From patchwork Fri Sep 11 12:32:21 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11770409 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 0C5ED59D for ; Fri, 11 Sep 2020 12:35:25 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id D1D6722208 for ; Fri, 11 Sep 2020 12:35:24 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="RTIL2Ydk" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1725928AbgIKMeg (ORCPT ); Fri, 11 Sep 2020 08:34:36 -0400 Received: from esa5.hgst.iphmx.com ([216.71.153.144]:38372 "EHLO esa5.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725893AbgIKMds (ORCPT ); Fri, 11 Sep 2020 08:33:48 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1599827629; x=1631363629; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=+3nWbUVlFor4kVfiP3oNqaSLc+/2lCklw0996h4GMGc=; b=RTIL2YdkYd6abNTlDVoM2sKdCTgsJz4E64h1XxYQwIjU5tFIxaGoKeAE bFX8OMI2Tr4hIK/Gzdxvs24bTfl9FBsPn2p7+nOGPTZpfck4O7zP6PHvm 40yZDdqYLTzXFjYQbo1lyjJc5gMtDeOt1kQZDvIQobzWrlkD7cdgtlmSv SCk/N7HQJZiwGkGKE51uIBoBa3NXiDZSiq+RtdgVA7p3KorQwTX8OsfmP v8Y0ezkh/sKWJksY9TL7sLCydG+Z6p4Y17/3Y7HsZgjMi6sG1K3GfY8FE 5lBP28s5KO0IIgXOmCmrNdafhtBKCW6f0EwOIgVeepgyEio0iCxiOhYcw g==; IronPort-SDR: +V1XvkrfvT5YgH2MDqz6htCoST9XvpxXZgLpuzJV+U6lCnZqGSwhTusA3sabrBIbXeVzJxATLJ PGxY4y7fDQyC1xK6UIpCufxiygNWXLehIVFQ/2kLsLT4wCjJHY8rGc5A0LFtwLPxqQB2BKuJSS h/1pQPLPULDrqZ44qIK+fGMAvhE5trRGJEfWmkFV1LXQERVK68Q2r2Ww/Lp4fEl2JDpzGX4T+G WEIta9y6qNYmTloSH5w8xibbvTIOmGBBIZSj7gMPRnELNqEP2/OLtEzxeHI1RRBjBjlSj9igPm SNw= X-IronPort-AV: E=Sophos;i="5.76,415,1592841600"; d="scan'208";a="147125946" Received: from h199-255-45-15.hgst.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 11 Sep 2020 20:33:09 +0800 IronPort-SDR: 011ghEOVGHYPojpSxF4EPOfwNszaQptZsWgeUH9kTZ0hIrEqRLFU2t0OYxCKqzpZ/vuiF1o6B9 fCnPoRMEhtXA== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Sep 2020 05:19:30 -0700 IronPort-SDR: bNW0znLN4m7r2EHLLcUqFuMFuhRF0017toi2Z5FoO1Jj9BMWXCtGVxwvHtMexOCQGSvYgbK4It 0Yb29VfSP7+Q== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 11 Sep 2020 05:33:07 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Hannes Reinecke , linux-fsdevel@vger.kernel.org, Naohiro Aota , Anand Jain , Johannes Thumshirn , Damien Le Moal Subject: [PATCH v7 01/39] btrfs: introduce ZONED feature flag Date: Fri, 11 Sep 2020 21:32:21 +0900 Message-Id: <20200911123259.3782926-2-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20200911123259.3782926-1-naohiro.aota@wdc.com> References: <20200911123259.3782926-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org This patch introduces the ZONED incompat flag. The flag indicates that the volume management will satisfy the constraints imposed by host-managed zoned block devices. Reviewed-by: Anand Jain Reviewed-by: Johannes Thumshirn Signed-off-by: Damien Le Moal Signed-off-by: Naohiro Aota --- fs/btrfs/sysfs.c | 2 ++ include/uapi/linux/btrfs.h | 1 + 2 files changed, 3 insertions(+) diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c index c8df2edafd85..38c7a57789d8 100644 --- a/fs/btrfs/sysfs.c +++ b/fs/btrfs/sysfs.c @@ -262,6 +262,7 @@ BTRFS_FEAT_ATTR_INCOMPAT(no_holes, NO_HOLES); BTRFS_FEAT_ATTR_INCOMPAT(metadata_uuid, METADATA_UUID); BTRFS_FEAT_ATTR_COMPAT_RO(free_space_tree, FREE_SPACE_TREE); BTRFS_FEAT_ATTR_INCOMPAT(raid1c34, RAID1C34); +BTRFS_FEAT_ATTR_INCOMPAT(zoned, ZONED); static struct attribute *btrfs_supported_feature_attrs[] = { BTRFS_FEAT_ATTR_PTR(mixed_backref), @@ -277,6 +278,7 @@ static struct attribute *btrfs_supported_feature_attrs[] = { BTRFS_FEAT_ATTR_PTR(metadata_uuid), BTRFS_FEAT_ATTR_PTR(free_space_tree), BTRFS_FEAT_ATTR_PTR(raid1c34), + BTRFS_FEAT_ATTR_PTR(zoned), NULL }; diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h index 2c39d15a2beb..5df73001aad4 100644 --- a/include/uapi/linux/btrfs.h +++ b/include/uapi/linux/btrfs.h @@ -307,6 +307,7 @@ struct btrfs_ioctl_fs_info_args { #define BTRFS_FEATURE_INCOMPAT_NO_HOLES (1ULL << 9) #define BTRFS_FEATURE_INCOMPAT_METADATA_UUID (1ULL << 10) #define BTRFS_FEATURE_INCOMPAT_RAID1C34 (1ULL << 11) +#define BTRFS_FEATURE_INCOMPAT_ZONED (1ULL << 12) struct btrfs_ioctl_feature_flags { __u64 compat_flags; From patchwork Fri Sep 11 12:32:22 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11771193 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 5E6E059D for ; Fri, 11 Sep 2020 17:44:57 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 455FA221EB for ; Fri, 11 Sep 2020 17:44:57 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="PF4/gKjo" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1725936AbgIKRow (ORCPT ); Fri, 11 Sep 2020 13:44:52 -0400 Received: from esa5.hgst.iphmx.com ([216.71.153.144]:38375 "EHLO esa5.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725911AbgIKMeA (ORCPT ); Fri, 11 Sep 2020 08:34:00 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1599827639; x=1631363639; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=HD+BgJLJ5iYEE3GK9sh/eKr8LmZtJRAzTb+fU7X+CHg=; b=PF4/gKjo3pRsN0nyi/6azBSilrZ+945HhqAnZ0gj3WPhM5k2CT+MVP/K 9gYsCqNMSxPaQwSSKIlQq573tZbSJJWOPyyDO1oHBL8+wZimyGuHCqMDf k93OGkuJk0vGuJRbStKgVBmG11sas3gjIHEhEKvMqbSm8LCPdUd8MME8R P+qtEamatxxYSGa6VpxK2uUGS+B4LbTfeiv9JOzrHWdZLT0dYMFrlFL1H BMxvK2b66iU4TAaRc5ZbMcC6eAWZX8NUIVMJQpG0mPGHv7PVxFkH5DZWZ UXv1D2IvPjUC4I2YIjBQxoeyivzX5H7qiOhbRIyiVt0GtWXHvCJd6R31R g==; IronPort-SDR: PS7cKqnw0ZgLmr1V1ZHgLif25ssSv1GEpLAzJrxRfyB7houP7zvVdazQwYGX0WYqWOGVG0Kjk5 HOw2rBL7NekAJByLjud0vnwJ2fbMBAR0nkw92geUtWDu/TrGDj8prgo69Mb7VCEzzdJ5X6D7RK idmzu5+R7H4HffyzHYAT1Gy8s5cfoZgYTMhZSZntUpTKu5w8y/sP50bdhAy79V2S5hoXGs8N2s StQIfAFLgiN1fgivIUIwvJFczwKx0CiDBTP7vG+pnsXgFlUaukHBL14HnrZ9/9zTcL60vK4zef M20= X-IronPort-AV: E=Sophos;i="5.76,415,1592841600"; d="scan'208";a="147125951" Received: from h199-255-45-15.hgst.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 11 Sep 2020 20:33:11 +0800 IronPort-SDR: P3VUD/I9xG1aRB4aEZyzk5sQsFMWkbD5BRKS8iClLHxcaFGx2kp+kbWH4GO1Tl6MFM/SxdtgX9 xBd1Mi0wwgZw== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Sep 2020 05:19:31 -0700 IronPort-SDR: v/dNXFXSFxhyjKmd0X4wvKXhOtuNxyMzlN327FLzqrZbiJhpMfBlB5cldUTl7BizaaXCl7DnuD PMifqPo5UR0Q== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 11 Sep 2020 05:33:08 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Hannes Reinecke , linux-fsdevel@vger.kernel.org, Naohiro Aota , Damien Le Moal Subject: [PATCH v7 02/39] btrfs: Get zone information of zoned block devices Date: Fri, 11 Sep 2020 21:32:22 +0900 Message-Id: <20200911123259.3782926-3-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20200911123259.3782926-1-naohiro.aota@wdc.com> References: <20200911123259.3782926-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org If a zoned block device is found, get its zone information (number of zones and zone size) using the new helper function btrfs_get_dev_zone_info(). To avoid costly run-time zone report commands to test the device zones type during block allocation, attach the seq_zones bitmap to the device structure to indicate if a zone is sequential or accept random writes. Also it attaches the empty_zones bitmap to indicate if a zone is empty or not. This patch also introduces the helper function btrfs_dev_is_sequential() to test if the zone storing a block is a sequential write required zone and btrfs_dev_is_empty_zone() to test if the zone is a empty zone. Reviewed-by: Josef Bacik Signed-off-by: Damien Le Moal Signed-off-by: Naohiro Aota --- fs/btrfs/Makefile | 1 + fs/btrfs/dev-replace.c | 5 ++ fs/btrfs/volumes.c | 18 ++++- fs/btrfs/volumes.h | 4 + fs/btrfs/zoned.c | 179 +++++++++++++++++++++++++++++++++++++++++ fs/btrfs/zoned.h | 92 +++++++++++++++++++++ 6 files changed, 297 insertions(+), 2 deletions(-) create mode 100644 fs/btrfs/zoned.c create mode 100644 fs/btrfs/zoned.h diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile index e738f6206ea5..0497fdc37f90 100644 --- a/fs/btrfs/Makefile +++ b/fs/btrfs/Makefile @@ -16,6 +16,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \ btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o btrfs-$(CONFIG_BTRFS_FS_REF_VERIFY) += ref-verify.o +btrfs-$(CONFIG_BLK_DEV_ZONED) += zoned.o btrfs-$(CONFIG_BTRFS_FS_RUN_SANITY_TESTS) += tests/free-space-tests.o \ tests/extent-buffer-tests.o tests/btrfs-tests.o \ diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c index db93909b25e0..83ee7371136c 100644 --- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -21,6 +21,7 @@ #include "rcu-string.h" #include "dev-replace.h" #include "sysfs.h" +#include "zoned.h" /* * Device replace overview @@ -297,6 +298,10 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info, set_blocksize(device->bdev, BTRFS_BDEV_BLOCKSIZE); device->fs_devices = fs_info->fs_devices; + ret = btrfs_get_dev_zone_info(device); + if (ret) + goto error; + mutex_lock(&fs_info->fs_devices->device_list_mutex); list_add(&device->dev_list, &fs_info->fs_devices->devices); fs_info->fs_devices->num_devices++; diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 214856c4ccb1..ce612cb900cd 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -30,6 +30,7 @@ #include "space-info.h" #include "block-group.h" #include "discard.h" +#include "zoned.h" const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = { [BTRFS_RAID_RAID10] = { @@ -372,6 +373,7 @@ void btrfs_free_device(struct btrfs_device *device) rcu_string_free(device->name); extent_io_tree_release(&device->alloc_state); bio_put(device->flush_bio); + btrfs_destroy_dev_zone_info(device); kfree(device); } @@ -666,6 +668,11 @@ static int btrfs_open_one_device(struct btrfs_fs_devices *fs_devices, clear_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state); device->mode = flags; + /* Get zone type information of zoned block devices */ + ret = btrfs_get_dev_zone_info(device); + if (ret != 0) + goto error_free_page; + fs_devices->open_devices++; if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state) && device->devid != BTRFS_DEV_REPLACE_DEVID) { @@ -2553,6 +2560,14 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path } rcu_assign_pointer(device->name, name); + device->fs_info = fs_info; + device->bdev = bdev; + + /* Get zone type information of zoned block devices */ + ret = btrfs_get_dev_zone_info(device); + if (ret) + goto error_free_device; + trans = btrfs_start_transaction(root, 0); if (IS_ERR(trans)) { ret = PTR_ERR(trans); @@ -2569,8 +2584,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path fs_info->sectorsize); device->disk_total_bytes = device->total_bytes; device->commit_total_bytes = device->total_bytes; - device->fs_info = fs_info; - device->bdev = bdev; set_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state); clear_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state); device->mode = FMODE_EXCL; @@ -2713,6 +2726,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path sb->s_flags |= SB_RDONLY; if (trans) btrfs_end_transaction(trans); + btrfs_destroy_dev_zone_info(device); error_free_device: btrfs_free_device(device); error: diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index 5eea93916fbf..a7ae1a02c6d2 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -51,6 +51,8 @@ struct btrfs_io_geometry { #define BTRFS_DEV_STATE_REPLACE_TGT (3) #define BTRFS_DEV_STATE_FLUSH_SENT (4) +struct btrfs_zoned_device_info; + struct btrfs_device { struct list_head dev_list; /* device_list_mutex */ struct list_head dev_alloc_list; /* chunk mutex */ @@ -64,6 +66,8 @@ struct btrfs_device { struct block_device *bdev; + struct btrfs_zoned_device_info *zone_info; + /* the mode sent to blkdev_get */ fmode_t mode; diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c new file mode 100644 index 000000000000..0c908f0e9469 --- /dev/null +++ b/fs/btrfs/zoned.c @@ -0,0 +1,179 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2019 Western Digital Corporation or its affiliates. + * Authors: + * Naohiro Aota + * Damien Le Moal + */ + +#include +#include +#include "ctree.h" +#include "volumes.h" +#include "zoned.h" +#include "rcu-string.h" + +/* Maximum number of zones to report per blkdev_report_zones() call */ +#define BTRFS_REPORT_NR_ZONES 4096 + +static int copy_zone_info_cb(struct blk_zone *zone, unsigned int idx, + void *data) +{ + struct blk_zone *zones = data; + + memcpy(&zones[idx], zone, sizeof(*zone)); + + return 0; +} + +static int btrfs_get_dev_zones(struct btrfs_device *device, u64 pos, + struct blk_zone *zones, unsigned int *nr_zones) +{ + int ret; + + if (!*nr_zones) + return 0; + + ret = blkdev_report_zones(device->bdev, pos >> SECTOR_SHIFT, *nr_zones, + copy_zone_info_cb, zones); + if (ret < 0) { + btrfs_err_in_rcu(device->fs_info, + "get zone at %llu on %s failed %d", pos, + rcu_str_deref(device->name), ret); + return ret; + } + *nr_zones = ret; + if (!ret) + return -EIO; + + return 0; +} + +int btrfs_get_dev_zone_info(struct btrfs_device *device) +{ + struct btrfs_zoned_device_info *zone_info = NULL; + struct block_device *bdev = device->bdev; + sector_t nr_sectors = bdev->bd_part->nr_sects; + sector_t sector = 0; + struct blk_zone *zones = NULL; + unsigned int i, nreported = 0, nr_zones; + unsigned int zone_sectors; + int ret; + char devstr[sizeof(device->fs_info->sb->s_id) + + sizeof(" (device )") - 1]; + + if (!bdev_is_zoned(bdev)) + return 0; + + zone_info = kzalloc(sizeof(*zone_info), GFP_KERNEL); + if (!zone_info) + return -ENOMEM; + + zone_sectors = bdev_zone_sectors(bdev); + ASSERT(is_power_of_2(zone_sectors)); + zone_info->zone_size = (u64)zone_sectors << SECTOR_SHIFT; + zone_info->zone_size_shift = ilog2(zone_info->zone_size); + zone_info->nr_zones = nr_sectors >> ilog2(bdev_zone_sectors(bdev)); + if (!IS_ALIGNED(nr_sectors, zone_sectors)) + zone_info->nr_zones++; + + zone_info->seq_zones = bitmap_zalloc(zone_info->nr_zones, GFP_KERNEL); + if (!zone_info->seq_zones) { + ret = -ENOMEM; + goto out; + } + + zone_info->empty_zones = bitmap_zalloc(zone_info->nr_zones, GFP_KERNEL); + if (!zone_info->empty_zones) { + ret = -ENOMEM; + goto out; + } + + zones = kcalloc(BTRFS_REPORT_NR_ZONES, + sizeof(struct blk_zone), GFP_KERNEL); + if (!zones) { + ret = -ENOMEM; + goto out; + } + + /* Get zones type */ + while (sector < nr_sectors) { + nr_zones = BTRFS_REPORT_NR_ZONES; + ret = btrfs_get_dev_zones(device, sector << SECTOR_SHIFT, zones, + &nr_zones); + if (ret) + goto out; + + for (i = 0; i < nr_zones; i++) { + if (zones[i].type == BLK_ZONE_TYPE_SEQWRITE_REQ) + set_bit(nreported, zone_info->seq_zones); + if (zones[i].cond == BLK_ZONE_COND_EMPTY) + set_bit(nreported, zone_info->empty_zones); + nreported++; + } + sector = zones[nr_zones - 1].start + zones[nr_zones - 1].len; + } + + if (nreported != zone_info->nr_zones) { + btrfs_err_in_rcu(device->fs_info, + "inconsistent number of zones on %s (%u / %u)", + rcu_str_deref(device->name), nreported, + zone_info->nr_zones); + ret = -EIO; + goto out; + } + + kfree(zones); + + device->zone_info = zone_info; + + devstr[0] = 0; + if (device->fs_info) + snprintf(devstr, sizeof(devstr), " (device %s)", + device->fs_info->sb->s_id); + + rcu_read_lock(); + pr_info( +"BTRFS info%s: host-%s zoned block device %s, %u zones of %llu sectors", + devstr, + bdev_zoned_model(bdev) == BLK_ZONED_HM ? "managed" : "aware", + rcu_str_deref(device->name), zone_info->nr_zones, + zone_info->zone_size >> SECTOR_SHIFT); + rcu_read_unlock(); + + return 0; + +out: + kfree(zones); + bitmap_free(zone_info->empty_zones); + bitmap_free(zone_info->seq_zones); + kfree(zone_info); + + return ret; +} + +void btrfs_destroy_dev_zone_info(struct btrfs_device *device) +{ + struct btrfs_zoned_device_info *zone_info = device->zone_info; + + if (!zone_info) + return; + + bitmap_free(zone_info->seq_zones); + bitmap_free(zone_info->empty_zones); + kfree(zone_info); + device->zone_info = NULL; +} + +int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos, + struct blk_zone *zone) +{ + unsigned int nr_zones = 1; + int ret; + + ret = btrfs_get_dev_zones(device, pos, zone, &nr_zones); + if (ret != 0 || !nr_zones) + return ret ? ret : -EIO; + + return 0; +} diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h new file mode 100644 index 000000000000..e4a08ae0a96b --- /dev/null +++ b/fs/btrfs/zoned.h @@ -0,0 +1,92 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (C) 2019 Western Digital Corporation or its affiliates. + * Authors: + * Naohiro Aota + * Damien Le Moal + */ + +#ifndef BTRFS_ZONED_H +#define BTRFS_ZONED_H + +struct btrfs_zoned_device_info { + /* + * Number of zones, zone size and types of zones if bdev is a + * zoned block device. + */ + u64 zone_size; + u8 zone_size_shift; + u32 nr_zones; + unsigned long *seq_zones; + unsigned long *empty_zones; +}; + +#ifdef CONFIG_BLK_DEV_ZONED +int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos, + struct blk_zone *zone); +int btrfs_get_dev_zone_info(struct btrfs_device *device); +void btrfs_destroy_dev_zone_info(struct btrfs_device *device); +#else /* CONFIG_BLK_DEV_ZONED */ +static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos, + struct blk_zone *zone) +{ + return 0; +} +static inline int btrfs_get_dev_zone_info(struct btrfs_device *device) +{ + return 0; +} +static inline void btrfs_destroy_dev_zone_info(struct btrfs_device *device) { } +#endif + +static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos) +{ + struct btrfs_zoned_device_info *zone_info = device->zone_info; + + if (!zone_info) + return false; + + return test_bit(pos >> zone_info->zone_size_shift, + zone_info->seq_zones); +} + +static inline bool btrfs_dev_is_empty_zone(struct btrfs_device *device, u64 pos) +{ + struct btrfs_zoned_device_info *zone_info = device->zone_info; + + if (!zone_info) + return true; + + return test_bit(pos >> zone_info->zone_size_shift, + zone_info->empty_zones); +} + +static inline void btrfs_dev_set_empty_zone_bit(struct btrfs_device *device, + u64 pos, bool set) +{ + struct btrfs_zoned_device_info *zone_info = device->zone_info; + unsigned int zno; + + if (!zone_info) + return; + + zno = pos >> zone_info->zone_size_shift; + if (set) + set_bit(zno, zone_info->empty_zones); + else + clear_bit(zno, zone_info->empty_zones); +} + +static inline void btrfs_dev_set_zone_empty(struct btrfs_device *device, + u64 pos) +{ + btrfs_dev_set_empty_zone_bit(device, pos, true); +} + +static inline void btrfs_dev_clear_zone_empty(struct btrfs_device *device, + u64 pos) +{ + btrfs_dev_set_empty_zone_bit(device, pos, false); +} + +#endif From patchwork Fri Sep 11 12:32:23 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11771197 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 4B9A259D for ; Fri, 11 Sep 2020 17:45:12 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 2C51D221EB for ; Fri, 11 Sep 2020 17:45:12 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="GMc6edK1" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726084AbgIKRoy (ORCPT ); Fri, 11 Sep 2020 13:44:54 -0400 Received: from esa5.hgst.iphmx.com ([216.71.153.144]:38415 "EHLO esa5.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725915AbgIKMd7 (ORCPT ); Fri, 11 Sep 2020 08:33:59 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1599827639; x=1631363639; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=9H7FoqyFo9qTgLdXHzvHo5F217GGbaVz/V3WiXt0J5s=; b=GMc6edK1WkBzydTCLPBoMTIgFY+hLfEAj/T9nzANAjEpKp6NIfimrCuv DflxTqzZJRkvQiaxQ3jCIby6LWl3aAie16txegOsQkKUHXxfMg43igFIL 0k9/FcTs145T12UrIvbG6DpOHks9NgdbF9kbmOGXR5wE+4da5MclLwtmu yhJf/vKU/9E2wfp8D0mn6w75zD+GGW91buh4NX5bNGvP6ECcpMaXAvQ6X 7ERjF1/kyYDRnEVGV/T7nD3HzICvRRCxk0+yMPfaAAL3EaNjgc3HM7b0f KZjpHgjHmfIcQoaerCN1C4bLQrZqLj5UTf1j9oSExyFuLtWRTNQOhA2Oc g==; IronPort-SDR: p/h4ERpAZUJZHqVVYbycSOZUQuGNuORIc/nSkBGq595exPa0E16OiyRByzpOm/VOpmeoA0/dVf jp82VqSj3Zx3eTpuU2AWmzWVY4dinm+hEUCazmNj8wf74R1PrZD70QLnsXrVw4f/zh1J1vbTXf P3qxrWxveKCvdLY1Aw39LeE/EUeXPk2p5zY8vZ0Is3gpX68Md6uEp4WPUSeSf4jN21BHtp1BPb Bk31Q/MrJm/P7pxmE2qlysMZfLvCnP7w8mVZ5Xo5p28g8MsdERUTh4238AswOyq5Boc+XP9ark yiU= X-IronPort-AV: E=Sophos;i="5.76,415,1592841600"; d="scan'208";a="147125954" Received: from h199-255-45-15.hgst.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 11 Sep 2020 20:33:12 +0800 IronPort-SDR: 7fxBhTCayeyHK9exPk8c+/rKFOyih5uvudSVEIlJop3QR1WXVzHW78KY2G/0uzgXDwskaaPw6A CfxjKqxQ3Osg== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Sep 2020 05:19:33 -0700 IronPort-SDR: xMxi89uKFN+6WrWqve0TPaRykBN/LqrrIkJBwnoX3ygEGjqAelipFVCF28LVX84Beie885OVUQ a5bM7/BE4bRg== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 11 Sep 2020 05:33:10 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Hannes Reinecke , linux-fsdevel@vger.kernel.org, Naohiro Aota , Damien Le Moal Subject: [PATCH v7 03/39] btrfs: Check and enable ZONED mode Date: Fri, 11 Sep 2020 21:32:23 +0900 Message-Id: <20200911123259.3782926-4-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20200911123259.3782926-1-naohiro.aota@wdc.com> References: <20200911123259.3782926-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org This commit introduces the function btrfs_check_zoned_mode() to check if ZONED flag is enabled on the file system and if the file system consists of zoned devices with equal zone size. Reviewed-by: Josef Bacik Signed-off-by: Damien Le Moal Signed-off-by: Naohiro Aota --- fs/btrfs/ctree.h | 3 ++ fs/btrfs/dev-replace.c | 7 ++++ fs/btrfs/disk-io.c | 9 +++++ fs/btrfs/super.c | 1 + fs/btrfs/volumes.c | 5 +++ fs/btrfs/zoned.c | 78 ++++++++++++++++++++++++++++++++++++++++++ fs/btrfs/zoned.h | 26 ++++++++++++++ 7 files changed, 129 insertions(+) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 4455eb3f3683..f5ed8f5519dd 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -580,6 +580,9 @@ struct btrfs_fs_info { struct btrfs_root *free_space_root; struct btrfs_root *data_reloc_root; + /* Zone size when in ZONED mode */ + u64 zone_size; + /* the log root tree is a directory of all the other log roots */ struct btrfs_root *log_root_tree; diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c index 83ee7371136c..18a36973f973 100644 --- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -243,6 +243,13 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info, return PTR_ERR(bdev); } + if (!btrfs_check_device_zone_type(fs_info, bdev)) { + btrfs_err(fs_info, + "zone type of target device mismatch with the filesystem!"); + ret = -EINVAL; + goto error; + } + sync_blockdev(bdev); devices = &fs_info->fs_devices->devices; diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 465bc8372e09..f7c2d1d26026 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -42,6 +42,7 @@ #include "block-group.h" #include "discard.h" #include "space-info.h" +#include "zoned.h" #define BTRFS_SUPER_FLAG_SUPP (BTRFS_HEADER_FLAG_WRITTEN |\ BTRFS_HEADER_FLAG_RELOC |\ @@ -3212,7 +3213,15 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device btrfs_free_extra_devids(fs_devices, 1); + ret = btrfs_check_zoned_mode(fs_info); + if (ret) { + btrfs_err(fs_info, "failed to init ZONED mode: %d", + ret); + goto fail_block_groups; + } + ret = btrfs_sysfs_add_fsid(fs_devices); + if (ret) { btrfs_err(fs_info, "failed to init sysfs fsid interface: %d", ret); diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 25967ecaaf0a..27a3a053f330 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -44,6 +44,7 @@ #include "backref.h" #include "space-info.h" #include "sysfs.h" +#include "zoned.h" #include "tests/btrfs-tests.h" #include "block-group.h" #include "discard.h" diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index ce612cb900cd..d736d5391fac 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -2527,6 +2527,11 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path if (IS_ERR(bdev)) return PTR_ERR(bdev); + if (!btrfs_check_device_zone_type(fs_info, bdev)) { + ret = -EINVAL; + goto error; + } + if (fs_devices->seeding) { seeding_dev = 1; down_write(&sb->s_umount); diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c index 0c908f0e9469..7509888b457a 100644 --- a/fs/btrfs/zoned.c +++ b/fs/btrfs/zoned.c @@ -177,3 +177,81 @@ int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos, return 0; } + +int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info) +{ + struct btrfs_fs_devices *fs_devices = fs_info->fs_devices; + struct btrfs_device *device; + u64 hmzoned_devices = 0; + u64 nr_devices = 0; + u64 zone_size = 0; + int incompat_zoned = btrfs_fs_incompat(fs_info, ZONED); + int ret = 0; + + /* Count zoned devices */ + list_for_each_entry(device, &fs_devices->devices, dev_list) { + enum blk_zoned_model model; + + if (!device->bdev) + continue; + + model = bdev_zoned_model(device->bdev); + if (model == BLK_ZONED_HM || + (model == BLK_ZONED_HA && incompat_zoned)) { + hmzoned_devices++; + if (!zone_size) { + zone_size = device->zone_info->zone_size; + } else if (device->zone_info->zone_size != zone_size) { + btrfs_err(fs_info, + "Zoned block devices must have equal zone sizes"); + ret = -EINVAL; + goto out; + } + } + nr_devices++; + } + + if (!hmzoned_devices && !incompat_zoned) + goto out; + + if (!hmzoned_devices && incompat_zoned) { + /* No zoned block device found on ZONED FS */ + btrfs_err(fs_info, + "ZONED enabled file system should have zoned devices"); + ret = -EINVAL; + goto out; + } + + if (hmzoned_devices && !incompat_zoned) { + btrfs_err(fs_info, + "Enable ZONED mode to mount HMZONED device"); + ret = -EINVAL; + goto out; + } + + if (hmzoned_devices != nr_devices) { + btrfs_err(fs_info, + "zoned devices cannot be mixed with regular devices"); + ret = -EINVAL; + goto out; + } + + /* + * stripe_size is always aligned to BTRFS_STRIPE_LEN in + * __btrfs_alloc_chunk(). Since we want stripe_len == zone_size, + * check the alignment here. + */ + if (!IS_ALIGNED(zone_size, BTRFS_STRIPE_LEN)) { + btrfs_err(fs_info, + "zone size is not aligned to BTRFS_STRIPE_LEN"); + ret = -EINVAL; + goto out; + } + + fs_info->zone_size = zone_size; + + btrfs_info(fs_info, "ZONED mode enabled, zone size %llu B", + fs_info->zone_size); +out: + return ret; +} diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h index e4a08ae0a96b..4341630cb756 100644 --- a/fs/btrfs/zoned.h +++ b/fs/btrfs/zoned.h @@ -9,6 +9,8 @@ #ifndef BTRFS_ZONED_H #define BTRFS_ZONED_H +#include + struct btrfs_zoned_device_info { /* * Number of zones, zone size and types of zones if bdev is a @@ -26,6 +28,7 @@ int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos, struct blk_zone *zone); int btrfs_get_dev_zone_info(struct btrfs_device *device); void btrfs_destroy_dev_zone_info(struct btrfs_device *device); +int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info); #else /* CONFIG_BLK_DEV_ZONED */ static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos, struct blk_zone *zone) @@ -37,6 +40,14 @@ static inline int btrfs_get_dev_zone_info(struct btrfs_device *device) return 0; } static inline void btrfs_destroy_dev_zone_info(struct btrfs_device *device) { } +static inline int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info) +{ + if (!btrfs_fs_incompat(fs_info, ZONED)) + return 0; + + btrfs_err(fs_info, "Zoned block devices support is not enabled"); + return -EOPNOTSUPP; +} #endif static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos) @@ -89,4 +100,19 @@ static inline void btrfs_dev_clear_zone_empty(struct btrfs_device *device, btrfs_dev_set_empty_zone_bit(device, pos, false); } +static inline bool btrfs_check_device_zone_type(struct btrfs_fs_info *fs_info, + struct block_device *bdev) +{ + u64 zone_size; + + if (btrfs_fs_incompat(fs_info, ZONED)) { + zone_size = (u64)bdev_zone_sectors(bdev) << SECTOR_SHIFT; + /* Do not allow non-zoned device */ + return bdev_is_zoned(bdev) && fs_info->zone_size == zone_size; + } + + /* Do not allow Host Manged zoned device */ + return bdev_zoned_model(bdev) != BLK_ZONED_HM; +} + #endif From patchwork Fri Sep 11 12:32:24 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11771205 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id B25B559D for ; Fri, 11 Sep 2020 17:45:29 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 94BEE221EB for ; Fri, 11 Sep 2020 17:45:29 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="Lj9RJEfL" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1725921AbgIKRpR (ORCPT ); Fri, 11 Sep 2020 13:45:17 -0400 Received: from esa5.hgst.iphmx.com ([216.71.153.144]:38428 "EHLO esa5.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725870AbgIKMd7 (ORCPT ); Fri, 11 Sep 2020 08:33:59 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1599827639; x=1631363639; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=1PL/k295pCRz+zaD9m9g4LmgnY9w/wyaaJRUqvK14PA=; b=Lj9RJEfLjEeFXzfKZHctl5LGwpZVHtPXIA+BDXB0NEI1B+U/nhnhT8eZ CIFDrXglgGb9Ul1g99xJqHKYJAB8zcvFt+dTsEyxbajc3zZ0oqLLetqr2 gBAwzOwPFEoJ06cRzOpX7OTG+MiqynhHdIpDfEFwyPwjG1TwvUBQJCZBj Gs+eKVWeUthPIMNQzQ+kQzAMC4/y8RhFixbl+3LiyrTfsG3k7pvbBq+da IHhKhnfosdu1ZLGaBpRZ4USMldSSRWt88gF7Huv8H/UEIZVmqo7B6mUUi GiVtqNNpwJSluAz0sx34PlIu7GaGwwI7J9e0lkGKer0vmzjG6hkExP2aE w==; IronPort-SDR: 7IY9yytVa5dhUEv3KJph2itrpkW4Bj+lt7FwawPtbuxG7vKsUygmRcRV9WxdrXsNm8kQHWNih9 eeiBV8NmQAQtY1dtFgL1LFxgFkH08SufvARmnlMU0KM4IWi01g2QbHAMYT/RtDfDAipIIAdOwm KpZHGcPShJsws3xoX55SRUHm/HcWW1JFl8ohuLguqwkHxsf4lHT+rcbnpsCl/Cm9eskKlxpz9g V910WGBmiG0z5EnbevJ69Sq7ypOzxSKvT7fggfWImb7jIoP630cKs5UTWllDGNYpkzbJs99370 y1Q= X-IronPort-AV: E=Sophos;i="5.76,415,1592841600"; d="scan'208";a="147125957" Received: from h199-255-45-15.hgst.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 11 Sep 2020 20:33:14 +0800 IronPort-SDR: 3Jml6WinPMT4WGe06eem72DPz/Eo5vtCpxcSPf70jCU46JmTLyWjlGfZZA5Uo1SlW7DCPd7mRT 7y+nFxnqGKWw== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Sep 2020 05:19:34 -0700 IronPort-SDR: OX0FofCBQ4BJc9ydSdHdA/0WhdmOShNi3N+3Lf8b2jnImC7c9wJOSrabyeDAateHs5iL2mP4Y5 0fCTJFutOGfQ== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 11 Sep 2020 05:33:11 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Hannes Reinecke , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v7 04/39] btrfs: introduce max_zone_append_size Date: Fri, 11 Sep 2020 21:32:24 +0900 Message-Id: <20200911123259.3782926-5-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20200911123259.3782926-1-naohiro.aota@wdc.com> References: <20200911123259.3782926-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Zone append write command has a maximum IO size restriction it accepts. Introduce max_zone_append_size to zone_info and fs_into to track the value. Signed-off-by: Naohiro Aota --- fs/btrfs/ctree.h | 2 ++ fs/btrfs/zoned.c | 17 +++++++++++++++-- fs/btrfs/zoned.h | 1 + 3 files changed, 18 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index f5ed8f5519dd..54c22ad0d633 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -582,6 +582,8 @@ struct btrfs_fs_info { /* Zone size when in ZONED mode */ u64 zone_size; + /* max size to emit ZONE_APPEND write command */ + u64 max_zone_append_size; /* the log root tree is a directory of all the other log roots */ struct btrfs_root *log_root_tree; diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c index 7509888b457a..2e12fce81abf 100644 --- a/fs/btrfs/zoned.c +++ b/fs/btrfs/zoned.c @@ -53,6 +53,7 @@ int btrfs_get_dev_zone_info(struct btrfs_device *device) { struct btrfs_zoned_device_info *zone_info = NULL; struct block_device *bdev = device->bdev; + struct request_queue *q = bdev_get_queue(bdev); sector_t nr_sectors = bdev->bd_part->nr_sects; sector_t sector = 0; struct blk_zone *zones = NULL; @@ -73,6 +74,8 @@ int btrfs_get_dev_zone_info(struct btrfs_device *device) ASSERT(is_power_of_2(zone_sectors)); zone_info->zone_size = (u64)zone_sectors << SECTOR_SHIFT; zone_info->zone_size_shift = ilog2(zone_info->zone_size); + zone_info->max_zone_append_size = + (u64)queue_max_zone_append_sectors(q) << SECTOR_SHIFT; zone_info->nr_zones = nr_sectors >> ilog2(bdev_zone_sectors(bdev)); if (!IS_ALIGNED(nr_sectors, zone_sectors)) zone_info->nr_zones++; @@ -185,6 +188,7 @@ int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info) u64 hmzoned_devices = 0; u64 nr_devices = 0; u64 zone_size = 0; + u64 max_zone_append_size = 0; int incompat_zoned = btrfs_fs_incompat(fs_info, ZONED); int ret = 0; @@ -198,15 +202,23 @@ int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info) model = bdev_zoned_model(device->bdev); if (model == BLK_ZONED_HM || (model == BLK_ZONED_HA && incompat_zoned)) { + struct btrfs_zoned_device_info *zone_info = + device->zone_info; + hmzoned_devices++; if (!zone_size) { - zone_size = device->zone_info->zone_size; - } else if (device->zone_info->zone_size != zone_size) { + zone_size = zone_info->zone_size; + } else if (zone_info->zone_size != zone_size) { btrfs_err(fs_info, "Zoned block devices must have equal zone sizes"); ret = -EINVAL; goto out; } + if (!max_zone_append_size || + (zone_info->max_zone_append_size && + zone_info->max_zone_append_size < max_zone_append_size)) + max_zone_append_size = + zone_info->max_zone_append_size; } nr_devices++; } @@ -249,6 +261,7 @@ int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info) } fs_info->zone_size = zone_size; + fs_info->max_zone_append_size = max_zone_append_size; btrfs_info(fs_info, "ZONED mode enabled, zone size %llu B", fs_info->zone_size); diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h index 4341630cb756..f200b46a71fb 100644 --- a/fs/btrfs/zoned.h +++ b/fs/btrfs/zoned.h @@ -18,6 +18,7 @@ struct btrfs_zoned_device_info { */ u64 zone_size; u8 zone_size_shift; + u64 max_zone_append_size; u32 nr_zones; unsigned long *seq_zones; unsigned long *empty_zones; From patchwork Fri Sep 11 12:32:25 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11771187 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 7C0BF13B1 for ; Fri, 11 Sep 2020 17:44:49 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 5CA8C221EB for ; Fri, 11 Sep 2020 17:44:49 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="pupBtSx4" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726054AbgIKRos (ORCPT ); Fri, 11 Sep 2020 13:44:48 -0400 Received: from esa5.hgst.iphmx.com ([216.71.153.144]:38372 "EHLO esa5.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725925AbgIKMeC (ORCPT ); Fri, 11 Sep 2020 08:34:02 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1599827642; x=1631363642; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=AY/cM6eADnWn7IOOlrHMYTwSa/tNP6WiNcR2++d82VA=; b=pupBtSx4EJ2JzeZ/xPfHyxhL5h8BzcMkJWUbwWB7eeetl8gRpe3sEWN4 wJK2zft7l2qEN1K2n3NEGq5yY44J9THX6YIDxv6W3i11371EC1aNqY6Bi iSq0r1v+CzO1V6RMP6uhbkJENYRM6Z9EvIFj06kUxaELF0ugYcQcuh2dK me0d2bsICTWNHmhiDNAgi1k+eEIv8lEEhVtNiOurOWt57mGuL7BA4Vgx3 MMKdGXSAEenj+Lo9p3yccm//roSMrMeWR8td9hhSdxPrIwlzecALHrgk3 FYPABJ9aMTlIeokXOsquzBStNT0pyqE1iBeKl9kzUYeLOtW2wseSHeo2w A==; IronPort-SDR: 7/6/cs1/MgvlF7/mqHCOZjwboG7DtTiBPL8iQ8vGHh3eiawF4xUgGYNQgtuIeBiRt9gbgsXGE2 SdxCmtzKp8wJyN0sdTpDHYopejoT+KSsdyRBX4a2NVPqInEckUTnrEmt5Oioef4FiCv1HgiYBL XLMY0la98ROXPBusMxVDHpB/stuTwqHgRN7nHhhqNzNGEwMy1X8p5tCpzbKRgH3Q9StJj4X+hj gC77VObI3Nr+1rpyjcU2J7xF1FoFJ8gIf2WLbnZ4xBUvxsYRyQc8dqbGPkxJN0la9qObRztD/3 L1o= X-IronPort-AV: E=Sophos;i="5.76,415,1592841600"; d="scan'208";a="147125960" Received: from h199-255-45-15.hgst.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 11 Sep 2020 20:33:15 +0800 IronPort-SDR: yNT3U271bslXZGV3kGmeqoxroFQceNx9IOQTWjb0hP7BzXUVU4uLNgRAZScb2wnEZe7mHFTklb Oj4gl2Y2QXJQ== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Sep 2020 05:19:36 -0700 IronPort-SDR: bljYaH2LKVnELLxvnb+cMJT5P42CXjDHKdbgRJYVQspBSLMX7DucDqbuEE7E3CrQSAyilkZDf9 q4CuVGWIawkg== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 11 Sep 2020 05:33:13 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Hannes Reinecke , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v7 05/39] btrfs: disallow space_cache in ZONED mode Date: Fri, 11 Sep 2020 21:32:25 +0900 Message-Id: <20200911123259.3782926-6-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20200911123259.3782926-1-naohiro.aota@wdc.com> References: <20200911123259.3782926-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org As updates to the space cache v1 are in-place, the space cache cannot be located over sequential zones and there is no guarantees that the device will have enough conventional zones to store this cache. Resolve this problem by disabling completely the space cache v1. This does not introduces any problems with sequential block groups: all the free space is located after the allocation pointer and no free space before the pointer. There is no need to have such cache. Note: we can technically use free-space-tree (space cache v2) on ZONED mode. But, since ZONED mode now always allocate extents in a block group sequentially regardless of underlying device zone type, it's no use to enable and maintain the tree. For the same reason, NODATACOW is also disabled. Also INODE_MAP_CACHE is also disabled to avoid preallocation in the INODE_MAP_CACHE inode. In summary, ZONED will disable: | Disabled features | Reason | |-------------------+-----------------------------------------------------| | RAID/Dup | Cannot handle two zone append writes to different | | | zones | |-------------------+-----------------------------------------------------| | space_cache (v1) | In-place updating | | NODATACOW | In-place updating | |-------------------+-----------------------------------------------------| | fallocate | Reserved extent will be a write hole | | INODE_MAP_CACHE | Need pre-allocation. (and will be deprecated?) | |-------------------+-----------------------------------------------------| | MIXED_BG | Allocated metadata region will be write holes for | | | data writes | Signed-off-by: Naohiro Aota --- fs/btrfs/super.c | 12 ++++++++++-- fs/btrfs/zoned.c | 18 ++++++++++++++++++ fs/btrfs/zoned.h | 5 +++++ 3 files changed, 33 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 27a3a053f330..3fbffc7ce42b 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -525,8 +525,14 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options, cache_gen = btrfs_super_cache_generation(info->super_copy); if (btrfs_fs_compat_ro(info, FREE_SPACE_TREE)) btrfs_set_opt(info->mount_opt, FREE_SPACE_TREE); - else if (cache_gen) - btrfs_set_opt(info->mount_opt, SPACE_CACHE); + else if (cache_gen) { + if (btrfs_fs_incompat(info, ZONED)) { + btrfs_info(info, + "clearring existing space cache in ZONED mode"); + btrfs_set_super_cache_generation(info->super_copy, 0); + } else + btrfs_set_opt(info->mount_opt, SPACE_CACHE); + } /* * Even the options are empty, we still need to do extra check @@ -985,6 +991,8 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options, ret = -EINVAL; } + if (!ret) + ret = btrfs_check_mountopts_zoned(info); if (!ret && btrfs_test_opt(info, SPACE_CACHE)) btrfs_info(info, "disk space caching is enabled"); if (!ret && btrfs_test_opt(info, FREE_SPACE_TREE)) diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c index 2e12fce81abf..1629e585ba8c 100644 --- a/fs/btrfs/zoned.c +++ b/fs/btrfs/zoned.c @@ -268,3 +268,21 @@ int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info) out: return ret; } + +int btrfs_check_mountopts_zoned(struct btrfs_fs_info *info) +{ + if (!btrfs_fs_incompat(info, ZONED)) + return 0; + + /* + * SPACE CACHE writing is not CoWed. Disable that to avoid write + * errors in sequential zones. + */ + if (btrfs_test_opt(info, SPACE_CACHE)) { + btrfs_err(info, + "space cache v1 not supportted in ZONED mode"); + return -EOPNOTSUPP; + } + + return 0; +} diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h index f200b46a71fb..2e1983188e6f 100644 --- a/fs/btrfs/zoned.h +++ b/fs/btrfs/zoned.h @@ -30,6 +30,7 @@ int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos, int btrfs_get_dev_zone_info(struct btrfs_device *device); void btrfs_destroy_dev_zone_info(struct btrfs_device *device); int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info); +int btrfs_check_mountopts_zoned(struct btrfs_fs_info *info); #else /* CONFIG_BLK_DEV_ZONED */ static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos, struct blk_zone *zone) @@ -49,6 +50,10 @@ static inline int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info) btrfs_err(fs_info, "Zoned block devices support is not enabled"); return -EOPNOTSUPP; } +static inline int btrfs_check_mountopts_zoned(struct btrfs_fs_info *info) +{ + return 0; +} #endif static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos) From patchwork Fri Sep 11 12:32:26 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11771181 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id B107B13B1 for ; Fri, 11 Sep 2020 17:44:30 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 941F1221ED for ; Fri, 11 Sep 2020 17:44:30 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="m8l/rSS3" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726254AbgIKRo3 (ORCPT ); Fri, 11 Sep 2020 13:44:29 -0400 Received: from esa5.hgst.iphmx.com ([216.71.153.144]:38451 "EHLO esa5.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725883AbgIKMeD (ORCPT ); Fri, 11 Sep 2020 08:34:03 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1599827644; x=1631363644; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=VtUI/UUEY5r9r9zyt3U+9zudIoc620wwP7oYcJO7XcI=; b=m8l/rSS3feNftDH1YLSBObyYpVzQVNhzNZmvJHQglCKAcyhGxEsiHy+m AYTa2PQ35e6vGQUzE9TfnT0XVur0NALFUCpK3rwZM4RB/adGEnKy6JghW eaOmV+Q8neIRDm3oEPZFcsCBDGQSzfFmsAobAjJrGa8YoqNPHd1uF0WAx ctVAqX0sNqTTbzMtUJ3nDe4EcmoypA+3F/sP5AMH7IooLx5ikNlnSJdga Wy4x9fKPc2z8xYrlPJoh3jpavZqh7/pA2O2maMCWMwRgeYRM2GkgQous/ i4XsDURjnYlTJZ8bNcW/639NiZ/r/oQUVzQ66oRNhBkzjanys8UjPJbGo w==; IronPort-SDR: 17BzVFbg2H54vRdCSNefBCOc1JxQOOYvJXNjGzhBgncS2MD+uciM71r+M7wnYMZNXxIcnYhpPh c2OPutsrSqx6R1XVI2tVlRLhov2CKG8oJmknVvVLz4S/zHK1H7zlib2sMkQ+VXAzPJ9xtSAXp/ jEbHY1IlYw32R+00ruRRrxJ5aAP9Bj/tJiwbxsOIXWoKk1X58s9KdaxqQpfina4Sw82S0HFt1p KNG0VFqIAwL7U+gbiwzvo6BJ/zBKysB8mywWayulainKqriOLOXmCPdiuBc7K2XDMNjKG7IG4p xY8= X-IronPort-AV: E=Sophos;i="5.76,415,1592841600"; d="scan'208";a="147125963" Received: from h199-255-45-15.hgst.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 11 Sep 2020 20:33:17 +0800 IronPort-SDR: 2tRazSOvHIMVMyy66ue0N88Oonp6HzEaaH1aH3Ccq/TBGf8Wq1hijZ6pbs8FpZF6km4YTUCl5I HYOu6xJqzWSQ== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Sep 2020 05:19:37 -0700 IronPort-SDR: 8CWfhvT5sZU0hE1c7f/XQF9xSa5tuj4FZjP5fBZ7HUAsBCuscTIeiFKpyIskDIlNYDZqaNYn0H vnd6mNYOd8DQ== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 11 Sep 2020 05:33:14 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Hannes Reinecke , linux-fsdevel@vger.kernel.org, Naohiro Aota , Johannes Thumshirn Subject: [PATCH v7 06/39] btrfs: disallow NODATACOW in ZONED mode Date: Fri, 11 Sep 2020 21:32:26 +0900 Message-Id: <20200911123259.3782926-7-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20200911123259.3782926-1-naohiro.aota@wdc.com> References: <20200911123259.3782926-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org NODATACOW implies overwriting the file data on a device, which is impossible in sequential required zones. Disable NODATACOW globally with mount option and per-file NODATACOW attribute by masking FS_NOCOW_FL. Reviewed-by: Johannes Thumshirn Reviewed-by: Josef Bacik Signed-off-by: Naohiro Aota --- fs/btrfs/ioctl.c | 3 +++ fs/btrfs/zoned.c | 6 ++++++ 2 files changed, 9 insertions(+) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index ac45f022b495..548692cdc5df 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -91,6 +91,9 @@ struct btrfs_ioctl_send_args_32 { static unsigned int btrfs_mask_fsflags_for_type(struct inode *inode, unsigned int flags) { + if (btrfs_fs_incompat(btrfs_sb(inode->i_sb), ZONED)) + flags &= ~FS_NOCOW_FL; + if (S_ISDIR(inode->i_mode)) return flags; else if (S_ISREG(inode->i_mode)) diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c index 1629e585ba8c..6bce654bb0e8 100644 --- a/fs/btrfs/zoned.c +++ b/fs/btrfs/zoned.c @@ -284,5 +284,11 @@ int btrfs_check_mountopts_zoned(struct btrfs_fs_info *info) return -EOPNOTSUPP; } + if (btrfs_test_opt(info, NODATACOW)) { + btrfs_err(info, + "cannot enable nodatacow with ZONED mode"); + return -EOPNOTSUPP; + } + return 0; } From patchwork Fri Sep 11 12:32:27 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11771185 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id A9BF559D for ; Fri, 11 Sep 2020 17:44:36 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 90000221EB for ; Fri, 11 Sep 2020 17:44:36 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="rZk2TWiE" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726225AbgIKRo1 (ORCPT ); Fri, 11 Sep 2020 13:44:27 -0400 Received: from esa5.hgst.iphmx.com ([216.71.153.144]:38370 "EHLO esa5.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725905AbgIKMeR (ORCPT ); Fri, 11 Sep 2020 08:34:17 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1599827657; x=1631363657; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=G+eY6whLDHc/YKQ9K4YyAddscKcqtBOgwR1sNKYkpH8=; b=rZk2TWiEQqXKf5I1ZsxF+vbYznrCUCdVGPEAZLpxYIwtoburmOF8/mZA mjd81ngIiW/u999tNd/NtCPiVtJ2QtieqIeBpRL0HMNICUl1maSxyQmAf HeYMmrK5tjgbwpyO3szTvh8LJtGabmVafUlYxgpP4qUzQlNUboeNIXxph EXJsnyOMGgFGgMRUmrERVR6q02aF1eb4nQwhFWrdJpZqF5DJJwii53sIA c98bYHbJq4ru2b3yW8q1uG613OHpJlDIhLIVe159rNrnjE8y9PyUs0jLx nDyo3tP+1kevfS8VGD3276qN4suBr22ZGEuco4VgGS6iHxNQNQ7v3hUfM w==; IronPort-SDR: 12DQnn1IniLuHCQ9vhhw1w1WmW83JzGVxhlQrccTp7ezm9WhdxHfkPBaex7VKy/QJxIi8aBP/K V6gNOM5X8UPxgjVtqbB3cAvlloDaUkGeemohsnyLMXxGqRqOaBfkyEXMb/jBJbCEbauIaTu7fZ UT9X2qeONlXIV24YXmlP/HgSMPFTlIBC6NIGd+I0TVcIANM/XMYvs3NvGRJtz8vVuHmcbIZSZf cIs5J6ZvbMdMTM7s5SS3GeSPzWaNTCK4PmNL17RcMIFM/VoeNeKKvXtET4nLk5L8gaH2wdpKtf 33g= X-IronPort-AV: E=Sophos;i="5.76,415,1592841600"; d="scan'208";a="147125964" Received: from h199-255-45-15.hgst.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 11 Sep 2020 20:33:18 +0800 IronPort-SDR: 6sOxySGaDcvVsMTIBADmFyi/SFfrTFRLArHoO+cgMQLWSg/zmac60F/yHTS7IfLPWJUo+X2n4D 2mLo/0nQgTfQ== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Sep 2020 05:19:39 -0700 IronPort-SDR: ofowNukk5ASuU0CR5DAMDfiPbO78TOUgbW618Ec9+WiXi+RyeS5pgYeMm8MBlVvDm7AKw4fDvj G2cKSDFS/Uzg== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 11 Sep 2020 05:33:16 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Hannes Reinecke , linux-fsdevel@vger.kernel.org, Naohiro Aota , Johannes Thumshirn Subject: [PATCH v7 07/39] btrfs: disable fallocate in ZONED mode Date: Fri, 11 Sep 2020 21:32:27 +0900 Message-Id: <20200911123259.3782926-8-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20200911123259.3782926-1-naohiro.aota@wdc.com> References: <20200911123259.3782926-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org fallocate() is implemented by reserving actual extent instead of reservations. This can result in exposing the sequential write constraint of host-managed zoned block devices to the application, which would break the POSIX semantic for the fallocated file. To avoid this, report fallocate() as not supported when in ZONED mode for now. In the future, we may be able to implement "in-memory" fallocate() in ZONED mode by utilizing space_info->bytes_may_use or so. Reviewed-by: Johannes Thumshirn Reviewed-by: Josef Bacik Signed-off-by: Naohiro Aota --- fs/btrfs/file.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 1395e537ad32..8843696c7f74 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -3309,6 +3309,10 @@ static long btrfs_fallocate(struct file *file, int mode, alloc_end = round_up(offset + len, blocksize); cur_offset = alloc_start; + /* Do not allow fallocate in ZONED mode */ + if (btrfs_fs_incompat(btrfs_sb(inode->i_sb), ZONED)) + return -EOPNOTSUPP; + /* Make sure we aren't being give some crap mode */ if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE | FALLOC_FL_ZERO_RANGE)) From patchwork Fri Sep 11 12:32:28 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11770413 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id BBE4E92C for ; Fri, 11 Sep 2020 12:37:23 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 90F5722206 for ; Fri, 11 Sep 2020 12:37:23 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="ByAxPnPe" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1725947AbgIKMhO (ORCPT ); Fri, 11 Sep 2020 08:37:14 -0400 Received: from esa5.hgst.iphmx.com ([216.71.153.144]:38428 "EHLO esa5.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725927AbgIKMfY (ORCPT ); Fri, 11 Sep 2020 08:35:24 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1599827725; x=1631363725; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=i44UAAWCU3pEu/WrC87PmjeM9oN6Y5QwdfGRxQZHtDI=; b=ByAxPnPecbmuR6jSqXUFzvE8w/GTYZ2R895RNE4plBQatjGGcZQgXuCF sGHcU1qdh7Co4l7s7AiEALbwvch47izazicHjlI3ylTA9+ipy6ZrRMfCR d0WK0H30BSmpMlAtmoDgOr/cO2cMb8lpOOMj9ztn2/HHjByhwpTZu+jIP j7VzVkZuyaBOXPirpm24MHswcf/TQUCBeJhYu7fEC69OUz/QOqpGj2COP JRZyRQfCNTppNKhUkISuhwjTePLtqwUJFZl/yaJZpUAjo/13iCiWkqOMc evOWcgl4jirx/TcXlsiMpESXpOo+ybgRJerSnJ0rLEIks9zIAlUlPcQYp Q==; IronPort-SDR: dScrbB4mMpkhdLpCQ/OUgoaNixF140VYd7NMdrNsB8zacP8IBVgRrF+cZagJ33jmfsxFjlkqy7 3qib0fETEGIp7sExwpeIwB96Bui8iAaTzd/zR+dBqpqJGh1Chuhavc5guySo4QXYpNuZjwKVKs NkBGI++tCe3AX6Pn8FsThvh7zI3y4kIFVZ7L64y0ww9qyCfhgGykxF4hNKUR516ltrkTCrYugm jkFnZcRDyIgBwlg4m/ZfZRxkajR8k7MCq5Ry/kO63VGpXuEEyv1MqmPL/UfvPnqTUTOa3aQQoQ CX4= X-IronPort-AV: E=Sophos;i="5.76,415,1592841600"; d="scan'208";a="147125968" Received: from h199-255-45-15.hgst.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 11 Sep 2020 20:33:20 +0800 IronPort-SDR: PKWaGU3+easiWOuLFF7PWjyEWTPsi/K7U3rZAverMgl20istRpkLoC05MsE+xkXYrSdxpENI3U aYN4bBg0yfqQ== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Sep 2020 05:19:40 -0700 IronPort-SDR: 8IvoI5d2CsSf1iTG+gGPSTPaR7hlMYGNCssn/rMhavF+c5VELpq1CPus6AQLG2bm5f99dxMqLX aBXiQLalsJsQ== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 11 Sep 2020 05:33:17 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Hannes Reinecke , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v7 08/39] btrfs: disallow mixed-bg in ZONED mode Date: Fri, 11 Sep 2020 21:32:28 +0900 Message-Id: <20200911123259.3782926-9-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20200911123259.3782926-1-naohiro.aota@wdc.com> References: <20200911123259.3782926-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Placing both data and metadata in a block group is impossible in ZONED mode. For data, we can allocate a space for it and write it immediately after the allocation. For metadata, however, we cannot do so, because the logical addresses are recorded in other metadata buffers to build up the trees. As a result, a data buffer can be placed after a metadata buffer, which is not written yet. Writing out the data buffer will break the sequential write rule. This commit check and disallow MIXED_BG with ZONED mode. Reviewed-by: Josef Bacik Signed-off-by: Naohiro Aota --- fs/btrfs/zoned.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c index 6bce654bb0e8..8cd43d2d5611 100644 --- a/fs/btrfs/zoned.c +++ b/fs/btrfs/zoned.c @@ -260,6 +260,13 @@ int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info) goto out; } + if (btrfs_fs_incompat(fs_info, MIXED_GROUPS)) { + btrfs_err(fs_info, + "ZONED mode is not allowed for mixed block groups"); + ret = -EINVAL; + goto out; + } + fs_info->zone_size = zone_size; fs_info->max_zone_append_size = max_zone_append_size; From patchwork Fri Sep 11 12:32:29 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11770419 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id CCED5618 for ; Fri, 11 Sep 2020 12:38:12 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id B1864221F1 for ; Fri, 11 Sep 2020 12:38:12 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="oEq5XTyD" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1725941AbgIKMgk (ORCPT ); Fri, 11 Sep 2020 08:36:40 -0400 Received: from esa5.hgst.iphmx.com ([216.71.153.144]:38415 "EHLO esa5.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725887AbgIKMfY (ORCPT ); Fri, 11 Sep 2020 08:35:24 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1599827725; x=1631363725; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=qgEBWgJSMacwtA3SUdTXH5qAHlX9ODJTm0ZOwmpKHYs=; b=oEq5XTyDn8aRFhFGFkLVK2EwzM6au6ujcJm2paWeTvvsYH39FVtZBhR0 QDShe0v7mfQ/2rUwfaQY0wAZxNdEggRosflODdpXUkB6jSILMUOhHlIeH YJMO9/aPr/co99doQqPESoyWrQaXu+XSglqJrDrgpokSNggSfRnbSpIYj ds7qbTTIpH7cDSuvq29ed86pZcefMepWla4aV7ypSk7YFkbLSLSHSsEsu 2mx+uRXI9n2bWVkfnMMw3bKzRls8/Yi7JDGM/MjJNFolbdkBOslJmRqbL mtTBJbg/ecMBto3x551/KeoAkUKNytGuAUGgka3CXCJ/e8qWbXET8WUJe g==; IronPort-SDR: YK3Vn34xC2oprHHT69bQ0KkvjzMAa7tHjlXgd7q8CrH8V5sZgL05SaH6oHHC31pTcLEa2HoZnY pwYYUztMqOv3h/VEaS3/HhdZ6YrCkxCuCaz6xvGT40A89CaaHGI71fayeLytYHAc5Sr4DPNCo9 LagSLFtFIm4FjS9ii/JyB9/t9xGsX55dVBOmHwF4pRQGkRlVkUGHU/JCyKXG3YCIZ8tjtYwh7O WBX1u+p97DW30YMZum6D4dpMRIxij2++N9hxHEuhqHWwOb+HmyGPCg3YKDknww6BnQYfwVVazG 5rU= X-IronPort-AV: E=Sophos;i="5.76,415,1592841600"; d="scan'208";a="147125970" Received: from h199-255-45-15.hgst.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 11 Sep 2020 20:33:21 +0800 IronPort-SDR: XoRNsWX9u0oFxIiW9t0QJpjCB9hLgpvkNTgOE+AJzzkAjZnpXdD9K3CWCkLxuoToQPk/4fZMwf JLhPbh30qbZw== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Sep 2020 05:19:42 -0700 IronPort-SDR: lPOPGaxgV5z3Z/JgDqRGQJYTZnqZtJo4dCFm7A5TmqySZN/1/d30FNcQZbgzSWMsmFbBmRb6zX 7qDFRkQaBcqA== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 11 Sep 2020 05:33:19 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Hannes Reinecke , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v7 09/39] btrfs: disallow inode_cache in ZONED mode Date: Fri, 11 Sep 2020 21:32:29 +0900 Message-Id: <20200911123259.3782926-10-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20200911123259.3782926-1-naohiro.aota@wdc.com> References: <20200911123259.3782926-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org inode_cache use pre-allocation to write its cache data. However, pre-allocation is completely disabled in ZONED mode. We can technically enable inode_cache in the same way as relocation. However, inode_cache is rarely used and the man page discourage using it. So, let's just disable it for now. Reviewed-by: Josef Bacik Signed-off-by: Naohiro Aota --- fs/btrfs/zoned.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c index 8cd43d2d5611..e47698d313a5 100644 --- a/fs/btrfs/zoned.c +++ b/fs/btrfs/zoned.c @@ -297,5 +297,11 @@ int btrfs_check_mountopts_zoned(struct btrfs_fs_info *info) return -EOPNOTSUPP; } + if (btrfs_test_pending(info, SET_INODE_MAP_CACHE)) { + btrfs_err(info, + "cannot enable inode map caching with ZONED mode"); + return -EOPNOTSUPP; + } + return 0; } From patchwork Fri Sep 11 12:32:30 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11771177 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 0C1C959D for ; Fri, 11 Sep 2020 17:44:19 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id D8F72221ED for ; Fri, 11 Sep 2020 17:44:18 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="oPkHWqbJ" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726199AbgIKRoK (ORCPT ); Fri, 11 Sep 2020 13:44:10 -0400 Received: from esa5.hgst.iphmx.com ([216.71.153.144]:38375 "EHLO esa5.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725933AbgIKMfZ (ORCPT ); Fri, 11 Sep 2020 08:35:25 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1599827725; x=1631363725; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=X4jH9YP/TnjjTvpPfw/Sf7hrTD+pmtWbkl1pfmdoyRg=; b=oPkHWqbJ2hLW4lNm2VS8mpNt/HIhTPTfksOMy1QJvgkxO1n2fPOcFYvP jQZBE/Da9F2x4fZOdZym+W6Y42435lPMsZOq+GYxCXUoeCaJ/uVl/WWeU paIxC/c1Wwa224HCbIbQasvru4YMTgwUaSxVSNvcGJHuK8VpiA9LDrkEt do95ZIDwjIpoYWhe6BJKnm1xcv9kCM6nlZyRRSnZy0NE2H92gTZjqQ583 J+2pQA7aA7UNggD1WkPNutcO8xbACmQb0fIJVxLOXbWXHpVAuEJyhNmtC t1Kfu9yrBuKqYXzjIaSc7Nc4EqiEzAxsVEwL+to4FT4j6QW/lVPTgRRgr Q==; IronPort-SDR: h7l5u/sYC3i+QmfdFEdQ6HJyW+pKgRfc3oTDGcJVZCGAF4qJTWplxzj5bL3z12xnOoodE0DXNp RXyDoW27H2D0chrLebDaJZRhS+LotOKfaFDTMeY5sddlCMQEGeeI4SFEN3ynPUCZnDoaQtq+1S uV3jKWoRSaTPePF9wvdBGQNuftT2dqv08zan3p94esozULB9ryyJBRk+9qVRuwhcHwGjrB3r6v zasTb3ujPZF499ONy1W7IlM/HRBrZV8PJMVsW4JOLCrr4aB3h7uur6vn6iOOTb6uY9/BZYi1QK nqM= X-IronPort-AV: E=Sophos;i="5.76,415,1592841600"; d="scan'208";a="147125972" Received: from h199-255-45-15.hgst.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 11 Sep 2020 20:33:23 +0800 IronPort-SDR: KPTkHyAnrea1YJjX4HtWT20jHEw0HlPeq/mUKtgKYXdWWGVxix0wFriu0//D6k7lkzvXaL1e2S WOhI/sBQwmvw== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Sep 2020 05:19:43 -0700 IronPort-SDR: TFUSsTTxwJxB98iTvAHHorKUf+BqdVYcu+6eqCihRk8GLen7sHGGjNUvuTbMwvFeFuNHljHAuv yYw/z9tzF8vg== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 11 Sep 2020 05:33:20 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Hannes Reinecke , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v7 10/39] btrfs: implement log-structured superblock for ZONED mode Date: Fri, 11 Sep 2020 21:32:30 +0900 Message-Id: <20200911123259.3782926-11-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20200911123259.3782926-1-naohiro.aota@wdc.com> References: <20200911123259.3782926-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Superblock (and its copies) is the only data structure in btrfs which has a fixed location on a device. Since we cannot overwrite in a sequential write required zone, we cannot place superblock in the zone. One easy solution is limiting superblock and copies to be placed only in conventional zones. However, this method has two downsides: one is reduced number of superblock copies. The location of the second copy of superblock is 256GB, which is in a sequential write required zone on typical devices in the market today. So, the number of superblock and copies is limited to be two. Second downside is that we cannot support devices which have no conventional zones at all. To solve these two problems, we employ superblock log writing. It uses two zones as a circular buffer to write updated superblocks. Once the first zone is filled up, start writing into the second buffer. Then, when the both zones are filled up and before start writing to the first zone again, it reset the first zone. We can determine the position of the latest superblock by reading write pointer information from a device. One corner case is when the both zones are full. For this situation, we read out the last superblock of each zone, and compare them to determine which zone is older. The following zones are reserved as the circular buffer on ZONED btrfs. - The primary superblock: zones 0 and 1 - The first copy: zones 16 and 17 - The second copy: zones 1024 or zone at 256GB which is minimum, and next to it If these reserved zones are conventional, superblock is written fixed at the start of the zone without logging. Signed-off-by: Naohiro Aota --- fs/btrfs/block-group.c | 9 ++ fs/btrfs/disk-io.c | 41 +++++- fs/btrfs/scrub.c | 3 + fs/btrfs/volumes.c | 21 ++- fs/btrfs/zoned.c | 313 +++++++++++++++++++++++++++++++++++++++++ fs/btrfs/zoned.h | 42 ++++++ 6 files changed, 417 insertions(+), 12 deletions(-) diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c index ea8aaf36647e..4ac4aacfae04 100644 --- a/fs/btrfs/block-group.c +++ b/fs/btrfs/block-group.c @@ -1723,6 +1723,7 @@ int btrfs_rmap_block(struct btrfs_fs_info *fs_info, u64 chunk_start, static int exclude_super_stripes(struct btrfs_block_group *cache) { struct btrfs_fs_info *fs_info = cache->fs_info; + bool zoned = btrfs_fs_incompat(fs_info, ZONED); u64 bytenr; u64 *logical; int stripe_len; @@ -1744,6 +1745,14 @@ static int exclude_super_stripes(struct btrfs_block_group *cache) if (ret) return ret; + /* shouldn't have super stripes in sequential zones */ + if (zoned && nr) { + btrfs_err(fs_info, + "Zoned btrfs's block group %llu should not have super blocks", + cache->start); + return -EUCLEAN; + } + while (nr--) { u64 len = min_t(u64, stripe_len, cache->start + cache->length - logical[nr]); diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index f7c2d1d26026..362799403285 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3501,10 +3501,17 @@ struct btrfs_super_block *btrfs_read_dev_one_super(struct block_device *bdev, { struct btrfs_super_block *super; struct page *page; - u64 bytenr; + u64 bytenr, bytenr_orig; struct address_space *mapping = bdev->bd_inode->i_mapping; + int ret; + + bytenr_orig = btrfs_sb_offset(copy_num); + ret = btrfs_sb_log_location_bdev(bdev, copy_num, READ, &bytenr); + if (ret == -ENOENT) + return ERR_PTR(-EINVAL); + else if (ret) + return ERR_PTR(ret); - bytenr = btrfs_sb_offset(copy_num); if (bytenr + BTRFS_SUPER_INFO_SIZE >= i_size_read(bdev->bd_inode)) return ERR_PTR(-EINVAL); @@ -3513,7 +3520,7 @@ struct btrfs_super_block *btrfs_read_dev_one_super(struct block_device *bdev, return ERR_CAST(page); super = page_address(page); - if (btrfs_super_bytenr(super) != bytenr || + if (btrfs_super_bytenr(super) != bytenr_orig || btrfs_super_magic(super) != BTRFS_MAGIC) { btrfs_release_disk_super(super); return ERR_PTR(-EINVAL); @@ -3569,7 +3576,8 @@ static int write_dev_supers(struct btrfs_device *device, SHASH_DESC_ON_STACK(shash, fs_info->csum_shash); int i; int errors = 0; - u64 bytenr; + int ret; + u64 bytenr, bytenr_orig; if (max_mirrors == 0) max_mirrors = BTRFS_SUPER_MIRROR_MAX; @@ -3581,12 +3589,21 @@ static int write_dev_supers(struct btrfs_device *device, struct bio *bio; struct btrfs_super_block *disk_super; - bytenr = btrfs_sb_offset(i); + bytenr_orig = btrfs_sb_offset(i); + ret = btrfs_sb_log_location(device, i, WRITE, &bytenr); + if (ret == -ENOENT) + continue; + else if (ret < 0) { + btrfs_err(device->fs_info, "couldn't get super block location for mirror %d", + i); + errors++; + continue; + } if (bytenr + BTRFS_SUPER_INFO_SIZE >= device->commit_total_bytes) break; - btrfs_set_super_bytenr(sb, bytenr); + btrfs_set_super_bytenr(sb, bytenr_orig); crypto_shash_digest(shash, (const char *)sb + BTRFS_CSUM_SIZE, BTRFS_SUPER_INFO_SIZE - BTRFS_CSUM_SIZE, @@ -3631,6 +3648,7 @@ static int write_dev_supers(struct btrfs_device *device, bio->bi_opf |= REQ_FUA; btrfsic_submit_bio(bio); + btrfs_advance_sb_log(device, i); } return errors < i ? 0 : -1; } @@ -3647,6 +3665,7 @@ static int wait_dev_supers(struct btrfs_device *device, int max_mirrors) int i; int errors = 0; bool primary_failed = false; + int ret; u64 bytenr; if (max_mirrors == 0) @@ -3655,7 +3674,15 @@ static int wait_dev_supers(struct btrfs_device *device, int max_mirrors) for (i = 0; i < max_mirrors; i++) { struct page *page; - bytenr = btrfs_sb_offset(i); + ret = btrfs_sb_log_location(device, i, READ, &bytenr); + if (ret == -ENOENT) + break; + else if (ret < 0) { + errors++; + if (i == 0) + primary_failed = true; + continue; + } if (bytenr + BTRFS_SUPER_INFO_SIZE >= device->commit_total_bytes) break; diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index 354ab9985a34..e46c91188a75 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -20,6 +20,7 @@ #include "rcu-string.h" #include "raid56.h" #include "block-group.h" +#include "zoned.h" /* * This is only the first step towards a full-features scrub. It reads all @@ -3704,6 +3705,8 @@ static noinline_for_stack int scrub_supers(struct scrub_ctx *sctx, if (bytenr + BTRFS_SUPER_INFO_SIZE > scrub_dev->commit_total_bytes) break; + if (!btrfs_check_super_location(scrub_dev, bytenr)) + continue; ret = scrub_pages(sctx, bytenr, BTRFS_SUPER_INFO_SIZE, bytenr, scrub_dev, BTRFS_EXTENT_FLAG_SUPER, gen, i, diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index d736d5391fac..22384c803ead 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -1273,7 +1273,8 @@ void btrfs_release_disk_super(struct btrfs_super_block *super) } static struct btrfs_super_block *btrfs_read_disk_super(struct block_device *bdev, - u64 bytenr) + u64 bytenr, + u64 bytenr_orig) { struct btrfs_super_block *disk_super; struct page *page; @@ -1304,7 +1305,7 @@ static struct btrfs_super_block *btrfs_read_disk_super(struct block_device *bdev /* align our pointer to the offset of the super block */ disk_super = p + offset_in_page(bytenr); - if (btrfs_super_bytenr(disk_super) != bytenr || + if (btrfs_super_bytenr(disk_super) != bytenr_orig || btrfs_super_magic(disk_super) != BTRFS_MAGIC) { btrfs_release_disk_super(p); return ERR_PTR(-EINVAL); @@ -1339,7 +1340,8 @@ struct btrfs_device *btrfs_scan_one_device(const char *path, fmode_t flags, bool new_device_added = false; struct btrfs_device *device = NULL; struct block_device *bdev; - u64 bytenr; + u64 bytenr, bytenr_orig; + int ret; lockdep_assert_held(&uuid_mutex); @@ -1349,14 +1351,18 @@ struct btrfs_device *btrfs_scan_one_device(const char *path, fmode_t flags, * So, we need to add a special mount option to scan for * later supers, using BTRFS_SUPER_MIRROR_MAX instead */ - bytenr = btrfs_sb_offset(0); flags |= FMODE_EXCL; bdev = blkdev_get_by_path(path, flags, holder); if (IS_ERR(bdev)) return ERR_CAST(bdev); - disk_super = btrfs_read_disk_super(bdev, bytenr); + bytenr_orig = btrfs_sb_offset(0); + ret = btrfs_sb_log_location_bdev(bdev, 0, READ, &bytenr); + if (ret) + return ERR_PTR(ret); + + disk_super = btrfs_read_disk_super(bdev, bytenr, bytenr_orig); if (IS_ERR(disk_super)) { device = ERR_CAST(disk_super); goto error_bdev_put; @@ -2023,6 +2029,11 @@ static void btrfs_scratch_superblocks(struct btrfs_fs_info *fs_info, if (IS_ERR(disk_super)) continue; + if (bdev_is_zoned(bdev)) { + btrfs_reset_sb_log_zones(bdev, copy_num); + continue; + } + memset(&disk_super->magic, 0, sizeof(disk_super->magic)); page = virt_to_page(disk_super); diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c index e47698d313a5..6912b66f3130 100644 --- a/fs/btrfs/zoned.c +++ b/fs/btrfs/zoned.c @@ -26,6 +26,27 @@ static int copy_zone_info_cb(struct blk_zone *zone, unsigned int idx, return 0; } +static int sb_write_pointer(struct block_device *bdev, struct blk_zone *zone, + u64 *wp_ret); + +static inline u32 sb_zone_number(u64 zone_size, int mirror) +{ + ASSERT(mirror < BTRFS_SUPER_MIRROR_MAX); + + switch (mirror) { + case 0: + return 0; + case 1: + return 16; + case 2: + return min(btrfs_sb_offset(mirror) / zone_size, 1024ULL); + default: + BUG(); + } + + return 0; +} + static int btrfs_get_dev_zones(struct btrfs_device *device, u64 pos, struct blk_zone *zones, unsigned int *nr_zones) { @@ -126,6 +147,40 @@ int btrfs_get_dev_zone_info(struct btrfs_device *device) goto out; } + nr_zones = 2; + for (i = 0; i < BTRFS_SUPER_MIRROR_MAX; i++) { + u32 sb_zone = sb_zone_number(zone_info->zone_size, i); + u64 sb_wp; + + if (sb_zone + 1 >= zone_info->nr_zones) + continue; + + sector = sb_zone << (zone_info->zone_size_shift - SECTOR_SHIFT); + ret = btrfs_get_dev_zones(device, sector << SECTOR_SHIFT, + &zone_info->sb_zones[2 * i], + &nr_zones); + if (ret) + goto out; + if (nr_zones != 2) { + btrfs_err_in_rcu(device->fs_info, + "failed to read SB log zone info at device %s zone %u", + rcu_str_deref(device->name), sb_zone); + ret = -EIO; + goto out; + } + + ret = sb_write_pointer(device->bdev, + &zone_info->sb_zones[2 * i], &sb_wp); + if (ret != -ENOENT && ret) { + btrfs_err_in_rcu(device->fs_info, + "SB log zone corrupted: device %s zone %u", + rcu_str_deref(device->name), sb_zone); + ret = -EUCLEAN; + goto out; + } + } + + kfree(zones); device->zone_info = zone_info; @@ -305,3 +360,261 @@ int btrfs_check_mountopts_zoned(struct btrfs_fs_info *info) return 0; } + +static int sb_write_pointer(struct block_device *bdev, struct blk_zone *zones, + u64 *wp_ret) +{ + bool empty[2]; + bool full[2]; + sector_t sector; + + ASSERT(zones[0].type != BLK_ZONE_TYPE_CONVENTIONAL && + zones[1].type != BLK_ZONE_TYPE_CONVENTIONAL); + if (zones[0].type == BLK_ZONE_TYPE_CONVENTIONAL) { + *wp_ret = zones[0].start << SECTOR_SHIFT; + return -ENOENT; + } + + empty[0] = zones[0].cond == BLK_ZONE_COND_EMPTY; + empty[1] = zones[1].cond == BLK_ZONE_COND_EMPTY; + full[0] = zones[0].cond == BLK_ZONE_COND_FULL; + full[1] = zones[1].cond == BLK_ZONE_COND_FULL; + + /* + * Possible state of log buffer zones + * + * E I F + * E * x 0 + * I 0 x 0 + * F 1 1 C + * + * Row: zones[0] + * Col: zones[1] + * State: + * E: Empty, I: In-Use, F: Full + * Log position: + * *: Special case, no superblock is written + * 0: Use write pointer of zones[0] + * 1: Use write pointer of zones[1] + * C: Compare SBs from zones[0] and zones[1], use the newer one + * x: Invalid state + */ + + if (empty[0] && empty[1]) { + /* special case to distinguish no superblock to read */ + *wp_ret = zones[0].start << SECTOR_SHIFT; + return -ENOENT; + } else if (full[0] && full[1]) { + /* Compare two super blocks */ + struct address_space *mapping = bdev->bd_inode->i_mapping; + struct page *page[2]; + struct btrfs_super_block *super[2]; + u64 bytenr[2]; + int i; + + for (i = 0; i < 2; i++) { + bytenr[i] = ((zones[i].start + zones[i].len) << SECTOR_SHIFT) - + BTRFS_SUPER_INFO_SIZE; + page[i] = read_cache_page_gfp(mapping, + bytenr[i] >> PAGE_SHIFT, + GFP_NOFS); + if (IS_ERR(page[i])) { + if (i == 1) + btrfs_release_disk_super(super[0]); + return PTR_ERR(page[i]); + } + super[i] = page_address(page[i]); + } + + if (super[0]->generation > super[1]->generation) + sector = zones[1].start; + else + sector = zones[0].start; + + for (i = 0; i < 2; i++) + btrfs_release_disk_super(super[i]); + } else if (!full[0] && (empty[1] || full[1])) { + sector = zones[0].wp; + } else if (full[0]) { + sector = zones[1].wp; + } else { + return -EUCLEAN; + } + *wp_ret = sector << SECTOR_SHIFT; + return 0; +} + +static int sb_log_location(struct block_device *bdev, struct blk_zone *zones, + int rw, u64 *bytenr_ret) +{ + u64 wp; + int ret; + + if (zones[0].type == BLK_ZONE_TYPE_CONVENTIONAL) { + *bytenr_ret = zones[0].start << SECTOR_SHIFT; + return 0; + } + + ret = sb_write_pointer(bdev, zones, &wp); + if (ret != -ENOENT && ret < 0) + return ret; + + if (rw == WRITE) { + struct blk_zone *reset = NULL; + + if (wp == zones[0].start << SECTOR_SHIFT) + reset = &zones[0]; + else if (wp == zones[1].start << SECTOR_SHIFT) + reset = &zones[1]; + + if (reset) { + ASSERT(reset->cond == BLK_ZONE_COND_FULL); + + ret = blkdev_zone_mgmt(bdev, REQ_OP_ZONE_RESET, + reset->start, reset->len, + GFP_NOFS); + if (ret) + return ret; + + reset->cond = BLK_ZONE_COND_EMPTY; + reset->wp = reset->start; + } + } else if (ret != -ENOENT) { + /* For READ, we want the precious one */ + if (wp == zones[0].start << SECTOR_SHIFT) + wp = (zones[1].start + zones[1].len) << SECTOR_SHIFT; + wp -= BTRFS_SUPER_INFO_SIZE; + } + + *bytenr_ret = wp; + return 0; + +} + +int btrfs_sb_log_location_bdev(struct block_device *bdev, int mirror, int rw, + u64 *bytenr_ret) +{ + struct blk_zone zones[2]; + unsigned int zone_sectors; + u32 sb_zone; + int ret; + u64 zone_size; + u8 zone_sectors_shift; + sector_t nr_sectors = bdev->bd_part->nr_sects; + u32 nr_zones; + + if (!bdev_is_zoned(bdev)) { + *bytenr_ret = btrfs_sb_offset(mirror); + return 0; + } + + ASSERT(rw == READ || rw == WRITE); + + zone_sectors = bdev_zone_sectors(bdev); + if (!is_power_of_2(zone_sectors)) + return -EINVAL; + zone_size = zone_sectors << SECTOR_SHIFT; + zone_sectors_shift = ilog2(zone_sectors); + nr_zones = nr_sectors >> zone_sectors_shift; + + sb_zone = sb_zone_number(zone_size, mirror); + if (sb_zone + 1 >= nr_zones) + return -ENOENT; + + ret = blkdev_report_zones(bdev, sb_zone << zone_sectors_shift, 2, + copy_zone_info_cb, zones); + if (ret < 0) + return ret; + if (ret != 2) + return -EIO; + + return sb_log_location(bdev, zones, rw, bytenr_ret); +} + +int btrfs_sb_log_location(struct btrfs_device *device, int mirror, int rw, + u64 *bytenr_ret) +{ + struct btrfs_zoned_device_info *zinfo = device->zone_info; + u32 zone_num; + + if (!zinfo) { + *bytenr_ret = btrfs_sb_offset(mirror); + return 0; + } + + zone_num = sb_zone_number(zinfo->zone_size, mirror); + if (zone_num + 1 >= zinfo->nr_zones) + return -ENOENT; + + return sb_log_location(device->bdev, &zinfo->sb_zones[2 * mirror], rw, + bytenr_ret); +} + +static inline bool is_sb_log_zone(struct btrfs_zoned_device_info *zinfo, + int mirror) +{ + u32 zone_num; + + if (!zinfo) + return false; + + zone_num = sb_zone_number(zinfo->zone_size, mirror); + if (zone_num + 1 >= zinfo->nr_zones) + return false; + + if (!test_bit(zone_num, zinfo->seq_zones)) + return false; + + return true; +} + +int btrfs_advance_sb_log(struct btrfs_device *device, int mirror) +{ + struct btrfs_zoned_device_info *zinfo = device->zone_info; + struct blk_zone *zone; + int ret; + + if (!is_sb_log_zone(zinfo, mirror)) + return 0; + + zone = &zinfo->sb_zones[2 * mirror]; + if (zone->cond != BLK_ZONE_COND_FULL) { + if (zone->cond == BLK_ZONE_COND_EMPTY) + zone->cond = BLK_ZONE_COND_IMP_OPEN; + zone->wp += (BTRFS_SUPER_INFO_SIZE >> SECTOR_SHIFT); + if (zone->wp == zone->start + zone->len) + zone->cond = BLK_ZONE_COND_FULL; + return 0; + } + + zone++; + ASSERT(zone->cond != BLK_ZONE_COND_FULL); + if (zone->cond == BLK_ZONE_COND_EMPTY) + zone->cond = BLK_ZONE_COND_IMP_OPEN; + zone->wp += (BTRFS_SUPER_INFO_SIZE >> SECTOR_SHIFT); + if (zone->wp == zone->start + zone->len) + zone->cond = BLK_ZONE_COND_FULL; + + return ret; +} + +int btrfs_reset_sb_log_zones(struct block_device *bdev, int mirror) +{ + sector_t zone_sectors; + sector_t nr_sectors = bdev->bd_part->nr_sects; + u8 zone_sectors_shift; + u32 sb_zone; + u32 nr_zones; + + zone_sectors = bdev_zone_sectors(bdev); + zone_sectors_shift = ilog2(zone_sectors); + nr_zones = nr_sectors >> zone_sectors_shift; + + sb_zone = sb_zone_number(zone_sectors << SECTOR_SHIFT, mirror); + if (sb_zone + 1 >= nr_zones) + return -ENOENT; + + return blkdev_zone_mgmt(bdev, REQ_OP_ZONE_RESET, + sb_zone << zone_sectors_shift, zone_sectors * 2, + GFP_NOFS); +} diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h index 2e1983188e6f..e33c0e409b7d 100644 --- a/fs/btrfs/zoned.h +++ b/fs/btrfs/zoned.h @@ -10,6 +10,8 @@ #define BTRFS_ZONED_H #include +#include "volumes.h" +#include "disk-io.h" struct btrfs_zoned_device_info { /* @@ -22,6 +24,7 @@ struct btrfs_zoned_device_info { u32 nr_zones; unsigned long *seq_zones; unsigned long *empty_zones; + struct blk_zone sb_zones[2 * BTRFS_SUPER_MIRROR_MAX]; }; #ifdef CONFIG_BLK_DEV_ZONED @@ -31,6 +34,12 @@ int btrfs_get_dev_zone_info(struct btrfs_device *device); void btrfs_destroy_dev_zone_info(struct btrfs_device *device); int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info); int btrfs_check_mountopts_zoned(struct btrfs_fs_info *info); +int btrfs_sb_log_location_bdev(struct block_device *bdev, int mirror, int rw, + u64 *bytenr_ret); +int btrfs_sb_log_location(struct btrfs_device *device, int mirror, int rw, + u64 *bytenr_ret); +int btrfs_advance_sb_log(struct btrfs_device *device, int mirror); +int btrfs_reset_sb_log_zones(struct block_device *bdev, int mirror); #else /* CONFIG_BLK_DEV_ZONED */ static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos, struct blk_zone *zone) @@ -54,6 +63,28 @@ static inline int btrfs_check_mountopts_zoned(struct btrfs_fs_info *info) { return 0; } +static inline int btrfs_sb_log_location_bdev(struct block_device *bdev, + int mirror, int rw, + u64 *bytenr_ret) +{ + *bytenr_ret = btrfs_sb_offset(mirror); + return 0; +} +static inline int btrfs_sb_log_location(struct btrfs_device *device, int mirror, + int rw, u64 *bytenr_ret) +{ + *bytenr_ret = btrfs_sb_offset(mirror); + return 0; +} +static inline int btrfs_advance_sb_log(struct btrfs_device *device, int mirror) +{ + return 0; +} +static inline int btrfs_reset_sb_log_zones(struct block_device *bdev, + int mirror) +{ + return 0; +} #endif static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos) @@ -121,4 +152,15 @@ static inline bool btrfs_check_device_zone_type(struct btrfs_fs_info *fs_info, return bdev_zoned_model(bdev) != BLK_ZONED_HM; } +static inline bool btrfs_check_super_location(struct btrfs_device *device, + u64 pos) +{ + /* + * On a non-zoned device, any address is OK. On a zoned device, + * non-SEQUENTIAL WRITE REQUIRED zones are capable. + */ + return device->zone_info == NULL || + !btrfs_dev_is_sequential(device, pos); +} + #endif From patchwork Fri Sep 11 12:32:31 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11771175 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id C5EB01580 for ; Fri, 11 Sep 2020 17:44:18 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 9910A221ED for ; Fri, 11 Sep 2020 17:44:18 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="dv9F348t" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1725933AbgIKRoL (ORCPT ); Fri, 11 Sep 2020 13:44:11 -0400 Received: from esa5.hgst.iphmx.com ([216.71.153.144]:38372 "EHLO esa5.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725884AbgIKMfY (ORCPT ); Fri, 11 Sep 2020 08:35:24 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1599827725; x=1631363725; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=rxo9Odb/qo2j/4umy3Y6zUiJfT8+7IC+FNN0J9tCzSQ=; b=dv9F348t7+80FiXK7d0WojklPF6QubWeQpWkkrGt9q1oJ49CXjYOcC5B m5iMPlgc6KAtAd8G5fTa0de9/6dFGmkwSZi1UnNBpHZvcZZN1by/xm6Ou hNTR5+jtjeUdKcXfMPRoGmMy+uARfnm9voItHDEpTafie++Mji7gSx12X 3QzdyDIeio3OZ3bJVmBAXBFdawZbeBE8iFEMrzjU2KHUOH4aeeJNsn42Z nr3CHrgfczAGI1zJhLmjZ7eFfszpOoN8SVRvzIkEvMM8pN1Cp5ltfahAw 0pUWm880oV+UR2YYUGQKkKsPikup4Me4m+53YH7kZ1K3M/ZFkI4/Kadaw Q==; IronPort-SDR: 3QCmPIGcy/4HHUYOTw9mRujX4Iw32TrkQlM+VS7M1OZjXMLD/ysk9hkJplJQM4QGtDerAO8fHU 1finCXwQ/3/gEArlsdoFOimKESyUQR+3vvVQcv9aH6+uIm7DCkxfEsmBcZuhPwUR9tdtL4NnmO LIl6delrMJu+RuntjgMvtpyJPjn9gGVgFAlhD2BNfokgTMr0UsH9uAh9vPjgTpyS69/mpN/ZMj PkS36zwaTqncESop8m80K85El5nBbJdi1qf1YZCANMrXw9KWwwZ8xh+EIKX8GmOWBdI9Bu0XPV sd4= X-IronPort-AV: E=Sophos;i="5.76,415,1592841600"; d="scan'208";a="147125976" Received: from h199-255-45-15.hgst.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 11 Sep 2020 20:33:24 +0800 IronPort-SDR: JOYBCJRtUntso7F/kdGo2zZ14Rv6bunqD8WRh+zx/Vp2waVBSYBzk3R9iWgEdEyidApaLIvYsP ORPSoVcFGsOQ== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Sep 2020 05:19:45 -0700 IronPort-SDR: YwdEAHJiIrA3eApNGK0xmBgPlTT+/i6GAByvb7xr2QBlH351aI/tYkheBpHW2uzc6YyJ7u/4P4 /K8DfDskuUcA== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 11 Sep 2020 05:33:22 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Hannes Reinecke , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v7 11/39] btrfs: implement zoned chunk allocator Date: Fri, 11 Sep 2020 21:32:31 +0900 Message-Id: <20200911123259.3782926-12-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20200911123259.3782926-1-naohiro.aota@wdc.com> References: <20200911123259.3782926-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org This commit implements zoned chunk/dev_extent allocator. The zoned allocator align the device extents to zone boundaries so that a zone reset affects only the device extent and does not change the state of blocks in the neighbor device extents. Also, it checks that a region allocation is not over any locations of super block zones, and ensures the region is empty. Signed-off-by: Naohiro Aota --- fs/btrfs/volumes.c | 133 +++++++++++++++++++++++++++++++++++++++++++++ fs/btrfs/volumes.h | 1 + fs/btrfs/zoned.c | 128 +++++++++++++++++++++++++++++++++++++++++++ fs/btrfs/zoned.h | 30 ++++++++++ 4 files changed, 292 insertions(+) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 22384c803ead..8c439d1ae4c5 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -1407,6 +1407,14 @@ static bool contains_pending_extent(struct btrfs_device *device, u64 *start, return false; } +static inline u64 dev_extent_search_start_zoned(struct btrfs_device *device, + u64 start) +{ + start = max_t(u64, start, + max_t(u64, device->zone_info->zone_size, SZ_1M)); + return btrfs_zone_align(device, start); +} + static u64 dev_extent_search_start(struct btrfs_device *device, u64 start) { switch (device->fs_devices->chunk_alloc_policy) { @@ -1417,11 +1425,57 @@ static u64 dev_extent_search_start(struct btrfs_device *device, u64 start) * make sure to start at an offset of at least 1MB. */ return max_t(u64, start, SZ_1M); + case BTRFS_CHUNK_ALLOC_ZONED: + return dev_extent_search_start_zoned(device, start); default: BUG(); } } +static bool dev_extent_hole_check_zoned(struct btrfs_device *device, + u64 *hole_start, u64 *hole_size, + u64 num_bytes) +{ + u64 zone_size = device->zone_info->zone_size; + u64 pos; + int ret; + int changed = 0; + + ASSERT(IS_ALIGNED(*hole_start, zone_size)); + + while (*hole_size > 0) { + pos = btrfs_find_allocatable_zones(device, *hole_start, + *hole_start + *hole_size, + num_bytes); + if (pos != *hole_start) { + *hole_size = *hole_start + *hole_size - pos; + *hole_start = pos; + changed = 1; + if (*hole_size < num_bytes) + break; + } + + ret = btrfs_ensure_empty_zones(device, pos, num_bytes); + + /* range is ensured to be empty */ + if (!ret) + return changed; + + /* given hole range was invalid (outside of device) */ + if (ret == -ERANGE) { + *hole_start += *hole_size; + *hole_size = 0; + return 1; + } + + *hole_start += zone_size; + *hole_size -= zone_size; + changed = 1; + } + + return changed; +} + /** * dev_extent_hole_check - check if specified hole is suitable for allocation * @device: the device which we have the hole @@ -1454,6 +1508,10 @@ static bool dev_extent_hole_check(struct btrfs_device *device, u64 *hole_start, case BTRFS_CHUNK_ALLOC_REGULAR: /* No extra check */ break; + case BTRFS_CHUNK_ALLOC_ZONED: + changed |= dev_extent_hole_check_zoned(device, hole_start, + hole_size, num_bytes); + break; default: BUG(); } @@ -1508,6 +1566,9 @@ static int find_free_dev_extent_start(struct btrfs_device *device, search_start = dev_extent_search_start(device, search_start); + WARN_ON(device->zone_info && + !IS_ALIGNED(num_bytes, device->zone_info->zone_size)); + path = btrfs_alloc_path(); if (!path) return -ENOMEM; @@ -4912,6 +4973,39 @@ static void init_alloc_chunk_ctl_policy_regular( ctl->dev_extent_min = BTRFS_STRIPE_LEN * ctl->dev_stripes; } +static void +init_alloc_chunk_ctl_policy_zoned(struct btrfs_fs_devices *fs_devices, + struct alloc_chunk_ctl *ctl) +{ + u64 zone_size = fs_devices->fs_info->zone_size; + u64 limit; + int min_num_stripes = ctl->devs_min * ctl->dev_stripes; + int min_data_stripes = (min_num_stripes - ctl->nparity) / ctl->ncopies; + u64 min_chunk_size = min_data_stripes * zone_size; + u64 type = ctl->type; + + ctl->max_stripe_size = zone_size; + if (type & BTRFS_BLOCK_GROUP_DATA) { + ctl->max_chunk_size = round_down(BTRFS_MAX_DATA_CHUNK_SIZE, + zone_size); + } else if (type & BTRFS_BLOCK_GROUP_METADATA) { + ctl->max_chunk_size = ctl->max_stripe_size; + } else if (type & BTRFS_BLOCK_GROUP_SYSTEM) { + ctl->max_chunk_size = 2 * ctl->max_stripe_size; + ctl->devs_max = min_t(int, ctl->devs_max, + BTRFS_MAX_DEVS_SYS_CHUNK); + } else { + BUG(); + } + + /* We don't want a chunk larger than 10% of writable space */ + limit = max(round_down(div_factor(fs_devices->total_rw_bytes, 1), + zone_size), + min_chunk_size); + ctl->max_chunk_size = min(limit, ctl->max_chunk_size); + ctl->dev_extent_min = zone_size * ctl->dev_stripes; +} + static void init_alloc_chunk_ctl(struct btrfs_fs_devices *fs_devices, struct alloc_chunk_ctl *ctl) { @@ -4932,6 +5026,9 @@ static void init_alloc_chunk_ctl(struct btrfs_fs_devices *fs_devices, case BTRFS_CHUNK_ALLOC_REGULAR: init_alloc_chunk_ctl_policy_regular(fs_devices, ctl); break; + case BTRFS_CHUNK_ALLOC_ZONED: + init_alloc_chunk_ctl_policy_zoned(fs_devices, ctl); + break; default: BUG(); } @@ -5058,6 +5155,40 @@ static int decide_stripe_size_regular(struct alloc_chunk_ctl *ctl, return 0; } +static int decide_stripe_size_zoned(struct alloc_chunk_ctl *ctl, + struct btrfs_device_info *devices_info) +{ + u64 zone_size = devices_info[0].dev->zone_info->zone_size; + int data_stripes; /* number of stripes that count for + block group size */ + + /* + * It should hold because: + * dev_extent_min == dev_extent_want == zone_size * dev_stripes + */ + ASSERT(devices_info[ctl->ndevs - 1].max_avail == ctl->dev_extent_min); + + ctl->stripe_size = zone_size; + ctl->num_stripes = ctl->ndevs * ctl->dev_stripes; + data_stripes = (ctl->num_stripes - ctl->nparity) / ctl->ncopies; + + /* + * stripe_size is fixed in ZONED. Reduce ndevs instead. + */ + if (ctl->stripe_size * data_stripes > ctl->max_chunk_size) { + ctl->ndevs = div_u64(div_u64(ctl->max_chunk_size * ctl->ncopies, + ctl->stripe_size) + ctl->nparity, + ctl->dev_stripes); + ctl->num_stripes = ctl->ndevs * ctl->dev_stripes; + data_stripes = (ctl->num_stripes - ctl->nparity) / ctl->ncopies; + ASSERT(ctl->stripe_size * data_stripes <= ctl->max_chunk_size); + } + + ctl->chunk_size = ctl->stripe_size * data_stripes; + + return 0; +} + static int decide_stripe_size(struct btrfs_fs_devices *fs_devices, struct alloc_chunk_ctl *ctl, struct btrfs_device_info *devices_info) @@ -5085,6 +5216,8 @@ static int decide_stripe_size(struct btrfs_fs_devices *fs_devices, switch (fs_devices->chunk_alloc_policy) { case BTRFS_CHUNK_ALLOC_REGULAR: return decide_stripe_size_regular(ctl, devices_info); + case BTRFS_CHUNK_ALLOC_ZONED: + return decide_stripe_size_zoned(ctl, devices_info); default: BUG(); } diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index a7ae1a02c6d2..88b1d59fbc12 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -213,6 +213,7 @@ BTRFS_DEVICE_GETSET_FUNCS(bytes_used); enum btrfs_chunk_allocation_policy { BTRFS_CHUNK_ALLOC_REGULAR, + BTRFS_CHUNK_ALLOC_ZONED, }; struct btrfs_fs_devices { diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c index 6912b66f3130..916d358dea27 100644 --- a/fs/btrfs/zoned.c +++ b/fs/btrfs/zoned.c @@ -6,12 +6,16 @@ * Damien Le Moal */ +#include "asm-generic/bitops/find.h" +#include "linux/blk_types.h" +#include "linux/kernel.h" #include #include #include "ctree.h" #include "volumes.h" #include "zoned.h" #include "rcu-string.h" +#include "disk-io.h" /* Maximum number of zones to report per blkdev_report_zones() call */ #define BTRFS_REPORT_NR_ZONES 4096 @@ -324,6 +328,7 @@ int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info) fs_info->zone_size = zone_size; fs_info->max_zone_append_size = max_zone_append_size; + fs_info->fs_devices->chunk_alloc_policy = BTRFS_CHUNK_ALLOC_ZONED; btrfs_info(fs_info, "ZONED mode enabled, zone size %llu B", fs_info->zone_size); @@ -618,3 +623,126 @@ int btrfs_reset_sb_log_zones(struct block_device *bdev, int mirror) sb_zone << zone_sectors_shift, zone_sectors * 2, GFP_NOFS); } + +/* + * btrfs_check_allocatable_zones - find allocatable zones within give region + * @device: the device to allocate a region + * @hole_start: the position of the hole to allocate the region + * @num_bytes: the size of wanted region + * @hole_size: the size of hole + * + * Allocatable region should not contain any superblock locations. + */ +u64 btrfs_find_allocatable_zones(struct btrfs_device *device, u64 hole_start, + u64 hole_end, u64 num_bytes) +{ + struct btrfs_zoned_device_info *zinfo = device->zone_info; + u8 shift = zinfo->zone_size_shift; + u64 nzones = num_bytes >> shift; + u64 pos = hole_start; + u64 begin, end; + u64 sb_pos; + bool have_sb; + int i; + + ASSERT(IS_ALIGNED(hole_start, zinfo->zone_size)); + ASSERT(IS_ALIGNED(num_bytes, zinfo->zone_size)); + + while (pos < hole_end) { + begin = pos >> shift; + end = begin + nzones; + + if (end > zinfo->nr_zones) + return hole_end; + + /* check if zones in the region are all empty */ + if (btrfs_dev_is_sequential(device, pos) && + find_next_zero_bit(zinfo->empty_zones, end, begin) != end) { + pos += zinfo->zone_size; + continue; + } + + have_sb = false; + for (i = 0; i < BTRFS_SUPER_MIRROR_MAX; i++) { + sb_pos = sb_zone_number(zinfo->zone_size, i); + if (!(end < sb_pos || sb_pos + 1 < begin)) { + have_sb = true; + pos = (sb_pos + 2) << shift; + break; + } + } + if (!have_sb) + break; + } + + return pos; +} + +int btrfs_reset_device_zone(struct btrfs_device *device, u64 physical, + u64 length, u64 *bytes) +{ + int ret; + + *bytes = 0; + ret = blkdev_zone_mgmt(device->bdev, REQ_OP_ZONE_RESET, + physical >> SECTOR_SHIFT, length >> SECTOR_SHIFT, + GFP_NOFS); + if (ret) + return ret; + + *bytes = length; + while (length) { + btrfs_dev_set_zone_empty(device, physical); + physical += device->zone_info->zone_size; + length -= device->zone_info->zone_size; + } + + return 0; +} + +int btrfs_ensure_empty_zones(struct btrfs_device *device, u64 start, u64 size) +{ + struct btrfs_zoned_device_info *zinfo = device->zone_info; + u8 shift = zinfo->zone_size_shift; + unsigned long begin = start >> shift; + unsigned long end = (start + size) >> shift; + u64 pos; + int ret; + + ASSERT(IS_ALIGNED(start, zinfo->zone_size)); + ASSERT(IS_ALIGNED(size, zinfo->zone_size)); + + if (end > zinfo->nr_zones) + return -ERANGE; + + /* all the zones are conventional */ + if (find_next_bit(zinfo->seq_zones, begin, end) == end) + return 0; + + /* all the zones are sequential and empty */ + if (find_next_zero_bit(zinfo->seq_zones, begin, end) == end && + find_next_zero_bit(zinfo->empty_zones, begin, end) == end) + return 0; + + for (pos = start; pos < start + size; pos += zinfo->zone_size) { + u64 reset_bytes; + + if (!btrfs_dev_is_sequential(device, pos) || + btrfs_dev_is_empty_zone(device, pos)) + continue; + + /* free regions should be empty */ + btrfs_warn_in_rcu( + device->fs_info, + "resetting device %s zone %llu for allocation", + rcu_str_deref(device->name), pos >> shift); + WARN_ON_ONCE(1); + + ret = btrfs_reset_device_zone(device, pos, zinfo->zone_size, + &reset_bytes); + if (ret) + return ret; + } + + return 0; +} diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h index e33c0e409b7d..0be58861d922 100644 --- a/fs/btrfs/zoned.h +++ b/fs/btrfs/zoned.h @@ -40,6 +40,11 @@ int btrfs_sb_log_location(struct btrfs_device *device, int mirror, int rw, u64 *bytenr_ret); int btrfs_advance_sb_log(struct btrfs_device *device, int mirror); int btrfs_reset_sb_log_zones(struct block_device *bdev, int mirror); +u64 btrfs_find_allocatable_zones(struct btrfs_device *device, u64 hole_start, + u64 hole_end, u64 num_bytes); +int btrfs_reset_device_zone(struct btrfs_device *device, u64 physical, + u64 length, u64 *bytes); +int btrfs_ensure_empty_zones(struct btrfs_device *device, u64 start, u64 size); #else /* CONFIG_BLK_DEV_ZONED */ static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos, struct blk_zone *zone) @@ -85,6 +90,23 @@ static inline int btrfs_reset_sb_log_zones(struct block_device *bdev, { return 0; } +static inline u64 btrfs_find_allocatable_zones(struct btrfs_device *device, + u64 hole_start, u64 hole_end, + u64 num_bytes) +{ + return hole_start; +} +static inline int btrfs_reset_device_zone(struct btrfs_device *device, + u64 physical, u64 length, u64 *bytes) +{ + *bytes = 0; + return 0; +} +static inline int btrfs_ensure_empty_zones(struct btrfs_device *device, + u64 start, u64 size) +{ + return 0; +} #endif static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos) @@ -163,4 +185,12 @@ static inline bool btrfs_check_super_location(struct btrfs_device *device, !btrfs_dev_is_sequential(device, pos); } +static inline u64 btrfs_zone_align(struct btrfs_device *device, u64 pos) +{ + if (!device->zone_info) + return pos; + + return ALIGN(pos, device->zone_info->zone_size); +} + #endif From patchwork Fri Sep 11 12:32:32 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11771173 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 3265959D for ; Fri, 11 Sep 2020 17:44:18 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 14084221ED for ; Fri, 11 Sep 2020 17:44:18 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="bax7xpX+" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726224AbgIKRoM (ORCPT ); Fri, 11 Sep 2020 13:44:12 -0400 Received: from esa5.hgst.iphmx.com ([216.71.153.144]:38451 "EHLO esa5.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725936AbgIKMfZ (ORCPT ); Fri, 11 Sep 2020 08:35:25 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1599827725; x=1631363725; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=yH1nPkHODBV09kOOFiPhVkyCkpApdtePV0TSia5UdZM=; b=bax7xpX+CPwLujyCVoLBSQUVsbXGTKCROhweLqSrKCN+e8/aMvgsgKhd B5Avq1u8rvoHJn3vUaxvj9yYCdYSaVIdM+/yz0lHFOx2Bi8Y4+s8FwqhB BsL08YjQVvoTrydJfQ+7eGuQWexd2Pl/z/KLSWTpMVLk1hbNsnew2ggfi BcwveP8Br7QzrTOchFbgGyOJtm4G3RidwRKo0WBSgjtWw3t7KDUUSQh8p LA/N4Frd2ZvpkVw+LRk/xTh1fXh4rEsyYNRoY36ujvD9VsJP97k3VVo7X 3MraYotJg/KVytn1r9INOC9PTOGwnDMe+M+84PpWdhb7Qehoa9G7ZhNka w==; IronPort-SDR: qlHqBj7B8XODFqkTsvT84Dq3ggcX4CLo/6vXZF7lgNmD2uavcgmojzn5ckQqniHoyd+TQ5LkJZ 8uPURxQMc+Bm/vfk423FEEa9NZgwtJbalKPjjy44G4rhxSyGYzHiTXZeJSrvaa26l9c4u+kb+e 28denc4aj63XJWC7T92NsIBB88hM/Pg6KCj1pRjqSD6g2XSitHBnqbkAPqawyNO3XX8JSpTPcx 9Y9QaBhTwOWuThEpKNHqva5pyNBkv2vW9VNkr/L4r66YZBsD5Zn0MlspgbEZCbJx1tGe5CQikT IOA= X-IronPort-AV: E=Sophos;i="5.76,415,1592841600"; d="scan'208";a="147125978" Received: from h199-255-45-15.hgst.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 11 Sep 2020 20:33:25 +0800 IronPort-SDR: RhBzNfws+h5iMTWzZ863TTbrL9jP86jXKoyWkGc0feJh6ZI6o32Nr0QpFPM1XmUvO+y+NTlHda Dn6CmQLR51ig== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Sep 2020 05:19:46 -0700 IronPort-SDR: ZkLh3ZvYdikujK0xzO9wN3MUnwFs5hSDAQFDer1MfztJfgIOYZQmiEd7ST71VdKvu3rjDB9Khp e8vybWwigQEQ== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 11 Sep 2020 05:33:23 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Hannes Reinecke , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v7 12/39] btrfs: verify device extent is aligned to zone Date: Fri, 11 Sep 2020 21:32:32 +0900 Message-Id: <20200911123259.3782926-13-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20200911123259.3782926-1-naohiro.aota@wdc.com> References: <20200911123259.3782926-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org This patch add a verification in verify_one_dev_extent() to check if the device extent is aligned to zone boundary. Signed-off-by: Naohiro Aota --- fs/btrfs/volumes.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 8c439d1ae4c5..086cd308e5b6 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -7765,6 +7765,20 @@ static int verify_one_dev_extent(struct btrfs_fs_info *fs_info, ret = -EUCLEAN; goto out; } + + if (dev->zone_info) { + u64 zone_size = dev->zone_info->zone_size; + + if (!IS_ALIGNED(physical_offset, zone_size) || + !IS_ALIGNED(physical_len, zone_size)) { + btrfs_err(fs_info, +"dev extent devid %llu physical offset %llu len %llu is not aligned to device zone", + devid, physical_offset, physical_len); + ret = -EUCLEAN; + goto out; + } + } + out: free_extent_map(em); return ret; From patchwork Fri Sep 11 12:32:33 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11771165 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 47B8913B1 for ; Fri, 11 Sep 2020 17:44:02 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 1E71F221EB for ; Fri, 11 Sep 2020 17:44:02 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="QdlEDEuJ" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1725774AbgIKRns (ORCPT ); Fri, 11 Sep 2020 13:43:48 -0400 Received: from esa5.hgst.iphmx.com ([216.71.153.144]:38370 "EHLO esa5.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725935AbgIKMfZ (ORCPT ); Fri, 11 Sep 2020 08:35:25 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1599827725; x=1631363725; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=5A/9fBK17G8sEij8dtmDRpHMYN+v3qYjluhyZkcn2jo=; b=QdlEDEuJu2r+GM8E8HzfbZmDvzW8PdUECLyW7PsflVhUry97VSEpLsxy XN0Xcj+Nzxy7PUiVRy0WWCu6god7x5qjWcGwAO1Li/xSj5zXP1hY7CBPL IekrWOoy7nHCvgV4K85a5on4Sfb9dx9YHMicS2oc19ujou+ZBLraHKIBZ PnD0+R6E+MK1oPR1Ker7obc6U9U4IOOtHBMsf5/5XFUHd7lCzwf311/Ik MQGqG5SqcgQt6dFVaPfyaDPaYyEjmqdYgigIGB6GexD0nIskP39JnXdDn M+y3YmWpQutjCA7VQStYSyg5Du9TmyLFtXpcoNYAi+Dn3J8HCX3zSxImj Q==; IronPort-SDR: ZG694UsPzpdFCD4CoLGvFyEL7t4p6zMEvyfKsomlKgxUWX57EThGMgTWoMWSSoB8WSA5Z45tWG PWvZLStB48FzwYuCX91HG6AjY2RnKPeXkvBUZVOqkIlWRVOLe/ONrAEZlNvGEH9qeYy/m5E+un 3xY1JEXSWBy5uj7Z3QoEkZcfZDUrW4nlWkCU++zH0Vz9M4owlQsh9oWxZ4kUilr5vuz5FPPXNH wRi0Dvt5WtgrSCb4jwWeUqQmirb/gBhZAfmqWsix5CKaX4uRVri5jlXlkgnaOkanyQkQp9nZZN WvE= X-IronPort-AV: E=Sophos;i="5.76,415,1592841600"; d="scan'208";a="147125979" Received: from h199-255-45-15.hgst.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 11 Sep 2020 20:33:27 +0800 IronPort-SDR: TRwt67Cx14tivziVWcVADjQa1iMp7B/cSPZ0mTd0kphRwR+Wa3/50gcjcJJH2QSLB+LZyaP/fD Q86NDDXJpeCQ== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Sep 2020 05:19:48 -0700 IronPort-SDR: aVIqE+FFeRkK1UDORUtRGvcWYSvyDRRzV2MWSqjHUYYjH/32ol3EXNtbcVS6qjMnRwHO+xizr8 La2rdwK88lYw== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 11 Sep 2020 05:33:25 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Hannes Reinecke , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v7 13/39] btrfs: load zone's alloction offset Date: Fri, 11 Sep 2020 21:32:33 +0900 Message-Id: <20200911123259.3782926-14-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20200911123259.3782926-1-naohiro.aota@wdc.com> References: <20200911123259.3782926-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Zoned btrfs must allocate blocks at the zones' write pointer. The device's write pointer position can be mapped to a logical address within a block group. This commit adds "alloc_offset" to track the logical address. This logical address is populated in btrfs_load_block-group_zone_info() from write pointers of corresponding zones. For now, zoned btrfs only support the SINGLE profile. Supporting non-SINGLE profile with zone append writing is not trivial. For example, in the DUP profile, we send a zone append writing IO to two zones on a device. The device reply with written LBAs for the IOs. If the offsets of the returned addresses from the beginning of the zone are different, then it results in different logical addresses. We need fine-grained logical to physical mapping to support such separated physical address issue. Since it should require additional metadata type, disable non-SINGLE profiles for now. Signed-off-by: Naohiro Aota --- fs/btrfs/block-group.c | 15 ++++ fs/btrfs/block-group.h | 6 ++ fs/btrfs/zoned.c | 153 +++++++++++++++++++++++++++++++++++++++++ fs/btrfs/zoned.h | 6 ++ 4 files changed, 180 insertions(+) diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c index 4ac4aacfae04..3ce685a10631 100644 --- a/fs/btrfs/block-group.c +++ b/fs/btrfs/block-group.c @@ -15,6 +15,7 @@ #include "delalloc-space.h" #include "discard.h" #include "raid56.h" +#include "zoned.h" /* * Return target flags in extended format or 0 if restripe for this chunk_type @@ -1945,6 +1946,13 @@ static int read_one_block_group(struct btrfs_fs_info *info, goto error; } + ret = btrfs_load_block_group_zone_info(cache); + if (ret) { + btrfs_err(info, "failed to load zone info of bg %llu", + cache->start); + goto error; + } + /* * We need to exclude the super stripes now so that the space info has * super bytes accounted for, otherwise we'll think we have more space @@ -2148,6 +2156,13 @@ int btrfs_make_block_group(struct btrfs_trans_handle *trans, u64 bytes_used, cache->last_byte_to_unpin = (u64)-1; cache->cached = BTRFS_CACHE_FINISHED; cache->needs_free_space = 1; + + ret = btrfs_load_block_group_zone_info(cache); + if (ret) { + btrfs_put_block_group(cache); + return ret; + } + ret = exclude_super_stripes(cache); if (ret) { /* We may have excluded something, so call this just in case */ diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h index adfd7583a17b..14e3043c9ce7 100644 --- a/fs/btrfs/block-group.h +++ b/fs/btrfs/block-group.h @@ -183,6 +183,12 @@ struct btrfs_block_group { /* Record locked full stripes for RAID5/6 block group */ struct btrfs_full_stripe_locks_tree full_stripe_locks_root; + + /* + * Allocation offset for the block group to implement sequential + * allocation. This is used only with ZONED mode enabled. + */ + u64 alloc_offset; }; static inline u64 btrfs_block_group_end(struct btrfs_block_group *block_group) diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c index 916d358dea27..cc6bc45729b4 100644 --- a/fs/btrfs/zoned.c +++ b/fs/btrfs/zoned.c @@ -11,14 +11,20 @@ #include "linux/kernel.h" #include #include +#include #include "ctree.h" #include "volumes.h" #include "zoned.h" #include "rcu-string.h" #include "disk-io.h" +#include "block-group.h" /* Maximum number of zones to report per blkdev_report_zones() call */ #define BTRFS_REPORT_NR_ZONES 4096 +/* Invalid allocation pointer value for missing devices */ +#define WP_MISSING_DEV ((u64)-1) +/* Pseudo write pointer value for conventional zone */ +#define WP_CONVENTIONAL ((u64)-2) static int copy_zone_info_cb(struct blk_zone *zone, unsigned int idx, void *data) @@ -746,3 +752,150 @@ int btrfs_ensure_empty_zones(struct btrfs_device *device, u64 start, u64 size) return 0; } + +int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache) +{ + struct btrfs_fs_info *fs_info = cache->fs_info; + struct extent_map_tree *em_tree = &fs_info->mapping_tree; + struct extent_map *em; + struct map_lookup *map; + struct btrfs_device *device; + u64 logical = cache->start; + u64 length = cache->length; + u64 physical = 0; + int ret; + int i; + unsigned int nofs_flag; + u64 *alloc_offsets = NULL; + u32 num_sequential = 0, num_conventional = 0; + + if (!btrfs_fs_incompat(fs_info, ZONED)) + return 0; + + /* Sanity check */ + if (!IS_ALIGNED(length, fs_info->zone_size)) { + btrfs_err(fs_info, "unaligned block group at %llu + %llu", + logical, length); + return -EIO; + } + + /* Get the chunk mapping */ + read_lock(&em_tree->lock); + em = lookup_extent_mapping(em_tree, logical, length); + read_unlock(&em_tree->lock); + + if (!em) + return -EINVAL; + + map = em->map_lookup; + + /* + * Get the zone type: if the group is mapped to a non-sequential zone, + * there is no need for the allocation offset (fit allocation is OK). + */ + alloc_offsets = kcalloc(map->num_stripes, sizeof(*alloc_offsets), + GFP_NOFS); + if (!alloc_offsets) { + free_extent_map(em); + return -ENOMEM; + } + + for (i = 0; i < map->num_stripes; i++) { + bool is_sequential; + struct blk_zone zone; + + device = map->stripes[i].dev; + physical = map->stripes[i].physical; + + if (device->bdev == NULL) { + alloc_offsets[i] = WP_MISSING_DEV; + continue; + } + + is_sequential = btrfs_dev_is_sequential(device, physical); + if (is_sequential) + num_sequential++; + else + num_conventional++; + + if (!is_sequential) { + alloc_offsets[i] = WP_CONVENTIONAL; + continue; + } + + /* + * This zone will be used for allocation, so mark this + * zone non-empty. + */ + btrfs_dev_clear_zone_empty(device, physical); + + /* + * The group is mapped to a sequential zone. Get the zone write + * pointer to determine the allocation offset within the zone. + */ + WARN_ON(!IS_ALIGNED(physical, fs_info->zone_size)); + nofs_flag = memalloc_nofs_save(); + ret = btrfs_get_dev_zone(device, physical, &zone); + memalloc_nofs_restore(nofs_flag); + if (ret == -EIO || ret == -EOPNOTSUPP) { + ret = 0; + alloc_offsets[i] = WP_MISSING_DEV; + continue; + } else if (ret) { + goto out; + } + + switch (zone.cond) { + case BLK_ZONE_COND_OFFLINE: + case BLK_ZONE_COND_READONLY: + btrfs_err(fs_info, "Offline/readonly zone %llu", + physical >> device->zone_info->zone_size_shift); + alloc_offsets[i] = WP_MISSING_DEV; + break; + case BLK_ZONE_COND_EMPTY: + alloc_offsets[i] = 0; + break; + case BLK_ZONE_COND_FULL: + alloc_offsets[i] = fs_info->zone_size; + break; + default: + /* Partially used zone */ + alloc_offsets[i] = + ((zone.wp - zone.start) << SECTOR_SHIFT); + break; + } + } + + if (num_conventional > 0) { + /* + * Since conventional zones does not have write pointer, we + * cannot determine alloc_offset from the pointer + */ + ret = -EINVAL; + goto out; + } + + switch (map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK) { + case 0: /* single */ + cache->alloc_offset = alloc_offsets[0]; + break; + case BTRFS_BLOCK_GROUP_DUP: + case BTRFS_BLOCK_GROUP_RAID1: + case BTRFS_BLOCK_GROUP_RAID0: + case BTRFS_BLOCK_GROUP_RAID10: + case BTRFS_BLOCK_GROUP_RAID5: + case BTRFS_BLOCK_GROUP_RAID6: + /* non-SINGLE profiles are not supported yet */ + default: + btrfs_err(fs_info, "Unsupported profile on ZONED %s", + btrfs_bg_type_to_raid_name(map->type)); + ret = -EINVAL; + goto out; + } + +out: + kfree(alloc_offsets); + free_extent_map(em); + + return ret; +} diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h index 0be58861d922..1fd7cad19e18 100644 --- a/fs/btrfs/zoned.h +++ b/fs/btrfs/zoned.h @@ -45,6 +45,7 @@ u64 btrfs_find_allocatable_zones(struct btrfs_device *device, u64 hole_start, int btrfs_reset_device_zone(struct btrfs_device *device, u64 physical, u64 length, u64 *bytes); int btrfs_ensure_empty_zones(struct btrfs_device *device, u64 start, u64 size); +int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache); #else /* CONFIG_BLK_DEV_ZONED */ static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos, struct blk_zone *zone) @@ -107,6 +108,11 @@ static inline int btrfs_ensure_empty_zones(struct btrfs_device *device, { return 0; } +static inline int btrfs_load_block_group_zone_info( + struct btrfs_block_group *cache) +{ + return 0; +} #endif static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos) From patchwork Fri Sep 11 12:32:34 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11771159 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 6E31D13B1 for ; Fri, 11 Sep 2020 17:43:45 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 53462221ED for ; Fri, 11 Sep 2020 17:43:45 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="lDfNTXee" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726017AbgIKRnk (ORCPT ); Fri, 11 Sep 2020 13:43:40 -0400 Received: from esa5.hgst.iphmx.com ([216.71.153.144]:38372 "EHLO esa5.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725813AbgIKMgA (ORCPT ); Fri, 11 Sep 2020 08:36:00 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1599827760; x=1631363760; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=HbmbgVqmxW0JeqvL4wMgkAVAzhmn/HJp2we2iWZC6rE=; b=lDfNTXeeOrmJv6Qeqzj/WvYC6Q6gyYXn/h7UAZhEOPzzdJPWtaWodDkL RVMdF28y0VdtYk+4Kp3jrHF6ZJYcf5/OpGgaDgloEBTOMFKZLTaDloz4x uP8BgS6oM/qaB6a8NFiJcOO08u/0VOMMXkMnrHhOLShBoe5b1IzXXXcbm ocDipu+zjt1fzjeLpG37epS85QufRSqv9BG0Zdv9h+q/MDPBdYBizeNmr MiJjqhTZGnox+ekIKMxKXIhSvH7EXFJhFX9bfxzkGzkH7U2XIbbtIXPDH pl3fqQXSoSKTLriX1wPIfCA3e1xpb+VmDqXUttF5mpVEbZ4Jmhc/yEIdB A==; IronPort-SDR: jinznpqEbUOEWPgyeGboIysqvMu/br/OJpRdNXXhXv3NXYWgFFI2+Npz2uEBC7nEeUQ/I/0RpD ZNrjC6KvV1kS6MXe72cGho9kAIU800YR1/O6dW4geqZAfAHes2bMrsMcIQmiHYSGPp3DFshyI2 9o33giMcY94GfjlzNcj9R6TbLfR7m56ajZiysucE+S9ITWDhQn4m35gI8FR88Bl7iYnRCkg3i2 rlGF+CBiE8hiWJP4Ynv1Uy0aSV+l7KVKE1+qA2GdcHo2uAphl6eboj+qJ5vrqzf7d/67FFMH/G Elg= X-IronPort-AV: E=Sophos;i="5.76,415,1592841600"; d="scan'208";a="147125985" Received: from h199-255-45-15.hgst.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 11 Sep 2020 20:33:28 +0800 IronPort-SDR: Jsh7kdUNperrM96bbfOp1TsNY8adoQPWOmWSnnQEGrEhQkMg0ZNq80KujrAJxusojThcbd6vH0 gv2IHRV5q4bA== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Sep 2020 05:19:49 -0700 IronPort-SDR: xJL6eNhJG/HyZytgjmGx8Gw61wQlhwHifSXa4lkRA3DLLPlpuH7Y/ck7FYANaKkNI9yTvzQ4dA uHGh3l/6CE0Q== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 11 Sep 2020 05:33:26 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Hannes Reinecke , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v7 14/39] btrfs: emulate write pointer for conventional zones Date: Fri, 11 Sep 2020 21:32:34 +0900 Message-Id: <20200911123259.3782926-15-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20200911123259.3782926-1-naohiro.aota@wdc.com> References: <20200911123259.3782926-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Conventional zones do not have a write pointer. So, we cannot use it to determine the allocation offset if a block group contains a conventional zone. Instead, we can consider the end of the last allocated extent int the block group as an allocation offset. Signed-off-by: Naohiro Aota --- fs/btrfs/zoned.c | 119 ++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 113 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c index cc6bc45729b4..ca090a5cdc6e 100644 --- a/fs/btrfs/zoned.c +++ b/fs/btrfs/zoned.c @@ -753,6 +753,104 @@ int btrfs_ensure_empty_zones(struct btrfs_device *device, u64 start, u64 size) return 0; } +static int emulate_write_pointer(struct btrfs_block_group *cache, + u64 *offset_ret) +{ + struct btrfs_fs_info *fs_info = cache->fs_info; + struct btrfs_root *root = fs_info->extent_root; + struct btrfs_path *path; + struct extent_buffer *leaf; + struct btrfs_key search_key; + struct btrfs_key found_key; + int slot; + int ret; + u64 length; + + path = btrfs_alloc_path(); + if (!path) + return -ENOMEM; + + search_key.objectid = cache->start + cache->length; + search_key.type = 0; + search_key.offset = 0; + + ret = btrfs_search_slot(NULL, root, &search_key, path, 0, 0); + if (ret < 0) + goto out; + ASSERT(ret != 0); + slot = path->slots[0]; + leaf = path->nodes[0]; + ASSERT(slot != 0); + slot--; + btrfs_item_key_to_cpu(leaf, &found_key, slot); + + if (found_key.objectid < cache->start) { + *offset_ret = 0; + } else if (found_key.type == BTRFS_BLOCK_GROUP_ITEM_KEY) { + struct btrfs_key extent_item_key; + + if (found_key.objectid != cache->start) { + ret = -EUCLEAN; + goto out; + } + + length = 0; + + /* metadata may have METADATA_ITEM_KEY */ + if (slot == 0) { + btrfs_set_path_blocking(path); + ret = btrfs_prev_leaf(root, path); + if (ret < 0) + goto out; + if (ret == 0) { + slot = btrfs_header_nritems(leaf) - 1; + btrfs_item_key_to_cpu(leaf, &extent_item_key, + slot); + } + } else { + btrfs_item_key_to_cpu(leaf, &extent_item_key, slot - 1); + ret = 0; + } + + if (ret == 0 && + extent_item_key.objectid == cache->start) { + if (extent_item_key.type == BTRFS_METADATA_ITEM_KEY) + length = fs_info->nodesize; + else if (extent_item_key.type == BTRFS_EXTENT_ITEM_KEY) + length = extent_item_key.offset; + else { + ret = -EUCLEAN; + goto out; + } + } + + *offset_ret = length; + } else if (found_key.type == BTRFS_EXTENT_ITEM_KEY || + found_key.type == BTRFS_METADATA_ITEM_KEY) { + + if (found_key.type == BTRFS_EXTENT_ITEM_KEY) + length = found_key.offset; + else + length = fs_info->nodesize; + + if (!(found_key.objectid >= cache->start && + found_key.objectid + length <= + cache->start + cache->length)) { + ret = -EUCLEAN; + goto out; + } + *offset_ret = found_key.objectid + length - cache->start; + } else { + ret = -EUCLEAN; + goto out; + } + ret = 0; + +out: + btrfs_free_path(path); + return ret; +} + int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache) { struct btrfs_fs_info *fs_info = cache->fs_info; @@ -767,6 +865,7 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache) int i; unsigned int nofs_flag; u64 *alloc_offsets = NULL; + u64 emulated_offset = 0; u32 num_sequential = 0, num_conventional = 0; if (!btrfs_fs_incompat(fs_info, ZONED)) @@ -867,12 +966,12 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache) } if (num_conventional > 0) { - /* - * Since conventional zones does not have write pointer, we - * cannot determine alloc_offset from the pointer - */ - ret = -EINVAL; - goto out; + ret = emulate_write_pointer(cache, &emulated_offset); + if (ret || map->num_stripes == num_conventional) { + if (!ret) + cache->alloc_offset = emulated_offset; + goto out; + } } switch (map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK) { @@ -894,6 +993,14 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache) } out: + /* an extent is allocated after the write pointer */ + if (num_conventional && emulated_offset > cache->alloc_offset) { + btrfs_err(fs_info, + "got wrong write pointer in BG %llu: %llu > %llu", + logical, emulated_offset, cache->alloc_offset); + ret = -EIO; + } + kfree(alloc_offsets); free_extent_map(em); From patchwork Fri Sep 11 12:32:35 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11771161 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id CCA7559D for ; Fri, 11 Sep 2020 17:43:46 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id A42A1221EF for ; Fri, 11 Sep 2020 17:43:46 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="YQlU/i0v" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726127AbgIKRnj (ORCPT ); Fri, 11 Sep 2020 13:43:39 -0400 Received: from esa5.hgst.iphmx.com ([216.71.153.144]:38428 "EHLO esa5.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725876AbgIKMgM (ORCPT ); Fri, 11 Sep 2020 08:36:12 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1599827773; x=1631363773; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=5kKRjEpMb1plEY26COWVdyUchowLlJUVQkFHc7CHHKI=; b=YQlU/i0v8KAfuClbH0P+POzx3f50Z2R4Tj65JJb37ZbgcHsysVxA3JIX 9EcFd8KXjWOhwi+CHy1VmfjxMcskWcFUw+1+nbEdZHxg+MPUvZ4wA27zA jT9DsR8d/S5D0Aq7APu+Hgu1g6Y0lKq3J4vtiYi8Zo6RCASmZ0Gs7z2Z6 7FVFZXsyRpIm13XpWoiXWZjxkY4RqTb6IXYaBCeGDRq+ROGcqaCJikzPQ yyMbFJ0SwKWDov5MMyjMH02FbYZENSU8kj5fhCf2XKdtE5vKHW4NLmjRw 7/H97KAW+wD2sue0MdVCtKdl5adJbhnmaA/o4VEy7KY2ErbVtn6hwm3DU w==; IronPort-SDR: Re4iltSrpXpn7zTWQvCt/zr+b+//Ba/bWVe9pOYrXSYzssPjIvr0v1obBxYM5cYuHC19jPqTn/ 9QED2XdIkrmmTt74cfVWywZAdiedQ4u9MRqWCPTp9QDKIlk32rTu0vhZzvLIn4Iw8GaiiH3TS/ VaiNb2j/RrvVCCPBMKzo+eXyiRwMcWAWYX6b3X4TJwrZ5Q/pTtFLw6nlqSGastxXV+lGqlOS1s +4OAk/Dr+u3guqfYVgousmwzXqWm/7L2SxUDrX5rz43V7g/oouKCYbw1O0EpH4B/3uPnL0/6o9 MZM= X-IronPort-AV: E=Sophos;i="5.76,415,1592841600"; d="scan'208";a="147125988" Received: from h199-255-45-15.hgst.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 11 Sep 2020 20:33:30 +0800 IronPort-SDR: eTwyPSv/p4NT4jy5md+sW/NGtU8Cqtce0HlcpeBdSu/koymq3UtCHWcXJklu0LJVO+j2vFLtBO Ll3JH/LEcx7A== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Sep 2020 05:19:50 -0700 IronPort-SDR: EOalRsz66TAJvvJtcu7cL4pcAaQsNmT8zmCLwU741h9krVEa2tugZoi7F9Lwj10UuMAcv35RJd X2PUkV7zKd3w== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 11 Sep 2020 05:33:27 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Hannes Reinecke , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v7 15/39] btrfs: track unusable bytes for zones Date: Fri, 11 Sep 2020 21:32:35 +0900 Message-Id: <20200911123259.3782926-16-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20200911123259.3782926-1-naohiro.aota@wdc.com> References: <20200911123259.3782926-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org In zoned btrfs, once written, then freed region is never usable until resetting underlying zones. We need to distinguish such unusable space from usable free space. So, this commit introduces "zone_unusable" to block group, and "bytes_zone_unusable" to space_info to track the unusable space. Pinned bytes are always reclaimed to the unsable space. But, when an allocated region is returned before using e.g., the block group becomes read-only between allocation time and reservation time, we can safely return the region to the block group. For the situation, this commit introduces "btrfs_add_free_space_unused". This behave as the same as btrfs_add_free_space() on regular btrfs. On zoned btrfs, it rewinds the allocation offset. Signed-off-by: Naohiro Aota --- fs/btrfs/block-group.c | 19 +++++++++----- fs/btrfs/block-group.h | 1 + fs/btrfs/extent-tree.c | 15 ++++++++--- fs/btrfs/free-space-cache.c | 52 +++++++++++++++++++++++++++++++++++++ fs/btrfs/free-space-cache.h | 4 +++ fs/btrfs/space-info.c | 13 ++++++---- fs/btrfs/space-info.h | 4 ++- fs/btrfs/sysfs.c | 2 ++ fs/btrfs/zoned.c | 22 ++++++++++++++++ fs/btrfs/zoned.h | 2 ++ 10 files changed, 118 insertions(+), 16 deletions(-) diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c index 3ce685a10631..324a1ef1bf04 100644 --- a/fs/btrfs/block-group.c +++ b/fs/btrfs/block-group.c @@ -1080,12 +1080,15 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans, WARN_ON(block_group->space_info->total_bytes < block_group->length); WARN_ON(block_group->space_info->bytes_readonly - < block_group->length); + < block_group->length - block_group->zone_unusable); + WARN_ON(block_group->space_info->bytes_zone_unusable + < block_group->zone_unusable); WARN_ON(block_group->space_info->disk_total < block_group->length * factor); } block_group->space_info->total_bytes -= block_group->length; - block_group->space_info->bytes_readonly -= block_group->length; + block_group->space_info->bytes_readonly -= + (block_group->length - block_group->zone_unusable); block_group->space_info->disk_total -= block_group->length * factor; spin_unlock(&block_group->space_info->lock); @@ -1229,7 +1232,7 @@ static int inc_block_group_ro(struct btrfs_block_group *cache, int force) } num_bytes = cache->length - cache->reserved - cache->pinned - - cache->bytes_super - cache->used; + cache->bytes_super - cache->zone_unusable - cache->used; /* * Data never overcommits, even in mixed mode, so do just the straight @@ -1983,6 +1986,8 @@ static int read_one_block_group(struct btrfs_fs_info *info, btrfs_free_excluded_extents(cache); } + btrfs_calc_zone_unusable(cache); + ret = btrfs_add_block_group_cache(info, cache); if (ret) { btrfs_remove_free_space_cache(cache); @@ -1990,7 +1995,8 @@ static int read_one_block_group(struct btrfs_fs_info *info, } trace_btrfs_add_block_group(info, cache, 0); btrfs_update_space_info(info, cache->flags, cache->length, - cache->used, cache->bytes_super, &space_info); + cache->used, cache->bytes_super, + cache->zone_unusable, &space_info); cache->space_info = space_info; @@ -2204,7 +2210,7 @@ int btrfs_make_block_group(struct btrfs_trans_handle *trans, u64 bytes_used, */ trace_btrfs_add_block_group(fs_info, cache, 1); btrfs_update_space_info(fs_info, cache->flags, size, bytes_used, - cache->bytes_super, &cache->space_info); + cache->bytes_super, 0, &cache->space_info); btrfs_update_global_block_rsv(fs_info); link_block_group(cache); @@ -2312,7 +2318,8 @@ void btrfs_dec_block_group_ro(struct btrfs_block_group *cache) spin_lock(&cache->lock); if (!--cache->ro) { num_bytes = cache->length - cache->reserved - - cache->pinned - cache->bytes_super - cache->used; + cache->pinned - cache->bytes_super - + cache->zone_unusable - cache->used; sinfo->bytes_readonly -= num_bytes; list_del_init(&cache->ro_list); } diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h index 14e3043c9ce7..5be47f4bfea7 100644 --- a/fs/btrfs/block-group.h +++ b/fs/btrfs/block-group.h @@ -189,6 +189,7 @@ struct btrfs_block_group { * allocation. This is used only with ZONED mode enabled. */ u64 alloc_offset; + u64 zone_unusable; }; static inline u64 btrfs_block_group_end(struct btrfs_block_group *block_group) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index e9eedc053fc5..4f486277fb6e 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -34,6 +34,7 @@ #include "block-group.h" #include "discard.h" #include "rcu-string.h" +#include "zoned.h" #undef SCRAMBLE_DELAYED_REFS @@ -2790,9 +2791,11 @@ static int unpin_extent_range(struct btrfs_fs_info *fs_info, cache = btrfs_lookup_block_group(fs_info, start); BUG_ON(!cache); /* Logic error */ - cluster = fetch_cluster_info(fs_info, - cache->space_info, - &empty_cluster); + if (!btrfs_fs_incompat(fs_info, ZONED)) + cluster = fetch_cluster_info(fs_info, + cache->space_info, + &empty_cluster); + empty_cluster <<= 1; } @@ -2829,7 +2832,11 @@ static int unpin_extent_range(struct btrfs_fs_info *fs_info, space_info->max_extent_size = 0; percpu_counter_add_batch(&space_info->total_bytes_pinned, -len, BTRFS_TOTAL_BYTES_PINNED_BATCH); - if (cache->ro) { + if (btrfs_fs_incompat(fs_info, ZONED)) { + /* need reset before reusing in zoned Block Group */ + space_info->bytes_zone_unusable += len; + readonly = true; + } else if (cache->ro) { space_info->bytes_readonly += len; readonly = true; } diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c index dc82fd0c80cb..7701b39b4d57 100644 --- a/fs/btrfs/free-space-cache.c +++ b/fs/btrfs/free-space-cache.c @@ -2470,6 +2470,8 @@ int __btrfs_add_free_space(struct btrfs_fs_info *fs_info, int ret = 0; u64 filter_bytes = bytes; + ASSERT(!btrfs_fs_incompat(fs_info, ZONED)); + info = kmem_cache_zalloc(btrfs_free_space_cachep, GFP_NOFS); if (!info) return -ENOMEM; @@ -2527,11 +2529,44 @@ int __btrfs_add_free_space(struct btrfs_fs_info *fs_info, return ret; } +int __btrfs_add_free_space_zoned(struct btrfs_block_group *block_group, + u64 bytenr, u64 size, bool used) +{ + struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl; + u64 offset = bytenr - block_group->start; + u64 to_free, to_unusable; + + spin_lock(&ctl->tree_lock); + if (!used) + to_free = size; + else if (offset >= block_group->alloc_offset) + to_free = size; + else if (offset + size <= block_group->alloc_offset) + to_free = 0; + else + to_free = offset + size - block_group->alloc_offset; + to_unusable = size - to_free; + + ctl->free_space += to_free; + block_group->zone_unusable += to_unusable; + spin_unlock(&ctl->tree_lock); + if (!used) { + spin_lock(&block_group->lock); + block_group->alloc_offset -= size; + spin_unlock(&block_group->lock); + } + return 0; +} + int btrfs_add_free_space(struct btrfs_block_group *block_group, u64 bytenr, u64 size) { enum btrfs_trim_state trim_state = BTRFS_TRIM_STATE_UNTRIMMED; + if (btrfs_fs_incompat(block_group->fs_info, ZONED)) + return __btrfs_add_free_space_zoned(block_group, bytenr, size, + true); + if (btrfs_test_opt(block_group->fs_info, DISCARD_SYNC)) trim_state = BTRFS_TRIM_STATE_TRIMMED; @@ -2540,6 +2575,16 @@ int btrfs_add_free_space(struct btrfs_block_group *block_group, bytenr, size, trim_state); } +int btrfs_add_free_space_unused(struct btrfs_block_group *block_group, + u64 bytenr, u64 size) +{ + if (btrfs_fs_incompat(block_group->fs_info, ZONED)) + return __btrfs_add_free_space_zoned(block_group, bytenr, size, + false); + + return btrfs_add_free_space(block_group, bytenr, size); +} + /* * This is a subtle distinction because when adding free space back in general, * we want it to be added as untrimmed for async. But in the case where we add @@ -2550,6 +2595,10 @@ int btrfs_add_free_space_async_trimmed(struct btrfs_block_group *block_group, { enum btrfs_trim_state trim_state = BTRFS_TRIM_STATE_UNTRIMMED; + if (btrfs_fs_incompat(block_group->fs_info, ZONED)) + return __btrfs_add_free_space_zoned(block_group, bytenr, size, + true); + if (btrfs_test_opt(block_group->fs_info, DISCARD_SYNC) || btrfs_test_opt(block_group->fs_info, DISCARD_ASYNC)) trim_state = BTRFS_TRIM_STATE_TRIMMED; @@ -2567,6 +2616,9 @@ int btrfs_remove_free_space(struct btrfs_block_group *block_group, int ret; bool re_search = false; + if (btrfs_fs_incompat(block_group->fs_info, ZONED)) + return 0; + spin_lock(&ctl->tree_lock); again: diff --git a/fs/btrfs/free-space-cache.h b/fs/btrfs/free-space-cache.h index e3d5e0ad8f8e..7081216257a8 100644 --- a/fs/btrfs/free-space-cache.h +++ b/fs/btrfs/free-space-cache.h @@ -114,8 +114,12 @@ int __btrfs_add_free_space(struct btrfs_fs_info *fs_info, struct btrfs_free_space_ctl *ctl, u64 bytenr, u64 size, enum btrfs_trim_state trim_state); +int __btrfs_add_free_space_zoned(struct btrfs_block_group *block_group, + u64 bytenr, u64 size, bool used); int btrfs_add_free_space(struct btrfs_block_group *block_group, u64 bytenr, u64 size); +int btrfs_add_free_space_unused(struct btrfs_block_group *block_group, + u64 bytenr, u64 size); int btrfs_add_free_space_async_trimmed(struct btrfs_block_group *block_group, u64 bytenr, u64 size); int btrfs_remove_free_space(struct btrfs_block_group *block_group, diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c index 475968ccbd1d..bcf7c41746d8 100644 --- a/fs/btrfs/space-info.c +++ b/fs/btrfs/space-info.c @@ -163,6 +163,7 @@ u64 __pure btrfs_space_info_used(struct btrfs_space_info *s_info, ASSERT(s_info); return s_info->bytes_used + s_info->bytes_reserved + s_info->bytes_pinned + s_info->bytes_readonly + + s_info->bytes_zone_unusable + (may_use_included ? s_info->bytes_may_use : 0); } @@ -259,7 +260,7 @@ int btrfs_init_space_info(struct btrfs_fs_info *fs_info) void btrfs_update_space_info(struct btrfs_fs_info *info, u64 flags, u64 total_bytes, u64 bytes_used, - u64 bytes_readonly, + u64 bytes_readonly, u64 bytes_zone_unusable, struct btrfs_space_info **space_info) { struct btrfs_space_info *found; @@ -275,6 +276,7 @@ void btrfs_update_space_info(struct btrfs_fs_info *info, u64 flags, found->bytes_used += bytes_used; found->disk_used += bytes_used * factor; found->bytes_readonly += bytes_readonly; + found->bytes_zone_unusable += bytes_zone_unusable; if (total_bytes > 0) found->full = 0; btrfs_try_granting_tickets(info, found); @@ -433,10 +435,10 @@ static void __btrfs_dump_space_info(struct btrfs_fs_info *fs_info, info->total_bytes - btrfs_space_info_used(info, true), info->full ? "" : "not "); btrfs_info(fs_info, - "space_info total=%llu, used=%llu, pinned=%llu, reserved=%llu, may_use=%llu, readonly=%llu", + "space_info total=%llu, used=%llu, pinned=%llu, reserved=%llu, may_use=%llu, readonly=%llu zone_unusable=%llu", info->total_bytes, info->bytes_used, info->bytes_pinned, info->bytes_reserved, info->bytes_may_use, - info->bytes_readonly); + info->bytes_readonly, info->bytes_zone_unusable); DUMP_BLOCK_RSV(fs_info, global_block_rsv); DUMP_BLOCK_RSV(fs_info, trans_block_rsv); @@ -465,9 +467,10 @@ void btrfs_dump_space_info(struct btrfs_fs_info *fs_info, list_for_each_entry(cache, &info->block_groups[index], list) { spin_lock(&cache->lock); btrfs_info(fs_info, - "block group %llu has %llu bytes, %llu used %llu pinned %llu reserved %s", + "block group %llu has %llu bytes, %llu used %llu pinned %llu reserved %llu zone_unusable %s", cache->start, cache->length, cache->used, cache->pinned, - cache->reserved, cache->ro ? "[readonly]" : ""); + cache->reserved, cache->zone_unusable, + cache->ro ? "[readonly]" : ""); spin_unlock(&cache->lock); btrfs_dump_free_space(cache, bytes); } diff --git a/fs/btrfs/space-info.h b/fs/btrfs/space-info.h index c3c64019950a..3799b703f0eb 100644 --- a/fs/btrfs/space-info.h +++ b/fs/btrfs/space-info.h @@ -17,6 +17,8 @@ struct btrfs_space_info { u64 bytes_may_use; /* number of bytes that may be used for delalloc/allocations */ u64 bytes_readonly; /* total bytes that are read only */ + u64 bytes_zone_unusable; /* total bytes that are unusable until + resetting the device zone */ u64 max_extent_size; /* This will hold the maximum extent size of the space info if we had an ENOSPC in the @@ -119,7 +121,7 @@ DECLARE_SPACE_INFO_UPDATE(bytes_pinned, "pinned"); int btrfs_init_space_info(struct btrfs_fs_info *fs_info); void btrfs_update_space_info(struct btrfs_fs_info *info, u64 flags, u64 total_bytes, u64 bytes_used, - u64 bytes_readonly, + u64 bytes_readonly, u64 bytes_zone_unusable, struct btrfs_space_info **space_info); struct btrfs_space_info *btrfs_find_space_info(struct btrfs_fs_info *info, u64 flags); diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c index 38c7a57789d8..1709f5e0e375 100644 --- a/fs/btrfs/sysfs.c +++ b/fs/btrfs/sysfs.c @@ -626,6 +626,7 @@ SPACE_INFO_ATTR(bytes_pinned); SPACE_INFO_ATTR(bytes_reserved); SPACE_INFO_ATTR(bytes_may_use); SPACE_INFO_ATTR(bytes_readonly); +SPACE_INFO_ATTR(bytes_zone_unusable); SPACE_INFO_ATTR(disk_used); SPACE_INFO_ATTR(disk_total); BTRFS_ATTR(space_info, total_bytes_pinned, @@ -639,6 +640,7 @@ static struct attribute *space_info_attrs[] = { BTRFS_ATTR_PTR(space_info, bytes_reserved), BTRFS_ATTR_PTR(space_info, bytes_may_use), BTRFS_ATTR_PTR(space_info, bytes_readonly), + BTRFS_ATTR_PTR(space_info, bytes_zone_unusable), BTRFS_ATTR_PTR(space_info, disk_used), BTRFS_ATTR_PTR(space_info, disk_total), BTRFS_ATTR_PTR(space_info, total_bytes_pinned), diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c index ca090a5cdc6e..68f8224d74c3 100644 --- a/fs/btrfs/zoned.c +++ b/fs/btrfs/zoned.c @@ -1006,3 +1006,25 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache) return ret; } + +void btrfs_calc_zone_unusable(struct btrfs_block_group *cache) +{ + u64 unusable, free; + + if (!btrfs_fs_incompat(cache->fs_info, ZONED)) + return; + + WARN_ON(cache->bytes_super != 0); + unusable = cache->alloc_offset - cache->used; + free = cache->length - cache->alloc_offset; + /* we only need ->free_space in ALLOC_SEQ BGs */ + cache->last_byte_to_unpin = (u64)-1; + cache->cached = BTRFS_CACHE_FINISHED; + cache->free_space_ctl->free_space = free; + cache->zone_unusable = unusable; + /* + * Should not have any excluded extents. Just + * in case, though. + */ + btrfs_free_excluded_extents(cache); +} diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h index 1fd7cad19e18..3e3eff8dd0b4 100644 --- a/fs/btrfs/zoned.h +++ b/fs/btrfs/zoned.h @@ -46,6 +46,7 @@ int btrfs_reset_device_zone(struct btrfs_device *device, u64 physical, u64 length, u64 *bytes); int btrfs_ensure_empty_zones(struct btrfs_device *device, u64 start, u64 size); int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache); +void btrfs_calc_zone_unusable(struct btrfs_block_group *cache); #else /* CONFIG_BLK_DEV_ZONED */ static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos, struct blk_zone *zone) @@ -113,6 +114,7 @@ static inline int btrfs_load_block_group_zone_info( { return 0; } +static inline void btrfs_calc_zone_unusable(struct btrfs_block_group *cache) { } #endif static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos) From patchwork Fri Sep 11 12:32:36 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11771151 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 9E31A1580 for ; Fri, 11 Sep 2020 17:43:39 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 660C720735 for ; Fri, 11 Sep 2020 17:43:39 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="P4Hi+Ae+" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726008AbgIKRnh (ORCPT ); Fri, 11 Sep 2020 13:43:37 -0400 Received: from esa5.hgst.iphmx.com ([216.71.153.144]:38415 "EHLO esa5.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725893AbgIKMgP (ORCPT ); Fri, 11 Sep 2020 08:36:15 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1599827774; x=1631363774; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=j9ylhoMUMvMfBh0uJ8EDnluGQ1RXwDp7rM3DVgfS9qw=; b=P4Hi+Ae+MlkCQ3f4ZBy/s79M6PzJ7FzIZsHzkmUe0iudKmwMyjatT/Fc K09JlaPma1j+osteukMz31QBlW8+d4EmoHHNl47bBmF1TB80tGBKH/UTm hsKIE4hCQPE7p2AqqUx8qVxWvMtKEVulUuRn9r1WSaWWnMjnS+nYAHYoj aE+kNcG1CyfSBbqDJFN5Ov5zrPtKHvkm4RoMyMunZ/NXMBkFj+0o591UD nv0xBz6rN6Btmzz6fwpC+oXmi0EeXKfSI9qDEHHfhHIK0ojxfxB9Vf8vQ /Lgumflt3xtNp1EhRBSF7kDJRbK+zFzZT/8hkiD312FpRdAHhWEyTXFIq A==; IronPort-SDR: JjJv/VpA0Jc08QT7ouuchw89NR6jbdu2qhZWOmiCfQtvwmEf/hNgXYLpSuMnBg9zaNUyaZ7FLH WVO8W5osaVUAR2EOisKhA+iwmxIE2Feb5yBECfV4tREaqSVz02eeThlVlEatvr61Msw0j4MPN+ InF5ih9PbguXvQ4Jx1hChOcWJ2eGE47jp3aG+dUaEMVMotZXqoC8D2SS+p02JyaDfjnXuCTRWE annM9lmZwhccWUfO2jEZspKJw7jZloDOzRucF2+B3rCbS81DanxwdUIoa/7hLKu8hsUZ9B56Do //k= X-IronPort-AV: E=Sophos;i="5.76,415,1592841600"; d="scan'208";a="147125990" Received: from h199-255-45-15.hgst.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 11 Sep 2020 20:33:31 +0800 IronPort-SDR: P8/FjH+2EvOQC5YBhqVSY2kIZwfc5jYtsFq38QKZX7n5ascp9ui3GG6fLmT9gZyOgqzU3mBFFV TtayScYnmeHA== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Sep 2020 05:19:52 -0700 IronPort-SDR: 1HP5aheOsDCA7EzKCyk8vvLxFZ89E6a+4Q0n1AwYwppf+4+KrMNGEQQoTTfRlFupLeRtC8kBT7 JQtqVLEVrG6g== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 11 Sep 2020 05:33:29 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Hannes Reinecke , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v7 16/39] btrfs: do sequential extent allocation in ZONED mode Date: Fri, 11 Sep 2020 21:32:36 +0900 Message-Id: <20200911123259.3782926-17-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20200911123259.3782926-1-naohiro.aota@wdc.com> References: <20200911123259.3782926-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org This commit implement sequential extent allocator for ZONED mode. The allocator just need to check if there is enough space in the block group. Since the allocator never manage bitmap or cluster. This commit also add ASSERTs to the corresponding functions. Actually, with zone append writing, it is unnecessary to track the allocation offset. It only needs to check space availability. But, by tracking the offset and returning the offset as an allocated region, we can skip modification of ordered extents and checksum information when there is no IO reordering. Signed-off-by: Naohiro Aota --- fs/btrfs/block-group.c | 4 ++ fs/btrfs/extent-tree.c | 82 ++++++++++++++++++++++++++++++++++--- fs/btrfs/free-space-cache.c | 6 +++ 3 files changed, 86 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c index 324a1ef1bf04..9df83e687b92 100644 --- a/fs/btrfs/block-group.c +++ b/fs/btrfs/block-group.c @@ -683,6 +683,10 @@ int btrfs_cache_block_group(struct btrfs_block_group *cache, int load_cache_only struct btrfs_caching_control *caching_ctl; int ret = 0; + /* Allocator for ZONED btrfs do not use the cache at all */ + if (btrfs_fs_incompat(fs_info, ZONED)) + return 0; + caching_ctl = kzalloc(sizeof(*caching_ctl), GFP_NOFS); if (!caching_ctl) return -ENOMEM; diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 4f486277fb6e..5f86d552c6cb 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -3412,6 +3412,7 @@ btrfs_release_block_group(struct btrfs_block_group *cache, enum btrfs_extent_allocation_policy { BTRFS_EXTENT_ALLOC_CLUSTERED, + BTRFS_EXTENT_ALLOC_ZONED, }; /* @@ -3664,6 +3665,55 @@ static int do_allocation_clustered(struct btrfs_block_group *block_group, return find_free_extent_unclustered(block_group, ffe_ctl); } +/* + * Simple allocator for sequential only block group. It only allows + * sequential allocation. No need to play with trees. This function + * also reserve the bytes as in btrfs_add_reserved_bytes. + */ +static int do_allocation_zoned(struct btrfs_block_group *block_group, + struct find_free_extent_ctl *ffe_ctl, + struct btrfs_block_group **bg_ret) +{ + struct btrfs_space_info *space_info = block_group->space_info; + struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl; + u64 start = block_group->start; + u64 num_bytes = ffe_ctl->num_bytes; + u64 avail; + int ret = 0; + + ASSERT(btrfs_fs_incompat(block_group->fs_info, ZONED)); + + spin_lock(&space_info->lock); + spin_lock(&block_group->lock); + + if (block_group->ro) { + ret = 1; + goto out; + } + + avail = block_group->length - block_group->alloc_offset; + if (avail < num_bytes) { + ffe_ctl->max_extent_size = avail; + ret = 1; + goto out; + } + + ffe_ctl->found_offset = start + block_group->alloc_offset; + block_group->alloc_offset += num_bytes; + spin_lock(&ctl->tree_lock); + ctl->free_space -= num_bytes; + spin_unlock(&ctl->tree_lock); + + ASSERT(IS_ALIGNED(ffe_ctl->found_offset, + block_group->fs_info->stripesize)); + ffe_ctl->search_start = ffe_ctl->found_offset; + +out: + spin_unlock(&block_group->lock); + spin_unlock(&space_info->lock); + return ret; +} + static int do_allocation(struct btrfs_block_group *block_group, struct find_free_extent_ctl *ffe_ctl, struct btrfs_block_group **bg_ret) @@ -3671,6 +3721,8 @@ static int do_allocation(struct btrfs_block_group *block_group, switch (ffe_ctl->policy) { case BTRFS_EXTENT_ALLOC_CLUSTERED: return do_allocation_clustered(block_group, ffe_ctl, bg_ret); + case BTRFS_EXTENT_ALLOC_ZONED: + return do_allocation_zoned(block_group, ffe_ctl, bg_ret); default: BUG(); } @@ -3685,6 +3737,9 @@ static void release_block_group(struct btrfs_block_group *block_group, ffe_ctl->retry_clustered = false; ffe_ctl->retry_unclustered = false; break; + case BTRFS_EXTENT_ALLOC_ZONED: + /* nothing to do */ + break; default: BUG(); } @@ -3713,6 +3768,9 @@ static void found_extent(struct find_free_extent_ctl *ffe_ctl, case BTRFS_EXTENT_ALLOC_CLUSTERED: found_extent_clustered(ffe_ctl, ins); break; + case BTRFS_EXTENT_ALLOC_ZONED: + /* nothing to do */ + break; default: BUG(); } @@ -3728,6 +3786,9 @@ static int chunk_allocation_failed(struct find_free_extent_ctl *ffe_ctl) */ ffe_ctl->loop = LOOP_NO_EMPTY_SIZE; return 0; + case BTRFS_EXTENT_ALLOC_ZONED: + /* give up here */ + return -ENOSPC; default: BUG(); } @@ -3896,6 +3957,9 @@ static int prepare_allocation(struct btrfs_fs_info *fs_info, case BTRFS_EXTENT_ALLOC_CLUSTERED: return prepare_allocation_clustered(fs_info, ffe_ctl, space_info, ins); + case BTRFS_EXTENT_ALLOC_ZONED: + /* nothing to do */ + return 0; default: BUG(); } @@ -3958,6 +4022,9 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info, ffe_ctl.last_ptr = NULL; ffe_ctl.use_cluster = true; + if (btrfs_fs_incompat(fs_info, ZONED)) + ffe_ctl.policy = BTRFS_EXTENT_ALLOC_ZONED; + ins->type = BTRFS_EXTENT_ITEM_KEY; ins->objectid = 0; ins->offset = 0; @@ -4100,20 +4167,23 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info, /* move on to the next group */ if (ffe_ctl.search_start + num_bytes > block_group->start + block_group->length) { - btrfs_add_free_space(block_group, ffe_ctl.found_offset, - num_bytes); + btrfs_add_free_space_unused(block_group, + ffe_ctl.found_offset, + num_bytes); goto loop; } if (ffe_ctl.found_offset < ffe_ctl.search_start) - btrfs_add_free_space(block_group, ffe_ctl.found_offset, - ffe_ctl.search_start - ffe_ctl.found_offset); + btrfs_add_free_space_unused(block_group, + ffe_ctl.found_offset, + ffe_ctl.search_start - ffe_ctl.found_offset); ret = btrfs_add_reserved_bytes(block_group, ram_bytes, num_bytes, delalloc); if (ret == -EAGAIN) { - btrfs_add_free_space(block_group, ffe_ctl.found_offset, - num_bytes); + btrfs_add_free_space_unused(block_group, + ffe_ctl.found_offset, + num_bytes); goto loop; } btrfs_inc_block_group_reservations(block_group); diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c index 7701b39b4d57..2df8ffd1ef8b 100644 --- a/fs/btrfs/free-space-cache.c +++ b/fs/btrfs/free-space-cache.c @@ -2906,6 +2906,8 @@ u64 btrfs_find_space_for_alloc(struct btrfs_block_group *block_group, u64 align_gap_len = 0; enum btrfs_trim_state align_gap_trim_state = BTRFS_TRIM_STATE_UNTRIMMED; + ASSERT(!btrfs_fs_incompat(block_group->fs_info, ZONED)); + spin_lock(&ctl->tree_lock); entry = find_free_space(ctl, &offset, &bytes_search, block_group->full_stripe_len, max_extent_size); @@ -3037,6 +3039,8 @@ u64 btrfs_alloc_from_cluster(struct btrfs_block_group *block_group, struct rb_node *node; u64 ret = 0; + ASSERT(!btrfs_fs_incompat(block_group->fs_info, ZONED)); + spin_lock(&cluster->lock); if (bytes > cluster->max_size) goto out; @@ -3813,6 +3817,8 @@ int btrfs_trim_block_group(struct btrfs_block_group *block_group, int ret; u64 rem = 0; + ASSERT(!btrfs_fs_incompat(block_group->fs_info, ZONED)); + *trimmed = 0; spin_lock(&block_group->lock); From patchwork Fri Sep 11 12:32:37 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11771147 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id B610A13B1 for ; Fri, 11 Sep 2020 17:43:37 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 93780221EB for ; Fri, 11 Sep 2020 17:43:37 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="ik2lNvU4" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726103AbgIKRnf (ORCPT ); Fri, 11 Sep 2020 13:43:35 -0400 Received: from esa5.hgst.iphmx.com ([216.71.153.144]:38451 "EHLO esa5.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725900AbgIKMgW (ORCPT ); Fri, 11 Sep 2020 08:36:22 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1599827781; x=1631363781; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=mDo+HKwXNeAZJE0Xbr0L93QnHHI6yjzP7MX0jDGrGr0=; b=ik2lNvU4ziqxdPXEOFe4LBWed7unnlGFQF41cb/2+78oTLri7ApWtf/V N5Il2A+q/rJIzduF/JCIB2BwSwfcJxd3ovhxB+WWKOkdV3HYMCm9q6/ye d9jfn5talHD6fj3pbCPy6XRP7ps8feXhr3esrJLsvHWT35Q0K2M/jL/58 9q0j1WXYIahpSpFQIEOPIfLLp1rnKjwizN6iJnH873MFfZ6EoNVI4mhbF rzk7Qi0O31kkC3+/BPSLFcQ6tRk7UBcg4g7D9RvUSn9IrZ2MNQcmnkb3T 6W8F7KcoJiEhw0qoK50YOz6b8ghTDZbirkwL3PCtA84SlmOblAIyxq9YG A==; IronPort-SDR: Sfll7DxlVcZyU8CJiL5gh85ZoTYuNAgkQ93CXmFhdXvoMDtS8aI98E0cOUC35DI1RkypWnODo7 k+V3hLCBJJjRiHcpqWcXOGW+SpYswDjCwQRiUunUmKORnyPGBFcqvnmdfNG+/+Ycf84RTOJJK0 PBHf7JLadvIpfpZOQIZ82pavNX+X/NSa0y+7OhIAHXmBI+60MTFOASpjPoE7wVXVSLpvxwkj5a 8JEPTllDXjrYXTJ47xixStJFgnsO143+EDAcstI6ysTonNotMGuvZvX+tmQ8Hvva044k2bOyA7 nIw= X-IronPort-AV: E=Sophos;i="5.76,415,1592841600"; d="scan'208";a="147125992" Received: from h199-255-45-15.hgst.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 11 Sep 2020 20:33:33 +0800 IronPort-SDR: scneKJlsRzQ+1jwN6njBKpLl/lrYmIduu8Tk7Ia/SPBB/1gjRQLzKIa5H8vf2g1ynF9B5wChZr CdkXTK8WPgUQ== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Sep 2020 05:19:53 -0700 IronPort-SDR: c/lVz5UFELd/rIiYtDFzc1XHwjs30oNuaWiyNibAPPvmurE+0UAou83GCFJi8yWfybZD6DaboN VJYFgEhTbqSQ== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 11 Sep 2020 05:33:30 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Hannes Reinecke , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v7 17/39] btrfs: reset zones of unused block groups Date: Fri, 11 Sep 2020 21:32:37 +0900 Message-Id: <20200911123259.3782926-18-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20200911123259.3782926-1-naohiro.aota@wdc.com> References: <20200911123259.3782926-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org For an ZONED volume, a block group maps to a zone of the device. For deleted unused block groups, the zone of the block group can be reset to rewind the zone write pointer at the start of the zone. Signed-off-by: Naohiro Aota --- fs/btrfs/block-group.c | 8 ++++++-- fs/btrfs/extent-tree.c | 17 ++++++++++++----- fs/btrfs/zoned.h | 16 ++++++++++++++++ 3 files changed, 34 insertions(+), 7 deletions(-) diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c index 9df83e687b92..fbc22f0a6744 100644 --- a/fs/btrfs/block-group.c +++ b/fs/btrfs/block-group.c @@ -1468,8 +1468,12 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info) if (!async_trim_enabled && btrfs_test_opt(fs_info, DISCARD_ASYNC)) goto flip_async; - /* DISCARD can flip during remount */ - trimming = btrfs_test_opt(fs_info, DISCARD_SYNC); + /* + * DISCARD can flip during remount. In ZONED mode, we need + * to reset sequential required zones. + */ + trimming = btrfs_test_opt(fs_info, DISCARD_SYNC) || + btrfs_fs_incompat(fs_info, ZONED); /* Implicit trim during transaction commit. */ if (trimming) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 5f86d552c6cb..7fe5b6e3b207 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -1317,6 +1317,9 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr, stripe = bbio->stripes; for (i = 0; i < bbio->num_stripes; i++, stripe++) { + struct btrfs_device *dev = stripe->dev; + u64 physical = stripe->physical; + u64 length = stripe->length; u64 bytes; struct request_queue *req_q; @@ -1324,14 +1327,18 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr, ASSERT(btrfs_test_opt(fs_info, DEGRADED)); continue; } + req_q = bdev_get_queue(stripe->dev->bdev); - if (!blk_queue_discard(req_q)) + /* zone reset in ZONED mode */ + if (btrfs_can_zone_reset(dev, physical, length)) + ret = btrfs_reset_device_zone(dev, physical, + length, &bytes); + else if (blk_queue_discard(req_q)) + ret = btrfs_issue_discard(dev->bdev, physical, + length, &bytes); + else continue; - ret = btrfs_issue_discard(stripe->dev->bdev, - stripe->physical, - stripe->length, - &bytes); if (!ret) { discarded_bytes += bytes; } else if (ret != -EOPNOTSUPP) { diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h index 3e3eff8dd0b4..ccfb63a455dc 100644 --- a/fs/btrfs/zoned.h +++ b/fs/btrfs/zoned.h @@ -201,4 +201,20 @@ static inline u64 btrfs_zone_align(struct btrfs_device *device, u64 pos) return ALIGN(pos, device->zone_info->zone_size); } +static inline bool btrfs_can_zone_reset(struct btrfs_device *device, + u64 physical, u64 length) +{ + u64 zone_size; + + if (!btrfs_dev_is_sequential(device, physical)) + return false; + + zone_size = device->zone_info->zone_size; + if (!IS_ALIGNED(physical, zone_size) || + !IS_ALIGNED(length, zone_size)) + return false; + + return true; +} + #endif From patchwork Fri Sep 11 12:32:38 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11771141 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 527CF59D for ; Fri, 11 Sep 2020 17:42:51 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 306FB20735 for ; Fri, 11 Sep 2020 17:42:51 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="PVwptQx7" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726024AbgIKRmn (ORCPT ); Fri, 11 Sep 2020 13:42:43 -0400 Received: from esa5.hgst.iphmx.com ([216.71.153.144]:38375 "EHLO esa5.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725940AbgIKMgi (ORCPT ); Fri, 11 Sep 2020 08:36:38 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1599827798; x=1631363798; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=kZVFdRqawtuE2G8GiC+tGGO9id34TaO6eSKQ2BgJ2fw=; b=PVwptQx7pwexGTQPyZVTGUdzeQsqxg9U8bZe+MmGJHc3AbYIFw15c/Al 2xXzjAzY1WAir/bKcIhVVXIuTVV5sc+fCFM1Q6Fvs0ke2UNpDp9MzhWEJ wCSUjBdvf7ZBzVV2GeyHZtsCuyZ1+8KDIn1ci45pEceOLU2SrakUhN0Ce 8st+bP5g/xMiLsNVbHB8ziqVp2t699KsqjbFnX89AFHw47CfLxSK6M4rh xdR70YXP5nW8rOdtq+EkzUilrF6KcK56vsxEvArWl+d8tQcpyg+yFct9p LRQDHczflTR0j7UOmuZ4twBHVewavsYxC/Jpb/JTAvnVbnoyI4NdOlR2v Q==; IronPort-SDR: QvuAZ2D8huPpkQ3fi6FtoSenbI2VXE8Y/trfukIFnAN6CRkRFw/6QGDqkSc4LI1ywRvsZ2zKTT cOBAUIB0op724sfOZYTjpRqfhdNElwHjWN0FDdGi/W+Q9OXLYbEMJCn1iDqeaoTM9WrLYquhJr Ecmp+kn3Ffx9V58EwGz/ibjOGANSP7et3oW/g/wUGTQmzt/SwZL7oF9U0YZQy319Y67KEmFFPN eSB6EjLHXXr3uY1n39kcjdz135QHlLjTEm78cpx47uR+2Yni3xX7Mxincax+k6SzFMeVwbjmVT Cds= X-IronPort-AV: E=Sophos;i="5.76,415,1592841600"; d="scan'208";a="147125995" Received: from h199-255-45-15.hgst.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 11 Sep 2020 20:33:34 +0800 IronPort-SDR: TZGftri9jj/dXQfdc7zJZVTK8V4y7EOyRMnwB4eQIFOkROj3XseEJsm0SJbaXyf8p/INd8ycV2 htJFgUwIXjLg== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Sep 2020 05:19:55 -0700 IronPort-SDR: hKt/lV44sZPVg5qDKGQ1azEa1FIc6jz/LiYf7iysr3py6yStfBqarqRCW9eZhZzqX09rz2V34J Bos1RSrTXlCw== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 11 Sep 2020 05:33:32 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Hannes Reinecke , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v7 18/39] btrfs: redirty released extent buffers in ZONED mode Date: Fri, 11 Sep 2020 21:32:38 +0900 Message-Id: <20200911123259.3782926-19-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20200911123259.3782926-1-naohiro.aota@wdc.com> References: <20200911123259.3782926-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Tree manipulating operations like merging nodes often release once-allocated tree nodes. Btrfs cleans such nodes so that pages in the node are not uselessly written out. On ZONED volumes, however, such optimization blocks the following IOs as the cancellation of the write out of the freed blocks breaks the sequential write sequence expected by the device. This patch introduces a list of clean and unwritten extent buffers that have been released in a transaction. Btrfs redirty the buffer so that btree_write_cache_pages() can send proper bios to the devices. Besides it clears the entire content of the extent buffer not to confuse raw block scanners e.g. btrfsck. By clearing the content, csum_dirty_buffer() complains about bytenr mismatch, so avoid the checking and checksum using newly introduced buffer flag EXTENT_BUFFER_NO_CHECK. Signed-off-by: Naohiro Aota --- fs/btrfs/disk-io.c | 8 ++++++++ fs/btrfs/extent-tree.c | 12 +++++++++++- fs/btrfs/extent_io.c | 3 +++ fs/btrfs/extent_io.h | 2 ++ fs/btrfs/transaction.c | 10 ++++++++++ fs/btrfs/transaction.h | 3 +++ fs/btrfs/tree-log.c | 6 ++++++ fs/btrfs/zoned.c | 37 +++++++++++++++++++++++++++++++++++++ fs/btrfs/zoned.h | 6 ++++++ 9 files changed, 86 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 362799403285..d766cb0e1a52 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -510,6 +510,12 @@ static int csum_dirty_buffer(struct btrfs_fs_info *fs_info, struct page *page) return 0; found_start = btrfs_header_bytenr(eb); + + if (test_bit(EXTENT_BUFFER_NO_CHECK, &eb->bflags)) { + WARN_ON(found_start != 0); + return 0; + } + /* * Please do not consolidate these warnings into a single if. * It is useful to know what went wrong. @@ -4689,6 +4695,8 @@ void btrfs_cleanup_one_transaction(struct btrfs_transaction *cur_trans, EXTENT_DIRTY); btrfs_destroy_pinned_extent(fs_info, &cur_trans->pinned_extents); + btrfs_free_redirty_list(cur_trans); + cur_trans->state =TRANS_STATE_COMPLETED; wake_up(&cur_trans->commit_wait); } diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 7fe5b6e3b207..81b9b58d7a9d 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -3271,8 +3271,10 @@ void btrfs_free_tree_block(struct btrfs_trans_handle *trans, if (root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID) { ret = check_ref_cleanup(trans, buf->start); - if (!ret) + if (!ret) { + btrfs_redirty_list_add(trans->transaction, buf); goto out; + } } pin = 0; @@ -3284,6 +3286,13 @@ void btrfs_free_tree_block(struct btrfs_trans_handle *trans, goto out; } + if (btrfs_fs_incompat(fs_info, ZONED)) { + btrfs_redirty_list_add(trans->transaction, buf); + pin_down_extent(trans, cache, buf->start, buf->len, 1); + btrfs_put_block_group(cache); + goto out; + } + WARN_ON(test_bit(EXTENT_BUFFER_DIRTY, &buf->bflags)); btrfs_add_free_space(cache, buf->start, buf->len); @@ -4615,6 +4624,7 @@ btrfs_init_new_buffer(struct btrfs_trans_handle *trans, struct btrfs_root *root, btrfs_tree_lock(buf); btrfs_clean_tree_block(buf); clear_bit(EXTENT_BUFFER_STALE, &buf->bflags); + clear_bit(EXTENT_BUFFER_NO_CHECK, &buf->bflags); btrfs_set_lock_blocking_write(buf); set_extent_buffer_uptodate(buf); diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index c15ab6c1897f..53bac37bc4ac 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -24,6 +24,7 @@ #include "rcu-string.h" #include "backref.h" #include "disk-io.h" +#include "zoned.h" static struct kmem_cache *extent_state_cache; static struct kmem_cache *extent_buffer_cache; @@ -4994,6 +4995,7 @@ __alloc_extent_buffer(struct btrfs_fs_info *fs_info, u64 start, btrfs_leak_debug_add(&fs_info->eb_leak_lock, &eb->leak_list, &fs_info->allocated_ebs); + INIT_LIST_HEAD(&eb->release_list); spin_lock_init(&eb->refs_lock); atomic_set(&eb->refs, 1); @@ -5756,6 +5758,7 @@ void write_extent_buffer(const struct extent_buffer *eb, const void *srcv, WARN_ON(start > eb->len); WARN_ON(start + len > eb->start + eb->len); + WARN_ON(test_bit(EXTENT_BUFFER_NO_CHECK, &eb->bflags)); offset = offset_in_page(start); diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h index 30794ae58498..29dbb21a5c9a 100644 --- a/fs/btrfs/extent_io.h +++ b/fs/btrfs/extent_io.h @@ -30,6 +30,7 @@ enum { EXTENT_BUFFER_IN_TREE, /* write IO error */ EXTENT_BUFFER_WRITE_ERR, + EXTENT_BUFFER_NO_CHECK, }; /* these are flags for __process_pages_contig */ @@ -119,6 +120,7 @@ struct extent_buffer { */ wait_queue_head_t read_lock_wq; struct page *pages[INLINE_EXTENT_BUFFER_PAGES]; + struct list_head release_list; #ifdef CONFIG_BTRFS_DEBUG int spinning_writers; atomic_t spinning_readers; diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 951b10364fd0..1414b3ade2db 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -22,6 +22,7 @@ #include "qgroup.h" #include "block-group.h" #include "space-info.h" +#include "zoned.h" #define BTRFS_ROOT_TRANS_TAG 0 @@ -334,6 +335,8 @@ static noinline int join_transaction(struct btrfs_fs_info *fs_info, spin_lock_init(&cur_trans->dirty_bgs_lock); INIT_LIST_HEAD(&cur_trans->deleted_bgs); spin_lock_init(&cur_trans->dropped_roots_lock); + INIT_LIST_HEAD(&cur_trans->releasing_ebs); + spin_lock_init(&cur_trans->releasing_ebs_lock); list_add_tail(&cur_trans->list, &fs_info->trans_list); extent_io_tree_init(fs_info, &cur_trans->dirty_pages, IO_TREE_TRANS_DIRTY_PAGES, fs_info->btree_inode); @@ -2334,6 +2337,13 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) goto scrub_continue; } + /* + * At this point, we should have written the all tree blocks + * allocated in this transaction. So it's now safe to free the + * redirtyied extent buffers. + */ + btrfs_free_redirty_list(cur_trans); + ret = write_all_supers(fs_info, 0); /* * the super is written, we can safely allow the tree-loggers diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h index d60b055b8695..d274b6d9798c 100644 --- a/fs/btrfs/transaction.h +++ b/fs/btrfs/transaction.h @@ -85,6 +85,9 @@ struct btrfs_transaction { spinlock_t dropped_roots_lock; struct btrfs_delayed_ref_root delayed_refs; struct btrfs_fs_info *fs_info; + + spinlock_t releasing_ebs_lock; + struct list_head releasing_ebs; }; #define __TRANS_FREEZABLE (1U << 0) diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 39da9db35278..4b6a68a81eac 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -20,6 +20,7 @@ #include "inode-map.h" #include "block-group.h" #include "space-info.h" +#include "zoned.h" /* magic values for the inode_only field in btrfs_log_inode: * @@ -2746,6 +2747,8 @@ static noinline int walk_down_log_tree(struct btrfs_trans_handle *trans, free_extent_buffer(next); return ret; } + btrfs_redirty_list_add( + trans->transaction, next); } else { if (test_and_clear_bit(EXTENT_BUFFER_DIRTY, &next->bflags)) clear_extent_buffer_dirty(next); @@ -3281,6 +3284,9 @@ static void free_log_tree(struct btrfs_trans_handle *trans, clear_extent_bits(&log->dirty_log_pages, 0, (u64)-1, EXTENT_DIRTY | EXTENT_NEW | EXTENT_NEED_WAIT); extent_io_tree_release(&log->log_csum_range); + + if (trans && log->node) + btrfs_redirty_list_add(trans->transaction, log->node); btrfs_put_root(log); } diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c index 68f8224d74c3..855acbc61d47 100644 --- a/fs/btrfs/zoned.c +++ b/fs/btrfs/zoned.c @@ -18,6 +18,7 @@ #include "rcu-string.h" #include "disk-io.h" #include "block-group.h" +#include "transaction.h" /* Maximum number of zones to report per blkdev_report_zones() call */ #define BTRFS_REPORT_NR_ZONES 4096 @@ -1028,3 +1029,39 @@ void btrfs_calc_zone_unusable(struct btrfs_block_group *cache) */ btrfs_free_excluded_extents(cache); } + +void btrfs_redirty_list_add(struct btrfs_transaction *trans, + struct extent_buffer *eb) +{ + struct btrfs_fs_info *fs_info = eb->fs_info; + + if (!btrfs_fs_incompat(fs_info, ZONED) || + btrfs_header_flag(eb, BTRFS_HEADER_FLAG_WRITTEN) || + !list_empty(&eb->release_list)) + return; + + set_extent_buffer_dirty(eb); + set_extent_bits_nowait(&trans->dirty_pages, eb->start, + eb->start + eb->len - 1, EXTENT_DIRTY); + memzero_extent_buffer(eb, 0, eb->len); + set_bit(EXTENT_BUFFER_NO_CHECK, &eb->bflags); + + spin_lock(&trans->releasing_ebs_lock); + list_add_tail(&eb->release_list, &trans->releasing_ebs); + spin_unlock(&trans->releasing_ebs_lock); + atomic_inc(&eb->refs); +} + +void btrfs_free_redirty_list(struct btrfs_transaction *trans) +{ + spin_lock(&trans->releasing_ebs_lock); + while (!list_empty(&trans->releasing_ebs)) { + struct extent_buffer *eb; + + eb = list_first_entry(&trans->releasing_ebs, + struct extent_buffer, release_list); + list_del_init(&eb->release_list); + free_extent_buffer(eb); + } + spin_unlock(&trans->releasing_ebs_lock); +} diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h index ccfb63a455dc..cdb84c758a61 100644 --- a/fs/btrfs/zoned.h +++ b/fs/btrfs/zoned.h @@ -47,6 +47,9 @@ int btrfs_reset_device_zone(struct btrfs_device *device, u64 physical, int btrfs_ensure_empty_zones(struct btrfs_device *device, u64 start, u64 size); int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache); void btrfs_calc_zone_unusable(struct btrfs_block_group *cache); +void btrfs_redirty_list_add(struct btrfs_transaction *trans, + struct extent_buffer *eb); +void btrfs_free_redirty_list(struct btrfs_transaction *trans); #else /* CONFIG_BLK_DEV_ZONED */ static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos, struct blk_zone *zone) @@ -115,6 +118,9 @@ static inline int btrfs_load_block_group_zone_info( return 0; } static inline void btrfs_calc_zone_unusable(struct btrfs_block_group *cache) { } +static inline void btrfs_redirty_list_add(struct btrfs_transaction *trans, + struct extent_buffer *eb) { } +static inline void btrfs_free_redirty_list(struct btrfs_transaction *trans) { } #endif static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos) From patchwork Fri Sep 11 12:32:39 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11770423 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id B6EB392C for ; Fri, 11 Sep 2020 12:39:32 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 8F7B3221F1 for ; Fri, 11 Sep 2020 12:39:32 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="em5Fejbz" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1725934AbgIKMih (ORCPT ); Fri, 11 Sep 2020 08:38:37 -0400 Received: from esa5.hgst.iphmx.com ([216.71.153.144]:38370 "EHLO esa5.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725898AbgIKMgl (ORCPT ); Fri, 11 Sep 2020 08:36:41 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1599827800; x=1631363800; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=AxoyH1Na5xYglRGmAc7fmJ7ZNxnXvK3wCr4zK84pZ9A=; b=em5Fejbz++rLvzV10/EHR55fxAXbaU0S2lxM2pzoClfRFnNTonhJZw// UzCl4PIJ03YivLUI72ynkTuPdIyVD8NcaKG98Lw8F1Fdl1W2fRUYmItBy 71eo/2jy+chuQ59ey6L1UL4FllqQXGKcXaEHbKpoTJ9T4m/Iy43nlkLn/ NUtYImfcvCzsKQ3f1axn05lAMCYaOCnC07XOOyX1+DbeJvozdsVjXLHWz Y5hoYVtWwUMGUVzoepQkzVWfax7qFM9QsTvUEZtVCktF1BYLyH894D3Nc 3NDLevie/BQ08+4fpa7+rh0xxNkF10oyudVOAN6fQRFVsIrmUNbqxVIza w==; IronPort-SDR: A8wajh9IrX2Jt8nLeYBpMgmO+OjbF2i5xr1MslPifeXBlRa4Pp4L8PTlzk1GBTMhDmAbtu7Rel ZHJmJ1s4peRLH12ninQ+7X5Oseuft7l0JnKIq6krhC7TD6mtEN1uWTdDCzs50qNiskzAR+R6vs PS3gauwqJGhn/fcYj70A70xcpxbAVefa7r3m2sYPAtoZncnpokNyJI+OfjkSen7H2PiFdCHRYq wNoRrOGGHGMNa8jmb3RKqi3hQIYpQsLk4EHKeqBZ900fsmGOJlwAiSuKfWFAfL6PH8DQX5W1JT AR0= X-IronPort-AV: E=Sophos;i="5.76,415,1592841600"; d="scan'208";a="147125998" Received: from h199-255-45-15.hgst.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 11 Sep 2020 20:33:35 +0800 IronPort-SDR: c7PbvuJAVNI94CL4KAYnPJYa5Fs49JynyzidPEYoHlBIiceGnzJv2aFykFx5J7dsR8nSxmf0rT cQ4XvnyrSTJg== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Sep 2020 05:19:56 -0700 IronPort-SDR: VqqENO74fcIR3IHgKtmZABoXjVVU6Em/Tr/SQXuHGpR0hzyZ2NxTkJmqO5z4c0klfsj0FJDRMv WTnKweeSzXdw== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 11 Sep 2020 05:33:33 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Hannes Reinecke , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v7 19/39] btrfs: limit bio size under max_zone_append_size Date: Fri, 11 Sep 2020 21:32:39 +0900 Message-Id: <20200911123259.3782926-20-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20200911123259.3782926-1-naohiro.aota@wdc.com> References: <20200911123259.3782926-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Zone append write command cannot exceed the max zone append size. This commit limits the page merging into a bio. Signed-off-by: Naohiro Aota --- fs/btrfs/extent_io.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 53bac37bc4ac..63cdf67e6885 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -3041,6 +3041,7 @@ static int submit_extent_page(unsigned int opf, size_t page_size = min_t(size_t, size, PAGE_SIZE); sector_t sector = offset >> 9; struct extent_io_tree *tree = &BTRFS_I(page->mapping->host)->io_tree; + struct btrfs_fs_info *fs_info = tree->fs_info; ASSERT(bio_ret); @@ -3058,6 +3059,11 @@ static int submit_extent_page(unsigned int opf, if (btrfs_bio_fits_in_stripe(page, page_size, bio, bio_flags)) can_merge = false; + if (fs_info->max_zone_append_size && + bio_op(bio) == REQ_OP_WRITE && + bio->bi_iter.bi_size + size > fs_info->max_zone_append_size) + can_merge = false; + if (prev_bio_flags != bio_flags || !contig || !can_merge || force_bio_submit || bio_add_page(bio, page, page_size, pg_offset) < page_size) { From patchwork Fri Sep 11 12:32:40 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11771143 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id E912759D for ; Fri, 11 Sep 2020 17:42:53 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id CDE73221EB for ; Fri, 11 Sep 2020 17:42:53 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="eLEc2LJY" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726362AbgIKRml (ORCPT ); Fri, 11 Sep 2020 13:42:41 -0400 Received: from esa5.hgst.iphmx.com ([216.71.153.144]:38372 "EHLO esa5.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725945AbgIKMgw (ORCPT ); Fri, 11 Sep 2020 08:36:52 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1599827812; x=1631363812; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=6DLiwTztLtmP5Fw1686Z6IAlkdY2BsouY3JllIlppog=; b=eLEc2LJYDo1IkAib2ZaFQwG6n8PdKd8RYGFOwOs3T5aVgvp62yYXKtiY BT38WuEyJejX8jCdPEePJ1Vs26nh8TT5wYlCzTlGOR8lpBSs8GOPeV7zY 8kNcOBD2WngzsNRGAD2RTwLcChe28HuCaTsWg4Qc0d6TF1cPOvnlwHjQV jhPNPpXIldaQANjc2P4sj71Fh4UtCUnZvQ2aL52ZHBsNHCOPGzB4HKwG+ ipOFmf6UFp8KlAvcUWGHb3ru6jVJIvTm0oWT0s0i27QmVKfMrF1nXkqcZ hl1mYI1IBxUBbSKax4EzEVZPRKFY3JUrnEMIhIrj25ZfQocfTA0FA6HiB g==; IronPort-SDR: h5Wt0oXj/NvcFzAegdzH81KIpIJ+2V/9F1/JHtH4ubH8ILIOJUJrTpFLCFcSVNh1mo2/Dn8OXn S2HvFRXclK/5cbaC4ZIu6yzGSOg4xwyihqwp8Y+DXFew0rGj0ubspX2QpPf/bHtdjCF7xw3Lpw BGeGRY14NZAn32FBoGfTnrnW7gVRMk+Fm/x1Na08WS//lcCbnjfF+cMepVmVWUE20U4+Ts7f+9 T3OZNXYlSlCbNP1KlXWrbqL5h+dcteME9sRTY5q9IP5kqutaytYj3NTkB7plyjTBRp81sedFdH QZI= X-IronPort-AV: E=Sophos;i="5.76,415,1592841600"; d="scan'208";a="147126001" Received: from h199-255-45-15.hgst.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 11 Sep 2020 20:33:37 +0800 IronPort-SDR: eKB6jaVxIqKnxUXDP5xlHVsFeMDDdSQ/iiEOPAZLep6NS+2kpoEmceFRPKCEirzi33uasIrkYT TKDqYKwcYHfA== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Sep 2020 05:19:57 -0700 IronPort-SDR: fNzYYCNYxjsUkKElF/FQg16xwkHLyCh2lhTj83Nm2XxOKcrw/Rz2HfkFjcPssF79rmNajI0U57 9RVG7XlDBJQg== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 11 Sep 2020 05:33:35 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Hannes Reinecke , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v7 20/39] btrfs: limit ordered extent size to max_zone_append_size Date: Fri, 11 Sep 2020 21:32:40 +0900 Message-Id: <20200911123259.3782926-21-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20200911123259.3782926-1-naohiro.aota@wdc.com> References: <20200911123259.3782926-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org With zone append writing, the logical address must be modified to match the actual written physical address. If multiple bios serve one ordered extent, the bios can reside in a non-contiguous physical region, resulting in the non-contiguous logical region. It is troublesome to handle such a case, so one ordered extent must be served by one bio, limited to max_zone_apend_size. Thus, this commit limits the size of an ordered extent as well. This size limitation results in file extents fragmentation. In the future, we can merge contiguous ordered extents as an optimization. Signed-off-by: Naohiro Aota --- fs/btrfs/extent_io.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 63cdf67e6885..c21d1dbe314e 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -1865,6 +1865,7 @@ noinline_for_stack bool find_lock_delalloc_range(struct inode *inode, u64 *end) { struct extent_io_tree *tree = &BTRFS_I(inode)->io_tree; + struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); u64 max_bytes = BTRFS_MAX_EXTENT_SIZE; u64 delalloc_start; u64 delalloc_end; @@ -1873,6 +1874,10 @@ noinline_for_stack bool find_lock_delalloc_range(struct inode *inode, int ret; int loops = 0; + if (fs_info && fs_info->max_zone_append_size) + max_bytes = ALIGN_DOWN(fs_info->max_zone_append_size, + PAGE_SIZE); + again: /* step one, find a bunch of delalloc bytes starting at start */ delalloc_start = *start; From patchwork Fri Sep 11 12:32:41 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11771145 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id B81CB13B1 for ; Fri, 11 Sep 2020 17:42:55 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 9B69A221ED for ; Fri, 11 Sep 2020 17:42:55 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="V23/Pfa+" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726162AbgIKRmi (ORCPT ); Fri, 11 Sep 2020 13:42:38 -0400 Received: from esa5.hgst.iphmx.com ([216.71.153.144]:38428 "EHLO esa5.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725946AbgIKMhM (ORCPT ); Fri, 11 Sep 2020 08:37:12 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1599827831; x=1631363831; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=EhqTsQY3IK2cQ7n7/gAdtr1dRvNODqPQppBEQS8Qx6s=; b=V23/Pfa+hcuRGQrDjIuxtAikW5/ueg73FvkZ4CsqJibYAFs1G5vNIuP5 KO3ao++vlZQzwY+Mb8Qc1isX6qP778E9pfj5cIrwOEpAAFCaulgbHs6lS yaXw4Y2ZgRXJYtaognq8W8KsV9qjdcvp3oe45T3lPJTJTSFpdU3TgtOft pD4fftcGTDX+Q3Gwf9qRw+63PoxdxEeBUhi9m26FChBulx6tPkrjPimNE qzTTo/tov1vhuRWo92pHhyWWOEhKxY95qoTlly4AnIFyxhSX8hOTFedrx jcauOHdemoK00MlWWWNYLSAIMpJIMJ8mAjHTY9XVCFR5pUzhr5g6+RR+f w==; IronPort-SDR: P2X6Vu8lsSJCFzUZ3Ot/eoZI08ByMkmf6dN2AWONFv+u9dpnvNPVCHym1h5LPSrLUqk7y3jkJE qagFmUnYNIDlxp6TO74zHNJ9xk4hyu1DgBC34uhqTYFKOF6KXXVxaYgZxRNab9dsX+J5Gd2XRg QxibF8ad0UX1Dhy0eBEJR/9bmAUg/E0mT2c2XxxDQhogo+PcXpRXJyNqTmfpuR/pQBaOR5Lq1Z e0Vi4efz84FJ9D7VVGgIJCnseS48FN/4lZ9ODZNqIZ+qrwn8OVsPFxiTTa4ssCe+G9Q6ZTwQMk gpg= X-IronPort-AV: E=Sophos;i="5.76,415,1592841600"; d="scan'208";a="147126004" Received: from h199-255-45-15.hgst.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 11 Sep 2020 20:33:38 +0800 IronPort-SDR: Gzij4V1Lvdd17il0rcWjCDCx0QbyrEwmrf9S7TThqWHvxKHZvzHr2VHEca1uUTquGgx/gxB7In djy6jNROBYsQ== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Sep 2020 05:19:59 -0700 IronPort-SDR: UWTDcaBuKINMEV3IWLuD384ylUM67TBagnMGU1rcpGxhQHfXwoGyxmz3AnjshlzR4RPLnBT50g TT0G99ULTFiA== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 11 Sep 2020 05:33:36 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Hannes Reinecke , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v7 21/39] btrfs: extend btrfs_rmap_block for specifying a device Date: Fri, 11 Sep 2020 21:32:41 +0900 Message-Id: <20200911123259.3782926-22-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20200911123259.3782926-1-naohiro.aota@wdc.com> References: <20200911123259.3782926-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org btrfs_rmap_block currently reverse-map the physical address on all devices to logical addresses. This commit extends the function to match to a specified device. You can still query all devices by specifying NULL as a device. This commit also exporet the function for later use. Signed-off-by: Naohiro Aota --- fs/btrfs/block-group.c | 23 ++++++++++++++++++----- fs/btrfs/block-group.h | 3 +++ 2 files changed, 21 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c index fbc22f0a6744..be5394c8ec3a 100644 --- a/fs/btrfs/block-group.c +++ b/fs/btrfs/block-group.c @@ -1655,9 +1655,9 @@ static void set_avail_alloc_bits(struct btrfs_fs_info *fs_info, u64 flags) * Used primarily to exclude those portions of a block group that contain super * block copies. */ -EXPORT_FOR_TESTS -int btrfs_rmap_block(struct btrfs_fs_info *fs_info, u64 chunk_start, - u64 physical, u64 **logical, int *naddrs, int *stripe_len) +int __btrfs_rmap_block(struct btrfs_fs_info *fs_info, u64 chunk_start, + struct block_device *bdev, u64 physical, u64 **logical, + int *naddrs, int *stripe_len) { struct extent_map *em; struct map_lookup *map; @@ -1675,6 +1675,7 @@ int btrfs_rmap_block(struct btrfs_fs_info *fs_info, u64 chunk_start, map = em->map_lookup; data_stripe_length = em->orig_block_len; io_stripe_size = map->stripe_len; + chunk_start = em->start; /* For RAID5/6 adjust to a full IO stripe length */ if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK) @@ -1689,14 +1690,18 @@ int btrfs_rmap_block(struct btrfs_fs_info *fs_info, u64 chunk_start, for (i = 0; i < map->num_stripes; i++) { bool already_inserted = false; u64 stripe_nr; + u64 offset; int j; if (!in_range(physical, map->stripes[i].physical, data_stripe_length)) continue; + if (bdev && map->stripes[i].dev->bdev != bdev) + continue; + stripe_nr = physical - map->stripes[i].physical; - stripe_nr = div64_u64(stripe_nr, map->stripe_len); + stripe_nr = div64_u64_rem(stripe_nr, map->stripe_len, &offset); if (map->type & BTRFS_BLOCK_GROUP_RAID10) { stripe_nr = stripe_nr * map->num_stripes + i; @@ -1710,7 +1715,7 @@ int btrfs_rmap_block(struct btrfs_fs_info *fs_info, u64 chunk_start, * instead of map->stripe_len */ - bytenr = chunk_start + stripe_nr * io_stripe_size; + bytenr = chunk_start + stripe_nr * io_stripe_size + offset; /* Ensure we don't add duplicate addresses */ for (j = 0; j < nr; j++) { @@ -1732,6 +1737,14 @@ int btrfs_rmap_block(struct btrfs_fs_info *fs_info, u64 chunk_start, return ret; } +EXPORT_FOR_TESTS +int btrfs_rmap_block(struct btrfs_fs_info *fs_info, u64 chunk_start, + u64 physical, u64 **logical, int *naddrs, int *stripe_len) +{ + return __btrfs_rmap_block(fs_info, chunk_start, NULL, physical, logical, + naddrs, stripe_len); +} + static int exclude_super_stripes(struct btrfs_block_group *cache) { struct btrfs_fs_info *fs_info = cache->fs_info; diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h index 5be47f4bfea7..401e9bcefaec 100644 --- a/fs/btrfs/block-group.h +++ b/fs/btrfs/block-group.h @@ -275,6 +275,9 @@ void check_system_chunk(struct btrfs_trans_handle *trans, const u64 type); u64 btrfs_get_alloc_profile(struct btrfs_fs_info *fs_info, u64 orig_flags); void btrfs_put_block_group_cache(struct btrfs_fs_info *info); int btrfs_free_block_groups(struct btrfs_fs_info *info); +int __btrfs_rmap_block(struct btrfs_fs_info *fs_info, u64 chunk_start, + struct block_device *bdev, u64 physical, u64 **logical, + int *naddrs, int *stripe_len); static inline u64 btrfs_data_alloc_profile(struct btrfs_fs_info *fs_info) { From patchwork Fri Sep 11 12:32:42 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11771127 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id A72EB1580 for ; Fri, 11 Sep 2020 17:42:37 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 8942E20735 for ; Fri, 11 Sep 2020 17:42:37 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="kG5sh6rn" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726029AbgIKRmg (ORCPT ); Fri, 11 Sep 2020 13:42:36 -0400 Received: from esa5.hgst.iphmx.com ([216.71.153.144]:38415 "EHLO esa5.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725852AbgIKMhX (ORCPT ); Fri, 11 Sep 2020 08:37:23 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1599827842; x=1631363842; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=3OuUoXYeTteufgkIDhZ+l/S5PrF5wExbCumkNAr7eq8=; b=kG5sh6rntdLoLypD2QSIB9ho9Qj9w5vGLtWqGnGEcNbUwjkE3jKf7RQj V/OsFUbKoUCzefDY+pdedLNJP1j4Ofizg0iw5n7clYYjSiWcGe4jM59lV 7f7TT5lly4tNGLzwbo9Q0aEQhYEAlvG2VhzzfpqlQFAVf9DUhXkf4766g b4cHGirjMq9yh+r6lDGZ1e5KJg269XaNk5iidpno9v2nMMepdtuZIcrVw ZhcuMzNmQxzWs+n5zVPdCO0pg3/YI2q+aw6V+IOwFtwJ0JAnuvBy2EPVm u3t9RDL+w1AffhQ059ADNNvC0o4dOO/60eUsQT0MnkfyVKsIDhLCT2HL1 g==; IronPort-SDR: IUKJCWWn9e2t0OPB0ndv12cnF7o8lebHHs4y4Mtw8EcTNmhp/v1nSXnmpmbBcNlcaDU4hFrTk9 Uk11x3CkbyHdcZ+D9eD0bhE+75ZbzheesoZu5p9L4vpxtk+mnj6kNPwVl/HfmjK3xet6ihT0dE Fe7DPtMZT6W7DHVbUe4eo2IgBU2UQE53K7m+T6X5lbT4loIVyWSdWJbIo6r9OcUSqAPM3hkw0l yXvCQ9OFGeD0oQYudTeW7smtEIEb5QA/2SKS21KN64gNlekiUjsseuHfwj59rlvjlyfHguXKE6 vMQ= X-IronPort-AV: E=Sophos;i="5.76,415,1592841600"; d="scan'208";a="147126006" Received: from h199-255-45-15.hgst.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 11 Sep 2020 20:33:40 +0800 IronPort-SDR: jqTjVE1hEQH33kpphlgU9/vW1aUiPuUIU4+skeARM9yBh5IvuFKZB+MpZlar4BRH7poPDu8/0m mKXYhTfnNP6g== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Sep 2020 05:20:00 -0700 IronPort-SDR: tVXb407wN/RQZfPXO5/sny2Q8niB+ebBFpRppQOrhat83aG5I7SKbVO9X8BE7eO0+EUZuWTt4C +qIjBnX5sQfg== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 11 Sep 2020 05:33:37 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Hannes Reinecke , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v7 22/39] btrfs: use ZONE_APPEND write for ZONED btrfs Date: Fri, 11 Sep 2020 21:32:42 +0900 Message-Id: <20200911123259.3782926-23-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20200911123259.3782926-1-naohiro.aota@wdc.com> References: <20200911123259.3782926-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org This commit enables zone append writing for zoned btrfs. Three parts are necessary to enable it. First, it modifies bio to use REQ_OP_ZONE_APPEND in btrfs_submit_bio_hook() and adjust the bi_sector to point the beginning of the zone. Second, it records returned physical address (and disk/partno) to the ordered extent in end_bio_extent_writepage(). Finally, it rewrites logical addresses of the extent mapping and checksum data according to the physical address (using __btrfs_rmap_block). If the returned address match to the originaly allocated address, we can skip the rewriting process. Signed-off-by: Naohiro Aota --- fs/btrfs/extent_io.c | 4 +++ fs/btrfs/inode.c | 12 ++++++- fs/btrfs/ordered-data.c | 3 ++ fs/btrfs/ordered-data.h | 4 +++ fs/btrfs/volumes.c | 9 ++++++ fs/btrfs/zoned.c | 70 +++++++++++++++++++++++++++++++++++++++++ fs/btrfs/zoned.h | 9 ++++++ 7 files changed, 110 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index c21d1dbe314e..00a07cefffeb 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2749,6 +2749,10 @@ static void end_bio_extent_writepage(struct bio *bio) u64 end; struct bvec_iter_all iter_all; + btrfs_record_physical_zoned(bio_iovec(bio).bv_page->mapping->host, + page_offset(bio_iovec(bio).bv_page) + bio_iovec(bio).bv_offset, + bio); + ASSERT(!bio_flagged(bio, BIO_CLONED)); bio_for_each_segment_all(bvec, bio, iter_all) { struct page *page = bvec->bv_page; diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index ca0be689e7ad..7fe28a77f9b8 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -49,6 +49,7 @@ #include "delalloc-space.h" #include "block-group.h" #include "space-info.h" +#include "zoned.h" struct btrfs_iget_args { u64 ino; @@ -2198,7 +2199,13 @@ static blk_status_t btrfs_submit_bio_hook(struct inode *inode, struct bio *bio, if (btrfs_is_free_space_inode(BTRFS_I(inode))) metadata = BTRFS_WQ_ENDIO_FREE_SPACE; - if (bio_op(bio) != REQ_OP_WRITE) { + if (bio_op(bio) == REQ_OP_WRITE && btrfs_fs_incompat(fs_info, ZONED)) { + /* use zone append writing */ + bio->bi_opf &= ~REQ_OP_MASK; + bio->bi_opf |= REQ_OP_ZONE_APPEND; + } + + if (bio_op(bio) != REQ_OP_WRITE && bio_op(bio) != REQ_OP_ZONE_APPEND) { ret = btrfs_bio_wq_end_io(fs_info, bio, metadata); if (ret) goto out; @@ -2594,6 +2601,9 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent) bool clear_reserved_extent = true; unsigned int clear_bits; + if (ordered_extent->disk) + btrfs_rewrite_logical_zoned(ordered_extent); + start = ordered_extent->file_offset; end = start + ordered_extent->num_bytes - 1; diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c index ebac13389e7e..3cb0d92a3bcf 100644 --- a/fs/btrfs/ordered-data.c +++ b/fs/btrfs/ordered-data.c @@ -199,6 +199,9 @@ static int __btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset entry->compress_type = compress_type; entry->truncated_len = (u64)-1; entry->qgroup_rsv = ret; + entry->physical = (u64)-1; + entry->disk = NULL; + entry->partno = (u8)-1; if (type != BTRFS_ORDERED_IO_DONE && type != BTRFS_ORDERED_COMPLETE) set_bit(type, &entry->flags); diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h index d61ea9c880a3..7872d566ae1b 100644 --- a/fs/btrfs/ordered-data.h +++ b/fs/btrfs/ordered-data.h @@ -118,6 +118,10 @@ struct btrfs_ordered_extent { struct completion completion; struct btrfs_work flush_work; struct list_head work_list; + + u64 physical; + struct gendisk *disk; + u8 partno; }; /* diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 086cd308e5b6..6337ce95a088 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -6505,6 +6505,15 @@ static void submit_stripe_bio(struct btrfs_bio *bbio, struct bio *bio, btrfs_io_bio(bio)->device = dev; bio->bi_end_io = btrfs_end_bio; bio->bi_iter.bi_sector = physical >> 9; + /* + * For zone append writing, bi_sector must point the beginning of the + * zone + */ + if (bio_op(bio) == REQ_OP_ZONE_APPEND) { + u64 zone_start = round_down(physical, fs_info->zone_size); + + bio->bi_iter.bi_sector = zone_start >> SECTOR_SHIFT; + } btrfs_debug_in_rcu(fs_info, "btrfs_map_bio: rw %d 0x%x, sector=%llu, dev=%lu (%s id %llu), size=%u", bio_op(bio), bio->bi_opf, (u64)bio->bi_iter.bi_sector, diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c index 855acbc61d47..1744e2649087 100644 --- a/fs/btrfs/zoned.c +++ b/fs/btrfs/zoned.c @@ -1065,3 +1065,73 @@ void btrfs_free_redirty_list(struct btrfs_transaction *trans) } spin_unlock(&trans->releasing_ebs_lock); } + +void btrfs_record_physical_zoned(struct inode *inode, u64 file_offset, + struct bio *bio) +{ + struct btrfs_ordered_extent *ordered; + struct bio_vec bvec = bio_iovec(bio); + u64 physical = ((u64)bio->bi_iter.bi_sector << SECTOR_SHIFT) + + bvec.bv_offset; + + if (bio_op(bio) != REQ_OP_ZONE_APPEND) + return; + + ordered = btrfs_lookup_ordered_extent(BTRFS_I(inode), file_offset); + if (WARN_ON(!ordered)) + return; + + ordered->physical = physical; + ordered->disk = bio->bi_disk; + ordered->partno = bio->bi_partno; + + btrfs_put_ordered_extent(ordered); +} + +void btrfs_rewrite_logical_zoned(struct btrfs_ordered_extent *ordered) +{ + struct extent_map_tree *em_tree; + struct extent_map *em; + struct inode *inode = ordered->inode; + struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); + struct btrfs_ordered_sum *sum; + struct block_device *bdev; + u64 orig_logical = ordered->disk_bytenr; + u64 *logical = NULL; + int nr, stripe_len; + + bdev = bdget_disk(ordered->disk, ordered->partno); + if (WARN_ON(!bdev)) + return; + + if (WARN_ON(__btrfs_rmap_block(fs_info, orig_logical, bdev, + ordered->physical, &logical, &nr, + &stripe_len))) + goto out; + + WARN_ON(nr != 1); + + if (orig_logical == *logical) + goto out; + + ordered->disk_bytenr = *logical; + + em_tree = &BTRFS_I(inode)->extent_tree; + write_lock(&em_tree->lock); + em = search_extent_mapping(em_tree, ordered->file_offset, + ordered->num_bytes); + em->block_start = *logical; + free_extent_map(em); + write_unlock(&em_tree->lock); + + list_for_each_entry(sum, &ordered->list, list) { + if (*logical < orig_logical) + sum->bytenr -= orig_logical - *logical; + else + sum->bytenr += *logical - orig_logical; + } + +out: + kfree(logical); + bdput(bdev); +} diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h index cdb84c758a61..5f4bc746e3e2 100644 --- a/fs/btrfs/zoned.h +++ b/fs/btrfs/zoned.h @@ -50,6 +50,9 @@ void btrfs_calc_zone_unusable(struct btrfs_block_group *cache); void btrfs_redirty_list_add(struct btrfs_transaction *trans, struct extent_buffer *eb); void btrfs_free_redirty_list(struct btrfs_transaction *trans); +void btrfs_record_physical_zoned(struct inode *inode, u64 file_offset, + struct bio *bio); +void btrfs_rewrite_logical_zoned(struct btrfs_ordered_extent *ordered); #else /* CONFIG_BLK_DEV_ZONED */ static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos, struct blk_zone *zone) @@ -121,6 +124,12 @@ static inline void btrfs_calc_zone_unusable(struct btrfs_block_group *cache) { } static inline void btrfs_redirty_list_add(struct btrfs_transaction *trans, struct extent_buffer *eb) { } static inline void btrfs_free_redirty_list(struct btrfs_transaction *trans) { } +static inline void btrfs_record_physical_zoned(struct inode *inode, + u64 file_offset, struct bio *bio) +{ +} +static inline void +btrfs_rewrite_logical_zoned(struct btrfs_ordered_extent *ordered) { } #endif static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos) From patchwork Fri Sep 11 12:32:43 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11771131 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 427ED59D for ; Fri, 11 Sep 2020 17:42:42 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 228F4221EB for ; Fri, 11 Sep 2020 17:42:42 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="jWe9Oe1t" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726069AbgIKRmh (ORCPT ); Fri, 11 Sep 2020 13:42:37 -0400 Received: from esa5.hgst.iphmx.com ([216.71.153.144]:38451 "EHLO esa5.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725892AbgIKMhY (ORCPT ); Fri, 11 Sep 2020 08:37:24 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1599827843; x=1631363843; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=u23e3WllPu/3rko2m2r9M+D8jYbtSHTDuV00w5bX5wk=; b=jWe9Oe1txJdlZy/b3Bt7FNH9EGDAdsDjuBJfDnKWEcENOyYi302fs0Nw cia8D24guDjUKNNJ/hojVmCWkC+ze5rITpwCJr7As9fBMihbk64m/B61e zEcHSm/sUYOSzP8IYpACO+MYC56SbC0ZR0gurHQQ/QPxKRPW8BooX1pbP uKNkFVs3CwWEkz0tCYx1rBvDvsodt8EDp4dpaL4jzBTYH4ObB0GlUm6eR tf2bNJeZEttL3uYRbicmF9f4+GA/JVq1kK4lvdpTLZFfirYZWsfsDzBK5 bFE7Z3EQtG45JyAt+N/qTEq7nOzuET4cJ7aNgwO2uYRB6dTnMQmyOLC9x Q==; IronPort-SDR: 72/KwMJJ7LSTfAqqTKrgc78IqpogTHS2W6Rdc7uGXn6fgqCw5QNOncm8+4fgaK0kBn6wHRsX3q ymIs03Ku28kQbJ9wfja5Xov4CCURHW/fGha70gsJxZ5xAT3Av2ttzFanpG+T7TFOb1v89iz7rW 16/bSbV7fbNPEF/61RLiFpAgjyGBLi/4uJ1/sJLMQfPFBkTJkxVicDMDkaD5P7ttWJqPiQn/iU 6jpfY4GPfkZbk6zofqzOl/gr/CCrpWLqYkcM1O7Ontp47SzqQxoyR3lOJcERHxFB9MsNkGxroN 3js= X-IronPort-AV: E=Sophos;i="5.76,415,1592841600"; d="scan'208";a="147126007" Received: from h199-255-45-15.hgst.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 11 Sep 2020 20:33:41 +0800 IronPort-SDR: ycqnwhT8DIgQKX66ZYoS1CgVk59uDXbWqFPCd5DyOKqsMMhHR2v//sGjBtuLSFvg8CnmTlTIg7 6Y4bPAa3kf8w== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Sep 2020 05:20:02 -0700 IronPort-SDR: /EYgh2W1wBcOStpcNDs7LwzTWUD3E2SdpVRIUj5X0OLvx7uBLr57A2DT3lJGFQZ/klhF6wcqk3 qnc660/riHRg== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 11 Sep 2020 05:33:39 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Hannes Reinecke , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v7 23/39] btrfs: handle REQ_OP_ZONE_APPEND as writing Date: Fri, 11 Sep 2020 21:32:43 +0900 Message-Id: <20200911123259.3782926-24-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20200911123259.3782926-1-naohiro.aota@wdc.com> References: <20200911123259.3782926-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org ZONED btrfs uses REQ_OP_ZONE_APPEND for a bio going to actual devices. Let btrfs_end_bio() and btrfs_op, who faces the bios, aware of it. Signed-off-by: Naohiro Aota --- fs/btrfs/volumes.c | 3 ++- fs/btrfs/volumes.h | 1 + 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 6337ce95a088..ca139c63f63c 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -6453,7 +6453,8 @@ static void btrfs_end_bio(struct bio *bio) struct btrfs_device *dev = btrfs_io_bio(bio)->device; ASSERT(dev->bdev); - if (bio_op(bio) == REQ_OP_WRITE) + if (bio_op(bio) == REQ_OP_WRITE || + bio_op(bio) == REQ_OP_ZONE_APPEND) btrfs_dev_stat_inc_and_print(dev, BTRFS_DEV_STAT_WRITE_ERRS); else if (!(bio->bi_opf & REQ_RAHEAD)) diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index 88b1d59fbc12..fc03b386bb8c 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -410,6 +410,7 @@ static inline enum btrfs_map_op btrfs_op(struct bio *bio) case REQ_OP_DISCARD: return BTRFS_MAP_DISCARD; case REQ_OP_WRITE: + case REQ_OP_ZONE_APPEND: return BTRFS_MAP_WRITE; default: WARN_ON_ONCE(1); From patchwork Fri Sep 11 12:32:44 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11771119 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 799CC13B1 for ; Fri, 11 Sep 2020 17:42:24 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 577BA221EB for ; Fri, 11 Sep 2020 17:42:24 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="IF/vcM40" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726393AbgIKRmP (ORCPT ); Fri, 11 Sep 2020 13:42:15 -0400 Received: from esa5.hgst.iphmx.com ([216.71.153.144]:38375 "EHLO esa5.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725790AbgIKMiK (ORCPT ); Fri, 11 Sep 2020 08:38:10 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1599827890; x=1631363890; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=62/TEj5IsMLbAyIeTMsUle83Lmf9YQfLDgUJffHyhVA=; b=IF/vcM40BU3reN1H1+cAEEUa3q14CORixdlInhELobOJuZ2TtZLfGEMg quyb27NstMENUo3ElzrIzgO7aVT4ShAKGU40vHFJmK54y9+fCQWs1Ie7H 6lI7MPU5v9kr+a0ZN93otbkiMSAehO8m3jJc7mc7eGvXmpCtITTrfD0hD EIO7hVl50g9K4/CYchQXC+/BHdbin6gCnXXerEPeZ/xJFf4N2ze/BXUti m+POx17tXAzwWmQ6EYzz26FS7WBHokTV4n4Tpv3CCm/D6CpIFdzlY7wcK e0QgKyBJk14Dug8kBBjhoxU2QUh1eqyrxMqCOb9shyNq65sxorF5Y7Dbt g==; IronPort-SDR: Q17tBUeDnYXVHKRkojVuAU4Y6FXKVqURgmdZNSXpH1AerupNeeMfldvgFvdWKkOZpTz+egPrpr YTB6mpNOTVJozzr5VsKQt6Ssxy/6HVzB5ShEa9fP9+sgQe6N9A3um4XnBIehTOOue7IszADWwz ZU6u92qB8iXF7LE4ZswSyo7pZqtIMQHj3XSGKo8aOt4m5sDqnhOOWCns5OssLIrOdpOSvj0wtX CX4LivJXap/847ATxTBVPMOXuyL/3pOG0mPP/YvqlH9neOyDD3NbouslDEbqozYjyRooP5vTQk BBc= X-IronPort-AV: E=Sophos;i="5.76,415,1592841600"; d="scan'208";a="147126011" Received: from h199-255-45-15.hgst.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 11 Sep 2020 20:33:42 +0800 IronPort-SDR: PPSP1I7BzJa2bxfsV/a6OTyW16JU3uNGktEGLIzcEFwo70q27cRCZCh7fQ0h2vT/iwdnTJL7fe BinxE31kcavA== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Sep 2020 05:20:03 -0700 IronPort-SDR: sHX0k0zZCpbftdvE6D5vRWZ49nwKQiITq0kyzHxi4FRHSEzY6DjwWsctiP+KBShxboVZdjLYD/ iLRkhfqkkzYQ== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 11 Sep 2020 05:33:40 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Hannes Reinecke , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v7 24/39] btrfs: enable zone append writing for direct IO Date: Fri, 11 Sep 2020 21:32:44 +0900 Message-Id: <20200911123259.3782926-25-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20200911123259.3782926-1-naohiro.aota@wdc.com> References: <20200911123259.3782926-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org This commit enables zone append writing as same as in buffered write. Signed-off-by: Naohiro Aota --- fs/btrfs/inode.c | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 7fe28a77f9b8..422940d7bb4b 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -7351,6 +7351,11 @@ static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start, u64 len = length; bool unlock_extents = false; + if (write && fs_info->max_zone_append_size) { + length = min_t(u64, length, fs_info->max_zone_append_size); + len = length; + } + if (!write) len = min_t(u64, len, fs_info->sectorsize); @@ -7692,6 +7697,8 @@ static void btrfs_end_dio_bio(struct bio *bio) if (err) dip->dio_bio->bi_status = err; + btrfs_record_physical_zoned(dip->inode, dip->logical_offset, bio); + bio_put(bio); btrfs_dio_private_put(dip); } @@ -7701,7 +7708,8 @@ static inline blk_status_t btrfs_submit_dio_bio(struct bio *bio, { struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); struct btrfs_dio_private *dip = bio->bi_private; - bool write = bio_op(bio) == REQ_OP_WRITE; + bool write = bio_op(bio) == REQ_OP_WRITE || + bio_op(bio) == REQ_OP_ZONE_APPEND; blk_status_t ret; /* Check btrfs_submit_bio_hook() for rules about async submit. */ @@ -7846,6 +7854,12 @@ static blk_qc_t btrfs_submit_direct(struct inode *inode, struct iomap *iomap, bio->bi_end_io = btrfs_end_dio_bio; btrfs_io_bio(bio)->logical = file_offset; + if (write && btrfs_fs_incompat(fs_info, ZONED) && + fs_info->max_zone_append_size) { + bio->bi_opf &= ~REQ_OP_MASK; + bio->bi_opf |= REQ_OP_ZONE_APPEND; + } + ASSERT(submit_len >= clone_len); submit_len -= clone_len; From patchwork Fri Sep 11 12:32:45 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11771115 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 8F68913B1 for ; Fri, 11 Sep 2020 17:42:17 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 6D247221ED for ; Fri, 11 Sep 2020 17:42:17 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="cQ+NOn7z" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1725823AbgIKRmO (ORCPT ); Fri, 11 Sep 2020 13:42:14 -0400 Received: from esa5.hgst.iphmx.com ([216.71.153.144]:38370 "EHLO esa5.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725954AbgIKMiM (ORCPT ); Fri, 11 Sep 2020 08:38:12 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1599827891; x=1631363891; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=YrTGp2LIGBKlHvCMIO3KoLbvWpsovAWjthYmnU6OJQE=; b=cQ+NOn7zRzTeU++KFGRmn5U4aXtn0x6PDY7ik/Czj8awl2ieu8kmFh1w QJj2X01ZWrogYh0z9tlG8TPAjSD7AGqm6UXM9NsAFG5jKJPwb28+wFa1r 5diek6TKzO2iysZD1E7QlkDUppGsKeOzRt1cfmw9BfPyj75l/RoqzjnBx HUSSUdVc274kBCSNyDQWEPBtYBN1R5rElA1QtWU6BHyAAmrqOEHsBQtCB 6Jxs91Kil47CEd+F50SM40TL6dnGwA3K49qPSfSCP+0CxoHsOQos/rS0r +IKD4qavJbDEWwxVKRubLZWkcfUkWQuZurP1PBI37NVVK/Wh1Mw2MIUns w==; IronPort-SDR: TXnOYPSLoi7bhmPyHYAs1SqWXVdxi1ywjTlTZYRVXk+5B8BBscGk0RxKz4KAfWpON90nm5hfKS FtQW6acGop1wM2O/DNo5fX8sKemZGHfhl4E/6KlkrAriz8JWJRmLxU/7XOWdltMVVF55NofC9Y njOqRi2inmXRSmbESjKbKZ40jdsqI0Z7DZexCVfCtRo0AHXMN7eTcdo0GOKVWa08ZRAoDmpavS SBlmYczic8+fs4kYMwYW/ek2oy5h9osdpJZ5c7d5TE6fJxV52zKwJueOLRzUg/sI1TFV//iIUp dZc= X-IronPort-AV: E=Sophos;i="5.76,415,1592841600"; d="scan'208";a="147126014" Received: from h199-255-45-15.hgst.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 11 Sep 2020 20:33:44 +0800 IronPort-SDR: COtVDYBQgL9JkL1414se0/K8eEe0o6lCE0Kq7UsZOJBUZRP3J8VInnoEHB80m5RrfGczq/XTqO gkVXdMTfSIXQ== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Sep 2020 05:20:05 -0700 IronPort-SDR: qanchY6wwna8Cy8yY5TkXUe45Z5kF609toNzWjz3LS6VIjyIHHrg8mPO5fxHlshPtPv4Wc9EE8 ES7tmV93lSeg== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 11 Sep 2020 05:33:42 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Hannes Reinecke , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v7 25/39] btrfs: introduce dedicated data write path for ZONED mode Date: Fri, 11 Sep 2020 21:32:45 +0900 Message-Id: <20200911123259.3782926-26-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20200911123259.3782926-1-naohiro.aota@wdc.com> References: <20200911123259.3782926-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org If more than one IO is issued for one file extent, these IO can be written in a separate region on a device. Since we cannot map one file extent to such a separate area, we need to follow the "one IO == one ordered extent" rule. Normal (buffered, uncompressed, not pre-allocated) write path (= cow_file_range()) sometime does not follow the rule. It can write a part of an ordered extent when specified a region to write e.g., called from fdatasync(). This commit introduces a dedicated (uncompressed buffered) data write path for ZONED mode. This write path CoW the region and write the region at once. Signed-off-by: Naohiro Aota --- fs/btrfs/inode.c | 34 ++++++++++++++++++++++++++++++++-- 1 file changed, 32 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 422940d7bb4b..2bd001df4a75 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -1355,6 +1355,29 @@ static int cow_file_range_async(struct btrfs_inode *inode, return 0; } +static noinline int run_delalloc_zoned(struct btrfs_inode *inode, + struct page *locked_page, u64 start, + u64 end, int *page_started, + unsigned long *nr_written) +{ + int ret; + + ret = cow_file_range(inode, locked_page, start, end, + page_started, nr_written, 0); + if (ret) + return ret; + + if (*page_started) + return 0; + + __set_page_dirty_nobuffers(locked_page); + account_page_redirty(locked_page); + extent_write_locked_range(&inode->vfs_inode, start, end, WB_SYNC_ALL); + *page_started = 1; + + return 0; +} + static noinline int csum_exist_in_range(struct btrfs_fs_info *fs_info, u64 bytenr, u64 num_bytes) { @@ -1825,17 +1848,24 @@ int btrfs_run_delalloc_range(struct btrfs_inode *inode, struct page *locked_page { int ret; int force_cow = need_force_cow(inode, start, end); + int do_compress = inode_can_compress(inode) && + inode_need_compress(inode, start, end); + bool zoned = btrfs_fs_incompat(inode->root->fs_info, ZONED); if (inode->flags & BTRFS_INODE_NODATACOW && !force_cow) { + ASSERT(!zoned); ret = run_delalloc_nocow(inode, locked_page, start, end, page_started, 1, nr_written); } else if (inode->flags & BTRFS_INODE_PREALLOC && !force_cow) { + ASSERT(!zoned); ret = run_delalloc_nocow(inode, locked_page, start, end, page_started, 0, nr_written); - } else if (!inode_can_compress(inode) || - !inode_need_compress(inode, start, end)) { + } else if (!do_compress && !zoned) { ret = cow_file_range(inode, locked_page, start, end, page_started, nr_written, 1); + } else if (!do_compress && zoned) { + ret = run_delalloc_zoned(inode, locked_page, start, end, + page_started, nr_written); } else { set_bit(BTRFS_INODE_HAS_ASYNC_EXTENT, &inode->runtime_flags); ret = cow_file_range_async(inode, wbc, locked_page, start, end, From patchwork Fri Sep 11 12:32:46 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11770441 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 8C82F618 for ; Fri, 11 Sep 2020 12:41:12 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 680CF221F1 for ; Fri, 11 Sep 2020 12:41:12 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="RGsP37aG" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1725972AbgIKMku (ORCPT ); Fri, 11 Sep 2020 08:40:50 -0400 Received: from esa5.hgst.iphmx.com ([216.71.153.144]:38372 "EHLO esa5.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725962AbgIKMio (ORCPT ); Fri, 11 Sep 2020 08:38:44 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1599827924; x=1631363924; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=S/QS4t8efLjLqGnrAtWaM0ztrbO7bjdXvqzls+xOaDw=; b=RGsP37aGe+/uscIO2n94jWkfX9fRopddYG14BjUPBby0oBbw+jbDESau oXdl2Yehk8Jaoxi9CcoXC3Yf97nlEWjt5Stp+yqnccooUTHCB8VU4K0LV vxM5sAfZwjjPQ77nD4QJi4Dm0MHOTtyYb7LMML+P1JqTxa6Roat+46vAP SMUIztq6O60V3k/lIroDBZjn5GFmi0W0VHsqJDwd1DdDDXSxau4C4VoA2 aBzE83PcqYng9CuiQp/ilaAhcpZFbtF/w9/faZ9FtWm12osrXM0aLkhZh 2bRIGbFdr5Rdhz0snhPqUx2sqtMOhQJxdfLLeClJlt21gDFTRpr78f+7A A==; IronPort-SDR: L71X0zifjZB43zZK9l6B0MsnArpfPyCDvzoXf+nyCp1tUgnUG1GOhlvmIjggU94PcQuC/wAW7R g4XSZjybJbpTExneqmnbML3N8ee5EGmZcVjnzBh9UXViY/QOsBrlEYgU9EE+yQCmGdoSCmVSDN b5zST8D2NpjomUEDxsa1Rosna8mZn1Ktax+63x6pGJhZi3nbWQ4imjGQRlE2Ds9LmpK+io7tXW 7cbt8RmRKcXeKoneRdfAwEqVtFqNHRoy8ciIenpNo16JDfU8WofdGSxkjPezC7xgo6aK4oKpsL Pkk= X-IronPort-AV: E=Sophos;i="5.76,415,1592841600"; d="scan'208";a="147126016" Received: from h199-255-45-15.hgst.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 11 Sep 2020 20:33:45 +0800 IronPort-SDR: I4qblLv6PX7xCjkLUsOSn3YqugBWWBWW+vXvnxoi+NmwXE+LhdvjcGpLPLBo25yvdfnUtBQqOE WUoYzKJ6Q67g== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Sep 2020 05:20:06 -0700 IronPort-SDR: HfTgHcq0r6g9Jh5SRWE6CsOtUmgbkdvo1gSoSk9sAkmnf7BdrAYzxWnIGDqDZW01MFNo1q52zn 59l7zw538QCA== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 11 Sep 2020 05:33:43 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Hannes Reinecke , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v7 26/39] btrfs: serialize meta IOs on ZONED mode Date: Fri, 11 Sep 2020 21:32:46 +0900 Message-Id: <20200911123259.3782926-27-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20200911123259.3782926-1-naohiro.aota@wdc.com> References: <20200911123259.3782926-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org We cannot use zone append writing for metadata, because the B-tree nodes have references to each other using the logical address. Without knowing the address in advance, we cannot construct the tree in the first place. Thus, we need to serialize write IOs for metadata. We cannot add mutex around allocation and submit because metadata blocks are allocated in an earlier stage to build up B-trees. Thus, this commit add zoned_meta_io_lock and hold it during metadata IO submission in btree_write_cache_pages() to serialize IOs. Furthermore, this commit add per-block group metadata IO submission pointer "meta_write_pointer" to ensure sequential writing, which can be caused when writing back blocks in a not finished transaction. Signed-off-by: Naohiro Aota --- fs/btrfs/block-group.h | 1 + fs/btrfs/ctree.h | 2 ++ fs/btrfs/disk-io.c | 1 + fs/btrfs/extent_io.c | 27 ++++++++++++++++++++++- fs/btrfs/zoned.c | 50 ++++++++++++++++++++++++++++++++++++++++++ fs/btrfs/zoned.h | 31 ++++++++++++++++++++++++++ 6 files changed, 111 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h index 401e9bcefaec..b2a8a3beceac 100644 --- a/fs/btrfs/block-group.h +++ b/fs/btrfs/block-group.h @@ -190,6 +190,7 @@ struct btrfs_block_group { */ u64 alloc_offset; u64 zone_unusable; + u64 meta_write_pointer; }; static inline u64 btrfs_block_group_end(struct btrfs_block_group *block_group) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 54c22ad0d633..e08fe341cd81 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -941,6 +941,8 @@ struct btrfs_fs_info { */ int send_in_progress; + struct mutex zoned_meta_io_lock; + #ifdef CONFIG_BTRFS_FS_REF_VERIFY spinlock_t ref_verify_lock; struct rb_root block_tree; diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index d766cb0e1a52..a50436d89d30 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -2732,6 +2732,7 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info) mutex_init(&fs_info->delete_unused_bgs_mutex); mutex_init(&fs_info->reloc_mutex); mutex_init(&fs_info->delalloc_root_mutex); + mutex_init(&fs_info->zoned_meta_io_lock); seqlock_init(&fs_info->profiles_lock); INIT_LIST_HEAD(&fs_info->dirty_cowonly_roots); diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 00a07cefffeb..b660921af935 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -25,6 +25,7 @@ #include "backref.h" #include "disk-io.h" #include "zoned.h" +#include "block-group.h" static struct kmem_cache *extent_state_cache; static struct kmem_cache *extent_buffer_cache; @@ -3986,6 +3987,7 @@ int btree_write_cache_pages(struct address_space *mapping, struct writeback_control *wbc) { struct extent_buffer *eb, *prev_eb = NULL; + struct btrfs_block_group *cache = NULL; struct extent_page_data epd = { .bio = NULL, .extent_locked = 0, @@ -4020,6 +4022,7 @@ int btree_write_cache_pages(struct address_space *mapping, tag = PAGECACHE_TAG_TOWRITE; else tag = PAGECACHE_TAG_DIRTY; + btrfs_zoned_meta_io_lock(fs_info); retry: if (wbc->sync_mode == WB_SYNC_ALL) tag_pages_for_writeback(mapping, index, end); @@ -4062,12 +4065,30 @@ int btree_write_cache_pages(struct address_space *mapping, if (!ret) continue; + if (!btrfs_check_meta_write_pointer(fs_info, eb, + &cache)) { + /* + * If for_sync, this hole will be filled with + * trasnsaction commit. + */ + if (wbc->sync_mode == WB_SYNC_ALL && + !wbc->for_sync) + ret = -EAGAIN; + else + ret = 0; + done = 1; + free_extent_buffer(eb); + break; + } + prev_eb = eb; ret = lock_extent_buffer_for_io(eb, &epd); if (!ret) { + btrfs_revert_meta_write_pointer(cache, eb); free_extent_buffer(eb); continue; } else if (ret < 0) { + btrfs_revert_meta_write_pointer(cache, eb); done = 1; free_extent_buffer(eb); break; @@ -4100,10 +4121,12 @@ int btree_write_cache_pages(struct address_space *mapping, index = 0; goto retry; } + if (cache) + btrfs_put_block_group(cache); ASSERT(ret <= 0); if (ret < 0) { end_write_bio(&epd, ret); - return ret; + goto out; } /* * If something went wrong, don't allow any metadata write bio to be @@ -4138,6 +4161,8 @@ int btree_write_cache_pages(struct address_space *mapping, ret = -EROFS; end_write_bio(&epd, ret); } +out: + btrfs_zoned_meta_io_unlock(fs_info); return ret; } diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c index 1744e2649087..0f790f3a54e5 100644 --- a/fs/btrfs/zoned.c +++ b/fs/btrfs/zoned.c @@ -1002,6 +1002,9 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache) ret = -EIO; } + if (!ret) + cache->meta_write_pointer = cache->alloc_offset + cache->start; + kfree(alloc_offsets); free_extent_map(em); @@ -1135,3 +1138,50 @@ void btrfs_rewrite_logical_zoned(struct btrfs_ordered_extent *ordered) kfree(logical); bdput(bdev); } + +bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info, + struct extent_buffer *eb, + struct btrfs_block_group **cache_ret) +{ + struct btrfs_block_group *cache; + + if (!btrfs_fs_incompat(fs_info, ZONED)) + return true; + + cache = *cache_ret; + + if (cache && (eb->start < cache->start || + cache->start + cache->length <= eb->start)) { + btrfs_put_block_group(cache); + cache = NULL; + *cache_ret = NULL; + } + + if (!cache) + cache = btrfs_lookup_block_group(fs_info, eb->start); + + if (cache) { + *cache_ret = cache; + + if (cache->meta_write_pointer != eb->start) { + btrfs_put_block_group(cache); + cache = NULL; + *cache_ret = NULL; + return false; + } + + cache->meta_write_pointer = eb->start + eb->len; + } + + return true; +} + +void btrfs_revert_meta_write_pointer(struct btrfs_block_group *cache, + struct extent_buffer *eb) +{ + if (!btrfs_fs_incompat(eb->fs_info, ZONED) || !cache) + return; + + ASSERT(cache->meta_write_pointer == eb->start + eb->len); + cache->meta_write_pointer = eb->start; +} diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h index 5f4bc746e3e2..5d4b132a4d95 100644 --- a/fs/btrfs/zoned.h +++ b/fs/btrfs/zoned.h @@ -53,6 +53,11 @@ void btrfs_free_redirty_list(struct btrfs_transaction *trans); void btrfs_record_physical_zoned(struct inode *inode, u64 file_offset, struct bio *bio); void btrfs_rewrite_logical_zoned(struct btrfs_ordered_extent *ordered); +bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info, + struct extent_buffer *eb, + struct btrfs_block_group **cache_ret); +void btrfs_revert_meta_write_pointer(struct btrfs_block_group *cache, + struct extent_buffer *eb); #else /* CONFIG_BLK_DEV_ZONED */ static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos, struct blk_zone *zone) @@ -130,6 +135,18 @@ static inline void btrfs_record_physical_zoned(struct inode *inode, } static inline void btrfs_rewrite_logical_zoned(struct btrfs_ordered_extent *ordered) { } +static inline bool +btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info, + struct extent_buffer *eb, + struct btrfs_block_group **cache_ret) +{ + return true; +} +static inline void +btrfs_revert_meta_write_pointer(struct btrfs_block_group *cache, + struct extent_buffer *eb) +{ +} #endif static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos) @@ -232,4 +249,18 @@ static inline bool btrfs_can_zone_reset(struct btrfs_device *device, return true; } +static inline void btrfs_zoned_meta_io_lock(struct btrfs_fs_info *fs_info) +{ + if (!btrfs_fs_incompat(fs_info, ZONED)) + return; + mutex_lock(&fs_info->zoned_meta_io_lock); +} + +static inline void btrfs_zoned_meta_io_unlock(struct btrfs_fs_info *fs_info) +{ + if (!btrfs_fs_incompat(fs_info, ZONED)) + return; + mutex_unlock(&fs_info->zoned_meta_io_lock); +} + #endif From patchwork Fri Sep 11 12:32:47 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11771113 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 7473913B1 for ; Fri, 11 Sep 2020 17:42:15 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 55C95221ED for ; Fri, 11 Sep 2020 17:42:15 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="Tqck23hp" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726335AbgIKRmN (ORCPT ); Fri, 11 Sep 2020 13:42:13 -0400 Received: from esa5.hgst.iphmx.com ([216.71.153.144]:38428 "EHLO esa5.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725965AbgIKMkR (ORCPT ); Fri, 11 Sep 2020 08:40:17 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1599828017; x=1631364017; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=OJptT2hjT7Q8ki1tbMIvOuOMASPmIsPauAsPnYMRO5c=; b=Tqck23hpUqs2B47qiDj869ESOYkcogc42tzIphQzLcH1xqbvnyFXuKhO FfORXdtRVsOIIDyMAmyytCe0jLXtybWjiEHFsShHHlIS3Ayko6L3P1HBD LDZyTH7YJjDCGDU2/0sznfmoiNpuIrSbuISU9vkHHoBMWufOu8PKe1klR hh3be8rVT/6NMFZ9kRBR4vzzCz7sCzoifyVi62SzLTRx34rBDDMHVoaTN x0UVC9i44sdFLhZlP3qZ7mCYraN8LRd41rVeGm2Sxm0QbK8P8mTkSUO4j sPE7IbZU81P/1FLSWJuz+q2Chwg5MNjuEerT2HmY8amLqQlaqo2nEFQkV w==; IronPort-SDR: T2wtIZzmjT/rbbaG73/Iy3MbDPHANJb72cuWjoZ75BPw5vDO0S1Goh8uqyX+sRxAz+oPVIJuLE Rup5LKENhHRm8y6zEME+5FBopJbOr3YhyD1QBsT2nVFSTkv9AVWZGKLotv8K/WAkK4rAwTyoBw g2H2SO7qrj0lMrCvnrqZLSa64Uw9ricJqAVV2TYxv3dOd+YEFV8/DiejtvEXKbqWbyYd9HrroE CWBl2I/I5MsIkX6Nm4oLNlixOV1YU499aFYEJuTELXUsP6RMYzV9SdTkuyLzS7zg5KQyZ8GkNP 5mQ= X-IronPort-AV: E=Sophos;i="5.76,415,1592841600"; d="scan'208";a="147126018" Received: from h199-255-45-15.hgst.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 11 Sep 2020 20:33:47 +0800 IronPort-SDR: trG2Do2LjEyopDfyiHs3mt1JCu3CDBZtmT80CjtSreOMsQh/0cr6IBVvQwj1V1N4Zdu4qdxTCB d4DI/ziJxZ+g== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Sep 2020 05:20:07 -0700 IronPort-SDR: OKMcWla4O61fSOwlSfE0Qwp/cL+VTBir6NyU5QZEt+yo9Qx3tz7snqSjmNGawnoYdOU4/2/rYA 3hciq9gR1xGw== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 11 Sep 2020 05:33:44 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Hannes Reinecke , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v7 27/39] btrfs: wait existing extents before truncating Date: Fri, 11 Sep 2020 21:32:47 +0900 Message-Id: <20200911123259.3782926-28-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20200911123259.3782926-1-naohiro.aota@wdc.com> References: <20200911123259.3782926-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org When truncating a file, file buffers which have already been allocated but not yet written may be truncated. Truncating these buffers could cause breakage of a sequential write pattern in a block group if the truncated blocks are for example followed by blocks allocated to another file. To avoid this problem, always wait for write out of all unwritten buffers before proceeding with the truncate execution. Reviewed-by: Josef Bacik Signed-off-by: Naohiro Aota --- fs/btrfs/inode.c | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 2bd001df4a75..7e1a0a5a6e55 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -4883,6 +4883,16 @@ static int btrfs_setsize(struct inode *inode, struct iattr *attr) btrfs_drew_write_unlock(&root->snapshot_lock); btrfs_end_transaction(trans); } else { + struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); + + if (btrfs_fs_incompat(fs_info, ZONED)) { + ret = btrfs_wait_ordered_range( + inode, + ALIGN(newsize, fs_info->sectorsize), + (u64)-1); + if (ret) + return ret; + } /* * We're truncating a file that used to have good data down to From patchwork Fri Sep 11 12:32:48 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11771099 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 3385613B1 for ; Fri, 11 Sep 2020 17:42:02 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 12AB1221EB for ; Fri, 11 Sep 2020 17:42:02 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="DNg2WY4n" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726420AbgIKRls (ORCPT ); Fri, 11 Sep 2020 13:41:48 -0400 Received: from esa5.hgst.iphmx.com ([216.71.153.144]:38415 "EHLO esa5.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725967AbgIKMkS (ORCPT ); Fri, 11 Sep 2020 08:40:18 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1599828017; x=1631364017; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=WmMbx7y4lRUJiChOmOKWKn0KLU8szrq48BUWvF+sBVc=; b=DNg2WY4nwGkdqG8F1sWYPJZL/xTP5g2TyFXVq57nt9voWdYk8/UZOgSv rCIogvycYsfWG/Uls15901Pnt/yvzHxIIHFsKV7Qnhf8P8y6Wj9PIijXH BGWIhyYXZ9DAlJxDseP+ySdzGvZFdWPnJoimvsYg8DD7sHx4aCGlt8rPR rwnODeEwJEftfk9F9yE4E1Fu6VGF6rhu4szHE64Kx64+8wBWVbeFBqfar jL69JTw3gn+QeRO+UOU38bcnUHj3dvfA5XztZCXIfiQZTU3qG6I25HyCf K20D2muM690g4DP5EnzQGkmb8C9BrwMBMetFEGZJXHFtCGa7rtNHKa+6j Q==; IronPort-SDR: ykGp51hbU/scfFrV6LpAyLLhCQ5ctkgsODmk+VhSp/3eBQKkHqZg0SSelWowV0B6hVKJ3Xtc7n JDY5oBSwjWHgz7/q+gS+3Qhb1jPJwfK/5cacAFY+HUEAS5BL9XzNN7c8H+rgf5MwlROIcy7Gw1 TvvhU+zvGyUWhuYuGVN9l/WdiF4ZlMpZQUSmJYOaiQXjAab0JuOHmKu9XLJIBCPDlH2jlWVRjF 0yiXWA3i3scHEMa6Wr9tdLm/REIz4KcBQaoKK7tQuRcEUQfpRScAIpSLn9yTb+wg7vt3Lail11 +oA= X-IronPort-AV: E=Sophos;i="5.76,415,1592841600"; d="scan'208";a="147126020" Received: from h199-255-45-15.hgst.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 11 Sep 2020 20:33:48 +0800 IronPort-SDR: imGcMdtZHSEEU4X0gKlgjSx0KPp6M+zMmPK/x5TPEEqwGxy9kV/aKHBzdspaisjBLtS7I7VIMO Jbz7xE63pEDA== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Sep 2020 05:20:09 -0700 IronPort-SDR: AI+qUQYruDYibBbPGDNJoVVnha4hHdpm2jSmB0DtrWHFjuqoJzyem5OoqTXIlChgvmLdjZtfdF I2S8rXaYmn2w== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 11 Sep 2020 05:33:46 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Hannes Reinecke , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v7 28/39] btrfs: avoid async metadata checksum on ZONED mode Date: Fri, 11 Sep 2020 21:32:48 +0900 Message-Id: <20200911123259.3782926-29-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20200911123259.3782926-1-naohiro.aota@wdc.com> References: <20200911123259.3782926-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org In ZONED, btrfs uses per-FS zoned_meta_io_lock to serialize the metadata write IOs. Even with these serialization, write bios sent from btree_write_cache_pages can be reordered by async checksum workers as these workers are per CPU and not per zone. To preserve write BIO ordering, we can disable async metadata checksum on ZONED. This does not result in lower performance with HDDs as a single CPU core is fast enough to do checksum for a single zone write stream with the maximum possible bandwidth of the device. If multiple zones are being written simultaneously, HDD seek overhead lowers the achievable maximum bandwidth, resulting again in a per zone checksum serialization not affecting performance. Signed-off-by: Naohiro Aota --- fs/btrfs/disk-io.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index a50436d89d30..cd768030b7bb 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -865,6 +865,8 @@ static blk_status_t btree_submit_bio_start(void *private_data, struct bio *bio, static int check_async_write(struct btrfs_fs_info *fs_info, struct btrfs_inode *bi) { + if (btrfs_fs_incompat(fs_info, ZONED)) + return 0; if (atomic_read(&bi->sync_writers)) return 0; if (test_bit(BTRFS_FS_CSUM_IMPL_FAST, &fs_info->flags)) From patchwork Fri Sep 11 12:32:49 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11771093 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id BD1AB59D for ; Fri, 11 Sep 2020 17:41:49 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 98CFB221EF for ; Fri, 11 Sep 2020 17:41:49 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="B53AkpAt" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1725969AbgIKRlr (ORCPT ); Fri, 11 Sep 2020 13:41:47 -0400 Received: from esa5.hgst.iphmx.com ([216.71.153.144]:38451 "EHLO esa5.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725943AbgIKMkS (ORCPT ); Fri, 11 Sep 2020 08:40:18 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1599828017; x=1631364017; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=i/glboJM0oIY7TlBNgBYwZVEHPN7RFM5R1D30dFDkK4=; b=B53AkpAtERrdgr57vsqzFPy/dCZ97ySDScMkUSE4/Uoy/uVplHK1xrMo TSsOEKsMKfxHBxJIge3WfY8C+dMYCk9rwP2XmRMZYIfGm4kw4+IMPPicK XEwkw+4lZRgATjlpEkUjZ6m5yLLiN0aZSeWnh5AvLRUbIjb9ycUB0ybCw mUMwKDxOoHGy36GNzPmQ5yad+LSpxfkR9w0FzWmzsd0ADz3qgzZu5OY4o DQlrUvrKmOggQwSf8bCixSUJcK5YrnsK0f9M8W7yFTbT9+1DKnVvk58IH M+xlJnJFDISoAVASesy5HTKM48P57cGDUxHV1mJexpCfkMh4dIMfAKdpa g==; IronPort-SDR: ORcoFoxrjbxvd35SyvF7ysFASR1Q0iXs6wflLGJfHK1J5N800UPz7EH6AwZY0M7EBfNXBJPeMw luTRLzrMoUFREOWDJ43QZjw1ELkWa7bpU5LfQhXTDdsjyHwrUXMw+U1MPhHOmO6gJgd4jqIgc7 q1BMKoYXVt+LWcehebR6iWXAbz1dim7n5GAKDpMzn4MhcTdDeYBWYcPcErsgWrZ7FYoNg+vsOc jEfbl+swzh88I/VVuIJAG2Q/Oh7lgl+/E/GfL4icZuc1EXwMm5fcPzPJLRUjMtJm2CtAM9N4Dd IwM= X-IronPort-AV: E=Sophos;i="5.76,415,1592841600"; d="scan'208";a="147126021" Received: from h199-255-45-15.hgst.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 11 Sep 2020 20:33:50 +0800 IronPort-SDR: XGL2X6o6Ea4QglFmc5BUwmX1YdcQloVgddodTdls5F5SmlDsWoG+GiDpr+W1lNL6V17gq2K/oq 3y+RBLquY9ww== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Sep 2020 05:20:10 -0700 IronPort-SDR: PXf5pE4u4F/waFAikZehTUG161YfXKkFGi5n2tw8Tl4iF+eqFRI5KZxsbUMx7y7pGeDRrHfv8x PmG3b+zPYOCw== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 11 Sep 2020 05:33:47 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Hannes Reinecke , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v7 29/39] btrfs: mark block groups to copy for device-replace Date: Fri, 11 Sep 2020 21:32:49 +0900 Message-Id: <20200911123259.3782926-30-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20200911123259.3782926-1-naohiro.aota@wdc.com> References: <20200911123259.3782926-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org This is the 1/4 patch to support device-replace in ZONED mode. We have two types of I/Os during the device-replace process. One is an I/O to "copy" (by the scrub functions) all the device extents on the source device to the destination device. The other one is an I/O to "clone" (by handle_ops_on_dev_replace()) new incoming write I/Os from users to the source device into the target device. Cloning incoming I/Os can break the sequential write rule in the target device. When writing is mapped in the middle of a block group, the I/O is directed in the middle of a target device zone, which breaks the sequential write rule. However, the cloning function cannot be merely disabled since incoming I/Os targeting already copied device extents must be cloned so that the I/O is executed on the target device. We cannot use dev_replace->cursor_{left,right} to determine whether bio is going to not yet copied region. Since we have a time gap between finishing btrfs_scrub_dev() and rewriting the mapping tree in btrfs_dev_replace_finishing(), we can have a newly allocated device extent which is never cloned nor copied. So the point is to copy only already existing device extents. This patch introduces mark_block_group_to_copy() to mark existing block groups as a target of copying. Then, handle_ops_on_dev_replace() and dev-replace can check the flag to do their job. Signed-off-by: Naohiro Aota --- fs/btrfs/block-group.h | 1 + fs/btrfs/dev-replace.c | 175 +++++++++++++++++++++++++++++++++++++++++ fs/btrfs/dev-replace.h | 3 + fs/btrfs/scrub.c | 17 ++++ 4 files changed, 196 insertions(+) diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h index b2a8a3beceac..e91123495d68 100644 --- a/fs/btrfs/block-group.h +++ b/fs/btrfs/block-group.h @@ -95,6 +95,7 @@ struct btrfs_block_group { unsigned int iref:1; unsigned int has_caching_ctl:1; unsigned int removed:1; + unsigned int to_copy:1; int disk_cache_state; diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c index 18a36973f973..d2db963be985 100644 --- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -22,6 +22,7 @@ #include "dev-replace.h" #include "sysfs.h" #include "zoned.h" +#include "block-group.h" /* * Device replace overview @@ -443,6 +444,176 @@ static char* btrfs_dev_name(struct btrfs_device *device) return rcu_str_deref(device->name); } +static int mark_block_group_to_copy(struct btrfs_fs_info *fs_info, + struct btrfs_device *src_dev) +{ + struct btrfs_path *path; + struct btrfs_key key; + struct btrfs_key found_key; + struct btrfs_root *root = fs_info->dev_root; + struct btrfs_dev_extent *dev_extent = NULL; + struct btrfs_block_group *cache; + struct extent_buffer *l; + struct btrfs_trans_handle *trans; + int slot; + int ret = 0; + u64 chunk_offset, length; + + /* Do not use "to_copy" on non-ZONED for now */ + if (!btrfs_fs_incompat(fs_info, ZONED)) + return 0; + + mutex_lock(&fs_info->chunk_mutex); + + /* ensulre we don't have pending new block group */ + while (fs_info->running_transaction && + !list_empty(&fs_info->running_transaction->dev_update_list)) { + mutex_unlock(&fs_info->chunk_mutex); + trans = btrfs_attach_transaction(root); + if (IS_ERR(trans)) { + ret = PTR_ERR(trans); + mutex_lock(&fs_info->chunk_mutex); + if (ret == -ENOENT) + continue; + else + goto out; + } + + ret = btrfs_commit_transaction(trans); + mutex_lock(&fs_info->chunk_mutex); + if (ret) + goto out; + } + + path = btrfs_alloc_path(); + if (!path) { + ret = -ENOMEM; + goto out; + } + + path->reada = READA_FORWARD; + path->search_commit_root = 1; + path->skip_locking = 1; + + key.objectid = src_dev->devid; + key.offset = 0ull; + key.type = BTRFS_DEV_EXTENT_KEY; + + while (1) { + ret = btrfs_search_slot(NULL, root, &key, path, 0, 0); + if (ret < 0) + break; + if (ret > 0) { + if (path->slots[0] >= + btrfs_header_nritems(path->nodes[0])) { + ret = btrfs_next_leaf(root, path); + if (ret < 0) + break; + if (ret > 0) { + ret = 0; + break; + } + } else { + ret = 0; + } + } + + l = path->nodes[0]; + slot = path->slots[0]; + + btrfs_item_key_to_cpu(l, &found_key, slot); + + if (found_key.objectid != src_dev->devid) + break; + + if (found_key.type != BTRFS_DEV_EXTENT_KEY) + break; + + if (found_key.offset < key.offset) + break; + + dev_extent = btrfs_item_ptr(l, slot, struct btrfs_dev_extent); + length = btrfs_dev_extent_length(l, dev_extent); + + chunk_offset = btrfs_dev_extent_chunk_offset(l, dev_extent); + + cache = btrfs_lookup_block_group(fs_info, chunk_offset); + if (!cache) + goto skip; + + spin_lock(&cache->lock); + cache->to_copy = 1; + spin_unlock(&cache->lock); + + btrfs_put_block_group(cache); + +skip: + key.offset = found_key.offset + length; + btrfs_release_path(path); + } + + btrfs_free_path(path); +out: + mutex_unlock(&fs_info->chunk_mutex); + + return ret; +} + +bool btrfs_finish_block_group_to_copy(struct btrfs_device *srcdev, + struct btrfs_block_group *cache, + u64 physical) +{ + struct btrfs_fs_info *fs_info = cache->fs_info; + struct extent_map *em; + struct map_lookup *map; + u64 chunk_offset = cache->start; + int num_extents, cur_extent; + int i; + + /* Do not use "to_copy" on non-ZONED for now */ + if (!btrfs_fs_incompat(fs_info, ZONED)) + return true; + + spin_lock(&cache->lock); + if (cache->removed) { + spin_unlock(&cache->lock); + return true; + } + spin_unlock(&cache->lock); + + em = btrfs_get_chunk_map(fs_info, chunk_offset, 1); + BUG_ON(IS_ERR(em)); + map = em->map_lookup; + + num_extents = cur_extent = 0; + for (i = 0; i < map->num_stripes; i++) { + /* we have more device extent to copy */ + if (srcdev != map->stripes[i].dev) + continue; + + num_extents++; + if (physical == map->stripes[i].physical) + cur_extent = i; + } + + free_extent_map(em); + + if (num_extents > 1 && cur_extent < num_extents - 1) { + /* + * Has more stripes on this device. Keep this BG + * readonly until we finish all the stripes. + */ + return false; + } + + /* last stripe on this device */ + spin_lock(&cache->lock); + cache->to_copy = 0; + spin_unlock(&cache->lock); + + return true; +} + static int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info, const char *tgtdev_name, u64 srcdevid, const char *srcdev_name, int read_src) @@ -484,6 +655,10 @@ static int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info, if (ret) return ret; + ret = mark_block_group_to_copy(fs_info, src_device); + if (ret) + return ret; + down_write(&dev_replace->rwsem); switch (dev_replace->replace_state) { case BTRFS_IOCTL_DEV_REPLACE_STATE_NEVER_STARTED: diff --git a/fs/btrfs/dev-replace.h b/fs/btrfs/dev-replace.h index 60b70dacc299..3911049a5f23 100644 --- a/fs/btrfs/dev-replace.h +++ b/fs/btrfs/dev-replace.h @@ -18,5 +18,8 @@ int btrfs_dev_replace_cancel(struct btrfs_fs_info *fs_info); void btrfs_dev_replace_suspend_for_unmount(struct btrfs_fs_info *fs_info); int btrfs_resume_dev_replace_async(struct btrfs_fs_info *fs_info); int __pure btrfs_dev_replace_is_ongoing(struct btrfs_dev_replace *dev_replace); +bool btrfs_finish_block_group_to_copy(struct btrfs_device *srcdev, + struct btrfs_block_group *cache, + u64 physical); #endif diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index e46c91188a75..f7d750b32cfb 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -3500,6 +3500,17 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx, if (!cache) goto skip; + + if (sctx->is_dev_replace && btrfs_fs_incompat(fs_info, ZONED)) { + spin_lock(&cache->lock); + if (!cache->to_copy) { + spin_unlock(&cache->lock); + ro_set = 0; + goto done; + } + spin_unlock(&cache->lock); + } + /* * Make sure that while we are scrubbing the corresponding block * group doesn't get its logical address and its device extents @@ -3631,6 +3642,12 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx, scrub_pause_off(fs_info); + if (sctx->is_dev_replace && + !btrfs_finish_block_group_to_copy(dev_replace->srcdev, + cache, found_key.offset)) + ro_set = 0; + +done: down_write(&dev_replace->rwsem); dev_replace->cursor_left = dev_replace->cursor_right; dev_replace->item_needs_writeback = 1; From patchwork Fri Sep 11 12:32:50 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11771107 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id EBBFB59D for ; Fri, 11 Sep 2020 17:42:13 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id C0D29221ED for ; Fri, 11 Sep 2020 17:42:13 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="qjxmLkRA" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1725971AbgIKRmN (ORCPT ); Fri, 11 Sep 2020 13:42:13 -0400 Received: from esa5.hgst.iphmx.com ([216.71.153.144]:38375 "EHLO esa5.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725970AbgIKMkS (ORCPT ); Fri, 11 Sep 2020 08:40:18 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1599828017; x=1631364017; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=IKSnidCJQqviGjW0MLEa/94LIClq0cu6k26aWPVEoTo=; b=qjxmLkRATnhLJY9Z/I/+HQcN/zdUkcD7PzBKfM+P4h/Glu0E0IVY8wZt 2KwgQKUp/TgMd7D6EU+ECMpuQnCtS2xMNm3jOJvuR5DV9zK+wQIKlCSTR Amn2aoBtnX/ckTX4k7unhc4fOqBAEbIhpv5M/kxyxNS1+HOdl5OKeoaJV dzseJ06DxPnaJ+ptyqEpg8p6957h1i2VAWDRNPYAIeOA31O4AZi6yv95E wsFbWzH+CPYmTVUTkBNpIddoIyi3FDAWQHQeVhowqA171zaq5XGJxwqs9 lqPGSXT1z0t1kA/YC/u7fY490E9tunB1KDvLKhfFMubxUCdHnmiJUfgmU g==; IronPort-SDR: lKe+iYgEvYt1nVQAcIAa8ey0vZ/6sTpTugCkEGZAw1e5tJYEFAwRLyYqYQR8lZVz98AlVqQw3H YS+txoiGEtkULpt+261/8sJXVDsuIqK1OpueVc3bl5KqgIjxrT8XVq+CJun1TpmS+m7e0JriKs QyoOOb2eNoEOAVbOLw8LDaau87hbYJSEtiYi9uOuEu2Bj0T6SldpDrfXgd1XLyQ64tJz/SBY2k Gii1M7UuCPTE8IZrtYmjMv2vn84vCj9WW3OgQwDb+rf/HzzF4yXJ2McXCBd6FiH62NvHxuMfMZ eyU= X-IronPort-AV: E=Sophos;i="5.76,415,1592841600"; d="scan'208";a="147126024" Received: from h199-255-45-15.hgst.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 11 Sep 2020 20:33:51 +0800 IronPort-SDR: xOsLGngYCT42C/LsIj3DPD2nfmccKAP97shMwJOxenAvUCs+2QD6zajvgm1ArULCpg6zGOgS6o GrQ8RfZNAXMw== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Sep 2020 05:20:12 -0700 IronPort-SDR: Usi2h86qrJ7WSvOAgGm9FRMd+5oVd5TUhGgaSjrd1CKH3emhoPz3fDvyIEPP4SvNb5OfTwawy3 NDSz279MaG7A== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 11 Sep 2020 05:33:49 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Hannes Reinecke , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v7 30/39] btrfs: implement cloning for ZONED device-replace Date: Fri, 11 Sep 2020 21:32:50 +0900 Message-Id: <20200911123259.3782926-31-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20200911123259.3782926-1-naohiro.aota@wdc.com> References: <20200911123259.3782926-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org This is 2/4 patch to implement device-replace for ZONED mode. On zoned mode, a block group must be either copied (from the source device to the destination device) or cloned (to the both device). This commit implements the cloning part. If a block group targeted by an IO is marked to copy, we should not clone the IO to the destination device, because the block group is eventually copied by the replace process. This commit also handles cloning of device reset. Signed-off-by: Naohiro Aota --- fs/btrfs/extent-tree.c | 20 ++++++++++++++++++-- fs/btrfs/volumes.c | 33 +++++++++++++++++++++++++++++++-- fs/btrfs/zoned.c | 11 +++++++++++ 3 files changed, 60 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 81b9b58d7a9d..79ac8fcc5c35 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -35,6 +35,7 @@ #include "discard.h" #include "rcu-string.h" #include "zoned.h" +#include "dev-replace.h" #undef SCRAMBLE_DELAYED_REFS @@ -1322,6 +1323,8 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr, u64 length = stripe->length; u64 bytes; struct request_queue *req_q; + struct btrfs_dev_replace *dev_replace = + &fs_info->dev_replace; if (!stripe->dev->bdev) { ASSERT(btrfs_test_opt(fs_info, DEGRADED)); @@ -1330,15 +1333,28 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr, req_q = bdev_get_queue(stripe->dev->bdev); /* zone reset in ZONED mode */ - if (btrfs_can_zone_reset(dev, physical, length)) + if (btrfs_can_zone_reset(dev, physical, length)) { ret = btrfs_reset_device_zone(dev, physical, length, &bytes); - else if (blk_queue_discard(req_q)) + if (ret) + goto next; + if (!btrfs_dev_replace_is_ongoing( + dev_replace) || + dev != dev_replace->srcdev) + goto next; + + discarded_bytes += bytes; + /* send to replace target as well */ + ret = btrfs_reset_device_zone( + dev_replace->tgtdev, + physical, length, &bytes); + } else if (blk_queue_discard(req_q)) ret = btrfs_issue_discard(dev->bdev, physical, length, &bytes); else continue; +next: if (!ret) { discarded_bytes += bytes; } else if (ret != -EOPNOTSUPP) { diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index ca139c63f63c..779ee0452c1b 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -5971,9 +5971,29 @@ static int get_extra_mirror_from_replace(struct btrfs_fs_info *fs_info, return ret; } +static bool is_block_group_to_copy(struct btrfs_fs_info *fs_info, u64 logical) +{ + struct btrfs_block_group *cache; + bool ret; + + /* non-ZONED mode does not use "to_copy" flag */ + if (!btrfs_fs_incompat(fs_info, ZONED)) + return false; + + cache = btrfs_lookup_block_group(fs_info, logical); + + spin_lock(&cache->lock); + ret = cache->to_copy; + spin_unlock(&cache->lock); + + btrfs_put_block_group(cache); + return ret; +} + static void handle_ops_on_dev_replace(enum btrfs_map_op op, struct btrfs_bio **bbio_ret, struct btrfs_dev_replace *dev_replace, + u64 logical, int *num_stripes_ret, int *max_errors_ret) { struct btrfs_bio *bbio = *bbio_ret; @@ -5986,6 +6006,15 @@ static void handle_ops_on_dev_replace(enum btrfs_map_op op, if (op == BTRFS_MAP_WRITE) { int index_where_to_add; + /* + * a block group which have "to_copy" set will + * eventually copied by dev-replace process. We can + * avoid cloning IO here. + */ + if (is_block_group_to_copy(dev_replace->srcdev->fs_info, + logical)) + return; + /* * duplicate the write operations while the dev replace * procedure is running. Since the copying of the old disk to @@ -6381,8 +6410,8 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info, if (dev_replace_is_ongoing && dev_replace->tgtdev != NULL && need_full_stripe(op)) { - handle_ops_on_dev_replace(op, &bbio, dev_replace, &num_stripes, - &max_errors); + handle_ops_on_dev_replace(op, &bbio, dev_replace, logical, + &num_stripes, &max_errors); } *bbio_ret = bbio; diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c index 0f790f3a54e5..2fe659bb0709 100644 --- a/fs/btrfs/zoned.c +++ b/fs/btrfs/zoned.c @@ -19,6 +19,7 @@ #include "disk-io.h" #include "block-group.h" #include "transaction.h" +#include "dev-replace.h" /* Maximum number of zones to report per blkdev_report_zones() call */ #define BTRFS_REPORT_NR_ZONES 4096 @@ -903,6 +904,8 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache) for (i = 0; i < map->num_stripes; i++) { bool is_sequential; struct blk_zone zone; + struct btrfs_dev_replace *dev_replace = &fs_info->dev_replace; + int dev_replace_is_ongoing = 0; device = map->stripes[i].dev; physical = map->stripes[i].physical; @@ -929,6 +932,14 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache) */ btrfs_dev_clear_zone_empty(device, physical); + down_read(&dev_replace->rwsem); + dev_replace_is_ongoing = + btrfs_dev_replace_is_ongoing(dev_replace); + if (dev_replace_is_ongoing && dev_replace->tgtdev != NULL) + btrfs_dev_clear_zone_empty(dev_replace->tgtdev, + physical); + up_read(&dev_replace->rwsem); + /* * The group is mapped to a sequential zone. Get the zone write * pointer to determine the allocation offset within the zone. From patchwork Fri Sep 11 12:32:51 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11771125 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id E90BE59D for ; Fri, 11 Sep 2020 17:42:34 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id C41C3221ED for ; Fri, 11 Sep 2020 17:42:34 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="Er0UYca0" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726135AbgIKRmM (ORCPT ); Fri, 11 Sep 2020 13:42:12 -0400 Received: from esa5.hgst.iphmx.com ([216.71.153.144]:38370 "EHLO esa5.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725971AbgIKMkS (ORCPT ); Fri, 11 Sep 2020 08:40:18 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1599828017; x=1631364017; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=xMz7+myZMpHXv+TqaDSg+IRlaO1Dux1SprQ8ZqsYWqY=; b=Er0UYca0/OPVpqjcS6FHiuZMz90rihQBKVz+7m4lw8h5r/IW+bHRLf+s r9ZfdEa3d/28ogGI5KWYwZJWe+UkcTLJDi5wGb7qtfI4Cmq1iiZPX/RCa Lp+QFCeXTXrm2iH3Z3WAQAF+dtOm3vA1m0X54DL1Xj6pEI7I2yL642E68 eX2kVYLXeqsK5MF4G6/njtTOFMNjuVEav1Rmt6B544ESaTloWkIfKPWp8 JJOzeCfHWLqQ8sHCcKd0M79EF3hQ+ZpXvB6xBiQlybFQjlov4fvXOIznf Rg9VIbpZG3OeL2/99cygcWGXAUj/eAXOUogkwmv4OheAW1r6UeZOxeI25 g==; IronPort-SDR: 6oFxab158RDWmY05/9nhwfD346W5F0MDFjrPZOUuXy5fn5kVvw0LkeV7B71Qexu6KSzRA0noS+ Pt5+qquUH/TzyPsTHM2mki0ESLllL2zzKAVRWaKAiCC381okF+imV5kQ6PAbQObO8LH2BXKpym YCvlwVGfiteESZBVYsOOWF3YXVYgk/IP/uk5DT7InExqNxeNaAQWOYe4mWx4kGhlR4NLoCg5T5 Xvp3RME+8pcz5AExn+8tQM5L3ZQl+ogOETgjNn8yXOfLwJWE0vlk88Qa9ePgiFHOXrC9g0/EGV Y8c= X-IronPort-AV: E=Sophos;i="5.76,415,1592841600"; d="scan'208";a="147126028" Received: from h199-255-45-15.hgst.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 11 Sep 2020 20:33:52 +0800 IronPort-SDR: BjcBFY6tbv8fUrdjxKtbIzpgTEr9K7pNYHWKISi2Fv0kcg1a894XmmIfe54eIhJUCFYppgumY6 02ZRyt13yTsA== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Sep 2020 05:20:13 -0700 IronPort-SDR: EEvLqJWsyoBh8gxGmLjK+xrbKh1DH7c64EKAY4g7SrfLGVSxMNn04jECHsOZbBHSe2XOe4pREN 09JbDJZGauyQ== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 11 Sep 2020 05:33:50 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Hannes Reinecke , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v7 31/39] btrfs: implement copying for ZONED device-replace Date: Fri, 11 Sep 2020 21:32:51 +0900 Message-Id: <20200911123259.3782926-32-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20200911123259.3782926-1-naohiro.aota@wdc.com> References: <20200911123259.3782926-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org This is 3/4 patch to implement device-replace on ZONED mode. This commit implement copying. So, it track the write pointer during device replace process. Device-replace's copying is smart to copy only used extents on source device, we have to fill the gap to honor the sequential write rule in the target device. Device-replace process in ZONED mode must copy or clone all the extents in the source device exactly once. So, we need to use to ensure allocations started just before the dev-replace process to have their corresponding extent information in the B-trees. finish_extent_writes_for_zoned() implements that functionality, which basically is the removed code in the commit 042528f8d840 ("Btrfs: fix block group remaining RO forever after error during device replace"). Signed-off-by: Naohiro Aota --- fs/btrfs/scrub.c | 86 ++++++++++++++++++++++++++++++++++++++++++++++++ fs/btrfs/zoned.c | 12 +++++++ fs/btrfs/zoned.h | 7 ++++ 3 files changed, 105 insertions(+) diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index f7d750b32cfb..568d90214446 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -169,6 +169,7 @@ struct scrub_ctx { int pages_per_rd_bio; int is_dev_replace; + u64 write_pointer; struct scrub_bio *wr_curr_bio; struct mutex wr_lock; @@ -1623,6 +1624,25 @@ static int scrub_write_page_to_dev_replace(struct scrub_block *sblock, return scrub_add_page_to_wr_bio(sblock->sctx, spage); } +static int fill_writer_pointer_gap(struct scrub_ctx *sctx, u64 physical) +{ + int ret = 0; + u64 length; + + if (!btrfs_fs_incompat(sctx->fs_info, ZONED)) + return 0; + + if (sctx->write_pointer < physical) { + length = physical - sctx->write_pointer; + + ret = btrfs_zoned_issue_zeroout(sctx->wr_tgtdev, + sctx->write_pointer, length); + if (!ret) + sctx->write_pointer = physical; + } + return ret; +} + static int scrub_add_page_to_wr_bio(struct scrub_ctx *sctx, struct scrub_page *spage) { @@ -1645,6 +1665,13 @@ static int scrub_add_page_to_wr_bio(struct scrub_ctx *sctx, if (sbio->page_count == 0) { struct bio *bio; + ret = fill_writer_pointer_gap(sctx, + spage->physical_for_dev_replace); + if (ret) { + mutex_unlock(&sctx->wr_lock); + return ret; + } + sbio->physical = spage->physical_for_dev_replace; sbio->logical = spage->logical; sbio->dev = sctx->wr_tgtdev; @@ -1706,6 +1733,10 @@ static void scrub_wr_submit(struct scrub_ctx *sctx) * doubled the write performance on spinning disks when measured * with Linux 3.5 */ btrfsic_submit_bio(sbio->bio); + + if (btrfs_fs_incompat(sctx->fs_info, ZONED)) + sctx->write_pointer = sbio->physical + + sbio->page_count * PAGE_SIZE; } static void scrub_wr_bio_end_io(struct bio *bio) @@ -2973,6 +3004,21 @@ static noinline_for_stack int scrub_raid56_parity(struct scrub_ctx *sctx, return ret < 0 ? ret : 0; } +static void sync_replace_for_zoned(struct scrub_ctx *sctx) +{ + if (!btrfs_fs_incompat(sctx->fs_info, ZONED)) + return; + + sctx->flush_all_writes = true; + scrub_submit(sctx); + mutex_lock(&sctx->wr_lock); + scrub_wr_submit(sctx); + mutex_unlock(&sctx->wr_lock); + + wait_event(sctx->list_wait, + atomic_read(&sctx->bios_in_flight) == 0); +} + static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx, struct map_lookup *map, struct btrfs_device *scrub_dev, @@ -3105,6 +3151,14 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx, */ blk_start_plug(&plug); + if (sctx->is_dev_replace && + btrfs_dev_is_sequential(sctx->wr_tgtdev, physical)) { + mutex_lock(&sctx->wr_lock); + sctx->write_pointer = physical; + mutex_unlock(&sctx->wr_lock); + sctx->flush_all_writes = true; + } + /* * now find all extents for each stripe and scrub them */ @@ -3292,6 +3346,9 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx, if (ret) goto out; + if (sctx->is_dev_replace) + sync_replace_for_zoned(sctx); + if (extent_logical + extent_len < key.objectid + bytes) { if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK) { @@ -3414,6 +3471,25 @@ static noinline_for_stack int scrub_chunk(struct scrub_ctx *sctx, return ret; } +static int finish_extent_writes_for_zoned(struct btrfs_root *root, + struct btrfs_block_group *cache) +{ + struct btrfs_fs_info *fs_info = cache->fs_info; + struct btrfs_trans_handle *trans; + + if (!btrfs_fs_incompat(fs_info, ZONED)) + return 0; + + btrfs_wait_block_group_reservations(cache); + btrfs_wait_nocow_writers(cache); + btrfs_wait_ordered_roots(fs_info, U64_MAX, cache->start, cache->length); + + trans = btrfs_join_transaction(root); + if (IS_ERR(trans)) + return PTR_ERR(trans); + return btrfs_commit_transaction(trans); +} + static noinline_for_stack int scrub_enumerate_chunks(struct scrub_ctx *sctx, struct btrfs_device *scrub_dev, u64 start, u64 end) @@ -3569,6 +3645,16 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx, * group is not RO. */ ret = btrfs_inc_block_group_ro(cache, sctx->is_dev_replace); + if (!ret && sctx->is_dev_replace) { + ret = finish_extent_writes_for_zoned(root, cache); + if (ret) { + btrfs_dec_block_group_ro(cache); + scrub_pause_off(fs_info); + btrfs_put_block_group(cache); + break; + } + } + if (ret == 0) { ro_set = 1; } else if (ret == -ENOSPC && !sctx->is_dev_replace) { diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c index 2fe659bb0709..ac88d26f1119 100644 --- a/fs/btrfs/zoned.c +++ b/fs/btrfs/zoned.c @@ -1196,3 +1196,15 @@ void btrfs_revert_meta_write_pointer(struct btrfs_block_group *cache, ASSERT(cache->meta_write_pointer == eb->start + eb->len); cache->meta_write_pointer = eb->start; } + +int btrfs_zoned_issue_zeroout(struct btrfs_device *device, u64 physical, + u64 length) +{ + if (!btrfs_dev_is_sequential(device, physical)) + return -EOPNOTSUPP; + + return blkdev_issue_zeroout(device->bdev, + physical >> SECTOR_SHIFT, + length >> SECTOR_SHIFT, + GFP_NOFS, 0); +} diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h index 5d4b132a4d95..dea313a61a3e 100644 --- a/fs/btrfs/zoned.h +++ b/fs/btrfs/zoned.h @@ -58,6 +58,8 @@ bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info, struct btrfs_block_group **cache_ret); void btrfs_revert_meta_write_pointer(struct btrfs_block_group *cache, struct extent_buffer *eb); +int btrfs_zoned_issue_zeroout(struct btrfs_device *device, u64 physical, + u64 length); #else /* CONFIG_BLK_DEV_ZONED */ static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos, struct blk_zone *zone) @@ -147,6 +149,11 @@ btrfs_revert_meta_write_pointer(struct btrfs_block_group *cache, struct extent_buffer *eb) { } +static inline int btrfs_zoned_issue_zeroout(struct btrfs_device *device, + u64 physical, u64 length) +{ + return -EOPNOTSUPP; +} #endif static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos) From patchwork Fri Sep 11 12:32:52 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11771103 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id D670613B1 for ; Fri, 11 Sep 2020 17:42:10 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id BB280221EF for ; Fri, 11 Sep 2020 17:42:10 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="YMX3Ql70" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726397AbgIKRlr (ORCPT ); Fri, 11 Sep 2020 13:41:47 -0400 Received: from esa5.hgst.iphmx.com ([216.71.153.144]:38372 "EHLO esa5.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725969AbgIKMkS (ORCPT ); Fri, 11 Sep 2020 08:40:18 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1599828017; x=1631364017; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=hl1Mk7GiUaQozrKkjvg3GEP31lE8h2AxbsjOnT0dDCg=; b=YMX3Ql70qBZxyaS1Y/ekxwPLOB/ZGi3P4p7/s0OmtBDLbcfBsXEYOv1S pAmFW0L8cWbxTwtIy80xUcvU1b+dC4cY7DktJtSiA/fzbK1/yrKh4OnIC FZ09TZzugNjjB/DjYxgc3GX+vtHw/EwreT+iTVEo0HlB73mM+QMj9C8Pn MIVXNf4vQn1IZywcpDNsFM2ezlxk3aTzNhMBUMfCOqTyYQoJ9LgRco07X mexJ7nPXQLjV33dgAYkXtYiNkrJE4kF9P8+l7qZocUO9ZwhhU1L4/f1z2 s8qEh4l5Cp9VAo8eRrFni1dWI122T2U2PHOpw38lRKga1KivWjTuNH7Ff A==; IronPort-SDR: fjGgn002fjWWludwdmT0KoPzaKVKPxHnthfwhqbDswlZ52iVJ2THjHlDUZsKrjxPT3+gCS6L/J TpM9BZ7QXlilKoe5EJWwg3Ik0azyW2K0PL8l2xAlYpx+CcJV6ZywZ0ifHYvCK7SStxWqN1podG bLJMipz3JpZmE42EAmmBZmxvZbqXFGLastnUmEykzizrkBZHfrNavoFn1uK6VUfTBcZvGFPuQl PZLqp1+AM2VUO0pfvBw3dVshtOqE+XSDR4EVVHIz3X3/4POaJvlRU1sch3Oyvn/u9ch+4Luz9E eq4= X-IronPort-AV: E=Sophos;i="5.76,415,1592841600"; d="scan'208";a="147126029" Received: from h199-255-45-15.hgst.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 11 Sep 2020 20:33:54 +0800 IronPort-SDR: RM1WGxUBpsaI2VluhK993JYaLSdZSGlkJYvxZ5lBiLiM0ykHLi1huBqEaJaUyF7zlL/2YlwVsC MMWjMCa4D3Rw== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Sep 2020 05:20:14 -0700 IronPort-SDR: Ip7WD3WEsdtnXSHgSfPBHq5+k36A4ZTe/ElZZpp6tsrK+dkjpWuzC5k9JKTkq26tSp3nhxVLk/ dw10/hEjgyuQ== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 11 Sep 2020 05:33:52 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Hannes Reinecke , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v7 32/39] btrfs: support dev-replace in ZONED mode Date: Fri, 11 Sep 2020 21:32:52 +0900 Message-Id: <20200911123259.3782926-33-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20200911123259.3782926-1-naohiro.aota@wdc.com> References: <20200911123259.3782926-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org This is 4/4 patch to implement device-replace on ZONED mode. Even after the copying is done, the write pointers of the source device and the destination device may not be synchronized. For example, when the last allocated extent is freed before device-replace process, the extent is not copied, leaving a hole there. This patch synchronize the write pointers by writing zeros to the destination device. Signed-off-by: Naohiro Aota --- fs/btrfs/scrub.c | 36 +++++++++++++++++++++++++ fs/btrfs/zoned.c | 69 ++++++++++++++++++++++++++++++++++++++++++++++++ fs/btrfs/zoned.h | 8 ++++++ 3 files changed, 113 insertions(+) diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index 568d90214446..2356e6d90690 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -3019,6 +3019,31 @@ static void sync_replace_for_zoned(struct scrub_ctx *sctx) atomic_read(&sctx->bios_in_flight) == 0); } +static int sync_write_pointer_for_zoned(struct scrub_ctx *sctx, u64 logical, + u64 physical, u64 physical_end) +{ + struct btrfs_fs_info *fs_info = sctx->fs_info; + int ret = 0; + + if (!btrfs_fs_incompat(fs_info, ZONED)) + return 0; + + wait_event(sctx->list_wait, atomic_read(&sctx->bios_in_flight) == 0); + + mutex_lock(&sctx->wr_lock); + if (sctx->write_pointer < physical_end) { + ret = btrfs_sync_zone_write_pointer(sctx->wr_tgtdev, logical, + physical, + sctx->write_pointer); + if (ret) + btrfs_err(fs_info, "failed to recover write pointer"); + } + mutex_unlock(&sctx->wr_lock); + btrfs_dev_clear_zone_empty(sctx->wr_tgtdev, physical); + + return ret; +} + static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx, struct map_lookup *map, struct btrfs_device *scrub_dev, @@ -3416,6 +3441,17 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx, blk_finish_plug(&plug); btrfs_free_path(path); btrfs_free_path(ppath); + + if (sctx->is_dev_replace && ret >= 0) { + int ret2; + + ret2 = sync_write_pointer_for_zoned(sctx, base + offset, + map->stripes[num].physical, + physical_end); + if (ret2) + ret = ret2; + } + return ret < 0 ? ret : 0; } diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c index ac88d26f1119..576f8e333f16 100644 --- a/fs/btrfs/zoned.c +++ b/fs/btrfs/zoned.c @@ -20,6 +20,7 @@ #include "block-group.h" #include "transaction.h" #include "dev-replace.h" +#include "space-info.h" /* Maximum number of zones to report per blkdev_report_zones() call */ #define BTRFS_REPORT_NR_ZONES 4096 @@ -1208,3 +1209,71 @@ int btrfs_zoned_issue_zeroout(struct btrfs_device *device, u64 physical, length >> SECTOR_SHIFT, GFP_NOFS, 0); } + +static int read_zone_info(struct btrfs_fs_info *fs_info, u64 logical, + struct blk_zone *zone) +{ + struct btrfs_bio *bbio = NULL; + u64 mapped_length = PAGE_SIZE; + unsigned int nofs_flag; + int nmirrors; + int i, ret; + + ret = btrfs_map_sblock(fs_info, BTRFS_MAP_GET_READ_MIRRORS, logical, + &mapped_length, &bbio); + if (ret || !bbio || mapped_length < PAGE_SIZE) { + btrfs_put_bbio(bbio); + return -EIO; + } + + if (bbio->map_type & BTRFS_BLOCK_GROUP_RAID56_MASK) + return -EINVAL; + + nofs_flag = memalloc_nofs_save(); + nmirrors = (int)bbio->num_stripes; + for (i = 0; i < nmirrors; i++) { + u64 physical = bbio->stripes[i].physical; + struct btrfs_device *dev = bbio->stripes[i].dev; + + /* missing device */ + if (!dev->bdev) + continue; + + ret = btrfs_get_dev_zone(dev, physical, zone); + /* failing device */ + if (ret == -EIO || ret == -EOPNOTSUPP) + continue; + break; + } + memalloc_nofs_restore(nofs_flag); + + return ret; +} + +int btrfs_sync_zone_write_pointer(struct btrfs_device *tgt_dev, u64 logical, + u64 physical_start, u64 physical_pos) +{ + struct btrfs_fs_info *fs_info = tgt_dev->fs_info; + struct blk_zone zone; + u64 length; + u64 wp; + int ret; + + if (!btrfs_dev_is_sequential(tgt_dev, physical_pos)) + return 0; + + ret = read_zone_info(fs_info, logical, &zone); + if (ret) + return ret; + + wp = physical_start + ((zone.wp - zone.start) << SECTOR_SHIFT); + + if (physical_pos == wp) + return 0; + + if (physical_pos > wp) + return -EUCLEAN; + + length = wp - physical_pos; + return btrfs_zoned_issue_zeroout(tgt_dev, physical_pos, length); +} diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h index dea313a61a3e..61388381c679 100644 --- a/fs/btrfs/zoned.h +++ b/fs/btrfs/zoned.h @@ -60,6 +60,8 @@ void btrfs_revert_meta_write_pointer(struct btrfs_block_group *cache, struct extent_buffer *eb); int btrfs_zoned_issue_zeroout(struct btrfs_device *device, u64 physical, u64 length); +int btrfs_sync_zone_write_pointer(struct btrfs_device *tgt_dev, u64 logical, + u64 physical_start, u64 physical_pos); #else /* CONFIG_BLK_DEV_ZONED */ static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos, struct blk_zone *zone) @@ -154,6 +156,12 @@ static inline int btrfs_zoned_issue_zeroout(struct btrfs_device *device, { return -EOPNOTSUPP; } +static inline int btrfs_sync_zone_write_pointer(struct btrfs_device *tgt_dev, + u64 logical, u64 physical_start, + u64 physical_pos) +{ + return -EOPNOTSUPP; +} #endif static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos) From patchwork Fri Sep 11 12:32:53 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11770477 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 3F81592C for ; Fri, 11 Sep 2020 12:42:54 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 1E0D622204 for ; Fri, 11 Sep 2020 12:42:54 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="Rp4msHIQ" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1725994AbgIKMmT (ORCPT ); Fri, 11 Sep 2020 08:42:19 -0400 Received: from esa5.hgst.iphmx.com ([216.71.153.144]:38428 "EHLO esa5.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725962AbgIKMk5 (ORCPT ); Fri, 11 Sep 2020 08:40:57 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1599828057; x=1631364057; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=P/4bSAIH0zoQm4EVjKfdPeJBx6fTTnv8AGgc7UWj+Mo=; b=Rp4msHIQZ7hBeqaVTkeJza7MLmBA6U+Fm2ANfp7aiRhFlp3uIsGxh8fi xMziRVZKaUplSmc06RLX0xZFiYWgcE1qt4iESGjbUstFlfaxP77u6x9lk XHqB4t/i9cNgSCb3R1eQKf233zYkEIUKkP/q+JZS48HPscT6GtybwX6er Cn++damzFaqS4/QFBNDTZ+Jq/0qGR9FNu+m3jLcgthypCateSoQzBBynS gR2UlEEUcJcFsazRyskENLk7n9xurogHPpN+ij9fpIkhbS5qmzTZZfSVm P3p9bTeEze5RIeKZfEbhHoq4tByxC8FZbP7YpLUCrwzUEZokj6fMRSF7j A==; IronPort-SDR: 1tbiYMOurfpeET5mEoZCSfpiXk0OqTqQVF7KbMr00UDWBEZF++ez8RqpjriExSV67jjFt8R9FX x2Uhw2A75jtHqRY88zWv+SciNacgagr5qw3+UZ8ySFNtDhKSuQCze2mdRnNUN8nBT80/lQSJBR 2o29h8t+Gjy3oEHrS63D9D3MlzZb+Ys5PtDzjNGHJfnDgaTzMeGMg5vJuLq2UF7heQTTv6msAM GUJgaDVEAGOh54ih+ZiH5Ha5qZDogHZOW7tRCiFtUjgpzYvCU5kCXirnleKO2JzTo7gOhLzBVX zJU= X-IronPort-AV: E=Sophos;i="5.76,415,1592841600"; d="scan'208";a="147126032" Received: from h199-255-45-15.hgst.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 11 Sep 2020 20:33:55 +0800 IronPort-SDR: O5w++b4hvzqNBKiYxXyIhCBGxNQn90ZVcwPqDvt05NGHQll6/LIClPJDUMAimLZMpojzFiWKpq j9y77GJakOOQ== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Sep 2020 05:20:16 -0700 IronPort-SDR: W4MJlL8i3e/J2tW94uYpG3kq3bOualRAd9lM6Zy46iGu9aUYNmdxbmOeOEm59e8LityEfSgUtm R1dxR39DF5dw== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 11 Sep 2020 05:33:53 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Hannes Reinecke , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v7 33/39] btrfs: enable relocation in ZONED mode Date: Fri, 11 Sep 2020 21:32:53 +0900 Message-Id: <20200911123259.3782926-34-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20200911123259.3782926-1-naohiro.aota@wdc.com> References: <20200911123259.3782926-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org To serialize allocation and submit_bio, we introduced mutex around them. As a result, preallocation must be completely disabled to avoid a deadlock. Since current relocation process relies on preallocation to move file data extents, it must be handled in another way. In ZONED mode, we just truncate the inode to the size that we wanted to pre-allocate. Then, we flush dirty pages on the file before finishing relocation process. run_delalloc_zoned() will handle all the allocation and submit IOs to the underlying layers. Signed-off-by: Naohiro Aota --- fs/btrfs/relocation.c | 35 +++++++++++++++++++++++++++++++++-- 1 file changed, 33 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c index 4ba1ab9cc76d..5bd1f2e61062 100644 --- a/fs/btrfs/relocation.c +++ b/fs/btrfs/relocation.c @@ -2600,6 +2600,32 @@ static noinline_for_stack int prealloc_file_extent_cluster( if (ret) return ret; + /* + * In ZONED mode, we cannot preallocate the file region. Instead, we + * dirty and fiemap_write the region. + */ + + if (btrfs_fs_incompat(inode->root->fs_info, ZONED)) { + struct btrfs_root *root = inode->root; + struct btrfs_trans_handle *trans; + + end = cluster->end - offset + 1; + trans = btrfs_start_transaction(root, 1); + if (IS_ERR(trans)) + return PTR_ERR(trans); + + inode->vfs_inode.i_ctime = current_time(&inode->vfs_inode); + i_size_write(&inode->vfs_inode, end); + ret = btrfs_update_inode(trans, root, &inode->vfs_inode); + if (ret) { + btrfs_abort_transaction(trans, ret); + btrfs_end_transaction(trans); + return ret; + } + + return btrfs_end_transaction(trans); + } + inode_lock(&inode->vfs_inode); for (nr = 0; nr < cluster->nr; nr++) { start = cluster->boundary[nr] - offset; @@ -2796,6 +2822,8 @@ static int relocate_file_extent_cluster(struct inode *inode, } } WARN_ON(nr != cluster->nr); + if (btrfs_fs_incompat(fs_info, ZONED) && !ret) + ret = btrfs_wait_ordered_range(inode, 0, (u64)-1); out: kfree(ra); return ret; @@ -3431,8 +3459,12 @@ static int __insert_orphan_inode(struct btrfs_trans_handle *trans, struct btrfs_path *path; struct btrfs_inode_item *item; struct extent_buffer *leaf; + u64 flags = BTRFS_INODE_NOCOMPRESS | BTRFS_INODE_PREALLOC; int ret; + if (btrfs_fs_incompat(trans->fs_info, ZONED)) + flags &= ~BTRFS_INODE_PREALLOC; + path = btrfs_alloc_path(); if (!path) return -ENOMEM; @@ -3447,8 +3479,7 @@ static int __insert_orphan_inode(struct btrfs_trans_handle *trans, btrfs_set_inode_generation(leaf, item, 1); btrfs_set_inode_size(leaf, item, 0); btrfs_set_inode_mode(leaf, item, S_IFREG | 0600); - btrfs_set_inode_flags(leaf, item, BTRFS_INODE_NOCOMPRESS | - BTRFS_INODE_PREALLOC); + btrfs_set_inode_flags(leaf, item, flags); btrfs_mark_buffer_dirty(leaf); out: btrfs_free_path(path); From patchwork Fri Sep 11 12:32:54 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11771087 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 150B51580 for ; Fri, 11 Sep 2020 17:41:40 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id E1F7C221EB for ; Fri, 11 Sep 2020 17:41:39 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="RUEnQpSb" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1725955AbgIKRlj (ORCPT ); Fri, 11 Sep 2020 13:41:39 -0400 Received: from esa5.hgst.iphmx.com ([216.71.153.144]:38375 "EHLO esa5.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725984AbgIKMlY (ORCPT ); Fri, 11 Sep 2020 08:41:24 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1599828084; x=1631364084; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=EXkIwwEiUjE0+AnTZXZgPNx8/NGUQfPI924yvL9K3tk=; b=RUEnQpSb/xvLI6/R2kW37wyP3Y4kbr+MOhFa2kpjZMuyJKORQiqQHJrR AxTYERS8fhRFCHhTBuPsAJ4N7ktIZfpVHKNKBgKVfWCnslDdtjzcVbGsb lJmQj/1G5BUTKLWq5MBer0vzMspbngX5xLu8JrIeLE1I5BPs/8cWqXjPz ZYPAyRQbKoxc3BZ2d89ZreJJXMvWxXICePDIBSaB908Bz5K6FHwSHoNQc UPD1vX0qflHpGIWRwsSZroeIiLXdHaWrcfbz98rkKtG8rT3hs/HbAeyvV gZrIvHeVM+DApIC8ks1zn13h20EwJ/u5kL33ZNEw+ETEcqp4tEcS8aYnO Q==; IronPort-SDR: xLXIi6mgOX9nZc1D3gx92M8EjstuDrfBCq1lVofau4EX2tM53+fy1MsfpUa4DBZgW8xD044jPC cON2R81IMAy2DRhhAI3linxgwoF4OtOUmwgPij1JrQeMtDD5dn2Sebss3pcgwg2dFDjf6xOqY0 2yCjbwYaiksxR1Ia2YR0nIOkd2+qGaLeF1LkE5RRk3+be1zsETPDUmBL5FWYImCrsJMKsRO6QY VNyRi3Ujcv3f5UcN1+jZEgK8wGmdfop2e8yUQYHORTDLzb5wNwVZMtyMrbMiU/9i8Kth+LLXR8 Rww= X-IronPort-AV: E=Sophos;i="5.76,415,1592841600"; d="scan'208";a="147126035" Received: from h199-255-45-15.hgst.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 11 Sep 2020 20:33:57 +0800 IronPort-SDR: pjtTYOdLh2GjRU3eQBnMQhq1yE6KujIXOH/kVClKxsKAwFt26cpzsF2ypU/zr1qdVRUPAKdrhY duMqmexubktA== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Sep 2020 05:20:17 -0700 IronPort-SDR: HmOKWrF7RdN5Xjay6+X5SJ+nwxD1vN9Sx6/mlZvYgu/KTYqh/Li9gcTYRWSo78xwOiX1N2iDCK NcGEIj2h/w+g== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 11 Sep 2020 05:33:54 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Hannes Reinecke , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v7 34/39] btrfs: relocate block group to repair IO failure in ZONED Date: Fri, 11 Sep 2020 21:32:54 +0900 Message-Id: <20200911123259.3782926-35-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20200911123259.3782926-1-naohiro.aota@wdc.com> References: <20200911123259.3782926-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org When btrfs find a checksum error and if the file system has a mirror of the damaged data, btrfs read the correct data from the mirror and write the data to damaged blocks. This repairing, however, is against the sequential write required rule. We can consider three methods to repair an IO failure in ZONED mode: (1) Reset and rewrite the damaged zone (2) Allocate new device extent and replace the damaged device extent to the new extent (3) Relocate the corresponding block group Method (1) is most similar to a behavior done with regular devices. However, it also wipes non-damaged data in the same device extent, and so it unnecessary degrades non-damaged data. Method (2) is much like device replacing but done in the same device. It is safe because it keeps the device extent until the replacing finish. However, extending device replacing is non-trivial. It assumes "src_dev->physical == dst_dev->physical". Also, the extent mapping replacing function should be extended to support replacing device extent position in one device. Method (3) invokes relocation of the damaged block group, so it is straightforward to implement. It relocates all the mirrored device extents, so it is, potentially, a more costly operation than method (1) or (2). But it relocates only using extents which reduce the total IO size. Let's apply method (3) for now. In the future, we can extend device-replace and apply method (2). For protecting a block group gets relocated multiple time with multiple IO errors, this commit introduces "relocating_repair" bit to show it's now relocating to repair IO failures. Also it uses a new kthread "btrfs-relocating-repair", not to block IO path with relocating process. This commit also supports repairing in the scrub process. Reviewed-by: Josef Bacik Signed-off-by: Naohiro Aota --- fs/btrfs/block-group.h | 1 + fs/btrfs/extent_io.c | 3 ++ fs/btrfs/scrub.c | 3 ++ fs/btrfs/volumes.c | 71 ++++++++++++++++++++++++++++++++++++++++++ fs/btrfs/volumes.h | 1 + 5 files changed, 79 insertions(+) diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h index e91123495d68..50e5ddb0a19b 100644 --- a/fs/btrfs/block-group.h +++ b/fs/btrfs/block-group.h @@ -96,6 +96,7 @@ struct btrfs_block_group { unsigned int has_caching_ctl:1; unsigned int removed:1; unsigned int to_copy:1; + unsigned int relocating_repair:1; int disk_cache_state; diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index b660921af935..2fcb78147330 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2273,6 +2273,9 @@ int repair_io_failure(struct btrfs_fs_info *fs_info, u64 ino, u64 start, ASSERT(!(fs_info->sb->s_flags & SB_RDONLY)); BUG_ON(!mirror_num); + if (btrfs_fs_incompat(fs_info, ZONED)) + return btrfs_repair_one_zone(fs_info, logical); + bio = btrfs_io_bio_alloc(1); bio->bi_iter.bi_size = 0; map_length = length; diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index 2356e6d90690..3c59e551b894 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -861,6 +861,9 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check) have_csum = sblock_to_check->pagev[0]->have_csum; dev = sblock_to_check->pagev[0]->dev; + if (btrfs_fs_incompat(fs_info, ZONED) && !sctx->is_dev_replace) + return btrfs_repair_one_zone(fs_info, logical); + /* * We must use GFP_NOFS because the scrub task might be waiting for a * worker task executing this function and in turn a transaction commit diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 779ee0452c1b..9e82cf28662f 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -7961,3 +7961,74 @@ bool btrfs_pinned_by_swapfile(struct btrfs_fs_info *fs_info, void *ptr) spin_unlock(&fs_info->swapfile_pins_lock); return node != NULL; } + +static int relocating_repair_kthread(void *data) +{ + struct btrfs_block_group *cache = (struct btrfs_block_group *) data; + struct btrfs_fs_info *fs_info = cache->fs_info; + u64 target; + int ret = 0; + + target = cache->start; + btrfs_put_block_group(cache); + + if (test_and_set_bit(BTRFS_FS_EXCL_OP, &fs_info->flags)) { + btrfs_info(fs_info, + "skip relocating block group %llu to repair: EBUSY", + target); + return -EBUSY; + } + + mutex_lock(&fs_info->delete_unused_bgs_mutex); + + /* ensure Block Group still exists */ + cache = btrfs_lookup_block_group(fs_info, target); + if (!cache) + goto out; + + if (!cache->relocating_repair) + goto out; + + ret = btrfs_may_alloc_data_chunk(fs_info, target); + if (ret < 0) + goto out; + + btrfs_info(fs_info, "relocating block group %llu to repair IO failure", + target); + ret = btrfs_relocate_chunk(fs_info, target); + +out: + if (cache) + btrfs_put_block_group(cache); + mutex_unlock(&fs_info->delete_unused_bgs_mutex); + clear_bit(BTRFS_FS_EXCL_OP, &fs_info->flags); + + return ret; +} + +int btrfs_repair_one_zone(struct btrfs_fs_info *fs_info, u64 logical) +{ + struct btrfs_block_group *cache; + + /* do not attempt to repair in degraded state */ + if (btrfs_test_opt(fs_info, DEGRADED)) + return 0; + + cache = btrfs_lookup_block_group(fs_info, logical); + if (!cache) + return 0; + + spin_lock(&cache->lock); + if (cache->relocating_repair) { + spin_unlock(&cache->lock); + btrfs_put_block_group(cache); + return 0; + } + cache->relocating_repair = 1; + spin_unlock(&cache->lock); + + kthread_run(relocating_repair_kthread, cache, + "btrfs-relocating-repair"); + + return 0; +} diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index fc03b386bb8c..25814628e2d5 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -583,5 +583,6 @@ bool btrfs_check_rw_degradable(struct btrfs_fs_info *fs_info, int btrfs_bg_type_to_factor(u64 flags); const char *btrfs_bg_type_to_raid_name(u64 flags); int btrfs_verify_dev_extents(struct btrfs_fs_info *fs_info); +int btrfs_repair_one_zone(struct btrfs_fs_info *fs_info, u64 logical); #endif From patchwork Fri Sep 11 12:32:55 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11771075 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 879BB1580 for ; Fri, 11 Sep 2020 17:41:27 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 6AD5622204 for ; Fri, 11 Sep 2020 17:41:27 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="DNAllh9j" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726365AbgIKRl0 (ORCPT ); Fri, 11 Sep 2020 13:41:26 -0400 Received: from esa5.hgst.iphmx.com ([216.71.153.144]:38370 "EHLO esa5.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725987AbgIKMmE (ORCPT ); Fri, 11 Sep 2020 08:42:04 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1599828124; x=1631364124; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=Moym42/sUeqiVApISDpT6H37I/hkr5wjw/ADzqGR7uQ=; b=DNAllh9jSRNnHf3T6GuifP7PsRZ64TvZuNO4hqk4D/He8LVnUJe6cCmZ rSWHvn5BBPQACrMRP72FP7Byu76ztJY9BntY6VLn5FJx+TGLxy1+ejWL9 +gEzGHCGKCWlwc8LWOvzmo04q2mcvi0pSfIo5cTFaIFOuKy48RJrPWnfd AFBayOiHHbvj08hFYSQ6qp3/ccLEH66SoKuBjvuU/JOyPvkpXYIogMDMW AqlIg3KYzryq5i9lH4/3zDobCyVIrpn7AQlt+K6mcttGbb20epSdYcicd kmuD6nbOMw6W4DPOsxh9xbYRHL6MS4pPayL29mQNj5nfvGqgXFj7CKJG+ w==; IronPort-SDR: zYGF/TMBO7OMP4j4pP+9Nab0lEHQ44CCebfqhFcRcx9bT1HORRUb2xdxoRRz/2mg6bC414qfWF 1r3XOeXko9BHV9aFGyVijHd+n7D0GPOPv+CN9NxK6dkqCVimPWZast00UAm86v5q7Hm2hRPpCn AQE1wMHzZVn+mDLxQv2LvATh7KjXPxYROMBsFXxLBO2FjyEYXOHOraKOnWc879NECrE4FR9juU kk/D+W7soKxx9AzG2TOdvp8OB4FUMFESi0NCO799QbjxF/P4X8DA/xOjt9+mT8P7+9RPIhLvHP FAU= X-IronPort-AV: E=Sophos;i="5.76,415,1592841600"; d="scan'208";a="147126041" Received: from h199-255-45-15.hgst.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 11 Sep 2020 20:33:58 +0800 IronPort-SDR: 1BwuoHAY8kf5hlnE/TA+2i0T/1vBAaRdyBxefTNO+XMQwBZDYcwuziuExJmX7j+jxNIs1hgdAB bCfvgQYckypQ== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Sep 2020 05:20:19 -0700 IronPort-SDR: ziaXOghq8oNifNeicDeJgkdxD6F7fRmbViSNaZfoksLjbW9wblK4q5GxqdyuRedKWinNdX+dU1 btPR+Ju6T0AA== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 11 Sep 2020 05:33:56 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Hannes Reinecke , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v7 35/39] btrfs: split alloc_log_tree() Date: Fri, 11 Sep 2020 21:32:55 +0900 Message-Id: <20200911123259.3782926-36-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20200911123259.3782926-1-naohiro.aota@wdc.com> References: <20200911123259.3782926-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org This is a preparation for the next patch. This commit split alloc_log_tree() to allocating tree structure part (remains in alloc_log_tree()) and allocating tree node part (moved in btrfs_alloc_log_tree_node()). The latter part is also exported to be used in the next patch. Signed-off-by: Naohiro Aota --- fs/btrfs/disk-io.c | 27 ++++++++++++++++++++++++--- fs/btrfs/disk-io.h | 2 ++ 2 files changed, 26 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index cd768030b7bb..4d1851e72031 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1269,7 +1269,6 @@ static struct btrfs_root *alloc_log_tree(struct btrfs_trans_handle *trans, struct btrfs_fs_info *fs_info) { struct btrfs_root *root; - struct extent_buffer *leaf; root = btrfs_alloc_root(fs_info, BTRFS_TREE_LOG_OBJECTID, GFP_NOFS); if (!root) @@ -1279,6 +1278,14 @@ static struct btrfs_root *alloc_log_tree(struct btrfs_trans_handle *trans, root->root_key.type = BTRFS_ROOT_ITEM_KEY; root->root_key.offset = BTRFS_TREE_LOG_OBJECTID; + return root; +} + +int btrfs_alloc_log_tree_node(struct btrfs_trans_handle *trans, + struct btrfs_root *root) +{ + struct extent_buffer *leaf; + /* * DON'T set SHAREABLE bit for log trees. * @@ -1293,24 +1300,31 @@ static struct btrfs_root *alloc_log_tree(struct btrfs_trans_handle *trans, NULL, 0, 0, 0); if (IS_ERR(leaf)) { btrfs_put_root(root); - return ERR_CAST(leaf); + return PTR_ERR(leaf); } root->node = leaf; btrfs_mark_buffer_dirty(root->node); btrfs_tree_unlock(root->node); - return root; + + return 0; } int btrfs_init_log_root_tree(struct btrfs_trans_handle *trans, struct btrfs_fs_info *fs_info) { struct btrfs_root *log_root; + int ret; log_root = alloc_log_tree(trans, fs_info); if (IS_ERR(log_root)) return PTR_ERR(log_root); + ret = btrfs_alloc_log_tree_node(trans, log_root); + if (ret) { + kfree(log_root); + return ret; + } WARN_ON(fs_info->log_root_tree); fs_info->log_root_tree = log_root; return 0; @@ -1322,11 +1336,18 @@ int btrfs_add_log_tree(struct btrfs_trans_handle *trans, struct btrfs_fs_info *fs_info = root->fs_info; struct btrfs_root *log_root; struct btrfs_inode_item *inode_item; + int ret; log_root = alloc_log_tree(trans, fs_info); if (IS_ERR(log_root)) return PTR_ERR(log_root); + ret = btrfs_alloc_log_tree_node(trans, log_root); + if (ret) { + kfree(log_root); + return ret; + } + log_root->last_trans = trans->transid; log_root->root_key.offset = root->root_key.objectid; diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h index 00dc39d47ed3..85c7d4de765e 100644 --- a/fs/btrfs/disk-io.h +++ b/fs/btrfs/disk-io.h @@ -111,6 +111,8 @@ blk_status_t btrfs_wq_submit_bio(struct btrfs_fs_info *fs_info, struct bio *bio, extent_submit_bio_start_t *submit_bio_start); blk_status_t btrfs_submit_bio_done(void *private_data, struct bio *bio, int mirror_num); +int btrfs_alloc_log_tree_node(struct btrfs_trans_handle *trans, + struct btrfs_root *root); int btrfs_init_log_root_tree(struct btrfs_trans_handle *trans, struct btrfs_fs_info *fs_info); int btrfs_add_log_tree(struct btrfs_trans_handle *trans, From patchwork Fri Sep 11 12:32:56 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11771091 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 809F91580 for ; Fri, 11 Sep 2020 17:41:42 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 5E154221EF for ; Fri, 11 Sep 2020 17:41:42 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="qKxWpvZE" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726051AbgIKRli (ORCPT ); Fri, 11 Sep 2020 13:41:38 -0400 Received: from esa5.hgst.iphmx.com ([216.71.153.144]:38451 "EHLO esa5.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725986AbgIKMl5 (ORCPT ); Fri, 11 Sep 2020 08:41:57 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1599828118; x=1631364118; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=DdJ2WoO+IKj9YOAKkTk0KAiDzgVmtP5MBJprXieter4=; b=qKxWpvZErV1uG55tRdZm+8O1JeiLuW81Q8+v5dTCv48mjeYxY0NDlCHi XNrFz025K5pu8c7yQftCUij/c0nTcCwdCGCdQb9eG+CY20gzWonCO7T7J a9h8o+P5wvJFPLBZkyhs/XdpBkPGmAzjLzPk2Lqo8IxX24IVbNJiTgwyW wf8sRNR8Tz8e49SXZcgrYg4eh/MgwQSUEEPIWSP8hrEZoBfw787PMqwS0 rKbcAFltzdL6M5a0lIEl5kwfyJcmLDYgMn056Mi4qOaEa97kLZGbVp+SU QfEBjVJeBoMY3wfxBV1t5p8WEtFtFj38hqjymA0kXB/B+74dKPilh9qx/ Q==; IronPort-SDR: 1wY9GhyNQsDOQ22geyO6ea0R9GhHEWsUKsM/ggpMxR0qTmcxuiMCZlN8OWKkvC23b6r2c5cFnd e9U3pchlR4nvHcM5Z4wSAIq3nodh0CT1ubLOOfS4Rpi1sAU9Np0Ktu22FBcsPs844bsb9x/y8f Fm17E6zdb7LUc0IqRQISt2BDbPyT7z9/w+dncsShB1hh1XW9H0OZeXi6C8VH+9pxtfSCo3e4zR inzK+boWqf2bc7uPfDjjr7t0hsW+AUU/KUzGstOmzMx0RQ53gpnzeY4yUQAt/VFQj7XdcZ0uZ5 Zpo= X-IronPort-AV: E=Sophos;i="5.76,415,1592841600"; d="scan'208";a="147126043" Received: from h199-255-45-15.hgst.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 11 Sep 2020 20:33:59 +0800 IronPort-SDR: OAoPuWySs4gAVqvodxI7K9MfLZjlfNloh3Xm1RVDLni5mpxIQupowRSx96Lciu5+ZwUC9kmlYQ ZvaObtULyfZg== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Sep 2020 05:20:20 -0700 IronPort-SDR: 7QrXEQkDh7S4BcldhZ3S5TeDZFY74EQhk24DBwxf7PzGFlEM3drpd/5P7NrWcXpgVZvbGoYVRR rHNYrV6popjQ== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 11 Sep 2020 05:33:57 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Hannes Reinecke , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v7 36/39] btrfs: extend zoned allocator to use dedicated tree-log block group Date: Fri, 11 Sep 2020 21:32:56 +0900 Message-Id: <20200911123259.3782926-37-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20200911123259.3782926-1-naohiro.aota@wdc.com> References: <20200911123259.3782926-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org This is the 1/3 patch to enable tree log on ZONED mode. The tree-log feature does not work on ZONED mode as is. Blocks for a tree-log tree are allocated mixed with other metadata blocks, and btrfs writes and syncs the tree-log blocks to devices at the time of fsync(), which is different timing from a global transaction commit. As a result, both writing tree-log blocks and writing other metadata blocks become non-sequential writes that ZONED mode must avoid. We can introduce a dedicated block group for tree-log blocks so that tree-log blocks and other metadata blocks can be separated write streams. As a result, each write stream can now be written to devices separately. "fs_info->treelog_bg" tracks the dedicated block group and btrfs assign "treelog_bg" on-demand on tree-log block allocation time. This commit extends the zoned block allocator to use the block group. Signed-off-by: Naohiro Aota --- fs/btrfs/block-group.c | 7 +++++ fs/btrfs/ctree.h | 2 ++ fs/btrfs/extent-tree.c | 68 +++++++++++++++++++++++++++++++++++++----- 3 files changed, 70 insertions(+), 7 deletions(-) diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c index be5394c8ec3a..d30eba3c484a 100644 --- a/fs/btrfs/block-group.c +++ b/fs/btrfs/block-group.c @@ -939,6 +939,13 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans, btrfs_return_cluster_to_free_space(block_group, cluster); spin_unlock(&cluster->refill_lock); + if (btrfs_fs_incompat(fs_info, ZONED)) { + spin_lock(&fs_info->treelog_bg_lock); + if (fs_info->treelog_bg == block_group->start) + fs_info->treelog_bg = 0; + spin_unlock(&fs_info->treelog_bg_lock); + } + path = btrfs_alloc_path(); if (!path) { ret = -ENOMEM; diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index e08fe341cd81..6e05eb180a77 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -942,6 +942,8 @@ struct btrfs_fs_info { int send_in_progress; struct mutex zoned_meta_io_lock; + spinlock_t treelog_bg_lock; + u64 treelog_bg; #ifdef CONFIG_BTRFS_FS_REF_VERIFY spinlock_t ref_verify_lock; diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 79ac8fcc5c35..9e576977f416 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -3506,6 +3506,9 @@ struct find_free_extent_ctl { /* Allocation policy */ enum btrfs_extent_allocation_policy policy; + + /* Allocation is called for tree-log */ + bool for_treelog; }; @@ -3706,23 +3709,54 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group, struct find_free_extent_ctl *ffe_ctl, struct btrfs_block_group **bg_ret) { + struct btrfs_fs_info *fs_info = block_group->fs_info; struct btrfs_space_info *space_info = block_group->space_info; struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl; u64 start = block_group->start; u64 num_bytes = ffe_ctl->num_bytes; u64 avail; + u64 bytenr = block_group->start; + u64 log_bytenr; int ret = 0; + bool skip; ASSERT(btrfs_fs_incompat(block_group->fs_info, ZONED)); + /* + * Do not allow non-tree-log blocks in the dedicated tree-log block + * group, and vice versa. + */ + spin_lock(&fs_info->treelog_bg_lock); + log_bytenr = fs_info->treelog_bg; + skip = log_bytenr && ((ffe_ctl->for_treelog && bytenr != log_bytenr) || + (!ffe_ctl->for_treelog && bytenr == log_bytenr)); + spin_unlock(&fs_info->treelog_bg_lock); + if (skip) + return 1; + spin_lock(&space_info->lock); spin_lock(&block_group->lock); + spin_lock(&fs_info->treelog_bg_lock); + + ASSERT(!ffe_ctl->for_treelog || + block_group->start == fs_info->treelog_bg || + fs_info->treelog_bg == 0); if (block_group->ro) { ret = 1; goto out; } + /* + * Do not allow currently using block group to be tree-log dedicated + * block group. + */ + if (ffe_ctl->for_treelog && !fs_info->treelog_bg && + (block_group->used || block_group->reserved)) { + ret = 1; + goto out; + } + avail = block_group->length - block_group->alloc_offset; if (avail < num_bytes) { ffe_ctl->max_extent_size = avail; @@ -3730,6 +3764,9 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group, goto out; } + if (ffe_ctl->for_treelog && !fs_info->treelog_bg) + fs_info->treelog_bg = block_group->start; + ffe_ctl->found_offset = start + block_group->alloc_offset; block_group->alloc_offset += num_bytes; spin_lock(&ctl->tree_lock); @@ -3737,10 +3774,13 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group, spin_unlock(&ctl->tree_lock); ASSERT(IS_ALIGNED(ffe_ctl->found_offset, - block_group->fs_info->stripesize)); + fs_info->stripesize)); ffe_ctl->search_start = ffe_ctl->found_offset; out: + if (ret && ffe_ctl->for_treelog) + fs_info->treelog_bg = 0; + spin_unlock(&fs_info->treelog_bg_lock); spin_unlock(&block_group->lock); spin_unlock(&space_info->lock); return ret; @@ -3990,7 +4030,12 @@ static int prepare_allocation(struct btrfs_fs_info *fs_info, return prepare_allocation_clustered(fs_info, ffe_ctl, space_info, ins); case BTRFS_EXTENT_ALLOC_ZONED: - /* nothing to do */ + if (ffe_ctl->for_treelog) { + spin_lock(&fs_info->treelog_bg_lock); + if (fs_info->treelog_bg) + ffe_ctl->hint_byte = fs_info->treelog_bg; + spin_unlock(&fs_info->treelog_bg_lock); + } return 0; default: BUG(); @@ -4025,7 +4070,7 @@ static int prepare_allocation(struct btrfs_fs_info *fs_info, static noinline int find_free_extent(struct btrfs_fs_info *fs_info, u64 ram_bytes, u64 num_bytes, u64 empty_size, u64 hint_byte_orig, struct btrfs_key *ins, - u64 flags, int delalloc) + u64 flags, int delalloc, bool for_treelog) { int ret = 0; int cache_block_group_error = 0; @@ -4046,6 +4091,7 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info, ffe_ctl.orig_have_caching_bg = false; ffe_ctl.found_offset = 0; ffe_ctl.hint_byte = hint_byte_orig; + ffe_ctl.for_treelog = for_treelog; ffe_ctl.policy = BTRFS_EXTENT_ALLOC_CLUSTERED; /* For clustered allocation */ @@ -4120,8 +4166,15 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info, struct btrfs_block_group *bg_ret; /* If the block group is read-only, we can skip it entirely. */ - if (unlikely(block_group->ro)) + if (unlikely(block_group->ro)) { + if (btrfs_fs_incompat(fs_info, ZONED) && for_treelog) { + spin_lock(&fs_info->treelog_bg_lock); + if (block_group->start == fs_info->treelog_bg) + fs_info->treelog_bg = 0; + spin_unlock(&fs_info->treelog_bg_lock); + } continue; + } btrfs_grab_block_group(block_group, delalloc); ffe_ctl.search_start = block_group->start; @@ -4309,12 +4362,13 @@ int btrfs_reserve_extent(struct btrfs_root *root, u64 ram_bytes, bool final_tried = num_bytes == min_alloc_size; u64 flags; int ret; + bool for_treelog = root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID; flags = get_alloc_profile_by_root(root, is_data); again: WARN_ON(num_bytes < fs_info->sectorsize); ret = find_free_extent(fs_info, ram_bytes, num_bytes, empty_size, - hint_byte, ins, flags, delalloc); + hint_byte, ins, flags, delalloc, for_treelog); if (!ret && !is_data) { btrfs_dec_block_group_reservations(fs_info, ins->objectid); } else if (ret == -ENOSPC) { @@ -4332,8 +4386,8 @@ int btrfs_reserve_extent(struct btrfs_root *root, u64 ram_bytes, sinfo = btrfs_find_space_info(fs_info, flags); btrfs_err(fs_info, - "allocation failed flags %llu, wanted %llu", - flags, num_bytes); + "allocation failed flags %llu, wanted %llu treelog %d", + flags, num_bytes, for_treelog); if (sinfo) btrfs_dump_space_info(fs_info, sinfo, num_bytes, 1); From patchwork Fri Sep 11 12:32:57 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11771083 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id BFA221580 for ; Fri, 11 Sep 2020 17:41:34 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 9D2A722204 for ; Fri, 11 Sep 2020 17:41:34 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="kXpuvYWQ" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1725780AbgIKRlZ (ORCPT ); Fri, 11 Sep 2020 13:41:25 -0400 Received: from esa5.hgst.iphmx.com ([216.71.153.144]:38415 "EHLO esa5.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725989AbgIKMmE (ORCPT ); Fri, 11 Sep 2020 08:42:04 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1599828124; x=1631364124; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=jDNOKaXClWIV7lgnR8edIQwjLBXqUIKTiHrpP/FSDC0=; b=kXpuvYWQx0g3lTisTzw7uPKq2fb5zq9lgxDh25vvr5LYw2enTudEOSWK B2RJ1nqFBApyn5v5dq2VPzP+TtfIUV+TfHMt2RfXoehsiTRTQXOmZ17fl WYwS2HCAqW4Vf9S4hmy4Z9/uDP+3UJlpv4EtwpYRz7vQkqd9SC+P25tRa c7kCQEbk3R1ifrJoQZ63HQd50kXk4b+ZyE2dgfA5eYz9VUqunGCANXUm8 0z8mvTVxJ3LGpiOaNK0n5S3v2DaXnSCKvNko15aMHf6Mbzl+DQIngS6tp ozyS6cxr5us8PG0NWo7RVuay1jgLJuz4Ho/xCV9btkJBhWa6zbSxXxG83 A==; IronPort-SDR: d3BI6iNpKkxxU4u0RoI1Vww2Jj6EdOVX42wOvHizGA68Im45kMgn/WfnX4IM0ZPry/WmCF3vhP x+a8OLX8brDL4eqmWab6UrMEVv1FfUjJWTnX9VDpYO+wZqxqz/vyEZ+UqJeQlU2xNLw6eiX0Mr j1DPiT09MATze4GrqJ3DXvfvWsL9oG1WC5nRIEa1eQS4cxU37rW5ELw+ALzwM3o8bxVrjLdLZQ w4isSEZ3uM7n6AYYSURXs8ff8Mapyr+9LBbeDWiBoHfWuSCLTCal3dJPXVnfK+DQy7HPDf2HjX NoI= X-IronPort-AV: E=Sophos;i="5.76,415,1592841600"; d="scan'208";a="147126049" Received: from h199-255-45-15.hgst.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 11 Sep 2020 20:34:01 +0800 IronPort-SDR: cAXb7ZQEiPyPbF0f1lX4WdWiBl8anmSoaVBrAUG0ShjgGqJgHqNl5qLt69BbXPP6EPchp5zoHI nhILytPWDxMA== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Sep 2020 05:20:22 -0700 IronPort-SDR: expOSwhFUzxYPUwkeBn6RkqSoUu1jLz1+B3KU8WsN90H20hVOBw/M44NX7YgYh65Ni6FznUgZg yjHV/uCoHzdw== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 11 Sep 2020 05:33:59 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Hannes Reinecke , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v7 37/39] btrfs: serialize log transaction on ZONED mode Date: Fri, 11 Sep 2020 21:32:57 +0900 Message-Id: <20200911123259.3782926-38-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20200911123259.3782926-1-naohiro.aota@wdc.com> References: <20200911123259.3782926-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org This is the 2/3 patch to enable tree-log on ZONED mode. Since we can start more than one log transactions per subvolume simultaneously, nodes from multiple transactions can be allocated interleaved. Such mixed allocation results in non-sequential writes at the time of log transaction commit. The nodes of the global log root tree (fs_info->log_root_tree), also have the same mixed allocation problem. This patch serializes log transactions by waiting for a committing transaction when someone tries to start a new transaction, to avoid the mixed allocation problem. We must also wait for running log transactions from another subvolume, but there is no easy way to detect which subvolume root is running a log transaction. So, this patch forbids starting a new log transaction when other subvolumes already allocated the global log root tree. Signed-off-by: Naohiro Aota --- fs/btrfs/tree-log.c | 25 ++++++++++++++++++++++++- 1 file changed, 24 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 4b6a68a81eac..1ffb9a0341e2 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -108,6 +108,7 @@ static noinline int replay_dir_deletes(struct btrfs_trans_handle *trans, struct btrfs_root *log, struct btrfs_path *path, u64 dirid, int del_all); +static void wait_log_commit(struct btrfs_root *root, int transid); /* * tree logging is a special write ahead log used to make sure that @@ -142,16 +143,25 @@ static int start_log_trans(struct btrfs_trans_handle *trans, struct btrfs_log_ctx *ctx) { struct btrfs_fs_info *fs_info = root->fs_info; + bool zoned = btrfs_fs_incompat(fs_info, ZONED); int ret = 0; mutex_lock(&root->log_mutex); +again: if (root->log_root) { + int index = (root->log_transid + 1) % 2; + if (btrfs_need_log_full_commit(trans)) { ret = -EAGAIN; goto out; } + if (zoned && atomic_read(&root->log_commit[index])) { + wait_log_commit(root, root->log_transid - 1); + goto again; + } + if (!root->log_start_pid) { clear_bit(BTRFS_ROOT_MULTI_LOG_TASKS, &root->state); root->log_start_pid = current->pid; @@ -160,8 +170,13 @@ static int start_log_trans(struct btrfs_trans_handle *trans, } } else { mutex_lock(&fs_info->tree_log_mutex); - if (!fs_info->log_root_tree) + if (zoned && fs_info->log_root_tree) { + ret = -EAGAIN; + mutex_unlock(&fs_info->tree_log_mutex); + goto out; + } else if (!fs_info->log_root_tree) { ret = btrfs_init_log_root_tree(trans, fs_info); + } mutex_unlock(&fs_info->tree_log_mutex); if (ret) goto out; @@ -195,14 +210,22 @@ static int start_log_trans(struct btrfs_trans_handle *trans, */ static int join_running_log_trans(struct btrfs_root *root) { + bool zoned = btrfs_fs_incompat(root->fs_info, ZONED); int ret = -ENOENT; if (!test_bit(BTRFS_ROOT_HAS_LOG_TREE, &root->state)) return ret; mutex_lock(&root->log_mutex); +again: if (root->log_root) { + int index = (root->log_transid + 1) % 2; + ret = 0; + if (zoned && atomic_read(&root->log_commit[index])) { + wait_log_commit(root, root->log_transid - 1); + goto again; + } atomic_inc(&root->log_writers); } mutex_unlock(&root->log_mutex); From patchwork Fri Sep 11 12:32:58 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11771079 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id E5CA859D for ; Fri, 11 Sep 2020 17:41:28 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id C7D5A221EF for ; Fri, 11 Sep 2020 17:41:28 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="TaNh3v6Z" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726375AbgIKRl0 (ORCPT ); Fri, 11 Sep 2020 13:41:26 -0400 Received: from esa5.hgst.iphmx.com ([216.71.153.144]:38372 "EHLO esa5.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725990AbgIKMmE (ORCPT ); Fri, 11 Sep 2020 08:42:04 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1599828124; x=1631364124; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=EGkEEOwBzpfHNcgr098kjLm/afTRruKnzHCAAKn/Kjc=; b=TaNh3v6Z1YZjy+BJeMPVcUlSz7QDKebs72PBqYzK69esGC9oRbzsSxys HlgnLXlxTATrudalh2rTE++6sk58xwCR/GmJ8CyIdjd//WqsXHb6lovcB kgxnYlCTIdZBDQZcELeiGS8MQ/eDzvSI8ovNkkOj70tss7sAxIDTO7yMt kKJmlUD2zjVB16UVZwGx27BDHomYGnP/Kf7fhdnC2R9r/0sCLsMBqqIyX +SRFVo9ASAUFdR5otU9viDexi+nYzVEL/sg6u+3nwNfYXwJaCugE/0Tzl AMGc/r/9/uuNFvoGEi0j8kWvFRlotaHaxRxyszy+ySioKaBIra+id2BEX A==; IronPort-SDR: G3BrYcoPNJmXT3t5u3ZxqJykRqivJxJPh4uB+0e4vVK6de4J/wNUc0re21qVclgoPKe0d3DpKT DDfgZGyJJFKd+5g3ZVnwktTA+LXd+9CPDh6RQHf6aTev+0mjoJGEZUDpySdXfe7iEE0nAnZt+9 KLCDwvo24Ar6ROW0jwonHoJYIor2F9A6rcn09wMwqElLU9Mls++MjG8xFO+ZuI6xtzg9bPBoCI Xz9hNcOq/6fRE/iciSBdpE3mVymhfC9RJICLtP1ifW4l6VrzIe+PvNn4gY6Rj0ZI89Xpr1nmr8 b4M= X-IronPort-AV: E=Sophos;i="5.76,415,1592841600"; d="scan'208";a="147126052" Received: from h199-255-45-15.hgst.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 11 Sep 2020 20:34:02 +0800 IronPort-SDR: Oir1RkoCVpzvzhvVMw0GdS73uzQUi8vJpC8AQENdgrr3cyIhmh2MR9iyq0zIzf0wSxOl+u7zEJ qk2q9EsV8YGQ== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Sep 2020 05:20:23 -0700 IronPort-SDR: 6pZrGdnHobUZD0ickYutQmhnFYw5uEk7PU1cQDdfd17Ups4Q+HR7LYg+ymtvBJBkKohhTJZH3C dJ145VECIeFA== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 11 Sep 2020 05:34:00 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Hannes Reinecke , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v7 38/39] btrfs: reorder log node allocation Date: Fri, 11 Sep 2020 21:32:58 +0900 Message-Id: <20200911123259.3782926-39-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20200911123259.3782926-1-naohiro.aota@wdc.com> References: <20200911123259.3782926-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org This is the 3/3 patch to enable tree-log on ZONED mode. The allocation order of nodes of "fs_info->log_root_tree" and nodes of "root->log_root" is not the same as the writing order of them. So, the writing causes unaligned write errors. This patch reorders the allocation of them by delaying allocation of the root node of "fs_info->log_root_tree," so that the node buffers can go out sequentially to devices. Signed-off-by: Naohiro Aota --- fs/btrfs/disk-io.c | 6 ------ fs/btrfs/tree-log.c | 19 +++++++++++++------ 2 files changed, 13 insertions(+), 12 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 4d1851e72031..0884412977a0 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1315,16 +1315,10 @@ int btrfs_init_log_root_tree(struct btrfs_trans_handle *trans, struct btrfs_fs_info *fs_info) { struct btrfs_root *log_root; - int ret; log_root = alloc_log_tree(trans, fs_info); if (IS_ERR(log_root)) return PTR_ERR(log_root); - ret = btrfs_alloc_log_tree_node(trans, log_root); - if (ret) { - kfree(log_root); - return ret; - } WARN_ON(fs_info->log_root_tree); fs_info->log_root_tree = log_root; return 0; diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 1ffb9a0341e2..087c1d0c7307 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -3147,6 +3147,11 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans, list_add_tail(&root_log_ctx.list, &log_root_tree->log_ctxs[index2]); root_log_ctx.log_transid = log_root_tree->log_transid; + mutex_lock(&fs_info->tree_log_mutex); + if (!log_root_tree->node) + btrfs_alloc_log_tree_node(trans, log_root_tree); + mutex_unlock(&fs_info->tree_log_mutex); + /* * Now we are safe to update the log_root_tree because we're under the * log_mutex, and we're a current writer so we're holding the commit @@ -3296,12 +3301,14 @@ static void free_log_tree(struct btrfs_trans_handle *trans, .process_func = process_one_buffer }; - ret = walk_log_tree(trans, log, &wc); - if (ret) { - if (trans) - btrfs_abort_transaction(trans, ret); - else - btrfs_handle_fs_error(log->fs_info, ret, NULL); + if (log->node) { + ret = walk_log_tree(trans, log, &wc); + if (ret) { + if (trans) + btrfs_abort_transaction(trans, ret); + else + btrfs_handle_fs_error(log->fs_info, ret, NULL); + } } clear_extent_bits(&log->dirty_log_pages, 0, (u64)-1, From patchwork Fri Sep 11 12:32:59 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11771071 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 7F76E59D for ; Fri, 11 Sep 2020 17:41:19 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 4CDEA22206 for ; Fri, 11 Sep 2020 17:41:19 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="MadKBhmI" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726343AbgIKRlJ (ORCPT ); Fri, 11 Sep 2020 13:41:09 -0400 Received: from esa5.hgst.iphmx.com ([216.71.153.144]:38428 "EHLO esa5.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725991AbgIKMmE (ORCPT ); Fri, 11 Sep 2020 08:42:04 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1599828124; x=1631364124; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=4XBBV9Fp6sjoRwi/igxARQSGbC6vCYxHPBzketHaDMg=; b=MadKBhmI5V7x7zSTg51w56ZD2ENHDWg7G5d/SvNU+ed/RDRKvMkJVhRd KHHXC+lYBfl6ypWv11RaNr3hDNaqSJQOoGO4hfMd8vGJB7A4AZcU03ZzW 0GodF5ZMsDG5qKB40VLubC7XkGRQ6I/0c7sLP7uBDQoi0CvXobp+jZNL5 8FaL2DjSVNBob9CebRRWBvS3edC+aC/bNEl7tn9pTPjaZN13MdWdu22y3 gT1egDCfwnojJcqusyLcF7xZAH2VTUjXITHH0M3V9QETcvdF/toqk2Xha wQmrTt4qXTwJ8LS7qUWBvthkdrloqTP33HGWYKAgk56+fupRsU20t+SBq g==; IronPort-SDR: 9YLGQ8SftZ3l0mSPqWZsRHT41JnGu0ER83yjDDJuTmlQ8d0aO0xWbQ8amPCT4ZuP0VRykdmFQT RXKI98zISp6XfNZ9gOn2LERxHT81EDG+rEPplz1bPhkUWpO32bUBBfktDs/D6GqjVZQ4opqyJe aEI+UOj+QpVm7pGEfJSUSNZs7T5y+oLca98MGkvd2nLjRpW8Oeo17/AWqqIIfen7DIpQb09cC6 y/NFOcB5E4l+n0Zgnhrb/noOc4+T0KwTlcoq4kNLwq0sL31CjadSMEeX0WXf5w8asPmrIlbO61 rAU= X-IronPort-AV: E=Sophos;i="5.76,415,1592841600"; d="scan'208";a="147126054" Received: from h199-255-45-15.hgst.com (HELO uls-op-cesaep02.wdc.com) ([199.255.45.15]) by ob1.hgst.iphmx.com with ESMTP; 11 Sep 2020 20:34:04 +0800 IronPort-SDR: TdKD4AgQtykkG9gvQrxgnwoHqbIkpl65PCvPHxwfdVIoT5asf1Ymeno2X40lRyB3IRfUjXYHIk /nZYp7Qe/plA== Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep02.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Sep 2020 05:20:24 -0700 IronPort-SDR: N0//q2KoYG7BBS+QU9ezO0q2sv2T3tkxv2q+/TzSU5jevXUpPY0eeZDRbgFz1bWUwSlUGjKD88 0z+xrTiCLzOA== WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155]) by uls-op-cesaip02.wdc.com with ESMTP; 11 Sep 2020 05:34:01 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Hannes Reinecke , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v7 39/39] btrfs: enable to mount ZONED incompat flag Date: Fri, 11 Sep 2020 21:32:59 +0900 Message-Id: <20200911123259.3782926-40-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20200911123259.3782926-1-naohiro.aota@wdc.com> References: <20200911123259.3782926-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org This final patch adds the ZONED incompat flag to BTRFS_FEATURE_INCOMPAT_SUPP and enables btrfs to mount ZONED flagged file system. Reviewed-by: Josef Bacik Signed-off-by: Naohiro Aota --- fs/btrfs/ctree.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 6e05eb180a77..e8639f6f7dec 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -303,7 +303,8 @@ struct btrfs_super_block { BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA | \ BTRFS_FEATURE_INCOMPAT_NO_HOLES | \ BTRFS_FEATURE_INCOMPAT_METADATA_UUID | \ - BTRFS_FEATURE_INCOMPAT_RAID1C34) + BTRFS_FEATURE_INCOMPAT_RAID1C34 | \ + BTRFS_FEATURE_INCOMPAT_ZONED) #define BTRFS_FEATURE_INCOMPAT_SAFE_SET \ (BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF)