From patchwork Fri Dec 13 04:08:48 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11289795 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 90FAE139A for ; Fri, 13 Dec 2019 04:10:41 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 6EDC3227BF for ; Fri, 13 Dec 2019 04:10:41 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="IYEn/y2C" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731476AbfLMEKj (ORCPT ); Thu, 12 Dec 2019 23:10:39 -0500 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:11856 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727299AbfLMEKj (ORCPT ); Thu, 12 Dec 2019 23:10:39 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1576210239; x=1607746239; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=dI1kCb/VRDJYiFLG8VGEf67pizlAWg6fFeBbRKlx4yk=; b=IYEn/y2CkkUIgkhYAbDhW52vm3leDg+2qpb9ZH6DCE8l45az6Ai8kTIY 2IIj/lmjdE0r1YbC2z2wZPLfiQJDjPDwaRvkGe6b7sw74V387FBWOL/nt g4pzTKiMHzCO2HBGWkw2IsziI4HJiv4QExtueIjQ5xLsi5fwekT8KIi6P xd1p81MW4EkIOH6luB3OeRtZLrjUrAe3aXyi350CQ2bcUm4Gq67JT7srC IV8x0bJ3ap9RxBXP1MTxw9tKQEVT+jwLhOODQtIcfx2sPogL08jMnlOXW mj/++zea3LLJPDENy0cdBW8ISL2huVC5SBKB+22JAjwG4iGu2RJJThLom w==; IronPort-SDR: pS8mMw0u2YkJTty1wHZxSzbedYIXMwF1A/She3YnXnOJ0Hc2YEt2A7zP51TVyQ/RZC+l6cvp2R xnZl2sWLkCbEP8X4zHXcUz36rz0aXldmpPfbG6SwTmHVLXJPqYOCNcSqx5v+rk/KUTkA3UTQEW RD0iVcrzfqMYuQvjxDUS6WgQH582l/8rhq7hGPQsEPykV1IQtHNUP0k6YXJWDbGBTyxdb0heI4 x9/kqUNMHhMe0JaaEjMvwFHH5jN4337Rq/EeDdzSkukyr3z//U9PgFakJo1bEtGOY4CJvoHbqI QQ8= X-IronPort-AV: E=Sophos;i="5.69,308,1571673600"; d="scan'208";a="126860094" Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com) ([199.255.45.14]) by ob1.hgst.iphmx.com with ESMTP; 13 Dec 2019 12:10:39 +0800 IronPort-SDR: 3Z5U70qhxM1EzcZV9BR/bMdTuW6kqx7Wq+IrlliFfQWaVXO2RBl1/ZNY6mCnFavYsS+tRwys82 ubstJ6L9bTrHlJkCc83d+j2AYu4+ReXh7K4loK3K6DcbhhpYESKIMNRJwIef4CpurAUDr6afED +Uy36aYcQ5caTQenrGWdbSy5rpG1bApGzf+aM5saX/jYkzZs7vZ1jiK8/peDrF1Jt1lcnrb4cl vfTvWwxgAKiB2im5H5dFlH6aF7lkUPSgahsnxs+GKcYnErM3nDXJI94MUP06oqi6RPxZEpibh+ 0RcYYLUuMNyIknVPwS5GHJXh Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Dec 2019 20:05:10 -0800 IronPort-SDR: ZKb2+lDDbtgITKCyiG3mP/q6Hd4HfybNKpKEW4KPC7MBXoQWilZv3fmybL775LTGyuafeToiEP aQjgA2I01AstjQjAURr+JoQQsi9M/zxP7DFoiOxrQd7nZsVhr9obaic6QAmzI9K/UwG+Mx/WTD d+0yU/kl/+kg+jGXgDnTf14ZTJ00Vzo2y/7COIZtqOinEFbfGxLeF8osn2os/Jr8tGsnwI29nM ooGColmaCXLtL0Iy5NDbW1UnQw85hTWD4A4L9yWF1yId5babMIzK5xyERBA2NFuTKWKK7sJUfy yew= WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com (HELO naota.fujisawa.hgst.com) ([10.149.53.115]) by uls-op-cesaip02.wdc.com with ESMTP; 12 Dec 2019 20:10:37 -0800 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Nikolay Borisov , Damien Le Moal , Johannes Thumshirn , Hannes Reinecke , Anand Jain , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v6 01/28] btrfs: introduce HMZONED feature flag Date: Fri, 13 Dec 2019 13:08:48 +0900 Message-Id: <20191213040915.3502922-2-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20191213040915.3502922-1-naohiro.aota@wdc.com> References: <20191213040915.3502922-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org This patch introduces the HMZONED incompat flag. The flag indicates that the volume management will satisfy the constraints imposed by host-managed zoned block devices. Reviewed-by: Anand Jain Reviewed-by: Johannes Thumshirn Signed-off-by: Damien Le Moal Signed-off-by: Naohiro Aota --- fs/btrfs/sysfs.c | 2 ++ include/uapi/linux/btrfs.h | 1 + 2 files changed, 3 insertions(+) diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c index 5ebbe8a5ee76..230c7ad90e22 100644 --- a/fs/btrfs/sysfs.c +++ b/fs/btrfs/sysfs.c @@ -260,6 +260,7 @@ BTRFS_FEAT_ATTR_INCOMPAT(no_holes, NO_HOLES); BTRFS_FEAT_ATTR_INCOMPAT(metadata_uuid, METADATA_UUID); BTRFS_FEAT_ATTR_COMPAT_RO(free_space_tree, FREE_SPACE_TREE); BTRFS_FEAT_ATTR_INCOMPAT(raid1c34, RAID1C34); +BTRFS_FEAT_ATTR_INCOMPAT(hmzoned, HMZONED); static struct attribute *btrfs_supported_feature_attrs[] = { BTRFS_FEAT_ATTR_PTR(mixed_backref), @@ -275,6 +276,7 @@ static struct attribute *btrfs_supported_feature_attrs[] = { BTRFS_FEAT_ATTR_PTR(metadata_uuid), BTRFS_FEAT_ATTR_PTR(free_space_tree), BTRFS_FEAT_ATTR_PTR(raid1c34), + BTRFS_FEAT_ATTR_PTR(hmzoned), NULL }; diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h index 7a8bc8b920f5..62c22bf1f702 100644 --- a/include/uapi/linux/btrfs.h +++ b/include/uapi/linux/btrfs.h @@ -271,6 +271,7 @@ struct btrfs_ioctl_fs_info_args { #define BTRFS_FEATURE_INCOMPAT_NO_HOLES (1ULL << 9) #define BTRFS_FEATURE_INCOMPAT_METADATA_UUID (1ULL << 10) #define BTRFS_FEATURE_INCOMPAT_RAID1C34 (1ULL << 11) +#define BTRFS_FEATURE_INCOMPAT_HMZONED (1ULL << 12) struct btrfs_ioctl_feature_flags { __u64 compat_flags; From patchwork Fri Dec 13 04:08:49 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11289799 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 222891593 for ; Fri, 13 Dec 2019 04:10:44 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id E8FED2073B for ; Fri, 13 Dec 2019 04:10:43 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="oNGej+7u" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731659AbfLMEKm (ORCPT ); Thu, 12 Dec 2019 23:10:42 -0500 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:11856 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727299AbfLMEKl (ORCPT ); Thu, 12 Dec 2019 23:10:41 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1576210241; x=1607746241; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=cAzg+qK3uF7jdVne/36DuYSjLI5XntYPhSPyayLCTqw=; b=oNGej+7uO47Q5/pGrvUw3whPQh9QFzf7Kz3LkZGW/JmChp0FAXl2oSAO mrbFBNpwZ82M/kMqrw6ByW66Cy5OAb6dfqazSfRtMZqz5laGqnZAZq/Gg HqeJ9wzU9w/fco1IngbrbRT93UUQgZFS6unU0lNQm1fx+eaoB4nNxQNgU UcsKPVBNnq24wc/NcnwBQcsJzj+F42eVdq1rOlNgdrlyTx86UMOYQN21x pKXTKEYhDW07OAad4iq2zoqhbSKBzocWWIq3MNNlJnVWbC/7yihqS0bzU Gi11vsPFY7lmsTaRbNFngxd60Sesq9OSxXqAqhQLZEMiO8cwjOBz4VvZd Q==; IronPort-SDR: 20mQFnkSaUDkNEZirH0hcpa3hZfnYixQoHT/WzhgiTmBUYo/r4ZUzbvNfDXpUxak2CDorhR0w0 b9knydrSZmc+27ZC1nfxoNxXfEviLMExnf9ILj/E1m+CcY5aRo+YvlFyrERgc8z5+1HE9lZ2vy dVUHdYCUkTGcYL88ctyXXv3OlLLZmpjwv5sSS8Iv9vW3y+T8Si92pxrs1sx9NwUtO0j6PjUu/B VLobRwtO1nJWs8kkeEq1UlkGBblirGTM5fQ4KLoxUT4nqHbGMma26KkcZ2KKBMAoiZzMtigksF LxU= X-IronPort-AV: E=Sophos;i="5.69,308,1571673600"; d="scan'208";a="126860100" Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com) ([199.255.45.14]) by ob1.hgst.iphmx.com with ESMTP; 13 Dec 2019 12:10:41 +0800 IronPort-SDR: wyasc9932Z8VxDzXaDx8V5+Z6/yD6ZuexdKBhwfXpRbVWuD5/M5lErVXnTlIkIdBXooLtXeGO1 r3Z8RrtSd17LaVy6EJ8DPsWFs2hQF+924Dah5vbGtmKmckrwlB5LQpR7cVR42elgy5SComfat7 f6mijOWlUmWqFzPiTHXsqzVp9vRbSuHlJDc3W+XNiGiCMPxHh2BRQdDfICegt0rdKDEuB0v4QF gBNYxiTl2c+7JVxmKXoQkEID1f7tpSlhGNK5k0iS6RuCREnENoLEvcMqZ3RD1mS/4Kw4cuxMru 2MhtlxhGHbDS3t/GqxrEWKrS Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Dec 2019 20:05:13 -0800 IronPort-SDR: xvX5ykM6N/F815N/SYrEebXd+o9cRzgBYRyk8YTYFru5lRjMw0r5cxqZsywJKM8o1zqs94mH5W Fx7uheRDKkJQbK5BIzswmF8hL6prykLTPTGd2xJRRs/TISX/CnyBtD0GAzH/MtI0YxthMKxbor rIJZIHvx3H5G+c5ma7d1ugAp7P74Fj86OVJS5Uwma4mHHUtS4JGsUD4g9qPyHVBrhilxXqVtTV BSOuf7wKnWpKuZyIiv2e8zIPvZDuP3QXZnZEPBt9hmHzNdOsHe5W8LMPIayBVBYoDkRKoVqF1D NfY= WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com (HELO naota.fujisawa.hgst.com) ([10.149.53.115]) by uls-op-cesaip02.wdc.com with ESMTP; 12 Dec 2019 20:10:39 -0800 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Nikolay Borisov , Damien Le Moal , Johannes Thumshirn , Hannes Reinecke , Anand Jain , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v6 02/28] btrfs: Get zone information of zoned block devices Date: Fri, 13 Dec 2019 13:08:49 +0900 Message-Id: <20191213040915.3502922-3-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20191213040915.3502922-1-naohiro.aota@wdc.com> References: <20191213040915.3502922-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org If a zoned block device is found, get its zone information (number of zones and zone size) using the new helper function btrfs_get_dev_zone_info(). To avoid costly run-time zone report commands to test the device zones type during block allocation, attach the seq_zones bitmap to the device structure to indicate if a zone is sequential or accept random writes. Also it attaches the empty_zones bitmap to indicate if a zone is empty or not. This patch also introduces the helper function btrfs_dev_is_sequential() to test if the zone storing a block is a sequential write required zone and btrfs_dev_is_empty_zone() to test if the zone is a empty zone. Signed-off-by: Damien Le Moal Signed-off-by: Naohiro Aota Reviewed-by: Josef Bacik --- fs/btrfs/Makefile | 1 + fs/btrfs/hmzoned.c | 168 +++++++++++++++++++++++++++++++++++++++++++++ fs/btrfs/hmzoned.h | 92 +++++++++++++++++++++++++ fs/btrfs/volumes.c | 18 ++++- fs/btrfs/volumes.h | 4 ++ 5 files changed, 281 insertions(+), 2 deletions(-) create mode 100644 fs/btrfs/hmzoned.c create mode 100644 fs/btrfs/hmzoned.h diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile index 82200dbca5ac..64aaeed397a4 100644 --- a/fs/btrfs/Makefile +++ b/fs/btrfs/Makefile @@ -16,6 +16,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \ btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o btrfs-$(CONFIG_BTRFS_FS_REF_VERIFY) += ref-verify.o +btrfs-$(CONFIG_BLK_DEV_ZONED) += hmzoned.o btrfs-$(CONFIG_BTRFS_FS_RUN_SANITY_TESTS) += tests/free-space-tests.o \ tests/extent-buffer-tests.o tests/btrfs-tests.o \ diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c new file mode 100644 index 000000000000..6a13763d2916 --- /dev/null +++ b/fs/btrfs/hmzoned.c @@ -0,0 +1,168 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2019 Western Digital Corporation or its affiliates. + * Authors: + * Naohiro Aota + * Damien Le Moal + */ + +#include +#include +#include "ctree.h" +#include "volumes.h" +#include "hmzoned.h" +#include "rcu-string.h" + +/* Maximum number of zones to report per blkdev_report_zones() call */ +#define BTRFS_REPORT_NR_ZONES 4096 + +static int btrfs_get_dev_zones(struct btrfs_device *device, u64 pos, + struct blk_zone *zones, unsigned int *nr_zones) +{ + int ret; + + ret = blkdev_report_zones(device->bdev, pos >> SECTOR_SHIFT, zones, + nr_zones); + if (ret != 0) { + btrfs_err_in_rcu(device->fs_info, + "get zone at %llu on %s failed %d", pos, + rcu_str_deref(device->name), ret); + return ret; + } + if (!*nr_zones) + return -EIO; + + return 0; +} + +int btrfs_get_dev_zone_info(struct btrfs_device *device) +{ + struct btrfs_zoned_device_info *zone_info = NULL; + struct block_device *bdev = device->bdev; + sector_t nr_sectors = bdev->bd_part->nr_sects; + sector_t sector = 0; + struct blk_zone *zones = NULL; + unsigned int i, nreported = 0, nr_zones; + unsigned int zone_sectors; + int ret; + char devstr[sizeof(device->fs_info->sb->s_id) + + sizeof(" (device )") - 1]; + + if (!bdev_is_zoned(bdev)) + return 0; + + zone_info = kzalloc(sizeof(*zone_info), GFP_KERNEL); + if (!zone_info) + return -ENOMEM; + + zone_sectors = bdev_zone_sectors(bdev); + ASSERT(is_power_of_2(zone_sectors)); + zone_info->zone_size = (u64)zone_sectors << SECTOR_SHIFT; + zone_info->zone_size_shift = ilog2(zone_info->zone_size); + zone_info->nr_zones = nr_sectors >> ilog2(bdev_zone_sectors(bdev)); + if (!IS_ALIGNED(nr_sectors, zone_sectors)) + zone_info->nr_zones++; + + zone_info->seq_zones = bitmap_zalloc(zone_info->nr_zones, GFP_KERNEL); + if (!zone_info->seq_zones) { + ret = -ENOMEM; + goto free_zone_info; + } + + zone_info->empty_zones = bitmap_zalloc(zone_info->nr_zones, GFP_KERNEL); + if (!zone_info->empty_zones) { + ret = -ENOMEM; + goto free_seq_zones; + } + + zones = kcalloc(BTRFS_REPORT_NR_ZONES, + sizeof(struct blk_zone), GFP_KERNEL); + if (!zones) { + ret = -ENOMEM; + goto free_empty_zones; + } + + /* Get zones type */ + while (sector < nr_sectors) { + nr_zones = BTRFS_REPORT_NR_ZONES; + ret = btrfs_get_dev_zones(device, sector << SECTOR_SHIFT, zones, + &nr_zones); + if (ret) + goto free_zones; + + for (i = 0; i < nr_zones; i++) { + if (zones[i].type == BLK_ZONE_TYPE_SEQWRITE_REQ) + set_bit(nreported, zone_info->seq_zones); + if (zones[i].cond == BLK_ZONE_COND_EMPTY) + set_bit(nreported, zone_info->empty_zones); + nreported++; + } + sector = zones[nr_zones - 1].start + zones[nr_zones - 1].len; + } + + if (nreported != zone_info->nr_zones) { + btrfs_err_in_rcu(device->fs_info, + "inconsistent number of zones on %s (%u / %u)", + rcu_str_deref(device->name), nreported, + zone_info->nr_zones); + ret = -EIO; + goto free_zones; + } + + kfree(zones); + + device->zone_info = zone_info; + + devstr[0] = 0; + if (device->fs_info) + snprintf(devstr, sizeof(devstr), " (device %s)", + device->fs_info->sb->s_id); + + rcu_read_lock(); + pr_info( +"BTRFS info%s: host-%s zoned block device %s, %u zones of %llu sectors", + devstr, + bdev_zoned_model(bdev) == BLK_ZONED_HM ? "managed" : "aware", + rcu_str_deref(device->name), zone_info->nr_zones, + zone_info->zone_size >> SECTOR_SHIFT); + rcu_read_unlock(); + + return 0; + +free_zones: + kfree(zones); +free_empty_zones: + bitmap_free(zone_info->empty_zones); +free_seq_zones: + bitmap_free(zone_info->seq_zones); +free_zone_info: + kfree(zone_info); + + return ret; +} + +void btrfs_destroy_dev_zone_info(struct btrfs_device *device) +{ + struct btrfs_zoned_device_info *zone_info = device->zone_info; + + if (!zone_info) + return; + + bitmap_free(zone_info->seq_zones); + bitmap_free(zone_info->empty_zones); + kfree(zone_info); + device->zone_info = NULL; +} + +int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos, + struct blk_zone *zone) +{ + unsigned int nr_zones = 1; + int ret; + + ret = btrfs_get_dev_zones(device, pos, zone, &nr_zones); + if (ret != 0 || !nr_zones) + return ret ? ret : -EIO; + + return 0; +} diff --git a/fs/btrfs/hmzoned.h b/fs/btrfs/hmzoned.h new file mode 100644 index 000000000000..0f8006f39aaf --- /dev/null +++ b/fs/btrfs/hmzoned.h @@ -0,0 +1,92 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (C) 2019 Western Digital Corporation or its affiliates. + * Authors: + * Naohiro Aota + * Damien Le Moal + */ + +#ifndef BTRFS_HMZONED_H +#define BTRFS_HMZONED_H + +struct btrfs_zoned_device_info { + /* + * Number of zones, zone size and types of zones if bdev is a + * zoned block device. + */ + u64 zone_size; + u8 zone_size_shift; + u32 nr_zones; + unsigned long *seq_zones; + unsigned long *empty_zones; +}; + +#ifdef CONFIG_BLK_DEV_ZONED +int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos, + struct blk_zone *zone); +int btrfs_get_dev_zone_info(struct btrfs_device *device); +void btrfs_destroy_dev_zone_info(struct btrfs_device *device); +#else /* CONFIG_BLK_DEV_ZONED */ +static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos, + struct blk_zone *zone) +{ + return 0; +} +static inline int btrfs_get_dev_zone_info(struct btrfs_device *device) +{ + return 0; +} +static inline void btrfs_destroy_dev_zone_info(struct btrfs_device *device) { } +#endif + +static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos) +{ + struct btrfs_zoned_device_info *zone_info = device->zone_info; + + if (!zone_info) + return false; + + return test_bit(pos >> zone_info->zone_size_shift, + zone_info->seq_zones); +} + +static inline bool btrfs_dev_is_empty_zone(struct btrfs_device *device, u64 pos) +{ + struct btrfs_zoned_device_info *zone_info = device->zone_info; + + if (!zone_info) + return true; + + return test_bit(pos >> zone_info->zone_size_shift, + zone_info->empty_zones); +} + +static inline void btrfs_dev_set_empty_zone_bit(struct btrfs_device *device, + u64 pos, bool set) +{ + struct btrfs_zoned_device_info *zone_info = device->zone_info; + unsigned int zno; + + if (!zone_info) + return; + + zno = pos >> zone_info->zone_size_shift; + if (set) + set_bit(zno, zone_info->empty_zones); + else + clear_bit(zno, zone_info->empty_zones); +} + +static inline void btrfs_dev_set_zone_empty(struct btrfs_device *device, + u64 pos) +{ + btrfs_dev_set_empty_zone_bit(device, pos, true); +} + +static inline void btrfs_dev_clear_zone_empty(struct btrfs_device *device, + u64 pos) +{ + btrfs_dev_set_empty_zone_bit(device, pos, false); +} + +#endif diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index d8e5560db285..18ea8dfce244 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -30,6 +30,7 @@ #include "tree-checker.h" #include "space-info.h" #include "block-group.h" +#include "hmzoned.h" const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = { [BTRFS_RAID_RAID10] = { @@ -366,6 +367,7 @@ void btrfs_free_device(struct btrfs_device *device) rcu_string_free(device->name); extent_io_tree_release(&device->alloc_state); bio_put(device->flush_bio); + btrfs_destroy_dev_zone_info(device); kfree(device); } @@ -650,6 +652,11 @@ static int btrfs_open_one_device(struct btrfs_fs_devices *fs_devices, clear_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state); device->mode = flags; + /* Get zone type information of zoned block devices */ + ret = btrfs_get_dev_zone_info(device); + if (ret != 0) + goto error_brelse; + fs_devices->open_devices++; if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state) && device->devid != BTRFS_DEV_REPLACE_DEVID) { @@ -2421,6 +2428,14 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path } rcu_assign_pointer(device->name, name); + device->fs_info = fs_info; + device->bdev = bdev; + + /* Get zone type information of zoned block devices */ + ret = btrfs_get_dev_zone_info(device); + if (ret) + goto error_free_device; + trans = btrfs_start_transaction(root, 0); if (IS_ERR(trans)) { ret = PTR_ERR(trans); @@ -2437,8 +2452,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path fs_info->sectorsize); device->disk_total_bytes = device->total_bytes; device->commit_total_bytes = device->total_bytes; - device->fs_info = fs_info; - device->bdev = bdev; set_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state); clear_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state); device->mode = FMODE_EXCL; @@ -2571,6 +2584,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path sb->s_flags |= SB_RDONLY; if (trans) btrfs_end_transaction(trans); + btrfs_destroy_dev_zone_info(device); error_free_device: btrfs_free_device(device); error: diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index fc1b564b9cfe..70cabe65f72a 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -53,6 +53,8 @@ struct btrfs_io_geometry { #define BTRFS_DEV_STATE_REPLACE_TGT (3) #define BTRFS_DEV_STATE_FLUSH_SENT (4) +struct btrfs_zoned_device_info; + struct btrfs_device { struct list_head dev_list; /* device_list_mutex */ struct list_head dev_alloc_list; /* chunk mutex */ @@ -66,6 +68,8 @@ struct btrfs_device { struct block_device *bdev; + struct btrfs_zoned_device_info *zone_info; + /* the mode sent to blkdev_get */ fmode_t mode; From patchwork Fri Dec 13 04:08:50 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11289803 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 218431593 for ; Fri, 13 Dec 2019 04:10:46 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id EA6292253D for ; Fri, 13 Dec 2019 04:10:45 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="hE6Us7Gk" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731706AbfLMEKo (ORCPT ); Thu, 12 Dec 2019 23:10:44 -0500 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:11856 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727299AbfLMEKn (ORCPT ); Thu, 12 Dec 2019 23:10:43 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1576210244; x=1607746244; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=k+ytgM1H1RRCyMfcS8SqoPOIxW6huNdKmeUbCN5t/jY=; b=hE6Us7GkdfUkzOoEA90QR4gJ9/s8d8GxbTmWBxXsUb56gdTZZ9nE10OH wr5fL7zIlf9nleu9OUkXn5TWvJZqq33z96+rbiyKcg67m4ptKScXu++1h H1BreRaKtdZqXioMpis45wkZjqwScVnaFlI7/Vo2fZgerthNYWWumdael 3o+LvVtrwRb4E3eN/IPGbWJKfv1jPDXqCPHd09Uy9xlmZhaAIzv/sezEy 6Ranb41Qe1CaMJNlgSL8fyw6pW11xXxznknd2LG3+MnF/IDWsxdmYOJ19 s5gmh5771CS+Tq84XzKBlAQTRhM+gGTn/LQ7gVW4hzDfT4x52sTertF8L Q==; IronPort-SDR: 5mnx/Z7TZENcLsSz6qZP14HiFlmWBvZafc8qV72m7emxz5K4aLWJXO9BJIQwSZPsU9X0MpWGtm CWdHEVMek/bJNwlVrVt+bK0ZR9BGOg/uSSflYiyNLZfBCDbq6EmpfYYlwK/IUUAJ6Isi9PvDTx H72Ek28MjiPWE+MYrmtxovA9JRvDUaqICz8EdKolv+sBB95FOQcFfyXR/kqaseWPAgjmsVqbJ+ y4i4ZcjhS+ogY/TazrircZlys7NFQrx4WEghDcs5RxLqisKeaDsQlYzA9jsc29vZszQY+9SR/W TyY= X-IronPort-AV: E=Sophos;i="5.69,308,1571673600"; d="scan'208";a="126860103" Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com) ([199.255.45.14]) by ob1.hgst.iphmx.com with ESMTP; 13 Dec 2019 12:10:44 +0800 IronPort-SDR: cvvZpoLd08yD+QN8ObtyOV/bbrBLz33bwH5FNfRrF4QjHIhJSjj6R9pEN1czCPK5LrMGTL0NXD uNuO3Ucqw4qEKF+uYXQ3rDE+5isA0VYL5vDyxDayPz/yB5eEU7XlFJ/KqyWoc2Xc1FwHd0u9Cs pBG2KkKko9pVFsi8MFZNImjiCVo0hqzmyhVZQ567JD3TrF+mjwl0EwoNByucBaEEGB14yE8sG0 p5jmV2g10/rgE3MgoIfdA+qUzk8ni31N3HvO3kZc5fJQhwEO4lUNmXK4Aci2gYAx8lXwr0UfPk 6ivUh9oMKd0IE27P6a4/u+3L Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Dec 2019 20:05:15 -0800 IronPort-SDR: 4HG1NilA5AAwOHnuPY7nyspxcrQMcHCVkHc4Fg7ArVm+r8B5Ogeqq2xryk0IwqYzvKAcxIrHio U5P08UXq3nE10rMyG5YuEUiisg/6yfBPBT8xeBJG2ff9cLQZSe0z4eFjJVoWq/xeLOWnR/1r8v nhGfObQu0OUcxfMijx2Boe8KCW3eXFz6aJFHIFgpSz4mNizG1K+rcUubY4lQEgjpIwGK3YDONW wi4Go/49SmRrQt52PPoHYP9Qf4aV1CZ3RFPgAMc58I5SINNhZbioUqyhzz1kDvfflPzH60H/H8 yjc= WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com (HELO naota.fujisawa.hgst.com) ([10.149.53.115]) by uls-op-cesaip02.wdc.com with ESMTP; 12 Dec 2019 20:10:41 -0800 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Nikolay Borisov , Damien Le Moal , Johannes Thumshirn , Hannes Reinecke , Anand Jain , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v6 03/28] btrfs: Check and enable HMZONED mode Date: Fri, 13 Dec 2019 13:08:50 +0900 Message-Id: <20191213040915.3502922-4-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20191213040915.3502922-1-naohiro.aota@wdc.com> References: <20191213040915.3502922-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org HMZONED mode cannot be used together with the RAID5/6 profile for now. Introduce the function btrfs_check_hmzoned_mode() to check this. This function will also check if HMZONED flag is enabled on the file system and if the file system consists of zoned devices with equal zone size. Additionally, as updates to the space cache are in-place, the space cache cannot be located over sequential zones and there is no guarantees that the device will have enough conventional zones to store this cache. Resolve this problem by completely disabling the space cache. This does not introduce any problems in HMZONED mode: all the free space is located after the allocation pointer and no free space is located before the pointer. There is no need to have such cache. For the same reason, NODATACOW is also disabled. Also INODE_MAP_CACHE is also disabled to avoid preallocation in the INODE_MAP_CACHE inode. In summary, HMZONED will disable: | Disabled features | Reason | |-------------------+-----------------------------------------------------| | RAID5/6 | 1) Non-full stripe write cause overwriting of | | | parity block | | | 2) Rebuilding on high capacity volume (usually with | | | SMR) can lead to higher failure rate | |-------------------+-----------------------------------------------------| | space_cache (v1) | In-place updating | | NODATACOW | In-place updating | |-------------------+-----------------------------------------------------| | fallocate | Reserved extent will be a write hole | | INODE_MAP_CACHE | Need pre-allocation. (and will be deprecated?) | |-------------------+-----------------------------------------------------| | MIXED_BG | Allocated metadata region will be write holes for | | | data writes | | async checksum | Not to mix up bios by multiple workers | Signed-off-by: Damien Le Moal Signed-off-by: Naohiro Aota Reviewed-by: Josef Bacik --- fs/btrfs/ctree.h | 3 ++ fs/btrfs/dev-replace.c | 8 +++++ fs/btrfs/disk-io.c | 8 +++++ fs/btrfs/hmzoned.c | 77 ++++++++++++++++++++++++++++++++++++++++++ fs/btrfs/hmzoned.h | 26 ++++++++++++++ fs/btrfs/super.c | 1 + fs/btrfs/volumes.c | 5 +++ 7 files changed, 128 insertions(+) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index b2e8fd8a8e59..44517802b9e5 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -541,6 +541,9 @@ struct btrfs_fs_info { struct btrfs_root *uuid_root; struct btrfs_root *free_space_root; + /* Zone size when in HMZONED mode */ + u64 zone_size; + /* the log root tree is a directory of all the other log roots */ struct btrfs_root *log_root_tree; diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c index f639dde2a679..9286c6e0b636 100644 --- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -21,6 +21,7 @@ #include "rcu-string.h" #include "dev-replace.h" #include "sysfs.h" +#include "hmzoned.h" static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info, int scrub_ret); @@ -202,6 +203,13 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info, return PTR_ERR(bdev); } + if (!btrfs_check_device_zone_type(fs_info, bdev)) { + btrfs_err(fs_info, + "zone type of target device mismatch with the filesystem!"); + ret = -EINVAL; + goto error; + } + sync_blockdev(bdev); devices = &fs_info->fs_devices->devices; diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index e0edfdc9c82b..ff418e393f82 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -41,6 +41,7 @@ #include "tree-checker.h" #include "ref-verify.h" #include "block-group.h" +#include "hmzoned.h" #define BTRFS_SUPER_FLAG_SUPP (BTRFS_HEADER_FLAG_WRITTEN |\ BTRFS_HEADER_FLAG_RELOC |\ @@ -3082,6 +3083,13 @@ int __cold open_ctree(struct super_block *sb, btrfs_free_extra_devids(fs_devices, 1); + ret = btrfs_check_hmzoned_mode(fs_info); + if (ret) { + btrfs_err(fs_info, "failed to init hmzoned mode: %d", + ret); + goto fail_block_groups; + } + ret = btrfs_sysfs_add_fsid(fs_devices, NULL); if (ret) { btrfs_err(fs_info, "failed to init sysfs fsid interface: %d", diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c index 6a13763d2916..0182bfb9c903 100644 --- a/fs/btrfs/hmzoned.c +++ b/fs/btrfs/hmzoned.c @@ -166,3 +166,80 @@ int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos, return 0; } + +int btrfs_check_hmzoned_mode(struct btrfs_fs_info *fs_info) +{ + struct btrfs_fs_devices *fs_devices = fs_info->fs_devices; + struct btrfs_device *device; + u64 hmzoned_devices = 0; + u64 nr_devices = 0; + u64 zone_size = 0; + int incompat_hmzoned = btrfs_fs_incompat(fs_info, HMZONED); + int ret = 0; + + /* Count zoned devices */ + list_for_each_entry(device, &fs_devices->devices, dev_list) { + enum blk_zoned_model model; + + if (!device->bdev) + continue; + + model = bdev_zoned_model(device->bdev); + if (model == BLK_ZONED_HM || + (model == BLK_ZONED_HA && incompat_hmzoned)) { + hmzoned_devices++; + if (!zone_size) { + zone_size = device->zone_info->zone_size; + } else if (device->zone_info->zone_size != zone_size) { + btrfs_err(fs_info, + "Zoned block devices must have equal zone sizes"); + ret = -EINVAL; + goto out; + } + } + nr_devices++; + } + + if (!hmzoned_devices && !incompat_hmzoned) + goto out; + + if (!hmzoned_devices && incompat_hmzoned) { + /* No zoned block device found on HMZONED FS */ + btrfs_err(fs_info, "HMZONED enabled file system should have zoned devices"); + ret = -EINVAL; + goto out; + } + + if (hmzoned_devices && !incompat_hmzoned) { + btrfs_err(fs_info, + "Enable HMZONED mode to mount HMZONED device"); + ret = -EINVAL; + goto out; + } + + if (hmzoned_devices != nr_devices) { + btrfs_err(fs_info, + "zoned devices cannot be mixed with regular devices"); + ret = -EINVAL; + goto out; + } + + /* + * stripe_size is always aligned to BTRFS_STRIPE_LEN in + * __btrfs_alloc_chunk(). Since we want stripe_len == zone_size, + * check the alignment here. + */ + if (!IS_ALIGNED(zone_size, BTRFS_STRIPE_LEN)) { + btrfs_err(fs_info, + "zone size is not aligned to BTRFS_STRIPE_LEN"); + ret = -EINVAL; + goto out; + } + + fs_info->zone_size = zone_size; + + btrfs_info(fs_info, "HMZONED mode enabled, zone size %llu B", + fs_info->zone_size); +out: + return ret; +} diff --git a/fs/btrfs/hmzoned.h b/fs/btrfs/hmzoned.h index 0f8006f39aaf..8e17f64ff986 100644 --- a/fs/btrfs/hmzoned.h +++ b/fs/btrfs/hmzoned.h @@ -9,6 +9,8 @@ #ifndef BTRFS_HMZONED_H #define BTRFS_HMZONED_H +#include + struct btrfs_zoned_device_info { /* * Number of zones, zone size and types of zones if bdev is a @@ -26,6 +28,7 @@ int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos, struct blk_zone *zone); int btrfs_get_dev_zone_info(struct btrfs_device *device); void btrfs_destroy_dev_zone_info(struct btrfs_device *device); +int btrfs_check_hmzoned_mode(struct btrfs_fs_info *fs_info); #else /* CONFIG_BLK_DEV_ZONED */ static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos, struct blk_zone *zone) @@ -37,6 +40,14 @@ static inline int btrfs_get_dev_zone_info(struct btrfs_device *device) return 0; } static inline void btrfs_destroy_dev_zone_info(struct btrfs_device *device) { } +static inline int btrfs_check_hmzoned_mode(struct btrfs_fs_info *fs_info) +{ + if (!btrfs_fs_incompat(fs_info, HMZONED)) + return 0; + + btrfs_err(fs_info, "Zoned block devices support is not enabled"); + return -EOPNOTSUPP; +} #endif static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos) @@ -89,4 +100,19 @@ static inline void btrfs_dev_clear_zone_empty(struct btrfs_device *device, btrfs_dev_set_empty_zone_bit(device, pos, false); } +static inline bool btrfs_check_device_zone_type(struct btrfs_fs_info *fs_info, + struct block_device *bdev) +{ + u64 zone_size; + + if (btrfs_fs_incompat(fs_info, HMZONED)) { + zone_size = (u64)bdev_zone_sectors(bdev) << SECTOR_SHIFT; + /* Do not allow non-zoned device */ + return bdev_is_zoned(bdev) && fs_info->zone_size == zone_size; + } + + /* Do not allow Host Manged zoned device */ + return bdev_zoned_model(bdev) != BLK_ZONED_HM; +} + #endif diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index a98c3c71fc54..616f5abec267 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -44,6 +44,7 @@ #include "backref.h" #include "space-info.h" #include "sysfs.h" +#include "hmzoned.h" #include "tests/btrfs-tests.h" #include "block-group.h" diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 18ea8dfce244..ab3590b310af 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -2395,6 +2395,11 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path if (IS_ERR(bdev)) return PTR_ERR(bdev); + if (!btrfs_check_device_zone_type(fs_info, bdev)) { + ret = -EINVAL; + goto error; + } + if (fs_devices->seeding) { seeding_dev = 1; down_write(&sb->s_umount); From patchwork Fri Dec 13 04:08:51 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11289807 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id BEA4E1593 for ; Fri, 13 Dec 2019 04:10:49 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 9C81F2073B for ; Fri, 13 Dec 2019 04:10:49 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="nzq6I9eP" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731746AbfLMEKs (ORCPT ); Thu, 12 Dec 2019 23:10:48 -0500 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:11856 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727299AbfLMEKq (ORCPT ); Thu, 12 Dec 2019 23:10:46 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1576210246; x=1607746246; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=nRaA3JRYzcP6k4TuIo2+r6HiEqDsmoiNiX7kFHYjflM=; b=nzq6I9ePgZIQpx07M/eAACSO7Ftn/VpTfxRbnYUF3FfziaVPNv4QJYVn f+VnwUSL6Oq9nd5iEpbhtRrOqGxBCMYuSweOpn+ALRwipgskM5Hr0bsfC BBNcjbc/3eT1rKhwF6hQ/iVoWT0eZ9w/3RUtKqh28odINhSGYAjmYp2yG VVBnXjugqJSwTJYiG0lV8q8gEKa278VRu4JBwymLgD4W7lMQCqOJgC8KK PlRaY3QH5jQoR76CB7xS6lDsFMwj7wm9P/ZfuWCyrV3bPvf6XN+MMl5Cw FiHSwObjnVZK2JkX4mEgtQkDG6YFCqE0Im2vWJDRLIvm7V13jVs8gkFuq A==; IronPort-SDR: UHsSSmVe467NpPPvy5vY/9VIlOgEHmwIZGy+/h10+bPyafHrjXWEryVf+D5EAVRIWlDmNV+nVU oyYArR0/Maki13gpd5uTCkzgUOazePkSPzA+me2MES5/D92KfVMqQJCsdL3SpXd/zFIlqZ3grB 0mvM847u5qQWMthndGXSEa8A8CPtcRoCmfi4bQLSwUVFxh8/wUQobG/5WUPUnCKhJIhmZma3vH sroAY2xOToZ+wYWg9UAHQ3yZP0cioyD5VdC/B/4tRdY6sVuS7BwwtHGCzSoeKWezaLvfhMbJLZ TVM= X-IronPort-AV: E=Sophos;i="5.69,308,1571673600"; d="scan'208";a="126860105" Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com) ([199.255.45.14]) by ob1.hgst.iphmx.com with ESMTP; 13 Dec 2019 12:10:46 +0800 IronPort-SDR: xNi0y9EisKfpA2mi+kfhd32uLWHjtAfL5E0P0MYA971+WdBPgnbSYi876R4aa57GWvtZmdyfX1 NNgNWlSb2PcX4npfvk6sJ5gqIztETFABauh5M2m5mTxGcmhw1FHi1LP+l+eUCxRF0LOzroO7Rj 5jGQ3v0CYy8JTDzj+0BZajMoRYWK56E3jSyuTbI8vxsCYhxPCV2wRMDe7uN+UMaObmcZoBUuWL JvvNl7OlLHG90hPn6nlpYrAgBVr17OKThnUHzJhGGNg7+HdILNglh3ukUfcSiErlEFIH7ztBX2 DA1ZopbFlaWkzGVyK1L7i7Dx Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Dec 2019 20:05:17 -0800 IronPort-SDR: r+5NYvY5cw/V/r0PJqKMkexmk3ysWt2anI4Ehrh3A5/Q0sAHusfBIAMXprbsH4iwocL8/gn+Qq CUD+67IBM1lTe4VUjwdk11naKe1RD952YS+bKbk1Rp6K5odac4MrXKt3cCACc+i5W2FM2+Bzaz gKjUa1xvMy1r/HzyAGRDW1/yJJfnMq+NP1wehlc/+2vFgGbhBaTFEUNLzvHg5QZ9HUPA6TUmB1 8TS8b48ZWudu5l8j8AsNp7MFFpVce/9N4qo84CxnUSmZiP4PmxJb73IVVVSd5DlChQQfLhAkCV 7CA= WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com (HELO naota.fujisawa.hgst.com) ([10.149.53.115]) by uls-op-cesaip02.wdc.com with ESMTP; 12 Dec 2019 20:10:44 -0800 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Nikolay Borisov , Damien Le Moal , Johannes Thumshirn , Hannes Reinecke , Anand Jain , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v6 04/28] btrfs: disallow RAID5/6 in HMZONED mode Date: Fri, 13 Dec 2019 13:08:51 +0900 Message-Id: <20191213040915.3502922-5-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20191213040915.3502922-1-naohiro.aota@wdc.com> References: <20191213040915.3502922-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Supporting the RAID5/6 profile in HMZONED mode is not trivial. For example, non-full stripe writes will cause overwriting parity blocks. When we do a non-full stripe write, it writes to the parity block with the data at that moment. Then, another write to the stripes will try to overwrite the parity block with new parity value. However, sequential zones do not allow such parity overwriting. Furthermore, using RAID5/6 on SMR drives, which usually have a huge capacity, incur large overhead of rebuild. Such overhead can lead to higher to higher volume failure rate (e.g. additional drive failure during rebuild) because of the increased rebuild time. Thus, let's disable RAID5/6 profile in HMZONED mode for now. Reviewed-by: Johannes Thumshirn Signed-off-by: Naohiro Aota Reviewed-by: Josef Bacik --- fs/btrfs/hmzoned.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c index 0182bfb9c903..1b24facd46b8 100644 --- a/fs/btrfs/hmzoned.c +++ b/fs/btrfs/hmzoned.c @@ -236,6 +236,13 @@ int btrfs_check_hmzoned_mode(struct btrfs_fs_info *fs_info) goto out; } + /* RAID56 is not allowed */ + if (btrfs_fs_incompat(fs_info, RAID56)) { + btrfs_err(fs_info, "HMZONED mode does not support RAID56"); + ret = -EINVAL; + goto out; + } + fs_info->zone_size = zone_size; btrfs_info(fs_info, "HMZONED mode enabled, zone size %llu B", From patchwork Fri Dec 13 04:08:52 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11289811 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 1B5721593 for ; Fri, 13 Dec 2019 04:10:51 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id EE35C2253D for ; Fri, 13 Dec 2019 04:10:50 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="QXM5n5cO" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731821AbfLMEKt (ORCPT ); Thu, 12 Dec 2019 23:10:49 -0500 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:11856 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731720AbfLMEKs (ORCPT ); Thu, 12 Dec 2019 23:10:48 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1576210248; x=1607746248; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=BuXMQSDljydNyydK04Js/j2ADAd1KRqir9xX0N0qj8Q=; b=QXM5n5cOJ5V4c3lEuShvFxnBDH/YYzYExwoVnqYz0x7luKRg0hGxbRXu +MM080d1h86OAG5uOkkTsgozjQSRUgHzloprh7kl22v1/z6WIKdhqwCHa L1iJnAAl6yo3lVEu5Thopy0oCeodepI3A48yIvv//fe+P4mUGN2J6z4PC x7Y8bDBhPle4pq53at1WMA7atgYf1F8YCW9JH22P6WzQmuv6iTn0G82lF 3zIf21+8t37Iua2S/F30kzr/6roSKyZOTvVVE1wJH8Uw00WfasA0EZMEm EMCA0Xb1gcagqxTAIbnyXPlM+Bk5Y2Ijpawlk/33kDjQW9uaJVpYL9zl2 w==; IronPort-SDR: EzUkBKblhAlsrTBOzLCzEcOqwkixrMLDJoLpYuOvPrhF1y47UQ4x4F1xKcR51u7UECWF8AjRy5 nDQC/0n8+sNJ3i6WXhtwTysyKf2TFYn4JW/M3UgTEjb1WJnMeWCFiO3+Y/jZpn1kdYkYZf2Q+f BFI6V4TwlSwtNO2YFhnAK3/7/ToSH1QAM5jeOFdw0cOrnCbvuae/cIwvQs8CMIk/U/rUsqnNyH Sxr5SJ1sdnZ3aWENlcLCFX1QgSJ7yZ04TmyiG/Qda/iYSWBvgPHr7qp0WwcVEfJyTDS6r8WAGD 2tw= X-IronPort-AV: E=Sophos;i="5.69,308,1571673600"; d="scan'208";a="126860107" Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com) ([199.255.45.14]) by ob1.hgst.iphmx.com with ESMTP; 13 Dec 2019 12:10:48 +0800 IronPort-SDR: PjpHPo4MVnZikNUwVUWOSZrIJAGZHWsVCRAMZN6+ppSi4czqoeCpaLpEeNHi8kDawLAJBVK2sx Nzmnw9snkJ0Fhdl8d/zlH9Z1NjtVI9N9BmYSU9y0AADVLdHCjO/21TtlO8ZHMxQsVFXQHqYVa8 vt47MRFiwexF/hdV3XWxEpK1giXuX4BUbGIOWn0oufowvsQO8KgWin+wIFSgbO6MFTa6nazO9U F22EfUK5QPDPkmz7NGmSr8YpLOBVuK/p7UFvVDerDExem6zvNwtnneANruwc4iTLxQMED6VlEm rKVce9xOKfIMHT8fKGRhIOcb Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Dec 2019 20:05:20 -0800 IronPort-SDR: CdyW01rxs+B0DoggYYWfCbmg2DNCkG4eYGuXV1sUWf5yG7n+/TpYE64M+MKdcL+ZvBtCUjT0Tj PyOPfO5AXQ43E64Mtyt7LPLwUprEI8hMBiz3GgW5tf/gA/ctfzpJto4Ln7mPY8/u7HNeaPpGnK Q62qxaG+RL+k28gPt8RwpKUo1gYeMqA5WO4/oG1nMU6/2MQoWWvYdcCsekFU3hmSSw0YECaoDL FCr7IYjKt9Cu0VMTWpbYnVVtBN4JQDpxonYlub9NicpTi86SrIW0yLtcXYjt7CwJKp+PhDl/eF bk8= WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com (HELO naota.fujisawa.hgst.com) ([10.149.53.115]) by uls-op-cesaip02.wdc.com with ESMTP; 12 Dec 2019 20:10:46 -0800 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Nikolay Borisov , Damien Le Moal , Johannes Thumshirn , Hannes Reinecke , Anand Jain , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v6 05/28] btrfs: disallow space_cache in HMZONED mode Date: Fri, 13 Dec 2019 13:08:52 +0900 Message-Id: <20191213040915.3502922-6-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20191213040915.3502922-1-naohiro.aota@wdc.com> References: <20191213040915.3502922-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org As updates to the space cache v1 are in-place, the space cache cannot be located over sequential zones and there is no guarantees that the device will have enough conventional zones to store this cache. Resolve this problem by disabling completely the space cache v1. This does not introduces any problems with sequential block groups: all the free space is located after the allocation pointer and no free space before the pointer. There is no need to have such cache. Note: we can technically use free-space-tree (space cache v2) on HMZONED mode. But, since HMZONED mode now always allocate extents in a block group sequentially regardless of underlying device zone type, it's no use to enable and maintain the tree. Signed-off-by: Naohiro Aota --- fs/btrfs/hmzoned.c | 18 ++++++++++++++++++ fs/btrfs/hmzoned.h | 5 +++++ fs/btrfs/super.c | 11 +++++++++-- 3 files changed, 32 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c index 1b24facd46b8..d62f11652973 100644 --- a/fs/btrfs/hmzoned.c +++ b/fs/btrfs/hmzoned.c @@ -250,3 +250,21 @@ int btrfs_check_hmzoned_mode(struct btrfs_fs_info *fs_info) out: return ret; } + +int btrfs_check_mountopts_hmzoned(struct btrfs_fs_info *info) +{ + if (!btrfs_fs_incompat(info, HMZONED)) + return 0; + + /* + * SPACE CACHE writing is not CoWed. Disable that to avoid write + * errors in sequential zones. + */ + if (btrfs_test_opt(info, SPACE_CACHE)) { + btrfs_err(info, + "space cache v1 not supportted in HMZONED mode"); + return -EOPNOTSUPP; + } + + return 0; +} diff --git a/fs/btrfs/hmzoned.h b/fs/btrfs/hmzoned.h index 8e17f64ff986..d9ebe11afdf5 100644 --- a/fs/btrfs/hmzoned.h +++ b/fs/btrfs/hmzoned.h @@ -29,6 +29,7 @@ int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos, int btrfs_get_dev_zone_info(struct btrfs_device *device); void btrfs_destroy_dev_zone_info(struct btrfs_device *device); int btrfs_check_hmzoned_mode(struct btrfs_fs_info *fs_info); +int btrfs_check_mountopts_hmzoned(struct btrfs_fs_info *info); #else /* CONFIG_BLK_DEV_ZONED */ static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos, struct blk_zone *zone) @@ -48,6 +49,10 @@ static inline int btrfs_check_hmzoned_mode(struct btrfs_fs_info *fs_info) btrfs_err(fs_info, "Zoned block devices support is not enabled"); return -EOPNOTSUPP; } +static inline int btrfs_check_mountopts_hmzoned(struct btrfs_fs_info *info) +{ + return 0; +} #endif static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos) diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 616f5abec267..1424c3c6e3cf 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -442,8 +442,13 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options, cache_gen = btrfs_super_cache_generation(info->super_copy); if (btrfs_fs_compat_ro(info, FREE_SPACE_TREE)) btrfs_set_opt(info->mount_opt, FREE_SPACE_TREE); - else if (cache_gen) - btrfs_set_opt(info->mount_opt, SPACE_CACHE); + else if (cache_gen) { + if (btrfs_fs_incompat(info, HMZONED)) + btrfs_info(info, + "ignoring existing space cache in HMZONED mode"); + else + btrfs_set_opt(info->mount_opt, SPACE_CACHE); + } /* * Even the options are empty, we still need to do extra check @@ -879,6 +884,8 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options, ret = -EINVAL; } + if (!ret) + ret = btrfs_check_mountopts_hmzoned(info); if (!ret && btrfs_test_opt(info, SPACE_CACHE)) btrfs_info(info, "disk space caching is enabled"); if (!ret && btrfs_test_opt(info, FREE_SPACE_TREE)) From patchwork Fri Dec 13 04:08:53 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11289815 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 5D6B31593 for ; Fri, 13 Dec 2019 04:10:52 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 3CF8F24658 for ; Fri, 13 Dec 2019 04:10:52 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="Y3fhT8w+" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731843AbfLMEKv (ORCPT ); Thu, 12 Dec 2019 23:10:51 -0500 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:11856 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731720AbfLMEKu (ORCPT ); Thu, 12 Dec 2019 23:10:50 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1576210250; x=1607746250; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=HWodf9i+Q9JHuGM6H1pHriq9UD+L1er9YnSArjPPIF4=; b=Y3fhT8w+e49yb0Dr6OIbLshX8EI4zxfBfVhhvcThOcbb0jzRGRUh2svG hDTW/54/x1KJt1ZcoMgqPEOe3O6NDDs7HRqDemJns2xOuBmHxLaSuMQsa sT1Tqjy+aDtiimYFx4NRRp8a4gQ8t2Ks0NGENMFdSkRLO/G4QFYtn6pn0 knap+vX6sArH+bLEHsBBow4o3Tn5wIaPaehOKYOuXePrfp2YDKoKhxsM0 eiUAI/H0ZFVpAn8AeHhwaJbAvnGFQmmUhwjIMgedE/elndVS8IPZpj+NN 63/7SfFzZRGTA6DFR5FaGnFeH2kwOOcI55FVOyMBz8OObtcyXADccoJKA Q==; IronPort-SDR: gb7Tr4kVG5m8dM7202D2dodv5lGaBNPUzJhpVOFpayGTuNwxl+Gyp8lYFSVZgRVxIPAU9O4q4l x/8xzDuUtDT8vxgf1GFTVzInrXVeE0PXVvcgD66E3Xn+SsNrG5hm6km4yvQ35Lv8JrjVAk+xWD w3dC6FSAUNwhalgKCoj/D8g1Ya2pEX9OWDBfNmio2uc4WgEM2qHbR2XwAHR9l/IYpjRfbwGT0n CRYZircijlKK7qOVqlxFIV/mbzIVfDgyeuDi3/Jb/MjZKfkSsnO716WZ20MT88fTW5om+95Q5/ iaA= X-IronPort-AV: E=Sophos;i="5.69,308,1571673600"; d="scan'208";a="126860111" Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com) ([199.255.45.14]) by ob1.hgst.iphmx.com with ESMTP; 13 Dec 2019 12:10:50 +0800 IronPort-SDR: eLHA/R2vOu7LyJawsjyZa23zBhtBbEgInGaGUUNnHVPpGL4MSqRCREUkVMg2dlh+IS975YnBMk Cuj7EsmIJ/u3276bmKG8WVqVdqn3Ev8wqLJZmu26OAzc1CVwBH+7GFxt+HKprOu0lR0Kgw3IqD pRBx3mcxv0iv5U5nQXIY0QNChqj7bE4XA3FsJukHZg4oODzmVKNr1BaZbnX2OgjO3cKmuF7nde CnDagiOGbFr6k1KjRTkD6fnqzjk4P1lejP7dNZXCPx0+bYMxrHmR+Fbcrz3o2Qah4mNxDlHdZL NWPwHrR+obd0KZM4MTCJM4jw Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Dec 2019 20:05:22 -0800 IronPort-SDR: iIuoN5WmZ5K7C7SIDgoFn1/BwPfKMd8ILEd+zvMYYEoNt2nru7B9LFCeQqBbJ10bYdqX5v+lQl zPNy7KXzj4BzqZBt7ZiyY72I2iQkjluyiMdgTRuoXMAhcXYDnXl0r4UmzneS8GoD6RMn/EnT1f w3Lp/BCKzXbBy9SEVMJtKmEg56wbyaraNVQBSCGuDQpqk5C5w27QdWR5zolbifM7Py2qHMrnR9 ohQy2z4xhoyavfBi5IHU3uyD/R3K1pyNB0vklTGXwyaRX/lkkkPGGV7YKZCv7BS2G7ckMzvx3Q Bj4= WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com (HELO naota.fujisawa.hgst.com) ([10.149.53.115]) by uls-op-cesaip02.wdc.com with ESMTP; 12 Dec 2019 20:10:48 -0800 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Nikolay Borisov , Damien Le Moal , Johannes Thumshirn , Hannes Reinecke , Anand Jain , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v6 06/28] btrfs: disallow NODATACOW in HMZONED mode Date: Fri, 13 Dec 2019 13:08:53 +0900 Message-Id: <20191213040915.3502922-7-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20191213040915.3502922-1-naohiro.aota@wdc.com> References: <20191213040915.3502922-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org NODATACOW implies overwriting the file data on a device, which is impossible in sequential required zones. Disable NODATACOW globally with mount option and per-file NODATACOW attribute by masking FS_NOCOW_FL. Reviewed-by: Johannes Thumshirn Signed-off-by: Naohiro Aota Reviewed-by: Josef Bacik --- fs/btrfs/hmzoned.c | 6 ++++++ fs/btrfs/ioctl.c | 3 +++ 2 files changed, 9 insertions(+) diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c index d62f11652973..21b8737dd289 100644 --- a/fs/btrfs/hmzoned.c +++ b/fs/btrfs/hmzoned.c @@ -266,5 +266,11 @@ int btrfs_check_mountopts_hmzoned(struct btrfs_fs_info *info) return -EOPNOTSUPP; } + if (btrfs_test_opt(info, NODATACOW)) { + btrfs_err(info, + "cannot enable nodatacow with HMZONED mode"); + return -EOPNOTSUPP; + } + return 0; } diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index a1ee0b775e65..a67421eb8bd5 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -94,6 +94,9 @@ static int btrfs_clone(struct inode *src, struct inode *inode, static unsigned int btrfs_mask_fsflags_for_type(struct inode *inode, unsigned int flags) { + if (btrfs_fs_incompat(btrfs_sb(inode->i_sb), HMZONED)) + flags &= ~FS_NOCOW_FL; + if (S_ISDIR(inode->i_mode)) return flags; else if (S_ISREG(inode->i_mode)) From patchwork Fri Dec 13 04:08:54 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11289819 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id CA3351593 for ; Fri, 13 Dec 2019 04:10:55 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id A8C242253D for ; Fri, 13 Dec 2019 04:10:55 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="OjToC+dR" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731876AbfLMEKy (ORCPT ); Thu, 12 Dec 2019 23:10:54 -0500 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:11856 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731720AbfLMEKx (ORCPT ); Thu, 12 Dec 2019 23:10:53 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1576210253; x=1607746253; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=0WeNiviBP14DuhLCsntNThKBR+o1UaBlFRIbWtyJ5/Y=; b=OjToC+dRaCHMfP2UOCPCeKjky01LfiPmdUxkJWal4uUFy6St7zgDvw7I 84+EBS4LLXadtDAle7+GpesmjdWjrbwwnBllGa3QKxjriQdEAsE62Euhf JWMFofZo+8w+GSi6ssZgUkOMAJW6khzPkz8zY7OPbcys/vHt/Kg/IZ9oe iI8BK7YO7v5dTnfYReC3yxawfd0M8I+LwaFRphvo2IgwrbSSagQki5T4y x65P+PCWOo1lAhnzjHz0eXn4JEsCf8zfJPQRyYv3ESLRBIEh8Qv6tpczi jZZh8ZHbq9fs4s9svVhqWOaNGQ7yX7eC1goDNhuoeKVDLY6i1pbWeWPyE Q==; IronPort-SDR: GuZvNoYK/g0AAHaJaeqWjGUHQE4l41rk/HIhKM6W6VIMRXUMR8QtKmtCi45kYHaNUDMKARqJwx a3ZFQYZgn3UUratAqZV814litA5eIc71RsrCz9VO1Ki+R2elxsMwvNLZG72YjpcHkWjsWkTQ1j e51jMRIlz56yEk4wpo8DPtwZZIr/Bx3ctrm9NV9MijDv7p+zyrBbBMi5TbS3wpxtY6g8T4KGgZ LIlaULCV63ac3MYLwR0IS9DhsreQMIxNFMbuUwPVRSXwk+TG5rxJ9ksWVqlgOAvuB2lkizX8Ef m1Y= X-IronPort-AV: E=Sophos;i="5.69,308,1571673600"; d="scan'208";a="126860115" Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com) ([199.255.45.14]) by ob1.hgst.iphmx.com with ESMTP; 13 Dec 2019 12:10:53 +0800 IronPort-SDR: aE3MZMWUWlWZAsCHHt6nAQDugJMsh2f7KO/8yjoc6Q8F+/EbeZF3C5D7w5/YkRF7xVjXDW5gy3 swr0XmZp73kd8aLt+UZH8K8p3ueDYsDszkBcDDzG/tx7vWnFy6R9mYBjL94k2qE/UFKniPpYQK sG6t+R3nf+oALWYsd6O+qKON9aaMMXUnf+pLRSmBQ0Ikyjad8bh6jTqMgeAgo3P7Cl+8YIClPN Di+GscARUhli7fCU9KQcbyS5hZv1KBtNQSa/pK8LHoaPrQqyE0Fky1n4jb6aLMSjid3RDMuPND /ihFjA1ljb84PnvvDPfBWiZE Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Dec 2019 20:05:24 -0800 IronPort-SDR: r56pajcUqlwr8ChV1Dnb8IcDyHhOXKjFQKM2XQYxoSIOsZvwViwOdzR5Za06jKYvGfZHSrcTqU TIyLu4YQCnYOQmrJ5Ez978y6GWeCB3pggS51Qa808AbXGWHxUpDMeinW6SQLZExoYlmU2Cg329 S+wLuEKHuHAffdGk8QjHZs3SKwZsSB3SxjB6EAvrrOH707bmXfEnUpLcJWy1FFPiwJVlsUxV24 wg0rQQPataJYRWPZkWLtKNLlbwA/oaXsb1kkwfuqnIhT4nUpMkvjyvnBcsHJ7SK25S/fZsfteA jIg= WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com (HELO naota.fujisawa.hgst.com) ([10.149.53.115]) by uls-op-cesaip02.wdc.com with ESMTP; 12 Dec 2019 20:10:50 -0800 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Nikolay Borisov , Damien Le Moal , Johannes Thumshirn , Hannes Reinecke , Anand Jain , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v6 07/28] btrfs: disable fallocate in HMZONED mode Date: Fri, 13 Dec 2019 13:08:54 +0900 Message-Id: <20191213040915.3502922-8-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20191213040915.3502922-1-naohiro.aota@wdc.com> References: <20191213040915.3502922-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org fallocate() is implemented by reserving actual extent instead of reservations. This can result in exposing the sequential write constraint of host-managed zoned block devices to the application, which would break the POSIX semantic for the fallocated file. To avoid this, report fallocate() as not supported when in HMZONED mode for now. In the future, we may be able to implement "in-memory" fallocate() in HMZONED mode by utilizing space_info->bytes_may_use or so. Reviewed-by: Johannes Thumshirn Signed-off-by: Naohiro Aota Reviewed-by: Josef Bacik --- fs/btrfs/file.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 0cb43b682789..22373d00428b 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -3170,6 +3170,10 @@ static long btrfs_fallocate(struct file *file, int mode, alloc_end = round_up(offset + len, blocksize); cur_offset = alloc_start; + /* Do not allow fallocate in HMZONED mode */ + if (btrfs_fs_incompat(btrfs_sb(inode->i_sb), HMZONED)) + return -EOPNOTSUPP; + /* Make sure we aren't being give some crap mode */ if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE | FALLOC_FL_ZERO_RANGE)) From patchwork Fri Dec 13 04:08:55 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11289827 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id BF2D914E3 for ; Fri, 13 Dec 2019 04:11:00 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 8A5D4227BF for ; Fri, 13 Dec 2019 04:11:00 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="S4gtVvnx" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731889AbfLMEK6 (ORCPT ); Thu, 12 Dec 2019 23:10:58 -0500 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:11856 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731879AbfLMEKz (ORCPT ); Thu, 12 Dec 2019 23:10:55 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1576210255; x=1607746255; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=6v8gJ+fdBSTeUGb5hrMrW8okAPyyYxI7isT90mHBMEM=; b=S4gtVvnxho6s64Yk9gkblanmWhr57nMX3RuNPmqNgnCAVyD6LU/v3ln1 t5aoXPfQmd/DiTwY6MFnLOZ6bGOFUqA/YBCYOrPyFiTARcW5VScg82Evr kGI6tB2+m2SVZRtPIBoAeaU/lBqPRpF0jaFZkYj7g/iCVc952W0KH6S7N ZoUHxw3C4YCTmgh12BuQYpksyyiShjqxgg3h7gQhvxsqu+Dl/0QP1mNfE F2GaLF1Eq5P2mUbiM5wSKzj4NWNaH3EicPe6iQXv5GNBGwQEghHME6WuM rJ99KUaPHan1vMQT+rf2KQeDJK/ZJ50vnJVmveVzluzmVI6Z2nl3LNGIe w==; IronPort-SDR: t5FomaHkuTyztVF7x/5IgF4gVd2Rfab97D8c7u7Aw5a4UtAyaRlaAUcu7ci5LkwLCRNGmpbhyd idPzfOkhpMkGJKZ0QtMh3YsUo2pvAoupzzc9lFnObT/6rh/Z/73J11x05i51Gx0Io6Bhc9GSvP tFrRJulAlmOmmWP9Z17Qbp0PMPweC5ybjsH3mBHQQHGdWwXILr6MT20JGXL69aJbtaK3EZpmUi rkkULRoMPVMQWF9F8h8AGDShfKwKB7o1isMs4T4vdI9ofPXv+Ubn/M2aRG0vUHG8FKs0kYBm3o ozw= X-IronPort-AV: E=Sophos;i="5.69,308,1571673600"; d="scan'208";a="126860117" Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com) ([199.255.45.14]) by ob1.hgst.iphmx.com with ESMTP; 13 Dec 2019 12:10:55 +0800 IronPort-SDR: Hf9N6/0RRT3H9xu3Ebzy0/WO5FL21Y7PX42f4FFIgFkpHq54Ty/pphgd0Hy5L5jJSRmPy7Uuii wY7dVS9N953mp2wlFhkfBLxCGsOkUDApV/lIfPjNy3IV0f/OPK/AlClIv2qQ/fb0YzRbTLu3AO XCADXqOkDnKmRveCaxDr8fbnF6cjTDimIFgt/JXwd24wJeSDoGofBlhjG4wQt8j9uwp6rrX+DB RgYd14ZWok9g9kM62GmrUQIyUEnU9+XkbKU/P3KxWAkpFzzlWtqgLOuPxD26hHIhochu45Wp1E Josp67FlIUKjOu/8N+92GPGH Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Dec 2019 20:05:26 -0800 IronPort-SDR: IKET3WOPtJHdS4nfghfJVL4Yw9QY0GG/1mwTSAC0kuYecXM35S00PwdLv0IGEC73teUCT2O8nD QIjzAk3YtrEPYGAHsZjCxAUlZBGMkiOpwdQU8P9jGjsdNdS8BZc7UbZmomOwnouti5PbM/xqqt TF0uqRy0Wn4VcitQ1GLmEINP8ZLIbQTfbuao0ZX+fYiC443uJTjx5MdYYv+xEgBVF3i/jd5aCG Ue9E1V8rtFPSH1alQ3RtfOirpoKlE9g0gfaTMy/bTgMheH1UptlNBTweB66WcBqG+ZjJZOQd+3 lpE= WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com (HELO naota.fujisawa.hgst.com) ([10.149.53.115]) by uls-op-cesaip02.wdc.com with ESMTP; 12 Dec 2019 20:10:53 -0800 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Nikolay Borisov , Damien Le Moal , Johannes Thumshirn , Hannes Reinecke , Anand Jain , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v6 08/28] btrfs: implement log-structured superblock for HMZONED mode Date: Fri, 13 Dec 2019 13:08:55 +0900 Message-Id: <20191213040915.3502922-9-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20191213040915.3502922-1-naohiro.aota@wdc.com> References: <20191213040915.3502922-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Superblock (and its copies) is the only data structure in btrfs which has a fixed location on a device. Since we cannot overwrite in a sequential write required zone, we cannot place superblock in the zone. One easy solution is limiting superblock and copies to be placed only in conventional zones. However, this method has two downsides: one is reduced number of superblock copies. The location of the second copy of superblock is 256GB, which is in a sequential write required zone on typical devices in the market today. So, the number of superblock and copies is limited to be two. Second downside is that we cannot support devices which have no conventional zones at all. To solve these two problems, we employ superblock log writing. It uses two zones as a circular buffer to write updated superblocks. Once the first zone is filled up, start writing into the second buffer and reset the first one. We can determine the postion of the latest superblock by reading write pointer information from a device. The following zones are reserved as the circular buffer on HMZONED btrfs. - The primary superblock: zones 0 and 1 - The first copy: zones 16 and 17 - The second copy: zones 1024 or zone at 256GB which is minimum, and next to it Signed-off-by: Naohiro Aota Reviewed-by: Josef Bacik --- fs/btrfs/block-group.c | 9 ++ fs/btrfs/disk-io.c | 19 ++- fs/btrfs/hmzoned.c | 276 +++++++++++++++++++++++++++++++++++++++++ fs/btrfs/hmzoned.h | 40 ++++++ fs/btrfs/scrub.c | 3 + fs/btrfs/volumes.c | 18 ++- 6 files changed, 354 insertions(+), 11 deletions(-) diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c index 6934a5b8708f..acfa0a9d3c5a 100644 --- a/fs/btrfs/block-group.c +++ b/fs/btrfs/block-group.c @@ -1519,6 +1519,7 @@ static void set_avail_alloc_bits(struct btrfs_fs_info *fs_info, u64 flags) static int exclude_super_stripes(struct btrfs_block_group *cache) { struct btrfs_fs_info *fs_info = cache->fs_info; + bool hmzoned = btrfs_fs_incompat(fs_info, HMZONED); u64 bytenr; u64 *logical; int stripe_len; @@ -1549,6 +1550,14 @@ static int exclude_super_stripes(struct btrfs_block_group *cache) if (logical[nr] + stripe_len <= cache->start) continue; + /* shouldn't have super stripes in sequential zones */ + if (hmzoned) { + btrfs_err(fs_info, + "sequentil allocation bg %llu should not have super blocks", + cache->start); + return -EUCLEAN; + } + start = logical[nr]; if (start < cache->start) { start = cache->start; diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index ff418e393f82..deca9fd70771 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3386,8 +3386,12 @@ int btrfs_read_dev_one_super(struct block_device *bdev, int copy_num, struct buffer_head *bh; struct btrfs_super_block *super; u64 bytenr; + u64 bytenr_orig; + + bytenr_orig = btrfs_sb_offset(copy_num); + if (btrfs_sb_log_location_bdev(bdev, copy_num, READ, &bytenr)) + return -EUCLEAN; - bytenr = btrfs_sb_offset(copy_num); if (bytenr + BTRFS_SUPER_INFO_SIZE >= i_size_read(bdev->bd_inode)) return -EINVAL; @@ -3400,7 +3404,7 @@ int btrfs_read_dev_one_super(struct block_device *bdev, int copy_num, return -EIO; super = (struct btrfs_super_block *)bh->b_data; - if (btrfs_super_bytenr(super) != bytenr || + if (btrfs_super_bytenr(super) != bytenr_orig || btrfs_super_magic(super) != BTRFS_MAGIC) { brelse(bh); return -EINVAL; @@ -3466,7 +3470,7 @@ static int write_dev_supers(struct btrfs_device *device, int i; int ret; int errors = 0; - u64 bytenr; + u64 bytenr, bytenr_orig; int op_flags; if (max_mirrors == 0) @@ -3475,12 +3479,13 @@ static int write_dev_supers(struct btrfs_device *device, shash->tfm = fs_info->csum_shash; for (i = 0; i < max_mirrors; i++) { - bytenr = btrfs_sb_offset(i); + bytenr_orig = btrfs_sb_offset(i); + bytenr = btrfs_sb_log_location(device, i, WRITE); if (bytenr + BTRFS_SUPER_INFO_SIZE >= device->commit_total_bytes) break; - btrfs_set_super_bytenr(sb, bytenr); + btrfs_set_super_bytenr(sb, bytenr_orig); crypto_shash_init(shash); crypto_shash_update(shash, (const char *)sb + BTRFS_CSUM_SIZE, @@ -3518,6 +3523,8 @@ static int write_dev_supers(struct btrfs_device *device, ret = btrfsic_submit_bh(REQ_OP_WRITE, op_flags, bh); if (ret) errors++; + else if (btrfs_advance_sb_log(device, i)) + errors++; } return errors < i ? 0 : -1; } @@ -3541,7 +3548,7 @@ static int wait_dev_supers(struct btrfs_device *device, int max_mirrors) max_mirrors = BTRFS_SUPER_MIRROR_MAX; for (i = 0; i < max_mirrors; i++) { - bytenr = btrfs_sb_offset(i); + bytenr = btrfs_sb_log_location(device, i, READ); if (bytenr + BTRFS_SUPER_INFO_SIZE >= device->commit_total_bytes) break; diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c index 21b8737dd289..a74011650145 100644 --- a/fs/btrfs/hmzoned.c +++ b/fs/btrfs/hmzoned.c @@ -16,6 +16,26 @@ /* Maximum number of zones to report per blkdev_report_zones() call */ #define BTRFS_REPORT_NR_ZONES 4096 +static int sb_write_pointer(struct blk_zone *zone, u64 *wp_ret); + +static inline u32 sb_zone_number(u64 zone_size, int mirror) +{ + ASSERT(mirror < BTRFS_SUPER_MIRROR_MAX); + + switch (mirror) { + case 0: + return 0; + case 1: + return 16; + case 2: + return min(btrfs_sb_offset(mirror) / zone_size, 1024ULL); + default: + BUG(); + } + + return 0; +} + static int btrfs_get_dev_zones(struct btrfs_device *device, u64 pos, struct blk_zone *zones, unsigned int *nr_zones) { @@ -109,6 +129,39 @@ int btrfs_get_dev_zone_info(struct btrfs_device *device) goto free_zones; } + nr_zones = 2; + for (i = 0; i < BTRFS_SUPER_MIRROR_MAX; i++) { + u32 sb_zone = sb_zone_number(zone_info->zone_size, i); + u64 sb_wp; + + if (sb_zone + 1 >= zone_info->nr_zones) + continue; + + sector = sb_zone << (zone_info->zone_size_shift - SECTOR_SHIFT); + ret = btrfs_get_dev_zones(device, sector << SECTOR_SHIFT, + &zone_info->sb_zones[2 * i], + &nr_zones); + if (ret) + goto free_zones; + if (nr_zones != 2) { + btrfs_err_in_rcu(device->fs_info, + "failed to read SB log zone info at device %s zone %u", + rcu_str_deref(device->name), sb_zone); + ret = -EIO; + goto free_zones; + } + + ret = sb_write_pointer(&zone_info->sb_zones[2 * i], &sb_wp); + if (ret != -ENOENT && ret) { + btrfs_err_in_rcu(device->fs_info, + "SB log zone corrupted: device %s zone %u", + rcu_str_deref(device->name), sb_zone); + ret = -EUCLEAN; + goto free_zones; + } + } + + kfree(zones); device->zone_info = zone_info; @@ -274,3 +327,226 @@ int btrfs_check_mountopts_hmzoned(struct btrfs_fs_info *info) return 0; } + +static int sb_write_pointer(struct blk_zone *zones, u64 *wp_ret) +{ + bool empty[2]; + bool full[2]; + sector_t sector; + + if (zones[0].type == BLK_ZONE_TYPE_CONVENTIONAL) { + *wp_ret = zones[0].start << SECTOR_SHIFT; + return -ENOENT; + } + + empty[0] = zones[0].cond == BLK_ZONE_COND_EMPTY; + empty[1] = zones[1].cond == BLK_ZONE_COND_EMPTY; + full[0] = zones[0].cond == BLK_ZONE_COND_FULL; + full[1] = zones[1].cond == BLK_ZONE_COND_FULL; + + /* + * Possible state of log buffer zones + * + * E I F + * E * x 0 + * I 0 x 0 + * F 1 1 x + * + * Row: zones[0] + * Col: zones[1] + * State: + * E: Empty, I: In-Use, F: Full + * Log position: + * *: Special case, no superblock is written + * 0: Use write pointer of zones[0] + * 1: Use write pointer of zones[1] + * x: Invalid state + */ + + if (empty[0] && empty[1]) { + /* special case to distinguish no superblock to read */ + *wp_ret = zones[0].start << SECTOR_SHIFT; + return -ENOENT; + } else if (full[0] && full[1]) { + /* cannot determine which zone has the newer superblock */ + return -EUCLEAN; + } else if (!full[0] && (empty[1] || full[1])) { + sector = zones[0].wp; + } else if (full[0]) { + sector = zones[1].wp; + } else { + return -EUCLEAN; + } + *wp_ret = sector << SECTOR_SHIFT; + return 0; +} + +int btrfs_sb_log_location_bdev(struct block_device *bdev, int mirror, int rw, + u64 *bytenr_ret) +{ + struct blk_zone zones[2]; + unsigned int nr_zones_rep = 2; + unsigned int zone_sectors; + u32 sb_zone; + int ret; + u64 wp; + u64 zone_size; + u8 zone_sectors_shift; + sector_t nr_sectors = bdev->bd_part->nr_sects; + u32 nr_zones; + + if (!bdev_is_zoned(bdev)) { + *bytenr_ret = btrfs_sb_offset(mirror); + return 0; + } + + ASSERT(rw == READ || rw == WRITE); + + zone_sectors = bdev_zone_sectors(bdev); + if (!is_power_of_2(zone_sectors)) + return -EINVAL; + zone_size = zone_sectors << SECTOR_SHIFT; + zone_sectors_shift = ilog2(zone_sectors); + nr_zones = nr_sectors >> zone_sectors_shift; + + sb_zone = sb_zone_number(zone_size, mirror); + if (sb_zone + 1 >= nr_zones) + return -ENOENT; + + ret = blkdev_report_zones(bdev, sb_zone << zone_sectors_shift, zones, + &nr_zones_rep); + if (ret) + return ret; + if (nr_zones_rep != 2) + return -EIO; + + ret = sb_write_pointer(zones, &wp); + if (ret != -ENOENT && ret) + return -EUCLEAN; + + if (rw == READ && ret != -ENOENT) { + if (wp == zones[0].start << SECTOR_SHIFT) + wp = (zones[1].start + zones[1].len) << SECTOR_SHIFT; + wp -= BTRFS_SUPER_INFO_SIZE; + } + *bytenr_ret = wp; + + return 0; +} + +u64 btrfs_sb_log_location(struct btrfs_device *device, int mirror, int rw) +{ + struct btrfs_zoned_device_info *zinfo = device->zone_info; + u64 base, wp; + u32 zone_num; + int ret; + + if (!zinfo) + return btrfs_sb_offset(mirror); + + zone_num = sb_zone_number(zinfo->zone_size, mirror); + if (zone_num + 1 >= zinfo->nr_zones) + return U64_MAX - BTRFS_SUPER_INFO_SIZE; + + base = (u64)zone_num << zinfo->zone_size_shift; + if (!test_bit(zone_num, zinfo->seq_zones)) + return base; + + /* sb_zones should be kept valid during runtime */ + ret = sb_write_pointer(&zinfo->sb_zones[2 * mirror], &wp); + if (ret != -ENOENT && ret) + return U64_MAX - BTRFS_SUPER_INFO_SIZE; + if (rw == WRITE || ret == -ENOENT) + return wp; + if (wp == base) + wp = base + zinfo->zone_size * 2; + return wp - BTRFS_SUPER_INFO_SIZE; +} + +static inline bool is_sb_log_zone(struct btrfs_zoned_device_info *zinfo, + int mirror) +{ + u32 zone_num; + + if (!zinfo) + return false; + + zone_num = sb_zone_number(zinfo->zone_size, mirror); + if (zone_num + 1 >= zinfo->nr_zones) + return false; + + if (!test_bit(zone_num, zinfo->seq_zones)) + return false; + + return true; +} + +int btrfs_advance_sb_log(struct btrfs_device *device, int mirror) +{ + struct btrfs_zoned_device_info *zinfo = device->zone_info; + struct blk_zone *zone; + struct blk_zone *reset = NULL; + int ret; + + if (!is_sb_log_zone(zinfo, mirror)) + return 0; + + zone = &zinfo->sb_zones[2 * mirror]; + if (zone->cond != BLK_ZONE_COND_FULL) { + if (zone->cond == BLK_ZONE_COND_EMPTY) + zone->cond = BLK_ZONE_COND_IMP_OPEN; + zone->wp += (BTRFS_SUPER_INFO_SIZE >> SECTOR_SHIFT); + if (zone->wp == zone->start + zone->len) { + zone->cond = BLK_ZONE_COND_FULL; + reset = zone + 1; + goto reset; + } + return 0; + } + + zone++; + ASSERT(zone->cond != BLK_ZONE_COND_FULL); + if (zone->cond == BLK_ZONE_COND_EMPTY) + zone->cond = BLK_ZONE_COND_IMP_OPEN; + zone->wp += (BTRFS_SUPER_INFO_SIZE >> SECTOR_SHIFT); + if (zone->wp == zone->start + zone->len) { + zone->cond = BLK_ZONE_COND_FULL; + reset = zone - 1; + } + +reset: + if (!reset || reset->cond == BLK_ZONE_COND_EMPTY) + return 0; + + ASSERT(reset->cond == BLK_ZONE_COND_FULL); + + ret = blkdev_reset_zones(device->bdev, reset->start, reset->len, + GFP_NOFS); + if (!ret) { + reset->cond = BLK_ZONE_COND_EMPTY; + reset->wp = reset->start; + } + return ret; +} + +int btrfs_reset_sb_log_zones(struct block_device *bdev, int mirror) +{ + sector_t zone_sectors; + sector_t nr_sectors = bdev->bd_part->nr_sects; + u8 zone_sectors_shift; + u32 sb_zone; + u32 nr_zones; + + zone_sectors = bdev_zone_sectors(bdev); + zone_sectors_shift = ilog2(zone_sectors); + nr_zones = nr_sectors >> zone_sectors_shift; + + sb_zone = sb_zone_number(zone_sectors << SECTOR_SHIFT, mirror); + if (sb_zone + 1 >= nr_zones) + return -ENOENT; + + return blkdev_reset_zones(bdev, + sb_zone << zone_sectors_shift, + zone_sectors * 2, + GFP_NOFS); +} diff --git a/fs/btrfs/hmzoned.h b/fs/btrfs/hmzoned.h index d9ebe11afdf5..55041a26ae3c 100644 --- a/fs/btrfs/hmzoned.h +++ b/fs/btrfs/hmzoned.h @@ -10,6 +10,8 @@ #define BTRFS_HMZONED_H #include +#include "volumes.h" +#include "disk-io.h" struct btrfs_zoned_device_info { /* @@ -21,6 +23,7 @@ struct btrfs_zoned_device_info { u32 nr_zones; unsigned long *seq_zones; unsigned long *empty_zones; + struct blk_zone sb_zones[2 * BTRFS_SUPER_MIRROR_MAX]; }; #ifdef CONFIG_BLK_DEV_ZONED @@ -30,6 +33,11 @@ int btrfs_get_dev_zone_info(struct btrfs_device *device); void btrfs_destroy_dev_zone_info(struct btrfs_device *device); int btrfs_check_hmzoned_mode(struct btrfs_fs_info *fs_info); int btrfs_check_mountopts_hmzoned(struct btrfs_fs_info *info); +int btrfs_sb_log_location_bdev(struct block_device *bdev, int mirror, int rw, + u64 *bytenr_ret); +u64 btrfs_sb_log_location(struct btrfs_device *device, int mirror, int rw); +int btrfs_advance_sb_log(struct btrfs_device *device, int mirror); +int btrfs_reset_sb_log_zones(struct block_device *bdev, int mirror); #else /* CONFIG_BLK_DEV_ZONED */ static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos, struct blk_zone *zone) @@ -53,6 +61,27 @@ static inline int btrfs_check_mountopts_hmzoned(struct btrfs_fs_info *info) { return 0; } +static inline int btrfs_sb_log_location_bdev(struct block_device *bdev, + int mirror, int rw, + u64 *bytenr_ret) +{ + *bytenr_ret = btrfs_sb_offset(mirror); + return 0; +} +static inline u64 btrfs_sb_log_location(struct btrfs_device *device, int mirror, + int rw) +{ + return btrfs_sb_offset(mirror); +} +static inline int btrfs_advance_sb_log(struct btrfs_device *device, int mirror) +{ + return 0; +} +static inline int btrfs_reset_sb_log_zones(struct block_device *bdev, + int mirror) +{ + return 0; +} #endif static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos) @@ -120,4 +149,15 @@ static inline bool btrfs_check_device_zone_type(struct btrfs_fs_info *fs_info, return bdev_zoned_model(bdev) != BLK_ZONED_HM; } +static inline bool btrfs_check_super_location(struct btrfs_device *device, + u64 pos) +{ + /* + * On a non-zoned device, any address is OK. On a zoned + * device, non-SEQUENTIAL WRITE REQUIRED zones are capable. + */ + return device->zone_info == NULL || + !btrfs_dev_is_sequential(device, pos); +} + #endif diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index 21de630b0730..af7cec962619 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -19,6 +19,7 @@ #include "rcu-string.h" #include "raid56.h" #include "block-group.h" +#include "hmzoned.h" /* * This is only the first step towards a full-features scrub. It reads all @@ -3709,6 +3710,8 @@ static noinline_for_stack int scrub_supers(struct scrub_ctx *sctx, if (bytenr + BTRFS_SUPER_INFO_SIZE > scrub_dev->commit_total_bytes) break; + if (!btrfs_check_super_location(scrub_dev, bytenr)) + continue; ret = scrub_pages(sctx, bytenr, BTRFS_SUPER_INFO_SIZE, bytenr, scrub_dev, BTRFS_EXTENT_FLAG_SUPER, gen, i, diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index ab3590b310af..a260648cecca 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -1218,12 +1218,17 @@ static void btrfs_release_disk_super(struct page *page) put_page(page); } -static int btrfs_read_disk_super(struct block_device *bdev, u64 bytenr, +static int btrfs_read_disk_super(struct block_device *bdev, int mirror, struct page **page, struct btrfs_super_block **disk_super) { void *p; pgoff_t index; + u64 bytenr; + u64 bytenr_orig = btrfs_sb_offset(mirror); + + if (btrfs_sb_log_location_bdev(bdev, 0, READ, &bytenr)) + return 1; /* make sure our super fits in the device */ if (bytenr + PAGE_SIZE >= i_size_read(bdev->bd_inode)) @@ -1250,7 +1255,7 @@ static int btrfs_read_disk_super(struct block_device *bdev, u64 bytenr, /* align our pointer to the offset of the super block */ *disk_super = p + offset_in_page(bytenr); - if (btrfs_super_bytenr(*disk_super) != bytenr || + if (btrfs_super_bytenr(*disk_super) != bytenr_orig || btrfs_super_magic(*disk_super) != BTRFS_MAGIC) { btrfs_release_disk_super(*page); return 1; @@ -1287,7 +1292,6 @@ struct btrfs_device *btrfs_scan_one_device(const char *path, fmode_t flags, struct btrfs_device *device = NULL; struct block_device *bdev; struct page *page; - u64 bytenr; lockdep_assert_held(&uuid_mutex); @@ -1297,14 +1301,13 @@ struct btrfs_device *btrfs_scan_one_device(const char *path, fmode_t flags, * So, we need to add a special mount option to scan for * later supers, using BTRFS_SUPER_MIRROR_MAX instead */ - bytenr = btrfs_sb_offset(0); flags |= FMODE_EXCL; bdev = blkdev_get_by_path(path, flags, holder); if (IS_ERR(bdev)) return ERR_CAST(bdev); - if (btrfs_read_disk_super(bdev, bytenr, &page, &disk_super)) { + if (btrfs_read_disk_super(bdev, 0, &page, &disk_super)) { device = ERR_PTR(-EINVAL); goto error_bdev_put; } @@ -7371,6 +7374,11 @@ void btrfs_scratch_superblocks(struct block_device *bdev, const char *device_pat if (btrfs_read_dev_one_super(bdev, copy_num, &bh)) continue; + if (bdev_is_zoned(bdev)) { + btrfs_reset_sb_log_zones(bdev, copy_num); + continue; + } + disk_super = (struct btrfs_super_block *)bh->b_data; memset(&disk_super->magic, 0, sizeof(disk_super->magic)); From patchwork Fri Dec 13 04:08:56 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11289821 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 4A3C014DB for ; Fri, 13 Dec 2019 04:10:59 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 1D0282073B for ; Fri, 13 Dec 2019 04:10:59 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="WJsyQzoq" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731911AbfLMEK6 (ORCPT ); Thu, 12 Dec 2019 23:10:58 -0500 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:11892 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731720AbfLMEK5 (ORCPT ); Thu, 12 Dec 2019 23:10:57 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1576210257; x=1607746257; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=Mhe5PGEWWycCzKvFzkOsE0Tfx/7HD095eI1zKoe8Ikk=; b=WJsyQzoq4Fga/Ctm8bsFTa0/otGM6bn+Yp9mvxIrRZd/c5zVxmqTYkJU wDvLuI2InzUIxPtRgOm1R0m/eeHGOKzwk48IrDQCjC8+OrAcyNzD6xKSh 5nUSw1zLaUUBtX2Zzn1cdGWNeDNlL5L+xTPJVmJUb47iMHVc49XseJ8eO IPEbHcOv/5WxFY8cmk7ODVvosUn5IjI4KqDp9fIXxbcOBbM+iH+wd2UIZ 3WPpne4QoWGlMl0Lj179duQTZngzO2kwV3lu6AZkvyWjG/Aup/yKEQycz YOi0VMZ1tdSv7l7me4BRhNqL0ySGutZzCkzVKyIF0Bsiiq6Rwdz6ML3HR w==; IronPort-SDR: jC+okgSzENtPzUIL0zH6dgN+qMXVdUpZnTGC340J+3KQCj5ntM0uCFKNIwg4bjCueLudLcqSjO 6/xBDhSIX1XNIu2EVh3EfgBgnOtRLou14KQmWOAwnOliZ3MngD+KuYCq1MLMZtVPM+lkHD5Eig Np0BVotGibnhNKiut2zsqf9ST32tPDBnyj5xPsfDPqIIONNUXzqTsByN6ghs4GA0rjNvhnWcJW +oi4Bc4yOB60Uftg5KxE7vES7E93XCEEz3CwW8lL6NrlPS+TgpMAMi60UAfdiG4if4Zk9slBhL l4Q= X-IronPort-AV: E=Sophos;i="5.69,308,1571673600"; d="scan'208";a="126860120" Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com) ([199.255.45.14]) by ob1.hgst.iphmx.com with ESMTP; 13 Dec 2019 12:10:57 +0800 IronPort-SDR: AtYq+cAeJiL29HUGc4XFnI0dxkZLA8mlqc0JsV14rhtxhi6DIGHLU/IDX4M/qeubu3Lmat8jrK TmpMyznHHARyDeCwAPmNoLQUS+VTNDAboekhfv0GHbJW+LVC0ztfYxK4AgAlPGyVMakul977He 6DWwvmPP2nyliR7KXBn7SDzvELvXZIBaog11lVBfVGHUonLTa03Prnh6nS3L97QoALZ/5v2bcR wjQjKeLA888Yfvr8MZkRLcrR1X1uR+vZUg9NDU6JxhFWGjUooUyQ4IzBHUqZfYXBpY5d9OoN+W InxW8+bg20rHLoJa6yJEmcq2 Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Dec 2019 20:05:28 -0800 IronPort-SDR: WRQ4b6O25O4ySWiyXQe48v7P3p2JKvIYiJuF0oaANQDd7aXVzSS/oy/BFL9YOSqNVMUPGUs+yp pRVcJoNs6zq640ILXdc8CaMVjd0x50rFYRpNDTYelUrGsuknMjspCIZXD82GkXuDnqcREzdOPb FbUPCbJdcDDps8u6KMjsDd5vcGCIjn0kUoEA2QxAuq5YgFaZQmTPh5jvOs+0bCk+oRF65ef3Q2 /z+MZmiIDBhFsK15vfsTa+Ju7/CXU5INf79VMEkldi3YFmYWpz6VEkRDNye5zPrHxH2oc+VZlJ ta4= WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com (HELO naota.fujisawa.hgst.com) ([10.149.53.115]) by uls-op-cesaip02.wdc.com with ESMTP; 12 Dec 2019 20:10:55 -0800 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Nikolay Borisov , Damien Le Moal , Johannes Thumshirn , Hannes Reinecke , Anand Jain , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v6 09/28] btrfs: align device extent allocation to zone boundary Date: Fri, 13 Dec 2019 13:08:56 +0900 Message-Id: <20191213040915.3502922-10-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20191213040915.3502922-1-naohiro.aota@wdc.com> References: <20191213040915.3502922-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org In HMZONED mode, align the device extents to zone boundaries so that a zone reset affects only the device extent and does not change the state of blocks in the neighbor device extents. Also, check that a region allocation is always over empty zones and it is not over any locations of super block zones. This patch also add a verification in verify_one_dev_extent() to check if the device extent is align to zone boundary. Signed-off-by: Naohiro Aota Reviewed-by: Josef Bacik --- fs/btrfs/hmzoned.c | 55 ++++++++++++++++++++++++++++++++ fs/btrfs/hmzoned.h | 15 +++++++++ fs/btrfs/volumes.c | 78 ++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 148 insertions(+) diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c index a74011650145..6263c8aee082 100644 --- a/fs/btrfs/hmzoned.c +++ b/fs/btrfs/hmzoned.c @@ -12,6 +12,7 @@ #include "volumes.h" #include "hmzoned.h" #include "rcu-string.h" +#include "disk-io.h" /* Maximum number of zones to report per blkdev_report_zones() call */ #define BTRFS_REPORT_NR_ZONES 4096 @@ -550,3 +551,57 @@ int btrfs_reset_sb_log_zones(struct block_device *bdev, int mirror) zone_sectors * 2, GFP_NOFS); } + +/* + * btrfs_check_allocatable_zones - check if spcecifeid region is + * suitable for allocation + * @device: the device to allocate a region + * @pos: the position of the region + * @num_bytes: the size of the region + * + * In non-ZONED device, anywhere is suitable for allocation. In ZONED + * device, check if + * 1) the region is not on non-empty sequential zones, + * 2) all zones in the region have the same zone type, + * 3) it does not contain super block location. + */ +bool btrfs_check_allocatable_zones(struct btrfs_device *device, u64 pos, + u64 num_bytes) +{ + struct btrfs_zoned_device_info *zinfo = device->zone_info; + u64 nzones, begin, end; + u64 sb_pos; + u8 shift; + int i; + + if (!zinfo) + return true; + + shift = zinfo->zone_size_shift; + nzones = num_bytes >> shift; + begin = pos >> shift; + end = begin + nzones; + + ASSERT(IS_ALIGNED(pos, zinfo->zone_size)); + ASSERT(IS_ALIGNED(num_bytes, zinfo->zone_size)); + + if (end > zinfo->nr_zones) + return false; + + /* check if zones in the region are all empty */ + if (btrfs_dev_is_sequential(device, pos) && + find_next_zero_bit(zinfo->empty_zones, end, begin) != end) + return false; + + if (btrfs_dev_is_sequential(device, pos)) { + for (i = 0; i < BTRFS_SUPER_MIRROR_MAX; i++) { + sb_pos = sb_zone_number(zinfo->zone_size, i); + if (!(end < sb_pos || sb_pos + 1 < begin)) + return false; + } + + return find_next_zero_bit(zinfo->seq_zones, end, begin) == end; + } + + return find_next_bit(zinfo->seq_zones, end, begin) == end; +} diff --git a/fs/btrfs/hmzoned.h b/fs/btrfs/hmzoned.h index 55041a26ae3c..d54b4ae8cf8b 100644 --- a/fs/btrfs/hmzoned.h +++ b/fs/btrfs/hmzoned.h @@ -38,6 +38,8 @@ int btrfs_sb_log_location_bdev(struct block_device *bdev, int mirror, int rw, u64 btrfs_sb_log_location(struct btrfs_device *device, int mirror, int rw); int btrfs_advance_sb_log(struct btrfs_device *device, int mirror); int btrfs_reset_sb_log_zones(struct block_device *bdev, int mirror); +bool btrfs_check_allocatable_zones(struct btrfs_device *device, u64 pos, + u64 num_bytes); #else /* CONFIG_BLK_DEV_ZONED */ static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos, struct blk_zone *zone) @@ -82,6 +84,11 @@ static inline int btrfs_reset_sb_log_zones(struct block_device *bdev, { return 0; } +static inline bool btrfs_check_allocatable_zones(struct btrfs_device *device, + u64 pos, u64 num_bytes) +{ + return true; +} #endif static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos) @@ -160,4 +167,12 @@ static inline bool btrfs_check_super_location(struct btrfs_device *device, !btrfs_dev_is_sequential(device, pos); } +static inline u64 btrfs_zone_align(struct btrfs_device *device, u64 pos) +{ + if (!device->zone_info) + return pos; + + return ALIGN(pos, device->zone_info->zone_size); +} + #endif diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index a260648cecca..d5b280b59733 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -1393,6 +1393,7 @@ static int find_free_dev_extent_start(struct btrfs_device *device, u64 max_hole_size; u64 extent_end; u64 search_end = device->total_bytes; + u64 zone_size = 0; int ret; int slot; struct extent_buffer *l; @@ -1403,6 +1404,15 @@ static int find_free_dev_extent_start(struct btrfs_device *device, * at an offset of at least 1MB. */ search_start = max_t(u64, search_start, SZ_1M); + /* + * For a zoned block device, skip the first zone of the device + * entirely. + */ + if (device->zone_info) { + zone_size = device->zone_info->zone_size; + search_start = max_t(u64, search_start, zone_size); + search_start = btrfs_zone_align(device, search_start); + } path = btrfs_alloc_path(); if (!path) @@ -1467,12 +1477,21 @@ static int find_free_dev_extent_start(struct btrfs_device *device, */ if (contains_pending_extent(device, &search_start, hole_size)) { + search_start = btrfs_zone_align(device, + search_start); if (key.offset >= search_start) hole_size = key.offset - search_start; else hole_size = 0; } + if (!btrfs_check_allocatable_zones(device, search_start, + num_bytes)) { + search_start += zone_size; + btrfs_release_path(path); + goto again; + } + if (hole_size > max_hole_size) { max_hole_start = search_start; max_hole_size = hole_size; @@ -1512,6 +1531,14 @@ static int find_free_dev_extent_start(struct btrfs_device *device, hole_size = search_end - search_start; if (contains_pending_extent(device, &search_start, hole_size)) { + search_start = btrfs_zone_align(device, search_start); + btrfs_release_path(path); + goto again; + } + + if (!btrfs_check_allocatable_zones(device, search_start, + num_bytes)) { + search_start += zone_size; btrfs_release_path(path); goto again; } @@ -1529,6 +1556,7 @@ static int find_free_dev_extent_start(struct btrfs_device *device, ret = 0; out: + ASSERT(zone_size == 0 || IS_ALIGNED(max_hole_start, zone_size)); btrfs_free_path(path); *start = max_hole_start; if (len) @@ -4778,6 +4806,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans, int i; int j; int index; + bool hmzoned = btrfs_fs_incompat(info, HMZONED); BUG_ON(!alloc_profile_is_valid(type, 0)); @@ -4819,10 +4848,25 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans, BUG(); } + if (hmzoned) { + max_stripe_size = info->zone_size; + max_chunk_size = round_down(max_chunk_size, info->zone_size); + } + /* We don't want a chunk larger than 10% of writable space */ max_chunk_size = min(div_factor(fs_devices->total_rw_bytes, 1), max_chunk_size); + if (hmzoned) { + int min_num_stripes = devs_min * dev_stripes; + int min_data_stripes = (min_num_stripes - nparity) / ncopies; + u64 min_chunk_size = min_data_stripes * info->zone_size; + + max_chunk_size = max(round_down(max_chunk_size, + info->zone_size), + min_chunk_size); + } + devices_info = kcalloc(fs_devices->rw_devices, sizeof(*devices_info), GFP_NOFS); if (!devices_info) @@ -4857,6 +4901,9 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans, if (total_avail == 0) continue; + if (hmzoned && total_avail < max_stripe_size * dev_stripes) + continue; + ret = find_free_dev_extent(device, max_stripe_size * dev_stripes, &dev_offset, &max_avail); @@ -4875,6 +4922,9 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans, continue; } + if (hmzoned && max_avail < max_stripe_size * dev_stripes) + continue; + if (ndevs == fs_devices->rw_devices) { WARN(1, "%s: found more than %llu devices\n", __func__, fs_devices->rw_devices); @@ -4893,6 +4943,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans, sort(devices_info, ndevs, sizeof(struct btrfs_device_info), btrfs_cmp_device_info, NULL); +again: /* * Round down to number of usable stripes, devs_increment can be any * number so we can't use round_down() @@ -4934,6 +4985,17 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans, * we try to reduce stripe_size. */ if (stripe_size * data_stripes > max_chunk_size) { + if (hmzoned) { + /* + * stripe_size is fixed in HMZONED. Reduce ndevs + * instead. + */ + ASSERT(nparity == 0); + ndevs = div_u64(max_chunk_size * ncopies, + stripe_size * dev_stripes); + goto again; + } + /* * Reduce stripe_size, round it up to a 16MB boundary again and * then use it, unless it ends up being even bigger than the @@ -4947,6 +5009,8 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans, /* align to BTRFS_STRIPE_LEN */ stripe_size = round_down(stripe_size, BTRFS_STRIPE_LEN); + ASSERT(!hmzoned || stripe_size == info->zone_size); + map = kmalloc(map_lookup_size(num_stripes), GFP_NOFS); if (!map) { ret = -ENOMEM; @@ -7541,6 +7605,20 @@ static int verify_one_dev_extent(struct btrfs_fs_info *fs_info, ret = -EUCLEAN; goto out; } + + if (dev->zone_info) { + u64 zone_size = dev->zone_info->zone_size; + + if (!IS_ALIGNED(physical_offset, zone_size) || + !IS_ALIGNED(physical_len, zone_size)) { + btrfs_err(fs_info, +"dev extent devid %llu physical offset %llu len %llu is not aligned to device zone", + devid, physical_offset, physical_len); + ret = -EUCLEAN; + goto out; + } + } + out: free_extent_map(em); return ret; From patchwork Fri Dec 13 04:08:57 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11289831 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id DAE251593 for ; Fri, 13 Dec 2019 04:11:02 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id AF92D2073B for ; Fri, 13 Dec 2019 04:11:02 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="KSYOSsri" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731922AbfLMELB (ORCPT ); Thu, 12 Dec 2019 23:11:01 -0500 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:11896 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731720AbfLMELA (ORCPT ); Thu, 12 Dec 2019 23:11:00 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1576210260; x=1607746260; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=N7smVG2PAlLI4UsK6ojVMXeCUPxTXRdgwd70Ll2x+No=; b=KSYOSsridr/EE0rOYCEMTQiOjAeltoeWIkpLi4K8Vt+P6HXQ9m1rOP1A ZaClMByeUtzjuVcu2RNM0xBrErStho+kQovsfDLty/bHOo999E0gWYnpL XyNCOOs+hBQFgYNrSLN/1t/cspocXY7UV4YPLK3AtO+wyliYMxNzBreMW bO+R5YcnB53BKGfoSjK3mzLZ/xj8lkwjQZEmvuTpsS8sTuxGGo6Vu+1lb 58MsZZHckqRUUbHoIfj6wNtzm1kILziAEycalw5d1yOGdwVvwKqSxzIL3 RHvk1eYan1hilFIWUQpbwCJjBXe4Ipz7V2flC67SuNb9IbWZlh7fPWJ/v Q==; IronPort-SDR: 5kf4JAuzqyZoNzmcy1z6JyeVq7TPpo9WnLNDrpu/92ndKr9V9BsYZalmYL2WmmorMQUwQOUzUr od1IC0CvKudWLtlGLXSqWDBjwrmo0yhqYV+V58aF4RlSPOLd0OMa87p6WBzfOsb7x3oPetNH6f +N7Mhw0HUe8RCxoXT9VUDp7q8F4I6wn2lR+P+pMSPOULHyTwckBqb9CNSt2/ie3mKBBylB/RaL osB3J9NgxUXl3k7pGzFppQkyfveWXNlRIX82Ebr8JjjsZf25W0PohjxugGtAjdVfDdUAdD1KaE QMA= X-IronPort-AV: E=Sophos;i="5.69,308,1571673600"; d="scan'208";a="126860126" Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com) ([199.255.45.14]) by ob1.hgst.iphmx.com with ESMTP; 13 Dec 2019 12:10:59 +0800 IronPort-SDR: qawupxtBt59ErM9H91mscodMuW8LLXow7mceFYvsJs1Q6QLzWXQ5uH2CfOOxpweU07QCwH0pmc 2/DfKHzKAaDzjsmYwS5jwO/wnI1oyZUOZ+Jut4uwbZeA1tJJNQxc5RsvRkZ94LA6CaHjfRVitH gAbAq9r0xrVl2IXOrGHxnK254LneSk18qc/SfIWHIjSgiS5Z6CDsILzhjYyE1/4ZGbrIAfm4uh Uem5rlMnpNZfOBDu7S8GaIoY78oTv3bMy4CiMd211BOTr+wXSwsh1AFUTRhXnOOqTcb1YpC5en cQYZoqLiRnSquTHY55XbxXzq Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Dec 2019 20:05:31 -0800 IronPort-SDR: 1pr7j1njEOGMgrFoAL1p4PZ/18PF83IsjHqWpAVsfvUbHRUr92EelTk9Npbqs8f6y55N1cNpQp G7sCDhC/oZPcF63F74vJ/2payZZUL3aH047uJGE1vca3IcI4d4BI95TkYYE14Rf5T3u5RcStNC BvJ4k/gM/c8Bpmvz+JyNxgWA5q41LUAQWs6zVrX1f6eQfdaUerctNw2IFAB0goIctBHK07OEKG h6gkXTpNRqWwM0F+TR8a6oOqvtES7nI5nCdrQtsCDNvWoLk81bslv9EuUPUnrHTUpXI8jCN0sn JXI= WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com (HELO naota.fujisawa.hgst.com) ([10.149.53.115]) by uls-op-cesaip02.wdc.com with ESMTP; 12 Dec 2019 20:10:57 -0800 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Nikolay Borisov , Damien Le Moal , Johannes Thumshirn , Hannes Reinecke , Anand Jain , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v6 10/28] btrfs: do sequential extent allocation in HMZONED mode Date: Fri, 13 Dec 2019 13:08:57 +0900 Message-Id: <20191213040915.3502922-11-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20191213040915.3502922-1-naohiro.aota@wdc.com> References: <20191213040915.3502922-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org On HMZONED drives, writes must always be sequential and directed at a block group zone write pointer position. Thus, block allocation in a block group must also be done sequentially using an allocation pointer equal to the block group zone write pointer plus the number of blocks allocated but not yet written. Sequential allocation function find_free_extent_zoned() bypass the checks in find_free_extent() and increase the reserved byte counter by itself. It is impossible to revert once allocated region in the sequential allocation, since it might race with other allocations and leave an allocation hole, which breaks the sequential write rule. Furthermore, this commit introduce two new variable to struct btrfs_block_group. "wp_broken" indicate that write pointer is broken (e.g. not synced on a RAID1 block group) and mark that block group read only. "zone_unusable" keeps track of the size of once allocated then freed region in a block group. Such region is never usable until resetting underlying zones. This commit also introduce "bytes_zone_unusable" to track such unusable bytes in a space_info. Pinned bytes are always reclaimed to "bytes_zone_unusable". They are not usable until resetting them first. Signed-off-by: Naohiro Aota --- fs/btrfs/block-group.c | 74 ++++-- fs/btrfs/block-group.h | 11 + fs/btrfs/extent-tree.c | 80 +++++- fs/btrfs/free-space-cache.c | 38 +++ fs/btrfs/free-space-cache.h | 2 + fs/btrfs/hmzoned.c | 467 ++++++++++++++++++++++++++++++++++++ fs/btrfs/hmzoned.h | 8 + fs/btrfs/space-info.c | 13 +- fs/btrfs/space-info.h | 4 +- fs/btrfs/sysfs.c | 2 + 10 files changed, 672 insertions(+), 27 deletions(-) diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c index acfa0a9d3c5a..5c04422f6f5a 100644 --- a/fs/btrfs/block-group.c +++ b/fs/btrfs/block-group.c @@ -14,6 +14,7 @@ #include "sysfs.h" #include "tree-log.h" #include "delalloc-space.h" +#include "hmzoned.h" /* * Return target flags in extended format or 0 if restripe for this chunk_type @@ -677,6 +678,9 @@ int btrfs_cache_block_group(struct btrfs_block_group *cache, int load_cache_only struct btrfs_caching_control *caching_ctl; int ret = 0; + if (btrfs_fs_incompat(fs_info, HMZONED)) + return 0; + caching_ctl = kzalloc(sizeof(*caching_ctl), GFP_NOFS); if (!caching_ctl) return -ENOMEM; @@ -1048,12 +1052,15 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans, WARN_ON(block_group->space_info->total_bytes < block_group->length); WARN_ON(block_group->space_info->bytes_readonly - < block_group->length); + < block_group->length - block_group->zone_unusable); + WARN_ON(block_group->space_info->bytes_zone_unusable + < block_group->zone_unusable); WARN_ON(block_group->space_info->disk_total < block_group->length * factor); } block_group->space_info->total_bytes -= block_group->length; - block_group->space_info->bytes_readonly -= block_group->length; + block_group->space_info->bytes_readonly -= + (block_group->length - block_group->zone_unusable); block_group->space_info->disk_total -= block_group->length * factor; spin_unlock(&block_group->space_info->lock); @@ -1210,7 +1217,7 @@ static int inc_block_group_ro(struct btrfs_block_group *cache, int force) } num_bytes = cache->length - cache->reserved - cache->pinned - - cache->bytes_super - cache->used; + cache->bytes_super - cache->zone_unusable - cache->used; sinfo_used = btrfs_space_info_used(sinfo, true); /* @@ -1736,6 +1743,13 @@ static int read_one_block_group(struct btrfs_fs_info *info, goto error; } + ret = btrfs_load_block_group_zone_info(cache); + if (ret) { + btrfs_err(info, "failed to load zone info of bg %llu", + cache->start); + goto error; + } + /* * We need to exclude the super stripes now so that the space info has * super bytes accounted for, otherwise we'll think we have more space @@ -1766,6 +1780,8 @@ static int read_one_block_group(struct btrfs_fs_info *info, btrfs_free_excluded_extents(cache); } + btrfs_calc_zone_unusable(cache); + ret = btrfs_add_block_group_cache(info, cache); if (ret) { btrfs_remove_free_space_cache(cache); @@ -1773,7 +1789,8 @@ static int read_one_block_group(struct btrfs_fs_info *info, } trace_btrfs_add_block_group(info, cache, 0); btrfs_update_space_info(info, cache->flags, key->offset, - cache->used, cache->bytes_super, &space_info); + cache->used, cache->bytes_super, + cache->zone_unusable, &space_info); cache->space_info = space_info; @@ -1786,6 +1803,10 @@ static int read_one_block_group(struct btrfs_fs_info *info, ASSERT(list_empty(&cache->bg_list)); btrfs_mark_bg_unused(cache); } + + if (cache->wp_broken) + inc_block_group_ro(cache, 1); + return 0; error: btrfs_put_block_group(cache); @@ -1924,6 +1945,13 @@ int btrfs_make_block_group(struct btrfs_trans_handle *trans, u64 bytes_used, cache->last_byte_to_unpin = (u64)-1; cache->cached = BTRFS_CACHE_FINISHED; cache->needs_free_space = 1; + + ret = btrfs_load_block_group_zone_info(cache); + if (ret) { + btrfs_put_block_group(cache); + return ret; + } + ret = exclude_super_stripes(cache); if (ret) { /* We may have excluded something, so call this just in case */ @@ -1965,7 +1993,7 @@ int btrfs_make_block_group(struct btrfs_trans_handle *trans, u64 bytes_used, */ trace_btrfs_add_block_group(fs_info, cache, 1); btrfs_update_space_info(fs_info, cache->flags, size, bytes_used, - cache->bytes_super, &cache->space_info); + cache->bytes_super, 0, &cache->space_info); btrfs_update_global_block_rsv(fs_info); link_block_group(cache); @@ -2121,7 +2149,8 @@ void btrfs_dec_block_group_ro(struct btrfs_block_group *cache) spin_lock(&cache->lock); if (!--cache->ro) { num_bytes = cache->length - cache->reserved - - cache->pinned - cache->bytes_super - cache->used; + cache->pinned - cache->bytes_super - + cache->zone_unusable - cache->used; sinfo->bytes_readonly -= num_bytes; list_del_init(&cache->ro_list); } @@ -2760,6 +2789,21 @@ int btrfs_update_block_group(struct btrfs_trans_handle *trans, return ret; } +void __btrfs_add_reserved_bytes(struct btrfs_block_group *cache, u64 ram_bytes, + u64 num_bytes, int delalloc) +{ + struct btrfs_space_info *space_info = cache->space_info; + + cache->reserved += num_bytes; + space_info->bytes_reserved += num_bytes; + trace_btrfs_space_reservation(cache->fs_info, "space_info", + space_info->flags, num_bytes, 1); + btrfs_space_info_update_bytes_may_use(cache->fs_info, space_info, + -ram_bytes); + if (delalloc) + cache->delalloc_bytes += num_bytes; +} + /** * btrfs_add_reserved_bytes - update the block_group and space info counters * @cache: The cache we are manipulating @@ -2778,20 +2822,16 @@ int btrfs_add_reserved_bytes(struct btrfs_block_group *cache, struct btrfs_space_info *space_info = cache->space_info; int ret = 0; + /* should handled by find_free_extent_zoned */ + ASSERT(!btrfs_fs_incompat(cache->fs_info, HMZONED)); + spin_lock(&space_info->lock); spin_lock(&cache->lock); - if (cache->ro) { + if (cache->ro) ret = -EAGAIN; - } else { - cache->reserved += num_bytes; - space_info->bytes_reserved += num_bytes; - trace_btrfs_space_reservation(cache->fs_info, "space_info", - space_info->flags, num_bytes, 1); - btrfs_space_info_update_bytes_may_use(cache->fs_info, - space_info, -ram_bytes); - if (delalloc) - cache->delalloc_bytes += num_bytes; - } + else + __btrfs_add_reserved_bytes(cache, ram_bytes, num_bytes, + delalloc); spin_unlock(&cache->lock); spin_unlock(&space_info->lock); return ret; diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h index 9b409676c4b2..347605654021 100644 --- a/fs/btrfs/block-group.h +++ b/fs/btrfs/block-group.h @@ -82,6 +82,7 @@ struct btrfs_block_group { unsigned int iref:1; unsigned int has_caching_ctl:1; unsigned int removed:1; + unsigned int wp_broken:1; int disk_cache_state; @@ -156,6 +157,14 @@ struct btrfs_block_group { /* Record locked full stripes for RAID5/6 block group */ struct btrfs_full_stripe_locks_tree full_stripe_locks_root; + + u64 zone_unusable; + /* + * Allocation offset for the block group to implement + * sequential allocation. This is used only with HMZONED mode + * enabled. + */ + u64 alloc_offset; }; #ifdef CONFIG_BTRFS_DEBUG @@ -216,6 +225,8 @@ int btrfs_update_block_group(struct btrfs_trans_handle *trans, u64 bytenr, u64 num_bytes, int alloc); int btrfs_add_reserved_bytes(struct btrfs_block_group *cache, u64 ram_bytes, u64 num_bytes, int delalloc); +void __btrfs_add_reserved_bytes(struct btrfs_block_group *cache, u64 ram_bytes, + u64 num_bytes, int delalloc); void btrfs_free_reserved_bytes(struct btrfs_block_group *cache, u64 num_bytes, int delalloc); int btrfs_chunk_alloc(struct btrfs_trans_handle *trans, u64 flags, diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 153f71a5bba9..3781a3778696 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -32,6 +32,8 @@ #include "block-rsv.h" #include "delalloc-space.h" #include "block-group.h" +#include "rcu-string.h" +#include "hmzoned.h" #undef SCRAMBLE_DELAYED_REFS @@ -2824,9 +2826,11 @@ static int unpin_extent_range(struct btrfs_fs_info *fs_info, cache = btrfs_lookup_block_group(fs_info, start); BUG_ON(!cache); /* Logic error */ - cluster = fetch_cluster_info(fs_info, - cache->space_info, - &empty_cluster); + if (!btrfs_fs_incompat(fs_info, HMZONED)) + cluster = fetch_cluster_info(fs_info, + cache->space_info, + &empty_cluster); + empty_cluster <<= 1; } @@ -2863,7 +2867,11 @@ static int unpin_extent_range(struct btrfs_fs_info *fs_info, space_info->max_extent_size = 0; percpu_counter_add_batch(&space_info->total_bytes_pinned, -len, BTRFS_TOTAL_BYTES_PINNED_BATCH); - if (cache->ro) { + if (btrfs_fs_incompat(fs_info, HMZONED)) { + /* need reset before reusing in zoned Block Group */ + space_info->bytes_zone_unusable += len; + readonly = true; + } else if (cache->ro) { space_info->bytes_readonly += len; readonly = true; } @@ -3657,6 +3665,57 @@ static int find_free_extent_unclustered(struct btrfs_block_group *bg, return 0; } +/* + * Simple allocator for sequential only block group. It only allows + * sequential allocation. No need to play with trees. This function + * also reserve the bytes as in btrfs_add_reserved_bytes. + */ + +static int find_free_extent_zoned(struct btrfs_block_group *cache, + struct find_free_extent_ctl *ffe_ctl) +{ + struct btrfs_space_info *space_info = cache->space_info; + struct btrfs_free_space_ctl *ctl = cache->free_space_ctl; + u64 start = cache->start; + u64 num_bytes = ffe_ctl->num_bytes; + u64 avail; + int ret = 0; + + ASSERT(btrfs_fs_incompat(cache->fs_info, HMZONED)); + + spin_lock(&space_info->lock); + spin_lock(&cache->lock); + + if (cache->ro) { + ret = -EAGAIN; + goto out; + } + + avail = cache->length - cache->alloc_offset; + if (avail < num_bytes) { + ffe_ctl->max_extent_size = avail; + ret = 1; + goto out; + } + + ffe_ctl->found_offset = start + cache->alloc_offset; + cache->alloc_offset += num_bytes; + spin_lock(&ctl->tree_lock); + ctl->free_space -= num_bytes; + spin_unlock(&ctl->tree_lock); + + ASSERT(IS_ALIGNED(ffe_ctl->found_offset, + cache->fs_info->stripesize)); + ffe_ctl->search_start = ffe_ctl->found_offset; + __btrfs_add_reserved_bytes(cache, ffe_ctl->ram_bytes, num_bytes, + ffe_ctl->delalloc); + +out: + spin_unlock(&cache->lock); + spin_unlock(&space_info->lock); + return ret; +} + /* * Return >0 means caller needs to re-search for free extent * Return 0 means we have the needed free extent. @@ -3803,6 +3862,7 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info, struct btrfs_block_group *block_group = NULL; struct find_free_extent_ctl ffe_ctl = {0}; struct btrfs_space_info *space_info; + bool hmzoned = btrfs_fs_incompat(fs_info, HMZONED); bool use_cluster = true; bool full_search = false; @@ -3965,6 +4025,17 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info, if (unlikely(block_group->cached == BTRFS_CACHE_ERROR)) goto loop; + if (hmzoned) { + ret = find_free_extent_zoned(block_group, &ffe_ctl); + if (ret) + goto loop; + /* + * find_free_space_seq should ensure that + * everything is OK and reserve the extent. + */ + goto nocheck; + } + /* * Ok we want to try and use the cluster allocator, so * lets look there @@ -4020,6 +4091,7 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info, num_bytes); goto loop; } +nocheck: btrfs_inc_block_group_reservations(block_group); /* we are all good, lets return */ diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c index 3283da419200..e068325fcfc0 100644 --- a/fs/btrfs/free-space-cache.c +++ b/fs/btrfs/free-space-cache.c @@ -2336,6 +2336,8 @@ int __btrfs_add_free_space(struct btrfs_fs_info *fs_info, struct btrfs_free_space *info; int ret = 0; + ASSERT(!btrfs_fs_incompat(fs_info, HMZONED)); + info = kmem_cache_zalloc(btrfs_free_space_cachep, GFP_NOFS); if (!info) return -ENOMEM; @@ -2384,9 +2386,36 @@ int __btrfs_add_free_space(struct btrfs_fs_info *fs_info, return ret; } +int __btrfs_add_free_space_seq(struct btrfs_block_group *block_group, + u64 bytenr, u64 size) +{ + struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl; + u64 offset = bytenr - block_group->start; + u64 to_free, to_unusable; + + spin_lock(&ctl->tree_lock); + if (block_group->wp_broken) + to_free = 0; + else if (offset >= block_group->alloc_offset) + to_free = size; + else if (offset + size <= block_group->alloc_offset) + to_free = 0; + else + to_free = offset + size - block_group->alloc_offset; + to_unusable = size - to_free; + + ctl->free_space += to_free; + block_group->zone_unusable += to_unusable; + spin_unlock(&ctl->tree_lock); + return 0; +} + int btrfs_add_free_space(struct btrfs_block_group *block_group, u64 bytenr, u64 size) { + if (btrfs_fs_incompat(block_group->fs_info, HMZONED)) + return __btrfs_add_free_space_seq(block_group, bytenr, size); + return __btrfs_add_free_space(block_group->fs_info, block_group->free_space_ctl, bytenr, size); @@ -2400,6 +2429,9 @@ int btrfs_remove_free_space(struct btrfs_block_group *block_group, int ret; bool re_search = false; + if (btrfs_fs_incompat(block_group->fs_info, HMZONED)) + return 0; + spin_lock(&ctl->tree_lock); again: @@ -2635,6 +2667,8 @@ u64 btrfs_find_space_for_alloc(struct btrfs_block_group *block_group, u64 align_gap = 0; u64 align_gap_len = 0; + ASSERT(!btrfs_fs_incompat(block_group->fs_info, HMZONED)); + spin_lock(&ctl->tree_lock); entry = find_free_space(ctl, &offset, &bytes_search, block_group->full_stripe_len, max_extent_size); @@ -2754,6 +2788,8 @@ u64 btrfs_alloc_from_cluster(struct btrfs_block_group *block_group, struct rb_node *node; u64 ret = 0; + ASSERT(!btrfs_fs_incompat(block_group->fs_info, HMZONED)); + spin_lock(&cluster->lock); if (bytes > cluster->max_size) goto out; @@ -3401,6 +3437,8 @@ int btrfs_trim_block_group(struct btrfs_block_group *block_group, { int ret; + ASSERT(!btrfs_fs_incompat(block_group->fs_info, HMZONED)); + *trimmed = 0; spin_lock(&block_group->lock); diff --git a/fs/btrfs/free-space-cache.h b/fs/btrfs/free-space-cache.h index ba9a23241101..0d3812bbb793 100644 --- a/fs/btrfs/free-space-cache.h +++ b/fs/btrfs/free-space-cache.h @@ -84,6 +84,8 @@ void btrfs_init_free_space_ctl(struct btrfs_block_group *block_group); int __btrfs_add_free_space(struct btrfs_fs_info *fs_info, struct btrfs_free_space_ctl *ctl, u64 bytenr, u64 size); +int __btrfs_add_free_space_seq(struct btrfs_block_group *block_group, + u64 bytenr, u64 size); int btrfs_add_free_space(struct btrfs_block_group *block_group, u64 bytenr, u64 size); int btrfs_remove_free_space(struct btrfs_block_group *block_group, diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c index 6263c8aee082..b067fa84b9a1 100644 --- a/fs/btrfs/hmzoned.c +++ b/fs/btrfs/hmzoned.c @@ -8,14 +8,21 @@ #include #include +#include #include "ctree.h" #include "volumes.h" #include "hmzoned.h" #include "rcu-string.h" #include "disk-io.h" +#include "block-group.h" +#include "locking.h" /* Maximum number of zones to report per blkdev_report_zones() call */ #define BTRFS_REPORT_NR_ZONES 4096 +/* Invalid allocation pointer value for missing devices */ +#define WP_MISSING_DEV ((u64)-1) +/* Pseudo write pointer value for conventional zone */ +#define WP_CONVENTIONAL ((u64)-2) static int sb_write_pointer(struct blk_zone *zone, u64 *wp_ret); @@ -605,3 +612,463 @@ bool btrfs_check_allocatable_zones(struct btrfs_device *device, u64 pos, return find_next_bit(zinfo->seq_zones, end, begin) == end; } + +void btrfs_calc_zone_unusable(struct btrfs_block_group *cache) +{ + u64 unusable, free; + + if (!btrfs_fs_incompat(cache->fs_info, HMZONED)) + return; + + WARN_ON(cache->bytes_super != 0); + if (!cache->wp_broken) { + unusable = cache->alloc_offset - cache->used; + free = cache->length - cache->alloc_offset; + } else { + unusable = cache->length - cache->used; + free = 0; + } + /* we only need ->free_space in ALLOC_SEQ BGs */ + cache->last_byte_to_unpin = (u64)-1; + cache->cached = BTRFS_CACHE_FINISHED; + cache->free_space_ctl->free_space = free; + cache->zone_unusable = unusable; + /* + * Should not have any excluded extents. Just + * in case, though. + */ + btrfs_free_excluded_extents(cache); +} + +static int emulate_write_pointer(struct btrfs_block_group *cache, + u64 *offset_ret) +{ + struct btrfs_fs_info *fs_info = cache->fs_info; + struct btrfs_root *root = fs_info->extent_root; + struct btrfs_path *path; + struct extent_buffer *leaf; + struct btrfs_key search_key; + struct btrfs_key found_key; + int slot; + int ret; + u64 length; + + path = btrfs_alloc_path(); + if (!path) + return -ENOMEM; + + search_key.objectid = cache->start + cache->length; + search_key.type = 0; + search_key.offset = 0; + + ret = btrfs_search_slot(NULL, root, &search_key, path, 0, 0); + if (ret < 0) + goto out; + ASSERT(ret != 0); + slot = path->slots[0]; + leaf = path->nodes[0]; + ASSERT(slot != 0); + slot--; + btrfs_item_key_to_cpu(leaf, &found_key, slot); + + if (found_key.objectid < cache->start) { + *offset_ret = 0; + } else if (found_key.type == BTRFS_BLOCK_GROUP_ITEM_KEY) { + struct btrfs_key extent_item_key; + + if (found_key.objectid != cache->start) { + ret = -EUCLEAN; + goto out; + } + + length = 0; + + /* metadata may have METADATA_ITEM_KEY */ + if (slot == 0) { + btrfs_set_path_blocking(path); + ret = btrfs_prev_leaf(root, path); + if (ret < 0) + goto out; + if (ret == 0) { + slot = btrfs_header_nritems(leaf) - 1; + btrfs_item_key_to_cpu(leaf, &extent_item_key, + slot); + } + } else { + btrfs_item_key_to_cpu(leaf, &extent_item_key, slot - 1); + ret = 0; + } + + if (ret == 0 && + extent_item_key.objectid == cache->start) { + if (extent_item_key.type == BTRFS_METADATA_ITEM_KEY) + length = fs_info->nodesize; + else if (extent_item_key.type == BTRFS_EXTENT_ITEM_KEY) + length = extent_item_key.offset; + else { + ret = -EUCLEAN; + goto out; + } + } + + *offset_ret = length; + } else if (found_key.type == BTRFS_EXTENT_ITEM_KEY || + found_key.type == BTRFS_METADATA_ITEM_KEY) { + + if (found_key.type == BTRFS_EXTENT_ITEM_KEY) + length = found_key.offset; + else + length = fs_info->nodesize; + + if (!(found_key.objectid >= cache->start && + found_key.objectid + length <= + cache->start + cache->length)) { + ret = -EUCLEAN; + goto out; + } + *offset_ret = found_key.objectid + length - cache->start; + } else { + ret = -EUCLEAN; + goto out; + } + ret = 0; + +out: + btrfs_free_path(path); + return ret; +} + +static u64 offset_in_dev_extent(struct map_lookup *map, u64 *alloc_offsets, + u64 logical, int idx) +{ + u64 profile = map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK; + u64 stripe_nr = logical / map->stripe_len; + u64 full_stripes_cnt; + u32 rest_stripes_cnt; + u64 stripe_start, offset; + int data_stripes = map->num_stripes / map->sub_stripes; + int stripe_idx; + int i; + + ASSERT(profile == BTRFS_BLOCK_GROUP_RAID0 || + profile == BTRFS_BLOCK_GROUP_RAID10); + + full_stripes_cnt = div_u64_rem(stripe_nr, data_stripes, + &rest_stripes_cnt); + stripe_idx = idx / map->sub_stripes; + + if (stripe_idx < rest_stripes_cnt) + return map->stripe_len * (full_stripes_cnt + 1); + + for (i = idx + map->sub_stripes; i < map->num_stripes; + i += map->sub_stripes) { + if (alloc_offsets[i] != WP_CONVENTIONAL && + alloc_offsets[i] > map->stripe_len * full_stripes_cnt) + return map->stripe_len * (full_stripes_cnt + 1); + } + + stripe_start = (full_stripes_cnt * data_stripes + stripe_idx) * + map->stripe_len; + if (stripe_start >= logical) + return full_stripes_cnt * map->stripe_len; + offset = min_t(u64, logical - stripe_start, map->stripe_len); + + return full_stripes_cnt * map->stripe_len + offset; +} + +int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache) +{ + struct btrfs_fs_info *fs_info = cache->fs_info; + struct extent_map_tree *em_tree = &fs_info->mapping_tree; + struct extent_map *em; + struct map_lookup *map; + struct btrfs_device *device; + u64 logical = cache->start; + u64 length = cache->length; + u64 physical = 0; + int ret; + int i, j; + unsigned int nofs_flag; + u64 *alloc_offsets = NULL; + u64 emulated_offset = 0; + u32 num_sequential = 0, num_conventional = 0; + + if (!btrfs_fs_incompat(fs_info, HMZONED)) + return 0; + + /* Sanity check */ + if (!IS_ALIGNED(length, fs_info->zone_size)) { + btrfs_err(fs_info, "unaligned block group at %llu + %llu", + logical, length); + return -EIO; + } + + /* Get the chunk mapping */ + read_lock(&em_tree->lock); + em = lookup_extent_mapping(em_tree, logical, length); + read_unlock(&em_tree->lock); + + if (!em) + return -EINVAL; + + map = em->map_lookup; + + /* + * Get the zone type: if the group is mapped to a non-sequential zone, + * there is no need for the allocation offset (fit allocation is OK). + */ + alloc_offsets = kcalloc(map->num_stripes, sizeof(*alloc_offsets), + GFP_NOFS); + if (!alloc_offsets) { + free_extent_map(em); + return -ENOMEM; + } + + for (i = 0; i < map->num_stripes; i++) { + bool is_sequential; + struct blk_zone zone; + + device = map->stripes[i].dev; + physical = map->stripes[i].physical; + + if (device->bdev == NULL) { + alloc_offsets[i] = WP_MISSING_DEV; + continue; + } + + is_sequential = btrfs_dev_is_sequential(device, physical); + if (is_sequential) + num_sequential++; + else + num_conventional++; + + if (!is_sequential) { + alloc_offsets[i] = WP_CONVENTIONAL; + continue; + } + + /* + * This zone will be used for allocation, so mark this + * zone non-empty. + */ + btrfs_dev_clear_zone_empty(device, physical); + + /* + * The group is mapped to a sequential zone. Get the zone write + * pointer to determine the allocation offset within the zone. + */ + WARN_ON(!IS_ALIGNED(physical, fs_info->zone_size)); + nofs_flag = memalloc_nofs_save(); + ret = btrfs_get_dev_zone(device, physical, &zone); + memalloc_nofs_restore(nofs_flag); + if (ret == -EIO || ret == -EOPNOTSUPP) { + ret = 0; + alloc_offsets[i] = WP_MISSING_DEV; + continue; + } else if (ret) { + goto out; + } + + switch (zone.cond) { + case BLK_ZONE_COND_OFFLINE: + case BLK_ZONE_COND_READONLY: + btrfs_err( + fs_info, "Offline/readonly zone %llu", + physical >> device->zone_info->zone_size_shift); + alloc_offsets[i] = WP_MISSING_DEV; + break; + case BLK_ZONE_COND_EMPTY: + alloc_offsets[i] = 0; + break; + case BLK_ZONE_COND_FULL: + alloc_offsets[i] = fs_info->zone_size; + break; + default: + /* Partially used zone */ + alloc_offsets[i] = + ((zone.wp - zone.start) << SECTOR_SHIFT); + break; + } + } + + if (num_conventional > 0) { + ret = emulate_write_pointer(cache, &emulated_offset); + if (ret || map->num_stripes == num_conventional) { + if (!ret) + cache->alloc_offset = emulated_offset; + goto out; + } + } + + switch (map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK) { + case 0: /* single */ + case BTRFS_BLOCK_GROUP_DUP: + case BTRFS_BLOCK_GROUP_RAID1: + cache->alloc_offset = WP_MISSING_DEV; + for (i = 0; i < map->num_stripes; i++) { + if (alloc_offsets[i] == WP_MISSING_DEV || + alloc_offsets[i] == WP_CONVENTIONAL) + continue; + if (cache->alloc_offset == WP_MISSING_DEV) + cache->alloc_offset = alloc_offsets[i]; + if (alloc_offsets[i] == cache->alloc_offset) + continue; + + cache->wp_broken = 1; + } + break; + case BTRFS_BLOCK_GROUP_RAID0: + cache->alloc_offset = 0; + for (i = 0; i < map->num_stripes; i++) { + if (alloc_offsets[i] == WP_MISSING_DEV) { + cache->wp_broken = 1; + continue; + } + + if (alloc_offsets[i] == WP_CONVENTIONAL) + alloc_offsets[i] = + offset_in_dev_extent(map, alloc_offsets, + emulated_offset, + i); + + /* sanity check */ + if (i > 0) { + if ((alloc_offsets[i] % BTRFS_STRIPE_LEN != 0 && + alloc_offsets[i - 1] % + BTRFS_STRIPE_LEN != 0) || + (alloc_offsets[i - 1] < alloc_offsets[i]) || + (alloc_offsets[i - 1] - alloc_offsets[i] > + BTRFS_STRIPE_LEN)) { + cache->wp_broken = 1; + continue; + } + } + + cache->alloc_offset += alloc_offsets[i]; + } + break; + case BTRFS_BLOCK_GROUP_RAID10: + /* + * Pass1: check write pointer of RAID1 level: each pointer + * should be equal. + */ + for (i = 0; i < map->num_stripes / map->sub_stripes; i++) { + int base = i * map->sub_stripes; + u64 offset = WP_MISSING_DEV; + int fill = 0, num_conventional = 0; + + for (j = 0; j < map->sub_stripes; j++) { + if (alloc_offsets[base+j] == WP_MISSING_DEV) { + fill++; + continue; + } + if (alloc_offsets[base+j] == WP_CONVENTIONAL) { + fill++; + num_conventional++; + continue; + } + if (offset == WP_MISSING_DEV) + offset = alloc_offsets[base+j]; + if (alloc_offsets[base + j] == offset) + continue; + + cache->wp_broken = 1; + goto out; + } + if (!fill) + continue; + /* this RAID0 stripe is free on conventional zones */ + if (num_conventional == map->sub_stripes) + offset = WP_CONVENTIONAL; + /* fill WP_MISSING_DEV or WP_CONVENTIONAL */ + for (j = 0; j < map->sub_stripes; j++) + alloc_offsets[base + j] = offset; + } + + /* Pass2: check write pointer of RAID0 level */ + cache->alloc_offset = 0; + for (i = 0; i < map->num_stripes / map->sub_stripes; i++) { + int base = i * map->sub_stripes; + + if (alloc_offsets[base] == WP_MISSING_DEV) { + cache->wp_broken = 1; + continue; + } + + if (alloc_offsets[base] == WP_CONVENTIONAL) + alloc_offsets[base] = + offset_in_dev_extent(map, alloc_offsets, + emulated_offset, + base); + + /* sanity check */ + if (i > 0) { + int prev = base - map->sub_stripes; + + if ((alloc_offsets[base] % + BTRFS_STRIPE_LEN != 0 && + alloc_offsets[prev] % + BTRFS_STRIPE_LEN != 0) || + (alloc_offsets[prev] < + alloc_offsets[base]) || + (alloc_offsets[prev] - alloc_offsets[base] > + BTRFS_STRIPE_LEN)) { + cache->wp_broken = 1; + continue; + } + } + + cache->alloc_offset += alloc_offsets[base]; + } + break; + case BTRFS_BLOCK_GROUP_RAID5: + case BTRFS_BLOCK_GROUP_RAID6: + /* RAID5/6 is not supported yet */ + default: + btrfs_err(fs_info, "Unsupported profile on HMZONED %llu", + map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK); + ret = -EINVAL; + goto out; + } + +out: + /* an extent is allocated after the write pointer */ + if (num_conventional && emulated_offset > cache->alloc_offset) { + btrfs_err(fs_info, + "got wrong write pointer in BG %llu: %llu > %llu", + logical, emulated_offset, cache->alloc_offset); + cache->wp_broken = 1; + ret = -EIO; + } + + if (cache->wp_broken) { + char buf[128] = {'\0'}; + + btrfs_describe_block_groups(cache->flags, buf, sizeof(buf)); + btrfs_err(fs_info, "broken write pointer: block group %llu %s", + logical, buf); + for (i = 0; i < map->num_stripes; i++) { + char *note; + + device = map->stripes[i].dev; + physical = map->stripes[i].physical; + + if (device->bdev == NULL) + note = " (missing)"; + else if (!btrfs_dev_is_sequential(device, physical)) + note = " (conventional)"; + else + note = ""; + + btrfs_err_in_rcu(fs_info, + "stripe %d dev %s physical %llu write_pointer[i] = %llu%s", + i, rcu_str_deref(device->name), + physical, alloc_offsets[i], note); + } + } + + kfree(alloc_offsets); + free_extent_map(em); + + return ret; +} diff --git a/fs/btrfs/hmzoned.h b/fs/btrfs/hmzoned.h index d54b4ae8cf8b..4ed985d027cc 100644 --- a/fs/btrfs/hmzoned.h +++ b/fs/btrfs/hmzoned.h @@ -40,6 +40,8 @@ int btrfs_advance_sb_log(struct btrfs_device *device, int mirror); int btrfs_reset_sb_log_zones(struct block_device *bdev, int mirror); bool btrfs_check_allocatable_zones(struct btrfs_device *device, u64 pos, u64 num_bytes); +void btrfs_calc_zone_unusable(struct btrfs_block_group *cache); +int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache); #else /* CONFIG_BLK_DEV_ZONED */ static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos, struct blk_zone *zone) @@ -89,6 +91,12 @@ static inline bool btrfs_check_allocatable_zones(struct btrfs_device *device, { return true; } +static inline void btrfs_calc_zone_unusable(struct btrfs_block_group *cache) { } +static inline int btrfs_load_block_group_zone_info( + struct btrfs_block_group *cache) +{ + return 0; +} #endif static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos) diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c index f09aa6ee9113..322036e49831 100644 --- a/fs/btrfs/space-info.c +++ b/fs/btrfs/space-info.c @@ -16,6 +16,7 @@ u64 __pure btrfs_space_info_used(struct btrfs_space_info *s_info, ASSERT(s_info); return s_info->bytes_used + s_info->bytes_reserved + s_info->bytes_pinned + s_info->bytes_readonly + + s_info->bytes_zone_unusable + (may_use_included ? s_info->bytes_may_use : 0); } @@ -112,7 +113,7 @@ int btrfs_init_space_info(struct btrfs_fs_info *fs_info) void btrfs_update_space_info(struct btrfs_fs_info *info, u64 flags, u64 total_bytes, u64 bytes_used, - u64 bytes_readonly, + u64 bytes_readonly, u64 bytes_zone_unusable, struct btrfs_space_info **space_info) { struct btrfs_space_info *found; @@ -128,6 +129,7 @@ void btrfs_update_space_info(struct btrfs_fs_info *info, u64 flags, found->bytes_used += bytes_used; found->disk_used += bytes_used * factor; found->bytes_readonly += bytes_readonly; + found->bytes_zone_unusable += bytes_zone_unusable; if (total_bytes > 0) found->full = 0; btrfs_try_granting_tickets(info, found); @@ -267,10 +269,10 @@ static void __btrfs_dump_space_info(struct btrfs_fs_info *fs_info, info->total_bytes - btrfs_space_info_used(info, true), info->full ? "" : "not "); btrfs_info(fs_info, - "space_info total=%llu, used=%llu, pinned=%llu, reserved=%llu, may_use=%llu, readonly=%llu", + "space_info total=%llu, used=%llu, pinned=%llu, reserved=%llu, may_use=%llu, readonly=%llu zone_unusable=%llu", info->total_bytes, info->bytes_used, info->bytes_pinned, info->bytes_reserved, info->bytes_may_use, - info->bytes_readonly); + info->bytes_readonly, info->bytes_zone_unusable); DUMP_BLOCK_RSV(fs_info, global_block_rsv); DUMP_BLOCK_RSV(fs_info, trans_block_rsv); @@ -299,9 +301,10 @@ void btrfs_dump_space_info(struct btrfs_fs_info *fs_info, list_for_each_entry(cache, &info->block_groups[index], list) { spin_lock(&cache->lock); btrfs_info(fs_info, - "block group %llu has %llu bytes, %llu used %llu pinned %llu reserved %s", + "block group %llu has %llu bytes, %llu used %llu pinned %llu reserved %llu zone_unusable %s", cache->start, cache->length, cache->used, cache->pinned, - cache->reserved, cache->ro ? "[readonly]" : ""); + cache->reserved, cache->zone_unusable, + cache->ro ? "[readonly]" : ""); btrfs_dump_free_space(cache, bytes); spin_unlock(&cache->lock); } diff --git a/fs/btrfs/space-info.h b/fs/btrfs/space-info.h index 1a349e3f9cc1..a1a5f6c2611b 100644 --- a/fs/btrfs/space-info.h +++ b/fs/btrfs/space-info.h @@ -17,6 +17,8 @@ struct btrfs_space_info { u64 bytes_may_use; /* number of bytes that may be used for delalloc/allocations */ u64 bytes_readonly; /* total bytes that are read only */ + u64 bytes_zone_unusable; /* total bytes that are unusable until + resetting the device zone */ u64 max_extent_size; /* This will hold the maximum extent size of the space info if we had an ENOSPC in the @@ -111,7 +113,7 @@ DECLARE_SPACE_INFO_UPDATE(bytes_pinned, "pinned"); int btrfs_init_space_info(struct btrfs_fs_info *fs_info); void btrfs_update_space_info(struct btrfs_fs_info *info, u64 flags, u64 total_bytes, u64 bytes_used, - u64 bytes_readonly, + u64 bytes_readonly, u64 bytes_zone_unusable, struct btrfs_space_info **space_info); struct btrfs_space_info *btrfs_find_space_info(struct btrfs_fs_info *info, u64 flags); diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c index 230c7ad90e22..c479708537fc 100644 --- a/fs/btrfs/sysfs.c +++ b/fs/btrfs/sysfs.c @@ -458,6 +458,7 @@ SPACE_INFO_ATTR(bytes_pinned); SPACE_INFO_ATTR(bytes_reserved); SPACE_INFO_ATTR(bytes_may_use); SPACE_INFO_ATTR(bytes_readonly); +SPACE_INFO_ATTR(bytes_zone_unusable); SPACE_INFO_ATTR(disk_used); SPACE_INFO_ATTR(disk_total); BTRFS_ATTR(space_info, total_bytes_pinned, @@ -471,6 +472,7 @@ static struct attribute *space_info_attrs[] = { BTRFS_ATTR_PTR(space_info, bytes_reserved), BTRFS_ATTR_PTR(space_info, bytes_may_use), BTRFS_ATTR_PTR(space_info, bytes_readonly), + BTRFS_ATTR_PTR(space_info, bytes_zone_unusable), BTRFS_ATTR_PTR(space_info, disk_used), BTRFS_ATTR_PTR(space_info, disk_total), BTRFS_ATTR_PTR(space_info, total_bytes_pinned), From patchwork Fri Dec 13 04:08:58 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11289839 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id ED9FE14E3 for ; Fri, 13 Dec 2019 04:11:07 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id CD27E2253D for ; Fri, 13 Dec 2019 04:11:07 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="N0778Bxg" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731937AbfLMELF (ORCPT ); Thu, 12 Dec 2019 23:11:05 -0500 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:11896 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731720AbfLMELD (ORCPT ); Thu, 12 Dec 2019 23:11:03 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1576210262; x=1607746262; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=wvg0IqZ/70mmiMou4MCSFBUNqZKh0c6d/vjtjdV79yg=; b=N0778BxgafNcaLRA7F1clmPxPjpkq/qqJMLs1amFEy0wioHtln2Xw4HG uqPN4mfQVXAODPb2gMtorxHvK4qwxqpgftbvXHvzBfY9UrIJFSTnpyfVs JggVkdxH6pYJL5ccB1acOy9hd/eFG2711ebLDakt2mRa0t5xwwxkKjXuF 7O2DS6d8Knl9/1914dkMPIC/42TEQfAtvtIsf0s1kZrL2yRN6uMx+EZTO 0agfwNH7POCIhuYMnjC6eSD3oWmbbJvfLO893vKXVymLyopouoX9T/pco tIjJJSLlG2mV9oa1pCkJW0ga7x5dJzcLE6MiesT0Bmo+YciEFdgoQYY5X w==; IronPort-SDR: uBN/x6ArYHFpsQX8C/+okmi/xI+CJfLiZakGCcSPLfq8N70TqoGcPk3pYnQvND2yCTHoYP2NTL h0atHtafwJSjv/Xcd0oKqAMYHq0Orv/PcbxIopXGWxD9lwCVyj0WjdMYejyOpHYZ6aRKZYDBXt g1UqPy1++uNFzl/yhIyXJooZGfUXMY4lQHYqOQmPSYZx8akV/w9HaBRJQQFCswtJudZ+29TW/u 8Us46FmOqaEFC0NukD0bd7AEU2m/FyjYdIuyGw+AQrr5lOsnpciHP7CcF2oShngTFE4Fn71WKM F1k= X-IronPort-AV: E=Sophos;i="5.69,308,1571673600"; d="scan'208";a="126860128" Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com) ([199.255.45.14]) by ob1.hgst.iphmx.com with ESMTP; 13 Dec 2019 12:11:02 +0800 IronPort-SDR: qRH64oRzn0mcoju/WJAd7RlE8ObCsHO04a7UEQwT8fIVAUCUvumOAsiqLFTRlF2xOZ3z2B4n3I lhCYCsYqJIYOE4Aiwra1RWnaeB6uhT1V6N0qzXpGPflp/5tj45kENM4LJsao77/rOrlzCDIX37 uJQN4P2xhlp7u1GsaGx0MQywvuNmnYGYoV624VCF1fBdu7URSTGR9gULxgz5bsioizDxUu8K7z nFj8y7Fm8iqd/qu/X94qBn5aHc+Q0yOUTDj8z78ghbO1nAW1SWi8h3G3IFTalznhuw2cl+/Ieg 237RdC7X1NQ5u/u8NSSMvEaZ Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Dec 2019 20:05:34 -0800 IronPort-SDR: advJTC2F+cIL0F/CGVSBlz/BmJPjJSBp+XhfONNQdFrhvi8g3UkOu99vmSIAkgVPMo6gGileHl A6ilSUgGjLYyxZPSxElINJyoVxLCmQG9OGdeCwWTH1yPPXo/2ZH7Hh3uca5Bo04TZ8uUMBcnmM 5AY9RSmNpzoYtLqY7x6QkVa3PCCFNgtYVeYjZxUL4KvZIyCDAa0Mj+eielN33kJSzcpt5PFAvW 3k+672ld1ff5k4yE+fYPU5gaKaalcssAfuH5vo6AXYaheSthGWfdjLYz+QNxPcHiyXUBvsOMAo bx0= WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com (HELO naota.fujisawa.hgst.com) ([10.149.53.115]) by uls-op-cesaip02.wdc.com with ESMTP; 12 Dec 2019 20:10:59 -0800 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Nikolay Borisov , Damien Le Moal , Johannes Thumshirn , Hannes Reinecke , Anand Jain , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v6 11/28] btrfs: make unmirroed BGs readonly only if we have at least one writable BG Date: Fri, 13 Dec 2019 13:08:58 +0900 Message-Id: <20191213040915.3502922-12-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20191213040915.3502922-1-naohiro.aota@wdc.com> References: <20191213040915.3502922-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org If the btrfs volume has mirrored block groups, it unconditionally makes un-mirrored block groups read only. When we have mirrored block groups, but don't have writable block groups, this will drop all writable block groups. So, check if we have at least one writable mirrored block group before setting un-mirrored block groups read only. This change is necessary to handle e.g. xfstests btrfs/124 case. When we mount degraded RAID1 FS and write to it, and then re-mount with full device, the write pointers of corresponding zones of written block group differ. We mark such block group as "wp_broken" and make it read only. In this situation, we only have read only RAID1 block groups because of "wp_broken" and un-mirrored block groups are also marked read only, because we have RAID1 block groups. As a result, all the block groups are now read only, so that we cannot even start the rebalance to fix the situation. Signed-off-by: Naohiro Aota --- fs/btrfs/block-group.c | 25 +++++++++++++++++++++++++ 1 file changed, 25 insertions(+) diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c index 5c04422f6f5a..b286359f3876 100644 --- a/fs/btrfs/block-group.c +++ b/fs/btrfs/block-group.c @@ -1813,6 +1813,27 @@ static int read_one_block_group(struct btrfs_fs_info *info, return ret; } +/* + * have_mirrored_block_group - check if we have at least one writable + * mirrored Block Group + */ +static bool have_mirrored_block_group(struct btrfs_space_info *space_info) +{ + struct btrfs_block_group *block_group; + int i; + + for (i = 0; i < BTRFS_NR_RAID_TYPES; i++) { + if (i == BTRFS_RAID_RAID0 || i == BTRFS_RAID_SINGLE) + continue; + list_for_each_entry(block_group, &space_info->block_groups[i], + list) { + if (!block_group->ro) + return true; + } + } + return false; +} + int btrfs_read_block_groups(struct btrfs_fs_info *info) { struct btrfs_path *path; @@ -1861,6 +1882,10 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info) BTRFS_BLOCK_GROUP_RAID56_MASK | BTRFS_BLOCK_GROUP_DUP))) continue; + + if (!have_mirrored_block_group(space_info)) + continue; + /* * Avoid allocating from un-mirrored block group if there are * mirrored block groups. From patchwork Fri Dec 13 04:08:59 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11289833 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 69ED714E3 for ; Fri, 13 Dec 2019 04:11:06 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 3F229227BF for ; Fri, 13 Dec 2019 04:11:06 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="mb5X3w9q" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731949AbfLMELF (ORCPT ); Thu, 12 Dec 2019 23:11:05 -0500 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:11904 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731928AbfLMELE (ORCPT ); Thu, 12 Dec 2019 23:11:04 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1576210265; x=1607746265; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=xv2SsMLe2K0/AdCEEdTMe1viYLi6crAnMKNxS3FDTjM=; b=mb5X3w9qkfYJgJtNGTSCEA6q6WIQhe9vnyeBIrW4PuQpcCtYeenBl0Ht 4ulP8HQ8ZnYpMk0zjO7OwIcOxaNQeIMfDOKj7JsvknE7418IMcUIvJDk7 JaRq/qgMuNx8zealzzpPNNZEPRK7HZa6l0s3y7A6Opdi1/1aRfovutbRR 1wvS0pvqajaXf7cTOmc2EZh2DsLiSP8wiwVavWbawWzuL9n+YAXPUg+hr yHzGhWKJfVpos7JJqL4u7G/3t+T9ESBpHZvOoCAJDuNseE1EwYdpWyBfN I5sw0sXOzt5NLu+hzstes+sD6kNaygDiBOE3Tdvcj4CXJC4mO9MQaT0DM g==; IronPort-SDR: HbivF7wcGbpTZJik02oge+WPME9DsYK8+Va5SmMsp+2G9d/NGpfw+EQ7SvJ1JCkVK8ShTTIINh D6H1gbHlHAuIZkUPMMo1GDtLDBf2ck4ucWNkGBHHbuSrzUldYgIA/kGan4DEQuBtmSAVclRGIa eq6RG5Li+K9groZBjDnlmt1ARsSqRu/mRbHqz0qLayWlfzhPFXvjET7RN7/x3qAzcH2I8+ZHcj lMdJCJx7AV/s6pXk2ydrI8HyrQhqPe6k88jpKIC6Zo/pftMJPXygfxpkjVTqWuWzqNe+DBEv+w QH4= X-IronPort-AV: E=Sophos;i="5.69,308,1571673600"; d="scan'208";a="126860130" Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com) ([199.255.45.14]) by ob1.hgst.iphmx.com with ESMTP; 13 Dec 2019 12:11:04 +0800 IronPort-SDR: SHKEx7KWHLgYCeSr5/3RXjTJUHrjuzRmro7wpHBYuydshC9i90OuMQ6c1nWw4zu0Xi3qEjMiE3 /RGyEQTsjXf92GXf50xZ1BBY8B1gLvQIq4Vl9fta0NZEAqms5nGMZifOHEBcvPUuYTzmcr+nID DDqUM/B3meNBdYaaDNm4Xwxv2Ut3g3RbSgeynaTSGjiDN/C0EWlf1KbUEbu1eeEoZBqQRk3ELH FzKPTCuUfFMoribYnq3oZNAvBfzDFKMMCy0L59YxLM3qeCMuFkDJ+lRaA4AzIC/9DewJ6VoZGZ rVJI6RfSw3B81xSuhguPgB/p Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Dec 2019 20:05:36 -0800 IronPort-SDR: G5h1mVBjGadMl1SUULm3W0M2KykB+dXUSiwuBVvqRbUyOwvO/vuqJqjLJOGbX+ouZ47JeH9Uza /trh9xQGnWI/tVFhWbmYUbUXslwQRjR/rNXREyLNl0yLuiWMzJbsaIGOr7wn7lc3IzP+ZsQLYO Ckts9mRBDi/SbgPWzlGldbFoaqMWxSoZEZWhBoCb6Nh9JHZ5qe1zXf6rnRl6Qqj7FI10CJl196 VkpRav8fiCTGEvJf3Z91Mkdn+Xz+VKqc5yZGbKFldN18EPOmfoCHdcxvmwExkmtfIEt7cbnobi wHQ= WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com (HELO naota.fujisawa.hgst.com) ([10.149.53.115]) by uls-op-cesaip02.wdc.com with ESMTP; 12 Dec 2019 20:11:02 -0800 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Nikolay Borisov , Damien Le Moal , Johannes Thumshirn , Hannes Reinecke , Anand Jain , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v6 12/28] btrfs: ensure metadata space available on/after degraded mount in HMZONED Date: Fri, 13 Dec 2019 13:08:59 +0900 Message-Id: <20191213040915.3502922-13-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20191213040915.3502922-1-naohiro.aota@wdc.com> References: <20191213040915.3502922-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org On/After degraded mount, we might have no writable metadata block group due to broken write pointers. If you e.g. balance the FS before writing any data, alloc_tree_block_no_bg_flush() (called from insert_balance_item()) fails to allocate a tree block for it, due to global reservation failure. We can reproduce this situation with xfstests btrfs/124. While we can workaround the failure if we write some data and, as a result of writing, let a new metadata block group allocated, it's a bad practice to apply. This commit avoids such failures by ensuring that read-write mounted volume has non-zero metadata space. If metadata space is empty, it forces new metadata block group allocation. Signed-off-by: Naohiro Aota --- fs/btrfs/disk-io.c | 9 +++++++++ fs/btrfs/hmzoned.c | 45 +++++++++++++++++++++++++++++++++++++++++++++ fs/btrfs/hmzoned.h | 6 ++++++ 3 files changed, 60 insertions(+) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index deca9fd70771..7f4c6a92079a 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3246,6 +3246,15 @@ int __cold open_ctree(struct super_block *sb, } } + ret = btrfs_hmzoned_check_metadata_space(fs_info); + if (ret) { + btrfs_warn(fs_info, "failed to allocate metadata space: %d", + ret); + btrfs_warn(fs_info, "try remount with readonly"); + close_ctree(fs_info); + return ret; + } + down_read(&fs_info->cleanup_work_sem); if ((ret = btrfs_orphan_cleanup(fs_info->fs_root)) || (ret = btrfs_orphan_cleanup(fs_info->tree_root))) { diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c index b067fa84b9a1..1a2a296e988a 100644 --- a/fs/btrfs/hmzoned.c +++ b/fs/btrfs/hmzoned.c @@ -16,6 +16,8 @@ #include "disk-io.h" #include "block-group.h" #include "locking.h" +#include "space-info.h" +#include "transaction.h" /* Maximum number of zones to report per blkdev_report_zones() call */ #define BTRFS_REPORT_NR_ZONES 4096 @@ -1072,3 +1074,46 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache) return ret; } + +/* + * On/After degraded mount, we might have no writable metadata block + * group due to broken write pointers. If you e.g. balance the FS + * before writing any data, alloc_tree_block_no_bg_flush() (called + * from insert_balance_item())fails to allocate a tree block for + * it. To avoid such situations, ensure we have some metadata BG here. + */ +int btrfs_hmzoned_check_metadata_space(struct btrfs_fs_info *fs_info) +{ + struct btrfs_root *root = fs_info->extent_root; + struct btrfs_trans_handle *trans; + struct btrfs_space_info *info; + u64 left; + int ret; + + if (!btrfs_fs_incompat(fs_info, HMZONED)) + return 0; + + info = btrfs_find_space_info(fs_info, BTRFS_BLOCK_GROUP_METADATA); + spin_lock(&info->lock); + left = info->total_bytes - btrfs_space_info_used(info, true); + spin_unlock(&info->lock); + + if (left) + return 0; + + trans = btrfs_start_transaction(root, 0); + if (IS_ERR(trans)) + return PTR_ERR(trans); + + mutex_lock(&fs_info->chunk_mutex); + ret = btrfs_alloc_chunk(trans, btrfs_metadata_alloc_profile(fs_info)); + if (ret) { + mutex_unlock(&fs_info->chunk_mutex); + btrfs_abort_transaction(trans, ret); + btrfs_end_transaction(trans); + return ret; + } + mutex_unlock(&fs_info->chunk_mutex); + + return btrfs_commit_transaction(trans); +} diff --git a/fs/btrfs/hmzoned.h b/fs/btrfs/hmzoned.h index 4ed985d027cc..8ac758074afd 100644 --- a/fs/btrfs/hmzoned.h +++ b/fs/btrfs/hmzoned.h @@ -42,6 +42,7 @@ bool btrfs_check_allocatable_zones(struct btrfs_device *device, u64 pos, u64 num_bytes); void btrfs_calc_zone_unusable(struct btrfs_block_group *cache); int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache); +int btrfs_hmzoned_check_metadata_space(struct btrfs_fs_info *fs_info); #else /* CONFIG_BLK_DEV_ZONED */ static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos, struct blk_zone *zone) @@ -97,6 +98,11 @@ static inline int btrfs_load_block_group_zone_info( { return 0; } +static inline int btrfs_hmzoned_check_metadata_space( + struct btrfs_fs_info *fs_info) +{ + return 0; +} #endif static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos) From patchwork Fri Dec 13 04:09:00 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11289843 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id C949B14DB for ; Fri, 13 Dec 2019 04:11:08 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id A92BB227BF for ; Fri, 13 Dec 2019 04:11:08 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="EGCUViCz" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731959AbfLMELH (ORCPT ); Thu, 12 Dec 2019 23:11:07 -0500 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:11907 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731928AbfLMELH (ORCPT ); Thu, 12 Dec 2019 23:11:07 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1576210267; x=1607746267; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=G6sdGsoJ6mrUehm8dIo8WrICIe1NCMsYkOQULpp1omU=; b=EGCUViCzg+0iQYPdnJu6QZvWT9wego/Ta3v5K6SJFT5JfhUUaapRR2u5 +oY9frmTkcTrRXnG4qfeUWI2aX4u+k/QMADfHqtKNA0tVWxAhtzSuiP8Q 5DpNJbOtWTLNcGKkmuaXETUC4DQlPs30N+1uQJ+bi8Ezln2cBAOU341eo 6Vx4sfKx9pu6uqwGzA83ZQfyOiklKYB704y/9bnw979WTa9L4FWXgkCry 3UzDiIki3/btFdE36nAlcyyGGkKiuB/R2m7R7F08KiD8bmLm8HUHbc4OV DIjEY3g+e7A/y47MmdD047ZBjHjASlx/aYYM+HpOohJvOiDezBbvtMw1/ g==; IronPort-SDR: Nz/rG2TxpLXw6LTggDIrocAi0MhnaAcfC5wDWHoUEzLxkxDXsrWHl+Cu5vRTzHPbvyGrIwE/8c 6Ej+4dkvoAtVuTDXyHbF6AFMwYHsTTtvbG4uTtLzeAei5+wrD6UIioci6V0oVmKbEt35KtBg/B 6ASAND3gTBPW7amF1/kcQz7Do/YPratbCJ0kzPo1r/3t6pOT/3Rsk0WQy8GvBTtSSMN8f4tRHX Tog5ua8/BaFEf0gsA3pwbqCYbgJMEd5KZIkP+zcWCxMm1FJHYf+ODzlFAjWZBuEzMbC4J79aeP g8g= X-IronPort-AV: E=Sophos;i="5.69,308,1571673600"; d="scan'208";a="126860133" Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com) ([199.255.45.14]) by ob1.hgst.iphmx.com with ESMTP; 13 Dec 2019 12:11:07 +0800 IronPort-SDR: G96FeQjX0vRuO9bGtXCqFNXRzy4MJDlHYOhETPwsSmg0nk+A/2kefNeJmaM2IXr3qPNEve/RrA OktXuhs2vp5ODYQd5K4unF16jDTdd9SBX0EQMzNxOgUBGrPOSk6Z2x65FZuYqHM3sGOM57Rz21 yYK+K30Ohj0pU8gWASP6Ro+aH4hXTzKdktMULAu2x9yLUzuGaV1ZQhqBmTOQpwG5vtjHpeCkiB pJHOJM7NFyRHWQDOFYogC5R1FM44x1rWbe9lr+5dien1onyTdqw8ecFBjyI0/djhXjs68faniB zhOkFyJccTzV+Yz6EZFDal2z Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Dec 2019 20:05:38 -0800 IronPort-SDR: OvWIrmrlRpFx4XIP3+DUB6muv8LplN0Wv5YtaP8Iv1QW8q/D6ZAFYZ9rZ23xBiUFY8d04frcRy VvDNJUOl+EJlHaITn9RDtSnRA1bFmvTKLF6YiPgHNJQ9MZRS3VrO9/NHW01VlcHztefRy5RfK/ 97ENmwvFrCRlL10/lOZkI62OR+4feM19g++6VOY7T2BIgZir/s2iq4LKPQTULyQMkUvXb+2iD2 K0YcW7zlJSm4FUycsFkYkB38vz7qWRYZieeSyF9X5y9tXYCP0w3SfWw2hzlHEH2GScp1e4QPRe 5ds= WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com (HELO naota.fujisawa.hgst.com) ([10.149.53.115]) by uls-op-cesaip02.wdc.com with ESMTP; 12 Dec 2019 20:11:04 -0800 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Nikolay Borisov , Damien Le Moal , Johannes Thumshirn , Hannes Reinecke , Anand Jain , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v6 13/28] btrfs: reset zones of unused block groups Date: Fri, 13 Dec 2019 13:09:00 +0900 Message-Id: <20191213040915.3502922-14-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20191213040915.3502922-1-naohiro.aota@wdc.com> References: <20191213040915.3502922-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org For an HMZONED volume, a block group maps to a zone of the device. For deleted unused block groups, the zone of the block group can be reset to rewind the zone write pointer at the start of the zone. Signed-off-by: Naohiro Aota Reviewed-by: Josef Bacik --- fs/btrfs/block-group.c | 8 ++++++-- fs/btrfs/extent-tree.c | 17 ++++++++++++----- fs/btrfs/hmzoned.c | 18 ++++++++++++++++++ fs/btrfs/hmzoned.h | 23 +++++++++++++++++++++++ 4 files changed, 59 insertions(+), 7 deletions(-) diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c index b286359f3876..e78d34a4fb56 100644 --- a/fs/btrfs/block-group.c +++ b/fs/btrfs/block-group.c @@ -1369,8 +1369,12 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info) spin_unlock(&block_group->lock); spin_unlock(&space_info->lock); - /* DISCARD can flip during remount */ - trimming = btrfs_test_opt(fs_info, DISCARD); + /* + * DISCARD can flip during remount. In HMZONED mode, + * we need to reset sequential required zones. + */ + trimming = btrfs_test_opt(fs_info, DISCARD) || + btrfs_fs_incompat(fs_info, HMZONED); /* Implicit trim during transaction commit. */ if (trimming) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 3781a3778696..b41a45855bc4 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -1338,6 +1338,9 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr, stripe = bbio->stripes; for (i = 0; i < bbio->num_stripes; i++, stripe++) { + struct btrfs_device *dev = stripe->dev; + u64 physical = stripe->physical; + u64 length = stripe->length; u64 bytes; struct request_queue *req_q; @@ -1345,14 +1348,18 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr, ASSERT(btrfs_test_opt(fs_info, DEGRADED)); continue; } + req_q = bdev_get_queue(stripe->dev->bdev); - if (!blk_queue_discard(req_q)) + /* zone reset in HMZONED mode */ + if (btrfs_can_zone_reset(dev, physical, length)) + ret = btrfs_reset_device_zone(dev, physical, + length, &bytes); + else if (blk_queue_discard(req_q)) + ret = btrfs_issue_discard(dev->bdev, physical, + length, &bytes); + else continue; - ret = btrfs_issue_discard(stripe->dev->bdev, - stripe->physical, - stripe->length, - &bytes); if (!ret) { discarded_bytes += bytes; } else if (ret != -EOPNOTSUPP) { diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c index 1a2a296e988a..0ca84d888e53 100644 --- a/fs/btrfs/hmzoned.c +++ b/fs/btrfs/hmzoned.c @@ -1117,3 +1117,21 @@ int btrfs_hmzoned_check_metadata_space(struct btrfs_fs_info *fs_info) return btrfs_commit_transaction(trans); } + +int btrfs_reset_device_zone(struct btrfs_device *device, u64 physical, + u64 length, u64 *bytes) +{ + int ret; + + ret = blkdev_reset_zones(device->bdev, physical >> SECTOR_SHIFT, + length >> SECTOR_SHIFT, GFP_NOFS); + if (!ret) { + *bytes = length; + while (length) { + btrfs_dev_set_zone_empty(device, physical); + length -= device->zone_info->zone_size; + } + } + + return ret; +} diff --git a/fs/btrfs/hmzoned.h b/fs/btrfs/hmzoned.h index 8ac758074afd..e1fa6a2f2557 100644 --- a/fs/btrfs/hmzoned.h +++ b/fs/btrfs/hmzoned.h @@ -43,6 +43,8 @@ bool btrfs_check_allocatable_zones(struct btrfs_device *device, u64 pos, void btrfs_calc_zone_unusable(struct btrfs_block_group *cache); int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache); int btrfs_hmzoned_check_metadata_space(struct btrfs_fs_info *fs_info); +int btrfs_reset_device_zone(struct btrfs_device *device, u64 physical, + u64 length, u64 *bytes); #else /* CONFIG_BLK_DEV_ZONED */ static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos, struct blk_zone *zone) @@ -103,6 +105,11 @@ static inline int btrfs_hmzoned_check_metadata_space( { return 0; } +static inline int btrfs_reset_device_zone(struct btrfs_device *device, + u64 physical, u64 length, u64 *bytes) +{ + return 0; +} #endif static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos) @@ -189,4 +196,20 @@ static inline u64 btrfs_zone_align(struct btrfs_device *device, u64 pos) return ALIGN(pos, device->zone_info->zone_size); } +static inline bool btrfs_can_zone_reset(struct btrfs_device *device, + u64 physical, u64 length) +{ + u64 zone_size; + + if (!btrfs_dev_is_sequential(device, physical)) + return false; + + zone_size = device->zone_info->zone_size; + if (!IS_ALIGNED(physical, zone_size) || + !IS_ALIGNED(length, zone_size)) + return false; + + return true; +} + #endif From patchwork Fri Dec 13 04:09:01 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11289847 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id B62E01593 for ; Fri, 13 Dec 2019 04:11:11 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 934512464B for ; Fri, 13 Dec 2019 04:11:11 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="VzEb5ub9" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731966AbfLMELK (ORCPT ); Thu, 12 Dec 2019 23:11:10 -0500 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:11907 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731928AbfLMELJ (ORCPT ); Thu, 12 Dec 2019 23:11:09 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1576210269; x=1607746269; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=qHitdKXS7WToCLK0tYGztMsYaxh4wdvD80bm2tDzTq4=; b=VzEb5ub9lkIsLIrwxNBMWehY90VjpLuDJeWbIXiNkvtXS1+ZEJCf5hll Ge7eBew66+g5sWfq5IXbaAcRAIZi8X41eLPKRb3xVJnJxxKzBjXmMdY3Y i6EYZnrqLYWmgxqecBbLDF+Yj7t9xVahuE3Rxwa3UOz2OL1Uu2DAIN9hK 96mo+3A96goHRkVZLfK7OLahR9dCuapPBdBlBVDY/uYPBiJT3usDoWow/ u+GnDBMRAaImYMHoXtmqLjPEqyj/ijsnF4cH3Ww7ajdsXhn9XfwwbJEj+ zYnkVQ6KLTMlrgqWFiKk9PNKy/BUVhD2ZnfG057eIyv1kC8qi8d1SnTS9 A==; IronPort-SDR: Z1eHC5wcz8YbGQuhSxOyw25z7FfpDMtVMkhMC7rAc1+cfHoWn2OWYs7VR4C390GMDomInbQeCz +FVoOT8HMg5jP8du1Zd/JxisCKKwG7ZBS6GJvogVL3RvGv9ocC5/xxLI2tYBwukHXsl6GU/JJt 0soL8fQ5zsqXAyRI/9LIutelFq+U+ZOgmPI5+c/njsyEqX/Dza27Sz1w1T6fVnZ64FWrmrZhNW lhRgLy90JiRoGyTbRqYjTTuxgLOJREiKScnYXEiW6Wv6Phel8qhh/jD6plPgNgF0BrjCvOK1WR u44= X-IronPort-AV: E=Sophos;i="5.69,308,1571673600"; d="scan'208";a="126860135" Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com) ([199.255.45.14]) by ob1.hgst.iphmx.com with ESMTP; 13 Dec 2019 12:11:09 +0800 IronPort-SDR: tiDQo8QRntu4n2GTK74opsh2039MzTt5rXl0EqCTNCOlHmPZ0CRa4dqkrnGHjKgj6cTEU2+4Ah 2K/IlQclOY0qFJggDza56Mt1SmXuUCoKQoHLpVOYZzRxscGMH90fpEwchmvDLv1O/nAjcbEPOA sKSNtnTQR5/TPGbKW+xeB+QHtFHh+whE6aD72LauYAcNHwxqlVL+nK5eCVajVA/hSnTxTZsaNw CleaUJXOCWKL8axVfNU4/nW6eJmvVyroizCIYs1HrHlFo/UW+pqU8uoDuxGF7pxhPDT017ecUl 6DLLTe4oKkqg9jb6YSNDQ+zv Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Dec 2019 20:05:40 -0800 IronPort-SDR: dCs0ow6v3uyWQUzFjgL5nq9VdvlxzGEnKgdotxP1UhOFYBCeVmzoLWUyEqm4x9B3NlE+zrt9vE FH/UPv3JLlBK876vh6A0euhgEhJ8tufgzbtU8B+Fs9XrIwySawfZQkf4APHUCr/r55r4IzcKXM AI6Zv+YCjcj5mGrUgVaw5447w5cb0E5guGvT1BxrH4bUnkA5ipCbU/X6iEC6QuWAKaTM7YRpW5 aMkzPGepchh7R0JLmjtALlUl2kGxRtg7iNXSKUbqBUz3oyw5h20PUoNaJjPm+hwGNfW4bKO3dm 5dM= WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com (HELO naota.fujisawa.hgst.com) ([10.149.53.115]) by uls-op-cesaip02.wdc.com with ESMTP; 12 Dec 2019 20:11:07 -0800 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Nikolay Borisov , Damien Le Moal , Johannes Thumshirn , Hannes Reinecke , Anand Jain , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v6 14/28] btrfs: redirty released extent buffers in HMZONED mode Date: Fri, 13 Dec 2019 13:09:01 +0900 Message-Id: <20191213040915.3502922-15-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20191213040915.3502922-1-naohiro.aota@wdc.com> References: <20191213040915.3502922-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Tree manipulating operations like merging nodes often release once-allocated tree nodes. Btrfs cleans such nodes so that pages in the node are not uselessly written out. On HMZONED drives, however, such optimization blocks the following IOs as the cancellation of the write out of the freed blocks breaks the sequential write sequence expected by the device. This patch introduces a list of clean and unwritten extent buffers that have been released in a transaction. Btrfs redirty the buffer so that btree_write_cache_pages() can send proper bios to the devices. Besides it clears the entire content of the extent buffer not to confuse raw block scanners e.g. btrfsck. By clearing the content, csum_dirty_buffer() complains about bytenr mismatch, so avoid the checking and checksum using newly introduced buffer flag EXTENT_BUFFER_NO_CHECK. Signed-off-by: Naohiro Aota Reviewed-by: Josef Bacik --- fs/btrfs/disk-io.c | 8 ++++++++ fs/btrfs/extent-tree.c | 12 +++++++++++- fs/btrfs/extent_io.c | 3 +++ fs/btrfs/extent_io.h | 2 ++ fs/btrfs/hmzoned.c | 36 ++++++++++++++++++++++++++++++++++++ fs/btrfs/hmzoned.h | 6 ++++++ fs/btrfs/transaction.c | 10 ++++++++++ fs/btrfs/transaction.h | 3 +++ 8 files changed, 79 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 7f4c6a92079a..fbbc313f9f46 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -525,6 +525,12 @@ static int csum_dirty_buffer(struct btrfs_fs_info *fs_info, struct page *page) return 0; found_start = btrfs_header_bytenr(eb); + + if (test_bit(EXTENT_BUFFER_NO_CHECK, &eb->bflags)) { + WARN_ON(found_start != 0); + return 0; + } + /* * Please do not consolidate these warnings into a single if. * It is useful to know what went wrong. @@ -4521,6 +4527,8 @@ void btrfs_cleanup_one_transaction(struct btrfs_transaction *cur_trans, btrfs_destroy_pinned_extent(fs_info, fs_info->pinned_extents); + btrfs_free_redirty_list(cur_trans); + cur_trans->state =TRANS_STATE_COMPLETED; wake_up(&cur_trans->commit_wait); } diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index b41a45855bc4..e61f69eef4a8 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -3301,8 +3301,10 @@ void btrfs_free_tree_block(struct btrfs_trans_handle *trans, if (root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID) { ret = check_ref_cleanup(trans, buf->start); - if (!ret) + if (!ret) { + btrfs_redirty_list_add(trans->transaction, buf); goto out; + } } pin = 0; @@ -3314,6 +3316,13 @@ void btrfs_free_tree_block(struct btrfs_trans_handle *trans, goto out; } + if (btrfs_fs_incompat(fs_info, HMZONED)) { + btrfs_redirty_list_add(trans->transaction, buf); + pin_down_extent(cache, buf->start, buf->len, 1); + btrfs_put_block_group(cache); + goto out; + } + WARN_ON(test_bit(EXTENT_BUFFER_DIRTY, &buf->bflags)); btrfs_add_free_space(cache, buf->start, buf->len); @@ -4524,6 +4533,7 @@ btrfs_init_new_buffer(struct btrfs_trans_handle *trans, struct btrfs_root *root, btrfs_tree_lock(buf); btrfs_clean_tree_block(buf); clear_bit(EXTENT_BUFFER_STALE, &buf->bflags); + clear_bit(EXTENT_BUFFER_NO_CHECK, &buf->bflags); btrfs_set_lock_blocking_write(buf); set_extent_buffer_uptodate(buf); diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index eb8bd0258360..6e25c8790ef4 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -24,6 +24,7 @@ #include "rcu-string.h" #include "backref.h" #include "disk-io.h" +#include "hmzoned.h" static struct kmem_cache *extent_state_cache; static struct kmem_cache *extent_buffer_cache; @@ -4889,6 +4890,7 @@ __alloc_extent_buffer(struct btrfs_fs_info *fs_info, u64 start, init_waitqueue_head(&eb->read_lock_wq); btrfs_leak_debug_add(&eb->leak_list, &buffers); + INIT_LIST_HEAD(&eb->release_list); spin_lock_init(&eb->refs_lock); atomic_set(&eb->refs, 1); @@ -5686,6 +5688,7 @@ void write_extent_buffer(struct extent_buffer *eb, const void *srcv, WARN_ON(start > eb->len); WARN_ON(start + len > eb->start + eb->len); + WARN_ON(test_bit(EXTENT_BUFFER_NO_CHECK, &eb->bflags)); offset = offset_in_page(start_offset + start); diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h index a8551a1f56e2..51a15e93a5cd 100644 --- a/fs/btrfs/extent_io.h +++ b/fs/btrfs/extent_io.h @@ -29,6 +29,7 @@ enum { EXTENT_BUFFER_IN_TREE, /* write IO error */ EXTENT_BUFFER_WRITE_ERR, + EXTENT_BUFFER_NO_CHECK, }; /* these are flags for __process_pages_contig */ @@ -115,6 +116,7 @@ struct extent_buffer { */ wait_queue_head_t read_lock_wq; struct page *pages[INLINE_EXTENT_BUFFER_PAGES]; + struct list_head release_list; #ifdef CONFIG_BTRFS_DEBUG int spinning_writers; atomic_t spinning_readers; diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c index 0ca84d888e53..0c0ee9a46009 100644 --- a/fs/btrfs/hmzoned.c +++ b/fs/btrfs/hmzoned.c @@ -1135,3 +1135,39 @@ int btrfs_reset_device_zone(struct btrfs_device *device, u64 physical, return ret; } + +void btrfs_redirty_list_add(struct btrfs_transaction *trans, + struct extent_buffer *eb) +{ + struct btrfs_fs_info *fs_info = eb->fs_info; + + if (!btrfs_fs_incompat(fs_info, HMZONED) || + btrfs_header_flag(eb, BTRFS_HEADER_FLAG_WRITTEN) || + !list_empty(&eb->release_list)) + return; + + set_extent_buffer_dirty(eb); + set_extent_bits_nowait(&trans->dirty_pages, eb->start, + eb->start + eb->len - 1, EXTENT_DIRTY); + memzero_extent_buffer(eb, 0, eb->len); + set_bit(EXTENT_BUFFER_NO_CHECK, &eb->bflags); + + spin_lock(&trans->releasing_ebs_lock); + list_add_tail(&eb->release_list, &trans->releasing_ebs); + spin_unlock(&trans->releasing_ebs_lock); + atomic_inc(&eb->refs); +} + +void btrfs_free_redirty_list(struct btrfs_transaction *trans) +{ + spin_lock(&trans->releasing_ebs_lock); + while (!list_empty(&trans->releasing_ebs)) { + struct extent_buffer *eb; + + eb = list_first_entry(&trans->releasing_ebs, + struct extent_buffer, release_list); + list_del_init(&eb->release_list); + free_extent_buffer(eb); + } + spin_unlock(&trans->releasing_ebs_lock); +} diff --git a/fs/btrfs/hmzoned.h b/fs/btrfs/hmzoned.h index e1fa6a2f2557..ddec6aed7283 100644 --- a/fs/btrfs/hmzoned.h +++ b/fs/btrfs/hmzoned.h @@ -45,6 +45,9 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache); int btrfs_hmzoned_check_metadata_space(struct btrfs_fs_info *fs_info); int btrfs_reset_device_zone(struct btrfs_device *device, u64 physical, u64 length, u64 *bytes); +void btrfs_redirty_list_add(struct btrfs_transaction *trans, + struct extent_buffer *eb); +void btrfs_free_redirty_list(struct btrfs_transaction *trans); #else /* CONFIG_BLK_DEV_ZONED */ static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos, struct blk_zone *zone) @@ -110,6 +113,9 @@ static inline int btrfs_reset_device_zone(struct btrfs_device *device, { return 0; } +static inline void btrfs_redirty_list_add(struct btrfs_transaction *trans, + struct extent_buffer *eb) { } +static inline void btrfs_free_redirty_list(struct btrfs_transaction *trans) { } #endif static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos) diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 19de6e2041dc..39628c370bdb 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -21,6 +21,7 @@ #include "dev-replace.h" #include "qgroup.h" #include "block-group.h" +#include "hmzoned.h" #define BTRFS_ROOT_TRANS_TAG 0 @@ -329,6 +330,8 @@ static noinline int join_transaction(struct btrfs_fs_info *fs_info, spin_lock_init(&cur_trans->dirty_bgs_lock); INIT_LIST_HEAD(&cur_trans->deleted_bgs); spin_lock_init(&cur_trans->dropped_roots_lock); + INIT_LIST_HEAD(&cur_trans->releasing_ebs); + spin_lock_init(&cur_trans->releasing_ebs_lock); list_add_tail(&cur_trans->list, &fs_info->trans_list); extent_io_tree_init(fs_info, &cur_trans->dirty_pages, IO_TREE_TRANS_DIRTY_PAGES, fs_info->btree_inode); @@ -2336,6 +2339,13 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) goto scrub_continue; } + /* + * At this point, we should have written the all tree blocks + * allocated in this transaction. So it's now safe to free the + * redirtyied extent buffers. + */ + btrfs_free_redirty_list(cur_trans); + ret = write_all_supers(fs_info, 0); /* * the super is written, we can safely allow the tree-loggers diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h index 49f7196368f5..3d60d2213c70 100644 --- a/fs/btrfs/transaction.h +++ b/fs/btrfs/transaction.h @@ -84,6 +84,9 @@ struct btrfs_transaction { spinlock_t dropped_roots_lock; struct btrfs_delayed_ref_root delayed_refs; struct btrfs_fs_info *fs_info; + + spinlock_t releasing_ebs_lock; + struct list_head releasing_ebs; }; #define __TRANS_FREEZABLE (1U << 0) From patchwork Fri Dec 13 04:09:02 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11289855 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 25FD11593 for ; Fri, 13 Dec 2019 04:11:18 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 044D92253D for ; Fri, 13 Dec 2019 04:11:18 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="K47jwJUO" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731974AbfLMELP (ORCPT ); Thu, 12 Dec 2019 23:11:15 -0500 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:11907 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731928AbfLMELL (ORCPT ); Thu, 12 Dec 2019 23:11:11 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1576210272; x=1607746272; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=y8QXk6brR3tcRJ1SASk/5og+UoqtTNWx3qVdS+e9vws=; b=K47jwJUOOu/CDHswN6394mv3kFqZxR/lRrW3oUkLV3TNSn3enb3NtveG Cl+Mpe7dWMTo9Kd/YUEOxakwG0CQ2Bw7EdY3IxHQqgqr7dYOrrVMPKUqM rWn+ltJvYxLzJsd7jjut18NMAVH21vdxuEfkqDIGaSLrcnY+axbIBbkZN fHzOGEzjnHWKcP9e7loe18HZf5qiyugbTywx7NUaoQPFUqVBJf0OqvVVw 7CoeUTbojbUvsS2Jc5M0yNrKQ7hDNwL+ZFB4TWtDgykvwgaMdG1Jy3oTl fIVL61WIjqv43hoVAmjfzLKQxPvsPGTPp+iyKNY22wxdwSnLwkLH3tTI+ Q==; IronPort-SDR: IOEZ2P9+Lb79GaozaH1fP8VrS8bHFZo+akwGnlPMZeLtHJu7y1v6dew1eFkpwkd+371jLqaPVX yQ3FBpM8+U4pr61qzc4gWzw4rF21GsD9aQvzth+zp0B7kx6iYN7A9bZMfVQw7Zl48Hq+nFIPIa NtPTGX7Vj9m/f2hTQ0Wi+FWgraP8D89Zq1gomY/TrFU55yVtBP5ci22Befv/1Sep4thOTMMuOX S2azHgbvfr1xEqUk8/uNb6JcWqcAeg3WQyt+e4o+Nz/johmEb/vpTHJUH8r4Dyd9P/PnVk3ef5 gz8= X-IronPort-AV: E=Sophos;i="5.69,308,1571673600"; d="scan'208";a="126860137" Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com) ([199.255.45.14]) by ob1.hgst.iphmx.com with ESMTP; 13 Dec 2019 12:11:11 +0800 IronPort-SDR: vF+AhnCBORDj873Y4UprxlxrdUBnu0y84xF/QVuDzDR3Wsf/LTU0Spbfv++wyYwFJJWcmhl6ML ZLyZ/rbcnswyzbbaDipAUoPPx3X9yRtqZSXQ3WWGz25G5KjZYYI1JbSQFRBUFCZc9btp5Eh5xV 4XQMESyEWQLKh+3lw/oes0bi9xqiVMdFgDJe+1ZINu3Hx7/eUosMq98pfLl8qBzYB9crJrCY40 PMQ4bDMCbwTSg4AUDyeLRnDYJrK/zHW63YAQtcrBPJ0ws7ugp+5EfKUGijmQ/BQHCcAbBFG9a9 /HP/4TRppl8rGJ0t/rD5g1ri Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Dec 2019 20:05:43 -0800 IronPort-SDR: ZTzz9VCglhU/Ls2DR6EHRwkmxXAZOv+yRIMDMk5QEH0NaSaUkV7v4bqrYeXA2Ya0/P9byxzvHE 2aESUAmt9eaiO5e0wZD6tS5g0goyrwwvJUMcfm8OEigOMlwb0B8292+657qGB1rvB2nLBjmVQW WpSWHrNeI8N4j8sfVgdl/CYP6r7K+Vh3CHENjhLt3TWdpOmOD4RUvMDWjgzT0ckFkiFrJJcvYU AGwavSmMs078GbflzPlxhU/6eGTxGDJ1UINQg1UxOK4ytwwgleay/1wBGnTx1zl2eXxX5HztLj ef0= WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com (HELO naota.fujisawa.hgst.com) ([10.149.53.115]) by uls-op-cesaip02.wdc.com with ESMTP; 12 Dec 2019 20:11:09 -0800 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Nikolay Borisov , Damien Le Moal , Johannes Thumshirn , Hannes Reinecke , Anand Jain , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v6 15/28] btrfs: serialize data allocation and submit IOs Date: Fri, 13 Dec 2019 13:09:02 +0900 Message-Id: <20191213040915.3502922-16-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20191213040915.3502922-1-naohiro.aota@wdc.com> References: <20191213040915.3502922-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org To preserve sequential write pattern on the drives, we must serialize allocation and submit_bio. This commit add per-block group mutex "zone_io_lock" and find_free_extent_zoned() hold the lock. The lock is kept even after returning from find_free_extent(). It is released when submiting IOs corresponding to the allocation is completed. Implementing such behavior under __extent_writepage_io() is almost impossible because once pages are unlocked we are not sure when submiting IOs for an allocated region is finished or not. Instead, this commit add run_delalloc_hmzoned() to write out non-compressed data IOs at once using extent_write_locked_rage(). After the write, we can call btrfs_hmzoned_data_io_unlock() to unlock the block group for new allocation. Signed-off-by: Naohiro Aota --- fs/btrfs/block-group.c | 1 + fs/btrfs/block-group.h | 1 + fs/btrfs/extent-tree.c | 4 ++++ fs/btrfs/hmzoned.h | 36 +++++++++++++++++++++++++++++++++ fs/btrfs/inode.c | 45 ++++++++++++++++++++++++++++++++++++++++-- 5 files changed, 85 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c index e78d34a4fb56..6f7d29171adf 100644 --- a/fs/btrfs/block-group.c +++ b/fs/btrfs/block-group.c @@ -1642,6 +1642,7 @@ static struct btrfs_block_group *btrfs_create_block_group_cache( btrfs_init_free_space_ctl(cache); atomic_set(&cache->trimming, 0); mutex_init(&cache->free_space_lock); + mutex_init(&cache->zone_io_lock); btrfs_init_full_stripe_locks_tree(&cache->full_stripe_locks_root); return cache; diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h index 347605654021..57c8d6f4b3d1 100644 --- a/fs/btrfs/block-group.h +++ b/fs/btrfs/block-group.h @@ -165,6 +165,7 @@ struct btrfs_block_group { * enabled. */ u64 alloc_offset; + struct mutex zone_io_lock; }; #ifdef CONFIG_BTRFS_DEBUG diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index e61f69eef4a8..d1f326b6c4d4 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -3699,6 +3699,7 @@ static int find_free_extent_zoned(struct btrfs_block_group *cache, ASSERT(btrfs_fs_incompat(cache->fs_info, HMZONED)); + btrfs_hmzoned_data_io_lock(cache); spin_lock(&space_info->lock); spin_lock(&cache->lock); @@ -3729,6 +3730,9 @@ static int find_free_extent_zoned(struct btrfs_block_group *cache, out: spin_unlock(&cache->lock); spin_unlock(&space_info->lock); + /* if succeeds, unlock after submit_bio */ + if (ret) + btrfs_hmzoned_data_io_unlock(cache); return ret; } diff --git a/fs/btrfs/hmzoned.h b/fs/btrfs/hmzoned.h index ddec6aed7283..f6682ead575b 100644 --- a/fs/btrfs/hmzoned.h +++ b/fs/btrfs/hmzoned.h @@ -12,6 +12,7 @@ #include #include "volumes.h" #include "disk-io.h" +#include "block-group.h" struct btrfs_zoned_device_info { /* @@ -48,6 +49,7 @@ int btrfs_reset_device_zone(struct btrfs_device *device, u64 physical, void btrfs_redirty_list_add(struct btrfs_transaction *trans, struct extent_buffer *eb); void btrfs_free_redirty_list(struct btrfs_transaction *trans); +void btrfs_hmzoned_data_io_unlock_at(struct inode *inode, u64 start, u64 len); #else /* CONFIG_BLK_DEV_ZONED */ static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos, struct blk_zone *zone) @@ -116,6 +118,8 @@ static inline int btrfs_reset_device_zone(struct btrfs_device *device, static inline void btrfs_redirty_list_add(struct btrfs_transaction *trans, struct extent_buffer *eb) { } static inline void btrfs_free_redirty_list(struct btrfs_transaction *trans) { } +static inline void btrfs_hmzoned_data_io_unlock_at(struct inode *inode, + u64 start, u64 len) { } #endif static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos) @@ -218,4 +222,36 @@ static inline bool btrfs_can_zone_reset(struct btrfs_device *device, return true; } +static inline void btrfs_hmzoned_data_io_lock( + struct btrfs_block_group *cache) +{ + /* No need to lock metadata BGs or non-sequential BGs */ + if (!btrfs_fs_incompat(cache->fs_info, HMZONED) || + !(cache->flags & BTRFS_BLOCK_GROUP_DATA)) + return; + mutex_lock(&cache->zone_io_lock); +} + +static inline void btrfs_hmzoned_data_io_unlock( + struct btrfs_block_group *cache) +{ + if (!btrfs_fs_incompat(cache->fs_info, HMZONED) || + !(cache->flags & BTRFS_BLOCK_GROUP_DATA)) + return; + mutex_unlock(&cache->zone_io_lock); +} + +static inline void btrfs_hmzoned_data_io_unlock_logical( + struct btrfs_fs_info *fs_info, u64 logical) +{ + struct btrfs_block_group *cache; + + if (!btrfs_fs_incompat(fs_info, HMZONED)) + return; + + cache = btrfs_lookup_block_group(fs_info, logical); + btrfs_hmzoned_data_io_unlock(cache); + btrfs_put_block_group(cache); +} + #endif diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 56032c518b26..3677c36999d8 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -49,6 +49,7 @@ #include "qgroup.h" #include "delalloc-space.h" #include "block-group.h" +#include "hmzoned.h" struct btrfs_iget_args { struct btrfs_key *location; @@ -1325,6 +1326,39 @@ static int cow_file_range_async(struct inode *inode, return 0; } +static noinline int run_delalloc_hmzoned(struct inode *inode, + struct page *locked_page, u64 start, + u64 end, int *page_started, + unsigned long *nr_written) +{ + struct extent_map *em; + u64 logical; + int ret; + + ret = cow_file_range(inode, locked_page, start, end, + page_started, nr_written, 0); + if (ret) + return ret; + + if (*page_started) + return 0; + + em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, start, end - start + 1, + 0); + ASSERT(em != NULL && em->block_start < EXTENT_MAP_LAST_BYTE); + logical = em->block_start; + free_extent_map(em); + + __set_page_dirty_nobuffers(locked_page); + account_page_redirty(locked_page); + extent_write_locked_range(inode, start, end, WB_SYNC_ALL); + *page_started = 1; + + btrfs_hmzoned_data_io_unlock_logical(btrfs_sb(inode->i_sb), logical); + + return 0; +} + static noinline int csum_exist_in_range(struct btrfs_fs_info *fs_info, u64 bytenr, u64 num_bytes) { @@ -1737,17 +1771,24 @@ int btrfs_run_delalloc_range(struct inode *inode, struct page *locked_page, { int ret; int force_cow = need_force_cow(inode, start, end); + int do_compress = inode_can_compress(inode) && + inode_need_compress(inode, start, end); + int hmzoned = btrfs_fs_incompat(btrfs_sb(inode->i_sb), HMZONED); if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATACOW && !force_cow) { + ASSERT(!hmzoned); ret = run_delalloc_nocow(inode, locked_page, start, end, page_started, 1, nr_written); } else if (BTRFS_I(inode)->flags & BTRFS_INODE_PREALLOC && !force_cow) { + ASSERT(!hmzoned); ret = run_delalloc_nocow(inode, locked_page, start, end, page_started, 0, nr_written); - } else if (!inode_can_compress(inode) || - !inode_need_compress(inode, start, end)) { + } else if (!do_compress && !hmzoned) { ret = cow_file_range(inode, locked_page, start, end, page_started, nr_written, 1); + } else if (!do_compress && hmzoned) { + ret = run_delalloc_hmzoned(inode, locked_page, start, end, + page_started, nr_written); } else { set_bit(BTRFS_INODE_HAS_ASYNC_EXTENT, &BTRFS_I(inode)->runtime_flags); From patchwork Fri Dec 13 04:09:03 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11289849 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id ACCE814DB for ; Fri, 13 Dec 2019 04:11:16 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 8A806227BF for ; Fri, 13 Dec 2019 04:11:16 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="oo3qwKTK" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731979AbfLMELP (ORCPT ); Thu, 12 Dec 2019 23:11:15 -0500 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:11918 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731971AbfLMELN (ORCPT ); Thu, 12 Dec 2019 23:11:13 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1576210274; x=1607746274; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=48fi9iWuTJ5hUDEI3q4s9C6bEGLUNolZCKfPx156Xxk=; b=oo3qwKTKsxRO56wCBQncvBIqjla9jBXXwKAwSvAyq1yuUSuktBcQm6KG cUa0kLify7YlqBkgymSbNsf5Wh8ZT5vXA5Ay1rP4oDZ2evKnB4QCLA8Qc VWLQZ7LiNr2LsRrQhjUc194uOmm52A9rG6pwZjTYGthxHVLXbl6vLu8fX 9SsmXBjLRCqWpn54yaWbL9ce18C/DtjOZ6vZ7IDOLyAOC9gklyoAECDay 4z5ZUCaZ1U4D82bPw0WMoA327AQyK4xLCiY7yTBfUX7vI2qCsvsKph6jP qSmJr48eb4ApWhMq0k1SunLsNaS7P023hss8ueU/OXAUGYVHzjU/y5FH5 g==; IronPort-SDR: J4tKul2ZXnGsmHzlOPwAMKL2mTQIN/GtrTfEq23R5akeo9+95K3BEbsdZWtCOmIZo9w5QH8rZz 25c9tb01QFaXFFyFIRIAWUM4CpRKpRiYLy6m5OMo/eEycncjNZtvbuPx7Uowp+O5vtYQHZrhkb tjohd9PkrNxd6i4KTwhiU7JUGTS+/tvLvGhL+1Aoo+2dHhIFyqvnuxXizxdk+lBUozVJ5mH7+P 8mwusG0QauIQBetKswa/OgR5P9etZ4AgpDCeKFUVMHuS+0FlKfUbjOv5XTcdcMRxVWtqqFeAkB j4w= X-IronPort-AV: E=Sophos;i="5.69,308,1571673600"; d="scan'208";a="126860144" Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com) ([199.255.45.14]) by ob1.hgst.iphmx.com with ESMTP; 13 Dec 2019 12:11:14 +0800 IronPort-SDR: fvf3cQgb5GxZfQpNx59pYz0bM7PqDZFhuKiXzxo8fdfCIehy33ykL56Fhz/k2YCYs8mzAjDdjV BPfvGmFxKaiOcRH5Jxu5/sySOuvraaCVVDtsEdRCkMphgR9XELo+KTwF6hqZzSdPNWf+HOyOSJ Jr5DMEPLFX+uxRgC//u/09s/Pj7ZAIrCGmcXX+Sb8kjRJT6ZgjYT2YDhFuT6QDG5ngqkZMzA0V M7hNsZXTQ573TwnFwW7cBIvQk7LAGzEgJN7BmtY7pR/3LYZ9l3hCxd9Ug4NrAaVniY0OhdUu+B FD4KtYUr/paIiswP4tjhecA2 Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Dec 2019 20:05:45 -0800 IronPort-SDR: 5LnL+4hVnOklwCOqCGNFa4OmPj9F26Xg2LYSNDBLR+CLgugDiYzpy1PJfvaQCMHYhooiY78QxN fJrQvESLNFcHJaIVu9FFNZuK2Q/5eckzCApn4YO6yLo9WRLwbyp98rxAFMOGMv8w7KEMn/uw9p dOpe3QoR70u6VPugaZ7CkRTxNKO5aUVxS8FFEKHBlWMIaE+MP8PElqmpeqTDPL5stvPnOQnIrN nJt8EhY/FYSbzTYzxOuePMZWorwCtBzGYdsKrYF7RQrEcOMKa25bJC6YHqEPgbBQPrWLN1GFRb 7/4= WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com (HELO naota.fujisawa.hgst.com) ([10.149.53.115]) by uls-op-cesaip02.wdc.com with ESMTP; 12 Dec 2019 20:11:11 -0800 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Nikolay Borisov , Damien Le Moal , Johannes Thumshirn , Hannes Reinecke , Anand Jain , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v6 16/28] btrfs: implement atomic compressed IO submission Date: Fri, 13 Dec 2019 13:09:03 +0900 Message-Id: <20191213040915.3502922-17-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20191213040915.3502922-1-naohiro.aota@wdc.com> References: <20191213040915.3502922-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org As same as with non-compressed IO submission, we must unlock a block group for the next allocation. Signed-off-by: Naohiro Aota --- fs/btrfs/inode.c | 18 ++++++++++++++++-- 1 file changed, 16 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 3677c36999d8..e09089e24a8f 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -793,13 +793,25 @@ static noinline void submit_compressed_extents(struct async_chunk *async_chunk) * and IO for us. Otherwise, we need to submit * all those pages down to the drive. */ - if (!page_started && !ret) + if (!page_started && !ret) { + struct extent_map *em; + u64 logical; + + em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, + async_extent->start, + async_extent->ram_size, + 0); + logical = em->block_start; + free_extent_map(em); + extent_write_locked_range(inode, async_extent->start, async_extent->start + async_extent->ram_size - 1, WB_SYNC_ALL); - else if (ret && async_chunk->locked_page) + btrfs_hmzoned_data_io_unlock_logical(fs_info, + logical); + } else if (ret && async_chunk->locked_page) unlock_page(async_chunk->locked_page); kfree(async_extent); cond_resched(); @@ -899,6 +911,7 @@ static noinline void submit_compressed_extents(struct async_chunk *async_chunk) free_async_extent_pages(async_extent); } alloc_hint = ins.objectid + ins.offset; + btrfs_hmzoned_data_io_unlock_logical(fs_info, ins.objectid); kfree(async_extent); cond_resched(); } @@ -906,6 +919,7 @@ static noinline void submit_compressed_extents(struct async_chunk *async_chunk) out_free_reserve: btrfs_dec_block_group_reservations(fs_info, ins.objectid); btrfs_free_reserved_extent(fs_info, ins.objectid, ins.offset, 1); + btrfs_hmzoned_data_io_unlock_logical(fs_info, ins.objectid); out_free: extent_clear_unlock_delalloc(inode, async_extent->start, async_extent->start + From patchwork Fri Dec 13 04:09:04 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11289859 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 13C5314E3 for ; Fri, 13 Dec 2019 04:11:19 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id E5BEA2253D for ; Fri, 13 Dec 2019 04:11:18 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="qh0jhjKH" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731993AbfLMELR (ORCPT ); Thu, 12 Dec 2019 23:11:17 -0500 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:11924 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731971AbfLMELR (ORCPT ); Thu, 12 Dec 2019 23:11:17 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1576210277; x=1607746277; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=fQova5fCa+QqHXMFjuyqIKWvJWmoO6qkos/hzGf3o5o=; b=qh0jhjKHeb8RYe8SmIhACutAxSGXV0ZiTw9ZO9ECho5Z5+4MIyFKa85I YRQrrQZsnkcuN0v5zqTVqtAZc0bmMVKm+ZNvwFurR0HGoe8K6O3IL59lQ ipG+z8ZF8H9+WsMjIeduN5PCyrBPkPrKzm8dAlutTPyp2r6uPHY6AT8JO 3erD2KkamMNNDF7SUJKr+3R4p90CR/euH5/NnAM/3p2fCE5ZfdHcSNY4p VZVM8ckURBDkrasDmqfOua3nLkX7rSNoqqnV2MaJ9bQJ815iK5EupfWH9 f4JFf1+3KUvukRULa9rjsoeS+fbnd3GLLauMHJEdQpUp1mNj2/Pc+3DB+ w==; IronPort-SDR: ThUhEaG50kX/ds2LUXzpvL4DRimjUhgca7el/DFogAv0mLaruBKn+1/2yfAPc2zi5ux7En25+w 6wVKQMzqF2SCUNBoV/XFj81hoWl2umZDR6UVPF0AzZmjD1vDYYerycUvxnipmKUcUyQnL7dSK8 zQ74j5IW0Yu+3M8P3zRsgLweujqIVTWIEH8O5q6rwqdCgzSnmYuBDOumawMssKJMDd93xLN5yZ zAZ8eguWJme/ke4aSfbAVomziNldQqxBZS6sEZuaVwMaBR2J6Yhgi83z3WXJv5NsWBvr7IqvS2 D60= X-IronPort-AV: E=Sophos;i="5.69,308,1571673600"; d="scan'208";a="126860146" Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com) ([199.255.45.14]) by ob1.hgst.iphmx.com with ESMTP; 13 Dec 2019 12:11:17 +0800 IronPort-SDR: xCP7X9kFCtLqOsBIwe85z8OhsKmo/pqJJ6URQtJqHVgfglyGv/2A/EAZhnQ0kfJ/vkSGPGaHyp gS3Xc4sv0daqti9hO3qAzcZ1RgaDWfzAsn/sDoXbhBj5mVOIVAQGW+giNE0tpP9ucTJiL4ITNO ZUrC51R577QBfQ070PhoCLpmHhaf6Zrcwh+Gd+YIM0KfFQJ3T+PyAU3Qcrket/A+ekvZ1uNMss njj0L0SRlDXx8/emjvaYJa+kwRC0xiG6FLMUVI5Ur3cklyOIIEFogB1i6cFriVbsQQ19fIZBYt wACApE8KJ6o0hNsfI/pLtNJH Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Dec 2019 20:05:48 -0800 IronPort-SDR: rnRhJ4NoLFXX3L7L5JERYqxmfnaMjGqDTM+1XzJ+7cT8hyxMqLcFqmphHcHul1XmMyHE9thNF7 iPCd+O/ikwQrn7gN2ZNaQEIXBgrGtmymUghfHceOO29Z+qqCjQMbqTLzyxbXHNc53t5bCVJGKH uRWAdR22sbFgO+uNRNi/Qifs7NmyDLGddLZ09KY19JjP4spk72svQ97J1SzJThiztZCGuIxGWG jpYO7f9Zr4Ofm+q80o3aBwuzZdCkDlYWn2yIpZ8kwNlp0u9C5nOghNe9Qs9SPO6mtWL3N1CydL Zs4= WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com (HELO naota.fujisawa.hgst.com) ([10.149.53.115]) by uls-op-cesaip02.wdc.com with ESMTP; 12 Dec 2019 20:11:14 -0800 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Nikolay Borisov , Damien Le Moal , Johannes Thumshirn , Hannes Reinecke , Anand Jain , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v6 17/28] btrfs: support direct write IO in HMZONED Date: Fri, 13 Dec 2019 13:09:04 +0900 Message-Id: <20191213040915.3502922-18-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20191213040915.3502922-1-naohiro.aota@wdc.com> References: <20191213040915.3502922-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org As same as with other IO submission, we must unlock a block group for the next allocation. Signed-off-by: Naohiro Aota --- fs/btrfs/inode.c | 25 +++++++++++++++++++++++++ 1 file changed, 25 insertions(+) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index e09089e24a8f..44658590c6e8 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -60,6 +60,7 @@ struct btrfs_dio_data { u64 reserve; u64 unsubmitted_oe_range_start; u64 unsubmitted_oe_range_end; + u64 alloc_end; int overwrite; }; @@ -7787,6 +7788,12 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map, } } + if (dio_data->alloc_end) { + btrfs_hmzoned_data_io_unlock_logical(fs_info, + dio_data->alloc_end - 1); + dio_data->alloc_end = 0; + } + /* this will cow the extent */ len = bh_result->b_size; free_extent_map(em); @@ -7818,6 +7825,7 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map, WARN_ON(dio_data->reserve < len); dio_data->reserve -= len; dio_data->unsubmitted_oe_range_end = start + len; + dio_data->alloc_end = em->block_start + (start - em->start) + len; current->journal_info = dio_data; out: return ret; @@ -8585,6 +8593,7 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode, struct btrfs_io_bio *io_bio; bool write = (bio_op(dio_bio) == REQ_OP_WRITE); int ret = 0; + u64 disk_bytenr, len; bio = btrfs_bio_clone(dio_bio); @@ -8628,7 +8637,18 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode, dio_data->unsubmitted_oe_range_end; } + disk_bytenr = dip->disk_bytenr; + len = dip->bytes; ret = btrfs_submit_direct_hook(dip); + if (write) { + struct btrfs_dio_data *dio_data = current->journal_info; + + if (disk_bytenr + len == dio_data->alloc_end) { + btrfs_hmzoned_data_io_unlock_logical( + btrfs_sb(inode->i_sb), disk_bytenr); + dio_data->alloc_end = 0; + } + } if (!ret) return; @@ -8804,6 +8824,11 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter) btrfs_delalloc_release_space(inode, data_reserved, offset, count - (size_t)ret, true); btrfs_delalloc_release_extents(BTRFS_I(inode), count); + if (dio_data.alloc_end) { + pr_info("unlock final direct %llu", dio_data.alloc_end); + btrfs_hmzoned_data_io_unlock_logical( + fs_info, dio_data.alloc_end - 1); + } } out: if (wakeup) From patchwork Fri Dec 13 04:09:05 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11289863 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id E20A91593 for ; Fri, 13 Dec 2019 04:11:20 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id B7E872253D for ; Fri, 13 Dec 2019 04:11:20 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="HpZc94di" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732000AbfLMELT (ORCPT ); Thu, 12 Dec 2019 23:11:19 -0500 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:11924 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731971AbfLMELS (ORCPT ); Thu, 12 Dec 2019 23:11:18 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1576210279; x=1607746279; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=2f1KnAdryzEpvsmmQ2tYt34P/Jq9D6Z656c9K+/XHbQ=; b=HpZc94diZtF5CZYUzwFnKs71LPDoWskolvjKTP6Miy5irMFmpOsZEdu4 z0WYRBu/2yy0LiHHqx1UIE7i49U7dVrEAmppneZkDUyjIVqsTvpG0NSeE e48dsE/YK2QPbYpqmpCM3nlaSsOPeMsdEBJwW+owYIg3+iVgTElybiSIb tHiLnq6XLnj/eszqqg89pPfr/tOMpYJm1JO1n0ACXt6vab1Xza0KpiHSI 4OYAF/owMCrZyn37uNw6QBdg8WfBImKQmM4RYjv2a+22nW+06hcauvCp8 udiWf5NKoYdPK6h/LHR9Cj97C45ZGXmCqzednR9HVuBFcuVYr5mSFbXzX A==; IronPort-SDR: IswPuFSi9bEJSdbj1rig2AYM8jrU25ZmwVTmcWrkpn14HiTEhnoxnkE2Gmz2lsImYzBhC66Re5 2Mj0jdqE54liUsQmV26nZAMI5Qy14Oe6lAW2cenG8ZNNHg2ye4ocsQQgGiJS3fJmytEBN5YxUf Os5mnrRSIXmAqJ/Up74NNoB3W3/byMXGtzVgSUqKn3mHkTwfkkJx6NWdkx4phMhuL1GkIieS6Y MKkeCW5szfxfDMZRA9Tda+lEjpcn8e0Ow2sanT8TDBsAQRvYQk2t0SwMRQAya5aAdsBgNnIuKq CIA= X-IronPort-AV: E=Sophos;i="5.69,308,1571673600"; d="scan'208";a="126860148" Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com) ([199.255.45.14]) by ob1.hgst.iphmx.com with ESMTP; 13 Dec 2019 12:11:19 +0800 IronPort-SDR: Tuum0WBXjtbNia1kfvQyzbMvqv8yIYoKZZ4JZ2GDU8U2prXvpXrLgUjTMRA1O2LFDfw/vJ4sy8 5cNtfZksLQFT/EGJSHABIIBN8HnhPrBmZ4lF/AhAyOOy1I0sQhtL8hNDwO2Euyisoiuf21YNI3 hULpcrGB8+jNI+T2t4bK768CA/pO/QY3gGzVcwJNqEch7Cz4KI9VpliOpGpLEdVViqzCgVqIhb TDnPOqzUkT3YQojyvO45NghuwPik6Sd/3zkKp/BK9WGawlBJQO34CdWdAitssxWdLphTEvy3WV urK/buS9Sgri7G9pe3jkGVly Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Dec 2019 20:05:50 -0800 IronPort-SDR: 5rkzRzy73BH6awY1s1P9gsisVEN87XoV9PaZCnoDRu1iJCm6ln9UA8Kztcs0ea7GiJ09WR3GsH cLsEkjLHU7qEjOhdX34Y5ZjSNZrlp6ZjtS5L2v6Fh+DzNbPOM7q1c/+TU9AD6y/FMLpZP2oYpx ahQMCA4jtOxuev8qp2UsqNw+5DmzPl7UTpsNTsDdA7aGfQMKBK3MV2sm5/3eiSJpo/Gw7epK13 yaG5YpzVMHei/IWcVucVTWy3Nhv5u5iakz2Ev0ru3+MRrwp1sc8/d+ssbGlJCYW8aa0nKzwFKL g+w= WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com (HELO naota.fujisawa.hgst.com) ([10.149.53.115]) by uls-op-cesaip02.wdc.com with ESMTP; 12 Dec 2019 20:11:16 -0800 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Nikolay Borisov , Damien Le Moal , Johannes Thumshirn , Hannes Reinecke , Anand Jain , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v6 18/28] btrfs: serialize meta IOs on HMZONED mode Date: Fri, 13 Dec 2019 13:09:05 +0900 Message-Id: <20191213040915.3502922-19-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20191213040915.3502922-1-naohiro.aota@wdc.com> References: <20191213040915.3502922-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org As same as in data IO path, we must serialize write IOs for metadata. We cannot add mutex around allocation and submit because metadata blocks are allocated in an earlier stage to build up B-trees. Thus, this commit add hmzoned_meta_io_lock and hold it during metadata IO submission in btree_write_cache_pages() to serialize IOs. Furthermore, this commit add per-block group metadata IO submission pointer "meta_write_pointer" to ensure sequential writing, which can be caused when writing back blocks in a not finished transaction. Signed-off-by: Naohiro Aota --- fs/btrfs/block-group.h | 1 + fs/btrfs/ctree.h | 2 ++ fs/btrfs/disk-io.c | 1 + fs/btrfs/extent_io.c | 27 +++++++++++++++++++++- fs/btrfs/hmzoned.c | 52 ++++++++++++++++++++++++++++++++++++++++++ fs/btrfs/hmzoned.h | 27 ++++++++++++++++++++++ 6 files changed, 109 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h index 57c8d6f4b3d1..8827869f1744 100644 --- a/fs/btrfs/block-group.h +++ b/fs/btrfs/block-group.h @@ -166,6 +166,7 @@ struct btrfs_block_group { */ u64 alloc_offset; struct mutex zone_io_lock; + u64 meta_write_pointer; }; #ifdef CONFIG_BTRFS_DEBUG diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 44517802b9e5..18d2d0581e68 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -905,6 +905,8 @@ struct btrfs_fs_info { spinlock_t ref_verify_lock; struct rb_root block_tree; #endif + + struct mutex hmzoned_meta_io_lock; }; static inline struct btrfs_fs_info *btrfs_sb(struct super_block *sb) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index fbbc313f9f46..4abadd9317d1 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -2707,6 +2707,7 @@ int __cold open_ctree(struct super_block *sb, mutex_init(&fs_info->delete_unused_bgs_mutex); mutex_init(&fs_info->reloc_mutex); mutex_init(&fs_info->delalloc_root_mutex); + mutex_init(&fs_info->hmzoned_meta_io_lock); seqlock_init(&fs_info->profiles_lock); INIT_LIST_HEAD(&fs_info->dirty_cowonly_roots); diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 6e25c8790ef4..24f7b05e1f4c 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -3921,7 +3921,9 @@ int btree_write_cache_pages(struct address_space *mapping, struct writeback_control *wbc) { struct extent_io_tree *tree = &BTRFS_I(mapping->host)->io_tree; + struct btrfs_fs_info *fs_info = tree->fs_info; struct extent_buffer *eb, *prev_eb = NULL; + struct btrfs_block_group *cache = NULL; struct extent_page_data epd = { .bio = NULL, .tree = tree, @@ -3951,6 +3953,7 @@ int btree_write_cache_pages(struct address_space *mapping, tag = PAGECACHE_TAG_TOWRITE; else tag = PAGECACHE_TAG_DIRTY; + btrfs_hmzoned_meta_io_lock(fs_info); retry: if (wbc->sync_mode == WB_SYNC_ALL) tag_pages_for_writeback(mapping, index, end); @@ -3994,12 +3997,30 @@ int btree_write_cache_pages(struct address_space *mapping, if (!ret) continue; + if (!btrfs_check_meta_write_pointer(fs_info, eb, + &cache)) { + /* + * If for_sync, this hole will be + * filled with trasnsaction commit. + */ + if (wbc->sync_mode == WB_SYNC_ALL && + !wbc->for_sync) + ret = -EAGAIN; + else + ret = 0; + done = 1; + free_extent_buffer(eb); + break; + } + prev_eb = eb; ret = lock_extent_buffer_for_io(eb, &epd); if (!ret) { + btrfs_revert_meta_write_pointer(cache, eb); free_extent_buffer(eb); continue; } else if (ret < 0) { + btrfs_revert_meta_write_pointer(cache, eb); done = 1; free_extent_buffer(eb); break; @@ -4032,12 +4053,16 @@ int btree_write_cache_pages(struct address_space *mapping, index = 0; goto retry; } + if (cache) + btrfs_put_block_group(cache); ASSERT(ret <= 0); if (ret < 0) { end_write_bio(&epd, ret); - return ret; + goto out; } ret = flush_write_bio(&epd); +out: + btrfs_hmzoned_meta_io_unlock(fs_info); return ret; } diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c index 0c0ee9a46009..1aa4c9d1032e 100644 --- a/fs/btrfs/hmzoned.c +++ b/fs/btrfs/hmzoned.c @@ -1069,6 +1069,9 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache) } } + if (!ret) + cache->meta_write_pointer = cache->alloc_offset + cache->start; + kfree(alloc_offsets); free_extent_map(em); @@ -1171,3 +1174,52 @@ void btrfs_free_redirty_list(struct btrfs_transaction *trans) } spin_unlock(&trans->releasing_ebs_lock); } + +bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info, + struct extent_buffer *eb, + struct btrfs_block_group **cache_ret) +{ + struct btrfs_block_group *cache; + + if (!btrfs_fs_incompat(fs_info, HMZONED)) + return true; + + cache = *cache_ret; + + if (cache && + (eb->start < cache->start || + cache->start + cache->length <= eb->start)) { + btrfs_put_block_group(cache); + cache = NULL; + *cache_ret = NULL; + } + + if (!cache) + cache = btrfs_lookup_block_group(fs_info, + eb->start); + + if (cache) { + *cache_ret = cache; + + if (cache->meta_write_pointer != eb->start) { + btrfs_put_block_group(cache); + cache = NULL; + *cache_ret = NULL; + return false; + } + + cache->meta_write_pointer = eb->start + eb->len; + } + + return true; +} + +void btrfs_revert_meta_write_pointer(struct btrfs_block_group *cache, + struct extent_buffer *eb) +{ + if (!btrfs_fs_incompat(eb->fs_info, HMZONED) || !cache) + return; + + ASSERT(cache->meta_write_pointer == eb->start + eb->len); + cache->meta_write_pointer = eb->start; +} diff --git a/fs/btrfs/hmzoned.h b/fs/btrfs/hmzoned.h index f6682ead575b..54f1affa6919 100644 --- a/fs/btrfs/hmzoned.h +++ b/fs/btrfs/hmzoned.h @@ -50,6 +50,11 @@ void btrfs_redirty_list_add(struct btrfs_transaction *trans, struct extent_buffer *eb); void btrfs_free_redirty_list(struct btrfs_transaction *trans); void btrfs_hmzoned_data_io_unlock_at(struct inode *inode, u64 start, u64 len); +bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info, + struct extent_buffer *eb, + struct btrfs_block_group **cache_ret); +void btrfs_revert_meta_write_pointer(struct btrfs_block_group *cache, + struct extent_buffer *eb); #else /* CONFIG_BLK_DEV_ZONED */ static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos, struct blk_zone *zone) @@ -120,6 +125,14 @@ static inline void btrfs_redirty_list_add(struct btrfs_transaction *trans, static inline void btrfs_free_redirty_list(struct btrfs_transaction *trans) { } static inline void btrfs_hmzoned_data_io_unlock_at(struct inode *inode, u64 start, u64 len) { } +static inline bool btrfs_check_meta_write_pointer( + struct btrfs_fs_info *fs_info, struct extent_buffer *eb, + struct btrfs_block_group **cache_ret) +{ + return true; +} +static inline void btrfs_revert_meta_write_pointer( + struct btrfs_block_group *cache, struct extent_buffer *eb) { } #endif static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos) @@ -254,4 +267,18 @@ static inline void btrfs_hmzoned_data_io_unlock_logical( btrfs_put_block_group(cache); } +static inline void btrfs_hmzoned_meta_io_lock(struct btrfs_fs_info *fs_info) +{ + if (!btrfs_fs_incompat(fs_info, HMZONED)) + return; + mutex_lock(&fs_info->hmzoned_meta_io_lock); +} + +static inline void btrfs_hmzoned_meta_io_unlock(struct btrfs_fs_info *fs_info) +{ + if (!btrfs_fs_incompat(fs_info, HMZONED)) + return; + mutex_unlock(&fs_info->hmzoned_meta_io_lock); +} + #endif From patchwork Fri Dec 13 04:09:06 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11289867 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id E92421593 for ; Fri, 13 Dec 2019 04:11:22 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id C638B227BF for ; Fri, 13 Dec 2019 04:11:22 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="X0lOXTIQ" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732005AbfLMELV (ORCPT ); Thu, 12 Dec 2019 23:11:21 -0500 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:11924 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731971AbfLMELV (ORCPT ); Thu, 12 Dec 2019 23:11:21 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1576210281; x=1607746281; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=wD/L2N6Pr/GUhfnWh9kOAQLWKs9x/iWk54ItmaVu6NM=; b=X0lOXTIQ3BG1QQoZl6ahwV9E4ZVfangqJgZNsjT6QBeD1T4M8yWU8H7i ZFrAff9y8cI0i9lnNBgvTyAdGo5wqqmdTf6jSAFALZO3nvxJcEBpCelHY UELl94qSKdBxqWYOCxrlMEtx8fjWrm3mMLPn6CingbLhxAqRPZwfjmHJZ jsw4JQgVKhQaGA+Ajicod3ssUiO1XBYrPY976RAF6KpiiFbymM809wQoI 1YM54+C4Zirf1fn/LoWgawJthgKGZQK3i/bQEwq9kg/KB93Ov6l6Qijen Y12I2Uk4Hff98YzPsNXy6uQHTowMMs2YoC57grmqAOkt5UJdHexTl14yt w==; IronPort-SDR: gAmeiG3Y+NmxZvk+/2hRYKRs6g/x+Eol6nQV5HVv/+RwdQcxfXaNlIF2gbAQk2T0zuh4x5u+6C mVC7PqW0AyUyqs9F5i1O7HBZFozXbpgy0rlkc5x1J5/6ZMrbo9AB6RLJifi3LCx6CNGQh4fG0Q V6MdSKve64i4u+SuCaDcFNtD8jWxtA3V+fxVwhmNnPTe4bNuUEGhkKITguPD0H85eMSzCmNEC/ 9SE5j9UiFA3rSgNJX1L1n3Ef6ptFV8bUztCTzJDN6jpeLUaoIRdZsqsyKxHZ1XR20UL64SC/zv sRs= X-IronPort-AV: E=Sophos;i="5.69,308,1571673600"; d="scan'208";a="126860151" Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com) ([199.255.45.14]) by ob1.hgst.iphmx.com with ESMTP; 13 Dec 2019 12:11:21 +0800 IronPort-SDR: 39vxF6kLF76rpoMv34SNQNWRMsSdFze8XArP6vQewR5FYQ+UgXMa55w+oDrG1EynGXGudaSKIW YuY7P/0cWYR4Z6vSil1ajkj7VHm1J512e0J5lH7OGPR/x2dVAkcmwEdgMlqTS+7aFZD1JjyBhU Nks7DWPndx/n6cKSJZTerlhCvPNg4cmbV3zN237jU147nCJ0HZ16E+PjFndxZTXdE+rA8D1tRe xq+twOrEoHuaGstnkbKxrbcTO19mHX5iBa4DBeZuSpAtZnN1kmCcxxT8jgpjM/JwW5ODWeCs2x n7Sz9PedhJZmtM4KTr6d7Oxx Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Dec 2019 20:05:52 -0800 IronPort-SDR: uMgkIIqOysCICsBrWkJRsE+573X4Ytl374/8bBbh7dS5bi7u0BnRe6kR+EQxIa0gqhIiQRTmid iwGi2vGZU+A6/VYVI6Ae2gNONa4Ddu632dfIxHQq/0DQhx7FjuHWsRpxPROgH0pI0VToYM9/5w ZXgv5gVQGzPo3kMlUG1/24ovt9gL4HgqY+F1aOnGrEt3vrY5sZ0yzeWamiJM6tMxF3WyjYT+qh 4QaikLO6toBt+6xxbXfSmUVmomkTBICFRxbQaEdPrSeZH89EO7ufShWsU20qnmmVySLiV8FEAo E7o= WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com (HELO naota.fujisawa.hgst.com) ([10.149.53.115]) by uls-op-cesaip02.wdc.com with ESMTP; 12 Dec 2019 20:11:19 -0800 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Nikolay Borisov , Damien Le Moal , Johannes Thumshirn , Hannes Reinecke , Anand Jain , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v6 19/28] btrfs: wait existing extents before truncating Date: Fri, 13 Dec 2019 13:09:06 +0900 Message-Id: <20191213040915.3502922-20-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20191213040915.3502922-1-naohiro.aota@wdc.com> References: <20191213040915.3502922-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org When truncating a file, file buffers which have already been allocated but not yet written may be truncated. Truncating these buffers could cause breakage of a sequential write pattern in a block group if the truncated blocks are for example followed by blocks allocated to another file. To avoid this problem, always wait for write out of all unwritten buffers before proceeding with the truncate execution. Signed-off-by: Naohiro Aota Reviewed-by: Josef Bacik --- fs/btrfs/inode.c | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 44658590c6e8..e7fc217be095 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -5323,6 +5323,16 @@ static int btrfs_setsize(struct inode *inode, struct iattr *attr) btrfs_end_write_no_snapshotting(root); btrfs_end_transaction(trans); } else { + struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); + + if (btrfs_fs_incompat(fs_info, HMZONED)) { + ret = btrfs_wait_ordered_range( + inode, + ALIGN(newsize, fs_info->sectorsize), + (u64)-1); + if (ret) + return ret; + } /* * We're truncating a file that used to have good data down to From patchwork Fri Dec 13 04:09:07 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11289871 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 5411114DB for ; Fri, 13 Dec 2019 04:11:26 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 327DC2073B for ; Fri, 13 Dec 2019 04:11:26 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="W85iN9wO" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732012AbfLMELY (ORCPT ); Thu, 12 Dec 2019 23:11:24 -0500 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:11924 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731476AbfLMELX (ORCPT ); Thu, 12 Dec 2019 23:11:23 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1576210284; x=1607746284; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=Gv9q0DDDj86eEUSslQJev/ik3Quaqy64bmbJlpcYfcY=; b=W85iN9wOurXKPLWnqgHJpAEfy5HZBpRyM3q4uUCw8zQxWbOHQI2cSdPZ hdBsyBYoffQTrNL4AdrAK1TGIZmVMv72NKVeOkOD25ifv8k3tKiCG50Jg KTI3APCOhRLBb8OCGDAxDYkogtsl+ozPQ0Nzj4GSLqdeSLpWvLV5mBYPC xxFIiz62Vi+TdvFBxA0JNA1dnY2KCz7Z4QpZAS9zi9XAbftdQZrocorzC BiVyix1seb3sQXe5KJZqnfm8gfCtlS/CdjavpWlziyZOycrkktKP8mWaV eqQcq0JoaN7+6Xg5i1fP6RkWPy0NTS/pCDQEpwSfX8VQfnulftWv6FhuQ w==; IronPort-SDR: 6KCi3Ex+KJvxLAQSA65i0IbhLPslAAYIr9Is1C8vr6aRQRwkmiHHuOIUnQzbETkrjP+ZZKKcxz SwrCaFy9RS8hLcPTydiI2rGLx8ywCOiYiBNOYzdtfNuvpLH0nTNpv/CN1+ApkwKctbARXjFuvx mqLpitQlHsyywdY4Vk7KT/cpJ/eZXBM3y6HAn0q15dz31f5yt/OYaJa+BryoazPoOowc5pq5ig 618uLIUaZCj8h6ilHKBbSnsOEeWWdau2zNH9o2R9WscLBMFgrTEyiiWbfn2sfGxbNn0IUkma9A DdI= X-IronPort-AV: E=Sophos;i="5.69,308,1571673600"; d="scan'208";a="126860153" Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com) ([199.255.45.14]) by ob1.hgst.iphmx.com with ESMTP; 13 Dec 2019 12:11:24 +0800 IronPort-SDR: yfjbylusxPpA9bhf7wDYEYuLAlrJ3wmWO0T6zX+WRuVpRt5hDWdNyT75CLPq+P63VVm9HRUlI9 0I/Fvl27aFmQZdPlm+B8XQq+2USfi3TVU4LCXhPDKfUBagWEFq7iGwaXMSmp+bp7iiSM4uOomI FexhGwrmLFpnFtIQrSyRWRAW15GmXfK6Ozge0X54/2uOnm228V2Kz+BBih+pQYhOmesHbznbFe goM8PnmJlCDQQ8S4jW+oCp4mSWlItYisC9ecrsLiENmHAgbYXV+A8xhTrkHrDZYCNh+XVJhIGu DppREmfVP++Zv6Q9Y5AyG03i Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Dec 2019 20:05:55 -0800 IronPort-SDR: nR7bSYGrpYzRhq/dvqqmE/2hol6O0UXdk9VUWBOAhZl07KqzF79lYJ1Gb+ydmFiYNVpGS8kmvC ox16BOvvtqfP6t30uJTRuk+Q7Qdi1ZZoui25OOoDrAt4jE1Uf8k+cYex0UoQ7n3o0r2llBML6V /OwF6nAM27isDBX/CNHk8zbjJhqg1/vnFgz9p5PcmWSaPr+DLlyCjetWVqppi6RUDQbk4Er1Z+ v/ZTn4v9zc5+MpVeGX+oqFmLXfrDj0G9hZl34QH5M9gZ7hVZJTnEgAinG+pUA1jcCaPdk/qKrt ov4= WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com (HELO naota.fujisawa.hgst.com) ([10.149.53.115]) by uls-op-cesaip02.wdc.com with ESMTP; 12 Dec 2019 20:11:21 -0800 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Nikolay Borisov , Damien Le Moal , Johannes Thumshirn , Hannes Reinecke , Anand Jain , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v6 20/28] btrfs: avoid async checksum on HMZONED mode Date: Fri, 13 Dec 2019 13:09:07 +0900 Message-Id: <20191213040915.3502922-21-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20191213040915.3502922-1-naohiro.aota@wdc.com> References: <20191213040915.3502922-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org In HMZONED, btrfs use per-Block Group zone_io_lock to serialize the data write IOs or use per-FS hmzoned_meta_io_lock to serialize the metadata write IOs. Even with these serialization, write bios sent from {btree,btrfs}_write_cache_pages can be reordered by async checksum workers as these workers are per CPU and not per zone. To preserve write BIO ordering, we can disable async checksum on HMZONED. This does not result in lower performance with HDDs as a single CPU core is fast enough to do checksum for a single zone write stream with the maximum possible bandwidth of the device. If multiple zones are being written simultaneously, HDD seek overhead lowers the achievable maximum bandwidth, resulting again in a per zone checksum serialization not affecting performance. Signed-off-by: Naohiro Aota --- fs/btrfs/disk-io.c | 2 ++ fs/btrfs/inode.c | 9 ++++++--- 2 files changed, 8 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 4abadd9317d1..c3d8fc10d11d 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -882,6 +882,8 @@ static blk_status_t btree_submit_bio_start(void *private_data, struct bio *bio, static int check_async_write(struct btrfs_fs_info *fs_info, struct btrfs_inode *bi) { + if (btrfs_fs_incompat(fs_info, HMZONED)) + return 0; if (atomic_read(&bi->sync_writers)) return 0; if (test_bit(BTRFS_FS_CSUM_IMPL_FAST, &fs_info->flags)) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index e7fc217be095..bd3384200fc9 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -2166,7 +2166,8 @@ static blk_status_t btrfs_submit_bio_hook(struct inode *inode, struct bio *bio, enum btrfs_wq_endio_type metadata = BTRFS_WQ_ENDIO_DATA; blk_status_t ret = 0; int skip_sum; - int async = !atomic_read(&BTRFS_I(inode)->sync_writers); + int async = !atomic_read(&BTRFS_I(inode)->sync_writers) && + !btrfs_fs_incompat(fs_info, HMZONED); skip_sum = BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM; @@ -8457,7 +8458,8 @@ static inline blk_status_t btrfs_submit_dio_bio(struct bio *bio, /* Check btrfs_submit_bio_hook() for rules about async submit. */ if (async_submit) - async_submit = !atomic_read(&BTRFS_I(inode)->sync_writers); + async_submit = !atomic_read(&BTRFS_I(inode)->sync_writers) && + !btrfs_fs_incompat(fs_info, HMZONED); if (!write) { ret = btrfs_bio_wq_end_io(fs_info, bio, BTRFS_WQ_ENDIO_DATA); @@ -8522,7 +8524,8 @@ static int btrfs_submit_direct_hook(struct btrfs_dio_private *dip) } /* async crcs make it difficult to collect full stripe writes. */ - if (btrfs_data_alloc_profile(fs_info) & BTRFS_BLOCK_GROUP_RAID56_MASK) + if (btrfs_data_alloc_profile(fs_info) & BTRFS_BLOCK_GROUP_RAID56_MASK || + btrfs_fs_incompat(fs_info, HMZONED)) async_submit = 0; else async_submit = 1; From patchwork Fri Dec 13 04:09:08 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11289875 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 828621593 for ; Fri, 13 Dec 2019 04:11:28 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 61E932253D for ; Fri, 13 Dec 2019 04:11:28 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="Tk0XlOaA" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732023AbfLMEL0 (ORCPT ); Thu, 12 Dec 2019 23:11:26 -0500 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:11924 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731476AbfLMEL0 (ORCPT ); Thu, 12 Dec 2019 23:11:26 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1576210286; x=1607746286; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=/2PWrPqXZ5ElzQ0/FlPHhCO4f/Wes9FOqdwgVS4EW44=; b=Tk0XlOaApDIWQ3/envySfeZgCYXcgUaqseklmcfVIIRm4nXivoX697gS 6LsKmem8k0ljRUeA4e03wRRnejZVAsujcCxWPrpoOnnrAUVXWDUHI+Yma 66SoRq1TK+FyuomT+IBChskFmSvzbptbnRzLzib54zI7xTBHevWsv6zq/ yvwFxwKxn9ynKtk4ZYbzCk28p8D3nGXirxMEi+FIXOFmWu7DqN7TbKMxA Nl2luQK9pizI0rrcG11VwgGOPSxts+QUOjBEXm3Ut22OCK7UA6sZ+1l/N 4bRWT9Sc9nyl/aTUJoUuYw0pc1d75/RjpC5FnHrXqsjnm18PlaBj+bdSf A==; IronPort-SDR: LEmtYSUTuGfjwT4CeCPaX6KYNUcWK36xDpmf3BJF/03YX4FmAiFARa700NdB67YnhqGtse+tuG sYKDDmbWLislVBd76g06EQOGQj77Jffj3hGAAwCK96uBL+F3Q387KkWRXClXXlIkc9Y7MEUMOQ IVTsu4wRavr/d1+xCiAlhzvA3hZUMMfS8fyj4DzpSg3t0UFqcxJhd1vd24PZjwKe3a4KhfTM3Z Z43aKPc8TXcO9vIPFHX83dDe1cFZiIBuFLsqavo4isjHJAhzyVKxJ3INWDh8d5TpL4TCQtM+G/ bzM= X-IronPort-AV: E=Sophos;i="5.69,308,1571673600"; d="scan'208";a="126860156" Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com) ([199.255.45.14]) by ob1.hgst.iphmx.com with ESMTP; 13 Dec 2019 12:11:26 +0800 IronPort-SDR: azbN2mXajNgyUKFUf8OKuykrqVp+4OINkF8yXq6wbJiU4pdgNnC6+w3SluyclZyo5YeSkYp573 3fCqqCREimV/TGAceufMGU2Aw3kwYJb1fozqgs8x+di3HcILp5isoeT2KJUEWD4FjhysP044N/ fIjN5go3T6w0Xhd6Gh/P45jQ85x1kNT8i40ra3Wt+NNeVZTRirvCfIi1Btd5+jXheUSA5dCIS7 nGP7TqbyMEIwzx9D8Z5iTu92PxKX5rqml1Cc9UHPC8cRNQCHM7AS63izRFlACwTWV6PX8V15e1 zKgYlYCkRJv/1uowp1CBPJvO Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Dec 2019 20:05:57 -0800 IronPort-SDR: yzjb0wUSg21yiNVfUv1x7XeN9J5TwKTOkTcyD89bP3JXaTjMw3iEUy1pLeG7Yd0Eos5ohusgcY CWW8c2DXHWJb3mdkZy92VtXBDHVD/I/CruMtBCHJFbeX2EvphkHRGVLA04ibUAXVLsmEF/DwpM Va1iH4TV+baminVTPOCrACaO60M9hMyXFildviFs7dub9d/Ny9ahrjPQ6OUecQG0MddO0Azem5 qADcXUgE7hIQ57pdIC5XDG/fqrQuzLrIpOINThxMvgnMQFy3aKHd5MJJYX9UATy2FDq/XjZiOD ZcM= WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com (HELO naota.fujisawa.hgst.com) ([10.149.53.115]) by uls-op-cesaip02.wdc.com with ESMTP; 12 Dec 2019 20:11:24 -0800 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Nikolay Borisov , Damien Le Moal , Johannes Thumshirn , Hannes Reinecke , Anand Jain , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v6 21/28] btrfs: disallow mixed-bg in HMZONED mode Date: Fri, 13 Dec 2019 13:09:08 +0900 Message-Id: <20191213040915.3502922-22-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20191213040915.3502922-1-naohiro.aota@wdc.com> References: <20191213040915.3502922-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Placing both data and metadata in a block group is impossible in HMZONED mode. For data, we can allocate a space for it and write it immediately after the allocation. For metadata, however, we cannot do so, because the logical addresses are recorded in other metadata buffers to build up the trees. As a result, a data buffer can be placed after a metadata buffer, which is not written yet. Writing out the data buffer will break the sequential write rule. This commit check and disallow MIXED_BG with HMZONED mode. Signed-off-by: Naohiro Aota Reviewed-by: Josef Bacik --- fs/btrfs/hmzoned.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c index 1aa4c9d1032e..c779232bb003 100644 --- a/fs/btrfs/hmzoned.c +++ b/fs/btrfs/hmzoned.c @@ -306,6 +306,13 @@ int btrfs_check_hmzoned_mode(struct btrfs_fs_info *fs_info) goto out; } + if (btrfs_fs_incompat(fs_info, MIXED_GROUPS)) { + btrfs_err(fs_info, + "HMZONED mode is not allowed for mixed block groups"); + ret = -EINVAL; + goto out; + } + fs_info->zone_size = zone_size; btrfs_info(fs_info, "HMZONED mode enabled, zone size %llu B", From patchwork Fri Dec 13 04:09:09 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11289879 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 48EC614E3 for ; Fri, 13 Dec 2019 04:11:31 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 286A32253D for ; Fri, 13 Dec 2019 04:11:31 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="DflqSxCL" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732028AbfLMEL3 (ORCPT ); Thu, 12 Dec 2019 23:11:29 -0500 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:11924 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731476AbfLMEL2 (ORCPT ); Thu, 12 Dec 2019 23:11:28 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1576210288; x=1607746288; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=5g5FFyvLuO6DkQ2HFrt9iiEh61bIQwNKiFjyU7DlwLA=; b=DflqSxCLohXqoD0c/oPSk4cJ26uo6qBobbtGutP7K1DuiSXWulA2jnkh RSsJit/mITS0f+pNGwjwChx48gzXQ2pDiEAhOGdxi89pAS9FcQN4YMDke TQf6ZgdPKiNvFPllhA8VElk9+LXA6G1xg+E4tdRG1bgTfaGRz24Yhfw5P rzLX6KNaqcQXPVMiLON/jM/K+2CtBw8le9oWw4kBs7vWteJjmhpRfKDs1 Bw+u9SxJkFjJv2eOM/1HQGRt97JxQfqReJ795PyZ2dPciX0b52XNISFlR mPG8r79KDHVGY96KN/zrZz5E1EFiDQMJ6KOdoFNq25fjS3Yymo9qrYG5m Q==; IronPort-SDR: NA4+JwzQZlSU3t830Ta2Dqy4gZ5uH6Du8XY+mPLEYAfthpptyW8pmVv9s24nZQ71esUW23/Pvf pyBBeDooXcU7bpb9azG/2YUTxYr0v9SM/30QPHK3IUsELBX66OSyjgYh7fcEgS9n9Xie53Pwhx rj41vboxJO9dGN3W8Y07kluLbaOSMFOSUZDHrmXWuLmQBisQUs0ZJVjfZvJEZvU9I9ShVWdyET gop0Lzk57Fn/9xN9j0rQFusi6wWyHEY3NzDzCfFcAFN6sdlVQIgl2R6SKeYmjTfMlaQqupUZte 63g= X-IronPort-AV: E=Sophos;i="5.69,308,1571673600"; d="scan'208";a="126860160" Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com) ([199.255.45.14]) by ob1.hgst.iphmx.com with ESMTP; 13 Dec 2019 12:11:28 +0800 IronPort-SDR: gfCQOtIqq7Z57yYIDwxANx+C9e7DAdSerd7nIsCMklLfRXQzKZ1bcfag1xDuMsB/58y9zJV1AP d9jkKGoxy5rsPm6Msr4c9bony+xKkTL5PSDrwXOMBL2ZAIhYiJcZcoXTmCywP9RFoxBuqBScuA +eoBx7sf0aRlFViA+p7yvH/X5g9oseaa/Grmj/fnINPSTyxiCa2AXLI83I2HXzc9GYAZX2Q/Cc b6o36WB18FiHaRid46aLY1lcizpzU6pzxQ6yjTHJC7G4+29MwBR+ghAoDwnKDaQagd3V+1JFh7 QjM+LfLlJ1udiBhpjF3yd/oR Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Dec 2019 20:05:59 -0800 IronPort-SDR: wpPP0zZa4YfnTDHsQ3XsBLdy6U973JnxDt2Ud37euL3emdekozweHotBCA1QSmcnEY9nh7IGhG HOXPNRcHuRubGCRukI2uVBECTknsaxyCcSDtwsvslvvlguRidV8S5xmLFohTG0/fsPvbOWOsaU /8UBMtnthHW7m+RIAaBD8qxIIyPM3X+4aUJkw+GdlBc5IRv3sRt5zJhpMOjH33mp9bqN/RkRQM W3ueEmqTeMcqJ63oaO8yQOE+driPZMGrgz7X99XglJ0gymLMt/Lb+zLdn/SJR3awLc54IMlqwB V24= WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com (HELO naota.fujisawa.hgst.com) ([10.149.53.115]) by uls-op-cesaip02.wdc.com with ESMTP; 12 Dec 2019 20:11:26 -0800 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Nikolay Borisov , Damien Le Moal , Johannes Thumshirn , Hannes Reinecke , Anand Jain , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v6 22/28] btrfs: disallow inode_cache in HMZONED mode Date: Fri, 13 Dec 2019 13:09:09 +0900 Message-Id: <20191213040915.3502922-23-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20191213040915.3502922-1-naohiro.aota@wdc.com> References: <20191213040915.3502922-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org inode_cache use pre-allocation to write its cache data. However, pre-allocation is completely disabled in HMZONED mode. We can technically enable inode_cache in the same way as relocation. However, inode_cache is rarely used and the man page discourage using it. So, let's just disable it for now. Signed-off-by: Naohiro Aota Reviewed-by: Josef Bacik --- fs/btrfs/hmzoned.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c index c779232bb003..465db8e6de94 100644 --- a/fs/btrfs/hmzoned.c +++ b/fs/btrfs/hmzoned.c @@ -342,6 +342,12 @@ int btrfs_check_mountopts_hmzoned(struct btrfs_fs_info *info) return -EOPNOTSUPP; } + if (btrfs_test_pending(info, SET_INODE_MAP_CACHE)) { + btrfs_err(info, + "cannot enable inode map caching with HMZONED mode"); + return -EOPNOTSUPP; + } + return 0; } From patchwork Fri Dec 13 04:09:10 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11289883 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 228A214DB for ; Fri, 13 Dec 2019 04:11:33 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id E13612253D for ; Fri, 13 Dec 2019 04:11:32 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="O97K0foJ" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732035AbfLMELb (ORCPT ); Thu, 12 Dec 2019 23:11:31 -0500 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:11924 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731476AbfLMELa (ORCPT ); Thu, 12 Dec 2019 23:11:30 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1576210290; x=1607746290; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=JzZdo6ep90wuQBOi8mJFFf1oBDvhMdFRRAtT38kYaWs=; b=O97K0foJy1Ht1F02GPXkxtKSEhkPFrSHTyaOCaniIFbyxk6ZxtXYWbRR K2cjF/ZnCZvqTZKPermSeT/LxSn2uLGS+pAeUFze0viR2Wni6wHbbxIih vZTt47/NPiwcaHHqWymY+GlzZe0c3KqFut4z8qmPjiWVyOCz3BeMZ76jW GLgVIl6dHJBMAP0KDxZDxFgSpNKPn05Y4YRfDaXAvIFzvgARBsM3Lh11F cOrcij2OrwSFsC/2nbHHtqYH1rUBCMTzcA90fXd4COxrHJqiP/PRwyFpV G9sAtd2r0zFvZvArZ97nL7EH36C+Fa7MG6IBJN42WKhWjcW2kHwPnFEzt g==; IronPort-SDR: l9Yz1KPgtcT0sxfw6YYhG4odVxh/9ULKG2W59iE1B6KeV+bF5dQ5xMq326EkiKKL7Z1/9cCVyD jBexsMOX/PtgN6SQRVmSjiaHzzaqeviBeDqwzhhOyK1w0a5HQlaJCjW3Sgd3WXfFj3BypiIabK LTGHwVk3O+ECcXXIZLJtke3Xtqxwy8unhugEzBrr8uWkqZhn0udFeKKffkkJ5nOYForIXbRI4c jv9ZFr9e5YtjRL+yZbVQE2HMkuUqgUrS2qvYBI23i23dYN9ZvEXLqm3OHAsmV7ogbY7fFeFTrM PE0= X-IronPort-AV: E=Sophos;i="5.69,308,1571673600"; d="scan'208";a="126860164" Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com) ([199.255.45.14]) by ob1.hgst.iphmx.com with ESMTP; 13 Dec 2019 12:11:30 +0800 IronPort-SDR: otRenyXyEa6oc68rdLUb7ctyCeKa4VN1TDtbpoveZA1/CawDt+sEFsNRTRRisq8KDKaaZW8aAd yxdno4detuv+k7EL9xRhxpx2piyim9GqlPwJl2ZQk00nNGMgMoMsuNkHwspxDE9/8n239oV9L1 wvjYVHTh/wuaFJq/xbc1AlhbKhDKmJQ3DyI3oPahhxKyEBS1bzyYNHvtDObLSE6cnl42PUrzA/ dAop0MIR1cbTSphx3WprwVBL+SNNrPoSh74Yi9sjsHlgxARsl/DINpUTA6i1QFY+/ze2AnLhi3 siqpa/CrWGHgOAp63KuBQyRS Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Dec 2019 20:06:02 -0800 IronPort-SDR: TqPamKSeV0E6VbVB5O4+nYh9CeUR60du+RakR4MEM2gq2AawP8hzuSLovqefK9CNmpjg8Hf++L xUXwlZUbc4VQeR1NyMOOTxapGj6ZnyGLpE8ntoZsMvjNU812KPuerqjvk4nAgvQW2XYoMb+4Yc KQQRPapXRfsBosYHqI0bYwjbMCaoUsv0qDjVH7TYtZm8OuPYymR0D7Alfs246ySz8CTgVIvdPU f15urrwwJt6zQkfEaWgtJPflIdqclUeVE1pQxm4IFgjOTy2Oug1fhcVK9yeQ8d4hodoqtyZWky KIA= WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com (HELO naota.fujisawa.hgst.com) ([10.149.53.115]) by uls-op-cesaip02.wdc.com with ESMTP; 12 Dec 2019 20:11:28 -0800 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Nikolay Borisov , Damien Le Moal , Johannes Thumshirn , Hannes Reinecke , Anand Jain , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v6 23/28] btrfs: support dev-replace in HMZONED mode Date: Fri, 13 Dec 2019 13:09:10 +0900 Message-Id: <20191213040915.3502922-24-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20191213040915.3502922-1-naohiro.aota@wdc.com> References: <20191213040915.3502922-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org We have two type of I/Os during the device-replace process. One is a I/O to "copy" (by the scrub functions) all the device extents on the source device to the destination device. The other one is a I/O to "clone" (by handle_ops_on_dev_replace()) new incoming write I/Os from users to the source device into the target device. Cloning incoming I/Os can break the sequential write rule in the target device. When write is mapped in the middle of a block group, that I/O is directed in the middle of a zone of target device, which breaks the sequential write rule. However, the cloning function cannot be simply disabled since incoming I/Os targeting already copied device extents must be cloned so that the I/O is executed on the target device. We cannot use dev_replace->cursor_{left,right} to determine whether bio is going to not yet copied region. Since we have time gap between finishing btrfs_scrub_dev() and rewriting the mapping tree in btrfs_dev_replace_finishing(), we can have newly allocated device extent which is never cloned nor copied. So the point is to copy only already existing device extents. This patch introduces mark_block_group_to_copy() to mark existing block group as a target of copying. Then, handle_ops_on_dev_replace() and dev-replace can check the flag to do their job. Device-replace process in HMZONED mode must copy or clone all the extents in the source device exctly once. So, we need to use to ensure allocations started just before the dev-replace process to have their corresponding extent information in the B-trees. finish_extent_writes_for_hmzoned() implements that functionality, which basically is the removed code in the commit 042528f8d840 ("Btrfs: fix block group remaining RO forever after error during device replace"). This patch also handles empty region between used extents. Since dev-replace is smart to copy only used extents on source device, we have to fill the gap to honor the sequential write rule in the target device. Signed-off-by: Naohiro Aota --- fs/btrfs/block-group.h | 1 + fs/btrfs/dev-replace.c | 178 +++++++++++++++++++++++++++++++++++++++++ fs/btrfs/dev-replace.h | 3 + fs/btrfs/extent-tree.c | 20 ++++- fs/btrfs/hmzoned.c | 91 +++++++++++++++++++++ fs/btrfs/hmzoned.h | 16 ++++ fs/btrfs/scrub.c | 142 +++++++++++++++++++++++++++++++- fs/btrfs/volumes.c | 36 ++++++++- 8 files changed, 481 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h index 8827869f1744..323ba01ad8a9 100644 --- a/fs/btrfs/block-group.h +++ b/fs/btrfs/block-group.h @@ -83,6 +83,7 @@ struct btrfs_block_group { unsigned int has_caching_ctl:1; unsigned int removed:1; unsigned int wp_broken:1; + unsigned int to_copy:1; int disk_cache_state; diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c index 9286c6e0b636..6ac6aa0eb0b6 100644 --- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -265,6 +265,10 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info, set_blocksize(device->bdev, BTRFS_BDEV_BLOCKSIZE); device->fs_devices = fs_info->fs_devices; + ret = btrfs_get_dev_zone_info(device); + if (ret) + goto error; + mutex_lock(&fs_info->fs_devices->device_list_mutex); list_add(&device->dev_list, &fs_info->fs_devices->devices); fs_info->fs_devices->num_devices++; @@ -399,6 +403,176 @@ static char* btrfs_dev_name(struct btrfs_device *device) return rcu_str_deref(device->name); } +static int mark_block_group_to_copy(struct btrfs_fs_info *fs_info, + struct btrfs_device *src_dev) +{ + struct btrfs_path *path; + struct btrfs_key key; + struct btrfs_key found_key; + struct btrfs_root *root = fs_info->dev_root; + struct btrfs_dev_extent *dev_extent = NULL; + struct btrfs_block_group *cache; + struct extent_buffer *l; + struct btrfs_trans_handle *trans; + int slot; + int ret = 0; + u64 chunk_offset, length; + + /* Do not use "to_copy" on non-HMZONED for now */ + if (!btrfs_fs_incompat(fs_info, HMZONED)) + return 0; + + mutex_lock(&fs_info->chunk_mutex); + + /* ensulre we don't have pending new block group */ + while (fs_info->running_transaction && + !list_empty(&fs_info->running_transaction->dev_update_list)) { + mutex_unlock(&fs_info->chunk_mutex); + trans = btrfs_attach_transaction(root); + if (IS_ERR(trans)) { + ret = PTR_ERR(trans); + mutex_lock(&fs_info->chunk_mutex); + if (ret == -ENOENT) + continue; + else + goto out; + } + + ret = btrfs_commit_transaction(trans); + mutex_lock(&fs_info->chunk_mutex); + if (ret) + goto out; + } + + path = btrfs_alloc_path(); + if (!path) { + ret = -ENOMEM; + goto out; + } + + path->reada = READA_FORWARD; + path->search_commit_root = 1; + path->skip_locking = 1; + + key.objectid = src_dev->devid; + key.offset = 0ull; + key.type = BTRFS_DEV_EXTENT_KEY; + + while (1) { + ret = btrfs_search_slot(NULL, root, &key, path, 0, 0); + if (ret < 0) + break; + if (ret > 0) { + if (path->slots[0] >= + btrfs_header_nritems(path->nodes[0])) { + ret = btrfs_next_leaf(root, path); + if (ret < 0) + break; + if (ret > 0) { + ret = 0; + break; + } + } else { + ret = 0; + } + } + + l = path->nodes[0]; + slot = path->slots[0]; + + btrfs_item_key_to_cpu(l, &found_key, slot); + + if (found_key.objectid != src_dev->devid) + break; + + if (found_key.type != BTRFS_DEV_EXTENT_KEY) + break; + + if (found_key.offset < key.offset) + break; + + dev_extent = btrfs_item_ptr(l, slot, struct btrfs_dev_extent); + length = btrfs_dev_extent_length(l, dev_extent); + + chunk_offset = btrfs_dev_extent_chunk_offset(l, dev_extent); + + cache = btrfs_lookup_block_group(fs_info, chunk_offset); + if (!cache) + goto skip; + + spin_lock(&cache->lock); + cache->to_copy = 1; + spin_unlock(&cache->lock); + + btrfs_put_block_group(cache); + +skip: + key.offset = found_key.offset + length; + btrfs_release_path(path); + } + + btrfs_free_path(path); +out: + mutex_unlock(&fs_info->chunk_mutex); + + return ret; +} + +bool btrfs_finish_block_group_to_copy(struct btrfs_device *srcdev, + struct btrfs_block_group *cache, + u64 physical) +{ + struct btrfs_fs_info *fs_info = cache->fs_info; + struct extent_map *em; + struct map_lookup *map; + u64 chunk_offset = cache->start; + int num_extents, cur_extent; + int i; + + /* Do not use "to_copy" on non-HMZONED for now */ + if (!btrfs_fs_incompat(fs_info, HMZONED)) + return true; + + spin_lock(&cache->lock); + if (cache->removed) { + spin_unlock(&cache->lock); + return true; + } + spin_unlock(&cache->lock); + + em = btrfs_get_chunk_map(fs_info, chunk_offset, 1); + BUG_ON(IS_ERR(em)); + map = em->map_lookup; + + num_extents = cur_extent = 0; + for (i = 0; i < map->num_stripes; i++) { + /* we have more device extent to copy */ + if (srcdev != map->stripes[i].dev) + continue; + + num_extents++; + if (physical == map->stripes[i].physical) + cur_extent = i; + } + + free_extent_map(em); + + if (num_extents > 1 && cur_extent < num_extents - 1) { + /* + * Has more stripes on this device. Keep this BG + * readonly until we finish all the stripes. + */ + return false; + } + + /* last stripe on this device */ + spin_lock(&cache->lock); + cache->to_copy = 0; + spin_unlock(&cache->lock); + + return true; +} + static int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info, const char *tgtdev_name, u64 srcdevid, const char *srcdev_name, int read_src) @@ -440,6 +614,10 @@ static int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info, if (ret) return ret; + ret = mark_block_group_to_copy(fs_info, src_device); + if (ret) + return ret; + down_write(&dev_replace->rwsem); switch (dev_replace->replace_state) { case BTRFS_IOCTL_DEV_REPLACE_STATE_NEVER_STARTED: diff --git a/fs/btrfs/dev-replace.h b/fs/btrfs/dev-replace.h index 60b70dacc299..3911049a5f23 100644 --- a/fs/btrfs/dev-replace.h +++ b/fs/btrfs/dev-replace.h @@ -18,5 +18,8 @@ int btrfs_dev_replace_cancel(struct btrfs_fs_info *fs_info); void btrfs_dev_replace_suspend_for_unmount(struct btrfs_fs_info *fs_info); int btrfs_resume_dev_replace_async(struct btrfs_fs_info *fs_info); int __pure btrfs_dev_replace_is_ongoing(struct btrfs_dev_replace *dev_replace); +bool btrfs_finish_block_group_to_copy(struct btrfs_device *srcdev, + struct btrfs_block_group *cache, + u64 physical); #endif diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index d1f326b6c4d4..69c4ce8ec83e 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -34,6 +34,7 @@ #include "block-group.h" #include "rcu-string.h" #include "hmzoned.h" +#include "dev-replace.h" #undef SCRAMBLE_DELAYED_REFS @@ -1343,6 +1344,8 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr, u64 length = stripe->length; u64 bytes; struct request_queue *req_q; + struct btrfs_dev_replace *dev_replace = + &fs_info->dev_replace; if (!stripe->dev->bdev) { ASSERT(btrfs_test_opt(fs_info, DEGRADED)); @@ -1351,15 +1354,28 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr, req_q = bdev_get_queue(stripe->dev->bdev); /* zone reset in HMZONED mode */ - if (btrfs_can_zone_reset(dev, physical, length)) + if (btrfs_can_zone_reset(dev, physical, length)) { ret = btrfs_reset_device_zone(dev, physical, length, &bytes); - else if (blk_queue_discard(req_q)) + if (ret) + goto next; + if (!btrfs_dev_replace_is_ongoing( + dev_replace) || + dev != dev_replace->srcdev) + goto next; + + discarded_bytes += bytes; + /* send to replace target as well */ + ret = btrfs_reset_device_zone( + dev_replace->tgtdev, + physical, length, &bytes); + } else if (blk_queue_discard(req_q)) ret = btrfs_issue_discard(dev->bdev, physical, length, &bytes); else continue; +next: if (!ret) { discarded_bytes += bytes; } else if (ret != -EOPNOTSUPP) { diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c index 465db8e6de94..c26a28bd159e 100644 --- a/fs/btrfs/hmzoned.c +++ b/fs/btrfs/hmzoned.c @@ -18,6 +18,7 @@ #include "locking.h" #include "space-info.h" #include "transaction.h" +#include "dev-replace.h" /* Maximum number of zones to report per blkdev_report_zones() call */ #define BTRFS_REPORT_NR_ZONES 4096 @@ -842,6 +843,8 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache) for (i = 0; i < map->num_stripes; i++) { bool is_sequential; struct blk_zone zone; + struct btrfs_dev_replace *dev_replace = &fs_info->dev_replace; + int dev_replace_is_ongoing = 0; device = map->stripes[i].dev; physical = map->stripes[i].physical; @@ -868,6 +871,14 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache) */ btrfs_dev_clear_zone_empty(device, physical); + down_read(&dev_replace->rwsem); + dev_replace_is_ongoing = + btrfs_dev_replace_is_ongoing(dev_replace); + if (dev_replace_is_ongoing && dev_replace->tgtdev != NULL) + btrfs_dev_clear_zone_empty(dev_replace->tgtdev, + physical); + up_read(&dev_replace->rwsem); + /* * The group is mapped to a sequential zone. Get the zone write * pointer to determine the allocation offset within the zone. @@ -1236,3 +1247,83 @@ void btrfs_revert_meta_write_pointer(struct btrfs_block_group *cache, ASSERT(cache->meta_write_pointer == eb->start + eb->len); cache->meta_write_pointer = eb->start; } + +int btrfs_hmzoned_issue_zeroout(struct btrfs_device *device, u64 physical, + u64 length) +{ + if (!btrfs_dev_is_sequential(device, physical)) + return -EOPNOTSUPP; + + return blkdev_issue_zeroout(device->bdev, + physical >> SECTOR_SHIFT, + length >> SECTOR_SHIFT, + GFP_NOFS, 0); +} + +static int read_zone_info(struct btrfs_fs_info *fs_info, u64 logical, + struct blk_zone *zone) +{ + struct btrfs_bio *bbio = NULL; + u64 mapped_length = PAGE_SIZE; + unsigned int nofs_flag; + int nmirrors; + int i, ret; + + ret = btrfs_map_sblock(fs_info, BTRFS_MAP_GET_READ_MIRRORS, logical, + &mapped_length, &bbio); + if (ret || !bbio || mapped_length < PAGE_SIZE) { + btrfs_put_bbio(bbio); + return -EIO; + } + + if (bbio->map_type & BTRFS_BLOCK_GROUP_RAID56_MASK) + return -EINVAL; + + nofs_flag = memalloc_nofs_save(); + nmirrors = (int)bbio->num_stripes; + for (i = 0; i < nmirrors; i++) { + u64 physical = bbio->stripes[i].physical; + struct btrfs_device *dev = bbio->stripes[i].dev; + + /* missing device */ + if (!dev->bdev) + continue; + + ret = btrfs_get_dev_zone(dev, physical, zone); + /* failing device */ + if (ret == -EIO || ret == -EOPNOTSUPP) + continue; + break; + } + memalloc_nofs_restore(nofs_flag); + + return ret; +} + +int btrfs_sync_zone_write_pointer(struct btrfs_device *tgt_dev, u64 logical, + u64 physical_start, u64 physical_pos) +{ + struct btrfs_fs_info *fs_info = tgt_dev->fs_info; + struct blk_zone zone; + u64 length; + u64 wp; + int ret; + + if (!btrfs_dev_is_sequential(tgt_dev, physical_pos)) + return 0; + + ret = read_zone_info(fs_info, logical, &zone); + if (ret) + return ret; + + wp = physical_start + ((zone.wp - zone.start) << SECTOR_SHIFT); + + if (physical_pos == wp) + return 0; + + if (physical_pos > wp) + return -EUCLEAN; + + length = wp - physical_pos; + return btrfs_hmzoned_issue_zeroout(tgt_dev, physical_pos, length); +} diff --git a/fs/btrfs/hmzoned.h b/fs/btrfs/hmzoned.h index 54f1affa6919..8558dd692b08 100644 --- a/fs/btrfs/hmzoned.h +++ b/fs/btrfs/hmzoned.h @@ -55,6 +55,10 @@ bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info, struct btrfs_block_group **cache_ret); void btrfs_revert_meta_write_pointer(struct btrfs_block_group *cache, struct extent_buffer *eb); +int btrfs_hmzoned_issue_zeroout(struct btrfs_device *device, u64 physical, + u64 length); +int btrfs_sync_zone_write_pointer(struct btrfs_device *tgt_dev, u64 logical, + u64 physical_start, u64 physical_pos); #else /* CONFIG_BLK_DEV_ZONED */ static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos, struct blk_zone *zone) @@ -133,6 +137,18 @@ static inline bool btrfs_check_meta_write_pointer( } static inline void btrfs_revert_meta_write_pointer( struct btrfs_block_group *cache, struct extent_buffer *eb) { } +static inline int btrfs_hmzoned_issue_zeroout(struct btrfs_device *device, + u64 physical, u64 length) +{ + return -EOPNOTSUPP; +} +static inline int btrfs_sync_zone_write_pointer(struct btrfs_device *tgt_dev, + u64 logical, + u64 physical_start, + u64 physical_pos) +{ + return -EOPNOTSUPP; +} #endif static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos) diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index af7cec962619..e88f32256ccc 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -168,6 +168,7 @@ struct scrub_ctx { int pages_per_rd_bio; int is_dev_replace; + u64 write_pointer; struct scrub_bio *wr_curr_bio; struct mutex wr_lock; @@ -1627,6 +1628,25 @@ static int scrub_write_page_to_dev_replace(struct scrub_block *sblock, return scrub_add_page_to_wr_bio(sblock->sctx, spage); } +static int fill_writer_pointer_gap(struct scrub_ctx *sctx, u64 physical) +{ + int ret = 0; + u64 length; + + if (!btrfs_fs_incompat(sctx->fs_info, HMZONED)) + return 0; + + if (sctx->write_pointer < physical) { + length = physical - sctx->write_pointer; + + ret = btrfs_hmzoned_issue_zeroout(sctx->wr_tgtdev, + sctx->write_pointer, length); + if (!ret) + sctx->write_pointer = physical; + } + return ret; +} + static int scrub_add_page_to_wr_bio(struct scrub_ctx *sctx, struct scrub_page *spage) { @@ -1649,6 +1669,13 @@ static int scrub_add_page_to_wr_bio(struct scrub_ctx *sctx, if (sbio->page_count == 0) { struct bio *bio; + ret = fill_writer_pointer_gap(sctx, + spage->physical_for_dev_replace); + if (ret) { + mutex_unlock(&sctx->wr_lock); + return ret; + } + sbio->physical = spage->physical_for_dev_replace; sbio->logical = spage->logical; sbio->dev = sctx->wr_tgtdev; @@ -1710,6 +1737,10 @@ static void scrub_wr_submit(struct scrub_ctx *sctx) * doubled the write performance on spinning disks when measured * with Linux 3.5 */ btrfsic_submit_bio(sbio->bio); + + if (btrfs_fs_incompat(sctx->fs_info, HMZONED)) + sctx->write_pointer = sbio->physical + + sbio->page_count * PAGE_SIZE; } static void scrub_wr_bio_end_io(struct bio *bio) @@ -3040,6 +3071,46 @@ static noinline_for_stack int scrub_raid56_parity(struct scrub_ctx *sctx, return ret < 0 ? ret : 0; } +static void sync_replace_for_hmzoned(struct scrub_ctx *sctx) +{ + if (!btrfs_fs_incompat(sctx->fs_info, HMZONED)) + return; + + sctx->flush_all_writes = true; + scrub_submit(sctx); + mutex_lock(&sctx->wr_lock); + scrub_wr_submit(sctx); + mutex_unlock(&sctx->wr_lock); + + wait_event(sctx->list_wait, + atomic_read(&sctx->bios_in_flight) == 0); +} + +static int sync_write_pointer_for_hmzoned(struct scrub_ctx *sctx, u64 logical, + u64 physical, u64 physical_end) +{ + struct btrfs_fs_info *fs_info = sctx->fs_info; + int ret = 0; + + if (!btrfs_fs_incompat(fs_info, HMZONED)) + return 0; + + wait_event(sctx->list_wait, atomic_read(&sctx->bios_in_flight) == 0); + + mutex_lock(&sctx->wr_lock); + if (sctx->write_pointer < physical_end) { + ret = btrfs_sync_zone_write_pointer(sctx->wr_tgtdev, logical, + physical, + sctx->write_pointer); + if (ret) + btrfs_err(fs_info, "failed to recover write pointer"); + } + mutex_unlock(&sctx->wr_lock); + btrfs_dev_clear_zone_empty(sctx->wr_tgtdev, physical); + + return ret; +} + static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx, struct map_lookup *map, struct btrfs_device *scrub_dev, @@ -3052,7 +3123,7 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx, struct btrfs_extent_item *extent; struct blk_plug plug; u64 flags; - int ret; + int ret, ret2; int slot; u64 nstripes; struct extent_buffer *l; @@ -3171,6 +3242,14 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx, */ blk_start_plug(&plug); + if (sctx->is_dev_replace && + btrfs_dev_is_sequential(sctx->wr_tgtdev, physical)) { + mutex_lock(&sctx->wr_lock); + sctx->write_pointer = physical; + mutex_unlock(&sctx->wr_lock); + sctx->flush_all_writes = true; + } + /* * now find all extents for each stripe and scrub them */ @@ -3343,6 +3422,9 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx, if (ret) goto out; + if (sctx->is_dev_replace) + sync_replace_for_hmzoned(sctx); + if (extent_logical + extent_len < key.objectid + bytes) { if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK) { @@ -3410,6 +3492,15 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx, blk_finish_plug(&plug); btrfs_free_path(path); btrfs_free_path(ppath); + + if (sctx->is_dev_replace && ret >= 0) { + ret2 = sync_write_pointer_for_hmzoned( + sctx, base + offset, + map->stripes[num].physical, physical_end); + if (ret2) + ret = ret2; + } + return ret < 0 ? ret : 0; } @@ -3465,6 +3556,25 @@ static noinline_for_stack int scrub_chunk(struct scrub_ctx *sctx, return ret; } +static int finish_extent_writes_for_hmzoned(struct btrfs_root *root, + struct btrfs_block_group *cache) +{ + struct btrfs_fs_info *fs_info = cache->fs_info; + struct btrfs_trans_handle *trans; + + if (!btrfs_fs_incompat(fs_info, HMZONED)) + return 0; + + btrfs_wait_block_group_reservations(cache); + btrfs_wait_nocow_writers(cache); + btrfs_wait_ordered_roots(fs_info, U64_MAX, cache->start, cache->length); + + trans = btrfs_join_transaction(root); + if (IS_ERR(trans)) + return PTR_ERR(trans); + return btrfs_commit_transaction(trans); +} + static noinline_for_stack int scrub_enumerate_chunks(struct scrub_ctx *sctx, struct btrfs_device *scrub_dev, u64 start, u64 end) @@ -3483,6 +3593,7 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx, struct btrfs_key found_key; struct btrfs_block_group *cache; struct btrfs_dev_replace *dev_replace = &fs_info->dev_replace; + bool do_chunk_alloc = btrfs_fs_incompat(fs_info, HMZONED); path = btrfs_alloc_path(); if (!path) @@ -3551,6 +3662,18 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx, if (!cache) goto skip; + + if (sctx->is_dev_replace && + btrfs_fs_incompat(fs_info, HMZONED)) { + spin_lock(&cache->lock); + if (!cache->to_copy) { + spin_unlock(&cache->lock); + ro_set = 0; + goto done; + } + spin_unlock(&cache->lock); + } + /* * we need call btrfs_inc_block_group_ro() with scrubs_paused, * to avoid deadlock caused by: @@ -3579,7 +3702,16 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx, * thread can't be triggered fast enough, and use up all space * of btrfs_super_block::sys_chunk_array */ - ret = btrfs_inc_block_group_ro(cache, false); + ret = btrfs_inc_block_group_ro(cache, do_chunk_alloc); + if (!ret && sctx->is_dev_replace) { + ret = finish_extent_writes_for_hmzoned(root, cache); + if (ret) { + btrfs_dec_block_group_ro(cache); + scrub_pause_off(fs_info); + btrfs_put_block_group(cache); + break; + } + } scrub_pause_off(fs_info); if (ret == 0) { @@ -3641,6 +3773,12 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx, scrub_pause_off(fs_info); + if (sctx->is_dev_replace && + !btrfs_finish_block_group_to_copy(dev_replace->srcdev, + cache, found_key.offset)) + ro_set = 0; + +done: down_write(&dev_replace->rwsem); dev_replace->cursor_left = dev_replace->cursor_right; dev_replace->item_needs_writeback = 1; diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index d5b280b59733..adc9dfd655a6 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -1414,6 +1414,9 @@ static int find_free_dev_extent_start(struct btrfs_device *device, search_start = btrfs_zone_align(device, search_start); } + WARN_ON(device->zone_info && + !IS_ALIGNED(num_bytes, device->zone_info->zone_size)); + path = btrfs_alloc_path(); if (!path) return -ENOMEM; @@ -5721,9 +5724,29 @@ static int get_extra_mirror_from_replace(struct btrfs_fs_info *fs_info, return ret; } +static bool is_block_group_to_copy(struct btrfs_fs_info *fs_info, u64 logical) +{ + struct btrfs_block_group *cache; + bool ret; + + /* non-HMZONED mode does not use "to_copy" flag */ + if (!btrfs_fs_incompat(fs_info, HMZONED)) + return false; + + cache = btrfs_lookup_block_group(fs_info, logical); + + spin_lock(&cache->lock); + ret = cache->to_copy; + spin_unlock(&cache->lock); + + btrfs_put_block_group(cache); + return ret; +} + static void handle_ops_on_dev_replace(enum btrfs_map_op op, struct btrfs_bio **bbio_ret, struct btrfs_dev_replace *dev_replace, + u64 logical, int *num_stripes_ret, int *max_errors_ret) { struct btrfs_bio *bbio = *bbio_ret; @@ -5736,6 +5759,15 @@ static void handle_ops_on_dev_replace(enum btrfs_map_op op, if (op == BTRFS_MAP_WRITE) { int index_where_to_add; + /* + * a block group which have "to_copy" set will + * eventually copied by dev-replace process. We can + * avoid cloning IO here. + */ + if (is_block_group_to_copy(dev_replace->srcdev->fs_info, + logical)) + return; + /* * duplicate the write operations while the dev replace * procedure is running. Since the copying of the old disk to @@ -6146,8 +6178,8 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info, if (dev_replace_is_ongoing && dev_replace->tgtdev != NULL && need_full_stripe(op)) { - handle_ops_on_dev_replace(op, &bbio, dev_replace, &num_stripes, - &max_errors); + handle_ops_on_dev_replace(op, &bbio, dev_replace, logical, + &num_stripes, &max_errors); } *bbio_ret = bbio; From patchwork Fri Dec 13 04:09:11 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11289887 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id C9DCB1593 for ; Fri, 13 Dec 2019 04:11:34 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id A8DBD2253D for ; Fri, 13 Dec 2019 04:11:34 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="PMTPzfOq" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732040AbfLMELd (ORCPT ); Thu, 12 Dec 2019 23:11:33 -0500 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:11924 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731476AbfLMELc (ORCPT ); Thu, 12 Dec 2019 23:11:32 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1576210293; x=1607746293; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=Xc6PNKHNLMT6cFXODCW5lJKtVCDGPT+JT7Ud3tXcFaQ=; b=PMTPzfOqgxHSbMyNBOx6eQgM+MQmsjutgsNBCHDXb5pNdvJmStdWoOa9 fT9m5J+vGZ4aKDUUBKozNI1QR51BgpSjMlo8Gr+o7GmYp/m4Q8ypoUmVR Fx5kgrssQbweJI9wHlCEFtN96oqCfRtZqnZ7fJMl/8jEhIxMrdHNyL074 DrY/WMaAD0UUccCehoLEN6HUbKRxeSQ7fE8aUwPO2KHoSZNo/6xi3T3If H6QhzGchcK6PEuDCHjKb+IlFut82bQW3U8vU/r/LzHJgsiZ484uFNkMXn CgO2F1jryjGu1KI1jWqDqGZZjFeFuD5SaF4fjfxqbUrodJoOriu1VfnKw Q==; IronPort-SDR: TFG+6NVR8kjFcdCKh8CuICgK/Ksi4UjKKuW9gedn2dTvP4wV7HathYmwRfM7L2RmyjUwj/xBVT AYpdyVsTQ18IGtNhxbOto58LwS6jXl91kVNnNVYIyxkHtqodWhzfqTEAlq5/Fu8fMtb3esHkTA mLcS1BM8EBoY5MAYxx/2G1YBvYoheA7ddOou8Ot7pj982WW2i0UV1DF3tFnM4xcmEsMIUBzdzR +YR72OyC7MRXwEhRTnVnxWQ/acKT1NT/c4bFKxFoDYooDjVidKDPp5pXclR74JvdmvNi2jDiWn XPg= X-IronPort-AV: E=Sophos;i="5.69,308,1571673600"; d="scan'208";a="126860167" Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com) ([199.255.45.14]) by ob1.hgst.iphmx.com with ESMTP; 13 Dec 2019 12:11:32 +0800 IronPort-SDR: Vt706ZYGV/0Em+pav4zJiCxDKTD9PIdwYt3vUu3LoFpE6INc2H6nxSysfxNePglC5viNeGsPpa li1mKbifePoJvIad9sbIeX+Oqgx8xbJnV+c2U7BMjK4ICthL/qmu22orkEeQ9hrz+UUZ75gyOn t+OfGAMN+UV/05FQoF1XNtA5oyMSdbDaIzYjpEtn/51TLN+8Co4e5ppkXMNNsM1jl2GLptTVE4 hpqfcO80yLGu4AXO5U97bbiW7rflJD8UgJJ6rvYBYeBlUcM2GItMFMQD5ylqo7yF21k78T2JHd jIaDfarjQEVcUyduI1dCx6lC Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Dec 2019 20:06:04 -0800 IronPort-SDR: xKp/Jg/HyGLx4nXCfMW4tEnQm8VnOtX/l/SQZeml8ASwLX9qqKknqWEIYsTsxTSWwYCRzIX/zt xHSz1FeGwLrMrSa3p0/xgDdJbbvFukPAgPYme6fJUAJ0iEvillGoa1oHRPh3QAysqaN+W8EwrK lHY32oZhLme4hDsKPVkrJhGftANUbVkjMfCc5X/c2JPh3CkJYXVVfAYRBHwuzas4JldEeNnAVU DAJsIXFnl4Tb7DGJv+em3EX/6904Ogrp1kupUkSIR8ZhVMX3d6th/j5PgeoJqyseB2pzCuz6IH sQ0= WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com (HELO naota.fujisawa.hgst.com) ([10.149.53.115]) by uls-op-cesaip02.wdc.com with ESMTP; 12 Dec 2019 20:11:30 -0800 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Nikolay Borisov , Damien Le Moal , Johannes Thumshirn , Hannes Reinecke , Anand Jain , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v6 24/28] btrfs: enable relocation in HMZONED mode Date: Fri, 13 Dec 2019 13:09:11 +0900 Message-Id: <20191213040915.3502922-25-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20191213040915.3502922-1-naohiro.aota@wdc.com> References: <20191213040915.3502922-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org To serialize allocation and submit_bio, we introduced mutex around them. As a result, preallocation must be completely disabled to avoid a deadlock. Since current relocation process relies on preallocation to move file data extents, it must be handled in another way. In HMZONED mode, we just truncate the inode to the size that we wanted to pre-allocate. Then, we flush dirty pages on the file before finishing relocation process. run_delalloc_hmzoned() will handle all the allocation and submit IOs to the underlying layers. Signed-off-by: Naohiro Aota --- fs/btrfs/relocation.c | 39 +++++++++++++++++++++++++++++++++++++-- 1 file changed, 37 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c index d897a8e5e430..2d17b7566df4 100644 --- a/fs/btrfs/relocation.c +++ b/fs/btrfs/relocation.c @@ -3159,6 +3159,34 @@ int prealloc_file_extent_cluster(struct inode *inode, if (ret) goto out; + /* + * In HMZONED, we cannot preallocate the file region. Instead, + * we dirty and fiemap_write the region. + */ + + if (btrfs_fs_incompat(btrfs_sb(inode->i_sb), HMZONED)) { + struct btrfs_root *root = BTRFS_I(inode)->root; + struct btrfs_trans_handle *trans; + + end = cluster->end - offset + 1; + trans = btrfs_start_transaction(root, 1); + if (IS_ERR(trans)) + return PTR_ERR(trans); + + inode->i_ctime = current_time(inode); + i_size_write(inode, end); + btrfs_ordered_update_i_size(inode, end, NULL); + ret = btrfs_update_inode(trans, root, inode); + if (ret) { + btrfs_abort_transaction(trans, ret); + btrfs_end_transaction(trans); + return ret; + } + ret = btrfs_end_transaction(trans); + + goto out; + } + cur_offset = prealloc_start; while (nr < cluster->nr) { start = cluster->boundary[nr] - offset; @@ -3346,6 +3374,10 @@ static int relocate_file_extent_cluster(struct inode *inode, btrfs_throttle(fs_info); } WARN_ON(nr != cluster->nr); + if (btrfs_fs_incompat(fs_info, HMZONED) && !ret) { + ret = btrfs_wait_ordered_range(inode, 0, (u64)-1); + WARN_ON(ret); + } out: kfree(ra); return ret; @@ -4186,8 +4218,12 @@ static int __insert_orphan_inode(struct btrfs_trans_handle *trans, struct btrfs_path *path; struct btrfs_inode_item *item; struct extent_buffer *leaf; + u64 flags = BTRFS_INODE_NOCOMPRESS | BTRFS_INODE_PREALLOC; int ret; + if (btrfs_fs_incompat(trans->fs_info, HMZONED)) + flags &= ~BTRFS_INODE_PREALLOC; + path = btrfs_alloc_path(); if (!path) return -ENOMEM; @@ -4202,8 +4238,7 @@ static int __insert_orphan_inode(struct btrfs_trans_handle *trans, btrfs_set_inode_generation(leaf, item, 1); btrfs_set_inode_size(leaf, item, 0); btrfs_set_inode_mode(leaf, item, S_IFREG | 0600); - btrfs_set_inode_flags(leaf, item, BTRFS_INODE_NOCOMPRESS | - BTRFS_INODE_PREALLOC); + btrfs_set_inode_flags(leaf, item, flags); btrfs_mark_buffer_dirty(leaf); out: btrfs_free_path(path); From patchwork Fri Dec 13 04:09:12 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11289895 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 910BC14DB for ; Fri, 13 Dec 2019 04:11:40 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 66CB324656 for ; Fri, 13 Dec 2019 04:11:40 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="RLn0UlNq" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732046AbfLMELg (ORCPT ); Thu, 12 Dec 2019 23:11:36 -0500 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:11924 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731476AbfLMELe (ORCPT ); Thu, 12 Dec 2019 23:11:34 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1576210295; x=1607746295; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=2WVE+NOT0Lojntc5rfFT1s+kzoEafV6t/Mrqof4ttQM=; b=RLn0UlNqzhmOJi4nmh26BHwXTadUU4/aqF8FzEgK8dFl1feuyuw/VLhQ +1C/Bjlh65UXNDfI0yfR+BPDf4YVyIswVvk1WQ904xjXDA8d128E01/iM d4e5T7R80BWqyUFQfZ3S3jAy7+SiV9YHBTxd7PX14/4PRQVVNDGp3u0r3 +kZs+8/+lOjBAGjoOOE0vz9VqFl0/C9G5aCHr/vp0yviIVp1UDzqtRQcP 5PGzAsfpAgfuwv4pdhTb6VIRsFK8zPsEGmsK1dl+ooCZjWYo+9AJRbZr3 pXYuzb/z5St/1MetaUlFnj/jpGdx69CfwArhmod65MVuqPS/BR97cnx9W g==; IronPort-SDR: UjVT/+Mec0BKHo7o8TyLXEYFT8JOwqfcVli7rlCz3gcBp9y+RpNNLJVqR+tlCGxQVlufquZpQQ Y3zi7gXcuOA1KCuP93CVCz4AQbg1/nikgVFZt6WtnPQoOGD09LayX2+wHtBX+QtQASjVcodff/ IxFQTzM/wCxP1I5+L/eWZ7XWIufK0dOj+Ye8atB4N4YutmT7VAGbytt1tf3kRfADnrWCYczXPM g2BxSC8UBB8f/lWH1wkPKroUx/5EJo0B/CKWnrCOkbYQe9vR6QbYx4oyGu8GZNYnx8cTQrHswM BzQ= X-IronPort-AV: E=Sophos;i="5.69,308,1571673600"; d="scan'208";a="126860171" Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com) ([199.255.45.14]) by ob1.hgst.iphmx.com with ESMTP; 13 Dec 2019 12:11:35 +0800 IronPort-SDR: eeegO+em2mdnHZHb1LLw3mcI0MVhguJf/O9/T2OGNePfbvBC/UZRaBA6x0hf306gL15RfbcgAJ wCnP0ZIzo4L1xFdJ4eeEFqrtC2U3fQ8mrmgoOAhJn26VSlFNI3M59M0+oL7nsfKDsaQpxzNg3z BIsD/lHYU0U/vQma6nBf0espvyg4/rdlFhONfbGcWD49Ch1p/c6UgA3OF5qFfj2+g4ie8svdcd 2lFqw37Qbzrtrdr12KLLEhhBv5v+1rRaiwe2Ij0nO3sPchhmQsMzpqimI8OeSJSQ9Qho+fvNhg wy1eMDkRK/5h02X3lpk907+2 Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Dec 2019 20:06:06 -0800 IronPort-SDR: ia98MyFQWtnW/RKTyqBPlCWhDy3bjVuF59uVe5UsaLgP2kyMvT26hO3GxTF4KxZtnRvEQgtDta UjNwVohd1cc54DCQduCw4EShYiRgaw8heXz3nK7plR1WVATudPwS1t0XPXJJJqwO9WG/RLT+Hx 2ANOclDbs0yD1Ju4QCHbuKhomzxHZHpjv0ULuErr05h8/AGF8fVc3XMcpko3b/ekf2SPBamGf2 C1w2FwXKFGA/x0oJjaIDA5JMhRf9Wb9k796jl8xjrU3aC19G2B8pVITTFnCuZLe625SYF9mqnc 9Ik= WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com (HELO naota.fujisawa.hgst.com) ([10.149.53.115]) by uls-op-cesaip02.wdc.com with ESMTP; 12 Dec 2019 20:11:32 -0800 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Nikolay Borisov , Damien Le Moal , Johannes Thumshirn , Hannes Reinecke , Anand Jain , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v6 25/28] btrfs: relocate block group to repair IO failure in HMZONED Date: Fri, 13 Dec 2019 13:09:12 +0900 Message-Id: <20191213040915.3502922-26-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20191213040915.3502922-1-naohiro.aota@wdc.com> References: <20191213040915.3502922-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org When btrfs find a checksum error and if the file system has a mirror of the damaged data, btrfs read the correct data from the mirror and write the data to damaged blocks. This repairing, however, is against the sequential write required rule. We can consider three methods to repair an IO failure in HMZONED mode: (1) Reset and rewrite the damaged zone (2) Allocate new device extent and replace the damaged device extent to the new extent (3) Relocate the corresponding block group Method (1) is most similar to a behavior done with regular devices. However, it also wipes non-damaged data in the same device extent, and so it unnecessary degrades non-damaged data. Method (2) is much like device replacing but done in the same device. It is safe because it keeps the device extent until the replacing finish. However, extending device replacing is non-trivial. It assumes "src_dev>physical == dst_dev->physical". Also, the extent mapping replacing function should be extended to support replacing device extent position in one device. Method (3) invokes relocation of the damaged block group, so it is straightforward to implement. It relocates all the mirrored device extents, so it is, potentially, a more costly operation than method (1) or (2). But it relocates only using extents which reduce the total IO size. Let's apply method (3) for now. In the future, we can extend device-replace and apply method (2). For protecting a block group gets relocated multiple time with multiple IO errors, this commit introduces "relocating_repair" bit to show it's now relocating to repair IO failures. Also it uses a new kthread "btrfs-relocating-repair", not to block IO path with relocating process. This commit also supports repairing in the scrub process. Signed-off-by: Naohiro Aota Reviewed-by: Josef Bacik --- fs/btrfs/block-group.h | 1 + fs/btrfs/extent_io.c | 3 ++ fs/btrfs/scrub.c | 3 ++ fs/btrfs/volumes.c | 71 ++++++++++++++++++++++++++++++++++++++++++ fs/btrfs/volumes.h | 1 + 5 files changed, 79 insertions(+) diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h index 323ba01ad8a9..4a5bd87345a1 100644 --- a/fs/btrfs/block-group.h +++ b/fs/btrfs/block-group.h @@ -84,6 +84,7 @@ struct btrfs_block_group { unsigned int removed:1; unsigned int wp_broken:1; unsigned int to_copy:1; + unsigned int relocating_repair:1; int disk_cache_state; diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 24f7b05e1f4c..83f5e5883723 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2197,6 +2197,9 @@ int repair_io_failure(struct btrfs_fs_info *fs_info, u64 ino, u64 start, ASSERT(!(fs_info->sb->s_flags & SB_RDONLY)); BUG_ON(!mirror_num); + if (btrfs_fs_incompat(fs_info, HMZONED)) + return btrfs_repair_one_hmzone(fs_info, logical); + bio = btrfs_io_bio_alloc(1); bio->bi_iter.bi_size = 0; map_length = length; diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index e88f32256ccc..5ed54523f036 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -861,6 +861,9 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check) have_csum = sblock_to_check->pagev[0]->have_csum; dev = sblock_to_check->pagev[0]->dev; + if (btrfs_fs_incompat(fs_info, HMZONED) && !sctx->is_dev_replace) + return btrfs_repair_one_hmzone(fs_info, logical); + /* * We must use GFP_NOFS because the scrub task might be waiting for a * worker task executing this function and in turn a transaction commit diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index adc9dfd655a6..21801aaa77c2 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -7794,3 +7794,74 @@ bool btrfs_pinned_by_swapfile(struct btrfs_fs_info *fs_info, void *ptr) spin_unlock(&fs_info->swapfile_pins_lock); return node != NULL; } + +static int relocating_repair_kthread(void *data) +{ + struct btrfs_block_group *cache = (struct btrfs_block_group *) data; + struct btrfs_fs_info *fs_info = cache->fs_info; + u64 target; + int ret = 0; + + target = cache->start; + btrfs_put_block_group(cache); + + if (test_and_set_bit(BTRFS_FS_EXCL_OP, &fs_info->flags)) { + btrfs_info(fs_info, + "skip relocating block group %llu to repair: EBUSY", + target); + return -EBUSY; + } + + mutex_lock(&fs_info->delete_unused_bgs_mutex); + + /* ensure Block Group still exists */ + cache = btrfs_lookup_block_group(fs_info, target); + if (!cache) + goto out; + + if (!cache->relocating_repair) + goto out; + + ret = btrfs_may_alloc_data_chunk(fs_info, target); + if (ret < 0) + goto out; + + btrfs_info(fs_info, "relocating block group %llu to repair IO failure", + target); + ret = btrfs_relocate_chunk(fs_info, target); + +out: + if (cache) + btrfs_put_block_group(cache); + mutex_unlock(&fs_info->delete_unused_bgs_mutex); + clear_bit(BTRFS_FS_EXCL_OP, &fs_info->flags); + + return ret; +} + +int btrfs_repair_one_hmzone(struct btrfs_fs_info *fs_info, u64 logical) +{ + struct btrfs_block_group *cache; + + /* do not attempt to repair in degraded state */ + if (btrfs_test_opt(fs_info, DEGRADED)) + return 0; + + cache = btrfs_lookup_block_group(fs_info, logical); + if (!cache) + return 0; + + spin_lock(&cache->lock); + if (cache->relocating_repair) { + spin_unlock(&cache->lock); + btrfs_put_block_group(cache); + return 0; + } + cache->relocating_repair = 1; + spin_unlock(&cache->lock); + + kthread_run(relocating_repair_kthread, cache, + "btrfs-relocating-repair"); + + return 0; +} diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index 70cabe65f72a..e5a2e7fc3a08 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -576,5 +576,6 @@ bool btrfs_check_rw_degradable(struct btrfs_fs_info *fs_info, int btrfs_bg_type_to_factor(u64 flags); const char *btrfs_bg_type_to_raid_name(u64 flags); int btrfs_verify_dev_extents(struct btrfs_fs_info *fs_info); +int btrfs_repair_one_hmzone(struct btrfs_fs_info *fs_info, u64 logical); #endif From patchwork Fri Dec 13 04:09:13 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11289893 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 27D0C1593 for ; Fri, 13 Dec 2019 04:11:40 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 06B012253D for ; Fri, 13 Dec 2019 04:11:40 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="exucFoIU" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732054AbfLMELh (ORCPT ); Thu, 12 Dec 2019 23:11:37 -0500 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:11924 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1732042AbfLMELh (ORCPT ); Thu, 12 Dec 2019 23:11:37 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1576210297; x=1607746297; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=lnC8yzmKb8UpJR++C4z0jIqstzFWE9sHCk1QgPwlnlE=; b=exucFoIUsyqVh+QVlbGj3xPfszPq3RQTFzSXBuEULt4Ihnod5gdylQZa QdeLyF32F5NGImSfHvFWjTCAa/jKh3SKTypyYC3/TI2jQoZE3uNOdhfyk wAH/INpftDt9MXQyDhbYbhcF66sDGhCz8G2vxw9AWwWlBIptaO6lYgmzt BPS3+jteb9cx5DRHj+K3peM978fnlw/AwqxTZJqJhuAWFeKOBXkean2EI 9sZ5tA7YynjMnxz/wgcsKUBmYyYkbXirNMFx23jzT6dU+gNdAHdE2yQwu mH21coj37LuEXfo1dKZS3QvSa8Vqf+ZbVMDdK9j3uUVwEiNOHov88ZRRh w==; IronPort-SDR: 1P9vufYCC5hSlKO2IJLvoQonuwFm18bupfoh6E20amyXUi1JpaLktQESSKXiW1aaNE7PM3tDY4 Lk0VWPF5KVBDpkHB9hoVMfsCe2e7Fxg5TbxdzqnT2AhTx7FFrOE9qmpkv0oSNnsGSYizPGtUlc ets7W4CFZ9y5mW6P9iTS01X1DfTQfKrcbylUJE7RhGKZWHPHaLD/3YpBmldvlfaLh9K9qquQ0A jW1TdG2nRHud8olnFfdFvxC7L5xn5p6juCUrazhBn6xkxODpp845b52iGKOHIEQvOARlz4GEbU ZAE= X-IronPort-AV: E=Sophos;i="5.69,308,1571673600"; d="scan'208";a="126860177" Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com) ([199.255.45.14]) by ob1.hgst.iphmx.com with ESMTP; 13 Dec 2019 12:11:37 +0800 IronPort-SDR: Qmw7lyZkBXOhSGE49DC5P1HCdBxZzZU1DPZAQHuQe0iAK9ndJ3VQVWhCGxtqPDsts3mnTknpgq BZDRARdqC3AlOq3TNC/5S0/qaDRZULLM9j3xK8hiJzj5vSC62aWVLu+Qro1kIld/xRGONvlzXB g9ODlCugG1RntrDUvtI5p0U+QXJf0GC7t6htrc1bY3JA60MBVn6A2y7bYpmKrQ4IvHEUBWMzS+ ejX8xPqilUJIk1I4/Pmt7ztPIJc+9mZBEfKp4SXyZa8rD+se0mqXboZ9hFcoEC6NPmfucQaAPO wOsuFFFNqUX4YO+8mhRWo26N Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Dec 2019 20:06:09 -0800 IronPort-SDR: 3puJfwwzOCL/0daR3sqnXAqM9fg9c+80mSie9ceJK28nMoBAyBPdod+cnMxiT0+YoK8VpXt5rx q1ZGN1OGIUgXNpvM4/9rzANF0xcZAJQzvOjyMYnxu6hVCapLHACjlTDCKJtWBlm83aOYdK4uzZ cu429yKqgMkJ9royet4/LM5i2LWXTLiMBnaOXNZL70GnKVV8PIIqacNhGGG/1QbG0FPbxd1tcl 589iwrMIp+Ud/YSZN6gjRTJ8TIKdIhMw7ARZ8KvYqwhNX9N+vqfguWyNZi+hbb/zuiyImWYjAU I3c= WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com (HELO naota.fujisawa.hgst.com) ([10.149.53.115]) by uls-op-cesaip02.wdc.com with ESMTP; 12 Dec 2019 20:11:35 -0800 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Nikolay Borisov , Damien Le Moal , Johannes Thumshirn , Hannes Reinecke , Anand Jain , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v6 26/28] btrfs: split alloc_log_tree() Date: Fri, 13 Dec 2019 13:09:13 +0900 Message-Id: <20191213040915.3502922-27-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20191213040915.3502922-1-naohiro.aota@wdc.com> References: <20191213040915.3502922-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org This is a preparation for the next patch. This commit split alloc_log_tree() to allocating tree structure part (remains in alloc_log_tree()) and allocating tree node part (moved in btrfs_alloc_log_tree_node()). The latter part is also exported to be used in the next patch. Signed-off-by: Naohiro Aota --- fs/btrfs/disk-io.c | 31 +++++++++++++++++++++++++------ fs/btrfs/disk-io.h | 2 ++ 2 files changed, 27 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index c3d8fc10d11d..914c517d26b0 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1315,7 +1315,6 @@ static struct btrfs_root *alloc_log_tree(struct btrfs_trans_handle *trans, struct btrfs_fs_info *fs_info) { struct btrfs_root *root; - struct extent_buffer *leaf; root = btrfs_alloc_root(fs_info, GFP_NOFS); if (!root) @@ -1327,6 +1326,14 @@ static struct btrfs_root *alloc_log_tree(struct btrfs_trans_handle *trans, root->root_key.type = BTRFS_ROOT_ITEM_KEY; root->root_key.offset = BTRFS_TREE_LOG_OBJECTID; + return root; +} + +int btrfs_alloc_log_tree_node(struct btrfs_trans_handle *trans, + struct btrfs_root *root) +{ + struct extent_buffer *leaf; + /* * DON'T set REF_COWS for log trees * @@ -1338,26 +1345,31 @@ static struct btrfs_root *alloc_log_tree(struct btrfs_trans_handle *trans, leaf = btrfs_alloc_tree_block(trans, root, 0, BTRFS_TREE_LOG_OBJECTID, NULL, 0, 0, 0); - if (IS_ERR(leaf)) { - kfree(root); - return ERR_CAST(leaf); - } + if (IS_ERR(leaf)) + return PTR_ERR(leaf); root->node = leaf; btrfs_mark_buffer_dirty(root->node); btrfs_tree_unlock(root->node); - return root; + + return 0; } int btrfs_init_log_root_tree(struct btrfs_trans_handle *trans, struct btrfs_fs_info *fs_info) { struct btrfs_root *log_root; + int ret; log_root = alloc_log_tree(trans, fs_info); if (IS_ERR(log_root)) return PTR_ERR(log_root); + ret = btrfs_alloc_log_tree_node(trans, log_root); + if (ret) { + kfree(log_root); + return ret; + } WARN_ON(fs_info->log_root_tree); fs_info->log_root_tree = log_root; return 0; @@ -1369,11 +1381,18 @@ int btrfs_add_log_tree(struct btrfs_trans_handle *trans, struct btrfs_fs_info *fs_info = root->fs_info; struct btrfs_root *log_root; struct btrfs_inode_item *inode_item; + int ret; log_root = alloc_log_tree(trans, fs_info); if (IS_ERR(log_root)) return PTR_ERR(log_root); + ret = btrfs_alloc_log_tree_node(trans, log_root); + if (ret) { + kfree(log_root); + return ret; + } + log_root->last_trans = trans->transid; log_root->root_key.offset = root->root_key.objectid; diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h index 76f123ebb292..21e8d936c705 100644 --- a/fs/btrfs/disk-io.h +++ b/fs/btrfs/disk-io.h @@ -121,6 +121,8 @@ blk_status_t btrfs_wq_submit_bio(struct btrfs_fs_info *fs_info, struct bio *bio, extent_submit_bio_start_t *submit_bio_start); blk_status_t btrfs_submit_bio_done(void *private_data, struct bio *bio, int mirror_num); +int btrfs_alloc_log_tree_node(struct btrfs_trans_handle *trans, + struct btrfs_root *root); int btrfs_init_log_root_tree(struct btrfs_trans_handle *trans, struct btrfs_fs_info *fs_info); int btrfs_add_log_tree(struct btrfs_trans_handle *trans, From patchwork Fri Dec 13 04:09:14 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11289897 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 7B4CC14DB for ; Fri, 13 Dec 2019 04:11:41 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 468A4227BF for ; Fri, 13 Dec 2019 04:11:41 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="rcOOJSZd" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732056AbfLMELk (ORCPT ); Thu, 12 Dec 2019 23:11:40 -0500 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:11924 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731706AbfLMELj (ORCPT ); Thu, 12 Dec 2019 23:11:39 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1576210299; x=1607746299; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=aTDT/TuW1C009nCSV5xqCnFu5EeDZEvFd1SEUSwdmj4=; b=rcOOJSZdTg3E2E5I9ZvYjSwWXTEpJ/3c5eNQ22tPZ0CpVX/SnlaWu8y3 VmZ9i05GSvDBgvbikdGA2Vb8B9nM84Yy4OQ0uN9bPkNoSV3gMvLoPZvM1 DfCAfsG6laIOXPeEhtKICVYnqvqsb8vVGhf/e2aLNirHgujsYp3CdgtQQ 4OWDF3/wcJ0zd6lEyaqmZlwX5bWfWnnImKfIqzi5z9nWyyCZ0jOzLmKG2 8rbb8K8XEjU4JOKngsC1IAPzCpYTh9x4FwN7l0s+YMxA1lqKmln0kCFZY K9LFYY6xHnTiKoHbBstgQ7EIakqGud9MpfpG8BmFL0ncJit/SxM6IRtjd A==; IronPort-SDR: PWlrFowWgF05rggYL2DmJsXyIPA9iUWxlY+46lqvP2gukKwtky+HOLCMlCAgVb9z4sFSCSZ5MW kceaP4NVbcDBqZlJu3/CXWhuI8CPdGmetfC/u9vftNKOuRvO5wWAnPqev+68LgPgavJ3vaR1lX oJ7zhj4sGATspFNpqekSKpJxaeHRUg/LFQ1afKA4AdlfBE4yVTnyztwngfWzAW1Ryyb1Vx33CC y91Q69sTVhHMQ7JPjLIYumJs/El15Kb5LOfWktZxn6IV8Bqp3fLivLXa/0GmgYUPBnRiloG/DI ZIc= X-IronPort-AV: E=Sophos;i="5.69,308,1571673600"; d="scan'208";a="126860180" Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com) ([199.255.45.14]) by ob1.hgst.iphmx.com with ESMTP; 13 Dec 2019 12:11:39 +0800 IronPort-SDR: 6AX4ucnYiX/cOlN48KtPR9sLUcZq8Mq+J6aMnrUppcbFItOXaqu3t+0Ln+vNxQGofHuYfFBMCM vSpHntn3GTgu2Yx2L95XvfxOUsH84/lUsxoWAwQeaVFJzkse9A7o77dyYj37jPKjWu3KP8EZpO t+VeAWrMi0pNcBgekOxySF/8dYrMASJIrEQ3l++sZS1NyFyRupnJDhmQE6oSMtKSKS9IvpEl1G 7eJJV1Z74IOBoB1eAdtUVKXd+xJFEJ6zaSooTFXpuH8uzXKe+rJxmVDJlEktaS98F4yTjgxeqF UoGimLykMqwz9qulF8OZprjZ Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Dec 2019 20:06:11 -0800 IronPort-SDR: p1z6hDUAYddcZhEQgo6jexGcC6JU3Ri3u4qTSpX2sxwz2upEQGy2Ivrl83zhUFcMsxVSw/aLLI kZXeoLLA3vkkGoTeDrjkTgePVqyfMd0/pQX7JoxwHv3Yt68xLJngM0/9o5hZdM4PSIIgoa4x+P cYSW8CL3qdwK2YVj9hz8AAmdFQKztEGs1pXdrpf9bLk4E6pu4g6/NXz7JM0D9cF4v2BGjyLurx nT78C2y2PO7nIOI3ttNnTLElllm7IyklYzt1s2J92E7GJQyfdo4WEP7nQuYPsHM+azXBBGDXV6 6uo= WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com (HELO naota.fujisawa.hgst.com) ([10.149.53.115]) by uls-op-cesaip02.wdc.com with ESMTP; 12 Dec 2019 20:11:37 -0800 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Nikolay Borisov , Damien Le Moal , Johannes Thumshirn , Hannes Reinecke , Anand Jain , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v6 27/28] btrfs: enable tree-log on HMZONED mode Date: Fri, 13 Dec 2019 13:09:14 +0900 Message-Id: <20191213040915.3502922-28-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20191213040915.3502922-1-naohiro.aota@wdc.com> References: <20191213040915.3502922-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org The tree-log feature does not work on HMZONED mode as is. Blocks for a tree-log tree are allocated mixed with other metadata blocks, and btrfs writes and syncs the tree-log blocks to devices at the time of fsync(), which is different timing than a global transaction commit. As a result, both writing tree-log blocks and writing other metadata blocks become non-sequential writes which HMZONED mode must avoid. Also, since we can start more than one log transactions per subvolume at the same time, nodes from multiple transactions can be allocated interleaved. Such mixed allocation results in non-sequential writes at the time of log transaction commit. The nodes of the global log root tree (fs_info->log_root_tree), also have the same mixed allocation problem. This patch assigns a dedicated block group for tree-log blocks to separate two metadata writing streams (for tree-log blocks and other metadata blocks). As a result, each write stream can now be written to devices separately. "fs_info->treelog_bg" tracks the dedicated block group and btrfs assign "treelog_bg" on-demand on tree-log block allocation time. Then, this patch serializes log transactions by waiting for a committing transaction when someone tries to start a new transaction, to avoid the mixed allocation problem. We must also wait for running log transactions from another subvolume, but there is no easy way to detect which subvolume root is running a log transaction. So, this patch forbids starting a new log transaction when the global log root tree is already allocated by other subvolumes. Furthermore, this patch aligns the allocation order of nodes of "fs_info->log_root_tree" and nodes of "root->log_root" with the writing order of the nodes, by delaying allocation of the root node of "fs_info->log_root_tree," so that, the node buffers can go out sequentially to devices. Signed-off-by: Naohiro Aota --- fs/btrfs/block-group.c | 7 +++++ fs/btrfs/ctree.h | 2 ++ fs/btrfs/disk-io.c | 8 ++--- fs/btrfs/extent-tree.c | 71 +++++++++++++++++++++++++++++++++++++----- fs/btrfs/tree-log.c | 49 ++++++++++++++++++++++++----- 5 files changed, 116 insertions(+), 21 deletions(-) diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c index 6f7d29171adf..93e6c617d68e 100644 --- a/fs/btrfs/block-group.c +++ b/fs/btrfs/block-group.c @@ -910,6 +910,13 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans, btrfs_return_cluster_to_free_space(block_group, cluster); spin_unlock(&cluster->refill_lock); + if (btrfs_fs_incompat(fs_info, HMZONED)) { + spin_lock(&fs_info->treelog_bg_lock); + if (fs_info->treelog_bg == block_group->start) + fs_info->treelog_bg = 0; + spin_unlock(&fs_info->treelog_bg_lock); + } + path = btrfs_alloc_path(); if (!path) { ret = -ENOMEM; diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 18d2d0581e68..cba8a169002c 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -907,6 +907,8 @@ struct btrfs_fs_info { #endif struct mutex hmzoned_meta_io_lock; + spinlock_t treelog_bg_lock; + u64 treelog_bg; }; static inline struct btrfs_fs_info *btrfs_sb(struct super_block *sb) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 914c517d26b0..9c2b2fbf0cdb 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1360,16 +1360,10 @@ int btrfs_init_log_root_tree(struct btrfs_trans_handle *trans, struct btrfs_fs_info *fs_info) { struct btrfs_root *log_root; - int ret; log_root = alloc_log_tree(trans, fs_info); if (IS_ERR(log_root)) return PTR_ERR(log_root); - ret = btrfs_alloc_log_tree_node(trans, log_root); - if (ret) { - kfree(log_root); - return ret; - } WARN_ON(fs_info->log_root_tree); fs_info->log_root_tree = log_root; return 0; @@ -2841,6 +2835,8 @@ int __cold open_ctree(struct super_block *sb, fs_info->send_in_progress = 0; + spin_lock_init(&fs_info->treelog_bg_lock); + ret = btrfs_alloc_stripe_hash_table(fs_info); if (ret) { err = ret; diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 69c4ce8ec83e..9b9608097f7f 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -3704,8 +3704,10 @@ static int find_free_extent_unclustered(struct btrfs_block_group *bg, */ static int find_free_extent_zoned(struct btrfs_block_group *cache, - struct find_free_extent_ctl *ffe_ctl) + struct find_free_extent_ctl *ffe_ctl, + bool for_treelog) { + struct btrfs_fs_info *fs_info = cache->fs_info; struct btrfs_space_info *space_info = cache->space_info; struct btrfs_free_space_ctl *ctl = cache->free_space_ctl; u64 start = cache->start; @@ -3718,12 +3720,26 @@ static int find_free_extent_zoned(struct btrfs_block_group *cache, btrfs_hmzoned_data_io_lock(cache); spin_lock(&space_info->lock); spin_lock(&cache->lock); + spin_lock(&fs_info->treelog_bg_lock); + + ASSERT(!for_treelog || cache->start == fs_info->treelog_bg || + fs_info->treelog_bg == 0); if (cache->ro) { ret = -EAGAIN; goto out; } + /* + * Do not allow currently using block group to be tree-log + * dedicated block group. + */ + if (for_treelog && !fs_info->treelog_bg && + (cache->used || cache->reserved)) { + ret = 1; + goto out; + } + avail = cache->length - cache->alloc_offset; if (avail < num_bytes) { ffe_ctl->max_extent_size = avail; @@ -3731,6 +3747,9 @@ static int find_free_extent_zoned(struct btrfs_block_group *cache, goto out; } + if (for_treelog && !fs_info->treelog_bg) + fs_info->treelog_bg = cache->start; + ffe_ctl->found_offset = start + cache->alloc_offset; cache->alloc_offset += num_bytes; spin_lock(&ctl->tree_lock); @@ -3738,12 +3757,15 @@ static int find_free_extent_zoned(struct btrfs_block_group *cache, spin_unlock(&ctl->tree_lock); ASSERT(IS_ALIGNED(ffe_ctl->found_offset, - cache->fs_info->stripesize)); + fs_info->stripesize)); ffe_ctl->search_start = ffe_ctl->found_offset; __btrfs_add_reserved_bytes(cache, ffe_ctl->ram_bytes, num_bytes, ffe_ctl->delalloc); out: + if (ret && for_treelog) + fs_info->treelog_bg = 0; + spin_unlock(&fs_info->treelog_bg_lock); spin_unlock(&cache->lock); spin_unlock(&space_info->lock); /* if succeeds, unlock after submit_bio */ @@ -3891,7 +3913,7 @@ static int find_free_extent_update_loop(struct btrfs_fs_info *fs_info, static noinline int find_free_extent(struct btrfs_fs_info *fs_info, u64 ram_bytes, u64 num_bytes, u64 empty_size, u64 hint_byte, struct btrfs_key *ins, - u64 flags, int delalloc) + u64 flags, int delalloc, bool for_treelog) { int ret = 0; struct btrfs_free_cluster *last_ptr = NULL; @@ -3970,6 +3992,13 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info, spin_unlock(&last_ptr->lock); } + if (hmzoned && for_treelog) { + spin_lock(&fs_info->treelog_bg_lock); + if (fs_info->treelog_bg) + hint_byte = fs_info->treelog_bg; + spin_unlock(&fs_info->treelog_bg_lock); + } + ffe_ctl.search_start = max(ffe_ctl.search_start, first_logical_byte(fs_info, 0)); ffe_ctl.search_start = max(ffe_ctl.search_start, hint_byte); @@ -4015,8 +4044,15 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info, list_for_each_entry(block_group, &space_info->block_groups[ffe_ctl.index], list) { /* If the block group is read-only, we can skip it entirely. */ - if (unlikely(block_group->ro)) + if (unlikely(block_group->ro)) { + if (hmzoned && for_treelog) { + spin_lock(&fs_info->treelog_bg_lock); + if (block_group->start == fs_info->treelog_bg) + fs_info->treelog_bg = 0; + spin_unlock(&fs_info->treelog_bg_lock); + } continue; + } btrfs_grab_block_group(block_group, delalloc); ffe_ctl.search_start = block_group->start; @@ -4062,7 +4098,25 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info, goto loop; if (hmzoned) { - ret = find_free_extent_zoned(block_group, &ffe_ctl); + u64 bytenr = block_group->start; + u64 log_bytenr; + bool skip; + + /* + * Do not allow non-tree-log blocks in the + * dedicated tree-log block group, and vice versa. + */ + spin_lock(&fs_info->treelog_bg_lock); + log_bytenr = fs_info->treelog_bg; + skip = log_bytenr && + ((for_treelog && bytenr != log_bytenr) || + (!for_treelog && bytenr == log_bytenr)); + spin_unlock(&fs_info->treelog_bg_lock); + if (skip) + goto loop; + + ret = find_free_extent_zoned(block_group, &ffe_ctl, + for_treelog); if (ret) goto loop; /* @@ -4222,12 +4276,13 @@ int btrfs_reserve_extent(struct btrfs_root *root, u64 ram_bytes, bool final_tried = num_bytes == min_alloc_size; u64 flags; int ret; + bool for_treelog = root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID; flags = get_alloc_profile_by_root(root, is_data); again: WARN_ON(num_bytes < fs_info->sectorsize); ret = find_free_extent(fs_info, ram_bytes, num_bytes, empty_size, - hint_byte, ins, flags, delalloc); + hint_byte, ins, flags, delalloc, for_treelog); if (!ret && !is_data) { btrfs_dec_block_group_reservations(fs_info, ins->objectid); } else if (ret == -ENOSPC) { @@ -4245,8 +4300,8 @@ int btrfs_reserve_extent(struct btrfs_root *root, u64 ram_bytes, sinfo = btrfs_find_space_info(fs_info, flags); btrfs_err(fs_info, - "allocation failed flags %llu, wanted %llu", - flags, num_bytes); + "allocation failed flags %llu, wanted %llu treelog %d", + flags, num_bytes, for_treelog); if (sinfo) btrfs_dump_space_info(fs_info, sinfo, num_bytes, 1); diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 6f757361db53..e155418f24ba 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -18,6 +18,7 @@ #include "compression.h" #include "qgroup.h" #include "inode-map.h" +#include "hmzoned.h" /* magic values for the inode_only field in btrfs_log_inode: * @@ -105,6 +106,7 @@ static noinline int replay_dir_deletes(struct btrfs_trans_handle *trans, struct btrfs_root *log, struct btrfs_path *path, u64 dirid, int del_all); +static void wait_log_commit(struct btrfs_root *root, int transid); /* * tree logging is a special write ahead log used to make sure that @@ -139,16 +141,25 @@ static int start_log_trans(struct btrfs_trans_handle *trans, struct btrfs_log_ctx *ctx) { struct btrfs_fs_info *fs_info = root->fs_info; + bool hmzoned = btrfs_fs_incompat(fs_info, HMZONED); int ret = 0; mutex_lock(&root->log_mutex); +again: if (root->log_root) { + int index = (root->log_transid + 1) % 2; + if (btrfs_need_log_full_commit(trans)) { ret = -EAGAIN; goto out; } + if (hmzoned && atomic_read(&root->log_commit[index])) { + wait_log_commit(root, root->log_transid - 1); + goto again; + } + if (!root->log_start_pid) { clear_bit(BTRFS_ROOT_MULTI_LOG_TASKS, &root->state); root->log_start_pid = current->pid; @@ -157,8 +168,13 @@ static int start_log_trans(struct btrfs_trans_handle *trans, } } else { mutex_lock(&fs_info->tree_log_mutex); - if (!fs_info->log_root_tree) + if (hmzoned && fs_info->log_root_tree) { + ret = -EAGAIN; + mutex_unlock(&fs_info->tree_log_mutex); + goto out; + } else if (!fs_info->log_root_tree) { ret = btrfs_init_log_root_tree(trans, fs_info); + } mutex_unlock(&fs_info->tree_log_mutex); if (ret) goto out; @@ -191,11 +207,19 @@ static int start_log_trans(struct btrfs_trans_handle *trans, */ static int join_running_log_trans(struct btrfs_root *root) { + bool hmzoned = btrfs_fs_incompat(root->fs_info, HMZONED); int ret = -ENOENT; mutex_lock(&root->log_mutex); +again: if (root->log_root) { + int index = (root->log_transid + 1) % 2; + ret = 0; + if (hmzoned && atomic_read(&root->log_commit[index])) { + wait_log_commit(root, root->log_transid - 1); + goto again; + } atomic_inc(&root->log_writers); } mutex_unlock(&root->log_mutex); @@ -2724,6 +2748,8 @@ static noinline int walk_down_log_tree(struct btrfs_trans_handle *trans, btrfs_clean_tree_block(next); btrfs_wait_tree_block_writeback(next); btrfs_tree_unlock(next); + btrfs_redirty_list_add( + trans->transaction, next); } else { if (test_and_clear_bit(EXTENT_BUFFER_DIRTY, &next->bflags)) clear_extent_buffer_dirty(next); @@ -3128,6 +3154,11 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans, mutex_lock(&log_root_tree->log_mutex); + mutex_lock(&fs_info->tree_log_mutex); + if (!log_root_tree->node) + btrfs_alloc_log_tree_node(trans, log_root_tree); + mutex_unlock(&fs_info->tree_log_mutex); + /* * Now we are safe to update the log_root_tree because we're under the * log_mutex, and we're a current writer so we're holding the commit @@ -3285,16 +3316,20 @@ static void free_log_tree(struct btrfs_trans_handle *trans, .process_func = process_one_buffer }; - ret = walk_log_tree(trans, log, &wc); - if (ret) { - if (trans) - btrfs_abort_transaction(trans, ret); - else - btrfs_handle_fs_error(log->fs_info, ret, NULL); + if (log->node) { + ret = walk_log_tree(trans, log, &wc); + if (ret) { + if (trans) + btrfs_abort_transaction(trans, ret); + else + btrfs_handle_fs_error(log->fs_info, ret, NULL); + } } clear_extent_bits(&log->dirty_log_pages, 0, (u64)-1, EXTENT_DIRTY | EXTENT_NEW | EXTENT_NEED_WAIT); + if (trans && log->node) + btrfs_redirty_list_add(trans->transaction, log->node); free_extent_buffer(log->node); kfree(log); } From patchwork Fri Dec 13 04:09:15 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11289901 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 0946614DB for ; Fri, 13 Dec 2019 04:11:43 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id DC57D2253D for ; Fri, 13 Dec 2019 04:11:42 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="g6fzT0Hx" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732062AbfLMELl (ORCPT ); Thu, 12 Dec 2019 23:11:41 -0500 Received: from esa6.hgst.iphmx.com ([216.71.154.45]:11924 "EHLO esa6.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731706AbfLMELl (ORCPT ); Thu, 12 Dec 2019 23:11:41 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1576210301; x=1607746301; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=13Xj/2Q3e/31YLvSZL3B4rqcPrjvoVC0rWnzkBnhSLo=; b=g6fzT0HxB6OgRd2IPfJr/921Eh+kkxDprPHWPzS3qBREmQrqu7IXMjRh i8R15xzcYG5Z2k/482qcThUenGYdbmha+VYz17c7ynDjz1X6+1x5muDDy 3RCgCn/gTSR2xbTzwcDyLvu6rSjHBjivRAk0yAGV8EzyGBMyvW6J/knEX 7hdo08UAZ2ZiT5aF9pN+tkIkNK5Zk78RRwCGnOn0Xwp/wXSCFlQ+yReG7 squjdeNxTQploySHkY8UOH71bDBZ9OJHyhSeNsCclaMquDJOjYAqPDwRg LY3Yk0kF7rSglxQkRgic8VY/QqSpDjg+1MAbSeNrytuCH+gSjrhjLKimS Q==; IronPort-SDR: oLY0Liw0/7/XLXI6q3gckpQEqaLa2SyKNLaiwvQSmG2gonQfM4Kg2vo5GTjNjLJP6ghDbVNS61 pYaA75x+s3TWN5Je7wRylcPOxWWbXBZeKhJyLdE8I7b43qiJAfPXuov1E16rY/lvqpqmHcQQo5 bVdrF8PG3+Y4YDPHNJ23jieYTjmkq4SYFQEFBIf8p0Uob7tJGig00pO+W7Q/f5eO2NSpSaaV7r GAbgwt+nWCfvoYUQZeh2shwcbaENuIMJFaTMSNCrYuyRPGWRfRl5chQhvVKgRiGMCHqLLSTA6/ Rhc= X-IronPort-AV: E=Sophos;i="5.69,308,1571673600"; d="scan'208";a="126860183" Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com) ([199.255.45.14]) by ob1.hgst.iphmx.com with ESMTP; 13 Dec 2019 12:11:41 +0800 IronPort-SDR: bTNUqjyeB1ScBrt2cNm8bGo4KdAGGhkWzG2lrQDqTk5ymzaESBUJM6EjkJpBe/W8qhd/3PLnsp J3fBCzDjrMFy75jDdgA/Mzh/W4tu3aEwCvLCwvQK42Wf1Wh26yOk7lzNRes5YJW+tj+i7BiYrs H4+reuP/1DR13TubHCgjsopG2LyJKfxUHntmHm2d/dO5RQiApLwmdsNBgZyJ7NPRHWePSYCYvy ZCgwWVKXhffXB29CJTD5NUK/UE23cb9CnuopgjI5Zg2Zhyo3EE9x6hREs4OlI1vTygnSXtmV4Z O99B9FEJL3J+nEbjSTQPqvOa Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 12 Dec 2019 20:06:13 -0800 IronPort-SDR: Ud1UygR2smZjObKJolFWCH6HmYv+v3E8fzVq2C75ROmMxXyv2Pr+bL+soCLPbKeT56r0L9/jyc R6xlZ3MQ9VbE9HGnCR/A6n4FrFXpRlJhn+zoo9DEpgiefKh5e0Ojlfh2gW+/gZWpT814qSd2Y5 lx4VQBAg1rRYjjbrkQVmm5zPD6dup6bzKOZc5/YHF+EdA6PoZoWX0f8djmrzW0Cu7cdw5HZvQJ LihbLYodJ9BLABcjvHxrt/9PzjBocMH/UA6QGhB71SRVclKIj/YW1kdntKtvSRJm26MSj60y+5 zE8= WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com (HELO naota.fujisawa.hgst.com) ([10.149.53.115]) by uls-op-cesaip02.wdc.com with ESMTP; 12 Dec 2019 20:11:39 -0800 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Nikolay Borisov , Damien Le Moal , Johannes Thumshirn , Hannes Reinecke , Anand Jain , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v6 28/28] btrfs: enable to mount HMZONED incompat flag Date: Fri, 13 Dec 2019 13:09:15 +0900 Message-Id: <20191213040915.3502922-29-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20191213040915.3502922-1-naohiro.aota@wdc.com> References: <20191213040915.3502922-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org This final patch adds the HMZONED incompat flag to BTRFS_FEATURE_INCOMPAT_SUPP and enables btrfs to mount HMZONED flagged file system. Signed-off-by: Naohiro Aota Reviewed-by: Josef Bacik --- fs/btrfs/ctree.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index cba8a169002c..79c8695ba4b4 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -293,7 +293,8 @@ struct btrfs_super_block { BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA | \ BTRFS_FEATURE_INCOMPAT_NO_HOLES | \ BTRFS_FEATURE_INCOMPAT_METADATA_UUID | \ - BTRFS_FEATURE_INCOMPAT_RAID1C34) + BTRFS_FEATURE_INCOMPAT_RAID1C34 | \ + BTRFS_FEATURE_INCOMPAT_HMZONED) #define BTRFS_FEATURE_INCOMPAT_SAFE_SET \ (BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF)