From patchwork Fri Aug 23 10:10:10 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11111313 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 5F3A014DE for ; Fri, 23 Aug 2019 10:11:16 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 3DA59233FD for ; Fri, 23 Aug 2019 10:11:16 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="eBanygCo" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2404089AbfHWKLO (ORCPT ); Fri, 23 Aug 2019 06:11:14 -0400 Received: from esa3.hgst.iphmx.com ([216.71.153.141]:47762 "EHLO esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2404040AbfHWKLO (ORCPT ); Fri, 23 Aug 2019 06:11:14 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1566555073; x=1598091073; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=n+UrDX4m+T4iHtzoCxuskv309A0OFYCB2UEgJ6ooGCc=; b=eBanygCoLaS/8PY946JazOyuy1zwA1gLU8OdtEE872pBeL1033IhugPG PP4xhkv7iBagCRfBcvXk6ioZ+1N+W93PedqEamK8hcOnfAZMhXE4Y6cJ7 8U1NOfF1WM/Mn6XyNsPi587/8tB7euz4lJreQ18Jdz/EXHQ/V4cTpNKCE pSuBLZflZ/N5wRJXfc2kvGK/ps7ZYPeCsSJRdNHSqQ6zusMgDHE07Hu5y 1KvPEpT+y1iDBDnvWFGxnqi0I790uTHRVqM3k1W8ShbarxmgYCX+IM4Of lckcnsAh9l2jU/JA2XqQHT2xJj8jX624StRXxThpAdJvd/+gTsObJH6UE g==; IronPort-SDR: fkbP/43wVviic4f2w9JY8IOsSVCJNRdvmPaajSFViFIeD7dyE9QFIUHBSfLWVKZm1ksyZVAzNt RVcHrwmBvB1NX26rnwPcH6ptIeW3Y5yo2jsXDkGyQ5YuXhbXLXz4SYdnEBXNaumPpa4mxBjaTG DzjXOhZ6P7j7ijZkhEZMhVCJa5SJQBX5RFVr1UCFRbxVpFrHQmXI/CbIB5u4fwKImQzO33TX8l yw4gtoRVPQxrLsixNUdzd7OOlIbAeu0E+0GbGom7ynxNy6LXIm7gFy+LMGtl/MZ6K+Uf/0C27S SNA= X-IronPort-AV: E=Sophos;i="5.64,420,1559491200"; d="scan'208";a="121096230" Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com) ([199.255.45.14]) by ob1.hgst.iphmx.com with ESMTP; 23 Aug 2019 18:11:12 +0800 IronPort-SDR: ngwc5LG5NO6kBn8wV7aGvtBDdkznWOFjqTL95q4iqDrVy22JZjTnpECX8shaRIerWKesvL8DB2 hBsMPwwUpW7dZ6TwDLJGnU7xrxHFMtVvlfKtBfczthYCjQa4pjzcbffcmTBDoEDaltq6BA6bAq Pz6JNULfkJGBX4oXFf40p8lsFN9GKf2WKdMMpwG13kNIfCprRdIaUc8SCWYHix4iSsPhNIvpXW 257P/nbrx5nKpZ9paD8Bzoz6VC6BF/sKOKYeZ3ewOCXM/7ouSCDoW/Z7cnpS6PAd7Sy/EjVWJp 01XztSiSJg6Brmhl263r6S7z Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 23 Aug 2019 03:08:30 -0700 IronPort-SDR: Wp7uRQJMEP6FLkPJkhybTR8zIz6p3D0cVan9pX9qBWB5b3xB5gHwh10vPZuWxxlNGfzkrwl9vV zc9WSVTljv1FfuwGXo54Z25hlK11iEvYnH9mjP0ixMW8FTJjM5xOBDqK4aSUt5wMo4yYMssGnp aDGjjkje0EbNq54P0YZtrQHtuM/mS890TuQ1/uHSU8ACn21kDWYp7f0ZjFYnz1vmjt5oAqX7dR uQYcttKANtY4DdUNEym6wi9v+iRM17TH2w4uAvf9N9sR1wcIaeitlsstb6wWS481tcoiDwE8m+ LZI= WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com (HELO naota.fujisawa.hgst.com) ([10.149.53.115]) by uls-op-cesaip02.wdc.com with ESMTP; 23 Aug 2019 03:11:10 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Nikolay Borisov , Damien Le Moal , Matias Bjorling , Johannes Thumshirn , Hannes Reinecke , Anand Jain , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v4 01/27] btrfs: introduce HMZONED feature flag Date: Fri, 23 Aug 2019 19:10:10 +0900 Message-Id: <20190823101036.796932-2-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.23.0 In-Reply-To: <20190823101036.796932-1-naohiro.aota@wdc.com> References: <20190823101036.796932-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org This patch introduces the HMZONED incompat flag. The flag indicates that the volume management will satisfy the constraints imposed by host-managed zoned block devices. Reviewed-by: Anand Jain Signed-off-by: Damien Le Moal Signed-off-by: Naohiro Aota Reviewed-by: Johannes Thumshirn --- fs/btrfs/sysfs.c | 2 ++ include/uapi/linux/btrfs.h | 1 + 2 files changed, 3 insertions(+) diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c index e6493b068294..ad708a9edd0b 100644 --- a/fs/btrfs/sysfs.c +++ b/fs/btrfs/sysfs.c @@ -193,6 +193,7 @@ BTRFS_FEAT_ATTR_INCOMPAT(raid56, RAID56); BTRFS_FEAT_ATTR_INCOMPAT(skinny_metadata, SKINNY_METADATA); BTRFS_FEAT_ATTR_INCOMPAT(no_holes, NO_HOLES); BTRFS_FEAT_ATTR_INCOMPAT(metadata_uuid, METADATA_UUID); +BTRFS_FEAT_ATTR_INCOMPAT(hmzoned, HMZONED); BTRFS_FEAT_ATTR_COMPAT_RO(free_space_tree, FREE_SPACE_TREE); static struct attribute *btrfs_supported_feature_attrs[] = { @@ -207,6 +208,7 @@ static struct attribute *btrfs_supported_feature_attrs[] = { BTRFS_FEAT_ATTR_PTR(skinny_metadata), BTRFS_FEAT_ATTR_PTR(no_holes), BTRFS_FEAT_ATTR_PTR(metadata_uuid), + BTRFS_FEAT_ATTR_PTR(hmzoned), BTRFS_FEAT_ATTR_PTR(free_space_tree), NULL }; diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h index c195896d478f..2d5e8f801135 100644 --- a/include/uapi/linux/btrfs.h +++ b/include/uapi/linux/btrfs.h @@ -270,6 +270,7 @@ struct btrfs_ioctl_fs_info_args { #define BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA (1ULL << 8) #define BTRFS_FEATURE_INCOMPAT_NO_HOLES (1ULL << 9) #define BTRFS_FEATURE_INCOMPAT_METADATA_UUID (1ULL << 10) +#define BTRFS_FEATURE_INCOMPAT_HMZONED (1ULL << 11) struct btrfs_ioctl_feature_flags { __u64 compat_flags; From patchwork Fri Aug 23 10:10:11 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11111317 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 2C3AC13B1 for ; Fri, 23 Aug 2019 10:11:17 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 0869D233FE for ; Fri, 23 Aug 2019 10:11:17 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="NGMf7tMO" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2404128AbfHWKLP (ORCPT ); Fri, 23 Aug 2019 06:11:15 -0400 Received: from esa3.hgst.iphmx.com ([216.71.153.141]:47764 "EHLO esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2403955AbfHWKLP (ORCPT ); Fri, 23 Aug 2019 06:11:15 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1566555074; x=1598091074; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=HcRZl54N6cDqJ1/3YiP4NeC3nqSIvHaZZZE4W/u6D38=; b=NGMf7tMOLPoZ4x0VDGz7Ww4p9gDN4xZVV4+DAqCmgKJmiwAa+UP9Llxq CevWCX6u0DIdWRJnRFsAlrIIxzYUoCrwSWAOJCz5/pxv3r71+C09G23wv dmEwBgJY167SKcrfrE/DUoSNOCXYzi433f+3nnE54UfnCeDINy5ZyuOAM S8dhMiVbOTINE0ByNRWoprZJ9Sdq5xVcFa8OahzNtWjV+T8mbW2fe7sw+ QJv1ftdM/y80MHFL02mC/fFrOlnXVZ/SJljlxkq1VjI8h6jVCbdYBChNl HCrgrdWLRouULufyvrTCw4F7MtMR5WtkvxSp+DNdC9xgvFg7mK30HnMBm g==; IronPort-SDR: lrgTFPOgYwZYVd5jHyj0ucC7FpjTQ4BgLWun5KjlED01ZNmGk54VAj2/Uz8Oso3guqiosDAyxY 8RKh8rfCzG3HneiVRFwS/Mqz6FU1nWtdrm/kH+6fxglr4NoxugpGrUptpGV8pmpNJwBD7+EPv8 ErGFWMbflwnrLeexX0bTSFqlXlsfSbvZSUoyt5WKvqpWpvHyuAX6af1mCTRrbJt1AxDJGSwC00 dw/uEzKCix1XuVWnu9tolmm5D1lIbkfZO76O8W21hEM1Bxmyvi5SSrhWhm5qxlSByd0cnDpZUZ KtA= X-IronPort-AV: E=Sophos;i="5.64,420,1559491200"; d="scan'208";a="121096231" Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com) ([199.255.45.14]) by ob1.hgst.iphmx.com with ESMTP; 23 Aug 2019 18:11:14 +0800 IronPort-SDR: J4D3o852WaIHz/Zp9wX/+eOxSV6Xejn2EuukQXKvHqfWl2wbCPGvy2IEUssyYAUifJ0m5jbvT3 O+NGDY9bDRBvJGhNZG2Mks6oKyQI58P+HtgAfTOYermexWg7ZVgpMzRj/3aGQO9Onu6DgcoMXV zGajqkm3e//jpP21LbBN7sa5LCI5xkdfBJ8e3LWpipx/C+8Zeq81xYOMdR3RiCv+9me3IXviov /ie5HekK66wW9A1AEWs1URiEni1eLAAibQi7PfpmrPRrjeB10MeEph0qrlyQbEyLkVDExg4LdY 3hEndzPOwrFRypT8rb2ogSvU Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 23 Aug 2019 03:08:32 -0700 IronPort-SDR: 9CXjUJ9nCk/vTrzwBqSsA7CdOEiQzuPRC+EVyndAbUs47Zbt4voxWH/DW8hkX7oBuaVZqr3Bq/ Iq7okdbSZoPcV2TIqd+xyK0nz/HHfv1tSa5elmQRA/6bJl25qNcaXLPCr4LJWkQ1789WKvUVYF Lo1E5v5XQpbU36ELl4hdhfzRPJmycQKz+KgsyoUusvBjITaKDR5k5R9gdXRHISVWKsgCZeCXtC gb74IFrsqMngHJcxr+xh4OpoYoDLBXdhrYBIKGSI2lmbauNqlOmWoUoA46irFGTmEu5Zn+zBEu /gI= WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com (HELO naota.fujisawa.hgst.com) ([10.149.53.115]) by uls-op-cesaip02.wdc.com with ESMTP; 23 Aug 2019 03:11:12 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Nikolay Borisov , Damien Le Moal , Matias Bjorling , Johannes Thumshirn , Hannes Reinecke , Anand Jain , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v4 02/27] btrfs: Get zone information of zoned block devices Date: Fri, 23 Aug 2019 19:10:11 +0900 Message-Id: <20190823101036.796932-3-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.23.0 In-Reply-To: <20190823101036.796932-1-naohiro.aota@wdc.com> References: <20190823101036.796932-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org If a zoned block device is found, get its zone information (number of zones and zone size) using the new helper function btrfs_get_dev_zonetypes(). To avoid costly run-time zone report commands to test the device zones type during block allocation, attach the seq_zones bitmap to the device structure to indicate if a zone is sequential or accept random writes. Also it attaches the empty_zones bitmap to indicate if a zone is empty or not. This patch also introduces the helper function btrfs_dev_is_sequential() to test if the zone storing a block is a sequential write required zone and btrfs_dev_is_empty_zone() to test if the zone is a empty zone. Signed-off-by: Damien Le Moal Signed-off-by: Naohiro Aota --- fs/btrfs/Makefile | 2 +- fs/btrfs/hmzoned.c | 159 +++++++++++++++++++++++++++++++++++++++++++++ fs/btrfs/hmzoned.h | 79 ++++++++++++++++++++++ fs/btrfs/volumes.c | 18 ++++- fs/btrfs/volumes.h | 4 ++ 5 files changed, 259 insertions(+), 3 deletions(-) create mode 100644 fs/btrfs/hmzoned.c create mode 100644 fs/btrfs/hmzoned.h diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile index 76a843198bcb..8d93abb31074 100644 --- a/fs/btrfs/Makefile +++ b/fs/btrfs/Makefile @@ -11,7 +11,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \ compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \ reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \ uuid-tree.o props.o free-space-tree.o tree-checker.o space-info.o \ - block-rsv.o delalloc-space.o + block-rsv.o delalloc-space.o hmzoned.o btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c new file mode 100644 index 000000000000..23bf58d3d7bb --- /dev/null +++ b/fs/btrfs/hmzoned.c @@ -0,0 +1,159 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2019 Western Digital Corporation or its affiliates. + * Authors: + * Naohiro Aota + * Damien Le Moal + */ + +#include +#include +#include "ctree.h" +#include "volumes.h" +#include "hmzoned.h" +#include "rcu-string.h" + +/* Maximum number of zones to report per blkdev_report_zones() call */ +#define BTRFS_REPORT_NR_ZONES 4096 + +static int btrfs_get_dev_zones(struct btrfs_device *device, u64 pos, + struct blk_zone *zones, + unsigned int *nr_zones, gfp_t gfp_mask) +{ + int ret; + + ret = blkdev_report_zones(device->bdev, pos >> SECTOR_SHIFT, + zones, nr_zones, gfp_mask); + if (ret != 0) { + btrfs_err_in_rcu(device->fs_info, + "get zone at %llu on %s failed %d", pos, + rcu_str_deref(device->name), ret); + return ret; + } + if (!*nr_zones) + return -EIO; + + return 0; +} + +int btrfs_get_dev_zone_info(struct btrfs_device *device) +{ + struct btrfs_zoned_device_info *zone_info = NULL; + struct block_device *bdev = device->bdev; + sector_t nr_sectors = bdev->bd_part->nr_sects; + sector_t sector = 0; + struct blk_zone *zones = NULL; + unsigned int i, nreported = 0, nr_zones; + unsigned int zone_sectors; + int ret; + + if (!bdev_is_zoned(bdev)) + return 0; + + zone_info = kzalloc(sizeof(*zone_info), GFP_KERNEL); + if (!zone_info) + return -ENOMEM; + + zone_sectors = bdev_zone_sectors(bdev); + ASSERT(is_power_of_2(zone_sectors)); + zone_info->zone_size = (u64)zone_sectors << SECTOR_SHIFT; + zone_info->zone_size_shift = ilog2(zone_info->zone_size); + zone_info->nr_zones = nr_sectors >> ilog2(bdev_zone_sectors(bdev)); + if (nr_sectors & (bdev_zone_sectors(bdev) - 1)) + zone_info->nr_zones++; + + zone_info->seq_zones = kcalloc(BITS_TO_LONGS(zone_info->nr_zones), + sizeof(*zone_info->seq_zones), + GFP_KERNEL); + if (!zone_info->seq_zones) { + ret = -ENOMEM; + goto out; + } + + zone_info->empty_zones = kcalloc(BITS_TO_LONGS(zone_info->nr_zones), + sizeof(*zone_info->empty_zones), + GFP_KERNEL); + if (!zone_info->empty_zones) { + ret = -ENOMEM; + goto out; + } + + + zones = kcalloc(BTRFS_REPORT_NR_ZONES, + sizeof(struct blk_zone), GFP_KERNEL); + if (!zones) + return -ENOMEM; + + /* Get zones type */ + while (sector < nr_sectors) { + nr_zones = BTRFS_REPORT_NR_ZONES; + ret = btrfs_get_dev_zones(device, sector << SECTOR_SHIFT, + zones, &nr_zones, GFP_KERNEL); + if (ret) + goto out; + + for (i = 0; i < nr_zones; i++) { + if (zones[i].type == BLK_ZONE_TYPE_SEQWRITE_REQ) + set_bit(nreported, zone_info->seq_zones); + if (zones[i].cond == BLK_ZONE_COND_EMPTY) + set_bit(nreported, zone_info->empty_zones); + nreported++; + } + sector = zones[nr_zones - 1].start + zones[nr_zones - 1].len; + } + + if (nreported != zone_info->nr_zones) { + btrfs_err_in_rcu(device->fs_info, + "inconsistent number of zones on %s (%u / %u)", + rcu_str_deref(device->name), nreported, + zone_info->nr_zones); + ret = -EIO; + goto out; + } + + device->zone_info = zone_info; + + btrfs_info_in_rcu( + device->fs_info, + "host-%s zoned block device %s, %u zones of %llu sectors", + bdev_zoned_model(bdev) == BLK_ZONED_HM ? "managed" : "aware", + rcu_str_deref(device->name), zone_info->nr_zones, + zone_info->zone_size >> SECTOR_SHIFT); + +out: + kfree(zones); + + if (ret) { + kfree(zone_info->seq_zones); + kfree(zone_info->empty_zones); + kfree(zone_info); + } + + return ret; +} + +void btrfs_destroy_dev_zone_info(struct btrfs_device *device) +{ + struct btrfs_zoned_device_info *zone_info = device->zone_info; + + if (!zone_info) + return; + + kfree(zone_info->seq_zones); + kfree(zone_info->empty_zones); + kfree(zone_info); + device->zone_info = NULL; +} + +int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos, + struct blk_zone *zone, gfp_t gfp_mask) +{ + unsigned int nr_zones = 1; + int ret; + + ret = btrfs_get_dev_zones(device, pos, zone, &nr_zones, gfp_mask); + if (ret != 0 || !nr_zones) + return ret ? ret : -EIO; + + return 0; +} diff --git a/fs/btrfs/hmzoned.h b/fs/btrfs/hmzoned.h new file mode 100644 index 000000000000..ffc70842135e --- /dev/null +++ b/fs/btrfs/hmzoned.h @@ -0,0 +1,79 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (C) 2019 Western Digital Corporation or its affiliates. + * Authors: + * Naohiro Aota + * Damien Le Moal + */ + +#ifndef BTRFS_HMZONED_H +#define BTRFS_HMZONED_H + +struct btrfs_zoned_device_info { + /* + * Number of zones, zone size and types of zones if bdev is a + * zoned block device. + */ + u64 zone_size; + u8 zone_size_shift; + u32 nr_zones; + unsigned long *seq_zones; + unsigned long *empty_zones; +}; + +int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos, + struct blk_zone *zone, gfp_t gfp_mask); +int btrfs_get_dev_zone_info(struct btrfs_device *device); +void btrfs_destroy_dev_zone_info(struct btrfs_device *device); + +static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos) +{ + struct btrfs_zoned_device_info *zone_info = device->zone_info; + + if (!zone_info) + return false; + + return test_bit(pos >> zone_info->zone_size_shift, + zone_info->seq_zones); +} + +static inline bool btrfs_dev_is_empty_zone(struct btrfs_device *device, u64 pos) +{ + struct btrfs_zoned_device_info *zone_info = device->zone_info; + + if (!zone_info) + return true; + + return test_bit(pos >> zone_info->zone_size_shift, + zone_info->empty_zones); +} + +static inline void btrfs_dev_set_empty_zone_bit(struct btrfs_device *device, + u64 pos, bool set) +{ + struct btrfs_zoned_device_info *zone_info = device->zone_info; + unsigned int zno; + + if (!zone_info) + return; + + zno = pos >> zone_info->zone_size_shift; + if (set) + set_bit(zno, zone_info->empty_zones); + else + clear_bit(zno, zone_info->empty_zones); +} + +static inline void btrfs_dev_set_zone_empty(struct btrfs_device *device, + u64 pos) +{ + btrfs_dev_set_empty_zone_bit(device, pos, true); +} + +static inline void btrfs_dev_clear_zone_empty(struct btrfs_device *device, + u64 pos) +{ + btrfs_dev_set_empty_zone_bit(device, pos, false); +} + +#endif diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index a447d3ec48d5..a8c550562057 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -29,6 +29,7 @@ #include "sysfs.h" #include "tree-checker.h" #include "space-info.h" +#include "hmzoned.h" const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = { [BTRFS_RAID_RAID10] = { @@ -342,6 +343,7 @@ void btrfs_free_device(struct btrfs_device *device) rcu_string_free(device->name); extent_io_tree_release(&device->alloc_state); bio_put(device->flush_bio); + btrfs_destroy_dev_zone_info(device); kfree(device); } @@ -847,6 +849,11 @@ static int btrfs_open_one_device(struct btrfs_fs_devices *fs_devices, clear_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state); device->mode = flags; + /* Get zone type information of zoned block devices */ + ret = btrfs_get_dev_zone_info(device); + if (ret != 0) + goto error_brelse; + fs_devices->open_devices++; if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state) && device->devid != BTRFS_DEV_REPLACE_DEVID) { @@ -2598,6 +2605,14 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path } rcu_assign_pointer(device->name, name); + device->fs_info = fs_info; + device->bdev = bdev; + + /* Get zone type information of zoned block devices */ + ret = btrfs_get_dev_zone_info(device); + if (ret) + goto error_free_device; + trans = btrfs_start_transaction(root, 0); if (IS_ERR(trans)) { ret = PTR_ERR(trans); @@ -2614,8 +2629,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path fs_info->sectorsize); device->disk_total_bytes = device->total_bytes; device->commit_total_bytes = device->total_bytes; - device->fs_info = fs_info; - device->bdev = bdev; set_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state); clear_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state); device->mode = FMODE_EXCL; @@ -2756,6 +2769,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path sb->s_flags |= SB_RDONLY; if (trans) btrfs_end_transaction(trans); + btrfs_destroy_dev_zone_info(device); error_free_device: btrfs_free_device(device); error: diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index 7f6aa1816409..5da1f354db93 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -57,6 +57,8 @@ struct btrfs_io_geometry { #define BTRFS_DEV_STATE_REPLACE_TGT (3) #define BTRFS_DEV_STATE_FLUSH_SENT (4) +struct btrfs_zoned_device_info; + struct btrfs_device { struct list_head dev_list; /* device_list_mutex */ struct list_head dev_alloc_list; /* chunk mutex */ @@ -77,6 +79,8 @@ struct btrfs_device { struct block_device *bdev; + struct btrfs_zoned_device_info *zone_info; + /* the mode sent to blkdev_get */ fmode_t mode; From patchwork Fri Aug 23 10:10:12 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11111325 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id D5B1013A4 for ; Fri, 23 Aug 2019 10:11:21 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id AA157233FD for ; Fri, 23 Aug 2019 10:11:21 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="OjXimVJY" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2404154AbfHWKLS (ORCPT ); Fri, 23 Aug 2019 06:11:18 -0400 Received: from esa3.hgst.iphmx.com ([216.71.153.141]:47768 "EHLO esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2403955AbfHWKLR (ORCPT ); Fri, 23 Aug 2019 06:11:17 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1566555076; x=1598091076; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=Z6aUYhCuLr5vmwUKQXn36Dz6vx8GwB+EigKbsngwKFA=; b=OjXimVJYi7P5j/kGqAGklhUmFh0uBo/E7DzHVQpd+n9wKktefjrxh4+T GCsx7s/uNsH7ta65turY3sipd416YENlVdgELTPOwwQpfpzUEdqAX1cZq ionuCu2IkzzGvtzD37D1caLqMfMOvN88FtebnMlr6BvxIPGckdiwL16dX VBcSl2opX5XEiXAWu8S0RiBn/KLDULS7iDoITk90aJRY3OGFD/AvIv2Ml waTMkEa/bgIE7ZNgtU3JB4OnqObfmlWRj5yUr3YQI8153tFo6psk3qk2M 3hULnnK9xjoc4m1kl+BqaR+mQBwUFsVCjll3E0mK+2O4FnXztw0EGSo9s w==; IronPort-SDR: nPmOJQ5rVw6cMx9XPg7BLF7RWOOwek1zQ0OvIWRO+g5dzGhijLmxkNqWIyaXlYBpYfrSNfgbeb dGu1QU+bZQL1jY6N/fHeTTvmTriOnpzL1JqAiDQ0s7LhwBZUIk3+FilzpJ3K1HE71qiofS54uK FYIap4Os/tWUummijDVpdGaTiaoRr5QSCbnXuAbR8fw/df1kxMrSbOG7FGnhYpf6cDpa/nvqdT BsqTHRiy+vHt4cpud2MvzP0yGB0rA2ooS1UTcKtykdcGM5zLL100RQ3zJ9vHEzj5xBcuB6yMaw HSU= X-IronPort-AV: E=Sophos;i="5.64,420,1559491200"; d="scan'208";a="121096233" Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com) ([199.255.45.14]) by ob1.hgst.iphmx.com with ESMTP; 23 Aug 2019 18:11:16 +0800 IronPort-SDR: Wbb6PKFpcpYGLPhIp3KaqSnanrH14LH1yBToHDjVcBsbO8PJ8F6ZZiAlyxrtQnq/dz8smj+sRa y8mJHQlDYLI3BP3dP9z7zcde+n16NlC7VxQJF7MYrLLtG44rZaKc3SNb+e+9M83zzlAIBsyAFC 6e60Px5hU5X5YF7ylj3nZZhhzOKYwIs8XX8t85+qqQ+KcHJqIx78CCuRqajcTq8Dz5nemruQwm VvzTlm6Ti/PIGWeA00GnxtVVrOUin4WINayIUUCxtVE3e6njiHjXYAQKRKiiUsJxYzQT5MCdjB TT2sVzriMBqHf1quV5H4SusH Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 23 Aug 2019 03:08:34 -0700 IronPort-SDR: kcsMVdAhesPq2k5KqatgLgckyNGTT5yaT1zcOJa6Lfyhha7sK71McVl5Y/i0OXRqDNwWyj39/0 LvyTja4zYp+R0wgmgnKq+sdb/oFU/FWJ3ryVeLQd2tBN9htfhXaSWr2INTid4ln4KIk9uzK5E6 S61IOiKeZVPu5beC6M76o+DcVXjUPyoa16nMgqxp1wVH+OiT00vDHptfw6Ak2z3arwH7k611e1 KgXjaTqZCeURN5fYOMb1A1NbjspU01BZRe4QqM61VU+1r3tL63E4C5qtcb/rKayiTLbN5QWkCw +lg= WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com (HELO naota.fujisawa.hgst.com) ([10.149.53.115]) by uls-op-cesaip02.wdc.com with ESMTP; 23 Aug 2019 03:11:14 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Nikolay Borisov , Damien Le Moal , Matias Bjorling , Johannes Thumshirn , Hannes Reinecke , Anand Jain , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v4 03/27] btrfs: Check and enable HMZONED mode Date: Fri, 23 Aug 2019 19:10:12 +0900 Message-Id: <20190823101036.796932-4-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.23.0 In-Reply-To: <20190823101036.796932-1-naohiro.aota@wdc.com> References: <20190823101036.796932-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org HMZONED mode cannot be used together with the RAID5/6 profile for now. Introduce the function btrfs_check_hmzoned_mode() to check this. This function will also check if HMZONED flag is enabled on the file system and if the file system consists of zoned devices with equal zone size. Additionally, as updates to the space cache are in-place, the space cache cannot be located over sequential zones and there is no guarantees that the device will have enough conventional zones to store this cache. Resolve this problem by disabling completely the space cache. This does not introduces any problems with sequential block groups: all the free space is located after the allocation pointer and no free space before the pointer. There is no need to have such cache. For the same reason, NODATACOW is also disabled. Also INODE_MAP_CACHE is also disabled to avoid preallocation in the INODE_MAP_CACHE inode. In summary, HMZONED will disable: | Disabled features | Reason | |-------------------+-----------------------------------------------------| | RAID5/6 | 1) Non-full stripe write cause overwriting of | | | parity block | | | 2) Rebuilding on high capacity volume (usually with | | | SMR) can lead to higher failure rate | |-------------------+-----------------------------------------------------| | space_cache (v1) | In-place updating | | NODATACOW | In-place updating | |-------------------+-----------------------------------------------------| | tree-log | Partial write out of metadata creates write holes | |-------------------+-----------------------------------------------------| | fallocate | Reserved extent will be a write hole | | INODE_MAP_CACHE | Need pre-allocation. (and will be deprecated?) | |-------------------+-----------------------------------------------------| | MIXED_BG | Allocated metadata region will be write holes for | | | data writes | | async checksum | Not to mix up bios by multiple workers | Signed-off-by: Damien Le Moal Signed-off-by: Naohiro Aota --- fs/btrfs/ctree.h | 3 ++ fs/btrfs/dev-replace.c | 8 +++++ fs/btrfs/disk-io.c | 8 +++++ fs/btrfs/hmzoned.c | 67 ++++++++++++++++++++++++++++++++++++++++++ fs/btrfs/hmzoned.h | 18 ++++++++++++ fs/btrfs/super.c | 1 + fs/btrfs/volumes.c | 5 ++++ 7 files changed, 110 insertions(+) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 94660063a162..221259737703 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -712,6 +712,9 @@ struct btrfs_fs_info { struct btrfs_root *uuid_root; struct btrfs_root *free_space_root; + /* Zone size when in HMZONED mode */ + u64 zone_size; + /* the log root tree is a directory of all the other log roots */ struct btrfs_root *log_root_tree; diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c index 6b2e9aa83ffa..2cc3ac4d101d 100644 --- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -20,6 +20,7 @@ #include "rcu-string.h" #include "dev-replace.h" #include "sysfs.h" +#include "hmzoned.h" static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info, int scrub_ret); @@ -201,6 +202,13 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info, return PTR_ERR(bdev); } + if (!btrfs_check_device_zone_type(fs_info, bdev)) { + btrfs_err(fs_info, + "zone type of target device mismatch with the filesystem!"); + ret = -EINVAL; + goto error; + } + sync_blockdev(bdev); devices = &fs_info->fs_devices->devices; diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 97beb351a10c..3f5ea92f546c 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -40,6 +40,7 @@ #include "compression.h" #include "tree-checker.h" #include "ref-verify.h" +#include "hmzoned.h" #define BTRFS_SUPER_FLAG_SUPP (BTRFS_HEADER_FLAG_WRITTEN |\ BTRFS_HEADER_FLAG_RELOC |\ @@ -3121,6 +3122,13 @@ int open_ctree(struct super_block *sb, btrfs_free_extra_devids(fs_devices, 1); + ret = btrfs_check_hmzoned_mode(fs_info); + if (ret) { + btrfs_err(fs_info, "failed to init hmzoned mode: %d", + ret); + goto fail_block_groups; + } + ret = btrfs_sysfs_add_fsid(fs_devices, NULL); if (ret) { btrfs_err(fs_info, "failed to init sysfs fsid interface: %d", diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c index 23bf58d3d7bb..ca58eee08a70 100644 --- a/fs/btrfs/hmzoned.c +++ b/fs/btrfs/hmzoned.c @@ -157,3 +157,70 @@ int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos, return 0; } + +int btrfs_check_hmzoned_mode(struct btrfs_fs_info *fs_info) +{ + struct btrfs_fs_devices *fs_devices = fs_info->fs_devices; + struct btrfs_device *device; + u64 hmzoned_devices = 0; + u64 nr_devices = 0; + u64 zone_size = 0; + int incompat_hmzoned = btrfs_fs_incompat(fs_info, HMZONED); + int ret = 0; + + /* Count zoned devices */ + list_for_each_entry(device, &fs_devices->devices, dev_list) { + if (!device->bdev) + continue; + if (bdev_zoned_model(device->bdev) == BLK_ZONED_HM || + (bdev_zoned_model(device->bdev) == BLK_ZONED_HA && + incompat_hmzoned)) { + hmzoned_devices++; + if (!zone_size) { + zone_size = device->zone_info->zone_size; + } else if (device->zone_info->zone_size != zone_size) { + btrfs_err(fs_info, + "Zoned block devices must have equal zone sizes"); + ret = -EINVAL; + goto out; + } + } + nr_devices++; + } + + if (!hmzoned_devices && incompat_hmzoned) { + /* No zoned block device found on HMZONED FS */ + btrfs_err(fs_info, "HMZONED enabled file system should have zoned devices"); + ret = -EINVAL; + goto out; + } + + if (!hmzoned_devices && !incompat_hmzoned) + goto out; + + fs_info->zone_size = zone_size; + + if (hmzoned_devices != nr_devices) { + btrfs_err(fs_info, + "zoned devices cannot be mixed with regular devices"); + ret = -EINVAL; + goto out; + } + + /* + * stripe_size is always aligned to BTRFS_STRIPE_LEN in + * __btrfs_alloc_chunk(). Since we want stripe_len == zone_size, + * check the alignment here. + */ + if (!IS_ALIGNED(zone_size, BTRFS_STRIPE_LEN)) { + btrfs_err(fs_info, + "zone size is not aligned to BTRFS_STRIPE_LEN"); + ret = -EINVAL; + goto out; + } + + btrfs_info(fs_info, "HMZONED mode enabled, zone size %llu B", + fs_info->zone_size); +out: + return ret; +} diff --git a/fs/btrfs/hmzoned.h b/fs/btrfs/hmzoned.h index ffc70842135e..29cfdcabff2f 100644 --- a/fs/btrfs/hmzoned.h +++ b/fs/btrfs/hmzoned.h @@ -9,6 +9,8 @@ #ifndef BTRFS_HMZONED_H #define BTRFS_HMZONED_H +#include + struct btrfs_zoned_device_info { /* * Number of zones, zone size and types of zones if bdev is a @@ -25,6 +27,7 @@ int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos, struct blk_zone *zone, gfp_t gfp_mask); int btrfs_get_dev_zone_info(struct btrfs_device *device); void btrfs_destroy_dev_zone_info(struct btrfs_device *device); +int btrfs_check_hmzoned_mode(struct btrfs_fs_info *fs_info); static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos) { @@ -76,4 +79,19 @@ static inline void btrfs_dev_clear_zone_empty(struct btrfs_device *device, btrfs_dev_set_empty_zone_bit(device, pos, false); } +static inline bool btrfs_check_device_zone_type(struct btrfs_fs_info *fs_info, + struct block_device *bdev) +{ + u64 zone_size; + + if (btrfs_fs_incompat(fs_info, HMZONED)) { + zone_size = (u64)bdev_zone_sectors(bdev) << SECTOR_SHIFT; + /* Do not allow non-zoned device */ + return bdev_is_zoned(bdev) && fs_info->zone_size == zone_size; + } + + /* Do not allow Host Manged zoned device */ + return bdev_zoned_model(bdev) != BLK_ZONED_HM; +} + #endif diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 78de9d5d80c6..d7879a5a2536 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -43,6 +43,7 @@ #include "free-space-cache.h" #include "backref.h" #include "space-info.h" +#include "hmzoned.h" #include "tests/btrfs-tests.h" #include "qgroup.h" diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index a8c550562057..ffa4de09666d 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -2572,6 +2572,11 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path if (IS_ERR(bdev)) return PTR_ERR(bdev); + if (!btrfs_check_device_zone_type(fs_info, bdev)) { + ret = -EINVAL; + goto error; + } + if (fs_devices->seeding) { seeding_dev = 1; down_write(&sb->s_umount); From patchwork Fri Aug 23 10:10:13 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11111321 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 2B4E713A4 for ; Fri, 23 Aug 2019 10:11:20 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 095C3233FD for ; Fri, 23 Aug 2019 10:11:20 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="m/9kKJ9W" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2404179AbfHWKLT (ORCPT ); Fri, 23 Aug 2019 06:11:19 -0400 Received: from esa3.hgst.iphmx.com ([216.71.153.141]:47768 "EHLO esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2404176AbfHWKLS (ORCPT ); Fri, 23 Aug 2019 06:11:18 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1566555078; x=1598091078; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=4gCxc5DIg70UY+tbNqleaiUt+WjvhXHfvAnkm9bxdLg=; b=m/9kKJ9WT9lv8gaYhZdaDC+rh4bF/FF0Xp+KyKu0r+/1GwvBjJqjYVSA IaxdJUFLDgDAxSN4A+FUe+TiKFlhe25LWIvgUMM3jla0oXs8TUZQg6q+j pMUihZP8kACLLviwU0SDrUOqtccFxTD6yg5+gfBb8YLd/FSQivUrVlKu7 4B09uvZNVHAw9eQWichlsUjsz99JSx6rHhWuYRFZQ82b9UrxriqKNva2k h3JGvvIbeuviV6wTgPprl3/m5YYF1qKP2DhhK3iXUegImTxjAkvNKpDKU S8kiHZnIL9OCL+5M17X3heleuN/8qGcZKhS3ZhWq0jsZkD4nLhM/lZQU4 A==; IronPort-SDR: XnI+L8ZXDtVmtdeozyKEw/rEV/8Jz4nPpuxKuziou2Vzhs4FP0XKgvJb+1dJMJbXVxHx9aInq5 6Xe9sMiUPuOc2VbBO99tjxHVvb8AkiXH4jcCyznfhaJKh1MLa/em4DzSPiAeR0LC/UAdOznEMb uSReSPgeaFUReGckRZJKdPfcAPzqerwdVtH18ntoSAUWIZjjhRCSFSu75mNXKdRWnGfD4LjeOy +AtTMHALHkqxiSzW3WOGb1EXM7a8A2fHADyiPq5HP+xeKj8Wf4amrmggk1nFagPki1+kH+YSYK 88Q= X-IronPort-AV: E=Sophos;i="5.64,420,1559491200"; d="scan'208";a="121096235" Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com) ([199.255.45.14]) by ob1.hgst.iphmx.com with ESMTP; 23 Aug 2019 18:11:18 +0800 IronPort-SDR: 7SR+ZERmZ90f65yXHRC8NbXYNnBo+VxOTQ4UztCmzOZlpuEN+P9+xEs2IWGiUn1mW1Q5G3a4KJ yXARMxiJpo/qXGxMZgzhOmBfG6/TcDZksOj0KzQBltqW45paHASLhBDVDxmo0FXIcQVzjTgMCo +e1UXTcW9vKBs1+6tjsiOMF33U/K7Fa8heHeJ8KN9InXSk0wzSllX+MURLD/00zoWtex58zIME 6PkI3vnO09h2xwUHELJL+oaAbmiiVWR6rkv0Sz2fVgaF0nCYEkNBvb7f2vHf306H5zC4VRBWOO 75XtW/UxKvp5M2d9xtRhzNXb Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 23 Aug 2019 03:08:36 -0700 IronPort-SDR: YH/N4wuDz3ZNlCSTMLlfpf9YD8Igbzjbb7p0F3cbE/FXzY4kP1bacMfULJHq8FrXhSm73ngoip Zlk6DaOA9Ye+5vP9DQSClBQw4T9EPqWAvKNmo+u1PNoiCWQ3d2cFKKZYDHalPrboMb/E6+hmxe i6Mx43Zid1qCXB2UL/TV4vyQAnKj1fz7DAnqI6aX6tIUfdwzOBd8KDeNFLweQ0a/SCIrVZZeqW YMXrADx4QhJv6obQFdcvtr2OrpPzQWGmzYg7ImISqLEkodvuK4rEsZ3NtRMiQRJUqGD8yBE6WQ /uA= WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com (HELO naota.fujisawa.hgst.com) ([10.149.53.115]) by uls-op-cesaip02.wdc.com with ESMTP; 23 Aug 2019 03:11:16 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Nikolay Borisov , Damien Le Moal , Matias Bjorling , Johannes Thumshirn , Hannes Reinecke , Anand Jain , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v4 04/27] btrfs: disallow RAID5/6 in HMZONED mode Date: Fri, 23 Aug 2019 19:10:13 +0900 Message-Id: <20190823101036.796932-5-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.23.0 In-Reply-To: <20190823101036.796932-1-naohiro.aota@wdc.com> References: <20190823101036.796932-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Supporting the RAID5/6 profile in HMZONED mode is not trivial. For example, non-full stripe writes will cause overwriting parity blocks. When we do a non-full stripe write, it writes to the parity block with the data at that moment. Then, another write to the stripes will try to overwrite the parity block with new parity value. However, sequential zones do not allow such parity overwriting. Furthermore, using RAID5/6 on SMR drives, which usually have a huge capacity, incur large overhead of rebuild. Such overhead can lead to higher to higher volume failure rate (e.g. additional drive failure during rebuild) because of the increased rebuild time. Thus, let's disable RAID5/6 profile in HMZONED mode for now. Signed-off-by: Naohiro Aota Reviewed-by: Johannes Thumshirn --- fs/btrfs/hmzoned.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c index ca58eee08a70..84b7b561840d 100644 --- a/fs/btrfs/hmzoned.c +++ b/fs/btrfs/hmzoned.c @@ -219,6 +219,13 @@ int btrfs_check_hmzoned_mode(struct btrfs_fs_info *fs_info) goto out; } + /* RAID56 is not allowed */ + if (btrfs_fs_incompat(fs_info, RAID56)) { + btrfs_err(fs_info, "HMZONED mode does not support RAID56"); + ret = -EINVAL; + goto out; + } + btrfs_info(fs_info, "HMZONED mode enabled, zone size %llu B", fs_info->zone_size); out: From patchwork Fri Aug 23 10:10:14 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11111329 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 5558B13A4 for ; Fri, 23 Aug 2019 10:11:23 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 3397F233FE for ; Fri, 23 Aug 2019 10:11:23 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="bBvpogC8" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2404210AbfHWKLW (ORCPT ); Fri, 23 Aug 2019 06:11:22 -0400 Received: from esa3.hgst.iphmx.com ([216.71.153.141]:47768 "EHLO esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2404176AbfHWKLV (ORCPT ); Fri, 23 Aug 2019 06:11:21 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1566555080; x=1598091080; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=Q+czJv/GmG70ux1+2aP+G8UuCpWVPl1+fHma1GR8sC8=; b=bBvpogC8154sej6Fl5mL1K/7uVeOzE+FGn6AJbnF7tnl1Vvzrkbr4gLu DZiuTXoQNjSsXU//ODqcCp8gFCsoir53bOhyPWFD9jcNISJSOuVDq1NWE gvGVCtpLNM2xB6CQ0VnlmFbxycmhJI+hPqQ4XnvinDnHe4v21K/venEP5 i/6VTuLnItQawSlGqaRm5cy7PoeJxphKTfdADRkomgWaB4/FrL83a0sjR RkLsdw/uwLef558ksdyNLJlGFc2IJzzMwpL05qwMfwOjgmLxUhou2M1cA ew3QGH51ivtQm0F+/zAjX+LP9FzOIWgFEGC+8Fsxh3rBAgKnqD9ajG1eD g==; IronPort-SDR: FzET/B22COTJ4RyyM1vlxMKrxAdjLK9RoeuhVkg0wJEwEEBwExwYRyp/pEbc4M7YbQqTL/pxZZ lq4xPj/ryL6+R79iWeWbg8ykqky5aEq3hZGbaK8WZjHrp0S2/Gvo2R8LNyD1t8KPFVsLqOBXX0 BPqYm78dYsbCjcQ6b5Ayu5kvZ+xNdB3piKI1B2F2CQFdp7+iW69SVHhT9AjHENzGE4VzcP1Ibq Gn0QrCeVKc1MQ3Vit+RSyPtTbWL7oi1hU0IHFVH2gS0KbeI4+yRlIm5P9Z/OqRdcSSQKHFUzrO +mw= X-IronPort-AV: E=Sophos;i="5.64,420,1559491200"; d="scan'208";a="121096236" Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com) ([199.255.45.14]) by ob1.hgst.iphmx.com with ESMTP; 23 Aug 2019 18:11:20 +0800 IronPort-SDR: FXG/60sZRBQrt7Cj/ppBElzoLQ27NAstQmKAIxikFaZ2hpNVsfLH2ptjB4munxLPGpGKU+xaZE BO8NPUNpVLkROcrZTKq2leYVJyekAWUjz2U2n2/k8/LR2on1xDSKdRsAZZwDWPK05x7Jhqri6d V/NaVzVbC6JT22z5SC6fkF1tiWC29llQ3MpSPy+++j5IxJAlBhmG8XR0hn3bHgaiyc7vMTRoO7 gdGK+SqWDPN81d+BjgCYuh1bXc1feuwFEmka2xPDIjihCeqx3r5V6XGwfT1qa+ChCRHG8ZwQHK rJKMl69mRvidHKOgQw9x6rW5 Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 23 Aug 2019 03:08:38 -0700 IronPort-SDR: QXs/QYCj/EPx8a7BI/NnzWYqPhJE2iO+yWAH0oO9d5yBNmikKW4+uBrPpUZMJLX6S12axvoRA7 1J5/tta71egowT2Lm4/GmlgdGSgRriRdZTaTeT9KLvMdIsnGrW0tQmaWAlZquCP2rxDMHxAYXe 8mSyK1pZHPc0Lu4g7rWvybjqclu0x3ZHt4r8ZLlUTYpffod87nuk1uGpuUZlUx1dK0FfvK/X28 qB8pRpxU9jG2d9YRSoQI9EDYhrIBcP6QAwtxVhgDf0LUSKKsnTwtaYjYOw7dm9qpLQZAbvhAfq sFE= WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com (HELO naota.fujisawa.hgst.com) ([10.149.53.115]) by uls-op-cesaip02.wdc.com with ESMTP; 23 Aug 2019 03:11:19 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Nikolay Borisov , Damien Le Moal , Matias Bjorling , Johannes Thumshirn , Hannes Reinecke , Anand Jain , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v4 05/27] btrfs: disallow space_cache in HMZONED mode Date: Fri, 23 Aug 2019 19:10:14 +0900 Message-Id: <20190823101036.796932-6-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.23.0 In-Reply-To: <20190823101036.796932-1-naohiro.aota@wdc.com> References: <20190823101036.796932-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org As updates to the space cache are in-place, the space cache cannot be located over sequential zones and there is no guarantees that the device will have enough conventional zones to store this cache. Resolve this problem by disabling completely the space cache. This does not introduces any problems with sequential block groups: all the free space is located after the allocation pointer and no free space before the pointer. There is no need to have such cache. Signed-off-by: Naohiro Aota --- fs/btrfs/hmzoned.c | 18 ++++++++++++++++++ fs/btrfs/hmzoned.h | 1 + fs/btrfs/super.c | 10 ++++++++-- 3 files changed, 27 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c index 84b7b561840d..8f0b17eba4b3 100644 --- a/fs/btrfs/hmzoned.c +++ b/fs/btrfs/hmzoned.c @@ -231,3 +231,21 @@ int btrfs_check_hmzoned_mode(struct btrfs_fs_info *fs_info) out: return ret; } + +int btrfs_check_mountopts_hmzoned(struct btrfs_fs_info *info) +{ + if (!btrfs_fs_incompat(info, HMZONED)) + return 0; + + /* + * SPACE CACHE writing is not CoWed. Disable that to avoid + * write errors in sequential zones. + */ + if (btrfs_test_opt(info, SPACE_CACHE)) { + btrfs_err(info, + "cannot enable disk space caching with HMZONED mode"); + return -EINVAL; + } + + return 0; +} diff --git a/fs/btrfs/hmzoned.h b/fs/btrfs/hmzoned.h index 29cfdcabff2f..83579b2dc0a4 100644 --- a/fs/btrfs/hmzoned.h +++ b/fs/btrfs/hmzoned.h @@ -28,6 +28,7 @@ int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos, int btrfs_get_dev_zone_info(struct btrfs_device *device); void btrfs_destroy_dev_zone_info(struct btrfs_device *device); int btrfs_check_hmzoned_mode(struct btrfs_fs_info *fs_info); +int btrfs_check_mountopts_hmzoned(struct btrfs_fs_info *info); static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos) { diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index d7879a5a2536..496d8b74f9a2 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -440,8 +440,12 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options, cache_gen = btrfs_super_cache_generation(info->super_copy); if (btrfs_fs_compat_ro(info, FREE_SPACE_TREE)) btrfs_set_opt(info->mount_opt, FREE_SPACE_TREE); - else if (cache_gen) - btrfs_set_opt(info->mount_opt, SPACE_CACHE); + else if (cache_gen) { + if (btrfs_fs_incompat(info, HMZONED)) + WARN_ON(1); + else + btrfs_set_opt(info->mount_opt, SPACE_CACHE); + } /* * Even the options are empty, we still need to do extra check @@ -877,6 +881,8 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options, ret = -EINVAL; } + if (!ret) + ret = btrfs_check_mountopts_hmzoned(info); if (!ret && btrfs_test_opt(info, SPACE_CACHE)) btrfs_info(info, "disk space caching is enabled"); if (!ret && btrfs_test_opt(info, FREE_SPACE_TREE)) From patchwork Fri Aug 23 10:10:15 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11111333 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 733CE13B1 for ; Fri, 23 Aug 2019 10:11:24 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 529F2233FD for ; Fri, 23 Aug 2019 10:11:24 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="Aj2SfBRJ" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2404224AbfHWKLX (ORCPT ); Fri, 23 Aug 2019 06:11:23 -0400 Received: from esa3.hgst.iphmx.com ([216.71.153.141]:47768 "EHLO esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2404176AbfHWKLW (ORCPT ); Fri, 23 Aug 2019 06:11:22 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1566555082; x=1598091082; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=lBaKPS9JfZTL0UTIepFgeg5zZ0gxmI9ZSwd0/FUcXdc=; b=Aj2SfBRJ3zI+pLCKEm45bt5MD+RiyOAWfw5chODqwKifUmm6DB0Zu/I7 +tgZyCAMcbzWc+hOhOwgAHAki+IZvV9S+v+DgiSsfXFxPi/1JP5Zrqal0 jgZ216PtJOYFaJ9QI4Y2Um4evY9ssMcHna1Jr5YdF9TGZRq5D12a4OUZP kOGHrsyP4hIzqLnEooa6ALdxdw5jpmqsdzoD2Btt2tSNTNlvICCBQi8oA 1lyJXY0ez7wbmgOYkQg7VKbW53/jm3z8XmpGyM55UL2B9NsUm0SiC2QlS gLQK7p7VTvL9EO0yz45Fx4sjbRNA6ObVYPhhTDdb2p0FKsH+7hTdEOqFT g==; IronPort-SDR: X/g0bL/aEfGBH4IkPaWu381w1E1Hq8YPEM1/onRsc1jGuODlAR07gFSZab+qjyQ9ODX5Ku8TwY qu5v6F+GqjETTX855hcYGEbEaC8TiREI6jLk3hL0AQJThY6vxoC/RUsgwuxtWgcXc00eWYyJKB hstDh2WCrdSluof0RYRHTo8n2ZzMq7Y9D68TduUP0qa6zEqT0N1sfTHzq8fHQkbYUQI5HaCW3a uYhVOnL7c4Kat42kC4o/ecDZhfMK2dRIzY/UkEsfpnjKIQPnxBer33SUvZPuYo/pvjfRDlGKl6 WaM= X-IronPort-AV: E=Sophos;i="5.64,420,1559491200"; d="scan'208";a="121096238" Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com) ([199.255.45.14]) by ob1.hgst.iphmx.com with ESMTP; 23 Aug 2019 18:11:22 +0800 IronPort-SDR: fVzM5IIvWYBi1Vh6iGMKhycy6IxICe9gewOHrKqUiTvwPPxwDf656VFzbSK687U9e87JKMK7EW Cf1V2mxJ0CIC425+mA19wcIiF/Slxwr5CgwQJQZQdM4SomN0b8Bo55Aah2Audk4uPibnJdXmEe NM/GnnA+gCkc5YvzToY55cAMdK9fAPO7xbBpHLwDD8r70xbGSBcPLXaX3Co/0CaGt/M0BikfH9 fQuo4pS7fBaF3y6Mw+9jhmnTCYyp3cid87RYIgMfSHReoS4awGzsf/RDKtiOBFWLVqVILfr5dz jvX73W6B2vUP9jJkm74Pyt9m Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 23 Aug 2019 03:08:40 -0700 IronPort-SDR: Ygp7DNslaZ+BBEfkbOEgVjSco8ojtZoOcRcDtwjuDDRWOm2VR+p39Ml8YBZVSBwkmKsa4d0g/1 c2N1JKQmWjNPtcsoiQNkL5Jn04gUk+dtQxv4G41kGXuJILwK0oJKVEAYugQ4HS7p12r78hJscC J4VfoN2VEE7c7jS9XzVg+bFabddHPXhCGXWMwPeBHemkJocYxMMXhIzPc8W0LeoSf1tgLxa+Aw mfViRzqFCZtsUygRJa/JVX+SLtk5lGdVmUqSWYqKFqLOCBi++0P9Y3XOA3kHTwyBLyFcI+VGWF 0JU= WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com (HELO naota.fujisawa.hgst.com) ([10.149.53.115]) by uls-op-cesaip02.wdc.com with ESMTP; 23 Aug 2019 03:11:21 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Nikolay Borisov , Damien Le Moal , Matias Bjorling , Johannes Thumshirn , Hannes Reinecke , Anand Jain , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v4 06/27] btrfs: disallow NODATACOW in HMZONED mode Date: Fri, 23 Aug 2019 19:10:15 +0900 Message-Id: <20190823101036.796932-7-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.23.0 In-Reply-To: <20190823101036.796932-1-naohiro.aota@wdc.com> References: <20190823101036.796932-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org NODATACOW implies overwriting the file data on a device, which is impossible in sequential required zones. Disable NODATACOW globally with mount option and per-file NODATACOW attribute by masking FS_NOCOW_FL. Signed-off-by: Naohiro Aota --- fs/btrfs/hmzoned.c | 6 ++++++ fs/btrfs/ioctl.c | 3 +++ 2 files changed, 9 insertions(+) diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c index 8f0b17eba4b3..edddf52d2c5e 100644 --- a/fs/btrfs/hmzoned.c +++ b/fs/btrfs/hmzoned.c @@ -247,5 +247,11 @@ int btrfs_check_mountopts_hmzoned(struct btrfs_fs_info *info) return -EINVAL; } + if (btrfs_test_opt(info, NODATACOW)) { + btrfs_err(info, + "cannot enable nodatacow with HMZONED mode"); + return -EINVAL; + } + return 0; } diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index d0743ec1231d..06783c489023 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -93,6 +93,9 @@ static int btrfs_clone(struct inode *src, struct inode *inode, static unsigned int btrfs_mask_fsflags_for_type(struct inode *inode, unsigned int flags) { + if (btrfs_fs_incompat(btrfs_sb(inode->i_sb), HMZONED)) + flags &= ~FS_NOCOW_FL; + if (S_ISDIR(inode->i_mode)) return flags; else if (S_ISREG(inode->i_mode)) From patchwork Fri Aug 23 10:10:16 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11111339 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 10F8F14DE for ; Fri, 23 Aug 2019 10:11:28 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id E3A54233FE for ; Fri, 23 Aug 2019 10:11:27 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="GSYm0+uf" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2404253AbfHWKL0 (ORCPT ); Fri, 23 Aug 2019 06:11:26 -0400 Received: from esa3.hgst.iphmx.com ([216.71.153.141]:47768 "EHLO esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2404232AbfHWKLZ (ORCPT ); Fri, 23 Aug 2019 06:11:25 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1566555085; x=1598091085; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=ZaUosXTmKmM8IeUHHpz+YywFJlkZxMCcI0gK1Q+modc=; b=GSYm0+ufa46clO1qF3r37+Isui6QpgI1OMDPD6ZuLddsLu8VtEC60IMu 3OO09fyOT7r48ZMBwXsgwxuBb3w8hCtB4l/W07tpil3ZmXsTzQpJH2Djb VrmqWB8JBK0dbL3zSnjhfcOsBsWRvNIkez2/v/ozOkgRfJwpbtBTKP6Re RAtM6QWOcsX6mkceSJve196XV7uaJ/2mhTiNdMiUxZhgZKG1xD5EuhS1k hb5nctWhGJplwZQ3WbTSXF4SvVrsxYiUOaJdG6SZE9/9wXT3NmNqNw2xZ nxhbUc2TG3IIh/XNfD+2Cuy7HlnzZpdsYNMi3y2fQBWoBiL0r17lMaf+/ w==; IronPort-SDR: 5aWX/bZBtkc50ZDvAHJPAOJlZ7lEF6UeFcncB47w41RXY/v7XfAM2+Tw4yPuoVO7C3/+78cvpv +HZZXqhpCL/y53Djw+qAnP/9BDHmagS21wKHxpMC8ETk8HczatIUJp3PcKrO+MgC03HNHKiiCA toyshoIToKqC9ZI2OAJoQoSd/ghbrvDiSoYWMdyIqGP7bpYWYrLqrwyPfr3WKUYDUlJnSZdANU jlitYZ9XZM769VxWCSO6Dt1zB/boPNHDunZSr+F8BiAcnzz1N5lLM+EcG1RTDMxAQP86u9duF7 9CA= X-IronPort-AV: E=Sophos;i="5.64,420,1559491200"; d="scan'208";a="121096239" Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com) ([199.255.45.14]) by ob1.hgst.iphmx.com with ESMTP; 23 Aug 2019 18:11:24 +0800 IronPort-SDR: P3qOMvksrrHEUu/fq8lvfrS/81pv2rT5cUbSzjCXis8pEtZ+NuID6KnbtiEAHULyQuE//PEsa9 2Xhj4obyQ7v4iC5ty0O2nPJ1/eIBbk4S/3gaxFlQIjULqQ5zJcVbGAIRJpgXO2tEFkMS7vFAd0 ueQzkmmel2ZzsGrEfI6VNOX/ujfHpbrDobOaGY3duOr8a+cQcE9VHAjxD6g+qTmfQCzWEU1lUn fgNLcX6bz6NtTLzWvM5Xc2ZxBqa1frQO3ZG1BLrmDBp+6jPTH1dANJA4tOTSuVNh1CQ3Czu7jM YLsMCKDn/zz0p6w1DlKXk4Mh Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 23 Aug 2019 03:08:42 -0700 IronPort-SDR: nNK7UsncfKSxEzk/j74pCF/JBHjDbxbNqp/NljQpnqzNKP1A9MZ4SQ/NeIXjp9f94mLwBW+59N PZMctSHoTIIt1BE9975xO0PLdaGgd3B6u+fFJCNNeL9KP+t/NzuW4oCY1qhQMkzjAyUlIopLkW yC1Y5QR5dmauhmH7qMcYh33yFSC3FkrwLPTvwgTaJ7XeTob8EiEzkvTPQbBcjxlP/liMyW3Yh2 SiEO1reeC1PwKOss8bzpn2ROk8PZE6ylNdyr+jMCF3zNO1eI1ud7qiJVACS9lr0J84FwP1kage /bQ= WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com (HELO naota.fujisawa.hgst.com) ([10.149.53.115]) by uls-op-cesaip02.wdc.com with ESMTP; 23 Aug 2019 03:11:23 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Nikolay Borisov , Damien Le Moal , Matias Bjorling , Johannes Thumshirn , Hannes Reinecke , Anand Jain , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v4 07/27] btrfs: disable tree-log in HMZONED mode Date: Fri, 23 Aug 2019 19:10:16 +0900 Message-Id: <20190823101036.796932-8-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.23.0 In-Reply-To: <20190823101036.796932-1-naohiro.aota@wdc.com> References: <20190823101036.796932-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Extent buffers for tree-log tree are allocated scattered between other metadata's extent buffers, and btrfs_sync_log() writes out only the tree-log buffers. This behavior breaks sequential writing rule, which is mandatory in sequential required zones. Actually, we don't have much benefit using tree-logging with HMZONED mode, until we can allocate tree-log buffer sequentially. So, disable tree-log entirely in HMZONED mode. Signed-off-by: Naohiro Aota --- fs/btrfs/hmzoned.c | 6 ++++++ fs/btrfs/super.c | 4 ++++ 2 files changed, 10 insertions(+) diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c index edddf52d2c5e..4e4e727302d4 100644 --- a/fs/btrfs/hmzoned.c +++ b/fs/btrfs/hmzoned.c @@ -253,5 +253,11 @@ int btrfs_check_mountopts_hmzoned(struct btrfs_fs_info *info) return -EINVAL; } + if (!btrfs_test_opt(info, NOTREELOG)) { + btrfs_err(info, + "cannot enable tree log with HMZONED mode"); + return -EINVAL; + } + return 0; } diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 496d8b74f9a2..396238e099bc 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -447,6 +447,10 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options, btrfs_set_opt(info->mount_opt, SPACE_CACHE); } + if (btrfs_fs_incompat(info, HMZONED)) + btrfs_set_and_info(info, NOTREELOG, + "disabling tree log with HMZONED mode"); + /* * Even the options are empty, we still need to do extra check * against new flags From patchwork Fri Aug 23 10:10:17 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11111343 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 4A29814DE for ; Fri, 23 Aug 2019 10:11:30 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 27CEE21726 for ; Fri, 23 Aug 2019 10:11:30 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="bUgUuhUo" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2404257AbfHWKL2 (ORCPT ); Fri, 23 Aug 2019 06:11:28 -0400 Received: from esa3.hgst.iphmx.com ([216.71.153.141]:47768 "EHLO esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2404232AbfHWKL1 (ORCPT ); Fri, 23 Aug 2019 06:11:27 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1566555086; x=1598091086; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=1hRmT3DiraCeVKIIORgGvc4tSO1ifK7iXDzxtKkod9A=; b=bUgUuhUoCyMkJp4r1/nD6H9Lf2HoGxbzsL0xsQVcuKbgE4y6X0s0KWqR muJytkkVH2D5CbitJ6msE7xmYpEyjl9No5Rz239w2qwqElYnQd36xD0k+ tOobBqlVfKhECMVIDiHjAeKkz5lcxrNUjH822bQxgL83zZxEt9AM4O6zA fq9vTWieCifDcez8hgzPc8R2K1Q0zfBEczbRWeHr8CgvC/tS0IABNi0mE FvHQZhIv+BP2mTED9b5tQhYz2zuSF96fFaQWE+HeEs6RvYI+T+TZWS6Vm UeDEvNqr8tRaddtUho4SpwraLlpw/SUH/6mBPJtH71Rw0L4hT1/lCRdIP Q==; IronPort-SDR: aEQpX13Svh20uQRxB8tgzGXj4uyJxupBNZEHafqV7fZK86i9HhiOO1iVKoxjnaXghPXZJn7tPG fy6/0hwEC7cl6UEDuivbWFNhYaTQjqUYkidJxr5yybALxTMDlI0QCHnK9YNNUGCUNwPb2dMZAY ZS05XcyOuQiGIbr4lJMGpnF54K+52cnw6Eyz3hfYiTQqtkDHhDfG4zWv84z6OYGJjVA6Ee+S2U /m2OWGjQ1Nablxo4xDcvMD2fT13zZkptTkolrAsSaIzWTIQP5aPQkuyVrq6UJ3vKQeNYRKyTL9 D2g= X-IronPort-AV: E=Sophos;i="5.64,420,1559491200"; d="scan'208";a="121096242" Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com) ([199.255.45.14]) by ob1.hgst.iphmx.com with ESMTP; 23 Aug 2019 18:11:26 +0800 IronPort-SDR: 4CiJ/6N6bogUko1ukkw04ZFPhTnT9yIj268UwQP9KnMb5ajNHuQ8wPF9IDB2J+Yo5FR15HxQv2 kRoi2rY/Ck4e18V8PoI2rLqgJZuKVq7KAfzyW9F4wZYOyPtSo/la3s9GZwb5Z3GMrUXyE80JJ2 VD5e5yPMUPjlhTuBu1QzSiB5GuDaeVFomDoUR1gjHGUl7g33H1ctBXblDuSzRn3kzGPda7NQ8P +aFCnlsy39hP6dbTm5mEpCFs4d5cc5hyNlyM/2Pd7Oq7JhY/Np1CUPHf4B2xz4TFcAeu1LKzEY BQFeM3z+HDtAZB+PNMBRpZqO Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 23 Aug 2019 03:08:44 -0700 IronPort-SDR: kYZlLB/Cq4tgx7IuBi5fFoBkyvFHputZ515in7tsKikF3n2QKzeO2HH+OoWeiYHXJ23WkWiie1 mSJRwihbNaONeJoYzyJqjFjGYj7uCG4hP21TTYQZBGRWk0Atwdkyyqk+rZJwC1cHyCxdZdoyyW racjC1D2bDoYz7AEl/Vuen5lZan5i/9I9AZT7WI9OAGb/hdwLlqqcqwW9R6qi22SLOkOjJAQ7W 0gDwBRTcMZH1pnFdQxeL64H2mOEBLtUUsB4b+89x+Z9sZq5u+ZWLpIdbWQChYI9TLrVethV0sn TCI= WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com (HELO naota.fujisawa.hgst.com) ([10.149.53.115]) by uls-op-cesaip02.wdc.com with ESMTP; 23 Aug 2019 03:11:25 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Nikolay Borisov , Damien Le Moal , Matias Bjorling , Johannes Thumshirn , Hannes Reinecke , Anand Jain , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v4 08/27] btrfs: disable fallocate in HMZONED mode Date: Fri, 23 Aug 2019 19:10:17 +0900 Message-Id: <20190823101036.796932-9-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.23.0 In-Reply-To: <20190823101036.796932-1-naohiro.aota@wdc.com> References: <20190823101036.796932-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org fallocate() is implemented by reserving actual extent instead of reservations. This can result in exposing the sequential write constraint of host-managed zoned block devices to the application, which would break the POSIX semantic for the fallocated file. To avoid this, report fallocate() as not supported when in HMZONED mode for now. In the future, we may be able to implement "in-memory" fallocate() in HMZONED mode by utilizing space_info->bytes_may_use or so. Signed-off-by: Naohiro Aota --- fs/btrfs/file.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 58a18ed11546..7474010a997d 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -3023,6 +3023,10 @@ static long btrfs_fallocate(struct file *file, int mode, alloc_end = round_up(offset + len, blocksize); cur_offset = alloc_start; + /* Do not allow fallocate in HMZONED mode */ + if (btrfs_fs_incompat(btrfs_sb(inode->i_sb), HMZONED)) + return -EOPNOTSUPP; + /* Make sure we aren't being give some crap mode */ if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE | FALLOC_FL_ZERO_RANGE)) From patchwork Fri Aug 23 10:10:18 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11111351 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 2B61513A4 for ; Fri, 23 Aug 2019 10:11:34 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 096F423401 for ; Fri, 23 Aug 2019 10:11:34 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="epF/c+mr" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2404292AbfHWKLa (ORCPT ); Fri, 23 Aug 2019 06:11:30 -0400 Received: from esa3.hgst.iphmx.com ([216.71.153.141]:47768 "EHLO esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2404268AbfHWKL3 (ORCPT ); Fri, 23 Aug 2019 06:11:29 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1566555088; x=1598091088; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=PkAGPuAf4+o4HCUGAm6uDU8nUmd88+ZJFmJYjWagg+0=; b=epF/c+mr2DRIWVKWjA6X2xF92cIbfdqB0L7r1qPx7U7fAe7ecf+Nq0ns 6WTWaD4KlcXaGbpSPUli/LzcpxFDw3Zug7ynHovM+lVsiB5XxC8WU9Hzm nwkENFquRcQU753AMXDgvJh8X+EZetm9o27LmfhPBLXi1fdra1tihw882 RFcx/IrdqYG//zb4brKQvgeEwQzq7HiAtblX9fts0rkd4mGrujwCdRg24 nZ7fDwWuHQdSUWcKeILFnIleXApPkIa4aWgzVZE5bf+RmUB3wsh0At6Zt xZbRFxiL7+JogpXuYfyfx+0WSvG3R20wdSVNyHFtS51ZV6e8cudugqL1V g==; IronPort-SDR: 3MefvuzC0cZrOBxyyajj7ESpaTUpD8Esjwo3XNPIYSlKSJJ1I0i/xGAZKLltVdvO8KCZGQlKWe V/h/MbdpFGkxAHglk++unoWuU3ApT6bjSYiNGBEDbgmUqvVURZPDiS5kII+fNozJ/TnzEYXROq 2XWR04gZkwE9jcBNTlMX59iDpAt4RKfk5k9+UYmP3bzhk8kMmMyqDzbEU+yD8Zm9vXZuw3aIzV eiQ1nIKDUxn2TalmzmIu6C9ic0cZyuXsi+BXTe6CIAvBj4p6ODL8l+Uq4YYKUBrA2KYzI9tO1W L0E= X-IronPort-AV: E=Sophos;i="5.64,420,1559491200"; d="scan'208";a="121096243" Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com) ([199.255.45.14]) by ob1.hgst.iphmx.com with ESMTP; 23 Aug 2019 18:11:28 +0800 IronPort-SDR: gQjDT3bz/andIA3NtXxzj1hTsu4WA7Sn0OtjE5gJCE88gw+IJRjm/cXTLPtLe8ChObTC/yI6vJ Z3g8wRl60keHBmskKBvItn5zrsfYM7c199m56dUKd5Ft7/02zlA8Li6oFcyV+qmq4pZHA2WC8j wiG3xc3tn9pnqqKNLnXVELgNTqjtirI3r3vEO4tM/v1PWP1IXbz894TIDQ8K/sl1VYfnGN+k53 rwwFvpmRgeOZL7vCDqH6XmzUwgXksK7FOBMncAndjZZ6iT6M/I1tcXzvnVmGKePlRD4qShq0yB qOHfPvTTS7t7qnOrR8d20b3B Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 23 Aug 2019 03:08:46 -0700 IronPort-SDR: W1rTRW+bL5WrNQGF2uZpe4okjYMatRNenp+9NVuCVm8s1hDPvYoGN+Ocg/P9Q8aYzYWYfmo6cC qgIyxwBUFaEBPKH5W9G2PfdjMLfVeb9gFRlsyJJ7JkWf3bluhtde55yFluV3I7Jg3H4mCRYDKU Ecb/QTpmJnASRC4DlMQhKFJhJjUBbbW/NArBpPO7wRVsPjD1GFfbldy1mnA6c6mtn/5heeCDG/ Kfh7VTqt0NB0q6/V5AzR9qnI7Soaw3AqSq6qKBrsr1itYC52PUd+yi+XNxg27aDuFr0siJo8xI qAA= WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com (HELO naota.fujisawa.hgst.com) ([10.149.53.115]) by uls-op-cesaip02.wdc.com with ESMTP; 23 Aug 2019 03:11:27 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Nikolay Borisov , Damien Le Moal , Matias Bjorling , Johannes Thumshirn , Hannes Reinecke , Anand Jain , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v4 09/27] btrfs: align device extent allocation to zone boundary Date: Fri, 23 Aug 2019 19:10:18 +0900 Message-Id: <20190823101036.796932-10-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.23.0 In-Reply-To: <20190823101036.796932-1-naohiro.aota@wdc.com> References: <20190823101036.796932-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org In HMZONED mode, align the device extents to zone boundaries so that a zone reset affects only the device extent and does not change the state of blocks in the neighbor device extents. Also, check that a region allocation is always over empty same-type zones and it is not over any locations of super block copies. This patch also add a verification in verify_one_dev_extent() to check if the device extent is align to zone boundary. Signed-off-by: Naohiro Aota --- fs/btrfs/extent-tree.c | 6 ++++ fs/btrfs/hmzoned.c | 56 ++++++++++++++++++++++++++++++ fs/btrfs/hmzoned.h | 10 ++++++ fs/btrfs/volumes.c | 77 ++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 149 insertions(+) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 8b7eb22d508a..1020469ca61b 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -7638,6 +7638,12 @@ int btrfs_can_relocate(struct btrfs_fs_info *fs_info, u64 bytenr) min_free = div64_u64(min_free, dev_min); } + /* We cannot allocate size less than zone_size anyway */ + if (index == BTRFS_RAID_DUP) + min_free = max_t(u64, min_free, 2 * fs_info->zone_size); + else + min_free = max_t(u64, min_free, fs_info->zone_size); + mutex_lock(&fs_info->chunk_mutex); list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list) { u64 dev_offset; diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c index 4e4e727302d4..94f4f67e0548 100644 --- a/fs/btrfs/hmzoned.c +++ b/fs/btrfs/hmzoned.c @@ -12,6 +12,7 @@ #include "volumes.h" #include "hmzoned.h" #include "rcu-string.h" +#include "disk-io.h" /* Maximum number of zones to report per blkdev_report_zones() call */ #define BTRFS_REPORT_NR_ZONES 4096 @@ -261,3 +262,58 @@ int btrfs_check_mountopts_hmzoned(struct btrfs_fs_info *info) return 0; } + +/* + * btrfs_check_allocatable_zones - check if spcecifeid region is + * suitable for allocation + * @device: the device to allocate a region + * @pos: the position of the region + * @num_bytes: the size of the region + * + * In non-ZONED device, anywhere is suitable for allocation. In ZONED + * device, check if + * 1) the region is not on non-empty zones, + * 2) all zones in the region have the same zone type, + * 3) it does not contain super block location, if the zones are + * sequential. + */ +bool btrfs_check_allocatable_zones(struct btrfs_device *device, u64 pos, + u64 num_bytes) +{ + struct btrfs_zoned_device_info *zinfo = device->zone_info; + u64 nzones, begin, end; + u64 sb_pos; + u8 shift; + int i; + + if (!zinfo) + return true; + + shift = zinfo->zone_size_shift; + nzones = num_bytes >> shift; + begin = pos >> shift; + end = begin + nzones; + + ASSERT(IS_ALIGNED(pos, zinfo->zone_size)); + ASSERT(IS_ALIGNED(num_bytes, zinfo->zone_size)); + + if (end > zinfo->nr_zones) + return false; + + /* check if zones in the region are all empty */ + if (find_next_zero_bit(zinfo->empty_zones, end, begin) != end) + return false; + + if (btrfs_dev_is_sequential(device, pos)) { + for (i = 0; i < BTRFS_SUPER_MIRROR_MAX; i++) { + sb_pos = btrfs_sb_offset(i); + if (!(sb_pos + BTRFS_SUPER_INFO_SIZE <= pos || + pos + end <= sb_pos)) + return false; + } + + return find_next_zero_bit(zinfo->seq_zones, end, begin) == end; + } + + return find_next_bit(zinfo->seq_zones, end, begin) == end; +} diff --git a/fs/btrfs/hmzoned.h b/fs/btrfs/hmzoned.h index 83579b2dc0a4..396ece5f9410 100644 --- a/fs/btrfs/hmzoned.h +++ b/fs/btrfs/hmzoned.h @@ -29,6 +29,8 @@ int btrfs_get_dev_zone_info(struct btrfs_device *device); void btrfs_destroy_dev_zone_info(struct btrfs_device *device); int btrfs_check_hmzoned_mode(struct btrfs_fs_info *fs_info); int btrfs_check_mountopts_hmzoned(struct btrfs_fs_info *info); +bool btrfs_check_allocatable_zones(struct btrfs_device *device, u64 pos, + u64 num_bytes); static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos) { @@ -95,4 +97,12 @@ static inline bool btrfs_check_device_zone_type(struct btrfs_fs_info *fs_info, return bdev_zoned_model(bdev) != BLK_ZONED_HM; } +static inline u64 btrfs_zone_align(struct btrfs_device *device, u64 pos) +{ + if (!device->zone_info) + return pos; + + return ALIGN(pos, device->zone_info->zone_size); +} + #endif diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index ffa4de09666d..16094fc68552 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -1572,6 +1572,7 @@ int find_free_dev_extent_start(struct btrfs_device *device, u64 num_bytes, u64 max_hole_size; u64 extent_end; u64 search_end = device->total_bytes; + u64 zone_size = 0; int ret; int slot; struct extent_buffer *l; @@ -1582,6 +1583,14 @@ int find_free_dev_extent_start(struct btrfs_device *device, u64 num_bytes, * at an offset of at least 1MB. */ search_start = max_t(u64, search_start, SZ_1M); + /* + * For a zoned block device, skip the first zone of the device + * entirely. + */ + if (device->zone_info) + zone_size = device->zone_info->zone_size; + search_start = max_t(u64, search_start, zone_size); + search_start = btrfs_zone_align(device, search_start); path = btrfs_alloc_path(); if (!path) @@ -1646,12 +1655,21 @@ int find_free_dev_extent_start(struct btrfs_device *device, u64 num_bytes, */ if (contains_pending_extent(device, &search_start, hole_size)) { + search_start = btrfs_zone_align(device, + search_start); if (key.offset >= search_start) hole_size = key.offset - search_start; else hole_size = 0; } + if (!btrfs_check_allocatable_zones(device, search_start, + num_bytes)) { + search_start += zone_size; + btrfs_release_path(path); + goto again; + } + if (hole_size > max_hole_size) { max_hole_start = search_start; max_hole_size = hole_size; @@ -1691,6 +1709,14 @@ int find_free_dev_extent_start(struct btrfs_device *device, u64 num_bytes, hole_size = search_end - search_start; if (contains_pending_extent(device, &search_start, hole_size)) { + search_start = btrfs_zone_align(device, search_start); + btrfs_release_path(path); + goto again; + } + + if (!btrfs_check_allocatable_zones(device, search_start, + num_bytes)) { + search_start += zone_size; btrfs_release_path(path); goto again; } @@ -1708,6 +1734,7 @@ int find_free_dev_extent_start(struct btrfs_device *device, u64 num_bytes, ret = 0; out: + ASSERT(zone_size == 0 || IS_ALIGNED(max_hole_start, zone_size)); btrfs_free_path(path); *start = max_hole_start; if (len) @@ -4951,6 +4978,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans, int i; int j; int index; + int hmzoned = btrfs_fs_incompat(info, HMZONED); BUG_ON(!alloc_profile_is_valid(type, 0)); @@ -4991,10 +5019,25 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans, BUG(); } + if (hmzoned) { + max_stripe_size = info->zone_size; + max_chunk_size = round_down(max_chunk_size, info->zone_size); + } + /* We don't want a chunk larger than 10% of writable space */ max_chunk_size = min(div_factor(fs_devices->total_rw_bytes, 1), max_chunk_size); + if (hmzoned) { + int min_num_stripes = devs_min * dev_stripes; + int min_data_stripes = (min_num_stripes - nparity) / ncopies; + u64 min_chunk_size = min_data_stripes * info->zone_size; + + max_chunk_size = max(round_down(max_chunk_size, + info->zone_size), + min_chunk_size); + } + devices_info = kcalloc(fs_devices->rw_devices, sizeof(*devices_info), GFP_NOFS); if (!devices_info) @@ -5029,6 +5072,9 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans, if (total_avail == 0) continue; + if (hmzoned && total_avail < max_stripe_size * dev_stripes) + continue; + ret = find_free_dev_extent(device, max_stripe_size * dev_stripes, &dev_offset, &max_avail); @@ -5047,6 +5093,9 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans, continue; } + if (hmzoned && max_avail < max_stripe_size * dev_stripes) + continue; + if (ndevs == fs_devices->rw_devices) { WARN(1, "%s: found more than %llu devices\n", __func__, fs_devices->rw_devices); @@ -5065,6 +5114,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans, sort(devices_info, ndevs, sizeof(struct btrfs_device_info), btrfs_cmp_device_info, NULL); +again: /* round down to number of usable stripes */ ndevs = round_down(ndevs, devs_increment); @@ -5103,6 +5153,17 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans, * we try to reduce stripe_size. */ if (stripe_size * data_stripes > max_chunk_size) { + if (hmzoned) { + /* + * stripe_size is fixed in HMZONED. Reduce ndevs + * instead. + */ + ASSERT(nparity == 0); + ndevs = div_u64(max_chunk_size * ncopies, + stripe_size * dev_stripes); + goto again; + } + /* * Reduce stripe_size, round it up to a 16MB boundary again and * then use it, unless it ends up being even bigger than the @@ -5116,6 +5177,8 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans, /* align to BTRFS_STRIPE_LEN */ stripe_size = round_down(stripe_size, BTRFS_STRIPE_LEN); + ASSERT(!hmzoned || stripe_size == info->zone_size); + map = kmalloc(map_lookup_size(num_stripes), GFP_NOFS); if (!map) { ret = -ENOMEM; @@ -7742,6 +7805,20 @@ static int verify_one_dev_extent(struct btrfs_fs_info *fs_info, ret = -EUCLEAN; goto out; } + + if (dev->zone_info) { + u64 zone_size = dev->zone_info->zone_size; + + if (!IS_ALIGNED(physical_offset, zone_size) || + !IS_ALIGNED(physical_len, zone_size)) { + btrfs_err(fs_info, +"dev extent devid %llu physical offset %llu len %llu is not aligned to device zone", + devid, physical_offset, physical_len); + ret = -EUCLEAN; + goto out; + } + } + out: free_extent_map(em); return ret; From patchwork Fri Aug 23 10:10:19 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11111349 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id A4D3214DE for ; Fri, 23 Aug 2019 10:11:33 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 6DA64233FD for ; Fri, 23 Aug 2019 10:11:33 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="gSvET6dN" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2404300AbfHWKLb (ORCPT ); Fri, 23 Aug 2019 06:11:31 -0400 Received: from esa3.hgst.iphmx.com ([216.71.153.141]:47768 "EHLO esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2404176AbfHWKLb (ORCPT ); Fri, 23 Aug 2019 06:11:31 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1566555090; x=1598091090; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=rWltier0zzKkCJ6JfjaAU2w/7lPBNSCqx/yO0gDTvTQ=; b=gSvET6dNI7RFGf4q52T3GAbeI+XA5kiDbX46qW5wSagTsKpQfq3850mo SkKWGVi5SZp+MP4EPBGKE16Adivxc2fuwpqSzjLBl1wRAbz4IGxxlsoQp NLgOJjIY8QV4cwnR6ZdUPgkE2eZtx0mF12tXeSXXsS8cB+sSlrHChOa43 /k+JpnX+l8D/bnbyAjqKVeLcDyEyg60kBNbgRjt3hwDF/vR+FPrUkGw9O pv9Rpsa4fld6khdp391MuXWHYBiehz+bOVg9oW8dFxEMTbAJrl2pU0nb0 /x+yuEiVDkuP6RgIHW1gJdFgQrEGp9Aigm3C0w7v73vhyaEDRpfRQWY7L A==; IronPort-SDR: w4aYDr9+7N24ypoXCqNxKNkAIJrW2zr/sXny0nPP4k650jWi/c4a301jntmduoMyVJfHcc9DAm lUFyxwuNrN/k/wYFo4c7PkaYKvmwVhlx7s6Gq2LusND4FBmYvUouZ2qe2xC9YCtZHEFNFnORdO hijMrsCWqb+k0OAPte7YcI5WyuLWOP1UWGUKK1WJ6d1ldfs6aNBLTOoQe1TI/djjO4eixcyDWp qhsYy3K/ytLYOwel4j0yaU/H5Ck1qVSRjMI/HTg6c4JevR/nXXqZoaCzutxY6CXNpeoZtUunhV oWE= X-IronPort-AV: E=Sophos;i="5.64,420,1559491200"; d="scan'208";a="121096245" Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com) ([199.255.45.14]) by ob1.hgst.iphmx.com with ESMTP; 23 Aug 2019 18:11:30 +0800 IronPort-SDR: 1kCqRlnkp5xWtgKasltzVhtDQCp0+RwY2u4V76m7TzfuD3HlPreT49Gux4vbTIZx4ISuGYkJd2 UWbW5H6le88VKk0ttOc2xdcNC6Lb7BrdqL5KQtF/B+sV34FJqBOe8/zONaysvzM0Dcg9XwC8xp 88JIyCSECX1umITcUDMze9C1818tICmjgp3O0AgIZ2/GkFVA/3ncXBdyLf0rd+oiRuLHftGsta +gi4FQaXcGBgBSSiFVK4anWWAxiR9aOKDMCxf6SQjusdocaP+hcrgJZDZ4SKDr9GIoWDl4av6J al0SjwtJlDdFXuvsalaC0pIR Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 23 Aug 2019 03:08:48 -0700 IronPort-SDR: KlhHVCjWhgwkaepnEMRNEoV6oAYFMxVBp8ruBxSLycUNCXdnRDGBiL9fMroqj7NK2UdZAlOFe2 ZuwdTsOd69dRHSfhRQLWHJvNdvPA8DdvuaDvRZUg8W1Njn1O3JQgFwOqoFVuLo7zlcDv7E/Duf VMJc5WCyBKxSMAaIzGUeDzy/xZIwBlyUSbJ55v81n9vsCPT69+6aSvlAlL9TnWRowFaDC8+SKS MdnXDVVCoU4NZoibaLQl0BKzPzhiBeQdTfNwITMJc2ysbXqf+sYagrPkbLCIgqs6XDMHm5pSp1 r2Y= WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com (HELO naota.fujisawa.hgst.com) ([10.149.53.115]) by uls-op-cesaip02.wdc.com with ESMTP; 23 Aug 2019 03:11:29 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Nikolay Borisov , Damien Le Moal , Matias Bjorling , Johannes Thumshirn , Hannes Reinecke , Anand Jain , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v4 10/27] btrfs: do sequential extent allocation in HMZONED mode Date: Fri, 23 Aug 2019 19:10:19 +0900 Message-Id: <20190823101036.796932-11-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.23.0 In-Reply-To: <20190823101036.796932-1-naohiro.aota@wdc.com> References: <20190823101036.796932-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org On HMZONED drives, writes must always be sequential and directed at a block group zone write pointer position. Thus, block allocation in a block group must also be done sequentially using an allocation pointer equal to the block group zone write pointer plus the number of blocks allocated but not yet written. Sequential allocation function find_free_extent_seq() bypass the checks in find_free_extent() and increase the reserved byte counter by itself. It is impossible to revert once allocated region in the sequential allocation, since it might race with other allocations and leave an allocation hole, which breaks the sequential write rule. Furthermore, this commit introduce two new variable to struct btrfs_block_group_cache. "wp_broken" indicate that write pointer is broken (e.g. not synced on a RAID1 block group) and mark that block group read only. "zone_unusable" keeps track of the size of once allocated then freed region in a block group. Such region is never usable until resetting underlying zones. This commit also introduce "bytes_zone_unusable" to track such unusable bytes in a space_info. Pinned bytes are always reclaimed to "bytes_zone_unusable". They are not usable until resetting them first. Signed-off-by: Naohiro Aota --- fs/btrfs/ctree.h | 25 ++++ fs/btrfs/extent-tree.c | 179 +++++++++++++++++++++++++--- fs/btrfs/free-space-cache.c | 35 ++++++ fs/btrfs/free-space-cache.h | 5 + fs/btrfs/hmzoned.c | 231 ++++++++++++++++++++++++++++++++++++ fs/btrfs/hmzoned.h | 1 + fs/btrfs/space-info.c | 13 +- fs/btrfs/space-info.h | 4 +- fs/btrfs/sysfs.c | 2 + 9 files changed, 471 insertions(+), 24 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 221259737703..3b24ce49e84b 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -481,6 +481,20 @@ struct btrfs_full_stripe_locks_tree { struct mutex lock; }; +/* Block group allocation types */ +enum btrfs_alloc_type { + + /* Regular first fit allocation */ + BTRFS_ALLOC_FIT = 0, + + /* + * Sequential allocation: this is for HMZONED mode and + * will result in ignoring free space before a block + * group allocation offset. + */ + BTRFS_ALLOC_SEQ = 1, +}; + struct btrfs_block_group_cache { struct btrfs_key key; struct btrfs_block_group_item item; @@ -520,6 +534,7 @@ struct btrfs_block_group_cache { unsigned int iref:1; unsigned int has_caching_ctl:1; unsigned int removed:1; + unsigned int wp_broken:1; int disk_cache_state; @@ -593,6 +608,16 @@ struct btrfs_block_group_cache { /* Record locked full stripes for RAID5/6 block group */ struct btrfs_full_stripe_locks_tree full_stripe_locks_root; + + enum btrfs_alloc_type alloc_type; + u64 zone_unusable; + /* + * Allocation offset for the block group to implement + * sequential allocation. This is used only with HMZONED mode + * enabled and if the block group resides on a sequential + * zone. + */ + u64 alloc_offset; }; /* delayed seq elem */ diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 1020469ca61b..922592e82fb9 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -32,6 +32,8 @@ #include "space-info.h" #include "block-rsv.h" #include "delalloc-space.h" +#include "rcu-string.h" +#include "hmzoned.h" #undef SCRAMBLE_DELAYED_REFS @@ -544,6 +546,8 @@ static int cache_block_group(struct btrfs_block_group_cache *cache, struct btrfs_caching_control *caching_ctl; int ret = 0; + ASSERT(cache->alloc_type == BTRFS_ALLOC_FIT); + caching_ctl = kzalloc(sizeof(*caching_ctl), GFP_NOFS); if (!caching_ctl) return -ENOMEM; @@ -4430,6 +4434,20 @@ void btrfs_wait_block_group_reservations(struct btrfs_block_group_cache *bg) wait_var_event(&bg->reservations, !atomic_read(&bg->reservations)); } +static void __btrfs_add_reserved_bytes(struct btrfs_block_group_cache *cache, + u64 ram_bytes, u64 num_bytes, + int delalloc) +{ + struct btrfs_space_info *space_info = cache->space_info; + + cache->reserved += num_bytes; + space_info->bytes_reserved += num_bytes; + btrfs_space_info_update_bytes_may_use(cache->fs_info, space_info, + -ram_bytes); + if (delalloc) + cache->delalloc_bytes += num_bytes; +} + /** * btrfs_add_reserved_bytes - update the block_group and space info counters * @cache: The cache we are manipulating @@ -4448,18 +4466,16 @@ static int btrfs_add_reserved_bytes(struct btrfs_block_group_cache *cache, struct btrfs_space_info *space_info = cache->space_info; int ret = 0; + /* should handled by find_free_extent_seq */ + ASSERT(cache->alloc_type != BTRFS_ALLOC_SEQ); + spin_lock(&space_info->lock); spin_lock(&cache->lock); - if (cache->ro) { + if (cache->ro) ret = -EAGAIN; - } else { - cache->reserved += num_bytes; - space_info->bytes_reserved += num_bytes; - btrfs_space_info_update_bytes_may_use(cache->fs_info, - space_info, -ram_bytes); - if (delalloc) - cache->delalloc_bytes += num_bytes; - } + else + __btrfs_add_reserved_bytes(cache, ram_bytes, num_bytes, + delalloc); spin_unlock(&cache->lock); spin_unlock(&space_info->lock); return ret; @@ -4577,9 +4593,13 @@ static int unpin_extent_range(struct btrfs_fs_info *fs_info, cache = btrfs_lookup_block_group(fs_info, start); BUG_ON(!cache); /* Logic error */ - cluster = fetch_cluster_info(fs_info, - cache->space_info, - &empty_cluster); + if (cache->alloc_type == BTRFS_ALLOC_FIT) + cluster = fetch_cluster_info(fs_info, + cache->space_info, + &empty_cluster); + else + cluster = NULL; + empty_cluster <<= 1; } @@ -4619,7 +4639,11 @@ static int unpin_extent_range(struct btrfs_fs_info *fs_info, space_info->max_extent_size = 0; percpu_counter_add_batch(&space_info->total_bytes_pinned, -len, BTRFS_TOTAL_BYTES_PINNED_BATCH); - if (cache->ro) { + if (cache->alloc_type == BTRFS_ALLOC_SEQ) { + /* need reset before reusing in ALLOC_SEQ BG */ + space_info->bytes_zone_unusable += len; + readonly = true; + } else if (cache->ro) { space_info->bytes_readonly += len; readonly = true; } @@ -5465,6 +5489,60 @@ static int find_free_extent_unclustered(struct btrfs_block_group_cache *bg, return 0; } +/* + * Simple allocator for sequential only block group. It only allows + * sequential allocation. No need to play with trees. This function + * also reserve the bytes as in btrfs_add_reserved_bytes. + */ + +static int find_free_extent_seq(struct btrfs_block_group_cache *cache, + struct find_free_extent_ctl *ffe_ctl) +{ + struct btrfs_space_info *space_info = cache->space_info; + struct btrfs_free_space_ctl *ctl = cache->free_space_ctl; + u64 start = cache->key.objectid; + u64 num_bytes = ffe_ctl->num_bytes; + u64 avail; + int ret = 0; + + /* Sanity check */ + if (cache->alloc_type != BTRFS_ALLOC_SEQ) + return 1; + + spin_lock(&space_info->lock); + spin_lock(&cache->lock); + + if (cache->ro) { + ret = -EAGAIN; + goto out; + } + + spin_lock(&ctl->tree_lock); + avail = cache->key.offset - cache->alloc_offset; + if (avail < num_bytes) { + ffe_ctl->max_extent_size = avail; + spin_unlock(&ctl->tree_lock); + ret = 1; + goto out; + } + + ffe_ctl->found_offset = start + cache->alloc_offset; + cache->alloc_offset += num_bytes; + ctl->free_space -= num_bytes; + spin_unlock(&ctl->tree_lock); + + ASSERT(IS_ALIGNED(ffe_ctl->found_offset, + cache->fs_info->stripesize)); + ffe_ctl->search_start = ffe_ctl->found_offset; + __btrfs_add_reserved_bytes(cache, ffe_ctl->ram_bytes, num_bytes, + ffe_ctl->delalloc); + +out: + spin_unlock(&cache->lock); + spin_unlock(&space_info->lock); + return ret; +} + /* * Return >0 means caller needs to re-search for free extent * Return 0 means we have the needed free extent. @@ -5765,6 +5843,17 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info, if (unlikely(block_group->cached == BTRFS_CACHE_ERROR)) goto loop; + if (block_group->alloc_type == BTRFS_ALLOC_SEQ) { + ret = find_free_extent_seq(block_group, &ffe_ctl); + if (ret) + goto loop; + /* + * find_free_space_seq should ensure that + * everything is OK and reserve the extent. + */ + goto nocheck; + } + /* * Ok we want to try and use the cluster allocator, so * lets look there @@ -5820,6 +5909,7 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info, num_bytes); goto loop; } +nocheck: btrfs_inc_block_group_reservations(block_group); /* we are all good, lets return */ @@ -7371,7 +7461,8 @@ static int inc_block_group_ro(struct btrfs_block_group_cache *cache, int force) } num_bytes = cache->key.offset - cache->reserved - cache->pinned - - cache->bytes_super - btrfs_block_group_used(&cache->item); + cache->bytes_super - cache->zone_unusable - + btrfs_block_group_used(&cache->item); sinfo_used = btrfs_space_info_used(sinfo, true); if (sinfo_used + num_bytes + min_allocable_bytes <= @@ -7520,6 +7611,7 @@ void btrfs_dec_block_group_ro(struct btrfs_block_group_cache *cache) if (!--cache->ro) { num_bytes = cache->key.offset - cache->reserved - cache->pinned - cache->bytes_super - + cache->zone_unusable - btrfs_block_group_used(&cache->item); sinfo->bytes_readonly -= num_bytes; list_del_init(&cache->ro_list); @@ -7981,6 +8073,7 @@ btrfs_create_block_group_cache(struct btrfs_fs_info *fs_info, atomic_set(&cache->trimming, 0); mutex_init(&cache->free_space_lock); btrfs_init_full_stripe_locks_tree(&cache->full_stripe_locks_root); + cache->alloc_type = BTRFS_ALLOC_FIT; return cache; } @@ -8053,6 +8146,7 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info) int need_clear = 0; u64 cache_gen; u64 feature; + u64 unusable = 0; int mixed; feature = btrfs_super_incompat_flags(info->super_copy); @@ -8122,6 +8216,14 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info) key.objectid = found_key.objectid + found_key.offset; btrfs_release_path(path); + ret = btrfs_load_block_group_zone_info(cache); + if (ret) { + btrfs_err(info, "failed to load zone info of bg %llu", + cache->key.objectid); + btrfs_put_block_group(cache); + goto error; + } + /* * We need to exclude the super stripes now so that the space * info has super bytes accounted for, otherwise we'll think @@ -8158,6 +8260,31 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info) free_excluded_extents(cache); } + if (cache->alloc_type == BTRFS_ALLOC_SEQ) { + u64 free; + + WARN_ON(cache->bytes_super != 0); + if (!cache->wp_broken) { + unusable = cache->alloc_offset - + btrfs_block_group_used(&cache->item); + free = cache->key.offset - cache->alloc_offset; + } else { + unusable = cache->key.offset - + btrfs_block_group_used(&cache->item); + free = 0; + } + /* we only need ->free_space in ALLOC_SEQ BGs */ + cache->last_byte_to_unpin = (u64)-1; + cache->cached = BTRFS_CACHE_FINISHED; + cache->free_space_ctl->free_space = free; + cache->zone_unusable = unusable; + /* + * Should not have any excluded extents. Just + * in case, though. + */ + free_excluded_extents(cache); + } + ret = btrfs_add_block_group_cache(info, cache); if (ret) { btrfs_remove_free_space_cache(cache); @@ -8168,7 +8295,8 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info) trace_btrfs_add_block_group(info, cache, 0); btrfs_update_space_info(info, cache->flags, found_key.offset, btrfs_block_group_used(&cache->item), - cache->bytes_super, &space_info); + cache->bytes_super, unusable, + &space_info); cache->space_info = space_info; @@ -8181,6 +8309,9 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info) ASSERT(list_empty(&cache->bg_list)); btrfs_mark_bg_unused(cache); } + + if (cache->wp_broken) + inc_block_group_ro(cache, 1); } list_for_each_entry_rcu(space_info, &info->space_info, list) { @@ -8273,6 +8404,13 @@ int btrfs_make_block_group(struct btrfs_trans_handle *trans, u64 bytes_used, cache->last_byte_to_unpin = (u64)-1; cache->cached = BTRFS_CACHE_FINISHED; cache->needs_free_space = 1; + + ret = btrfs_load_block_group_zone_info(cache); + if (ret) { + btrfs_put_block_group(cache); + return ret; + } + ret = exclude_super_stripes(cache); if (ret) { /* @@ -8317,7 +8455,7 @@ int btrfs_make_block_group(struct btrfs_trans_handle *trans, u64 bytes_used, */ trace_btrfs_add_block_group(fs_info, cache, 1); btrfs_update_space_info(fs_info, cache->flags, size, bytes_used, - cache->bytes_super, &cache->space_info); + cache->bytes_super, 0, &cache->space_info); btrfs_update_global_block_rsv(fs_info); link_block_group(cache); @@ -8567,12 +8705,17 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans, WARN_ON(block_group->space_info->total_bytes < block_group->key.offset); WARN_ON(block_group->space_info->bytes_readonly - < block_group->key.offset); + < block_group->key.offset - block_group->zone_unusable); + WARN_ON(block_group->space_info->bytes_zone_unusable + < block_group->zone_unusable); WARN_ON(block_group->space_info->disk_total < block_group->key.offset * factor); } block_group->space_info->total_bytes -= block_group->key.offset; - block_group->space_info->bytes_readonly -= block_group->key.offset; + block_group->space_info->bytes_readonly -= + (block_group->key.offset - block_group->zone_unusable); + block_group->space_info->bytes_zone_unusable -= + block_group->zone_unusable; block_group->space_info->disk_total -= block_group->key.offset * factor; spin_unlock(&block_group->space_info->lock); diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c index 062be9dde4c6..2aeb3620645c 100644 --- a/fs/btrfs/free-space-cache.c +++ b/fs/btrfs/free-space-cache.c @@ -2326,8 +2326,11 @@ int __btrfs_add_free_space(struct btrfs_fs_info *fs_info, u64 offset, u64 bytes) { struct btrfs_free_space *info; + struct btrfs_block_group_cache *block_group = ctl->private; int ret = 0; + ASSERT(!block_group || block_group->alloc_type != BTRFS_ALLOC_SEQ); + info = kmem_cache_zalloc(btrfs_free_space_cachep, GFP_NOFS); if (!info) return -ENOMEM; @@ -2376,6 +2379,30 @@ int __btrfs_add_free_space(struct btrfs_fs_info *fs_info, return ret; } +int __btrfs_add_free_space_seq(struct btrfs_block_group_cache *block_group, + u64 bytenr, u64 size) +{ + struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl; + u64 offset = bytenr - block_group->key.objectid; + u64 to_free, to_unusable; + + spin_lock(&ctl->tree_lock); + if (block_group->wp_broken) + to_free = 0; + else if (offset >= block_group->alloc_offset) + to_free = size; + else if (offset + size <= block_group->alloc_offset) + to_free = 0; + else + to_free = offset + size - block_group->alloc_offset; + to_unusable = size - to_free; + + ctl->free_space += to_free; + block_group->zone_unusable += to_unusable; + spin_unlock(&ctl->tree_lock); + return 0; +} + int btrfs_remove_free_space(struct btrfs_block_group_cache *block_group, u64 offset, u64 bytes) { @@ -2384,6 +2411,8 @@ int btrfs_remove_free_space(struct btrfs_block_group_cache *block_group, int ret; bool re_search = false; + ASSERT(block_group->alloc_type != BTRFS_ALLOC_SEQ); + spin_lock(&ctl->tree_lock); again: @@ -2619,6 +2648,8 @@ u64 btrfs_find_space_for_alloc(struct btrfs_block_group_cache *block_group, u64 align_gap = 0; u64 align_gap_len = 0; + ASSERT(block_group->alloc_type != BTRFS_ALLOC_SEQ); + spin_lock(&ctl->tree_lock); entry = find_free_space(ctl, &offset, &bytes_search, block_group->full_stripe_len, max_extent_size); @@ -2738,6 +2769,8 @@ u64 btrfs_alloc_from_cluster(struct btrfs_block_group_cache *block_group, struct rb_node *node; u64 ret = 0; + ASSERT(block_group->alloc_type != BTRFS_ALLOC_SEQ); + spin_lock(&cluster->lock); if (bytes > cluster->max_size) goto out; @@ -3384,6 +3417,8 @@ int btrfs_trim_block_group(struct btrfs_block_group_cache *block_group, { int ret; + ASSERT(block_group->alloc_type != BTRFS_ALLOC_SEQ); + *trimmed = 0; spin_lock(&block_group->lock); diff --git a/fs/btrfs/free-space-cache.h b/fs/btrfs/free-space-cache.h index 8760acb55ffd..d30667784f73 100644 --- a/fs/btrfs/free-space-cache.h +++ b/fs/btrfs/free-space-cache.h @@ -73,10 +73,15 @@ void btrfs_init_free_space_ctl(struct btrfs_block_group_cache *block_group); int __btrfs_add_free_space(struct btrfs_fs_info *fs_info, struct btrfs_free_space_ctl *ctl, u64 bytenr, u64 size); +int __btrfs_add_free_space_seq(struct btrfs_block_group_cache *block_group, + u64 bytenr, u64 size); static inline int btrfs_add_free_space(struct btrfs_block_group_cache *block_group, u64 bytenr, u64 size) { + if (block_group->alloc_type == BTRFS_ALLOC_SEQ) + return __btrfs_add_free_space_seq(block_group, bytenr, size); + return __btrfs_add_free_space(block_group->fs_info, block_group->free_space_ctl, bytenr, size); diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c index 94f4f67e0548..55c00410e2f1 100644 --- a/fs/btrfs/hmzoned.c +++ b/fs/btrfs/hmzoned.c @@ -17,6 +17,9 @@ /* Maximum number of zones to report per blkdev_report_zones() call */ #define BTRFS_REPORT_NR_ZONES 4096 +/* Invalid allocation pointer value for missing devices */ +#define WP_MISSING_DEV ((u64)-1) + static int btrfs_get_dev_zones(struct btrfs_device *device, u64 pos, struct blk_zone *zones, unsigned int *nr_zones, gfp_t gfp_mask) @@ -317,3 +320,231 @@ bool btrfs_check_allocatable_zones(struct btrfs_device *device, u64 pos, return find_next_bit(zinfo->seq_zones, end, begin) == end; } + +int btrfs_load_block_group_zone_info(struct btrfs_block_group_cache *cache) +{ + struct btrfs_fs_info *fs_info = cache->fs_info; + struct extent_map_tree *em_tree = &fs_info->mapping_tree; + struct extent_map *em; + struct map_lookup *map; + struct btrfs_device *device; + u64 logical = cache->key.objectid; + u64 length = cache->key.offset; + u64 physical = 0; + int ret, alloc_type; + int i, j; + u64 *alloc_offsets = NULL; + + if (!btrfs_fs_incompat(fs_info, HMZONED)) + return 0; + + /* Sanity check */ + if (!IS_ALIGNED(length, fs_info->zone_size)) { + btrfs_err(fs_info, "unaligned block group at %llu + %llu", + logical, length); + return -EIO; + } + + /* Get the chunk mapping */ + read_lock(&em_tree->lock); + em = lookup_extent_mapping(em_tree, logical, length); + read_unlock(&em_tree->lock); + + if (!em) + return -EINVAL; + + map = em->map_lookup; + + /* + * Get the zone type: if the group is mapped to a non-sequential zone, + * there is no need for the allocation offset (fit allocation is OK). + */ + alloc_type = -1; + alloc_offsets = kcalloc(map->num_stripes, sizeof(*alloc_offsets), + GFP_NOFS); + if (!alloc_offsets) { + free_extent_map(em); + return -ENOMEM; + } + + for (i = 0; i < map->num_stripes; i++) { + bool is_sequential; + struct blk_zone zone; + + device = map->stripes[i].dev; + physical = map->stripes[i].physical; + + if (device->bdev == NULL) { + alloc_offsets[i] = WP_MISSING_DEV; + continue; + } + + is_sequential = btrfs_dev_is_sequential(device, physical); + if (alloc_type == -1) + alloc_type = is_sequential ? + BTRFS_ALLOC_SEQ : BTRFS_ALLOC_FIT; + + if ((is_sequential && alloc_type != BTRFS_ALLOC_SEQ) || + (!is_sequential && alloc_type == BTRFS_ALLOC_SEQ)) { + btrfs_err(fs_info, "found block group of mixed zone types"); + ret = -EIO; + goto out; + } + + if (!is_sequential) + continue; + + /* + * This zone will be used for allocation, so mark this + * zone non-empty. + */ + btrfs_dev_clear_zone_empty(device, physical); + + /* + * The group is mapped to a sequential zone. Get the zone write + * pointer to determine the allocation offset within the zone. + */ + WARN_ON(!IS_ALIGNED(physical, fs_info->zone_size)); + ret = btrfs_get_dev_zone(device, physical, &zone, GFP_NOFS); + if (ret == -EIO || ret == -EOPNOTSUPP) { + ret = 0; + alloc_offsets[i] = WP_MISSING_DEV; + continue; + } else if (ret) { + goto out; + } + + + switch (zone.cond) { + case BLK_ZONE_COND_OFFLINE: + case BLK_ZONE_COND_READONLY: + btrfs_err( + fs_info, "Offline/readonly zone %llu", + physical >> device->zone_info->zone_size_shift); + alloc_offsets[i] = WP_MISSING_DEV; + break; + case BLK_ZONE_COND_EMPTY: + alloc_offsets[i] = 0; + break; + case BLK_ZONE_COND_FULL: + alloc_offsets[i] = fs_info->zone_size; + break; + default: + /* Partially used zone */ + alloc_offsets[i] = + ((zone.wp - zone.start) << SECTOR_SHIFT); + break; + } + } + + if (alloc_type == BTRFS_ALLOC_FIT) + goto out; + + switch (map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK) { + case 0: /* single */ + case BTRFS_BLOCK_GROUP_DUP: + case BTRFS_BLOCK_GROUP_RAID1: + cache->alloc_offset = WP_MISSING_DEV; + for (i = 0; i < map->num_stripes; i++) { + if (alloc_offsets[i] == WP_MISSING_DEV) + continue; + if (cache->alloc_offset == WP_MISSING_DEV) + cache->alloc_offset = alloc_offsets[i]; + if (alloc_offsets[i] == cache->alloc_offset) + continue; + + btrfs_err(fs_info, + "write pointer mismatch: block group %llu", + logical); + cache->wp_broken = 1; + } + break; + case BTRFS_BLOCK_GROUP_RAID0: + cache->alloc_offset = 0; + for (i = 0; i < map->num_stripes; i++) { + if (alloc_offsets[i] == WP_MISSING_DEV) { + btrfs_err(fs_info, + "cannot recover write pointer: block group %llu", + logical); + cache->wp_broken = 1; + continue; + } + + if (alloc_offsets[0] < alloc_offsets[i]) { + btrfs_err(fs_info, + "write pointer mismatch: block group %llu", + logical); + cache->wp_broken = 1; + continue; + } + + cache->alloc_offset += alloc_offsets[i]; + } + break; + case BTRFS_BLOCK_GROUP_RAID10: + /* + * Pass1: check write pointer of RAID1 level: each pointer + * should be equal. + */ + for (i = 0; i < map->num_stripes / map->sub_stripes; i++) { + int base = i * map->sub_stripes; + u64 offset = WP_MISSING_DEV; + + for (j = 0; j < map->sub_stripes; j++) { + if (alloc_offsets[base + j] == WP_MISSING_DEV) + continue; + if (offset == WP_MISSING_DEV) + offset = alloc_offsets[base+j]; + if (alloc_offsets[base + j] == offset) + continue; + + btrfs_err(fs_info, + "write pointer mismatch: block group %llu", + logical); + cache->wp_broken = 1; + } + for (j = 0; j < map->sub_stripes; j++) + alloc_offsets[base + j] = offset; + } + + /* Pass2: check write pointer of RAID1 level */ + cache->alloc_offset = 0; + for (i = 0; i < map->num_stripes / map->sub_stripes; i++) { + int base = i * map->sub_stripes; + + if (alloc_offsets[base] == WP_MISSING_DEV) { + btrfs_err(fs_info, + "cannot recover write pointer: block group %llu", + logical); + cache->wp_broken = 1; + continue; + } + + if (alloc_offsets[0] < alloc_offsets[base]) { + btrfs_err(fs_info, + "write pointer mismatch: block group %llu", + logical); + cache->wp_broken = 1; + continue; + } + + cache->alloc_offset += alloc_offsets[base]; + } + break; + case BTRFS_BLOCK_GROUP_RAID5: + case BTRFS_BLOCK_GROUP_RAID6: + /* RAID5/6 is not supported yet */ + default: + btrfs_err(fs_info, "Unsupported profile on HMZONED %llu", + map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK); + ret = -EINVAL; + goto out; + } + +out: + cache->alloc_type = alloc_type; + kfree(alloc_offsets); + free_extent_map(em); + + return ret; +} diff --git a/fs/btrfs/hmzoned.h b/fs/btrfs/hmzoned.h index 396ece5f9410..399d9e9543aa 100644 --- a/fs/btrfs/hmzoned.h +++ b/fs/btrfs/hmzoned.h @@ -31,6 +31,7 @@ int btrfs_check_hmzoned_mode(struct btrfs_fs_info *fs_info); int btrfs_check_mountopts_hmzoned(struct btrfs_fs_info *info); bool btrfs_check_allocatable_zones(struct btrfs_device *device, u64 pos, u64 num_bytes); +int btrfs_load_block_group_zone_info(struct btrfs_block_group_cache *cache); static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos) { diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c index ab7b9ec4c240..4c6457bd1b9c 100644 --- a/fs/btrfs/space-info.c +++ b/fs/btrfs/space-info.c @@ -15,6 +15,7 @@ u64 btrfs_space_info_used(struct btrfs_space_info *s_info, ASSERT(s_info); return s_info->bytes_used + s_info->bytes_reserved + s_info->bytes_pinned + s_info->bytes_readonly + + s_info->bytes_zone_unusable + (may_use_included ? s_info->bytes_may_use : 0); } @@ -133,7 +134,7 @@ int btrfs_init_space_info(struct btrfs_fs_info *fs_info) void btrfs_update_space_info(struct btrfs_fs_info *info, u64 flags, u64 total_bytes, u64 bytes_used, - u64 bytes_readonly, + u64 bytes_readonly, u64 bytes_zone_unusable, struct btrfs_space_info **space_info) { struct btrfs_space_info *found; @@ -149,6 +150,7 @@ void btrfs_update_space_info(struct btrfs_fs_info *info, u64 flags, found->bytes_used += bytes_used; found->disk_used += bytes_used * factor; found->bytes_readonly += bytes_readonly; + found->bytes_zone_unusable += bytes_zone_unusable; if (total_bytes > 0) found->full = 0; btrfs_space_info_add_new_bytes(info, found, @@ -372,10 +374,10 @@ void btrfs_dump_space_info(struct btrfs_fs_info *fs_info, info->total_bytes - btrfs_space_info_used(info, true), info->full ? "" : "not "); btrfs_info(fs_info, - "space_info total=%llu, used=%llu, pinned=%llu, reserved=%llu, may_use=%llu, readonly=%llu", + "space_info total=%llu, used=%llu, pinned=%llu, reserved=%llu, may_use=%llu, readonly=%llu zone_unusable=%llu", info->total_bytes, info->bytes_used, info->bytes_pinned, info->bytes_reserved, info->bytes_may_use, - info->bytes_readonly); + info->bytes_readonly, info->bytes_zone_unusable); spin_unlock(&info->lock); DUMP_BLOCK_RSV(fs_info, global_block_rsv); @@ -392,10 +394,11 @@ void btrfs_dump_space_info(struct btrfs_fs_info *fs_info, list_for_each_entry(cache, &info->block_groups[index], list) { spin_lock(&cache->lock); btrfs_info(fs_info, - "block group %llu has %llu bytes, %llu used %llu pinned %llu reserved %s", + "block group %llu has %llu bytes, %llu used %llu pinned %llu reserved zone_unusable %llu %s", cache->key.objectid, cache->key.offset, btrfs_block_group_used(&cache->item), cache->pinned, - cache->reserved, cache->ro ? "[readonly]" : ""); + cache->reserved, cache->zone_unusable, + cache->ro ? "[readonly]" : ""); btrfs_dump_free_space(cache, bytes); spin_unlock(&cache->lock); } diff --git a/fs/btrfs/space-info.h b/fs/btrfs/space-info.h index c2b54b8e1a14..b3837b2c41e4 100644 --- a/fs/btrfs/space-info.h +++ b/fs/btrfs/space-info.h @@ -17,6 +17,8 @@ struct btrfs_space_info { u64 bytes_may_use; /* number of bytes that may be used for delalloc/allocations */ u64 bytes_readonly; /* total bytes that are read only */ + u64 bytes_zone_unusable; /* total bytes that are unusable until + resetting the device zone */ u64 max_extent_size; /* This will hold the maximum extent size of the space info if we had an ENOSPC in the @@ -115,7 +117,7 @@ void btrfs_space_info_add_old_bytes(struct btrfs_fs_info *fs_info, int btrfs_init_space_info(struct btrfs_fs_info *fs_info); void btrfs_update_space_info(struct btrfs_fs_info *info, u64 flags, u64 total_bytes, u64 bytes_used, - u64 bytes_readonly, + u64 bytes_readonly, u64 bytes_zone_unusable, struct btrfs_space_info **space_info); struct btrfs_space_info *btrfs_find_space_info(struct btrfs_fs_info *info, u64 flags); diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c index ad708a9edd0b..37733ec8e437 100644 --- a/fs/btrfs/sysfs.c +++ b/fs/btrfs/sysfs.c @@ -349,6 +349,7 @@ SPACE_INFO_ATTR(bytes_pinned); SPACE_INFO_ATTR(bytes_reserved); SPACE_INFO_ATTR(bytes_may_use); SPACE_INFO_ATTR(bytes_readonly); +SPACE_INFO_ATTR(bytes_zone_unusable); SPACE_INFO_ATTR(disk_used); SPACE_INFO_ATTR(disk_total); BTRFS_ATTR(space_info, total_bytes_pinned, @@ -362,6 +363,7 @@ static struct attribute *space_info_attrs[] = { BTRFS_ATTR_PTR(space_info, bytes_reserved), BTRFS_ATTR_PTR(space_info, bytes_may_use), BTRFS_ATTR_PTR(space_info, bytes_readonly), + BTRFS_ATTR_PTR(space_info, bytes_zone_unusable), BTRFS_ATTR_PTR(space_info, disk_used), BTRFS_ATTR_PTR(space_info, disk_total), BTRFS_ATTR_PTR(space_info, total_bytes_pinned), From patchwork Fri Aug 23 10:10:20 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11111355 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 7269913B1 for ; Fri, 23 Aug 2019 10:11:35 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 510B6233FE for ; Fri, 23 Aug 2019 10:11:35 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="G3xbyuqo" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2404336AbfHWKLe (ORCPT ); Fri, 23 Aug 2019 06:11:34 -0400 Received: from esa3.hgst.iphmx.com ([216.71.153.141]:47768 "EHLO esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2404176AbfHWKLd (ORCPT ); Fri, 23 Aug 2019 06:11:33 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1566555092; x=1598091092; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=FsE4gqf95mAO+bFZo5M2DPXEATDME6q89JF3cIOcjzg=; b=G3xbyuqoBdgVy5Th2iehySTz3zVTawqvQqWGEf/9t96yWYRx0uFWR6MN rIYzswVvtr2KBkuQihkMuXtuvq4orwkQosVxBohudh2xgPUUwcHG3saYk UH5q95SUwttYH5Ys7cvUqIUqi3eBYL/IKpIKtzvby8mVB5j1p7k5lgtOW Ig9mcrKbJMw3ry7Lm4sSgp4UvwgIKgp27GoaFoo12sc97+KUBhobnZ03Y Mnm6xGPoN3E/EFPKygOjVw420ERAy1nS02tN3y+BjnxGdT2xSPib1gl/4 bTxG9+bYWf3JNHXfHW6Pi3uuO4dXa1Ix/+gumbG+15D/GS+TzrjbwOPod g==; IronPort-SDR: govh+MOgX2RfwmIipjjnlnapfIoZzwqA2a9UuNx08HVu6LfcMHEilMxO9DpUyE+zb1yQOVQIo3 bVC5AE+I8FBckOEh89uo3vqH1vbEkqvpGQe6Cvj8ubWBnHXgDmfjfK6tPNPcOsQ885vqbP06we +XfHs9Cf1CSSV6suYvur9NjhMTvmeZydCqIkvcX7lq7H1ICLbbxHVteQ9XTcQ+Inlf6y4jqGWA KwvftVirjsNgIwRrUlimpbgX3sbcWvOBaLe7hoczwjvRTi6etkpu5t2OQ3NuxhV76hiwuFAWVu D0Y= X-IronPort-AV: E=Sophos;i="5.64,420,1559491200"; d="scan'208";a="121096247" Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com) ([199.255.45.14]) by ob1.hgst.iphmx.com with ESMTP; 23 Aug 2019 18:11:32 +0800 IronPort-SDR: VvR/eEliK1r3wUuswzmX9KV6taNQJuSC3aiHYvz7wbv57RumRqEl0rUMyLPDpc5NnTAhqmrbUC viwqHGTBtWKR3i7abMO9Hyqd9Jknyg832LlrGGyUcPsaDeraEdRx0XTJ1emqPbcqsocWJznqHk 0HO7nqanafZ953G1e4C9kyylXobk/fryphB9P6GVKAt6x/NeUgu2wQXMAOuetL6no8KFmA+fPD HA6eWSWAJuq1WDsFgAm76xs9ZljYURHl+ensaBjkUPVTXiMEdgQl/GuMeNY0xzoOAqQupd5dCp 532aueDQ9/b/taNHvpb2PvW3 Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 23 Aug 2019 03:08:50 -0700 IronPort-SDR: mPQ7Py/iXRg5sKVYvP0dKd4n5vggF1h370Bvrblq0kPaCtuw4ByCq2ROIPhKkyTwMKYkR8xI4c RbrqNaOg7Ymp98FosLuT1LxO4vAGXHCym4GmJrDvtwWcac8T8i23Tw0SzXLPePUPqDXeM76mpl EVFK/RVeaPCT3Hl/Xc7aOYy19wvt1078jgwwFN0Ws7ubEV/9Qy4n2oZxCg88bZRbzWDcfEWO+x iIyGi2PHtHIrY57imMGw4oMZUV3nYPuXh0rkWWZz3ibTezkCopfTyhaS+NGBOVTaB6TgBykwzW Q8Y= WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com (HELO naota.fujisawa.hgst.com) ([10.149.53.115]) by uls-op-cesaip02.wdc.com with ESMTP; 23 Aug 2019 03:11:31 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Nikolay Borisov , Damien Le Moal , Matias Bjorling , Johannes Thumshirn , Hannes Reinecke , Anand Jain , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v4 11/27] btrfs: make unmirroed BGs readonly only if we have at least one writable BG Date: Fri, 23 Aug 2019 19:10:20 +0900 Message-Id: <20190823101036.796932-12-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.23.0 In-Reply-To: <20190823101036.796932-1-naohiro.aota@wdc.com> References: <20190823101036.796932-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org If the btrfs volume has mirrored block groups, it unconditionally makes un-mirrored block groups read only. When we have mirrored block groups, but don't have writable block groups, this will drop all writable block groups. So, check if we have at least one writable mirrored block group before setting un-mirrored block groups read only. This change is necessary to handle e.g. xfstests btrfs/124 case. When we mount degraded RAID1 FS and write to it, and then re-mount with full device, the write pointers of corresponding zones of written BG differ. We mark such block group as "wp_broken" and make it read only. In this situation, we only have read only RAID1 BGs because of "wp_broken" and un-mirrored BGs are also marked read only, because we have RAID1 BGs. As a result, all the BGs are now read only, so that we cannot even start the rebalance to fix the situation. Signed-off-by: Naohiro Aota --- fs/btrfs/extent-tree.c | 25 +++++++++++++++++++++++++ 1 file changed, 25 insertions(+) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 922592e82fb9..0f845cfb2442 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -8134,6 +8134,27 @@ static int check_chunk_block_group_mappings(struct btrfs_fs_info *fs_info) return ret; } +/* + * have_mirrored_block_group - check if we have at least one writable + * mirrored Block Group + */ +static bool have_mirrored_block_group(struct btrfs_space_info *space_info) +{ + struct btrfs_block_group_cache *cache; + int i; + + for (i = 0; i < BTRFS_NR_RAID_TYPES; i++) { + if (i == BTRFS_RAID_RAID0 || i == BTRFS_RAID_SINGLE) + continue; + list_for_each_entry(cache, &space_info->block_groups[i], + list) { + if (!cache->ro) + return true; + } + } + return false; +} + int btrfs_read_block_groups(struct btrfs_fs_info *info) { struct btrfs_path *path; @@ -8321,6 +8342,10 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info) BTRFS_BLOCK_GROUP_RAID56_MASK | BTRFS_BLOCK_GROUP_DUP))) continue; + + if (!have_mirrored_block_group(space_info)) + continue; + /* * avoid allocating from un-mirrored block group if there are * mirrored block groups. From patchwork Fri Aug 23 10:10:21 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11111359 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id E0CCE13B1 for ; Fri, 23 Aug 2019 10:11:37 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id BECFE233FD for ; Fri, 23 Aug 2019 10:11:37 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="D/OtmKSu" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2404365AbfHWKLg (ORCPT ); Fri, 23 Aug 2019 06:11:36 -0400 Received: from esa3.hgst.iphmx.com ([216.71.153.141]:47768 "EHLO esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2404360AbfHWKLf (ORCPT ); Fri, 23 Aug 2019 06:11:35 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1566555094; x=1598091094; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=InzjgJc/CYOES7JhwRz7akPBQj1xH3+trygavvD0iao=; b=D/OtmKSu/uJ5PIH28BjUWaxSDvOIhKBpLOG6qfzMXvCgnrvZS2jeqJmC /DgTtKAj+0Pt0lYd5Lpr3mig7JYKPegReqhHI+E7qqfX7lhZQH6TaEMiZ t4wHZUq0MQIeWFfvg9bDAXu81F5xIUVLUHbyTNOqN1rgy0/xCWFM7OpHm GMfV+VnJeOmBV21F+Glt2bd5CXxSJKPwOKzc4qimaSet1K6yVi98oKinW IMZt3QoeEvFIgJlVekYcG5BEweGWv77HK2nccz4F2Uu4C2ZU88TE90QGC wsVJ6NEMNmxwIt2bmQMG9VArJ3Syr+2tWb8qP6IUKJMaGHIm4uMEp6qbw Q==; IronPort-SDR: nKZxRsKENlSWvLastHfdB8ik39XNrTq9/bmEKCdVWcV9P84rQhWzXYXPAZRhGL9OwLObSTUuuH A77oOiduI6je/mDH1EL/fg7nxkQ2/r+wfNfvdbqdtc2B/D2x22cfUa9fBf/mizQEzGwDgWe+bj gOKkc5/onmgUE12rj2cRrKH16C4x74otF93FR04wzYEWlJNebyiBXsxnU5xdArfQrMEqoWtECl au6d7c36+0a9p3BmeZJNm+isO+97+WuNDAWd0DZNHcltuBencBJPoM/1crGWPEzP9aILA6yXqL wtA= X-IronPort-AV: E=Sophos;i="5.64,420,1559491200"; d="scan'208";a="121096248" Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com) ([199.255.45.14]) by ob1.hgst.iphmx.com with ESMTP; 23 Aug 2019 18:11:34 +0800 IronPort-SDR: Gw7pjyuJpRJjEb6XZZZpFLwqHa+RyIH5O8aFcM5t39SNQHSlbi583xjUWUPO7kutaU87EaV+8V CqRG8gbxWquoVcoBHVGPLwk4bLOPMmosgw4oh7yGJwjg9EMXnTATjOyJfSxVIeBMXwLoxPCbyz AFe6YYXvPwZyBV3WHzlvs0epcnj1SRAjr5F+RIlPJEdyIRwoPdz9EaYB141JX7Up9faGrXFIH8 eaUyF5SsMPcpC6as29b/G7Yj4TnikaHXl0ifqegZ7vwVBPRxGcaUEIYZx1xWB3oR4TepJdLe9F 6luYVG4gnxxYbvAH0t1O3LSB Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 23 Aug 2019 03:08:52 -0700 IronPort-SDR: qhAy7dXUmqUytCo39hN29LNefa/+FPrEabEVN4CK1oyQXzlocvKcc7RmF3YVIaBwTlFhpAwvDa FJRxc1WEEsPbrtKH9aJS35nLTkGup0IWatxQTmy5RPt1tS9SmpVMdKT0QYGo6AEOHqK2Gim2g+ jAIpZSQIZbo0LDEoAWViLhOyuqHHUvvHoyEG6axx4zpKUVttyS4ErEXqX4fpBNCddsriaB7l65 sBcWAaMVY4Y8b7zv3HxVbvp3p+ZvXcfOMvtVBkDHMTpSWrUmBirbn5XZrKfsGFvaSdSFwq+uLt 9f8= WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com (HELO naota.fujisawa.hgst.com) ([10.149.53.115]) by uls-op-cesaip02.wdc.com with ESMTP; 23 Aug 2019 03:11:33 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Nikolay Borisov , Damien Le Moal , Matias Bjorling , Johannes Thumshirn , Hannes Reinecke , Anand Jain , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v4 12/27] btrfs: ensure metadata space available on/after degraded mount in HMZONED Date: Fri, 23 Aug 2019 19:10:21 +0900 Message-Id: <20190823101036.796932-13-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.23.0 In-Reply-To: <20190823101036.796932-1-naohiro.aota@wdc.com> References: <20190823101036.796932-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org On/After degraded mount, we might have no writable metadata block group due to broken write pointers. If you e.g. balance the FS before writing any data, alloc_tree_block_no_bg_flush() (called from insert_balance_item()) fails to allocate a tree block for it, due to global reservation failure. We can reproduce this situation with xfstests btrfs/124. While we can workaround the failure if we write some data and, as a result of writing, let a new metadata block group allocated, it's a bad practice to apply. This commit avoids such failures by ensuring that read-write mounted volume has non-zero metadata space. If metadata space is empty, it forces new metadata block group allocation. Signed-off-by: Naohiro Aota --- fs/btrfs/disk-io.c | 9 +++++++++ fs/btrfs/hmzoned.c | 45 +++++++++++++++++++++++++++++++++++++++++++++ fs/btrfs/hmzoned.h | 1 + 3 files changed, 55 insertions(+) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 3f5ea92f546c..b25cff8af3b7 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3285,6 +3285,15 @@ int open_ctree(struct super_block *sb, } } + ret = btrfs_hmzoned_check_metadata_space(fs_info); + if (ret) { + btrfs_warn(fs_info, "failed to allocate metadata space: %d", + ret); + btrfs_warn(fs_info, "try remount with readonly"); + close_ctree(fs_info); + return ret; + } + down_read(&fs_info->cleanup_work_sem); if ((ret = btrfs_orphan_cleanup(fs_info->fs_root)) || (ret = btrfs_orphan_cleanup(fs_info->tree_root))) { diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c index 55c00410e2f1..b5fd3e280b65 100644 --- a/fs/btrfs/hmzoned.c +++ b/fs/btrfs/hmzoned.c @@ -13,6 +13,8 @@ #include "hmzoned.h" #include "rcu-string.h" #include "disk-io.h" +#include "space-info.h" +#include "transaction.h" /* Maximum number of zones to report per blkdev_report_zones() call */ #define BTRFS_REPORT_NR_ZONES 4096 @@ -548,3 +550,46 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group_cache *cache) return ret; } + +/* + * On/After degraded mount, we might have no writable metadata block + * group due to broken write pointers. If you e.g. balance the FS + * before writing any data, alloc_tree_block_no_bg_flush() (called + * from insert_balance_item())fails to allocate a tree block for + * it. To avoid such situations, ensure we have some metadata BG here. + */ +int btrfs_hmzoned_check_metadata_space(struct btrfs_fs_info *fs_info) +{ + struct btrfs_root *root = fs_info->extent_root; + struct btrfs_trans_handle *trans; + struct btrfs_space_info *info; + u64 left; + int ret; + + if (!btrfs_fs_incompat(fs_info, HMZONED)) + return 0; + + info = btrfs_find_space_info(fs_info, BTRFS_BLOCK_GROUP_METADATA); + spin_lock(&info->lock); + left = info->total_bytes - btrfs_space_info_used(info, true); + spin_unlock(&info->lock); + + if (left) + return 0; + + trans = btrfs_start_transaction(root, 0); + if (IS_ERR(trans)) + return PTR_ERR(trans); + + mutex_lock(&fs_info->chunk_mutex); + ret = btrfs_alloc_chunk(trans, btrfs_metadata_alloc_profile(fs_info)); + if (ret) { + mutex_unlock(&fs_info->chunk_mutex); + btrfs_abort_transaction(trans, ret); + btrfs_end_transaction(trans); + return ret; + } + mutex_unlock(&fs_info->chunk_mutex); + + return btrfs_commit_transaction(trans); +} diff --git a/fs/btrfs/hmzoned.h b/fs/btrfs/hmzoned.h index 399d9e9543aa..e95139d4c072 100644 --- a/fs/btrfs/hmzoned.h +++ b/fs/btrfs/hmzoned.h @@ -32,6 +32,7 @@ int btrfs_check_mountopts_hmzoned(struct btrfs_fs_info *info); bool btrfs_check_allocatable_zones(struct btrfs_device *device, u64 pos, u64 num_bytes); int btrfs_load_block_group_zone_info(struct btrfs_block_group_cache *cache); +int btrfs_hmzoned_check_metadata_space(struct btrfs_fs_info *fs_info); static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos) { From patchwork Fri Aug 23 10:10:22 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11111367 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 704D11864 for ; Fri, 23 Aug 2019 10:11:42 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 4EC6321726 for ; Fri, 23 Aug 2019 10:11:42 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="Os0CT8d5" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2404391AbfHWKLj (ORCPT ); Fri, 23 Aug 2019 06:11:39 -0400 Received: from esa3.hgst.iphmx.com ([216.71.153.141]:47768 "EHLO esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2404360AbfHWKLh (ORCPT ); Fri, 23 Aug 2019 06:11:37 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1566555096; x=1598091096; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=4odTqs4ZNtewzttV9LDeX2WybTMLAg/TmE0+WAUIiU4=; b=Os0CT8d5WQMzEaf6BqN4hjYXwSA1znJrcKVCeuacDS/4IiHu4FZWn6nJ RiE6mKR78lKwoVQgLJOMB4/MicN/mzIJl3gfImiED2AX0XtLBxC/m8Ior 0LENYYin+mgSBYnBQxUL3KouCp2lX9kQudCBIP4/fcmGRb3H3ZJo13BTm 8uK24zTFntHZDLIiR1RyJiy4hFdl4Yp2Ns4FupofUj9zrociD0X5DNM0G whWKOHejjiW/kAx7jqotnjtIMYYoXYNi9Wfd/syZpVbltBRGzMUC0Cuw7 VioXRCs0GvQ3rLaw6cNY3ovRywNQbd9/uqugXMiPNhzTHBgBLp2qb4TM9 Q==; IronPort-SDR: V2VLEiMlpxd4y7fpBYU+kUsT2beS49JtdrZRSOMmRvPHSDYTIWTe/gH5Vddm7N/1UEHH/jNW5i 1kMfsS3rzmTP6buk7I2mYyNm7/t85drtBzLKQgyQxDl4D1REDg/Vv/SEu3aq01wdKEbPbHBdYY UGDVbFBr4k+yoJ3V4zE9V7XodALGK+44mHmHWkbA77jZkCS8b3umDAx6MWDcUfNgwFZTQPFeGm rRFi/mHWy3skYPOVych1RDgcpZOKUw53TBHDuJD4wwjJ+mfYmJxiYZGqb4tV1NCaxmNO3Wlq2E 2J8= X-IronPort-AV: E=Sophos;i="5.64,420,1559491200"; d="scan'208";a="121096249" Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com) ([199.255.45.14]) by ob1.hgst.iphmx.com with ESMTP; 23 Aug 2019 18:11:36 +0800 IronPort-SDR: TqgGokU+hjSnuTGAZGx7mZ5narFABOyhz6IuucASdDzynhJiJwBGhCrl9XomcjOEXb3Ehi72Sn xurIyFsCnZ0h7YhP+oYKtAe3eSXpm0OVfF+YI3LSE/Iy6Cpnad1CGww6c4MXBD1i0lS6lM8kl8 rPdqANuSuB3j6L1/iUkmy7Bl+/DCoWRowc9zQ9RBFOv78OyZE+NjN4m7r9Ksz5jnNtcOfxhnRB aGq3NRQI97mHlcp+1yA0emgFCPFcmLsWTpx1YhHu1zC0agI6/1Pxt/dt7wHPMJ3M/yV4zJE4QX YgImPnPgDs6GPOPKI3tbgOU4 Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 23 Aug 2019 03:08:54 -0700 IronPort-SDR: RzmF+wCVaHIoL3gwNBEeyA5mzfSSzNMLObtJ5ILaJA+iGqjzX5ddkHxRV0ZGwDsQyksHtT+IZx fgrt1wtyD3d3a6vz8N20e59/NAfCHlluR1LW+b1OOa3uzd9RMiBhhPFtntBtOz6qMlbvCSVBtq UXOQsH4Csa3uRZ+38huM568wW/OQUuXe0dbEYNNfSvmFgsQ7qeu/WGCMn6E52zQQSYpdPRDozi OyV+UMFU8TAr+1ybbt83sKfLL8rl65FOhk6hmlgzTzIGV2FfHT4MFzA78CVHGu80Bhudz0A/a5 s74= WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com (HELO naota.fujisawa.hgst.com) ([10.149.53.115]) by uls-op-cesaip02.wdc.com with ESMTP; 23 Aug 2019 03:11:35 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Nikolay Borisov , Damien Le Moal , Matias Bjorling , Johannes Thumshirn , Hannes Reinecke , Anand Jain , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v4 13/27] btrfs: reset zones of unused block groups Date: Fri, 23 Aug 2019 19:10:22 +0900 Message-Id: <20190823101036.796932-14-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.23.0 In-Reply-To: <20190823101036.796932-1-naohiro.aota@wdc.com> References: <20190823101036.796932-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org For an HMZONED volume, a block group maps to a zone of the device. For deleted unused block groups, the zone of the block group can be reset to rewind the zone write pointer at the start of the zone. Signed-off-by: Naohiro Aota --- fs/btrfs/extent-tree.c | 27 +++++++++++++++++++-------- fs/btrfs/hmzoned.c | 21 +++++++++++++++++++++ fs/btrfs/hmzoned.h | 18 ++++++++++++++++++ 3 files changed, 58 insertions(+), 8 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 0f845cfb2442..457252ac7782 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -1937,6 +1937,9 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr, for (i = 0; i < bbio->num_stripes; i++, stripe++) { + struct btrfs_device *dev = stripe->dev; + u64 physical = stripe->physical; + u64 length = stripe->length; u64 bytes; struct request_queue *req_q; @@ -1944,19 +1947,23 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr, ASSERT(btrfs_test_opt(fs_info, DEGRADED)); continue; } + req_q = bdev_get_queue(stripe->dev->bdev); - if (!blk_queue_discard(req_q)) + + /* zone reset in HMZONED mode */ + if (btrfs_can_zone_reset(dev, physical, length)) + ret = btrfs_reset_device_zone(dev, physical, + length, &bytes); + else if (blk_queue_discard(req_q)) + ret = btrfs_issue_discard(dev->bdev, physical, + length, &bytes); + else continue; - ret = btrfs_issue_discard(stripe->dev->bdev, - stripe->physical, - stripe->length, - &bytes); if (!ret) discarded_bytes += bytes; else if (ret != -EOPNOTSUPP) break; /* Logic errors or -ENOMEM, or -EIO but I don't know how that could happen JDM */ - /* * Just in case we get back EOPNOTSUPP for some reason, * just ignore the return value so we don't screw up @@ -8976,8 +8983,12 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info) spin_unlock(&block_group->lock); spin_unlock(&space_info->lock); - /* DISCARD can flip during remount */ - trimming = btrfs_test_opt(fs_info, DISCARD); + /* + * DISCARD can flip during remount. In HMZONED mode, + * we need to reset sequential required zones. + */ + trimming = btrfs_test_opt(fs_info, DISCARD) || + btrfs_fs_incompat(fs_info, HMZONED); /* Implicit trim during transaction commit. */ if (trimming) diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c index b5fd3e280b65..3d7db7d480d4 100644 --- a/fs/btrfs/hmzoned.c +++ b/fs/btrfs/hmzoned.c @@ -593,3 +593,24 @@ int btrfs_hmzoned_check_metadata_space(struct btrfs_fs_info *fs_info) return btrfs_commit_transaction(trans); } + +int btrfs_reset_device_zone(struct btrfs_device *device, u64 physical, + u64 length, u64 *bytes) +{ + int ret; + + ret = blkdev_reset_zones(device->bdev, + physical >> SECTOR_SHIFT, + length >> SECTOR_SHIFT, + GFP_NOFS); + if (!ret) { + *bytes = length; + while (length) { + set_bit(physical >> device->zone_info->zone_size_shift, + device->zone_info->empty_zones); + length -= device->zone_info->zone_size; + } + } + + return ret; +} diff --git a/fs/btrfs/hmzoned.h b/fs/btrfs/hmzoned.h index e95139d4c072..40b4151fc935 100644 --- a/fs/btrfs/hmzoned.h +++ b/fs/btrfs/hmzoned.h @@ -32,6 +32,8 @@ int btrfs_check_mountopts_hmzoned(struct btrfs_fs_info *info); bool btrfs_check_allocatable_zones(struct btrfs_device *device, u64 pos, u64 num_bytes); int btrfs_load_block_group_zone_info(struct btrfs_block_group_cache *cache); +int btrfs_reset_device_zone(struct btrfs_device *device, u64 physical, + u64 length, u64 *bytes); int btrfs_hmzoned_check_metadata_space(struct btrfs_fs_info *fs_info); static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos) @@ -107,4 +109,20 @@ static inline u64 btrfs_zone_align(struct btrfs_device *device, u64 pos) return ALIGN(pos, device->zone_info->zone_size); } +static inline bool btrfs_can_zone_reset(struct btrfs_device *device, + u64 physical, u64 length) +{ + u64 zone_size; + + if (!btrfs_dev_is_sequential(device, physical)) + return false; + + zone_size = device->zone_info->zone_size; + if (!IS_ALIGNED(physical, zone_size) || + !IS_ALIGNED(length, zone_size)) + return false; + + return true; +} + #endif From patchwork Fri Aug 23 10:10:23 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11111363 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id D4B0913B1 for ; Fri, 23 Aug 2019 10:11:41 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id B2E77233FE for ; Fri, 23 Aug 2019 10:11:41 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="XbKxNVtP" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2404435AbfHWKLk (ORCPT ); Fri, 23 Aug 2019 06:11:40 -0400 Received: from esa3.hgst.iphmx.com ([216.71.153.141]:47796 "EHLO esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2404383AbfHWKLj (ORCPT ); Fri, 23 Aug 2019 06:11:39 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1566555098; x=1598091098; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=Bmu9r9dyfmzCSO2sk7kl9FUepC5eW+BqiR751zfVd2Y=; b=XbKxNVtPrrkHzMqUhqepAyeyuty9UhkfHTnEQ7/j3DzNT6NwmATZYUPY +K+HNrkMnUehcM1QmhlvWnUQNglma8xEgVFUZpNMlhwX2ANyYjCN2xE+A BgpwKemYZjp9QLAqOdXMJ5pPSQWQEUN/GyWkuYy5qIWREcSLBwIrgHvlO KXhk56PQm3c41EzOxzVMvyHm3ZAECfk7QWU2tfNLHs9liODRktGN/Qgtv D3iTNP2wikMnyK6Ip0BhCMTK4ZPBa60dNz460FlS3mOa7mg4vViqziV+K K/lXm2KoFzt0C/HhA+WEvGjONcGxM7GX2BTb1v6I1M5NEewe0lzzohL/y w==; IronPort-SDR: tRNQPcbetg/LF7aFrugQaSFzKXqaOWLO5jiD6z1Q1EyQ9j1QfUMhbkgrIhVXfn+okZXkqcfPSL Rn30stW9Rrp6mK2AXc3dRQqUDSKzHTzCKjJ3/1R6H37DbY83UsgAjXCzGofkWwOSmaz6fohncb eEZxvxg1X5D5KOo5h/Zbozidbn/K+wtPkJoxF28OhVTEXOrs9A8VptTLKAgYEFcfXDxEDT1DzC JgAnQeZLAxTP91YHhXk9goShbDj8fNYGIBMAqmhmXK3Upvv5mRSDzi2XoNEszPFsTgmAUCN3ih q3Q= X-IronPort-AV: E=Sophos;i="5.64,420,1559491200"; d="scan'208";a="121096253" Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com) ([199.255.45.14]) by ob1.hgst.iphmx.com with ESMTP; 23 Aug 2019 18:11:38 +0800 IronPort-SDR: q2SrYUMrPlRCg88RnsAMjunZ77vS4F6Z3LlBtkbZ346M4EEfFlzYosv3jQB/4mT9KGPh+PxoGx GW9/d0nVBU47nQGWL8k4yvU/DfxgWsYD38tEuAFzIHNi44NvHfAat9/OofMl5fm8Yx2hnjy561 vLW5batf5pP/HauWi4TEjCzyZMLi0v4pCQ8qyyY+0DLuQqEfKMC6nznWe0WflnmVhzm6CY64tD v5SStsjvZA4WCoTaJqYH46/F33MoCXY/Ax1uPz/U3tCpDhk8kc16KNtvTESKMyH+uDq2Xvi5DQ 82s1KKL4lvesxkmnsAE1uN0K Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 23 Aug 2019 03:08:56 -0700 IronPort-SDR: vapz2JTb7nBIgK51gYrDmSvmvB4QcbQT4oH1y6/RZEsHjHL/zxyqiSn1fDlueEe6Q9wKlywRkB M79a2AIF+upYAHhXjs9nBZ41Oj5zg05m3JRN/E6DBfRk7y5nkcMSvV4WE8l080FIDpQAGeMLYq 9dznCuprcmS3C45LBpUXC8Px71q9xjDYD/BuHSL14/IHlBOXMCiXEH+c084SWe4Hkafq4xE/rj pmHR2JM4Wy+BblPueyyGppgdi+8+tznwYGpJbmgbCPzyiLe94NF7OJ097Qloe1pno7KJJirA7t FL4= WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com (HELO naota.fujisawa.hgst.com) ([10.149.53.115]) by uls-op-cesaip02.wdc.com with ESMTP; 23 Aug 2019 03:11:37 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Nikolay Borisov , Damien Le Moal , Matias Bjorling , Johannes Thumshirn , Hannes Reinecke , Anand Jain , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v4 14/27] btrfs: limit super block locations in HMZONED mode Date: Fri, 23 Aug 2019 19:10:23 +0900 Message-Id: <20190823101036.796932-15-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.23.0 In-Reply-To: <20190823101036.796932-1-naohiro.aota@wdc.com> References: <20190823101036.796932-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org When in HMZONED mode, make sure that device super blocks are located in randomly writable zones of zoned block devices. That is, do not write super blocks in sequential write required zones of host-managed zoned block devices as update would not be possible. Signed-off-by: Damien Le Moal Signed-off-by: Naohiro Aota --- fs/btrfs/disk-io.c | 4 ++++ fs/btrfs/extent-tree.c | 8 ++++++++ fs/btrfs/hmzoned.h | 12 ++++++++++++ fs/btrfs/scrub.c | 3 +++ 4 files changed, 27 insertions(+) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index b25cff8af3b7..38a9830b4893 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3545,6 +3545,8 @@ static int write_dev_supers(struct btrfs_device *device, if (bytenr + BTRFS_SUPER_INFO_SIZE >= device->commit_total_bytes) break; + if (!btrfs_check_super_location(device, bytenr)) + continue; btrfs_set_super_bytenr(sb, bytenr); @@ -3611,6 +3613,8 @@ static int wait_dev_supers(struct btrfs_device *device, int max_mirrors) if (bytenr + BTRFS_SUPER_INFO_SIZE >= device->commit_total_bytes) break; + if (!btrfs_check_super_location(device, bytenr)) + continue; bh = __find_get_block(device->bdev, bytenr / BTRFS_BDEV_BLOCKSIZE, diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 457252ac7782..ddf5c26b9f58 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -239,6 +239,14 @@ static int exclude_super_stripes(struct btrfs_block_group_cache *cache) if (logical[nr] + stripe_len <= cache->key.objectid) continue; + /* shouldn't have super stripes in sequential zones */ + if (cache->alloc_type == BTRFS_ALLOC_SEQ) { + btrfs_err(fs_info, + "sequentil allocation bg %llu should not have super blocks", + cache->key.objectid); + return -EUCLEAN; + } + start = logical[nr]; if (start < cache->key.objectid) { start = cache->key.objectid; diff --git a/fs/btrfs/hmzoned.h b/fs/btrfs/hmzoned.h index 40b4151fc935..9de26d6b8c4e 100644 --- a/fs/btrfs/hmzoned.h +++ b/fs/btrfs/hmzoned.h @@ -10,6 +10,7 @@ #define BTRFS_HMZONED_H #include +#include "volumes.h" struct btrfs_zoned_device_info { /* @@ -125,4 +126,15 @@ static inline bool btrfs_can_zone_reset(struct btrfs_device *device, return true; } +static inline bool btrfs_check_super_location(struct btrfs_device *device, + u64 pos) +{ + /* + * On a non-zoned device, any address is OK. On a zoned + * device, non-SEQUENTIAL WRITE REQUIRED zones are capable. + */ + return device->zone_info == NULL || + !btrfs_dev_is_sequential(device, pos); +} + #endif diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index 0c99cf9fb595..e15d846c700a 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -18,6 +18,7 @@ #include "check-integrity.h" #include "rcu-string.h" #include "raid56.h" +#include "hmzoned.h" /* * This is only the first step towards a full-features scrub. It reads all @@ -3732,6 +3733,8 @@ static noinline_for_stack int scrub_supers(struct scrub_ctx *sctx, if (bytenr + BTRFS_SUPER_INFO_SIZE > scrub_dev->commit_total_bytes) break; + if (!btrfs_check_super_location(scrub_dev, bytenr)) + continue; ret = scrub_pages(sctx, bytenr, BTRFS_SUPER_INFO_SIZE, bytenr, scrub_dev, BTRFS_EXTENT_FLAG_SUPER, gen, i, From patchwork Fri Aug 23 10:10:24 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11111371 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id E444C13B1 for ; Fri, 23 Aug 2019 10:11:42 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id B857521726 for ; Fri, 23 Aug 2019 10:11:42 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="FIS1S7nQ" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2404449AbfHWKLm (ORCPT ); Fri, 23 Aug 2019 06:11:42 -0400 Received: from esa3.hgst.iphmx.com ([216.71.153.141]:47806 "EHLO esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2404383AbfHWKLl (ORCPT ); Fri, 23 Aug 2019 06:11:41 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1566555101; x=1598091101; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=1c43D+Jl51SBUhxun53Ty9tOI8dHj9Y6YMFy+/m4kTw=; b=FIS1S7nQt4bqvzuoau1VHVCKwtFhhThGCSNn9jIyGhs6POkAVlAZF0ry psLacN2ccNaLUw2HJAhpQ1LW8qR6aTep/hnMWxY30JP8b51HUM9FRhswp oGEIPeUNhOXUssFoMfdsYWte2FmTzoa8O37Nxt9da5S6vlFohaXHSvlXf oc+G9JCN/BxXrEujW5XhXtXwA62gDseNk5nCjFpXEqO+x5rY9f3RlpIPf qpc2RHg93KzcIk6q/priiY46dVEae4OGpYJIyzUXb2sNisPRoI5M9tkmA +mF+1R/QzISiFuQ8I8M4i9agTwqJFGCNgxenQ8DL5+jU1Ik7kJ78Sud6i g==; IronPort-SDR: Mxh5PQnktYugHOde/ZfSXcfMNWtpmGcHY90zXjkGOebkpXueFUEEWklrp+uwar6jeMkMrLiOTm StFI9VaXE6lBiTiS7FcifxZbEHW7HADOyYRRXFQSzPxvwX4tWX/9NjEGPNJcHy7/2KsNOxr5Jh 3bP9hP4qLzml82AXYr8WmSjJ7RtS/w4QmlFNkxjir517afnkMaEyrBqysewRpVKLHDW5UnneoA taokqSh+DBqFgJHYQ7JIOaZjR8qBPPjLhI00gpJ171BHhd6F4sgCFOl9UlAccj6FkmGcqEDSPl eXE= X-IronPort-AV: E=Sophos;i="5.64,420,1559491200"; d="scan'208";a="121096257" Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com) ([199.255.45.14]) by ob1.hgst.iphmx.com with ESMTP; 23 Aug 2019 18:11:40 +0800 IronPort-SDR: hjsPXGDNZwjK/9CzCwzlFtL1XB+xUFHOpksA6D/6CIjuvXS/6iGWlxxapHZAcHY09Ej2GtYZQl SeE7rMiN9s9Qc2Uvj/DLViPR6/pHkYqkCq1WqEiP9XHGMZJWCZj6f5NFaYrgVYOA3/0ckI7p8o s70K//VNUqze6CqhK1BfWiZP/gBPdUtaF5LzI80Y773wtowmihuAjoUf2w5MFv2m5w6Pp9E+Gv kX4Nqio5qj61ShMSOT5xsujrVLIINsg3AurI+2L09K+sTRKKzsCIcPR9OfrOz95rYXmk8rkNVG lC/yAKT+rZdPN8urgw3K52au Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 23 Aug 2019 03:08:59 -0700 IronPort-SDR: KqU1rFPObFCOzGqmfOgm5Zfw2VUEnDebYLqg3Q2zrEMYDQj+ROSkMsx2O4Cy+628f8LHZGoqM6 zHP29f4jclk4Nv1ImsDKfT7srj+caInE2B6gIlwwxijLhB+96tZMe07vCccANZ57UaRX9oXVz9 6Ci9ueLoSIovFJJ/dPBB72yL9uyNSIpRaGB20Wr47ACeg+O/VLCWCqKAkzf7X4+AdQ94kcGF4o 4vnVbtvCJpB4bB9Uwf30AFYNuSqXESphcYQQQVAlkgYpSUE7qCDU+3daBpEr59SQ5HCsDfLZ+6 nwY= WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com (HELO naota.fujisawa.hgst.com) ([10.149.53.115]) by uls-op-cesaip02.wdc.com with ESMTP; 23 Aug 2019 03:11:39 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Nikolay Borisov , Damien Le Moal , Matias Bjorling , Johannes Thumshirn , Hannes Reinecke , Anand Jain , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v4 15/27] btrfs: redirty released extent buffers in sequential BGs Date: Fri, 23 Aug 2019 19:10:24 +0900 Message-Id: <20190823101036.796932-16-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.23.0 In-Reply-To: <20190823101036.796932-1-naohiro.aota@wdc.com> References: <20190823101036.796932-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Tree manipulating operations like merging nodes often release once-allocated tree nodes. Btrfs cleans such nodes so that pages in the node are not uselessly written out. On HMZONED drives, however, such optimization blocks the following IOs as the cancellation of the write out of the freed blocks breaks the sequential write sequence expected by the device. This patch introduces a list of clean and unwritten extent buffers that have been released in a transaction. Btrfs redirty the buffer so that btree_write_cache_pages() can send proper bios to the disk. Besides it clear the entire content of the extent buffer not to confuse raw block scanners e.g. btrfsck. By clearing the content, csum_dirty_buffer() complains about bytenr mismatch, so avoid the checking and checksum using newly introduced buffer flag EXTENT_BUFFER_NO_CHECK. Signed-off-by: Naohiro Aota --- fs/btrfs/disk-io.c | 5 +++++ fs/btrfs/extent-tree.c | 11 ++++++++++- fs/btrfs/extent_io.c | 2 ++ fs/btrfs/extent_io.h | 2 ++ fs/btrfs/hmzoned.c | 34 ++++++++++++++++++++++++++++++++++ fs/btrfs/hmzoned.h | 3 +++ fs/btrfs/transaction.c | 10 ++++++++++ fs/btrfs/transaction.h | 3 +++ 8 files changed, 69 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 38a9830b4893..d36cdb1b1421 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -513,6 +513,9 @@ static int csum_dirty_buffer(struct btrfs_fs_info *fs_info, struct page *page) if (page != eb->pages[0]) return 0; + if (test_bit(EXTENT_BUFFER_NO_CHECK, &eb->bflags)) + return 0; + found_start = btrfs_header_bytenr(eb); /* * Please do not consolidate these warnings into a single if. @@ -4575,6 +4578,8 @@ void btrfs_cleanup_one_transaction(struct btrfs_transaction *cur_trans, btrfs_destroy_pinned_extent(fs_info, fs_info->pinned_extents); + btrfs_free_redirty_list(cur_trans); + cur_trans->state =TRANS_STATE_COMPLETED; wake_up(&cur_trans->commit_wait); } diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index ddf5c26b9f58..c0d7cb95a8c9 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -5085,8 +5085,10 @@ void btrfs_free_tree_block(struct btrfs_trans_handle *trans, if (root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID) { ret = check_ref_cleanup(trans, buf->start); - if (!ret) + if (!ret) { + btrfs_redirty_list_add(trans->transaction, buf); goto out; + } } pin = 0; @@ -5098,6 +5100,13 @@ void btrfs_free_tree_block(struct btrfs_trans_handle *trans, goto out; } + if (btrfs_fs_incompat(fs_info, HMZONED)) { + btrfs_redirty_list_add(trans->transaction, buf); + pin_down_extent(cache, buf->start, buf->len, 1); + btrfs_put_block_group(cache); + goto out; + } + WARN_ON(test_bit(EXTENT_BUFFER_DIRTY, &buf->bflags)); btrfs_add_free_space(cache, buf->start, buf->len); diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index aea990473392..4e67b16c9f80 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -23,6 +23,7 @@ #include "rcu-string.h" #include "backref.h" #include "disk-io.h" +#include "hmzoned.h" static struct kmem_cache *extent_state_cache; static struct kmem_cache *extent_buffer_cache; @@ -4863,6 +4864,7 @@ __alloc_extent_buffer(struct btrfs_fs_info *fs_info, u64 start, init_waitqueue_head(&eb->read_lock_wq); btrfs_leak_debug_add(&eb->leak_list, &buffers); + INIT_LIST_HEAD(&eb->release_list); spin_lock_init(&eb->refs_lock); atomic_set(&eb->refs, 1); diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h index 401423b16976..c63b58438f90 100644 --- a/fs/btrfs/extent_io.h +++ b/fs/btrfs/extent_io.h @@ -58,6 +58,7 @@ enum { EXTENT_BUFFER_IN_TREE, /* write IO error */ EXTENT_BUFFER_WRITE_ERR, + EXTENT_BUFFER_NO_CHECK, }; /* these are flags for __process_pages_contig */ @@ -186,6 +187,7 @@ struct extent_buffer { */ wait_queue_head_t read_lock_wq; struct page *pages[INLINE_EXTENT_BUFFER_PAGES]; + struct list_head release_list; #ifdef CONFIG_BTRFS_DEBUG int spinning_writers; atomic_t spinning_readers; diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c index 3d7db7d480d4..81d8037ae7f6 100644 --- a/fs/btrfs/hmzoned.c +++ b/fs/btrfs/hmzoned.c @@ -614,3 +614,37 @@ int btrfs_reset_device_zone(struct btrfs_device *device, u64 physical, return ret; } + +void btrfs_redirty_list_add(struct btrfs_transaction *trans, + struct extent_buffer *eb) +{ + struct btrfs_fs_info *fs_info = eb->fs_info; + + if (!btrfs_fs_incompat(fs_info, HMZONED) || + btrfs_header_flag(eb, BTRFS_HEADER_FLAG_WRITTEN) || + !list_empty(&eb->release_list)) + return; + + set_extent_buffer_dirty(eb); + memzero_extent_buffer(eb, 0, eb->len); + set_bit(EXTENT_BUFFER_NO_CHECK, &eb->bflags); + + spin_lock(&trans->releasing_ebs_lock); + list_add_tail(&eb->release_list, &trans->releasing_ebs); + spin_unlock(&trans->releasing_ebs_lock); + atomic_inc(&eb->refs); +} + +void btrfs_free_redirty_list(struct btrfs_transaction *trans) +{ + spin_lock(&trans->releasing_ebs_lock); + while (!list_empty(&trans->releasing_ebs)) { + struct extent_buffer *eb; + + eb = list_first_entry(&trans->releasing_ebs, + struct extent_buffer, release_list); + list_del_init(&eb->release_list); + free_extent_buffer(eb); + } + spin_unlock(&trans->releasing_ebs_lock); +} diff --git a/fs/btrfs/hmzoned.h b/fs/btrfs/hmzoned.h index 9de26d6b8c4e..3a73c3c5e1da 100644 --- a/fs/btrfs/hmzoned.h +++ b/fs/btrfs/hmzoned.h @@ -36,6 +36,9 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group_cache *cache); int btrfs_reset_device_zone(struct btrfs_device *device, u64 physical, u64 length, u64 *bytes); int btrfs_hmzoned_check_metadata_space(struct btrfs_fs_info *fs_info); +void btrfs_redirty_list_add(struct btrfs_transaction *trans, + struct extent_buffer *eb); +void btrfs_free_redirty_list(struct btrfs_transaction *trans); static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos) { diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index e3adb714c04b..45bd7c25bebf 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -19,6 +19,7 @@ #include "volumes.h" #include "dev-replace.h" #include "qgroup.h" +#include "hmzoned.h" #define BTRFS_ROOT_TRANS_TAG 0 @@ -257,6 +258,8 @@ static noinline int join_transaction(struct btrfs_fs_info *fs_info, spin_lock_init(&cur_trans->dirty_bgs_lock); INIT_LIST_HEAD(&cur_trans->deleted_bgs); spin_lock_init(&cur_trans->dropped_roots_lock); + INIT_LIST_HEAD(&cur_trans->releasing_ebs); + spin_lock_init(&cur_trans->releasing_ebs_lock); list_add_tail(&cur_trans->list, &fs_info->trans_list); extent_io_tree_init(fs_info, &cur_trans->dirty_pages, IO_TREE_TRANS_DIRTY_PAGES, fs_info->btree_inode); @@ -2269,6 +2272,13 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) goto scrub_continue; } + /* + * At this point, we should have written the all tree blocks + * allocated in this transaction. So it's now safe to free the + * redirtyied extent buffers. + */ + btrfs_free_redirty_list(cur_trans); + ret = write_all_supers(fs_info, 0); /* * the super is written, we can safely allow the tree-loggers diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h index 2c5a6f6e5bb0..09329d2901b7 100644 --- a/fs/btrfs/transaction.h +++ b/fs/btrfs/transaction.h @@ -85,6 +85,9 @@ struct btrfs_transaction { spinlock_t dropped_roots_lock; struct btrfs_delayed_ref_root delayed_refs; struct btrfs_fs_info *fs_info; + + spinlock_t releasing_ebs_lock; + struct list_head releasing_ebs; }; #define __TRANS_FREEZABLE (1U << 0) From patchwork Fri Aug 23 10:10:25 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11111377 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id B1EFC14DE for ; Fri, 23 Aug 2019 10:11:45 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 8671521726 for ; Fri, 23 Aug 2019 10:11:45 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="VhOpEjVF" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2404480AbfHWKLo (ORCPT ); Fri, 23 Aug 2019 06:11:44 -0400 Received: from esa3.hgst.iphmx.com ([216.71.153.141]:47806 "EHLO esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2404383AbfHWKLn (ORCPT ); Fri, 23 Aug 2019 06:11:43 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1566555103; x=1598091103; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=WN/MRQLXQJ2ZyMpW4esSt0OpDBM/afuXIZYYLans+nU=; b=VhOpEjVFe52o+GvoKtT0n7Dcy4LXPDIqDbhDxDj1uHzq9pU5KQpSBKYE Am/Eh3eyZibYcsmlwXU72W0uBf3bqPq1/YOSnNAZ3v/97SEEWYFvj6bCK Eznh9EC9csRcdZwCq8VjCHy/TNcr4zDA59dTz7rGNZwqLQdFg2f59eyvw GqauRo+neqDQfB7okpFAGmfR65ghiRfxtN7iHpQUJqV/E6Xj91tuGREFd MlbLxlcifQA3xrwL5M7HTdaMQkLDrdXYfCYcyNHawEudKHlZjPklrIB/a 3EH3+buYJvSWD62UcCyAHTLtCshsWivDsAy/MOa6r1pHvQ21BKrHa88UF w==; IronPort-SDR: QsPif38ZxvjwdumeyiRDhRnA4+7xjwVcpcWpRrkOR3uDbi1AvcHI+Zc+7dJfJGkEugsz8x2E7Q /dkqoaSryMVPfQfJZ676X2g/w/4RDg1hrQ5auUeL79TEBM5yTe3VKfnnW1Ndjmnv21CWC6NVZL pHlOhz4Cl8S3TvdznvJ9j16owZ61karvwS0pbeNB7ZSv8JO/AetNMMZnbDUpFMQJ9FDTc7zfaO B5GETtvpm17J2NSh3OhAxlqg02P4U6VZIQgh1xb+1tbqXdqvaO3obuT8L3xQxTFf/R2ISjMCmv sQ8= X-IronPort-AV: E=Sophos;i="5.64,420,1559491200"; d="scan'208";a="121096258" Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com) ([199.255.45.14]) by ob1.hgst.iphmx.com with ESMTP; 23 Aug 2019 18:11:42 +0800 IronPort-SDR: sjyL38XbJSHqs/I1WH4fgxXjh7dQ0mspcUmaNMWixaX9CxjDc+GeUB1wQCZnHuBpDx5Q21pxaX tDaFWWsnSxG7XfAe3lTxxgX9U2CwtQhID/HFz+0UW9MPGMW5hawGN1GAuBAwPrJYFiRrVDRQFx MxhZLAmGTlaHkzt6DsACa6lItfObRy8L8PA2RrS5Y35zfoi+L8y4ivomkAe40TyHEtNBwYzuj8 FsPxtiUzJ2fmReTd7c1CAk1Kc7tgknC/owxl+26R+mGdSgIdauv4csV3dYepBe6fgPnwIrXPb2 x/0JY1MqfZrb2KMVkERoDcek Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 23 Aug 2019 03:09:01 -0700 IronPort-SDR: 3gqq8nCfD3FMviA8k39R4BMBeuV8jX/H33G2E/11AToG4MNZcVIvg5+Lqpv5XE3nyfTFxG0m9S HXjIUGA56UGfh5L3qqTwrMtvQBTSvYj8YfZewDMRL6EV7HtnSTNJSuFZyjQJJFjEt6lelzZyGf qgCFGhkjr4KGdR6bdnaeYL6RQ1dvv6jeeJOJQBYio539EJIPagV1p+SGm148cUuR6S/OLB4fRh nvX004wgsNWrLZNnoLqH9lsfISFYA4wWxC4XkH4jtqTCZimCsrVYd2iAh+IndnPRSv0kbSsmYD qnE= WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com (HELO naota.fujisawa.hgst.com) ([10.149.53.115]) by uls-op-cesaip02.wdc.com with ESMTP; 23 Aug 2019 03:11:41 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Nikolay Borisov , Damien Le Moal , Matias Bjorling , Johannes Thumshirn , Hannes Reinecke , Anand Jain , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v4 16/27] btrfs: serialize data allocation and submit IOs Date: Fri, 23 Aug 2019 19:10:25 +0900 Message-Id: <20190823101036.796932-17-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.23.0 In-Reply-To: <20190823101036.796932-1-naohiro.aota@wdc.com> References: <20190823101036.796932-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org To preserve sequential write pattern on the drives, we must serialize allocation and submit_bio. This commit add per-block group mutex "zone_io_lock" and find_free_extent_seq() hold the lock. The lock is kept even after returning from find_free_extent(). It is released when submiting IOs corresponding to the allocation is completed. Implementing such behavior under __extent_writepage_io is almost impossible because once pages are unlocked we are not sure when submiting IOs for an allocated region is finished or not. Instead, this commit add run_delalloc_hmzoned() to write out non-compressed data IOs at once using extent_write_locked_rage(). After the write, we can call btrfs_hmzoned_unlock_allocation() to unlock the block group for new allocation. Signed-off-by: Naohiro Aota --- fs/btrfs/ctree.h | 1 + fs/btrfs/extent-tree.c | 5 +++++ fs/btrfs/hmzoned.h | 34 +++++++++++++++++++++++++++++++ fs/btrfs/inode.c | 45 ++++++++++++++++++++++++++++++++++++++++-- 4 files changed, 83 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 3b24ce49e84b..d4df9624cb04 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -618,6 +618,7 @@ struct btrfs_block_group_cache { * zone. */ u64 alloc_offset; + struct mutex zone_io_lock; }; /* delayed seq elem */ diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index c0d7cb95a8c9..9f9c09e28b5b 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -5533,6 +5533,7 @@ static int find_free_extent_seq(struct btrfs_block_group_cache *cache, if (cache->alloc_type != BTRFS_ALLOC_SEQ) return 1; + btrfs_hmzoned_data_io_lock(cache); spin_lock(&space_info->lock); spin_lock(&cache->lock); @@ -5564,6 +5565,9 @@ static int find_free_extent_seq(struct btrfs_block_group_cache *cache, out: spin_unlock(&cache->lock); spin_unlock(&space_info->lock); + /* if succeeds, unlock after submit_bio */ + if (ret) + btrfs_hmzoned_data_io_unlock(cache); return ret; } @@ -8096,6 +8100,7 @@ btrfs_create_block_group_cache(struct btrfs_fs_info *fs_info, btrfs_init_free_space_ctl(cache); atomic_set(&cache->trimming, 0); mutex_init(&cache->free_space_lock); + mutex_init(&cache->zone_io_lock); btrfs_init_full_stripe_locks_tree(&cache->full_stripe_locks_root); cache->alloc_type = BTRFS_ALLOC_FIT; diff --git a/fs/btrfs/hmzoned.h b/fs/btrfs/hmzoned.h index 3a73c3c5e1da..a8e7286708d4 100644 --- a/fs/btrfs/hmzoned.h +++ b/fs/btrfs/hmzoned.h @@ -39,6 +39,7 @@ int btrfs_hmzoned_check_metadata_space(struct btrfs_fs_info *fs_info); void btrfs_redirty_list_add(struct btrfs_transaction *trans, struct extent_buffer *eb); void btrfs_free_redirty_list(struct btrfs_transaction *trans); +void btrfs_hmzoned_data_io_unlock_at(struct inode *inode, u64 start, u64 len); static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos) { @@ -140,4 +141,37 @@ static inline bool btrfs_check_super_location(struct btrfs_device *device, !btrfs_dev_is_sequential(device, pos); } + +static inline void btrfs_hmzoned_data_io_lock( + struct btrfs_block_group_cache *cache) +{ + /* No need to lock metadata BGs or non-sequential BGs */ + if (!(cache->flags & BTRFS_BLOCK_GROUP_DATA) || + cache->alloc_type != BTRFS_ALLOC_SEQ) + return; + mutex_lock(&cache->zone_io_lock); +} + +static inline void btrfs_hmzoned_data_io_unlock( + struct btrfs_block_group_cache *cache) +{ + if (!(cache->flags & BTRFS_BLOCK_GROUP_DATA) || + cache->alloc_type != BTRFS_ALLOC_SEQ) + return; + mutex_unlock(&cache->zone_io_lock); +} + +static inline void btrfs_hmzoned_data_io_unlock_logical( + struct btrfs_fs_info *fs_info, u64 logical) +{ + struct btrfs_block_group_cache *cache; + + if (!btrfs_fs_incompat(fs_info, HMZONED)) + return; + + cache = btrfs_lookup_block_group(fs_info, logical); + btrfs_hmzoned_data_io_unlock(cache); + btrfs_put_block_group(cache); +} + #endif diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index ee582a36653d..d504200c9767 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -48,6 +48,7 @@ #include "qgroup.h" #include "dedupe.h" #include "delalloc-space.h" +#include "hmzoned.h" struct btrfs_iget_args { struct btrfs_key *location; @@ -1279,6 +1280,39 @@ static int cow_file_range_async(struct inode *inode, struct page *locked_page, return 0; } +static noinline int run_delalloc_hmzoned(struct inode *inode, + struct page *locked_page, u64 start, + u64 end, int *page_started, + unsigned long *nr_written) +{ + struct extent_map *em; + u64 logical; + int ret; + + ret = cow_file_range(inode, locked_page, start, end, + end, page_started, nr_written, 0, NULL); + if (ret) + return ret; + + if (*page_started) + return 0; + + em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, start, end - start + 1, + 0); + ASSERT(em != NULL && em->block_start < EXTENT_MAP_LAST_BYTE); + logical = em->block_start; + free_extent_map(em); + + __set_page_dirty_nobuffers(locked_page); + account_page_redirty(locked_page); + extent_write_locked_range(inode, start, end, WB_SYNC_ALL); + *page_started = 1; + + btrfs_hmzoned_data_io_unlock_logical(btrfs_sb(inode->i_sb), logical); + + return 0; +} + static noinline int csum_exist_in_range(struct btrfs_fs_info *fs_info, u64 bytenr, u64 num_bytes) { @@ -1645,17 +1679,24 @@ int btrfs_run_delalloc_range(struct inode *inode, struct page *locked_page, int ret; int force_cow = need_force_cow(inode, start, end); unsigned int write_flags = wbc_to_write_flags(wbc); + int do_compress = inode_can_compress(inode) && + inode_need_compress(inode, start, end); + int hmzoned = btrfs_fs_incompat(btrfs_sb(inode->i_sb), HMZONED); if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATACOW && !force_cow) { + ASSERT(!hmzoned); ret = run_delalloc_nocow(inode, locked_page, start, end, page_started, 1, nr_written); } else if (BTRFS_I(inode)->flags & BTRFS_INODE_PREALLOC && !force_cow) { + ASSERT(!hmzoned); ret = run_delalloc_nocow(inode, locked_page, start, end, page_started, 0, nr_written); - } else if (!inode_can_compress(inode) || - !inode_need_compress(inode, start, end)) { + } else if (!do_compress && !hmzoned) { ret = cow_file_range(inode, locked_page, start, end, end, page_started, nr_written, 1, NULL); + } else if (!do_compress && hmzoned) { + ret = run_delalloc_hmzoned(inode, locked_page, start, end, + page_started, nr_written); } else { set_bit(BTRFS_INODE_HAS_ASYNC_EXTENT, &BTRFS_I(inode)->runtime_flags); From patchwork Fri Aug 23 10:10:26 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11111381 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 3226714DE for ; Fri, 23 Aug 2019 10:11:47 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 0CC8621726 for ; Fri, 23 Aug 2019 10:11:47 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="GMD3t2ms" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2404530AbfHWKLp (ORCPT ); Fri, 23 Aug 2019 06:11:45 -0400 Received: from esa3.hgst.iphmx.com ([216.71.153.141]:47806 "EHLO esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2404383AbfHWKLp (ORCPT ); Fri, 23 Aug 2019 06:11:45 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1566555105; x=1598091105; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=nIhIe5TVptpfqOEOHCwo5F/TfH05pQH7Mpvwi20f3vk=; b=GMD3t2mslycXABg5hAEoCK/apNW1rlQC8x/fL9iOsF+QItdX7yVCN6Zv i0zFPgsjZX6ZE3sqOk5FjpMM0Sf5x8E6J2Lvii9JHyPa8kNR9K3dBGZxe RfzQ/ReT1PdId0cFQumejqM+HIpgn6LYDieedPGP6OYM8UE565TWls5iv DrdU4LRRtDXYF4pEq/6AAvXmfQb0BV0ZIcaLfr1TPa7Fy1Tyfq5fpAjFQ p3O8HPZFALpbVlAvpr7lEn67/RyXRcggi8Ljang0GgazZQa+a3JBRbsr5 tYr+hDpVcqjtuuVw9GO6OH/jH0BbOYOFJYP9gDwwO0eyjqXKhabXLZ4LT A==; IronPort-SDR: E5BBdfLHpWnFn/hALdXLcZyXxDqxg4ib7/cAoBqeOMTCgXMJC2NCJNsk90i8cf1ew4lkIRQdQJ rq8hcP9IiXRXuB0GHwGGzXkQQdkR2YKdVZ5rdBZ77vHd3B89bhU4SMyQyONJohCMuVpEtSEdtP WzV7dKJRIMxiUfIE8F8jJU+L/IwE9b4KA7ZUWpUyCd3C/1ecIuOXAA8ZFfTtSvTLaTPBtrmBd8 7okWylMGBYxrlvjFyUaOkb9xvnzn26aPLzKGDYq+6IOBbC9UIHqOlAbESy4wJePdngnLAt84jd qoo= X-IronPort-AV: E=Sophos;i="5.64,420,1559491200"; d="scan'208";a="121096260" Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com) ([199.255.45.14]) by ob1.hgst.iphmx.com with ESMTP; 23 Aug 2019 18:11:44 +0800 IronPort-SDR: Z29JGMJqwY9rjYOqlYqSlCOYDNI0x2TmZ+6RP4Vl8xrzDi5CSG/7kp0+FIkZLJjWRmG9D7Vjpx 7VRPmxFrLd0Ak/y9ZUPgQTcrTttKdv7ClxekB7iLpc28uVlWjTUkts3jhK5KSDdMHq/tSu33pl QJQeDmtdz5LUjJT4SNKurCoiPuqHSCeijAXAyfeuoFR3JwX/rJyxLFeZLOttLi+UahQCLrJZiK m03372X+hMDhIFvRqxpOlPIkU4Ln/lgOuLtqBT9HWNLWj6m8EQxbKGi1gAbAzb5esn9LKen+Ft 9YlL5dOOHK/FlHy7cZ0AE1mh Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 23 Aug 2019 03:09:03 -0700 IronPort-SDR: lrAqJUgdHy4bFMwv2PId0PHQcBa203MYa30hbvXm7am1qfYNPzCZSppbPpFh+wEwkEAd3apgsX CwWXiniuWIRxRBItFVRLfuKk78ULGB+vYh0mtR6egvInrCfSmchEX1ZM9zXMWwIQ1wZVU7lW77 paC6aq0CmrxUEgZNyT9faXp+77fviZS9g94BWfVTnxTF2pDhXvnMxtw4lmYLMXK+neFxcnF4Tg QJ8b8yVZ8a0MKF/YbDWQyyLJS/6dvKhx8J7zDvw/XCQYLcvpD4PDJl+QTH8fuHz9EdaN6UQJDm JLo= WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com (HELO naota.fujisawa.hgst.com) ([10.149.53.115]) by uls-op-cesaip02.wdc.com with ESMTP; 23 Aug 2019 03:11:43 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Nikolay Borisov , Damien Le Moal , Matias Bjorling , Johannes Thumshirn , Hannes Reinecke , Anand Jain , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v4 17/27] btrfs: implement atomic compressed IO submission Date: Fri, 23 Aug 2019 19:10:26 +0900 Message-Id: <20190823101036.796932-18-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.23.0 In-Reply-To: <20190823101036.796932-1-naohiro.aota@wdc.com> References: <20190823101036.796932-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org As same as with non-compressed IO submission, we must unlock a block group for the next allocation. Signed-off-by: Naohiro Aota --- fs/btrfs/inode.c | 19 +++++++++++++++++-- 1 file changed, 17 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index d504200c9767..283ac11849b1 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -776,13 +776,26 @@ static noinline void submit_compressed_extents(struct async_chunk *async_chunk) * and IO for us. Otherwise, we need to submit * all those pages down to the drive. */ - if (!page_started && !ret) + if (!page_started && !ret) { + struct extent_map *em; + u64 logical; + + em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, + async_extent->start, + async_extent->ram_size, + 0); + logical = em->block_start; + free_extent_map(em); + extent_write_locked_range(inode, async_extent->start, async_extent->start + async_extent->ram_size - 1, WB_SYNC_ALL); - else if (ret) + + btrfs_hmzoned_data_io_unlock_logical(fs_info, + logical); + } else if (ret) unlock_page(async_chunk->locked_page); kfree(async_extent); cond_resched(); @@ -883,6 +896,7 @@ static noinline void submit_compressed_extents(struct async_chunk *async_chunk) free_async_extent_pages(async_extent); } alloc_hint = ins.objectid + ins.offset; + btrfs_hmzoned_data_io_unlock_logical(fs_info, ins.objectid); kfree(async_extent); cond_resched(); } @@ -890,6 +904,7 @@ static noinline void submit_compressed_extents(struct async_chunk *async_chunk) out_free_reserve: btrfs_dec_block_group_reservations(fs_info, ins.objectid); btrfs_free_reserved_extent(fs_info, ins.objectid, ins.offset, 1); + btrfs_hmzoned_data_io_unlock_logical(fs_info, ins.objectid); out_free: extent_clear_unlock_delalloc(inode, async_extent->start, async_extent->start + From patchwork Fri Aug 23 10:10:27 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11111385 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 4A91B14DE for ; Fri, 23 Aug 2019 10:11:49 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 29A1323401 for ; Fri, 23 Aug 2019 10:11:49 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="WyBfIuqk" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2404576AbfHWKLs (ORCPT ); Fri, 23 Aug 2019 06:11:48 -0400 Received: from esa3.hgst.iphmx.com ([216.71.153.141]:47806 "EHLO esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2404383AbfHWKLr (ORCPT ); Fri, 23 Aug 2019 06:11:47 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1566555107; x=1598091107; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=4lhxbEjSXkBXDAScPSNFcUFtf8ehM0euARJ66fgeDFE=; b=WyBfIuqkYeOKN/mlfBPD/ZG6ARlKAmZ4E1WZvU/7bkj7cGe+C+f71Jo5 bGZtW1cxswtF9atl/vJ6WxpbH/uUeZJI3zZD56g4a2WX6tChA8BeAR7Bg CFlWKTManTfXvXFYxZ5Q3A6hq0XyMOEpH5FYs2yWLnESKVDoEzRUvcUaZ IaDdnBzkgm+CoyUhfo6I9qP+bDwBZQ7cMa8/lAL9htybk+EZ15lgo/lFo +7vpkl7Hg1rEavBZCeyyDIDjw4DCGprFRQFHG8yMlA+nWHWn/q3npCalf 38jy6n1IAKYSOyVIIVcSvKGr218GkAhlVIZcqB5/1/EQnq3Gv6DV9nwSP A==; IronPort-SDR: hfVix9K9axNOACIvHwnHGh4g1JTdSqwTwdLbNlz20wdazH8Ga3wFXHBQ5BTksr3/w60Z2TRypL AP7oS6LDzGa+kURl6xAZ0Hp9UqGMs34W4m3BHYZg56OocAayH2NuIEQYOEMZKn19scdm+111o3 oDuJmdThZ6iq+9ZdYCXDuVoRCTSGD/LjVucKnygYtvgXtS7UFwbZSxsRVIsG4wEtItURnxEMiN Y5VARGzZLxKeVYRY+AUvihjXJh04VOE0wDsosUeCp8ILb2GspeJ0XXcaem7qMIbczT6nQP7ehW gVA= X-IronPort-AV: E=Sophos;i="5.64,420,1559491200"; d="scan'208";a="121096263" Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com) ([199.255.45.14]) by ob1.hgst.iphmx.com with ESMTP; 23 Aug 2019 18:11:47 +0800 IronPort-SDR: Zf9/TS3CAgnTm9UhfvGd3Oee7LoWaLExXc6j5m4xocV9oH7qGaovvofTByGllhpYNMSMS46Rqn wIIvx8B7DQT9L1zFGonTeYgcJBG9DQTELcPY1tU3TTxPoObmB6y5FxqxQN49UGmzz2Qjct61Lr 1jQafXxBK6k2AL5c5tE2PQuv8LjWt69WDDOCskZV9yFibrZwR8vKxp+7fw15a+0wa63EilQ0gM JhEW2deCMBA6bOGxNabqf4p4gIMgcUFgYg7qTCDmacZD7BBKdVU8R10XL0ksxeNm2f+M5nt5rR gq6Wi2izj/geCC3s5/LA7Esq Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 23 Aug 2019 03:09:05 -0700 IronPort-SDR: FoR4xtdV+xPrfbPGuNf80/8NBoJN0NrU373iGoEIePdrba0goFVrV01R/weRrky7GOSuhbU8aw bvWN6pxGWm0FLvb37m0WjHYz8ydc5S0xyoKlzFAKwDTUmUFHKEMrJrS5RMjULSu6hbxfKax36q H/+wOTplB/Sf2iHpZo5k9kQ6BN0vT4RpiNNqqAAKNBs2MToXE6XOaRE/EQN/SceKKxhjVs7dtZ Hb4ZP8EcSKNkm/HkXJuiUOwyNd3nd+XGw9K/drcoB3Fu5+7QASEjR2fd37eajqgoJtvo0a4Dys odM= WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com (HELO naota.fujisawa.hgst.com) ([10.149.53.115]) by uls-op-cesaip02.wdc.com with ESMTP; 23 Aug 2019 03:11:45 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Nikolay Borisov , Damien Le Moal , Matias Bjorling , Johannes Thumshirn , Hannes Reinecke , Anand Jain , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v4 18/27] btrfs: support direct write IO in HMZONED Date: Fri, 23 Aug 2019 19:10:27 +0900 Message-Id: <20190823101036.796932-19-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.23.0 In-Reply-To: <20190823101036.796932-1-naohiro.aota@wdc.com> References: <20190823101036.796932-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org As same as with other IO submission, we must unlock a block group for the next allocation. Signed-off-by: Naohiro Aota --- fs/btrfs/inode.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 283ac11849b1..d7be97c6a069 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -8519,6 +8519,7 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode, struct btrfs_io_bio *io_bio; bool write = (bio_op(dio_bio) == REQ_OP_WRITE); int ret = 0; + u64 disk_bytenr; bio = btrfs_bio_clone(dio_bio); @@ -8562,7 +8563,11 @@ static void btrfs_submit_direct(struct bio *dio_bio, struct inode *inode, dio_data->unsubmitted_oe_range_end; } + disk_bytenr = dip->disk_bytenr; ret = btrfs_submit_direct_hook(dip); + if (write) + btrfs_hmzoned_data_io_unlock_logical( + btrfs_sb(inode->i_sb), disk_bytenr); if (!ret) return; From patchwork Fri Aug 23 10:10:28 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11111389 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 01AF314DE for ; Fri, 23 Aug 2019 10:11:52 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id D4884233FD for ; Fri, 23 Aug 2019 10:11:51 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="GWGyqWf3" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2404597AbfHWKLu (ORCPT ); Fri, 23 Aug 2019 06:11:50 -0400 Received: from esa3.hgst.iphmx.com ([216.71.153.141]:47806 "EHLO esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2404383AbfHWKLt (ORCPT ); Fri, 23 Aug 2019 06:11:49 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1566555109; x=1598091109; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=yW9btYQxXmen4W7byA8IXC1sJmMINn7f2DFdlZItrXI=; b=GWGyqWf3BGEnYyXRYshdEc24JWcQzqJOU8W33VqB43d03enJ8G54kTYR /4DXHyrUhssUFlymnN0doy4Ma4gw7Y09DEJorBR6T9deC2BGT15O0CBZr bVlyWSMIK2IzGkSuaA/60qanxVQPr005fuW2grC822HVc4PWfFBslLTGu 8M3pM1DYNvysy+DECMJ6DvqN0/B7w7i7nq7+7CzDwMCCl45PEQWU8Flqq RlPOluUiTKFrgULqYDmJNqPiJ9+HPLm0iUMrQv/C649vMvLCbI8RrCObZ SbHV3n/QurTE3ZJarP7F0NEeBs0MWq6Py0AG+G0rsMX5gqDWRW0shADdQ w==; IronPort-SDR: F2qunQ0zpziNq+TqS/aQG4vQPRX53HuCenwf3MHvhro3Hp6ssf5UGxV+iDMBy82O1vw+i5CvWi rlPtVApJZAJLfsCe8wB7i9UHQy8iLD3bj4rymf8FjaY7De1ilu+JDCwrAomwynB9ZvgcqxnJuX 4NQGWJo1QW5miDlJUjBXZ1pAHAJSZ7+pWMlEvCjc4BrZ91vaSGeBhS1ZUSVEEbV/PYxQWi1/NE b6j7mkwyEaWgh4ExSvw1yyz3afg6NlLYRZZIKmKo7nxh3bpWzkT2vLrheiUCvtkwjK1x9fobJw ero= X-IronPort-AV: E=Sophos;i="5.64,420,1559491200"; d="scan'208";a="121096265" Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com) ([199.255.45.14]) by ob1.hgst.iphmx.com with ESMTP; 23 Aug 2019 18:11:49 +0800 IronPort-SDR: o/+GeNrrmtaEDgjmikqL5tyUouMKfaYyoNsff4QSf0MAnoGIvGVgDkItxyZruPE5C/TdACqZ1M tRWjsa9xaB0lOBECU/yM7rhON95yVIgfJS/QRZGsHmHWrFb/dL6zPscbY6/AYILnnjVDvNyvSU dTvaMuXo6Fa5fl/WvFKHlKdhkXeGxSFdAUGKS9bMV4C/nDy60R218kD1Xell9K1v+o+QnApq7h +cYJsLaDUcUyqTMRedOjfLQ6801ZAykhxERy50O81CPv1V2vRDMiRc/dXAQX76Yn66O/ADJsYL ORLLOrgpGq32B04V8Nt+puuj Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 23 Aug 2019 03:09:07 -0700 IronPort-SDR: wzU3Od9kC5oShRwkcaQSC8AfIS5YNM3g93ZFOjLGqCQ5LZ+EuCl9SJCBsRr5Oy4XWxaBRWgl3S M96Xt+MiPt64S4Ips+TTCpoFbZQF7fCR6AmYcWMXDQ9GdX481MSee+3YCpWxnw5Ud9oEpE3f/m MMjJ3/97ipAXTHBueuM7PdkIZtsPS2hFUGrNhx+DgpPfLR83TgyDKW6LGd3AnUJcYSJi6V3QJo EsColpXJPO57giXsHppPnDMYtLUWmuN8E3XEGU6ysxG1GL9VAHERytQKrRd8BVv4jeWauCEf6N Boo= WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com (HELO naota.fujisawa.hgst.com) ([10.149.53.115]) by uls-op-cesaip02.wdc.com with ESMTP; 23 Aug 2019 03:11:47 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Nikolay Borisov , Damien Le Moal , Matias Bjorling , Johannes Thumshirn , Hannes Reinecke , Anand Jain , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v4 19/27] btrfs: serialize meta IOs on HMZONED mode Date: Fri, 23 Aug 2019 19:10:28 +0900 Message-Id: <20190823101036.796932-20-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.23.0 In-Reply-To: <20190823101036.796932-1-naohiro.aota@wdc.com> References: <20190823101036.796932-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org As same as in data IO path, we must serialize write IOs for metadata. We cannot add mutex around allocation and submit because metadata blocks are allocated in an earlier stage to build up B-trees. Thus, this commit add hmzoned_meta_io_lock and hold it during metadata IO submission in btree_write_cache_pages() to serialize IOs. Furthermore, this commit add per-block grorup metadata IO submission pointer "meta_write_pointer" to ensure sequential writing. Signed-off-by: Naohiro Aota --- fs/btrfs/ctree.h | 3 +++ fs/btrfs/disk-io.c | 1 + fs/btrfs/extent_io.c | 17 ++++++++++++++++- fs/btrfs/hmzoned.c | 45 ++++++++++++++++++++++++++++++++++++++++++++ fs/btrfs/hmzoned.h | 17 +++++++++++++++++ 5 files changed, 82 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index d4df9624cb04..e974174e12a2 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -619,6 +619,7 @@ struct btrfs_block_group_cache { */ u64 alloc_offset; struct mutex zone_io_lock; + u64 meta_write_pointer; }; /* delayed seq elem */ @@ -1105,6 +1106,8 @@ struct btrfs_fs_info { spinlock_t ref_verify_lock; struct rb_root block_tree; #endif + + struct mutex hmzoned_meta_io_lock; }; static inline struct btrfs_fs_info *btrfs_sb(struct super_block *sb) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index d36cdb1b1421..a9632e455eb5 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -2701,6 +2701,7 @@ int open_ctree(struct super_block *sb, mutex_init(&fs_info->delete_unused_bgs_mutex); mutex_init(&fs_info->reloc_mutex); mutex_init(&fs_info->delalloc_root_mutex); + mutex_init(&fs_info->hmzoned_meta_io_lock); seqlock_init(&fs_info->profiles_lock); INIT_LIST_HEAD(&fs_info->dirty_cowonly_roots); diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 4e67b16c9f80..ff963b2214aa 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -3892,7 +3892,9 @@ int btree_write_cache_pages(struct address_space *mapping, struct writeback_control *wbc) { struct extent_io_tree *tree = &BTRFS_I(mapping->host)->io_tree; + struct btrfs_fs_info *fs_info = tree->fs_info; struct extent_buffer *eb, *prev_eb = NULL; + struct btrfs_block_group_cache *cache = NULL; struct extent_page_data epd = { .bio = NULL, .tree = tree, @@ -3922,6 +3924,7 @@ int btree_write_cache_pages(struct address_space *mapping, tag = PAGECACHE_TAG_TOWRITE; else tag = PAGECACHE_TAG_DIRTY; + btrfs_hmzoned_meta_io_lock(fs_info); retry: if (wbc->sync_mode == WB_SYNC_ALL) tag_pages_for_writeback(mapping, index, end); @@ -3965,6 +3968,14 @@ int btree_write_cache_pages(struct address_space *mapping, if (!ret) continue; + if (!btrfs_check_meta_write_pointer(fs_info, eb, + &cache)) { + ret = 0; + done = 1; + free_extent_buffer(eb); + break; + } + prev_eb = eb; ret = lock_extent_buffer_for_io(eb, &epd); if (!ret) { @@ -3999,12 +4010,16 @@ int btree_write_cache_pages(struct address_space *mapping, index = 0; goto retry; } + if (cache) + btrfs_put_block_group(cache); ASSERT(ret <= 0); if (ret < 0) { end_write_bio(&epd, ret); - return ret; + goto out; } ret = flush_write_bio(&epd); +out: + btrfs_hmzoned_meta_io_unlock(fs_info); return ret; } diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c index 81d8037ae7f6..bfc95a0443d0 100644 --- a/fs/btrfs/hmzoned.c +++ b/fs/btrfs/hmzoned.c @@ -545,6 +545,9 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group_cache *cache) out: cache->alloc_type = alloc_type; + if (!ret) + cache->meta_write_pointer = + cache->alloc_offset + cache->key.objectid; kfree(alloc_offsets); free_extent_map(em); @@ -648,3 +651,45 @@ void btrfs_free_redirty_list(struct btrfs_transaction *trans) } spin_unlock(&trans->releasing_ebs_lock); } + +bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info, + struct extent_buffer *eb, + struct btrfs_block_group_cache **cache_ret) +{ + struct btrfs_block_group_cache *cache; + + if (!btrfs_fs_incompat(fs_info, HMZONED)) + return true; + + cache = *cache_ret; + + if (cache && + (eb->start < cache->key.objectid || + cache->key.objectid + cache->key.offset <= eb->start)) { + btrfs_put_block_group(cache); + cache = NULL; + *cache_ret = NULL; + } + + if (!cache) + cache = btrfs_lookup_block_group(fs_info, + eb->start); + + if (cache) { + *cache_ret = cache; + + if (cache->alloc_type != BTRFS_ALLOC_SEQ) + return true; + + if (cache->meta_write_pointer != eb->start) { + btrfs_put_block_group(cache); + cache = NULL; + *cache_ret = NULL; + return false; + } + + cache->meta_write_pointer = eb->start + eb->len; + } + + return true; +} diff --git a/fs/btrfs/hmzoned.h b/fs/btrfs/hmzoned.h index a8e7286708d4..c68c4b8056a4 100644 --- a/fs/btrfs/hmzoned.h +++ b/fs/btrfs/hmzoned.h @@ -40,6 +40,9 @@ void btrfs_redirty_list_add(struct btrfs_transaction *trans, struct extent_buffer *eb); void btrfs_free_redirty_list(struct btrfs_transaction *trans); void btrfs_hmzoned_data_io_unlock_at(struct inode *inode, u64 start, u64 len); +bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info, + struct extent_buffer *eb, + struct btrfs_block_group_cache **cache_ret); static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos) { @@ -174,4 +177,18 @@ static inline void btrfs_hmzoned_data_io_unlock_logical( btrfs_put_block_group(cache); } +static inline void btrfs_hmzoned_meta_io_lock(struct btrfs_fs_info *fs_info) +{ + if (!btrfs_fs_incompat(fs_info, HMZONED)) + return; + mutex_lock(&fs_info->hmzoned_meta_io_lock); +} + +static inline void btrfs_hmzoned_meta_io_unlock(struct btrfs_fs_info *fs_info) +{ + if (!btrfs_fs_incompat(fs_info, HMZONED)) + return; + mutex_unlock(&fs_info->hmzoned_meta_io_lock); +} + #endif From patchwork Fri Aug 23 10:10:29 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11111393 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id F29C413B1 for ; Fri, 23 Aug 2019 10:11:53 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id D131E233FE for ; Fri, 23 Aug 2019 10:11:53 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="ali5hAXS" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2404616AbfHWKLw (ORCPT ); Fri, 23 Aug 2019 06:11:52 -0400 Received: from esa3.hgst.iphmx.com ([216.71.153.141]:47806 "EHLO esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731760AbfHWKLv (ORCPT ); Fri, 23 Aug 2019 06:11:51 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1566555111; x=1598091111; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=It5/Lg5mTDTXfibuvpFKsVVhaXqA0aPNL3v2S0DwPNo=; b=ali5hAXSuhUEFzcZVCW2NyQdqqxqdE0OWI6flH5KyCd0PwgYCUWZ0DPX 9QU683Pk1rvbKg/EdsZGAzd6y8u9N0FeLOL0FJ4CjlJrruR4e7vS55Bzk 0Ui+D3JlxkOf++Jjdd0Anoo38h+3eXSMnGe3seOPAMca4Nd8AIDYNnYDC OjwDPC2yP99oZ378nMA5YG0Wdo7Xa0R5Re/mtbB9iCNCLDI0vlz5xZDeb 18PBGpvv///3/rnBiUjPat4Cme/vonNLlE68lbHGzEy53T4WSrJeOXLwO oYqZIV8aINH/q838ciRW5IVjXMh0T4jTLM9OMAO8qpQKtc8/ZlO68nwS7 g==; IronPort-SDR: dIoqCXSmmht3fjHgkOpL/gI10FWzuGijrYemJuInv8Vd5+KZzn3vEFV9iB4ny+ETUdjBdx1xDo ALvzzV80rgZ6oCskPBvv7fOW7VfC5PYaY6dhF4p1FECaw1Y051fui3OOVl/3r4cbDtOLGGwydI lyv+6ZGrWG2CphFBpM+WXTsjp9valXgt7r77KOpLl/Wtsj++tH/4079SwbivKJE1h0Vjpp0P38 iGtBStwwTcRbX7zJ5MWXg1LNC+hnbRMj4HRIBr60hxJUZ9b65ywJropsY2tI/GxLlx+1IFvldf 798= X-IronPort-AV: E=Sophos;i="5.64,420,1559491200"; d="scan'208";a="121096266" Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com) ([199.255.45.14]) by ob1.hgst.iphmx.com with ESMTP; 23 Aug 2019 18:11:51 +0800 IronPort-SDR: jT1HskHn2ZRpBW9s2AaRHp80cjdLwDOOESMTHMuEsh/8yvE0nM9jKSHDAVLzmjeLlpa4y1YMu9 5o2R5XOuFklAZz1f8qpVQK3Gf9Vagb/IcSOswI4Jlg1omLh78vw02ys/n+MczrFIerqLjujFV0 HU/R1XmXdgdeY5GCqslPgItX2yzeyWvlhD91UuIpmHFNrEyt4uAxjcPkKHxqQ8bM0dEgTUsDNx APE1sDNbZT5L6qYaDWbXu4Vr/6Ekmv1W6sDVhSryccYWOsLx7TTnH9fVST3Nr4EjSImdmf13i5 tCs7LFafQssQxg29sLHcArxe Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 23 Aug 2019 03:09:09 -0700 IronPort-SDR: lpYiKXHTImWpbXrp/PILasuHEiN1B23MbKvqTEoWaHKn8rIANLnBhTjOzPnVOmvqapjY3V9QLk sIAWoTfJXceRAFeyM8GO9qLtLvbG0b9OwO3hXU7WlloFLUFOzrymzbkzqYgjah3KkSF5GxhmzB 9FHJCECIJ1+Ng9OwZLGF/qzpJOMsPnSGVXpuMrXSR6pSy4U9h8Hpea1Ffmco22SOpz14s/+dB6 PuE2Srkqj1wImg/xPUWGTgC3xMGw7rDRSZy3fiy/KDONodt8cT+4COsjxl+3OTvLwxYylVPEYk kRA= WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com (HELO naota.fujisawa.hgst.com) ([10.149.53.115]) by uls-op-cesaip02.wdc.com with ESMTP; 23 Aug 2019 03:11:49 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Nikolay Borisov , Damien Le Moal , Matias Bjorling , Johannes Thumshirn , Hannes Reinecke , Anand Jain , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v4 20/27] btrfs: wait existing extents before truncating Date: Fri, 23 Aug 2019 19:10:29 +0900 Message-Id: <20190823101036.796932-21-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.23.0 In-Reply-To: <20190823101036.796932-1-naohiro.aota@wdc.com> References: <20190823101036.796932-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org When truncating a file, file buffers which have already been allocated but not yet written may be truncated. Truncating these buffers could cause breakage of a sequential write pattern in a block group if the truncated blocks are for example followed by blocks allocated to another file. To avoid this problem, always wait for write out of all unwritten buffers before proceeding with the truncate execution. Signed-off-by: Naohiro Aota --- fs/btrfs/inode.c | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index d7be97c6a069..95f4ce8ac8d0 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -5236,6 +5236,16 @@ static int btrfs_setsize(struct inode *inode, struct iattr *attr) btrfs_end_write_no_snapshotting(root); btrfs_end_transaction(trans); } else { + struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); + + if (btrfs_fs_incompat(fs_info, HMZONED)) { + ret = btrfs_wait_ordered_range( + inode, + ALIGN(newsize, fs_info->sectorsize), + (u64)-1); + if (ret) + return ret; + } /* * We're truncating a file that used to have good data down to From patchwork Fri Aug 23 10:10:30 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11111397 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 6E73A14DE for ; Fri, 23 Aug 2019 10:11:57 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 4C1B6233FD for ; Fri, 23 Aug 2019 10:11:57 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="PG/PiTwA" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2404662AbfHWKLz (ORCPT ); Fri, 23 Aug 2019 06:11:55 -0400 Received: from esa3.hgst.iphmx.com ([216.71.153.141]:47806 "EHLO esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731760AbfHWKLz (ORCPT ); Fri, 23 Aug 2019 06:11:55 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1566555113; x=1598091113; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=nyP2IhXLDkEoP/6eGvL9pL2rjUrWDTQDi/O4CUxLhMM=; b=PG/PiTwAYmZYJyFPELoo1lnBI4BEhCJws1FtW8q/XKBR6v66KnQ1zW0f C1yOqoLDWV2De9WIv8qysgOIfPDfzs2mw4GJ4gUtvSMPZ6oApolxzr4Np x1tw5L+Ljdu8Prt9bzYefjGa+ZVHS/xNx1JPK20O5DbanWKEwkHPIcHw/ xV7YFA32pBhCmuWUMUXM7IbiF2ERdk0fCHKOV7RvTxw0N5DWwRet0CITN MsfYXqqtt4YwnNUAG0rjrZMlgEmChFG5DgCaO9HYHfN1jTdKUEmsWoMqC AjYQLfSnQNEyxTairR1GbH1STBX21zEjq8OD3ksNyN9pNxBqxOZwkJMKg w==; IronPort-SDR: XUdiFiH3piKKPl1YgNqwCwTBTCRM7ewkn5gWOC9igy3o+0LjS0evw6lT5Qm5ETEUNX7ArtcuED r8G/e5r3qZSRIs6TWdqNApRd4s5ucshTXGiOG3BKMQGwXIo+EIhllGQP8Miqo7cqDJidiXlHJD vzHTr95wM0sV6ThH45dbSTuLEfTng8uto8B6JAoxS//juLF37ky99x0vwh/ZlkUjcDLgKoKhjc wTY9XUZnt8y6Ylhe8TOEXb38jgVFIv8iUodtOzFPP+R63Clk11Nc8q1NmpQ4kuZA6UprhnD/cx wQA= X-IronPort-AV: E=Sophos;i="5.64,420,1559491200"; d="scan'208";a="121096267" Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com) ([199.255.45.14]) by ob1.hgst.iphmx.com with ESMTP; 23 Aug 2019 18:11:53 +0800 IronPort-SDR: lSVgKL1oiTYPhuIrgGdAEUMuT0Aatu7oq09hpymO25KpnRPDavoHoAi2VgofmR3pAaCnbxMKkf xQZbyV6ECqXVPVJcj7yUw9Zb/gwEpXNibTdmTQGxB/e4/FnztrxJgAoAmFdFrZDig7m8vJdmjy qciqq8v0R9Lt69uxhfPF9corOfRWXCVCaWKuRow7+LqvYUsJIYGxEwtsNtfYADmpmka2U+XRcL oS0SZ+V7ZkGj092AvuVbZlkLBT3Gybv2AM+8eR2rD4K+bDno+byoFh8qZ28EJIasBPOcPDkbly akO4mcNGBHKu6WZoyEnfyE/z Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 23 Aug 2019 03:09:11 -0700 IronPort-SDR: k3E6vKlHWSJTn8kpTF13V7eZhyugyJJk2cl9nyOfzNei0lLhkPF0ehf5MiSA/nx+Nr6pK962iK 7hEo/ezEq1TrKTYRYOvHrlEpNdnu6laDaGNR6H669IAz2yqNaHOg2cK4tsp4mKJ4jL+hteMKOx W8g1WFhFQpFbdygWzdg6At0rYOvy+K50+E4DTcNTIg5/STjsN4Qd6T1rYMq6+5lMUxchxOtFz6 453r9uAnubqgEAGsXZLG0W589Y+ok6AitDmNw2rQ73VNQm65XX3Q4V4u0w8qbW9MPTcps4vFQe pLI= WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com (HELO naota.fujisawa.hgst.com) ([10.149.53.115]) by uls-op-cesaip02.wdc.com with ESMTP; 23 Aug 2019 03:11:51 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Nikolay Borisov , Damien Le Moal , Matias Bjorling , Johannes Thumshirn , Hannes Reinecke , Anand Jain , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v4 21/27] btrfs: avoid async checksum/submit on HMZONED mode Date: Fri, 23 Aug 2019 19:10:30 +0900 Message-Id: <20190823101036.796932-22-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.23.0 In-Reply-To: <20190823101036.796932-1-naohiro.aota@wdc.com> References: <20190823101036.796932-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org In HMZONED, btrfs use per-Block Group zone_io_lock to serialize the data write IOs or use per-FS hmzoned_meta_io_lock to serialize the metadata write IOs. Even with these serialization, write bios sent from {btree,btrfs}_write_cache_pages can be reordered by async checksum workers as these workers are per CPU and not per zone. To preserve write BIO ordering, we can disable async checksum on HMZONED. This does not result in lower performance with HDDs as a single CPU core is fast enough to do checksum for a single zone write stream with the maximum possible bandwidth of the device. If multiple zones are being written simultaneously, HDD seek overhead lowers the achievable maximum bandwidth, resulting again in a per zone checksum serialization not affecting performance. Besides, this commit disable async_submit in btrfs_submit_compressed_write() for the same reason. This part will be unnecessary once btrfs get the "btrfs: fix cgroup writeback support" series. Signed-off-by: Naohiro Aota --- fs/btrfs/compression.c | 5 +++-- fs/btrfs/disk-io.c | 2 ++ fs/btrfs/inode.c | 9 ++++++--- 3 files changed, 11 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c index 60c47b417a4b..058dea5e432f 100644 --- a/fs/btrfs/compression.c +++ b/fs/btrfs/compression.c @@ -322,6 +322,7 @@ blk_status_t btrfs_submit_compressed_write(struct inode *inode, u64 start, struct block_device *bdev; blk_status_t ret; int skip_sum = BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM; + int async_submit = !btrfs_fs_incompat(fs_info, HMZONED); WARN_ON(!PAGE_ALIGNED(start)); cb = kmalloc(compressed_bio_size(fs_info, compressed_len), GFP_NOFS); @@ -377,7 +378,7 @@ blk_status_t btrfs_submit_compressed_write(struct inode *inode, u64 start, BUG_ON(ret); /* -ENOMEM */ } - ret = btrfs_map_bio(fs_info, bio, 0, 1); + ret = btrfs_map_bio(fs_info, bio, 0, async_submit); if (ret) { bio->bi_status = ret; bio_endio(bio); @@ -408,7 +409,7 @@ blk_status_t btrfs_submit_compressed_write(struct inode *inode, u64 start, BUG_ON(ret); /* -ENOMEM */ } - ret = btrfs_map_bio(fs_info, bio, 0, 1); + ret = btrfs_map_bio(fs_info, bio, 0, async_submit); if (ret) { bio->bi_status = ret; bio_endio(bio); diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index a9632e455eb5..9bae051f9f44 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -873,6 +873,8 @@ static blk_status_t btree_submit_bio_start(void *private_data, struct bio *bio, static int check_async_write(struct btrfs_fs_info *fs_info, struct btrfs_inode *bi) { + if (btrfs_fs_incompat(fs_info, HMZONED)) + return 0; if (atomic_read(&bi->sync_writers)) return 0; if (test_bit(BTRFS_FS_CSUM_IMPL_FAST, &fs_info->flags)) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 95f4ce8ac8d0..bb0ae3107e60 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -2075,7 +2075,8 @@ static blk_status_t btrfs_submit_bio_hook(struct inode *inode, struct bio *bio, enum btrfs_wq_endio_type metadata = BTRFS_WQ_ENDIO_DATA; blk_status_t ret = 0; int skip_sum; - int async = !atomic_read(&BTRFS_I(inode)->sync_writers); + int async = !atomic_read(&BTRFS_I(inode)->sync_writers) && + !btrfs_fs_incompat(fs_info, HMZONED); skip_sum = BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM; @@ -8383,7 +8384,8 @@ static inline blk_status_t btrfs_submit_dio_bio(struct bio *bio, /* Check btrfs_submit_bio_hook() for rules about async submit. */ if (async_submit) - async_submit = !atomic_read(&BTRFS_I(inode)->sync_writers); + async_submit = !atomic_read(&BTRFS_I(inode)->sync_writers) && + !btrfs_fs_incompat(fs_info, HMZONED); if (!write) { ret = btrfs_bio_wq_end_io(fs_info, bio, BTRFS_WQ_ENDIO_DATA); @@ -8448,7 +8450,8 @@ static int btrfs_submit_direct_hook(struct btrfs_dio_private *dip) } /* async crcs make it difficult to collect full stripe writes. */ - if (btrfs_data_alloc_profile(fs_info) & BTRFS_BLOCK_GROUP_RAID56_MASK) + if (btrfs_data_alloc_profile(fs_info) & BTRFS_BLOCK_GROUP_RAID56_MASK || + btrfs_fs_incompat(fs_info, HMZONED)) async_submit = 0; else async_submit = 1; From patchwork Fri Aug 23 10:10:31 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11111401 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id CECB914DE for ; Fri, 23 Aug 2019 10:11:58 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id ACE00233FD for ; Fri, 23 Aug 2019 10:11:58 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="d7MVJHMQ" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2404676AbfHWKL4 (ORCPT ); Fri, 23 Aug 2019 06:11:56 -0400 Received: from esa3.hgst.iphmx.com ([216.71.153.141]:47806 "EHLO esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2404630AbfHWKLz (ORCPT ); Fri, 23 Aug 2019 06:11:55 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1566555115; x=1598091115; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=Xx7dT/vkpGscmmJ0xyUOSuzu4Kzk5ohT4mXKP6xnGEQ=; b=d7MVJHMQ0GYNNqz9g1gysHryrSGHsMoSVH6PDu9731+yRupJmh7PhgR4 jVQEo8Kn7aBRLYsE2kT018maf0/+Ye6D/0Gc+p/xbRyHRlbYa6x/xBClt yzFe599lo05vk1SmNd81wd9iLQMcUPYNNuqG5LiO8jz6tJtifVZZFerzw 5g0Vhaa+9P6cwPziRV1Q7D6mshlytNaN/6e8G12clUQUtZog0+gUYlCnJ ou5FeIv7JTgMDQ+hWcN0eqry++38+QeuLK8eIpfnYBfvNwd9scphb4S7j 5cPLuzNACTruf/f83NumP0LyM/nX/nS5KgTXOov0q/MJ8yyUuUl52tCoP g==; IronPort-SDR: ZDgZq4flClcj0wJBTAU314vQII1rF5qAJ84V+KcRB/ewguT2VYz5JtmS/pkJdczrZ10kZ0QlNE hyO6Snc6R4uWT9bVn2UvLpkYPl3pv1qKeqYcrNLpXIrD+jyljd3S/a0OhUToMi3EKMnrrNYsm/ ZZWJ9dgF7kt7ysyWyfWM6DSn4CrSrnuYjygk5lYVW3Vfr5IgjnlYVuuN0tV08max+CjD7c44Ui J4ISfg/GHDgdKZP3a1PvCDgA1IzwZ2TKM/1oSQC0wBu2D+HvqYyEwVuDAIC0dDgJaSssDUt7er f9M= X-IronPort-AV: E=Sophos;i="5.64,420,1559491200"; d="scan'208";a="121096268" Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com) ([199.255.45.14]) by ob1.hgst.iphmx.com with ESMTP; 23 Aug 2019 18:11:55 +0800 IronPort-SDR: vooSTlLdlNQlS4LoWbG3tRkXolSDPu4ZxkYMtBcPMoZqlJsrZaHiiuvrDwS/OpWLcpptS0Qegx oV0WniZt2Bm+Oy7nnzEEqzmeVl692Ui5vgc7/wixPt0SOA2zgqMRQr26WbL6PcHAXJKAhnKrMc bBEmtXiXrlSjEm6rxBWPw3JAXUOji3Vx/njArIUPIGjAht1LWMbRTJ/32pg1w+XdVZluopPzgZ Gh22nVBKOBf/DhA6PLNnXPbDg8D4Aiu1DlCyv+kwZwKEHXmkGFAQLPVCRsHfavxdIx0K6VGQrm sLibL5pmP6tugOqFywI4L4HV Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 23 Aug 2019 03:09:13 -0700 IronPort-SDR: T6jVN2CsppX/nU0j9qWivVWDcQmAtJdUYTUnWhribShJDHEtTytI7C16NMyUkSiCQ1Mguygup8 XHTLwZ/aFtoEzO/l599x9VgT7iTXzkSMlMOnOWC9CsBShMNzqQmqG0V5Z+GsjpnEVQVl9q7qPX 8ZqoxGER7IAfTrODaisDqk+ihVriosDW9IeOsPpoEHc7VSKLBPV8RusqLDd6bnZCV2qwFKNsCm eIzbauGdEAZN0OCfUSkPqen8qcOAR38+SKyAvKdI3Y+8sI/j7fPocmM+fNi9Q7pHrHNxXQZPcp yxc= WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com (HELO naota.fujisawa.hgst.com) ([10.149.53.115]) by uls-op-cesaip02.wdc.com with ESMTP; 23 Aug 2019 03:11:53 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Nikolay Borisov , Damien Le Moal , Matias Bjorling , Johannes Thumshirn , Hannes Reinecke , Anand Jain , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v4 22/27] btrfs: disallow mixed-bg in HMZONED mode Date: Fri, 23 Aug 2019 19:10:31 +0900 Message-Id: <20190823101036.796932-23-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.23.0 In-Reply-To: <20190823101036.796932-1-naohiro.aota@wdc.com> References: <20190823101036.796932-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Placing both data and metadata in a block group is impossible in HMZONED mode. For data, we can allocate a space for it and write it immediately after the allocation. For metadata, however, we cannot do so, because the logical addresses are recorded in other metadata buffers to build up the trees. As a result, a data buffer can be placed after a metadata buffer, which is not written yet. Writing out the data buffer will break the sequential write rule. This commit check and disallow MIXED_BG with HMZONED mode. Signed-off-by: Naohiro Aota --- fs/btrfs/hmzoned.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c index bfc95a0443d0..871befbbb23b 100644 --- a/fs/btrfs/hmzoned.c +++ b/fs/btrfs/hmzoned.c @@ -232,6 +232,13 @@ int btrfs_check_hmzoned_mode(struct btrfs_fs_info *fs_info) goto out; } + if (btrfs_fs_incompat(fs_info, MIXED_GROUPS)) { + btrfs_err(fs_info, + "HMZONED mode is not allowed for mixed block groups"); + ret = -EINVAL; + goto out; + } + btrfs_info(fs_info, "HMZONED mode enabled, zone size %llu B", fs_info->zone_size); out: From patchwork Fri Aug 23 10:10:32 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11111405 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id E3BDD13B1 for ; Fri, 23 Aug 2019 10:11:59 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id C2812233FD for ; Fri, 23 Aug 2019 10:11:59 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="TnY+Xqlw" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2404732AbfHWKL6 (ORCPT ); Fri, 23 Aug 2019 06:11:58 -0400 Received: from esa3.hgst.iphmx.com ([216.71.153.141]:47806 "EHLO esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2404700AbfHWKL5 (ORCPT ); Fri, 23 Aug 2019 06:11:57 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1566555117; x=1598091117; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=JQnGemSRC9GgoTrBYbP+B/iabFUV/w8ISdNPJGyqlWg=; b=TnY+Xqlw/omb1Avo0+pGY+hSRSo+xIo8lLvUm1gJ+ajyFgCI3FCx+BpK tvDahfWgH/c7/6GfkLbjG+tNpY6SsECgFSoQXRpFcV9UPWo4cvZL4YKEW kjVI+6WUAxb2GRJh/4Pt3WQwlq+5JcddpgibSDeE4i3qHbgXEdSXkTbFw CiAVyvCWFX7qRpyyag1v/KXK8SVpoZUGbI2U9IBNNvJvOYqDYBPDJ1biC LmOdEKAC4ZOkQvcuYCyuyWJrCFuihMEJF+uIZJw3S8AUcXwI4gHW2nYrz p1AJWY30GmDOVRgUV9VFu2p0QMDfLdV+R35+v+08gl2FtJ+2ex5cSpIKh w==; IronPort-SDR: IYlqaXEzx010cLDNmz7SQorswWeHwrxmvRmuPl1IgWKbsLlD/PrH/qYAYuE5qbahM3aJ25lxtf r1T7i7e/8FIs/VndLoEQZm/DWjlqnfJZVqRfUofd5mexU27a5U8WKuB0w79U8JzU7/mopgnJO2 OmF7EOvLnjVq+8Wm0GNQtrpRVuWltXhuFag6K9rQIdpbNfbaiKEd4gfjpMdGSSKMBqp7JfWgil O6zJ7Jh962bk2hzWXOElh2HLUd3R8ConsBMri5qzS4QTBnSqBBLPZ4T29xzycIXEB+aJFmbI47 rdA= X-IronPort-AV: E=Sophos;i="5.64,420,1559491200"; d="scan'208";a="121096271" Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com) ([199.255.45.14]) by ob1.hgst.iphmx.com with ESMTP; 23 Aug 2019 18:11:57 +0800 IronPort-SDR: Ni9XqinKvU1ONUlnH/frLO6oHUOpo463UdBcv0zcGZJ1ErNsNpZVGm/yr3lyGpgcGvHe5vUPer d36dmQAxB8jNl6CDbDfwCACP8Rn4t1U2XB6ElpKhJc5fFINAHtXArgxi//uxbJBEu+krmoMiDh YzO53mPLAE7FrZXvSuprTnd4duLDNrVIFqT3/+9FgTQTq6gQzrHcHx8y7mw73+Z+zysSeeSlTq bBzwhNZE86q2iQOHon/TFzIgFeRTBjCohrlBm7LlKlmJ7N2nfYzXATlcuvpnbIDJWjgUKkofNs B6cAJ+KgGnwcTM8zFoJMqLXa Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 23 Aug 2019 03:09:15 -0700 IronPort-SDR: eDqlaZk5zLrflJaJrt0MbYIuH+LmUUEyNrviosmSVyOYs73th87r4R7vyv1CgZRt1mYFVINOSY ZMDFJoOVwWbY7X6u/jG295LzX6LBWMY799JtR9EE0iiNIQ9KYkUIuSMeBCgFLaQfS9UCgSJTlI VNv/IlCXiN0L+275WjNiXw52PtJpVVOWejodGGVID0CzM+wu0bNZJiW/vrakv1395ho35h6va+ xq+7Irf7QY44T+dtrugEdfgxNcLlT6p7VbKP1a/RF8f0qH+e3hPIoCW/+UPauO6jv/ZISZUnC9 Ggw= WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com (HELO naota.fujisawa.hgst.com) ([10.149.53.115]) by uls-op-cesaip02.wdc.com with ESMTP; 23 Aug 2019 03:11:55 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Nikolay Borisov , Damien Le Moal , Matias Bjorling , Johannes Thumshirn , Hannes Reinecke , Anand Jain , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v4 23/27] btrfs: disallow inode_cache in HMZONED mode Date: Fri, 23 Aug 2019 19:10:32 +0900 Message-Id: <20190823101036.796932-24-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.23.0 In-Reply-To: <20190823101036.796932-1-naohiro.aota@wdc.com> References: <20190823101036.796932-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org inode_cache use pre-allocation to write its cache data. However, pre-allocation is completely disabled in HMZONED mode. We can technically enable inode_cache in the same way as relocation. However, inode_cache is rarely used and the man page discourage using it. So, let's just disable it for now. Signed-off-by: Naohiro Aota --- fs/btrfs/hmzoned.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c index 871befbbb23b..f8f41cb3d22a 100644 --- a/fs/btrfs/hmzoned.c +++ b/fs/btrfs/hmzoned.c @@ -272,6 +272,12 @@ int btrfs_check_mountopts_hmzoned(struct btrfs_fs_info *info) return -EINVAL; } + if (btrfs_test_pending(info, SET_INODE_MAP_CACHE)) { + btrfs_err(info, + "cannot enable inode map caching with HMZONED mode"); + return -EINVAL; + } + return 0; } From patchwork Fri Aug 23 10:10:33 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11111409 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 4415114DE for ; Fri, 23 Aug 2019 10:12:02 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 181D7233FD for ; Fri, 23 Aug 2019 10:12:02 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="ftmBsz56" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2404772AbfHWKMA (ORCPT ); Fri, 23 Aug 2019 06:12:00 -0400 Received: from esa3.hgst.iphmx.com ([216.71.153.141]:47806 "EHLO esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2404700AbfHWKMA (ORCPT ); Fri, 23 Aug 2019 06:12:00 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1566555119; x=1598091119; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=qXafMpKw0+kuKPC45Do+6KF/ank9FZH7H2/GFFBNxVI=; b=ftmBsz56ahTmbxV+Y9yfEeD9F6qaXjI/IYpATo93DdXunmTyomzC29qM uWu8p825dP8qEVPLGgNReHWBMoFUKOGHNNCowH7xwu08GhcjVrsX0CNLJ gLZrQhhGjm1Y624hd2xTv8tZF0BJxE6GoXOebRwdjt42mK42hSsr4BcaM oBktq1A6buh0fn4oxhVt4R6w0zq0bC0RCL5I3Jo/yBdZ6UTFpx3Vt9h4c M8NahgfvRf/TxRRh4Q6FeluCKRU0kEdlMh1CAWQ8EDEbtYnkFL5CfBjkt 7mHX7NQ+4qyTRYthtrrHFnxBj0C12AdS5fQROQxA34agQQ7q6OpFCFxnx Q==; IronPort-SDR: EMcbv1uwO//NU0aNxy3AsLgntCgA1DW+2hZaC2O8bon/Qrjho74VfWkSXb8p2mIiOYJnc1WUvE QuY0qyl+plepEzXvZv0LzB5EOiJC6W0so5JdG280wPBKJs0wb8C/8sWrRpxfQH4OyR+LLE7XCO rrYYqkuZRuWfW35DFwYxU+XTancBf4HYTa29XlPcReU7C0rVpJbxK+u9WS4je9K81vUsUqwSB5 o4M2utPBGehntwravLJoxnU3P34LIVMZE/WYgk6OvJ5mvy1KWfVHl1kaLgZfpSqzAg1+7TyPJA s5U= X-IronPort-AV: E=Sophos;i="5.64,420,1559491200"; d="scan'208";a="121096274" Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com) ([199.255.45.14]) by ob1.hgst.iphmx.com with ESMTP; 23 Aug 2019 18:11:59 +0800 IronPort-SDR: MOxuRID4zuuBv9uNqZfDw/p7abuQ1xl1x5QCfuOR8uZGTgoFQZdREX1Hv2Jpp0PX0Urym8ZhF2 u8qS60EcKBUJ13hYAyo8EeA6Mx7402w4pqNuZaKZHjTr9mOMGTsnHgVnmin3gtnYFC7d+a22ml 0EUuj4LTKF/kXdLcmaHYsN5JLNWy82eILxhOW6N4sS8W6SAvwiQyuSvGH0qaQxAYyrRcpHtvkv XMAQaO5ib7BJkj4/UpGL1Sgu9iSCLMwproBLi+AQDcNpE+Paj8LtZrQhg6YIWabt49M+qU/Obn S0m1qtPPArHxhUg5J8f+y21d Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 23 Aug 2019 03:09:17 -0700 IronPort-SDR: XDjz1rNludSycUUIYlxpXlXE8+aJA1IItNsfemCB46V/++YYDncpG8VvIs1pc7StamsRG2hbz5 y3H8VARuCWDFcX2dh49VF3i+qM/QrNwZRLjN7bmaJzog4Enqi/6coEBMDvMSEvK69W3ah88Nv/ bkbrVAf3qHpyr6Bo+iiV3F+CBv7UHLC0p9S2CizqqKH1M+ofMZsJa2jWSGkjTCtOXn+1aTgr7S evqoXOy2djd6ltfH86uNqssohHsp7pyrfAFfGx8xlNyO/+3KrPe5pz4v9OoO8AKInQ1+4OYZdg vEE= WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com (HELO naota.fujisawa.hgst.com) ([10.149.53.115]) by uls-op-cesaip02.wdc.com with ESMTP; 23 Aug 2019 03:11:57 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Nikolay Borisov , Damien Le Moal , Matias Bjorling , Johannes Thumshirn , Hannes Reinecke , Anand Jain , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v4 24/27] btrfs: support dev-replace in HMZONED mode Date: Fri, 23 Aug 2019 19:10:33 +0900 Message-Id: <20190823101036.796932-25-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.23.0 In-Reply-To: <20190823101036.796932-1-naohiro.aota@wdc.com> References: <20190823101036.796932-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Currently, dev-replace copy all the device extents on source device to the target device, and it also clones new incoming write I/Os from users to the source device into the target device. Cloning incoming IOs can break the sequential write rule in the target device. When write is mapped in the middle of block group, that I/O is directed in the middle of a zone of target device, which breaks the sequential write rule. However, the cloning function cannot be simply disabled since incoming I/Os targeting already copied device extents must be cloned so that the I/O is executed on the target device. We cannot use dev_replace->cursor_{left,right} to determine whether bio is going to not yet copied region. Since we have time gap between finishing btrfs_scrub_dev() and rewriting the mapping tree in btrfs_dev_replace_finishing(), we can have newly allocated device extent which is never cloned (by handle_ops_on_dev_replace) nor copied (by the dev-replace process). So the point is to copy only already existing device extents. This patch introduce mark_block_group_to_copy() to mark existing block group as a target of copying. Then, handle_ops_on_dev_replace() and dev-replace can check the flag to do their job. This patch also handles empty region between used extents. Since dev-replace is smart to copy only used extents on source device, we have to fill the gap to honor the sequential write rule in the target device. Signed-off-by: Naohiro Aota --- fs/btrfs/ctree.h | 1 + fs/btrfs/dev-replace.c | 147 +++++++++++++++++++++++++++++++++++++++++ fs/btrfs/dev-replace.h | 3 + fs/btrfs/extent-tree.c | 20 +++++- fs/btrfs/hmzoned.c | 77 +++++++++++++++++++++ fs/btrfs/hmzoned.h | 4 ++ fs/btrfs/scrub.c | 83 ++++++++++++++++++++++- fs/btrfs/volumes.c | 40 ++++++++++- 8 files changed, 370 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index e974174e12a2..a353d5901558 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -535,6 +535,7 @@ struct btrfs_block_group_cache { unsigned int has_caching_ctl:1; unsigned int removed:1; unsigned int wp_broken:1; + unsigned int to_copy:1; int disk_cache_state; diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c index 2cc3ac4d101d..7ef1654aed9d 100644 --- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -264,6 +264,10 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info, set_blocksize(device->bdev, BTRFS_BDEV_BLOCKSIZE); device->fs_devices = fs_info->fs_devices; + ret = btrfs_get_dev_zone_info(device); + if (ret) + goto error; + mutex_lock(&fs_info->fs_devices->device_list_mutex); list_add(&device->dev_list, &fs_info->fs_devices->devices); fs_info->fs_devices->num_devices++; @@ -398,6 +402,143 @@ static char* btrfs_dev_name(struct btrfs_device *device) return rcu_str_deref(device->name); } +static int mark_block_group_to_copy(struct btrfs_fs_info *fs_info, + struct btrfs_device *src_dev) +{ + struct btrfs_path *path; + struct btrfs_key key; + struct btrfs_key found_key; + struct btrfs_root *root = fs_info->dev_root; + struct btrfs_dev_extent *dev_extent = NULL; + struct btrfs_block_group_cache *cache; + struct extent_buffer *l; + int slot; + int ret; + u64 chunk_offset, length; + + /* Do not use "to_copy" on non-HMZONED for now */ + if (!btrfs_fs_incompat(fs_info, HMZONED)) + return 0; + + path = btrfs_alloc_path(); + if (!path) + return -ENOMEM; + + path->reada = READA_FORWARD; + path->search_commit_root = 1; + path->skip_locking = 1; + + key.objectid = src_dev->devid; + key.offset = 0ull; + key.type = BTRFS_DEV_EXTENT_KEY; + + while (1) { + ret = btrfs_search_slot(NULL, root, &key, path, 0, 0); + if (ret < 0) + break; + if (ret > 0) { + if (path->slots[0] >= + btrfs_header_nritems(path->nodes[0])) { + ret = btrfs_next_leaf(root, path); + if (ret < 0) + break; + if (ret > 0) { + ret = 0; + break; + } + } else { + ret = 0; + } + } + + l = path->nodes[0]; + slot = path->slots[0]; + + btrfs_item_key_to_cpu(l, &found_key, slot); + + if (found_key.objectid != src_dev->devid) + break; + + if (found_key.type != BTRFS_DEV_EXTENT_KEY) + break; + + if (found_key.offset < key.offset) + break; + + dev_extent = btrfs_item_ptr(l, slot, struct btrfs_dev_extent); + length = btrfs_dev_extent_length(l, dev_extent); + + chunk_offset = btrfs_dev_extent_chunk_offset(l, dev_extent); + + cache = btrfs_lookup_block_group(fs_info, chunk_offset); + if (!cache) + goto skip; + + spin_lock(&cache->lock); + cache->to_copy = 1; + spin_unlock(&cache->lock); + + btrfs_put_block_group(cache); + +skip: + key.offset = found_key.offset + length; + btrfs_release_path(path); + } + + btrfs_free_path(path); + + return ret; +} + +void btrfs_finish_block_group_to_copy(struct btrfs_device *srcdev, + struct btrfs_block_group_cache *cache, + u64 physical) +{ + struct btrfs_fs_info *fs_info = cache->fs_info; + struct extent_map *em; + struct map_lookup *map; + u64 chunk_offset = cache->key.objectid; + int num_extents, cur_extent; + int i; + + em = btrfs_get_chunk_map(fs_info, chunk_offset, 1); + BUG_ON(IS_ERR(em)); + map = em->map_lookup; + + num_extents = cur_extent = 0; + for (i = 0; i < map->num_stripes; i++) { + /* we have more device extent to copy */ + if (srcdev != map->stripes[i].dev) + continue; + + num_extents++; + if (physical == map->stripes[i].physical) + cur_extent = i; + } + + free_extent_map(em); + + if (num_extents > 1) { + if (cur_extent == 0) { + /* + * first stripe on this device. Keep this BG + * readonly until we finish all the stripes. + */ + btrfs_inc_block_group_ro(cache); + } else if (cur_extent == num_extents - 1) { + /* last stripe on this device */ + btrfs_dec_block_group_ro(cache); + spin_lock(&cache->lock); + cache->to_copy = 0; + spin_unlock(&cache->lock); + } + } else { + spin_lock(&cache->lock); + cache->to_copy = 0; + spin_unlock(&cache->lock); + } +} + static int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info, const char *tgtdev_name, u64 srcdevid, const char *srcdev_name, int read_src) @@ -439,6 +580,12 @@ static int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info, if (ret) return ret; + mutex_lock(&fs_info->chunk_mutex); + ret = mark_block_group_to_copy(fs_info, src_device); + mutex_unlock(&fs_info->chunk_mutex); + if (ret) + return ret; + down_write(&dev_replace->rwsem); switch (dev_replace->replace_state) { case BTRFS_IOCTL_DEV_REPLACE_STATE_NEVER_STARTED: diff --git a/fs/btrfs/dev-replace.h b/fs/btrfs/dev-replace.h index 78c5d8f1adda..5ba60345dbf8 100644 --- a/fs/btrfs/dev-replace.h +++ b/fs/btrfs/dev-replace.h @@ -18,5 +18,8 @@ int btrfs_dev_replace_cancel(struct btrfs_fs_info *fs_info); void btrfs_dev_replace_suspend_for_unmount(struct btrfs_fs_info *fs_info); int btrfs_resume_dev_replace_async(struct btrfs_fs_info *fs_info); int btrfs_dev_replace_is_ongoing(struct btrfs_dev_replace *dev_replace); +void btrfs_finish_block_group_to_copy(struct btrfs_device *srcdev, + struct btrfs_block_group_cache *cache, + u64 physical); #endif diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 9f9c09e28b5b..3289d2164860 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -34,6 +34,7 @@ #include "delalloc-space.h" #include "rcu-string.h" #include "hmzoned.h" +#include "dev-replace.h" #undef SCRAMBLE_DELAYED_REFS @@ -1950,6 +1951,8 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr, u64 length = stripe->length; u64 bytes; struct request_queue *req_q; + struct btrfs_dev_replace *dev_replace = + &fs_info->dev_replace; if (!stripe->dev->bdev) { ASSERT(btrfs_test_opt(fs_info, DEGRADED)); @@ -1959,15 +1962,28 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr, req_q = bdev_get_queue(stripe->dev->bdev); /* zone reset in HMZONED mode */ - if (btrfs_can_zone_reset(dev, physical, length)) + if (btrfs_can_zone_reset(dev, physical, length)) { ret = btrfs_reset_device_zone(dev, physical, length, &bytes); - else if (blk_queue_discard(req_q)) + if (ret) + goto next; + if (!btrfs_dev_replace_is_ongoing( + dev_replace) || + dev != dev_replace->srcdev) + goto next; + + discarded_bytes += bytes; + /* send to replace target as well */ + ret = btrfs_reset_device_zone( + dev_replace->tgtdev, + physical, length, &bytes); + } else if (blk_queue_discard(req_q)) ret = btrfs_issue_discard(dev->bdev, physical, length, &bytes); else continue; +next: if (!ret) discarded_bytes += bytes; else if (ret != -EOPNOTSUPP) diff --git a/fs/btrfs/hmzoned.c b/fs/btrfs/hmzoned.c index f8f41cb3d22a..1012e1e49645 100644 --- a/fs/btrfs/hmzoned.c +++ b/fs/btrfs/hmzoned.c @@ -706,3 +706,80 @@ bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info, return true; } + +int btrfs_hmzoned_issue_zeroout(struct btrfs_device *device, u64 physical, + u64 length) +{ + if (!btrfs_dev_is_sequential(device, physical)) + return -EOPNOTSUPP; + + return blkdev_issue_zeroout(device->bdev, + physical >> SECTOR_SHIFT, + length >> SECTOR_SHIFT, + GFP_NOFS, 0); +} + +static int read_zone_info(struct btrfs_fs_info *fs_info, u64 logical, + struct blk_zone *zone) +{ + struct btrfs_bio *bbio = NULL; + u64 mapped_length = PAGE_SIZE; + int nmirrors; + int i, ret; + + ret = btrfs_map_sblock(fs_info, BTRFS_MAP_GET_READ_MIRRORS, logical, + &mapped_length, &bbio); + if (ret || !bbio || mapped_length < PAGE_SIZE) { + btrfs_put_bbio(bbio); + return -EIO; + } + + if (bbio->map_type & BTRFS_BLOCK_GROUP_RAID56_MASK) + return -EINVAL; + + nmirrors = (int)bbio->num_stripes; + for (i = 0; i < nmirrors; i++) { + u64 physical = bbio->stripes[i].physical; + struct btrfs_device *dev = bbio->stripes[i].dev; + + /* missing device */ + if (!dev->bdev) + continue; + + ret = btrfs_get_dev_zone(dev, physical, zone, GFP_NOFS); + /* failing device */ + if (ret == -EIO || ret == -EOPNOTSUPP) + continue; + break; + } + + return ret; +} + +int btrfs_sync_hmzone_write_pointer(struct btrfs_device *tgt_dev, u64 logical, + u64 physical_start, u64 physical_pos) +{ + struct btrfs_fs_info *fs_info = tgt_dev->fs_info; + struct blk_zone zone; + u64 length; + u64 wp; + int ret; + + if (!btrfs_dev_is_sequential(tgt_dev, physical_pos)) + return 0; + + ret = read_zone_info(fs_info, logical, &zone); + if (ret) + return ret; + + wp = physical_start + ((zone.wp - zone.start) << SECTOR_SHIFT); + + if (physical_pos == wp) + return 0; + + if (physical_pos > wp) + return -EUCLEAN; + + length = wp - physical_pos; + return btrfs_hmzoned_issue_zeroout(tgt_dev, physical_pos, length); +} diff --git a/fs/btrfs/hmzoned.h b/fs/btrfs/hmzoned.h index c68c4b8056a4..b0bb96404a24 100644 --- a/fs/btrfs/hmzoned.h +++ b/fs/btrfs/hmzoned.h @@ -43,6 +43,10 @@ void btrfs_hmzoned_data_io_unlock_at(struct inode *inode, u64 start, u64 len); bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info, struct extent_buffer *eb, struct btrfs_block_group_cache **cache_ret); +int btrfs_hmzoned_issue_zeroout(struct btrfs_device *device, u64 physical, + u64 length); +int btrfs_sync_hmzone_write_pointer(struct btrfs_device *tgt_dev, u64 logical, + u64 physical_start, u64 physical_pos); static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos) { diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index e15d846c700a..9f3484597338 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -167,6 +167,7 @@ struct scrub_ctx { int pages_per_rd_bio; int is_dev_replace; + u64 write_pointer; struct scrub_bio *wr_curr_bio; struct mutex wr_lock; @@ -1648,6 +1649,23 @@ static int scrub_add_page_to_wr_bio(struct scrub_ctx *sctx, sbio = sctx->wr_curr_bio; if (sbio->page_count == 0) { struct bio *bio; + u64 physical = spage->physical_for_dev_replace; + + if (btrfs_fs_incompat(sctx->fs_info, HMZONED) && + sctx->write_pointer < physical) { + u64 length = physical - sctx->write_pointer; + + ret = btrfs_hmzoned_issue_zeroout(sctx->wr_tgtdev, + sctx->write_pointer, + length); + if (ret == -EOPNOTSUPP) + ret = 0; + if (ret) { + mutex_unlock(&sctx->wr_lock); + return ret; + } + sctx->write_pointer = physical; + } sbio->physical = spage->physical_for_dev_replace; sbio->logical = spage->logical; @@ -1710,6 +1728,10 @@ static void scrub_wr_submit(struct scrub_ctx *sctx) * doubled the write performance on spinning disks when measured * with Linux 3.5 */ btrfsic_submit_bio(sbio->bio); + + if (btrfs_fs_incompat(sctx->fs_info, HMZONED)) + sctx->write_pointer = sbio->physical + + sbio->page_count * PAGE_SIZE; } static void scrub_wr_bio_end_io(struct bio *bio) @@ -3043,6 +3065,21 @@ static noinline_for_stack int scrub_raid56_parity(struct scrub_ctx *sctx, return ret < 0 ? ret : 0; } +void sync_replace_for_hmzoned(struct scrub_ctx *sctx) +{ + if (!btrfs_fs_incompat(sctx->fs_info, HMZONED)) + return; + + sctx->flush_all_writes = true; + scrub_submit(sctx); + mutex_lock(&sctx->wr_lock); + scrub_wr_submit(sctx); + mutex_unlock(&sctx->wr_lock); + + wait_event(sctx->list_wait, + atomic_read(&sctx->bios_in_flight) == 0); +} + static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx, struct map_lookup *map, struct btrfs_device *scrub_dev, @@ -3174,6 +3211,14 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx, */ blk_start_plug(&plug); + if (sctx->is_dev_replace && + btrfs_dev_is_sequential(sctx->wr_tgtdev, physical)) { + mutex_lock(&sctx->wr_lock); + sctx->write_pointer = physical; + mutex_unlock(&sctx->wr_lock); + sctx->flush_all_writes = true; + } + /* * now find all extents for each stripe and scrub them */ @@ -3346,6 +3391,9 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx, if (ret) goto out; + if (sctx->is_dev_replace) + sync_replace_for_hmzoned(sctx); + if (extent_logical + extent_len < key.objectid + bytes) { if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK) { @@ -3413,6 +3461,26 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx, blk_finish_plug(&plug); btrfs_free_path(path); btrfs_free_path(ppath); + + if (btrfs_fs_incompat(fs_info, HMZONED) && sctx->is_dev_replace && + ret >= 0) { + wait_event(sctx->list_wait, + atomic_read(&sctx->bios_in_flight) == 0); + + mutex_lock(&sctx->wr_lock); + if (sctx->write_pointer < physical_end) { + ret = btrfs_sync_hmzone_write_pointer( + sctx->wr_tgtdev, base + offset, + map->stripes[num].physical, + sctx->write_pointer); + if (ret) + btrfs_err(fs_info, "failed to recover write pointer"); + } + mutex_unlock(&sctx->wr_lock); + btrfs_dev_clear_zone_empty(sctx->wr_tgtdev, + map->stripes[num].physical); + } + return ret < 0 ? ret : 0; } @@ -3554,6 +3622,14 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx, if (!cache) goto skip; + spin_lock(&cache->lock); + if (sctx->is_dev_replace && !cache->to_copy) { + spin_unlock(&cache->lock); + ro_set = 0; + goto done; + } + spin_unlock(&cache->lock); + /* * we need call btrfs_inc_block_group_ro() with scrubs_paused, * to avoid deadlock caused by: @@ -3588,7 +3664,7 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx, ret = btrfs_wait_ordered_roots(fs_info, U64_MAX, cache->key.objectid, cache->key.offset); - if (ret > 0) { + if (ret >= 0) { struct btrfs_trans_handle *trans; trans = btrfs_join_transaction(root); @@ -3664,6 +3740,11 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx, scrub_pause_off(fs_info); + if (sctx->is_dev_replace) + btrfs_finish_block_group_to_copy( + dev_replace->srcdev, cache, found_key.offset); + +done: down_write(&fs_info->dev_replace.rwsem); dev_replace->cursor_left = dev_replace->cursor_right; dev_replace->item_needs_writeback = 1; diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 16094fc68552..632001aea19e 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -1592,6 +1592,9 @@ int find_free_dev_extent_start(struct btrfs_device *device, u64 num_bytes, search_start = max_t(u64, search_start, zone_size); search_start = btrfs_zone_align(device, search_start); + WARN_ON(device->zone_info && + !IS_ALIGNED(num_bytes, device->zone_info->zone_size)); + path = btrfs_alloc_path(); if (!path) return -ENOMEM; @@ -5886,9 +5889,29 @@ static int get_extra_mirror_from_replace(struct btrfs_fs_info *fs_info, return ret; } +static bool is_block_group_to_copy(struct btrfs_fs_info *fs_info, u64 logical) +{ + struct btrfs_block_group_cache *cache; + bool ret; + + /* non-HMZONED mode does not use "to_copy" flag */ + if (!btrfs_fs_incompat(fs_info, HMZONED)) + return false; + + cache = btrfs_lookup_block_group(fs_info, logical); + + spin_lock(&cache->lock); + ret = cache->to_copy; + spin_unlock(&cache->lock); + + btrfs_put_block_group(cache); + return ret; +} + static void handle_ops_on_dev_replace(enum btrfs_map_op op, struct btrfs_bio **bbio_ret, struct btrfs_dev_replace *dev_replace, + u64 logical, int *num_stripes_ret, int *max_errors_ret) { struct btrfs_bio *bbio = *bbio_ret; @@ -5901,6 +5924,15 @@ static void handle_ops_on_dev_replace(enum btrfs_map_op op, if (op == BTRFS_MAP_WRITE) { int index_where_to_add; + /* + * a block group which have "to_copy" set will + * eventually copied by dev-replace process. We can + * avoid cloning IO here. + */ + if (is_block_group_to_copy(dev_replace->srcdev->fs_info, + logical)) + return; + /* * duplicate the write operations while the dev replace * procedure is running. Since the copying of the old disk to @@ -5928,6 +5960,10 @@ static void handle_ops_on_dev_replace(enum btrfs_map_op op, index_where_to_add++; max_errors++; tgtdev_indexes++; + + /* mark this zone as non-empty */ + btrfs_dev_clear_zone_empty(new->dev, + new->physical); } } num_stripes = index_where_to_add; @@ -6313,8 +6349,8 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info, if (dev_replace_is_ongoing && dev_replace->tgtdev != NULL && need_full_stripe(op)) { - handle_ops_on_dev_replace(op, &bbio, dev_replace, &num_stripes, - &max_errors); + handle_ops_on_dev_replace(op, &bbio, dev_replace, logical, + &num_stripes, &max_errors); } *bbio_ret = bbio; From patchwork Fri Aug 23 10:10:34 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11111413 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 401F114DE for ; Fri, 23 Aug 2019 10:12:04 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 1D070233FE for ; Fri, 23 Aug 2019 10:12:04 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="b9ZELsGs" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2404809AbfHWKMC (ORCPT ); Fri, 23 Aug 2019 06:12:02 -0400 Received: from esa3.hgst.iphmx.com ([216.71.153.141]:47806 "EHLO esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2404700AbfHWKMB (ORCPT ); Fri, 23 Aug 2019 06:12:01 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1566555121; x=1598091121; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=RdZkysa1QaRudWUb1KxKJfLMLCLBayd8KddMUFlrxHI=; b=b9ZELsGsNQ0tozVormlF+tyRswrzvia7R9q7wnkooIs1cW2tug06uUhV URxakrdsER2yorEV6j+5ePKmtoYRvlXIb06p2cuCxpNeiIqu3/Jehoc65 t5tGFKEagO9Aekiy9fWmQaCEFlKSYeejOV+w6l+g4rEJbg9fZnxkvwCv6 f0PPxYYs9/9tf6l5kcHo1NUKwNXV14i/y3nzAHO/c+8EujJ0nZUciG37x AWf8bv9X+LelUHbf8ZeeuBSYk3EIp6el+m/zBdkywenrnGd34qz3Hht4V QIbuUUSKOuDZLU5CQqNdPDmGX1hjbXZVOgOUdoMZSXjnHAsES2WQ4x//R g==; IronPort-SDR: 23QuFrqT8p3fuYNkeKyZJ2vWaVcyIdeMvR2ngWAyM2tQ1yaZvzPvoWMZ0wmu6lvd6lAmySdqB0 wR7jbAU5YhgIEdbZmpJwFJAdR4OisO5Kclc3oid2N55woml1eLNcn0MTElmwlyfLYW+F0Ldp9d zegSGEtyN/zz8Ezm9/zUezqPMiHGKaTtP9l5Fa36/PV5jnTLBo/iyhiu+vCV6a6nE64Y4gRiM+ Ha8/NCBElmaHdd0xILOQ7eM1Xes6Uq/BT8Tsd9xNCWq34tfEO6sr6mQaLNrIuJPkYQ8esUMLtQ LCc= X-IronPort-AV: E=Sophos;i="5.64,420,1559491200"; d="scan'208";a="121096276" Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com) ([199.255.45.14]) by ob1.hgst.iphmx.com with ESMTP; 23 Aug 2019 18:12:01 +0800 IronPort-SDR: Oux4uM5pHp5GaQGcpRWC5CiW2KcG9FYQNuKKjrMYbBPV2Q23pAeIpLgS9MFOhrHXERr6ekwtK3 +tXc7P6/SICRUGdYR0m4a/DuUi5jj+rx9/OH0923X3/pS9PEjG6P/xvV6hKpdJbCIdYoPXFdVI uYF2YBp2/OqdpBfnIk/2G4amK5aecY0LmlrcadVkg302WAYVurM4koOmWts/Y9BJNqop6Ehut3 wDK8MPseLaxijBrjdUTvOC2fE/ECLXnolHxgAfSChhMzaG8//JDJBfPr1wmT9k7IAPjv3RwsXW 0lN2HPKzDyqUanL2ce2LTnBL Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 23 Aug 2019 03:09:19 -0700 IronPort-SDR: EWvda8meFK+zGXV7Lj4HODNHZ2tSf2c1rl2e2j502bDqjB8VPtc29FNsO4ZwXv0XVuTkSZ1Hfy E8JL2xIfq4NeSvnif4o6CwnadHjUrPelUWZ6EEdM+pO0PgJlDoJ0Fv/MKEqPD0ljpwIp+DK3z9 5JTGMEQkshWRvwIy77YvhCeyQSZ/h1tQ4OlfQYUJn47MpJxVsmikb69TPurfLMrqZVKlsH8Akj kvIk6V5TnZ94T4VgProrZksKcIrO4AENtY3k3UXX8MtLjdyxUBzCxdRLfM4DpcqqLBZnPwBhZ2 Ha0= WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com (HELO naota.fujisawa.hgst.com) ([10.149.53.115]) by uls-op-cesaip02.wdc.com with ESMTP; 23 Aug 2019 03:12:00 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Nikolay Borisov , Damien Le Moal , Matias Bjorling , Johannes Thumshirn , Hannes Reinecke , Anand Jain , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v4 25/27] btrfs: enable relocation in HMZONED mode Date: Fri, 23 Aug 2019 19:10:34 +0900 Message-Id: <20190823101036.796932-26-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.23.0 In-Reply-To: <20190823101036.796932-1-naohiro.aota@wdc.com> References: <20190823101036.796932-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org To serialize allocation and submit_bio, we introduced mutex around them. As a result, preallocation must be completely disabled to avoid a deadlock. Since current relocation process relies on preallocation to move file data extents, it must be handled in another way. In HMZONED mode, we just truncate the inode to the size that we wanted to pre-allocate. Then, we flush dirty pages on the file before finishing relocation process. run_delalloc_hmzoned() will handle all the allocation and submit IOs to the underlying layers. Signed-off-by: Naohiro Aota --- fs/btrfs/relocation.c | 39 +++++++++++++++++++++++++++++++++++++-- 1 file changed, 37 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c index 7f219851fa23..d852e3389ee2 100644 --- a/fs/btrfs/relocation.c +++ b/fs/btrfs/relocation.c @@ -3152,6 +3152,34 @@ int prealloc_file_extent_cluster(struct inode *inode, if (ret) goto out; + /* + * In HMZONED, we cannot preallocate the file region. Instead, + * we dirty and fiemap_write the region. + */ + + if (btrfs_fs_incompat(btrfs_sb(inode->i_sb), HMZONED)) { + struct btrfs_root *root = BTRFS_I(inode)->root; + struct btrfs_trans_handle *trans; + + end = cluster->end - offset + 1; + trans = btrfs_start_transaction(root, 1); + if (IS_ERR(trans)) + return PTR_ERR(trans); + + inode->i_ctime = current_time(inode); + i_size_write(inode, end); + btrfs_ordered_update_i_size(inode, end, NULL); + ret = btrfs_update_inode(trans, root, inode); + if (ret) { + btrfs_abort_transaction(trans, ret); + btrfs_end_transaction(trans); + return ret; + } + ret = btrfs_end_transaction(trans); + + goto out; + } + cur_offset = prealloc_start; while (nr < cluster->nr) { start = cluster->boundary[nr] - offset; @@ -3340,6 +3368,10 @@ static int relocate_file_extent_cluster(struct inode *inode, btrfs_throttle(fs_info); } WARN_ON(nr != cluster->nr); + if (btrfs_fs_incompat(fs_info, HMZONED) && !ret) { + ret = btrfs_wait_ordered_range(inode, 0, (u64)-1); + WARN_ON(ret); + } out: kfree(ra); return ret; @@ -4180,8 +4212,12 @@ static int __insert_orphan_inode(struct btrfs_trans_handle *trans, struct btrfs_path *path; struct btrfs_inode_item *item; struct extent_buffer *leaf; + u64 flags = BTRFS_INODE_NOCOMPRESS | BTRFS_INODE_PREALLOC; int ret; + if (btrfs_fs_incompat(trans->fs_info, HMZONED)) + flags &= ~BTRFS_INODE_PREALLOC; + path = btrfs_alloc_path(); if (!path) return -ENOMEM; @@ -4196,8 +4232,7 @@ static int __insert_orphan_inode(struct btrfs_trans_handle *trans, btrfs_set_inode_generation(leaf, item, 1); btrfs_set_inode_size(leaf, item, 0); btrfs_set_inode_mode(leaf, item, S_IFREG | 0600); - btrfs_set_inode_flags(leaf, item, BTRFS_INODE_NOCOMPRESS | - BTRFS_INODE_PREALLOC); + btrfs_set_inode_flags(leaf, item, flags); btrfs_mark_buffer_dirty(leaf); out: btrfs_free_path(path); From patchwork Fri Aug 23 10:10:35 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11111417 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id BC54D14DE for ; Fri, 23 Aug 2019 10:12:06 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 8CF3021726 for ; Fri, 23 Aug 2019 10:12:06 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="n5fYqk4q" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2404859AbfHWKMF (ORCPT ); Fri, 23 Aug 2019 06:12:05 -0400 Received: from esa3.hgst.iphmx.com ([216.71.153.141]:47806 "EHLO esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2404700AbfHWKME (ORCPT ); Fri, 23 Aug 2019 06:12:04 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1566555123; x=1598091123; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=AUq1nsN94aM5HzeiKOtIYNHIT4ew+qqqiWNTTx8Fu/U=; b=n5fYqk4qtcMt+2zx3Cwv0JGq78NzcACuCtueAUXWhMOKHVM5vmM0gzV4 SJErvmfu1LjeT0HpcPO2kJxCcyYmZbK8qvcIWURzwh7haFM1LoxPzBAJf KCrIJcGANdu+hVZ+AAkNUnFG/lKH7PQe5oK3KeWkaqbVl2l13nHfiV2bG H2d9d5byKJ3MuAkV+20X/+7PDwBebr4mnBWkOJQVt4QNEzC/xJTME5rL1 b7n/mjKmSMDwpDnE/mu91sZ1OQf7gtQ3Xswj4dKxXl/b+6jbqaaJnG5Pr gjsTKqdzX07SkCYLQiha9/RAoDCEocYp38JzMnNygsyLJIO5fMqzCXXHS Q==; IronPort-SDR: APdAvTVxQjKH+QstQyhtPzogmclpk5Cb7KlYuGbDP9izXTZtqvsEtnVvoBVQy6z4DhuUugJYaA ibYjhZ5APg8A7/1ZsYpkYrb7I1vp2cS6LIbmsNR4KL0J1COrGMgG90nxUvo/5k2AQQm6OlbYes vHihGhPbYnsQIOD4nx2G/7+wCJWM90vq+Uv0b5DB2szxA+pia2Az3h9O3tsTW7r3UKjmPSt6oo 0+IzjO/wkNWnVFy++lsx3amXTQcTX3AZHO4QCMHnsF5PwqG4zksXI90SzKK9KXpmVnOdENGBgv W5Y= X-IronPort-AV: E=Sophos;i="5.64,420,1559491200"; d="scan'208";a="121096283" Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com) ([199.255.45.14]) by ob1.hgst.iphmx.com with ESMTP; 23 Aug 2019 18:12:03 +0800 IronPort-SDR: gC4Y1jVIVOZDH2sPbGdvv6mZJQnwZckgSvNcBJH5YX8U1EvXmeBOm1O4skE5F6ui2AC1QSt49F 4p5m2S6cUrW1/abtdAwW24Ka/rebMKhPe5PffhidbgyyyaxBV5qFwbMQecNRejFlCj/pVy5wuX 4yScShzLd0cqw6mzLvowDh49NkpVwDGeGb7VJ0SPgDsJy4Ij1ciEWuwpWpX666jtlqWlQlXzmQ FrCZEiOkLJwItrgIWjM9TPW4N4RGcwnjOWWwaGuUb4fVW3TagT0UBvBwZfZ5bJ1Z5CVdcQ5kRr 14/C6YVsaE0emRvvEclhVVfL Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 23 Aug 2019 03:09:21 -0700 IronPort-SDR: DunF0mBUER9MkDFEjWDQ7a4/6cusPx9arHNQ0ARU12mOkVE36bTc2WJgFiMGYW8vjMKv95P32P nSL08eC46W9IJp70fIlxFp1xTlKRQ5XsCEEkwwOZT1UNFx6YFkcxty0XmssWa2N8UDyWoDGXso 1I47wLBBXM/4/GAyjvIbXasuFN3sBoQa9R6xpXl9cDjP6+A9rnMfSHsgIrp2ZDxwQAEY1aGRzO ZGVqlY3UPPjl49RZJhCw3Hcu8UK29h8YXF0THSGPeLHXxFmVxyiwbfc42tU+bxJrW2dKw1XAKF +Jw= WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com (HELO naota.fujisawa.hgst.com) ([10.149.53.115]) by uls-op-cesaip02.wdc.com with ESMTP; 23 Aug 2019 03:12:02 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Nikolay Borisov , Damien Le Moal , Matias Bjorling , Johannes Thumshirn , Hannes Reinecke , Anand Jain , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v4 26/27] btrfs: relocate block group to repair IO failure in HMZONED Date: Fri, 23 Aug 2019 19:10:35 +0900 Message-Id: <20190823101036.796932-27-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.23.0 In-Reply-To: <20190823101036.796932-1-naohiro.aota@wdc.com> References: <20190823101036.796932-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org When btrfs find a checksum error and if the file system has a mirror of the damaged data, btrfs read the correct data from the mirror and write the data to damaged blocks. This repairing, however, is against the sequential write required rule. We can consider three methods to repair an IO failure in HMZONED mode: (1) Reset and rewrite the damaged zone (2) Allocate new device extent and replace the damaged device extent to the new extent (3) Relocate the corresponding block group Method (1) is most similar to a behavior done with regular devices. However, it also wipes non-damaged data in the same device extent, and so it unnecessary degrades non-damaged data. Method (2) is much like device replacing but done in the same device. It is safe because it keeps the device extent until the replacing finish. However, extending device replacing is non-trivial. It assumes "src_dev>physical == dst_dev->physical". Also, the extent mapping replacing function should be extended to support replacing device extent position in one device. Method (3) invokes relocation of the damaged block group, so it is straightforward to implement. It relocates all the mirrored device extents, so it is, potentially, a more costly operation than method (1) or (2). But it relocates only using extents which reduce the total IO size. Let's apply method (3) for now. In the future, we can extend device-replace and apply method (2). For protecting a block group gets relocated multiple time with multiple IO errors, this commit introduces "relocating_repair" bit to show it's now relocating to repair IO failures. Also it uses a new kthread "btrfs-relocating-repair", not to block IO path with relocating process. This commit also supports repairing in the scrub process. Signed-off-by: Naohiro Aota --- fs/btrfs/ctree.h | 1 + fs/btrfs/extent_io.c | 3 ++ fs/btrfs/scrub.c | 3 ++ fs/btrfs/volumes.c | 72 ++++++++++++++++++++++++++++++++++++++++++++ fs/btrfs/volumes.h | 1 + 5 files changed, 80 insertions(+) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index a353d5901558..8b00798ca3a1 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -536,6 +536,7 @@ struct btrfs_block_group_cache { unsigned int removed:1; unsigned int wp_broken:1; unsigned int to_copy:1; + unsigned int relocating_repair:1; int disk_cache_state; diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index ff963b2214aa..0d3b61606b15 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2187,6 +2187,9 @@ int repair_io_failure(struct btrfs_fs_info *fs_info, u64 ino, u64 start, ASSERT(!(fs_info->sb->s_flags & SB_RDONLY)); BUG_ON(!mirror_num); + if (btrfs_fs_incompat(fs_info, HMZONED)) + return btrfs_repair_one_hmzone(fs_info, logical); + bio = btrfs_io_bio_alloc(1); bio->bi_iter.bi_size = 0; map_length = length; diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index 9f3484597338..6dd5fa4ad657 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -861,6 +861,9 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check) have_csum = sblock_to_check->pagev[0]->have_csum; dev = sblock_to_check->pagev[0]->dev; + if (btrfs_fs_incompat(fs_info, HMZONED) && !sctx->is_dev_replace) + return btrfs_repair_one_hmzone(fs_info, logical); + /* * We must use GFP_NOFS because the scrub task might be waiting for a * worker task executing this function and in turn a transaction commit diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 632001aea19e..e34032a438db 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -7998,3 +7998,75 @@ bool btrfs_pinned_by_swapfile(struct btrfs_fs_info *fs_info, void *ptr) spin_unlock(&fs_info->swapfile_pins_lock); return node != NULL; } + +static int relocating_repair_kthread(void *data) +{ + struct btrfs_block_group_cache *cache = + (struct btrfs_block_group_cache *) data; + struct btrfs_fs_info *fs_info = cache->fs_info; + u64 target; + int ret = 0; + + target = cache->key.objectid; + btrfs_put_block_group(cache); + + if (test_and_set_bit(BTRFS_FS_EXCL_OP, &fs_info->flags)) { + btrfs_info(fs_info, + "skip relocating block group %llu to repair: EBUSY", + target); + return -EBUSY; + } + + mutex_lock(&fs_info->delete_unused_bgs_mutex); + + /* ensure Block Group still exists */ + cache = btrfs_lookup_block_group(fs_info, target); + if (!cache) + goto out; + + if (!cache->relocating_repair) + goto out; + + ret = btrfs_may_alloc_data_chunk(fs_info, target); + if (ret < 0) + goto out; + + btrfs_info(fs_info, "relocating block group %llu to repair IO failure", + target); + ret = btrfs_relocate_chunk(fs_info, target); + +out: + if (cache) + btrfs_put_block_group(cache); + mutex_unlock(&fs_info->delete_unused_bgs_mutex); + clear_bit(BTRFS_FS_EXCL_OP, &fs_info->flags); + + return ret; +} + +int btrfs_repair_one_hmzone(struct btrfs_fs_info *fs_info, u64 logical) +{ + struct btrfs_block_group_cache *cache; + + /* do not attempt to repair in degraded state */ + if (btrfs_test_opt(fs_info, DEGRADED)) + return 0; + + cache = btrfs_lookup_block_group(fs_info, logical); + if (!cache) + return 0; + + spin_lock(&cache->lock); + if (cache->relocating_repair) { + spin_unlock(&cache->lock); + btrfs_put_block_group(cache); + return 0; + } + cache->relocating_repair = 1; + spin_unlock(&cache->lock); + + kthread_run(relocating_repair_kthread, cache, + "btrfs-relocating-repair"); + + return 0; +} diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index 5da1f354db93..ccb139d1f9c4 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -593,5 +593,6 @@ bool btrfs_check_rw_degradable(struct btrfs_fs_info *fs_info, int btrfs_bg_type_to_factor(u64 flags); const char *btrfs_bg_type_to_raid_name(u64 flags); int btrfs_verify_dev_extents(struct btrfs_fs_info *fs_info); +int btrfs_repair_one_hmzone(struct btrfs_fs_info *fs_info, u64 logical); #endif From patchwork Fri Aug 23 10:10:36 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Naohiro Aota X-Patchwork-Id: 11111421 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 017D914DE for ; Fri, 23 Aug 2019 10:12:08 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id D4F52233FD for ; Fri, 23 Aug 2019 10:12:07 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=wdc.com header.i=@wdc.com header.b="JRy3+Xbk" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2404878AbfHWKMG (ORCPT ); Fri, 23 Aug 2019 06:12:06 -0400 Received: from esa3.hgst.iphmx.com ([216.71.153.141]:47806 "EHLO esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2404700AbfHWKMF (ORCPT ); Fri, 23 Aug 2019 06:12:05 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com; t=1566555125; x=1598091125; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=HcDSXOw0tTC0Jygeox0zndTzcDgQ/3fMv19yC9YUu5o=; b=JRy3+XbkmJJay0Cx0FoQgjm76/rsViwDukkWwAAkqqsjojuIJrC5JDBC eRcr6ex4qkpNtB2L+yHttma9QKd+RMDyLWXzvAss7nGBnF06TjqJk9m7n f6ZMEX9DRiCe+tVH1y5vMYv4sYrlL8nHUgmUI2Q5bx14XnWfFTV470B9O c8T5sv+MEDplED5gbc9yBfdW8nuFOofBbxKfmLx77iKfik8plKI8vcr81 p+3KOfkqSFVu9dTF7BhNooPWynewlmZQEWH+9NZ1jRCSVWNGes/k+Od5D ASJ1z7iHRA9JbPCVAXq//cD8fQRMdY/pJcxX7JXmQ/P/SOZUVUu+9CXM4 g==; IronPort-SDR: K/9/w1J/dKY2sv5X4szTXdZ46WmskHNX5DrwUWGUYhanxveGopMSfyWRk6S2RYClkD8IpALp49 pR/PzbbTa9oEqH1y1wI8o8JhsYB5ty+Wlza6BUU5pgmM2MHkZSviqbfvyeYeJGk/cVGo2c1BT0 0tgI8o5fu0YSNjNjfQP/OqKMee7RyMCTJzxwxarbhV1ZIl4V5dwBWj8VcivJeMBIUbxNMzHiZc EVEbLmBLIvaBc3ZJZWLaXplWNnnp/wB78Yj4ocu06QYUn0gyeere/If8DtCvXS2x4T2u1y1QF4 cFg= X-IronPort-AV: E=Sophos;i="5.64,420,1559491200"; d="scan'208";a="121096286" Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com) ([199.255.45.14]) by ob1.hgst.iphmx.com with ESMTP; 23 Aug 2019 18:12:05 +0800 IronPort-SDR: RpF7vT9uPnGHWfWWKEWNt9T6SwkgxEoK8a6esoNviYyUuLSFTfzG0uWTP79MNf4HZS1sc0J8Gy sDut+CyQ2GBWCJrzuZbvYPMFdNi1SLoVEHauIrEf/NdaHZBtZkW3ZHNm5giJCzPXMpD5SPzy21 fRsxzGPZJFUjf2rm44O/gb0LydckGxqGCkoBMZTUDt+3zeAuwJu3cLOcaLKB1WZuzjmDGNRSGz A+l75w/zue2RXPOU8rebB1l2nKEZJDi/FKrHz8DLqauxG0nhlwiL1zzz2LAsg3tg+VG1czthVj Xr2b39ui3TlG2oqZBy6r9u9Q Received: from uls-op-cesaip02.wdc.com ([10.248.3.37]) by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 23 Aug 2019 03:09:23 -0700 IronPort-SDR: gcuyCgHP8qV2R//TijUJvCdqMkR/g8NajFGxL0BBjTHldCmOlJ2cXp/DHsRMWwrCTb0vAenWpi V2yMG/r3ukthiv16cxgGS+sYSExWMamcdB/x8pZZ9lhPBA4UvfaFy1Y+wGW+wINnHT9IpDZaU2 gdxVmt3vUCgI1vqmmj+x5lyx+agBDTpAgHNejHZIhK5Ly7Bt4+kePZ5xqSUXzXRs+qHogdTIuA zgR7Ma0EYoJsrMgavBQcOL1askGWVs6kc2zgps6aDodCybHGSEjc8uA5Zp10OL3XGaY/d9Bw4T N0o= WDCIronportException: Internal Received: from naota.dhcp.fujisawa.hgst.com (HELO naota.fujisawa.hgst.com) ([10.149.53.115]) by uls-op-cesaip02.wdc.com with ESMTP; 23 Aug 2019 03:12:04 -0700 From: Naohiro Aota To: linux-btrfs@vger.kernel.org, David Sterba Cc: Chris Mason , Josef Bacik , Nikolay Borisov , Damien Le Moal , Matias Bjorling , Johannes Thumshirn , Hannes Reinecke , Anand Jain , linux-fsdevel@vger.kernel.org, Naohiro Aota Subject: [PATCH v4 27/27] btrfs: enable to mount HMZONED incompat flag Date: Fri, 23 Aug 2019 19:10:36 +0900 Message-Id: <20190823101036.796932-28-naohiro.aota@wdc.com> X-Mailer: git-send-email 2.23.0 In-Reply-To: <20190823101036.796932-1-naohiro.aota@wdc.com> References: <20190823101036.796932-1-naohiro.aota@wdc.com> MIME-Version: 1.0 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org This final patch adds the HMZONED incompat flag to BTRFS_FEATURE_INCOMPAT_SUPP and enables btrfs to mount HMZONED flagged file system. Signed-off-by: Naohiro Aota --- fs/btrfs/ctree.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 8b00798ca3a1..597159c2b6b0 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -294,7 +294,8 @@ struct btrfs_super_block { BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF | \ BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA | \ BTRFS_FEATURE_INCOMPAT_NO_HOLES | \ - BTRFS_FEATURE_INCOMPAT_METADATA_UUID) + BTRFS_FEATURE_INCOMPAT_METADATA_UUID | \ + BTRFS_FEATURE_INCOMPAT_HMZONED) #define BTRFS_FEATURE_INCOMPAT_SAFE_SET \ (BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF)