From patchwork Fri Oct 30 13:51:08 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Naohiro Aota <naohiro.aota@wdc.com>
X-Patchwork-Id: 11869609
Return-Path: <SRS0=iOSM=EF=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id EF4D361C
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:52:25 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id C3DC62071A
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:52:25 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=fail reason="signature verification failed" (2048-bit key)
 header.d=wdc.com header.i=@wdc.com header.b="GAU34rI2"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726676AbgJ3NwX (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Fri, 30 Oct 2020 09:52:23 -0400
Received: from esa3.hgst.iphmx.com ([216.71.153.141]:21969 "EHLO
        esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726653AbgJ3NwX (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Fri, 30 Oct 2020 09:52:23 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple;
  d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com;
  t=1604065942; x=1635601942;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=ackO2U078tN9hfYqk2z+v2pR975Co+JjTSV3v7Aa2mk=;
  b=GAU34rI2G96fj19NKqdmyC53f/qye5F6PXur287OpTxPkMaA4uYioSWV
   xCVumwpaNbraXqZ1aNIPrXH6LjQJxvnW6cC7YUNOMro6xN7sCTlFDFQvO
   5Oye02kBvVIx8EoEJdo7My7XiplHoUBKgOFiezEalU0YouRa9WMWCRi1o
   1icyiy2EQ5KCdPBzFfUgfiVNikhV8crQjP2x4xu/c07mEzfp4G2yGLDWY
   BZ+1SJyI1W7g6S/3oz3P7gdWWf9wEng3FliCYp5sPXyqTP+LcMURS2ogO
   9QQQ7iGaIA2E2ss+fEzc3lZYQmWYZ0+FU+HNNfYltVWNQdM0ufzZIqKU4
   w==;
IronPort-SDR: 
 TMRZ7m/tDq8oPWPzGgNux+0TpR/BWQkC6LBeDLEIeRYQBY2+N8HTHBD0gPLREeSLsYZ3NNAYuu
 +ddXrZKrQYMLQkrL6niGKWca8bedhfEjceTrp2ZUa3+pmwvJEafMpxD5U+RuSwmE8M9H/fVttM
 6ziafYUEOIe+AVy6MeWBtEzNuvqiHjYakTfa2dECdBzJPDqM77b6kNaAnvXdeBiJGH05pbOWAW
 0+J7yekPdVddZXj3xOFrHOY4q/6CshGg0Dq8huiZukZ1vZ6XMF9yhlVBfBpBU/X8QofK86jpPR
 304=
X-IronPort-AV: E=Sophos;i="5.77,433,1596470400";
   d="scan'208";a="155806570"
Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com)
 ([199.255.45.14])
  by ob1.hgst.iphmx.com with ESMTP; 30 Oct 2020 21:52:21 +0800
IronPort-SDR: 
 fMMweYDwB7dgqo3/fYFxnUQC4+t4W1iT4EUJWKruU4chME8ahLQznGjwJb6+ntBD+rvTJuhBZL
 8oGIHbvCIfVu2T+PulVShbNdtaPqjfnJKunkgP7YdXiXRNFE/ykRvoDRRX3XfR84XvXzsod8iu
 VRdCg+wJWsqSbeIkDJp/OnohOZfoXGfYP9/vA4WumCXJv74eARYCw55qhIIqKWEe8QjTgifcm6
 5rzNrROTN0SlQMWVA4EDcNpRiDcYn0ajLPIUjfkZarn77A8RpWzmxXFx0MxOPGgQzeo0X/iZri
 9Lt8NREqpDCKXFvCNb6A6tH+
Received: from uls-op-cesaip01.wdc.com ([10.248.3.36])
  by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 30 Oct 2020 06:38:36 -0700
IronPort-SDR: 
 Xt/hPeHC+IE9nSr8qRAGPZRJHsMsv2hI+eBwSz63plbmPEhespV3GyA9N4m0ExHBUp2AzUIz63
 1oqmN0U7O+OzXYl0piVeTwxiOjMRmZ328fCBkgWKLL1KgqU+e7NlqkJU1d5VqV5KXKuLVFfHk+
 UA71hnvMJXDh7CgGxVAnlRHARKulZEk4mcDlroIM9MwAUtYL019kXmP6P/dIqoDTGlTtX5y7OR
 BgYa35fSqeBr0LpGltoHngPL0IhHXtFQ9KWYiLszzt3AoIKAsklTfTJw6yOWZ2u3Qi8Vl1sKNo
 rY4=
WDCIronportException: Internal
Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155])
  by uls-op-cesaip01.wdc.com with ESMTP; 30 Oct 2020 06:52:21 -0700
From: Naohiro Aota <naohiro.aota@wdc.com>
To: linux-btrfs@vger.kernel.org, dsterba@suse.com
Cc: hare@suse.com, linux-fsdevel@vger.kernel.org,
        Johannes Thumshirn <johannes.thumshirn@wdc.com>,
        Jens Axboe <axboe@kernel.dk>
Subject: [PATCH v9 01/41] block: add bio_add_zone_append_page
Date: Fri, 30 Oct 2020 22:51:08 +0900
Message-Id: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
X-Mailer: git-send-email 2.27.0
In-Reply-To: <cover.1604065156.git.naohiro.aota@wdc.com>
References: <cover.1604065156.git.naohiro.aota@wdc.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

From: Johannes Thumshirn <johannes.thumshirn@wdc.com>

Add bio_add_zone_append_page(), a wrapper around bio_add_hw_page() which
is intended to be used by file systems that directly add pages to a bio
instead of using bio_iov_iter_get_pages().

Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 block/bio.c         | 36 ++++++++++++++++++++++++++++++++++++
 include/linux/bio.h |  2 ++
 2 files changed, 38 insertions(+)

diff --git a/block/bio.c b/block/bio.c
index 58d765400226..2dfe40be4d6b 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -853,6 +853,42 @@ int bio_add_pc_page(struct request_queue *q, struct bio *bio,
 }
 EXPORT_SYMBOL(bio_add_pc_page);
 
+/**
+ * bio_add_zone_append_page - attempt to add page to zone-append bio
+ * @bio: destination bio
+ * @page: page to add
+ * @len: vec entry length
+ * @offset: vec entry offset
+ *
+ * Attempt to add a page to the bio_vec maplist of a bio that will be submitted
+ * for a zone-append request. This can fail for a number of reasons, such as the
+ * bio being full or the target block device is not a zoned block device or
+ * other limitations of the target block device. The target block device must
+ * allow bio's up to PAGE_SIZE, so it is always possible to add a single page
+ * to an empty bio.
+ */
+int bio_add_zone_append_page(struct bio *bio, struct page *page,
+			     unsigned int len, unsigned int offset)
+{
+	struct request_queue *q;
+	bool same_page = false;
+
+	if (WARN_ON_ONCE(bio_op(bio) != REQ_OP_ZONE_APPEND))
+		return 0;
+
+	if (WARN_ON_ONCE(!bio->bi_disk))
+		return 0;
+
+	q = bio->bi_disk->queue;
+
+	if (WARN_ON_ONCE(!blk_queue_is_zoned(q)))
+		return 0;
+
+	return bio_add_hw_page(q, bio, page, len, offset,
+			       queue_max_zone_append_sectors(q), &same_page);
+}
+EXPORT_SYMBOL_GPL(bio_add_zone_append_page);
+
 /**
  * __bio_try_merge_page - try appending data to an existing bvec.
  * @bio: destination bio
diff --git a/include/linux/bio.h b/include/linux/bio.h
index c6d765382926..7ef300cb4e9a 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -442,6 +442,8 @@ void bio_chain(struct bio *, struct bio *);
 extern int bio_add_page(struct bio *, struct page *, unsigned int,unsigned int);
 extern int bio_add_pc_page(struct request_queue *, struct bio *, struct page *,
 			   unsigned int, unsigned int);
+int bio_add_zone_append_page(struct bio *bio, struct page *page,
+			     unsigned int len, unsigned int offset);
 bool __bio_try_merge_page(struct bio *bio, struct page *page,
 		unsigned int len, unsigned int off, bool *same_page);
 void __bio_add_page(struct bio *bio, struct page *page,

From patchwork Fri Oct 30 13:51:09 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Naohiro Aota <naohiro.aota@wdc.com>
X-Patchwork-Id: 11869635
Return-Path: <SRS0=iOSM=EF=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 2E39014B4
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:52:56 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 0696E2071A
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:52:55 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=fail reason="signature verification failed" (2048-bit key)
 header.d=wdc.com header.i=@wdc.com header.b="E74CDCL9"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726700AbgJ3Nw0 (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Fri, 30 Oct 2020 09:52:26 -0400
Received: from esa3.hgst.iphmx.com ([216.71.153.141]:21969 "EHLO
        esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1725939AbgJ3NwY (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Fri, 30 Oct 2020 09:52:24 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple;
  d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com;
  t=1604065943; x=1635601943;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=6ZABgDOLOrs+rLLlJ5W237XDgbrWRnHoebecHstMnyg=;
  b=E74CDCL9fMwLY/CO1HV37ml18vUTfcN/6bEGXdS3IDatxXmvZFazokik
   0qAHlVFU3OaAUs4GHFZGOqYPDN3iB78iQGZtUa/EQVBWZ9kFQ4T9rjU8g
   Amf0mYJNTZ8FzPT+/FEVitbjWXa1ZuVtKtWC9dS2TRJqsYm/eVpXJS1FR
   HTNjWV89Nz8reVu2g5VO85koQ0agkt+GHDCY906guvwxiAq9hRriSWGqZ
   HJ3eQtXY1Nd4gzS41N1PaQKj7B1Syk3FzqNauLA62zqs8gBeBjGkbtSUp
   9YTixO7Vpq973XTOK2f+KIm+qL7zbuJmUsDQxyAW7YG7UDyQoHAaOQjnZ
   w==;
IronPort-SDR: 
 ol1eygZ3kOOkWOvbDHiqIpkhxfoz6gsUAVqp4tUr2LAlEyoPxCv5B606naT2OS0AVi/ZTABgC3
 t/x40DDWQleoy1OZjiwREKpyIhBGbUbchvFJzhij2V48DfyGEWCS7M9Gm8XBJZ/3QVHSb7dn9f
 U4htG+cI5AhOD1rUGo7kSRjc6AlLHo19M1oNcBWFj1AyrNGktVdEqzyEzJlRd0YQrGGpLryZq2
 ft/ubdScAonmiq05kGRHmQSD0s1UEvMWLUa62KfkNwLXGOpWintb5LwcLVdYkYxZtABdEwfm5D
 aWs=
X-IronPort-AV: E=Sophos;i="5.77,433,1596470400";
   d="scan'208";a="155806573"
Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com)
 ([199.255.45.14])
  by ob1.hgst.iphmx.com with ESMTP; 30 Oct 2020 21:52:23 +0800
IronPort-SDR: 
 T8aaf2usyLXdvDvRqqF3VzuTE4sCfmcbQBqgSZZkVCc02zFgJTYvRtfPs8B9EiF+N8th2ehLgn
 NFdA5PaF/qcIH34delFhCRt+5YDXzN1P+hLwa+LrOEoGb9N/LMQVxfEyD4ieAU/oXyehQbxCOj
 PDST9kWCVQZ3Yg5rAYdrRR+JsqlJm9AzlQ3G1oAU5Pegs5rxAEv21u8CqQ8+W++z6LDZPxh5T5
 gXr5mv5m4FmE2Eg5Z5zA6AeJrpH9k6HJnd6GqVDhVDtfzQ3mPdGRzc1DjI+5H5HZNm4y7aCJlf
 xZae8tfPa/gr9fAQmamxpFLe
Received: from uls-op-cesaip01.wdc.com ([10.248.3.36])
  by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 30 Oct 2020 06:38:37 -0700
IronPort-SDR: 
 noWgkHUxjtaSgnGZjmNtgHrbLYWetIWjhxuBiDzqZDYnXgWrLskwG3q9vM16A1VWTI+PkbRr68
 5rkwU6h6+4BHolzl5dAz6uklUOTvYexQn/CNUycoSV9fniMbwPFYtDcymPqOOz/BgcZaNQIxYU
 1k0VVWel8aQHEsnD85T1CyitH2Pu566VcgULWz0w47E8lH4hjYq7yoPYTeEh4eJDedrq5gKQ4L
 k6vQy6wLkaHmC8ny8pnOPo0xGdutMVMhlT/agJvJAiyuI38g79ESRy174WXHpNJlryAdaE6ece
 c7s=
WDCIronportException: Internal
Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155])
  by uls-op-cesaip01.wdc.com with ESMTP; 30 Oct 2020 06:52:22 -0700
From: Naohiro Aota <naohiro.aota@wdc.com>
To: linux-btrfs@vger.kernel.org, dsterba@suse.com
Cc: hare@suse.com, linux-fsdevel@vger.kernel.org,
        Naohiro Aota <naohiro.aota@wdc.com>,
        Christoph Hellwig <hch@infradead.org>,
        "Darrick J. Wong" <darrick.wong@oracle.com>,
        linux-xfs@vger.kernel.org
Subject: [PATCH v9 02/41] iomap: support REQ_OP_ZONE_APPEND
Date: Fri, 30 Oct 2020 22:51:09 +0900
Message-Id: 
 <cb409918a22b8f15ec20d7efad2281cb4c99d18c.1604065694.git.naohiro.aota@wdc.com>
X-Mailer: git-send-email 2.27.0
In-Reply-To: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
References: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

A ZONE_APPEND bio must follow hardware restrictions (e.g. not exceeding
max_zone_append_sectors) not to be split. bio_iov_iter_get_pages builds
such restricted bio using __bio_iov_append_get_pages if bio_op(bio) ==
REQ_OP_ZONE_APPEND.

To utilize it, we need to set the bio_op before calling
bio_iov_iter_get_pages(). This commit introduces IOMAP_F_ZONE_APPEND, so
that iomap user can set the flag to indicate they want REQ_OP_ZONE_APPEND
and restricted bio.

Cc: Christoph Hellwig <hch@infradead.org>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: linux-xfs@vger.kernel.org
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/iomap/direct-io.c  | 22 ++++++++++++++++------
 include/linux/iomap.h |  1 +
 2 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index c1aafb2ab990..e534703c5594 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -210,6 +210,7 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
 	struct bio *bio;
 	bool need_zeroout = false;
 	bool use_fua = false;
+	bool zone_append = false;
 	int nr_pages, ret = 0;
 	size_t copied = 0;
 	size_t orig_count;
@@ -241,6 +242,8 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
 			use_fua = true;
 	}
 
+	zone_append = iomap->flags & IOMAP_F_ZONE_APPEND;
+
 	/*
 	 * Save the original count and trim the iter to just the extent we
 	 * are operating on right now.  The iter will be re-expanded once
@@ -278,6 +281,19 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
 		bio->bi_private = dio;
 		bio->bi_end_io = iomap_dio_bio_end_io;
 
+		if (dio->flags & IOMAP_DIO_WRITE) {
+			bio->bi_opf = (zone_append ? REQ_OP_ZONE_APPEND : REQ_OP_WRITE) |
+				      REQ_SYNC | REQ_IDLE;
+			if (use_fua)
+				bio->bi_opf |= REQ_FUA;
+			else
+				dio->flags &= ~IOMAP_DIO_WRITE_FUA;
+		} else {
+			WARN_ON_ONCE(zone_append);
+
+			bio->bi_opf = REQ_OP_READ;
+		}
+
 		ret = bio_iov_iter_get_pages(bio, dio->submit.iter);
 		if (unlikely(ret)) {
 			/*
@@ -292,14 +308,8 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, loff_t length,
 
 		n = bio->bi_iter.bi_size;
 		if (dio->flags & IOMAP_DIO_WRITE) {
-			bio->bi_opf = REQ_OP_WRITE | REQ_SYNC | REQ_IDLE;
-			if (use_fua)
-				bio->bi_opf |= REQ_FUA;
-			else
-				dio->flags &= ~IOMAP_DIO_WRITE_FUA;
 			task_io_account_write(n);
 		} else {
-			bio->bi_opf = REQ_OP_READ;
 			if (dio->flags & IOMAP_DIO_DIRTY)
 				bio_set_pages_dirty(bio);
 		}
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 4d1d3c3469e9..1bccd1880d0d 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -54,6 +54,7 @@ struct vm_fault;
 #define IOMAP_F_SHARED		0x04
 #define IOMAP_F_MERGED		0x08
 #define IOMAP_F_BUFFER_HEAD	0x10
+#define IOMAP_F_ZONE_APPEND	0x20
 
 /*
  * Flags set by the core iomap code during operations:

From patchwork Fri Oct 30 13:51:10 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Naohiro Aota <naohiro.aota@wdc.com>
X-Patchwork-Id: 11869633
Return-Path: <SRS0=iOSM=EF=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id E5CA192C
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:52:55 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id BA3932083B
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:52:53 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=fail reason="signature verification failed" (2048-bit key)
 header.d=wdc.com header.i=@wdc.com header.b="KaYgFfEe"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726711AbgJ3Nw2 (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Fri, 30 Oct 2020 09:52:28 -0400
Received: from esa3.hgst.iphmx.com ([216.71.153.141]:21971 "EHLO
        esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726653AbgJ3Nw0 (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Fri, 30 Oct 2020 09:52:26 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple;
  d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com;
  t=1604065945; x=1635601945;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=JURNtChtu4jU2Kkb/XX8ESd3HK5ZuzHxwYwetvjZgvM=;
  b=KaYgFfEe+OlVzcXbKvkRVEFBfxsCugMhNgCW1V/7TWYlVsOzgfQDP9Fb
   F0x3/rrCM02GlNt3RRybPIfcfjkt/VkUMC40hFehLdy2dLOW1oSBfQWGg
   sHqDNW02tgLnX/bzdQ432H3P4Mn7aXvsoPvwcp5UlU5XRPn8PZKUiqUx7
   pfxA0k9EkURux7DU5/SjcSoV+Fo1vhQ9oTAAQZzFrOXIATqSL1kOMUuLQ
   YULa7GO4Sa1wIIo6tiSSP+eiTzUYjvESNTVTBH8F2OVquqPvsKWa8q7mz
   i++elXznb5gUbrWIYkOJOVwbjMqZKGxCadKTiIvfsemDPwuHJGl+Mu7kf
   g==;
IronPort-SDR: 
 IFtpAChb7/Op4ku3LNqPL3/KnbsZzOaGw0vM4DvoXfthlAKUQ371tSch+G7CMdLGxhDYPIH3lG
 Ej3wk0ZiDmLeSuNZeP6V1doMCfAh9jD99ERASresqTq2t8lC57NgK00bb+jChuByngC1XwYIX/
 ZxuPd5S7NDJzVHgI+FnB7ctei9PiDkMAW4q10Pobw6aTjTSDtM6Kd02nGFw9P4R1ZIcbcb56/G
 A4/zY7fnOGeAE8DJPvzEM25dzpbSB+1EFHcDY0bPa9+MmSLXcyvWntYkDU4Gc28YN/RCI/xMQo
 u0k=
X-IronPort-AV: E=Sophos;i="5.77,433,1596470400";
   d="scan'208";a="155806575"
Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com)
 ([199.255.45.14])
  by ob1.hgst.iphmx.com with ESMTP; 30 Oct 2020 21:52:25 +0800
IronPort-SDR: 
 TO1PllebxaXdDRrF+DKy9mt17G5iT3fdvj5TkBUq6pUsWRk/iV2CHouC0dUbfPzrFbw7XO7UGR
 UMAwMGzOswSFV8mEoXqaOQ8QATinYNySaMeMg7DPhgNbc7oSjNaAGSocL1QClD4vA9o7nL3Q/9
 30VsGcjP23gdJ5ERxWZnVIF2j4UuYK+FF9omo/m4L/yceT6tMfuI7gzabeo6hgkYR2A8bm1C45
 ySXJkTBdgrH/qSj7Smojg+yMKInKoeyW25Is6mRc/f8lVkSuVSt3UvVpIhu0USJ61HlWVbEBf1
 hxXsjb8GzTR5B2XC9ooWbPWQ
Received: from uls-op-cesaip01.wdc.com ([10.248.3.36])
  by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 30 Oct 2020 06:38:39 -0700
IronPort-SDR: 
 UZcBDI6LSnPh6yTruXKpN60LM6cfP+FqbuIclkJvZxBcZaeDlnlXH/O3U9N9BtvUmh+46ni2cM
 n9xInb2vIEEkx4cnuzmqGWH/yRnbvr0gEea4CIH9Z87u6OqG/Adaol0WmDr/TO2r5i4zXf5Yaa
 VSouM7eEPlHE3vS+wC8dpvs+SGEytl4CbZjH4CLmTtQETYXAfiwV0W6UXQ1YwngCZPkIp9ncxc
 Cd9SDGaQReK399RzB30p4AlAntXcqwd5q/rTovT8RZOBPj27U557o6G52v/dVzJlQGzgZ9zoR4
 qnk=
WDCIronportException: Internal
Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155])
  by uls-op-cesaip01.wdc.com with ESMTP; 30 Oct 2020 06:52:24 -0700
From: Naohiro Aota <naohiro.aota@wdc.com>
To: linux-btrfs@vger.kernel.org, dsterba@suse.com
Cc: hare@suse.com, linux-fsdevel@vger.kernel.org,
        Naohiro Aota <naohiro.aota@wdc.com>,
        Damien Le Moal <damien.lemoal@wdc.com>,
        Anand Jain <anand.jain@oracle.com>,
        Johannes Thumshirn <johannes.thumshirn@wdc.com>
Subject: [PATCH v9 03/41] btrfs: introduce ZONED feature flag
Date: Fri, 30 Oct 2020 22:51:10 +0900
Message-Id: 
 <2b5ff5d6f08e483f71cfdeff47267c0dbe2acf4e.1604065694.git.naohiro.aota@wdc.com>
X-Mailer: git-send-email 2.27.0
In-Reply-To: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
References: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

This patch introduces the ZONED incompat flag. The flag indicates that the
volume management will satisfy the constraints imposed by host-managed
zoned block devices.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
---
 fs/btrfs/sysfs.c           | 2 ++
 include/uapi/linux/btrfs.h | 1 +
 2 files changed, 3 insertions(+)

diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 279d9262b676..828006020bbd 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -263,6 +263,7 @@ BTRFS_FEAT_ATTR_INCOMPAT(no_holes, NO_HOLES);
 BTRFS_FEAT_ATTR_INCOMPAT(metadata_uuid, METADATA_UUID);
 BTRFS_FEAT_ATTR_COMPAT_RO(free_space_tree, FREE_SPACE_TREE);
 BTRFS_FEAT_ATTR_INCOMPAT(raid1c34, RAID1C34);
+BTRFS_FEAT_ATTR_INCOMPAT(zoned, ZONED);
 
 static struct attribute *btrfs_supported_feature_attrs[] = {
 	BTRFS_FEAT_ATTR_PTR(mixed_backref),
@@ -278,6 +279,7 @@ static struct attribute *btrfs_supported_feature_attrs[] = {
 	BTRFS_FEAT_ATTR_PTR(metadata_uuid),
 	BTRFS_FEAT_ATTR_PTR(free_space_tree),
 	BTRFS_FEAT_ATTR_PTR(raid1c34),
+	BTRFS_FEAT_ATTR_PTR(zoned),
 	NULL
 };
 
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index 2c39d15a2beb..5df73001aad4 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -307,6 +307,7 @@ struct btrfs_ioctl_fs_info_args {
 #define BTRFS_FEATURE_INCOMPAT_NO_HOLES		(1ULL << 9)
 #define BTRFS_FEATURE_INCOMPAT_METADATA_UUID	(1ULL << 10)
 #define BTRFS_FEATURE_INCOMPAT_RAID1C34		(1ULL << 11)
+#define BTRFS_FEATURE_INCOMPAT_ZONED		(1ULL << 12)
 
 struct btrfs_ioctl_feature_flags {
 	__u64 compat_flags;

From patchwork Fri Oct 30 13:51:11 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Naohiro Aota <naohiro.aota@wdc.com>
X-Patchwork-Id: 11869613
Return-Path: <SRS0=iOSM=EF=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 20A3E61C
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:52:31 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id E28202076E
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:52:30 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=fail reason="signature verification failed" (2048-bit key)
 header.d=wdc.com header.i=@wdc.com header.b="njApx0Be"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726733AbgJ3Nwa (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Fri, 30 Oct 2020 09:52:30 -0400
Received: from esa3.hgst.iphmx.com ([216.71.153.141]:21974 "EHLO
        esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726702AbgJ3Nw1 (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Fri, 30 Oct 2020 09:52:27 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple;
  d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com;
  t=1604065946; x=1635601946;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=c1fLKFaRdT0TB+so2lzvhaU0SiAmji8TO0m4uP/4XDI=;
  b=njApx0BeOt2UH4U9+IzOWXkcV2DM3r+YDTyQFhwtRJso2R6lHvPustix
   j4CuGp4a+8JjHuApCxX7yiZ2s83tdWZMB6SXDRlGe+zqeFY5iCd4ojG2e
   9wtvSMD1VuEH/xmFv+oq/ilxnvZNGxQqKGFumfeCQZlBkI6gBXx+O41aX
   gE+lREDK1iCXWHYH5ylKqOZFbw5yOvpGQ/f/QyJw21zhWFzBQEAmECe5b
   nyELGokmHw2uVtja4oAZI71i7UrrEOH249TsZKPPt4NS2vTftbPSNGPrB
   GC3vdcqOcykDvXS6XfzGaGkAF4oCWKI31sP26X6mje43VcRl8ZAUKzgi5
   A==;
IronPort-SDR: 
 cqiOCJOVXtkgiWhfJImXepd9RnvtaM3nsVYHSL7iq9wf4gvRPCpjYSrpWZZGpqSG7bC11snHBW
 Gd8tmU5iIMKQ1B7ysviNSWm2K3rhSwlyM2i9KvcmEgtnuiqTemAFcnufyYA4fdGJPAD74LwvQc
 uo+0KVpWITi7eNMb/jGwSvue4/nmPZWpMb6byzzkgnBWN8BnqQjxtG1zXeTRrAuskwBbt0mexG
 dpsYN93aJa69Yo8doNlZO7vGkab6v/o5HpJjcpujDap3vTsX1MPYmUmurF8wlMDOTHvVOjf0I3
 PyM=
X-IronPort-AV: E=Sophos;i="5.77,433,1596470400";
   d="scan'208";a="155806579"
Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com)
 ([199.255.45.14])
  by ob1.hgst.iphmx.com with ESMTP; 30 Oct 2020 21:52:26 +0800
IronPort-SDR: 
 9sL9eU0Bcl1aHXarFLm6bFN53rs568L+0ve3iNB2LxoK4v6gdiaR1SjBpCLYJTg+8eYo2qNGuB
 2CqlfjYqjOZB6Nhch9gTBHHSHgfDH8VPz8K2BD9/p9JJFltGbC88Q7nlyXnrQmSRQR/Ql730KW
 tdLprphQ7B5ySUCgJx9PmBhlEIxU9IkMf6drdncpzP4wdZYyd/ohJ6PqTDEARBCI+wbZ7OOBj7
 W9VORGp0EaBv7jjera+ybaPQRRHtCC9BfGaogtGA1lG+A3lFbj3gSFxSrTFS+bxV9yWWMXJotv
 kR0uE1g8pO6N3a9IaV2sCgXk
Received: from uls-op-cesaip01.wdc.com ([10.248.3.36])
  by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 30 Oct 2020 06:38:40 -0700
IronPort-SDR: 
 yxZTRp4GlqKmtDpQKhV/iq5wXugRNLgIu+NUcjEPRL5DjfdBTUldn0tv7ZpDVb4gBwnRhHpR8M
 Dbl7zVs29tNA9zASGvVTSgrIywHVXE9VCPmnBMWXqVDihqNVPrARU8PfIpjGbTZj5PhZs4jWUh
 uC1IpN0xv1s4yrgypZzujCcT8U9rei7v+VmKHazmI+CCnppnp8bN0reO4/8t8ZihpcQKLpbUG5
 aFe+n6wOZTZ4I9m/GCRYx2ZtQrYNpLxs55yk5z2SY4B+uHj1gf29vIlEzLuesJYmcI1FeGrE7T
 fjI=
WDCIronportException: Internal
Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155])
  by uls-op-cesaip01.wdc.com with ESMTP; 30 Oct 2020 06:52:25 -0700
From: Naohiro Aota <naohiro.aota@wdc.com>
To: linux-btrfs@vger.kernel.org, dsterba@suse.com
Cc: hare@suse.com, linux-fsdevel@vger.kernel.org,
        Naohiro Aota <naohiro.aota@wdc.com>,
        Damien Le Moal <damien.lemoal@wdc.com>,
        Josef Bacik <josef@toxicpanda.com>
Subject: [PATCH v9 04/41] btrfs: Get zone information of zoned block devices
Date: Fri, 30 Oct 2020 22:51:11 +0900
Message-Id: 
 <feca5ea7b6dc1a62eddbc00e01452b92523c8f36.1604065694.git.naohiro.aota@wdc.com>
X-Mailer: git-send-email 2.27.0
In-Reply-To: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
References: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

If a zoned block device is found, get its zone information (number of zones
and zone size) using the new helper function btrfs_get_dev_zone_info().  To
avoid costly run-time zone report commands to test the device zones type
during block allocation, attach the seq_zones bitmap to the device
structure to indicate if a zone is sequential or accept random writes. Also
it attaches the empty_zones bitmap to indicate if a zone is empty or not.

This patch also introduces the helper function btrfs_dev_is_sequential() to
test if the zone storing a block is a sequential write required zone and
btrfs_dev_is_empty_zone() to test if the zone is a empty zone.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/Makefile      |   1 +
 fs/btrfs/dev-replace.c |   5 ++
 fs/btrfs/super.c       |   5 ++
 fs/btrfs/volumes.c     |  19 ++++-
 fs/btrfs/volumes.h     |   4 +
 fs/btrfs/zoned.c       | 176 +++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/zoned.h       |  86 ++++++++++++++++++++
 7 files changed, 294 insertions(+), 2 deletions(-)
 create mode 100644 fs/btrfs/zoned.c
 create mode 100644 fs/btrfs/zoned.h

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index e738f6206ea5..0497fdc37f90 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -16,6 +16,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \
 btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
 btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
 btrfs-$(CONFIG_BTRFS_FS_REF_VERIFY) += ref-verify.o
+btrfs-$(CONFIG_BLK_DEV_ZONED) += zoned.o
 
 btrfs-$(CONFIG_BTRFS_FS_RUN_SANITY_TESTS) += tests/free-space-tests.o \
 	tests/extent-buffer-tests.o tests/btrfs-tests.o \
diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 20ce1970015f..6f6d77224c2b 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -21,6 +21,7 @@
 #include "rcu-string.h"
 #include "dev-replace.h"
 #include "sysfs.h"
+#include "zoned.h"
 
 /*
  * Device replace overview
@@ -291,6 +292,10 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
 	set_blocksize(device->bdev, BTRFS_BDEV_BLOCKSIZE);
 	device->fs_devices = fs_info->fs_devices;
 
+	ret = btrfs_get_dev_zone_info(device);
+	if (ret)
+		goto error;
+
 	mutex_lock(&fs_info->fs_devices->device_list_mutex);
 	list_add(&device->dev_list, &fs_info->fs_devices->devices);
 	fs_info->fs_devices->num_devices++;
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 8840a4fa81eb..ed55014fd1bd 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2462,6 +2462,11 @@ static void __init btrfs_print_mod_info(void)
 #endif
 #ifdef CONFIG_BTRFS_FS_REF_VERIFY
 			", ref-verify=on"
+#endif
+#ifdef CONFIG_BLK_DEV_ZONED
+			", zoned=yes"
+#else
+			", zoned=no"
 #endif
 			;
 	pr_info("Btrfs loaded, crc32c=%s%s\n", crc32c_impl(), options);
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 58b9c419a2b6..e787bf89f761 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -31,6 +31,7 @@
 #include "space-info.h"
 #include "block-group.h"
 #include "discard.h"
+#include "zoned.h"
 
 const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
 	[BTRFS_RAID_RAID10] = {
@@ -374,6 +375,7 @@ void btrfs_free_device(struct btrfs_device *device)
 	rcu_string_free(device->name);
 	extent_io_tree_release(&device->alloc_state);
 	bio_put(device->flush_bio);
+	btrfs_destroy_dev_zone_info(device);
 	kfree(device);
 }
 
@@ -667,6 +669,11 @@ static int btrfs_open_one_device(struct btrfs_fs_devices *fs_devices,
 	clear_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state);
 	device->mode = flags;
 
+	/* Get zone type information of zoned block devices */
+	ret = btrfs_get_dev_zone_info(device);
+	if (ret != 0)
+		goto error_free_page;
+
 	fs_devices->open_devices++;
 	if (test_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state) &&
 	    device->devid != BTRFS_DEV_REPLACE_DEVID) {
@@ -1143,6 +1150,7 @@ static void btrfs_close_one_device(struct btrfs_device *device)
 		device->bdev = NULL;
 	}
 	clear_bit(BTRFS_DEV_STATE_WRITEABLE, &device->dev_state);
+	btrfs_destroy_dev_zone_info(device);
 
 	device->fs_info = NULL;
 	atomic_set(&device->dev_stats_ccnt, 0);
@@ -2543,6 +2551,14 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
 	}
 	rcu_assign_pointer(device->name, name);
 
+	device->fs_info = fs_info;
+	device->bdev = bdev;
+
+	/* Get zone type information of zoned block devices */
+	ret = btrfs_get_dev_zone_info(device);
+	if (ret)
+		goto error_free_device;
+
 	trans = btrfs_start_transaction(root, 0);
 	if (IS_ERR(trans)) {
 		ret = PTR_ERR(trans);
@@ -2559,8 +2575,6 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
 					 fs_info->sectorsize);
 	device->disk_total_bytes = device->total_bytes;
 	device->commit_total_bytes = device->total_bytes;
-	device->fs_info = fs_info;
-	device->bdev = bdev;
 	set_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state);
 	clear_bit(BTRFS_DEV_STATE_REPLACE_TGT, &device->dev_state);
 	device->mode = FMODE_EXCL;
@@ -2707,6 +2721,7 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
 		sb->s_flags |= SB_RDONLY;
 	if (trans)
 		btrfs_end_transaction(trans);
+	btrfs_destroy_dev_zone_info(device);
 error_free_device:
 	btrfs_free_device(device);
 error:
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index bf27ac07d315..9c07b97a2260 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -51,6 +51,8 @@ struct btrfs_io_geometry {
 #define BTRFS_DEV_STATE_REPLACE_TGT	(3)
 #define BTRFS_DEV_STATE_FLUSH_SENT	(4)
 
+struct btrfs_zoned_device_info;
+
 struct btrfs_device {
 	struct list_head dev_list; /* device_list_mutex */
 	struct list_head dev_alloc_list; /* chunk mutex */
@@ -64,6 +66,8 @@ struct btrfs_device {
 
 	struct block_device *bdev;
 
+	struct btrfs_zoned_device_info *zone_info;
+
 	/* the mode sent to blkdev_get */
 	fmode_t mode;
 
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
new file mode 100644
index 000000000000..5657d654bc44
--- /dev/null
+++ b/fs/btrfs/zoned.c
@@ -0,0 +1,176 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/slab.h>
+#include <linux/blkdev.h>
+#include "ctree.h"
+#include "volumes.h"
+#include "zoned.h"
+#include "rcu-string.h"
+
+/* Maximum number of zones to report per blkdev_report_zones() call */
+#define BTRFS_REPORT_NR_ZONES   4096
+
+static int copy_zone_info_cb(struct blk_zone *zone, unsigned int idx,
+			     void *data)
+{
+	struct blk_zone *zones = data;
+
+	memcpy(&zones[idx], zone, sizeof(*zone));
+
+	return 0;
+}
+
+static int btrfs_get_dev_zones(struct btrfs_device *device, u64 pos,
+			       struct blk_zone *zones, unsigned int *nr_zones)
+{
+	int ret;
+
+	if (!*nr_zones)
+		return 0;
+
+	ret = blkdev_report_zones(device->bdev, pos >> SECTOR_SHIFT, *nr_zones,
+				  copy_zone_info_cb, zones);
+	if (ret < 0) {
+		btrfs_err_in_rcu(device->fs_info,
+				 "get zone at %llu on %s failed %d", pos,
+				 rcu_str_deref(device->name), ret);
+		return ret;
+	}
+	*nr_zones = ret;
+	if (!ret)
+		return -EIO;
+
+	return 0;
+}
+
+int btrfs_get_dev_zone_info(struct btrfs_device *device)
+{
+	struct btrfs_zoned_device_info *zone_info = NULL;
+	struct block_device *bdev = device->bdev;
+	sector_t nr_sectors = bdev->bd_part->nr_sects;
+	sector_t sector = 0;
+	struct blk_zone *zones = NULL;
+	unsigned int i, nreported = 0, nr_zones;
+	unsigned int zone_sectors;
+	int ret;
+	char devstr[sizeof(device->fs_info->sb->s_id) +
+		    sizeof(" (device )") - 1];
+
+	if (!bdev_is_zoned(bdev))
+		return 0;
+
+	if (device->zone_info)
+		return 0;
+
+	zone_info = kzalloc(sizeof(*zone_info), GFP_KERNEL);
+	if (!zone_info)
+		return -ENOMEM;
+
+	zone_sectors = bdev_zone_sectors(bdev);
+	ASSERT(is_power_of_2(zone_sectors));
+	zone_info->zone_size = (u64)zone_sectors << SECTOR_SHIFT;
+	zone_info->zone_size_shift = ilog2(zone_info->zone_size);
+	zone_info->nr_zones = nr_sectors >> ilog2(bdev_zone_sectors(bdev));
+	if (!IS_ALIGNED(nr_sectors, zone_sectors))
+		zone_info->nr_zones++;
+
+	zone_info->seq_zones = bitmap_zalloc(zone_info->nr_zones, GFP_KERNEL);
+	if (!zone_info->seq_zones) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	zone_info->empty_zones = bitmap_zalloc(zone_info->nr_zones, GFP_KERNEL);
+	if (!zone_info->empty_zones) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	zones = kcalloc(BTRFS_REPORT_NR_ZONES,
+			sizeof(struct blk_zone), GFP_KERNEL);
+	if (!zones) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	/* Get zones type */
+	while (sector < nr_sectors) {
+		nr_zones = BTRFS_REPORT_NR_ZONES;
+		ret = btrfs_get_dev_zones(device, sector << SECTOR_SHIFT, zones,
+					  &nr_zones);
+		if (ret)
+			goto out;
+
+		for (i = 0; i < nr_zones; i++) {
+			if (zones[i].type == BLK_ZONE_TYPE_SEQWRITE_REQ)
+				set_bit(nreported, zone_info->seq_zones);
+			if (zones[i].cond == BLK_ZONE_COND_EMPTY)
+				set_bit(nreported, zone_info->empty_zones);
+			nreported++;
+		}
+		sector = zones[nr_zones - 1].start + zones[nr_zones - 1].len;
+	}
+
+	if (nreported != zone_info->nr_zones) {
+		btrfs_err_in_rcu(device->fs_info,
+				 "inconsistent number of zones on %s (%u / %u)",
+				 rcu_str_deref(device->name), nreported,
+				 zone_info->nr_zones);
+		ret = -EIO;
+		goto out;
+	}
+
+	kfree(zones);
+
+	device->zone_info = zone_info;
+
+	devstr[0] = 0;
+	if (device->fs_info)
+		snprintf(devstr, sizeof(devstr), " (device %s)",
+			 device->fs_info->sb->s_id);
+
+	rcu_read_lock();
+	pr_info(
+"BTRFS info%s: host-%s zoned block device %s, %u zones of %llu sectors",
+		devstr,
+		bdev_zoned_model(bdev) == BLK_ZONED_HM ? "managed" : "aware",
+		rcu_str_deref(device->name), zone_info->nr_zones,
+		zone_info->zone_size >> SECTOR_SHIFT);
+	rcu_read_unlock();
+
+	return 0;
+
+out:
+	kfree(zones);
+	bitmap_free(zone_info->empty_zones);
+	bitmap_free(zone_info->seq_zones);
+	kfree(zone_info);
+
+	return ret;
+}
+
+void btrfs_destroy_dev_zone_info(struct btrfs_device *device)
+{
+	struct btrfs_zoned_device_info *zone_info = device->zone_info;
+
+	if (!zone_info)
+		return;
+
+	bitmap_free(zone_info->seq_zones);
+	bitmap_free(zone_info->empty_zones);
+	kfree(zone_info);
+	device->zone_info = NULL;
+}
+
+int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
+		       struct blk_zone *zone)
+{
+	unsigned int nr_zones = 1;
+	int ret;
+
+	ret = btrfs_get_dev_zones(device, pos, zone, &nr_zones);
+	if (ret != 0 || !nr_zones)
+		return ret ? ret : -EIO;
+
+	return 0;
+}
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
new file mode 100644
index 000000000000..483229e27908
--- /dev/null
+++ b/fs/btrfs/zoned.h
@@ -0,0 +1,86 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef BTRFS_ZONED_H
+#define BTRFS_ZONED_H
+
+struct btrfs_zoned_device_info {
+	/*
+	 * Number of zones, zone size and types of zones if bdev is a
+	 * zoned block device.
+	 */
+	u64 zone_size;
+	u8  zone_size_shift;
+	u32 nr_zones;
+	unsigned long *seq_zones;
+	unsigned long *empty_zones;
+};
+
+#ifdef CONFIG_BLK_DEV_ZONED
+int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
+		       struct blk_zone *zone);
+int btrfs_get_dev_zone_info(struct btrfs_device *device);
+void btrfs_destroy_dev_zone_info(struct btrfs_device *device);
+#else /* CONFIG_BLK_DEV_ZONED */
+static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
+				     struct blk_zone *zone)
+{
+	return 0;
+}
+static inline int btrfs_get_dev_zone_info(struct btrfs_device *device)
+{
+	return 0;
+}
+static inline void btrfs_destroy_dev_zone_info(struct btrfs_device *device) { }
+#endif
+
+static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
+{
+	struct btrfs_zoned_device_info *zone_info = device->zone_info;
+
+	if (!zone_info)
+		return false;
+
+	return test_bit(pos >> zone_info->zone_size_shift,
+			zone_info->seq_zones);
+}
+
+static inline bool btrfs_dev_is_empty_zone(struct btrfs_device *device, u64 pos)
+{
+	struct btrfs_zoned_device_info *zone_info = device->zone_info;
+
+	if (!zone_info)
+		return true;
+
+	return test_bit(pos >> zone_info->zone_size_shift,
+			zone_info->empty_zones);
+}
+
+static inline void btrfs_dev_set_empty_zone_bit(struct btrfs_device *device,
+						u64 pos, bool set)
+{
+	struct btrfs_zoned_device_info *zone_info = device->zone_info;
+	unsigned int zno;
+
+	if (!zone_info)
+		return;
+
+	zno = pos >> zone_info->zone_size_shift;
+	if (set)
+		set_bit(zno, zone_info->empty_zones);
+	else
+		clear_bit(zno, zone_info->empty_zones);
+}
+
+static inline void btrfs_dev_set_zone_empty(struct btrfs_device *device,
+					    u64 pos)
+{
+	btrfs_dev_set_empty_zone_bit(device, pos, true);
+}
+
+static inline void btrfs_dev_clear_zone_empty(struct btrfs_device *device,
+					      u64 pos)
+{
+	btrfs_dev_set_empty_zone_bit(device, pos, false);
+}
+
+#endif

From patchwork Fri Oct 30 13:51:12 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Naohiro Aota <naohiro.aota@wdc.com>
X-Patchwork-Id: 11869637
Return-Path: <SRS0=iOSM=EF=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 6DF7061C
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:52:57 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 478232083B
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:52:57 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=fail reason="signature verification failed" (2048-bit key)
 header.d=wdc.com header.i=@wdc.com header.b="iM4T+e0M"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726821AbgJ3Nwy (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Fri, 30 Oct 2020 09:52:54 -0400
Received: from esa3.hgst.iphmx.com ([216.71.153.141]:21978 "EHLO
        esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726710AbgJ3Nw2 (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Fri, 30 Oct 2020 09:52:28 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple;
  d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com;
  t=1604065948; x=1635601948;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=P8pg+cbadi/K8dJKqVoIbiNqcCKziYwvXwKeM1hNSr4=;
  b=iM4T+e0Mx7RHq+cs/eEc2y9DGQOcdsTdWAo1esX3je1Df4aD4SxHluxS
   g9V7wartrwggYWwKX9G1dn9R/+iK36TzHqoand/rYs0ZNOqGkdwp0Q/D4
   M3gf6Lxh8y1N3pAXLxJxkZuB6I7RPnhivdJx0bOw4Ck63i+kFLRzOonZa
   smfX+4CF2rLVNF9akZYNpefmBaBvedu9ITbSQBSTiOe9Fs34zANGwQQe2
   r9Gomz78k0BzmUg2SQCAoWnoyS5a2O+nzvurdTn6lk+JZLMUBkn5Sj1/R
   8ZD3jWx9FPhJqPpn2qBEbQsUhyAM3NhtVdC25FDwr+p47NCDVxy6CAKPC
   g==;
IronPort-SDR: 
 pfcEXdXSEGIyuAxmPwgrQ+Kn+oKf3nX0sD18xMawcsUjPaJQfTQdfHfRVqbWNNQ0EN99wpH6cp
 GikEn2ue3fdfFui7d0u52Du6ANywmN9I1GbABC1SiLaMVqFUgYK8A7NT6ab4cRu9lQBx0k0lLx
 O0TCsqiAOUkyoVBSUN8rf+7SDHdaJyQ56lK+DLVZVjTw0nbVs+/6XgJRLXCZSnwwc/VwaxMIJN
 Uv0iwwNxfu7s1pA7r7rjs00wp4FsZiUtxw//OQ2U0TPZ/96zQjw+IuD7x1rsxtA+Xnz438pbW4
 RWw=
X-IronPort-AV: E=Sophos;i="5.77,433,1596470400";
   d="scan'208";a="155806581"
Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com)
 ([199.255.45.14])
  by ob1.hgst.iphmx.com with ESMTP; 30 Oct 2020 21:52:28 +0800
IronPort-SDR: 
 2cgb+bBGL7mLrTQ41G0ewBM0MAmC1UaKbuj9amPx4z6BQ/Tp59aHAIbnjMxkUG6LdqZHxd/chi
 nXoxg2hlM/h2IrmiYUukoKhIKubVfAxQACLVHgpv8CP84Xdf9ChRvFfNuA+s0LEFyrUr99AcII
 sRimiLI0D5uHChVLIAUdBhfxeHqKMe/7EjIjnh6G+ZCjlMFLtDjTzO7zazbv0yOPgDF63VX6YY
 rP07YVtJUMFK3ybeHBpemdw4kpoy28bfIH7EezIKLRXMUSI9cH63vHSa0byuWuM1HhbkB9foF9
 3Y7k+tgDK7Q3A+U/GsZ8HYPw
Received: from uls-op-cesaip01.wdc.com ([10.248.3.36])
  by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 30 Oct 2020 06:38:42 -0700
IronPort-SDR: 
 b5SpNraWoCi5Sd0BhXW1iDbfg2ztX6iliiIaSNMZdCuDYSPjUYb0CgXIiWY8nzPVEDzvewXsPh
 rfC/4WtcrtXdgGBl6izJr1m4qkGT8cJ1dXU/Pj0S1HIVthoGak40sAEwm3cCT2J33KG9rRoCaa
 nprXJP+QHBXbj/6G0Qq1hkU2adqI8t7RGD3/G0ZpCS/q8fITnd1dvQoey1mvW0vQuWQ5anKWWA
 Cp3tTxVzreBuQ+qBo5YXaJxgyh9FERE/kFrsTL7joWaWE7dkSVtOor86NP0Y4Fh/X/FH9A0UjK
 aKc=
WDCIronportException: Internal
Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155])
  by uls-op-cesaip01.wdc.com with ESMTP; 30 Oct 2020 06:52:27 -0700
From: Naohiro Aota <naohiro.aota@wdc.com>
To: linux-btrfs@vger.kernel.org, dsterba@suse.com
Cc: hare@suse.com, linux-fsdevel@vger.kernel.org,
        Naohiro Aota <naohiro.aota@wdc.com>,
        Johannes Thumshirn <johannes.thumshirn@wdc.com>,
        Damien Le Moal <damien.lemoal@wdc.com>,
        Josef Bacik <josef@toxicpanda.com>
Subject: [PATCH v9 05/41] btrfs: Check and enable ZONED mode
Date: Fri, 30 Oct 2020 22:51:12 +0900
Message-Id: 
 <599d306d41880e3e3242120a40a78b81f6ed0473.1604065695.git.naohiro.aota@wdc.com>
X-Mailer: git-send-email 2.27.0
In-Reply-To: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
References: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

This commit introduces the function btrfs_check_zoned_mode() to check if
ZONED flag is enabled on the file system and if the file system consists of
zoned devices with equal zone size.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/ctree.h       | 10 ++++++
 fs/btrfs/dev-replace.c |  7 ++++
 fs/btrfs/disk-io.c     | 11 ++++++
 fs/btrfs/super.c       |  1 +
 fs/btrfs/volumes.c     |  5 +++
 fs/btrfs/zoned.c       | 78 ++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/zoned.h       | 26 ++++++++++++++
 7 files changed, 138 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index aac3d6f4e35b..25fd4e97dd2a 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -948,6 +948,12 @@ struct btrfs_fs_info {
 	/* Type of exclusive operation running */
 	unsigned long exclusive_operation;
 
+	/* Zone size when in ZONED mode */
+	union {
+		u64 zone_size;
+		u64 zoned;
+	};
+
 #ifdef CONFIG_BTRFS_FS_REF_VERIFY
 	spinlock_t ref_verify_lock;
 	struct rb_root block_tree;
@@ -3595,4 +3601,8 @@ static inline int btrfs_is_testing(struct btrfs_fs_info *fs_info)
 }
 #endif
 
+static inline bool btrfs_is_zoned(struct btrfs_fs_info *fs_info)
+{
+	return fs_info->zoned != 0;
+}
 #endif
diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 6f6d77224c2b..5e3554482af1 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -238,6 +238,13 @@ static int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info,
 		return PTR_ERR(bdev);
 	}
 
+	if (!btrfs_check_device_zone_type(fs_info, bdev)) {
+		btrfs_err(fs_info,
+			  "zone type of target device mismatch with the filesystem!");
+		ret = -EINVAL;
+		goto error;
+	}
+
 	sync_blockdev(bdev);
 
 	list_for_each_entry(device, &fs_info->fs_devices->devices, dev_list) {
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 764001609a15..9bc51cff48b8 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -42,6 +42,7 @@
 #include "block-group.h"
 #include "discard.h"
 #include "space-info.h"
+#include "zoned.h"
 
 #define BTRFS_SUPER_FLAG_SUPP	(BTRFS_HEADER_FLAG_WRITTEN |\
 				 BTRFS_HEADER_FLAG_RELOC |\
@@ -2976,6 +2977,8 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
 	if (features & BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA)
 		btrfs_info(fs_info, "has skinny extents");
 
+	fs_info->zoned = features & BTRFS_FEATURE_INCOMPAT_ZONED;
+
 	/*
 	 * flag our filesystem as having big metadata blocks if
 	 * they are bigger than the page size
@@ -3130,7 +3133,15 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
 
 	btrfs_free_extra_devids(fs_devices, 1);
 
+	ret = btrfs_check_zoned_mode(fs_info);
+	if (ret) {
+		btrfs_err(fs_info, "failed to init ZONED mode: %d",
+				ret);
+		goto fail_block_groups;
+	}
+
 	ret = btrfs_sysfs_add_fsid(fs_devices);
+
 	if (ret) {
 		btrfs_err(fs_info, "failed to init sysfs fsid interface: %d",
 				ret);
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index ed55014fd1bd..3312fe08168f 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -44,6 +44,7 @@
 #include "backref.h"
 #include "space-info.h"
 #include "sysfs.h"
+#include "zoned.h"
 #include "tests/btrfs-tests.h"
 #include "block-group.h"
 #include "discard.h"
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index e787bf89f761..10827892c086 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2518,6 +2518,11 @@ int btrfs_init_new_device(struct btrfs_fs_info *fs_info, const char *device_path
 	if (IS_ERR(bdev))
 		return PTR_ERR(bdev);
 
+	if (!btrfs_check_device_zone_type(fs_info, bdev)) {
+		ret = -EINVAL;
+		goto error;
+	}
+
 	if (fs_devices->seeding) {
 		seeding_dev = 1;
 		down_write(&sb->s_umount);
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 5657d654bc44..e1cdff5af3a3 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -174,3 +174,81 @@ int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 
 	return 0;
 }
+
+int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info)
+{
+	struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
+	struct btrfs_device *device;
+	u64 hmzoned_devices = 0;
+	u64 nr_devices = 0;
+	u64 zone_size = 0;
+	int incompat_zoned = btrfs_is_zoned(fs_info);
+	int ret = 0;
+
+	/* Count zoned devices */
+	list_for_each_entry(device, &fs_devices->devices, dev_list) {
+		enum blk_zoned_model model;
+
+		if (!device->bdev)
+			continue;
+
+		model = bdev_zoned_model(device->bdev);
+		if (model == BLK_ZONED_HM ||
+		    (model == BLK_ZONED_HA && incompat_zoned)) {
+			hmzoned_devices++;
+			if (!zone_size) {
+				zone_size = device->zone_info->zone_size;
+			} else if (device->zone_info->zone_size != zone_size) {
+				btrfs_err(fs_info,
+					  "Zoned block devices must have equal zone sizes");
+				ret = -EINVAL;
+				goto out;
+			}
+		}
+		nr_devices++;
+	}
+
+	if (!hmzoned_devices && !incompat_zoned)
+		goto out;
+
+	if (!hmzoned_devices && incompat_zoned) {
+		/* No zoned block device found on ZONED FS */
+		btrfs_err(fs_info,
+			  "ZONED enabled file system should have zoned devices");
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (hmzoned_devices && !incompat_zoned) {
+		btrfs_err(fs_info,
+			  "Enable ZONED mode to mount HMZONED device");
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (hmzoned_devices != nr_devices) {
+		btrfs_err(fs_info,
+			  "zoned devices cannot be mixed with regular devices");
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/*
+	 * stripe_size is always aligned to BTRFS_STRIPE_LEN in
+	 * __btrfs_alloc_chunk(). Since we want stripe_len == zone_size,
+	 * check the alignment here.
+	 */
+	if (!IS_ALIGNED(zone_size, BTRFS_STRIPE_LEN)) {
+		btrfs_err(fs_info,
+			  "zone size is not aligned to BTRFS_STRIPE_LEN");
+		ret = -EINVAL;
+		goto out;
+	}
+
+	fs_info->zone_size = zone_size;
+
+	btrfs_info(fs_info, "ZONED mode enabled, zone size %llu B",
+		   fs_info->zone_size);
+out:
+	return ret;
+}
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index 483229e27908..c4c63c4294f2 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -3,6 +3,8 @@
 #ifndef BTRFS_ZONED_H
 #define BTRFS_ZONED_H
 
+#include <linux/blkdev.h>
+
 struct btrfs_zoned_device_info {
 	/*
 	 * Number of zones, zone size and types of zones if bdev is a
@@ -20,6 +22,7 @@ int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 		       struct blk_zone *zone);
 int btrfs_get_dev_zone_info(struct btrfs_device *device);
 void btrfs_destroy_dev_zone_info(struct btrfs_device *device);
+int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info);
 #else /* CONFIG_BLK_DEV_ZONED */
 static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 				     struct blk_zone *zone)
@@ -31,6 +34,14 @@ static inline int btrfs_get_dev_zone_info(struct btrfs_device *device)
 	return 0;
 }
 static inline void btrfs_destroy_dev_zone_info(struct btrfs_device *device) { }
+static inline int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info)
+{
+	if (!btrfs_is_zoned(fs_info))
+		return 0;
+
+	btrfs_err(fs_info, "Zoned block devices support is not enabled");
+	return -EOPNOTSUPP;
+}
 #endif
 
 static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
@@ -83,4 +94,19 @@ static inline void btrfs_dev_clear_zone_empty(struct btrfs_device *device,
 	btrfs_dev_set_empty_zone_bit(device, pos, false);
 }
 
+static inline bool btrfs_check_device_zone_type(struct btrfs_fs_info *fs_info,
+						struct block_device *bdev)
+{
+	u64 zone_size;
+
+	if (btrfs_is_zoned(fs_info)) {
+		zone_size = (u64)bdev_zone_sectors(bdev) << SECTOR_SHIFT;
+		/* Do not allow non-zoned device */
+		return bdev_is_zoned(bdev) && fs_info->zone_size == zone_size;
+	}
+
+	/* Do not allow Host Manged zoned device */
+	return bdev_zoned_model(bdev) != BLK_ZONED_HM;
+}
+
 #endif

From patchwork Fri Oct 30 13:51:13 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Naohiro Aota <naohiro.aota@wdc.com>
X-Patchwork-Id: 11869615
Return-Path: <SRS0=iOSM=EF=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id DA63E61C
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:52:31 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id B17E62076E
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:52:31 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=fail reason="signature verification failed" (2048-bit key)
 header.d=wdc.com header.i=@wdc.com header.b="YDF9EAzD"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726743AbgJ3Nwa (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Fri, 30 Oct 2020 09:52:30 -0400
Received: from esa3.hgst.iphmx.com ([216.71.153.141]:21974 "EHLO
        esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726724AbgJ3Nwa (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Fri, 30 Oct 2020 09:52:30 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple;
  d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com;
  t=1604065949; x=1635601949;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=4lXJv40Js6CZywmL+mY9WqZg+ls07sZV5lUWPt6av68=;
  b=YDF9EAzD11NJtusT3SBhamRvXthMR2dM3rHIcKpEG/cR2kmTqLZf6NvB
   XkEDHZLLR77q5FRl2pykWWSuCQU9diMh08yeKpLECe68+GUX+ZUgNP74n
   LwwcCnhdKqo2db9JOsPu1bj/p1d7toz7E+f3+Oldrhzp9iZpvZ+v4nLgd
   V0ci5xVEKpHSL6DynW5Tl5cnE0CFXD6qIlbKNqZ8yBkSYNSnXKKHbbmia
   Lg2YoQUTzw+YTY/vOM1Afw5PxmXJ7bF4OLzU2YxY6PRU/JlVjUwFmoS9O
   Tuj+krrdzw0uyP1GdC3JBtoZWl/sGi048FCJwF9yhJ/vDuxiETCXjT6C8
   g==;
IronPort-SDR: 
 FyNClQtPW/YZjkQkeE7lTqFZcQdiNiJ0wPzA7luFOfERWwtIeRKoOMpsYJSMItwWjmrpyVGd6E
 P/2VQhr/exh/hqORaZGZs+V+0Z5sJaHxBPGdK6kWoRvGCViqJD3GqqxUy78a/jKZ2DbczydOZz
 Lg+XJaovic6ZWykymBm0XKfH+TGYSa1w42Xx+hGslarYmjJr6gJ5Qno03d3P1GtGoqBH0BU+7B
 bq7xEDLJVKdd8jaKn4SQ41RCeTj42A8+9uFJvuD0qbwNJuk0+lF0yuV0BlImIH52ZjeFr56sBz
 z84=
X-IronPort-AV: E=Sophos;i="5.77,433,1596470400";
   d="scan'208";a="155806585"
Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com)
 ([199.255.45.14])
  by ob1.hgst.iphmx.com with ESMTP; 30 Oct 2020 21:52:29 +0800
IronPort-SDR: 
 b4NwO3gUevv6r18gFAK5D1z0jMyCvAIzopyigk/SGcrJ/Ukwd9GXDYwprw5cjDZCJHwBxNnSu7
 dABps7jse1js4l6b6oD1bMmcYDAP1OQOhFlcrY2V+m3jA35KaFnDFSh5OtrVsdlySyZbamJADN
 zOw6SIaVV9ZKW7mZRbCCaQvVY4Tz1sTLZ1hJHXROikHxO9h4itkI4bVfw7gTfrD0Bl6uaZM9+L
 uHyv9yNNyqs7d7TEArUL1MsseKr8SlrCDECXV4DtdFdC+wuXjMtVB6OE88JkPKs8xnjR/9c6xQ
 VKbN/uwoGtLddsOd/FOm7ALm
Received: from uls-op-cesaip01.wdc.com ([10.248.3.36])
  by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 30 Oct 2020 06:38:43 -0700
IronPort-SDR: 
 0QZF/sk+/ZaVc55OKKi7/mal2c9HOagu7TduW1ooqNxnq3vlZdbRULJh3HiWaghftMQ7Vgw9TU
 erVxgItRTbizOPO25mn74NTUM5783vkor2vZuSAdxUlTb8o0mjngTqG9HFXnuStlztovMUOWoV
 3sVdKtBntNnRFgGYMZE2MF7n4Fov9dsgSI90w/OuxW2qWOsQsP4kFm9oockq7Vj/MSZMCWAINa
 s6gAqLr0M0tlet9z7J2CqbjhIfwj2onIrdUjClpMybRQOGB1gxdVvODhYub2/nMpusR7AJpfI5
 P8o=
WDCIronportException: Internal
Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155])
  by uls-op-cesaip01.wdc.com with ESMTP; 30 Oct 2020 06:52:28 -0700
From: Naohiro Aota <naohiro.aota@wdc.com>
To: linux-btrfs@vger.kernel.org, dsterba@suse.com
Cc: hare@suse.com, linux-fsdevel@vger.kernel.org,
        Naohiro Aota <naohiro.aota@wdc.com>
Subject: [PATCH v9 06/41] btrfs: introduce max_zone_append_size
Date: Fri, 30 Oct 2020 22:51:13 +0900
Message-Id: 
 <066af35477cd9dd3a096128df4aef3b583e93f52.1604065695.git.naohiro.aota@wdc.com>
X-Mailer: git-send-email 2.27.0
In-Reply-To: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
References: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

The zone append write command has a maximum IO size restriction it
accepts. This is because a zone append write command cannot be split, as
we ask the device to place the data into a specific target zone and the
device responds with the actual written location of the data.

Introduce max_zone_append_size to zone_info and fs_info to track the
value, so we can limit all I/O to a zoned block device that we want to
write using the zone append command to the device's limits.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/ctree.h |  2 ++
 fs/btrfs/zoned.c | 19 ++++++++++++++++---
 fs/btrfs/zoned.h |  1 +
 3 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 25fd4e97dd2a..383c83a1f5b5 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -953,6 +953,8 @@ struct btrfs_fs_info {
 		u64 zone_size;
 		u64 zoned;
 	};
+	/* max size to emit ZONE_APPEND write command */
+	u64 max_zone_append_size;
 
 #ifdef CONFIG_BTRFS_FS_REF_VERIFY
 	spinlock_t ref_verify_lock;
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index e1cdff5af3a3..1b42e13b8227 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -47,6 +47,7 @@ int btrfs_get_dev_zone_info(struct btrfs_device *device)
 {
 	struct btrfs_zoned_device_info *zone_info = NULL;
 	struct block_device *bdev = device->bdev;
+	struct request_queue *q = bdev_get_queue(bdev);
 	sector_t nr_sectors = bdev->bd_part->nr_sects;
 	sector_t sector = 0;
 	struct blk_zone *zones = NULL;
@@ -70,6 +71,8 @@ int btrfs_get_dev_zone_info(struct btrfs_device *device)
 	ASSERT(is_power_of_2(zone_sectors));
 	zone_info->zone_size = (u64)zone_sectors << SECTOR_SHIFT;
 	zone_info->zone_size_shift = ilog2(zone_info->zone_size);
+	zone_info->max_zone_append_size =
+		(u64)queue_max_zone_append_sectors(q) << SECTOR_SHIFT;
 	zone_info->nr_zones = nr_sectors >> ilog2(bdev_zone_sectors(bdev));
 	if (!IS_ALIGNED(nr_sectors, zone_sectors))
 		zone_info->nr_zones++;
@@ -182,7 +185,8 @@ int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info)
 	u64 hmzoned_devices = 0;
 	u64 nr_devices = 0;
 	u64 zone_size = 0;
-	int incompat_zoned = btrfs_is_zoned(fs_info);
+	u64 max_zone_append_size = 0;
+	bool incompat_zoned = btrfs_is_zoned(fs_info);
 	int ret = 0;
 
 	/* Count zoned devices */
@@ -195,15 +199,23 @@ int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info)
 		model = bdev_zoned_model(device->bdev);
 		if (model == BLK_ZONED_HM ||
 		    (model == BLK_ZONED_HA && incompat_zoned)) {
+			struct btrfs_zoned_device_info *zone_info =
+				device->zone_info;
+
 			hmzoned_devices++;
 			if (!zone_size) {
-				zone_size = device->zone_info->zone_size;
-			} else if (device->zone_info->zone_size != zone_size) {
+				zone_size = zone_info->zone_size;
+			} else if (zone_info->zone_size != zone_size) {
 				btrfs_err(fs_info,
 					  "Zoned block devices must have equal zone sizes");
 				ret = -EINVAL;
 				goto out;
 			}
+			if (!max_zone_append_size ||
+			    (zone_info->max_zone_append_size &&
+			     zone_info->max_zone_append_size < max_zone_append_size))
+				max_zone_append_size =
+					zone_info->max_zone_append_size;
 		}
 		nr_devices++;
 	}
@@ -246,6 +258,7 @@ int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info)
 	}
 
 	fs_info->zone_size = zone_size;
+	fs_info->max_zone_append_size = max_zone_append_size;
 
 	btrfs_info(fs_info, "ZONED mode enabled, zone size %llu B",
 		   fs_info->zone_size);
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index c4c63c4294f2..a63f6177f9ee 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -12,6 +12,7 @@ struct btrfs_zoned_device_info {
 	 */
 	u64 zone_size;
 	u8  zone_size_shift;
+	u64 max_zone_append_size;
 	u32 nr_zones;
 	unsigned long *seq_zones;
 	unsigned long *empty_zones;

From patchwork Fri Oct 30 13:51:14 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Naohiro Aota <naohiro.aota@wdc.com>
X-Patchwork-Id: 11869617
Return-Path: <SRS0=iOSM=EF=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 94A5B92C
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:52:35 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 68D062071A
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:52:35 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=fail reason="signature verification failed" (2048-bit key)
 header.d=wdc.com header.i=@wdc.com header.b="lW9BtdKZ"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726753AbgJ3Nwc (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Fri, 30 Oct 2020 09:52:32 -0400
Received: from esa3.hgst.iphmx.com ([216.71.153.141]:21982 "EHLO
        esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726741AbgJ3Nwb (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Fri, 30 Oct 2020 09:52:31 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple;
  d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com;
  t=1604065950; x=1635601950;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=Yp4sWa7sTEUtLZclF9ZhtgP5sfBabo6NdB/4KCl1KQM=;
  b=lW9BtdKZq5W79q1hWy3eHBWQuGrHUgjOEuMlFiPx/tL7jVPddWPCw/61
   SSyEXQJxcSU4u6k/mcQTf63VpbxJJ3NWDXqz4/sAhNYtNwRfOQy5i6cAl
   5APZsO5faIXKlO7V4IZ9XdPs1OM1qrpxHZyRqsUHGUSa4+ZmQpLKQPLrj
   P4p0T6H3LBiDUs7FQp5gjGCX5tyBWnghhVES88Uui6mwzNk7fwuQDv50l
   Cw9u2dZXMNEneBoKneC/C5fCH7zWI0m2xCbn/hixeHqs5wSIEyFgXe94C
   32ug3KOiONBL7Mv10dURwmdjFcZtQjnpGREWx3coNW79rl7zevpunqulq
   g==;
IronPort-SDR: 
 1vQgHIL4rSL5wbGwojnYe4ZMQsc8AXKk3gj9x6/Q2M+1ZFLSDh1mmlxl0ombWCih0u3jMfsRTU
 qqL7gau5inzFehq+bKOlJ69WYqedMz/gJy+WPzxPC8NP36GHXFSP1XQaRcDZ9y/q9ZEudtVLkB
 r8azStHer5sdY4Hy15efAU9S0ph9yeycjhoycmdmcu2K5/QFHn6BX2yNLv8FMkxSSeH4yndtlg
 BShU7HaoQpuGYPlnNq7lefyYxA5ROwOUUHkuf9Q2hupj0s371CLMHnZJj0QOExr4FJYQRwFxNl
 pVA=
X-IronPort-AV: E=Sophos;i="5.77,433,1596470400";
   d="scan'208";a="155806586"
Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com)
 ([199.255.45.14])
  by ob1.hgst.iphmx.com with ESMTP; 30 Oct 2020 21:52:30 +0800
IronPort-SDR: 
 J2LynzpVWIAJi9DWbi/AuqO5y/bukaIe1nXQz2kxFV2Cp+LekE9qRgp6So9M8eB2fG49plEB34
 26uRQVIE3+n0z1CDTOZ5QNYhw5MlEL3wc9Npt2jqvr9jd0Zejo+4tVuQVI0IeOJqt/VZmzv7GS
 6BVAPzCBopqz9+uAcfkpKyQ8oNkbF4ON1+ryDMWnsni0cWs/UrrY5VxNYE7zkkjT+rrEnd81ZA
 VP4FejAwhji4SfFt7eIArUUafwG7yLaRrkys+8DSPB99A8PZefXRKt7In8n7QNZrA4IzZf98xM
 oxHBrQMMosNa3B9vJNHHavsp
Received: from uls-op-cesaip01.wdc.com ([10.248.3.36])
  by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 30 Oct 2020 06:38:44 -0700
IronPort-SDR: 
 cxGdWdOQypRgl/skraXD7rxeU4MbOgnk77orHFtuJm8uwxhazmxdNAf/qI/cxcvcJCrigMm0Y8
 cv7kpZ84kKaAFBb7Oml23hPnJvGIOx8r9GeCGytO1FlitRW9bWb584HHGKBsQzLaXGVJhMx5e/
 kcIlxtZsgrBKgdyiJSnOm96GHTUuZ24jhlKX8mNbrLHBQspb4qoOO8ruR5+7Cae8gfKx8hdudp
 ACGp+PTnztvulE7ygD/TUuevlHJFUPxZ6kmt0WcImz21PSMFspjabeL0C92yPD+MVINGUwK0FN
 GUY=
WDCIronportException: Internal
Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155])
  by uls-op-cesaip01.wdc.com with ESMTP; 30 Oct 2020 06:52:29 -0700
From: Naohiro Aota <naohiro.aota@wdc.com>
To: linux-btrfs@vger.kernel.org, dsterba@suse.com
Cc: hare@suse.com, linux-fsdevel@vger.kernel.org,
        Naohiro Aota <naohiro.aota@wdc.com>
Subject: [PATCH v9 07/41] btrfs: disallow space_cache in ZONED mode
Date: Fri, 30 Oct 2020 22:51:14 +0900
Message-Id: 
 <f0a4ae9168940bf1756f89a140cabedb8972e0d1.1604065695.git.naohiro.aota@wdc.com>
X-Mailer: git-send-email 2.27.0
In-Reply-To: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
References: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

As updates to the space cache v1 are in-place, the space cache cannot be
located over sequential zones and there is no guarantees that the device
will have enough conventional zones to store this cache. Resolve this
problem by disabling completely the space cache v1.  This does not
introduces any problems with sequential block groups: all the free space is
located after the allocation pointer and no free space before the pointer.
There is no need to have such cache.

Note: we can technically use free-space-tree (space cache v2) on ZONED
mode. But, since ZONED mode now always allocate extents in a block group
sequentially regardless of underlying device zone type, it's no use to
enable and maintain the tree.

For the same reason, NODATACOW is also disabled.

Also INODE_MAP_CACHE is also disabled to avoid preallocation in the
INODE_MAP_CACHE inode.

In summary, ZONED will disable:

| Disabled features | Reason                                              |
|-------------------+-----------------------------------------------------|
| RAID/Dup          | Cannot handle two zone append writes to different   |
|                   | zones                                               |
|-------------------+-----------------------------------------------------|
| space_cache (v1)  | In-place updating                                   |
| NODATACOW         | In-place updating                                   |
|-------------------+-----------------------------------------------------|
| fallocate         | Reserved extent will be a write hole                |
| INODE_MAP_CACHE   | Need pre-allocation. (and will be deprecated?)      |
|-------------------+-----------------------------------------------------|
| MIXED_BG          | Allocated metadata region will be write holes for   |
|                   | data writes                                         |

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/super.c | 12 ++++++++++--
 fs/btrfs/zoned.c | 18 ++++++++++++++++++
 fs/btrfs/zoned.h |  5 +++++
 3 files changed, 33 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 3312fe08168f..9064ca62b0a0 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -525,8 +525,14 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
 	cache_gen = btrfs_super_cache_generation(info->super_copy);
 	if (btrfs_fs_compat_ro(info, FREE_SPACE_TREE))
 		btrfs_set_opt(info->mount_opt, FREE_SPACE_TREE);
-	else if (cache_gen)
-		btrfs_set_opt(info->mount_opt, SPACE_CACHE);
+	else if (cache_gen) {
+		if (btrfs_is_zoned(info)) {
+			btrfs_info(info,
+			"clearring existing space cache in ZONED mode");
+			btrfs_set_super_cache_generation(info->super_copy, 0);
+		} else
+			btrfs_set_opt(info->mount_opt, SPACE_CACHE);
+	}
 
 	/*
 	 * Even the options are empty, we still need to do extra check
@@ -985,6 +991,8 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options,
 		ret = -EINVAL;
 
 	}
+	if (!ret)
+		ret = btrfs_check_mountopts_zoned(info);
 	if (!ret && btrfs_test_opt(info, SPACE_CACHE))
 		btrfs_info(info, "disk space caching is enabled");
 	if (!ret && btrfs_test_opt(info, FREE_SPACE_TREE))
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 1b42e13b8227..3885fa327049 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -265,3 +265,21 @@ int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info)
 out:
 	return ret;
 }
+
+int btrfs_check_mountopts_zoned(struct btrfs_fs_info *info)
+{
+	if (!btrfs_is_zoned(info))
+		return 0;
+
+	/*
+	 * SPACE CACHE writing is not CoWed. Disable that to avoid write
+	 * errors in sequential zones.
+	 */
+	if (btrfs_test_opt(info, SPACE_CACHE)) {
+		btrfs_err(info,
+			  "space cache v1 not supportted in ZONED mode");
+		return -EOPNOTSUPP;
+	}
+
+	return 0;
+}
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index a63f6177f9ee..0b7756a7104d 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -24,6 +24,7 @@ int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 int btrfs_get_dev_zone_info(struct btrfs_device *device);
 void btrfs_destroy_dev_zone_info(struct btrfs_device *device);
 int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info);
+int btrfs_check_mountopts_zoned(struct btrfs_fs_info *info);
 #else /* CONFIG_BLK_DEV_ZONED */
 static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 				     struct blk_zone *zone)
@@ -43,6 +44,10 @@ static inline int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info)
 	btrfs_err(fs_info, "Zoned block devices support is not enabled");
 	return -EOPNOTSUPP;
 }
+static inline int btrfs_check_mountopts_zoned(struct btrfs_fs_info *info)
+{
+	return 0;
+}
 #endif
 
 static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)

From patchwork Fri Oct 30 13:51:15 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Naohiro Aota <naohiro.aota@wdc.com>
X-Patchwork-Id: 11869621
Return-Path: <SRS0=iOSM=EF=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 3D95D92C
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:52:37 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 149562087E
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:52:37 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=fail reason="signature verification failed" (2048-bit key)
 header.d=wdc.com header.i=@wdc.com header.b="pMTcef9s"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726766AbgJ3Nwg (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Fri, 30 Oct 2020 09:52:36 -0400
Received: from esa3.hgst.iphmx.com ([216.71.153.141]:21982 "EHLO
        esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726724AbgJ3Nwc (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Fri, 30 Oct 2020 09:52:32 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple;
  d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com;
  t=1604065951; x=1635601951;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=uaSV3YtpzAhOAnLQn61gpLWM4TCZLYKZQdx2GDB4tHg=;
  b=pMTcef9sTst4aRdxIGtLgrV4pNc+bP9Yzw8i2t1qV2KDIAywQPSxAuC2
   toWq72ms4hP5r5aCw2gu69HUbIvZ+0+cj5JoP6iagfQOoHpaZUbfbSlX5
   J9gbeIXOHiHuvr7TPLahtFXeCFJjWbboTKjisCCvkm5IWoABeEoeGNNzF
   SgS1/bBeoosLChRWWYb9+ik4rEiGzQCW4HeLPk4UvMIXUzdSVS7PPgwIt
   k4aNtJEEeTZJO2bRVIGEZuUiHvyX17Kx4skfNpNzzDMAi1jTWZZgdybei
   bUGH+XcuukMfLYi61Hn9uAleczUtbw42esW60SrsNiYF2WXed2uXkp8n4
   w==;
IronPort-SDR: 
 Okrxw6vkZ4ggt+UgXHNY1xFcEregljOuM6yhW0JDDJdvwPbOCSkXGwYx8ac1yzrvAUqFeYYhZA
 tNxwkeEQctJiTLyNaQMxSoCTtYsDGTpJ7l/Jo2qJ2HECfbGvjIzVy42KYlRAUM3iXJgExZGI/r
 QmlPO/LecSBsuJ1g7bcWEJB6RGq9QtvAPWYit/Lz1Dn7PskMCgyZ4vhtxClca5qfo4OGNyLujS
 Js2qec7TVT4kOX30eHSFi/G3wcMosb5WJ48Itf+i0OCA7w3Rfk2P/NmmHFFFc3q/qjBWZfGRfj
 +VI=
X-IronPort-AV: E=Sophos;i="5.77,433,1596470400";
   d="scan'208";a="155806588"
Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com)
 ([199.255.45.14])
  by ob1.hgst.iphmx.com with ESMTP; 30 Oct 2020 21:52:31 +0800
IronPort-SDR: 
 dCMzNitI6SnFH7EIkwsR+yTVu3at03M2Pk3RP7r/95NCI2NSdiBqghILyIrxEqSHZLPNpxT+VX
 oapdHF+acMmGQ5CmetG8Nd/fEIPSo5KtdgtpzwvTi/tB+razghNcAZ5lHkKc7aCvRzNUZTVSR1
 +kSCB6K6LZl+1F1T4E+e3iGxDE8Ujcv5daA2KZ4YO4y1E22d8noZubOrCa01p7JOveq5Ac4zJ9
 JonBkB0MVwhCc5PoTVFfO6N3jwyu3PghBQg0UgLYA3F2uzft1XF5O80P4zHiomhesQMOp54s+d
 yDHK0tCQK+qmQNzDRC3RWSsz
Received: from uls-op-cesaip01.wdc.com ([10.248.3.36])
  by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 30 Oct 2020 06:38:45 -0700
IronPort-SDR: 
 hL+PYsl9JeeuuGJad61KqQBo6YChaogmPBLO5L101UKeDljx0JhC3m3TwxWqZFR38TlWDSIS1r
 BPODA7/TOky4atYnb0mrzYWTCZX1eL8Piqn3PfHYPA53yxUy23mxgCo6qYYOKBhfKrEBfNcE9/
 AeXrBTNOCoaivrOkYjFuPTD2uGDjcW1LplLYs3k6B2gNlzH/yczgAL5Hn+rEwqV3pGgAlVycRV
 XgxEC7wTfTt9z5R06ir4mQS8Fmz1maHNLGRCmssqziqWd4Oa0ajCAX170NX6INAhoj9vtf+5ER
 t34=
WDCIronportException: Internal
Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155])
  by uls-op-cesaip01.wdc.com with ESMTP; 30 Oct 2020 06:52:30 -0700
From: Naohiro Aota <naohiro.aota@wdc.com>
To: linux-btrfs@vger.kernel.org, dsterba@suse.com
Cc: hare@suse.com, linux-fsdevel@vger.kernel.org,
        Naohiro Aota <naohiro.aota@wdc.com>,
        Johannes Thumshirn <johannes.thumshirn@wdc.com>
Subject: [PATCH v9 08/41] btrfs: disallow NODATACOW in ZONED mode
Date: Fri, 30 Oct 2020 22:51:15 +0900
Message-Id: 
 <4129ba21e887cff5dc707b34920fb825ca1c61a4.1604065695.git.naohiro.aota@wdc.com>
X-Mailer: git-send-email 2.27.0
In-Reply-To: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
References: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

NODATACOW implies overwriting the file data on a device, which is
impossible in sequential required zones. Disable NODATACOW globally with
mount option and per-file NODATACOW attribute by masking FS_NOCOW_FL.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/ioctl.c | 16 ++++++++++++++++
 fs/btrfs/zoned.c |  6 ++++++
 2 files changed, 22 insertions(+)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index ab408a23ba32..e1036a9ce881 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -193,6 +193,18 @@ static int check_fsflags(unsigned int old_flags, unsigned int flags)
 	return 0;
 }
 
+static int check_fsflags_compatible(struct btrfs_fs_info *fs_info,
+				    unsigned int flags)
+{
+	bool zoned = btrfs_is_zoned(fs_info);
+
+	if (zoned && (flags & FS_NOCOW_FL))
+		return -EPERM;
+
+	return 0;
+}
+
+
 static int btrfs_ioctl_setflags(struct file *file, void __user *arg)
 {
 	struct inode *inode = file_inode(file);
@@ -230,6 +242,10 @@ static int btrfs_ioctl_setflags(struct file *file, void __user *arg)
 	if (ret)
 		goto out_unlock;
 
+	ret = check_fsflags_compatible(fs_info, fsflags);
+	if (ret)
+		goto out_unlock;
+
 	binode_flags = binode->flags;
 	if (fsflags & FS_SYNC_FL)
 		binode_flags |= BTRFS_INODE_SYNC;
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 3885fa327049..1939b3ee6c10 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -281,5 +281,11 @@ int btrfs_check_mountopts_zoned(struct btrfs_fs_info *info)
 		return -EOPNOTSUPP;
 	}
 
+	if (btrfs_test_opt(info, NODATACOW)) {
+		btrfs_err(info,
+		  "cannot enable nodatacow with ZONED mode");
+		return -EOPNOTSUPP;
+	}
+
 	return 0;
 }

From patchwork Fri Oct 30 13:51:16 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Naohiro Aota <naohiro.aota@wdc.com>
X-Patchwork-Id: 11869619
Return-Path: <SRS0=iOSM=EF=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 9624861C
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:52:36 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 6E4732071A
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:52:36 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=fail reason="signature verification failed" (2048-bit key)
 header.d=wdc.com header.i=@wdc.com header.b="gHFi3o9v"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726761AbgJ3Nwf (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Fri, 30 Oct 2020 09:52:35 -0400
Received: from esa3.hgst.iphmx.com ([216.71.153.141]:21988 "EHLO
        esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726754AbgJ3Nwd (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Fri, 30 Oct 2020 09:52:33 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple;
  d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com;
  t=1604065953; x=1635601953;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=KmhEYe4FQn2hKciTrSoqULPqx/1dV3UQO8AYYi+tRAQ=;
  b=gHFi3o9vVvqyhOr/ZEZcXEXRhsD7i7d1XWiUc/x9TrT0Pjz4V5dbAnjg
   0E2fhd7degGXywlNPk8VeOn45/UVFneBcRzllqhzq9uOFyyqDn5WOErVG
   mJGpnHU92suoetlxtkv6N/Hk4yXAej2tcULnE/GPRU+2AR+8pXYAvXMWH
   rPhm2CNBP9iPmJT9rqWOOIBw2xOU5LRtnmO5JhSsg6vJ1ksPaKibar8TN
   prZA8+oxkKGTxVNDZarMeHUSH5ymMB9pbxUC+t9+UIEqqoxDmXAS/+8z6
   ykbYF+xhMXHLAP2i5LGUJLL5ZMl4rn0bHfTQYhmm0eVwuvbBfKD4mGWwy
   w==;
IronPort-SDR: 
 1oMKUq/6Kiw8Ssc+nkL3KiWw7nCeU7PdL6WNfbsxGII1AwS22sFGHJPOHcFI8/9wRFrTpkzHJS
 EZ75zs2qG7us2aD1bXSLoKmF35SRmpJ01AZiad7WtMM70AgBTd1/LCG6Pg7/N12qF0f+nW4Sse
 iVMAzlmOHXF2Mt0AC+2V2Pzp95HDYbyuRBGVa66MoLFCa6GIu2xmgEtWNP/6z2GdSVcOHtOj0U
 zi6m2z33SnY2S9XtEi3kUvpo8symaWC/lAg3pEIzk5q6xABjTnKqNv2nkceXQ5PvNf2Nb8HdSl
 gSA=
X-IronPort-AV: E=Sophos;i="5.77,433,1596470400";
   d="scan'208";a="155806591"
Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com)
 ([199.255.45.14])
  by ob1.hgst.iphmx.com with ESMTP; 30 Oct 2020 21:52:33 +0800
IronPort-SDR: 
 k4x707tpW2gQHX53FtZFdqpOTtCtgQO8Lg7AIiSIMwOMJEYhTdlWgPvZGqG0Ih9XRBFZkRz9HS
 q0wLH9jqHrhUFdq6+Vh+tOy4KwmXp4nxfWNXaJg2ODxeNSO9wNCOGXBe312Bb8zdTRgj9t0iYb
 IsC1qV/kXAHqyMIAF4BvCUfAUXMmAmiziOolku3QUXrpR0HDSDuXhD3WLFmVwP+1r9BYUtz+Xx
 oGGwio4g3V3ugDrMMFY2qh7GmM4TtXFGQeiJsIH6pr4Nl8BriPG4aRaQn3gwrOnO+/Z0VRoGx5
 b+/0tDCsWw5TL13cAtqRmEjf
Received: from uls-op-cesaip01.wdc.com ([10.248.3.36])
  by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 30 Oct 2020 06:38:47 -0700
IronPort-SDR: 
 18u/5xNev/RDuATJ+EkUQ389/Eq/fmMcbrMDMsEDyQcx9J0vaRezJ7ZssTNCjmCrWYIw1saZQv
 T+6BdDlHODFVoCaR/flSKYXGaubRobwZYH0DISxDEgNwfhe9vflxjCaL6Xm5hlFIRjSmXZZaFN
 x/VqAiOnUYbYUOkQGra/T/AQ6fhzVCJ20orGFNyYmomTm8b1yEvOipHReSPnagdTsI5McrAtnP
 jUgnpwzwrSLgJse3uJegoPqsov6eS/qvGjSl6ZymwRLfnNM24VjraX50adIstAKh2Cfgpi6boE
 kfw=
WDCIronportException: Internal
Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155])
  by uls-op-cesaip01.wdc.com with ESMTP; 30 Oct 2020 06:52:32 -0700
From: Naohiro Aota <naohiro.aota@wdc.com>
To: linux-btrfs@vger.kernel.org, dsterba@suse.com
Cc: hare@suse.com, linux-fsdevel@vger.kernel.org,
        Naohiro Aota <naohiro.aota@wdc.com>,
        Johannes Thumshirn <johannes.thumshirn@wdc.com>,
        Josef Bacik <josef@toxicpanda.com>
Subject: [PATCH v9 09/41] btrfs: disable fallocate in ZONED mode
Date: Fri, 30 Oct 2020 22:51:16 +0900
Message-Id: 
 <51d6321cf98f8176e6214aea5ab83325832194a7.1604065695.git.naohiro.aota@wdc.com>
X-Mailer: git-send-email 2.27.0
In-Reply-To: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
References: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

fallocate() is implemented by reserving actual extent instead of
reservations. This can result in exposing the sequential write constraint
of host-managed zoned block devices to the application, which would break
the POSIX semantic for the fallocated file.  To avoid this, report
fallocate() as not supported when in ZONED mode for now.

In the future, we may be able to implement "in-memory" fallocate() in ZONED
mode by utilizing space_info->bytes_may_use or so.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/file.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 0ff659455b1e..68938a43081e 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -3341,6 +3341,10 @@ static long btrfs_fallocate(struct file *file, int mode,
 	alloc_end = round_up(offset + len, blocksize);
 	cur_offset = alloc_start;
 
+	/* Do not allow fallocate in ZONED mode */
+	if (btrfs_is_zoned(btrfs_sb(inode->i_sb)))
+		return -EOPNOTSUPP;
+
 	/* Make sure we aren't being give some crap mode */
 	if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |
 		     FALLOC_FL_ZERO_RANGE))

From patchwork Fri Oct 30 13:51:17 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Naohiro Aota <naohiro.aota@wdc.com>
X-Patchwork-Id: 11869623
Return-Path: <SRS0=iOSM=EF=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id E5A7861C
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:52:37 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id BC00A2076E
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:52:37 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=fail reason="signature verification failed" (2048-bit key)
 header.d=wdc.com header.i=@wdc.com header.b="hgwu2ZDW"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726724AbgJ3Nwg (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Fri, 30 Oct 2020 09:52:36 -0400
Received: from esa3.hgst.iphmx.com ([216.71.153.141]:21982 "EHLO
        esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726741AbgJ3Nwf (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Fri, 30 Oct 2020 09:52:35 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple;
  d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com;
  t=1604065954; x=1635601954;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=s2KKkfnoTWevfkYQvf5DwhYyLT6JaMkKfA2n4Tx7U8Y=;
  b=hgwu2ZDW4U5TgLL7IVtbkw2VIrtpEgQr/9PO7qDOJQNWvCbu8y0JV3E9
   xntNIhV8i2TzVr+/ka9k0eDak0W9OFpOV/UhDkT0la/9mJumiieGkFivQ
   luUlwsmcGpMpNLY3vrLuEY2WEczAP5deeekxJsOjDjdis/E3h5zrvJNQ3
   KMBGwGx0ZlGnbzRzhppMa2JLxKW9x6K6/JQBMZql2CZCcEvlaz4gYBV8g
   tTkOikRoIEcvP8aWgNowHoTuZJ/0ChPkxsP4Vo/vXVVBK5JlR7beoCowG
   vZDQ0sQ6xElLBtEMbubjdc/6AkcEM8OwTh+f7cY/NOGM0BcEREfBAkqii
   A==;
IronPort-SDR: 
 RgMDPCtp7TkTmpDiV+/fcct5v+x6A1MI85JYo3uB1W0k+lCCjYvjVLItxF7u8kWZaBqZ0siVeq
 nAQS27HYMkcUD8a3xFzjN7Mak3m8/Mvv9msnBoRKjLbwl/SLSEBgTZx/PXC8udiSN4flfiGuIo
 Fx+ry4qhD2PEgwSBM6MhDuTRpVkPG0InP9HdTeKNim5N4XTHhkUpnQ7MQyPFa30GuY5rhqhYR1
 dvwpCHKjrCIFHZZm3nfQGY4lubX/teErZQrzMT0vYJ43ZvBCzrOteK+e+MNcT9Mll/IhW5ZALp
 EC4=
X-IronPort-AV: E=Sophos;i="5.77,433,1596470400";
   d="scan'208";a="155806592"
Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com)
 ([199.255.45.14])
  by ob1.hgst.iphmx.com with ESMTP; 30 Oct 2020 21:52:34 +0800
IronPort-SDR: 
 4pzZvVT/RW88te7AnlWzA5mLI3VCsEy7CAjbI6tH526qi37n9nOsMgZxalPl8RWpH3ycBFZzOR
 nLbdCNpBXTjVcolQawSaFZRFVPA01UpMYUQCY1ErWf+QuypyuBHuhoqw+vSNqZemFT+05wjZJc
 EzX7JO3Y3MIUAt/4QN/0FT97K1PF8obdGA4Q6vwZbUEDRIkKQZhDQjQApPgYtv2IfftTWYBdIc
 poTr85UJ4V8S7kGbD25Hyy49R7cHy0gthc7Ubzlioo5vjhjf6Ht8Khh0vCZnePBCofzcLfoq6+
 4pjRxBAcsKXgnwaocyjPPeqq
Received: from uls-op-cesaip01.wdc.com ([10.248.3.36])
  by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 30 Oct 2020 06:38:48 -0700
IronPort-SDR: 
 4u3PZhFe0LtRevulN/Cm+mk2sK9MsHM08wLLo8TmtjDd1a4eldMLqi5Ln4ZophLRGlisD28TeC
 35lapuvGfsP4Vm7UsmNPX/5Rzufk3EFvA5VGmfT0EcUHhyRuX2Vk9Yj7ImZbvXzfgWJTeutdEa
 mtWClCXVyS/kMjdammY6+VYV0MmSvGpR9947qJmPU2IOmLKtzfe93hGW+QU7YB/d7UFd+Xrgyq
 040IJB5Mb3TnkY+RwUwexWSqEmnz2dp1Xu4aghvGPy1ZPrhMBpRdTbM2XoALe4qbpp2BPW/Aqq
 Jnw=
WDCIronportException: Internal
Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155])
  by uls-op-cesaip01.wdc.com with ESMTP; 30 Oct 2020 06:52:33 -0700
From: Naohiro Aota <naohiro.aota@wdc.com>
To: linux-btrfs@vger.kernel.org, dsterba@suse.com
Cc: hare@suse.com, linux-fsdevel@vger.kernel.org,
        Naohiro Aota <naohiro.aota@wdc.com>,
        Josef Bacik <josef@toxicpanda.com>
Subject: [PATCH v9 10/41] btrfs: disallow mixed-bg in ZONED mode
Date: Fri, 30 Oct 2020 22:51:17 +0900
Message-Id: 
 <1aec842d98f9b38674aebce12606b9267f475164.1604065695.git.naohiro.aota@wdc.com>
X-Mailer: git-send-email 2.27.0
In-Reply-To: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
References: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

Placing both data and metadata in a block group is impossible in ZONED
mode. For data, we can allocate a space for it and write it immediately
after the allocation. For metadata, however, we cannot do so, because the
logical addresses are recorded in other metadata buffers to build up the
trees. As a result, a data buffer can be placed after a metadata buffer,
which is not written yet. Writing out the data buffer will break the
sequential write rule.

This commit check and disallow MIXED_BG with ZONED mode.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/zoned.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 1939b3ee6c10..ae509699da14 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -257,6 +257,13 @@ int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info)
 		goto out;
 	}
 
+	if (btrfs_fs_incompat(fs_info, MIXED_GROUPS)) {
+		btrfs_err(fs_info,
+			  "ZONED mode is not allowed for mixed block groups");
+		ret = -EINVAL;
+		goto out;
+	}
+
 	fs_info->zone_size = zone_size;
 	fs_info->max_zone_append_size = max_zone_append_size;
 

From patchwork Fri Oct 30 13:51:18 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Naohiro Aota <naohiro.aota@wdc.com>
X-Patchwork-Id: 11869627
Return-Path: <SRS0=iOSM=EF=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 9924692C
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:52:46 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 5EE362083B
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:52:46 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=fail reason="signature verification failed" (2048-bit key)
 header.d=wdc.com header.i=@wdc.com header.b="GxaNK088"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726794AbgJ3Nwn (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Fri, 30 Oct 2020 09:52:43 -0400
Received: from esa3.hgst.iphmx.com ([216.71.153.141]:21993 "EHLO
        esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726765AbgJ3Nwg (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Fri, 30 Oct 2020 09:52:36 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple;
  d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com;
  t=1604065956; x=1635601956;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=PkrhCLMh+cLYYbzmzrzC+XSkx4Bp5UCvF06uxBlfBXI=;
  b=GxaNK088cTv2O2WSjMY6CfUiOAdfkHpZdcz+sasvwL+kNd+UYW5Acw/O
   2YZZwykVL19s07AKITvHjVscRyspjrwpvSaX2uBuU6nV3atD7fGm9hLM/
   qwh7X+kB2nN9AKah4zA8oKGnL0UobZYmwQeKgiyWRzPCFl6NAVcF7yWiH
   K4tzAG9aHrkSBJYU8vRaoEyLd3in9eBZc1SGOMoJv4fPsjkv5xTFxoQuP
   rD4FMlR84nPeEMqf7m5xMrgeVlnFADLQlpomwbrTVWBuxUKXKLKiM6tTb
   C0fiNsUawNA/VK93C61Oai/21ioURiKvKuncNjDlKskbUKS9Iol8tW1sV
   A==;
IronPort-SDR: 
 gTBMOQWuSuNQ2RyeRC4M4coZPmx0pV/Ce8MTPnBtIS88nR5NbzmWC+Em+AFAutLwAEI8nK3XsI
 CXh4cI5wgP9R06E+coOWqxbcrJG0P+KFS84ZdaaeXNzVaaaEYR3Ugz+cwpgReXL7sAkMksgDBu
 8qF9S00/zdOdexqzwGSoP/tAXPI784QYiwwzNJCJNvVU+z7EQp3buXDg5/29m0f1bHxkPaoe+d
 2P5QU5+nlOOonb80Ha0VuF21jhpdp70dIWni40LBSkYqXB0bvDNbEPcuYwyvTGx6W0+WktDYA4
 zIQ=
X-IronPort-AV: E=Sophos;i="5.77,433,1596470400";
   d="scan'208";a="155806595"
Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com)
 ([199.255.45.14])
  by ob1.hgst.iphmx.com with ESMTP; 30 Oct 2020 21:52:35 +0800
IronPort-SDR: 
 7r6ymJ5cxaOqmpcQ0+xhSnnmY2xiUhLRFaQdPeVHOki9F4B/22yKq0gxDQeTVdVu6TX+vq4y2w
 9tIk53eO+qZSMU4n6U6DsJehNyMO44EvxrW53WU0WBESEupIRbiYDtm4MBayDjPGJ54Kf5bIEC
 JlODouP/JzRuHPeMgdqzPpW+N2IQGs1zuK6tmmVMm4aVEPfrmljPsq8PUMRKLrifdooyrR0rvU
 waYHFcst5s4CdYNbX4yWD775xOJoSh6DkqJOEoCsvzPIe8OcDwu7Bahw3MjxK505FLtmXkVANO
 fIO2OHwSeTOZgQmZTrDjV+Wp
Received: from uls-op-cesaip01.wdc.com ([10.248.3.36])
  by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 30 Oct 2020 06:38:49 -0700
IronPort-SDR: 
 QQNb1+mV7F0bEp5DzcC6N/YfNtAPvUrQm6pUqZSixIrIrXiQ5uxf0huMehbXTzPnuIvlgjuh5N
 8pftCyTphnMWV3l76mn9CbaeA+4ow2Y8AenpvJN8RiFwNDOlsuQ192fRwrGsZP8H1Hf865CFP1
 fhop9EXcKV4lreLjg3dcrDN7Hv4JzSxM5f8enn3ALYLoQn1GPInJVlXR8vWbwSFGxagOHpUJ/1
 mgJk0n6SJJ+SS2oJ9f6me3Mb8TrOuGpVAGkazrw1ZRekeBZX1I/fqAlzUgcRoS18hrsyggER9N
 WcI=
WDCIronportException: Internal
Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155])
  by uls-op-cesaip01.wdc.com with ESMTP; 30 Oct 2020 06:52:34 -0700
From: Naohiro Aota <naohiro.aota@wdc.com>
To: linux-btrfs@vger.kernel.org, dsterba@suse.com
Cc: hare@suse.com, linux-fsdevel@vger.kernel.org,
        Naohiro Aota <naohiro.aota@wdc.com>
Subject: [PATCH v9 11/41] btrfs: implement log-structured superblock for ZONED
 mode
Date: Fri, 30 Oct 2020 22:51:18 +0900
Message-Id: 
 <eca26372a84d8b8ec2b59d3390f172810ed6f3e4.1604065695.git.naohiro.aota@wdc.com>
X-Mailer: git-send-email 2.27.0
In-Reply-To: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
References: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

Superblock (and its copies) is the only data structure in btrfs which has a
fixed location on a device. Since we cannot overwrite in a sequential write
required zone, we cannot place superblock in the zone. One easy solution is
limiting superblock and copies to be placed only in conventional zones.
However, this method has two downsides: one is reduced number of superblock
copies. The location of the second copy of superblock is 256GB, which is in
a sequential write required zone on typical devices in the market today.
So, the number of superblock and copies is limited to be two.  Second
downside is that we cannot support devices which have no conventional zones
at all.

To solve these two problems, we employ superblock log writing. It uses two
zones as a circular buffer to write updated superblocks. Once the first
zone is filled up, start writing into the second buffer. Then, when the
both zones are filled up and before start writing to the first zone again,
it reset the first zone.

We can determine the position of the latest superblock by reading write
pointer information from a device. One corner case is when the both zones
are full. For this situation, we read out the last superblock of each
zone, and compare them to determine which zone is older.

The following zones are reserved as the circular buffer on ZONED btrfs.

- The primary superblock: zones 0 and 1
- The first copy: zones 16 and 17
- The second copy: zones 1024 or zone at 256GB which is minimum, and next
  to it

If these reserved zones are conventional, superblock is written fixed at
the start of the zone without logging.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/block-group.c |   9 ++
 fs/btrfs/disk-io.c     |  41 +++++-
 fs/btrfs/scrub.c       |   3 +
 fs/btrfs/volumes.c     |  21 ++-
 fs/btrfs/zoned.c       | 311 +++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/zoned.h       |  40 ++++++
 6 files changed, 413 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index c0f1d6818df7..e989c66aa764 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1723,6 +1723,7 @@ int btrfs_rmap_block(struct btrfs_fs_info *fs_info, u64 chunk_start,
 static int exclude_super_stripes(struct btrfs_block_group *cache)
 {
 	struct btrfs_fs_info *fs_info = cache->fs_info;
+	bool zoned = btrfs_is_zoned(fs_info);
 	u64 bytenr;
 	u64 *logical;
 	int stripe_len;
@@ -1744,6 +1745,14 @@ static int exclude_super_stripes(struct btrfs_block_group *cache)
 		if (ret)
 			return ret;
 
+		/* shouldn't have super stripes in sequential zones */
+		if (zoned && nr) {
+			btrfs_err(fs_info,
+				  "Zoned btrfs's block group %llu should not have super blocks",
+				  cache->start);
+			return -EUCLEAN;
+		}
+
 		while (nr--) {
 			u64 len = min_t(u64, stripe_len,
 				cache->start + cache->length - logical[nr]);
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 9bc51cff48b8..fd8b970ee92c 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3423,10 +3423,17 @@ struct btrfs_super_block *btrfs_read_dev_one_super(struct block_device *bdev,
 {
 	struct btrfs_super_block *super;
 	struct page *page;
-	u64 bytenr;
+	u64 bytenr, bytenr_orig;
 	struct address_space *mapping = bdev->bd_inode->i_mapping;
+	int ret;
+
+	bytenr_orig = btrfs_sb_offset(copy_num);
+	ret = btrfs_sb_log_location_bdev(bdev, copy_num, READ, &bytenr);
+	if (ret == -ENOENT)
+		return ERR_PTR(-EINVAL);
+	else if (ret)
+		return ERR_PTR(ret);
 
-	bytenr = btrfs_sb_offset(copy_num);
 	if (bytenr + BTRFS_SUPER_INFO_SIZE >= i_size_read(bdev->bd_inode))
 		return ERR_PTR(-EINVAL);
 
@@ -3440,7 +3447,7 @@ struct btrfs_super_block *btrfs_read_dev_one_super(struct block_device *bdev,
 		return ERR_PTR(-ENODATA);
 	}
 
-	if (btrfs_super_bytenr(super) != bytenr) {
+	if (btrfs_super_bytenr(super) != bytenr_orig) {
 		btrfs_release_disk_super(super);
 		return ERR_PTR(-EINVAL);
 	}
@@ -3495,7 +3502,8 @@ static int write_dev_supers(struct btrfs_device *device,
 	SHASH_DESC_ON_STACK(shash, fs_info->csum_shash);
 	int i;
 	int errors = 0;
-	u64 bytenr;
+	int ret;
+	u64 bytenr, bytenr_orig;
 
 	if (max_mirrors == 0)
 		max_mirrors = BTRFS_SUPER_MIRROR_MAX;
@@ -3507,12 +3515,21 @@ static int write_dev_supers(struct btrfs_device *device,
 		struct bio *bio;
 		struct btrfs_super_block *disk_super;
 
-		bytenr = btrfs_sb_offset(i);
+		bytenr_orig = btrfs_sb_offset(i);
+		ret = btrfs_sb_log_location(device, i, WRITE, &bytenr);
+		if (ret == -ENOENT)
+			continue;
+		else if (ret < 0) {
+			btrfs_err(device->fs_info, "couldn't get super block location for mirror %d",
+				  i);
+			errors++;
+			continue;
+		}
 		if (bytenr + BTRFS_SUPER_INFO_SIZE >=
 		    device->commit_total_bytes)
 			break;
 
-		btrfs_set_super_bytenr(sb, bytenr);
+		btrfs_set_super_bytenr(sb, bytenr_orig);
 
 		crypto_shash_digest(shash, (const char *)sb + BTRFS_CSUM_SIZE,
 				    BTRFS_SUPER_INFO_SIZE - BTRFS_CSUM_SIZE,
@@ -3557,6 +3574,7 @@ static int write_dev_supers(struct btrfs_device *device,
 			bio->bi_opf |= REQ_FUA;
 
 		btrfsic_submit_bio(bio);
+		btrfs_advance_sb_log(device, i);
 	}
 	return errors < i ? 0 : -1;
 }
@@ -3573,6 +3591,7 @@ static int wait_dev_supers(struct btrfs_device *device, int max_mirrors)
 	int i;
 	int errors = 0;
 	bool primary_failed = false;
+	int ret;
 	u64 bytenr;
 
 	if (max_mirrors == 0)
@@ -3581,7 +3600,15 @@ static int wait_dev_supers(struct btrfs_device *device, int max_mirrors)
 	for (i = 0; i < max_mirrors; i++) {
 		struct page *page;
 
-		bytenr = btrfs_sb_offset(i);
+		ret = btrfs_sb_log_location(device, i, READ, &bytenr);
+		if (ret == -ENOENT)
+			break;
+		else if (ret < 0) {
+			errors++;
+			if (i == 0)
+				primary_failed = true;
+			continue;
+		}
 		if (bytenr + BTRFS_SUPER_INFO_SIZE >=
 		    device->commit_total_bytes)
 			break;
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index cf63f1e27a27..aa1b36cf5c88 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -20,6 +20,7 @@
 #include "rcu-string.h"
 #include "raid56.h"
 #include "block-group.h"
+#include "zoned.h"
 
 /*
  * This is only the first step towards a full-features scrub. It reads all
@@ -3704,6 +3705,8 @@ static noinline_for_stack int scrub_supers(struct scrub_ctx *sctx,
 		if (bytenr + BTRFS_SUPER_INFO_SIZE >
 		    scrub_dev->commit_total_bytes)
 			break;
+		if (!btrfs_check_super_location(scrub_dev, bytenr))
+			continue;
 
 		ret = scrub_pages(sctx, bytenr, BTRFS_SUPER_INFO_SIZE, bytenr,
 				  scrub_dev, BTRFS_EXTENT_FLAG_SUPER, gen, i,
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 10827892c086..db884b96a5ea 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1282,7 +1282,8 @@ void btrfs_release_disk_super(struct btrfs_super_block *super)
 }
 
 static struct btrfs_super_block *btrfs_read_disk_super(struct block_device *bdev,
-						       u64 bytenr)
+						       u64 bytenr,
+						       u64 bytenr_orig)
 {
 	struct btrfs_super_block *disk_super;
 	struct page *page;
@@ -1313,7 +1314,7 @@ static struct btrfs_super_block *btrfs_read_disk_super(struct block_device *bdev
 	/* align our pointer to the offset of the super block */
 	disk_super = p + offset_in_page(bytenr);
 
-	if (btrfs_super_bytenr(disk_super) != bytenr ||
+	if (btrfs_super_bytenr(disk_super) != bytenr_orig ||
 	    btrfs_super_magic(disk_super) != BTRFS_MAGIC) {
 		btrfs_release_disk_super(p);
 		return ERR_PTR(-EINVAL);
@@ -1348,7 +1349,8 @@ struct btrfs_device *btrfs_scan_one_device(const char *path, fmode_t flags,
 	bool new_device_added = false;
 	struct btrfs_device *device = NULL;
 	struct block_device *bdev;
-	u64 bytenr;
+	u64 bytenr, bytenr_orig;
+	int ret;
 
 	lockdep_assert_held(&uuid_mutex);
 
@@ -1358,14 +1360,18 @@ struct btrfs_device *btrfs_scan_one_device(const char *path, fmode_t flags,
 	 * So, we need to add a special mount option to scan for
 	 * later supers, using BTRFS_SUPER_MIRROR_MAX instead
 	 */
-	bytenr = btrfs_sb_offset(0);
 	flags |= FMODE_EXCL;
 
 	bdev = blkdev_get_by_path(path, flags, holder);
 	if (IS_ERR(bdev))
 		return ERR_CAST(bdev);
 
-	disk_super = btrfs_read_disk_super(bdev, bytenr);
+	bytenr_orig = btrfs_sb_offset(0);
+	ret = btrfs_sb_log_location_bdev(bdev, 0, READ, &bytenr);
+	if (ret)
+		return ERR_PTR(ret);
+
+	disk_super = btrfs_read_disk_super(bdev, bytenr, bytenr_orig);
 	if (IS_ERR(disk_super)) {
 		device = ERR_CAST(disk_super);
 		goto error_bdev_put;
@@ -2029,6 +2035,11 @@ void btrfs_scratch_superblocks(struct btrfs_fs_info *fs_info,
 		if (IS_ERR(disk_super))
 			continue;
 
+		if (bdev_is_zoned(bdev)) {
+			btrfs_reset_sb_log_zones(bdev, copy_num);
+			continue;
+		}
+
 		memset(&disk_super->magic, 0, sizeof(disk_super->magic));
 
 		page = virt_to_page(disk_super);
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index ae509699da14..d5487cba203b 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -20,6 +20,25 @@ static int copy_zone_info_cb(struct blk_zone *zone, unsigned int idx,
 	return 0;
 }
 
+static int sb_write_pointer(struct block_device *bdev, struct blk_zone *zone,
+			    u64 *wp_ret);
+
+static inline u32 sb_zone_number(u8 shift, int mirror)
+{
+	ASSERT(mirror < BTRFS_SUPER_MIRROR_MAX);
+
+	switch (mirror) {
+	case 0:
+		return 0;
+	case 1:
+		return 16;
+	case 2:
+		return min(btrfs_sb_offset(mirror) >> shift, 1024ULL);
+	}
+
+	return 0;
+}
+
 static int btrfs_get_dev_zones(struct btrfs_device *device, u64 pos,
 			       struct blk_zone *zones, unsigned int *nr_zones)
 {
@@ -123,6 +142,49 @@ int btrfs_get_dev_zone_info(struct btrfs_device *device)
 		goto out;
 	}
 
+	/* validate superblock log */
+	nr_zones = 2;
+	for (i = 0; i < BTRFS_SUPER_MIRROR_MAX; i++) {
+		u32 sb_zone = sb_zone_number(zone_info->zone_size_shift, i);
+		u64 sb_wp;
+
+		if (sb_zone + 1 >= zone_info->nr_zones)
+			continue;
+
+		sector = sb_zone << (zone_info->zone_size_shift - SECTOR_SHIFT);
+		ret = btrfs_get_dev_zones(device, sector << SECTOR_SHIFT,
+					  &zone_info->sb_zones[2 * i],
+					  &nr_zones);
+		if (ret)
+			goto out;
+		if (nr_zones != 2) {
+			btrfs_err_in_rcu(device->fs_info,
+			"failed to read SB log zone info at device %s zone %u",
+					 rcu_str_deref(device->name), sb_zone);
+			ret = -EIO;
+			goto out;
+		}
+
+		/*
+		 * If zones[0] is conventional, always use the beggining of
+		 * the zone to record superblock. No need to validate in
+		 * that case.
+		 */
+		if (zone_info->sb_zones[2 * i].type == BLK_ZONE_TYPE_CONVENTIONAL)
+			continue;
+
+		ret = sb_write_pointer(device->bdev,
+				       &zone_info->sb_zones[2 * i], &sb_wp);
+		if (ret != -ENOENT && ret) {
+			btrfs_err_in_rcu(device->fs_info,
+				"SB log zone corrupted: device %s zone %u",
+					 rcu_str_deref(device->name), sb_zone);
+			ret = -EUCLEAN;
+			goto out;
+		}
+	}
+
+
 	kfree(zones);
 
 	device->zone_info = zone_info;
@@ -296,3 +358,252 @@ int btrfs_check_mountopts_zoned(struct btrfs_fs_info *info)
 
 	return 0;
 }
+
+static int sb_write_pointer(struct block_device *bdev, struct blk_zone *zones,
+			    u64 *wp_ret)
+{
+	bool empty[2];
+	bool full[2];
+	sector_t sector;
+
+	ASSERT(zones[0].type != BLK_ZONE_TYPE_CONVENTIONAL &&
+	       zones[1].type != BLK_ZONE_TYPE_CONVENTIONAL);
+
+	empty[0] = zones[0].cond == BLK_ZONE_COND_EMPTY;
+	empty[1] = zones[1].cond == BLK_ZONE_COND_EMPTY;
+	full[0] = zones[0].cond == BLK_ZONE_COND_FULL;
+	full[1] = zones[1].cond == BLK_ZONE_COND_FULL;
+
+	/*
+	 * Possible state of log buffer zones
+	 *
+	 *   E I F
+	 * E * x 0
+	 * I 0 x 0
+	 * F 1 1 C
+	 *
+	 * Row: zones[0]
+	 * Col: zones[1]
+	 * State:
+	 *   E: Empty, I: In-Use, F: Full
+	 * Log position:
+	 *   *: Special case, no superblock is written
+	 *   0: Use write pointer of zones[0]
+	 *   1: Use write pointer of zones[1]
+	 *   C: Compare SBs from zones[0] and zones[1], use the newer one
+	 *   x: Invalid state
+	 */
+
+	if (empty[0] && empty[1]) {
+		/* special case to distinguish no superblock to read */
+		*wp_ret = zones[0].start << SECTOR_SHIFT;
+		return -ENOENT;
+	} else if (full[0] && full[1]) {
+		/* Compare two super blocks */
+		struct address_space *mapping = bdev->bd_inode->i_mapping;
+		struct page *page[2];
+		struct btrfs_super_block *super[2];
+		int i;
+
+		for (i = 0; i < 2; i++) {
+			u64 bytenr = ((zones[i].start + zones[i].len) << SECTOR_SHIFT) -
+				BTRFS_SUPER_INFO_SIZE;
+
+			page[i] = read_cache_page_gfp(mapping, bytenr >> PAGE_SHIFT, GFP_NOFS);
+			if (IS_ERR(page[i])) {
+				if (i == 1)
+					btrfs_release_disk_super(super[0]);
+				return PTR_ERR(page[i]);
+			}
+			super[i] = page_address(page[i]);
+		}
+
+		if (super[0]->generation > super[1]->generation)
+			sector = zones[1].start;
+		else
+			sector = zones[0].start;
+
+		for (i = 0; i < 2; i++)
+			btrfs_release_disk_super(super[i]);
+	} else if (!full[0] && (empty[1] || full[1])) {
+		sector = zones[0].wp;
+	} else if (full[0]) {
+		sector = zones[1].wp;
+	} else {
+		return -EUCLEAN;
+	}
+	*wp_ret = sector << SECTOR_SHIFT;
+	return 0;
+}
+
+static int sb_log_location(struct block_device *bdev, struct blk_zone *zones,
+			   int rw, u64 *bytenr_ret)
+{
+	u64 wp;
+	int ret;
+
+	if (zones[0].type == BLK_ZONE_TYPE_CONVENTIONAL) {
+		*bytenr_ret = zones[0].start << SECTOR_SHIFT;
+		return 0;
+	}
+
+	ret = sb_write_pointer(bdev, zones, &wp);
+	if (ret != -ENOENT && ret < 0)
+		return ret;
+
+	if (rw == WRITE) {
+		struct blk_zone *reset = NULL;
+
+		if (wp == zones[0].start << SECTOR_SHIFT)
+			reset = &zones[0];
+		else if (wp == zones[1].start << SECTOR_SHIFT)
+			reset = &zones[1];
+
+		if (reset && reset->cond != BLK_ZONE_COND_EMPTY) {
+			ASSERT(reset->cond == BLK_ZONE_COND_FULL);
+
+			ret = blkdev_zone_mgmt(bdev, REQ_OP_ZONE_RESET,
+					       reset->start, reset->len,
+					       GFP_NOFS);
+			if (ret)
+				return ret;
+
+			reset->cond = BLK_ZONE_COND_EMPTY;
+			reset->wp = reset->start;
+		}
+	} else if (ret != -ENOENT) {
+		/* For READ, we want the precious one */
+		if (wp == zones[0].start << SECTOR_SHIFT)
+			wp = (zones[1].start + zones[1].len) << SECTOR_SHIFT;
+		wp -= BTRFS_SUPER_INFO_SIZE;
+	}
+
+	*bytenr_ret = wp;
+	return 0;
+
+}
+
+int btrfs_sb_log_location_bdev(struct block_device *bdev, int mirror, int rw,
+			       u64 *bytenr_ret)
+{
+	struct blk_zone zones[2];
+	unsigned int zone_sectors;
+	u32 sb_zone;
+	int ret;
+	u64 zone_size;
+	u8 zone_sectors_shift;
+	sector_t nr_sectors = bdev->bd_part->nr_sects;
+	u32 nr_zones;
+
+	if (!bdev_is_zoned(bdev)) {
+		*bytenr_ret = btrfs_sb_offset(mirror);
+		return 0;
+	}
+
+	ASSERT(rw == READ || rw == WRITE);
+
+	zone_sectors = bdev_zone_sectors(bdev);
+	if (!is_power_of_2(zone_sectors))
+		return -EINVAL;
+	zone_size = zone_sectors << SECTOR_SHIFT;
+	zone_sectors_shift = ilog2(zone_sectors);
+	nr_zones = nr_sectors >> zone_sectors_shift;
+
+	sb_zone = sb_zone_number(zone_sectors_shift + SECTOR_SHIFT, mirror);
+	if (sb_zone + 1 >= nr_zones)
+		return -ENOENT;
+
+	ret = blkdev_report_zones(bdev, sb_zone << zone_sectors_shift, 2,
+				  copy_zone_info_cb, zones);
+	if (ret < 0)
+		return ret;
+	if (ret != 2)
+		return -EIO;
+
+	return sb_log_location(bdev, zones, rw, bytenr_ret);
+}
+
+int btrfs_sb_log_location(struct btrfs_device *device, int mirror, int rw,
+			  u64 *bytenr_ret)
+{
+	struct btrfs_zoned_device_info *zinfo = device->zone_info;
+	u32 zone_num;
+
+	if (!zinfo) {
+		*bytenr_ret = btrfs_sb_offset(mirror);
+		return 0;
+	}
+
+	zone_num = sb_zone_number(zinfo->zone_size_shift, mirror);
+	if (zone_num + 1 >= zinfo->nr_zones)
+		return -ENOENT;
+
+	return sb_log_location(device->bdev, &zinfo->sb_zones[2 * mirror], rw,
+			       bytenr_ret);
+}
+
+static inline bool is_sb_log_zone(struct btrfs_zoned_device_info *zinfo,
+				  int mirror)
+{
+	u32 zone_num;
+
+	if (!zinfo)
+		return false;
+
+	zone_num = sb_zone_number(zinfo->zone_size_shift, mirror);
+	if (zone_num + 1 >= zinfo->nr_zones)
+		return false;
+
+	if (!test_bit(zone_num, zinfo->seq_zones))
+		return false;
+
+	return true;
+}
+
+void btrfs_advance_sb_log(struct btrfs_device *device, int mirror)
+{
+	struct btrfs_zoned_device_info *zinfo = device->zone_info;
+	struct blk_zone *zone;
+
+	if (!is_sb_log_zone(zinfo, mirror))
+		return;
+
+	zone = &zinfo->sb_zones[2 * mirror];
+	if (zone->cond != BLK_ZONE_COND_FULL) {
+		if (zone->cond == BLK_ZONE_COND_EMPTY)
+			zone->cond = BLK_ZONE_COND_IMP_OPEN;
+		zone->wp += (BTRFS_SUPER_INFO_SIZE >> SECTOR_SHIFT);
+		if (zone->wp == zone->start + zone->len)
+			zone->cond = BLK_ZONE_COND_FULL;
+		return;
+	}
+
+	zone++;
+	ASSERT(zone->cond != BLK_ZONE_COND_FULL);
+	if (zone->cond == BLK_ZONE_COND_EMPTY)
+		zone->cond = BLK_ZONE_COND_IMP_OPEN;
+	zone->wp += (BTRFS_SUPER_INFO_SIZE >> SECTOR_SHIFT);
+	if (zone->wp == zone->start + zone->len)
+		zone->cond = BLK_ZONE_COND_FULL;
+}
+
+int btrfs_reset_sb_log_zones(struct block_device *bdev, int mirror)
+{
+	sector_t zone_sectors;
+	sector_t nr_sectors = bdev->bd_part->nr_sects;
+	u8 zone_sectors_shift;
+	u32 sb_zone;
+	u32 nr_zones;
+
+	zone_sectors = bdev_zone_sectors(bdev);
+	zone_sectors_shift = ilog2(zone_sectors);
+	nr_zones = nr_sectors >> zone_sectors_shift;
+
+	sb_zone = sb_zone_number(zone_sectors_shift + SECTOR_SHIFT, mirror);
+	if (sb_zone + 1 >= nr_zones)
+		return -ENOENT;
+
+	return blkdev_zone_mgmt(bdev, REQ_OP_ZONE_RESET,
+				sb_zone << zone_sectors_shift, zone_sectors * 2,
+				GFP_NOFS);
+}
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index 0b7756a7104d..447c4e5ffcbb 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -4,6 +4,8 @@
 #define BTRFS_ZONED_H
 
 #include <linux/blkdev.h>
+#include "volumes.h"
+#include "disk-io.h"
 
 struct btrfs_zoned_device_info {
 	/*
@@ -16,6 +18,7 @@ struct btrfs_zoned_device_info {
 	u32 nr_zones;
 	unsigned long *seq_zones;
 	unsigned long *empty_zones;
+	struct blk_zone sb_zones[2 * BTRFS_SUPER_MIRROR_MAX];
 };
 
 #ifdef CONFIG_BLK_DEV_ZONED
@@ -25,6 +28,12 @@ int btrfs_get_dev_zone_info(struct btrfs_device *device);
 void btrfs_destroy_dev_zone_info(struct btrfs_device *device);
 int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info);
 int btrfs_check_mountopts_zoned(struct btrfs_fs_info *info);
+int btrfs_sb_log_location_bdev(struct block_device *bdev, int mirror, int rw,
+			       u64 *bytenr_ret);
+int btrfs_sb_log_location(struct btrfs_device *device, int mirror, int rw,
+			  u64 *bytenr_ret);
+void btrfs_advance_sb_log(struct btrfs_device *device, int mirror);
+int btrfs_reset_sb_log_zones(struct block_device *bdev, int mirror);
 #else /* CONFIG_BLK_DEV_ZONED */
 static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 				     struct blk_zone *zone)
@@ -48,6 +57,26 @@ static inline int btrfs_check_mountopts_zoned(struct btrfs_fs_info *info)
 {
 	return 0;
 }
+static inline int btrfs_sb_log_location_bdev(struct block_device *bdev,
+					     int mirror, int rw,
+					     u64 *bytenr_ret)
+{
+	*bytenr_ret = btrfs_sb_offset(mirror);
+	return 0;
+}
+static inline int btrfs_sb_log_location(struct btrfs_device *device, int mirror,
+					int rw, u64 *bytenr_ret)
+{
+	*bytenr_ret = btrfs_sb_offset(mirror);
+	return 0;
+}
+static inline void btrfs_advance_sb_log(struct btrfs_device *device,
+					int mirror) { }
+static inline int btrfs_reset_sb_log_zones(struct block_device *bdev,
+					   int mirror)
+{
+	return 0;
+}
 #endif
 
 static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
@@ -115,4 +144,15 @@ static inline bool btrfs_check_device_zone_type(struct btrfs_fs_info *fs_info,
 	return bdev_zoned_model(bdev) != BLK_ZONED_HM;
 }
 
+static inline bool btrfs_check_super_location(struct btrfs_device *device,
+					      u64 pos)
+{
+	/*
+	 * On a non-zoned device, any address is OK. On a zoned device,
+	 * non-SEQUENTIAL WRITE REQUIRED zones are capable.
+	 */
+	return device->zone_info == NULL ||
+	       !btrfs_dev_is_sequential(device, pos);
+}
+
 #endif

From patchwork Fri Oct 30 13:51:19 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Naohiro Aota <naohiro.aota@wdc.com>
X-Patchwork-Id: 11869625
Return-Path: <SRS0=iOSM=EF=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 2001C61C
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:52:44 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id E16402083B
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:52:43 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=fail reason="signature verification failed" (2048-bit key)
 header.d=wdc.com header.i=@wdc.com header.b="O4xjYi6V"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726790AbgJ3Nwm (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Fri, 30 Oct 2020 09:52:42 -0400
Received: from esa3.hgst.iphmx.com ([216.71.153.141]:21995 "EHLO
        esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726776AbgJ3Nwh (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Fri, 30 Oct 2020 09:52:37 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple;
  d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com;
  t=1604065957; x=1635601957;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=j9RI4cuvMYsYdA8lB0pQaYj9A77UyrACkFd8ci2YJh0=;
  b=O4xjYi6VtV4aWrJ7INdkhcBOhkkY9htvjBSc46jgmcrg2s2gbPhb+mrh
   CUz9qfZwYpsFD7/RhIFxu7zvZG0KSs5xNAdHWSJxA8d5AEQuiICY1h62e
   oc2RgP7Pppw9v8rbGqGk+U7fcsJYa26MLPLvla6xrbdTkYP3ZCd0K6ynm
   PJpGQuQ5UGJ9WMiklpWSAR6/oQsKa5wqWSY45ExdOg7O83VfRqHIayW9x
   TXWmOg9iHSguC2VTEwIHuq8Avr1kUmkab40AZRSCPbcd+mQkU7hJQNXza
   ZievT+1+EYFurI0gO7epHPlXeCXaeZ/7X+FdNVJhSInc/RozkReAz/qTk
   w==;
IronPort-SDR: 
 S4LsaqxuqUs0W/LLlfPSI+8VB3M0mJt4vh9hzcBYdDMtyUtfr8eHVcctWWGE8GDP+BWPULux22
 hCs1I3UDH5nkPjQ3cezBScUvGCcsvVFPEL4M+jk+BPzHh0U/sXHAA+UK/ZwYTXV+QvcEvNeNAe
 oFaMoj9yt/HLjheBKMW1vTtMYXruJfQMtp1K3s5qUZWeowUiC+gATjB4cOl6KYoVKQfjAUcxiU
 9M1DB8GnSgCnL1sHvdhiIBUjQY4k5z8To6aJdnWTjHs7bq55gOKkwr3bYnk0wFAtDmgQZRCO3J
 RtY=
X-IronPort-AV: E=Sophos;i="5.77,433,1596470400";
   d="scan'208";a="155806601"
Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com)
 ([199.255.45.14])
  by ob1.hgst.iphmx.com with ESMTP; 30 Oct 2020 21:52:36 +0800
IronPort-SDR: 
 rH1oLBdY8+IRLNr/RlP3mviSSZHqAxdNHgaPtgxYiIPuPvodgPgIyfZOHavCh43SE+1MzSLfr/
 Su4xJwB3KiGB+UKGNKZ3sRaHQTt3t9LLTStaQnnElOC8VJ5dsPphH3UTGUwP/ssdnm1n/sPVP5
 930qPyj466BrAPTIde1ofOe1WQ0kYlqyu4WeichdSvv7K1z6xeAReT2ezpdu4gw6QUAmz9cHwi
 lZLzhx1ZNHcvbsuZEZz9VIkxPyfxIG4GrKmuJC0FB+caMUhMlK5DXzqX/i0Qtct6nvOTZt0ADc
 mrEXfBg/934VS4Jr+w/dcXfk
Received: from uls-op-cesaip01.wdc.com ([10.248.3.36])
  by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 30 Oct 2020 06:38:51 -0700
IronPort-SDR: 
 atCstIp7u82V1nPi+ST9BZI1SjC+tTaPpDY2xEjlvJ01qusMxYKa44B6p5aJtRJ2ZuuKOZGX7W
 TJjgii7++d1oHEYoh/JxF60ajTDcZ+Po6BU6vkfroKAaO54Q28jTDswfuwv//skls1lWagq1kQ
 6BDl2wTdxOlRIUnMqEyykf8B0wbdif2pUhBM2eJbZKuHQm74wQlpgkWsbl8ZCO6Est/VWSKC9N
 AcZ3yr9TSaiG8QHAya/aBXnsXAWWmJxo64VYvjwUzSGRh2ZzqJPRfCWPr4VEOl4ZfUrQwZ60gV
 3nM=
WDCIronportException: Internal
Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155])
  by uls-op-cesaip01.wdc.com with ESMTP; 30 Oct 2020 06:52:36 -0700
From: Naohiro Aota <naohiro.aota@wdc.com>
To: linux-btrfs@vger.kernel.org, dsterba@suse.com
Cc: hare@suse.com, linux-fsdevel@vger.kernel.org,
        Naohiro Aota <naohiro.aota@wdc.com>
Subject: [PATCH v9 12/41] btrfs: implement zoned chunk allocator
Date: Fri, 30 Oct 2020 22:51:19 +0900
Message-Id: 
 <5b9798f6e4c317e6a2c433ef88ffeabe00b93bb3.1604065695.git.naohiro.aota@wdc.com>
X-Mailer: git-send-email 2.27.0
In-Reply-To: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
References: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

This commit implements a zoned chunk/dev_extent allocator. The zoned
allocator aligns the device extents to zone boundaries, so that a zone
reset affects only the device extent and does not change the state of
blocks in the neighbor device extents.

Also, it checks that a region allocation is not overlapping any of the
super block zones, and ensures the region is empty.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/volumes.c | 131 +++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/volumes.h |   1 +
 fs/btrfs/zoned.c   | 126 +++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/zoned.h   |  30 +++++++++++
 4 files changed, 288 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index db884b96a5ea..78c62ef02e6f 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1416,6 +1416,14 @@ static bool contains_pending_extent(struct btrfs_device *device, u64 *start,
 	return false;
 }
 
+static inline u64 dev_extent_search_start_zoned(struct btrfs_device *device,
+						u64 start)
+{
+	start = max_t(u64, start,
+		      max_t(u64, device->zone_info->zone_size, SZ_1M));
+	return btrfs_zone_align(device, start);
+}
+
 static u64 dev_extent_search_start(struct btrfs_device *device, u64 start)
 {
 	switch (device->fs_devices->chunk_alloc_policy) {
@@ -1426,11 +1434,57 @@ static u64 dev_extent_search_start(struct btrfs_device *device, u64 start)
 		 * make sure to start at an offset of at least 1MB.
 		 */
 		return max_t(u64, start, SZ_1M);
+	case BTRFS_CHUNK_ALLOC_ZONED:
+		return dev_extent_search_start_zoned(device, start);
 	default:
 		BUG();
 	}
 }
 
+static bool dev_extent_hole_check_zoned(struct btrfs_device *device,
+					u64 *hole_start, u64 *hole_size,
+					u64 num_bytes)
+{
+	u64 zone_size = device->zone_info->zone_size;
+	u64 pos;
+	int ret;
+	int changed = 0;
+
+	ASSERT(IS_ALIGNED(*hole_start, zone_size));
+
+	while (*hole_size > 0) {
+		pos = btrfs_find_allocatable_zones(device, *hole_start,
+						   *hole_start + *hole_size,
+						   num_bytes);
+		if (pos != *hole_start) {
+			*hole_size = *hole_start + *hole_size - pos;
+			*hole_start = pos;
+			changed = 1;
+			if (*hole_size < num_bytes)
+				break;
+		}
+
+		ret = btrfs_ensure_empty_zones(device, pos, num_bytes);
+
+		/* range is ensured to be empty */
+		if (!ret)
+			return changed;
+
+		/* given hole range was invalid (outside of device) */
+		if (ret == -ERANGE) {
+			*hole_start += *hole_size;
+			*hole_size = 0;
+			return 1;
+		}
+
+		*hole_start += zone_size;
+		*hole_size -= zone_size;
+		changed = 1;
+	}
+
+	return changed;
+}
+
 /**
  * dev_extent_hole_check - check if specified hole is suitable for allocation
  * @device:	the device which we have the hole
@@ -1463,6 +1517,10 @@ static bool dev_extent_hole_check(struct btrfs_device *device, u64 *hole_start,
 	case BTRFS_CHUNK_ALLOC_REGULAR:
 		/* No extra check */
 		break;
+	case BTRFS_CHUNK_ALLOC_ZONED:
+		changed |= dev_extent_hole_check_zoned(device, hole_start,
+						       hole_size, num_bytes);
+		break;
 	default:
 		BUG();
 	}
@@ -1517,6 +1575,9 @@ static int find_free_dev_extent_start(struct btrfs_device *device,
 
 	search_start = dev_extent_search_start(device, search_start);
 
+	WARN_ON(device->zone_info &&
+		!IS_ALIGNED(num_bytes, device->zone_info->zone_size));
+
 	path = btrfs_alloc_path();
 	if (!path)
 		return -ENOMEM;
@@ -4907,6 +4968,37 @@ static void init_alloc_chunk_ctl_policy_regular(
 	ctl->dev_extent_min = BTRFS_STRIPE_LEN * ctl->dev_stripes;
 }
 
+static void
+init_alloc_chunk_ctl_policy_zoned(struct btrfs_fs_devices *fs_devices,
+				  struct alloc_chunk_ctl *ctl)
+{
+	u64 zone_size = fs_devices->fs_info->zone_size;
+	u64 limit;
+	int min_num_stripes = ctl->devs_min * ctl->dev_stripes;
+	int min_data_stripes = (min_num_stripes - ctl->nparity) / ctl->ncopies;
+	u64 min_chunk_size = min_data_stripes * zone_size;
+	u64 type = ctl->type;
+
+	ctl->max_stripe_size = zone_size;
+	if (type & BTRFS_BLOCK_GROUP_DATA) {
+		ctl->max_chunk_size = round_down(BTRFS_MAX_DATA_CHUNK_SIZE,
+						 zone_size);
+	} else if (type & BTRFS_BLOCK_GROUP_METADATA) {
+		ctl->max_chunk_size = ctl->max_stripe_size;
+	} else if (type & BTRFS_BLOCK_GROUP_SYSTEM) {
+		ctl->max_chunk_size = 2 * ctl->max_stripe_size;
+		ctl->devs_max = min_t(int, ctl->devs_max,
+				      BTRFS_MAX_DEVS_SYS_CHUNK);
+	}
+
+	/* We don't want a chunk larger than 10% of writable space */
+	limit = max(round_down(div_factor(fs_devices->total_rw_bytes, 1),
+			       zone_size),
+		    min_chunk_size);
+	ctl->max_chunk_size = min(limit, ctl->max_chunk_size);
+	ctl->dev_extent_min = zone_size * ctl->dev_stripes;
+}
+
 static void init_alloc_chunk_ctl(struct btrfs_fs_devices *fs_devices,
 				 struct alloc_chunk_ctl *ctl)
 {
@@ -4927,6 +5019,9 @@ static void init_alloc_chunk_ctl(struct btrfs_fs_devices *fs_devices,
 	case BTRFS_CHUNK_ALLOC_REGULAR:
 		init_alloc_chunk_ctl_policy_regular(fs_devices, ctl);
 		break;
+	case BTRFS_CHUNK_ALLOC_ZONED:
+		init_alloc_chunk_ctl_policy_zoned(fs_devices, ctl);
+		break;
 	default:
 		BUG();
 	}
@@ -5053,6 +5148,40 @@ static int decide_stripe_size_regular(struct alloc_chunk_ctl *ctl,
 	return 0;
 }
 
+static int decide_stripe_size_zoned(struct alloc_chunk_ctl *ctl,
+				    struct btrfs_device_info *devices_info)
+{
+	u64 zone_size = devices_info[0].dev->zone_info->zone_size;
+	/* number of stripes that count for block group size */
+	int data_stripes;
+
+	/*
+	 * It should hold because:
+	 *    dev_extent_min == dev_extent_want == zone_size * dev_stripes
+	 */
+	ASSERT(devices_info[ctl->ndevs - 1].max_avail == ctl->dev_extent_min);
+
+	ctl->stripe_size = zone_size;
+	ctl->num_stripes = ctl->ndevs * ctl->dev_stripes;
+	data_stripes = (ctl->num_stripes - ctl->nparity) / ctl->ncopies;
+
+	/*
+	 * stripe_size is fixed in ZONED. Reduce ndevs instead.
+	 */
+	if (ctl->stripe_size * data_stripes > ctl->max_chunk_size) {
+		ctl->ndevs = div_u64(div_u64(ctl->max_chunk_size * ctl->ncopies,
+					     ctl->stripe_size) + ctl->nparity,
+				     ctl->dev_stripes);
+		ctl->num_stripes = ctl->ndevs * ctl->dev_stripes;
+		data_stripes = (ctl->num_stripes - ctl->nparity) / ctl->ncopies;
+		ASSERT(ctl->stripe_size * data_stripes <= ctl->max_chunk_size);
+	}
+
+	ctl->chunk_size = ctl->stripe_size * data_stripes;
+
+	return 0;
+}
+
 static int decide_stripe_size(struct btrfs_fs_devices *fs_devices,
 			      struct alloc_chunk_ctl *ctl,
 			      struct btrfs_device_info *devices_info)
@@ -5080,6 +5209,8 @@ static int decide_stripe_size(struct btrfs_fs_devices *fs_devices,
 	switch (fs_devices->chunk_alloc_policy) {
 	case BTRFS_CHUNK_ALLOC_REGULAR:
 		return decide_stripe_size_regular(ctl, devices_info);
+	case BTRFS_CHUNK_ALLOC_ZONED:
+		return decide_stripe_size_zoned(ctl, devices_info);
 	default:
 		BUG();
 	}
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 9c07b97a2260..0249aca668fb 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -213,6 +213,7 @@ BTRFS_DEVICE_GETSET_FUNCS(bytes_used);
 
 enum btrfs_chunk_allocation_policy {
 	BTRFS_CHUNK_ALLOC_REGULAR,
+	BTRFS_CHUNK_ALLOC_ZONED,
 };
 
 struct btrfs_fs_devices {
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index d5487cba203b..4411d786597a 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -1,11 +1,13 @@
 // SPDX-License-Identifier: GPL-2.0
 
+#include <linux/bitops.h>
 #include <linux/slab.h>
 #include <linux/blkdev.h>
 #include "ctree.h"
 #include "volumes.h"
 #include "zoned.h"
 #include "rcu-string.h"
+#include "disk-io.h"
 
 /* Maximum number of zones to report per blkdev_report_zones() call */
 #define BTRFS_REPORT_NR_ZONES   4096
@@ -328,6 +330,7 @@ int btrfs_check_zoned_mode(struct btrfs_fs_info *fs_info)
 
 	fs_info->zone_size = zone_size;
 	fs_info->max_zone_append_size = max_zone_append_size;
+	fs_info->fs_devices->chunk_alloc_policy = BTRFS_CHUNK_ALLOC_ZONED;
 
 	btrfs_info(fs_info, "ZONED mode enabled, zone size %llu B",
 		   fs_info->zone_size);
@@ -607,3 +610,126 @@ int btrfs_reset_sb_log_zones(struct block_device *bdev, int mirror)
 				sb_zone << zone_sectors_shift, zone_sectors * 2,
 				GFP_NOFS);
 }
+
+/*
+ * btrfs_check_allocatable_zones - find allocatable zones within give region
+ * @device:	the device to allocate a region
+ * @hole_start: the position of the hole to allocate the region
+ * @num_bytes:	the size of wanted region
+ * @hole_size:	the size of hole
+ *
+ * Allocatable region should not contain any superblock locations.
+ */
+u64 btrfs_find_allocatable_zones(struct btrfs_device *device, u64 hole_start,
+				 u64 hole_end, u64 num_bytes)
+{
+	struct btrfs_zoned_device_info *zinfo = device->zone_info;
+	u8 shift = zinfo->zone_size_shift;
+	u64 nzones = num_bytes >> shift;
+	u64 pos = hole_start;
+	u64 begin, end;
+	u64 sb_pos;
+	bool have_sb;
+	int i;
+
+	ASSERT(IS_ALIGNED(hole_start, zinfo->zone_size));
+	ASSERT(IS_ALIGNED(num_bytes, zinfo->zone_size));
+
+	while (pos < hole_end) {
+		begin = pos >> shift;
+		end = begin + nzones;
+
+		if (end > zinfo->nr_zones)
+			return hole_end;
+
+		/* check if zones in the region are all empty */
+		if (btrfs_dev_is_sequential(device, pos) &&
+		    find_next_zero_bit(zinfo->empty_zones, end, begin) != end) {
+			pos += zinfo->zone_size;
+			continue;
+		}
+
+		have_sb = false;
+		for (i = 0; i < BTRFS_SUPER_MIRROR_MAX; i++) {
+			sb_pos = sb_zone_number(zinfo->zone_size, i);
+			if (!(end < sb_pos || sb_pos + 1 < begin)) {
+				have_sb = true;
+				pos = (sb_pos + 2) << shift;
+				break;
+			}
+		}
+		if (!have_sb)
+			break;
+	}
+
+	return pos;
+}
+
+int btrfs_reset_device_zone(struct btrfs_device *device, u64 physical,
+			    u64 length, u64 *bytes)
+{
+	int ret;
+
+	*bytes = 0;
+	ret = blkdev_zone_mgmt(device->bdev, REQ_OP_ZONE_RESET,
+			       physical >> SECTOR_SHIFT, length >> SECTOR_SHIFT,
+			       GFP_NOFS);
+	if (ret)
+		return ret;
+
+	*bytes = length;
+	while (length) {
+		btrfs_dev_set_zone_empty(device, physical);
+		physical += device->zone_info->zone_size;
+		length -= device->zone_info->zone_size;
+	}
+
+	return 0;
+}
+
+int btrfs_ensure_empty_zones(struct btrfs_device *device, u64 start, u64 size)
+{
+	struct btrfs_zoned_device_info *zinfo = device->zone_info;
+	u8 shift = zinfo->zone_size_shift;
+	unsigned long begin = start >> shift;
+	unsigned long end = (start + size) >> shift;
+	u64 pos;
+	int ret;
+
+	ASSERT(IS_ALIGNED(start, zinfo->zone_size));
+	ASSERT(IS_ALIGNED(size, zinfo->zone_size));
+
+	if (end > zinfo->nr_zones)
+		return -ERANGE;
+
+	/* all the zones are conventional */
+	if (find_next_bit(zinfo->seq_zones, begin, end) == end)
+		return 0;
+
+	/* all the zones are sequential and empty */
+	if (find_next_zero_bit(zinfo->seq_zones, begin, end) == end &&
+	    find_next_zero_bit(zinfo->empty_zones, begin, end) == end)
+		return 0;
+
+	for (pos = start; pos < start + size; pos += zinfo->zone_size) {
+		u64 reset_bytes;
+
+		if (!btrfs_dev_is_sequential(device, pos) ||
+		    btrfs_dev_is_empty_zone(device, pos))
+			continue;
+
+		/* free regions should be empty */
+		btrfs_warn_in_rcu(
+			device->fs_info,
+			"resetting device %s zone %llu for allocation",
+			rcu_str_deref(device->name), pos >> shift);
+		WARN_ON_ONCE(1);
+
+		ret = btrfs_reset_device_zone(device, pos, zinfo->zone_size,
+					      &reset_bytes);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index 447c4e5ffcbb..24dd0c9561f9 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -34,6 +34,11 @@ int btrfs_sb_log_location(struct btrfs_device *device, int mirror, int rw,
 			  u64 *bytenr_ret);
 void btrfs_advance_sb_log(struct btrfs_device *device, int mirror);
 int btrfs_reset_sb_log_zones(struct block_device *bdev, int mirror);
+u64 btrfs_find_allocatable_zones(struct btrfs_device *device, u64 hole_start,
+				 u64 hole_end, u64 num_bytes);
+int btrfs_reset_device_zone(struct btrfs_device *device, u64 physical,
+			    u64 length, u64 *bytes);
+int btrfs_ensure_empty_zones(struct btrfs_device *device, u64 start, u64 size);
 #else /* CONFIG_BLK_DEV_ZONED */
 static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 				     struct blk_zone *zone)
@@ -77,6 +82,23 @@ static inline int btrfs_reset_sb_log_zones(struct block_device *bdev,
 {
 	return 0;
 }
+static inline u64 btrfs_find_allocatable_zones(struct btrfs_device *device,
+					       u64 hole_start, u64 hole_end,
+					       u64 num_bytes)
+{
+	return hole_start;
+}
+static inline int btrfs_reset_device_zone(struct btrfs_device *device,
+					  u64 physical, u64 length, u64 *bytes)
+{
+	*bytes = 0;
+	return 0;
+}
+static inline int btrfs_ensure_empty_zones(struct btrfs_device *device,
+					   u64 start, u64 size)
+{
+	return 0;
+}
 #endif
 
 static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
@@ -155,4 +177,12 @@ static inline bool btrfs_check_super_location(struct btrfs_device *device,
 	       !btrfs_dev_is_sequential(device, pos);
 }
 
+static inline u64 btrfs_zone_align(struct btrfs_device *device, u64 pos)
+{
+	if (!device->zone_info)
+		return pos;
+
+	return ALIGN(pos, device->zone_info->zone_size);
+}
+
 #endif

From patchwork Fri Oct 30 13:51:20 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Naohiro Aota <naohiro.aota@wdc.com>
X-Patchwork-Id: 11869629
Return-Path: <SRS0=iOSM=EF=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 53E6A92C
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:52:48 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 2CC952083B
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:52:48 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=fail reason="signature verification failed" (2048-bit key)
 header.d=wdc.com header.i=@wdc.com header.b="mgRCubwv"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726784AbgJ3Nwm (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Fri, 30 Oct 2020 09:52:42 -0400
Received: from esa3.hgst.iphmx.com ([216.71.153.141]:21997 "EHLO
        esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726741AbgJ3Nwi (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Fri, 30 Oct 2020 09:52:38 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple;
  d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com;
  t=1604065958; x=1635601958;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=4C8Gp+BZT/KoALoW+KYOIuH0QVroj4D4IuoxRX499wA=;
  b=mgRCubwvtkSREbbe1pK9ETxwrfOpflgGqaIj05TNXFHubcMs67AxMRCQ
   dsyv6pytnwUxHNnSX4UyweHrb8mv2RtlBW6/Kkoi13vuRmnylW2fCfhsH
   MPS8zO42r+GHYSFo/zUgpGsA0IOHQE/7udcQRUc+0hZeMnfCcWnzI9Zmj
   NP1Dck57Wj/YyY85os88z/OXL7u6tlU0sFb6E1Pikw+CORwSRW39uWKbe
   oZdmSh4fuBHrwfZOFUEHi3dih85Fh14ZeEPCkno/HSjPoPyBRENiYCfrM
   se6K4rYsxdYQ6qzBxOqpWpHmFKJ7YiIK1WiWlVHzz4IMziI1aR4JR9xq+
   g==;
IronPort-SDR: 
 Je5ye50/LdErWHyqLXuXXR8PI17WCzy0EgzSztDw3tawpLZjyAGBx72nh5SHHmYwMqu2XCZci+
 QJdDi/4Cp4eo8YsK10jjgXOnCvKRf3xwRTwGoOpgqPiKLHcc6XYe7hLMNxPhoT1fg63rWtGGfM
 BnRsMeBD2cT6j2CL+gQ7ACP3dG9ZVKrw42XQ+DjmUVsOVRbe1lLaZ7VTsU39N6n2wJvkGA0z9E
 JHCmX9tdOgTJHuOzpqmuXK8axROjq7ZD0eVOd+fBSTn5WZe0Fzn8xaLw0fUkqTtyP2rDrosDOZ
 NgE=
X-IronPort-AV: E=Sophos;i="5.77,433,1596470400";
   d="scan'208";a="155806603"
Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com)
 ([199.255.45.14])
  by ob1.hgst.iphmx.com with ESMTP; 30 Oct 2020 21:52:38 +0800
IronPort-SDR: 
 rPz6P7Y2D6YYAwTpb0CqWdxz4u7ziEu/rrzTEjHe+ulUCi+3lc4shUD9wGff0DpM8IzeaZ4y1r
 80iGokjizSan/P5TNZLtl39q7cgs9tKhfVt566GHP6a6+Um4gXg4gB1KsKYIatLtUkSjYV0mLI
 yw1MBP7FDX+1oZNiTThm5efCX6PCfyxDWF/0+GAYSE2ZMvW1wn+XHvccdpUqrrLNpPT3McOd1G
 MT1FuswDjoTfc4gAGTQ/hiyQdC//prRdEm9zYg+KDajg7olLY+RswAY8+Jkn+xYJhOzpRtd5y4
 KJ2/2UbbVFYGqix23mdrPniG
Received: from uls-op-cesaip01.wdc.com ([10.248.3.36])
  by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 30 Oct 2020 06:38:52 -0700
IronPort-SDR: 
 vExZKPzi6hHB3p672kGvds8Kkb8L4bueGdTxT51ZPSLZj9+RlWLcD8jULNAR47qW5tA5sx8t96
 xgsG7D2ZydgLSVIljOuS/CqmKTRjY0iFJLNvS3rUjMKENVmQirRM/dgFKM5skMsKUgB5CsrRqK
 m4U9Vgzq0RfqRK6CCDiYWxgRTWRu/4kTsEIbJxGP4RdrxWcV1RGazQtDeTY5DsToqJ/yqYxw0S
 hi/oP93PNAbKaSwDgoPBS2cEBNtlRgl7zjkObNOWyRrpG3mcYih+tbI9snHfc/nfI+vvd8OoOM
 +cA=
WDCIronportException: Internal
Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155])
  by uls-op-cesaip01.wdc.com with ESMTP; 30 Oct 2020 06:52:37 -0700
From: Naohiro Aota <naohiro.aota@wdc.com>
To: linux-btrfs@vger.kernel.org, dsterba@suse.com
Cc: hare@suse.com, linux-fsdevel@vger.kernel.org,
        Naohiro Aota <naohiro.aota@wdc.com>
Subject: [PATCH v9 13/41] btrfs: verify device extent is aligned to zone
Date: Fri, 30 Oct 2020 22:51:20 +0900
Message-Id: 
 <6d9191ed858e0a0283a8d52fefa538896d8d125b.1604065695.git.naohiro.aota@wdc.com>
X-Mailer: git-send-email 2.27.0
In-Reply-To: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
References: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

Add a check in verify_one_dev_extent() to check if a device extent on a
zoned block device is aligned to the respective zone boundary.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/volumes.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 78c62ef02e6f..9e69222086ae 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -7778,6 +7778,20 @@ static int verify_one_dev_extent(struct btrfs_fs_info *fs_info,
 		ret = -EUCLEAN;
 		goto out;
 	}
+
+	if (dev->zone_info) {
+		u64 zone_size = dev->zone_info->zone_size;
+
+		if (!IS_ALIGNED(physical_offset, zone_size) ||
+		    !IS_ALIGNED(physical_len, zone_size)) {
+			btrfs_err(fs_info,
+"dev extent devid %llu physical offset %llu len %llu is not aligned to device zone",
+				  devid, physical_offset, physical_len);
+			ret = -EUCLEAN;
+			goto out;
+		}
+	}
+
 out:
 	free_extent_map(em);
 	return ret;

From patchwork Fri Oct 30 13:51:21 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Naohiro Aota <naohiro.aota@wdc.com>
X-Patchwork-Id: 11869631
Return-Path: <SRS0=iOSM=EF=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id C2B9561C
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:52:50 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 8B5302076E
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:52:50 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=fail reason="signature verification failed" (2048-bit key)
 header.d=wdc.com header.i=@wdc.com header.b="jVINMBMP"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726809AbgJ3Nwt (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Fri, 30 Oct 2020 09:52:49 -0400
Received: from esa3.hgst.iphmx.com ([216.71.153.141]:21997 "EHLO
        esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726778AbgJ3Nwk (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Fri, 30 Oct 2020 09:52:40 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple;
  d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com;
  t=1604065959; x=1635601959;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=g4gCQOAZgB5n7mMWOqZ4zC2Q4y1BXkTeHzA5NCaq9ZM=;
  b=jVINMBMPqqTGLL3yM05x0YX2LgyBoZWWzVjxKhXSZvxW7jzLquo0HkYr
   23GSzcbMtUs3t6b5k2X9AQ97XNkI5XbEEiIG5FypXSicFq64rFUEJi3J6
   +2Jn1VsApnntdDIChA19wnR44bHv95BCg/0wnDmqyBgp84P1FiHMuZGj3
   D+d3+DrXOxmB2EuPqVarU9MpYJfcxhOz/xBoxA3BEZxqrVxD+bhxv+XBo
   /TuLgYKq5+FLyEqqu+s5K1BBUO3gmCMokHDtxR396fzKZEIIGixIvLw7K
   vIwDbOpBfqwweWzBo1s/vbX2JMndqclfo93k6DsFI8h0PB4dpQGzJgTMy
   g==;
IronPort-SDR: 
 jEk0objC/fAoJcVFm3orLBUBTwVDijicQEflo55zHqLbpREcUd4LRSszk/SZpqHSSF0tMxFbWa
 pDA0QY1ULSsStQUp1l6PiuYIobakWZm+YtQEicrSL24gPMKWdwdh+shfOGg6s2NIYW8EZey6tU
 HglRsfEq9U+MYxeTG994nj8Rh8gvl0DbWlFoV/vqT58RTi8l3ZpVQyn6FzLs+i+d93HKLdvLpw
 a4cYH9tNtZDiKNzALYqu/XbpVj2f2j+ihse24RztbUAVSLGZBuAWY1zS4+3EMnYqfkbkwcUoUX
 Rx4=
X-IronPort-AV: E=Sophos;i="5.77,433,1596470400";
   d="scan'208";a="155806605"
Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com)
 ([199.255.45.14])
  by ob1.hgst.iphmx.com with ESMTP; 30 Oct 2020 21:52:39 +0800
IronPort-SDR: 
 UAsaMwx6AbUtBAt/3Z1jQV6oEeEHM9UPP5HYS/RkYImkWz2BUzwc6owAzfOCi2y//ohtqs8lXJ
 dWo6HHAQwBIMuurBJNnA5HcGWSHABmTqoc4FO0u2mncjfXHrzflq37WHIcjF+AmSU+V5U5UFOE
 n4A6Olv7r8jeYDJXu1JlHwgTyWvmSxsflWHfnrV6NtEMqcagPgFBlumh0fjFExa5+u1O+rOceV
 TqcRgsnpur2p49clRKTi0CfBwbxoTZ/XJg+hLwKH+YHHsbXFZsYBa/XvBm2r6ZhECvGvBOTlQt
 XfPsCD6jt9O1d26LTI8kltpw
Received: from uls-op-cesaip01.wdc.com ([10.248.3.36])
  by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 30 Oct 2020 06:38:53 -0700
IronPort-SDR: 
 nqdJHz/p8ORR3/QygHUfcFkmWRDFufpJQ0nAnrg5fuDTArpaQG9dG+uG/4wpgN7h+ccMe9y4ll
 o4+HMbGxHG/n74z1KNn+POiNYYu+dW1IUHZ6l7k9vj61zh/ce8hSJB1HsopBhDLKuW/iUY06oL
 ndQKFyr7mH4AbOo9ekyNwpl7yH154ZLAB479i47+MUH9c0TNH6w2V95cIpacRmNhCiM3yQTPz1
 MD4L001QblKlo+q1u8+UjNuphTXAsYDEKfogBybI8gWjZFpshf8Qj3wqpDesmIfroTaSFJwJ/q
 3UQ=
WDCIronportException: Internal
Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155])
  by uls-op-cesaip01.wdc.com with ESMTP; 30 Oct 2020 06:52:38 -0700
From: Naohiro Aota <naohiro.aota@wdc.com>
To: linux-btrfs@vger.kernel.org, dsterba@suse.com
Cc: hare@suse.com, linux-fsdevel@vger.kernel.org,
        Naohiro Aota <naohiro.aota@wdc.com>
Subject: [PATCH v9 14/41] btrfs: load zone's alloction offset
Date: Fri, 30 Oct 2020 22:51:21 +0900
Message-Id: 
 <1bbbf9d4ade0c5aeeaebd0772c90f360ceafa9b3.1604065695.git.naohiro.aota@wdc.com>
X-Mailer: git-send-email 2.27.0
In-Reply-To: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
References: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

Zoned btrfs must allocate blocks at the zones' write pointer. The device's
write pointer position can be mapped to a logical address within a block
group. This commit adds "alloc_offset" to track the logical address.

This logical address is populated in btrfs_load_block-group_zone_info()
from write pointers of corresponding zones.

For now, zoned btrfs only support the SINGLE profile. Supporting non-SINGLE
profile with zone append writing is not trivial. For example, in the DUP
profile, we send a zone append writing IO to two zones on a device. The
device reply with written LBAs for the IOs. If the offsets of the returned
addresses from the beginning of the zone are different, then it results in
different logical addresses.

We need fine-grained logical to physical mapping to support such separated
physical address issue. Since it should require additional metadata type,
disable non-SINGLE profiles for now.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/block-group.c |  15 ++++
 fs/btrfs/block-group.h |   6 ++
 fs/btrfs/zoned.c       | 153 +++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/zoned.h       |   6 ++
 4 files changed, 180 insertions(+)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index e989c66aa764..920b2708c7f2 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -15,6 +15,7 @@
 #include "delalloc-space.h"
 #include "discard.h"
 #include "raid56.h"
+#include "zoned.h"
 
 /*
  * Return target flags in extended format or 0 if restripe for this chunk_type
@@ -1935,6 +1936,13 @@ static int read_one_block_group(struct btrfs_fs_info *info,
 			goto error;
 	}
 
+	ret = btrfs_load_block_group_zone_info(cache);
+	if (ret) {
+		btrfs_err(info, "failed to load zone info of bg %llu",
+			  cache->start);
+		goto error;
+	}
+
 	/*
 	 * We need to exclude the super stripes now so that the space info has
 	 * super bytes accounted for, otherwise we'll think we have more space
@@ -2161,6 +2169,13 @@ int btrfs_make_block_group(struct btrfs_trans_handle *trans, u64 bytes_used,
 	cache->last_byte_to_unpin = (u64)-1;
 	cache->cached = BTRFS_CACHE_FINISHED;
 	cache->needs_free_space = 1;
+
+	ret = btrfs_load_block_group_zone_info(cache);
+	if (ret) {
+		btrfs_put_block_group(cache);
+		return ret;
+	}
+
 	ret = exclude_super_stripes(cache);
 	if (ret) {
 		/* We may have excluded something, so call this just in case */
diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
index adfd7583a17b..14e3043c9ce7 100644
--- a/fs/btrfs/block-group.h
+++ b/fs/btrfs/block-group.h
@@ -183,6 +183,12 @@ struct btrfs_block_group {
 
 	/* Record locked full stripes for RAID5/6 block group */
 	struct btrfs_full_stripe_locks_tree full_stripe_locks_root;
+
+	/*
+	 * Allocation offset for the block group to implement sequential
+	 * allocation. This is used only with ZONED mode enabled.
+	 */
+	u64 alloc_offset;
 };
 
 static inline u64 btrfs_block_group_end(struct btrfs_block_group *block_group)
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 4411d786597a..0aa821893a51 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -3,14 +3,20 @@
 #include <linux/bitops.h>
 #include <linux/slab.h>
 #include <linux/blkdev.h>
+#include <linux/sched/mm.h>
 #include "ctree.h"
 #include "volumes.h"
 #include "zoned.h"
 #include "rcu-string.h"
 #include "disk-io.h"
+#include "block-group.h"
 
 /* Maximum number of zones to report per blkdev_report_zones() call */
 #define BTRFS_REPORT_NR_ZONES   4096
+/* Invalid allocation pointer value for missing devices */
+#define WP_MISSING_DEV ((u64)-1)
+/* Pseudo write pointer value for conventional zone */
+#define WP_CONVENTIONAL ((u64)-2)
 
 static int copy_zone_info_cb(struct blk_zone *zone, unsigned int idx,
 			     void *data)
@@ -733,3 +739,150 @@ int btrfs_ensure_empty_zones(struct btrfs_device *device, u64 start, u64 size)
 
 	return 0;
 }
+
+int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache)
+{
+	struct btrfs_fs_info *fs_info = cache->fs_info;
+	struct extent_map_tree *em_tree = &fs_info->mapping_tree;
+	struct extent_map *em;
+	struct map_lookup *map;
+	struct btrfs_device *device;
+	u64 logical = cache->start;
+	u64 length = cache->length;
+	u64 physical = 0;
+	int ret;
+	int i;
+	unsigned int nofs_flag;
+	u64 *alloc_offsets = NULL;
+	u32 num_sequential = 0, num_conventional = 0;
+
+	if (!btrfs_is_zoned(fs_info))
+		return 0;
+
+	/* Sanity check */
+	if (!IS_ALIGNED(length, fs_info->zone_size)) {
+		btrfs_err(fs_info, "unaligned block group at %llu + %llu",
+			  logical, length);
+		return -EIO;
+	}
+
+	/* Get the chunk mapping */
+	read_lock(&em_tree->lock);
+	em = lookup_extent_mapping(em_tree, logical, length);
+	read_unlock(&em_tree->lock);
+
+	if (!em)
+		return -EINVAL;
+
+	map = em->map_lookup;
+
+	/*
+	 * Get the zone type: if the group is mapped to a non-sequential zone,
+	 * there is no need for the allocation offset (fit allocation is OK).
+	 */
+	alloc_offsets = kcalloc(map->num_stripes, sizeof(*alloc_offsets),
+				GFP_NOFS);
+	if (!alloc_offsets) {
+		free_extent_map(em);
+		return -ENOMEM;
+	}
+
+	for (i = 0; i < map->num_stripes; i++) {
+		bool is_sequential;
+		struct blk_zone zone;
+
+		device = map->stripes[i].dev;
+		physical = map->stripes[i].physical;
+
+		if (device->bdev == NULL) {
+			alloc_offsets[i] = WP_MISSING_DEV;
+			continue;
+		}
+
+		is_sequential = btrfs_dev_is_sequential(device, physical);
+		if (is_sequential)
+			num_sequential++;
+		else
+			num_conventional++;
+
+		if (!is_sequential) {
+			alloc_offsets[i] = WP_CONVENTIONAL;
+			continue;
+		}
+
+		/*
+		 * This zone will be used for allocation, so mark this
+		 * zone non-empty.
+		 */
+		btrfs_dev_clear_zone_empty(device, physical);
+
+		/*
+		 * The group is mapped to a sequential zone. Get the zone write
+		 * pointer to determine the allocation offset within the zone.
+		 */
+		WARN_ON(!IS_ALIGNED(physical, fs_info->zone_size));
+		nofs_flag = memalloc_nofs_save();
+		ret = btrfs_get_dev_zone(device, physical, &zone);
+		memalloc_nofs_restore(nofs_flag);
+		if (ret == -EIO || ret == -EOPNOTSUPP) {
+			ret = 0;
+			alloc_offsets[i] = WP_MISSING_DEV;
+			continue;
+		} else if (ret) {
+			goto out;
+		}
+
+		switch (zone.cond) {
+		case BLK_ZONE_COND_OFFLINE:
+		case BLK_ZONE_COND_READONLY:
+			btrfs_err(fs_info, "Offline/readonly zone %llu",
+				  physical >> device->zone_info->zone_size_shift);
+			alloc_offsets[i] = WP_MISSING_DEV;
+			break;
+		case BLK_ZONE_COND_EMPTY:
+			alloc_offsets[i] = 0;
+			break;
+		case BLK_ZONE_COND_FULL:
+			alloc_offsets[i] = fs_info->zone_size;
+			break;
+		default:
+			/* Partially used zone */
+			alloc_offsets[i] =
+				((zone.wp - zone.start) << SECTOR_SHIFT);
+			break;
+		}
+	}
+
+	if (num_conventional > 0) {
+		/*
+		 * Since conventional zones does not have write pointer, we
+		 * cannot determine alloc_offset from the pointer
+		 */
+		ret = -EINVAL;
+		goto out;
+	}
+
+	switch (map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
+	case 0: /* single */
+		cache->alloc_offset = alloc_offsets[0];
+		break;
+	case BTRFS_BLOCK_GROUP_DUP:
+	case BTRFS_BLOCK_GROUP_RAID1:
+	case BTRFS_BLOCK_GROUP_RAID0:
+	case BTRFS_BLOCK_GROUP_RAID10:
+	case BTRFS_BLOCK_GROUP_RAID5:
+	case BTRFS_BLOCK_GROUP_RAID6:
+		/* non-SINGLE profiles are not supported yet */
+	default:
+		btrfs_err(fs_info, "Unsupported profile on ZONED %s",
+			  btrfs_bg_type_to_raid_name(map->type));
+		ret = -EINVAL;
+		goto out;
+	}
+
+out:
+	kfree(alloc_offsets);
+	free_extent_map(em);
+
+	return ret;
+}
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index 24dd0c9561f9..90ed43a25595 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -39,6 +39,7 @@ u64 btrfs_find_allocatable_zones(struct btrfs_device *device, u64 hole_start,
 int btrfs_reset_device_zone(struct btrfs_device *device, u64 physical,
 			    u64 length, u64 *bytes);
 int btrfs_ensure_empty_zones(struct btrfs_device *device, u64 start, u64 size);
+int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache);
 #else /* CONFIG_BLK_DEV_ZONED */
 static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 				     struct blk_zone *zone)
@@ -99,6 +100,11 @@ static inline int btrfs_ensure_empty_zones(struct btrfs_device *device,
 {
 	return 0;
 }
+static inline int btrfs_load_block_group_zone_info(
+	struct btrfs_block_group *cache)
+{
+	return 0;
+}
 #endif
 
 static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)

From patchwork Fri Oct 30 13:51:22 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Naohiro Aota <naohiro.aota@wdc.com>
X-Patchwork-Id: 11869727
Return-Path: <SRS0=iOSM=EF=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 88B4561C
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:54:11 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 60F7F2083B
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:54:11 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=fail reason="signature verification failed" (2048-bit key)
 header.d=wdc.com header.i=@wdc.com header.b="pvxRN/wB"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1727014AbgJ3NyK (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Fri, 30 Oct 2020 09:54:10 -0400
Received: from esa3.hgst.iphmx.com ([216.71.153.141]:21997 "EHLO
        esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726765AbgJ3Nwq (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Fri, 30 Oct 2020 09:52:46 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple;
  d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com;
  t=1604065966; x=1635601966;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=0nnONIf1rKLzuRt0TujUb7eMvjS+U1x8FR+Ol0oIp2k=;
  b=pvxRN/wBB1d0UatsC7ukEwBnz8KB6uto8y7Q9shDQUNaPOK1/pUSXJ3l
   kc1EOXv2BvpoR6k9sHLO+0twzKOx9mbjzBCq7/MZ+ZdV1GaTKFuh8mbzi
   HVmPa2j2ExGDwGJ/GMLcAbulibWIqiuO9BhXF2lkGLa8Q/LU6UjS9Ynzd
   J2Kt3cM39lQeBsim6yLWRS/MX2RexDZpxkKa3zuQuhy4ORsYhTHB3tgnE
   9JmEdOkidLzTE/jVCrPSgu1BTJ7mzI7j4pTo2vVIDd/oBpwLxDg51Sljh
   ljLjrQ9lZ+CwkuHsx8E3PdwBy3G5t/GvP0Ml8VogMMrRDRpM0sJgBsHH0
   Q==;
IronPort-SDR: 
 CYc6zV3XDDRNJduiH051jthDlaXRT8jcZuv2mv6GiJ8D6Rh5A2A30kq5QIMCTn64fKUxwIKBp7
 HRso2MQtHzGGVYOSJE7zOoj2wjlsrFOla73rZ/vbirExqG7Iaonaq1kjHso/jybZ5kO8b+6HoX
 JIYKsRwKcXsdgaY41+WrFX5KACAjA8pM/wzr4GgUzKDdhEC/4vEbryIivGznQvL3OXGDSECaJf
 CnbReBAZ9L25qabk4oKmsLlXh4LM+Iu164eIe2ZtvW1cmhFbFrLOD7Gd0cjJAtxoztAHoJx5fr
 3yI=
X-IronPort-AV: E=Sophos;i="5.77,433,1596470400";
   d="scan'208";a="155806607"
Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com)
 ([199.255.45.14])
  by ob1.hgst.iphmx.com with ESMTP; 30 Oct 2020 21:52:40 +0800
IronPort-SDR: 
 GnPLVCzvL/pXu6qKPEKWggiRSUbz6nd58sfLSx/9zKuX3VD3Fi7jzRM40V8QDgF+df7QUTkAFD
 R1YB3yu9EgJOEEH6J+fKKEswZVAMkFq3yGoEvnTAhnhBlNog4v7gK4AQ0MRveE9PItW2G3tewh
 670rc9KXea+C4nak9mHEZerJSJwGGYlvu9djwASp10TS5qlNpNaQJ3HL25yI0RVCl4F/wX+YHo
 tu6iD3QGx0L8oZJ9XpgWJfYj7jHperOeWaxPkQcBHuv6Vacx3rRCOzSRaT7lfoqQteWOvySfsw
 CYuS9ZGjryNTin0AdnSYAX7q
Received: from uls-op-cesaip01.wdc.com ([10.248.3.36])
  by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 30 Oct 2020 06:38:54 -0700
IronPort-SDR: 
 YqdWAIGHm/UYTJcK9H5eIeTI5wxVmLQTi0825SHSOlEXBgZ+p9J0ommRTt3DbL4hJUCQx57zgF
 +WIargxl69G2XfqWUvP4+Zd4695gH+5AG5gFAtzKpgqq6zYBlBPh2LmmymX52lOFF1adutLn/8
 z2mp6+U4oxwfo8qaXijgbFZ6/VqfdEfrIG6oCazJsG1QWW6ZPPIhnq+f4QUvZ1tqvVwu0ks1Oa
 L1xLOrw4RVRiBevOgjiU5kT4ghh19X/YwvCG83HXnZ2uZ0AkdWZLhxr2Pv8MP+Eb1VgOuXkLbN
 cI4=
WDCIronportException: Internal
Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155])
  by uls-op-cesaip01.wdc.com with ESMTP; 30 Oct 2020 06:52:39 -0700
From: Naohiro Aota <naohiro.aota@wdc.com>
To: linux-btrfs@vger.kernel.org, dsterba@suse.com
Cc: hare@suse.com, linux-fsdevel@vger.kernel.org,
        Naohiro Aota <naohiro.aota@wdc.com>
Subject: [PATCH v9 15/41] btrfs: emulate write pointer for conventional zones
Date: Fri, 30 Oct 2020 22:51:22 +0900
Message-Id: 
 <af1830174f9dd9e2651dab213c0b984d90811138.1604065695.git.naohiro.aota@wdc.com>
X-Mailer: git-send-email 2.27.0
In-Reply-To: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
References: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

Conventional zones do not have a write pointer, so we cannot use it to
determine the allocation offset if a block group contains a conventional
zone.

But instead, we can consider the end of the last allocated extent in the
block group as an allocation offset.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/zoned.c | 119 ++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 113 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 0aa821893a51..8f58d0853cc3 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -740,6 +740,104 @@ int btrfs_ensure_empty_zones(struct btrfs_device *device, u64 start, u64 size)
 	return 0;
 }
 
+static int emulate_write_pointer(struct btrfs_block_group *cache,
+				 u64 *offset_ret)
+{
+	struct btrfs_fs_info *fs_info = cache->fs_info;
+	struct btrfs_root *root = fs_info->extent_root;
+	struct btrfs_path *path;
+	struct extent_buffer *leaf;
+	struct btrfs_key search_key;
+	struct btrfs_key found_key;
+	int slot;
+	int ret;
+	u64 length;
+
+	path = btrfs_alloc_path();
+	if (!path)
+		return -ENOMEM;
+
+	search_key.objectid = cache->start + cache->length;
+	search_key.type = 0;
+	search_key.offset = 0;
+
+	ret = btrfs_search_slot(NULL, root, &search_key, path, 0, 0);
+	if (ret < 0)
+		goto out;
+	ASSERT(ret != 0);
+	slot = path->slots[0];
+	leaf = path->nodes[0];
+	ASSERT(slot != 0);
+	slot--;
+	btrfs_item_key_to_cpu(leaf, &found_key, slot);
+
+	if (found_key.objectid < cache->start) {
+		*offset_ret = 0;
+	} else if (found_key.type == BTRFS_BLOCK_GROUP_ITEM_KEY) {
+		struct btrfs_key extent_item_key;
+
+		if (found_key.objectid != cache->start) {
+			ret = -EUCLEAN;
+			goto out;
+		}
+
+		length = 0;
+
+		/* metadata may have METADATA_ITEM_KEY */
+		if (slot == 0) {
+			btrfs_set_path_blocking(path);
+			ret = btrfs_prev_leaf(root, path);
+			if (ret < 0)
+				goto out;
+			if (ret == 0) {
+				slot = btrfs_header_nritems(leaf) - 1;
+				btrfs_item_key_to_cpu(leaf, &extent_item_key,
+						      slot);
+			}
+		} else {
+			btrfs_item_key_to_cpu(leaf, &extent_item_key, slot - 1);
+			ret = 0;
+		}
+
+		if (ret == 0 &&
+		    extent_item_key.objectid == cache->start) {
+			if (extent_item_key.type == BTRFS_METADATA_ITEM_KEY)
+				length = fs_info->nodesize;
+			else if (extent_item_key.type == BTRFS_EXTENT_ITEM_KEY)
+				length = extent_item_key.offset;
+			else {
+				ret = -EUCLEAN;
+				goto out;
+			}
+		}
+
+		*offset_ret = length;
+	} else if (found_key.type == BTRFS_EXTENT_ITEM_KEY ||
+		   found_key.type == BTRFS_METADATA_ITEM_KEY) {
+
+		if (found_key.type == BTRFS_EXTENT_ITEM_KEY)
+			length = found_key.offset;
+		else
+			length = fs_info->nodesize;
+
+		if (!(found_key.objectid >= cache->start &&
+		       found_key.objectid + length <=
+		       cache->start + cache->length)) {
+			ret = -EUCLEAN;
+			goto out;
+		}
+		*offset_ret = found_key.objectid + length - cache->start;
+	} else {
+		ret = -EUCLEAN;
+		goto out;
+	}
+	ret = 0;
+
+out:
+	btrfs_free_path(path);
+	return ret;
+}
+
 int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache)
 {
 	struct btrfs_fs_info *fs_info = cache->fs_info;
@@ -754,6 +852,7 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache)
 	int i;
 	unsigned int nofs_flag;
 	u64 *alloc_offsets = NULL;
+	u64 emulated_offset = 0;
 	u32 num_sequential = 0, num_conventional = 0;
 
 	if (!btrfs_is_zoned(fs_info))
@@ -854,12 +953,12 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache)
 	}
 
 	if (num_conventional > 0) {
-		/*
-		 * Since conventional zones does not have write pointer, we
-		 * cannot determine alloc_offset from the pointer
-		 */
-		ret = -EINVAL;
-		goto out;
+		ret = emulate_write_pointer(cache, &emulated_offset);
+		if (ret || map->num_stripes == num_conventional) {
+			if (!ret)
+				cache->alloc_offset = emulated_offset;
+			goto out;
+		}
 	}
 
 	switch (map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK) {
@@ -881,6 +980,14 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache)
 	}
 
 out:
+	/* an extent is allocated after the write pointer */
+	if (num_conventional && emulated_offset > cache->alloc_offset) {
+		btrfs_err(fs_info,
+			  "got wrong write pointer in BG %llu: %llu > %llu",
+			  logical, emulated_offset, cache->alloc_offset);
+		ret = -EIO;
+	}
+
 	kfree(alloc_offsets);
 	free_extent_map(em);
 

From patchwork Fri Oct 30 13:51:23 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Naohiro Aota <naohiro.aota@wdc.com>
X-Patchwork-Id: 11869729
Return-Path: <SRS0=iOSM=EF=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id E675E92C
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:54:13 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id AA9932076E
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:54:13 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=fail reason="signature verification failed" (2048-bit key)
 header.d=wdc.com header.i=@wdc.com header.b="TK60FSUV"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1727011AbgJ3NyJ (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Fri, 30 Oct 2020 09:54:09 -0400
Received: from esa3.hgst.iphmx.com ([216.71.153.141]:22001 "EHLO
        esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726798AbgJ3Nwq (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Fri, 30 Oct 2020 09:52:46 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple;
  d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com;
  t=1604065966; x=1635601966;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=DzBKNEsauYXEdqEIezKt2XTB8FqnybLAPXRn8OQuIrQ=;
  b=TK60FSUVblTHxV6avf5pHFkR/O5X/ErPBIkpO2UcNUSpRuIVVsaiX+EF
   imFBh/Z9VI1sHzcZSa2iXa8SjMK33mKJHgKI3xYOTAkpuKuuIADWeCz9d
   D7tUgORQye030BsLRYmuHieQLJb9XWB6z2wcW9/5VotI35FSXVkfUtLvv
   DjqcX6N6kkzTbQEWPs7aIjSF6QGnyVzb5rN72d1lVjGfm/WRAeJj4Nl2T
   nkRabClclIfoua0aYG7AoLassvY6SZjWFUmOIJiPI1tkjU6eLG/BZu0VF
   dldYwFM1rV3gFTr8IFXK5c6tWczm0llC0AP9tWiF+Win1SdwI/xzQIWUE
   w==;
IronPort-SDR: 
 FJbOvBQbCIwz5W3/dZaoMH9Nk+piGx4r6YRaxFYU3d9p57DqZl/STWKtHh4zltwxDTL9Ve+a7X
 N7mH/1oZy0TzrnysbXTCYfG98N97U6G/fGei4f8JEIdy7nYXu+qEH+5ji9NG2mm2Ymh4FvN1aC
 CcAGUdZc8Xb/204pWP98MyjsGYZc9v1PyZASUJ4QfAb62yPBrSqyAFa6FJBL6tLl2fRPYY6Qs3
 eEJClcbxMkZsAELxsw9R1mPkvY+W4CYfCQjDg56YLB/da6c4GZ2CuWSvaXk7bDrqP9BPmafeH1
 Mic=
X-IronPort-AV: E=Sophos;i="5.77,433,1596470400";
   d="scan'208";a="155806608"
Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com)
 ([199.255.45.14])
  by ob1.hgst.iphmx.com with ESMTP; 30 Oct 2020 21:52:41 +0800
IronPort-SDR: 
 a1MrFsTcfV7IPFdAl9COxjkY1uo9G7SmGC8fB1Tna1Qdnxh/2+v4jPCQ4ldLOS5c3EY8uuAMuh
 H95AeF3snXK1fH5KObqkD+/svOEwd+N4vs+2swn0zBph0GbzwHjWO+R+Tl2OiwY3SoZUf4/TTm
 34YfLXs7NG9UHYJG9B9ajRMdIAKgRhSl99C36O7sb3NzUw+m4G/H4To8so9wS41JXlz2/7+qkH
 Gm2QyiU1h4w2R2YxBs0aqg2uh5fWsn0wwwL8zwZjEtb6imG/CV6ndVdCR3K74BMyDmUVaTiv26
 lohAfr1h3zACTKgVCRUSzaGd
Received: from uls-op-cesaip01.wdc.com ([10.248.3.36])
  by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 30 Oct 2020 06:38:55 -0700
IronPort-SDR: 
 7lyUa16EemtYhbbTT9O9mK3yjnMffWOjx60RLxqc257hAcwRfT4ORvql5mih1V/szuylKLeOnK
 nDkJIpibS2oETH8CCIRgafG9okJCW63pSBAAxSa3vnBJdCpJ1BBY3PiWevgCG+p4Q3bmvTLQ6P
 YT1Bc/nsuQluCL6RgN+Uapr9T4cMv1YPX1aubpGJZiw6OEs9dt/8B8lmBw+rePQO6mE5ST1R6w
 V/eLmKS3KZWn6XO2Ipf0lFvjmvNB38pqknrI53yfS9jF+IMDJtOAZMP8RcQYe7C4JV+roD+kpk
 vN4=
WDCIronportException: Internal
Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155])
  by uls-op-cesaip01.wdc.com with ESMTP; 30 Oct 2020 06:52:40 -0700
From: Naohiro Aota <naohiro.aota@wdc.com>
To: linux-btrfs@vger.kernel.org, dsterba@suse.com
Cc: hare@suse.com, linux-fsdevel@vger.kernel.org,
        Naohiro Aota <naohiro.aota@wdc.com>
Subject: [PATCH v9 16/41] btrfs: track unusable bytes for zones
Date: Fri, 30 Oct 2020 22:51:23 +0900
Message-Id: 
 <cd87f173cb6f2a4b17ed2a849a4a73dbd856df31.1604065695.git.naohiro.aota@wdc.com>
X-Mailer: git-send-email 2.27.0
In-Reply-To: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
References: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

In zoned btrfs a region that was once written then freed is not usable
until we reset the underlying zone. So we need to distinguish such
unusable space from usable free space.

Therefore we need to introduce the "zone_unusable" field  to the block
group structure, and "bytes_zone_unusable" to the space_info structure to
track the unusable space.

Pinned bytes are always reclaimed to the unusable space. But, when an
allocated region is returned before using e.g., the block group becomes
read-only between allocation time and reservation time, we can safely
return the region to the block group. For the situation, this commit
introduces "btrfs_add_free_space_unused". This behaves the same as
btrfs_add_free_space() on regular btrfs. On zoned btrfs, it rewinds the
allocation offset.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/block-group.c      | 19 +++++++++-----
 fs/btrfs/block-group.h      |  1 +
 fs/btrfs/extent-tree.c      | 15 ++++++++---
 fs/btrfs/free-space-cache.c | 52 +++++++++++++++++++++++++++++++++++++
 fs/btrfs/free-space-cache.h |  4 +++
 fs/btrfs/space-info.c       | 13 ++++++----
 fs/btrfs/space-info.h       |  4 ++-
 fs/btrfs/sysfs.c            |  2 ++
 fs/btrfs/zoned.c            | 22 ++++++++++++++++
 fs/btrfs/zoned.h            |  2 ++
 10 files changed, 118 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 920b2708c7f2..c34bd2dbdf82 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1080,12 +1080,15 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 		WARN_ON(block_group->space_info->total_bytes
 			< block_group->length);
 		WARN_ON(block_group->space_info->bytes_readonly
-			< block_group->length);
+			< block_group->length - block_group->zone_unusable);
+		WARN_ON(block_group->space_info->bytes_zone_unusable
+			< block_group->zone_unusable);
 		WARN_ON(block_group->space_info->disk_total
 			< block_group->length * factor);
 	}
 	block_group->space_info->total_bytes -= block_group->length;
-	block_group->space_info->bytes_readonly -= block_group->length;
+	block_group->space_info->bytes_readonly -=
+		(block_group->length - block_group->zone_unusable);
 	block_group->space_info->disk_total -= block_group->length * factor;
 
 	spin_unlock(&block_group->space_info->lock);
@@ -1229,7 +1232,7 @@ static int inc_block_group_ro(struct btrfs_block_group *cache, int force)
 	}
 
 	num_bytes = cache->length - cache->reserved - cache->pinned -
-		    cache->bytes_super - cache->used;
+		    cache->bytes_super - cache->zone_unusable - cache->used;
 
 	/*
 	 * Data never overcommits, even in mixed mode, so do just the straight
@@ -1973,6 +1976,8 @@ static int read_one_block_group(struct btrfs_fs_info *info,
 		btrfs_free_excluded_extents(cache);
 	}
 
+	btrfs_calc_zone_unusable(cache);
+
 	ret = btrfs_add_block_group_cache(info, cache);
 	if (ret) {
 		btrfs_remove_free_space_cache(cache);
@@ -1980,7 +1985,8 @@ static int read_one_block_group(struct btrfs_fs_info *info,
 	}
 	trace_btrfs_add_block_group(info, cache, 0);
 	btrfs_update_space_info(info, cache->flags, cache->length,
-				cache->used, cache->bytes_super, &space_info);
+				cache->used, cache->bytes_super,
+				cache->zone_unusable, &space_info);
 
 	cache->space_info = space_info;
 
@@ -2217,7 +2223,7 @@ int btrfs_make_block_group(struct btrfs_trans_handle *trans, u64 bytes_used,
 	 */
 	trace_btrfs_add_block_group(fs_info, cache, 1);
 	btrfs_update_space_info(fs_info, cache->flags, size, bytes_used,
-				cache->bytes_super, &cache->space_info);
+				cache->bytes_super, 0, &cache->space_info);
 	btrfs_update_global_block_rsv(fs_info);
 
 	link_block_group(cache);
@@ -2325,7 +2331,8 @@ void btrfs_dec_block_group_ro(struct btrfs_block_group *cache)
 	spin_lock(&cache->lock);
 	if (!--cache->ro) {
 		num_bytes = cache->length - cache->reserved -
-			    cache->pinned - cache->bytes_super - cache->used;
+			    cache->pinned - cache->bytes_super -
+			    cache->zone_unusable - cache->used;
 		sinfo->bytes_readonly -= num_bytes;
 		list_del_init(&cache->ro_list);
 	}
diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
index 14e3043c9ce7..5be47f4bfea7 100644
--- a/fs/btrfs/block-group.h
+++ b/fs/btrfs/block-group.h
@@ -189,6 +189,7 @@ struct btrfs_block_group {
 	 * allocation. This is used only with ZONED mode enabled.
 	 */
 	u64 alloc_offset;
+	u64 zone_unusable;
 };
 
 static inline u64 btrfs_block_group_end(struct btrfs_block_group *block_group)
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 3b21fee13e77..fad53c702d8a 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -34,6 +34,7 @@
 #include "block-group.h"
 #include "discard.h"
 #include "rcu-string.h"
+#include "zoned.h"
 
 #undef SCRAMBLE_DELAYED_REFS
 
@@ -2807,9 +2808,11 @@ static int unpin_extent_range(struct btrfs_fs_info *fs_info,
 			cache = btrfs_lookup_block_group(fs_info, start);
 			BUG_ON(!cache); /* Logic error */
 
-			cluster = fetch_cluster_info(fs_info,
-						     cache->space_info,
-						     &empty_cluster);
+			if (!btrfs_is_zoned(fs_info))
+				cluster = fetch_cluster_info(fs_info,
+							     cache->space_info,
+							     &empty_cluster);
+
 			empty_cluster <<= 1;
 		}
 
@@ -2846,7 +2849,11 @@ static int unpin_extent_range(struct btrfs_fs_info *fs_info,
 		space_info->max_extent_size = 0;
 		percpu_counter_add_batch(&space_info->total_bytes_pinned,
 			    -len, BTRFS_TOTAL_BYTES_PINNED_BATCH);
-		if (cache->ro) {
+		if (btrfs_is_zoned(fs_info)) {
+			/* need reset before reusing in zoned Block Group */
+			space_info->bytes_zone_unusable += len;
+			readonly = true;
+		} else if (cache->ro) {
 			space_info->bytes_readonly += len;
 			readonly = true;
 		}
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index af0013d3df63..cfa466319166 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -2467,6 +2467,8 @@ int __btrfs_add_free_space(struct btrfs_fs_info *fs_info,
 	int ret = 0;
 	u64 filter_bytes = bytes;
 
+	ASSERT(!btrfs_is_zoned(fs_info));
+
 	info = kmem_cache_zalloc(btrfs_free_space_cachep, GFP_NOFS);
 	if (!info)
 		return -ENOMEM;
@@ -2524,11 +2526,44 @@ int __btrfs_add_free_space(struct btrfs_fs_info *fs_info,
 	return ret;
 }
 
+int __btrfs_add_free_space_zoned(struct btrfs_block_group *block_group,
+				 u64 bytenr, u64 size, bool used)
+{
+	struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
+	u64 offset = bytenr - block_group->start;
+	u64 to_free, to_unusable;
+
+	spin_lock(&ctl->tree_lock);
+	if (!used)
+		to_free = size;
+	else if (offset >= block_group->alloc_offset)
+		to_free = size;
+	else if (offset + size <= block_group->alloc_offset)
+		to_free = 0;
+	else
+		to_free = offset + size - block_group->alloc_offset;
+	to_unusable = size - to_free;
+
+	ctl->free_space += to_free;
+	block_group->zone_unusable += to_unusable;
+	spin_unlock(&ctl->tree_lock);
+	if (!used) {
+		spin_lock(&block_group->lock);
+		block_group->alloc_offset -= size;
+		spin_unlock(&block_group->lock);
+	}
+	return 0;
+}
+
 int btrfs_add_free_space(struct btrfs_block_group *block_group,
 			 u64 bytenr, u64 size)
 {
 	enum btrfs_trim_state trim_state = BTRFS_TRIM_STATE_UNTRIMMED;
 
+	if (btrfs_is_zoned(block_group->fs_info))
+		return __btrfs_add_free_space_zoned(block_group, bytenr, size,
+						    true);
+
 	if (btrfs_test_opt(block_group->fs_info, DISCARD_SYNC))
 		trim_state = BTRFS_TRIM_STATE_TRIMMED;
 
@@ -2537,6 +2572,16 @@ int btrfs_add_free_space(struct btrfs_block_group *block_group,
 				      bytenr, size, trim_state);
 }
 
+int btrfs_add_free_space_unused(struct btrfs_block_group *block_group,
+				u64 bytenr, u64 size)
+{
+	if (btrfs_is_zoned(block_group->fs_info))
+		return __btrfs_add_free_space_zoned(block_group, bytenr, size,
+						    false);
+
+	return btrfs_add_free_space(block_group, bytenr, size);
+}
+
 /*
  * This is a subtle distinction because when adding free space back in general,
  * we want it to be added as untrimmed for async. But in the case where we add
@@ -2547,6 +2592,10 @@ int btrfs_add_free_space_async_trimmed(struct btrfs_block_group *block_group,
 {
 	enum btrfs_trim_state trim_state = BTRFS_TRIM_STATE_UNTRIMMED;
 
+	if (btrfs_is_zoned(block_group->fs_info))
+		return __btrfs_add_free_space_zoned(block_group, bytenr, size,
+						    true);
+
 	if (btrfs_test_opt(block_group->fs_info, DISCARD_SYNC) ||
 	    btrfs_test_opt(block_group->fs_info, DISCARD_ASYNC))
 		trim_state = BTRFS_TRIM_STATE_TRIMMED;
@@ -2564,6 +2613,9 @@ int btrfs_remove_free_space(struct btrfs_block_group *block_group,
 	int ret;
 	bool re_search = false;
 
+	if (btrfs_is_zoned(block_group->fs_info))
+		return 0;
+
 	spin_lock(&ctl->tree_lock);
 
 again:
diff --git a/fs/btrfs/free-space-cache.h b/fs/btrfs/free-space-cache.h
index e3d5e0ad8f8e..7081216257a8 100644
--- a/fs/btrfs/free-space-cache.h
+++ b/fs/btrfs/free-space-cache.h
@@ -114,8 +114,12 @@ int __btrfs_add_free_space(struct btrfs_fs_info *fs_info,
 			   struct btrfs_free_space_ctl *ctl,
 			   u64 bytenr, u64 size,
 			   enum btrfs_trim_state trim_state);
+int __btrfs_add_free_space_zoned(struct btrfs_block_group *block_group,
+				 u64 bytenr, u64 size, bool used);
 int btrfs_add_free_space(struct btrfs_block_group *block_group,
 			 u64 bytenr, u64 size);
+int btrfs_add_free_space_unused(struct btrfs_block_group *block_group,
+				u64 bytenr, u64 size);
 int btrfs_add_free_space_async_trimmed(struct btrfs_block_group *block_group,
 				       u64 bytenr, u64 size);
 int btrfs_remove_free_space(struct btrfs_block_group *block_group,
diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 64099565ab8f..bbbf3c1412a4 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -163,6 +163,7 @@ u64 __pure btrfs_space_info_used(struct btrfs_space_info *s_info,
 	ASSERT(s_info);
 	return s_info->bytes_used + s_info->bytes_reserved +
 		s_info->bytes_pinned + s_info->bytes_readonly +
+		s_info->bytes_zone_unusable +
 		(may_use_included ? s_info->bytes_may_use : 0);
 }
 
@@ -257,7 +258,7 @@ int btrfs_init_space_info(struct btrfs_fs_info *fs_info)
 
 void btrfs_update_space_info(struct btrfs_fs_info *info, u64 flags,
 			     u64 total_bytes, u64 bytes_used,
-			     u64 bytes_readonly,
+			     u64 bytes_readonly, u64 bytes_zone_unusable,
 			     struct btrfs_space_info **space_info)
 {
 	struct btrfs_space_info *found;
@@ -273,6 +274,7 @@ void btrfs_update_space_info(struct btrfs_fs_info *info, u64 flags,
 	found->bytes_used += bytes_used;
 	found->disk_used += bytes_used * factor;
 	found->bytes_readonly += bytes_readonly;
+	found->bytes_zone_unusable += bytes_zone_unusable;
 	if (total_bytes > 0)
 		found->full = 0;
 	btrfs_try_granting_tickets(info, found);
@@ -422,10 +424,10 @@ static void __btrfs_dump_space_info(struct btrfs_fs_info *fs_info,
 		   info->total_bytes - btrfs_space_info_used(info, true),
 		   info->full ? "" : "not ");
 	btrfs_info(fs_info,
-		"space_info total=%llu, used=%llu, pinned=%llu, reserved=%llu, may_use=%llu, readonly=%llu",
+		"space_info total=%llu, used=%llu, pinned=%llu, reserved=%llu, may_use=%llu, readonly=%llu zone_unusable=%llu",
 		info->total_bytes, info->bytes_used, info->bytes_pinned,
 		info->bytes_reserved, info->bytes_may_use,
-		info->bytes_readonly);
+		info->bytes_readonly, info->bytes_zone_unusable);
 
 	DUMP_BLOCK_RSV(fs_info, global_block_rsv);
 	DUMP_BLOCK_RSV(fs_info, trans_block_rsv);
@@ -454,9 +456,10 @@ void btrfs_dump_space_info(struct btrfs_fs_info *fs_info,
 	list_for_each_entry(cache, &info->block_groups[index], list) {
 		spin_lock(&cache->lock);
 		btrfs_info(fs_info,
-			"block group %llu has %llu bytes, %llu used %llu pinned %llu reserved %s",
+			"block group %llu has %llu bytes, %llu used %llu pinned %llu reserved %llu zone_unusable %s",
 			cache->start, cache->length, cache->used, cache->pinned,
-			cache->reserved, cache->ro ? "[readonly]" : "");
+			cache->reserved, cache->zone_unusable,
+			cache->ro ? "[readonly]" : "");
 		spin_unlock(&cache->lock);
 		btrfs_dump_free_space(cache, bytes);
 	}
diff --git a/fs/btrfs/space-info.h b/fs/btrfs/space-info.h
index 5646393b928c..ee003ffba956 100644
--- a/fs/btrfs/space-info.h
+++ b/fs/btrfs/space-info.h
@@ -17,6 +17,8 @@ struct btrfs_space_info {
 	u64 bytes_may_use;	/* number of bytes that may be used for
 				   delalloc/allocations */
 	u64 bytes_readonly;	/* total bytes that are read only */
+	u64 bytes_zone_unusable;	/* total bytes that are unusable until
+					   resetting the device zone */
 
 	u64 max_extent_size;	/* This will hold the maximum extent size of
 				   the space info if we had an ENOSPC in the
@@ -119,7 +121,7 @@ DECLARE_SPACE_INFO_UPDATE(bytes_pinned, "pinned");
 int btrfs_init_space_info(struct btrfs_fs_info *fs_info);
 void btrfs_update_space_info(struct btrfs_fs_info *info, u64 flags,
 			     u64 total_bytes, u64 bytes_used,
-			     u64 bytes_readonly,
+			     u64 bytes_readonly, u64 bytes_zone_unusable,
 			     struct btrfs_space_info **space_info);
 struct btrfs_space_info *btrfs_find_space_info(struct btrfs_fs_info *info,
 					       u64 flags);
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 828006020bbd..ea679803da9b 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -635,6 +635,7 @@ SPACE_INFO_ATTR(bytes_pinned);
 SPACE_INFO_ATTR(bytes_reserved);
 SPACE_INFO_ATTR(bytes_may_use);
 SPACE_INFO_ATTR(bytes_readonly);
+SPACE_INFO_ATTR(bytes_zone_unusable);
 SPACE_INFO_ATTR(disk_used);
 SPACE_INFO_ATTR(disk_total);
 BTRFS_ATTR(space_info, total_bytes_pinned,
@@ -648,6 +649,7 @@ static struct attribute *space_info_attrs[] = {
 	BTRFS_ATTR_PTR(space_info, bytes_reserved),
 	BTRFS_ATTR_PTR(space_info, bytes_may_use),
 	BTRFS_ATTR_PTR(space_info, bytes_readonly),
+	BTRFS_ATTR_PTR(space_info, bytes_zone_unusable),
 	BTRFS_ATTR_PTR(space_info, disk_used),
 	BTRFS_ATTR_PTR(space_info, disk_total),
 	BTRFS_ATTR_PTR(space_info, total_bytes_pinned),
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 8f58d0853cc3..d94a2c363a47 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -993,3 +993,25 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache)
 
 	return ret;
 }
+
+void btrfs_calc_zone_unusable(struct btrfs_block_group *cache)
+{
+	u64 unusable, free;
+
+	if (!btrfs_is_zoned(cache->fs_info))
+		return;
+
+	WARN_ON(cache->bytes_super != 0);
+	unusable = cache->alloc_offset - cache->used;
+	free = cache->length - cache->alloc_offset;
+	/* we only need ->free_space in ALLOC_SEQ BGs */
+	cache->last_byte_to_unpin = (u64)-1;
+	cache->cached = BTRFS_CACHE_FINISHED;
+	cache->free_space_ctl->free_space = free;
+	cache->zone_unusable = unusable;
+	/*
+	 * Should not have any excluded extents. Just
+	 * in case, though.
+	 */
+	btrfs_free_excluded_extents(cache);
+}
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index 90ed43a25595..42b1a4217c6b 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -40,6 +40,7 @@ int btrfs_reset_device_zone(struct btrfs_device *device, u64 physical,
 			    u64 length, u64 *bytes);
 int btrfs_ensure_empty_zones(struct btrfs_device *device, u64 start, u64 size);
 int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache);
+void btrfs_calc_zone_unusable(struct btrfs_block_group *cache);
 #else /* CONFIG_BLK_DEV_ZONED */
 static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 				     struct blk_zone *zone)
@@ -105,6 +106,7 @@ static inline int btrfs_load_block_group_zone_info(
 {
 	return 0;
 }
+static inline void btrfs_calc_zone_unusable(struct btrfs_block_group *cache) { }
 #endif
 
 static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)

From patchwork Fri Oct 30 13:51:24 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Naohiro Aota <naohiro.aota@wdc.com>
X-Patchwork-Id: 11869725
Return-Path: <SRS0=iOSM=EF=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 81A8B92C
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:54:09 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 5A072221FA
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:54:09 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=fail reason="signature verification failed" (2048-bit key)
 header.d=wdc.com header.i=@wdc.com header.b="hLYyuOd0"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726634AbgJ3NyI (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Fri, 30 Oct 2020 09:54:08 -0400
Received: from esa3.hgst.iphmx.com ([216.71.153.141]:22003 "EHLO
        esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726799AbgJ3Nws (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Fri, 30 Oct 2020 09:52:48 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple;
  d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com;
  t=1604065967; x=1635601967;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=k8Wmys5L2g9sde1FdwDgNezlXg2dd9T7GZAnlZsOUrc=;
  b=hLYyuOd076xgizIMWBcgVMPfyr51q236dcQRkfA2rhFttazZRq6wvXBi
   kn+G2s6RykXgcmxsnWkb6RTwP4CLQVsUf1go/CoMcNxs62HAfAl417MZL
   9/Pz/41EkmdAweagugB7mf2nUvT0FCdnBSOjcZ3XN54ZD4zbqQdm513PG
   fqvYKOmutyYlgOqYYCRoBxetzqchajyymLbJfsNkFgONj7yNYrv1kCh3J
   bunZC7HCk64t8UgGIbu9Zg9BkED/toP+Kx3kiWXY1WSj/cO8/UYwLKM0J
   wxvpKVUmHUYqqmP44CmnYWVMk9xhE1KeCqRtod/uaP79eKvMpFmMhaoBM
   Q==;
IronPort-SDR: 
 jCRF45lN65/FkDdeosLXn8qG1tvKyMVtDBVmB1JLo6j+qS4cTV2eDUHfs5jnNlYncrlCLPPo2U
 sWbyVN9wuBQT2YQskeN2c/hlZpnabxyU3hjIKmMVimy8DOKZCHHddKxep6VuHxMIKghubO2esv
 u3pXasfi9hhBgN9TrnxvtQcMP3EAOL//XHUyX9j+PKeQuWKYkbBE63B0xjvSQ9LHLp/e0JQ6Bl
 MVt5jTIIAaqUPOD3KBHk+V51/MoAB1z9lHLw/+v0jt4NhwbjxNYUr3znelo7+RYlxDr4AYZaFD
 Og0=
X-IronPort-AV: E=Sophos;i="5.77,433,1596470400";
   d="scan'208";a="155806610"
Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com)
 ([199.255.45.14])
  by ob1.hgst.iphmx.com with ESMTP; 30 Oct 2020 21:52:42 +0800
IronPort-SDR: 
 uTpxl9LWrdTV0cI3dXTHkrsEedvmNZIKCMMPYjiY94iIycaC/OYYC03OAHfXtlloe9LNf7B15A
 Ua39j5J2CsJJjh+zSWiT7eACBTZff+wR7NCTb9foFvU2ac1OEp8olEvkFGAk7my/VBZHUasteO
 qOp8vIFJiZDaRbfPh+Dfej7RSO51qgI9klONcX4XAF+gwKcqplm6SQ4gGjI2HYsiN7qm6J8TbL
 Nzn7n2tIGX/bJQ1GNjHh/XwNbsb0o2MzliTBNW6PPhW3YUr2WDdPUceVIYYZ00GLnuvGTev4Rp
 7R+RQqZBX+h4/f7vaQHHsw6i
Received: from uls-op-cesaip01.wdc.com ([10.248.3.36])
  by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 30 Oct 2020 06:38:57 -0700
IronPort-SDR: 
 QzQd8PDiIvUgl0xfSNS84tgaWyqKMvYAliUAFqdMFiDkb3y7eRgQJDOCBeMa6IKusl6k632T6o
 wYtJmZKqGZ2KPv6fT9B1A6o7/cSVPyjqBhiaawvLdaCdGbX33NYh44tSxUL1+gfI0pLEoYtUao
 yni9MRe2KSLMIS85Rop7yF9MqoL7U+qQbf92obqQyUEoPhByN9FDCM6xAvGDi8VX9+8w4Nij4F
 M5yMdpR1PFINoUBT5H2w+S/R2LHEzI4griIIYyvgJI0LhOI/GyZ4MqN/Wuu7jMh1FmEMsF7sqa
 iOg=
WDCIronportException: Internal
Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155])
  by uls-op-cesaip01.wdc.com with ESMTP; 30 Oct 2020 06:52:42 -0700
From: Naohiro Aota <naohiro.aota@wdc.com>
To: linux-btrfs@vger.kernel.org, dsterba@suse.com
Cc: hare@suse.com, linux-fsdevel@vger.kernel.org,
        Naohiro Aota <naohiro.aota@wdc.com>
Subject: [PATCH v9 17/41] btrfs: do sequential extent allocation in ZONED mode
Date: Fri, 30 Oct 2020 22:51:24 +0900
Message-Id: 
 <0b0775b15f3fd97b04b3b3f1650701330e9392b5.1604065695.git.naohiro.aota@wdc.com>
X-Mailer: git-send-email 2.27.0
In-Reply-To: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
References: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

This commit implements a sequential extent allocator for the ZONED mode.
This allocator just needs to check if there is enough space in the block
group. Therefor the allocator never manages bitmaps or clusters. Also add
ASSERTs to the corresponding functions.

Actually, with zone append writing, it is unnecessary to track the
allocation offset. It only needs to check space availability. But, by
tracking the offset and returning the offset as an allocated region, we can
skip modification of ordered extents and checksum information when there is
no IO reordering.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/block-group.c      |  4 ++
 fs/btrfs/extent-tree.c      | 85 ++++++++++++++++++++++++++++++++++---
 fs/btrfs/free-space-cache.c |  6 +++
 3 files changed, 89 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index c34bd2dbdf82..d67f9cabe5c1 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -683,6 +683,10 @@ int btrfs_cache_block_group(struct btrfs_block_group *cache, int load_cache_only
 	struct btrfs_caching_control *caching_ctl;
 	int ret = 0;
 
+	/* Allocator for ZONED btrfs do not use the cache at all */
+	if (btrfs_is_zoned(fs_info))
+		return 0;
+
 	caching_ctl = kzalloc(sizeof(*caching_ctl), GFP_NOFS);
 	if (!caching_ctl)
 		return -ENOMEM;
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index fad53c702d8a..5e6b4d1712f2 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3562,6 +3562,7 @@ btrfs_release_block_group(struct btrfs_block_group *cache,
 
 enum btrfs_extent_allocation_policy {
 	BTRFS_EXTENT_ALLOC_CLUSTERED,
+	BTRFS_EXTENT_ALLOC_ZONED,
 };
 
 /*
@@ -3814,6 +3815,58 @@ static int do_allocation_clustered(struct btrfs_block_group *block_group,
 	return find_free_extent_unclustered(block_group, ffe_ctl);
 }
 
+/*
+ * Simple allocator for sequential only block group. It only allows
+ * sequential allocation. No need to play with trees. This function
+ * also reserve the bytes as in btrfs_add_reserved_bytes.
+ */
+static int do_allocation_zoned(struct btrfs_block_group *block_group,
+			       struct find_free_extent_ctl *ffe_ctl,
+			       struct btrfs_block_group **bg_ret)
+{
+	struct btrfs_space_info *space_info = block_group->space_info;
+	struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
+	u64 start = block_group->start;
+	u64 num_bytes = ffe_ctl->num_bytes;
+	u64 avail;
+	int ret = 0;
+
+	ASSERT(btrfs_is_zoned(block_group->fs_info));
+
+	spin_lock(&space_info->lock);
+	spin_lock(&block_group->lock);
+
+	if (block_group->ro) {
+		ret = 1;
+		goto out;
+	}
+
+	avail = block_group->length - block_group->alloc_offset;
+	if (avail < num_bytes) {
+		ffe_ctl->max_extent_size = avail;
+		ret = 1;
+		goto out;
+	}
+
+	ffe_ctl->found_offset = start + block_group->alloc_offset;
+	block_group->alloc_offset += num_bytes;
+	spin_lock(&ctl->tree_lock);
+	ctl->free_space -= num_bytes;
+	spin_unlock(&ctl->tree_lock);
+
+	/*
+	 * We do not check if found_offset is aligned to stripesize. The
+	 * address is anyway rewritten when using zone append writing.
+	 */
+
+	ffe_ctl->search_start = ffe_ctl->found_offset;
+
+out:
+	spin_unlock(&block_group->lock);
+	spin_unlock(&space_info->lock);
+	return ret;
+}
+
 static int do_allocation(struct btrfs_block_group *block_group,
 			 struct find_free_extent_ctl *ffe_ctl,
 			 struct btrfs_block_group **bg_ret)
@@ -3821,6 +3874,8 @@ static int do_allocation(struct btrfs_block_group *block_group,
 	switch (ffe_ctl->policy) {
 	case BTRFS_EXTENT_ALLOC_CLUSTERED:
 		return do_allocation_clustered(block_group, ffe_ctl, bg_ret);
+	case BTRFS_EXTENT_ALLOC_ZONED:
+		return do_allocation_zoned(block_group, ffe_ctl, bg_ret);
 	default:
 		BUG();
 	}
@@ -3835,6 +3890,9 @@ static void release_block_group(struct btrfs_block_group *block_group,
 		ffe_ctl->retry_clustered = false;
 		ffe_ctl->retry_unclustered = false;
 		break;
+	case BTRFS_EXTENT_ALLOC_ZONED:
+		/* nothing to do */
+		break;
 	default:
 		BUG();
 	}
@@ -3863,6 +3921,9 @@ static void found_extent(struct find_free_extent_ctl *ffe_ctl,
 	case BTRFS_EXTENT_ALLOC_CLUSTERED:
 		found_extent_clustered(ffe_ctl, ins);
 		break;
+	case BTRFS_EXTENT_ALLOC_ZONED:
+		/* nothing to do */
+		break;
 	default:
 		BUG();
 	}
@@ -3878,6 +3939,9 @@ static int chunk_allocation_failed(struct find_free_extent_ctl *ffe_ctl)
 		 */
 		ffe_ctl->loop = LOOP_NO_EMPTY_SIZE;
 		return 0;
+	case BTRFS_EXTENT_ALLOC_ZONED:
+		/* give up here */
+		return -ENOSPC;
 	default:
 		BUG();
 	}
@@ -4046,6 +4110,9 @@ static int prepare_allocation(struct btrfs_fs_info *fs_info,
 	case BTRFS_EXTENT_ALLOC_CLUSTERED:
 		return prepare_allocation_clustered(fs_info, ffe_ctl,
 						    space_info, ins);
+	case BTRFS_EXTENT_ALLOC_ZONED:
+		/* nothing to do */
+		return 0;
 	default:
 		BUG();
 	}
@@ -4109,6 +4176,9 @@ static noinline int find_free_extent(struct btrfs_root *root,
 	ffe_ctl.last_ptr = NULL;
 	ffe_ctl.use_cluster = true;
 
+	if (btrfs_is_zoned(fs_info))
+		ffe_ctl.policy = BTRFS_EXTENT_ALLOC_ZONED;
+
 	ins->type = BTRFS_EXTENT_ITEM_KEY;
 	ins->objectid = 0;
 	ins->offset = 0;
@@ -4251,20 +4321,23 @@ static noinline int find_free_extent(struct btrfs_root *root,
 		/* move on to the next group */
 		if (ffe_ctl.search_start + num_bytes >
 		    block_group->start + block_group->length) {
-			btrfs_add_free_space(block_group, ffe_ctl.found_offset,
-					     num_bytes);
+			btrfs_add_free_space_unused(block_group,
+						    ffe_ctl.found_offset,
+						    num_bytes);
 			goto loop;
 		}
 
 		if (ffe_ctl.found_offset < ffe_ctl.search_start)
-			btrfs_add_free_space(block_group, ffe_ctl.found_offset,
-				ffe_ctl.search_start - ffe_ctl.found_offset);
+			btrfs_add_free_space_unused(block_group,
+						    ffe_ctl.found_offset,
+						    ffe_ctl.search_start - ffe_ctl.found_offset);
 
 		ret = btrfs_add_reserved_bytes(block_group, ram_bytes,
 				num_bytes, delalloc);
 		if (ret == -EAGAIN) {
-			btrfs_add_free_space(block_group, ffe_ctl.found_offset,
-					     num_bytes);
+			btrfs_add_free_space_unused(block_group,
+						    ffe_ctl.found_offset,
+						    num_bytes);
 			goto loop;
 		}
 		btrfs_inc_block_group_reservations(block_group);
diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index cfa466319166..7ad046d33c7e 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -2903,6 +2903,8 @@ u64 btrfs_find_space_for_alloc(struct btrfs_block_group *block_group,
 	u64 align_gap_len = 0;
 	enum btrfs_trim_state align_gap_trim_state = BTRFS_TRIM_STATE_UNTRIMMED;
 
+	ASSERT(!btrfs_is_zoned(block_group->fs_info));
+
 	spin_lock(&ctl->tree_lock);
 	entry = find_free_space(ctl, &offset, &bytes_search,
 				block_group->full_stripe_len, max_extent_size);
@@ -3034,6 +3036,8 @@ u64 btrfs_alloc_from_cluster(struct btrfs_block_group *block_group,
 	struct rb_node *node;
 	u64 ret = 0;
 
+	ASSERT(!btrfs_is_zoned(block_group->fs_info));
+
 	spin_lock(&cluster->lock);
 	if (bytes > cluster->max_size)
 		goto out;
@@ -3810,6 +3814,8 @@ int btrfs_trim_block_group(struct btrfs_block_group *block_group,
 	int ret;
 	u64 rem = 0;
 
+	ASSERT(!btrfs_is_zoned(block_group->fs_info));
+
 	*trimmed = 0;
 
 	spin_lock(&block_group->lock);

From patchwork Fri Oct 30 13:51:25 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Naohiro Aota <naohiro.aota@wdc.com>
X-Patchwork-Id: 11869639
Return-Path: <SRS0=iOSM=EF=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 3AD5F92C
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:53:06 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 0EA8D221FA
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:53:05 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=fail reason="signature verification failed" (2048-bit key)
 header.d=wdc.com header.i=@wdc.com header.b="RzXiCMOY"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726822AbgJ3NxD (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Fri, 30 Oct 2020 09:53:03 -0400
Received: from esa3.hgst.iphmx.com ([216.71.153.141]:21997 "EHLO
        esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726812AbgJ3Nwy (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Fri, 30 Oct 2020 09:52:54 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple;
  d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com;
  t=1604065973; x=1635601973;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=QQGbEtdKYhE6rWKNVcyaLM+Ube5SIByZddq0G034/rY=;
  b=RzXiCMOYIWJOLFdZEg7lgY0FkKPGfxtC8D6v7kOnl4Cq+dYP6Nq/WfBo
   vLpu9FCAE8u0cAPCoW/623wR/1808ayLlk+xCZqdwDdfKkBfhuLETJRRB
   ALbagrSw3QC3q5jcBemGXzKEWhVBWNKTKTF0+Uj56Rn5SbrdnqziacuCQ
   Zhk5oPBd5/RpuP4V/WCTnq7uNqdTOkY1hglugGdXptduNUDAG78o444py
   K+aILZUFk5IuqQpoahw2a46wnai5ojbxi3weyS2QhFB3IaCx9BQZh9Mfr
   aqLk0LbYH5WguI7EtyEZYydQ0eb/Gu62eanO6jGnzt3WbVbuPMKJVc/qa
   A==;
IronPort-SDR: 
 mhPQhj5ImoX9UG2xh+pTTgeMnU4kX9Bk512NPQrUqViT+cL6v/H2d0W5YY/nt/nB2iBTI3uMNp
 pIQnEMXcA/eQKRi9yI34wiTsoYJPWseglzQcZARwna1+EMmgcYHAi+dkD9cpf86dcfDoMYjiyx
 RhBZf3wwPm44TTnqLbztUVtFd66pQmiezmY4rfJtCIb5/aCdoB3nmHQ4x6+YJ3f6kCuZsQ4l3p
 Wc/gXgaRFVFydihlrro4JfcKTxCowdHx/SQg4k/8KyQfvdzdR5vDzy+vdKuB+JSgL0cLfCyVNp
 bgQ=
X-IronPort-AV: E=Sophos;i="5.77,433,1596470400";
   d="scan'208";a="155806612"
Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com)
 ([199.255.45.14])
  by ob1.hgst.iphmx.com with ESMTP; 30 Oct 2020 21:52:43 +0800
IronPort-SDR: 
 HQVOYAKVL2Dcy8KPUEbuxZI3qMN270GVcnPONSaN4QqNloLPq/b3IDGAOQl6JcKPTnWzsmbX7n
 IE9L9F+O6cz5FPVcZsu5zkKypQHFOpPhtWlx9nTtJuKd7Vk+BNiRM9pmsOtwziLRMQEyPqoje0
 mNNlhS6qatijaNBu1U1tjDsEEmFpG15HUAbSR5Bhq2FwvnLyaCD0O1hiOUQrZ1ORivbGOc39w9
 INy949KeSwpTuf3KtHM+Jqh/r6XpXR5YcBjxone7+mNOfDXN6Wj4pwKQ08bSobcUrNp5uU8Bkw
 +Nl7SGEdZOofoCAfmyPR1oLC
Received: from uls-op-cesaip01.wdc.com ([10.248.3.36])
  by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 30 Oct 2020 06:38:58 -0700
IronPort-SDR: 
 a8n9AzINFsib9qfupn2kw+5yaMgdiIQNK1obqZocNCNqLVKJspmUm4Q/a7U5ENBWDQVAfOUP2y
 b7+RU+v7Y0IaNxYkwE5CubEVKCgFZSybGwy0W1B6XYqHqLRx8+JMtH/RZze8e3TIVYAtEt3mmC
 mHQ3/YLXNLO3kH8Rn6FXP3SEDtbbklx3hmaRFclzlJWQm4LJ39APtWyZatpwDglH10AHEF4bOj
 7J3T3kMb1QvOLugnBr9ibDuufWAyaWNX1n7d/kVkDmKDMudMOZXRYVWGDOD7NwzxOrmqFSJHjm
 2Fk=
WDCIronportException: Internal
Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155])
  by uls-op-cesaip01.wdc.com with ESMTP; 30 Oct 2020 06:52:43 -0700
From: Naohiro Aota <naohiro.aota@wdc.com>
To: linux-btrfs@vger.kernel.org, dsterba@suse.com
Cc: hare@suse.com, linux-fsdevel@vger.kernel.org,
        Naohiro Aota <naohiro.aota@wdc.com>
Subject: [PATCH v9 18/41] btrfs: reset zones of unused block groups
Date: Fri, 30 Oct 2020 22:51:25 +0900
Message-Id: 
 <575e495d534c44aded9e6ae042a9d6bda5c84162.1604065695.git.naohiro.aota@wdc.com>
X-Mailer: git-send-email 2.27.0
In-Reply-To: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
References: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

For an ZONED volume, a block group maps to a zone of the device. For
deleted unused block groups, the zone of the block group can be reset to
rewind the zone write pointer at the start of the zone.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/block-group.c |  8 ++++++--
 fs/btrfs/extent-tree.c | 17 ++++++++++++-----
 fs/btrfs/zoned.h       | 16 ++++++++++++++++
 3 files changed, 34 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index d67f9cabe5c1..82d556368c85 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1468,8 +1468,12 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info)
 		if (!async_trim_enabled && btrfs_test_opt(fs_info, DISCARD_ASYNC))
 			goto flip_async;
 
-		/* DISCARD can flip during remount */
-		trimming = btrfs_test_opt(fs_info, DISCARD_SYNC);
+		/*
+		 * DISCARD can flip during remount. In ZONED mode, we need
+		 * to reset sequential required zones.
+		 */
+		trimming = btrfs_test_opt(fs_info, DISCARD_SYNC) ||
+				btrfs_is_zoned(fs_info);
 
 		/* Implicit trim during transaction commit. */
 		if (trimming)
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 5e6b4d1712f2..c134746d7417 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -1331,6 +1331,9 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr,
 
 		stripe = bbio->stripes;
 		for (i = 0; i < bbio->num_stripes; i++, stripe++) {
+			struct btrfs_device *dev = stripe->dev;
+			u64 physical = stripe->physical;
+			u64 length = stripe->length;
 			u64 bytes;
 			struct request_queue *req_q;
 
@@ -1338,14 +1341,18 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr,
 				ASSERT(btrfs_test_opt(fs_info, DEGRADED));
 				continue;
 			}
+
 			req_q = bdev_get_queue(stripe->dev->bdev);
-			if (!blk_queue_discard(req_q))
+			/* zone reset in ZONED mode */
+			if (btrfs_can_zone_reset(dev, physical, length))
+				ret = btrfs_reset_device_zone(dev, physical,
+							      length, &bytes);
+			else if (blk_queue_discard(req_q))
+				ret = btrfs_issue_discard(dev->bdev, physical,
+							  length, &bytes);
+			else
 				continue;
 
-			ret = btrfs_issue_discard(stripe->dev->bdev,
-						  stripe->physical,
-						  stripe->length,
-						  &bytes);
 			if (!ret) {
 				discarded_bytes += bytes;
 			} else if (ret != -EOPNOTSUPP) {
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index 42b1a4217c6b..d13bc6d70ea4 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -193,4 +193,20 @@ static inline u64 btrfs_zone_align(struct btrfs_device *device, u64 pos)
 	return ALIGN(pos, device->zone_info->zone_size);
 }
 
+static inline bool btrfs_can_zone_reset(struct btrfs_device *device,
+					u64 physical, u64 length)
+{
+	u64 zone_size;
+
+	if (!btrfs_dev_is_sequential(device, physical))
+		return false;
+
+	zone_size = device->zone_info->zone_size;
+	if (!IS_ALIGNED(physical, zone_size) ||
+	    !IS_ALIGNED(length, zone_size))
+		return false;
+
+	return true;
+}
+
 #endif

From patchwork Fri Oct 30 13:51:26 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Naohiro Aota <naohiro.aota@wdc.com>
X-Patchwork-Id: 11869641
Return-Path: <SRS0=iOSM=EF=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 88C9F14B4
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:53:06 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 54C4C221FA
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:53:06 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=fail reason="signature verification failed" (2048-bit key)
 header.d=wdc.com header.i=@wdc.com header.b="VzZj5IuT"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726839AbgJ3NxC (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Fri, 30 Oct 2020 09:53:02 -0400
Received: from esa3.hgst.iphmx.com ([216.71.153.141]:22001 "EHLO
        esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726822AbgJ3Nwy (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Fri, 30 Oct 2020 09:52:54 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple;
  d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com;
  t=1604065974; x=1635601974;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=S/dQskmKO8j3MY3gFEXv2q8BEHlNOVyBWF+Bw+PXmzA=;
  b=VzZj5IuTHESWbB1Qc4C0wlCHtpkM26/jfg3Z3riKrOsWsx9ejm7xaEg1
   OJgqph6VSAmhqAcoSdL+EkA4+Fx8w3brw7TC1aOw91jYBVfIKeVVPU2D1
   DAq8cH95s/ljiKcsh51p0SbSeQKv1cdrPT4EExJ5LORy9AbajqDpTueLK
   jKJ6ltGEeqms+b83xoq+OcEthVHjKr6IMvEr26h1KxnRoOT8GfKOTroyW
   uHDSYtWZ7G9d2+WFZwWefMaSur6ZWgr8ELDyB1P4KIdfpnmCOhwZ/PzCe
   rIOIICjrE0krUsyyDvDAxqH4PlgJju5YkKer/nxqD3xzGkXMgIuBWAuaW
   A==;
IronPort-SDR: 
 ygbZtGSJaVkqn79plI4+yOwXLuLOzecj1Vr34w3DIu/mtrmmy+ZbLW6eZKLcINdBwC/U94N4eO
 XQqUcpntuI7btgQnZHVpyaxXn3KB6tSdiAnRauqIvnUPzSdY/y9EJPRrv/ubDaTxzPm0CqlbfQ
 DVPDDtLFet4sEKUSy6jlQzOrhrgFHUO0XIKL40eNcCVFIPlp7dvNT4Qabkmlfl9AxfNyP9hCqm
 GxYX/K5CvVOOpKgHTb1b6uDBsPtrE0I2AtICwv5FSzv1TuqMueSeNHdGtPvFqw1r3SuFzDlkBg
 Uco=
X-IronPort-AV: E=Sophos;i="5.77,433,1596470400";
   d="scan'208";a="155806615"
Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com)
 ([199.255.45.14])
  by ob1.hgst.iphmx.com with ESMTP; 30 Oct 2020 21:52:45 +0800
IronPort-SDR: 
 seG1wholAwjChyf/gahL4jwgBdcL7EPbQbzl/w7L9EFWi3eCDcwphq6OoQPDt54jAa24XaYKE8
 a+QHp29/PpqYj6U70wsa1kBpnX8BUoiwuGkqi6/GoSidhIsvVhi2wpH+HrxJOfajjE9GOvWFHV
 Q8jN7tlC63rSzgXB/7uCPLCZVAZsKyBBZjSnDfvFfhBveZcGrxP5eRncqkySTbKiIQuGPZ55Av
 EIiHvpkrrFSzH/ffn2kZxxp4Cf8BNNjoar701SSGwnk7ywVkiREqM8NcSo2Y+LuxaXvLQVfUlS
 pfnJMSSORl5Gbpd7FBt2i+RH
Received: from uls-op-cesaip01.wdc.com ([10.248.3.36])
  by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 30 Oct 2020 06:38:59 -0700
IronPort-SDR: 
 a9kFZpXppYU01dZKfL24wA++w/F1Vy/pifmxVvf4k6/5q6kdzrcLO2whObMr/2Hri112+c6ntI
 KZRq1O7GntU/cYpa0A/uKLa1E8JQyL59rfm/dP2Ixndc43ml49jveXqf0aARfMMm0F4btfB8ey
 3lrjq0RkG/8DvxUGJ/uALFwO6vufL/nhX9QhohpKRr4KGtyQJMuVoUDBoJv79+liQ0xhjDI4wR
 0RwV1Ac98nkPRsIeCdHxx+HX/6+hi848qIiTSFVb5kEixhtWOqgP7M5UeFg0qrR6h3+nS9eJuD
 O/I=
WDCIronportException: Internal
Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155])
  by uls-op-cesaip01.wdc.com with ESMTP; 30 Oct 2020 06:52:44 -0700
From: Naohiro Aota <naohiro.aota@wdc.com>
To: linux-btrfs@vger.kernel.org, dsterba@suse.com
Cc: hare@suse.com, linux-fsdevel@vger.kernel.org,
        Naohiro Aota <naohiro.aota@wdc.com>
Subject: [PATCH v9 19/41] btrfs: redirty released extent buffers in ZONED mode
Date: Fri, 30 Oct 2020 22:51:26 +0900
Message-Id: 
 <b2c1ee9a3c4067b9f8823604e4c4c5c96d3abc61.1604065695.git.naohiro.aota@wdc.com>
X-Mailer: git-send-email 2.27.0
In-Reply-To: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
References: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

Tree manipulating operations like merging nodes often release
once-allocated tree nodes. Btrfs cleans such nodes so that pages in the
node are not uselessly written out. On ZONED volumes, however, such
optimization blocks the following IOs as the cancellation of the write out
of the freed blocks breaks the sequential write sequence expected by the
device.

This patch introduces a list of clean and unwritten extent buffers that
have been released in a transaction. Btrfs redirty the buffer so that
btree_write_cache_pages() can send proper bios to the devices.

Besides it clears the entire content of the extent buffer not to confuse
raw block scanners e.g. btrfsck. By clearing the content,
csum_dirty_buffer() complains about bytenr mismatch, so avoid the checking
and checksum using newly introduced buffer flag EXTENT_BUFFER_NO_CHECK.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/disk-io.c     |  8 ++++++++
 fs/btrfs/extent-tree.c | 12 +++++++++++-
 fs/btrfs/extent_io.c   |  4 ++++
 fs/btrfs/extent_io.h   |  2 ++
 fs/btrfs/transaction.c | 10 ++++++++++
 fs/btrfs/transaction.h |  3 +++
 fs/btrfs/tree-log.c    |  6 ++++++
 fs/btrfs/zoned.c       | 37 +++++++++++++++++++++++++++++++++++++
 fs/btrfs/zoned.h       |  6 ++++++
 9 files changed, 87 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index fd8b970ee92c..9750b4e6d538 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -462,6 +462,12 @@ static int csum_dirty_buffer(struct btrfs_fs_info *fs_info, struct page *page)
 		return 0;
 
 	found_start = btrfs_header_bytenr(eb);
+
+	if (test_bit(EXTENT_BUFFER_NO_CHECK, &eb->bflags)) {
+		WARN_ON(found_start != 0);
+		return 0;
+	}
+
 	/*
 	 * Please do not consolidate these warnings into a single if.
 	 * It is useful to know what went wrong.
@@ -4616,6 +4622,8 @@ void btrfs_cleanup_one_transaction(struct btrfs_transaction *cur_trans,
 				     EXTENT_DIRTY);
 	btrfs_destroy_pinned_extent(fs_info, &cur_trans->pinned_extents);
 
+	btrfs_free_redirty_list(cur_trans);
+
 	cur_trans->state =TRANS_STATE_COMPLETED;
 	wake_up(&cur_trans->commit_wait);
 }
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index c134746d7417..57454ef4c91e 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3421,8 +3421,10 @@ void btrfs_free_tree_block(struct btrfs_trans_handle *trans,
 
 		if (root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID) {
 			ret = check_ref_cleanup(trans, buf->start);
-			if (!ret)
+			if (!ret) {
+				btrfs_redirty_list_add(trans->transaction, buf);
 				goto out;
+			}
 		}
 
 		pin = 0;
@@ -3434,6 +3436,13 @@ void btrfs_free_tree_block(struct btrfs_trans_handle *trans,
 			goto out;
 		}
 
+		if (btrfs_is_zoned(fs_info)) {
+			btrfs_redirty_list_add(trans->transaction, buf);
+			pin_down_extent(trans, cache, buf->start, buf->len, 1);
+			btrfs_put_block_group(cache);
+			goto out;
+		}
+
 		WARN_ON(test_bit(EXTENT_BUFFER_DIRTY, &buf->bflags));
 
 		btrfs_add_free_space(cache, buf->start, buf->len);
@@ -4770,6 +4779,7 @@ btrfs_init_new_buffer(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 	__btrfs_tree_lock(buf, nest);
 	btrfs_clean_tree_block(buf);
 	clear_bit(EXTENT_BUFFER_STALE, &buf->bflags);
+	clear_bit(EXTENT_BUFFER_NO_CHECK, &buf->bflags);
 
 	btrfs_set_lock_blocking_write(buf);
 	set_extent_buffer_uptodate(buf);
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 60f5f68d892d..e91c504fe973 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -24,6 +24,7 @@
 #include "rcu-string.h"
 #include "backref.h"
 #include "disk-io.h"
+#include "zoned.h"
 
 static struct kmem_cache *extent_state_cache;
 static struct kmem_cache *extent_buffer_cache;
@@ -4959,6 +4960,7 @@ __alloc_extent_buffer(struct btrfs_fs_info *fs_info, u64 start,
 
 	btrfs_leak_debug_add(&fs_info->eb_leak_lock, &eb->leak_list,
 			     &fs_info->allocated_ebs);
+	INIT_LIST_HEAD(&eb->release_list);
 
 	spin_lock_init(&eb->refs_lock);
 	atomic_set(&eb->refs, 1);
@@ -5744,6 +5746,8 @@ void write_extent_buffer(const struct extent_buffer *eb, const void *srcv,
 	char *src = (char *)srcv;
 	unsigned long i = start >> PAGE_SHIFT;
 
+	WARN_ON(test_bit(EXTENT_BUFFER_NO_CHECK, &eb->bflags));
+
 	if (check_eb_range(eb, start, len))
 		return;
 
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index f39d02e7f7ef..5f2ccfd0205e 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -30,6 +30,7 @@ enum {
 	EXTENT_BUFFER_IN_TREE,
 	/* write IO error */
 	EXTENT_BUFFER_WRITE_ERR,
+	EXTENT_BUFFER_NO_CHECK,
 };
 
 /* these are flags for __process_pages_contig */
@@ -107,6 +108,7 @@ struct extent_buffer {
 	 */
 	wait_queue_head_t read_lock_wq;
 	struct page *pages[INLINE_EXTENT_BUFFER_PAGES];
+	struct list_head release_list;
 #ifdef CONFIG_BTRFS_DEBUG
 	int spinning_writers;
 	atomic_t spinning_readers;
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 52ada47aff50..5c561cdcce42 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -22,6 +22,7 @@
 #include "qgroup.h"
 #include "block-group.h"
 #include "space-info.h"
+#include "zoned.h"
 
 #define BTRFS_ROOT_TRANS_TAG 0
 
@@ -336,6 +337,8 @@ static noinline int join_transaction(struct btrfs_fs_info *fs_info,
 	spin_lock_init(&cur_trans->dirty_bgs_lock);
 	INIT_LIST_HEAD(&cur_trans->deleted_bgs);
 	spin_lock_init(&cur_trans->dropped_roots_lock);
+	INIT_LIST_HEAD(&cur_trans->releasing_ebs);
+	spin_lock_init(&cur_trans->releasing_ebs_lock);
 	list_add_tail(&cur_trans->list, &fs_info->trans_list);
 	extent_io_tree_init(fs_info, &cur_trans->dirty_pages,
 			IO_TREE_TRANS_DIRTY_PAGES, fs_info->btree_inode);
@@ -2345,6 +2348,13 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans)
 		goto scrub_continue;
 	}
 
+	/*
+	 * At this point, we should have written the all tree blocks
+	 * allocated in this transaction. So it's now safe to free the
+	 * redirtyied extent buffers.
+	 */
+	btrfs_free_redirty_list(cur_trans);
+
 	ret = write_all_supers(fs_info, 0);
 	/*
 	 * the super is written, we can safely allow the tree-loggers
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 858d9153a1cd..380e0aaa15b3 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -92,6 +92,9 @@ struct btrfs_transaction {
 	 */
 	atomic_t pending_ordered;
 	wait_queue_head_t pending_wait;
+
+	spinlock_t releasing_ebs_lock;
+	struct list_head releasing_ebs;
 };
 
 #define __TRANS_FREEZABLE	(1U << 0)
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 56cbc1706b6f..5f585cf57383 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -20,6 +20,7 @@
 #include "inode-map.h"
 #include "block-group.h"
 #include "space-info.h"
+#include "zoned.h"
 
 /* magic values for the inode_only field in btrfs_log_inode:
  *
@@ -2742,6 +2743,8 @@ static noinline int walk_down_log_tree(struct btrfs_trans_handle *trans,
 						free_extent_buffer(next);
 						return ret;
 					}
+					btrfs_redirty_list_add(
+						trans->transaction, next);
 				} else {
 					if (test_and_clear_bit(EXTENT_BUFFER_DIRTY, &next->bflags))
 						clear_extent_buffer_dirty(next);
@@ -3277,6 +3280,9 @@ static void free_log_tree(struct btrfs_trans_handle *trans,
 	clear_extent_bits(&log->dirty_log_pages, 0, (u64)-1,
 			  EXTENT_DIRTY | EXTENT_NEW | EXTENT_NEED_WAIT);
 	extent_io_tree_release(&log->log_csum_range);
+
+	if (trans && log->node)
+		btrfs_redirty_list_add(trans->transaction, log->node);
 	btrfs_put_root(log);
 }
 
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index d94a2c363a47..b45ca33282d9 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -10,6 +10,7 @@
 #include "rcu-string.h"
 #include "disk-io.h"
 #include "block-group.h"
+#include "transaction.h"
 
 /* Maximum number of zones to report per blkdev_report_zones() call */
 #define BTRFS_REPORT_NR_ZONES   4096
@@ -1015,3 +1016,39 @@ void btrfs_calc_zone_unusable(struct btrfs_block_group *cache)
 	 */
 	btrfs_free_excluded_extents(cache);
 }
+
+void btrfs_redirty_list_add(struct btrfs_transaction *trans,
+			    struct extent_buffer *eb)
+{
+	struct btrfs_fs_info *fs_info = eb->fs_info;
+
+	if (!btrfs_is_zoned(fs_info) ||
+	    btrfs_header_flag(eb, BTRFS_HEADER_FLAG_WRITTEN) ||
+	    !list_empty(&eb->release_list))
+		return;
+
+	set_extent_buffer_dirty(eb);
+	set_extent_bits_nowait(&trans->dirty_pages, eb->start,
+			       eb->start + eb->len - 1, EXTENT_DIRTY);
+	memzero_extent_buffer(eb, 0, eb->len);
+	set_bit(EXTENT_BUFFER_NO_CHECK, &eb->bflags);
+
+	spin_lock(&trans->releasing_ebs_lock);
+	list_add_tail(&eb->release_list, &trans->releasing_ebs);
+	spin_unlock(&trans->releasing_ebs_lock);
+	atomic_inc(&eb->refs);
+}
+
+void btrfs_free_redirty_list(struct btrfs_transaction *trans)
+{
+	spin_lock(&trans->releasing_ebs_lock);
+	while (!list_empty(&trans->releasing_ebs)) {
+		struct extent_buffer *eb;
+
+		eb = list_first_entry(&trans->releasing_ebs,
+				      struct extent_buffer, release_list);
+		list_del_init(&eb->release_list);
+		free_extent_buffer(eb);
+	}
+	spin_unlock(&trans->releasing_ebs_lock);
+}
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index d13bc6d70ea4..845623932fa5 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -41,6 +41,9 @@ int btrfs_reset_device_zone(struct btrfs_device *device, u64 physical,
 int btrfs_ensure_empty_zones(struct btrfs_device *device, u64 start, u64 size);
 int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache);
 void btrfs_calc_zone_unusable(struct btrfs_block_group *cache);
+void btrfs_redirty_list_add(struct btrfs_transaction *trans,
+			    struct extent_buffer *eb);
+void btrfs_free_redirty_list(struct btrfs_transaction *trans);
 #else /* CONFIG_BLK_DEV_ZONED */
 static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 				     struct blk_zone *zone)
@@ -107,6 +110,9 @@ static inline int btrfs_load_block_group_zone_info(
 	return 0;
 }
 static inline void btrfs_calc_zone_unusable(struct btrfs_block_group *cache) { }
+static inline void btrfs_redirty_list_add(struct btrfs_transaction *trans,
+					  struct extent_buffer *eb) { }
+static inline void btrfs_free_redirty_list(struct btrfs_transaction *trans) { }
 #endif
 
 static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)

From patchwork Fri Oct 30 13:51:27 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Naohiro Aota <naohiro.aota@wdc.com>
X-Patchwork-Id: 11869717
Return-Path: <SRS0=iOSM=EF=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 7C1AE92C
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:54:00 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 52B3B2083B
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:54:00 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=fail reason="signature verification failed" (2048-bit key)
 header.d=wdc.com header.i=@wdc.com header.b="Li3HndKn"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726996AbgJ3Nx7 (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Fri, 30 Oct 2020 09:53:59 -0400
Received: from esa3.hgst.iphmx.com ([216.71.153.141]:22003 "EHLO
        esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726708AbgJ3NxA (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Fri, 30 Oct 2020 09:53:00 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple;
  d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com;
  t=1604065979; x=1635601979;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=75fAmRUO5nDwJooOw48asWhVJ81jKELWO2GhE82J9JA=;
  b=Li3HndKnO0hj7Cj1HfHxewe5w//qmq4DilNR3snq85hiDun+x9EML6m5
   HNKJlK2Hcgd9ifthS7yNPIfEssxp/8zja4nKdRZnh5bJ/gos3cleU/9Lt
   WkqjqZik1sjTCm0uf+733+v3jUAXCK3o5fDun0cHu0K3IeqZnkkSBbiSK
   ihuEHr71xGiKw5tRJENfIjRZzmsm/1RP71ahB7u28V7NSvj2aQqj1B9+C
   yhdMsxvesIarSAsMSrkINSJK9fT8pbokmkrvUlAcV/CeZIWK+HvbA7rvA
   eCgGFTNCR2T6lQSF352SVmZAjfnd+Jmw6aKQ5kTWML/xtuwWKX7oxl2Ib
   Q==;
IronPort-SDR: 
 FkfNQi74RMkd8mKnOLAUtGlPsEhhQZrg9qSNssiE95ai4rlXynXeli9MtBzqzyAXd//mhD7Rfl
 8biYlkkVadkCwnWssgZ2NTX8uXT4KJYiIQDdSOqppQ1gdPAWzFk/YC2gjhwRYr4YjT3cGlM8D/
 c3OHXSGdS4VQJIxo4Me+xhPfHPVgQlGwr/JwyeaJ7QI50o3xzhvx95IDyArOQHkpo9FSQRY/VF
 4ClM8/BIk1m0LavoV37oDq78MTE1rdIXe3Y4IPoGUKYSrH8i3kJ/gKseVN1AcSYhSS+/OKGQ9b
 CaY=
X-IronPort-AV: E=Sophos;i="5.77,433,1596470400";
   d="scan'208";a="155806617"
Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com)
 ([199.255.45.14])
  by ob1.hgst.iphmx.com with ESMTP; 30 Oct 2020 21:52:46 +0800
IronPort-SDR: 
 M0qOG84xUQNT8jZ5h8W+xABVZgnJCU/Kx4YtUCVJsYwJjWU1lWMQrcXd4DsgYmiTUtMQxHuTDu
 bY0FB3VYXOu/pTYq/dwwvwO9cdTT2rOW9fiFzaP61Td2kf1czRwWhld1ejvBQrhEqXlisTTLx7
 JY1QlTudBdhijgvsQgvJ0hqBrO0IOvVg879iCMIuVBYOqW/89weeuUea2jpcKEGJSvdkunKhwH
 9jGKMQWg2XNLhTn5XLQ9DH9zRTjvokB/88OzmFf3oIbwkiI6884DWFMB29z63Q9TIibb+FFMsc
 NDEyKeUEH1GZnZB7yidWC7gR
Received: from uls-op-cesaip01.wdc.com ([10.248.3.36])
  by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 30 Oct 2020 06:39:00 -0700
IronPort-SDR: 
 e7JzAysIXUEdQsx08XAZNIWGtmSWDB2UU/6Fde7TDpSfPeXFryEhu6+P7/PeVo7SVYLoitatSD
 TZ3MRXbAm+AFRExCIdeVPlP+DgYMQlUVK8wtzov2G7pF7AIpKSzJyK0SpHNrVcER7cAphRad6c
 /RUWGmr6vmtVP84TuybPlCZqeTR9Wpasrv1WBPNx+2XvBlSwprn9Zw2QMBjAPD97cdByGhhimk
 W0yIDJnmAol0zK1kYFUqFs3OIOe74DQ3FbinnWZjIcFuwVJi+Pmw0RW0fosRFKcETR5HDIsfI1
 G68=
WDCIronportException: Internal
Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155])
  by uls-op-cesaip01.wdc.com with ESMTP; 30 Oct 2020 06:52:45 -0700
From: Naohiro Aota <naohiro.aota@wdc.com>
To: linux-btrfs@vger.kernel.org, dsterba@suse.com
Cc: hare@suse.com, linux-fsdevel@vger.kernel.org,
        Naohiro Aota <naohiro.aota@wdc.com>
Subject: [PATCH v9 20/41] btrfs: extract page adding function
Date: Fri, 30 Oct 2020 22:51:27 +0900
Message-Id: 
 <da2dba415cc94f271f0693b30296d9cc4f583f03.1604065695.git.naohiro.aota@wdc.com>
X-Mailer: git-send-email 2.27.0
In-Reply-To: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
References: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

This commit extract page adding to bio part from submit_extent_page(). The
page is added only when bio_flags are the same, contiguous and the added
page fits in the same stripe as pages in the bio.

Condition checkings are reordered to allow early return to avoid possibly
heavy btrfs_bio_fits_in_stripe() calling.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/extent_io.c | 55 ++++++++++++++++++++++++++++++++------------
 1 file changed, 40 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index e91c504fe973..17285048fb5a 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3012,6 +3012,43 @@ struct bio *btrfs_bio_clone_partial(struct bio *orig, int offset, int size)
 	return bio;
 }
 
+/**
+ * btrfs_bio_add_page	-	attempt to add a page to bio
+ * @bio:	destination bio
+ * @page:	page to add to the bio
+ * @logical:	offset of the new bio or to check whether we are adding
+ *              a contiguous page to the previous one
+ * @pg_offset:	starting offset in the page
+ * @size:	portion of page that we want to write
+ * @prev_bio_flags:  flags of previous bio to see if we can merge the current one
+ * @bio_flags:	flags of the current bio to see if we can merge them
+ *
+ * Attempt to add a page to bio considering stripe alignment etc. Return
+ * true if successfully page added. Otherwise, return false.
+ */
+bool btrfs_bio_add_page(struct bio *bio, struct page *page, u64 logical,
+			unsigned int size, unsigned int pg_offset,
+			unsigned long prev_bio_flags, unsigned long bio_flags)
+{
+	sector_t sector = logical >> SECTOR_SHIFT;
+	bool contig;
+
+	if (prev_bio_flags != bio_flags)
+		return false;
+
+	if (prev_bio_flags & EXTENT_BIO_COMPRESSED)
+		contig = bio->bi_iter.bi_sector == sector;
+	else
+		contig = bio_end_sector(bio) == sector;
+	if (!contig)
+		return false;
+
+	if (btrfs_bio_fits_in_stripe(page, size, bio, bio_flags))
+		return false;
+
+	return bio_add_page(bio, page, size, pg_offset) == size;
+}
+
 /*
  * @opf:	bio REQ_OP_* and REQ_* flags as one value
  * @wbc:	optional writeback control for io accounting
@@ -3040,27 +3077,15 @@ static int submit_extent_page(unsigned int opf,
 	int ret = 0;
 	struct bio *bio;
 	size_t page_size = min_t(size_t, size, PAGE_SIZE);
-	sector_t sector = offset >> 9;
 	struct extent_io_tree *tree = &BTRFS_I(page->mapping->host)->io_tree;
 
 	ASSERT(bio_ret);
 
 	if (*bio_ret) {
-		bool contig;
-		bool can_merge = true;
-
 		bio = *bio_ret;
-		if (prev_bio_flags & EXTENT_BIO_COMPRESSED)
-			contig = bio->bi_iter.bi_sector == sector;
-		else
-			contig = bio_end_sector(bio) == sector;
-
-		if (btrfs_bio_fits_in_stripe(page, page_size, bio, bio_flags))
-			can_merge = false;
-
-		if (prev_bio_flags != bio_flags || !contig || !can_merge ||
-		    force_bio_submit ||
-		    bio_add_page(bio, page, page_size, pg_offset) < page_size) {
+		if (force_bio_submit ||
+		    !btrfs_bio_add_page(bio, page, offset, page_size, pg_offset,
+					prev_bio_flags, bio_flags)) {
 			ret = submit_one_bio(bio, mirror_num, prev_bio_flags);
 			if (ret < 0) {
 				*bio_ret = NULL;

From patchwork Fri Oct 30 13:51:28 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Naohiro Aota <naohiro.aota@wdc.com>
X-Patchwork-Id: 11869719
Return-Path: <SRS0=iOSM=EF=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id DA4C261C
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:54:01 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id B36E42083B
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:54:01 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=fail reason="signature verification failed" (2048-bit key)
 header.d=wdc.com header.i=@wdc.com header.b="lksMgdxQ"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726853AbgJ3NyA (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Fri, 30 Oct 2020 09:54:00 -0400
Received: from esa3.hgst.iphmx.com ([216.71.153.141]:21997 "EHLO
        esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726829AbgJ3NxA (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Fri, 30 Oct 2020 09:53:00 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple;
  d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com;
  t=1604065980; x=1635601980;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=/cg2DzInArlfMYW+0mqyb64wEUNBbmYmeZFAkD9V3Lk=;
  b=lksMgdxQz9uCMCSDH10k8J8lwOYXA8A+neyr1xjO1s+l6PhTQDAzhiVJ
   tc54xqegHbyA+RBpA7/tXoyEhWejXi2N06fvHhcMzXQamCk7fKrrTJnXI
   ZcdzjzDP3k2kqRVVaNYE7lKzHTl92MUxa/4SfMHkCqAgF5rzQww2XiMyG
   RXw69AV1l5LdTVFnIdPJs4etDyDRRTX3exd8FF5o4mjNNsUHJ9t/IQDEv
   Z7z2hixbeEfnBtX8fUz/xyzVgaVAz9leEsJS2DQZMgdHll9Tvm4KrBKtn
   4HdW5BBNBnvtIeOfsHBxg9dC5A0WgIhJ3QPhE3JhbMbm0Y1SXMfz153m0
   Q==;
IronPort-SDR: 
 kIjpCCyqp8pkbPsJ03gDW3w46htLTFN6SScqrjBrlVd0pjHltdX5SHt65Diqv+otgF58OF8vSB
 CqYY5Dx/NDeM3yz5i/vjlSU2Owjj9P1ETXSvBGrZNc6g8fw3RelXFtqCI+naJ1yCpXjL4mBg8h
 GJAU37qo3x1lQ+2T+4pShDToLZr3TFoJ8Ysro9xBckN0UvlExrNHmNKU1luogeOideI7G60xIX
 /kX7LiLk0+84QEg0xxlJ/WVK1anG+pH9/Pj8Mbqi0Xbc3offLWUvCgEVVOz6cN5NlOavw0kJ6w
 2ZU=
X-IronPort-AV: E=Sophos;i="5.77,433,1596470400";
   d="scan'208";a="155806618"
Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com)
 ([199.255.45.14])
  by ob1.hgst.iphmx.com with ESMTP; 30 Oct 2020 21:52:47 +0800
IronPort-SDR: 
 gtdhnsQgnpz3JABo1rlHhkZzFaIA8nLyGWCnorvblNnB7F7wFly1CIyj5TQm7U0kjN03P0q34N
 GKtb4E124+s9OWscw3rPoWBIgaGHVoWJz7Ja+YvC55B/PRoueWzYnz/6wEs3VRLO/LJU4OOa/u
 aSh9MfASvpEY2tl6skUjRTcKYk611xBOIYDqJBk1jHqXwdtGcmMfCecJDy5JeLuW1l5tG3a16N
 W+S6r6aVUd/2fk/SoX05OYxHko6tXduSa8oV+Cx+ciwq4RjXeAQcwQDsXGgn0z39A2sZrIKN+o
 UTG6lxo2xNELuVCjQ6wvH9qs
Received: from uls-op-cesaip01.wdc.com ([10.248.3.36])
  by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 30 Oct 2020 06:39:01 -0700
IronPort-SDR: 
 gc6grznEYiQ0EhePSrcSmMP6rq0bqeyklibjuKLDmIByYmbn1ECNyHYQzV8q7DHFGtKfywFNwU
 GG2ySYnmqIi9v6dcmHjAR/VqtMDojNHOmR1JvfjcUefH6COahKYLrctB5shg6nw8DccOy7cKfV
 mQCiL7mzITqfjuzz1svuWmhfUMCeu765tTQFAO1cZVFj2MI3j2OSfx2j8h8uIKM5K+/gbJHxGD
 uG170cGAqteQAj5P6Oh7UrK5e2rfkrBUcGOwzf/L19NQW7tzKDVvDrsOcpq3017yOi+IqaESN1
 cKE=
WDCIronportException: Internal
Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155])
  by uls-op-cesaip01.wdc.com with ESMTP; 30 Oct 2020 06:52:46 -0700
From: Naohiro Aota <naohiro.aota@wdc.com>
To: linux-btrfs@vger.kernel.org, dsterba@suse.com
Cc: hare@suse.com, linux-fsdevel@vger.kernel.org,
        Naohiro Aota <naohiro.aota@wdc.com>
Subject: [PATCH v9 21/41] btrfs: use bio_add_zone_append_page for zoned btrfs
Date: Fri, 30 Oct 2020 22:51:28 +0900
Message-Id: 
 <ad4c16f2fff58ea4c6bd034e782b1c354521d696.1604065695.git.naohiro.aota@wdc.com>
X-Mailer: git-send-email 2.27.0
In-Reply-To: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
References: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

Zoned device has its own hardware restrictions e.g. max_zone_append_size
when using REQ_OP_ZONE_APPEND. To follow the restrictions, use
bio_add_zone_append_page() instead of bio_add_page(). We need target device
to use bio_add_zone_append_page(), so this commit reads the chunk
information to memoize the target device to btrfs_io_bio(bio)->device.

Currently, zoned btrfs only supports SINGLE profile. In the feature,
btrfs_io_bio can hold extent_map and check the restrictions for all the
devices the bio will be mapped.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/extent_io.c | 37 ++++++++++++++++++++++++++++++++++---
 1 file changed, 34 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 17285048fb5a..764257eb658f 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3032,6 +3032,7 @@ bool btrfs_bio_add_page(struct bio *bio, struct page *page, u64 logical,
 {
 	sector_t sector = logical >> SECTOR_SHIFT;
 	bool contig;
+	int ret;
 
 	if (prev_bio_flags != bio_flags)
 		return false;
@@ -3046,7 +3047,19 @@ bool btrfs_bio_add_page(struct bio *bio, struct page *page, u64 logical,
 	if (btrfs_bio_fits_in_stripe(page, size, bio, bio_flags))
 		return false;
 
-	return bio_add_page(bio, page, size, pg_offset) == size;
+	if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
+		struct bio orig_bio;
+
+		memset(&orig_bio, 0, sizeof(orig_bio));
+		bio_copy_dev(&orig_bio, bio);
+		bio_set_dev(bio, btrfs_io_bio(bio)->device->bdev);
+		ret = bio_add_zone_append_page(bio, page, size, pg_offset);
+		bio_copy_dev(bio, &orig_bio);
+	} else {
+		ret = bio_add_page(bio, page, size, pg_offset);
+	}
+
+	return ret == size;
 }
 
 /*
@@ -3077,7 +3090,9 @@ static int submit_extent_page(unsigned int opf,
 	int ret = 0;
 	struct bio *bio;
 	size_t page_size = min_t(size_t, size, PAGE_SIZE);
-	struct extent_io_tree *tree = &BTRFS_I(page->mapping->host)->io_tree;
+	struct btrfs_inode *inode = BTRFS_I(page->mapping->host);
+	struct extent_io_tree *tree = &inode->io_tree;
+	struct btrfs_fs_info *fs_info = inode->root->fs_info;
 
 	ASSERT(bio_ret);
 
@@ -3108,11 +3123,27 @@ static int submit_extent_page(unsigned int opf,
 	if (wbc) {
 		struct block_device *bdev;
 
-		bdev = BTRFS_I(page->mapping->host)->root->fs_info->fs_devices->latest_bdev;
+		bdev = fs_info->fs_devices->latest_bdev;
 		bio_set_dev(bio, bdev);
 		wbc_init_bio(wbc, bio);
 		wbc_account_cgroup_owner(wbc, page, page_size);
 	}
+	if (btrfs_is_zoned(fs_info) &&
+	    bio_op(bio) == REQ_OP_ZONE_APPEND) {
+		struct extent_map *em;
+		struct map_lookup *map;
+
+		em = btrfs_get_chunk_map(fs_info, offset, page_size);
+		if (IS_ERR(em))
+			return PTR_ERR(em);
+
+		map = em->map_lookup;
+		/* only support SINGLE profile for now */
+		ASSERT(map->num_stripes == 1);
+		btrfs_io_bio(bio)->device = map->stripes[0].dev;
+
+		free_extent_map(em);
+	}
 
 	*bio_ret = bio;
 

From patchwork Fri Oct 30 13:51:29 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Naohiro Aota <naohiro.aota@wdc.com>
X-Patchwork-Id: 11869721
Return-Path: <SRS0=iOSM=EF=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 29C1761C
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:54:04 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 017C62087E
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:54:03 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=fail reason="signature verification failed" (2048-bit key)
 header.d=wdc.com header.i=@wdc.com header.b="CPVvZSkC"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726999AbgJ3NyA (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Fri, 30 Oct 2020 09:54:00 -0400
Received: from esa3.hgst.iphmx.com ([216.71.153.141]:22001 "EHLO
        esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726831AbgJ3NxA (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Fri, 30 Oct 2020 09:53:00 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple;
  d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com;
  t=1604065980; x=1635601980;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=2bGG7Tw0NSCWwI5DQkA2LqQtOto36WazeQt2f+80j2c=;
  b=CPVvZSkCPbuzOHUTpR9Vl5Ek8BxWDPcqzZOmKov0w8yYHDy2mutllTUu
   +m+kDCEKq8nB7P4uTO6AeSwNpJxUb2qo4/ivoFSj8xzM0n3FBqyOmMfaO
   D/sJtVHdi7xWEwFvN9HTDN/JuGmmMfB1xNJ4YrTKQM/OK1FcsG/O+wNel
   aHZCnBlZlUebPZV2ekuvZ09K+u6qcnmvlhg2FJjLeL1MkiXIAIjlCuFwz
   WelmLF9lHI80bL5CEzO5fv91DgS3K8j5rszTiTLUzi9vvhwPePK3N3fWP
   X9cUvXqN+0bkLnRYzg5wdfq4YPlNJRWLdF5kiXP+tbtUodiQIKOCizeFZ
   w==;
IronPort-SDR: 
 7ZuUG7v2abQWuceKehcBEG0ubnUnV/CUdUS/hnY0EjqcoHiLFLBSnIACFHytM/vsohCdBOdVif
 M0lWd9MO/iKuXAcuSXZg75YKfgX+6zKw0hLa+rSZXjO/RfmfXzIDF/OpfhGmInDc64fQla8S+C
 qqjIURPfNEbBgzOdcn+K2bdBm60N/i6uxqQyoyExxltivl298qfTEj7aWun1erkDqR+29qXAp7
 aOT7+S4j5v/Kg+Q945kY5pc7Y8tx5pqzDN9enWw3siS0KcFCiRzAkTLJWDX42OuNz7cskYHyj2
 mkQ=
X-IronPort-AV: E=Sophos;i="5.77,433,1596470400";
   d="scan'208";a="155806621"
Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com)
 ([199.255.45.14])
  by ob1.hgst.iphmx.com with ESMTP; 30 Oct 2020 21:52:48 +0800
IronPort-SDR: 
 +sTuIcLaIKAVCyfNNyFGEPWq/86+WOQki58JFnsjurtBMoF6jDXVUqExyHHeoy8myPalrcMNex
 LT+HrIGo8O8shenOmtq2j+J0gUByv1U58AU36i4FICs+difSwc5e/FOE55VjCcbQGySkvISWNz
 rL4saqhSPG3I4xTIo6gmc3zFAGFZ+Wg9uTI7kYuHAbaOa5ujZQeqMI6sCxWMKnbPLrjQHbYhiP
 IrcpGCztoI3Zpjc/f7wyQy8IgzKYqGm/dCrs4ItJc3TfNaCOW/DJHkmNgq2zol8ut533ar4qU2
 YufMVjQ19L7thbDrDtUkTRzh
Received: from uls-op-cesaip01.wdc.com ([10.248.3.36])
  by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 30 Oct 2020 06:39:02 -0700
IronPort-SDR: 
 NXeIUOWCLjsd/p5kPnEDctHFHKI3u8FDQfr0z7ig98BcW4iXb8q5SX6PwXx6PQfN8l6EsN3Duc
 TbF2YU+hP9H5xCLUx9JKJXjsbhBx6YCYO6lQCy2ig44vuD4zldLymDHHQ/vAykTzQFuhK1RZtV
 oCAqSxd/mqCEnql/mo/sShkU0/B1LCAx5msUHn1n4bfpyyzo/4u4jS9QaPjTwANW3/f/lfOhe3
 9GqLoloqM0pmTPaRmmXvjBGYw9vobpfVFdH0ZnulVmJEVg6znf1S7UtWtFeWlLmSddrPI4IMgD
 Fuc=
WDCIronportException: Internal
Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155])
  by uls-op-cesaip01.wdc.com with ESMTP; 30 Oct 2020 06:52:47 -0700
From: Naohiro Aota <naohiro.aota@wdc.com>
To: linux-btrfs@vger.kernel.org, dsterba@suse.com
Cc: hare@suse.com, linux-fsdevel@vger.kernel.org,
        Naohiro Aota <naohiro.aota@wdc.com>
Subject: [PATCH v9 22/41] btrfs: handle REQ_OP_ZONE_APPEND as writing
Date: Fri, 30 Oct 2020 22:51:29 +0900
Message-Id: 
 <4f9bb21cb378fa0b123caf56c37c06dedccbbf1f.1604065695.git.naohiro.aota@wdc.com>
X-Mailer: git-send-email 2.27.0
In-Reply-To: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
References: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

ZONED btrfs uses REQ_OP_ZONE_APPEND bios for writing to actual devices. Let
btrfs_end_bio() and btrfs_op be aware of it.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/disk-io.c |  4 ++--
 fs/btrfs/inode.c   | 10 +++++-----
 fs/btrfs/volumes.c |  8 ++++----
 fs/btrfs/volumes.h |  1 +
 4 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 9750b4e6d538..778716e223ff 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -652,7 +652,7 @@ static void end_workqueue_bio(struct bio *bio)
 	fs_info = end_io_wq->info;
 	end_io_wq->status = bio->bi_status;
 
-	if (bio_op(bio) == REQ_OP_WRITE) {
+	if (btrfs_op(bio) == BTRFS_MAP_WRITE) {
 		if (end_io_wq->metadata == BTRFS_WQ_ENDIO_METADATA)
 			wq = fs_info->endio_meta_write_workers;
 		else if (end_io_wq->metadata == BTRFS_WQ_ENDIO_FREE_SPACE)
@@ -827,7 +827,7 @@ blk_status_t btrfs_submit_metadata_bio(struct inode *inode, struct bio *bio,
 	int async = check_async_write(fs_info, BTRFS_I(inode));
 	blk_status_t ret;
 
-	if (bio_op(bio) != REQ_OP_WRITE) {
+	if (btrfs_op(bio) != BTRFS_MAP_WRITE) {
 		/*
 		 * called for a read, do the setup so that checksum validation
 		 * can happen in the async kernel threads
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 936c3137c646..591ca539e444 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2192,7 +2192,7 @@ blk_status_t btrfs_submit_data_bio(struct inode *inode, struct bio *bio,
 	if (btrfs_is_free_space_inode(BTRFS_I(inode)))
 		metadata = BTRFS_WQ_ENDIO_FREE_SPACE;
 
-	if (bio_op(bio) != REQ_OP_WRITE) {
+	if (btrfs_op(bio) != BTRFS_MAP_WRITE) {
 		ret = btrfs_bio_wq_end_io(fs_info, bio, metadata);
 		if (ret)
 			goto out;
@@ -7526,7 +7526,7 @@ static void btrfs_dio_private_put(struct btrfs_dio_private *dip)
 	if (!refcount_dec_and_test(&dip->refs))
 		return;
 
-	if (bio_op(dip->dio_bio) == REQ_OP_WRITE) {
+	if (btrfs_op(dip->dio_bio) == BTRFS_MAP_WRITE) {
 		__endio_write_update_ordered(BTRFS_I(dip->inode),
 					     dip->logical_offset,
 					     dip->bytes,
@@ -7695,7 +7695,7 @@ static inline blk_status_t btrfs_submit_dio_bio(struct bio *bio,
 {
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	struct btrfs_dio_private *dip = bio->bi_private;
-	bool write = bio_op(bio) == REQ_OP_WRITE;
+	bool write = btrfs_op(bio) == BTRFS_MAP_WRITE;
 	blk_status_t ret;
 
 	/* Check btrfs_submit_bio_hook() for rules about async submit. */
@@ -7746,7 +7746,7 @@ static struct btrfs_dio_private *btrfs_create_dio_private(struct bio *dio_bio,
 							  struct inode *inode,
 							  loff_t file_offset)
 {
-	const bool write = (bio_op(dio_bio) == REQ_OP_WRITE);
+	const bool write = (btrfs_op(dio_bio) == BTRFS_MAP_WRITE);
 	const bool csum = !(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM);
 	size_t dip_size;
 	struct btrfs_dio_private *dip;
@@ -7777,7 +7777,7 @@ static struct btrfs_dio_private *btrfs_create_dio_private(struct bio *dio_bio,
 static blk_qc_t btrfs_submit_direct(struct inode *inode, struct iomap *iomap,
 		struct bio *dio_bio, loff_t file_offset)
 {
-	const bool write = (bio_op(dio_bio) == REQ_OP_WRITE);
+	const bool write = (btrfs_op(dio_bio) == BTRFS_MAP_WRITE);
 	const bool csum = !(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM);
 	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 	const bool raid56 = (btrfs_data_alloc_profile(fs_info) &
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 9e69222086ae..a0056b205964 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6446,7 +6446,7 @@ static void btrfs_end_bio(struct bio *bio)
 			struct btrfs_device *dev = btrfs_io_bio(bio)->device;
 
 			ASSERT(dev->bdev);
-			if (bio_op(bio) == REQ_OP_WRITE)
+			if (btrfs_op(bio) == BTRFS_MAP_WRITE)
 				btrfs_dev_stat_inc_and_print(dev,
 						BTRFS_DEV_STAT_WRITE_ERRS);
 			else if (!(bio->bi_opf & REQ_RAHEAD))
@@ -6559,10 +6559,10 @@ blk_status_t btrfs_map_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
 	atomic_set(&bbio->stripes_pending, bbio->num_stripes);
 
 	if ((bbio->map_type & BTRFS_BLOCK_GROUP_RAID56_MASK) &&
-	    ((bio_op(bio) == REQ_OP_WRITE) || (mirror_num > 1))) {
+	    ((btrfs_op(bio) == BTRFS_MAP_WRITE) || (mirror_num > 1))) {
 		/* In this case, map_length has been set to the length of
 		   a single stripe; not the whole write */
-		if (bio_op(bio) == REQ_OP_WRITE) {
+		if (btrfs_op(bio) == BTRFS_MAP_WRITE) {
 			ret = raid56_parity_write(fs_info, bio, bbio,
 						  map_length);
 		} else {
@@ -6585,7 +6585,7 @@ blk_status_t btrfs_map_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
 		dev = bbio->stripes[dev_nr].dev;
 		if (!dev || !dev->bdev || test_bit(BTRFS_DEV_STATE_MISSING,
 						   &dev->dev_state) ||
-		    (bio_op(first_bio) == REQ_OP_WRITE &&
+		    (btrfs_op(first_bio) == BTRFS_MAP_WRITE &&
 		    !test_bit(BTRFS_DEV_STATE_WRITEABLE, &dev->dev_state))) {
 			bbio_error(bbio, first_bio, logical);
 			continue;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 0249aca668fb..cff1f7689eac 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -410,6 +410,7 @@ static inline enum btrfs_map_op btrfs_op(struct bio *bio)
 	case REQ_OP_DISCARD:
 		return BTRFS_MAP_DISCARD;
 	case REQ_OP_WRITE:
+	case REQ_OP_ZONE_APPEND:
 		return BTRFS_MAP_WRITE;
 	default:
 		WARN_ON_ONCE(1);

From patchwork Fri Oct 30 13:51:30 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Naohiro Aota <naohiro.aota@wdc.com>
X-Patchwork-Id: 11869723
Return-Path: <SRS0=iOSM=EF=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 9C2B161C
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:54:06 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 70EDD2083B
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:54:06 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=fail reason="signature verification failed" (2048-bit key)
 header.d=wdc.com header.i=@wdc.com header.b="gly285WI"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726991AbgJ3Nx7 (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Fri, 30 Oct 2020 09:53:59 -0400
Received: from esa3.hgst.iphmx.com ([216.71.153.141]:21997 "EHLO
        esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726812AbgJ3NxG (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Fri, 30 Oct 2020 09:53:06 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple;
  d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com;
  t=1604065984; x=1635601984;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=IKTfoSB9Nk+pZ1gMLl3hPMmBzlG4s6ik66o0J8eRomY=;
  b=gly285WIwxAIqgKZdzSw53HNzHknTv9ffN0DR0DHgAdDcSsMVjkeNEPp
   Ij0ls+ncR4UXoO/IiAbMooMKYqQI+48Q7oYzmCzHKbiU+PEx765/+sf1h
   nqPOf2+XYxH2b6woMyjDmdqpo7NyhYZO/uqGs1oAwKWXV/ChCynJWGBNI
   76VmW9ZqtnJn4fU8Lep+k+6w2XihWpA1Gaaaoi4UrSFybTsftd+hOLvqg
   JQJ+eqd/rRrnlF6Up5IXs9+EDa1yU5ZlrXJjMbBl/s4L+oZHzSafqLvw7
   aPfTulP+CL9IvLI/bRNdIW7CV2fucR+lrJonO9Xq9RUNsupQ8QIKoShoq
   Q==;
IronPort-SDR: 
 xpB65Ybjf2gbr44phsejbwRjc1aadvJPN6dWyZ0eYx8Y74e2QWhMAYgIbJ7Qv4iwqgN4e6BQO9
 4h5Z40lUngTAKLjC750ftDMCcH3w9ZyolbNih7weeSZCVpvbNqkD3mZO60z8gEPXVxnyjsHYtm
 STiCjlLKZvJlCPc+TDGEhpVsU2VjhE8egLpiUVFZ7b+ds+a1jz29oREaE16cYRLmsclqDL29Xn
 n3AkY8xjzl2imPauyK3ScRON1uFbFUpfrjymwkMGllSVfdbOOTSckAbeHtPF1kh8lQjvGIycCI
 Xps=
X-IronPort-AV: E=Sophos;i="5.77,433,1596470400";
   d="scan'208";a="155806622"
Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com)
 ([199.255.45.14])
  by ob1.hgst.iphmx.com with ESMTP; 30 Oct 2020 21:52:49 +0800
IronPort-SDR: 
 VmZUtKjZF9r1jZJ4BxTZHJfz4Mak9drCwighR4AKCTtr1j/R1csCm5gp0j6yXZ8CJK+VIjY4uj
 FNCZWUHhbp5sTbs6/zet9PZ3Jp6vJF9qM0qN7lMwmF2GQsGvmXnnf8XC8PWlwtqiUayG3ooX7E
 abk7YR8QmsDQKFVJT0lIygV8mV+IrD2N9YK38/HF61uoZmBaYEXDFa5a2I3JAsMeFMn0SRXDNO
 S2gRiC/ZJQKCrZuBQ8i/kNKlSRBLHCbu0RUfLM4dNU/8UgPAkijTAlic6rT9eCePD7VRo6sAk/
 +/VHcfxZXG+GqWaBAijZl3HP
Received: from uls-op-cesaip01.wdc.com ([10.248.3.36])
  by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 30 Oct 2020 06:39:04 -0700
IronPort-SDR: 
 DDldf90HB0ocpUWGEzpbJ9SrSbahpMuns6Bqw8X7zh2FaCCs4tgslIS3bPeMp3MM2YdzZAdm4C
 E0R1mccAvzgSTAlzuCOGyXaE/9f1jRpKJEAS5secJMepU85LIsh2mN6A/6tKmSEGk5wFX/v7me
 0z/DE206Nz9J+ojlLKZ4+KU0B4NU1pbF1Kr2FXhA9A9zEHVuX0q/m/pxWkyrf1hEFXbBJvrgZy
 V/B/PCfJXkVU6Rl5ofhM/S0/k+I6OuH7PNHPZLI26QS+Ee+J3QpTmk/0bevrYvhjqfc4+wndbp
 tsA=
WDCIronportException: Internal
Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155])
  by uls-op-cesaip01.wdc.com with ESMTP; 30 Oct 2020 06:52:49 -0700
From: Naohiro Aota <naohiro.aota@wdc.com>
To: linux-btrfs@vger.kernel.org, dsterba@suse.com
Cc: hare@suse.com, linux-fsdevel@vger.kernel.org,
        Naohiro Aota <naohiro.aota@wdc.com>
Subject: [PATCH v9 23/41] btrfs: split ordered extent when bio is sent
Date: Fri, 30 Oct 2020 22:51:30 +0900
Message-Id: 
 <003ea43d3ee954cdb95efa0638a3fdc289cb34c0.1604065695.git.naohiro.aota@wdc.com>
X-Mailer: git-send-email 2.27.0
In-Reply-To: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
References: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

For a zone append write, the device decides the location the data is
written to. Therefore we cannot ensure that two bios are written
consecutively on the device. In order to ensure that a ordered extent maps
to a contiguous region on disk, we need to maintain a "one bio == one
ordered extent" rule.

This commit implements the splitting of an ordered extent and extent map
on bio submission to adhere to the rule.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/inode.c        | 89 +++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/ordered-data.c | 76 +++++++++++++++++++++++++++++++++++
 fs/btrfs/ordered-data.h |  2 +
 3 files changed, 167 insertions(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 591ca539e444..6b2569dfc3bd 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2158,6 +2158,86 @@ static blk_status_t btrfs_submit_bio_start(void *private_data, struct bio *bio,
 	return btrfs_csum_one_bio(BTRFS_I(inode), bio, 0, 0);
 }
 
+int extract_ordered_extent(struct inode *inode, struct bio *bio,
+			   loff_t file_offset)
+{
+	struct btrfs_ordered_extent *ordered;
+	struct extent_map *em = NULL, *em_new = NULL;
+	struct extent_map_tree *em_tree = &BTRFS_I(inode)->extent_tree;
+	u64 start = (u64)bio->bi_iter.bi_sector << SECTOR_SHIFT;
+	u64 len = bio->bi_iter.bi_size;
+	u64 end = start + len;
+	u64 ordered_end;
+	u64 pre, post;
+	int ret = 0;
+
+	ordered = btrfs_lookup_ordered_extent(BTRFS_I(inode), file_offset);
+	if (WARN_ON_ONCE(!ordered))
+		return -EIO;
+
+	/* no need to split */
+	if (ordered->disk_num_bytes == len)
+		goto out;
+
+	/* cannot split once end_bio'd ordered extent */
+	if (WARN_ON_ONCE(ordered->bytes_left != ordered->disk_num_bytes)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/* we cannot split compressed ordered extent */
+	if (WARN_ON_ONCE(ordered->disk_num_bytes != ordered->num_bytes)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/* cannot split waietd ordered extent */
+	if (WARN_ON_ONCE(wq_has_sleeper(&ordered->wait))) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	ordered_end = ordered->disk_bytenr + ordered->disk_num_bytes;
+	/* bio must be in one ordered extent */
+	if (WARN_ON_ONCE(start < ordered->disk_bytenr || end > ordered_end)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/* checksum list should be empty */
+	if (WARN_ON_ONCE(!list_empty(&ordered->list))) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	pre = start - ordered->disk_bytenr;
+	post = ordered_end - end;
+
+	btrfs_split_ordered_extent(ordered, pre, post);
+
+	read_lock(&em_tree->lock);
+	em = lookup_extent_mapping(em_tree, ordered->file_offset, len);
+	if (!em) {
+		read_unlock(&em_tree->lock);
+		ret = -EIO;
+		goto out;
+	}
+	read_unlock(&em_tree->lock);
+
+	ASSERT(!test_bit(EXTENT_FLAG_COMPRESSED, &em->flags));
+	em_new = create_io_em(BTRFS_I(inode), em->start + pre, len,
+			      em->start + pre, em->block_start + pre, len,
+			      len, len, BTRFS_COMPRESS_NONE,
+			      BTRFS_ORDERED_REGULAR);
+	free_extent_map(em_new);
+
+out:
+	free_extent_map(em);
+	btrfs_put_ordered_extent(ordered);
+
+	return ret;
+}
+
 /*
  * extent_io.c submission hook. This does the right thing for csum calculation
  * on write, or reading the csums from the tree before a read.
@@ -2192,6 +2272,15 @@ blk_status_t btrfs_submit_data_bio(struct inode *inode, struct bio *bio,
 	if (btrfs_is_free_space_inode(BTRFS_I(inode)))
 		metadata = BTRFS_WQ_ENDIO_FREE_SPACE;
 
+	if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
+		struct page *page = bio_first_bvec_all(bio)->bv_page;
+		loff_t file_offset = page_offset(page);
+
+		ret = extract_ordered_extent(inode, bio, file_offset);
+		if (ret)
+			goto out;
+	}
+
 	if (btrfs_op(bio) != BTRFS_MAP_WRITE) {
 		ret = btrfs_bio_wq_end_io(fs_info, bio, metadata);
 		if (ret)
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 87bac9ecdf4c..28fb9d5f48d3 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -943,6 +943,82 @@ void btrfs_lock_and_flush_ordered_range(struct btrfs_inode *inode, u64 start,
 	}
 }
 
+static void clone_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pos,
+				 u64 len)
+{
+	struct inode *inode = ordered->inode;
+	u64 file_offset = ordered->file_offset + pos;
+	u64 disk_bytenr = ordered->disk_bytenr + pos;
+	u64 num_bytes = len;
+	u64 disk_num_bytes = len;
+	int type;
+	unsigned long flags_masked =
+		ordered->flags & ~(1 << BTRFS_ORDERED_DIRECT);
+	int compress_type = ordered->compress_type;
+	unsigned long weight;
+
+	weight = hweight_long(flags_masked);
+	WARN_ON_ONCE(weight > 1);
+	if (!weight)
+		type = 0;
+	else
+		type = __ffs(flags_masked);
+
+	if (test_bit(BTRFS_ORDERED_COMPRESSED, &ordered->flags)) {
+		WARN_ON_ONCE(1);
+		btrfs_add_ordered_extent_compress(BTRFS_I(inode), file_offset,
+						  disk_bytenr, num_bytes,
+						  disk_num_bytes, type,
+						  compress_type);
+	} else if (test_bit(BTRFS_ORDERED_DIRECT, &ordered->flags)) {
+		btrfs_add_ordered_extent_dio(BTRFS_I(inode), file_offset,
+					     disk_bytenr, num_bytes,
+					     disk_num_bytes, type);
+	} else {
+		btrfs_add_ordered_extent(BTRFS_I(inode), file_offset,
+					 disk_bytenr, num_bytes, disk_num_bytes,
+					 type);
+	}
+}
+
+void btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pre,
+				u64 post)
+{
+	struct inode *inode = ordered->inode;
+	struct btrfs_ordered_inode_tree *tree = &BTRFS_I(inode)->ordered_tree;
+	struct rb_node *node;
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+
+	spin_lock_irq(&tree->lock);
+	/* remove from tree once */
+	node = &ordered->rb_node;
+	rb_erase(node, &tree->tree);
+	RB_CLEAR_NODE(node);
+	if (tree->last == node)
+		tree->last = NULL;
+
+	ordered->file_offset += pre;
+	ordered->disk_bytenr += pre;
+	ordered->num_bytes -= (pre + post);
+	ordered->disk_num_bytes -= (pre + post);
+	ordered->bytes_left -= (pre + post);
+
+	/* re-insert the node */
+	node = tree_insert(&tree->tree, ordered->file_offset,
+			   &ordered->rb_node);
+	if (node)
+		btrfs_panic(fs_info, -EEXIST,
+				"inconsistency in ordered tree at offset %llu",
+				ordered->file_offset);
+
+	spin_unlock_irq(&tree->lock);
+
+	if (pre)
+		clone_ordered_extent(ordered, 0, pre);
+	if (post)
+		clone_ordered_extent(ordered, pre + ordered->disk_num_bytes, post);
+}
+
 int __init ordered_data_init(void)
 {
 	btrfs_ordered_extent_cache = kmem_cache_create("btrfs_ordered_extent",
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index c3a2325e64a4..e346b03bd66a 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -193,6 +193,8 @@ void btrfs_wait_ordered_roots(struct btrfs_fs_info *fs_info, u64 nr,
 void btrfs_lock_and_flush_ordered_range(struct btrfs_inode *inode, u64 start,
 					u64 end,
 					struct extent_state **cached_state);
+void btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pre,
+				u64 post);
 int __init ordered_data_init(void);
 void __cold ordered_data_exit(void);
 

From patchwork Fri Oct 30 13:51:31 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Naohiro Aota <naohiro.aota@wdc.com>
X-Patchwork-Id: 11869643
Return-Path: <SRS0=iOSM=EF=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 1C49461C
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:53:08 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id E76EA2071A
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:53:07 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=fail reason="signature verification failed" (2048-bit key)
 header.d=wdc.com header.i=@wdc.com header.b="ETRzRIDC"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726859AbgJ3NxG (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Fri, 30 Oct 2020 09:53:06 -0400
Received: from esa3.hgst.iphmx.com ([216.71.153.141]:22001 "EHLO
        esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726848AbgJ3NxF (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Fri, 30 Oct 2020 09:53:05 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple;
  d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com;
  t=1604065985; x=1635601985;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=IsgXCjwhP7h5iONIZeWOyieIlDJD/IVYMJp+/jHltfU=;
  b=ETRzRIDC2FfoncsYcZFKBBDgCb8oJ/KkYebNkvW09sbXkaGCS7ml5RAN
   bzzRYAJpLr1OdHxwph47BZmW+053mL8QF9AVswrKRt/lYsvkIDnPxNYkQ
   GrSds33fknqU1aG9XnCDONOfmo9mkZBvWtKCppzXTAjlUCTOdEEuMXiJy
   XDqTfmgZHp6aWKzy1P1CuTNODyNIVniY2inih0atXx4aWYicGR6QIvOEa
   Tw8e5vLi+SozjpC5PO6nNhme7Dm5ExoT9Gc6XGYzLBttyxRYSK/uzRzoG
   B/hsEpu2fOBXAmhPy6ts9WjeIoxcz4sXOG8BtB43rXRSa37m0c1CxTRC2
   w==;
IronPort-SDR: 
 6Yo/djRoQfwrV5OxpewnMGQ3Rke05fSK3X2GiJDg7uekDMWVxfDG+5IIvL8w0yk6UAllor+2xR
 xcmt06AVvPUJO4hGqVvgSHMO5ZJYU4klJrZvNxTvjKy+UZLFPLQ3lDWSoNwxkarfzV9PNBImf4
 We6ZafMsCNDoaGvwiwEO2D5Uhw28I3O5traMxO+vIRSy6Ks+xlCM3oNCUhfF10q8AJYc4gHw5D
 qYdlSKDya+zrdz2yGhebKPrnqBytIAuFlDtNJG3RFQZPSyem5FQk6xgxoJGWZIH7QYg6kGcRhf
 hHE=
X-IronPort-AV: E=Sophos;i="5.77,433,1596470400";
   d="scan'208";a="155806624"
Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com)
 ([199.255.45.14])
  by ob1.hgst.iphmx.com with ESMTP; 30 Oct 2020 21:52:50 +0800
IronPort-SDR: 
 WF+58+dKaKHc3KOwSIx3GFfssKxW/7piZ4U4TAPeqq87k8z+uVX6cGxyx23711uEkIB4gBy8AY
 FLcihna51G+MsKJtZl0rFVY1ELm7ZVA+TNlUMX0/RY4q539G2PMqwUvZRkfF4CqACd8eaKOmUL
 z6sK6Dk6vUu4CARGt2HWbb4dS52gulTcv+kaFMFCbdXHZT8ctCCKTqISSm2qEimYQz2KYLM90c
 jWKlB4qgFUuJBr2bGuWozuJciRutK6+Nh9KuzubglDVNn1VJh1vPenA6skLWgpa9/m0a/csRTD
 nT+A6EcE17dX1nhePkznVILs
Received: from uls-op-cesaip01.wdc.com ([10.248.3.36])
  by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 30 Oct 2020 06:39:05 -0700
IronPort-SDR: 
 0dtVsGyAiiT2ac/Uo6vL2gtHrThZtsdZhFY8au/Tb20/dwIboaPwUOzDYFlBqIx3EkTnS3EHR6
 VTqd2T1TU0qPkAx5eEEiBhqh3FDRaqdTfw8f28DOftmoih9wmQ4eiAf6mlCUdjz4qE5MCaiWG7
 f498kxbM0TRmOF5KR5NomLQA0V8HO00ZwppfAaHeZqUylBKYgrUClCbLuECsayU/6nkjQCoNAA
 dhtjkbzSAvTcZijuzETfhD8xEU6amuDMvH9uLGXc7AxH7f62Ncqzq/efatJ0XXftd1ocZRGCSf
 m40=
WDCIronportException: Internal
Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155])
  by uls-op-cesaip01.wdc.com with ESMTP; 30 Oct 2020 06:52:50 -0700
From: Naohiro Aota <naohiro.aota@wdc.com>
To: linux-btrfs@vger.kernel.org, dsterba@suse.com
Cc: hare@suse.com, linux-fsdevel@vger.kernel.org,
        Naohiro Aota <naohiro.aota@wdc.com>
Subject: [PATCH v9 24/41] btrfs: extend btrfs_rmap_block for specifying a
 device
Date: Fri, 30 Oct 2020 22:51:31 +0900
Message-Id: 
 <3ee4958e7ebcc06973ed2d7c84a9cf9240d6e7d7.1604065695.git.naohiro.aota@wdc.com>
X-Mailer: git-send-email 2.27.0
In-Reply-To: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
References: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

btrfs_rmap_block currently reverse-maps the physical addresses on all
devices to the corresponding logical addresses.

This commit extends the function to match to a specified device. The old
functionality of querying all devices is left intact by specifying NULL as
target device.

We pass block_device instead of btrfs_device to __btrfs_rmap_block. This
function is intended to reverse-map the result of bio, which only have
block_device.

This commit also exports the function for later use.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/block-group.c | 28 ++++++++++++++++++++++------
 fs/btrfs/block-group.h |  3 +++
 2 files changed, 25 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 82d556368c85..21e40046dce1 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -1644,8 +1644,11 @@ static void set_avail_alloc_bits(struct btrfs_fs_info *fs_info, u64 flags)
 }
 
 /**
- * btrfs_rmap_block - Map a physical disk address to a list of logical addresses
+ * __btrfs_rmap_block - Map a physical disk address to a list of logical
+ *                      addresses
  * @chunk_start:   logical address of block group
+ * @bdev:	   physical device to resolve. Can be NULL to indicate any
+ *                 device.
  * @physical:	   physical address to map to logical addresses
  * @logical:	   return array of logical addresses which map to @physical
  * @naddrs:	   length of @logical
@@ -1655,9 +1658,9 @@ static void set_avail_alloc_bits(struct btrfs_fs_info *fs_info, u64 flags)
  * Used primarily to exclude those portions of a block group that contain super
  * block copies.
  */
-EXPORT_FOR_TESTS
-int btrfs_rmap_block(struct btrfs_fs_info *fs_info, u64 chunk_start,
-		     u64 physical, u64 **logical, int *naddrs, int *stripe_len)
+int __btrfs_rmap_block(struct btrfs_fs_info *fs_info, u64 chunk_start,
+		       struct block_device *bdev, u64 physical, u64 **logical,
+		       int *naddrs, int *stripe_len)
 {
 	struct extent_map *em;
 	struct map_lookup *map;
@@ -1675,6 +1678,7 @@ int btrfs_rmap_block(struct btrfs_fs_info *fs_info, u64 chunk_start,
 	map = em->map_lookup;
 	data_stripe_length = em->orig_block_len;
 	io_stripe_size = map->stripe_len;
+	chunk_start = em->start;
 
 	/* For RAID5/6 adjust to a full IO stripe length */
 	if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK)
@@ -1689,14 +1693,18 @@ int btrfs_rmap_block(struct btrfs_fs_info *fs_info, u64 chunk_start,
 	for (i = 0; i < map->num_stripes; i++) {
 		bool already_inserted = false;
 		u64 stripe_nr;
+		u64 offset;
 		int j;
 
 		if (!in_range(physical, map->stripes[i].physical,
 			      data_stripe_length))
 			continue;
 
+		if (bdev && map->stripes[i].dev->bdev != bdev)
+			continue;
+
 		stripe_nr = physical - map->stripes[i].physical;
-		stripe_nr = div64_u64(stripe_nr, map->stripe_len);
+		stripe_nr = div64_u64_rem(stripe_nr, map->stripe_len, &offset);
 
 		if (map->type & BTRFS_BLOCK_GROUP_RAID10) {
 			stripe_nr = stripe_nr * map->num_stripes + i;
@@ -1710,7 +1718,7 @@ int btrfs_rmap_block(struct btrfs_fs_info *fs_info, u64 chunk_start,
 		 * instead of map->stripe_len
 		 */
 
-		bytenr = chunk_start + stripe_nr * io_stripe_size;
+		bytenr = chunk_start + stripe_nr * io_stripe_size + offset;
 
 		/* Ensure we don't add duplicate addresses */
 		for (j = 0; j < nr; j++) {
@@ -1732,6 +1740,14 @@ int btrfs_rmap_block(struct btrfs_fs_info *fs_info, u64 chunk_start,
 	return ret;
 }
 
+EXPORT_FOR_TESTS
+int btrfs_rmap_block(struct btrfs_fs_info *fs_info, u64 chunk_start,
+		     u64 physical, u64 **logical, int *naddrs, int *stripe_len)
+{
+	return __btrfs_rmap_block(fs_info, chunk_start, NULL, physical, logical,
+				  naddrs, stripe_len);
+}
+
 static int exclude_super_stripes(struct btrfs_block_group *cache)
 {
 	struct btrfs_fs_info *fs_info = cache->fs_info;
diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
index 5be47f4bfea7..401e9bcefaec 100644
--- a/fs/btrfs/block-group.h
+++ b/fs/btrfs/block-group.h
@@ -275,6 +275,9 @@ void check_system_chunk(struct btrfs_trans_handle *trans, const u64 type);
 u64 btrfs_get_alloc_profile(struct btrfs_fs_info *fs_info, u64 orig_flags);
 void btrfs_put_block_group_cache(struct btrfs_fs_info *info);
 int btrfs_free_block_groups(struct btrfs_fs_info *info);
+int __btrfs_rmap_block(struct btrfs_fs_info *fs_info, u64 chunk_start,
+		       struct block_device *bdev, u64 physical, u64 **logical,
+		       int *naddrs, int *stripe_len);
 
 static inline u64 btrfs_data_alloc_profile(struct btrfs_fs_info *fs_info)
 {

From patchwork Fri Oct 30 13:51:32 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Naohiro Aota <naohiro.aota@wdc.com>
X-Patchwork-Id: 11869645
Return-Path: <SRS0=iOSM=EF=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 6740514B4
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:53:08 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 3C4B3221FF
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:53:08 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=fail reason="signature verification failed" (2048-bit key)
 header.d=wdc.com header.i=@wdc.com header.b="L2Fh3LZt"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726848AbgJ3NxH (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Fri, 30 Oct 2020 09:53:07 -0400
Received: from esa3.hgst.iphmx.com ([216.71.153.141]:22003 "EHLO
        esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726849AbgJ3NxF (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Fri, 30 Oct 2020 09:53:05 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple;
  d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com;
  t=1604065985; x=1635601985;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=02M8HP7ut5Il9mNyWYYOoNBBrENfs4CDnW9YNkJqdJ4=;
  b=L2Fh3LZtbFcY35cycLlBoduGDao4t7lmnhO72UZimxJMTAzqqhGwbqTM
   qPKRdksWcOlGoRnAnSKSyVpwW99/8tqdFyYq352jxO11pTNXE1NXDOQWb
   jdll4FTekoh8BDyhz5KHCG6azNoGu+9PvjEmc+bgksiRqj9gFRljBCuhg
   dB3YjXHt506/9ObRbfIWhKXyZD7wMNMTd/hf8Q8NbMV5SbBhtMz6RYZ3Z
   +7FYAMvc8Mnt6gnwUntbFmyNQ5xUBRhquCs6DHKzvfbMToxDmAE9+mUTq
   ZCkTaH35/VWx9B2Me/5n56YV8xC4akS8Te3YJAE7D89J0vHyF6/72wh3/
   g==;
IronPort-SDR: 
 OKlmIT53tuQ5QBthkVaQ6f1PMuEKH/dQ0dIlkP2OhrVyfII0hIuGa8jvWKg8kXbiHZqZThO5pJ
 SUA/vMvyRVa1FXDDTRFuXLmEkp9qmUwwOZ69hSu/GI+Z5wcbtrlISJ/UsGC8ooY+um4HJQ3UGo
 Fs2xjwxWG6fn7SWcmpQBMw9nuVY+e5s/kG8GIhsvnwvpYUdkar2AJ3ihYIHTYgjCen+PkKvcOo
 rqvywC7i99PGPSTOB9OFCHNDY/4LW2y0MFPutahZ9aZLZSFZWJehpH99Ghwq7ztWvs83CxOpZX
 7Us=
X-IronPort-AV: E=Sophos;i="5.77,433,1596470400";
   d="scan'208";a="155806626"
Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com)
 ([199.255.45.14])
  by ob1.hgst.iphmx.com with ESMTP; 30 Oct 2020 21:52:52 +0800
IronPort-SDR: 
 TUFh+eITbAr5wggl+YHTwbxAkQBIe487YZlcsykrb/IOkOaVwr+6ahAjVFJ6ZAQXyunzvOz5vu
 Kud3zQ5oDI2UrG7k6gz9OwsURmhUvVCT2UtrGGbBuoGvATJ4siSsnFf1HIJ25aR3ctUTC7BENS
 A8yHNq1C8SggitJA73iVuxeej9gKIfrLbLcLhDgeKHkIGL9dvfqeHGTlqk5X9E2MrYK7dvH6sY
 mow4LZzrrRVI8C2cenia/hTYVUgqzkhWUIvKMzKJo9eRTM5lOHM19ZmxG3Kep+jo2gJKFIhxf6
 ktlhrQR5oY6IAJmImEdt6Eti
Received: from uls-op-cesaip01.wdc.com ([10.248.3.36])
  by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 30 Oct 2020 06:39:06 -0700
IronPort-SDR: 
 vx8hBVEygsN9YhWD4Zsn8Uk7R6WPiwWJ8zknEOTDFOqW0WZ9J26xy+Qc8HVzqnM+RlXvXWynVz
 1hTwVsLMXUyYwfcTIiVeo3z8vbZORjZcPZCODIdWvfqbdJEfFiogTYR0BxqTS/FlEnY0l/h6kW
 8f/HZsZTPEQ1cj4TdHFYPYaV2VaQLJZmjnUuC+Hjl8Aa4eGVeF85fqvNBQbcgBUlKoLUcXgDfx
 RESmxqtGGs/UYXn2ItHp/kPUAmgCBSXWcO9/EFHyUERasAlMRLJDvzGEj9X2L/NsCt919ZlLbS
 EBE=
WDCIronportException: Internal
Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155])
  by uls-op-cesaip01.wdc.com with ESMTP; 30 Oct 2020 06:52:51 -0700
From: Naohiro Aota <naohiro.aota@wdc.com>
To: linux-btrfs@vger.kernel.org, dsterba@suse.com
Cc: hare@suse.com, linux-fsdevel@vger.kernel.org,
        Naohiro Aota <naohiro.aota@wdc.com>,
        Johannes Thumshirn <johannes.thumshirn@wdc.com>
Subject: [PATCH v9 25/41] btrfs: use ZONE_APPEND write for ZONED btrfs
Date: Fri, 30 Oct 2020 22:51:32 +0900
Message-Id: 
 <51cdc96d68f84eb93a310a96b6b7ad6e070dd1ac.1604065695.git.naohiro.aota@wdc.com>
X-Mailer: git-send-email 2.27.0
In-Reply-To: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
References: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

This commit enables zone append writing for zoned btrfs. When using zone
append, a bio is issued to the start of a target zone and the device
decides to place it inside the zone. Upon completion the device reports
the actual written position back to the host.

Three parts are necessary to enable zone append in btrfs. First, modify
the bio to use REQ_OP_ZONE_APPEND in btrfs_submit_bio_hook() and adjust
the bi_sector to point the beginning of the zone.

Secondly, records the returned physical address (and disk/partno) to the
ordered extent in end_bio_extent_writepage() after the bio has been
completed. We cannot resolve the physical address to the logical address
because we can neither take locks nor allocate a buffer in this end_bio
context. So, we need to record the physical address to resolve it later in
btrfs_finish_ordered_io().

And finally, rewrites the logical addresses of the extent mapping and
checksum data according to the physical address (using __btrfs_rmap_block).
If the returned address matches the originally allocated address, we can
skip this rewriting process.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/extent_io.c    | 12 +++++++-
 fs/btrfs/file.c         |  2 +-
 fs/btrfs/inode.c        |  4 +++
 fs/btrfs/ordered-data.c |  3 ++
 fs/btrfs/ordered-data.h |  8 +++++
 fs/btrfs/volumes.c      |  9 ++++++
 fs/btrfs/zoned.c        | 68 +++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/zoned.h        |  9 ++++++
 8 files changed, 113 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 764257eb658f..3f49febafc69 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2743,6 +2743,7 @@ static void end_bio_extent_writepage(struct bio *bio)
 	u64 start;
 	u64 end;
 	struct bvec_iter_all iter_all;
+	bool first_bvec = true;
 
 	ASSERT(!bio_flagged(bio, BIO_CLONED));
 	bio_for_each_segment_all(bvec, bio, iter_all) {
@@ -2769,6 +2770,11 @@ static void end_bio_extent_writepage(struct bio *bio)
 		start = page_offset(page);
 		end = start + bvec->bv_offset + bvec->bv_len - 1;
 
+		if (first_bvec) {
+			btrfs_record_physical_zoned(inode, start, bio);
+			first_bvec = false;
+		}
+
 		end_extent_writepage(page, error, start, end);
 		end_page_writeback(page);
 	}
@@ -3531,6 +3537,7 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 	size_t blocksize;
 	int ret = 0;
 	int nr = 0;
+	int opf = REQ_OP_WRITE;
 	const unsigned int write_flags = wbc_to_write_flags(wbc);
 	bool compressed;
 
@@ -3543,6 +3550,9 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 		return 1;
 	}
 
+	if (btrfs_is_zoned(inode->root->fs_info))
+		opf = REQ_OP_ZONE_APPEND;
+
 	/*
 	 * we don't want to touch the inode after unlocking the page,
 	 * so we update the mapping writeback index now
@@ -3603,7 +3613,7 @@ static noinline_for_stack int __extent_writepage_io(struct btrfs_inode *inode,
 			       page->index, cur, end);
 		}
 
-		ret = submit_extent_page(REQ_OP_WRITE | write_flags, wbc,
+		ret = submit_extent_page(opf | write_flags, wbc,
 					 page, offset, iosize, pg_offset,
 					 &epd->bio,
 					 end_bio_extent_writepage,
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 68938a43081e..f41cdcbf44f5 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2226,7 +2226,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 	 * the current transaction commits before the ordered extents complete
 	 * and a power failure happens right after that.
 	 */
-	if (full_sync) {
+	if (full_sync || fs_info->zoned) {
 		ret = btrfs_wait_ordered_range(inode, start, len);
 	} else {
 		/*
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 6b2569dfc3bd..6d40affa833f 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -51,6 +51,7 @@
 #include "delalloc-space.h"
 #include "block-group.h"
 #include "space-info.h"
+#include "zoned.h"
 
 struct btrfs_iget_args {
 	u64 ino;
@@ -2676,6 +2677,9 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent)
 	bool clear_reserved_extent = true;
 	unsigned int clear_bits;
 
+	if (ordered_extent->disk)
+		btrfs_rewrite_logical_zoned(ordered_extent);
+
 	start = ordered_extent->file_offset;
 	end = start + ordered_extent->num_bytes - 1;
 
diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c
index 28fb9d5f48d3..b2d185d4479e 100644
--- a/fs/btrfs/ordered-data.c
+++ b/fs/btrfs/ordered-data.c
@@ -199,6 +199,9 @@ static int __btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset
 	entry->compress_type = compress_type;
 	entry->truncated_len = (u64)-1;
 	entry->qgroup_rsv = ret;
+	entry->physical = (u64)-1;
+	entry->disk = NULL;
+	entry->partno = (u8)-1;
 	if (type != BTRFS_ORDERED_IO_DONE && type != BTRFS_ORDERED_COMPLETE)
 		set_bit(type, &entry->flags);
 
diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h
index e346b03bd66a..084c609afd83 100644
--- a/fs/btrfs/ordered-data.h
+++ b/fs/btrfs/ordered-data.h
@@ -127,6 +127,14 @@ struct btrfs_ordered_extent {
 	struct completion completion;
 	struct btrfs_work flush_work;
 	struct list_head work_list;
+
+	/*
+	 * used to reverse-map physical address returned from ZONE_APPEND
+	 * write command in a workqueue context.
+	 */
+	u64 physical;
+	struct gendisk *disk;
+	u8 partno;
 };
 
 /*
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index a0056b205964..26669007d367 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6498,6 +6498,15 @@ static void submit_stripe_bio(struct btrfs_bio *bbio, struct bio *bio,
 	btrfs_io_bio(bio)->device = dev;
 	bio->bi_end_io = btrfs_end_bio;
 	bio->bi_iter.bi_sector = physical >> 9;
+	/*
+	 * For zone append writing, bi_sector must point the beginning of the
+	 * zone
+	 */
+	if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
+		u64 zone_start = round_down(physical, fs_info->zone_size);
+
+		bio->bi_iter.bi_sector = zone_start >> SECTOR_SHIFT;
+	}
 	btrfs_debug_in_rcu(fs_info,
 	"btrfs_map_bio: rw %d 0x%x, sector=%llu, dev=%lu (%s id %llu), size=%u",
 		bio_op(bio), bio->bi_opf, (u64)bio->bi_iter.bi_sector,
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index b45ca33282d9..50393d560c9a 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -1052,3 +1052,71 @@ void btrfs_free_redirty_list(struct btrfs_transaction *trans)
 	}
 	spin_unlock(&trans->releasing_ebs_lock);
 }
+
+void btrfs_record_physical_zoned(struct inode *inode, u64 file_offset,
+				 struct bio *bio)
+{
+	struct btrfs_ordered_extent *ordered;
+	u64 physical = (u64)bio->bi_iter.bi_sector << SECTOR_SHIFT;
+
+	if (bio_op(bio) != REQ_OP_ZONE_APPEND)
+		return;
+
+	ordered = btrfs_lookup_ordered_extent(BTRFS_I(inode), file_offset);
+	if (WARN_ON(!ordered))
+		return;
+
+	ordered->physical = physical;
+	ordered->disk = bio->bi_disk;
+	ordered->partno = bio->bi_partno;
+
+	btrfs_put_ordered_extent(ordered);
+}
+
+void btrfs_rewrite_logical_zoned(struct btrfs_ordered_extent *ordered)
+{
+	struct extent_map_tree *em_tree;
+	struct extent_map *em;
+	struct inode *inode = ordered->inode;
+	struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+	struct btrfs_ordered_sum *sum;
+	struct block_device *bdev;
+	u64 orig_logical = ordered->disk_bytenr;
+	u64 *logical = NULL;
+	int nr, stripe_len;
+
+	bdev = bdget_disk(ordered->disk, ordered->partno);
+	if (WARN_ON(!bdev))
+		return;
+
+	if (WARN_ON(__btrfs_rmap_block(fs_info, orig_logical, bdev,
+				       ordered->physical, &logical, &nr,
+				       &stripe_len)))
+		goto out;
+
+	WARN_ON(nr != 1);
+
+	if (orig_logical == *logical)
+		goto out;
+
+	ordered->disk_bytenr = *logical;
+
+	em_tree = &BTRFS_I(inode)->extent_tree;
+	write_lock(&em_tree->lock);
+	em = search_extent_mapping(em_tree, ordered->file_offset,
+				   ordered->num_bytes);
+	em->block_start = *logical;
+	free_extent_map(em);
+	write_unlock(&em_tree->lock);
+
+	list_for_each_entry(sum, &ordered->list, list) {
+		if (*logical < orig_logical)
+			sum->bytenr -= orig_logical - *logical;
+		else
+			sum->bytenr += *logical - orig_logical;
+	}
+
+out:
+	kfree(logical);
+	bdput(bdev);
+}
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index 845623932fa5..d3ed4d7dae2b 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -44,6 +44,9 @@ void btrfs_calc_zone_unusable(struct btrfs_block_group *cache);
 void btrfs_redirty_list_add(struct btrfs_transaction *trans,
 			    struct extent_buffer *eb);
 void btrfs_free_redirty_list(struct btrfs_transaction *trans);
+void btrfs_record_physical_zoned(struct inode *inode, u64 file_offset,
+				 struct bio *bio);
+void btrfs_rewrite_logical_zoned(struct btrfs_ordered_extent *ordered);
 #else /* CONFIG_BLK_DEV_ZONED */
 static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 				     struct blk_zone *zone)
@@ -113,6 +116,12 @@ static inline void btrfs_calc_zone_unusable(struct btrfs_block_group *cache) { }
 static inline void btrfs_redirty_list_add(struct btrfs_transaction *trans,
 					  struct extent_buffer *eb) { }
 static inline void btrfs_free_redirty_list(struct btrfs_transaction *trans) { }
+static inline void btrfs_record_physical_zoned(struct inode *inode,
+					       u64 file_offset, struct bio *bio)
+{
+}
+static inline void
+btrfs_rewrite_logical_zoned(struct btrfs_ordered_extent *ordered) { }
 #endif
 
 static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)

From patchwork Fri Oct 30 13:51:33 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Naohiro Aota <naohiro.aota@wdc.com>
X-Patchwork-Id: 11869691
Return-Path: <SRS0=iOSM=EF=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id F0A8292C
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:53:21 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id C93802083B
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:53:21 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=fail reason="signature verification failed" (2048-bit key)
 header.d=wdc.com header.i=@wdc.com header.b="FWPrKlrh"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726877AbgJ3NxH (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Fri, 30 Oct 2020 09:53:07 -0400
Received: from esa3.hgst.iphmx.com ([216.71.153.141]:22001 "EHLO
        esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726852AbgJ3NxG (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Fri, 30 Oct 2020 09:53:06 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple;
  d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com;
  t=1604065985; x=1635601985;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=Zl4jN9ny58bdtwI9MRmAuXUyo6ZKJe0ZuqJHF7+FeRQ=;
  b=FWPrKlrh6/Ejyl5nUiesMUItgCi4VZ+dZxkPPdLxIXUP2kT+NcS2eBJQ
   /ciN9QOu8AA9GI/Mot2Zw8uoEEmPplhd7ad8YSysrfEvwN8X7wYolPIh8
   Q2Ksy4J1NO5L9yHsiYQcSOtIe9fxzQpmkhGt4+allF+ieoSnn0vBCDw2R
   qcUpe+pqau/gg5jptn1sTP6lqIo+SDjicGMo2NCV5HnIbMKVqKMtyfkCG
   omHwKq/8G90p6bwJ6El2I+c6drVRY/9mFGTL09RoGLIi/muD3kjweI5ii
   7CISshk5s9O9iCQrQnEDQZ9pQG0fkqmFV3ozajCFPEDreP2lcv15NRhNI
   g==;
IronPort-SDR: 
 6BCtU/d5wGXulUE74gPcYYWahDx6CmKs0BY8Q0KGLGZ27JOwjb98HDOzVErTxWV9f+mIyvFKV2
 +BtWd4UuTt9CUOT/VpVvCpUcLNWQRUyuyHKaEuMHiKYRjN4DT1ovz7pbn59WfKbm3ppWopn4B+
 RqpDfmW45doBWmsArjAzyr/Keef5CI0XVHHk0IHEjHyfTm7M9d02iC3RKH1Jn34OjcLZDbWkka
 QBpz++ENBTJZvmk9bYXzzdaxzF08BNGQ6HTotrVAEVW+2NMai2t3hFwY9594HO2Pe4EeYSdJx6
 H88=
X-IronPort-AV: E=Sophos;i="5.77,433,1596470400";
   d="scan'208";a="155806627"
Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com)
 ([199.255.45.14])
  by ob1.hgst.iphmx.com with ESMTP; 30 Oct 2020 21:52:53 +0800
IronPort-SDR: 
 RUjwikquoDAvtRSlId8sD22UQck8S8rzCC4QGa07BqXat+OTbjhGAMJsZ02b+10javEx0wY0Qa
 tFrBxCVZ1umPfwQQu0BltMs9SlWdWNz8EsqW+bXDTCTHFGe63BHNzvK7ReORRWbUEcF5aLLXSJ
 ePGHFvvGPeZIcUs+EK0nJl+OJTb510bEEZi3qXoXP2pLrPw8fl7tDtsfG4OtDhqVdei5TGyzKM
 XvWFZNe3dsLc+J+FgZAFm1aoUNYmNAyRpM4dl3lcRN+IpLeXf1EYBg7ynwjWkcpcg99aoY4zm2
 x/uh2U/+UuulbRqrcE0kV1ZO
Received: from uls-op-cesaip01.wdc.com ([10.248.3.36])
  by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 30 Oct 2020 06:39:07 -0700
IronPort-SDR: 
 byyAOo3HiOQ5CFQZOXl7BPNjhA7sACD9AZ8uvlLKhRvWBO+1D7Tntf24Bx59X9BogeO5iAnB/q
 lN6myUYpLdokXtGBFPrxIOUrN7H341XpWakjnTWfEtSOkJyBvZWE4/JauuhlFlmX8Dk/Q3OIyA
 8S+PgnVMvVsowzOc3gmPC7YEA6KPhqrgPOZmEKvmMF/QS8tQLnQV+cRDYWSuY2HQzaNj9A5ohl
 I+8lbg9u6XF6QbGoQiv/VDYU5fneLVWfleZ0iC6KrJ9oBtIvOfbH04FSJZ2Qx685H2UzlRVAkh
 8FI=
WDCIronportException: Internal
Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155])
  by uls-op-cesaip01.wdc.com with ESMTP; 30 Oct 2020 06:52:52 -0700
From: Naohiro Aota <naohiro.aota@wdc.com>
To: linux-btrfs@vger.kernel.org, dsterba@suse.com
Cc: hare@suse.com, linux-fsdevel@vger.kernel.org,
        Naohiro Aota <naohiro.aota@wdc.com>
Subject: [PATCH v9 26/41] btrfs: enable zone append writing for direct IO
Date: Fri, 30 Oct 2020 22:51:33 +0900
Message-Id: 
 <0ac6b4d4630d97838ea6532ba81c8ebf1cfaffd3.1604065695.git.naohiro.aota@wdc.com>
X-Mailer: git-send-email 2.27.0
In-Reply-To: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
References: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

Likewise to buffered IO, enable zone append writing for direct IO when its
used on a zoned block device.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/inode.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 6d40affa833f..bc853a4f22cc 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7542,6 +7542,9 @@ static int btrfs_dio_iomap_begin(struct inode *inode, loff_t start,
 	iomap->bdev = fs_info->fs_devices->latest_bdev;
 	iomap->length = len;
 
+	if (write && btrfs_is_zoned(fs_info) && fs_info->max_zone_append_size)
+		iomap->flags |= IOMAP_F_ZONE_APPEND;
+
 	free_extent_map(em);
 
 	return 0;
@@ -7779,6 +7782,8 @@ static void btrfs_end_dio_bio(struct bio *bio)
 	if (err)
 		dip->dio_bio->bi_status = err;
 
+	btrfs_record_physical_zoned(dip->inode, dip->logical_offset, bio);
+
 	bio_put(bio);
 	btrfs_dio_private_put(dip);
 }
@@ -7933,6 +7938,18 @@ static blk_qc_t btrfs_submit_direct(struct inode *inode, struct iomap *iomap,
 		bio->bi_end_io = btrfs_end_dio_bio;
 		btrfs_io_bio(bio)->logical = file_offset;
 
+		WARN_ON_ONCE(write && btrfs_is_zoned(fs_info) &&
+			     fs_info->max_zone_append_size &&
+			     bio_op(bio) != REQ_OP_ZONE_APPEND);
+
+		if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
+			ret = extract_ordered_extent(inode, bio, file_offset);
+			if (ret) {
+				bio_put(bio);
+				goto out_err;
+			}
+		}
+
 		ASSERT(submit_len >= clone_len);
 		submit_len -= clone_len;
 

From patchwork Fri Oct 30 13:51:34 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Naohiro Aota <naohiro.aota@wdc.com>
X-Patchwork-Id: 11869715
Return-Path: <SRS0=iOSM=EF=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 283A692C
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:53:57 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 036982083B
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:53:56 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=fail reason="signature verification failed" (2048-bit key)
 header.d=wdc.com header.i=@wdc.com header.b="JKt0w9Jf"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726866AbgJ3Nxy (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Fri, 30 Oct 2020 09:53:54 -0400
Received: from esa3.hgst.iphmx.com ([216.71.153.141]:22003 "EHLO
        esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726853AbgJ3NxG (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Fri, 30 Oct 2020 09:53:06 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple;
  d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com;
  t=1604065986; x=1635601986;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=RAeB+WuV6ZtDINwa1qP3+ZHNXCsV2X3PR1lyuHebgjg=;
  b=JKt0w9JfJ6IcZ64MJvFin+wuhyt2wDHOOQ0Ohp/0NysSXSiKR1Bd7Etd
   gUa72kuG9bte80BNBDtNo87Ozc1nuJ3XokrCBKvHVcx3cswSXNBEwMGR3
   yG0SnYDAYNM8v3r5rWWjNquCMRfEZQi8gaOdu13yJSNnXIBIOMl2j6i7D
   T/ZLcJ8PTkYsW5oMxuzRALTvTzRVtg9aLxRI+nQ99Tsk9+cjDRsF3T/dr
   eFqwl7LcdureNMCgbBh9BHl3OrldZ2ItNeKMPdskoYiEjJZbSKHKqfSrx
   sTI2vxkfVStxdHr5KduVWBryA8iXe9VYLLvN8P8EHXfSzVXrVv10vg/4l
   g==;
IronPort-SDR: 
 ntgXKQisF44/FN2R/Ukmbd1HUHF8ektMPspCr4fD1WjyjnpKtu+UFRgBTCnlIOmKXJjyPkxPpS
 ivP5QXpu4Lr+JEOsSFSOmRAHdJ1S/ZJAghx3I7tuBc4ko2Ok/YjwiK9vnf8b2YiTEGUKQFBE6R
 nCIMUNjz4Ra/ev1e0Sq5WWqiW23xlcAMRNiT9+8CE0lQUTJWcP+4VDtucQCvk4quGh8K/68aFJ
 MedBiQE6i8BfzJ/wdLtLvdkzeNgiZDgUvd6EdqSSYmtkFSR3SEofvMAIkyuaxyZSrM1PbEi0A0
 Vzo=
X-IronPort-AV: E=Sophos;i="5.77,433,1596470400";
   d="scan'208";a="155806628"
Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com)
 ([199.255.45.14])
  by ob1.hgst.iphmx.com with ESMTP; 30 Oct 2020 21:52:54 +0800
IronPort-SDR: 
 OBUAU/EKuvP9P2suQhZfzGf8Blt715yCoRpbO2T79wSnVKdGYV63znoVzJ0Iy4mmulvPxRY5b8
 W8qpIFCGCAiLOlRz1eJGxeRES9ChLvXesJvptCqA9kWfAJyqP2f7nwAUZA+AhfwzSItXl25Zot
 69COWtH51++kwCN2ryb+pOFf3g3BJDHpSbwaycDCvvnIpZyUutdvowMAoY4VaM4krcxsOpL7pz
 JLb8mbE5yuOVEsVVO3BKFgae/5dkF9UTgGtFM9Q2alNyfsEl9AinhZxW0oBWYVte0Ac4sUAIWY
 XlJzBdjz3MPgOQrHXEnRGCka
Received: from uls-op-cesaip01.wdc.com ([10.248.3.36])
  by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 30 Oct 2020 06:39:08 -0700
IronPort-SDR: 
 2kPBYhjNPn8n3ROFda2xIMMZscNNS2pSFW4xXNMfG8NP+9D/oGw4ymRFj/Qtae+wd5Z95xKnBG
 1nz6vwhmyoVwKYx3kXFBw2mOLQAYHjRj/7qHhHqrRs4EiEdJj6recLFLiMuiwqQz0mk//93t1J
 uSFE07mtZnrMX42cMfztMokk2ST3VAarv5+G5BQM53ZprPeCVRbPDWWx6x3pn+/E+vRrAu9FVm
 xjTPeVquENsCsenYYdLxgXZ2uy92IvhiZBSTUKOyVkEQnblisXwiZGb6HX84KQO59qaioOyOPM
 93o=
WDCIronportException: Internal
Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155])
  by uls-op-cesaip01.wdc.com with ESMTP; 30 Oct 2020 06:52:53 -0700
From: Naohiro Aota <naohiro.aota@wdc.com>
To: linux-btrfs@vger.kernel.org, dsterba@suse.com
Cc: hare@suse.com, linux-fsdevel@vger.kernel.org,
        Naohiro Aota <naohiro.aota@wdc.com>
Subject: [PATCH v9 27/41] btrfs: introduce dedicated data write path for ZONED
 mode
Date: Fri, 30 Oct 2020 22:51:34 +0900
Message-Id: 
 <72df5edeab150be3f081b0d96b174285f238eb0f.1604065695.git.naohiro.aota@wdc.com>
X-Mailer: git-send-email 2.27.0
In-Reply-To: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
References: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

If more than one IO is issued for one file extent, these IO can be written
to separate regions on a device. Since we cannot map one file extent to
such a separate area, we need to follow the "one IO == one ordered extent"
rule.

The Normal buffered, uncompressed, not pre-allocated write path (used by
cow_file_range()) sometimes does not follow this rule. It can write a part
of an ordered extent when specified a region to write e.g., when its
called from fdatasync().

Introduces a dedicated (uncompressed buffered) data write path for ZONED
mode. This write path will CoW the region and write it at once.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/inode.c | 34 ++++++++++++++++++++++++++++++++--
 1 file changed, 32 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index bc853a4f22cc..fdc367a39194 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1350,6 +1350,29 @@ static int cow_file_range_async(struct btrfs_inode *inode,
 	return 0;
 }
 
+static noinline int run_delalloc_zoned(struct btrfs_inode *inode,
+				       struct page *locked_page, u64 start,
+				       u64 end, int *page_started,
+				       unsigned long *nr_written)
+{
+	int ret;
+
+	ret = cow_file_range(inode, locked_page, start, end,
+			     page_started, nr_written, 0);
+	if (ret)
+		return ret;
+
+	if (*page_started)
+		return 0;
+
+	__set_page_dirty_nobuffers(locked_page);
+	account_page_redirty(locked_page);
+	extent_write_locked_range(&inode->vfs_inode, start, end, WB_SYNC_ALL);
+	*page_started = 1;
+
+	return 0;
+}
+
 static noinline int csum_exist_in_range(struct btrfs_fs_info *fs_info,
 					u64 bytenr, u64 num_bytes)
 {
@@ -1820,17 +1843,24 @@ int btrfs_run_delalloc_range(struct btrfs_inode *inode, struct page *locked_page
 {
 	int ret;
 	int force_cow = need_force_cow(inode, start, end);
+	int do_compress = inode_can_compress(inode) &&
+		inode_need_compress(inode, start, end);
+	bool zoned = btrfs_is_zoned(inode->root->fs_info);
 
 	if (inode->flags & BTRFS_INODE_NODATACOW && !force_cow) {
+		ASSERT(!zoned);
 		ret = run_delalloc_nocow(inode, locked_page, start, end,
 					 page_started, 1, nr_written);
 	} else if (inode->flags & BTRFS_INODE_PREALLOC && !force_cow) {
+		ASSERT(!zoned);
 		ret = run_delalloc_nocow(inode, locked_page, start, end,
 					 page_started, 0, nr_written);
-	} else if (!inode_can_compress(inode) ||
-		   !inode_need_compress(inode, start, end)) {
+	} else if (!do_compress && !zoned) {
 		ret = cow_file_range(inode, locked_page, start, end,
 				     page_started, nr_written, 1);
+	} else if (!do_compress && zoned) {
+		ret = run_delalloc_zoned(inode, locked_page, start, end,
+					 page_started, nr_written);
 	} else {
 		set_bit(BTRFS_INODE_HAS_ASYNC_EXTENT, &inode->runtime_flags);
 		ret = cow_file_range_async(inode, wbc, locked_page, start, end,

From patchwork Fri Oct 30 13:51:35 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Naohiro Aota <naohiro.aota@wdc.com>
X-Patchwork-Id: 11869647
Return-Path: <SRS0=iOSM=EF=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id A81D261C
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:53:12 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 7135B221FA
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:53:12 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=fail reason="signature verification failed" (2048-bit key)
 header.d=wdc.com header.i=@wdc.com header.b="Ilmwn9Ol"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726885AbgJ3NxI (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Fri, 30 Oct 2020 09:53:08 -0400
Received: from esa3.hgst.iphmx.com ([216.71.153.141]:21997 "EHLO
        esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726858AbgJ3NxG (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Fri, 30 Oct 2020 09:53:06 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple;
  d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com;
  t=1604065986; x=1635601986;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=daDaDyauWWbGjS+uDafaEorxHQhBe9r6S2WaFj20xaU=;
  b=Ilmwn9OlHkgqm/j8nTH7FMhsd4b7vExnZafK9v2g6qKIBu3nWhYMapOA
   YgJBXmMsrdTKurkkIsXJH7HAujjKOkHKEjyUqY8ReEOxnUQIMBphHF+oD
   iCAUNEgV5re85kXHFOyU0QdWLKbkALqnjOq5BRMJLtJudBH2Z4pTHr1V9
   A/4m8rBhIWGfeaoV2e88yqArdLT0pUlTHAcWu5aNn0j6vZiQU6rYZ6gTU
   zCPHV7f6ECIqrZL2AUHahaQfYnkjsyZoWAMfH9+VsTm0hhvUOnuO51h0B
   KhL8RtEQwT/ZZMN4LpZi55xymFWHHcz5czW/PNVrXQQmF+snJLZkr58GM
   w==;
IronPort-SDR: 
 GxwarNjt29HJH0mm18IUVRVRi79h2tr4qcg4kmEpv6xwyGBv89JJJX39YaSvRmbxctnstNdpSp
 bt1xFT1LYIrTFLDpupzypCZMSM9CkLJy2wJRJalNDP8hHp+eYejs12ZDPTS3x4bNYiMhiaUFP4
 T1p4Mk6vcqp8eb/IKnjxWqIjrtOHBX9E9UKHdTEbCU5noSJhfX4U3fjufDvGEsfzwCkrYhdDuZ
 GHb5nTID06YnexyqxiBfrQMZQK+oBjnM0K6Ro27HdMneYP+duVe/qyeug1lBupwlBzMAuuDvAy
 7Lc=
X-IronPort-AV: E=Sophos;i="5.77,433,1596470400";
   d="scan'208";a="155806631"
Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com)
 ([199.255.45.14])
  by ob1.hgst.iphmx.com with ESMTP; 30 Oct 2020 21:52:55 +0800
IronPort-SDR: 
 w3TzW9UQbW+gRL3McXGn6d4xZmG/qzKJvWeC5mnWERjtC2X+nWDSzbUuTF48ig0bWTWpgehRc3
 qG64B4y3Q+TEPLZuPD2uzsFzU1zNjvVAzMzGXyAdJS/2Qa7q0q2SMCMJP3JDnuVq5IefhlXxmV
 qtay1HMjMmkP2Vyb14JfzqNOouvJtglWyUo5SZ9kkxCquQAhWpeuRR4EkG5MRmgqPjhuTmSCXY
 MDuIK+YR20zXWwy0RzSFsaMcNPXnVlgvJGMDmF7lgPKIEEiDgvwm1kojyUHqGYBqFueio6datt
 uxuc+vKANcMgi+xBY7JqN1TQ
Received: from uls-op-cesaip01.wdc.com ([10.248.3.36])
  by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 30 Oct 2020 06:39:09 -0700
IronPort-SDR: 
 ECpfeLskO+nEWoM4WSaQlxQkAmPgW7KjW4rj8wpw51gIuTrDrS+74d59A4SRm6xF6gyXoIi8SM
 4sPyqibTeXpWPc1wrrGx+ClbQjWi01Yn2iVZ3/YS7gY9UHcG9PUxF7tM+dUF5GPbs5Hdtb7MVL
 9p2HNTWVIfj9O0vE5L3XL8edRd1sahNtjjJbMvkAzZg2llgO02OAXKtUyAs5/4k5dtYFfgo2ty
 Afh0jvOj3QFNO6fpZQ3Mp7/KQGZLFZ7zrmG+9tTJp/S5rKeBmtLbb9zobXHp7de4/V21sDIn5G
 1VA=
WDCIronportException: Internal
Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155])
  by uls-op-cesaip01.wdc.com with ESMTP; 30 Oct 2020 06:52:55 -0700
From: Naohiro Aota <naohiro.aota@wdc.com>
To: linux-btrfs@vger.kernel.org, dsterba@suse.com
Cc: hare@suse.com, linux-fsdevel@vger.kernel.org,
        Naohiro Aota <naohiro.aota@wdc.com>
Subject: [PATCH v9 28/41] btrfs: serialize meta IOs on ZONED mode
Date: Fri, 30 Oct 2020 22:51:35 +0900
Message-Id: 
 <61771fe28bda89abcdb55b2a00be05eb82d2216e.1604065695.git.naohiro.aota@wdc.com>
X-Mailer: git-send-email 2.27.0
In-Reply-To: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
References: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

We cannot use zone append for writing metadata, because the B-tree nodes
have references to each other using the logical address. Without knowing
the address in advance, we cannot construct the tree in the first place.
So we need to serialize write IOs for metadata.

We cannot add a mutex around allocation and submission because metadata
blocks are allocated in an earlier stage to build up B-trees.

Add a zoned_meta_io_lock and hold it during metadata IO submission in
btree_write_cache_pages() to serialize IOs. Furthermore, this add a
per-block group metadata IO submission pointer "meta_write_pointer" to
ensure sequential writing, which can be caused when writing back blocks in
an unfinished transaction.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/block-group.h |  1 +
 fs/btrfs/ctree.h       |  1 +
 fs/btrfs/disk-io.c     |  1 +
 fs/btrfs/extent_io.c   | 27 ++++++++++++++++++++++-
 fs/btrfs/zoned.c       | 50 ++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/zoned.h       | 31 ++++++++++++++++++++++++++
 6 files changed, 110 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
index 401e9bcefaec..b2a8a3beceac 100644
--- a/fs/btrfs/block-group.h
+++ b/fs/btrfs/block-group.h
@@ -190,6 +190,7 @@ struct btrfs_block_group {
 	 */
 	u64 alloc_offset;
 	u64 zone_unusable;
+	u64 meta_write_pointer;
 };
 
 static inline u64 btrfs_block_group_end(struct btrfs_block_group *block_group)
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 383c83a1f5b5..736f679f1310 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -955,6 +955,7 @@ struct btrfs_fs_info {
 	};
 	/* max size to emit ZONE_APPEND write command */
 	u64 max_zone_append_size;
+	struct mutex zoned_meta_io_lock;
 
 #ifdef CONFIG_BTRFS_FS_REF_VERIFY
 	spinlock_t ref_verify_lock;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 778716e223ff..f02b121d8213 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2652,6 +2652,7 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
 	mutex_init(&fs_info->delete_unused_bgs_mutex);
 	mutex_init(&fs_info->reloc_mutex);
 	mutex_init(&fs_info->delalloc_root_mutex);
+	mutex_init(&fs_info->zoned_meta_io_lock);
 	seqlock_init(&fs_info->profiles_lock);
 
 	INIT_LIST_HEAD(&fs_info->dirty_cowonly_roots);
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 3f49febafc69..3cce444d5dbb 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -25,6 +25,7 @@
 #include "backref.h"
 #include "disk-io.h"
 #include "zoned.h"
+#include "block-group.h"
 
 static struct kmem_cache *extent_state_cache;
 static struct kmem_cache *extent_buffer_cache;
@@ -4001,6 +4002,7 @@ int btree_write_cache_pages(struct address_space *mapping,
 				   struct writeback_control *wbc)
 {
 	struct extent_buffer *eb, *prev_eb = NULL;
+	struct btrfs_block_group *cache = NULL;
 	struct extent_page_data epd = {
 		.bio = NULL,
 		.extent_locked = 0,
@@ -4035,6 +4037,7 @@ int btree_write_cache_pages(struct address_space *mapping,
 		tag = PAGECACHE_TAG_TOWRITE;
 	else
 		tag = PAGECACHE_TAG_DIRTY;
+	btrfs_zoned_meta_io_lock(fs_info);
 retry:
 	if (wbc->sync_mode == WB_SYNC_ALL)
 		tag_pages_for_writeback(mapping, index, end);
@@ -4077,12 +4080,30 @@ int btree_write_cache_pages(struct address_space *mapping,
 			if (!ret)
 				continue;
 
+			if (!btrfs_check_meta_write_pointer(fs_info, eb,
+							    &cache)) {
+				/*
+				 * If for_sync, this hole will be filled with
+				 * trasnsaction commit.
+				 */
+				if (wbc->sync_mode == WB_SYNC_ALL &&
+				    !wbc->for_sync)
+					ret = -EAGAIN;
+				else
+					ret = 0;
+				done = 1;
+				free_extent_buffer(eb);
+				break;
+			}
+
 			prev_eb = eb;
 			ret = lock_extent_buffer_for_io(eb, &epd);
 			if (!ret) {
+				btrfs_revert_meta_write_pointer(cache, eb);
 				free_extent_buffer(eb);
 				continue;
 			} else if (ret < 0) {
+				btrfs_revert_meta_write_pointer(cache, eb);
 				done = 1;
 				free_extent_buffer(eb);
 				break;
@@ -4115,10 +4136,12 @@ int btree_write_cache_pages(struct address_space *mapping,
 		index = 0;
 		goto retry;
 	}
+	if (cache)
+		btrfs_put_block_group(cache);
 	ASSERT(ret <= 0);
 	if (ret < 0) {
 		end_write_bio(&epd, ret);
-		return ret;
+		goto out;
 	}
 	/*
 	 * If something went wrong, don't allow any metadata write bio to be
@@ -4153,6 +4176,8 @@ int btree_write_cache_pages(struct address_space *mapping,
 		ret = -EROFS;
 		end_write_bio(&epd, ret);
 	}
+out:
+	btrfs_zoned_meta_io_unlock(fs_info);
 	return ret;
 }
 
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 50393d560c9a..15bc7d451348 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -989,6 +989,9 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache)
 		ret = -EIO;
 	}
 
+	if (!ret)
+		cache->meta_write_pointer = cache->alloc_offset + cache->start;
+
 	kfree(alloc_offsets);
 	free_extent_map(em);
 
@@ -1120,3 +1123,50 @@ void btrfs_rewrite_logical_zoned(struct btrfs_ordered_extent *ordered)
 	kfree(logical);
 	bdput(bdev);
 }
+
+bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
+				    struct extent_buffer *eb,
+				    struct btrfs_block_group **cache_ret)
+{
+	struct btrfs_block_group *cache;
+
+	if (!btrfs_is_zoned(fs_info))
+		return true;
+
+	cache = *cache_ret;
+
+	if (cache && (eb->start < cache->start ||
+		      cache->start + cache->length <= eb->start)) {
+		btrfs_put_block_group(cache);
+		cache = NULL;
+		*cache_ret = NULL;
+	}
+
+	if (!cache)
+		cache = btrfs_lookup_block_group(fs_info, eb->start);
+
+	if (cache) {
+		*cache_ret = cache;
+
+		if (cache->meta_write_pointer != eb->start) {
+			btrfs_put_block_group(cache);
+			cache = NULL;
+			*cache_ret = NULL;
+			return false;
+		}
+
+		cache->meta_write_pointer = eb->start + eb->len;
+	}
+
+	return true;
+}
+
+void btrfs_revert_meta_write_pointer(struct btrfs_block_group *cache,
+				     struct extent_buffer *eb)
+{
+	if (!btrfs_is_zoned(eb->fs_info) || !cache)
+		return;
+
+	ASSERT(cache->meta_write_pointer == eb->start + eb->len);
+	cache->meta_write_pointer = eb->start;
+}
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index d3ed4d7dae2b..262b248a9085 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -47,6 +47,11 @@ void btrfs_free_redirty_list(struct btrfs_transaction *trans);
 void btrfs_record_physical_zoned(struct inode *inode, u64 file_offset,
 				 struct bio *bio);
 void btrfs_rewrite_logical_zoned(struct btrfs_ordered_extent *ordered);
+bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
+				    struct extent_buffer *eb,
+				    struct btrfs_block_group **cache_ret);
+void btrfs_revert_meta_write_pointer(struct btrfs_block_group *cache,
+				     struct extent_buffer *eb);
 #else /* CONFIG_BLK_DEV_ZONED */
 static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 				     struct blk_zone *zone)
@@ -122,6 +127,18 @@ static inline void btrfs_record_physical_zoned(struct inode *inode,
 }
 static inline void
 btrfs_rewrite_logical_zoned(struct btrfs_ordered_extent *ordered) { }
+static inline bool
+btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
+			       struct extent_buffer *eb,
+			       struct btrfs_block_group **cache_ret)
+{
+	return true;
+}
+static inline void
+btrfs_revert_meta_write_pointer(struct btrfs_block_group *cache,
+				struct extent_buffer *eb)
+{
+}
 #endif
 
 static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)
@@ -224,4 +241,18 @@ static inline bool btrfs_can_zone_reset(struct btrfs_device *device,
 	return true;
 }
 
+static inline void btrfs_zoned_meta_io_lock(struct btrfs_fs_info *fs_info)
+{
+	if (!btrfs_is_zoned(fs_info))
+		return;
+	mutex_lock(&fs_info->zoned_meta_io_lock);
+}
+
+static inline void btrfs_zoned_meta_io_unlock(struct btrfs_fs_info *fs_info)
+{
+	if (!btrfs_is_zoned(fs_info))
+		return;
+	mutex_unlock(&fs_info->zoned_meta_io_lock);
+}
+
 #endif

From patchwork Fri Oct 30 13:51:36 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Naohiro Aota <naohiro.aota@wdc.com>
X-Patchwork-Id: 11869711
Return-Path: <SRS0=iOSM=EF=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 5628792C
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:53:51 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 2B6192076E
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:53:51 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=fail reason="signature verification failed" (2048-bit key)
 header.d=wdc.com header.i=@wdc.com header.b="EGLoqAst"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726977AbgJ3Nxu (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Fri, 30 Oct 2020 09:53:50 -0400
Received: from esa3.hgst.iphmx.com ([216.71.153.141]:22001 "EHLO
        esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726860AbgJ3NxH (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Fri, 30 Oct 2020 09:53:07 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple;
  d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com;
  t=1604065986; x=1635601986;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=qwCyCSgZeSZMQFnd6hq7J9UTcWKFUZUiN9m/ZFj6hDs=;
  b=EGLoqAstI/qiBxsFP4wAfhvEwIkJ3eLHrdBDvg1zaaoJekfzOimOEWMf
   m4hg4dX0fray3Sl+YJMp7c1H0QgRqL2srdEbtM/HnW6qZeDfm+Ih+BtMl
   UwL725IxkgbDna7+cVY5FoQ9mInMgSXPldRAzLgczZ10jfSHOJ09jFSDF
   7AwgEcZoM8uIIoz+EIcRhgSadr5Y54qJRwndWW+3HSXgAZIScA1HAfDyr
   Q3hQ19TKbePeBeqFylFy7UigtxkTF/e/HgKOkqwuCG01e4s0YbxAMT9BF
   LUXFv14HGO5ve7+PVLr/ocJDXNrigm7ARUlsVh1HkCpKO8nqS1PNVNpi4
   A==;
IronPort-SDR: 
 bZeyqme/q4PJsX3GzDBYF/kf7mX2tq5P7N7fouaN7V9gQ0+C5vM0CxqlfN5SlTYrlo+Ll5gGcB
 SLAxg+viKWcwgyWJWzA0fEXUUFkf+bLVMBtVsoo5dOX0A3yF419wpXDFK/f3ePOE5nBFYQH47W
 Jfe8UAMH1lmE2qSNZC4+mQ3BCc6MIpauQB4f0E4n18t5zWgno7E4DnLmTCUSKkxaooYh/8rb0Y
 iJ2VCgCCexEnQvP2Raq/pOrYmAPwKrPw5M+pQ0HT5vf8lNr7DXVxJf4b90dIiEHbsVeamYWP0h
 d18=
X-IronPort-AV: E=Sophos;i="5.77,433,1596470400";
   d="scan'208";a="155806632"
Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com)
 ([199.255.45.14])
  by ob1.hgst.iphmx.com with ESMTP; 30 Oct 2020 21:52:57 +0800
IronPort-SDR: 
 xq6e/61wUPKynYAHfTHymFD/9UW31lVZzJgkdzMBc7yFwSUG4mBRC8flqpLdWx1/5cj7dxebyD
 kyS0uNR1FJMOW2Uqce452mkKAh2dyrRrqWNtUVyt2XchG07fjzoCNcY6H8/zPkKMfLWRQ9moBJ
 EUkXFwubeu+XGy+LaEELGGi/bIHKQdqmkd0Xf+VHEyrnQ0fmytDc0t4TKnMWzY5+R8PVVjEkfK
 rAByv1n0MoJb5iJOEPXQ4C5YfnnD9cM4MVR1L9WwF0fdnc+AcZG8RT6j32blVorgmDn6Ojcwei
 HVAjqRHcMLFUQldY92AuOsrC
Received: from uls-op-cesaip01.wdc.com ([10.248.3.36])
  by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 30 Oct 2020 06:39:11 -0700
IronPort-SDR: 
 vmNGjZF5/c0XMdnfClCYhLXdBXqLGu6VvPNLvBsLr89QqM3yefwoq18jKGa/mhVu56QnSOzZHt
 OVIWrWSFvMACLSmFaUoZ1+xpgJPkRGH8hFG1OyinOcIXrsdtjswGTVphmn7ocMj5JM09OTh5Mf
 CiPsplKF7N9l3dCyQfUf1wlMEkqxyr5MO7uCfMTGmDoyWap/h5lVyVtZyXUcGlsd+q9eSCrOCG
 uetLedvts0b2viPHU2ToXZmBdFO2vsfx/fvd5FFP/70EQvpcSFmzK23EgZWcfu3GvwFqN4c0aB
 OQ8=
WDCIronportException: Internal
Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155])
  by uls-op-cesaip01.wdc.com with ESMTP; 30 Oct 2020 06:52:56 -0700
From: Naohiro Aota <naohiro.aota@wdc.com>
To: linux-btrfs@vger.kernel.org, dsterba@suse.com
Cc: hare@suse.com, linux-fsdevel@vger.kernel.org,
        Naohiro Aota <naohiro.aota@wdc.com>,
        Josef Bacik <josef@toxicpanda.com>
Subject: [PATCH v9 29/41] btrfs: wait existing extents before truncating
Date: Fri, 30 Oct 2020 22:51:36 +0900
Message-Id: 
 <764d5b98df15deecef0541f872ee6df756986d46.1604065695.git.naohiro.aota@wdc.com>
X-Mailer: git-send-email 2.27.0
In-Reply-To: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
References: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

When truncating a file, file buffers which have already been allocated but
not yet written may be truncated.  Truncating these buffers could cause
breakage of a sequential write pattern in a block group if the truncated
blocks are for example followed by blocks allocated to another file. To
avoid this problem, always wait for write out of all unwritten buffers
before proceeding with the truncate execution.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/inode.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index fdc367a39194..de8b58abed1d 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4955,6 +4955,16 @@ static int btrfs_setsize(struct inode *inode, struct iattr *attr)
 		btrfs_drew_write_unlock(&root->snapshot_lock);
 		btrfs_end_transaction(trans);
 	} else {
+		struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+
+		if (btrfs_is_zoned(fs_info)) {
+			ret = btrfs_wait_ordered_range(
+				inode,
+				ALIGN(newsize, fs_info->sectorsize),
+				(u64)-1);
+			if (ret)
+				return ret;
+		}
 
 		/*
 		 * We're truncating a file that used to have good data down to

From patchwork Fri Oct 30 13:51:37 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Naohiro Aota <naohiro.aota@wdc.com>
X-Patchwork-Id: 11869713
Return-Path: <SRS0=iOSM=EF=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id D3BC461C
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:53:52 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id ACAA32076E
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:53:52 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=fail reason="signature verification failed" (2048-bit key)
 header.d=wdc.com header.i=@wdc.com header.b="HAdYS8np"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726981AbgJ3Nxv (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Fri, 30 Oct 2020 09:53:51 -0400
Received: from esa3.hgst.iphmx.com ([216.71.153.141]:21997 "EHLO
        esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726862AbgJ3NxH (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Fri, 30 Oct 2020 09:53:07 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple;
  d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com;
  t=1604065986; x=1635601986;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=acVyMDibRvMfztPSjphbHCXJZqzL0/hyecarrVfdIp4=;
  b=HAdYS8np+mXT1Sx85dEypo7anGJwg9z0a3n4IiUanIJCZFWwjsS7KMhd
   YjoAomGErOZ5AYP0z7OHtCPIHKPcixmlyqbOO0bXTsLymfib+YzTQEPj+
   sovbQqu1LlxspkJJQLyaexFi+ptUlA5y+W9igz206zd42MGLxYvYEuUCv
   1c6vpNdAptmWqiLfRk65OBnfnIr0KIGHtWVWft3m04eFlDyBZRjJ9Nt8p
   45UMwJMJ1GfLCieNiYkbo4y8B686sxm8+TrhBqEJU+bZ6H1ceo4TMXxrw
   K8nDHY+oSDe1v3qfWsyEN28CFgcd31z8abyetF8CXA/NdnRbkKQGpHPUc
   A==;
IronPort-SDR: 
 k+S5I0lq+jxN9HdksXsxSzdlMOW7voO1qBbWIeLKDlkoNPc9ix5fDHfyTH8D82KrJaOhtUZcb3
 M14ir26ETfwHrQymL0orrK5OvKAOPo5n92pGuEc/Wik3n8P3e/e7OrH4VwGU+4rsLXzFxEc6qw
 23YSVHwQ8t2MDbMIoZrflYx9Ixl+4MZp0ck10spKtaLrmjAJ6uFxhmoq7mkEjzlIsypy97YALC
 wGEeWN0H1E0PDLytJVe/dwUW6vJWoVYzU3PWtcWtqAK7yToLzaDRy2m4J+v2Y9vzd/GQl+EF4k
 exM=
X-IronPort-AV: E=Sophos;i="5.77,433,1596470400";
   d="scan'208";a="155806633"
Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com)
 ([199.255.45.14])
  by ob1.hgst.iphmx.com with ESMTP; 30 Oct 2020 21:52:58 +0800
IronPort-SDR: 
 fIsW3/QBBd/J1Z58ddEpsempU+JrKTBsFmuWXr0JQOaSVwHGIEdYRoex4uL+EFbxzknblUnCTi
 wgCSBGIrgjZhfwunUkNnFeNuAIumcQxo0GtRfBt6dcPeDy9G6RZjNolRFXhysSLiEs6pDCbGop
 zO5FzeCu2okD2BcHOQ5AT+iEnq3fOhRsoKAvzYMYfr98AMid0vkR0O47p39PhTZftNLfJaCZY+
 Jq+wuFmXJlwsXwpiv9UvuxkhirgIOSRqRDDv9T5TWN+enM948jJv6yHR7LVL9BthMJpMZ5j0LQ
 4b7g8vEGRSjXA6BJMkaZ44yk
Received: from uls-op-cesaip01.wdc.com ([10.248.3.36])
  by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 30 Oct 2020 06:39:12 -0700
IronPort-SDR: 
 uigXKqk6+fabkqBM1GPiKillEkeVRTlMyB9gn/mUBKydm4PgueCSNbnDuHmfpdx8Z8dx238q47
 dgNNBHrt63UxRNNaZ6GXF6j0UlEdfAoTtHUaX+X+l50NDYsVUW9z7+LLfJuqkk5GLvD0Tl1LSj
 V1HpHdWcYvXaKqY6e8iA+fAZKja/YGJFqFjraCGLJzQbtoDvCCwOjJOtQt86B2rzWvCaj/a3sY
 UKrvFd/WMzqmkSnlUcrvglspOpZKb5NRgwLtY6k+IScaMyic/cBfgpU9yXvVSPEp0QvjwZholk
 li8=
WDCIronportException: Internal
Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155])
  by uls-op-cesaip01.wdc.com with ESMTP; 30 Oct 2020 06:52:57 -0700
From: Naohiro Aota <naohiro.aota@wdc.com>
To: linux-btrfs@vger.kernel.org, dsterba@suse.com
Cc: hare@suse.com, linux-fsdevel@vger.kernel.org,
        Naohiro Aota <naohiro.aota@wdc.com>
Subject: [PATCH v9 30/41] btrfs: avoid async metadata checksum on ZONED mode
Date: Fri, 30 Oct 2020 22:51:37 +0900
Message-Id: 
 <3434830b6fadf644c5a47eaaea6759c375b39ad7.1604065695.git.naohiro.aota@wdc.com>
X-Mailer: git-send-email 2.27.0
In-Reply-To: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
References: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

In ZONED, btrfs uses per-FS zoned_meta_io_lock to serialize the metadata
write IOs.

Even with these serialization, write bios sent from btree_write_cache_pages
can be reordered by async checksum workers as these workers are per CPU and
not per zone.

To preserve write BIO ordering, we can disable async metadata checksum on
ZONED.  This does not result in lower performance with HDDs as a single CPU
core is fast enough to do checksum for a single zone write stream with the
maximum possible bandwidth of the device. If multiple zones are being
written simultaneously, HDD seek overhead lowers the achievable maximum
bandwidth, resulting again in a per zone checksum serialization not
affecting performance.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/disk-io.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index f02b121d8213..2b30ef8a7034 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -813,6 +813,8 @@ static blk_status_t btree_submit_bio_start(void *private_data, struct bio *bio,
 static int check_async_write(struct btrfs_fs_info *fs_info,
 			     struct btrfs_inode *bi)
 {
+	if (btrfs_is_zoned(fs_info))
+		return 0;
 	if (atomic_read(&bi->sync_writers))
 		return 0;
 	if (test_bit(BTRFS_FS_CSUM_IMPL_FAST, &fs_info->flags))

From patchwork Fri Oct 30 13:51:38 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Naohiro Aota <naohiro.aota@wdc.com>
X-Patchwork-Id: 11869705
Return-Path: <SRS0=iOSM=EF=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 1F1DA61C
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:53:45 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id E25B22083B
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:53:44 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=fail reason="signature verification failed" (2048-bit key)
 header.d=wdc.com header.i=@wdc.com header.b="TMRnmEmZ"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726402AbgJ3Nxm (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Fri, 30 Oct 2020 09:53:42 -0400
Received: from esa3.hgst.iphmx.com ([216.71.153.141]:22003 "EHLO
        esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726869AbgJ3NxH (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Fri, 30 Oct 2020 09:53:07 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple;
  d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com;
  t=1604065987; x=1635601987;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=g83IDaetFeG2qxHEJsxkSMJbqm1nC0dkIReub0/vubg=;
  b=TMRnmEmZvS4n8DloEpExXKiKrRhHHPBkjbdpg7Dw2Cy0+Ech3vCP4Nur
   sdzNmIgUYCiQEuhOhPORzPpTdROasmYOzvlJotWbB1RqlhH4NbLcGZRR6
   lRnOU+XAVqOmKROj1Eb48d1NCKm5B0HorYADSAxywdO+Lr5LqBydkdhFA
   NnigG6vtQChLefnc1l9OolC9n3UIdKJjDOD4BRt+EY9ss0p459BA2HoMt
   ST6rHg9SFgodGDBrf10yBER7tkq98dOBiQG2c/R9Mmc+qtRuv+d7JP1RI
   gC0kbKhvUQ31/WdI07FIeLZM+7/qUPLuqfTKhLOoHdbfQ6QQl+iFNzXwA
   g==;
IronPort-SDR: 
 drBmNZFoVhvtqQhEPpP2VHEWJe0sKMOkzNqE6bo60bJvKpJ+8nfnn10XCWUzL+AaZ4LnSX7YWw
 x812RjgCjMuAtUrMDeLsHe0Zs7d8RtyhriXgaTXS06tV7TcJiHd5vDw9XswDd3cuBLb1NjFmMx
 1s3zohMv+o8Ae5GAmV3ZI4U2tbeAKxialvvAaiot5GkBQaI84MHaJ0cHy2M+E6BF47nbFv/rsd
 fZWzJ0qPiXsADzQOWcT1lglU7ig73twzgIB2i+dheKqM1/JYkjHNhLIhlI16uSUYiiPlPCRrM+
 vvA=
X-IronPort-AV: E=Sophos;i="5.77,433,1596470400";
   d="scan'208";a="155806635"
Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com)
 ([199.255.45.14])
  by ob1.hgst.iphmx.com with ESMTP; 30 Oct 2020 21:52:59 +0800
IronPort-SDR: 
 fMUZiQeKadZ3mF7RhGEOzkcITKv6/JveujXAtXkcWJFw45X0V/EKnMEIkpdfqrmntJRhYwRkyI
 UZB8jnOFKwF9wyGphVx1qkczxgLKFKNpB90pB2uvZfuEJVCGgikV246zh9IazFemnG4Gakp2qM
 OqKgtECZLqfmFq6/hiOYr4pLpB5qy7KiX1pkjhkWvYerhgAA5zPXIe51zt+AzDQWeQHcxDHCHe
 rFW3DJ92jB8OnBCFVYmlV75zijr2NnLg7SGcEavQRw7aWEeOCRxLL8KrjoEa5c70gWykLUa3iQ
 ySpdp8EaUpfoTcv+vGd06tls
Received: from uls-op-cesaip01.wdc.com ([10.248.3.36])
  by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 30 Oct 2020 06:39:13 -0700
IronPort-SDR: 
 Owf6ENj/AgnS9X3CfeikYV5z4WVWkDn/MKQhJUdbYx6wM1f2Pkc31I/iN11AYQwFWdqaCQTPK5
 jTiSH8QPYQTO4AqzJZc5D8L2YjwnquHR1lseFmCjQMrtJWsYWrIbJiVBzGZNoBTxLmwUM+slsb
 vVFRUNSEWp+z/M4LJIo3YUmbU8IuPPGuALwQQ0N7moQ2H/Y3HtUdZMGerC5PG2JENHvnmIo9D9
 UwnDrCSG50DtT30lSdCgBJPQOdqZqqRkvlSuCqS3Ez4Wb2kjKKeCNbQ3TEUoXwH+r6znJHyQhP
 PeE=
WDCIronportException: Internal
Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155])
  by uls-op-cesaip01.wdc.com with ESMTP; 30 Oct 2020 06:52:58 -0700
From: Naohiro Aota <naohiro.aota@wdc.com>
To: linux-btrfs@vger.kernel.org, dsterba@suse.com
Cc: hare@suse.com, linux-fsdevel@vger.kernel.org,
        Naohiro Aota <naohiro.aota@wdc.com>
Subject: [PATCH v9 31/41] btrfs: mark block groups to copy for device-replace
Date: Fri, 30 Oct 2020 22:51:38 +0900
Message-Id: 
 <500edb962a76f3b9a6df9a6183efdc814d68d4d6.1604065695.git.naohiro.aota@wdc.com>
X-Mailer: git-send-email 2.27.0
In-Reply-To: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
References: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

This is the 1/4 patch to support device-replace in ZONED mode.

We have two types of I/Os during the device-replace process. One is an I/O
to "copy" (by the scrub functions) all the device extents on the source
device to the destination device.  The other one is an I/O to "clone" (by
handle_ops_on_dev_replace()) new incoming write I/Os from users to the
source device into the target device.

Cloning incoming I/Os can break the sequential write rule in the target
device. When writing is mapped in the middle of a block group, the I/O is
directed in the middle of a target device zone, which breaks the sequential
write rule.

However, the cloning function cannot be merely disabled since incoming I/Os
targeting already copied device extents must be cloned so that the I/O is
executed on the target device.

We cannot use dev_replace->cursor_{left,right} to determine whether bio is
going to not yet copied region.  Since we have a time gap between finishing
btrfs_scrub_dev() and rewriting the mapping tree in
btrfs_dev_replace_finishing(), we can have a newly allocated device extent
which is never cloned nor copied.

So the point is to copy only already existing device extents. This patch
introduces mark_block_group_to_copy() to mark existing block groups as a
target of copying. Then, handle_ops_on_dev_replace() and dev-replace can
check the flag to do their job.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/block-group.h |   1 +
 fs/btrfs/dev-replace.c | 175 +++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/dev-replace.h |   3 +
 fs/btrfs/scrub.c       |  17 ++++
 4 files changed, 196 insertions(+)

diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
index b2a8a3beceac..e91123495d68 100644
--- a/fs/btrfs/block-group.h
+++ b/fs/btrfs/block-group.h
@@ -95,6 +95,7 @@ struct btrfs_block_group {
 	unsigned int iref:1;
 	unsigned int has_caching_ctl:1;
 	unsigned int removed:1;
+	unsigned int to_copy:1;
 
 	int disk_cache_state;
 
diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 5e3554482af1..e86aff38aea4 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -22,6 +22,7 @@
 #include "dev-replace.h"
 #include "sysfs.h"
 #include "zoned.h"
+#include "block-group.h"
 
 /*
  * Device replace overview
@@ -437,6 +438,176 @@ static char* btrfs_dev_name(struct btrfs_device *device)
 		return rcu_str_deref(device->name);
 }
 
+static int mark_block_group_to_copy(struct btrfs_fs_info *fs_info,
+				    struct btrfs_device *src_dev)
+{
+	struct btrfs_path *path;
+	struct btrfs_key key;
+	struct btrfs_key found_key;
+	struct btrfs_root *root = fs_info->dev_root;
+	struct btrfs_dev_extent *dev_extent = NULL;
+	struct btrfs_block_group *cache;
+	struct extent_buffer *l;
+	struct btrfs_trans_handle *trans;
+	int slot;
+	int ret = 0;
+	u64 chunk_offset, length;
+
+	/* Do not use "to_copy" on non-ZONED for now */
+	if (!btrfs_fs_incompat(fs_info, ZONED))
+		return 0;
+
+	mutex_lock(&fs_info->chunk_mutex);
+
+	/* ensulre we don't have pending new block group */
+	while (fs_info->running_transaction &&
+	       !list_empty(&fs_info->running_transaction->dev_update_list)) {
+		mutex_unlock(&fs_info->chunk_mutex);
+		trans = btrfs_attach_transaction(root);
+		if (IS_ERR(trans)) {
+			ret = PTR_ERR(trans);
+			mutex_lock(&fs_info->chunk_mutex);
+			if (ret == -ENOENT)
+				continue;
+			else
+				goto out;
+		}
+
+		ret = btrfs_commit_transaction(trans);
+		mutex_lock(&fs_info->chunk_mutex);
+		if (ret)
+			goto out;
+	}
+
+	path = btrfs_alloc_path();
+	if (!path) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	path->reada = READA_FORWARD;
+	path->search_commit_root = 1;
+	path->skip_locking = 1;
+
+	key.objectid = src_dev->devid;
+	key.offset = 0ull;
+	key.type = BTRFS_DEV_EXTENT_KEY;
+
+	while (1) {
+		ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+		if (ret < 0)
+			break;
+		if (ret > 0) {
+			if (path->slots[0] >=
+			    btrfs_header_nritems(path->nodes[0])) {
+				ret = btrfs_next_leaf(root, path);
+				if (ret < 0)
+					break;
+				if (ret > 0) {
+					ret = 0;
+					break;
+				}
+			} else {
+				ret = 0;
+			}
+		}
+
+		l = path->nodes[0];
+		slot = path->slots[0];
+
+		btrfs_item_key_to_cpu(l, &found_key, slot);
+
+		if (found_key.objectid != src_dev->devid)
+			break;
+
+		if (found_key.type != BTRFS_DEV_EXTENT_KEY)
+			break;
+
+		if (found_key.offset < key.offset)
+			break;
+
+		dev_extent = btrfs_item_ptr(l, slot, struct btrfs_dev_extent);
+		length = btrfs_dev_extent_length(l, dev_extent);
+
+		chunk_offset = btrfs_dev_extent_chunk_offset(l, dev_extent);
+
+		cache = btrfs_lookup_block_group(fs_info, chunk_offset);
+		if (!cache)
+			goto skip;
+
+		spin_lock(&cache->lock);
+		cache->to_copy = 1;
+		spin_unlock(&cache->lock);
+
+		btrfs_put_block_group(cache);
+
+skip:
+		key.offset = found_key.offset + length;
+		btrfs_release_path(path);
+	}
+
+	btrfs_free_path(path);
+out:
+	mutex_unlock(&fs_info->chunk_mutex);
+
+	return ret;
+}
+
+bool btrfs_finish_block_group_to_copy(struct btrfs_device *srcdev,
+				      struct btrfs_block_group *cache,
+				      u64 physical)
+{
+	struct btrfs_fs_info *fs_info = cache->fs_info;
+	struct extent_map *em;
+	struct map_lookup *map;
+	u64 chunk_offset = cache->start;
+	int num_extents, cur_extent;
+	int i;
+
+	/* Do not use "to_copy" on non-ZONED for now */
+	if (!btrfs_fs_incompat(fs_info, ZONED))
+		return true;
+
+	spin_lock(&cache->lock);
+	if (cache->removed) {
+		spin_unlock(&cache->lock);
+		return true;
+	}
+	spin_unlock(&cache->lock);
+
+	em = btrfs_get_chunk_map(fs_info, chunk_offset, 1);
+	BUG_ON(IS_ERR(em));
+	map = em->map_lookup;
+
+	num_extents = cur_extent = 0;
+	for (i = 0; i < map->num_stripes; i++) {
+		/* we have more device extent to copy */
+		if (srcdev != map->stripes[i].dev)
+			continue;
+
+		num_extents++;
+		if (physical == map->stripes[i].physical)
+			cur_extent = i;
+	}
+
+	free_extent_map(em);
+
+	if (num_extents > 1 && cur_extent < num_extents - 1) {
+		/*
+		 * Has more stripes on this device. Keep this BG
+		 * readonly until we finish all the stripes.
+		 */
+		return false;
+	}
+
+	/* last stripe on this device */
+	spin_lock(&cache->lock);
+	cache->to_copy = 0;
+	spin_unlock(&cache->lock);
+
+	return true;
+}
+
 static int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info,
 		const char *tgtdev_name, u64 srcdevid, const char *srcdev_name,
 		int read_src)
@@ -478,6 +649,10 @@ static int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info,
 	if (ret)
 		return ret;
 
+	ret = mark_block_group_to_copy(fs_info, src_device);
+	if (ret)
+		return ret;
+
 	down_write(&dev_replace->rwsem);
 	switch (dev_replace->replace_state) {
 	case BTRFS_IOCTL_DEV_REPLACE_STATE_NEVER_STARTED:
diff --git a/fs/btrfs/dev-replace.h b/fs/btrfs/dev-replace.h
index 60b70dacc299..3911049a5f23 100644
--- a/fs/btrfs/dev-replace.h
+++ b/fs/btrfs/dev-replace.h
@@ -18,5 +18,8 @@ int btrfs_dev_replace_cancel(struct btrfs_fs_info *fs_info);
 void btrfs_dev_replace_suspend_for_unmount(struct btrfs_fs_info *fs_info);
 int btrfs_resume_dev_replace_async(struct btrfs_fs_info *fs_info);
 int __pure btrfs_dev_replace_is_ongoing(struct btrfs_dev_replace *dev_replace);
+bool btrfs_finish_block_group_to_copy(struct btrfs_device *srcdev,
+				      struct btrfs_block_group *cache,
+				      u64 physical);
 
 #endif
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index aa1b36cf5c88..d0d7db3c8b0b 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -3500,6 +3500,17 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
 		if (!cache)
 			goto skip;
 
+
+		if (sctx->is_dev_replace && btrfs_fs_incompat(fs_info, ZONED)) {
+			spin_lock(&cache->lock);
+			if (!cache->to_copy) {
+				spin_unlock(&cache->lock);
+				ro_set = 0;
+				goto done;
+			}
+			spin_unlock(&cache->lock);
+		}
+
 		/*
 		 * Make sure that while we are scrubbing the corresponding block
 		 * group doesn't get its logical address and its device extents
@@ -3631,6 +3642,12 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
 
 		scrub_pause_off(fs_info);
 
+		if (sctx->is_dev_replace &&
+		    !btrfs_finish_block_group_to_copy(dev_replace->srcdev,
+						      cache, found_key.offset))
+			ro_set = 0;
+
+done:
 		down_write(&dev_replace->rwsem);
 		dev_replace->cursor_left = dev_replace->cursor_right;
 		dev_replace->item_needs_writeback = 1;

From patchwork Fri Oct 30 13:51:39 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Naohiro Aota <naohiro.aota@wdc.com>
X-Patchwork-Id: 11869699
Return-Path: <SRS0=iOSM=EF=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 4721B61C
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:53:38 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 1BF3D2087E
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:53:38 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=fail reason="signature verification failed" (2048-bit key)
 header.d=wdc.com header.i=@wdc.com header.b="ErxUOFdy"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726920AbgJ3NxV (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Fri, 30 Oct 2020 09:53:21 -0400
Received: from esa3.hgst.iphmx.com ([216.71.153.141]:21997 "EHLO
        esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726868AbgJ3NxI (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Fri, 30 Oct 2020 09:53:08 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple;
  d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com;
  t=1604065987; x=1635601987;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=4AHs2QfXc+eeEQ2D+AK5M3qcbb+ElujE0f5Ey0Ppn2U=;
  b=ErxUOFdyZFVNa/GH+etmTtRZRJ95UKq8AS7y8sBEb71xjFWo4KbddiXb
   AhVDVv3BGqUQ9bjKMD9WP+2zhA/Jiqb3FmddMgjhgzo9rTt37TgBhEFfL
   ItNPbF/8iVwZscilKR+1/ip5yQnnaZ6e/s9XL2fS3zJLZQTMSWnuDyYon
   NcvgMM55sQHhpqcio1GHsn9qAnqkzHq8OiQEyZmCi2/8VvespX3SiQ8xL
   VRCIuRI4nYzZaxuhjnxaAFMxzbrpTnOfbPvTnraGwdtPyg7yTfbV5ZtSX
   HB+wcf/wIaBz7BSMsxjW+P1AT0qldcu1hgM1J6eAyitrn6pXucqR2Ueyr
   g==;
IronPort-SDR: 
 SgeBzL5bb7HO3aHFsc74jvMufCpdefG1T63eVYiiknyQKpk4pTQKVg9E71HwaFpIlSaoBi+ix9
 AE8RYrjOuvi7XnpLK1jFiv4VbrvoH8Hwd9xc/d3EQzIlfnBBrkL7iuQSZfkrZgfIuFsB3DhU92
 c4i9htcnlQtKB9Hq2skow2jbsPaA4JOXurVHmE8UTGipR7i3bqg0RIHTm9majLfYB6c3d+0+Ey
 KMVBgmZfCiZ6d5oWz9nbXxxdL24iG9QiMFj99LenHhHglnwB5sk0cbEgnIgE3mp80LBdHFmWOF
 qyo=
X-IronPort-AV: E=Sophos;i="5.77,433,1596470400";
   d="scan'208";a="155806637"
Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com)
 ([199.255.45.14])
  by ob1.hgst.iphmx.com with ESMTP; 30 Oct 2020 21:53:00 +0800
IronPort-SDR: 
 c4UUGhBzsQLFGor5Tn2SfUb7GLmftSd1RTzxR3gciJsNVRswhVODbMPoMerLubM0IqcG+Uz7dD
 lIOUsGmcbcIpFmO24G7LW2r4sd8f5ubCG6yPygFiveBSbfm+mJxoU2ButETEm1HYBTY7xPEMGe
 bbLao1JCMSB1wYuwFQ/QGloowCVul7sOR6q/hbToHUPiZoCE4O9r04fegr+M0Rz2Cis96XEVo3
 0iJ3y9FpxxzNJZ04+Xv7IQQS1abNCYrCuMTilIMThwz1gnQqVEJJtYVJg791jcRPJZ6AJwECDa
 l7mK4F9L4v6MAvFPe1YZ/sq2
Received: from uls-op-cesaip01.wdc.com ([10.248.3.36])
  by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 30 Oct 2020 06:39:14 -0700
IronPort-SDR: 
 g5BmCTtPdq3rp+Kx4tOdBWPhUA4MUZh4FRMDLoAPcq3Jz1dmgleEAnhW2y68gm0ctGClInVrIs
 xilb8r22PZu/06qw7NZUYKxqsoE8ARsKma+ZV0nNEuAuOMbGVQMgTvvtJOrvpnmxcPskuxdBTx
 xDHP9BdqzLHGLxOya5TkvMSG1cK4l5gKGpwAorTEq/pfGJp0hPt0ToIlTemvxOou5ynMigkaLJ
 zzVilUy+sQ9CoQqsYe5q05lrFA2a4ppjn6wFQXGxmmnMALlUND0vY/88Cx2k3STdvKjkJv53d8
 GyA=
WDCIronportException: Internal
Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155])
  by uls-op-cesaip01.wdc.com with ESMTP; 30 Oct 2020 06:52:59 -0700
From: Naohiro Aota <naohiro.aota@wdc.com>
To: linux-btrfs@vger.kernel.org, dsterba@suse.com
Cc: hare@suse.com, linux-fsdevel@vger.kernel.org,
        Naohiro Aota <naohiro.aota@wdc.com>
Subject: [PATCH v9 32/41] btrfs: implement cloning for ZONED device-replace
Date: Fri, 30 Oct 2020 22:51:39 +0900
Message-Id: 
 <a65d400de20403db088a8093073735cd1f783e22.1604065695.git.naohiro.aota@wdc.com>
X-Mailer: git-send-email 2.27.0
In-Reply-To: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
References: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

This is 2/4 patch to implement device-replace for ZONED mode.

On zoned mode, a block group must be either copied (from the source device
to the destination device) or cloned (to the both device).

This commit implements the cloning part. If a block group targeted by an IO
is marked to copy, we should not clone the IO to the destination device,
because the block group is eventually copied by the replace process.

This commit also handles cloning of device reset.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/dev-replace.c |  4 ++--
 fs/btrfs/extent-tree.c | 20 ++++++++++++++++++--
 fs/btrfs/scrub.c       |  2 +-
 fs/btrfs/volumes.c     | 33 +++++++++++++++++++++++++++++++--
 fs/btrfs/zoned.c       | 11 +++++++++++
 5 files changed, 63 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index e86aff38aea4..848f8063dc1c 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -454,7 +454,7 @@ static int mark_block_group_to_copy(struct btrfs_fs_info *fs_info,
 	u64 chunk_offset, length;
 
 	/* Do not use "to_copy" on non-ZONED for now */
-	if (!btrfs_fs_incompat(fs_info, ZONED))
+	if (!btrfs_is_zoned(fs_info))
 		return 0;
 
 	mutex_lock(&fs_info->chunk_mutex);
@@ -565,7 +565,7 @@ bool btrfs_finish_block_group_to_copy(struct btrfs_device *srcdev,
 	int i;
 
 	/* Do not use "to_copy" on non-ZONED for now */
-	if (!btrfs_fs_incompat(fs_info, ZONED))
+	if (!btrfs_is_zoned(fs_info))
 		return true;
 
 	spin_lock(&cache->lock);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 57454ef4c91e..1bb95d5aaed2 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -35,6 +35,7 @@
 #include "discard.h"
 #include "rcu-string.h"
 #include "zoned.h"
+#include "dev-replace.h"
 
 #undef SCRAMBLE_DELAYED_REFS
 
@@ -1336,6 +1337,8 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr,
 			u64 length = stripe->length;
 			u64 bytes;
 			struct request_queue *req_q;
+			struct btrfs_dev_replace *dev_replace =
+				&fs_info->dev_replace;
 
 			if (!stripe->dev->bdev) {
 				ASSERT(btrfs_test_opt(fs_info, DEGRADED));
@@ -1344,15 +1347,28 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, u64 bytenr,
 
 			req_q = bdev_get_queue(stripe->dev->bdev);
 			/* zone reset in ZONED mode */
-			if (btrfs_can_zone_reset(dev, physical, length))
+			if (btrfs_can_zone_reset(dev, physical, length)) {
 				ret = btrfs_reset_device_zone(dev, physical,
 							      length, &bytes);
-			else if (blk_queue_discard(req_q))
+				if (ret)
+					goto next;
+				if (!btrfs_dev_replace_is_ongoing(
+					    dev_replace) ||
+				    dev != dev_replace->srcdev)
+					goto next;
+
+				discarded_bytes += bytes;
+				/* send to replace target as well */
+				ret = btrfs_reset_device_zone(
+					dev_replace->tgtdev,
+					physical, length, &bytes);
+			} else if (blk_queue_discard(req_q))
 				ret = btrfs_issue_discard(dev->bdev, physical,
 							  length, &bytes);
 			else
 				continue;
 
+next:
 			if (!ret) {
 				discarded_bytes += bytes;
 			} else if (ret != -EOPNOTSUPP) {
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index d0d7db3c8b0b..371bb6437cab 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -3501,7 +3501,7 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
 			goto skip;
 
 
-		if (sctx->is_dev_replace && btrfs_fs_incompat(fs_info, ZONED)) {
+		if (sctx->is_dev_replace && btrfs_is_zoned(fs_info)) {
 			spin_lock(&cache->lock);
 			if (!cache->to_copy) {
 				spin_unlock(&cache->lock);
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 26669007d367..920292d0fca7 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5964,9 +5964,29 @@ static int get_extra_mirror_from_replace(struct btrfs_fs_info *fs_info,
 	return ret;
 }
 
+static bool is_block_group_to_copy(struct btrfs_fs_info *fs_info, u64 logical)
+{
+	struct btrfs_block_group *cache;
+	bool ret;
+
+	/* non-ZONED mode does not use "to_copy" flag */
+	if (!btrfs_is_zoned(fs_info))
+		return false;
+
+	cache = btrfs_lookup_block_group(fs_info, logical);
+
+	spin_lock(&cache->lock);
+	ret = cache->to_copy;
+	spin_unlock(&cache->lock);
+
+	btrfs_put_block_group(cache);
+	return ret;
+}
+
 static void handle_ops_on_dev_replace(enum btrfs_map_op op,
 				      struct btrfs_bio **bbio_ret,
 				      struct btrfs_dev_replace *dev_replace,
+				      u64 logical,
 				      int *num_stripes_ret, int *max_errors_ret)
 {
 	struct btrfs_bio *bbio = *bbio_ret;
@@ -5979,6 +5999,15 @@ static void handle_ops_on_dev_replace(enum btrfs_map_op op,
 	if (op == BTRFS_MAP_WRITE) {
 		int index_where_to_add;
 
+		/*
+		 * a block group which have "to_copy" set will
+		 * eventually copied by dev-replace process. We can
+		 * avoid cloning IO here.
+		 */
+		if (is_block_group_to_copy(dev_replace->srcdev->fs_info,
+					   logical))
+			return;
+
 		/*
 		 * duplicate the write operations while the dev replace
 		 * procedure is running. Since the copying of the old disk to
@@ -6374,8 +6403,8 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info,
 
 	if (dev_replace_is_ongoing && dev_replace->tgtdev != NULL &&
 	    need_full_stripe(op)) {
-		handle_ops_on_dev_replace(op, &bbio, dev_replace, &num_stripes,
-					  &max_errors);
+		handle_ops_on_dev_replace(op, &bbio, dev_replace, logical,
+					  &num_stripes, &max_errors);
 	}
 
 	*bbio_ret = bbio;
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 15bc7d451348..f672465d1bb1 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -11,6 +11,7 @@
 #include "disk-io.h"
 #include "block-group.h"
 #include "transaction.h"
+#include "dev-replace.h"
 
 /* Maximum number of zones to report per blkdev_report_zones() call */
 #define BTRFS_REPORT_NR_ZONES   4096
@@ -890,6 +891,8 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache)
 	for (i = 0; i < map->num_stripes; i++) {
 		bool is_sequential;
 		struct blk_zone zone;
+		struct btrfs_dev_replace *dev_replace = &fs_info->dev_replace;
+		int dev_replace_is_ongoing = 0;
 
 		device = map->stripes[i].dev;
 		physical = map->stripes[i].physical;
@@ -916,6 +919,14 @@ int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache)
 		 */
 		btrfs_dev_clear_zone_empty(device, physical);
 
+		down_read(&dev_replace->rwsem);
+		dev_replace_is_ongoing =
+			btrfs_dev_replace_is_ongoing(dev_replace);
+		if (dev_replace_is_ongoing && dev_replace->tgtdev != NULL)
+			btrfs_dev_clear_zone_empty(dev_replace->tgtdev,
+						   physical);
+		up_read(&dev_replace->rwsem);
+
 		/*
 		 * The group is mapped to a sequential zone. Get the zone write
 		 * pointer to determine the allocation offset within the zone.

From patchwork Fri Oct 30 13:51:40 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Naohiro Aota <naohiro.aota@wdc.com>
X-Patchwork-Id: 11869689
Return-Path: <SRS0=iOSM=EF=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 61B1F61C
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:53:19 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 37CBF2083B
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:53:19 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=fail reason="signature verification failed" (2048-bit key)
 header.d=wdc.com header.i=@wdc.com header.b="e2HFm+xa"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726897AbgJ3NxN (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Fri, 30 Oct 2020 09:53:13 -0400
Received: from esa3.hgst.iphmx.com ([216.71.153.141]:22001 "EHLO
        esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726876AbgJ3NxI (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Fri, 30 Oct 2020 09:53:08 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple;
  d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com;
  t=1604065987; x=1635601987;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=y1C/2p/Nba/F+w2tDO6tPz67/tKR4rWG74TMOqTsf+E=;
  b=e2HFm+xasqZH6Abk200UFCC54YPdZmSfFH6BLAG5ND68yRLGWYHysUdM
   TbKvxXmlR6QJSdhJCC94+WttdRcAfvjGInbLdcUyjbahVYIrh/QGBVI+N
   Bc9Kfxu/uGmEssj0LiXxUD2HDRCuD4jby20usQrC3EfpyVyptujd9Q1C5
   P+/Gei6wmtTttPnFIEYmP4kLF5JS/cmwEwjcDDDKuYQtk3M4dP2C68ZED
   1DiTT3suvVjTgBdWA3Zg/tZ0Oi4+R5BFbCVlskzIu4EOHrMNnoY1Z2YDS
   vGF0qyv1yIGmbksL0LihAxtqh+aih6h5FIoEkehlOD9qpNh/nkCfTU6kd
   A==;
IronPort-SDR: 
 I7PGtxbyvFaPVNGaZzmwNlGzluRMOmuaEWgq4Ni2svBuqnaImi9GvlYfSoP94VAtstHaG2Bit6
 4DHY5hDOhL8tsksVJ57lQehpxHFwzb3GWQPsN1v0uQ5WpgVfWgTv8QjqB6T3c7eCbdMH0hIQhQ
 YTurdbYwg5fHjwOUwdY4uhhV72hT8ezWZGIjijqut10EMJdzqQST1zRnT3bQ/mrBumAvQva5KR
 NUUdm08oSZvEx+prrfne8mXcsa2s68Rb8mBL2QrBJjidEeQsEZb8n9iav+tnEUUxMCiI1x38kQ
 LqA=
X-IronPort-AV: E=Sophos;i="5.77,433,1596470400";
   d="scan'208";a="155806638"
Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com)
 ([199.255.45.14])
  by ob1.hgst.iphmx.com with ESMTP; 30 Oct 2020 21:53:01 +0800
IronPort-SDR: 
 wo1pIKdBil/Bpt//gM04kXyx8r+iUCuuLzXVHqx2eD2M+updbIgF7kJKXr/l0DY2+lHyt5LGCL
 Rq/NdQ7LgOZaAMcjqKMuwET3f2nx1Gvqb077X05rkqsTw3QD1SIqddScXB8RjbEXiSZHOYlKvY
 LF0TRJGVRhIWOwPkFqmkxgI3NrKGnd95DVpeAvxKtWqF2Y8AqbD6tXWVkPpTR+KYazKXBZHHzq
 AO9wSmUNZQDF8H6j3FR0NdONdIIbVazw43xF0rWZpPaBDYlwfPWz8aUL/jkdq1W3NaAfg20GKB
 iWjMqrab1vYN3UbICZms75s3
Received: from uls-op-cesaip01.wdc.com ([10.248.3.36])
  by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 30 Oct 2020 06:39:15 -0700
IronPort-SDR: 
 knsOkGky5Kt2xMZ40AZj0axnj3EoIw1Kuxy16Zeobu32E5e+sSfRKOCtHl2JPf9ZBqrdiAqQ89
 1M7sTa482k5UOfcXLoRM9LDteBZfnKfCyRGCElrP0aG1ELfmOsFc6EZu/8JUoAH3Kf2IA28jQH
 qybVkOzmqs39YcJxqY5TjuJdJzpmHFgtDt7mYNc2CnqLJycQ6UrHglj/bEJpvToDxuBVdUBIir
 XOHU7FLxw5lanng4pCNX+R0+kONaRMxX2Jd2CtJCaoWAAOEiVevPXGBpPNcAEpBbscyoKfebwE
 lj4=
WDCIronportException: Internal
Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155])
  by uls-op-cesaip01.wdc.com with ESMTP; 30 Oct 2020 06:53:01 -0700
From: Naohiro Aota <naohiro.aota@wdc.com>
To: linux-btrfs@vger.kernel.org, dsterba@suse.com
Cc: hare@suse.com, linux-fsdevel@vger.kernel.org,
        Naohiro Aota <naohiro.aota@wdc.com>
Subject: [PATCH v9 33/41] btrfs: implement copying for ZONED device-replace
Date: Fri, 30 Oct 2020 22:51:40 +0900
Message-Id: 
 <2370c99060300c9438ca1727215d398592f52434.1604065695.git.naohiro.aota@wdc.com>
X-Mailer: git-send-email 2.27.0
In-Reply-To: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
References: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

This is 3/4 patch to implement device-replace on ZONED mode.

This commit implement copying. So, it track the write pointer during device
replace process. Device-replace's copying is smart to copy only used
extents on source device, we have to fill the gap to honor the sequential
write rule in the target device.

Device-replace process in ZONED mode must copy or clone all the extents in
the source device exactly once.  So, we need to use to ensure allocations
started just before the dev-replace process to have their corresponding
extent information in the B-trees. finish_extent_writes_for_zoned()
implements that functionality, which basically is the removed code in the
commit 042528f8d840 ("Btrfs: fix block group remaining RO forever after
error during device replace").

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/scrub.c | 86 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/zoned.c | 12 +++++++
 fs/btrfs/zoned.h |  7 ++++
 3 files changed, 105 insertions(+)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 371bb6437cab..aaf7882dee06 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -169,6 +169,7 @@ struct scrub_ctx {
 	int			pages_per_rd_bio;
 
 	int			is_dev_replace;
+	u64			write_pointer;
 
 	struct scrub_bio        *wr_curr_bio;
 	struct mutex            wr_lock;
@@ -1623,6 +1624,25 @@ static int scrub_write_page_to_dev_replace(struct scrub_block *sblock,
 	return scrub_add_page_to_wr_bio(sblock->sctx, spage);
 }
 
+static int fill_writer_pointer_gap(struct scrub_ctx *sctx, u64 physical)
+{
+	int ret = 0;
+	u64 length;
+
+	if (!btrfs_is_zoned(sctx->fs_info))
+		return 0;
+
+	if (sctx->write_pointer < physical) {
+		length = physical - sctx->write_pointer;
+
+		ret = btrfs_zoned_issue_zeroout(sctx->wr_tgtdev,
+						sctx->write_pointer, length);
+		if (!ret)
+			sctx->write_pointer = physical;
+	}
+	return ret;
+}
+
 static int scrub_add_page_to_wr_bio(struct scrub_ctx *sctx,
 				    struct scrub_page *spage)
 {
@@ -1645,6 +1665,13 @@ static int scrub_add_page_to_wr_bio(struct scrub_ctx *sctx,
 	if (sbio->page_count == 0) {
 		struct bio *bio;
 
+		ret = fill_writer_pointer_gap(sctx,
+					      spage->physical_for_dev_replace);
+		if (ret) {
+			mutex_unlock(&sctx->wr_lock);
+			return ret;
+		}
+
 		sbio->physical = spage->physical_for_dev_replace;
 		sbio->logical = spage->logical;
 		sbio->dev = sctx->wr_tgtdev;
@@ -1706,6 +1733,10 @@ static void scrub_wr_submit(struct scrub_ctx *sctx)
 	 * doubled the write performance on spinning disks when measured
 	 * with Linux 3.5 */
 	btrfsic_submit_bio(sbio->bio);
+
+	if (btrfs_is_zoned(sctx->fs_info))
+		sctx->write_pointer = sbio->physical +
+			sbio->page_count * PAGE_SIZE;
 }
 
 static void scrub_wr_bio_end_io(struct bio *bio)
@@ -2973,6 +3004,21 @@ static noinline_for_stack int scrub_raid56_parity(struct scrub_ctx *sctx,
 	return ret < 0 ? ret : 0;
 }
 
+static void sync_replace_for_zoned(struct scrub_ctx *sctx)
+{
+	if (!btrfs_is_zoned(sctx->fs_info))
+		return;
+
+	sctx->flush_all_writes = true;
+	scrub_submit(sctx);
+	mutex_lock(&sctx->wr_lock);
+	scrub_wr_submit(sctx);
+	mutex_unlock(&sctx->wr_lock);
+
+	wait_event(sctx->list_wait,
+		   atomic_read(&sctx->bios_in_flight) == 0);
+}
+
 static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 					   struct map_lookup *map,
 					   struct btrfs_device *scrub_dev,
@@ -3105,6 +3151,14 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 	 */
 	blk_start_plug(&plug);
 
+	if (sctx->is_dev_replace &&
+	    btrfs_dev_is_sequential(sctx->wr_tgtdev, physical)) {
+		mutex_lock(&sctx->wr_lock);
+		sctx->write_pointer = physical;
+		mutex_unlock(&sctx->wr_lock);
+		sctx->flush_all_writes = true;
+	}
+
 	/*
 	 * now find all extents for each stripe and scrub them
 	 */
@@ -3292,6 +3346,9 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 			if (ret)
 				goto out;
 
+			if (sctx->is_dev_replace)
+				sync_replace_for_zoned(sctx);
+
 			if (extent_logical + extent_len <
 			    key.objectid + bytes) {
 				if (map->type & BTRFS_BLOCK_GROUP_RAID56_MASK) {
@@ -3414,6 +3471,25 @@ static noinline_for_stack int scrub_chunk(struct scrub_ctx *sctx,
 	return ret;
 }
 
+static int finish_extent_writes_for_zoned(struct btrfs_root *root,
+					  struct btrfs_block_group *cache)
+{
+	struct btrfs_fs_info *fs_info = cache->fs_info;
+	struct btrfs_trans_handle *trans;
+
+	if (!btrfs_is_zoned(fs_info))
+		return 0;
+
+	btrfs_wait_block_group_reservations(cache);
+	btrfs_wait_nocow_writers(cache);
+	btrfs_wait_ordered_roots(fs_info, U64_MAX, cache->start, cache->length);
+
+	trans = btrfs_join_transaction(root);
+	if (IS_ERR(trans))
+		return PTR_ERR(trans);
+	return btrfs_commit_transaction(trans);
+}
+
 static noinline_for_stack
 int scrub_enumerate_chunks(struct scrub_ctx *sctx,
 			   struct btrfs_device *scrub_dev, u64 start, u64 end)
@@ -3569,6 +3645,16 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx,
 		 * group is not RO.
 		 */
 		ret = btrfs_inc_block_group_ro(cache, sctx->is_dev_replace);
+		if (!ret && sctx->is_dev_replace) {
+			ret = finish_extent_writes_for_zoned(root, cache);
+			if (ret) {
+				btrfs_dec_block_group_ro(cache);
+				scrub_pause_off(fs_info);
+				btrfs_put_block_group(cache);
+				break;
+			}
+		}
+
 		if (ret == 0) {
 			ro_set = 1;
 		} else if (ret == -ENOSPC && !sctx->is_dev_replace) {
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index f672465d1bb1..1b080184440d 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -1181,3 +1181,15 @@ void btrfs_revert_meta_write_pointer(struct btrfs_block_group *cache,
 	ASSERT(cache->meta_write_pointer == eb->start + eb->len);
 	cache->meta_write_pointer = eb->start;
 }
+
+int btrfs_zoned_issue_zeroout(struct btrfs_device *device, u64 physical,
+			      u64 length)
+{
+	if (!btrfs_dev_is_sequential(device, physical))
+		return -EOPNOTSUPP;
+
+	return blkdev_issue_zeroout(device->bdev,
+				    physical >> SECTOR_SHIFT,
+				    length >> SECTOR_SHIFT,
+				    GFP_NOFS, 0);
+}
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index 262b248a9085..fb4f1cfce1e5 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -52,6 +52,8 @@ bool btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
 				    struct btrfs_block_group **cache_ret);
 void btrfs_revert_meta_write_pointer(struct btrfs_block_group *cache,
 				     struct extent_buffer *eb);
+int btrfs_zoned_issue_zeroout(struct btrfs_device *device, u64 physical,
+			      u64 length);
 #else /* CONFIG_BLK_DEV_ZONED */
 static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 				     struct blk_zone *zone)
@@ -139,6 +141,11 @@ btrfs_revert_meta_write_pointer(struct btrfs_block_group *cache,
 				struct extent_buffer *eb)
 {
 }
+static inline int btrfs_zoned_issue_zeroout(struct btrfs_device *device,
+					    u64 physical, u64 length)
+{
+	return -EOPNOTSUPP;
+}
 #endif
 
 static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)

From patchwork Fri Oct 30 13:51:41 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Naohiro Aota <naohiro.aota@wdc.com>
X-Patchwork-Id: 11869649
Return-Path: <SRS0=iOSM=EF=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id CBA3292C
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:53:15 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id A1C3A2083B
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:53:15 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=fail reason="signature verification failed" (2048-bit key)
 header.d=wdc.com header.i=@wdc.com header.b="LrRaV8Ho"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726902AbgJ3NxO (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Fri, 30 Oct 2020 09:53:14 -0400
Received: from esa3.hgst.iphmx.com ([216.71.153.141]:22003 "EHLO
        esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726878AbgJ3NxI (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Fri, 30 Oct 2020 09:53:08 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple;
  d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com;
  t=1604065987; x=1635601987;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=mwCgNuvrYa9VKDt9EmfgkB3IOrCEdDvJeqsCCLkM/Lo=;
  b=LrRaV8HojpN6DNa0X8H36XcviSsdLf/M+oMjei98jMQJrPtNuJdamMXY
   iXCXSm5zc5yeY081yP5pszX1lMbIufMgja5CWqEboSlo4RFG0Gut+u9qU
   iOWCgmMoFwVquDG4hZUN8jP9T2qdcY2u09aYOfM7QMsPa9wS9yFH8zp+R
   hBL4ycxUabQRh+n70ryHcdZwtVg1XCPZmmBDzYXZ3lKolL5gpeNc2Idl5
   IqpHqi1ayv1/EWspOcTUONjngtFEtazXNMKwAHCEV/Dpxm/cVgarLFAQa
   TPO86xEru8ghajJpVHkfrFEZukMTbPBlzdQNGQdUEkmH7o2mZJnINZvMw
   w==;
IronPort-SDR: 
 /Plfh09zZ7y0zpqPspR4OhLeKg/6dTqMiicL75Ntf2qTUoM/mG9lr92dIxrj/ZaX6zM4ZskQwZ
 seEH115t3LiW/xum549FOIeSaCS5e3inVXRvDG8zkT/jzvPdgbT8bdSkFbno2ag26GR0PzMwd0
 uVNVUDs4ovx0jrvqfZ7Nb7ChBhCrfQBuHPzCvJA/qRmaxMIgNi2RZpT8ILtY35tDm3ba7KkqJG
 57AlxvZ33/mAQP3N09GmcdjhcUqC3wGN4P3rBjJ/NUMXYSsseWHfvRHNrcYDZ/RPBFHZ9mgJoX
 dPo=
X-IronPort-AV: E=Sophos;i="5.77,433,1596470400";
   d="scan'208";a="155806639"
Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com)
 ([199.255.45.14])
  by ob1.hgst.iphmx.com with ESMTP; 30 Oct 2020 21:53:02 +0800
IronPort-SDR: 
 5ZYVKZrzYkjBspmaodt3SfKWMq9te6yfenkfePei2v2Cvpju5lUfx1iYxNOYqs6dWYxp8+AMVQ
 6utYpfPoko/3BmSwTiHUTfUYkwFN/kjwYVq/5G2qmA3xpTIaOLXBKaagbv0KB+PcJQtYj1mkUP
 1T/GaaO5cjId2T0h3ZZXPoVOhZEiA3QXXQz+wgGXlpOMEZ8YnsCFrM7QRrIuiyDVpb1bKADfty
 RQIjYyrP2b+Q8XB3YCJ5eUEmoSkOx8bXHOaWvKqg2rC19wlexTroBmgEfCF3S99ShpgQIB5XNM
 FD79HBjfi7rUYHk/V598hiqz
Received: from uls-op-cesaip01.wdc.com ([10.248.3.36])
  by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 30 Oct 2020 06:39:17 -0700
IronPort-SDR: 
 w4DbssFe3EMeC+FWQBjRVyse1K0259hA5NPI/KvGAQBV9NKaimjWC2aCmlV9FhRnmNfu6t/Ist
 9WMhlVydgsJz7cd8PlWtwzDEujL4OiKAJThzm/G/2OOLsrq3PNglCZwYNuNJBnXreNoSWpxSoE
 kFShEhWKPr3rxrrMs0EzV2ed54B09D4nES3C+oxRORT/BLzbLN0kTojNnx4X6o3t81D2qW3CtD
 jV8WBSWJpa30t9WcfclXFYNrzVrfBU54LQhyLbu0yQjoTWkbPvCqTPUqxMVNdIvSBq/NsnpvqN
 +I4=
WDCIronportException: Internal
Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155])
  by uls-op-cesaip01.wdc.com with ESMTP; 30 Oct 2020 06:53:02 -0700
From: Naohiro Aota <naohiro.aota@wdc.com>
To: linux-btrfs@vger.kernel.org, dsterba@suse.com
Cc: hare@suse.com, linux-fsdevel@vger.kernel.org,
        Naohiro Aota <naohiro.aota@wdc.com>
Subject: [PATCH v9 34/41] btrfs: support dev-replace in ZONED mode
Date: Fri, 30 Oct 2020 22:51:41 +0900
Message-Id: 
 <aa19bede48b24e48c10503604fd95e14557b7cb2.1604065695.git.naohiro.aota@wdc.com>
X-Mailer: git-send-email 2.27.0
In-Reply-To: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
References: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

This is 4/4 patch to implement device-replace on ZONED mode.

Even after the copying is done, the write pointers of the source device and
the destination device may not be synchronized. For example, when the last
allocated extent is freed before device-replace process, the extent is not
copied, leaving a hole there.

This patch synchronize the write pointers by writing zeros to the
destination device.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/scrub.c | 36 +++++++++++++++++++++++++
 fs/btrfs/zoned.c | 69 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/zoned.h |  8 ++++++
 3 files changed, 113 insertions(+)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index aaf7882dee06..0e2211b9c810 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -3019,6 +3019,31 @@ static void sync_replace_for_zoned(struct scrub_ctx *sctx)
 		   atomic_read(&sctx->bios_in_flight) == 0);
 }
 
+static int sync_write_pointer_for_zoned(struct scrub_ctx *sctx, u64 logical,
+					u64 physical, u64 physical_end)
+{
+	struct btrfs_fs_info *fs_info = sctx->fs_info;
+	int ret = 0;
+
+	if (!btrfs_is_zoned(fs_info))
+		return 0;
+
+	wait_event(sctx->list_wait, atomic_read(&sctx->bios_in_flight) == 0);
+
+	mutex_lock(&sctx->wr_lock);
+	if (sctx->write_pointer < physical_end) {
+		ret = btrfs_sync_zone_write_pointer(sctx->wr_tgtdev, logical,
+						    physical,
+						    sctx->write_pointer);
+		if (ret)
+			btrfs_err(fs_info, "failed to recover write pointer");
+	}
+	mutex_unlock(&sctx->wr_lock);
+	btrfs_dev_clear_zone_empty(sctx->wr_tgtdev, physical);
+
+	return ret;
+}
+
 static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 					   struct map_lookup *map,
 					   struct btrfs_device *scrub_dev,
@@ -3416,6 +3441,17 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 	blk_finish_plug(&plug);
 	btrfs_free_path(path);
 	btrfs_free_path(ppath);
+
+	if (sctx->is_dev_replace && ret >= 0) {
+		int ret2;
+
+		ret2 = sync_write_pointer_for_zoned(sctx, base + offset,
+						    map->stripes[num].physical,
+						    physical_end);
+		if (ret2)
+			ret = ret2;
+	}
+
 	return ret < 0 ? ret : 0;
 }
 
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 1b080184440d..f98aa0fae849 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -12,6 +12,7 @@
 #include "block-group.h"
 #include "transaction.h"
 #include "dev-replace.h"
+#include "space-info.h"
 
 /* Maximum number of zones to report per blkdev_report_zones() call */
 #define BTRFS_REPORT_NR_ZONES   4096
@@ -1193,3 +1194,71 @@ int btrfs_zoned_issue_zeroout(struct btrfs_device *device, u64 physical,
 				    length >> SECTOR_SHIFT,
 				    GFP_NOFS, 0);
 }
+
+static int read_zone_info(struct btrfs_fs_info *fs_info, u64 logical,
+			  struct blk_zone *zone)
+{
+	struct btrfs_bio *bbio = NULL;
+	u64 mapped_length = PAGE_SIZE;
+	unsigned int nofs_flag;
+	int nmirrors;
+	int i, ret;
+
+	ret = btrfs_map_sblock(fs_info, BTRFS_MAP_GET_READ_MIRRORS, logical,
+			       &mapped_length, &bbio);
+	if (ret || !bbio || mapped_length < PAGE_SIZE) {
+		btrfs_put_bbio(bbio);
+		return -EIO;
+	}
+
+	if (bbio->map_type & BTRFS_BLOCK_GROUP_RAID56_MASK)
+		return -EINVAL;
+
+	nofs_flag = memalloc_nofs_save();
+	nmirrors = (int)bbio->num_stripes;
+	for (i = 0; i < nmirrors; i++) {
+		u64 physical = bbio->stripes[i].physical;
+		struct btrfs_device *dev = bbio->stripes[i].dev;
+
+		/* missing device */
+		if (!dev->bdev)
+			continue;
+
+		ret = btrfs_get_dev_zone(dev, physical, zone);
+		/* failing device */
+		if (ret == -EIO || ret == -EOPNOTSUPP)
+			continue;
+		break;
+	}
+	memalloc_nofs_restore(nofs_flag);
+
+	return ret;
+}
+
+int btrfs_sync_zone_write_pointer(struct btrfs_device *tgt_dev, u64 logical,
+				    u64 physical_start, u64 physical_pos)
+{
+	struct btrfs_fs_info *fs_info = tgt_dev->fs_info;
+	struct blk_zone zone;
+	u64 length;
+	u64 wp;
+	int ret;
+
+	if (!btrfs_dev_is_sequential(tgt_dev, physical_pos))
+		return 0;
+
+	ret = read_zone_info(fs_info, logical, &zone);
+	if (ret)
+		return ret;
+
+	wp = physical_start + ((zone.wp - zone.start) << SECTOR_SHIFT);
+
+	if (physical_pos == wp)
+		return 0;
+
+	if (physical_pos > wp)
+		return -EUCLEAN;
+
+	length = wp - physical_pos;
+	return btrfs_zoned_issue_zeroout(tgt_dev, physical_pos, length);
+}
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index fb4f1cfce1e5..bd73f3c48c0c 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -54,6 +54,8 @@ void btrfs_revert_meta_write_pointer(struct btrfs_block_group *cache,
 				     struct extent_buffer *eb);
 int btrfs_zoned_issue_zeroout(struct btrfs_device *device, u64 physical,
 			      u64 length);
+int btrfs_sync_zone_write_pointer(struct btrfs_device *tgt_dev, u64 logical,
+				  u64 physical_start, u64 physical_pos);
 #else /* CONFIG_BLK_DEV_ZONED */
 static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos,
 				     struct blk_zone *zone)
@@ -146,6 +148,12 @@ static inline int btrfs_zoned_issue_zeroout(struct btrfs_device *device,
 {
 	return -EOPNOTSUPP;
 }
+static inline int btrfs_sync_zone_write_pointer(struct btrfs_device *tgt_dev,
+						u64 logical, u64 physical_start,
+						u64 physical_pos)
+{
+	return -EOPNOTSUPP;
+}
 #endif
 
 static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos)

From patchwork Fri Oct 30 13:51:42 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Naohiro Aota <naohiro.aota@wdc.com>
X-Patchwork-Id: 11869707
Return-Path: <SRS0=iOSM=EF=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id C981661C
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:53:47 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id A27512083B
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:53:47 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=fail reason="signature verification failed" (2048-bit key)
 header.d=wdc.com header.i=@wdc.com header.b="qg3UPafW"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726491AbgJ3Nxm (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Fri, 30 Oct 2020 09:53:42 -0400
Received: from esa3.hgst.iphmx.com ([216.71.153.141]:21997 "EHLO
        esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726882AbgJ3NxI (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Fri, 30 Oct 2020 09:53:08 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple;
  d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com;
  t=1604065988; x=1635601988;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=xPo3kn4hqbPhdopEoZslI7bCsYHDqKdsh6NiaL+hT/k=;
  b=qg3UPafWlOpyXbrubQWVvXGqDxXklTZR0dYlLKv9m3jLFIwlDFaOyT/O
   pu82BK5X6OsFi2rXngE18YncaU/iAsx58ZrPE8NOEDRML1JH+cYmf51/L
   41ClH9zJxHrT5CSlO1/y5JdMB2aoOlHfGuXqixfVBClaZUQfvuqi9bSIM
   71qefGUr9RMTdT2TWTXy4+3M5AatRM/IT/FSc0YYPZDBSKpfPbM2rCh3j
   CZxs68+vyC/JC5GoS55gMCzTe+HRm4Mce7Jm5/jv2U9QyUvKs9YPb1lz0
   Ms30MbTqD/RMhLsJuuRid/Wf3J3TLAzew9W6bkeZVAET7C7i6Ax3WTQV5
   Q==;
IronPort-SDR: 
 odw2HDIUwn3e2qIAuOicPotaNQ876PithUblVKQVQ1UDI3pCCnOhxNmtyo13dXa5RPokZZ63aj
 yeT9YER/9HzRb58fYwDEIMAlk7kx65gYEtN2kv+P8XdAVZGgWxZQ7nD7ILNnxU7zJO/2DLFQue
 +n2cCt27/pSGdkyKbqwayjN7jTNT3v0cSSDKhOsJWk/zAeHBlGJC1y5CZ5z84rpfeUmjZpbF3k
 1d5cFpAAqYq5dlSf6NOJ1Cp92jwZ6Va8r/qGp8ghWfEL5q6bKFu1JeDCvVEurvMkZxAwqESQoa
 dGQ=
X-IronPort-AV: E=Sophos;i="5.77,433,1596470400";
   d="scan'208";a="155806640"
Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com)
 ([199.255.45.14])
  by ob1.hgst.iphmx.com with ESMTP; 30 Oct 2020 21:53:04 +0800
IronPort-SDR: 
 x78Lb1LCRpY0KqumMjahb427At76E9O83X3wMf6eZiAgk2RGVk3tyL8kEq4vrdgf+HLwYLgQml
 SMqoB8LUAq7vy960GQUrDmPZv5tAn8Qw8lBi0lHzcxUCHntezdaaJnH+5TTdZwAupSapNq85gA
 N20JTRoSINOrUhGqivag7scCB9a9rF5s9TMAHv/V3bhwQ0DZTU6rbMZjZ3RkIt6rlx7SGKxHbr
 vVIQxrX4on5ilg7IzXuKibR1A6Oz7W16cFlA+ZaXSOXgDNGPCgjmgRxl9mRQhBteqTCUXA0mhz
 12K1ulwsS6Fk45PgxFq28AVU
Received: from uls-op-cesaip01.wdc.com ([10.248.3.36])
  by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 30 Oct 2020 06:39:18 -0700
IronPort-SDR: 
 ppR0aDkHzeebcsAie91uo4tnMYCfzGcKfNyYyKB5eZBeL61GOwUsAaq6EydVOurGhPRbKYXiaY
 A4KLbcTFk38kuDW1NEBfqYhMMC4XrmPrBjOHNEF+DkBG8f65mozplvyQlJFa+WCE2QClKlAKZp
 jzJA4sJ/WdnA1FG2fEWicP2Z5GUDj+MDcNyVCyp2+dkDyurnZVmZY4WAlk6fEcevJTIKgdaX17
 OLwmMjJyVLNftFLmkUSu9SOW/sSEQJ3xyrbWCSHyXkZELA4jLr8OrsiTLoKXylMw8j4AXHyGHk
 NQg=
WDCIronportException: Internal
Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155])
  by uls-op-cesaip01.wdc.com with ESMTP; 30 Oct 2020 06:53:03 -0700
From: Naohiro Aota <naohiro.aota@wdc.com>
To: linux-btrfs@vger.kernel.org, dsterba@suse.com
Cc: hare@suse.com, linux-fsdevel@vger.kernel.org,
        Naohiro Aota <naohiro.aota@wdc.com>
Subject: [PATCH v9 35/41] btrfs: enable relocation in ZONED mode
Date: Fri, 30 Oct 2020 22:51:42 +0900
Message-Id: 
 <669d00d499b702413a51364b405280798df9c6c3.1604065695.git.naohiro.aota@wdc.com>
X-Mailer: git-send-email 2.27.0
In-Reply-To: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
References: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

To serialize allocation and submit_bio, we introduced mutex around them. As
a result, preallocation must be completely disabled to avoid a deadlock.

Since current relocation process relies on preallocation to move file data
extents, it must be handled in another way. In ZONED mode, we just truncate
the inode to the size that we wanted to pre-allocate. Then, we flush dirty
pages on the file before finishing relocation process.
run_delalloc_zoned() will handle all the allocation and submit IOs to the
underlying layers.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/relocation.c | 35 +++++++++++++++++++++++++++++++++--
 1 file changed, 33 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 3602806d71bd..44b697b881b6 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -2603,6 +2603,32 @@ static noinline_for_stack int prealloc_file_extent_cluster(
 	if (ret)
 		return ret;
 
+	/*
+	 * In ZONED mode, we cannot preallocate the file region. Instead, we
+	 * dirty and fiemap_write the region.
+	 */
+
+	if (btrfs_is_zoned(inode->root->fs_info)) {
+		struct btrfs_root *root = inode->root;
+		struct btrfs_trans_handle *trans;
+
+		end = cluster->end - offset + 1;
+		trans = btrfs_start_transaction(root, 1);
+		if (IS_ERR(trans))
+			return PTR_ERR(trans);
+
+		inode->vfs_inode.i_ctime = current_time(&inode->vfs_inode);
+		i_size_write(&inode->vfs_inode, end);
+		ret = btrfs_update_inode(trans, root, &inode->vfs_inode);
+		if (ret) {
+			btrfs_abort_transaction(trans, ret);
+			btrfs_end_transaction(trans);
+			return ret;
+		}
+
+		return btrfs_end_transaction(trans);
+	}
+
 	inode_lock(&inode->vfs_inode);
 	for (nr = 0; nr < cluster->nr; nr++) {
 		start = cluster->boundary[nr] - offset;
@@ -2799,6 +2825,8 @@ static int relocate_file_extent_cluster(struct inode *inode,
 		}
 	}
 	WARN_ON(nr != cluster->nr);
+	if (btrfs_is_zoned(fs_info) && !ret)
+		ret = btrfs_wait_ordered_range(inode, 0, (u64)-1);
 out:
 	kfree(ra);
 	return ret;
@@ -3434,8 +3462,12 @@ static int __insert_orphan_inode(struct btrfs_trans_handle *trans,
 	struct btrfs_path *path;
 	struct btrfs_inode_item *item;
 	struct extent_buffer *leaf;
+	u64 flags = BTRFS_INODE_NOCOMPRESS | BTRFS_INODE_PREALLOC;
 	int ret;
 
+	if (btrfs_is_zoned(trans->fs_info))
+		flags &= ~BTRFS_INODE_PREALLOC;
+
 	path = btrfs_alloc_path();
 	if (!path)
 		return -ENOMEM;
@@ -3450,8 +3482,7 @@ static int __insert_orphan_inode(struct btrfs_trans_handle *trans,
 	btrfs_set_inode_generation(leaf, item, 1);
 	btrfs_set_inode_size(leaf, item, 0);
 	btrfs_set_inode_mode(leaf, item, S_IFREG | 0600);
-	btrfs_set_inode_flags(leaf, item, BTRFS_INODE_NOCOMPRESS |
-					  BTRFS_INODE_PREALLOC);
+	btrfs_set_inode_flags(leaf, item, flags);
 	btrfs_mark_buffer_dirty(leaf);
 out:
 	btrfs_free_path(path);

From patchwork Fri Oct 30 13:51:43 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Naohiro Aota <naohiro.aota@wdc.com>
X-Patchwork-Id: 11869703
Return-Path: <SRS0=iOSM=EF=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 6707E61C
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:53:42 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 396E42083B
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:53:42 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=fail reason="signature verification failed" (2048-bit key)
 header.d=wdc.com header.i=@wdc.com header.b="IST/Mr1u"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726958AbgJ3Nxl (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Fri, 30 Oct 2020 09:53:41 -0400
Received: from esa3.hgst.iphmx.com ([216.71.153.141]:22001 "EHLO
        esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726890AbgJ3NxM (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Fri, 30 Oct 2020 09:53:12 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple;
  d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com;
  t=1604065991; x=1635601991;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=XGecI4/PSi+/P3b47C8WRFT3NXxMcXFh9BHBTUoqZ8w=;
  b=IST/Mr1u3LwBJ5IBdcBHXZKI8wHli+yRUcY5L0LYBwJW72vtrjTfmG+3
   SJpkwPCzQ6UvuCt/XkSAv2h6hkzwLr7yfnTJIcFSMCbC0qMdg6CqGYcwz
   yT+vRZSeOdY1+++CC7v0uK5nC1/h8Lo4T3niA+6auQTMv2o2niV1i2KCl
   cDHy7JfKgm/BUgK2tqHdoBqmG5uVGT5Q2Nnv08f9MArF+iusd6eKcuCwW
   JugBkf7kJiqZh/CEFPWf7cU0+E27InOuEC6qzddro/3lOntgDES2ObcIv
   Ke6q0PsPVOq/a+T1UZ+PQ8ybT6RzjfAdFoolYw5VSdPjyIkEszgPl5zXh
   w==;
IronPort-SDR: 
 ECNTJaRzQmzbOFN6mmcnSixV7uychKkfGHAfFucoaKxAtyr9DRIs/fqxpfacRiOT1z+dvIryf9
 47hSNNDB9D5H9UZFg/QKaYwZOAmoPjiSR5ayBB+zqynBe7V9fvhfHdqnNuUTuuF0sNDLBUhdL+
 fycEZtfmLdP/AmPR/pcQUuUbfDACWNrnao+7lQecvkM5VdkUtU0mjwdIjZrSY2YXH9sNzwijzH
 av+9Ys+LbyQ8yr92rgQLgB8b6Z4hXgG9acyt1PWnHn8xH5hvh28lR0g79x9tJQ9oNMnx8NKrTj
 dIQ=
X-IronPort-AV: E=Sophos;i="5.77,433,1596470400";
   d="scan'208";a="155806642"
Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com)
 ([199.255.45.14])
  by ob1.hgst.iphmx.com with ESMTP; 30 Oct 2020 21:53:05 +0800
IronPort-SDR: 
 R2DWJTltvZY49X/9F9Ncl8rgTKDKQMeb3pa18zWNQGVzCiRtwNikWLPzQF92Uei2SDwaUZggSO
 ImlqCpLamyArvQP3+NW6FtND/yah7JjnUwJ978FK+MJcOKtarrTvGk5EffFvSdk7Uz+NlUXtbO
 8UKEa8go4jJTU79yaxw0bYLOUan0qXGrK8WiM/Hffjk3PWpuG4m14sFWEJh/LuNMy/LIz785OK
 T5G5LxISVTekgvqObDUc63QDVwZbCnQemTzKxATt5N2OhR9v4uwav4py8rANH2NgMIPh1dE7Nv
 D82r6KmCoZjLhUioX/XkJ5Ej
Received: from uls-op-cesaip01.wdc.com ([10.248.3.36])
  by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 30 Oct 2020 06:39:19 -0700
IronPort-SDR: 
 cSzgD+kbOj2dsZaDbdd02UqyKMKPqyamf/rx6ujI94jhzHWeHsQTQs70D7AHiOkxJcxkhKhO4J
 y/ij3YTNCO/Bi287xy296iSYGdYmTg437tTL0cbs5AjXUIQ6RDp9xpsAxCFIetxybLpKoVluae
 j8bMxRr/U5q6sTNV9oTDzLDwAwfAbofpvUA7DkF93tH/G9FVFpMaPb2yc+6EkEqLdInWjz2CB6
 ComnAeiMqJEJ6eU9q81avRNKd2ezpo5nSFwKDISIDnLtQq9sPgFUWKRtJlo6BR6sGvTRfJ2Trg
 ljM=
WDCIronportException: Internal
Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155])
  by uls-op-cesaip01.wdc.com with ESMTP; 30 Oct 2020 06:53:04 -0700
From: Naohiro Aota <naohiro.aota@wdc.com>
To: linux-btrfs@vger.kernel.org, dsterba@suse.com
Cc: hare@suse.com, linux-fsdevel@vger.kernel.org,
        Naohiro Aota <naohiro.aota@wdc.com>,
        Josef Bacik <josef@toxicpanda.com>
Subject: [PATCH v9 36/41] btrfs: relocate block group to repair IO failure in
 ZONED
Date: Fri, 30 Oct 2020 22:51:43 +0900
Message-Id: 
 <1a4cf83e2685980c0958e0425b2804bcbc1642a9.1604065695.git.naohiro.aota@wdc.com>
X-Mailer: git-send-email 2.27.0
In-Reply-To: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
References: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

When btrfs find a checksum error and if the file system has a mirror of the
damaged data, btrfs read the correct data from the mirror and write the
data to damaged blocks. This repairing, however, is against the sequential
write required rule.

We can consider three methods to repair an IO failure in ZONED mode:
(1) Reset and rewrite the damaged zone
(2) Allocate new device extent and replace the damaged device extent to the
    new extent
(3) Relocate the corresponding block group

Method (1) is most similar to a behavior done with regular devices.
However, it also wipes non-damaged data in the same device extent, and so
it unnecessary degrades non-damaged data.

Method (2) is much like device replacing but done in the same device. It is
safe because it keeps the device extent until the replacing finish.
However, extending device replacing is non-trivial. It assumes
"src_dev->physical == dst_dev->physical". Also, the extent mapping
replacing function should be extended to support replacing device extent
position in one device.

Method (3) invokes relocation of the damaged block group, so it is
straightforward to implement. It relocates all the mirrored device extents,
so it is, potentially, a more costly operation than method (1) or (2). But
it relocates only using extents which reduce the total IO size.

Let's apply method (3) for now. In the future, we can extend device-replace
and apply method (2).

For protecting a block group gets relocated multiple time with multiple IO
errors, this commit introduces "relocating_repair" bit to show it's now
relocating to repair IO failures. Also it uses a new kthread
"btrfs-relocating-repair", not to block IO path with relocating process.

This commit also supports repairing in the scrub process.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/block-group.h |  1 +
 fs/btrfs/extent_io.c   |  3 ++
 fs/btrfs/scrub.c       |  3 ++
 fs/btrfs/volumes.c     | 71 ++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/volumes.h     |  1 +
 5 files changed, 79 insertions(+)

diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
index e91123495d68..50e5ddb0a19b 100644
--- a/fs/btrfs/block-group.h
+++ b/fs/btrfs/block-group.h
@@ -96,6 +96,7 @@ struct btrfs_block_group {
 	unsigned int has_caching_ctl:1;
 	unsigned int removed:1;
 	unsigned int to_copy:1;
+	unsigned int relocating_repair:1;
 
 	int disk_cache_state;
 
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 3cce444d5dbb..8ab5161a68b4 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2268,6 +2268,9 @@ int repair_io_failure(struct btrfs_fs_info *fs_info, u64 ino, u64 start,
 	ASSERT(!(fs_info->sb->s_flags & SB_RDONLY));
 	BUG_ON(!mirror_num);
 
+	if (btrfs_is_zoned(fs_info))
+		return btrfs_repair_one_zone(fs_info, logical);
+
 	bio = btrfs_io_bio_alloc(1);
 	bio->bi_iter.bi_size = 0;
 	map_length = length;
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 0e2211b9c810..e6a8df8a8f4f 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -861,6 +861,9 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check)
 	have_csum = sblock_to_check->pagev[0]->have_csum;
 	dev = sblock_to_check->pagev[0]->dev;
 
+	if (btrfs_is_zoned(fs_info) && !sctx->is_dev_replace)
+		return btrfs_repair_one_zone(fs_info, logical);
+
 	/*
 	 * We must use GFP_NOFS because the scrub task might be waiting for a
 	 * worker task executing this function and in turn a transaction commit
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 920292d0fca7..10e678f88f2a 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -7973,3 +7973,74 @@ bool btrfs_pinned_by_swapfile(struct btrfs_fs_info *fs_info, void *ptr)
 	spin_unlock(&fs_info->swapfile_pins_lock);
 	return node != NULL;
 }
+
+static int relocating_repair_kthread(void *data)
+{
+	struct btrfs_block_group *cache = (struct btrfs_block_group *) data;
+	struct btrfs_fs_info *fs_info = cache->fs_info;
+	u64 target;
+	int ret = 0;
+
+	target = cache->start;
+	btrfs_put_block_group(cache);
+
+	if (!btrfs_exclop_start(fs_info, BTRFS_EXCLOP_BALANCE)) {
+		btrfs_info(fs_info,
+			   "skip relocating block group %llu to repair: EBUSY",
+			   target);
+		return -EBUSY;
+	}
+
+	mutex_lock(&fs_info->delete_unused_bgs_mutex);
+
+	/* ensure Block Group still exists */
+	cache = btrfs_lookup_block_group(fs_info, target);
+	if (!cache)
+		goto out;
+
+	if (!cache->relocating_repair)
+		goto out;
+
+	ret = btrfs_may_alloc_data_chunk(fs_info, target);
+	if (ret < 0)
+		goto out;
+
+	btrfs_info(fs_info, "relocating block group %llu to repair IO failure",
+		   target);
+	ret = btrfs_relocate_chunk(fs_info, target);
+
+out:
+	if (cache)
+		btrfs_put_block_group(cache);
+	mutex_unlock(&fs_info->delete_unused_bgs_mutex);
+	btrfs_exclop_finish(fs_info);
+
+	return ret;
+}
+
+int btrfs_repair_one_zone(struct btrfs_fs_info *fs_info, u64 logical)
+{
+	struct btrfs_block_group *cache;
+
+	/* do not attempt to repair in degraded state */
+	if (btrfs_test_opt(fs_info, DEGRADED))
+		return 0;
+
+	cache = btrfs_lookup_block_group(fs_info, logical);
+	if (!cache)
+		return 0;
+
+	spin_lock(&cache->lock);
+	if (cache->relocating_repair) {
+		spin_unlock(&cache->lock);
+		btrfs_put_block_group(cache);
+		return 0;
+	}
+	cache->relocating_repair = 1;
+	spin_unlock(&cache->lock);
+
+	kthread_run(relocating_repair_kthread, cache,
+		    "btrfs-relocating-repair");
+
+	return 0;
+}
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index cff1f7689eac..7c1ad6901791 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -584,5 +584,6 @@ void btrfs_scratch_superblocks(struct btrfs_fs_info *fs_info,
 int btrfs_bg_type_to_factor(u64 flags);
 const char *btrfs_bg_type_to_raid_name(u64 flags);
 int btrfs_verify_dev_extents(struct btrfs_fs_info *fs_info);
+int btrfs_repair_one_zone(struct btrfs_fs_info *fs_info, u64 logical);
 
 #endif

From patchwork Fri Oct 30 13:51:44 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Naohiro Aota <naohiro.aota@wdc.com>
X-Patchwork-Id: 11869709
Return-Path: <SRS0=iOSM=EF=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id C54E892C
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:53:48 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id A3C8D2083B
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:53:48 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=fail reason="signature verification failed" (2048-bit key)
 header.d=wdc.com header.i=@wdc.com header.b="IZ3nXBd/"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726955AbgJ3Nxk (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Fri, 30 Oct 2020 09:53:40 -0400
Received: from esa3.hgst.iphmx.com ([216.71.153.141]:22003 "EHLO
        esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726891AbgJ3NxM (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Fri, 30 Oct 2020 09:53:12 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple;
  d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com;
  t=1604065992; x=1635601992;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=5Q87+JRgb7O8K8EiTKwhnGcUdvotKiuemZXbkPqubuM=;
  b=IZ3nXBd/iJ+ELySLw6q0A4RjnH/ahMOxCZdAgveVFJKaEpLLYl+W6XoT
   oT5OkMpHkAsZBESDycnFT3Ts6+du1TFv/FfjCaEKXXuFN5H3aOSEGWio4
   DcSZ1iAdKgMu52agiTC1nsyyTxAoTzg+xZff1rjVtE5jlRrDJ+rJWCgvF
   CMak6asNApzKSPjKmvVrRNJB7x75Pl84bg4wKHAiaMijyAPraMqXsOAj4
   jPhORODMdvXUvJ0FTwLmHehtjzndIecGIfSdxMV6JqZHjKWZDNn06nF/K
   BsfRobblKsFBNV9TKkDnAFxHQMnm9AmGoUqM+BC1qOrCkktSxC3/zKhez
   g==;
IronPort-SDR: 
 dnOgLBfVWVB+Vs0gPs6jQrVrlgME06AtMoNEHnSK3sSAX7cA8I+1/H0/vGQ74unDqpGLMl5SG9
 AXevCuidreHpbYVUBEx5i2jAzXyG8k9PsFjMVrLEg5ibRsInrmgGJQZRAD55MCu0E4RmydOiyT
 wN9PoQzDkilaVHrs6eqCXtWUYLqbwc6+f071yZRkO/keycI9+YEDciBBpQSFnz/3/NUrW9BXJo
 M+xhyxXo4G10LX/RsHTGWSe99lJIJ951tG4zuBAlIf7Tz+k6lx/N59A8rGKFjum20pldK4HqWO
 ZHQ=
X-IronPort-AV: E=Sophos;i="5.77,433,1596470400";
   d="scan'208";a="155806643"
Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com)
 ([199.255.45.14])
  by ob1.hgst.iphmx.com with ESMTP; 30 Oct 2020 21:53:06 +0800
IronPort-SDR: 
 UWnc/G0JCkzoeLaYDtQ6z9jfmoim/rM2Yi51YCpszOtuk+Fv+iTdCkwSPYjCmXNyRBnjKi7SMl
 W5IwGYYHv6+px2ShhRGaXr9jUvHn/Jndzka8iepdgFR6IH5J9J/z2SRvGDQI547FIWjxlsZNsI
 MtCcGEdXTMu0qRFpSdfGxmDVyRqSLgXYuGW94TFOt2JwTD3Tvk0y1zPijjbLKO7x1K6GdDbofh
 ZVjm9RMmf0C70qyuLXksd/yHVTfVmyh+TOET3MTJWFo4CsjsgkxjigSv/76rE9uD7lMtvyjCIY
 0OH+6xvyojNkvX5FSB9iUnYv
Received: from uls-op-cesaip01.wdc.com ([10.248.3.36])
  by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 30 Oct 2020 06:39:20 -0700
IronPort-SDR: 
 JLR7TQK/6qMD4c5ookN/Mibg70pCFS7wdJ/AEGf1hTv59p6JGm6L5DOHAz3C87Ti+iDPfWWx1e
 753kOXUMf5s0G3Jd9taktFl+sETjW4/SkkCzWUhOm4yZ3tFaftkua3lB+Fo1Mv89EpN6jcZURG
 FnJ6e05+aG2Mo1423bQfhZQQZ0Do9Bd8ndBBlLWfSf1n5vtiYRqlRAn6ZqRutra/wlg0WzBgiE
 xS9NhLoirTB5fah9rLlV8ymv2OqEaE8RR1RtkqggwKeM+SX9/B3K1H1MrkC2AJdUH7f0Xa+soO
 t6M=
WDCIronportException: Internal
Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155])
  by uls-op-cesaip01.wdc.com with ESMTP; 30 Oct 2020 06:53:05 -0700
From: Naohiro Aota <naohiro.aota@wdc.com>
To: linux-btrfs@vger.kernel.org, dsterba@suse.com
Cc: hare@suse.com, linux-fsdevel@vger.kernel.org,
        Naohiro Aota <naohiro.aota@wdc.com>,
        Johannes Thumshirn <johannes.thumshirn@wdc.com>
Subject: [PATCH v9 37/41] btrfs: split alloc_log_tree()
Date: Fri, 30 Oct 2020 22:51:44 +0900
Message-Id: 
 <71b8f94034f04da6f69f1ea0720825aabc852a54.1604065695.git.naohiro.aota@wdc.com>
X-Mailer: git-send-email 2.27.0
In-Reply-To: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
References: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

This is a preparation for the next patch. This commit split
alloc_log_tree() to allocating tree structure part (remains in
alloc_log_tree()) and allocating tree node part (moved in
btrfs_alloc_log_tree_node()). The latter part is also exported to be used
in the next patch.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/disk-io.c | 31 +++++++++++++++++++++++++------
 fs/btrfs/disk-io.h |  2 ++
 2 files changed, 27 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 2b30ef8a7034..70885f3d3321 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1211,7 +1211,6 @@ static struct btrfs_root *alloc_log_tree(struct btrfs_trans_handle *trans,
 					 struct btrfs_fs_info *fs_info)
 {
 	struct btrfs_root *root;
-	struct extent_buffer *leaf;
 
 	root = btrfs_alloc_root(fs_info, BTRFS_TREE_LOG_OBJECTID, GFP_NOFS);
 	if (!root)
@@ -1221,6 +1220,14 @@ static struct btrfs_root *alloc_log_tree(struct btrfs_trans_handle *trans,
 	root->root_key.type = BTRFS_ROOT_ITEM_KEY;
 	root->root_key.offset = BTRFS_TREE_LOG_OBJECTID;
 
+	return root;
+}
+
+int btrfs_alloc_log_tree_node(struct btrfs_trans_handle *trans,
+			      struct btrfs_root *root)
+{
+	struct extent_buffer *leaf;
+
 	/*
 	 * DON'T set SHAREABLE bit for log trees.
 	 *
@@ -1233,26 +1240,31 @@ static struct btrfs_root *alloc_log_tree(struct btrfs_trans_handle *trans,
 
 	leaf = btrfs_alloc_tree_block(trans, root, 0, BTRFS_TREE_LOG_OBJECTID,
 			NULL, 0, 0, 0, BTRFS_NESTING_NORMAL);
-	if (IS_ERR(leaf)) {
-		btrfs_put_root(root);
-		return ERR_CAST(leaf);
-	}
+	if (IS_ERR(leaf))
+		return PTR_ERR(leaf);
 
 	root->node = leaf;
 
 	btrfs_mark_buffer_dirty(root->node);
 	btrfs_tree_unlock(root->node);
-	return root;
+
+	return 0;
 }
 
 int btrfs_init_log_root_tree(struct btrfs_trans_handle *trans,
 			     struct btrfs_fs_info *fs_info)
 {
 	struct btrfs_root *log_root;
+	int ret;
 
 	log_root = alloc_log_tree(trans, fs_info);
 	if (IS_ERR(log_root))
 		return PTR_ERR(log_root);
+	ret = btrfs_alloc_log_tree_node(trans, log_root);
+	if (ret) {
+		kfree(log_root);
+		return ret;
+	}
 	WARN_ON(fs_info->log_root_tree);
 	fs_info->log_root_tree = log_root;
 	return 0;
@@ -1264,11 +1276,18 @@ int btrfs_add_log_tree(struct btrfs_trans_handle *trans,
 	struct btrfs_fs_info *fs_info = root->fs_info;
 	struct btrfs_root *log_root;
 	struct btrfs_inode_item *inode_item;
+	int ret;
 
 	log_root = alloc_log_tree(trans, fs_info);
 	if (IS_ERR(log_root))
 		return PTR_ERR(log_root);
 
+	ret = btrfs_alloc_log_tree_node(trans, log_root);
+	if (ret) {
+		btrfs_put_root(log_root);
+		return ret;
+	}
+
 	log_root->last_trans = trans->transid;
 	log_root->root_key.offset = root->root_key.objectid;
 
diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
index fee69ced58b4..b82ae3711c42 100644
--- a/fs/btrfs/disk-io.h
+++ b/fs/btrfs/disk-io.h
@@ -115,6 +115,8 @@ blk_status_t btrfs_wq_submit_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
 			extent_submit_bio_start_t *submit_bio_start);
 blk_status_t btrfs_submit_bio_done(void *private_data, struct bio *bio,
 			  int mirror_num);
+int btrfs_alloc_log_tree_node(struct btrfs_trans_handle *trans,
+			      struct btrfs_root *root);
 int btrfs_init_log_root_tree(struct btrfs_trans_handle *trans,
 			     struct btrfs_fs_info *fs_info);
 int btrfs_add_log_tree(struct btrfs_trans_handle *trans,

From patchwork Fri Oct 30 13:51:45 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Naohiro Aota <naohiro.aota@wdc.com>
X-Patchwork-Id: 11869701
Return-Path: <SRS0=iOSM=EF=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id BD99314B4
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:53:38 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 942442087E
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:53:38 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=fail reason="signature verification failed" (2048-bit key)
 header.d=wdc.com header.i=@wdc.com header.b="IWsJFumC"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726951AbgJ3Nxh (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Fri, 30 Oct 2020 09:53:37 -0400
Received: from esa3.hgst.iphmx.com ([216.71.153.141]:21997 "EHLO
        esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726906AbgJ3NxV (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Fri, 30 Oct 2020 09:53:21 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple;
  d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com;
  t=1604066000; x=1635602000;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=x5Y0WD3RTEFz7Z3HQrUml34e1hfnT65ellZSD2oJEGs=;
  b=IWsJFumCxRHbw16D/h0eWh1MztcYCsQgLDskRuhRLGfPVabnJUsxVVNf
   dkPF+s8+7DA1Z/YA0S+CA5JYiL97vzgPO17S4krppnBY2jtjvnPl9tI+Q
   UCr4QTi5KbyBcLXuLeTjSR//O8AEE2G+DVoHctlS/2A0oGdOiyj++Es71
   rSI4Dm8a95naMTVtkijUiregD1zXvV6pGnwEay26+fBj1pb2/57eiCYK7
   S4d2tBM9yLB2DQ6ci9rtH8vbkq/KvL007ACbSa7ZsqnI6DV/9ytp9krAt
   91wp3tcFzzh4sQfrj85JL1W/RB3/TKzxtBhWvclB0Fr698r6KccbNf89l
   w==;
IronPort-SDR: 
 /7PVcxkFlHA+Iw/HgBa+4D3fQUaJkMtKDkqVkwo+j0rjpAM3cuA5YLBUryiph3XlW1+hLf2ss8
 t9tiWsaDB7VNFLbjV6x4hC0nh0GfYJurFg/5+Da/U/1Ab+UW+Ou6d4AsE6BvrN8W6jZAsIZgvM
 lXfvbZuIjSe4uqb0W9M0YB5q5TB52kL6YS718e2eo/UvCIPf2aNWZmWwJKm71981EO1u1xxs0X
 zWxctY+kHLuiDBg0vbUDeqwGWyh0uKdKCOr2jC4Gq4+aXVUIuTmSSbYixhFZFj0QtP/0/HB8hN
 iEw=
X-IronPort-AV: E=Sophos;i="5.77,433,1596470400";
   d="scan'208";a="155806645"
Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com)
 ([199.255.45.14])
  by ob1.hgst.iphmx.com with ESMTP; 30 Oct 2020 21:53:07 +0800
IronPort-SDR: 
 nyvS+lq1r6+Y5IKtaf4mdMtILnIDPI/6NxTcpeAcYBaknvzfypnAFgdZm7glmwGdhpDQF0j58P
 RtWst/M54Bg77VeBfdXLor9hSuBcF0VJn+ep1xkLNuEEN1+xHACgjnkPhgKqHLA33RDp4Zhx8V
 WhW+o0vsAgSbTnDatHDDhxkrMzQO/hfFuttwCBcTiVpXbI05ygs+J4u1hEM3LPovWc2iIANNaw
 zc9FUQRd0FRGzaQ+vlgVkvYTKiz0NhVe2tMMRVra+jtKWHqUjNsnMqeSdhX0mH16QD7EXynHAq
 UO1MGZsdzOZoOosIK52bNdbC
Received: from uls-op-cesaip01.wdc.com ([10.248.3.36])
  by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 30 Oct 2020 06:39:22 -0700
IronPort-SDR: 
 2Yhve9yEjRdiT4lA6lemjvBmddKKgqpOyeQatwe+6gxS1qfdu3z1Jpqtp0quMvfwHKYlBhIWQ7
 tPIKVq3TZI18afnlPK6OPtJQDPllnAutXrqGKrQySIimuJKLkj3e/mDNhamDR0n95cPH8Ai1jK
 XZSITFsd89gWibek9RQj3J84c8hv2XFb0DYXCivEIUkmG2a7DV+CUefB6/4BQnScIj5s53XpUp
 ZaTJOtaKmLYhFFQS11BXHCAMP7RazW2ke0h1xW3uVOGcErTcmPJUmoj/nXd0un24kID1mlzarp
 mLM=
WDCIronportException: Internal
Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155])
  by uls-op-cesaip01.wdc.com with ESMTP; 30 Oct 2020 06:53:07 -0700
From: Naohiro Aota <naohiro.aota@wdc.com>
To: linux-btrfs@vger.kernel.org, dsterba@suse.com
Cc: hare@suse.com, linux-fsdevel@vger.kernel.org,
        Naohiro Aota <naohiro.aota@wdc.com>,
        Johannes Thumshirn <johannes.thumshirn@wdc.com>
Subject: [PATCH v9 38/41] btrfs: extend zoned allocator to use dedicated
 tree-log block group
Date: Fri, 30 Oct 2020 22:51:45 +0900
Message-Id: 
 <6640d3c034c9c347958860743501aff59da7a5a0.1604065695.git.naohiro.aota@wdc.com>
X-Mailer: git-send-email 2.27.0
In-Reply-To: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
References: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

This is the 1/3 patch to enable tree log on ZONED mode.

The tree-log feature does not work on ZONED mode as is. Blocks for a
tree-log tree are allocated mixed with other metadata blocks, and btrfs
writes and syncs the tree-log blocks to devices at the time of fsync(),
which is different timing from a global transaction commit. As a result,
both writing tree-log blocks and writing other metadata blocks become
non-sequential writes that ZONED mode must avoid.

We can introduce a dedicated block group for tree-log blocks so that
tree-log blocks and other metadata blocks can be separated write streams.
As a result, each write stream can now be written to devices separately.
"fs_info->treelog_bg" tracks the dedicated block group and btrfs assign
"treelog_bg" on-demand on tree-log block allocation time.

This commit extends the zoned block allocator to use the block group.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/block-group.c |  7 +++++
 fs/btrfs/ctree.h       |  2 ++
 fs/btrfs/extent-tree.c | 63 +++++++++++++++++++++++++++++++++++++++---
 3 files changed, 68 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 21e40046dce1..ac4fb954db28 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -939,6 +939,13 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 	btrfs_return_cluster_to_free_space(block_group, cluster);
 	spin_unlock(&cluster->refill_lock);
 
+	if (btrfs_is_zoned(fs_info)) {
+		spin_lock(&fs_info->treelog_bg_lock);
+		if (fs_info->treelog_bg == block_group->start)
+			fs_info->treelog_bg = 0;
+		spin_unlock(&fs_info->treelog_bg_lock);
+	}
+
 	path = btrfs_alloc_path();
 	if (!path) {
 		ret = -ENOMEM;
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 736f679f1310..026d34c8f2ee 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -956,6 +956,8 @@ struct btrfs_fs_info {
 	/* max size to emit ZONE_APPEND write command */
 	u64 max_zone_append_size;
 	struct mutex zoned_meta_io_lock;
+	spinlock_t treelog_bg_lock;
+	u64 treelog_bg;
 
 #ifdef CONFIG_BTRFS_FS_REF_VERIFY
 	spinlock_t ref_verify_lock;
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 1bb95d5aaed2..c570fe576fb3 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3619,6 +3619,9 @@ struct find_free_extent_ctl {
 	bool have_caching_bg;
 	bool orig_have_caching_bg;
 
+	/* Allocation is called for tree-log */
+	bool for_treelog;
+
 	/* RAID index, converted from flags */
 	int index;
 
@@ -3856,23 +3859,54 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group,
 			       struct find_free_extent_ctl *ffe_ctl,
 			       struct btrfs_block_group **bg_ret)
 {
+	struct btrfs_fs_info *fs_info = block_group->fs_info;
 	struct btrfs_space_info *space_info = block_group->space_info;
 	struct btrfs_free_space_ctl *ctl = block_group->free_space_ctl;
 	u64 start = block_group->start;
 	u64 num_bytes = ffe_ctl->num_bytes;
 	u64 avail;
+	u64 bytenr = block_group->start;
+	u64 log_bytenr;
 	int ret = 0;
+	bool skip;
 
 	ASSERT(btrfs_is_zoned(block_group->fs_info));
 
+	/*
+	 * Do not allow non-tree-log blocks in the dedicated tree-log block
+	 * group, and vice versa.
+	 */
+	spin_lock(&fs_info->treelog_bg_lock);
+	log_bytenr = fs_info->treelog_bg;
+	skip = log_bytenr && ((ffe_ctl->for_treelog && bytenr != log_bytenr) ||
+			      (!ffe_ctl->for_treelog && bytenr == log_bytenr));
+	spin_unlock(&fs_info->treelog_bg_lock);
+	if (skip)
+		return 1;
+
 	spin_lock(&space_info->lock);
 	spin_lock(&block_group->lock);
+	spin_lock(&fs_info->treelog_bg_lock);
+
+	ASSERT(!ffe_ctl->for_treelog ||
+	       block_group->start == fs_info->treelog_bg ||
+	       fs_info->treelog_bg == 0);
 
 	if (block_group->ro) {
 		ret = 1;
 		goto out;
 	}
 
+	/*
+	 * Do not allow currently using block group to be tree-log dedicated
+	 * block group.
+	 */
+	if (ffe_ctl->for_treelog && !fs_info->treelog_bg &&
+	    (block_group->used || block_group->reserved)) {
+		ret = 1;
+		goto out;
+	}
+
 	avail = block_group->length - block_group->alloc_offset;
 	if (avail < num_bytes) {
 		ffe_ctl->max_extent_size = avail;
@@ -3880,6 +3914,9 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group,
 		goto out;
 	}
 
+	if (ffe_ctl->for_treelog && !fs_info->treelog_bg)
+		fs_info->treelog_bg = block_group->start;
+
 	ffe_ctl->found_offset = start + block_group->alloc_offset;
 	block_group->alloc_offset += num_bytes;
 	spin_lock(&ctl->tree_lock);
@@ -3894,6 +3931,9 @@ static int do_allocation_zoned(struct btrfs_block_group *block_group,
 	ffe_ctl->search_start = ffe_ctl->found_offset;
 
 out:
+	if (ret && ffe_ctl->for_treelog)
+		fs_info->treelog_bg = 0;
+	spin_unlock(&fs_info->treelog_bg_lock);
 	spin_unlock(&block_group->lock);
 	spin_unlock(&space_info->lock);
 	return ret;
@@ -4143,7 +4183,12 @@ static int prepare_allocation(struct btrfs_fs_info *fs_info,
 		return prepare_allocation_clustered(fs_info, ffe_ctl,
 						    space_info, ins);
 	case BTRFS_EXTENT_ALLOC_ZONED:
-		/* nothing to do */
+		if (ffe_ctl->for_treelog) {
+			spin_lock(&fs_info->treelog_bg_lock);
+			if (fs_info->treelog_bg)
+				ffe_ctl->hint_byte = fs_info->treelog_bg;
+			spin_unlock(&fs_info->treelog_bg_lock);
+		}
 		return 0;
 	default:
 		BUG();
@@ -4187,6 +4232,7 @@ static noinline int find_free_extent(struct btrfs_root *root,
 	struct find_free_extent_ctl ffe_ctl = {0};
 	struct btrfs_space_info *space_info;
 	bool full_search = false;
+	bool for_treelog = root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID;
 
 	WARN_ON(num_bytes < fs_info->sectorsize);
 
@@ -4200,6 +4246,7 @@ static noinline int find_free_extent(struct btrfs_root *root,
 	ffe_ctl.orig_have_caching_bg = false;
 	ffe_ctl.found_offset = 0;
 	ffe_ctl.hint_byte = hint_byte_orig;
+	ffe_ctl.for_treelog = for_treelog;
 	ffe_ctl.policy = BTRFS_EXTENT_ALLOC_CLUSTERED;
 
 	/* For clustered allocation */
@@ -4274,8 +4321,15 @@ static noinline int find_free_extent(struct btrfs_root *root,
 		struct btrfs_block_group *bg_ret;
 
 		/* If the block group is read-only, we can skip it entirely. */
-		if (unlikely(block_group->ro))
+		if (unlikely(block_group->ro)) {
+			if (btrfs_is_zoned(fs_info) && for_treelog) {
+				spin_lock(&fs_info->treelog_bg_lock);
+				if (block_group->start == fs_info->treelog_bg)
+					fs_info->treelog_bg = 0;
+				spin_unlock(&fs_info->treelog_bg_lock);
+			}
 			continue;
+		}
 
 		btrfs_grab_block_group(block_group, delalloc);
 		ffe_ctl.search_start = block_group->start;
@@ -4463,6 +4517,7 @@ int btrfs_reserve_extent(struct btrfs_root *root, u64 ram_bytes,
 	bool final_tried = num_bytes == min_alloc_size;
 	u64 flags;
 	int ret;
+	bool for_treelog = root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID;
 
 	flags = get_alloc_profile_by_root(root, is_data);
 again:
@@ -4486,8 +4541,8 @@ int btrfs_reserve_extent(struct btrfs_root *root, u64 ram_bytes,
 
 			sinfo = btrfs_find_space_info(fs_info, flags);
 			btrfs_err(fs_info,
-				  "allocation failed flags %llu, wanted %llu",
-				  flags, num_bytes);
+			"allocation failed flags %llu, wanted %llu treelog %d",
+				  flags, num_bytes, for_treelog);
 			if (sinfo)
 				btrfs_dump_space_info(fs_info, sinfo,
 						      num_bytes, 1);

From patchwork Fri Oct 30 13:51:46 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Naohiro Aota <naohiro.aota@wdc.com>
X-Patchwork-Id: 11869697
Return-Path: <SRS0=iOSM=EF=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id B99CF92C
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:53:33 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 91E192076E
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:53:33 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=fail reason="signature verification failed" (2048-bit key)
 header.d=wdc.com header.i=@wdc.com header.b="EcC0KBNb"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726945AbgJ3Nxa (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Fri, 30 Oct 2020 09:53:30 -0400
Received: from esa3.hgst.iphmx.com ([216.71.153.141]:22001 "EHLO
        esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726917AbgJ3NxW (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Fri, 30 Oct 2020 09:53:22 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple;
  d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com;
  t=1604066001; x=1635602001;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=kR7xRia34JO6vWeAs3GOC4bVK+bGtTOBrgjYr2Od+bk=;
  b=EcC0KBNbDy+yJ9pAFIDjHbUmkkDf12EAppvhkckmxUp/iJZBWZmzvHgu
   8Ri0wwnDVcKCQVdMISuUwvagFbjPA0wlU2jQIM3Afn3stbpyMUOcluCtN
   A3ltYDGtj+gmxIBLsd4xr+TZPtH67FrwmcKQP0C2LGetBOAFMvgT1fDOa
   cW14mRyIBFs9IinQMULDmXVhjKStyhsk49OcTyggpQ4VuNT8+1ukEM2KG
   HIkQVUv78wfxGEMMQ65mheCMAKAvVdhqOcMlOiZkxMqSxsEKaFjeSK5+8
   2laZlYPmMXKxdivahmxLN30HTJ3WdWriwhvxfM/WMMopCiDpdBN0DldGh
   w==;
IronPort-SDR: 
 0Vft13uaUxpuZOJiYxESJ/frjiyXXtu4IOubIcVw/44PkqrjUo/KCAGfdKGaQVPov8UxVFz5OG
 GE9l7hCK4TmzwtJjxLTBICX5nQvqrAPfbpRlbwgMePds67QFfx1nJJm+c9leVsKNpzXR2nq514
 NUKQNkbrIqqTMsSv3WT3q6D4MdYeoD1GPNDJR0xfGr4LADBfjOC22AkM6IIxQHA7qVXDQb808+
 QIxQoXnjt81iNunxlfBDRHCo3chCtr7rQakJKuTzFLYu0UEyrfqxEo/4bVwx0p52kW3G7KdthB
 eeo=
X-IronPort-AV: E=Sophos;i="5.77,433,1596470400";
   d="scan'208";a="155806647"
Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com)
 ([199.255.45.14])
  by ob1.hgst.iphmx.com with ESMTP; 30 Oct 2020 21:53:09 +0800
IronPort-SDR: 
 2RHclDxX4pBTjJmTFNiW0w84bM5pQGucmjyesUoB+WMtW8FWCDRt9FpXjPwBimzDudf1a3fcNk
 87oQQRUkzqLDvsEt+voFRvvBLbn/h0pM1MA/QXbtup3j1KqrajDNqM6fZJGxueQRppAVH+HRKB
 aGU0zz6cQvNn/gFmSda50mK2d2isSwH4jr5/1urZjtV7CbpkvPHKydtHP7F3f6NrrMYxUtxG5x
 Ozspafa42JMMY857QMZwb9xS9yPDOeUa4LqSxKQ5alw/VvqzCwyOnYZ0QNn9M8Xust/unmxqig
 V8H6FdReMq2WSyUsqkxqulFF
Received: from uls-op-cesaip01.wdc.com ([10.248.3.36])
  by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 30 Oct 2020 06:39:23 -0700
IronPort-SDR: 
 t4Oh7m6BQu0oULl8NoPG4nzbPZaAkt3q2DU8vKIgV7xm0WtNHY1tXtk0vo6CiKQnLswgNyT69z
 1uaQNZkKQ0iq+hSycyzk2LPbXdd0hyGIdQF9/LidfJAGYGSEC0nkbnFjgSxVspVXRN/LTz4fFB
 6i//0tSnAFnd9EUNZnsi9UgzjTSYoqQsc7iyJPSb+k7K/ahFQhoZYRypMVRNWtcfFDl4td6GUj
 2PXvkP+8EjVbjMphI/4D9KDCmREQK46OqQ3wNCGyTCGF7DA65vuS/AFaDLF1UombH7yM6iLuqj
 y9E=
WDCIronportException: Internal
Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155])
  by uls-op-cesaip01.wdc.com with ESMTP; 30 Oct 2020 06:53:08 -0700
From: Naohiro Aota <naohiro.aota@wdc.com>
To: linux-btrfs@vger.kernel.org, dsterba@suse.com
Cc: hare@suse.com, linux-fsdevel@vger.kernel.org,
        Naohiro Aota <naohiro.aota@wdc.com>
Subject: [PATCH v9 39/41] btrfs: serialize log transaction on ZONED mode
Date: Fri, 30 Oct 2020 22:51:46 +0900
Message-Id: 
 <e4622b1b8b94de9b4fe379c694f11524c4341bf6.1604065695.git.naohiro.aota@wdc.com>
X-Mailer: git-send-email 2.27.0
In-Reply-To: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
References: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

This is the 2/3 patch to enable tree-log on ZONED mode.

Since we can start more than one log transactions per subvolume
simultaneously, nodes from multiple transactions can be allocated
interleaved. Such mixed allocation results in non-sequential writes at the
time of log transaction commit. The nodes of the global log root tree
(fs_info->log_root_tree), also have the same mixed allocation problem.

This patch serializes log transactions by waiting for a committing
transaction when someone tries to start a new transaction, to avoid the
mixed allocation problem. We must also wait for running log transactions
from another subvolume, but there is no easy way to detect which subvolume
root is running a log transaction. So, this patch forbids starting a new
log transaction when other subvolumes already allocated the global log root
tree.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
 fs/btrfs/tree-log.c | 22 +++++++++++++++++++++-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 5f585cf57383..826af2ff4740 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -106,6 +106,7 @@ static noinline int replay_dir_deletes(struct btrfs_trans_handle *trans,
 				       struct btrfs_root *log,
 				       struct btrfs_path *path,
 				       u64 dirid, int del_all);
+static void wait_log_commit(struct btrfs_root *root, int transid);
 
 /*
  * tree logging is a special write ahead log used to make sure that
@@ -140,16 +141,25 @@ static int start_log_trans(struct btrfs_trans_handle *trans,
 			   struct btrfs_log_ctx *ctx)
 {
 	struct btrfs_fs_info *fs_info = root->fs_info;
+	bool zoned = btrfs_is_zoned(fs_info);
 	int ret = 0;
 
 	mutex_lock(&root->log_mutex);
 
+again:
 	if (root->log_root) {
+		int index = (root->log_transid + 1) % 2;
+
 		if (btrfs_need_log_full_commit(trans)) {
 			ret = -EAGAIN;
 			goto out;
 		}
 
+		if (zoned && atomic_read(&root->log_commit[index])) {
+			wait_log_commit(root, root->log_transid - 1);
+			goto again;
+		}
+
 		if (!root->log_start_pid) {
 			clear_bit(BTRFS_ROOT_MULTI_LOG_TASKS, &root->state);
 			root->log_start_pid = current->pid;
@@ -158,7 +168,9 @@ static int start_log_trans(struct btrfs_trans_handle *trans,
 		}
 	} else {
 		mutex_lock(&fs_info->tree_log_mutex);
-		if (!fs_info->log_root_tree)
+		if (zoned && fs_info->log_root_tree)
+			ret = -EAGAIN;
+		else if (!fs_info->log_root_tree)
 			ret = btrfs_init_log_root_tree(trans, fs_info);
 		mutex_unlock(&fs_info->tree_log_mutex);
 		if (ret)
@@ -193,14 +205,22 @@ static int start_log_trans(struct btrfs_trans_handle *trans,
  */
 static int join_running_log_trans(struct btrfs_root *root)
 {
+	bool zoned = btrfs_is_zoned(root->fs_info);
 	int ret = -ENOENT;
 
 	if (!test_bit(BTRFS_ROOT_HAS_LOG_TREE, &root->state))
 		return ret;
 
 	mutex_lock(&root->log_mutex);
+again:
 	if (root->log_root) {
+		int index = (root->log_transid + 1) % 2;
+
 		ret = 0;
+		if (zoned && atomic_read(&root->log_commit[index])) {
+			wait_log_commit(root, root->log_transid - 1);
+			goto again;
+		}
 		atomic_inc(&root->log_writers);
 	}
 	mutex_unlock(&root->log_mutex);

From patchwork Fri Oct 30 13:51:47 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Naohiro Aota <naohiro.aota@wdc.com>
X-Patchwork-Id: 11869695
Return-Path: <SRS0=iOSM=EF=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 9BD6361C
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:53:30 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id 724442076E
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:53:30 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=fail reason="signature verification failed" (2048-bit key)
 header.d=wdc.com header.i=@wdc.com header.b="riHs8YNu"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726938AbgJ3Nx1 (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Fri, 30 Oct 2020 09:53:27 -0400
Received: from esa3.hgst.iphmx.com ([216.71.153.141]:22003 "EHLO
        esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726913AbgJ3NxW (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Fri, 30 Oct 2020 09:53:22 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple;
  d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com;
  t=1604066001; x=1635602001;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=LPWEINL6u3NvTmMDpZCgBV2uMWMYSnwdj6ZztydhToc=;
  b=riHs8YNuT39EO3rVfCz2L8e453AabuWdySkpKq11ou/GJE+yHyes//Uc
   60Mgfkw4IB3ToEmWAXUEhEN7131K99J/uZJS4msyKQWrGhgpkBGVJ2wh8
   QxX2jrwbg9drXVL6MLCqCDxeSNbd2FRL1c5PI157cHu5q4YqDfO/9+m9g
   tL2YBeboZRGz/m5lIIjVudZsMvR4tqZ+EdPgi44nBEsdEQvOdqk/n/b4Y
   fgEmh6Tv4yeNLf2o/BBgZ4yJx/lp3Eg/HldVQY6t+KKRQdWmw8N+eOqDP
   0Kxtfo+cimZuKpAA1p7BTkAt3soTjTrtowUT4jU4M8CFb6VhNjFkNbcwn
   A==;
IronPort-SDR: 
 8MzuaK0Z1wCf76lmHX/AoJd1n3v0IXOnWosKVFq+szQjDj6AWwhBN7dleTlB/X/dYKvQgNWbSY
 26Ibo9razmjNLVzJrIJSFYizVcf4XgFqT9Qby3uVKucxXxpqITxO5Z5ZZrLjOckMfmE9q7ieMg
 eIeF9DrBjam8FDIAQe5Q9AVDpF5jtPqkJ8aG8tLbOtEFiVOTUE/k5U/NtUYO43hp2L2QRhD3o9
 Adyotp2ezUDYdH7GsSYqadJ06W/mfUUeTxgh4l679+AD/C/H86ifQxTPEljuN/qPM1VFHuu61S
 hIE=
X-IronPort-AV: E=Sophos;i="5.77,433,1596470400";
   d="scan'208";a="155806648"
Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com)
 ([199.255.45.14])
  by ob1.hgst.iphmx.com with ESMTP; 30 Oct 2020 21:53:10 +0800
IronPort-SDR: 
 uk+u5FHjr33GrMv3jY8XidiOq7DenHfu8Ach9dWkJ0B+Ngxgdmpq9r+pAzlsrv/nWKI1QcaKeh
 UTI9QvxW3EwohkE0RbyzhTEiSyGrL+j5PobDSLctV+aPFFCsmGjqN4SiUkE+a2B2Zm2Yr/uKwp
 SFs/JMJQLx1qRWs1vHgiYd9bDtVa8zqIs5Yq+IPX0VRJHIZ7f0/qBS1tjsFniw3xTDH+Gvix98
 h2mHFuo8e6hqpYrbqimDeYDxiXIcuK/43zkfsc2JCdWvn17qkwyQtaHZQ01GTPUh4yfphnZb1I
 MUVr8LVUG+HMo3fDLJAD1o9J
Received: from uls-op-cesaip01.wdc.com ([10.248.3.36])
  by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 30 Oct 2020 06:39:24 -0700
IronPort-SDR: 
 SrAQhaShIz/dErxnrwPFjRuHxZw+CWnIcQX19wuQ730AKs9MctOYSt85zpEzQqop8Pls/1skdV
 JgPa6aNFUtdA6yLj1e6OK8+EEajl+6MeBFDaCfs6kxT6vKA8b2CMTRgx370izyN5MKIf0Lywyp
 JWWLMGJerPNOMJ6rbAN5B/mJE7Gbam5wxg3xaNnIbM2WT2vNq0kaiLWXx6zbv8o1fBuVpLU8dk
 mTHUiiRlnfVwssIS8fGY2msPQV/3pUXNYpeQdVPZ4IsfkAXn5Dm2Nu/Xq/rzgYM0BDX8hdGiIB
 upw=
WDCIronportException: Internal
Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155])
  by uls-op-cesaip01.wdc.com with ESMTP; 30 Oct 2020 06:53:09 -0700
From: Naohiro Aota <naohiro.aota@wdc.com>
To: linux-btrfs@vger.kernel.org, dsterba@suse.com
Cc: hare@suse.com, linux-fsdevel@vger.kernel.org,
        Naohiro Aota <naohiro.aota@wdc.com>,
        Johannes Thumshirn <johannes.thumshirn@wdc.com>
Subject: [PATCH v9 40/41] btrfs: reorder log node allocation
Date: Fri, 30 Oct 2020 22:51:47 +0900
Message-Id: 
 <ea3f5bdd11553b1ba3864f25c42d0943eb97b139.1604065695.git.naohiro.aota@wdc.com>
X-Mailer: git-send-email 2.27.0
In-Reply-To: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
References: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

This is the 3/3 patch to enable tree-log on ZONED mode.

The allocation order of nodes of "fs_info->log_root_tree" and nodes of
"root->log_root" is not the same as the writing order of them. So, the
writing causes unaligned write errors.

This patch reorders the allocation of them by delaying allocation of the
root node of "fs_info->log_root_tree," so that the node buffers can go out
sequentially to devices.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/disk-io.c  |  6 ------
 fs/btrfs/tree-log.c | 24 ++++++++++++++++++------
 2 files changed, 18 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 70885f3d3321..2c2fa5ebfef1 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1255,16 +1255,10 @@ int btrfs_init_log_root_tree(struct btrfs_trans_handle *trans,
 			     struct btrfs_fs_info *fs_info)
 {
 	struct btrfs_root *log_root;
-	int ret;
 
 	log_root = alloc_log_tree(trans, fs_info);
 	if (IS_ERR(log_root))
 		return PTR_ERR(log_root);
-	ret = btrfs_alloc_log_tree_node(trans, log_root);
-	if (ret) {
-		kfree(log_root);
-		return ret;
-	}
 	WARN_ON(fs_info->log_root_tree);
 	fs_info->log_root_tree = log_root;
 	return 0;
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 826af2ff4740..46ff92a18474 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -3140,6 +3140,16 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
 	list_add_tail(&root_log_ctx.list, &log_root_tree->log_ctxs[index2]);
 	root_log_ctx.log_transid = log_root_tree->log_transid;
 
+	mutex_lock(&fs_info->tree_log_mutex);
+	if (!log_root_tree->node) {
+		ret = btrfs_alloc_log_tree_node(trans, log_root_tree);
+		if (ret) {
+			mutex_unlock(&fs_info->tree_log_mutex);
+			goto out;
+		}
+	}
+	mutex_unlock(&fs_info->tree_log_mutex);
+
 	/*
 	 * Now we are safe to update the log_root_tree because we're under the
 	 * log_mutex, and we're a current writer so we're holding the commit
@@ -3289,12 +3299,14 @@ static void free_log_tree(struct btrfs_trans_handle *trans,
 		.process_func = process_one_buffer
 	};
 
-	ret = walk_log_tree(trans, log, &wc);
-	if (ret) {
-		if (trans)
-			btrfs_abort_transaction(trans, ret);
-		else
-			btrfs_handle_fs_error(log->fs_info, ret, NULL);
+	if (log->node) {
+		ret = walk_log_tree(trans, log, &wc);
+		if (ret) {
+			if (trans)
+				btrfs_abort_transaction(trans, ret);
+			else
+				btrfs_handle_fs_error(log->fs_info, ret, NULL);
+		}
 	}
 
 	clear_extent_bits(&log->dirty_log_pages, 0, (u64)-1,

From patchwork Fri Oct 30 13:51:48 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Naohiro Aota <naohiro.aota@wdc.com>
X-Patchwork-Id: 11869693
Return-Path: <SRS0=iOSM=EF=vger.kernel.org=linux-fsdevel-owner@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id D5A5192C
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:53:27 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by mail.kernel.org (Postfix) with ESMTP id AD635221FF
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
 Fri, 30 Oct 2020 13:53:27 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=fail reason="signature verification failed" (2048-bit key)
 header.d=wdc.com header.i=@wdc.com header.b="ECmNkgtu"
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1726878AbgJ3Nx0 (ORCPT
        <rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
        Fri, 30 Oct 2020 09:53:26 -0400
Received: from esa3.hgst.iphmx.com ([216.71.153.141]:21997 "EHLO
        esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1726871AbgJ3NxZ (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Fri, 30 Oct 2020 09:53:25 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple;
  d=wdc.com; i=@wdc.com; q=dns/txt; s=dkim.wdc.com;
  t=1604066004; x=1635602004;
  h=from:to:cc:subject:date:message-id:in-reply-to:
   references:mime-version:content-transfer-encoding;
  bh=rQA0qxXR27Qp9vBL16jvoizpw+AqEtwDF+vNmnqv+ho=;
  b=ECmNkgtuznU3glF5ZVTy6s9JmeITloWWoidPwxgjWNWzccLfe/MMx3oZ
   dNUKzwvQmHev/Qxn+bj1EfUrlmenT4w8dIO/qVCux2Brlio6xiHb5nOkY
   M12qIIaRNbKwgdjD4RP2jJuKrVlAkoWP7vk+628gERq2B6MbKVc4HPq2G
   K8tn/6abbrITpOXANJ0MPU9SL1ayYcgdcGaetPjeMnE97/DTKdRtkSmK9
   J7y4xUwoPt4qcy7VIMu2QV1/4gCM4SFqnexdwOdXn/NKtdA4hxniADj8C
   uu88HKGd4EH8bP9Rv3mQZqv9zO4oYFdwJ/q6qLSiHJaRHjchRTZUPHxSK
   w==;
IronPort-SDR: 
 QS9LVCh1JSC0fpyQevO6DeSLfFR0WB1gor8WXVf7NMvsqqW47OtCP7hFC/KBBLSFS+hdTfm/CO
 clFVDXOyvsOSc94gvR1BLgu0M66zv41KrpCByBGysnVPcky4WoE6Zv/P9NpWClwIT2+Vm1IdBM
 3fuWXPZgx6uh712+t8BjXisUAbpWNC/jvIey/WBysH5V4ox7Jo1CfvDsoS5KBM4Xj1UZ4yqwHd
 Irj1lv0I/wEiHACcmhawsNOwlJLUyiFATVnol8YHO7/1rfCsQIuE4rIRMI1uFysFTmz3dI4Ab+
 sO4=
X-IronPort-AV: E=Sophos;i="5.77,433,1596470400";
   d="scan'208";a="155806650"
Received: from uls-op-cesaip01.wdc.com (HELO uls-op-cesaep01.wdc.com)
 ([199.255.45.14])
  by ob1.hgst.iphmx.com with ESMTP; 30 Oct 2020 21:53:11 +0800
IronPort-SDR: 
 GKRtR2P8V8UbfKIQe12pL7aNtWP7ZGIffqKLMMsMBMzoq+HBAnUaaol0Bdo2K5mUdrWsuFEevo
 hykIpuUZqAkx0JDWkv8/pcaJ0Eshbibg42Wjf33MYBdxA7B/QKSUsMutDBg+Wj+GZ+n5q5Msut
 da7lzcno/HC8F6r1Yl2ubjhwEonCHfoeNu2Uou1FCCtPHckulaHg40L9bYPD4xswBqcN+oVkil
 F3YE+DuJEi2KVUdZAiEkR7iwi/MDJYoSAH17hfBHrLMNW5DP/v6Xr4tLvXi6+MYtssP3RtdCQK
 MKvPTmhy0vvzkw/5KamoLrQd
Received: from uls-op-cesaip01.wdc.com ([10.248.3.36])
  by uls-op-cesaep01.wdc.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 30 Oct 2020 06:39:25 -0700
IronPort-SDR: 
 dgMlRiVEOGDuYEOI8v5Ci41Uc/xcQhUWEnGM/IuorO/jbhB/O3R2w99u0ErnFEEys5aDhLzwer
 WvMencArzMWotSuoK9/MV5xKulj2834m3QxWb6Rmat2A/Vqln9pARbfDZ0aXjB14tMqMNPY7VY
 9qdRldpCzvevY84FTTJ6fpZ7L57GkzoPUdMI4SZZtAMXwWqi31/ymKonI8bIcXMemwLEfWJTDR
 HLKNdQ3issmQC3pBzkBFdw7hlCjy1bD6y2nqMwdUEk61La4qedrsBarTyL+OWRPgR1wA468gQS
 73U=
WDCIronportException: Internal
Received: from naota.dhcp.fujisawa.hgst.com ([10.149.52.155])
  by uls-op-cesaip01.wdc.com with ESMTP; 30 Oct 2020 06:53:10 -0700
From: Naohiro Aota <naohiro.aota@wdc.com>
To: linux-btrfs@vger.kernel.org, dsterba@suse.com
Cc: hare@suse.com, linux-fsdevel@vger.kernel.org,
        Naohiro Aota <naohiro.aota@wdc.com>,
        Josef Bacik <josef@toxicpanda.com>
Subject: [PATCH v9 41/41] btrfs: enable to mount ZONED incompat flag
Date: Fri, 30 Oct 2020 22:51:48 +0900
Message-Id: 
 <477669128f8e874f07c2c636e651833b6620d4c4.1604065695.git.naohiro.aota@wdc.com>
X-Mailer: git-send-email 2.27.0
In-Reply-To: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
References: 
 <d9a0a445560db3a9eb240c6535f8dd1bbd0abd96.1604065694.git.naohiro.aota@wdc.com>
MIME-Version: 1.0
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org

This final patch adds the ZONED incompat flag to
BTRFS_FEATURE_INCOMPAT_SUPP and enables btrfs to mount ZONED flagged file
system.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/ctree.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 026d34c8f2ee..c15f73c21f69 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -302,7 +302,8 @@ struct btrfs_super_block {
 	 BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA |	\
 	 BTRFS_FEATURE_INCOMPAT_NO_HOLES	|	\
 	 BTRFS_FEATURE_INCOMPAT_METADATA_UUID	|	\
-	 BTRFS_FEATURE_INCOMPAT_RAID1C34)
+	 BTRFS_FEATURE_INCOMPAT_RAID1C34	|	\
+	 BTRFS_FEATURE_INCOMPAT_ZONED)
 
 #define BTRFS_FEATURE_INCOMPAT_SAFE_SET			\
 	(BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF)