[v5,02/10] block: Introduce REQ_OP_ZONE_APPEND

Message ID	20200409165352.2126-3-johannes.thumshirn@wdc.com (mailing list archive)
State	Superseded
Headers	show Return-Path: <SRS0=DEWY=5Z=vger.kernel.org=linux-scsi-owner@kernel.org> IronPort-SDR: 7Q5VSZndslkkLIMx+HGSPYvyIVhifKL+yfo9Np7VK9BykEAASpIDkhPOvRCDk37aWkha0yOV13 XePnvRBa/jEDRpqr0W07hJTOhz/n9wQ9nunlp0uYnCPwFadc0CopBjS5NmyCeLVpkwZqgJerwl 8P8rgayRT46+S/zODsgrXHW15UYIr2WpVj+BOs1K01hogRBP7w2JDXrgXmuvO4OQju40ExQnHY xhVEsk/xdEFrqfGRCRa9bMkZvANt6iSOy2gPzhUNBrW3dimUhVbL9LEx+scDRQMkrM6QJkiHI4 I3Y= IronPort-SDR: V1GaOGT8D4GFbfnUVpnxp0Mi8UCHM/a5qdwRQUTv3TMhncvVUBUZjlRuoXKl1hCrLZSnmx05nD TerQ+liF8VgE1k6NVjyofZhZ8fNFP7OgmSrD97K4ebFzRuPANtuGb+rF/sTnk5LEk77B/sV25o +u47hcFMHOBhUYySZLGpuhOPfb+xljyCayOFNRPr/ZdCUrL2Sj53qPnqc9Of/2bj+s+zZenkKB or+k6VixRw+Zm7lqsYCoCGdTbZnRwVO6+1Sib2h3bSD+YdLZWhPG1rNKQtqd1OPo+hvsBklF8d eHuSxtlpnP7WV0CUmcwEuceF IronPort-SDR: 2wE+jqrME+00wdphR5HXRA4dDAhJcTh4SzNnaifBOIHSljndofEQTtE5O1Ysw7iFfxeB+pqvPj vOWAJwdNSjHlvSJ5ZbgmLe6k6NH4e/6gJyXhA2LVePY2OsbMQkWtpWAjE5SuQX4VkIAysnR4UH UguOkB8RjkSVoSsRRyzwQfPoaglCdlTxPWjJ8Dxs8pEBLmlpQyXfWUhSB3127zFmJNOfWa0Tm5 j6gh/I8W8EWCdqHK+P7oZ3+Rprg87Y8f5BLYON10kA6a0cLAdIZL0KVy3Z7FljOXCbU7ajX6xl /2o= WDCIronportException: Internal From: Johannes Thumshirn <johannes.thumshirn@wdc.com> To: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@infradead.org>, linux-block <linux-block@vger.kernel.org>, Damien Le Moal <Damien.LeMoal@wdc.com>, Keith Busch <kbusch@kernel.org>, "linux-scsi @ vger . kernel . org" <linux-scsi@vger.kernel.org>, "Martin K . Petersen" <martin.petersen@oracle.com>, "linux-fsdevel @ vger . kernel . org" <linux-fsdevel@vger.kernel.org>, Johannes Thumshirn <johannes.thumshirn@wdc.com> Subject: [PATCH v5 02/10] block: Introduce REQ_OP_ZONE_APPEND Date: Fri, 10 Apr 2020 01:53:44 +0900 Message-Id: <20200409165352.2126-3-johannes.thumshirn@wdc.com> In-Reply-To: <20200409165352.2126-1-johannes.thumshirn@wdc.com> References: <20200409165352.2126-1-johannes.thumshirn@wdc.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-scsi-owner@vger.kernel.org Precedence: bulk
Series	Introduce Zone Append for writing to zoned block devices \| expand [v5,00/10] Introduce Zone Append for writing to zoned block devices [v5,01/10] block: provide fallbacks for blk_queue_zone_is_seq and blk_queue_zone_no [v5,02/10] block: Introduce REQ_OP_ZONE_APPEND [v5,03/10] block: introduce blk_req_zone_write_trylock [v5,04/10] block: Modify revalidate zones [v5,05/10] scsi: sd_zbc: factor out sanity checks for zoned commands [v5,06/10] scsi: export scsi_mq_free_sgtables [v5,07/10] scsi: sd_zbc: emulate ZONE_APPEND commands [v5,08/10] null_blk: Support REQ_OP_ZONE_APPEND [v5,09/10] block: export bio_release_pages and bio_iov_iter_get_pages [v5,10/10] zonefs: use REQ_OP_ZONE_APPEND for sync DIO

Message ID

20200409165352.2126-3-johannes.thumshirn@wdc.com (mailing list archive)

State

Superseded

Headers

IronPort-SDR: 
 7Q5VSZndslkkLIMx+HGSPYvyIVhifKL+yfo9Np7VK9BykEAASpIDkhPOvRCDk37aWkha0yOV13
 XePnvRBa/jEDRpqr0W07hJTOhz/n9wQ9nunlp0uYnCPwFadc0CopBjS5NmyCeLVpkwZqgJerwl
 8P8rgayRT46+S/zODsgrXHW15UYIr2WpVj+BOs1K01hogRBP7w2JDXrgXmuvO4OQju40ExQnHY
 xhVEsk/xdEFrqfGRCRa9bMkZvANt6iSOy2gPzhUNBrW3dimUhVbL9LEx+scDRQMkrM6QJkiHI4
 I3Y=
IronPort-SDR: 
 V1GaOGT8D4GFbfnUVpnxp0Mi8UCHM/a5qdwRQUTv3TMhncvVUBUZjlRuoXKl1hCrLZSnmx05nD
 TerQ+liF8VgE1k6NVjyofZhZ8fNFP7OgmSrD97K4ebFzRuPANtuGb+rF/sTnk5LEk77B/sV25o
 +u47hcFMHOBhUYySZLGpuhOPfb+xljyCayOFNRPr/ZdCUrL2Sj53qPnqc9Of/2bj+s+zZenkKB
 or+k6VixRw+Zm7lqsYCoCGdTbZnRwVO6+1Sib2h3bSD+YdLZWhPG1rNKQtqd1OPo+hvsBklF8d
 eHuSxtlpnP7WV0CUmcwEuceF
IronPort-SDR: 
 2wE+jqrME+00wdphR5HXRA4dDAhJcTh4SzNnaifBOIHSljndofEQTtE5O1Ysw7iFfxeB+pqvPj
 vOWAJwdNSjHlvSJ5ZbgmLe6k6NH4e/6gJyXhA2LVePY2OsbMQkWtpWAjE5SuQX4VkIAysnR4UH
 UguOkB8RjkSVoSsRRyzwQfPoaglCdlTxPWjJ8Dxs8pEBLmlpQyXfWUhSB3127zFmJNOfWa0Tm5
 j6gh/I8W8EWCdqHK+P7oZ3+Rprg87Y8f5BLYON10kA6a0cLAdIZL0KVy3Z7FljOXCbU7ajX6xl
 /2o=
WDCIronportException: Internal
From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
To: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@infradead.org>,
        linux-block <linux-block@vger.kernel.org>,
        Damien Le Moal <Damien.LeMoal@wdc.com>,
        Keith Busch <kbusch@kernel.org>,
        "linux-scsi @ vger . kernel . org" <linux-scsi@vger.kernel.org>,
        "Martin K . Petersen" <martin.petersen@oracle.com>,
        "linux-fsdevel @ vger . kernel . org" <linux-fsdevel@vger.kernel.org>,
        Johannes Thumshirn <johannes.thumshirn@wdc.com>
Subject: [PATCH v5 02/10] block: Introduce REQ_OP_ZONE_APPEND
Date: Fri, 10 Apr 2020 01:53:44 +0900
Message-Id: <20200409165352.2126-3-johannes.thumshirn@wdc.com>
In-Reply-To: <20200409165352.2126-1-johannes.thumshirn@wdc.com>
References: <20200409165352.2126-1-johannes.thumshirn@wdc.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Sender: linux-scsi-owner@vger.kernel.org
Precedence: bulk

Series

Introduce Zone Append for writing to zoned block devices | expand

Commit Message

Johannes Thumshirn April 9, 2020, 4:53 p.m. UTC

From: Keith Busch <kbusch@kernel.org>

Define REQ_OP_ZONE_APPEND to append-write sectors to a zone of a zoned
block device. This is a no-merge write operation.

A zone append write BIO must:
* Target a zoned block device
* Have a sector position indicating the start sector of the target zone
* The target zone must be a sequential write zone
* The BIO must not cross a zone boundary
* The BIO size must not be split to ensure that a single range of LBAs
  is written with a single command.

Implement these checks in generic_make_request_checks() using the
helper function blk_check_zone_append(). To avoid write append BIO
splitting, introduce the new max_zone_append_sectors queue limit
attribute and ensure that a BIO size is always lower than this limit.
Export this new limit through sysfs and check these limits in bio_full().

Also when a LLDD can't dispatch a request to a specific zone, it
will return BLK_STS_ZONE_RESOURCE indicating this request needs to
be delayed, e.g.  because the zone it will be dispatched to is still
write-locked. If this happens set the request aside in a local list
to continue trying dispatching requests such as READ requests or a
WRITE/ZONE_APPEND requests targetting other zones. This way we can
still keep a high queue depth without starving other requests even if
one request can't be served due to zone write-locking.

Finally, make sure that the bio sector position indicates the actual
write position as indicated by the device on completion.

Signed-off-by: Keith Busch <kbusch@kernel.org>
[ jth: added zone-append specific add_page and merge_page helpers ]
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

---
Changes to v4:
- fix page merging for zone-append bios
- remove unneeded variable
---
 block/bio.c               | 70 +++++++++++++++++++++++++++++++++++++--
 block/blk-core.c          | 52 +++++++++++++++++++++++++++++
 block/blk-mq.c            | 27 +++++++++++++++
 block/blk-settings.c      | 23 +++++++++++++
 block/blk-sysfs.c         | 13 ++++++++
 drivers/scsi/scsi_lib.c   |  1 +
 include/linux/blk_types.h | 14 ++++++++
 include/linux/blkdev.h    | 11 ++++++
 8 files changed, 209 insertions(+), 2 deletions(-)

Comments

Christoph Hellwig April 10, 2020, 7:10 a.m. UTC | #1

I've just been auditing the bio code for now and have a few suggestions:

 - we really should be reusing the passthrough bio handling for
   zone append instead of reinventing it
 - I think __bio_iov_iter_get_pages should be split into a separate
   append version.  That matches the bvec split (which we fail to
   handle properly for append), avoids a branch for every page in
   the fast path and generall seems to look cleaner.

Patch on top of your whole branch attached:

diff --git a/block/bio.c b/block/bio.c
index 4029a48f3828..dd84bd5adc24 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -679,54 +679,6 @@ struct bio *bio_clone_fast(struct bio *bio, gfp_t gfp_mask, struct bio_set *bs)
 }
 EXPORT_SYMBOL(bio_clone_fast);
 
-static bool bio_try_merge_zone_append_page(struct bio *bio, struct page *page,
-					   unsigned int len, unsigned int off,
-					   bool *same_page)
-{
-	struct request_queue *q = bio->bi_disk->queue;
-	struct bio_vec *bv;
-	unsigned long mask = queue_segment_boundary(q);
-	phys_addr_t addr1, addr2;
-
-	if (bio->bi_vcnt < 1)
-		return false;
-
-	bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
-
-	addr1 = page_to_phys(bv->bv_page) + bv->bv_offset;
-	addr2 = page_to_phys(page) + off + len - 1;
-
-	if ((addr1 | mask) != (addr2 | mask))
-		return false;
-	if (bv->bv_len + len > queue_max_segment_size(q))
-		return false;
-	return __bio_try_merge_page(bio, page, len, off, same_page);
-}
-
-static int bio_add_append_page(struct bio *bio, struct page *page, unsigned len,
-			       size_t offset)
-{
-	struct request_queue *q = bio->bi_disk->queue;
-	unsigned int max_append_sectors = queue_max_zone_append_sectors(q);
-	bool same_page = false;
-
-	if (WARN_ON_ONCE(!max_append_sectors))
-		return 0;
-
-	if (((bio->bi_iter.bi_size + len) >> 9) > max_append_sectors)
-		return 0;
-
-	if (bio_try_merge_zone_append_page(bio, page, len, offset, &same_page))
-		return len;
-
-	if (bio->bi_vcnt >= queue_max_segments(q))
-		return 0;
-
-	__bio_add_page(bio, page, len, offset);
-
-	return len;
-}
-
 static inline bool page_is_mergeable(const struct bio_vec *bv,
 		struct page *page, unsigned int len, unsigned int off,
 		bool *same_page)
@@ -746,9 +698,13 @@ static inline bool page_is_mergeable(const struct bio_vec *bv,
 	return true;
 }
 
-static bool bio_try_merge_pc_page(struct request_queue *q, struct bio *bio,
-		struct page *page, unsigned len, unsigned offset,
-		bool *same_page)
+/*
+ * Try to merge a page into a segment, while obeying the hardware segment
+ * size limit.  This is not for normal read/write bios, but for passthrough
+ * or Zone Append operations that we can't split.
+ */
+static bool bio_try_merge_hw_seg(struct request_queue *q, struct bio *bio,
+		struct page *page, unsigned len, unsigned offset, bool *same_page)
 {
 	struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
 	unsigned long mask = queue_segment_boundary(q);
@@ -762,39 +718,24 @@ static bool bio_try_merge_pc_page(struct request_queue *q, struct bio *bio,
 	return __bio_try_merge_page(bio, page, len, offset, same_page);
 }
 
-/**
- *	__bio_add_pc_page	- attempt to add page to passthrough bio
- *	@q: the target queue
- *	@bio: destination bio
- *	@page: page to add
- *	@len: vec entry length
- *	@offset: vec entry offset
- *	@same_page: return if the merge happen inside the same page
- *
- *	Attempt to add a page to the bio_vec maplist. This can fail for a
- *	number of reasons, such as the bio being full or target block device
- *	limitations. The target block device must allow bio's up to PAGE_SIZE,
- *	so it is always possible to add a single page to an empty bio.
- *
- *	This should only be used by passthrough bios.
+/*
+ * Add a page to a bio while respecting the hardware max_sectors, max_segment
+ * and gap limitations.
  */
-static int __bio_add_pc_page(struct request_queue *q, struct bio *bio,
+static int bio_add_hw_page(struct request_queue *q, struct bio *bio,
 		struct page *page, unsigned int len, unsigned int offset,
-		bool *same_page)
+		unsigned int max_sectors, bool *same_page)
 {
 	struct bio_vec *bvec;
 
-	/*
-	 * cloned bio must not modify vec list
-	 */
-	if (unlikely(bio_flagged(bio, BIO_CLONED)))
+	if (WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED)))
 		return 0;
 
-	if (((bio->bi_iter.bi_size + len) >> 9) > queue_max_hw_sectors(q))
+	if (((bio->bi_iter.bi_size + len) >> 9) > max_sectors)
 		return 0;
 
 	if (bio->bi_vcnt > 0) {
-		if (bio_try_merge_pc_page(q, bio, page, len, offset, same_page))
+		if (bio_try_merge_hw_seg(q, bio, page, len, offset, same_page))
 			return len;
 
 		/*
@@ -821,11 +762,27 @@ static int __bio_add_pc_page(struct request_queue *q, struct bio *bio,
 	return len;
 }
 
+/**
+ * bio_add_pc_page	- attempt to add page to passthrough bio
+ * @q: the target queue
+ * @bio: destination bio
+ * @page: page to add
+ * @len: vec entry length
+ * @offset: vec entry offset
+ *
+ * Attempt to add a page to the bio_vec maplist. This can fail for a
+ * number of reasons, such as the bio being full or target block device
+ * limitations. The target block device must allow bio's up to PAGE_SIZE,
+ * so it is always possible to add a single page to an empty bio.
+ *
+ * This should only be used by passthrough bios.
+ */
 int bio_add_pc_page(struct request_queue *q, struct bio *bio,
 		struct page *page, unsigned int len, unsigned int offset)
 {
 	bool same_page = false;
-	return __bio_add_pc_page(q, bio, page, len, offset, &same_page);
+	return bio_add_hw_page(q, bio, page, len, offset,
+			queue_max_hw_sectors(q), &same_page);
 }
 EXPORT_SYMBOL(bio_add_pc_page);
 
@@ -993,27 +950,12 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 		struct page *page = pages[i];
 
 		len = min_t(size_t, PAGE_SIZE - offset, left);
-		if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
-			int ret;
-
-			if (bio_try_merge_zone_append_page(bio, page, len,
-							   offset,
-							   &same_page)) {
-				if (same_page)
-					put_page(page);
-			} else {
-				ret = bio_add_append_page(bio, page, len,
-							  offset);
-				if (ret != len)
-					return -EINVAL;
-			}
-		} else if (__bio_try_merge_page(bio, page, len, offset,
-						&same_page)) {
+		if (__bio_try_merge_page(bio, page, len, offset, &same_page)) {
 			if (same_page)
 				put_page(page);
 		} else {
 			if (WARN_ON_ONCE(bio_full(bio, len)))
-                                return -EINVAL;
+				return -EINVAL;
 			__bio_add_page(bio, page, len, offset);
 		}
 		offset = 0;
@@ -1023,6 +965,50 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	return 0;
 }
 
+static int __bio_iov_append_get_pages(struct bio *bio, struct iov_iter *iter)
+{
+	unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt;
+	unsigned short entries_left = bio->bi_max_vecs - bio->bi_vcnt;
+	struct request_queue *q = bio->bi_disk->queue;
+	unsigned int max_append_sectors = queue_max_zone_append_sectors(q);
+	struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
+	struct page **pages = (struct page **)bv;
+	ssize_t size, left;
+	unsigned len, i;
+	size_t offset;
+
+	if (WARN_ON_ONCE(!max_append_sectors))
+		return 0;
+
+	/*
+	 * Move page array up in the allocated memory for the bio vecs as far as
+	 * possible so that we can start filling biovecs from the beginning
+	 * without overwriting the temporary page array.
+	*/
+	BUILD_BUG_ON(PAGE_PTRS_PER_BVEC < 2);
+	pages += entries_left * (PAGE_PTRS_PER_BVEC - 1);
+
+	size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
+	if (unlikely(size <= 0))
+		return size ? size : -EFAULT;
+
+	for (left = size, i = 0; left > 0; left -= len, i++) {
+		struct page *page = pages[i];
+		bool same_page = false;
+
+		len = min_t(size_t, PAGE_SIZE - offset, left);
+		if (bio_add_hw_page(q, bio, page, len, offset,
+				max_append_sectors, &same_page) != len)
+			return -EINVAL;
+		if (same_page)
+			put_page(page);
+		offset = 0;
+	}
+
+	iov_iter_advance(iter, size);
+	return 0;
+}
+
 /**
  * bio_iov_iter_get_pages - add user or kernel pages to a bio
  * @bio: bio to add pages to
@@ -1052,10 +1038,16 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 		return -EINVAL;
 
 	do {
-		if (is_bvec)
-			ret = __bio_iov_bvec_add_pages(bio, iter);
-		else
-			ret = __bio_iov_iter_get_pages(bio, iter);
+		if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
+			if (WARN_ON_ONCE(is_bvec))
+				return -EINVAL;
+			ret = __bio_iov_append_get_pages(bio, iter);
+		} else {
+			if (is_bvec)
+				ret = __bio_iov_bvec_add_pages(bio, iter);
+			else
+				ret = __bio_iov_iter_get_pages(bio, iter);
+		}
 	} while (!ret && iov_iter_count(iter) && !bio_full(bio, 0));
 
 	if (is_bvec)
@@ -1455,6 +1447,7 @@ struct bio *bio_map_user_iov(struct request_queue *q,
 			     struct iov_iter *iter,
 			     gfp_t gfp_mask)
 {
+	unsigned int max_sectors = queue_max_hw_sectors(q);
 	int j;
 	struct bio *bio;
 	int ret;
@@ -1492,8 +1485,8 @@ struct bio *bio_map_user_iov(struct request_queue *q,
 				if (n > bytes)
 					n = bytes;
 
-				if (!__bio_add_pc_page(q, bio, page, n, offs,
-						&same_page)) {
+				if (!bio_add_hw_page(q, bio, page, n, offs,
+						max_sectors, &same_page)) {
 					if (same_page)
 						put_page(page);
 					break;

Johannes Thumshirn April 14, 2020, 9:43 a.m. UTC | #2

On 10/04/2020 09:10, Christoph Hellwig wrote:
> -/**
> - *	__bio_add_pc_page	- attempt to add page to passthrough bio
> - *	@q: the target queue
> - *	@bio: destination bio
> - *	@page: page to add
> - *	@len: vec entry length
> - *	@offset: vec entry offset
> - *	@same_page: return if the merge happen inside the same page
> - *
> - *	Attempt to add a page to the bio_vec maplist. This can fail for a
> - *	number of reasons, such as the bio being full or target block device
> - *	limitations. The target block device must allow bio's up to PAGE_SIZE,
> - *	so it is always possible to add a single page to an empty bio.
> - *
> - *	This should only be used by passthrough bios.
> +/*
> + * Add a page to a bio while respecting the hardware max_sectors, max_segment
> + * and gap limitations.
>    */
> -static int __bio_add_pc_page(struct request_queue *q, struct bio *bio,
> +static int bio_add_hw_page(struct request_queue *q, struct bio *bio,
>   		struct page *page, unsigned int len, unsigned int offset,
> -		bool *same_page)
> +		unsigned int max_sectors, bool *same_page)

Should I split that rename into a prep patch and if yes add you as the 
author?

Christoph Hellwig April 14, 2020, 11:28 a.m. UTC | #3

On Tue, Apr 14, 2020 at 09:43:24AM +0000, Johannes Thumshirn wrote:
> > -static int __bio_add_pc_page(struct request_queue *q, struct bio *bio,
> > +static int bio_add_hw_page(struct request_queue *q, struct bio *bio,
> >   		struct page *page, unsigned int len, unsigned int offset,
> > -		bool *same_page)
> > +		unsigned int max_sectors, bool *same_page)
> 
> Should I split that rename into a prep patch and if yes add you as the 
> author?

It is not just a rename but also passing max_sectors explicitly.  I'm
kinda torn if it is worth a prep patch or not, but then why not..

diff --git a/block/bio.c b/block/bio.c
index 94d697217887..689f31357d30 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -679,6 +679,54 @@  struct bio *bio_clone_fast(struct bio *bio, gfp_t gfp_mask, struct bio_set *bs)
 }
 EXPORT_SYMBOL(bio_clone_fast);
 
+static bool bio_try_merge_zone_append_page(struct bio *bio, struct page *page,
+					   unsigned int len, unsigned int off,
+					   bool *same_page)
+{
+	struct request_queue *q = bio->bi_disk->queue;
+	struct bio_vec *bv;
+	unsigned long mask = queue_segment_boundary(q);
+	phys_addr_t addr1, addr2;
+
+	if (bio->bi_vcnt < 1)
+		return false;
+
+	bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
+
+	addr1 = page_to_phys(bv->bv_page) + bv->bv_offset;
+	addr2 = page_to_phys(page) + off + len - 1;
+
+	if ((addr1 | mask) != (addr2 | mask))
+		return false;
+	if (bv->bv_len + len > queue_max_segment_size(q))
+		return false;
+	return __bio_try_merge_page(bio, page, len, off, same_page);
+}
+
+static int bio_add_append_page(struct bio *bio, struct page *page, unsigned len,
+			       size_t offset)
+{
+	struct request_queue *q = bio->bi_disk->queue;
+	unsigned int max_append_sectors = queue_max_zone_append_sectors(q);
+	bool same_page = false;
+
+	if (WARN_ON_ONCE(!max_append_sectors))
+		return 0;
+
+	if (((bio->bi_iter.bi_size + len) >> 9) > max_append_sectors)
+		return 0;
+
+	if (bio_try_merge_zone_append_page(bio, page, len, offset, &same_page))
+		return len;
+
+	if (bio->bi_vcnt >= queue_max_segments(q))
+		return 0;
+
+	__bio_add_page(bio, page, len, offset);
+
+	return len;
+}
+
 static inline bool page_is_mergeable(const struct bio_vec *bv,
 		struct page *page, unsigned int len, unsigned int off,
 		bool *same_page)
@@ -944,8 +992,22 @@  static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 		struct page *page = pages[i];
 
 		len = min_t(size_t, PAGE_SIZE - offset, left);
-
-		if (__bio_try_merge_page(bio, page, len, offset, &same_page)) {
+		if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
+			int ret;
+
+			if (bio_try_merge_zone_append_page(bio, page, len,
+							   offset,
+							   &same_page)) {
+				if (same_page)
+					put_page(page);
+			} else {
+				ret = bio_add_append_page(bio, page, len,
+							  offset);
+				if (ret != len)
+					return -EINVAL;
+			}
+		} else if (__bio_try_merge_page(bio, page, len, offset,
+						&same_page)) {
 			if (same_page)
 				put_page(page);
 		} else {
@@ -1895,6 +1957,10 @@  struct bio *bio_split(struct bio *bio, int sectors,
 	BUG_ON(sectors <= 0);
 	BUG_ON(sectors >= bio_sectors(bio));
 
+	/* Zone append commands cannot be split */
+	if (WARN_ON_ONCE(bio_op(bio) == REQ_OP_ZONE_APPEND))
+		return NULL;
+
 	split = bio_clone_fast(bio, gfp, bs);
 	if (!split)
 		return NULL;
diff --git a/block/blk-core.c b/block/blk-core.c
index 60dc9552ef8d..57127092d816 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -135,6 +135,7 @@  static const char *const blk_op_name[] = {
 	REQ_OP_NAME(ZONE_OPEN),
 	REQ_OP_NAME(ZONE_CLOSE),
 	REQ_OP_NAME(ZONE_FINISH),
+	REQ_OP_NAME(ZONE_APPEND),
 	REQ_OP_NAME(WRITE_SAME),
 	REQ_OP_NAME(WRITE_ZEROES),
 	REQ_OP_NAME(SCSI_IN),
@@ -240,6 +241,17 @@  static void req_bio_endio(struct request *rq, struct bio *bio,
 
 	bio_advance(bio, nbytes);
 
+	if (req_op(rq) == REQ_OP_ZONE_APPEND && error == BLK_STS_OK) {
+		/*
+		 * Partial zone append completions cannot be supported as the
+		 * BIO fragments may end up not being written sequentially.
+		 */
+		if (bio->bi_iter.bi_size)
+			bio->bi_status = BLK_STS_IOERR;
+		else
+			bio->bi_iter.bi_sector = rq->__sector;
+	}
+
 	/* don't actually finish bio if it's part of flush sequence */
 	if (bio->bi_iter.bi_size == 0 && !(rq->rq_flags & RQF_FLUSH_SEQ))
 		bio_endio(bio);
@@ -865,6 +877,41 @@  static inline int blk_partition_remap(struct bio *bio)
 	return ret;
 }
 
+/*
+ * Check write append to a zoned block device.
+ */
+static inline blk_status_t blk_check_zone_append(struct request_queue *q,
+						 struct bio *bio)
+{
+	sector_t pos = bio->bi_iter.bi_sector;
+	int nr_sectors = bio_sectors(bio);
+
+	/* Only applicable to zoned block devices */
+	if (!blk_queue_is_zoned(q))
+		return BLK_STS_NOTSUPP;
+
+	/* The bio sector must point to the start of a sequential zone */
+	if (pos & (blk_queue_zone_sectors(q) - 1) ||
+	    !blk_queue_zone_is_seq(q, pos))
+		return BLK_STS_IOERR;
+
+	/*
+	 * Not allowed to cross zone boundaries. Otherwise, the BIO will be
+	 * split and could result in non-contiguous sectors being written in
+	 * different zones.
+	 */
+	if (blk_queue_zone_no(q, pos) != blk_queue_zone_no(q, pos + nr_sectors))
+		return BLK_STS_IOERR;
+
+	/* Make sure the BIO is small enough and will not get split */
+	if (nr_sectors > q->limits.max_zone_append_sectors)
+		return BLK_STS_IOERR;
+
+	bio->bi_opf |= REQ_NOMERGE;
+
+	return BLK_STS_OK;
+}
+
 static noinline_for_stack bool
 generic_make_request_checks(struct bio *bio)
 {
@@ -937,6 +984,11 @@  generic_make_request_checks(struct bio *bio)
 		if (!q->limits.max_write_same_sectors)
 			goto not_supported;
 		break;
+	case REQ_OP_ZONE_APPEND:
+		status = blk_check_zone_append(q, bio);
+		if (status != BLK_STS_OK)
+			goto end_io;
+		break;
 	case REQ_OP_ZONE_RESET:
 	case REQ_OP_ZONE_OPEN:
 	case REQ_OP_ZONE_CLOSE:
diff --git a/block/blk-mq.c b/block/blk-mq.c
index d92088dec6c3..ce60a071660f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1178,6 +1178,19 @@  static void blk_mq_update_dispatch_busy(struct blk_mq_hw_ctx *hctx, bool busy)
 
 #define BLK_MQ_RESOURCE_DELAY	3		/* ms units */
 
+static void blk_mq_handle_zone_resource(struct request *rq,
+					struct list_head *zone_list)
+{
+	/*
+	 * If we end up here it is because we cannot dispatch a request to a
+	 * specific zone due to LLD level zone-write locking or other zone
+	 * related resource not being available. In this case, set the request
+	 * aside in zone_list for retrying it later.
+	 */
+	list_add(&rq->queuelist, zone_list);
+	__blk_mq_requeue_request(rq);
+}
+
 /*
  * Returns true if we did some work AND can potentially do more.
  */
@@ -1189,6 +1202,7 @@  bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 	bool no_tag = false;
 	int errors, queued;
 	blk_status_t ret = BLK_STS_OK;
+	LIST_HEAD(zone_list);
 
 	if (list_empty(list))
 		return false;
@@ -1257,6 +1271,16 @@  bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 			list_add(&rq->queuelist, list);
 			__blk_mq_requeue_request(rq);
 			break;
+		} else if (ret == BLK_STS_ZONE_RESOURCE) {
+			/*
+			 * Move the request to zone_list and keep going through
+			 * the dispatch list to find more requests the drive can
+			 * accept.
+			 */
+			blk_mq_handle_zone_resource(rq, &zone_list);
+			if (list_empty(list))
+				break;
+			continue;
 		}
 
 		if (unlikely(ret != BLK_STS_OK)) {
@@ -1268,6 +1292,9 @@  bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 		queued++;
 	} while (!list_empty(list));
 
+	if (!list_empty(&zone_list))
+		list_splice_tail_init(&zone_list, list);
+
 	hctx->dispatched[queued_to_index(queued)]++;
 
 	/*
diff --git a/block/blk-settings.c b/block/blk-settings.c
index c8eda2e7b91e..5388965841df 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -48,6 +48,7 @@  void blk_set_default_limits(struct queue_limits *lim)
 	lim->chunk_sectors = 0;
 	lim->max_write_same_sectors = 0;
 	lim->max_write_zeroes_sectors = 0;
+	lim->max_zone_append_sectors = 0;
 	lim->max_discard_sectors = 0;
 	lim->max_hw_discard_sectors = 0;
 	lim->discard_granularity = 0;
@@ -83,6 +84,7 @@  void blk_set_stacking_limits(struct queue_limits *lim)
 	lim->max_dev_sectors = UINT_MAX;
 	lim->max_write_same_sectors = UINT_MAX;
 	lim->max_write_zeroes_sectors = UINT_MAX;
+	lim->max_zone_append_sectors = UINT_MAX;
 }
 EXPORT_SYMBOL(blk_set_stacking_limits);
 
@@ -257,6 +259,25 @@  void blk_queue_max_write_zeroes_sectors(struct request_queue *q,
 }
 EXPORT_SYMBOL(blk_queue_max_write_zeroes_sectors);
 
+/**
+ * blk_queue_max_zone_append_sectors - set max sectors for a single zone append
+ * @q:  the request queue for the device
+ * @max_zone_append_sectors: maximum number of sectors to write per command
+ **/
+void blk_queue_max_zone_append_sectors(struct request_queue *q,
+		unsigned int max_zone_append_sectors)
+{
+	unsigned int max_sectors;
+
+	max_sectors = min(q->limits.max_hw_sectors, max_zone_append_sectors);
+	if (max_sectors)
+		max_sectors = min_not_zero(q->limits.chunk_sectors,
+					   max_sectors);
+
+	q->limits.max_zone_append_sectors = max_sectors;
+}
+EXPORT_SYMBOL_GPL(blk_queue_max_zone_append_sectors);
+
 /**
  * blk_queue_max_segments - set max hw segments for a request for this queue
  * @q:  the request queue for the device
@@ -506,6 +527,8 @@  int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 					b->max_write_same_sectors);
 	t->max_write_zeroes_sectors = min(t->max_write_zeroes_sectors,
 					b->max_write_zeroes_sectors);
+	t->max_zone_append_sectors = min(t->max_zone_append_sectors,
+					b->max_zone_append_sectors);
 	t->bounce_pfn = min_not_zero(t->bounce_pfn, b->bounce_pfn);
 
 	t->seg_boundary_mask = min_not_zero(t->seg_boundary_mask,
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index fca9b158f4a0..02643e149d5e 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -218,6 +218,13 @@  static ssize_t queue_write_zeroes_max_show(struct request_queue *q, char *page)
 		(unsigned long long)q->limits.max_write_zeroes_sectors << 9);
 }
 
+static ssize_t queue_zone_append_max_show(struct request_queue *q, char *page)
+{
+	unsigned long long max_sectors = q->limits.max_zone_append_sectors;
+
+	return sprintf(page, "%llu\n", max_sectors << SECTOR_SHIFT);
+}
+
 static ssize_t
 queue_max_sectors_store(struct request_queue *q, const char *page, size_t count)
 {
@@ -639,6 +646,11 @@  static struct queue_sysfs_entry queue_write_zeroes_max_entry = {
 	.show = queue_write_zeroes_max_show,
 };
 
+static struct queue_sysfs_entry queue_zone_append_max_entry = {
+	.attr = {.name = "zone_append_max_bytes", .mode = 0444 },
+	.show = queue_zone_append_max_show,
+};
+
 static struct queue_sysfs_entry queue_nonrot_entry = {
 	.attr = {.name = "rotational", .mode = 0644 },
 	.show = queue_show_nonrot,
@@ -749,6 +761,7 @@  static struct attribute *queue_attrs[] = {
 	&queue_discard_zeroes_data_entry.attr,
 	&queue_write_same_max_entry.attr,
 	&queue_write_zeroes_max_entry.attr,
+	&queue_zone_append_max_entry.attr,
 	&queue_nonrot_entry.attr,
 	&queue_zoned_entry.attr,
 	&queue_nr_zones_entry.attr,
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 610ee41fa54c..ea327f320b7f 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1706,6 +1706,7 @@  static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
 	case BLK_STS_OK:
 		break;
 	case BLK_STS_RESOURCE:
+	case BLK_STS_ZONE_RESOURCE:
 		if (atomic_read(&sdev->device_busy) ||
 		    scsi_device_blocked(sdev))
 			ret = BLK_STS_DEV_RESOURCE;
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 70254ae11769..824ec2d89954 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -63,6 +63,18 @@  typedef u8 __bitwise blk_status_t;
  */
 #define BLK_STS_DEV_RESOURCE	((__force blk_status_t)13)
 
+/*
+ * BLK_STS_ZONE_RESOURCE is returned from the driver to the block layer if zone
+ * related resources are unavailable, but the driver can guarantee the queue
+ * will be rerun in the future once the resources become available again.
+ *
+ * This is different from BLK_STS_DEV_RESOURCE in that it explicitly references
+ * a zone specific resource and IO to a different zone on the same device could
+ * still be served. Examples of that are zones that are write-locked, but a read
+ * to the same zone could be served.
+ */
+#define BLK_STS_ZONE_RESOURCE	((__force blk_status_t)14)
+
 /**
  * blk_path_error - returns true if error may be path related
  * @error: status the request was completed with
@@ -296,6 +308,8 @@  enum req_opf {
 	REQ_OP_ZONE_CLOSE	= 11,
 	/* Transition a zone to full */
 	REQ_OP_ZONE_FINISH	= 12,
+	/* write data at the current zone write pointer */
+	REQ_OP_ZONE_APPEND	= 13,
 
 	/* SCSI passthrough using struct scsi_request */
 	REQ_OP_SCSI_IN		= 32,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 25b63f714619..36111b10d514 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -336,6 +336,7 @@  struct queue_limits {
 	unsigned int		max_hw_discard_sectors;
 	unsigned int		max_write_same_sectors;
 	unsigned int		max_write_zeroes_sectors;
+	unsigned int		max_zone_append_sectors;
 	unsigned int		discard_granularity;
 	unsigned int		discard_alignment;
 
@@ -757,6 +758,9 @@  static inline bool rq_mergeable(struct request *rq)
 	if (req_op(rq) == REQ_OP_WRITE_ZEROES)
 		return false;
 
+	if (req_op(rq) == REQ_OP_ZONE_APPEND)
+		return false;
+
 	if (rq->cmd_flags & REQ_NOMERGE_FLAGS)
 		return false;
 	if (rq->rq_flags & RQF_NOMERGE_FLAGS)
@@ -1088,6 +1092,8 @@  extern void blk_queue_max_write_same_sectors(struct request_queue *q,
 extern void blk_queue_max_write_zeroes_sectors(struct request_queue *q,
 		unsigned int max_write_same_sectors);
 extern void blk_queue_logical_block_size(struct request_queue *, unsigned int);
+extern void blk_queue_max_zone_append_sectors(struct request_queue *q,
+		unsigned int max_zone_append_sectors);
 extern void blk_queue_physical_block_size(struct request_queue *, unsigned int);
 extern void blk_queue_alignment_offset(struct request_queue *q,
 				       unsigned int alignment);
@@ -1301,6 +1307,11 @@  static inline unsigned int queue_max_segment_size(const struct request_queue *q)
 	return q->limits.max_segment_size;
 }
 
+static inline unsigned int queue_max_zone_append_sectors(const struct request_queue *q)
+{
+	return q->limits.max_zone_append_sectors;
+}
+
 static inline unsigned queue_logical_block_size(const struct request_queue *q)
 {
 	int retval = 512;