[v8,04/11] block: Introduce REQ_OP_ZONE_APPEND

Message ID	20200427113153.31246-5-johannes.thumshirn@wdc.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=YbdZ=6L=vger.kernel.org=linux-fsdevel-owner@kernel.org> IronPort-SDR: MnSLueEWJQ004cvAz2stI+EvR4GSbiOYQcjRLWkUvfux+BbEcqjVd7JdE8Jk57c0jIt83FBMCS VYUUbEcgGJRt8WA8/LML2KmWPGsOvUXrbuTw4TPhsnHgrwgXvZaqI/AI9REYW6qPTX8IGxhgJv AMLkbtLplVowQpbO95PYzxoGu91UcRTPJXqhUswSX046PkZbSgKUbKcVmMrMBLD6XRnAZXMNdV 00X3nAQnTg7FhJvtiyX++IvGrNXlPH1PZ+siewVwfjU/lfsB++BlJEhw1GV4Qy1qx87lHvtja8 3VY= IronPort-SDR: /H0Dl21VZSzdRjq+DIS9esXZnedqJGhkdJNLLKjfBqn4EMDtyjgyDY+Hv9cu7sqMcduHZjNAp4 /kv3oClTL9TZq2lxx1TIu8JR0C2+zUr54= IronPort-SDR: Xlkd7ukAXCT2iEsSu0VerM/ZsJGoGjLASp81+P0kjsmFKebxB8mprDuhyiTDlK49LPywpZzxNU l2QK9KvmKOCA== WDCIronportException: Internal From: Johannes Thumshirn <johannes.thumshirn@wdc.com> To: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@infradead.org>, linux-block <linux-block@vger.kernel.org>, Damien Le Moal <Damien.LeMoal@wdc.com>, Keith Busch <kbusch@kernel.org>, "linux-scsi @ vger . kernel . org" <linux-scsi@vger.kernel.org>, "Martin K . Petersen" <martin.petersen@oracle.com>, "linux-fsdevel @ vger . kernel . org" <linux-fsdevel@vger.kernel.org>, Johannes Thumshirn <johannes.thumshirn@wdc.com>, Christoph Hellwig <hch@lst.de> Subject: [PATCH v8 04/11] block: Introduce REQ_OP_ZONE_APPEND Date: Mon, 27 Apr 2020 20:31:46 +0900 Message-Id: <20200427113153.31246-5-johannes.thumshirn@wdc.com> In-Reply-To: <20200427113153.31246-1-johannes.thumshirn@wdc.com> References: <20200427113153.31246-1-johannes.thumshirn@wdc.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk
Series	Introduce Zone Append for writing to zoned block devices \| expand [v8,00/11] Introduce Zone Append for writing to zoned block devices [v8,01/11] scsi: free sgtables in case command setup fails [v8,02/11] block: provide fallbacks for blk_queue_zone_is_seq and blk_queue_zone_no [v8,03/11] block: rename __bio_add_pc_page to bio_add_hw_page [v8,04/11] block: Introduce REQ_OP_ZONE_APPEND [v8,05/11] block: introduce blk_req_zone_write_trylock [v8,06/11] block: Modify revalidate zones [v8,07/11] scsi: sd_zbc: factor out sanity checks for zoned commands [v8,08/11] scsi: sd_zbc: emulate ZONE_APPEND commands [v8,09/11] null_blk: Support REQ_OP_ZONE_APPEND [v8,10/11] block: export bio_release_pages and bio_iov_iter_get_pages [v8,11/11] zonefs: use REQ_OP_ZONE_APPEND for sync DIO

Message ID

20200427113153.31246-5-johannes.thumshirn@wdc.com (mailing list archive)

State

New, archived

Headers

IronPort-SDR: 
 MnSLueEWJQ004cvAz2stI+EvR4GSbiOYQcjRLWkUvfux+BbEcqjVd7JdE8Jk57c0jIt83FBMCS
 VYUUbEcgGJRt8WA8/LML2KmWPGsOvUXrbuTw4TPhsnHgrwgXvZaqI/AI9REYW6qPTX8IGxhgJv
 AMLkbtLplVowQpbO95PYzxoGu91UcRTPJXqhUswSX046PkZbSgKUbKcVmMrMBLD6XRnAZXMNdV
 00X3nAQnTg7FhJvtiyX++IvGrNXlPH1PZ+siewVwfjU/lfsB++BlJEhw1GV4Qy1qx87lHvtja8
 3VY=
IronPort-SDR: 
 /H0Dl21VZSzdRjq+DIS9esXZnedqJGhkdJNLLKjfBqn4EMDtyjgyDY+Hv9cu7sqMcduHZjNAp4
 /kv3oClTL9TZq2lxx1TIu8JR0C2+zUr54=
IronPort-SDR: 
 Xlkd7ukAXCT2iEsSu0VerM/ZsJGoGjLASp81+P0kjsmFKebxB8mprDuhyiTDlK49LPywpZzxNU
 l2QK9KvmKOCA==
WDCIronportException: Internal
From: Johannes Thumshirn <johannes.thumshirn@wdc.com>
To: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@infradead.org>,
        linux-block <linux-block@vger.kernel.org>,
        Damien Le Moal <Damien.LeMoal@wdc.com>,
        Keith Busch <kbusch@kernel.org>,
        "linux-scsi @ vger . kernel . org" <linux-scsi@vger.kernel.org>,
        "Martin K . Petersen" <martin.petersen@oracle.com>,
        "linux-fsdevel @ vger . kernel . org" <linux-fsdevel@vger.kernel.org>,
        Johannes Thumshirn <johannes.thumshirn@wdc.com>,
        Christoph Hellwig <hch@lst.de>
Subject: [PATCH v8 04/11] block: Introduce REQ_OP_ZONE_APPEND
Date: Mon, 27 Apr 2020 20:31:46 +0900
Message-Id: <20200427113153.31246-5-johannes.thumshirn@wdc.com>
In-Reply-To: <20200427113153.31246-1-johannes.thumshirn@wdc.com>
References: <20200427113153.31246-1-johannes.thumshirn@wdc.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk

Series

Introduce Zone Append for writing to zoned block devices | expand

Commit Message

Johannes Thumshirn April 27, 2020, 11:31 a.m. UTC

From: Keith Busch <kbusch@kernel.org>

Define REQ_OP_ZONE_APPEND to append-write sectors to a zone of a zoned
block device. This is a no-merge write operation.

A zone append write BIO must:
* Target a zoned block device
* Have a sector position indicating the start sector of the target zone
* The target zone must be a sequential write zone
* The BIO must not cross a zone boundary
* The BIO size must not be split to ensure that a single range of LBAs
  is written with a single command.

Implement these checks in generic_make_request_checks() using the
helper function blk_check_zone_append(). To avoid write append BIO
splitting, introduce the new max_zone_append_sectors queue limit
attribute and ensure that a BIO size is always lower than this limit.
Export this new limit through sysfs and check these limits in bio_full().

Also when a LLDD can't dispatch a request to a specific zone, it
will return BLK_STS_ZONE_RESOURCE indicating this request needs to
be delayed, e.g.  because the zone it will be dispatched to is still
write-locked. If this happens set the request aside in a local list
to continue trying dispatching requests such as READ requests or a
WRITE/ZONE_APPEND requests targetting other zones. This way we can
still keep a high queue depth without starving other requests even if
one request can't be served due to zone write-locking.

Finally, make sure that the bio sector position indicates the actual
write position as indicated by the device on completion.

Signed-off-by: Keith Busch <kbusch@kernel.org>
[ jth: added zone-append specific add_page and merge_page helpers ]
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 block/bio.c               | 62 ++++++++++++++++++++++++++++++++++++---
 block/blk-core.c          | 52 ++++++++++++++++++++++++++++++++
 block/blk-mq.c            | 27 +++++++++++++++++
 block/blk-settings.c      | 23 +++++++++++++++
 block/blk-sysfs.c         | 13 ++++++++
 drivers/scsi/scsi_lib.c   |  1 +
 include/linux/blk_types.h | 14 +++++++++
 include/linux/blkdev.h    | 11 +++++++
 8 files changed, 199 insertions(+), 4 deletions(-)

Comments

Hannes Reinecke April 27, 2020, 12:30 p.m. UTC | #1

On 4/27/20 1:31 PM, Johannes Thumshirn wrote:
> From: Keith Busch <kbusch@kernel.org>
> 
> Define REQ_OP_ZONE_APPEND to append-write sectors to a zone of a zoned
> block device. This is a no-merge write operation.
> 
> A zone append write BIO must:
> * Target a zoned block device
> * Have a sector position indicating the start sector of the target zone
> * The target zone must be a sequential write zone
> * The BIO must not cross a zone boundary
> * The BIO size must not be split to ensure that a single range of LBAs
>    is written with a single command.
> 
> Implement these checks in generic_make_request_checks() using the
> helper function blk_check_zone_append(). To avoid write append BIO
> splitting, introduce the new max_zone_append_sectors queue limit
> attribute and ensure that a BIO size is always lower than this limit.
> Export this new limit through sysfs and check these limits in bio_full().
> 
> Also when a LLDD can't dispatch a request to a specific zone, it
> will return BLK_STS_ZONE_RESOURCE indicating this request needs to
> be delayed, e.g.  because the zone it will be dispatched to is still
> write-locked. If this happens set the request aside in a local list
> to continue trying dispatching requests such as READ requests or a
> WRITE/ZONE_APPEND requests targetting other zones. This way we can
> still keep a high queue depth without starving other requests even if
> one request can't be served due to zone write-locking.
> 
> Finally, make sure that the bio sector position indicates the actual
> write position as indicated by the device on completion.
> 
> Signed-off-by: Keith Busch <kbusch@kernel.org>
> [ jth: added zone-append specific add_page and merge_page helpers ]
> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
>   block/bio.c               | 62 ++++++++++++++++++++++++++++++++++++---
>   block/blk-core.c          | 52 ++++++++++++++++++++++++++++++++
>   block/blk-mq.c            | 27 +++++++++++++++++
>   block/blk-settings.c      | 23 +++++++++++++++
>   block/blk-sysfs.c         | 13 ++++++++
>   drivers/scsi/scsi_lib.c   |  1 +
>   include/linux/blk_types.h | 14 +++++++++
>   include/linux/blkdev.h    | 11 +++++++
>   8 files changed, 199 insertions(+), 4 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes

Bart Van Assche April 28, 2020, 5:20 a.m. UTC | #2

On 2020-04-27 04:31, Johannes Thumshirn wrote:
> +/*
> + * Check write append to a zoned block device.
> + */
> +static inline blk_status_t blk_check_zone_append(struct request_queue *q,
> +						 struct bio *bio)
> +{
> +	sector_t pos = bio->bi_iter.bi_sector;
> +	int nr_sectors = bio_sectors(bio);
> +
> +	/* Only applicable to zoned block devices */
> +	if (!blk_queue_is_zoned(q))
> +		return BLK_STS_NOTSUPP;
> +
> +	/* The bio sector must point to the start of a sequential zone */
> +	if (pos & (blk_queue_zone_sectors(q) - 1) ||
> +	    !blk_queue_zone_is_seq(q, pos))
> +		return BLK_STS_IOERR;
> +
> +	/*
> +	 * Not allowed to cross zone boundaries. Otherwise, the BIO will be
> +	 * split and could result in non-contiguous sectors being written in
> +	 * different zones.
> +	 */
> +	if (blk_queue_zone_no(q, pos) != blk_queue_zone_no(q, pos + nr_sectors))
> +		return BLK_STS_IOERR;
> +
> +	/* Make sure the BIO is small enough and will not get split */
> +	if (nr_sectors > q->limits.max_zone_append_sectors)
> +		return BLK_STS_IOERR;
> +
> +	bio->bi_opf |= REQ_NOMERGE;
> +
> +	return BLK_STS_OK;
> +}

Since the above function has not changed compared to v7, I will repeat
my question about this function. Since 'pos' refers to the start of a
zone, is the "blk_queue_zone_no(q, pos) != blk_queue_zone_no(q, pos +
nr_sectors)" check identical to nr_sectors < q->limits.chunk_sectors?
Since q->limits.max_zone_append_sectors is guaranteed to be less than or
equal to the size of a zone, does that mean that the check
"blk_queue_zone_no(q, pos) != blk_queue_zone_no(q, pos + nr_sectors)" is
superfluous?

Thanks,

Bart.

Damien Le Moal April 28, 2020, 5:42 a.m. UTC | #3

On 2020/04/28 14:20, Bart Van Assche wrote:
> On 2020-04-27 04:31, Johannes Thumshirn wrote:
>> +/*
>> + * Check write append to a zoned block device.
>> + */
>> +static inline blk_status_t blk_check_zone_append(struct request_queue *q,
>> +						 struct bio *bio)
>> +{
>> +	sector_t pos = bio->bi_iter.bi_sector;
>> +	int nr_sectors = bio_sectors(bio);
>> +
>> +	/* Only applicable to zoned block devices */
>> +	if (!blk_queue_is_zoned(q))
>> +		return BLK_STS_NOTSUPP;
>> +
>> +	/* The bio sector must point to the start of a sequential zone */
>> +	if (pos & (blk_queue_zone_sectors(q) - 1) ||
>> +	    !blk_queue_zone_is_seq(q, pos))
>> +		return BLK_STS_IOERR;
>> +
>> +	/*
>> +	 * Not allowed to cross zone boundaries. Otherwise, the BIO will be
>> +	 * split and could result in non-contiguous sectors being written in
>> +	 * different zones.
>> +	 */
>> +	if (blk_queue_zone_no(q, pos) != blk_queue_zone_no(q, pos + nr_sectors))
>> +		return BLK_STS_IOERR;
>> +
>> +	/* Make sure the BIO is small enough and will not get split */
>> +	if (nr_sectors > q->limits.max_zone_append_sectors)
>> +		return BLK_STS_IOERR;
>> +
>> +	bio->bi_opf |= REQ_NOMERGE;
>> +
>> +	return BLK_STS_OK;
>> +}
> 
> Since the above function has not changed compared to v7, I will repeat
> my question about this function. Since 'pos' refers to the start of a
> zone, is the "blk_queue_zone_no(q, pos) != blk_queue_zone_no(q, pos +
> nr_sectors)" check identical to nr_sectors < q->limits.chunk_sectors?

Bart, I think I already answered... But writing again an answer to your
question, I realized that you are correct. My previous answer was: no, the tests
are not equivalent. But thinking again about this, since the block layer BIUO
splitting code will decide on BIO split or not based on pos+nr_sectors exceeding
the zone size or not, yes, the first test is not necessary.

We can reduce this to only testing that nr_sectors does not exceed
q->limits.max_zone_append_sectors since we already tested pos alignment to the
zone start.

> Since q->limits.max_zone_append_sectors is guaranteed to be less than or
> equal to the size of a zone, does that mean that the check
> "blk_queue_zone_no(q, pos) != blk_queue_zone_no(q, pos + nr_sectors)" is
> superfluous?

Yes, it is.

> 
> Thanks,
> 
> Bart.
>

diff --git a/block/bio.c b/block/bio.c
index 0f0e337e46b4..6dffc45ba895 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1016,6 +1016,50 @@  static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	return 0;
 }
 
+static int __bio_iov_append_get_pages(struct bio *bio, struct iov_iter *iter)
+{
+	unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt;
+	unsigned short entries_left = bio->bi_max_vecs - bio->bi_vcnt;
+	struct request_queue *q = bio->bi_disk->queue;
+	unsigned int max_append_sectors = queue_max_zone_append_sectors(q);
+	struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
+	struct page **pages = (struct page **)bv;
+	ssize_t size, left;
+	unsigned len, i;
+	size_t offset;
+
+	if (WARN_ON_ONCE(!max_append_sectors))
+		return 0;
+
+	/*
+	 * Move page array up in the allocated memory for the bio vecs as far as
+	 * possible so that we can start filling biovecs from the beginning
+	 * without overwriting the temporary page array.
+	 */
+	BUILD_BUG_ON(PAGE_PTRS_PER_BVEC < 2);
+	pages += entries_left * (PAGE_PTRS_PER_BVEC - 1);
+
+	size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset);
+	if (unlikely(size <= 0))
+		return size ? size : -EFAULT;
+
+	for (left = size, i = 0; left > 0; left -= len, i++) {
+		struct page *page = pages[i];
+		bool same_page = false;
+
+		len = min_t(size_t, PAGE_SIZE - offset, left);
+		if (bio_add_hw_page(q, bio, page, len, offset,
+				max_append_sectors, &same_page) != len)
+			return -EINVAL;
+		if (same_page)
+			put_page(page);
+		offset = 0;
+	}
+
+	iov_iter_advance(iter, size);
+	return 0;
+}
+
 /**
  * bio_iov_iter_get_pages - add user or kernel pages to a bio
  * @bio: bio to add pages to
@@ -1045,10 +1089,16 @@  int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 		return -EINVAL;
 
 	do {
-		if (is_bvec)
-			ret = __bio_iov_bvec_add_pages(bio, iter);
-		else
-			ret = __bio_iov_iter_get_pages(bio, iter);
+		if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
+			if (WARN_ON_ONCE(is_bvec))
+				return -EINVAL;
+			ret = __bio_iov_append_get_pages(bio, iter);
+		} else {
+			if (is_bvec)
+				ret = __bio_iov_bvec_add_pages(bio, iter);
+			else
+				ret = __bio_iov_iter_get_pages(bio, iter);
+		}
 	} while (!ret && iov_iter_count(iter) && !bio_full(bio, 0));
 
 	if (is_bvec)
@@ -1451,6 +1501,10 @@  struct bio *bio_split(struct bio *bio, int sectors,
 	BUG_ON(sectors <= 0);
 	BUG_ON(sectors >= bio_sectors(bio));
 
+	/* Zone append commands cannot be split */
+	if (WARN_ON_ONCE(bio_op(bio) == REQ_OP_ZONE_APPEND))
+		return NULL;
+
 	split = bio_clone_fast(bio, gfp, bs);
 	if (!split)
 		return NULL;
diff --git a/block/blk-core.c b/block/blk-core.c
index dffff2100888..c6a34d2a4eea 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -135,6 +135,7 @@  static const char *const blk_op_name[] = {
 	REQ_OP_NAME(ZONE_OPEN),
 	REQ_OP_NAME(ZONE_CLOSE),
 	REQ_OP_NAME(ZONE_FINISH),
+	REQ_OP_NAME(ZONE_APPEND),
 	REQ_OP_NAME(WRITE_SAME),
 	REQ_OP_NAME(WRITE_ZEROES),
 	REQ_OP_NAME(SCSI_IN),
@@ -240,6 +241,17 @@  static void req_bio_endio(struct request *rq, struct bio *bio,
 
 	bio_advance(bio, nbytes);
 
+	if (req_op(rq) == REQ_OP_ZONE_APPEND && error == BLK_STS_OK) {
+		/*
+		 * Partial zone append completions cannot be supported as the
+		 * BIO fragments may end up not being written sequentially.
+		 */
+		if (bio->bi_iter.bi_size)
+			bio->bi_status = BLK_STS_IOERR;
+		else
+			bio->bi_iter.bi_sector = rq->__sector;
+	}
+
 	/* don't actually finish bio if it's part of flush sequence */
 	if (bio->bi_iter.bi_size == 0 && !(rq->rq_flags & RQF_FLUSH_SEQ))
 		bio_endio(bio);
@@ -871,6 +883,41 @@  static inline int blk_partition_remap(struct bio *bio)
 	return ret;
 }
 
+/*
+ * Check write append to a zoned block device.
+ */
+static inline blk_status_t blk_check_zone_append(struct request_queue *q,
+						 struct bio *bio)
+{
+	sector_t pos = bio->bi_iter.bi_sector;
+	int nr_sectors = bio_sectors(bio);
+
+	/* Only applicable to zoned block devices */
+	if (!blk_queue_is_zoned(q))
+		return BLK_STS_NOTSUPP;
+
+	/* The bio sector must point to the start of a sequential zone */
+	if (pos & (blk_queue_zone_sectors(q) - 1) ||
+	    !blk_queue_zone_is_seq(q, pos))
+		return BLK_STS_IOERR;
+
+	/*
+	 * Not allowed to cross zone boundaries. Otherwise, the BIO will be
+	 * split and could result in non-contiguous sectors being written in
+	 * different zones.
+	 */
+	if (blk_queue_zone_no(q, pos) != blk_queue_zone_no(q, pos + nr_sectors))
+		return BLK_STS_IOERR;
+
+	/* Make sure the BIO is small enough and will not get split */
+	if (nr_sectors > q->limits.max_zone_append_sectors)
+		return BLK_STS_IOERR;
+
+	bio->bi_opf |= REQ_NOMERGE;
+
+	return BLK_STS_OK;
+}
+
 static noinline_for_stack bool
 generic_make_request_checks(struct bio *bio)
 {
@@ -943,6 +990,11 @@  generic_make_request_checks(struct bio *bio)
 		if (!q->limits.max_write_same_sectors)
 			goto not_supported;
 		break;
+	case REQ_OP_ZONE_APPEND:
+		status = blk_check_zone_append(q, bio);
+		if (status != BLK_STS_OK)
+			goto end_io;
+		break;
 	case REQ_OP_ZONE_RESET:
 	case REQ_OP_ZONE_OPEN:
 	case REQ_OP_ZONE_CLOSE:
diff --git a/block/blk-mq.c b/block/blk-mq.c
index bcc3a2397d4a..0b8567ef23bb 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1183,6 +1183,19 @@  static void blk_mq_handle_dev_resource(struct request *rq,
 	__blk_mq_requeue_request(rq);
 }
 
+static void blk_mq_handle_zone_resource(struct request *rq,
+					struct list_head *zone_list)
+{
+	/*
+	 * If we end up here it is because we cannot dispatch a request to a
+	 * specific zone due to LLD level zone-write locking or other zone
+	 * related resource not being available. In this case, set the request
+	 * aside in zone_list for retrying it later.
+	 */
+	list_add(&rq->queuelist, zone_list);
+	__blk_mq_requeue_request(rq);
+}
+
 /*
  * Returns true if we did some work AND can potentially do more.
  */
@@ -1195,6 +1208,7 @@  bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 	int errors, queued;
 	blk_status_t ret = BLK_STS_OK;
 	bool no_budget_avail = false;
+	LIST_HEAD(zone_list);
 
 	if (list_empty(list))
 		return false;
@@ -1256,6 +1270,16 @@  bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 		if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) {
 			blk_mq_handle_dev_resource(rq, list);
 			break;
+		} else if (ret == BLK_STS_ZONE_RESOURCE) {
+			/*
+			 * Move the request to zone_list and keep going through
+			 * the dispatch list to find more requests the drive can
+			 * accept.
+			 */
+			blk_mq_handle_zone_resource(rq, &zone_list);
+			if (list_empty(list))
+				break;
+			continue;
 		}
 
 		if (unlikely(ret != BLK_STS_OK)) {
@@ -1267,6 +1291,9 @@  bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 		queued++;
 	} while (!list_empty(list));
 
+	if (!list_empty(&zone_list))
+		list_splice_tail_init(&zone_list, list);
+
 	hctx->dispatched[queued_to_index(queued)]++;
 
 	/*
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 2ab1967b9716..8452ffd33cfc 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -48,6 +48,7 @@  void blk_set_default_limits(struct queue_limits *lim)
 	lim->chunk_sectors = 0;
 	lim->max_write_same_sectors = 0;
 	lim->max_write_zeroes_sectors = 0;
+	lim->max_zone_append_sectors = 0;
 	lim->max_discard_sectors = 0;
 	lim->max_hw_discard_sectors = 0;
 	lim->discard_granularity = 0;
@@ -83,6 +84,7 @@  void blk_set_stacking_limits(struct queue_limits *lim)
 	lim->max_dev_sectors = UINT_MAX;
 	lim->max_write_same_sectors = UINT_MAX;
 	lim->max_write_zeroes_sectors = UINT_MAX;
+	lim->max_zone_append_sectors = UINT_MAX;
 }
 EXPORT_SYMBOL(blk_set_stacking_limits);
 
@@ -221,6 +223,25 @@  void blk_queue_max_write_zeroes_sectors(struct request_queue *q,
 }
 EXPORT_SYMBOL(blk_queue_max_write_zeroes_sectors);
 
+/**
+ * blk_queue_max_zone_append_sectors - set max sectors for a single zone append
+ * @q:  the request queue for the device
+ * @max_zone_append_sectors: maximum number of sectors to write per command
+ **/
+void blk_queue_max_zone_append_sectors(struct request_queue *q,
+		unsigned int max_zone_append_sectors)
+{
+	unsigned int max_sectors;
+
+	max_sectors = min(q->limits.max_hw_sectors, max_zone_append_sectors);
+	if (max_sectors)
+		max_sectors = min_not_zero(q->limits.chunk_sectors,
+					   max_sectors);
+
+	q->limits.max_zone_append_sectors = max_sectors;
+}
+EXPORT_SYMBOL_GPL(blk_queue_max_zone_append_sectors);
+
 /**
  * blk_queue_max_segments - set max hw segments for a request for this queue
  * @q:  the request queue for the device
@@ -470,6 +491,8 @@  int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
 					b->max_write_same_sectors);
 	t->max_write_zeroes_sectors = min(t->max_write_zeroes_sectors,
 					b->max_write_zeroes_sectors);
+	t->max_zone_append_sectors = min(t->max_zone_append_sectors,
+					b->max_zone_append_sectors);
 	t->bounce_pfn = min_not_zero(t->bounce_pfn, b->bounce_pfn);
 
 	t->seg_boundary_mask = min_not_zero(t->seg_boundary_mask,
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index fca9b158f4a0..02643e149d5e 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -218,6 +218,13 @@  static ssize_t queue_write_zeroes_max_show(struct request_queue *q, char *page)
 		(unsigned long long)q->limits.max_write_zeroes_sectors << 9);
 }
 
+static ssize_t queue_zone_append_max_show(struct request_queue *q, char *page)
+{
+	unsigned long long max_sectors = q->limits.max_zone_append_sectors;
+
+	return sprintf(page, "%llu\n", max_sectors << SECTOR_SHIFT);
+}
+
 static ssize_t
 queue_max_sectors_store(struct request_queue *q, const char *page, size_t count)
 {
@@ -639,6 +646,11 @@  static struct queue_sysfs_entry queue_write_zeroes_max_entry = {
 	.show = queue_write_zeroes_max_show,
 };
 
+static struct queue_sysfs_entry queue_zone_append_max_entry = {
+	.attr = {.name = "zone_append_max_bytes", .mode = 0444 },
+	.show = queue_zone_append_max_show,
+};
+
 static struct queue_sysfs_entry queue_nonrot_entry = {
 	.attr = {.name = "rotational", .mode = 0644 },
 	.show = queue_show_nonrot,
@@ -749,6 +761,7 @@  static struct attribute *queue_attrs[] = {
 	&queue_discard_zeroes_data_entry.attr,
 	&queue_write_same_max_entry.attr,
 	&queue_write_zeroes_max_entry.attr,
+	&queue_zone_append_max_entry.attr,
 	&queue_nonrot_entry.attr,
 	&queue_zoned_entry.attr,
 	&queue_nr_zones_entry.attr,
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index f0cb26b3da6a..af00e4a3f006 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1712,6 +1712,7 @@  static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
 	case BLK_STS_OK:
 		break;
 	case BLK_STS_RESOURCE:
+	case BLK_STS_ZONE_RESOURCE:
 		if (atomic_read(&sdev->device_busy) ||
 		    scsi_device_blocked(sdev))
 			ret = BLK_STS_DEV_RESOURCE;
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 31eb92876be7..dbee2b0d56b6 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -63,6 +63,18 @@  typedef u8 __bitwise blk_status_t;
  */
 #define BLK_STS_DEV_RESOURCE	((__force blk_status_t)13)
 
+/*
+ * BLK_STS_ZONE_RESOURCE is returned from the driver to the block layer if zone
+ * related resources are unavailable, but the driver can guarantee the queue
+ * will be rerun in the future once the resources become available again.
+ *
+ * This is different from BLK_STS_DEV_RESOURCE in that it explicitly references
+ * a zone specific resource and IO to a different zone on the same device could
+ * still be served. Examples of that are zones that are write-locked, but a read
+ * to the same zone could be served.
+ */
+#define BLK_STS_ZONE_RESOURCE	((__force blk_status_t)14)
+
 /**
  * blk_path_error - returns true if error may be path related
  * @error: status the request was completed with
@@ -296,6 +308,8 @@  enum req_opf {
 	REQ_OP_ZONE_CLOSE	= 11,
 	/* Transition a zone to full */
 	REQ_OP_ZONE_FINISH	= 12,
+	/* write data at the current zone write pointer */
+	REQ_OP_ZONE_APPEND	= 13,
 
 	/* SCSI passthrough using struct scsi_request */
 	REQ_OP_SCSI_IN		= 32,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 91c6e413bf6b..158641fbc7cd 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -331,6 +331,7 @@  struct queue_limits {
 	unsigned int		max_hw_discard_sectors;
 	unsigned int		max_write_same_sectors;
 	unsigned int		max_write_zeroes_sectors;
+	unsigned int		max_zone_append_sectors;
 	unsigned int		discard_granularity;
 	unsigned int		discard_alignment;
 
@@ -749,6 +750,9 @@  static inline bool rq_mergeable(struct request *rq)
 	if (req_op(rq) == REQ_OP_WRITE_ZEROES)
 		return false;
 
+	if (req_op(rq) == REQ_OP_ZONE_APPEND)
+		return false;
+
 	if (rq->cmd_flags & REQ_NOMERGE_FLAGS)
 		return false;
 	if (rq->rq_flags & RQF_NOMERGE_FLAGS)
@@ -1083,6 +1087,8 @@  extern void blk_queue_max_write_same_sectors(struct request_queue *q,
 extern void blk_queue_max_write_zeroes_sectors(struct request_queue *q,
 		unsigned int max_write_same_sectors);
 extern void blk_queue_logical_block_size(struct request_queue *, unsigned int);
+extern void blk_queue_max_zone_append_sectors(struct request_queue *q,
+		unsigned int max_zone_append_sectors);
 extern void blk_queue_physical_block_size(struct request_queue *, unsigned int);
 extern void blk_queue_alignment_offset(struct request_queue *q,
 				       unsigned int alignment);
@@ -1300,6 +1306,11 @@  static inline unsigned int queue_max_segment_size(const struct request_queue *q)
 	return q->limits.max_segment_size;
 }
 
+static inline unsigned int queue_max_zone_append_sectors(const struct request_queue *q)
+{
+	return q->limits.max_zone_append_sectors;
+}
+
 static inline unsigned queue_logical_block_size(const struct request_queue *q)
 {
 	int retval = 512;

[v8,04/11] block: Introduce REQ_OP_ZONE_APPEND

Commit Message

Comments

Patch