Message ID | 20200417121536.5393-5-johannes.thumshirn@wdc.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Introduce Zone Append for writing to zoned block devices | expand |
On 2020-04-17 05:15, Johannes Thumshirn wrote: > From: Keith Busch <kbusch@kernel.org> > > Define REQ_OP_ZONE_APPEND to append-write sectors to a zone of a zoned > block device. This is a no-merge write operation. > > A zone append write BIO must: > * Target a zoned block device > * Have a sector position indicating the start sector of the target zone Why the start sector instead of any sector in the target zone? Wouldn't the latter make it easier to write software that uses REQ_OP_ZONE_APPEND? > * The target zone must be a sequential write zone > * The BIO must not cross a zone boundary > * The BIO size must not be split to ensure that a single range of LBAs > is written with a single command. "BIO size must" -> "BIO must"? > diff --git a/block/bio.c b/block/bio.c > index 0f0e337e46b4..97baadc6d964 100644 > --- a/block/bio.c > +++ b/block/bio.c > @@ -1006,7 +1006,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) > put_page(page); > } else { > if (WARN_ON_ONCE(bio_full(bio, len))) > - return -EINVAL; > + return -EINVAL; > __bio_add_page(bio, page, len, offset); > } > offset = 0; Has the 'return' statement been indented correctly? I see a tab and a space in front of that statement instead of only a tab. > @@ -1451,6 +1501,10 @@ struct bio *bio_split(struct bio *bio, int sectors, > BUG_ON(sectors <= 0); > BUG_ON(sectors >= bio_sectors(bio)); > > + /* Zone append commands cannot be split */ > + if (WARN_ON_ONCE(bio_op(bio) == REQ_OP_ZONE_APPEND)) > + return NULL; > + > split = bio_clone_fast(bio, gfp, bs); > if (!split) > return NULL; Zone append commands -> Zone append bio's? > +/* > + * Check write append to a zoned block device. > + */ > +static inline blk_status_t blk_check_zone_append(struct request_queue *q, > + struct bio *bio) > +{ > + sector_t pos = bio->bi_iter.bi_sector; > + int nr_sectors = bio_sectors(bio); > + > + /* Only applicable to zoned block devices */ > + if (!blk_queue_is_zoned(q)) > + return BLK_STS_NOTSUPP; > + > + /* The bio sector must point to the start of a sequential zone */ > + if (pos & (blk_queue_zone_sectors(q) - 1) || > + !blk_queue_zone_is_seq(q, pos)) > + return BLK_STS_IOERR; > + > + /* > + * Not allowed to cross zone boundaries. Otherwise, the BIO will be > + * split and could result in non-contiguous sectors being written in > + * different zones. > + */ > + if (blk_queue_zone_no(q, pos) != blk_queue_zone_no(q, pos + nr_sectors)) > + return BLK_STS_IOERR; Can the above statement be simplified into the following? if (nr_sectors > q->limits.chunk_sectors) return BLK_STS_IOERR; > + /* Make sure the BIO is small enough and will not get split */ > + if (nr_sectors > q->limits.max_zone_append_sectors) > + return BLK_STS_IOERR; Do we really need a new request queue limit parameter? In which cases will max_zone_append_sectors differ from the zone size? Thanks, Bart.
On 2020/04/19 1:46, Bart Van Assche wrote: > On 2020-04-17 05:15, Johannes Thumshirn wrote: >> From: Keith Busch <kbusch@kernel.org> >> >> Define REQ_OP_ZONE_APPEND to append-write sectors to a zone of a zoned >> block device. This is a no-merge write operation. >> >> A zone append write BIO must: >> * Target a zoned block device >> * Have a sector position indicating the start sector of the target zone > > Why the start sector instead of any sector in the target zone? Wouldn't > the latter make it easier to write software that uses REQ_OP_ZONE_APPEND? We could do that, but we choose to have the interface match that of other zone operations (e.g. REQ_OP_ZONE_RESET/OPEN/CLOSE/FINISH) which also require the zone start sector. >> * The target zone must be a sequential write zone >> * The BIO must not cross a zone boundary >> * The BIO size must not be split to ensure that a single range of LBAs >> is written with a single command. > > "BIO size must" -> "BIO must"? > >> diff --git a/block/bio.c b/block/bio.c >> index 0f0e337e46b4..97baadc6d964 100644 >> --- a/block/bio.c >> +++ b/block/bio.c >> @@ -1006,7 +1006,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) >> put_page(page); >> } else { >> if (WARN_ON_ONCE(bio_full(bio, len))) >> - return -EINVAL; >> + return -EINVAL; >> __bio_add_page(bio, page, len, offset); >> } >> offset = 0; > > Has the 'return' statement been indented correctly? I see a tab and a > space in front of that statement instead of only a tab. > >> @@ -1451,6 +1501,10 @@ struct bio *bio_split(struct bio *bio, int sectors, >> BUG_ON(sectors <= 0); >> BUG_ON(sectors >= bio_sectors(bio)); >> >> + /* Zone append commands cannot be split */ >> + if (WARN_ON_ONCE(bio_op(bio) == REQ_OP_ZONE_APPEND)) >> + return NULL; >> + >> split = bio_clone_fast(bio, gfp, bs); >> if (!split) >> return NULL; > > Zone append commands -> Zone append bio's? > >> +/* >> + * Check write append to a zoned block device. >> + */ >> +static inline blk_status_t blk_check_zone_append(struct request_queue *q, >> + struct bio *bio) >> +{ >> + sector_t pos = bio->bi_iter.bi_sector; >> + int nr_sectors = bio_sectors(bio); >> + >> + /* Only applicable to zoned block devices */ >> + if (!blk_queue_is_zoned(q)) >> + return BLK_STS_NOTSUPP; >> + >> + /* The bio sector must point to the start of a sequential zone */ >> + if (pos & (blk_queue_zone_sectors(q) - 1) || >> + !blk_queue_zone_is_seq(q, pos)) >> + return BLK_STS_IOERR; >> + >> + /* >> + * Not allowed to cross zone boundaries. Otherwise, the BIO will be >> + * split and could result in non-contiguous sectors being written in >> + * different zones. >> + */ >> + if (blk_queue_zone_no(q, pos) != blk_queue_zone_no(q, pos + nr_sectors)) >> + return BLK_STS_IOERR; > > Can the above statement be simplified into the following? > > if (nr_sectors > q->limits.chunk_sectors) > return BLK_STS_IOERR; That would be equivalent only if the zone is empty. If the zone is not empty, we need to check that the zone append request does not cross over to the next zone, which would result in the BIO being split by the block layer. > >> + /* Make sure the BIO is small enough and will not get split */ >> + if (nr_sectors > q->limits.max_zone_append_sectors) >> + return BLK_STS_IOERR; > > Do we really need a new request queue limit parameter? In which cases > will max_zone_append_sectors differ from the zone size? Yes it can differ from the zone size. On real hardware, max_zone_append_sectors will most of the time be equal to max_hw_sectors_kb. But it could be smaller than that too. For the host, since a zone append is a write, it is also subject to the max_sector_kb limit and cannot exceed that value. > > Thanks, > > Bart. >
On 4/19/20 5:30 PM, Damien Le Moal wrote: > On 2020/04/19 1:46, Bart Van Assche wrote: >> On 2020-04-17 05:15, Johannes Thumshirn wrote: >>> From: Keith Busch <kbusch@kernel.org> >>> +/* >>> + * Check write append to a zoned block device. >>> + */ >>> +static inline blk_status_t blk_check_zone_append(struct request_queue *q, >>> + struct bio *bio) >>> +{ >>> + sector_t pos = bio->bi_iter.bi_sector; >>> + int nr_sectors = bio_sectors(bio); >>> + >>> + /* Only applicable to zoned block devices */ >>> + if (!blk_queue_is_zoned(q)) >>> + return BLK_STS_NOTSUPP; >>> + >>> + /* The bio sector must point to the start of a sequential zone */ >>> + if (pos & (blk_queue_zone_sectors(q) - 1) || >>> + !blk_queue_zone_is_seq(q, pos)) >>> + return BLK_STS_IOERR; >>> + >>> + /* >>> + * Not allowed to cross zone boundaries. Otherwise, the BIO will be >>> + * split and could result in non-contiguous sectors being written in >>> + * different zones. >>> + */ >>> + if (blk_queue_zone_no(q, pos) != blk_queue_zone_no(q, pos + nr_sectors)) >>> + return BLK_STS_IOERR; >> >> Can the above statement be simplified into the following? >> >> if (nr_sectors > q->limits.chunk_sectors) >> return BLK_STS_IOERR; > > That would be equivalent only if the zone is empty. If the zone is not empty, we > need to check that the zone append request does not cross over to the next zone, > which would result in the BIO being split by the block layer. At the start of blk_check_zone_append() function there is a check that 'pos' is aligned with a zone boundary. How can 'pos' at the same time represent a zone boundary and the exact offset at which the write will happen? I do not understand this. >>> + /* Make sure the BIO is small enough and will not get split */ >>> + if (nr_sectors > q->limits.max_zone_append_sectors) >>> + return BLK_STS_IOERR; >> >> Do we really need a new request queue limit parameter? In which cases >> will max_zone_append_sectors differ from the zone size? > > Yes it can differ from the zone size. On real hardware, max_zone_append_sectors > will most of the time be equal to max_hw_sectors_kb. But it could be smaller > than that too. For the host, since a zone append is a write, it is also subject > to the max_sector_kb limit and cannot exceed that value. Would it be possible to use min(max_hw_sectors, zone_size) instead of introducing a new user-visible parameter? Thanks, Bart.
On 2020/04/20 9:49, Bart Van Assche wrote: > On 4/19/20 5:30 PM, Damien Le Moal wrote: >> On 2020/04/19 1:46, Bart Van Assche wrote: >>> On 2020-04-17 05:15, Johannes Thumshirn wrote: >>>> From: Keith Busch <kbusch@kernel.org> >>>> +/* >>>> + * Check write append to a zoned block device. >>>> + */ >>>> +static inline blk_status_t blk_check_zone_append(struct request_queue *q, >>>> + struct bio *bio) >>>> +{ >>>> + sector_t pos = bio->bi_iter.bi_sector; >>>> + int nr_sectors = bio_sectors(bio); >>>> + >>>> + /* Only applicable to zoned block devices */ >>>> + if (!blk_queue_is_zoned(q)) >>>> + return BLK_STS_NOTSUPP; >>>> + >>>> + /* The bio sector must point to the start of a sequential zone */ >>>> + if (pos & (blk_queue_zone_sectors(q) - 1) || >>>> + !blk_queue_zone_is_seq(q, pos)) >>>> + return BLK_STS_IOERR; >>>> + >>>> + /* >>>> + * Not allowed to cross zone boundaries. Otherwise, the BIO will be >>>> + * split and could result in non-contiguous sectors being written in >>>> + * different zones. >>>> + */ >>>> + if (blk_queue_zone_no(q, pos) != blk_queue_zone_no(q, pos + nr_sectors)) >>>> + return BLK_STS_IOERR; >>> >>> Can the above statement be simplified into the following? >>> >>> if (nr_sectors > q->limits.chunk_sectors) >>> return BLK_STS_IOERR; >> >> That would be equivalent only if the zone is empty. If the zone is not empty, we >> need to check that the zone append request does not cross over to the next zone, >> which would result in the BIO being split by the block layer. > > At the start of blk_check_zone_append() function there is a check that > 'pos' is aligned with a zone boundary. How can 'pos' at the same time > represent a zone boundary and the exact offset at which the write will > happen? I do not understand this. pos indicates the zone to be written, not the actual position the data will be written to. The actual position will be determined by the drive when it receives the zone append command. That position will be, of course, the zone write pointer which the drive manages. The actual position the data was written to is returned by the drive and passed along in the IO stack using bio->bi_iter.bi_sector. Example: A thread sends a 4KB zone append writes to a zone starting at sector Z with a write pointer value of W. On submission, bio->bi_iter.bi_sector must be equal to Z. On completion, bio->bi_iter.bi_sector will be W. If another zone append is sent, again bio->bi_iter.bi_sector must be equal to Z and on completion, bio->bi_iter.bi_sector will be equal to W+8 (since 4KB have been written with the previous zone append). Zone append allows an application to send writes to any zone without knowing the zone write pointer. The only think the user needs to know is if the zone has enough free space or not. For a file system, this changes the current classic model of (a) allocate blocks, write allocated blocks, commit metadata, to something like (b) pick a target zone with free space, write blocks, mark used blocks as allocated, commit metadata. > >>>> + /* Make sure the BIO is small enough and will not get split */ >>>> + if (nr_sectors > q->limits.max_zone_append_sectors) >>>> + return BLK_STS_IOERR; >>> >>> Do we really need a new request queue limit parameter? In which cases >>> will max_zone_append_sectors differ from the zone size? >> >> Yes it can differ from the zone size. On real hardware, max_zone_append_sectors >> will most of the time be equal to max_hw_sectors_kb. But it could be smaller >> than that too. For the host, since a zone append is a write, it is also subject >> to the max_sector_kb limit and cannot exceed that value. > > Would it be possible to use min(max_hw_sectors, zone_size) instead of > introducing a new user-visible parameter? For scsi emulation, that is what is being used to set max_zone_append_sectors. But that value may not be appropriate for a ZNS NVMe device as that device may have different constraints for regular writes and zone append writes. > > Thanks, > > Bart. >
On Mon, Apr 20, 2020 at 12:30:25AM +0000, Damien Le Moal wrote: > > > > Why the start sector instead of any sector in the target zone? Wouldn't > > the latter make it easier to write software that uses REQ_OP_ZONE_APPEND? > > We could do that, but we choose to have the interface match that of other zone > operations (e.g. REQ_OP_ZONE_RESET/OPEN/CLOSE/FINISH) which also require the > zone start sector. It also would not simply anything. The fs block allocator needs to pick a zone, and the device picks the actual LBA, So specifying any "random" LBA in the zone would not actualy benefit the fs.
diff --git a/block/bio.c b/block/bio.c index 0f0e337e46b4..97baadc6d964 100644 --- a/block/bio.c +++ b/block/bio.c @@ -1006,7 +1006,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) put_page(page); } else { if (WARN_ON_ONCE(bio_full(bio, len))) - return -EINVAL; + return -EINVAL; __bio_add_page(bio, page, len, offset); } offset = 0; @@ -1016,6 +1016,50 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) return 0; } +static int __bio_iov_append_get_pages(struct bio *bio, struct iov_iter *iter) +{ + unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt; + unsigned short entries_left = bio->bi_max_vecs - bio->bi_vcnt; + struct request_queue *q = bio->bi_disk->queue; + unsigned int max_append_sectors = queue_max_zone_append_sectors(q); + struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt; + struct page **pages = (struct page **)bv; + ssize_t size, left; + unsigned len, i; + size_t offset; + + if (WARN_ON_ONCE(!max_append_sectors)) + return 0; + + /* + * Move page array up in the allocated memory for the bio vecs as far as + * possible so that we can start filling biovecs from the beginning + * without overwriting the temporary page array. + */ + BUILD_BUG_ON(PAGE_PTRS_PER_BVEC < 2); + pages += entries_left * (PAGE_PTRS_PER_BVEC - 1); + + size = iov_iter_get_pages(iter, pages, LONG_MAX, nr_pages, &offset); + if (unlikely(size <= 0)) + return size ? size : -EFAULT; + + for (left = size, i = 0; left > 0; left -= len, i++) { + struct page *page = pages[i]; + bool same_page = false; + + len = min_t(size_t, PAGE_SIZE - offset, left); + if (bio_add_hw_page(q, bio, page, len, offset, + max_append_sectors, &same_page) != len) + return -EINVAL; + if (same_page) + put_page(page); + offset = 0; + } + + iov_iter_advance(iter, size); + return 0; +} + /** * bio_iov_iter_get_pages - add user or kernel pages to a bio * @bio: bio to add pages to @@ -1045,10 +1089,16 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) return -EINVAL; do { - if (is_bvec) - ret = __bio_iov_bvec_add_pages(bio, iter); - else - ret = __bio_iov_iter_get_pages(bio, iter); + if (bio_op(bio) == REQ_OP_ZONE_APPEND) { + if (WARN_ON_ONCE(is_bvec)) + return -EINVAL; + ret = __bio_iov_append_get_pages(bio, iter); + } else { + if (is_bvec) + ret = __bio_iov_bvec_add_pages(bio, iter); + else + ret = __bio_iov_iter_get_pages(bio, iter); + } } while (!ret && iov_iter_count(iter) && !bio_full(bio, 0)); if (is_bvec) @@ -1451,6 +1501,10 @@ struct bio *bio_split(struct bio *bio, int sectors, BUG_ON(sectors <= 0); BUG_ON(sectors >= bio_sectors(bio)); + /* Zone append commands cannot be split */ + if (WARN_ON_ONCE(bio_op(bio) == REQ_OP_ZONE_APPEND)) + return NULL; + split = bio_clone_fast(bio, gfp, bs); if (!split) return NULL; diff --git a/block/blk-core.c b/block/blk-core.c index 7e4a1da0715e..34fe47a728c3 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -135,6 +135,7 @@ static const char *const blk_op_name[] = { REQ_OP_NAME(ZONE_OPEN), REQ_OP_NAME(ZONE_CLOSE), REQ_OP_NAME(ZONE_FINISH), + REQ_OP_NAME(ZONE_APPEND), REQ_OP_NAME(WRITE_SAME), REQ_OP_NAME(WRITE_ZEROES), REQ_OP_NAME(SCSI_IN), @@ -240,6 +241,17 @@ static void req_bio_endio(struct request *rq, struct bio *bio, bio_advance(bio, nbytes); + if (req_op(rq) == REQ_OP_ZONE_APPEND && error == BLK_STS_OK) { + /* + * Partial zone append completions cannot be supported as the + * BIO fragments may end up not being written sequentially. + */ + if (bio->bi_iter.bi_size) + bio->bi_status = BLK_STS_IOERR; + else + bio->bi_iter.bi_sector = rq->__sector; + } + /* don't actually finish bio if it's part of flush sequence */ if (bio->bi_iter.bi_size == 0 && !(rq->rq_flags & RQF_FLUSH_SEQ)) bio_endio(bio); @@ -871,6 +883,41 @@ static inline int blk_partition_remap(struct bio *bio) return ret; } +/* + * Check write append to a zoned block device. + */ +static inline blk_status_t blk_check_zone_append(struct request_queue *q, + struct bio *bio) +{ + sector_t pos = bio->bi_iter.bi_sector; + int nr_sectors = bio_sectors(bio); + + /* Only applicable to zoned block devices */ + if (!blk_queue_is_zoned(q)) + return BLK_STS_NOTSUPP; + + /* The bio sector must point to the start of a sequential zone */ + if (pos & (blk_queue_zone_sectors(q) - 1) || + !blk_queue_zone_is_seq(q, pos)) + return BLK_STS_IOERR; + + /* + * Not allowed to cross zone boundaries. Otherwise, the BIO will be + * split and could result in non-contiguous sectors being written in + * different zones. + */ + if (blk_queue_zone_no(q, pos) != blk_queue_zone_no(q, pos + nr_sectors)) + return BLK_STS_IOERR; + + /* Make sure the BIO is small enough and will not get split */ + if (nr_sectors > q->limits.max_zone_append_sectors) + return BLK_STS_IOERR; + + bio->bi_opf |= REQ_NOMERGE; + + return BLK_STS_OK; +} + static noinline_for_stack bool generic_make_request_checks(struct bio *bio) { @@ -943,6 +990,11 @@ generic_make_request_checks(struct bio *bio) if (!q->limits.max_write_same_sectors) goto not_supported; break; + case REQ_OP_ZONE_APPEND: + status = blk_check_zone_append(q, bio); + if (status != BLK_STS_OK) + goto end_io; + break; case REQ_OP_ZONE_RESET: case REQ_OP_ZONE_OPEN: case REQ_OP_ZONE_CLOSE: diff --git a/block/blk-mq.c b/block/blk-mq.c index 8e56884fd2e9..50e216a218ee 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1195,6 +1195,19 @@ static void blk_mq_handle_dev_resource(struct request *rq, __blk_mq_requeue_request(rq); } +static void blk_mq_handle_zone_resource(struct request *rq, + struct list_head *zone_list) +{ + /* + * If we end up here it is because we cannot dispatch a request to a + * specific zone due to LLD level zone-write locking or other zone + * related resource not being available. In this case, set the request + * aside in zone_list for retrying it later. + */ + list_add(&rq->queuelist, zone_list); + __blk_mq_requeue_request(rq); +} + /* * Returns true if we did some work AND can potentially do more. */ @@ -1206,6 +1219,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, bool no_tag = false; int errors, queued; blk_status_t ret = BLK_STS_OK; + LIST_HEAD(zone_list); if (list_empty(list)) return false; @@ -1264,6 +1278,16 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) { blk_mq_handle_dev_resource(rq, list); break; + } else if (ret == BLK_STS_ZONE_RESOURCE) { + /* + * Move the request to zone_list and keep going through + * the dispatch list to find more requests the drive can + * accept. + */ + blk_mq_handle_zone_resource(rq, &zone_list); + if (list_empty(list)) + break; + continue; } if (unlikely(ret != BLK_STS_OK)) { @@ -1275,6 +1299,9 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list, queued++; } while (!list_empty(list)); + if (!list_empty(&zone_list)) + list_splice_tail_init(&zone_list, list); + hctx->dispatched[queued_to_index(queued)]++; /* diff --git a/block/blk-settings.c b/block/blk-settings.c index 14397b4c4b53..58d5b49fb131 100644 --- a/block/blk-settings.c +++ b/block/blk-settings.c @@ -48,6 +48,7 @@ void blk_set_default_limits(struct queue_limits *lim) lim->chunk_sectors = 0; lim->max_write_same_sectors = 0; lim->max_write_zeroes_sectors = 0; + lim->max_zone_append_sectors = 0; lim->max_discard_sectors = 0; lim->max_hw_discard_sectors = 0; lim->discard_granularity = 0; @@ -83,6 +84,7 @@ void blk_set_stacking_limits(struct queue_limits *lim) lim->max_dev_sectors = UINT_MAX; lim->max_write_same_sectors = UINT_MAX; lim->max_write_zeroes_sectors = UINT_MAX; + lim->max_zone_append_sectors = UINT_MAX; } EXPORT_SYMBOL(blk_set_stacking_limits); @@ -221,6 +223,25 @@ void blk_queue_max_write_zeroes_sectors(struct request_queue *q, } EXPORT_SYMBOL(blk_queue_max_write_zeroes_sectors); +/** + * blk_queue_max_zone_append_sectors - set max sectors for a single zone append + * @q: the request queue for the device + * @max_zone_append_sectors: maximum number of sectors to write per command + **/ +void blk_queue_max_zone_append_sectors(struct request_queue *q, + unsigned int max_zone_append_sectors) +{ + unsigned int max_sectors; + + max_sectors = min(q->limits.max_hw_sectors, max_zone_append_sectors); + if (max_sectors) + max_sectors = min_not_zero(q->limits.chunk_sectors, + max_sectors); + + q->limits.max_zone_append_sectors = max_sectors; +} +EXPORT_SYMBOL_GPL(blk_queue_max_zone_append_sectors); + /** * blk_queue_max_segments - set max hw segments for a request for this queue * @q: the request queue for the device @@ -470,6 +491,8 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b, b->max_write_same_sectors); t->max_write_zeroes_sectors = min(t->max_write_zeroes_sectors, b->max_write_zeroes_sectors); + t->max_zone_append_sectors = min(t->max_zone_append_sectors, + b->max_zone_append_sectors); t->bounce_pfn = min_not_zero(t->bounce_pfn, b->bounce_pfn); t->seg_boundary_mask = min_not_zero(t->seg_boundary_mask, diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index fca9b158f4a0..02643e149d5e 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -218,6 +218,13 @@ static ssize_t queue_write_zeroes_max_show(struct request_queue *q, char *page) (unsigned long long)q->limits.max_write_zeroes_sectors << 9); } +static ssize_t queue_zone_append_max_show(struct request_queue *q, char *page) +{ + unsigned long long max_sectors = q->limits.max_zone_append_sectors; + + return sprintf(page, "%llu\n", max_sectors << SECTOR_SHIFT); +} + static ssize_t queue_max_sectors_store(struct request_queue *q, const char *page, size_t count) { @@ -639,6 +646,11 @@ static struct queue_sysfs_entry queue_write_zeroes_max_entry = { .show = queue_write_zeroes_max_show, }; +static struct queue_sysfs_entry queue_zone_append_max_entry = { + .attr = {.name = "zone_append_max_bytes", .mode = 0444 }, + .show = queue_zone_append_max_show, +}; + static struct queue_sysfs_entry queue_nonrot_entry = { .attr = {.name = "rotational", .mode = 0644 }, .show = queue_show_nonrot, @@ -749,6 +761,7 @@ static struct attribute *queue_attrs[] = { &queue_discard_zeroes_data_entry.attr, &queue_write_same_max_entry.attr, &queue_write_zeroes_max_entry.attr, + &queue_zone_append_max_entry.attr, &queue_nonrot_entry.attr, &queue_zoned_entry.attr, &queue_nr_zones_entry.attr, diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index ad97369ffabd..b9e8f55cf8c4 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -1690,6 +1690,7 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx, case BLK_STS_OK: break; case BLK_STS_RESOURCE: + case BLK_STS_ZONE_RESOURCE: if (atomic_read(&sdev->device_busy) || scsi_device_blocked(sdev)) ret = BLK_STS_DEV_RESOURCE; diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index 70254ae11769..824ec2d89954 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -63,6 +63,18 @@ typedef u8 __bitwise blk_status_t; */ #define BLK_STS_DEV_RESOURCE ((__force blk_status_t)13) +/* + * BLK_STS_ZONE_RESOURCE is returned from the driver to the block layer if zone + * related resources are unavailable, but the driver can guarantee the queue + * will be rerun in the future once the resources become available again. + * + * This is different from BLK_STS_DEV_RESOURCE in that it explicitly references + * a zone specific resource and IO to a different zone on the same device could + * still be served. Examples of that are zones that are write-locked, but a read + * to the same zone could be served. + */ +#define BLK_STS_ZONE_RESOURCE ((__force blk_status_t)14) + /** * blk_path_error - returns true if error may be path related * @error: status the request was completed with @@ -296,6 +308,8 @@ enum req_opf { REQ_OP_ZONE_CLOSE = 11, /* Transition a zone to full */ REQ_OP_ZONE_FINISH = 12, + /* write data at the current zone write pointer */ + REQ_OP_ZONE_APPEND = 13, /* SCSI passthrough using struct scsi_request */ REQ_OP_SCSI_IN = 32, diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index e47888a7d80b..774947365341 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -336,6 +336,7 @@ struct queue_limits { unsigned int max_hw_discard_sectors; unsigned int max_write_same_sectors; unsigned int max_write_zeroes_sectors; + unsigned int max_zone_append_sectors; unsigned int discard_granularity; unsigned int discard_alignment; @@ -757,6 +758,9 @@ static inline bool rq_mergeable(struct request *rq) if (req_op(rq) == REQ_OP_WRITE_ZEROES) return false; + if (req_op(rq) == REQ_OP_ZONE_APPEND) + return false; + if (rq->cmd_flags & REQ_NOMERGE_FLAGS) return false; if (rq->rq_flags & RQF_NOMERGE_FLAGS) @@ -1091,6 +1095,8 @@ extern void blk_queue_max_write_same_sectors(struct request_queue *q, extern void blk_queue_max_write_zeroes_sectors(struct request_queue *q, unsigned int max_write_same_sectors); extern void blk_queue_logical_block_size(struct request_queue *, unsigned int); +extern void blk_queue_max_zone_append_sectors(struct request_queue *q, + unsigned int max_zone_append_sectors); extern void blk_queue_physical_block_size(struct request_queue *, unsigned int); extern void blk_queue_alignment_offset(struct request_queue *q, unsigned int alignment); @@ -1303,6 +1309,11 @@ static inline unsigned int queue_max_segment_size(const struct request_queue *q) return q->limits.max_segment_size; } +static inline unsigned int queue_max_zone_append_sectors(const struct request_queue *q) +{ + return q->limits.max_zone_append_sectors; +} + static inline unsigned queue_logical_block_size(const struct request_queue *q) { int retval = 512;