Message ID | 20230522183845.354920-3-bvanassche@acm.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Submit zoned writes in order | expand |
On Mon, May 22, 2023 at 11:38:37AM -0700, Bart Van Assche wrote: > Send requeued requests to the I/O scheduler such that the I/O scheduler > can control the order in which requests are dispatched. > > This patch reworks commit aef1897cd36d ("blk-mq: insert rq with DONTPREP > to hctx dispatch list when requeue"). But you need to explain why it is safe. I think it is because you add DONTPREP to RQF_NOMERGE_FLAGS, but that really belongs into the commit log. > + list_for_each_entry_safe(rq, next, &requeue_list, queuelist) { > + if (!(rq->rq_flags & RQF_DONTPREP)) { > list_del_init(&rq->queuelist); > blk_mq_insert_request(rq, BLK_MQ_INSERT_AT_HEAD); > } > } > > + while (!list_empty(&requeue_list)) { > + rq = list_entry(requeue_list.next, struct request, queuelist); > + list_del_init(&rq->queuelist); > + blk_mq_insert_request(rq, 0); So now both started and unstarted requests go through blk_mq_insert_request, and thus potentially the I/O scheduler, but RQF_DONTPREP request that are by definition further along in processing are inserted at the tail, while !RQF_DONTPREP ones get inserted at head. This feels wrong, and we probably need to sort out why and how we are doing head insertation on a requeue first. > @@ -2064,14 +2060,28 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list, > bool no_tag = prep == PREP_DISPATCH_NO_TAG && > ((hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED) || > blk_mq_is_shared_tags(hctx->flags)); > + LIST_HEAD(for_sched); > + struct request *next; > > if (nr_budgets) > blk_mq_release_budgets(q, list); > > spin_lock(&hctx->lock); > + list_for_each_entry_safe(rq, next, list, queuelist) > + if (rq->rq_flags & RQF_USE_SCHED) > + list_move_tail(&rq->queuelist, &for_sched); > list_splice_tail_init(list, &hctx->dispatch); > spin_unlock(&hctx->lock); > > + if (q->elevator) { > + if (q->elevator->type->ops.requeue_request) > + list_for_each_entry(rq, &for_sched, queuelist) > + q->elevator->type->ops. > + requeue_request(rq); > + q->elevator->type->ops.insert_requests(hctx, &for_sched, > + BLK_MQ_INSERT_AT_HEAD); > + } > + None of this is explained in the commit log, please elaborate on the rationale for it. Also: - this code block really belongs into a helper - calling ->requeue_request in a loop before calling ->insert_requests feels wrong A BLK_MQ_INSERT_REQUEUE flag feels like the better interface here.
On Mon, May 22, 2023 at 11:38:37AM -0700, Bart Van Assche wrote: > Send requeued requests to the I/O scheduler such that the I/O scheduler > can control the order in which requests are dispatched. I guess you are addressing UFS zoned for REQ_OP_WRITE: https://lore.kernel.org/linux-block/8e88b22e-fdf2-5182-02fe-9876e8148947@acm.org/ I am wondering how this way can maintain order for zoned write request. requeued WRITE may happen in any order, for example, req A and req B is in order, now req B is requeued first, then follows req A. So req B is requeued to scheduler first, and issued to LLD, then request A is requeued and issued to LLD later, then still re-order? Or sd_zbc can provide requeue order guarantee? Thanks, Ming
On 5/23/23 02:03, Ming Lei wrote: > On Mon, May 22, 2023 at 11:38:37AM -0700, Bart Van Assche wrote: >> Send requeued requests to the I/O scheduler such that the I/O scheduler >> can control the order in which requests are dispatched. > > I guess you are addressing UFS zoned for REQ_OP_WRITE: > > https://lore.kernel.org/linux-block/8e88b22e-fdf2-5182-02fe-9876e8148947@acm.org/ > > I am wondering how this way can maintain order for zoned write request. > > requeued WRITE may happen in any order, for example, req A and req B is > in order, now req B is requeued first, then follows req A. > > So req B is requeued to scheduler first, and issued to LLD, then > request A is requeued and issued to LLD later, then still re-order? > > Or sd_zbc can provide requeue order guarantee? Hi Ming, The mq-deadline scheduler restricts the queue depth to one per zone for zoned storage so at any time there is at most one write command (REQ_OP_WRITE) in flight per zone. Thanks, Bart.
On 5/23/23 00:18, Christoph Hellwig wrote: > On Mon, May 22, 2023 at 11:38:37AM -0700, Bart Van Assche wrote: >> Send requeued requests to the I/O scheduler such that the I/O scheduler >> can control the order in which requests are dispatched. >> >> This patch reworks commit aef1897cd36d ("blk-mq: insert rq with DONTPREP >> to hctx dispatch list when requeue"). > > But you need to explain why it is safe. I think it is because you add > DONTPREP to RQF_NOMERGE_FLAGS, but that really belongs into the commit > log. Hi Christoph, I will add the above explanation in the commit message. > >> + list_for_each_entry_safe(rq, next, &requeue_list, queuelist) { >> + if (!(rq->rq_flags & RQF_DONTPREP)) { >> list_del_init(&rq->queuelist); >> blk_mq_insert_request(rq, BLK_MQ_INSERT_AT_HEAD); >> } >> } >> >> + while (!list_empty(&requeue_list)) { >> + rq = list_entry(requeue_list.next, struct request, queuelist); >> + list_del_init(&rq->queuelist); >> + blk_mq_insert_request(rq, 0); > > So now both started and unstarted requests go through > blk_mq_insert_request, and thus potentially the I/O scheduler, but > RQF_DONTPREP request that are by definition further along in processing > are inserted at the tail, while !RQF_DONTPREP ones get inserted at head. > > This feels wrong, and we probably need to sort out why and how we are > doing head insertation on a requeue first. The use cases for requeuing requests are as follows: * Resubmitting a partially completed request (ATA, SCSI, loop, ...). * Handling connection recovery (nbd, ublk, ...). * Simulating the "device busy" condition (null_blk). * Handling the queue full condition (virtio-blk). * Handling path errors by dm-mpath (DM_ENDIO_REQUEUE). * Handling retryable I/O errors (mmc, NVMe). * Handling unit attentions (SCSI). * Unplugging a CPU. Do you perhaps want me to select BLK_MQ_INSERT_AT_HEAD for all requeued requests? >> + if (q->elevator) { >> + if (q->elevator->type->ops.requeue_request) >> + list_for_each_entry(rq, &for_sched, queuelist) >> + q->elevator->type->ops. >> + requeue_request(rq); >> + q->elevator->type->ops.insert_requests(hctx, &for_sched, >> + BLK_MQ_INSERT_AT_HEAD); >> + } >> + > > None of this is explained in the commit log, please elaborate on the > rationale for it. Also: > > - this code block really belongs into a helper > - calling ->requeue_request in a loop before calling ->insert_requests > feels wrong A BLK_MQ_INSERT_REQUEUE flag feels like the better > interface here. Pushing a request back to hctx->dispatch is a requeuing mechanism. I want to send requeued requests to the I/O scheduler. Regarding the BLK_MQ_INSERT_REQUEUE suggestion, how about adding the patch below near the start of this patch series? Thanks, Bart. block: Remove elevator_type.requeue_request() diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index 117aec1791c0..319d3c5a0f85 100644 --- a/block/bfq-iosched.c +++ b/block/bfq-iosched.c @@ -6232,6 +6232,7 @@ static inline void bfq_update_insert_stats(struct request_queue *q, #endif /* CONFIG_BFQ_CGROUP_DEBUG */ static struct bfq_queue *bfq_init_rq(struct request *rq); +static void bfq_finish_requeue_request(struct request *rq); static void bfq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq, blk_insert_t flags) @@ -6243,6 +6244,11 @@ static void bfq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq, blk_opf_t cmd_flags; LIST_HEAD(free); + if (rq->rq_flags & RQF_REQUEUED) { + rq->rq_flags &= ~RQF_REQUEUED; + bfq_finish_requeue_request(rq); + } + #ifdef CONFIG_BFQ_GROUP_IOSCHED if (!cgroup_subsys_on_dfl(io_cgrp_subsys) && rq->bio) bfqg_stats_update_legacy_io(q, rq); @@ -7596,7 +7602,6 @@ static struct elevator_type iosched_bfq_mq = { .ops = { .limit_depth = bfq_limit_depth, .prepare_request = bfq_prepare_request, - .requeue_request = bfq_finish_requeue_request, .finish_request = bfq_finish_request, .exit_icq = bfq_exit_icq, .insert_requests = bfq_insert_requests, diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h index 1326526bb733..8d4b835539b1 100644 --- a/block/blk-mq-sched.h +++ b/block/blk-mq-sched.h @@ -58,13 +58,8 @@ static inline void blk_mq_sched_completed_request(struct request *rq, u64 now) static inline void blk_mq_sched_requeue_request(struct request *rq) { - if (rq->rq_flags & RQF_USE_SCHED) { - struct request_queue *q = rq->q; - struct elevator_queue *e = q->elevator; - - if (e->type->ops.requeue_request) - e->type->ops.requeue_request(rq); - } + if (rq->rq_flags & RQF_USE_SCHED) + rq->rq_flags |= RQF_REQUEUED; } static inline bool blk_mq_sched_has_work(struct blk_mq_hw_ctx *hctx) diff --git a/block/elevator.h b/block/elevator.h index 7ca3d7b6ed82..c1f459eebc9f 100644 --- a/block/elevator.h +++ b/block/elevator.h @@ -43,7 +43,6 @@ struct elevator_mq_ops { struct request *(*dispatch_request)(struct blk_mq_hw_ctx *); bool (*has_work)(struct blk_mq_hw_ctx *); void (*completed_request)(struct request *, u64); - void (*requeue_request)(struct request *); struct request *(*former_request)(struct request_queue *, struct request *); struct request *(*next_request)(struct request_queue *, struct request *); void (*init_icq)(struct io_cq *); diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c index f495be433276..b1f2e172fe61 100644 --- a/block/kyber-iosched.c +++ b/block/kyber-iosched.c @@ -608,6 +608,10 @@ static void kyber_insert_requests(struct blk_mq_hw_ctx *hctx, spin_lock(&kcq->lock); trace_block_rq_insert(rq); + if (rq->rq_flags & RQF_REQUEUED) { + rq->rq_flags &= ~RQF_REQUEUED; + kyber_finish_request(rq); + } if (flags & BLK_MQ_INSERT_AT_HEAD) list_move(&rq->queuelist, head); else @@ -1022,7 +1026,6 @@ static struct elevator_type kyber_sched = { .prepare_request = kyber_prepare_request, .insert_requests = kyber_insert_requests, .finish_request = kyber_finish_request, - .requeue_request = kyber_finish_request, .completed_request = kyber_completed_request, .dispatch_request = kyber_dispatch_request, .has_work = kyber_has_work, diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index d778cb6b2112..861ca3bb0bce 100644 --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -59,6 +59,9 @@ typedef __u32 __bitwise req_flags_t; #define RQF_ZONE_WRITE_LOCKED ((__force req_flags_t)(1 << 19)) /* ->timeout has been called, don't expire again */ #define RQF_TIMED_OUT ((__force req_flags_t)(1 << 21)) +/* The request is about to be requeued. */ +#define RQF_REQUEUED ((__force req_flags_t)(1 << 22)) +/* A reserved tag has been assigned to the request. */ #define RQF_RESV ((__force req_flags_t)(1 << 23)) /* flags that prevent us from merging requests: */
On Tue, May 23, 2023 at 10:19:34AM -0700, Bart Van Assche wrote: > On 5/23/23 02:03, Ming Lei wrote: > > On Mon, May 22, 2023 at 11:38:37AM -0700, Bart Van Assche wrote: > > > Send requeued requests to the I/O scheduler such that the I/O scheduler > > > can control the order in which requests are dispatched. > > > > I guess you are addressing UFS zoned for REQ_OP_WRITE: > > > > https://lore.kernel.org/linux-block/8e88b22e-fdf2-5182-02fe-9876e8148947@acm.org/ > > > > I am wondering how this way can maintain order for zoned write request. > > > > requeued WRITE may happen in any order, for example, req A and req B is > > in order, now req B is requeued first, then follows req A. > > > > So req B is requeued to scheduler first, and issued to LLD, then > > request A is requeued and issued to LLD later, then still re-order? > > > > Or sd_zbc can provide requeue order guarantee? > > Hi Ming, > > The mq-deadline scheduler restricts the queue depth to one per zone for zoned > storage so at any time there is at most one write command (REQ_OP_WRITE) in > flight per zone. But if the write queue depth is 1 per zone, the requeued request won't be re-ordered at all given no other write request can be issued from scheduler in this zone before this requeued request is completed. So why bother to requeue the BLK_STS_RESOURCE request via scheduler? Thanks, Ming
On Tue, May 23, 2023 at 03:30:16PM -0700, Bart Van Assche wrote: > static inline void blk_mq_sched_requeue_request(struct request *rq) > { > - if (rq->rq_flags & RQF_USE_SCHED) { > - struct request_queue *q = rq->q; > - struct elevator_queue *e = q->elevator; > - > - if (e->type->ops.requeue_request) > - e->type->ops.requeue_request(rq); > - } > + if (rq->rq_flags & RQF_USE_SCHED) > + rq->rq_flags |= RQF_REQUEUED; > } I'd drop this helper function if we go down this way. But maybe we might just want to keep the method. Sorry for the noise.
On 5/23/23 17:31, Ming Lei wrote: > On Tue, May 23, 2023 at 10:19:34AM -0700, Bart Van Assche wrote: >> The mq-deadline scheduler restricts the queue depth to one per zone for zoned >> storage so at any time there is at most one write command (REQ_OP_WRITE) in >> flight per zone. > > But if the write queue depth is 1 per zone, the requeued request won't > be re-ordered at all given no other write request can be issued from > scheduler in this zone before this requeued request is completed. > > So why bother to requeue the BLK_STS_RESOURCE request via scheduler? Hi Ming, It seems like my previous email was not clear enough. The mq-deadline scheduler restricts the queue depth per zone for commands passed to the SCSI core. It does not restrict how many requests a filesystem can submit per zone to the block layer. Without this patch there is a risk of reordering if a request is requeued, e.g. by the SCSI core, and other requests are pending for the same zone. Thanks, Bart.
On 5/23/23 23:13, Christoph Hellwig wrote: > On Tue, May 23, 2023 at 03:30:16PM -0700, Bart Van Assche wrote: >> static inline void blk_mq_sched_requeue_request(struct request *rq) >> { >> - if (rq->rq_flags & RQF_USE_SCHED) { >> - struct request_queue *q = rq->q; >> - struct elevator_queue *e = q->elevator; >> - >> - if (e->type->ops.requeue_request) >> - e->type->ops.requeue_request(rq); >> - } >> + if (rq->rq_flags & RQF_USE_SCHED) >> + rq->rq_flags |= RQF_REQUEUED; >> } > > I'd drop this helper function if we go down this way. But maybe > we might just want to keep the method. My understanding is that every .requeue_request() call is followed by a .insert_requests() call and hence that we don't need the .requeue_request() method anymore if the RQF_REQUEUED flag would be introduced? Thanks, Bart.
On 5/25/23 02:56, Bart Van Assche wrote: > On 5/23/23 17:31, Ming Lei wrote: >> On Tue, May 23, 2023 at 10:19:34AM -0700, Bart Van Assche wrote: >>> The mq-deadline scheduler restricts the queue depth to one per zone for zoned >>> storage so at any time there is at most one write command (REQ_OP_WRITE) in >>> flight per zone. >> >> But if the write queue depth is 1 per zone, the requeued request won't >> be re-ordered at all given no other write request can be issued from >> scheduler in this zone before this requeued request is completed. >> >> So why bother to requeue the BLK_STS_RESOURCE request via scheduler? > > Hi Ming, > > It seems like my previous email was not clear enough. The mq-deadline > scheduler restricts the queue depth per zone for commands passed to the > SCSI core. It does not restrict how many requests a filesystem can > submit per zone to the block layer. Without this patch there is a risk > of reordering if a request is requeued, e.g. by the SCSI core, and other > requests are pending for the same zone. Yes there is, but the contract we established for zoned devices in the block layer, from the start of the support, is that users *must* write sequentially. The block layer does not attempt, generally speaking, to reorder requests. When mq-deadline is used, the scheduler lba reordering *may* reorder writes, thus hiding potential bugs in the user write sequence for a zone. That is fine. However, once a write request is dispatched, we should keep the assumption that it is a well formed one, namely directed at the zone write pointer. So any consideration of requeue solving write ordering issues is moot to me. Furthermore, when the requeue happens, the target zone is still locked and the only write request that can be in flight for that target zones is that one being requeued. Add to that the above assumption that the request is the one we must dispatch first, there are absolutely zero chances of seeing a reordering happen for writes to a particular zone. I really do not see the point of requeuing that request through the IO scheduler at all. In general, even for reads, requeuing through the scheduler is I think a really bad idea as that can potentially significantly increase the request latency (time to completion), with the user seeing long tail latencies. E.g. if the request has high priority or a short CDL time limit, requeuing through the scheduler will go against the user indicated urgency for that request and degrade the effectivness of latency control easures such as IO priority and CDL. Requeues should be at the head of the dispatch queue, not through the scheduler. As long as we keep zone write locking for zoned devices, requeue to the head of the dispatch queue is fine. But maybe this work is preparatory to removing zone write locking ? If that is the case, I would like to see that as well to get the big picture. Otherwise, the latency concerns I raised above are in my opinion, a blocker for this change. > > Thanks, > > Bart. >
On Thu, May 25, 2023 at 08:06:31AM +0900, Damien Le Moal wrote: > On 5/25/23 02:56, Bart Van Assche wrote: > > On 5/23/23 17:31, Ming Lei wrote: > >> On Tue, May 23, 2023 at 10:19:34AM -0700, Bart Van Assche wrote: > >>> The mq-deadline scheduler restricts the queue depth to one per zone for zoned > >>> storage so at any time there is at most one write command (REQ_OP_WRITE) in > >>> flight per zone. > >> > >> But if the write queue depth is 1 per zone, the requeued request won't > >> be re-ordered at all given no other write request can be issued from > >> scheduler in this zone before this requeued request is completed. > >> > >> So why bother to requeue the BLK_STS_RESOURCE request via scheduler? > > > > Hi Ming, > > > > It seems like my previous email was not clear enough. The mq-deadline > > scheduler restricts the queue depth per zone for commands passed to the > > SCSI core. It does not restrict how many requests a filesystem can > > submit per zone to the block layer. Without this patch there is a risk > > of reordering if a request is requeued, e.g. by the SCSI core, and other > > requests are pending for the same zone. > > Yes there is, but the contract we established for zoned devices in the block > layer, from the start of the support, is that users *must* write sequentially. > The block layer does not attempt, generally speaking, to reorder requests. > > When mq-deadline is used, the scheduler lba reordering *may* reorder writes, > thus hiding potential bugs in the user write sequence for a zone. That is fine. > However, once a write request is dispatched, we should keep the assumption that > it is a well formed one, namely directed at the zone write pointer. So any > consideration of requeue solving write ordering issues is moot to me. > > Furthermore, when the requeue happens, the target zone is still locked and the > only write request that can be in flight for that target zones is that one being > requeued. Add to that the above assumption that the request is the one we must > dispatch first, there are absolutely zero chances of seeing a reordering happen > for writes to a particular zone. I really do not see the point of requeuing that > request through the IO scheduler at all. Totally agree, that is exactly what I meant. > > In general, even for reads, requeuing through the scheduler is I think a really > bad idea as that can potentially significantly increase the request latency > (time to completion), with the user seeing long tail latencies. E.g. if the > request has high priority or a short CDL time limit, requeuing through the > scheduler will go against the user indicated urgency for that request and > degrade the effectivness of latency control easures such as IO priority and CDL. > > Requeues should be at the head of the dispatch queue, not through the scheduler. It is pretty easy to run into STS_RESOURCE for some scsi devices, requeuing via schedule does add more overhead for these devices. > > As long as we keep zone write locking for zoned devices, requeue to the head of > the dispatch queue is fine. But maybe this work is preparatory to removing zone > write locking ? If that is the case, I would like to see that as well to get the > big picture. Otherwise, the latency concerns I raised above are in my opinion, a > blocker for this change. Yeah, looks Bart needs to add more words about requirement & motivation of this patchset. Thanks, Ming
On Wed, May 24, 2023 at 11:22:00AM -0700, Bart Van Assche wrote: >>> static inline void blk_mq_sched_requeue_request(struct request *rq) >>> { >>> - if (rq->rq_flags & RQF_USE_SCHED) { >>> - struct request_queue *q = rq->q; >>> - struct elevator_queue *e = q->elevator; >>> - >>> - if (e->type->ops.requeue_request) >>> - e->type->ops.requeue_request(rq); >>> - } >>> + if (rq->rq_flags & RQF_USE_SCHED) >>> + rq->rq_flags |= RQF_REQUEUED; >>> } >> >> I'd drop this helper function if we go down this way. But maybe >> we might just want to keep the method. > > My understanding is that every .requeue_request() call is followed by a > .insert_requests() call and hence that we don't need the .requeue_request() > method anymore if the RQF_REQUEUED flag would be introduced? Yes, but at the same time RQF_REQUEUED no creates global state instead of just in a callchain. That's why originally suggest a flag to ->insert_requests instead of leaving state on every request. > > Thanks, > > Bart. ---end quoted text---
On 5/24/23 16:06, Damien Le Moal wrote: > When mq-deadline is used, the scheduler lba reordering *may* reorder writes, > thus hiding potential bugs in the user write sequence for a zone. That is fine. > However, once a write request is dispatched, we should keep the assumption that > it is a well formed one, namely directed at the zone write pointer. So any > consideration of requeue solving write ordering issues is moot to me. > > Furthermore, when the requeue happens, the target zone is still locked and the > only write request that can be in flight for that target zones is that one being > requeued. Add to that the above assumption that the request is the one we must > dispatch first, there are absolutely zero chances of seeing a reordering happen > for writes to a particular zone. I really do not see the point of requeuing that > request through the IO scheduler at all. I will drop the changes from this patch that send requeued dispatched requests to the I/O scheduler instead of back to the dispatch list. It took me considerable effort to find all the potential causes of reordering. As you may have noticed I also came up with some changes that are not essential. I should have realized myself that this change is not essential. > As long as we keep zone write locking for zoned devices, requeue to the head of > the dispatch queue is fine. But maybe this work is preparatory to removing zone > write locking ? If that is the case, I would like to see that as well to get the > big picture. Otherwise, the latency concerns I raised above are in my opinion, a > blocker for this change. Regarding removing zone write locking, would it be acceptable to implement a solution for SCSI devices before it is clear how to implement a solution for NVMe devices? I think a potential solution for SCSI devices is to send requests that should be requeued to the SCSI error handler instead of to the block layer requeue list. The SCSI error handler waits until all pending requests have timed out or have been sent to the error handler. The SCSI error handler can be modified such that requests are sorted in LBA order before being resubmitted. This would solve the nasty issues that would otherwise arise when requeuing requests if multiple write requests for the same zone are pending. Thanks, Bart.
On 6/21/23 09:34, Bart Van Assche wrote: > On 5/24/23 16:06, Damien Le Moal wrote: >> When mq-deadline is used, the scheduler lba reordering *may* reorder writes, >> thus hiding potential bugs in the user write sequence for a zone. That is fine. >> However, once a write request is dispatched, we should keep the assumption that >> it is a well formed one, namely directed at the zone write pointer. So any >> consideration of requeue solving write ordering issues is moot to me. >> >> Furthermore, when the requeue happens, the target zone is still locked and the >> only write request that can be in flight for that target zones is that one being >> requeued. Add to that the above assumption that the request is the one we must >> dispatch first, there are absolutely zero chances of seeing a reordering happen >> for writes to a particular zone. I really do not see the point of requeuing that >> request through the IO scheduler at all. > > I will drop the changes from this patch that send requeued dispatched > requests to the I/O scheduler instead of back to the dispatch list. It > took me considerable effort to find all the potential causes of > reordering. As you may have noticed I also came up with some changes > that are not essential. I should have realized myself that this change > is not essential. > >> As long as we keep zone write locking for zoned devices, requeue to the head of >> the dispatch queue is fine. But maybe this work is preparatory to removing zone >> write locking ? If that is the case, I would like to see that as well to get the >> big picture. Otherwise, the latency concerns I raised above are in my opinion, a >> blocker for this change. > > Regarding removing zone write locking, would it be acceptable to > implement a solution for SCSI devices before it is clear how to > implement a solution for NVMe devices? I think a potential solution for > SCSI devices is to send requests that should be requeued to the SCSI > error handler instead of to the block layer requeue list. The SCSI error > handler waits until all pending requests have timed out or have been > sent to the error handler. The SCSI error handler can be modified such > that requests are sorted in LBA order before being resubmitted. This > would solve the nasty issues that would otherwise arise when requeuing > requests if multiple write requests for the same zone are pending. I am still thinking that a dedicated hctx for writes to sequential zones may be the simplest solution for all device types: 1) For scsi HBAs, we can likely gain high qd zone writes, but that needs to be checked. For AHCI though, we need to keep the max write qd=1 per zone because of the chipsets reordering command submissions. So we'll need a queue flag saying "need zone write locking" indicated by the adapter when creating the queue. 2) For NVMe, this would allow high QD writes, with only the penalty of heavier locking overhead when writes are issued from multiple CPUs. But I have not started looking at all the details. Need to start prototyping something. We can try working on this together if you want.
On 6/22/23 16:45, Damien Le Moal wrote: > On 6/21/23 09:34, Bart Van Assche wrote: >> Regarding removing zone write locking, would it be acceptable to >> implement a solution for SCSI devices before it is clear how to >> implement a solution for NVMe devices? I think a potential solution for >> SCSI devices is to send requests that should be requeued to the SCSI >> error handler instead of to the block layer requeue list. The SCSI error >> handler waits until all pending requests have timed out or have been >> sent to the error handler. The SCSI error handler can be modified such >> that requests are sorted in LBA order before being resubmitted. This >> would solve the nasty issues that would otherwise arise when requeuing >> requests if multiple write requests for the same zone are pending. > > I am still thinking that a dedicated hctx for writes to sequential zones may be > the simplest solution for all device types: > 1) For scsi HBAs, we can likely gain high qd zone writes, but that needs to be > checked. For AHCI though, we need to keep the max write qd=1 per zone because of > the chipsets reordering command submissions. So we'll need a queue flag saying > "need zone write locking" indicated by the adapter when creating the queue. > 2) For NVMe, this would allow high QD writes, with only the penalty of heavier > locking overhead when writes are issued from multiple CPUs. > > But I have not started looking at all the details. Need to start prototyping > something. We can try working on this together if you want. Hi Damien, I'm interested in collaborating on this. But I'm not sure whether a dedicated hardware queue for sequential writes is a full solution. Applications must submit zoned writes (other than write appends) in order. These zoned writes may end up in a software queue. It is possible that the software queues are flushed in such a way that the zoned writes are reordered. Or do you perhaps want to send all zoned writes directly to a hardware queue? If so, is this really a better solution than a single-queue I/O scheduler? Is the difference perhaps that higher read IOPS can be achieved because multiple hardware queues are used for reads? Even if all sequential writes would be sent to a single hardware queue, to support queue depths > 1, we still need a mechanism for resubmitting requests in order after a request has been requeued. If e.g. three zoned writes are in flight and a unit attention is reported for the second write then resubmitting the two writes that have to be resubmitted must only happen after both writes have completed. Another possibility is to introduce a new request queue flag that specifies that only writes should be sent to the I/O scheduler. I'm interested in this because of the following observation for zoned UFS devices for a block size of 4 KiB and a random read workload: * mq-deadline scheduler: 59 K IOPS. * no I/O scheduler: 100 K IOPS. In other words, 70% more IOPS with no I/O scheduler compared to mq-deadline. I don't think that this indicates a performance bug in the mq-deadline scheduler. From a quick measurement with the null_blk driver it seems to me that all I/O schedulers saturate around 150 K - 170 K IOPS. Thanks, Bart.
diff --git a/block/blk-mq.c b/block/blk-mq.c index e79cc34ad962..632aee9af60f 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1438,30 +1438,26 @@ static void blk_mq_requeue_work(struct work_struct *work) container_of(work, struct request_queue, requeue_work.work); LIST_HEAD(requeue_list); LIST_HEAD(flush_list); - struct request *rq; + struct request *rq, *next; spin_lock_irq(&q->requeue_lock); list_splice_init(&q->requeue_list, &requeue_list); list_splice_init(&q->flush_list, &flush_list); spin_unlock_irq(&q->requeue_lock); - while (!list_empty(&requeue_list)) { - rq = list_entry(requeue_list.next, struct request, queuelist); - /* - * If RQF_DONTPREP ist set, the request has been started by the - * driver already and might have driver-specific data allocated - * already. Insert it into the hctx dispatch list to avoid - * block layer merges for the request. - */ - if (rq->rq_flags & RQF_DONTPREP) { - list_del_init(&rq->queuelist); - blk_mq_request_bypass_insert(rq, 0); - } else { + list_for_each_entry_safe(rq, next, &requeue_list, queuelist) { + if (!(rq->rq_flags & RQF_DONTPREP)) { list_del_init(&rq->queuelist); blk_mq_insert_request(rq, BLK_MQ_INSERT_AT_HEAD); } } + while (!list_empty(&requeue_list)) { + rq = list_entry(requeue_list.next, struct request, queuelist); + list_del_init(&rq->queuelist); + blk_mq_insert_request(rq, 0); + } + while (!list_empty(&flush_list)) { rq = list_entry(flush_list.next, struct request, queuelist); list_del_init(&rq->queuelist); @@ -2064,14 +2060,28 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list, bool no_tag = prep == PREP_DISPATCH_NO_TAG && ((hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED) || blk_mq_is_shared_tags(hctx->flags)); + LIST_HEAD(for_sched); + struct request *next; if (nr_budgets) blk_mq_release_budgets(q, list); spin_lock(&hctx->lock); + list_for_each_entry_safe(rq, next, list, queuelist) + if (rq->rq_flags & RQF_USE_SCHED) + list_move_tail(&rq->queuelist, &for_sched); list_splice_tail_init(list, &hctx->dispatch); spin_unlock(&hctx->lock); + if (q->elevator) { + if (q->elevator->type->ops.requeue_request) + list_for_each_entry(rq, &for_sched, queuelist) + q->elevator->type->ops. + requeue_request(rq); + q->elevator->type->ops.insert_requests(hctx, &for_sched, + BLK_MQ_INSERT_AT_HEAD); + } + /* * Order adding requests to hctx->dispatch and checking * SCHED_RESTART flag. The pair of this smp_mb() is the one diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index d778cb6b2112..363894aea0e8 100644 --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -62,8 +62,8 @@ typedef __u32 __bitwise req_flags_t; #define RQF_RESV ((__force req_flags_t)(1 << 23)) /* flags that prevent us from merging requests: */ -#define RQF_NOMERGE_FLAGS \ - (RQF_STARTED | RQF_FLUSH_SEQ | RQF_SPECIAL_PAYLOAD) +#define RQF_NOMERGE_FLAGS \ + (RQF_STARTED | RQF_FLUSH_SEQ | RQF_DONTPREP | RQF_SPECIAL_PAYLOAD) enum mq_rq_state { MQ_RQ_IDLE = 0,
Send requeued requests to the I/O scheduler such that the I/O scheduler can control the order in which requests are dispatched. This patch reworks commit aef1897cd36d ("blk-mq: insert rq with DONTPREP to hctx dispatch list when requeue"). Cc: Christoph Hellwig <hch@lst.de> Cc: Damien Le Moal <dlemoal@kernel.org> Cc: Ming Lei <ming.lei@redhat.com> Cc: Mike Snitzer <snitzer@kernel.org> Cc: Jianchao Wang <jianchao.w.wang@oracle.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> --- block/blk-mq.c | 36 +++++++++++++++++++++++------------- include/linux/blk-mq.h | 4 ++-- 2 files changed, 25 insertions(+), 15 deletions(-)