diff mbox series

[v3,2/7] block: Send requeued requests to the I/O scheduler

Message ID 20230522183845.354920-3-bvanassche@acm.org (mailing list archive)
State New, archived
Headers show
Series Submit zoned writes in order | expand

Commit Message

Bart Van Assche May 22, 2023, 6:38 p.m. UTC
Send requeued requests to the I/O scheduler such that the I/O scheduler
can control the order in which requests are dispatched.

This patch reworks commit aef1897cd36d ("blk-mq: insert rq with DONTPREP
to hctx dispatch list when requeue").

Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Mike Snitzer <snitzer@kernel.org>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-mq.c         | 36 +++++++++++++++++++++++-------------
 include/linux/blk-mq.h |  4 ++--
 2 files changed, 25 insertions(+), 15 deletions(-)

Comments

Christoph Hellwig May 23, 2023, 7:18 a.m. UTC | #1
On Mon, May 22, 2023 at 11:38:37AM -0700, Bart Van Assche wrote:
> Send requeued requests to the I/O scheduler such that the I/O scheduler
> can control the order in which requests are dispatched.
> 
> This patch reworks commit aef1897cd36d ("blk-mq: insert rq with DONTPREP
> to hctx dispatch list when requeue").

But you need to explain why it is safe.  I think it is because you add
DONTPREP to RQF_NOMERGE_FLAGS, but that really belongs into the commit
log.

> +	list_for_each_entry_safe(rq, next, &requeue_list, queuelist) {
> +		if (!(rq->rq_flags & RQF_DONTPREP)) {
>  			list_del_init(&rq->queuelist);
>  			blk_mq_insert_request(rq, BLK_MQ_INSERT_AT_HEAD);
>  		}
>  	}
>  
> +	while (!list_empty(&requeue_list)) {
> +		rq = list_entry(requeue_list.next, struct request, queuelist);
> +		list_del_init(&rq->queuelist);
> +		blk_mq_insert_request(rq, 0);

So now both started and unstarted requests go through
blk_mq_insert_request, and thus potentially the I/O scheduler, but
RQF_DONTPREP request that are by definition further along in processing
are inserted at the tail, while !RQF_DONTPREP ones get inserted at head.

This feels wrong, and we probably need to sort out why and how we are
doing head insertation on a requeue first.

> @@ -2064,14 +2060,28 @@ bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list,
>  		bool no_tag = prep == PREP_DISPATCH_NO_TAG &&
>  			((hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED) ||
>  			blk_mq_is_shared_tags(hctx->flags));
> +		LIST_HEAD(for_sched);
> +		struct request *next;
>  
>  		if (nr_budgets)
>  			blk_mq_release_budgets(q, list);
>  
>  		spin_lock(&hctx->lock);
> +		list_for_each_entry_safe(rq, next, list, queuelist)
> +			if (rq->rq_flags & RQF_USE_SCHED)
> +				list_move_tail(&rq->queuelist, &for_sched);
>  		list_splice_tail_init(list, &hctx->dispatch);
>  		spin_unlock(&hctx->lock);
>  
> +		if (q->elevator) {
> +			if (q->elevator->type->ops.requeue_request)
> +				list_for_each_entry(rq, &for_sched, queuelist)
> +					q->elevator->type->ops.
> +						requeue_request(rq);
> +			q->elevator->type->ops.insert_requests(hctx, &for_sched,
> +							BLK_MQ_INSERT_AT_HEAD);
> +		}
> +

None of this is explained in the commit log, please elaborate on the
rationale for it.  Also:

 - this code block really belongs into a helper
 - calling ->requeue_request in a loop before calling ->insert_requests
   feels wrong  A BLK_MQ_INSERT_REQUEUE flag feels like the better
   interface here.
Ming Lei May 23, 2023, 9:03 a.m. UTC | #2
On Mon, May 22, 2023 at 11:38:37AM -0700, Bart Van Assche wrote:
> Send requeued requests to the I/O scheduler such that the I/O scheduler
> can control the order in which requests are dispatched.

I guess you are addressing UFS zoned for REQ_OP_WRITE:

https://lore.kernel.org/linux-block/8e88b22e-fdf2-5182-02fe-9876e8148947@acm.org/

I am wondering how this way can maintain order for zoned write request.

requeued WRITE may happen in any order, for example, req A and req B is
in order, now req B is requeued first, then follows req A.

So req B is requeued to scheduler first, and issued to LLD, then
request A is requeued and issued to LLD later, then still re-order?

Or sd_zbc can provide requeue order guarantee? 

Thanks,
Ming
Bart Van Assche May 23, 2023, 5:19 p.m. UTC | #3
On 5/23/23 02:03, Ming Lei wrote:
> On Mon, May 22, 2023 at 11:38:37AM -0700, Bart Van Assche wrote:
>> Send requeued requests to the I/O scheduler such that the I/O scheduler
>> can control the order in which requests are dispatched.
> 
> I guess you are addressing UFS zoned for REQ_OP_WRITE:
> 
> https://lore.kernel.org/linux-block/8e88b22e-fdf2-5182-02fe-9876e8148947@acm.org/
> 
> I am wondering how this way can maintain order for zoned write request.
> 
> requeued WRITE may happen in any order, for example, req A and req B is
> in order, now req B is requeued first, then follows req A.
> 
> So req B is requeued to scheduler first, and issued to LLD, then
> request A is requeued and issued to LLD later, then still re-order?
> 
> Or sd_zbc can provide requeue order guarantee?

Hi Ming,

The mq-deadline scheduler restricts the queue depth to one per zone for zoned
storage so at any time there is at most one write command (REQ_OP_WRITE) in
flight per zone.

Thanks,

Bart.
Bart Van Assche May 23, 2023, 10:30 p.m. UTC | #4
On 5/23/23 00:18, Christoph Hellwig wrote:
> On Mon, May 22, 2023 at 11:38:37AM -0700, Bart Van Assche wrote:
>> Send requeued requests to the I/O scheduler such that the I/O scheduler
>> can control the order in which requests are dispatched.
>>
>> This patch reworks commit aef1897cd36d ("blk-mq: insert rq with DONTPREP
>> to hctx dispatch list when requeue").
> 
> But you need to explain why it is safe.  I think it is because you add
> DONTPREP to RQF_NOMERGE_FLAGS, but that really belongs into the commit
> log.

Hi Christoph,

I will add the above explanation in the commit message.

> 
>> +	list_for_each_entry_safe(rq, next, &requeue_list, queuelist) {
>> +		if (!(rq->rq_flags & RQF_DONTPREP)) {
>>   			list_del_init(&rq->queuelist);
>>   			blk_mq_insert_request(rq, BLK_MQ_INSERT_AT_HEAD);
>>   		}
>>   	}
>>   
>> +	while (!list_empty(&requeue_list)) {
>> +		rq = list_entry(requeue_list.next, struct request, queuelist);
>> +		list_del_init(&rq->queuelist);
>> +		blk_mq_insert_request(rq, 0);
> 
> So now both started and unstarted requests go through
> blk_mq_insert_request, and thus potentially the I/O scheduler, but
> RQF_DONTPREP request that are by definition further along in processing
> are inserted at the tail, while !RQF_DONTPREP ones get inserted at head.
> 
> This feels wrong, and we probably need to sort out why and how we are
> doing head insertation on a requeue first.

The use cases for requeuing requests are as follows:
* Resubmitting a partially completed request (ATA, SCSI, loop, ...).
* Handling connection recovery (nbd, ublk, ...).
* Simulating the "device busy" condition (null_blk).
* Handling the queue full condition (virtio-blk).
* Handling path errors by dm-mpath (DM_ENDIO_REQUEUE).
* Handling retryable I/O errors (mmc, NVMe).
* Handling unit attentions (SCSI).
* Unplugging a CPU.

Do you perhaps want me to select BLK_MQ_INSERT_AT_HEAD for all requeued requests?

>> +		if (q->elevator) {
>> +			if (q->elevator->type->ops.requeue_request)
>> +				list_for_each_entry(rq, &for_sched, queuelist)
>> +					q->elevator->type->ops.
>> +						requeue_request(rq);
>> +			q->elevator->type->ops.insert_requests(hctx, &for_sched,
>> +							BLK_MQ_INSERT_AT_HEAD);
>> +		}
>> +
> 
> None of this is explained in the commit log, please elaborate on the
> rationale for it.  Also:
> 
>   - this code block really belongs into a helper
>   - calling ->requeue_request in a loop before calling ->insert_requests
>     feels wrong  A BLK_MQ_INSERT_REQUEUE flag feels like the better
>     interface here.

Pushing a request back to hctx->dispatch is a requeuing mechanism. I want
to send requeued requests to the I/O scheduler.

Regarding the BLK_MQ_INSERT_REQUEUE suggestion, how about adding the patch
below near the start of this patch series?

Thanks,

Bart.


block: Remove elevator_type.requeue_request()

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 117aec1791c0..319d3c5a0f85 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -6232,6 +6232,7 @@ static inline void bfq_update_insert_stats(struct request_queue *q,
  #endif /* CONFIG_BFQ_CGROUP_DEBUG */

  static struct bfq_queue *bfq_init_rq(struct request *rq);
+static void bfq_finish_requeue_request(struct request *rq);

  static void bfq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
  			       blk_insert_t flags)
@@ -6243,6 +6244,11 @@ static void bfq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq,
  	blk_opf_t cmd_flags;
  	LIST_HEAD(free);

+	if (rq->rq_flags & RQF_REQUEUED) {
+		rq->rq_flags &= ~RQF_REQUEUED;
+		bfq_finish_requeue_request(rq);
+	}
+
  #ifdef CONFIG_BFQ_GROUP_IOSCHED
  	if (!cgroup_subsys_on_dfl(io_cgrp_subsys) && rq->bio)
  		bfqg_stats_update_legacy_io(q, rq);
@@ -7596,7 +7602,6 @@ static struct elevator_type iosched_bfq_mq = {
  	.ops = {
  		.limit_depth		= bfq_limit_depth,
  		.prepare_request	= bfq_prepare_request,
-		.requeue_request        = bfq_finish_requeue_request,
  		.finish_request		= bfq_finish_request,
  		.exit_icq		= bfq_exit_icq,
  		.insert_requests	= bfq_insert_requests,
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 1326526bb733..8d4b835539b1 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -58,13 +58,8 @@ static inline void blk_mq_sched_completed_request(struct request *rq, u64 now)

  static inline void blk_mq_sched_requeue_request(struct request *rq)
  {
-	if (rq->rq_flags & RQF_USE_SCHED) {
-		struct request_queue *q = rq->q;
-		struct elevator_queue *e = q->elevator;
-
-		if (e->type->ops.requeue_request)
-			e->type->ops.requeue_request(rq);
-	}
+	if (rq->rq_flags & RQF_USE_SCHED)
+		rq->rq_flags |= RQF_REQUEUED;
  }

  static inline bool blk_mq_sched_has_work(struct blk_mq_hw_ctx *hctx)
diff --git a/block/elevator.h b/block/elevator.h
index 7ca3d7b6ed82..c1f459eebc9f 100644
--- a/block/elevator.h
+++ b/block/elevator.h
@@ -43,7 +43,6 @@ struct elevator_mq_ops {
  	struct request *(*dispatch_request)(struct blk_mq_hw_ctx *);
  	bool (*has_work)(struct blk_mq_hw_ctx *);
  	void (*completed_request)(struct request *, u64);
-	void (*requeue_request)(struct request *);
  	struct request *(*former_request)(struct request_queue *, struct request *);
  	struct request *(*next_request)(struct request_queue *, struct request *);
  	void (*init_icq)(struct io_cq *);
diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index f495be433276..b1f2e172fe61 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -608,6 +608,10 @@ static void kyber_insert_requests(struct blk_mq_hw_ctx *hctx,

  		spin_lock(&kcq->lock);
  		trace_block_rq_insert(rq);
+		if (rq->rq_flags & RQF_REQUEUED) {
+			rq->rq_flags &= ~RQF_REQUEUED;
+			kyber_finish_request(rq);
+		}
  		if (flags & BLK_MQ_INSERT_AT_HEAD)
  			list_move(&rq->queuelist, head);
  		else
@@ -1022,7 +1026,6 @@ static struct elevator_type kyber_sched = {
  		.prepare_request = kyber_prepare_request,
  		.insert_requests = kyber_insert_requests,
  		.finish_request = kyber_finish_request,
-		.requeue_request = kyber_finish_request,
  		.completed_request = kyber_completed_request,
  		.dispatch_request = kyber_dispatch_request,
  		.has_work = kyber_has_work,
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index d778cb6b2112..861ca3bb0bce 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -59,6 +59,9 @@ typedef __u32 __bitwise req_flags_t;
  #define RQF_ZONE_WRITE_LOCKED	((__force req_flags_t)(1 << 19))
  /* ->timeout has been called, don't expire again */
  #define RQF_TIMED_OUT		((__force req_flags_t)(1 << 21))
+/* The request is about to be requeued. */
+#define RQF_REQUEUED		((__force req_flags_t)(1 << 22))
+/* A reserved tag has been assigned to the request. */
  #define RQF_RESV		((__force req_flags_t)(1 << 23))

  /* flags that prevent us from merging requests: */
Ming Lei May 24, 2023, 12:31 a.m. UTC | #5
On Tue, May 23, 2023 at 10:19:34AM -0700, Bart Van Assche wrote:
> On 5/23/23 02:03, Ming Lei wrote:
> > On Mon, May 22, 2023 at 11:38:37AM -0700, Bart Van Assche wrote:
> > > Send requeued requests to the I/O scheduler such that the I/O scheduler
> > > can control the order in which requests are dispatched.
> > 
> > I guess you are addressing UFS zoned for REQ_OP_WRITE:
> > 
> > https://lore.kernel.org/linux-block/8e88b22e-fdf2-5182-02fe-9876e8148947@acm.org/
> > 
> > I am wondering how this way can maintain order for zoned write request.
> > 
> > requeued WRITE may happen in any order, for example, req A and req B is
> > in order, now req B is requeued first, then follows req A.
> > 
> > So req B is requeued to scheduler first, and issued to LLD, then
> > request A is requeued and issued to LLD later, then still re-order?
> > 
> > Or sd_zbc can provide requeue order guarantee?
> 
> Hi Ming,
> 
> The mq-deadline scheduler restricts the queue depth to one per zone for zoned
> storage so at any time there is at most one write command (REQ_OP_WRITE) in
> flight per zone.

But if the write queue depth is 1 per zone, the requeued request won't
be re-ordered at all given no other write request can be issued from
scheduler in this zone before this requeued request is completed.

So why bother to requeue the BLK_STS_RESOURCE request via scheduler?

Thanks,
Ming
Christoph Hellwig May 24, 2023, 6:13 a.m. UTC | #6
On Tue, May 23, 2023 at 03:30:16PM -0700, Bart Van Assche wrote:
>  static inline void blk_mq_sched_requeue_request(struct request *rq)
>  {
> -	if (rq->rq_flags & RQF_USE_SCHED) {
> -		struct request_queue *q = rq->q;
> -		struct elevator_queue *e = q->elevator;
> -
> -		if (e->type->ops.requeue_request)
> -			e->type->ops.requeue_request(rq);
> -	}
> +	if (rq->rq_flags & RQF_USE_SCHED)
> +		rq->rq_flags |= RQF_REQUEUED;
>  }

I'd drop this helper function if we go down this way.  But maybe
we might just want to keep the method.  Sorry for the noise.
Bart Van Assche May 24, 2023, 5:56 p.m. UTC | #7
On 5/23/23 17:31, Ming Lei wrote:
> On Tue, May 23, 2023 at 10:19:34AM -0700, Bart Van Assche wrote:
>> The mq-deadline scheduler restricts the queue depth to one per zone for zoned
>> storage so at any time there is at most one write command (REQ_OP_WRITE) in
>> flight per zone.
> 
> But if the write queue depth is 1 per zone, the requeued request won't
> be re-ordered at all given no other write request can be issued from
> scheduler in this zone before this requeued request is completed.
> 
> So why bother to requeue the BLK_STS_RESOURCE request via scheduler?

Hi Ming,

It seems like my previous email was not clear enough. The mq-deadline 
scheduler restricts the queue depth per zone for commands passed to the 
SCSI core. It does not restrict how many requests a filesystem can 
submit per zone to the block layer. Without this patch there is a risk 
of reordering if a request is requeued, e.g. by the SCSI core, and other 
requests are pending for the same zone.

Thanks,

Bart.
Bart Van Assche May 24, 2023, 6:22 p.m. UTC | #8
On 5/23/23 23:13, Christoph Hellwig wrote:
> On Tue, May 23, 2023 at 03:30:16PM -0700, Bart Van Assche wrote:
>>   static inline void blk_mq_sched_requeue_request(struct request *rq)
>>   {
>> -	if (rq->rq_flags & RQF_USE_SCHED) {
>> -		struct request_queue *q = rq->q;
>> -		struct elevator_queue *e = q->elevator;
>> -
>> -		if (e->type->ops.requeue_request)
>> -			e->type->ops.requeue_request(rq);
>> -	}
>> +	if (rq->rq_flags & RQF_USE_SCHED)
>> +		rq->rq_flags |= RQF_REQUEUED;
>>   }
> 
> I'd drop this helper function if we go down this way.  But maybe
> we might just want to keep the method.

My understanding is that every .requeue_request() call is followed by a 
.insert_requests() call and hence that we don't need the 
.requeue_request() method anymore if the RQF_REQUEUED flag would be 
introduced?

Thanks,

Bart.
Damien Le Moal May 24, 2023, 11:06 p.m. UTC | #9
On 5/25/23 02:56, Bart Van Assche wrote:
> On 5/23/23 17:31, Ming Lei wrote:
>> On Tue, May 23, 2023 at 10:19:34AM -0700, Bart Van Assche wrote:
>>> The mq-deadline scheduler restricts the queue depth to one per zone for zoned
>>> storage so at any time there is at most one write command (REQ_OP_WRITE) in
>>> flight per zone.
>>
>> But if the write queue depth is 1 per zone, the requeued request won't
>> be re-ordered at all given no other write request can be issued from
>> scheduler in this zone before this requeued request is completed.
>>
>> So why bother to requeue the BLK_STS_RESOURCE request via scheduler?
> 
> Hi Ming,
> 
> It seems like my previous email was not clear enough. The mq-deadline 
> scheduler restricts the queue depth per zone for commands passed to the 
> SCSI core. It does not restrict how many requests a filesystem can 
> submit per zone to the block layer. Without this patch there is a risk 
> of reordering if a request is requeued, e.g. by the SCSI core, and other 
> requests are pending for the same zone.

Yes there is, but the contract we established for zoned devices in the block
layer, from the start of the support, is that users *must* write sequentially.
The block layer does not attempt, generally speaking, to reorder requests.

When mq-deadline is used, the scheduler lba reordering *may* reorder writes,
thus hiding potential bugs in the user write sequence for a zone. That is fine.
However, once a write request is dispatched, we should keep the assumption that
it is a well formed one, namely directed at the zone write pointer. So any
consideration of requeue solving write ordering issues is moot to me.

Furthermore, when the requeue happens, the target zone is still locked and the
only write request that can be in flight for that target zones is that one being
requeued. Add to that the above assumption that the request is the one we must
dispatch first, there are absolutely zero chances of seeing a reordering happen
for writes to a particular zone. I really do not see the point of requeuing that
request through the IO scheduler at all.

In general, even for reads, requeuing through the scheduler is I think a really
bad idea as that can potentially significantly increase the request latency
(time to completion), with the user seeing long tail latencies. E.g. if the
request has high priority or a short CDL time limit, requeuing through the
scheduler will go against the user indicated urgency for that request and
degrade the effectivness of latency control easures such as IO priority and CDL.

Requeues should be at the head of the dispatch queue, not through the scheduler.

As long as we keep zone write locking for zoned devices, requeue to the head of
the dispatch queue is fine. But maybe this work is preparatory to removing zone
write locking ? If that is the case, I would like to see that as well to get the
big picture. Otherwise, the latency concerns I raised above are in my opinion, a
blocker for this change.

> 
> Thanks,
> 
> Bart.
>
Ming Lei May 25, 2023, 12:53 a.m. UTC | #10
On Thu, May 25, 2023 at 08:06:31AM +0900, Damien Le Moal wrote:
> On 5/25/23 02:56, Bart Van Assche wrote:
> > On 5/23/23 17:31, Ming Lei wrote:
> >> On Tue, May 23, 2023 at 10:19:34AM -0700, Bart Van Assche wrote:
> >>> The mq-deadline scheduler restricts the queue depth to one per zone for zoned
> >>> storage so at any time there is at most one write command (REQ_OP_WRITE) in
> >>> flight per zone.
> >>
> >> But if the write queue depth is 1 per zone, the requeued request won't
> >> be re-ordered at all given no other write request can be issued from
> >> scheduler in this zone before this requeued request is completed.
> >>
> >> So why bother to requeue the BLK_STS_RESOURCE request via scheduler?
> > 
> > Hi Ming,
> > 
> > It seems like my previous email was not clear enough. The mq-deadline 
> > scheduler restricts the queue depth per zone for commands passed to the 
> > SCSI core. It does not restrict how many requests a filesystem can 
> > submit per zone to the block layer. Without this patch there is a risk 
> > of reordering if a request is requeued, e.g. by the SCSI core, and other 
> > requests are pending for the same zone.
> 
> Yes there is, but the contract we established for zoned devices in the block
> layer, from the start of the support, is that users *must* write sequentially.
> The block layer does not attempt, generally speaking, to reorder requests.
> 
> When mq-deadline is used, the scheduler lba reordering *may* reorder writes,
> thus hiding potential bugs in the user write sequence for a zone. That is fine.
> However, once a write request is dispatched, we should keep the assumption that
> it is a well formed one, namely directed at the zone write pointer. So any
> consideration of requeue solving write ordering issues is moot to me.
> 
> Furthermore, when the requeue happens, the target zone is still locked and the
> only write request that can be in flight for that target zones is that one being
> requeued. Add to that the above assumption that the request is the one we must
> dispatch first, there are absolutely zero chances of seeing a reordering happen
> for writes to a particular zone. I really do not see the point of requeuing that
> request through the IO scheduler at all.

Totally agree, that is exactly what I meant.

> 
> In general, even for reads, requeuing through the scheduler is I think a really
> bad idea as that can potentially significantly increase the request latency
> (time to completion), with the user seeing long tail latencies. E.g. if the
> request has high priority or a short CDL time limit, requeuing through the
> scheduler will go against the user indicated urgency for that request and
> degrade the effectivness of latency control easures such as IO priority and CDL.
> 
> Requeues should be at the head of the dispatch queue, not through the scheduler.

It is pretty easy to run into STS_RESOURCE for some scsi devices,
requeuing via schedule does add more overhead for these devices.

> 
> As long as we keep zone write locking for zoned devices, requeue to the head of
> the dispatch queue is fine. But maybe this work is preparatory to removing zone
> write locking ? If that is the case, I would like to see that as well to get the
> big picture. Otherwise, the latency concerns I raised above are in my opinion, a
> blocker for this change.

Yeah, looks Bart needs to add more words about requirement & motivation
of this patchset.


Thanks,
Ming
Christoph Hellwig May 25, 2023, 8:25 a.m. UTC | #11
On Wed, May 24, 2023 at 11:22:00AM -0700, Bart Van Assche wrote:
>>>   static inline void blk_mq_sched_requeue_request(struct request *rq)
>>>   {
>>> -	if (rq->rq_flags & RQF_USE_SCHED) {
>>> -		struct request_queue *q = rq->q;
>>> -		struct elevator_queue *e = q->elevator;
>>> -
>>> -		if (e->type->ops.requeue_request)
>>> -			e->type->ops.requeue_request(rq);
>>> -	}
>>> +	if (rq->rq_flags & RQF_USE_SCHED)
>>> +		rq->rq_flags |= RQF_REQUEUED;
>>>   }
>>
>> I'd drop this helper function if we go down this way.  But maybe
>> we might just want to keep the method.
>
> My understanding is that every .requeue_request() call is followed by a 
> .insert_requests() call and hence that we don't need the .requeue_request() 
> method anymore if the RQF_REQUEUED flag would be introduced?

Yes, but at the same time RQF_REQUEUED no creates global state instead
of just in a callchain.  That's why originally suggest a flag to
->insert_requests instead of leaving state on every request.

>
> Thanks,
>
> Bart.
---end quoted text---
Bart Van Assche June 21, 2023, 12:34 a.m. UTC | #12
On 5/24/23 16:06, Damien Le Moal wrote:
> When mq-deadline is used, the scheduler lba reordering *may* reorder writes,
> thus hiding potential bugs in the user write sequence for a zone. That is fine.
> However, once a write request is dispatched, we should keep the assumption that
> it is a well formed one, namely directed at the zone write pointer. So any
> consideration of requeue solving write ordering issues is moot to me.
> 
> Furthermore, when the requeue happens, the target zone is still locked and the
> only write request that can be in flight for that target zones is that one being
> requeued. Add to that the above assumption that the request is the one we must
> dispatch first, there are absolutely zero chances of seeing a reordering happen
> for writes to a particular zone. I really do not see the point of requeuing that
> request through the IO scheduler at all.

I will drop the changes from this patch that send requeued dispatched 
requests to the I/O scheduler instead of back to the dispatch list. It 
took me considerable effort to find all the potential causes of 
reordering. As you may have noticed I also came up with some changes 
that are not essential. I should have realized myself that this change 
is not essential.

> As long as we keep zone write locking for zoned devices, requeue to the head of
> the dispatch queue is fine. But maybe this work is preparatory to removing zone
> write locking ? If that is the case, I would like to see that as well to get the
> big picture. Otherwise, the latency concerns I raised above are in my opinion, a
> blocker for this change.

Regarding removing zone write locking, would it be acceptable to 
implement a solution for SCSI devices before it is clear how to 
implement a solution for NVMe devices? I think a potential solution for 
SCSI devices is to send requests that should be requeued to the SCSI 
error handler instead of to the block layer requeue list. The SCSI error 
handler waits until all pending requests have timed out or have been 
sent to the error handler. The SCSI error handler can be modified such 
that requests are sorted in LBA order before being resubmitted. This 
would solve the nasty issues that would otherwise arise when requeuing 
requests if multiple write requests for the same zone are pending.

Thanks,

Bart.
Damien Le Moal June 22, 2023, 11:45 p.m. UTC | #13
On 6/21/23 09:34, Bart Van Assche wrote:
> On 5/24/23 16:06, Damien Le Moal wrote:
>> When mq-deadline is used, the scheduler lba reordering *may* reorder writes,
>> thus hiding potential bugs in the user write sequence for a zone. That is fine.
>> However, once a write request is dispatched, we should keep the assumption that
>> it is a well formed one, namely directed at the zone write pointer. So any
>> consideration of requeue solving write ordering issues is moot to me.
>>
>> Furthermore, when the requeue happens, the target zone is still locked and the
>> only write request that can be in flight for that target zones is that one being
>> requeued. Add to that the above assumption that the request is the one we must
>> dispatch first, there are absolutely zero chances of seeing a reordering happen
>> for writes to a particular zone. I really do not see the point of requeuing that
>> request through the IO scheduler at all.
> 
> I will drop the changes from this patch that send requeued dispatched 
> requests to the I/O scheduler instead of back to the dispatch list. It 
> took me considerable effort to find all the potential causes of 
> reordering. As you may have noticed I also came up with some changes 
> that are not essential. I should have realized myself that this change 
> is not essential.
> 
>> As long as we keep zone write locking for zoned devices, requeue to the head of
>> the dispatch queue is fine. But maybe this work is preparatory to removing zone
>> write locking ? If that is the case, I would like to see that as well to get the
>> big picture. Otherwise, the latency concerns I raised above are in my opinion, a
>> blocker for this change.
> 
> Regarding removing zone write locking, would it be acceptable to 
> implement a solution for SCSI devices before it is clear how to 
> implement a solution for NVMe devices? I think a potential solution for 
> SCSI devices is to send requests that should be requeued to the SCSI 
> error handler instead of to the block layer requeue list. The SCSI error 
> handler waits until all pending requests have timed out or have been 
> sent to the error handler. The SCSI error handler can be modified such 
> that requests are sorted in LBA order before being resubmitted. This 
> would solve the nasty issues that would otherwise arise when requeuing 
> requests if multiple write requests for the same zone are pending.

I am still thinking that a dedicated hctx for writes to sequential zones may be
the simplest solution for all device types:
1) For scsi HBAs, we can likely gain high qd zone writes, but that needs to be
checked. For AHCI though, we need to keep the max write qd=1 per zone because of
the chipsets reordering command submissions. So we'll need a queue flag saying
"need zone write locking" indicated by the adapter when creating the queue.
2) For NVMe, this would allow high QD writes, with only the penalty of heavier
locking overhead when writes are issued from multiple CPUs.

But I have not started looking at all the details. Need to start prototyping
something. We can try working on this together if you want.
Bart Van Assche June 23, 2023, 8:31 p.m. UTC | #14
On 6/22/23 16:45, Damien Le Moal wrote:
> On 6/21/23 09:34, Bart Van Assche wrote:
>> Regarding removing zone write locking, would it be acceptable to
>> implement a solution for SCSI devices before it is clear how to
>> implement a solution for NVMe devices? I think a potential solution for
>> SCSI devices is to send requests that should be requeued to the SCSI
>> error handler instead of to the block layer requeue list. The SCSI error
>> handler waits until all pending requests have timed out or have been
>> sent to the error handler. The SCSI error handler can be modified such
>> that requests are sorted in LBA order before being resubmitted. This
>> would solve the nasty issues that would otherwise arise when requeuing
>> requests if multiple write requests for the same zone are pending.
> 
> I am still thinking that a dedicated hctx for writes to sequential zones may be
> the simplest solution for all device types:
> 1) For scsi HBAs, we can likely gain high qd zone writes, but that needs to be
> checked. For AHCI though, we need to keep the max write qd=1 per zone because of
> the chipsets reordering command submissions. So we'll need a queue flag saying
> "need zone write locking" indicated by the adapter when creating the queue.
> 2) For NVMe, this would allow high QD writes, with only the penalty of heavier
> locking overhead when writes are issued from multiple CPUs.
> 
> But I have not started looking at all the details. Need to start prototyping
> something. We can try working on this together if you want.

Hi Damien,

I'm interested in collaborating on this. But I'm not sure whether a 
dedicated hardware queue for sequential writes is a full solution. 
Applications must submit zoned writes (other than write appends) in 
order. These zoned writes may end up in a software queue. It is possible 
that the software queues are flushed in such a way that the zoned writes 
are reordered. Or do you perhaps want to send all zoned writes directly 
to a hardware queue? If so, is this really a better solution than a 
single-queue I/O scheduler? Is the difference perhaps that higher read 
IOPS can be achieved because multiple hardware queues are used for reads?

Even if all sequential writes would be sent to a single hardware queue, 
to support queue depths > 1, we still need a mechanism for resubmitting 
requests in order after a request has been requeued. If e.g. three zoned 
writes are in flight and a unit attention is reported for the second 
write then resubmitting the two writes that have to be resubmitted must 
only happen after both writes have completed.

Another possibility is to introduce a new request queue flag that 
specifies that only writes should be sent to the I/O scheduler. I'm 
interested in this because of the following observation for zoned UFS 
devices for a block size of 4 KiB and a random read workload:
* mq-deadline scheduler:  59 K IOPS.
* no I/O scheduler:      100 K IOPS.
In other words, 70% more IOPS with no I/O scheduler compared to 
mq-deadline. I don't think that this indicates a performance bug in the 
mq-deadline scheduler. From a quick measurement with the null_blk driver 
it seems to me that all I/O schedulers saturate around 150 K - 170 K IOPS.

Thanks,

Bart.
diff mbox series

Patch

diff --git a/block/blk-mq.c b/block/blk-mq.c
index e79cc34ad962..632aee9af60f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1438,30 +1438,26 @@  static void blk_mq_requeue_work(struct work_struct *work)
 		container_of(work, struct request_queue, requeue_work.work);
 	LIST_HEAD(requeue_list);
 	LIST_HEAD(flush_list);
-	struct request *rq;
+	struct request *rq, *next;
 
 	spin_lock_irq(&q->requeue_lock);
 	list_splice_init(&q->requeue_list, &requeue_list);
 	list_splice_init(&q->flush_list, &flush_list);
 	spin_unlock_irq(&q->requeue_lock);
 
-	while (!list_empty(&requeue_list)) {
-		rq = list_entry(requeue_list.next, struct request, queuelist);
-		/*
-		 * If RQF_DONTPREP ist set, the request has been started by the
-		 * driver already and might have driver-specific data allocated
-		 * already.  Insert it into the hctx dispatch list to avoid
-		 * block layer merges for the request.
-		 */
-		if (rq->rq_flags & RQF_DONTPREP) {
-			list_del_init(&rq->queuelist);
-			blk_mq_request_bypass_insert(rq, 0);
-		} else {
+	list_for_each_entry_safe(rq, next, &requeue_list, queuelist) {
+		if (!(rq->rq_flags & RQF_DONTPREP)) {
 			list_del_init(&rq->queuelist);
 			blk_mq_insert_request(rq, BLK_MQ_INSERT_AT_HEAD);
 		}
 	}
 
+	while (!list_empty(&requeue_list)) {
+		rq = list_entry(requeue_list.next, struct request, queuelist);
+		list_del_init(&rq->queuelist);
+		blk_mq_insert_request(rq, 0);
+	}
+
 	while (!list_empty(&flush_list)) {
 		rq = list_entry(flush_list.next, struct request, queuelist);
 		list_del_init(&rq->queuelist);
@@ -2064,14 +2060,28 @@  bool blk_mq_dispatch_rq_list(struct blk_mq_hw_ctx *hctx, struct list_head *list,
 		bool no_tag = prep == PREP_DISPATCH_NO_TAG &&
 			((hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED) ||
 			blk_mq_is_shared_tags(hctx->flags));
+		LIST_HEAD(for_sched);
+		struct request *next;
 
 		if (nr_budgets)
 			blk_mq_release_budgets(q, list);
 
 		spin_lock(&hctx->lock);
+		list_for_each_entry_safe(rq, next, list, queuelist)
+			if (rq->rq_flags & RQF_USE_SCHED)
+				list_move_tail(&rq->queuelist, &for_sched);
 		list_splice_tail_init(list, &hctx->dispatch);
 		spin_unlock(&hctx->lock);
 
+		if (q->elevator) {
+			if (q->elevator->type->ops.requeue_request)
+				list_for_each_entry(rq, &for_sched, queuelist)
+					q->elevator->type->ops.
+						requeue_request(rq);
+			q->elevator->type->ops.insert_requests(hctx, &for_sched,
+							BLK_MQ_INSERT_AT_HEAD);
+		}
+
 		/*
 		 * Order adding requests to hctx->dispatch and checking
 		 * SCHED_RESTART flag. The pair of this smp_mb() is the one
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index d778cb6b2112..363894aea0e8 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -62,8 +62,8 @@  typedef __u32 __bitwise req_flags_t;
 #define RQF_RESV		((__force req_flags_t)(1 << 23))
 
 /* flags that prevent us from merging requests: */
-#define RQF_NOMERGE_FLAGS \
-	(RQF_STARTED | RQF_FLUSH_SEQ | RQF_SPECIAL_PAYLOAD)
+#define RQF_NOMERGE_FLAGS                                               \
+	(RQF_STARTED | RQF_FLUSH_SEQ | RQF_DONTPREP | RQF_SPECIAL_PAYLOAD)
 
 enum mq_rq_state {
 	MQ_RQ_IDLE		= 0,