[4/8] block/mq-deadline: Only use zone locking if necessary

Message ID	20230109232738.169886-5-bvanassche@acm.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-block-owner@vger.kernel.org> From: Bart Van Assche <bvanassche@acm.org> To: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org, Jaegeuk Kim <jaegeuk@kernel.org>, Avri Altman <avri.altman@wdc.com>, Bart Van Assche <bvanassche@acm.org>, Damien Le Moal <damien.lemoal@wdc.com> Subject: [PATCH 4/8] block/mq-deadline: Only use zone locking if necessary Date: Mon, 9 Jan 2023 15:27:34 -0800 Message-Id: <20230109232738.169886-5-bvanassche@acm.org> In-Reply-To: <20230109232738.169886-1-bvanassche@acm.org> References: <20230109232738.169886-1-bvanassche@acm.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	Enable zoned write pipelining for UFS devices \| expand [0/8] Enable zoned write pipelining for UFS devices [1/8] block: Document blk_queue_zone_is_seq() and blk_rq_zone_is_seq() [2/8] block: Introduce the blk_rq_is_seq_zone_write() function [3/8] block: Introduce a request queue flag for pipelining zoned writes [4/8] block/mq-deadline: Only use zone locking if necessary [5/8] block/null_blk: Refactor null_queue_rq() [6/8] block/null_blk: Add support for pipelining zoned writes [7/8] scsi: Retry unaligned zoned writes [8/8] scsi: ufs: Enable zoned write pipelining

Bart Van Assche Jan. 9, 2023, 11:27 p.m. UTC

Measurements have shown that limiting the queue depth to one for zoned
writes has a significant negative performance impact on zoned UFS devices.
Hence this patch that disables zone locking from the mq-deadline scheduler
for storage controllers that support pipelining zoned writes. This patch is
based on the following assumptions:
- Applications submit write requests to sequential write required zones
  in order.
- It happens infrequently that zoned write requests are reordered by the
  block layer.
- The storage controller does not reorder write requests that have been
  submitted to the same hardware queue. This is the case for UFS: the
  UFSHCI specification requires that UFS controllers process requests in
  order per hardware queue.
- The I/O priority of all pipelined write requests is the same per zone.
- Either no I/O scheduler is used or an I/O scheduler is used that
  submits write requests per zone in LBA order.

If applications submit write requests to sequential write required zones
in order, at least one of the pending requests will succeed. Hence, the
number of retries that is required is at most (number of pending
requests) - 1.

Cc: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-zoned.c   |  3 ++-
 block/mq-deadline.c | 14 +++++++++-----
 2 files changed, 11 insertions(+), 6 deletions(-)

Damien Le Moal Jan. 9, 2023, 11:46 p.m. UTC | #1

On 1/10/23 08:27, Bart Van Assche wrote:
> Measurements have shown that limiting the queue depth to one for zoned
> writes has a significant negative performance impact on zoned UFS devices.
> Hence this patch that disables zone locking from the mq-deadline scheduler
> for storage controllers that support pipelining zoned writes. This patch is
> based on the following assumptions:
> - Applications submit write requests to sequential write required zones
>   in order.
> - It happens infrequently that zoned write requests are reordered by the
>   block layer.
> - The storage controller does not reorder write requests that have been
>   submitted to the same hardware queue. This is the case for UFS: the
>   UFSHCI specification requires that UFS controllers process requests in
>   order per hardware queue.
> - The I/O priority of all pipelined write requests is the same per zone.
> - Either no I/O scheduler is used or an I/O scheduler is used that
>   submits write requests per zone in LBA order.
> 
> If applications submit write requests to sequential write required zones
> in order, at least one of the pending requests will succeed. Hence, the
> number of retries that is required is at most (number of pending
> requests) - 1.

But if the mid-layer decides to requeue a write request, the workqueue
used in the mq block layer for requeuing is going to completely destroy
write ordering as that is outside of the submission path, working in
parallel with it... Does blk_queue_pipeline_zoned_writes() == true also
guarantee that a write request will *never* be requeued before hitting the
adapter/device ?

> 
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/blk-zoned.c   |  3 ++-
>  block/mq-deadline.c | 14 +++++++++-----
>  2 files changed, 11 insertions(+), 6 deletions(-)
> 
> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> index db829401d8d0..158638091e39 100644
> --- a/block/blk-zoned.c
> +++ b/block/blk-zoned.c
> @@ -520,7 +520,8 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
>  		break;
>  	case BLK_ZONE_TYPE_SEQWRITE_REQ:
>  	case BLK_ZONE_TYPE_SEQWRITE_PREF:
> -		if (!args->seq_zones_wlock) {
> +		if (!blk_queue_pipeline_zoned_writes(q) &&
> +		    !args->seq_zones_wlock) {
>  			args->seq_zones_wlock =
>  				blk_alloc_zone_bitmap(q->node, args->nr_zones);
>  			if (!args->seq_zones_wlock)
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index f10c2a0d18d4..e41808c0b007 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -339,7 +339,8 @@ deadline_fifo_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
>  		return NULL;
>  
>  	rq = rq_entry_fifo(per_prio->fifo_list[data_dir].next);
> -	if (data_dir == DD_READ || !blk_queue_is_zoned(rq->q))
> +	if (data_dir == DD_READ || !blk_queue_is_zoned(rq->q) ||
> +	    blk_queue_pipeline_zoned_writes(rq->q))
>  		return rq;
>  
>  	/*
> @@ -378,7 +379,8 @@ deadline_next_request(struct deadline_data *dd, struct dd_per_prio *per_prio,
>  	if (!rq)
>  		return NULL;
>  
> -	if (data_dir == DD_READ || !blk_queue_is_zoned(rq->q))
> +	if (data_dir == DD_READ || !blk_queue_is_zoned(rq->q) ||
> +	    blk_queue_pipeline_zoned_writes(rq->q))
>  		return rq;
>  
>  	/*
> @@ -503,8 +505,9 @@ static struct request *__dd_dispatch_request(struct deadline_data *dd,
>  	}
>  
>  	/*
> -	 * For a zoned block device, if we only have writes queued and none of
> -	 * them can be dispatched, rq will be NULL.
> +	 * For a zoned block device that requires write serialization, if we
> +	 * only have writes queued and none of them can be dispatched, rq will
> +	 * be NULL.
>  	 */
>  	if (!rq)
>  		return NULL;
> @@ -893,7 +896,8 @@ static void dd_finish_request(struct request *rq)
>  
>  	atomic_inc(&per_prio->stats.completed);
>  
> -	if (blk_queue_is_zoned(q)) {
> +	if (blk_queue_is_zoned(rq->q) &&
> +	    !blk_queue_pipeline_zoned_writes(q)) {
>  		unsigned long flags;
>  
>  		spin_lock_irqsave(&dd->zone_lock, flags);

Bart Van Assche Jan. 9, 2023, 11:51 p.m. UTC | #2

On 1/9/23 15:46, Damien Le Moal wrote:
> On 1/10/23 08:27, Bart Van Assche wrote:
>> Measurements have shown that limiting the queue depth to one for zoned
>> writes has a significant negative performance impact on zoned UFS devices.
>> Hence this patch that disables zone locking from the mq-deadline scheduler
>> for storage controllers that support pipelining zoned writes. This patch is
>> based on the following assumptions:
>> - Applications submit write requests to sequential write required zones
>>    in order.
>> - It happens infrequently that zoned write requests are reordered by the
>>    block layer.
>> - The storage controller does not reorder write requests that have been
>>    submitted to the same hardware queue. This is the case for UFS: the
>>    UFSHCI specification requires that UFS controllers process requests in
>>    order per hardware queue.
>> - The I/O priority of all pipelined write requests is the same per zone.
>> - Either no I/O scheduler is used or an I/O scheduler is used that
>>    submits write requests per zone in LBA order.
>>
>> If applications submit write requests to sequential write required zones
>> in order, at least one of the pending requests will succeed. Hence, the
>> number of retries that is required is at most (number of pending
>> requests) - 1.
> 
> But if the mid-layer decides to requeue a write request, the workqueue
> used in the mq block layer for requeuing is going to completely destroy
> write ordering as that is outside of the submission path, working in
> parallel with it... Does blk_queue_pipeline_zoned_writes() == true also
> guarantee that a write request will *never* be requeued before hitting the
> adapter/device ?

We don't need the guarantee that reordering will never happen. What we 
need is that reordering happens infrequently (e.g. less than 1% of the 
cases). This is what the last paragraph before your reply refers to. 
Maybe I should expand that paragraph.

Thanks,

Bart.

Damien Le Moal Jan. 9, 2023, 11:56 p.m. UTC | #3

On 1/10/23 08:51, Bart Van Assche wrote:
> On 1/9/23 15:46, Damien Le Moal wrote:
>> On 1/10/23 08:27, Bart Van Assche wrote:
>>> Measurements have shown that limiting the queue depth to one for zoned
>>> writes has a significant negative performance impact on zoned UFS devices.
>>> Hence this patch that disables zone locking from the mq-deadline scheduler
>>> for storage controllers that support pipelining zoned writes. This patch is
>>> based on the following assumptions:
>>> - Applications submit write requests to sequential write required zones
>>>    in order.
>>> - It happens infrequently that zoned write requests are reordered by the
>>>    block layer.
>>> - The storage controller does not reorder write requests that have been
>>>    submitted to the same hardware queue. This is the case for UFS: the
>>>    UFSHCI specification requires that UFS controllers process requests in
>>>    order per hardware queue.
>>> - The I/O priority of all pipelined write requests is the same per zone.
>>> - Either no I/O scheduler is used or an I/O scheduler is used that
>>>    submits write requests per zone in LBA order.
>>>
>>> If applications submit write requests to sequential write required zones
>>> in order, at least one of the pending requests will succeed. Hence, the
>>> number of retries that is required is at most (number of pending
>>> requests) - 1.
>>
>> But if the mid-layer decides to requeue a write request, the workqueue
>> used in the mq block layer for requeuing is going to completely destroy
>> write ordering as that is outside of the submission path, working in
>> parallel with it... Does blk_queue_pipeline_zoned_writes() == true also
>> guarantee that a write request will *never* be requeued before hitting the
>> adapter/device ?
> 
> We don't need the guarantee that reordering will never happen. What we 
> need is that reordering happens infrequently (e.g. less than 1% of the 
> cases). This is what the last paragraph before your reply refers to. 
> Maybe I should expand that paragraph.

But my point is that if a request goes through the block layer requeue, it
will be out of order, and will be submitted out of order again, and will
fail again. Unless you stall dispatching, wait for all requeues to come
back in the scheduler, and then start trying again, I do not see how you
can guarantee that retrying the unaligned writes will ever succeed.

I am talking in the context of host-managed devices here.

Bart Van Assche Jan. 10, 2023, 12:19 a.m. UTC | #4

On 1/9/23 15:56, Damien Le Moal wrote:
> But my point is that if a request goes through the block layer requeue, it
> will be out of order, and will be submitted out of order again, and will
> fail again. Unless you stall dispatching, wait for all requeues to come
> back in the scheduler, and then start trying again, I do not see how you
> can guarantee that retrying the unaligned writes will ever succeed.
> 
> I am talking in the context of host-managed devices here.

Hi Damien,

How about changing the NEEDS_RETRY in patch 7/8 into another value than 
SUCCESS, NEEDS_RETRY or ADD_TO_MLQUEUE? That will make the SCSI core 
wait until all pending commands have finished before it starts its error 
handling strategy and before it requeues any pending commands.

Thanks,

Bart.

Damien Le Moal Jan. 10, 2023, 12:32 a.m. UTC | #5

On 1/10/23 09:19, Bart Van Assche wrote:
> On 1/9/23 15:56, Damien Le Moal wrote:
>> But my point is that if a request goes through the block layer requeue, it
>> will be out of order, and will be submitted out of order again, and will
>> fail again. Unless you stall dispatching, wait for all requeues to come
>> back in the scheduler, and then start trying again, I do not see how you
>> can guarantee that retrying the unaligned writes will ever succeed.
>>
>> I am talking in the context of host-managed devices here.
> 
> Hi Damien,
> 
> How about changing the NEEDS_RETRY in patch 7/8 into another value than 
> SUCCESS, NEEDS_RETRY or ADD_TO_MLQUEUE? That will make the SCSI core 
> wait until all pending commands have finished before it starts its error 
> handling strategy and before it requeues any pending commands.

Considering a sequence of sequential write requests, the request can be in:
1) The scheduler
2) The dispatch queue (out of the scheduler)
3) In the requeue list, waiting to be put back in the scheduler
4) in-flight being processed for dispatch by the scsi mid-layer & scsi
disk driver
5) being processed by the driver
6) dispatched to the device already

Requeue back to the scheduler can happen anywhere after (2) up to (5)
(would need to check again to be 100% sure though). So I do not see how
changes to the scsi layer only, adding a new state, can cover all possible
cases resulting at some point to come back to a clean ordering. But if you
have ideas to prove me wrong, I will be happy to review that :)

So far, the only thing that I think could work is: stall everything and
put back all write requests in the scheduler, and restart dispatching.
That will definitively have a performance impact. How does that compare to
the zone write locking performance impact, I am not sure...

It may be way simpler to rely on:
1) none scheduler
2) some light re-ordering of write requests in the driver itself to avoid
any requeue to higher level (essentially, handle requeueing in the driver
itself).

> 
> Thanks,
> 
> Bart.

Jens Axboe Jan. 10, 2023, 12:38 a.m. UTC | #6

On 1/9/23 5:32?PM, Damien Le Moal wrote:
> It may be way simpler to rely on:
> 1) none scheduler
> 2) some light re-ordering of write requests in the driver itself to avoid
> any requeue to higher level (essentially, handle requeueing in the driver
> itself).

Let's not start adding requeueing back into the drivers, that was a
giant mess that we tried to get out of for a while.

And all of this strict ordering mess really makes me think that we
should just have a special scheduler mandated for devices that have that
kind of stupid constraints...

Jens Axboe Jan. 10, 2023, 12:41 a.m. UTC | #7

On 1/9/23 5:38 PM, Jens Axboe wrote:
> On 1/9/23 5:32?PM, Damien Le Moal wrote:
>> It may be way simpler to rely on:
>> 1) none scheduler
>> 2) some light re-ordering of write requests in the driver itself to avoid
>> any requeue to higher level (essentially, handle requeueing in the driver
>> itself).
> 
> Let's not start adding requeueing back into the drivers, that was a
> giant mess that we tried to get out of for a while.
> 
> And all of this strict ordering mess really makes me think that we
> should just have a special scheduler mandated for devices that have that
> kind of stupid constraints...

Or, probably better, a stacked scheduler where the bottom one can be zone
away. Then we can get rid of littering the entire stack and IO schedulers
with silly blk_queue_pipeline_zoned_writes() or blk_is_zoned_write() etc.

It really is a mess right now...

Bart Van Assche Jan. 10, 2023, 12:44 a.m. UTC | #8

On 1/9/23 16:41, Jens Axboe wrote:
> Or, probably better, a stacked scheduler where the bottom one can be zone
> away. Then we can get rid of littering the entire stack and IO schedulers
> with silly blk_queue_pipeline_zoned_writes() or blk_is_zoned_write() etc.

Hi Jens,

Isn't one of Damien's viewpoints that an I/O scheduler should not do the 
reordering of write requests since reordering of write requests may 
involve waiting for write requests, write request that will never be 
received if all tags have been allocated?

Thanks,

Bart.

Jens Axboe Jan. 10, 2023, 12:48 a.m. UTC | #9

On 1/9/23 5:44?PM, Bart Van Assche wrote:
> On 1/9/23 16:41, Jens Axboe wrote:
>> Or, probably better, a stacked scheduler where the bottom one can be zone
>> away. Then we can get rid of littering the entire stack and IO schedulers
>> with silly blk_queue_pipeline_zoned_writes() or blk_is_zoned_write() etc.
> 
> Hi Jens,
> 
> Isn't one of Damien's viewpoints that an I/O scheduler should not do
> the reordering of write requests since reordering of write requests
> may involve waiting for write requests, write request that will never
> be received if all tags have been allocated?

It should be work conservering, it should not wait for anything. If
there are holes or gaps, then there's nothing the scheduler can do.

My point is that the strict ordering was pretty hacky when it went in,
and rather than get better, it's proliferating. That's not a good
direction.

Bart Van Assche Jan. 10, 2023, 12:56 a.m. UTC | #10

On 1/9/23 16:48, Jens Axboe wrote:
> On 1/9/23 5:44?PM, Bart Van Assche wrote:
>> On 1/9/23 16:41, Jens Axboe wrote:
>>> Or, probably better, a stacked scheduler where the bottom one can be zone
>>> away. Then we can get rid of littering the entire stack and IO schedulers
>>> with silly blk_queue_pipeline_zoned_writes() or blk_is_zoned_write() etc.
>>
>> Hi Jens,
>>
>> Isn't one of Damien's viewpoints that an I/O scheduler should not do
>> the reordering of write requests since reordering of write requests
>> may involve waiting for write requests, write request that will never
>> be received if all tags have been allocated?
> 
> It should be work conservering, it should not wait for anything. If
> there are holes or gaps, then there's nothing the scheduler can do.
> 
> My point is that the strict ordering was pretty hacky when it went in,
> and rather than get better, it's proliferating. That's not a good
> direction.

Hi Jens,

As you know one of the deeply embedded design choices in the blk-mq code 
is that reordering can happen at any time between submission of a 
request to the blk-mq code and request completion. I agree with that 
design choice.

For the use cases I'm looking at the sequential write required zone type 
works best. This zone type works best since it guarantees that data on 
the storage medium is sequential. This results in optimal sequential 
read performance.

Combining these two approaches is not ideal and I agree that the 
combination of these two approaches adds some complexity. Personally I 
prefer to add a limited amount of complexity rather than implementing a 
new block layer from scratch.

Thanks,

Bart.

Jens Axboe Jan. 10, 2023, 1:03 a.m. UTC | #11

On 1/9/23 5:56?PM, Bart Van Assche wrote:
> On 1/9/23 16:48, Jens Axboe wrote:
>> On 1/9/23 5:44?PM, Bart Van Assche wrote:
>>> On 1/9/23 16:41, Jens Axboe wrote:
>>>> Or, probably better, a stacked scheduler where the bottom one can be zone
>>>> away. Then we can get rid of littering the entire stack and IO schedulers
>>>> with silly blk_queue_pipeline_zoned_writes() or blk_is_zoned_write() etc.
>>>
>>> Hi Jens,
>>>
>>> Isn't one of Damien's viewpoints that an I/O scheduler should not do
>>> the reordering of write requests since reordering of write requests
>>> may involve waiting for write requests, write request that will never
>>> be received if all tags have been allocated?
>>
>> It should be work conservering, it should not wait for anything. If
>> there are holes or gaps, then there's nothing the scheduler can do.
>>
>> My point is that the strict ordering was pretty hacky when it went in,
>> and rather than get better, it's proliferating. That's not a good
>> direction.
> 
> Hi Jens,
> 
> As you know one of the deeply embedded design choices in the blk-mq
> code is that reordering can happen at any time between submission of a
> request to the blk-mq code and request completion. I agree with that
> design choice.

Indeed. And getting rid of any ordering ops like barriers greatly
simplified things and fixed a number of issued related to that.

> For the use cases I'm looking at the sequential write required zone
> type works best. This zone type works best since it guarantees that
> data on the storage medium is sequential. This results in optimal
> sequential read performance.

That's a given.

> Combining these two approaches is not ideal and I agree that the
> combination of these two approaches adds some complexity. Personally I
> prefer to add a limited amount of complexity rather than implementing
> a new block layer from scratch.

I'm not talking about a new block layer at all, ordered devices are not
nearly important enough to warrant that kind of attention. Nor would it
be a good solution even if they were. I'm merely saying that I'm getting
more and more disgruntled with the direction that is being taken to
cater to these kinds of devices, and perhaps a much better idea is to
contain that complexity in a separate scheduler (be it stacked or not).
Because I'm really not thrilled to see the addition of various "is this
device ordered" all over the place, and now we are getting "is this
device ordered AND pipelined". Do you see what I mean? It's making
things _worse_, not better, and we really should be making it better
rather than pile more stuff on top of it.

Bart Van Assche Jan. 10, 2023, 1:17 a.m. UTC | #12

On 1/9/23 17:03, Jens Axboe wrote:
> Because I'm really not thrilled to see the addition of various "is this
> device ordered" all over the place, and now we are getting "is this
> device ordered AND pipelined". Do you see what I mean? It's making
> things _worse_, not better, and we really should be making it better
> rather than pile more stuff on top of it.

Hi Jens,

I agree with you that the additional complexity is unfortunate.

For most zoned storage use cases a queue depth above one is not an 
option if the zoned device expects zoned write commands in LBA order. 
ATA controllers do not support preserving the command order. Transports 
like NVMeOF do not support preserving the command order either. UFS is 
an exception. The only use case supported by the UFS specification is a 
1:1 connection between UFS controller and UFS device with a link with a 
low BER between controller and device. UFS controllers must preserve the 
command order per command queue. I think this context is well suited for 
pipelining zoned write commands.

Thanks,

Bart.

Jens Axboe Jan. 10, 2023, 1:48 a.m. UTC | #13

On 1/9/23 6:17?PM, Bart Van Assche wrote:
> On 1/9/23 17:03, Jens Axboe wrote:
>> Because I'm really not thrilled to see the addition of various "is this
>> device ordered" all over the place, and now we are getting "is this
>> device ordered AND pipelined". Do you see what I mean? It's making
>> things _worse_, not better, and we really should be making it better
>> rather than pile more stuff on top of it.
> 
> Hi Jens,
> 
> I agree with you that the additional complexity is unfortunate.
> 
> For most zoned storage use cases a queue depth above one is not an
> option if the zoned device expects zoned write commands in LBA order.
> ATA controllers do not support preserving the command order.
> Transports like NVMeOF do not support preserving the command order
> either. UFS is an exception. The only use case supported by the UFS
> specification is a 1:1 connection between UFS controller and UFS
> device with a link with a low BER between controller and device. UFS
> controllers must preserve the command order per command queue. I think
> this context is well suited for pipelining zoned write commands.

But it should not matter, if the scheduler handles it, and requeues are
ordered correctly. If the queue depth isn't known at init time, surely
we'd get a retry condition on submitting a request if it can't accept
another one. That'd trigger a retry, and the retry should be the first
one submitted when the device can accept another one.

Any setup that handles queue depth > 1 will do just fine at 1 as well.

Damien Le Moal Jan. 10, 2023, 2:24 a.m. UTC | #14

On 1/10/23 09:48, Jens Axboe wrote:
> On 1/9/23 5:44?PM, Bart Van Assche wrote:
>> On 1/9/23 16:41, Jens Axboe wrote:
>>> Or, probably better, a stacked scheduler where the bottom one can be zone
>>> away. Then we can get rid of littering the entire stack and IO schedulers
>>> with silly blk_queue_pipeline_zoned_writes() or blk_is_zoned_write() etc.
>>
>> Hi Jens,
>>
>> Isn't one of Damien's viewpoints that an I/O scheduler should not do
>> the reordering of write requests since reordering of write requests
>> may involve waiting for write requests, write request that will never
>> be received if all tags have been allocated?
> 
> It should be work conservering, it should not wait for anything. If
> there are holes or gaps, then there's nothing the scheduler can do.
> 
> My point is that the strict ordering was pretty hacky when it went in,
> and rather than get better, it's proliferating. That's not a good
> direction.

Yes, and hard to maintain/avoid breaking something.

Given that only writes need special handling, I am thinking that having a
dedicated write queue for submission/scheduling/requeue could
significantly clean things up. Essentially, we would have a different code
path for zoned device write from submit_bio(). Something like:

if (queue_is_zoned() && op_is_write())
	return blk_zoned_write_submit();

at the top of submit_bio(). That zone write code can be isolated in
block/blk-zoned.c and avoid spreading "if (zoned)" all over the place.
E.g. the flush machinery reorders writes right now... That needs fixing,
more "if (zoned)" coming...

That special zone write queue could also do its own dispatch scheduling,
so no need to hack existing schedulers.

Need to try coding something to see how it goes.

Jens Axboe Jan. 10, 2023, 3 a.m. UTC | #15

On 1/9/23 7:24?PM, Damien Le Moal wrote:
> On 1/10/23 09:48, Jens Axboe wrote:
>> On 1/9/23 5:44?PM, Bart Van Assche wrote:
>>> On 1/9/23 16:41, Jens Axboe wrote:
>>>> Or, probably better, a stacked scheduler where the bottom one can be zone
>>>> away. Then we can get rid of littering the entire stack and IO schedulers
>>>> with silly blk_queue_pipeline_zoned_writes() or blk_is_zoned_write() etc.
>>>
>>> Hi Jens,
>>>
>>> Isn't one of Damien's viewpoints that an I/O scheduler should not do
>>> the reordering of write requests since reordering of write requests
>>> may involve waiting for write requests, write request that will never
>>> be received if all tags have been allocated?
>>
>> It should be work conservering, it should not wait for anything. If
>> there are holes or gaps, then there's nothing the scheduler can do.
>>
>> My point is that the strict ordering was pretty hacky when it went in,
>> and rather than get better, it's proliferating. That's not a good
>> direction.
> 
> Yes, and hard to maintain/avoid breaking something.

Indeed! It's both fragile and ends up adding branches in a bunch of
spots in the generic code, which isn't ideal either from an efficiency
pov.

> Given that only writes need special handling, I am thinking that having a
> dedicated write queue for submission/scheduling/requeue could
> significantly clean things up. Essentially, we would have a different code
> path for zoned device write from submit_bio(). Something like:
> 
> if (queue_is_zoned() && op_is_write())
> 	return blk_zoned_write_submit();
> 
> at the top of submit_bio(). That zone write code can be isolated in
> block/blk-zoned.c and avoid spreading "if (zoned)" all over the place.
> E.g. the flush machinery reorders writes right now... That needs fixing,
> more "if (zoned)" coming...
> 
> That special zone write queue could also do its own dispatch scheduling,
> so no need to hack existing schedulers.

This seems very reasonable, and would just have the one check at queue
time, and then one at requeue time (which is fine, that's not a fast
path in any case).

[4/8] block/mq-deadline: Only use zone locking if necessary

Commit Message

Comments

Patch