diff mbox series

block: Submit flush requests to the I/O scheduler

Message ID 20220812210355.2252143-1-bvanassche@acm.org (mailing list archive)
State New, archived
Headers show
Series block: Submit flush requests to the I/O scheduler | expand

Commit Message

Bart Van Assche Aug. 12, 2022, 9:03 p.m. UTC
When submitting a REQ_OP_WRITE | REQ_FUA request to a zoned storage
device, these requests must be passed to the (mq-deadline) I/O scheduler
to ensure that these happen at the write pointer. It has been verfied
that this patch prevents that write pointer violations happen
sporadically when f2fs is using a zoned storage device.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-mq.c | 31 ++++---------------------------
 1 file changed, 4 insertions(+), 27 deletions(-)

Comments

Christoph Hellwig Aug. 13, 2022, 6:41 a.m. UTC | #1
On Fri, Aug 12, 2022 at 02:03:55PM -0700, Bart Van Assche wrote:
> When submitting a REQ_OP_WRITE | REQ_FUA request to a zoned storage
> device, these requests must be passed to the (mq-deadline) I/O scheduler
> to ensure that these happen at the write pointer.

Yes.

But maybe I'm stupid, but how is the patch related to fixing that?
blk_mq_plug_issue_direct is called from blk_mq_flush_plug_list for
only the !has_elevator case.  How does that change a thing?

Also please include a description of why these changes are otherwise
good and won't regress other cases.

> +		blk_mq_sched_insert_request(rq, /*at_head=*/false,
> +			/*run_queue=*/last, /*async=*/false);

I find thise comment style very hard to read.  Yes, maybe the three
bools here should become flags, but this is even worse than just
passing the arguments.
Damien Le Moal Aug. 14, 2022, 5:13 p.m. UTC | #2
On 2022/08/12 23:41, Christoph Hellwig wrote:
> On Fri, Aug 12, 2022 at 02:03:55PM -0700, Bart Van Assche wrote:
>> When submitting a REQ_OP_WRITE | REQ_FUA request to a zoned storage
>> device, these requests must be passed to the (mq-deadline) I/O scheduler
>> to ensure that these happen at the write pointer.
> 
> Yes.
> 
> But maybe I'm stupid, but how is the patch related to fixing that?
> blk_mq_plug_issue_direct is called from blk_mq_flush_plug_list for
> only the !has_elevator case.  How does that change a thing?

And writes to zoned drives never get plugged in the first place, scheduler
present or not.

> 
> Also please include a description of why these changes are otherwise
> good and won't regress other cases.
> 
>> +		blk_mq_sched_insert_request(rq, /*at_head=*/false,
>> +			/*run_queue=*/last, /*async=*/false);
> 
> I find thise comment style very hard to read.  Yes, maybe the three
> bools here should become flags, but this is even worse than just
> passing the arguments.
Bart Van Assche Aug. 14, 2022, 11:44 p.m. UTC | #3
On 8/14/22 10:13, Damien Le Moal wrote:
> And writes to zoned drives never get plugged in the first place, scheduler
> present or not.

Hi Damien,

I agree that blk_mq_submit_bio() does not plug writes to zoned drives 
because of the following code in blk_mq_plug():

/* Zoned block device write operation case: do not plug the BIO */
if (bdev_is_zoned(bio->bi_bdev) && op_is_write(bio_op(bio)))
	return NULL;

However, I have not found any code in blk_execute_rq_nowait() that 
causes the plugging mechanism to be skipped for zoned writes. Did I 
perhaps overlook something? The current blk_execute_rq_nowait() 
implementation is as follows:

void blk_execute_rq_nowait(struct request *rq, bool at_head)
{
	WARN_ON(irqs_disabled());
	WARN_ON(!blk_rq_is_passthrough(rq));

	blk_account_io_start(rq);
	if (current->plug)
		blk_add_rq_to_plug(current->plug, rq);
	else
		blk_mq_sched_insert_request(rq, at_head, true, false);
}

Thanks,

Bart.
Pankaj Raghav Aug. 15, 2022, 9:06 a.m. UTC | #4
On Sun, Aug 14, 2022 at 04:44:31PM -0700, Bart Van Assche wrote:
> I agree that blk_mq_submit_bio() does not plug writes to zoned drives
> because of the following code in blk_mq_plug():
> 
> /* Zoned block device write operation case: do not plug the BIO */
> if (bdev_is_zoned(bio->bi_bdev) && op_is_write(bio_op(bio)))
> 	return NULL;
> 
> However, I have not found any code in blk_execute_rq_nowait() that causes
> the plugging mechanism to be skipped for zoned writes. Did I perhaps
> overlook something? The current blk_execute_rq_nowait() implementation is as
> follows:
> 
IIUC, blk_execute_rq_nowait() is used mainly by lower level drivers to send
commands but current->plug is not initialized with blk_start_plug() in those
drivers. So, the rqs are not added to the plug list.

I did a quick test with fio with the new uring_cmd IO path that uses
blk_execute_rq_nowait() and it never plugged the rqs.

fio --filename=/dev/ng0n3 --size=128M --rw=write --bs=4k --zonemode=zbd --ioengine=io_uring_cmd --name=zoned

Did you notice it otherwise?

But I think it is better if we change current->plug to blk_mq_plug() to
be on the safer side.
> void blk_execute_rq_nowait(struct request *rq, bool at_head)
> {
> 	WARN_ON(irqs_disabled());
> 	WARN_ON(!blk_rq_is_passthrough(rq));
> 
> 	blk_account_io_start(rq);
> 	if (current->plug)
> 		blk_add_rq_to_plug(current->plug, rq);
> 	else
> 		blk_mq_sched_insert_request(rq, at_head, true, false);
> }
> 
> Thanks,
> 
> Bart.
Damien Le Moal Aug. 15, 2022, 4:31 p.m. UTC | #5
On 2022/08/14 16:44, Bart Van Assche wrote:
> On 8/14/22 10:13, Damien Le Moal wrote:
>> And writes to zoned drives never get plugged in the first place, scheduler
>> present or not.
> 
> Hi Damien,
> 
> I agree that blk_mq_submit_bio() does not plug writes to zoned drives 
> because of the following code in blk_mq_plug():
> 
> /* Zoned block device write operation case: do not plug the BIO */
> if (bdev_is_zoned(bio->bi_bdev) && op_is_write(bio_op(bio)))
> 	return NULL;
> 
> However, I have not found any code in blk_execute_rq_nowait() that 
> causes the plugging mechanism to be skipped for zoned writes. Did I 
> perhaps overlook something? The current blk_execute_rq_nowait() 
> implementation is as follows:
> 
> void blk_execute_rq_nowait(struct request *rq, bool at_head)
> {
> 	WARN_ON(irqs_disabled());
> 	WARN_ON(!blk_rq_is_passthrough(rq));
> 
> 	blk_account_io_start(rq);
> 	if (current->plug)
> 		blk_add_rq_to_plug(current->plug, rq);
> 	else
> 		blk_mq_sched_insert_request(rq, at_head, true, false);
> }

As far as I understand it, and checking the call sites, this is for LLD internal
commands only. And I think Pankaj has a good point for a fix to this one. Though
I would hate to see an LLD issue a write request though.

For f2fs, it seems to me that the problem is more with the code in
block/blk-flush.c where functions bypassing the scheduler are used for writes,
e.g. blk_insert_flush() / blk_mq_request_bypass_insert(). I am not 100% sure
though but that definitely looks very suspicious.

> 
> Thanks,
> 
> Bart.
diff mbox series

Patch

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 5ee62b95f3e5..530aad95cc33 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2546,16 +2546,14 @@  static blk_status_t blk_mq_request_issue_directly(struct request *rq, bool last)
 	return __blk_mq_try_issue_directly(rq->mq_hctx, rq, true, last);
 }
 
-static void blk_mq_plug_issue_direct(struct blk_plug *plug, bool from_schedule)
+static void blk_mq_plug_issue(struct blk_plug *plug, bool from_schedule)
 {
 	struct blk_mq_hw_ctx *hctx = NULL;
 	struct request *rq;
 	int queued = 0;
-	int errors = 0;
 
 	while ((rq = rq_list_pop(&plug->mq_list))) {
 		bool last = rq_list_empty(plug->mq_list);
-		blk_status_t ret;
 
 		if (hctx != rq->mq_hctx) {
 			if (hctx)
@@ -2563,29 +2561,9 @@  static void blk_mq_plug_issue_direct(struct blk_plug *plug, bool from_schedule)
 			hctx = rq->mq_hctx;
 		}
 
-		ret = blk_mq_request_issue_directly(rq, last);
-		switch (ret) {
-		case BLK_STS_OK:
-			queued++;
-			break;
-		case BLK_STS_RESOURCE:
-		case BLK_STS_DEV_RESOURCE:
-			blk_mq_request_bypass_insert(rq, false, last);
-			blk_mq_commit_rqs(hctx, &queued, from_schedule);
-			return;
-		default:
-			blk_mq_end_request(rq, ret);
-			errors++;
-			break;
-		}
+		blk_mq_sched_insert_request(rq, /*at_head=*/false,
+			/*run_queue=*/last, /*async=*/false);
 	}
-
-	/*
-	 * If we didn't flush the entire list, we could have told the driver
-	 * there was more coming, but that turned out to be a lie.
-	 */
-	if (errors)
-		blk_mq_commit_rqs(hctx, &queued, from_schedule);
 }
 
 static void __blk_mq_flush_plug_list(struct request_queue *q,
@@ -2655,8 +2633,7 @@  void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule)
 				return;
 		}
 
-		blk_mq_run_dispatch_ops(q,
-				blk_mq_plug_issue_direct(plug, false));
+		blk_mq_run_dispatch_ops(q, blk_mq_plug_issue(plug, false));
 		if (rq_list_empty(plug->mq_list))
 			return;
 	}