Message ID | 20220812210355.2252143-1-bvanassche@acm.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | block: Submit flush requests to the I/O scheduler | expand |
On Fri, Aug 12, 2022 at 02:03:55PM -0700, Bart Van Assche wrote: > When submitting a REQ_OP_WRITE | REQ_FUA request to a zoned storage > device, these requests must be passed to the (mq-deadline) I/O scheduler > to ensure that these happen at the write pointer. Yes. But maybe I'm stupid, but how is the patch related to fixing that? blk_mq_plug_issue_direct is called from blk_mq_flush_plug_list for only the !has_elevator case. How does that change a thing? Also please include a description of why these changes are otherwise good and won't regress other cases. > + blk_mq_sched_insert_request(rq, /*at_head=*/false, > + /*run_queue=*/last, /*async=*/false); I find thise comment style very hard to read. Yes, maybe the three bools here should become flags, but this is even worse than just passing the arguments.
On 2022/08/12 23:41, Christoph Hellwig wrote: > On Fri, Aug 12, 2022 at 02:03:55PM -0700, Bart Van Assche wrote: >> When submitting a REQ_OP_WRITE | REQ_FUA request to a zoned storage >> device, these requests must be passed to the (mq-deadline) I/O scheduler >> to ensure that these happen at the write pointer. > > Yes. > > But maybe I'm stupid, but how is the patch related to fixing that? > blk_mq_plug_issue_direct is called from blk_mq_flush_plug_list for > only the !has_elevator case. How does that change a thing? And writes to zoned drives never get plugged in the first place, scheduler present or not. > > Also please include a description of why these changes are otherwise > good and won't regress other cases. > >> + blk_mq_sched_insert_request(rq, /*at_head=*/false, >> + /*run_queue=*/last, /*async=*/false); > > I find thise comment style very hard to read. Yes, maybe the three > bools here should become flags, but this is even worse than just > passing the arguments.
On 8/14/22 10:13, Damien Le Moal wrote: > And writes to zoned drives never get plugged in the first place, scheduler > present or not. Hi Damien, I agree that blk_mq_submit_bio() does not plug writes to zoned drives because of the following code in blk_mq_plug(): /* Zoned block device write operation case: do not plug the BIO */ if (bdev_is_zoned(bio->bi_bdev) && op_is_write(bio_op(bio))) return NULL; However, I have not found any code in blk_execute_rq_nowait() that causes the plugging mechanism to be skipped for zoned writes. Did I perhaps overlook something? The current blk_execute_rq_nowait() implementation is as follows: void blk_execute_rq_nowait(struct request *rq, bool at_head) { WARN_ON(irqs_disabled()); WARN_ON(!blk_rq_is_passthrough(rq)); blk_account_io_start(rq); if (current->plug) blk_add_rq_to_plug(current->plug, rq); else blk_mq_sched_insert_request(rq, at_head, true, false); } Thanks, Bart.
On Sun, Aug 14, 2022 at 04:44:31PM -0700, Bart Van Assche wrote: > I agree that blk_mq_submit_bio() does not plug writes to zoned drives > because of the following code in blk_mq_plug(): > > /* Zoned block device write operation case: do not plug the BIO */ > if (bdev_is_zoned(bio->bi_bdev) && op_is_write(bio_op(bio))) > return NULL; > > However, I have not found any code in blk_execute_rq_nowait() that causes > the plugging mechanism to be skipped for zoned writes. Did I perhaps > overlook something? The current blk_execute_rq_nowait() implementation is as > follows: > IIUC, blk_execute_rq_nowait() is used mainly by lower level drivers to send commands but current->plug is not initialized with blk_start_plug() in those drivers. So, the rqs are not added to the plug list. I did a quick test with fio with the new uring_cmd IO path that uses blk_execute_rq_nowait() and it never plugged the rqs. fio --filename=/dev/ng0n3 --size=128M --rw=write --bs=4k --zonemode=zbd --ioengine=io_uring_cmd --name=zoned Did you notice it otherwise? But I think it is better if we change current->plug to blk_mq_plug() to be on the safer side. > void blk_execute_rq_nowait(struct request *rq, bool at_head) > { > WARN_ON(irqs_disabled()); > WARN_ON(!blk_rq_is_passthrough(rq)); > > blk_account_io_start(rq); > if (current->plug) > blk_add_rq_to_plug(current->plug, rq); > else > blk_mq_sched_insert_request(rq, at_head, true, false); > } > > Thanks, > > Bart.
On 2022/08/14 16:44, Bart Van Assche wrote: > On 8/14/22 10:13, Damien Le Moal wrote: >> And writes to zoned drives never get plugged in the first place, scheduler >> present or not. > > Hi Damien, > > I agree that blk_mq_submit_bio() does not plug writes to zoned drives > because of the following code in blk_mq_plug(): > > /* Zoned block device write operation case: do not plug the BIO */ > if (bdev_is_zoned(bio->bi_bdev) && op_is_write(bio_op(bio))) > return NULL; > > However, I have not found any code in blk_execute_rq_nowait() that > causes the plugging mechanism to be skipped for zoned writes. Did I > perhaps overlook something? The current blk_execute_rq_nowait() > implementation is as follows: > > void blk_execute_rq_nowait(struct request *rq, bool at_head) > { > WARN_ON(irqs_disabled()); > WARN_ON(!blk_rq_is_passthrough(rq)); > > blk_account_io_start(rq); > if (current->plug) > blk_add_rq_to_plug(current->plug, rq); > else > blk_mq_sched_insert_request(rq, at_head, true, false); > } As far as I understand it, and checking the call sites, this is for LLD internal commands only. And I think Pankaj has a good point for a fix to this one. Though I would hate to see an LLD issue a write request though. For f2fs, it seems to me that the problem is more with the code in block/blk-flush.c where functions bypassing the scheduler are used for writes, e.g. blk_insert_flush() / blk_mq_request_bypass_insert(). I am not 100% sure though but that definitely looks very suspicious. > > Thanks, > > Bart.
diff --git a/block/blk-mq.c b/block/blk-mq.c index 5ee62b95f3e5..530aad95cc33 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -2546,16 +2546,14 @@ static blk_status_t blk_mq_request_issue_directly(struct request *rq, bool last) return __blk_mq_try_issue_directly(rq->mq_hctx, rq, true, last); } -static void blk_mq_plug_issue_direct(struct blk_plug *plug, bool from_schedule) +static void blk_mq_plug_issue(struct blk_plug *plug, bool from_schedule) { struct blk_mq_hw_ctx *hctx = NULL; struct request *rq; int queued = 0; - int errors = 0; while ((rq = rq_list_pop(&plug->mq_list))) { bool last = rq_list_empty(plug->mq_list); - blk_status_t ret; if (hctx != rq->mq_hctx) { if (hctx) @@ -2563,29 +2561,9 @@ static void blk_mq_plug_issue_direct(struct blk_plug *plug, bool from_schedule) hctx = rq->mq_hctx; } - ret = blk_mq_request_issue_directly(rq, last); - switch (ret) { - case BLK_STS_OK: - queued++; - break; - case BLK_STS_RESOURCE: - case BLK_STS_DEV_RESOURCE: - blk_mq_request_bypass_insert(rq, false, last); - blk_mq_commit_rqs(hctx, &queued, from_schedule); - return; - default: - blk_mq_end_request(rq, ret); - errors++; - break; - } + blk_mq_sched_insert_request(rq, /*at_head=*/false, + /*run_queue=*/last, /*async=*/false); } - - /* - * If we didn't flush the entire list, we could have told the driver - * there was more coming, but that turned out to be a lie. - */ - if (errors) - blk_mq_commit_rqs(hctx, &queued, from_schedule); } static void __blk_mq_flush_plug_list(struct request_queue *q, @@ -2655,8 +2633,7 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule) return; } - blk_mq_run_dispatch_ops(q, - blk_mq_plug_issue_direct(plug, false)); + blk_mq_run_dispatch_ops(q, blk_mq_plug_issue(plug, false)); if (rq_list_empty(plug->mq_list)) return; }
When submitting a REQ_OP_WRITE | REQ_FUA request to a zoned storage device, these requests must be passed to the (mq-deadline) I/O scheduler to ensure that these happen at the write pointer. It has been verfied that this patch prevents that write pointer violations happen sporadically when f2fs is using a zoned storage device. Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> --- block/blk-mq.c | 31 ++++--------------------------- 1 file changed, 4 insertions(+), 27 deletions(-)