Message ID | 20180410215456.GD793541@devbig577.frc2.facebook.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Tue, 2018-04-10 at 14:54 -0700, tj@kernel.org wrote: > Ah, yeah, I was moving it out of add_timer but forgot to actully add > it to the issue path. Fixed patch below. > > BTW, no matter what we do w/ the request handover between normal and > timeout paths, we'd need something similar. Otherwise, while we can > reduce the window, we can't get rid of it. (+Martin Steigerwald) Hello Tejun, Thank you for having shared this patch. It looks interesting to me. What I know about the blk-mq timeout handling is as follows: * Nobody who has reviewed the blk-mq timeout handling code with this patch applied has reported any shortcomings for that code. * However, several people have reported kernel crashes that disappear when the blk-mq timeout code is reworked. I'm referring to "nvme-rdma corrupts memory upon timeout" (http://lists.infradead.org/pipermail/linux-nvme/2018-February/015848.html) and also to a "RIP: scsi_times_out+0x17" crash during boot (https://bugzilla.kernel.org/show_bug.cgi?id=199077). So we have the choice between two approaches: (1) apply the patch from your previous e-mail and root-cause and fix the crashes referred to above. (2) apply a patch that makes the crashes reported against v4.16 disappear and remove the atomic instructions introduced by such a patch at a later time. Since crashes have been reported for kernel v4.16 I think we should follow approach (2). That will remove the time pressure from root-causing and fixing the crashes reported for the NVMeOF initiator and SCSI initiator drivers. Thanks, Bart.
Hello, Bart. On Wed, Apr 11, 2018 at 12:50:51PM +0000, Bart Van Assche wrote: > Thank you for having shared this patch. It looks interesting to me. What I > know about the blk-mq timeout handling is as follows: > * Nobody who has reviewed the blk-mq timeout handling code with this patch > applied has reported any shortcomings for that code. > * However, several people have reported kernel crashes that disappear when > the blk-mq timeout code is reworked. I'm referring to "nvme-rdma corrupts > memory upon timeout" > (http://lists.infradead.org/pipermail/linux-nvme/2018-February/015848.html) > and also to a "RIP: scsi_times_out+0x17" crash during boot > (https://bugzilla.kernel.org/show_bug.cgi?id=199077). > > So we have the choice between two approaches: > (1) apply the patch from your previous e-mail and root-cause and fix the > crashes referred to above. > (2) apply a patch that makes the crashes reported against v4.16 disappear and > remove the atomic instructions introduced by such a patch at a later time. > > Since crashes have been reported for kernel v4.16 I think we should follow > approach (2). That will remove the time pressure from root-causing and fixing > the crashes reported for the NVMeOF initiator and SCSI initiator drivers. So, it really bothers me how blind we're going about this. It isn't an insurmountable emergency that we have to adopt whatever solution which passed a couple tests this minute. We can debug and root cause this properly and pick the right solution. We even have two most likely causes already analysed and patches proposed, one of them months ago. If we wanna change the handover model, let's do that because the new one is better, not because of vague fear. Thanks.
> Ah, yeah, I was moving it out of add_timer but forgot to actully add > it to the issue path. Fixed patch below. > > BTW, no matter what we do w/ the request handover between normal and > timeout paths, we'd need something similar. Otherwise, while we can > reduce the window, we can't get rid of it. > > Thanks. Just noticed this one, this looks interesting to me as well. Israel, can you run your test with this patch? > > --- > block/blk-mq.c | 45 ++++++++++++++++++++++++++++++++------------- > include/linux/blkdev.h | 2 ++ > 2 files changed, 34 insertions(+), 13 deletions(-) > > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -642,6 +642,8 @@ void blk_mq_complete_request(struct requ > hctx_lock(hctx, &srcu_idx); > if (blk_mq_rq_aborted_gstate(rq) != rq->gstate) > __blk_mq_complete_request(rq); > + else > + rq->missed_completion = true; > hctx_unlock(hctx, srcu_idx); > } > EXPORT_SYMBOL(blk_mq_complete_request); > @@ -684,6 +686,7 @@ void blk_mq_start_request(struct request > > blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT); > blk_add_timer(rq); > + rq->missed_completion = false; > > write_seqcount_end(&rq->gstate_seq); > preempt_enable(); > @@ -881,7 +884,7 @@ static void blk_mq_check_expired(struct > } > } > > -static void blk_mq_timeout_sync_rcu(struct request_queue *q) > +static void blk_mq_timeout_sync_rcu(struct request_queue *q, bool clear) > { > struct blk_mq_hw_ctx *hctx; > bool has_rcu = false; > @@ -896,7 +899,8 @@ static void blk_mq_timeout_sync_rcu(stru > else > synchronize_srcu(hctx->srcu); > > - hctx->need_sync_rcu = false; > + if (clear) > + hctx->need_sync_rcu = false; > } > if (has_rcu) > synchronize_rcu(); > @@ -917,25 +921,37 @@ static void blk_mq_terminate_expired(str > blk_mq_rq_timed_out(hctx, rq, priv, reserved); > } > > -static void blk_mq_finish_timeout_reset(struct blk_mq_hw_ctx *hctx, > +static void blk_mq_timeout_reset_return(struct blk_mq_hw_ctx *hctx, > struct request *rq, void *priv, bool reserved) > { > /* > * @rq's timer reset has gone through rcu synchronization and is > * visible now. Allow normal completions again by resetting > * ->aborted_gstate. Don't clear RQF_MQ_TIMEOUT_RESET here as > - * there's no memory ordering around ->aborted_gstate making it the > - * only field safe to update. Let blk_add_timer() clear it later > - * when the request is recycled or times out again. > - * > - * As nothing prevents from completion happening while > - * ->aborted_gstate is set, this may lead to ignored completions > - * and further spurious timeouts. > + * blk_mq_timeout_reset_cleanup() needs it again and there's no > + * memory ordering around ->aborted_gstate making it the only field > + * safe to update. Let blk_add_timer() clear it later when the > + * request is recycled or times out again. > */ > if (rq->rq_flags & RQF_MQ_TIMEOUT_RESET) > blk_mq_rq_update_aborted_gstate(rq, 0); > } > > +static void blk_mq_timeout_reset_cleanup(struct blk_mq_hw_ctx *hctx, > + struct request *rq, void *priv, bool reserved) > +{ > + /* > + * @rq is now fully returned to the normal path. If normal > + * completion raced timeout handling, execute the missed completion > + * here. This is safe because 1. ->missed_completion can no longer > + * be asserted because nothing is timing out right now and 2. if > + * ->missed_completion is set, @rq is safe from recycling because > + * nobody could have completed it. > + */ > + if ((rq->rq_flags & RQF_MQ_TIMEOUT_RESET) && rq->missed_completion) > + blk_mq_complete_request(rq); > +} > + > static void blk_mq_timeout_work(struct work_struct *work) > { > struct request_queue *q = > @@ -976,7 +992,7 @@ static void blk_mq_timeout_work(struct w > * becomes a problem, we can add per-hw_ctx rcu_head and > * wait in parallel. > */ > - blk_mq_timeout_sync_rcu(q); > + blk_mq_timeout_sync_rcu(q, true); > > /* terminate the ones we won */ > blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_expired, > @@ -988,9 +1004,12 @@ static void blk_mq_timeout_work(struct w > * reset racing against recycling. > */ > if (nr_resets) { > - blk_mq_timeout_sync_rcu(q); > + blk_mq_timeout_sync_rcu(q, false); > + blk_mq_queue_tag_busy_iter(q, > + blk_mq_timeout_reset_return, NULL); > + blk_mq_timeout_sync_rcu(q, true); > blk_mq_queue_tag_busy_iter(q, > - blk_mq_finish_timeout_reset, NULL); > + blk_mq_timeout_reset_cleanup, NULL); > } > } > > --- a/include/linux/blkdev.h > +++ b/include/linux/blkdev.h > @@ -227,6 +227,8 @@ struct request { > > unsigned int extra_len; /* length of alignment and padding */ > > + bool missed_completion; > + Would be nicer if we can flag this somewhere instead of adding a hole to struct request...
On 4/11/2018 5:24 PM, Sagi Grimberg wrote: > >> Ah, yeah, I was moving it out of add_timer but forgot to actully add >> it to the issue path. Fixed patch below. >> >> BTW, no matter what we do w/ the request handover between normal and >> timeout paths, we'd need something similar. Otherwise, while we can >> reduce the window, we can't get rid of it. >> >> Thanks. > > Just noticed this one, this looks interesting to me as well. Israel, > can you run your test with this patch? Yes, I just did and it looks good. > >> >> --- >> block/blk-mq.c | 45 >> ++++++++++++++++++++++++++++++++------------- >> include/linux/blkdev.h | 2 ++ >> 2 files changed, 34 insertions(+), 13 deletions(-) >> >> --- a/block/blk-mq.c >> +++ b/block/blk-mq.c >> @@ -642,6 +642,8 @@ void blk_mq_complete_request(struct requ >> hctx_lock(hctx, &srcu_idx); >> if (blk_mq_rq_aborted_gstate(rq) != rq->gstate) >> __blk_mq_complete_request(rq); >> + else >> + rq->missed_completion = true; >> hctx_unlock(hctx, srcu_idx); >> } >> EXPORT_SYMBOL(blk_mq_complete_request); >> @@ -684,6 +686,7 @@ void blk_mq_start_request(struct request >> blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT); >> blk_add_timer(rq); >> + rq->missed_completion = false; >> write_seqcount_end(&rq->gstate_seq); >> preempt_enable(); >> @@ -881,7 +884,7 @@ static void blk_mq_check_expired(struct >> } >> } >> -static void blk_mq_timeout_sync_rcu(struct request_queue *q) >> +static void blk_mq_timeout_sync_rcu(struct request_queue *q, bool >> clear) >> { >> struct blk_mq_hw_ctx *hctx; >> bool has_rcu = false; >> @@ -896,7 +899,8 @@ static void blk_mq_timeout_sync_rcu(stru >> else >> synchronize_srcu(hctx->srcu); >> - hctx->need_sync_rcu = false; >> + if (clear) >> + hctx->need_sync_rcu = false; >> } >> if (has_rcu) >> synchronize_rcu(); >> @@ -917,25 +921,37 @@ static void blk_mq_terminate_expired(str >> blk_mq_rq_timed_out(hctx, rq, priv, reserved); >> } >> -static void blk_mq_finish_timeout_reset(struct blk_mq_hw_ctx *hctx, >> +static void blk_mq_timeout_reset_return(struct blk_mq_hw_ctx *hctx, >> struct request *rq, void *priv, bool reserved) >> { >> /* >> * @rq's timer reset has gone through rcu synchronization and is >> * visible now. Allow normal completions again by resetting >> * ->aborted_gstate. Don't clear RQF_MQ_TIMEOUT_RESET here as >> - * there's no memory ordering around ->aborted_gstate making it the >> - * only field safe to update. Let blk_add_timer() clear it later >> - * when the request is recycled or times out again. >> - * >> - * As nothing prevents from completion happening while >> - * ->aborted_gstate is set, this may lead to ignored completions >> - * and further spurious timeouts. >> + * blk_mq_timeout_reset_cleanup() needs it again and there's no >> + * memory ordering around ->aborted_gstate making it the only field >> + * safe to update. Let blk_add_timer() clear it later when the >> + * request is recycled or times out again. >> */ >> if (rq->rq_flags & RQF_MQ_TIMEOUT_RESET) >> blk_mq_rq_update_aborted_gstate(rq, 0); >> } >> +static void blk_mq_timeout_reset_cleanup(struct blk_mq_hw_ctx *hctx, >> + struct request *rq, void *priv, bool reserved) >> +{ >> + /* >> + * @rq is now fully returned to the normal path. If normal >> + * completion raced timeout handling, execute the missed completion >> + * here. This is safe because 1. ->missed_completion can no longer >> + * be asserted because nothing is timing out right now and 2. if >> + * ->missed_completion is set, @rq is safe from recycling because >> + * nobody could have completed it. >> + */ >> + if ((rq->rq_flags & RQF_MQ_TIMEOUT_RESET) && rq->missed_completion) >> + blk_mq_complete_request(rq); >> +} >> + >> static void blk_mq_timeout_work(struct work_struct *work) >> { >> struct request_queue *q = >> @@ -976,7 +992,7 @@ static void blk_mq_timeout_work(struct w >> * becomes a problem, we can add per-hw_ctx rcu_head and >> * wait in parallel. >> */ >> - blk_mq_timeout_sync_rcu(q); >> + blk_mq_timeout_sync_rcu(q, true); >> /* terminate the ones we won */ >> blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_expired, >> @@ -988,9 +1004,12 @@ static void blk_mq_timeout_work(struct w >> * reset racing against recycling. >> */ >> if (nr_resets) { >> - blk_mq_timeout_sync_rcu(q); >> + blk_mq_timeout_sync_rcu(q, false); >> + blk_mq_queue_tag_busy_iter(q, >> + blk_mq_timeout_reset_return, NULL); >> + blk_mq_timeout_sync_rcu(q, true); >> blk_mq_queue_tag_busy_iter(q, >> - blk_mq_finish_timeout_reset, NULL); >> + blk_mq_timeout_reset_cleanup, NULL); >> } >> } >> --- a/include/linux/blkdev.h >> +++ b/include/linux/blkdev.h >> @@ -227,6 +227,8 @@ struct request { >> unsigned int extra_len; /* length of alignment and padding */ >> + bool missed_completion; >> + > > Would be nicer if we can flag this somewhere instead of adding a hole to > struct request...
Hello, Israel. On Wed, Apr 11, 2018 at 07:16:14PM +0300, Israel Rukshin wrote: > >Just noticed this one, this looks interesting to me as well. Israel, > >can you run your test with this patch? > > Yes, I just did and it looks good. Awesome. > >>+++ b/include/linux/blkdev.h > >>@@ -227,6 +227,8 @@ struct request { > >> unsigned int extra_len; /* length of alignment and padding */ > >> + bool missed_completion; > >>+ > > > >Would be nicer if we can flag this somewhere instead of adding a hole to > >struct request... I missed it before. It's actually being put in an existing hole, so the struct size stays the same before and after. It's a bit of cheating cuz this is one of the two holes which can be removed by swapping two fields. Re. making it a flag, regardless of whether this is a flag or a separate field, we need to add a new field because there currently is no field which can be modified by the party who doesn't own the request, so if we make it a flag, we need to add sth like unsigned long atom_flags. Thanks.
Bart Van Assche - 11.04.18, 14:50: > On Tue, 2018-04-10 at 14:54 -0700, tj@kernel.org wrote: > > Ah, yeah, I was moving it out of add_timer but forgot to actully add > > it to the issue path. Fixed patch below. > > > > BTW, no matter what we do w/ the request handover between normal and > > timeout paths, we'd need something similar. Otherwise, while we can > > reduce the window, we can't get rid of it. > > (+Martin Steigerwald) […] > Thank you for having shared this patch. It looks interesting to me. > What I know about the blk-mq timeout handling is as follows: > * Nobody who has reviewed the blk-mq timeout handling code with this > patch applied has reported any shortcomings for that code. > * However, several people have reported kernel crashes that disappear > when the blk-mq timeout code is reworked. I'm referring to "nvme-rdma > corrupts memory upon timeout" > > (http://lists.infradead.org/pipermail/linux-nvme/2018-February/015848 > .html) and also to a "RIP: scsi_times_out+0x17" crash during boot > (https://bugzilla.kernel.org/show_bug.cgi?id=199077). Yes, with the three patches: - '[PATCH] blk-mq_Directly schedule q->timeout_work when aborting a request.mbox' - '[PATCH v2] block: Change a rcu_read_{lock,unlock}_sched() pair into rcu_read_{lock,unlock}().mbox' - '[PATCH v4] blk-mq_Fix race conditions in request timeout handling.mbox' the occasional hangs on some boots / resumes from hibernation appear to be gone. Also it appears that the error loading SMART data issue is gone as well (see my bug report). However it is still to early to say for sure. I think I need at least 2-3 days of additional testing with this kernel to be sufficiently sure about it. However… I could also test another patch, but from reading the rest of this thread so far I have no clear on whether to try one of the new patches and if so which one and whether adding it on top of some of the patches I already applied or using it as a replacement of it. So while doing a training this and next week I can apply a patch here and then, but I won´t have much time to read the complete discussion to figure out what to apply. Personally as a stable kernel has been released with those issues, I think its good to fix it up soon. On the other hand it may take quite some time til popular distros carry 4.16 for regular users. And I have no idea how frequent the reported issues are, i.e. how many users would be affected. Thanks,
Hey, again. On Wed, Apr 11, 2018 at 10:07:33AM -0700, tj@kernel.org wrote: > Hello, Israel. > > On Wed, Apr 11, 2018 at 07:16:14PM +0300, Israel Rukshin wrote: > > >Just noticed this one, this looks interesting to me as well. Israel, > > >can you run your test with this patch? > > > > Yes, I just did and it looks good. > > Awesome. Just to be sure, you tested the combined patch and saw the XXX debug messages, right? Thanks.
On 4/12/2018 12:31 AM, tj@kernel.org wrote: > Hey, again. > > On Wed, Apr 11, 2018 at 10:07:33AM -0700, tj@kernel.org wrote: >> Hello, Israel. >> >> On Wed, Apr 11, 2018 at 07:16:14PM +0300, Israel Rukshin wrote: >>>> Just noticed this one, this looks interesting to me as well. Israel, >>>> can you run your test with this patch? >>> Yes, I just did and it looks good. >> Awesome. > Just to be sure, you tested the combined patch and saw the XXX debug > messages, right? I am not sure I understand. What is the combined patch? What are the debug messages that I need to see? > Thanks. >
On 04/10/18 14:54, tj@kernel.org wrote: > On Tue, Apr 10, 2018 at 09:46:23PM +0000, Bart Van Assche wrote: >> On Tue, 2018-04-10 at 14:33 -0700, tj@kernel.org wrote: >>> + else >>> + rq->missed_completion = true; >> >> In this patch I found code that sets rq->missed_completion but no code that >> clears it. Did I perhaps miss something? > > Ah, yeah, I was moving it out of add_timer but forgot to actully add > it to the issue path. Fixed patch below. > > BTW, no matter what we do w/ the request handover between normal and > timeout paths, we'd need something similar. Otherwise, while we can > reduce the window, we can't get rid of it. > > [ ... ] Hello Tejun, This patch no longer applies cleanly on Jens' tree. Can you rebase and repost it? Thanks, Bart.
--- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -642,6 +642,8 @@ void blk_mq_complete_request(struct requ hctx_lock(hctx, &srcu_idx); if (blk_mq_rq_aborted_gstate(rq) != rq->gstate) __blk_mq_complete_request(rq); + else + rq->missed_completion = true; hctx_unlock(hctx, srcu_idx); } EXPORT_SYMBOL(blk_mq_complete_request); @@ -684,6 +686,7 @@ void blk_mq_start_request(struct request blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT); blk_add_timer(rq); + rq->missed_completion = false; write_seqcount_end(&rq->gstate_seq); preempt_enable(); @@ -881,7 +884,7 @@ static void blk_mq_check_expired(struct } } -static void blk_mq_timeout_sync_rcu(struct request_queue *q) +static void blk_mq_timeout_sync_rcu(struct request_queue *q, bool clear) { struct blk_mq_hw_ctx *hctx; bool has_rcu = false; @@ -896,7 +899,8 @@ static void blk_mq_timeout_sync_rcu(stru else synchronize_srcu(hctx->srcu); - hctx->need_sync_rcu = false; + if (clear) + hctx->need_sync_rcu = false; } if (has_rcu) synchronize_rcu(); @@ -917,25 +921,37 @@ static void blk_mq_terminate_expired(str blk_mq_rq_timed_out(hctx, rq, priv, reserved); } -static void blk_mq_finish_timeout_reset(struct blk_mq_hw_ctx *hctx, +static void blk_mq_timeout_reset_return(struct blk_mq_hw_ctx *hctx, struct request *rq, void *priv, bool reserved) { /* * @rq's timer reset has gone through rcu synchronization and is * visible now. Allow normal completions again by resetting * ->aborted_gstate. Don't clear RQF_MQ_TIMEOUT_RESET here as - * there's no memory ordering around ->aborted_gstate making it the - * only field safe to update. Let blk_add_timer() clear it later - * when the request is recycled or times out again. - * - * As nothing prevents from completion happening while - * ->aborted_gstate is set, this may lead to ignored completions - * and further spurious timeouts. + * blk_mq_timeout_reset_cleanup() needs it again and there's no + * memory ordering around ->aborted_gstate making it the only field + * safe to update. Let blk_add_timer() clear it later when the + * request is recycled or times out again. */ if (rq->rq_flags & RQF_MQ_TIMEOUT_RESET) blk_mq_rq_update_aborted_gstate(rq, 0); } +static void blk_mq_timeout_reset_cleanup(struct blk_mq_hw_ctx *hctx, + struct request *rq, void *priv, bool reserved) +{ + /* + * @rq is now fully returned to the normal path. If normal + * completion raced timeout handling, execute the missed completion + * here. This is safe because 1. ->missed_completion can no longer + * be asserted because nothing is timing out right now and 2. if + * ->missed_completion is set, @rq is safe from recycling because + * nobody could have completed it. + */ + if ((rq->rq_flags & RQF_MQ_TIMEOUT_RESET) && rq->missed_completion) + blk_mq_complete_request(rq); +} + static void blk_mq_timeout_work(struct work_struct *work) { struct request_queue *q = @@ -976,7 +992,7 @@ static void blk_mq_timeout_work(struct w * becomes a problem, we can add per-hw_ctx rcu_head and * wait in parallel. */ - blk_mq_timeout_sync_rcu(q); + blk_mq_timeout_sync_rcu(q, true); /* terminate the ones we won */ blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_expired, @@ -988,9 +1004,12 @@ static void blk_mq_timeout_work(struct w * reset racing against recycling. */ if (nr_resets) { - blk_mq_timeout_sync_rcu(q); + blk_mq_timeout_sync_rcu(q, false); + blk_mq_queue_tag_busy_iter(q, + blk_mq_timeout_reset_return, NULL); + blk_mq_timeout_sync_rcu(q, true); blk_mq_queue_tag_busy_iter(q, - blk_mq_finish_timeout_reset, NULL); + blk_mq_timeout_reset_cleanup, NULL); } } --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -227,6 +227,8 @@ struct request { unsigned int extra_len; /* length of alignment and padding */ + bool missed_completion; + /* * On blk-mq, the lower bits of ->gstate (generation number and * state) carry the MQ_RQ_* state value and the upper bits the