diff mbox

[v4] blk-mq: Fix race conditions in request timeout handling

Message ID 20180410215456.GD793541@devbig577.frc2.facebook.com (mailing list archive)
State New, archived
Headers show

Commit Message

Tejun Heo April 10, 2018, 9:54 p.m. UTC
On Tue, Apr 10, 2018 at 09:46:23PM +0000, Bart Van Assche wrote:
> On Tue, 2018-04-10 at 14:33 -0700, tj@kernel.org wrote:
> > +       else
> > +               rq->missed_completion = true;
> 
> In this patch I found code that sets rq->missed_completion but no code that
> clears it. Did I perhaps miss something?

Ah, yeah, I was moving it out of add_timer but forgot to actully add
it to the issue path.  Fixed patch below.

BTW, no matter what we do w/ the request handover between normal and
timeout paths, we'd need something similar.  Otherwise, while we can
reduce the window, we can't get rid of it.

Thanks.

---
 block/blk-mq.c         |   45 ++++++++++++++++++++++++++++++++-------------
 include/linux/blkdev.h |    2 ++
 2 files changed, 34 insertions(+), 13 deletions(-)

Comments

Bart Van Assche April 11, 2018, 12:50 p.m. UTC | #1
On Tue, 2018-04-10 at 14:54 -0700, tj@kernel.org wrote:
> Ah, yeah, I was moving it out of add_timer but forgot to actully add

> it to the issue path.  Fixed patch below.

> 

> BTW, no matter what we do w/ the request handover between normal and

> timeout paths, we'd need something similar.  Otherwise, while we can

> reduce the window, we can't get rid of it.


(+Martin Steigerwald)

Hello Tejun,

Thank you for having shared this patch. It looks interesting to me. What I
know about the blk-mq timeout handling is as follows:
* Nobody who has reviewed the blk-mq timeout handling code with this patch
  applied has reported any shortcomings for that code.
* However, several people have reported kernel crashes that disappear when
  the blk-mq timeout code is reworked. I'm referring to "nvme-rdma corrupts
  memory upon timeout"
  (http://lists.infradead.org/pipermail/linux-nvme/2018-February/015848.html)
  and also to a "RIP: scsi_times_out+0x17" crash during boot
  (https://bugzilla.kernel.org/show_bug.cgi?id=199077).

So we have the choice between two approaches:
(1) apply the patch from your previous e-mail and root-cause and fix the
    crashes referred to above.
(2) apply a patch that makes the crashes reported against v4.16 disappear and
    remove the atomic instructions introduced by such a patch at a later time.

Since crashes have been reported for kernel v4.16 I think we should follow
approach (2). That will remove the time pressure from root-causing and fixing
the crashes reported for the NVMeOF initiator and SCSI initiator drivers.

Thanks,

Bart.
Tejun Heo April 11, 2018, 2:16 p.m. UTC | #2
Hello, Bart.

On Wed, Apr 11, 2018 at 12:50:51PM +0000, Bart Van Assche wrote:
> Thank you for having shared this patch. It looks interesting to me. What I
> know about the blk-mq timeout handling is as follows:
> * Nobody who has reviewed the blk-mq timeout handling code with this patch
>   applied has reported any shortcomings for that code.
> * However, several people have reported kernel crashes that disappear when
>   the blk-mq timeout code is reworked. I'm referring to "nvme-rdma corrupts
>   memory upon timeout"
>   (http://lists.infradead.org/pipermail/linux-nvme/2018-February/015848.html)
>   and also to a "RIP: scsi_times_out+0x17" crash during boot
>   (https://bugzilla.kernel.org/show_bug.cgi?id=199077).
> 
> So we have the choice between two approaches:
> (1) apply the patch from your previous e-mail and root-cause and fix the
>     crashes referred to above.
> (2) apply a patch that makes the crashes reported against v4.16 disappear and
>     remove the atomic instructions introduced by such a patch at a later time.
> 
> Since crashes have been reported for kernel v4.16 I think we should follow
> approach (2). That will remove the time pressure from root-causing and fixing
> the crashes reported for the NVMeOF initiator and SCSI initiator drivers.

So, it really bothers me how blind we're going about this.  It isn't
an insurmountable emergency that we have to adopt whatever solution
which passed a couple tests this minute.  We can debug and root cause
this properly and pick the right solution.  We even have two most
likely causes already analysed and patches proposed, one of them
months ago.  If we wanna change the handover model, let's do that
because the new one is better, not because of vague fear.

Thanks.
Sagi Grimberg April 11, 2018, 2:24 p.m. UTC | #3
> Ah, yeah, I was moving it out of add_timer but forgot to actully add
> it to the issue path.  Fixed patch below.
> 
> BTW, no matter what we do w/ the request handover between normal and
> timeout paths, we'd need something similar.  Otherwise, while we can
> reduce the window, we can't get rid of it.
> 
> Thanks.

Just noticed this one, this looks interesting to me as well. Israel,
can you run your test with this patch?

> 
> ---
>   block/blk-mq.c         |   45 ++++++++++++++++++++++++++++++++-------------
>   include/linux/blkdev.h |    2 ++
>   2 files changed, 34 insertions(+), 13 deletions(-)
> 
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -642,6 +642,8 @@ void blk_mq_complete_request(struct requ
>   	hctx_lock(hctx, &srcu_idx);
>   	if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
>   		__blk_mq_complete_request(rq);
> +	else
> +		rq->missed_completion = true;
>   	hctx_unlock(hctx, srcu_idx);
>   }
>   EXPORT_SYMBOL(blk_mq_complete_request);
> @@ -684,6 +686,7 @@ void blk_mq_start_request(struct request
>   
>   	blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
>   	blk_add_timer(rq);
> +	rq->missed_completion = false;
>   
>   	write_seqcount_end(&rq->gstate_seq);
>   	preempt_enable();
> @@ -881,7 +884,7 @@ static void blk_mq_check_expired(struct
>   	}
>   }
>   
> -static void blk_mq_timeout_sync_rcu(struct request_queue *q)
> +static void blk_mq_timeout_sync_rcu(struct request_queue *q, bool clear)
>   {
>   	struct blk_mq_hw_ctx *hctx;
>   	bool has_rcu = false;
> @@ -896,7 +899,8 @@ static void blk_mq_timeout_sync_rcu(stru
>   		else
>   			synchronize_srcu(hctx->srcu);
>   
> -		hctx->need_sync_rcu = false;
> +		if (clear)
> +			hctx->need_sync_rcu = false;
>   	}
>   	if (has_rcu)
>   		synchronize_rcu();
> @@ -917,25 +921,37 @@ static void blk_mq_terminate_expired(str
>   		blk_mq_rq_timed_out(hctx, rq, priv, reserved);
>   }
>   
> -static void blk_mq_finish_timeout_reset(struct blk_mq_hw_ctx *hctx,
> +static void blk_mq_timeout_reset_return(struct blk_mq_hw_ctx *hctx,
>   		struct request *rq, void *priv, bool reserved)
>   {
>   	/*
>   	 * @rq's timer reset has gone through rcu synchronization and is
>   	 * visible now.  Allow normal completions again by resetting
>   	 * ->aborted_gstate.  Don't clear RQF_MQ_TIMEOUT_RESET here as
> -	 * there's no memory ordering around ->aborted_gstate making it the
> -	 * only field safe to update.  Let blk_add_timer() clear it later
> -	 * when the request is recycled or times out again.
> -	 *
> -	 * As nothing prevents from completion happening while
> -	 * ->aborted_gstate is set, this may lead to ignored completions
> -	 * and further spurious timeouts.
> +	 * blk_mq_timeout_reset_cleanup() needs it again and there's no
> +	 * memory ordering around ->aborted_gstate making it the only field
> +	 * safe to update.  Let blk_add_timer() clear it later when the
> +	 * request is recycled or times out again.
>   	 */
>   	if (rq->rq_flags & RQF_MQ_TIMEOUT_RESET)
>   		blk_mq_rq_update_aborted_gstate(rq, 0);
>   }
>   
> +static void blk_mq_timeout_reset_cleanup(struct blk_mq_hw_ctx *hctx,
> +		struct request *rq, void *priv, bool reserved)
> +{
> +	/*
> +	 * @rq is now fully returned to the normal path.  If normal
> +	 * completion raced timeout handling, execute the missed completion
> +	 * here.  This is safe because 1. ->missed_completion can no longer
> +	 * be asserted because nothing is timing out right now and 2. if
> +	 * ->missed_completion is set, @rq is safe from recycling because
> +	 * nobody could have completed it.
> +	 */
> +	if ((rq->rq_flags & RQF_MQ_TIMEOUT_RESET) && rq->missed_completion)
> +		blk_mq_complete_request(rq);
> +}
> +
>   static void blk_mq_timeout_work(struct work_struct *work)
>   {
>   	struct request_queue *q =
> @@ -976,7 +992,7 @@ static void blk_mq_timeout_work(struct w
>   		 * becomes a problem, we can add per-hw_ctx rcu_head and
>   		 * wait in parallel.
>   		 */
> -		blk_mq_timeout_sync_rcu(q);
> +		blk_mq_timeout_sync_rcu(q, true);
>   
>   		/* terminate the ones we won */
>   		blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_expired,
> @@ -988,9 +1004,12 @@ static void blk_mq_timeout_work(struct w
>   		 * reset racing against recycling.
>   		 */
>   		if (nr_resets) {
> -			blk_mq_timeout_sync_rcu(q);
> +			blk_mq_timeout_sync_rcu(q, false);
> +			blk_mq_queue_tag_busy_iter(q,
> +					blk_mq_timeout_reset_return, NULL);
> +			blk_mq_timeout_sync_rcu(q, true);
>   			blk_mq_queue_tag_busy_iter(q,
> -					blk_mq_finish_timeout_reset, NULL);
> +					blk_mq_timeout_reset_cleanup, NULL);
>   		}
>   	}
>   
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -227,6 +227,8 @@ struct request {
>   
>   	unsigned int extra_len;	/* length of alignment and padding */
>   
> +	bool missed_completion;
> +

Would be nicer if we can flag this somewhere instead of adding a hole to
struct request...
Israel Rukshin April 11, 2018, 4:16 p.m. UTC | #4
On 4/11/2018 5:24 PM, Sagi Grimberg wrote:
>
>> Ah, yeah, I was moving it out of add_timer but forgot to actully add
>> it to the issue path.  Fixed patch below.
>>
>> BTW, no matter what we do w/ the request handover between normal and
>> timeout paths, we'd need something similar.  Otherwise, while we can
>> reduce the window, we can't get rid of it.
>>
>> Thanks.
>
> Just noticed this one, this looks interesting to me as well. Israel,
> can you run your test with this patch?

Yes, I just did and it looks good.

>
>>
>> ---
>>   block/blk-mq.c         |   45 
>> ++++++++++++++++++++++++++++++++-------------
>>   include/linux/blkdev.h |    2 ++
>>   2 files changed, 34 insertions(+), 13 deletions(-)
>>
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -642,6 +642,8 @@ void blk_mq_complete_request(struct requ
>>       hctx_lock(hctx, &srcu_idx);
>>       if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
>>           __blk_mq_complete_request(rq);
>> +    else
>> +        rq->missed_completion = true;
>>       hctx_unlock(hctx, srcu_idx);
>>   }
>>   EXPORT_SYMBOL(blk_mq_complete_request);
>> @@ -684,6 +686,7 @@ void blk_mq_start_request(struct request
>>         blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
>>       blk_add_timer(rq);
>> +    rq->missed_completion = false;
>>         write_seqcount_end(&rq->gstate_seq);
>>       preempt_enable();
>> @@ -881,7 +884,7 @@ static void blk_mq_check_expired(struct
>>       }
>>   }
>>   -static void blk_mq_timeout_sync_rcu(struct request_queue *q)
>> +static void blk_mq_timeout_sync_rcu(struct request_queue *q, bool 
>> clear)
>>   {
>>       struct blk_mq_hw_ctx *hctx;
>>       bool has_rcu = false;
>> @@ -896,7 +899,8 @@ static void blk_mq_timeout_sync_rcu(stru
>>           else
>>               synchronize_srcu(hctx->srcu);
>>   -        hctx->need_sync_rcu = false;
>> +        if (clear)
>> +            hctx->need_sync_rcu = false;
>>       }
>>       if (has_rcu)
>>           synchronize_rcu();
>> @@ -917,25 +921,37 @@ static void blk_mq_terminate_expired(str
>>           blk_mq_rq_timed_out(hctx, rq, priv, reserved);
>>   }
>>   -static void blk_mq_finish_timeout_reset(struct blk_mq_hw_ctx *hctx,
>> +static void blk_mq_timeout_reset_return(struct blk_mq_hw_ctx *hctx,
>>           struct request *rq, void *priv, bool reserved)
>>   {
>>       /*
>>        * @rq's timer reset has gone through rcu synchronization and is
>>        * visible now.  Allow normal completions again by resetting
>>        * ->aborted_gstate.  Don't clear RQF_MQ_TIMEOUT_RESET here as
>> -     * there's no memory ordering around ->aborted_gstate making it the
>> -     * only field safe to update.  Let blk_add_timer() clear it later
>> -     * when the request is recycled or times out again.
>> -     *
>> -     * As nothing prevents from completion happening while
>> -     * ->aborted_gstate is set, this may lead to ignored completions
>> -     * and further spurious timeouts.
>> +     * blk_mq_timeout_reset_cleanup() needs it again and there's no
>> +     * memory ordering around ->aborted_gstate making it the only field
>> +     * safe to update.  Let blk_add_timer() clear it later when the
>> +     * request is recycled or times out again.
>>        */
>>       if (rq->rq_flags & RQF_MQ_TIMEOUT_RESET)
>>           blk_mq_rq_update_aborted_gstate(rq, 0);
>>   }
>>   +static void blk_mq_timeout_reset_cleanup(struct blk_mq_hw_ctx *hctx,
>> +        struct request *rq, void *priv, bool reserved)
>> +{
>> +    /*
>> +     * @rq is now fully returned to the normal path.  If normal
>> +     * completion raced timeout handling, execute the missed completion
>> +     * here.  This is safe because 1. ->missed_completion can no longer
>> +     * be asserted because nothing is timing out right now and 2. if
>> +     * ->missed_completion is set, @rq is safe from recycling because
>> +     * nobody could have completed it.
>> +     */
>> +    if ((rq->rq_flags & RQF_MQ_TIMEOUT_RESET) && rq->missed_completion)
>> +        blk_mq_complete_request(rq);
>> +}
>> +
>>   static void blk_mq_timeout_work(struct work_struct *work)
>>   {
>>       struct request_queue *q =
>> @@ -976,7 +992,7 @@ static void blk_mq_timeout_work(struct w
>>            * becomes a problem, we can add per-hw_ctx rcu_head and
>>            * wait in parallel.
>>            */
>> -        blk_mq_timeout_sync_rcu(q);
>> +        blk_mq_timeout_sync_rcu(q, true);
>>             /* terminate the ones we won */
>>           blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_expired,
>> @@ -988,9 +1004,12 @@ static void blk_mq_timeout_work(struct w
>>            * reset racing against recycling.
>>            */
>>           if (nr_resets) {
>> -            blk_mq_timeout_sync_rcu(q);
>> +            blk_mq_timeout_sync_rcu(q, false);
>> +            blk_mq_queue_tag_busy_iter(q,
>> +                    blk_mq_timeout_reset_return, NULL);
>> +            blk_mq_timeout_sync_rcu(q, true);
>>               blk_mq_queue_tag_busy_iter(q,
>> -                    blk_mq_finish_timeout_reset, NULL);
>> +                    blk_mq_timeout_reset_cleanup, NULL);
>>           }
>>       }
>>   --- a/include/linux/blkdev.h
>> +++ b/include/linux/blkdev.h
>> @@ -227,6 +227,8 @@ struct request {
>>         unsigned int extra_len;    /* length of alignment and padding */
>>   +    bool missed_completion;
>> +
>
> Would be nicer if we can flag this somewhere instead of adding a hole to
> struct request...
Tejun Heo April 11, 2018, 5:07 p.m. UTC | #5
Hello, Israel.

On Wed, Apr 11, 2018 at 07:16:14PM +0300, Israel Rukshin wrote:
> >Just noticed this one, this looks interesting to me as well. Israel,
> >can you run your test with this patch?
> 
> Yes, I just did and it looks good.

Awesome.

> >>+++ b/include/linux/blkdev.h
> >>@@ -227,6 +227,8 @@ struct request {
> >>        unsigned int extra_len;    /* length of alignment and padding */
> >>  +    bool missed_completion;
> >>+
> >
> >Would be nicer if we can flag this somewhere instead of adding a hole to
> >struct request...

I missed it before.  It's actually being put in an existing hole, so
the struct size stays the same before and after.  It's a bit of
cheating cuz this is one of the two holes which can be removed by
swapping two fields.

Re. making it a flag, regardless of whether this is a flag or a
separate field, we need to add a new field because there currently is
no field which can be modified by the party who doesn't own the
request, so if we make it a flag, we need to add sth like unsigned
long atom_flags.

Thanks.
Martin Steigerwald April 11, 2018, 6:38 p.m. UTC | #6
Bart Van Assche - 11.04.18, 14:50:
> On Tue, 2018-04-10 at 14:54 -0700, tj@kernel.org wrote:
> > Ah, yeah, I was moving it out of add_timer but forgot to actully add
> > it to the issue path.  Fixed patch below.
> > 
> > BTW, no matter what we do w/ the request handover between normal and
> > timeout paths, we'd need something similar.  Otherwise, while we can
> > reduce the window, we can't get rid of it.
> 
> (+Martin Steigerwald)
[…]
> Thank you for having shared this patch. It looks interesting to me.
> What I know about the blk-mq timeout handling is as follows:
> * Nobody who has reviewed the blk-mq timeout handling code with this
> patch applied has reported any shortcomings for that code.
> * However, several people have reported kernel crashes that disappear
> when the blk-mq timeout code is reworked. I'm referring to "nvme-rdma
> corrupts memory upon timeout"
>  
> (http://lists.infradead.org/pipermail/linux-nvme/2018-February/015848
> .html) and also to a "RIP: scsi_times_out+0x17" crash during boot
> (https://bugzilla.kernel.org/show_bug.cgi?id=199077).

Yes, with the three patches:

- '[PATCH] blk-mq_Directly schedule q->timeout_work when aborting a 
request.mbox'

- '[PATCH v2] block: Change a rcu_read_{lock,unlock}_sched() pair into 
rcu_read_{lock,unlock}().mbox'

- '[PATCH v4] blk-mq_Fix race conditions in request timeout 
handling.mbox'

the occasional hangs on some boots / resumes from hibernation appear to 
be gone.

Also it appears that the error loading SMART data issue is gone as well 
(see my bug report). However it is still to early to say for sure. I 
think I need at least 2-3 days of additional testing with this kernel to 
be sufficiently sure about it.

However… I could also test another patch, but from reading the rest of 
this thread so far I have no clear on whether to try one of the new 
patches and if so which one and whether adding it on top of some of the 
patches I already applied or using it as a replacement of it.

So while doing a training this and next week I can apply a patch here 
and then, but I won´t have much time to read the complete discussion to 
figure out what to apply.

Personally as a stable kernel has been released with those issues, I 
think its good to fix it up soon. On the other hand it may take quite 
some time til popular distros carry 4.16 for regular users. And I have 
no idea how frequent the reported issues are, i.e. how many users would 
be affected.

Thanks,
Tejun Heo April 11, 2018, 9:31 p.m. UTC | #7
Hey, again.

On Wed, Apr 11, 2018 at 10:07:33AM -0700, tj@kernel.org wrote:
> Hello, Israel.
> 
> On Wed, Apr 11, 2018 at 07:16:14PM +0300, Israel Rukshin wrote:
> > >Just noticed this one, this looks interesting to me as well. Israel,
> > >can you run your test with this patch?
> > 
> > Yes, I just did and it looks good.
> 
> Awesome.

Just to be sure, you tested the combined patch and saw the XXX debug
messages, right?

Thanks.
Israel Rukshin April 12, 2018, 8:59 a.m. UTC | #8
On 4/12/2018 12:31 AM, tj@kernel.org wrote:
> Hey, again.
>
> On Wed, Apr 11, 2018 at 10:07:33AM -0700, tj@kernel.org wrote:
>> Hello, Israel.
>>
>> On Wed, Apr 11, 2018 at 07:16:14PM +0300, Israel Rukshin wrote:
>>>> Just noticed this one, this looks interesting to me as well. Israel,
>>>> can you run your test with this patch?
>>> Yes, I just did and it looks good.
>> Awesome.
> Just to be sure, you tested the combined patch and saw the XXX debug
> messages, right?

I am not sure I understand.
What is the combined patch?
What are the debug messages that I need to see?

> Thanks.
>
Bart Van Assche April 18, 2018, 4:34 p.m. UTC | #9
On 04/10/18 14:54, tj@kernel.org wrote:
> On Tue, Apr 10, 2018 at 09:46:23PM +0000, Bart Van Assche wrote:
>> On Tue, 2018-04-10 at 14:33 -0700, tj@kernel.org wrote:
>>> +       else
>>> +               rq->missed_completion = true;
>>
>> In this patch I found code that sets rq->missed_completion but no code that
>> clears it. Did I perhaps miss something?
> 
> Ah, yeah, I was moving it out of add_timer but forgot to actully add
> it to the issue path.  Fixed patch below.
> 
> BTW, no matter what we do w/ the request handover between normal and
> timeout paths, we'd need something similar.  Otherwise, while we can
> reduce the window, we can't get rid of it.
> 
> [ ... ]

Hello Tejun,

This patch no longer applies cleanly on Jens' tree. Can you rebase and 
repost it?

Thanks,

Bart.
diff mbox

Patch

--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -642,6 +642,8 @@  void blk_mq_complete_request(struct requ
 	hctx_lock(hctx, &srcu_idx);
 	if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
 		__blk_mq_complete_request(rq);
+	else
+		rq->missed_completion = true;
 	hctx_unlock(hctx, srcu_idx);
 }
 EXPORT_SYMBOL(blk_mq_complete_request);
@@ -684,6 +686,7 @@  void blk_mq_start_request(struct request
 
 	blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
 	blk_add_timer(rq);
+	rq->missed_completion = false;
 
 	write_seqcount_end(&rq->gstate_seq);
 	preempt_enable();
@@ -881,7 +884,7 @@  static void blk_mq_check_expired(struct
 	}
 }
 
-static void blk_mq_timeout_sync_rcu(struct request_queue *q)
+static void blk_mq_timeout_sync_rcu(struct request_queue *q, bool clear)
 {
 	struct blk_mq_hw_ctx *hctx;
 	bool has_rcu = false;
@@ -896,7 +899,8 @@  static void blk_mq_timeout_sync_rcu(stru
 		else
 			synchronize_srcu(hctx->srcu);
 
-		hctx->need_sync_rcu = false;
+		if (clear)
+			hctx->need_sync_rcu = false;
 	}
 	if (has_rcu)
 		synchronize_rcu();
@@ -917,25 +921,37 @@  static void blk_mq_terminate_expired(str
 		blk_mq_rq_timed_out(hctx, rq, priv, reserved);
 }
 
-static void blk_mq_finish_timeout_reset(struct blk_mq_hw_ctx *hctx,
+static void blk_mq_timeout_reset_return(struct blk_mq_hw_ctx *hctx,
 		struct request *rq, void *priv, bool reserved)
 {
 	/*
 	 * @rq's timer reset has gone through rcu synchronization and is
 	 * visible now.  Allow normal completions again by resetting
 	 * ->aborted_gstate.  Don't clear RQF_MQ_TIMEOUT_RESET here as
-	 * there's no memory ordering around ->aborted_gstate making it the
-	 * only field safe to update.  Let blk_add_timer() clear it later
-	 * when the request is recycled or times out again.
-	 *
-	 * As nothing prevents from completion happening while
-	 * ->aborted_gstate is set, this may lead to ignored completions
-	 * and further spurious timeouts.
+	 * blk_mq_timeout_reset_cleanup() needs it again and there's no
+	 * memory ordering around ->aborted_gstate making it the only field
+	 * safe to update.  Let blk_add_timer() clear it later when the
+	 * request is recycled or times out again.
 	 */
 	if (rq->rq_flags & RQF_MQ_TIMEOUT_RESET)
 		blk_mq_rq_update_aborted_gstate(rq, 0);
 }
 
+static void blk_mq_timeout_reset_cleanup(struct blk_mq_hw_ctx *hctx,
+		struct request *rq, void *priv, bool reserved)
+{
+	/*
+	 * @rq is now fully returned to the normal path.  If normal
+	 * completion raced timeout handling, execute the missed completion
+	 * here.  This is safe because 1. ->missed_completion can no longer
+	 * be asserted because nothing is timing out right now and 2. if
+	 * ->missed_completion is set, @rq is safe from recycling because
+	 * nobody could have completed it.
+	 */
+	if ((rq->rq_flags & RQF_MQ_TIMEOUT_RESET) && rq->missed_completion)
+		blk_mq_complete_request(rq);
+}
+
 static void blk_mq_timeout_work(struct work_struct *work)
 {
 	struct request_queue *q =
@@ -976,7 +992,7 @@  static void blk_mq_timeout_work(struct w
 		 * becomes a problem, we can add per-hw_ctx rcu_head and
 		 * wait in parallel.
 		 */
-		blk_mq_timeout_sync_rcu(q);
+		blk_mq_timeout_sync_rcu(q, true);
 
 		/* terminate the ones we won */
 		blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_expired,
@@ -988,9 +1004,12 @@  static void blk_mq_timeout_work(struct w
 		 * reset racing against recycling.
 		 */
 		if (nr_resets) {
-			blk_mq_timeout_sync_rcu(q);
+			blk_mq_timeout_sync_rcu(q, false);
+			blk_mq_queue_tag_busy_iter(q,
+					blk_mq_timeout_reset_return, NULL);
+			blk_mq_timeout_sync_rcu(q, true);
 			blk_mq_queue_tag_busy_iter(q,
-					blk_mq_finish_timeout_reset, NULL);
+					blk_mq_timeout_reset_cleanup, NULL);
 		}
 	}
 
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -227,6 +227,8 @@  struct request {
 
 	unsigned int extra_len;	/* length of alignment and padding */
 
+	bool missed_completion;
+
 	/*
 	 * On blk-mq, the lower bits of ->gstate (generation number and
 	 * state) carry the MQ_RQ_* state value and the upper bits the