diff mbox

[v4] blk-mq: Fix race conditions in request timeout handling

Message ID 20180410013455.7448-1-bart.vanassche@wdc.com (mailing list archive)
State New, archived
Headers show

Commit Message

Bart Van Assche April 10, 2018, 1:34 a.m. UTC
If a completion occurs after blk_mq_rq_timed_out() has reset
rq->aborted_gstate and the request is again in flight when the timeout
expires then a request will be completed twice: a first time by the
timeout handler and a second time when the regular completion occurs.

Additionally, the blk-mq timeout handling code ignores completions that
occur after blk_mq_check_expired() has been called and before
blk_mq_rq_timed_out() has reset rq->aborted_gstate. If a block driver
timeout handler always returns BLK_EH_RESET_TIMER then the result will
be that the request never terminates.

Since the request state can be updated from two different contexts,
namely regular completion and request timeout, this race cannot be
fixed with RCU synchronization only. Fix this race as follows:
- Split __deadline in two variables, namely lq_deadline for legacy
  request queues and mq_deadline for blk-mq request queues. Use atomic
  operations to update mq_deadline.
- Use the deadline instead of the request generation to detect whether
  or not a request timer fired after reinitialization of a request.
- Store the request state in the lowest two bits of the deadline instead
  of the lowest two bits of 'gstate'.
- Remove all request member variables that became superfluous due to
  this change: gstate, aborted_gstate, gstate_seq and aborted_gstate_sync.
- Remove the request state information that became superfluous due to this
  patch, namely RQF_MQ_TIMEOUT_EXPIRED.
- Remove the hctx member that became superfluous due to these changes,
  namely nr_expired.
- Remove the code that became superfluous due to this change, namely
  the RCU lock and unlock statements in blk_mq_complete_request() and
  also the synchronize_rcu() call in the timeout handler.

This patch fixes the following kernel crashes:

BUG: unable to handle kernel NULL pointer dereference at           (null)
Oops: 0000 [#1] PREEMPT SMP
CPU: 2 PID: 151 Comm: kworker/2:1H Tainted: G        W        4.15.0-dbg+ #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
Workqueue: kblockd blk_mq_timeout_work
RIP: 0010:scsi_times_out+0x17/0x2c0 [scsi_mod]
Call Trace:
blk_mq_terminate_expired+0x42/0x80
bt_iter+0x3d/0x50
blk_mq_queue_tag_busy_iter+0xe9/0x200
blk_mq_timeout_work+0x181/0x2e0
process_one_work+0x21c/0x6d0
worker_thread+0x35/0x380
kthread+0x117/0x130
ret_from_fork+0x24/0x30

This patch also fixes a double completion problem in the NVMeOF
initiator driver. See also http://lists.infradead.org/pipermail/linux-nvme/2018-February/015848.html.

Fixes: 1d9bd5161ba3 ("blk-mq: replace timeout synchronization with a RCU and generation based scheme")
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Israel Rukshin <israelr@mellanox.com>,
Cc: Max Gurtovoy <maxg@mellanox.com>
Cc: <stable@vger.kernel.org> # v4.16
---

Changes compared to v3 (see also https://www.mail-archive.com/linux-block@vger.kernel.org/msg20073.html):
- Removed the spinlock again that was introduced to protect the request state.
  v4 uses atomic_long_cmpxchg() instead.
- Split __deadline into two variables - one for the legacy block layer and one
  for blk-mq.

Changes compared to v2 (https://www.mail-archive.com/linux-block@vger.kernel.org/msg18338.html):
- Rebased and retested on top of kernel v4.16.

Changes compared to v1 (https://www.mail-archive.com/linux-block@vger.kernel.org/msg18089.html):
- Removed the gstate and aborted_gstate members of struct request and used
  the __deadline member to encode both the generation and state information.

 block/blk-core.c       |   2 -
 block/blk-mq-debugfs.c |   1 -
 block/blk-mq.c         | 166 +++----------------------------------------------
 block/blk-mq.h         |  47 ++++++++------
 block/blk-timeout.c    |  57 ++++++++++++-----
 block/blk.h            |  41 ++++++++++--
 include/linux/blk-mq.h |   1 -
 include/linux/blkdev.h |  32 +++-------
 8 files changed, 122 insertions(+), 225 deletions(-)

Comments

jianchao.wang April 10, 2018, 7:59 a.m. UTC | #1
Hi Bart

On 04/10/2018 09:34 AM, Bart Van Assche wrote:
> If a completion occurs after blk_mq_rq_timed_out() has reset
> rq->aborted_gstate and the request is again in flight when the timeout
> expires then a request will be completed twice: a first time by the
> timeout handler and a second time when the regular completion occurs

Would you please detail more here about why the request could be completed twice ?

Is it the scenario you described as below in https://marc.info/?l=linux-block&m=151796816127318

The following race can occur between the code that resets the timer
and completion handling:
- The code that handles BLK_EH_RESET_TIMER resets aborted_gstate.
- A completion occurs and blk_mq_complete_request() calls
  __blk_mq_complete_request().
- The timeout code calls blk_add_timer() and that function sets the
  request deadline and adjusts the timer.
- __blk_mq_complete_request() frees the request tag.
- The timer fires and the timeout handler gets called for a freed
  request.
If yes, how does the timeout handler get the freed request when the tag has been freed ?

Thanks
Jianchao
Ming Lei April 10, 2018, 8:41 a.m. UTC | #2
On Mon, Apr 09, 2018 at 06:34:55PM -0700, Bart Van Assche wrote:
> If a completion occurs after blk_mq_rq_timed_out() has reset
> rq->aborted_gstate and the request is again in flight when the timeout

Given rq->aborted_gstate is reset only for BLK_EH_RESET_TIMER, I
think you are addressing the race between normal completion and
the timeout of BLK_EH_RESET_TIMER.

> expires then a request will be completed twice: a first time by the
> timeout handler and a second time when the regular completion occurs.

This patch looks simpler, and more like the previous model of
using blk_mark_rq_complete().

I have one question:

- with this patch, rq's state is updated atomically as cmpxchg. Suppose
one rq is timed out, the rq's state is updated as MQ_RQ_COMPLETE by
blk_mq_check_expired(), then ops->timeout() is run and
BLK_EH_RESET_TIMER is returned, so blk_mq_add_timer(req, MQ_RQ_COMPLETE,
MQ_RQ_IN_FLIGHT) is called to update rq's state into MQ_RQ_IN_FLIGHT.

Now the original normal completion still may occur after rq's state
becomes MQ_RQ_IN_FLIGHT, and seems there is still the double completion
with this patch? Maybe I am wrong, please explain a bit.

And synchronize_rcu() is needed by Tejun's approach between marking
COMPLETE and handling this rq's timeout, and the time can be quite long,
I guess it might be easier to trigger?

Thanks,
Ming
Ming Lei April 10, 2018, 10:04 a.m. UTC | #3
On Tue, Apr 10, 2018 at 03:59:30PM +0800, jianchao.wang wrote:
> Hi Bart
> 
> On 04/10/2018 09:34 AM, Bart Van Assche wrote:
> > If a completion occurs after blk_mq_rq_timed_out() has reset
> > rq->aborted_gstate and the request is again in flight when the timeout
> > expires then a request will be completed twice: a first time by the
> > timeout handler and a second time when the regular completion occurs
> 
> Would you please detail more here about why the request could be completed twice ?
> 
> Is it the scenario you described as below in https://marc.info/?l=linux-block&m=151796816127318
> 
> The following race can occur between the code that resets the timer
> and completion handling:
> - The code that handles BLK_EH_RESET_TIMER resets aborted_gstate.
> - A completion occurs and blk_mq_complete_request() calls
>   __blk_mq_complete_request().
> - The timeout code calls blk_add_timer() and that function sets the
>   request deadline and adjusts the timer.
> - __blk_mq_complete_request() frees the request tag.
> - The timer fires and the timeout handler gets called for a freed
>   request.
> If yes, how does the timeout handler get the freed request when the tag has been freed ?

Thinking of this patch further.

The issue may not be a double completion issue, and it may be the
following behaviour which breaks NVMe or other drivers easily:

1) there is long delay(synchronize_rcu()) between setting rq->aborted_gstate
and handling the timeout by blk_mq_rq_timed_out().

2) during the long delay, the rq may be completed by hardware, then
if the following timeout is handled as BLK_EH_RESET_TIMER, it is
driver's bug, and driver's .timeout() may be confused about this
behaviour, I guess.

In theory this behaviour should exist in all these approaches,
but just easier to trigger if long delay is introduced before handling
timeout.

Thanks,
Ming
Shan Hai April 10, 2018, 12:04 p.m. UTC | #4
On 2018年04月10日 18:04, Ming Lei wrote:
> On Tue, Apr 10, 2018 at 03:59:30PM +0800, jianchao.wang wrote:
>> Hi Bart
>>
>> On 04/10/2018 09:34 AM, Bart Van Assche wrote:
>>> If a completion occurs after blk_mq_rq_timed_out() has reset
>>> rq->aborted_gstate and the request is again in flight when the timeout
>>> expires then a request will be completed twice: a first time by the
>>> timeout handler and a second time when the regular completion occurs
>> Would you please detail more here about why the request could be completed twice ?
>>
>> Is it the scenario you described as below in https://marc.info/?l=linux-block&m=151796816127318
>>
>> The following race can occur between the code that resets the timer
>> and completion handling:
>> - The code that handles BLK_EH_RESET_TIMER resets aborted_gstate.
>> - A completion occurs and blk_mq_complete_request() calls
>>    __blk_mq_complete_request().
>> - The timeout code calls blk_add_timer() and that function sets the
>>    request deadline and adjusts the timer.
>> - __blk_mq_complete_request() frees the request tag.
>> - The timer fires and the timeout handler gets called for a freed
>>    request.
>> If yes, how does the timeout handler get the freed request when the tag has been freed ?
> Thinking of this patch further.
>
> The issue may not be a double completion issue, and it may be the
> following behaviour which breaks NVMe or other drivers easily:
>
> 1) there is long delay(synchronize_rcu()) between setting rq->aborted_gstate
> and handling the timeout by blk_mq_rq_timed_out().
>
> 2) during the long delay, the rq may be completed by hardware, then
> if the following timeout is handled as BLK_EH_RESET_TIMER, it is
> driver's bug, and driver's .timeout() may be confused about this
> behaviour, I guess.
>
> In theory this behaviour should exist in all these approaches,
> but just easier to trigger if long delay is introduced before handling
> timeout.

Or think it as below?


                    C                           C                C
+-----------------------------+---------------+-----------+
S                                   T F              E

Request life time line:
S: start
T: timed out
F: found (by timer), inflight but timed out because of a long delay
E: completed by timeout handler
C: regular completion

normal request completion time range: (S, T)
completion in (T, F) is fine since it's not inflight anymore
race window time range: (F, E)

if the delayed request is completed in (F, E) range then it could be 
completed
twice by the regular completion first and then the timeout handler.

Thanks
Shan Hai
> Thanks,
> Ming
Bart Van Assche April 10, 2018, 12:58 p.m. UTC | #5
On Tue, 2018-04-10 at 16:41 +0800, Ming Lei wrote:
> On Mon, Apr 09, 2018 at 06:34:55PM -0700, Bart Van Assche wrote:

> > If a completion occurs after blk_mq_rq_timed_out() has reset

> > rq->aborted_gstate and the request is again in flight when the timeout

> 

> Given rq->aborted_gstate is reset only for BLK_EH_RESET_TIMER, I

> think you are addressing the race between normal completion and

> the timeout of BLK_EH_RESET_TIMER.


Yes, that's correct. I will make this more explicit.

> > expires then a request will be completed twice: a first time by the

> > timeout handler and a second time when the regular completion occurs.

> 

> This patch looks simpler, and more like the previous model of

> using blk_mark_rq_complete().


That's also correct.

> I have one question:

> 

> - with this patch, rq's state is updated atomically as cmpxchg. Suppose

> one rq is timed out, the rq's state is updated as MQ_RQ_COMPLETE by

> blk_mq_check_expired(), then ops->timeout() is run and

> BLK_EH_RESET_TIMER is returned, so blk_mq_add_timer(req, MQ_RQ_COMPLETE,

> MQ_RQ_IN_FLIGHT) is called to update rq's state into MQ_RQ_IN_FLIGHT.

> 

> Now the original normal completion still may occur after rq's state

> becomes MQ_RQ_IN_FLIGHT, and seems there is still the double completion

> with this patch? Maybe I am wrong, please explain a bit.


That scenario won't result in a double completion. After the timer has
been reset the block driver blk_mq_complete_request() call will change
the request state from MQ_RQ_IN_FLIGHT into MQ_RQ_COMPLETE. The next
time blk_mq_check_expired() is called it will execute the following code:

	blk_mq_change_rq_state(rq, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE);

That function call only changes the request state if the current state is
IN_FLIGHT. However, the blk_mq_complete_request() call changed the request
state into COMPLETE. Hence, the above blk_mq_change_rq_state() call will
return false and the blk-mq timeout code will skip this request. If the
request would already have been reused and would have been marked again as
IN_FLIGHT then its deadline will also have been updated and the request
will be skipped by the timeout code because its deadline has not yet
expired.

> And synchronize_rcu() is needed by Tejun's approach between marking

> COMPLETE and handling this rq's timeout, and the time can be quite long,

> I guess it might be easier to trigger?


I have done what I could to trigger races between the regular completion
path and the timeout code in my tests. Without this patch if I run the
srp-test software I see crashes being reported in the rdma_rxe driver but
with this patch applied I don't see any crashes with my tests.

Bart.
Bart Van Assche April 10, 2018, 1:01 p.m. UTC | #6
On Tue, 2018-04-10 at 15:59 +0800, jianchao.wang wrote:
> If yes, how does the timeout handler get the freed request when the tag has been freed ?


Hello Jianchao,

Have you noticed that the timeout handler does not check whether or not the request
tag is freed? Additionally, I don't think it would be possible to add such a check
to the timeout code without introducing a new race condition.

Bart.
Ming Lei April 10, 2018, 1:55 p.m. UTC | #7
On Tue, Apr 10, 2018 at 12:58:04PM +0000, Bart Van Assche wrote:
> On Tue, 2018-04-10 at 16:41 +0800, Ming Lei wrote:
> > On Mon, Apr 09, 2018 at 06:34:55PM -0700, Bart Van Assche wrote:
> > > If a completion occurs after blk_mq_rq_timed_out() has reset
> > > rq->aborted_gstate and the request is again in flight when the timeout
> > 
> > Given rq->aborted_gstate is reset only for BLK_EH_RESET_TIMER, I
> > think you are addressing the race between normal completion and
> > the timeout of BLK_EH_RESET_TIMER.
> 
> Yes, that's correct. I will make this more explicit.
> 
> > > expires then a request will be completed twice: a first time by the
> > > timeout handler and a second time when the regular completion occurs.
> > 
> > This patch looks simpler, and more like the previous model of
> > using blk_mark_rq_complete().
> 
> That's also correct.
> 
> > I have one question:
> > 
> > - with this patch, rq's state is updated atomically as cmpxchg. Suppose
> > one rq is timed out, the rq's state is updated as MQ_RQ_COMPLETE by
> > blk_mq_check_expired(), then ops->timeout() is run and
> > BLK_EH_RESET_TIMER is returned, so blk_mq_add_timer(req, MQ_RQ_COMPLETE,
> > MQ_RQ_IN_FLIGHT) is called to update rq's state into MQ_RQ_IN_FLIGHT.
> > 
> > Now the original normal completion still may occur after rq's state
> > becomes MQ_RQ_IN_FLIGHT, and seems there is still the double completion
> > with this patch? Maybe I am wrong, please explain a bit.
> 
> That scenario won't result in a double completion. After the timer has
> been reset the block driver blk_mq_complete_request() call will change
> the request state from MQ_RQ_IN_FLIGHT into MQ_RQ_COMPLETE. The next
> time blk_mq_check_expired() is called it will execute the following code:
> 
> 	blk_mq_change_rq_state(rq, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE);
> 
> That function call only changes the request state if the current state is
> IN_FLIGHT. However, the blk_mq_complete_request() call changed the request
> state into COMPLETE. Hence, the above blk_mq_change_rq_state() call will
> return false and the blk-mq timeout code will skip this request. If the
> request would already have been reused and would have been marked again as
> IN_FLIGHT then its deadline will also have been updated and the request
> will be skipped by the timeout code because its deadline has not yet
> expired.

OK.

Then I have same question with Jianchao, what is the actual double
complete in linus tree between BLK_EH_RESET_TIMER and normal completion?

Follows my understanding:

1) when timeout is detected on one request, its aborted_gstate is
updated in blk_mq_check_expired().

2) run synchronize_rcu(), and make sure all pending completion is done

3) run blk_mq_rq_timed_out()
- ret = ops->timeout
- blk_mq_rq_update_aborted_gstate(req, 0)
- blk_add_timer(req);

If normal completion is done between 1) and reset aborted_gstate in 3),
blk_mq_complete_request() will be called, and found that aborted_gstate
is set, then the rq won't be completed really.

If normal completion is done after reset aborted_gstate in 3), it should
be same with applying this patch.

> 
> > And synchronize_rcu() is needed by Tejun's approach between marking
> > COMPLETE and handling this rq's timeout, and the time can be quite long,
> > I guess it might be easier to trigger?
> 
> I have done what I could to trigger races between the regular completion
> path and the timeout code in my tests. Without this patch if I run the
> srp-test software I see crashes being reported in the rdma_rxe driver but
> with this patch applied I don't see any crashes with my tests.

I believe this patch may fix this issue, but I think the idea behind
should be understood.

Thanks,
Ming
Bart Van Assche April 10, 2018, 2:09 p.m. UTC | #8
On Tue, 2018-04-10 at 21:55 +0800, Ming Lei wrote:
> Then I have same question with Jianchao, what is the actual double

> complete in linus tree between BLK_EH_RESET_TIMER and normal completion?

> 

> Follows my understanding:

> 

> 1) when timeout is detected on one request, its aborted_gstate is

> updated in blk_mq_check_expired().

> 

> 2) run synchronize_rcu(), and make sure all pending completion is done

> 

> 3) run blk_mq_rq_timed_out()

> - ret = ops->timeout

> - blk_mq_rq_update_aborted_gstate(req, 0)

> - blk_add_timer(req);

> 

> If normal completion is done between 1) and reset aborted_gstate in 3),

> blk_mq_complete_request() will be called, and found that aborted_gstate

> is set, then the rq won't be completed really.

> 

> If normal completion is done after reset aborted_gstate in 3), it should

> be same with applying this patch.


Hello Ming,

Please keep in mind that all synchronize_rcu() does is to wait for pre-
existing RCU readers to finish. synchronize_rcu() does not prevent that new
rcu_read_lock() calls happen. It is e.g. possible that after
blk_mq_rq_update_aborted_gstate(req, 0) has been executed that a regular
completion occurs. If that request is not reused before the timer that was
restarted by the timeout code expires, that request will be completed twice.

Bart.
Tejun Heo April 10, 2018, 2:20 p.m. UTC | #9
Hello, Bart.

On Mon, Apr 09, 2018 at 06:34:55PM -0700, Bart Van Assche wrote:
> Since the request state can be updated from two different contexts,
> namely regular completion and request timeout, this race cannot be
> fixed with RCU synchronization only. Fix this race as follows:

Well, it can be and the patches have been posted months ago.  It just
needed a repro case to confirm the fix, which we now seem to have.

Switching to another model might be better but let's please do that
with the right rationales.  A good portion of this seems to be built
on misunderstandings.

Thanks.
Ming Lei April 10, 2018, 2:30 p.m. UTC | #10
On Tue, Apr 10, 2018 at 02:09:33PM +0000, Bart Van Assche wrote:
> On Tue, 2018-04-10 at 21:55 +0800, Ming Lei wrote:
> > Then I have same question with Jianchao, what is the actual double
> > complete in linus tree between BLK_EH_RESET_TIMER and normal completion?
> > 
> > Follows my understanding:
> > 
> > 1) when timeout is detected on one request, its aborted_gstate is
> > updated in blk_mq_check_expired().
> > 
> > 2) run synchronize_rcu(), and make sure all pending completion is done
> > 
> > 3) run blk_mq_rq_timed_out()
> > - ret = ops->timeout
> > - blk_mq_rq_update_aborted_gstate(req, 0)
> > - blk_add_timer(req);
> > 
> > If normal completion is done between 1) and reset aborted_gstate in 3),
> > blk_mq_complete_request() will be called, and found that aborted_gstate
> > is set, then the rq won't be completed really.
> > 
> > If normal completion is done after reset aborted_gstate in 3), it should
> > be same with applying this patch.
> 
> Hello Ming,
> 
> Please keep in mind that all synchronize_rcu() does is to wait for pre-
> existing RCU readers to finish. synchronize_rcu() does not prevent that new
> rcu_read_lock() calls happen. It is e.g. possible that after

That is right, and I also mentioned normal completion can be done between
1) and reset aborted_gstate in 3).

> blk_mq_rq_update_aborted_gstate(req, 0) has been executed that a regular
> completion occurs. If that request is not reused before the timer that was
> restarted by the timeout code expires, that request will be completed twice.

In this patch, blk_mq_add_timer(req, MQ_RQ_COMPLETE, MQ_RQ_IN_FLIGHT) is
called for handling BLK_EH_RESET_TIMER. And after rq's state is changed
to MQ_RQ_IN_FLIGHT, normal completion still can come and complete this rq,
just like the above you described, right?


Thanks,
Ming
Bart Van Assche April 10, 2018, 2:30 p.m. UTC | #11
On Tue, 2018-04-10 at 07:20 -0700, Tejun Heo wrote:
> On Mon, Apr 09, 2018 at 06:34:55PM -0700, Bart Van Assche wrote:

> > Since the request state can be updated from two different contexts,

> > namely regular completion and request timeout, this race cannot be

> > fixed with RCU synchronization only. Fix this race as follows:

> 

> Well, it can be and the patches have been posted months ago.


That's not correct. I have explained you in detail that the two patches you
posted do not fix all the races fixed by the patch at the start of this
e-mail thread.

> Switching to another model might be better but let's please do that

> with the right rationales.  A good portion of this seems to be built

> on misunderstandings.


Which misunderstandings? I'm not aware of any misunderstandings at my side.
Additionally, tests with two different block drivers (NVMeOF initiator and
the SRP initiator driver) have shown that the current blk-mq timeout
implementation with or without your two patches applied result in subtle and
hard to debug crashes and/or memory corruption. That is not the case for the
patch at the start of this thread. The latest report of a crash I ran into
myself and that is fixed by the patch at the start of this thread is
available here: https://www.spinics.net/lists/linux-rdma/msg63240.html.

Please also keep in mind that if this patch would be accepted that that does
not prevent this patch to be replaced with an RCU-based solution later on.
If anyone comes up any time with a reliably working RCU-based solution I
will be happy to accept a revert of this patch and I will help reviewing that
RCU-based solution.

Bart.
jianchao.wang April 10, 2018, 2:32 p.m. UTC | #12
Hi Bart

Thanks for your kindly response.

On 04/10/2018 09:01 PM, Bart Van Assche wrote:
> On Tue, 2018-04-10 at 15:59 +0800, jianchao.wang wrote:
>> If yes, how does the timeout handler get the freed request when the tag has been freed ?
> 
> Hello Jianchao,
> 
> Have you noticed that the timeout handler does not check whether or not the request
> tag is freed? Additionally, I don't think it would be possible to add such a check
> to the timeout code without introducing a new race condition.

Doesn't blk_mq_queue_tag_busy_iter only iterate the tags that has been allocated/set ?
When the request is freed, the tag will be cleared through blk_mq_put_tag->sbitmap_queue_clear
Do I miss something else ?

Thanks
Jianchao

> 
>
Tejun Heo April 10, 2018, 2:33 p.m. UTC | #13
Hello,

On Tue, Apr 10, 2018 at 02:30:26PM +0000, Bart Van Assche wrote:
> > Switching to another model might be better but let's please do that
> > with the right rationales.  A good portion of this seems to be built
> > on misunderstandings.
> 
> Which misunderstandings? I'm not aware of any misunderstandings at my side.
> Additionally, tests with two different block drivers (NVMeOF initiator and
> the SRP initiator driver) have shown that the current blk-mq timeout
> implementation with or without your two patches applied result in subtle and
> hard to debug crashes and/or memory corruption. That is not the case for the

I must have missed that part.  Which tests were they?

> patch at the start of this thread. The latest report of a crash I ran into
> myself and that is fixed by the patch at the start of this thread is
> available here: https://www.spinics.net/lists/linux-rdma/msg63240.html.
> 
> Please also keep in mind that if this patch would be accepted that that does
> not prevent this patch to be replaced with an RCU-based solution later on.
> If anyone comes up any time with a reliably working RCU-based solution I
> will be happy to accept a revert of this patch and I will help reviewing that
> RCU-based solution.

Oh, switching is fine but let's get in sync first.  Who have the repro
cases and what were tested?

Thanks.
Bart Van Assche April 10, 2018, 3:02 p.m. UTC | #14
On Tue, 2018-04-10 at 22:30 +0800, Ming Lei wrote:
> On Tue, Apr 10, 2018 at 02:09:33PM +0000, Bart Van Assche wrote:

> > Please keep in mind that all synchronize_rcu() does is to wait for pre-

> > existing RCU readers to finish. synchronize_rcu() does not prevent that new

> > rcu_read_lock() calls happen. It is e.g. possible that after

> 

> That is right, and I also mentioned normal completion can be done between

> 1) and reset aborted_gstate in 3).

> 

> > blk_mq_rq_update_aborted_gstate(req, 0) has been executed that a regular

> > completion occurs. If that request is not reused before the timer that was

> > restarted by the timeout code expires, that request will be completed twice.

> 

> In this patch, blk_mq_add_timer(req, MQ_RQ_COMPLETE, MQ_RQ_IN_FLIGHT) is

> called for handling BLK_EH_RESET_TIMER. And after rq's state is changed

> to MQ_RQ_IN_FLIGHT, normal completion still can come and complete this rq,

> just like the above you described, right?


I should have added the following in my previous e-mail: "if the completion
occurs after blk_mq_check_expired() examined rq->gstate and before it updated
rq->aborted_gstate". That race can occur with the current upstream blk-mq
timeout handling code but not after my patch has been applied.

Bart.
Ming Lei April 10, 2018, 3:25 p.m. UTC | #15
On Tue, Apr 10, 2018 at 03:02:11PM +0000, Bart Van Assche wrote:
> On Tue, 2018-04-10 at 22:30 +0800, Ming Lei wrote:
> > On Tue, Apr 10, 2018 at 02:09:33PM +0000, Bart Van Assche wrote:
> > > Please keep in mind that all synchronize_rcu() does is to wait for pre-
> > > existing RCU readers to finish. synchronize_rcu() does not prevent that new
> > > rcu_read_lock() calls happen. It is e.g. possible that after
> > 
> > That is right, and I also mentioned normal completion can be done between
> > 1) and reset aborted_gstate in 3).
> > 
> > > blk_mq_rq_update_aborted_gstate(req, 0) has been executed that a regular
> > > completion occurs. If that request is not reused before the timer that was
> > > restarted by the timeout code expires, that request will be completed twice.
> > 
> > In this patch, blk_mq_add_timer(req, MQ_RQ_COMPLETE, MQ_RQ_IN_FLIGHT) is
> > called for handling BLK_EH_RESET_TIMER. And after rq's state is changed
> > to MQ_RQ_IN_FLIGHT, normal completion still can come and complete this rq,
> > just like the above you described, right?
> 
> I should have added the following in my previous e-mail: "if the completion
> occurs after blk_mq_check_expired() examined rq->gstate and before it updated
> rq->aborted_gstate".

That is the difference between tj's approach and your patch, but the
difference is just in the timing.

See this patch:

+   if (time_after_eq(jiffies, deadline) &&
+       blk_mq_change_rq_state(rq, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE)) {
+       blk_mq_rq_timed_out(rq, reserved);

Normal completion still can happen between blk_mq_change_rq_state()
and blk_mq_rq_timed_out().

In tj's approach, there is synchronize_rcu() between writing aborted_gstate
and blk_mq_rq_timed_out, it is easier for normal completion to happen during
the big window.

> That race can occur with the current upstream blk-mq
> timeout handling code but not after my patch has been applied.

In theory, the 'race' can occur with your patch too, but the window
is just so small.

If you think my comment is correct, please update your commit log.
Tejun Heo April 10, 2018, 3:30 p.m. UTC | #16
Hello, Ming.

On Tue, Apr 10, 2018 at 11:25:54PM +0800, Ming Lei wrote:
> +   if (time_after_eq(jiffies, deadline) &&
> +       blk_mq_change_rq_state(rq, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE)) {
> +       blk_mq_rq_timed_out(rq, reserved);
> 
> Normal completion still can happen between blk_mq_change_rq_state()
> and blk_mq_rq_timed_out().
> 
> In tj's approach, there is synchronize_rcu() between writing aborted_gstate
> and blk_mq_rq_timed_out, it is easier for normal completion to happen during
> the big window.

I don't think plugging this hole is all that difficult, but this
shouldn't lead to any critical failures.  If so, that'd be a driver
bug.

Thanks.
Ming Lei April 10, 2018, 3:38 p.m. UTC | #17
Hi Tejun,

On Tue, Apr 10, 2018 at 08:30:31AM -0700, tj@kernel.org wrote:
> Hello, Ming.
> 
> On Tue, Apr 10, 2018 at 11:25:54PM +0800, Ming Lei wrote:
> > +   if (time_after_eq(jiffies, deadline) &&
> > +       blk_mq_change_rq_state(rq, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE)) {
> > +       blk_mq_rq_timed_out(rq, reserved);
> > 
> > Normal completion still can happen between blk_mq_change_rq_state()
> > and blk_mq_rq_timed_out().
> > 
> > In tj's approach, there is synchronize_rcu() between writing aborted_gstate
> > and blk_mq_rq_timed_out, it is easier for normal completion to happen during
> > the big window.
> 
> I don't think plugging this hole is all that difficult, but this
> shouldn't lead to any critical failures.  If so, that'd be a driver
> bug.

I agree, the issue should be in driver's irq handler and .timeout in
theory.

For example, even though one request has been done by irq handler, .timeout
still may return RESET_TIMER.

Thanks,
Ming
Tejun Heo April 10, 2018, 3:40 p.m. UTC | #18
Hello,

On Tue, Apr 10, 2018 at 11:38:18PM +0800, Ming Lei wrote:
> I agree, the issue should be in driver's irq handler and .timeout in
> theory.
> 
> For example, even though one request has been done by irq handler, .timeout
> still may return RESET_TIMER.

blk-mq can use a separate flag to track normal completions during
timeout and complete the request normally on RESET_TIMER if the flag
is set while EH was in progress.  With a bit of care, we'd be able to
plug the race completely.

Thanks.
diff mbox

Patch

diff --git a/block/blk-core.c b/block/blk-core.c
index 0c48bef8490f..422b79b61bb9 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -200,8 +200,6 @@  void blk_rq_init(struct request_queue *q, struct request *rq)
 	rq->start_time = jiffies;
 	set_start_time_ns(rq);
 	rq->part = NULL;
-	seqcount_init(&rq->gstate_seq);
-	u64_stats_init(&rq->aborted_gstate_sync);
 }
 EXPORT_SYMBOL(blk_rq_init);
 
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 6f72413b6cab..80c7c585769f 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -345,7 +345,6 @@  static const char *const rqf_name[] = {
 	RQF_NAME(STATS),
 	RQF_NAME(SPECIAL_PAYLOAD),
 	RQF_NAME(ZONE_WRITE_LOCKED),
-	RQF_NAME(MQ_TIMEOUT_EXPIRED),
 	RQF_NAME(MQ_POLL_SLEPT),
 };
 #undef RQF_NAME
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 7816d28b7219..337e10a5a30c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -305,7 +305,6 @@  static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 	rq->special = NULL;
 	/* tag was already set */
 	rq->extra_len = 0;
-	rq->__deadline = 0;
 
 	INIT_LIST_HEAD(&rq->timeout_list);
 	rq->timeout = 0;
@@ -527,8 +526,7 @@  static void __blk_mq_complete_request(struct request *rq)
 	bool shared = false;
 	int cpu;
 
-	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IN_FLIGHT);
-	blk_mq_rq_update_state(rq, MQ_RQ_COMPLETE);
+	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_COMPLETE);
 
 	if (rq->internal_tag != -1)
 		blk_mq_sched_completed_request(rq);
@@ -577,36 +575,6 @@  static void hctx_lock(struct blk_mq_hw_ctx *hctx, int *srcu_idx)
 		*srcu_idx = srcu_read_lock(hctx->srcu);
 }
 
-static void blk_mq_rq_update_aborted_gstate(struct request *rq, u64 gstate)
-{
-	unsigned long flags;
-
-	/*
-	 * blk_mq_rq_aborted_gstate() is used from the completion path and
-	 * can thus be called from irq context.  u64_stats_fetch in the
-	 * middle of update on the same CPU leads to lockup.  Disable irq
-	 * while updating.
-	 */
-	local_irq_save(flags);
-	u64_stats_update_begin(&rq->aborted_gstate_sync);
-	rq->aborted_gstate = gstate;
-	u64_stats_update_end(&rq->aborted_gstate_sync);
-	local_irq_restore(flags);
-}
-
-static u64 blk_mq_rq_aborted_gstate(struct request *rq)
-{
-	unsigned int start;
-	u64 aborted_gstate;
-
-	do {
-		start = u64_stats_fetch_begin(&rq->aborted_gstate_sync);
-		aborted_gstate = rq->aborted_gstate;
-	} while (u64_stats_fetch_retry(&rq->aborted_gstate_sync, start));
-
-	return aborted_gstate;
-}
-
 /**
  * blk_mq_complete_request - end I/O on a request
  * @rq:		the request being processed
@@ -618,27 +586,12 @@  static u64 blk_mq_rq_aborted_gstate(struct request *rq)
 void blk_mq_complete_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
-	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu);
-	int srcu_idx;
 
 	if (unlikely(blk_should_fake_timeout(q)))
 		return;
 
-	/*
-	 * If @rq->aborted_gstate equals the current instance, timeout is
-	 * claiming @rq and we lost.  This is synchronized through
-	 * hctx_lock().  See blk_mq_timeout_work() for details.
-	 *
-	 * Completion path never blocks and we can directly use RCU here
-	 * instead of hctx_lock() which can be either RCU or SRCU.
-	 * However, that would complicate paths which want to synchronize
-	 * against us.  Let stay in sync with the issue path so that
-	 * hctx_lock() covers both issue and completion paths.
-	 */
-	hctx_lock(hctx, &srcu_idx);
-	if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
+	if (blk_mq_change_rq_state(rq, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE))
 		__blk_mq_complete_request(rq);
-	hctx_unlock(hctx, srcu_idx);
 }
 EXPORT_SYMBOL(blk_mq_complete_request);
 
@@ -662,27 +615,8 @@  void blk_mq_start_request(struct request *rq)
 		wbt_issue(q->rq_wb, &rq->issue_stat);
 	}
 
-	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);
-
-	/*
-	 * Mark @rq in-flight which also advances the generation number,
-	 * and register for timeout.  Protect with a seqcount to allow the
-	 * timeout path to read both @rq->gstate and @rq->deadline
-	 * coherently.
-	 *
-	 * This is the only place where a request is marked in-flight.  If
-	 * the timeout path reads an in-flight @rq->gstate, the
-	 * @rq->deadline it reads together under @rq->gstate_seq is
-	 * guaranteed to be the matching one.
-	 */
-	preempt_disable();
-	write_seqcount_begin(&rq->gstate_seq);
-
-	blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
-	blk_add_timer(rq);
-
-	write_seqcount_end(&rq->gstate_seq);
-	preempt_enable();
+	/* Mark @rq in-flight and set its deadline. */
+	blk_mq_add_timer(rq, MQ_RQ_IDLE, MQ_RQ_IN_FLIGHT);
 
 	if (q->dma_drain_size && blk_rq_bytes(rq)) {
 		/*
@@ -695,11 +629,6 @@  void blk_mq_start_request(struct request *rq)
 }
 EXPORT_SYMBOL(blk_mq_start_request);
 
-/*
- * When we reach here because queue is busy, it's safe to change the state
- * to IDLE without checking @rq->aborted_gstate because we should still be
- * holding the RCU read lock and thus protected against timeout.
- */
 static void __blk_mq_requeue_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
@@ -811,7 +740,6 @@  EXPORT_SYMBOL(blk_mq_tag_to_rq);
 struct blk_mq_timeout_data {
 	unsigned long next;
 	unsigned int next_set;
-	unsigned int nr_expired;
 };
 
 static void blk_mq_rq_timed_out(struct request *req, bool reserved)
@@ -819,8 +747,6 @@  static void blk_mq_rq_timed_out(struct request *req, bool reserved)
 	const struct blk_mq_ops *ops = req->q->mq_ops;
 	enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER;
 
-	req->rq_flags |= RQF_MQ_TIMEOUT_EXPIRED;
-
 	if (ops->timeout)
 		ret = ops->timeout(req, reserved);
 
@@ -829,13 +755,7 @@  static void blk_mq_rq_timed_out(struct request *req, bool reserved)
 		__blk_mq_complete_request(req);
 		break;
 	case BLK_EH_RESET_TIMER:
-		/*
-		 * As nothing prevents from completion happening while
-		 * ->aborted_gstate is set, this may lead to ignored
-		 * completions and further spurious timeouts.
-		 */
-		blk_mq_rq_update_aborted_gstate(req, 0);
-		blk_add_timer(req);
+		blk_mq_add_timer(req, MQ_RQ_COMPLETE, MQ_RQ_IN_FLIGHT);
 		break;
 	case BLK_EH_NOT_HANDLED:
 		break;
@@ -849,60 +769,23 @@  static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
 		struct request *rq, void *priv, bool reserved)
 {
 	struct blk_mq_timeout_data *data = priv;
-	unsigned long gstate, deadline;
-	int start;
-
-	might_sleep();
-
-	if (rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED)
-		return;
-
-	/* read coherent snapshots of @rq->state_gen and @rq->deadline */
-	while (true) {
-		start = read_seqcount_begin(&rq->gstate_seq);
-		gstate = READ_ONCE(rq->gstate);
-		deadline = blk_rq_deadline(rq);
-		if (!read_seqcount_retry(&rq->gstate_seq, start))
-			break;
-		cond_resched();
-	}
+	unsigned long deadline = blk_mq_rq_deadline(rq);
 
-	/* if in-flight && overdue, mark for abortion */
-	if ((gstate & MQ_RQ_STATE_MASK) == MQ_RQ_IN_FLIGHT &&
-	    time_after_eq(jiffies, deadline)) {
-		blk_mq_rq_update_aborted_gstate(rq, gstate);
-		data->nr_expired++;
-		hctx->nr_expired++;
+	if (time_after_eq(jiffies, deadline) &&
+	    blk_mq_change_rq_state(rq, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE)) {
+		blk_mq_rq_timed_out(rq, reserved);
 	} else if (!data->next_set || time_after(data->next, deadline)) {
 		data->next = deadline;
 		data->next_set = 1;
 	}
-}
 
-static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx,
-		struct request *rq, void *priv, bool reserved)
-{
-	/*
-	 * We marked @rq->aborted_gstate and waited for RCU.  If there were
-	 * completions that we lost to, they would have finished and
-	 * updated @rq->gstate by now; otherwise, the completion path is
-	 * now guaranteed to see @rq->aborted_gstate and yield.  If
-	 * @rq->aborted_gstate still matches @rq->gstate, @rq is ours.
-	 */
-	if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) &&
-	    READ_ONCE(rq->gstate) == rq->aborted_gstate)
-		blk_mq_rq_timed_out(rq, reserved);
 }
 
 static void blk_mq_timeout_work(struct work_struct *work)
 {
 	struct request_queue *q =
 		container_of(work, struct request_queue, timeout_work);
-	struct blk_mq_timeout_data data = {
-		.next		= 0,
-		.next_set	= 0,
-		.nr_expired	= 0,
-	};
+	struct blk_mq_timeout_data data = { };
 	struct blk_mq_hw_ctx *hctx;
 	int i;
 
@@ -925,33 +808,6 @@  static void blk_mq_timeout_work(struct work_struct *work)
 	/* scan for the expired ones and set their ->aborted_gstate */
 	blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, &data);
 
-	if (data.nr_expired) {
-		bool has_rcu = false;
-
-		/*
-		 * Wait till everyone sees ->aborted_gstate.  The
-		 * sequential waits for SRCUs aren't ideal.  If this ever
-		 * becomes a problem, we can add per-hw_ctx rcu_head and
-		 * wait in parallel.
-		 */
-		queue_for_each_hw_ctx(q, hctx, i) {
-			if (!hctx->nr_expired)
-				continue;
-
-			if (!(hctx->flags & BLK_MQ_F_BLOCKING))
-				has_rcu = true;
-			else
-				synchronize_srcu(hctx->srcu);
-
-			hctx->nr_expired = 0;
-		}
-		if (has_rcu)
-			synchronize_rcu();
-
-		/* terminate the ones we won */
-		blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_expired, NULL);
-	}
-
 	if (data.next_set) {
 		data.next = blk_rq_timeout(round_jiffies_up(data.next));
 		mod_timer(&q->timeout, data.next);
@@ -2087,8 +1943,6 @@  static int blk_mq_init_request(struct blk_mq_tag_set *set, struct request *rq,
 			return ret;
 	}
 
-	seqcount_init(&rq->gstate_seq);
-	u64_stats_init(&rq->aborted_gstate_sync);
 	return 0;
 }
 
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 88c558f71819..4f96fd66eb8a 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -27,10 +27,7 @@  struct blk_mq_ctx {
 	struct kobject		kobj;
 } ____cacheline_aligned_in_smp;
 
-/*
- * Bits for request->gstate.  The lower two bits carry MQ_RQ_* state value
- * and the upper bits the generation number.
- */
+/* Lowest two bits of request->mq_deadline. */
 enum mq_rq_state {
 	MQ_RQ_IDLE		= 0,
 	MQ_RQ_IN_FLIGHT		= 1,
@@ -38,7 +35,6 @@  enum mq_rq_state {
 
 	MQ_RQ_STATE_BITS	= 2,
 	MQ_RQ_STATE_MASK	= (1 << MQ_RQ_STATE_BITS) - 1,
-	MQ_RQ_GEN_INC		= 1 << MQ_RQ_STATE_BITS,
 };
 
 void blk_mq_freeze_queue(struct request_queue *q);
@@ -104,9 +100,30 @@  void blk_mq_release(struct request_queue *q);
  * blk_mq_rq_state() - read the current MQ_RQ_* state of a request
  * @rq: target request.
  */
-static inline int blk_mq_rq_state(struct request *rq)
+static inline enum mq_rq_state blk_mq_rq_state(struct request *rq)
 {
-	return READ_ONCE(rq->gstate) & MQ_RQ_STATE_MASK;
+	return atomic_long_read(&rq->mq_deadline) & MQ_RQ_STATE_MASK;
+}
+
+/**
+ * blk_mq_change_rq_state - atomically test and set request state
+ * @rq: Request pointer.
+ * @old: Old request state.
+ * @new: New request state.
+ *
+ * Returns %true if and only if the old state was @old and if the state has
+ * been changed into @new.
+ */
+static inline bool blk_mq_change_rq_state(struct request *rq,
+					  enum mq_rq_state old_s,
+					  enum mq_rq_state new_s)
+{
+	unsigned long old_d = (atomic_long_read(&rq->mq_deadline) &
+			       ~(unsigned long)MQ_RQ_STATE_MASK) | old_s;
+	unsigned long new_d = (old_d & ~(unsigned long)MQ_RQ_STATE_MASK) |
+			      new_s;
+
+	return atomic_long_cmpxchg(&rq->mq_deadline, old_d, new_d) == old_d;
 }
 
 /**
@@ -114,23 +131,13 @@  static inline int blk_mq_rq_state(struct request *rq)
  * @rq: target request.
  * @state: new state to set.
  *
- * Set @rq's state to @state.  The caller is responsible for ensuring that
- * there are no other updaters.  A request can transition into IN_FLIGHT
- * only from IDLE and doing so increments the generation number.
+ * Set @rq's state to @state.
  */
 static inline void blk_mq_rq_update_state(struct request *rq,
-					  enum mq_rq_state state)
+					  enum mq_rq_state new_s)
 {
-	u64 old_val = READ_ONCE(rq->gstate);
-	u64 new_val = (old_val & ~MQ_RQ_STATE_MASK) | state;
-
-	if (state == MQ_RQ_IN_FLIGHT) {
-		WARN_ON_ONCE((old_val & MQ_RQ_STATE_MASK) != MQ_RQ_IDLE);
-		new_val += MQ_RQ_GEN_INC;
+	while (!blk_mq_change_rq_state(rq, blk_mq_rq_state(rq), new_s)) {
 	}
-
-	/* avoid exposing interim values */
-	WRITE_ONCE(rq->gstate, new_val);
 }
 
 static inline struct blk_mq_ctx *__blk_mq_get_ctx(struct request_queue *q,
diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index 50a191720055..3ca829dce2d6 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -165,8 +165,9 @@  void blk_abort_request(struct request *req)
 		 * immediately and that scan sees the new timeout value.
 		 * No need for fancy synchronizations.
 		 */
-		blk_rq_set_deadline(req, jiffies);
-		kblockd_schedule_work(&req->q->timeout_work);
+		if (blk_mq_rq_set_deadline(req, jiffies, MQ_RQ_IN_FLIGHT,
+					   MQ_RQ_IN_FLIGHT))
+			kblockd_schedule_work(&req->q->timeout_work);
 	} else {
 		if (blk_mark_rq_complete(req))
 			return;
@@ -187,15 +188,8 @@  unsigned long blk_rq_timeout(unsigned long timeout)
 	return timeout;
 }
 
-/**
- * blk_add_timer - Start timeout timer for a single request
- * @req:	request that is about to start running.
- *
- * Notes:
- *    Each request has its own timer, and as it is added to the queue, we
- *    set up the timer. When the request completes, we cancel the timer.
- */
-void blk_add_timer(struct request *req)
+static void __blk_add_timer(struct request *req, enum mq_rq_state old,
+			    enum mq_rq_state new)
 {
 	struct request_queue *q = req->q;
 	unsigned long expiry;
@@ -216,15 +210,17 @@  void blk_add_timer(struct request *req)
 	if (!req->timeout)
 		req->timeout = q->rq_timeout;
 
-	blk_rq_set_deadline(req, jiffies + req->timeout);
-	req->rq_flags &= ~RQF_MQ_TIMEOUT_EXPIRED;
-
 	/*
 	 * Only the non-mq case needs to add the request to a protected list.
 	 * For the mq case we simply scan the tag map.
 	 */
-	if (!q->mq_ops)
+	if (!q->mq_ops) {
+		blk_rq_set_deadline(req, jiffies + req->timeout);
 		list_add_tail(&req->timeout_list, &req->q->timeout_list);
+	} else {
+		WARN_ON_ONCE(!blk_mq_rq_set_deadline(req, jiffies +
+						     req->timeout, old, new));
+	}
 
 	/*
 	 * If the timer isn't already pending or this timeout is earlier
@@ -249,3 +245,34 @@  void blk_add_timer(struct request *req)
 	}
 
 }
+
+/**
+ * blk_add_timer - Start timeout timer for a single request
+ * @req:	request that is about to start running.
+ *
+ * Notes:
+ *    Each request has its own timer, and as it is added to the queue, we
+ *    set up the timer. When the request completes, we cancel the timer.
+ */
+void blk_add_timer(struct request *req)
+{
+	return __blk_add_timer(req, MQ_RQ_IDLE/*ignored*/,
+			       MQ_RQ_IDLE/*ignored*/);
+}
+
+/**
+ * blk_mq_add_timer - set the deadline for a single request
+ * @req:	request for which to set the deadline.
+ * @old:	current request state.
+ * @new:	new request state.
+ *
+ * Sets the deadline of a request if and only if it has state @old and
+ * at the same time changes the request state from @old into @new. The caller
+ * must guarantee that the request state won't be modified while this function
+ * is in progress.
+ */
+void blk_mq_add_timer(struct request *req, enum mq_rq_state old,
+		      enum mq_rq_state new)
+{
+	return __blk_add_timer(req, old, new);
+}
diff --git a/block/blk.h b/block/blk.h
index b034fd2460c4..7665d4af777e 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -170,6 +170,8 @@  static inline bool bio_integrity_endio(struct bio *bio)
 void blk_timeout_work(struct work_struct *work);
 unsigned long blk_rq_timeout(unsigned long timeout);
 void blk_add_timer(struct request *req);
+void blk_mq_add_timer(struct request *req, enum mq_rq_state old,
+		      enum mq_rq_state new);
 void blk_delete_timer(struct request *);
 
 
@@ -191,21 +193,21 @@  void blk_account_io_done(struct request *req);
 /*
  * EH timer and IO completion will both attempt to 'grab' the request, make
  * sure that only one of them succeeds. Steal the bottom bit of the
- * __deadline field for this.
+ * lq_deadline field for this.
  */
 static inline int blk_mark_rq_complete(struct request *rq)
 {
-	return test_and_set_bit(0, &rq->__deadline);
+	return test_and_set_bit(0, &rq->lq_deadline);
 }
 
 static inline void blk_clear_rq_complete(struct request *rq)
 {
-	clear_bit(0, &rq->__deadline);
+	clear_bit(0, &rq->lq_deadline);
 }
 
 static inline bool blk_rq_is_complete(struct request *rq)
 {
-	return test_bit(0, &rq->__deadline);
+	return test_bit(0, &rq->lq_deadline);
 }
 
 /*
@@ -311,15 +313,42 @@  static inline void req_set_nomerge(struct request_queue *q, struct request *req)
  * Steal a bit from this field for legacy IO path atomic IO marking. Note that
  * setting the deadline clears the bottom bit, potentially clearing the
  * completed bit. The user has to be OK with this (current ones are fine).
+ * Must be called with the request queue lock held.
  */
 static inline void blk_rq_set_deadline(struct request *rq, unsigned long time)
 {
-	rq->__deadline = time & ~0x1UL;
+	rq->lq_deadline = time & ~0x1UL;
 }
 
 static inline unsigned long blk_rq_deadline(struct request *rq)
 {
-	return rq->__deadline & ~0x1UL;
+	return rq->lq_deadline & ~0x1UL;
+}
+
+/*
+ * If the state of request @rq equals @old_s, update deadline and request state
+ * atomically to @time and @new_s. blk-mq only.
+ */
+static inline bool blk_mq_rq_set_deadline(struct request *rq,
+					  unsigned long time,
+					  enum mq_rq_state old_s,
+					  enum mq_rq_state new_s)
+{
+	unsigned long old_d, new_d;
+
+	do {
+		old_d = atomic_long_read(&rq->mq_deadline);
+		if ((old_d & MQ_RQ_STATE_MASK) != old_s)
+			return false;
+		new_d = (time & ~0x3UL) | (new_s & 3UL);
+	} while (atomic_long_cmpxchg(&rq->mq_deadline, old_d, new_d) != old_d);
+
+	return true;
+}
+
+static inline unsigned long blk_mq_rq_deadline(struct request *rq)
+{
+	return atomic_long_read(&rq->mq_deadline) & ~0x3UL;
 }
 
 /*
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 8efcf49796a3..13ccbb418e89 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -51,7 +51,6 @@  struct blk_mq_hw_ctx {
 	unsigned int		queue_num;
 
 	atomic_t		nr_active;
-	unsigned int		nr_expired;
 
 	struct hlist_node	cpuhp_dead;
 	struct kobject		kobj;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 6075d1a6760c..abf78819014b 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -27,7 +27,6 @@ 
 #include <linux/percpu-refcount.h>
 #include <linux/scatterlist.h>
 #include <linux/blkzoned.h>
-#include <linux/seqlock.h>
 #include <linux/u64_stats_sync.h>
 
 struct module;
@@ -125,8 +124,6 @@  typedef __u32 __bitwise req_flags_t;
 #define RQF_SPECIAL_PAYLOAD	((__force req_flags_t)(1 << 18))
 /* The per-zone write lock is held for this request */
 #define RQF_ZONE_WRITE_LOCKED	((__force req_flags_t)(1 << 19))
-/* timeout is expired */
-#define RQF_MQ_TIMEOUT_EXPIRED	((__force req_flags_t)(1 << 20))
 /* already slept for hybrid poll */
 #define RQF_MQ_POLL_SLEPT	((__force req_flags_t)(1 << 21))
 
@@ -226,28 +223,15 @@  struct request {
 	unsigned int extra_len;	/* length of alignment and padding */
 
 	/*
-	 * On blk-mq, the lower bits of ->gstate (generation number and
-	 * state) carry the MQ_RQ_* state value and the upper bits the
-	 * generation number which is monotonically incremented and used to
-	 * distinguish the reuse instances.
-	 *
-	 * ->gstate_seq allows updates to ->gstate and other fields
-	 * (currently ->deadline) during request start to be read
-	 * atomically from the timeout path, so that it can operate on a
-	 * coherent set of information.
+	 * Access through blk_rq_set_deadline(), blk_rq_deadline() and
+	 * blk_mark_rq_complete(), blk_clear_rq_complete() and
+	 * blk_rq_is_complete() for legacy queues or blk_mq_rq_set_deadline(),
+	 * blk_mq_rq_deadline() and blk_mq_rq_state() for blk-mq queues.
 	 */
-	seqcount_t gstate_seq;
-	u64 gstate;
-
-	/*
-	 * ->aborted_gstate is used by the timeout to claim a specific
-	 * recycle instance of this request.  See blk_mq_timeout_work().
-	 */
-	struct u64_stats_sync aborted_gstate_sync;
-	u64 aborted_gstate;
-
-	/* access through blk_rq_set_deadline, blk_rq_deadline */
-	unsigned long __deadline;
+	union {
+		unsigned long lq_deadline;
+		atomic_long_t mq_deadline;
+	};
 
 	struct list_head timeout_list;