diff mbox

[v13] blk-mq: Rework blk-mq timeout handling again

Message ID 20180522162515.20650-1-bart.vanassche@wdc.com (mailing list archive)
State New, archived
Headers show

Commit Message

Bart Van Assche May 22, 2018, 4:25 p.m. UTC
Recently the blk-mq timeout handling code was reworked. See also Tejun
Heo, "[PATCHSET v4] blk-mq: reimplement timeout handling", 08 Jan 2018
(https://www.mail-archive.com/linux-block@vger.kernel.org/msg16985.html).
This patch reworks the blk-mq timeout handling code again. The timeout
handling code is simplified by introducing a state machine per request.
This change avoids that the blk-mq timeout handling code ignores
completions that occur after blk_mq_check_expired() has been called and
before blk_mq_rq_timed_out() has been called.

Fix this race as follows:
- Reduce the gstate field from 64 to 32 bits such that cmpxchg() can
  be used to update it. Introduce deadline_seq for updating the deadline
  on 32-bit systems.
- Remove the request member variables that became superfluous due to
  this change, namely gstate_seq and aborted_gstate_sync.
- Remove the request state information that became superfluous due to
  this patch, namely RQF_MQ_TIMEOUT_EXPIRED.
- Remove the code that became superfluous due to this change, namely
  the RCU lock and unlock statements in blk_mq_complete_request() and
  also the synchronize_rcu() call in the timeout handler.

Notes:
- Atomic instructions are only used to update the request state if
  a concurrent request state change could be in progress.
- blk_add_timer() has been split into two functions - one for the
  legacy block layer and one for blk-mq.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Sebastian Ott <sebott@linux.ibm.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Israel Rukshin <israelr@mellanox.com>,
Cc: Max Gurtovoy <maxg@mellanox.com>
---

Changes compared to v12:
- Switched from cmpxchg64() to cmpxchg(). This became possible because the
  deadline is now updated before the request state.
- Introduced a new request state to ensure that completions that occur while
  the timeout function is in progress are not lost.
- Left out the ARM cmpxchg64() patch.

Changes compared to v11:
- Reworked patch 1/2: instead of introducing CONFIG_ARCH_HAVE_CMPXCHG64, make
  sure that cmpxchg64() is only defined if it can be used.

Changes compared to v10:
- In patch 1/2, added "default y if 64BIT" to the "config ARCH_HAVE_CMPXCHG64"
  entry in arch/Kconfig. Left out the "select ARCH_HAVE_CMPXCHG64" statements
  that became superfluous due to this change (alpha, arm64, powerpc and s390).
- Also in patch 1/2, only select ARCH_HAVE_CMPXCHG64 if X86_CMPXCHG64 has been
  selected.
- In patch 2/2, moved blk_mq_change_rq_state() from blk-mq.h to blk-mq.c.
- Added a comment header above __blk_mq_requeue_request() and
  blk_mq_requeue_request().
- Documented the MQ_RQ_* state transitions in block/blk-mq.h.
- Left out the fourth argument of blk_mq_rq_set_deadline().

Changes compared to v9:
- Addressed multiple comments related to patch 1/2: added
  CONFIG_ARCH_HAVE_CMPXCHG64 for riscv, modified
  features/locking/cmpxchg64/arch-support.txt as requested and made the
  order of the symbols in the arch/*/Kconfig alphabetical where possible.

Changes compared to v8:
- Split into two patches.
- Moved the spin_lock_init() call from blk_mq_rq_ctx_init() into
  blk_mq_init_request().
- Fixed the deadline set by blk_add_timer().
- Surrounded the das_lock member with #ifndef CONFIG_ARCH_HAVE_CMPXCHG64 /
  #endif.

Changes compared to v7:
- Fixed the generation number mechanism. Note: with this patch applied the
  behavior of the block layer does not depend on the generation number.
- Added more 32-bit architectures to the list of architectures on which
  cmpxchg64() should not be used.

Changes compared to v6:
- Used a union instead of bit manipulations to store multiple values into
  a single 64-bit field.
- Reduced the size of the timeout field from 64 to 32 bits.
- Made sure that the block layer still builds with this patch applied
  for the sh and mips architectures.
- Fixed two sparse warnings that were introduced by this patch in the
  WRITE_ONCE() calls.

Changes compared to v5:
- Restored the synchronize_rcu() call between marking a request for timeout
  handling and the actual timeout handling to avoid that timeout handling
  starts while .queue_rq() is still in progress if the timeout is very short.
- Only use cmpxchg() if another context could attempt to change the request
  state concurrently. Use WRITE_ONCE() otherwise.

Changes compared to v4:
- Addressed multiple review comments from Christoph. The most important are
  that atomic_long_cmpxchg() has been changed into cmpxchg() and also that
  there is now a nice and clean split between the legacy and blk-mq versions
  of blk_add_timer().
- Changed the patch name and modified the patch description because there is
  disagreement about whether or not the v4.16 blk-mq core can complete a
  single request twice. Kept the "Cc: stable" tag because of
  https://bugzilla.kernel.org/show_bug.cgi?id=199077.

Changes compared to v3 (see also https://www.mail-archive.com/linux-block@vger.kernel.org/msg20073.html):
- Removed the spinlock again that was introduced to protect the request state.
  v4 uses atomic_long_cmpxchg() instead.
- Split __deadline into two variables - one for the legacy block layer and one
  for blk-mq.

Changes compared to v2 (https://www.mail-archive.com/linux-block@vger.kernel.org/msg18338.html):
- Rebased and retested on top of kernel v4.16.

Changes compared to v1 (https://www.mail-archive.com/linux-block@vger.kernel.org/msg18089.html):
- Removed the gstate and aborted_gstate members of struct request and used
  the __deadline member to encode both the generation and state information.


 block/blk-core.c       |   9 +-
 block/blk-mq-debugfs.c |   4 +-
 block/blk-mq.c         | 250 ++++++++++++++++++++++++++-----------------------
 block/blk-mq.h         |  54 +++++------
 block/blk-timeout.c    | 107 +++++++++++++--------
 block/blk.h            |   1 +
 include/linux/blkdev.h |  44 ++++-----
 7 files changed, 250 insertions(+), 219 deletions(-)

Comments

Jens Axboe May 22, 2018, 4:44 p.m. UTC | #1
On 5/22/18 10:25 AM, Bart Van Assche wrote:
> Recently the blk-mq timeout handling code was reworked. See also Tejun
> Heo, "[PATCHSET v4] blk-mq: reimplement timeout handling", 08 Jan 2018
> (https://www.mail-archive.com/linux-block@vger.kernel.org/msg16985.html).
> This patch reworks the blk-mq timeout handling code again. The timeout
> handling code is simplified by introducing a state machine per request.
> This change avoids that the blk-mq timeout handling code ignores
> completions that occur after blk_mq_check_expired() has been called and
> before blk_mq_rq_timed_out() has been called.

I'll take a look at this again, getting rid of cmpxchg64 makes me
much more comfortable.
Jens Axboe May 22, 2018, 5:17 p.m. UTC | #2
On 5/22/18 10:44 AM, Jens Axboe wrote:
> 
> On 5/22/18 10:25 AM, Bart Van Assche wrote:
>> Recently the blk-mq timeout handling code was reworked. See also Tejun
>> Heo, "[PATCHSET v4] blk-mq: reimplement timeout handling", 08 Jan 2018
>> (https://www.mail-archive.com/linux-block@vger.kernel.org/msg16985.html).
>> This patch reworks the blk-mq timeout handling code again. The timeout
>> handling code is simplified by introducing a state machine per request.
>> This change avoids that the blk-mq timeout handling code ignores
>> completions that occur after blk_mq_check_expired() has been called and
>> before blk_mq_rq_timed_out() has been called.
> 
> I'll take a look at this again, getting rid of cmpxchg64 makes me
> much more comfortable.

FWIW, a quick pass on runtime testing works fine. As expected, it's
more efficient than what's currently in the kernel, testing with both
null_blk (1 and nr_cpus worth of queues), and nvme as well. A huge win
is that we shrink the request size from 360 bytes to 312, and I did
a small followup patch that brings that to 304. That's a 15% reduction,
massive.
Jens Axboe May 22, 2018, 6:47 p.m. UTC | #3
On 5/22/18 11:17 AM, Jens Axboe wrote:
> On 5/22/18 10:44 AM, Jens Axboe wrote:
>>
>> On 5/22/18 10:25 AM, Bart Van Assche wrote:
>>> Recently the blk-mq timeout handling code was reworked. See also Tejun
>>> Heo, "[PATCHSET v4] blk-mq: reimplement timeout handling", 08 Jan 2018
>>> (https://www.mail-archive.com/linux-block@vger.kernel.org/msg16985.html).
>>> This patch reworks the blk-mq timeout handling code again. The timeout
>>> handling code is simplified by introducing a state machine per request.
>>> This change avoids that the blk-mq timeout handling code ignores
>>> completions that occur after blk_mq_check_expired() has been called and
>>> before blk_mq_rq_timed_out() has been called.
>>
>> I'll take a look at this again, getting rid of cmpxchg64 makes me
>> much more comfortable.
> 
> FWIW, a quick pass on runtime testing works fine. As expected, it's
> more efficient than what's currently in the kernel, testing with both
> null_blk (1 and nr_cpus worth of queues), and nvme as well. A huge win
> is that we shrink the request size from 360 bytes to 312, and I did
> a small followup patch that brings that to 304. That's a 15% reduction,
> massive.

Ran into this, running block/014 from blktests:

[ 5744.949839] run blktests block/014 at 2018-05-22 12:41:25
[ 5750.723000] null: rq 00000000ff68f103 timed out
[ 5750.728181] WARNING: CPU: 45 PID: 2480 at block/blk-mq.c:585 __blk_mq_complete_request+0xa6/0x0
[ 5750.738187] Modules linked in: null_blk(+) configfs nvme nvme_core sb_edac x86_pkg_temp_therma]
[ 5750.765509] CPU: 45 PID: 2480 Comm: kworker/45:1H Not tainted 4.17.0-rc6+ #712
[ 5750.774087] Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.3.4 11/09/2016
[ 5750.783369] Workqueue: kblockd blk_mq_timeout_work
[ 5750.789223] RIP: 0010:__blk_mq_complete_request+0xa6/0x110
[ 5750.795850] RSP: 0018:ffff883ffb417d68 EFLAGS: 00010202
[ 5750.802187] RAX: 0000000000000003 RBX: ffff881ff100d800 RCX: 0000000000000000
[ 5750.810649] RDX: ffff88407fd9e040 RSI: ffff88407fd956b8 RDI: ffff881ff100d800
[ 5750.819119] RBP: ffffe8ffffd91800 R08: 0000000000000000 R09: ffffffff82066eb8
[ 5750.827588] R10: ffff883ffa386138 R11: ffff883ffa385900 R12: 0000000000000001
[ 5750.836050] R13: ffff881fe7da6000 R14: 0000000000000020 R15: 0000000000000002
[ 5750.844529] FS:  0000000000000000(0000) GS:ffff88407fd80000(0000) knlGS:0000000000000000
[ 5750.854482] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5750.861397] CR2: 00007ffc92f97f68 CR3: 000000000201d005 CR4: 00000000003606e0
[ 5750.869861] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 5750.878333] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 5750.886805] Call Trace:
[ 5750.890033]  bt_iter+0x42/0x50
[ 5750.894000]  blk_mq_queue_tag_busy_iter+0x12b/0x220
[ 5750.899941]  ? blk_mq_tag_to_rq+0x20/0x20
[ 5750.904913]  ? __rcu_read_unlock+0x50/0x50
[ 5750.909978]  ? blk_mq_tag_to_rq+0x20/0x20
[ 5750.914948]  blk_mq_timeout_work+0x14b/0x240
[ 5750.920220]  process_one_work+0x21b/0x510
[ 5750.925197]  worker_thread+0x3a/0x390
[ 5750.929781]  ? process_one_work+0x510/0x510
[ 5750.934944]  kthread+0x11c/0x140
[ 5750.939028]  ? kthread_create_worker_on_cpu+0x50/0x50
[ 5750.945169]  ret_from_fork+0x1f/0x30
[ 5750.949656] Code: 48 02 00 00 80 e6 80 74 29 8b 95 80 00 00 00 44 39 e2 75 3b 48 89 df ff 90 2 
[ 5750.972139] ---[ end trace 40065cb1764bf500 ]---

which is this:

        WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_COMPLETE);                    

hitting when blk_mq_terminate_expired() completes the request through BLK_EH_HANDLED.
Jens Axboe May 22, 2018, 7:03 p.m. UTC | #4
On 5/22/18 12:47 PM, Jens Axboe wrote:
> On 5/22/18 11:17 AM, Jens Axboe wrote:
>> On 5/22/18 10:44 AM, Jens Axboe wrote:
>>>
>>> On 5/22/18 10:25 AM, Bart Van Assche wrote:
>>>> Recently the blk-mq timeout handling code was reworked. See also Tejun
>>>> Heo, "[PATCHSET v4] blk-mq: reimplement timeout handling", 08 Jan 2018
>>>> (https://www.mail-archive.com/linux-block@vger.kernel.org/msg16985.html).
>>>> This patch reworks the blk-mq timeout handling code again. The timeout
>>>> handling code is simplified by introducing a state machine per request.
>>>> This change avoids that the blk-mq timeout handling code ignores
>>>> completions that occur after blk_mq_check_expired() has been called and
>>>> before blk_mq_rq_timed_out() has been called.
>>>
>>> I'll take a look at this again, getting rid of cmpxchg64 makes me
>>> much more comfortable.
>>
>> FWIW, a quick pass on runtime testing works fine. As expected, it's
>> more efficient than what's currently in the kernel, testing with both
>> null_blk (1 and nr_cpus worth of queues), and nvme as well. A huge win
>> is that we shrink the request size from 360 bytes to 312, and I did
>> a small followup patch that brings that to 304. That's a 15% reduction,
>> massive.
> 
> Ran into this, running block/014 from blktests:
> 
> [ 5744.949839] run blktests block/014 at 2018-05-22 12:41:25
> [ 5750.723000] null: rq 00000000ff68f103 timed out
> [ 5750.728181] WARNING: CPU: 45 PID: 2480 at block/blk-mq.c:585 __blk_mq_complete_request+0xa6/0x0
> [ 5750.738187] Modules linked in: null_blk(+) configfs nvme nvme_core sb_edac x86_pkg_temp_therma]
> [ 5750.765509] CPU: 45 PID: 2480 Comm: kworker/45:1H Not tainted 4.17.0-rc6+ #712
> [ 5750.774087] Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.3.4 11/09/2016
> [ 5750.783369] Workqueue: kblockd blk_mq_timeout_work
> [ 5750.789223] RIP: 0010:__blk_mq_complete_request+0xa6/0x110
> [ 5750.795850] RSP: 0018:ffff883ffb417d68 EFLAGS: 00010202
> [ 5750.802187] RAX: 0000000000000003 RBX: ffff881ff100d800 RCX: 0000000000000000
> [ 5750.810649] RDX: ffff88407fd9e040 RSI: ffff88407fd956b8 RDI: ffff881ff100d800
> [ 5750.819119] RBP: ffffe8ffffd91800 R08: 0000000000000000 R09: ffffffff82066eb8
> [ 5750.827588] R10: ffff883ffa386138 R11: ffff883ffa385900 R12: 0000000000000001
> [ 5750.836050] R13: ffff881fe7da6000 R14: 0000000000000020 R15: 0000000000000002
> [ 5750.844529] FS:  0000000000000000(0000) GS:ffff88407fd80000(0000) knlGS:0000000000000000
> [ 5750.854482] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 5750.861397] CR2: 00007ffc92f97f68 CR3: 000000000201d005 CR4: 00000000003606e0
> [ 5750.869861] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 5750.878333] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 5750.886805] Call Trace:
> [ 5750.890033]  bt_iter+0x42/0x50
> [ 5750.894000]  blk_mq_queue_tag_busy_iter+0x12b/0x220
> [ 5750.899941]  ? blk_mq_tag_to_rq+0x20/0x20
> [ 5750.904913]  ? __rcu_read_unlock+0x50/0x50
> [ 5750.909978]  ? blk_mq_tag_to_rq+0x20/0x20
> [ 5750.914948]  blk_mq_timeout_work+0x14b/0x240
> [ 5750.920220]  process_one_work+0x21b/0x510
> [ 5750.925197]  worker_thread+0x3a/0x390
> [ 5750.929781]  ? process_one_work+0x510/0x510
> [ 5750.934944]  kthread+0x11c/0x140
> [ 5750.939028]  ? kthread_create_worker_on_cpu+0x50/0x50
> [ 5750.945169]  ret_from_fork+0x1f/0x30
> [ 5750.949656] Code: 48 02 00 00 80 e6 80 74 29 8b 95 80 00 00 00 44 39 e2 75 3b 48 89 df ff 90 2 
> [ 5750.972139] ---[ end trace 40065cb1764bf500 ]---
> 
> which is this:
> 
>         WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_COMPLETE);

That check looks wrong, since TIMED_OUT -> COMPLETE is also a valid
state transition. So that check should be:

        WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_COMPLETE &&
                     blk_mq_rq_state(rq) != MQ_RQ_TIMED_OUT);
Jens Axboe May 22, 2018, 7:38 p.m. UTC | #5
On 5/22/18 1:03 PM, Jens Axboe wrote:
> On 5/22/18 12:47 PM, Jens Axboe wrote:
>> On 5/22/18 11:17 AM, Jens Axboe wrote:
>>> On 5/22/18 10:44 AM, Jens Axboe wrote:
>>>>
>>>> On 5/22/18 10:25 AM, Bart Van Assche wrote:
>>>>> Recently the blk-mq timeout handling code was reworked. See also Tejun
>>>>> Heo, "[PATCHSET v4] blk-mq: reimplement timeout handling", 08 Jan 2018
>>>>> (https://www.mail-archive.com/linux-block@vger.kernel.org/msg16985.html).
>>>>> This patch reworks the blk-mq timeout handling code again. The timeout
>>>>> handling code is simplified by introducing a state machine per request.
>>>>> This change avoids that the blk-mq timeout handling code ignores
>>>>> completions that occur after blk_mq_check_expired() has been called and
>>>>> before blk_mq_rq_timed_out() has been called.
>>>>
>>>> I'll take a look at this again, getting rid of cmpxchg64 makes me
>>>> much more comfortable.
>>>
>>> FWIW, a quick pass on runtime testing works fine. As expected, it's
>>> more efficient than what's currently in the kernel, testing with both
>>> null_blk (1 and nr_cpus worth of queues), and nvme as well. A huge win
>>> is that we shrink the request size from 360 bytes to 312, and I did
>>> a small followup patch that brings that to 304. That's a 15% reduction,
>>> massive.
>>
>> Ran into this, running block/014 from blktests:
>>
>> [ 5744.949839] run blktests block/014 at 2018-05-22 12:41:25
>> [ 5750.723000] null: rq 00000000ff68f103 timed out
>> [ 5750.728181] WARNING: CPU: 45 PID: 2480 at block/blk-mq.c:585 __blk_mq_complete_request+0xa6/0x0
>> [ 5750.738187] Modules linked in: null_blk(+) configfs nvme nvme_core sb_edac x86_pkg_temp_therma]
>> [ 5750.765509] CPU: 45 PID: 2480 Comm: kworker/45:1H Not tainted 4.17.0-rc6+ #712
>> [ 5750.774087] Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.3.4 11/09/2016
>> [ 5750.783369] Workqueue: kblockd blk_mq_timeout_work
>> [ 5750.789223] RIP: 0010:__blk_mq_complete_request+0xa6/0x110
>> [ 5750.795850] RSP: 0018:ffff883ffb417d68 EFLAGS: 00010202
>> [ 5750.802187] RAX: 0000000000000003 RBX: ffff881ff100d800 RCX: 0000000000000000
>> [ 5750.810649] RDX: ffff88407fd9e040 RSI: ffff88407fd956b8 RDI: ffff881ff100d800
>> [ 5750.819119] RBP: ffffe8ffffd91800 R08: 0000000000000000 R09: ffffffff82066eb8
>> [ 5750.827588] R10: ffff883ffa386138 R11: ffff883ffa385900 R12: 0000000000000001
>> [ 5750.836050] R13: ffff881fe7da6000 R14: 0000000000000020 R15: 0000000000000002
>> [ 5750.844529] FS:  0000000000000000(0000) GS:ffff88407fd80000(0000) knlGS:0000000000000000
>> [ 5750.854482] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 5750.861397] CR2: 00007ffc92f97f68 CR3: 000000000201d005 CR4: 00000000003606e0
>> [ 5750.869861] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [ 5750.878333] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> [ 5750.886805] Call Trace:
>> [ 5750.890033]  bt_iter+0x42/0x50
>> [ 5750.894000]  blk_mq_queue_tag_busy_iter+0x12b/0x220
>> [ 5750.899941]  ? blk_mq_tag_to_rq+0x20/0x20
>> [ 5750.904913]  ? __rcu_read_unlock+0x50/0x50
>> [ 5750.909978]  ? blk_mq_tag_to_rq+0x20/0x20
>> [ 5750.914948]  blk_mq_timeout_work+0x14b/0x240
>> [ 5750.920220]  process_one_work+0x21b/0x510
>> [ 5750.925197]  worker_thread+0x3a/0x390
>> [ 5750.929781]  ? process_one_work+0x510/0x510
>> [ 5750.934944]  kthread+0x11c/0x140
>> [ 5750.939028]  ? kthread_create_worker_on_cpu+0x50/0x50
>> [ 5750.945169]  ret_from_fork+0x1f/0x30
>> [ 5750.949656] Code: 48 02 00 00 80 e6 80 74 29 8b 95 80 00 00 00 44 39 e2 75 3b 48 89 df ff 90 2 
>> [ 5750.972139] ---[ end trace 40065cb1764bf500 ]---
>>
>> which is this:
>>
>>         WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_COMPLETE);
> 
> That check looks wrong, since TIMED_OUT -> COMPLETE is also a valid
> state transition. So that check should be:
> 
>         WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_COMPLETE &&
>                      blk_mq_rq_state(rq) != MQ_RQ_TIMED_OUT);

I guess it would be cleaner to actually do the transition, in
blk_mq_rq_timed_out():

        case BLK_EH_HANDLED:                                                    
                if (blk_mq_change_rq_state(req, MQ_RQ_TIMED_OUT,                
                                                MQ_RQ_COMPLETE))                
                        __blk_mq_complete_request(req);                         
                break;        

This works for me.
Keith Busch May 22, 2018, 8:26 p.m. UTC | #6
On Tue, May 22, 2018 at 01:38:06PM -0600, Jens Axboe wrote:
> I guess it would be cleaner to actually do the transition, in
> blk_mq_rq_timed_out():
> 
>         case BLK_EH_HANDLED:                                                    
>                 if (blk_mq_change_rq_state(req, MQ_RQ_TIMED_OUT,                
>                                                 MQ_RQ_COMPLETE))                
>                         __blk_mq_complete_request(req);                         
>                 break;        
> 
> This works for me.

Works for me as well on manual fault injection tests.

I think this change above goes back to Christoph's point earlier on usage
of BLK_EH_HANDLED. Is the driver supposed to return BLK_EH_NOT_HANDLED
when the driver actually knows the request has been completed before
returning the status?
Jens Axboe May 22, 2018, 8:29 p.m. UTC | #7
On 5/22/18 2:26 PM, Keith Busch wrote:
> On Tue, May 22, 2018 at 01:38:06PM -0600, Jens Axboe wrote:
>> I guess it would be cleaner to actually do the transition, in
>> blk_mq_rq_timed_out():
>>
>>         case BLK_EH_HANDLED:                                                    
>>                 if (blk_mq_change_rq_state(req, MQ_RQ_TIMED_OUT,                
>>                                                 MQ_RQ_COMPLETE))                
>>                         __blk_mq_complete_request(req);                         
>>                 break;        
>>
>> This works for me.
> 
> Works for me as well on manual fault injection tests.
> 
> I think this change above goes back to Christoph's point earlier on usage
> of BLK_EH_HANDLED. Is the driver supposed to return BLK_EH_NOT_HANDLED
> when the driver actually knows the request has been completed before
> returning the status?

If the driver knows it's completed, it'd have to return BLK_EH_NOT_HANDLED.
Or BLK_EH_HANDLED would work too, since the above state transition would
then fail.
Keith Busch May 22, 2018, 8:33 p.m. UTC | #8
On Tue, May 22, 2018 at 09:25:15AM -0700, Bart Van Assche wrote:
> @@ -848,13 +843,22 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved)
>  	case BLK_EH_RESET_TIMER:
> +		blk_mq_add_timer(req);
>  		/*
> +		 * The loop is for the unlikely case of a race with the
> +		 * completion code. There is no need to reset the deadline
> +		 * if the transition to the in-flight state fails because
> +		 * the deadline only matters in the in-flight state.
>  		 */
> -		blk_mq_rq_update_aborted_gstate(req, 0);
> -		blk_add_timer(req);
> +		while (true) {
> +			if (blk_mq_change_rq_state(req, MQ_RQ_TIMED_OUT,
> +						   MQ_RQ_IN_FLIGHT))
> +				break;
> +			if (blk_mq_rq_state(req) == MQ_RQ_COMPLETE) {
> +				__blk_mq_complete_request(req);
> +				break;
> +			}
> +		}

I'm having some trouble triggering this case where the state is already
MQ_RQ_COMPLETE, so I'm just trying to figure this out through inspection.

It looks like the new blk_mq_complete_request() already called
__blk_mq_complete_request() when it gets the state to MQ_RQ_COMPLETE,
so doing that again completes it a second time.
Bart Van Assche May 22, 2018, 8:36 p.m. UTC | #9
On Tue, 2018-05-22 at 14:33 -0600, Keith Busch wrote:
> On Tue, May 22, 2018 at 09:25:15AM -0700, Bart Van Assche wrote:

> > @@ -848,13 +843,22 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved)

> >  	case BLK_EH_RESET_TIMER:

> > +		blk_mq_add_timer(req);

> >  		/*

> > +		 * The loop is for the unlikely case of a race with the

> > +		 * completion code. There is no need to reset the deadline

> > +		 * if the transition to the in-flight state fails because

> > +		 * the deadline only matters in the in-flight state.

> >  		 */

> > -		blk_mq_rq_update_aborted_gstate(req, 0);

> > -		blk_add_timer(req);

> > +		while (true) {

> > +			if (blk_mq_change_rq_state(req, MQ_RQ_TIMED_OUT,

> > +						   MQ_RQ_IN_FLIGHT))

> > +				break;

> > +			if (blk_mq_rq_state(req) == MQ_RQ_COMPLETE) {

> > +				__blk_mq_complete_request(req);

> > +				break;

> > +			}

> > +		}

> 

> I'm having some trouble triggering this case where the state is already

> MQ_RQ_COMPLETE, so I'm just trying to figure this out through inspection.

> 

> It looks like the new blk_mq_complete_request() already called

> __blk_mq_complete_request() when it gets the state to MQ_RQ_COMPLETE,

> so doing that again completes it a second time.


Hello Keith,

Have you noticed that if blk_mq_complete_request() encounters a request with
state MQ_RQ_TIMED_OUT that it doesn't call __blk_mq_complete_request()? I think
the code in blk_mq_complete_request() together with the above code guarantees
that __blk_mq_complete_request() is only called once per request generation.

Bart.
Keith Busch May 22, 2018, 8:40 p.m. UTC | #10
On Tue, May 22, 2018 at 08:36:27PM +0000, Bart Van Assche wrote:
> Have you noticed that if blk_mq_complete_request() encounters a request with
> state MQ_RQ_TIMED_OUT that it doesn't call __blk_mq_complete_request()? I think
> the code in blk_mq_complete_request() together with the above code guarantees
> that __blk_mq_complete_request() is only called once per request generation.

Right, my mistake. I noticed that when I saw your reply on the EH_HANDLED
case, so looks fine.
Keith Busch May 22, 2018, 8:44 p.m. UTC | #11
On Tue, May 22, 2018 at 08:36:27PM +0000, Bart Van Assche wrote:
> 
> Have you noticed that if blk_mq_complete_request() encounters a request with
> state MQ_RQ_TIMED_OUT that it doesn't call __blk_mq_complete_request()? I think
> the code in blk_mq_complete_request() together with the above code guarantees
> that __blk_mq_complete_request() is only called once per request generation.

Okay, now to the BLK_EH_NOT_HANDLED case: that's supposedly the correct
status to return if the driver knows blk_mq_complete_request() was called
prior to returning from the timeout handler, so we need a similiar check
there, right?
Bart Van Assche May 22, 2018, 8:47 p.m. UTC | #12
On Tue, 2018-05-22 at 14:44 -0600, Keith Busch wrote:
> On Tue, May 22, 2018 at 08:36:27PM +0000, Bart Van Assche wrote:

> > 

> > Have you noticed that if blk_mq_complete_request() encounters a request with

> > state MQ_RQ_TIMED_OUT that it doesn't call __blk_mq_complete_request()? I think

> > the code in blk_mq_complete_request() together with the above code guarantees

> > that __blk_mq_complete_request() is only called once per request generation.

> 

> Okay, now to the BLK_EH_NOT_HANDLED case: that's supposedly the correct

> status to return if the driver knows blk_mq_complete_request() was called

> prior to returning from the timeout handler, so we need a similiar check

> there, right?


Good catch. To me that seems like the right place to handle that case.

Bart.
Christoph Hellwig May 22, 2018, 9:02 p.m. UTC | #13
On Tue, May 22, 2018 at 02:29:21PM -0600, Jens Axboe wrote:
> > of BLK_EH_HANDLED. Is the driver supposed to return BLK_EH_NOT_HANDLED
> > when the driver actually knows the request has been completed before
> > returning the status?
> 
> If the driver knows it's completed, it'd have to return BLK_EH_NOT_HANDLED.
> Or BLK_EH_HANDLED would work too, since the above state transition would
> then fail.

Btw, I think we should just kill off BLK_EH_HANDLED.  WIP totally
untested progress toward that is here:

http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/blk-eh-ret

The only real missing bit is SCSI overloading the value for internal
use.
Jens Axboe May 22, 2018, 9:02 p.m. UTC | #14
On 5/22/18 3:02 PM, Christoph Hellwig wrote:
> On Tue, May 22, 2018 at 02:29:21PM -0600, Jens Axboe wrote:
>>> of BLK_EH_HANDLED. Is the driver supposed to return BLK_EH_NOT_HANDLED
>>> when the driver actually knows the request has been completed before
>>> returning the status?
>>
>> If the driver knows it's completed, it'd have to return BLK_EH_NOT_HANDLED.
>> Or BLK_EH_HANDLED would work too, since the above state transition would
>> then fail.
> 
> Btw, I think we should just kill off BLK_EH_HANDLED.  WIP totally
> untested progress toward that is here:
> 
> http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/blk-eh-ret
> 
> The only real missing bit is SCSI overloading the value for internal
> use.

I think that's a great idea, the use cases aren't clear at all, driver
writers will get this wrong.
Keith Busch May 23, 2018, 2:02 p.m. UTC | #15
On Tue, May 22, 2018 at 09:25:15AM -0700, Bart Van Assche wrote:
> +static bool blk_mq_change_rq_state(struct request *rq,
> +				   enum mq_rq_state old_state,
> +				   enum mq_rq_state new_state)
> +{
> +	union blk_generation_and_state gstate = READ_ONCE(rq->gstate);
> +	union blk_generation_and_state old_val = gstate;
> +	union blk_generation_and_state new_val = gstate;
> +
> +	old_val.state = old_state;
> +	new_val.state = new_state;
> +	if (new_state == MQ_RQ_IN_FLIGHT)
> +		new_val.generation++;
> +	/*
> +	 * For transitions from state in-flight to another state cmpxchg()
> +	 * must be used. For other state transitions it is safe to use
> +	 * WRITE_ONCE().
> +	 */
> +	if (old_state != MQ_RQ_IN_FLIGHT) {
> +		WRITE_ONCE(rq->gstate.val, new_val.val);
> +		return true;
> +	}
> +	return blk_mq_set_rq_state(rq, old_val, new_val);
> +}

<snip>

>  void blk_mq_complete_request(struct request *rq)
>  {
>  	struct request_queue *q = rq->q;
> -	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu);
> -	int srcu_idx;
>  
>  	if (unlikely(blk_should_fake_timeout(q)))
>  		return;
>  
> -	/*
> -	 * If @rq->aborted_gstate equals the current instance, timeout is
> -	 * claiming @rq and we lost.  This is synchronized through
> -	 * hctx_lock().  See blk_mq_timeout_work() for details.
> -	 *
> -	 * Completion path never blocks and we can directly use RCU here
> -	 * instead of hctx_lock() which can be either RCU or SRCU.
> -	 * However, that would complicate paths which want to synchronize
> -	 * against us.  Let stay in sync with the issue path so that
> -	 * hctx_lock() covers both issue and completion paths.
> -	 */
> -	hctx_lock(hctx, &srcu_idx);
> -	if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
> -		__blk_mq_complete_request(rq);
> -	hctx_unlock(hctx, srcu_idx);
> +	/* The loop is for the unlikely case of a race with the timeout code. */
> +	while (true) {
> +		if (blk_mq_change_rq_state(rq, MQ_RQ_IN_FLIGHT,
> +					   MQ_RQ_COMPLETE)) {
> +			__blk_mq_complete_request(rq);
> +			break;
> +		}
> +		if (blk_mq_change_rq_state(rq, MQ_RQ_TIMED_OUT, MQ_RQ_COMPLETE))
> +			break;
> +	}
>  }

Looks like the cmpxchg is also needed if old_state is MQ_RQ_TIMED_OUT,
otherwise its guaranteed to return 'true' and there's no point to the
loop and 'if' check.
Keith Busch May 23, 2018, 2:08 p.m. UTC | #16
On Wed, May 23, 2018 at 08:02:31AM -0600, Keith Busch wrote:
> Looks like the cmpxchg is also needed if old_state is MQ_RQ_TIMED_OUT,
> otherwise its guaranteed to return 'true' and there's no point to the
> loop and 'if' check.

And I see v14 is already posted with that fix! :)
diff mbox

Patch

diff --git a/block/blk-core.c b/block/blk-core.c
index 341501c5e239..42b055292cdc 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -198,12 +198,9 @@  void blk_rq_init(struct request_queue *q, struct request *rq)
 	rq->internal_tag = -1;
 	rq->start_time_ns = ktime_get_ns();
 	rq->part = NULL;
-	seqcount_init(&rq->gstate_seq);
-	u64_stats_init(&rq->aborted_gstate_sync);
-	/*
-	 * See comment of blk_mq_init_request
-	 */
-	WRITE_ONCE(rq->gstate, MQ_RQ_GEN_INC);
+#ifndef CONFIG_64BIT
+	seqcount_init(&rq->deadline_seq);
+#endif
 }
 EXPORT_SYMBOL(blk_rq_init);
 
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 3080e18cb859..9c539ab2c0dc 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -344,15 +344,15 @@  static const char *const rqf_name[] = {
 	RQF_NAME(STATS),
 	RQF_NAME(SPECIAL_PAYLOAD),
 	RQF_NAME(ZONE_WRITE_LOCKED),
-	RQF_NAME(MQ_TIMEOUT_EXPIRED),
 	RQF_NAME(MQ_POLL_SLEPT),
 };
 #undef RQF_NAME
 
 static const char *const blk_mq_rq_state_name_array[] = {
 	[MQ_RQ_IDLE]		= "idle",
-	[MQ_RQ_IN_FLIGHT]	= "in_flight",
+	[MQ_RQ_IN_FLIGHT]	= "in flight",
 	[MQ_RQ_COMPLETE]	= "complete",
+	[MQ_RQ_TIMED_OUT]	= "timed out",
 };
 
 static const char *blk_mq_rq_state_name(enum mq_rq_state rq_state)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 64630caaf27e..6bfc7679a5df 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -319,6 +319,7 @@  static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 	/* tag was already set */
 	rq->extra_len = 0;
 	rq->__deadline = 0;
+	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);
 
 	INIT_LIST_HEAD(&rq->timeout_list);
 	rq->timeout = 0;
@@ -465,6 +466,39 @@  struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
 }
 EXPORT_SYMBOL_GPL(blk_mq_alloc_request_hctx);
 
+/**
+ * blk_mq_change_rq_state - atomically test and set request state
+ * @rq: Request pointer.
+ * @old_state: Old request state.
+ * @new_state: New request state.
+ *
+ * Returns %true if and only if the old state was @old and if the state has
+ * been changed into @new.
+ */
+static bool blk_mq_change_rq_state(struct request *rq,
+				   enum mq_rq_state old_state,
+				   enum mq_rq_state new_state)
+{
+	union blk_generation_and_state gstate = READ_ONCE(rq->gstate);
+	union blk_generation_and_state old_val = gstate;
+	union blk_generation_and_state new_val = gstate;
+
+	old_val.state = old_state;
+	new_val.state = new_state;
+	if (new_state == MQ_RQ_IN_FLIGHT)
+		new_val.generation++;
+	/*
+	 * For transitions from state in-flight to another state cmpxchg()
+	 * must be used. For other state transitions it is safe to use
+	 * WRITE_ONCE().
+	 */
+	if (old_state != MQ_RQ_IN_FLIGHT) {
+		WRITE_ONCE(rq->gstate.val, new_val.val);
+		return true;
+	}
+	return blk_mq_set_rq_state(rq, old_val, new_val);
+}
+
 void blk_mq_free_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
@@ -494,7 +528,8 @@  void blk_mq_free_request(struct request *rq)
 	if (blk_rq_rl(rq))
 		blk_put_rl(blk_rq_rl(rq));
 
-	blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
+	if (!blk_mq_change_rq_state(rq, blk_mq_rq_state(rq), MQ_RQ_IDLE))
+		WARN_ON_ONCE(true);
 	if (rq->tag != -1)
 		blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
 	if (sched_tag != -1)
@@ -547,8 +582,7 @@  static void __blk_mq_complete_request(struct request *rq)
 	bool shared = false;
 	int cpu;
 
-	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IN_FLIGHT);
-	blk_mq_rq_update_state(rq, MQ_RQ_COMPLETE);
+	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_COMPLETE);
 
 	if (rq->internal_tag != -1)
 		blk_mq_sched_completed_request(rq);
@@ -593,36 +627,6 @@  static void hctx_lock(struct blk_mq_hw_ctx *hctx, int *srcu_idx)
 		*srcu_idx = srcu_read_lock(hctx->srcu);
 }
 
-static void blk_mq_rq_update_aborted_gstate(struct request *rq, u64 gstate)
-{
-	unsigned long flags;
-
-	/*
-	 * blk_mq_rq_aborted_gstate() is used from the completion path and
-	 * can thus be called from irq context.  u64_stats_fetch in the
-	 * middle of update on the same CPU leads to lockup.  Disable irq
-	 * while updating.
-	 */
-	local_irq_save(flags);
-	u64_stats_update_begin(&rq->aborted_gstate_sync);
-	rq->aborted_gstate = gstate;
-	u64_stats_update_end(&rq->aborted_gstate_sync);
-	local_irq_restore(flags);
-}
-
-static u64 blk_mq_rq_aborted_gstate(struct request *rq)
-{
-	unsigned int start;
-	u64 aborted_gstate;
-
-	do {
-		start = u64_stats_fetch_begin(&rq->aborted_gstate_sync);
-		aborted_gstate = rq->aborted_gstate;
-	} while (u64_stats_fetch_retry(&rq->aborted_gstate_sync, start));
-
-	return aborted_gstate;
-}
-
 /**
  * blk_mq_complete_request - end I/O on a request
  * @rq:		the request being processed
@@ -634,27 +638,20 @@  static u64 blk_mq_rq_aborted_gstate(struct request *rq)
 void blk_mq_complete_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
-	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu);
-	int srcu_idx;
 
 	if (unlikely(blk_should_fake_timeout(q)))
 		return;
 
-	/*
-	 * If @rq->aborted_gstate equals the current instance, timeout is
-	 * claiming @rq and we lost.  This is synchronized through
-	 * hctx_lock().  See blk_mq_timeout_work() for details.
-	 *
-	 * Completion path never blocks and we can directly use RCU here
-	 * instead of hctx_lock() which can be either RCU or SRCU.
-	 * However, that would complicate paths which want to synchronize
-	 * against us.  Let stay in sync with the issue path so that
-	 * hctx_lock() covers both issue and completion paths.
-	 */
-	hctx_lock(hctx, &srcu_idx);
-	if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
-		__blk_mq_complete_request(rq);
-	hctx_unlock(hctx, srcu_idx);
+	/* The loop is for the unlikely case of a race with the timeout code. */
+	while (true) {
+		if (blk_mq_change_rq_state(rq, MQ_RQ_IN_FLIGHT,
+					   MQ_RQ_COMPLETE)) {
+			__blk_mq_complete_request(rq);
+			break;
+		}
+		if (blk_mq_change_rq_state(rq, MQ_RQ_TIMED_OUT, MQ_RQ_COMPLETE))
+			break;
+	}
 }
 EXPORT_SYMBOL(blk_mq_complete_request);
 
@@ -681,27 +678,8 @@  void blk_mq_start_request(struct request *rq)
 		wbt_issue(q->rq_wb, rq);
 	}
 
-	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);
-
-	/*
-	 * Mark @rq in-flight which also advances the generation number,
-	 * and register for timeout.  Protect with a seqcount to allow the
-	 * timeout path to read both @rq->gstate and @rq->deadline
-	 * coherently.
-	 *
-	 * This is the only place where a request is marked in-flight.  If
-	 * the timeout path reads an in-flight @rq->gstate, the
-	 * @rq->deadline it reads together under @rq->gstate_seq is
-	 * guaranteed to be the matching one.
-	 */
-	preempt_disable();
-	write_seqcount_begin(&rq->gstate_seq);
-
-	blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
-	blk_add_timer(rq);
-
-	write_seqcount_end(&rq->gstate_seq);
-	preempt_enable();
+	blk_mq_add_timer(rq);
+	blk_mq_change_rq_state(rq, MQ_RQ_IDLE, MQ_RQ_IN_FLIGHT);
 
 	if (q->dma_drain_size && blk_rq_bytes(rq)) {
 		/*
@@ -714,27 +692,46 @@  void blk_mq_start_request(struct request *rq)
 }
 EXPORT_SYMBOL(blk_mq_start_request);
 
-/*
- * When we reach here because queue is busy, it's safe to change the state
- * to IDLE without checking @rq->aborted_gstate because we should still be
- * holding the RCU read lock and thus protected against timeout.
+/**
+ * __blk_mq_requeue_request - requeue a request
+ * @rq: request to be requeued
+ *
+ * This function is either called by blk_mq_requeue_request() or by the block
+ * layer core if .queue_rq() returned BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE.
+ * If the request state is MQ_RQ_IN_FLIGHT and if this function is called from
+ * inside .queue_rq() then it is guaranteed that the timeout code won't try to
+ * modify the request state while this function is in progress because an RCU
+ * read lock is held around .queue_rq() and because the timeout code calls
+ * synchronize_rcu() after having marked requests and before processing
+ * time-outs.
  */
 static void __blk_mq_requeue_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
+	enum mq_rq_state old_state = blk_mq_rq_state(rq);
 
 	blk_mq_put_driver_tag(rq);
 
 	trace_block_rq_requeue(q, rq);
 	wbt_requeue(q->rq_wb, rq);
 
-	if (blk_mq_rq_state(rq) != MQ_RQ_IDLE) {
-		blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
+	if (old_state != MQ_RQ_IDLE) {
+		if (!blk_mq_change_rq_state(rq, old_state, MQ_RQ_IDLE))
+			WARN_ON_ONCE(true);
 		if (q->dma_drain_size && blk_rq_bytes(rq))
 			rq->nr_phys_segments--;
 	}
 }
 
+/**
+ * blk_mq_requeue_request - requeue a request
+ * @rq: request to be requeued
+ * @kick_requeue_list: whether or not to kick the requeue_list
+ *
+ * This function is called after a request has completed, after a request has
+ * timed out or from inside .queue_rq(). In the latter case the request may
+ * already have been started.
+ */
 void blk_mq_requeue_request(struct request *rq, bool kick_requeue_list)
 {
 	__blk_mq_requeue_request(rq);
@@ -838,8 +835,6 @@  static void blk_mq_rq_timed_out(struct request *req, bool reserved)
 	const struct blk_mq_ops *ops = req->q->mq_ops;
 	enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER;
 
-	req->rq_flags |= RQF_MQ_TIMEOUT_EXPIRED;
-
 	if (ops->timeout)
 		ret = ops->timeout(req, reserved);
 
@@ -848,13 +843,22 @@  static void blk_mq_rq_timed_out(struct request *req, bool reserved)
 		__blk_mq_complete_request(req);
 		break;
 	case BLK_EH_RESET_TIMER:
+		blk_mq_add_timer(req);
 		/*
-		 * As nothing prevents from completion happening while
-		 * ->aborted_gstate is set, this may lead to ignored
-		 * completions and further spurious timeouts.
+		 * The loop is for the unlikely case of a race with the
+		 * completion code. There is no need to reset the deadline
+		 * if the transition to the in-flight state fails because
+		 * the deadline only matters in the in-flight state.
 		 */
-		blk_mq_rq_update_aborted_gstate(req, 0);
-		blk_add_timer(req);
+		while (true) {
+			if (blk_mq_change_rq_state(req, MQ_RQ_TIMED_OUT,
+						   MQ_RQ_IN_FLIGHT))
+				break;
+			if (blk_mq_rq_state(req) == MQ_RQ_COMPLETE) {
+				__blk_mq_complete_request(req);
+				break;
+			}
+		}
 		break;
 	case BLK_EH_NOT_HANDLED:
 		break;
@@ -868,48 +872,60 @@  static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
 		struct request *rq, void *priv, bool reserved)
 {
 	struct blk_mq_timeout_data *data = priv;
-	unsigned long gstate, deadline;
-	int start;
+	union blk_generation_and_state gstate = READ_ONCE(rq->gstate);
+	unsigned long deadline;
 
 	might_sleep();
 
-	if (rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED)
-		return;
-
-	/* read coherent snapshots of @rq->state_gen and @rq->deadline */
+#ifdef CONFIG_64BIT
+	deadline = rq->__deadline;
+#else
 	while (true) {
-		start = read_seqcount_begin(&rq->gstate_seq);
-		gstate = READ_ONCE(rq->gstate);
-		deadline = blk_rq_deadline(rq);
-		if (!read_seqcount_retry(&rq->gstate_seq, start))
+		int start;
+
+		start = read_seqcount_begin(&rq->deadline_seq);
+		deadline = rq->__deadline;
+		if (!read_seqcount_retry(&rq->deadline_seq, start))
 			break;
 		cond_resched();
 	}
+#endif
 
-	/* if in-flight && overdue, mark for abortion */
-	if ((gstate & MQ_RQ_STATE_MASK) == MQ_RQ_IN_FLIGHT &&
+	/*
+	 * Make sure that rq->aborted_gstate != rq->gstate if rq->deadline has
+	 * not yet been reached even if a request gets recycled before
+	 * blk_mq_terminate_expired() is called and the value of rq->deadline
+	 * is not modified due to the request recycling.
+	 */
+	rq->aborted_gstate = gstate;
+	rq->aborted_gstate.generation ^= (1UL << 29);
+	if (gstate.state == MQ_RQ_IN_FLIGHT &&
 	    time_after_eq(jiffies, deadline)) {
-		blk_mq_rq_update_aborted_gstate(rq, gstate);
+		/* request timed out */
+		rq->aborted_gstate = gstate;
 		data->nr_expired++;
 		hctx->nr_expired++;
 	} else if (!data->next_set || time_after(data->next, deadline)) {
 		data->next = deadline;
 		data->next_set = 1;
 	}
+
 }
 
 static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx,
 		struct request *rq, void *priv, bool reserved)
 {
+	union blk_generation_and_state old_val = rq->aborted_gstate;
+	union blk_generation_and_state new_val = old_val;
+
+	new_val.state = MQ_RQ_TIMED_OUT;
+
 	/*
-	 * We marked @rq->aborted_gstate and waited for RCU.  If there were
-	 * completions that we lost to, they would have finished and
-	 * updated @rq->gstate by now; otherwise, the completion path is
-	 * now guaranteed to see @rq->aborted_gstate and yield.  If
-	 * @rq->aborted_gstate still matches @rq->gstate, @rq is ours.
+	 * We marked @rq->aborted_gstate and waited for ongoing .queue_rq()
+	 * calls. If rq->gstate has not changed that means that it
+	 * is now safe to change the request state and to handle the timeout.
 	 */
-	if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) &&
-	    READ_ONCE(rq->gstate) == rq->aborted_gstate)
+	if (blk_mq_set_rq_state(rq, old_val, new_val))
 		blk_mq_rq_timed_out(rq, reserved);
 }
 
@@ -948,10 +964,12 @@  static void blk_mq_timeout_work(struct work_struct *work)
 		bool has_rcu = false;
 
 		/*
-		 * Wait till everyone sees ->aborted_gstate.  The
-		 * sequential waits for SRCUs aren't ideal.  If this ever
-		 * becomes a problem, we can add per-hw_ctx rcu_head and
-		 * wait in parallel.
+		 * For very short timeouts it can happen that
+		 * blk_mq_check_expired() modifies the state of a request
+		 * while .queue_rq() is still in progress. Hence wait until
+		 * these .queue_rq() calls have finished. This is also
+		 * necessary to avoid races with blk_mq_requeue_request() for
+		 * requests that have already been started.
 		 */
 		queue_for_each_hw_ctx(q, hctx, i) {
 			if (!hctx->nr_expired)
@@ -967,7 +985,7 @@  static void blk_mq_timeout_work(struct work_struct *work)
 		if (has_rcu)
 			synchronize_rcu();
 
-		/* terminate the ones we won */
+		/* Terminate the requests marked by blk_mq_check_expired(). */
 		blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_expired, NULL);
 	}
 
@@ -2063,14 +2081,9 @@  static int blk_mq_init_request(struct blk_mq_tag_set *set, struct request *rq,
 			return ret;
 	}
 
-	seqcount_init(&rq->gstate_seq);
-	u64_stats_init(&rq->aborted_gstate_sync);
-	/*
-	 * start gstate with gen 1 instead of 0, otherwise it will be equal
-	 * to aborted_gstate, and be identified timed out by
-	 * blk_mq_terminate_expired.
-	 */
-	WRITE_ONCE(rq->gstate, MQ_RQ_GEN_INC);
+#ifndef CONFIG_64BIT
+	seqcount_init(&rq->deadline_seq);
+#endif
 
 	return 0;
 }
@@ -3105,7 +3118,8 @@  static bool blk_mq_poll_hybrid_sleep(struct request_queue *q,
 
 	hrtimer_init_sleeper(&hs, current);
 	do {
-		if (blk_mq_rq_state(rq) == MQ_RQ_COMPLETE)
+		if (blk_mq_rq_state(rq) == MQ_RQ_COMPLETE ||
+		    blk_mq_rq_state(rq) == MQ_RQ_TIMED_OUT)
 			break;
 		set_current_state(TASK_UNINTERRUPTIBLE);
 		hrtimer_start_expires(&hs.timer, mode);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index e1bb420dc5d6..7b810c934732 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -2,6 +2,7 @@ 
 #ifndef INT_BLK_MQ_H
 #define INT_BLK_MQ_H
 
+#include <asm/cmpxchg.h>
 #include "blk-stat.h"
 #include "blk-mq-tag.h"
 
@@ -30,18 +31,25 @@  struct blk_mq_ctx {
 	struct kobject		kobj;
 } ____cacheline_aligned_in_smp;
 
-/*
- * Bits for request->gstate.  The lower two bits carry MQ_RQ_* state value
- * and the upper bits the generation number.
+/**
+ * enum mq_rq_state - blk-mq request state
+ *
+ * The legal state transitions are:
+ * - idle      -> in-flight: blk_mq_start_request()
+ * - in-flight -> complete:  blk_mq_complete_request()
+ * - timed out -> complete:  blk_mq_complete_request()
+ * - in-flight -> timed out: request times out
+ * - complete/tmo -> idle:   blk_mq_requeue_request() or blk_mq_free_request()
+ * - in-flight -> idle:      blk_mq_requeue_request() or blk_mq_free_request()
+ * - timed out -> in-flight: request restart due to BLK_EH_RESET_TIMER
+ *
+ * See also blk_generation_and_state.state.
  */
 enum mq_rq_state {
 	MQ_RQ_IDLE		= 0,
 	MQ_RQ_IN_FLIGHT		= 1,
 	MQ_RQ_COMPLETE		= 2,
-
-	MQ_RQ_STATE_BITS	= 2,
-	MQ_RQ_STATE_MASK	= (1 << MQ_RQ_STATE_BITS) - 1,
-	MQ_RQ_GEN_INC		= 1 << MQ_RQ_STATE_BITS,
+	MQ_RQ_TIMED_OUT		= 3,
 };
 
 void blk_mq_freeze_queue(struct request_queue *q);
@@ -103,37 +111,21 @@  extern void blk_mq_hctx_kobj_init(struct blk_mq_hw_ctx *hctx);
 
 void blk_mq_release(struct request_queue *q);
 
-/**
- * blk_mq_rq_state() - read the current MQ_RQ_* state of a request
- * @rq: target request.
- */
-static inline int blk_mq_rq_state(struct request *rq)
+static inline bool blk_mq_set_rq_state(struct request *rq,
+				       union blk_generation_and_state old_val,
+				       union blk_generation_and_state new_val)
 {
-	return READ_ONCE(rq->gstate) & MQ_RQ_STATE_MASK;
+	return cmpxchg(&rq->gstate.val, old_val.val, new_val.val) ==
+		old_val.val;
 }
 
 /**
- * blk_mq_rq_update_state() - set the current MQ_RQ_* state of a request
+ * blk_mq_rq_state() - read the current MQ_RQ_* state of a request
  * @rq: target request.
- * @state: new state to set.
- *
- * Set @rq's state to @state.  The caller is responsible for ensuring that
- * there are no other updaters.  A request can transition into IN_FLIGHT
- * only from IDLE and doing so increments the generation number.
  */
-static inline void blk_mq_rq_update_state(struct request *rq,
-					  enum mq_rq_state state)
+static inline enum mq_rq_state blk_mq_rq_state(struct request *rq)
 {
-	u64 old_val = READ_ONCE(rq->gstate);
-	u64 new_val = (old_val & ~MQ_RQ_STATE_MASK) | state;
-
-	if (state == MQ_RQ_IN_FLIGHT) {
-		WARN_ON_ONCE((old_val & MQ_RQ_STATE_MASK) != MQ_RQ_IDLE);
-		new_val += MQ_RQ_GEN_INC;
-	}
-
-	/* avoid exposing interim values */
-	WRITE_ONCE(rq->gstate, new_val);
+	return READ_ONCE(rq->gstate).state;
 }
 
 static inline struct blk_mq_ctx *__blk_mq_get_ctx(struct request_queue *q,
diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index 652d4d4d3e97..3abbaa332a91 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -145,6 +145,22 @@  void blk_timeout_work(struct work_struct *work)
 	spin_unlock_irqrestore(q->queue_lock, flags);
 }
 
+/* Update deadline to @time. May be called from interrupt context. */
+static void blk_mq_rq_set_deadline(struct request *rq, unsigned long new_time)
+{
+#ifdef CONFIG_64BIT
+	rq->__deadline = new_time;
+#else
+	unsigned long flags;
+
+	local_irq_save(flags);
+	write_seqcount_begin(&rq->deadline_seq);
+	rq->__deadline = new_time;
+	write_seqcount_end(&rq->deadline_seq);
+	local_irq_restore(flags);
+#endif
+}
+
 /**
  * blk_abort_request -- Request request recovery for the specified command
  * @req:	pointer to the request of interest
@@ -158,11 +174,10 @@  void blk_abort_request(struct request *req)
 {
 	if (req->q->mq_ops) {
 		/*
-		 * All we need to ensure is that timeout scan takes place
-		 * immediately and that scan sees the new timeout value.
-		 * No need for fancy synchronizations.
+		 * Ensure that a timeout scan takes place immediately and that
+		 * that scan sees the new timeout value.
 		 */
-		blk_rq_set_deadline(req, jiffies);
+		blk_mq_rq_set_deadline(req, jiffies);
 		kblockd_schedule_work(&req->q->timeout_work);
 	} else {
 		if (blk_mark_rq_complete(req))
@@ -184,52 +199,17 @@  unsigned long blk_rq_timeout(unsigned long timeout)
 	return timeout;
 }
 
-/**
- * blk_add_timer - Start timeout timer for a single request
- * @req:	request that is about to start running.
- *
- * Notes:
- *    Each request has its own timer, and as it is added to the queue, we
- *    set up the timer. When the request completes, we cancel the timer.
- */
-void blk_add_timer(struct request *req)
+static void __blk_add_timer(struct request *req, unsigned long deadline)
 {
 	struct request_queue *q = req->q;
 	unsigned long expiry;
 
-	if (!q->mq_ops)
-		lockdep_assert_held(q->queue_lock);
-
-	/* blk-mq has its own handler, so we don't need ->rq_timed_out_fn */
-	if (!q->mq_ops && !q->rq_timed_out_fn)
-		return;
-
-	BUG_ON(!list_empty(&req->timeout_list));
-
-	/*
-	 * Some LLDs, like scsi, peek at the timeout to prevent a
-	 * command from being retried forever.
-	 */
-	if (!req->timeout)
-		req->timeout = q->rq_timeout;
-
-	blk_rq_set_deadline(req, jiffies + req->timeout);
-	req->rq_flags &= ~RQF_MQ_TIMEOUT_EXPIRED;
-
-	/*
-	 * Only the non-mq case needs to add the request to a protected list.
-	 * For the mq case we simply scan the tag map.
-	 */
-	if (!q->mq_ops)
-		list_add_tail(&req->timeout_list, &req->q->timeout_list);
-
 	/*
 	 * If the timer isn't already pending or this timeout is earlier
 	 * than an existing one, modify the timer. Round up to next nearest
 	 * second.
 	 */
-	expiry = blk_rq_timeout(round_jiffies_up(blk_rq_deadline(req)));
-
+	expiry = blk_rq_timeout(round_jiffies_up(deadline));
 	if (!timer_pending(&q->timeout) ||
 	    time_before(expiry, q->timeout.expires)) {
 		unsigned long diff = q->timeout.expires - expiry;
@@ -244,5 +224,50 @@  void blk_add_timer(struct request *req)
 		if (!timer_pending(&q->timeout) || (diff >= HZ / 2))
 			mod_timer(&q->timeout, expiry);
 	}
+}
 
+/**
+ * blk_add_timer - Start timeout timer for a single request
+ * @req:	request that is about to start running.
+ *
+ * Notes:
+ *    Each request has its own timer, and as it is added to the queue, we
+ *    set up the timer. When the request completes, we cancel the timer.
+ */
+void blk_add_timer(struct request *req)
+{
+	struct request_queue *q = req->q;
+	unsigned long deadline;
+
+	lockdep_assert_held(q->queue_lock);
+
+	if (!q->rq_timed_out_fn)
+		return;
+	if (!req->timeout)
+		req->timeout = q->rq_timeout;
+
+	deadline = jiffies + req->timeout;
+	blk_rq_set_deadline(req, deadline);
+	list_add_tail(&req->timeout_list, &req->q->timeout_list);
+
+	return __blk_add_timer(req, deadline);
+}
+
+/**
+ * blk_mq_add_timer - set the deadline for a single request
+ * @req:	request for which to set the deadline.
+ *
+ * Sets the deadline of a request. The caller must guarantee that the request
+ * state won't be modified while this function is in progress.
+ */
+void blk_mq_add_timer(struct request *req)
+{
+	struct request_queue *q = req->q;
+	unsigned long deadline;
+
+	if (!req->timeout)
+		req->timeout = q->rq_timeout;
+	deadline = jiffies + req->timeout;
+	blk_mq_rq_set_deadline(req, deadline);
+	return __blk_add_timer(req, deadline);
 }
diff --git a/block/blk.h b/block/blk.h
index eaf1a8e87d11..ffd44cbf3095 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -170,6 +170,7 @@  static inline bool bio_integrity_endio(struct bio *bio)
 void blk_timeout_work(struct work_struct *work);
 unsigned long blk_rq_timeout(unsigned long timeout);
 void blk_add_timer(struct request *req);
+void blk_mq_add_timer(struct request *req);
 void blk_delete_timer(struct request *);
 
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index f3999719f828..acc08806a6e5 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -7,6 +7,7 @@ 
 
 #ifdef CONFIG_BLOCK
 
+#include <linux/atomic.h>	/* cmpxchg() */
 #include <linux/major.h>
 #include <linux/genhd.h>
 #include <linux/list.h>
@@ -28,7 +29,6 @@ 
 #include <linux/scatterlist.h>
 #include <linux/blkzoned.h>
 #include <linux/seqlock.h>
-#include <linux/u64_stats_sync.h>
 
 struct module;
 struct scsi_ioctl_command;
@@ -125,15 +125,21 @@  typedef __u32 __bitwise req_flags_t;
 #define RQF_SPECIAL_PAYLOAD	((__force req_flags_t)(1 << 18))
 /* The per-zone write lock is held for this request */
 #define RQF_ZONE_WRITE_LOCKED	((__force req_flags_t)(1 << 19))
-/* timeout is expired */
-#define RQF_MQ_TIMEOUT_EXPIRED	((__force req_flags_t)(1 << 20))
 /* already slept for hybrid poll */
-#define RQF_MQ_POLL_SLEPT	((__force req_flags_t)(1 << 21))
+#define RQF_MQ_POLL_SLEPT	((__force req_flags_t)(1 << 20))
 
 /* flags that prevent us from merging requests: */
 #define RQF_NOMERGE_FLAGS \
 	(RQF_STARTED | RQF_SOFTBARRIER | RQF_FLUSH_SEQ | RQF_SPECIAL_PAYLOAD)
 
+union blk_generation_and_state {
+	struct {
+		uint32_t generation:30;
+		uint32_t state:2;
+	};
+	uint32_t val;
+};
+
 /*
  * Try to put the fields that are referenced together in the same cacheline.
  *
@@ -236,28 +242,24 @@  struct request {
 
 	unsigned int extra_len;	/* length of alignment and padding */
 
-	/*
-	 * On blk-mq, the lower bits of ->gstate (generation number and
-	 * state) carry the MQ_RQ_* state value and the upper bits the
-	 * generation number which is monotonically incremented and used to
-	 * distinguish the reuse instances.
-	 *
-	 * ->gstate_seq allows updates to ->gstate and other fields
-	 * (currently ->deadline) during request start to be read
-	 * atomically from the timeout path, so that it can operate on a
-	 * coherent set of information.
-	 */
-	seqcount_t gstate_seq;
-	u64 gstate;
-
 	/*
 	 * ->aborted_gstate is used by the timeout to claim a specific
 	 * recycle instance of this request.  See blk_mq_timeout_work().
 	 */
-	struct u64_stats_sync aborted_gstate_sync;
-	u64 aborted_gstate;
+	union blk_generation_and_state aborted_gstate;
 
-	/* access through blk_rq_set_deadline, blk_rq_deadline */
+	/*
+	 * Access through blk_rq_deadline() and blk_rq_set_deadline(),
+	 * blk_mark_rq_complete(), blk_clear_rq_complete() and
+	 * blk_rq_is_complete() for legacy queues or blk_mq_rq_state(),
+	 * blk_mq_change_rq_state() and blk_mq_rq_set_deadline() for
+	 * blk-mq queues. deadline_seq protects __deadline in blk-mq mode
+	 * only.
+	 */
+	union blk_generation_and_state gstate;
+#ifndef CONFIG_64BIT
+	seqcount_t deadline_seq;
+#endif
 	unsigned long __deadline;
 
 	struct list_head timeout_list;