diff mbox

[2/6] blk-mq: replace timeout synchronization with a RCU and generation based scheme

Message ID 20171212190134.535941-3-tj@kernel.org (mailing list archive)
State New, archived
Headers show

Commit Message

Tejun Heo Dec. 12, 2017, 7:01 p.m. UTC
Currently, blk-mq timeout path synchronizes against the usual
issue/completion path using a complex scheme involving atomic
bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
rules.  Unfortunatley, it contains quite a few holes.

There's a complex dancing around REQ_ATOM_STARTED and
REQ_ATOM_COMPLETE between issue/completion and timeout paths; however,
they don't have a synchronization point across request recycle
instances and it isn't clear what the barriers add.
blk_mq_check_expired() can easily read STARTED from N-2'th iteration,
deadline from N-1'th, blk_mark_rq_complete() against Nth instance.

In fact, it's pretty easy to make blk_mq_check_expired() terminate a
later instance of a request.  If we induce 5 sec delay before
time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
2s, and issue back-to-back large IOs, blk-mq starts timing out
requests spuriously pretty quickly.  Nothing actually timed out.  It
just made the call on a recycle instance of a request and then
terminated a later instance long after the original instance finished.
The scenario isn't theoretical either.

This patch replaces the broken synchronization mechanism with a RCU
and generation number based one.

1. Each request has a u64 generation + state value, which can be
   updated only by the request owner.  Whenever a request becomes
   in-flight, the generation number gets bumped up too.  This provides
   the basis for the timeout path to distinguish different recycle
   instances of the request.

   Also, marking a request in-flight and setting its deadline are
   protected with a seqcount so that the timeout path can fetch both
   values coherently.

2. The timeout path fetches the generation, state and deadline.  If
   the verdict is timeout, it records the generation into a dedicated
   request abortion field and does RCU wait.

3. The completion path is also protected by RCU (from the previous
   patch) and checks whether the current generation number and state
   match the abortion field.  If so, it skips completion.

4. The timeout path, after RCU wait, scans requests again and
   terminates the ones whose generation and state still match the ones
   requested for abortion.

   By now, the timeout path knows that either the generation number
   and state changed if it lost the race or the completion will yield
   to it and can safely timeout the request.

While it's more lines of code, it's conceptually simpler, doesn't
depend on direct use of subtle memory ordering or coherence, and
hopefully doesn't terminate the wrong instance.

While this change makes REQ_ATOM_COMPLETE synchornization unnecessary
between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't
removed yet as it's still used in other places.  Future patches will
move all state tracking to the new mechanism and remove all bitops in
the hot paths.

v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao.
    - s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter.
    - READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
---
 block/blk-core.c       |   2 +
 block/blk-mq.c         | 206 +++++++++++++++++++++++++++++++------------------
 block/blk-mq.h         |  45 +++++++++++
 block/blk-timeout.c    |   2 +-
 block/blk.h            |   6 --
 include/linux/blk-mq.h |   1 +
 include/linux/blkdev.h |  23 ++++++
 7 files changed, 204 insertions(+), 81 deletions(-)

Comments

Bart Van Assche Dec. 12, 2017, 9:37 p.m. UTC | #1
On Tue, 2017-12-12 at 11:01 -0800, Tejun Heo wrote:
> +/*

> + * Bits for request->gstate.  The lower two bits carry MQ_RQ_* state value

> + * and the upper bits the generation number.

> + */

> +enum mq_rq_state {

> +	MQ_RQ_IDLE		= 0,

> +	MQ_RQ_IN_FLIGHT		= 1,

> +

> +	MQ_RQ_STATE_BITS	= 2,

> +	MQ_RQ_STATE_MASK	= (1 << MQ_RQ_STATE_BITS) - 1,

> +	MQ_RQ_GEN_INC		= 1 << MQ_RQ_STATE_BITS,

> +};

> +

> @@ -85,6 +98,38 @@ extern void blk_mq_rq_timed_out(struct request *req, bool reserved);

> +/**

> + * blk_mq_rq_state() - read the current MQ_RQ_* state of a request

> + * @rq: target request.

> + */

> +static inline int blk_mq_rq_state(struct request *rq)

> +{

> +	return READ_ONCE(rq->gstate) & MQ_RQ_STATE_MASK;

> +}

> +

> +/**

> + * blk_mq_rq_update_state() - set the current MQ_RQ_* state of a request

> + * @rq: target request.

> + * @state: new state to set.

> + *

> + * Set @rq's state to @state.  The caller is responsible for ensuring that

> + * there are no other updaters.  A request can transition into IN_FLIGHT

> + * only from IDLE and doing so increments the generation number.

> + */

> +static inline void blk_mq_rq_update_state(struct request *rq,

> +					  enum mq_rq_state state)

> +{

> +	u64 new_val = (READ_ONCE(rq->gstate) & ~MQ_RQ_STATE_MASK) | state;

> +

> +	if (state == MQ_RQ_IN_FLIGHT) {

> +		WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);

> +		new_val += MQ_RQ_GEN_INC;

> +	}

> +

> +	/* avoid exposing interim values */

> +	WRITE_ONCE(rq->gstate, new_val);

> +}


Hello Tejun,

Have you considered the following instead of introducing MQ_RQ_IDLE and
MQ_RQ_IN_FLIGHT? I think this could help to limit the number of new atomic
operations introduced in the hot path by this patch series.

static inline bool blk_mq_rq_in_flight(struct request *rq)
{
	return list_empty(&rq->queuelist);
}

Thanks,

Bart.
Tejun Heo Dec. 12, 2017, 9:44 p.m. UTC | #2
Hello, Bart.

On Tue, Dec 12, 2017 at 09:37:11PM +0000, Bart Van Assche wrote:
> Have you considered the following instead of introducing MQ_RQ_IDLE and
> MQ_RQ_IN_FLIGHT? I think this could help to limit the number of new atomic
> operations introduced in the hot path by this patch series.

But nothing in the hot paths is atomic.

> static inline bool blk_mq_rq_in_flight(struct request *rq)
> {
> 	return list_empty(&rq->queuelist);
> }

And the fact that we encode the generation number and state into a
single variable contributes to not needing atomic operations.
Breaking up the state and generation like the above would need more
synchronization, not less.

Thanks.
jianchao.wang Dec. 13, 2017, 5:07 a.m. UTC | #3
Hi Tejun

On 12/13/2017 03:01 AM, Tejun Heo wrote:
> Currently, blk-mq timeout path synchronizes against the usual
> issue/completion path using a complex scheme involving atomic
> bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
> rules.  Unfortunatley, it contains quite a few holes.
> 
> There's a complex dancing around REQ_ATOM_STARTED and
> REQ_ATOM_COMPLETE between issue/completion and timeout paths; however,
> they don't have a synchronization point across request recycle
> instances and it isn't clear what the barriers add.
> blk_mq_check_expired() can easily read STARTED from N-2'th iteration,
> deadline from N-1'th, blk_mark_rq_complete() against Nth instance.
> 
> In fact, it's pretty easy to make blk_mq_check_expired() terminate a
> later instance of a request.  If we induce 5 sec delay before
> time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
> 2s, and issue back-to-back large IOs, blk-mq starts timing out
> requests spuriously pretty quickly.  Nothing actually timed out.  It
> just made the call on a recycle instance of a request and then
> terminated a later instance long after the original instance finished.
> The scenario isn't theoretical either.
> 
> This patch replaces the broken synchronization mechanism with a RCU
> and generation number based one.
> 
> 1. Each request has a u64 generation + state value, which can be
>    updated only by the request owner.  Whenever a request becomes
>    in-flight, the generation number gets bumped up too.  This provides
>    the basis for the timeout path to distinguish different recycle
>    instances of the request.
> 
>    Also, marking a request in-flight and setting its deadline are
>    protected with a seqcount so that the timeout path can fetch both
>    values coherently.
> 
> 2. The timeout path fetches the generation, state and deadline.  If
>    the verdict is timeout, it records the generation into a dedicated
>    request abortion field and does RCU wait.
> 
> 3. The completion path is also protected by RCU (from the previous
>    patch) and checks whether the current generation number and state
>    match the abortion field.  If so, it skips completion.
> 
> 4. The timeout path, after RCU wait, scans requests again and
>    terminates the ones whose generation and state still match the ones
>    requested for abortion.
> 
>    By now, the timeout path knows that either the generation number
>    and state changed if it lost the race or the completion will yield
>    to it and can safely timeout the request.
> 
> While it's more lines of code, it's conceptually simpler, doesn't
> depend on direct use of subtle memory ordering or coherence, and
> hopefully doesn't terminate the wrong instance.
> 
> While this change makes REQ_ATOM_COMPLETE synchornization unnecessary
> between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't
> removed yet as it's still used in other places.  Future patches will
> move all state tracking to the new mechanism and remove all bitops in
> the hot paths.
> 
> v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao.
>     - s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter.
>     - READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
> ---
>  block/blk-core.c       |   2 +
>  block/blk-mq.c         | 206 +++++++++++++++++++++++++++++++------------------
>  block/blk-mq.h         |  45 +++++++++++
>  block/blk-timeout.c    |   2 +-
>  block/blk.h            |   6 --
>  include/linux/blk-mq.h |   1 +
>  include/linux/blkdev.h |  23 ++++++
>  7 files changed, 204 insertions(+), 81 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index b888175..6034623 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -126,6 +126,8 @@ void blk_rq_init(struct request_queue *q, struct request *rq)
>  	rq->start_time = jiffies;
>  	set_start_time_ns(rq);
>  	rq->part = NULL;
> +	seqcount_init(&rq->gstate_seq);
> +	u64_stats_init(&rq->aborted_gstate_sync);
>  }
>  EXPORT_SYMBOL(blk_rq_init);
>  
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index acf4fbb..b4e733b 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -530,6 +530,9 @@ static void __blk_mq_complete_request(struct request *rq)
>  	bool shared = false;
>  	int cpu;
>  
> +	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IN_FLIGHT);
> +	blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
> +
>  	if (rq->internal_tag != -1)
>  		blk_mq_sched_completed_request(rq);
>  	if (rq->rq_flags & RQF_STATS) {
> @@ -557,6 +560,19 @@ static void __blk_mq_complete_request(struct request *rq)
>  	put_cpu();
>  }
>  
> +static u64 blk_mq_rq_aborted_gstate(struct request *rq)
> +{
> +	unsigned int start;
> +	u64 aborted_gstate;
> +
> +	do {
> +		start = u64_stats_fetch_begin(&rq->aborted_gstate_sync);
> +		aborted_gstate = rq->aborted_gstate;
> +	} while (u64_stats_fetch_retry(&rq->aborted_gstate_sync, start));
> +
> +	return aborted_gstate;
> +}
> +
>  /**
>   * blk_mq_complete_request - end I/O on a request
>   * @rq:		the request being processed
> @@ -574,14 +590,21 @@ void blk_mq_complete_request(struct request *rq)
>  	if (unlikely(blk_should_fake_timeout(q)))
>  		return;
>  
> +	/*
> +	 * If @rq->aborted_gstate equals the current instance, timeout is
> +	 * claiming @rq and we lost.  This is synchronized through RCU.
> +	 * See blk_mq_timeout_work() for details.
> +	 */
>  	if (!(hctx->flags & BLK_MQ_F_BLOCKING)) {
>  		rcu_read_lock();
> -		if (!blk_mark_rq_complete(rq))
> +		if (blk_mq_rq_aborted_gstate(rq) != rq->gstate &&
> +		    !blk_mark_rq_complete(rq))
>  			__blk_mq_complete_request(rq);
>  		rcu_read_unlock();
>  	} else {
>  		srcu_idx = srcu_read_lock(hctx->queue_rq_srcu);
> -		if (!blk_mark_rq_complete(rq))
> +		if (blk_mq_rq_aborted_gstate(rq) != rq->gstate &&
> +		    !blk_mark_rq_complete(rq))
>  			__blk_mq_complete_request(rq);
>  		srcu_read_unlock(hctx->queue_rq_srcu, srcu_idx);
>  	}
> @@ -608,34 +631,28 @@ void blk_mq_start_request(struct request *rq)
>  		wbt_issue(q->rq_wb, &rq->issue_stat);
>  	}
>  
> -	blk_add_timer(rq);
> -
> +	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);
>  	WARN_ON_ONCE(test_bit(REQ_ATOM_STARTED, &rq->atomic_flags));
>  
>  	/*
> -	 * Mark us as started and clear complete. Complete might have been
> -	 * set if requeue raced with timeout, which then marked it as
> -	 * complete. So be sure to clear complete again when we start
> -	 * the request, otherwise we'll ignore the completion event.
> +	 * Mark @rq in-flight which also advances the generation number,
> +	 * and register for timeout.  Protect with a seqcount to allow the
> +	 * timeout path to read both @rq->gstate and @rq->deadline
> +	 * coherently.
>  	 *
> -	 * Ensure that ->deadline is visible before we set STARTED, such that
> -	 * blk_mq_check_expired() is guaranteed to observe our ->deadline when
> -	 * it observes STARTED.
> +	 * This is the only place where a request is marked in-flight.  If
> +	 * the timeout path reads an in-flight @rq->gstate, the
> +	 * @rq->deadline it reads together under @rq->gstate_seq is
> +	 * guaranteed to be the matching one.
>  	 */
> -	smp_wmb();
> +	write_seqcount_begin(&rq->gstate_seq);
> +	blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
> +	blk_add_timer(rq);
> +	write_seqcount_end(&rq->gstate_seq);
> +
>  	set_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
> -	if (test_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags)) {
> -		/*
> -		 * Coherence order guarantees these consecutive stores to a
> -		 * single variable propagate in the specified order. Thus the
> -		 * clear_bit() is ordered _after_ the set bit. See
> -		 * blk_mq_check_expired().
> -		 *
> -		 * (the bits must be part of the same byte for this to be
> -		 * true).
> -		 */
> +	if (test_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags))
>  		clear_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags);
> -	}
>  
>  	if (q->dma_drain_size && blk_rq_bytes(rq)) {
>  		/*
> @@ -668,6 +685,7 @@ static void __blk_mq_requeue_request(struct request *rq)
>  	blk_mq_sched_requeue_request(rq);
>  
>  	if (test_and_clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags)) {
> +		blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
>  		if (q->dma_drain_size && blk_rq_bytes(rq))
>  			rq->nr_phys_segments--;
>  	}
> @@ -765,6 +783,7 @@ EXPORT_SYMBOL(blk_mq_tag_to_rq);
>  struct blk_mq_timeout_data {
>  	unsigned long next;
>  	unsigned int next_set;
> +	unsigned int nr_expired;
>  };
>  
>  void blk_mq_rq_timed_out(struct request *req, bool reserved)
> @@ -792,6 +811,14 @@ void blk_mq_rq_timed_out(struct request *req, bool reserved)
>  		__blk_mq_complete_request(req);
>  		break;
>  	case BLK_EH_RESET_TIMER:
> +		/*
> +		 * As nothing prevents from completion happening while
> +		 * ->aborted_gstate is set, this may lead to ignored
> +		 * completions and further spurious timeouts.
> +		 */
> +		u64_stats_update_begin(&req->aborted_gstate_sync);
> +		req->aborted_gstate = 0;
> +		u64_stats_update_end(&req->aborted_gstate_sync);
>  		blk_add_timer(req);
>  		blk_clear_rq_complete(req);
Test ok with NVMe

Thanks
Jianchao
Tejun Heo Dec. 13, 2017, 4:13 p.m. UTC | #4
Hi, Jianchao.

On Wed, Dec 13, 2017 at 01:07:30PM +0800, jianchao.wang wrote:
> Test ok with NVMe

Awesome, thanks for testing!
Bart Van Assche Dec. 14, 2017, 6:51 p.m. UTC | #5
On Tue, 2017-12-12 at 11:01 -0800, Tejun Heo wrote:
> rules.  Unfortunatley, it contains quite a few holes.

          ^^^^^^^^^^^^^
          Unfortunately?

> While this change makes REQ_ATOM_COMPLETE synchornization unnecessary

                                            ^^^^^^^^^^^^^^^
                                            synchronization?

> --- a/block/blk-core.c

> +++ b/block/blk-core.c

> @@ -126,6 +126,8 @@ void blk_rq_init(struct request_queue *q, struct request *rq)

>  	rq->start_time = jiffies;

>  	set_start_time_ns(rq);

>  	rq->part = NULL;

> +	seqcount_init(&rq->gstate_seq);

> +	u64_stats_init(&rq->aborted_gstate_sync);

>  }

>  EXPORT_SYMBOL(blk_rq_init);


Sorry but the above change looks ugly to me. My understanding is that 
blk_rq_init() is only used inside the block layer to initialize legacy block
layer requests while gstate_seq and aborted_gstate_sync are only relevant
for blk-mq requests. Wouldn't it be better to avoid that blk_rq_init() is
called for blk-mq requests such that the above change can be left out? The
only callers outside the block layer core of blk_rq_init() I know of are
ide_prep_sense() and scsi_ioctl_reset(). I can help with converting the SCSI
code if you want.

> +	write_seqcount_begin(&rq->gstate_seq);

> +	blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);

> +	blk_add_timer(rq);

> +	write_seqcount_end(&rq->gstate_seq);


My understanding is that both write_seqcount_begin() and write_seqcount_end()
trigger a write memory barrier. Is a seqcount really faster than a spinlock?

> 

> @@ -792,6 +811,14 @@ void blk_mq_rq_timed_out(struct request *req, bool reserved)

>  		__blk_mq_complete_request(req);

>  		break;

>  	case BLK_EH_RESET_TIMER:

> +		/*

> +		 * As nothing prevents from completion happening while

> +		 * ->aborted_gstate is set, this may lead to ignored

> +		 * completions and further spurious timeouts.

> +		 */

> +		u64_stats_update_begin(&req->aborted_gstate_sync);

> +		req->aborted_gstate = 0;

> +		u64_stats_update_end(&req->aborted_gstate_sync);


If a blk-mq request is resubmitted 2**62 times, can that result in the above
code setting aborted_gstate to the same value as gstate? Isn't that a bug?
If so, how about setting aborted_gstate in the above code to e.g. gstate ^ (2**63)?

> @@ -228,6 +230,27 @@ struct request {

>  

>  	unsigned short write_hint;

>  

> +	/*

> +	 * On blk-mq, the lower bits of ->gstate carry the MQ_RQ_* state

> +	 * value and the upper bits the generation number which is

> +	 * monotonically incremented and used to distinguish the reuse

> +	 * instances.

> +	 *

> +	 * ->gstate_seq allows updates to ->gstate and other fields

> +	 * (currently ->deadline) during request start to be read

> +	 * atomically from the timeout path, so that it can operate on a

> +	 * coherent set of information.

> +	 */

> +	seqcount_t gstate_seq;

> +	u64 gstate;

> +

> +	/*

> +	 * ->aborted_gstate is used by the timeout to claim a specific

> +	 * recycle instance of this request.  See blk_mq_timeout_work().

> +	 */

> +	struct u64_stats_sync aborted_gstate_sync;

> +	u64 aborted_gstate;

> +

>  	unsigned long deadline;

>  	struct list_head timeout_list;


Why are gstate and aborted_gstate 64-bit variables? What makes you think that
32 bits would not be enough?

Thanks,

Bart.
Tejun Heo Dec. 14, 2017, 7:19 p.m. UTC | #6
Hello,

On Thu, Dec 14, 2017 at 06:51:11PM +0000, Bart Van Assche wrote:
> On Tue, 2017-12-12 at 11:01 -0800, Tejun Heo wrote:
> > rules.  Unfortunatley, it contains quite a few holes.
>           ^^^^^^^^^^^^^
>           Unfortunately?
> 
> > While this change makes REQ_ATOM_COMPLETE synchornization unnecessary
>                                             ^^^^^^^^^^^^^^^
>                                             synchronization?

lol, believe it or not, my english spelling is a lot better than my
korean.  Will fix them.

> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -126,6 +126,8 @@ void blk_rq_init(struct request_queue *q, struct request *rq)
> >  	rq->start_time = jiffies;
> >  	set_start_time_ns(rq);
> >  	rq->part = NULL;
> > +	seqcount_init(&rq->gstate_seq);
> > +	u64_stats_init(&rq->aborted_gstate_sync);
> >  }
> >  EXPORT_SYMBOL(blk_rq_init);
> 
> Sorry but the above change looks ugly to me. My understanding is that 
> blk_rq_init() is only used inside the block layer to initialize legacy block
> layer requests while gstate_seq and aborted_gstate_sync are only relevant
> for blk-mq requests. Wouldn't it be better to avoid that blk_rq_init() is
> called for blk-mq requests such that the above change can be left out? The
> only callers outside the block layer core of blk_rq_init() I know of are
> ide_prep_sense() and scsi_ioctl_reset(). I can help with converting the SCSI
> code if you want.

This is also used by flush path.  We probably should clean that up,
but let's worry about that later cuz flush handling has enough of its
own complications.

> > +	write_seqcount_begin(&rq->gstate_seq);
> > +	blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
> > +	blk_add_timer(rq);
> > +	write_seqcount_end(&rq->gstate_seq);
> 
> My understanding is that both write_seqcount_begin() and write_seqcount_end()
> trigger a write memory barrier. Is a seqcount really faster than a spinlock?

Write memory barrier has no cost on x86 and usually pretty low cost
elsewhere too and they're likely in the same cacheline as other rq
fields.

> > @@ -792,6 +811,14 @@ void blk_mq_rq_timed_out(struct request *req, bool reserved)
> >  		__blk_mq_complete_request(req);
> >  		break;
> >  	case BLK_EH_RESET_TIMER:
> > +		/*
> > +		 * As nothing prevents from completion happening while
> > +		 * ->aborted_gstate is set, this may lead to ignored
> > +		 * completions and further spurious timeouts.
> > +		 */
> > +		u64_stats_update_begin(&req->aborted_gstate_sync);
> > +		req->aborted_gstate = 0;
> > +		u64_stats_update_end(&req->aborted_gstate_sync);
> 
> If a blk-mq request is resubmitted 2**62 times, can that result in the above
> code setting aborted_gstate to the same value as gstate? Isn't that a bug?
> If so, how about setting aborted_gstate in the above code to e.g. gstate ^ (2**63)?

A request gets aborted only if the state is in-flight, 0 isn't that.
Also, how many years would 2^62 times be?

> > +	struct u64_stats_sync aborted_gstate_sync;
> > +	u64 aborted_gstate;
> > +
> >  	unsigned long deadline;
> >  	struct list_head timeout_list;
> 
> Why are gstate and aborted_gstate 64-bit variables? What makes you think that
> 32 bits would not be enough?

Because 32 bits puts it in the rance where a false hit is still
theoretically possible in a reasonable amount of time.

Thanks.
Peter Zijlstra Dec. 14, 2017, 8:20 p.m. UTC | #7
On Thu, Dec 14, 2017 at 06:51:11PM +0000, Bart Van Assche wrote:
> On Tue, 2017-12-12 at 11:01 -0800, Tejun Heo wrote:
> > +	write_seqcount_begin(&rq->gstate_seq);
> > +	blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
> > +	blk_add_timer(rq);
> > +	write_seqcount_end(&rq->gstate_seq);
> 
> My understanding is that both write_seqcount_begin() and write_seqcount_end()
> trigger a write memory barrier. Is a seqcount really faster than a spinlock?

Yes lots, no atomic operations and no waiting.

The only constraint for write_seqlock is that there must not be any
concurrency.

But now that I look at this again, TJ, why can't the below happen?

	write_seqlock_begin();
	blk_mq_rq_update_state(rq, IN_FLIGHT);
	blk_add_timer(rq);
	<timer-irq>
		read_seqcount_begin()
			while (seq & 1)
				cpurelax();
		// life-lock
	</timer-irq>
	write_seqlock_end();
Bart Van Assche Dec. 14, 2017, 9:13 p.m. UTC | #8
On Thu, 2017-12-14 at 11:19 -0800, tj@kernel.org wrote:
> On Thu, Dec 14, 2017 at 06:51:11PM +0000, Bart Van Assche wrote:

> > On Tue, 2017-12-12 at 11:01 -0800, Tejun Heo wrote:

> > > --- a/block/blk-core.c

> > > +++ b/block/blk-core.c

> > > @@ -126,6 +126,8 @@ void blk_rq_init(struct request_queue *q, struct request *rq)

> > >  	rq->start_time = jiffies;

> > >  	set_start_time_ns(rq);

> > >  	rq->part = NULL;

> > > +	seqcount_init(&rq->gstate_seq);

> > > +	u64_stats_init(&rq->aborted_gstate_sync);

> > >  }

> > >  EXPORT_SYMBOL(blk_rq_init);

> > 

> > Sorry but the above change looks ugly to me. My understanding is that 

> > blk_rq_init() is only used inside the block layer to initialize legacy block

> > layer requests while gstate_seq and aborted_gstate_sync are only relevant

> > for blk-mq requests. Wouldn't it be better to avoid that blk_rq_init() is

> > called for blk-mq requests such that the above change can be left out? The

> > only callers outside the block layer core of blk_rq_init() I know of are

> > ide_prep_sense() and scsi_ioctl_reset(). I can help with converting the SCSI

> > code if you want.

> 

> This is also used by flush path.  We probably should clean that up,

> but let's worry about that later cuz flush handling has enough of its

> own complications.


We may have a different opinion about this but I think it is more than a
detail. This patch needs gstate_seq and aborted_gstate_sync to be preserved
across request state transitions from MQ_RQ_IN_FLIGHT to MR_RQ_IDLE.
blk_mq_init_request() is called at request allocation time so it's the right
context to initialize gstate_seq and aborted_gstate_sync. blk_rq_init()
however is called before a every use of a request. Sorry but I'm not
enthusiast about the code in blk_rq_init() that reinitializes state
information that should survive request reuse.

Bart.
Bart Van Assche Dec. 14, 2017, 9:42 p.m. UTC | #9
On Thu, 2017-12-14 at 21:20 +0100, Peter Zijlstra wrote:
> On Thu, Dec 14, 2017 at 06:51:11PM +0000, Bart Van Assche wrote:

> > On Tue, 2017-12-12 at 11:01 -0800, Tejun Heo wrote:

> > > +	write_seqcount_begin(&rq->gstate_seq);

> > > +	blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);

> > > +	blk_add_timer(rq);

> > > +	write_seqcount_end(&rq->gstate_seq);

> > 

> > My understanding is that both write_seqcount_begin() and write_seqcount_end()

> > trigger a write memory barrier. Is a seqcount really faster than a spinlock?

> 

> Yes lots, no atomic operations and no waiting.

> 

> The only constraint for write_seqlock is that there must not be any

> concurrency.

> 

> But now that I look at this again, TJ, why can't the below happen?

> 

> 	write_seqlock_begin();

> 	blk_mq_rq_update_state(rq, IN_FLIGHT);

> 	blk_add_timer(rq);

> 	<timer-irq>

> 		read_seqcount_begin()

> 			while (seq & 1)

> 				cpurelax();

> 		// life-lock

> 	</timer-irq>

> 	write_seqlock_end();


Hello Peter,

Some time ago the block layer was changed to handle timeouts in thread context
instead of interrupt context. See also commit 287922eb0b18 ("block: defer
timeouts to a workqueue").

Bart.
Peter Zijlstra Dec. 14, 2017, 9:54 p.m. UTC | #10
On Thu, Dec 14, 2017 at 09:42:48PM +0000, Bart Van Assche wrote:
> On Thu, 2017-12-14 at 21:20 +0100, Peter Zijlstra wrote:
> > On Thu, Dec 14, 2017 at 06:51:11PM +0000, Bart Van Assche wrote:
> > > On Tue, 2017-12-12 at 11:01 -0800, Tejun Heo wrote:
> > > > +	write_seqcount_begin(&rq->gstate_seq);
> > > > +	blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
> > > > +	blk_add_timer(rq);
> > > > +	write_seqcount_end(&rq->gstate_seq);
> > > 
> > > My understanding is that both write_seqcount_begin() and write_seqcount_end()
> > > trigger a write memory barrier. Is a seqcount really faster than a spinlock?
> > 
> > Yes lots, no atomic operations and no waiting.
> > 
> > The only constraint for write_seqlock is that there must not be any
> > concurrency.
> > 
> > But now that I look at this again, TJ, why can't the below happen?
> > 
> > 	write_seqlock_begin();
> > 	blk_mq_rq_update_state(rq, IN_FLIGHT);
> > 	blk_add_timer(rq);
> > 	<timer-irq>
> > 		read_seqcount_begin()
> > 			while (seq & 1)
> > 				cpurelax();
> > 		// life-lock
> > 	</timer-irq>
> > 	write_seqlock_end();
> 
> Hello Peter,
> 
> Some time ago the block layer was changed to handle timeouts in thread context
> instead of interrupt context. See also commit 287922eb0b18 ("block: defer
> timeouts to a workqueue").

That only makes it a little better:

	Task-A					Worker

	write_seqcount_begin()
	blk_mq_rw_update_state(rq, IN_FLIGHT)
	blk_add_timer(rq)
	<timer>
		schedule_work()
	</timer>
	<context-switch to worker>
						read_seqcount_begin()
							while(seq & 1)
								cpu_relax();


Now normally this isn't fatal because Worker will simply spin its entire
time slice away and we'll eventually schedule our Task-A back in, which
will complete the seqcount and things will work.

But if, for some reason, our Worker was to have RT priority higher than
our Task-A we'd be up some creek without no paddles.

We don't happen to have preemption of IRQs off here? That would fix
things nicely.
jianchao.wang Dec. 15, 2017, 2:12 a.m. UTC | #11
On 12/15/2017 05:54 AM, Peter Zijlstra wrote:
> On Thu, Dec 14, 2017 at 09:42:48PM +0000, Bart Van Assche wrote:
>> On Thu, 2017-12-14 at 21:20 +0100, Peter Zijlstra wrote:
>>> On Thu, Dec 14, 2017 at 06:51:11PM +0000, Bart Van Assche wrote:
>>>> On Tue, 2017-12-12 at 11:01 -0800, Tejun Heo wrote:
>>>>> +	write_seqcount_begin(&rq->gstate_seq);
>>>>> +	blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
>>>>> +	blk_add_timer(rq);
>>>>> +	write_seqcount_end(&rq->gstate_seq);
>>>>
>>>> My understanding is that both write_seqcount_begin() and write_seqcount_end()
>>>> trigger a write memory barrier. Is a seqcount really faster than a spinlock?
>>>
>>> Yes lots, no atomic operations and no waiting.
>>>
>>> The only constraint for write_seqlock is that there must not be any
>>> concurrency.
>>>
>>> But now that I look at this again, TJ, why can't the below happen?
>>>
>>> 	write_seqlock_begin();
>>> 	blk_mq_rq_update_state(rq, IN_FLIGHT);
>>> 	blk_add_timer(rq);
>>> 	<timer-irq>
>>> 		read_seqcount_begin()
>>> 			while (seq & 1)
>>> 				cpurelax();
>>> 		// life-lock
>>> 	</timer-irq>
>>> 	write_seqlock_end();
>>
>> Hello Peter,
>>
>> Some time ago the block layer was changed to handle timeouts in thread context
>> instead of interrupt context. See also commit 287922eb0b18 ("block: defer
>> timeouts to a workqueue").
> 
> That only makes it a little better:
> 
> 	Task-A					Worker
> 
> 	write_seqcount_begin()
> 	blk_mq_rw_update_state(rq, IN_FLIGHT)
> 	blk_add_timer(rq)
> 	<timer>
> 		schedule_work()
> 	</timer>
> 	<context-switch to worker>
> 						read_seqcount_begin()
> 							while(seq & 1)
> 								cpu_relax();
> 
Hi Peter

The current seqcount read side is as below:
	do {
		start = read_seqcount_begin(&rq->gstate_seq);
		gstate = READ_ONCE(rq->gstate);
		deadline = rq->deadline;
	} while (read_seqcount_retry(&rq->gstate_seq, start));
read_seqcount_retry() doesn't check the bit 0, but whether the saved value from 
read_seqcount_begin() is equal to the current value of seqcount.
pls refer:
static inline int __read_seqcount_retry(const seqcount_t *s, unsigned start)
{
	return unlikely(s->sequence != start);
}

Thanks
Jianchao
> 
> Now normally this isn't fatal because Worker will simply spin its entire
> time slice away and we'll eventually schedule our Task-A back in, which
> will complete the seqcount and things will work.
> 
> But if, for some reason, our Worker was to have RT priority higher than
> our Task-A we'd be up some creek without no paddles.
> 
> We don't happen to have preemption of IRQs off here? That would fix
> things nicely.
>
Mike Galbraith Dec. 15, 2017, 2:39 a.m. UTC | #12
On Thu, 2017-12-14 at 22:54 +0100, Peter Zijlstra wrote:
> On Thu, Dec 14, 2017 at 09:42:48PM +0000, Bart Van Assche wrote:
> 
> > Some time ago the block layer was changed to handle timeouts in thread context
> > instead of interrupt context. See also commit 287922eb0b18 ("block: defer
> > timeouts to a workqueue").
> 
> That only makes it a little better:
> 
> 	Task-A					Worker
> 
> 	write_seqcount_begin()
> 	blk_mq_rw_update_state(rq, IN_FLIGHT)
> 	blk_add_timer(rq)
> 	<timer>
> 		schedule_work()
> 	</timer>
> 	<context-switch to worker>
> 						read_seqcount_begin()
> 							while(seq & 1)
> 								cpu_relax();
> 
> 
> Now normally this isn't fatal because Worker will simply spin its entire
> time slice away and we'll eventually schedule our Task-A back in, which
> will complete the seqcount and things will work.
> 
> But if, for some reason, our Worker was to have RT priority higher than
> our Task-A we'd be up some creek without no paddles.

Most kthreads, including kworkers, are very frequently SCHED_FIFO here.

	-Mike
Peter Zijlstra Dec. 15, 2017, 7:31 a.m. UTC | #13
On Fri, Dec 15, 2017 at 10:12:50AM +0800, jianchao.wang wrote:
> > That only makes it a little better:
> > 
> > 	Task-A					Worker
> > 
> > 	write_seqcount_begin()
> > 	blk_mq_rw_update_state(rq, IN_FLIGHT)
> > 	blk_add_timer(rq)
> > 	<timer>
> > 		schedule_work()
> > 	</timer>
> > 	<context-switch to worker>
> > 						read_seqcount_begin()
> > 							while(seq & 1)
> > 								cpu_relax();
> > 
> Hi Peter
> 
> The current seqcount read side is as below:
> 	do {
> 		start = read_seqcount_begin(&rq->gstate_seq);


static inline unsigned read_seqcount_begin(const seqcount_t *s)
{
	seqcount_lockdep_reader_access(s);
	return raw_read_seqcount_begin(s);
}

static inline unsigned raw_read_seqcount_begin(const seqcount_t *s)
{
	unsigned ret = __read_seqcount_begin(s);
	smp_rmb();
	return ret;
}

static inline unsigned __read_seqcount_begin(const seqcount_t *s)
{
	unsigned ret;

repeat:
	ret = READ_ONCE(s->sequence);
	if (unlikely(ret & 1)) {
		cpu_relax();
		goto repeat;
	}
	return ret;
}
Tejun Heo Dec. 15, 2017, 1:30 p.m. UTC | #14
Hello, Bart.

On Thu, Dec 14, 2017 at 09:13:32PM +0000, Bart Van Assche wrote:
...
> however is called before a every use of a request. Sorry but I'm not
> enthusiast about the code in blk_rq_init() that reinitializes state
> information that should survive request reuse.

If it wasn't clear, me neither.  I think what'd be better is making
those paths use the usual request allocation path instead of custom
one but for flush handling, that's not gonna be trivial, so let's deal
with that later.

Thanks.
Tejun Heo Dec. 15, 2017, 1:50 p.m. UTC | #15
Hello, Peter.

On Thu, Dec 14, 2017 at 09:20:42PM +0100, Peter Zijlstra wrote:
> But now that I look at this again, TJ, why can't the below happen?
> 
> 	write_seqlock_begin();
> 	blk_mq_rq_update_state(rq, IN_FLIGHT);
> 	blk_add_timer(rq);
> 	<timer-irq>
> 		read_seqcount_begin()
> 			while (seq & 1)
> 				cpurelax();
> 		// life-lock
> 	</timer-irq>
> 	write_seqlock_end();

Ah, you're right.  For both gstate_seq and aborted_gstate_sync, we can
push all synchronization to the timeout side - ie. gstate_seq read can
yield, pause or synchronize_rcu and hte aborted_gstate_sync can
disable irq around update.

Thanks.
jianchao.wang Dec. 15, 2017, 3:14 p.m. UTC | #16
On 12/15/2017 03:31 PM, Peter Zijlstra wrote:
> On Fri, Dec 15, 2017 at 10:12:50AM +0800, jianchao.wang wrote:
>>> That only makes it a little better:
>>>
>>> 	Task-A					Worker
>>>
>>> 	write_seqcount_begin()
>>> 	blk_mq_rw_update_state(rq, IN_FLIGHT)
>>> 	blk_add_timer(rq)
>>> 	<timer>
>>> 		schedule_work()
>>> 	</timer>
>>> 	<context-switch to worker>
>>> 						read_seqcount_begin()
>>> 							while(seq & 1)
>>> 								cpu_relax();
>>>
>> Hi Peter
>>
>> The current seqcount read side is as below:
>> 	do {
>> 		start = read_seqcount_begin(&rq->gstate_seq);
> 
> 
> static inline unsigned read_seqcount_begin(const seqcount_t *s)
> {
> 	seqcount_lockdep_reader_access(s);
> 	return raw_read_seqcount_begin(s);
> }
> 
> static inline unsigned raw_read_seqcount_begin(const seqcount_t *s)
> {
> 	unsigned ret = __read_seqcount_begin(s);
> 	smp_rmb();
> 	return ret;
> }
> 
> static inline unsigned __read_seqcount_begin(const seqcount_t *s)
> {
> 	unsigned ret;
> 
> repeat:
> 	ret = READ_ONCE(s->sequence);
> 	if (unlikely(ret & 1)) {
> 		cpu_relax();
> 		goto repeat;
> 	}
> 	return ret;
> }
> 
Really thanks for kindly pointing out.
jianchao
diff mbox

Patch

diff --git a/block/blk-core.c b/block/blk-core.c
index b888175..6034623 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -126,6 +126,8 @@  void blk_rq_init(struct request_queue *q, struct request *rq)
 	rq->start_time = jiffies;
 	set_start_time_ns(rq);
 	rq->part = NULL;
+	seqcount_init(&rq->gstate_seq);
+	u64_stats_init(&rq->aborted_gstate_sync);
 }
 EXPORT_SYMBOL(blk_rq_init);
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index acf4fbb..b4e733b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -530,6 +530,9 @@  static void __blk_mq_complete_request(struct request *rq)
 	bool shared = false;
 	int cpu;
 
+	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IN_FLIGHT);
+	blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
+
 	if (rq->internal_tag != -1)
 		blk_mq_sched_completed_request(rq);
 	if (rq->rq_flags & RQF_STATS) {
@@ -557,6 +560,19 @@  static void __blk_mq_complete_request(struct request *rq)
 	put_cpu();
 }
 
+static u64 blk_mq_rq_aborted_gstate(struct request *rq)
+{
+	unsigned int start;
+	u64 aborted_gstate;
+
+	do {
+		start = u64_stats_fetch_begin(&rq->aborted_gstate_sync);
+		aborted_gstate = rq->aborted_gstate;
+	} while (u64_stats_fetch_retry(&rq->aborted_gstate_sync, start));
+
+	return aborted_gstate;
+}
+
 /**
  * blk_mq_complete_request - end I/O on a request
  * @rq:		the request being processed
@@ -574,14 +590,21 @@  void blk_mq_complete_request(struct request *rq)
 	if (unlikely(blk_should_fake_timeout(q)))
 		return;
 
+	/*
+	 * If @rq->aborted_gstate equals the current instance, timeout is
+	 * claiming @rq and we lost.  This is synchronized through RCU.
+	 * See blk_mq_timeout_work() for details.
+	 */
 	if (!(hctx->flags & BLK_MQ_F_BLOCKING)) {
 		rcu_read_lock();
-		if (!blk_mark_rq_complete(rq))
+		if (blk_mq_rq_aborted_gstate(rq) != rq->gstate &&
+		    !blk_mark_rq_complete(rq))
 			__blk_mq_complete_request(rq);
 		rcu_read_unlock();
 	} else {
 		srcu_idx = srcu_read_lock(hctx->queue_rq_srcu);
-		if (!blk_mark_rq_complete(rq))
+		if (blk_mq_rq_aborted_gstate(rq) != rq->gstate &&
+		    !blk_mark_rq_complete(rq))
 			__blk_mq_complete_request(rq);
 		srcu_read_unlock(hctx->queue_rq_srcu, srcu_idx);
 	}
@@ -608,34 +631,28 @@  void blk_mq_start_request(struct request *rq)
 		wbt_issue(q->rq_wb, &rq->issue_stat);
 	}
 
-	blk_add_timer(rq);
-
+	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);
 	WARN_ON_ONCE(test_bit(REQ_ATOM_STARTED, &rq->atomic_flags));
 
 	/*
-	 * Mark us as started and clear complete. Complete might have been
-	 * set if requeue raced with timeout, which then marked it as
-	 * complete. So be sure to clear complete again when we start
-	 * the request, otherwise we'll ignore the completion event.
+	 * Mark @rq in-flight which also advances the generation number,
+	 * and register for timeout.  Protect with a seqcount to allow the
+	 * timeout path to read both @rq->gstate and @rq->deadline
+	 * coherently.
 	 *
-	 * Ensure that ->deadline is visible before we set STARTED, such that
-	 * blk_mq_check_expired() is guaranteed to observe our ->deadline when
-	 * it observes STARTED.
+	 * This is the only place where a request is marked in-flight.  If
+	 * the timeout path reads an in-flight @rq->gstate, the
+	 * @rq->deadline it reads together under @rq->gstate_seq is
+	 * guaranteed to be the matching one.
 	 */
-	smp_wmb();
+	write_seqcount_begin(&rq->gstate_seq);
+	blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
+	blk_add_timer(rq);
+	write_seqcount_end(&rq->gstate_seq);
+
 	set_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
-	if (test_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags)) {
-		/*
-		 * Coherence order guarantees these consecutive stores to a
-		 * single variable propagate in the specified order. Thus the
-		 * clear_bit() is ordered _after_ the set bit. See
-		 * blk_mq_check_expired().
-		 *
-		 * (the bits must be part of the same byte for this to be
-		 * true).
-		 */
+	if (test_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags))
 		clear_bit(REQ_ATOM_COMPLETE, &rq->atomic_flags);
-	}
 
 	if (q->dma_drain_size && blk_rq_bytes(rq)) {
 		/*
@@ -668,6 +685,7 @@  static void __blk_mq_requeue_request(struct request *rq)
 	blk_mq_sched_requeue_request(rq);
 
 	if (test_and_clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags)) {
+		blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
 		if (q->dma_drain_size && blk_rq_bytes(rq))
 			rq->nr_phys_segments--;
 	}
@@ -765,6 +783,7 @@  EXPORT_SYMBOL(blk_mq_tag_to_rq);
 struct blk_mq_timeout_data {
 	unsigned long next;
 	unsigned int next_set;
+	unsigned int nr_expired;
 };
 
 void blk_mq_rq_timed_out(struct request *req, bool reserved)
@@ -792,6 +811,14 @@  void blk_mq_rq_timed_out(struct request *req, bool reserved)
 		__blk_mq_complete_request(req);
 		break;
 	case BLK_EH_RESET_TIMER:
+		/*
+		 * As nothing prevents from completion happening while
+		 * ->aborted_gstate is set, this may lead to ignored
+		 * completions and further spurious timeouts.
+		 */
+		u64_stats_update_begin(&req->aborted_gstate_sync);
+		req->aborted_gstate = 0;
+		u64_stats_update_end(&req->aborted_gstate_sync);
 		blk_add_timer(req);
 		blk_clear_rq_complete(req);
 		break;
@@ -807,50 +834,48 @@  static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
 		struct request *rq, void *priv, bool reserved)
 {
 	struct blk_mq_timeout_data *data = priv;
-	unsigned long deadline;
+	unsigned long gstate, deadline;
+	int start;
 
 	if (!test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
 		return;
 
-	/*
-	 * Ensures that if we see STARTED we must also see our
-	 * up-to-date deadline, see blk_mq_start_request().
-	 */
-	smp_rmb();
-
-	deadline = READ_ONCE(rq->deadline);
-
-	/*
-	 * The rq being checked may have been freed and reallocated
-	 * out already here, we avoid this race by checking rq->deadline
-	 * and REQ_ATOM_COMPLETE flag together:
-	 *
-	 * - if rq->deadline is observed as new value because of
-	 *   reusing, the rq won't be timed out because of timing.
-	 * - if rq->deadline is observed as previous value,
-	 *   REQ_ATOM_COMPLETE flag won't be cleared in reuse path
-	 *   because we put a barrier between setting rq->deadline
-	 *   and clearing the flag in blk_mq_start_request(), so
-	 *   this rq won't be timed out too.
-	 */
-	if (time_after_eq(jiffies, deadline)) {
-		if (!blk_mark_rq_complete(rq)) {
-			/*
-			 * Again coherence order ensures that consecutive reads
-			 * from the same variable must be in that order. This
-			 * ensures that if we see COMPLETE clear, we must then
-			 * see STARTED set and we'll ignore this timeout.
-			 *
-			 * (There's also the MB implied by the test_and_clear())
-			 */
-			blk_mq_rq_timed_out(rq, reserved);
-		}
+	/* read coherent snapshots of @rq->state_gen and @rq->deadline */
+	do {
+		start = read_seqcount_begin(&rq->gstate_seq);
+		gstate = READ_ONCE(rq->gstate);
+		deadline = rq->deadline;
+	} while (read_seqcount_retry(&rq->gstate_seq, start));
+
+	/* if in-flight && overdue, mark for abortion */
+	if ((gstate & MQ_RQ_STATE_MASK) == MQ_RQ_IN_FLIGHT &&
+	    time_after_eq(jiffies, deadline)) {
+		u64_stats_update_begin(&rq->aborted_gstate_sync);
+		rq->aborted_gstate = gstate;
+		u64_stats_update_end(&rq->aborted_gstate_sync);
+		data->nr_expired++;
+		hctx->nr_expired++;
 	} else if (!data->next_set || time_after(data->next, deadline)) {
 		data->next = deadline;
 		data->next_set = 1;
 	}
 }
 
+static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx,
+		struct request *rq, void *priv, bool reserved)
+{
+	/*
+	 * We marked @rq->aborted_gstate and waited for RCU.  If there were
+	 * completions that we lost to, they would have finished and
+	 * updated @rq->gstate by now; otherwise, the completion path is
+	 * now guaranteed to see @rq->aborted_gstate and yield.  If
+	 * @rq->aborted_gstate still matches @rq->gstate, @rq is ours.
+	 */
+	if (READ_ONCE(rq->gstate) == rq->aborted_gstate &&
+	    !blk_mark_rq_complete(rq))
+		blk_mq_rq_timed_out(rq, reserved);
+}
+
 static void blk_mq_timeout_work(struct work_struct *work)
 {
 	struct request_queue *q =
@@ -858,7 +883,9 @@  static void blk_mq_timeout_work(struct work_struct *work)
 	struct blk_mq_timeout_data data = {
 		.next		= 0,
 		.next_set	= 0,
+		.nr_expired	= 0,
 	};
+	struct blk_mq_hw_ctx *hctx;
 	int i;
 
 	/* A deadlock might occur if a request is stuck requiring a
@@ -877,14 +904,40 @@  static void blk_mq_timeout_work(struct work_struct *work)
 	if (!percpu_ref_tryget(&q->q_usage_counter))
 		return;
 
+	/* scan for the expired ones and set their ->aborted_gstate */
 	blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, &data);
 
+	if (data.nr_expired) {
+		bool has_rcu = false;
+
+		/*
+		 * Wait till everyone sees ->aborted_gstate.  The
+		 * sequential waits for SRCUs aren't ideal.  If this ever
+		 * becomes a problem, we can add per-hw_ctx rcu_head and
+		 * wait in parallel.
+		 */
+		queue_for_each_hw_ctx(q, hctx, i) {
+			if (!hctx->nr_expired)
+				continue;
+
+			if (!(hctx->flags & BLK_MQ_F_BLOCKING))
+				has_rcu = true;
+			else
+				synchronize_srcu(hctx->queue_rq_srcu);
+
+			hctx->nr_expired = 0;
+		}
+		if (has_rcu)
+			synchronize_rcu();
+
+		/* terminate the ones we won */
+		blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_expired, NULL);
+	}
+
 	if (data.next_set) {
 		data.next = blk_rq_timeout(round_jiffies_up(data.next));
 		mod_timer(&q->timeout, data.next);
 	} else {
-		struct blk_mq_hw_ctx *hctx;
-
 		queue_for_each_hw_ctx(q, hctx, i) {
 			/* the hctx may be unmapped, so check it here */
 			if (blk_mq_hw_queue_mapped(hctx))
@@ -1879,6 +1932,22 @@  static size_t order_to_size(unsigned int order)
 	return (size_t)PAGE_SIZE << order;
 }
 
+static int blk_mq_init_request(struct blk_mq_tag_set *set, struct request *rq,
+			       unsigned int hctx_idx, int node)
+{
+	int ret;
+
+	if (set->ops->init_request) {
+		ret = set->ops->init_request(set, rq, hctx_idx, node);
+		if (ret)
+			return ret;
+	}
+
+	seqcount_init(&rq->gstate_seq);
+	u64_stats_init(&rq->aborted_gstate_sync);
+	return 0;
+}
+
 int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 		     unsigned int hctx_idx, unsigned int depth)
 {
@@ -1940,12 +2009,9 @@  int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 			struct request *rq = p;
 
 			tags->static_rqs[i] = rq;
-			if (set->ops->init_request) {
-				if (set->ops->init_request(set, rq, hctx_idx,
-						node)) {
-					tags->static_rqs[i] = NULL;
-					goto fail;
-				}
+			if (blk_mq_init_request(set, rq, hctx_idx, node)) {
+				tags->static_rqs[i] = NULL;
+				goto fail;
 			}
 
 			p += rq_size;
@@ -2084,9 +2150,7 @@  static int blk_mq_init_hctx(struct request_queue *q,
 	if (!hctx->fq)
 		goto sched_exit_hctx;
 
-	if (set->ops->init_request &&
-	    set->ops->init_request(set, hctx->fq->flush_rq, hctx_idx,
-				   node))
+	if (blk_mq_init_request(set, hctx->fq->flush_rq, hctx_idx, node))
 		goto free_fq;
 
 	if (hctx->flags & BLK_MQ_F_BLOCKING)
@@ -2980,12 +3044,6 @@  static bool blk_mq_poll(struct request_queue *q, blk_qc_t cookie)
 
 static int __init blk_mq_init(void)
 {
-	/*
-	 * See comment in block/blk.h rq_atomic_flags enum
-	 */
-	BUILD_BUG_ON((REQ_ATOM_STARTED / BITS_PER_BYTE) !=
-			(REQ_ATOM_COMPLETE / BITS_PER_BYTE));
-
 	cpuhp_setup_state_multi(CPUHP_BLK_MQ_DEAD, "block/mq:dead", NULL,
 				blk_mq_hctx_notify_dead);
 	return 0;
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 6c7c3ff..154b2fa 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -27,6 +27,19 @@  struct blk_mq_ctx {
 	struct kobject		kobj;
 } ____cacheline_aligned_in_smp;
 
+/*
+ * Bits for request->gstate.  The lower two bits carry MQ_RQ_* state value
+ * and the upper bits the generation number.
+ */
+enum mq_rq_state {
+	MQ_RQ_IDLE		= 0,
+	MQ_RQ_IN_FLIGHT		= 1,
+
+	MQ_RQ_STATE_BITS	= 2,
+	MQ_RQ_STATE_MASK	= (1 << MQ_RQ_STATE_BITS) - 1,
+	MQ_RQ_GEN_INC		= 1 << MQ_RQ_STATE_BITS,
+};
+
 void blk_mq_freeze_queue(struct request_queue *q);
 void blk_mq_free_queue(struct request_queue *q);
 int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr);
@@ -85,6 +98,38 @@  extern void blk_mq_rq_timed_out(struct request *req, bool reserved);
 
 void blk_mq_release(struct request_queue *q);
 
+/**
+ * blk_mq_rq_state() - read the current MQ_RQ_* state of a request
+ * @rq: target request.
+ */
+static inline int blk_mq_rq_state(struct request *rq)
+{
+	return READ_ONCE(rq->gstate) & MQ_RQ_STATE_MASK;
+}
+
+/**
+ * blk_mq_rq_update_state() - set the current MQ_RQ_* state of a request
+ * @rq: target request.
+ * @state: new state to set.
+ *
+ * Set @rq's state to @state.  The caller is responsible for ensuring that
+ * there are no other updaters.  A request can transition into IN_FLIGHT
+ * only from IDLE and doing so increments the generation number.
+ */
+static inline void blk_mq_rq_update_state(struct request *rq,
+					  enum mq_rq_state state)
+{
+	u64 new_val = (READ_ONCE(rq->gstate) & ~MQ_RQ_STATE_MASK) | state;
+
+	if (state == MQ_RQ_IN_FLIGHT) {
+		WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);
+		new_val += MQ_RQ_GEN_INC;
+	}
+
+	/* avoid exposing interim values */
+	WRITE_ONCE(rq->gstate, new_val);
+}
+
 static inline struct blk_mq_ctx *__blk_mq_get_ctx(struct request_queue *q,
 					   unsigned int cpu)
 {
diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index 764ecf9..6427be7 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -208,7 +208,7 @@  void blk_add_timer(struct request *req)
 	if (!req->timeout)
 		req->timeout = q->rq_timeout;
 
-	WRITE_ONCE(req->deadline, jiffies + req->timeout);
+	req->deadline = jiffies + req->timeout;
 
 	/*
 	 * Only the non-mq case needs to add the request to a protected list.
diff --git a/block/blk.h b/block/blk.h
index 3f14469..9cb2739 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -123,12 +123,6 @@  void blk_account_io_done(struct request *req);
  * Internal atomic flags for request handling
  */
 enum rq_atomic_flags {
-	/*
-	 * Keep these two bits first - not because we depend on the
-	 * value of them, but we do depend on them being in the same
-	 * byte of storage to ensure ordering on writes. Keeping them
-	 * first will achieve that nicely.
-	 */
 	REQ_ATOM_COMPLETE = 0,
 	REQ_ATOM_STARTED,
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 95c9a5c..460798d 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -51,6 +51,7 @@  struct blk_mq_hw_ctx {
 	unsigned int		queue_num;
 
 	atomic_t		nr_active;
+	unsigned int		nr_expired;
 
 	struct hlist_node	cpuhp_dead;
 	struct kobject		kobj;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 8089ca1..2d6fd11 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -27,6 +27,8 @@ 
 #include <linux/percpu-refcount.h>
 #include <linux/scatterlist.h>
 #include <linux/blkzoned.h>
+#include <linux/seqlock.h>
+#include <linux/u64_stats_sync.h>
 
 struct module;
 struct scsi_ioctl_command;
@@ -228,6 +230,27 @@  struct request {
 
 	unsigned short write_hint;
 
+	/*
+	 * On blk-mq, the lower bits of ->gstate carry the MQ_RQ_* state
+	 * value and the upper bits the generation number which is
+	 * monotonically incremented and used to distinguish the reuse
+	 * instances.
+	 *
+	 * ->gstate_seq allows updates to ->gstate and other fields
+	 * (currently ->deadline) during request start to be read
+	 * atomically from the timeout path, so that it can operate on a
+	 * coherent set of information.
+	 */
+	seqcount_t gstate_seq;
+	u64 gstate;
+
+	/*
+	 * ->aborted_gstate is used by the timeout to claim a specific
+	 * recycle instance of this request.  See blk_mq_timeout_work().
+	 */
+	struct u64_stats_sync aborted_gstate_sync;
+	u64 aborted_gstate;
+
 	unsigned long deadline;
 	struct list_head timeout_list;