diff mbox series

[8/8] blk-mq: drain I/O when all CPUs in a hctx are offline

Message ID 20200527180644.514302-9-hch@lst.de (mailing list archive)
State New, archived
Headers show
Series [1/8] blk-mq: remove the bio argument to ->prepare_request | expand

Commit Message

Christoph Hellwig May 27, 2020, 6:06 p.m. UTC
From: Ming Lei <ming.lei@redhat.com>

Most of blk-mq drivers depend on managed IRQ's auto-affinity to setup
up queue mapping. Thomas mentioned the following point[1]:

"That was the constraint of managed interrupts from the very beginning:

 The driver/subsystem has to quiesce the interrupt line and the associated
 queue _before_ it gets shutdown in CPU unplug and not fiddle with it
 until it's restarted by the core when the CPU is plugged in again."

However, current blk-mq implementation doesn't quiesce hw queue before
the last CPU in the hctx is shutdown.  Even worse, CPUHP_BLK_MQ_DEAD is a
cpuhp state handled after the CPU is down, so there isn't any chance to
quiesce the hctx before shutting down the CPU.

Add new CPUHP_AP_BLK_MQ_ONLINE state to stop allocating from blk-mq hctxs
where the last CPU goes away, and wait for completion of in-flight
requests.  This guarantees that there is no inflight I/O before shutting
down the managed IRQ.

Add a BLK_MQ_F_STACKING and set it for dm-rq and loop, so we don't need
to wait for completion of in-flight requests from these drivers to avoid
a potential dead-lock. It is safe to do this for stacking drivers as those
do not use interrupts at all and their I/O completions are triggered by
underlying devices I/O completion.

[1] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/

Signed-off-by: Ming Lei <ming.lei@redhat.com>
[hch: different retry mechanism, merged two patches, minor cleanups]
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-mq-debugfs.c     |   2 +
 block/blk-mq-tag.c         |   8 +++
 block/blk-mq.c             | 121 ++++++++++++++++++++++++++++++++++++-
 drivers/block/loop.c       |   2 +-
 drivers/md/dm-rq.c         |   2 +-
 include/linux/blk-mq.h     |  10 +++
 include/linux/cpuhotplug.h |   1 +
 7 files changed, 142 insertions(+), 4 deletions(-)

Comments

Hannes Reinecke May 27, 2020, 6:26 p.m. UTC | #1
On 5/27/20 8:06 PM, Christoph Hellwig wrote:
> From: Ming Lei <ming.lei@redhat.com>
> 
> Most of blk-mq drivers depend on managed IRQ's auto-affinity to setup
> up queue mapping. Thomas mentioned the following point[1]:
> 
> "That was the constraint of managed interrupts from the very beginning:
> 
>   The driver/subsystem has to quiesce the interrupt line and the associated
>   queue _before_ it gets shutdown in CPU unplug and not fiddle with it
>   until it's restarted by the core when the CPU is plugged in again."
> 
> However, current blk-mq implementation doesn't quiesce hw queue before
> the last CPU in the hctx is shutdown.  Even worse, CPUHP_BLK_MQ_DEAD is a
> cpuhp state handled after the CPU is down, so there isn't any chance to
> quiesce the hctx before shutting down the CPU.
> 
> Add new CPUHP_AP_BLK_MQ_ONLINE state to stop allocating from blk-mq hctxs
> where the last CPU goes away, and wait for completion of in-flight
> requests.  This guarantees that there is no inflight I/O before shutting
> down the managed IRQ.
> 
> Add a BLK_MQ_F_STACKING and set it for dm-rq and loop, so we don't need
> to wait for completion of in-flight requests from these drivers to avoid
> a potential dead-lock. It is safe to do this for stacking drivers as those
> do not use interrupts at all and their I/O completions are triggered by
> underlying devices I/O completion.
> 
> [1] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> [hch: different retry mechanism, merged two patches, minor cleanups]
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>   block/blk-mq-debugfs.c     |   2 +
>   block/blk-mq-tag.c         |   8 +++
>   block/blk-mq.c             | 121 ++++++++++++++++++++++++++++++++++++-
>   drivers/block/loop.c       |   2 +-
>   drivers/md/dm-rq.c         |   2 +-
>   include/linux/blk-mq.h     |  10 +++
>   include/linux/cpuhotplug.h |   1 +
>   7 files changed, 142 insertions(+), 4 deletions(-)
> 
> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> index 96b7a35c898a7..15df3a36e9fa4 100644
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
> @@ -213,6 +213,7 @@ static const char *const hctx_state_name[] = {
>   	HCTX_STATE_NAME(STOPPED),
>   	HCTX_STATE_NAME(TAG_ACTIVE),
>   	HCTX_STATE_NAME(SCHED_RESTART),
> +	HCTX_STATE_NAME(INACTIVE),
>   };
>   #undef HCTX_STATE_NAME
>   
> @@ -239,6 +240,7 @@ static const char *const hctx_flag_name[] = {
>   	HCTX_FLAG_NAME(TAG_SHARED),
>   	HCTX_FLAG_NAME(BLOCKING),
>   	HCTX_FLAG_NAME(NO_SCHED),
> +	HCTX_FLAG_NAME(STACKING),
>   };
>   #undef HCTX_FLAG_NAME
>   
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index 9f74064768423..1c548d9f67ee7 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -180,6 +180,14 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
>   	sbitmap_finish_wait(bt, ws, &wait);
>   
>   found_tag:
> +	/*
> +	 * Give up this allocation if the hctx is inactive.  The caller will
> +	 * retry on an active hctx.
> +	 */
> +	if (unlikely(test_bit(BLK_MQ_S_INACTIVE, &data->hctx->state))) {
> +		blk_mq_put_tag(tags, data->ctx, tag + tag_offset);
> +		return -1;

BLK_MQ_NO_TAG, please, to be consistent with the caller checks later on.

> +	}
>   	return tag + tag_offset;
>   }
>   
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 898400452b1cf..e4580cd6c6f49 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -375,14 +375,39 @@ static struct request *__blk_mq_alloc_request(struct blk_mq_alloc_data *data)
>   			e->type->ops.limit_depth(data->cmd_flags, data);
>   	}
>   
> +retry:
>   	data->ctx = blk_mq_get_ctx(q);
>   	data->hctx = blk_mq_map_queue(q, data->cmd_flags, data->ctx);
>   	if (!(data->flags & BLK_MQ_REQ_INTERNAL))
>   		blk_mq_tag_busy(data->hctx);
>   
> +	/*
> +	 * Waiting allocations only fail because of an inactive hctx.  In that
> +	 * case just retry the hctx assignment and tag allocation as CPU hotplug
> +	 * should have migrated us to an online CPU by now.
> +	 */
>   	tag = blk_mq_get_tag(data);
> -	if (tag == BLK_MQ_NO_TAG)
> -		return NULL;
> +	if (tag == BLK_MQ_NO_TAG) {
> +		if (data->flags & BLK_MQ_REQ_NOWAIT)
> +			return NULL;
> +
> +		/*
> +		 * All kthreads that can perform I/O should have been moved off
> +		 * this CPU by the time the the CPU hotplug statemachine has
> +		 * shut down a hctx.  But better be sure with an extra sanity
> +		 * check.
> +		 */
> +		if (WARN_ON_ONCE(current->flags & PF_KTHREAD))
> +			return NULL;
> +
> +		/*
> +		 * Give up the CPU and sleep for a random short time to ensure
> +		 * that thread using a realtime scheduling class are migrated
> +		 * off the the CPU.
> +		 */
> +		msleep(3);
> +		goto retry;
> +	}
>   	return blk_mq_rq_ctx_init(data, tag, alloc_time_ns);
>   }
>   
> @@ -2324,6 +2349,86 @@ int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
>   	return -ENOMEM;
>   }
>   
> +struct rq_iter_data {
> +	struct blk_mq_hw_ctx *hctx;
> +	bool has_rq;
> +};
> +
> +static bool blk_mq_has_request(struct request *rq, void *data, bool reserved)
> +{
> +	struct rq_iter_data *iter_data = data;
> +
> +	if (rq->mq_hctx != iter_data->hctx)
> +		return true;
> +	iter_data->has_rq = true;
> +	return false;
> +}
> +
> +static bool blk_mq_hctx_has_requests(struct blk_mq_hw_ctx *hctx)
> +{
> +	struct blk_mq_tags *tags = hctx->sched_tags ?
> +			hctx->sched_tags : hctx->tags;
> +	struct rq_iter_data data = {
> +		.hctx	= hctx,
> +	};
> +
> +	blk_mq_all_tag_iter(tags, blk_mq_has_request, &data);
> +	return data.has_rq;
> +}
> +
> +static inline bool blk_mq_last_cpu_in_hctx(unsigned int cpu,
> +		struct blk_mq_hw_ctx *hctx)
> +{
> +	if (cpumask_next_and(-1, hctx->cpumask, cpu_online_mask) != cpu)
> +		return false;
> +	if (cpumask_next_and(cpu, hctx->cpumask, cpu_online_mask) < nr_cpu_ids)
> +		return false;
> +	return true;
> +}
> +
> +static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node)
> +{
> +	struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
> +			struct blk_mq_hw_ctx, cpuhp_online);
> +
> +	if (!cpumask_test_cpu(cpu, hctx->cpumask) ||
> +	    !blk_mq_last_cpu_in_hctx(cpu, hctx))
> +		return 0;
> +
> +	/*
> +	 * Prevent new request from being allocated on the current hctx.
> +	 *
> +	 * The smp_mb__after_atomic() Pairs with the implied barrier in
> +	 * test_and_set_bit_lock in sbitmap_get().  Ensures the inactive flag is
> +	 * seen once we return from the tag allocator.
> +	 */
> +	set_bit(BLK_MQ_S_INACTIVE, &hctx->state);
> +	smp_mb__after_atomic();
> +
> +	/*
> +	 * Try to grab a reference to the queue and wait for any outstanding
> +	 * requests.  If we could not grab a reference the queue has been
> +	 * frozen and there are no requests.
> +	 */
> +	if (percpu_ref_tryget(&hctx->queue->q_usage_counter)) {
> +		while (blk_mq_hctx_has_requests(hctx))
> +			msleep(5);
> +		percpu_ref_put(&hctx->queue->q_usage_counter);
> +	}
> +
> +	return 0;
> +}
> +
> +static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node *node)
> +{
> +	struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
> +			struct blk_mq_hw_ctx, cpuhp_online);
> +
> +	if (cpumask_test_cpu(cpu, hctx->cpumask))
> +		clear_bit(BLK_MQ_S_INACTIVE, &hctx->state);
> +	return 0;
> +}
> +
>   /*
>    * 'cpu' is going away. splice any existing rq_list entries from this
>    * software queue to the hw queue dispatch list, and ensure that it
> @@ -2337,6 +2442,9 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
>   	enum hctx_type type;
>   
>   	hctx = hlist_entry_safe(node, struct blk_mq_hw_ctx, cpuhp_dead);
> +	if (!cpumask_test_cpu(cpu, hctx->cpumask))
> +		return 0;
> +
>   	ctx = __blk_mq_get_ctx(hctx->queue, cpu);
>   	type = hctx->type;
>   
> @@ -2360,6 +2468,9 @@ static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
>   
>   static void blk_mq_remove_cpuhp(struct blk_mq_hw_ctx *hctx)
>   {
> +	if (!(hctx->flags & BLK_MQ_F_STACKING))
> +		cpuhp_state_remove_instance_nocalls(CPUHP_AP_BLK_MQ_ONLINE,
> +						    &hctx->cpuhp_online);
>   	cpuhp_state_remove_instance_nocalls(CPUHP_BLK_MQ_DEAD,
>   					    &hctx->cpuhp_dead);
>   }
> @@ -2419,6 +2530,9 @@ static int blk_mq_init_hctx(struct request_queue *q,
>   {
>   	hctx->queue_num = hctx_idx;
>   
> +	if (!(hctx->flags & BLK_MQ_F_STACKING))
> +		cpuhp_state_add_instance_nocalls(CPUHP_AP_BLK_MQ_ONLINE,
> +				&hctx->cpuhp_online);
>   	cpuhp_state_add_instance_nocalls(CPUHP_BLK_MQ_DEAD, &hctx->cpuhp_dead);
>   
>   	hctx->tags = set->tags[hctx_idx];
> @@ -3673,6 +3787,9 @@ static int __init blk_mq_init(void)
>   {
>   	cpuhp_setup_state_multi(CPUHP_BLK_MQ_DEAD, "block/mq:dead", NULL,
>   				blk_mq_hctx_notify_dead);
> +	cpuhp_setup_state_multi(CPUHP_AP_BLK_MQ_ONLINE, "block/mq:online",
> +				blk_mq_hctx_notify_online,
> +				blk_mq_hctx_notify_offline);
>   	return 0;
>   }
>   subsys_initcall(blk_mq_init);
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index da693e6a834e5..d7904b4d8d126 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -2037,7 +2037,7 @@ static int loop_add(struct loop_device **l, int i)
>   	lo->tag_set.queue_depth = 128;
>   	lo->tag_set.numa_node = NUMA_NO_NODE;
>   	lo->tag_set.cmd_size = sizeof(struct loop_cmd);
> -	lo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
> +	lo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_STACKING;
>   	lo->tag_set.driver_data = lo;
>   
>   	err = blk_mq_alloc_tag_set(&lo->tag_set);
> diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
> index 3f8577e2c13be..f60c025121215 100644
> --- a/drivers/md/dm-rq.c
> +++ b/drivers/md/dm-rq.c
> @@ -547,7 +547,7 @@ int dm_mq_init_request_queue(struct mapped_device *md, struct dm_table *t)
>   	md->tag_set->ops = &dm_mq_ops;
>   	md->tag_set->queue_depth = dm_get_blk_mq_queue_depth();
>   	md->tag_set->numa_node = md->numa_node_id;
> -	md->tag_set->flags = BLK_MQ_F_SHOULD_MERGE;
> +	md->tag_set->flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_STACKING;
>   	md->tag_set->nr_hw_queues = dm_get_blk_mq_nr_hw_queues();
>   	md->tag_set->driver_data = md;
>   
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index d7307795439a4..a20f8c241d665 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -140,6 +140,8 @@ struct blk_mq_hw_ctx {
>   	 */
>   	atomic_t		nr_active;
>   
> +	/** @cpuhp_online: List to store request if CPU is going to die */
> +	struct hlist_node	cpuhp_online;
>   	/** @cpuhp_dead: List to store request if some CPU die. */
>   	struct hlist_node	cpuhp_dead;
>   	/** @kobj: Kernel object for sysfs. */
> @@ -391,6 +393,11 @@ struct blk_mq_ops {
>   enum {
>   	BLK_MQ_F_SHOULD_MERGE	= 1 << 0,
>   	BLK_MQ_F_TAG_SHARED	= 1 << 1,
> +	/*
> +	 * Set when this device requires underlying blk-mq device for
> +	 * completing IO:
> +	 */
> +	BLK_MQ_F_STACKING	= 1 << 2,
>   	BLK_MQ_F_BLOCKING	= 1 << 5,
>   	BLK_MQ_F_NO_SCHED	= 1 << 6,
>   	BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
> @@ -400,6 +407,9 @@ enum {
>   	BLK_MQ_S_TAG_ACTIVE	= 1,
>   	BLK_MQ_S_SCHED_RESTART	= 2,
>   
> +	/* hw queue is inactive after all its CPUs become offline */
> +	BLK_MQ_S_INACTIVE	= 3,
> +
>   	BLK_MQ_MAX_DEPTH	= 10240,
>   
>   	BLK_MQ_CPU_WORK_BATCH	= 8,
> diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
> index 77d70b6335318..24b3a77810b6d 100644
> --- a/include/linux/cpuhotplug.h
> +++ b/include/linux/cpuhotplug.h
> @@ -152,6 +152,7 @@ enum cpuhp_state {
>   	CPUHP_AP_SMPBOOT_THREADS,
>   	CPUHP_AP_X86_VDSO_VMA_ONLINE,
>   	CPUHP_AP_IRQ_AFFINITY_ONLINE,
> +	CPUHP_AP_BLK_MQ_ONLINE,
>   	CPUHP_AP_ARM_MVEBU_SYNC_CLOCKS,
>   	CPUHP_AP_X86_INTEL_EPB_ONLINE,
>   	CPUHP_AP_PERF_ONLINE,
> 
Otherwise looks okay.

Cheers,

Hannes
Bart Van Assche May 27, 2020, 11:09 p.m. UTC | #2
On 2020-05-27 11:06, Christoph Hellwig wrote:
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -180,6 +180,14 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
>  	sbitmap_finish_wait(bt, ws, &wait);
>  
>  found_tag:
> +	/*
> +	 * Give up this allocation if the hctx is inactive.  The caller will
> +	 * retry on an active hctx.
> +	 */
> +	if (unlikely(test_bit(BLK_MQ_S_INACTIVE, &data->hctx->state))) {
> +		blk_mq_put_tag(tags, data->ctx, tag + tag_offset);
> +		return -1;
> +	}
>  	return tag + tag_offset;
>  }

The code that has been added in blk_mq_hctx_notify_offline() will only
work correctly if blk_mq_get_tag() tests BLK_MQ_S_INACTIVE after the
store instructions involved in the tag allocation happened. Does this
mean that a memory barrier should be added in the above function before
the test_bit() call?

Thanks,

Bart.
Ming Lei May 28, 2020, 1:46 a.m. UTC | #3
On Wed, May 27, 2020 at 04:09:19PM -0700, Bart Van Assche wrote:
> On 2020-05-27 11:06, Christoph Hellwig wrote:
> > --- a/block/blk-mq-tag.c
> > +++ b/block/blk-mq-tag.c
> > @@ -180,6 +180,14 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
> >  	sbitmap_finish_wait(bt, ws, &wait);
> >  
> >  found_tag:
> > +	/*
> > +	 * Give up this allocation if the hctx is inactive.  The caller will
> > +	 * retry on an active hctx.
> > +	 */
> > +	if (unlikely(test_bit(BLK_MQ_S_INACTIVE, &data->hctx->state))) {
> > +		blk_mq_put_tag(tags, data->ctx, tag + tag_offset);
> > +		return -1;
> > +	}
> >  	return tag + tag_offset;
> >  }
> 
> The code that has been added in blk_mq_hctx_notify_offline() will only
> work correctly if blk_mq_get_tag() tests BLK_MQ_S_INACTIVE after the
> store instructions involved in the tag allocation happened. Does this
> mean that a memory barrier should be added in the above function before
> the test_bit() call?

Please see comment in blk_mq_hctx_notify_offline():

+       /*
+        * Prevent new request from being allocated on the current hctx.
+        *
+        * The smp_mb__after_atomic() Pairs with the implied barrier in
+        * test_and_set_bit_lock in sbitmap_get().  Ensures the inactive flag is
+        * seen once we return from the tag allocator.
+        */
+       set_bit(BLK_MQ_S_INACTIVE, &hctx->state);


Thanks, 
Ming
Bart Van Assche May 28, 2020, 3:33 a.m. UTC | #4
On 2020-05-27 18:46, Ming Lei wrote:
> On Wed, May 27, 2020 at 04:09:19PM -0700, Bart Van Assche wrote:
>> On 2020-05-27 11:06, Christoph Hellwig wrote:
>>> --- a/block/blk-mq-tag.c
>>> +++ b/block/blk-mq-tag.c
>>> @@ -180,6 +180,14 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
>>>  	sbitmap_finish_wait(bt, ws, &wait);
>>>  
>>>  found_tag:
>>> +	/*
>>> +	 * Give up this allocation if the hctx is inactive.  The caller will
>>> +	 * retry on an active hctx.
>>> +	 */
>>> +	if (unlikely(test_bit(BLK_MQ_S_INACTIVE, &data->hctx->state))) {
>>> +		blk_mq_put_tag(tags, data->ctx, tag + tag_offset);
>>> +		return -1;
>>> +	}
>>>  	return tag + tag_offset;
>>>  }
>>
>> The code that has been added in blk_mq_hctx_notify_offline() will only
>> work correctly if blk_mq_get_tag() tests BLK_MQ_S_INACTIVE after the
>> store instructions involved in the tag allocation happened. Does this
>> mean that a memory barrier should be added in the above function before
>> the test_bit() call?
> 
> Please see comment in blk_mq_hctx_notify_offline():
> 
> +       /*
> +        * Prevent new request from being allocated on the current hctx.
> +        *
> +        * The smp_mb__after_atomic() Pairs with the implied barrier in
> +        * test_and_set_bit_lock in sbitmap_get().  Ensures the inactive flag is
> +        * seen once we return from the tag allocator.
> +        */
> +       set_bit(BLK_MQ_S_INACTIVE, &hctx->state);

From Documentation/atomic_bitops.txt: "Except for a successful
test_and_set_bit_lock() which has ACQUIRE semantics and
clear_bit_unlock() which has RELEASE semantics."

My understanding is that operations that have acquire semantics pair
with operations that have release semantics. I haven't been able to find
any documentation that shows that smp_mb__after_atomic() has release
semantics. So I looked up its definition. This is what I found:

$ git grep -nH 'define __smp_mb__after_atomic'
arch/ia64/include/asm/barrier.h:49:#define __smp_mb__after_atomic()
barrier()
arch/mips/include/asm/barrier.h:133:#define __smp_mb__after_atomic()
smp_llsc_mb()
arch/s390/include/asm/barrier.h:50:#define __smp_mb__after_atomic()
barrier()
arch/sparc/include/asm/barrier_64.h:57:#define __smp_mb__after_atomic()
barrier()
arch/x86/include/asm/barrier.h:83:#define __smp_mb__after_atomic()	do {
} while (0)
arch/xtensa/include/asm/barrier.h:20:#define __smp_mb__after_atomic()	
barrier()
include/asm-generic/barrier.h:116:#define __smp_mb__after_atomic()
__smp_mb()

My interpretation of the above is that not all smp_mb__after_atomic()
implementations have release semantics. Do you agree with this conclusion?

Thanks,

Bart.
Ming Lei May 28, 2020, 5:19 a.m. UTC | #5
On Wed, May 27, 2020 at 08:33:48PM -0700, Bart Van Assche wrote:
> On 2020-05-27 18:46, Ming Lei wrote:
> > On Wed, May 27, 2020 at 04:09:19PM -0700, Bart Van Assche wrote:
> >> On 2020-05-27 11:06, Christoph Hellwig wrote:
> >>> --- a/block/blk-mq-tag.c
> >>> +++ b/block/blk-mq-tag.c
> >>> @@ -180,6 +180,14 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
> >>>  	sbitmap_finish_wait(bt, ws, &wait);
> >>>  
> >>>  found_tag:
> >>> +	/*
> >>> +	 * Give up this allocation if the hctx is inactive.  The caller will
> >>> +	 * retry on an active hctx.
> >>> +	 */
> >>> +	if (unlikely(test_bit(BLK_MQ_S_INACTIVE, &data->hctx->state))) {
> >>> +		blk_mq_put_tag(tags, data->ctx, tag + tag_offset);
> >>> +		return -1;
> >>> +	}
> >>>  	return tag + tag_offset;
> >>>  }
> >>
> >> The code that has been added in blk_mq_hctx_notify_offline() will only
> >> work correctly if blk_mq_get_tag() tests BLK_MQ_S_INACTIVE after the
> >> store instructions involved in the tag allocation happened. Does this
> >> mean that a memory barrier should be added in the above function before
> >> the test_bit() call?
> > 
> > Please see comment in blk_mq_hctx_notify_offline():
> > 
> > +       /*
> > +        * Prevent new request from being allocated on the current hctx.
> > +        *
> > +        * The smp_mb__after_atomic() Pairs with the implied barrier in
> > +        * test_and_set_bit_lock in sbitmap_get().  Ensures the inactive flag is
> > +        * seen once we return from the tag allocator.
> > +        */
> > +       set_bit(BLK_MQ_S_INACTIVE, &hctx->state);
> 
> From Documentation/atomic_bitops.txt: "Except for a successful
> test_and_set_bit_lock() which has ACQUIRE semantics and
> clear_bit_unlock() which has RELEASE semantics."

test_bit(BLK_MQ_S_INACTIVE, &data->hctx->state) is called exactly after
one tag is allocated, that means test_and_set_bit_lock is successful before
the test_bit(). The ACQUIRE semantics guarantees that test_bit(BLK_MQ_S_INACTIVE)
is always done after successful test_and_set_bit_lock(), so tag bit is
always set before testing BLK_MQ_S_INACTIVE.
 
See Documentation/memory-barriers.txt:
 (5) ACQUIRE operations.

     This acts as a one-way permeable barrier.  It guarantees that all memory
     operations after the ACQUIRE operation will appear to happen after the
     ACQUIRE operation with respect to the other components of the system.
     ACQUIRE operations include LOCK operations and both smp_load_acquire()
     and smp_cond_load_acquire() operations.

> 
> My understanding is that operations that have acquire semantics pair
> with operations that have release semantics. I haven't been able to find
> any documentation that shows that smp_mb__after_atomic() has release
> semantics. So I looked up its definition. This is what I found:
> 
> $ git grep -nH 'define __smp_mb__after_atomic'
> arch/ia64/include/asm/barrier.h:49:#define __smp_mb__after_atomic()
> barrier()
> arch/mips/include/asm/barrier.h:133:#define __smp_mb__after_atomic()
> smp_llsc_mb()
> arch/s390/include/asm/barrier.h:50:#define __smp_mb__after_atomic()
> barrier()
> arch/sparc/include/asm/barrier_64.h:57:#define __smp_mb__after_atomic()
> barrier()
> arch/x86/include/asm/barrier.h:83:#define __smp_mb__after_atomic()	do {
> } while (0)
> arch/xtensa/include/asm/barrier.h:20:#define __smp_mb__after_atomic()	
> barrier()
> include/asm-generic/barrier.h:116:#define __smp_mb__after_atomic()
> __smp_mb()
> 
> My interpretation of the above is that not all smp_mb__after_atomic()
> implementations have release semantics. Do you agree with this conclusion?

I understand smp_mb__after_atomic() orders set_bit(BLK_MQ_S_INACTIVE)
and reading the tag bit which is done in blk_mq_all_tag_iter().

So the two pair of OPs are ordered:

1) if one request(tag bit) is allocated before setting BLK_MQ_S_INACTIVE,
the tag bit will be observed in blk_mq_all_tag_iter() from blk_mq_hctx_has_requests(),
so the request will be drained.

OR

2) if one request(tag bit) is allocated after setting BLK_MQ_S_INACTIVE,
the request(tag bit) will be released and retried on another CPU
finally, see __blk_mq_alloc_request().

Cc Paul and linux-kernel list.


Thanks,
Ming
Bart Van Assche May 28, 2020, 1:37 p.m. UTC | #6
On 2020-05-27 22:19, Ming Lei wrote:
> On Wed, May 27, 2020 at 08:33:48PM -0700, Bart Van Assche wrote:
>> My understanding is that operations that have acquire semantics pair
>> with operations that have release semantics. I haven't been able to find
>> any documentation that shows that smp_mb__after_atomic() has release
>> semantics. So I looked up its definition. This is what I found:
>>
>> $ git grep -nH 'define __smp_mb__after_atomic'
>> arch/ia64/include/asm/barrier.h:49:#define __smp_mb__after_atomic()
>> barrier()
>> arch/mips/include/asm/barrier.h:133:#define __smp_mb__after_atomic()
>> smp_llsc_mb()
>> arch/s390/include/asm/barrier.h:50:#define __smp_mb__after_atomic()
>> barrier()
>> arch/sparc/include/asm/barrier_64.h:57:#define __smp_mb__after_atomic()
>> barrier()
>> arch/x86/include/asm/barrier.h:83:#define __smp_mb__after_atomic()	do {
>> } while (0)
>> arch/xtensa/include/asm/barrier.h:20:#define __smp_mb__after_atomic()	
>> barrier()
>> include/asm-generic/barrier.h:116:#define __smp_mb__after_atomic()
>> __smp_mb()
>>
>> My interpretation of the above is that not all smp_mb__after_atomic()
>> implementations have release semantics. Do you agree with this conclusion?
> 
> I understand smp_mb__after_atomic() orders set_bit(BLK_MQ_S_INACTIVE)
> and reading the tag bit which is done in blk_mq_all_tag_iter().
> 
> So the two pair of OPs are ordered:
> 
> 1) if one request(tag bit) is allocated before setting BLK_MQ_S_INACTIVE,
> the tag bit will be observed in blk_mq_all_tag_iter() from blk_mq_hctx_has_requests(),
> so the request will be drained.
> 
> OR
> 
> 2) if one request(tag bit) is allocated after setting BLK_MQ_S_INACTIVE,
> the request(tag bit) will be released and retried on another CPU
> finally, see __blk_mq_alloc_request().
> 
> Cc Paul and linux-kernel list.

I do not agree with the above conclusion. My understanding of
acquire/release labels is that if the following holds:
(1) A store operation that stores the value V into memory location M has
a release label.
(2) A load operation that reads memory location M has an acquire label.
(3) The load operation (2) retrieves the value V that was stored by (1).

that the following ordering property holds: all load and store
instructions that happened before the store instruction (1) in program
order are guaranteed to happen before the load and store instructions
that follow (2) in program order.

In the ARM manual these semantics have been described as follows: "A
Store-Release instruction is multicopy atomic when observed with a
Load-Acquire instruction".

In this case the load-acquire operation is the
"test_and_set_bit_lock(nr, word)" statement from the sbitmap code. That
code is executed indirectly by blk_mq_get_tag(). Since there is no
matching store-release instruction in __blk_mq_alloc_request() for
'word', ordering of the &data->hctx->state and 'tag' memory locations is
not guaranteed by the acquire property of the "test_and_set_bit_lock(nr,
word)" statement from the sbitmap code.

Thanks,

Bart.
Paul E. McKenney May 28, 2020, 5:21 p.m. UTC | #7
On Thu, May 28, 2020 at 06:37:47AM -0700, Bart Van Assche wrote:
> On 2020-05-27 22:19, Ming Lei wrote:
> > On Wed, May 27, 2020 at 08:33:48PM -0700, Bart Van Assche wrote:
> >> My understanding is that operations that have acquire semantics pair
> >> with operations that have release semantics. I haven't been able to find
> >> any documentation that shows that smp_mb__after_atomic() has release
> >> semantics. So I looked up its definition. This is what I found:
> >>
> >> $ git grep -nH 'define __smp_mb__after_atomic'
> >> arch/ia64/include/asm/barrier.h:49:#define __smp_mb__after_atomic()
> >> barrier()
> >> arch/mips/include/asm/barrier.h:133:#define __smp_mb__after_atomic()
> >> smp_llsc_mb()
> >> arch/s390/include/asm/barrier.h:50:#define __smp_mb__after_atomic()
> >> barrier()
> >> arch/sparc/include/asm/barrier_64.h:57:#define __smp_mb__after_atomic()
> >> barrier()
> >> arch/x86/include/asm/barrier.h:83:#define __smp_mb__after_atomic()	do {
> >> } while (0)
> >> arch/xtensa/include/asm/barrier.h:20:#define __smp_mb__after_atomic()	
> >> barrier()
> >> include/asm-generic/barrier.h:116:#define __smp_mb__after_atomic()
> >> __smp_mb()
> >>
> >> My interpretation of the above is that not all smp_mb__after_atomic()
> >> implementations have release semantics. Do you agree with this conclusion?
> > 
> > I understand smp_mb__after_atomic() orders set_bit(BLK_MQ_S_INACTIVE)
> > and reading the tag bit which is done in blk_mq_all_tag_iter().
> > 
> > So the two pair of OPs are ordered:
> > 
> > 1) if one request(tag bit) is allocated before setting BLK_MQ_S_INACTIVE,
> > the tag bit will be observed in blk_mq_all_tag_iter() from blk_mq_hctx_has_requests(),
> > so the request will be drained.
> > 
> > OR
> > 
> > 2) if one request(tag bit) is allocated after setting BLK_MQ_S_INACTIVE,
> > the request(tag bit) will be released and retried on another CPU
> > finally, see __blk_mq_alloc_request().
> > 
> > Cc Paul and linux-kernel list.
> 
> I do not agree with the above conclusion. My understanding of
> acquire/release labels is that if the following holds:
> (1) A store operation that stores the value V into memory location M has
> a release label.
> (2) A load operation that reads memory location M has an acquire label.
> (3) The load operation (2) retrieves the value V that was stored by (1).
> 
> that the following ordering property holds: all load and store
> instructions that happened before the store instruction (1) in program
> order are guaranteed to happen before the load and store instructions
> that follow (2) in program order.
> 
> In the ARM manual these semantics have been described as follows: "A
> Store-Release instruction is multicopy atomic when observed with a
> Load-Acquire instruction".
> 
> In this case the load-acquire operation is the
> "test_and_set_bit_lock(nr, word)" statement from the sbitmap code. That
> code is executed indirectly by blk_mq_get_tag(). Since there is no
> matching store-release instruction in __blk_mq_alloc_request() for
> 'word', ordering of the &data->hctx->state and 'tag' memory locations is
> not guaranteed by the acquire property of the "test_and_set_bit_lock(nr,
> word)" statement from the sbitmap code.

I feel like I just parachuted into the middle of the conversation,
so let me start by giving a (silly) example illustrating the limits of
smp_mb__{before,after}_atomic() that might be tangling things up.

But please please please avoid doing this in real code unless you have
an extremely good reason included in a comment.

void t1(void)
{
	WRITE_ONCE(a, 1);
	smp_mb__before_atomic();
	WRITE_ONCE(b, 1);  // Just Say No to code here!!!
	atomic_inc(&c);
	WRITE_ONCE(d, 1);  // Just Say No to code here!!!
	smp_mb__after_atomic();
	WRITE_ONCE(e, 1);
}

void t2(void)
{
	r1 = READ_ONCE(e);
	smp_mb();
	r2 = READ_ONCE(d);
	smp_mb();
	r3 = READ_ONCE(c);
	smp_mb();
	r4 = READ_ONCE(b);
	smp_mb();
	r5 = READ_ONCE(a);
}

Each platform must provide strong ordering for either atomic_inc()
on the one hand (as ia64 does) or for smp_mb__{before,after}_atomic()
on the other (as powerpc does).  Note that both ia64 and powerpc are
weakly ordered.

So ia64 could see (r1 == 1 && r2 == 0) on the one hand as well as (r4 ==
1 && r5 == 0).  So clearly smp_mb_{before,after}_atomic() need not have
any ordering properties whatsoever.

Similarly, powerpc could see (r3 == 1 && r4 == 0) on the one hand as well
as (r2 == 1 && r3 == 0) on the other.  Or even both at the same time.
So clearly atomic_inc() need not have any ordering properties whatsoever.

But the combination of smp_mb__before_atomic() and the later atomic_inc()
does provide full ordering, so that no architecture can see (r3 == 1 &&
r5 == 0), and either of r1 or r2 can be substituted for r3.

Similarly, atomic_inc() and the late4r smp_mb__after_atomic() also
provide full ordering, so that no architecture can see (r1 == 1 && r3 ==
0), and either r4 or r5 can be substituted for r3.


So a call to set_bit() followed by a call to smp_mb__after_atomic() will
provide a full memory barrier (implying release semantics) for any write
access after the smp_mb__after_atomic() with respect to the set_bit() or
any access preceding it.  But the set_bit() by itself won't have release
semantics, nor will the smp_mb__after_atomic(), only their combination
further combined with some write following the smp_mb__after_atomic().

More generally, there will be the equivalent of smp_mb() somewhere between
the set_bit() and every access following the smp_mb__after_atomic().

Does that help, or am I missing the point?

							Thanx, Paul
Ming Lei May 29, 2020, 1:13 a.m. UTC | #8
On Thu, May 28, 2020 at 06:37:47AM -0700, Bart Van Assche wrote:
> On 2020-05-27 22:19, Ming Lei wrote:
> > On Wed, May 27, 2020 at 08:33:48PM -0700, Bart Van Assche wrote:
> >> My understanding is that operations that have acquire semantics pair
> >> with operations that have release semantics. I haven't been able to find
> >> any documentation that shows that smp_mb__after_atomic() has release
> >> semantics. So I looked up its definition. This is what I found:
> >>
> >> $ git grep -nH 'define __smp_mb__after_atomic'
> >> arch/ia64/include/asm/barrier.h:49:#define __smp_mb__after_atomic()
> >> barrier()
> >> arch/mips/include/asm/barrier.h:133:#define __smp_mb__after_atomic()
> >> smp_llsc_mb()
> >> arch/s390/include/asm/barrier.h:50:#define __smp_mb__after_atomic()
> >> barrier()
> >> arch/sparc/include/asm/barrier_64.h:57:#define __smp_mb__after_atomic()
> >> barrier()
> >> arch/x86/include/asm/barrier.h:83:#define __smp_mb__after_atomic()	do {
> >> } while (0)
> >> arch/xtensa/include/asm/barrier.h:20:#define __smp_mb__after_atomic()	
> >> barrier()
> >> include/asm-generic/barrier.h:116:#define __smp_mb__after_atomic()
> >> __smp_mb()
> >>
> >> My interpretation of the above is that not all smp_mb__after_atomic()
> >> implementations have release semantics. Do you agree with this conclusion?
> > 
> > I understand smp_mb__after_atomic() orders set_bit(BLK_MQ_S_INACTIVE)
> > and reading the tag bit which is done in blk_mq_all_tag_iter().
> > 
> > So the two pair of OPs are ordered:
> > 
> > 1) if one request(tag bit) is allocated before setting BLK_MQ_S_INACTIVE,
> > the tag bit will be observed in blk_mq_all_tag_iter() from blk_mq_hctx_has_requests(),
> > so the request will be drained.
> > 
> > OR
> > 
> > 2) if one request(tag bit) is allocated after setting BLK_MQ_S_INACTIVE,
> > the request(tag bit) will be released and retried on another CPU
> > finally, see __blk_mq_alloc_request().
> > 
> > Cc Paul and linux-kernel list.
> 
> I do not agree with the above conclusion. My understanding of
> acquire/release labels is that if the following holds:
> (1) A store operation that stores the value V into memory location M has
> a release label.
> (2) A load operation that reads memory location M has an acquire label.
> (3) The load operation (2) retrieves the value V that was stored by (1).
> 
> that the following ordering property holds: all load and store
> instructions that happened before the store instruction (1) in program
> order are guaranteed to happen before the load and store instructions
> that follow (2) in program order.
> 
> In the ARM manual these semantics have been described as follows: "A
> Store-Release instruction is multicopy atomic when observed with a
> Load-Acquire instruction".
> 
> In this case the load-acquire operation is the
> "test_and_set_bit_lock(nr, word)" statement from the sbitmap code. That
> code is executed indirectly by blk_mq_get_tag(). Since there is no
> matching store-release instruction in __blk_mq_alloc_request() for
> 'word', ordering of the &data->hctx->state and 'tag' memory locations is
> not guaranteed by the acquire property of the "test_and_set_bit_lock(nr,
> word)" statement from the sbitmap code.

If the order isn't guaranteed, either of the following two documents has to be wrong:

Documentation/memory-barriers.txt:
	...
	In all cases there are variants on "ACQUIRE" operations and "RELEASE" operations
	for each construct.  These operations all imply certain barriers:
	
	 (1) ACQUIRE operation implication:
	
	     Memory operations issued after the ACQUIRE will be completed after the
	     ACQUIRE operation has completed.

Documentation/atomic_bitops.txt:
	...
	Except for a successful test_and_set_bit_lock() which has ACQUIRE semantics and
	clear_bit_unlock() which has RELEASE semantics.

Setting the tag bit is part of successful test_and_set_bit_lock(), which has ACQUIRE
semantics, and any Memory operations(test_bit(INACTIVE)) after the ACQUIRE will be
completed after the ACQUIRE has completed according to the above two documents.

Thanks,
Ming
Ming Lei May 29, 2020, 1:53 a.m. UTC | #9
Hi Paul,

Thanks for your response!

On Thu, May 28, 2020 at 10:21:21AM -0700, Paul E. McKenney wrote:
> On Thu, May 28, 2020 at 06:37:47AM -0700, Bart Van Assche wrote:
> > On 2020-05-27 22:19, Ming Lei wrote:
> > > On Wed, May 27, 2020 at 08:33:48PM -0700, Bart Van Assche wrote:
> > >> My understanding is that operations that have acquire semantics pair
> > >> with operations that have release semantics. I haven't been able to find
> > >> any documentation that shows that smp_mb__after_atomic() has release
> > >> semantics. So I looked up its definition. This is what I found:
> > >>
> > >> $ git grep -nH 'define __smp_mb__after_atomic'
> > >> arch/ia64/include/asm/barrier.h:49:#define __smp_mb__after_atomic()
> > >> barrier()
> > >> arch/mips/include/asm/barrier.h:133:#define __smp_mb__after_atomic()
> > >> smp_llsc_mb()
> > >> arch/s390/include/asm/barrier.h:50:#define __smp_mb__after_atomic()
> > >> barrier()
> > >> arch/sparc/include/asm/barrier_64.h:57:#define __smp_mb__after_atomic()
> > >> barrier()
> > >> arch/x86/include/asm/barrier.h:83:#define __smp_mb__after_atomic()	do {
> > >> } while (0)
> > >> arch/xtensa/include/asm/barrier.h:20:#define __smp_mb__after_atomic()	
> > >> barrier()
> > >> include/asm-generic/barrier.h:116:#define __smp_mb__after_atomic()
> > >> __smp_mb()
> > >>
> > >> My interpretation of the above is that not all smp_mb__after_atomic()
> > >> implementations have release semantics. Do you agree with this conclusion?
> > > 
> > > I understand smp_mb__after_atomic() orders set_bit(BLK_MQ_S_INACTIVE)
> > > and reading the tag bit which is done in blk_mq_all_tag_iter().
> > > 
> > > So the two pair of OPs are ordered:
> > > 
> > > 1) if one request(tag bit) is allocated before setting BLK_MQ_S_INACTIVE,
> > > the tag bit will be observed in blk_mq_all_tag_iter() from blk_mq_hctx_has_requests(),
> > > so the request will be drained.
> > > 
> > > OR
> > > 
> > > 2) if one request(tag bit) is allocated after setting BLK_MQ_S_INACTIVE,
> > > the request(tag bit) will be released and retried on another CPU
> > > finally, see __blk_mq_alloc_request().
> > > 
> > > Cc Paul and linux-kernel list.
> > 
> > I do not agree with the above conclusion. My understanding of
> > acquire/release labels is that if the following holds:
> > (1) A store operation that stores the value V into memory location M has
> > a release label.
> > (2) A load operation that reads memory location M has an acquire label.
> > (3) The load operation (2) retrieves the value V that was stored by (1).
> > 
> > that the following ordering property holds: all load and store
> > instructions that happened before the store instruction (1) in program
> > order are guaranteed to happen before the load and store instructions
> > that follow (2) in program order.
> > 
> > In the ARM manual these semantics have been described as follows: "A
> > Store-Release instruction is multicopy atomic when observed with a
> > Load-Acquire instruction".
> > 
> > In this case the load-acquire operation is the
> > "test_and_set_bit_lock(nr, word)" statement from the sbitmap code. That
> > code is executed indirectly by blk_mq_get_tag(). Since there is no
> > matching store-release instruction in __blk_mq_alloc_request() for
> > 'word', ordering of the &data->hctx->state and 'tag' memory locations is
> > not guaranteed by the acquire property of the "test_and_set_bit_lock(nr,
> > word)" statement from the sbitmap code.
> 
> I feel like I just parachuted into the middle of the conversation,
> so let me start by giving a (silly) example illustrating the limits of
> smp_mb__{before,after}_atomic() that might be tangling things up.
> 
> But please please please avoid doing this in real code unless you have
> an extremely good reason included in a comment.
> 
> void t1(void)
> {
> 	WRITE_ONCE(a, 1);
> 	smp_mb__before_atomic();
> 	WRITE_ONCE(b, 1);  // Just Say No to code here!!!
> 	atomic_inc(&c);
> 	WRITE_ONCE(d, 1);  // Just Say No to code here!!!
> 	smp_mb__after_atomic();
> 	WRITE_ONCE(e, 1);
> }
> 
> void t2(void)
> {
> 	r1 = READ_ONCE(e);
> 	smp_mb();
> 	r2 = READ_ONCE(d);
> 	smp_mb();
> 	r3 = READ_ONCE(c);
> 	smp_mb();
> 	r4 = READ_ONCE(b);
> 	smp_mb();
> 	r5 = READ_ONCE(a);
> }
> 
> Each platform must provide strong ordering for either atomic_inc()
> on the one hand (as ia64 does) or for smp_mb__{before,after}_atomic()
> on the other (as powerpc does).  Note that both ia64 and powerpc are
> weakly ordered.
> 
> So ia64 could see (r1 == 1 && r2 == 0) on the one hand as well as (r4 ==
> 1 && r5 == 0).  So clearly smp_mb_{before,after}_atomic() need not have
> any ordering properties whatsoever.
> 
> Similarly, powerpc could see (r3 == 1 && r4 == 0) on the one hand as well
> as (r2 == 1 && r3 == 0) on the other.  Or even both at the same time.
> So clearly atomic_inc() need not have any ordering properties whatsoever.
> 
> But the combination of smp_mb__before_atomic() and the later atomic_inc()
> does provide full ordering, so that no architecture can see (r3 == 1 &&
> r5 == 0), and either of r1 or r2 can be substituted for r3.
> 
> Similarly, atomic_inc() and the late4r smp_mb__after_atomic() also
> provide full ordering, so that no architecture can see (r1 == 1 && r3 ==
> 0), and either r4 or r5 can be substituted for r3.
> 
> 
> So a call to set_bit() followed by a call to smp_mb__after_atomic() will
> provide a full memory barrier (implying release semantics) for any write
> access after the smp_mb__after_atomic() with respect to the set_bit() or
> any access preceding it.  But the set_bit() by itself won't have release
> semantics, nor will the smp_mb__after_atomic(), only their combination
> further combined with some write following the smp_mb__after_atomic().
> 
> More generally, there will be the equivalent of smp_mb() somewhere between
> the set_bit() and every access following the smp_mb__after_atomic().
> 
> Does that help, or am I missing the point?

Yeah, it does help.

BTW, can we replace the smp_mb__after_atomic() with smp_mb() for
ordering set_bit() and the memory OP following the smp_mb()?


Thanks,
Ming
Paul E. McKenney May 29, 2020, 3:07 a.m. UTC | #10
On Fri, May 29, 2020 at 09:53:04AM +0800, Ming Lei wrote:
> Hi Paul,
> 
> Thanks for your response!
> 
> On Thu, May 28, 2020 at 10:21:21AM -0700, Paul E. McKenney wrote:
> > On Thu, May 28, 2020 at 06:37:47AM -0700, Bart Van Assche wrote:
> > > On 2020-05-27 22:19, Ming Lei wrote:
> > > > On Wed, May 27, 2020 at 08:33:48PM -0700, Bart Van Assche wrote:
> > > >> My understanding is that operations that have acquire semantics pair
> > > >> with operations that have release semantics. I haven't been able to find
> > > >> any documentation that shows that smp_mb__after_atomic() has release
> > > >> semantics. So I looked up its definition. This is what I found:
> > > >>
> > > >> $ git grep -nH 'define __smp_mb__after_atomic'
> > > >> arch/ia64/include/asm/barrier.h:49:#define __smp_mb__after_atomic()
> > > >> barrier()
> > > >> arch/mips/include/asm/barrier.h:133:#define __smp_mb__after_atomic()
> > > >> smp_llsc_mb()
> > > >> arch/s390/include/asm/barrier.h:50:#define __smp_mb__after_atomic()
> > > >> barrier()
> > > >> arch/sparc/include/asm/barrier_64.h:57:#define __smp_mb__after_atomic()
> > > >> barrier()
> > > >> arch/x86/include/asm/barrier.h:83:#define __smp_mb__after_atomic()	do {
> > > >> } while (0)
> > > >> arch/xtensa/include/asm/barrier.h:20:#define __smp_mb__after_atomic()	
> > > >> barrier()
> > > >> include/asm-generic/barrier.h:116:#define __smp_mb__after_atomic()
> > > >> __smp_mb()
> > > >>
> > > >> My interpretation of the above is that not all smp_mb__after_atomic()
> > > >> implementations have release semantics. Do you agree with this conclusion?
> > > > 
> > > > I understand smp_mb__after_atomic() orders set_bit(BLK_MQ_S_INACTIVE)
> > > > and reading the tag bit which is done in blk_mq_all_tag_iter().
> > > > 
> > > > So the two pair of OPs are ordered:
> > > > 
> > > > 1) if one request(tag bit) is allocated before setting BLK_MQ_S_INACTIVE,
> > > > the tag bit will be observed in blk_mq_all_tag_iter() from blk_mq_hctx_has_requests(),
> > > > so the request will be drained.
> > > > 
> > > > OR
> > > > 
> > > > 2) if one request(tag bit) is allocated after setting BLK_MQ_S_INACTIVE,
> > > > the request(tag bit) will be released and retried on another CPU
> > > > finally, see __blk_mq_alloc_request().
> > > > 
> > > > Cc Paul and linux-kernel list.
> > > 
> > > I do not agree with the above conclusion. My understanding of
> > > acquire/release labels is that if the following holds:
> > > (1) A store operation that stores the value V into memory location M has
> > > a release label.
> > > (2) A load operation that reads memory location M has an acquire label.
> > > (3) The load operation (2) retrieves the value V that was stored by (1).
> > > 
> > > that the following ordering property holds: all load and store
> > > instructions that happened before the store instruction (1) in program
> > > order are guaranteed to happen before the load and store instructions
> > > that follow (2) in program order.
> > > 
> > > In the ARM manual these semantics have been described as follows: "A
> > > Store-Release instruction is multicopy atomic when observed with a
> > > Load-Acquire instruction".
> > > 
> > > In this case the load-acquire operation is the
> > > "test_and_set_bit_lock(nr, word)" statement from the sbitmap code. That
> > > code is executed indirectly by blk_mq_get_tag(). Since there is no
> > > matching store-release instruction in __blk_mq_alloc_request() for
> > > 'word', ordering of the &data->hctx->state and 'tag' memory locations is
> > > not guaranteed by the acquire property of the "test_and_set_bit_lock(nr,
> > > word)" statement from the sbitmap code.
> > 
> > I feel like I just parachuted into the middle of the conversation,
> > so let me start by giving a (silly) example illustrating the limits of
> > smp_mb__{before,after}_atomic() that might be tangling things up.
> > 
> > But please please please avoid doing this in real code unless you have
> > an extremely good reason included in a comment.
> > 
> > void t1(void)
> > {
> > 	WRITE_ONCE(a, 1);
> > 	smp_mb__before_atomic();
> > 	WRITE_ONCE(b, 1);  // Just Say No to code here!!!
> > 	atomic_inc(&c);
> > 	WRITE_ONCE(d, 1);  // Just Say No to code here!!!
> > 	smp_mb__after_atomic();
> > 	WRITE_ONCE(e, 1);
> > }
> > 
> > void t2(void)
> > {
> > 	r1 = READ_ONCE(e);
> > 	smp_mb();
> > 	r2 = READ_ONCE(d);
> > 	smp_mb();
> > 	r3 = READ_ONCE(c);
> > 	smp_mb();
> > 	r4 = READ_ONCE(b);
> > 	smp_mb();
> > 	r5 = READ_ONCE(a);
> > }
> > 
> > Each platform must provide strong ordering for either atomic_inc()
> > on the one hand (as ia64 does) or for smp_mb__{before,after}_atomic()
> > on the other (as powerpc does).  Note that both ia64 and powerpc are
> > weakly ordered.
> > 
> > So ia64 could see (r1 == 1 && r2 == 0) on the one hand as well as (r4 ==
> > 1 && r5 == 0).  So clearly smp_mb_{before,after}_atomic() need not have
> > any ordering properties whatsoever.
> > 
> > Similarly, powerpc could see (r3 == 1 && r4 == 0) on the one hand as well
> > as (r2 == 1 && r3 == 0) on the other.  Or even both at the same time.
> > So clearly atomic_inc() need not have any ordering properties whatsoever.
> > 
> > But the combination of smp_mb__before_atomic() and the later atomic_inc()
> > does provide full ordering, so that no architecture can see (r3 == 1 &&
> > r5 == 0), and either of r1 or r2 can be substituted for r3.
> > 
> > Similarly, atomic_inc() and the late4r smp_mb__after_atomic() also
> > provide full ordering, so that no architecture can see (r1 == 1 && r3 ==
> > 0), and either r4 or r5 can be substituted for r3.
> > 
> > 
> > So a call to set_bit() followed by a call to smp_mb__after_atomic() will
> > provide a full memory barrier (implying release semantics) for any write
> > access after the smp_mb__after_atomic() with respect to the set_bit() or
> > any access preceding it.  But the set_bit() by itself won't have release
> > semantics, nor will the smp_mb__after_atomic(), only their combination
> > further combined with some write following the smp_mb__after_atomic().
> > 
> > More generally, there will be the equivalent of smp_mb() somewhere between
> > the set_bit() and every access following the smp_mb__after_atomic().
> > 
> > Does that help, or am I missing the point?
> 
> Yeah, it does help.
> 
> BTW, can we replace the smp_mb__after_atomic() with smp_mb() for
> ordering set_bit() and the memory OP following the smp_mb()?

Placing an smp_mb() between set_bit() and a later access will indeed
order set_bit() with that later access.

That said, I don't know this code well enough to say whether or not
that ordering is sufficient.

						Thanx, Paul
Ming Lei May 29, 2020, 3:53 a.m. UTC | #11
Hi Paul,

On Thu, May 28, 2020 at 08:07:28PM -0700, Paul E. McKenney wrote:
> On Fri, May 29, 2020 at 09:53:04AM +0800, Ming Lei wrote:
> > Hi Paul,
> > 
> > Thanks for your response!
> > 
> > On Thu, May 28, 2020 at 10:21:21AM -0700, Paul E. McKenney wrote:
> > > On Thu, May 28, 2020 at 06:37:47AM -0700, Bart Van Assche wrote:
> > > > On 2020-05-27 22:19, Ming Lei wrote:
> > > > > On Wed, May 27, 2020 at 08:33:48PM -0700, Bart Van Assche wrote:
> > > > >> My understanding is that operations that have acquire semantics pair
> > > > >> with operations that have release semantics. I haven't been able to find
> > > > >> any documentation that shows that smp_mb__after_atomic() has release
> > > > >> semantics. So I looked up its definition. This is what I found:
> > > > >>
> > > > >> $ git grep -nH 'define __smp_mb__after_atomic'
> > > > >> arch/ia64/include/asm/barrier.h:49:#define __smp_mb__after_atomic()
> > > > >> barrier()
> > > > >> arch/mips/include/asm/barrier.h:133:#define __smp_mb__after_atomic()
> > > > >> smp_llsc_mb()
> > > > >> arch/s390/include/asm/barrier.h:50:#define __smp_mb__after_atomic()
> > > > >> barrier()
> > > > >> arch/sparc/include/asm/barrier_64.h:57:#define __smp_mb__after_atomic()
> > > > >> barrier()
> > > > >> arch/x86/include/asm/barrier.h:83:#define __smp_mb__after_atomic()	do {
> > > > >> } while (0)
> > > > >> arch/xtensa/include/asm/barrier.h:20:#define __smp_mb__after_atomic()	
> > > > >> barrier()
> > > > >> include/asm-generic/barrier.h:116:#define __smp_mb__after_atomic()
> > > > >> __smp_mb()
> > > > >>
> > > > >> My interpretation of the above is that not all smp_mb__after_atomic()
> > > > >> implementations have release semantics. Do you agree with this conclusion?
> > > > > 
> > > > > I understand smp_mb__after_atomic() orders set_bit(BLK_MQ_S_INACTIVE)
> > > > > and reading the tag bit which is done in blk_mq_all_tag_iter().
> > > > > 
> > > > > So the two pair of OPs are ordered:
> > > > > 
> > > > > 1) if one request(tag bit) is allocated before setting BLK_MQ_S_INACTIVE,
> > > > > the tag bit will be observed in blk_mq_all_tag_iter() from blk_mq_hctx_has_requests(),
> > > > > so the request will be drained.
> > > > > 
> > > > > OR
> > > > > 
> > > > > 2) if one request(tag bit) is allocated after setting BLK_MQ_S_INACTIVE,
> > > > > the request(tag bit) will be released and retried on another CPU
> > > > > finally, see __blk_mq_alloc_request().
> > > > > 
> > > > > Cc Paul and linux-kernel list.
> > > > 
> > > > I do not agree with the above conclusion. My understanding of
> > > > acquire/release labels is that if the following holds:
> > > > (1) A store operation that stores the value V into memory location M has
> > > > a release label.
> > > > (2) A load operation that reads memory location M has an acquire label.
> > > > (3) The load operation (2) retrieves the value V that was stored by (1).
> > > > 
> > > > that the following ordering property holds: all load and store
> > > > instructions that happened before the store instruction (1) in program
> > > > order are guaranteed to happen before the load and store instructions
> > > > that follow (2) in program order.
> > > > 
> > > > In the ARM manual these semantics have been described as follows: "A
> > > > Store-Release instruction is multicopy atomic when observed with a
> > > > Load-Acquire instruction".
> > > > 
> > > > In this case the load-acquire operation is the
> > > > "test_and_set_bit_lock(nr, word)" statement from the sbitmap code. That
> > > > code is executed indirectly by blk_mq_get_tag(). Since there is no
> > > > matching store-release instruction in __blk_mq_alloc_request() for
> > > > 'word', ordering of the &data->hctx->state and 'tag' memory locations is
> > > > not guaranteed by the acquire property of the "test_and_set_bit_lock(nr,
> > > > word)" statement from the sbitmap code.
> > > 
> > > I feel like I just parachuted into the middle of the conversation,
> > > so let me start by giving a (silly) example illustrating the limits of
> > > smp_mb__{before,after}_atomic() that might be tangling things up.
> > > 
> > > But please please please avoid doing this in real code unless you have
> > > an extremely good reason included in a comment.
> > > 
> > > void t1(void)
> > > {
> > > 	WRITE_ONCE(a, 1);
> > > 	smp_mb__before_atomic();
> > > 	WRITE_ONCE(b, 1);  // Just Say No to code here!!!
> > > 	atomic_inc(&c);
> > > 	WRITE_ONCE(d, 1);  // Just Say No to code here!!!
> > > 	smp_mb__after_atomic();
> > > 	WRITE_ONCE(e, 1);
> > > }
> > > 
> > > void t2(void)
> > > {
> > > 	r1 = READ_ONCE(e);
> > > 	smp_mb();
> > > 	r2 = READ_ONCE(d);
> > > 	smp_mb();
> > > 	r3 = READ_ONCE(c);
> > > 	smp_mb();
> > > 	r4 = READ_ONCE(b);
> > > 	smp_mb();
> > > 	r5 = READ_ONCE(a);
> > > }
> > > 
> > > Each platform must provide strong ordering for either atomic_inc()
> > > on the one hand (as ia64 does) or for smp_mb__{before,after}_atomic()
> > > on the other (as powerpc does).  Note that both ia64 and powerpc are
> > > weakly ordered.
> > > 
> > > So ia64 could see (r1 == 1 && r2 == 0) on the one hand as well as (r4 ==
> > > 1 && r5 == 0).  So clearly smp_mb_{before,after}_atomic() need not have
> > > any ordering properties whatsoever.
> > > 
> > > Similarly, powerpc could see (r3 == 1 && r4 == 0) on the one hand as well
> > > as (r2 == 1 && r3 == 0) on the other.  Or even both at the same time.
> > > So clearly atomic_inc() need not have any ordering properties whatsoever.
> > > 
> > > But the combination of smp_mb__before_atomic() and the later atomic_inc()
> > > does provide full ordering, so that no architecture can see (r3 == 1 &&
> > > r5 == 0), and either of r1 or r2 can be substituted for r3.
> > > 
> > > Similarly, atomic_inc() and the late4r smp_mb__after_atomic() also
> > > provide full ordering, so that no architecture can see (r1 == 1 && r3 ==
> > > 0), and either r4 or r5 can be substituted for r3.
> > > 
> > > 
> > > So a call to set_bit() followed by a call to smp_mb__after_atomic() will
> > > provide a full memory barrier (implying release semantics) for any write
> > > access after the smp_mb__after_atomic() with respect to the set_bit() or
> > > any access preceding it.  But the set_bit() by itself won't have release
> > > semantics, nor will the smp_mb__after_atomic(), only their combination
> > > further combined with some write following the smp_mb__after_atomic().
> > > 
> > > More generally, there will be the equivalent of smp_mb() somewhere between
> > > the set_bit() and every access following the smp_mb__after_atomic().
> > > 
> > > Does that help, or am I missing the point?
> > 
> > Yeah, it does help.
> > 
> > BTW, can we replace the smp_mb__after_atomic() with smp_mb() for
> > ordering set_bit() and the memory OP following the smp_mb()?
> 
> Placing an smp_mb() between set_bit() and a later access will indeed
> order set_bit() with that later access.
> 
> That said, I don't know this code well enough to say whether or not
> that ordering is sufficient.

Another pair is in blk_mq_get_tag(), and we expect the following two
memory OPs are ordered:

1) set bit in successful test_and_set_bit_lock(), which is called
from sbitmap_get()

2) test_bit(BLK_MQ_S_INACTIVE, &data->hctx->state)

Do you think that the above two OPs are ordered?

Thanks,
Ming
Paul E. McKenney May 29, 2020, 6:13 p.m. UTC | #12
On Fri, May 29, 2020 at 11:53:15AM +0800, Ming Lei wrote:
> Hi Paul,
> 
> On Thu, May 28, 2020 at 08:07:28PM -0700, Paul E. McKenney wrote:
> > On Fri, May 29, 2020 at 09:53:04AM +0800, Ming Lei wrote:
> > > Hi Paul,
> > > 
> > > Thanks for your response!
> > > 
> > > On Thu, May 28, 2020 at 10:21:21AM -0700, Paul E. McKenney wrote:
> > > > On Thu, May 28, 2020 at 06:37:47AM -0700, Bart Van Assche wrote:
> > > > > On 2020-05-27 22:19, Ming Lei wrote:
> > > > > > On Wed, May 27, 2020 at 08:33:48PM -0700, Bart Van Assche wrote:
> > > > > >> My understanding is that operations that have acquire semantics pair
> > > > > >> with operations that have release semantics. I haven't been able to find
> > > > > >> any documentation that shows that smp_mb__after_atomic() has release
> > > > > >> semantics. So I looked up its definition. This is what I found:
> > > > > >>
> > > > > >> $ git grep -nH 'define __smp_mb__after_atomic'
> > > > > >> arch/ia64/include/asm/barrier.h:49:#define __smp_mb__after_atomic()
> > > > > >> barrier()
> > > > > >> arch/mips/include/asm/barrier.h:133:#define __smp_mb__after_atomic()
> > > > > >> smp_llsc_mb()
> > > > > >> arch/s390/include/asm/barrier.h:50:#define __smp_mb__after_atomic()
> > > > > >> barrier()
> > > > > >> arch/sparc/include/asm/barrier_64.h:57:#define __smp_mb__after_atomic()
> > > > > >> barrier()
> > > > > >> arch/x86/include/asm/barrier.h:83:#define __smp_mb__after_atomic()	do {
> > > > > >> } while (0)
> > > > > >> arch/xtensa/include/asm/barrier.h:20:#define __smp_mb__after_atomic()	
> > > > > >> barrier()
> > > > > >> include/asm-generic/barrier.h:116:#define __smp_mb__after_atomic()
> > > > > >> __smp_mb()
> > > > > >>
> > > > > >> My interpretation of the above is that not all smp_mb__after_atomic()
> > > > > >> implementations have release semantics. Do you agree with this conclusion?
> > > > > > 
> > > > > > I understand smp_mb__after_atomic() orders set_bit(BLK_MQ_S_INACTIVE)
> > > > > > and reading the tag bit which is done in blk_mq_all_tag_iter().
> > > > > > 
> > > > > > So the two pair of OPs are ordered:
> > > > > > 
> > > > > > 1) if one request(tag bit) is allocated before setting BLK_MQ_S_INACTIVE,
> > > > > > the tag bit will be observed in blk_mq_all_tag_iter() from blk_mq_hctx_has_requests(),
> > > > > > so the request will be drained.
> > > > > > 
> > > > > > OR
> > > > > > 
> > > > > > 2) if one request(tag bit) is allocated after setting BLK_MQ_S_INACTIVE,
> > > > > > the request(tag bit) will be released and retried on another CPU
> > > > > > finally, see __blk_mq_alloc_request().
> > > > > > 
> > > > > > Cc Paul and linux-kernel list.
> > > > > 
> > > > > I do not agree with the above conclusion. My understanding of
> > > > > acquire/release labels is that if the following holds:
> > > > > (1) A store operation that stores the value V into memory location M has
> > > > > a release label.
> > > > > (2) A load operation that reads memory location M has an acquire label.
> > > > > (3) The load operation (2) retrieves the value V that was stored by (1).
> > > > > 
> > > > > that the following ordering property holds: all load and store
> > > > > instructions that happened before the store instruction (1) in program
> > > > > order are guaranteed to happen before the load and store instructions
> > > > > that follow (2) in program order.
> > > > > 
> > > > > In the ARM manual these semantics have been described as follows: "A
> > > > > Store-Release instruction is multicopy atomic when observed with a
> > > > > Load-Acquire instruction".
> > > > > 
> > > > > In this case the load-acquire operation is the
> > > > > "test_and_set_bit_lock(nr, word)" statement from the sbitmap code. That
> > > > > code is executed indirectly by blk_mq_get_tag(). Since there is no
> > > > > matching store-release instruction in __blk_mq_alloc_request() for
> > > > > 'word', ordering of the &data->hctx->state and 'tag' memory locations is
> > > > > not guaranteed by the acquire property of the "test_and_set_bit_lock(nr,
> > > > > word)" statement from the sbitmap code.
> > > > 
> > > > I feel like I just parachuted into the middle of the conversation,
> > > > so let me start by giving a (silly) example illustrating the limits of
> > > > smp_mb__{before,after}_atomic() that might be tangling things up.
> > > > 
> > > > But please please please avoid doing this in real code unless you have
> > > > an extremely good reason included in a comment.
> > > > 
> > > > void t1(void)
> > > > {
> > > > 	WRITE_ONCE(a, 1);
> > > > 	smp_mb__before_atomic();
> > > > 	WRITE_ONCE(b, 1);  // Just Say No to code here!!!
> > > > 	atomic_inc(&c);
> > > > 	WRITE_ONCE(d, 1);  // Just Say No to code here!!!
> > > > 	smp_mb__after_atomic();
> > > > 	WRITE_ONCE(e, 1);
> > > > }
> > > > 
> > > > void t2(void)
> > > > {
> > > > 	r1 = READ_ONCE(e);
> > > > 	smp_mb();
> > > > 	r2 = READ_ONCE(d);
> > > > 	smp_mb();
> > > > 	r3 = READ_ONCE(c);
> > > > 	smp_mb();
> > > > 	r4 = READ_ONCE(b);
> > > > 	smp_mb();
> > > > 	r5 = READ_ONCE(a);
> > > > }
> > > > 
> > > > Each platform must provide strong ordering for either atomic_inc()
> > > > on the one hand (as ia64 does) or for smp_mb__{before,after}_atomic()
> > > > on the other (as powerpc does).  Note that both ia64 and powerpc are
> > > > weakly ordered.
> > > > 
> > > > So ia64 could see (r1 == 1 && r2 == 0) on the one hand as well as (r4 ==
> > > > 1 && r5 == 0).  So clearly smp_mb_{before,after}_atomic() need not have
> > > > any ordering properties whatsoever.
> > > > 
> > > > Similarly, powerpc could see (r3 == 1 && r4 == 0) on the one hand as well
> > > > as (r2 == 1 && r3 == 0) on the other.  Or even both at the same time.
> > > > So clearly atomic_inc() need not have any ordering properties whatsoever.
> > > > 
> > > > But the combination of smp_mb__before_atomic() and the later atomic_inc()
> > > > does provide full ordering, so that no architecture can see (r3 == 1 &&
> > > > r5 == 0), and either of r1 or r2 can be substituted for r3.
> > > > 
> > > > Similarly, atomic_inc() and the late4r smp_mb__after_atomic() also
> > > > provide full ordering, so that no architecture can see (r1 == 1 && r3 ==
> > > > 0), and either r4 or r5 can be substituted for r3.
> > > > 
> > > > 
> > > > So a call to set_bit() followed by a call to smp_mb__after_atomic() will
> > > > provide a full memory barrier (implying release semantics) for any write
> > > > access after the smp_mb__after_atomic() with respect to the set_bit() or
> > > > any access preceding it.  But the set_bit() by itself won't have release
> > > > semantics, nor will the smp_mb__after_atomic(), only their combination
> > > > further combined with some write following the smp_mb__after_atomic().
> > > > 
> > > > More generally, there will be the equivalent of smp_mb() somewhere between
> > > > the set_bit() and every access following the smp_mb__after_atomic().
> > > > 
> > > > Does that help, or am I missing the point?
> > > 
> > > Yeah, it does help.
> > > 
> > > BTW, can we replace the smp_mb__after_atomic() with smp_mb() for
> > > ordering set_bit() and the memory OP following the smp_mb()?
> > 
> > Placing an smp_mb() between set_bit() and a later access will indeed
> > order set_bit() with that later access.
> > 
> > That said, I don't know this code well enough to say whether or not
> > that ordering is sufficient.
> 
> Another pair is in blk_mq_get_tag(), and we expect the following two
> memory OPs are ordered:
> 
> 1) set bit in successful test_and_set_bit_lock(), which is called
> from sbitmap_get()
> 
> 2) test_bit(BLK_MQ_S_INACTIVE, &data->hctx->state)
> 
> Do you think that the above two OPs are ordered?

Given that he has been through the code, I would like to hear Bart's
thoughts, actually.

							Thanx, Paul
Bart Van Assche May 29, 2020, 7:55 p.m. UTC | #13
On 2020-05-29 11:13, Paul E. McKenney wrote:
> On Fri, May 29, 2020 at 11:53:15AM +0800, Ming Lei wrote:
>> Another pair is in blk_mq_get_tag(), and we expect the following two
>> memory OPs are ordered:
>>
>> 1) set bit in successful test_and_set_bit_lock(), which is called
>> from sbitmap_get()
>>
>> 2) test_bit(BLK_MQ_S_INACTIVE, &data->hctx->state)
>>
>> Do you think that the above two OPs are ordered?
> 
> Given that he has been through the code, I would like to hear Bart's
> thoughts, actually.

Hi Paul,

My understanding of the involved instructions is as follows (see also
https://lore.kernel.org/linux-block/b98f055f-6f38-a47c-965d-b6bcf4f5563f@huawei.com/T/#t
for the entire e-mail thread):
* blk_mq_hctx_notify_offline() sets the BLK_MQ_S_INACTIVE bit in
hctx->state, calls smp_mb__after_atomic() and waits in a loop until all
tags have been freed. Each tag is an integer number that has a 1:1
correspondence with a block layer request structure. The code that
iterates over block layer request tags relies on
__sbitmap_for_each_set(). That function examines both the 'word' and
'cleared' members of struct sbitmap_word.
* What blk_mq_hctx_notify_offline() waits for is freeing of tags by
blk_mq_put_tag(). blk_mq_put_tag() frees a tag by setting a bit in
sbitmap_word.cleared (see also sbitmap_deferred_clear_bit()).
* Tag allocation by blk_mq_get_tag() relies on test_and_set_bit_lock().
The actual allocation happens by sbitmap_get() that sets a bit in
sbitmap_word.word. blk_mg_get_tag() tests the BLK_MQ_S_INACTIVE bit
after tag allocation succeeded.

What confuses me is that blk_mq_hctx_notify_offline() uses
smp_mb__after_atomic() to enforce the order of memory accesses while
blk_mq_get_tag() relies on the acquire semantics of
test_and_set_bit_lock(). Usually ordering is enforced by combining two
smp_mb() calls or by combining a store-release with a load-acquire.

Does the Linux memory model provide the expected ordering guarantees
when combining load-acquire with smp_mb__after_atomic() as used in patch
8/8 of this series?

Thanks,

Bart.
Paul E. McKenney May 29, 2020, 9:12 p.m. UTC | #14
On Fri, May 29, 2020 at 12:55:43PM -0700, Bart Van Assche wrote:
> On 2020-05-29 11:13, Paul E. McKenney wrote:
> > On Fri, May 29, 2020 at 11:53:15AM +0800, Ming Lei wrote:
> >> Another pair is in blk_mq_get_tag(), and we expect the following two
> >> memory OPs are ordered:
> >>
> >> 1) set bit in successful test_and_set_bit_lock(), which is called
> >> from sbitmap_get()
> >>
> >> 2) test_bit(BLK_MQ_S_INACTIVE, &data->hctx->state)
> >>
> >> Do you think that the above two OPs are ordered?
> > 
> > Given that he has been through the code, I would like to hear Bart's
> > thoughts, actually.
> 
> Hi Paul,
> 
> My understanding of the involved instructions is as follows (see also
> https://lore.kernel.org/linux-block/b98f055f-6f38-a47c-965d-b6bcf4f5563f@huawei.com/T/#t
> for the entire e-mail thread):
> * blk_mq_hctx_notify_offline() sets the BLK_MQ_S_INACTIVE bit in
> hctx->state, calls smp_mb__after_atomic() and waits in a loop until all
> tags have been freed. Each tag is an integer number that has a 1:1
> correspondence with a block layer request structure. The code that
> iterates over block layer request tags relies on
> __sbitmap_for_each_set(). That function examines both the 'word' and
> 'cleared' members of struct sbitmap_word.
> * What blk_mq_hctx_notify_offline() waits for is freeing of tags by
> blk_mq_put_tag(). blk_mq_put_tag() frees a tag by setting a bit in
> sbitmap_word.cleared (see also sbitmap_deferred_clear_bit()).
> * Tag allocation by blk_mq_get_tag() relies on test_and_set_bit_lock().
> The actual allocation happens by sbitmap_get() that sets a bit in
> sbitmap_word.word. blk_mg_get_tag() tests the BLK_MQ_S_INACTIVE bit
> after tag allocation succeeded.
> 
> What confuses me is that blk_mq_hctx_notify_offline() uses
> smp_mb__after_atomic() to enforce the order of memory accesses while
> blk_mq_get_tag() relies on the acquire semantics of
> test_and_set_bit_lock(). Usually ordering is enforced by combining two
> smp_mb() calls or by combining a store-release with a load-acquire.
> 
> Does the Linux memory model provide the expected ordering guarantees
> when combining load-acquire with smp_mb__after_atomic() as used in patch
> 8/8 of this series?

Strictly speaking, smp_mb__after_atomic() works only in combination
with a non-value-returning atomic operation. Let's look at a (silly)
example where smp_mb__after_atomic() would not help in conjunction
with smp_store_release():

void thread1(void)
{
	smp_store_release(&x, 1);
	smp_mb__after_atomic();
	r1 = READ_ONCE(y);
}

void thread2(void)
{
	smp_store_release(&y, 1);
	smp_mb__after_atomic();
	r2 = READ_ONCE(x);
}

Even on x86 (or perhaps especially on x86) it is quite possible that
execution could end with r1 == r2 == 0 because on x86 there is no
ordering whatsoever from smp_mb__after_atomic().  In this case,
the CPU is well within its rights to reorder each thread's store
with its later load.  Yes, even x86.

On the other hand, suppose that the stores are non-value-returning
atomics:

void thread1(void)
{
	atomic_inc(&x);
	smp_mb__after_atomic();
	r1 = READ_ONCE(y);
}

void thread2(void)
{
	atomic_inc(&y);
	smp_mb__after_atomic();
	r2 = READ_ONCE(x);
}

In this case, for all architectures, there would be the equivalent
of an smp_mb() full barrier associated with either the atomic_inc()
or the smp_mb__after_atomic(), which would rule out the case where
execution ends with r1 == r2 == 0.

Does that help?

							Thanx, Paul
diff mbox series

Patch

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 96b7a35c898a7..15df3a36e9fa4 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -213,6 +213,7 @@  static const char *const hctx_state_name[] = {
 	HCTX_STATE_NAME(STOPPED),
 	HCTX_STATE_NAME(TAG_ACTIVE),
 	HCTX_STATE_NAME(SCHED_RESTART),
+	HCTX_STATE_NAME(INACTIVE),
 };
 #undef HCTX_STATE_NAME
 
@@ -239,6 +240,7 @@  static const char *const hctx_flag_name[] = {
 	HCTX_FLAG_NAME(TAG_SHARED),
 	HCTX_FLAG_NAME(BLOCKING),
 	HCTX_FLAG_NAME(NO_SCHED),
+	HCTX_FLAG_NAME(STACKING),
 };
 #undef HCTX_FLAG_NAME
 
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 9f74064768423..1c548d9f67ee7 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -180,6 +180,14 @@  unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 	sbitmap_finish_wait(bt, ws, &wait);
 
 found_tag:
+	/*
+	 * Give up this allocation if the hctx is inactive.  The caller will
+	 * retry on an active hctx.
+	 */
+	if (unlikely(test_bit(BLK_MQ_S_INACTIVE, &data->hctx->state))) {
+		blk_mq_put_tag(tags, data->ctx, tag + tag_offset);
+		return -1;
+	}
 	return tag + tag_offset;
 }
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 898400452b1cf..e4580cd6c6f49 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -375,14 +375,39 @@  static struct request *__blk_mq_alloc_request(struct blk_mq_alloc_data *data)
 			e->type->ops.limit_depth(data->cmd_flags, data);
 	}
 
+retry:
 	data->ctx = blk_mq_get_ctx(q);
 	data->hctx = blk_mq_map_queue(q, data->cmd_flags, data->ctx);
 	if (!(data->flags & BLK_MQ_REQ_INTERNAL))
 		blk_mq_tag_busy(data->hctx);
 
+	/*
+	 * Waiting allocations only fail because of an inactive hctx.  In that
+	 * case just retry the hctx assignment and tag allocation as CPU hotplug
+	 * should have migrated us to an online CPU by now.
+	 */
 	tag = blk_mq_get_tag(data);
-	if (tag == BLK_MQ_NO_TAG)
-		return NULL;
+	if (tag == BLK_MQ_NO_TAG) {
+		if (data->flags & BLK_MQ_REQ_NOWAIT)
+			return NULL;
+
+		/*
+		 * All kthreads that can perform I/O should have been moved off
+		 * this CPU by the time the the CPU hotplug statemachine has
+		 * shut down a hctx.  But better be sure with an extra sanity
+		 * check.
+		 */
+		if (WARN_ON_ONCE(current->flags & PF_KTHREAD))
+			return NULL;
+
+		/*
+		 * Give up the CPU and sleep for a random short time to ensure
+		 * that thread using a realtime scheduling class are migrated
+		 * off the the CPU.
+		 */
+		msleep(3);
+		goto retry;
+	}
 	return blk_mq_rq_ctx_init(data, tag, alloc_time_ns);
 }
 
@@ -2324,6 +2349,86 @@  int blk_mq_alloc_rqs(struct blk_mq_tag_set *set, struct blk_mq_tags *tags,
 	return -ENOMEM;
 }
 
+struct rq_iter_data {
+	struct blk_mq_hw_ctx *hctx;
+	bool has_rq;
+};
+
+static bool blk_mq_has_request(struct request *rq, void *data, bool reserved)
+{
+	struct rq_iter_data *iter_data = data;
+
+	if (rq->mq_hctx != iter_data->hctx)
+		return true;
+	iter_data->has_rq = true;
+	return false;
+}
+
+static bool blk_mq_hctx_has_requests(struct blk_mq_hw_ctx *hctx)
+{
+	struct blk_mq_tags *tags = hctx->sched_tags ?
+			hctx->sched_tags : hctx->tags;
+	struct rq_iter_data data = {
+		.hctx	= hctx,
+	};
+
+	blk_mq_all_tag_iter(tags, blk_mq_has_request, &data);
+	return data.has_rq;
+}
+
+static inline bool blk_mq_last_cpu_in_hctx(unsigned int cpu,
+		struct blk_mq_hw_ctx *hctx)
+{
+	if (cpumask_next_and(-1, hctx->cpumask, cpu_online_mask) != cpu)
+		return false;
+	if (cpumask_next_and(cpu, hctx->cpumask, cpu_online_mask) < nr_cpu_ids)
+		return false;
+	return true;
+}
+
+static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node)
+{
+	struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
+			struct blk_mq_hw_ctx, cpuhp_online);
+
+	if (!cpumask_test_cpu(cpu, hctx->cpumask) ||
+	    !blk_mq_last_cpu_in_hctx(cpu, hctx))
+		return 0;
+
+	/*
+	 * Prevent new request from being allocated on the current hctx.
+	 *
+	 * The smp_mb__after_atomic() Pairs with the implied barrier in
+	 * test_and_set_bit_lock in sbitmap_get().  Ensures the inactive flag is
+	 * seen once we return from the tag allocator.
+	 */
+	set_bit(BLK_MQ_S_INACTIVE, &hctx->state);
+	smp_mb__after_atomic();
+
+	/*
+	 * Try to grab a reference to the queue and wait for any outstanding
+	 * requests.  If we could not grab a reference the queue has been
+	 * frozen and there are no requests.
+	 */
+	if (percpu_ref_tryget(&hctx->queue->q_usage_counter)) {
+		while (blk_mq_hctx_has_requests(hctx))
+			msleep(5);
+		percpu_ref_put(&hctx->queue->q_usage_counter);
+	}
+
+	return 0;
+}
+
+static int blk_mq_hctx_notify_online(unsigned int cpu, struct hlist_node *node)
+{
+	struct blk_mq_hw_ctx *hctx = hlist_entry_safe(node,
+			struct blk_mq_hw_ctx, cpuhp_online);
+
+	if (cpumask_test_cpu(cpu, hctx->cpumask))
+		clear_bit(BLK_MQ_S_INACTIVE, &hctx->state);
+	return 0;
+}
+
 /*
  * 'cpu' is going away. splice any existing rq_list entries from this
  * software queue to the hw queue dispatch list, and ensure that it
@@ -2337,6 +2442,9 @@  static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 	enum hctx_type type;
 
 	hctx = hlist_entry_safe(node, struct blk_mq_hw_ctx, cpuhp_dead);
+	if (!cpumask_test_cpu(cpu, hctx->cpumask))
+		return 0;
+
 	ctx = __blk_mq_get_ctx(hctx->queue, cpu);
 	type = hctx->type;
 
@@ -2360,6 +2468,9 @@  static int blk_mq_hctx_notify_dead(unsigned int cpu, struct hlist_node *node)
 
 static void blk_mq_remove_cpuhp(struct blk_mq_hw_ctx *hctx)
 {
+	if (!(hctx->flags & BLK_MQ_F_STACKING))
+		cpuhp_state_remove_instance_nocalls(CPUHP_AP_BLK_MQ_ONLINE,
+						    &hctx->cpuhp_online);
 	cpuhp_state_remove_instance_nocalls(CPUHP_BLK_MQ_DEAD,
 					    &hctx->cpuhp_dead);
 }
@@ -2419,6 +2530,9 @@  static int blk_mq_init_hctx(struct request_queue *q,
 {
 	hctx->queue_num = hctx_idx;
 
+	if (!(hctx->flags & BLK_MQ_F_STACKING))
+		cpuhp_state_add_instance_nocalls(CPUHP_AP_BLK_MQ_ONLINE,
+				&hctx->cpuhp_online);
 	cpuhp_state_add_instance_nocalls(CPUHP_BLK_MQ_DEAD, &hctx->cpuhp_dead);
 
 	hctx->tags = set->tags[hctx_idx];
@@ -3673,6 +3787,9 @@  static int __init blk_mq_init(void)
 {
 	cpuhp_setup_state_multi(CPUHP_BLK_MQ_DEAD, "block/mq:dead", NULL,
 				blk_mq_hctx_notify_dead);
+	cpuhp_setup_state_multi(CPUHP_AP_BLK_MQ_ONLINE, "block/mq:online",
+				blk_mq_hctx_notify_online,
+				blk_mq_hctx_notify_offline);
 	return 0;
 }
 subsys_initcall(blk_mq_init);
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index da693e6a834e5..d7904b4d8d126 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -2037,7 +2037,7 @@  static int loop_add(struct loop_device **l, int i)
 	lo->tag_set.queue_depth = 128;
 	lo->tag_set.numa_node = NUMA_NO_NODE;
 	lo->tag_set.cmd_size = sizeof(struct loop_cmd);
-	lo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
+	lo->tag_set.flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_STACKING;
 	lo->tag_set.driver_data = lo;
 
 	err = blk_mq_alloc_tag_set(&lo->tag_set);
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 3f8577e2c13be..f60c025121215 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -547,7 +547,7 @@  int dm_mq_init_request_queue(struct mapped_device *md, struct dm_table *t)
 	md->tag_set->ops = &dm_mq_ops;
 	md->tag_set->queue_depth = dm_get_blk_mq_queue_depth();
 	md->tag_set->numa_node = md->numa_node_id;
-	md->tag_set->flags = BLK_MQ_F_SHOULD_MERGE;
+	md->tag_set->flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_STACKING;
 	md->tag_set->nr_hw_queues = dm_get_blk_mq_nr_hw_queues();
 	md->tag_set->driver_data = md;
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index d7307795439a4..a20f8c241d665 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -140,6 +140,8 @@  struct blk_mq_hw_ctx {
 	 */
 	atomic_t		nr_active;
 
+	/** @cpuhp_online: List to store request if CPU is going to die */
+	struct hlist_node	cpuhp_online;
 	/** @cpuhp_dead: List to store request if some CPU die. */
 	struct hlist_node	cpuhp_dead;
 	/** @kobj: Kernel object for sysfs. */
@@ -391,6 +393,11 @@  struct blk_mq_ops {
 enum {
 	BLK_MQ_F_SHOULD_MERGE	= 1 << 0,
 	BLK_MQ_F_TAG_SHARED	= 1 << 1,
+	/*
+	 * Set when this device requires underlying blk-mq device for
+	 * completing IO:
+	 */
+	BLK_MQ_F_STACKING	= 1 << 2,
 	BLK_MQ_F_BLOCKING	= 1 << 5,
 	BLK_MQ_F_NO_SCHED	= 1 << 6,
 	BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
@@ -400,6 +407,9 @@  enum {
 	BLK_MQ_S_TAG_ACTIVE	= 1,
 	BLK_MQ_S_SCHED_RESTART	= 2,
 
+	/* hw queue is inactive after all its CPUs become offline */
+	BLK_MQ_S_INACTIVE	= 3,
+
 	BLK_MQ_MAX_DEPTH	= 10240,
 
 	BLK_MQ_CPU_WORK_BATCH	= 8,
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index 77d70b6335318..24b3a77810b6d 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -152,6 +152,7 @@  enum cpuhp_state {
 	CPUHP_AP_SMPBOOT_THREADS,
 	CPUHP_AP_X86_VDSO_VMA_ONLINE,
 	CPUHP_AP_IRQ_AFFINITY_ONLINE,
+	CPUHP_AP_BLK_MQ_ONLINE,
 	CPUHP_AP_ARM_MVEBU_SYNC_CLOCKS,
 	CPUHP_AP_X86_INTEL_EPB_ONLINE,
 	CPUHP_AP_PERF_ONLINE,