diff mbox series

[v7,3/5] blk-mq: Fix races between iterating over requests and freeing requests

Message ID 20210421000235.2028-4-bvanassche@acm.org (mailing list archive)
State New, archived
Headers show
Series blk-mq: Fix a race between iterating over requests and freeing requests | expand

Commit Message

Bart Van Assche April 21, 2021, 12:02 a.m. UTC
When multiple request queues share a tag set and when switching the I/O
scheduler for one of the request queues associated with that tag set, the
following race can happen:
* blk_mq_tagset_busy_iter() calls bt_tags_iter() and bt_tags_iter() assigns
  a pointer to a scheduler request to the local variable 'rq'.
* blk_mq_sched_free_requests() is called to free hctx->sched_tags.
* blk_mq_tagset_busy_iter() dereferences 'rq' and triggers a use-after-free.

Fix this race as follows:
* Use rcu_assign_pointer() and rcu_dereference() to access hctx->tags->rqs[].
  The performance impact of the assignments added to the hot path is minimal
  (about 1% according to one particular test).
* Protect hctx->tags->rqs[] reads with an RCU read-side lock or with a
  semaphore. Which mechanism is used depends on whether or not it is allowed
  to sleep and also on whether or not the callback function may sleep.
* Wait for all preexisting readers to finish before freeing scheduler
  requests.

Another race is as follows:
* blk_mq_sched_free_requests() is called to free hctx->sched_tags.
* blk_mq_queue_tag_busy_iter() iterates over the same tag set but for another
  request queue than the queue for which scheduler tags are being freed.
* bt_iter() examines rq->q after *rq has been freed.

Fix this race by protecting the rq->q read in bt_iter() with an RCU read
lock and by calling synchronize_rcu() before freeing scheduler tags.

Multiple users have reported use-after-free complaints similar to the
following (from https://lore.kernel.org/linux-block/1545261885.185366.488.camel@acm.org/ ):

BUG: KASAN: use-after-free in bt_iter+0x86/0xf0
Read of size 8 at addr ffff88803b335240 by task fio/21412

CPU: 0 PID: 21412 Comm: fio Tainted: G        W         4.20.0-rc6-dbg+ #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
Call Trace:
 dump_stack+0x86/0xca
 print_address_description+0x71/0x239
 kasan_report.cold.5+0x242/0x301
 __asan_load8+0x54/0x90
 bt_iter+0x86/0xf0
 blk_mq_queue_tag_busy_iter+0x373/0x5e0
 blk_mq_in_flight+0x96/0xb0
 part_in_flight+0x40/0x140
 part_round_stats+0x18e/0x370
 blk_account_io_start+0x3d7/0x670
 blk_mq_bio_to_request+0x19c/0x3a0
 blk_mq_make_request+0x7a9/0xcb0
 generic_make_request+0x41d/0x960
 submit_bio+0x9b/0x250
 do_blockdev_direct_IO+0x435c/0x4c70
 __blockdev_direct_IO+0x79/0x88
 ext4_direct_IO+0x46c/0xc00
 generic_file_direct_write+0x119/0x210
 __generic_file_write_iter+0x11c/0x280
 ext4_file_write_iter+0x1b8/0x6f0
 aio_write+0x204/0x310
 io_submit_one+0x9d3/0xe80
 __x64_sys_io_submit+0x115/0x340
 do_syscall_64+0x71/0x210

Reviewed-by: Khazhismel Kumykov <khazhy@google.com>
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Cc: John Garry <john.garry@huawei.com>
Cc: Khazhy Kumykov <khazhy@google.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-core.c   | 34 ++++++++++++++++++++++++++++++-
 block/blk-mq-tag.c | 51 ++++++++++++++++++++++++++++++++++++++++------
 block/blk-mq-tag.h |  4 +++-
 block/blk-mq.c     | 21 +++++++++++++++----
 block/blk-mq.h     |  1 +
 block/blk.h        |  2 ++
 block/elevator.c   |  1 +
 7 files changed, 102 insertions(+), 12 deletions(-)

Comments

Ming Lei April 22, 2021, 2:25 a.m. UTC | #1
On Tue, Apr 20, 2021 at 05:02:33PM -0700, Bart Van Assche wrote:
> When multiple request queues share a tag set and when switching the I/O
> scheduler for one of the request queues associated with that tag set, the
> following race can happen:
> * blk_mq_tagset_busy_iter() calls bt_tags_iter() and bt_tags_iter() assigns
>   a pointer to a scheduler request to the local variable 'rq'.
> * blk_mq_sched_free_requests() is called to free hctx->sched_tags.
> * blk_mq_tagset_busy_iter() dereferences 'rq' and triggers a use-after-free.
> 
> Fix this race as follows:
> * Use rcu_assign_pointer() and rcu_dereference() to access hctx->tags->rqs[].
>   The performance impact of the assignments added to the hot path is minimal
>   (about 1% according to one particular test).
> * Protect hctx->tags->rqs[] reads with an RCU read-side lock or with a
>   semaphore. Which mechanism is used depends on whether or not it is allowed
>   to sleep and also on whether or not the callback function may sleep.
> * Wait for all preexisting readers to finish before freeing scheduler
>   requests.
> 
> Another race is as follows:
> * blk_mq_sched_free_requests() is called to free hctx->sched_tags.
> * blk_mq_queue_tag_busy_iter() iterates over the same tag set but for another
>   request queue than the queue for which scheduler tags are being freed.
> * bt_iter() examines rq->q after *rq has been freed.
> 
> Fix this race by protecting the rq->q read in bt_iter() with an RCU read
> lock and by calling synchronize_rcu() before freeing scheduler tags.
> 
> Multiple users have reported use-after-free complaints similar to the
> following (from https://lore.kernel.org/linux-block/1545261885.185366.488.camel@acm.org/ ):
> 
> BUG: KASAN: use-after-free in bt_iter+0x86/0xf0
> Read of size 8 at addr ffff88803b335240 by task fio/21412
> 
> CPU: 0 PID: 21412 Comm: fio Tainted: G        W         4.20.0-rc6-dbg+ #3
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
> Call Trace:
>  dump_stack+0x86/0xca
>  print_address_description+0x71/0x239
>  kasan_report.cold.5+0x242/0x301
>  __asan_load8+0x54/0x90
>  bt_iter+0x86/0xf0
>  blk_mq_queue_tag_busy_iter+0x373/0x5e0
>  blk_mq_in_flight+0x96/0xb0
>  part_in_flight+0x40/0x140
>  part_round_stats+0x18e/0x370
>  blk_account_io_start+0x3d7/0x670
>  blk_mq_bio_to_request+0x19c/0x3a0
>  blk_mq_make_request+0x7a9/0xcb0
>  generic_make_request+0x41d/0x960
>  submit_bio+0x9b/0x250
>  do_blockdev_direct_IO+0x435c/0x4c70
>  __blockdev_direct_IO+0x79/0x88
>  ext4_direct_IO+0x46c/0xc00
>  generic_file_direct_write+0x119/0x210
>  __generic_file_write_iter+0x11c/0x280
>  ext4_file_write_iter+0x1b8/0x6f0
>  aio_write+0x204/0x310
>  io_submit_one+0x9d3/0xe80
>  __x64_sys_io_submit+0x115/0x340
>  do_syscall_64+0x71/0x210
> 
> Reviewed-by: Khazhismel Kumykov <khazhy@google.com>
> Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Martin K. Petersen <martin.petersen@oracle.com>
> Cc: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> Cc: Ming Lei <ming.lei@redhat.com>
> Cc: Hannes Reinecke <hare@suse.de>
> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Cc: John Garry <john.garry@huawei.com>
> Cc: Khazhy Kumykov <khazhy@google.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/blk-core.c   | 34 ++++++++++++++++++++++++++++++-
>  block/blk-mq-tag.c | 51 ++++++++++++++++++++++++++++++++++++++++------
>  block/blk-mq-tag.h |  4 +++-
>  block/blk-mq.c     | 21 +++++++++++++++----
>  block/blk-mq.h     |  1 +
>  block/blk.h        |  2 ++
>  block/elevator.c   |  1 +
>  7 files changed, 102 insertions(+), 12 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 9bcdae93f6d4..ca7f833e25a8 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -279,6 +279,36 @@ void blk_dump_rq_flags(struct request *rq, char *msg)
>  }
>  EXPORT_SYMBOL(blk_dump_rq_flags);
>  
> +/**
> + * blk_mq_wait_for_tag_iter - wait for preexisting tag iteration functions to finish
> + * @set: Tag set to wait on.
> + *
> + * Waits for preexisting calls of blk_mq_all_tag_iter(),
> + * blk_mq_tagset_busy_iter() and also for their atomic variants to finish
> + * accessing hctx->tags->rqs[]. New readers may start while this function is
> + * in progress or after this function has returned. Use this function to make
> + * sure that hctx->tags->rqs[] changes have become globally visible.
> + *
> + * Waits for preexisting blk_mq_queue_tag_busy_iter(q, fn, priv) calls to
> + * finish accessing requests associated with other request queues than 'q'.
> + */
> +void blk_mq_wait_for_tag_iter(struct blk_mq_tag_set *set)
> +{
> +	struct blk_mq_tags *tags;
> +	int i;
> +
> +	if (set->tags) {
> +		for (i = 0; i < set->nr_hw_queues; i++) {
> +			tags = set->tags[i];
> +			if (!tags)
> +				continue;
> +			down_write(&tags->iter_rwsem);
> +			up_write(&tags->iter_rwsem);
> +		}
> +	}
> +	synchronize_rcu();
> +}
> +
>  /**
>   * blk_sync_queue - cancel any pending callbacks on a queue
>   * @q: the queue
> @@ -412,8 +442,10 @@ void blk_cleanup_queue(struct request_queue *q)
>  	 * it is safe to free requests now.
>  	 */
>  	mutex_lock(&q->sysfs_lock);
> -	if (q->elevator)
> +	if (q->elevator) {
> +		blk_mq_wait_for_tag_iter(q->tag_set);
>  		blk_mq_sched_free_requests(q);
> +	}
>  	mutex_unlock(&q->sysfs_lock);
>  
>  	percpu_ref_exit(&q->q_usage_counter);
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index d8eaa38a1bd1..39d5c9190a6b 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -209,14 +209,24 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
>  
>  	if (!reserved)
>  		bitnr += tags->nr_reserved_tags;
> -	rq = tags->rqs[bitnr];
> -
> +	rcu_read_lock();
> +	/*
> +	 * The request 'rq' points at is protected by an RCU read lock until
> +	 * its queue pointer has been verified and by q_usage_count while the
> +	 * callback function is being invoked. See also the
> +	 * percpu_ref_tryget() and blk_queue_exit() calls in
> +	 * blk_mq_queue_tag_busy_iter().
> +	 */
> +	rq = rcu_dereference(tags->rqs[bitnr]);
>  	/*
>  	 * We can hit rq == NULL here, because the tagging functions
>  	 * test and set the bit before assigning ->rqs[].
>  	 */
> -	if (rq && rq->q == hctx->queue && rq->mq_hctx == hctx)
> +	if (rq && rq->q == hctx->queue && rq->mq_hctx == hctx) {
> +		rcu_read_unlock();
>  		return iter_data->fn(hctx, rq, iter_data->data, reserved);
> +	}
> +	rcu_read_unlock();
>  	return true;
>  }
>  
> @@ -254,11 +264,17 @@ struct bt_tags_iter_data {
>  	unsigned int flags;
>  };
>  
> +/* Include reserved tags. */
>  #define BT_TAG_ITER_RESERVED		(1 << 0)
> +/* Only include started requests. */
>  #define BT_TAG_ITER_STARTED		(1 << 1)
> +/* Iterate over tags->static_rqs[] instead of tags->rqs[]. */
>  #define BT_TAG_ITER_STATIC_RQS		(1 << 2)
> +/* The callback function may sleep. */
> +#define BT_TAG_ITER_MAY_SLEEP		(1 << 3)
>  
> -static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
> +static bool __bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr,
> +			   void *data)
>  {
>  	struct bt_tags_iter_data *iter_data = data;
>  	struct blk_mq_tags *tags = iter_data->tags;
> @@ -275,7 +291,8 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
>  	if (iter_data->flags & BT_TAG_ITER_STATIC_RQS)
>  		rq = tags->static_rqs[bitnr];
>  	else
> -		rq = tags->rqs[bitnr];
> +		rq = rcu_dereference_check(tags->rqs[bitnr],
> +					   lockdep_is_held(&tags->iter_rwsem));
>  	if (!rq)
>  		return true;
>  	if ((iter_data->flags & BT_TAG_ITER_STARTED) &&
> @@ -284,6 +301,25 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
>  	return iter_data->fn(rq, iter_data->data, reserved);
>  }
>  
> +static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
> +{
> +	struct bt_tags_iter_data *iter_data = data;
> +	struct blk_mq_tags *tags = iter_data->tags;
> +	bool res;
> +
> +	if (iter_data->flags & BT_TAG_ITER_MAY_SLEEP) {
> +		down_read(&tags->iter_rwsem);
> +		res = __bt_tags_iter(bitmap, bitnr, data);
> +		up_read(&tags->iter_rwsem);
> +	} else {
> +		rcu_read_lock();
> +		res = __bt_tags_iter(bitmap, bitnr, data);
> +		rcu_read_unlock();
> +	}
> +
> +	return res;
> +}
> +
>  /**
>   * bt_tags_for_each - iterate over the requests in a tag map
>   * @tags:	Tag map to iterate over.
> @@ -357,10 +393,12 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
>  {
>  	int i;
>  
> +	might_sleep();
> +
>  	for (i = 0; i < tagset->nr_hw_queues; i++) {
>  		if (tagset->tags && tagset->tags[i])
>  			__blk_mq_all_tag_iter(tagset->tags[i], fn, priv,
> -					      BT_TAG_ITER_STARTED);
> +				BT_TAG_ITER_STARTED | BT_TAG_ITER_MAY_SLEEP);
>  	}
>  }
>  EXPORT_SYMBOL(blk_mq_tagset_busy_iter);
> @@ -544,6 +582,7 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
>  
>  	tags->nr_tags = total_tags;
>  	tags->nr_reserved_tags = reserved_tags;
> +	init_rwsem(&tags->iter_rwsem);
>  
>  	if (blk_mq_is_sbitmap_shared(flags))
>  		return tags;
> diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
> index 0290c308ece9..d1d73d7cc7df 100644
> --- a/block/blk-mq-tag.h
> +++ b/block/blk-mq-tag.h
> @@ -17,9 +17,11 @@ struct blk_mq_tags {
>  	struct sbitmap_queue __bitmap_tags;
>  	struct sbitmap_queue __breserved_tags;
>  
> -	struct request **rqs;
> +	struct request __rcu **rqs;
>  	struct request **static_rqs;
>  	struct list_head page_list;
> +
> +	struct rw_semaphore iter_rwsem;
>  };
>  
>  extern struct blk_mq_tags *blk_mq_init_tags(unsigned int nr_tags,
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 79c01b1f885c..8b59f6b4ec8e 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -496,8 +496,10 @@ static void __blk_mq_free_request(struct request *rq)
>  	blk_crypto_free_request(rq);
>  	blk_pm_mark_last_busy(rq);
>  	rq->mq_hctx = NULL;
> -	if (rq->tag != BLK_MQ_NO_TAG)
> +	if (rq->tag != BLK_MQ_NO_TAG) {
>  		blk_mq_put_tag(hctx->tags, ctx, rq->tag);
> +		rcu_assign_pointer(hctx->tags->rqs[rq->tag], NULL);

After the tag is released, it can be re-allocated from another context
immediately. And the above rcu_assign_pointer(NULL) could be run after
the re-allocation and new ->rqs[rq->tag] assignment in submission path,
is it one issue?


Thanks, 
Ming
Ming Lei April 22, 2021, 3:15 a.m. UTC | #2
On Tue, Apr 20, 2021 at 05:02:33PM -0700, Bart Van Assche wrote:
> When multiple request queues share a tag set and when switching the I/O
> scheduler for one of the request queues associated with that tag set, the
> following race can happen:
> * blk_mq_tagset_busy_iter() calls bt_tags_iter() and bt_tags_iter() assigns
>   a pointer to a scheduler request to the local variable 'rq'.
> * blk_mq_sched_free_requests() is called to free hctx->sched_tags.
> * blk_mq_tagset_busy_iter() dereferences 'rq' and triggers a use-after-free.
> 
> Fix this race as follows:
> * Use rcu_assign_pointer() and rcu_dereference() to access hctx->tags->rqs[].
>   The performance impact of the assignments added to the hot path is minimal
>   (about 1% according to one particular test).
> * Protect hctx->tags->rqs[] reads with an RCU read-side lock or with a
>   semaphore. Which mechanism is used depends on whether or not it is allowed
>   to sleep and also on whether or not the callback function may sleep.
> * Wait for all preexisting readers to finish before freeing scheduler
>   requests.
> 
> Another race is as follows:
> * blk_mq_sched_free_requests() is called to free hctx->sched_tags.
> * blk_mq_queue_tag_busy_iter() iterates over the same tag set but for another
>   request queue than the queue for which scheduler tags are being freed.
> * bt_iter() examines rq->q after *rq has been freed.
> 
> Fix this race by protecting the rq->q read in bt_iter() with an RCU read
> lock and by calling synchronize_rcu() before freeing scheduler tags.
> 
> Multiple users have reported use-after-free complaints similar to the
> following (from https://lore.kernel.org/linux-block/1545261885.185366.488.camel@acm.org/ ):
> 
> BUG: KASAN: use-after-free in bt_iter+0x86/0xf0
> Read of size 8 at addr ffff88803b335240 by task fio/21412
> 
> CPU: 0 PID: 21412 Comm: fio Tainted: G        W         4.20.0-rc6-dbg+ #3
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
> Call Trace:
>  dump_stack+0x86/0xca
>  print_address_description+0x71/0x239
>  kasan_report.cold.5+0x242/0x301
>  __asan_load8+0x54/0x90
>  bt_iter+0x86/0xf0
>  blk_mq_queue_tag_busy_iter+0x373/0x5e0
>  blk_mq_in_flight+0x96/0xb0
>  part_in_flight+0x40/0x140
>  part_round_stats+0x18e/0x370
>  blk_account_io_start+0x3d7/0x670
>  blk_mq_bio_to_request+0x19c/0x3a0
>  blk_mq_make_request+0x7a9/0xcb0
>  generic_make_request+0x41d/0x960
>  submit_bio+0x9b/0x250
>  do_blockdev_direct_IO+0x435c/0x4c70
>  __blockdev_direct_IO+0x79/0x88
>  ext4_direct_IO+0x46c/0xc00
>  generic_file_direct_write+0x119/0x210
>  __generic_file_write_iter+0x11c/0x280
>  ext4_file_write_iter+0x1b8/0x6f0
>  aio_write+0x204/0x310
>  io_submit_one+0x9d3/0xe80
>  __x64_sys_io_submit+0x115/0x340
>  do_syscall_64+0x71/0x210
> 
> Reviewed-by: Khazhismel Kumykov <khazhy@google.com>
> Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Martin K. Petersen <martin.petersen@oracle.com>
> Cc: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> Cc: Ming Lei <ming.lei@redhat.com>
> Cc: Hannes Reinecke <hare@suse.de>
> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Cc: John Garry <john.garry@huawei.com>
> Cc: Khazhy Kumykov <khazhy@google.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/blk-core.c   | 34 ++++++++++++++++++++++++++++++-
>  block/blk-mq-tag.c | 51 ++++++++++++++++++++++++++++++++++++++++------
>  block/blk-mq-tag.h |  4 +++-
>  block/blk-mq.c     | 21 +++++++++++++++----
>  block/blk-mq.h     |  1 +
>  block/blk.h        |  2 ++
>  block/elevator.c   |  1 +
>  7 files changed, 102 insertions(+), 12 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 9bcdae93f6d4..ca7f833e25a8 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -279,6 +279,36 @@ void blk_dump_rq_flags(struct request *rq, char *msg)
>  }
>  EXPORT_SYMBOL(blk_dump_rq_flags);
>  
> +/**
> + * blk_mq_wait_for_tag_iter - wait for preexisting tag iteration functions to finish
> + * @set: Tag set to wait on.
> + *
> + * Waits for preexisting calls of blk_mq_all_tag_iter(),
> + * blk_mq_tagset_busy_iter() and also for their atomic variants to finish
> + * accessing hctx->tags->rqs[]. New readers may start while this function is
> + * in progress or after this function has returned. Use this function to make
> + * sure that hctx->tags->rqs[] changes have become globally visible.
> + *
> + * Waits for preexisting blk_mq_queue_tag_busy_iter(q, fn, priv) calls to
> + * finish accessing requests associated with other request queues than 'q'.
> + */
> +void blk_mq_wait_for_tag_iter(struct blk_mq_tag_set *set)
> +{
> +	struct blk_mq_tags *tags;
> +	int i;
> +
> +	if (set->tags) {
> +		for (i = 0; i < set->nr_hw_queues; i++) {
> +			tags = set->tags[i];
> +			if (!tags)
> +				continue;
> +			down_write(&tags->iter_rwsem);
> +			up_write(&tags->iter_rwsem);
> +		}
> +	}
> +	synchronize_rcu();
> +}
> +
>  /**
>   * blk_sync_queue - cancel any pending callbacks on a queue
>   * @q: the queue
> @@ -412,8 +442,10 @@ void blk_cleanup_queue(struct request_queue *q)
>  	 * it is safe to free requests now.
>  	 */
>  	mutex_lock(&q->sysfs_lock);
> -	if (q->elevator)
> +	if (q->elevator) {
> +		blk_mq_wait_for_tag_iter(q->tag_set);
>  		blk_mq_sched_free_requests(q);
> +	}
>  	mutex_unlock(&q->sysfs_lock);
>  
>  	percpu_ref_exit(&q->q_usage_counter);
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index d8eaa38a1bd1..39d5c9190a6b 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -209,14 +209,24 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
>  
>  	if (!reserved)
>  		bitnr += tags->nr_reserved_tags;
> -	rq = tags->rqs[bitnr];
> -
> +	rcu_read_lock();
> +	/*
> +	 * The request 'rq' points at is protected by an RCU read lock until
> +	 * its queue pointer has been verified and by q_usage_count while the
> +	 * callback function is being invoked. See also the
> +	 * percpu_ref_tryget() and blk_queue_exit() calls in
> +	 * blk_mq_queue_tag_busy_iter().
> +	 */
> +	rq = rcu_dereference(tags->rqs[bitnr]);
>  	/*
>  	 * We can hit rq == NULL here, because the tagging functions
>  	 * test and set the bit before assigning ->rqs[].
>  	 */
> -	if (rq && rq->q == hctx->queue && rq->mq_hctx == hctx)
> +	if (rq && rq->q == hctx->queue && rq->mq_hctx == hctx) {
> +		rcu_read_unlock();
>  		return iter_data->fn(hctx, rq, iter_data->data, reserved);
> +	}
> +	rcu_read_unlock();
>  	return true;
>  }
>  
> @@ -254,11 +264,17 @@ struct bt_tags_iter_data {
>  	unsigned int flags;
>  };
>  
> +/* Include reserved tags. */
>  #define BT_TAG_ITER_RESERVED		(1 << 0)
> +/* Only include started requests. */
>  #define BT_TAG_ITER_STARTED		(1 << 1)
> +/* Iterate over tags->static_rqs[] instead of tags->rqs[]. */
>  #define BT_TAG_ITER_STATIC_RQS		(1 << 2)
> +/* The callback function may sleep. */
> +#define BT_TAG_ITER_MAY_SLEEP		(1 << 3)
>  
> -static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
> +static bool __bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr,
> +			   void *data)
>  {
>  	struct bt_tags_iter_data *iter_data = data;
>  	struct blk_mq_tags *tags = iter_data->tags;
> @@ -275,7 +291,8 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
>  	if (iter_data->flags & BT_TAG_ITER_STATIC_RQS)
>  		rq = tags->static_rqs[bitnr];
>  	else
> -		rq = tags->rqs[bitnr];
> +		rq = rcu_dereference_check(tags->rqs[bitnr],
> +					   lockdep_is_held(&tags->iter_rwsem));
>  	if (!rq)
>  		return true;
>  	if ((iter_data->flags & BT_TAG_ITER_STARTED) &&
> @@ -284,6 +301,25 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
>  	return iter_data->fn(rq, iter_data->data, reserved);
>  }
>  
> +static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
> +{
> +	struct bt_tags_iter_data *iter_data = data;
> +	struct blk_mq_tags *tags = iter_data->tags;
> +	bool res;
> +
> +	if (iter_data->flags & BT_TAG_ITER_MAY_SLEEP) {
> +		down_read(&tags->iter_rwsem);
> +		res = __bt_tags_iter(bitmap, bitnr, data);
> +		up_read(&tags->iter_rwsem);
> +	} else {
> +		rcu_read_lock();
> +		res = __bt_tags_iter(bitmap, bitnr, data);
> +		rcu_read_unlock();
> +	}
> +
> +	return res;
> +}

Holding one rwsem or rcu read lock won't avoid the issue completely
because request may be completed remotely in iter_data->fn(), such as
nbd_clear_req(), nvme_cancel_request(), complete_all_cmds_iter(),
mtip_no_dev_cleanup(), because blk_mq_complete_request() may complete
request in softirq, remote IPI, even wq, and the request is still
referenced in these contexts after bt_tags_iter() returns.


Thanks,
Ming
Bart Van Assche April 22, 2021, 3:54 a.m. UTC | #3
On 4/21/21 8:15 PM, Ming Lei wrote:
> On Tue, Apr 20, 2021 at 05:02:33PM -0700, Bart Van Assche wrote:
>> +static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
>> +{
>> +	struct bt_tags_iter_data *iter_data = data;
>> +	struct blk_mq_tags *tags = iter_data->tags;
>> +	bool res;
>> +
>> +	if (iter_data->flags & BT_TAG_ITER_MAY_SLEEP) {
>> +		down_read(&tags->iter_rwsem);
>> +		res = __bt_tags_iter(bitmap, bitnr, data);
>> +		up_read(&tags->iter_rwsem);
>> +	} else {
>> +		rcu_read_lock();
>> +		res = __bt_tags_iter(bitmap, bitnr, data);
>> +		rcu_read_unlock();
>> +	}
>> +
>> +	return res;
>> +}
> 
> Holding one rwsem or rcu read lock won't avoid the issue completely
> because request may be completed remotely in iter_data->fn(), such as
> nbd_clear_req(), nvme_cancel_request(), complete_all_cmds_iter(),
> mtip_no_dev_cleanup(), because blk_mq_complete_request() may complete
> request in softirq, remote IPI, even wq, and the request is still
> referenced in these contexts after bt_tags_iter() returns.

The rwsem and RCU read lock are used to serialize iterating over
requests against blk_mq_sched_free_requests() calls. I don't think it
matters for this patch from which context requests are freed.

Bart.
Bart Van Assche April 22, 2021, 4:01 a.m. UTC | #4
On 4/21/21 7:25 PM, Ming Lei wrote:
> On Tue, Apr 20, 2021 at 05:02:33PM -0700, Bart Van Assche wrote:
>> When multiple request queues share a tag set and when switching the I/O
>> scheduler for one of the request queues associated with that tag set, the
>> following race can happen:
>> * blk_mq_tagset_busy_iter() calls bt_tags_iter() and bt_tags_iter() assigns
>>   a pointer to a scheduler request to the local variable 'rq'.
>> * blk_mq_sched_free_requests() is called to free hctx->sched_tags.
>> * blk_mq_tagset_busy_iter() dereferences 'rq' and triggers a use-after-free.
>>
>> Fix this race as follows:
>> * Use rcu_assign_pointer() and rcu_dereference() to access hctx->tags->rqs[].
>>   The performance impact of the assignments added to the hot path is minimal
>>   (about 1% according to one particular test).
>> * Protect hctx->tags->rqs[] reads with an RCU read-side lock or with a
>>   semaphore. Which mechanism is used depends on whether or not it is allowed
>>   to sleep and also on whether or not the callback function may sleep.
>> * Wait for all preexisting readers to finish before freeing scheduler
>>   requests.
>>
>> Another race is as follows:
>> * blk_mq_sched_free_requests() is called to free hctx->sched_tags.
>> * blk_mq_queue_tag_busy_iter() iterates over the same tag set but for another
>>   request queue than the queue for which scheduler tags are being freed.
>> * bt_iter() examines rq->q after *rq has been freed.
>>
>> Fix this race by protecting the rq->q read in bt_iter() with an RCU read
>> lock and by calling synchronize_rcu() before freeing scheduler tags.
>>
>> Multiple users have reported use-after-free complaints similar to the
>> following (from https://lore.kernel.org/linux-block/1545261885.185366.488.camel@acm.org/ ):
>>
>> BUG: KASAN: use-after-free in bt_iter+0x86/0xf0
>> Read of size 8 at addr ffff88803b335240 by task fio/21412
>>
>> CPU: 0 PID: 21412 Comm: fio Tainted: G        W         4.20.0-rc6-dbg+ #3
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
>> Call Trace:
>>  dump_stack+0x86/0xca
>>  print_address_description+0x71/0x239
>>  kasan_report.cold.5+0x242/0x301
>>  __asan_load8+0x54/0x90
>>  bt_iter+0x86/0xf0
>>  blk_mq_queue_tag_busy_iter+0x373/0x5e0
>>  blk_mq_in_flight+0x96/0xb0
>>  part_in_flight+0x40/0x140
>>  part_round_stats+0x18e/0x370
>>  blk_account_io_start+0x3d7/0x670
>>  blk_mq_bio_to_request+0x19c/0x3a0
>>  blk_mq_make_request+0x7a9/0xcb0
>>  generic_make_request+0x41d/0x960
>>  submit_bio+0x9b/0x250
>>  do_blockdev_direct_IO+0x435c/0x4c70
>>  __blockdev_direct_IO+0x79/0x88
>>  ext4_direct_IO+0x46c/0xc00
>>  generic_file_direct_write+0x119/0x210
>>  __generic_file_write_iter+0x11c/0x280
>>  ext4_file_write_iter+0x1b8/0x6f0
>>  aio_write+0x204/0x310
>>  io_submit_one+0x9d3/0xe80
>>  __x64_sys_io_submit+0x115/0x340
>>  do_syscall_64+0x71/0x210
>>
>> Reviewed-by: Khazhismel Kumykov <khazhy@google.com>
>> Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: Martin K. Petersen <martin.petersen@oracle.com>
>> Cc: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
>> Cc: Ming Lei <ming.lei@redhat.com>
>> Cc: Hannes Reinecke <hare@suse.de>
>> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>> Cc: John Garry <john.garry@huawei.com>
>> Cc: Khazhy Kumykov <khazhy@google.com>
>> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
>> ---
>>  block/blk-core.c   | 34 ++++++++++++++++++++++++++++++-
>>  block/blk-mq-tag.c | 51 ++++++++++++++++++++++++++++++++++++++++------
>>  block/blk-mq-tag.h |  4 +++-
>>  block/blk-mq.c     | 21 +++++++++++++++----
>>  block/blk-mq.h     |  1 +
>>  block/blk.h        |  2 ++
>>  block/elevator.c   |  1 +
>>  7 files changed, 102 insertions(+), 12 deletions(-)
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 9bcdae93f6d4..ca7f833e25a8 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -279,6 +279,36 @@ void blk_dump_rq_flags(struct request *rq, char *msg)
>>  }
>>  EXPORT_SYMBOL(blk_dump_rq_flags);
>>  
>> +/**
>> + * blk_mq_wait_for_tag_iter - wait for preexisting tag iteration functions to finish
>> + * @set: Tag set to wait on.
>> + *
>> + * Waits for preexisting calls of blk_mq_all_tag_iter(),
>> + * blk_mq_tagset_busy_iter() and also for their atomic variants to finish
>> + * accessing hctx->tags->rqs[]. New readers may start while this function is
>> + * in progress or after this function has returned. Use this function to make
>> + * sure that hctx->tags->rqs[] changes have become globally visible.
>> + *
>> + * Waits for preexisting blk_mq_queue_tag_busy_iter(q, fn, priv) calls to
>> + * finish accessing requests associated with other request queues than 'q'.
>> + */
>> +void blk_mq_wait_for_tag_iter(struct blk_mq_tag_set *set)
>> +{
>> +	struct blk_mq_tags *tags;
>> +	int i;
>> +
>> +	if (set->tags) {
>> +		for (i = 0; i < set->nr_hw_queues; i++) {
>> +			tags = set->tags[i];
>> +			if (!tags)
>> +				continue;
>> +			down_write(&tags->iter_rwsem);
>> +			up_write(&tags->iter_rwsem);
>> +		}
>> +	}
>> +	synchronize_rcu();
>> +}
>> +
>>  /**
>>   * blk_sync_queue - cancel any pending callbacks on a queue
>>   * @q: the queue
>> @@ -412,8 +442,10 @@ void blk_cleanup_queue(struct request_queue *q)
>>  	 * it is safe to free requests now.
>>  	 */
>>  	mutex_lock(&q->sysfs_lock);
>> -	if (q->elevator)
>> +	if (q->elevator) {
>> +		blk_mq_wait_for_tag_iter(q->tag_set);
>>  		blk_mq_sched_free_requests(q);
>> +	}
>>  	mutex_unlock(&q->sysfs_lock);
>>  
>>  	percpu_ref_exit(&q->q_usage_counter);
>> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
>> index d8eaa38a1bd1..39d5c9190a6b 100644
>> --- a/block/blk-mq-tag.c
>> +++ b/block/blk-mq-tag.c
>> @@ -209,14 +209,24 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
>>  
>>  	if (!reserved)
>>  		bitnr += tags->nr_reserved_tags;
>> -	rq = tags->rqs[bitnr];
>> -
>> +	rcu_read_lock();
>> +	/*
>> +	 * The request 'rq' points at is protected by an RCU read lock until
>> +	 * its queue pointer has been verified and by q_usage_count while the
>> +	 * callback function is being invoked. See also the
>> +	 * percpu_ref_tryget() and blk_queue_exit() calls in
>> +	 * blk_mq_queue_tag_busy_iter().
>> +	 */
>> +	rq = rcu_dereference(tags->rqs[bitnr]);
>>  	/*
>>  	 * We can hit rq == NULL here, because the tagging functions
>>  	 * test and set the bit before assigning ->rqs[].
>>  	 */
>> -	if (rq && rq->q == hctx->queue && rq->mq_hctx == hctx)
>> +	if (rq && rq->q == hctx->queue && rq->mq_hctx == hctx) {
>> +		rcu_read_unlock();
>>  		return iter_data->fn(hctx, rq, iter_data->data, reserved);
>> +	}
>> +	rcu_read_unlock();
>>  	return true;
>>  }
>>  
>> @@ -254,11 +264,17 @@ struct bt_tags_iter_data {
>>  	unsigned int flags;
>>  };
>>  
>> +/* Include reserved tags. */
>>  #define BT_TAG_ITER_RESERVED		(1 << 0)
>> +/* Only include started requests. */
>>  #define BT_TAG_ITER_STARTED		(1 << 1)
>> +/* Iterate over tags->static_rqs[] instead of tags->rqs[]. */
>>  #define BT_TAG_ITER_STATIC_RQS		(1 << 2)
>> +/* The callback function may sleep. */
>> +#define BT_TAG_ITER_MAY_SLEEP		(1 << 3)
>>  
>> -static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
>> +static bool __bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr,
>> +			   void *data)
>>  {
>>  	struct bt_tags_iter_data *iter_data = data;
>>  	struct blk_mq_tags *tags = iter_data->tags;
>> @@ -275,7 +291,8 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
>>  	if (iter_data->flags & BT_TAG_ITER_STATIC_RQS)
>>  		rq = tags->static_rqs[bitnr];
>>  	else
>> -		rq = tags->rqs[bitnr];
>> +		rq = rcu_dereference_check(tags->rqs[bitnr],
>> +					   lockdep_is_held(&tags->iter_rwsem));
>>  	if (!rq)
>>  		return true;
>>  	if ((iter_data->flags & BT_TAG_ITER_STARTED) &&
>> @@ -284,6 +301,25 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
>>  	return iter_data->fn(rq, iter_data->data, reserved);
>>  }
>>  
>> +static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
>> +{
>> +	struct bt_tags_iter_data *iter_data = data;
>> +	struct blk_mq_tags *tags = iter_data->tags;
>> +	bool res;
>> +
>> +	if (iter_data->flags & BT_TAG_ITER_MAY_SLEEP) {
>> +		down_read(&tags->iter_rwsem);
>> +		res = __bt_tags_iter(bitmap, bitnr, data);
>> +		up_read(&tags->iter_rwsem);
>> +	} else {
>> +		rcu_read_lock();
>> +		res = __bt_tags_iter(bitmap, bitnr, data);
>> +		rcu_read_unlock();
>> +	}
>> +
>> +	return res;
>> +}
>> +
>>  /**
>>   * bt_tags_for_each - iterate over the requests in a tag map
>>   * @tags:	Tag map to iterate over.
>> @@ -357,10 +393,12 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
>>  {
>>  	int i;
>>  
>> +	might_sleep();
>> +
>>  	for (i = 0; i < tagset->nr_hw_queues; i++) {
>>  		if (tagset->tags && tagset->tags[i])
>>  			__blk_mq_all_tag_iter(tagset->tags[i], fn, priv,
>> -					      BT_TAG_ITER_STARTED);
>> +				BT_TAG_ITER_STARTED | BT_TAG_ITER_MAY_SLEEP);
>>  	}
>>  }
>>  EXPORT_SYMBOL(blk_mq_tagset_busy_iter);
>> @@ -544,6 +582,7 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
>>  
>>  	tags->nr_tags = total_tags;
>>  	tags->nr_reserved_tags = reserved_tags;
>> +	init_rwsem(&tags->iter_rwsem);
>>  
>>  	if (blk_mq_is_sbitmap_shared(flags))
>>  		return tags;
>> diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
>> index 0290c308ece9..d1d73d7cc7df 100644
>> --- a/block/blk-mq-tag.h
>> +++ b/block/blk-mq-tag.h
>> @@ -17,9 +17,11 @@ struct blk_mq_tags {
>>  	struct sbitmap_queue __bitmap_tags;
>>  	struct sbitmap_queue __breserved_tags;
>>  
>> -	struct request **rqs;
>> +	struct request __rcu **rqs;
>>  	struct request **static_rqs;
>>  	struct list_head page_list;
>> +
>> +	struct rw_semaphore iter_rwsem;
>>  };
>>  
>>  extern struct blk_mq_tags *blk_mq_init_tags(unsigned int nr_tags,
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index 79c01b1f885c..8b59f6b4ec8e 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -496,8 +496,10 @@ static void __blk_mq_free_request(struct request *rq)
>>  	blk_crypto_free_request(rq);
>>  	blk_pm_mark_last_busy(rq);
>>  	rq->mq_hctx = NULL;
>> -	if (rq->tag != BLK_MQ_NO_TAG)
>> +	if (rq->tag != BLK_MQ_NO_TAG) {
>>  		blk_mq_put_tag(hctx->tags, ctx, rq->tag);
>> +		rcu_assign_pointer(hctx->tags->rqs[rq->tag], NULL);
> 
> After the tag is released, it can be re-allocated from another context
> immediately. And the above rcu_assign_pointer(NULL) could be run after
> the re-allocation and new ->rqs[rq->tag] assignment in submission path,
> is it one issue?

How about swapping the rcu_assign_pointer() and blk_mq_put_tag() calls?
That should be safe since the tag iteration functions check whether or
not hctx->tags->rqs[] is NULL.

There are already barriers in sbitmap_queue_clear() so I don't think
that any new barriers would need to be added.

Thanks,

Bart.
Ming Lei April 22, 2021, 7:13 a.m. UTC | #5
On Wed, Apr 21, 2021 at 08:54:30PM -0700, Bart Van Assche wrote:
> On 4/21/21 8:15 PM, Ming Lei wrote:
> > On Tue, Apr 20, 2021 at 05:02:33PM -0700, Bart Van Assche wrote:
> >> +static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
> >> +{
> >> +	struct bt_tags_iter_data *iter_data = data;
> >> +	struct blk_mq_tags *tags = iter_data->tags;
> >> +	bool res;
> >> +
> >> +	if (iter_data->flags & BT_TAG_ITER_MAY_SLEEP) {
> >> +		down_read(&tags->iter_rwsem);
> >> +		res = __bt_tags_iter(bitmap, bitnr, data);
> >> +		up_read(&tags->iter_rwsem);
> >> +	} else {
> >> +		rcu_read_lock();
> >> +		res = __bt_tags_iter(bitmap, bitnr, data);
> >> +		rcu_read_unlock();
> >> +	}
> >> +
> >> +	return res;
> >> +}
> > 
> > Holding one rwsem or rcu read lock won't avoid the issue completely
> > because request may be completed remotely in iter_data->fn(), such as
> > nbd_clear_req(), nvme_cancel_request(), complete_all_cmds_iter(),
> > mtip_no_dev_cleanup(), because blk_mq_complete_request() may complete
> > request in softirq, remote IPI, even wq, and the request is still
> > referenced in these contexts after bt_tags_iter() returns.
> 
> The rwsem and RCU read lock are used to serialize iterating over
> requests against blk_mq_sched_free_requests() calls. I don't think it
> matters for this patch from which context requests are freed.

Requests still can be referred in other context after blk_mq_wait_for_tag_iter()
returns, then follows freeing request pool. And use-after-free exists too, doesn't it?

Thanks,
Ming
Ming Lei April 22, 2021, 7:23 a.m. UTC | #6
On Wed, Apr 21, 2021 at 09:01:50PM -0700, Bart Van Assche wrote:
> On 4/21/21 7:25 PM, Ming Lei wrote:
> > On Tue, Apr 20, 2021 at 05:02:33PM -0700, Bart Van Assche wrote:
> >> When multiple request queues share a tag set and when switching the I/O
> >> scheduler for one of the request queues associated with that tag set, the
> >> following race can happen:
> >> * blk_mq_tagset_busy_iter() calls bt_tags_iter() and bt_tags_iter() assigns
> >>   a pointer to a scheduler request to the local variable 'rq'.
> >> * blk_mq_sched_free_requests() is called to free hctx->sched_tags.
> >> * blk_mq_tagset_busy_iter() dereferences 'rq' and triggers a use-after-free.
> >>
> >> Fix this race as follows:
> >> * Use rcu_assign_pointer() and rcu_dereference() to access hctx->tags->rqs[].
> >>   The performance impact of the assignments added to the hot path is minimal
> >>   (about 1% according to one particular test).
> >> * Protect hctx->tags->rqs[] reads with an RCU read-side lock or with a
> >>   semaphore. Which mechanism is used depends on whether or not it is allowed
> >>   to sleep and also on whether or not the callback function may sleep.
> >> * Wait for all preexisting readers to finish before freeing scheduler
> >>   requests.
> >>
> >> Another race is as follows:
> >> * blk_mq_sched_free_requests() is called to free hctx->sched_tags.
> >> * blk_mq_queue_tag_busy_iter() iterates over the same tag set but for another
> >>   request queue than the queue for which scheduler tags are being freed.
> >> * bt_iter() examines rq->q after *rq has been freed.
> >>
> >> Fix this race by protecting the rq->q read in bt_iter() with an RCU read
> >> lock and by calling synchronize_rcu() before freeing scheduler tags.
> >>
> >> Multiple users have reported use-after-free complaints similar to the
> >> following (from https://lore.kernel.org/linux-block/1545261885.185366.488.camel@acm.org/ ):
> >>
> >> BUG: KASAN: use-after-free in bt_iter+0x86/0xf0
> >> Read of size 8 at addr ffff88803b335240 by task fio/21412
> >>
> >> CPU: 0 PID: 21412 Comm: fio Tainted: G        W         4.20.0-rc6-dbg+ #3
> >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
> >> Call Trace:
> >>  dump_stack+0x86/0xca
> >>  print_address_description+0x71/0x239
> >>  kasan_report.cold.5+0x242/0x301
> >>  __asan_load8+0x54/0x90
> >>  bt_iter+0x86/0xf0
> >>  blk_mq_queue_tag_busy_iter+0x373/0x5e0
> >>  blk_mq_in_flight+0x96/0xb0
> >>  part_in_flight+0x40/0x140
> >>  part_round_stats+0x18e/0x370
> >>  blk_account_io_start+0x3d7/0x670
> >>  blk_mq_bio_to_request+0x19c/0x3a0
> >>  blk_mq_make_request+0x7a9/0xcb0
> >>  generic_make_request+0x41d/0x960
> >>  submit_bio+0x9b/0x250
> >>  do_blockdev_direct_IO+0x435c/0x4c70
> >>  __blockdev_direct_IO+0x79/0x88
> >>  ext4_direct_IO+0x46c/0xc00
> >>  generic_file_direct_write+0x119/0x210
> >>  __generic_file_write_iter+0x11c/0x280
> >>  ext4_file_write_iter+0x1b8/0x6f0
> >>  aio_write+0x204/0x310
> >>  io_submit_one+0x9d3/0xe80
> >>  __x64_sys_io_submit+0x115/0x340
> >>  do_syscall_64+0x71/0x210
> >>
> >> Reviewed-by: Khazhismel Kumykov <khazhy@google.com>
> >> Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> >> Cc: Christoph Hellwig <hch@lst.de>
> >> Cc: Martin K. Petersen <martin.petersen@oracle.com>
> >> Cc: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> >> Cc: Ming Lei <ming.lei@redhat.com>
> >> Cc: Hannes Reinecke <hare@suse.de>
> >> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> >> Cc: John Garry <john.garry@huawei.com>
> >> Cc: Khazhy Kumykov <khazhy@google.com>
> >> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> >> ---
> >>  block/blk-core.c   | 34 ++++++++++++++++++++++++++++++-
> >>  block/blk-mq-tag.c | 51 ++++++++++++++++++++++++++++++++++++++++------
> >>  block/blk-mq-tag.h |  4 +++-
> >>  block/blk-mq.c     | 21 +++++++++++++++----
> >>  block/blk-mq.h     |  1 +
> >>  block/blk.h        |  2 ++
> >>  block/elevator.c   |  1 +
> >>  7 files changed, 102 insertions(+), 12 deletions(-)
> >>
> >> diff --git a/block/blk-core.c b/block/blk-core.c
> >> index 9bcdae93f6d4..ca7f833e25a8 100644
> >> --- a/block/blk-core.c
> >> +++ b/block/blk-core.c
> >> @@ -279,6 +279,36 @@ void blk_dump_rq_flags(struct request *rq, char *msg)
> >>  }
> >>  EXPORT_SYMBOL(blk_dump_rq_flags);
> >>  
> >> +/**
> >> + * blk_mq_wait_for_tag_iter - wait for preexisting tag iteration functions to finish
> >> + * @set: Tag set to wait on.
> >> + *
> >> + * Waits for preexisting calls of blk_mq_all_tag_iter(),
> >> + * blk_mq_tagset_busy_iter() and also for their atomic variants to finish
> >> + * accessing hctx->tags->rqs[]. New readers may start while this function is
> >> + * in progress or after this function has returned. Use this function to make
> >> + * sure that hctx->tags->rqs[] changes have become globally visible.
> >> + *
> >> + * Waits for preexisting blk_mq_queue_tag_busy_iter(q, fn, priv) calls to
> >> + * finish accessing requests associated with other request queues than 'q'.
> >> + */
> >> +void blk_mq_wait_for_tag_iter(struct blk_mq_tag_set *set)
> >> +{
> >> +	struct blk_mq_tags *tags;
> >> +	int i;
> >> +
> >> +	if (set->tags) {
> >> +		for (i = 0; i < set->nr_hw_queues; i++) {
> >> +			tags = set->tags[i];
> >> +			if (!tags)
> >> +				continue;
> >> +			down_write(&tags->iter_rwsem);
> >> +			up_write(&tags->iter_rwsem);
> >> +		}
> >> +	}
> >> +	synchronize_rcu();
> >> +}
> >> +
> >>  /**
> >>   * blk_sync_queue - cancel any pending callbacks on a queue
> >>   * @q: the queue
> >> @@ -412,8 +442,10 @@ void blk_cleanup_queue(struct request_queue *q)
> >>  	 * it is safe to free requests now.
> >>  	 */
> >>  	mutex_lock(&q->sysfs_lock);
> >> -	if (q->elevator)
> >> +	if (q->elevator) {
> >> +		blk_mq_wait_for_tag_iter(q->tag_set);
> >>  		blk_mq_sched_free_requests(q);
> >> +	}
> >>  	mutex_unlock(&q->sysfs_lock);
> >>  
> >>  	percpu_ref_exit(&q->q_usage_counter);
> >> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> >> index d8eaa38a1bd1..39d5c9190a6b 100644
> >> --- a/block/blk-mq-tag.c
> >> +++ b/block/blk-mq-tag.c
> >> @@ -209,14 +209,24 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
> >>  
> >>  	if (!reserved)
> >>  		bitnr += tags->nr_reserved_tags;
> >> -	rq = tags->rqs[bitnr];
> >> -
> >> +	rcu_read_lock();
> >> +	/*
> >> +	 * The request 'rq' points at is protected by an RCU read lock until
> >> +	 * its queue pointer has been verified and by q_usage_count while the
> >> +	 * callback function is being invoked. See also the
> >> +	 * percpu_ref_tryget() and blk_queue_exit() calls in
> >> +	 * blk_mq_queue_tag_busy_iter().
> >> +	 */
> >> +	rq = rcu_dereference(tags->rqs[bitnr]);
> >>  	/*
> >>  	 * We can hit rq == NULL here, because the tagging functions
> >>  	 * test and set the bit before assigning ->rqs[].
> >>  	 */
> >> -	if (rq && rq->q == hctx->queue && rq->mq_hctx == hctx)
> >> +	if (rq && rq->q == hctx->queue && rq->mq_hctx == hctx) {
> >> +		rcu_read_unlock();
> >>  		return iter_data->fn(hctx, rq, iter_data->data, reserved);
> >> +	}
> >> +	rcu_read_unlock();
> >>  	return true;
> >>  }
> >>  
> >> @@ -254,11 +264,17 @@ struct bt_tags_iter_data {
> >>  	unsigned int flags;
> >>  };
> >>  
> >> +/* Include reserved tags. */
> >>  #define BT_TAG_ITER_RESERVED		(1 << 0)
> >> +/* Only include started requests. */
> >>  #define BT_TAG_ITER_STARTED		(1 << 1)
> >> +/* Iterate over tags->static_rqs[] instead of tags->rqs[]. */
> >>  #define BT_TAG_ITER_STATIC_RQS		(1 << 2)
> >> +/* The callback function may sleep. */
> >> +#define BT_TAG_ITER_MAY_SLEEP		(1 << 3)
> >>  
> >> -static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
> >> +static bool __bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr,
> >> +			   void *data)
> >>  {
> >>  	struct bt_tags_iter_data *iter_data = data;
> >>  	struct blk_mq_tags *tags = iter_data->tags;
> >> @@ -275,7 +291,8 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
> >>  	if (iter_data->flags & BT_TAG_ITER_STATIC_RQS)
> >>  		rq = tags->static_rqs[bitnr];
> >>  	else
> >> -		rq = tags->rqs[bitnr];
> >> +		rq = rcu_dereference_check(tags->rqs[bitnr],
> >> +					   lockdep_is_held(&tags->iter_rwsem));
> >>  	if (!rq)
> >>  		return true;
> >>  	if ((iter_data->flags & BT_TAG_ITER_STARTED) &&
> >> @@ -284,6 +301,25 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
> >>  	return iter_data->fn(rq, iter_data->data, reserved);
> >>  }
> >>  
> >> +static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
> >> +{
> >> +	struct bt_tags_iter_data *iter_data = data;
> >> +	struct blk_mq_tags *tags = iter_data->tags;
> >> +	bool res;
> >> +
> >> +	if (iter_data->flags & BT_TAG_ITER_MAY_SLEEP) {
> >> +		down_read(&tags->iter_rwsem);
> >> +		res = __bt_tags_iter(bitmap, bitnr, data);
> >> +		up_read(&tags->iter_rwsem);
> >> +	} else {
> >> +		rcu_read_lock();
> >> +		res = __bt_tags_iter(bitmap, bitnr, data);
> >> +		rcu_read_unlock();
> >> +	}
> >> +
> >> +	return res;
> >> +}
> >> +
> >>  /**
> >>   * bt_tags_for_each - iterate over the requests in a tag map
> >>   * @tags:	Tag map to iterate over.
> >> @@ -357,10 +393,12 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
> >>  {
> >>  	int i;
> >>  
> >> +	might_sleep();
> >> +
> >>  	for (i = 0; i < tagset->nr_hw_queues; i++) {
> >>  		if (tagset->tags && tagset->tags[i])
> >>  			__blk_mq_all_tag_iter(tagset->tags[i], fn, priv,
> >> -					      BT_TAG_ITER_STARTED);
> >> +				BT_TAG_ITER_STARTED | BT_TAG_ITER_MAY_SLEEP);
> >>  	}
> >>  }
> >>  EXPORT_SYMBOL(blk_mq_tagset_busy_iter);
> >> @@ -544,6 +582,7 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
> >>  
> >>  	tags->nr_tags = total_tags;
> >>  	tags->nr_reserved_tags = reserved_tags;
> >> +	init_rwsem(&tags->iter_rwsem);
> >>  
> >>  	if (blk_mq_is_sbitmap_shared(flags))
> >>  		return tags;
> >> diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
> >> index 0290c308ece9..d1d73d7cc7df 100644
> >> --- a/block/blk-mq-tag.h
> >> +++ b/block/blk-mq-tag.h
> >> @@ -17,9 +17,11 @@ struct blk_mq_tags {
> >>  	struct sbitmap_queue __bitmap_tags;
> >>  	struct sbitmap_queue __breserved_tags;
> >>  
> >> -	struct request **rqs;
> >> +	struct request __rcu **rqs;
> >>  	struct request **static_rqs;
> >>  	struct list_head page_list;
> >> +
> >> +	struct rw_semaphore iter_rwsem;
> >>  };
> >>  
> >>  extern struct blk_mq_tags *blk_mq_init_tags(unsigned int nr_tags,
> >> diff --git a/block/blk-mq.c b/block/blk-mq.c
> >> index 79c01b1f885c..8b59f6b4ec8e 100644
> >> --- a/block/blk-mq.c
> >> +++ b/block/blk-mq.c
> >> @@ -496,8 +496,10 @@ static void __blk_mq_free_request(struct request *rq)
> >>  	blk_crypto_free_request(rq);
> >>  	blk_pm_mark_last_busy(rq);
> >>  	rq->mq_hctx = NULL;
> >> -	if (rq->tag != BLK_MQ_NO_TAG)
> >> +	if (rq->tag != BLK_MQ_NO_TAG) {
> >>  		blk_mq_put_tag(hctx->tags, ctx, rq->tag);
> >> +		rcu_assign_pointer(hctx->tags->rqs[rq->tag], NULL);
> > 
> > After the tag is released, it can be re-allocated from another context
> > immediately. And the above rcu_assign_pointer(NULL) could be run after
> > the re-allocation and new ->rqs[rq->tag] assignment in submission path,
> > is it one issue?
> 
> How about swapping the rcu_assign_pointer() and blk_mq_put_tag() calls?
> That should be safe since the tag iteration functions check whether or
> not hctx->tags->rqs[] is NULL.
> 
> There are already barriers in sbitmap_queue_clear() so I don't think
> that any new barriers would need to be added.

That looks fine, since the barrier pair is implied between allocating tag
and assigning ->tags[tag] too.

Thanks,
Ming
Bart Van Assche April 22, 2021, 3:51 p.m. UTC | #7
On 4/22/21 12:13 AM, Ming Lei wrote:
> On Wed, Apr 21, 2021 at 08:54:30PM -0700, Bart Van Assche wrote:
>> On 4/21/21 8:15 PM, Ming Lei wrote:
>>> On Tue, Apr 20, 2021 at 05:02:33PM -0700, Bart Van Assche wrote:
>>>> +static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
>>>> +{
>>>> +	struct bt_tags_iter_data *iter_data = data;
>>>> +	struct blk_mq_tags *tags = iter_data->tags;
>>>> +	bool res;
>>>> +
>>>> +	if (iter_data->flags & BT_TAG_ITER_MAY_SLEEP) {
>>>> +		down_read(&tags->iter_rwsem);
>>>> +		res = __bt_tags_iter(bitmap, bitnr, data);
>>>> +		up_read(&tags->iter_rwsem);
>>>> +	} else {
>>>> +		rcu_read_lock();
>>>> +		res = __bt_tags_iter(bitmap, bitnr, data);
>>>> +		rcu_read_unlock();
>>>> +	}
>>>> +
>>>> +	return res;
>>>> +}
>>>
>>> Holding one rwsem or rcu read lock won't avoid the issue completely
>>> because request may be completed remotely in iter_data->fn(), such as
>>> nbd_clear_req(), nvme_cancel_request(), complete_all_cmds_iter(),
>>> mtip_no_dev_cleanup(), because blk_mq_complete_request() may complete
>>> request in softirq, remote IPI, even wq, and the request is still
>>> referenced in these contexts after bt_tags_iter() returns.
>>
>> The rwsem and RCU read lock are used to serialize iterating over
>> requests against blk_mq_sched_free_requests() calls. I don't think it
>> matters for this patch from which context requests are freed.
> 
> Requests still can be referred in other context after blk_mq_wait_for_tag_iter()
> returns, then follows freeing request pool. And use-after-free exists too, doesn't it?

The request pool should only be freed after it has been guaranteed that
all pending requests have finished and also that no new requests will be
started. This patch series adds two blk_mq_wait_for_tag_iter() calls.
Both calls happen while the queue is frozen so I don't think that the
issue mentioned in your email can happen.

Bart.
Ming Lei April 23, 2021, 3:52 a.m. UTC | #8
On Thu, Apr 22, 2021 at 08:51:06AM -0700, Bart Van Assche wrote:
> On 4/22/21 12:13 AM, Ming Lei wrote:
> > On Wed, Apr 21, 2021 at 08:54:30PM -0700, Bart Van Assche wrote:
> >> On 4/21/21 8:15 PM, Ming Lei wrote:
> >>> On Tue, Apr 20, 2021 at 05:02:33PM -0700, Bart Van Assche wrote:
> >>>> +static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
> >>>> +{
> >>>> +	struct bt_tags_iter_data *iter_data = data;
> >>>> +	struct blk_mq_tags *tags = iter_data->tags;
> >>>> +	bool res;
> >>>> +
> >>>> +	if (iter_data->flags & BT_TAG_ITER_MAY_SLEEP) {
> >>>> +		down_read(&tags->iter_rwsem);
> >>>> +		res = __bt_tags_iter(bitmap, bitnr, data);
> >>>> +		up_read(&tags->iter_rwsem);
> >>>> +	} else {
> >>>> +		rcu_read_lock();
> >>>> +		res = __bt_tags_iter(bitmap, bitnr, data);
> >>>> +		rcu_read_unlock();
> >>>> +	}
> >>>> +
> >>>> +	return res;
> >>>> +}
> >>>
> >>> Holding one rwsem or rcu read lock won't avoid the issue completely
> >>> because request may be completed remotely in iter_data->fn(), such as
> >>> nbd_clear_req(), nvme_cancel_request(), complete_all_cmds_iter(),
> >>> mtip_no_dev_cleanup(), because blk_mq_complete_request() may complete
> >>> request in softirq, remote IPI, even wq, and the request is still
> >>> referenced in these contexts after bt_tags_iter() returns.
> >>
> >> The rwsem and RCU read lock are used to serialize iterating over
> >> requests against blk_mq_sched_free_requests() calls. I don't think it
> >> matters for this patch from which context requests are freed.
> > 
> > Requests still can be referred in other context after blk_mq_wait_for_tag_iter()
> > returns, then follows freeing request pool. And use-after-free exists too, doesn't it?
> 
> The request pool should only be freed after it has been guaranteed that
> all pending requests have finished and also that no new requests will be
> started. This patch series adds two blk_mq_wait_for_tag_iter() calls.
> Both calls happen while the queue is frozen so I don't think that the
> issue mentioned in your email can happen.

For example, scsi aacraid normal completion vs. reset together with elevator
switch, aacraid is one single queue HBA, and the request will be completed
via IPI or softirq asynchronously, that said request isn't really completed
after blk_mq_complete_request() returns.

1) interrupt comes, and request A is completed via blk_mq_complete_request()
from aacraid's interrupt handler via ->scsi_done()

2) _aac_reset_adapter() comes because of reset event which can be
triggered by sysfs store or whatever, irq is drained in 
_aac_reset_adpter(), so blk_mq_complete_request(request A) from aacraid irq
context is done, but request A is just scheduled to be completed via IPI
or softirq asynchronously, not really done yet.

3) scsi_host_complete_all_commands() is called from _aac_reset_adapter() for
failing all pending requests. request A is still visible in
scsi_host_complete_all_commands, because its tag isn't freed yet. But the
tag & request A can be completed & freed exactly after scsi_host_complete_all_commands()
reads ->rqs[bitnr] in bt_tags_iter(), which calls complete_all_cmds_iter()
-> .scsi_done() -> blk_mq_complete_request(), and same request A is scheduled via
IPI or softirq, and request A is addded in ipi or softirq list.

4) meantime request A is freed from normal completion triggered by interrupt, one
pending elevator switch can move on since request A drops the last reference; and
bt_tags_iter() returns from reset path, so blk_mq_wait_for_tag_iter() can return
too, then the whole scheduler request pool is freed now.

5) request A in ipi/softirq list scheduled from _aac_reset_adapter is read , UAF
is triggered.

It is supposed that driver covers normal completion vs. error handling, but wrt.
remove completion, not sure driver is capable of covering that.

Thanks,
Ming
Bart Van Assche April 23, 2021, 5:52 p.m. UTC | #9
On 4/22/21 8:52 PM, Ming Lei wrote:
> For example, scsi aacraid normal completion vs. reset together with elevator
> switch, aacraid is one single queue HBA, and the request will be completed
> via IPI or softirq asynchronously, that said request isn't really completed
> after blk_mq_complete_request() returns.
> 
> 1) interrupt comes, and request A is completed via blk_mq_complete_request()
> from aacraid's interrupt handler via ->scsi_done()
> 
> 2) _aac_reset_adapter() comes because of reset event which can be
> triggered by sysfs store or whatever, irq is drained in 
> _aac_reset_adpter(), so blk_mq_complete_request(request A) from aacraid irq
> context is done, but request A is just scheduled to be completed via IPI
> or softirq asynchronously, not really done yet.
> 
> 3) scsi_host_complete_all_commands() is called from _aac_reset_adapter() for
> failing all pending requests. request A is still visible in
> scsi_host_complete_all_commands, because its tag isn't freed yet. But the
> tag & request A can be completed & freed exactly after scsi_host_complete_all_commands()
> reads ->rqs[bitnr] in bt_tags_iter(), which calls complete_all_cmds_iter()
> -> .scsi_done() -> blk_mq_complete_request(), and same request A is scheduled via
> IPI or softirq, and request A is addded in ipi or softirq list.
> 
> 4) meantime request A is freed from normal completion triggered by interrupt, one
> pending elevator switch can move on since request A drops the last reference; and
> bt_tags_iter() returns from reset path, so blk_mq_wait_for_tag_iter() can return
> too, then the whole scheduler request pool is freed now.
> 
> 5) request A in ipi/softirq list scheduled from _aac_reset_adapter is read , UAF
> is triggered.
> 
> It is supposed that driver covers normal completion vs. error handling, but wrt.
> remove completion, not sure driver is capable of covering that.

Hi Ming,

I agree that the scenario above can happen and also that a fix is
needed. However, I do not agree that this should be fixed by modifying
the tag iteration functions. I see scsi_host_complete_all_commands() as
a workaround for broken storage controllers - storage controllers that
do not have a facility for terminating all pending commands. NVMe
controllers can be told to terminate all pending I/O commands by e.g.
deleting all I/O submission queues. Many SCSI controllers also provide a
facility for aborting all pending commands. For SCSI controllers that do
not provide such a facility I propose to fix races like the race
described above in the SCSI LLD instead of modifying the block layer tag
iteration functions.

Thanks,

Bart.
Ming Lei April 25, 2021, 12:09 a.m. UTC | #10
On Fri, Apr 23, 2021 at 10:52:43AM -0700, Bart Van Assche wrote:
> On 4/22/21 8:52 PM, Ming Lei wrote:
> > For example, scsi aacraid normal completion vs. reset together with elevator
> > switch, aacraid is one single queue HBA, and the request will be completed
> > via IPI or softirq asynchronously, that said request isn't really completed
> > after blk_mq_complete_request() returns.
> > 
> > 1) interrupt comes, and request A is completed via blk_mq_complete_request()
> > from aacraid's interrupt handler via ->scsi_done()
> > 
> > 2) _aac_reset_adapter() comes because of reset event which can be
> > triggered by sysfs store or whatever, irq is drained in 
> > _aac_reset_adpter(), so blk_mq_complete_request(request A) from aacraid irq
> > context is done, but request A is just scheduled to be completed via IPI
> > or softirq asynchronously, not really done yet.
> > 
> > 3) scsi_host_complete_all_commands() is called from _aac_reset_adapter() for
> > failing all pending requests. request A is still visible in
> > scsi_host_complete_all_commands, because its tag isn't freed yet. But the
> > tag & request A can be completed & freed exactly after scsi_host_complete_all_commands()
> > reads ->rqs[bitnr] in bt_tags_iter(), which calls complete_all_cmds_iter()
> > -> .scsi_done() -> blk_mq_complete_request(), and same request A is scheduled via
> > IPI or softirq, and request A is addded in ipi or softirq list.
> > 
> > 4) meantime request A is freed from normal completion triggered by interrupt, one
> > pending elevator switch can move on since request A drops the last reference; and
> > bt_tags_iter() returns from reset path, so blk_mq_wait_for_tag_iter() can return
> > too, then the whole scheduler request pool is freed now.
> > 
> > 5) request A in ipi/softirq list scheduled from _aac_reset_adapter is read , UAF
> > is triggered.
> > 
> > It is supposed that driver covers normal completion vs. error handling, but wrt.
> > remove completion, not sure driver is capable of covering that.
> 
> Hi Ming,
> 
> I agree that the scenario above can happen and also that a fix is
> needed. However, I do not agree that this should be fixed by modifying
> the tag iteration functions. I see scsi_host_complete_all_commands() as
> a workaround for broken storage controllers - storage controllers that
> do not have a facility for terminating all pending commands. NVMe
> controllers can be told to terminate all pending I/O commands by e.g.
> deleting all I/O submission queues. Many SCSI controllers also provide a
> facility for aborting all pending commands. For SCSI controllers that do
> not provide such a facility I propose to fix races like the race
> described above in the SCSI LLD instead of modifying the block layer tag
> iteration functions.

Hi Bart,

Terminating all pending commands can't avoid the issue wrt. request UAF,
so far blk_mq_tagset_wait_completed_request() is used for making sure
that all pending requests are really aborted.

However, blk_mq_wait_for_tag_iter() still may return before
blk_mq_wait_for_tag_iter() is done because blk_mq_wait_for_tag_iter()
supposes all request reference is just done inside bt_tags_iter(),
especially .iter_rwsem and read rcu lock is added in bt_tags_iter().

What I really meant is that .iter_rwsem/read rcu added or blk_mq_wait_for_tag_iter
can't do what is expected.


Thanks,
Ming
Bart Van Assche April 25, 2021, 9:01 p.m. UTC | #11
On 4/24/21 5:09 PM, Ming Lei wrote:
> However, blk_mq_wait_for_tag_iter() still may return before
> blk_mq_wait_for_tag_iter() is done because blk_mq_wait_for_tag_iter()
> supposes all request reference is just done inside bt_tags_iter(),
> especially .iter_rwsem and read rcu lock is added in bt_tags_iter().

The comment above blk_mq_wait_for_tag_iter() needs to be updated but I
believe that the code is fine. Waiting for bt_tags_iter() to finish
should be sufficient to fix the UAF. What matters is that the pointer
read by rcu_dereference(tags->rqs[bitnr]) remains valid until the
callback function has finished. I think that is guaranteed by the
current implementation.

Bart.
Ming Lei April 26, 2021, 12:55 a.m. UTC | #12
On Sun, Apr 25, 2021 at 02:01:11PM -0700, Bart Van Assche wrote:
> On 4/24/21 5:09 PM, Ming Lei wrote:
> > However, blk_mq_wait_for_tag_iter() still may return before
> > blk_mq_wait_for_tag_iter() is done because blk_mq_wait_for_tag_iter()
> > supposes all request reference is just done inside bt_tags_iter(),
> > especially .iter_rwsem and read rcu lock is added in bt_tags_iter().
> 
> The comment above blk_mq_wait_for_tag_iter() needs to be updated but I
> believe that the code is fine. Waiting for bt_tags_iter() to finish
> should be sufficient to fix the UAF. What matters is that the pointer
> read by rcu_dereference(tags->rqs[bitnr]) remains valid until the
> callback function has finished. I think that is guaranteed by the
> current implementation.

It depends if 'rq' will be passed to another new context from ->fn(),
since 'rq' still can be USEed in the new context after ->fn() returns.


thanks,
Ming
Bart Van Assche April 26, 2021, 4:29 p.m. UTC | #13
On 4/24/21 5:09 PM, Ming Lei wrote:
> Terminating all pending commands can't avoid the issue wrt. request UAF,
> so far blk_mq_tagset_wait_completed_request() is used for making sure
> that all pending requests are really aborted.
> 
> However, blk_mq_wait_for_tag_iter() still may return before
> blk_mq_wait_for_tag_iter() is done because blk_mq_wait_for_tag_iter()
> supposes all request reference is just done inside bt_tags_iter(),
> especially .iter_rwsem and read rcu lock is added in bt_tags_iter().

Hi Ming,

I think that we agree that completing a request from inside a tag
iteration callback function may cause the request completion to happen
after tag iteration has finished. This can happen because
blk_mq_complete_request() may redirect completion processing to another
CPU via an IPI.

But can this mechanism trigger a use-after-free by itself? If request
completion is redirected to another CPU, the request is still considered
pending and request queue freezing won't complete. Request queue
freezing will only succeed after __blk_mq_free_request() has been called
because it is __blk_mq_free_request() that calls blk_queue_exit().

In other words, do we really need the new
blk_mq_complete_request_locally() function?

Did I perhaps miss something?

Thanks,

Bart.
Ming Lei April 27, 2021, 12:11 a.m. UTC | #14
On Mon, Apr 26, 2021 at 09:29:54AM -0700, Bart Van Assche wrote:
> On 4/24/21 5:09 PM, Ming Lei wrote:
> > Terminating all pending commands can't avoid the issue wrt. request UAF,
> > so far blk_mq_tagset_wait_completed_request() is used for making sure
> > that all pending requests are really aborted.
> > 
> > However, blk_mq_wait_for_tag_iter() still may return before
> > blk_mq_wait_for_tag_iter() is done because blk_mq_wait_for_tag_iter()
> > supposes all request reference is just done inside bt_tags_iter(),
> > especially .iter_rwsem and read rcu lock is added in bt_tags_iter().
> 
> Hi Ming,
> 
> I think that we agree that completing a request from inside a tag
> iteration callback function may cause the request completion to happen
> after tag iteration has finished. This can happen because
> blk_mq_complete_request() may redirect completion processing to another
> CPU via an IPI.
> 
> But can this mechanism trigger a use-after-free by itself? If request
> completion is redirected to another CPU, the request is still considered
> pending and request queue freezing won't complete. Request queue
> freezing will only succeed after __blk_mq_free_request() has been called
> because it is __blk_mq_free_request() that calls blk_queue_exit().
> 
> In other words, do we really need the new
> blk_mq_complete_request_locally() function?
> 
> Did I perhaps miss something?

Please see the example in the following link:

https://lore.kernel.org/linux-block/20210421000235.2028-4-bvanassche@acm.org/T/#m4d7bc9aa01108f03d5b4b7ee102eb26eb0c778aa

In short:

1) One async completion from interrupt context is pending, and one request
isn't really freed yet because its driver tag isn't released.

2) Meantime iterating code still can visit this request, and ->fn() schedules
a new remote completion, and returns.

3) The scheduled async completion in 1) is really done now,
but the scheduled async completion in 2) isn't started or done yet

4) queue becomes frozen because of one pending elevator switching, so
sched request pool is freed since there isn't any pending iterator ->fn()

5) request UAF is caused when running the scheduled async completion in 2)
now.


Thanks,
Ming
diff mbox series

Patch

diff --git a/block/blk-core.c b/block/blk-core.c
index 9bcdae93f6d4..ca7f833e25a8 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -279,6 +279,36 @@  void blk_dump_rq_flags(struct request *rq, char *msg)
 }
 EXPORT_SYMBOL(blk_dump_rq_flags);
 
+/**
+ * blk_mq_wait_for_tag_iter - wait for preexisting tag iteration functions to finish
+ * @set: Tag set to wait on.
+ *
+ * Waits for preexisting calls of blk_mq_all_tag_iter(),
+ * blk_mq_tagset_busy_iter() and also for their atomic variants to finish
+ * accessing hctx->tags->rqs[]. New readers may start while this function is
+ * in progress or after this function has returned. Use this function to make
+ * sure that hctx->tags->rqs[] changes have become globally visible.
+ *
+ * Waits for preexisting blk_mq_queue_tag_busy_iter(q, fn, priv) calls to
+ * finish accessing requests associated with other request queues than 'q'.
+ */
+void blk_mq_wait_for_tag_iter(struct blk_mq_tag_set *set)
+{
+	struct blk_mq_tags *tags;
+	int i;
+
+	if (set->tags) {
+		for (i = 0; i < set->nr_hw_queues; i++) {
+			tags = set->tags[i];
+			if (!tags)
+				continue;
+			down_write(&tags->iter_rwsem);
+			up_write(&tags->iter_rwsem);
+		}
+	}
+	synchronize_rcu();
+}
+
 /**
  * blk_sync_queue - cancel any pending callbacks on a queue
  * @q: the queue
@@ -412,8 +442,10 @@  void blk_cleanup_queue(struct request_queue *q)
 	 * it is safe to free requests now.
 	 */
 	mutex_lock(&q->sysfs_lock);
-	if (q->elevator)
+	if (q->elevator) {
+		blk_mq_wait_for_tag_iter(q->tag_set);
 		blk_mq_sched_free_requests(q);
+	}
 	mutex_unlock(&q->sysfs_lock);
 
 	percpu_ref_exit(&q->q_usage_counter);
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index d8eaa38a1bd1..39d5c9190a6b 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -209,14 +209,24 @@  static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
 
 	if (!reserved)
 		bitnr += tags->nr_reserved_tags;
-	rq = tags->rqs[bitnr];
-
+	rcu_read_lock();
+	/*
+	 * The request 'rq' points at is protected by an RCU read lock until
+	 * its queue pointer has been verified and by q_usage_count while the
+	 * callback function is being invoked. See also the
+	 * percpu_ref_tryget() and blk_queue_exit() calls in
+	 * blk_mq_queue_tag_busy_iter().
+	 */
+	rq = rcu_dereference(tags->rqs[bitnr]);
 	/*
 	 * We can hit rq == NULL here, because the tagging functions
 	 * test and set the bit before assigning ->rqs[].
 	 */
-	if (rq && rq->q == hctx->queue && rq->mq_hctx == hctx)
+	if (rq && rq->q == hctx->queue && rq->mq_hctx == hctx) {
+		rcu_read_unlock();
 		return iter_data->fn(hctx, rq, iter_data->data, reserved);
+	}
+	rcu_read_unlock();
 	return true;
 }
 
@@ -254,11 +264,17 @@  struct bt_tags_iter_data {
 	unsigned int flags;
 };
 
+/* Include reserved tags. */
 #define BT_TAG_ITER_RESERVED		(1 << 0)
+/* Only include started requests. */
 #define BT_TAG_ITER_STARTED		(1 << 1)
+/* Iterate over tags->static_rqs[] instead of tags->rqs[]. */
 #define BT_TAG_ITER_STATIC_RQS		(1 << 2)
+/* The callback function may sleep. */
+#define BT_TAG_ITER_MAY_SLEEP		(1 << 3)
 
-static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
+static bool __bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr,
+			   void *data)
 {
 	struct bt_tags_iter_data *iter_data = data;
 	struct blk_mq_tags *tags = iter_data->tags;
@@ -275,7 +291,8 @@  static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
 	if (iter_data->flags & BT_TAG_ITER_STATIC_RQS)
 		rq = tags->static_rqs[bitnr];
 	else
-		rq = tags->rqs[bitnr];
+		rq = rcu_dereference_check(tags->rqs[bitnr],
+					   lockdep_is_held(&tags->iter_rwsem));
 	if (!rq)
 		return true;
 	if ((iter_data->flags & BT_TAG_ITER_STARTED) &&
@@ -284,6 +301,25 @@  static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
 	return iter_data->fn(rq, iter_data->data, reserved);
 }
 
+static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
+{
+	struct bt_tags_iter_data *iter_data = data;
+	struct blk_mq_tags *tags = iter_data->tags;
+	bool res;
+
+	if (iter_data->flags & BT_TAG_ITER_MAY_SLEEP) {
+		down_read(&tags->iter_rwsem);
+		res = __bt_tags_iter(bitmap, bitnr, data);
+		up_read(&tags->iter_rwsem);
+	} else {
+		rcu_read_lock();
+		res = __bt_tags_iter(bitmap, bitnr, data);
+		rcu_read_unlock();
+	}
+
+	return res;
+}
+
 /**
  * bt_tags_for_each - iterate over the requests in a tag map
  * @tags:	Tag map to iterate over.
@@ -357,10 +393,12 @@  void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
 {
 	int i;
 
+	might_sleep();
+
 	for (i = 0; i < tagset->nr_hw_queues; i++) {
 		if (tagset->tags && tagset->tags[i])
 			__blk_mq_all_tag_iter(tagset->tags[i], fn, priv,
-					      BT_TAG_ITER_STARTED);
+				BT_TAG_ITER_STARTED | BT_TAG_ITER_MAY_SLEEP);
 	}
 }
 EXPORT_SYMBOL(blk_mq_tagset_busy_iter);
@@ -544,6 +582,7 @@  struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
 
 	tags->nr_tags = total_tags;
 	tags->nr_reserved_tags = reserved_tags;
+	init_rwsem(&tags->iter_rwsem);
 
 	if (blk_mq_is_sbitmap_shared(flags))
 		return tags;
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index 0290c308ece9..d1d73d7cc7df 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -17,9 +17,11 @@  struct blk_mq_tags {
 	struct sbitmap_queue __bitmap_tags;
 	struct sbitmap_queue __breserved_tags;
 
-	struct request **rqs;
+	struct request __rcu **rqs;
 	struct request **static_rqs;
 	struct list_head page_list;
+
+	struct rw_semaphore iter_rwsem;
 };
 
 extern struct blk_mq_tags *blk_mq_init_tags(unsigned int nr_tags,
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 79c01b1f885c..8b59f6b4ec8e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -496,8 +496,10 @@  static void __blk_mq_free_request(struct request *rq)
 	blk_crypto_free_request(rq);
 	blk_pm_mark_last_busy(rq);
 	rq->mq_hctx = NULL;
-	if (rq->tag != BLK_MQ_NO_TAG)
+	if (rq->tag != BLK_MQ_NO_TAG) {
 		blk_mq_put_tag(hctx->tags, ctx, rq->tag);
+		rcu_assign_pointer(hctx->tags->rqs[rq->tag], NULL);
+	}
 	if (sched_tag != BLK_MQ_NO_TAG)
 		blk_mq_put_tag(hctx->sched_tags, ctx, sched_tag);
 	blk_mq_sched_restart(hctx);
@@ -839,9 +841,20 @@  EXPORT_SYMBOL(blk_mq_delay_kick_requeue_list);
 
 struct request *blk_mq_tag_to_rq(struct blk_mq_tags *tags, unsigned int tag)
 {
+	struct request *rq;
+
 	if (tag < tags->nr_tags) {
-		prefetch(tags->rqs[tag]);
-		return tags->rqs[tag];
+		/*
+		 * Freeing tags happens with the request queue frozen so the
+		 * rcu dereference below is protected by the request queue
+		 * usage count. We can only verify that usage count after
+		 * having read the request pointer.
+		 */
+		rq = rcu_dereference_check(tags->rqs[tag], true);
+		WARN_ON_ONCE(IS_ENABLED(CONFIG_PROVE_RCU) && rq &&
+			     percpu_ref_is_zero(&rq->q->q_usage_counter));
+		prefetch(rq);
+		return rq;
 	}
 
 	return NULL;
@@ -1112,7 +1125,7 @@  static bool blk_mq_get_driver_tag(struct request *rq)
 		rq->rq_flags |= RQF_MQ_INFLIGHT;
 		__blk_mq_inc_active_requests(hctx);
 	}
-	hctx->tags->rqs[rq->tag] = rq;
+	rcu_assign_pointer(hctx->tags->rqs[rq->tag], rq);
 	return true;
 }
 
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 3616453ca28c..9ccb1818303b 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -226,6 +226,7 @@  static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
 					   struct request *rq)
 {
 	blk_mq_put_tag(hctx->tags, rq->mq_ctx, rq->tag);
+	rcu_assign_pointer(hctx->tags->rqs[rq->tag], NULL);
 	rq->tag = BLK_MQ_NO_TAG;
 
 	if (rq->rq_flags & RQF_MQ_INFLIGHT) {
diff --git a/block/blk.h b/block/blk.h
index 529233957207..d88b0823738c 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -185,6 +185,8 @@  bool blk_bio_list_merge(struct request_queue *q, struct list_head *list,
 void blk_account_io_start(struct request *req);
 void blk_account_io_done(struct request *req, u64 now);
 
+void blk_mq_wait_for_tag_iter(struct blk_mq_tag_set *set);
+
 /*
  * Internal elevator interface
  */
diff --git a/block/elevator.c b/block/elevator.c
index 7c486ce858e0..aae9cff6d5ae 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -201,6 +201,7 @@  static void elevator_exit(struct request_queue *q, struct elevator_queue *e)
 {
 	lockdep_assert_held(&q->sysfs_lock);
 
+	blk_mq_wait_for_tag_iter(q->tag_set);
 	blk_mq_sched_free_requests(q);
 	__elevator_exit(q, e);
 }