diff mbox series

[RFC,v3,3/3] blk-mq: Lockout tagset iterator when exiting elevator

Message ID 1614957294-188540-4-git-send-email-john.garry@huawei.com (mailing list archive)
State New, archived
Headers show
Series blk-mq: Avoid use-after-free for accessing old requests | expand

Commit Message

John Garry March 5, 2021, 3:14 p.m. UTC
All queues associated with a tagset are frozen when one queue is exiting
an elevator. This is to ensure that one queue running
blk_mq_queue_tag_busy_iter() cannot hold a stale request reference for
the queue who is exiting the elevator.

However, there is nothing to stop blk_mq_all_tag_iter() being run for
the tagset, and, again, getting hold of a stale request reference. A kasan
UAF can be triggered for this scenario:

BUG: KASAN: use-after-free in bt_tags_iter+0xe0/0x128 
Read of size 4 at addr ffff001085330fcc by task more/3038 
 
 CPU: 1 PID: 3038 Comm: more Not tainted 5.12.0-rc1-11926-g7359e4a1604d-dirty #750 
 Hardware name: Huawei Taishan 2280 /D05, BIOS Hisilicon D05 IT21 Nemo 2.0 RC0 04/18/2018
Call trace: 
dump_backtrace+0x0/0x2d0 
show_stack+0x18/0x68 
dump_stack+0x100/0x16c 
print_address_description.constprop.13+0x68/0x30c
kasan_report+0x1d8/0x240 
__asan_load4+0x9c/0xd8 
bt_tags_iter+0xe0/0x128
__blk_mq_all_tag_iter+0x320/0x3a8
blk_mq_tagset_busy_iter+0x84/0xb8
scsi_host_busy+0x88/0xb8 
show_host_busy+0x1c/0x48 
dev_attr_show+0x44/0x90
sysfs_kf_seq_show+0x128/0x1c8
kernfs_seq_show+0xa0/0xb8
seq_read_iter+0x210/0x660
kernfs_fop_read_iter+0x208/0x2b0 
new_sync_read+0x1ec/0x2d0
vfs_read+0x188/0x248 
ksys_read+0xc8/0x178 
__arm64_sys_read+0x44/0x58 
el0_svc_common.constprop.1+0xc4/0x190
do_el0_svc+0x90/0xa0 
el0_svc+0x24/0x38
el0_sync_handler+0x90/0xb8 
el0_sync+0x154/0x180 
 
To avoid this, reject the tagset iterators when the queue is exiting
the elevator.

This should not break any semantics in blk_mq_all_tag_iter(), as, since
all queues are frozen, there should be no active tags to iterate.

Signed-off-by: John Garry <john.garry@huawei.com>
---
 block/blk-mq-tag.c     | 5 +++++
 block/blk-mq.c         | 1 +
 block/blk.h            | 4 ++++
 include/linux/blk-mq.h | 1 +
 4 files changed, 11 insertions(+)

Comments

Bart Van Assche March 6, 2021, 4:43 a.m. UTC | #1
On 3/5/21 7:14 AM, John Garry wrote:
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index 7ff1b20d58e7..5950fee490e8 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -358,11 +358,16 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
>  {
>  	int i;
>  
> +	if (!atomic_inc_not_zero(&tagset->iter_usage_counter))
> +		return;
> +
>  	for (i = 0; i < tagset->nr_hw_queues; i++) {
>  		if (tagset->tags && tagset->tags[i])
>  			__blk_mq_all_tag_iter(tagset->tags[i], fn, priv,
>  					      BT_TAG_ITER_STARTED);
>  	}
> +
> +	atomic_dec(&tagset->iter_usage_counter);
>  }
>  EXPORT_SYMBOL(blk_mq_tagset_busy_iter);

This changes the behavior of blk_mq_tagset_busy_iter(). What will e.g.
happen if the mtip driver calls blk_mq_tagset_busy_iter(&dd->tags,
mtip_abort_cmd, dd) concurrently with another blk_mq_tagset_busy_iter()
call and if that causes all mtip_abort_cmd() calls to be skipped?

> +	while (atomic_cmpxchg(&set->iter_usage_counter, 1, 0) != 1);

Isn't it recommended to call cpu_relax() inside busy-waiting loops?

>  	blk_mq_sched_free_requests(q);
>  	__elevator_exit(q, e);
>  
> +	atomic_set(&set->iter_usage_counter, 1);

Can it happen that the above atomic_set() call happens while a
blk_mq_tagset_busy_iter() call is in progress? Should that atomic_set()
call perhaps be changed into an atomic_inc() call?

Thanks,

Bart.
John Garry March 8, 2021, 11:17 a.m. UTC | #2
On 06/03/2021 04:43, Bart Van Assche wrote:
> On 3/5/21 7:14 AM, John Garry wrote:
>> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
>> index 7ff1b20d58e7..5950fee490e8 100644
>> --- a/block/blk-mq-tag.c
>> +++ b/block/blk-mq-tag.c
>> @@ -358,11 +358,16 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
>>   {
>>   	int i;
>>   
>> +	if (!atomic_inc_not_zero(&tagset->iter_usage_counter))
>> +		return;
>> +
>>   	for (i = 0; i < tagset->nr_hw_queues; i++) {
>>   		if (tagset->tags && tagset->tags[i])
>>   			__blk_mq_all_tag_iter(tagset->tags[i], fn, priv,
>>   					      BT_TAG_ITER_STARTED);
>>   	}
>> +
>> +	atomic_dec(&tagset->iter_usage_counter);
>>   }
>>   EXPORT_SYMBOL(blk_mq_tagset_busy_iter);

Hi Bart,

> This changes the behavior of blk_mq_tagset_busy_iter(). What will e.g.
> happen if the mtip driver calls blk_mq_tagset_busy_iter(&dd->tags,
> mtip_abort_cmd, dd) concurrently with another blk_mq_tagset_busy_iter()
> call and if that causes all mtip_abort_cmd() calls to be skipped?

I'm not sure that I understand this problem you describe. So if 
blk_mq_tagset_busy_iter(&dd->tags, mtip_abort_cmd, dd) is called, either 
can happen:
a. normal operation, iter_usage_counter initially holds >= 1, and then 
iter_usage_counter is incremented in blk_mq_tagset_busy_iter() and we 
iter the busy tags. Any parallel call to blk_mq_tagset_busy_iter() will 
also increase iter_usage_counter.
b. we're switching IO scheduler. In this scenario, first we quiesce all 
queues. After that, there should be no active requests. At that point, 
we ensure any calls to blk_mq_tagset_busy_iter() are finished and block 
(or discard may be a better term) any more calls. Blocking any more 
calls should be safe as there are no requests to iter. atomic_cmpxchg() 
is used to set iter_usage_counter to 0, blocking any more calls.

> 
>> +	while (atomic_cmpxchg(&set->iter_usage_counter, 1, 0) != 1);
> Isn't it recommended to call cpu_relax() inside busy-waiting loops?

Maybe, but I am considering changing this patch to use percpu_refcnt() - 
I need to check it further.

> 
>>   	blk_mq_sched_free_requests(q);
>>   	__elevator_exit(q, e);
>>   
>> +	atomic_set(&set->iter_usage_counter, 1);
> Can it happen that the above atomic_set() call happens while a
> blk_mq_tagset_busy_iter() call is in progress?

No, as at this point it should be ensured that iter_usage_counter holds 
0 from atomic_cmpxchg(), so there should be no active processes in 
blk_mq_tagset_busy_iter() sensitive region. Calls to 
blk_mq_tagset_busy_iter() are blocked when iter_usage_counter holds 0.

> Should that atomic_set()
> call perhaps be changed into an atomic_inc() call?

They have the same affect in practice, but we use atomic_set() in 
blk_mq_alloc_tag_set(), so at least consistent.

Thanks,
John
Bart Van Assche March 8, 2021, 7:59 p.m. UTC | #3
On 3/8/21 3:17 AM, John Garry wrote:
> On 06/03/2021 04:43, Bart Van Assche wrote:
>> On 3/5/21 7:14 AM, John Garry wrote:
>>> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
>>> index 7ff1b20d58e7..5950fee490e8 100644
>>> --- a/block/blk-mq-tag.c
>>> +++ b/block/blk-mq-tag.c
>>> @@ -358,11 +358,16 @@ void blk_mq_tagset_busy_iter(struct 
>>> blk_mq_tag_set *tagset,
>>>   {
>>>       int i;
>>> +    if (!atomic_inc_not_zero(&tagset->iter_usage_counter))
>>> +        return;
>>> +
>>>       for (i = 0; i < tagset->nr_hw_queues; i++) {
>>>           if (tagset->tags && tagset->tags[i])
>>>               __blk_mq_all_tag_iter(tagset->tags[i], fn, priv,
>>>                             BT_TAG_ITER_STARTED);
>>>       }
>>> +
>>> +    atomic_dec(&tagset->iter_usage_counter);
>>>   }
>>>   EXPORT_SYMBOL(blk_mq_tagset_busy_iter);
> 
> Hi Bart,
> 
>> This changes the behavior of blk_mq_tagset_busy_iter(). What will e.g.
>> happen if the mtip driver calls blk_mq_tagset_busy_iter(&dd->tags,
>> mtip_abort_cmd, dd) concurrently with another blk_mq_tagset_busy_iter()
>> call and if that causes all mtip_abort_cmd() calls to be skipped?
> 
> I'm not sure that I understand this problem you describe. So if 
> blk_mq_tagset_busy_iter(&dd->tags, mtip_abort_cmd, dd) is called, either 
> can happen:
> a. normal operation, iter_usage_counter initially holds >= 1, and then 
> iter_usage_counter is incremented in blk_mq_tagset_busy_iter() and we 
> iter the busy tags. Any parallel call to blk_mq_tagset_busy_iter() will 
> also increase iter_usage_counter.
> b. we're switching IO scheduler. In this scenario, first we quiesce all 
> queues. After that, there should be no active requests. At that point, 
> we ensure any calls to blk_mq_tagset_busy_iter() are finished and block 
> (or discard may be a better term) any more calls. Blocking any more 
> calls should be safe as there are no requests to iter. atomic_cmpxchg() 
> is used to set iter_usage_counter to 0, blocking any more calls.


Hi John,

My concern is about the insertion of the early return statement in 
blk_mq_tagset_busy_iter(). Although most blk_mq_tagset_busy_iter() 
callers can handle skipping certain blk_mq_tagset_busy_iter() calls 
(e.g. when gathering statistics), I'm not sure this is safe for all 
blk_mq_tagset_busy_iter() callers. The example I cited is an example of 
a blk_mq_tagset_busy_iter() call with side effects.

The mtip driver allocates one tag set per request queue so quiescing 
queues should be sufficient to address my concern for the mtip driver.

The NVMe core and SCSI core however share a single tag set across 
multiple namespaces / LUNs. In the error path of nvme_rdma_setup_ctrl() 
I found a call to nvme_cancel_tagset(). nvme_cancel_tagset() calls 
blk_mq_tagset_busy_iter(ctrl->tagset, nvme_cancel_request, ctrl). I'm 
not sure it is safe to skip the nvme_cancel_request() calls if the I/O 
scheduler for another NVMe namespace is being modified.

Thanks,

Bart.
John Garry March 9, 2021, 5:47 p.m. UTC | #4
On 08/03/2021 19:59, Bart Van Assche wrote:
>>> This changes the behavior of blk_mq_tagset_busy_iter(). What will e.g.
>>> happen if the mtip driver calls blk_mq_tagset_busy_iter(&dd->tags,
>>> mtip_abort_cmd, dd) concurrently with another blk_mq_tagset_busy_iter()
>>> call and if that causes all mtip_abort_cmd() calls to be skipped?
>>
>> I'm not sure that I understand this problem you describe. So if 
>> blk_mq_tagset_busy_iter(&dd->tags, mtip_abort_cmd, dd) is called, 
>> either can happen:
>> a. normal operation, iter_usage_counter initially holds >= 1, and then 
>> iter_usage_counter is incremented in blk_mq_tagset_busy_iter() and we 
>> iter the busy tags. Any parallel call to blk_mq_tagset_busy_iter() 
>> will also increase iter_usage_counter.
>> b. we're switching IO scheduler. In this scenario, first we quiesce 
>> all queues. After that, there should be no active requests. At that 
>> point, we ensure any calls to blk_mq_tagset_busy_iter() are finished 
>> and block (or discard may be a better term) any more calls. Blocking 
>> any more calls should be safe as there are no requests to iter. 
>> atomic_cmpxchg() is used to set iter_usage_counter to 0, blocking any 
>> more calls.
> 


Hi Bart,

> My concern is about the insertion of the early return statement in 
> blk_mq_tagset_busy_iter(). 

So I take this approach as I don't see any way to use a mutual exclusion 
waiting mechanism to block calls to blk_mq_tagset_busy_iter() while the 
IO scheduler is being switched.

The reason is that blk_mq_tagset_busy_iter() can be called from any 
context, including hardirq.

> Although most blk_mq_tagset_busy_iter() 
> callers can handle skipping certain blk_mq_tagset_busy_iter() calls 
> (e.g. when gathering statistics), I'm not sure this is safe for all 
> blk_mq_tagset_busy_iter() callers. The example I cited is an example of 
> a blk_mq_tagset_busy_iter() call with side effects.

I don't like to think that we're skipping it, which may imply that there 
are some active requests to iter and we're just ignoring them.

It's more like: we know that there are no requests active, so don't 
bother trying to iterate.

> 
> The mtip driver allocates one tag set per request queue so quiescing 
> queues should be sufficient to address my concern for the mtip driver.
> 
> The NVMe core and SCSI core however share a single tag set across 
> multiple namespaces / LUNs. In the error path of nvme_rdma_setup_ctrl()
> I found a call to nvme_cancel_tagset(). nvme_cancel_tagset() calls 
> blk_mq_tagset_busy_iter(ctrl->tagset, nvme_cancel_request, ctrl). I'm 
> not sure it is safe to skip the nvme_cancel_request() calls if the I/O 
> scheduler for another NVMe namespace is being modified.

Again, I would be relying on all request_queues associated with that 
tagset to be queisced when switching IO scheduler at the point 
blk_mq_tagset_busy_iter() is called and returns early.

Now if there were active requests, I am relying on the request queue 
quiescing to flush them. So blk_mq_tagset_busy_iter() could be called 
during that quiescing period, and would continue to iter the requests.

This does fall over if some tags are allocated without associated 
request queue, which I do not know exists.

Thanks,
John
Bart Van Assche March 9, 2021, 7:21 p.m. UTC | #5
On 3/9/21 9:47 AM, John Garry wrote:
> This does fall over if some tags are allocated without associated
> request queue, which I do not know exists.

The only tag allocation mechanism I know of is blk_mq_get_tag(). The
only blk_mq_get_tag() callers I know of are __blk_mq_alloc_request() and
blk_mq_alloc_request_hctx(). So I think all allocated tags are
associated with a request queue.

Regarding this patch series, I have shared the feedback I wanted to
share so I would appreciate it if someone else could also take a look.

Thanks,

Bart.
John Garry March 10, 2021, 8:52 a.m. UTC | #6
On 09/03/2021 19:21, Bart Van Assche wrote:
> On 3/9/21 9:47 AM, John Garry wrote:
>> This does fall over if some tags are allocated without associated
>> request queue, which I do not know exists.
> 

Hi Bart,

> The only tag allocation mechanism I know of is blk_mq_get_tag(). The
> only blk_mq_get_tag() callers I know of are __blk_mq_alloc_request() and
> blk_mq_alloc_request_hctx(). So I think all allocated tags are
> associated with a request queue.
> 

ok, good.

> Regarding this patch series, I have shared the feedback I wanted to
> share so I would appreciate it if someone else could also take a look.
> 

So I can incorporate any changes and suggestions so far and send a 
non-RFC version - that may get more attention if none extra comes.

As mentioned on the cover letter, if patch 2+3/3 are accepted, then 
patch 1/3 could be simplified. But I plan to leave as is.

BTW, any issue with putting your suggested-by on patch 2/3?

Thanks,
John
Bart Van Assche March 10, 2021, 4 p.m. UTC | #7
On 3/10/21 12:52 AM, John Garry wrote:
> On 09/03/2021 19:21, Bart Van Assche wrote:
>> Regarding this patch series, I have shared the feedback I wanted to
>> share so I would appreciate it if someone else could also take a look.
> 
> So I can incorporate any changes and suggestions so far and send a 
> non-RFC version - that may get more attention if none extra comes.
> 
> As mentioned on the cover letter, if patch 2+3/3 are accepted, then 
> patch 1/3 could be simplified. But I plan to leave as is.
> 
> BTW, any issue with putting your suggested-by on patch 2/3?

Hi John,

I have added my Reviewed-by to patch 2/3.

Regarding the other two patches in this series: I do not agree with 
patch 3/3. As I have explained, I am concerned that that patch breaks 
existing block drivers.

Are patches 1/3 and 3/3 necessary? Or in other words, is patch 2/3 
sufficient to fix the use-after-free?

Thanks,

Bart.
John Garry March 10, 2021, 5:26 p.m. UTC | #8
On 10/03/2021 16:00, Bart Van Assche wrote:
>> So I can incorporate any changes and suggestions so far and send a 
>> non-RFC version - that may get more attention if none extra comes.
>>
>> As mentioned on the cover letter, if patch 2+3/3 are accepted, then 
>> patch 1/3 could be simplified. But I plan to leave as is.
>>
>> BTW, any issue with putting your suggested-by on patch 2/3?
> 

Hi Bart,

> 
> I have added my Reviewed-by to patch 2/3.
> 

OK, thanks.

Please note that I still want to check further whether some of Ming's 
series "blk-mq: implement queue quiesce via percpu_ref for 
BLK_MQ_F_BLOCKING" can be used.

> Regarding the other two patches in this series: I do not agree with 
> patch 3/3. As I have explained, I am concerned that that patch breaks 
> existing block drivers.

Understood. I need to check your concern further to allay any fears.

So I could probably change that patch to drop the early return.

Instead we just need to ensure that we complete any existing calls to 
blk_mq_tagset_busy_iter() prior to freeing the IO scheduler requests. 
Then we don't need to return early and can iter as before - but, as I 
said previously, there should be no active tags to iter.

> 
> Are patches 1/3 and 3/3 necessary? Or in other words, is patch 2/3 
> sufficient to fix the use-after-free?

No, we need them all in some form.

So far, reports are that 1/3 solves the most common seen UAF. It is 
pretty easy to trigger.

But the scenarios associated with 2/3 and 3/3 are much harder to 
trigger, and I needed to add delays in the code just to trigger them.

Thanks,
John
diff mbox series

Patch

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 7ff1b20d58e7..5950fee490e8 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -358,11 +358,16 @@  void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
 {
 	int i;
 
+	if (!atomic_inc_not_zero(&tagset->iter_usage_counter))
+		return;
+
 	for (i = 0; i < tagset->nr_hw_queues; i++) {
 		if (tagset->tags && tagset->tags[i])
 			__blk_mq_all_tag_iter(tagset->tags[i], fn, priv,
 					      BT_TAG_ITER_STARTED);
 	}
+
+	atomic_dec(&tagset->iter_usage_counter);
 }
 EXPORT_SYMBOL(blk_mq_tagset_busy_iter);
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 9cb60bf7ac24..326e1b0e5b83 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3493,6 +3493,7 @@  int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 			goto out_free_mq_rq_maps;
 		}
 	}
+	atomic_set(&set->iter_usage_counter, 1);
 
 	mutex_init(&set->tag_list_lock);
 	INIT_LIST_HEAD(&set->tag_list);
diff --git a/block/blk.h b/block/blk.h
index 1a948bfd91e4..461e5b54eb5f 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -214,9 +214,13 @@  static inline void elevator_exit(struct request_queue *q,
 		blk_mq_quiesce_queue(tmp);
 	}
 
+	while (atomic_cmpxchg(&set->iter_usage_counter, 1, 0) != 1);
+
 	blk_mq_sched_free_requests(q);
 	__elevator_exit(q, e);
 
+	atomic_set(&set->iter_usage_counter, 1);
+
 	list_for_each_entry(tmp, &set->tag_list, tag_set_list) {
 		if (tmp == q)
 			continue;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 2c473c9b8990..30a21335767b 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -263,6 +263,7 @@  struct blk_mq_tag_set {
 
 	struct mutex		tag_list_lock;
 	struct list_head	tag_list;
+	atomic_t		iter_usage_counter;
 };
 
 /**