diff mbox series

[-next] block: fix uaf for flush rq while iterating tags

Message ID 20241104110005.1412161-1-yukuai1@huaweicloud.com (mailing list archive)
State New
Headers show
Series [-next] block: fix uaf for flush rq while iterating tags | expand

Commit Message

Yu Kuai Nov. 4, 2024, 11 a.m. UTC
From: Yu Kuai <yukuai3@huawei.com>

blk_mq_clear_flush_rq_mapping() is not called during scsi probe, by
checking blk_queue_init_done(). However, QUEUE_FLAG_INIT_DONE is cleared
in del_gendisk by commit aec89dc5d421 ("block: keep q_usage_counter in
atomic mode after del_gendisk"), hence for disk like scsi, following
blk_mq_destroy_queue() will not clear flush rq from tags->rqs[] as well,
cause following uaf that is found by our syzkaller for v6.6:

==================================================================
BUG: KASAN: slab-use-after-free in blk_mq_find_and_get_req+0x16e/0x1a0 block/blk-mq-tag.c:261
Read of size 4 at addr ffff88811c969c20 by task kworker/1:2H/224909

CPU: 1 PID: 224909 Comm: kworker/1:2H Not tainted 6.6.0-ga836a5060850 #32
Workqueue: kblockd blk_mq_timeout_work
Call Trace:

__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x91/0xf0 lib/dump_stack.c:106
print_address_description.constprop.0+0x66/0x300 mm/kasan/report.c:364
print_report+0x3e/0x70 mm/kasan/report.c:475
kasan_report+0xb8/0xf0 mm/kasan/report.c:588
blk_mq_find_and_get_req+0x16e/0x1a0 block/blk-mq-tag.c:261
bt_iter block/blk-mq-tag.c:288 [inline]
__sbitmap_for_each_set include/linux/sbitmap.h:295 [inline]
sbitmap_for_each_set include/linux/sbitmap.h:316 [inline]
bt_for_each+0x455/0x790 block/blk-mq-tag.c:325
blk_mq_queue_tag_busy_iter+0x320/0x740 block/blk-mq-tag.c:534
blk_mq_timeout_work+0x1a3/0x7b0 block/blk-mq.c:1673
process_one_work+0x7c4/0x1450 kernel/workqueue.c:2631
process_scheduled_works kernel/workqueue.c:2704 [inline]
worker_thread+0x804/0xe40 kernel/workqueue.c:2785
kthread+0x346/0x450 kernel/kthread.c:388
ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:293

Allocated by task 942:
kasan_save_stack+0x22/0x50 mm/kasan/common.c:45
kasan_set_track+0x25/0x30 mm/kasan/common.c:52
____kasan_kmalloc mm/kasan/common.c:374 [inline]
__kasan_kmalloc mm/kasan/common.c:383 [inline]
__kasan_kmalloc+0xaa/0xb0 mm/kasan/common.c:380
kasan_kmalloc include/linux/kasan.h:198 [inline]
__do_kmalloc_node mm/slab_common.c:1007 [inline]
__kmalloc_node+0x69/0x170 mm/slab_common.c:1014
kmalloc_node include/linux/slab.h:620 [inline]
kzalloc_node include/linux/slab.h:732 [inline]
blk_alloc_flush_queue+0x144/0x2f0 block/blk-flush.c:499
blk_mq_alloc_hctx+0x601/0x940 block/blk-mq.c:3788
blk_mq_alloc_and_init_hctx+0x27f/0x330 block/blk-mq.c:4261
blk_mq_realloc_hw_ctxs+0x488/0x5e0 block/blk-mq.c:4294
blk_mq_init_allocated_queue+0x188/0x860 block/blk-mq.c:4350
blk_mq_init_queue_data block/blk-mq.c:4166 [inline]
blk_mq_init_queue+0x8d/0x100 block/blk-mq.c:4176
scsi_alloc_sdev+0x843/0xd50 drivers/scsi/scsi_scan.c:335
scsi_probe_and_add_lun+0x77c/0xde0 drivers/scsi/scsi_scan.c:1189
__scsi_scan_target+0x1fc/0x5a0 drivers/scsi/scsi_scan.c:1727
scsi_scan_channel drivers/scsi/scsi_scan.c:1815 [inline]
scsi_scan_channel+0x14b/0x1e0 drivers/scsi/scsi_scan.c:1791
scsi_scan_host_selected+0x2fe/0x400 drivers/scsi/scsi_scan.c:1844
scsi_scan+0x3a0/0x3f0 drivers/scsi/scsi_sysfs.c:151
store_scan+0x2a/0x60 drivers/scsi/scsi_sysfs.c:191
dev_attr_store+0x5c/0x90 drivers/base/core.c:2388
sysfs_kf_write+0x11c/0x170 fs/sysfs/file.c:136
kernfs_fop_write_iter+0x3fc/0x610 fs/kernfs/file.c:338
call_write_iter include/linux/fs.h:2083 [inline]
new_sync_write+0x1b4/0x2d0 fs/read_write.c:493
vfs_write+0x76c/0xb00 fs/read_write.c:586
ksys_write+0x127/0x250 fs/read_write.c:639
do_syscall_x64 arch/x86/entry/common.c:51 [inline]
do_syscall_64+0x70/0x120 arch/x86/entry/common.c:81
entry_SYSCALL_64_after_hwframe+0x78/0xe2

Freed by task 244687:
kasan_save_stack+0x22/0x50 mm/kasan/common.c:45
kasan_set_track+0x25/0x30 mm/kasan/common.c:52
kasan_save_free_info+0x2b/0x50 mm/kasan/generic.c:522
____kasan_slab_free mm/kasan/common.c:236 [inline]
__kasan_slab_free+0x12a/0x1b0 mm/kasan/common.c:244
kasan_slab_free include/linux/kasan.h:164 [inline]
slab_free_hook mm/slub.c:1815 [inline]
slab_free_freelist_hook mm/slub.c:1841 [inline]
slab_free mm/slub.c:3807 [inline]
__kmem_cache_free+0xe4/0x520 mm/slub.c:3820
blk_free_flush_queue+0x40/0x60 block/blk-flush.c:520
blk_mq_hw_sysfs_release+0x4a/0x170 block/blk-mq-sysfs.c:37
kobject_cleanup+0x136/0x410 lib/kobject.c:689
kobject_release lib/kobject.c:720 [inline]
kref_put include/linux/kref.h:65 [inline]
kobject_put+0x119/0x140 lib/kobject.c:737
blk_mq_release+0x24f/0x3f0 block/blk-mq.c:4144
blk_free_queue block/blk-core.c:298 [inline]
blk_put_queue+0xe2/0x180 block/blk-core.c:314
blkg_free_workfn+0x376/0x6e0 block/blk-cgroup.c:144
process_one_work+0x7c4/0x1450 kernel/workqueue.c:2631
process_scheduled_works kernel/workqueue.c:2704 [inline]
worker_thread+0x804/0xe40 kernel/workqueue.c:2785
kthread+0x346/0x450 kernel/kthread.c:388
ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:293

Other than blk_mq_clear_flush_rq_mapping(), the flag is only used in
blk_register_queue() from initialization path, hence it's safe not to
clear the flag in del_gendisk. And since QUEUE_FLAG_REGISTERED already
make sure that queue should only be registered once, there is no need
to test the flag as well.

Fixes: 6cfeadbff3f8 ("blk-mq: don't clear flush_rq from tags->rqs[]")
Depends-on: commit aec89dc5d421 ("block: keep q_usage_counter in atomic mode after del_gendisk")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 block/blk-sysfs.c | 6 ++----
 block/genhd.c     | 9 +++------
 2 files changed, 5 insertions(+), 10 deletions(-)

Comments

Ming Lei Nov. 5, 2024, 7:18 a.m. UTC | #1
On Mon, Nov 04, 2024 at 07:00:05PM +0800, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> blk_mq_clear_flush_rq_mapping() is not called during scsi probe, by
> checking blk_queue_init_done(). However, QUEUE_FLAG_INIT_DONE is cleared
> in del_gendisk by commit aec89dc5d421 ("block: keep q_usage_counter in
> atomic mode after del_gendisk"), hence for disk like scsi, following
> blk_mq_destroy_queue() will not clear flush rq from tags->rqs[] as well,
> cause following uaf that is found by our syzkaller for v6.6:
> 
> ==================================================================
> BUG: KASAN: slab-use-after-free in blk_mq_find_and_get_req+0x16e/0x1a0 block/blk-mq-tag.c:261
> Read of size 4 at addr ffff88811c969c20 by task kworker/1:2H/224909
> 
> CPU: 1 PID: 224909 Comm: kworker/1:2H Not tainted 6.6.0-ga836a5060850 #32
> Workqueue: kblockd blk_mq_timeout_work
> Call Trace:
> 
> __dump_stack lib/dump_stack.c:88 [inline]
> dump_stack_lvl+0x91/0xf0 lib/dump_stack.c:106
> print_address_description.constprop.0+0x66/0x300 mm/kasan/report.c:364
> print_report+0x3e/0x70 mm/kasan/report.c:475
> kasan_report+0xb8/0xf0 mm/kasan/report.c:588
> blk_mq_find_and_get_req+0x16e/0x1a0 block/blk-mq-tag.c:261
> bt_iter block/blk-mq-tag.c:288 [inline]
> __sbitmap_for_each_set include/linux/sbitmap.h:295 [inline]
> sbitmap_for_each_set include/linux/sbitmap.h:316 [inline]
> bt_for_each+0x455/0x790 block/blk-mq-tag.c:325
> blk_mq_queue_tag_busy_iter+0x320/0x740 block/blk-mq-tag.c:534
> blk_mq_timeout_work+0x1a3/0x7b0 block/blk-mq.c:1673
> process_one_work+0x7c4/0x1450 kernel/workqueue.c:2631
> process_scheduled_works kernel/workqueue.c:2704 [inline]
> worker_thread+0x804/0xe40 kernel/workqueue.c:2785
> kthread+0x346/0x450 kernel/kthread.c:388
> ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147
> ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:293
> 
> Allocated by task 942:
> kasan_save_stack+0x22/0x50 mm/kasan/common.c:45
> kasan_set_track+0x25/0x30 mm/kasan/common.c:52
> ____kasan_kmalloc mm/kasan/common.c:374 [inline]
> __kasan_kmalloc mm/kasan/common.c:383 [inline]
> __kasan_kmalloc+0xaa/0xb0 mm/kasan/common.c:380
> kasan_kmalloc include/linux/kasan.h:198 [inline]
> __do_kmalloc_node mm/slab_common.c:1007 [inline]
> __kmalloc_node+0x69/0x170 mm/slab_common.c:1014
> kmalloc_node include/linux/slab.h:620 [inline]
> kzalloc_node include/linux/slab.h:732 [inline]
> blk_alloc_flush_queue+0x144/0x2f0 block/blk-flush.c:499
> blk_mq_alloc_hctx+0x601/0x940 block/blk-mq.c:3788
> blk_mq_alloc_and_init_hctx+0x27f/0x330 block/blk-mq.c:4261
> blk_mq_realloc_hw_ctxs+0x488/0x5e0 block/blk-mq.c:4294
> blk_mq_init_allocated_queue+0x188/0x860 block/blk-mq.c:4350
> blk_mq_init_queue_data block/blk-mq.c:4166 [inline]
> blk_mq_init_queue+0x8d/0x100 block/blk-mq.c:4176
> scsi_alloc_sdev+0x843/0xd50 drivers/scsi/scsi_scan.c:335
> scsi_probe_and_add_lun+0x77c/0xde0 drivers/scsi/scsi_scan.c:1189
> __scsi_scan_target+0x1fc/0x5a0 drivers/scsi/scsi_scan.c:1727
> scsi_scan_channel drivers/scsi/scsi_scan.c:1815 [inline]
> scsi_scan_channel+0x14b/0x1e0 drivers/scsi/scsi_scan.c:1791
> scsi_scan_host_selected+0x2fe/0x400 drivers/scsi/scsi_scan.c:1844
> scsi_scan+0x3a0/0x3f0 drivers/scsi/scsi_sysfs.c:151
> store_scan+0x2a/0x60 drivers/scsi/scsi_sysfs.c:191
> dev_attr_store+0x5c/0x90 drivers/base/core.c:2388
> sysfs_kf_write+0x11c/0x170 fs/sysfs/file.c:136
> kernfs_fop_write_iter+0x3fc/0x610 fs/kernfs/file.c:338
> call_write_iter include/linux/fs.h:2083 [inline]
> new_sync_write+0x1b4/0x2d0 fs/read_write.c:493
> vfs_write+0x76c/0xb00 fs/read_write.c:586
> ksys_write+0x127/0x250 fs/read_write.c:639
> do_syscall_x64 arch/x86/entry/common.c:51 [inline]
> do_syscall_64+0x70/0x120 arch/x86/entry/common.c:81
> entry_SYSCALL_64_after_hwframe+0x78/0xe2
> 
> Freed by task 244687:
> kasan_save_stack+0x22/0x50 mm/kasan/common.c:45
> kasan_set_track+0x25/0x30 mm/kasan/common.c:52
> kasan_save_free_info+0x2b/0x50 mm/kasan/generic.c:522
> ____kasan_slab_free mm/kasan/common.c:236 [inline]
> __kasan_slab_free+0x12a/0x1b0 mm/kasan/common.c:244
> kasan_slab_free include/linux/kasan.h:164 [inline]
> slab_free_hook mm/slub.c:1815 [inline]
> slab_free_freelist_hook mm/slub.c:1841 [inline]
> slab_free mm/slub.c:3807 [inline]
> __kmem_cache_free+0xe4/0x520 mm/slub.c:3820
> blk_free_flush_queue+0x40/0x60 block/blk-flush.c:520
> blk_mq_hw_sysfs_release+0x4a/0x170 block/blk-mq-sysfs.c:37
> kobject_cleanup+0x136/0x410 lib/kobject.c:689
> kobject_release lib/kobject.c:720 [inline]
> kref_put include/linux/kref.h:65 [inline]
> kobject_put+0x119/0x140 lib/kobject.c:737
> blk_mq_release+0x24f/0x3f0 block/blk-mq.c:4144
> blk_free_queue block/blk-core.c:298 [inline]
> blk_put_queue+0xe2/0x180 block/blk-core.c:314
> blkg_free_workfn+0x376/0x6e0 block/blk-cgroup.c:144
> process_one_work+0x7c4/0x1450 kernel/workqueue.c:2631
> process_scheduled_works kernel/workqueue.c:2704 [inline]
> worker_thread+0x804/0xe40 kernel/workqueue.c:2785
> kthread+0x346/0x450 kernel/kthread.c:388
> ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147
> ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:293
> 
> Other than blk_mq_clear_flush_rq_mapping(), the flag is only used in
> blk_register_queue() from initialization path, hence it's safe not to
> clear the flag in del_gendisk. And since QUEUE_FLAG_REGISTERED already
> make sure that queue should only be registered once, there is no need
> to test the flag as well.

But disk can be added again in case of sd rebind, so the check should be
kept.

> 
> Fixes: 6cfeadbff3f8 ("blk-mq: don't clear flush_rq from tags->rqs[]")
> Depends-on: commit aec89dc5d421 ("block: keep q_usage_counter in atomic mode after del_gendisk")
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> ---
>  block/blk-sysfs.c | 6 ++----
>  block/genhd.c     | 9 +++------
>  2 files changed, 5 insertions(+), 10 deletions(-)
> 
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 741b95dfdbf6..a7c540728f3f 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -821,10 +821,8 @@ int blk_register_queue(struct gendisk *disk)
>  	 * faster to shut down and is made fully functional here as
>  	 * request_queues for non-existent devices never get registered.
>  	 */
> -	if (!blk_queue_init_done(q)) {
> -		blk_queue_flag_set(QUEUE_FLAG_INIT_DONE, q);
> -		percpu_ref_switch_to_percpu(&q->q_usage_counter);
> -	}
> +	blk_queue_flag_set(QUEUE_FLAG_INIT_DONE, q);
> +	percpu_ref_switch_to_percpu(&q->q_usage_counter);
>  
>  	return ret;
>  
> diff --git a/block/genhd.c b/block/genhd.c
> index dfee66146bd1..87f9c2457ca6 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -742,13 +742,10 @@ void del_gendisk(struct gendisk *disk)
>  	 * If the disk does not own the queue, allow using passthrough requests
>  	 * again.  Else leave the queue frozen to fail all I/O.
>  	 */
> -	if (!test_bit(GD_OWNS_QUEUE, &disk->state)) {
> -		blk_queue_flag_clear(QUEUE_FLAG_INIT_DONE, q);
> +	if (!test_bit(GD_OWNS_QUEUE, &disk->state))
>  		__blk_mq_unfreeze_queue(q, true);
> -	} else {
> -		if (queue_is_mq(q))
> -			blk_mq_exit_queue(q);
> -	}
> +	else if (queue_is_mq(q))
> +		blk_mq_exit_queue(q);

Clearing INIT_DONE only may regress sd rebind, and I'd suggest to move
the clearing into blk_mq_destroy_queue() for fixing this issue.


Thanks,
Ming
Yu Kuai Nov. 5, 2024, 7:36 a.m. UTC | #2
Hi,

在 2024/11/05 15:18, Ming Lei 写道:
> On Mon, Nov 04, 2024 at 07:00:05PM +0800, Yu Kuai wrote:
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> blk_mq_clear_flush_rq_mapping() is not called during scsi probe, by
>> checking blk_queue_init_done(). However, QUEUE_FLAG_INIT_DONE is cleared
>> in del_gendisk by commit aec89dc5d421 ("block: keep q_usage_counter in
>> atomic mode after del_gendisk"), hence for disk like scsi, following
>> blk_mq_destroy_queue() will not clear flush rq from tags->rqs[] as well,
>> cause following uaf that is found by our syzkaller for v6.6:
>>
>> ==================================================================
>> BUG: KASAN: slab-use-after-free in blk_mq_find_and_get_req+0x16e/0x1a0 block/blk-mq-tag.c:261
>> Read of size 4 at addr ffff88811c969c20 by task kworker/1:2H/224909
>>
>> CPU: 1 PID: 224909 Comm: kworker/1:2H Not tainted 6.6.0-ga836a5060850 #32
>> Workqueue: kblockd blk_mq_timeout_work
>> Call Trace:
>>
>> __dump_stack lib/dump_stack.c:88 [inline]
>> dump_stack_lvl+0x91/0xf0 lib/dump_stack.c:106
>> print_address_description.constprop.0+0x66/0x300 mm/kasan/report.c:364
>> print_report+0x3e/0x70 mm/kasan/report.c:475
>> kasan_report+0xb8/0xf0 mm/kasan/report.c:588
>> blk_mq_find_and_get_req+0x16e/0x1a0 block/blk-mq-tag.c:261
>> bt_iter block/blk-mq-tag.c:288 [inline]
>> __sbitmap_for_each_set include/linux/sbitmap.h:295 [inline]
>> sbitmap_for_each_set include/linux/sbitmap.h:316 [inline]
>> bt_for_each+0x455/0x790 block/blk-mq-tag.c:325
>> blk_mq_queue_tag_busy_iter+0x320/0x740 block/blk-mq-tag.c:534
>> blk_mq_timeout_work+0x1a3/0x7b0 block/blk-mq.c:1673
>> process_one_work+0x7c4/0x1450 kernel/workqueue.c:2631
>> process_scheduled_works kernel/workqueue.c:2704 [inline]
>> worker_thread+0x804/0xe40 kernel/workqueue.c:2785
>> kthread+0x346/0x450 kernel/kthread.c:388
>> ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147
>> ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:293
>>
>> Allocated by task 942:
>> kasan_save_stack+0x22/0x50 mm/kasan/common.c:45
>> kasan_set_track+0x25/0x30 mm/kasan/common.c:52
>> ____kasan_kmalloc mm/kasan/common.c:374 [inline]
>> __kasan_kmalloc mm/kasan/common.c:383 [inline]
>> __kasan_kmalloc+0xaa/0xb0 mm/kasan/common.c:380
>> kasan_kmalloc include/linux/kasan.h:198 [inline]
>> __do_kmalloc_node mm/slab_common.c:1007 [inline]
>> __kmalloc_node+0x69/0x170 mm/slab_common.c:1014
>> kmalloc_node include/linux/slab.h:620 [inline]
>> kzalloc_node include/linux/slab.h:732 [inline]
>> blk_alloc_flush_queue+0x144/0x2f0 block/blk-flush.c:499
>> blk_mq_alloc_hctx+0x601/0x940 block/blk-mq.c:3788
>> blk_mq_alloc_and_init_hctx+0x27f/0x330 block/blk-mq.c:4261
>> blk_mq_realloc_hw_ctxs+0x488/0x5e0 block/blk-mq.c:4294
>> blk_mq_init_allocated_queue+0x188/0x860 block/blk-mq.c:4350
>> blk_mq_init_queue_data block/blk-mq.c:4166 [inline]
>> blk_mq_init_queue+0x8d/0x100 block/blk-mq.c:4176
>> scsi_alloc_sdev+0x843/0xd50 drivers/scsi/scsi_scan.c:335
>> scsi_probe_and_add_lun+0x77c/0xde0 drivers/scsi/scsi_scan.c:1189
>> __scsi_scan_target+0x1fc/0x5a0 drivers/scsi/scsi_scan.c:1727
>> scsi_scan_channel drivers/scsi/scsi_scan.c:1815 [inline]
>> scsi_scan_channel+0x14b/0x1e0 drivers/scsi/scsi_scan.c:1791
>> scsi_scan_host_selected+0x2fe/0x400 drivers/scsi/scsi_scan.c:1844
>> scsi_scan+0x3a0/0x3f0 drivers/scsi/scsi_sysfs.c:151
>> store_scan+0x2a/0x60 drivers/scsi/scsi_sysfs.c:191
>> dev_attr_store+0x5c/0x90 drivers/base/core.c:2388
>> sysfs_kf_write+0x11c/0x170 fs/sysfs/file.c:136
>> kernfs_fop_write_iter+0x3fc/0x610 fs/kernfs/file.c:338
>> call_write_iter include/linux/fs.h:2083 [inline]
>> new_sync_write+0x1b4/0x2d0 fs/read_write.c:493
>> vfs_write+0x76c/0xb00 fs/read_write.c:586
>> ksys_write+0x127/0x250 fs/read_write.c:639
>> do_syscall_x64 arch/x86/entry/common.c:51 [inline]
>> do_syscall_64+0x70/0x120 arch/x86/entry/common.c:81
>> entry_SYSCALL_64_after_hwframe+0x78/0xe2
>>
>> Freed by task 244687:
>> kasan_save_stack+0x22/0x50 mm/kasan/common.c:45
>> kasan_set_track+0x25/0x30 mm/kasan/common.c:52
>> kasan_save_free_info+0x2b/0x50 mm/kasan/generic.c:522
>> ____kasan_slab_free mm/kasan/common.c:236 [inline]
>> __kasan_slab_free+0x12a/0x1b0 mm/kasan/common.c:244
>> kasan_slab_free include/linux/kasan.h:164 [inline]
>> slab_free_hook mm/slub.c:1815 [inline]
>> slab_free_freelist_hook mm/slub.c:1841 [inline]
>> slab_free mm/slub.c:3807 [inline]
>> __kmem_cache_free+0xe4/0x520 mm/slub.c:3820
>> blk_free_flush_queue+0x40/0x60 block/blk-flush.c:520
>> blk_mq_hw_sysfs_release+0x4a/0x170 block/blk-mq-sysfs.c:37
>> kobject_cleanup+0x136/0x410 lib/kobject.c:689
>> kobject_release lib/kobject.c:720 [inline]
>> kref_put include/linux/kref.h:65 [inline]
>> kobject_put+0x119/0x140 lib/kobject.c:737
>> blk_mq_release+0x24f/0x3f0 block/blk-mq.c:4144
>> blk_free_queue block/blk-core.c:298 [inline]
>> blk_put_queue+0xe2/0x180 block/blk-core.c:314
>> blkg_free_workfn+0x376/0x6e0 block/blk-cgroup.c:144
>> process_one_work+0x7c4/0x1450 kernel/workqueue.c:2631
>> process_scheduled_works kernel/workqueue.c:2704 [inline]
>> worker_thread+0x804/0xe40 kernel/workqueue.c:2785
>> kthread+0x346/0x450 kernel/kthread.c:388
>> ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147
>> ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:293
>>
>> Other than blk_mq_clear_flush_rq_mapping(), the flag is only used in
>> blk_register_queue() from initialization path, hence it's safe not to
>> clear the flag in del_gendisk. And since QUEUE_FLAG_REGISTERED already
>> make sure that queue should only be registered once, there is no need
>> to test the flag as well.
> 
> But disk can be added again in case of sd rebind, so the check should be
> kept.

I don't understand yet, I mean the other flag QUEUE_FLAG_REGISTERED can
cover the case to call percpu_ref_switch_to_percpu() for the rebind
case, the flag QUEUE_FLAG_INIT_DONE now is useless other than
blk_mq_clear_flush_rq_mapping().
> 
>>
>> Fixes: 6cfeadbff3f8 ("blk-mq: don't clear flush_rq from tags->rqs[]")
>> Depends-on: commit aec89dc5d421 ("block: keep q_usage_counter in atomic mode after del_gendisk")
>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>> ---
>>   block/blk-sysfs.c | 6 ++----
>>   block/genhd.c     | 9 +++------
>>   2 files changed, 5 insertions(+), 10 deletions(-)
>>
>> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
>> index 741b95dfdbf6..a7c540728f3f 100644
>> --- a/block/blk-sysfs.c
>> +++ b/block/blk-sysfs.c
>> @@ -821,10 +821,8 @@ int blk_register_queue(struct gendisk *disk)
>>   	 * faster to shut down and is made fully functional here as
>>   	 * request_queues for non-existent devices never get registered.
>>   	 */
>> -	if (!blk_queue_init_done(q)) {
>> -		blk_queue_flag_set(QUEUE_FLAG_INIT_DONE, q);
>> -		percpu_ref_switch_to_percpu(&q->q_usage_counter);
>> -	}
>> +	blk_queue_flag_set(QUEUE_FLAG_INIT_DONE, q);
>> +	percpu_ref_switch_to_percpu(&q->q_usage_counter);
>>   
>>   	return ret;
>>   
>> diff --git a/block/genhd.c b/block/genhd.c
>> index dfee66146bd1..87f9c2457ca6 100644
>> --- a/block/genhd.c
>> +++ b/block/genhd.c
>> @@ -742,13 +742,10 @@ void del_gendisk(struct gendisk *disk)
>>   	 * If the disk does not own the queue, allow using passthrough requests
>>   	 * again.  Else leave the queue frozen to fail all I/O.
>>   	 */
>> -	if (!test_bit(GD_OWNS_QUEUE, &disk->state)) {
>> -		blk_queue_flag_clear(QUEUE_FLAG_INIT_DONE, q);
>> +	if (!test_bit(GD_OWNS_QUEUE, &disk->state))
>>   		__blk_mq_unfreeze_queue(q, true);
>> -	} else {
>> -		if (queue_is_mq(q))
>> -			blk_mq_exit_queue(q);
>> -	}
>> +	else if (queue_is_mq(q))
>> +		blk_mq_exit_queue(q);
> 
> Clearing INIT_DONE only may regress sd rebind, and I'd suggest to move
> the clearing into blk_mq_destroy_queue() for fixing this issue.

No, the problem here is not related to sd rebind, if the elevator is
none, the uaf is very easy to trigger:

1) issue a flush and delete the disk, the flush_rq will be left in the
tags->rqs[];
2) trigger iterating tags by another disk in the same host;

Thanks,
Kuai

> 
> 
> Thanks,
> Ming
> 
> 
> .
>
Ming Lei Nov. 5, 2024, 7:58 a.m. UTC | #3
On Tue, Nov 05, 2024 at 03:36:36PM +0800, Yu Kuai wrote:
> Hi,
> 
> 在 2024/11/05 15:18, Ming Lei 写道:
> > On Mon, Nov 04, 2024 at 07:00:05PM +0800, Yu Kuai wrote:
> > > From: Yu Kuai <yukuai3@huawei.com>
> > > 
> > > blk_mq_clear_flush_rq_mapping() is not called during scsi probe, by
> > > checking blk_queue_init_done(). However, QUEUE_FLAG_INIT_DONE is cleared
> > > in del_gendisk by commit aec89dc5d421 ("block: keep q_usage_counter in
> > > atomic mode after del_gendisk"), hence for disk like scsi, following
> > > blk_mq_destroy_queue() will not clear flush rq from tags->rqs[] as well,
> > > cause following uaf that is found by our syzkaller for v6.6:
> > > 
> > > ==================================================================
> > > BUG: KASAN: slab-use-after-free in blk_mq_find_and_get_req+0x16e/0x1a0 block/blk-mq-tag.c:261
> > > Read of size 4 at addr ffff88811c969c20 by task kworker/1:2H/224909
> > > 
> > > CPU: 1 PID: 224909 Comm: kworker/1:2H Not tainted 6.6.0-ga836a5060850 #32
> > > Workqueue: kblockd blk_mq_timeout_work
> > > Call Trace:
> > > 
> > > __dump_stack lib/dump_stack.c:88 [inline]
> > > dump_stack_lvl+0x91/0xf0 lib/dump_stack.c:106
> > > print_address_description.constprop.0+0x66/0x300 mm/kasan/report.c:364
> > > print_report+0x3e/0x70 mm/kasan/report.c:475
> > > kasan_report+0xb8/0xf0 mm/kasan/report.c:588
> > > blk_mq_find_and_get_req+0x16e/0x1a0 block/blk-mq-tag.c:261
> > > bt_iter block/blk-mq-tag.c:288 [inline]
> > > __sbitmap_for_each_set include/linux/sbitmap.h:295 [inline]
> > > sbitmap_for_each_set include/linux/sbitmap.h:316 [inline]
> > > bt_for_each+0x455/0x790 block/blk-mq-tag.c:325
> > > blk_mq_queue_tag_busy_iter+0x320/0x740 block/blk-mq-tag.c:534
> > > blk_mq_timeout_work+0x1a3/0x7b0 block/blk-mq.c:1673
> > > process_one_work+0x7c4/0x1450 kernel/workqueue.c:2631
> > > process_scheduled_works kernel/workqueue.c:2704 [inline]
> > > worker_thread+0x804/0xe40 kernel/workqueue.c:2785
> > > kthread+0x346/0x450 kernel/kthread.c:388
> > > ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147
> > > ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:293
> > > 
> > > Allocated by task 942:
> > > kasan_save_stack+0x22/0x50 mm/kasan/common.c:45
> > > kasan_set_track+0x25/0x30 mm/kasan/common.c:52
> > > ____kasan_kmalloc mm/kasan/common.c:374 [inline]
> > > __kasan_kmalloc mm/kasan/common.c:383 [inline]
> > > __kasan_kmalloc+0xaa/0xb0 mm/kasan/common.c:380
> > > kasan_kmalloc include/linux/kasan.h:198 [inline]
> > > __do_kmalloc_node mm/slab_common.c:1007 [inline]
> > > __kmalloc_node+0x69/0x170 mm/slab_common.c:1014
> > > kmalloc_node include/linux/slab.h:620 [inline]
> > > kzalloc_node include/linux/slab.h:732 [inline]
> > > blk_alloc_flush_queue+0x144/0x2f0 block/blk-flush.c:499
> > > blk_mq_alloc_hctx+0x601/0x940 block/blk-mq.c:3788
> > > blk_mq_alloc_and_init_hctx+0x27f/0x330 block/blk-mq.c:4261
> > > blk_mq_realloc_hw_ctxs+0x488/0x5e0 block/blk-mq.c:4294
> > > blk_mq_init_allocated_queue+0x188/0x860 block/blk-mq.c:4350
> > > blk_mq_init_queue_data block/blk-mq.c:4166 [inline]
> > > blk_mq_init_queue+0x8d/0x100 block/blk-mq.c:4176
> > > scsi_alloc_sdev+0x843/0xd50 drivers/scsi/scsi_scan.c:335
> > > scsi_probe_and_add_lun+0x77c/0xde0 drivers/scsi/scsi_scan.c:1189
> > > __scsi_scan_target+0x1fc/0x5a0 drivers/scsi/scsi_scan.c:1727
> > > scsi_scan_channel drivers/scsi/scsi_scan.c:1815 [inline]
> > > scsi_scan_channel+0x14b/0x1e0 drivers/scsi/scsi_scan.c:1791
> > > scsi_scan_host_selected+0x2fe/0x400 drivers/scsi/scsi_scan.c:1844
> > > scsi_scan+0x3a0/0x3f0 drivers/scsi/scsi_sysfs.c:151
> > > store_scan+0x2a/0x60 drivers/scsi/scsi_sysfs.c:191
> > > dev_attr_store+0x5c/0x90 drivers/base/core.c:2388
> > > sysfs_kf_write+0x11c/0x170 fs/sysfs/file.c:136
> > > kernfs_fop_write_iter+0x3fc/0x610 fs/kernfs/file.c:338
> > > call_write_iter include/linux/fs.h:2083 [inline]
> > > new_sync_write+0x1b4/0x2d0 fs/read_write.c:493
> > > vfs_write+0x76c/0xb00 fs/read_write.c:586
> > > ksys_write+0x127/0x250 fs/read_write.c:639
> > > do_syscall_x64 arch/x86/entry/common.c:51 [inline]
> > > do_syscall_64+0x70/0x120 arch/x86/entry/common.c:81
> > > entry_SYSCALL_64_after_hwframe+0x78/0xe2
> > > 
> > > Freed by task 244687:
> > > kasan_save_stack+0x22/0x50 mm/kasan/common.c:45
> > > kasan_set_track+0x25/0x30 mm/kasan/common.c:52
> > > kasan_save_free_info+0x2b/0x50 mm/kasan/generic.c:522
> > > ____kasan_slab_free mm/kasan/common.c:236 [inline]
> > > __kasan_slab_free+0x12a/0x1b0 mm/kasan/common.c:244
> > > kasan_slab_free include/linux/kasan.h:164 [inline]
> > > slab_free_hook mm/slub.c:1815 [inline]
> > > slab_free_freelist_hook mm/slub.c:1841 [inline]
> > > slab_free mm/slub.c:3807 [inline]
> > > __kmem_cache_free+0xe4/0x520 mm/slub.c:3820
> > > blk_free_flush_queue+0x40/0x60 block/blk-flush.c:520
> > > blk_mq_hw_sysfs_release+0x4a/0x170 block/blk-mq-sysfs.c:37
> > > kobject_cleanup+0x136/0x410 lib/kobject.c:689
> > > kobject_release lib/kobject.c:720 [inline]
> > > kref_put include/linux/kref.h:65 [inline]
> > > kobject_put+0x119/0x140 lib/kobject.c:737
> > > blk_mq_release+0x24f/0x3f0 block/blk-mq.c:4144
> > > blk_free_queue block/blk-core.c:298 [inline]
> > > blk_put_queue+0xe2/0x180 block/blk-core.c:314
> > > blkg_free_workfn+0x376/0x6e0 block/blk-cgroup.c:144
> > > process_one_work+0x7c4/0x1450 kernel/workqueue.c:2631
> > > process_scheduled_works kernel/workqueue.c:2704 [inline]
> > > worker_thread+0x804/0xe40 kernel/workqueue.c:2785
> > > kthread+0x346/0x450 kernel/kthread.c:388
> > > ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147
> > > ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:293
> > > 
> > > Other than blk_mq_clear_flush_rq_mapping(), the flag is only used in
> > > blk_register_queue() from initialization path, hence it's safe not to
> > > clear the flag in del_gendisk. And since QUEUE_FLAG_REGISTERED already
> > > make sure that queue should only be registered once, there is no need
> > > to test the flag as well.
> > 
> > But disk can be added again in case of sd rebind, so the check should be
> > kept.
> 
> I don't understand yet, I mean the other flag QUEUE_FLAG_REGISTERED can
> cover the case to call percpu_ref_switch_to_percpu() for the rebind
> case, the flag QUEUE_FLAG_INIT_DONE now is useless other than
> blk_mq_clear_flush_rq_mapping().

OK, looks fine, since queue usage counter only works in percpu mode with
disk attached for !GD_OWNS_QUEUE, maybe QUEUE_FLAG_INIT_DONE can be renamed
too, but anyway:

Reviewed-by: Ming Lei <ming.lei@redhat.com>


Thanks,
Ming
Yu Kuai Nov. 5, 2024, 8:09 a.m. UTC | #4
Hi,

在 2024/11/05 15:58, Ming Lei 写道:
> On Tue, Nov 05, 2024 at 03:36:36PM +0800, Yu Kuai wrote:
>> Hi,
>>
>> 在 2024/11/05 15:18, Ming Lei 写道:
>>> On Mon, Nov 04, 2024 at 07:00:05PM +0800, Yu Kuai wrote:
>>>> From: Yu Kuai <yukuai3@huawei.com>
>>>>
>>>> blk_mq_clear_flush_rq_mapping() is not called during scsi probe, by
>>>> checking blk_queue_init_done(). However, QUEUE_FLAG_INIT_DONE is cleared
>>>> in del_gendisk by commit aec89dc5d421 ("block: keep q_usage_counter in
>>>> atomic mode after del_gendisk"), hence for disk like scsi, following
>>>> blk_mq_destroy_queue() will not clear flush rq from tags->rqs[] as well,
>>>> cause following uaf that is found by our syzkaller for v6.6:
>>>>
>>>> ==================================================================
>>>> BUG: KASAN: slab-use-after-free in blk_mq_find_and_get_req+0x16e/0x1a0 block/blk-mq-tag.c:261
>>>> Read of size 4 at addr ffff88811c969c20 by task kworker/1:2H/224909
>>>>
>>>> CPU: 1 PID: 224909 Comm: kworker/1:2H Not tainted 6.6.0-ga836a5060850 #32
>>>> Workqueue: kblockd blk_mq_timeout_work
>>>> Call Trace:
>>>>
>>>> __dump_stack lib/dump_stack.c:88 [inline]
>>>> dump_stack_lvl+0x91/0xf0 lib/dump_stack.c:106
>>>> print_address_description.constprop.0+0x66/0x300 mm/kasan/report.c:364
>>>> print_report+0x3e/0x70 mm/kasan/report.c:475
>>>> kasan_report+0xb8/0xf0 mm/kasan/report.c:588
>>>> blk_mq_find_and_get_req+0x16e/0x1a0 block/blk-mq-tag.c:261
>>>> bt_iter block/blk-mq-tag.c:288 [inline]
>>>> __sbitmap_for_each_set include/linux/sbitmap.h:295 [inline]
>>>> sbitmap_for_each_set include/linux/sbitmap.h:316 [inline]
>>>> bt_for_each+0x455/0x790 block/blk-mq-tag.c:325
>>>> blk_mq_queue_tag_busy_iter+0x320/0x740 block/blk-mq-tag.c:534
>>>> blk_mq_timeout_work+0x1a3/0x7b0 block/blk-mq.c:1673
>>>> process_one_work+0x7c4/0x1450 kernel/workqueue.c:2631
>>>> process_scheduled_works kernel/workqueue.c:2704 [inline]
>>>> worker_thread+0x804/0xe40 kernel/workqueue.c:2785
>>>> kthread+0x346/0x450 kernel/kthread.c:388
>>>> ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147
>>>> ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:293
>>>>
>>>> Allocated by task 942:
>>>> kasan_save_stack+0x22/0x50 mm/kasan/common.c:45
>>>> kasan_set_track+0x25/0x30 mm/kasan/common.c:52
>>>> ____kasan_kmalloc mm/kasan/common.c:374 [inline]
>>>> __kasan_kmalloc mm/kasan/common.c:383 [inline]
>>>> __kasan_kmalloc+0xaa/0xb0 mm/kasan/common.c:380
>>>> kasan_kmalloc include/linux/kasan.h:198 [inline]
>>>> __do_kmalloc_node mm/slab_common.c:1007 [inline]
>>>> __kmalloc_node+0x69/0x170 mm/slab_common.c:1014
>>>> kmalloc_node include/linux/slab.h:620 [inline]
>>>> kzalloc_node include/linux/slab.h:732 [inline]
>>>> blk_alloc_flush_queue+0x144/0x2f0 block/blk-flush.c:499
>>>> blk_mq_alloc_hctx+0x601/0x940 block/blk-mq.c:3788
>>>> blk_mq_alloc_and_init_hctx+0x27f/0x330 block/blk-mq.c:4261
>>>> blk_mq_realloc_hw_ctxs+0x488/0x5e0 block/blk-mq.c:4294
>>>> blk_mq_init_allocated_queue+0x188/0x860 block/blk-mq.c:4350
>>>> blk_mq_init_queue_data block/blk-mq.c:4166 [inline]
>>>> blk_mq_init_queue+0x8d/0x100 block/blk-mq.c:4176
>>>> scsi_alloc_sdev+0x843/0xd50 drivers/scsi/scsi_scan.c:335
>>>> scsi_probe_and_add_lun+0x77c/0xde0 drivers/scsi/scsi_scan.c:1189
>>>> __scsi_scan_target+0x1fc/0x5a0 drivers/scsi/scsi_scan.c:1727
>>>> scsi_scan_channel drivers/scsi/scsi_scan.c:1815 [inline]
>>>> scsi_scan_channel+0x14b/0x1e0 drivers/scsi/scsi_scan.c:1791
>>>> scsi_scan_host_selected+0x2fe/0x400 drivers/scsi/scsi_scan.c:1844
>>>> scsi_scan+0x3a0/0x3f0 drivers/scsi/scsi_sysfs.c:151
>>>> store_scan+0x2a/0x60 drivers/scsi/scsi_sysfs.c:191
>>>> dev_attr_store+0x5c/0x90 drivers/base/core.c:2388
>>>> sysfs_kf_write+0x11c/0x170 fs/sysfs/file.c:136
>>>> kernfs_fop_write_iter+0x3fc/0x610 fs/kernfs/file.c:338
>>>> call_write_iter include/linux/fs.h:2083 [inline]
>>>> new_sync_write+0x1b4/0x2d0 fs/read_write.c:493
>>>> vfs_write+0x76c/0xb00 fs/read_write.c:586
>>>> ksys_write+0x127/0x250 fs/read_write.c:639
>>>> do_syscall_x64 arch/x86/entry/common.c:51 [inline]
>>>> do_syscall_64+0x70/0x120 arch/x86/entry/common.c:81
>>>> entry_SYSCALL_64_after_hwframe+0x78/0xe2
>>>>
>>>> Freed by task 244687:
>>>> kasan_save_stack+0x22/0x50 mm/kasan/common.c:45
>>>> kasan_set_track+0x25/0x30 mm/kasan/common.c:52
>>>> kasan_save_free_info+0x2b/0x50 mm/kasan/generic.c:522
>>>> ____kasan_slab_free mm/kasan/common.c:236 [inline]
>>>> __kasan_slab_free+0x12a/0x1b0 mm/kasan/common.c:244
>>>> kasan_slab_free include/linux/kasan.h:164 [inline]
>>>> slab_free_hook mm/slub.c:1815 [inline]
>>>> slab_free_freelist_hook mm/slub.c:1841 [inline]
>>>> slab_free mm/slub.c:3807 [inline]
>>>> __kmem_cache_free+0xe4/0x520 mm/slub.c:3820
>>>> blk_free_flush_queue+0x40/0x60 block/blk-flush.c:520
>>>> blk_mq_hw_sysfs_release+0x4a/0x170 block/blk-mq-sysfs.c:37
>>>> kobject_cleanup+0x136/0x410 lib/kobject.c:689
>>>> kobject_release lib/kobject.c:720 [inline]
>>>> kref_put include/linux/kref.h:65 [inline]
>>>> kobject_put+0x119/0x140 lib/kobject.c:737
>>>> blk_mq_release+0x24f/0x3f0 block/blk-mq.c:4144
>>>> blk_free_queue block/blk-core.c:298 [inline]
>>>> blk_put_queue+0xe2/0x180 block/blk-core.c:314
>>>> blkg_free_workfn+0x376/0x6e0 block/blk-cgroup.c:144
>>>> process_one_work+0x7c4/0x1450 kernel/workqueue.c:2631
>>>> process_scheduled_works kernel/workqueue.c:2704 [inline]
>>>> worker_thread+0x804/0xe40 kernel/workqueue.c:2785
>>>> kthread+0x346/0x450 kernel/kthread.c:388
>>>> ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147
>>>> ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:293
>>>>
>>>> Other than blk_mq_clear_flush_rq_mapping(), the flag is only used in
>>>> blk_register_queue() from initialization path, hence it's safe not to
>>>> clear the flag in del_gendisk. And since QUEUE_FLAG_REGISTERED already
>>>> make sure that queue should only be registered once, there is no need
>>>> to test the flag as well.
>>>
>>> But disk can be added again in case of sd rebind, so the check should be
>>> kept.
>>
>> I don't understand yet, I mean the other flag QUEUE_FLAG_REGISTERED can
>> cover the case to call percpu_ref_switch_to_percpu() for the rebind
>> case, the flag QUEUE_FLAG_INIT_DONE now is useless other than
>> blk_mq_clear_flush_rq_mapping().
> 
> OK, looks fine, since queue usage counter only works in percpu mode with
> disk attached for !GD_OWNS_QUEUE, maybe QUEUE_FLAG_INIT_DONE can be renamed
> too, but anyway:

Yeah, I do think about rename this flag, but I don't come up with a
better name myself. :)
> 
> Reviewed-by: Ming Lei <ming.lei@redhat.com>

Thanks for the review!
Kuai

> 
> 
> Thanks,
> Ming
> 
> 
> .
>
Christoph Hellwig Nov. 5, 2024, 11:19 a.m. UTC | #5
On Mon, Nov 04, 2024 at 07:00:05PM +0800, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> blk_mq_clear_flush_rq_mapping() is not called during scsi probe, by
> checking blk_queue_init_done(). However, QUEUE_FLAG_INIT_DONE is cleared
> in del_gendisk by commit aec89dc5d421 ("block: keep q_usage_counter in
> atomic mode after del_gendisk"), hence for disk like scsi, following
> blk_mq_destroy_queue() will not clear flush rq from tags->rqs[] as well,
> cause following uaf that is found by our syzkaller for v6.6:

Which means we leave the flush request lingering after del_gendisk,
which sounds like the real bug.  I suspect we just need to move the
call to blk_mq_clear_flush_rq_mapping so that it is called from
del_gendisk and doesn't leave the flush tag lingering around.
Yu Kuai Nov. 6, 2024, 2:58 a.m. UTC | #6
Hi,Ming and Christoph

在 2024/11/05 19:19, Christoph Hellwig 写道:
> On Mon, Nov 04, 2024 at 07:00:05PM +0800, Yu Kuai wrote:
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> blk_mq_clear_flush_rq_mapping() is not called during scsi probe, by
>> checking blk_queue_init_done(). However, QUEUE_FLAG_INIT_DONE is cleared
>> in del_gendisk by commit aec89dc5d421 ("block: keep q_usage_counter in
>> atomic mode after del_gendisk"), hence for disk like scsi, following
>> blk_mq_destroy_queue() will not clear flush rq from tags->rqs[] as well,
>> cause following uaf that is found by our syzkaller for v6.6:
> 
> Which means we leave the flush request lingering after del_gendisk,
> which sounds like the real bug.  I suspect we just need to move the
> call to blk_mq_clear_flush_rq_mapping so that it is called from
> del_gendisk and doesn't leave the flush tag lingering around.
> 

This remind me that del_gendisk is still too late to do that. Noted that
flush_rq can acquire different tags, so if the multiple flush_rq is done
and those tags are not reused, the flush_rq can exist in multiple
entries in tags->rqs[]. The consequence I can think of is that iterating
tags can found the same flush_rq multiple times, and the flush_rq can be
inflight.

Therefor, I think the best solution is to clear the flush_rq mapping
when flush request is done.

Thanks,
Kuai
> .
>
Ming Lei Nov. 6, 2024, 3:11 a.m. UTC | #7
On Wed, Nov 06, 2024 at 10:58:40AM +0800, Yu Kuai wrote:
> Hi,Ming and Christoph
> 
> 在 2024/11/05 19:19, Christoph Hellwig 写道:
> > On Mon, Nov 04, 2024 at 07:00:05PM +0800, Yu Kuai wrote:
> > > From: Yu Kuai <yukuai3@huawei.com>
> > > 
> > > blk_mq_clear_flush_rq_mapping() is not called during scsi probe, by
> > > checking blk_queue_init_done(). However, QUEUE_FLAG_INIT_DONE is cleared
> > > in del_gendisk by commit aec89dc5d421 ("block: keep q_usage_counter in
> > > atomic mode after del_gendisk"), hence for disk like scsi, following
> > > blk_mq_destroy_queue() will not clear flush rq from tags->rqs[] as well,
> > > cause following uaf that is found by our syzkaller for v6.6:
> > 
> > Which means we leave the flush request lingering after del_gendisk,
> > which sounds like the real bug.  I suspect we just need to move the
> > call to blk_mq_clear_flush_rq_mapping so that it is called from
> > del_gendisk and doesn't leave the flush tag lingering around.
> > 
> 
> This remind me that del_gendisk is still too late to do that. Noted that
> flush_rq can acquire different tags, so if the multiple flush_rq is done
> and those tags are not reused, the flush_rq can exist in multiple
> entries in tags->rqs[]. The consequence I can think of is that iterating
> tags can found the same flush_rq multiple times, and the flush_rq can be
> inflight.

How can that be one problem?

Please look at

commit 364b61818f65 ("blk-mq: clearing flush request reference in tags->rqs[]")
commit bd63141d585b ("blk-mq: clear stale request in tags->rq[] before freeing one request pool")

and understand the motivation.

That also means it is just fine to delay blk_mq_clear_flush_rq_mapping()
after disk is deleted.

Thanks,
Ming
Yu Kuai Nov. 6, 2024, 3:25 a.m. UTC | #8
Hi,

在 2024/11/06 11:11, Ming Lei 写道:
> On Wed, Nov 06, 2024 at 10:58:40AM +0800, Yu Kuai wrote:
>> Hi,Ming and Christoph
>>
>> 在 2024/11/05 19:19, Christoph Hellwig 写道:
>>> On Mon, Nov 04, 2024 at 07:00:05PM +0800, Yu Kuai wrote:
>>>> From: Yu Kuai <yukuai3@huawei.com>
>>>>
>>>> blk_mq_clear_flush_rq_mapping() is not called during scsi probe, by
>>>> checking blk_queue_init_done(). However, QUEUE_FLAG_INIT_DONE is cleared
>>>> in del_gendisk by commit aec89dc5d421 ("block: keep q_usage_counter in
>>>> atomic mode after del_gendisk"), hence for disk like scsi, following
>>>> blk_mq_destroy_queue() will not clear flush rq from tags->rqs[] as well,
>>>> cause following uaf that is found by our syzkaller for v6.6:
>>>
>>> Which means we leave the flush request lingering after del_gendisk,
>>> which sounds like the real bug.  I suspect we just need to move the
>>> call to blk_mq_clear_flush_rq_mapping so that it is called from
>>> del_gendisk and doesn't leave the flush tag lingering around.
>>>
>>
>> This remind me that del_gendisk is still too late to do that. Noted that
>> flush_rq can acquire different tags, so if the multiple flush_rq is done
>> and those tags are not reused, the flush_rq can exist in multiple
>> entries in tags->rqs[]. The consequence I can think of is that iterating
>> tags can found the same flush_rq multiple times, and the flush_rq can be
>> inflight.
> 
> How can that be one problem?
> 
> Please look at
> 
> commit 364b61818f65 ("blk-mq: clearing flush request reference in tags->rqs[]")
> commit bd63141d585b ("blk-mq: clear stale request in tags->rq[] before freeing one request pool")
> 
> and understand the motivation.
> 
> That also means it is just fine to delay blk_mq_clear_flush_rq_mapping()
> after disk is deleted.

I do understand what you mean. Let's see if you want this to be avoided,
for example(no disk is deleted):

1) issue a flush, and tag 0 is used, after the flush is done,
tags->rqs[0] = flush_rq
2) issue another flush, and tag 1 is used, after the flush is done,
tags->rqs[1] = flush_rq
3) issue a flush again, and tag 2 is used, and the flush_rq is
dispatched to disk;
4) Then in this case, blk_mq_in_flight_rw() will found the same flush_rq
3 times and think there are 3 inflight request, same for
hctx_busy_show()...

Thanks,
Kuai

> 
> Thanks,
> Ming
> 
> 
> .
>
Ming Lei Nov. 6, 2024, 3:36 a.m. UTC | #9
On Wed, Nov 06, 2024 at 11:25:07AM +0800, Yu Kuai wrote:
> Hi,
> 
> 在 2024/11/06 11:11, Ming Lei 写道:
> > On Wed, Nov 06, 2024 at 10:58:40AM +0800, Yu Kuai wrote:
> > > Hi,Ming and Christoph
> > > 
> > > 在 2024/11/05 19:19, Christoph Hellwig 写道:
> > > > On Mon, Nov 04, 2024 at 07:00:05PM +0800, Yu Kuai wrote:
> > > > > From: Yu Kuai <yukuai3@huawei.com>
> > > > > 
> > > > > blk_mq_clear_flush_rq_mapping() is not called during scsi probe, by
> > > > > checking blk_queue_init_done(). However, QUEUE_FLAG_INIT_DONE is cleared
> > > > > in del_gendisk by commit aec89dc5d421 ("block: keep q_usage_counter in
> > > > > atomic mode after del_gendisk"), hence for disk like scsi, following
> > > > > blk_mq_destroy_queue() will not clear flush rq from tags->rqs[] as well,
> > > > > cause following uaf that is found by our syzkaller for v6.6:
> > > > 
> > > > Which means we leave the flush request lingering after del_gendisk,
> > > > which sounds like the real bug.  I suspect we just need to move the
> > > > call to blk_mq_clear_flush_rq_mapping so that it is called from
> > > > del_gendisk and doesn't leave the flush tag lingering around.
> > > > 
> > > 
> > > This remind me that del_gendisk is still too late to do that. Noted that
> > > flush_rq can acquire different tags, so if the multiple flush_rq is done
> > > and those tags are not reused, the flush_rq can exist in multiple
> > > entries in tags->rqs[]. The consequence I can think of is that iterating
> > > tags can found the same flush_rq multiple times, and the flush_rq can be
> > > inflight.
> > 
> > How can that be one problem?
> > 
> > Please look at
> > 
> > commit 364b61818f65 ("blk-mq: clearing flush request reference in tags->rqs[]")
> > commit bd63141d585b ("blk-mq: clear stale request in tags->rq[] before freeing one request pool")
> > 
> > and understand the motivation.
> > 
> > That also means it is just fine to delay blk_mq_clear_flush_rq_mapping()
> > after disk is deleted.
> 
> I do understand what you mean. Let's see if you want this to be avoided,
> for example(no disk is deleted):

It is definitely another issue, and not supposed to be covered by
blk_mq_clear_flush_rq_mapping().

> 
> 1) issue a flush, and tag 0 is used, after the flush is done,
> tags->rqs[0] = flush_rq
> 2) issue another flush, and tag 1 is used, after the flush is done,
> tags->rqs[1] = flush_rq
> 3) issue a flush again, and tag 2 is used, and the flush_rq is
> dispatched to disk;

Yes, this kind of thing exists since blk-mq begins, and you can't expect
blk_mq_in_flight_rw() to get accurate inflight requests.

> 4) Then in this case, blk_mq_in_flight_rw() will found the same flush_rq
> 3 times and think there are 3 inflight request, same for
> hctx_busy_show()...

But we have tried to avoid it, such as in blk_mq_find_and_get_req()
req->tag is checked against the current 'bitnr' of sbitmap when walking
over tags. Then the same flush_rq won't be counted 3 times.



Thanks,
Ming
Yu Kuai Nov. 6, 2024, 3:51 a.m. UTC | #10
Hi,

在 2024/11/06 11:36, Ming Lei 写道:
> On Wed, Nov 06, 2024 at 11:25:07AM +0800, Yu Kuai wrote:
>> Hi,
>>
>> 在 2024/11/06 11:11, Ming Lei 写道:
>>> On Wed, Nov 06, 2024 at 10:58:40AM +0800, Yu Kuai wrote:
>>>> Hi,Ming and Christoph
>>>>
>>>> 在 2024/11/05 19:19, Christoph Hellwig 写道:
>>>>> On Mon, Nov 04, 2024 at 07:00:05PM +0800, Yu Kuai wrote:
>>>>>> From: Yu Kuai <yukuai3@huawei.com>
>>>>>>
>>>>>> blk_mq_clear_flush_rq_mapping() is not called during scsi probe, by
>>>>>> checking blk_queue_init_done(). However, QUEUE_FLAG_INIT_DONE is cleared
>>>>>> in del_gendisk by commit aec89dc5d421 ("block: keep q_usage_counter in
>>>>>> atomic mode after del_gendisk"), hence for disk like scsi, following
>>>>>> blk_mq_destroy_queue() will not clear flush rq from tags->rqs[] as well,
>>>>>> cause following uaf that is found by our syzkaller for v6.6:
>>>>>
>>>>> Which means we leave the flush request lingering after del_gendisk,
>>>>> which sounds like the real bug.  I suspect we just need to move the
>>>>> call to blk_mq_clear_flush_rq_mapping so that it is called from
>>>>> del_gendisk and doesn't leave the flush tag lingering around.
>>>>>
>>>>
>>>> This remind me that del_gendisk is still too late to do that. Noted that
>>>> flush_rq can acquire different tags, so if the multiple flush_rq is done
>>>> and those tags are not reused, the flush_rq can exist in multiple
>>>> entries in tags->rqs[]. The consequence I can think of is that iterating
>>>> tags can found the same flush_rq multiple times, and the flush_rq can be
>>>> inflight.
>>>
>>> How can that be one problem?
>>>
>>> Please look at
>>>
>>> commit 364b61818f65 ("blk-mq: clearing flush request reference in tags->rqs[]")
>>> commit bd63141d585b ("blk-mq: clear stale request in tags->rq[] before freeing one request pool")
>>>
>>> and understand the motivation.
>>>
>>> That also means it is just fine to delay blk_mq_clear_flush_rq_mapping()
>>> after disk is deleted.
>>
>> I do understand what you mean. Let's see if you want this to be avoided,
>> for example(no disk is deleted):
> 
> It is definitely another issue, and not supposed to be covered by
> blk_mq_clear_flush_rq_mapping().
> 
>>
>> 1) issue a flush, and tag 0 is used, after the flush is done,
>> tags->rqs[0] = flush_rq
>> 2) issue another flush, and tag 1 is used, after the flush is done,
>> tags->rqs[1] = flush_rq
>> 3) issue a flush again, and tag 2 is used, and the flush_rq is
>> dispatched to disk;
> 
> Yes, this kind of thing exists since blk-mq begins, and you can't expect
> blk_mq_in_flight_rw() to get accurate inflight requests.
> 
>> 4) Then in this case, blk_mq_in_flight_rw() will found the same flush_rq
>> 3 times and think there are 3 inflight request, same for
>> hctx_busy_show()...
> 
> But we have tried to avoid it, such as in blk_mq_find_and_get_req()
> req->tag is checked against the current 'bitnr' of sbitmap when walking
> over tags. Then the same flush_rq won't be counted 3 times.

Yes, I missed that req->tag is checked. Looks like there is no porblem
now. :)

Just to make sure, we're still on the same page for this patch, right?

Thanks for the explanation.
Kuai

> 
> 
> 
> Thanks,
> Ming
> 
> 
> .
>
Ming Lei Nov. 6, 2024, 3:58 a.m. UTC | #11
On Wed, Nov 6, 2024 at 11:52 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> Hi,
>
> 在 2024/11/06 11:36, Ming Lei 写道:
> > On Wed, Nov 06, 2024 at 11:25:07AM +0800, Yu Kuai wrote:
> >> Hi,
> >>
> >> 在 2024/11/06 11:11, Ming Lei 写道:
> >>> On Wed, Nov 06, 2024 at 10:58:40AM +0800, Yu Kuai wrote:
> >>>> Hi,Ming and Christoph
> >>>>
> >>>> 在 2024/11/05 19:19, Christoph Hellwig 写道:
> >>>>> On Mon, Nov 04, 2024 at 07:00:05PM +0800, Yu Kuai wrote:
> >>>>>> From: Yu Kuai <yukuai3@huawei.com>
> >>>>>>
> >>>>>> blk_mq_clear_flush_rq_mapping() is not called during scsi probe, by
> >>>>>> checking blk_queue_init_done(). However, QUEUE_FLAG_INIT_DONE is cleared
> >>>>>> in del_gendisk by commit aec89dc5d421 ("block: keep q_usage_counter in
> >>>>>> atomic mode after del_gendisk"), hence for disk like scsi, following
> >>>>>> blk_mq_destroy_queue() will not clear flush rq from tags->rqs[] as well,
> >>>>>> cause following uaf that is found by our syzkaller for v6.6:
> >>>>>
> >>>>> Which means we leave the flush request lingering after del_gendisk,
> >>>>> which sounds like the real bug.  I suspect we just need to move the
> >>>>> call to blk_mq_clear_flush_rq_mapping so that it is called from
> >>>>> del_gendisk and doesn't leave the flush tag lingering around.
> >>>>>
> >>>>
> >>>> This remind me that del_gendisk is still too late to do that. Noted that
> >>>> flush_rq can acquire different tags, so if the multiple flush_rq is done
> >>>> and those tags are not reused, the flush_rq can exist in multiple
> >>>> entries in tags->rqs[]. The consequence I can think of is that iterating
> >>>> tags can found the same flush_rq multiple times, and the flush_rq can be
> >>>> inflight.
> >>>
> >>> How can that be one problem?
> >>>
> >>> Please look at
> >>>
> >>> commit 364b61818f65 ("blk-mq: clearing flush request reference in tags->rqs[]")
> >>> commit bd63141d585b ("blk-mq: clear stale request in tags->rq[] before freeing one request pool")
> >>>
> >>> and understand the motivation.
> >>>
> >>> That also means it is just fine to delay blk_mq_clear_flush_rq_mapping()
> >>> after disk is deleted.
> >>
> >> I do understand what you mean. Let's see if you want this to be avoided,
> >> for example(no disk is deleted):
> >
> > It is definitely another issue, and not supposed to be covered by
> > blk_mq_clear_flush_rq_mapping().
> >
> >>
> >> 1) issue a flush, and tag 0 is used, after the flush is done,
> >> tags->rqs[0] = flush_rq
> >> 2) issue another flush, and tag 1 is used, after the flush is done,
> >> tags->rqs[1] = flush_rq
> >> 3) issue a flush again, and tag 2 is used, and the flush_rq is
> >> dispatched to disk;
> >
> > Yes, this kind of thing exists since blk-mq begins, and you can't expect
> > blk_mq_in_flight_rw() to get accurate inflight requests.
> >
> >> 4) Then in this case, blk_mq_in_flight_rw() will found the same flush_rq
> >> 3 times and think there are 3 inflight request, same for
> >> hctx_busy_show()...
> >
> > But we have tried to avoid it, such as in blk_mq_find_and_get_req()
> > req->tag is checked against the current 'bitnr' of sbitmap when walking
> > over tags. Then the same flush_rq won't be counted 3 times.
>
> Yes, I missed that req->tag is checked. Looks like there is no porblem
> now. :)
>
> Just to make sure, we're still on the same page for this patch, right?

Yeah, my reviewed-by still stands, and it is fine to delay to clear flush req
wrt. the original motivation, what matters is that it is cleared before the
flush req is freed.

Thanks,
diff mbox series

Patch

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 741b95dfdbf6..a7c540728f3f 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -821,10 +821,8 @@  int blk_register_queue(struct gendisk *disk)
 	 * faster to shut down and is made fully functional here as
 	 * request_queues for non-existent devices never get registered.
 	 */
-	if (!blk_queue_init_done(q)) {
-		blk_queue_flag_set(QUEUE_FLAG_INIT_DONE, q);
-		percpu_ref_switch_to_percpu(&q->q_usage_counter);
-	}
+	blk_queue_flag_set(QUEUE_FLAG_INIT_DONE, q);
+	percpu_ref_switch_to_percpu(&q->q_usage_counter);
 
 	return ret;
 
diff --git a/block/genhd.c b/block/genhd.c
index dfee66146bd1..87f9c2457ca6 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -742,13 +742,10 @@  void del_gendisk(struct gendisk *disk)
 	 * If the disk does not own the queue, allow using passthrough requests
 	 * again.  Else leave the queue frozen to fail all I/O.
 	 */
-	if (!test_bit(GD_OWNS_QUEUE, &disk->state)) {
-		blk_queue_flag_clear(QUEUE_FLAG_INIT_DONE, q);
+	if (!test_bit(GD_OWNS_QUEUE, &disk->state))
 		__blk_mq_unfreeze_queue(q, true);
-	} else {
-		if (queue_is_mq(q))
-			blk_mq_exit_queue(q);
-	}
+	else if (queue_is_mq(q))
+		blk_mq_exit_queue(q);
 
 	if (start_drain)
 		blk_unfreeze_release_lock(q, true, queue_dying);