Message ID | 20241104110005.1412161-1-yukuai1@huaweicloud.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [-next] block: fix uaf for flush rq while iterating tags | expand |
On Mon, Nov 04, 2024 at 07:00:05PM +0800, Yu Kuai wrote: > From: Yu Kuai <yukuai3@huawei.com> > > blk_mq_clear_flush_rq_mapping() is not called during scsi probe, by > checking blk_queue_init_done(). However, QUEUE_FLAG_INIT_DONE is cleared > in del_gendisk by commit aec89dc5d421 ("block: keep q_usage_counter in > atomic mode after del_gendisk"), hence for disk like scsi, following > blk_mq_destroy_queue() will not clear flush rq from tags->rqs[] as well, > cause following uaf that is found by our syzkaller for v6.6: > > ================================================================== > BUG: KASAN: slab-use-after-free in blk_mq_find_and_get_req+0x16e/0x1a0 block/blk-mq-tag.c:261 > Read of size 4 at addr ffff88811c969c20 by task kworker/1:2H/224909 > > CPU: 1 PID: 224909 Comm: kworker/1:2H Not tainted 6.6.0-ga836a5060850 #32 > Workqueue: kblockd blk_mq_timeout_work > Call Trace: > > __dump_stack lib/dump_stack.c:88 [inline] > dump_stack_lvl+0x91/0xf0 lib/dump_stack.c:106 > print_address_description.constprop.0+0x66/0x300 mm/kasan/report.c:364 > print_report+0x3e/0x70 mm/kasan/report.c:475 > kasan_report+0xb8/0xf0 mm/kasan/report.c:588 > blk_mq_find_and_get_req+0x16e/0x1a0 block/blk-mq-tag.c:261 > bt_iter block/blk-mq-tag.c:288 [inline] > __sbitmap_for_each_set include/linux/sbitmap.h:295 [inline] > sbitmap_for_each_set include/linux/sbitmap.h:316 [inline] > bt_for_each+0x455/0x790 block/blk-mq-tag.c:325 > blk_mq_queue_tag_busy_iter+0x320/0x740 block/blk-mq-tag.c:534 > blk_mq_timeout_work+0x1a3/0x7b0 block/blk-mq.c:1673 > process_one_work+0x7c4/0x1450 kernel/workqueue.c:2631 > process_scheduled_works kernel/workqueue.c:2704 [inline] > worker_thread+0x804/0xe40 kernel/workqueue.c:2785 > kthread+0x346/0x450 kernel/kthread.c:388 > ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147 > ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:293 > > Allocated by task 942: > kasan_save_stack+0x22/0x50 mm/kasan/common.c:45 > kasan_set_track+0x25/0x30 mm/kasan/common.c:52 > ____kasan_kmalloc mm/kasan/common.c:374 [inline] > __kasan_kmalloc mm/kasan/common.c:383 [inline] > __kasan_kmalloc+0xaa/0xb0 mm/kasan/common.c:380 > kasan_kmalloc include/linux/kasan.h:198 [inline] > __do_kmalloc_node mm/slab_common.c:1007 [inline] > __kmalloc_node+0x69/0x170 mm/slab_common.c:1014 > kmalloc_node include/linux/slab.h:620 [inline] > kzalloc_node include/linux/slab.h:732 [inline] > blk_alloc_flush_queue+0x144/0x2f0 block/blk-flush.c:499 > blk_mq_alloc_hctx+0x601/0x940 block/blk-mq.c:3788 > blk_mq_alloc_and_init_hctx+0x27f/0x330 block/blk-mq.c:4261 > blk_mq_realloc_hw_ctxs+0x488/0x5e0 block/blk-mq.c:4294 > blk_mq_init_allocated_queue+0x188/0x860 block/blk-mq.c:4350 > blk_mq_init_queue_data block/blk-mq.c:4166 [inline] > blk_mq_init_queue+0x8d/0x100 block/blk-mq.c:4176 > scsi_alloc_sdev+0x843/0xd50 drivers/scsi/scsi_scan.c:335 > scsi_probe_and_add_lun+0x77c/0xde0 drivers/scsi/scsi_scan.c:1189 > __scsi_scan_target+0x1fc/0x5a0 drivers/scsi/scsi_scan.c:1727 > scsi_scan_channel drivers/scsi/scsi_scan.c:1815 [inline] > scsi_scan_channel+0x14b/0x1e0 drivers/scsi/scsi_scan.c:1791 > scsi_scan_host_selected+0x2fe/0x400 drivers/scsi/scsi_scan.c:1844 > scsi_scan+0x3a0/0x3f0 drivers/scsi/scsi_sysfs.c:151 > store_scan+0x2a/0x60 drivers/scsi/scsi_sysfs.c:191 > dev_attr_store+0x5c/0x90 drivers/base/core.c:2388 > sysfs_kf_write+0x11c/0x170 fs/sysfs/file.c:136 > kernfs_fop_write_iter+0x3fc/0x610 fs/kernfs/file.c:338 > call_write_iter include/linux/fs.h:2083 [inline] > new_sync_write+0x1b4/0x2d0 fs/read_write.c:493 > vfs_write+0x76c/0xb00 fs/read_write.c:586 > ksys_write+0x127/0x250 fs/read_write.c:639 > do_syscall_x64 arch/x86/entry/common.c:51 [inline] > do_syscall_64+0x70/0x120 arch/x86/entry/common.c:81 > entry_SYSCALL_64_after_hwframe+0x78/0xe2 > > Freed by task 244687: > kasan_save_stack+0x22/0x50 mm/kasan/common.c:45 > kasan_set_track+0x25/0x30 mm/kasan/common.c:52 > kasan_save_free_info+0x2b/0x50 mm/kasan/generic.c:522 > ____kasan_slab_free mm/kasan/common.c:236 [inline] > __kasan_slab_free+0x12a/0x1b0 mm/kasan/common.c:244 > kasan_slab_free include/linux/kasan.h:164 [inline] > slab_free_hook mm/slub.c:1815 [inline] > slab_free_freelist_hook mm/slub.c:1841 [inline] > slab_free mm/slub.c:3807 [inline] > __kmem_cache_free+0xe4/0x520 mm/slub.c:3820 > blk_free_flush_queue+0x40/0x60 block/blk-flush.c:520 > blk_mq_hw_sysfs_release+0x4a/0x170 block/blk-mq-sysfs.c:37 > kobject_cleanup+0x136/0x410 lib/kobject.c:689 > kobject_release lib/kobject.c:720 [inline] > kref_put include/linux/kref.h:65 [inline] > kobject_put+0x119/0x140 lib/kobject.c:737 > blk_mq_release+0x24f/0x3f0 block/blk-mq.c:4144 > blk_free_queue block/blk-core.c:298 [inline] > blk_put_queue+0xe2/0x180 block/blk-core.c:314 > blkg_free_workfn+0x376/0x6e0 block/blk-cgroup.c:144 > process_one_work+0x7c4/0x1450 kernel/workqueue.c:2631 > process_scheduled_works kernel/workqueue.c:2704 [inline] > worker_thread+0x804/0xe40 kernel/workqueue.c:2785 > kthread+0x346/0x450 kernel/kthread.c:388 > ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147 > ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:293 > > Other than blk_mq_clear_flush_rq_mapping(), the flag is only used in > blk_register_queue() from initialization path, hence it's safe not to > clear the flag in del_gendisk. And since QUEUE_FLAG_REGISTERED already > make sure that queue should only be registered once, there is no need > to test the flag as well. But disk can be added again in case of sd rebind, so the check should be kept. > > Fixes: 6cfeadbff3f8 ("blk-mq: don't clear flush_rq from tags->rqs[]") > Depends-on: commit aec89dc5d421 ("block: keep q_usage_counter in atomic mode after del_gendisk") > Signed-off-by: Yu Kuai <yukuai3@huawei.com> > --- > block/blk-sysfs.c | 6 ++---- > block/genhd.c | 9 +++------ > 2 files changed, 5 insertions(+), 10 deletions(-) > > diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c > index 741b95dfdbf6..a7c540728f3f 100644 > --- a/block/blk-sysfs.c > +++ b/block/blk-sysfs.c > @@ -821,10 +821,8 @@ int blk_register_queue(struct gendisk *disk) > * faster to shut down and is made fully functional here as > * request_queues for non-existent devices never get registered. > */ > - if (!blk_queue_init_done(q)) { > - blk_queue_flag_set(QUEUE_FLAG_INIT_DONE, q); > - percpu_ref_switch_to_percpu(&q->q_usage_counter); > - } > + blk_queue_flag_set(QUEUE_FLAG_INIT_DONE, q); > + percpu_ref_switch_to_percpu(&q->q_usage_counter); > > return ret; > > diff --git a/block/genhd.c b/block/genhd.c > index dfee66146bd1..87f9c2457ca6 100644 > --- a/block/genhd.c > +++ b/block/genhd.c > @@ -742,13 +742,10 @@ void del_gendisk(struct gendisk *disk) > * If the disk does not own the queue, allow using passthrough requests > * again. Else leave the queue frozen to fail all I/O. > */ > - if (!test_bit(GD_OWNS_QUEUE, &disk->state)) { > - blk_queue_flag_clear(QUEUE_FLAG_INIT_DONE, q); > + if (!test_bit(GD_OWNS_QUEUE, &disk->state)) > __blk_mq_unfreeze_queue(q, true); > - } else { > - if (queue_is_mq(q)) > - blk_mq_exit_queue(q); > - } > + else if (queue_is_mq(q)) > + blk_mq_exit_queue(q); Clearing INIT_DONE only may regress sd rebind, and I'd suggest to move the clearing into blk_mq_destroy_queue() for fixing this issue. Thanks, Ming
Hi, 在 2024/11/05 15:18, Ming Lei 写道: > On Mon, Nov 04, 2024 at 07:00:05PM +0800, Yu Kuai wrote: >> From: Yu Kuai <yukuai3@huawei.com> >> >> blk_mq_clear_flush_rq_mapping() is not called during scsi probe, by >> checking blk_queue_init_done(). However, QUEUE_FLAG_INIT_DONE is cleared >> in del_gendisk by commit aec89dc5d421 ("block: keep q_usage_counter in >> atomic mode after del_gendisk"), hence for disk like scsi, following >> blk_mq_destroy_queue() will not clear flush rq from tags->rqs[] as well, >> cause following uaf that is found by our syzkaller for v6.6: >> >> ================================================================== >> BUG: KASAN: slab-use-after-free in blk_mq_find_and_get_req+0x16e/0x1a0 block/blk-mq-tag.c:261 >> Read of size 4 at addr ffff88811c969c20 by task kworker/1:2H/224909 >> >> CPU: 1 PID: 224909 Comm: kworker/1:2H Not tainted 6.6.0-ga836a5060850 #32 >> Workqueue: kblockd blk_mq_timeout_work >> Call Trace: >> >> __dump_stack lib/dump_stack.c:88 [inline] >> dump_stack_lvl+0x91/0xf0 lib/dump_stack.c:106 >> print_address_description.constprop.0+0x66/0x300 mm/kasan/report.c:364 >> print_report+0x3e/0x70 mm/kasan/report.c:475 >> kasan_report+0xb8/0xf0 mm/kasan/report.c:588 >> blk_mq_find_and_get_req+0x16e/0x1a0 block/blk-mq-tag.c:261 >> bt_iter block/blk-mq-tag.c:288 [inline] >> __sbitmap_for_each_set include/linux/sbitmap.h:295 [inline] >> sbitmap_for_each_set include/linux/sbitmap.h:316 [inline] >> bt_for_each+0x455/0x790 block/blk-mq-tag.c:325 >> blk_mq_queue_tag_busy_iter+0x320/0x740 block/blk-mq-tag.c:534 >> blk_mq_timeout_work+0x1a3/0x7b0 block/blk-mq.c:1673 >> process_one_work+0x7c4/0x1450 kernel/workqueue.c:2631 >> process_scheduled_works kernel/workqueue.c:2704 [inline] >> worker_thread+0x804/0xe40 kernel/workqueue.c:2785 >> kthread+0x346/0x450 kernel/kthread.c:388 >> ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147 >> ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:293 >> >> Allocated by task 942: >> kasan_save_stack+0x22/0x50 mm/kasan/common.c:45 >> kasan_set_track+0x25/0x30 mm/kasan/common.c:52 >> ____kasan_kmalloc mm/kasan/common.c:374 [inline] >> __kasan_kmalloc mm/kasan/common.c:383 [inline] >> __kasan_kmalloc+0xaa/0xb0 mm/kasan/common.c:380 >> kasan_kmalloc include/linux/kasan.h:198 [inline] >> __do_kmalloc_node mm/slab_common.c:1007 [inline] >> __kmalloc_node+0x69/0x170 mm/slab_common.c:1014 >> kmalloc_node include/linux/slab.h:620 [inline] >> kzalloc_node include/linux/slab.h:732 [inline] >> blk_alloc_flush_queue+0x144/0x2f0 block/blk-flush.c:499 >> blk_mq_alloc_hctx+0x601/0x940 block/blk-mq.c:3788 >> blk_mq_alloc_and_init_hctx+0x27f/0x330 block/blk-mq.c:4261 >> blk_mq_realloc_hw_ctxs+0x488/0x5e0 block/blk-mq.c:4294 >> blk_mq_init_allocated_queue+0x188/0x860 block/blk-mq.c:4350 >> blk_mq_init_queue_data block/blk-mq.c:4166 [inline] >> blk_mq_init_queue+0x8d/0x100 block/blk-mq.c:4176 >> scsi_alloc_sdev+0x843/0xd50 drivers/scsi/scsi_scan.c:335 >> scsi_probe_and_add_lun+0x77c/0xde0 drivers/scsi/scsi_scan.c:1189 >> __scsi_scan_target+0x1fc/0x5a0 drivers/scsi/scsi_scan.c:1727 >> scsi_scan_channel drivers/scsi/scsi_scan.c:1815 [inline] >> scsi_scan_channel+0x14b/0x1e0 drivers/scsi/scsi_scan.c:1791 >> scsi_scan_host_selected+0x2fe/0x400 drivers/scsi/scsi_scan.c:1844 >> scsi_scan+0x3a0/0x3f0 drivers/scsi/scsi_sysfs.c:151 >> store_scan+0x2a/0x60 drivers/scsi/scsi_sysfs.c:191 >> dev_attr_store+0x5c/0x90 drivers/base/core.c:2388 >> sysfs_kf_write+0x11c/0x170 fs/sysfs/file.c:136 >> kernfs_fop_write_iter+0x3fc/0x610 fs/kernfs/file.c:338 >> call_write_iter include/linux/fs.h:2083 [inline] >> new_sync_write+0x1b4/0x2d0 fs/read_write.c:493 >> vfs_write+0x76c/0xb00 fs/read_write.c:586 >> ksys_write+0x127/0x250 fs/read_write.c:639 >> do_syscall_x64 arch/x86/entry/common.c:51 [inline] >> do_syscall_64+0x70/0x120 arch/x86/entry/common.c:81 >> entry_SYSCALL_64_after_hwframe+0x78/0xe2 >> >> Freed by task 244687: >> kasan_save_stack+0x22/0x50 mm/kasan/common.c:45 >> kasan_set_track+0x25/0x30 mm/kasan/common.c:52 >> kasan_save_free_info+0x2b/0x50 mm/kasan/generic.c:522 >> ____kasan_slab_free mm/kasan/common.c:236 [inline] >> __kasan_slab_free+0x12a/0x1b0 mm/kasan/common.c:244 >> kasan_slab_free include/linux/kasan.h:164 [inline] >> slab_free_hook mm/slub.c:1815 [inline] >> slab_free_freelist_hook mm/slub.c:1841 [inline] >> slab_free mm/slub.c:3807 [inline] >> __kmem_cache_free+0xe4/0x520 mm/slub.c:3820 >> blk_free_flush_queue+0x40/0x60 block/blk-flush.c:520 >> blk_mq_hw_sysfs_release+0x4a/0x170 block/blk-mq-sysfs.c:37 >> kobject_cleanup+0x136/0x410 lib/kobject.c:689 >> kobject_release lib/kobject.c:720 [inline] >> kref_put include/linux/kref.h:65 [inline] >> kobject_put+0x119/0x140 lib/kobject.c:737 >> blk_mq_release+0x24f/0x3f0 block/blk-mq.c:4144 >> blk_free_queue block/blk-core.c:298 [inline] >> blk_put_queue+0xe2/0x180 block/blk-core.c:314 >> blkg_free_workfn+0x376/0x6e0 block/blk-cgroup.c:144 >> process_one_work+0x7c4/0x1450 kernel/workqueue.c:2631 >> process_scheduled_works kernel/workqueue.c:2704 [inline] >> worker_thread+0x804/0xe40 kernel/workqueue.c:2785 >> kthread+0x346/0x450 kernel/kthread.c:388 >> ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147 >> ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:293 >> >> Other than blk_mq_clear_flush_rq_mapping(), the flag is only used in >> blk_register_queue() from initialization path, hence it's safe not to >> clear the flag in del_gendisk. And since QUEUE_FLAG_REGISTERED already >> make sure that queue should only be registered once, there is no need >> to test the flag as well. > > But disk can be added again in case of sd rebind, so the check should be > kept. I don't understand yet, I mean the other flag QUEUE_FLAG_REGISTERED can cover the case to call percpu_ref_switch_to_percpu() for the rebind case, the flag QUEUE_FLAG_INIT_DONE now is useless other than blk_mq_clear_flush_rq_mapping(). > >> >> Fixes: 6cfeadbff3f8 ("blk-mq: don't clear flush_rq from tags->rqs[]") >> Depends-on: commit aec89dc5d421 ("block: keep q_usage_counter in atomic mode after del_gendisk") >> Signed-off-by: Yu Kuai <yukuai3@huawei.com> >> --- >> block/blk-sysfs.c | 6 ++---- >> block/genhd.c | 9 +++------ >> 2 files changed, 5 insertions(+), 10 deletions(-) >> >> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c >> index 741b95dfdbf6..a7c540728f3f 100644 >> --- a/block/blk-sysfs.c >> +++ b/block/blk-sysfs.c >> @@ -821,10 +821,8 @@ int blk_register_queue(struct gendisk *disk) >> * faster to shut down and is made fully functional here as >> * request_queues for non-existent devices never get registered. >> */ >> - if (!blk_queue_init_done(q)) { >> - blk_queue_flag_set(QUEUE_FLAG_INIT_DONE, q); >> - percpu_ref_switch_to_percpu(&q->q_usage_counter); >> - } >> + blk_queue_flag_set(QUEUE_FLAG_INIT_DONE, q); >> + percpu_ref_switch_to_percpu(&q->q_usage_counter); >> >> return ret; >> >> diff --git a/block/genhd.c b/block/genhd.c >> index dfee66146bd1..87f9c2457ca6 100644 >> --- a/block/genhd.c >> +++ b/block/genhd.c >> @@ -742,13 +742,10 @@ void del_gendisk(struct gendisk *disk) >> * If the disk does not own the queue, allow using passthrough requests >> * again. Else leave the queue frozen to fail all I/O. >> */ >> - if (!test_bit(GD_OWNS_QUEUE, &disk->state)) { >> - blk_queue_flag_clear(QUEUE_FLAG_INIT_DONE, q); >> + if (!test_bit(GD_OWNS_QUEUE, &disk->state)) >> __blk_mq_unfreeze_queue(q, true); >> - } else { >> - if (queue_is_mq(q)) >> - blk_mq_exit_queue(q); >> - } >> + else if (queue_is_mq(q)) >> + blk_mq_exit_queue(q); > > Clearing INIT_DONE only may regress sd rebind, and I'd suggest to move > the clearing into blk_mq_destroy_queue() for fixing this issue. No, the problem here is not related to sd rebind, if the elevator is none, the uaf is very easy to trigger: 1) issue a flush and delete the disk, the flush_rq will be left in the tags->rqs[]; 2) trigger iterating tags by another disk in the same host; Thanks, Kuai > > > Thanks, > Ming > > > . >
On Tue, Nov 05, 2024 at 03:36:36PM +0800, Yu Kuai wrote: > Hi, > > 在 2024/11/05 15:18, Ming Lei 写道: > > On Mon, Nov 04, 2024 at 07:00:05PM +0800, Yu Kuai wrote: > > > From: Yu Kuai <yukuai3@huawei.com> > > > > > > blk_mq_clear_flush_rq_mapping() is not called during scsi probe, by > > > checking blk_queue_init_done(). However, QUEUE_FLAG_INIT_DONE is cleared > > > in del_gendisk by commit aec89dc5d421 ("block: keep q_usage_counter in > > > atomic mode after del_gendisk"), hence for disk like scsi, following > > > blk_mq_destroy_queue() will not clear flush rq from tags->rqs[] as well, > > > cause following uaf that is found by our syzkaller for v6.6: > > > > > > ================================================================== > > > BUG: KASAN: slab-use-after-free in blk_mq_find_and_get_req+0x16e/0x1a0 block/blk-mq-tag.c:261 > > > Read of size 4 at addr ffff88811c969c20 by task kworker/1:2H/224909 > > > > > > CPU: 1 PID: 224909 Comm: kworker/1:2H Not tainted 6.6.0-ga836a5060850 #32 > > > Workqueue: kblockd blk_mq_timeout_work > > > Call Trace: > > > > > > __dump_stack lib/dump_stack.c:88 [inline] > > > dump_stack_lvl+0x91/0xf0 lib/dump_stack.c:106 > > > print_address_description.constprop.0+0x66/0x300 mm/kasan/report.c:364 > > > print_report+0x3e/0x70 mm/kasan/report.c:475 > > > kasan_report+0xb8/0xf0 mm/kasan/report.c:588 > > > blk_mq_find_and_get_req+0x16e/0x1a0 block/blk-mq-tag.c:261 > > > bt_iter block/blk-mq-tag.c:288 [inline] > > > __sbitmap_for_each_set include/linux/sbitmap.h:295 [inline] > > > sbitmap_for_each_set include/linux/sbitmap.h:316 [inline] > > > bt_for_each+0x455/0x790 block/blk-mq-tag.c:325 > > > blk_mq_queue_tag_busy_iter+0x320/0x740 block/blk-mq-tag.c:534 > > > blk_mq_timeout_work+0x1a3/0x7b0 block/blk-mq.c:1673 > > > process_one_work+0x7c4/0x1450 kernel/workqueue.c:2631 > > > process_scheduled_works kernel/workqueue.c:2704 [inline] > > > worker_thread+0x804/0xe40 kernel/workqueue.c:2785 > > > kthread+0x346/0x450 kernel/kthread.c:388 > > > ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147 > > > ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:293 > > > > > > Allocated by task 942: > > > kasan_save_stack+0x22/0x50 mm/kasan/common.c:45 > > > kasan_set_track+0x25/0x30 mm/kasan/common.c:52 > > > ____kasan_kmalloc mm/kasan/common.c:374 [inline] > > > __kasan_kmalloc mm/kasan/common.c:383 [inline] > > > __kasan_kmalloc+0xaa/0xb0 mm/kasan/common.c:380 > > > kasan_kmalloc include/linux/kasan.h:198 [inline] > > > __do_kmalloc_node mm/slab_common.c:1007 [inline] > > > __kmalloc_node+0x69/0x170 mm/slab_common.c:1014 > > > kmalloc_node include/linux/slab.h:620 [inline] > > > kzalloc_node include/linux/slab.h:732 [inline] > > > blk_alloc_flush_queue+0x144/0x2f0 block/blk-flush.c:499 > > > blk_mq_alloc_hctx+0x601/0x940 block/blk-mq.c:3788 > > > blk_mq_alloc_and_init_hctx+0x27f/0x330 block/blk-mq.c:4261 > > > blk_mq_realloc_hw_ctxs+0x488/0x5e0 block/blk-mq.c:4294 > > > blk_mq_init_allocated_queue+0x188/0x860 block/blk-mq.c:4350 > > > blk_mq_init_queue_data block/blk-mq.c:4166 [inline] > > > blk_mq_init_queue+0x8d/0x100 block/blk-mq.c:4176 > > > scsi_alloc_sdev+0x843/0xd50 drivers/scsi/scsi_scan.c:335 > > > scsi_probe_and_add_lun+0x77c/0xde0 drivers/scsi/scsi_scan.c:1189 > > > __scsi_scan_target+0x1fc/0x5a0 drivers/scsi/scsi_scan.c:1727 > > > scsi_scan_channel drivers/scsi/scsi_scan.c:1815 [inline] > > > scsi_scan_channel+0x14b/0x1e0 drivers/scsi/scsi_scan.c:1791 > > > scsi_scan_host_selected+0x2fe/0x400 drivers/scsi/scsi_scan.c:1844 > > > scsi_scan+0x3a0/0x3f0 drivers/scsi/scsi_sysfs.c:151 > > > store_scan+0x2a/0x60 drivers/scsi/scsi_sysfs.c:191 > > > dev_attr_store+0x5c/0x90 drivers/base/core.c:2388 > > > sysfs_kf_write+0x11c/0x170 fs/sysfs/file.c:136 > > > kernfs_fop_write_iter+0x3fc/0x610 fs/kernfs/file.c:338 > > > call_write_iter include/linux/fs.h:2083 [inline] > > > new_sync_write+0x1b4/0x2d0 fs/read_write.c:493 > > > vfs_write+0x76c/0xb00 fs/read_write.c:586 > > > ksys_write+0x127/0x250 fs/read_write.c:639 > > > do_syscall_x64 arch/x86/entry/common.c:51 [inline] > > > do_syscall_64+0x70/0x120 arch/x86/entry/common.c:81 > > > entry_SYSCALL_64_after_hwframe+0x78/0xe2 > > > > > > Freed by task 244687: > > > kasan_save_stack+0x22/0x50 mm/kasan/common.c:45 > > > kasan_set_track+0x25/0x30 mm/kasan/common.c:52 > > > kasan_save_free_info+0x2b/0x50 mm/kasan/generic.c:522 > > > ____kasan_slab_free mm/kasan/common.c:236 [inline] > > > __kasan_slab_free+0x12a/0x1b0 mm/kasan/common.c:244 > > > kasan_slab_free include/linux/kasan.h:164 [inline] > > > slab_free_hook mm/slub.c:1815 [inline] > > > slab_free_freelist_hook mm/slub.c:1841 [inline] > > > slab_free mm/slub.c:3807 [inline] > > > __kmem_cache_free+0xe4/0x520 mm/slub.c:3820 > > > blk_free_flush_queue+0x40/0x60 block/blk-flush.c:520 > > > blk_mq_hw_sysfs_release+0x4a/0x170 block/blk-mq-sysfs.c:37 > > > kobject_cleanup+0x136/0x410 lib/kobject.c:689 > > > kobject_release lib/kobject.c:720 [inline] > > > kref_put include/linux/kref.h:65 [inline] > > > kobject_put+0x119/0x140 lib/kobject.c:737 > > > blk_mq_release+0x24f/0x3f0 block/blk-mq.c:4144 > > > blk_free_queue block/blk-core.c:298 [inline] > > > blk_put_queue+0xe2/0x180 block/blk-core.c:314 > > > blkg_free_workfn+0x376/0x6e0 block/blk-cgroup.c:144 > > > process_one_work+0x7c4/0x1450 kernel/workqueue.c:2631 > > > process_scheduled_works kernel/workqueue.c:2704 [inline] > > > worker_thread+0x804/0xe40 kernel/workqueue.c:2785 > > > kthread+0x346/0x450 kernel/kthread.c:388 > > > ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147 > > > ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:293 > > > > > > Other than blk_mq_clear_flush_rq_mapping(), the flag is only used in > > > blk_register_queue() from initialization path, hence it's safe not to > > > clear the flag in del_gendisk. And since QUEUE_FLAG_REGISTERED already > > > make sure that queue should only be registered once, there is no need > > > to test the flag as well. > > > > But disk can be added again in case of sd rebind, so the check should be > > kept. > > I don't understand yet, I mean the other flag QUEUE_FLAG_REGISTERED can > cover the case to call percpu_ref_switch_to_percpu() for the rebind > case, the flag QUEUE_FLAG_INIT_DONE now is useless other than > blk_mq_clear_flush_rq_mapping(). OK, looks fine, since queue usage counter only works in percpu mode with disk attached for !GD_OWNS_QUEUE, maybe QUEUE_FLAG_INIT_DONE can be renamed too, but anyway: Reviewed-by: Ming Lei <ming.lei@redhat.com> Thanks, Ming
Hi, 在 2024/11/05 15:58, Ming Lei 写道: > On Tue, Nov 05, 2024 at 03:36:36PM +0800, Yu Kuai wrote: >> Hi, >> >> 在 2024/11/05 15:18, Ming Lei 写道: >>> On Mon, Nov 04, 2024 at 07:00:05PM +0800, Yu Kuai wrote: >>>> From: Yu Kuai <yukuai3@huawei.com> >>>> >>>> blk_mq_clear_flush_rq_mapping() is not called during scsi probe, by >>>> checking blk_queue_init_done(). However, QUEUE_FLAG_INIT_DONE is cleared >>>> in del_gendisk by commit aec89dc5d421 ("block: keep q_usage_counter in >>>> atomic mode after del_gendisk"), hence for disk like scsi, following >>>> blk_mq_destroy_queue() will not clear flush rq from tags->rqs[] as well, >>>> cause following uaf that is found by our syzkaller for v6.6: >>>> >>>> ================================================================== >>>> BUG: KASAN: slab-use-after-free in blk_mq_find_and_get_req+0x16e/0x1a0 block/blk-mq-tag.c:261 >>>> Read of size 4 at addr ffff88811c969c20 by task kworker/1:2H/224909 >>>> >>>> CPU: 1 PID: 224909 Comm: kworker/1:2H Not tainted 6.6.0-ga836a5060850 #32 >>>> Workqueue: kblockd blk_mq_timeout_work >>>> Call Trace: >>>> >>>> __dump_stack lib/dump_stack.c:88 [inline] >>>> dump_stack_lvl+0x91/0xf0 lib/dump_stack.c:106 >>>> print_address_description.constprop.0+0x66/0x300 mm/kasan/report.c:364 >>>> print_report+0x3e/0x70 mm/kasan/report.c:475 >>>> kasan_report+0xb8/0xf0 mm/kasan/report.c:588 >>>> blk_mq_find_and_get_req+0x16e/0x1a0 block/blk-mq-tag.c:261 >>>> bt_iter block/blk-mq-tag.c:288 [inline] >>>> __sbitmap_for_each_set include/linux/sbitmap.h:295 [inline] >>>> sbitmap_for_each_set include/linux/sbitmap.h:316 [inline] >>>> bt_for_each+0x455/0x790 block/blk-mq-tag.c:325 >>>> blk_mq_queue_tag_busy_iter+0x320/0x740 block/blk-mq-tag.c:534 >>>> blk_mq_timeout_work+0x1a3/0x7b0 block/blk-mq.c:1673 >>>> process_one_work+0x7c4/0x1450 kernel/workqueue.c:2631 >>>> process_scheduled_works kernel/workqueue.c:2704 [inline] >>>> worker_thread+0x804/0xe40 kernel/workqueue.c:2785 >>>> kthread+0x346/0x450 kernel/kthread.c:388 >>>> ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147 >>>> ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:293 >>>> >>>> Allocated by task 942: >>>> kasan_save_stack+0x22/0x50 mm/kasan/common.c:45 >>>> kasan_set_track+0x25/0x30 mm/kasan/common.c:52 >>>> ____kasan_kmalloc mm/kasan/common.c:374 [inline] >>>> __kasan_kmalloc mm/kasan/common.c:383 [inline] >>>> __kasan_kmalloc+0xaa/0xb0 mm/kasan/common.c:380 >>>> kasan_kmalloc include/linux/kasan.h:198 [inline] >>>> __do_kmalloc_node mm/slab_common.c:1007 [inline] >>>> __kmalloc_node+0x69/0x170 mm/slab_common.c:1014 >>>> kmalloc_node include/linux/slab.h:620 [inline] >>>> kzalloc_node include/linux/slab.h:732 [inline] >>>> blk_alloc_flush_queue+0x144/0x2f0 block/blk-flush.c:499 >>>> blk_mq_alloc_hctx+0x601/0x940 block/blk-mq.c:3788 >>>> blk_mq_alloc_and_init_hctx+0x27f/0x330 block/blk-mq.c:4261 >>>> blk_mq_realloc_hw_ctxs+0x488/0x5e0 block/blk-mq.c:4294 >>>> blk_mq_init_allocated_queue+0x188/0x860 block/blk-mq.c:4350 >>>> blk_mq_init_queue_data block/blk-mq.c:4166 [inline] >>>> blk_mq_init_queue+0x8d/0x100 block/blk-mq.c:4176 >>>> scsi_alloc_sdev+0x843/0xd50 drivers/scsi/scsi_scan.c:335 >>>> scsi_probe_and_add_lun+0x77c/0xde0 drivers/scsi/scsi_scan.c:1189 >>>> __scsi_scan_target+0x1fc/0x5a0 drivers/scsi/scsi_scan.c:1727 >>>> scsi_scan_channel drivers/scsi/scsi_scan.c:1815 [inline] >>>> scsi_scan_channel+0x14b/0x1e0 drivers/scsi/scsi_scan.c:1791 >>>> scsi_scan_host_selected+0x2fe/0x400 drivers/scsi/scsi_scan.c:1844 >>>> scsi_scan+0x3a0/0x3f0 drivers/scsi/scsi_sysfs.c:151 >>>> store_scan+0x2a/0x60 drivers/scsi/scsi_sysfs.c:191 >>>> dev_attr_store+0x5c/0x90 drivers/base/core.c:2388 >>>> sysfs_kf_write+0x11c/0x170 fs/sysfs/file.c:136 >>>> kernfs_fop_write_iter+0x3fc/0x610 fs/kernfs/file.c:338 >>>> call_write_iter include/linux/fs.h:2083 [inline] >>>> new_sync_write+0x1b4/0x2d0 fs/read_write.c:493 >>>> vfs_write+0x76c/0xb00 fs/read_write.c:586 >>>> ksys_write+0x127/0x250 fs/read_write.c:639 >>>> do_syscall_x64 arch/x86/entry/common.c:51 [inline] >>>> do_syscall_64+0x70/0x120 arch/x86/entry/common.c:81 >>>> entry_SYSCALL_64_after_hwframe+0x78/0xe2 >>>> >>>> Freed by task 244687: >>>> kasan_save_stack+0x22/0x50 mm/kasan/common.c:45 >>>> kasan_set_track+0x25/0x30 mm/kasan/common.c:52 >>>> kasan_save_free_info+0x2b/0x50 mm/kasan/generic.c:522 >>>> ____kasan_slab_free mm/kasan/common.c:236 [inline] >>>> __kasan_slab_free+0x12a/0x1b0 mm/kasan/common.c:244 >>>> kasan_slab_free include/linux/kasan.h:164 [inline] >>>> slab_free_hook mm/slub.c:1815 [inline] >>>> slab_free_freelist_hook mm/slub.c:1841 [inline] >>>> slab_free mm/slub.c:3807 [inline] >>>> __kmem_cache_free+0xe4/0x520 mm/slub.c:3820 >>>> blk_free_flush_queue+0x40/0x60 block/blk-flush.c:520 >>>> blk_mq_hw_sysfs_release+0x4a/0x170 block/blk-mq-sysfs.c:37 >>>> kobject_cleanup+0x136/0x410 lib/kobject.c:689 >>>> kobject_release lib/kobject.c:720 [inline] >>>> kref_put include/linux/kref.h:65 [inline] >>>> kobject_put+0x119/0x140 lib/kobject.c:737 >>>> blk_mq_release+0x24f/0x3f0 block/blk-mq.c:4144 >>>> blk_free_queue block/blk-core.c:298 [inline] >>>> blk_put_queue+0xe2/0x180 block/blk-core.c:314 >>>> blkg_free_workfn+0x376/0x6e0 block/blk-cgroup.c:144 >>>> process_one_work+0x7c4/0x1450 kernel/workqueue.c:2631 >>>> process_scheduled_works kernel/workqueue.c:2704 [inline] >>>> worker_thread+0x804/0xe40 kernel/workqueue.c:2785 >>>> kthread+0x346/0x450 kernel/kthread.c:388 >>>> ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147 >>>> ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:293 >>>> >>>> Other than blk_mq_clear_flush_rq_mapping(), the flag is only used in >>>> blk_register_queue() from initialization path, hence it's safe not to >>>> clear the flag in del_gendisk. And since QUEUE_FLAG_REGISTERED already >>>> make sure that queue should only be registered once, there is no need >>>> to test the flag as well. >>> >>> But disk can be added again in case of sd rebind, so the check should be >>> kept. >> >> I don't understand yet, I mean the other flag QUEUE_FLAG_REGISTERED can >> cover the case to call percpu_ref_switch_to_percpu() for the rebind >> case, the flag QUEUE_FLAG_INIT_DONE now is useless other than >> blk_mq_clear_flush_rq_mapping(). > > OK, looks fine, since queue usage counter only works in percpu mode with > disk attached for !GD_OWNS_QUEUE, maybe QUEUE_FLAG_INIT_DONE can be renamed > too, but anyway: Yeah, I do think about rename this flag, but I don't come up with a better name myself. :) > > Reviewed-by: Ming Lei <ming.lei@redhat.com> Thanks for the review! Kuai > > > Thanks, > Ming > > > . >
On Mon, Nov 04, 2024 at 07:00:05PM +0800, Yu Kuai wrote: > From: Yu Kuai <yukuai3@huawei.com> > > blk_mq_clear_flush_rq_mapping() is not called during scsi probe, by > checking blk_queue_init_done(). However, QUEUE_FLAG_INIT_DONE is cleared > in del_gendisk by commit aec89dc5d421 ("block: keep q_usage_counter in > atomic mode after del_gendisk"), hence for disk like scsi, following > blk_mq_destroy_queue() will not clear flush rq from tags->rqs[] as well, > cause following uaf that is found by our syzkaller for v6.6: Which means we leave the flush request lingering after del_gendisk, which sounds like the real bug. I suspect we just need to move the call to blk_mq_clear_flush_rq_mapping so that it is called from del_gendisk and doesn't leave the flush tag lingering around.
Hi,Ming and Christoph 在 2024/11/05 19:19, Christoph Hellwig 写道: > On Mon, Nov 04, 2024 at 07:00:05PM +0800, Yu Kuai wrote: >> From: Yu Kuai <yukuai3@huawei.com> >> >> blk_mq_clear_flush_rq_mapping() is not called during scsi probe, by >> checking blk_queue_init_done(). However, QUEUE_FLAG_INIT_DONE is cleared >> in del_gendisk by commit aec89dc5d421 ("block: keep q_usage_counter in >> atomic mode after del_gendisk"), hence for disk like scsi, following >> blk_mq_destroy_queue() will not clear flush rq from tags->rqs[] as well, >> cause following uaf that is found by our syzkaller for v6.6: > > Which means we leave the flush request lingering after del_gendisk, > which sounds like the real bug. I suspect we just need to move the > call to blk_mq_clear_flush_rq_mapping so that it is called from > del_gendisk and doesn't leave the flush tag lingering around. > This remind me that del_gendisk is still too late to do that. Noted that flush_rq can acquire different tags, so if the multiple flush_rq is done and those tags are not reused, the flush_rq can exist in multiple entries in tags->rqs[]. The consequence I can think of is that iterating tags can found the same flush_rq multiple times, and the flush_rq can be inflight. Therefor, I think the best solution is to clear the flush_rq mapping when flush request is done. Thanks, Kuai > . >
On Wed, Nov 06, 2024 at 10:58:40AM +0800, Yu Kuai wrote: > Hi,Ming and Christoph > > 在 2024/11/05 19:19, Christoph Hellwig 写道: > > On Mon, Nov 04, 2024 at 07:00:05PM +0800, Yu Kuai wrote: > > > From: Yu Kuai <yukuai3@huawei.com> > > > > > > blk_mq_clear_flush_rq_mapping() is not called during scsi probe, by > > > checking blk_queue_init_done(). However, QUEUE_FLAG_INIT_DONE is cleared > > > in del_gendisk by commit aec89dc5d421 ("block: keep q_usage_counter in > > > atomic mode after del_gendisk"), hence for disk like scsi, following > > > blk_mq_destroy_queue() will not clear flush rq from tags->rqs[] as well, > > > cause following uaf that is found by our syzkaller for v6.6: > > > > Which means we leave the flush request lingering after del_gendisk, > > which sounds like the real bug. I suspect we just need to move the > > call to blk_mq_clear_flush_rq_mapping so that it is called from > > del_gendisk and doesn't leave the flush tag lingering around. > > > > This remind me that del_gendisk is still too late to do that. Noted that > flush_rq can acquire different tags, so if the multiple flush_rq is done > and those tags are not reused, the flush_rq can exist in multiple > entries in tags->rqs[]. The consequence I can think of is that iterating > tags can found the same flush_rq multiple times, and the flush_rq can be > inflight. How can that be one problem? Please look at commit 364b61818f65 ("blk-mq: clearing flush request reference in tags->rqs[]") commit bd63141d585b ("blk-mq: clear stale request in tags->rq[] before freeing one request pool") and understand the motivation. That also means it is just fine to delay blk_mq_clear_flush_rq_mapping() after disk is deleted. Thanks, Ming
Hi, 在 2024/11/06 11:11, Ming Lei 写道: > On Wed, Nov 06, 2024 at 10:58:40AM +0800, Yu Kuai wrote: >> Hi,Ming and Christoph >> >> 在 2024/11/05 19:19, Christoph Hellwig 写道: >>> On Mon, Nov 04, 2024 at 07:00:05PM +0800, Yu Kuai wrote: >>>> From: Yu Kuai <yukuai3@huawei.com> >>>> >>>> blk_mq_clear_flush_rq_mapping() is not called during scsi probe, by >>>> checking blk_queue_init_done(). However, QUEUE_FLAG_INIT_DONE is cleared >>>> in del_gendisk by commit aec89dc5d421 ("block: keep q_usage_counter in >>>> atomic mode after del_gendisk"), hence for disk like scsi, following >>>> blk_mq_destroy_queue() will not clear flush rq from tags->rqs[] as well, >>>> cause following uaf that is found by our syzkaller for v6.6: >>> >>> Which means we leave the flush request lingering after del_gendisk, >>> which sounds like the real bug. I suspect we just need to move the >>> call to blk_mq_clear_flush_rq_mapping so that it is called from >>> del_gendisk and doesn't leave the flush tag lingering around. >>> >> >> This remind me that del_gendisk is still too late to do that. Noted that >> flush_rq can acquire different tags, so if the multiple flush_rq is done >> and those tags are not reused, the flush_rq can exist in multiple >> entries in tags->rqs[]. The consequence I can think of is that iterating >> tags can found the same flush_rq multiple times, and the flush_rq can be >> inflight. > > How can that be one problem? > > Please look at > > commit 364b61818f65 ("blk-mq: clearing flush request reference in tags->rqs[]") > commit bd63141d585b ("blk-mq: clear stale request in tags->rq[] before freeing one request pool") > > and understand the motivation. > > That also means it is just fine to delay blk_mq_clear_flush_rq_mapping() > after disk is deleted. I do understand what you mean. Let's see if you want this to be avoided, for example(no disk is deleted): 1) issue a flush, and tag 0 is used, after the flush is done, tags->rqs[0] = flush_rq 2) issue another flush, and tag 1 is used, after the flush is done, tags->rqs[1] = flush_rq 3) issue a flush again, and tag 2 is used, and the flush_rq is dispatched to disk; 4) Then in this case, blk_mq_in_flight_rw() will found the same flush_rq 3 times and think there are 3 inflight request, same for hctx_busy_show()... Thanks, Kuai > > Thanks, > Ming > > > . >
On Wed, Nov 06, 2024 at 11:25:07AM +0800, Yu Kuai wrote: > Hi, > > 在 2024/11/06 11:11, Ming Lei 写道: > > On Wed, Nov 06, 2024 at 10:58:40AM +0800, Yu Kuai wrote: > > > Hi,Ming and Christoph > > > > > > 在 2024/11/05 19:19, Christoph Hellwig 写道: > > > > On Mon, Nov 04, 2024 at 07:00:05PM +0800, Yu Kuai wrote: > > > > > From: Yu Kuai <yukuai3@huawei.com> > > > > > > > > > > blk_mq_clear_flush_rq_mapping() is not called during scsi probe, by > > > > > checking blk_queue_init_done(). However, QUEUE_FLAG_INIT_DONE is cleared > > > > > in del_gendisk by commit aec89dc5d421 ("block: keep q_usage_counter in > > > > > atomic mode after del_gendisk"), hence for disk like scsi, following > > > > > blk_mq_destroy_queue() will not clear flush rq from tags->rqs[] as well, > > > > > cause following uaf that is found by our syzkaller for v6.6: > > > > > > > > Which means we leave the flush request lingering after del_gendisk, > > > > which sounds like the real bug. I suspect we just need to move the > > > > call to blk_mq_clear_flush_rq_mapping so that it is called from > > > > del_gendisk and doesn't leave the flush tag lingering around. > > > > > > > > > > This remind me that del_gendisk is still too late to do that. Noted that > > > flush_rq can acquire different tags, so if the multiple flush_rq is done > > > and those tags are not reused, the flush_rq can exist in multiple > > > entries in tags->rqs[]. The consequence I can think of is that iterating > > > tags can found the same flush_rq multiple times, and the flush_rq can be > > > inflight. > > > > How can that be one problem? > > > > Please look at > > > > commit 364b61818f65 ("blk-mq: clearing flush request reference in tags->rqs[]") > > commit bd63141d585b ("blk-mq: clear stale request in tags->rq[] before freeing one request pool") > > > > and understand the motivation. > > > > That also means it is just fine to delay blk_mq_clear_flush_rq_mapping() > > after disk is deleted. > > I do understand what you mean. Let's see if you want this to be avoided, > for example(no disk is deleted): It is definitely another issue, and not supposed to be covered by blk_mq_clear_flush_rq_mapping(). > > 1) issue a flush, and tag 0 is used, after the flush is done, > tags->rqs[0] = flush_rq > 2) issue another flush, and tag 1 is used, after the flush is done, > tags->rqs[1] = flush_rq > 3) issue a flush again, and tag 2 is used, and the flush_rq is > dispatched to disk; Yes, this kind of thing exists since blk-mq begins, and you can't expect blk_mq_in_flight_rw() to get accurate inflight requests. > 4) Then in this case, blk_mq_in_flight_rw() will found the same flush_rq > 3 times and think there are 3 inflight request, same for > hctx_busy_show()... But we have tried to avoid it, such as in blk_mq_find_and_get_req() req->tag is checked against the current 'bitnr' of sbitmap when walking over tags. Then the same flush_rq won't be counted 3 times. Thanks, Ming
Hi, 在 2024/11/06 11:36, Ming Lei 写道: > On Wed, Nov 06, 2024 at 11:25:07AM +0800, Yu Kuai wrote: >> Hi, >> >> 在 2024/11/06 11:11, Ming Lei 写道: >>> On Wed, Nov 06, 2024 at 10:58:40AM +0800, Yu Kuai wrote: >>>> Hi,Ming and Christoph >>>> >>>> 在 2024/11/05 19:19, Christoph Hellwig 写道: >>>>> On Mon, Nov 04, 2024 at 07:00:05PM +0800, Yu Kuai wrote: >>>>>> From: Yu Kuai <yukuai3@huawei.com> >>>>>> >>>>>> blk_mq_clear_flush_rq_mapping() is not called during scsi probe, by >>>>>> checking blk_queue_init_done(). However, QUEUE_FLAG_INIT_DONE is cleared >>>>>> in del_gendisk by commit aec89dc5d421 ("block: keep q_usage_counter in >>>>>> atomic mode after del_gendisk"), hence for disk like scsi, following >>>>>> blk_mq_destroy_queue() will not clear flush rq from tags->rqs[] as well, >>>>>> cause following uaf that is found by our syzkaller for v6.6: >>>>> >>>>> Which means we leave the flush request lingering after del_gendisk, >>>>> which sounds like the real bug. I suspect we just need to move the >>>>> call to blk_mq_clear_flush_rq_mapping so that it is called from >>>>> del_gendisk and doesn't leave the flush tag lingering around. >>>>> >>>> >>>> This remind me that del_gendisk is still too late to do that. Noted that >>>> flush_rq can acquire different tags, so if the multiple flush_rq is done >>>> and those tags are not reused, the flush_rq can exist in multiple >>>> entries in tags->rqs[]. The consequence I can think of is that iterating >>>> tags can found the same flush_rq multiple times, and the flush_rq can be >>>> inflight. >>> >>> How can that be one problem? >>> >>> Please look at >>> >>> commit 364b61818f65 ("blk-mq: clearing flush request reference in tags->rqs[]") >>> commit bd63141d585b ("blk-mq: clear stale request in tags->rq[] before freeing one request pool") >>> >>> and understand the motivation. >>> >>> That also means it is just fine to delay blk_mq_clear_flush_rq_mapping() >>> after disk is deleted. >> >> I do understand what you mean. Let's see if you want this to be avoided, >> for example(no disk is deleted): > > It is definitely another issue, and not supposed to be covered by > blk_mq_clear_flush_rq_mapping(). > >> >> 1) issue a flush, and tag 0 is used, after the flush is done, >> tags->rqs[0] = flush_rq >> 2) issue another flush, and tag 1 is used, after the flush is done, >> tags->rqs[1] = flush_rq >> 3) issue a flush again, and tag 2 is used, and the flush_rq is >> dispatched to disk; > > Yes, this kind of thing exists since blk-mq begins, and you can't expect > blk_mq_in_flight_rw() to get accurate inflight requests. > >> 4) Then in this case, blk_mq_in_flight_rw() will found the same flush_rq >> 3 times and think there are 3 inflight request, same for >> hctx_busy_show()... > > But we have tried to avoid it, such as in blk_mq_find_and_get_req() > req->tag is checked against the current 'bitnr' of sbitmap when walking > over tags. Then the same flush_rq won't be counted 3 times. Yes, I missed that req->tag is checked. Looks like there is no porblem now. :) Just to make sure, we're still on the same page for this patch, right? Thanks for the explanation. Kuai > > > > Thanks, > Ming > > > . >
On Wed, Nov 6, 2024 at 11:52 AM Yu Kuai <yukuai1@huaweicloud.com> wrote: > > Hi, > > 在 2024/11/06 11:36, Ming Lei 写道: > > On Wed, Nov 06, 2024 at 11:25:07AM +0800, Yu Kuai wrote: > >> Hi, > >> > >> 在 2024/11/06 11:11, Ming Lei 写道: > >>> On Wed, Nov 06, 2024 at 10:58:40AM +0800, Yu Kuai wrote: > >>>> Hi,Ming and Christoph > >>>> > >>>> 在 2024/11/05 19:19, Christoph Hellwig 写道: > >>>>> On Mon, Nov 04, 2024 at 07:00:05PM +0800, Yu Kuai wrote: > >>>>>> From: Yu Kuai <yukuai3@huawei.com> > >>>>>> > >>>>>> blk_mq_clear_flush_rq_mapping() is not called during scsi probe, by > >>>>>> checking blk_queue_init_done(). However, QUEUE_FLAG_INIT_DONE is cleared > >>>>>> in del_gendisk by commit aec89dc5d421 ("block: keep q_usage_counter in > >>>>>> atomic mode after del_gendisk"), hence for disk like scsi, following > >>>>>> blk_mq_destroy_queue() will not clear flush rq from tags->rqs[] as well, > >>>>>> cause following uaf that is found by our syzkaller for v6.6: > >>>>> > >>>>> Which means we leave the flush request lingering after del_gendisk, > >>>>> which sounds like the real bug. I suspect we just need to move the > >>>>> call to blk_mq_clear_flush_rq_mapping so that it is called from > >>>>> del_gendisk and doesn't leave the flush tag lingering around. > >>>>> > >>>> > >>>> This remind me that del_gendisk is still too late to do that. Noted that > >>>> flush_rq can acquire different tags, so if the multiple flush_rq is done > >>>> and those tags are not reused, the flush_rq can exist in multiple > >>>> entries in tags->rqs[]. The consequence I can think of is that iterating > >>>> tags can found the same flush_rq multiple times, and the flush_rq can be > >>>> inflight. > >>> > >>> How can that be one problem? > >>> > >>> Please look at > >>> > >>> commit 364b61818f65 ("blk-mq: clearing flush request reference in tags->rqs[]") > >>> commit bd63141d585b ("blk-mq: clear stale request in tags->rq[] before freeing one request pool") > >>> > >>> and understand the motivation. > >>> > >>> That also means it is just fine to delay blk_mq_clear_flush_rq_mapping() > >>> after disk is deleted. > >> > >> I do understand what you mean. Let's see if you want this to be avoided, > >> for example(no disk is deleted): > > > > It is definitely another issue, and not supposed to be covered by > > blk_mq_clear_flush_rq_mapping(). > > > >> > >> 1) issue a flush, and tag 0 is used, after the flush is done, > >> tags->rqs[0] = flush_rq > >> 2) issue another flush, and tag 1 is used, after the flush is done, > >> tags->rqs[1] = flush_rq > >> 3) issue a flush again, and tag 2 is used, and the flush_rq is > >> dispatched to disk; > > > > Yes, this kind of thing exists since blk-mq begins, and you can't expect > > blk_mq_in_flight_rw() to get accurate inflight requests. > > > >> 4) Then in this case, blk_mq_in_flight_rw() will found the same flush_rq > >> 3 times and think there are 3 inflight request, same for > >> hctx_busy_show()... > > > > But we have tried to avoid it, such as in blk_mq_find_and_get_req() > > req->tag is checked against the current 'bitnr' of sbitmap when walking > > over tags. Then the same flush_rq won't be counted 3 times. > > Yes, I missed that req->tag is checked. Looks like there is no porblem > now. :) > > Just to make sure, we're still on the same page for this patch, right? Yeah, my reviewed-by still stands, and it is fine to delay to clear flush req wrt. the original motivation, what matters is that it is cleared before the flush req is freed. Thanks,
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index 741b95dfdbf6..a7c540728f3f 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -821,10 +821,8 @@ int blk_register_queue(struct gendisk *disk) * faster to shut down and is made fully functional here as * request_queues for non-existent devices never get registered. */ - if (!blk_queue_init_done(q)) { - blk_queue_flag_set(QUEUE_FLAG_INIT_DONE, q); - percpu_ref_switch_to_percpu(&q->q_usage_counter); - } + blk_queue_flag_set(QUEUE_FLAG_INIT_DONE, q); + percpu_ref_switch_to_percpu(&q->q_usage_counter); return ret; diff --git a/block/genhd.c b/block/genhd.c index dfee66146bd1..87f9c2457ca6 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -742,13 +742,10 @@ void del_gendisk(struct gendisk *disk) * If the disk does not own the queue, allow using passthrough requests * again. Else leave the queue frozen to fail all I/O. */ - if (!test_bit(GD_OWNS_QUEUE, &disk->state)) { - blk_queue_flag_clear(QUEUE_FLAG_INIT_DONE, q); + if (!test_bit(GD_OWNS_QUEUE, &disk->state)) __blk_mq_unfreeze_queue(q, true); - } else { - if (queue_is_mq(q)) - blk_mq_exit_queue(q); - } + else if (queue_is_mq(q)) + blk_mq_exit_queue(q); if (start_drain) blk_unfreeze_release_lock(q, true, queue_dying);