Message ID | 20240722091633.13128-1-yang.yang@vivo.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] block: fix deadlock between sd_remove & sd_release | expand |
On Mon, Jul 22, 2024 at 5:17 PM Yang Yang <yang.yang@vivo.com> wrote: > > Our test report the following hung task: > > [ 2538.459400] INFO: task "kworker/0:0":7 blocked for more than 188 seconds. > [ 2538.459427] Call trace: > [ 2538.459430] __switch_to+0x174/0x338 > [ 2538.459436] __schedule+0x628/0x9c4 > [ 2538.459442] schedule+0x7c/0xe8 > [ 2538.459447] schedule_preempt_disabled+0x24/0x40 > [ 2538.459453] __mutex_lock+0x3ec/0xf04 > [ 2538.459456] __mutex_lock_slowpath+0x14/0x24 > [ 2538.459459] mutex_lock+0x30/0xd8 > [ 2538.459462] del_gendisk+0xdc/0x350 > [ 2538.459466] sd_remove+0x30/0x60 > [ 2538.459470] device_release_driver_internal+0x1c4/0x2c4 > [ 2538.459474] device_release_driver+0x18/0x28 > [ 2538.459478] bus_remove_device+0x15c/0x174 > [ 2538.459483] device_del+0x1d0/0x358 > [ 2538.459488] __scsi_remove_device+0xa8/0x198 > [ 2538.459493] scsi_forget_host+0x50/0x70 > [ 2538.459497] scsi_remove_host+0x80/0x180 > [ 2538.459502] usb_stor_disconnect+0x68/0xf4 > [ 2538.459506] usb_unbind_interface+0xd4/0x280 > [ 2538.459510] device_release_driver_internal+0x1c4/0x2c4 > [ 2538.459514] device_release_driver+0x18/0x28 > [ 2538.459518] bus_remove_device+0x15c/0x174 > [ 2538.459523] device_del+0x1d0/0x358 > [ 2538.459528] usb_disable_device+0x84/0x194 > [ 2538.459532] usb_disconnect+0xec/0x300 > [ 2538.459537] hub_event+0xb80/0x1870 > [ 2538.459541] process_scheduled_works+0x248/0x4dc > [ 2538.459545] worker_thread+0x244/0x334 > [ 2538.459549] kthread+0x114/0x1bc > > [ 2538.461001] INFO: task "fsck.":15415 blocked for more than 188 seconds. > [ 2538.461014] Call trace: > [ 2538.461016] __switch_to+0x174/0x338 > [ 2538.461021] __schedule+0x628/0x9c4 > [ 2538.461025] schedule+0x7c/0xe8 > [ 2538.461030] blk_queue_enter+0xc4/0x160 > [ 2538.461034] blk_mq_alloc_request+0x120/0x1d4 > [ 2538.461037] scsi_execute_cmd+0x7c/0x23c > [ 2538.461040] ioctl_internal_command+0x5c/0x164 > [ 2538.461046] scsi_set_medium_removal+0x5c/0xb0 > [ 2538.461051] sd_release+0x50/0x94 > [ 2538.461054] blkdev_put+0x190/0x28c > [ 2538.461058] blkdev_release+0x28/0x40 > [ 2538.461063] __fput+0xf8/0x2a8 > [ 2538.461066] __fput_sync+0x28/0x5c > [ 2538.461070] __arm64_sys_close+0x84/0xe8 > [ 2538.461073] invoke_syscall+0x58/0x114 > [ 2538.461078] el0_svc_common+0xac/0xe0 > [ 2538.461082] do_el0_svc+0x1c/0x28 > [ 2538.461087] el0_svc+0x38/0x68 > [ 2538.461090] el0t_64_sync_handler+0x68/0xbc > [ 2538.461093] el0t_64_sync+0x1a8/0x1ac > > T1: T2: > sd_remove > del_gendisk > __blk_mark_disk_dead > blk_freeze_queue_start > ++q->mq_freeze_depth > bdev_release > mutex_lock(&disk->open_mutex) > sd_release > scsi_execute_cmd > blk_queue_enter > wait_event(!q->mq_freeze_depth) > mutex_lock(&disk->open_mutex) > > SCSI does not set GD_OWNS_QUEUE, so QUEUE_FLAG_DYING is not set in > this scenario. This is a classic ABBA deadlock. To fix the deadlock, del_gendisk() shouldn't fail scsi_execute_cmd(), so QUEUE_FLAG_DYING can't be set. > make sure we don't try to acquire disk->open_mutex after freezing > the queue. > > Signed-off-by: Yang Yang <yang.yang@vivo.com> > > --- > Changes from v1: > - Modify commit message by suggestion > --- > block/genhd.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/block/genhd.c b/block/genhd.c > index 8f1f3c6b4d67..c5fca3e893a0 100644 > --- a/block/genhd.c > +++ b/block/genhd.c > @@ -663,12 +663,12 @@ void del_gendisk(struct gendisk *disk) > */ > if (!test_bit(GD_DEAD, &disk->state)) > blk_report_disk_dead(disk, false); > - __blk_mark_disk_dead(disk); > > /* > * Drop all partitions now that the disk is marked dead. > */ > mutex_lock(&disk->open_mutex); > + __blk_mark_disk_dead(disk); > xa_for_each_start(&disk->part_tbl, idx, part, 1) > drop_partition(part); > mutex_unlock(&disk->open_mutex); This fix looks fine, and the added lock dependency is safe since mq_freeze_lock is block layer internal lock, so: Reviewed-by: Ming Lei <ming.lei@redhat.com> Thanks,
On 7/22/24 2:16 AM, Yang Yang wrote: > SCSI does not set GD_OWNS_QUEUE, so QUEUE_FLAG_DYING is not set in > this scenario. This is a classic ABBA deadlock. To fix the deadlock, > make sure we don't try to acquire disk->open_mutex after freezing > the queue. > > Signed-off-by: Yang Yang <yang.yang@vivo.com> Fixes: and Cc: stable tags are missing. Otherwise this patch looks fine to me, hence: Reviewed-by: Bart Van Assche <bvanassche@acm.org>
diff --git a/block/genhd.c b/block/genhd.c index 8f1f3c6b4d67..c5fca3e893a0 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -663,12 +663,12 @@ void del_gendisk(struct gendisk *disk) */ if (!test_bit(GD_DEAD, &disk->state)) blk_report_disk_dead(disk, false); - __blk_mark_disk_dead(disk); /* * Drop all partitions now that the disk is marked dead. */ mutex_lock(&disk->open_mutex); + __blk_mark_disk_dead(disk); xa_for_each_start(&disk->part_tbl, idx, part, 1) drop_partition(part); mutex_unlock(&disk->open_mutex);
Our test report the following hung task: [ 2538.459400] INFO: task "kworker/0:0":7 blocked for more than 188 seconds. [ 2538.459427] Call trace: [ 2538.459430] __switch_to+0x174/0x338 [ 2538.459436] __schedule+0x628/0x9c4 [ 2538.459442] schedule+0x7c/0xe8 [ 2538.459447] schedule_preempt_disabled+0x24/0x40 [ 2538.459453] __mutex_lock+0x3ec/0xf04 [ 2538.459456] __mutex_lock_slowpath+0x14/0x24 [ 2538.459459] mutex_lock+0x30/0xd8 [ 2538.459462] del_gendisk+0xdc/0x350 [ 2538.459466] sd_remove+0x30/0x60 [ 2538.459470] device_release_driver_internal+0x1c4/0x2c4 [ 2538.459474] device_release_driver+0x18/0x28 [ 2538.459478] bus_remove_device+0x15c/0x174 [ 2538.459483] device_del+0x1d0/0x358 [ 2538.459488] __scsi_remove_device+0xa8/0x198 [ 2538.459493] scsi_forget_host+0x50/0x70 [ 2538.459497] scsi_remove_host+0x80/0x180 [ 2538.459502] usb_stor_disconnect+0x68/0xf4 [ 2538.459506] usb_unbind_interface+0xd4/0x280 [ 2538.459510] device_release_driver_internal+0x1c4/0x2c4 [ 2538.459514] device_release_driver+0x18/0x28 [ 2538.459518] bus_remove_device+0x15c/0x174 [ 2538.459523] device_del+0x1d0/0x358 [ 2538.459528] usb_disable_device+0x84/0x194 [ 2538.459532] usb_disconnect+0xec/0x300 [ 2538.459537] hub_event+0xb80/0x1870 [ 2538.459541] process_scheduled_works+0x248/0x4dc [ 2538.459545] worker_thread+0x244/0x334 [ 2538.459549] kthread+0x114/0x1bc [ 2538.461001] INFO: task "fsck.":15415 blocked for more than 188 seconds. [ 2538.461014] Call trace: [ 2538.461016] __switch_to+0x174/0x338 [ 2538.461021] __schedule+0x628/0x9c4 [ 2538.461025] schedule+0x7c/0xe8 [ 2538.461030] blk_queue_enter+0xc4/0x160 [ 2538.461034] blk_mq_alloc_request+0x120/0x1d4 [ 2538.461037] scsi_execute_cmd+0x7c/0x23c [ 2538.461040] ioctl_internal_command+0x5c/0x164 [ 2538.461046] scsi_set_medium_removal+0x5c/0xb0 [ 2538.461051] sd_release+0x50/0x94 [ 2538.461054] blkdev_put+0x190/0x28c [ 2538.461058] blkdev_release+0x28/0x40 [ 2538.461063] __fput+0xf8/0x2a8 [ 2538.461066] __fput_sync+0x28/0x5c [ 2538.461070] __arm64_sys_close+0x84/0xe8 [ 2538.461073] invoke_syscall+0x58/0x114 [ 2538.461078] el0_svc_common+0xac/0xe0 [ 2538.461082] do_el0_svc+0x1c/0x28 [ 2538.461087] el0_svc+0x38/0x68 [ 2538.461090] el0t_64_sync_handler+0x68/0xbc [ 2538.461093] el0t_64_sync+0x1a8/0x1ac T1: T2: sd_remove del_gendisk __blk_mark_disk_dead blk_freeze_queue_start ++q->mq_freeze_depth bdev_release mutex_lock(&disk->open_mutex) sd_release scsi_execute_cmd blk_queue_enter wait_event(!q->mq_freeze_depth) mutex_lock(&disk->open_mutex) SCSI does not set GD_OWNS_QUEUE, so QUEUE_FLAG_DYING is not set in this scenario. This is a classic ABBA deadlock. To fix the deadlock, make sure we don't try to acquire disk->open_mutex after freezing the queue. Signed-off-by: Yang Yang <yang.yang@vivo.com> --- Changes from v1: - Modify commit message by suggestion --- block/genhd.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)