Message ID | 20241009113831.557606-2-hch@lst.de (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [1/2] block: also mark disk-owned queues as dying in __blk_mark_disk_dead | expand |
On (24/10/09 13:38), Christoph Hellwig wrote: [..] > @@ -589,8 +589,16 @@ static void __blk_mark_disk_dead(struct gendisk *disk) > if (test_and_set_bit(GD_DEAD, &disk->state)) > return; > > - if (test_bit(GD_OWNS_QUEUE, &disk->state)) > - blk_queue_flag_set(QUEUE_FLAG_DYING, disk->queue); > + /* > + * Also mark the disk dead if it is not owned by the gendisk. This > + * means we can't allow /dev/sg passthrough or SCSI internal commands > + * while unbinding a ULP. That is more than just a bit ugly, but until > + * we untangle q_usage_counter into one owned by the disk and one owned > + * by the queue this is as good as it gets. The flag will be cleared > + * at the end of del_gendisk if it wasn't set before. > + */ > + if (!test_and_set_bit(QUEUE_FLAG_DYING, &disk->queue->queue_flags)) > + set_bit(QUEUE_FLAG_RESURRECT, &disk->queue->queue_flags); > > /* > * Stop buffered writers from dirtying pages that can't be written out. > @@ -719,6 +727,10 @@ void del_gendisk(struct gendisk *disk) > * again. Else leave the queue frozen to fail all I/O. > */ > if (!test_bit(GD_OWNS_QUEUE, &disk->state)) { > + if (test_bit(QUEUE_FLAG_RESURRECT, &q->queue_flags)) { > + clear_bit(QUEUE_FLAG_DYING, &q->queue_flags); > + clear_bit(QUEUE_FLAG_RESURRECT, &q->queue_flags); > + } Christoph, shouldn't QUEUE_FLAG_RESURRECT handling be outside of GD_OWNS_QUEUE if-block? Because __blk_mark_disk_dead() sets QUEUE_FLAG_DYING/QUEUE_FLAG_RESURRECT regardless of GD_OWNS_QUEUE. // A silly nit: it seems the code uses blk_queue_flag_set() and // blk_queue_flag_clear() helpers, but there is no queue_flag_test(), // I don't know what if the preference here - stick to queue_flag // helpers, or is it ok to mix them. > blk_queue_flag_clear(QUEUE_FLAG_INIT_DONE, q); > __blk_mq_unfreeze_queue(q, true);
On Wed, Oct 09, 2024 at 09:31:23PM +0900, Sergey Senozhatsky wrote: > > if (!test_bit(GD_OWNS_QUEUE, &disk->state)) { > > + if (test_bit(QUEUE_FLAG_RESURRECT, &q->queue_flags)) { > > + clear_bit(QUEUE_FLAG_DYING, &q->queue_flags); > > + clear_bit(QUEUE_FLAG_RESURRECT, &q->queue_flags); > > + } > > Christoph, shouldn't QUEUE_FLAG_RESURRECT handling be outside of > GD_OWNS_QUEUE if-block? Because __blk_mark_disk_dead() sets > QUEUE_FLAG_DYING/QUEUE_FLAG_RESURRECT regardless of GD_OWNS_QUEUE. For !GD_OWNS_QUEUE the queue is freed right below, so there isn't much of a point. > // A silly nit: it seems the code uses blk_queue_flag_set() and > // blk_queue_flag_clear() helpers, but there is no queue_flag_test(), > // I don't know what if the preference here - stick to queue_flag > // helpers, or is it ok to mix them. Yeah. I looked into a test_and_set wrapper, but then saw how pointless the existing wrappers are. So for now this just open codes it, and once we're done with the fixes I plan to just send a patch to remove the wrappers entirely.
On (24/10/09 14:41), Christoph Hellwig wrote: > On Wed, Oct 09, 2024 at 09:31:23PM +0900, Sergey Senozhatsky wrote: > > > if (!test_bit(GD_OWNS_QUEUE, &disk->state)) { > > > + if (test_bit(QUEUE_FLAG_RESURRECT, &q->queue_flags)) { > > > + clear_bit(QUEUE_FLAG_DYING, &q->queue_flags); > > > + clear_bit(QUEUE_FLAG_RESURRECT, &q->queue_flags); > > > + } > > > > Christoph, shouldn't QUEUE_FLAG_RESURRECT handling be outside of > > GD_OWNS_QUEUE if-block? Because __blk_mark_disk_dead() sets > > QUEUE_FLAG_DYING/QUEUE_FLAG_RESURRECT regardless of GD_OWNS_QUEUE. > > For !GD_OWNS_QUEUE the queue is freed right below, so there isn't much > of a point. Oh, right. > > // A silly nit: it seems the code uses blk_queue_flag_set() and > > // blk_queue_flag_clear() helpers, but there is no queue_flag_test(), > > // I don't know what if the preference here - stick to queue_flag > > // helpers, or is it ok to mix them. > > Yeah. I looked into a test_and_set wrapper, but then saw how pointless > the existing wrappers are. Likewise. > So for now this just open codes it, and once we're done with the fixes > I plan to just send a patch to remove the wrappers entirely. Ack.
On 10/9/24 6:41 AM, Christoph Hellwig wrote: >> // A silly nit: it seems the code uses blk_queue_flag_set() and >> // blk_queue_flag_clear() helpers, but there is no queue_flag_test(), >> // I don't know what if the preference here - stick to queue_flag >> // helpers, or is it ok to mix them. > > Yeah. I looked into a test_and_set wrapper, but then saw how pointless > the existing wrappers are. So for now this just open codes it, and > once we're done with the fixes I plan to just send a patch to remove > the wrappers entirely. Agree, but that's because you didn't do it back when you changed them to be just set/clear bit operations ;-). They should definitely just go away now.
On 2024/10/9 19:38, Christoph Hellwig wrote: > When del_gendisk shuts down access to a gendisk, it could lead to a > deadlock with sd or, which try to submit passthrough SCSI commands from > their ->release method under open_mutex. The submission can be blocked > in blk_enter_queue while del_gendisk can't get to actually telling them > top stop and wake them up. > > As the disk is going away there is no real point in sending these > commands, but we have no really good way to distinguish between the > cases. For now mark even standalone (aka SCSI queues) as dying in > del_gendisk to avoid this deadlock, but the real fix will be to split > freeing a disk from freezing a queue for not disk associated requests. > > Reported-by: Sergey Senozhatsky <senozhatsky@chromium.org> > Signed-off-by: Christoph Hellwig <hch@lst.de> > Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> > --- > block/genhd.c | 16 ++++++++++++++-- > include/linux/blkdev.h | 1 + > 2 files changed, 15 insertions(+), 2 deletions(-) > > diff --git a/block/genhd.c b/block/genhd.c > index 1c05dd4c6980b5..7026569fa8a0be 100644 > --- a/block/genhd.c > +++ b/block/genhd.c > @@ -589,8 +589,16 @@ static void __blk_mark_disk_dead(struct gendisk *disk) > if (test_and_set_bit(GD_DEAD, &disk->state)) > return; > > - if (test_bit(GD_OWNS_QUEUE, &disk->state)) > - blk_queue_flag_set(QUEUE_FLAG_DYING, disk->queue); > + /* > + * Also mark the disk dead if it is not owned by the gendisk. This > + * means we can't allow /dev/sg passthrough or SCSI internal commands > + * while unbinding a ULP. That is more than just a bit ugly, but until > + * we untangle q_usage_counter into one owned by the disk and one owned > + * by the queue this is as good as it gets. The flag will be cleared > + * at the end of del_gendisk if it wasn't set before. > + */ > + if (!test_and_set_bit(QUEUE_FLAG_DYING, &disk->queue->queue_flags)) > + set_bit(QUEUE_FLAG_RESURRECT, &disk->queue->queue_flags); > > /* > * Stop buffered writers from dirtying pages that can't be written out. > @@ -719,6 +727,10 @@ void del_gendisk(struct gendisk *disk) > * again. Else leave the queue frozen to fail all I/O. > */ > if (!test_bit(GD_OWNS_QUEUE, &disk->state)) { > + if (test_bit(QUEUE_FLAG_RESURRECT, &q->queue_flags)) { > + clear_bit(QUEUE_FLAG_DYING, &q->queue_flags); > + clear_bit(QUEUE_FLAG_RESURRECT, &q->queue_flags); > + } > blk_queue_flag_clear(QUEUE_FLAG_INIT_DONE, q); > __blk_mq_unfreeze_queue(q, true); > } else { > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h > index 50c3b959da2816..391e3eb3bb5e61 100644 > --- a/include/linux/blkdev.h > +++ b/include/linux/blkdev.h > @@ -590,6 +590,7 @@ struct request_queue { > /* Keep blk_queue_flag_name[] in sync with the definitions below */ > enum { > QUEUE_FLAG_DYING, /* queue being torn down */ > + QUEUE_FLAG_RESURRECT, /* temporarily dying */ > QUEUE_FLAG_NOMERGES, /* disable merge attempts */ > QUEUE_FLAG_SAME_COMP, /* complete on same CPU-group */ > QUEUE_FLAG_FAIL_IO, /* fake timeout */ Looks good. Feel free to add: Reviewed-by: Yang Yang <yang.yang@vivo.com> Thanks.
On Wed, Oct 09, 2024 at 01:38:20PM +0200, Christoph Hellwig wrote: > When del_gendisk shuts down access to a gendisk, it could lead to a > deadlock with sd or, which try to submit passthrough SCSI commands from > their ->release method under open_mutex. The submission can be blocked > in blk_enter_queue while del_gendisk can't get to actually telling them > top stop and wake them up. > > As the disk is going away there is no real point in sending these > commands, but we have no really good way to distinguish between the > cases. For now mark even standalone (aka SCSI queues) as dying in > del_gendisk to avoid this deadlock, but the real fix will be to split > freeing a disk from freezing a queue for not disk associated requests. > > Reported-by: Sergey Senozhatsky <senozhatsky@chromium.org> > Signed-off-by: Christoph Hellwig <hch@lst.de> > Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> > --- > block/genhd.c | 16 ++++++++++++++-- > include/linux/blkdev.h | 1 + > 2 files changed, 15 insertions(+), 2 deletions(-) > > diff --git a/block/genhd.c b/block/genhd.c > index 1c05dd4c6980b5..7026569fa8a0be 100644 > --- a/block/genhd.c > +++ b/block/genhd.c > @@ -589,8 +589,16 @@ static void __blk_mark_disk_dead(struct gendisk *disk) > if (test_and_set_bit(GD_DEAD, &disk->state)) > return; > > - if (test_bit(GD_OWNS_QUEUE, &disk->state)) > - blk_queue_flag_set(QUEUE_FLAG_DYING, disk->queue); > + /* > + * Also mark the disk dead if it is not owned by the gendisk. This > + * means we can't allow /dev/sg passthrough or SCSI internal commands > + * while unbinding a ULP. That is more than just a bit ugly, but until > + * we untangle q_usage_counter into one owned by the disk and one owned > + * by the queue this is as good as it gets. The flag will be cleared > + * at the end of del_gendisk if it wasn't set before. > + */ > + if (!test_and_set_bit(QUEUE_FLAG_DYING, &disk->queue->queue_flags)) > + set_bit(QUEUE_FLAG_RESURRECT, &disk->queue->queue_flags); Setting QUEUE_FLAG_DYING may fail passthrough request for !GD_OWNS_QUEUE, I guess this may cause SCSI regression. blk_queue_enter() need to wait until RESURRECT & DYING are cleared instead of returning failure. Thanks, Ming
On Wed, Oct 16, 2024 at 07:09:48PM +0800, Ming Lei wrote: > Setting QUEUE_FLAG_DYING may fail passthrough request for > !GD_OWNS_QUEUE, I guess this may cause SCSI regression. Yes, as clearly documented in the commit log. > > blk_queue_enter() need to wait until RESURRECT & DYING are cleared > instead of returning failure. What we really need to is to split the enter conditions between disk and standalone queue. But until then I think the current version is reasonable enough.
On Wed, Oct 16, 2024 at 02:32:40PM +0200, Christoph Hellwig wrote: > On Wed, Oct 16, 2024 at 07:09:48PM +0800, Ming Lei wrote: > > Setting QUEUE_FLAG_DYING may fail passthrough request for > > !GD_OWNS_QUEUE, I guess this may cause SCSI regression. > > Yes, as clearly documented in the commit log. The change need Cc linux-scsi. > As the disk is going away there is no real point in sending these > commands, but we have no really good way to distinguish between the > cases. scsi request queue has very different lifetime with gendisk, not sure the above comment is correct. Thanks, Ming
On Wed, Oct 09, 2024 at 01:38:20PM +0200, Christoph Hellwig wrote: > When del_gendisk shuts down access to a gendisk, it could lead to a > deadlock with sd or, which try to submit passthrough SCSI commands from > their ->release method under open_mutex. The submission can be blocked > in blk_enter_queue while del_gendisk can't get to actually telling them > top stop and wake them up. When ->release() waits in blk_enter_queue(), the following code block mutex_lock(&disk->open_mutex); __blk_mark_disk_dead(disk); xa_for_each_start(&disk->part_tbl, idx, part, 1) drop_partition(part); mutex_unlock(&disk->open_mutex); in del_gendisk() should have been done. Then del_gendisk() should move on and finally unfreeze queue, so I still don't get the idea how the above dead lock is triggered. Thanks, Ming
On (24/10/16 21:35), Ming Lei wrote: > On Wed, Oct 09, 2024 at 01:38:20PM +0200, Christoph Hellwig wrote: > > When del_gendisk shuts down access to a gendisk, it could lead to a > > deadlock with sd or, which try to submit passthrough SCSI commands from > > their ->release method under open_mutex. The submission can be blocked > > in blk_enter_queue while del_gendisk can't get to actually telling them > > top stop and wake them up. > > When ->release() waits in blk_enter_queue(), the following code block > > mutex_lock(&disk->open_mutex); > __blk_mark_disk_dead(disk); > xa_for_each_start(&disk->part_tbl, idx, part, 1) > drop_partition(part); > mutex_unlock(&disk->open_mutex); blk_enter_queue()->schedule() holds ->open_mutex, so that block of code sleeps on ->open_mutex. We can't drain under ->open_mutex.
On Sat, Oct 19, 2024 at 10:25:41AM +0900, Sergey Senozhatsky wrote: > On (24/10/16 21:35), Ming Lei wrote: > > On Wed, Oct 09, 2024 at 01:38:20PM +0200, Christoph Hellwig wrote: > > > When del_gendisk shuts down access to a gendisk, it could lead to a > > > deadlock with sd or, which try to submit passthrough SCSI commands from > > > their ->release method under open_mutex. The submission can be blocked > > > in blk_enter_queue while del_gendisk can't get to actually telling them > > > top stop and wake them up. > > > > When ->release() waits in blk_enter_queue(), the following code block > > > > mutex_lock(&disk->open_mutex); > > __blk_mark_disk_dead(disk); > > xa_for_each_start(&disk->part_tbl, idx, part, 1) > > drop_partition(part); > > mutex_unlock(&disk->open_mutex); > > blk_enter_queue()->schedule() holds ->open_mutex, so that block > of code sleeps on ->open_mutex. We can't drain under ->open_mutex. We don't start to drain yet, then why does blk_enter_queue() sleeps and it waits for what? Thanks, Ming
On (24/10/19 20:32), Ming Lei wrote: [..] > > > When ->release() waits in blk_enter_queue(), the following code block > > > > > > mutex_lock(&disk->open_mutex); > > > __blk_mark_disk_dead(disk); > > > xa_for_each_start(&disk->part_tbl, idx, part, 1) > > > drop_partition(part); > > > mutex_unlock(&disk->open_mutex); > > > > blk_enter_queue()->schedule() holds ->open_mutex, so that block > > of code sleeps on ->open_mutex. We can't drain under ->open_mutex. > > We don't start to drain yet, then why does blk_enter_queue() sleeps and > it waits for what? Unfortunately I don't have a device to repro this, but it happens to a number of our customers (using different peripheral devices, but, as far as I'm concerned, all running 6.6 kernel).
On Sat, Oct 19, 2024 at 09:37:27PM +0900, Sergey Senozhatsky wrote: > On (24/10/19 20:32), Ming Lei wrote: > [..] > > > > When ->release() waits in blk_enter_queue(), the following code block > > > > > > > > mutex_lock(&disk->open_mutex); > > > > __blk_mark_disk_dead(disk); > > > > xa_for_each_start(&disk->part_tbl, idx, part, 1) > > > > drop_partition(part); > > > > mutex_unlock(&disk->open_mutex); > > > > > > blk_enter_queue()->schedule() holds ->open_mutex, so that block > > > of code sleeps on ->open_mutex. We can't drain under ->open_mutex. > > > > We don't start to drain yet, then why does blk_enter_queue() sleeps and > > it waits for what? > > Unfortunately I don't have a device to repro this, but it happens to a > number of our customers (using different peripheral devices, but, as far > as I'm concerned, all running 6.6 kernel). I can understand the issue on v6.6 because it doesn't have commit 7e04da2dc701 ("block: fix deadlock between sd_remove & sd_release"). But for the latest upstream, I don't get idea how it can happen. Thanks, Ming
On (24/10/19 20:50), Ming Lei wrote: > On Sat, Oct 19, 2024 at 09:37:27PM +0900, Sergey Senozhatsky wrote: > > On (24/10/19 20:32), Ming Lei wrote: > > [..] > > Unfortunately I don't have a device to repro this, but it happens to a > > number of our customers (using different peripheral devices, but, as far > > as I'm concerned, all running 6.6 kernel). > > I can understand the issue on v6.6 because it doesn't have commit > 7e04da2dc701 ("block: fix deadlock between sd_remove & sd_release"). We have that one in 6.6, as far as I can tell https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/block/genhd.c?h=v6.6.57#n663
On Sat, Oct 19, 2024 at 09:58:04PM +0900, Sergey Senozhatsky wrote: > On (24/10/19 20:50), Ming Lei wrote: > > On Sat, Oct 19, 2024 at 09:37:27PM +0900, Sergey Senozhatsky wrote: > > > On (24/10/19 20:32), Ming Lei wrote: > > > [..] > > > Unfortunately I don't have a device to repro this, but it happens to a > > > number of our customers (using different peripheral devices, but, as far > > > as I'm concerned, all running 6.6 kernel). > > > > I can understand the issue on v6.6 because it doesn't have commit > > 7e04da2dc701 ("block: fix deadlock between sd_remove & sd_release"). > > We have that one in 6.6, as far as I can tell > > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/block/genhd.c?h=v6.6.57#n663 Then we need to root-cause it first. If you can reproduce it, please provide dmesg log, and deadlock related process stack trace log collected via sysrq control. thanks, Ming
On (24/10/19 21:09), Ming Lei wrote: > On Sat, Oct 19, 2024 at 09:58:04PM +0900, Sergey Senozhatsky wrote: > > On (24/10/19 20:50), Ming Lei wrote: > > > On Sat, Oct 19, 2024 at 09:37:27PM +0900, Sergey Senozhatsky wrote: [..] > > Then we need to root-cause it first. > > If you can reproduce it I cannot. All I'm having are backtraces from various crash reports, I posted some of them earlier [1] (and in that entire thread). This loos like close()->bio_queue_enter() vs usb_disconnect()->del_gendisk() deadlock, and del_gendisk() cannot drain. Doing drain under the same lock, that things we want to drain currently hold, looks troublesome in general. [1] https://lore.kernel.org/linux-block/20241008051948.GB10794@google.com
On Sat, Oct 19, 2024 at 10:50:10PM +0900, Sergey Senozhatsky wrote: > On (24/10/19 21:09), Ming Lei wrote: > > On Sat, Oct 19, 2024 at 09:58:04PM +0900, Sergey Senozhatsky wrote: > > > On (24/10/19 20:50), Ming Lei wrote: > > > > On Sat, Oct 19, 2024 at 09:37:27PM +0900, Sergey Senozhatsky wrote: > [..] > > > > Then we need to root-cause it first. > > > > If you can reproduce it > > I cannot. > > All I'm having are backtraces from various crash reports, I posted > some of them earlier [1] (and in that entire thread). This loos like > close()->bio_queue_enter() vs usb_disconnect()->del_gendisk() deadlock, > and del_gendisk() cannot drain. Doing drain under the same lock, that > things we want to drain currently hold, looks troublesome in general. > > [1] https://lore.kernel.org/linux-block/20241008051948.GB10794@google.com Probably bio_queue_enter() waits for runtime PM, and the queue is in ->pm_only state, and BLK_MQ_REQ_PM isn't passed actually from ioctl_internal_command() <- scsi_set_medium_removal(). And if you have vmcore collected, it shouldn't be not hard to root cause. Also I'd suggest to collect intact related dmesg log in future, instead of providing selective log, such as, there isn't even kernel version... Thanks, Ming
On (24/10/19 23:03), Ming Lei wrote: > Probably bio_queue_enter() waits for runtime PM, and the queue is in > ->pm_only state, and BLK_MQ_REQ_PM isn't passed actually from > ioctl_internal_command() <- scsi_set_medium_removal(). > > And if you have vmcore collected, it shouldn't be not hard to root cause. We don't collect those. > Also I'd suggest to collect intact related dmesg log in future, instead of > providing selective log, such as, there isn't even kernel version... These "selected" backtraces are the only backtraces in the dmesg. I literally have reports that have just two backtraces of tasks blocked over 120 seconds, one close()->bio_queue_enter()->schedule (under ->open_mutex) and the other one del_gendisk()->mutex_lock()->schedule().
On (24/10/19 23:03), Ming Lei wrote: > > there isn't even kernel version... > Well, that's on me, yes, I admit it. I completely missed that but that was never a secret [1]. I missed it, probably, because I would have not reached out to upstream with 5.4 bug report; and 6.6, in that part of the code, looked quite close to the upsteram. But well, I forgot to add the kernel version, yes. [1] https://lore.kernel.org/linux-block/20241003135504.GL11458@google.com
diff --git a/block/genhd.c b/block/genhd.c index 1c05dd4c6980b5..7026569fa8a0be 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -589,8 +589,16 @@ static void __blk_mark_disk_dead(struct gendisk *disk) if (test_and_set_bit(GD_DEAD, &disk->state)) return; - if (test_bit(GD_OWNS_QUEUE, &disk->state)) - blk_queue_flag_set(QUEUE_FLAG_DYING, disk->queue); + /* + * Also mark the disk dead if it is not owned by the gendisk. This + * means we can't allow /dev/sg passthrough or SCSI internal commands + * while unbinding a ULP. That is more than just a bit ugly, but until + * we untangle q_usage_counter into one owned by the disk and one owned + * by the queue this is as good as it gets. The flag will be cleared + * at the end of del_gendisk if it wasn't set before. + */ + if (!test_and_set_bit(QUEUE_FLAG_DYING, &disk->queue->queue_flags)) + set_bit(QUEUE_FLAG_RESURRECT, &disk->queue->queue_flags); /* * Stop buffered writers from dirtying pages that can't be written out. @@ -719,6 +727,10 @@ void del_gendisk(struct gendisk *disk) * again. Else leave the queue frozen to fail all I/O. */ if (!test_bit(GD_OWNS_QUEUE, &disk->state)) { + if (test_bit(QUEUE_FLAG_RESURRECT, &q->queue_flags)) { + clear_bit(QUEUE_FLAG_DYING, &q->queue_flags); + clear_bit(QUEUE_FLAG_RESURRECT, &q->queue_flags); + } blk_queue_flag_clear(QUEUE_FLAG_INIT_DONE, q); __blk_mq_unfreeze_queue(q, true); } else { diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 50c3b959da2816..391e3eb3bb5e61 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -590,6 +590,7 @@ struct request_queue { /* Keep blk_queue_flag_name[] in sync with the definitions below */ enum { QUEUE_FLAG_DYING, /* queue being torn down */ + QUEUE_FLAG_RESURRECT, /* temporarily dying */ QUEUE_FLAG_NOMERGES, /* disable merge attempts */ QUEUE_FLAG_SAME_COMP, /* complete on same CPU-group */ QUEUE_FLAG_FAIL_IO, /* fake timeout */