Message ID | 20220214095107.3t5en5a3tosaeoo6@ipetronik.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] nvme: prevent hang on surprise removal of NVMe disk | expand |
On Mon, Feb 14, 2022 at 10:51:07AM +0100, Markus Blöchl wrote: > After the surprise removal of a mounted NVMe disk the pciehp task > reliably hangs forever with a trace similar to this one: > > INFO: task irq/43-pciehp:64 blocked for more than 120 seconds. > Call Trace: > <TASK> > __bio_queue_enter > blk_mq_submit_bio > submit_bio_noacct > submit_bio_wait > blkdev_issue_flush > ext4_sync_fs > sync_filesystem > fsync_bdev > delete_partition > blk_drop_partitions > del_gendisk > nvme_ns_remove > nvme_remove_namespaces > nvme_remove > pci_device_remove > __device_release_driver > device_release_driver > pci_stop_bus_device > pci_stop_and_remove_bus_device > pciehp_unconfigure_device > pciehp_disable_slot > pciehp_handle_presence_or_link_change > pciehp_ist > </TASK> > > I observed this with 5.15.5 from debian bullseye-backports and confirmed > with 5.17.0-rc3 but previous kernels may be affected as well. Thanks for the patch. Entering the queue used to fail if blk_queue_dying() was true. The condition was changed in commit: 8e141f9eb803e209714a80aa6ec073893f94c526 block: drain file system I/O on del_gendisk I can't actually tell if not checking the DYING flag check was intentional or not, since the comments in blk_queue_start_drain() say otherwise. Christoph, do you know the intention here? Should __bio_queue_enter() check the queue DYING flag, or do you prefer drivers explicity set the disk state like this? It looks to me the queue flags should be checked since that's already tied to the freeze wait_queue_head_t. > @@ -4573,6 +4573,8 @@ static void nvme_set_queue_dying(struct nvme_ns *ns) > if (test_and_set_bit(NVME_NS_DEAD, &ns->flags)) > return; > > + set_bit(GD_DEAD, &ns->disk->state); > + > blk_set_queue_dying(ns->queue); > nvme_start_ns_queue(ns);
On Tue, Feb 15, 2022 at 07:22:40AM -0800, Keith Busch wrote: > I can't actually tell if not checking the DYING flag check was > intentional or not, since the comments in blk_queue_start_drain() say > otherwise. > > Christoph, do you know the intention here? Should __bio_queue_enter() > check the queue DYING flag, or do you prefer drivers explicity set the > disk state like this? It looks to me the queue flags should be checked > since that's already tied to the freeze wait_queue_head_t. It was intentional but maybe not fully thought out. Do you remember why we're doing the manual setting of the dying flag instead of just calling del_gendisk early on in nvme? Because calling del_gendisk is supposed to be all that a tree needs to do.
On Tue, Feb 15, 2022 at 07:47:04PM +0100, Christoph Hellwig wrote: > On Tue, Feb 15, 2022 at 07:22:40AM -0800, Keith Busch wrote: > > I can't actually tell if not checking the DYING flag check was > > intentional or not, since the comments in blk_queue_start_drain() say > > otherwise. > > > > Christoph, do you know the intention here? Should __bio_queue_enter() > > check the queue DYING flag, or do you prefer drivers explicity set the > > disk state like this? It looks to me the queue flags should be checked > > since that's already tied to the freeze wait_queue_head_t. > > It was intentional but maybe not fully thought out. Do you remember why > we're doing the manual setting of the dying flag instead of just calling > del_gendisk early on in nvme? Because calling del_gendisk is supposed > to be all that a tree needs to do. When the driver concludes new requests can't ever succeed, we had been setting the queue to DYING first so new requests can't enter, which can prevent forward progress. AFAICT, just calling del_gendisk() is fine for a graceful removal. It calls fsync_bdev() to flush out pending writes before setting the disk state to "DEAD". Setting the queue to dying first will "freeze" the queue, which is why fsync_bdev() is blocked. We were relying on the queue DYING flag to prevent that from blocking. Perhaps another way to do this might be to remove the queue DYING setting, and let the driver internally fail new requests instead? There may be some issues with doing it that way IIRC, but blk-mq is a bit evolved from where we started, so I'd need to test it out to confirm.
On Mon, Feb 14, 2022 at 10:51:07AM +0100, Markus Blöchl wrote: > After the surprise removal of a mounted NVMe disk the pciehp task > reliably hangs forever with a trace similar to this one: Do you have a specific reproducer? At least with doing a echo 1 > /sys/.../remove while running fsx on a file system I can't actually reproduce it.
On Tue, Feb 15, 2022 at 08:17:31PM +0100, Christoph Hellwig wrote: > On Mon, Feb 14, 2022 at 10:51:07AM +0100, Markus Blöchl wrote: > > After the surprise removal of a mounted NVMe disk the pciehp task > > reliably hangs forever with a trace similar to this one: > > Do you have a specific reproducer? At least with doing a > > echo 1 > /sys/.../remove > > while running fsx on a file system I can't actually reproduce it. That's a gracefull removal. You need to do something to terminate the connection without the driver knowing about it. If you don't have a hotplug capable system, you can do something slightly destructive to the PCI link to force an ungraceful teardown, though you'll need to wait for IO timeout before anything interesting will happen. # setpci -s "${slot}" CAP_EXP+10.w=10:10 The "$slot" needs to be the B:D.f of the bridge connecting to your nvme end device. An example getting it for a non-multipath PCIe attached nvme0n1: # readlink -f /sys/block/nvme0n1/device | grep -Eo '[0-9a-f]{4,5}:[0-9a-f]{2}:[0-9a-f]{2}\.[0-9a-f]' | tail -2 | head -1
So, I think in th short run setting GD_DEAD is the right thing as a non-invasive fix, but I'd rather do it in blk_set_queue_dying to also cover the other drivers with a similar pattern. blk_set_queue_dying is never used for cases where q->disk isn't valid so that should be fine. In the long run I think we just need to remove the fsync_bdev in del_gendisk or at least make it conditional.
On 2/15/22 20:17, Christoph Hellwig wrote: > On Mon, Feb 14, 2022 at 10:51:07AM +0100, Markus Blöchl wrote: >> After the surprise removal of a mounted NVMe disk the pciehp task >> reliably hangs forever with a trace similar to this one: > > Do you have a specific reproducer? At least with doing a > > echo 1 > /sys/.../remove > > while running fsx on a file system I can't actually reproduce it. You should be able to reproduce it doing a PCI hotplug from qemu. Cheers, Hannes
On Tue, Feb 15, 2022 at 08:17:31PM +0100, Christoph Hellwig wrote: > On Mon, Feb 14, 2022 at 10:51:07AM +0100, Markus Blöchl wrote: > > After the surprise removal of a mounted NVMe disk the pciehp task > > reliably hangs forever with a trace similar to this one: > > Do you have a specific reproducer? At least with doing a > > echo 1 > /sys/.../remove > > while running fsx on a file system I can't actually reproduce it. We built our own enclosures with a custom connector to plug the disks. So an external enclosure for thunderbolt is probably very similar. (or just ripping an unscrewed NVMe out of the M.2 ...) But as already suggested, qemu might also be very useful here as it also allows us to test multiple namespaces and multipath I/O, if you/someone wants to check those too (hotplug with multipath I/O really scares me).
On 2/16/22 12:18, Markus Blöchl wrote: > On Tue, Feb 15, 2022 at 08:17:31PM +0100, Christoph Hellwig wrote: >> On Mon, Feb 14, 2022 at 10:51:07AM +0100, Markus Blöchl wrote: >>> After the surprise removal of a mounted NVMe disk the pciehp task >>> reliably hangs forever with a trace similar to this one: >> >> Do you have a specific reproducer? At least with doing a >> >> echo 1 > /sys/.../remove >> >> while running fsx on a file system I can't actually reproduce it. > > We built our own enclosures with a custom connector to plug the disks. > > So an external enclosure for thunderbolt is probably very similar. > (or just ripping an unscrewed NVMe out of the M.2 ...) > > But as already suggested, qemu might also be very useful here as it also > allows us to test multiple namespaces and multipath I/O, if you/someone > wants to check those too (hotplug with multipath I/O really scares me). > Nothing to be scared of. I've tested this extensively in the run up to commit 5396fdac56d8 ("nvme: fix refcounting imbalance when all paths are down") which, incidentally, is something you need if you want to test things. Let me see if I can dig up the testbed. Cheers, Hannes
On Tue, Feb 15, 2022 at 09:17:38PM +0100, Christoph Hellwig wrote: > So, I think in th short run setting GD_DEAD is the right thing as a > non-invasive fix, but I'd rather do it in blk_set_queue_dying to also > cover the other drivers with a similar pattern. blk_set_queue_dying > is never used for cases where q->disk isn't valid so that should be > fine. blk_set_queue_dying() is called from blk_cleanup_queue(), which in turn is called for all kinds of queues without associated disk, even ones with failed allocations. But according to disk_release() a disk which was once associated should always outlive its queue, thus q->disk should indeed always be safe. It's a little unfortunate that the existing doc on request queues was so outdated that it had to be removed. So I was trying to figure out if blk_queue_dying(q) should always imply (!q->disk || test_bit(GD_DEAD, &q->disk->state)). Assuming that there is a 1:1 relationship between q and q->disk and that blk_queue_dying(q) was previously used where GD_DEAD is now, I do hope so, since GD_DEAD should just be a weaker form of QUEUE_FLAG_DYING to allow killing the disk way before the queue. > > In the long run I think we just need to remove the fsync_bdev in > del_gendisk or at least make it conditional. But wouldn't that require you to pass a flag like SURPRISE_REMOVAL into del_gendisk and passing it along through delete_partition()? You would still need GD_DEAD since someone might remove the disk while you are syncing, so this only sounds like an error-prone optimization to me. After all, syncing on graceful removal seems to be a sensible thing to do. Fiddling with qemu will probably take me too long unless Hannes already has something up his sleeves. So I'll submit V2 after I regained access to the test setup. It will probably just look something like this: ``` --- a/block/blk-core.c +++ b/block/blk-core.c @@ -286,6 +286,8 @@ void blk_queue_start_drain(struct request_queue *q) void blk_set_queue_dying(struct request_queue *q) { + if (q->disk) + set_bit(GD_DEAD, &q->disk->state); blk_queue_flag_set(QUEUE_FLAG_DYING, q); blk_queue_start_drain(q); } ``` Thanks for all your input, Markus
I'd do something like this, which gets us a properly documented interface (the del_gendisk change will be split into a separate patch): diff --git a/block/blk-core.c b/block/blk-core.c index d93e3bb9a769b..15d5c5ba5bbe5 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -284,12 +284,19 @@ void blk_queue_start_drain(struct request_queue *q) wake_up_all(&q->mq_freeze_wq); } -void blk_set_queue_dying(struct request_queue *q) +/** + * blk_set_disk_dead - mark a disk as dead + * @disk: disk to mark as dead + * + * Mark as disk as dead (e.g. surprise removed) and don't accept any new I/O + * to this disk. + */ +void blk_mark_disk_dead(struct gendisk *disk) { - blk_queue_flag_set(QUEUE_FLAG_DYING, q); - blk_queue_start_drain(q); + set_bit(GD_DEAD, &disk->state); + blk_queue_start_drain(disk->queue); } -EXPORT_SYMBOL_GPL(blk_set_queue_dying); +EXPORT_SYMBOL_GPL(blk_mark_disk_dead); /** * blk_cleanup_queue - shutdown a request queue @@ -308,7 +315,8 @@ void blk_cleanup_queue(struct request_queue *q) WARN_ON_ONCE(blk_queue_registered(q)); /* mark @q DYING, no new request or merges will be allowed afterwards */ - blk_set_queue_dying(q); + blk_queue_flag_set(QUEUE_FLAG_DYING, q); + blk_queue_start_drain(q); blk_queue_flag_set(QUEUE_FLAG_NOMERGES, q); blk_queue_flag_set(QUEUE_FLAG_NOXMERGES, q); diff --git a/block/genhd.c b/block/genhd.c index 626c8406f21a6..35e49b7f1112a 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -584,7 +584,8 @@ void del_gendisk(struct gendisk *disk) blk_drop_partitions(disk); mutex_unlock(&disk->open_mutex); - fsync_bdev(disk->part0); + if (!test_bit(GD_DEAD, &disk->state)) + fsync_bdev(disk->part0); __invalidate_device(disk->part0, true); /* diff --git a/drivers/block/mtip32xx/mtip32xx.c b/drivers/block/mtip32xx/mtip32xx.c index e6005c2323281..2b588b62cbbb2 100644 --- a/drivers/block/mtip32xx/mtip32xx.c +++ b/drivers/block/mtip32xx/mtip32xx.c @@ -4112,7 +4112,7 @@ static void mtip_pci_remove(struct pci_dev *pdev) "Completion workers still active!\n"); } - blk_set_queue_dying(dd->queue); + blk_mark_disk_dead(dd->disk); set_bit(MTIP_DDF_REMOVE_PENDING_BIT, &dd->dd_flag); /* Clean up the block layer. */ diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 4203cdab8abfd..b844432bad20b 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -7185,7 +7185,7 @@ static ssize_t do_rbd_remove(struct bus_type *bus, * IO to complete/fail. */ blk_mq_freeze_queue(rbd_dev->disk->queue); - blk_set_queue_dying(rbd_dev->disk->queue); + blk_mark_disk_dead(rbd_dev->disk); } del_gendisk(rbd_dev->disk); diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c index ccd0dd0c6b83c..ca71a0585333f 100644 --- a/drivers/block/xen-blkfront.c +++ b/drivers/block/xen-blkfront.c @@ -2126,7 +2126,7 @@ static void blkfront_closing(struct blkfront_info *info) /* No more blkif_request(). */ blk_mq_stop_hw_queues(info->rq); - blk_set_queue_dying(info->rq); + blk_mark_disk_dead(info->gd); set_capacity(info->gd, 0); for_each_rinfo(info, rinfo, i) { diff --git a/drivers/md/dm.c b/drivers/md/dm.c index dcbd6d201619d..997ace47bbd54 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -2077,7 +2077,7 @@ static void __dm_destroy(struct mapped_device *md, bool wait) set_bit(DMF_FREEING, &md->flags); spin_unlock(&_minor_lock); - blk_set_queue_dying(md->queue); + blk_mark_disk_dead(md->disk); /* * Take suspend_lock so that presuspend and postsuspend methods diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 79005ea1a33e3..469f23186159c 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -4574,7 +4574,7 @@ static void nvme_set_queue_dying(struct nvme_ns *ns) if (test_and_set_bit(NVME_NS_DEAD, &ns->flags)) return; - blk_set_queue_dying(ns->queue); + blk_mark_disk_dead(ns->disk); nvme_start_ns_queue(ns); set_capacity_and_notify(ns->disk, 0); diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index f8bf6606eb2fc..ff775235534cf 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -848,7 +848,7 @@ void nvme_mpath_remove_disk(struct nvme_ns_head *head) { if (!head->disk) return; - blk_set_queue_dying(head->disk->queue); + blk_mark_disk_dead(head->disk); /* make sure all pending bios are cleaned up */ kblockd_schedule_work(&head->requeue_work); flush_work(&head->requeue_work); diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index f35aea98bc351..16b47035e4b06 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -748,7 +748,8 @@ extern bool blk_queue_can_use_dma_map_merging(struct request_queue *q, bool __must_check blk_get_queue(struct request_queue *); extern void blk_put_queue(struct request_queue *); -extern void blk_set_queue_dying(struct request_queue *); + +void blk_mark_disk_dead(struct gendisk *disk); #ifdef CONFIG_BLOCK /*
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 961a5f8a44d2..0654cbe9b80e 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -4573,6 +4573,8 @@ static void nvme_set_queue_dying(struct nvme_ns *ns) if (test_and_set_bit(NVME_NS_DEAD, &ns->flags)) return; + set_bit(GD_DEAD, &ns->disk->state); + blk_set_queue_dying(ns->queue); nvme_start_ns_queue(ns);
After the surprise removal of a mounted NVMe disk the pciehp task reliably hangs forever with a trace similar to this one: INFO: task irq/43-pciehp:64 blocked for more than 120 seconds. Call Trace: <TASK> __bio_queue_enter blk_mq_submit_bio submit_bio_noacct submit_bio_wait blkdev_issue_flush ext4_sync_fs sync_filesystem fsync_bdev delete_partition blk_drop_partitions del_gendisk nvme_ns_remove nvme_remove_namespaces nvme_remove pci_device_remove __device_release_driver device_release_driver pci_stop_bus_device pci_stop_and_remove_bus_device pciehp_unconfigure_device pciehp_disable_slot pciehp_handle_presence_or_link_change pciehp_ist </TASK> I observed this with 5.15.5 from debian bullseye-backports and confirmed with 5.17.0-rc3 but previous kernels may be affected as well. I read that del_gendisk() prevents any new I/O only after flushing and dropping all partitions. But in case of a surprise removal any new blocking I/O must be prevented first. I assume that nvme_set_queue_dying() is supposed to do that. Is there any other mechanism in place which should achieve this? Unfortunately I am not very familiar with the blk_mq infrastructure so any comments and suggestions are very welcome. Best regards, Markus Signed-off-by: Markus Blöchl <markus.bloechl@ipetronik.com> --- drivers/nvme/host/core.c | 2 ++ 1 file changed, 2 insertions(+) base-commit: f1baf68e1383f6ed93eb9cff2866d46562607a43