Message ID | 20210322073726.788347-3-hch@lst.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [1/2] blk-mq: add a blk_mq_submit_bio_direct API | expand |
On 3/22/21 8:37 AM, Christoph Hellwig wrote: > When we reset/teardown a controller, we must freeze and quiesce the > namespaces request queues to make sure that we safely stop inflight I/O > submissions. Freeze is mandatory because if our hctx map changed between > reconnects, blk_mq_update_nr_hw_queues will immediately attempt to freeze > the queue, and if it still has pending submissions (that are still > quiesced) it will hang. > > However, by freezing the namespaces request queues, and only unfreezing > them when we successfully reconnect, inflight submissions that are > running concurrently can now block grabbing the nshead srcu until either > we successfully reconnect or ctrl_loss_tmo expired (or the user > explicitly disconnected). > > This caused a deadlock when a different controller (different path on the > same subsystem) became live (i.e. optimized/non-optimized). This is > because nvme_mpath_set_live needs to synchronize the nshead srcu before > requeueing I/O in order to make sure that current_path is visible to > future (re-)submisions. However the srcu lock is taken by a blocked > submission on a frozen request queue, and we have a deadlock. > > In order to fix this use the blk_mq_submit_bio_direct API to submit the > bio to the low-level driver, which does not block on the queue free > but instead allows nvme-multipath to pick another path or queue up the > bio. > > Fixes: 9f98772ba307 ("nvme-rdma: fix controller reset hang during traffic") > Fixes: 2875b0aecabe ("nvme-tcp: fix controller reset hang during traffic") > > Reported-by Sagi Grimberg <sagi@grimberg.me> > Signed-off-by: Christoph Hellwig <hch@lst.de> > --- > drivers/nvme/host/multipath.c | 8 +++++++- > 1 file changed, 7 insertions(+), 1 deletion(-) > > diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c > index a1d476e1ac020f..92adebfaf86fd1 100644 > --- a/drivers/nvme/host/multipath.c > +++ b/drivers/nvme/host/multipath.c > @@ -309,6 +309,7 @@ blk_qc_t nvme_ns_head_submit_bio(struct bio *bio) > */ > blk_queue_split(&bio); > > +retry: > srcu_idx = srcu_read_lock(&head->srcu); > ns = nvme_find_path(head); > if (likely(ns)) { > @@ -316,7 +317,12 @@ blk_qc_t nvme_ns_head_submit_bio(struct bio *bio) > bio->bi_opf |= REQ_NVME_MPATH; > trace_block_bio_remap(bio, disk_devt(ns->head->disk), > bio->bi_iter.bi_sector); > - ret = submit_bio_noacct(bio); > + > + if (!blk_mq_submit_bio_direct(bio, &ret)) { > + nvme_mpath_clear_current_path(ns); > + srcu_read_unlock(&head->srcu, srcu_idx); > + goto retry; > + } > } else if (nvme_available_path(head)) { > dev_warn_ratelimited(dev, "no usable path - requeuing I/O\n"); > > Ah. We've run into the same issue, and I've come up with basically the same patch to have it fixed. Tests are still outstanding, so I haven't been able to validate it properly. Thanks for fixing it up. Reviewed-by: Hannes Reinecke <hare@suse.de> Cheers, Hannes
On Mon, Mar 22, 2021 at 08:37:26AM +0100, Christoph Hellwig wrote: > When we reset/teardown a controller, we must freeze and quiesce the > namespaces request queues to make sure that we safely stop inflight I/O > submissions. Freeze is mandatory because if our hctx map changed between > reconnects, blk_mq_update_nr_hw_queues will immediately attempt to freeze > the queue, and if it still has pending submissions (that are still > quiesced) it will hang. > > However, by freezing the namespaces request queues, and only unfreezing > them when we successfully reconnect, inflight submissions that are > running concurrently can now block grabbing the nshead srcu until either > we successfully reconnect or ctrl_loss_tmo expired (or the user > explicitly disconnected). > > This caused a deadlock when a different controller (different path on the > same subsystem) became live (i.e. optimized/non-optimized). This is > because nvme_mpath_set_live needs to synchronize the nshead srcu before > requeueing I/O in order to make sure that current_path is visible to > future (re-)submisions. However the srcu lock is taken by a blocked > submission on a frozen request queue, and we have a deadlock. > > In order to fix this use the blk_mq_submit_bio_direct API to submit the > bio to the low-level driver, which does not block on the queue free > but instead allows nvme-multipath to pick another path or queue up the > bio. Looks good. Reviewed-by: Keith Busch <kbusch@kernel.org>
> When we reset/teardown a controller, we must freeze and quiesce the > namespaces request queues to make sure that we safely stop inflight I/O > submissions. Freeze is mandatory because if our hctx map changed between > reconnects, blk_mq_update_nr_hw_queues will immediately attempt to freeze > the queue, and if it still has pending submissions (that are still > quiesced) it will hang. > > However, by freezing the namespaces request queues, and only unfreezing > them when we successfully reconnect, inflight submissions that are > running concurrently can now block grabbing the nshead srcu until either > we successfully reconnect or ctrl_loss_tmo expired (or the user > explicitly disconnected). > > This caused a deadlock when a different controller (different path on the > same subsystem) became live (i.e. optimized/non-optimized). This is > because nvme_mpath_set_live needs to synchronize the nshead srcu before > requeueing I/O in order to make sure that current_path is visible to > future (re-)submisions. However the srcu lock is taken by a blocked > submission on a frozen request queue, and we have a deadlock. > > In order to fix this use the blk_mq_submit_bio_direct API to submit the > bio to the low-level driver, which does not block on the queue free > but instead allows nvme-multipath to pick another path or queue up the > bio. Almost... This still has the same issue but instead of blocking on blk_queue_enter() it is blocked on blk_mq_get_tag(): -- __schedule+0x22b/0x6e0 schedule+0x46/0xb0 io_schedule+0x42/0x70 blk_mq_get_tag+0x11d/0x270 ? blk_bio_segment_split+0x235/0x2a0 ? finish_wait+0x80/0x80 __blk_mq_alloc_request+0x65/0xe0 blk_mq_submit_bio+0x144/0x500 blk_mq_submit_bio_direct+0x78/0xa0 nvme_ns_head_submit_bio+0xc3/0x2f0 [nvme_core] __submit_bio_noacct+0xcf/0x2e0 __blkdev_direct_IO+0x413/0x440 ? __io_complete_rw.constprop.0+0x150/0x150 generic_file_read_iter+0x92/0x160 io_iter_do_read+0x1a/0x40 io_read+0xc5/0x350 ? common_interrupt+0x14/0xa0 ? update_load_avg+0x7a/0x5e0 io_issue_sqe+0xa28/0x1020 ? lock_timer_base+0x61/0x80 io_wq_submit_work+0xaa/0x120 io_worker_handle_work+0x121/0x330 io_wqe_worker+0xb6/0x190 ? io_worker_handle_work+0x330/0x330 ret_from_fork+0x22/0x30 -- -- ? usleep_range+0x80/0x80 __schedule+0x22b/0x6e0 ? usleep_range+0x80/0x80 schedule+0x46/0xb0 schedule_timeout+0xff/0x140 ? del_timer_sync+0x67/0xb0 ? __prepare_to_swait+0x4b/0x70 __wait_for_common+0xb3/0x160 __synchronize_srcu.part.0+0x75/0xe0 ? __bpf_trace_rcu_utilization+0x10/0x10 nvme_mpath_set_live+0x61/0x130 [nvme_core] nvme_update_ana_state+0xd7/0x100 [nvme_core] nvme_parse_ana_log+0xa5/0x160 [nvme_core] ? nvme_mpath_set_live+0x130/0x130 [nvme_core] nvme_read_ana_log+0x7b/0xe0 [nvme_core] process_one_work+0x1e6/0x380 worker_thread+0x49/0x300 -- If I were to always start the queues in nvme_tcp_teardown_ctrl right after I cancel the tagset inflights like: -- @@ -1934,8 +1934,7 @@ static void nvme_tcp_teardown_io_queues(struct nvme_ctrl *ctrl, nvme_sync_io_queues(ctrl); nvme_tcp_stop_io_queues(ctrl); nvme_cancel_tagset(ctrl); - if (remove) - nvme_start_queues(ctrl); + nvme_start_queues(ctrl); nvme_tcp_destroy_io_queues(ctrl, remove); -- then a simple reset during traffic bricks the host on infinite loop because in the setup sequence we freeze the queue in nvme_update_ns_info, so the queue is frozen but we still have an available path (because the controller is back to live!) so nvme-mpath keeps calling blk_mq_submit_bio_direct and fails, and nvme_update_ns_info cannot properly freeze the queue.. -> deadlock. So this is obviously incorrect. Also, if we make nvme-mpath submit a REQ_NOWAIT we basically will fail as soon as we run out of tags, even in the normal path... So I'm not exactly sure what we should do to fix this...
On 3/22/21 7:57 PM, Sagi Grimberg wrote: > >> When we reset/teardown a controller, we must freeze and quiesce the >> namespaces request queues to make sure that we safely stop inflight I/O >> submissions. Freeze is mandatory because if our hctx map changed between >> reconnects, blk_mq_update_nr_hw_queues will immediately attempt to freeze >> the queue, and if it still has pending submissions (that are still >> quiesced) it will hang. >> >> However, by freezing the namespaces request queues, and only unfreezing >> them when we successfully reconnect, inflight submissions that are >> running concurrently can now block grabbing the nshead srcu until either >> we successfully reconnect or ctrl_loss_tmo expired (or the user >> explicitly disconnected). >> >> This caused a deadlock when a different controller (different path on the >> same subsystem) became live (i.e. optimized/non-optimized). This is >> because nvme_mpath_set_live needs to synchronize the nshead srcu before >> requeueing I/O in order to make sure that current_path is visible to >> future (re-)submisions. However the srcu lock is taken by a blocked >> submission on a frozen request queue, and we have a deadlock. >> >> In order to fix this use the blk_mq_submit_bio_direct API to submit the >> bio to the low-level driver, which does not block on the queue free >> but instead allows nvme-multipath to pick another path or queue up the >> bio. > > Almost... > > This still has the same issue but instead of blocking on > blk_queue_enter() it is blocked on blk_mq_get_tag(): > -- > __schedule+0x22b/0x6e0 > schedule+0x46/0xb0 > io_schedule+0x42/0x70 > blk_mq_get_tag+0x11d/0x270 > ? blk_bio_segment_split+0x235/0x2a0 > ? finish_wait+0x80/0x80 > __blk_mq_alloc_request+0x65/0xe0 > blk_mq_submit_bio+0x144/0x500 > blk_mq_submit_bio_direct+0x78/0xa0 > nvme_ns_head_submit_bio+0xc3/0x2f0 [nvme_core] > __submit_bio_noacct+0xcf/0x2e0 > __blkdev_direct_IO+0x413/0x440 > ? __io_complete_rw.constprop.0+0x150/0x150 > generic_file_read_iter+0x92/0x160 > io_iter_do_read+0x1a/0x40 > io_read+0xc5/0x350 > ? common_interrupt+0x14/0xa0 > ? update_load_avg+0x7a/0x5e0 > io_issue_sqe+0xa28/0x1020 > ? lock_timer_base+0x61/0x80 > io_wq_submit_work+0xaa/0x120 > io_worker_handle_work+0x121/0x330 > io_wqe_worker+0xb6/0x190 > ? io_worker_handle_work+0x330/0x330 > ret_from_fork+0x22/0x30 > -- > > -- > ? usleep_range+0x80/0x80 > __schedule+0x22b/0x6e0 > ? usleep_range+0x80/0x80 > schedule+0x46/0xb0 > schedule_timeout+0xff/0x140 > ? del_timer_sync+0x67/0xb0 > ? __prepare_to_swait+0x4b/0x70 > __wait_for_common+0xb3/0x160 > __synchronize_srcu.part.0+0x75/0xe0 > ? __bpf_trace_rcu_utilization+0x10/0x10 > nvme_mpath_set_live+0x61/0x130 [nvme_core] > nvme_update_ana_state+0xd7/0x100 [nvme_core] > nvme_parse_ana_log+0xa5/0x160 [nvme_core] > ? nvme_mpath_set_live+0x130/0x130 [nvme_core] > nvme_read_ana_log+0x7b/0xe0 [nvme_core] > process_one_work+0x1e6/0x380 > worker_thread+0x49/0x300 > -- > > > > If I were to always start the queues in nvme_tcp_teardown_ctrl > right after I cancel the tagset inflights like: > -- > @@ -1934,8 +1934,7 @@ static void nvme_tcp_teardown_io_queues(struct > nvme_ctrl *ctrl, > nvme_sync_io_queues(ctrl); > nvme_tcp_stop_io_queues(ctrl); > nvme_cancel_tagset(ctrl); > - if (remove) > - nvme_start_queues(ctrl); > + nvme_start_queues(ctrl); > nvme_tcp_destroy_io_queues(ctrl, remove); > -- > > then a simple reset during traffic bricks the host on infinite loop > because in the setup sequence we freeze the queue in > nvme_update_ns_info, so the queue is frozen but we still have an > available path (because the controller is back to live!) so nvme-mpath > keeps calling blk_mq_submit_bio_direct and fails, and > nvme_update_ns_info cannot properly freeze the queue.. > -> deadlock. > > So this is obviously incorrect. > > Also, if we make nvme-mpath submit a REQ_NOWAIT we basically > will fail as soon as we run out of tags, even in the normal path... > > So I'm not exactly sure what we should do to fix this... It's still not too late to go with my original approach... ;)
On 2021/3/23 11:23, Sagi Grimberg wrote: > > > On 3/22/21 7:57 PM, Sagi Grimberg wrote: >> >>> When we reset/teardown a controller, we must freeze and quiesce the >>> namespaces request queues to make sure that we safely stop inflight I/O >>> submissions. Freeze is mandatory because if our hctx map changed between >>> reconnects, blk_mq_update_nr_hw_queues will immediately attempt to freeze >>> the queue, and if it still has pending submissions (that are still >>> quiesced) it will hang. >>> >>> However, by freezing the namespaces request queues, and only unfreezing >>> them when we successfully reconnect, inflight submissions that are >>> running concurrently can now block grabbing the nshead srcu until either >>> we successfully reconnect or ctrl_loss_tmo expired (or the user >>> explicitly disconnected). >>> >>> This caused a deadlock when a different controller (different path on the >>> same subsystem) became live (i.e. optimized/non-optimized). This is >>> because nvme_mpath_set_live needs to synchronize the nshead srcu before >>> requeueing I/O in order to make sure that current_path is visible to >>> future (re-)submisions. However the srcu lock is taken by a blocked >>> submission on a frozen request queue, and we have a deadlock. >>> >>> In order to fix this use the blk_mq_submit_bio_direct API to submit the >>> bio to the low-level driver, which does not block on the queue free >>> but instead allows nvme-multipath to pick another path or queue up the >>> bio. >> >> Almost... >> >> This still has the same issue but instead of blocking on >> blk_queue_enter() it is blocked on blk_mq_get_tag(): >> -- >> __schedule+0x22b/0x6e0 >> schedule+0x46/0xb0 >> io_schedule+0x42/0x70 >> blk_mq_get_tag+0x11d/0x270 >> ? blk_bio_segment_split+0x235/0x2a0 >> ? finish_wait+0x80/0x80 >> __blk_mq_alloc_request+0x65/0xe0 >> blk_mq_submit_bio+0x144/0x500 >> blk_mq_submit_bio_direct+0x78/0xa0 >> nvme_ns_head_submit_bio+0xc3/0x2f0 [nvme_core] >> __submit_bio_noacct+0xcf/0x2e0 >> __blkdev_direct_IO+0x413/0x440 >> ? __io_complete_rw.constprop.0+0x150/0x150 >> generic_file_read_iter+0x92/0x160 >> io_iter_do_read+0x1a/0x40 >> io_read+0xc5/0x350 >> ? common_interrupt+0x14/0xa0 >> ? update_load_avg+0x7a/0x5e0 >> io_issue_sqe+0xa28/0x1020 >> ? lock_timer_base+0x61/0x80 >> io_wq_submit_work+0xaa/0x120 >> io_worker_handle_work+0x121/0x330 >> io_wqe_worker+0xb6/0x190 >> ? io_worker_handle_work+0x330/0x330 >> ret_from_fork+0x22/0x30 >> -- >> >> -- >> ? usleep_range+0x80/0x80 >> __schedule+0x22b/0x6e0 >> ? usleep_range+0x80/0x80 >> schedule+0x46/0xb0 >> schedule_timeout+0xff/0x140 >> ? del_timer_sync+0x67/0xb0 >> ? __prepare_to_swait+0x4b/0x70 >> __wait_for_common+0xb3/0x160 >> __synchronize_srcu.part.0+0x75/0xe0 >> ? __bpf_trace_rcu_utilization+0x10/0x10 >> nvme_mpath_set_live+0x61/0x130 [nvme_core] >> nvme_update_ana_state+0xd7/0x100 [nvme_core] >> nvme_parse_ana_log+0xa5/0x160 [nvme_core] >> ? nvme_mpath_set_live+0x130/0x130 [nvme_core] >> nvme_read_ana_log+0x7b/0xe0 [nvme_core] >> process_one_work+0x1e6/0x380 >> worker_thread+0x49/0x300 >> -- >> >> >> >> If I were to always start the queues in nvme_tcp_teardown_ctrl >> right after I cancel the tagset inflights like: >> -- >> @@ -1934,8 +1934,7 @@ static void nvme_tcp_teardown_io_queues(struct nvme_ctrl *ctrl, >> nvme_sync_io_queues(ctrl); >> nvme_tcp_stop_io_queues(ctrl); >> nvme_cancel_tagset(ctrl); >> - if (remove) >> - nvme_start_queues(ctrl); >> + nvme_start_queues(ctrl); >> nvme_tcp_destroy_io_queues(ctrl, remove); >> -- >> >> then a simple reset during traffic bricks the host on infinite loop >> because in the setup sequence we freeze the queue in >> nvme_update_ns_info, so the queue is frozen but we still have an >> available path (because the controller is back to live!) so nvme-mpath >> keeps calling blk_mq_submit_bio_direct and fails, and >> nvme_update_ns_info cannot properly freeze the queue.. >> -> deadlock. >> >> So this is obviously incorrect. >> >> Also, if we make nvme-mpath submit a REQ_NOWAIT we basically >> will fail as soon as we run out of tags, even in the normal path... >> >> So I'm not exactly sure what we should do to fix this... > > It's still not too late to go with my original approach... ;) I check it again. I still think the below patch can avoid the bug. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5a6c35f9af416114588298aa7a90b15bbed15a41 The process: 1.nvme_ns_head_submit_bio call srcu_read_lock(&head->srcu). 2.nvme_ns_head_submit_bio will add the bio to current->bio_list instead of waiting for the frozen queue. 3.nvme_ns_head_submit_bio call srcu_read_unlock(&head->srcu, srcu_idx). So nvme_ns_head_submit_bio do not hold head->srcu long when the queue is frozen, can avoid deadlock. Sagi, suggest trying this patch. > .
On 3/23/21 3:57 AM, Sagi Grimberg wrote: > >> When we reset/teardown a controller, we must freeze and quiesce the >> namespaces request queues to make sure that we safely stop inflight I/O >> submissions. Freeze is mandatory because if our hctx map changed between >> reconnects, blk_mq_update_nr_hw_queues will immediately attempt to freeze >> the queue, and if it still has pending submissions (that are still >> quiesced) it will hang. >> >> However, by freezing the namespaces request queues, and only unfreezing >> them when we successfully reconnect, inflight submissions that are >> running concurrently can now block grabbing the nshead srcu until either >> we successfully reconnect or ctrl_loss_tmo expired (or the user >> explicitly disconnected). >> >> This caused a deadlock when a different controller (different path on the >> same subsystem) became live (i.e. optimized/non-optimized). This is >> because nvme_mpath_set_live needs to synchronize the nshead srcu before >> requeueing I/O in order to make sure that current_path is visible to >> future (re-)submisions. However the srcu lock is taken by a blocked >> submission on a frozen request queue, and we have a deadlock. >> >> In order to fix this use the blk_mq_submit_bio_direct API to submit the >> bio to the low-level driver, which does not block on the queue free >> but instead allows nvme-multipath to pick another path or queue up the >> bio. > > Almost... > > This still has the same issue but instead of blocking on > blk_queue_enter() it is blocked on blk_mq_get_tag(): > -- > __schedule+0x22b/0x6e0 > schedule+0x46/0xb0 > io_schedule+0x42/0x70 > blk_mq_get_tag+0x11d/0x270 > ? blk_bio_segment_split+0x235/0x2a0 > ? finish_wait+0x80/0x80 > __blk_mq_alloc_request+0x65/0xe0 > blk_mq_submit_bio+0x144/0x500 > blk_mq_submit_bio_direct+0x78/0xa0 > nvme_ns_head_submit_bio+0xc3/0x2f0 [nvme_core] > __submit_bio_noacct+0xcf/0x2e0 > __blkdev_direct_IO+0x413/0x440 > ? __io_complete_rw.constprop.0+0x150/0x150 > generic_file_read_iter+0x92/0x160 > io_iter_do_read+0x1a/0x40 > io_read+0xc5/0x350 > ? common_interrupt+0x14/0xa0 > ? update_load_avg+0x7a/0x5e0 > io_issue_sqe+0xa28/0x1020 > ? lock_timer_base+0x61/0x80 > io_wq_submit_work+0xaa/0x120 > io_worker_handle_work+0x121/0x330 > io_wqe_worker+0xb6/0x190 > ? io_worker_handle_work+0x330/0x330 > ret_from_fork+0x22/0x30 > -- > > -- > ? usleep_range+0x80/0x80 > __schedule+0x22b/0x6e0 > ? usleep_range+0x80/0x80 > schedule+0x46/0xb0 > schedule_timeout+0xff/0x140 > ? del_timer_sync+0x67/0xb0 > ? __prepare_to_swait+0x4b/0x70 > __wait_for_common+0xb3/0x160 > __synchronize_srcu.part.0+0x75/0xe0 > ? __bpf_trace_rcu_utilization+0x10/0x10 > nvme_mpath_set_live+0x61/0x130 [nvme_core] > nvme_update_ana_state+0xd7/0x100 [nvme_core] > nvme_parse_ana_log+0xa5/0x160 [nvme_core] > ? nvme_mpath_set_live+0x130/0x130 [nvme_core] > nvme_read_ana_log+0x7b/0xe0 [nvme_core] > process_one_work+0x1e6/0x380 > worker_thread+0x49/0x300 > -- > Actually, I had been playing around with marking the entire bio as 'NOWAIT'; that would avoid the tag stall, too: @@ -313,7 +316,7 @@ blk_qc_t nvme_ns_head_submit_bio(struct bio *bio) ns = nvme_find_path(head); if (likely(ns)) { bio_set_dev(bio, ns->disk->part0); - bio->bi_opf |= REQ_NVME_MPATH; + bio->bi_opf |= REQ_NVME_MPATH | REQ_NOWAIT; trace_block_bio_remap(bio, disk_devt(ns->head->disk), bio->bi_iter.bi_sector); ret = submit_bio_noacct(bio); My only worry here is that we might incur spurious failures under high load; but then this is not necessarily a bad thing. Cheers, Hannes
> Actually, I had been playing around with marking the entire bio as > 'NOWAIT'; that would avoid the tag stall, too: > > @@ -313,7 +316,7 @@ blk_qc_t nvme_ns_head_submit_bio(struct bio *bio) > ns = nvme_find_path(head); > if (likely(ns)) { > bio_set_dev(bio, ns->disk->part0); > - bio->bi_opf |= REQ_NVME_MPATH; > + bio->bi_opf |= REQ_NVME_MPATH | REQ_NOWAIT; > trace_block_bio_remap(bio, disk_devt(ns->head->disk), > bio->bi_iter.bi_sector); > ret = submit_bio_noacct(bio); > > > My only worry here is that we might incur spurious failures under high > load; but then this is not necessarily a bad thing. What? making spurious failures is not ok under any load. what fs will take into account that you may have run out of tags?
> I check it again. I still think the below patch can avoid the bug. > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5a6c35f9af416114588298aa7a90b15bbed15a41 I don't understand what you are saying... > > The process: > 1.nvme_ns_head_submit_bio call srcu_read_lock(&head->srcu). > 2.nvme_ns_head_submit_bio will add the bio to current->bio_list instead > of waiting for the frozen queue. Nothing guarantees that you have a bio_list active at any point in time, in fact for a workload that submits one by one you will always drain that list directly in the submission... > 3.nvme_ns_head_submit_bio call srcu_read_unlock(&head->srcu, srcu_idx). > So nvme_ns_head_submit_bio do not hold head->srcu long when the queue is > frozen, can avoid deadlock. > > Sagi, suggest trying this patch. The above reproduces with the patch applied on upstream nvme code.
On 2021/3/23 15:36, Sagi Grimberg wrote: > >> I check it again. I still think the below patch can avoid the bug. >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5a6c35f9af416114588298aa7a90b15bbed15a41 > > I don't understand what you are saying... > >> >> The process: >> 1.nvme_ns_head_submit_bio call srcu_read_lock(&head->srcu). >> 2.nvme_ns_head_submit_bio will add the bio to current->bio_list instead of waiting for the frozen queue. > > Nothing guarantees that you have a bio_list active at any point in time, > in fact for a workload that submits one by one you will always drain > that list directly in the submission... submit_bio and nvme_requeue_work both guarantee current->bio_list. The process: 1.submit_bio and nvme_requeue_work will call submit_bio_noacct. 2.submit_bio_noacct will call __submit_bio_noacct because bio->bi_disk->fops->submit_bio = nvme_ns_head_submit_bio. 3.__submit_bio_noacct set current->bio_list, and then __submit_bio will call bio->bi_disk->fops->submit_bio(nvme_ns_head_submit_bio) 4.nvme_ns_head_submit_bio will add the bio to current->bio_list. 5.__submit_bio_noacct drain current->bio_list. when drain current->bio_list, it will wait for the frozen queue but do not hold the head->srcu. Because it call blk_mq_submit_bio directly instead of ->submit_bio(nvme_ns_head_submit_bio). So it is safe. > >> 3.nvme_ns_head_submit_bio call srcu_read_unlock(&head->srcu, srcu_idx). >> So nvme_ns_head_submit_bio do not hold head->srcu long when the queue is frozen, can avoid deadlock. >> >> Sagi, suggest trying this patch. > > The above reproduces with the patch applied on upstream nvme code.The new patch(blk_mq_submit_bio_direct) will cause the bug again. Because it revert add the bio to current->bio_list. Just try the upstream nvme code, and do not apply the new patch(blk_mq_submit_bio_direct). > .
On 3/23/21 8:31 AM, Sagi Grimberg wrote: > >> Actually, I had been playing around with marking the entire bio as >> 'NOWAIT'; that would avoid the tag stall, too: >> >> @@ -313,7 +316,7 @@ blk_qc_t nvme_ns_head_submit_bio(struct bio *bio) >> ns = nvme_find_path(head); >> if (likely(ns)) { >> bio_set_dev(bio, ns->disk->part0); >> - bio->bi_opf |= REQ_NVME_MPATH; >> + bio->bi_opf |= REQ_NVME_MPATH | REQ_NOWAIT; >> trace_block_bio_remap(bio, disk_devt(ns->head->disk), >> bio->bi_iter.bi_sector); >> ret = submit_bio_noacct(bio); >> >> >> My only worry here is that we might incur spurious failures under high >> load; but then this is not necessarily a bad thing. > > What? making spurious failures is not ok under any load. what fs will > take into account that you may have run out of tags? Well, it's not actually a spurious failure but rather a spurious failover, as we're still on a multipath scenario, and bios will still be re-routed to other paths. Or queued if all paths are out of tags. Hence the OS would not see any difference in behaviour. But in the end, we abandoned this attempt, as the crash we've been seeing was in bio_endio (due to bi_bdev still pointing to the removed path device): [ 6552.155251] bio_endio+0x74/0x120 [ 6552.155260] nvme_ns_head_submit_bio+0x36f/0x3e0 [nvme_core] [ 6552.155271] submit_bio_noacct+0x175/0x490 [ 6552.155284] ? nvme_requeue_work+0x5a/0x70 [nvme_core] [ 6552.155290] nvme_requeue_work+0x5a/0x70 [nvme_core] [ 6552.155296] process_one_work+0x1f4/0x3e0 [ 6552.155299] worker_thread+0x2d/0x3e0 [ 6552.155302] ? process_one_work+0x3e0/0x3e0 [ 6552.155305] kthread+0x10d/0x130 [ 6552.155307] ? kthread_park+0xa0/0xa0 [ 6552.155311] ret_from_fork+0x35/0x40 So we're not blocked on blk_queue_enter(), and it's a crash, not a deadlock. Blocking on blk_queue_enter() certainly plays a part here, but is seems not to be the full picture. Cheers, Hannes
On Tue, Mar 23, 2021 at 09:36:47AM +0100, Hannes Reinecke wrote: > On 3/23/21 8:31 AM, Sagi Grimberg wrote: > > > > > Actually, I had been playing around with marking the entire bio as > > > 'NOWAIT'; that would avoid the tag stall, too: > > > > > > @@ -313,7 +316,7 @@ blk_qc_t nvme_ns_head_submit_bio(struct bio *bio) > > > ns = nvme_find_path(head); > > > if (likely(ns)) { > > > bio_set_dev(bio, ns->disk->part0); > > > - bio->bi_opf |= REQ_NVME_MPATH; > > > + bio->bi_opf |= REQ_NVME_MPATH | REQ_NOWAIT; > > > trace_block_bio_remap(bio, disk_devt(ns->head->disk), > > > bio->bi_iter.bi_sector); > > > ret = submit_bio_noacct(bio); > > > > > > > > > My only worry here is that we might incur spurious failures under > > > high load; but then this is not necessarily a bad thing. > > > > What? making spurious failures is not ok under any load. what fs will > > take into account that you may have run out of tags? > > Well, it's not actually a spurious failure but rather a spurious failover, > as we're still on a multipath scenario, and bios will still be re-routed to > other paths. Or queued if all paths are out of tags. > Hence the OS would not see any difference in behaviour. Failover might be overkill. We can run out of tags in a perfectly normal situation, and simply waiting may be the best option, or even scheduling on a different CPU may be sufficient to get a viable tag rather than selecting a different path. Does it make sense to just abort all allocated tags during a reset and let the original bio requeue for multipath IO? > But in the end, we abandoned this attempt, as the crash we've been seeing > was in bio_endio (due to bi_bdev still pointing to the removed path device): > > [ 6552.155251] bio_endio+0x74/0x120 > [ 6552.155260] nvme_ns_head_submit_bio+0x36f/0x3e0 [nvme_core] > [ 6552.155271] submit_bio_noacct+0x175/0x490 > [ 6552.155284] ? nvme_requeue_work+0x5a/0x70 [nvme_core] > [ 6552.155290] nvme_requeue_work+0x5a/0x70 [nvme_core] > [ 6552.155296] process_one_work+0x1f4/0x3e0 > [ 6552.155299] worker_thread+0x2d/0x3e0 > [ 6552.155302] ? process_one_work+0x3e0/0x3e0 > [ 6552.155305] kthread+0x10d/0x130 > [ 6552.155307] ? kthread_park+0xa0/0xa0 > [ 6552.155311] ret_from_fork+0x35/0x40 > > So we're not blocked on blk_queue_enter(), and it's a crash, not a deadlock. > Blocking on blk_queue_enter() certainly plays a part here, > but is seems not to be the full picture. > > Cheers, > > Hannes > -- > Dr. Hannes Reinecke Kernel Storage Architect > hare@suse.de +49 911 74053 688 > SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg > HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer
On Tue, Mar 23, 2021 at 12:36:40AM -0700, Sagi Grimberg wrote: >> The process: >> 1.nvme_ns_head_submit_bio call srcu_read_lock(&head->srcu). >> 2.nvme_ns_head_submit_bio will add the bio to current->bio_list instead of >> waiting for the frozen queue. > > Nothing guarantees that you have a bio_list active at any point in time, > in fact for a workload that submits one by one you will always drain > that list directly in the submission... It should always be active when ->submit_bio is called. > >> 3.nvme_ns_head_submit_bio call srcu_read_unlock(&head->srcu, srcu_idx). >> So nvme_ns_head_submit_bio do not hold head->srcu long when the queue is >> frozen, can avoid deadlock. >> >> Sagi, suggest trying this patch. > > The above reproduces with the patch applied on upstream nvme code. Weird. I don't think the deadlock in your original report should happen due to this. Can you take a look at the callstacks in the reproduced deadlock? Either we're missing something obvious or it is a a somewhat different deadlock.
On Tue, Mar 23, 2021 at 04:13:09PM +0800, Chao Leng wrote: >> The above reproduces with the patch applied on upstream nvme code.The new patch(blk_mq_submit_bio_direct) will cause the bug again. > Because it revert add the bio to current->bio_list. > Just try the upstream nvme code, and do not apply the new patch(blk_mq_submit_bio_direct). Indeed, the patch removes the deferred submission, so the original deadlock should not happen.
On Tue, Mar 23, 2021 at 11:53:30PM +0900, Keith Busch wrote: > Failover might be overkill. We can run out of tags in a perfectly normal > situation, and simply waiting may be the best option, or even scheduling > on a different CPU may be sufficient to get a viable tag rather than > selecting a different path. Indeed. Then again IFF there are multiple optimized paths picking one could also be a way to make progress. > Does it make sense to just abort all allocated tags during a reset and > let the original bio requeue for multipath IO? Well, this would again hard code multipath information into the lower levels. But then again we could at least do it so that we check for the REQ_NVME_MPATH to make it clear what is going on.
>>> Sagi, suggest trying this patch. >> >> The above reproduces with the patch applied on upstream nvme code. > > Weird. I don't think the deadlock in your original report should > happen due to this. Can you take a look at the callstacks in the > reproduced deadlock? Either we're missing something obvious or it is a > a somewhat different deadlock. The deadlock in this patchset reproduces upstream. It is not possible to update the kernel in the env in the original report. So IFF we assume that this does not reproduce in upstream (pending proof), is there something that we can do with stable fixes? This will probably go back to everything that is before 5.8...
On Tue, Mar 23, 2021 at 11:13:13AM -0700, Sagi Grimberg wrote: > The deadlock in this patchset reproduces upstream. It is not possible to > update the kernel in the env in the original report. > > So IFF we assume that this does not reproduce in upstream (pending > proof), is there something that we can do with stable fixes? This will > probably go back to everything that is before 5.8... The direct_make_request removal should be pretty easily backportable. In old kernels without the streamlined normal path it might cause minor performance degradations, but the actual change should be trivial.
>> The deadlock in this patchset reproduces upstream. It is not possible to >> update the kernel in the env in the original report. >> >> So IFF we assume that this does not reproduce in upstream (pending >> proof), is there something that we can do with stable fixes? This will >> probably go back to everything that is before 5.8... > > The direct_make_request removal should be pretty easily backportable. Umm, the direct_make_request removal is based on the submit_bio_noacct rework you've done. Are you suggesting that we replace it with generic_make_request for these kernels? > In old kernels without the streamlined normal path it might cause minor > performance degradations, but the actual change should be trivial. That is better than getting stuck in I/O failover...
On Tue, Mar 23, 2021 at 12:00:34PM -0700, Sagi Grimberg wrote: > >>> The deadlock in this patchset reproduces upstream. It is not possible to >>> update the kernel in the env in the original report. >>> >>> So IFF we assume that this does not reproduce in upstream (pending >>> proof), is there something that we can do with stable fixes? This will >>> probably go back to everything that is before 5.8... >> >> The direct_make_request removal should be pretty easily backportable. > > Umm, the direct_make_request removal is based on the submit_bio_noacct > rework you've done. Are you suggesting that we replace it with > generic_make_request for these kernels? Yes.
>>>> The deadlock in this patchset reproduces upstream. It is not possible to >>>> update the kernel in the env in the original report. >>>> >>>> So IFF we assume that this does not reproduce in upstream (pending >>>> proof), is there something that we can do with stable fixes? This will >>>> probably go back to everything that is before 5.8... >>> >>> The direct_make_request removal should be pretty easily backportable. >> >> Umm, the direct_make_request removal is based on the submit_bio_noacct >> rework you've done. Are you suggesting that we replace it with >> generic_make_request for these kernels? > > Yes. OK, that should be testable...
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index a1d476e1ac020f..92adebfaf86fd1 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -309,6 +309,7 @@ blk_qc_t nvme_ns_head_submit_bio(struct bio *bio) */ blk_queue_split(&bio); +retry: srcu_idx = srcu_read_lock(&head->srcu); ns = nvme_find_path(head); if (likely(ns)) { @@ -316,7 +317,12 @@ blk_qc_t nvme_ns_head_submit_bio(struct bio *bio) bio->bi_opf |= REQ_NVME_MPATH; trace_block_bio_remap(bio, disk_devt(ns->head->disk), bio->bi_iter.bi_sector); - ret = submit_bio_noacct(bio); + + if (!blk_mq_submit_bio_direct(bio, &ret)) { + nvme_mpath_clear_current_path(ns); + srcu_read_unlock(&head->srcu, srcu_idx); + goto retry; + } } else if (nvme_available_path(head)) { dev_warn_ratelimited(dev, "no usable path - requeuing I/O\n");
When we reset/teardown a controller, we must freeze and quiesce the namespaces request queues to make sure that we safely stop inflight I/O submissions. Freeze is mandatory because if our hctx map changed between reconnects, blk_mq_update_nr_hw_queues will immediately attempt to freeze the queue, and if it still has pending submissions (that are still quiesced) it will hang. However, by freezing the namespaces request queues, and only unfreezing them when we successfully reconnect, inflight submissions that are running concurrently can now block grabbing the nshead srcu until either we successfully reconnect or ctrl_loss_tmo expired (or the user explicitly disconnected). This caused a deadlock when a different controller (different path on the same subsystem) became live (i.e. optimized/non-optimized). This is because nvme_mpath_set_live needs to synchronize the nshead srcu before requeueing I/O in order to make sure that current_path is visible to future (re-)submisions. However the srcu lock is taken by a blocked submission on a frozen request queue, and we have a deadlock. In order to fix this use the blk_mq_submit_bio_direct API to submit the bio to the low-level driver, which does not block on the queue free but instead allows nvme-multipath to pick another path or queue up the bio. Fixes: 9f98772ba307 ("nvme-rdma: fix controller reset hang during traffic") Fixes: 2875b0aecabe ("nvme-tcp: fix controller reset hang during traffic") Reported-by Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de> --- drivers/nvme/host/multipath.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-)