diff mbox series

[2/2] nvme-multipath: don't block on blk_queue_enter of the underlying device

Message ID 20210322073726.788347-3-hch@lst.de (mailing list archive)
State New, archived
Headers show
Series [1/2] blk-mq: add a blk_mq_submit_bio_direct API | expand

Commit Message

Christoph Hellwig March 22, 2021, 7:37 a.m. UTC
When we reset/teardown a controller, we must freeze and quiesce the
namespaces request queues to make sure that we safely stop inflight I/O
submissions. Freeze is mandatory because if our hctx map changed between
reconnects, blk_mq_update_nr_hw_queues will immediately attempt to freeze
the queue, and if it still has pending submissions (that are still
quiesced) it will hang.

However, by freezing the namespaces request queues, and only unfreezing
them when we successfully reconnect, inflight submissions that are
running concurrently can now block grabbing the nshead srcu until either
we successfully reconnect or ctrl_loss_tmo expired (or the user
explicitly disconnected).

This caused a deadlock when a different controller (different path on the
same subsystem) became live (i.e. optimized/non-optimized). This is
because nvme_mpath_set_live needs to synchronize the nshead srcu before
requeueing I/O in order to make sure that current_path is visible to
future (re-)submisions. However the srcu lock is taken by a blocked
submission on a frozen request queue, and we have a deadlock.

In order to fix this use the blk_mq_submit_bio_direct API to submit the
bio to the low-level driver, which does not block on the queue free
but instead allows nvme-multipath to pick another path or queue up the
bio.

Fixes: 9f98772ba307 ("nvme-rdma: fix controller reset hang during traffic")
Fixes: 2875b0aecabe ("nvme-tcp: fix controller reset hang during traffic")

Reported-by Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/nvme/host/multipath.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Comments

Hannes Reinecke March 22, 2021, 11:22 a.m. UTC | #1
On 3/22/21 8:37 AM, Christoph Hellwig wrote:
> When we reset/teardown a controller, we must freeze and quiesce the
> namespaces request queues to make sure that we safely stop inflight I/O
> submissions. Freeze is mandatory because if our hctx map changed between
> reconnects, blk_mq_update_nr_hw_queues will immediately attempt to freeze
> the queue, and if it still has pending submissions (that are still
> quiesced) it will hang.
> 
> However, by freezing the namespaces request queues, and only unfreezing
> them when we successfully reconnect, inflight submissions that are
> running concurrently can now block grabbing the nshead srcu until either
> we successfully reconnect or ctrl_loss_tmo expired (or the user
> explicitly disconnected).
> 
> This caused a deadlock when a different controller (different path on the
> same subsystem) became live (i.e. optimized/non-optimized). This is
> because nvme_mpath_set_live needs to synchronize the nshead srcu before
> requeueing I/O in order to make sure that current_path is visible to
> future (re-)submisions. However the srcu lock is taken by a blocked
> submission on a frozen request queue, and we have a deadlock.
> 
> In order to fix this use the blk_mq_submit_bio_direct API to submit the
> bio to the low-level driver, which does not block on the queue free
> but instead allows nvme-multipath to pick another path or queue up the
> bio.
> 
> Fixes: 9f98772ba307 ("nvme-rdma: fix controller reset hang during traffic")
> Fixes: 2875b0aecabe ("nvme-tcp: fix controller reset hang during traffic")
> 
> Reported-by Sagi Grimberg <sagi@grimberg.me>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  drivers/nvme/host/multipath.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index a1d476e1ac020f..92adebfaf86fd1 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -309,6 +309,7 @@ blk_qc_t nvme_ns_head_submit_bio(struct bio *bio)
>  	 */
>  	blk_queue_split(&bio);
>  
> +retry:
>  	srcu_idx = srcu_read_lock(&head->srcu);
>  	ns = nvme_find_path(head);
>  	if (likely(ns)) {
> @@ -316,7 +317,12 @@ blk_qc_t nvme_ns_head_submit_bio(struct bio *bio)
>  		bio->bi_opf |= REQ_NVME_MPATH;
>  		trace_block_bio_remap(bio, disk_devt(ns->head->disk),
>  				      bio->bi_iter.bi_sector);
> -		ret = submit_bio_noacct(bio);
> +
> +		if (!blk_mq_submit_bio_direct(bio, &ret)) {
> +			nvme_mpath_clear_current_path(ns);
> +			srcu_read_unlock(&head->srcu, srcu_idx);
> +			goto retry;
> +		}
>  	} else if (nvme_available_path(head)) {
>  		dev_warn_ratelimited(dev, "no usable path - requeuing I/O\n");
>  
> 
Ah. We've run into the same issue, and I've come up with basically the
same patch to have it fixed.
Tests are still outstanding, so I haven't been able to validate it properly.
Thanks for fixing it up.

Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
Keith Busch March 22, 2021, 3:31 p.m. UTC | #2
On Mon, Mar 22, 2021 at 08:37:26AM +0100, Christoph Hellwig wrote:
> When we reset/teardown a controller, we must freeze and quiesce the
> namespaces request queues to make sure that we safely stop inflight I/O
> submissions. Freeze is mandatory because if our hctx map changed between
> reconnects, blk_mq_update_nr_hw_queues will immediately attempt to freeze
> the queue, and if it still has pending submissions (that are still
> quiesced) it will hang.
> 
> However, by freezing the namespaces request queues, and only unfreezing
> them when we successfully reconnect, inflight submissions that are
> running concurrently can now block grabbing the nshead srcu until either
> we successfully reconnect or ctrl_loss_tmo expired (or the user
> explicitly disconnected).
> 
> This caused a deadlock when a different controller (different path on the
> same subsystem) became live (i.e. optimized/non-optimized). This is
> because nvme_mpath_set_live needs to synchronize the nshead srcu before
> requeueing I/O in order to make sure that current_path is visible to
> future (re-)submisions. However the srcu lock is taken by a blocked
> submission on a frozen request queue, and we have a deadlock.
> 
> In order to fix this use the blk_mq_submit_bio_direct API to submit the
> bio to the low-level driver, which does not block on the queue free
> but instead allows nvme-multipath to pick another path or queue up the
> bio.

Looks good.

Reviewed-by: Keith Busch <kbusch@kernel.org>
Sagi Grimberg March 23, 2021, 2:57 a.m. UTC | #3
> When we reset/teardown a controller, we must freeze and quiesce the
> namespaces request queues to make sure that we safely stop inflight I/O
> submissions. Freeze is mandatory because if our hctx map changed between
> reconnects, blk_mq_update_nr_hw_queues will immediately attempt to freeze
> the queue, and if it still has pending submissions (that are still
> quiesced) it will hang.
> 
> However, by freezing the namespaces request queues, and only unfreezing
> them when we successfully reconnect, inflight submissions that are
> running concurrently can now block grabbing the nshead srcu until either
> we successfully reconnect or ctrl_loss_tmo expired (or the user
> explicitly disconnected).
> 
> This caused a deadlock when a different controller (different path on the
> same subsystem) became live (i.e. optimized/non-optimized). This is
> because nvme_mpath_set_live needs to synchronize the nshead srcu before
> requeueing I/O in order to make sure that current_path is visible to
> future (re-)submisions. However the srcu lock is taken by a blocked
> submission on a frozen request queue, and we have a deadlock.
> 
> In order to fix this use the blk_mq_submit_bio_direct API to submit the
> bio to the low-level driver, which does not block on the queue free
> but instead allows nvme-multipath to pick another path or queue up the
> bio.

Almost...

This still has the same issue but instead of blocking on
blk_queue_enter() it is blocked on blk_mq_get_tag():
--
  __schedule+0x22b/0x6e0
  schedule+0x46/0xb0
  io_schedule+0x42/0x70
  blk_mq_get_tag+0x11d/0x270
  ? blk_bio_segment_split+0x235/0x2a0
  ? finish_wait+0x80/0x80
  __blk_mq_alloc_request+0x65/0xe0
  blk_mq_submit_bio+0x144/0x500
  blk_mq_submit_bio_direct+0x78/0xa0
  nvme_ns_head_submit_bio+0xc3/0x2f0 [nvme_core]
  __submit_bio_noacct+0xcf/0x2e0
  __blkdev_direct_IO+0x413/0x440
  ? __io_complete_rw.constprop.0+0x150/0x150
  generic_file_read_iter+0x92/0x160
  io_iter_do_read+0x1a/0x40
  io_read+0xc5/0x350
  ? common_interrupt+0x14/0xa0
  ? update_load_avg+0x7a/0x5e0
  io_issue_sqe+0xa28/0x1020
  ? lock_timer_base+0x61/0x80
  io_wq_submit_work+0xaa/0x120
  io_worker_handle_work+0x121/0x330
  io_wqe_worker+0xb6/0x190
  ? io_worker_handle_work+0x330/0x330
  ret_from_fork+0x22/0x30
--

--
  ? usleep_range+0x80/0x80
  __schedule+0x22b/0x6e0
  ? usleep_range+0x80/0x80
  schedule+0x46/0xb0
  schedule_timeout+0xff/0x140
  ? del_timer_sync+0x67/0xb0
  ? __prepare_to_swait+0x4b/0x70
  __wait_for_common+0xb3/0x160
  __synchronize_srcu.part.0+0x75/0xe0
  ? __bpf_trace_rcu_utilization+0x10/0x10
  nvme_mpath_set_live+0x61/0x130 [nvme_core]
  nvme_update_ana_state+0xd7/0x100 [nvme_core]
  nvme_parse_ana_log+0xa5/0x160 [nvme_core]
  ? nvme_mpath_set_live+0x130/0x130 [nvme_core]
  nvme_read_ana_log+0x7b/0xe0 [nvme_core]
  process_one_work+0x1e6/0x380
  worker_thread+0x49/0x300
--



If I were to always start the queues in nvme_tcp_teardown_ctrl
right after I cancel the tagset inflights like:
--
@@ -1934,8 +1934,7 @@ static void nvme_tcp_teardown_io_queues(struct 
nvme_ctrl *ctrl,
         nvme_sync_io_queues(ctrl);
         nvme_tcp_stop_io_queues(ctrl);
         nvme_cancel_tagset(ctrl);
-       if (remove)
-               nvme_start_queues(ctrl);
+       nvme_start_queues(ctrl);
         nvme_tcp_destroy_io_queues(ctrl, remove);
--

then a simple reset during traffic bricks the host on infinite loop
because in the setup sequence we freeze the queue in
nvme_update_ns_info, so the queue is frozen but we still have an
available path (because the controller is back to live!) so nvme-mpath
keeps calling blk_mq_submit_bio_direct and fails, and
nvme_update_ns_info cannot properly freeze the queue..
-> deadlock.

So this is obviously incorrect.

Also, if we make nvme-mpath submit a REQ_NOWAIT we basically
will fail as soon as we run out of tags, even in the normal path...

So I'm not exactly sure what we should do to fix this...
Sagi Grimberg March 23, 2021, 3:23 a.m. UTC | #4
On 3/22/21 7:57 PM, Sagi Grimberg wrote:
> 
>> When we reset/teardown a controller, we must freeze and quiesce the
>> namespaces request queues to make sure that we safely stop inflight I/O
>> submissions. Freeze is mandatory because if our hctx map changed between
>> reconnects, blk_mq_update_nr_hw_queues will immediately attempt to freeze
>> the queue, and if it still has pending submissions (that are still
>> quiesced) it will hang.
>>
>> However, by freezing the namespaces request queues, and only unfreezing
>> them when we successfully reconnect, inflight submissions that are
>> running concurrently can now block grabbing the nshead srcu until either
>> we successfully reconnect or ctrl_loss_tmo expired (or the user
>> explicitly disconnected).
>>
>> This caused a deadlock when a different controller (different path on the
>> same subsystem) became live (i.e. optimized/non-optimized). This is
>> because nvme_mpath_set_live needs to synchronize the nshead srcu before
>> requeueing I/O in order to make sure that current_path is visible to
>> future (re-)submisions. However the srcu lock is taken by a blocked
>> submission on a frozen request queue, and we have a deadlock.
>>
>> In order to fix this use the blk_mq_submit_bio_direct API to submit the
>> bio to the low-level driver, which does not block on the queue free
>> but instead allows nvme-multipath to pick another path or queue up the
>> bio.
> 
> Almost...
> 
> This still has the same issue but instead of blocking on
> blk_queue_enter() it is blocked on blk_mq_get_tag():
> -- 
>   __schedule+0x22b/0x6e0
>   schedule+0x46/0xb0
>   io_schedule+0x42/0x70
>   blk_mq_get_tag+0x11d/0x270
>   ? blk_bio_segment_split+0x235/0x2a0
>   ? finish_wait+0x80/0x80
>   __blk_mq_alloc_request+0x65/0xe0
>   blk_mq_submit_bio+0x144/0x500
>   blk_mq_submit_bio_direct+0x78/0xa0
>   nvme_ns_head_submit_bio+0xc3/0x2f0 [nvme_core]
>   __submit_bio_noacct+0xcf/0x2e0
>   __blkdev_direct_IO+0x413/0x440
>   ? __io_complete_rw.constprop.0+0x150/0x150
>   generic_file_read_iter+0x92/0x160
>   io_iter_do_read+0x1a/0x40
>   io_read+0xc5/0x350
>   ? common_interrupt+0x14/0xa0
>   ? update_load_avg+0x7a/0x5e0
>   io_issue_sqe+0xa28/0x1020
>   ? lock_timer_base+0x61/0x80
>   io_wq_submit_work+0xaa/0x120
>   io_worker_handle_work+0x121/0x330
>   io_wqe_worker+0xb6/0x190
>   ? io_worker_handle_work+0x330/0x330
>   ret_from_fork+0x22/0x30
> -- 
> 
> -- 
>   ? usleep_range+0x80/0x80
>   __schedule+0x22b/0x6e0
>   ? usleep_range+0x80/0x80
>   schedule+0x46/0xb0
>   schedule_timeout+0xff/0x140
>   ? del_timer_sync+0x67/0xb0
>   ? __prepare_to_swait+0x4b/0x70
>   __wait_for_common+0xb3/0x160
>   __synchronize_srcu.part.0+0x75/0xe0
>   ? __bpf_trace_rcu_utilization+0x10/0x10
>   nvme_mpath_set_live+0x61/0x130 [nvme_core]
>   nvme_update_ana_state+0xd7/0x100 [nvme_core]
>   nvme_parse_ana_log+0xa5/0x160 [nvme_core]
>   ? nvme_mpath_set_live+0x130/0x130 [nvme_core]
>   nvme_read_ana_log+0x7b/0xe0 [nvme_core]
>   process_one_work+0x1e6/0x380
>   worker_thread+0x49/0x300
> -- 
> 
> 
> 
> If I were to always start the queues in nvme_tcp_teardown_ctrl
> right after I cancel the tagset inflights like:
> -- 
> @@ -1934,8 +1934,7 @@ static void nvme_tcp_teardown_io_queues(struct 
> nvme_ctrl *ctrl,
>          nvme_sync_io_queues(ctrl);
>          nvme_tcp_stop_io_queues(ctrl);
>          nvme_cancel_tagset(ctrl);
> -       if (remove)
> -               nvme_start_queues(ctrl);
> +       nvme_start_queues(ctrl);
>          nvme_tcp_destroy_io_queues(ctrl, remove);
> -- 
> 
> then a simple reset during traffic bricks the host on infinite loop
> because in the setup sequence we freeze the queue in
> nvme_update_ns_info, so the queue is frozen but we still have an
> available path (because the controller is back to live!) so nvme-mpath
> keeps calling blk_mq_submit_bio_direct and fails, and
> nvme_update_ns_info cannot properly freeze the queue..
> -> deadlock.
> 
> So this is obviously incorrect.
> 
> Also, if we make nvme-mpath submit a REQ_NOWAIT we basically
> will fail as soon as we run out of tags, even in the normal path...
> 
> So I'm not exactly sure what we should do to fix this...

It's still not too late to go with my original approach... ;)
Chao Leng March 23, 2021, 7:04 a.m. UTC | #5
On 2021/3/23 11:23, Sagi Grimberg wrote:
> 
> 
> On 3/22/21 7:57 PM, Sagi Grimberg wrote:
>>
>>> When we reset/teardown a controller, we must freeze and quiesce the
>>> namespaces request queues to make sure that we safely stop inflight I/O
>>> submissions. Freeze is mandatory because if our hctx map changed between
>>> reconnects, blk_mq_update_nr_hw_queues will immediately attempt to freeze
>>> the queue, and if it still has pending submissions (that are still
>>> quiesced) it will hang.
>>>
>>> However, by freezing the namespaces request queues, and only unfreezing
>>> them when we successfully reconnect, inflight submissions that are
>>> running concurrently can now block grabbing the nshead srcu until either
>>> we successfully reconnect or ctrl_loss_tmo expired (or the user
>>> explicitly disconnected).
>>>
>>> This caused a deadlock when a different controller (different path on the
>>> same subsystem) became live (i.e. optimized/non-optimized). This is
>>> because nvme_mpath_set_live needs to synchronize the nshead srcu before
>>> requeueing I/O in order to make sure that current_path is visible to
>>> future (re-)submisions. However the srcu lock is taken by a blocked
>>> submission on a frozen request queue, and we have a deadlock.
>>>
>>> In order to fix this use the blk_mq_submit_bio_direct API to submit the
>>> bio to the low-level driver, which does not block on the queue free
>>> but instead allows nvme-multipath to pick another path or queue up the
>>> bio.
>>
>> Almost...
>>
>> This still has the same issue but instead of blocking on
>> blk_queue_enter() it is blocked on blk_mq_get_tag():
>> -- 
>>   __schedule+0x22b/0x6e0
>>   schedule+0x46/0xb0
>>   io_schedule+0x42/0x70
>>   blk_mq_get_tag+0x11d/0x270
>>   ? blk_bio_segment_split+0x235/0x2a0
>>   ? finish_wait+0x80/0x80
>>   __blk_mq_alloc_request+0x65/0xe0
>>   blk_mq_submit_bio+0x144/0x500
>>   blk_mq_submit_bio_direct+0x78/0xa0
>>   nvme_ns_head_submit_bio+0xc3/0x2f0 [nvme_core]
>>   __submit_bio_noacct+0xcf/0x2e0
>>   __blkdev_direct_IO+0x413/0x440
>>   ? __io_complete_rw.constprop.0+0x150/0x150
>>   generic_file_read_iter+0x92/0x160
>>   io_iter_do_read+0x1a/0x40
>>   io_read+0xc5/0x350
>>   ? common_interrupt+0x14/0xa0
>>   ? update_load_avg+0x7a/0x5e0
>>   io_issue_sqe+0xa28/0x1020
>>   ? lock_timer_base+0x61/0x80
>>   io_wq_submit_work+0xaa/0x120
>>   io_worker_handle_work+0x121/0x330
>>   io_wqe_worker+0xb6/0x190
>>   ? io_worker_handle_work+0x330/0x330
>>   ret_from_fork+0x22/0x30
>> -- 
>>
>> -- 
>>   ? usleep_range+0x80/0x80
>>   __schedule+0x22b/0x6e0
>>   ? usleep_range+0x80/0x80
>>   schedule+0x46/0xb0
>>   schedule_timeout+0xff/0x140
>>   ? del_timer_sync+0x67/0xb0
>>   ? __prepare_to_swait+0x4b/0x70
>>   __wait_for_common+0xb3/0x160
>>   __synchronize_srcu.part.0+0x75/0xe0
>>   ? __bpf_trace_rcu_utilization+0x10/0x10
>>   nvme_mpath_set_live+0x61/0x130 [nvme_core]
>>   nvme_update_ana_state+0xd7/0x100 [nvme_core]
>>   nvme_parse_ana_log+0xa5/0x160 [nvme_core]
>>   ? nvme_mpath_set_live+0x130/0x130 [nvme_core]
>>   nvme_read_ana_log+0x7b/0xe0 [nvme_core]
>>   process_one_work+0x1e6/0x380
>>   worker_thread+0x49/0x300
>> -- 
>>
>>
>>
>> If I were to always start the queues in nvme_tcp_teardown_ctrl
>> right after I cancel the tagset inflights like:
>> -- 
>> @@ -1934,8 +1934,7 @@ static void nvme_tcp_teardown_io_queues(struct nvme_ctrl *ctrl,
>>          nvme_sync_io_queues(ctrl);
>>          nvme_tcp_stop_io_queues(ctrl);
>>          nvme_cancel_tagset(ctrl);
>> -       if (remove)
>> -               nvme_start_queues(ctrl);
>> +       nvme_start_queues(ctrl);
>>          nvme_tcp_destroy_io_queues(ctrl, remove);
>> -- 
>>
>> then a simple reset during traffic bricks the host on infinite loop
>> because in the setup sequence we freeze the queue in
>> nvme_update_ns_info, so the queue is frozen but we still have an
>> available path (because the controller is back to live!) so nvme-mpath
>> keeps calling blk_mq_submit_bio_direct and fails, and
>> nvme_update_ns_info cannot properly freeze the queue..
>> -> deadlock.
>>
>> So this is obviously incorrect.
>>
>> Also, if we make nvme-mpath submit a REQ_NOWAIT we basically
>> will fail as soon as we run out of tags, even in the normal path...
>>
>> So I'm not exactly sure what we should do to fix this...
> 
> It's still not too late to go with my original approach... ;)
I check it again. I still think the below patch can avoid the bug.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5a6c35f9af416114588298aa7a90b15bbed15a41
The process:
1.nvme_ns_head_submit_bio call srcu_read_lock(&head->srcu).
2.nvme_ns_head_submit_bio will add the bio to current->bio_list instead of waiting for the frozen queue.
3.nvme_ns_head_submit_bio call srcu_read_unlock(&head->srcu, srcu_idx).
So nvme_ns_head_submit_bio do not hold head->srcu long when the queue is frozen, can avoid deadlock.

Sagi, suggest trying this patch.

> .
Hannes Reinecke March 23, 2021, 7:28 a.m. UTC | #6
On 3/23/21 3:57 AM, Sagi Grimberg wrote:
> 
>> When we reset/teardown a controller, we must freeze and quiesce the
>> namespaces request queues to make sure that we safely stop inflight I/O
>> submissions. Freeze is mandatory because if our hctx map changed between
>> reconnects, blk_mq_update_nr_hw_queues will immediately attempt to freeze
>> the queue, and if it still has pending submissions (that are still
>> quiesced) it will hang.
>>
>> However, by freezing the namespaces request queues, and only unfreezing
>> them when we successfully reconnect, inflight submissions that are
>> running concurrently can now block grabbing the nshead srcu until either
>> we successfully reconnect or ctrl_loss_tmo expired (or the user
>> explicitly disconnected).
>>
>> This caused a deadlock when a different controller (different path on the
>> same subsystem) became live (i.e. optimized/non-optimized). This is
>> because nvme_mpath_set_live needs to synchronize the nshead srcu before
>> requeueing I/O in order to make sure that current_path is visible to
>> future (re-)submisions. However the srcu lock is taken by a blocked
>> submission on a frozen request queue, and we have a deadlock.
>>
>> In order to fix this use the blk_mq_submit_bio_direct API to submit the
>> bio to the low-level driver, which does not block on the queue free
>> but instead allows nvme-multipath to pick another path or queue up the
>> bio.
> 
> Almost...
> 
> This still has the same issue but instead of blocking on
> blk_queue_enter() it is blocked on blk_mq_get_tag():
> -- 
>   __schedule+0x22b/0x6e0
>   schedule+0x46/0xb0
>   io_schedule+0x42/0x70
>   blk_mq_get_tag+0x11d/0x270
>   ? blk_bio_segment_split+0x235/0x2a0
>   ? finish_wait+0x80/0x80
>   __blk_mq_alloc_request+0x65/0xe0
>   blk_mq_submit_bio+0x144/0x500
>   blk_mq_submit_bio_direct+0x78/0xa0
>   nvme_ns_head_submit_bio+0xc3/0x2f0 [nvme_core]
>   __submit_bio_noacct+0xcf/0x2e0
>   __blkdev_direct_IO+0x413/0x440
>   ? __io_complete_rw.constprop.0+0x150/0x150
>   generic_file_read_iter+0x92/0x160
>   io_iter_do_read+0x1a/0x40
>   io_read+0xc5/0x350
>   ? common_interrupt+0x14/0xa0
>   ? update_load_avg+0x7a/0x5e0
>   io_issue_sqe+0xa28/0x1020
>   ? lock_timer_base+0x61/0x80
>   io_wq_submit_work+0xaa/0x120
>   io_worker_handle_work+0x121/0x330
>   io_wqe_worker+0xb6/0x190
>   ? io_worker_handle_work+0x330/0x330
>   ret_from_fork+0x22/0x30
> -- 
> 
> -- 
>   ? usleep_range+0x80/0x80
>   __schedule+0x22b/0x6e0
>   ? usleep_range+0x80/0x80
>   schedule+0x46/0xb0
>   schedule_timeout+0xff/0x140
>   ? del_timer_sync+0x67/0xb0
>   ? __prepare_to_swait+0x4b/0x70
>   __wait_for_common+0xb3/0x160
>   __synchronize_srcu.part.0+0x75/0xe0
>   ? __bpf_trace_rcu_utilization+0x10/0x10
>   nvme_mpath_set_live+0x61/0x130 [nvme_core]
>   nvme_update_ana_state+0xd7/0x100 [nvme_core]
>   nvme_parse_ana_log+0xa5/0x160 [nvme_core]
>   ? nvme_mpath_set_live+0x130/0x130 [nvme_core]
>   nvme_read_ana_log+0x7b/0xe0 [nvme_core]
>   process_one_work+0x1e6/0x380
>   worker_thread+0x49/0x300
> -- 
> 
Actually, I had been playing around with marking the entire bio as 
'NOWAIT'; that would avoid the tag stall, too:

@@ -313,7 +316,7 @@ blk_qc_t nvme_ns_head_submit_bio(struct bio *bio)
         ns = nvme_find_path(head);
         if (likely(ns)) {
                 bio_set_dev(bio, ns->disk->part0);
-               bio->bi_opf |= REQ_NVME_MPATH;
+               bio->bi_opf |= REQ_NVME_MPATH | REQ_NOWAIT;
                 trace_block_bio_remap(bio, disk_devt(ns->head->disk),
                                       bio->bi_iter.bi_sector);
                 ret = submit_bio_noacct(bio);


My only worry here is that we might incur spurious failures under high 
load; but then this is not necessarily a bad thing.

Cheers,

Hannes
Sagi Grimberg March 23, 2021, 7:31 a.m. UTC | #7
> Actually, I had been playing around with marking the entire bio as 
> 'NOWAIT'; that would avoid the tag stall, too:
> 
> @@ -313,7 +316,7 @@ blk_qc_t nvme_ns_head_submit_bio(struct bio *bio)
>          ns = nvme_find_path(head);
>          if (likely(ns)) {
>                  bio_set_dev(bio, ns->disk->part0);
> -               bio->bi_opf |= REQ_NVME_MPATH;
> +               bio->bi_opf |= REQ_NVME_MPATH | REQ_NOWAIT;
>                  trace_block_bio_remap(bio, disk_devt(ns->head->disk),
>                                        bio->bi_iter.bi_sector);
>                  ret = submit_bio_noacct(bio);
> 
> 
> My only worry here is that we might incur spurious failures under high 
> load; but then this is not necessarily a bad thing.

What? making spurious failures is not ok under any load. what fs will
take into account that you may have run out of tags?
Sagi Grimberg March 23, 2021, 7:36 a.m. UTC | #8
> I check it again. I still think the below patch can avoid the bug.
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5a6c35f9af416114588298aa7a90b15bbed15a41 

I don't understand what you are saying...

> 
> The process:
> 1.nvme_ns_head_submit_bio call srcu_read_lock(&head->srcu).
> 2.nvme_ns_head_submit_bio will add the bio to current->bio_list instead 
> of waiting for the frozen queue.

Nothing guarantees that you have a bio_list active at any point in time,
in fact for a workload that submits one by one you will always drain
that list directly in the submission...

> 3.nvme_ns_head_submit_bio call srcu_read_unlock(&head->srcu, srcu_idx).
> So nvme_ns_head_submit_bio do not hold head->srcu long when the queue is 
> frozen, can avoid deadlock.
> 
> Sagi, suggest trying this patch.

The above reproduces with the patch applied on upstream nvme code.
Chao Leng March 23, 2021, 8:13 a.m. UTC | #9
On 2021/3/23 15:36, Sagi Grimberg wrote:
> 
>> I check it again. I still think the below patch can avoid the bug.
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5a6c35f9af416114588298aa7a90b15bbed15a41 
> 
> I don't understand what you are saying...
> 
>>
>> The process:
>> 1.nvme_ns_head_submit_bio call srcu_read_lock(&head->srcu).
>> 2.nvme_ns_head_submit_bio will add the bio to current->bio_list instead of waiting for the frozen queue.
> 
> Nothing guarantees that you have a bio_list active at any point in time,
> in fact for a workload that submits one by one you will always drain
> that list directly in the submission...
submit_bio and nvme_requeue_work both guarantee current->bio_list.
The process:
1.submit_bio and  nvme_requeue_work will call submit_bio_noacct.
2.submit_bio_noacct will call __submit_bio_noacct because bio->bi_disk->fops->submit_bio = nvme_ns_head_submit_bio.
3.__submit_bio_noacct set current->bio_list, and then __submit_bio will call bio->bi_disk->fops->submit_bio(nvme_ns_head_submit_bio)
4.nvme_ns_head_submit_bio will add the bio to current->bio_list.
5.__submit_bio_noacct drain current->bio_list.
when drain current->bio_list, it will wait for the frozen queue but do not hold the head->srcu.
Because it call blk_mq_submit_bio directly instead of ->submit_bio(nvme_ns_head_submit_bio).
So it is safe.
> 
>> 3.nvme_ns_head_submit_bio call srcu_read_unlock(&head->srcu, srcu_idx).
>> So nvme_ns_head_submit_bio do not hold head->srcu long when the queue is frozen, can avoid deadlock.
>>
>> Sagi, suggest trying this patch.
> 
> The above reproduces with the patch applied on upstream nvme code.The new patch(blk_mq_submit_bio_direct) will cause the bug again.
Because it revert add the bio to current->bio_list.
Just try the upstream nvme code, and do not apply the new patch(blk_mq_submit_bio_direct).
> .
Hannes Reinecke March 23, 2021, 8:36 a.m. UTC | #10
On 3/23/21 8:31 AM, Sagi Grimberg wrote:
> 
>> Actually, I had been playing around with marking the entire bio as 
>> 'NOWAIT'; that would avoid the tag stall, too:
>>
>> @@ -313,7 +316,7 @@ blk_qc_t nvme_ns_head_submit_bio(struct bio *bio)
>>          ns = nvme_find_path(head);
>>          if (likely(ns)) {
>>                  bio_set_dev(bio, ns->disk->part0);
>> -               bio->bi_opf |= REQ_NVME_MPATH;
>> +               bio->bi_opf |= REQ_NVME_MPATH | REQ_NOWAIT;
>>                  trace_block_bio_remap(bio, disk_devt(ns->head->disk),
>>                                        bio->bi_iter.bi_sector);
>>                  ret = submit_bio_noacct(bio);
>>
>>
>> My only worry here is that we might incur spurious failures under high 
>> load; but then this is not necessarily a bad thing.
> 
> What? making spurious failures is not ok under any load. what fs will
> take into account that you may have run out of tags?

Well, it's not actually a spurious failure but rather a spurious 
failover, as we're still on a multipath scenario, and bios will still be 
re-routed to other paths. Or queued if all paths are out of tags.
Hence the OS would not see any difference in behaviour.

But in the end, we abandoned this attempt, as the crash we've been 
seeing was in bio_endio (due to bi_bdev still pointing to the removed 
path device):

[ 6552.155251]  bio_endio+0x74/0x120
[ 6552.155260]  nvme_ns_head_submit_bio+0x36f/0x3e0 [nvme_core]
[ 6552.155271]  submit_bio_noacct+0x175/0x490
[ 6552.155284]  ? nvme_requeue_work+0x5a/0x70 [nvme_core]
[ 6552.155290]  nvme_requeue_work+0x5a/0x70 [nvme_core]
[ 6552.155296]  process_one_work+0x1f4/0x3e0
[ 6552.155299]  worker_thread+0x2d/0x3e0
[ 6552.155302]  ? process_one_work+0x3e0/0x3e0
[ 6552.155305]  kthread+0x10d/0x130
[ 6552.155307]  ? kthread_park+0xa0/0xa0
[ 6552.155311]  ret_from_fork+0x35/0x40

So we're not blocked on blk_queue_enter(), and it's a crash, not a 
deadlock. Blocking on blk_queue_enter() certainly plays a part here,
but is seems not to be the full picture.

Cheers,

Hannes
Keith Busch March 23, 2021, 2:53 p.m. UTC | #11
On Tue, Mar 23, 2021 at 09:36:47AM +0100, Hannes Reinecke wrote:
> On 3/23/21 8:31 AM, Sagi Grimberg wrote:
> > 
> > > Actually, I had been playing around with marking the entire bio as
> > > 'NOWAIT'; that would avoid the tag stall, too:
> > > 
> > > @@ -313,7 +316,7 @@ blk_qc_t nvme_ns_head_submit_bio(struct bio *bio)
> > >          ns = nvme_find_path(head);
> > >          if (likely(ns)) {
> > >                  bio_set_dev(bio, ns->disk->part0);
> > > -               bio->bi_opf |= REQ_NVME_MPATH;
> > > +               bio->bi_opf |= REQ_NVME_MPATH | REQ_NOWAIT;
> > >                  trace_block_bio_remap(bio, disk_devt(ns->head->disk),
> > >                                        bio->bi_iter.bi_sector);
> > >                  ret = submit_bio_noacct(bio);
> > > 
> > > 
> > > My only worry here is that we might incur spurious failures under
> > > high load; but then this is not necessarily a bad thing.
> > 
> > What? making spurious failures is not ok under any load. what fs will
> > take into account that you may have run out of tags?
> 
> Well, it's not actually a spurious failure but rather a spurious failover,
> as we're still on a multipath scenario, and bios will still be re-routed to
> other paths. Or queued if all paths are out of tags.
> Hence the OS would not see any difference in behaviour.

Failover might be overkill. We can run out of tags in a perfectly normal
situation, and simply waiting may be the best option, or even scheduling
on a different CPU may be sufficient to get a viable tag  rather than
selecting a different path.

Does it make sense to just abort all allocated tags during a reset and
let the original bio requeue for multipath IO?

 
> But in the end, we abandoned this attempt, as the crash we've been seeing
> was in bio_endio (due to bi_bdev still pointing to the removed path device):
> 
> [ 6552.155251]  bio_endio+0x74/0x120
> [ 6552.155260]  nvme_ns_head_submit_bio+0x36f/0x3e0 [nvme_core]
> [ 6552.155271]  submit_bio_noacct+0x175/0x490
> [ 6552.155284]  ? nvme_requeue_work+0x5a/0x70 [nvme_core]
> [ 6552.155290]  nvme_requeue_work+0x5a/0x70 [nvme_core]
> [ 6552.155296]  process_one_work+0x1f4/0x3e0
> [ 6552.155299]  worker_thread+0x2d/0x3e0
> [ 6552.155302]  ? process_one_work+0x3e0/0x3e0
> [ 6552.155305]  kthread+0x10d/0x130
> [ 6552.155307]  ? kthread_park+0xa0/0xa0
> [ 6552.155311]  ret_from_fork+0x35/0x40
> 
> So we're not blocked on blk_queue_enter(), and it's a crash, not a deadlock.
> Blocking on blk_queue_enter() certainly plays a part here,
> but is seems not to be the full picture.
> 
> Cheers,
> 
> Hannes
> -- 
> Dr. Hannes Reinecke                Kernel Storage Architect
> hare@suse.de                              +49 911 74053 688
> SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
> HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer
Christoph Hellwig March 23, 2021, 4:15 p.m. UTC | #12
On Tue, Mar 23, 2021 at 12:36:40AM -0700, Sagi Grimberg wrote:
>> The process:
>> 1.nvme_ns_head_submit_bio call srcu_read_lock(&head->srcu).
>> 2.nvme_ns_head_submit_bio will add the bio to current->bio_list instead of 
>> waiting for the frozen queue.
>
> Nothing guarantees that you have a bio_list active at any point in time,
> in fact for a workload that submits one by one you will always drain
> that list directly in the submission...

It should always be active when ->submit_bio is called.

>
>> 3.nvme_ns_head_submit_bio call srcu_read_unlock(&head->srcu, srcu_idx).
>> So nvme_ns_head_submit_bio do not hold head->srcu long when the queue is 
>> frozen, can avoid deadlock.
>>
>> Sagi, suggest trying this patch.
>
> The above reproduces with the patch applied on upstream nvme code.

Weird.  I don't think the deadlock in your original report should
happen due to this.  Can you take a look at the callstacks in the
reproduced deadlock?  Either we're missing something obvious or it is a
a somewhat different deadlock.
Christoph Hellwig March 23, 2021, 4:17 p.m. UTC | #13
On Tue, Mar 23, 2021 at 04:13:09PM +0800, Chao Leng wrote:
>> The above reproduces with the patch applied on upstream nvme code.The new patch(blk_mq_submit_bio_direct) will cause the bug again.
> Because it revert add the bio to current->bio_list.
> Just try the upstream nvme code, and do not apply the new patch(blk_mq_submit_bio_direct).

Indeed, the patch removes the deferred submission, so the original
deadlock should not happen.
Christoph Hellwig March 23, 2021, 4:19 p.m. UTC | #14
On Tue, Mar 23, 2021 at 11:53:30PM +0900, Keith Busch wrote:
> Failover might be overkill. We can run out of tags in a perfectly normal
> situation, and simply waiting may be the best option, or even scheduling
> on a different CPU may be sufficient to get a viable tag  rather than
> selecting a different path.

Indeed.  Then again IFF there are multiple optimized paths picking one
could also be a way to make progress.

> Does it make sense to just abort all allocated tags during a reset and
> let the original bio requeue for multipath IO?

Well, this would again hard code multipath information into the lower
levels.  But then again we could at least do it so that we check for
the REQ_NVME_MPATH to make it clear what is going on.
Sagi Grimberg March 23, 2021, 6:13 p.m. UTC | #15
>>> Sagi, suggest trying this patch.
>>
>> The above reproduces with the patch applied on upstream nvme code.
> 
> Weird.  I don't think the deadlock in your original report should
> happen due to this.  Can you take a look at the callstacks in the
> reproduced deadlock?  Either we're missing something obvious or it is a
> a somewhat different deadlock.

The deadlock in this patchset reproduces upstream. It is not possible to
update the kernel in the env in the original report.

So IFF we assume that this does not reproduce in upstream (pending
proof), is there something that we can do with stable fixes? This will
probably go back to everything that is before 5.8...
Christoph Hellwig March 23, 2021, 6:22 p.m. UTC | #16
On Tue, Mar 23, 2021 at 11:13:13AM -0700, Sagi Grimberg wrote:
> The deadlock in this patchset reproduces upstream. It is not possible to
> update the kernel in the env in the original report.
>
> So IFF we assume that this does not reproduce in upstream (pending
> proof), is there something that we can do with stable fixes? This will
> probably go back to everything that is before 5.8...

The direct_make_request removal should be pretty easily backportable.
In old kernels without the streamlined normal path it might cause minor
performance degradations, but the actual change should be trivial.
Sagi Grimberg March 23, 2021, 7 p.m. UTC | #17
>> The deadlock in this patchset reproduces upstream. It is not possible to
>> update the kernel in the env in the original report.
>>
>> So IFF we assume that this does not reproduce in upstream (pending
>> proof), is there something that we can do with stable fixes? This will
>> probably go back to everything that is before 5.8...
> 
> The direct_make_request removal should be pretty easily backportable.

Umm, the direct_make_request removal is based on the submit_bio_noacct
rework you've done. Are you suggesting that we replace it with
generic_make_request for these kernels?

> In old kernels without the streamlined normal path it might cause minor
> performance degradations, but the actual change should be trivial.

That is better than getting stuck in I/O failover...
Christoph Hellwig March 23, 2021, 7:01 p.m. UTC | #18
On Tue, Mar 23, 2021 at 12:00:34PM -0700, Sagi Grimberg wrote:
>
>>> The deadlock in this patchset reproduces upstream. It is not possible to
>>> update the kernel in the env in the original report.
>>>
>>> So IFF we assume that this does not reproduce in upstream (pending
>>> proof), is there something that we can do with stable fixes? This will
>>> probably go back to everything that is before 5.8...
>>
>> The direct_make_request removal should be pretty easily backportable.
>
> Umm, the direct_make_request removal is based on the submit_bio_noacct
> rework you've done. Are you suggesting that we replace it with
> generic_make_request for these kernels?

Yes.
Sagi Grimberg March 23, 2021, 7:10 p.m. UTC | #19
>>>> The deadlock in this patchset reproduces upstream. It is not possible to
>>>> update the kernel in the env in the original report.
>>>>
>>>> So IFF we assume that this does not reproduce in upstream (pending
>>>> proof), is there something that we can do with stable fixes? This will
>>>> probably go back to everything that is before 5.8...
>>>
>>> The direct_make_request removal should be pretty easily backportable.
>>
>> Umm, the direct_make_request removal is based on the submit_bio_noacct
>> rework you've done. Are you suggesting that we replace it with
>> generic_make_request for these kernels?
> 
> Yes.

OK, that should be testable...
diff mbox series

Patch

diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index a1d476e1ac020f..92adebfaf86fd1 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -309,6 +309,7 @@  blk_qc_t nvme_ns_head_submit_bio(struct bio *bio)
 	 */
 	blk_queue_split(&bio);
 
+retry:
 	srcu_idx = srcu_read_lock(&head->srcu);
 	ns = nvme_find_path(head);
 	if (likely(ns)) {
@@ -316,7 +317,12 @@  blk_qc_t nvme_ns_head_submit_bio(struct bio *bio)
 		bio->bi_opf |= REQ_NVME_MPATH;
 		trace_block_bio_remap(bio, disk_devt(ns->head->disk),
 				      bio->bi_iter.bi_sector);
-		ret = submit_bio_noacct(bio);
+
+		if (!blk_mq_submit_bio_direct(bio, &ret)) {
+			nvme_mpath_clear_current_path(ns);
+			srcu_read_unlock(&head->srcu, srcu_idx);
+			goto retry;
+		}
 	} else if (nvme_available_path(head)) {
 		dev_warn_ratelimited(dev, "no usable path - requeuing I/O\n");