[V2,0/4] nvme: fix two kinds of IO hang from removing NSs

Message ID	20230620013349.906601-1-ming.lei@redhat.com (mailing list archive)
Headers	show Return-Path: <linux-block-owner@vger.kernel.org> From: Ming Lei <ming.lei@redhat.com> To: Jens Axboe <axboe@kernel.dk>, Christoph Hellwig <hch@lst.de>, Sagi Grimberg <sagi@grimberg.me>, Keith Busch <kbusch@kernel.org>, linux-nvme@lists.infradead.org Cc: Yi Zhang <yi.zhang@redhat.com>, linux-block@vger.kernel.org, Chunguang Xu <brookxu.cn@gmail.com>, Ming Lei <ming.lei@redhat.com> Subject: [PATCH V2 0/4] nvme: fix two kinds of IO hang from removing NSs Date: Tue, 20 Jun 2023 09:33:45 +0800 Message-Id: <20230620013349.906601-1-ming.lei@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	nvme: fix two kinds of IO hang from removing NSs \| expand [V2,0/4] nvme: fix two kinds of IO hang from removing NSs [V2,1/4] blk-mq: add API of blk_mq_unfreeze_queue_force [V2,2/4] nvme: add nvme_unfreeze_force() [V2,3/4] nvme: unfreeze queues before removing namespaces [V2,4/4] nvme: unquiesce io queues when removing namespaces

Ming Lei June 20, 2023, 1:33 a.m. UTC

Hello,

The 1st three patch fixes io hang when controller removal interrupts error
recovery, then queue is left as frozen.

The 4th patch fixes io hang when controller is left as unquiesce.

V2:
	- don't unfreeze queue which isn't frozen for avoiding warning from
	percpu_ref_resurrect() 

Ming Lei (4):
  blk-mq: add API of blk_mq_unfreeze_queue_force
  nvme: add nvme_unfreeze_force()
  nvme: unfreeze queues before removing namespaces
  nvme: unquiesce io queues when removing namespaces

 block/blk-mq.c           | 28 +++++++++++++++++++++++++---
 block/blk.h              |  3 ++-
 block/genhd.c            |  2 +-
 drivers/nvme/host/core.c | 34 ++++++++++++++++++++++++++++------
 drivers/nvme/host/nvme.h |  1 +
 include/linux/blk-mq.h   |  1 +
 6 files changed, 58 insertions(+), 11 deletions(-)

Sagi Grimberg June 20, 2023, 10:04 a.m. UTC | #1

> Hello,
> 
> The 1st three patch fixes io hang when controller removal interrupts error
> recovery, then queue is left as frozen.
> 
> The 4th patch fixes io hang when controller is left as unquiesce.

Ming, what happened to nvme-tcp/rdma move of freeze/unfreeze to the
connect patches?

I think that these are patches that should go in regradless because
they also fix I/O failover blocked on request queue enter.

Ming Lei June 20, 2023, 1:23 p.m. UTC | #2

On Tue, Jun 20, 2023 at 01:04:33PM +0300, Sagi Grimberg wrote:
> 
> > Hello,
> > 
> > The 1st three patch fixes io hang when controller removal interrupts error
> > recovery, then queue is left as frozen.
> > 
> > The 4th patch fixes io hang when controller is left as unquiesce.
> 
> Ming, what happened to nvme-tcp/rdma move of freeze/unfreeze to the
> connect patches?

I'd suggest to handle all drivers(include nvme-pci) in same logic for avoiding
extra maintain burden wrt. error handling, but looks Keith worries about the
delay freezing may cause too many requests queued during error handling, and
that might cause user report.

Thanks, 
Ming

Sagi Grimberg June 20, 2023, 1:40 p.m. UTC | #3

>>> Hello,
>>>
>>> The 1st three patch fixes io hang when controller removal interrupts error
>>> recovery, then queue is left as frozen.
>>>
>>> The 4th patch fixes io hang when controller is left as unquiesce.
>>
>> Ming, what happened to nvme-tcp/rdma move of freeze/unfreeze to the
>> connect patches?
> 
> I'd suggest to handle all drivers(include nvme-pci) in same logic for avoiding
> extra maintain burden wrt. error handling, but looks Keith worries about the
> delay freezing may cause too many requests queued during error handling, and
> that might cause user report.

For nvme-tcp/rdma your patch also addresses IO not failing over because
they block on queue enter. So I definitely want this for fabrics.

AFAICT nvme-pci would also want to failover asap for dual-ported
multipathed devices, not sure if this is something that we are
interested in optimizing though, as pci either succeeds the reset,
or removes the gendisk. But the time-frame is different for fabrics
for sure.

Ming Lei June 21, 2023, 12:09 a.m. UTC | #4

On Tue, Jun 20, 2023 at 04:40:49PM +0300, Sagi Grimberg wrote:
> 
> > > > Hello,
> > > > 
> > > > The 1st three patch fixes io hang when controller removal interrupts error
> > > > recovery, then queue is left as frozen.
> > > > 
> > > > The 4th patch fixes io hang when controller is left as unquiesce.
> > > 
> > > Ming, what happened to nvme-tcp/rdma move of freeze/unfreeze to the
> > > connect patches?
> > 
> > I'd suggest to handle all drivers(include nvme-pci) in same logic for avoiding
> > extra maintain burden wrt. error handling, but looks Keith worries about the
> > delay freezing may cause too many requests queued during error handling, and
> > that might cause user report.
> 
> For nvme-tcp/rdma your patch also addresses IO not failing over because
> they block on queue enter. So I definitely want this for fabrics.

The patch in the following link should fix these issues too:

https://lore.kernel.org/linux-block/ZJGmW7lEaipT6saa@ovpn-8-23.pek2.redhat.com/T/#u

I guess you still want the paired freeze patch because it makes freeze &
unfreeze more reliable in error handling. If yes, I can make one fabric
only change for you.


Thanks,
Ming

Sagi Grimberg June 21, 2023, 10:13 a.m. UTC | #5

>>>>> Hello,
>>>>>
>>>>> The 1st three patch fixes io hang when controller removal interrupts error
>>>>> recovery, then queue is left as frozen.
>>>>>
>>>>> The 4th patch fixes io hang when controller is left as unquiesce.
>>>>
>>>> Ming, what happened to nvme-tcp/rdma move of freeze/unfreeze to the
>>>> connect patches?
>>>
>>> I'd suggest to handle all drivers(include nvme-pci) in same logic for avoiding
>>> extra maintain burden wrt. error handling, but looks Keith worries about the
>>> delay freezing may cause too many requests queued during error handling, and
>>> that might cause user report.
>>
>> For nvme-tcp/rdma your patch also addresses IO not failing over because
>> they block on queue enter. So I definitely want this for fabrics.
> 
> The patch in the following link should fix these issues too:
> 
> https://lore.kernel.org/linux-block/ZJGmW7lEaipT6saa@ovpn-8-23.pek2.redhat.com/T/#u
> 
> I guess you still want the paired freeze patch because it makes freeze &
> unfreeze more reliable in error handling. If yes, I can make one fabric
> only change for you.

Not sure exactly what reliability is referred here. I agree that there
is an issue with controller delete during error recovery. The patch
was a way to side-step it, great. But it addressed I/O blocked on enter
and not failing over.

So yes, for fabrics we should have it. I would argue that it would be
the right thing to do for pci as well. But I won't argue if Keith feels
otherwise.

Ming Lei June 21, 2023, 1:27 p.m. UTC | #6

On Wed, Jun 21, 2023 at 01:13:05PM +0300, Sagi Grimberg wrote:
> 
> > > > > > Hello,
> > > > > > 
> > > > > > The 1st three patch fixes io hang when controller removal interrupts error
> > > > > > recovery, then queue is left as frozen.
> > > > > > 
> > > > > > The 4th patch fixes io hang when controller is left as unquiesce.
> > > > > 
> > > > > Ming, what happened to nvme-tcp/rdma move of freeze/unfreeze to the
> > > > > connect patches?
> > > > 
> > > > I'd suggest to handle all drivers(include nvme-pci) in same logic for avoiding
> > > > extra maintain burden wrt. error handling, but looks Keith worries about the
> > > > delay freezing may cause too many requests queued during error handling, and
> > > > that might cause user report.
> > > 
> > > For nvme-tcp/rdma your patch also addresses IO not failing over because
> > > they block on queue enter. So I definitely want this for fabrics.
> > 
> > The patch in the following link should fix these issues too:
> > 
> > https://lore.kernel.org/linux-block/ZJGmW7lEaipT6saa@ovpn-8-23.pek2.redhat.com/T/#u
> > 
> > I guess you still want the paired freeze patch because it makes freeze &
> > unfreeze more reliable in error handling. If yes, I can make one fabric
> > only change for you.
> 
> Not sure exactly what reliability is referred here.

freeze and unfreeze have to be called strictly in pair, but nvme calls
the two from different contexts, so unfreeze may easily be missed, and
causes mismatched freeze count. There has many such reports so far.

> I agree that there
> is an issue with controller delete during error recovery. The patch
> was a way to side-step it, great. But it addressed I/O blocked on enter
> and not failing over.
> 
> So yes, for fabrics we should have it. I would argue that it would be
> the right thing to do for pci as well. But I won't argue if Keith feels
> otherwise.

Keith, can you update with us if you are fine with moving
nvme_start_freeze() into nvme_reset_work() for nvme pci driver?


Thanks,
Ming

Keith Busch June 21, 2023, 3:48 p.m. UTC | #7

On Wed, Jun 21, 2023 at 09:27:31PM +0800, Ming Lei wrote:
> On Wed, Jun 21, 2023 at 01:13:05PM +0300, Sagi Grimberg wrote:
> > 
> > > > > > > Hello,
> > > > > > > 
> > > > > > > The 1st three patch fixes io hang when controller removal interrupts error
> > > > > > > recovery, then queue is left as frozen.
> > > > > > > 
> > > > > > > The 4th patch fixes io hang when controller is left as unquiesce.
> > > > > > 
> > > > > > Ming, what happened to nvme-tcp/rdma move of freeze/unfreeze to the
> > > > > > connect patches?
> > > > > 
> > > > > I'd suggest to handle all drivers(include nvme-pci) in same logic for avoiding
> > > > > extra maintain burden wrt. error handling, but looks Keith worries about the
> > > > > delay freezing may cause too many requests queued during error handling, and
> > > > > that might cause user report.
> > > > 
> > > > For nvme-tcp/rdma your patch also addresses IO not failing over because
> > > > they block on queue enter. So I definitely want this for fabrics.
> > > 
> > > The patch in the following link should fix these issues too:
> > > 
> > > https://lore.kernel.org/linux-block/ZJGmW7lEaipT6saa@ovpn-8-23.pek2.redhat.com/T/#u
> > > 
> > > I guess you still want the paired freeze patch because it makes freeze &
> > > unfreeze more reliable in error handling. If yes, I can make one fabric
> > > only change for you.
> > 
> > Not sure exactly what reliability is referred here.
> 
> freeze and unfreeze have to be called strictly in pair, but nvme calls
> the two from different contexts, so unfreeze may easily be missed, and
> causes mismatched freeze count. There has many such reports so far.
> 
> > I agree that there
> > is an issue with controller delete during error recovery. The patch
> > was a way to side-step it, great. But it addressed I/O blocked on enter
> > and not failing over.
> > 
> > So yes, for fabrics we should have it. I would argue that it would be
> > the right thing to do for pci as well. But I won't argue if Keith feels
> > otherwise.
> 
> Keith, can you update with us if you are fine with moving
> nvme_start_freeze() into nvme_reset_work() for nvme pci driver?

The point was to contain requests from entering while the hctx's are
being reconfigured. If you're going to pair up the freezes as you've
suggested, we might as well just not call freeze at all.

Ming Lei June 22, 2023, 1:51 p.m. UTC | #8

On Wed, Jun 21, 2023 at 09:48:49AM -0600, Keith Busch wrote:
> On Wed, Jun 21, 2023 at 09:27:31PM +0800, Ming Lei wrote:
> > On Wed, Jun 21, 2023 at 01:13:05PM +0300, Sagi Grimberg wrote:
> > > 
> > > > > > > > Hello,
> > > > > > > > 
> > > > > > > > The 1st three patch fixes io hang when controller removal interrupts error
> > > > > > > > recovery, then queue is left as frozen.
> > > > > > > > 
> > > > > > > > The 4th patch fixes io hang when controller is left as unquiesce.
> > > > > > > 
> > > > > > > Ming, what happened to nvme-tcp/rdma move of freeze/unfreeze to the
> > > > > > > connect patches?
> > > > > > 
> > > > > > I'd suggest to handle all drivers(include nvme-pci) in same logic for avoiding
> > > > > > extra maintain burden wrt. error handling, but looks Keith worries about the
> > > > > > delay freezing may cause too many requests queued during error handling, and
> > > > > > that might cause user report.
> > > > > 
> > > > > For nvme-tcp/rdma your patch also addresses IO not failing over because
> > > > > they block on queue enter. So I definitely want this for fabrics.
> > > > 
> > > > The patch in the following link should fix these issues too:
> > > > 
> > > > https://lore.kernel.org/linux-block/ZJGmW7lEaipT6saa@ovpn-8-23.pek2.redhat.com/T/#u
> > > > 
> > > > I guess you still want the paired freeze patch because it makes freeze &
> > > > unfreeze more reliable in error handling. If yes, I can make one fabric
> > > > only change for you.
> > > 
> > > Not sure exactly what reliability is referred here.
> > 
> > freeze and unfreeze have to be called strictly in pair, but nvme calls
> > the two from different contexts, so unfreeze may easily be missed, and
> > causes mismatched freeze count. There has many such reports so far.
> > 
> > > I agree that there
> > > is an issue with controller delete during error recovery. The patch
> > > was a way to side-step it, great. But it addressed I/O blocked on enter
> > > and not failing over.
> > > 
> > > So yes, for fabrics we should have it. I would argue that it would be
> > > the right thing to do for pci as well. But I won't argue if Keith feels
> > > otherwise.
> > 
> > Keith, can you update with us if you are fine with moving
> > nvme_start_freeze() into nvme_reset_work() for nvme pci driver?
> 
> The point was to contain requests from entering while the hctx's are
> being reconfigured. If you're going to pair up the freezes as you've
> suggested, we might as well just not call freeze at all.

blk_mq_update_nr_hw_queues() requires queue to be frozen.


Thanks,
Ming

Keith Busch June 22, 2023, 2:35 p.m. UTC | #9

On Thu, Jun 22, 2023 at 09:51:12PM +0800, Ming Lei wrote:
> On Wed, Jun 21, 2023 at 09:48:49AM -0600, Keith Busch wrote:
> > The point was to contain requests from entering while the hctx's are
> > being reconfigured. If you're going to pair up the freezes as you've
> > suggested, we might as well just not call freeze at all.
> 
> blk_mq_update_nr_hw_queues() requires queue to be frozen.

It's too late at that point. Let's work through a real example. You'll
need a system that has more CPU's than your nvme has IO queues.

Boot without any special nvme parameters. Every possible nvme IO queue
will be assigned "default" hctx type. Now start IO to every queue, then
run:

  # echo 8 > /sys/modules/nvme/parameters/poll_queues && echo 1 > /sys/class/nvme/nvme0/reset_controller

Today, we freeze prior to tearing down the "default" IO queues, so
there's nothing entered into them while the driver reconfigures the
queues.

What you're suggesting will allow IO to queue up in a queisced "default"
queue, which will become "polled" without an interrupt hanlder on the
other side of the reset. The application doesn't know that, so the IO
you're allowing to queue up will time out.

Ming Lei June 22, 2023, 2:53 p.m. UTC | #10

On Thu, Jun 22, 2023 at 08:35:49AM -0600, Keith Busch wrote:
> On Thu, Jun 22, 2023 at 09:51:12PM +0800, Ming Lei wrote:
> > On Wed, Jun 21, 2023 at 09:48:49AM -0600, Keith Busch wrote:
> > > The point was to contain requests from entering while the hctx's are
> > > being reconfigured. If you're going to pair up the freezes as you've
> > > suggested, we might as well just not call freeze at all.
> > 
> > blk_mq_update_nr_hw_queues() requires queue to be frozen.
> 
> It's too late at that point. Let's work through a real example. You'll
> need a system that has more CPU's than your nvme has IO queues.
> 
> Boot without any special nvme parameters. Every possible nvme IO queue
> will be assigned "default" hctx type. Now start IO to every queue, then
> run:
> 
>   # echo 8 > /sys/modules/nvme/parameters/poll_queues && echo 1 > /sys/class/nvme/nvme0/reset_controller
> 
> Today, we freeze prior to tearing down the "default" IO queues, so
> there's nothing entered into them while the driver reconfigures the
> queues.

nvme_start_freeze() just prevents new IO from being queued, and old ones
may still be entering block layer queue, and what matters here is
actually quiesce, which prevents new IO from being queued to
driver/hardware.

> 
> What you're suggesting will allow IO to queue up in a queisced "default"
> queue, which will become "polled" without an interrupt hanlder on the
> other side of the reset. The application doesn't know that, so the IO
> you're allowing to queue up will time out.

time out only happens after the request is queued to driver/hardware, or after
blk_mq_start_request() is called in nvme_queue_rq(), but quiesce actually
prevents new IOs from being dispatched to driver or be queued via .queue_rq(),
meantime old requests have been canceled, so no any request can be
timed out.


Thanks,
Ming

Keith Busch June 22, 2023, 3:19 p.m. UTC | #11

On Thu, Jun 22, 2023 at 10:53:05PM +0800, Ming Lei wrote:
> On Thu, Jun 22, 2023 at 08:35:49AM -0600, Keith Busch wrote:
> > On Thu, Jun 22, 2023 at 09:51:12PM +0800, Ming Lei wrote:
> > > On Wed, Jun 21, 2023 at 09:48:49AM -0600, Keith Busch wrote:
> > > > The point was to contain requests from entering while the hctx's are
> > > > being reconfigured. If you're going to pair up the freezes as you've
> > > > suggested, we might as well just not call freeze at all.
> > > 
> > > blk_mq_update_nr_hw_queues() requires queue to be frozen.
> > 
> > It's too late at that point. Let's work through a real example. You'll
> > need a system that has more CPU's than your nvme has IO queues.
> > 
> > Boot without any special nvme parameters. Every possible nvme IO queue
> > will be assigned "default" hctx type. Now start IO to every queue, then
> > run:
> > 
> >   # echo 8 > /sys/modules/nvme/parameters/poll_queues && echo 1 > /sys/class/nvme/nvme0/reset_controller
> > 
> > Today, we freeze prior to tearing down the "default" IO queues, so
> > there's nothing entered into them while the driver reconfigures the
> > queues.
> 
> nvme_start_freeze() just prevents new IO from being queued, and old ones
> may still be entering block layer queue, and what matters here is
> actually quiesce, which prevents new IO from being queued to
> driver/hardware.
> 
> > 
> > What you're suggesting will allow IO to queue up in a queisced "default"
> > queue, which will become "polled" without an interrupt hanlder on the
> > other side of the reset. The application doesn't know that, so the IO
> > you're allowing to queue up will time out.
> 
> time out only happens after the request is queued to driver/hardware, or after
> blk_mq_start_request() is called in nvme_queue_rq(), but quiesce actually
> prevents new IOs from being dispatched to driver or be queued via .queue_rq(),
> meantime old requests have been canceled, so no any request can be
> timed out.

Quiesce doesn't prevent requests from entering an hctx, and you can't
back it out to put on another hctx later. It doesn't matter that you
haven't dispatched it to hardware yet. The request's queue was set the
moment it was allocated, so after you unquiesce and freeze for the new
queue mapping, the requests previously blocked on quiesce will time out
in the scenario I've described.

There are certainly gaps in the existing code where error'ed requests
can be requeued or stuck elsewhere and hit the exact same problem, but
the current way at least tries to contain it.

Ming Lei June 25, 2023, 12:26 a.m. UTC | #12

On Thu, Jun 22, 2023 at 09:19:19AM -0600, Keith Busch wrote:
> On Thu, Jun 22, 2023 at 10:53:05PM +0800, Ming Lei wrote:
> > On Thu, Jun 22, 2023 at 08:35:49AM -0600, Keith Busch wrote:
> > > On Thu, Jun 22, 2023 at 09:51:12PM +0800, Ming Lei wrote:
> > > > On Wed, Jun 21, 2023 at 09:48:49AM -0600, Keith Busch wrote:
> > > > > The point was to contain requests from entering while the hctx's are
> > > > > being reconfigured. If you're going to pair up the freezes as you've
> > > > > suggested, we might as well just not call freeze at all.
> > > > 
> > > > blk_mq_update_nr_hw_queues() requires queue to be frozen.
> > > 
> > > It's too late at that point. Let's work through a real example. You'll
> > > need a system that has more CPU's than your nvme has IO queues.
> > > 
> > > Boot without any special nvme parameters. Every possible nvme IO queue
> > > will be assigned "default" hctx type. Now start IO to every queue, then
> > > run:
> > > 
> > >   # echo 8 > /sys/modules/nvme/parameters/poll_queues && echo 1 > /sys/class/nvme/nvme0/reset_controller
> > > 
> > > Today, we freeze prior to tearing down the "default" IO queues, so
> > > there's nothing entered into them while the driver reconfigures the
> > > queues.
> > 
> > nvme_start_freeze() just prevents new IO from being queued, and old ones
> > may still be entering block layer queue, and what matters here is
> > actually quiesce, which prevents new IO from being queued to
> > driver/hardware.
> > 
> > > 
> > > What you're suggesting will allow IO to queue up in a queisced "default"
> > > queue, which will become "polled" without an interrupt hanlder on the
> > > other side of the reset. The application doesn't know that, so the IO
> > > you're allowing to queue up will time out.
> > 
> > time out only happens after the request is queued to driver/hardware, or after
> > blk_mq_start_request() is called in nvme_queue_rq(), but quiesce actually
> > prevents new IOs from being dispatched to driver or be queued via .queue_rq(),
> > meantime old requests have been canceled, so no any request can be
> > timed out.
> 
> Quiesce doesn't prevent requests from entering an hctx, and you can't
> back it out to put on another hctx later. It doesn't matter that you
> haven't dispatched it to hardware yet. The request's queue was set the
> moment it was allocated, so after you unquiesce and freeze for the new
> queue mapping, the requests previously blocked on quiesce will time out
> in the scenario I've described.
> 
> There are certainly gaps in the existing code where error'ed requests
> can be requeued or stuck elsewhere and hit the exact same problem, but
> the current way at least tries to contain it.

Yeah, but you can't remove the gap at all with start_freeze, that said
the current code has to live with the situation of new mapping change
and old request with old mapping.

Actually I considered to handle this kind of situation before, one approach
is to reuse the bio steal logic taken in nvme mpath:

1) for FS IO, re-submit bios, meantime free request

2) for PT request, simply fail it

It could be a bit violent for 2) even though REQ_FAILFAST_DRIVER is
always set for PT request, but not see any better approach for handling
PT request.


Thanks,
Ming

Sagi Grimberg June 25, 2023, 8:09 a.m. UTC | #13

>>>>>> The point was to contain requests from entering while the hctx's are
>>>>>> being reconfigured. If you're going to pair up the freezes as you've
>>>>>> suggested, we might as well just not call freeze at all.
>>>>>
>>>>> blk_mq_update_nr_hw_queues() requires queue to be frozen.
>>>>
>>>> It's too late at that point. Let's work through a real example. You'll
>>>> need a system that has more CPU's than your nvme has IO queues.
>>>>
>>>> Boot without any special nvme parameters. Every possible nvme IO queue
>>>> will be assigned "default" hctx type. Now start IO to every queue, then
>>>> run:
>>>>
>>>>    # echo 8 > /sys/modules/nvme/parameters/poll_queues && echo 1 > /sys/class/nvme/nvme0/reset_controller
>>>>
>>>> Today, we freeze prior to tearing down the "default" IO queues, so
>>>> there's nothing entered into them while the driver reconfigures the
>>>> queues.
>>>
>>> nvme_start_freeze() just prevents new IO from being queued, and old ones
>>> may still be entering block layer queue, and what matters here is
>>> actually quiesce, which prevents new IO from being queued to
>>> driver/hardware.
>>>
>>>>
>>>> What you're suggesting will allow IO to queue up in a queisced "default"
>>>> queue, which will become "polled" without an interrupt hanlder on the
>>>> other side of the reset. The application doesn't know that, so the IO
>>>> you're allowing to queue up will time out.
>>>
>>> time out only happens after the request is queued to driver/hardware, or after
>>> blk_mq_start_request() is called in nvme_queue_rq(), but quiesce actually
>>> prevents new IOs from being dispatched to driver or be queued via .queue_rq(),
>>> meantime old requests have been canceled, so no any request can be
>>> timed out.
>>
>> Quiesce doesn't prevent requests from entering an hctx, and you can't
>> back it out to put on another hctx later. It doesn't matter that you
>> haven't dispatched it to hardware yet. The request's queue was set the
>> moment it was allocated, so after you unquiesce and freeze for the new
>> queue mapping, the requests previously blocked on quiesce will time out
>> in the scenario I've described.
>>
>> There are certainly gaps in the existing code where error'ed requests
>> can be requeued or stuck elsewhere and hit the exact same problem, but
>> the current way at least tries to contain it.
> 
> Yeah, but you can't remove the gap at all with start_freeze, that said
> the current code has to live with the situation of new mapping change
> and old request with old mapping.
> 
> Actually I considered to handle this kind of situation before, one approach
> is to reuse the bio steal logic taken in nvme mpath:
> 
> 1) for FS IO, re-submit bios, meantime free request
> 
> 2) for PT request, simply fail it
> 
> It could be a bit violent for 2) even though REQ_FAILFAST_DRIVER is
> always set for PT request, but not see any better approach for handling
> PT request.

Ming,

I suggest to submit patches for tcp/rdma and continue the discussion on
the pci driver.

Keith Busch June 27, 2023, 5:21 p.m. UTC | #14

On Sun, Jun 25, 2023 at 08:26:48AM +0800, Ming Lei wrote:
> Yeah, but you can't remove the gap at all with start_freeze, that said
> the current code has to live with the situation of new mapping change
> and old request with old mapping.
> 
> Actually I considered to handle this kind of situation before, one approach
> is to reuse the bio steal logic taken in nvme mpath:
> 
> 1) for FS IO, re-submit bios, meantime free request
> 
> 2) for PT request, simply fail it
> 
> It could be a bit violent for 2) even though REQ_FAILFAST_DRIVER is
> always set for PT request, but not see any better approach for handling
> PT request.

I think that's acceptable for PT requests, or any request that doesn't
have a bio. I tried something similiar a while back that was almost
working, but I neither never posted it, or it's in that window when
infradead lost all the emails. :(

Anyway, for the pci controller, I think I see the problem you're fixing.
When reset_work fails, we used to do the mark dead + unquieces via
"nvme_kill_queues()", which doesn't exist anymore, but I think your
scenario worked back then. Currently a failed nvme_reset_work simply
marks them dead without the unquiesce. Would it be enough to just bring
that unqueisce behavior back?

---
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index b027e5e3f4acb..8eaa954aa6ed4 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2778,6 +2778,7 @@ static void nvme_reset_work(struct work_struct *work)
 	nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DELETING);
 	nvme_dev_disable(dev, true);
 	nvme_mark_namespaces_dead(&dev->ctrl);
+	nvme_unquiesce_io_queues(&dev->ctrl);
 	nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DEAD);
 }

--

Sagi Grimberg June 27, 2023, 9:15 p.m. UTC | #15

>> Yeah, but you can't remove the gap at all with start_freeze, that said
>> the current code has to live with the situation of new mapping change
>> and old request with old mapping.
>>
>> Actually I considered to handle this kind of situation before, one approach
>> is to reuse the bio steal logic taken in nvme mpath:
>>
>> 1) for FS IO, re-submit bios, meantime free request
>>
>> 2) for PT request, simply fail it
>>
>> It could be a bit violent for 2) even though REQ_FAILFAST_DRIVER is
>> always set for PT request, but not see any better approach for handling
>> PT request.
> 
> I think that's acceptable for PT requests, or any request that doesn't
> have a bio. I tried something similiar a while back that was almost
> working, but I neither never posted it, or it's in that window when
> infradead lost all the emails. :(
> 
> Anyway, for the pci controller, I think I see the problem you're fixing.
> When reset_work fails, we used to do the mark dead + unquieces via
> "nvme_kill_queues()", which doesn't exist anymore, but I think your
> scenario worked back then. Currently a failed nvme_reset_work simply
> marks them dead without the unquiesce. Would it be enough to just bring
> that unqueisce behavior back?
> 
> ---
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index b027e5e3f4acb..8eaa954aa6ed4 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -2778,6 +2778,7 @@ static void nvme_reset_work(struct work_struct *work)
>   	nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DELETING);
>   	nvme_dev_disable(dev, true);
>   	nvme_mark_namespaces_dead(&dev->ctrl);
> +	nvme_unquiesce_io_queues(&dev->ctrl);
>   	nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DEAD);
>   }
>   
> --

I think this should work.

Ming Lei June 28, 2023, 1:30 a.m. UTC | #16

On Tue, Jun 27, 2023 at 11:21:36AM -0600, Keith Busch wrote:
> On Sun, Jun 25, 2023 at 08:26:48AM +0800, Ming Lei wrote:
> > Yeah, but you can't remove the gap at all with start_freeze, that said
> > the current code has to live with the situation of new mapping change
> > and old request with old mapping.
> > 
> > Actually I considered to handle this kind of situation before, one approach
> > is to reuse the bio steal logic taken in nvme mpath:
> > 
> > 1) for FS IO, re-submit bios, meantime free request
> > 
> > 2) for PT request, simply fail it
> > 
> > It could be a bit violent for 2) even though REQ_FAILFAST_DRIVER is
> > always set for PT request, but not see any better approach for handling
> > PT request.
> 
> I think that's acceptable for PT requests, or any request that doesn't
> have a bio. I tried something similiar a while back that was almost
> working, but I neither never posted it, or it's in that window when
> infradead lost all the emails. :(
> 
> Anyway, for the pci controller, I think I see the problem you're fixing.
> When reset_work fails, we used to do the mark dead + unquieces via
> "nvme_kill_queues()", which doesn't exist anymore, but I think your
> scenario worked back then. Currently a failed nvme_reset_work simply
> marks them dead without the unquiesce. Would it be enough to just bring
> that unqueisce behavior back?
> 
> ---
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index b027e5e3f4acb..8eaa954aa6ed4 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -2778,6 +2778,7 @@ static void nvme_reset_work(struct work_struct *work)
>  	nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DELETING);
>  	nvme_dev_disable(dev, true);
>  	nvme_mark_namespaces_dead(&dev->ctrl);
> +	nvme_unquiesce_io_queues(&dev->ctrl);
>  	nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DEAD);
>  }

That may not be enough:

- What if nvme_sysfs_delete() is called from sysfs before the 1st check in
nvme_reset_work()?

- What if one pending nvme_dev_disable()<-nvme_timeout() comes after
the added nvme_unquiesce_io_queues() returns?


Thanks,
Ming

Ming Lei June 28, 2023, 7:06 a.m. UTC | #17

On Tue, Jun 27, 2023 at 11:21:36AM -0600, Keith Busch wrote:
> On Sun, Jun 25, 2023 at 08:26:48AM +0800, Ming Lei wrote:
> > Yeah, but you can't remove the gap at all with start_freeze, that said
> > the current code has to live with the situation of new mapping change
> > and old request with old mapping.
> > 
> > Actually I considered to handle this kind of situation before, one approach
> > is to reuse the bio steal logic taken in nvme mpath:
> > 
> > 1) for FS IO, re-submit bios, meantime free request
> > 
> > 2) for PT request, simply fail it
> > 
> > It could be a bit violent for 2) even though REQ_FAILFAST_DRIVER is
> > always set for PT request, but not see any better approach for handling
> > PT request.
> 
> I think that's acceptable for PT requests, or any request that doesn't
> have a bio. I tried something similiar a while back that was almost
> working, but I neither never posted it, or it's in that window when
> infradead lost all the emails. :(

If you are fine to fail PT request, I'd suggest to handle the
problem in the following way:

1) moving freeze into reset

2) during resetting

- freeze NS queues
- unquiesce NS queues
- nvme_wait_freeze()
- update_nr_hw_queues
- unfreeze NS queues

3) meantime changes driver's ->queue_rq() in case that ctrl state is NVME_CTRL_CONNECTING,

- if the request is FS IO with data, re-submit all bios of this request,
  and free the request

- otherwise, fail the request  

With this way, not only freeze is paired with unfreeze. More
importantly, it becomes not possible to trigger new timeout during
handling NVME_CTRL_CONNECTING, then fallback to ctrl removal can
be avoided.

Any comment on this approach?

Thanks,
Ming

Sagi Grimberg June 28, 2023, 7:26 a.m. UTC | #18

>>> Yeah, but you can't remove the gap at all with start_freeze, that said
>>> the current code has to live with the situation of new mapping change
>>> and old request with old mapping.
>>>
>>> Actually I considered to handle this kind of situation before, one approach
>>> is to reuse the bio steal logic taken in nvme mpath:
>>>
>>> 1) for FS IO, re-submit bios, meantime free request
>>>
>>> 2) for PT request, simply fail it
>>>
>>> It could be a bit violent for 2) even though REQ_FAILFAST_DRIVER is
>>> always set for PT request, but not see any better approach for handling
>>> PT request.
>>
>> I think that's acceptable for PT requests, or any request that doesn't
>> have a bio. I tried something similiar a while back that was almost
>> working, but I neither never posted it, or it's in that window when
>> infradead lost all the emails. :(
> 
> If you are fine to fail PT request, I'd suggest to handle the
> problem in the following way:
> 
> 1) moving freeze into reset
> 
> 2) during resetting
> 
> - freeze NS queues
> - unquiesce NS queues
> - nvme_wait_freeze()
> - update_nr_hw_queues
> - unfreeze NS queues
> 
> 3) meantime changes driver's ->queue_rq() in case that ctrl state is NVME_CTRL_CONNECTING,
> 
> - if the request is FS IO with data, re-submit all bios of this request,
>    and free the request
> 
> - otherwise, fail the request
> 
> With this way, not only freeze is paired with unfreeze. More
> importantly, it becomes not possible to trigger new timeout during
> handling NVME_CTRL_CONNECTING, then fallback to ctrl removal can
> be avoided.
> 
> Any comment on this approach?

As aid, for tcp/rdma I agree with this approach. No need to worry
about the non-mpath case, I don't think it is really used anyway
nowadays.

Keith Busch June 28, 2023, 2:35 p.m. UTC | #19

On Wed, Jun 28, 2023 at 09:30:16AM +0800, Ming Lei wrote:
> That may not be enough:
> 
> - What if nvme_sysfs_delete() is called from sysfs before the 1st check in
> nvme_reset_work()?
> 
> - What if one pending nvme_dev_disable()<-nvme_timeout() comes after
> the added nvme_unquiesce_io_queues() returns?

Okay, the following will handle both:

---
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index b027e5e3f4acb..c9224d39195e5 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2690,7 +2690,8 @@ static void nvme_reset_work(struct work_struct *work)
 	if (dev->ctrl.state != NVME_CTRL_RESETTING) {
 		dev_warn(dev->ctrl.device, "ctrl state %d is not RESETTING\n",
 			 dev->ctrl.state);
-		return;
+		result = -ENODEV;
+		goto out;
 	}
 
 	/*
@@ -2777,7 +2778,9 @@ static void nvme_reset_work(struct work_struct *work)
 		 result);
 	nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DELETING);
 	nvme_dev_disable(dev, true);
+	nvme_sync_queues(&dev->ctrl);
 	nvme_mark_namespaces_dead(&dev->ctrl);
+	nvme_unquiesce_io_queues(&dev->ctrl);
 	nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DEAD);
 }
 
--

Ming Lei June 29, 2023, 12:08 a.m. UTC | #20

On Wed, Jun 28, 2023 at 08:35:04AM -0600, Keith Busch wrote:
> On Wed, Jun 28, 2023 at 09:30:16AM +0800, Ming Lei wrote:
> > That may not be enough:
> > 
> > - What if nvme_sysfs_delete() is called from sysfs before the 1st check in
> > nvme_reset_work()?
> > 
> > - What if one pending nvme_dev_disable()<-nvme_timeout() comes after
> > the added nvme_unquiesce_io_queues() returns?
> 
> Okay, the following will handle both:
> 
> ---
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index b027e5e3f4acb..c9224d39195e5 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -2690,7 +2690,8 @@ static void nvme_reset_work(struct work_struct *work)
>  	if (dev->ctrl.state != NVME_CTRL_RESETTING) {
>  		dev_warn(dev->ctrl.device, "ctrl state %d is not RESETTING\n",
>  			 dev->ctrl.state);
> -		return;
> +		result = -ENODEV;
> +		goto out;
>  	}
>  
>  	/*
> @@ -2777,7 +2778,9 @@ static void nvme_reset_work(struct work_struct *work)
>  		 result);
>  	nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DELETING);
>  	nvme_dev_disable(dev, true);
> +	nvme_sync_queues(&dev->ctrl);
>  	nvme_mark_namespaces_dead(&dev->ctrl);
> +	nvme_unquiesce_io_queues(&dev->ctrl);
>  	nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DEAD);
>  }

This one looks better, but reset may not be scheduled successfully because
of removal, such as, the removal comes exactly before changing state
to NVME_CTRL_RESETTING.

Thanks,
Ming

Keith Busch June 29, 2023, 3:33 p.m. UTC | #21

On Thu, Jun 29, 2023 at 08:08:31AM +0800, Ming Lei wrote:
> On Wed, Jun 28, 2023 at 08:35:04AM -0600, Keith Busch wrote:
> > On Wed, Jun 28, 2023 at 09:30:16AM +0800, Ming Lei wrote:
> > > That may not be enough:
> > > 
> > > - What if nvme_sysfs_delete() is called from sysfs before the 1st check in
> > > nvme_reset_work()?
> > > 
> > > - What if one pending nvme_dev_disable()<-nvme_timeout() comes after
> > > the added nvme_unquiesce_io_queues() returns?
> > 
> > Okay, the following will handle both:
> > 
> > ---
> > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > index b027e5e3f4acb..c9224d39195e5 100644
> > --- a/drivers/nvme/host/pci.c
> > +++ b/drivers/nvme/host/pci.c
> > @@ -2690,7 +2690,8 @@ static void nvme_reset_work(struct work_struct *work)
> >  	if (dev->ctrl.state != NVME_CTRL_RESETTING) {
> >  		dev_warn(dev->ctrl.device, "ctrl state %d is not RESETTING\n",
> >  			 dev->ctrl.state);
> > -		return;
> > +		result = -ENODEV;
> > +		goto out;
> >  	}
> >  
> >  	/*
> > @@ -2777,7 +2778,9 @@ static void nvme_reset_work(struct work_struct *work)
> >  		 result);
> >  	nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DELETING);
> >  	nvme_dev_disable(dev, true);
> > +	nvme_sync_queues(&dev->ctrl);
> >  	nvme_mark_namespaces_dead(&dev->ctrl);
> > +	nvme_unquiesce_io_queues(&dev->ctrl);
> >  	nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DEAD);
> >  }
> 
> This one looks better, but reset may not be scheduled successfully because
> of removal, such as, the removal comes exactly before changing state
> to NVME_CTRL_RESETTING.

For example, io timeout disables the controller, but can't schedule the
reset because someone requested driver unbinding between the disabling
and the reset_work queueing? I think this needs some error checks,
like below. Fortunately there are not many places in nvme-pci that needs
this.

I do think the unconditional unquiesce in nvme_remove_namespaces() you
proposed earlier looks good though, too.

---
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index c9224d39195e5..bf21bd08ee7ed 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -108,6 +108,7 @@ MODULE_PARM_DESC(noacpi, "disable acpi bios quirks");
 struct nvme_dev;
 struct nvme_queue;
 
+static int nvme_disable_prepare_reset(struct nvme_dev *dev, bool shutdown);
 static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown);
 static void nvme_delete_io_queues(struct nvme_dev *dev);
 static void nvme_update_attrs(struct nvme_dev *dev);
@@ -1298,9 +1299,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req)
 	 */
 	if (nvme_should_reset(dev, csts)) {
 		nvme_warn_reset(dev, csts);
-		nvme_dev_disable(dev, false);
-		nvme_reset_ctrl(&dev->ctrl);
-		return BLK_EH_DONE;
+		goto disable;
 	}
 
 	/*
@@ -1351,10 +1350,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req)
 			 "I/O %d QID %d timeout, reset controller\n",
 			 req->tag, nvmeq->qid);
 		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
-		nvme_dev_disable(dev, false);
-		nvme_reset_ctrl(&dev->ctrl);
-
-		return BLK_EH_DONE;
+		goto disable;
 	}
 
 	if (atomic_dec_return(&dev->ctrl.abort_limit) < 0) {
@@ -1391,6 +1387,12 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req)
 	 * as the device then is in a faulty state.
 	 */
 	return BLK_EH_RESET_TIMER;
+
+disable:
+	if (!nvme_disable_prepare_reset(dev, false) &&
+	    nvme_try_sched_reset(&dev->ctrl))
+		nvme_unquiesce_io_queues(&dev->ctrl);
+	return BLK_EH_DONE;
 }
 
 static void nvme_free_queue(struct nvme_queue *nvmeq)
@@ -3278,7 +3280,8 @@ static pci_ers_result_t nvme_error_detected(struct pci_dev *pdev,
 	case pci_channel_io_frozen:
 		dev_warn(dev->ctrl.device,
 			"frozen state error detected, reset controller\n");
-		nvme_dev_disable(dev, false);
+		if (nvme_disable_prepare_reset(dev, false))
+			return PCI_ERS_RESULT_DISCONNECT;
 		return PCI_ERS_RESULT_NEED_RESET;
 	case pci_channel_io_perm_failure:
 		dev_warn(dev->ctrl.device,
@@ -3294,7 +3297,8 @@ static pci_ers_result_t nvme_slot_reset(struct pci_dev *pdev)
 
 	dev_info(dev->ctrl.device, "restart after slot reset\n");
 	pci_restore_state(pdev);
-	nvme_reset_ctrl(&dev->ctrl);
+	if (!nvme_try_sched_reset(&dev->ctrl))
+		nvme_unquiesce_io_queues(&dev->ctrl);
 	return PCI_ERS_RESULT_RECOVERED;
 }
 
--

[V2,0/4] nvme: fix two kinds of IO hang from removing NSs

Message

Comments