Message ID | 20230620013349.906601-1-ming.lei@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | nvme: fix two kinds of IO hang from removing NSs | expand |
> Hello, > > The 1st three patch fixes io hang when controller removal interrupts error > recovery, then queue is left as frozen. > > The 4th patch fixes io hang when controller is left as unquiesce. Ming, what happened to nvme-tcp/rdma move of freeze/unfreeze to the connect patches? I think that these are patches that should go in regradless because they also fix I/O failover blocked on request queue enter.
On Tue, Jun 20, 2023 at 01:04:33PM +0300, Sagi Grimberg wrote: > > > Hello, > > > > The 1st three patch fixes io hang when controller removal interrupts error > > recovery, then queue is left as frozen. > > > > The 4th patch fixes io hang when controller is left as unquiesce. > > Ming, what happened to nvme-tcp/rdma move of freeze/unfreeze to the > connect patches? I'd suggest to handle all drivers(include nvme-pci) in same logic for avoiding extra maintain burden wrt. error handling, but looks Keith worries about the delay freezing may cause too many requests queued during error handling, and that might cause user report. Thanks, Ming
>>> Hello, >>> >>> The 1st three patch fixes io hang when controller removal interrupts error >>> recovery, then queue is left as frozen. >>> >>> The 4th patch fixes io hang when controller is left as unquiesce. >> >> Ming, what happened to nvme-tcp/rdma move of freeze/unfreeze to the >> connect patches? > > I'd suggest to handle all drivers(include nvme-pci) in same logic for avoiding > extra maintain burden wrt. error handling, but looks Keith worries about the > delay freezing may cause too many requests queued during error handling, and > that might cause user report. For nvme-tcp/rdma your patch also addresses IO not failing over because they block on queue enter. So I definitely want this for fabrics. AFAICT nvme-pci would also want to failover asap for dual-ported multipathed devices, not sure if this is something that we are interested in optimizing though, as pci either succeeds the reset, or removes the gendisk. But the time-frame is different for fabrics for sure.
On Tue, Jun 20, 2023 at 04:40:49PM +0300, Sagi Grimberg wrote: > > > > > Hello, > > > > > > > > The 1st three patch fixes io hang when controller removal interrupts error > > > > recovery, then queue is left as frozen. > > > > > > > > The 4th patch fixes io hang when controller is left as unquiesce. > > > > > > Ming, what happened to nvme-tcp/rdma move of freeze/unfreeze to the > > > connect patches? > > > > I'd suggest to handle all drivers(include nvme-pci) in same logic for avoiding > > extra maintain burden wrt. error handling, but looks Keith worries about the > > delay freezing may cause too many requests queued during error handling, and > > that might cause user report. > > For nvme-tcp/rdma your patch also addresses IO not failing over because > they block on queue enter. So I definitely want this for fabrics. The patch in the following link should fix these issues too: https://lore.kernel.org/linux-block/ZJGmW7lEaipT6saa@ovpn-8-23.pek2.redhat.com/T/#u I guess you still want the paired freeze patch because it makes freeze & unfreeze more reliable in error handling. If yes, I can make one fabric only change for you. Thanks, Ming
>>>>> Hello, >>>>> >>>>> The 1st three patch fixes io hang when controller removal interrupts error >>>>> recovery, then queue is left as frozen. >>>>> >>>>> The 4th patch fixes io hang when controller is left as unquiesce. >>>> >>>> Ming, what happened to nvme-tcp/rdma move of freeze/unfreeze to the >>>> connect patches? >>> >>> I'd suggest to handle all drivers(include nvme-pci) in same logic for avoiding >>> extra maintain burden wrt. error handling, but looks Keith worries about the >>> delay freezing may cause too many requests queued during error handling, and >>> that might cause user report. >> >> For nvme-tcp/rdma your patch also addresses IO not failing over because >> they block on queue enter. So I definitely want this for fabrics. > > The patch in the following link should fix these issues too: > > https://lore.kernel.org/linux-block/ZJGmW7lEaipT6saa@ovpn-8-23.pek2.redhat.com/T/#u > > I guess you still want the paired freeze patch because it makes freeze & > unfreeze more reliable in error handling. If yes, I can make one fabric > only change for you. Not sure exactly what reliability is referred here. I agree that there is an issue with controller delete during error recovery. The patch was a way to side-step it, great. But it addressed I/O blocked on enter and not failing over. So yes, for fabrics we should have it. I would argue that it would be the right thing to do for pci as well. But I won't argue if Keith feels otherwise.
On Wed, Jun 21, 2023 at 01:13:05PM +0300, Sagi Grimberg wrote: > > > > > > > Hello, > > > > > > > > > > > > The 1st three patch fixes io hang when controller removal interrupts error > > > > > > recovery, then queue is left as frozen. > > > > > > > > > > > > The 4th patch fixes io hang when controller is left as unquiesce. > > > > > > > > > > Ming, what happened to nvme-tcp/rdma move of freeze/unfreeze to the > > > > > connect patches? > > > > > > > > I'd suggest to handle all drivers(include nvme-pci) in same logic for avoiding > > > > extra maintain burden wrt. error handling, but looks Keith worries about the > > > > delay freezing may cause too many requests queued during error handling, and > > > > that might cause user report. > > > > > > For nvme-tcp/rdma your patch also addresses IO not failing over because > > > they block on queue enter. So I definitely want this for fabrics. > > > > The patch in the following link should fix these issues too: > > > > https://lore.kernel.org/linux-block/ZJGmW7lEaipT6saa@ovpn-8-23.pek2.redhat.com/T/#u > > > > I guess you still want the paired freeze patch because it makes freeze & > > unfreeze more reliable in error handling. If yes, I can make one fabric > > only change for you. > > Not sure exactly what reliability is referred here. freeze and unfreeze have to be called strictly in pair, but nvme calls the two from different contexts, so unfreeze may easily be missed, and causes mismatched freeze count. There has many such reports so far. > I agree that there > is an issue with controller delete during error recovery. The patch > was a way to side-step it, great. But it addressed I/O blocked on enter > and not failing over. > > So yes, for fabrics we should have it. I would argue that it would be > the right thing to do for pci as well. But I won't argue if Keith feels > otherwise. Keith, can you update with us if you are fine with moving nvme_start_freeze() into nvme_reset_work() for nvme pci driver? Thanks, Ming
On Wed, Jun 21, 2023 at 09:27:31PM +0800, Ming Lei wrote: > On Wed, Jun 21, 2023 at 01:13:05PM +0300, Sagi Grimberg wrote: > > > > > > > > > Hello, > > > > > > > > > > > > > > The 1st three patch fixes io hang when controller removal interrupts error > > > > > > > recovery, then queue is left as frozen. > > > > > > > > > > > > > > The 4th patch fixes io hang when controller is left as unquiesce. > > > > > > > > > > > > Ming, what happened to nvme-tcp/rdma move of freeze/unfreeze to the > > > > > > connect patches? > > > > > > > > > > I'd suggest to handle all drivers(include nvme-pci) in same logic for avoiding > > > > > extra maintain burden wrt. error handling, but looks Keith worries about the > > > > > delay freezing may cause too many requests queued during error handling, and > > > > > that might cause user report. > > > > > > > > For nvme-tcp/rdma your patch also addresses IO not failing over because > > > > they block on queue enter. So I definitely want this for fabrics. > > > > > > The patch in the following link should fix these issues too: > > > > > > https://lore.kernel.org/linux-block/ZJGmW7lEaipT6saa@ovpn-8-23.pek2.redhat.com/T/#u > > > > > > I guess you still want the paired freeze patch because it makes freeze & > > > unfreeze more reliable in error handling. If yes, I can make one fabric > > > only change for you. > > > > Not sure exactly what reliability is referred here. > > freeze and unfreeze have to be called strictly in pair, but nvme calls > the two from different contexts, so unfreeze may easily be missed, and > causes mismatched freeze count. There has many such reports so far. > > > I agree that there > > is an issue with controller delete during error recovery. The patch > > was a way to side-step it, great. But it addressed I/O blocked on enter > > and not failing over. > > > > So yes, for fabrics we should have it. I would argue that it would be > > the right thing to do for pci as well. But I won't argue if Keith feels > > otherwise. > > Keith, can you update with us if you are fine with moving > nvme_start_freeze() into nvme_reset_work() for nvme pci driver? The point was to contain requests from entering while the hctx's are being reconfigured. If you're going to pair up the freezes as you've suggested, we might as well just not call freeze at all.
On Wed, Jun 21, 2023 at 09:48:49AM -0600, Keith Busch wrote: > On Wed, Jun 21, 2023 at 09:27:31PM +0800, Ming Lei wrote: > > On Wed, Jun 21, 2023 at 01:13:05PM +0300, Sagi Grimberg wrote: > > > > > > > > > > > Hello, > > > > > > > > > > > > > > > > The 1st three patch fixes io hang when controller removal interrupts error > > > > > > > > recovery, then queue is left as frozen. > > > > > > > > > > > > > > > > The 4th patch fixes io hang when controller is left as unquiesce. > > > > > > > > > > > > > > Ming, what happened to nvme-tcp/rdma move of freeze/unfreeze to the > > > > > > > connect patches? > > > > > > > > > > > > I'd suggest to handle all drivers(include nvme-pci) in same logic for avoiding > > > > > > extra maintain burden wrt. error handling, but looks Keith worries about the > > > > > > delay freezing may cause too many requests queued during error handling, and > > > > > > that might cause user report. > > > > > > > > > > For nvme-tcp/rdma your patch also addresses IO not failing over because > > > > > they block on queue enter. So I definitely want this for fabrics. > > > > > > > > The patch in the following link should fix these issues too: > > > > > > > > https://lore.kernel.org/linux-block/ZJGmW7lEaipT6saa@ovpn-8-23.pek2.redhat.com/T/#u > > > > > > > > I guess you still want the paired freeze patch because it makes freeze & > > > > unfreeze more reliable in error handling. If yes, I can make one fabric > > > > only change for you. > > > > > > Not sure exactly what reliability is referred here. > > > > freeze and unfreeze have to be called strictly in pair, but nvme calls > > the two from different contexts, so unfreeze may easily be missed, and > > causes mismatched freeze count. There has many such reports so far. > > > > > I agree that there > > > is an issue with controller delete during error recovery. The patch > > > was a way to side-step it, great. But it addressed I/O blocked on enter > > > and not failing over. > > > > > > So yes, for fabrics we should have it. I would argue that it would be > > > the right thing to do for pci as well. But I won't argue if Keith feels > > > otherwise. > > > > Keith, can you update with us if you are fine with moving > > nvme_start_freeze() into nvme_reset_work() for nvme pci driver? > > The point was to contain requests from entering while the hctx's are > being reconfigured. If you're going to pair up the freezes as you've > suggested, we might as well just not call freeze at all. blk_mq_update_nr_hw_queues() requires queue to be frozen. Thanks, Ming
On Thu, Jun 22, 2023 at 09:51:12PM +0800, Ming Lei wrote: > On Wed, Jun 21, 2023 at 09:48:49AM -0600, Keith Busch wrote: > > The point was to contain requests from entering while the hctx's are > > being reconfigured. If you're going to pair up the freezes as you've > > suggested, we might as well just not call freeze at all. > > blk_mq_update_nr_hw_queues() requires queue to be frozen. It's too late at that point. Let's work through a real example. You'll need a system that has more CPU's than your nvme has IO queues. Boot without any special nvme parameters. Every possible nvme IO queue will be assigned "default" hctx type. Now start IO to every queue, then run: # echo 8 > /sys/modules/nvme/parameters/poll_queues && echo 1 > /sys/class/nvme/nvme0/reset_controller Today, we freeze prior to tearing down the "default" IO queues, so there's nothing entered into them while the driver reconfigures the queues. What you're suggesting will allow IO to queue up in a queisced "default" queue, which will become "polled" without an interrupt hanlder on the other side of the reset. The application doesn't know that, so the IO you're allowing to queue up will time out.
On Thu, Jun 22, 2023 at 08:35:49AM -0600, Keith Busch wrote: > On Thu, Jun 22, 2023 at 09:51:12PM +0800, Ming Lei wrote: > > On Wed, Jun 21, 2023 at 09:48:49AM -0600, Keith Busch wrote: > > > The point was to contain requests from entering while the hctx's are > > > being reconfigured. If you're going to pair up the freezes as you've > > > suggested, we might as well just not call freeze at all. > > > > blk_mq_update_nr_hw_queues() requires queue to be frozen. > > It's too late at that point. Let's work through a real example. You'll > need a system that has more CPU's than your nvme has IO queues. > > Boot without any special nvme parameters. Every possible nvme IO queue > will be assigned "default" hctx type. Now start IO to every queue, then > run: > > # echo 8 > /sys/modules/nvme/parameters/poll_queues && echo 1 > /sys/class/nvme/nvme0/reset_controller > > Today, we freeze prior to tearing down the "default" IO queues, so > there's nothing entered into them while the driver reconfigures the > queues. nvme_start_freeze() just prevents new IO from being queued, and old ones may still be entering block layer queue, and what matters here is actually quiesce, which prevents new IO from being queued to driver/hardware. > > What you're suggesting will allow IO to queue up in a queisced "default" > queue, which will become "polled" without an interrupt hanlder on the > other side of the reset. The application doesn't know that, so the IO > you're allowing to queue up will time out. time out only happens after the request is queued to driver/hardware, or after blk_mq_start_request() is called in nvme_queue_rq(), but quiesce actually prevents new IOs from being dispatched to driver or be queued via .queue_rq(), meantime old requests have been canceled, so no any request can be timed out. Thanks, Ming
On Thu, Jun 22, 2023 at 10:53:05PM +0800, Ming Lei wrote: > On Thu, Jun 22, 2023 at 08:35:49AM -0600, Keith Busch wrote: > > On Thu, Jun 22, 2023 at 09:51:12PM +0800, Ming Lei wrote: > > > On Wed, Jun 21, 2023 at 09:48:49AM -0600, Keith Busch wrote: > > > > The point was to contain requests from entering while the hctx's are > > > > being reconfigured. If you're going to pair up the freezes as you've > > > > suggested, we might as well just not call freeze at all. > > > > > > blk_mq_update_nr_hw_queues() requires queue to be frozen. > > > > It's too late at that point. Let's work through a real example. You'll > > need a system that has more CPU's than your nvme has IO queues. > > > > Boot without any special nvme parameters. Every possible nvme IO queue > > will be assigned "default" hctx type. Now start IO to every queue, then > > run: > > > > # echo 8 > /sys/modules/nvme/parameters/poll_queues && echo 1 > /sys/class/nvme/nvme0/reset_controller > > > > Today, we freeze prior to tearing down the "default" IO queues, so > > there's nothing entered into them while the driver reconfigures the > > queues. > > nvme_start_freeze() just prevents new IO from being queued, and old ones > may still be entering block layer queue, and what matters here is > actually quiesce, which prevents new IO from being queued to > driver/hardware. > > > > > What you're suggesting will allow IO to queue up in a queisced "default" > > queue, which will become "polled" without an interrupt hanlder on the > > other side of the reset. The application doesn't know that, so the IO > > you're allowing to queue up will time out. > > time out only happens after the request is queued to driver/hardware, or after > blk_mq_start_request() is called in nvme_queue_rq(), but quiesce actually > prevents new IOs from being dispatched to driver or be queued via .queue_rq(), > meantime old requests have been canceled, so no any request can be > timed out. Quiesce doesn't prevent requests from entering an hctx, and you can't back it out to put on another hctx later. It doesn't matter that you haven't dispatched it to hardware yet. The request's queue was set the moment it was allocated, so after you unquiesce and freeze for the new queue mapping, the requests previously blocked on quiesce will time out in the scenario I've described. There are certainly gaps in the existing code where error'ed requests can be requeued or stuck elsewhere and hit the exact same problem, but the current way at least tries to contain it.
On Thu, Jun 22, 2023 at 09:19:19AM -0600, Keith Busch wrote: > On Thu, Jun 22, 2023 at 10:53:05PM +0800, Ming Lei wrote: > > On Thu, Jun 22, 2023 at 08:35:49AM -0600, Keith Busch wrote: > > > On Thu, Jun 22, 2023 at 09:51:12PM +0800, Ming Lei wrote: > > > > On Wed, Jun 21, 2023 at 09:48:49AM -0600, Keith Busch wrote: > > > > > The point was to contain requests from entering while the hctx's are > > > > > being reconfigured. If you're going to pair up the freezes as you've > > > > > suggested, we might as well just not call freeze at all. > > > > > > > > blk_mq_update_nr_hw_queues() requires queue to be frozen. > > > > > > It's too late at that point. Let's work through a real example. You'll > > > need a system that has more CPU's than your nvme has IO queues. > > > > > > Boot without any special nvme parameters. Every possible nvme IO queue > > > will be assigned "default" hctx type. Now start IO to every queue, then > > > run: > > > > > > # echo 8 > /sys/modules/nvme/parameters/poll_queues && echo 1 > /sys/class/nvme/nvme0/reset_controller > > > > > > Today, we freeze prior to tearing down the "default" IO queues, so > > > there's nothing entered into them while the driver reconfigures the > > > queues. > > > > nvme_start_freeze() just prevents new IO from being queued, and old ones > > may still be entering block layer queue, and what matters here is > > actually quiesce, which prevents new IO from being queued to > > driver/hardware. > > > > > > > > What you're suggesting will allow IO to queue up in a queisced "default" > > > queue, which will become "polled" without an interrupt hanlder on the > > > other side of the reset. The application doesn't know that, so the IO > > > you're allowing to queue up will time out. > > > > time out only happens after the request is queued to driver/hardware, or after > > blk_mq_start_request() is called in nvme_queue_rq(), but quiesce actually > > prevents new IOs from being dispatched to driver or be queued via .queue_rq(), > > meantime old requests have been canceled, so no any request can be > > timed out. > > Quiesce doesn't prevent requests from entering an hctx, and you can't > back it out to put on another hctx later. It doesn't matter that you > haven't dispatched it to hardware yet. The request's queue was set the > moment it was allocated, so after you unquiesce and freeze for the new > queue mapping, the requests previously blocked on quiesce will time out > in the scenario I've described. > > There are certainly gaps in the existing code where error'ed requests > can be requeued or stuck elsewhere and hit the exact same problem, but > the current way at least tries to contain it. Yeah, but you can't remove the gap at all with start_freeze, that said the current code has to live with the situation of new mapping change and old request with old mapping. Actually I considered to handle this kind of situation before, one approach is to reuse the bio steal logic taken in nvme mpath: 1) for FS IO, re-submit bios, meantime free request 2) for PT request, simply fail it It could be a bit violent for 2) even though REQ_FAILFAST_DRIVER is always set for PT request, but not see any better approach for handling PT request. Thanks, Ming
>>>>>> The point was to contain requests from entering while the hctx's are >>>>>> being reconfigured. If you're going to pair up the freezes as you've >>>>>> suggested, we might as well just not call freeze at all. >>>>> >>>>> blk_mq_update_nr_hw_queues() requires queue to be frozen. >>>> >>>> It's too late at that point. Let's work through a real example. You'll >>>> need a system that has more CPU's than your nvme has IO queues. >>>> >>>> Boot without any special nvme parameters. Every possible nvme IO queue >>>> will be assigned "default" hctx type. Now start IO to every queue, then >>>> run: >>>> >>>> # echo 8 > /sys/modules/nvme/parameters/poll_queues && echo 1 > /sys/class/nvme/nvme0/reset_controller >>>> >>>> Today, we freeze prior to tearing down the "default" IO queues, so >>>> there's nothing entered into them while the driver reconfigures the >>>> queues. >>> >>> nvme_start_freeze() just prevents new IO from being queued, and old ones >>> may still be entering block layer queue, and what matters here is >>> actually quiesce, which prevents new IO from being queued to >>> driver/hardware. >>> >>>> >>>> What you're suggesting will allow IO to queue up in a queisced "default" >>>> queue, which will become "polled" without an interrupt hanlder on the >>>> other side of the reset. The application doesn't know that, so the IO >>>> you're allowing to queue up will time out. >>> >>> time out only happens after the request is queued to driver/hardware, or after >>> blk_mq_start_request() is called in nvme_queue_rq(), but quiesce actually >>> prevents new IOs from being dispatched to driver or be queued via .queue_rq(), >>> meantime old requests have been canceled, so no any request can be >>> timed out. >> >> Quiesce doesn't prevent requests from entering an hctx, and you can't >> back it out to put on another hctx later. It doesn't matter that you >> haven't dispatched it to hardware yet. The request's queue was set the >> moment it was allocated, so after you unquiesce and freeze for the new >> queue mapping, the requests previously blocked on quiesce will time out >> in the scenario I've described. >> >> There are certainly gaps in the existing code where error'ed requests >> can be requeued or stuck elsewhere and hit the exact same problem, but >> the current way at least tries to contain it. > > Yeah, but you can't remove the gap at all with start_freeze, that said > the current code has to live with the situation of new mapping change > and old request with old mapping. > > Actually I considered to handle this kind of situation before, one approach > is to reuse the bio steal logic taken in nvme mpath: > > 1) for FS IO, re-submit bios, meantime free request > > 2) for PT request, simply fail it > > It could be a bit violent for 2) even though REQ_FAILFAST_DRIVER is > always set for PT request, but not see any better approach for handling > PT request. Ming, I suggest to submit patches for tcp/rdma and continue the discussion on the pci driver.
On Sun, Jun 25, 2023 at 08:26:48AM +0800, Ming Lei wrote: > Yeah, but you can't remove the gap at all with start_freeze, that said > the current code has to live with the situation of new mapping change > and old request with old mapping. > > Actually I considered to handle this kind of situation before, one approach > is to reuse the bio steal logic taken in nvme mpath: > > 1) for FS IO, re-submit bios, meantime free request > > 2) for PT request, simply fail it > > It could be a bit violent for 2) even though REQ_FAILFAST_DRIVER is > always set for PT request, but not see any better approach for handling > PT request. I think that's acceptable for PT requests, or any request that doesn't have a bio. I tried something similiar a while back that was almost working, but I neither never posted it, or it's in that window when infradead lost all the emails. :( Anyway, for the pci controller, I think I see the problem you're fixing. When reset_work fails, we used to do the mark dead + unquieces via "nvme_kill_queues()", which doesn't exist anymore, but I think your scenario worked back then. Currently a failed nvme_reset_work simply marks them dead without the unquiesce. Would it be enough to just bring that unqueisce behavior back? --- diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index b027e5e3f4acb..8eaa954aa6ed4 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -2778,6 +2778,7 @@ static void nvme_reset_work(struct work_struct *work) nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DELETING); nvme_dev_disable(dev, true); nvme_mark_namespaces_dead(&dev->ctrl); + nvme_unquiesce_io_queues(&dev->ctrl); nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DEAD); } --
>> Yeah, but you can't remove the gap at all with start_freeze, that said >> the current code has to live with the situation of new mapping change >> and old request with old mapping. >> >> Actually I considered to handle this kind of situation before, one approach >> is to reuse the bio steal logic taken in nvme mpath: >> >> 1) for FS IO, re-submit bios, meantime free request >> >> 2) for PT request, simply fail it >> >> It could be a bit violent for 2) even though REQ_FAILFAST_DRIVER is >> always set for PT request, but not see any better approach for handling >> PT request. > > I think that's acceptable for PT requests, or any request that doesn't > have a bio. I tried something similiar a while back that was almost > working, but I neither never posted it, or it's in that window when > infradead lost all the emails. :( > > Anyway, for the pci controller, I think I see the problem you're fixing. > When reset_work fails, we used to do the mark dead + unquieces via > "nvme_kill_queues()", which doesn't exist anymore, but I think your > scenario worked back then. Currently a failed nvme_reset_work simply > marks them dead without the unquiesce. Would it be enough to just bring > that unqueisce behavior back? > > --- > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > index b027e5e3f4acb..8eaa954aa6ed4 100644 > --- a/drivers/nvme/host/pci.c > +++ b/drivers/nvme/host/pci.c > @@ -2778,6 +2778,7 @@ static void nvme_reset_work(struct work_struct *work) > nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DELETING); > nvme_dev_disable(dev, true); > nvme_mark_namespaces_dead(&dev->ctrl); > + nvme_unquiesce_io_queues(&dev->ctrl); > nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DEAD); > } > > -- I think this should work.
On Tue, Jun 27, 2023 at 11:21:36AM -0600, Keith Busch wrote: > On Sun, Jun 25, 2023 at 08:26:48AM +0800, Ming Lei wrote: > > Yeah, but you can't remove the gap at all with start_freeze, that said > > the current code has to live with the situation of new mapping change > > and old request with old mapping. > > > > Actually I considered to handle this kind of situation before, one approach > > is to reuse the bio steal logic taken in nvme mpath: > > > > 1) for FS IO, re-submit bios, meantime free request > > > > 2) for PT request, simply fail it > > > > It could be a bit violent for 2) even though REQ_FAILFAST_DRIVER is > > always set for PT request, but not see any better approach for handling > > PT request. > > I think that's acceptable for PT requests, or any request that doesn't > have a bio. I tried something similiar a while back that was almost > working, but I neither never posted it, or it's in that window when > infradead lost all the emails. :( > > Anyway, for the pci controller, I think I see the problem you're fixing. > When reset_work fails, we used to do the mark dead + unquieces via > "nvme_kill_queues()", which doesn't exist anymore, but I think your > scenario worked back then. Currently a failed nvme_reset_work simply > marks them dead without the unquiesce. Would it be enough to just bring > that unqueisce behavior back? > > --- > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > index b027e5e3f4acb..8eaa954aa6ed4 100644 > --- a/drivers/nvme/host/pci.c > +++ b/drivers/nvme/host/pci.c > @@ -2778,6 +2778,7 @@ static void nvme_reset_work(struct work_struct *work) > nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DELETING); > nvme_dev_disable(dev, true); > nvme_mark_namespaces_dead(&dev->ctrl); > + nvme_unquiesce_io_queues(&dev->ctrl); > nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DEAD); > } That may not be enough: - What if nvme_sysfs_delete() is called from sysfs before the 1st check in nvme_reset_work()? - What if one pending nvme_dev_disable()<-nvme_timeout() comes after the added nvme_unquiesce_io_queues() returns? Thanks, Ming
On Tue, Jun 27, 2023 at 11:21:36AM -0600, Keith Busch wrote: > On Sun, Jun 25, 2023 at 08:26:48AM +0800, Ming Lei wrote: > > Yeah, but you can't remove the gap at all with start_freeze, that said > > the current code has to live with the situation of new mapping change > > and old request with old mapping. > > > > Actually I considered to handle this kind of situation before, one approach > > is to reuse the bio steal logic taken in nvme mpath: > > > > 1) for FS IO, re-submit bios, meantime free request > > > > 2) for PT request, simply fail it > > > > It could be a bit violent for 2) even though REQ_FAILFAST_DRIVER is > > always set for PT request, but not see any better approach for handling > > PT request. > > I think that's acceptable for PT requests, or any request that doesn't > have a bio. I tried something similiar a while back that was almost > working, but I neither never posted it, or it's in that window when > infradead lost all the emails. :( If you are fine to fail PT request, I'd suggest to handle the problem in the following way: 1) moving freeze into reset 2) during resetting - freeze NS queues - unquiesce NS queues - nvme_wait_freeze() - update_nr_hw_queues - unfreeze NS queues 3) meantime changes driver's ->queue_rq() in case that ctrl state is NVME_CTRL_CONNECTING, - if the request is FS IO with data, re-submit all bios of this request, and free the request - otherwise, fail the request With this way, not only freeze is paired with unfreeze. More importantly, it becomes not possible to trigger new timeout during handling NVME_CTRL_CONNECTING, then fallback to ctrl removal can be avoided. Any comment on this approach? Thanks, Ming
>>> Yeah, but you can't remove the gap at all with start_freeze, that said >>> the current code has to live with the situation of new mapping change >>> and old request with old mapping. >>> >>> Actually I considered to handle this kind of situation before, one approach >>> is to reuse the bio steal logic taken in nvme mpath: >>> >>> 1) for FS IO, re-submit bios, meantime free request >>> >>> 2) for PT request, simply fail it >>> >>> It could be a bit violent for 2) even though REQ_FAILFAST_DRIVER is >>> always set for PT request, but not see any better approach for handling >>> PT request. >> >> I think that's acceptable for PT requests, or any request that doesn't >> have a bio. I tried something similiar a while back that was almost >> working, but I neither never posted it, or it's in that window when >> infradead lost all the emails. :( > > If you are fine to fail PT request, I'd suggest to handle the > problem in the following way: > > 1) moving freeze into reset > > 2) during resetting > > - freeze NS queues > - unquiesce NS queues > - nvme_wait_freeze() > - update_nr_hw_queues > - unfreeze NS queues > > 3) meantime changes driver's ->queue_rq() in case that ctrl state is NVME_CTRL_CONNECTING, > > - if the request is FS IO with data, re-submit all bios of this request, > and free the request > > - otherwise, fail the request > > With this way, not only freeze is paired with unfreeze. More > importantly, it becomes not possible to trigger new timeout during > handling NVME_CTRL_CONNECTING, then fallback to ctrl removal can > be avoided. > > Any comment on this approach? As aid, for tcp/rdma I agree with this approach. No need to worry about the non-mpath case, I don't think it is really used anyway nowadays.
On Wed, Jun 28, 2023 at 09:30:16AM +0800, Ming Lei wrote: > That may not be enough: > > - What if nvme_sysfs_delete() is called from sysfs before the 1st check in > nvme_reset_work()? > > - What if one pending nvme_dev_disable()<-nvme_timeout() comes after > the added nvme_unquiesce_io_queues() returns? Okay, the following will handle both: --- diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index b027e5e3f4acb..c9224d39195e5 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -2690,7 +2690,8 @@ static void nvme_reset_work(struct work_struct *work) if (dev->ctrl.state != NVME_CTRL_RESETTING) { dev_warn(dev->ctrl.device, "ctrl state %d is not RESETTING\n", dev->ctrl.state); - return; + result = -ENODEV; + goto out; } /* @@ -2777,7 +2778,9 @@ static void nvme_reset_work(struct work_struct *work) result); nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DELETING); nvme_dev_disable(dev, true); + nvme_sync_queues(&dev->ctrl); nvme_mark_namespaces_dead(&dev->ctrl); + nvme_unquiesce_io_queues(&dev->ctrl); nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DEAD); } --
On Wed, Jun 28, 2023 at 08:35:04AM -0600, Keith Busch wrote: > On Wed, Jun 28, 2023 at 09:30:16AM +0800, Ming Lei wrote: > > That may not be enough: > > > > - What if nvme_sysfs_delete() is called from sysfs before the 1st check in > > nvme_reset_work()? > > > > - What if one pending nvme_dev_disable()<-nvme_timeout() comes after > > the added nvme_unquiesce_io_queues() returns? > > Okay, the following will handle both: > > --- > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > index b027e5e3f4acb..c9224d39195e5 100644 > --- a/drivers/nvme/host/pci.c > +++ b/drivers/nvme/host/pci.c > @@ -2690,7 +2690,8 @@ static void nvme_reset_work(struct work_struct *work) > if (dev->ctrl.state != NVME_CTRL_RESETTING) { > dev_warn(dev->ctrl.device, "ctrl state %d is not RESETTING\n", > dev->ctrl.state); > - return; > + result = -ENODEV; > + goto out; > } > > /* > @@ -2777,7 +2778,9 @@ static void nvme_reset_work(struct work_struct *work) > result); > nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DELETING); > nvme_dev_disable(dev, true); > + nvme_sync_queues(&dev->ctrl); > nvme_mark_namespaces_dead(&dev->ctrl); > + nvme_unquiesce_io_queues(&dev->ctrl); > nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DEAD); > } This one looks better, but reset may not be scheduled successfully because of removal, such as, the removal comes exactly before changing state to NVME_CTRL_RESETTING. Thanks, Ming
On Thu, Jun 29, 2023 at 08:08:31AM +0800, Ming Lei wrote: > On Wed, Jun 28, 2023 at 08:35:04AM -0600, Keith Busch wrote: > > On Wed, Jun 28, 2023 at 09:30:16AM +0800, Ming Lei wrote: > > > That may not be enough: > > > > > > - What if nvme_sysfs_delete() is called from sysfs before the 1st check in > > > nvme_reset_work()? > > > > > > - What if one pending nvme_dev_disable()<-nvme_timeout() comes after > > > the added nvme_unquiesce_io_queues() returns? > > > > Okay, the following will handle both: > > > > --- > > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > > index b027e5e3f4acb..c9224d39195e5 100644 > > --- a/drivers/nvme/host/pci.c > > +++ b/drivers/nvme/host/pci.c > > @@ -2690,7 +2690,8 @@ static void nvme_reset_work(struct work_struct *work) > > if (dev->ctrl.state != NVME_CTRL_RESETTING) { > > dev_warn(dev->ctrl.device, "ctrl state %d is not RESETTING\n", > > dev->ctrl.state); > > - return; > > + result = -ENODEV; > > + goto out; > > } > > > > /* > > @@ -2777,7 +2778,9 @@ static void nvme_reset_work(struct work_struct *work) > > result); > > nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DELETING); > > nvme_dev_disable(dev, true); > > + nvme_sync_queues(&dev->ctrl); > > nvme_mark_namespaces_dead(&dev->ctrl); > > + nvme_unquiesce_io_queues(&dev->ctrl); > > nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DEAD); > > } > > This one looks better, but reset may not be scheduled successfully because > of removal, such as, the removal comes exactly before changing state > to NVME_CTRL_RESETTING. For example, io timeout disables the controller, but can't schedule the reset because someone requested driver unbinding between the disabling and the reset_work queueing? I think this needs some error checks, like below. Fortunately there are not many places in nvme-pci that needs this. I do think the unconditional unquiesce in nvme_remove_namespaces() you proposed earlier looks good though, too. --- diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index c9224d39195e5..bf21bd08ee7ed 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -108,6 +108,7 @@ MODULE_PARM_DESC(noacpi, "disable acpi bios quirks"); struct nvme_dev; struct nvme_queue; +static int nvme_disable_prepare_reset(struct nvme_dev *dev, bool shutdown); static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown); static void nvme_delete_io_queues(struct nvme_dev *dev); static void nvme_update_attrs(struct nvme_dev *dev); @@ -1298,9 +1299,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req) */ if (nvme_should_reset(dev, csts)) { nvme_warn_reset(dev, csts); - nvme_dev_disable(dev, false); - nvme_reset_ctrl(&dev->ctrl); - return BLK_EH_DONE; + goto disable; } /* @@ -1351,10 +1350,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req) "I/O %d QID %d timeout, reset controller\n", req->tag, nvmeq->qid); nvme_req(req)->flags |= NVME_REQ_CANCELLED; - nvme_dev_disable(dev, false); - nvme_reset_ctrl(&dev->ctrl); - - return BLK_EH_DONE; + goto disable; } if (atomic_dec_return(&dev->ctrl.abort_limit) < 0) { @@ -1391,6 +1387,12 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req) * as the device then is in a faulty state. */ return BLK_EH_RESET_TIMER; + +disable: + if (!nvme_disable_prepare_reset(dev, false) && + nvme_try_sched_reset(&dev->ctrl)) + nvme_unquiesce_io_queues(&dev->ctrl); + return BLK_EH_DONE; } static void nvme_free_queue(struct nvme_queue *nvmeq) @@ -3278,7 +3280,8 @@ static pci_ers_result_t nvme_error_detected(struct pci_dev *pdev, case pci_channel_io_frozen: dev_warn(dev->ctrl.device, "frozen state error detected, reset controller\n"); - nvme_dev_disable(dev, false); + if (nvme_disable_prepare_reset(dev, false)) + return PCI_ERS_RESULT_DISCONNECT; return PCI_ERS_RESULT_NEED_RESET; case pci_channel_io_perm_failure: dev_warn(dev->ctrl.device, @@ -3294,7 +3297,8 @@ static pci_ers_result_t nvme_slot_reset(struct pci_dev *pdev) dev_info(dev->ctrl.device, "restart after slot reset\n"); pci_restore_state(pdev); - nvme_reset_ctrl(&dev->ctrl); + if (!nvme_try_sched_reset(&dev->ctrl)) + nvme_unquiesce_io_queues(&dev->ctrl); return PCI_ERS_RESULT_RECOVERED; } --