diff mbox series

[1/4] blk-mq: add API of blk_mq_unfreeze_queue_force

Message ID 20230615143236.297456-2-ming.lei@redhat.com (mailing list archive)
State New, archived
Headers show
Series nvme: fix two kinds of IO hang from removing NSs | expand

Commit Message

Ming Lei June 15, 2023, 2:32 p.m. UTC
NVMe calls freeze/unfreeze in different contexts, and controller removal
may break in-progress error recovery, then leave queues in frozen state.
So cause IO hang in del_gendisk() because pending writeback IOs are
still waited in bio_queue_enter().

Prepare for fixing this issue by calling the added
blk_mq_unfreeze_queue_force when removing device.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c         | 25 ++++++++++++++++++++++---
 block/blk.h            |  3 ++-
 block/genhd.c          |  2 +-
 include/linux/blk-mq.h |  1 +
 4 files changed, 26 insertions(+), 5 deletions(-)

Comments

Keith Busch June 15, 2023, 3:16 p.m. UTC | #1
On Thu, Jun 15, 2023 at 10:32:33PM +0800, Ming Lei wrote:
> NVMe calls freeze/unfreeze in different contexts, and controller removal
> may break in-progress error recovery, then leave queues in frozen state.
> So cause IO hang in del_gendisk() because pending writeback IOs are
> still waited in bio_queue_enter().

Shouldn't those writebacks be unblocked by the existing check in
bio_queue_enter, test_bit(GD_DEAD, &disk->state))? Or are we missing a
disk state update or wakeup on this condition?
Ming Lei June 15, 2023, 3:43 p.m. UTC | #2
On Thu, Jun 15, 2023 at 09:16:27AM -0600, Keith Busch wrote:
> On Thu, Jun 15, 2023 at 10:32:33PM +0800, Ming Lei wrote:
> > NVMe calls freeze/unfreeze in different contexts, and controller removal
> > may break in-progress error recovery, then leave queues in frozen state.
> > So cause IO hang in del_gendisk() because pending writeback IOs are
> > still waited in bio_queue_enter().
> 
> Shouldn't those writebacks be unblocked by the existing check in
> bio_queue_enter, test_bit(GD_DEAD, &disk->state))? Or are we missing a
> disk state update or wakeup on this condition?

GD_DEAD is only set if the device is really dead, then all pending IO
will be failed.

We need to try to handle these IOs first if device isn't set as dead by
calling blk_mark_disk_dead().

Thanks,
Ming
Christoph Hellwig June 16, 2023, 5:48 a.m. UTC | #3
On Thu, Jun 15, 2023 at 11:43:51PM +0800, Ming Lei wrote:
> On Thu, Jun 15, 2023 at 09:16:27AM -0600, Keith Busch wrote:
> > On Thu, Jun 15, 2023 at 10:32:33PM +0800, Ming Lei wrote:
> > > NVMe calls freeze/unfreeze in different contexts, and controller removal
> > > may break in-progress error recovery, then leave queues in frozen state.
> > > So cause IO hang in del_gendisk() because pending writeback IOs are
> > > still waited in bio_queue_enter().
> > 
> > Shouldn't those writebacks be unblocked by the existing check in
> > bio_queue_enter, test_bit(GD_DEAD, &disk->state))? Or are we missing a
> > disk state update or wakeup on this condition?
> 
> GD_DEAD is only set if the device is really dead, then all pending IO
> will be failed.

del_gendisk also sets GD_DEAD early on.
Ming Lei June 16, 2023, 7:20 a.m. UTC | #4
On Fri, Jun 16, 2023 at 07:48:00AM +0200, Christoph Hellwig wrote:
> On Thu, Jun 15, 2023 at 11:43:51PM +0800, Ming Lei wrote:
> > On Thu, Jun 15, 2023 at 09:16:27AM -0600, Keith Busch wrote:
> > > On Thu, Jun 15, 2023 at 10:32:33PM +0800, Ming Lei wrote:
> > > > NVMe calls freeze/unfreeze in different contexts, and controller removal
> > > > may break in-progress error recovery, then leave queues in frozen state.
> > > > So cause IO hang in del_gendisk() because pending writeback IOs are
> > > > still waited in bio_queue_enter().
> > > 
> > > Shouldn't those writebacks be unblocked by the existing check in
> > > bio_queue_enter, test_bit(GD_DEAD, &disk->state))? Or are we missing a
> > > disk state update or wakeup on this condition?
> > 
> > GD_DEAD is only set if the device is really dead, then all pending IO
> > will be failed.
> 
> del_gendisk also sets GD_DEAD early on.

No.

The hang happens in fsync_bdev() of del_gendisk(), and there are IOs pending on
bio_queue_enter().

Thanks,
Ming
Christoph Hellwig June 16, 2023, 7:27 a.m. UTC | #5
On Fri, Jun 16, 2023 at 03:20:38PM +0800, Ming Lei wrote:
> > > > Shouldn't those writebacks be unblocked by the existing check in
> > > > bio_queue_enter, test_bit(GD_DEAD, &disk->state))? Or are we missing a
> > > > disk state update or wakeup on this condition?
> > > 
> > > GD_DEAD is only set if the device is really dead, then all pending IO
> > > will be failed.
> > 
> > del_gendisk also sets GD_DEAD early on.
> 
> No.
> 
> The hang happens in fsync_bdev() of del_gendisk(), and there are IOs pending on
> bio_queue_enter().

What is the workload here?  If del_gendisk is called to remove a disk
that is in perfectly fine state and can do I/O, fsync_bdev should write
back data, which is what is exists for.  If the disk is dead, we should
have called blk_mark_disk_dead before.
Ming Lei June 16, 2023, 7:33 a.m. UTC | #6
On Fri, Jun 16, 2023 at 09:27:21AM +0200, Christoph Hellwig wrote:
> On Fri, Jun 16, 2023 at 03:20:38PM +0800, Ming Lei wrote:
> > > > > Shouldn't those writebacks be unblocked by the existing check in
> > > > > bio_queue_enter, test_bit(GD_DEAD, &disk->state))? Or are we missing a
> > > > > disk state update or wakeup on this condition?
> > > > 
> > > > GD_DEAD is only set if the device is really dead, then all pending IO
> > > > will be failed.
> > > 
> > > del_gendisk also sets GD_DEAD early on.
> > 
> > No.
> > 
> > The hang happens in fsync_bdev() of del_gendisk(), and there are IOs pending on
> > bio_queue_enter().
> 
> What is the workload here?  If del_gendisk is called to remove a disk
> that is in perfectly fine state and can do I/O, fsync_bdev should write
> back data, which is what is exists for.  If the disk is dead, we should
> have called blk_mark_disk_dead before.

It is basically that removing ctrl breaks in-progress error recovery,
then queues are left as quiesced and froze.

https://lore.kernel.org/linux-nvme/CAHj4cs-4gQHnp5aiekvJmb6o8qAcb6nLV61uOGFiisCzM49_dg@mail.gmail.com/T/#u

https://lore.kernel.org/linux-nvme/cover.1685350577.git.chunguang.xu@shopee.com/

Thanks, 
Ming
Christoph Hellwig June 20, 2023, 5:41 a.m. UTC | #7
On Fri, Jun 16, 2023 at 03:20:38PM +0800, Ming Lei wrote:
> > > GD_DEAD is only set if the device is really dead, then all pending IO
> > > will be failed.
> > 
> > del_gendisk also sets GD_DEAD early on.
> 
> No.
> 
> The hang happens in fsync_bdev() of del_gendisk(), and there are IOs pending on
> bio_queue_enter().

IFF we know we can't do I/O by the time del_gendisk is called, we
need to call mark_disk_dead first and not paper over the problem.

An API that force unfreezes is just broken and will leaves us with
freezecount mismatches.
Sagi Grimberg June 20, 2023, 10:01 a.m. UTC | #8
>>>> GD_DEAD is only set if the device is really dead, then all pending IO
>>>> will be failed.
>>>
>>> del_gendisk also sets GD_DEAD early on.
>>
>> No.
>>
>> The hang happens in fsync_bdev() of del_gendisk(), and there are IOs pending on
>> bio_queue_enter().
> 
> IFF we know we can't do I/O by the time del_gendisk is called, we
> need to call mark_disk_dead first and not paper over the problem.
> 
> An API that force unfreezes is just broken and will leaves us with
> freezecount mismatches.

I absolutely agree here.
Ming Lei June 20, 2023, 1:15 p.m. UTC | #9
On Tue, Jun 20, 2023 at 07:41:41AM +0200, Christoph Hellwig wrote:
> On Fri, Jun 16, 2023 at 03:20:38PM +0800, Ming Lei wrote:
> > > > GD_DEAD is only set if the device is really dead, then all pending IO
> > > > will be failed.
> > > 
> > > del_gendisk also sets GD_DEAD early on.
> > 
> > No.
> > 
> > The hang happens in fsync_bdev() of del_gendisk(), and there are IOs pending on
> > bio_queue_enter().
> 
> IFF we know we can't do I/O by the time del_gendisk is called, we
> need to call mark_disk_dead first and not paper over the problem.

In theory, device removal can happen any time, when it isn't clear
if the controller is recovered well at that time, that is why this
API is added for avoiding to fail IO unnecessarily.

However maybe it is just fine to mark controller as dead in case that
removal breaks current error recovery given current nvme driver error
handling isn't very fine-grained control, so how about something like
the following:

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index c3d72fc677f7..120d98f348de 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -558,6 +558,7 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
        }

        if (changed) {
+               ctrl->old_state = ctrl->state;
                ctrl->state = new_state;
                wake_up_all(&ctrl->state_wq);
        }
@@ -4654,7 +4655,7 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl)
         * removing the namespaces' disks; fail all the queues now to avoid
         * potentially having to clean up the failed sync later.
         */
-       if (ctrl->state == NVME_CTRL_DEAD) {
+       if (ctrl->state == NVME_CTRL_DEAD || ctrl->old_state != NVME_CTRL_LIVE) {
                nvme_mark_namespaces_dead(ctrl);
                nvme_unquiesce_io_queues(ctrl);
        }
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 953e59f56139..7da53cc76f11 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -246,8 +246,9 @@ enum nvme_ctrl_flags {

 struct nvme_ctrl {
        bool comp_seen;
-       enum nvme_ctrl_state state;
        bool identified;
+       enum nvme_ctrl_state old_state;
+       enum nvme_ctrl_state state;
        spinlock_t lock;
        struct mutex scan_lock;
        const struct nvme_ctrl_ops *ops;

> 
> An API that force unfreezes is just broken and will leaves us with
> freezecount mismatches.

The freezecount mismatch problem is actually in nvme driver, please
see the previous patch[1] I posted.


[1] https://lore.kernel.org/linux-block/20230613005847.1762378-1-ming.lei@redhat.com/T/#m17ac1aa71056b6517f8aefbae58c301f296f0a73


Thanks,
Ming
diff mbox series

Patch

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 24dc8fe0a9d2..6ac58dc9e648 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -185,12 +185,16 @@  void blk_mq_freeze_queue(struct request_queue *q)
 }
 EXPORT_SYMBOL_GPL(blk_mq_freeze_queue);
 
-void __blk_mq_unfreeze_queue(struct request_queue *q, bool force_atomic)
+void __blk_mq_unfreeze_queue(struct request_queue *q, bool force_atomic,
+		bool force)
 {
 	mutex_lock(&q->mq_freeze_lock);
 	if (force_atomic)
 		q->q_usage_counter.data->force_atomic = true;
-	q->mq_freeze_depth--;
+	if (force)
+		q->mq_freeze_depth = 0;
+	else
+		q->mq_freeze_depth--;
 	WARN_ON_ONCE(q->mq_freeze_depth < 0);
 	if (!q->mq_freeze_depth) {
 		percpu_ref_resurrect(&q->q_usage_counter);
@@ -201,10 +205,25 @@  void __blk_mq_unfreeze_queue(struct request_queue *q, bool force_atomic)
 
 void blk_mq_unfreeze_queue(struct request_queue *q)
 {
-	__blk_mq_unfreeze_queue(q, false);
+	__blk_mq_unfreeze_queue(q, false, false);
 }
 EXPORT_SYMBOL_GPL(blk_mq_unfreeze_queue);
 
+/*
+ * Force to unfreeze queue
+ *
+ * Be careful: this API should only be used for avoiding IO hang in
+ * bio_queue_enter() when going to remove disk which needs to drain pending
+ * writeback IO.
+ *
+ * Please don't use it for other cases.
+ */
+void blk_mq_unfreeze_queue_force(struct request_queue *q)
+{
+	__blk_mq_unfreeze_queue(q, false, true);
+}
+EXPORT_SYMBOL_GPL(blk_mq_unfreeze_queue_force);
+
 /*
  * FIXME: replace the scsi_internal_device_*block_nowait() calls in the
  * mpt3sas driver such that this function can be removed.
diff --git a/block/blk.h b/block/blk.h
index 768852a84fef..5c9f99051837 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -33,7 +33,8 @@  struct blk_flush_queue *blk_alloc_flush_queue(int node, int cmd_size,
 void blk_free_flush_queue(struct blk_flush_queue *q);
 
 void blk_freeze_queue(struct request_queue *q);
-void __blk_mq_unfreeze_queue(struct request_queue *q, bool force_atomic);
+void __blk_mq_unfreeze_queue(struct request_queue *q, bool force_atomic, bool
+		force);
 void blk_queue_start_drain(struct request_queue *q);
 int __bio_queue_enter(struct request_queue *q, struct bio *bio);
 void submit_bio_noacct_nocheck(struct bio *bio);
diff --git a/block/genhd.c b/block/genhd.c
index f71f82991434..184aa968b453 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -708,7 +708,7 @@  void del_gendisk(struct gendisk *disk)
 	 */
 	if (!test_bit(GD_OWNS_QUEUE, &disk->state)) {
 		blk_queue_flag_clear(QUEUE_FLAG_INIT_DONE, q);
-		__blk_mq_unfreeze_queue(q, true);
+		__blk_mq_unfreeze_queue(q, true, false);
 	} else {
 		if (queue_is_mq(q))
 			blk_mq_exit_queue(q);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index f401067ac03a..fa265e85d753 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -890,6 +890,7 @@  void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
 void blk_mq_tagset_wait_completed_request(struct blk_mq_tag_set *tagset);
 void blk_mq_freeze_queue(struct request_queue *q);
 void blk_mq_unfreeze_queue(struct request_queue *q);
+void blk_mq_unfreeze_queue_force(struct request_queue *q);
 void blk_freeze_queue_start(struct request_queue *q);
 void blk_mq_freeze_queue_wait(struct request_queue *q);
 int blk_mq_freeze_queue_wait_timeout(struct request_queue *q,