diff mbox series

[V7,9/9] nvme: hold request queue's refcount in ns's whole lifetime

Message ID 20190424110221.17435-10-ming.lei@redhat.com (mailing list archive)
State Superseded
Headers show
Series blk-mq: fix races related with freeing queue | expand

Commit Message

Ming Lei April 24, 2019, 11:02 a.m. UTC
Hennes reported the following kernel oops:

    There is a race condition between namespace rescanning and
    controller reset; during controller reset all namespaces are
    quiesed vie nams_stop_ctrl(), and after reset all namespaces
    are unquiesced again.
    When namespace scanning was active by the time controller reset
    was triggered the rescan code will call nvme_ns_remove(), which
    then will cause a kernel crash in nvme_start_ctrl() as it'll trip
    over uninitialized namespaces.

Patch "blk-mq: free hw queue's resource in hctx's release handler"
should make this issue quite difficult to trigger. However it can't
kill the issue completely becasue pre-condition of that patch is to
hold request queue's refcount before calling block layer API, and
there is still a small window between blk_cleanup_queue() and removing
the ns from the controller namspace list in nvme_ns_remove().

Hold request queue's refcount until the ns is freed, then the above race
can be avoided completely. Given the 'namespaces_rwsem' is always held
to retrieve ns for starting/stopping request queue, this lock can prevent
namespaces from being freed.

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reported-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Reviewed-by: Keith Busch <keith.busch@intel.com>
Tested-by: James Smart <james.smart@broadcom.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/nvme/host/core.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

Comments

Christoph Hellwig April 24, 2019, 4:27 p.m. UTC | #1
On Wed, Apr 24, 2019 at 07:02:21PM +0800, Ming Lei wrote:
> Hennes reported the following kernel oops:

Hannes?

> +	if (!blk_get_queue(ns->queue)) {
> +		ret = -ENXIO;
> +		goto out_free_queue;
> +	}

If we always need to hold a reference, shouldn't blk_mq_init_queue
return with that reference held (and yes, that means changes to
every driver, but it seems like we need to audit all of them anyway..)

It seems like the queue lifetimes are a bit of a mess, and I'm not sure
if this just papers over the problem.
Ming Lei April 25, 2019, 1 a.m. UTC | #2
On Wed, Apr 24, 2019 at 06:27:46PM +0200, Christoph Hellwig wrote:
> On Wed, Apr 24, 2019 at 07:02:21PM +0800, Ming Lei wrote:
> > Hennes reported the following kernel oops:
> 
> Hannes?
> 
> > +	if (!blk_get_queue(ns->queue)) {
> > +		ret = -ENXIO;
> > +		goto out_free_queue;
> > +	}
> 
> If we always need to hold a reference, shouldn't blk_mq_init_queue
> return with that reference held (and yes, that means changes to
> every driver, but it seems like we need to audit all of them anyway..)

The issue is driver(NVMe) specific, the race window is just between
between blk_cleanup_queue() and removing the ns from the controller namspace
list in nvme_ns_remove()

blk_mq_init_queue() does hold one refcount, and its counter-part is
blk_cleanup_queue().

It is simply ugly to ask blk_mq_init_queue() to grab a refcnt for driver,
then who is the counter-part for releasing the extra refcount?

> 
> It seems like the queue lifetimes are a bit of a mess, and I'm not sure
> if this just papers over the problem.

Could you explain a bit what the mess is?


Thanks,
Ming
Christoph Hellwig April 26, 2019, 3:11 p.m. UTC | #3
On Thu, Apr 25, 2019 at 09:00:31AM +0800, Ming Lei wrote:
> The issue is driver(NVMe) specific, the race window is just between
> between blk_cleanup_queue() and removing the ns from the controller namspace
> list in nvme_ns_remove()

And I wouldn't be surprised if others have the same issue.

> 
> blk_mq_init_queue() does hold one refcount, and its counter-part is
> blk_cleanup_queue().
> 
> It is simply ugly to ask blk_mq_init_queue() to grab a refcnt for driver,
> then who is the counter-part for releasing the extra refcount?

Well, the problem is exactly that blk_cleanup_queue drops the reference.
If move the blk_put_queue() call from the end of it to the callers the
callers can keep the reference as long as they need them, and we wouldn't
need an extra reference.
Bart Van Assche April 26, 2019, 5:04 p.m. UTC | #4
On Fri, 2019-04-26 at 17:11 +0200, Christoph Hellwig wrote:
> On Thu, Apr 25, 2019 at 09:00:31AM +0800, Ming Lei wrote:
> > The issue is driver(NVMe) specific, the race window is just between
> > between blk_cleanup_queue() and removing the ns from the controller namspace
> > list in nvme_ns_remove()
> 
> And I wouldn't be surprised if others have the same issue.
> 
> > 
> > blk_mq_init_queue() does hold one refcount, and its counter-part is
> > blk_cleanup_queue().
> > 
> > It is simply ugly to ask blk_mq_init_queue() to grab a refcnt for driver,
> > then who is the counter-part for releasing the extra refcount?
> 
> Well, the problem is exactly that blk_cleanup_queue drops the reference.
> If move the blk_put_queue() call from the end of it to the callers the
> callers can keep the reference as long as they need them, and we wouldn't
> need an extra reference.

Hi Christoph,

There are more than hundred callers of blk_cleanup_queue() so that change
would cause a lot of churn. Since blk_get_queue() and blk_put_queue() are
available, how inserting a pair of calls to these functions where necessary?

Thanks,

Bart.
Ming Lei April 26, 2019, 10:45 p.m. UTC | #5
On Fri, Apr 26, 2019 at 05:11:14PM +0200, Christoph Hellwig wrote:
> On Thu, Apr 25, 2019 at 09:00:31AM +0800, Ming Lei wrote:
> > The issue is driver(NVMe) specific, the race window is just between
> > between blk_cleanup_queue() and removing the ns from the controller namspace
> > list in nvme_ns_remove()
> 
> And I wouldn't be surprised if others have the same issue.

SCSI hasn't such issue, and loop/null/virtio-blk doesn't too.

> 
> > 
> > blk_mq_init_queue() does hold one refcount, and its counter-part is
> > blk_cleanup_queue().
> > 
> > It is simply ugly to ask blk_mq_init_queue() to grab a refcnt for driver,
> > then who is the counter-part for releasing the extra refcount?
> 
> Well, the problem is exactly that blk_cleanup_queue drops the reference.
> If move the blk_put_queue() call from the end of it to the callers the
> callers can keep the reference as long as they need them, and we wouldn't
> need an extra reference.

It depends on driver.

Still no difference between your suggestion and the way in this patch, given
driver specific change is a must. Even it is more clean to hold the
queue refocunt by drivers explicitly because we usually do the get/put pair
in one place, instead that getting refcnt in one subsystem, and doing put in
another place.

I am going to drop this patch in V8 since my original plan is to fix block
layer's race in this patchset.


Thanks,
Ming
Ming Lei April 26, 2019, 10:49 p.m. UTC | #6
On Fri, Apr 26, 2019 at 10:04:23AM -0700, Bart Van Assche wrote:
> On Fri, 2019-04-26 at 17:11 +0200, Christoph Hellwig wrote:
> > On Thu, Apr 25, 2019 at 09:00:31AM +0800, Ming Lei wrote:
> > > The issue is driver(NVMe) specific, the race window is just between
> > > between blk_cleanup_queue() and removing the ns from the controller namspace
> > > list in nvme_ns_remove()
> > 
> > And I wouldn't be surprised if others have the same issue.
> > 
> > > 
> > > blk_mq_init_queue() does hold one refcount, and its counter-part is
> > > blk_cleanup_queue().
> > > 
> > > It is simply ugly to ask blk_mq_init_queue() to grab a refcnt for driver,
> > > then who is the counter-part for releasing the extra refcount?
> > 
> > Well, the problem is exactly that blk_cleanup_queue drops the reference.
> > If move the blk_put_queue() call from the end of it to the callers the
> > callers can keep the reference as long as they need them, and we wouldn't
> > need an extra reference.
> 
> Hi Christoph,
> 
> There are more than hundred callers of blk_cleanup_queue() so that change
> would cause a lot of churn. Since blk_get_queue() and blk_put_queue() are
> available, how inserting a pair of calls to these functions where necessary?

The problem is that queue might be used after blk_cleanup_queue() is
returned by some drivers.

Gendisk is removed before cleanup queue, and the other activities on queue
depends on driver itself. There can't be universal way to deal with that.

Thanks,
Ming
Christoph Hellwig April 27, 2019, 5:54 a.m. UTC | #7
On Sat, Apr 27, 2019 at 06:45:43AM +0800, Ming Lei wrote:
> Still no difference between your suggestion and the way in this patch, given
> driver specific change is a must. Even it is more clean to hold the
> queue refocunt by drivers explicitly because we usually do the get/put pair
> in one place, instead that getting refcnt in one subsystem, and doing put in
> another place.
> 
> I am going to drop this patch in V8 since my original plan is to fix block
> layer's race in this patchset.

Doing the untegistration and final reference drop is just a bad pattern,
leading to possible bugs.  I'm fine with you just dropping it for now,
I guess I'll just send a scripted conversion moving the put to the drivers
to Jens after -rc1 to sort it out.  Probably including a rename
of blk_cleanup_queue so that everyone notices the difference.
diff mbox series

Patch

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 248ff3b48041..82cda6602ca7 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -406,6 +406,7 @@  static void nvme_free_ns(struct kref *kref)
 		nvme_nvm_unregister(ns);
 
 	put_disk(ns->disk);
+	blk_put_queue(ns->queue);
 	nvme_put_ns_head(ns->head);
 	nvme_put_ctrl(ns->ctrl);
 	kfree(ns);
@@ -3229,6 +3230,11 @@  static int nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 		goto out_free_ns;
 	}
 
+	if (!blk_get_queue(ns->queue)) {
+		ret = -ENXIO;
+		goto out_free_queue;
+	}
+
 	blk_queue_flag_set(QUEUE_FLAG_NONROT, ns->queue);
 	if (ctrl->ops->flags & NVME_F_PCI_P2PDMA)
 		blk_queue_flag_set(QUEUE_FLAG_PCI_P2PDMA, ns->queue);
@@ -3245,7 +3251,7 @@  static int nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	id = nvme_identify_ns(ctrl, nsid);
 	if (!id) {
 		ret = -EIO;
-		goto out_free_queue;
+		goto out_put_queue;
 	}
 
 	if (id->ncap == 0) {
@@ -3304,6 +3310,8 @@  static int nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	nvme_put_ns_head(ns->head);
  out_free_id:
 	kfree(id);
+ out_put_queue:
+	blk_put_queue(ns->queue);
  out_free_queue:
 	blk_cleanup_queue(ns->queue);
  out_free_ns: