Message ID | 20190424110221.17435-10-ming.lei@redhat.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | blk-mq: fix races related with freeing queue | expand |
On Wed, Apr 24, 2019 at 07:02:21PM +0800, Ming Lei wrote: > Hennes reported the following kernel oops: Hannes? > + if (!blk_get_queue(ns->queue)) { > + ret = -ENXIO; > + goto out_free_queue; > + } If we always need to hold a reference, shouldn't blk_mq_init_queue return with that reference held (and yes, that means changes to every driver, but it seems like we need to audit all of them anyway..) It seems like the queue lifetimes are a bit of a mess, and I'm not sure if this just papers over the problem.
On Wed, Apr 24, 2019 at 06:27:46PM +0200, Christoph Hellwig wrote: > On Wed, Apr 24, 2019 at 07:02:21PM +0800, Ming Lei wrote: > > Hennes reported the following kernel oops: > > Hannes? > > > + if (!blk_get_queue(ns->queue)) { > > + ret = -ENXIO; > > + goto out_free_queue; > > + } > > If we always need to hold a reference, shouldn't blk_mq_init_queue > return with that reference held (and yes, that means changes to > every driver, but it seems like we need to audit all of them anyway..) The issue is driver(NVMe) specific, the race window is just between between blk_cleanup_queue() and removing the ns from the controller namspace list in nvme_ns_remove() blk_mq_init_queue() does hold one refcount, and its counter-part is blk_cleanup_queue(). It is simply ugly to ask blk_mq_init_queue() to grab a refcnt for driver, then who is the counter-part for releasing the extra refcount? > > It seems like the queue lifetimes are a bit of a mess, and I'm not sure > if this just papers over the problem. Could you explain a bit what the mess is? Thanks, Ming
On Thu, Apr 25, 2019 at 09:00:31AM +0800, Ming Lei wrote: > The issue is driver(NVMe) specific, the race window is just between > between blk_cleanup_queue() and removing the ns from the controller namspace > list in nvme_ns_remove() And I wouldn't be surprised if others have the same issue. > > blk_mq_init_queue() does hold one refcount, and its counter-part is > blk_cleanup_queue(). > > It is simply ugly to ask blk_mq_init_queue() to grab a refcnt for driver, > then who is the counter-part for releasing the extra refcount? Well, the problem is exactly that blk_cleanup_queue drops the reference. If move the blk_put_queue() call from the end of it to the callers the callers can keep the reference as long as they need them, and we wouldn't need an extra reference.
On Fri, 2019-04-26 at 17:11 +0200, Christoph Hellwig wrote: > On Thu, Apr 25, 2019 at 09:00:31AM +0800, Ming Lei wrote: > > The issue is driver(NVMe) specific, the race window is just between > > between blk_cleanup_queue() and removing the ns from the controller namspace > > list in nvme_ns_remove() > > And I wouldn't be surprised if others have the same issue. > > > > > blk_mq_init_queue() does hold one refcount, and its counter-part is > > blk_cleanup_queue(). > > > > It is simply ugly to ask blk_mq_init_queue() to grab a refcnt for driver, > > then who is the counter-part for releasing the extra refcount? > > Well, the problem is exactly that blk_cleanup_queue drops the reference. > If move the blk_put_queue() call from the end of it to the callers the > callers can keep the reference as long as they need them, and we wouldn't > need an extra reference. Hi Christoph, There are more than hundred callers of blk_cleanup_queue() so that change would cause a lot of churn. Since blk_get_queue() and blk_put_queue() are available, how inserting a pair of calls to these functions where necessary? Thanks, Bart.
On Fri, Apr 26, 2019 at 05:11:14PM +0200, Christoph Hellwig wrote: > On Thu, Apr 25, 2019 at 09:00:31AM +0800, Ming Lei wrote: > > The issue is driver(NVMe) specific, the race window is just between > > between blk_cleanup_queue() and removing the ns from the controller namspace > > list in nvme_ns_remove() > > And I wouldn't be surprised if others have the same issue. SCSI hasn't such issue, and loop/null/virtio-blk doesn't too. > > > > > blk_mq_init_queue() does hold one refcount, and its counter-part is > > blk_cleanup_queue(). > > > > It is simply ugly to ask blk_mq_init_queue() to grab a refcnt for driver, > > then who is the counter-part for releasing the extra refcount? > > Well, the problem is exactly that blk_cleanup_queue drops the reference. > If move the blk_put_queue() call from the end of it to the callers the > callers can keep the reference as long as they need them, and we wouldn't > need an extra reference. It depends on driver. Still no difference between your suggestion and the way in this patch, given driver specific change is a must. Even it is more clean to hold the queue refocunt by drivers explicitly because we usually do the get/put pair in one place, instead that getting refcnt in one subsystem, and doing put in another place. I am going to drop this patch in V8 since my original plan is to fix block layer's race in this patchset. Thanks, Ming
On Fri, Apr 26, 2019 at 10:04:23AM -0700, Bart Van Assche wrote: > On Fri, 2019-04-26 at 17:11 +0200, Christoph Hellwig wrote: > > On Thu, Apr 25, 2019 at 09:00:31AM +0800, Ming Lei wrote: > > > The issue is driver(NVMe) specific, the race window is just between > > > between blk_cleanup_queue() and removing the ns from the controller namspace > > > list in nvme_ns_remove() > > > > And I wouldn't be surprised if others have the same issue. > > > > > > > > blk_mq_init_queue() does hold one refcount, and its counter-part is > > > blk_cleanup_queue(). > > > > > > It is simply ugly to ask blk_mq_init_queue() to grab a refcnt for driver, > > > then who is the counter-part for releasing the extra refcount? > > > > Well, the problem is exactly that blk_cleanup_queue drops the reference. > > If move the blk_put_queue() call from the end of it to the callers the > > callers can keep the reference as long as they need them, and we wouldn't > > need an extra reference. > > Hi Christoph, > > There are more than hundred callers of blk_cleanup_queue() so that change > would cause a lot of churn. Since blk_get_queue() and blk_put_queue() are > available, how inserting a pair of calls to these functions where necessary? The problem is that queue might be used after blk_cleanup_queue() is returned by some drivers. Gendisk is removed before cleanup queue, and the other activities on queue depends on driver itself. There can't be universal way to deal with that. Thanks, Ming
On Sat, Apr 27, 2019 at 06:45:43AM +0800, Ming Lei wrote: > Still no difference between your suggestion and the way in this patch, given > driver specific change is a must. Even it is more clean to hold the > queue refocunt by drivers explicitly because we usually do the get/put pair > in one place, instead that getting refcnt in one subsystem, and doing put in > another place. > > I am going to drop this patch in V8 since my original plan is to fix block > layer's race in this patchset. Doing the untegistration and final reference drop is just a bad pattern, leading to possible bugs. I'm fine with you just dropping it for now, I guess I'll just send a scripted conversion moving the put to the drivers to Jens after -rc1 to sort it out. Probably including a rename of blk_cleanup_queue so that everyone notices the difference.
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 248ff3b48041..82cda6602ca7 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -406,6 +406,7 @@ static void nvme_free_ns(struct kref *kref) nvme_nvm_unregister(ns); put_disk(ns->disk); + blk_put_queue(ns->queue); nvme_put_ns_head(ns->head); nvme_put_ctrl(ns->ctrl); kfree(ns); @@ -3229,6 +3230,11 @@ static int nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid) goto out_free_ns; } + if (!blk_get_queue(ns->queue)) { + ret = -ENXIO; + goto out_free_queue; + } + blk_queue_flag_set(QUEUE_FLAG_NONROT, ns->queue); if (ctrl->ops->flags & NVME_F_PCI_P2PDMA) blk_queue_flag_set(QUEUE_FLAG_PCI_P2PDMA, ns->queue); @@ -3245,7 +3251,7 @@ static int nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid) id = nvme_identify_ns(ctrl, nsid); if (!id) { ret = -EIO; - goto out_free_queue; + goto out_put_queue; } if (id->ncap == 0) { @@ -3304,6 +3310,8 @@ static int nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid) nvme_put_ns_head(ns->head); out_free_id: kfree(id); + out_put_queue: + blk_put_queue(ns->queue); out_free_queue: blk_cleanup_queue(ns->queue); out_free_ns: