Message ID | 20190404084320.24681-2-ming.lei@redhat.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | blk-mq: fix races related with freeing queue | expand |
On Thu, 2019-04-04 at 16:43 +0800, Ming Lei wrote: > Just like aio/io_uring, we need to grab 2 refcount for queuing one > request, one is for submission, another is for completion. > > If the request isn't queued from plug code path, the refcount grabbed > in generic_make_request() serves for submission. In theroy, this > refcount should have been released after the sumission(async run queue) > is done. blk_freeze_queue() works with blk_sync_queue() together > for avoiding race between cleanup queue and IO submission, given async > run queue activities are canceled because hctx->run_work is scheduled with > the refcount held, so it is fine to not hold the refcount when > running the run queue work function for dispatch IO. > > However, if request is staggered into plug list, and finally queued > from plug code path, the refcount in submission side is actually missed. > And we may start to run queue after queue is removed because the queue's > kobject refcount isn't guaranteed to be grabbed in flushing plug list > context, then kernel oops is triggered, see the following race: > > blk_mq_flush_plug_list(): > blk_mq_sched_insert_requests() > insert requests to sw queue or scheduler queue > blk_mq_run_hw_queue > > Because of concurrent run queue, all requests inserted above may be > completed before calling the above blk_mq_run_hw_queue. Then queue can > be freed during the above blk_mq_run_hw_queue(). > > Fixes the issue by grab .q_usage_counter before calling > blk_mq_sched_insert_requests() in blk_mq_flush_plug_list(). This way is > safe because the queue is absolutely alive before inserting request. > > Cc: Dongli Zhang <dongli.zhang@oracle.com> > Cc: James Smart <james.smart@broadcom.com> > Cc: Bart Van Assche <bart.vanassche@wdc.com> > Cc: linux-scsi@vger.kernel.org, > Cc: Martin K . Petersen <martin.petersen@oracle.com>, > Cc: Christoph Hellwig <hch@lst.de>, > Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>, > Cc: jianchao wang <jianchao.w.wang@oracle.com> > Signed-off-by: Ming Lei <ming.lei@redhat.com> > --- > block/blk-mq.c | 6 ++++++ > 1 file changed, 6 insertions(+) > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 3ff3d7b49969..5b586affee09 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -1728,9 +1728,12 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule) > if (rq->mq_hctx != this_hctx || rq->mq_ctx != this_ctx) { > if (this_hctx) { > trace_block_unplug(this_q, depth, !from_schedule); > + > + percpu_ref_get(&this_q->q_usage_counter); > blk_mq_sched_insert_requests(this_hctx, this_ctx, > &rq_list, > from_schedule); > + percpu_ref_put(&this_q->q_usage_counter); > } > > this_q = rq->q; > @@ -1749,8 +1752,11 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule) > */ > if (this_hctx) { > trace_block_unplug(this_q, depth, !from_schedule); > + > + percpu_ref_get(&this_q->q_usage_counter); > blk_mq_sched_insert_requests(this_hctx, this_ctx, &rq_list, > from_schedule); > + percpu_ref_put(&this_q->q_usage_counter); > } > } Although this patch looks fine to me: have you considered to insert one percpu_ref_get() call at the start of blk_mq_flush_plug_list() and one percpu_ref_put() call at the end of the same function? Thanks, Bart.
On Thu, Apr 4, 2019 at 11:58 PM Bart Van Assche <bvanassche@acm.org> wrote: > > On Thu, 2019-04-04 at 16:43 +0800, Ming Lei wrote: > > Just like aio/io_uring, we need to grab 2 refcount for queuing one > > request, one is for submission, another is for completion. > > > > If the request isn't queued from plug code path, the refcount grabbed > > in generic_make_request() serves for submission. In theroy, this > > refcount should have been released after the sumission(async run queue) > > is done. blk_freeze_queue() works with blk_sync_queue() together > > for avoiding race between cleanup queue and IO submission, given async > > run queue activities are canceled because hctx->run_work is scheduled with > > the refcount held, so it is fine to not hold the refcount when > > running the run queue work function for dispatch IO. > > > > However, if request is staggered into plug list, and finally queued > > from plug code path, the refcount in submission side is actually missed. > > And we may start to run queue after queue is removed because the queue's > > kobject refcount isn't guaranteed to be grabbed in flushing plug list > > context, then kernel oops is triggered, see the following race: > > > > blk_mq_flush_plug_list(): > > blk_mq_sched_insert_requests() > > insert requests to sw queue or scheduler queue > > blk_mq_run_hw_queue > > > > Because of concurrent run queue, all requests inserted above may be > > completed before calling the above blk_mq_run_hw_queue. Then queue can > > be freed during the above blk_mq_run_hw_queue(). > > > > Fixes the issue by grab .q_usage_counter before calling > > blk_mq_sched_insert_requests() in blk_mq_flush_plug_list(). This way is > > safe because the queue is absolutely alive before inserting request. > > > > Cc: Dongli Zhang <dongli.zhang@oracle.com> > > Cc: James Smart <james.smart@broadcom.com> > > Cc: Bart Van Assche <bart.vanassche@wdc.com> > > Cc: linux-scsi@vger.kernel.org, > > Cc: Martin K . Petersen <martin.petersen@oracle.com>, > > Cc: Christoph Hellwig <hch@lst.de>, > > Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>, > > Cc: jianchao wang <jianchao.w.wang@oracle.com> > > Signed-off-by: Ming Lei <ming.lei@redhat.com> > > --- > > block/blk-mq.c | 6 ++++++ > > 1 file changed, 6 insertions(+) > > > > diff --git a/block/blk-mq.c b/block/blk-mq.c > > index 3ff3d7b49969..5b586affee09 100644 > > --- a/block/blk-mq.c > > +++ b/block/blk-mq.c > > @@ -1728,9 +1728,12 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule) > > if (rq->mq_hctx != this_hctx || rq->mq_ctx != this_ctx) { > > if (this_hctx) { > > trace_block_unplug(this_q, depth, !from_schedule); > > + > > + percpu_ref_get(&this_q->q_usage_counter); > > blk_mq_sched_insert_requests(this_hctx, this_ctx, > > &rq_list, > > from_schedule); > > + percpu_ref_put(&this_q->q_usage_counter); > > } > > > > this_q = rq->q; > > @@ -1749,8 +1752,11 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule) > > */ > > if (this_hctx) { > > trace_block_unplug(this_q, depth, !from_schedule); > > + > > + percpu_ref_get(&this_q->q_usage_counter); > > blk_mq_sched_insert_requests(this_hctx, this_ctx, &rq_list, > > from_schedule); > > + percpu_ref_put(&this_q->q_usage_counter); > > } > > } > > Although this patch looks fine to me: have you considered to insert one > percpu_ref_get() call at the start of blk_mq_flush_plug_list() and one > percpu_ref_put() call at the end of the same function? Requests from different request queues can be added to the same per-task plug list, so we can't do that way simply. Thanks, Ming Lei
On Thu, 2019-04-04 at 16:43 +0800, Ming Lei wrote: > Just like aio/io_uring, we need to grab 2 refcount for queuing one > request, one is for submission, another is for completion. > > If the request isn't queued from plug code path, the refcount grabbed > in generic_make_request() serves for submission. In theroy, this > refcount should have been released after the sumission(async run queue) > is done. blk_freeze_queue() works with blk_sync_queue() together > for avoiding race between cleanup queue and IO submission, given async > run queue activities are canceled because hctx->run_work is scheduled with > the refcount held, so it is fine to not hold the refcount when > running the run queue work function for dispatch IO. > > However, if request is staggered into plug list, and finally queued > from plug code path, the refcount in submission side is actually missed. > And we may start to run queue after queue is removed because the queue's > kobject refcount isn't guaranteed to be grabbed in flushing plug list > context, then kernel oops is triggered, see the following race: > > blk_mq_flush_plug_list(): > blk_mq_sched_insert_requests() > insert requests to sw queue or scheduler queue > blk_mq_run_hw_queue > > Because of concurrent run queue, all requests inserted above may be > completed before calling the above blk_mq_run_hw_queue. Then queue can > be freed during the above blk_mq_run_hw_queue(). > > Fixes the issue by grab .q_usage_counter before calling > blk_mq_sched_insert_requests() in blk_mq_flush_plug_list(). This way is > safe because the queue is absolutely alive before inserting request. Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Hi Ming, On 04/04/2019 04:43 PM, Ming Lei wrote: > Just like aio/io_uring, we need to grab 2 refcount for queuing one > request, one is for submission, another is for completion. > > If the request isn't queued from plug code path, the refcount grabbed > in generic_make_request() serves for submission. In theroy, this > refcount should have been released after the sumission(async run queue) > is done. blk_freeze_queue() works with blk_sync_queue() together > for avoiding race between cleanup queue and IO submission, given async > run queue activities are canceled because hctx->run_work is scheduled with > the refcount held, so it is fine to not hold the refcount when > running the run queue work function for dispatch IO. > > However, if request is staggered into plug list, and finally queued > from plug code path, the refcount in submission side is actually missed. > And we may start to run queue after queue is removed because the queue's > kobject refcount isn't guaranteed to be grabbed in flushing plug list > context, then kernel oops is triggered, see the following race: > > blk_mq_flush_plug_list(): > blk_mq_sched_insert_requests() > insert requests to sw queue or scheduler queue > blk_mq_run_hw_queue > > Because of concurrent run queue, all requests inserted above may be > completed before calling the above blk_mq_run_hw_queue. Then queue can > be freed during the above blk_mq_run_hw_queue(). > > Fixes the issue by grab .q_usage_counter before calling > blk_mq_sched_insert_requests() in blk_mq_flush_plug_list(). This way is > safe because the queue is absolutely alive before inserting request. > > Cc: Dongli Zhang <dongli.zhang@oracle.com> > Cc: James Smart <james.smart@broadcom.com> > Cc: Bart Van Assche <bart.vanassche@wdc.com> > Cc: linux-scsi@vger.kernel.org, > Cc: Martin K . Petersen <martin.petersen@oracle.com>, > Cc: Christoph Hellwig <hch@lst.de>, > Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>, > Cc: jianchao wang <jianchao.w.wang@oracle.com> > Signed-off-by: Ming Lei <ming.lei@redhat.com> > --- > block/blk-mq.c | 6 ++++++ > 1 file changed, 6 insertions(+) > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 3ff3d7b49969..5b586affee09 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -1728,9 +1728,12 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule) > if (rq->mq_hctx != this_hctx || rq->mq_ctx != this_ctx) { > if (this_hctx) { > trace_block_unplug(this_q, depth, !from_schedule); > + > + percpu_ref_get(&this_q->q_usage_counter); Sorry to bother but I would just like to double confirm the reason to use "percpu_ref_get()" here which does not check whether the queue has been frozen. Is it because there is assumption that any direct/indirect caller of blk_mq_flush_plug_list() much have already grabbed q_usage_counter, which is similar to blk_queue_enter_live()? Thank you very much! Dongli Zhang
On Fri, Apr 05, 2019 at 05:26:24PM +0800, Dongli Zhang wrote: > Hi Ming, > > On 04/04/2019 04:43 PM, Ming Lei wrote: > > Just like aio/io_uring, we need to grab 2 refcount for queuing one > > request, one is for submission, another is for completion. > > > > If the request isn't queued from plug code path, the refcount grabbed > > in generic_make_request() serves for submission. In theroy, this > > refcount should have been released after the sumission(async run queue) > > is done. blk_freeze_queue() works with blk_sync_queue() together > > for avoiding race between cleanup queue and IO submission, given async > > run queue activities are canceled because hctx->run_work is scheduled with > > the refcount held, so it is fine to not hold the refcount when > > running the run queue work function for dispatch IO. > > > > However, if request is staggered into plug list, and finally queued > > from plug code path, the refcount in submission side is actually missed. > > And we may start to run queue after queue is removed because the queue's > > kobject refcount isn't guaranteed to be grabbed in flushing plug list > > context, then kernel oops is triggered, see the following race: > > > > blk_mq_flush_plug_list(): > > blk_mq_sched_insert_requests() > > insert requests to sw queue or scheduler queue > > blk_mq_run_hw_queue > > > > Because of concurrent run queue, all requests inserted above may be > > completed before calling the above blk_mq_run_hw_queue. Then queue can > > be freed during the above blk_mq_run_hw_queue(). > > > > Fixes the issue by grab .q_usage_counter before calling > > blk_mq_sched_insert_requests() in blk_mq_flush_plug_list(). This way is > > safe because the queue is absolutely alive before inserting request. > > > > Cc: Dongli Zhang <dongli.zhang@oracle.com> > > Cc: James Smart <james.smart@broadcom.com> > > Cc: Bart Van Assche <bart.vanassche@wdc.com> > > Cc: linux-scsi@vger.kernel.org, > > Cc: Martin K . Petersen <martin.petersen@oracle.com>, > > Cc: Christoph Hellwig <hch@lst.de>, > > Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>, > > Cc: jianchao wang <jianchao.w.wang@oracle.com> > > Signed-off-by: Ming Lei <ming.lei@redhat.com> > > --- > > block/blk-mq.c | 6 ++++++ > > 1 file changed, 6 insertions(+) > > > > diff --git a/block/blk-mq.c b/block/blk-mq.c > > index 3ff3d7b49969..5b586affee09 100644 > > --- a/block/blk-mq.c > > +++ b/block/blk-mq.c > > @@ -1728,9 +1728,12 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule) > > if (rq->mq_hctx != this_hctx || rq->mq_ctx != this_ctx) { > > if (this_hctx) { > > trace_block_unplug(this_q, depth, !from_schedule); > > + > > + percpu_ref_get(&this_q->q_usage_counter); > > Sorry to bother but I would just like to double confirm the reason to use > "percpu_ref_get()" here which does not check whether the queue has been frozen. > > Is it because there is assumption that any direct/indirect caller of > blk_mq_flush_plug_list() much have already grabbed q_usage_counter, which is > similar to blk_queue_enter_live()? Because there is request in the plug list to be queued. Thanks, Ming
diff --git a/block/blk-mq.c b/block/blk-mq.c index 3ff3d7b49969..5b586affee09 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1728,9 +1728,12 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule) if (rq->mq_hctx != this_hctx || rq->mq_ctx != this_ctx) { if (this_hctx) { trace_block_unplug(this_q, depth, !from_schedule); + + percpu_ref_get(&this_q->q_usage_counter); blk_mq_sched_insert_requests(this_hctx, this_ctx, &rq_list, from_schedule); + percpu_ref_put(&this_q->q_usage_counter); } this_q = rq->q; @@ -1749,8 +1752,11 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, bool from_schedule) */ if (this_hctx) { trace_block_unplug(this_q, depth, !from_schedule); + + percpu_ref_get(&this_q->q_usage_counter); blk_mq_sched_insert_requests(this_hctx, this_ctx, &rq_list, from_schedule); + percpu_ref_put(&this_q->q_usage_counter); } }
Just like aio/io_uring, we need to grab 2 refcount for queuing one request, one is for submission, another is for completion. If the request isn't queued from plug code path, the refcount grabbed in generic_make_request() serves for submission. In theroy, this refcount should have been released after the sumission(async run queue) is done. blk_freeze_queue() works with blk_sync_queue() together for avoiding race between cleanup queue and IO submission, given async run queue activities are canceled because hctx->run_work is scheduled with the refcount held, so it is fine to not hold the refcount when running the run queue work function for dispatch IO. However, if request is staggered into plug list, and finally queued from plug code path, the refcount in submission side is actually missed. And we may start to run queue after queue is removed because the queue's kobject refcount isn't guaranteed to be grabbed in flushing plug list context, then kernel oops is triggered, see the following race: blk_mq_flush_plug_list(): blk_mq_sched_insert_requests() insert requests to sw queue or scheduler queue blk_mq_run_hw_queue Because of concurrent run queue, all requests inserted above may be completed before calling the above blk_mq_run_hw_queue. Then queue can be freed during the above blk_mq_run_hw_queue(). Fixes the issue by grab .q_usage_counter before calling blk_mq_sched_insert_requests() in blk_mq_flush_plug_list(). This way is safe because the queue is absolutely alive before inserting request. Cc: Dongli Zhang <dongli.zhang@oracle.com> Cc: James Smart <james.smart@broadcom.com> Cc: Bart Van Assche <bart.vanassche@wdc.com> Cc: linux-scsi@vger.kernel.org, Cc: Martin K . Petersen <martin.petersen@oracle.com>, Cc: Christoph Hellwig <hch@lst.de>, Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>, Cc: jianchao wang <jianchao.w.wang@oracle.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> --- block/blk-mq.c | 6 ++++++ 1 file changed, 6 insertions(+)