mbox series

[0/5] blk-mq: allow to run queue if queue refcount is held

Message ID 20190331030954.22320-1-ming.lei@redhat.com (mailing list archive)
Headers show
Series blk-mq: allow to run queue if queue refcount is held | expand

Message

Ming Lei March 31, 2019, 3:09 a.m. UTC
Hi,

Since 45a9c9d909b2 ("blk-mq: Fix a use-after-free"), run queue isn't
allowed during cleanup queue even though queue refcount is held.

This change has caused lots of kernel oops triggered in run queue path,
turns out it isn't easy to fix them all.

So move freeing of hw queue resources into queue's release handler, then
the above issue is fixed. Meantime, this way is safe given freeing hw
queue resource doesn't require to use tags.

Thanks,

Ming Lei (5):
  blk-mq: re-organize blk_mq_exit_hctx() into two parts
  blk-mq: re-organize blk_mq_exit_hw_queues() into two parts
  blk-mq: free hw queues in queue's release handler
  block: don't drain in-progress dispatch in blk_cleanup_queue()
  SCSI: don't grab queue usage counter before run queue

 block/blk-core.c        | 14 +-------------
 block/blk-mq.c          | 32 +++++++++++++++++++++++++-------
 block/blk-mq.h          |  3 ++-
 drivers/scsi/scsi_lib.c |  7 -------
 4 files changed, 28 insertions(+), 28 deletions(-)

Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Cc: jianchao wang <jianchao.w.wang@oracle.com>

Comments

Bart Van Assche March 31, 2019, 3:27 p.m. UTC | #1
On 3/30/19 8:09 PM, Ming Lei wrote:
> Since 45a9c9d909b2 ("blk-mq: Fix a use-after-free"), run queue isn't
> allowed during cleanup queue even though queue refcount is held.
> 
> This change has caused lots of kernel oops triggered in run queue path,
> turns out it isn't easy to fix them all.
> 
> So move freeing of hw queue resources into queue's release handler, then
> the above issue is fixed. Meantime, this way is safe given freeing hw
> queue resource doesn't require to use tags.

I'm not sure the approach of this patch series is really the direction 
we should pursue. There are many block driver that free resources 
immediately after blk_cleanup_queue() returns. An example from the brd 
driver:

static void brd_free(struct brd_device *brd)
{
	put_disk(brd->brd_disk);
	blk_cleanup_queue(brd->brd_queue);
	brd_free_pages(brd);
	kfree(brd);
}

I'd like to avoid having to modify all block drivers that free resources 
immediately after blk_cleanup_queue() has returned. Have you considered 
to modify blk_mq_run_hw_queues() such that it becomes safe to call that 
function while blk_cleanup_queue() is in progress, e.g. by inserting a 
percpu_ref_tryget_live(&q->q_usage_counter) / 
percpu_ref_put(&q->q_usage_counter) pair?

Thanks,

Bart.
Ming Lei April 1, 2019, 2 a.m. UTC | #2
On Sun, Mar 31, 2019 at 08:27:35AM -0700, Bart Van Assche wrote:
> On 3/30/19 8:09 PM, Ming Lei wrote:
> > Since 45a9c9d909b2 ("blk-mq: Fix a use-after-free"), run queue isn't
> > allowed during cleanup queue even though queue refcount is held.
> > 
> > This change has caused lots of kernel oops triggered in run queue path,
> > turns out it isn't easy to fix them all.
> > 
> > So move freeing of hw queue resources into queue's release handler, then
> > the above issue is fixed. Meantime, this way is safe given freeing hw
> > queue resource doesn't require to use tags.
> 
> I'm not sure the approach of this patch series is really the direction we
> should pursue. There are many block driver that free resources immediately

Please see scsi_run_queue(), and the queue refcount is always held
before run queue. That said this way has been working well for long time.
However, 45a9c9d909b2 ("blk-mq: Fix a use-after-free") breaks this behaviour,
and causes kernel oops.

> after blk_cleanup_queue() returns. An example from the brd driver:
> 
> static void brd_free(struct brd_device *brd)
> {
> 	put_disk(brd->brd_disk);
> 	blk_cleanup_queue(brd->brd_queue);
> 	brd_free_pages(brd);
> 	kfree(brd);
> }

Once blk_cleanup_queue() is returned, especially queue is frozen, there
isn't any request to be queued to driver, so it isn't a problem wrt.
freeing driver resource because we won't call into driver any more
under this situation.

This patch fixes the race between cleanup queue and run queue:

1) run queue may happen before the queue is frozen, however blk_mq_run_hw_queues()
may not be done when blk_freeze_queue() returns.

2) run queue may happen after blk_freeze_queue() returns

> 
> I'd like to avoid having to modify all block drivers that free resources
> immediately after blk_cleanup_queue() has returned. Have you considered to
> modify blk_mq_run_hw_queues() such that it becomes safe to call that
> function while blk_cleanup_queue() is in progress, e.g. by inserting a
> percpu_ref_tryget_live(&q->q_usage_counter) /
> percpu_ref_put(&q->q_usage_counter) pair?

It can't work because blk_mq_run_hw_queues may happen after
percpu_ref_exit() is done.

However, if we move percpu_ref_exit() into queue's release handler, we
don't need to grab q->q_usage_counter any more in blk_mq_run_hw_queues(),
and we still have to free hw queue resources in queue's release handler,
that is exactly what this patchset is doing.

In short, getting q->q_usage_counter doesn't make a difference on this
issue.

Thanks,
Ming
Bart Van Assche April 1, 2019, 2:39 a.m. UTC | #3
On 3/31/19 7:00 PM, Ming Lei wrote:
> On Sun, Mar 31, 2019 at 08:27:35AM -0700, Bart Van Assche wrote:
>> I'm not sure the approach of this patch series is really the direction we
>> should pursue. There are many block driver that free resources immediately
> 
> Please see scsi_run_queue(), and the queue refcount is always held
> before run queue.

That's not correct. There is no guarantee that q->q_usage_counter > 0 
when scsi_run_queue() is called from inside scsi_requeue_run_queue().

>> I'd like to avoid having to modify all block drivers that free resources
>> immediately after blk_cleanup_queue() has returned. Have you considered to
>> modify blk_mq_run_hw_queues() such that it becomes safe to call that
>> function while blk_cleanup_queue() is in progress, e.g. by inserting a
>> percpu_ref_tryget_live(&q->q_usage_counter) /
>> percpu_ref_put(&q->q_usage_counter) pair?
> 
> It can't work because blk_mq_run_hw_queues may happen after
> percpu_ref_exit() is done.
> 
> However, if we move percpu_ref_exit() into queue's release handler, we
> don't need to grab q->q_usage_counter any more in blk_mq_run_hw_queues(),
> and we still have to free hw queue resources in queue's release handler,
> that is exactly what this patchset is doing.
> 
> In short, getting q->q_usage_counter doesn't make a difference on this
> issue.

percpu_ref_tryget_live() fails if a per-cpu counter is in the "dead" 
state. percpu_ref_kill() changes the state of a per-cpu counter to the 
"dead" state. blk_freeze_queue_start() calls percpu_ref_kill(). 
blk_cleanup_queue() already calls blk_set_queue_dying() and that last 
function calls blk_freeze_queue_start(). So I think that what you wrote 
is not correct and that inserting a 
percpu_ref_tryget_live()/percpu_ref_put() pair in blk_mq_run_hw_queues() 
or blk_mq_run_hw_queue() would make a difference and also that moving 
the percpu_ref_exit() call into blk_release_queue() makes sense.

Bart.
jianchao.wang April 1, 2019, 2:44 a.m. UTC | #4
On 4/1/19 10:39 AM, Bart Van Assche wrote:
> On 3/31/19 7:00 PM, Ming Lei wrote:
>> On Sun, Mar 31, 2019 at 08:27:35AM -0700, Bart Van Assche wrote:
>>> I'm not sure the approach of this patch series is really the direction we
>>> should pursue. There are many block driver that free resources immediately
>>
>> Please see scsi_run_queue(), and the queue refcount is always held
>> before run queue.
> 
> That's not correct. There is no guarantee that q->q_usage_counter > 0 when scsi_run_queue() is called from inside scsi_requeue_run_queue().
> 
>>> I'd like to avoid having to modify all block drivers that free resources
>>> immediately after blk_cleanup_queue() has returned. Have you considered to
>>> modify blk_mq_run_hw_queues() such that it becomes safe to call that
>>> function while blk_cleanup_queue() is in progress, e.g. by inserting a
>>> percpu_ref_tryget_live(&q->q_usage_counter) /
>>> percpu_ref_put(&q->q_usage_counter) pair?
>>
>> It can't work because blk_mq_run_hw_queues may happen after
>> percpu_ref_exit() is done.
>>
>> However, if we move percpu_ref_exit() into queue's release handler, we
>> don't need to grab q->q_usage_counter any more in blk_mq_run_hw_queues(),
>> and we still have to free hw queue resources in queue's release handler,
>> that is exactly what this patchset is doing.
>>
>> In short, getting q->q_usage_counter doesn't make a difference on this
>> issue.
> 
> percpu_ref_tryget_live() fails if a per-cpu counter is in the "dead" state. percpu_ref_kill() changes the state of a per-cpu counter to the "dead" state. blk_freeze_queue_start() calls percpu_ref_kill(). blk_cleanup_queue() already calls blk_set_queue_dying() and that last function calls blk_freeze_queue_start(). So I think that what you wrote is not correct and that inserting a percpu_ref_tryget_live()/percpu_ref_put() pair in blk_mq_run_hw_queues() or blk_mq_run_hw_queue() would make a difference and also that moving the percpu_ref_exit() call into blk_release_queue() makes sense.
> 

percpu_ref_tryget would be better to get pending requests to be issued when queue is frozen.

Thanks
Jianchao
Ming Lei April 1, 2019, 2:52 a.m. UTC | #5
On Sun, Mar 31, 2019 at 07:39:17PM -0700, Bart Van Assche wrote:
> On 3/31/19 7:00 PM, Ming Lei wrote:
> > On Sun, Mar 31, 2019 at 08:27:35AM -0700, Bart Van Assche wrote:
> > > I'm not sure the approach of this patch series is really the direction we
> > > should pursue. There are many block driver that free resources immediately
> > 
> > Please see scsi_run_queue(), and the queue refcount is always held
> > before run queue.
> 
> That's not correct. There is no guarantee that q->q_usage_counter > 0 when
> scsi_run_queue() is called from inside scsi_requeue_run_queue().

We don't need the guarantee of 'q->q_usage_counter > 0', I mean the
queue's kobj reference counter.

What we need is to allow run queue to work correctly after queue is frozen
or cleaned up.

> 
> > > I'd like to avoid having to modify all block drivers that free resources
> > > immediately after blk_cleanup_queue() has returned. Have you considered to
> > > modify blk_mq_run_hw_queues() such that it becomes safe to call that
> > > function while blk_cleanup_queue() is in progress, e.g. by inserting a
> > > percpu_ref_tryget_live(&q->q_usage_counter) /
> > > percpu_ref_put(&q->q_usage_counter) pair?
> > 
> > It can't work because blk_mq_run_hw_queues may happen after
> > percpu_ref_exit() is done.
> > 
> > However, if we move percpu_ref_exit() into queue's release handler, we
> > don't need to grab q->q_usage_counter any more in blk_mq_run_hw_queues(),
> > and we still have to free hw queue resources in queue's release handler,
> > that is exactly what this patchset is doing.
> > 
> > In short, getting q->q_usage_counter doesn't make a difference on this
> > issue.
> 
> percpu_ref_tryget_live() fails if a per-cpu counter is in the "dead" state.
> percpu_ref_kill() changes the state of a per-cpu counter to the "dead"
> state. blk_freeze_queue_start() calls percpu_ref_kill(). blk_cleanup_queue()
> already calls blk_set_queue_dying() and that last function calls
> blk_freeze_queue_start(). So I think that what you wrote is not correct and
> that inserting a percpu_ref_tryget_live()/percpu_ref_put() pair in
> blk_mq_run_hw_queues() or blk_mq_run_hw_queue() would make a difference and
> also that moving the percpu_ref_exit() call into blk_release_queue() makes
> sense.

If percpu_ref_exit() is moved to blk_release_queue(), we still need to
move freeing of hw queue's resource into blk_release_queue() like what
the patchset is doing.

Then we don't need to get/put q_usage_counter in blk_mq_run_hw_queues() any more,
do we?


Thanks,
Ming
jianchao.wang April 1, 2019, 3:25 a.m. UTC | #6
Hi Ming

On 4/1/19 10:52 AM, Ming Lei wrote:
>> percpu_ref_tryget_live() fails if a per-cpu counter is in the "dead" state.
>> percpu_ref_kill() changes the state of a per-cpu counter to the "dead"
>> state. blk_freeze_queue_start() calls percpu_ref_kill(). blk_cleanup_queue()
>> already calls blk_set_queue_dying() and that last function calls
>> blk_freeze_queue_start(). So I think that what you wrote is not correct and
>> that inserting a percpu_ref_tryget_live()/percpu_ref_put() pair in
>> blk_mq_run_hw_queues() or blk_mq_run_hw_queue() would make a difference and
>> also that moving the percpu_ref_exit() call into blk_release_queue() makes
>> sense.
> If percpu_ref_exit() is moved to blk_release_queue(), we still need to
> move freeing of hw queue's resource into blk_release_queue() like what
> the patchset is doing.
> 
> Then we don't need to get/put q_usage_counter in blk_mq_run_hw_queues() any more,
> do we?

IMO, if we could get a way to prevent any attempt to run queue, it would be
better and clearer.

Thanks
Jianchao
Ming Lei April 1, 2019, 3:27 a.m. UTC | #7
On Sun, Mar 31, 2019 at 07:39:17PM -0700, Bart Van Assche wrote:
> On 3/31/19 7:00 PM, Ming Lei wrote:
> > On Sun, Mar 31, 2019 at 08:27:35AM -0700, Bart Van Assche wrote:
> > > I'm not sure the approach of this patch series is really the direction we
> > > should pursue. There are many block driver that free resources immediately
> > 
> > Please see scsi_run_queue(), and the queue refcount is always held
> > before run queue.
> 
> That's not correct. There is no guarantee that q->q_usage_counter > 0 when
> scsi_run_queue() is called from inside scsi_requeue_run_queue().
> 
> > > I'd like to avoid having to modify all block drivers that free resources
> > > immediately after blk_cleanup_queue() has returned. Have you considered to
> > > modify blk_mq_run_hw_queues() such that it becomes safe to call that
> > > function while blk_cleanup_queue() is in progress, e.g. by inserting a
> > > percpu_ref_tryget_live(&q->q_usage_counter) /
> > > percpu_ref_put(&q->q_usage_counter) pair?
> > 
> > It can't work because blk_mq_run_hw_queues may happen after
> > percpu_ref_exit() is done.
> > 
> > However, if we move percpu_ref_exit() into queue's release handler, we
> > don't need to grab q->q_usage_counter any more in blk_mq_run_hw_queues(),
> > and we still have to free hw queue resources in queue's release handler,
> > that is exactly what this patchset is doing.
> > 
> > In short, getting q->q_usage_counter doesn't make a difference on this
> > issue.
> 
> percpu_ref_tryget_live() fails if a per-cpu counter is in the "dead" state.
> percpu_ref_kill() changes the state of a per-cpu counter to the "dead"
> state. blk_freeze_queue_start() calls percpu_ref_kill(). blk_cleanup_queue()
> already calls blk_set_queue_dying() and that last function calls
> blk_freeze_queue_start(). So I think that what you wrote is not correct and
> that inserting a percpu_ref_tryget_live()/percpu_ref_put() pair in
> blk_mq_run_hw_queues() or blk_mq_run_hw_queue() would make a difference and
> also that moving the percpu_ref_exit() call into blk_release_queue() makes
> sense.

This way is easy to cause deadlock!!!

If percpu_ref_tryget_live() is called in the entry of blk_mq_run_hw_queues(), 
at the same time, blk_freeze_queue_start() is called, then percpu_ref_tryget_live()
will fail, and run queue can't move on, then blk_mq_freeze_queue_wait() will wait
forever.

Thanks,
Ming
Ming Lei April 1, 2019, 3:28 a.m. UTC | #8
On Mon, Apr 01, 2019 at 11:25:50AM +0800, jianchao.wang wrote:
> Hi Ming
> 
> On 4/1/19 10:52 AM, Ming Lei wrote:
> >> percpu_ref_tryget_live() fails if a per-cpu counter is in the "dead" state.
> >> percpu_ref_kill() changes the state of a per-cpu counter to the "dead"
> >> state. blk_freeze_queue_start() calls percpu_ref_kill(). blk_cleanup_queue()
> >> already calls blk_set_queue_dying() and that last function calls
> >> blk_freeze_queue_start(). So I think that what you wrote is not correct and
> >> that inserting a percpu_ref_tryget_live()/percpu_ref_put() pair in
> >> blk_mq_run_hw_queues() or blk_mq_run_hw_queue() would make a difference and
> >> also that moving the percpu_ref_exit() call into blk_release_queue() makes
> >> sense.
> > If percpu_ref_exit() is moved to blk_release_queue(), we still need to
> > move freeing of hw queue's resource into blk_release_queue() like what
> > the patchset is doing.
> > 
> > Then we don't need to get/put q_usage_counter in blk_mq_run_hw_queues() any more,
> > do we?
> 
> IMO, if we could get a way to prevent any attempt to run queue, it would be
> better and clearer.

It is hard to do that way, and not necessary.

I will post V2 soon for review.

Thanks,
Ming
jianchao.wang April 1, 2019, 3:32 a.m. UTC | #9
On 4/1/19 11:27 AM, Ming Lei wrote:
> On Sun, Mar 31, 2019 at 07:39:17PM -0700, Bart Van Assche wrote:
>> On 3/31/19 7:00 PM, Ming Lei wrote:
>>> On Sun, Mar 31, 2019 at 08:27:35AM -0700, Bart Van Assche wrote:
>>>> I'm not sure the approach of this patch series is really the direction we
>>>> should pursue. There are many block driver that free resources immediately
>>>
>>> Please see scsi_run_queue(), and the queue refcount is always held
>>> before run queue.
>>
>> That's not correct. There is no guarantee that q->q_usage_counter > 0 when
>> scsi_run_queue() is called from inside scsi_requeue_run_queue().
>>
>>>> I'd like to avoid having to modify all block drivers that free resources
>>>> immediately after blk_cleanup_queue() has returned. Have you considered to
>>>> modify blk_mq_run_hw_queues() such that it becomes safe to call that
>>>> function while blk_cleanup_queue() is in progress, e.g. by inserting a
>>>> percpu_ref_tryget_live(&q->q_usage_counter) /
>>>> percpu_ref_put(&q->q_usage_counter) pair?
>>>
>>> It can't work because blk_mq_run_hw_queues may happen after
>>> percpu_ref_exit() is done.
>>>
>>> However, if we move percpu_ref_exit() into queue's release handler, we
>>> don't need to grab q->q_usage_counter any more in blk_mq_run_hw_queues(),
>>> and we still have to free hw queue resources in queue's release handler,
>>> that is exactly what this patchset is doing.
>>>
>>> In short, getting q->q_usage_counter doesn't make a difference on this
>>> issue.
>>
>> percpu_ref_tryget_live() fails if a per-cpu counter is in the "dead" state.
>> percpu_ref_kill() changes the state of a per-cpu counter to the "dead"
>> state. blk_freeze_queue_start() calls percpu_ref_kill(). blk_cleanup_queue()
>> already calls blk_set_queue_dying() and that last function calls
>> blk_freeze_queue_start(). So I think that what you wrote is not correct and
>> that inserting a percpu_ref_tryget_live()/percpu_ref_put() pair in
>> blk_mq_run_hw_queues() or blk_mq_run_hw_queue() would make a difference and
>> also that moving the percpu_ref_exit() call into blk_release_queue() makes
>> sense.
> 
> This way is easy to cause deadlock!!!
> 
> If percpu_ref_tryget_live() is called in the entry of blk_mq_run_hw_queues(), 
> at the same time, blk_freeze_queue_start() is called, then percpu_ref_tryget_live()
> will fail, and run queue can't move on, then blk_mq_freeze_queue_wait() will wait
> forever.
> 

percpu_ref_tryget would be better to get pending requests to be issued when queue is frozen.

Thanks
Jianchao
Dongli Zhang April 1, 2019, 5:05 a.m. UTC | #10
On 4/1/19 10:52 AM, Ming Lei wrote:
> On Sun, Mar 31, 2019 at 07:39:17PM -0700, Bart Van Assche wrote:
>> On 3/31/19 7:00 PM, Ming Lei wrote:
>>> On Sun, Mar 31, 2019 at 08:27:35AM -0700, Bart Van Assche wrote:
>>>> I'm not sure the approach of this patch series is really the direction we
>>>> should pursue. There are many block driver that free resources immediately
>>>
>>> Please see scsi_run_queue(), and the queue refcount is always held
>>> before run queue.
>>
>> That's not correct. There is no guarantee that q->q_usage_counter > 0 when
>> scsi_run_queue() is called from inside scsi_requeue_run_queue().
> 
> We don't need the guarantee of 'q->q_usage_counter > 0', I mean the
> queue's kobj reference counter.
> 
> What we need is to allow run queue to work correctly after queue is frozen
> or cleaned up.
> 
>>
>>>> I'd like to avoid having to modify all block drivers that free resources
>>>> immediately after blk_cleanup_queue() has returned. Have you considered to
>>>> modify blk_mq_run_hw_queues() such that it becomes safe to call that
>>>> function while blk_cleanup_queue() is in progress, e.g. by inserting a
>>>> percpu_ref_tryget_live(&q->q_usage_counter) /
>>>> percpu_ref_put(&q->q_usage_counter) pair?
>>>
>>> It can't work because blk_mq_run_hw_queues may happen after
>>> percpu_ref_exit() is done.
>>>
>>> However, if we move percpu_ref_exit() into queue's release handler, we
>>> don't need to grab q->q_usage_counter any more in blk_mq_run_hw_queues(),
>>> and we still have to free hw queue resources in queue's release handler,
>>> that is exactly what this patchset is doing.
>>>
>>> In short, getting q->q_usage_counter doesn't make a difference on this
>>> issue.
>>
>> percpu_ref_tryget_live() fails if a per-cpu counter is in the "dead" state.
>> percpu_ref_kill() changes the state of a per-cpu counter to the "dead"
>> state. blk_freeze_queue_start() calls percpu_ref_kill(). blk_cleanup_queue()
>> already calls blk_set_queue_dying() and that last function calls
>> blk_freeze_queue_start(). So I think that what you wrote is not correct and
>> that inserting a percpu_ref_tryget_live()/percpu_ref_put() pair in
>> blk_mq_run_hw_queues() or blk_mq_run_hw_queue() would make a difference and
>> also that moving the percpu_ref_exit() call into blk_release_queue() makes
>> sense.
> 
> If percpu_ref_exit() is moved to blk_release_queue(), we still need to
> move freeing of hw queue's resource into blk_release_queue() like what
> the patchset is doing.

Hi Ming,

Would you mind help explain why we still need to move freeing of hw queue's
resource into blk_release_queue() like what the patchset is doing?

Let's assume there is no deadlock when percpu_ref_tryget_live() is used,
blk_mq_run_hw_queues() would not be able to move forward as __PERCPU_REF_DEAD is
already set. Why we still need to move freeing of hw queue's resource into
blk_release_queue()?

Thank you very much!

Dongli Zhang

> 
> Then we don't need to get/put q_usage_counter in blk_mq_run_hw_queues() any more,
> do we?
> 
> 
> Thanks,
> Ming
>
Ming Lei April 1, 2019, 5:16 a.m. UTC | #11
Hi Dongli,

On Mon, Apr 01, 2019 at 01:05:46PM +0800, Dongli Zhang wrote:
> 
> 
> On 4/1/19 10:52 AM, Ming Lei wrote:
> > On Sun, Mar 31, 2019 at 07:39:17PM -0700, Bart Van Assche wrote:
> >> On 3/31/19 7:00 PM, Ming Lei wrote:
> >>> On Sun, Mar 31, 2019 at 08:27:35AM -0700, Bart Van Assche wrote:
> >>>> I'm not sure the approach of this patch series is really the direction we
> >>>> should pursue. There are many block driver that free resources immediately
> >>>
> >>> Please see scsi_run_queue(), and the queue refcount is always held
> >>> before run queue.
> >>
> >> That's not correct. There is no guarantee that q->q_usage_counter > 0 when
> >> scsi_run_queue() is called from inside scsi_requeue_run_queue().
> > 
> > We don't need the guarantee of 'q->q_usage_counter > 0', I mean the
> > queue's kobj reference counter.
> > 
> > What we need is to allow run queue to work correctly after queue is frozen
> > or cleaned up.
> > 
> >>
> >>>> I'd like to avoid having to modify all block drivers that free resources
> >>>> immediately after blk_cleanup_queue() has returned. Have you considered to
> >>>> modify blk_mq_run_hw_queues() such that it becomes safe to call that
> >>>> function while blk_cleanup_queue() is in progress, e.g. by inserting a
> >>>> percpu_ref_tryget_live(&q->q_usage_counter) /
> >>>> percpu_ref_put(&q->q_usage_counter) pair?
> >>>
> >>> It can't work because blk_mq_run_hw_queues may happen after
> >>> percpu_ref_exit() is done.
> >>>
> >>> However, if we move percpu_ref_exit() into queue's release handler, we
> >>> don't need to grab q->q_usage_counter any more in blk_mq_run_hw_queues(),
> >>> and we still have to free hw queue resources in queue's release handler,
> >>> that is exactly what this patchset is doing.
> >>>
> >>> In short, getting q->q_usage_counter doesn't make a difference on this
> >>> issue.
> >>
> >> percpu_ref_tryget_live() fails if a per-cpu counter is in the "dead" state.
> >> percpu_ref_kill() changes the state of a per-cpu counter to the "dead"
> >> state. blk_freeze_queue_start() calls percpu_ref_kill(). blk_cleanup_queue()
> >> already calls blk_set_queue_dying() and that last function calls
> >> blk_freeze_queue_start(). So I think that what you wrote is not correct and
> >> that inserting a percpu_ref_tryget_live()/percpu_ref_put() pair in
> >> blk_mq_run_hw_queues() or blk_mq_run_hw_queue() would make a difference and
> >> also that moving the percpu_ref_exit() call into blk_release_queue() makes
> >> sense.
> > 
> > If percpu_ref_exit() is moved to blk_release_queue(), we still need to
> > move freeing of hw queue's resource into blk_release_queue() like what
> > the patchset is doing.
> 
> Hi Ming,
> 
> Would you mind help explain why we still need to move freeing of hw queue's
> resource into blk_release_queue() like what the patchset is doing?
> 
> Let's assume there is no deadlock when percpu_ref_tryget_live() is used,

Could you explain why the assumption is true?

We have to run queue after starting to freeze queue for draining
allocated requests and making forward progress. Inside blk_freeze_queue_start(),
percpu_ref_kill() marks this ref as DEAD, then percpu_ref_tryget_live() returns
false, then queue won't be run.


Thanks, 
Ming
Dongli Zhang April 1, 2019, 5:30 a.m. UTC | #12
On 4/1/19 1:16 PM, Ming Lei wrote:
> Hi Dongli,
> 
> On Mon, Apr 01, 2019 at 01:05:46PM +0800, Dongli Zhang wrote:
>>
>>
>> On 4/1/19 10:52 AM, Ming Lei wrote:
>>> On Sun, Mar 31, 2019 at 07:39:17PM -0700, Bart Van Assche wrote:
>>>> On 3/31/19 7:00 PM, Ming Lei wrote:
>>>>> On Sun, Mar 31, 2019 at 08:27:35AM -0700, Bart Van Assche wrote:
>>>>>> I'm not sure the approach of this patch series is really the direction we
>>>>>> should pursue. There are many block driver that free resources immediately
>>>>>
>>>>> Please see scsi_run_queue(), and the queue refcount is always held
>>>>> before run queue.
>>>>
>>>> That's not correct. There is no guarantee that q->q_usage_counter > 0 when
>>>> scsi_run_queue() is called from inside scsi_requeue_run_queue().
>>>
>>> We don't need the guarantee of 'q->q_usage_counter > 0', I mean the
>>> queue's kobj reference counter.
>>>
>>> What we need is to allow run queue to work correctly after queue is frozen
>>> or cleaned up.
>>>
>>>>
>>>>>> I'd like to avoid having to modify all block drivers that free resources
>>>>>> immediately after blk_cleanup_queue() has returned. Have you considered to
>>>>>> modify blk_mq_run_hw_queues() such that it becomes safe to call that
>>>>>> function while blk_cleanup_queue() is in progress, e.g. by inserting a
>>>>>> percpu_ref_tryget_live(&q->q_usage_counter) /
>>>>>> percpu_ref_put(&q->q_usage_counter) pair?
>>>>>
>>>>> It can't work because blk_mq_run_hw_queues may happen after
>>>>> percpu_ref_exit() is done.
>>>>>
>>>>> However, if we move percpu_ref_exit() into queue's release handler, we
>>>>> don't need to grab q->q_usage_counter any more in blk_mq_run_hw_queues(),
>>>>> and we still have to free hw queue resources in queue's release handler,
>>>>> that is exactly what this patchset is doing.
>>>>>
>>>>> In short, getting q->q_usage_counter doesn't make a difference on this
>>>>> issue.
>>>>
>>>> percpu_ref_tryget_live() fails if a per-cpu counter is in the "dead" state.
>>>> percpu_ref_kill() changes the state of a per-cpu counter to the "dead"
>>>> state. blk_freeze_queue_start() calls percpu_ref_kill(). blk_cleanup_queue()
>>>> already calls blk_set_queue_dying() and that last function calls
>>>> blk_freeze_queue_start(). So I think that what you wrote is not correct and
>>>> that inserting a percpu_ref_tryget_live()/percpu_ref_put() pair in
>>>> blk_mq_run_hw_queues() or blk_mq_run_hw_queue() would make a difference and
>>>> also that moving the percpu_ref_exit() call into blk_release_queue() makes
>>>> sense.
>>>
>>> If percpu_ref_exit() is moved to blk_release_queue(), we still need to
>>> move freeing of hw queue's resource into blk_release_queue() like what
>>> the patchset is doing.
>>
>> Hi Ming,
>>
>> Would you mind help explain why we still need to move freeing of hw queue's
>> resource into blk_release_queue() like what the patchset is doing?
>>
>> Let's assume there is no deadlock when percpu_ref_tryget_live() is used,
> 
> Could you explain why the assumption is true?
> 
> We have to run queue after starting to freeze queue for draining
> allocated requests and making forward progress. Inside blk_freeze_queue_start(),
> percpu_ref_kill() marks this ref as DEAD, then percpu_ref_tryget_live() returns
> false, then queue won't be run.

Hi Ming,

I understand the assumption is invalid and there is issue when using
percpu_ref_tryget_live. And I also understand we have to run queue after
starting to freeze queue for draining allocated requests and making forward
progress.


I am just wondering specifically on why "If percpu_ref_exit() is moved to
blk_release_queue(), we still need to move freeing of hw queue's resource into
blk_release_queue() like what the patchset is doing." based on below Bart's
statement:

"percpu_ref_tryget_live() fails if a per-cpu counter is in the "dead" state.
percpu_ref_kill() changes the state of a per-cpu counter to the "dead" state.
blk_freeze_queue_start() calls percpu_ref_kill(). blk_cleanup_queue() already
calls blk_set_queue_dying() and that last function call
blk_freeze_queue_start(). So I think that what you wrote is not correct and that
inserting a percpu_ref_tryget_live()/percpu_ref_put() pair in
blk_mq_run_hw_queues() or blk_mq_run_hw_queue() would make a difference and also
that moving the percpu_ref_exit() call into blk_release_queue() makes sense."

That's is, what is penalty if we do not  move freeing of hw queue's resource
into blk_release_queue() like what the patchset is doing in above situation?

I ask this question just because I would like to better understand the source
code. Does "hw queue's resource" indicate the below?

+        if (hctx->flags & BLK_MQ_F_BLOCKING)
+                cleanup_srcu_struct(hctx->srcu);
+        blk_free_flush_queue(hctx->fq);
+        sbitmap_free(&hctx->ctx_map);

Thank you very much!

Dongli Zhang
Ming Lei April 1, 2019, 7:15 a.m. UTC | #13
On Mon, Apr 1, 2019 at 1:27 PM Dongli Zhang <dongli.zhang@oracle.com> wrote:
>
>
>
> On 4/1/19 1:16 PM, Ming Lei wrote:
> > Hi Dongli,
> >
> > On Mon, Apr 01, 2019 at 01:05:46PM +0800, Dongli Zhang wrote:
> >>
> >>
> >> On 4/1/19 10:52 AM, Ming Lei wrote:
> >>> On Sun, Mar 31, 2019 at 07:39:17PM -0700, Bart Van Assche wrote:
> >>>> On 3/31/19 7:00 PM, Ming Lei wrote:
> >>>>> On Sun, Mar 31, 2019 at 08:27:35AM -0700, Bart Van Assche wrote:
> >>>>>> I'm not sure the approach of this patch series is really the direction we
> >>>>>> should pursue. There are many block driver that free resources immediately
> >>>>>
> >>>>> Please see scsi_run_queue(), and the queue refcount is always held
> >>>>> before run queue.
> >>>>
> >>>> That's not correct. There is no guarantee that q->q_usage_counter > 0 when
> >>>> scsi_run_queue() is called from inside scsi_requeue_run_queue().
> >>>
> >>> We don't need the guarantee of 'q->q_usage_counter > 0', I mean the
> >>> queue's kobj reference counter.
> >>>
> >>> What we need is to allow run queue to work correctly after queue is frozen
> >>> or cleaned up.
> >>>
> >>>>
> >>>>>> I'd like to avoid having to modify all block drivers that free resources
> >>>>>> immediately after blk_cleanup_queue() has returned. Have you considered to
> >>>>>> modify blk_mq_run_hw_queues() such that it becomes safe to call that
> >>>>>> function while blk_cleanup_queue() is in progress, e.g. by inserting a
> >>>>>> percpu_ref_tryget_live(&q->q_usage_counter) /
> >>>>>> percpu_ref_put(&q->q_usage_counter) pair?
> >>>>>
> >>>>> It can't work because blk_mq_run_hw_queues may happen after
> >>>>> percpu_ref_exit() is done.
> >>>>>
> >>>>> However, if we move percpu_ref_exit() into queue's release handler, we
> >>>>> don't need to grab q->q_usage_counter any more in blk_mq_run_hw_queues(),
> >>>>> and we still have to free hw queue resources in queue's release handler,
> >>>>> that is exactly what this patchset is doing.
> >>>>>
> >>>>> In short, getting q->q_usage_counter doesn't make a difference on this
> >>>>> issue.
> >>>>
> >>>> percpu_ref_tryget_live() fails if a per-cpu counter is in the "dead" state.
> >>>> percpu_ref_kill() changes the state of a per-cpu counter to the "dead"
> >>>> state. blk_freeze_queue_start() calls percpu_ref_kill(). blk_cleanup_queue()
> >>>> already calls blk_set_queue_dying() and that last function calls
> >>>> blk_freeze_queue_start(). So I think that what you wrote is not correct and
> >>>> that inserting a percpu_ref_tryget_live()/percpu_ref_put() pair in
> >>>> blk_mq_run_hw_queues() or blk_mq_run_hw_queue() would make a difference and
> >>>> also that moving the percpu_ref_exit() call into blk_release_queue() makes
> >>>> sense.
> >>>
> >>> If percpu_ref_exit() is moved to blk_release_queue(), we still need to
> >>> move freeing of hw queue's resource into blk_release_queue() like what
> >>> the patchset is doing.
> >>
> >> Hi Ming,
> >>
> >> Would you mind help explain why we still need to move freeing of hw queue's
> >> resource into blk_release_queue() like what the patchset is doing?
> >>
> >> Let's assume there is no deadlock when percpu_ref_tryget_live() is used,
> >
> > Could you explain why the assumption is true?
> >
> > We have to run queue after starting to freeze queue for draining
> > allocated requests and making forward progress. Inside blk_freeze_queue_start(),
> > percpu_ref_kill() marks this ref as DEAD, then percpu_ref_tryget_live() returns
> > false, then queue won't be run.
>
> Hi Ming,
>
> I understand the assumption is invalid and there is issue when using
> percpu_ref_tryget_live. And I also understand we have to run queue after
> starting to freeze queue for draining allocated requests and making forward
> progress.

OK.

>
>
> I am just wondering specifically on why "If percpu_ref_exit() is moved to
> blk_release_queue(), we still need to move freeing of hw queue's resource into
> blk_release_queue() like what the patchset is doing." based on below Bart's
> statement:
>
> "percpu_ref_tryget_live() fails if a per-cpu counter is in the "dead" state.
> percpu_ref_kill() changes the state of a per-cpu counter to the "dead" state.
> blk_freeze_queue_start() calls percpu_ref_kill(). blk_cleanup_queue() already
> calls blk_set_queue_dying() and that last function call
> blk_freeze_queue_start(). So I think that what you wrote is not correct and that
> inserting a percpu_ref_tryget_live()/percpu_ref_put() pair in
> blk_mq_run_hw_queues() or blk_mq_run_hw_queue() would make a difference and also
> that moving the percpu_ref_exit() call into blk_release_queue() makes sense."

As you mentioned, percpu_ref_tryget_live() can't be used to avoid run queue
during cleanup, then run queue can come when queue is cleaned up.

>
> That's is, what is penalty if we do not  move freeing of hw queue's resource
> into blk_release_queue() like what the patchset is doing in above situation?

kernel oops as reported by James, because some fields of hctx will be used
by run queue, and they can be freed by blk_mq_free_queue() in
blk_cleanup_queue().

>
> I ask this question just because I would like to better understand the source
> code. Does "hw queue's resource" indicate the below?
>
> +        if (hctx->flags & BLK_MQ_F_BLOCKING)
> +                cleanup_srcu_struct(hctx->srcu);
> +        blk_free_flush_queue(hctx->fq);
> +        sbitmap_free(&hctx->ctx_map);

Right.

Thanks,
Ming Lei
jianchao.wang April 1, 2019, 9:19 a.m. UTC | #14
Hi Ming

On 4/1/19 11:28 AM, Ming Lei wrote:
> On Mon, Apr 01, 2019 at 11:25:50AM +0800, jianchao.wang wrote:
>> Hi Ming
>>
>> On 4/1/19 10:52 AM, Ming Lei wrote:
>>>> percpu_ref_tryget_live() fails if a per-cpu counter is in the "dead" state.
>>>> percpu_ref_kill() changes the state of a per-cpu counter to the "dead"
>>>> state. blk_freeze_queue_start() calls percpu_ref_kill(). blk_cleanup_queue()
>>>> already calls blk_set_queue_dying() and that last function calls
>>>> blk_freeze_queue_start(). So I think that what you wrote is not correct and
>>>> that inserting a percpu_ref_tryget_live()/percpu_ref_put() pair in
>>>> blk_mq_run_hw_queues() or blk_mq_run_hw_queue() would make a difference and
>>>> also that moving the percpu_ref_exit() call into blk_release_queue() makes
>>>> sense.
>>> If percpu_ref_exit() is moved to blk_release_queue(), we still need to
>>> move freeing of hw queue's resource into blk_release_queue() like what
>>> the patchset is doing.
>>>
>>> Then we don't need to get/put q_usage_counter in blk_mq_run_hw_queues() any more,
>>> do we?
>>
>> IMO, if we could get a way to prevent any attempt to run queue, it would be
>> better and clearer.
> 
> It is hard to do that way, and not necessary.
> 
> I will post V2 soon for review.
> 

Put percpu_ref_tryget/put pair into blk_mq_run_hw_queues could stop run queue after
requet_queue is frozen and drained (run queue is also unnecessary because there is no
entered requests). And also percpu_ref_tryget could avoid the io hung issue you mentioned.
We have similar one in blk_mq_timeout_work.

freeze and drain queue to stop new attempt to run queue, blk_sync_queue syncs and stops
the started ones, then hctx->run_queue is cleaned totally.

IMO, it would be better to have a checkpoint after which there will be no any in-flight
asynchronous activities of the request_queue (hctx->run_work, q->requeue_work, q-> timeout_work)
and any attempt to start them will fail.

Perhaps, this will be a good change to do this ;)

Thanks
Jianchao
Ming Lei April 1, 2019, 10:03 a.m. UTC | #15
On Mon, Apr 01, 2019 at 05:19:01PM +0800, jianchao.wang wrote:
> Hi Ming
> 
> On 4/1/19 11:28 AM, Ming Lei wrote:
> > On Mon, Apr 01, 2019 at 11:25:50AM +0800, jianchao.wang wrote:
> >> Hi Ming
> >>
> >> On 4/1/19 10:52 AM, Ming Lei wrote:
> >>>> percpu_ref_tryget_live() fails if a per-cpu counter is in the "dead" state.
> >>>> percpu_ref_kill() changes the state of a per-cpu counter to the "dead"
> >>>> state. blk_freeze_queue_start() calls percpu_ref_kill(). blk_cleanup_queue()
> >>>> already calls blk_set_queue_dying() and that last function calls
> >>>> blk_freeze_queue_start(). So I think that what you wrote is not correct and
> >>>> that inserting a percpu_ref_tryget_live()/percpu_ref_put() pair in
> >>>> blk_mq_run_hw_queues() or blk_mq_run_hw_queue() would make a difference and
> >>>> also that moving the percpu_ref_exit() call into blk_release_queue() makes
> >>>> sense.
> >>> If percpu_ref_exit() is moved to blk_release_queue(), we still need to
> >>> move freeing of hw queue's resource into blk_release_queue() like what
> >>> the patchset is doing.
> >>>
> >>> Then we don't need to get/put q_usage_counter in blk_mq_run_hw_queues() any more,
> >>> do we?
> >>
> >> IMO, if we could get a way to prevent any attempt to run queue, it would be
> >> better and clearer.
> > 
> > It is hard to do that way, and not necessary.
> > 
> > I will post V2 soon for review.
> > 
> 
> Put percpu_ref_tryget/put pair into blk_mq_run_hw_queues could stop run queue after
> requet_queue is frozen and drained (run queue is also unnecessary because there is no
> entered requests). And also percpu_ref_tryget could avoid the io hung issue you mentioned.
> We have similar one in blk_mq_timeout_work.

If percpu_ref_tryget() is used, percpu_ref_exit() has to be moved into
queue's release handler.

Then we still have to move freeing hctx's resource into hctx or queue's
release handler, that is exactly what this patch is doing. Then
percpu_ref_tryget() becomes unnecessary again, right?

> 
> freeze and drain queue to stop new attempt to run queue, blk_sync_queue syncs and stops
> the started ones, then hctx->run_queue is cleaned totally.
> 
> IMO, it would be better to have a checkpoint after which there will be no any in-flight
> asynchronous activities of the request_queue (hctx->run_work, q->requeue_work, q-> timeout_work)
> and any attempt to start them will fail.

All are canceled in blk_cleanup_queue(), but not enough, given queue can
be run in sync mode(such as via plug, direct issue, ...), or driver's
requeue, such as SCSI's requeue. SCSI's requeue may run other LUN's queue
just by holding queue's kobject refcount.

> 
> Perhaps, this will be a good change to do this ;)

However, I don't see it is necessary if we simply move freeing hctx's
resource into its release handler, just like V2.


Thanks,
Ming
jianchao.wang April 2, 2019, 2:02 a.m. UTC | #16
Hi Ming

On 4/1/19 6:03 PM, Ming Lei wrote:
> On Mon, Apr 01, 2019 at 05:19:01PM +0800, jianchao.wang wrote:
>> Hi Ming
>>
>> On 4/1/19 11:28 AM, Ming Lei wrote:
>>> On Mon, Apr 01, 2019 at 11:25:50AM +0800, jianchao.wang wrote:
>>>> Hi Ming
>>>>
>>>> On 4/1/19 10:52 AM, Ming Lei wrote:
>>>>>> percpu_ref_tryget_live() fails if a per-cpu counter is in the "dead" state.
>>>>>> percpu_ref_kill() changes the state of a per-cpu counter to the "dead"
>>>>>> state. blk_freeze_queue_start() calls percpu_ref_kill(). blk_cleanup_queue()
>>>>>> already calls blk_set_queue_dying() and that last function calls
>>>>>> blk_freeze_queue_start(). So I think that what you wrote is not correct and
>>>>>> that inserting a percpu_ref_tryget_live()/percpu_ref_put() pair in
>>>>>> blk_mq_run_hw_queues() or blk_mq_run_hw_queue() would make a difference and
>>>>>> also that moving the percpu_ref_exit() call into blk_release_queue() makes
>>>>>> sense.
>>>>> If percpu_ref_exit() is moved to blk_release_queue(), we still need to
>>>>> move freeing of hw queue's resource into blk_release_queue() like what
>>>>> the patchset is doing.
>>>>>
>>>>> Then we don't need to get/put q_usage_counter in blk_mq_run_hw_queues() any more,
>>>>> do we?
>>>>
>>>> IMO, if we could get a way to prevent any attempt to run queue, it would be
>>>> better and clearer.
>>>
>>> It is hard to do that way, and not necessary.
>>>
>>> I will post V2 soon for review.
>>>
>>
>> Put percpu_ref_tryget/put pair into blk_mq_run_hw_queues could stop run queue after
>> requet_queue is frozen and drained (run queue is also unnecessary because there is no
>> entered requests). And also percpu_ref_tryget could avoid the io hung issue you mentioned.
>> We have similar one in blk_mq_timeout_work.
> 
> If percpu_ref_tryget() is used, percpu_ref_exit() has to be moved into
> queue's release handler.
> 
> Then we still have to move freeing hctx's resource into hctx or queue's
> release handler, that is exactly what this patch is doing. Then
> percpu_ref_tryget() becomes unnecessary again, right?

I'm not sure about the percpu_ref_exit. Perhaps I have some misunderstanding about it.

From the code of it, it frees the percpu_count and set ref->percpu_count_ptr to __PERCPU_REF_ATOMIC_DEAD.
The comment says 'the caller is responsible for ensuring that @ref is no longer in active use'
But if we use it after kill, does it count a active use ?
Based on the code, the __ref_is_percpu is always false during this, and percpu_ref_tryget will not
touch the freed percpu counter but just the atomic ref->count.

It looks safe.


> 
>>
>> freeze and drain queue to stop new attempt to run queue, blk_sync_queue syncs and stops
>> the started ones, then hctx->run_queue is cleaned totally.
>>
>> IMO, it would be better to have a checkpoint after which there will be no any in-flight
>> asynchronous activities of the request_queue (hctx->run_work, q->requeue_work, q-> timeout_work)
>> and any attempt to start them will fail.
> 
> All are canceled in blk_cleanup_queue(), but not enough, given queue can
> be run in sync mode(such as via plug, direct issue, ...), or driver's
> requeue, such as SCSI's requeue. SCSI's requeue may run other LUN's queue
> just by holding queue's kobject refcount.

Yes, so we need a checkpoint here to ensure the request_queue to enter into a certain state.
We provide a guarantee that all of the activities are stopped after this checkpoint.
It will be convenient for us to do other things following, for example release request_queue's
resource.

Thanks
Jianchao

> 
>>
>> Perhaps, this will be a good change to do this ;)
> 
> However, I don't see it is necessary if we simply move freeing hctx's
> resource into its release handler, just like V2.
> 
> 
> Thanks,
> Ming
>
Dongli Zhang April 2, 2019, 2:10 a.m. UTC | #17
On 4/1/19 3:15 PM, Ming Lei wrote:
> On Mon, Apr 1, 2019 at 1:27 PM Dongli Zhang <dongli.zhang@oracle.com> wrote:
>>
>>
>>
>> On 4/1/19 1:16 PM, Ming Lei wrote:
>>> Hi Dongli,
>>>
>>> On Mon, Apr 01, 2019 at 01:05:46PM +0800, Dongli Zhang wrote:
>>>>
>>>>
>>>> On 4/1/19 10:52 AM, Ming Lei wrote:
>>>>> On Sun, Mar 31, 2019 at 07:39:17PM -0700, Bart Van Assche wrote:
>>>>>> On 3/31/19 7:00 PM, Ming Lei wrote:
>>>>>>> On Sun, Mar 31, 2019 at 08:27:35AM -0700, Bart Van Assche wrote:
>>>>>>>> I'm not sure the approach of this patch series is really the direction we
>>>>>>>> should pursue. There are many block driver that free resources immediately
>>>>>>>
>>>>>>> Please see scsi_run_queue(), and the queue refcount is always held
>>>>>>> before run queue.
>>>>>>
>>>>>> That's not correct. There is no guarantee that q->q_usage_counter > 0 when
>>>>>> scsi_run_queue() is called from inside scsi_requeue_run_queue().
>>>>>
>>>>> We don't need the guarantee of 'q->q_usage_counter > 0', I mean the
>>>>> queue's kobj reference counter.
>>>>>
>>>>> What we need is to allow run queue to work correctly after queue is frozen
>>>>> or cleaned up.
>>>>>
>>>>>>
>>>>>>>> I'd like to avoid having to modify all block drivers that free resources
>>>>>>>> immediately after blk_cleanup_queue() has returned. Have you considered to
>>>>>>>> modify blk_mq_run_hw_queues() such that it becomes safe to call that
>>>>>>>> function while blk_cleanup_queue() is in progress, e.g. by inserting a
>>>>>>>> percpu_ref_tryget_live(&q->q_usage_counter) /
>>>>>>>> percpu_ref_put(&q->q_usage_counter) pair?
>>>>>>>
>>>>>>> It can't work because blk_mq_run_hw_queues may happen after
>>>>>>> percpu_ref_exit() is done.
>>>>>>>
>>>>>>> However, if we move percpu_ref_exit() into queue's release handler, we
>>>>>>> don't need to grab q->q_usage_counter any more in blk_mq_run_hw_queues(),
>>>>>>> and we still have to free hw queue resources in queue's release handler,
>>>>>>> that is exactly what this patchset is doing.
>>>>>>>
>>>>>>> In short, getting q->q_usage_counter doesn't make a difference on this
>>>>>>> issue.
>>>>>>
>>>>>> percpu_ref_tryget_live() fails if a per-cpu counter is in the "dead" state.
>>>>>> percpu_ref_kill() changes the state of a per-cpu counter to the "dead"
>>>>>> state. blk_freeze_queue_start() calls percpu_ref_kill(). blk_cleanup_queue()
>>>>>> already calls blk_set_queue_dying() and that last function calls
>>>>>> blk_freeze_queue_start(). So I think that what you wrote is not correct and
>>>>>> that inserting a percpu_ref_tryget_live()/percpu_ref_put() pair in
>>>>>> blk_mq_run_hw_queues() or blk_mq_run_hw_queue() would make a difference and
>>>>>> also that moving the percpu_ref_exit() call into blk_release_queue() makes
>>>>>> sense.
>>>>>
>>>>> If percpu_ref_exit() is moved to blk_release_queue(), we still need to
>>>>> move freeing of hw queue's resource into blk_release_queue() like what
>>>>> the patchset is doing.
>>>>
>>>> Hi Ming,
>>>>
>>>> Would you mind help explain why we still need to move freeing of hw queue's
>>>> resource into blk_release_queue() like what the patchset is doing?
>>>>
>>>> Let's assume there is no deadlock when percpu_ref_tryget_live() is used,
>>>
>>> Could you explain why the assumption is true?
>>>
>>> We have to run queue after starting to freeze queue for draining
>>> allocated requests and making forward progress. Inside blk_freeze_queue_start(),
>>> percpu_ref_kill() marks this ref as DEAD, then percpu_ref_tryget_live() returns
>>> false, then queue won't be run.
>>
>> Hi Ming,
>>
>> I understand the assumption is invalid and there is issue when using
>> percpu_ref_tryget_live. And I also understand we have to run queue after
>> starting to freeze queue for draining allocated requests and making forward
>> progress.
> 
> OK.
> 
>>
>>
>> I am just wondering specifically on why "If percpu_ref_exit() is moved to
>> blk_release_queue(), we still need to move freeing of hw queue's resource into
>> blk_release_queue() like what the patchset is doing." based on below Bart's
>> statement:
>>
>> "percpu_ref_tryget_live() fails if a per-cpu counter is in the "dead" state.
>> percpu_ref_kill() changes the state of a per-cpu counter to the "dead" state.
>> blk_freeze_queue_start() calls percpu_ref_kill(). blk_cleanup_queue() already
>> calls blk_set_queue_dying() and that last function call
>> blk_freeze_queue_start(). So I think that what you wrote is not correct and that
>> inserting a percpu_ref_tryget_live()/percpu_ref_put() pair in
>> blk_mq_run_hw_queues() or blk_mq_run_hw_queue() would make a difference and also
>> that moving the percpu_ref_exit() call into blk_release_queue() makes sense."
> 
> As you mentioned, percpu_ref_tryget_live() can't be used to avoid run queue
> during cleanup, then run queue can come when queue is cleaned up.
> 
>>
>> That's is, what is penalty if we do not  move freeing of hw queue's resource
>> into blk_release_queue() like what the patchset is doing in above situation?
> 
> kernel oops as reported by James, because some fields of hctx will be used
> by run queue, and they can be freed by blk_mq_free_queue() in
> blk_cleanup_queue().
> 
>>
>> I ask this question just because I would like to better understand the source
>> code. Does "hw queue's resource" indicate the below?
>>
>> +        if (hctx->flags & BLK_MQ_F_BLOCKING)
>> +                cleanup_srcu_struct(hctx->srcu);
>> +        blk_free_flush_queue(hctx->fq);
>> +        sbitmap_free(&hctx->ctx_map);
> 
> Right.

Hi Ming,

Thank you very much for the detailed explanation.

I think maybe I misunderstood your message in the email.

In another direction posted by Bart as below, regardless about which direction
is better, that implementation does not move freeing of hw queue's resource into
blk_release_queue(), although percpu_ref_exit() is moved to blk_release_queue().
That's why I would like to confirm.

https://lore.kernel.org/linux-block/20190401212014.192753-1-bvanassche@acm.org/

In that direction, the more friendly percpu_ref_tryget(), which is suggested by
Jianchao, is used. I would like just to confirm that there is no need to move
freeing of hw queue's resource into blk_release_queue() when the get/put method
is friendly and fair enough.

I think I should not just focus on only part of your message.

Sorry for spamming you.

Dongli Zhang
Ming Lei April 2, 2019, 2:20 a.m. UTC | #18
On Tue, Apr 2, 2019 at 10:06 AM Dongli Zhang <dongli.zhang@oracle.com> wrote:
>
>
>
> On 4/1/19 3:15 PM, Ming Lei wrote:
> > On Mon, Apr 1, 2019 at 1:27 PM Dongli Zhang <dongli.zhang@oracle.com> wrote:
> >>
> >>
> >>
> >> On 4/1/19 1:16 PM, Ming Lei wrote:
> >>> Hi Dongli,
> >>>
> >>> On Mon, Apr 01, 2019 at 01:05:46PM +0800, Dongli Zhang wrote:
> >>>>
> >>>>
> >>>> On 4/1/19 10:52 AM, Ming Lei wrote:
> >>>>> On Sun, Mar 31, 2019 at 07:39:17PM -0700, Bart Van Assche wrote:
> >>>>>> On 3/31/19 7:00 PM, Ming Lei wrote:
> >>>>>>> On Sun, Mar 31, 2019 at 08:27:35AM -0700, Bart Van Assche wrote:
> >>>>>>>> I'm not sure the approach of this patch series is really the direction we
> >>>>>>>> should pursue. There are many block driver that free resources immediately
> >>>>>>>
> >>>>>>> Please see scsi_run_queue(), and the queue refcount is always held
> >>>>>>> before run queue.
> >>>>>>
> >>>>>> That's not correct. There is no guarantee that q->q_usage_counter > 0 when
> >>>>>> scsi_run_queue() is called from inside scsi_requeue_run_queue().
> >>>>>
> >>>>> We don't need the guarantee of 'q->q_usage_counter > 0', I mean the
> >>>>> queue's kobj reference counter.
> >>>>>
> >>>>> What we need is to allow run queue to work correctly after queue is frozen
> >>>>> or cleaned up.
> >>>>>
> >>>>>>
> >>>>>>>> I'd like to avoid having to modify all block drivers that free resources
> >>>>>>>> immediately after blk_cleanup_queue() has returned. Have you considered to
> >>>>>>>> modify blk_mq_run_hw_queues() such that it becomes safe to call that
> >>>>>>>> function while blk_cleanup_queue() is in progress, e.g. by inserting a
> >>>>>>>> percpu_ref_tryget_live(&q->q_usage_counter) /
> >>>>>>>> percpu_ref_put(&q->q_usage_counter) pair?
> >>>>>>>
> >>>>>>> It can't work because blk_mq_run_hw_queues may happen after
> >>>>>>> percpu_ref_exit() is done.
> >>>>>>>
> >>>>>>> However, if we move percpu_ref_exit() into queue's release handler, we
> >>>>>>> don't need to grab q->q_usage_counter any more in blk_mq_run_hw_queues(),
> >>>>>>> and we still have to free hw queue resources in queue's release handler,
> >>>>>>> that is exactly what this patchset is doing.
> >>>>>>>
> >>>>>>> In short, getting q->q_usage_counter doesn't make a difference on this
> >>>>>>> issue.
> >>>>>>
> >>>>>> percpu_ref_tryget_live() fails if a per-cpu counter is in the "dead" state.
> >>>>>> percpu_ref_kill() changes the state of a per-cpu counter to the "dead"
> >>>>>> state. blk_freeze_queue_start() calls percpu_ref_kill(). blk_cleanup_queue()
> >>>>>> already calls blk_set_queue_dying() and that last function calls
> >>>>>> blk_freeze_queue_start(). So I think that what you wrote is not correct and
> >>>>>> that inserting a percpu_ref_tryget_live()/percpu_ref_put() pair in
> >>>>>> blk_mq_run_hw_queues() or blk_mq_run_hw_queue() would make a difference and
> >>>>>> also that moving the percpu_ref_exit() call into blk_release_queue() makes
> >>>>>> sense.
> >>>>>
> >>>>> If percpu_ref_exit() is moved to blk_release_queue(), we still need to
> >>>>> move freeing of hw queue's resource into blk_release_queue() like what
> >>>>> the patchset is doing.
> >>>>
> >>>> Hi Ming,
> >>>>
> >>>> Would you mind help explain why we still need to move freeing of hw queue's
> >>>> resource into blk_release_queue() like what the patchset is doing?
> >>>>
> >>>> Let's assume there is no deadlock when percpu_ref_tryget_live() is used,
> >>>
> >>> Could you explain why the assumption is true?
> >>>
> >>> We have to run queue after starting to freeze queue for draining
> >>> allocated requests and making forward progress. Inside blk_freeze_queue_start(),
> >>> percpu_ref_kill() marks this ref as DEAD, then percpu_ref_tryget_live() returns
> >>> false, then queue won't be run.
> >>
> >> Hi Ming,
> >>
> >> I understand the assumption is invalid and there is issue when using
> >> percpu_ref_tryget_live. And I also understand we have to run queue after
> >> starting to freeze queue for draining allocated requests and making forward
> >> progress.
> >
> > OK.
> >
> >>
> >>
> >> I am just wondering specifically on why "If percpu_ref_exit() is moved to
> >> blk_release_queue(), we still need to move freeing of hw queue's resource into
> >> blk_release_queue() like what the patchset is doing." based on below Bart's
> >> statement:
> >>
> >> "percpu_ref_tryget_live() fails if a per-cpu counter is in the "dead" state.
> >> percpu_ref_kill() changes the state of a per-cpu counter to the "dead" state.
> >> blk_freeze_queue_start() calls percpu_ref_kill(). blk_cleanup_queue() already
> >> calls blk_set_queue_dying() and that last function call
> >> blk_freeze_queue_start(). So I think that what you wrote is not correct and that
> >> inserting a percpu_ref_tryget_live()/percpu_ref_put() pair in
> >> blk_mq_run_hw_queues() or blk_mq_run_hw_queue() would make a difference and also
> >> that moving the percpu_ref_exit() call into blk_release_queue() makes sense."
> >
> > As you mentioned, percpu_ref_tryget_live() can't be used to avoid run queue
> > during cleanup, then run queue can come when queue is cleaned up.
> >
> >>
> >> That's is, what is penalty if we do not  move freeing of hw queue's resource
> >> into blk_release_queue() like what the patchset is doing in above situation?
> >
> > kernel oops as reported by James, because some fields of hctx will be used
> > by run queue, and they can be freed by blk_mq_free_queue() in
> > blk_cleanup_queue().
> >
> >>
> >> I ask this question just because I would like to better understand the source
> >> code. Does "hw queue's resource" indicate the below?
> >>
> >> +        if (hctx->flags & BLK_MQ_F_BLOCKING)
> >> +                cleanup_srcu_struct(hctx->srcu);
> >> +        blk_free_flush_queue(hctx->fq);
> >> +        sbitmap_free(&hctx->ctx_map);
> >
> > Right.
>
> Hi Ming,
>
> Thank you very much for the detailed explanation.
>
> I think maybe I misunderstood your message in the email.
>
> In another direction posted by Bart as below, regardless about which direction
> is better, that implementation does not move freeing of hw queue's resource into
> blk_release_queue(), although percpu_ref_exit() is moved to blk_release_queue().
> That's why I would like to confirm.
>
> https://lore.kernel.org/linux-block/20190401212014.192753-1-bvanassche@acm.org/
>
> In that direction, the more friendly percpu_ref_tryget(), which is suggested by
> Jianchao, is used. I would like just to confirm that there is no need to move
> freeing of hw queue's resource into blk_release_queue() when the get/put method
> is friendly and fair enough.

I don't think it is friendly to add such unnecessary stuff in the very
fast path.



Thanks,
Ming Lei
Ming Lei April 2, 2019, 2:55 a.m. UTC | #19
On Tue, Apr 02, 2019 at 10:02:43AM +0800, jianchao.wang wrote:
> Hi Ming
> 
> On 4/1/19 6:03 PM, Ming Lei wrote:
> > On Mon, Apr 01, 2019 at 05:19:01PM +0800, jianchao.wang wrote:
> >> Hi Ming
> >>
> >> On 4/1/19 11:28 AM, Ming Lei wrote:
> >>> On Mon, Apr 01, 2019 at 11:25:50AM +0800, jianchao.wang wrote:
> >>>> Hi Ming
> >>>>
> >>>> On 4/1/19 10:52 AM, Ming Lei wrote:
> >>>>>> percpu_ref_tryget_live() fails if a per-cpu counter is in the "dead" state.
> >>>>>> percpu_ref_kill() changes the state of a per-cpu counter to the "dead"
> >>>>>> state. blk_freeze_queue_start() calls percpu_ref_kill(). blk_cleanup_queue()
> >>>>>> already calls blk_set_queue_dying() and that last function calls
> >>>>>> blk_freeze_queue_start(). So I think that what you wrote is not correct and
> >>>>>> that inserting a percpu_ref_tryget_live()/percpu_ref_put() pair in
> >>>>>> blk_mq_run_hw_queues() or blk_mq_run_hw_queue() would make a difference and
> >>>>>> also that moving the percpu_ref_exit() call into blk_release_queue() makes
> >>>>>> sense.
> >>>>> If percpu_ref_exit() is moved to blk_release_queue(), we still need to
> >>>>> move freeing of hw queue's resource into blk_release_queue() like what
> >>>>> the patchset is doing.
> >>>>>
> >>>>> Then we don't need to get/put q_usage_counter in blk_mq_run_hw_queues() any more,
> >>>>> do we?
> >>>>
> >>>> IMO, if we could get a way to prevent any attempt to run queue, it would be
> >>>> better and clearer.
> >>>
> >>> It is hard to do that way, and not necessary.
> >>>
> >>> I will post V2 soon for review.
> >>>
> >>
> >> Put percpu_ref_tryget/put pair into blk_mq_run_hw_queues could stop run queue after
> >> requet_queue is frozen and drained (run queue is also unnecessary because there is no
> >> entered requests). And also percpu_ref_tryget could avoid the io hung issue you mentioned.
> >> We have similar one in blk_mq_timeout_work.
> > 
> > If percpu_ref_tryget() is used, percpu_ref_exit() has to be moved into
> > queue's release handler.
> > 
> > Then we still have to move freeing hctx's resource into hctx or queue's
> > release handler, that is exactly what this patch is doing. Then
> > percpu_ref_tryget() becomes unnecessary again, right?
> 
> I'm not sure about the percpu_ref_exit. Perhaps I have some misunderstanding about it.
> 
> From the code of it, it frees the percpu_count and set ref->percpu_count_ptr to __PERCPU_REF_ATOMIC_DEAD.
> The comment says 'the caller is responsible for ensuring that @ref is no longer in active use'
> But if we use it after kill, does it count a active use ?
> Based on the code, the __ref_is_percpu is always false during this, and percpu_ref_tryget will not
> touch the freed percpu counter but just the atomic ref->count.
> 
> It looks safe.

OK, you are right.

However, I still think it isn't necessary to hold the perpcu_ref in the
very fast io path.

> 
> 
> > 
> >>
> >> freeze and drain queue to stop new attempt to run queue, blk_sync_queue syncs and stops
> >> the started ones, then hctx->run_queue is cleaned totally.
> >>
> >> IMO, it would be better to have a checkpoint after which there will be no any in-flight
> >> asynchronous activities of the request_queue (hctx->run_work, q->requeue_work, q-> timeout_work)
> >> and any attempt to start them will fail.
> > 
> > All are canceled in blk_cleanup_queue(), but not enough, given queue can
> > be run in sync mode(such as via plug, direct issue, ...), or driver's
> > requeue, such as SCSI's requeue. SCSI's requeue may run other LUN's queue
> > just by holding queue's kobject refcount.
> 
> Yes, so we need a checkpoint here to ensure the request_queue to enter into a certain state.
> We provide a guarantee that all of the activities are stopped after this checkpoint.
> It will be convenient for us to do other things following, for example release request_queue's
> resource.

We have such checkpoint already:

	blk_freeze_queue() together with blk_sync_queue()

Once the two are done, there shouldn't be any driver activities at all.

The current issue is related with blk-mq internal implementation, in which
it should have been safe to complete the run queue activity during queue
cleanup if the request queue's kobject refcount isn't released.

However, 45a9c9d909b2 ("blk-mq: Fix a use-after-free") frees hctx
resource too early, and causes the kernel oops.

Also, isn't it the typical practice to release kobject related resources in
its release handler?


Thanks,
Ming
jianchao.wang April 2, 2019, 8:07 a.m. UTC | #20
Hi Ming

On 4/2/19 10:55 AM, Ming Lei wrote:
> On Tue, Apr 02, 2019 at 10:02:43AM +0800, jianchao.wang wrote:
>> Hi Ming
>>
>> On 4/1/19 6:03 PM, Ming Lei wrote:
>>> On Mon, Apr 01, 2019 at 05:19:01PM +0800, jianchao.wang wrote:
>>>> Hi Ming
>>>>
>>>> On 4/1/19 11:28 AM, Ming Lei wrote:
>>>>> On Mon, Apr 01, 2019 at 11:25:50AM +0800, jianchao.wang wrote:
>>>>>> Hi Ming
>>>>>>
>>>>>> On 4/1/19 10:52 AM, Ming Lei wrote:
>>>>>>>> percpu_ref_tryget_live() fails if a per-cpu counter is in the "dead" state.
>>>>>>>> percpu_ref_kill() changes the state of a per-cpu counter to the "dead"
>>>>>>>> state. blk_freeze_queue_start() calls percpu_ref_kill(). blk_cleanup_queue()
>>>>>>>> already calls blk_set_queue_dying() and that last function calls
>>>>>>>> blk_freeze_queue_start(). So I think that what you wrote is not correct and
>>>>>>>> that inserting a percpu_ref_tryget_live()/percpu_ref_put() pair in
>>>>>>>> blk_mq_run_hw_queues() or blk_mq_run_hw_queue() would make a difference and
>>>>>>>> also that moving the percpu_ref_exit() call into blk_release_queue() makes
>>>>>>>> sense.
>>>>>>> If percpu_ref_exit() is moved to blk_release_queue(), we still need to
>>>>>>> move freeing of hw queue's resource into blk_release_queue() like what
>>>>>>> the patchset is doing.
>>>>>>>
>>>>>>> Then we don't need to get/put q_usage_counter in blk_mq_run_hw_queues() any more,
>>>>>>> do we?
>>>>>>
>>>>>> IMO, if we could get a way to prevent any attempt to run queue, it would be
>>>>>> better and clearer.
>>>>>
>>>>> It is hard to do that way, and not necessary.
>>>>>
>>>>> I will post V2 soon for review.
>>>>>
>>>>
>>>> Put percpu_ref_tryget/put pair into blk_mq_run_hw_queues could stop run queue after
>>>> requet_queue is frozen and drained (run queue is also unnecessary because there is no
>>>> entered requests). And also percpu_ref_tryget could avoid the io hung issue you mentioned.
>>>> We have similar one in blk_mq_timeout_work.
>>>
>>> If percpu_ref_tryget() is used, percpu_ref_exit() has to be moved into
>>> queue's release handler.
>>>
>>> Then we still have to move freeing hctx's resource into hctx or queue's
>>> release handler, that is exactly what this patch is doing. Then
>>> percpu_ref_tryget() becomes unnecessary again, right?
>>
>> I'm not sure about the percpu_ref_exit. Perhaps I have some misunderstanding about it.
>>
>> From the code of it, it frees the percpu_count and set ref->percpu_count_ptr to __PERCPU_REF_ATOMIC_DEAD.
>> The comment says 'the caller is responsible for ensuring that @ref is no longer in active use'
>> But if we use it after kill, does it count a active use ?
>> Based on the code, the __ref_is_percpu is always false during this, and percpu_ref_tryget will not
>> touch the freed percpu counter but just the atomic ref->count.
>>
>> It looks safe.
> 
> OK, you are right.
> 
> However, I still think it isn't necessary to hold the perpcu_ref in the
> very fast io path.

percpu_ref is born for fast path.
There are some drivers use it in completion path, such as scsi, does it really
matter for this kind of device ? If yes, I guess we should remove blk_mq_run_hw_queues
which is the really bulk and depend on hctx restart mechanism.

> 
>>
>>
>>>
>>>>
>>>> freeze and drain queue to stop new attempt to run queue, blk_sync_queue syncs and stops
>>>> the started ones, then hctx->run_queue is cleaned totally.
>>>>
>>>> IMO, it would be better to have a checkpoint after which there will be no any in-flight
>>>> asynchronous activities of the request_queue (hctx->run_work, q->requeue_work, q-> timeout_work)
>>>> and any attempt to start them will fail.
>>>
>>> All are canceled in blk_cleanup_queue(), but not enough, given queue can
>>> be run in sync mode(such as via plug, direct issue, ...), or driver's
>>> requeue, such as SCSI's requeue. SCSI's requeue may run other LUN's queue
>>> just by holding queue's kobject refcount.
>>
>> Yes, so we need a checkpoint here to ensure the request_queue to enter into a certain state.
>> We provide a guarantee that all of the activities are stopped after this checkpoint.
>> It will be convenient for us to do other things following, for example release request_queue's
>> resource.
> 
> We have such checkpoint already:
> 
> 	blk_freeze_queue() together with blk_sync_queue()
> 
> Once the two are done, there shouldn't be any driver activities at all.
> 
> The current issue is related with blk-mq internal implementation, in which
> it should have been safe to complete the run queue activity during queue
> cleanup if the request queue's kobject refcount isn't released.
> 
> However, 45a9c9d909b2 ("blk-mq: Fix a use-after-free") frees hctx
> resource too early, and causes the kernel oops.
> 
> Also, isn't it the typical practice to release kobject related resources in
> its release handler?

I agree with this.

Thanks
Jianchao
Ming Lei April 2, 2019, 11:05 a.m. UTC | #21
On Tue, Apr 02, 2019 at 04:07:04PM +0800, jianchao.wang wrote:
> Hi Ming
> 
> On 4/2/19 10:55 AM, Ming Lei wrote:
> > On Tue, Apr 02, 2019 at 10:02:43AM +0800, jianchao.wang wrote:
> >> Hi Ming
> >>
> >> On 4/1/19 6:03 PM, Ming Lei wrote:
> >>> On Mon, Apr 01, 2019 at 05:19:01PM +0800, jianchao.wang wrote:
> >>>> Hi Ming
> >>>>
> >>>> On 4/1/19 11:28 AM, Ming Lei wrote:
> >>>>> On Mon, Apr 01, 2019 at 11:25:50AM +0800, jianchao.wang wrote:
> >>>>>> Hi Ming
> >>>>>>
> >>>>>> On 4/1/19 10:52 AM, Ming Lei wrote:
> >>>>>>>> percpu_ref_tryget_live() fails if a per-cpu counter is in the "dead" state.
> >>>>>>>> percpu_ref_kill() changes the state of a per-cpu counter to the "dead"
> >>>>>>>> state. blk_freeze_queue_start() calls percpu_ref_kill(). blk_cleanup_queue()
> >>>>>>>> already calls blk_set_queue_dying() and that last function calls
> >>>>>>>> blk_freeze_queue_start(). So I think that what you wrote is not correct and
> >>>>>>>> that inserting a percpu_ref_tryget_live()/percpu_ref_put() pair in
> >>>>>>>> blk_mq_run_hw_queues() or blk_mq_run_hw_queue() would make a difference and
> >>>>>>>> also that moving the percpu_ref_exit() call into blk_release_queue() makes
> >>>>>>>> sense.
> >>>>>>> If percpu_ref_exit() is moved to blk_release_queue(), we still need to
> >>>>>>> move freeing of hw queue's resource into blk_release_queue() like what
> >>>>>>> the patchset is doing.
> >>>>>>>
> >>>>>>> Then we don't need to get/put q_usage_counter in blk_mq_run_hw_queues() any more,
> >>>>>>> do we?
> >>>>>>
> >>>>>> IMO, if we could get a way to prevent any attempt to run queue, it would be
> >>>>>> better and clearer.
> >>>>>
> >>>>> It is hard to do that way, and not necessary.
> >>>>>
> >>>>> I will post V2 soon for review.
> >>>>>
> >>>>
> >>>> Put percpu_ref_tryget/put pair into blk_mq_run_hw_queues could stop run queue after
> >>>> requet_queue is frozen and drained (run queue is also unnecessary because there is no
> >>>> entered requests). And also percpu_ref_tryget could avoid the io hung issue you mentioned.
> >>>> We have similar one in blk_mq_timeout_work.
> >>>
> >>> If percpu_ref_tryget() is used, percpu_ref_exit() has to be moved into
> >>> queue's release handler.
> >>>
> >>> Then we still have to move freeing hctx's resource into hctx or queue's
> >>> release handler, that is exactly what this patch is doing. Then
> >>> percpu_ref_tryget() becomes unnecessary again, right?
> >>
> >> I'm not sure about the percpu_ref_exit. Perhaps I have some misunderstanding about it.
> >>
> >> From the code of it, it frees the percpu_count and set ref->percpu_count_ptr to __PERCPU_REF_ATOMIC_DEAD.
> >> The comment says 'the caller is responsible for ensuring that @ref is no longer in active use'
> >> But if we use it after kill, does it count a active use ?
> >> Based on the code, the __ref_is_percpu is always false during this, and percpu_ref_tryget will not
> >> touch the freed percpu counter but just the atomic ref->count.
> >>
> >> It looks safe.
> > 
> > OK, you are right.
> > 
> > However, I still think it isn't necessary to hold the perpcu_ref in the
> > very fast io path.
> 
> percpu_ref is born for fast path.
> There are some drivers use it in completion path, such as scsi, does it really
> matter for this kind of device ? If yes, I guess we should remove blk_mq_run_hw_queues
> which is the really bulk and depend on hctx restart mechanism.

Yes, it is designed for fast path, but it doesn't mean percpu_ref
hasn't any cost. blk_mq_run_hw_queues() is called for all blk-mq devices,
includes the fast NVMe.

Also:

It may not be enough to just grab the percpu_ref for blk_mq_run_hw_queues
only, given the idea is to use the percpu_ref to protect hctx's resources.

There are lots of uses on 'hctx', such as other exported blk-mq APIs.
If this approach were chosen, we may have to audit other blk-mq APIs,
cause they might be called after queue is frozen too.

So probably this usage may be misbuse on percpu_ref.

> 
> > 
> >>
> >>
> >>>
> >>>>
> >>>> freeze and drain queue to stop new attempt to run queue, blk_sync_queue syncs and stops
> >>>> the started ones, then hctx->run_queue is cleaned totally.
> >>>>
> >>>> IMO, it would be better to have a checkpoint after which there will be no any in-flight
> >>>> asynchronous activities of the request_queue (hctx->run_work, q->requeue_work, q-> timeout_work)
> >>>> and any attempt to start them will fail.
> >>>
> >>> All are canceled in blk_cleanup_queue(), but not enough, given queue can
> >>> be run in sync mode(such as via plug, direct issue, ...), or driver's
> >>> requeue, such as SCSI's requeue. SCSI's requeue may run other LUN's queue
> >>> just by holding queue's kobject refcount.
> >>
> >> Yes, so we need a checkpoint here to ensure the request_queue to enter into a certain state.
> >> We provide a guarantee that all of the activities are stopped after this checkpoint.
> >> It will be convenient for us to do other things following, for example release request_queue's
> >> resource.
> > 
> > We have such checkpoint already:
> > 
> > 	blk_freeze_queue() together with blk_sync_queue()
> > 
> > Once the two are done, there shouldn't be any driver activities at all.
> > 
> > The current issue is related with blk-mq internal implementation, in which
> > it should have been safe to complete the run queue activity during queue
> > cleanup if the request queue's kobject refcount isn't released.
> > 
> > However, 45a9c9d909b2 ("blk-mq: Fix a use-after-free") frees hctx
> > resource too early, and causes the kernel oops.
> > 
> > Also, isn't it the typical practice to release kobject related resources in
> > its release handler?
> 
> I agree with this.

OK.

Another point with freeing hctx resources in its release handler is that 
things become much simple: if the queue's kobject refcount is held,
almost all blk-mq APIs can be called safely. This way works perfectly on
legacy IO path for ages.

Thanks,
Ming
Bart Van Assche April 2, 2019, 5:53 p.m. UTC | #22
On Tue, 2019-04-02 at 19:05 +0800, Ming Lei wrote:
> On Tue, Apr 02, 2019 at 04:07:04PM +0800, jianchao.wang wrote:
> > percpu_ref is born for fast path.
> > There are some drivers use it in completion path, such as scsi, does it really
> > matter for this kind of device ? If yes, I guess we should remove blk_mq_run_hw_queues
> > which is the really bulk and depend on hctx restart mechanism.
> 
> Yes, it is designed for fast path, but it doesn't mean percpu_ref
> hasn't any cost. blk_mq_run_hw_queues() is called for all blk-mq devices,
> includes the fast NVMe.

I think the overhead of adding a percpu_ref_get/put pair is acceptable for
SCSI drivers. The NVMe driver doesn't call blk_mq_run_hw_queues() directly.
Additionally, I don't think that any of the blk_mq_run_hw_queues() calls from
the block layer matter for the fast path code in the NVMe driver. In other
words, adding a percpu_ref_get/put pair in blk_mq_run_hw_queues() shouldn't
affect the performance of the NVMe driver.

> Also:
> 
> It may not be enough to just grab the percpu_ref for blk_mq_run_hw_queues
> only, given the idea is to use the percpu_ref to protect hctx's resources.
> 
> There are lots of uses on 'hctx', such as other exported blk-mq APIs.
> If this approach were chosen, we may have to audit other blk-mq APIs,
> cause they might be called after queue is frozen too.

The only blk_mq_hw_ctx user I have found so far that needs additional
protection is the q->mq_ops->poll() call in blk_poll(). However, that is not
a new issue. Functions like nvme_poll() access data structures (NVMe
completion queue) that shouldn't be accessed while blk_cleanup_queue() is in
progress. If blk_poll() is modified such that it becomes safe to call that
function while blk_cleanup_queue() is in progress then blk_poll() won't
access any hardware queue that it shouldn't access.

Bart.
Bart Van Assche April 2, 2019, 6:07 p.m. UTC | #23
On Mon, 2019-04-01 at 10:44 +0800, jianchao.wang wrote:
> percpu_ref_tryget would be better to get pending requests to be issued when queue is frozen.

Agreed.

Thanks,

Bart.
Bart Van Assche April 2, 2019, 6:11 p.m. UTC | #24
On Tue, 2019-04-02 at 10:55 +0800, Ming Lei wrote:
> Also, isn't it the typical practice to release kobject related resources in
> its release handler?

A typical approach is to call kobject_del() before starting to clean up a
resource and to defer the final kfree() call to the release handler. I think
that's how it works today in the block layer core. That is easy to see in the
blk_mq_hw_sysfs_release() implementation:

static void blk_mq_hw_sysfs_release(struct kobject *kobj)
{
	struct blk_mq_hw_ctx *hctx = container_of(kobj, struct blk_mq_hw_ctx,
						  kobj);
	free_cpumask_var(hctx->cpumask);
	kfree(hctx->ctxs);
	kfree(hctx);
}

Bart.
Ming Lei April 3, 2019, 3:20 a.m. UTC | #25
On Tue, Apr 02, 2019 at 10:53:00AM -0700, Bart Van Assche wrote:
> On Tue, 2019-04-02 at 19:05 +0800, Ming Lei wrote:
> > On Tue, Apr 02, 2019 at 04:07:04PM +0800, jianchao.wang wrote:
> > > percpu_ref is born for fast path.
> > > There are some drivers use it in completion path, such as scsi, does it really
> > > matter for this kind of device ? If yes, I guess we should remove blk_mq_run_hw_queues
> > > which is the really bulk and depend on hctx restart mechanism.
> > 
> > Yes, it is designed for fast path, but it doesn't mean percpu_ref
> > hasn't any cost. blk_mq_run_hw_queues() is called for all blk-mq devices,
> > includes the fast NVMe.
> 
> I think the overhead of adding a percpu_ref_get/put pair is acceptable for
> SCSI drivers. The NVMe driver doesn't call blk_mq_run_hw_queues() directly.
> Additionally, I don't think that any of the blk_mq_run_hw_queues() calls from
> the block layer matter for the fast path code in the NVMe driver. In other
> words, adding a percpu_ref_get/put pair in blk_mq_run_hw_queues() shouldn't
> affect the performance of the NVMe driver.

But it can be avoided easily and cleanly, why abuse it for protecting hctx?

> 
> > Also:
> > 
> > It may not be enough to just grab the percpu_ref for blk_mq_run_hw_queues
> > only, given the idea is to use the percpu_ref to protect hctx's resources.
> > 
> > There are lots of uses on 'hctx', such as other exported blk-mq APIs.
> > If this approach were chosen, we may have to audit other blk-mq APIs,
> > cause they might be called after queue is frozen too.
> 
> The only blk_mq_hw_ctx user I have found so far that needs additional
> protection is the q->mq_ops->poll() call in blk_poll(). However, that is not
> a new issue. Functions like nvme_poll() access data structures (NVMe
> completion queue) that shouldn't be accessed while blk_cleanup_queue() is in
> progress. If blk_poll() is modified such that it becomes safe to call that
> function while blk_cleanup_queue() is in progress then blk_poll() won't
> access any hardware queue that it shouldn't access.

There can be lots of such case:

1) blk_mq_run_hw_queue() from blk_mq_flush_plug_list()
- requests can be completed just after added to ctx queue or scheduler queue
becasue there can be concurrent run queue, then queue freezing may be done

- then the following blk_mq_run_hw_queue() in blk_mq_sched_insert_requests()
may see freed hctx fields

2) blk_mq_delay_run_hw_queue
- what if it is called after blk_sync_queue() is done in
  blk_cleanup_queue()
- but the caller follows the old rule by holding request queue's
  refcount

3) blk_mq_quiesce_queue
- called after blk_mq_free_queue() is done, then use-after-free on hctx->srcu
- but the caller follows the old rule by holding request queue's

...

Thanks,
Ming
Ming Lei April 3, 2019, 3:24 a.m. UTC | #26
On Tue, Apr 02, 2019 at 11:11:08AM -0700, Bart Van Assche wrote:
> On Tue, 2019-04-02 at 10:55 +0800, Ming Lei wrote:
> > Also, isn't it the typical practice to release kobject related resources in
> > its release handler?
> 
> A typical approach is to call kobject_del() before starting to clean up a

Yes.

> resource and to defer the final kfree() call to the release handler. I think
> that's how it works today in the block layer core. That is easy to see in the
> blk_mq_hw_sysfs_release() implementation:
> 
> static void blk_mq_hw_sysfs_release(struct kobject *kobj)
> {
> 	struct blk_mq_hw_ctx *hctx = container_of(kobj, struct blk_mq_hw_ctx,
> 						  kobj);
> 	free_cpumask_var(hctx->cpumask);
> 	kfree(hctx->ctxs);
> 	kfree(hctx);
> }

Right, then we should free other hctx fields here too, it is safe
and reliable.


Thanks,
Ming
Ming Lei April 3, 2019, 8:29 a.m. UTC | #27
On Wed, Apr 03, 2019 at 11:20:53AM +0800, Ming Lei wrote:
> On Tue, Apr 02, 2019 at 10:53:00AM -0700, Bart Van Assche wrote:
> > On Tue, 2019-04-02 at 19:05 +0800, Ming Lei wrote:
> > > On Tue, Apr 02, 2019 at 04:07:04PM +0800, jianchao.wang wrote:
> > > > percpu_ref is born for fast path.
> > > > There are some drivers use it in completion path, such as scsi, does it really
> > > > matter for this kind of device ? If yes, I guess we should remove blk_mq_run_hw_queues
> > > > which is the really bulk and depend on hctx restart mechanism.
> > > 
> > > Yes, it is designed for fast path, but it doesn't mean percpu_ref
> > > hasn't any cost. blk_mq_run_hw_queues() is called for all blk-mq devices,
> > > includes the fast NVMe.
> > 
> > I think the overhead of adding a percpu_ref_get/put pair is acceptable for
> > SCSI drivers. The NVMe driver doesn't call blk_mq_run_hw_queues() directly.
> > Additionally, I don't think that any of the blk_mq_run_hw_queues() calls from
> > the block layer matter for the fast path code in the NVMe driver. In other
> > words, adding a percpu_ref_get/put pair in blk_mq_run_hw_queues() shouldn't
> > affect the performance of the NVMe driver.
> 
> But it can be avoided easily and cleanly, why abuse it for protecting hctx?
> 
> > 
> > > Also:
> > > 
> > > It may not be enough to just grab the percpu_ref for blk_mq_run_hw_queues
> > > only, given the idea is to use the percpu_ref to protect hctx's resources.
> > > 
> > > There are lots of uses on 'hctx', such as other exported blk-mq APIs.
> > > If this approach were chosen, we may have to audit other blk-mq APIs,
> > > cause they might be called after queue is frozen too.
> > 
> > The only blk_mq_hw_ctx user I have found so far that needs additional
> > protection is the q->mq_ops->poll() call in blk_poll(). However, that is not
> > a new issue. Functions like nvme_poll() access data structures (NVMe
> > completion queue) that shouldn't be accessed while blk_cleanup_queue() is in
> > progress. If blk_poll() is modified such that it becomes safe to call that
> > function while blk_cleanup_queue() is in progress then blk_poll() won't
> > access any hardware queue that it shouldn't access.
> 
> There can be lots of such case:
> 
> 1) blk_mq_run_hw_queue() from blk_mq_flush_plug_list()
> - requests can be completed just after added to ctx queue or scheduler queue
> becasue there can be concurrent run queue, then queue freezing may be done
> 
> - then the following blk_mq_run_hw_queue() in blk_mq_sched_insert_requests()
> may see freed hctx fields

Actually this one is blk-mq internal race, and queue's refcount isn't
guaranteed to be held when blk_mq_run_hw_queue is called.

We might have to address this one by grabbing .q_usage_count in
blk_mq_sched_insert_requests just like commit 8dc765d438f1 ("SCSI: fix queue cleanup
race before queue initialization is done"), but I do want to avoid it.

Thanks,
Ming
Ming Lei April 3, 2019, 8:43 a.m. UTC | #28
On Wed, Apr 03, 2019 at 04:29:41PM +0800, Ming Lei wrote:
> On Wed, Apr 03, 2019 at 11:20:53AM +0800, Ming Lei wrote:
> > On Tue, Apr 02, 2019 at 10:53:00AM -0700, Bart Van Assche wrote:
> > > On Tue, 2019-04-02 at 19:05 +0800, Ming Lei wrote:
> > > > On Tue, Apr 02, 2019 at 04:07:04PM +0800, jianchao.wang wrote:
> > > > > percpu_ref is born for fast path.
> > > > > There are some drivers use it in completion path, such as scsi, does it really
> > > > > matter for this kind of device ? If yes, I guess we should remove blk_mq_run_hw_queues
> > > > > which is the really bulk and depend on hctx restart mechanism.
> > > > 
> > > > Yes, it is designed for fast path, but it doesn't mean percpu_ref
> > > > hasn't any cost. blk_mq_run_hw_queues() is called for all blk-mq devices,
> > > > includes the fast NVMe.
> > > 
> > > I think the overhead of adding a percpu_ref_get/put pair is acceptable for
> > > SCSI drivers. The NVMe driver doesn't call blk_mq_run_hw_queues() directly.
> > > Additionally, I don't think that any of the blk_mq_run_hw_queues() calls from
> > > the block layer matter for the fast path code in the NVMe driver. In other
> > > words, adding a percpu_ref_get/put pair in blk_mq_run_hw_queues() shouldn't
> > > affect the performance of the NVMe driver.
> > 
> > But it can be avoided easily and cleanly, why abuse it for protecting hctx?
> > 
> > > 
> > > > Also:
> > > > 
> > > > It may not be enough to just grab the percpu_ref for blk_mq_run_hw_queues
> > > > only, given the idea is to use the percpu_ref to protect hctx's resources.
> > > > 
> > > > There are lots of uses on 'hctx', such as other exported blk-mq APIs.
> > > > If this approach were chosen, we may have to audit other blk-mq APIs,
> > > > cause they might be called after queue is frozen too.
> > > 
> > > The only blk_mq_hw_ctx user I have found so far that needs additional
> > > protection is the q->mq_ops->poll() call in blk_poll(). However, that is not
> > > a new issue. Functions like nvme_poll() access data structures (NVMe
> > > completion queue) that shouldn't be accessed while blk_cleanup_queue() is in
> > > progress. If blk_poll() is modified such that it becomes safe to call that
> > > function while blk_cleanup_queue() is in progress then blk_poll() won't
> > > access any hardware queue that it shouldn't access.
> > 
> > There can be lots of such case:
> > 
> > 1) blk_mq_run_hw_queue() from blk_mq_flush_plug_list()
> > - requests can be completed just after added to ctx queue or scheduler queue
> > becasue there can be concurrent run queue, then queue freezing may be done
> > 
> > - then the following blk_mq_run_hw_queue() in blk_mq_sched_insert_requests()
> > may see freed hctx fields
> 
> Actually this one is blk-mq internal race, and queue's refcount isn't
> guaranteed to be held when blk_mq_run_hw_queue is called.
> 
> We might have to address this one by grabbing .q_usage_count in
> blk_mq_sched_insert_requests just like commit 8dc765d438f1 ("SCSI: fix queue cleanup
> race before queue initialization is done"), but I do want to avoid it.

The issue is very similar with kiocb's refcount in aio/io_uring, we might
need to grab two for handling one request, one is for submission side, and
another is for completion side. Now the issue is that we don't grab the
ref in submission side.

Thanks,
Ming