mbox series

[V2,0/5] blk-mq: improvement on handling IO during CPU hotplug

Message ID 20190812134312.16732-1-ming.lei@redhat.com (mailing list archive)
Headers show
Series blk-mq: improvement on handling IO during CPU hotplug | expand

Message

Ming Lei Aug. 12, 2019, 1:43 p.m. UTC
Hi,

Thomas mentioned:
    "
     That was the constraint of managed interrupts from the very beginning:
    
      The driver/subsystem has to quiesce the interrupt line and the associated
      queue _before_ it gets shutdown in CPU unplug and not fiddle with it
      until it's restarted by the core when the CPU is plugged in again.
    "

But no drivers or blk-mq do that before one hctx becomes dead(all
CPUs for one hctx are offline), and even it is worse, blk-mq stills tries
to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead().

This patchset tries to address the issue by two stages:

1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE

- mark the hctx as internal stopped, and drain all in-flight requests
if the hctx is going to be dead.

2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes dead

- steal bios from the request, and resubmit them via generic_make_request(),
then these IO will be mapped to other live hctx for dispatch

Please comment & review, thanks!

V2:
	- patch4 & patch 5 in V1 have been merged to block tree, so remove
	  them
	- address comments from John Garry and Minwoo


Ming Lei (5):
  blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED
  blk-mq: add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ
  blk-mq: stop to handle IO before hctx's all CPUs become offline
  blk-mq: re-submit IO in case that hctx is dead
  blk-mq: handle requests dispatched from IO scheduler in case that hctx
    is dead

 block/blk-mq-debugfs.c     |   2 +
 block/blk-mq-tag.c         |   2 +-
 block/blk-mq-tag.h         |   2 +
 block/blk-mq.c             | 143 +++++++++++++++++++++++++++++++++----
 block/blk-mq.h             |   3 +-
 drivers/block/loop.c       |   2 +-
 drivers/md/dm-rq.c         |   2 +-
 include/linux/blk-mq.h     |   5 ++
 include/linux/cpuhotplug.h |   1 +
 9 files changed, 146 insertions(+), 16 deletions(-)

Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Keith Busch <keith.busch@intel.com>

Comments

Ming Lei Aug. 12, 2019, 1:46 p.m. UTC | #1
Hi John,

On Mon, Aug 12, 2019 at 09:43:07PM +0800, Ming Lei wrote:
> Hi,
> 
> Thomas mentioned:
>     "
>      That was the constraint of managed interrupts from the very beginning:
>     
>       The driver/subsystem has to quiesce the interrupt line and the associated
>       queue _before_ it gets shutdown in CPU unplug and not fiddle with it
>       until it's restarted by the core when the CPU is plugged in again.
>     "
> 
> But no drivers or blk-mq do that before one hctx becomes dead(all
> CPUs for one hctx are offline), and even it is worse, blk-mq stills tries
> to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead().
> 
> This patchset tries to address the issue by two stages:
> 
> 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE
> 
> - mark the hctx as internal stopped, and drain all in-flight requests
> if the hctx is going to be dead.
> 
> 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes dead
> 
> - steal bios from the request, and resubmit them via generic_make_request(),
> then these IO will be mapped to other live hctx for dispatch
> 
> Please comment & review, thanks!
> 
> V2:
> 	- patch4 & patch 5 in V1 have been merged to block tree, so remove
> 	  them
> 	- address comments from John Garry and Minwoo
> 
> 
> Ming Lei (5):
>   blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED
>   blk-mq: add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ
>   blk-mq: stop to handle IO before hctx's all CPUs become offline
>   blk-mq: re-submit IO in case that hctx is dead
>   blk-mq: handle requests dispatched from IO scheduler in case that hctx
>     is dead
> 
>  block/blk-mq-debugfs.c     |   2 +
>  block/blk-mq-tag.c         |   2 +-
>  block/blk-mq-tag.h         |   2 +
>  block/blk-mq.c             | 143 +++++++++++++++++++++++++++++++++----
>  block/blk-mq.h             |   3 +-
>  drivers/block/loop.c       |   2 +-
>  drivers/md/dm-rq.c         |   2 +-
>  include/linux/blk-mq.h     |   5 ++
>  include/linux/cpuhotplug.h |   1 +
>  9 files changed, 146 insertions(+), 16 deletions(-)
> 
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Keith Busch <keith.busch@intel.com>
> -- 
> 2.20.1
> 

Sorry for forgetting to Cc you.


Thanks,
Ming
John Garry Aug. 12, 2019, 4:21 p.m. UTC | #2
On 12/08/2019 14:46, Ming Lei wrote:
> Hi John,
>
> On Mon, Aug 12, 2019 at 09:43:07PM +0800, Ming Lei wrote:
>> Hi,
>>
>> Thomas mentioned:
>>     "
>>      That was the constraint of managed interrupts from the very beginning:
>>
>>       The driver/subsystem has to quiesce the interrupt line and the associated
>>       queue _before_ it gets shutdown in CPU unplug and not fiddle with it
>>       until it's restarted by the core when the CPU is plugged in again.
>>     "
>>
>> But no drivers or blk-mq do that before one hctx becomes dead(all
>> CPUs for one hctx are offline), and even it is worse, blk-mq stills tries
>> to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead().
>>
>> This patchset tries to address the issue by two stages:
>>
>> 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE
>>
>> - mark the hctx as internal stopped, and drain all in-flight requests
>> if the hctx is going to be dead.
>>
>> 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes dead
>>
>> - steal bios from the request, and resubmit them via generic_make_request(),
>> then these IO will be mapped to other live hctx for dispatch
>>
>> Please comment & review, thanks!
>>
>> V2:
>> 	- patch4 & patch 5 in V1 have been merged to block tree, so remove
>> 	  them
>> 	- address comments from John Garry and Minwoo
>>
>>
>> Ming Lei (5):
>>   blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED
>>   blk-mq: add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ
>>   blk-mq: stop to handle IO before hctx's all CPUs become offline
>>   blk-mq: re-submit IO in case that hctx is dead
>>   blk-mq: handle requests dispatched from IO scheduler in case that hctx
>>     is dead
>>
>>  block/blk-mq-debugfs.c     |   2 +
>>  block/blk-mq-tag.c         |   2 +-
>>  block/blk-mq-tag.h         |   2 +
>>  block/blk-mq.c             | 143 +++++++++++++++++++++++++++++++++----
>>  block/blk-mq.h             |   3 +-
>>  drivers/block/loop.c       |   2 +-
>>  drivers/md/dm-rq.c         |   2 +-
>>  include/linux/blk-mq.h     |   5 ++
>>  include/linux/cpuhotplug.h |   1 +
>>  9 files changed, 146 insertions(+), 16 deletions(-)
>>
>> Cc: Bart Van Assche <bvanassche@acm.org>
>> Cc: Hannes Reinecke <hare@suse.com>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Keith Busch <keith.busch@intel.com>
>> --
>> 2.20.1
>>
>
> Sorry for forgetting to Cc you.

Already subscribed :)

I don't mean to hijack this thread, but JFYI we're getting around to 
test https://github.com/ming1/linux/commits/v5.2-rc-host-tags-V2 - 
unfortunately we're still seeing a performance regression. I can't see 
where it's coming from. We're double-checking the test though.

Thanks,
John

>
>
> Thanks,
> Ming
>
> .
>
Ming Lei Aug. 12, 2019, 10:45 p.m. UTC | #3
On Mon, Aug 12, 2019 at 05:21:44PM +0100, John Garry wrote:
> On 12/08/2019 14:46, Ming Lei wrote:
> > Hi John,
> > 
> > On Mon, Aug 12, 2019 at 09:43:07PM +0800, Ming Lei wrote:
> > > Hi,
> > > 
> > > Thomas mentioned:
> > >     "
> > >      That was the constraint of managed interrupts from the very beginning:
> > > 
> > >       The driver/subsystem has to quiesce the interrupt line and the associated
> > >       queue _before_ it gets shutdown in CPU unplug and not fiddle with it
> > >       until it's restarted by the core when the CPU is plugged in again.
> > >     "
> > > 
> > > But no drivers or blk-mq do that before one hctx becomes dead(all
> > > CPUs for one hctx are offline), and even it is worse, blk-mq stills tries
> > > to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead().
> > > 
> > > This patchset tries to address the issue by two stages:
> > > 
> > > 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE
> > > 
> > > - mark the hctx as internal stopped, and drain all in-flight requests
> > > if the hctx is going to be dead.
> > > 
> > > 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes dead
> > > 
> > > - steal bios from the request, and resubmit them via generic_make_request(),
> > > then these IO will be mapped to other live hctx for dispatch
> > > 
> > > Please comment & review, thanks!
> > > 
> > > V2:
> > > 	- patch4 & patch 5 in V1 have been merged to block tree, so remove
> > > 	  them
> > > 	- address comments from John Garry and Minwoo
> > > 
> > > 
> > > Ming Lei (5):
> > >   blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED
> > >   blk-mq: add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ
> > >   blk-mq: stop to handle IO before hctx's all CPUs become offline
> > >   blk-mq: re-submit IO in case that hctx is dead
> > >   blk-mq: handle requests dispatched from IO scheduler in case that hctx
> > >     is dead
> > > 
> > >  block/blk-mq-debugfs.c     |   2 +
> > >  block/blk-mq-tag.c         |   2 +-
> > >  block/blk-mq-tag.h         |   2 +
> > >  block/blk-mq.c             | 143 +++++++++++++++++++++++++++++++++----
> > >  block/blk-mq.h             |   3 +-
> > >  drivers/block/loop.c       |   2 +-
> > >  drivers/md/dm-rq.c         |   2 +-
> > >  include/linux/blk-mq.h     |   5 ++
> > >  include/linux/cpuhotplug.h |   1 +
> > >  9 files changed, 146 insertions(+), 16 deletions(-)
> > > 
> > > Cc: Bart Van Assche <bvanassche@acm.org>
> > > Cc: Hannes Reinecke <hare@suse.com>
> > > Cc: Christoph Hellwig <hch@lst.de>
> > > Cc: Thomas Gleixner <tglx@linutronix.de>
> > > Cc: Keith Busch <keith.busch@intel.com>
> > > --
> > > 2.20.1
> > > 
> > 
> > Sorry for forgetting to Cc you.
> 
> Already subscribed :)
> 
> I don't mean to hijack this thread, but JFYI we're getting around to test
> https://github.com/ming1/linux/commits/v5.2-rc-host-tags-V2 - unfortunately
> we're still seeing a performance regression. I can't see where it's coming
> from. We're double-checking the test though.

host-tag patchset is only for several particular drivers which use
private reply queue as completion queue.

This patchset is for handling generic blk-mq CPU hotplug issue, and
the several particular scsi drivers(hisi_sas_v3, hpsa, megaraid_sas and
mp3sas) won't be covered so far.

I'd suggest to move on for generic blk-mq devices first given now blk-mq
is the only request IO path now.

There are at least two choices for us to handle drivers/devices with
private completion queue:

1) host-tags
- performance issue shouldn't be hard to solve, given it is same with
with single tags in theory, and just corner cases is there.

What I am not glad with this approach is that blk-mq-tag code becomes mess.

2) private callback
- we could define private callback to drain each completion queue in
  driver simply.
- problem is that the four drivers have to duplicate the same job


Thanks,
Ming
John Garry Aug. 22, 2019, 5:39 p.m. UTC | #4
On 12/08/2019 14:43, Ming Lei wrote:
> Hi,
>
> Thomas mentioned:
>     "
>      That was the constraint of managed interrupts from the very beginning:
>
>       The driver/subsystem has to quiesce the interrupt line and the associated
>       queue _before_ it gets shutdown in CPU unplug and not fiddle with it
>       until it's restarted by the core when the CPU is plugged in again.
>     "
>
> But no drivers or blk-mq do that before one hctx becomes dead(all
> CPUs for one hctx are offline), and even it is worse, blk-mq stills tries
> to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead().
>
> This patchset tries to address the issue by two stages:
>
> 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE
>
> - mark the hctx as internal stopped, and drain all in-flight requests
> if the hctx is going to be dead.
>
> 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes dead
>
> - steal bios from the request, and resubmit them via generic_make_request(),
> then these IO will be mapped to other live hctx for dispatch
>
> Please comment & review, thanks!
>
> V2:
> 	- patch4 & patch 5 in V1 have been merged to block tree, so remove
> 	  them
> 	- address comments from John Garry and Minwoo
>
>
> Ming Lei (5):
>   blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED
>   blk-mq: add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ
>   blk-mq: stop to handle IO before hctx's all CPUs become offline
>   blk-mq: re-submit IO in case that hctx is dead
>   blk-mq: handle requests dispatched from IO scheduler in case that hctx
>     is dead

Hi Ming,

This looks to fix the hotplug issue for me.

Previously I could manufacture a scenario while running fio where I got 
IO timeouts, like this:

root@(none)$ echo 0 > ./sys/devices/system/cpu/cpu0/online
[  296.897627] process 891 (fio) no longer affine to cpu0
[  296.898488] process 893 (fio) no longer affine to cpu0
[  296.910270] process 890 (fio) no longer affine to cpu0
[  296.927322] IRQ 775: no longer affine to CPU0
[  296.932762] CPU0: shutdown
[  296.935469] psci: CPU0 killed.
root@(none)$ [  326.971962] sas: Enter sas_scsi_recover_host busy: 61 
failed: 61
[  326.977978] sas: sas_scsi_find_task: aborting task 0x00000000e2cdc79b
root@(none)$ [  333.047964] hisi_sas_v3_hw 0000:74:02.0: internal task 
abort: timeout and not done.
[  333.055616] hisi_sas_v3_hw 0000:74:02.0: abort task: internal abort (-5)
[  333.062306] sas: sas_scsi_find_task: querying task 0x00000000e2cdc79b
[  333.068776] sas: sas_scsi_find_task: task 0x00000000e2cdc79b not at LU
[  333.075295] sas: task 0x00000000e2cdc79b is not at LU: I_T recover
[  333.081464] sas: I_T nexus reset for dev 5000c500a7b95a49

Please notice the 30-second delay for the SCSI IO timeout.

And now I don't see it; here's a sample for irq shutdown:
root@(none)$ echo 0 > ./sys/devices/system/cpu/cpu0/online
[  344.608148] process 849 (fio) no longer affine to cpu0
[  344.608639] process 848 (fio) no longer affine to cpu0
[  344.609454] process 850 (fio) no longer affine to cpu0
[  344.643481] process 847 (fio) no longer affine to cpu0
[  346.213842] IRQ 775: no longer affine to CPU0
[  346.219712] CPU0: shutdown
[  346.222425] psci: CPU0 killed.

Please notice the ~1.5s pause, which would be the queue draining.

So FWIW:
Tested-by: John Garry <john.garry@huawei.com>

JFYI, I tested on 5.3-rc5 and cherry-picked 
https://github.com/ming1/linux/commit/0d2cd3c99bb0fe81d2c0ca5d68e02bdc4521d4d6 
and "blk-mq: add callback of .cleanup_rq".

Cheers,
John

>
>  block/blk-mq-debugfs.c     |   2 +
>  block/blk-mq-tag.c         |   2 +-
>  block/blk-mq-tag.h         |   2 +
>  block/blk-mq.c             | 143 +++++++++++++++++++++++++++++++++----
>  block/blk-mq.h             |   3 +-
>  drivers/block/loop.c       |   2 +-
>  drivers/md/dm-rq.c         |   2 +-
>  include/linux/blk-mq.h     |   5 ++
>  include/linux/cpuhotplug.h |   1 +
>  9 files changed, 146 insertions(+), 16 deletions(-)
>
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Keith Busch <keith.busch@intel.com>
>
John Garry Oct. 2, 2019, 9:56 a.m. UTC | #5
On 22/08/2019 18:39, John Garry wrote:
> On 12/08/2019 14:43, Ming Lei wrote:
>> Hi,
>>
>> Thomas mentioned:
>>     "
>>      That was the constraint of managed interrupts from the very
>> beginning:
>>
>>       The driver/subsystem has to quiesce the interrupt line and the
>> associated
>>       queue _before_ it gets shutdown in CPU unplug and not fiddle
>> with it
>>       until it's restarted by the core when the CPU is plugged in again.
>>     "
>>
>> But no drivers or blk-mq do that before one hctx becomes dead(all
>> CPUs for one hctx are offline), and even it is worse, blk-mq stills tries
>> to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead().
>>
>> This patchset tries to address the issue by two stages:
>>
>> 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE
>>
>> - mark the hctx as internal stopped, and drain all in-flight requests
>> if the hctx is going to be dead.
>>
>> 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx
>> becomes dead
>>
>> - steal bios from the request, and resubmit them via
>> generic_make_request(),
>> then these IO will be mapped to other live hctx for dispatch
>>
>> Please comment & review, thanks!
>>
>> V2:
>>     - patch4 & patch 5 in V1 have been merged to block tree, so remove
>>       them
>>     - address comments from John Garry and Minwoo
>>
>>
>> Ming Lei (5):
>>   blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED
>>   blk-mq: add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ
>>   blk-mq: stop to handle IO before hctx's all CPUs become offline
>>   blk-mq: re-submit IO in case that hctx is dead
>>   blk-mq: handle requests dispatched from IO scheduler in case that hctx
>>     is dead
>
> Hi Ming,
>
> This looks to fix the hotplug issue for me.
>
> Previously I could manufacture a scenario while running fio where I got
> IO timeouts, like this:
>
> root@(none)$ echo 0 > ./sys/devices/system/cpu/cpu0/online
> [  296.897627] process 891 (fio) no longer affine to cpu0
> [  296.898488] process 893 (fio) no longer affine to cpu0
> [  296.910270] process 890 (fio) no longer affine to cpu0
> [  296.927322] IRQ 775: no longer affine to CPU0
> [  296.932762] CPU0: shutdown
> [  296.935469] psci: CPU0 killed.
> root@(none)$ [  326.971962] sas: Enter sas_scsi_recover_host busy: 61
> failed: 61
> [  326.977978] sas: sas_scsi_find_task: aborting task 0x00000000e2cdc79b
> root@(none)$ [  333.047964] hisi_sas_v3_hw 0000:74:02.0: internal task
> abort: timeout and not done.
> [  333.055616] hisi_sas_v3_hw 0000:74:02.0: abort task: internal abort (-5)
> [  333.062306] sas: sas_scsi_find_task: querying task 0x00000000e2cdc79b
> [  333.068776] sas: sas_scsi_find_task: task 0x00000000e2cdc79b not at LU
> [  333.075295] sas: task 0x00000000e2cdc79b is not at LU: I_T recover
> [  333.081464] sas: I_T nexus reset for dev 5000c500a7b95a49
>
> Please notice the 30-second delay for the SCSI IO timeout.
>
> And now I don't see it; here's a sample for irq shutdown:
> root@(none)$ echo 0 > ./sys/devices/system/cpu/cpu0/online
> [  344.608148] process 849 (fio) no longer affine to cpu0
> [  344.608639] process 848 (fio) no longer affine to cpu0
> [  344.609454] process 850 (fio) no longer affine to cpu0
> [  344.643481] process 847 (fio) no longer affine to cpu0
> [  346.213842] IRQ 775: no longer affine to CPU0
> [  346.219712] CPU0: shutdown
> [  346.222425] psci: CPU0 killed.
>
> Please notice the ~1.5s pause, which would be the queue draining.
>
> So FWIW:
> Tested-by: John Garry <john.garry@huawei.com>
>
> JFYI, I tested on 5.3-rc5 and cherry-picked
> https://github.com/ming1/linux/commit/0d2cd3c99bb0fe81d2c0ca5d68e02bdc4521d4d6
> and "blk-mq: add callback of .cleanup_rq".
>
> Cheers,
> John

Hi Jens,

I don't mean to be pushy, but can we consider to get these patches from 
Ming merged?

As above, I tested on my SCSI driver and it works. I also tested on an 
NVMe disk, and it solves the condition which generates this message:
root@(none)$ echo 0 > /sys/devices/system/cpu/cpu2/online
[  465.635960] CPU2: shutdown
[  465.638662] psci: CPU2 killed.
[  111.381653] nvme nvme0: I/O 705 QID 18 timeout, completion polled

(that's on top off v5.4-rc1)

Thanks,
John



>
>>
>>  block/blk-mq-debugfs.c     |   2 +
Jens Axboe Oct. 2, 2019, 2:36 p.m. UTC | #6
On 10/2/19 3:56 AM, John Garry wrote:
> On 22/08/2019 18:39, John Garry wrote:
>> On 12/08/2019 14:43, Ming Lei wrote:
>>> Hi,
>>>
>>> Thomas mentioned:
>>>      "
>>>       That was the constraint of managed interrupts from the very
>>> beginning:
>>>
>>>        The driver/subsystem has to quiesce the interrupt line and the
>>> associated
>>>        queue _before_ it gets shutdown in CPU unplug and not fiddle
>>> with it
>>>        until it's restarted by the core when the CPU is plugged in again.
>>>      "
>>>
>>> But no drivers or blk-mq do that before one hctx becomes dead(all
>>> CPUs for one hctx are offline), and even it is worse, blk-mq stills tries
>>> to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead().
>>>
>>> This patchset tries to address the issue by two stages:
>>>
>>> 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE
>>>
>>> - mark the hctx as internal stopped, and drain all in-flight requests
>>> if the hctx is going to be dead.
>>>
>>> 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx
>>> becomes dead
>>>
>>> - steal bios from the request, and resubmit them via
>>> generic_make_request(),
>>> then these IO will be mapped to other live hctx for dispatch
>>>
>>> Please comment & review, thanks!
>>>
>>> V2:
>>>      - patch4 & patch 5 in V1 have been merged to block tree, so remove
>>>        them
>>>      - address comments from John Garry and Minwoo
>>>
>>>
>>> Ming Lei (5):
>>>    blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED
>>>    blk-mq: add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ
>>>    blk-mq: stop to handle IO before hctx's all CPUs become offline
>>>    blk-mq: re-submit IO in case that hctx is dead
>>>    blk-mq: handle requests dispatched from IO scheduler in case that hctx
>>>      is dead
>>
>> Hi Ming,
>>
>> This looks to fix the hotplug issue for me.
>>
>> Previously I could manufacture a scenario while running fio where I got
>> IO timeouts, like this:
>>
>> root@(none)$ echo 0 > ./sys/devices/system/cpu/cpu0/online
>> [  296.897627] process 891 (fio) no longer affine to cpu0
>> [  296.898488] process 893 (fio) no longer affine to cpu0
>> [  296.910270] process 890 (fio) no longer affine to cpu0
>> [  296.927322] IRQ 775: no longer affine to CPU0
>> [  296.932762] CPU0: shutdown
>> [  296.935469] psci: CPU0 killed.
>> root@(none)$ [  326.971962] sas: Enter sas_scsi_recover_host busy: 61
>> failed: 61
>> [  326.977978] sas: sas_scsi_find_task: aborting task 0x00000000e2cdc79b
>> root@(none)$ [  333.047964] hisi_sas_v3_hw 0000:74:02.0: internal task
>> abort: timeout and not done.
>> [  333.055616] hisi_sas_v3_hw 0000:74:02.0: abort task: internal abort (-5)
>> [  333.062306] sas: sas_scsi_find_task: querying task 0x00000000e2cdc79b
>> [  333.068776] sas: sas_scsi_find_task: task 0x00000000e2cdc79b not at LU
>> [  333.075295] sas: task 0x00000000e2cdc79b is not at LU: I_T recover
>> [  333.081464] sas: I_T nexus reset for dev 5000c500a7b95a49
>>
>> Please notice the 30-second delay for the SCSI IO timeout.
>>
>> And now I don't see it; here's a sample for irq shutdown:
>> root@(none)$ echo 0 > ./sys/devices/system/cpu/cpu0/online
>> [  344.608148] process 849 (fio) no longer affine to cpu0
>> [  344.608639] process 848 (fio) no longer affine to cpu0
>> [  344.609454] process 850 (fio) no longer affine to cpu0
>> [  344.643481] process 847 (fio) no longer affine to cpu0
>> [  346.213842] IRQ 775: no longer affine to CPU0
>> [  346.219712] CPU0: shutdown
>> [  346.222425] psci: CPU0 killed.
>>
>> Please notice the ~1.5s pause, which would be the queue draining.
>>
>> So FWIW:
>> Tested-by: John Garry <john.garry@huawei.com>
>>
>> JFYI, I tested on 5.3-rc5 and cherry-picked
>> https://github.com/ming1/linux/commit/0d2cd3c99bb0fe81d2c0ca5d68e02bdc4521d4d6
>> and "blk-mq: add callback of .cleanup_rq".
>>
>> Cheers,
>> John
> 
> Hi Jens,
> 
> I don't mean to be pushy, but can we consider to get these patches from
> Ming merged?
> 
> As above, I tested on my SCSI driver and it works. I also tested on an
> NVMe disk, and it solves the condition which generates this message:
> root@(none)$ echo 0 > /sys/devices/system/cpu/cpu2/online
> [  465.635960] CPU2: shutdown
> [  465.638662] psci: CPU2 killed.
> [  111.381653] nvme nvme0: I/O 705 QID 18 timeout, completion polled
> 
> (that's on top off v5.4-rc1)

Ming, can you repost the series?
Ming Lei Oct. 6, 2019, 2:47 a.m. UTC | #7
On Wed, Oct 02, 2019 at 08:36:52AM -0600, Jens Axboe wrote:
> On 10/2/19 3:56 AM, John Garry wrote:
> > On 22/08/2019 18:39, John Garry wrote:
> >> On 12/08/2019 14:43, Ming Lei wrote:
> >>> Hi,
> >>>
> >>> Thomas mentioned:
> >>>      "
> >>>       That was the constraint of managed interrupts from the very
> >>> beginning:
> >>>
> >>>        The driver/subsystem has to quiesce the interrupt line and the
> >>> associated
> >>>        queue _before_ it gets shutdown in CPU unplug and not fiddle
> >>> with it
> >>>        until it's restarted by the core when the CPU is plugged in again.
> >>>      "
> >>>
> >>> But no drivers or blk-mq do that before one hctx becomes dead(all
> >>> CPUs for one hctx are offline), and even it is worse, blk-mq stills tries
> >>> to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead().
> >>>
> >>> This patchset tries to address the issue by two stages:
> >>>
> >>> 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE
> >>>
> >>> - mark the hctx as internal stopped, and drain all in-flight requests
> >>> if the hctx is going to be dead.
> >>>
> >>> 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx
> >>> becomes dead
> >>>
> >>> - steal bios from the request, and resubmit them via
> >>> generic_make_request(),
> >>> then these IO will be mapped to other live hctx for dispatch
> >>>
> >>> Please comment & review, thanks!
> >>>
> >>> V2:
> >>>      - patch4 & patch 5 in V1 have been merged to block tree, so remove
> >>>        them
> >>>      - address comments from John Garry and Minwoo
> >>>
> >>>
> >>> Ming Lei (5):
> >>>    blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED
> >>>    blk-mq: add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ
> >>>    blk-mq: stop to handle IO before hctx's all CPUs become offline
> >>>    blk-mq: re-submit IO in case that hctx is dead
> >>>    blk-mq: handle requests dispatched from IO scheduler in case that hctx
> >>>      is dead
> >>
> >> Hi Ming,
> >>
> >> This looks to fix the hotplug issue for me.
> >>
> >> Previously I could manufacture a scenario while running fio where I got
> >> IO timeouts, like this:
> >>
> >> root@(none)$ echo 0 > ./sys/devices/system/cpu/cpu0/online
> >> [  296.897627] process 891 (fio) no longer affine to cpu0
> >> [  296.898488] process 893 (fio) no longer affine to cpu0
> >> [  296.910270] process 890 (fio) no longer affine to cpu0
> >> [  296.927322] IRQ 775: no longer affine to CPU0
> >> [  296.932762] CPU0: shutdown
> >> [  296.935469] psci: CPU0 killed.
> >> root@(none)$ [  326.971962] sas: Enter sas_scsi_recover_host busy: 61
> >> failed: 61
> >> [  326.977978] sas: sas_scsi_find_task: aborting task 0x00000000e2cdc79b
> >> root@(none)$ [  333.047964] hisi_sas_v3_hw 0000:74:02.0: internal task
> >> abort: timeout and not done.
> >> [  333.055616] hisi_sas_v3_hw 0000:74:02.0: abort task: internal abort (-5)
> >> [  333.062306] sas: sas_scsi_find_task: querying task 0x00000000e2cdc79b
> >> [  333.068776] sas: sas_scsi_find_task: task 0x00000000e2cdc79b not at LU
> >> [  333.075295] sas: task 0x00000000e2cdc79b is not at LU: I_T recover
> >> [  333.081464] sas: I_T nexus reset for dev 5000c500a7b95a49
> >>
> >> Please notice the 30-second delay for the SCSI IO timeout.
> >>
> >> And now I don't see it; here's a sample for irq shutdown:
> >> root@(none)$ echo 0 > ./sys/devices/system/cpu/cpu0/online
> >> [  344.608148] process 849 (fio) no longer affine to cpu0
> >> [  344.608639] process 848 (fio) no longer affine to cpu0
> >> [  344.609454] process 850 (fio) no longer affine to cpu0
> >> [  344.643481] process 847 (fio) no longer affine to cpu0
> >> [  346.213842] IRQ 775: no longer affine to CPU0
> >> [  346.219712] CPU0: shutdown
> >> [  346.222425] psci: CPU0 killed.
> >>
> >> Please notice the ~1.5s pause, which would be the queue draining.
> >>
> >> So FWIW:
> >> Tested-by: John Garry <john.garry@huawei.com>
> >>
> >> JFYI, I tested on 5.3-rc5 and cherry-picked
> >> https://github.com/ming1/linux/commit/0d2cd3c99bb0fe81d2c0ca5d68e02bdc4521d4d6
> >> and "blk-mq: add callback of .cleanup_rq".
> >>
> >> Cheers,
> >> John
> > 
> > Hi Jens,
> > 
> > I don't mean to be pushy, but can we consider to get these patches from
> > Ming merged?
> > 
> > As above, I tested on my SCSI driver and it works. I also tested on an
> > NVMe disk, and it solves the condition which generates this message:
> > root@(none)$ echo 0 > /sys/devices/system/cpu/cpu2/online
> > [  465.635960] CPU2: shutdown
> > [  465.638662] psci: CPU2 killed.
> > [  111.381653] nvme nvme0: I/O 705 QID 18 timeout, completion polled
> > 
> > (that's on top off v5.4-rc1)
> 
> Ming, can you repost the series?

It has been resent out just now.

Thanks,
Ming