Message ID | 20190812134312.16732-1-ming.lei@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | blk-mq: improvement on handling IO during CPU hotplug | expand |
Hi John, On Mon, Aug 12, 2019 at 09:43:07PM +0800, Ming Lei wrote: > Hi, > > Thomas mentioned: > " > That was the constraint of managed interrupts from the very beginning: > > The driver/subsystem has to quiesce the interrupt line and the associated > queue _before_ it gets shutdown in CPU unplug and not fiddle with it > until it's restarted by the core when the CPU is plugged in again. > " > > But no drivers or blk-mq do that before one hctx becomes dead(all > CPUs for one hctx are offline), and even it is worse, blk-mq stills tries > to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead(). > > This patchset tries to address the issue by two stages: > > 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE > > - mark the hctx as internal stopped, and drain all in-flight requests > if the hctx is going to be dead. > > 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes dead > > - steal bios from the request, and resubmit them via generic_make_request(), > then these IO will be mapped to other live hctx for dispatch > > Please comment & review, thanks! > > V2: > - patch4 & patch 5 in V1 have been merged to block tree, so remove > them > - address comments from John Garry and Minwoo > > > Ming Lei (5): > blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED > blk-mq: add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ > blk-mq: stop to handle IO before hctx's all CPUs become offline > blk-mq: re-submit IO in case that hctx is dead > blk-mq: handle requests dispatched from IO scheduler in case that hctx > is dead > > block/blk-mq-debugfs.c | 2 + > block/blk-mq-tag.c | 2 +- > block/blk-mq-tag.h | 2 + > block/blk-mq.c | 143 +++++++++++++++++++++++++++++++++---- > block/blk-mq.h | 3 +- > drivers/block/loop.c | 2 +- > drivers/md/dm-rq.c | 2 +- > include/linux/blk-mq.h | 5 ++ > include/linux/cpuhotplug.h | 1 + > 9 files changed, 146 insertions(+), 16 deletions(-) > > Cc: Bart Van Assche <bvanassche@acm.org> > Cc: Hannes Reinecke <hare@suse.com> > Cc: Christoph Hellwig <hch@lst.de> > Cc: Thomas Gleixner <tglx@linutronix.de> > Cc: Keith Busch <keith.busch@intel.com> > -- > 2.20.1 > Sorry for forgetting to Cc you. Thanks, Ming
On 12/08/2019 14:46, Ming Lei wrote: > Hi John, > > On Mon, Aug 12, 2019 at 09:43:07PM +0800, Ming Lei wrote: >> Hi, >> >> Thomas mentioned: >> " >> That was the constraint of managed interrupts from the very beginning: >> >> The driver/subsystem has to quiesce the interrupt line and the associated >> queue _before_ it gets shutdown in CPU unplug and not fiddle with it >> until it's restarted by the core when the CPU is plugged in again. >> " >> >> But no drivers or blk-mq do that before one hctx becomes dead(all >> CPUs for one hctx are offline), and even it is worse, blk-mq stills tries >> to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead(). >> >> This patchset tries to address the issue by two stages: >> >> 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE >> >> - mark the hctx as internal stopped, and drain all in-flight requests >> if the hctx is going to be dead. >> >> 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes dead >> >> - steal bios from the request, and resubmit them via generic_make_request(), >> then these IO will be mapped to other live hctx for dispatch >> >> Please comment & review, thanks! >> >> V2: >> - patch4 & patch 5 in V1 have been merged to block tree, so remove >> them >> - address comments from John Garry and Minwoo >> >> >> Ming Lei (5): >> blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED >> blk-mq: add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ >> blk-mq: stop to handle IO before hctx's all CPUs become offline >> blk-mq: re-submit IO in case that hctx is dead >> blk-mq: handle requests dispatched from IO scheduler in case that hctx >> is dead >> >> block/blk-mq-debugfs.c | 2 + >> block/blk-mq-tag.c | 2 +- >> block/blk-mq-tag.h | 2 + >> block/blk-mq.c | 143 +++++++++++++++++++++++++++++++++---- >> block/blk-mq.h | 3 +- >> drivers/block/loop.c | 2 +- >> drivers/md/dm-rq.c | 2 +- >> include/linux/blk-mq.h | 5 ++ >> include/linux/cpuhotplug.h | 1 + >> 9 files changed, 146 insertions(+), 16 deletions(-) >> >> Cc: Bart Van Assche <bvanassche@acm.org> >> Cc: Hannes Reinecke <hare@suse.com> >> Cc: Christoph Hellwig <hch@lst.de> >> Cc: Thomas Gleixner <tglx@linutronix.de> >> Cc: Keith Busch <keith.busch@intel.com> >> -- >> 2.20.1 >> > > Sorry for forgetting to Cc you. Already subscribed :) I don't mean to hijack this thread, but JFYI we're getting around to test https://github.com/ming1/linux/commits/v5.2-rc-host-tags-V2 - unfortunately we're still seeing a performance regression. I can't see where it's coming from. We're double-checking the test though. Thanks, John > > > Thanks, > Ming > > . >
On Mon, Aug 12, 2019 at 05:21:44PM +0100, John Garry wrote: > On 12/08/2019 14:46, Ming Lei wrote: > > Hi John, > > > > On Mon, Aug 12, 2019 at 09:43:07PM +0800, Ming Lei wrote: > > > Hi, > > > > > > Thomas mentioned: > > > " > > > That was the constraint of managed interrupts from the very beginning: > > > > > > The driver/subsystem has to quiesce the interrupt line and the associated > > > queue _before_ it gets shutdown in CPU unplug and not fiddle with it > > > until it's restarted by the core when the CPU is plugged in again. > > > " > > > > > > But no drivers or blk-mq do that before one hctx becomes dead(all > > > CPUs for one hctx are offline), and even it is worse, blk-mq stills tries > > > to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead(). > > > > > > This patchset tries to address the issue by two stages: > > > > > > 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE > > > > > > - mark the hctx as internal stopped, and drain all in-flight requests > > > if the hctx is going to be dead. > > > > > > 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes dead > > > > > > - steal bios from the request, and resubmit them via generic_make_request(), > > > then these IO will be mapped to other live hctx for dispatch > > > > > > Please comment & review, thanks! > > > > > > V2: > > > - patch4 & patch 5 in V1 have been merged to block tree, so remove > > > them > > > - address comments from John Garry and Minwoo > > > > > > > > > Ming Lei (5): > > > blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED > > > blk-mq: add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ > > > blk-mq: stop to handle IO before hctx's all CPUs become offline > > > blk-mq: re-submit IO in case that hctx is dead > > > blk-mq: handle requests dispatched from IO scheduler in case that hctx > > > is dead > > > > > > block/blk-mq-debugfs.c | 2 + > > > block/blk-mq-tag.c | 2 +- > > > block/blk-mq-tag.h | 2 + > > > block/blk-mq.c | 143 +++++++++++++++++++++++++++++++++---- > > > block/blk-mq.h | 3 +- > > > drivers/block/loop.c | 2 +- > > > drivers/md/dm-rq.c | 2 +- > > > include/linux/blk-mq.h | 5 ++ > > > include/linux/cpuhotplug.h | 1 + > > > 9 files changed, 146 insertions(+), 16 deletions(-) > > > > > > Cc: Bart Van Assche <bvanassche@acm.org> > > > Cc: Hannes Reinecke <hare@suse.com> > > > Cc: Christoph Hellwig <hch@lst.de> > > > Cc: Thomas Gleixner <tglx@linutronix.de> > > > Cc: Keith Busch <keith.busch@intel.com> > > > -- > > > 2.20.1 > > > > > > > Sorry for forgetting to Cc you. > > Already subscribed :) > > I don't mean to hijack this thread, but JFYI we're getting around to test > https://github.com/ming1/linux/commits/v5.2-rc-host-tags-V2 - unfortunately > we're still seeing a performance regression. I can't see where it's coming > from. We're double-checking the test though. host-tag patchset is only for several particular drivers which use private reply queue as completion queue. This patchset is for handling generic blk-mq CPU hotplug issue, and the several particular scsi drivers(hisi_sas_v3, hpsa, megaraid_sas and mp3sas) won't be covered so far. I'd suggest to move on for generic blk-mq devices first given now blk-mq is the only request IO path now. There are at least two choices for us to handle drivers/devices with private completion queue: 1) host-tags - performance issue shouldn't be hard to solve, given it is same with with single tags in theory, and just corner cases is there. What I am not glad with this approach is that blk-mq-tag code becomes mess. 2) private callback - we could define private callback to drain each completion queue in driver simply. - problem is that the four drivers have to duplicate the same job Thanks, Ming
On 12/08/2019 14:43, Ming Lei wrote: > Hi, > > Thomas mentioned: > " > That was the constraint of managed interrupts from the very beginning: > > The driver/subsystem has to quiesce the interrupt line and the associated > queue _before_ it gets shutdown in CPU unplug and not fiddle with it > until it's restarted by the core when the CPU is plugged in again. > " > > But no drivers or blk-mq do that before one hctx becomes dead(all > CPUs for one hctx are offline), and even it is worse, blk-mq stills tries > to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead(). > > This patchset tries to address the issue by two stages: > > 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE > > - mark the hctx as internal stopped, and drain all in-flight requests > if the hctx is going to be dead. > > 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx becomes dead > > - steal bios from the request, and resubmit them via generic_make_request(), > then these IO will be mapped to other live hctx for dispatch > > Please comment & review, thanks! > > V2: > - patch4 & patch 5 in V1 have been merged to block tree, so remove > them > - address comments from John Garry and Minwoo > > > Ming Lei (5): > blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED > blk-mq: add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ > blk-mq: stop to handle IO before hctx's all CPUs become offline > blk-mq: re-submit IO in case that hctx is dead > blk-mq: handle requests dispatched from IO scheduler in case that hctx > is dead Hi Ming, This looks to fix the hotplug issue for me. Previously I could manufacture a scenario while running fio where I got IO timeouts, like this: root@(none)$ echo 0 > ./sys/devices/system/cpu/cpu0/online [ 296.897627] process 891 (fio) no longer affine to cpu0 [ 296.898488] process 893 (fio) no longer affine to cpu0 [ 296.910270] process 890 (fio) no longer affine to cpu0 [ 296.927322] IRQ 775: no longer affine to CPU0 [ 296.932762] CPU0: shutdown [ 296.935469] psci: CPU0 killed. root@(none)$ [ 326.971962] sas: Enter sas_scsi_recover_host busy: 61 failed: 61 [ 326.977978] sas: sas_scsi_find_task: aborting task 0x00000000e2cdc79b root@(none)$ [ 333.047964] hisi_sas_v3_hw 0000:74:02.0: internal task abort: timeout and not done. [ 333.055616] hisi_sas_v3_hw 0000:74:02.0: abort task: internal abort (-5) [ 333.062306] sas: sas_scsi_find_task: querying task 0x00000000e2cdc79b [ 333.068776] sas: sas_scsi_find_task: task 0x00000000e2cdc79b not at LU [ 333.075295] sas: task 0x00000000e2cdc79b is not at LU: I_T recover [ 333.081464] sas: I_T nexus reset for dev 5000c500a7b95a49 Please notice the 30-second delay for the SCSI IO timeout. And now I don't see it; here's a sample for irq shutdown: root@(none)$ echo 0 > ./sys/devices/system/cpu/cpu0/online [ 344.608148] process 849 (fio) no longer affine to cpu0 [ 344.608639] process 848 (fio) no longer affine to cpu0 [ 344.609454] process 850 (fio) no longer affine to cpu0 [ 344.643481] process 847 (fio) no longer affine to cpu0 [ 346.213842] IRQ 775: no longer affine to CPU0 [ 346.219712] CPU0: shutdown [ 346.222425] psci: CPU0 killed. Please notice the ~1.5s pause, which would be the queue draining. So FWIW: Tested-by: John Garry <john.garry@huawei.com> JFYI, I tested on 5.3-rc5 and cherry-picked https://github.com/ming1/linux/commit/0d2cd3c99bb0fe81d2c0ca5d68e02bdc4521d4d6 and "blk-mq: add callback of .cleanup_rq". Cheers, John > > block/blk-mq-debugfs.c | 2 + > block/blk-mq-tag.c | 2 +- > block/blk-mq-tag.h | 2 + > block/blk-mq.c | 143 +++++++++++++++++++++++++++++++++---- > block/blk-mq.h | 3 +- > drivers/block/loop.c | 2 +- > drivers/md/dm-rq.c | 2 +- > include/linux/blk-mq.h | 5 ++ > include/linux/cpuhotplug.h | 1 + > 9 files changed, 146 insertions(+), 16 deletions(-) > > Cc: Bart Van Assche <bvanassche@acm.org> > Cc: Hannes Reinecke <hare@suse.com> > Cc: Christoph Hellwig <hch@lst.de> > Cc: Thomas Gleixner <tglx@linutronix.de> > Cc: Keith Busch <keith.busch@intel.com> >
On 22/08/2019 18:39, John Garry wrote: > On 12/08/2019 14:43, Ming Lei wrote: >> Hi, >> >> Thomas mentioned: >> " >> That was the constraint of managed interrupts from the very >> beginning: >> >> The driver/subsystem has to quiesce the interrupt line and the >> associated >> queue _before_ it gets shutdown in CPU unplug and not fiddle >> with it >> until it's restarted by the core when the CPU is plugged in again. >> " >> >> But no drivers or blk-mq do that before one hctx becomes dead(all >> CPUs for one hctx are offline), and even it is worse, blk-mq stills tries >> to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead(). >> >> This patchset tries to address the issue by two stages: >> >> 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE >> >> - mark the hctx as internal stopped, and drain all in-flight requests >> if the hctx is going to be dead. >> >> 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx >> becomes dead >> >> - steal bios from the request, and resubmit them via >> generic_make_request(), >> then these IO will be mapped to other live hctx for dispatch >> >> Please comment & review, thanks! >> >> V2: >> - patch4 & patch 5 in V1 have been merged to block tree, so remove >> them >> - address comments from John Garry and Minwoo >> >> >> Ming Lei (5): >> blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED >> blk-mq: add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ >> blk-mq: stop to handle IO before hctx's all CPUs become offline >> blk-mq: re-submit IO in case that hctx is dead >> blk-mq: handle requests dispatched from IO scheduler in case that hctx >> is dead > > Hi Ming, > > This looks to fix the hotplug issue for me. > > Previously I could manufacture a scenario while running fio where I got > IO timeouts, like this: > > root@(none)$ echo 0 > ./sys/devices/system/cpu/cpu0/online > [ 296.897627] process 891 (fio) no longer affine to cpu0 > [ 296.898488] process 893 (fio) no longer affine to cpu0 > [ 296.910270] process 890 (fio) no longer affine to cpu0 > [ 296.927322] IRQ 775: no longer affine to CPU0 > [ 296.932762] CPU0: shutdown > [ 296.935469] psci: CPU0 killed. > root@(none)$ [ 326.971962] sas: Enter sas_scsi_recover_host busy: 61 > failed: 61 > [ 326.977978] sas: sas_scsi_find_task: aborting task 0x00000000e2cdc79b > root@(none)$ [ 333.047964] hisi_sas_v3_hw 0000:74:02.0: internal task > abort: timeout and not done. > [ 333.055616] hisi_sas_v3_hw 0000:74:02.0: abort task: internal abort (-5) > [ 333.062306] sas: sas_scsi_find_task: querying task 0x00000000e2cdc79b > [ 333.068776] sas: sas_scsi_find_task: task 0x00000000e2cdc79b not at LU > [ 333.075295] sas: task 0x00000000e2cdc79b is not at LU: I_T recover > [ 333.081464] sas: I_T nexus reset for dev 5000c500a7b95a49 > > Please notice the 30-second delay for the SCSI IO timeout. > > And now I don't see it; here's a sample for irq shutdown: > root@(none)$ echo 0 > ./sys/devices/system/cpu/cpu0/online > [ 344.608148] process 849 (fio) no longer affine to cpu0 > [ 344.608639] process 848 (fio) no longer affine to cpu0 > [ 344.609454] process 850 (fio) no longer affine to cpu0 > [ 344.643481] process 847 (fio) no longer affine to cpu0 > [ 346.213842] IRQ 775: no longer affine to CPU0 > [ 346.219712] CPU0: shutdown > [ 346.222425] psci: CPU0 killed. > > Please notice the ~1.5s pause, which would be the queue draining. > > So FWIW: > Tested-by: John Garry <john.garry@huawei.com> > > JFYI, I tested on 5.3-rc5 and cherry-picked > https://github.com/ming1/linux/commit/0d2cd3c99bb0fe81d2c0ca5d68e02bdc4521d4d6 > and "blk-mq: add callback of .cleanup_rq". > > Cheers, > John Hi Jens, I don't mean to be pushy, but can we consider to get these patches from Ming merged? As above, I tested on my SCSI driver and it works. I also tested on an NVMe disk, and it solves the condition which generates this message: root@(none)$ echo 0 > /sys/devices/system/cpu/cpu2/online [ 465.635960] CPU2: shutdown [ 465.638662] psci: CPU2 killed. [ 111.381653] nvme nvme0: I/O 705 QID 18 timeout, completion polled (that's on top off v5.4-rc1) Thanks, John > >> >> block/blk-mq-debugfs.c | 2 +
On 10/2/19 3:56 AM, John Garry wrote: > On 22/08/2019 18:39, John Garry wrote: >> On 12/08/2019 14:43, Ming Lei wrote: >>> Hi, >>> >>> Thomas mentioned: >>> " >>> That was the constraint of managed interrupts from the very >>> beginning: >>> >>> The driver/subsystem has to quiesce the interrupt line and the >>> associated >>> queue _before_ it gets shutdown in CPU unplug and not fiddle >>> with it >>> until it's restarted by the core when the CPU is plugged in again. >>> " >>> >>> But no drivers or blk-mq do that before one hctx becomes dead(all >>> CPUs for one hctx are offline), and even it is worse, blk-mq stills tries >>> to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead(). >>> >>> This patchset tries to address the issue by two stages: >>> >>> 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE >>> >>> - mark the hctx as internal stopped, and drain all in-flight requests >>> if the hctx is going to be dead. >>> >>> 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx >>> becomes dead >>> >>> - steal bios from the request, and resubmit them via >>> generic_make_request(), >>> then these IO will be mapped to other live hctx for dispatch >>> >>> Please comment & review, thanks! >>> >>> V2: >>> - patch4 & patch 5 in V1 have been merged to block tree, so remove >>> them >>> - address comments from John Garry and Minwoo >>> >>> >>> Ming Lei (5): >>> blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED >>> blk-mq: add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ >>> blk-mq: stop to handle IO before hctx's all CPUs become offline >>> blk-mq: re-submit IO in case that hctx is dead >>> blk-mq: handle requests dispatched from IO scheduler in case that hctx >>> is dead >> >> Hi Ming, >> >> This looks to fix the hotplug issue for me. >> >> Previously I could manufacture a scenario while running fio where I got >> IO timeouts, like this: >> >> root@(none)$ echo 0 > ./sys/devices/system/cpu/cpu0/online >> [ 296.897627] process 891 (fio) no longer affine to cpu0 >> [ 296.898488] process 893 (fio) no longer affine to cpu0 >> [ 296.910270] process 890 (fio) no longer affine to cpu0 >> [ 296.927322] IRQ 775: no longer affine to CPU0 >> [ 296.932762] CPU0: shutdown >> [ 296.935469] psci: CPU0 killed. >> root@(none)$ [ 326.971962] sas: Enter sas_scsi_recover_host busy: 61 >> failed: 61 >> [ 326.977978] sas: sas_scsi_find_task: aborting task 0x00000000e2cdc79b >> root@(none)$ [ 333.047964] hisi_sas_v3_hw 0000:74:02.0: internal task >> abort: timeout and not done. >> [ 333.055616] hisi_sas_v3_hw 0000:74:02.0: abort task: internal abort (-5) >> [ 333.062306] sas: sas_scsi_find_task: querying task 0x00000000e2cdc79b >> [ 333.068776] sas: sas_scsi_find_task: task 0x00000000e2cdc79b not at LU >> [ 333.075295] sas: task 0x00000000e2cdc79b is not at LU: I_T recover >> [ 333.081464] sas: I_T nexus reset for dev 5000c500a7b95a49 >> >> Please notice the 30-second delay for the SCSI IO timeout. >> >> And now I don't see it; here's a sample for irq shutdown: >> root@(none)$ echo 0 > ./sys/devices/system/cpu/cpu0/online >> [ 344.608148] process 849 (fio) no longer affine to cpu0 >> [ 344.608639] process 848 (fio) no longer affine to cpu0 >> [ 344.609454] process 850 (fio) no longer affine to cpu0 >> [ 344.643481] process 847 (fio) no longer affine to cpu0 >> [ 346.213842] IRQ 775: no longer affine to CPU0 >> [ 346.219712] CPU0: shutdown >> [ 346.222425] psci: CPU0 killed. >> >> Please notice the ~1.5s pause, which would be the queue draining. >> >> So FWIW: >> Tested-by: John Garry <john.garry@huawei.com> >> >> JFYI, I tested on 5.3-rc5 and cherry-picked >> https://github.com/ming1/linux/commit/0d2cd3c99bb0fe81d2c0ca5d68e02bdc4521d4d6 >> and "blk-mq: add callback of .cleanup_rq". >> >> Cheers, >> John > > Hi Jens, > > I don't mean to be pushy, but can we consider to get these patches from > Ming merged? > > As above, I tested on my SCSI driver and it works. I also tested on an > NVMe disk, and it solves the condition which generates this message: > root@(none)$ echo 0 > /sys/devices/system/cpu/cpu2/online > [ 465.635960] CPU2: shutdown > [ 465.638662] psci: CPU2 killed. > [ 111.381653] nvme nvme0: I/O 705 QID 18 timeout, completion polled > > (that's on top off v5.4-rc1) Ming, can you repost the series?
On Wed, Oct 02, 2019 at 08:36:52AM -0600, Jens Axboe wrote: > On 10/2/19 3:56 AM, John Garry wrote: > > On 22/08/2019 18:39, John Garry wrote: > >> On 12/08/2019 14:43, Ming Lei wrote: > >>> Hi, > >>> > >>> Thomas mentioned: > >>> " > >>> That was the constraint of managed interrupts from the very > >>> beginning: > >>> > >>> The driver/subsystem has to quiesce the interrupt line and the > >>> associated > >>> queue _before_ it gets shutdown in CPU unplug and not fiddle > >>> with it > >>> until it's restarted by the core when the CPU is plugged in again. > >>> " > >>> > >>> But no drivers or blk-mq do that before one hctx becomes dead(all > >>> CPUs for one hctx are offline), and even it is worse, blk-mq stills tries > >>> to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead(). > >>> > >>> This patchset tries to address the issue by two stages: > >>> > >>> 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE > >>> > >>> - mark the hctx as internal stopped, and drain all in-flight requests > >>> if the hctx is going to be dead. > >>> > >>> 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx > >>> becomes dead > >>> > >>> - steal bios from the request, and resubmit them via > >>> generic_make_request(), > >>> then these IO will be mapped to other live hctx for dispatch > >>> > >>> Please comment & review, thanks! > >>> > >>> V2: > >>> - patch4 & patch 5 in V1 have been merged to block tree, so remove > >>> them > >>> - address comments from John Garry and Minwoo > >>> > >>> > >>> Ming Lei (5): > >>> blk-mq: add new state of BLK_MQ_S_INTERNAL_STOPPED > >>> blk-mq: add blk-mq flag of BLK_MQ_F_NO_MANAGED_IRQ > >>> blk-mq: stop to handle IO before hctx's all CPUs become offline > >>> blk-mq: re-submit IO in case that hctx is dead > >>> blk-mq: handle requests dispatched from IO scheduler in case that hctx > >>> is dead > >> > >> Hi Ming, > >> > >> This looks to fix the hotplug issue for me. > >> > >> Previously I could manufacture a scenario while running fio where I got > >> IO timeouts, like this: > >> > >> root@(none)$ echo 0 > ./sys/devices/system/cpu/cpu0/online > >> [ 296.897627] process 891 (fio) no longer affine to cpu0 > >> [ 296.898488] process 893 (fio) no longer affine to cpu0 > >> [ 296.910270] process 890 (fio) no longer affine to cpu0 > >> [ 296.927322] IRQ 775: no longer affine to CPU0 > >> [ 296.932762] CPU0: shutdown > >> [ 296.935469] psci: CPU0 killed. > >> root@(none)$ [ 326.971962] sas: Enter sas_scsi_recover_host busy: 61 > >> failed: 61 > >> [ 326.977978] sas: sas_scsi_find_task: aborting task 0x00000000e2cdc79b > >> root@(none)$ [ 333.047964] hisi_sas_v3_hw 0000:74:02.0: internal task > >> abort: timeout and not done. > >> [ 333.055616] hisi_sas_v3_hw 0000:74:02.0: abort task: internal abort (-5) > >> [ 333.062306] sas: sas_scsi_find_task: querying task 0x00000000e2cdc79b > >> [ 333.068776] sas: sas_scsi_find_task: task 0x00000000e2cdc79b not at LU > >> [ 333.075295] sas: task 0x00000000e2cdc79b is not at LU: I_T recover > >> [ 333.081464] sas: I_T nexus reset for dev 5000c500a7b95a49 > >> > >> Please notice the 30-second delay for the SCSI IO timeout. > >> > >> And now I don't see it; here's a sample for irq shutdown: > >> root@(none)$ echo 0 > ./sys/devices/system/cpu/cpu0/online > >> [ 344.608148] process 849 (fio) no longer affine to cpu0 > >> [ 344.608639] process 848 (fio) no longer affine to cpu0 > >> [ 344.609454] process 850 (fio) no longer affine to cpu0 > >> [ 344.643481] process 847 (fio) no longer affine to cpu0 > >> [ 346.213842] IRQ 775: no longer affine to CPU0 > >> [ 346.219712] CPU0: shutdown > >> [ 346.222425] psci: CPU0 killed. > >> > >> Please notice the ~1.5s pause, which would be the queue draining. > >> > >> So FWIW: > >> Tested-by: John Garry <john.garry@huawei.com> > >> > >> JFYI, I tested on 5.3-rc5 and cherry-picked > >> https://github.com/ming1/linux/commit/0d2cd3c99bb0fe81d2c0ca5d68e02bdc4521d4d6 > >> and "blk-mq: add callback of .cleanup_rq". > >> > >> Cheers, > >> John > > > > Hi Jens, > > > > I don't mean to be pushy, but can we consider to get these patches from > > Ming merged? > > > > As above, I tested on my SCSI driver and it works. I also tested on an > > NVMe disk, and it solves the condition which generates this message: > > root@(none)$ echo 0 > /sys/devices/system/cpu/cpu2/online > > [ 465.635960] CPU2: shutdown > > [ 465.638662] psci: CPU2 killed. > > [ 111.381653] nvme nvme0: I/O 705 QID 18 timeout, completion polled > > > > (that's on top off v5.4-rc1) > > Ming, can you repost the series? It has been resent out just now. Thanks, Ming