Message ID | 20221118113052.1324140-1-yukuai1@huaweicloud.com (mailing list archive) |
---|---|
State | Changes Requested |
Headers | show |
Series | [RFC] scsi: core: remove unsed 'restarts' from scsi_device | expand |
On 11/18/22 03:30, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
Regarding the subject: unsed -> unused?
Hi, 在 2022/11/19 9:11, Bart Van Assche 写道: > On 11/18/22 03:30, Yu Kuai wrote: >> From: Yu Kuai <yukuai3@huawei.com> > > Regarding the subject: unsed -> unused? Yes, sorry for the spelling mistake. > > . >
Hi, 在 2022/11/18 19:30, Yu Kuai 写道: > From: Yu Kuai <yukuai3@huawei.com> > > During code review, I found that 'restarts' is not useful anymore after > the following commits: > > 1) commit ab3cee3762e5 ("blk-mq: In blk_mq_dispatch_rq_list() "no budget" > is a reason to kick") > 2) commit d3b38596875d ("blk-mq: run queue no matter whether the request > is the last request") > 3) commit 673235f91531 ("scsi: core: Fix race between handling STS_RESOURCE > and completion") > > Now that if get budget ever failed, block layer will make sure to > trigger new run queue for the hctx. Hence there is no need to run queue > from scsi layer in this case. > Does anyone has suggestions about this patch? More info why I tried to remove this: while testing megaraid with 4 nvme with none elevator, the default queue_depth is 128, while I test it with fio 128 jobs and 1 iodepth, bw is about 4Gib/s, however, if I test with 128 jobs and 2 iodepth, bw is decreased to about 0.8Gib/s, and with this patch applied, bw can stay 4Gib/s in the later case. Thanks, Kuai
On Sat, Nov 26, 2022 at 04:54:46PM +0800, Yu Kuai wrote: > Hi, > > 在 2022/11/18 19:30, Yu Kuai 写道: > > From: Yu Kuai <yukuai3@huawei.com> > > > > During code review, I found that 'restarts' is not useful anymore after > > the following commits: > > > > 1) commit ab3cee3762e5 ("blk-mq: In blk_mq_dispatch_rq_list() "no budget" > > is a reason to kick") > > 2) commit d3b38596875d ("blk-mq: run queue no matter whether the request > > is the last request") > > 3) commit 673235f91531 ("scsi: core: Fix race between handling STS_RESOURCE > > and completion") > > > > Now that if get budget ever failed, block layer will make sure to > > trigger new run queue for the hctx. Hence there is no need to run queue > > from scsi layer in this case. > > > > Does anyone has suggestions about this patch? > > More info why I tried to remove this: > > while testing megaraid with 4 nvme with none elevator, the default > queue_depth is 128, while I test it with fio 128 jobs and 1 iodepth, > bw is about 4Gib/s, however, if I test with 128 jobs and 2 iodepth, > bw is decreased to about 0.8Gib/s, and with this patch applied, > bw can stay 4Gib/s in the later case. I will look at this patch next week. Can you investigate a bit the reason why perf boost is from this patch? Thanks, Ming
On 2022/11/26 16:54, Yu Kuai wrote: > Hi, > > 在 2022/11/18 19:30, Yu Kuai 写道: >> From: Yu Kuai <yukuai3@huawei.com> >> >> During code review, I found that 'restarts' is not useful anymore after >> the following commits: >> >> 1) commit ab3cee3762e5 ("blk-mq: In blk_mq_dispatch_rq_list() "no budget" >> is a reason to kick") >> 2) commit d3b38596875d ("blk-mq: run queue no matter whether the request >> is the last request") >> 3) commit 673235f91531 ("scsi: core: Fix race between handling >> STS_RESOURCE >> and completion") >> >> Now that if get budget ever failed, block layer will make sure to >> trigger new run queue for the hctx. Hence there is no need to run queue >> from scsi layer in this case. >> > > Does anyone has suggestions about this patch? > > More info why I tried to remove this: > > while testing megaraid with 4 nvme with none elevator, the default > queue_depth is 128, while I test it with fio 128 jobs and 1 iodepth, > bw is about 4Gib/s, however, if I test with 128 jobs and 2 iodepth, > bw is decreased to about 0.8Gib/s, and with this patch applied, > bw can stay 4Gib/s in the later case. > Hi Yu Kuai, This information should be included in the commit message. Thanks, Jason
On Sat, Nov 26, 2022 at 04:54:46PM +0800, Yu Kuai wrote: > Hi, > > 在 2022/11/18 19:30, Yu Kuai 写道: > > From: Yu Kuai <yukuai3@huawei.com> > > > > During code review, I found that 'restarts' is not useful anymore after > > the following commits: > > > > 1) commit ab3cee3762e5 ("blk-mq: In blk_mq_dispatch_rq_list() "no budget" > > is a reason to kick") > > 2) commit d3b38596875d ("blk-mq: run queue no matter whether the request > > is the last request") > > 3) commit 673235f91531 ("scsi: core: Fix race between handling STS_RESOURCE > > and completion") > > > > Now that if get budget ever failed, block layer will make sure to > > trigger new run queue for the hctx. Hence there is no need to run queue > > from scsi layer in this case. > > But scsi_run_queue_async() needs to run all hw queue because budget is actually LUN/request queue wide. > > Does anyone has suggestions about this patch? > > More info why I tried to remove this: > > while testing megaraid with 4 nvme with none elevator, the default > queue_depth is 128, while I test it with fio 128 jobs and 1 iodepth, > bw is about 4Gib/s, however, if I test with 128 jobs and 2 iodepth, > bw is decreased to about 0.8Gib/s, and with this patch applied, > bw can stay 4Gib/s in the later case. What is .can_queue and nr_hw_queues in your setting? thanks, Ming
在 2022/11/28 11:27, Ming Lei 写道: > On Sat, Nov 26, 2022 at 04:54:46PM +0800, Yu Kuai wrote: >> Hi, >> >> 在 2022/11/18 19:30, Yu Kuai 写道: >>> From: Yu Kuai <yukuai3@huawei.com> >>> >>> During code review, I found that 'restarts' is not useful anymore after >>> the following commits: >>> >>> 1) commit ab3cee3762e5 ("blk-mq: In blk_mq_dispatch_rq_list() "no budget" >>> is a reason to kick") >>> 2) commit d3b38596875d ("blk-mq: run queue no matter whether the request >>> is the last request") >>> 3) commit 673235f91531 ("scsi: core: Fix race between handling STS_RESOURCE >>> and completion") >>> >>> Now that if get budget ever failed, block layer will make sure to >>> trigger new run queue for the hctx. Hence there is no need to run queue >>> from scsi layer in this case. >>> > > But scsi_run_queue_async() needs to run all hw queue because budget is > actually LUN/request queue wide. Why the hw queue need to run if get budget never failed in this hw queue? > >> >> Does anyone has suggestions about this patch? >> >> More info why I tried to remove this: >> >> while testing megaraid with 4 nvme with none elevator, the default >> queue_depth is 128, while I test it with fio 128 jobs and 1 iodepth, >> bw is about 4Gib/s, however, if I test with 128 jobs and 2 iodepth, >> bw is decreased to about 0.8Gib/s, and with this patch applied, >> bw can stay 4Gib/s in the later case. > > What is .can_queue and nr_hw_queues in your setting? test cmd: fio -name=0 -ioengine=libaio -direct=1 -group_reporting=1 -randseed=2022 -rwmixread=70 -refill_buffers -filename=/dev/sdg -numjobs=128 -size=1TB -runtime=60s -bs=4k -iodepth=2 -rw=randwrite test environment: arm64 Kunpeng-920, 128 cpu megaraid with 4 NVMEs, 128 hctx and queue_depth is 128 > > > > thanks, > Ming > > . >
在 2022/11/28 11:35, Yu Kuai 写道: > > Why the hw queue need to run if get budget never failed in this hw > queue? And by the way, after Jan's patch "blk-mq: Improve performance of non-mq IO schedulers with multiple HW queues", scsi_run_queue_async() can only garantee to run hw queue for the current cpu, not all the hw queues.
On Mon, Nov 28, 2022 at 11:35:18AM +0800, Yu Kuai wrote: > > > 在 2022/11/28 11:27, Ming Lei 写道: > > On Sat, Nov 26, 2022 at 04:54:46PM +0800, Yu Kuai wrote: > > > Hi, > > > > > > 在 2022/11/18 19:30, Yu Kuai 写道: > > > > From: Yu Kuai <yukuai3@huawei.com> > > > > > > > > During code review, I found that 'restarts' is not useful anymore after > > > > the following commits: > > > > > > > > 1) commit ab3cee3762e5 ("blk-mq: In blk_mq_dispatch_rq_list() "no budget" > > > > is a reason to kick") > > > > 2) commit d3b38596875d ("blk-mq: run queue no matter whether the request > > > > is the last request") > > > > 3) commit 673235f91531 ("scsi: core: Fix race between handling STS_RESOURCE > > > > and completion") > > > > > > > > Now that if get budget ever failed, block layer will make sure to > > > > trigger new run queue for the hctx. Hence there is no need to run queue > > > > from scsi layer in this case. > > > > > > > > But scsi_run_queue_async() needs to run all hw queue because budget is > > actually LUN/request queue wide. > > Why the hw queue need to run if get budget never failed in this hw > queue? Because all hw queues share the queue wide budget, and once budget is available, all hw queues are re-run, and the hw queue won't be scheduled actually if there is nothing to run, see blk_mq_run_hw_queue(). > > > > > > > > > Does anyone has suggestions about this patch? > > > > > > More info why I tried to remove this: > > > > > > while testing megaraid with 4 nvme with none elevator, the default > > > queue_depth is 128, while I test it with fio 128 jobs and 1 iodepth, > > > bw is about 4Gib/s, however, if I test with 128 jobs and 2 iodepth, > > > bw is decreased to about 0.8Gib/s, and with this patch applied, > > > bw can stay 4Gib/s in the later case. > > > > What is .can_queue and nr_hw_queues in your setting? > test cmd: > fio -name=0 -ioengine=libaio -direct=1 -group_reporting=1 -randseed=2022 > -rwmixread=70 -refill_buffers -filename=/dev/sdg -numjobs=128 -size=1TB > -runtime=60s -bs=4k -iodepth=2 -rw=randwrite > > test environment: > arm64 Kunpeng-920, 128 cpu > megaraid with 4 NVMEs, 128 hctx and queue_depth is 128 From your setting, megaraid should sets ->host_tagset, that said there is only 128 tags for all 4 NVMEs(128 hw queue shares the all 128 tags too). That looks one really bad setting. BTW, why do you drive nvme via megaraid instead nvme driver? > And by the way, after Jan's patch "blk-mq: Improve performance of non-mq > IO schedulers with multiple HW queues", scsi_run_queue_async() can only > garantee to run hw queue for the current cpu, not all the hw queues. That isn't true, each hctx is still run in case of none & kyber scheduler. thanks, Ming
Hi, 在 2022/11/28 12:12, Ming Lei 写道: > On Mon, Nov 28, 2022 at 11:35:18AM +0800, Yu Kuai wrote: >> >> >> 在 2022/11/28 11:27, Ming Lei 写道: >>> On Sat, Nov 26, 2022 at 04:54:46PM +0800, Yu Kuai wrote: >>>> Hi, >>>> >>>> 在 2022/11/18 19:30, Yu Kuai 写道: >>>>> From: Yu Kuai <yukuai3@huawei.com> >>>>> >>>>> During code review, I found that 'restarts' is not useful anymore after >>>>> the following commits: >>>>> >>>>> 1) commit ab3cee3762e5 ("blk-mq: In blk_mq_dispatch_rq_list() "no budget" >>>>> is a reason to kick") >>>>> 2) commit d3b38596875d ("blk-mq: run queue no matter whether the request >>>>> is the last request") >>>>> 3) commit 673235f91531 ("scsi: core: Fix race between handling STS_RESOURCE >>>>> and completion") >>>>> >>>>> Now that if get budget ever failed, block layer will make sure to >>>>> trigger new run queue for the hctx. Hence there is no need to run queue >>>>> from scsi layer in this case. >>>>> >>> >>> But scsi_run_queue_async() needs to run all hw queue because budget is >>> actually LUN/request queue wide. >> >> Why the hw queue need to run if get budget never failed in this hw >> queue? > > Because all hw queues share the queue wide budget, and once budget > is available, all hw queues are re-run, and the hw queue won't be > scheduled actually if there is nothing to run, see > blk_mq_run_hw_queue(). I still don't get it why all hw queues should be re-run, is this just for performance or fixing a bug? And I'm not sure this behavior is good for performance. > >> >>> >>>> >>>> Does anyone has suggestions about this patch? >>>> >>>> More info why I tried to remove this: >>>> >>>> while testing megaraid with 4 nvme with none elevator, the default >>>> queue_depth is 128, while I test it with fio 128 jobs and 1 iodepth, >>>> bw is about 4Gib/s, however, if I test with 128 jobs and 2 iodepth, >>>> bw is decreased to about 0.8Gib/s, and with this patch applied, >>>> bw can stay 4Gib/s in the later case. >>> >>> What is .can_queue and nr_hw_queues in your setting? >> test cmd: >> fio -name=0 -ioengine=libaio -direct=1 -group_reporting=1 -randseed=2022 >> -rwmixread=70 -refill_buffers -filename=/dev/sdg -numjobs=128 -size=1TB >> -runtime=60s -bs=4k -iodepth=2 -rw=randwrite >> >> test environment: >> arm64 Kunpeng-920, 128 cpu >> megaraid with 4 NVMEs, 128 hctx and queue_depth is 128 > >>From your setting, megaraid should sets ->host_tagset, that said there > is only 128 tags for all 4 NVMEs(128 hw queue shares the all 128 tags > too). > > That looks one really bad setting. It's right that is bad, but the point here is to triggered get budget failed frequently. If I set queue_depth to 2048, and I use 128 numjobs with 32 iodpeth, performance is still bad. > > BTW, why do you drive nvme via megaraid instead nvme driver? > >> And by the way, after Jan's patch "blk-mq: Improve performance of non-mq >> IO schedulers with multiple HW queues", scsi_run_queue_async() can only >> garantee to run hw queue for the current cpu, not all the hw queues. > > That isn't true, each hctx is still run in case of none & kyber scheduler. Yes, but current default hctx shared elevator is deadline. > > thanks, > Ming > > . >
Hi, > Hi, > > 在 2022/11/28 12:12, Ming Lei 写道: >> >> BTW, why do you drive nvme via megaraid instead nvme driver? >> >>> And by the way, after Jan's patch "blk-mq: Improve performance of non-mq >>> IO schedulers with multiple HW queues", scsi_run_queue_async() can only >>> garantee to run hw queue for the current cpu, not all the hw queues. >> >> That isn't true, each hctx is still run in case of none & kyber >> scheduler. And I really suspect that why Jan's patch can improve performance is because it avoid many run queues from scsi_run_queue_async(). > > Yes, but current default hctx shared elevator is deadline. >> >> thanks, >> Ming >> >> . >> > > . >
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index 56f641ba1261..f6325a0f80fb 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -507,24 +507,6 @@ static void scsi_run_queue_async(struct scsi_device *sdev) if (scsi_target(sdev)->single_lun || !list_empty(&sdev->host->starved_list)) { kblockd_schedule_work(&sdev->requeue_work); - } else { - /* - * smp_mb() present in sbitmap_queue_clear() or implied in - * .end_io is for ordering writing .device_busy in - * scsi_device_unbusy() and reading sdev->restarts. - */ - int old = atomic_read(&sdev->restarts); - - /* - * ->restarts has to be kept as non-zero if new budget - * contention occurs. - * - * No need to run queue when either another re-run - * queue wins in updating ->restarts or a new budget - * contention occurs. - */ - if (old && atomic_cmpxchg(&sdev->restarts, old, 0) == old) - blk_mq_run_hw_queues(sdev->request_queue, true); } } @@ -1666,23 +1648,6 @@ static int scsi_mq_get_budget(struct request_queue *q) if (token >= 0) return token; - atomic_inc(&sdev->restarts); - - /* - * Orders atomic_inc(&sdev->restarts) and atomic_read(&sdev->device_busy). - * .restarts must be incremented before .device_busy is read because the - * code in scsi_run_queue_async() depends on the order of these operations. - */ - smp_mb__after_atomic(); - - /* - * If all in-flight requests originated from this LUN are completed - * before reading .device_busy, sdev->device_busy will be observed as - * zero, then blk_mq_delay_run_hw_queues() will dispatch this request - * soon. Otherwise, completion of one of these requests will observe - * the .restarts flag, and the request queue will be run for handling - * this request, see scsi_end_request(). - */ if (unlikely(scsi_device_busy(sdev) == 0 && !scsi_device_blocked(sdev))) blk_mq_delay_run_hw_queues(sdev->request_queue, SCSI_QUEUE_DELAY); diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h index 24bdbf7999ab..66345de80897 100644 --- a/include/scsi/scsi_device.h +++ b/include/scsi/scsi_device.h @@ -115,7 +115,6 @@ struct scsi_device { struct sbitmap budget_map; atomic_t device_blocked; /* Device returned QUEUE_FULL. */ - atomic_t restarts; spinlock_t list_lock; struct list_head starved_entry; unsigned short queue_depth; /* How deep of a queue we want */