[RFC] scsi: core: remove unsed 'restarts' from scsi_device

Message ID	20221118113052.1324140-1-yukuai1@huaweicloud.com (mailing list archive)
State	Changes Requested
Headers	show Return-Path: <linux-scsi-owner@kernel.org> From: Yu Kuai <yukuai1@huaweicloud.com> To: ming.lei@redhat.com, jejb@linux.ibm.com, martin.petersen@oracle.com Cc: linux-scsi@vger.kernel.org, linux-kernel@vger.kernel.org, yukuai3@huawei.com, yukuai1@huaweicloud.com, yi.zhang@huawei.com Subject: [PATCH RFC] scsi: core: remove unsed 'restarts' from scsi_device Date: Fri, 18 Nov 2022 19:30:52 +0800 Message-Id: <20221118113052.1324140-1-yukuai1@huaweicloud.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	[RFC] scsi: core: remove unsed 'restarts' from scsi_device \| expand [RFC] scsi: core: remove unsed 'restarts' from scsi_device

Yu Kuai Nov. 18, 2022, 11:30 a.m. UTC

From: Yu Kuai <yukuai3@huawei.com>

During code review, I found that 'restarts' is not useful anymore after
the following commits:

1) commit ab3cee3762e5 ("blk-mq: In blk_mq_dispatch_rq_list() "no budget"
is a reason to kick")
2) commit d3b38596875d ("blk-mq: run queue no matter whether the request
is the last request")
3) commit 673235f91531 ("scsi: core: Fix race between handling STS_RESOURCE
and completion")

Now that if get budget ever failed, block layer will make sure to
trigger new run queue for the hctx. Hence there is no need to run queue
from scsi layer in this case.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 drivers/scsi/scsi_lib.c    | 35 -----------------------------------
 include/scsi/scsi_device.h |  1 -
 2 files changed, 36 deletions(-)

Bart Van Assche Nov. 19, 2022, 1:11 a.m. UTC | #1

On 11/18/22 03:30, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>

Regarding the subject: unsed -> unused?

Yu Kuai Nov. 19, 2022, 5:57 a.m. UTC | #2

Hi,

在 2022/11/19 9:11, Bart Van Assche 写道:
> On 11/18/22 03:30, Yu Kuai wrote:
>> From: Yu Kuai <yukuai3@huawei.com>
> 
> Regarding the subject: unsed -> unused?

Yes, sorry for the spelling mistake.
> 
> .
>

Yu Kuai Nov. 26, 2022, 8:54 a.m. UTC | #3

Hi,

在 2022/11/18 19:30, Yu Kuai 写道:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> During code review, I found that 'restarts' is not useful anymore after
> the following commits:
> 
> 1) commit ab3cee3762e5 ("blk-mq: In blk_mq_dispatch_rq_list() "no budget"
> is a reason to kick")
> 2) commit d3b38596875d ("blk-mq: run queue no matter whether the request
> is the last request")
> 3) commit 673235f91531 ("scsi: core: Fix race between handling STS_RESOURCE
> and completion")
> 
> Now that if get budget ever failed, block layer will make sure to
> trigger new run queue for the hctx. Hence there is no need to run queue
> from scsi layer in this case.
> 

Does anyone has suggestions about this patch?

More info why I tried to remove this:

while testing megaraid with 4 nvme with none elevator, the default
queue_depth is 128, while I test it with fio 128 jobs and 1 iodepth,
bw is about 4Gib/s, however, if I test with 128 jobs and 2 iodepth,
bw is decreased to about 0.8Gib/s, and with this patch applied,
bw can stay 4Gib/s in the later case.

Thanks,
Kuai

Ming Lei Nov. 27, 2022, 9:45 a.m. UTC | #4

On Sat, Nov 26, 2022 at 04:54:46PM +0800, Yu Kuai wrote:
> Hi,
> 
> 在 2022/11/18 19:30, Yu Kuai 写道:
> > From: Yu Kuai <yukuai3@huawei.com>
> > 
> > During code review, I found that 'restarts' is not useful anymore after
> > the following commits:
> > 
> > 1) commit ab3cee3762e5 ("blk-mq: In blk_mq_dispatch_rq_list() "no budget"
> > is a reason to kick")
> > 2) commit d3b38596875d ("blk-mq: run queue no matter whether the request
> > is the last request")
> > 3) commit 673235f91531 ("scsi: core: Fix race between handling STS_RESOURCE
> > and completion")
> > 
> > Now that if get budget ever failed, block layer will make sure to
> > trigger new run queue for the hctx. Hence there is no need to run queue
> > from scsi layer in this case.
> > 
> 
> Does anyone has suggestions about this patch?
> 
> More info why I tried to remove this:
> 
> while testing megaraid with 4 nvme with none elevator, the default
> queue_depth is 128, while I test it with fio 128 jobs and 1 iodepth,
> bw is about 4Gib/s, however, if I test with 128 jobs and 2 iodepth,
> bw is decreased to about 0.8Gib/s, and with this patch applied,
> bw can stay 4Gib/s in the later case.

I will look at this patch next week.

Can you investigate a bit the reason why perf boost is from this patch?

Thanks,
Ming

Jason Yan Nov. 28, 2022, 2:26 a.m. UTC | #5

On 2022/11/26 16:54, Yu Kuai wrote:
> Hi,
> 
> 在 2022/11/18 19:30, Yu Kuai 写道:
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> During code review, I found that 'restarts' is not useful anymore after
>> the following commits:
>>
>> 1) commit ab3cee3762e5 ("blk-mq: In blk_mq_dispatch_rq_list() "no budget"
>> is a reason to kick")
>> 2) commit d3b38596875d ("blk-mq: run queue no matter whether the request
>> is the last request")
>> 3) commit 673235f91531 ("scsi: core: Fix race between handling 
>> STS_RESOURCE
>> and completion")
>>
>> Now that if get budget ever failed, block layer will make sure to
>> trigger new run queue for the hctx. Hence there is no need to run queue
>> from scsi layer in this case.
>>
> 
> Does anyone has suggestions about this patch?
> 
> More info why I tried to remove this:
> 
> while testing megaraid with 4 nvme with none elevator, the default
> queue_depth is 128, while I test it with fio 128 jobs and 1 iodepth,
> bw is about 4Gib/s, however, if I test with 128 jobs and 2 iodepth,
> bw is decreased to about 0.8Gib/s, and with this patch applied,
> bw can stay 4Gib/s in the later case.
> 

Hi Yu Kuai,

This information should be included in the commit message.

Thanks,
Jason

Ming Lei Nov. 28, 2022, 3:27 a.m. UTC | #6

On Sat, Nov 26, 2022 at 04:54:46PM +0800, Yu Kuai wrote:
> Hi,
> 
> 在 2022/11/18 19:30, Yu Kuai 写道:
> > From: Yu Kuai <yukuai3@huawei.com>
> > 
> > During code review, I found that 'restarts' is not useful anymore after
> > the following commits:
> > 
> > 1) commit ab3cee3762e5 ("blk-mq: In blk_mq_dispatch_rq_list() "no budget"
> > is a reason to kick")
> > 2) commit d3b38596875d ("blk-mq: run queue no matter whether the request
> > is the last request")
> > 3) commit 673235f91531 ("scsi: core: Fix race between handling STS_RESOURCE
> > and completion")
> > 
> > Now that if get budget ever failed, block layer will make sure to
> > trigger new run queue for the hctx. Hence there is no need to run queue
> > from scsi layer in this case.
> > 

But scsi_run_queue_async() needs to run all hw queue because budget is
actually LUN/request queue wide.

> 
> Does anyone has suggestions about this patch?
> 
> More info why I tried to remove this:
> 
> while testing megaraid with 4 nvme with none elevator, the default
> queue_depth is 128, while I test it with fio 128 jobs and 1 iodepth,
> bw is about 4Gib/s, however, if I test with 128 jobs and 2 iodepth,
> bw is decreased to about 0.8Gib/s, and with this patch applied,
> bw can stay 4Gib/s in the later case.

What is .can_queue and nr_hw_queues in your setting?



thanks,
Ming

Yu Kuai Nov. 28, 2022, 3:35 a.m. UTC | #7

在 2022/11/28 11:27, Ming Lei 写道:
> On Sat, Nov 26, 2022 at 04:54:46PM +0800, Yu Kuai wrote:
>> Hi,
>>
>> 在 2022/11/18 19:30, Yu Kuai 写道:
>>> From: Yu Kuai <yukuai3@huawei.com>
>>>
>>> During code review, I found that 'restarts' is not useful anymore after
>>> the following commits:
>>>
>>> 1) commit ab3cee3762e5 ("blk-mq: In blk_mq_dispatch_rq_list() "no budget"
>>> is a reason to kick")
>>> 2) commit d3b38596875d ("blk-mq: run queue no matter whether the request
>>> is the last request")
>>> 3) commit 673235f91531 ("scsi: core: Fix race between handling STS_RESOURCE
>>> and completion")
>>>
>>> Now that if get budget ever failed, block layer will make sure to
>>> trigger new run queue for the hctx. Hence there is no need to run queue
>>> from scsi layer in this case.
>>>
> 
> But scsi_run_queue_async() needs to run all hw queue because budget is
> actually LUN/request queue wide.

Why the hw queue need to run if get budget never failed in this hw
queue?

> 
>>
>> Does anyone has suggestions about this patch?
>>
>> More info why I tried to remove this:
>>
>> while testing megaraid with 4 nvme with none elevator, the default
>> queue_depth is 128, while I test it with fio 128 jobs and 1 iodepth,
>> bw is about 4Gib/s, however, if I test with 128 jobs and 2 iodepth,
>> bw is decreased to about 0.8Gib/s, and with this patch applied,
>> bw can stay 4Gib/s in the later case.
> 
> What is .can_queue and nr_hw_queues in your setting?
test cmd:
fio -name=0 -ioengine=libaio -direct=1 -group_reporting=1 -randseed=2022 
-rwmixread=70 -refill_buffers -filename=/dev/sdg -numjobs=128 -size=1TB 
-runtime=60s -bs=4k -iodepth=2 -rw=randwrite

test environment:
arm64 Kunpeng-920, 128 cpu
megaraid with 4 NVMEs, 128 hctx and queue_depth is 128

> 
> 
> 
> thanks,
> Ming
> 
> .
>

Yu Kuai Nov. 28, 2022, 3:38 a.m. UTC | #8

在 2022/11/28 11:35, Yu Kuai 写道:

> 
> Why the hw queue need to run if get budget never failed in this hw
> queue?

And by the way, after Jan's patch "blk-mq: Improve performance of non-mq
IO schedulers with multiple HW queues", scsi_run_queue_async() can only
garantee to run hw queue for the current cpu, not all the hw queues.

Ming Lei Nov. 28, 2022, 4:12 a.m. UTC | #9

On Mon, Nov 28, 2022 at 11:35:18AM +0800, Yu Kuai wrote:
> 
> 
> 在 2022/11/28 11:27, Ming Lei 写道:
> > On Sat, Nov 26, 2022 at 04:54:46PM +0800, Yu Kuai wrote:
> > > Hi,
> > > 
> > > 在 2022/11/18 19:30, Yu Kuai 写道:
> > > > From: Yu Kuai <yukuai3@huawei.com>
> > > > 
> > > > During code review, I found that 'restarts' is not useful anymore after
> > > > the following commits:
> > > > 
> > > > 1) commit ab3cee3762e5 ("blk-mq: In blk_mq_dispatch_rq_list() "no budget"
> > > > is a reason to kick")
> > > > 2) commit d3b38596875d ("blk-mq: run queue no matter whether the request
> > > > is the last request")
> > > > 3) commit 673235f91531 ("scsi: core: Fix race between handling STS_RESOURCE
> > > > and completion")
> > > > 
> > > > Now that if get budget ever failed, block layer will make sure to
> > > > trigger new run queue for the hctx. Hence there is no need to run queue
> > > > from scsi layer in this case.
> > > > 
> > 
> > But scsi_run_queue_async() needs to run all hw queue because budget is
> > actually LUN/request queue wide.
> 
> Why the hw queue need to run if get budget never failed in this hw
> queue?

Because all hw queues share the queue wide budget, and once budget
is available, all hw queues are re-run, and the hw queue won't be
scheduled actually if there is nothing to run, see
blk_mq_run_hw_queue().

> 
> > 
> > > 
> > > Does anyone has suggestions about this patch?
> > > 
> > > More info why I tried to remove this:
> > > 
> > > while testing megaraid with 4 nvme with none elevator, the default
> > > queue_depth is 128, while I test it with fio 128 jobs and 1 iodepth,
> > > bw is about 4Gib/s, however, if I test with 128 jobs and 2 iodepth,
> > > bw is decreased to about 0.8Gib/s, and with this patch applied,
> > > bw can stay 4Gib/s in the later case.
> > 
> > What is .can_queue and nr_hw_queues in your setting?
> test cmd:
> fio -name=0 -ioengine=libaio -direct=1 -group_reporting=1 -randseed=2022
> -rwmixread=70 -refill_buffers -filename=/dev/sdg -numjobs=128 -size=1TB
> -runtime=60s -bs=4k -iodepth=2 -rw=randwrite
> 
> test environment:
> arm64 Kunpeng-920, 128 cpu
> megaraid with 4 NVMEs, 128 hctx and queue_depth is 128

From your setting, megaraid should sets ->host_tagset, that said there
is only 128 tags for all 4 NVMEs(128 hw queue shares the all 128 tags
too).

That looks one really bad setting.

BTW, why do you drive nvme via megaraid instead nvme driver?

> And by the way, after Jan's patch "blk-mq: Improve performance of non-mq
> IO schedulers with multiple HW queues", scsi_run_queue_async() can only
> garantee to run hw queue for the current cpu, not all the hw queues.

That isn't true, each hctx is still run in case of none & kyber scheduler.

thanks,
Ming

Yu Kuai Nov. 28, 2022, 6:08 a.m. UTC | #10

Hi,

在 2022/11/28 12:12, Ming Lei 写道:
> On Mon, Nov 28, 2022 at 11:35:18AM +0800, Yu Kuai wrote:
>>
>>
>> 在 2022/11/28 11:27, Ming Lei 写道:
>>> On Sat, Nov 26, 2022 at 04:54:46PM +0800, Yu Kuai wrote:
>>>> Hi,
>>>>
>>>> 在 2022/11/18 19:30, Yu Kuai 写道:
>>>>> From: Yu Kuai <yukuai3@huawei.com>
>>>>>
>>>>> During code review, I found that 'restarts' is not useful anymore after
>>>>> the following commits:
>>>>>
>>>>> 1) commit ab3cee3762e5 ("blk-mq: In blk_mq_dispatch_rq_list() "no budget"
>>>>> is a reason to kick")
>>>>> 2) commit d3b38596875d ("blk-mq: run queue no matter whether the request
>>>>> is the last request")
>>>>> 3) commit 673235f91531 ("scsi: core: Fix race between handling STS_RESOURCE
>>>>> and completion")
>>>>>
>>>>> Now that if get budget ever failed, block layer will make sure to
>>>>> trigger new run queue for the hctx. Hence there is no need to run queue
>>>>> from scsi layer in this case.
>>>>>
>>>
>>> But scsi_run_queue_async() needs to run all hw queue because budget is
>>> actually LUN/request queue wide.
>>
>> Why the hw queue need to run if get budget never failed in this hw
>> queue?
> 
> Because all hw queues share the queue wide budget, and once budget
> is available, all hw queues are re-run, and the hw queue won't be
> scheduled actually if there is nothing to run, see
> blk_mq_run_hw_queue().

I still don't get it why all hw queues should be re-run, is this just
for performance or fixing a bug? And I'm not sure this behavior is good
for performance.

> 
>>
>>>
>>>>
>>>> Does anyone has suggestions about this patch?
>>>>
>>>> More info why I tried to remove this:
>>>>
>>>> while testing megaraid with 4 nvme with none elevator, the default
>>>> queue_depth is 128, while I test it with fio 128 jobs and 1 iodepth,
>>>> bw is about 4Gib/s, however, if I test with 128 jobs and 2 iodepth,
>>>> bw is decreased to about 0.8Gib/s, and with this patch applied,
>>>> bw can stay 4Gib/s in the later case.
>>>
>>> What is .can_queue and nr_hw_queues in your setting?
>> test cmd:
>> fio -name=0 -ioengine=libaio -direct=1 -group_reporting=1 -randseed=2022
>> -rwmixread=70 -refill_buffers -filename=/dev/sdg -numjobs=128 -size=1TB
>> -runtime=60s -bs=4k -iodepth=2 -rw=randwrite
>>
>> test environment:
>> arm64 Kunpeng-920, 128 cpu
>> megaraid with 4 NVMEs, 128 hctx and queue_depth is 128
> 
>>From your setting, megaraid should sets ->host_tagset, that said there
> is only 128 tags for all 4 NVMEs(128 hw queue shares the all 128 tags
> too).
> 
> That looks one really bad setting.

It's right that is bad, but the point here is to triggered get budget
failed frequently. If I set queue_depth to 2048, and I use 128 numjobs
with 32 iodpeth, performance is still bad.
> 
> BTW, why do you drive nvme via megaraid instead nvme driver?
> 
>> And by the way, after Jan's patch "blk-mq: Improve performance of non-mq
>> IO schedulers with multiple HW queues", scsi_run_queue_async() can only
>> garantee to run hw queue for the current cpu, not all the hw queues.
> 
> That isn't true, each hctx is still run in case of none & kyber scheduler.

Yes, but current default hctx shared elevator is deadline.
> 
> thanks,
> Ming
> 
> .
>

Yu Kuai Nov. 29, 2022, 2:15 a.m. UTC | #11

Hi,

> Hi,
> 
> 在 2022/11/28 12:12, Ming Lei 写道:

>>
>> BTW, why do you drive nvme via megaraid instead nvme driver?
>>
>>> And by the way, after Jan's patch "blk-mq: Improve performance of non-mq
>>> IO schedulers with multiple HW queues", scsi_run_queue_async() can only
>>> garantee to run hw queue for the current cpu, not all the hw queues.
>>
>> That isn't true, each hctx is still run in case of none & kyber 
>> scheduler.

And I really suspect that why Jan's patch can improve performance is
because it avoid many run queues from scsi_run_queue_async().

> 
> Yes, but current default hctx shared elevator is deadline.
>>
>> thanks,
>> Ming
>>
>> .
>>
> 
> .
>

[RFC] scsi: core: remove unsed 'restarts' from scsi_device

Commit Message

Comments

Patch