diff mbox series

[v2] block: reduce kblockd_mod_delayed_work_on() CPU consumption

Message ID 0eb94fa3-a1d0-f9b3-fb51-c22eaad225a7@kernel.dk (mailing list archive)
State New, archived
Headers show
Series [v2] block: reduce kblockd_mod_delayed_work_on() CPU consumption | expand

Commit Message

Jens Axboe Dec. 14, 2021, 8:49 p.m. UTC
Dexuan reports that he's seeing spikes of very heavy CPU utilization when
running 24 disks and using the 'none' scheduler. This happens off the
sched restart path, because SCSI requires the queue to be restarted async,
and hence we're hammering on mod_delayed_work_on() to ensure that the work
item gets run appropriately.

Avoid hammering on the timer and just use queue_work_on() if no delay
has been specified.

Reported-and-tested-by: Dexuan Cui <decui@microsoft.com>
Link: https://lore.kernel.org/linux-block/BYAPR21MB1270C598ED214C0490F47400BF719@BYAPR21MB1270.namprd21.prod.outlook.com/
Signed-off-by: Jens Axboe <axboe@kernel.dk>

---

Comments

Ming Lei Dec. 15, 2021, 2:51 a.m. UTC | #1
On Tue, Dec 14, 2021 at 01:49:34PM -0700, Jens Axboe wrote:
> Dexuan reports that he's seeing spikes of very heavy CPU utilization when
> running 24 disks and using the 'none' scheduler. This happens off the
> sched restart path, because SCSI requires the queue to be restarted async,
> and hence we're hammering on mod_delayed_work_on() to ensure that the work
> item gets run appropriately.
> 
> Avoid hammering on the timer and just use queue_work_on() if no delay
> has been specified.
> 
> Reported-and-tested-by: Dexuan Cui <decui@microsoft.com>
> Link: https://lore.kernel.org/linux-block/BYAPR21MB1270C598ED214C0490F47400BF719@BYAPR21MB1270.namprd21.prod.outlook.com/
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> 
> ---
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 1378d084c770..c1833f95cb97 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1484,6 +1484,8 @@ EXPORT_SYMBOL(kblockd_schedule_work);
>  int kblockd_mod_delayed_work_on(int cpu, struct delayed_work *dwork,
>  				unsigned long delay)
>  {
> +	if (!delay)
> +		return queue_work_on(cpu, kblockd_workqueue, &dwork->work);
>  	return mod_delayed_work_on(cpu, kblockd_workqueue, dwork, delay);

Reviewed-by: Ming Lei <ming.lei@redhat.com>


Thanks,
Ming
John Garry Dec. 15, 2021, 10:25 a.m. UTC | #2
On 14/12/2021 20:49, Jens Axboe wrote:
> Dexuan reports that he's seeing spikes of very heavy CPU utilization when
> running 24 disks and using the 'none' scheduler. This happens off the
> sched restart path, because SCSI requires the queue to be restarted async,
> and hence we're hammering on mod_delayed_work_on() to ensure that the work
> item gets run appropriately.
> 
> Avoid hammering on the timer and just use queue_work_on() if no delay
> has been specified.
> 
> Reported-and-tested-by: Dexuan Cui <decui@microsoft.com>
> Link: https://lore.kernel.org/linux-block/BYAPR21MB1270C598ED214C0490F47400BF719@BYAPR21MB1270.namprd21.prod.outlook.com/
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> 
> ---
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 1378d084c770..c1833f95cb97 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1484,6 +1484,8 @@ EXPORT_SYMBOL(kblockd_schedule_work);
>   int kblockd_mod_delayed_work_on(int cpu, struct delayed_work *dwork,
>   				unsigned long delay)
>   {
> +	if (!delay)
> +		return queue_work_on(cpu, kblockd_workqueue, &dwork->work);
>   	return mod_delayed_work_on(cpu, kblockd_workqueue, dwork, delay);
>   }
>   EXPORT_SYMBOL(kblockd_mod_delayed_work_on);
> 

Hi Jens,

I have a related comment on the current code and interface it uses, if 
you don't mind, as I did wonder if we are doing a msec_to_jiffies(0 [not 
built-in const]) call somewhere.

So we pass msecs to blk-mq.c, and we do a msec_to_jiffies() call on it 
before calling kblockd_mod_delayed_work_on(). Now most/all callsites 
uses const value for the msec value, so if we did the msec_to_jiffies() 
conversion at the callsites and passed a jiffies value, it should be 
compiled out by gcc. This is my current __blk_mq_delay_run_hw_queue 
assembler:

0000000000001ef0 <__blk_mq_delay_run_hw_queue>:
     [snip]
     2024: a942dfb6 ldp x22, x23, [x29, #40]
     2028: 2a1503e0 mov w0, w21
     202c: 94000000 bl 0 <__msecs_to_jiffies>
kblockd_mod_delayed_work_on(blk_mq_hctx_next_cpu(hctx), &hctx->run_work,
     2030: aa0003e2 mov x2, x0
     2034: 91010261 add x1, x19, #0x40
     2038: 2a1403e0 mov w0, w20
     203c: 94000000 bl 0 <kblockd_mod_delayed_work_on>

I'm not sure if you would want to change so many APIs or if jiffies is 
sensible to pass or even any performance gain. Additionally Function 
blk_mq_delay_kick_requeue_list() would not see so much gain in such a 
change as msec value is not const. Any thoughts? Maybe testing 
performance would not do much harm.

Thanks,
John
Jens Axboe Dec. 15, 2021, 3:47 p.m. UTC | #3
On 12/15/21 3:25 AM, John Garry wrote:
> On 14/12/2021 20:49, Jens Axboe wrote:
>> Dexuan reports that he's seeing spikes of very heavy CPU utilization when
>> running 24 disks and using the 'none' scheduler. This happens off the
>> sched restart path, because SCSI requires the queue to be restarted async,
>> and hence we're hammering on mod_delayed_work_on() to ensure that the work
>> item gets run appropriately.
>>
>> Avoid hammering on the timer and just use queue_work_on() if no delay
>> has been specified.
>>
>> Reported-and-tested-by: Dexuan Cui <decui@microsoft.com>
>> Link: https://lore.kernel.org/linux-block/BYAPR21MB1270C598ED214C0490F47400BF719@BYAPR21MB1270.namprd21.prod.outlook.com/
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>
>> ---
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 1378d084c770..c1833f95cb97 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -1484,6 +1484,8 @@ EXPORT_SYMBOL(kblockd_schedule_work);
>>   int kblockd_mod_delayed_work_on(int cpu, struct delayed_work *dwork,
>>   				unsigned long delay)
>>   {
>> +	if (!delay)
>> +		return queue_work_on(cpu, kblockd_workqueue, &dwork->work);
>>   	return mod_delayed_work_on(cpu, kblockd_workqueue, dwork, delay);
>>   }
>>   EXPORT_SYMBOL(kblockd_mod_delayed_work_on);
>>
> 
> Hi Jens,
> 
> I have a related comment on the current code and interface it uses, if 
> you don't mind, as I did wonder if we are doing a msec_to_jiffies(0 [not 
> built-in const]) call somewhere.
> 
> So we pass msecs to blk-mq.c, and we do a msec_to_jiffies() call on it 
> before calling kblockd_mod_delayed_work_on(). Now most/all callsites 
> uses const value for the msec value, so if we did the msec_to_jiffies() 
> conversion at the callsites and passed a jiffies value, it should be 
> compiled out by gcc. This is my current __blk_mq_delay_run_hw_queue 
> assembler:
> 
> 0000000000001ef0 <__blk_mq_delay_run_hw_queue>:
>      [snip]
>      2024: a942dfb6 ldp x22, x23, [x29, #40]
>      2028: 2a1503e0 mov w0, w21
>      202c: 94000000 bl 0 <__msecs_to_jiffies>
> kblockd_mod_delayed_work_on(blk_mq_hctx_next_cpu(hctx), &hctx->run_work,
>      2030: aa0003e2 mov x2, x0
>      2034: 91010261 add x1, x19, #0x40
>      2038: 2a1403e0 mov w0, w20
>      203c: 94000000 bl 0 <kblockd_mod_delayed_work_on>
> 
> I'm not sure if you would want to change so many APIs or if jiffies is 
> sensible to pass or even any performance gain. Additionally Function 
> blk_mq_delay_kick_requeue_list() would not see so much gain in such a 
> change as msec value is not const. Any thoughts? Maybe testing 
> performance would not do much harm.

In general I totally agree with you, it'd be smarter to flip the
conversion so it can be done in a more efficient manner. At the same
time, the queue delay running is not at all a fast path, so shouldn't
really matter in practice.
John Garry Dec. 16, 2021, 12:43 p.m. UTC | #4
On 15/12/2021 15:47, Jens Axboe wrote:
>> 0000000000001ef0 <__blk_mq_delay_run_hw_queue>:
>>       [snip]
>>       2024: a942dfb6 ldp x22, x23, [x29, #40]
>>       2028: 2a1503e0 mov w0, w21
>>       202c: 94000000 bl 0 <__msecs_to_jiffies>
>> kblockd_mod_delayed_work_on(blk_mq_hctx_next_cpu(hctx), &hctx->run_work,
>>       2030: aa0003e2 mov x2, x0
>>       2034: 91010261 add x1, x19, #0x40
>>       2038: 2a1403e0 mov w0, w20
>>       203c: 94000000 bl 0 <kblockd_mod_delayed_work_on>
>>
>> I'm not sure if you would want to change so many APIs or if jiffies is
>> sensible to pass or even any performance gain. Additionally Function
>> blk_mq_delay_kick_requeue_list() would not see so much gain in such a
>> change as msec value is not const. Any thoughts? Maybe testing
>> performance would not do much harm.
> In general I totally agree with you, it'd be smarter to flip the
> conversion so it can be done in a more efficient manner.


> At the same
> time, the queue delay running is not at all a fast path, so shouldn't
> really matter in practice.

ok, I just thought that from checking your change that we have a 
frequent msec_to_jiffies(0 [non const]) call in 
__blk_mq_delay_run_hw_queue() -> kblockd_mod_delayed_work_on().

Thanks,
John
diff mbox series

Patch

diff --git a/block/blk-core.c b/block/blk-core.c
index 1378d084c770..c1833f95cb97 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1484,6 +1484,8 @@  EXPORT_SYMBOL(kblockd_schedule_work);
 int kblockd_mod_delayed_work_on(int cpu, struct delayed_work *dwork,
 				unsigned long delay)
 {
+	if (!delay)
+		return queue_work_on(cpu, kblockd_workqueue, &dwork->work);
 	return mod_delayed_work_on(cpu, kblockd_workqueue, dwork, delay);
 }
 EXPORT_SYMBOL(kblockd_mod_delayed_work_on);