[RFC,0/2] block: CPU latency PM QoS tuning

Message ID	20240829075423.1345042-1-tero.kristo@linux.intel.com (mailing list archive)
Headers	show Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.16]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3D13A219E0; Thu, 29 Aug 2024 07:54:35 +0000 (UTC) From: Tero Kristo <tero.kristo@linux.intel.com> To: axboe@kernel.dk Cc: linux-block@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH 0/2] block: CPU latency PM QoS tuning Date: Thu, 29 Aug 2024 10:18:18 +0300 Message-ID: <20240829075423.1345042-1-tero.kristo@linux.intel.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	block: CPU latency PM QoS tuning \| expand [RFC,0/2] block: CPU latency PM QoS tuning [RFC,1/2] bdev: add support for CPU latency PM QoS tuning [RFC,2/2] block/genhd: add sysfs knobs for the CPU latency PM QoS settings

Message ID

20240829075423.1345042-1-tero.kristo@linux.intel.com (mailing list archive)

Headers

From: Tero Kristo <tero.kristo@linux.intel.com>
To: axboe@kernel.dk
Cc: linux-block@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: [RFC PATCH 0/2] block: CPU latency PM QoS tuning
Date: Thu, 29 Aug 2024 10:18:18 +0300
Message-ID: <20240829075423.1345042-1-tero.kristo@linux.intel.com>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit

Series

block: CPU latency PM QoS tuning | expand

Message

Tero Kristo Aug. 29, 2024, 7:18 a.m. UTC

Hello,

These patches introduce a mechanism for limiting deep CPU idle states
during block IO. With certain workloads, it is possible for CPU to
enter deep idle while waiting for the IO completion, causing a large
latency to the completion interrupt. See example below, where I used
an Intel Icelake Xeon system to run a simple 'fio' test with random
reads, and with CPU C6 state disabled / enabled (results from 2 * 2min
runs):

C6 enabled:
    slat (nsec): min=1769, max=73247, avg=6960.96, stdev=2115.90
    clat (nsec): min=442, max=242706, avg=23767.06, stdev=13348.74
     lat (usec): min=12, max=250, avg=30.73, stdev=13.96

    slat (nsec): min=1849, max=58824, avg=6970.61, stdev=2134.38
    clat (nsec): min=1684, max=241880, avg=23545.68, stdev=13448.87
     lat (usec): min=12, max=249, avg=30.52, stdev=14.03

C6 disabled:
    slat (nsec): min=2110, max=57871, avg=6867.86, stdev=1711.55
    clat (nsec): min=486, max=98292, avg=22185.50, stdev=10473.34
     lat (usec): min=13, max=105, avg=29.05, stdev=10.99

    slat (nsec): min=2128, max=67730, avg=6913.52, stdev=1714.89
    clat (nsec): min=552, max=93409, avg=22582.50, stdev=10407.53
     lat (usec): min=13, max=108, avg=29.50, stdev=10.93

The maximum latency with C6 enabled is about 2.5x seen with C6
disabled.

Now, the patches provided here introduce a mechanism for the block
layer to limit the maximum CPU latencies, with user configurable
sysfs knobs per block device. Doing following config in my test
system:

  /sys/block/nvme0n1/cpu_lat_limit_us = 10
  /sys/block/nvme0n1/cpu_lat_timeout_ms = 3

This limits the maximum CPU latency for the active CPUs doing block IO
to 10us, and the limit is removed if there is no block IO for 3ms.

Running the same fio test used above with C6 enabled, I get:

    slat (nsec): min=1887, max=71037, avg=7239.68, stdev=1850.67
    clat (nsec): min=438, max=103628, avg=22488.75, stdev=10457.86
     lat (usec): min=12, max=133, avg=29.73, stdev=11.04

    slat (nsec): min=1942, max=69159, avg=7194.01, stdev=1788.63
    clat (nsec): min=418, max=115739, avg=22239.51, stdev=10448.37
     lat (usec): min=12, max=123, avg=29.43, stdev=10.96

... so the maximum latencies are cut by approx 100us and are quite close
to the levels seen with C6 disabled completely system wide.

Any thoughts about the patches and the approach taken?

-Tero

Comments

Bart Van Assche Aug. 29, 2024, 11:04 a.m. UTC | #1

On 8/29/24 3:18 AM, Tero Kristo wrote:
> Any thoughts about the patches and the approach taken?

The optimal value for the PM QoS latency depends on the request size
and on the storage device characteristics. I think it would be better
if the latency value would be chosen automatically rather than
introducing yet another set of tunable sysfs parameters.

Thanks,

Bart.

Tero Kristo Aug. 30, 2024, 12:01 p.m. UTC | #2

On Thu, 2024-08-29 at 07:04 -0400, Bart Van Assche wrote:
> On 8/29/24 3:18 AM, Tero Kristo wrote:
> > Any thoughts about the patches and the approach taken?
> 
> The optimal value for the PM QoS latency depends on the request size
> and on the storage device characteristics. I think it would be better
> if the latency value would be chosen automatically rather than
> introducing yet another set of tunable sysfs parameters.

Are these device parameters stored somewhere in the kernel? I did try
looking for this kind of data but could not find anything useful; thats
the main reason I implemented the sysfs tunables.

-Tero

> 
> Thanks,
> 
> Bart.
>

Tero Kristo Sept. 4, 2024, 11:35 a.m. UTC | #3

On Thu, 2024-08-29 at 07:04 -0400, Bart Van Assche wrote:
> On 8/29/24 3:18 AM, Tero Kristo wrote:
> > Any thoughts about the patches and the approach taken?
> 
> The optimal value for the PM QoS latency depends on the request size
> and on the storage device characteristics. I think it would be better
> if the latency value would be chosen automatically rather than
> introducing yet another set of tunable sysfs parameters.
> 
> Thanks,
> 
> Bart.
> 

Hi all,

Based on the feedback received, I've updated my patch to work on the
NVMe driver level instead of block layer. I'll send that to the
corresponding list as a separate RFC, but for now these two patches can
be ignored.

-Tero