mbox series

[PATCHSET,RFC,0/2] mq-deadline scalability improvements

Message ID 20240118180541.930783-1-axboe@kernel.dk (mailing list archive)
Headers show
Series mq-deadline scalability improvements | expand

Message

Jens Axboe Jan. 18, 2024, 6:04 p.m. UTC
Hi,

It's no secret that mq-deadline doesn't scale very well - it was
originally done as a proof-of-concept conversion from deadline, when the
blk-mq multiqueue layer was written. In the single queue world, the
queue lock protected the IO scheduler as well, and mq-deadline simply
adopted an internal dd->lock to fill the place of that.

While mq-deadline works under blk-mq and doesn't suffer any scaling on
that side, as soon as request insertion or dispatch is done, we're
hitting the per-queue dd->lock quite intensely. On a basic test box
with 16 cores / 32 threads, running a number of IO intensive threads
on either null_blk (single hw queue) or nvme0n1 (many hw queues) shows
this quite easily:

Device		QD	Jobs	IOPS	Lock contention
=======================================================
null_blk	4	32	1090K	92%
nvme0n1		4	32	1070K	94%

which looks pretty miserable, most of the time is spent contending on
the queue lock.

This RFC patchset attempts to address that by:

1) Serializing dispatch of requests. If we fail dispatching, rely on
   the next completion to dispatch the next one. This could potentially
   reduce the overall depth achieved on the device side, however even
   for the heavily contended test I'm running here, no observable
   change is seen. This is patch 1.

2) Serialize request insertion, using internal per-cpu lists to
   temporarily store requests until insertion can proceed. This is
   patch 2.

With that in place, the same test case now does:

Device		QD	Jobs	IOPS	Contention	Diff
=============================================================
null_blk	4	32	2250K	28%		+106%
nvme0n1		4	32	2560K	23%		+112%

and while that doesn't completely eliminate the lock contention, it's
oodles better than what it was before. The throughput increase shows
that nicely, with more than 100% improvement for both cases.

 block/mq-deadline.c | 146 ++++++++++++++++++++++++++++++++++++++++----
  1 file changed, 133 insertions(+), 13 deletions(-)

Comments

Jens Axboe Jan. 18, 2024, 7:29 p.m. UTC | #1
On 1/18/24 11:04 AM, Jens Axboe wrote:
> With that in place, the same test case now does:
> 
> Device	QD	Jobs	IOPS	Contention	Diff
> =============================================================
> null_blk	4	32	2250K	28%		+106%
> nvme0n1	4	32	2560K	23%		+112%

nvme0n1		4	32	2560K	23%		+139%

Apparently I can't math, this is a +139% improvement for the nvme
case... Just wanted to make it clear that the IOPS number was correct,
it's just the diff math that was wrong.
Jens Axboe Jan. 18, 2024, 8:22 p.m. UTC | #2
On 1/18/24 12:29 PM, Jens Axboe wrote:
> On 1/18/24 11:04 AM, Jens Axboe wrote:
>> With that in place, the same test case now does:
>>
>> Device	QD	Jobs	IOPS	Contention	Diff
>> =============================================================
>> null_blk	4	32	2250K	28%		+106%
>> nvme0n1	4	32	2560K	23%		+112%
> 
> nvme0n1		4	32	2560K	23%		+139%
> 
> Apparently I can't math, this is a +139% improvement for the nvme
> case... Just wanted to make it clear that the IOPS number was correct,
> it's just the diff math that was wrong.

And further followup, since I ran some quick testing on another box that
has a raid1 more normal drive (SATA, 32 tags). Both pre and post the
patches, the performance is roughly the same. The bigger difference is
that the pre result is using 8% systime to do ~73K, and with the patches
we're using 1% systime to do the same work.

This should help answer the question "does this matter at all?". The
answer is definitely yes. It's not just about scalability, as is usually
the case with improving things like this, it's about efficiency as well.
8x the sys time is ridiculous.