[RFC] blk-mq: provide more tags for woken-up process when tag allocation is busy

When there are many free-bit waiters, current batch wakeup method will
wake up at most wake_batch processes when wake_batch bits are freed.
The perfect result is each process will get a free bit, however the
real result is that a waken-up process may being unable to get
a free bit and will call io_schedule() multiple times. That's because
other processes (e.g. wake-up before) in the same wake-up batch
may have already allocated multiple free bits.

And the race leads to two problems. The first one is the unnecessary
context switch, because multiple processes are waken up and then
go to sleep afterwards. And the second one is the performance
degradation when there is spatial locality between requests from
one process (e.g. split IO for HDD), because one process can not
allocated requests continuously for the split IOs, and
the sequential IOs will be dispatched separatedly.

To fix the problem, we mimic the way how SQ handles this situation:
1) stash a bulk of free bits
2) wake up a process when a new bit is freed
3) woken-up process consumes the stashed free bits
4) when stashed free bits are exhausted, goto step 1)

Because the tag allocation path or io submit path is much faster than
the tag free path, so when the race for free tags is intensive,
we can ensure:
1) only few processes will be waken up and will exhaust the stashed
   free bits quickly.
2) these processes will be able to allocate multiple requests
   continuously.

An alternative fix is to dynamically adjust the number of woken-up
process according to the number of waiters and busy bits, instead of
using wake_batch each time in __sbq_wake_up(). However it will need
to record the number of busy bits all the time, so use the
stash-wake-use method instead.

The following is the result of a simple fio test:

1. fio (random read, 1MB, libaio, iodepth=1024)

(1) 4TB HDD (max_sectors_kb=256)

IOPS (bs=1MB)
jobs | 4.18-sq  | 5.6.15 | 5.6.15-patched |
1    | 120      | 120    | 119
24   | 120      | 105    | 121
48   | 122      | 102    | 121
72   | 120      | 100    | 119

context switch per second
jobs | 4.18-sq  | 5.6.15 | 5.6.15-patched |
1    | 1058     | 1162   | 1188
24   | 1047     | 1715   | 1105
48   | 1109     | 1967   | 1105
72   | 1084     | 1908   | 1106

(2) 1.8TB SSD (set max_sectors_kb=256)

IOPS (bs=1MB)
jobs | 4.18-sq  | 5.6.15 | 5.6.15-patched |
1    | 1077     | 1075   | 1076
24   | 1079     | 1075   | 1076
48   | 1077     | 1076   | 1076
72   | 1077     | 1076   | 1077

context switch per second
jobs | 4.18-sq  | 5.6.15 | 5.6.15-patched |
1    | 1833     | 5123   | 5264
24   | 2143     | 15238  | 3859
48   | 2182     | 19015  | 3617
72   | 2268     | 19050  | 3662

(3) 1.5TB nvme (set max_sectors_kb=256)

4 read queue, 72 CPU

IOPS (bs=1MB)
jobs | 5.6.15 | 5.6.15-patched |
1    | 3018   | 3018
18   | 3015   | 3016
36   | 3001   | 3005
54   | 2993   | 2997
72   | 2984   | 2990

context switch per second
jobs | 5.6.15 | 5.6.15-patched |
1    | 6292   | 6469
18   | 19428  | 4253
36   | 21290  | 3928
54   | 23060  | 3957
72   | 24221  | 4054

Signed-off-by: Hou Tao <houtao1@huawei.com>
---
Hi,

We found the problems (excessive context switch and few performance
degradation) during the performance comparison between blk-sq (4.18)
and blk-mq (5.16) on HDD, but we can not find a better way to fix it.

It seems that in order to implement batched request allocation for
single process, we need to use an atomic variable to track
the number of busy bits. It's suitable for HDD or SDD, because the
IO latency is greater than 1ms, but no sure whether or not it's OK
for NVMe device.

Suggestions and comments are welcome.

Regards,
Tao
---
 block/blk-mq-tag.c      |  4 ++++
 include/linux/sbitmap.h |  7 ++++++
 lib/sbitmap.c           | 49 +++++++++++++++++++++++++++++++++++++++++
 3 files changed, 60 insertions(+)

Message ID	20200603073931.94435-1-houtao1@huawei.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=EWxa=7Q=vger.kernel.org=linux-block-owner@kernel.org> Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id BEFA51391 for <patchwork-linux-block@patchwork.kernel.org>; Wed, 3 Jun 2020 07:32:39 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id AB7422068D for <patchwork-linux-block@patchwork.kernel.org>; Wed, 3 Jun 2020 07:32:39 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1725810AbgFCHcj (ORCPT <rfc822;patchwork-linux-block@patchwork.kernel.org>); Wed, 3 Jun 2020 03:32:39 -0400 Received: from szxga07-in.huawei.com ([45.249.212.35]:40448 "EHLO huawei.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1725275AbgFCHci (ORCPT <rfc822;linux-block@vger.kernel.org>); Wed, 3 Jun 2020 03:32:38 -0400 Received: from DGGEMS408-HUB.china.huawei.com (unknown [172.30.72.58]) by Forcepoint Email with ESMTP id 4089BA4DA8D343110F2D; Wed, 3 Jun 2020 15:32:35 +0800 (CST) Received: from huawei.com (10.90.53.225) by DGGEMS408-HUB.china.huawei.com (10.3.19.208) with Microsoft SMTP Server id 14.3.487.0; Wed, 3 Jun 2020 15:32:25 +0800 From: Hou Tao <houtao1@huawei.com> To: Jens Axboe <axboe@kernel.dk>, Omar Sandoval <osandov@fb.com>, Ming Lei <ming.lei@redhat.com>, <linux-block@vger.kernel.org> CC: Kate Stewart <kstewart@linuxfoundation.org>, John Garry <john.garry@huawei.com>, Thomas Gleixner <tglx@linutronix.de>, <houtao1@huawei.com> Subject: [RFC PATCH] blk-mq: provide more tags for woken-up process when tag allocation is busy Date: Wed, 3 Jun 2020 15:39:31 +0800 Message-ID: <20200603073931.94435-1-houtao1@huawei.com> X-Mailer: git-send-email 2.25.0.4.g0ad7144999 MIME-Version: 1.0 Content-Transfer-Encoding: 7BIT Content-Type: text/plain; charset=US-ASCII X-Originating-IP: [10.90.53.225] X-CFilter-Loop: Reflected Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: <linux-block.vger.kernel.org> X-Mailing-List: linux-block@vger.kernel.org
Series	[RFC] blk-mq: provide more tags for woken-up process when tag allocation is busy \| expand [RFC] blk-mq: provide more tags for woken-up process when tag allocation is busy

[RFC] blk-mq: provide more tags for woken-up process when tag allocation is busy

Commit Message

Comments

Patch