[v5,00/14] blk-mq: Reduce static requests memory footprint for shared sbitmap

Message ID	1633429419-228500-1-git-send-email-john.garry@huawei.com (mailing list archive)
Headers	show Return-Path: <linux-scsi-owner@kernel.org> From: John Garry <john.garry@huawei.com> To: <axboe@kernel.dk> CC: <linux-block@vger.kernel.org>, <linux-kernel@vger.kernel.org>, <ming.lei@redhat.com>, <hare@suse.de>, <linux-scsi@vger.kernel.org>, <kashyap.desai@broadcom.com>, John Garry <john.garry@huawei.com> Subject: [PATCH v5 00/14] blk-mq: Reduce static requests memory footprint for shared sbitmap Date: Tue, 5 Oct 2021 18:23:25 +0800 Message-ID: <1633429419-228500-1-git-send-email-john.garry@huawei.com> MIME-Version: 1.0 Content-Type: text/plain Precedence: bulk
Series	blk-mq: Reduce static requests memory footprint for shared sbitmap \| expand [v5,00/14] blk-mq: Reduce static requests memory footprint for shared sbitmap [v5,01/14] blk-mq: Change rqs check in blk_mq_free_rqs() [v5,02/14] block: Rename BLKDEV_MAX_RQ -> BLKDEV_DEFAULT_RQ [v5,03/14] blk-mq: Relocate shared sbitmap resize in blk_mq_update_nr_requests() [v5,04/14] blk-mq: Invert check in blk_mq_update_nr_requests() [v5,05/14] blk-mq-sched: Rename blk_mq_sched_alloc_{tags -> map_and_rqs}() [v5,06/14] blk-mq-sched: Rename blk_mq_sched_free_{requests -> rqs}() [v5,07/14] blk-mq: Pass driver tags to blk_mq_clear_rq_mapping() [v5,08/14] blk-mq: Don't clear driver tags own mapping [v5,09/14] blk-mq: Add blk_mq_tag_update_sched_shared_sbitmap() [v5,10/14] blk-mq: Add blk_mq_alloc_map_and_rqs() [v5,11/14] blk-mq: Refactor and rename blk_mq_free_map_and_{requests->rqs}() [v5,12/14] blk-mq: Use shared tags for shared sbitmap support [v5,13/14] blk-mq: Stop using pointers for blk_mq_tags bitmap tags [v5,14/14] blk-mq: Change shared sbitmap naming to shared tags

John Garry Oct. 5, 2021, 10:23 a.m. UTC

Currently a full set of static requests are allocated per hw queue per
tagset when shared sbitmap is used.

However, only tagset->queue_depth number of requests may be active at
any given time. As such, only tagset->queue_depth number of static
requests are required.

The same goes for using an IO scheduler, which allocates a full set of
static requests per hw queue per request queue.

This series changes shared sbitmap support by using a shared tags per
tagset and request queue. Ming suggested something along those lines in
v1 review. In using a shared tags, the static rqs also become shared,
reducing the number of sets of static rqs, reducing memory usage.

Patch "blk-mq: Use shared tags for shared sbitmap support" is a bit big,
and could potentially be broken down. But then maintaining ability to
bisect becomes harder and each sub-patch would get more convoluted.

For megaraid sas driver on my 128-CPU arm64 system with 1x SATA disk, we
save approx. 300MB(!) [370MB -> 60MB]

Baseline is 1b2d1439fc25 (block/for-next) Merge branch 'for-5.16/io_uring'
into for-next

Changes since v4:
- Add Hannes' and Ming's RB tags (thanks!) - but please check 12/14 rework
- Add patch to change "shared sbitmap" naming to "shared tags"
- Rebase "block: Rename BLKDEV_MAX_RQ -> BLKDEV_DEFAULT_RQ"

Changes since v3:
- Fix transient error handling issue in  05/13
- Add Hannes RB tags (thanks!)

Changes since v2:
- Make blk_mq_clear_rq_mapping() static again
- Various function renaming for conciseness and consistency
- Add/refactor alloc/free map and rqs function
- Drop the new blk_mq_ops init_request method in favour of passing an
  invalid HW queue index for shared sbitmap
- Add patch to not clear rq mapping for driver tags
- Remove blk_mq_init_bitmap_tags()
- Add some more RB tags (thanks!)

Changes since v1:
- Switch to use blk_mq_tags for shared sbitmap
- Add new blk_mq_ops init request callback
- Add some RB tags (thanks!)

John Garry (14):
  blk-mq: Change rqs check in blk_mq_free_rqs()
  block: Rename BLKDEV_MAX_RQ -> BLKDEV_DEFAULT_RQ
  blk-mq: Relocate shared sbitmap resize in blk_mq_update_nr_requests()
  blk-mq: Invert check in blk_mq_update_nr_requests()
  blk-mq-sched: Rename blk_mq_sched_alloc_{tags -> map_and_rqs}()
  blk-mq-sched: Rename blk_mq_sched_free_{requests -> rqs}()
  blk-mq: Pass driver tags to blk_mq_clear_rq_mapping()
  blk-mq: Don't clear driver tags own mapping
  blk-mq: Add blk_mq_tag_update_sched_shared_sbitmap()
  blk-mq: Add blk_mq_alloc_map_and_rqs()
  blk-mq: Refactor and rename blk_mq_free_map_and_{requests->rqs}()
  blk-mq: Use shared tags for shared sbitmap support
  blk-mq: Stop using pointers for blk_mq_tags bitmap tags
  blk-mq: Change shared sbitmap naming to shared tags

 block/bfq-iosched.c    |   4 +-
 block/blk-core.c       |   6 +-
 block/blk-mq-debugfs.c |   8 +-
 block/blk-mq-sched.c   | 118 +++++++++++-------------
 block/blk-mq-sched.h   |   4 +-
 block/blk-mq-tag.c     | 135 ++++++++++------------------
 block/blk-mq-tag.h     |  16 ++--
 block/blk-mq.c         | 198 +++++++++++++++++++++++------------------
 block/blk-mq.h         |  36 ++++----
 block/blk.h            |   2 +-
 block/elevator.c       |   2 +-
 block/kyber-iosched.c  |   4 +-
 block/mq-deadline.c    |   2 +-
 drivers/block/rbd.c    |   2 +-
 include/linux/blk-mq.h |  17 ++--
 include/linux/blkdev.h |   5 +-
 16 files changed, 262 insertions(+), 297 deletions(-)

Jens Axboe Oct. 5, 2021, 12:35 p.m. UTC | #1

On 10/5/21 4:23 AM, John Garry wrote:
> Currently a full set of static requests are allocated per hw queue per
> tagset when shared sbitmap is used.
> 
> However, only tagset->queue_depth number of requests may be active at
> any given time. As such, only tagset->queue_depth number of static
> requests are required.
> 
> The same goes for using an IO scheduler, which allocates a full set of
> static requests per hw queue per request queue.
> 
> This series changes shared sbitmap support by using a shared tags per
> tagset and request queue. Ming suggested something along those lines in
> v1 review. In using a shared tags, the static rqs also become shared,
> reducing the number of sets of static rqs, reducing memory usage.
> 
> Patch "blk-mq: Use shared tags for shared sbitmap support" is a bit big,
> and could potentially be broken down. But then maintaining ability to
> bisect becomes harder and each sub-patch would get more convoluted.
> 
> For megaraid sas driver on my 128-CPU arm64 system with 1x SATA disk, we
> save approx. 300MB(!) [370MB -> 60MB]
> 
> Baseline is 1b2d1439fc25 (block/for-next) Merge branch 'for-5.16/io_uring'
> into for-next

Let's get this queued up for testing, thanks John.

John Garry Oct. 5, 2021, 1:34 p.m. UTC | #2

On 05/10/2021 13:35, Jens Axboe wrote:
>> Baseline is 1b2d1439fc25 (block/for-next) Merge branch 'for-5.16/io_uring'
>> into for-next
> Let's get this queued up for testing, thanks John.

Cheers, appreciated

@Kashyap, You mentioned that when testing you saw a performance 
regression from v5.11 -> v5.12 - any idea on that yet? Can you describe 
the scenario, like IO scheduler and how many disks and the type? Does 
disabling host_tagset_enable restore performance?

 From checking differences between those kernels, I don't see anything 
directly relevant in sbitmap support or in the megaraid sas driver.

Thanks,
John

Kashyap Desai Oct. 5, 2021, 1:53 p.m. UTC | #3

> -----Original Message-----
> From: John Garry [mailto:john.garry@huawei.com]
> Sent: Tuesday, October 5, 2021 7:05 PM
> To: Jens Axboe <axboe@kernel.dk>; kashyap.desai@broadcom.com
> Cc: linux-block@vger.kernel.org; linux-kernel@vger.kernel.org;
> ming.lei@redhat.com; hare@suse.de; linux-scsi@vger.kernel.org
> Subject: Re: [PATCH v5 00/14] blk-mq: Reduce static requests memory
> footprint for shared sbitmap
>
> On 05/10/2021 13:35, Jens Axboe wrote:
> >> Baseline is 1b2d1439fc25 (block/for-next) Merge branch
> >> 'for-5.16/io_uring'
> >> into for-next
> > Let's get this queued up for testing, thanks John.
>
> Cheers, appreciated
>
> @Kashyap, You mentioned that when testing you saw a performance
> regression from v5.11 -> v5.12 - any idea on that yet? Can you describe
> the
> scenario, like IO scheduler and how many disks and the type? Does
> disabling
> host_tagset_enable restore performance?

John - I am still working on this. System was not available due to some
other debugging.

>
>  From checking differences between those kernels, I don't see anything
> directly relevant in sbitmap support or in the megaraid sas driver.
>
> Thanks,
> John

Jens Axboe Oct. 5, 2021, 4:23 p.m. UTC | #4

On 10/5/21 7:34 AM, John Garry wrote:
> On 05/10/2021 13:35, Jens Axboe wrote:
>>> Baseline is 1b2d1439fc25 (block/for-next) Merge branch 'for-5.16/io_uring'
>>> into for-next
>> Let's get this queued up for testing, thanks John.
> 
> Cheers, appreciated
> 
> @Kashyap, You mentioned that when testing you saw a performance 
> regression from v5.11 -> v5.12 - any idea on that yet? Can you describe 
> the scenario, like IO scheduler and how many disks and the type? Does 
> disabling host_tagset_enable restore performance?

FWIW, I ran my usual peak testing on this and didn't observe any regressions.
Caveat being that a) no scheduler is involved, and b) no shared tags. But
at least that fast case/path is fine.

Kashyap Desai Oct. 7, 2021, 8:31 p.m. UTC | #5

> > -----Original Message-----
> > From: John Garry [mailto:john.garry@huawei.com]
> > Sent: Tuesday, October 5, 2021 7:05 PM
> > To: Jens Axboe <axboe@kernel.dk>; kashyap.desai@broadcom.com
> > Cc: linux-block@vger.kernel.org; linux-kernel@vger.kernel.org;
> > ming.lei@redhat.com; hare@suse.de; linux-scsi@vger.kernel.org
> > Subject: Re: [PATCH v5 00/14] blk-mq: Reduce static requests memory
> > footprint for shared sbitmap
> >
> > On 05/10/2021 13:35, Jens Axboe wrote:
> > >> Baseline is 1b2d1439fc25 (block/for-next) Merge branch 'for-
> 5.16/io_uring'
> > >> into for-next
> > > Let's get this queued up for testing, thanks John.
> >
> > Cheers, appreciated
> >
> > @Kashyap, You mentioned that when testing you saw a performance
> > regression from v5.11 -> v5.12 - any idea on that yet? Can you
> > describe the scenario, like IO scheduler and how many disks and the
> > type? Does disabling host_tagset_enable restore performance?
>
> John - I am still working on this. System was not available due to some
> other
> debugging.

John -

I tested this patchset on 5.15-rc4 (master) -
https://github.com/torvalds/linux.git

#1 I noticed some performance regression @mq-deadline scheduler which is not
related to this series. I will bisect and get more detail about this issue
separately.
#2  w.r.t this patchset, I noticed one issue which is related to cpu usage
is high in certain case.

I have covered test on same setup using same h/w. I tested on Aero MegaRaid
Controller.

Test #1 : Total 24 SAS SSDs in JBOD mode.
(numactl -N 1 fio
24.fio --rw=randread --bs=4k --iodepth=256 --numjobs=1
--ioscheduler=none/mq-deadline)
No performance regression is noticed using this patchset. I can get 3.1 M
IOPs (max IOPs on this setup). I noticed some CPU hogging issue if iodepth
from application is high.

Cpu usage data from (top)
%Node1 :  6.4 us, 57.5 sy,  0.0 ni, 23.7 id,  0.0 wa,  0.0 hi, 12.4 si,  0.0
st

Perf top data -
     19.11%  [kernel]        [k] native_queued_spin_lock_slowpath
     4.72%  [megaraid_sas]  [k] complete_cmd_fusion
     3.70%  [megaraid_sas]  [k] megasas_build_and_issue_cmd_fusion
     2.76%  [megaraid_sas]  [k] megasas_build_ldio_fusion
     2.16%  [kernel]        [k] syscall_return_via_sysret
     2.16%  [kernel]        [k] entry_SYSCALL_64
     1.87%  [megaraid_sas]  [k] megasas_queue_command
     1.58%  [kernel]        [k] io_submit_one
     1.53%  [kernel]        [k] llist_add_batch
     1.51%  [kernel]        [k] blk_mq_find_and_get_req
     1.43%  [kernel]        [k] llist_reverse_order
     1.42%  [kernel]        [k] scsi_complete
     1.18%  [kernel]        [k] blk_mq_rq_ctx_init.isra.51
     1.17%  [kernel]        [k] _raw_spin_lock_irqsave
     1.15%  [kernel]        [k] blk_mq_get_driver_tag
     1.09%  [kernel]        [k] read_tsc
     0.97%  [kernel]        [k] native_irq_return_iret
     0.91%  [kernel]        [k] scsi_queue_rq
     0.89%  [kernel]        [k] blk_complete_reqs

Perf top data indicates lock contention in "blk_mq_find_and_get_req" call.

1.31%     1.31%  kworker/57:1H-k  [kernel.vmlinux]
     native_queued_spin_lock_slowpath
     ret_from_fork
     kthread
     worker_thread
     process_one_work
     blk_mq_timeout_work
     blk_mq_queue_tag_busy_iter
     bt_iter
     blk_mq_find_and_get_req
     _raw_spin_lock_irqsave
     native_queued_spin_lock_slowpath


Kernel v5.14 Data -

%Node1 :  8.4 us, 31.2 sy,  0.0 ni, 43.7 id,  0.0 wa,  0.0 hi, 16.8 si,  0.0
st
     4.46%  [kernel]       [k] complete_cmd_fusion
     3.69%  [kernel]       [k] megasas_build_and_issue_cmd_fusion
     2.97%  [kernel]       [k] blk_mq_find_and_get_req
     2.81%  [kernel]       [k] megasas_build_ldio_fusion
     2.62%  [kernel]       [k] syscall_return_via_sysret
     2.17%  [kernel]       [k] __entry_text_start
     2.01%  [kernel]       [k] io_submit_one
     1.87%  [kernel]       [k] scsi_queue_rq
     1.77%  [kernel]       [k] native_queued_spin_lock_slowpath
     1.76%  [kernel]       [k] scsi_complete
     1.66%  [kernel]       [k] llist_reverse_order
     1.63%  [kernel]       [k] _raw_spin_lock_irqsave
     1.61%  [kernel]       [k] llist_add_batch
     1.39%  [kernel]       [k] aio_complete_rw
     1.37%  [kernel]       [k] read_tsc
     1.07%  [kernel]       [k] blk_complete_reqs
     1.07%  [kernel]       [k] native_irq_return_iret
     1.04%  [kernel]       [k] __x86_indirect_thunk_rax
     1.03%  fio            [.] __fio_gettime
     1.00%  [kernel]       [k] flush_smp_call_function_queue


Test #2: Three VDs (each VD consist of 8 SAS SSDs).
(numactl -N 1 fio
3vd.fio --rw=randread --bs=4k --iodepth=32 --numjobs=8
--ioscheduler=none/mq-deadline)

There is a performance regression but it is not due to this patch set.
Kernel v5.11 gives 2.1M IOPs on mq-deadline but 5.15 (without this patchset)
gives 1.8M IOPs.
In this test I did not noticed CPU issue as mentioned in Test-1.

In general, I noticed host_busy is incorrect once I apply this patchset. It
should not be more than can_queue, but sysfs host_busy value is very high
when IOs are running. This issue is only after applying this patchset.

Is this patch set only change the behavior of <shared_host_tag> enabled
driver ? Will there be any impact on mpi3mr driver ? I can test that as
well.

Kashyap

>
> >
> >  From checking differences between those kernels, I don't see anything
> > directly relevant in sbitmap support or in the megaraid sas driver.
> >
> > Thanks,
> > John

Bart Van Assche Oct. 8, 2021, 3:11 a.m. UTC | #6

On 10/7/21 13:31, Kashyap Desai wrote:
> I tested this patchset on 5.15-rc4 (master) -
> https://github.com/torvalds/linux.git
> 
> #1 I noticed some performance regression @mq-deadline scheduler which is not
> related to this series. I will bisect and get more detail about this issue
> separately.

Please test this patch series on top of Jens' for-next branch 
(git://git.kernel.dk/linux-block). The mq-deadline performance on Jens' 
for-next branch should match that of kernel v5.13.

Thanks,

Bart.

John Garry Oct. 8, 2021, 8:07 a.m. UTC | #7

On 08/10/2021 04:11, Bart Van Assche wrote:

+

> On 10/7/21 13:31, Kashyap Desai wrote:
>> I tested this patchset on 5.15-rc4 (master) -
>> https://github.com/torvalds/linux.git
>>
>> #1 I noticed some performance regression @mq-deadline scheduler which 
>> is not
>> related to this series. I will bisect and get more detail about this 
>> issue
>> separately.
> 
> Please test this patch series on top of Jens' for-next branch 
> (git://git.kernel.dk/linux-block). The mq-deadline performance on Jens' 
> for-next branch should match that of kernel v5.13.

Kashyap did also mention earlier that he say the drop from v5.11 -> v5.12.

The only thing I saw possibly related there was Jan's work in 
https://lore.kernel.org/linux-block/20210111164717.21937-1-jack@suse.cz/
to fix a performance regression witnessed on megaraid sas for slower media.

But I quite doubtful that this is the issue.

Thanks,
John

John Garry Oct. 8, 2021, 10:17 a.m. UTC | #8

On 07/10/2021 21:31, Kashyap Desai wrote:
> Perf top data indicates lock contention in "blk_mq_find_and_get_req" call.
> 
> 1.31%     1.31%  kworker/57:1H-k  [kernel.vmlinux]
>       native_queued_spin_lock_slowpath
>       ret_from_fork
>       kthread
>       worker_thread
>       process_one_work
>       blk_mq_timeout_work
>       blk_mq_queue_tag_busy_iter
>       bt_iter
>       blk_mq_find_and_get_req
>       _raw_spin_lock_irqsave
>       native_queued_spin_lock_slowpath
> 
> 
> Kernel v5.14 Data -
> 
> %Node1 :  8.4 us, 31.2 sy,  0.0 ni, 43.7 id,  0.0 wa,  0.0 hi, 16.8 si,  0.0
> st
>       4.46%  [kernel]       [k] complete_cmd_fusion
>       3.69%  [kernel]       [k] megasas_build_and_issue_cmd_fusion
>       2.97%  [kernel]       [k] blk_mq_find_and_get_req
>       2.81%  [kernel]       [k] megasas_build_ldio_fusion
>       2.62%  [kernel]       [k] syscall_return_via_sysret
>       2.17%  [kernel]       [k] __entry_text_start
>       2.01%  [kernel]       [k] io_submit_one
>       1.87%  [kernel]       [k] scsi_queue_rq
>       1.77%  [kernel]       [k] native_queued_spin_lock_slowpath
>       1.76%  [kernel]       [k] scsi_complete
>       1.66%  [kernel]       [k] llist_reverse_order
>       1.63%  [kernel]       [k] _raw_spin_lock_irqsave
>       1.61%  [kernel]       [k] llist_add_batch
>       1.39%  [kernel]       [k] aio_complete_rw
>       1.37%  [kernel]       [k] read_tsc
>       1.07%  [kernel]       [k] blk_complete_reqs
>       1.07%  [kernel]       [k] native_irq_return_iret
>       1.04%  [kernel]       [k] __x86_indirect_thunk_rax
>       1.03%  fio            [.] __fio_gettime
>       1.00%  [kernel]       [k] flush_smp_call_function_queue
> 
> 
> Test #2: Three VDs (each VD consist of 8 SAS SSDs).
> (numactl -N 1 fio
> 3vd.fio --rw=randread --bs=4k --iodepth=32 --numjobs=8
> --ioscheduler=none/mq-deadline)
> 
> There is a performance regression but it is not due to this patch set.
> Kernel v5.11 gives 2.1M IOPs on mq-deadline but 5.15 (without this patchset)
> gives 1.8M IOPs.
> In this test I did not noticed CPU issue as mentioned in Test-1.
> 
> In general, I noticed host_busy is incorrect once I apply this patchset. It
> should not be more than can_queue, but sysfs host_busy value is very high
> when IOs are running. This issue is only after applying this patchset.
> 
> Is this patch set only change the behavior of <shared_host_tag> enabled
> driver ? Will there be any impact on mpi3mr driver ? I can test that as
> well.

I can see where the high value of host_busy is coming from in this 
series - we incorrectly re-iter the tags by #hw queues times in 
blk_mq_tagset_busy_iter() - d'oh.

Please try the below patch. I have looked at other places where we may 
have similar problems in looping the hw queue count for tagset->tags[], 
and they look ok. But I will double-check. I think that 
blk_mq_queue_tag_busy_iter() should be fine - Ming?

--->8----

 From e6ecaa6d624ebb903fa773ca2a2035300b4c55c5 Mon Sep 17 00:00:00 2001
From: John Garry <john.garry@huawei.com>
Date: Fri, 8 Oct 2021 10:55:11 +0100
Subject: [PATCH] blk-mq: Fix blk_mq_tagset_busy_iter() for shared tags

Since it is now possible for a tagset to share a single set of tags, the
iter function should not re-iter the tags for the count of hw queues in
that case. Rather it should just iter once.

Signed-off-by: John Garry <john.garry@huawei.com>

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 72a2724a4eee..ef888aab81b3 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -378,9 +378,15 @@ void blk_mq_all_tag_iter(struct blk_mq_tags *tags, 
busy_tag_iter_fn *fn,
  void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
  		busy_tag_iter_fn *fn, void *priv)
  {
+	int nr_hw_queues;
  	int i;

-	for (i = 0; i < tagset->nr_hw_queues; i++) {
+	if (blk_mq_is_shared_tags(tagset->flags))
+		nr_hw_queues = 1;
+	else
+		nr_hw_queues = tagset->nr_hw_queues;
+
+	for (i = 0; i < nr_hw_queues; i++) {
  		if (tagset->tags && tagset->tags[i])
  			__blk_mq_all_tag_iter(tagset->tags[i], fn, priv,
  					      BT_TAG_ITER_STARTED);

----8<----

Thanks,
john

[v5,00/14] blk-mq: Reduce static requests memory footprint for shared sbitmap

Message

Comments