Message ID | 1633429419-228500-1-git-send-email-john.garry@huawei.com (mailing list archive) |
---|---|
Headers | show |
Series | blk-mq: Reduce static requests memory footprint for shared sbitmap | expand |
On 10/5/21 4:23 AM, John Garry wrote: > Currently a full set of static requests are allocated per hw queue per > tagset when shared sbitmap is used. > > However, only tagset->queue_depth number of requests may be active at > any given time. As such, only tagset->queue_depth number of static > requests are required. > > The same goes for using an IO scheduler, which allocates a full set of > static requests per hw queue per request queue. > > This series changes shared sbitmap support by using a shared tags per > tagset and request queue. Ming suggested something along those lines in > v1 review. In using a shared tags, the static rqs also become shared, > reducing the number of sets of static rqs, reducing memory usage. > > Patch "blk-mq: Use shared tags for shared sbitmap support" is a bit big, > and could potentially be broken down. But then maintaining ability to > bisect becomes harder and each sub-patch would get more convoluted. > > For megaraid sas driver on my 128-CPU arm64 system with 1x SATA disk, we > save approx. 300MB(!) [370MB -> 60MB] > > Baseline is 1b2d1439fc25 (block/for-next) Merge branch 'for-5.16/io_uring' > into for-next Let's get this queued up for testing, thanks John.
On 05/10/2021 13:35, Jens Axboe wrote: >> Baseline is 1b2d1439fc25 (block/for-next) Merge branch 'for-5.16/io_uring' >> into for-next > Let's get this queued up for testing, thanks John. Cheers, appreciated @Kashyap, You mentioned that when testing you saw a performance regression from v5.11 -> v5.12 - any idea on that yet? Can you describe the scenario, like IO scheduler and how many disks and the type? Does disabling host_tagset_enable restore performance? From checking differences between those kernels, I don't see anything directly relevant in sbitmap support or in the megaraid sas driver. Thanks, John
> -----Original Message----- > From: John Garry [mailto:john.garry@huawei.com] > Sent: Tuesday, October 5, 2021 7:05 PM > To: Jens Axboe <axboe@kernel.dk>; kashyap.desai@broadcom.com > Cc: linux-block@vger.kernel.org; linux-kernel@vger.kernel.org; > ming.lei@redhat.com; hare@suse.de; linux-scsi@vger.kernel.org > Subject: Re: [PATCH v5 00/14] blk-mq: Reduce static requests memory > footprint for shared sbitmap > > On 05/10/2021 13:35, Jens Axboe wrote: > >> Baseline is 1b2d1439fc25 (block/for-next) Merge branch > >> 'for-5.16/io_uring' > >> into for-next > > Let's get this queued up for testing, thanks John. > > Cheers, appreciated > > @Kashyap, You mentioned that when testing you saw a performance > regression from v5.11 -> v5.12 - any idea on that yet? Can you describe > the > scenario, like IO scheduler and how many disks and the type? Does > disabling > host_tagset_enable restore performance? John - I am still working on this. System was not available due to some other debugging. > > From checking differences between those kernels, I don't see anything > directly relevant in sbitmap support or in the megaraid sas driver. > > Thanks, > John
On 10/5/21 7:34 AM, John Garry wrote: > On 05/10/2021 13:35, Jens Axboe wrote: >>> Baseline is 1b2d1439fc25 (block/for-next) Merge branch 'for-5.16/io_uring' >>> into for-next >> Let's get this queued up for testing, thanks John. > > Cheers, appreciated > > @Kashyap, You mentioned that when testing you saw a performance > regression from v5.11 -> v5.12 - any idea on that yet? Can you describe > the scenario, like IO scheduler and how many disks and the type? Does > disabling host_tagset_enable restore performance? FWIW, I ran my usual peak testing on this and didn't observe any regressions. Caveat being that a) no scheduler is involved, and b) no shared tags. But at least that fast case/path is fine.
> > -----Original Message----- > > From: John Garry [mailto:john.garry@huawei.com] > > Sent: Tuesday, October 5, 2021 7:05 PM > > To: Jens Axboe <axboe@kernel.dk>; kashyap.desai@broadcom.com > > Cc: linux-block@vger.kernel.org; linux-kernel@vger.kernel.org; > > ming.lei@redhat.com; hare@suse.de; linux-scsi@vger.kernel.org > > Subject: Re: [PATCH v5 00/14] blk-mq: Reduce static requests memory > > footprint for shared sbitmap > > > > On 05/10/2021 13:35, Jens Axboe wrote: > > >> Baseline is 1b2d1439fc25 (block/for-next) Merge branch 'for- > 5.16/io_uring' > > >> into for-next > > > Let's get this queued up for testing, thanks John. > > > > Cheers, appreciated > > > > @Kashyap, You mentioned that when testing you saw a performance > > regression from v5.11 -> v5.12 - any idea on that yet? Can you > > describe the scenario, like IO scheduler and how many disks and the > > type? Does disabling host_tagset_enable restore performance? > > John - I am still working on this. System was not available due to some > other > debugging. John - I tested this patchset on 5.15-rc4 (master) - https://github.com/torvalds/linux.git #1 I noticed some performance regression @mq-deadline scheduler which is not related to this series. I will bisect and get more detail about this issue separately. #2 w.r.t this patchset, I noticed one issue which is related to cpu usage is high in certain case. I have covered test on same setup using same h/w. I tested on Aero MegaRaid Controller. Test #1 : Total 24 SAS SSDs in JBOD mode. (numactl -N 1 fio 24.fio --rw=randread --bs=4k --iodepth=256 --numjobs=1 --ioscheduler=none/mq-deadline) No performance regression is noticed using this patchset. I can get 3.1 M IOPs (max IOPs on this setup). I noticed some CPU hogging issue if iodepth from application is high. Cpu usage data from (top) %Node1 : 6.4 us, 57.5 sy, 0.0 ni, 23.7 id, 0.0 wa, 0.0 hi, 12.4 si, 0.0 st Perf top data - 19.11% [kernel] [k] native_queued_spin_lock_slowpath 4.72% [megaraid_sas] [k] complete_cmd_fusion 3.70% [megaraid_sas] [k] megasas_build_and_issue_cmd_fusion 2.76% [megaraid_sas] [k] megasas_build_ldio_fusion 2.16% [kernel] [k] syscall_return_via_sysret 2.16% [kernel] [k] entry_SYSCALL_64 1.87% [megaraid_sas] [k] megasas_queue_command 1.58% [kernel] [k] io_submit_one 1.53% [kernel] [k] llist_add_batch 1.51% [kernel] [k] blk_mq_find_and_get_req 1.43% [kernel] [k] llist_reverse_order 1.42% [kernel] [k] scsi_complete 1.18% [kernel] [k] blk_mq_rq_ctx_init.isra.51 1.17% [kernel] [k] _raw_spin_lock_irqsave 1.15% [kernel] [k] blk_mq_get_driver_tag 1.09% [kernel] [k] read_tsc 0.97% [kernel] [k] native_irq_return_iret 0.91% [kernel] [k] scsi_queue_rq 0.89% [kernel] [k] blk_complete_reqs Perf top data indicates lock contention in "blk_mq_find_and_get_req" call. 1.31% 1.31% kworker/57:1H-k [kernel.vmlinux] native_queued_spin_lock_slowpath ret_from_fork kthread worker_thread process_one_work blk_mq_timeout_work blk_mq_queue_tag_busy_iter bt_iter blk_mq_find_and_get_req _raw_spin_lock_irqsave native_queued_spin_lock_slowpath Kernel v5.14 Data - %Node1 : 8.4 us, 31.2 sy, 0.0 ni, 43.7 id, 0.0 wa, 0.0 hi, 16.8 si, 0.0 st 4.46% [kernel] [k] complete_cmd_fusion 3.69% [kernel] [k] megasas_build_and_issue_cmd_fusion 2.97% [kernel] [k] blk_mq_find_and_get_req 2.81% [kernel] [k] megasas_build_ldio_fusion 2.62% [kernel] [k] syscall_return_via_sysret 2.17% [kernel] [k] __entry_text_start 2.01% [kernel] [k] io_submit_one 1.87% [kernel] [k] scsi_queue_rq 1.77% [kernel] [k] native_queued_spin_lock_slowpath 1.76% [kernel] [k] scsi_complete 1.66% [kernel] [k] llist_reverse_order 1.63% [kernel] [k] _raw_spin_lock_irqsave 1.61% [kernel] [k] llist_add_batch 1.39% [kernel] [k] aio_complete_rw 1.37% [kernel] [k] read_tsc 1.07% [kernel] [k] blk_complete_reqs 1.07% [kernel] [k] native_irq_return_iret 1.04% [kernel] [k] __x86_indirect_thunk_rax 1.03% fio [.] __fio_gettime 1.00% [kernel] [k] flush_smp_call_function_queue Test #2: Three VDs (each VD consist of 8 SAS SSDs). (numactl -N 1 fio 3vd.fio --rw=randread --bs=4k --iodepth=32 --numjobs=8 --ioscheduler=none/mq-deadline) There is a performance regression but it is not due to this patch set. Kernel v5.11 gives 2.1M IOPs on mq-deadline but 5.15 (without this patchset) gives 1.8M IOPs. In this test I did not noticed CPU issue as mentioned in Test-1. In general, I noticed host_busy is incorrect once I apply this patchset. It should not be more than can_queue, but sysfs host_busy value is very high when IOs are running. This issue is only after applying this patchset. Is this patch set only change the behavior of <shared_host_tag> enabled driver ? Will there be any impact on mpi3mr driver ? I can test that as well. Kashyap > > > > > From checking differences between those kernels, I don't see anything > > directly relevant in sbitmap support or in the megaraid sas driver. > > > > Thanks, > > John
On 10/7/21 13:31, Kashyap Desai wrote: > I tested this patchset on 5.15-rc4 (master) - > https://github.com/torvalds/linux.git > > #1 I noticed some performance regression @mq-deadline scheduler which is not > related to this series. I will bisect and get more detail about this issue > separately. Please test this patch series on top of Jens' for-next branch (git://git.kernel.dk/linux-block). The mq-deadline performance on Jens' for-next branch should match that of kernel v5.13. Thanks, Bart.
On 08/10/2021 04:11, Bart Van Assche wrote: + > On 10/7/21 13:31, Kashyap Desai wrote: >> I tested this patchset on 5.15-rc4 (master) - >> https://github.com/torvalds/linux.git >> >> #1 I noticed some performance regression @mq-deadline scheduler which >> is not >> related to this series. I will bisect and get more detail about this >> issue >> separately. > > Please test this patch series on top of Jens' for-next branch > (git://git.kernel.dk/linux-block). The mq-deadline performance on Jens' > for-next branch should match that of kernel v5.13. Kashyap did also mention earlier that he say the drop from v5.11 -> v5.12. The only thing I saw possibly related there was Jan's work in https://lore.kernel.org/linux-block/20210111164717.21937-1-jack@suse.cz/ to fix a performance regression witnessed on megaraid sas for slower media. But I quite doubtful that this is the issue. Thanks, John
On 07/10/2021 21:31, Kashyap Desai wrote: > Perf top data indicates lock contention in "blk_mq_find_and_get_req" call. > > 1.31% 1.31% kworker/57:1H-k [kernel.vmlinux] > native_queued_spin_lock_slowpath > ret_from_fork > kthread > worker_thread > process_one_work > blk_mq_timeout_work > blk_mq_queue_tag_busy_iter > bt_iter > blk_mq_find_and_get_req > _raw_spin_lock_irqsave > native_queued_spin_lock_slowpath > > > Kernel v5.14 Data - > > %Node1 : 8.4 us, 31.2 sy, 0.0 ni, 43.7 id, 0.0 wa, 0.0 hi, 16.8 si, 0.0 > st > 4.46% [kernel] [k] complete_cmd_fusion > 3.69% [kernel] [k] megasas_build_and_issue_cmd_fusion > 2.97% [kernel] [k] blk_mq_find_and_get_req > 2.81% [kernel] [k] megasas_build_ldio_fusion > 2.62% [kernel] [k] syscall_return_via_sysret > 2.17% [kernel] [k] __entry_text_start > 2.01% [kernel] [k] io_submit_one > 1.87% [kernel] [k] scsi_queue_rq > 1.77% [kernel] [k] native_queued_spin_lock_slowpath > 1.76% [kernel] [k] scsi_complete > 1.66% [kernel] [k] llist_reverse_order > 1.63% [kernel] [k] _raw_spin_lock_irqsave > 1.61% [kernel] [k] llist_add_batch > 1.39% [kernel] [k] aio_complete_rw > 1.37% [kernel] [k] read_tsc > 1.07% [kernel] [k] blk_complete_reqs > 1.07% [kernel] [k] native_irq_return_iret > 1.04% [kernel] [k] __x86_indirect_thunk_rax > 1.03% fio [.] __fio_gettime > 1.00% [kernel] [k] flush_smp_call_function_queue > > > Test #2: Three VDs (each VD consist of 8 SAS SSDs). > (numactl -N 1 fio > 3vd.fio --rw=randread --bs=4k --iodepth=32 --numjobs=8 > --ioscheduler=none/mq-deadline) > > There is a performance regression but it is not due to this patch set. > Kernel v5.11 gives 2.1M IOPs on mq-deadline but 5.15 (without this patchset) > gives 1.8M IOPs. > In this test I did not noticed CPU issue as mentioned in Test-1. > > In general, I noticed host_busy is incorrect once I apply this patchset. It > should not be more than can_queue, but sysfs host_busy value is very high > when IOs are running. This issue is only after applying this patchset. > > Is this patch set only change the behavior of <shared_host_tag> enabled > driver ? Will there be any impact on mpi3mr driver ? I can test that as > well. I can see where the high value of host_busy is coming from in this series - we incorrectly re-iter the tags by #hw queues times in blk_mq_tagset_busy_iter() - d'oh. Please try the below patch. I have looked at other places where we may have similar problems in looping the hw queue count for tagset->tags[], and they look ok. But I will double-check. I think that blk_mq_queue_tag_busy_iter() should be fine - Ming? --->8---- From e6ecaa6d624ebb903fa773ca2a2035300b4c55c5 Mon Sep 17 00:00:00 2001 From: John Garry <john.garry@huawei.com> Date: Fri, 8 Oct 2021 10:55:11 +0100 Subject: [PATCH] blk-mq: Fix blk_mq_tagset_busy_iter() for shared tags Since it is now possible for a tagset to share a single set of tags, the iter function should not re-iter the tags for the count of hw queues in that case. Rather it should just iter once. Signed-off-by: John Garry <john.garry@huawei.com> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index 72a2724a4eee..ef888aab81b3 100644 --- a/block/blk-mq-tag.c +++ b/block/blk-mq-tag.c @@ -378,9 +378,15 @@ void blk_mq_all_tag_iter(struct blk_mq_tags *tags, busy_tag_iter_fn *fn, void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset, busy_tag_iter_fn *fn, void *priv) { + int nr_hw_queues; int i; - for (i = 0; i < tagset->nr_hw_queues; i++) { + if (blk_mq_is_shared_tags(tagset->flags)) + nr_hw_queues = 1; + else + nr_hw_queues = tagset->nr_hw_queues; + + for (i = 0; i < nr_hw_queues; i++) { if (tagset->tags && tagset->tags[i]) __blk_mq_all_tag_iter(tagset->tags[i], fn, priv, BT_TAG_ITER_STARTED); ----8<---- Thanks, john