mbox series

[v4,0/4] io_uring/rsrc: coalescing multi-hugepage registered buffers

Message ID 20240514075444.590910-1-cliang01.li@samsung.com (mailing list archive)
Headers show
Series io_uring/rsrc: coalescing multi-hugepage registered buffers | expand

Message

Chenliang Li May 14, 2024, 7:54 a.m. UTC
Registered buffers are stored and processed in the form of bvec array,
each bvec element typically points to a PAGE_SIZE page but can also work
with hugepages. Specifically, a buffer consisting of a hugepage is
coalesced to use only one hugepage bvec entry during registration.
This coalescing feature helps to save both the space and DMA-mapping time.

However, currently the coalescing feature doesn't work for multi-hugepage
buffers. For a buffer with several 2M hugepages, we still split it into
thousands of 4K page bvec entries while in fact, we can just use a
handful of hugepage bvecs.

This patch series enables coalescing registered buffers with more than
one hugepages. It optimizes the DMA-mapping time and saves memory for
these kind of buffers.

Testing:

The hugepage fixed buffer I/O can be tested using fio without
modification. The fio command used in the following test is given
in [1]. There's also a liburing testcase in [2]. Also, the system
should have enough hugepages available before testing.

Perf diff of 8M(4 * 2M hugepages) fio randread test:

Before          After           Symbol
.....................................................
4.68%				[k] __blk_rq_map_sg
3.31%				[k] dma_direct_map_sg
2.64%				[k] dma_pool_alloc
1.09%				[k] sg_next
                +0.49%		[k] dma_map_page_attrs

Perf diff of 8M fio randwrite test:

Before		After		Symbol
......................................................
2.82%				[k] __blk_rq_map_sg
2.05%				[k] dma_direct_map_sg
1.75%				[k] dma_pool_alloc
0.68%				[k] sg_next
		+0.08%		[k] dma_map_page_attrs

First three patches prepare for adding the multi-hugepage coalescing
into buffer registration, the 4th patch enables the feature. 

-----------------
Changes since v3:

- Delete unnecessary commit message
- Update test command and test results

v3 : https://lore.kernel.org/io-uring/20240514001614.566276-1-cliang01.li@samsung.com/T/#t

Changes since v2:

- Modify the loop iterator increment to make code cleaner
- Minor fix to the return procedure in coalesced buffer account
- Correct commit messages
- Add test cases in liburing

v2 : https://lore.kernel.org/io-uring/20240513020149.492727-1-cliang01.li@samsung.com/T/#t

Changes since v1:

- Split into 4 patches
- Fix code style issues
- Rearrange the change of code for cleaner look
- Add speciallized pinned page accounting procedure for coalesced
  buffers
- Reordered the newly add fields in imu struct for better compaction

v1 : https://lore.kernel.org/io-uring/20240506075303.25630-1-cliang01.li@samsung.com/T/#u

[1]
fio -iodepth=64 -rw=randread(-rw=randwrite) -direct=1 -ioengine=io_uring \
-bs=8M -numjobs=1 -group_reporting -mem=shmhuge -fixedbufs -hugepage-size=2M \
-filename=/dev/nvme0n1 -runtime=10s -name=test1

[2]
https://lore.kernel.org/io-uring/20240514051343.582556-1-cliang01.li@samsung.com/T/#u

Chenliang Li (4):
  io_uring/rsrc: add hugepage buffer coalesce helpers
  io_uring/rsrc: store folio shift and mask into imu
  io_uring/rsrc: add init and account functions for coalesced imus
  io_uring/rsrc: enable multi-hugepage buffer coalescing

 io_uring/rsrc.c | 217 +++++++++++++++++++++++++++++++++++++++---------
 io_uring/rsrc.h |  12 +++
 2 files changed, 191 insertions(+), 38 deletions(-)


base-commit: 59b28a6e37e650c0d601ed87875b6217140cda5d

Comments

Anuj gupta May 16, 2024, 2:01 p.m. UTC | #1
On Tue, May 14, 2024 at 1:25 PM Chenliang Li <cliang01.li@samsung.com> wrote:
>
> Registered buffers are stored and processed in the form of bvec array,
> each bvec element typically points to a PAGE_SIZE page but can also work
> with hugepages. Specifically, a buffer consisting of a hugepage is
> coalesced to use only one hugepage bvec entry during registration.
> This coalescing feature helps to save both the space and DMA-mapping time.
>
> However, currently the coalescing feature doesn't work for multi-hugepage
> buffers. For a buffer with several 2M hugepages, we still split it into
> thousands of 4K page bvec entries while in fact, we can just use a
> handful of hugepage bvecs.
>
> This patch series enables coalescing registered buffers with more than
> one hugepages. It optimizes the DMA-mapping time and saves memory for
> these kind of buffers.
>
> Testing:
>
> The hugepage fixed buffer I/O can be tested using fio without
> modification. The fio command used in the following test is given
> in [1]. There's also a liburing testcase in [2]. Also, the system
> should have enough hugepages available before testing.
>
> Perf diff of 8M(4 * 2M hugepages) fio randread test:
>
> Before          After           Symbol
> .....................................................
> 4.68%                           [k] __blk_rq_map_sg
> 3.31%                           [k] dma_direct_map_sg
> 2.64%                           [k] dma_pool_alloc
> 1.09%                           [k] sg_next
>                 +0.49%          [k] dma_map_page_attrs
>
> Perf diff of 8M fio randwrite test:
>
> Before          After           Symbol
> ......................................................
> 2.82%                           [k] __blk_rq_map_sg
> 2.05%                           [k] dma_direct_map_sg
> 1.75%                           [k] dma_pool_alloc
> 0.68%                           [k] sg_next
>                 +0.08%          [k] dma_map_page_attrs
>
> First three patches prepare for adding the multi-hugepage coalescing
> into buffer registration, the 4th patch enables the feature.
>
> -----------------
> Changes since v3:
>
> - Delete unnecessary commit message
> - Update test command and test results
>
> v3 : https://lore.kernel.org/io-uring/20240514001614.566276-1-cliang01.li@samsung.com/T/#t
>
> Changes since v2:
>
> - Modify the loop iterator increment to make code cleaner
> - Minor fix to the return procedure in coalesced buffer account
> - Correct commit messages
> - Add test cases in liburing
>
> v2 : https://lore.kernel.org/io-uring/20240513020149.492727-1-cliang01.li@samsung.com/T/#t
>
> Changes since v1:
>
> - Split into 4 patches
> - Fix code style issues
> - Rearrange the change of code for cleaner look
> - Add speciallized pinned page accounting procedure for coalesced
>   buffers
> - Reordered the newly add fields in imu struct for better compaction
>
> v1 : https://lore.kernel.org/io-uring/20240506075303.25630-1-cliang01.li@samsung.com/T/#u
>
> [1]
> fio -iodepth=64 -rw=randread(-rw=randwrite) -direct=1 -ioengine=io_uring \
> -bs=8M -numjobs=1 -group_reporting -mem=shmhuge -fixedbufs -hugepage-size=2M \
> -filename=/dev/nvme0n1 -runtime=10s -name=test1
>
> [2]
> https://lore.kernel.org/io-uring/20240514051343.582556-1-cliang01.li@samsung.com/T/#u
>
> Chenliang Li (4):
>   io_uring/rsrc: add hugepage buffer coalesce helpers
>   io_uring/rsrc: store folio shift and mask into imu
>   io_uring/rsrc: add init and account functions for coalesced imus
>   io_uring/rsrc: enable multi-hugepage buffer coalescing
>
>  io_uring/rsrc.c | 217 +++++++++++++++++++++++++++++++++++++++---------
>  io_uring/rsrc.h |  12 +++
>  2 files changed, 191 insertions(+), 38 deletions(-)
>
>
> base-commit: 59b28a6e37e650c0d601ed87875b6217140cda5d
> --
> 2.34.1
>
>

I tested this series by registering multi-hugepage buffers. The coalescing helps
saving dma-mapping time. This is the gain observed on my setup, while running
the fio workload shared here.

RandomRead:
Baseline        DeltaAbs        Symbol
.....................................................
3.89%            -3.62%            [k] blk_rq_map_sg
3.58%            -3.23%            [k] dma_direct_map_sg
2.25%            -2.23%            [k] sg_next

RandomWrite:
Baseline        DeltaAbs        Symbol
.....................................................
2.46%            -2.31%            [k] dma_direct_map_sg
2.06%            -2.05%            [k] sg_next
2.08%            -1.80%            [k] blk_rq_map_sg

The liburing test case shared works fine too on my setup.

Feel free to add:
Tested-by: Anuj Gupta <anuj20.g@samsung.com>
--
Anuj Gupta
Jens Axboe May 16, 2024, 2:58 p.m. UTC | #2
On 5/16/24 8:01 AM, Anuj gupta wrote:
> On Tue, May 14, 2024 at 1:25?PM Chenliang Li <cliang01.li@samsung.com> wrote:
>>
>> Registered buffers are stored and processed in the form of bvec array,
>> each bvec element typically points to a PAGE_SIZE page but can also work
>> with hugepages. Specifically, a buffer consisting of a hugepage is
>> coalesced to use only one hugepage bvec entry during registration.
>> This coalescing feature helps to save both the space and DMA-mapping time.
>>
>> However, currently the coalescing feature doesn't work for multi-hugepage
>> buffers. For a buffer with several 2M hugepages, we still split it into
>> thousands of 4K page bvec entries while in fact, we can just use a
>> handful of hugepage bvecs.
>>
>> This patch series enables coalescing registered buffers with more than
>> one hugepages. It optimizes the DMA-mapping time and saves memory for
>> these kind of buffers.
>>
>> Testing:
>>
>> The hugepage fixed buffer I/O can be tested using fio without
>> modification. The fio command used in the following test is given
>> in [1]. There's also a liburing testcase in [2]. Also, the system
>> should have enough hugepages available before testing.
>>
>> Perf diff of 8M(4 * 2M hugepages) fio randread test:
>>
>> Before          After           Symbol
>> .....................................................
>> 4.68%                           [k] __blk_rq_map_sg
>> 3.31%                           [k] dma_direct_map_sg
>> 2.64%                           [k] dma_pool_alloc
>> 1.09%                           [k] sg_next
>>                 +0.49%          [k] dma_map_page_attrs
>>
>> Perf diff of 8M fio randwrite test:
>>
>> Before          After           Symbol
>> ......................................................
>> 2.82%                           [k] __blk_rq_map_sg
>> 2.05%                           [k] dma_direct_map_sg
>> 1.75%                           [k] dma_pool_alloc
>> 0.68%                           [k] sg_next
>>                 +0.08%          [k] dma_map_page_attrs
>>
>> First three patches prepare for adding the multi-hugepage coalescing
>> into buffer registration, the 4th patch enables the feature.
>>
>> -----------------
>> Changes since v3:
>>
>> - Delete unnecessary commit message
>> - Update test command and test results
>>
>> v3 : https://lore.kernel.org/io-uring/20240514001614.566276-1-cliang01.li@samsung.com/T/#t
>>
>> Changes since v2:
>>
>> - Modify the loop iterator increment to make code cleaner
>> - Minor fix to the return procedure in coalesced buffer account
>> - Correct commit messages
>> - Add test cases in liburing
>>
>> v2 : https://lore.kernel.org/io-uring/20240513020149.492727-1-cliang01.li@samsung.com/T/#t
>>
>> Changes since v1:
>>
>> - Split into 4 patches
>> - Fix code style issues
>> - Rearrange the change of code for cleaner look
>> - Add speciallized pinned page accounting procedure for coalesced
>>   buffers
>> - Reordered the newly add fields in imu struct for better compaction
>>
>> v1 : https://lore.kernel.org/io-uring/20240506075303.25630-1-cliang01.li@samsung.com/T/#u
>>
>> [1]
>> fio -iodepth=64 -rw=randread(-rw=randwrite) -direct=1 -ioengine=io_uring \
>> -bs=8M -numjobs=1 -group_reporting -mem=shmhuge -fixedbufs -hugepage-size=2M \
>> -filename=/dev/nvme0n1 -runtime=10s -name=test1
>>
>> [2]
>> https://lore.kernel.org/io-uring/20240514051343.582556-1-cliang01.li@samsung.com/T/#u
>>
>> Chenliang Li (4):
>>   io_uring/rsrc: add hugepage buffer coalesce helpers
>>   io_uring/rsrc: store folio shift and mask into imu
>>   io_uring/rsrc: add init and account functions for coalesced imus
>>   io_uring/rsrc: enable multi-hugepage buffer coalescing
>>
>>  io_uring/rsrc.c | 217 +++++++++++++++++++++++++++++++++++++++---------
>>  io_uring/rsrc.h |  12 +++
>>  2 files changed, 191 insertions(+), 38 deletions(-)
>>
>>
>> base-commit: 59b28a6e37e650c0d601ed87875b6217140cda5d
>> --
>> 2.34.1
>>
>>
> 
> I tested this series by registering multi-hugepage buffers. The coalescing helps
> saving dma-mapping time. This is the gain observed on my setup, while running
> the fio workload shared here.
> 
> RandomRead:
> Baseline        DeltaAbs        Symbol
> .....................................................
> 3.89%            -3.62%            [k] blk_rq_map_sg
> 3.58%            -3.23%            [k] dma_direct_map_sg
> 2.25%            -2.23%            [k] sg_next
> 
> RandomWrite:
> Baseline        DeltaAbs        Symbol
> .....................................................
> 2.46%            -2.31%            [k] dma_direct_map_sg
> 2.06%            -2.05%            [k] sg_next
> 2.08%            -1.80%            [k] blk_rq_map_sg
> 
> The liburing test case shared works fine too on my setup.
> 
> Feel free to add:
> Tested-by: Anuj Gupta <anuj20.g@samsung.com>

It's even more dramatic here, excerpt from profiles:

    32.16%    -25.46%  [kernel.kallsyms]  [k] bio_split_rw
     8.92%     -8.38%  [kernel.kallsyms]  [k] iov_iter_is_aligned
     6.85%     -4.31%  [nvme]             [k] nvme_prep_rq.part.0
    14.71%             [kernel.kallsyms]  [k] __blk_rq_map_sg
     9.49%             [kernel.kallsyms]  [k] dma_direct_map_sg
     8.50%             [kernel.kallsyms]  [k] sg_next

some of it just shifted, but definitely a huge win. This is just using
a single drive, doing about 7GB/sec.

The change looks pretty reasonable to me. I'd love for the test cases to
try and hit corner cases, as it's really more of a functionality test
right now. We should include things like one-off huge pages, ensure we
don't coalesce where we should not, etc.

This is obviously too late for the 6.10 merge window, so there's plenty
of time to get this 100% sorted before the next kernel release.
Chenliang Li May 30, 2024, 5:10 a.m. UTC | #3
On Thu, 16 May 2024 08:58:03 -0600, Jens Axboe wrote:
> The change looks pretty reasonable to me. I'd love for the test cases to
> try and hit corner cases, as it's really more of a functionality test
> right now. We should include things like one-off huge pages, ensure we
> don't coalesce where we should not, etc.

Hi Jens, the testcases are updated here:
https://lore.kernel.org/io-uring/20240530031548.1401768-1-cliang01.li@samsung.com/T/#u
Add several corner cases this time, works fine. Please take a look.
Anuj gupta June 4, 2024, 1:33 p.m. UTC | #4
On Thu, May 30, 2024 at 10:41 AM Chenliang Li <cliang01.li@samsung.com> wrote:
>
> On Thu, 16 May 2024 08:58:03 -0600, Jens Axboe wrote:
> > The change looks pretty reasonable to me. I'd love for the test cases to
> > try and hit corner cases, as it's really more of a functionality test
> > right now. We should include things like one-off huge pages, ensure we
> > don't coalesce where we should not, etc.
>
> Hi Jens, the testcases are updated here:
> https://lore.kernel.org/io-uring/20240530031548.1401768-1-cliang01.li@samsung.com/T/#u
> Add several corner cases this time, works fine. Please take a look.

The additional test cases shared here [1], works fine too on my setup.
Tested-by: Anuj Gupta <anuj20.g@samsung.com>

[1] https://lore.kernel.org/io-uring/20240531052023.1446914-1-cliang01.li@samsung.com/

Thanks,
Anuj Gupta
Chenliang Li June 13, 2024, 2:49 a.m. UTC | #5
On Thu, 30 May 2024 13:10:44 +0800, Chenliang Li wrote:
> On Thu, 16 May 2024 08:58:03 -0600, Jens Axboe wrote:
>> The change looks pretty reasonable to me. I'd love for the test cases to
>> try and hit corner cases, as it's really more of a functionality test
>> right now. We should include things like one-off huge pages, ensure we
>> don't coalesce where we should not, etc.
>
> Hi Jens, the testcases are updated here:
> https://lore.kernel.org/io-uring/20240530031548.1401768-1-cliang01.li@samsung.com/T/#u
> Add several corner cases this time, works fine. Please take a look.

Hi, a gentle ping here.
The latest liburing testcase: https://lore.kernel.org/io-uring/20240531052023.1446914-1-cliang01.li@samsung.com/
Jens Axboe June 16, 2024, 2:54 a.m. UTC | #6
On 6/12/24 8:49 PM, Chenliang Li wrote:
> On Thu, 30 May 2024 13:10:44 +0800, Chenliang Li wrote:
>> On Thu, 16 May 2024 08:58:03 -0600, Jens Axboe wrote:
>>> The change looks pretty reasonable to me. I'd love for the test cases to
>>> try and hit corner cases, as it's really more of a functionality test
>>> right now. We should include things like one-off huge pages, ensure we
>>> don't coalesce where we should not, etc.
>>
>> Hi Jens, the testcases are updated here:
>> https://lore.kernel.org/io-uring/20240530031548.1401768-1-cliang01.li@samsung.com/T/#u
>> Add several corner cases this time, works fine. Please take a look.
> 
> Hi, a gentle ping here.
> The latest liburing testcase: https://lore.kernel.org/io-uring/20240531052023.1446914-1-cliang01.li@samsung.com/

I'll take a look on Monday, thanks.