[0/2] dmapool performance enhancements

Message ID	20220428202714.17630-1-kbusch@kernel.org (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: kbusch@kernel.org To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: willy@infradead.org, kernel-team@fb.com, Keith Busch <kbusch@kernel.org> Subject: [PATCH 0/2] dmapool performance enhancements Date: Thu, 28 Apr 2022 14:27:12 -0600 Message-Id: <20220428202714.17630-1-kbusch@kernel.org> Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	dmapool performance enhancements \| expand [0/2] dmapool performance enhancements [1/2] mm/dmapool: replace linked list with xarray [2/2] mm/dmapool: link blocks across pages

Message ID

20220428202714.17630-1-kbusch@kernel.org (mailing list archive)

Headers

From: kbusch@kernel.org
To: linux-kernel@vger.kernel.org,
	linux-mm@kvack.org
Cc: willy@infradead.org,
	kernel-team@fb.com,
	Keith Busch <kbusch@kernel.org>
Subject: [PATCH 0/2] dmapool performance enhancements
Date: Thu, 28 Apr 2022 14:27:12 -0600
Message-Id: <20220428202714.17630-1-kbusch@kernel.org>
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

dmapool performance enhancements | expand

Message

Keith Busch April 28, 2022, 8:27 p.m. UTC

From: Keith Busch <kbusch@kernel.org>

Allocating and freeing blocks from the dmapool iterates a list of all
allocated pages. We can save time removing the per-alloc/free list
traversal for a constant time lookup, so this series does that.

Compared to current kernel, perf record from running io_uring benchmarks
on nvme reports dma_pool_alloc() cost reduction cut in half from 0.81% to
0.41%.

Keith Busch (2):
  mm/dmapool: replace linked list with xarray
  mm/dmapool: link blocks across pages

 mm/dmapool.c | 107 +++++++++++++++++++++++++++------------------------
 1 file changed, 56 insertions(+), 51 deletions(-)

Comments

Tony Battersby May 27, 2022, 7:35 p.m. UTC | #1

I posted a similar patch series back in 2018:

https://lore.kernel.org/linux-mm/73ec1f52-d758-05df-fb6a-41d269e910d0@cybernetics.com/
https://lore.kernel.org/linux-mm/15ff502d-d840-1003-6c45-bc17f0d81262@cybernetics.com/
https://lore.kernel.org/linux-mm/1288e597-a67a-25b3-b7c6-db883ca67a25@cybernetics.com/


I initially used a red-black tree keyed by the DMA address, but then for
v2 of the patchset I put the dma pool info directly into struct page and
used virt_to_page() to get at it.  But it turned out that was a bad idea
because not all architectures have struct page backing
dma_alloc_coherent():

https://lore.kernel.org/linux-kernel/20181206013054.GI6707@atomide.com/

I intended to go back and resubmit the red-black tree version, but I was
too busy at the time and forgot about it.  A few days ago I finally
decided to update the patches and submit them upstream.  I found your
recent dmapool xarray patches by searching the mailing list archive to
see if anyone else was working on something similar.

Using the following as a benchmark:

modprobe mpt3sas
drivers/scsi/mpt3sas/mpt3sas_base.c
_base_allocate_chain_dma_pool
loop dma_pool_alloc(ioc->chain_dma_pool)

rmmod mpt3sas
drivers/scsi/mpt3sas/mpt3sas_base.c
_base_release_memory_pools()
loop dma_pool_free(ioc->chain_dma_pool)

Here are the benchmark results showing the speedup from the patchsets:

        modprobe  rmmod
orig          1x     1x
xarray      5.2x   186x
rbtree      9.3x   269x

It looks like my red-black tree version is faster than the v1 of the
xarray patch on this benchmark at least, although the mpt3sas usage of
dmapool is hardly typical.  I will try to get some testing done on my
patchset and post it next week.

Tony Battersby
Cybernetics

Keith Busch May 27, 2022, 9:01 p.m. UTC | #2

On Fri, May 27, 2022 at 03:35:47PM -0400, Tony Battersby wrote:
> I posted a similar patch series back in 2018:
> 
> https://lore.kernel.org/linux-mm/73ec1f52-d758-05df-fb6a-41d269e910d0@cybernetics.com/
> https://lore.kernel.org/linux-mm/15ff502d-d840-1003-6c45-bc17f0d81262@cybernetics.com/
> https://lore.kernel.org/linux-mm/1288e597-a67a-25b3-b7c6-db883ca67a25@cybernetics.com/
> 
> 
> I initially used a red-black tree keyed by the DMA address, but then for
> v2 of the patchset I put the dma pool info directly into struct page and
> used virt_to_page() to get at it.  But it turned out that was a bad idea
> because not all architectures have struct page backing
> dma_alloc_coherent():
> 
> https://lore.kernel.org/linux-kernel/20181206013054.GI6707@atomide.com/
> 
> I intended to go back and resubmit the red-black tree version, but I was
> too busy at the time and forgot about it.  A few days ago I finally
> decided to update the patches and submit them upstream.  I found your
> recent dmapool xarray patches by searching the mailing list archive to
> see if anyone else was working on something similar.
> 
> Using the following as a benchmark:
> 
> modprobe mpt3sas
> drivers/scsi/mpt3sas/mpt3sas_base.c
> _base_allocate_chain_dma_pool
> loop dma_pool_alloc(ioc->chain_dma_pool)
> 
> rmmod mpt3sas
> drivers/scsi/mpt3sas/mpt3sas_base.c
> _base_release_memory_pools()
> loop dma_pool_free(ioc->chain_dma_pool)
> 
> Here are the benchmark results showing the speedup from the patchsets:
> 
>         modprobe  rmmod
> orig          1x     1x
> xarray      5.2x   186x
> rbtree      9.3x   269x
> 
> It looks like my red-black tree version is faster than the v1 of the
> xarray patch on this benchmark at least, although the mpt3sas usage of
> dmapool is hardly typical.  I will try to get some testing done on my
> patchset and post it next week.

Thanks for the info.

Just comparing with xarray, I actually found that the list was still faster
until you get >100 pages in the pool, after which xarray becomes the clear
winner.

But it turns out I don't often see that many pages allocated for a lot of real
use cases, so I'm trying to take this in a different direction by replacing the
lookup structures with an intrusive stack. That is safe to do since pages are
never freed for the lifetime of the pool, and it's by far faster than anything
else. The downside is that I'd need to increase the size of the smallest
allowable pool block, but I think that's okay.

Anyway I was planning to post this new idea soon. My reasons for wanting a
faster dma pool are still in the works, though, so I'm just trying to sort out
those patches before returning to this one.