Message ID | 20220428202714.17630-1-kbusch@kernel.org (mailing list archive) |
---|---|
Headers | show |
Series | dmapool performance enhancements | expand |
I posted a similar patch series back in 2018: https://lore.kernel.org/linux-mm/73ec1f52-d758-05df-fb6a-41d269e910d0@cybernetics.com/ https://lore.kernel.org/linux-mm/15ff502d-d840-1003-6c45-bc17f0d81262@cybernetics.com/ https://lore.kernel.org/linux-mm/1288e597-a67a-25b3-b7c6-db883ca67a25@cybernetics.com/ I initially used a red-black tree keyed by the DMA address, but then for v2 of the patchset I put the dma pool info directly into struct page and used virt_to_page() to get at it. But it turned out that was a bad idea because not all architectures have struct page backing dma_alloc_coherent(): https://lore.kernel.org/linux-kernel/20181206013054.GI6707@atomide.com/ I intended to go back and resubmit the red-black tree version, but I was too busy at the time and forgot about it. A few days ago I finally decided to update the patches and submit them upstream. I found your recent dmapool xarray patches by searching the mailing list archive to see if anyone else was working on something similar. Using the following as a benchmark: modprobe mpt3sas drivers/scsi/mpt3sas/mpt3sas_base.c _base_allocate_chain_dma_pool loop dma_pool_alloc(ioc->chain_dma_pool) rmmod mpt3sas drivers/scsi/mpt3sas/mpt3sas_base.c _base_release_memory_pools() loop dma_pool_free(ioc->chain_dma_pool) Here are the benchmark results showing the speedup from the patchsets: modprobe rmmod orig 1x 1x xarray 5.2x 186x rbtree 9.3x 269x It looks like my red-black tree version is faster than the v1 of the xarray patch on this benchmark at least, although the mpt3sas usage of dmapool is hardly typical. I will try to get some testing done on my patchset and post it next week. Tony Battersby Cybernetics
On Fri, May 27, 2022 at 03:35:47PM -0400, Tony Battersby wrote: > I posted a similar patch series back in 2018: > > https://lore.kernel.org/linux-mm/73ec1f52-d758-05df-fb6a-41d269e910d0@cybernetics.com/ > https://lore.kernel.org/linux-mm/15ff502d-d840-1003-6c45-bc17f0d81262@cybernetics.com/ > https://lore.kernel.org/linux-mm/1288e597-a67a-25b3-b7c6-db883ca67a25@cybernetics.com/ > > > I initially used a red-black tree keyed by the DMA address, but then for > v2 of the patchset I put the dma pool info directly into struct page and > used virt_to_page() to get at it. But it turned out that was a bad idea > because not all architectures have struct page backing > dma_alloc_coherent(): > > https://lore.kernel.org/linux-kernel/20181206013054.GI6707@atomide.com/ > > I intended to go back and resubmit the red-black tree version, but I was > too busy at the time and forgot about it. A few days ago I finally > decided to update the patches and submit them upstream. I found your > recent dmapool xarray patches by searching the mailing list archive to > see if anyone else was working on something similar. > > Using the following as a benchmark: > > modprobe mpt3sas > drivers/scsi/mpt3sas/mpt3sas_base.c > _base_allocate_chain_dma_pool > loop dma_pool_alloc(ioc->chain_dma_pool) > > rmmod mpt3sas > drivers/scsi/mpt3sas/mpt3sas_base.c > _base_release_memory_pools() > loop dma_pool_free(ioc->chain_dma_pool) > > Here are the benchmark results showing the speedup from the patchsets: > > modprobe rmmod > orig 1x 1x > xarray 5.2x 186x > rbtree 9.3x 269x > > It looks like my red-black tree version is faster than the v1 of the > xarray patch on this benchmark at least, although the mpt3sas usage of > dmapool is hardly typical. I will try to get some testing done on my > patchset and post it next week. Thanks for the info. Just comparing with xarray, I actually found that the list was still faster until you get >100 pages in the pool, after which xarray becomes the clear winner. But it turns out I don't often see that many pages allocated for a lot of real use cases, so I'm trying to take this in a different direction by replacing the lookup structures with an intrusive stack. That is safe to do since pages are never freed for the lifetime of the pool, and it's by far faster than anything else. The downside is that I'd need to increase the size of the smallest allowable pool block, but I think that's okay. Anyway I was planning to post this new idea soon. My reasons for wanting a faster dma pool are still in the works, though, so I'm just trying to sort out those patches before returning to this one.
From: Keith Busch <kbusch@kernel.org> Allocating and freeing blocks from the dmapool iterates a list of all allocated pages. We can save time removing the per-alloc/free list traversal for a constant time lookup, so this series does that. Compared to current kernel, perf record from running io_uring benchmarks on nvme reports dma_pool_alloc() cost reduction cut in half from 0.81% to 0.41%. Keith Busch (2): mm/dmapool: replace linked list with xarray mm/dmapool: link blocks across pages mm/dmapool.c | 107 +++++++++++++++++++++++++++------------------------ 1 file changed, 56 insertions(+), 51 deletions(-)