Message ID | 20221130220803.3657490-1-willy@infradead.org (mailing list archive) |
---|---|
Headers | show |
Series | Split page pools from struct page | expand |
On 30/11/2022 23.07, Matthew Wilcox (Oracle) wrote: > The MM subsystem is trying to reduce struct page to a single pointer. > The first step towards that is splitting struct page by its individual > users, as has already been done with folio and slab. This attempt chooses > 'netmem' as a name, but I am not even slightly committed to that name, > and will happily use another. I've not been able to come-up with a better name, so I'm okay with 'netmem'. Others are of-cause free to bikesheet this ;-) > There are some relatively significant reductions in kernel text > size from these changes. I'm not qualified to judge how they > might affect performance, but every call to put_page() includes > a call to compound_head(), which is now rather more complex > than it once was (at least in a distro config which enables > CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP). > I have a micro-benchmark [1][2], that I want to run on this patchset. Reducing the asm code 'text' size is less likely to improve a microbenchmark. The 100Gbit mlx5 driver uses page_pool, so perhaps I can run a packet benchmark that can show the (expected) performance improvement. [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c [2] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_cross_cpu.c > I've only converted one user of the page_pool APIs to use the new netmem > APIs, all the others continue to use the page based ones. > I guess we/netdev-devels need to update the NIC drivers that uses page_pool. > Uh, I see I left netmem_to_virt() as its own commit instead of squashing > it into "netmem: Add utility functions". I'll fix that in the next > version, because I'm sure you'll want some changes anyway. > > Happy to answer questions. > > Matthew Wilcox (Oracle) (24): > netmem: Create new type > netmem: Add utility functions > page_pool: Add netmem_set_dma_addr() and netmem_get_dma_addr() > page_pool: Convert page_pool_release_page() to > page_pool_release_netmem() > page_pool: Start using netmem in allocation path. > page_pool: Convert page_pool_return_page() to > page_pool_return_netmem() > page_pool: Convert __page_pool_put_page() to __page_pool_put_netmem() > page_pool: Convert pp_alloc_cache to contain netmem > page_pool: Convert page_pool_defrag_page() to > page_pool_defrag_netmem() > page_pool: Convert page_pool_put_defragged_page() to netmem > page_pool: Convert page_pool_empty_ring() to use netmem > page_pool: Convert page_pool_alloc_pages() to page_pool_alloc_netmem() > page_pool: Convert page_pool_dma_sync_for_device() to take a netmem > page_pool: Convert page_pool_recycle_in_cache() to netmem > page_pool: Remove page_pool_defrag_page() > page_pool: Use netmem in page_pool_drain_frag() > page_pool: Convert page_pool_return_skb_page() to use netmem > page_pool: Convert frag_page to frag_nmem > xdp: Convert to netmem > mm: Remove page pool members from struct page > netmem_to_virt > page_pool: Pass a netmem to init_callback() > net: Add support for netmem in skb_frag > mvneta: Convert to netmem > > drivers/net/ethernet/marvell/mvneta.c | 48 ++--- > include/linux/mm_types.h | 22 --- > include/linux/skbuff.h | 11 ++ > include/net/page_pool.h | 181 ++++++++++++++--- > include/trace/events/page_pool.h | 28 +-- > net/bpf/test_run.c | 4 +- > net/core/page_pool.c | 274 +++++++++++++------------- > net/core/xdp.c | 7 +- > 8 files changed, 344 insertions(+), 231 deletions(-) > > > base-commit: 13ee7ef407cfcf63f4f047460ac5bb6ba5a3447d
Hi Jesper, On Mon, Dec 05, 2022 at 04:34:10PM +0100, Jesper Dangaard Brouer wrote: > > On 30/11/2022 23.07, Matthew Wilcox (Oracle) wrote: > > The MM subsystem is trying to reduce struct page to a single pointer. > > The first step towards that is splitting struct page by its individual > > users, as has already been done with folio and slab. This attempt chooses > > 'netmem' as a name, but I am not even slightly committed to that name, > > and will happily use another. > > I've not been able to come-up with a better name, so I'm okay with > 'netmem'. Others are of-cause free to bikesheet this ;-) Same here. But if anyone has a better name please shout. > > > There are some relatively significant reductions in kernel text > > size from these changes. I'm not qualified to judge how they > > might affect performance, but every call to put_page() includes > > a call to compound_head(), which is now rather more complex > > than it once was (at least in a distro config which enables > > CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP). > > > > I have a micro-benchmark [1][2], that I want to run on this patchset. > Reducing the asm code 'text' size is less likely to improve a > microbenchmark. The 100Gbit mlx5 driver uses page_pool, so perhaps I can > run a packet benchmark that can show the (expected) performance improvement. > > [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c > [2] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_cross_cpu.c > If you could give it a spin it would be great. I did apply the patchset and was running fine on my Arm box. I was about to run these tests, but then I remembered that this only works for x86. I don't have any cards supported by page pool around. > > I've only converted one user of the page_pool APIs to use the new netmem > > APIs, all the others continue to use the page based ones. > > > > I guess we/netdev-devels need to update the NIC drivers that uses page_pool. > [...] Regards /Ilias
On Mon, Dec 05, 2022 at 04:34:10PM +0100, Jesper Dangaard Brouer wrote: > I have a micro-benchmark [1][2], that I want to run on this patchset. > Reducing the asm code 'text' size is less likely to improve a > microbenchmark. The 100Gbit mlx5 driver uses page_pool, so perhaps I can > run a packet benchmark that can show the (expected) performance improvement. > > [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c > [2] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_cross_cpu.c Appreciate it! I'm not expecting any performance change outside noise, but things do surprise me. I'd appreciate it if you'd test with a "distro" config, ie enabling CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP so we show the most expensive case. > > I've only converted one user of the page_pool APIs to use the new netmem > > APIs, all the others continue to use the page based ones. > > > > I guess we/netdev-devels need to update the NIC drivers that uses page_pool. Oh, it's not a huge amount of work, and I don't mind doing it. I only did one in order to show the kinds of changes that are needed. I can do the mlx5 conversion now ...
On 05/12/2022 17.31, Matthew Wilcox wrote: > On Mon, Dec 05, 2022 at 04:34:10PM +0100, Jesper Dangaard Brouer wrote: >> I have a micro-benchmark [1][2], that I want to run on this patchset. >> Reducing the asm code 'text' size is less likely to improve a >> microbenchmark. The 100Gbit mlx5 driver uses page_pool, so perhaps I can >> run a packet benchmark that can show the (expected) performance improvement. >> >> [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c >> [2] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_cross_cpu.c > > Appreciate it! I'm not expecting any performance change outside noise, > but things do surprise me. I'd appreciate it if you'd test with a > "distro" config, ie enabling CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP so > we show the most expensive case. > I have CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP=y BUT it isn't default runtime enabled. Should I also choose CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON or enable it via sysctl ? $ grep -H . /proc/sys/vm/hugetlb_optimize_vmemmap /proc/sys/vm/hugetlb_optimize_vmemmap:0 --Jesper
On Tue, Dec 06, 2022 at 10:43:05AM +0100, Jesper Dangaard Brouer wrote: > > > On 05/12/2022 17.31, Matthew Wilcox wrote: > > On Mon, Dec 05, 2022 at 04:34:10PM +0100, Jesper Dangaard Brouer wrote: > > > I have a micro-benchmark [1][2], that I want to run on this patchset. > > > Reducing the asm code 'text' size is less likely to improve a > > > microbenchmark. The 100Gbit mlx5 driver uses page_pool, so perhaps I can > > > run a packet benchmark that can show the (expected) performance improvement. > > > > > > [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c > > > [2] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_cross_cpu.c > > > > Appreciate it! I'm not expecting any performance change outside noise, > > but things do surprise me. I'd appreciate it if you'd test with a > > "distro" config, ie enabling CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP so > > we show the most expensive case. > > > > I have CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP=y BUT it isn't default > runtime enabled. That's fine. I think the vast majority of machines won't actually have it enabled. It's mostly useful for hosting setups where allocating 1GB pages for VMs is common. The mlx5 driver was straightforward, but showed some gaps in the API. You'd already got the majority of the wins by using page_ref_inc() instead of get_page(), but I did find one put_page() ;-)
On 06/12/2022 17.08, Matthew Wilcox wrote: > On Tue, Dec 06, 2022 at 10:43:05AM +0100, Jesper Dangaard Brouer wrote: >> >> On 05/12/2022 17.31, Matthew Wilcox wrote: >>> On Mon, Dec 05, 2022 at 04:34:10PM +0100, Jesper Dangaard Brouer wrote: >>>> I have a micro-benchmark [1][2], that I want to run on this patchset. >>>> Reducing the asm code 'text' size is less likely to improve a >>>> microbenchmark. The 100Gbit mlx5 driver uses page_pool, so perhaps I can >>>> run a packet benchmark that can show the (expected) performance improvement. >>>> >>>> [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c >>>> [2] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_cross_cpu.c >>> >>> Appreciate it! I'm not expecting any performance change outside noise, >>> but things do surprise me. I'd appreciate it if you'd test with a >>> "distro" config, ie enabling CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP so >>> we show the most expensive case. I've tested with [1] and [2] and the performance numbers are the same. Microbench [1] is easiest to compare, and numbers below were basically same for both with+without patchset. Type:tasklet_page_pool01_fast_path Per elem: 16 cycles(tsc) 4.484 ns Type:tasklet_page_pool02_ptr_ring Per elem: 47 cycles(tsc) 13.147 ns Type:tasklet_page_pool03_slow Per elem: 173 cycles(tsc) 48.278 ns The last line (with 173 cycles) is then pages are not recycled, but instead returned back into systems page allocator. To related this to something, allocating order-0 pages via normal page allocator API costs approx 282 cycles(tsc) 78.385 ns on this system (with .config). I believe page_pool is faster, because we leverage the bulk page allocator. --Jesper