Message ID | 20210312154331.32229-1-mgorman@techsingularity.net (mailing list archive) |
---|---|
Headers | show |
Series | Introduce a bulk order-0 page allocator with two in-tree users | expand |
From: Mel Gorman <mgorman@techsingularity.net> Date: Fri, 12 Mar 2021 15:43:24 +0000 Hi there, > This series is based on top of Matthew Wilcox's series "Rationalise > __alloc_pages wrapper" and does not apply to 5.12-rc2. If you want to > test and are not using Andrew's tree as a baseline, I suggest using the > following git tree > > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v4r2 I gave this series a go on my setup, it showed a bump of 10 Mbps on UDP forwarding, but dropped TCP forwarding by almost 50 Mbps. (4 core 1.2GHz MIPS32 R2, page size of 16 Kb, Page Pool order-0 allocations with MTU of 1508 bytes, linear frames via build_skb(), GRO + TSO/USO) I didn't have time to drill into the code, so for now can't provide any additional details. You can request anything you need though and I'll try to find a window to collect it. > Note to Chuck and Jesper -- as this is a cross-subsystem series, you may > want to send the sunrpc and page_pool pre-requisites (patches 4 and 6) > directly to the subsystem maintainers. While sunrpc is low-risk, I'm > vaguely aware that there are other prototype series on netdev that affect > page_pool. The conflict should be obvious in linux-next. > > Changelog since v3 > o Rebase on top of Matthew's series consolidating the alloc_pages API > o Rename alloced to allocated > o Split out preparation patch for prepare_alloc_pages > o Defensive check for bulk allocation or <= 0 pages > o Call single page allocation path only if no pages were allocated > o Minor cosmetic cleanups > o Reorder patch dependencies by subsystem. As this is a cross-subsystem > series, the mm patches have to be merged before the sunrpc and net > users. > > Changelog since v2 > o Prep new pages with IRQs enabled > o Minor documentation update > > Changelog since v1 > o Parenthesise binary and boolean comparisons > o Add reviewed-bys > o Rebase to 5.12-rc2 > > This series introduces a bulk order-0 page allocator with sunrpc and > the network page pool being the first users. The implementation is not > particularly efficient and the intention is to iron out what the semantics > of the API should have for users. Once the semantics are ironed out, it > can be made more efficient. Despite that, this is a performance-related > for users that require multiple pages for an operation without multiple > round-trips to the page allocator. Quoting the last patch for the > high-speed networking use-case. > > For XDP-redirect workload with 100G mlx5 driver (that use page_pool) > redirecting xdp_frame packets into a veth, that does XDP_PASS to > create an SKB from the xdp_frame, which then cannot return the page > to the page_pool. In this case, we saw[1] an improvement of 18.8% > from using the alloc_pages_bulk API (3,677,958 pps -> 4,368,926 pps). > > Both users in this series are corner cases (NFS and high-speed networks) > so it is unlikely that most users will see any benefit in the short > term. Potential other users are batch allocations for page cache > readahead, fault around and SLUB allocations when high-order pages are > unavailable. It's unknown how much benefit would be seen by converting > multiple page allocation calls to a single batch or what difference it may > make to headline performance. It's a chicken and egg problem given that > the potential benefit cannot be investigated without an implementation > to test against. > > Light testing passed, I'm relying on Chuck and Jesper to test the target > users more aggressively but both report performance improvements with > the initial RFC. > > Patch 1 moves GFP flag initialision to prepare_alloc_pages > > Patch 2 renames a variable name that is particularly unpopular > > Patch 3 adds a bulk page allocator > > Patch 4 is a sunrpc cleanup that is a pre-requisite. > > Patch 5 is the sunrpc user. Chuck also has a patch which further caches > pages but is not included in this series. It's not directly > related to the bulk allocator and as it caches pages, it might > have other concerns (e.g. does it need a shrinker?) > > Patch 6 is a preparation patch only for the network user > > Patch 7 converts the net page pool to the bulk allocator for order-0 pages. > > include/linux/gfp.h | 12 ++++ > mm/page_alloc.c | 149 +++++++++++++++++++++++++++++++++++++----- > net/core/page_pool.c | 101 +++++++++++++++++----------- > net/sunrpc/svc_xprt.c | 47 +++++++++---- > 4 files changed, 240 insertions(+), 69 deletions(-) > > -- > 2.26.2 Thanks, Al
On Wed, 17 Mar 2021 16:31:07 +0000 Alexander Lobakin <alobakin@pm.me> wrote: > From: Mel Gorman <mgorman@techsingularity.net> > Date: Fri, 12 Mar 2021 15:43:24 +0000 > > Hi there, > > > This series is based on top of Matthew Wilcox's series "Rationalise > > __alloc_pages wrapper" and does not apply to 5.12-rc2. If you want to > > test and are not using Andrew's tree as a baseline, I suggest using the > > following git tree > > > > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v4r2 > > I gave this series a go on my setup, it showed a bump of 10 Mbps on > UDP forwarding, but dropped TCP forwarding by almost 50 Mbps. > > (4 core 1.2GHz MIPS32 R2, page size of 16 Kb, Page Pool order-0 > allocations with MTU of 1508 bytes, linear frames via build_skb(), > GRO + TSO/USO) What NIC driver is this? > I didn't have time to drill into the code, so for now can't provide > any additional details. You can request anything you need though and > I'll try to find a window to collect it. > > > Note to Chuck and Jesper -- as this is a cross-subsystem series, you may > > want to send the sunrpc and page_pool pre-requisites (patches 4 and 6) > > directly to the subsystem maintainers. While sunrpc is low-risk, I'm > > vaguely aware that there are other prototype series on netdev that affect > > page_pool. The conflict should be obvious in linux-next.
From: Jesper Dangaard Brouer <brouer@redhat.com> Date: Wed, 17 Mar 2021 17:38:44 +0100 > On Wed, 17 Mar 2021 16:31:07 +0000 > Alexander Lobakin <alobakin@pm.me> wrote: > > > From: Mel Gorman <mgorman@techsingularity.net> > > Date: Fri, 12 Mar 2021 15:43:24 +0000 > > > > Hi there, > > > > > This series is based on top of Matthew Wilcox's series "Rationalise > > > __alloc_pages wrapper" and does not apply to 5.12-rc2. If you want to > > > test and are not using Andrew's tree as a baseline, I suggest using the > > > following git tree > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v4r2 > > > > I gave this series a go on my setup, it showed a bump of 10 Mbps on > > UDP forwarding, but dropped TCP forwarding by almost 50 Mbps. > > > > (4 core 1.2GHz MIPS32 R2, page size of 16 Kb, Page Pool order-0 > > allocations with MTU of 1508 bytes, linear frames via build_skb(), > > GRO + TSO/USO) > > What NIC driver is this? Ah, forgot to mention. It's a WIP driver, not yet mainlined. The NIC itself is basically on-SoC 1G chip. > > I didn't have time to drill into the code, so for now can't provide > > any additional details. You can request anything you need though and > > I'll try to find a window to collect it. > > > > > Note to Chuck and Jesper -- as this is a cross-subsystem series, you may > > > want to send the sunrpc and page_pool pre-requisites (patches 4 and 6) > > > directly to the subsystem maintainers. While sunrpc is low-risk, I'm > > > vaguely aware that there are other prototype series on netdev that affect > > > page_pool. The conflict should be obvious in linux-next. > > -- > Best regards, > Jesper Dangaard Brouer > MSc.CS, Principal Kernel Engineer at Red Hat > LinkedIn: http://www.linkedin.com/in/brouer Al
On Wed, 17 Mar 2021 16:52:32 +0000 Alexander Lobakin <alobakin@pm.me> wrote: > From: Jesper Dangaard Brouer <brouer@redhat.com> > Date: Wed, 17 Mar 2021 17:38:44 +0100 > > > On Wed, 17 Mar 2021 16:31:07 +0000 > > Alexander Lobakin <alobakin@pm.me> wrote: > > > > > From: Mel Gorman <mgorman@techsingularity.net> > > > Date: Fri, 12 Mar 2021 15:43:24 +0000 > > > > > > Hi there, > > > > > > > This series is based on top of Matthew Wilcox's series "Rationalise > > > > __alloc_pages wrapper" and does not apply to 5.12-rc2. If you want to > > > > test and are not using Andrew's tree as a baseline, I suggest using the > > > > following git tree > > > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v4r2 > > > > > > I gave this series a go on my setup, it showed a bump of 10 Mbps on > > > UDP forwarding, but dropped TCP forwarding by almost 50 Mbps. > > > > > > (4 core 1.2GHz MIPS32 R2, page size of 16 Kb, Page Pool order-0 > > > allocations with MTU of 1508 bytes, linear frames via build_skb(), > > > GRO + TSO/USO) > > > > What NIC driver is this? > > Ah, forgot to mention. It's a WIP driver, not yet mainlined. > The NIC itself is basically on-SoC 1G chip. Hmm, then it is really hard to check if your driver is doing something else that could cause this. Well, can you try to lower the page_pool bulking size, to test the theory from Wilcox that we should do smaller bulking to avoid pushing cachelines into L2 when walking the LRU list. You might have to go as low as bulk=8 (for N-way associative level of L1 cache). In function: __page_pool_alloc_pages_slow() adjust variable: const int bulk = PP_ALLOC_CACHE_REFILL;
From: Jesper Dangaard Brouer <brouer@redhat.com> Date: Wed, 17 Mar 2021 18:19:43 +0100 > On Wed, 17 Mar 2021 16:52:32 +0000 > Alexander Lobakin <alobakin@pm.me> wrote: > > > From: Jesper Dangaard Brouer <brouer@redhat.com> > > Date: Wed, 17 Mar 2021 17:38:44 +0100 > > > > > On Wed, 17 Mar 2021 16:31:07 +0000 > > > Alexander Lobakin <alobakin@pm.me> wrote: > > > > > > > From: Mel Gorman <mgorman@techsingularity.net> > > > > Date: Fri, 12 Mar 2021 15:43:24 +0000 > > > > > > > > Hi there, > > > > > > > > > This series is based on top of Matthew Wilcox's series "Rationalise > > > > > __alloc_pages wrapper" and does not apply to 5.12-rc2. If you want to > > > > > test and are not using Andrew's tree as a baseline, I suggest using the > > > > > following git tree > > > > > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v4r2 > > > > > > > > I gave this series a go on my setup, it showed a bump of 10 Mbps on > > > > UDP forwarding, but dropped TCP forwarding by almost 50 Mbps. > > > > > > > > (4 core 1.2GHz MIPS32 R2, page size of 16 Kb, Page Pool order-0 > > > > allocations with MTU of 1508 bytes, linear frames via build_skb(), > > > > GRO + TSO/USO) > > > > > > What NIC driver is this? > > > > Ah, forgot to mention. It's a WIP driver, not yet mainlined. > > The NIC itself is basically on-SoC 1G chip. > > Hmm, then it is really hard to check if your driver is doing something > else that could cause this. > > Well, can you try to lower the page_pool bulking size, to test the > theory from Wilcox that we should do smaller bulking to avoid pushing > cachelines into L2 when walking the LRU list. You might have to go as > low as bulk=8 (for N-way associative level of L1 cache). Turned out it suffered from GCC's decisions. All of the following was taken on GCC 10.2.0 with -O2 in dotconfig. vmlinux differences between baseline and this series: (I used your followup instead of the last patch from the tree) Function old new delta __rmqueue_pcplist - 2024 +2024 __alloc_pages_bulk - 1456 +1456 __page_pool_alloc_pages_slow 284 600 +316 page_pool_dma_map - 164 +164 get_page_from_freelist 5676 3760 -1916 The uninlining of __rmqueue_pcplist() hurts a lot. It slightly slows down the "regular" page allocator, but makes __alloc_pages_bulk() much slower than the per-page (in my case at least) due to calling this function out from the loop. One possible solution is to mark __rmqueue_pcplist() and rmqueue_bulk() as __always_inline. Only both and only with __always_inline, or GCC will emit rmqueue_bulk.constprop and make the numbers even poorer. This nearly doubles the size of bulk allocator, but eliminates all performance hits. Function old new delta __alloc_pages_bulk 1456 3512 +2056 get_page_from_freelist 3760 5744 +1984 find_suitable_fallback.part - 160 +160 min_free_kbytes_sysctl_handler 96 128 +32 find_suitable_fallback 164 28 -136 __rmqueue_pcplist 2024 - -2024 Between baseline and this series with __always_inline hints: Function old new delta __alloc_pages_bulk - 3512 +3512 find_suitable_fallback.part - 160 +160 get_page_from_freelist 5676 5744 +68 min_free_kbytes_sysctl_handler 96 128 +32 find_suitable_fallback 164 28 -136 Another suboptimal place I've found is two functions in Page Pool code which are marked as 'noinline'. Maybe there's a reason behind this, but removing the annotations and additionally marking page_pool_dma_map() as inline simplifies the object code and in fact improves the performance (+15 Mbps on my setup): add/remove: 0/3 grow/shrink: 1/0 up/down: 1024/-1096 (-72) Function old new delta page_pool_alloc_pages 100 1124 +1024 page_pool_dma_map 164 - -164 page_pool_refill_alloc_cache 332 - -332 __page_pool_alloc_pages_slow 600 - -600 1124 is a normal size for a hotpath function. These fragmentation and jumps between page_pool_alloc_pages(), __page_pool_alloc_pages_slow() and page_pool_refill_alloc_cache() are really excessive and unhealthy for performance, as well as page_pool_dma_map() uninlined by GCC. So the best results I got so far were with these additional changes: - mark __rmqueue_pcplist() as __always_inline; - mark rmqueue_bulk() as __always_inline; - drop 'noinline' from page_pool_refill_alloc_cache(); - drop 'noinline' from __page_pool_alloc_pages_slow(); - mark page_pool_dma_map() as inline. (inlines in C files aren't generally recommended, but well, GCC is far from perfect) > In function: __page_pool_alloc_pages_slow() adjust variable: > const int bulk = PP_ALLOC_CACHE_REFILL; Regarding bulk size, it makes no sense on my machine. I tried { 8, 16, 32, 64 } and they differed by 1-2 Mbps max / standard deviation. Most of the bulk operations I've seen usually take the value of 16 as a "golden ratio" though. > > -- > Best regards, > Jesper Dangaard Brouer > MSc.CS, Principal Kernel Engineer at Red Hat > LinkedIn: http://www.linkedin.com/in/brouer Thanks, Al