mbox series

[0/7,v4] Introduce a bulk order-0 page allocator with two in-tree users

Message ID 20210312154331.32229-1-mgorman@techsingularity.net (mailing list archive)
Headers show
Series Introduce a bulk order-0 page allocator with two in-tree users | expand

Message

Mel Gorman March 12, 2021, 3:43 p.m. UTC
This series is based on top of Matthew Wilcox's series "Rationalise
__alloc_pages wrapper" and does not apply to 5.12-rc2. If you want to
test and are not using Andrew's tree as a baseline, I suggest using the
following git tree

git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v4r2

Note to Chuck and Jesper -- as this is a cross-subsystem series, you may
want to send the sunrpc and page_pool pre-requisites (patches 4 and 6)
directly to the subsystem maintainers. While sunrpc is low-risk, I'm
vaguely aware that there are other prototype series on netdev that affect
page_pool. The conflict should be obvious in linux-next.

Changelog since v3
o Rebase on top of Matthew's series consolidating the alloc_pages API
o Rename alloced to allocated
o Split out preparation patch for prepare_alloc_pages
o Defensive check for bulk allocation or <= 0 pages
o Call single page allocation path only if no pages were allocated
o Minor cosmetic cleanups
o Reorder patch dependencies by subsystem. As this is a cross-subsystem
  series, the mm patches have to be merged before the sunrpc and net
  users.

Changelog since v2
o Prep new pages with IRQs enabled
o Minor documentation update

Changelog since v1
o Parenthesise binary and boolean comparisons
o Add reviewed-bys
o Rebase to 5.12-rc2

This series introduces a bulk order-0 page allocator with sunrpc and
the network page pool being the first users. The implementation is not
particularly efficient and the intention is to iron out what the semantics
of the API should have for users. Once the semantics are ironed out, it
can be made more efficient. Despite that, this is a performance-related
for users that require multiple pages for an operation without multiple
round-trips to the page allocator. Quoting the last patch for the
high-speed networking use-case.

    For XDP-redirect workload with 100G mlx5 driver (that use page_pool)
    redirecting xdp_frame packets into a veth, that does XDP_PASS to
    create an SKB from the xdp_frame, which then cannot return the page
    to the page_pool. In this case, we saw[1] an improvement of 18.8%
    from using the alloc_pages_bulk API (3,677,958 pps -> 4,368,926 pps).

Both users in this series are corner cases (NFS and high-speed networks)
so it is unlikely that most users will see any benefit in the short
term. Potential other users are batch allocations for page cache
readahead, fault around and SLUB allocations when high-order pages are
unavailable. It's unknown how much benefit would be seen by converting
multiple page allocation calls to a single batch or what difference it may
make to headline performance. It's a chicken and egg problem given that
the potential benefit cannot be investigated without an implementation
to test against.

Light testing passed, I'm relying on Chuck and Jesper to test the target
users more aggressively but both report performance improvements with
the initial RFC.

Patch 1 moves GFP flag initialision to prepare_alloc_pages

Patch 2 renames a variable name that is particularly unpopular

Patch 3 adds a bulk page allocator

Patch 4 is a sunrpc cleanup that is a pre-requisite.

Patch 5 is the sunrpc user. Chuck also has a patch which further caches
	pages but is not included in this series. It's not directly
	related to the bulk allocator and as it caches pages, it might
	have other concerns (e.g. does it need a shrinker?)

Patch 6 is a preparation patch only for the network user

Patch 7 converts the net page pool to the bulk allocator for order-0 pages.

 include/linux/gfp.h   |  12 ++++
 mm/page_alloc.c       | 149 +++++++++++++++++++++++++++++++++++++-----
 net/core/page_pool.c  | 101 +++++++++++++++++-----------
 net/sunrpc/svc_xprt.c |  47 +++++++++----
 4 files changed, 240 insertions(+), 69 deletions(-)

Comments

Alexander Lobakin March 17, 2021, 4:31 p.m. UTC | #1
From: Mel Gorman <mgorman@techsingularity.net>
Date: Fri, 12 Mar 2021 15:43:24 +0000

Hi there,

> This series is based on top of Matthew Wilcox's series "Rationalise
> __alloc_pages wrapper" and does not apply to 5.12-rc2. If you want to
> test and are not using Andrew's tree as a baseline, I suggest using the
> following git tree
>
> git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v4r2

I gave this series a go on my setup, it showed a bump of 10 Mbps on
UDP forwarding, but dropped TCP forwarding by almost 50 Mbps.

(4 core 1.2GHz MIPS32 R2, page size of 16 Kb, Page Pool order-0
allocations with MTU of 1508 bytes, linear frames via build_skb(),
GRO + TSO/USO)

I didn't have time to drill into the code, so for now can't provide
any additional details. You can request anything you need though and
I'll try to find a window to collect it.

> Note to Chuck and Jesper -- as this is a cross-subsystem series, you may
> want to send the sunrpc and page_pool pre-requisites (patches 4 and 6)
> directly to the subsystem maintainers. While sunrpc is low-risk, I'm
> vaguely aware that there are other prototype series on netdev that affect
> page_pool. The conflict should be obvious in linux-next.
>
> Changelog since v3
> o Rebase on top of Matthew's series consolidating the alloc_pages API
> o Rename alloced to allocated
> o Split out preparation patch for prepare_alloc_pages
> o Defensive check for bulk allocation or <= 0 pages
> o Call single page allocation path only if no pages were allocated
> o Minor cosmetic cleanups
> o Reorder patch dependencies by subsystem. As this is a cross-subsystem
>   series, the mm patches have to be merged before the sunrpc and net
>   users.
>
> Changelog since v2
> o Prep new pages with IRQs enabled
> o Minor documentation update
>
> Changelog since v1
> o Parenthesise binary and boolean comparisons
> o Add reviewed-bys
> o Rebase to 5.12-rc2
>
> This series introduces a bulk order-0 page allocator with sunrpc and
> the network page pool being the first users. The implementation is not
> particularly efficient and the intention is to iron out what the semantics
> of the API should have for users. Once the semantics are ironed out, it
> can be made more efficient. Despite that, this is a performance-related
> for users that require multiple pages for an operation without multiple
> round-trips to the page allocator. Quoting the last patch for the
> high-speed networking use-case.
>
>     For XDP-redirect workload with 100G mlx5 driver (that use page_pool)
>     redirecting xdp_frame packets into a veth, that does XDP_PASS to
>     create an SKB from the xdp_frame, which then cannot return the page
>     to the page_pool. In this case, we saw[1] an improvement of 18.8%
>     from using the alloc_pages_bulk API (3,677,958 pps -> 4,368,926 pps).
>
> Both users in this series are corner cases (NFS and high-speed networks)
> so it is unlikely that most users will see any benefit in the short
> term. Potential other users are batch allocations for page cache
> readahead, fault around and SLUB allocations when high-order pages are
> unavailable. It's unknown how much benefit would be seen by converting
> multiple page allocation calls to a single batch or what difference it may
> make to headline performance. It's a chicken and egg problem given that
> the potential benefit cannot be investigated without an implementation
> to test against.
>
> Light testing passed, I'm relying on Chuck and Jesper to test the target
> users more aggressively but both report performance improvements with
> the initial RFC.
>
> Patch 1 moves GFP flag initialision to prepare_alloc_pages
>
> Patch 2 renames a variable name that is particularly unpopular
>
> Patch 3 adds a bulk page allocator
>
> Patch 4 is a sunrpc cleanup that is a pre-requisite.
>
> Patch 5 is the sunrpc user. Chuck also has a patch which further caches
> 	pages but is not included in this series. It's not directly
> 	related to the bulk allocator and as it caches pages, it might
> 	have other concerns (e.g. does it need a shrinker?)
>
> Patch 6 is a preparation patch only for the network user
>
> Patch 7 converts the net page pool to the bulk allocator for order-0 pages.
>
>  include/linux/gfp.h   |  12 ++++
>  mm/page_alloc.c       | 149 +++++++++++++++++++++++++++++++++++++-----
>  net/core/page_pool.c  | 101 +++++++++++++++++-----------
>  net/sunrpc/svc_xprt.c |  47 +++++++++----
>  4 files changed, 240 insertions(+), 69 deletions(-)
>
> --
> 2.26.2

Thanks,
Al
Jesper Dangaard Brouer March 17, 2021, 4:38 p.m. UTC | #2
On Wed, 17 Mar 2021 16:31:07 +0000
Alexander Lobakin <alobakin@pm.me> wrote:

> From: Mel Gorman <mgorman@techsingularity.net>
> Date: Fri, 12 Mar 2021 15:43:24 +0000
> 
> Hi there,
> 
> > This series is based on top of Matthew Wilcox's series "Rationalise
> > __alloc_pages wrapper" and does not apply to 5.12-rc2. If you want to
> > test and are not using Andrew's tree as a baseline, I suggest using the
> > following git tree
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v4r2  
> 
> I gave this series a go on my setup, it showed a bump of 10 Mbps on
> UDP forwarding, but dropped TCP forwarding by almost 50 Mbps.
> 
> (4 core 1.2GHz MIPS32 R2, page size of 16 Kb, Page Pool order-0
> allocations with MTU of 1508 bytes, linear frames via build_skb(),
> GRO + TSO/USO)

What NIC driver is this?

> I didn't have time to drill into the code, so for now can't provide
> any additional details. You can request anything you need though and
> I'll try to find a window to collect it.
> 
> > Note to Chuck and Jesper -- as this is a cross-subsystem series, you may
> > want to send the sunrpc and page_pool pre-requisites (patches 4 and 6)
> > directly to the subsystem maintainers. While sunrpc is low-risk, I'm
> > vaguely aware that there are other prototype series on netdev that affect
> > page_pool. The conflict should be obvious in linux-next.
Alexander Lobakin March 17, 2021, 4:52 p.m. UTC | #3
From: Jesper Dangaard Brouer <brouer@redhat.com>
Date: Wed, 17 Mar 2021 17:38:44 +0100

> On Wed, 17 Mar 2021 16:31:07 +0000
> Alexander Lobakin <alobakin@pm.me> wrote:
>
> > From: Mel Gorman <mgorman@techsingularity.net>
> > Date: Fri, 12 Mar 2021 15:43:24 +0000
> >
> > Hi there,
> >
> > > This series is based on top of Matthew Wilcox's series "Rationalise
> > > __alloc_pages wrapper" and does not apply to 5.12-rc2. If you want to
> > > test and are not using Andrew's tree as a baseline, I suggest using the
> > > following git tree
> > >
> > > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v4r2
> >
> > I gave this series a go on my setup, it showed a bump of 10 Mbps on
> > UDP forwarding, but dropped TCP forwarding by almost 50 Mbps.
> >
> > (4 core 1.2GHz MIPS32 R2, page size of 16 Kb, Page Pool order-0
> > allocations with MTU of 1508 bytes, linear frames via build_skb(),
> > GRO + TSO/USO)
>
> What NIC driver is this?

Ah, forgot to mention. It's a WIP driver, not yet mainlined.
The NIC itself is basically on-SoC 1G chip.

> > I didn't have time to drill into the code, so for now can't provide
> > any additional details. You can request anything you need though and
> > I'll try to find a window to collect it.
> >
> > > Note to Chuck and Jesper -- as this is a cross-subsystem series, you may
> > > want to send the sunrpc and page_pool pre-requisites (patches 4 and 6)
> > > directly to the subsystem maintainers. While sunrpc is low-risk, I'm
> > > vaguely aware that there are other prototype series on netdev that affect
> > > page_pool. The conflict should be obvious in linux-next.
>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer

Al
Jesper Dangaard Brouer March 17, 2021, 5:19 p.m. UTC | #4
On Wed, 17 Mar 2021 16:52:32 +0000
Alexander Lobakin <alobakin@pm.me> wrote:

> From: Jesper Dangaard Brouer <brouer@redhat.com>
> Date: Wed, 17 Mar 2021 17:38:44 +0100
> 
> > On Wed, 17 Mar 2021 16:31:07 +0000
> > Alexander Lobakin <alobakin@pm.me> wrote:
> >  
> > > From: Mel Gorman <mgorman@techsingularity.net>
> > > Date: Fri, 12 Mar 2021 15:43:24 +0000
> > >
> > > Hi there,
> > >  
> > > > This series is based on top of Matthew Wilcox's series "Rationalise
> > > > __alloc_pages wrapper" and does not apply to 5.12-rc2. If you want to
> > > > test and are not using Andrew's tree as a baseline, I suggest using the
> > > > following git tree
> > > >
> > > > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v4r2  
> > >
> > > I gave this series a go on my setup, it showed a bump of 10 Mbps on
> > > UDP forwarding, but dropped TCP forwarding by almost 50 Mbps.
> > >
> > > (4 core 1.2GHz MIPS32 R2, page size of 16 Kb, Page Pool order-0
> > > allocations with MTU of 1508 bytes, linear frames via build_skb(),
> > > GRO + TSO/USO)  
> >
> > What NIC driver is this?  
> 
> Ah, forgot to mention. It's a WIP driver, not yet mainlined.
> The NIC itself is basically on-SoC 1G chip.

Hmm, then it is really hard to check if your driver is doing something
else that could cause this.

Well, can you try to lower the page_pool bulking size, to test the
theory from Wilcox that we should do smaller bulking to avoid pushing
cachelines into L2 when walking the LRU list.  You might have to go as
low as bulk=8 (for N-way associative level of L1 cache).

In function: __page_pool_alloc_pages_slow() adjust variable:
  const int bulk = PP_ALLOC_CACHE_REFILL;
Alexander Lobakin March 17, 2021, 10:25 p.m. UTC | #5
From: Jesper Dangaard Brouer <brouer@redhat.com>
Date: Wed, 17 Mar 2021 18:19:43 +0100

> On Wed, 17 Mar 2021 16:52:32 +0000
> Alexander Lobakin <alobakin@pm.me> wrote:
>
> > From: Jesper Dangaard Brouer <brouer@redhat.com>
> > Date: Wed, 17 Mar 2021 17:38:44 +0100
> >
> > > On Wed, 17 Mar 2021 16:31:07 +0000
> > > Alexander Lobakin <alobakin@pm.me> wrote:
> > >
> > > > From: Mel Gorman <mgorman@techsingularity.net>
> > > > Date: Fri, 12 Mar 2021 15:43:24 +0000
> > > >
> > > > Hi there,
> > > >
> > > > > This series is based on top of Matthew Wilcox's series "Rationalise
> > > > > __alloc_pages wrapper" and does not apply to 5.12-rc2. If you want to
> > > > > test and are not using Andrew's tree as a baseline, I suggest using the
> > > > > following git tree
> > > > >
> > > > > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v4r2
> > > >
> > > > I gave this series a go on my setup, it showed a bump of 10 Mbps on
> > > > UDP forwarding, but dropped TCP forwarding by almost 50 Mbps.
> > > >
> > > > (4 core 1.2GHz MIPS32 R2, page size of 16 Kb, Page Pool order-0
> > > > allocations with MTU of 1508 bytes, linear frames via build_skb(),
> > > > GRO + TSO/USO)
> > >
> > > What NIC driver is this?
> >
> > Ah, forgot to mention. It's a WIP driver, not yet mainlined.
> > The NIC itself is basically on-SoC 1G chip.
>
> Hmm, then it is really hard to check if your driver is doing something
> else that could cause this.
>
> Well, can you try to lower the page_pool bulking size, to test the
> theory from Wilcox that we should do smaller bulking to avoid pushing
> cachelines into L2 when walking the LRU list.  You might have to go as
> low as bulk=8 (for N-way associative level of L1 cache).

Turned out it suffered from GCC's decisions.
All of the following was taken on GCC 10.2.0 with -O2 in dotconfig.

vmlinux differences between baseline and this series:

(I used your followup instead of the last patch from the tree)

Function                                     old     new   delta
__rmqueue_pcplist                              -    2024   +2024
__alloc_pages_bulk                             -    1456   +1456
__page_pool_alloc_pages_slow                 284     600    +316
page_pool_dma_map                              -     164    +164
get_page_from_freelist                      5676    3760   -1916

The uninlining of __rmqueue_pcplist() hurts a lot. It slightly slows
down the "regular" page allocator, but makes __alloc_pages_bulk()
much slower than the per-page (in my case at least) due to calling
this function out from the loop.

One possible solution is to mark __rmqueue_pcplist() and
rmqueue_bulk() as __always_inline. Only both and only with
__always_inline, or GCC will emit rmqueue_bulk.constprop and
make the numbers even poorer.
This nearly doubles the size of bulk allocator, but eliminates
all performance hits.

Function                                     old     new   delta
__alloc_pages_bulk                          1456    3512   +2056
get_page_from_freelist                      3760    5744   +1984
find_suitable_fallback.part                    -     160    +160
min_free_kbytes_sysctl_handler                96     128     +32
find_suitable_fallback                       164      28    -136
__rmqueue_pcplist                           2024       -   -2024

Between baseline and this series with __always_inline hints:

Function                                     old     new   delta
__alloc_pages_bulk                             -    3512   +3512
find_suitable_fallback.part                    -     160    +160
get_page_from_freelist                      5676    5744     +68
min_free_kbytes_sysctl_handler                96     128     +32
find_suitable_fallback                       164      28    -136

Another suboptimal place I've found is two functions in Page Pool
code which are marked as 'noinline'.
Maybe there's a reason behind this, but removing the annotations
and additionally marking page_pool_dma_map() as inline simplifies
the object code and in fact improves the performance (+15 Mbps on
my setup):

add/remove: 0/3 grow/shrink: 1/0 up/down: 1024/-1096 (-72)
Function                                     old     new   delta
page_pool_alloc_pages                        100    1124   +1024
page_pool_dma_map                            164       -    -164
page_pool_refill_alloc_cache                 332       -    -332
__page_pool_alloc_pages_slow                 600       -    -600

1124 is a normal size for a hotpath function.
These fragmentation and jumps between page_pool_alloc_pages(),
__page_pool_alloc_pages_slow() and page_pool_refill_alloc_cache()
are really excessive and unhealthy for performance, as well as
page_pool_dma_map() uninlined by GCC.

So the best results I got so far were with these additional changes:
 - mark __rmqueue_pcplist() as __always_inline;
 - mark rmqueue_bulk() as __always_inline;
 - drop 'noinline' from page_pool_refill_alloc_cache();
 - drop 'noinline' from __page_pool_alloc_pages_slow();
 - mark page_pool_dma_map() as inline.

(inlines in C files aren't generally recommended, but well, GCC
 is far from perfect)

> In function: __page_pool_alloc_pages_slow() adjust variable:
>   const int bulk = PP_ALLOC_CACHE_REFILL;

Regarding bulk size, it makes no sense on my machine. I tried
{ 8, 16, 32, 64 } and they differed by 1-2 Mbps max / standard
deviation.
Most of the bulk operations I've seen usually take the value of
16 as a "golden ratio" though.

>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer

Thanks,
Al