[0/5] Introduce a bulk order-0 page allocator with two in-tree users

Message ID	20210310104618.22750-1-mgorman@techsingularity.net (mailing list archive)
Headers	show Return-Path: <netdev-owner@kernel.org> From: Mel Gorman <mgorman@techsingularity.net> To: Andrew Morton <akpm@linux-foundation.org> Cc: Chuck Lever <chuck.lever@oracle.com>, Jesper Dangaard Brouer <brouer@redhat.com>, Christoph Hellwig <hch@infradead.org>, LKML <linux-kernel@vger.kernel.org>, Linux-Net <netdev@vger.kernel.org>, Linux-MM <linux-mm@kvack.org>, Linux-NFS <linux-nfs@vger.kernel.org>, Mel Gorman <mgorman@techsingularity.net> Subject: [PATCH 0/5] Introduce a bulk order-0 page allocator with two in-tree users Date: Wed, 10 Mar 2021 10:46:13 +0000 Message-Id: <20210310104618.22750-1-mgorman@techsingularity.net> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	Introduce a bulk order-0 page allocator with two in-tree users \| expand [0/5] Introduce a bulk order-0 page allocator with two in-tree users [1/5] SUNRPC: Set rq_page_end differently [2/5] mm/page_alloc: Add a bulk page allocator [3/5] SUNRPC: Refresh rq_pages using a bulk page allocator [4/5] net: page_pool: refactor dma_map into own function page_pool_dma_map [5/5] net: page_pool: use alloc_pages_bulk in refill code path

Message ID

20210310104618.22750-1-mgorman@techsingularity.net (mailing list archive)

Headers

From: Mel Gorman <mgorman@techsingularity.net>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Chuck Lever <chuck.lever@oracle.com>,
        Jesper Dangaard Brouer <brouer@redhat.com>,
        Christoph Hellwig <hch@infradead.org>,
        LKML <linux-kernel@vger.kernel.org>,
        Linux-Net <netdev@vger.kernel.org>,
        Linux-MM <linux-mm@kvack.org>,
        Linux-NFS <linux-nfs@vger.kernel.org>,
        Mel Gorman <mgorman@techsingularity.net>
Subject: [PATCH 0/5] Introduce a bulk order-0 page allocator with two in-tree
 users
Date: Wed, 10 Mar 2021 10:46:13 +0000
Message-Id: <20210310104618.22750-1-mgorman@techsingularity.net>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

Introduce a bulk order-0 page allocator with two in-tree users | expand

Message

Mel Gorman March 10, 2021, 10:46 a.m. UTC

Changelog since v1
o Parenthesise binary and boolean comparisons
o Add reviewed-bys
o Rebase to 5.12-rc2

This series introduces a bulk order-0 page allocator with sunrpc and
the network page pool being the first users. The implementation is not
particularly efficient and the intention is to iron out what the semantics
of the API should have for users. Once the semantics are ironed out, it can
be made more efficient.

Improving the implementation requires fairly deep surgery in numerous
places. The lock scope would need to be significantly reduced, particularly
as vmstat, per-cpu and the buddy allocator have different locking protocol
that overall -- e.g. all partially depend on irqs being disabled at
various points. Secondly, the core of the allocator deals with single
pages where as both the bulk allocator and per-cpu allocator operate in
batches. All of that has to be reconciled with all the existing users and
their constraints (memory offline, CMA and cpusets being the trickiest).

Light testing passed, I'm relying on Chuck and Jesper to test the target
users more aggressively but both report performance improvements with the
initial RFC.

Patch 1 of this series is a cleanup to sunrpc, it could be merged
	separately but is included here as a pre-requisite.

Patch 2 is the prototype bulk allocator

Patch 3 is the sunrpc user. Chuck also has a patch which further caches
	pages but is not included in this series. It's not directly
	related to the bulk allocator and as it caches pages, it might
	have other concerns (e.g. does it need a shrinker?)

Patch 4 is a preparation patch only for the network user

Patch 5 converts the net page pool to the bulk allocator for order-0 pages.

There is no obvious impact to the existing paths as only new users of the
API should notice a difference between multiple calls to the allocator
and a single bulk allocation.

 include/linux/gfp.h   |  13 +++++
 mm/page_alloc.c       | 113 +++++++++++++++++++++++++++++++++++++++++-
 net/core/page_pool.c  | 102 +++++++++++++++++++++++---------------
 net/sunrpc/svc_xprt.c |  47 ++++++++++++------
 4 files changed, 220 insertions(+), 55 deletions(-)

Comments

Andrew Morton March 10, 2021, 11:47 p.m. UTC | #1

On Wed, 10 Mar 2021 10:46:13 +0000 Mel Gorman <mgorman@techsingularity.net> wrote:

> This series introduces a bulk order-0 page allocator with sunrpc and
> the network page pool being the first users.

<scratches head>

Right now, the [0/n] doesn't even tell us that it's a performance
patchset!

The whole point of this patchset appears to appear in the final paragraph
of the final patch's changelog.

: For XDP-redirect workload with 100G mlx5 driver (that use page_pool)
: redirecting xdp_frame packets into a veth, that does XDP_PASS to create
: an SKB from the xdp_frame, which then cannot return the page to the
: page_pool.  In this case, we saw[1] an improvement of 18.8% from using
: the alloc_pages_bulk API (3,677,958 pps -> 4,368,926 pps).

Much more detail on the overall objective and the observed results,
please?

Also, that workload looks awfully corner-casey.  How beneficial is this
work for more general and widely-used operations?

> The implementation is not
> particularly efficient and the intention is to iron out what the semantics
> of the API should have for users. Once the semantics are ironed out, it can
> be made more efficient.

And some guesstimates about how much benefit remains to be realized
would be helpful.

Mel Gorman March 11, 2021, 8:48 a.m. UTC | #2

On Wed, Mar 10, 2021 at 03:47:04PM -0800, Andrew Morton wrote:
> On Wed, 10 Mar 2021 10:46:13 +0000 Mel Gorman <mgorman@techsingularity.net> wrote:
> 
> > This series introduces a bulk order-0 page allocator with sunrpc and
> > the network page pool being the first users.
> 
> <scratches head>
> 
> Right now, the [0/n] doesn't even tell us that it's a performance
> patchset!
> 

I'll add a note about this improving performance for users that operate
on batches of patches and want to avoid multiple round-trips to the
page allocator.

> The whole point of this patchset appears to appear in the final paragraph
> of the final patch's changelog.
> 

I'll copy&paste that note to the introduction. It's likely that high-speed
networking is the most relevant user in the short-term.

> : For XDP-redirect workload with 100G mlx5 driver (that use page_pool)
> : redirecting xdp_frame packets into a veth, that does XDP_PASS to create
> : an SKB from the xdp_frame, which then cannot return the page to the
> : page_pool.  In this case, we saw[1] an improvement of 18.8% from using
> : the alloc_pages_bulk API (3,677,958 pps -> 4,368,926 pps).
> 
> Much more detail on the overall objective and the observed results,
> please?
> 

I cannot generate that data right now so I need Jesper to comment on
exactly why this is beneficial. For example, while I get that more data
can be processed in a microbenchmark, I do not have a good handle on how
much difference that makes to a practical application. About all I know
is that this problem has been knocking around for 3-4 years at least.

> Also, that workload looks awfully corner-casey.  How beneficial is this
> work for more general and widely-used operations?
> 

At this point, probably nothing for most users because batch page
allocation is not common. It's primarily why I avoided reworking the
whole allocator just to make this a bit tidier.

> > The implementation is not
> > particularly efficient and the intention is to iron out what the semantics
> > of the API should have for users. Once the semantics are ironed out, it can
> > be made more efficient.
> 
> And some guesstimates about how much benefit remains to be realized
> would be helpful.
> 

I don't have that information unfortunately. It's a chicken and egg
problem because without the API, there is no point creating new users.
For example, fault around or readahead could potentially batch pages
but whether it is actually noticable when page zeroing has to happen
is a completely different story. It's a similar story for SLUB, we know
lower order allocations hurt some microbenchmarks like hackbench-sockets
but have not quantified what happens if SLUB batch allocates pages when
high-order allocations fail.

Matthew Wilcox March 12, 2021, 3:10 p.m. UTC | #3

On Thu, Mar 11, 2021 at 08:48:27AM +0000, Mel Gorman wrote:
> I don't have that information unfortunately. It's a chicken and egg
> problem because without the API, there is no point creating new users.
> For example, fault around or readahead could potentially batch pages
> but whether it is actually noticable when page zeroing has to happen
> is a completely different story. It's a similar story for SLUB, we know
> lower order allocations hurt some microbenchmarks like hackbench-sockets
> but have not quantified what happens if SLUB batch allocates pages when
> high-order allocations fail.

I'm planning on reducing overhead in the readahead path by allocating
higher-order pages rather than by allocating a batch of order-0 pages.
With the old ->readpages interface, it would have made sense to allocate a
batch of pages, but with the new ->readahead interface, we put the pages
into the page cache for the filesystem, so it doesn't make as much sense
any more.

Right now, measuring performance in the readahead path is hard because
we end up contending against kswapd that's trying to evict all the clean
pages that we earlier readahead into this same file.  Could avoid that by
having N files, each about half the size of memory, but then we restart
the readahead algorithm for each file ...

I feel like the real solution for that is to do a GFP_NOWAIT allocation,
then try to evict earlier pages for the same file we're working on so
that kswapd doesn't get woken up if we're just streaming a read through
a gargantuan file.  But I should probably talk to Johannes first.