Message ID | 20210310104618.22750-1-mgorman@techsingularity.net (mailing list archive) |
---|---|
Headers | show |
Series | Introduce a bulk order-0 page allocator with two in-tree users | expand |
On Wed, 10 Mar 2021 10:46:13 +0000 Mel Gorman <mgorman@techsingularity.net> wrote: > This series introduces a bulk order-0 page allocator with sunrpc and > the network page pool being the first users. <scratches head> Right now, the [0/n] doesn't even tell us that it's a performance patchset! The whole point of this patchset appears to appear in the final paragraph of the final patch's changelog. : For XDP-redirect workload with 100G mlx5 driver (that use page_pool) : redirecting xdp_frame packets into a veth, that does XDP_PASS to create : an SKB from the xdp_frame, which then cannot return the page to the : page_pool. In this case, we saw[1] an improvement of 18.8% from using : the alloc_pages_bulk API (3,677,958 pps -> 4,368,926 pps). Much more detail on the overall objective and the observed results, please? Also, that workload looks awfully corner-casey. How beneficial is this work for more general and widely-used operations? > The implementation is not > particularly efficient and the intention is to iron out what the semantics > of the API should have for users. Once the semantics are ironed out, it can > be made more efficient. And some guesstimates about how much benefit remains to be realized would be helpful.
On Wed, Mar 10, 2021 at 03:47:04PM -0800, Andrew Morton wrote: > On Wed, 10 Mar 2021 10:46:13 +0000 Mel Gorman <mgorman@techsingularity.net> wrote: > > > This series introduces a bulk order-0 page allocator with sunrpc and > > the network page pool being the first users. > > <scratches head> > > Right now, the [0/n] doesn't even tell us that it's a performance > patchset! > I'll add a note about this improving performance for users that operate on batches of patches and want to avoid multiple round-trips to the page allocator. > The whole point of this patchset appears to appear in the final paragraph > of the final patch's changelog. > I'll copy&paste that note to the introduction. It's likely that high-speed networking is the most relevant user in the short-term. > : For XDP-redirect workload with 100G mlx5 driver (that use page_pool) > : redirecting xdp_frame packets into a veth, that does XDP_PASS to create > : an SKB from the xdp_frame, which then cannot return the page to the > : page_pool. In this case, we saw[1] an improvement of 18.8% from using > : the alloc_pages_bulk API (3,677,958 pps -> 4,368,926 pps). > > Much more detail on the overall objective and the observed results, > please? > I cannot generate that data right now so I need Jesper to comment on exactly why this is beneficial. For example, while I get that more data can be processed in a microbenchmark, I do not have a good handle on how much difference that makes to a practical application. About all I know is that this problem has been knocking around for 3-4 years at least. > Also, that workload looks awfully corner-casey. How beneficial is this > work for more general and widely-used operations? > At this point, probably nothing for most users because batch page allocation is not common. It's primarily why I avoided reworking the whole allocator just to make this a bit tidier. > > The implementation is not > > particularly efficient and the intention is to iron out what the semantics > > of the API should have for users. Once the semantics are ironed out, it can > > be made more efficient. > > And some guesstimates about how much benefit remains to be realized > would be helpful. > I don't have that information unfortunately. It's a chicken and egg problem because without the API, there is no point creating new users. For example, fault around or readahead could potentially batch pages but whether it is actually noticable when page zeroing has to happen is a completely different story. It's a similar story for SLUB, we know lower order allocations hurt some microbenchmarks like hackbench-sockets but have not quantified what happens if SLUB batch allocates pages when high-order allocations fail.
On Thu, Mar 11, 2021 at 08:48:27AM +0000, Mel Gorman wrote: > I don't have that information unfortunately. It's a chicken and egg > problem because without the API, there is no point creating new users. > For example, fault around or readahead could potentially batch pages > but whether it is actually noticable when page zeroing has to happen > is a completely different story. It's a similar story for SLUB, we know > lower order allocations hurt some microbenchmarks like hackbench-sockets > but have not quantified what happens if SLUB batch allocates pages when > high-order allocations fail. I'm planning on reducing overhead in the readahead path by allocating higher-order pages rather than by allocating a batch of order-0 pages. With the old ->readpages interface, it would have made sense to allocate a batch of pages, but with the new ->readahead interface, we put the pages into the page cache for the filesystem, so it doesn't make as much sense any more. Right now, measuring performance in the readahead path is hard because we end up contending against kswapd that's trying to evict all the clean pages that we earlier readahead into this same file. Could avoid that by having N files, each about half the size of memory, but then we restart the readahead algorithm for each file ... I feel like the real solution for that is to do a GFP_NOWAIT allocation, then try to evict earlier pages for the same file we're working on so that kswapd doesn't get woken up if we're just streaming a read through a gargantuan file. But I should probably talk to Johannes first.