mm, slab: avoid high-order slab pages when it does not reduce waste

Message ID	alpine.DEB.2.21.1810121424420.116562@chino.kir.corp.google.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <owner-linux-mm@kvack.org> Received-SPF: pass (google.com: domain of rientjes@google.com designates 209.85.220.65 as permitted sender) client-ip=209.85.220.65; Date: Fri, 12 Oct 2018 14:24:57 -0700 (PDT) From: David Rientjes <rientjes@google.com> To: Christoph Lameter <cl@linux.com>, Pekka Enberg <penberg@kernel.org>, Joonsoo Kim <iamjoonsoo.kim@lge.com>, Andrew Morton <akpm@linux-foundation.org> cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [patch] mm, slab: avoid high-order slab pages when it does not reduce waste Message-ID: <alpine.DEB.2.21.1810121424420.116562@chino.kir.corp.google.com> User-Agent: Alpine 2.21 (DEB 202 2017-01-01) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm, slab: avoid high-order slab pages when it does not reduce waste \| expand mm, slab: avoid high-order slab pages when it does not reduce waste

David Rientjes Oct. 12, 2018, 9:24 p.m. UTC

The slab allocator has a heuristic that checks whether the internal
fragmentation is satisfactory and, if not, increases cachep->gfporder to
try to improve this.

If the amount of waste is the same at higher cachep->gfporder values,
there is no significant benefit to allocating higher order memory.  There
will be fewer calls to the page allocator, but each call will require
zone->lock and finding the page of best fit from the per-zone free areas.

Instead, it is better to allocate order-0 memory if possible so that pages
can be returned from the per-cpu pagesets (pcp).

There are two reasons to prefer this over allocating high order memory:

 - allocating from the pcp lists does not require a per-zone lock, and

 - this reduces stranding of MIGRATE_UNMOVABLE pageblocks on pcp lists
   that increases slab fragmentation across a zone.

We are particularly interested in the second point to eliminate cases
where all other pages on a pageblock are movable (or free) and fallback to
pageblocks of other migratetypes from the per-zone free areas causes
high-order slab memory to be allocated from them rather than from free
MIGRATE_UNMOVABLE pages on the pcp.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/slab.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

Andrew Morton Oct. 12, 2018, 10:13 p.m. UTC | #1

On Fri, 12 Oct 2018 14:24:57 -0700 (PDT) David Rientjes <rientjes@google.com> wrote:

> The slab allocator has a heuristic that checks whether the internal
> fragmentation is satisfactory and, if not, increases cachep->gfporder to
> try to improve this.
> 
> If the amount of waste is the same at higher cachep->gfporder values,
> there is no significant benefit to allocating higher order memory.  There
> will be fewer calls to the page allocator, but each call will require
> zone->lock and finding the page of best fit from the per-zone free areas.
> 
> Instead, it is better to allocate order-0 memory if possible so that pages
> can be returned from the per-cpu pagesets (pcp).
> 
> There are two reasons to prefer this over allocating high order memory:
> 
>  - allocating from the pcp lists does not require a per-zone lock, and
> 
>  - this reduces stranding of MIGRATE_UNMOVABLE pageblocks on pcp lists
>    that increases slab fragmentation across a zone.

Confused.  Higher-order slab pages never go through the pcp lists, do
they?  I'd have thought that by tending to increase the amount of
order-0 pages which are used by slab, such stranding would be
*increased*?

> We are particularly interested in the second point to eliminate cases
> where all other pages on a pageblock are movable (or free) and fallback to
> pageblocks of other migratetypes from the per-zone free areas causes
> high-order slab memory to be allocated from them rather than from free
> MIGRATE_UNMOVABLE pages on the pcp.
> 
>  mm/slab.c | 15 +++++++++++++++

Do slub and slob also suffer from this effect?

David Rientjes Oct. 12, 2018, 11:09 p.m. UTC | #2

On Fri, 12 Oct 2018, Andrew Morton wrote:

> > The slab allocator has a heuristic that checks whether the internal
> > fragmentation is satisfactory and, if not, increases cachep->gfporder to
> > try to improve this.
> > 
> > If the amount of waste is the same at higher cachep->gfporder values,
> > there is no significant benefit to allocating higher order memory.  There
> > will be fewer calls to the page allocator, but each call will require
> > zone->lock and finding the page of best fit from the per-zone free areas.
> > 
> > Instead, it is better to allocate order-0 memory if possible so that pages
> > can be returned from the per-cpu pagesets (pcp).
> > 
> > There are two reasons to prefer this over allocating high order memory:
> > 
> >  - allocating from the pcp lists does not require a per-zone lock, and
> > 
> >  - this reduces stranding of MIGRATE_UNMOVABLE pageblocks on pcp lists
> >    that increases slab fragmentation across a zone.
> 
> Confused.  Higher-order slab pages never go through the pcp lists, do
> they?

Nope.

> I'd have thought that by tending to increase the amount of
> order-0 pages which are used by slab, such stranding would be
> *increased*?
> 

These cpus have MIGRATE_UNMOVABLE pages on their pcp list.  But because 
they are order-1 instead of order-0, we take zone->lock and find the 
smallest possible page in the zone's free area that is of sufficient size.  
On low on memory situations, there are no pages of MIGRATE_UNMOVABLE 
migratetype at any order in the free area.  This calls 
__rmqueue_fallback() that steals pageblocks, MIGRATE_RECLAIMABLE and then 
MIGRATE_MOVABLE, and as MIGRATE_UNMOVABLE.

We rely on the pcp batch count to backfill MIGRATE_UNMOVABLE pages onto 
the pcp list so we don't need to take zone->lock, and as a result of these 
allocations being order-0 rather than order-1 we can then allocate from 
these pages when such slab caches are expanded rather than stranding them.

We noticed this when the amount of memory wasted for TCPv6 was the same 
for both order-0 and order-1 allocations (order-1 waste was two times the 
order-0 waste).  We had hundreds of cpus with pages on their 
MIGRATE_UNMOVABLE pcp list, but while allocating order-1 memory it would 
prefer to happily steal other pageblocks before calling reclaim and 
draining pcp lists.

> > We are particularly interested in the second point to eliminate cases
> > where all other pages on a pageblock are movable (or free) and fallback to
> > pageblocks of other migratetypes from the per-zone free areas causes
> > high-order slab memory to be allocated from them rather than from free
> > MIGRATE_UNMOVABLE pages on the pcp.
> > 
> >  mm/slab.c | 15 +++++++++++++++
> 
> Do slub and slob also suffer from this effect?
> 

SLOB should not, SLUB will typically increase the order to improve 
performance of the cpu cache; there's a drawback to changing out the cpu 
cache that SLAB does not have.  In the case that this patch is addressing, 
there is no greater memory utilization from the allocted slab pages.

Christoph Lameter (Ampere) Oct. 15, 2018, 10:41 p.m. UTC | #3

On Fri, 12 Oct 2018, Andrew Morton wrote:

> > If the amount of waste is the same at higher cachep->gfporder values,
> > there is no significant benefit to allocating higher order memory.  There
> > will be fewer calls to the page allocator, but each call will require
> > zone->lock and finding the page of best fit from the per-zone free areas.

There is a benefit because the management overhead is halved.

> > Instead, it is better to allocate order-0 memory if possible so that pages
> > can be returned from the per-cpu pagesets (pcp).

Have a benchmark that shows this?

>
> > There are two reasons to prefer this over allocating high order memory:
> >
> >  - allocating from the pcp lists does not require a per-zone lock, and
> >
> >  - this reduces stranding of MIGRATE_UNMOVABLE pageblocks on pcp lists
> >    that increases slab fragmentation across a zone.

The slab allocators generally buffer pages from the page allocator to
avoid this effect given the slowness of page allocator operations anyways.


> Confused.  Higher-order slab pages never go through the pcp lists, do
> they?  I'd have thought that by tending to increase the amount of
> order-0 pages which are used by slab, such stranding would be
> *increased*?

Potentially.


> > We are particularly interested in the second point to eliminate cases
> > where all other pages on a pageblock are movable (or free) and fallback to
> > pageblocks of other migratetypes from the per-zone free areas causes
> > high-order slab memory to be allocated from them rather than from free
> > MIGRATE_UNMOVABLE pages on the pcp.

Well does this actually do some good?

Christoph Lameter (Ampere) Oct. 15, 2018, 10:42 p.m. UTC | #4

On Fri, 12 Oct 2018, David Rientjes wrote:

> @@ -1803,6 +1804,20 @@ static size_t calculate_slab_order(struct kmem_cache *cachep,
>  		 */
>  		if (left_over * 8 <= (PAGE_SIZE << gfporder))
>  			break;
> +
> +		/*
> +		 * If a higher gfporder would not reduce internal fragmentation,
> +		 * no need to continue.  The preference is to keep gfporder as
> +		 * small as possible so slab allocations can be served from
> +		 * MIGRATE_UNMOVABLE pcp lists to avoid stranding.
> +		 */

I think either go for order 0 (because then you can use the pcp lists) or
go as high as possible (then you can allocator larger memory areas with a
single pass through the page allocator).

But then I am not sure that the whole approach will do any good.

David Rientjes Oct. 16, 2018, 12:39 a.m. UTC | #5

On Mon, 15 Oct 2018, Christopher Lameter wrote:

> > > If the amount of waste is the same at higher cachep->gfporder values,
> > > there is no significant benefit to allocating higher order memory.  There
> > > will be fewer calls to the page allocator, but each call will require
> > > zone->lock and finding the page of best fit from the per-zone free areas.
> 
> There is a benefit because the management overhead is halved.
> 

It depends on (1) how difficult it is to allocate higher order memory and 
(2) the long term affects of preferring high order memory over order 0.

For (1), slab has no minimum order fallback like slub does so the 
allocation either succeeds at cachep->gfporder or it fails.  If memory 
fragmentation is such that order-1 memory is not possible, this is fixing 
an issue where the slab allocation would succeed but now fails 
unnecessarily.  If that order-1 memory is painful to allocate, we've 
reclaimed and compacted unnecessarily when order-0 pages are available 
from the pcp list.

For (2), high-order slab allocations increase fragmentation of the zone 
under memory pressure.  If the per-zone free area is void of 
MIGRATE_UNMOVABLE pageblocks such that it must fallback, which it is under 
memory pressure, these order-1 pages can be returned from pageblocks that 
are filled with movable memory, or otherwise free.  This ends up making 
hugepages difficult to allocate from (to the extent where 1.5GB of slab on 
a node is spread over 100GB of pageblocks).  This occurs even though there 
may be MIGRATE_UNMOVABLE pages available on pcp lists.  Using this patch, 
it is possible to backfill the pcp list up to the batchcount with 
MIGRATE_UNMOVABLE order-0 pages that we can subsequently allocate and 
free to, which turns out to be optimized for caches like TCPv6 that result 
in both faster page allocation and less slab fragmentation.

> > > Instead, it is better to allocate order-0 memory if possible so that pages
> > > can be returned from the per-cpu pagesets (pcp).
> 
> Have a benchmark that shows this?
> 

I'm not necessarily approaching this from a performance point of view, but 
rather as a means to reduce slab fragmentation when fallback to order-0 
memory, especially when completely legitimate, is prohibited.  From a 
performance standpoint, this will depend on separately on fragmentation 
and contention on zone->lock which both don't exist for order-0 memory 
until fallback is required and then the pcp are filled with up to 
batchcount pages.

> >
> > > There are two reasons to prefer this over allocating high order memory:
> > >
> > >  - allocating from the pcp lists does not require a per-zone lock, and
> > >
> > >  - this reduces stranding of MIGRATE_UNMOVABLE pageblocks on pcp lists
> > >    that increases slab fragmentation across a zone.
> 
> The slab allocators generally buffer pages from the page allocator to
> avoid this effect given the slowness of page allocator operations anyways.
> 

It is possible to buffer the same number of pages once they are allocated, 
absent memory pressure, and does not require high-order memory.  This 
seems like a separate issue.

> > > We are particularly interested in the second point to eliminate cases
> > > where all other pages on a pageblock are movable (or free) and fallback to
> > > pageblocks of other migratetypes from the per-zone free areas causes
> > > high-order slab memory to be allocated from them rather than from free
> > > MIGRATE_UNMOVABLE pages on the pcp.
> 
> Well does this actually do some good?
> 

Examining pageblocks via tools/vm/page-types under memory pressure that 
show all B (buddy) and UlAMab (anon mapped) pages and then a single 
order-1 S (slab) page would suggest that the pageblock would not be 
exempted from ever being allocated for a hugepage until the slab is 
completely freed (indeterminate amount of time) if there are any pages on 
the MIGRATE_UNMOVABLE pcp list.

This change is eliminating the exemption from allocating from unmovable 
pages that are readily available instead of preferring to expensively 
allocate order-1 with no reduction in waste.

For users of slab_max_order, which we are not for obvious reasons, I can 
change this to only consider when testing gfporder == 0 since that 
logically makes sense if you prefer.

Vlastimil Babka Oct. 16, 2018, 8:42 a.m. UTC | #6

On 10/12/18 11:24 PM, David Rientjes wrote:
> The slab allocator has a heuristic that checks whether the internal
> fragmentation is satisfactory and, if not, increases cachep->gfporder to
> try to improve this.
> 
> If the amount of waste is the same at higher cachep->gfporder values,
> there is no significant benefit to allocating higher order memory.  There
> will be fewer calls to the page allocator, but each call will require
> zone->lock and finding the page of best fit from the per-zone free areas.
> 
> Instead, it is better to allocate order-0 memory if possible so that pages
> can be returned from the per-cpu pagesets (pcp).
> 
> There are two reasons to prefer this over allocating high order memory:
> 
>  - allocating from the pcp lists does not require a per-zone lock, and
> 
>  - this reduces stranding of MIGRATE_UNMOVABLE pageblocks on pcp lists
>    that increases slab fragmentation across a zone.
> 
> We are particularly interested in the second point to eliminate cases
> where all other pages on a pageblock are movable (or free) and fallback to
> pageblocks of other migratetypes from the per-zone free areas causes
> high-order slab memory to be allocated from them rather than from free
> MIGRATE_UNMOVABLE pages on the pcp.
> 
> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  mm/slab.c | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)
> 
> diff --git a/mm/slab.c b/mm/slab.c
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -1748,6 +1748,7 @@ static size_t calculate_slab_order(struct kmem_cache *cachep,
>  	for (gfporder = 0; gfporder <= KMALLOC_MAX_ORDER; gfporder++) {
>  		unsigned int num;
>  		size_t remainder;
> +		int order;
>  
>  		num = cache_estimate(gfporder, size, flags, &remainder);
>  		if (!num)
> @@ -1803,6 +1804,20 @@ static size_t calculate_slab_order(struct kmem_cache *cachep,
>  		 */
>  		if (left_over * 8 <= (PAGE_SIZE << gfporder))
>  			break;
> +
> +		/*
> +		 * If a higher gfporder would not reduce internal fragmentation,
> +		 * no need to continue.  The preference is to keep gfporder as
> +		 * small as possible so slab allocations can be served from
> +		 * MIGRATE_UNMOVABLE pcp lists to avoid stranding.
> +		 */
> +		for (order = gfporder + 1; order <= slab_max_order; order++) {
> +			cache_estimate(order, size, flags, &remainder);
> +			if (remainder < left_over)

I think this can be suboptimal when left_over is e.g. 500 for the lower
order and remainder is 800 for the higher order, so wasted memory per
page is lower, although the absolute value isn't. Can that happen?
Probably not for order-0 vs order-1 case, but for higher orders? In that
case left_order should be shifted left by (gfporder - order) in the
comparison?

> +				break;
> +		}
> +		if (order > slab_max_order)
> +			break;
>  	}
>  	return left_over;
>  }
>

Christoph Lameter (Ampere) Oct. 16, 2018, 3:17 p.m. UTC | #7

On Mon, 15 Oct 2018, David Rientjes wrote:

> On Mon, 15 Oct 2018, Christopher Lameter wrote:
>
> > > > If the amount of waste is the same at higher cachep->gfporder values,
> > > > there is no significant benefit to allocating higher order memory.  There
> > > > will be fewer calls to the page allocator, but each call will require
> > > > zone->lock and finding the page of best fit from the per-zone free areas.
> >
> > There is a benefit because the management overhead is halved.
> >
>
> It depends on (1) how difficult it is to allocate higher order memory and
> (2) the long term affects of preferring high order memory over order 0.

The overhead of the page allocator is orders of magnitudes bigger than
slab allocation. Higher order may be faster because the pcp overhead is
not there. It all depends. Please come up with some benchmarking to
substantiate these ideas.

>
> For (1), slab has no minimum order fallback like slub does so the
> allocation either succeeds at cachep->gfporder or it fails.  If memory
> fragmentation is such that order-1 memory is not possible, this is fixing
> an issue where the slab allocation would succeed but now fails
> unnecessarily.  If that order-1 memory is painful to allocate, we've
> reclaimed and compacted unnecessarily when order-0 pages are available
> from the pcp list.
>

Ok that sounds good but the performance impact is still an issue. Also we
agreed that the page allocator will provide allocations up to
COSTLY_ORDER without too much fuss. Other system components may fail if
these smaller order pages are not available.

> > Have a benchmark that shows this?
> >
>
> I'm not necessarily approaching this from a performance point of view, but
> rather as a means to reduce slab fragmentation when fallback to order-0
> memory, especially when completely legitimate, is prohibited.  From a
> performance standpoint, this will depend on separately on fragmentation
> and contention on zone->lock which both don't exist for order-0 memory
> until fallback is required and then the pcp are filled with up to
> batchcount pages.

Fragmentation is a performance issue and causes degradation of Linux MM
performance over time.  There are pretty complex mechanism that need to be
played against one another.

Come up with some metrics to get meaningful data that allows us to see the
impact.

I think what would be beneficial to have is a load that gradually
degrade as another process causes fragmentation. Any patch like the one
proposed should have an effect on the degree of fragmentation after a
certain time.

Having something like that could lead to a whole serial of optimizations.
Ideally we would like to have a MM subsystem that does not degrade as
today.

Vlastimil Babka Oct. 17, 2018, 9:09 a.m. UTC | #8

On 10/16/18 5:17 PM, Christopher Lameter wrote:
>> I'm not necessarily approaching this from a performance point of view, but
>> rather as a means to reduce slab fragmentation when fallback to order-0
>> memory, especially when completely legitimate, is prohibited.  From a
>> performance standpoint, this will depend on separately on fragmentation
>> and contention on zone->lock which both don't exist for order-0 memory
>> until fallback is required and then the pcp are filled with up to
>> batchcount pages.
> Fragmentation is a performance issue and causes degradation of Linux MM
> performance over time.  There are pretty complex mechanism that need to be
> played against one another.
> 
> Come up with some metrics to get meaningful data that allows us to see the
> impact.

I don't think the patch as it is needs some special evaluation. SLAB's
current design is to keep gfporder at minimum that satisfies "Acceptable
internal fragmentation" of 1/8 of the allocated gfporder page (hm
arguably that should be also considered relatively to order-0 page, as
I've argued for the comparison done in this patch as well).

In such design it's simply an oversight that we increase the gfporder in
cases when it doesn't improve the internal fragmentation metric, and it
should be straightforward decision to stop doing it.

I.e. the benefits vs drawbacks of higher order allocations for SLAB are
out of scope here. It would be nice if somebody evaluated them, but the
potential resulting change would be much larger than what concerns this
patch. But it would arguably also make SLAB more like SLUB, which you
already questioned at some point...

Vlastimil

Christoph Lameter (Ampere) Oct. 17, 2018, 3:38 p.m. UTC | #9

On Wed, 17 Oct 2018, Vlastimil Babka wrote:

> I.e. the benefits vs drawbacks of higher order allocations for SLAB are
> out of scope here. It would be nice if somebody evaluated them, but the
> potential resulting change would be much larger than what concerns this
> patch. But it would arguably also make SLAB more like SLUB, which you
> already questioned at some point...

Well if this leads to more code going into mm/slab_common.c then I would
certainly welcome that.

mm, slab: avoid high-order slab pages when it does not reduce waste

Commit Message

Comments

Patch