diff mbox

[RFC,1/2] Protect larger order pages from breaking up

Message ID 20180216160121.519788537@linux.com (mailing list archive)
State RFC
Headers show

Commit Message

Christoph Lameter (Ampere) Feb. 16, 2018, 4:01 p.m. UTC
Over time as the kernel is churning through memory it will break
up larger pages and as time progresses larger contiguous allocations
will no longer be possible. This is an approach to preserve these
large pages and prevent them from being broken up.

This is useful for example for the use of jumbo pages and can
satify various needs of subsystems and device drivers that require
large contiguous allocation to operate properly.

The idea is to reserve a pool of pages of the required order
so that the kernel is not allowed to use the pages for allocations
of a different order. This is a pool that is fully integrated
into the page allocator and therefore transparently usable.

Control over this feature is by writing to /proc/zoneinfo.

F.e. to ensure that 2000 16K pages stay available for jumbo
frames do

	echo "2=2000" >/proc/zoneinfo

or through the order=<page spec> on the kernel command line.
F.e.

	order=2=2000,4N2=500

These pages will be subject to reclaim etc as usual but will not
be broken up.

One can then also f.e. operate the slub allocator with
64k pages. Specify "slub_max_order=4 slub_min_order=4" on
the kernel command line and all slab allocator allocations
will occur in 64K page sizes.

Note that this will reduce the memory available to the application
in some cases. Reclaim may occur more often. If more than
the reserved number of higher order pages are being used then
allocations will still fail as normal.

In order to make this work just right one needs to be able to
know the workload well enough to reserve the right amount
of pages. This is comparable to other reservation schemes.

Well that f.e brings up huge pages. You can of course
also use this to reserve those and can then be sure that
you can dynamically resize your huge page pools even after
a long time of system up time.

The idea for this patch came from Thomas Schoebel-Theuer whom I met
at the LCA and who described the approach to me promising
a patch that would do this. Sadly he has vanished somehow.
However, he has been using this approach to support a
production environment for numerous years.

So I redid his patch and this is the first draft of it.


Idea-by: Thomas Schoebel-Theuer <tst@schoebel-theuer.de>

First performance tests in a virtual enviroment show
a hackbench improvement by 6% just by increasing
the page size used by the page allocator to order 3.

Signed-off-by: Christopher Lameter <cl@linux.com>


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Andi Kleen Feb. 16, 2018, 5:03 p.m. UTC | #1
> First performance tests in a virtual enviroment show
> a hackbench improvement by 6% just by increasing
> the page size used by the page allocator to order 3.

So why is hackbench improving? Is that just for kernel stacks?

-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Randy Dunlap Feb. 16, 2018, 6:02 p.m. UTC | #2
On 02/16/2018 08:01 AM, Christoph Lameter wrote:
> Control over this feature is by writing to /proc/zoneinfo.
> 
> F.e. to ensure that 2000 16K pages stay available for jumbo
> frames do
> 
> 	echo "2=2000" >/proc/zoneinfo
> 
> or through the order=<page spec> on the kernel command line.
> F.e.
> 
> 	order=2=2000,4N2=500


Please document the the kernel command line option in
Documentation/admin-guide/kernel-parameters.txt.

I suppose that /proc/zoneinfo should be added somewhere in Documentation/vm/
but I'm not sure where that would be.

thanks,
Christoph Lameter (Ampere) Feb. 16, 2018, 6:25 p.m. UTC | #3
On Fri, 16 Feb 2018, Andi Kleen wrote:

> > First performance tests in a virtual enviroment show
> > a hackbench improvement by 6% just by increasing
> > the page size used by the page allocator to order 3.
>
> So why is hackbench improving? Is that just for kernel stacks?

Less stack overhead. The large the page size the less metadata need to be
handled. The freelists get larger and the chance of hitting the per cpu
freelist increases.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mike Kravetz Feb. 16, 2018, 6:59 p.m. UTC | #4
On 02/16/2018 08:01 AM, Christoph Lameter wrote:
> Over time as the kernel is churning through memory it will break
> up larger pages and as time progresses larger contiguous allocations
> will no longer be possible. This is an approach to preserve these
> large pages and prevent them from being broken up.
> 
> This is useful for example for the use of jumbo pages and can
> satify various needs of subsystems and device drivers that require
> large contiguous allocation to operate properly.
> 
> The idea is to reserve a pool of pages of the required order
> so that the kernel is not allowed to use the pages for allocations
> of a different order. This is a pool that is fully integrated
> into the page allocator and therefore transparently usable.
> 
> Control over this feature is by writing to /proc/zoneinfo.
> 
> F.e. to ensure that 2000 16K pages stay available for jumbo
> frames do
> 
> 	echo "2=2000" >/proc/zoneinfo
> 
> or through the order=<page spec> on the kernel command line.
> F.e.
> 
> 	order=2=2000,4N2=500
> 
> These pages will be subject to reclaim etc as usual but will not
> be broken up.
> 
> One can then also f.e. operate the slub allocator with
> 64k pages. Specify "slub_max_order=4 slub_min_order=4" on
> the kernel command line and all slab allocator allocations
> will occur in 64K page sizes.
> 
> Note that this will reduce the memory available to the application
> in some cases. Reclaim may occur more often. If more than
> the reserved number of higher order pages are being used then
> allocations will still fail as normal.
> 
> In order to make this work just right one needs to be able to
> know the workload well enough to reserve the right amount
> of pages. This is comparable to other reservation schemes.

Yes.

I like the idea that this only comes into play as the result of explicit
user/sysadmin action.  It does remind me of hugetlbfs reservations.  So,
we hope that only people who really know their workload and know what
they are doing would use this feature.

> Well that f.e brings up huge pages. You can of course
> also use this to reserve those and can then be sure that
> you can dynamically resize your huge page pools even after
> a long time of system up time.

Yes, and no.  Doesn't that assume nobody else is doing allocations
of that size?  For example, I could image THP using huge page sized
reservations.  The when it comes time to resize your hugetlbfs pool
there may not be enough.  Although, we may quickly split THP pages
in this case.  I am not sure.

IIRC, Guy Shattah's use case was for allocations greater than MAX_ORDER.
This would not directly address that.  A huge contiguous area (2GB) is
the sweet spot' for best performance in his case.  However, I think he
could still benefit from using a set of larger (such as 2MB) size
allocations which this scheme could help with.
Dave Hansen Feb. 16, 2018, 7:01 p.m. UTC | #5
On 02/16/2018 08:01 AM, Christoph Lameter wrote:
> In order to make this work just right one needs to be able to
> know the workload well enough to reserve the right amount
> of pages. This is comparable to other reservation schemes.

Yes, but it's a reservation scheme that doesn't show up in MemFree, for
instance.  Even hugetlbfs-reserved memory subtracts from that.

This has the potential to be really confusing to apps.  If this memory
is now not available to normal apps, they might plow into the invisible
memory limits and get into nasty reclaim scenarios.

Shouldn't this subtract the memory for MemFree and friends?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Lameter (Ampere) Feb. 16, 2018, 8:13 p.m. UTC | #6
On Fri, 16 Feb 2018, Mike Kravetz wrote:

> > Well that f.e brings up huge pages. You can of course
> > also use this to reserve those and can then be sure that
> > you can dynamically resize your huge page pools even after
> > a long time of system up time.
>
> Yes, and no.  Doesn't that assume nobody else is doing allocations
> of that size?  For example, I could image THP using huge page sized
> reservations.  The when it comes time to resize your hugetlbfs pool
> there may not be enough.  Although, we may quickly split THP pages
> in this case.  I am not sure.

Yup it has a pool for everyone. Question is how to divide the loot ;-)

> IIRC, Guy Shattah's use case was for allocations greater than MAX_ORDER.
> This would not directly address that.  A huge contiguous area (2GB) is
> the sweet spot' for best performance in his case.  However, I think he
> could still benefit from using a set of larger (such as 2MB) size
> allocations which this scheme could help with.

MAX_ORDER can be increased to allow for larger allocations. IA64 has f.e.
a much larger MAX_ORDER size. So does powerpc. And then the reservation
scheme will work.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Lameter (Ampere) Feb. 16, 2018, 8:15 p.m. UTC | #7
On Fri, 16 Feb 2018, Dave Hansen wrote:

> On 02/16/2018 08:01 AM, Christoph Lameter wrote:
> > In order to make this work just right one needs to be able to
> > know the workload well enough to reserve the right amount
> > of pages. This is comparable to other reservation schemes.
>
> Yes, but it's a reservation scheme that doesn't show up in MemFree, for
> instance.  Even hugetlbfs-reserved memory subtracts from that.

Ok. There is the question if we can get all these reservation schemes
under one hood instead of having page order specific ones in subsystems
like hugetlb.

> This has the potential to be really confusing to apps.  If this memory
> is now not available to normal apps, they might plow into the invisible
> memory limits and get into nasty reclaim scenarios.

> Shouldn't this subtract the memory for MemFree and friends?

Ok certainly we could do that. But on the other hand the memory is
available if those subsystems ask for the right order. Its not clear to me
what the right way of handling this is. Right now it adds the reserved
pages to the watermarks. But then under some circumstances the memory is
available. What is the best solution here?

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Hansen Feb. 16, 2018, 9:08 p.m. UTC | #8
On 02/16/2018 12:15 PM, Christopher Lameter wrote:
>> This has the potential to be really confusing to apps.  If this memory
>> is now not available to normal apps, they might plow into the invisible
>> memory limits and get into nasty reclaim scenarios.
>> Shouldn't this subtract the memory for MemFree and friends?
> Ok certainly we could do that. But on the other hand the memory is
> available if those subsystems ask for the right order. Its not clear to me
> what the right way of handling this is. Right now it adds the reserved
> pages to the watermarks. But then under some circumstances the memory is
> available. What is the best solution here?

There's definitely no perfect solution.

But, in general, I think we should cater to the dumbest users.  Folks
doing higher-order allocations are not that.  I say we make the picture
the most clear for the traditional 4k users.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Matthew Wilcox Feb. 16, 2018, 9:43 p.m. UTC | #9
On Fri, Feb 16, 2018 at 01:08:11PM -0800, Dave Hansen wrote:
> On 02/16/2018 12:15 PM, Christopher Lameter wrote:
> >> This has the potential to be really confusing to apps.  If this memory
> >> is now not available to normal apps, they might plow into the invisible
> >> memory limits and get into nasty reclaim scenarios.
> >> Shouldn't this subtract the memory for MemFree and friends?
> > Ok certainly we could do that. But on the other hand the memory is
> > available if those subsystems ask for the right order. Its not clear to me
> > what the right way of handling this is. Right now it adds the reserved
> > pages to the watermarks. But then under some circumstances the memory is
> > available. What is the best solution here?
> 
> There's definitely no perfect solution.
> 
> But, in general, I think we should cater to the dumbest users.  Folks
> doing higher-order allocations are not that.  I say we make the picture
> the most clear for the traditional 4k users.

Your way might be confusing -- if there's a system which is under varying
amounts of jumboframe load and all the 16k pages get gobbled up by the
ethernet driver, MemFree won't change at all, for example.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Hansen Feb. 16, 2018, 9:47 p.m. UTC | #10
On 02/16/2018 01:43 PM, Matthew Wilcox wrote:
>> There's definitely no perfect solution.
>>
>> But, in general, I think we should cater to the dumbest users.  Folks
>> doing higher-order allocations are not that.  I say we make the picture
>> the most clear for the traditional 4k users.
> Your way might be confusing -- if there's a system which is under varying
> amounts of jumboframe load and all the 16k pages get gobbled up by the
> ethernet driver, MemFree won't change at all, for example.

IOW, you agree that "there's definitely no perfect solution." :)
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mike Rapoport Feb. 17, 2018, 4:07 p.m. UTC | #11
On February 16, 2018 7:02:53 PM GMT+01:00, Randy Dunlap <rdunlap@infradead.org> wrote:
>On 02/16/2018 08:01 AM, Christoph Lameter wrote:
>> Control over this feature is by writing to /proc/zoneinfo.
>> 
>> F.e. to ensure that 2000 16K pages stay available for jumbo
>> frames do
>> 
>> 	echo "2=2000" >/proc/zoneinfo
>> 
>> or through the order=<page spec> on the kernel command line.
>> F.e.
>> 
>> 	order=2=2000,4N2=500
>
>
>Please document the the kernel command line option in
>Documentation/admin-guide/kernel-parameters.txt.
>
>I suppose that /proc/zoneinfo should be added somewhere in
>Documentation/vm/
>but I'm not sure where that would be.

It's in Documentation/sysctl/vm.txt and in 'man proc' [1]

[1] https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/tree/man5/proc.5

>thanks,
Guy Shattah Feb. 18, 2018, 9 a.m. UTC | #12
> 
> Yup it has a pool for everyone. Question is how to divide the loot ;-)
> 
> > IIRC, Guy Shattah's use case was for allocations greater than MAX_ORDER.
> > This would not directly address that.  A huge contiguous area (2GB) is
> > the sweet spot' for best performance in his case.  However, I think he
> > could still benefit from using a set of larger (such as 2MB) size
> > allocations which this scheme could help with.
> 
> MAX_ORDER can be increased to allow for larger allocations. IA64 has f.e.
> a much larger MAX_ORDER size. So does powerpc. And then the reservation
> scheme will work.
> 

MAX_ORDER can be increased only if kernel is recompiled. 
It won't work for code running for the general case / typical user.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mel Gorman Feb. 19, 2018, 10:19 a.m. UTC | #13
My skynet.ie/csn.ul.ie address has been defunct for quite some time.
Mail sent to it is not guaranteed to get to me.

On Fri, Feb 16, 2018 at 10:01:11AM -0600, Christoph Lameter wrote:
> Over time as the kernel is churning through memory it will break
> up larger pages and as time progresses larger contiguous allocations
> will no longer be possible. This is an approach to preserve these
> large pages and prevent them from being broken up.
> 
> <SNIP>
> Idea-by: Thomas Schoebel-Theuer <tst@schoebel-theuer.de>
> 
> First performance tests in a virtual enviroment show
> a hackbench improvement by 6% just by increasing
> the page size used by the page allocator to order 3.
> 

The phrasing here is confusing. hackbench is not very intensive in terms of
memory, it's more fork intensive where I find it extremely unlikely that
it would hit problems with fragmentation unless memory was deliberately
fragmented first. Furthermore, the phrasing implies that the minimum order
used by the page allocator is order 3 which is not what the patch appears
to do.

> Signed-off-by: Christopher Lameter <cl@linux.com>
> 
> Index: linux/include/linux/mmzone.h
> ===================================================================
> --- linux.orig/include/linux/mmzone.h
> +++ linux/include/linux/mmzone.h
> @@ -96,6 +96,11 @@ extern int page_group_by_mobility_disabl
>  struct free_area {
>  	struct list_head	free_list[MIGRATE_TYPES];
>  	unsigned long		nr_free;
> +	/* We stop breaking up pages of this order if less than
> +	 * min are available. At that point the pages can only
> +	 * be used for allocations of that particular order.
> +	 */
> +	unsigned long		min;
>  };
>  
>  struct pglist_data;
> Index: linux/mm/page_alloc.c
> ===================================================================
> --- linux.orig/mm/page_alloc.c
> +++ linux/mm/page_alloc.c
> @@ -1844,7 +1844,12 @@ struct page *__rmqueue_smallest(struct z
>  		area = &(zone->free_area[current_order]);
>  		page = list_first_entry_or_null(&area->free_list[migratetype],
>  							struct page, lru);
> -		if (!page)
> +		/*
> +		 * Continue if no page is found or if our freelist contains
> +		 * less than the minimum pages of that order. In that case
> +		 * we better look for a different order.
> +		 */
> +		if (!page || area->nr_free < area->min)
>  			continue;
>  		list_del(&page->lru);
>  		rmv_page_order(page);

This is surprising to say the least. Assuming reservations are at order-3,
this would refuse to split order-3 even if there was sufficient reserved
pages at higher orders for a reserve. This will cause splits of higher
orders unnecessarily which could cause other fragmentation-related issues
in the future.

This is similar to a memory pool except it's not. There is no concept of a
user of high-order reserves accounting for it. Hence, a user of high-order
pages could allocate the reserve multiple times for long-term purposes
while starving other allocation requests. This could easily happen for slub
with min_order set to the same order as the reserve causing potential OOM
issues. If a pool is to be created, it should be a real pool even if it's
transparently accessed through the page allocator. It should allocate the
requested number of pages and either decide to refill is possible or pass
requests through to the page allocator when the pool is depleted. Also,
as it stands, an OOM due to the reserve would be confusing as there is no
hint the failure may have been due to the reserve.

Access to the pool is unprotected so you might create a reserve for jumbo
frames only to have them consumed by something else entirely. It's not
clear if that is even fixable as GFP flags are too coarse.

It is not covered in the changelog why MIGRATE_HIGHATOMIC was not
sufficient for jumbo frames which are generally expected to be allocated
from atomic context. If there is a problem there then maybe
MIGRATE_HIGHATOMIC should be made more strict instead of a hack like
this. It'll be very difficult, if not impossible, for this to be tuned
properly.

Finally, while I accept that fragmentation over time is a problem for
unmovable allocations (fragmentation protection was originally designed
for THP/hugetlbfs), this is papering over the problem. If greater
protections are needed then the right approach is to be more strict about
fallbacks. Specifically, unmovable allocations should migrate all movable
pages out of migrate_unmovable pageblocks before falling back and that
can be controlled by policy due to the overhead of migration. For atomic
allocations, allow fallback but use kcompact or a workqueue to migrate
movable pages out of migrate_unmovable pageblocks to limit fallbacks in
the future.

I'm not a fan of this patch.
Michal Hocko Feb. 19, 2018, 2:42 p.m. UTC | #14
On Mon 19-02-18 10:19:35, Mel Gorman wrote:
[...]
> Access to the pool is unprotected so you might create a reserve for jumbo
> frames only to have them consumed by something else entirely. It's not
> clear if that is even fixable as GFP flags are too coarse.
> 
> It is not covered in the changelog why MIGRATE_HIGHATOMIC was not
> sufficient for jumbo frames which are generally expected to be allocated
> from atomic context. If there is a problem there then maybe
> MIGRATE_HIGHATOMIC should be made more strict instead of a hack like
> this. It'll be very difficult, if not impossible, for this to be tuned
> properly.
> 
> Finally, while I accept that fragmentation over time is a problem for
> unmovable allocations (fragmentation protection was originally designed
> for THP/hugetlbfs), this is papering over the problem. If greater
> protections are needed then the right approach is to be more strict about
> fallbacks. Specifically, unmovable allocations should migrate all movable
> pages out of migrate_unmovable pageblocks before falling back and that
> can be controlled by policy due to the overhead of migration. For atomic
> allocations, allow fallback but use kcompact or a workqueue to migrate
> movable pages out of migrate_unmovable pageblocks to limit fallbacks in
> the future.

Completely agreed!

> I'm not a fan of this patch.

Yes, I think the approach is just wrong. It will just hit all sorts of
weird corner cases and won't work reliable for those who care.
Christoph Lameter (Ampere) Feb. 19, 2018, 3:09 p.m. UTC | #15
On Mon, 19 Feb 2018, Mel Gorman wrote:

> The phrasing here is confusing. hackbench is not very intensive in terms of
> memory, it's more fork intensive where I find it extremely unlikely that
> it would hit problems with fragmentation unless memory was deliberately
> fragmented first. Furthermore, the phrasing implies that the minimum order
> used by the page allocator is order 3 which is not what the patch appears
> to do.

It was used to illustrate the performance gain.

> > -		if (!page)
> > +		/*
> > +		 * Continue if no page is found or if our freelist contains
> > +		 * less than the minimum pages of that order. In that case
> > +		 * we better look for a different order.
> > +		 */
> > +		if (!page || area->nr_free < area->min)
> >  			continue;
> >  		list_del(&page->lru);
> >  		rmv_page_order(page);
>
> This is surprising to say the least. Assuming reservations are at order-3,
> this would refuse to split order-3 even if there was sufficient reserved
> pages at higher orders for a reserve. This will cause splits of higher
> orders unnecessarily which could cause other fragmentation-related issues
> in the future.

Well that is intended. We want to preserve a number of pages at a certain
order. If there are higher order pages available then those can be split
and the allocation will succeed while preserving the mininum number of
pages at the reserved order.

> This is similar to a memory pool except it's not. There is no concept of a
> user of high-order reserves accounting for it. Hence, a user of high-order
> pages could allocate the reserve multiple times for long-term purposes
> while starving other allocation requests. This could easily happen for slub
> with min_order set to the same order as the reserve causing potential OOM
> issues. If a pool is to be created, it should be a real pool even if it's
> transparently accessed through the page allocator. It should allocate the
> requested number of pages and either decide to refill is possible or pass
> requests through to the page allocator when the pool is depleted. Also,
> as it stands, an OOM due to the reserve would be confusing as there is no
> hint the failure may have been due to the reserve.

Ok we can add the ->min values to the OOOM report.

This is a crude approach I agree and it does require knowlege of the load
and user patterns. However, what other approach is there to allow the
system to sustain higher order allocations if those are needed? This is an
issue for which no satisfactory solution is present. So a measure like
this would allow a limited use in some situations.

> Access to the pool is unprotected so you might create a reserve for jumbo
> frames only to have them consumed by something else entirely. It's not
> clear if that is even fixable as GFP flags are too coarse.

If its consumed by something else then the parameters or the jumbo frame
setting may be adjusted. This feature is off by default so its only used
for tuning purposes.

> It is not covered in the changelog why MIGRATE_HIGHATOMIC was not
> sufficient for jumbo frames which are generally expected to be allocated
> from atomic context. If there is a problem there then maybe
> MIGRATE_HIGHATOMIC should be made more strict instead of a hack like
> this. It'll be very difficult, if not impossible, for this to be tuned
> properly.

This approach has been in use for a decade or so as mentioned in the patch
description. So please be careful with impossibility claims. This enables
handling of larger contiguous blocks of memory that are requires in some
circumstances and it has been doing that successfully (although with some
tuning effort).

> Finally, while I accept that fragmentation over time is a problem for
> unmovable allocations (fragmentation protection was originally designed
> for THP/hugetlbfs), this is papering over the problem. If greater
> protections are needed then the right approach is to be more strict about
> fallbacks. Specifically, unmovable allocations should migrate all movable
> pages out of migrate_unmovable pageblocks before falling back and that
> can be controlled by policy due to the overhead of migration. For atomic
> allocations, allow fallback but use kcompact or a workqueue to migrate
> movable pages out of migrate_unmovable pageblocks to limit fallbacks in
> the future.

This is also papering over more issues. While these measures may delay
fragmentation some bit more they will not result in a pool of large
pages being available for the system throughout the lifetime of it.

> I'm not a fan of this patch.

I am also not a fan of this patch but this is enabling something that we
wanted for a long time. Consistent ability in a limited way to allocate
large page orders.

Since we have failed to address this in other way this may be the best ad
hoc method to get there. What we have done to address fragmentation so far
are all these preventative measures that get more ineffective as time
progresses while memory sizes increase. Either we do this or we need to
actually do one of the other known measures to address fragmentation like
making inode/dentries movable.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

Index: linux/include/linux/mmzone.h
===================================================================
--- linux.orig/include/linux/mmzone.h
+++ linux/include/linux/mmzone.h
@@ -96,6 +96,11 @@  extern int page_group_by_mobility_disabl
 struct free_area {
 	struct list_head	free_list[MIGRATE_TYPES];
 	unsigned long		nr_free;
+	/* We stop breaking up pages of this order if less than
+	 * min are available. At that point the pages can only
+	 * be used for allocations of that particular order.
+	 */
+	unsigned long		min;
 };
 
 struct pglist_data;
Index: linux/mm/page_alloc.c
===================================================================
--- linux.orig/mm/page_alloc.c
+++ linux/mm/page_alloc.c
@@ -1844,7 +1844,12 @@  struct page *__rmqueue_smallest(struct z
 		area = &(zone->free_area[current_order]);
 		page = list_first_entry_or_null(&area->free_list[migratetype],
 							struct page, lru);
-		if (!page)
+		/*
+		 * Continue if no page is found or if our freelist contains
+		 * less than the minimum pages of that order. In that case
+		 * we better look for a different order.
+		 */
+		if (!page || area->nr_free < area->min)
 			continue;
 		list_del(&page->lru);
 		rmv_page_order(page);
@@ -5190,6 +5195,57 @@  static void build_zonelists(pg_data_t *p
 
 #endif	/* CONFIG_NUMA */
 
+int set_page_order_min(int node, int order, unsigned min)
+{
+	int i, o;
+	long min_pages = 0;			/* Pages already reserved */
+	long managed_pages = 0;			/* Pages managed on the node */
+	struct zone *last;
+	unsigned remaining;
+
+	/*
+	 * Determine already reserved memory for orders
+	 * plus the total of the pages on the node
+	 */
+	for (i = 0; i < MAX_NR_ZONES; i++) {
+		struct zone *z = &NODE_DATA(node)->node_zones[i];
+		if (managed_zone(z)) {
+			for (o = 0; o < MAX_ORDER; o++) {
+				if (o != order)
+					min_pages += z->free_area[o].min << o;
+
+			}
+			managed_pages += z->managed_pages;
+		}
+	}
+
+	if (min_pages + (min << order) > managed_pages / 2)
+		return -ENOMEM;
+
+	/* Set the min values for all zones on the node */
+	remaining = min;
+	for (i = 0; i < MAX_NR_ZONES; i++) {
+		struct zone *z = &NODE_DATA(node)->node_zones[i];
+		if (managed_zone(z)) {
+			u64 tmp;
+
+			tmp = (u64)z->managed_pages * (min << order);
+			do_div(tmp, managed_pages);
+			tmp >>= order;
+			z->free_area[order].min = tmp;
+
+			last = z;
+			remaining -= tmp;
+		}
+	}
+
+	/* Deal with rounding errors */
+	if (remaining)
+		last->free_area[order].min += remaining;
+
+	return 0;
+}
+
 /*
  * Boot pageset table. One per cpu which is going to be used for all
  * zones and all nodes. The parameters will be set in such a way
@@ -5424,6 +5480,7 @@  static void __meminit zone_init_free_lis
 	for_each_migratetype_order(order, t) {
 		INIT_LIST_HEAD(&zone->free_area[order].free_list[t]);
 		zone->free_area[order].nr_free = 0;
+		zone->free_area[order].min = 0;
 	}
 }
 
@@ -6998,6 +7055,7 @@  static void __setup_per_zone_wmarks(void
 	unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
 	unsigned long lowmem_pages = 0;
 	struct zone *zone;
+	int order;
 	unsigned long flags;
 
 	/* Calculate total number of !ZONE_HIGHMEM pages */
@@ -7012,6 +7070,10 @@  static void __setup_per_zone_wmarks(void
 		spin_lock_irqsave(&zone->lock, flags);
 		tmp = (u64)pages_min * zone->managed_pages;
 		do_div(tmp, lowmem_pages);
+
+		for (order = 0; order < MAX_ORDER; order++)
+			tmp += zone->free_area[order].min << order;
+
 		if (is_highmem(zone)) {
 			/*
 			 * __GFP_HIGH and PF_MEMALLOC allocations usually don't
Index: linux/mm/vmstat.c
===================================================================
--- linux.orig/mm/vmstat.c
+++ linux/mm/vmstat.c
@@ -27,6 +27,7 @@ 
 #include <linux/mm_inline.h>
 #include <linux/page_ext.h>
 #include <linux/page_owner.h>
+#include <linux/ctype.h>
 
 #include "internal.h"
 
@@ -1614,6 +1615,11 @@  static void zoneinfo_show_print(struct s
 				zone_numa_state_snapshot(zone, i));
 #endif
 
+	for (i = 0; i < MAX_ORDER; i++)
+		if (zone->free_area[i].min)
+			seq_printf(m, "\nPreserve %lu pages of order %d from breaking up.",
+				zone->free_area[i].min, i);
+
 	seq_printf(m, "\n  pagesets");
 	for_each_online_cpu(i) {
 		struct per_cpu_pageset *pageset;
@@ -1641,6 +1647,122 @@  static void zoneinfo_show_print(struct s
 	seq_putc(m, '\n');
 }
 
+static int __order_protect(char *p)
+{
+	char c;
+
+	do {
+		int order = 0;
+		int pages = 0;
+		int node = 0;
+		int rc;
+
+		/* Syntax <order>[N<node>]=number */
+		if (!isdigit(*p))
+			return -EFAULT;
+
+		while (true) {
+			c = *p++;
+
+			if (!isdigit(c))
+				break;
+
+			order = order * 10 + c - '0';
+		}
+
+		/* Check for optional node specification */
+		if (c == 'N') {
+			if (!isdigit(*p))
+				return -EFAULT;
+
+			while (true) {
+				c = *p++;
+				if (!isdigit(c))
+					break;
+				node = node * 10 + c - '0';
+			}
+		}
+
+		if (c != '=')
+			return -EINVAL;
+
+		if (!isdigit(*p))
+			return -EINVAL;
+
+		while (true) {
+			c = *p++;
+			if (!isdigit(c))
+				break;
+			pages = pages * 10 + c - '0';
+		}
+
+		if (order == 0 || order >= MAX_ORDER)
+		       return -EINVAL;
+
+		if (!node_online(node))
+			return -ENOSYS;
+
+		rc = set_page_order_min(node, order, pages);
+		if (rc)
+			return rc;
+
+	} while (c == ',');
+
+	if (c)
+		return -EINVAL;
+
+	setup_per_zone_wmarks();
+
+	return 0;
+}
+
+/*
+ * Writing to /proc/zoneinfo allows to setup the large page breakup
+ * protection.
+ *
+ * Syntax:
+ * 	<order>[N<node>]=<number>{,<order>[N<node>]=<number>}
+ *
+ * F.e. Protecting 500 pages of order 2 (16K on intel) and 300 of
+ * order 4 (64K) on node 1
+ *
+ * 	echo "2=500,4N1=300" >/proc/zoneinfo
+ *
+ */
+static ssize_t zoneinfo_write(struct file *file, const char __user *buffer,
+			size_t count, loff_t *ppos)
+{
+	char zinfo[200];
+	int rc;
+
+	if (count > sizeof(zinfo))
+		return -EINVAL;
+
+	if (copy_from_user(zinfo, buffer, count))
+		return -EFAULT;
+
+	zinfo[count - 1] = 0;
+
+	rc = __order_protect(zinfo);
+
+	if (rc)
+		return rc;
+
+	return count;
+}
+
+static int order_protect(char *s)
+{
+	int rc;
+
+	rc = __order_protect(s);
+	if (rc)
+		printk("Invalid order=%s rc=%d\n",s, rc);
+
+	return 1;
+}
+__setup("order=", order_protect);
+
 /*
  * Output information about zones in @pgdat.  All zones are printed regardless
  * of whether they are populated or not: lowmem_reserve_ratio operates on the
@@ -1672,6 +1794,7 @@  static const struct file_operations zone
 	.read		= seq_read,
 	.llseek		= seq_lseek,
 	.release	= seq_release,
+	.write		= zoneinfo_write,
 };
 
 enum writeback_stat_item {
@@ -2016,7 +2139,7 @@  void __init init_mm_internals(void)
 	proc_create("buddyinfo", 0444, NULL, &buddyinfo_file_operations);
 	proc_create("pagetypeinfo", 0444, NULL, &pagetypeinfo_file_operations);
 	proc_create("vmstat", 0444, NULL, &vmstat_file_operations);
-	proc_create("zoneinfo", 0444, NULL, &zoneinfo_file_operations);
+	proc_create("zoneinfo", 0644, NULL, &zoneinfo_file_operations);
 #endif
 }
 
Index: linux/include/linux/gfp.h
===================================================================
--- linux.orig/include/linux/gfp.h
+++ linux/include/linux/gfp.h
@@ -543,6 +543,7 @@  void drain_all_pages(struct zone *zone);
 void drain_local_pages(struct zone *zone);
 
 void page_alloc_init_late(void);
+int set_page_order_min(int node, int order, unsigned min);
 
 /*
  * gfp_allowed_mask is set to GFP_BOOT_MASK during early boot to restrict what