[1/2] Protect larger order pages from breaking up

Message ID	20180223030357.048558407@linux.com (mailing list archive)
State	Not Applicable
Headers	show Return-Path: <linux-rdma-owner@kernel.org> Message-Id: <20180223030357.048558407@linux.com> User-Agent: quilt/0.63-1 Date: Thu, 22 Feb 2018 21:03:47 -0600 From: cl@linux.com From: Christoph Lameter <cl@linux.com> To: Mel Gorman <mel@skynet.ie> Cc: Matthew Wilcox <willy@infradead.org> Cc: linux-mm@kvack.org Cc: linux-rdma@vger.kernel.org CC: akpm@linux-foundation.org Cc: Thomas Schoebel-Theuer <tst@schoebel-theuer.de> Cc: andi@firstfloor.org Cc: Rik van Riel <riel@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Guy Shattah <sguy@mellanox.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Nellans <dnellans@nvidia.com> Cc: Laura Abbott <labbott@redhat.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Subject: [PATCH 1/2] Protect larger order pages from breaking up References: <20180223030346.707128614@linux.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Disposition: inline; filename=limit_order Sender: linux-rdma-owner@vger.kernel.org Precedence: bulk

Message ID

20180223030357.048558407@linux.com (mailing list archive)

State

Not Applicable

Headers

Message-Id: <20180223030357.048558407@linux.com>
User-Agent: quilt/0.63-1
Date: Thu, 22 Feb 2018 21:03:47 -0600
From: cl@linux.com
From: Christoph Lameter <cl@linux.com>
To: Mel Gorman <mel@skynet.ie>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: linux-mm@kvack.org
Cc: linux-rdma@vger.kernel.org
CC: akpm@linux-foundation.org
Cc: Thomas Schoebel-Theuer <tst@schoebel-theuer.de>
Cc: andi@firstfloor.org
Cc: Rik van Riel <riel@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Guy Shattah <sguy@mellanox.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Nellans <dnellans@nvidia.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Subject: [PATCH 1/2] Protect larger order pages from breaking up
References: <20180223030346.707128614@linux.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Disposition: inline; filename=limit_order
Sender: linux-rdma-owner@vger.kernel.org
Precedence: bulk

Commit Message

Christoph Lameter (Ampere) Feb. 23, 2018, 3:03 a.m. UTC

rfc->v1
 - Use Thomas suggestion to change the test in __rmqueue_smallest

Over time as the kernel is churning through memory it will break
up larger pages and as time progresses larger contiguous allocations
will no longer be possible. This is an approach to preserve these
large pages and prevent them from being broken up.

This is useful for example for the use of jumbo pages and can
satify various needs of subsystems and device drivers that require
large contiguous allocation to operate properly.

The idea is to reserve a pool of pages of the required order
so that the kernel is not allowed to use the pages for allocations
of a different order. This is a pool that is fully integrated
into the page allocator and therefore transparently usable.

Control over this feature is by writing to /proc/zoneinfo.

F.e. to ensure that 2000 16K pages stay available for jumbo
frames do

	echo "3=2000" >/proc/zoneinfo

or throught the order=<page spec> on the kernel command line.
F.e.

	order=3=2000,4N2=500

These pages will be subject to reclaim etc as usual but will not
be broken up.

One can then also f.e. operate the slub allocator with
64k pages. Specify "slub_max_order=3 slub_min_order=3" on
the kernel command line and all slab allocator allocations
will occur in 32K page sizes.

Note that this will reduce the memory available to the application
in some cases. Reclaim may occur more often. If more than
the reserved number of higher order pages are being used then
allocations will still fail as normal.

In order to make this work just right one needs to be able to
know the workload well enough to reserve the right amount
of pages. This is comparable to other reservation schemes.

Well that f.e brings up huge pages. You can of course
also use this to reserve those and can then be sure that
you can dynamically resize your huge page pools even after
a long time of system up time.

The idea for this patch came from Thomas Schoebel-Theuer whom I met
at the LCA and who described the approach to me promising
a patch that would do this. Sadly he has vanished somehow.
However, he has been using this approach to support a
production environment for numerous years.

So I redid his patch and this is the first draft of it.


Idea-by: Thomas Schoebel-Theuer <tst@schoebel-theuer.de>

First performance tests in a virtual enviroment show
a hackbench improvement by 6% just by increasing
the page size used by the page allocator.

Signed-off-by: Christopher Lameter <cl@linux.com>


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Vlastimil Babka Feb. 26, 2018, 2:32 p.m. UTC | #1

[CC += linux-api@vger.kernel.org]

Since this is a kernel-user-space API change, could you please CC
linux-api@ (and also for any future iterations of this patch series).
The kernel source file Documentation/SubmitChecklist notes that all
Linux kernel patches that change userspace interfaces should be CCed to
linux-api@vger.kernel.org, so that the various parties who are
interested in API changes are informed. For further information, see
https://www.kernel.org/doc/man-pages/linux-api-ml.html

On 02/23/2018 04:03 AM, Christoph Lameter wrote:
> rfc->v1
>  - Use Thomas suggestion to change the test in __rmqueue_smallest
> 
> Over time as the kernel is churning through memory it will break
> up larger pages and as time progresses larger contiguous allocations
> will no longer be possible. This is an approach to preserve these
> large pages and prevent them from being broken up.
> 
> This is useful for example for the use of jumbo pages and can
> satify various needs of subsystems and device drivers that require
> large contiguous allocation to operate properly.
> 
> The idea is to reserve a pool of pages of the required order
> so that the kernel is not allowed to use the pages for allocations
> of a different order. This is a pool that is fully integrated
> into the page allocator and therefore transparently usable.
> 
> Control over this feature is by writing to /proc/zoneinfo.
> 
> F.e. to ensure that 2000 16K pages stay available for jumbo
> frames do
> 
> 	echo "3=2000" >/proc/zoneinfo

Huh, that's rather weird interface to use. Writing to a general
statistics/info file for such specific functionality? Please no.

> or throught the order=<page spec> on the kernel command line.
> F.e.
> 
> 	order=3=2000,4N2=500
> 
> These pages will be subject to reclaim etc as usual but will not
> be broken up.
> 
> One can then also f.e. operate the slub allocator with
> 64k pages. Specify "slub_max_order=3 slub_min_order=3" on
> the kernel command line and all slab allocator allocations
> will occur in 32K page sizes.
> 
> Note that this will reduce the memory available to the application
> in some cases. Reclaim may occur more often. If more than
> the reserved number of higher order pages are being used then
> allocations will still fail as normal.
> 
> In order to make this work just right one needs to be able to
> know the workload well enough to reserve the right amount
> of pages. This is comparable to other reservation schemes.
> 
> Well that f.e brings up huge pages. You can of course
> also use this to reserve those and can then be sure that
> you can dynamically resize your huge page pools even after
> a long time of system up time.
> 
> The idea for this patch came from Thomas Schoebel-Theuer whom I met
> at the LCA and who described the approach to me promising
> a patch that would do this. Sadly he has vanished somehow.
> However, he has been using this approach to support a
> production environment for numerous years.
> 
> So I redid his patch and this is the first draft of it.
> 
> 
> Idea-by: Thomas Schoebel-Theuer <tst@schoebel-theuer.de>
> 
> First performance tests in a virtual enviroment show
> a hackbench improvement by 6% just by increasing
> the page size used by the page allocator.

That's IMHO a rather weak justification for introducing a new userspace
API. What exactly has been set where? Could similar results be achieved
by tuning highatomic reservers and/or min_free_kbytes? I especially
wonder how much of the effects come from the associated watermarks
adjustment (which can be affected by min_free_kbytes) and what is due to
__rmqueue_smallest() changes. You changed the __rmqueue_smallest()
condition since RFC per Thomas suggestion, but report the same results?

> Signed-off-by: Christopher Lameter <cl@linux.com>
> 
> Index: linux/include/linux/mmzone.h
> ===================================================================
> --- linux.orig/include/linux/mmzone.h
> +++ linux/include/linux/mmzone.h
> @@ -96,6 +96,11 @@ extern int page_group_by_mobility_disabl
>  struct free_area {
>  	struct list_head	free_list[MIGRATE_TYPES];
>  	unsigned long		nr_free;
> +	/* We stop breaking up pages of this order if less than
> +	 * min are available. At that point the pages can only
> +	 * be used for allocations of that particular order.
> +	 */
> +	unsigned long		min;
>  };
>  
>  struct pglist_data;
> Index: linux/mm/page_alloc.c
> ===================================================================
> --- linux.orig/mm/page_alloc.c
> +++ linux/mm/page_alloc.c
> @@ -1848,8 +1848,15 @@ struct page *__rmqueue_smallest(struct z
>  		area = &(zone->free_area[current_order]);
>  		page = list_first_entry_or_null(&area->free_list[migratetype],
>  							struct page, lru);
> -		if (!page)
> +		/*
> +		 * Continue if no page is found or if our freelist contains
> +		 * less than the minimum pages of that order. In that case
> +		 * we better look for a different order.
> +		 */
> +		if (!page || (area->nr_free < area->min
> +				       && current_order > order))

For watermarks we have various situations when we let a critical
allocation bypass them to some extent, but this is a strict condition.
That's potential for regressions.

Well, also not a fan of this patch, TBH. It's rather ad-hoc and not
backed up with results. Aside from the above points, I agree with the
objections of others for the RFC posting. It's also rather awkward that
watermarks are increased per the reservations, but when the reservations
are "consumed" (nr_free < min && current_order == order), the increased
watermarks are untouched. IMHO this further enlarges the effects of
purely adjusted watermarks by this patch.

Vlastimil

(leaving the rest of quoted mail for linux-api readers)

>  			continue;
> +
>  		list_del(&page->lru);
>  		rmv_page_order(page);
>  		area->nr_free--;
> @@ -5194,6 +5201,57 @@ static void build_zonelists(pg_data_t *p
>  
>  #endif	/* CONFIG_NUMA */
>  
> +int set_page_order_min(int node, int order, unsigned min)
> +{
> +	int i, o;
> +	long min_pages = 0;			/* Pages already reserved */
> +	long managed_pages = 0;			/* Pages managed on the node */
> +	struct zone *last = NULL;
> +	unsigned remaining;
> +
> +	/*
> +	 * Determine already reserved memory for orders
> +	 * plus the total of the pages on the node
> +	 */
> +	for (i = 0; i < MAX_NR_ZONES; i++) {
> +		struct zone *z = &NODE_DATA(node)->node_zones[i];
> +		if (managed_zone(z)) {
> +			for (o = 0; o < MAX_ORDER; o++) {
> +				if (o != order)
> +					min_pages += z->free_area[o].min << o;
> +
> +			}
> +			managed_pages += z->managed_pages;
> +		}
> +	}
> +
> +	if (min_pages + (min << order) > managed_pages / 2)
> +		return -ENOMEM;
> +
> +	/* Set the min values for all zones on the node */
> +	remaining = min;
> +	for (i = 0; i < MAX_NR_ZONES; i++) {
> +		struct zone *z = &NODE_DATA(node)->node_zones[i];
> +		if (managed_zone(z)) {
> +			u64 tmp;
> +
> +			tmp = (u64)z->managed_pages * (min << order);
> +			do_div(tmp, managed_pages);
> +			tmp >>= order;
> +			z->free_area[order].min = tmp;
> +
> +			last = z;
> +			remaining -= tmp;
> +		}
> +	}
> +
> +	/* Deal with rounding errors */
> +	if (remaining && last)
> +		last->free_area[order].min += remaining;
> +
> +	return 0;
> +}
> +
>  /*
>   * Boot pageset table. One per cpu which is going to be used for all
>   * zones and all nodes. The parameters will be set in such a way
> @@ -5428,6 +5486,7 @@ static void __meminit zone_init_free_lis
>  	for_each_migratetype_order(order, t) {
>  		INIT_LIST_HEAD(&zone->free_area[order].free_list[t]);
>  		zone->free_area[order].nr_free = 0;
> +		zone->free_area[order].min = 0;
>  	}
>  }
>  
> @@ -7002,6 +7061,7 @@ static void __setup_per_zone_wmarks(void
>  	unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
>  	unsigned long lowmem_pages = 0;
>  	struct zone *zone;
> +	int order;
>  	unsigned long flags;
>  
>  	/* Calculate total number of !ZONE_HIGHMEM pages */
> @@ -7016,6 +7076,10 @@ static void __setup_per_zone_wmarks(void
>  		spin_lock_irqsave(&zone->lock, flags);
>  		tmp = (u64)pages_min * zone->managed_pages;
>  		do_div(tmp, lowmem_pages);
> +
> +		for (order = 0; order < MAX_ORDER; order++)
> +			tmp += zone->free_area[order].min << order;
> +
>  		if (is_highmem(zone)) {
>  			/*
>  			 * __GFP_HIGH and PF_MEMALLOC allocations usually don't
> Index: linux/mm/vmstat.c
> ===================================================================
> --- linux.orig/mm/vmstat.c
> +++ linux/mm/vmstat.c
> @@ -27,6 +27,7 @@
>  #include <linux/mm_inline.h>
>  #include <linux/page_ext.h>
>  #include <linux/page_owner.h>
> +#include <linux/ctype.h>
>  
>  #include "internal.h"
>  
> @@ -1614,6 +1615,11 @@ static void zoneinfo_show_print(struct s
>  				zone_numa_state_snapshot(zone, i));
>  #endif
>  
> +	for (i = 0; i < MAX_ORDER; i++)
> +		if (zone->free_area[i].min)
> +			seq_printf(m, "\nPreserve %lu pages of order %d from breaking up.",
> +				zone->free_area[i].min, i);
> +
>  	seq_printf(m, "\n  pagesets");
>  	for_each_online_cpu(i) {
>  		struct per_cpu_pageset *pageset;
> @@ -1641,6 +1647,122 @@ static void zoneinfo_show_print(struct s
>  	seq_putc(m, '\n');
>  }
>  
> +static int __order_protect(char *p)
> +{
> +	char c;
> +
> +	do {
> +		int order = 0;
> +		int pages = 0;
> +		int node = 0;
> +		int rc;
> +
> +		/* Syntax <order>[N<node>]=number */
> +		if (!isdigit(*p))
> +			return -EFAULT;
> +
> +		while (true) {
> +			c = *p++;
> +
> +			if (!isdigit(c))
> +				break;
> +
> +			order = order * 10 + c - '0';
> +		}
> +
> +		/* Check for optional node specification */
> +		if (c == 'N') {
> +			if (!isdigit(*p))
> +				return -EFAULT;
> +
> +			while (true) {
> +				c = *p++;
> +				if (!isdigit(c))
> +					break;
> +				node = node * 10 + c - '0';
> +			}
> +		}
> +
> +		if (c != '=')
> +			return -EINVAL;
> +
> +		if (!isdigit(*p))
> +			return -EINVAL;
> +
> +		while (true) {
> +			c = *p++;
> +			if (!isdigit(c))
> +				break;
> +			pages = pages * 10 + c - '0';
> +		}
> +
> +		if (order == 0 || order >= MAX_ORDER)
> +		       return -EINVAL;
> +
> +		if (!node_online(node))
> +			return -ENOSYS;
> +
> +		rc = set_page_order_min(node, order, pages);
> +		if (rc)
> +			return rc;
> +
> +	} while (c == ',');
> +
> +	if (c)
> +		return -EINVAL;
> +
> +	setup_per_zone_wmarks();
> +
> +	return 0;
> +}
> +
> +/*
> + * Writing to /proc/zoneinfo allows to setup the large page breakup
> + * protection.
> + *
> + * Syntax:
> + * 	<order>[N<node>]=<number>{,<order>[N<node>]=<number>}
> + *
> + * F.e. Protecting 500 pages of order 2 (16K on intel) and 300 of
> + * order 4 (64K) on node 1
> + *
> + * 	echo "2=500,4N1=300" >/proc/zoneinfo
> + *
> + */
> +static ssize_t zoneinfo_write(struct file *file, const char __user *buffer,
> +			size_t count, loff_t *ppos)
> +{
> +	char zinfo[200];
> +	int rc;
> +
> +	if (count > sizeof(zinfo))
> +		return -EINVAL;
> +
> +	if (copy_from_user(zinfo, buffer, count))
> +		return -EFAULT;
> +
> +	zinfo[count - 1] = 0;
> +
> +	rc = __order_protect(zinfo);
> +
> +	if (rc)
> +		return rc;
> +
> +	return count;
> +}
> +
> +static int order_protect(char *s)
> +{
> +	int rc;
> +
> +	rc = __order_protect(s);
> +	if (rc)
> +		printk("Invalid order=%s rc=%d\n",s, rc);
> +
> +	return 1;
> +}
> +__setup("order=", order_protect);
> +
>  /*
>   * Output information about zones in @pgdat.  All zones are printed regardless
>   * of whether they are populated or not: lowmem_reserve_ratio operates on the
> @@ -1672,6 +1794,7 @@ static const struct file_operations zone
>  	.read		= seq_read,
>  	.llseek		= seq_lseek,
>  	.release	= seq_release,
> +	.write		= zoneinfo_write,
>  };
>  
>  enum writeback_stat_item {
> @@ -2016,7 +2139,7 @@ void __init init_mm_internals(void)
>  	proc_create("buddyinfo", 0444, NULL, &buddyinfo_file_operations);
>  	proc_create("pagetypeinfo", 0444, NULL, &pagetypeinfo_file_operations);
>  	proc_create("vmstat", 0444, NULL, &vmstat_file_operations);
> -	proc_create("zoneinfo", 0444, NULL, &zoneinfo_file_operations);
> +	proc_create("zoneinfo", 0644, NULL, &zoneinfo_file_operations);
>  #endif
>  }
>  
> Index: linux/include/linux/gfp.h
> ===================================================================
> --- linux.orig/include/linux/gfp.h
> +++ linux/include/linux/gfp.h
> @@ -543,6 +543,7 @@ void drain_all_pages(struct zone *zone);
>  void drain_local_pages(struct zone *zone);
>  
>  void page_alloc_init_late(void);
> +int set_page_order_min(int node, int order, unsigned min);
>  
>  /*
>   * gfp_allowed_mask is set to GFP_BOOT_MASK during early boot to restrict what
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Lameter (Ampere) Feb. 26, 2018, 4:16 p.m. UTC | #2

On Mon, 26 Feb 2018, Vlastimil Babka wrote:

> > 	echo "3=2000" >/proc/zoneinfo
>
> Huh, that's rather weird interface to use. Writing to a general
> statistics/info file for such specific functionality? Please no.

Ok lets create /proc/sys/kernel/orders?\

Or put it into /sys/devices/syste/node/nodeX/orders

?

> > First performance tests in a virtual enviroment show
> > a hackbench improvement by 6% just by increasing
> > the page size used by the page allocator.
>
> That's IMHO a rather weak justification for introducing a new userspace
> API. What exactly has been set where? Could similar results be achieved
> by tuning highatomic reservers and/or min_free_kbytes? I especially
> wonder how much of the effects come from the associated watermarks
> adjustment (which can be affected by min_free_kbytes) and what is due to
> __rmqueue_smallest() changes. You changed the __rmqueue_smallest()
> condition since RFC per Thomas suggestion, but report the same results?

The highatomic reserves are for temporary allocations for jumbo frames.
The allocations here could be for numerous purposes.

The test demonstrates a performance gain by the user of higher order
pages. It does not demonstrate long term fragmentation results. For that
different benchmarks would have to be used. Maybe I can find something in
Mel's tests to get that tested.

Such test would have to verify that the system holds up despite large
order allocation. It would not demonstrate a performance benefit. However,
what we want is the performance benefit throughout the operation of the
system. So both tests are required.

> Well, also not a fan of this patch, TBH. It's rather ad-hoc and not
> backed up with results. Aside from the above points, I agree with the
> objections of others for the RFC posting. It's also rather awkward that
> watermarks are increased per the reservations, but when the reservations
> are "consumed" (nr_free < min && current_order == order), the increased
> watermarks are untouched. IMHO this further enlarges the effects of
> purely adjusted watermarks by this patch.

This is an RFC to see where we could do with this. I am looking for ways
to address the various shortcomings of this approach. There are others
approaches that have similar effects and that may be more desirable but
require more work (such as making dentries/inodes defragmentable via
migration or targeted reclaim).

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Index: linux/include/linux/mmzone.h
===================================================================
--- linux.orig/include/linux/mmzone.h
+++ linux/include/linux/mmzone.h
@@ -96,6 +96,11 @@  extern int page_group_by_mobility_disabl
 struct free_area {
 	struct list_head	free_list[MIGRATE_TYPES];
 	unsigned long		nr_free;
+	/* We stop breaking up pages of this order if less than
+	 * min are available. At that point the pages can only
+	 * be used for allocations of that particular order.
+	 */
+	unsigned long		min;
 };
 
 struct pglist_data;
Index: linux/mm/page_alloc.c
===================================================================
--- linux.orig/mm/page_alloc.c
+++ linux/mm/page_alloc.c
@@ -1848,8 +1848,15 @@  struct page *__rmqueue_smallest(struct z
 		area = &(zone->free_area[current_order]);
 		page = list_first_entry_or_null(&area->free_list[migratetype],
 							struct page, lru);
-		if (!page)
+		/*
+		 * Continue if no page is found or if our freelist contains
+		 * less than the minimum pages of that order. In that case
+		 * we better look for a different order.
+		 */
+		if (!page || (area->nr_free < area->min
+				       && current_order > order))
 			continue;
+
 		list_del(&page->lru);
 		rmv_page_order(page);
 		area->nr_free--;
@@ -5194,6 +5201,57 @@  static void build_zonelists(pg_data_t *p
 
 #endif	/* CONFIG_NUMA */
 
+int set_page_order_min(int node, int order, unsigned min)
+{
+	int i, o;
+	long min_pages = 0;			/* Pages already reserved */
+	long managed_pages = 0;			/* Pages managed on the node */
+	struct zone *last = NULL;
+	unsigned remaining;
+
+	/*
+	 * Determine already reserved memory for orders
+	 * plus the total of the pages on the node
+	 */
+	for (i = 0; i < MAX_NR_ZONES; i++) {
+		struct zone *z = &NODE_DATA(node)->node_zones[i];
+		if (managed_zone(z)) {
+			for (o = 0; o < MAX_ORDER; o++) {
+				if (o != order)
+					min_pages += z->free_area[o].min << o;
+
+			}
+			managed_pages += z->managed_pages;
+		}
+	}
+
+	if (min_pages + (min << order) > managed_pages / 2)
+		return -ENOMEM;
+
+	/* Set the min values for all zones on the node */
+	remaining = min;
+	for (i = 0; i < MAX_NR_ZONES; i++) {
+		struct zone *z = &NODE_DATA(node)->node_zones[i];
+		if (managed_zone(z)) {
+			u64 tmp;
+
+			tmp = (u64)z->managed_pages * (min << order);
+			do_div(tmp, managed_pages);
+			tmp >>= order;
+			z->free_area[order].min = tmp;
+
+			last = z;
+			remaining -= tmp;
+		}
+	}
+
+	/* Deal with rounding errors */
+	if (remaining && last)
+		last->free_area[order].min += remaining;
+
+	return 0;
+}
+
 /*
  * Boot pageset table. One per cpu which is going to be used for all
  * zones and all nodes. The parameters will be set in such a way
@@ -5428,6 +5486,7 @@  static void __meminit zone_init_free_lis
 	for_each_migratetype_order(order, t) {
 		INIT_LIST_HEAD(&zone->free_area[order].free_list[t]);
 		zone->free_area[order].nr_free = 0;
+		zone->free_area[order].min = 0;
 	}
 }
 
@@ -7002,6 +7061,7 @@  static void __setup_per_zone_wmarks(void
 	unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
 	unsigned long lowmem_pages = 0;
 	struct zone *zone;
+	int order;
 	unsigned long flags;
 
 	/* Calculate total number of !ZONE_HIGHMEM pages */
@@ -7016,6 +7076,10 @@  static void __setup_per_zone_wmarks(void
 		spin_lock_irqsave(&zone->lock, flags);
 		tmp = (u64)pages_min * zone->managed_pages;
 		do_div(tmp, lowmem_pages);
+
+		for (order = 0; order < MAX_ORDER; order++)
+			tmp += zone->free_area[order].min << order;
+
 		if (is_highmem(zone)) {
 			/*
 			 * __GFP_HIGH and PF_MEMALLOC allocations usually don't
Index: linux/mm/vmstat.c
===================================================================
--- linux.orig/mm/vmstat.c
+++ linux/mm/vmstat.c
@@ -27,6 +27,7 @@ 
 #include <linux/mm_inline.h>
 #include <linux/page_ext.h>
 #include <linux/page_owner.h>
+#include <linux/ctype.h>
 
 #include "internal.h"
 
@@ -1614,6 +1615,11 @@  static void zoneinfo_show_print(struct s
 				zone_numa_state_snapshot(zone, i));
 #endif
 
+	for (i = 0; i < MAX_ORDER; i++)
+		if (zone->free_area[i].min)
+			seq_printf(m, "\nPreserve %lu pages of order %d from breaking up.",
+				zone->free_area[i].min, i);
+
 	seq_printf(m, "\n  pagesets");
 	for_each_online_cpu(i) {
 		struct per_cpu_pageset *pageset;
@@ -1641,6 +1647,122 @@  static void zoneinfo_show_print(struct s
 	seq_putc(m, '\n');
 }
 
+static int __order_protect(char *p)
+{
+	char c;
+
+	do {
+		int order = 0;
+		int pages = 0;
+		int node = 0;
+		int rc;
+
+		/* Syntax <order>[N<node>]=number */
+		if (!isdigit(*p))
+			return -EFAULT;
+
+		while (true) {
+			c = *p++;
+
+			if (!isdigit(c))
+				break;
+
+			order = order * 10 + c - '0';
+		}
+
+		/* Check for optional node specification */
+		if (c == 'N') {
+			if (!isdigit(*p))
+				return -EFAULT;
+
+			while (true) {
+				c = *p++;
+				if (!isdigit(c))
+					break;
+				node = node * 10 + c - '0';
+			}
+		}
+
+		if (c != '=')
+			return -EINVAL;
+
+		if (!isdigit(*p))
+			return -EINVAL;
+
+		while (true) {
+			c = *p++;
+			if (!isdigit(c))
+				break;
+			pages = pages * 10 + c - '0';
+		}
+
+		if (order == 0 || order >= MAX_ORDER)
+		       return -EINVAL;
+
+		if (!node_online(node))
+			return -ENOSYS;
+
+		rc = set_page_order_min(node, order, pages);
+		if (rc)
+			return rc;
+
+	} while (c == ',');
+
+	if (c)
+		return -EINVAL;
+
+	setup_per_zone_wmarks();
+
+	return 0;
+}
+
+/*
+ * Writing to /proc/zoneinfo allows to setup the large page breakup
+ * protection.
+ *
+ * Syntax:
+ * 	<order>[N<node>]=<number>{,<order>[N<node>]=<number>}
+ *
+ * F.e. Protecting 500 pages of order 2 (16K on intel) and 300 of
+ * order 4 (64K) on node 1
+ *
+ * 	echo "2=500,4N1=300" >/proc/zoneinfo
+ *
+ */
+static ssize_t zoneinfo_write(struct file *file, const char __user *buffer,
+			size_t count, loff_t *ppos)
+{
+	char zinfo[200];
+	int rc;
+
+	if (count > sizeof(zinfo))
+		return -EINVAL;
+
+	if (copy_from_user(zinfo, buffer, count))
+		return -EFAULT;
+
+	zinfo[count - 1] = 0;
+
+	rc = __order_protect(zinfo);
+
+	if (rc)
+		return rc;
+
+	return count;
+}
+
+static int order_protect(char *s)
+{
+	int rc;
+
+	rc = __order_protect(s);
+	if (rc)
+		printk("Invalid order=%s rc=%d\n",s, rc);
+
+	return 1;
+}
+__setup("order=", order_protect);
+
 /*
  * Output information about zones in @pgdat.  All zones are printed regardless
  * of whether they are populated or not: lowmem_reserve_ratio operates on the
@@ -1672,6 +1794,7 @@  static const struct file_operations zone
 	.read		= seq_read,
 	.llseek		= seq_lseek,
 	.release	= seq_release,
+	.write		= zoneinfo_write,
 };
 
 enum writeback_stat_item {
@@ -2016,7 +2139,7 @@  void __init init_mm_internals(void)
 	proc_create("buddyinfo", 0444, NULL, &buddyinfo_file_operations);
 	proc_create("pagetypeinfo", 0444, NULL, &pagetypeinfo_file_operations);
 	proc_create("vmstat", 0444, NULL, &vmstat_file_operations);
-	proc_create("zoneinfo", 0444, NULL, &zoneinfo_file_operations);
+	proc_create("zoneinfo", 0644, NULL, &zoneinfo_file_operations);
 #endif
 }
 
Index: linux/include/linux/gfp.h
===================================================================
--- linux.orig/include/linux/gfp.h
+++ linux/include/linux/gfp.h
@@ -543,6 +543,7 @@  void drain_all_pages(struct zone *zone);
 void drain_local_pages(struct zone *zone);
 
 void page_alloc_init_late(void);
+int set_page_order_min(int node, int order, unsigned min);
 
 /*
  * gfp_allowed_mask is set to GFP_BOOT_MASK during early boot to restrict what

[1/2] Protect larger order pages from breaking up

Commit Message

Comments

Patch