[RFC] mm: Proactive compaction
diff mbox series

Message ID 20190816214413.15006-1-nigupta@nvidia.com
State New
Headers show
Series
  • [RFC] mm: Proactive compaction
Related show

Commit Message

Nitin Gupta Aug. 16, 2019, 9:43 p.m. UTC
For some applications we need to allocate almost all memory as
hugepages. However, on a running system, higher order allocations can
fail if the memory is fragmented. Linux kernel currently does
on-demand compaction as we request more hugepages but this style of
compaction incurs very high latency. Experiments with one-time full
memory compaction (followed by hugepage allocations) shows that kernel
is able to restore a highly fragmented memory state to a fairly
compacted memory state within <1 sec for a 32G system. Such data
suggests that a more proactive compaction can help us allocate a large
fraction of memory as hugepages keeping allocation latencies low.

For a more proactive compaction, the approach taken here is to define
per page-order external fragmentation thresholds and let kcompactd
threads act on these thresholds.

The low and high thresholds are defined per page-order and exposed
through sysfs:

  /sys/kernel/mm/compaction/order-[1..MAX_ORDER]/extfrag_{low,high}

Per-node kcompactd thread is woken up every few seconds to check if
any zone on its node has extfrag above the extfrag_high threshold for
any order, in which case the thread starts compaction in the backgrond
till all zones are below extfrag_low level for all orders. By default
both these thresolds are set to 100 for all orders which essentially
disables kcompactd.

To avoid wasting CPU cycles when compaction cannot help, such as when
memory is full, we check both, extfrag > extfrag_high and
compaction_suitable(zone). This allows kcomapctd thread to stays inactive
even if extfrag thresholds are not met.

This patch is largely based on ideas from Michal Hocko posted here:
https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.suse.cz/

Testing done (on x86):
 - Set /sys/kernel/mm/compaction/order-9/extfrag_{low,high} = {25, 30}
 respectively.
 - Use a test program to fragment memory: the program allocates all memory
 and then for each 2M aligned section, frees 3/4 of base pages using
 munmap.
 - kcompactd0 detects fragmentation for order-9 > extfrag_high and starts
 compaction till extfrag < extfrag_low for order-9.

The patch has plenty of rough edges but posting it early to see if I'm
going in the right direction and to get some early feedback.

Signed-off-by: Nitin Gupta <nigupta@nvidia.com>
---
 include/linux/compaction.h |  12 ++
 mm/compaction.c            | 250 ++++++++++++++++++++++++++++++-------
 mm/vmstat.c                |  12 ++
 3 files changed, 228 insertions(+), 46 deletions(-)

Comments

Vlastimil Babka Aug. 20, 2019, 8:46 a.m. UTC | #1
+CC Khalid Aziz who proposed a different approach:
https://lore.kernel.org/linux-mm/20190813014012.30232-1-khalid.aziz@oracle.com/T/#u

On 8/16/19 11:43 PM, Nitin Gupta wrote:
> For some applications we need to allocate almost all memory as
> hugepages. However, on a running system, higher order allocations can
> fail if the memory is fragmented. Linux kernel currently does
> on-demand compaction as we request more hugepages but this style of
> compaction incurs very high latency. Experiments with one-time full
> memory compaction (followed by hugepage allocations) shows that kernel
> is able to restore a highly fragmented memory state to a fairly
> compacted memory state within <1 sec for a 32G system. Such data
> suggests that a more proactive compaction can help us allocate a large
> fraction of memory as hugepages keeping allocation latencies low.
> 
> For a more proactive compaction, the approach taken here is to define
> per page-order external fragmentation thresholds and let kcompactd
> threads act on these thresholds.
> 
> The low and high thresholds are defined per page-order and exposed
> through sysfs:
> 
>   /sys/kernel/mm/compaction/order-[1..MAX_ORDER]/extfrag_{low,high}
> 
> Per-node kcompactd thread is woken up every few seconds to check if
> any zone on its node has extfrag above the extfrag_high threshold for
> any order, in which case the thread starts compaction in the backgrond
> till all zones are below extfrag_low level for all orders. By default
> both these thresolds are set to 100 for all orders which essentially
> disables kcompactd.

Could you define what exactly extfrag is, in the changelog?

> To avoid wasting CPU cycles when compaction cannot help, such as when
> memory is full, we check both, extfrag > extfrag_high and
> compaction_suitable(zone). This allows kcomapctd thread to stays inactive
> even if extfrag thresholds are not met.

How does it translate to e.g. the number of free pages of order?

> This patch is largely based on ideas from Michal Hocko posted here:
> https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.suse.cz/
> 
> Testing done (on x86):
>  - Set /sys/kernel/mm/compaction/order-9/extfrag_{low,high} = {25, 30}
>  respectively.
>  - Use a test program to fragment memory: the program allocates all memory
>  and then for each 2M aligned section, frees 3/4 of base pages using
>  munmap.
>  - kcompactd0 detects fragmentation for order-9 > extfrag_high and starts
>  compaction till extfrag < extfrag_low for order-9.
> 
> The patch has plenty of rough edges but posting it early to see if I'm
> going in the right direction and to get some early feedback.

That's a lot of control knobs - how is an admin supposed to tune them to their
needs?

(keeping the rest for reference)

> Signed-off-by: Nitin Gupta <nigupta@nvidia.com>
> ---
>  include/linux/compaction.h |  12 ++
>  mm/compaction.c            | 250 ++++++++++++++++++++++++++++++-------
>  mm/vmstat.c                |  12 ++
>  3 files changed, 228 insertions(+), 46 deletions(-)
> 
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index 9569e7c786d3..26bfedbbc64b 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -60,6 +60,17 @@ enum compact_result {
>  
>  struct alloc_context; /* in mm/internal.h */
>  
> +// "order-%d"
> +#define COMPACTION_ORDER_STATE_NAME_LEN 16
> +// Per-order compaction state
> +struct compaction_order_state {
> +	unsigned int order;
> +	unsigned int extfrag_low;
> +	unsigned int extfrag_high;
> +	unsigned int extfrag_curr;
> +	char name[COMPACTION_ORDER_STATE_NAME_LEN];
> +};
> +
>  /*
>   * Number of free order-0 pages that should be available above given watermark
>   * to make sure compaction has reasonable chance of not running out of free
> @@ -90,6 +101,7 @@ extern int sysctl_compaction_handler(struct ctl_table *table, int write,
>  extern int sysctl_extfrag_threshold;
>  extern int sysctl_compact_unevictable_allowed;
>  
> +extern int extfrag_for_order(struct zone *zone, unsigned int order);
>  extern int fragmentation_index(struct zone *zone, unsigned int order);
>  extern enum compact_result try_to_compact_pages(gfp_t gfp_mask,
>  		unsigned int order, unsigned int alloc_flags,
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 952dc2fb24e5..21866b1ad249 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -25,6 +25,10 @@
>  #include <linux/psi.h>
>  #include "internal.h"
>  
> +#ifdef CONFIG_COMPACTION
> +struct compaction_order_state compaction_order_states[MAX_ORDER+1];
> +#endif
> +
>  #ifdef CONFIG_COMPACTION
>  static inline void count_compact_event(enum vm_event_item item)
>  {
> @@ -1846,6 +1850,49 @@ static inline bool is_via_compact_memory(int order)
>  	return order == -1;
>  }
>  
> +static int extfrag_wmark_high(struct zone *zone)
> +{
> +	int order;
> +
> +	for (order = 1; order <= MAX_ORDER; order++) {
> +		int extfrag = extfrag_for_order(zone, order);
> +		int threshold = compaction_order_states[order].extfrag_high;
> +
> +		if (extfrag > threshold)
> +			return order;
> +	}
> +	return 0;
> +}
> +
> +static bool node_should_compact(pg_data_t *pgdat)
> +{
> +	struct zone *zone;
> +
> +	for_each_populated_zone(zone) {
> +		int order = extfrag_wmark_high(zone);
> +
> +		if (order && compaction_suitable(zone, order,
> +				0, zone_idx(zone)) == COMPACT_CONTINUE) {
> +			return true;
> +		}
> +	}
> +	return false;
> +}
> +
> +static int extfrag_wmark_low(struct zone *zone)
> +{
> +	int order;
> +
> +	for (order = 1; order <= MAX_ORDER; order++) {
> +		int extfrag = extfrag_for_order(zone, order);
> +		int threshold = compaction_order_states[order].extfrag_low;
> +
> +		if (extfrag > threshold)
> +			return order;
> +	}
> +	return 0;
> +}
> +
>  static enum compact_result __compact_finished(struct compact_control *cc)
>  {
>  	unsigned int order;
> @@ -1872,7 +1919,7 @@ static enum compact_result __compact_finished(struct compact_control *cc)
>  			return COMPACT_PARTIAL_SKIPPED;
>  	}
>  
> -	if (is_via_compact_memory(cc->order))
> +	if (extfrag_wmark_low(cc->zone))
>  		return COMPACT_CONTINUE;
>  
>  	/*
> @@ -1962,18 +2009,6 @@ static enum compact_result __compaction_suitable(struct zone *zone, int order,
>  {
>  	unsigned long watermark;
>  
> -	if (is_via_compact_memory(order))
> -		return COMPACT_CONTINUE;
> -
> -	watermark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);
> -	/*
> -	 * If watermarks for high-order allocation are already met, there
> -	 * should be no need for compaction at all.
> -	 */
> -	if (zone_watermark_ok(zone, order, watermark, classzone_idx,
> -								alloc_flags))
> -		return COMPACT_SUCCESS;
> -
>  	/*
>  	 * Watermarks for order-0 must be met for compaction to be able to
>  	 * isolate free pages for migration targets. This means that the
> @@ -2003,31 +2038,9 @@ enum compact_result compaction_suitable(struct zone *zone, int order,
>  					int classzone_idx)
>  {
>  	enum compact_result ret;
> -	int fragindex;
>  
>  	ret = __compaction_suitable(zone, order, alloc_flags, classzone_idx,
>  				    zone_page_state(zone, NR_FREE_PAGES));
> -	/*
> -	 * fragmentation index determines if allocation failures are due to
> -	 * low memory or external fragmentation
> -	 *
> -	 * index of -1000 would imply allocations might succeed depending on
> -	 * watermarks, but we already failed the high-order watermark check
> -	 * index towards 0 implies failure is due to lack of memory
> -	 * index towards 1000 implies failure is due to fragmentation
> -	 *
> -	 * Only compact if a failure would be due to fragmentation. Also
> -	 * ignore fragindex for non-costly orders where the alternative to
> -	 * a successful reclaim/compaction is OOM. Fragindex and the
> -	 * vm.extfrag_threshold sysctl is meant as a heuristic to prevent
> -	 * excessive compaction for costly orders, but it should not be at the
> -	 * expense of system stability.
> -	 */
> -	if (ret == COMPACT_CONTINUE && (order > PAGE_ALLOC_COSTLY_ORDER)) {
> -		fragindex = fragmentation_index(zone, order);
> -		if (fragindex >= 0 && fragindex <= sysctl_extfrag_threshold)
> -			ret = COMPACT_NOT_SUITABLE_ZONE;
> -	}
>  
>  	trace_mm_compaction_suitable(zone, order, ret);
>  	if (ret == COMPACT_NOT_SUITABLE_ZONE)
> @@ -2416,7 +2429,6 @@ static void compact_node(int nid)
>  		.gfp_mask = GFP_KERNEL,
>  	};
>  
> -
>  	for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) {
>  
>  		zone = &pgdat->node_zones[zoneid];
> @@ -2493,9 +2505,149 @@ void compaction_unregister_node(struct node *node)
>  }
>  #endif /* CONFIG_SYSFS && CONFIG_NUMA */
>  
> +#ifdef CONFIG_SYSFS
> +
> +#define COMPACTION_ATTR_RO(_name) \
> +	static struct kobj_attribute _name##_attr = __ATTR_RO(_name)
> +
> +#define COMPACTION_ATTR(_name) \
> +	static struct kobj_attribute _name##_attr = \
> +		__ATTR(_name, 0644, _name##_show, _name##_store)
> +
> +static struct kobject *compaction_kobj;
> +static struct kobject *compaction_order_kobjs[MAX_ORDER];
> +
> +static struct compaction_order_state *kobj_to_compaction_order_state(
> +						struct kobject *kobj)
> +{
> +	int i;
> +
> +	for (i = 1; i <= MAX_ORDER; i++) {
> +		if (compaction_order_kobjs[i] == kobj)
> +			return &compaction_order_states[i];
> +	}
> +
> +	return NULL;
> +}
> +
> +static ssize_t extfrag_store_common(bool is_low, struct kobject *kobj,
> +		struct kobj_attribute *attr, const char *buf, size_t count)
> +{
> +	int err;
> +	unsigned long input;
> +	struct compaction_order_state *c = kobj_to_compaction_order_state(kobj);
> +
> +	err = kstrtoul(buf, 10, &input);
> +	if (err)
> +		return err;
> +	if (input > 100)
> +		return -EINVAL;
> +
> +	if (is_low)
> +		c->extfrag_low = input;
> +	else
> +		c->extfrag_high = input;
> +
> +	return count;
> +}
> +
> +static ssize_t extfrag_low_show(struct kobject *kobj,
> +		struct kobj_attribute *attr, char *buf)
> +{
> +	struct compaction_order_state *c = kobj_to_compaction_order_state(kobj);
> +
> +	return sprintf(buf, "%u\n", c->extfrag_low);
> +}
> +
> +static ssize_t extfrag_low_store(struct kobject *kobj,
> +		struct kobj_attribute *attr, const char *buf, size_t count)
> +{
> +	return extfrag_store_common(true, kobj, attr, buf, count);
> +}
> +COMPACTION_ATTR(extfrag_low);
> +
> +static ssize_t extfrag_high_show(struct kobject *kobj,
> +					struct kobj_attribute *attr, char *buf)
> +{
> +	struct compaction_order_state *c = kobj_to_compaction_order_state(kobj);
> +
> +	return sprintf(buf, "%u\n", c->extfrag_high);
> +}
> +
> +static ssize_t extfrag_high_store(struct kobject *kobj,
> +		struct kobj_attribute *attr, const char *buf, size_t count)
> +{
> +	return extfrag_store_common(false, kobj, attr, buf, count);
> +}
> +COMPACTION_ATTR(extfrag_high);
> +
> +static struct attribute *compaction_order_attrs[] = {
> +	&extfrag_low_attr.attr,
> +	&extfrag_high_attr.attr,
> +	NULL,
> +};
> +
> +static const struct attribute_group compaction_order_attr_group = {
> +	.attrs = compaction_order_attrs,
> +};
> +
> +static int compaction_sysfs_add_order(struct compaction_order_state *c,
> +	struct kobject *parent, struct kobject **compaction_order_kobjs,
> +	const struct attribute_group *compaction_order_attr_group)
> +{
> +	int retval;
> +
> +	compaction_order_kobjs[c->order] =
> +			kobject_create_and_add(c->name, parent);
> +	if (!compaction_order_kobjs[c->order])
> +		return -ENOMEM;
> +
> +	retval = sysfs_create_group(compaction_order_kobjs[c->order],
> +				compaction_order_attr_group);
> +	if (retval)
> +		kobject_put(compaction_order_kobjs[c->order]);
> +
> +	return retval;
> +}
> +
> +static void __init compaction_sysfs_init(void)
> +{
> +	struct compaction_order_state *c;
> +	int i, err;
> +
> +	compaction_kobj = kobject_create_and_add("compaction", mm_kobj);
> +	if (!compaction_kobj)
> +		return;
> +
> +	for (i = 1; i <= MAX_ORDER; i++) {
> +		c = &compaction_order_states[i];
> +		err = compaction_sysfs_add_order(c, compaction_kobj,
> +					compaction_order_kobjs,
> +					&compaction_order_attr_group);
> +		if (err)
> +			pr_err("compaction: Unable to add state %s", c->name);
> +	}
> +}
> +
> +static void __init compaction_init_order_states(void)
> +{
> +	int i;
> +
> +	for (i = 0; i <= MAX_ORDER; i++) {
> +		struct compaction_order_state *c = &compaction_order_states[i];
> +
> +		c->order = i;
> +		c->extfrag_low = 100;
> +		c->extfrag_high = 100;
> +		snprintf(c->name, COMPACTION_ORDER_STATE_NAME_LEN,
> +						"order-%d", i);
> +	}
> +}
> +#endif
> +
>  static inline bool kcompactd_work_requested(pg_data_t *pgdat)
>  {
> -	return pgdat->kcompactd_max_order > 0 || kthread_should_stop();
> +	return kthread_should_stop() || node_should_compact(pgdat);
>  }
>  
>  static bool kcompactd_node_suitable(pg_data_t *pgdat)
> @@ -2527,15 +2679,16 @@ static void kcompactd_do_work(pg_data_t *pgdat)
>  	int zoneid;
>  	struct zone *zone;
>  	struct compact_control cc = {
> -		.order = pgdat->kcompactd_max_order,
> -		.search_order = pgdat->kcompactd_max_order,
> +		.order = -1,
>  		.total_migrate_scanned = 0,
>  		.total_free_scanned = 0,
> -		.classzone_idx = pgdat->kcompactd_classzone_idx,
> -		.mode = MIGRATE_SYNC_LIGHT,
> -		.ignore_skip_hint = false,
> +		.mode = MIGRATE_SYNC,
> +		.ignore_skip_hint = true,
> +		.whole_zone = false,
>  		.gfp_mask = GFP_KERNEL,
> +		.classzone_idx = MAX_NR_ZONES - 1,
>  	};
> +
>  	trace_mm_compaction_kcompactd_wake(pgdat->node_id, cc.order,
>  							cc.classzone_idx);
>  	count_compact_event(KCOMPACTD_WAKE);
> @@ -2565,7 +2718,6 @@ static void kcompactd_do_work(pg_data_t *pgdat)
>  		if (kthread_should_stop())
>  			return;
>  		status = compact_zone(&cc, NULL);
> -
>  		if (status == COMPACT_SUCCESS) {
>  			compaction_defer_reset(zone, cc.order, false);
>  		} else if (status == COMPACT_PARTIAL_SKIPPED || status == COMPACT_COMPLETE) {
> @@ -2650,11 +2802,14 @@ static int kcompactd(void *p)
>  	pgdat->kcompactd_classzone_idx = pgdat->nr_zones - 1;
>  
>  	while (!kthread_should_stop()) {
> -		unsigned long pflags;
> +		unsigned long ret, pflags;
>  
>  		trace_mm_compaction_kcompactd_sleep(pgdat->node_id);
> -		wait_event_freezable(pgdat->kcompactd_wait,
> -				kcompactd_work_requested(pgdat));
> +		ret = wait_event_freezable_timeout(pgdat->kcompactd_wait,
> +				kcompactd_work_requested(pgdat),
> +				msecs_to_jiffies(5000));
> +		if (!ret)
> +			continue;
>  
>  		psi_memstall_enter(&pflags);
>  		kcompactd_do_work(pgdat);
> @@ -2735,6 +2890,9 @@ static int __init kcompactd_init(void)
>  		return ret;
>  	}
>  
> +	compaction_init_order_states();
> +	compaction_sysfs_init();
> +
>  	for_each_node_state(nid, N_MEMORY)
>  		kcompactd_run(nid);
>  	return 0;
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index fd7e16ca6996..e9090a5595d1 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1074,6 +1074,18 @@ static int __fragmentation_index(unsigned int order, struct contig_page_info *in
>  	return 1000 - div_u64( (1000+(div_u64(info->free_pages * 1000ULL, requested))), info->free_blocks_total);
>  }
>  
> +int extfrag_for_order(struct zone *zone, unsigned int order)
> +{
> +	struct contig_page_info info;
> +
> +	fill_contig_page_info(zone, order, &info);
> +	if (info.free_pages == 0)
> +		return 0;
> +
> +	return (info.free_pages - (info.free_blocks_suitable << order)) * 100
> +							/ info.free_pages;
> +}
> +
>  /* Same as __fragmentation index but allocs contig_page_info on stack */
>  int fragmentation_index(struct zone *zone, unsigned int order)
>  {
>
Nitin Gupta Aug. 20, 2019, 9:35 p.m. UTC | #2
> -----Original Message-----
> From: Vlastimil Babka <vbabka@suse.cz>
> Sent: Tuesday, August 20, 2019 1:46 AM
> To: Nitin Gupta <nigupta@nvidia.com>; akpm@linux-foundation.org;
> mgorman@techsingularity.net; mhocko@suse.com;
> dan.j.williams@intel.com
> Cc: Yu Zhao <yuzhao@google.com>; Matthew Wilcox <willy@infradead.org>;
> Qian Cai <cai@lca.pw>; Andrey Ryabinin <aryabinin@virtuozzo.com>; Roman
> Gushchin <guro@fb.com>; Greg Kroah-Hartman
> <gregkh@linuxfoundation.org>; Kees Cook <keescook@chromium.org>; Jann
> Horn <jannh@google.com>; Johannes Weiner <hannes@cmpxchg.org>; Arun
> KS <arunks@codeaurora.org>; Janne Huttunen
> <janne.huttunen@nokia.com>; Konstantin Khlebnikov
> <khlebnikov@yandex-team.ru>; linux-kernel@vger.kernel.org; linux-
> mm@kvack.org; Khalid Aziz <khalid.aziz@oracle.com>
> Subject: Re: [RFC] mm: Proactive compaction
> 
> +CC Khalid Aziz who proposed a different approach:
> https://lore.kernel.org/linux-mm/20190813014012.30232-1-
> khalid.aziz@oracle.com/T/#u
> 
> On 8/16/19 11:43 PM, Nitin Gupta wrote:
> > For some applications we need to allocate almost all memory as
> > hugepages. However, on a running system, higher order allocations can
> > fail if the memory is fragmented. Linux kernel currently does
> > on-demand compaction as we request more hugepages but this style of
> > compaction incurs very high latency. Experiments with one-time full
> > memory compaction (followed by hugepage allocations) shows that kernel
> > is able to restore a highly fragmented memory state to a fairly
> > compacted memory state within <1 sec for a 32G system. Such data
> > suggests that a more proactive compaction can help us allocate a large
> > fraction of memory as hugepages keeping allocation latencies low.
> >
> > For a more proactive compaction, the approach taken here is to define
> > per page-order external fragmentation thresholds and let kcompactd
> > threads act on these thresholds.
> >
> > The low and high thresholds are defined per page-order and exposed
> > through sysfs:
> >
> >   /sys/kernel/mm/compaction/order-[1..MAX_ORDER]/extfrag_{low,high}
> >
> > Per-node kcompactd thread is woken up every few seconds to check if
> > any zone on its node has extfrag above the extfrag_high threshold for
> > any order, in which case the thread starts compaction in the backgrond
> > till all zones are below extfrag_low level for all orders. By default
> > both these thresolds are set to 100 for all orders which essentially
> > disables kcompactd.
> 
> Could you define what exactly extfrag is, in the changelog?
> 

extfrag for order-n = ((total free pages) - (free pages for order >= n)) / (total free pages) * 100;

I will add this to v2 changelog.


> > To avoid wasting CPU cycles when compaction cannot help, such as when
> > memory is full, we check both, extfrag > extfrag_high and
> > compaction_suitable(zone). This allows kcomapctd thread to stays
> > inactive even if extfrag thresholds are not met.
> 
> How does it translate to e.g. the number of free pages of order?
> 

Watermarks are checked as follows (see: __compaction_suitable)

	watermark = (order > PAGE_ALLOC_COSTLY_ORDER) ?
				low_wmark_pages(zone) : min_wmark_pages(zone);

If a zone does not satisfy this watermark, we don't start compaction.

> > This patch is largely based on ideas from Michal Hocko posted here:
> > https://lore.kernel.org/linux-
> mm/20161230131412.GI13301@dhcp22.suse.cz
> > /
> >
> > Testing done (on x86):
> >  - Set /sys/kernel/mm/compaction/order-9/extfrag_{low,high} = {25, 30}
> > respectively.
> >  - Use a test program to fragment memory: the program allocates all
> > memory  and then for each 2M aligned section, frees 3/4 of base pages
> > using  munmap.
> >  - kcompactd0 detects fragmentation for order-9 > extfrag_high and
> > starts  compaction till extfrag < extfrag_low for order-9.
> >
> > The patch has plenty of rough edges but posting it early to see if I'm
> > going in the right direction and to get some early feedback.
> 
> That's a lot of control knobs - how is an admin supposed to tune them to
> their needs?


I expect that a workload would typically care for just a particular page order
(say, order-9 on x86 for the default hugepage size). An admin can set
extfrag_{low,high} for just that order (say, low=25, high=30) and leave these
thresholds to their default value (low=100, high=100) for all other orders.

Thanks,
Nitin


> 
> (keeping the rest for reference)
> 
> > Signed-off-by: Nitin Gupta <nigupta@nvidia.com>
> > ---
> >  include/linux/compaction.h |  12 ++
> >  mm/compaction.c            | 250 ++++++++++++++++++++++++++++++-------
> >  mm/vmstat.c                |  12 ++
> >  3 files changed, 228 insertions(+), 46 deletions(-)
> >
> > diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> > index 9569e7c786d3..26bfedbbc64b 100644
> > --- a/include/linux/compaction.h
> > +++ b/include/linux/compaction.h
> > @@ -60,6 +60,17 @@ enum compact_result {
> >
> >  struct alloc_context; /* in mm/internal.h */
> >
> > +// "order-%d"
> > +#define COMPACTION_ORDER_STATE_NAME_LEN 16 // Per-order
> compaction
> > +state struct compaction_order_state {
> > +	unsigned int order;
> > +	unsigned int extfrag_low;
> > +	unsigned int extfrag_high;
> > +	unsigned int extfrag_curr;
> > +	char name[COMPACTION_ORDER_STATE_NAME_LEN];
> > +};
> > +
> >  /*
> >   * Number of free order-0 pages that should be available above given
> watermark
> >   * to make sure compaction has reasonable chance of not running out
> > of free @@ -90,6 +101,7 @@ extern int sysctl_compaction_handler(struct
> > ctl_table *table, int write,  extern int sysctl_extfrag_threshold;
> > extern int sysctl_compact_unevictable_allowed;
> >
> > +extern int extfrag_for_order(struct zone *zone, unsigned int order);
> >  extern int fragmentation_index(struct zone *zone, unsigned int
> > order);  extern enum compact_result try_to_compact_pages(gfp_t
> gfp_mask,
> >  		unsigned int order, unsigned int alloc_flags, diff --git
> > a/mm/compaction.c b/mm/compaction.c index
> 952dc2fb24e5..21866b1ad249
> > 100644
> > --- a/mm/compaction.c
> > +++ b/mm/compaction.c
> > @@ -25,6 +25,10 @@
> >  #include <linux/psi.h>
> >  #include "internal.h"
> >
> > +#ifdef CONFIG_COMPACTION
> > +struct compaction_order_state
> compaction_order_states[MAX_ORDER+1];
> > +#endif
> > +
> >  #ifdef CONFIG_COMPACTION
> >  static inline void count_compact_event(enum vm_event_item item)  {
> @@
> > -1846,6 +1850,49 @@ static inline bool is_via_compact_memory(int order)
> >  	return order == -1;
> >  }
> >
> > +static int extfrag_wmark_high(struct zone *zone) {
> > +	int order;
> > +
> > +	for (order = 1; order <= MAX_ORDER; order++) {
> > +		int extfrag = extfrag_for_order(zone, order);
> > +		int threshold =
> compaction_order_states[order].extfrag_high;
> > +
> > +		if (extfrag > threshold)
> > +			return order;
> > +	}
> > +	return 0;
> > +}
> > +
> > +static bool node_should_compact(pg_data_t *pgdat) {
> > +	struct zone *zone;
> > +
> > +	for_each_populated_zone(zone) {
> > +		int order = extfrag_wmark_high(zone);
> > +
> > +		if (order && compaction_suitable(zone, order,
> > +				0, zone_idx(zone)) == COMPACT_CONTINUE)
> {
> > +			return true;
> > +		}
> > +	}
> > +	return false;
> > +}
> > +
> > +static int extfrag_wmark_low(struct zone *zone) {
> > +	int order;
> > +
> > +	for (order = 1; order <= MAX_ORDER; order++) {
> > +		int extfrag = extfrag_for_order(zone, order);
> > +		int threshold = compaction_order_states[order].extfrag_low;
> > +
> > +		if (extfrag > threshold)
> > +			return order;
> > +	}
> > +	return 0;
> > +}
> > +
> >  static enum compact_result __compact_finished(struct compact_control
> > *cc)  {
> >  	unsigned int order;
> > @@ -1872,7 +1919,7 @@ static enum compact_result
> __compact_finished(struct compact_control *cc)
> >  			return COMPACT_PARTIAL_SKIPPED;
> >  	}
> >
> > -	if (is_via_compact_memory(cc->order))
> > +	if (extfrag_wmark_low(cc->zone))
> >  		return COMPACT_CONTINUE;
> >
> >  	/*
> > @@ -1962,18 +2009,6 @@ static enum compact_result
> > __compaction_suitable(struct zone *zone, int order,  {
> >  	unsigned long watermark;
> >
> > -	if (is_via_compact_memory(order))
> > -		return COMPACT_CONTINUE;
> > -
> > -	watermark = wmark_pages(zone, alloc_flags &
> ALLOC_WMARK_MASK);
> > -	/*
> > -	 * If watermarks for high-order allocation are already met, there
> > -	 * should be no need for compaction at all.
> > -	 */
> > -	if (zone_watermark_ok(zone, order, watermark, classzone_idx,
> > -								alloc_flags))
> > -		return COMPACT_SUCCESS;
> > -
> >  	/*
> >  	 * Watermarks for order-0 must be met for compaction to be able to
> >  	 * isolate free pages for migration targets. This means that the @@
> > -2003,31 +2038,9 @@ enum compact_result compaction_suitable(struct
> zone *zone, int order,
> >  					int classzone_idx)
> >  {
> >  	enum compact_result ret;
> > -	int fragindex;
> >
> >  	ret = __compaction_suitable(zone, order, alloc_flags, classzone_idx,
> >  				    zone_page_state(zone, NR_FREE_PAGES));
> > -	/*
> > -	 * fragmentation index determines if allocation failures are due to
> > -	 * low memory or external fragmentation
> > -	 *
> > -	 * index of -1000 would imply allocations might succeed depending
> on
> > -	 * watermarks, but we already failed the high-order watermark check
> > -	 * index towards 0 implies failure is due to lack of memory
> > -	 * index towards 1000 implies failure is due to fragmentation
> > -	 *
> > -	 * Only compact if a failure would be due to fragmentation. Also
> > -	 * ignore fragindex for non-costly orders where the alternative to
> > -	 * a successful reclaim/compaction is OOM. Fragindex and the
> > -	 * vm.extfrag_threshold sysctl is meant as a heuristic to prevent
> > -	 * excessive compaction for costly orders, but it should not be at the
> > -	 * expense of system stability.
> > -	 */
> > -	if (ret == COMPACT_CONTINUE && (order >
> PAGE_ALLOC_COSTLY_ORDER)) {
> > -		fragindex = fragmentation_index(zone, order);
> > -		if (fragindex >= 0 && fragindex <= sysctl_extfrag_threshold)
> > -			ret = COMPACT_NOT_SUITABLE_ZONE;
> > -	}
> >
> >  	trace_mm_compaction_suitable(zone, order, ret);
> >  	if (ret == COMPACT_NOT_SUITABLE_ZONE) @@ -2416,7 +2429,6
> @@ static
> > void compact_node(int nid)
> >  		.gfp_mask = GFP_KERNEL,
> >  	};
> >
> > -
> >  	for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) {
> >
> >  		zone = &pgdat->node_zones[zoneid];
> > @@ -2493,9 +2505,149 @@ void compaction_unregister_node(struct
> node
> > *node)  }  #endif /* CONFIG_SYSFS && CONFIG_NUMA */
> >
> > +#ifdef CONFIG_SYSFS
> > +
> > +#define COMPACTION_ATTR_RO(_name) \
> > +	static struct kobj_attribute _name##_attr = __ATTR_RO(_name)
> > +
> > +#define COMPACTION_ATTR(_name) \
> > +	static struct kobj_attribute _name##_attr = \
> > +		__ATTR(_name, 0644, _name##_show, _name##_store)
> > +
> > +static struct kobject *compaction_kobj; static struct kobject
> > +*compaction_order_kobjs[MAX_ORDER];
> > +
> > +static struct compaction_order_state *kobj_to_compaction_order_state(
> > +						struct kobject *kobj)
> > +{
> > +	int i;
> > +
> > +	for (i = 1; i <= MAX_ORDER; i++) {
> > +		if (compaction_order_kobjs[i] == kobj)
> > +			return &compaction_order_states[i];
> > +	}
> > +
> > +	return NULL;
> > +}
> > +
> > +static ssize_t extfrag_store_common(bool is_low, struct kobject *kobj,
> > +		struct kobj_attribute *attr, const char *buf, size_t count) {
> > +	int err;
> > +	unsigned long input;
> > +	struct compaction_order_state *c =
> > +kobj_to_compaction_order_state(kobj);
> > +
> > +	err = kstrtoul(buf, 10, &input);
> > +	if (err)
> > +		return err;
> > +	if (input > 100)
> > +		return -EINVAL;
> > +
> > +	if (is_low)
> > +		c->extfrag_low = input;
> > +	else
> > +		c->extfrag_high = input;
> > +
> > +	return count;
> > +}
> > +
> > +static ssize_t extfrag_low_show(struct kobject *kobj,
> > +		struct kobj_attribute *attr, char *buf) {
> > +	struct compaction_order_state *c =
> > +kobj_to_compaction_order_state(kobj);
> > +
> > +	return sprintf(buf, "%u\n", c->extfrag_low); }
> > +
> > +static ssize_t extfrag_low_store(struct kobject *kobj,
> > +		struct kobj_attribute *attr, const char *buf, size_t count) {
> > +	return extfrag_store_common(true, kobj, attr, buf, count); }
> > +COMPACTION_ATTR(extfrag_low);
> > +
> > +static ssize_t extfrag_high_show(struct kobject *kobj,
> > +					struct kobj_attribute *attr, char *buf)
> {
> > +	struct compaction_order_state *c =
> > +kobj_to_compaction_order_state(kobj);
> > +
> > +	return sprintf(buf, "%u\n", c->extfrag_high); }
> > +
> > +static ssize_t extfrag_high_store(struct kobject *kobj,
> > +		struct kobj_attribute *attr, const char *buf, size_t count) {
> > +	return extfrag_store_common(false, kobj, attr, buf, count); }
> > +COMPACTION_ATTR(extfrag_high);
> > +
> > +static struct attribute *compaction_order_attrs[] = {
> > +	&extfrag_low_attr.attr,
> > +	&extfrag_high_attr.attr,
> > +	NULL,
> > +};
> > +
> > +static const struct attribute_group compaction_order_attr_group = {
> > +	.attrs = compaction_order_attrs,
> > +};
> > +
> > +static int compaction_sysfs_add_order(struct compaction_order_state *c,
> > +	struct kobject *parent, struct kobject **compaction_order_kobjs,
> > +	const struct attribute_group *compaction_order_attr_group) {
> > +	int retval;
> > +
> > +	compaction_order_kobjs[c->order] =
> > +			kobject_create_and_add(c->name, parent);
> > +	if (!compaction_order_kobjs[c->order])
> > +		return -ENOMEM;
> > +
> > +	retval = sysfs_create_group(compaction_order_kobjs[c->order],
> > +				compaction_order_attr_group);
> > +	if (retval)
> > +		kobject_put(compaction_order_kobjs[c->order]);
> > +
> > +	return retval;
> > +}
> > +
> > +static void __init compaction_sysfs_init(void) {
> > +	struct compaction_order_state *c;
> > +	int i, err;
> > +
> > +	compaction_kobj = kobject_create_and_add("compaction",
> mm_kobj);
> > +	if (!compaction_kobj)
> > +		return;
> > +
> > +	for (i = 1; i <= MAX_ORDER; i++) {
> > +		c = &compaction_order_states[i];
> > +		err = compaction_sysfs_add_order(c, compaction_kobj,
> > +					compaction_order_kobjs,
> > +					&compaction_order_attr_group);
> > +		if (err)
> > +			pr_err("compaction: Unable to add state %s", c-
> >name);
> > +	}
> > +}
> > +
> > +static void __init compaction_init_order_states(void)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i <= MAX_ORDER; i++) {
> > +		struct compaction_order_state *c =
> &compaction_order_states[i];
> > +
> > +		c->order = i;
> > +		c->extfrag_low = 100;
> > +		c->extfrag_high = 100;
> > +		snprintf(c->name, COMPACTION_ORDER_STATE_NAME_LEN,
> > +						"order-%d", i);
> > +	}
> > +}
> > +#endif
> > +
> >  static inline bool kcompactd_work_requested(pg_data_t *pgdat)  {
> > -	return pgdat->kcompactd_max_order > 0 || kthread_should_stop();
> > +	return kthread_should_stop() || node_should_compact(pgdat);
> >  }
> >
> >  static bool kcompactd_node_suitable(pg_data_t *pgdat) @@ -2527,15
> > +2679,16 @@ static void kcompactd_do_work(pg_data_t *pgdat)
> >  	int zoneid;
> >  	struct zone *zone;
> >  	struct compact_control cc = {
> > -		.order = pgdat->kcompactd_max_order,
> > -		.search_order = pgdat->kcompactd_max_order,
> > +		.order = -1,
> >  		.total_migrate_scanned = 0,
> >  		.total_free_scanned = 0,
> > -		.classzone_idx = pgdat->kcompactd_classzone_idx,
> > -		.mode = MIGRATE_SYNC_LIGHT,
> > -		.ignore_skip_hint = false,
> > +		.mode = MIGRATE_SYNC,
> > +		.ignore_skip_hint = true,
> > +		.whole_zone = false,
> >  		.gfp_mask = GFP_KERNEL,
> > +		.classzone_idx = MAX_NR_ZONES - 1,
> >  	};
> > +
> >  	trace_mm_compaction_kcompactd_wake(pgdat->node_id, cc.order,
> >  							cc.classzone_idx);
> >  	count_compact_event(KCOMPACTD_WAKE);
> > @@ -2565,7 +2718,6 @@ static void kcompactd_do_work(pg_data_t
> *pgdat)
> >  		if (kthread_should_stop())
> >  			return;
> >  		status = compact_zone(&cc, NULL);
> > -
> >  		if (status == COMPACT_SUCCESS) {
> >  			compaction_defer_reset(zone, cc.order, false);
> >  		} else if (status == COMPACT_PARTIAL_SKIPPED || status ==
> > COMPACT_COMPLETE) { @@ -2650,11 +2802,14 @@ static int
> kcompactd(void *p)
> >  	pgdat->kcompactd_classzone_idx = pgdat->nr_zones - 1;
> >
> >  	while (!kthread_should_stop()) {
> > -		unsigned long pflags;
> > +		unsigned long ret, pflags;
> >
> >  		trace_mm_compaction_kcompactd_sleep(pgdat->node_id);
> > -		wait_event_freezable(pgdat->kcompactd_wait,
> > -				kcompactd_work_requested(pgdat));
> > +		ret = wait_event_freezable_timeout(pgdat-
> >kcompactd_wait,
> > +				kcompactd_work_requested(pgdat),
> > +				msecs_to_jiffies(5000));
> > +		if (!ret)
> > +			continue;
> >
> >  		psi_memstall_enter(&pflags);
> >  		kcompactd_do_work(pgdat);
> > @@ -2735,6 +2890,9 @@ static int __init kcompactd_init(void)
> >  		return ret;
> >  	}
> >
> > +	compaction_init_order_states();
> > +	compaction_sysfs_init();
> > +
> >  	for_each_node_state(nid, N_MEMORY)
> >  		kcompactd_run(nid);
> >  	return 0;
> > diff --git a/mm/vmstat.c b/mm/vmstat.c index
> > fd7e16ca6996..e9090a5595d1 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -1074,6 +1074,18 @@ static int __fragmentation_index(unsigned int
> order, struct contig_page_info *in
> >  	return 1000 - div_u64( (1000+(div_u64(info->free_pages * 1000ULL,
> > requested))), info->free_blocks_total);  }
> >
> > +int extfrag_for_order(struct zone *zone, unsigned int order) {
> > +	struct contig_page_info info;
> > +
> > +	fill_contig_page_info(zone, order, &info);
> > +	if (info.free_pages == 0)
> > +		return 0;
> > +
> > +	return (info.free_pages - (info.free_blocks_suitable << order)) * 100
> > +							/ info.free_pages;
> > +}
> > +
> >  /* Same as __fragmentation index but allocs contig_page_info on stack
> > */  int fragmentation_index(struct zone *zone, unsigned int order)  {
> >
Matthew Wilcox Aug. 20, 2019, 10:20 p.m. UTC | #3
On Fri, Aug 16, 2019 at 02:43:30PM -0700, Nitin Gupta wrote:
> Testing done (on x86):
>  - Set /sys/kernel/mm/compaction/order-9/extfrag_{low,high} = {25, 30}
>  respectively.
>  - Use a test program to fragment memory: the program allocates all memory
>  and then for each 2M aligned section, frees 3/4 of base pages using
>  munmap.
>  - kcompactd0 detects fragmentation for order-9 > extfrag_high and starts
>  compaction till extfrag < extfrag_low for order-9.

Your test program is a good idea, but I worry it may produce
unrealistically optimistic outcomes.  Page cache is readily reclaimable,
so you're setting up a situation where 2MB pages can once again be
produced.

How about this:

One program which creates a file several times the size of memory (or
several files which total the same amount).  Then read the file(s).  Maybe
by mmap(), and just do nice easy sequential accesses.

A second program which causes slab allocations.  eg

for (;;) {
	for (i = 0; i < n * 1000 * 1000; i++) {
		char fname[64];

		sprintf(fname, "/tmp/missing.%d", i);
		open(fname, O_RDWR);
	}
}

The first program should thrash the pagecache, causing pages to
continuously be allocated, reclaimed and freed.  The second will create
millions of dentries, causing the slab allocator to allocate a lot of
order-0 pages which are harder to free.  If you really want to make it
work hard, mix in opening some files whihc actually exist, preventing
the pages which contain those dentries from being evicted.

This feels like it's simulating a more normal workload than your test.
What do you think?
Nitin Gupta Aug. 21, 2019, 11:23 p.m. UTC | #4
> -----Original Message-----
> From: owner-linux-mm@kvack.org <owner-linux-mm@kvack.org> On Behalf
> Of Matthew Wilcox
> Sent: Tuesday, August 20, 2019 3:21 PM
> To: Nitin Gupta <nigupta@nvidia.com>
> Cc: akpm@linux-foundation.org; vbabka@suse.cz;
> mgorman@techsingularity.net; mhocko@suse.com;
> dan.j.williams@intel.com; Yu Zhao <yuzhao@google.com>; Qian Cai
> <cai@lca.pw>; Andrey Ryabinin <aryabinin@virtuozzo.com>; Roman
> Gushchin <guro@fb.com>; Greg Kroah-Hartman
> <gregkh@linuxfoundation.org>; Kees Cook <keescook@chromium.org>; Jann
> Horn <jannh@google.com>; Johannes Weiner <hannes@cmpxchg.org>; Arun
> KS <arunks@codeaurora.org>; Janne Huttunen
> <janne.huttunen@nokia.com>; Konstantin Khlebnikov
> <khlebnikov@yandex-team.ru>; linux-kernel@vger.kernel.org; linux-
> mm@kvack.org
> Subject: Re: [RFC] mm: Proactive compaction
> 
> On Fri, Aug 16, 2019 at 02:43:30PM -0700, Nitin Gupta wrote:
> > Testing done (on x86):
> >  - Set /sys/kernel/mm/compaction/order-9/extfrag_{low,high} = {25, 30}
> > respectively.
> >  - Use a test program to fragment memory: the program allocates all
> > memory  and then for each 2M aligned section, frees 3/4 of base pages
> > using  munmap.
> >  - kcompactd0 detects fragmentation for order-9 > extfrag_high and
> > starts  compaction till extfrag < extfrag_low for order-9.
> 
> Your test program is a good idea, but I worry it may produce unrealistically
> optimistic outcomes.  Page cache is readily reclaimable, so you're setting up
> a situation where 2MB pages can once again be produced.
> 
> How about this:
> 
> One program which creates a file several times the size of memory (or
> several files which total the same amount).  Then read the file(s).  Maybe by
> mmap(), and just do nice easy sequential accesses.
> 
> A second program which causes slab allocations.  eg
> 
> for (;;) {
> 	for (i = 0; i < n * 1000 * 1000; i++) {
> 		char fname[64];
> 
> 		sprintf(fname, "/tmp/missing.%d", i);
> 		open(fname, O_RDWR);
> 	}
> }
> 
> The first program should thrash the pagecache, causing pages to
> continuously be allocated, reclaimed and freed.  The second will create
> millions of dentries, causing the slab allocator to allocate a lot of
> order-0 pages which are harder to free.  If you really want to make it work
> hard, mix in opening some files whihc actually exist, preventing the pages
> which contain those dentries from being evicted.
> 
> This feels like it's simulating a more normal workload than your test.
> What do you think?

This combination of workloads for mixing movable and unmovable
pages sounds good.   I coded up these two and here's what I observed:

- kernel: 5.3.0-rc5 + this patch, x86_64, 32G RAM.
- Set extfrag_{low,high} = {25,30} for order-9
- Run pagecache and dentry thrash test programs as you described
    - for pagecache test: mmap and sequentially read 128G file on a 32G system.
    - for dentry test: set n=100. I created /tmp/missing.[0-10000] so these dentries stay allocated..
- Start linux kernel compile for further pagecache thrashing.

With above workload fragmentation for order-9 stayed 80-90% which kept
kcompactd0 working but it couldn't make progress due to unmovable pages
from dentries.  As expected, we keep hitting compaction_deferred() as
compaction attempts fail.

After a manual `echo 3 | /proc/sys/vm/drop_caches` and stopping dentry thrasher,
kcompactd succeded in bringing extfrag below set thresholds.


With unmovable pages spread across memory, there is little compaction
can do. Maybe we should have a knob like 'compactness' (like swapiness) which
defines how aggressive compaction can be. For high values, maybe allow
freeing dentries too? This way hugepage sensitive applications can trade
with higher I/O latencies.

Thanks,
Nitin
Mel Gorman Aug. 22, 2019, 8:51 a.m. UTC | #5
On Fri, Aug 16, 2019 at 02:43:30PM -0700, Nitin Gupta wrote:
> For some applications we need to allocate almost all memory as
> hugepages. However, on a running system, higher order allocations can
> fail if the memory is fragmented. Linux kernel currently does
> on-demand compaction as we request more hugepages but this style of
> compaction incurs very high latency. Experiments with one-time full
> memory compaction (followed by hugepage allocations) shows that kernel
> is able to restore a highly fragmented memory state to a fairly
> compacted memory state within <1 sec for a 32G system. Such data
> suggests that a more proactive compaction can help us allocate a large
> fraction of memory as hugepages keeping allocation latencies low.
> 

Note that proactive compaction may reduce allocation latency but it is not
free either. Even though the scanning and migration may happen in a kernel
thread, tasks can incur faults while waiting for compaction to complete if
the task accesses data being migrated. This means that costs are incurred
by applications on a system that may never care about high-order allocation
latency -- particularly if the allocations typically happen at application
initialisation time.  I recognise that kcompactd makes a bit of effort to
compact memory out-of-band but it also is typically triggered in response
to reclaim that was triggered by a high-order allocation request. i.e. the
work done by the thread is triggered by an allocation request that hit
the slow paths and not a preemptive measure.

> For a more proactive compaction, the approach taken here is to define
> per page-order external fragmentation thresholds and let kcompactd
> threads act on these thresholds.
> 
> The low and high thresholds are defined per page-order and exposed
> through sysfs:
> 
>   /sys/kernel/mm/compaction/order-[1..MAX_ORDER]/extfrag_{low,high}
> 

These will be difficult for an admin to tune that is not extremely
familiar with how external fragmentation is defined. If an admin asked
"how much will stalls be reduced by setting this to a different value?",
the answer will always be "I don't know, maybe some, maybe not".

> Per-node kcompactd thread is woken up every few seconds to check if
> any zone on its node has extfrag above the extfrag_high threshold for
> any order, in which case the thread starts compaction in the backgrond
> till all zones are below extfrag_low level for all orders. By default
> both these thresolds are set to 100 for all orders which essentially
> disables kcompactd.
> 
> To avoid wasting CPU cycles when compaction cannot help, such as when
> memory is full, we check both, extfrag > extfrag_high and
> compaction_suitable(zone). This allows kcomapctd thread to stays inactive
> even if extfrag thresholds are not met.
> 

There is still a risk that if a system is completely fragmented that it
may consume CPU on pointless compaction cycles. This is why compaction
from kernel thread context makes no special effort and bails relatively
quickly and assumes that if an application really needs high-order pages
that it'll incur the cost at allocation time. 

> This patch is largely based on ideas from Michal Hocko posted here:
> https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.suse.cz/
> 
> Testing done (on x86):
>  - Set /sys/kernel/mm/compaction/order-9/extfrag_{low,high} = {25, 30}
>  respectively.
>  - Use a test program to fragment memory: the program allocates all memory
>  and then for each 2M aligned section, frees 3/4 of base pages using
>  munmap.
>  - kcompactd0 detects fragmentation for order-9 > extfrag_high and starts
>  compaction till extfrag < extfrag_low for order-9.
> 

This is a somewhat optimisitic allocation scenario. The interesting
ones are when a system is fragmenteed in a manner that is not trivial to
resolve -- e.g. after a prolonged period of time with unmovable/reclaimable
allocations stealing pageblocks. It's also fairly difficult to analyse
if this is helping because you cannot measure after the fact how much
time was saved in allocation time due to the work done by kcompactd. It
is also hard to determine if the sum of the stalls incurred by proactive
compaction is lower than the time saved at allocation time.

I fear that the user-visible effect will be times when there are very
short but numerous stalls due to proactive compaction running in the
background that will be hard to detect while the benefits may be invisible.

> The patch has plenty of rough edges but posting it early to see if I'm
> going in the right direction and to get some early feedback.
> 

As unappealing as it sounds, I think it is better to try improve the
allocation latency itself instead of trying to hide the cost in a kernel
thread. It's far harder to implement as compaction is not easy but it
would be more obvious what the savings are by looking at a histogram of
allocation latencies -- there are other metrics that could be considered
but that's the obvious one.
Nitin Gupta Aug. 22, 2019, 9:57 p.m. UTC | #6
> -----Original Message-----
> From: owner-linux-mm@kvack.org <owner-linux-mm@kvack.org> On Behalf
> Of Mel Gorman
> Sent: Thursday, August 22, 2019 1:52 AM
> To: Nitin Gupta <nigupta@nvidia.com>
> Cc: akpm@linux-foundation.org; vbabka@suse.cz; mhocko@suse.com;
> dan.j.williams@intel.com; Yu Zhao <yuzhao@google.com>; Matthew Wilcox
> <willy@infradead.org>; Qian Cai <cai@lca.pw>; Andrey Ryabinin
> <aryabinin@virtuozzo.com>; Roman Gushchin <guro@fb.com>; Greg Kroah-
> Hartman <gregkh@linuxfoundation.org>; Kees Cook
> <keescook@chromium.org>; Jann Horn <jannh@google.com>; Johannes
> Weiner <hannes@cmpxchg.org>; Arun KS <arunks@codeaurora.org>; Janne
> Huttunen <janne.huttunen@nokia.com>; Konstantin Khlebnikov
> <khlebnikov@yandex-team.ru>; linux-kernel@vger.kernel.org; linux-
> mm@kvack.org
> Subject: Re: [RFC] mm: Proactive compaction
> 
> On Fri, Aug 16, 2019 at 02:43:30PM -0700, Nitin Gupta wrote:
> > For some applications we need to allocate almost all memory as
> > hugepages. However, on a running system, higher order allocations can
> > fail if the memory is fragmented. Linux kernel currently does
> > on-demand compaction as we request more hugepages but this style of
> > compaction incurs very high latency. Experiments with one-time full
> > memory compaction (followed by hugepage allocations) shows that kernel
> > is able to restore a highly fragmented memory state to a fairly
> > compacted memory state within <1 sec for a 32G system. Such data
> > suggests that a more proactive compaction can help us allocate a large
> > fraction of memory as hugepages keeping allocation latencies low.
> >
> 
> Note that proactive compaction may reduce allocation latency but it is not
> free either. Even though the scanning and migration may happen in a kernel
> thread, tasks can incur faults while waiting for compaction to complete if the
> task accesses data being migrated. This means that costs are incurred by
> applications on a system that may never care about high-order allocation
> latency -- particularly if the allocations typically happen at application
> initialisation time.  I recognise that kcompactd makes a bit of effort to
> compact memory out-of-band but it also is typically triggered in response to
> reclaim that was triggered by a high-order allocation request. i.e. the work
> done by the thread is triggered by an allocation request that hit the slow
> paths and not a preemptive measure.
> 

Hitting the slow path for every higher-order allocation is a signification
performance/latency issue for applications that requires a large number of
these allocations to succeed in bursts. To get some concrete numbers, I
made a small driver that allocates as many hugepages as possible and
measures allocation latency:

The driver first tries to allocate hugepage using GFP_TRANSHUGE_LIGHT
(referred to as "Light" in the table below) and if that fails, tries to
allocate with `GFP_TRANSHUGE | __GFP_RETRY_MAYFAIL` (referred to as
"Fallback" in table below). We stop the allocation loop if both methods
fail.

Table-1: hugepage allocation latencies on vanilla 5.3.0-rc5. All latencies
are in microsec.

| GFP/Stat |        Any |   Light |   Fallback |
|--------: | ---------: | ------: | ---------: |
|    count |       9908 |     788 |       9120 |
|      min |        0.0 |     0.0 |     1726.0 |
|      max |   135387.0 |   142.0 |   135387.0 |
|     mean |    5494.66 |    1.83 |    5969.26 |
|   stddev |   21624.04 |    7.58 |   22476.06 |

As you can see, the mean and stddev of allocation is extremely high with
the current approach of on-demand compaction.

The system was fragmented from a userspace program as I described in this
patch description. The workload is mainly anonymous userspace pages which
as easy to move around. I intentionally avoided unmovable pages in this
test to see how much latency do we incur just by hitting the slow path for
a majority of allocations.


> > For a more proactive compaction, the approach taken here is to define
> > per page-order external fragmentation thresholds and let kcompactd
> > threads act on these thresholds.
> >
> > The low and high thresholds are defined per page-order and exposed
> > through sysfs:
> >
> >   /sys/kernel/mm/compaction/order-[1..MAX_ORDER]/extfrag_{low,high}
> >
> 
> These will be difficult for an admin to tune that is not extremely familiar with
> how external fragmentation is defined. If an admin asked "how much will
> stalls be reduced by setting this to a different value?", the answer will always
> be "I don't know, maybe some, maybe not".
>

Yes, this is my main worry. These values can be set to emperically
determined values on highly specialized systems like database appliances.
However, on a generic system, there is no real reasonable value.


Still, at the very least, I would like an interface that allows compacting
system to a reasonable state. Something like:

    compact_extfrag(node, zone, order, high, low)

which start compaction if extfrag > high, and goes on till extfrag < low.

It's possible that there are too many unmovable pages mixed around for
compaction to succeed, still it's a reasonable interface to expose rather
than forced on-demand style of compaction (please see data below).

How (and if) to expose it to userspace (sysfs etc.) can be a separate
discussion.


> > Per-node kcompactd thread is woken up every few seconds to check if
> > any zone on its node has extfrag above the extfrag_high threshold for
> > any order, in which case the thread starts compaction in the backgrond
> > till all zones are below extfrag_low level for all orders. By default
> > both these thresolds are set to 100 for all orders which essentially
> > disables kcompactd.
> >
> > To avoid wasting CPU cycles when compaction cannot help, such as when
> > memory is full, we check both, extfrag > extfrag_high and
> > compaction_suitable(zone). This allows kcomapctd thread to stays
> > inactive even if extfrag thresholds are not met.
> >
> 
> There is still a risk that if a system is completely fragmented that it may
> consume CPU on pointless compaction cycles. This is why compaction from
> kernel thread context makes no special effort and bails relatively quickly and
> assumes that if an application really needs high-order pages that it'll incur
> the cost at allocation time.
> 

As data in Table-1 shows, on-demand compaction can add high latency to
every single allocation. I think it would be a significant improvement (see
Table-2) to at least expose an interface to allow proactive compaction
(like compaction_extfrag), which a driver can itself run in background. This
way, we need not add any tunables to the kernel itself and leave compaction
decision to specialized kernel/userspace monitors.


> > This patch is largely based on ideas from Michal Hocko posted here:
> > https://lore.kernel.org/linux-
> mm/20161230131412.GI13301@dhcp22.suse.cz
> > /
> >
> > Testing done (on x86):
> >  - Set /sys/kernel/mm/compaction/order-9/extfrag_{low,high} = {25, 30}
> > respectively.
> >  - Use a test program to fragment memory: the program allocates all
> > memory  and then for each 2M aligned section, frees 3/4 of base pages
> > using  munmap.
> >  - kcompactd0 detects fragmentation for order-9 > extfrag_high and
> > starts  compaction till extfrag < extfrag_low for order-9.
> >
> 
> This is a somewhat optimisitic allocation scenario. The interesting ones are
> when a system is fragmenteed in a manner that is not trivial to resolve -- e.g.
> after a prolonged period of time with unmovable/reclaimable allocations
> stealing pageblocks. It's also fairly difficult to analyse if this is helping
> because you cannot measure after the fact how much time was saved in
> allocation time due to the work done by kcompactd. It is also hard to
> determine if the sum of the stalls incurred by proactive compaction is lower
> than the time saved at allocation time.
> 
> I fear that the user-visible effect will be times when there are very short but
> numerous stalls due to proactive compaction running in the background that
> will be hard to detect while the benefits may be invisible.
> 

Pro-active compaction can be done in a non-time-critical context, so to
estimate its benefits we can just compare data from Table-1 the same run,
under a similar fragmentation state, but with this patch applied:


Table-2: hugepage allocation latencies with this patch applied on
5.3.0-rc5.

| GFP_Stat |        Any |     Light |   Fallback |
| --------:| ----------:| ---------:| ----------:|
|   count  |   12197.0  |  11167.0  |    1030.0  |
|     min  |       2.0  |      2.0  |       5.0  |
|     max  |  361727.0  |     26.0  |  361727.0  |
|    mean  |    366.05  |     4.48  |   4286.13  |
|   stddev |   4575.53  |     1.41  |  15209.63  |


We can see that mean latency dropped to 366us compared with 5494us before.

This is an optimistic scenario where there was a little mix of unmovable
pages but still the data shows that in case compaction can succeed,
pro-active compaction can give signification reduction higher-order
allocation latencies.


> > The patch has plenty of rough edges but posting it early to see if I'm
> > going in the right direction and to get some early feedback.
> >
> 
> As unappealing as it sounds, I think it is better to try improve the allocation
> latency itself instead of trying to hide the cost in a kernel thread. It's far
> harder to implement as compaction is not easy but it would be more
> obvious what the savings are by looking at a histogram of allocation latencies
> -- there are other metrics that could be considered but that's the obvious
> one.
> 

Improving allocation latencies in itself would be a separate effort. In
case memory is full or fragmented we have to deal with reclaim or
compaction to make allocation (esp. higher-order) succeed.  In particular,
forcing compaction to be done only on-demand is in my opinion not the right
approach. As I detailed above, at the very minimum, we need an interface
like `compact_extfrag` which can leave the decision on specific
kernel/userspace drivers on how pro-active you want compaction to be.
Khalid Aziz Aug. 24, 2019, 7:24 a.m. UTC | #7
On 8/20/19 2:46 AM, Vlastimil Babka wrote:
> +CC Khalid Aziz who proposed a different approach:
> https://lore.kernel.org/linux-mm/20190813014012.30232-1-khalid.aziz@oracle.com/T/#u
> 
> On 8/16/19 11:43 PM, Nitin Gupta wrote:
>> The patch has plenty of rough edges but posting it early to see if I'm
>> going in the right direction and to get some early feedback.
> 
> That's a lot of control knobs - how is an admin supposed to tune them to their
> needs?
> 

At a high level, this idea makes sense and is similar to the idea of
watermarks for free pages. My concern is the same. We now have more
knobs to tune and that increases complexity for sys admins as well as
the chances of a misconfigured system.

--
Khalid
Mel Gorman Aug. 26, 2019, 11:47 a.m. UTC | #8
On Thu, Aug 22, 2019 at 09:57:22PM +0000, Nitin Gupta wrote:
> > Note that proactive compaction may reduce allocation latency but it is not
> > free either. Even though the scanning and migration may happen in a kernel
> > thread, tasks can incur faults while waiting for compaction to complete if the
> > task accesses data being migrated. This means that costs are incurred by
> > applications on a system that may never care about high-order allocation
> > latency -- particularly if the allocations typically happen at application
> > initialisation time.  I recognise that kcompactd makes a bit of effort to
> > compact memory out-of-band but it also is typically triggered in response to
> > reclaim that was triggered by a high-order allocation request. i.e. the work
> > done by the thread is triggered by an allocation request that hit the slow
> > paths and not a preemptive measure.
> > 
> 
> Hitting the slow path for every higher-order allocation is a signification
> performance/latency issue for applications that requires a large number of
> these allocations to succeed in bursts. To get some concrete numbers, I
> made a small driver that allocates as many hugepages as possible and
> measures allocation latency:
> 

Every higher-order allocation does not necessarily hit the slow path nor
does it incur equal latency.

> The driver first tries to allocate hugepage using GFP_TRANSHUGE_LIGHT
> (referred to as "Light" in the table below) and if that fails, tries to
> allocate with `GFP_TRANSHUGE | __GFP_RETRY_MAYFAIL` (referred to as
> "Fallback" in table below). We stop the allocation loop if both methods
> fail.
> 
> Table-1: hugepage allocation latencies on vanilla 5.3.0-rc5. All latencies
> are in microsec.
> 
> | GFP/Stat |        Any |   Light |   Fallback |
> |--------: | ---------: | ------: | ---------: |
> |    count |       9908 |     788 |       9120 |
> |      min |        0.0 |     0.0 |     1726.0 |
> |      max |   135387.0 |   142.0 |   135387.0 |
> |     mean |    5494.66 |    1.83 |    5969.26 |
> |   stddev |   21624.04 |    7.58 |   22476.06 |
> 

Given that it is expected that there would be significant tail latencies,
it would be better to analyse this in terms of percentiles. A very small
number of high latency allocations would skew the mean significantly
which is hinted by the stddev.

> As you can see, the mean and stddev of allocation is extremely high with
> the current approach of on-demand compaction.
> 
> The system was fragmented from a userspace program as I described in this
> patch description. The workload is mainly anonymous userspace pages which
> as easy to move around. I intentionally avoided unmovable pages in this
> test to see how much latency do we incur just by hitting the slow path for
> a majority of allocations.
> 

Even though, the penalty for proactive compaction is that applications
that may have no interest in higher-order pages may still stall while
their data is migrated if the data is hot. This is why I think the focus
should be on reducing the latency of compaction -- it benefits
applications that require higher-order latencies without increasing the
overhead for unrelated applications.

> 
> > > For a more proactive compaction, the approach taken here is to define
> > > per page-order external fragmentation thresholds and let kcompactd
> > > threads act on these thresholds.
> > >
> > > The low and high thresholds are defined per page-order and exposed
> > > through sysfs:
> > >
> > >   /sys/kernel/mm/compaction/order-[1..MAX_ORDER]/extfrag_{low,high}
> > >
> > 
> > These will be difficult for an admin to tune that is not extremely familiar with
> > how external fragmentation is defined. If an admin asked "how much will
> > stalls be reduced by setting this to a different value?", the answer will always
> > be "I don't know, maybe some, maybe not".
> >
> 
> Yes, this is my main worry. These values can be set to emperically
> determined values on highly specialized systems like database appliances.
> However, on a generic system, there is no real reasonable value.
> 

Yep, which means the tunable will be vulnerable to cargo-cult tuning
recommendations. Or worse, the tuning recommendation will be a flat
"disable THP".

> Still, at the very least, I would like an interface that allows compacting
> system to a reasonable state. Something like:
> 
>     compact_extfrag(node, zone, order, high, low)
> 
> which start compaction if extfrag > high, and goes on till extfrag < low.
> 
> It's possible that there are too many unmovable pages mixed around for
> compaction to succeed, still it's a reasonable interface to expose rather
> than forced on-demand style of compaction (please see data below).
> 
> How (and if) to expose it to userspace (sysfs etc.) can be a separate
> discussion.
> 

That would be functionally similar to vm.compact_memory although it
would either need an extension or a separate tunable. With sysfs, there
could be a per-node file that takes with a watermark and order tuple to
trigger the interface.

> 
> > > Per-node kcompactd thread is woken up every few seconds to check if
> > > any zone on its node has extfrag above the extfrag_high threshold for
> > > any order, in which case the thread starts compaction in the backgrond
> > > till all zones are below extfrag_low level for all orders. By default
> > > both these thresolds are set to 100 for all orders which essentially
> > > disables kcompactd.
> > >
> > > To avoid wasting CPU cycles when compaction cannot help, such as when
> > > memory is full, we check both, extfrag > extfrag_high and
> > > compaction_suitable(zone). This allows kcomapctd thread to stays
> > > inactive even if extfrag thresholds are not met.
> > >
> > 
> > There is still a risk that if a system is completely fragmented that it may
> > consume CPU on pointless compaction cycles. This is why compaction from
> > kernel thread context makes no special effort and bails relatively quickly and
> > assumes that if an application really needs high-order pages that it'll incur
> > the cost at allocation time.
> > 
> 
> As data in Table-1 shows, on-demand compaction can add high latency to
> every single allocation. I think it would be a significant improvement (see
> Table-2) to at least expose an interface to allow proactive compaction
> (like compaction_extfrag), which a driver can itself run in background. This
> way, we need not add any tunables to the kernel itself and leave compaction
> decision to specialized kernel/userspace monitors.
> 

I do not have any major objection -- again, it's not that dissimilar to
compact_memory (although that was intended as a debugging interface).

> 
> > > This patch is largely based on ideas from Michal Hocko posted here:
> > > https://lore.kernel.org/linux-
> > mm/20161230131412.GI13301@dhcp22.suse.cz
> > > /
> > >
> > > Testing done (on x86):
> > >  - Set /sys/kernel/mm/compaction/order-9/extfrag_{low,high} = {25, 30}
> > > respectively.
> > >  - Use a test program to fragment memory: the program allocates all
> > > memory  and then for each 2M aligned section, frees 3/4 of base pages
> > > using  munmap.
> > >  - kcompactd0 detects fragmentation for order-9 > extfrag_high and
> > > starts  compaction till extfrag < extfrag_low for order-9.
> > >
> > 
> > This is a somewhat optimisitic allocation scenario. The interesting ones are
> > when a system is fragmenteed in a manner that is not trivial to resolve -- e.g.
> > after a prolonged period of time with unmovable/reclaimable allocations
> > stealing pageblocks. It's also fairly difficult to analyse if this is helping
> > because you cannot measure after the fact how much time was saved in
> > allocation time due to the work done by kcompactd. It is also hard to
> > determine if the sum of the stalls incurred by proactive compaction is lower
> > than the time saved at allocation time.
> > 
> > I fear that the user-visible effect will be times when there are very short but
> > numerous stalls due to proactive compaction running in the background that
> > will be hard to detect while the benefits may be invisible.
> > 
> 
> Pro-active compaction can be done in a non-time-critical context, so to
> estimate its benefits we can just compare data from Table-1 the same run,
> under a similar fragmentation state, but with this patch applied:
> 

How do you define what a non-time-critical context is? Once compaction
starts, an applications data becomes temporarily unavailable during
migration.

> 
> Table-2: hugepage allocation latencies with this patch applied on
> 5.3.0-rc5.
> 
> | GFP_Stat |        Any |     Light |   Fallback |
> | --------:| ----------:| ---------:| ----------:|
> |   count  |   12197.0  |  11167.0  |    1030.0  |
> |     min  |       2.0  |      2.0  |       5.0  |
> |     max  |  361727.0  |     26.0  |  361727.0  |
> |    mean  |    366.05  |     4.48  |   4286.13  |
> |   stddev |   4575.53  |     1.41  |  15209.63  |
> 
> 
> We can see that mean latency dropped to 366us compared with 5494us before.
> 
> This is an optimistic scenario where there was a little mix of unmovable
> pages but still the data shows that in case compaction can succeed,
> pro-active compaction can give signification reduction higher-order
> allocation latencies.
> 

Which still does not address the point that reducing compaction overhead
is generally beneficial without incurring additional overhead to
unrelated applications.

I'm not against the use of an interface because it requires an application
to make a deliberate choice and understand the downsides which can be
documented. An automatic proactive compaction may impact users that have
no idea the feature even exists.
Nitin Gupta Aug. 27, 2019, 8:36 p.m. UTC | #9
On Mon, 2019-08-26 at 12:47 +0100, Mel Gorman wrote:
> On Thu, Aug 22, 2019 at 09:57:22PM +0000, Nitin Gupta wrote:
> > > Note that proactive compaction may reduce allocation latency but
> > > it is not
> > > free either. Even though the scanning and migration may happen in
> > > a kernel
> > > thread, tasks can incur faults while waiting for compaction to
> > > complete if the
> > > task accesses data being migrated. This means that costs are
> > > incurred by
> > > applications on a system that may never care about high-order
> > > allocation
> > > latency -- particularly if the allocations typically happen at
> > > application
> > > initialisation time.  I recognise that kcompactd makes a bit of
> > > effort to
> > > compact memory out-of-band but it also is typically triggered in
> > > response to
> > > reclaim that was triggered by a high-order allocation request.
> > > i.e. the work
> > > done by the thread is triggered by an allocation request that hit
> > > the slow
> > > paths and not a preemptive measure.
> > > 
> > 
> > Hitting the slow path for every higher-order allocation is a
> > signification
> > performance/latency issue for applications that requires a large
> > number of
> > these allocations to succeed in bursts. To get some concrete
> > numbers, I
> > made a small driver that allocates as many hugepages as possible
> > and
> > measures allocation latency:
> > 
> 
> Every higher-order allocation does not necessarily hit the slow path
> nor
> does it incur equal latency.

I did not mean *every* hugepage allocation in a literal sense.
I meant to say: higher order allocation *tend* to hit slow path
with a high probability under reasonably fragmented memory state
and when they do, they incur high latency.


> 
> > The driver first tries to allocate hugepage using
> > GFP_TRANSHUGE_LIGHT
> > (referred to as "Light" in the table below) and if that fails,
> > tries to
> > allocate with `GFP_TRANSHUGE | __GFP_RETRY_MAYFAIL` (referred to as
> > "Fallback" in table below). We stop the allocation loop if both
> > methods
> > fail.
> > 
> > Table-1: hugepage allocation latencies on vanilla 5.3.0-rc5. All
> > latencies
> > are in microsec.
> > 
> > > GFP/Stat |        Any |   Light |   Fallback |
> > > --------: | ---------: | ------: | ---------: |
> > >    count |       9908 |     788 |       9120 |
> > >      min |        0.0 |     0.0 |     1726.0 |
> > >      max |   135387.0 |   142.0 |   135387.0 |
> > >     mean |    5494.66 |    1.83 |    5969.26 |
> > >   stddev |   21624.04 |    7.58 |   22476.06 |
> 
> Given that it is expected that there would be significant tail
> latencies,
> it would be better to analyse this in terms of percentiles. A very
> small
> number of high latency allocations would skew the mean significantly
> which is hinted by the stddev.
> 

Here is the same data in terms of percentiles:

- with vanilla kernel 5.3.0-rc5:

percentile latency
–––––––––– –––––––
         5       1
        10    179
0
        25    1829
        30    1838
        40    1854
        50    18
71
        60    1890
        75    1924
        80    1945
        90    2
206
        95    2302


- Now with kernel 5.3.0-rc5 + this patch:

percentile latency
–––––––––– –––––––
         5       3
        10       4
        25       
4
        30       4
        40       4
        50       4
        60      
 4
        75       5
        80       5
        90       9
        95    1
154


> > As you can see, the mean and stddev of allocation is extremely high
> > with
> > the current approach of on-demand compaction.
> > 
> > The system was fragmented from a userspace program as I described
> > in this
> > patch description. The workload is mainly anonymous userspace pages
> > which
> > as easy to move around. I intentionally avoided unmovable pages in
> > this
> > test to see how much latency do we incur just by hitting the slow
> > path for
> > a majority of allocations.
> > 
> 
> Even though, the penalty for proactive compaction is that
> applications
> that may have no interest in higher-order pages may still stall while
> their data is migrated if the data is hot. This is why I think the
> focus
> should be on reducing the latency of compaction -- it benefits
> applications that require higher-order latencies without increasing
> the
> overhead for unrelated applications.
> 

Sure, reducing compaction latency would help but there should still
be an option to proactively compact to hide latencies further.


> > > > For a more proactive compaction, the approach taken here is to
> > > > define
> > > > per page-order external fragmentation thresholds and let
> > > > kcompactd
> > > > threads act on these thresholds.
> > > > 
> > > > The low and high thresholds are defined per page-order and
> > > > exposed
> > > > through sysfs:
> > > > 
> > > >   /sys/kernel/mm/compaction/order-
> > > > [1..MAX_ORDER]/extfrag_{low,high}
> > > > 
> > > 
> > > These will be difficult for an admin to tune that is not
> > > extremely familiar with
> > > how external fragmentation is defined. If an admin asked "how
> > > much will
> > > stalls be reduced by setting this to a different value?", the
> > > answer will always
> > > be "I don't know, maybe some, maybe not".
> > > 
> > 
> > Yes, this is my main worry. These values can be set to emperically
> > determined values on highly specialized systems like database
> > appliances.
> > However, on a generic system, there is no real reasonable value.
> > 
> 
> Yep, which means the tunable will be vulnerable to cargo-cult tuning
> recommendations. Or worse, the tuning recommendation will be a flat
> "disable THP".
> 

I thought more on this and yes, exposing a system wide per-order
extfrag threshold may not be the best approach. Instead, expose a
specific interface to compact a zone to a specified level and leave the
policy on when to trigger (based on extfrag levels, system load etc.)
upto the user (kernel driver or userspace daemon).


> > Still, at the very least, I would like an interface that allows
> > compacting
> > system to a reasonable state. Something like:
> > 
> >     compact_extfrag(node, zone, order, high, low)
> > 
> > which start compaction if extfrag > high, and goes on till extfrag
> > < low.
> > 
> > It's possible that there are too many unmovable pages mixed around
> > for
> > compaction to succeed, still it's a reasonable interface to expose
> > rather
> > than forced on-demand style of compaction (please see data below).
> > 
> > How (and if) to expose it to userspace (sysfs etc.) can be a
> > separate
> > discussion.
> > 
> 
> That would be functionally similar to vm.compact_memory although it
> would either need an extension or a separate tunable. With sysfs,
> there
> could be a per-node file that takes with a watermark and order tuple
> to
> trigger the interface.
> 

Something like:
    /sys/kernel/mm/node-n/compact
or, /sys/kernel/mm/compact-n
    where n in [0, NUM_NODES],

which takes tuple watermark and order, should do?

I'm also okay not adding any of these sysfs interface for now.

> > > > Per-node kcompactd thread is woken up every few seconds to
> > > > check if
> > > > any zone on its node has extfrag above the extfrag_high
> > > > threshold for
> > > > any order, in which case the thread starts compaction in the
> > > > backgrond
> > > > till all zones are below extfrag_low level for all orders. By
> > > > default
> > > > both these thresolds are set to 100 for all orders which
> > > > essentially
> > > > disables kcompactd.
> > > > 
> > > > To avoid wasting CPU cycles when compaction cannot help, such
> > > > as when
> > > > memory is full, we check both, extfrag > extfrag_high and
> > > > compaction_suitable(zone). This allows kcomapctd thread to
> > > > stays
> > > > inactive even if extfrag thresholds are not met.
> > > > 
> > > 
> > > There is still a risk that if a system is completely fragmented
> > > that it may
> > > consume CPU on pointless compaction cycles. This is why
> > > compaction from
> > > kernel thread context makes no special effort and bails
> > > relatively quickly and
> > > assumes that if an application really needs high-order pages that
> > > it'll incur
> > > the cost at allocation time.
> > > 
> > 
> > As data in Table-1 shows, on-demand compaction can add high latency
> > to
> > every single allocation. I think it would be a significant
> > improvement (see
> > Table-2) to at least expose an interface to allow proactive
> > compaction
> > (like compaction_extfrag), which a driver can itself run in
> > background. This
> > way, we need not add any tunables to the kernel itself and leave
> > compaction
> > decision to specialized kernel/userspace monitors.
> > 
> 
> I do not have any major objection -- again, it's not that dissimilar
> to
> compact_memory (although that was intended as a debugging interface).
> 

Yes, the only difference is I want to stop compaction compaction till
we hit the given extfrag level.


> > > > This patch is largely based on ideas from Michal Hocko posted
> > > > here:
> > > > https://lore.kernel.org/linux-
> > > mm/20161230131412.GI13301@dhcp22.suse.cz
> > > > /
> > > > 
> > > > Testing done (on x86):
> > > >  - Set /sys/kernel/mm/compaction/order-9/extfrag_{low,high} =
> > > > {25, 30}
> > > > respectively.
> > > >  - Use a test program to fragment memory: the program allocates
> > > > all
> > > > memory  and then for each 2M aligned section, frees 3/4 of base
> > > > pages
> > > > using  munmap.
> > > >  - kcompactd0 detects fragmentation for order-9 > extfrag_high
> > > > and
> > > > starts  compaction till extfrag < extfrag_low for order-9.
> > > > 
> > > 
> > > This is a somewhat optimisitic allocation scenario. The
> > > interesting ones are
> > > when a system is fragmenteed in a manner that is not trivial to
> > > resolve -- e.g.
> > > after a prolonged period of time with unmovable/reclaimable
> > > allocations
> > > stealing pageblocks. It's also fairly difficult to analyse if
> > > this is helping
> > > because you cannot measure after the fact how much time was saved
> > > in
> > > allocation time due to the work done by kcompactd. It is also
> > > hard to
> > > determine if the sum of the stalls incurred by proactive
> > > compaction is lower
> > > than the time saved at allocation time.
> > > 
> > > I fear that the user-visible effect will be times when there are
> > > very short but
> > > numerous stalls due to proactive compaction running in the
> > > background that
> > > will be hard to detect while the benefits may be invisible.
> > > 
> > 
> > Pro-active compaction can be done in a non-time-critical context,
> > so to
> > estimate its benefits we can just compare data from Table-1 the
> > same run,
> > under a similar fragmentation state, but with this patch applied:
> > 
> 
> How do you define what a non-time-critical context is? Once
> compaction
> starts, an applications data becomes temporarily unavailable during
> migration.


By time-critical-context I roughly mean contexts where hugepage
allocations are triggered in response to a user action and any delay
here would be directly noticable by the user. Compare this scenario
with a backround thread doing compaction: this activity can appear
as random freezes for running applications. Whether this
effect on unrelated applications is acceptable or not can be left
to user of this new compaction interface.

> 
> > Table-2: hugepage allocation latencies with this patch applied on
> > 5.3.0-rc5.
> > 
> > > GFP_Stat |        Any |     Light |   Fallback |
> > > --------:| ----------:| ---------:| ----------:|
> > >   count  |   12197.0  |  11167.0  |    1030.0  |
> > >     min  |       2.0  |      2.0  |       5.0  |
> > >     max  |  361727.0  |     26.0  |  361727.0  |
> > >    mean  |    366.05  |     4.48  |   4286.13  |
> > >   stddev |   4575.53  |     1.41  |  15209.63  |
> > 
> > We can see that mean latency dropped to 366us compared with 5494us
> > before.
> > 
> > This is an optimistic scenario where there was a little mix of
> > unmovable
> > pages but still the data shows that in case compaction can succeed,
> > pro-active compaction can give signification reduction higher-order
> > allocation latencies.
> > 
> 
> Which still does not address the point that reducing compaction
> overhead
> is generally beneficial without incurring additional overhead to
> unrelated applications.
> 

Yes, reducing compaction latency is always beneficial especially if
it can be done in a way not to touch (hot) pages from unrelated
applications. 
Even with good improvements in this area, proactive compaction would
still be good to have.


> I'm not against the use of an interface because it requires an
> application
> to make a deliberate choice and understand the downsides which can be
> documented. An automatic proactive compaction may impact users that
> have
> no idea the feature even exists.
> 

I'm now dropping the idea of exposing per-order extfrag thresholds and
would now focus on an interface to compact memory to reach a given
extfrag level instead.

Thanks,
Nitin
David Rientjes Sept. 16, 2019, 8:16 p.m. UTC | #10
On Fri, 16 Aug 2019, Nitin Gupta wrote:

> For some applications we need to allocate almost all memory as
> hugepages. However, on a running system, higher order allocations can
> fail if the memory is fragmented. Linux kernel currently does
> on-demand compaction as we request more hugepages but this style of
> compaction incurs very high latency. Experiments with one-time full
> memory compaction (followed by hugepage allocations) shows that kernel
> is able to restore a highly fragmented memory state to a fairly
> compacted memory state within <1 sec for a 32G system. Such data
> suggests that a more proactive compaction can help us allocate a large
> fraction of memory as hugepages keeping allocation latencies low.
> 
> For a more proactive compaction, the approach taken here is to define
> per page-order external fragmentation thresholds and let kcompactd
> threads act on these thresholds.
> 
> The low and high thresholds are defined per page-order and exposed
> through sysfs:
> 
>   /sys/kernel/mm/compaction/order-[1..MAX_ORDER]/extfrag_{low,high}
> 
> Per-node kcompactd thread is woken up every few seconds to check if
> any zone on its node has extfrag above the extfrag_high threshold for
> any order, in which case the thread starts compaction in the backgrond
> till all zones are below extfrag_low level for all orders. By default
> both these thresolds are set to 100 for all orders which essentially
> disables kcompactd.
> 
> To avoid wasting CPU cycles when compaction cannot help, such as when
> memory is full, we check both, extfrag > extfrag_high and
> compaction_suitable(zone). This allows kcomapctd thread to stays inactive
> even if extfrag thresholds are not met.
> 
> This patch is largely based on ideas from Michal Hocko posted here:
> https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.suse.cz/
> 
> Testing done (on x86):
>  - Set /sys/kernel/mm/compaction/order-9/extfrag_{low,high} = {25, 30}
>  respectively.
>  - Use a test program to fragment memory: the program allocates all memory
>  and then for each 2M aligned section, frees 3/4 of base pages using
>  munmap.
>  - kcompactd0 detects fragmentation for order-9 > extfrag_high and starts
>  compaction till extfrag < extfrag_low for order-9.
> 
> The patch has plenty of rough edges but posting it early to see if I'm
> going in the right direction and to get some early feedback.
> 

Is there an update to this proposal or non-RFC patch that has been posted 
for proactive compaction?

We've had good success with periodically compacting memory on a regular 
cadence on systems with hugepages enabled.  The cadence itself is defined 
by the admin but it causes khugepaged[*] to periodically wakeup and invoke 
compaction in an attempt to keep zones as defragmented as possible 
(perhaps more "proactive" than what is proposed here in an attempt to keep 
all memory as unfragmented as possible regardless of extfrag thresholds).  
It also avoids corner-cases where kcompactd could become more expensive 
than what is anticipated because it is unsuccessful at compacting memory 
yet the extfrag threshold is still exceeded.

 [*] Khugepaged instead of kcompactd only because this is only enabled
     for systems where transparent hugepages are enabled, probably better
     off in kcompactd to avoid duplicating work between two kthreads if
     there is already a need for background compaction.
Nitin Gupta Sept. 16, 2019, 8:50 p.m. UTC | #11
On Mon, 2019-09-16 at 13:16 -0700, David Rientjes wrote:
> On Fri, 16 Aug 2019, Nitin Gupta wrote:
> 
> > For some applications we need to allocate almost all memory as
> > hugepages. However, on a running system, higher order allocations can
> > fail if the memory is fragmented. Linux kernel currently does
> > on-demand compaction as we request more hugepages but this style of
> > compaction incurs very high latency. Experiments with one-time full
> > memory compaction (followed by hugepage allocations) shows that kernel
> > is able to restore a highly fragmented memory state to a fairly
> > compacted memory state within <1 sec for a 32G system. Such data
> > suggests that a more proactive compaction can help us allocate a large
> > fraction of memory as hugepages keeping allocation latencies low.
> > 
> > For a more proactive compaction, the approach taken here is to define
> > per page-order external fragmentation thresholds and let kcompactd
> > threads act on these thresholds.
> > 
> > The low and high thresholds are defined per page-order and exposed
> > through sysfs:
> > 
> >   /sys/kernel/mm/compaction/order-[1..MAX_ORDER]/extfrag_{low,high}
> > 
> > Per-node kcompactd thread is woken up every few seconds to check if
> > any zone on its node has extfrag above the extfrag_high threshold for
> > any order, in which case the thread starts compaction in the backgrond
> > till all zones are below extfrag_low level for all orders. By default
> > both these thresolds are set to 100 for all orders which essentially
> > disables kcompactd.
> > 
> > To avoid wasting CPU cycles when compaction cannot help, such as when
> > memory is full, we check both, extfrag > extfrag_high and
> > compaction_suitable(zone). This allows kcomapctd thread to stays inactive
> > even if extfrag thresholds are not met.
> > 
> > This patch is largely based on ideas from Michal Hocko posted here:
> > https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.suse.cz/
> > 
> > Testing done (on x86):
> >  - Set /sys/kernel/mm/compaction/order-9/extfrag_{low,high} = {25, 30}
> >  respectively.
> >  - Use a test program to fragment memory: the program allocates all memory
> >  and then for each 2M aligned section, frees 3/4 of base pages using
> >  munmap.
> >  - kcompactd0 detects fragmentation for order-9 > extfrag_high and starts
> >  compaction till extfrag < extfrag_low for order-9.
> > 
> > The patch has plenty of rough edges but posting it early to see if I'm
> > going in the right direction and to get some early feedback.
> > 
> 
> Is there an update to this proposal or non-RFC patch that has been posted 
> for proactive compaction?
> 
> We've had good success with periodically compacting memory on a regular 
> cadence on systems with hugepages enabled.  The cadence itself is defined 
> by the admin but it causes khugepaged[*] to periodically wakeup and invoke 
> compaction in an attempt to keep zones as defragmented as possible 
> (perhaps more "proactive" than what is proposed here in an attempt to keep 
> all memory as unfragmented as possible regardless of extfrag thresholds).  
> It also avoids corner-cases where kcompactd could become more expensive 
> than what is anticipated because it is unsuccessful at compacting memory 
> yet the extfrag threshold is still exceeded.
> 
>  [*] Khugepaged instead of kcompactd only because this is only enabled
>      for systems where transparent hugepages are enabled, probably better
>      off in kcompactd to avoid duplicating work between two kthreads if
>      there is already a need for background compaction.
> 


Discussion on this RFC patch revolved around the issue of exposing too
many tunables (per-node, per-order, [low-high] extfrag thresholds). It
was sort-of concluded that no admin will get these tunables right for
a variety of workloads.

To eliminate the need for tunables, I proposed another patch:

https://patchwork.kernel.org/patch/11140067/

which does not add any tunables but extends and exports an existing
function (compact_zone_order). In summary, this new patch adds a
callback function which allows any driver to implement ad-hoc
compaction policies. There is also a sample driver which makes use
of this interface to keep hugepage external fragmentation within
specified range (exposed through debugfs):

https://gitlab.com/nigupta/linux/snippets/1894161

-Nitin
John Hubbard Sept. 17, 2019, 7:46 p.m. UTC | #12
On 9/16/19 1:16 PM, David Rientjes wrote:
> On Fri, 16 Aug 2019, Nitin Gupta wrote:
...
> 
> We've had good success with periodically compacting memory on a regular 
> cadence on systems with hugepages enabled.  The cadence itself is defined 
> by the admin but it causes khugepaged[*] to periodically wakeup and invoke 
> compaction in an attempt to keep zones as defragmented as possible 

That's an important data point, thanks for reporting it. 

And given that we have at least one data point validating it, I think we
should feel fairly comfortable with this approach. Because the sys admin 
probably knows  when are the best times to steal cpu cycles and recover 
some huge pages. Unlike the kernel, the sys admin can actually see the 
future sometimes, because he/she may know what is going to be run.

It's still sounding like we can expect excellent results from simply 
defragmenting from user space, via a chron job and/or before running
important tests, rather than trying to have the kernel guess whether 
it's a performance win to defragment at some particular time.

Are you using existing interfaces, or did you need to add something? How
exactly are you triggering compaction?

thanks,
David Rientjes Sept. 17, 2019, 8:26 p.m. UTC | #13
On Tue, 17 Sep 2019, John Hubbard wrote:

> > We've had good success with periodically compacting memory on a regular 
> > cadence on systems with hugepages enabled.  The cadence itself is defined 
> > by the admin but it causes khugepaged[*] to periodically wakeup and invoke 
> > compaction in an attempt to keep zones as defragmented as possible 
> 
> That's an important data point, thanks for reporting it. 
> 
> And given that we have at least one data point validating it, I think we
> should feel fairly comfortable with this approach. Because the sys admin 
> probably knows  when are the best times to steal cpu cycles and recover 
> some huge pages. Unlike the kernel, the sys admin can actually see the 
> future sometimes, because he/she may know what is going to be run.
> 
> It's still sounding like we can expect excellent results from simply 
> defragmenting from user space, via a chron job and/or before running
> important tests, rather than trying to have the kernel guess whether 
> it's a performance win to defragment at some particular time.
> 
> Are you using existing interfaces, or did you need to add something? How
> exactly are you triggering compaction?
> 

It's possible to do this through a cron job but there are a fre reasons 
that we preferred to do it through khugepaged:

 - we use a lighter variation of compaction, MIGRATE_SYNC_LIGHT, than what 
   the per-node trigger provides since compact_node() forces MIGRATE_SYNC
   and can stall for minutes and become disruptive under some
   circumstances,

 - we do not ignore the pageblock skip hint which compact_node() hardcodes 
   to ignore, and 

 - we didn't want to do this in process context so that the cpu time is
   not taxed to any user cgroup since it's on behalf of the system as a
   whole.

It seems much better to do this on a per-node basis rather than through 
the sysctl to do it for the whole system to partition the work.  Extending 
the per-node interface to do MIGRATE_SYNC_LIGHT and not ignore pageblock 
skip is possible but the work done would still be done in process context 
so if done from userspace this would need to be attached to a cgroup that 
does not tax that cgroup for usage done on behalf of the entire system.

Again, we're using khugepaged and allowing the period to be defined 
through /sys/kernel/mm/transparent_hugepage/khugepaged but that is because 
we only want to do this on systems where we want to dynamically allocate 
hugepages on a regular basis.
Nitin Gupta Sept. 19, 2019, 11:22 p.m. UTC | #14
On Thu, 2019-08-22 at 09:51 +0100, Mel Gorman wrote:
> As unappealing as it sounds, I think it is better to try improve the
> allocation latency itself instead of trying to hide the cost in a kernel
> thread. It's far harder to implement as compaction is not easy but it
> would be more obvious what the savings are by looking at a histogram of
> allocation latencies -- there are other metrics that could be considered
> but that's the obvious one.
> 

Do you mean reducing allocation latency especially when it hits direct
compaction path? Do you have any ideas in mind for this? I'm open to
working on them and report back latency nummbers, while I think more on less
tunable-heavy background (pro-active) compaction approaches.

-Nitin
Nitin Gupta Sept. 19, 2019, 11:37 p.m. UTC | #15
On Tue, 2019-08-20 at 10:46 +0200, Vlastimil Babka wrote:
> > This patch is largely based on ideas from Michal Hocko posted here:
> > https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.suse.cz/
> > 
> > Testing done (on x86):
> >   - Set /sys/kernel/mm/compaction/order-9/extfrag_{low,high} = {25, 30}
> >   respectively.
> >   - Use a test program to fragment memory: the program allocates all
> > memory
> >   and then for each 2M aligned section, frees 3/4 of base pages using
> >   munmap.
> >   - kcompactd0 detects fragmentation for order-9 > extfrag_high and starts
> >   compaction till extfrag < extfrag_low for order-9.
> > 
> > The patch has plenty of rough edges but posting it early to see if I'm
> > going in the right direction and to get some early feedback.
> 
> That's a lot of control knobs - how is an admin supposed to tune them to
> their
> needs?


Yes, it's difficult for an admin to get so many tunable right unless
targeting a very specific workload.

How about a simpler solution where we exposed just one tunable per-node:
   /sys/.../node-x/compaction_effort
which accepts [0, 100]

This parallels /proc/sys/vm/swappiness but for compaction. With this
single number, we can estimate per-order [low, high] watermarks for external
fragmentation like this:
 - For now, map this range to [low, medium, high] which correponds to specific
low, high thresholds for extfrag.
 - Apply more relaxed thresholds for higher-order than for lower orders.

With this single tunable we remove the burden of setting per-order explicit
[low, high] thresholds and it should be easier to experiment with.

-Nitin
Vlastimil Babka Sept. 24, 2019, 1:39 p.m. UTC | #16
On 9/20/19 1:37 AM, Nitin Gupta wrote:
> On Tue, 2019-08-20 at 10:46 +0200, Vlastimil Babka wrote:
>>
>> That's a lot of control knobs - how is an admin supposed to tune them to
>> their
>> needs?
> 
> 
> Yes, it's difficult for an admin to get so many tunable right unless
> targeting a very specific workload.
> 
> How about a simpler solution where we exposed just one tunable per-node:
>    /sys/.../node-x/compaction_effort
> which accepts [0, 100]
> 
> This parallels /proc/sys/vm/swappiness but for compaction. With this
> single number, we can estimate per-order [low, high] watermarks for external
> fragmentation like this:
>  - For now, map this range to [low, medium, high] which correponds to specific
> low, high thresholds for extfrag.
>  - Apply more relaxed thresholds for higher-order than for lower orders.
> 
> With this single tunable we remove the burden of setting per-order explicit
> [low, high] thresholds and it should be easier to experiment with.

What about instead autotuning by the numbers of allocations hitting
direct compaction recently? IIRC there were attempts in the past (myself
included) and recently Khalid's that was quite elaborated.

> -Nitin
> 
> 
>
Khalid Aziz Sept. 24, 2019, 2:11 p.m. UTC | #17
On 9/24/19 7:39 AM, Vlastimil Babka wrote:
> On 9/20/19 1:37 AM, Nitin Gupta wrote:
>> On Tue, 2019-08-20 at 10:46 +0200, Vlastimil Babka wrote:
>>>
>>> That's a lot of control knobs - how is an admin supposed to tune them to
>>> their
>>> needs?
>>
>>
>> Yes, it's difficult for an admin to get so many tunable right unless
>> targeting a very specific workload.
>>
>> How about a simpler solution where we exposed just one tunable per-node:
>>    /sys/.../node-x/compaction_effort
>> which accepts [0, 100]
>>
>> This parallels /proc/sys/vm/swappiness but for compaction. With this
>> single number, we can estimate per-order [low, high] watermarks for external
>> fragmentation like this:
>>  - For now, map this range to [low, medium, high] which correponds to specific
>> low, high thresholds for extfrag.
>>  - Apply more relaxed thresholds for higher-order than for lower orders.
>>
>> With this single tunable we remove the burden of setting per-order explicit
>> [low, high] thresholds and it should be easier to experiment with.
> 
> What about instead autotuning by the numbers of allocations hitting
> direct compaction recently? IIRC there were attempts in the past (myself
> included) and recently Khalid's that was quite elaborated.
> 

I do think the right way forward with this longstanding problem is to
take the burden of managing free memory away from end user and let the
kernel autotune itself to the demands of workload. We can start with a
simpler algorithm in the kernel that adapts to workload and refine it as
we move forward. As long as initial implementation performs at least as
well as current free page management, we have a workable path for
improvements. I am moving the implementation I put together in kernel to
a userspace daemon just to test it out on larger variety of workloads.
It is more limited in userspace with limited access to statistics the
algorithm needs to perform trend analysis so I would rather be doing
this in the kernel.

--
Khalid

Patch
diff mbox series

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 9569e7c786d3..26bfedbbc64b 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -60,6 +60,17 @@  enum compact_result {
 
 struct alloc_context; /* in mm/internal.h */
 
+// "order-%d"
+#define COMPACTION_ORDER_STATE_NAME_LEN 16
+// Per-order compaction state
+struct compaction_order_state {
+	unsigned int order;
+	unsigned int extfrag_low;
+	unsigned int extfrag_high;
+	unsigned int extfrag_curr;
+	char name[COMPACTION_ORDER_STATE_NAME_LEN];
+};
+
 /*
  * Number of free order-0 pages that should be available above given watermark
  * to make sure compaction has reasonable chance of not running out of free
@@ -90,6 +101,7 @@  extern int sysctl_compaction_handler(struct ctl_table *table, int write,
 extern int sysctl_extfrag_threshold;
 extern int sysctl_compact_unevictable_allowed;
 
+extern int extfrag_for_order(struct zone *zone, unsigned int order);
 extern int fragmentation_index(struct zone *zone, unsigned int order);
 extern enum compact_result try_to_compact_pages(gfp_t gfp_mask,
 		unsigned int order, unsigned int alloc_flags,
diff --git a/mm/compaction.c b/mm/compaction.c
index 952dc2fb24e5..21866b1ad249 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -25,6 +25,10 @@ 
 #include <linux/psi.h>
 #include "internal.h"
 
+#ifdef CONFIG_COMPACTION
+struct compaction_order_state compaction_order_states[MAX_ORDER+1];
+#endif
+
 #ifdef CONFIG_COMPACTION
 static inline void count_compact_event(enum vm_event_item item)
 {
@@ -1846,6 +1850,49 @@  static inline bool is_via_compact_memory(int order)
 	return order == -1;
 }
 
+static int extfrag_wmark_high(struct zone *zone)
+{
+	int order;
+
+	for (order = 1; order <= MAX_ORDER; order++) {
+		int extfrag = extfrag_for_order(zone, order);
+		int threshold = compaction_order_states[order].extfrag_high;
+
+		if (extfrag > threshold)
+			return order;
+	}
+	return 0;
+}
+
+static bool node_should_compact(pg_data_t *pgdat)
+{
+	struct zone *zone;
+
+	for_each_populated_zone(zone) {
+		int order = extfrag_wmark_high(zone);
+
+		if (order && compaction_suitable(zone, order,
+				0, zone_idx(zone)) == COMPACT_CONTINUE) {
+			return true;
+		}
+	}
+	return false;
+}
+
+static int extfrag_wmark_low(struct zone *zone)
+{
+	int order;
+
+	for (order = 1; order <= MAX_ORDER; order++) {
+		int extfrag = extfrag_for_order(zone, order);
+		int threshold = compaction_order_states[order].extfrag_low;
+
+		if (extfrag > threshold)
+			return order;
+	}
+	return 0;
+}
+
 static enum compact_result __compact_finished(struct compact_control *cc)
 {
 	unsigned int order;
@@ -1872,7 +1919,7 @@  static enum compact_result __compact_finished(struct compact_control *cc)
 			return COMPACT_PARTIAL_SKIPPED;
 	}
 
-	if (is_via_compact_memory(cc->order))
+	if (extfrag_wmark_low(cc->zone))
 		return COMPACT_CONTINUE;
 
 	/*
@@ -1962,18 +2009,6 @@  static enum compact_result __compaction_suitable(struct zone *zone, int order,
 {
 	unsigned long watermark;
 
-	if (is_via_compact_memory(order))
-		return COMPACT_CONTINUE;
-
-	watermark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);
-	/*
-	 * If watermarks for high-order allocation are already met, there
-	 * should be no need for compaction at all.
-	 */
-	if (zone_watermark_ok(zone, order, watermark, classzone_idx,
-								alloc_flags))
-		return COMPACT_SUCCESS;
-
 	/*
 	 * Watermarks for order-0 must be met for compaction to be able to
 	 * isolate free pages for migration targets. This means that the
@@ -2003,31 +2038,9 @@  enum compact_result compaction_suitable(struct zone *zone, int order,
 					int classzone_idx)
 {
 	enum compact_result ret;
-	int fragindex;
 
 	ret = __compaction_suitable(zone, order, alloc_flags, classzone_idx,
 				    zone_page_state(zone, NR_FREE_PAGES));
-	/*
-	 * fragmentation index determines if allocation failures are due to
-	 * low memory or external fragmentation
-	 *
-	 * index of -1000 would imply allocations might succeed depending on
-	 * watermarks, but we already failed the high-order watermark check
-	 * index towards 0 implies failure is due to lack of memory
-	 * index towards 1000 implies failure is due to fragmentation
-	 *
-	 * Only compact if a failure would be due to fragmentation. Also
-	 * ignore fragindex for non-costly orders where the alternative to
-	 * a successful reclaim/compaction is OOM. Fragindex and the
-	 * vm.extfrag_threshold sysctl is meant as a heuristic to prevent
-	 * excessive compaction for costly orders, but it should not be at the
-	 * expense of system stability.
-	 */
-	if (ret == COMPACT_CONTINUE && (order > PAGE_ALLOC_COSTLY_ORDER)) {
-		fragindex = fragmentation_index(zone, order);
-		if (fragindex >= 0 && fragindex <= sysctl_extfrag_threshold)
-			ret = COMPACT_NOT_SUITABLE_ZONE;
-	}
 
 	trace_mm_compaction_suitable(zone, order, ret);
 	if (ret == COMPACT_NOT_SUITABLE_ZONE)
@@ -2416,7 +2429,6 @@  static void compact_node(int nid)
 		.gfp_mask = GFP_KERNEL,
 	};
 
-
 	for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) {
 
 		zone = &pgdat->node_zones[zoneid];
@@ -2493,9 +2505,149 @@  void compaction_unregister_node(struct node *node)
 }
 #endif /* CONFIG_SYSFS && CONFIG_NUMA */
 
+#ifdef CONFIG_SYSFS
+
+#define COMPACTION_ATTR_RO(_name) \
+	static struct kobj_attribute _name##_attr = __ATTR_RO(_name)
+
+#define COMPACTION_ATTR(_name) \
+	static struct kobj_attribute _name##_attr = \
+		__ATTR(_name, 0644, _name##_show, _name##_store)
+
+static struct kobject *compaction_kobj;
+static struct kobject *compaction_order_kobjs[MAX_ORDER];
+
+static struct compaction_order_state *kobj_to_compaction_order_state(
+						struct kobject *kobj)
+{
+	int i;
+
+	for (i = 1; i <= MAX_ORDER; i++) {
+		if (compaction_order_kobjs[i] == kobj)
+			return &compaction_order_states[i];
+	}
+
+	return NULL;
+}
+
+static ssize_t extfrag_store_common(bool is_low, struct kobject *kobj,
+		struct kobj_attribute *attr, const char *buf, size_t count)
+{
+	int err;
+	unsigned long input;
+	struct compaction_order_state *c = kobj_to_compaction_order_state(kobj);
+
+	err = kstrtoul(buf, 10, &input);
+	if (err)
+		return err;
+	if (input > 100)
+		return -EINVAL;
+
+	if (is_low)
+		c->extfrag_low = input;
+	else
+		c->extfrag_high = input;
+
+	return count;
+}
+
+static ssize_t extfrag_low_show(struct kobject *kobj,
+		struct kobj_attribute *attr, char *buf)
+{
+	struct compaction_order_state *c = kobj_to_compaction_order_state(kobj);
+
+	return sprintf(buf, "%u\n", c->extfrag_low);
+}
+
+static ssize_t extfrag_low_store(struct kobject *kobj,
+		struct kobj_attribute *attr, const char *buf, size_t count)
+{
+	return extfrag_store_common(true, kobj, attr, buf, count);
+}
+COMPACTION_ATTR(extfrag_low);
+
+static ssize_t extfrag_high_show(struct kobject *kobj,
+					struct kobj_attribute *attr, char *buf)
+{
+	struct compaction_order_state *c = kobj_to_compaction_order_state(kobj);
+
+	return sprintf(buf, "%u\n", c->extfrag_high);
+}
+
+static ssize_t extfrag_high_store(struct kobject *kobj,
+		struct kobj_attribute *attr, const char *buf, size_t count)
+{
+	return extfrag_store_common(false, kobj, attr, buf, count);
+}
+COMPACTION_ATTR(extfrag_high);
+
+static struct attribute *compaction_order_attrs[] = {
+	&extfrag_low_attr.attr,
+	&extfrag_high_attr.attr,
+	NULL,
+};
+
+static const struct attribute_group compaction_order_attr_group = {
+	.attrs = compaction_order_attrs,
+};
+
+static int compaction_sysfs_add_order(struct compaction_order_state *c,
+	struct kobject *parent, struct kobject **compaction_order_kobjs,
+	const struct attribute_group *compaction_order_attr_group)
+{
+	int retval;
+
+	compaction_order_kobjs[c->order] =
+			kobject_create_and_add(c->name, parent);
+	if (!compaction_order_kobjs[c->order])
+		return -ENOMEM;
+
+	retval = sysfs_create_group(compaction_order_kobjs[c->order],
+				compaction_order_attr_group);
+	if (retval)
+		kobject_put(compaction_order_kobjs[c->order]);
+
+	return retval;
+}
+
+static void __init compaction_sysfs_init(void)
+{
+	struct compaction_order_state *c;
+	int i, err;
+
+	compaction_kobj = kobject_create_and_add("compaction", mm_kobj);
+	if (!compaction_kobj)
+		return;
+
+	for (i = 1; i <= MAX_ORDER; i++) {
+		c = &compaction_order_states[i];
+		err = compaction_sysfs_add_order(c, compaction_kobj,
+					compaction_order_kobjs,
+					&compaction_order_attr_group);
+		if (err)
+			pr_err("compaction: Unable to add state %s", c->name);
+	}
+}
+
+static void __init compaction_init_order_states(void)
+{
+	int i;
+
+	for (i = 0; i <= MAX_ORDER; i++) {
+		struct compaction_order_state *c = &compaction_order_states[i];
+
+		c->order = i;
+		c->extfrag_low = 100;
+		c->extfrag_high = 100;
+		snprintf(c->name, COMPACTION_ORDER_STATE_NAME_LEN,
+						"order-%d", i);
+	}
+}
+#endif
+
 static inline bool kcompactd_work_requested(pg_data_t *pgdat)
 {
-	return pgdat->kcompactd_max_order > 0 || kthread_should_stop();
+	return kthread_should_stop() || node_should_compact(pgdat);
 }
 
 static bool kcompactd_node_suitable(pg_data_t *pgdat)
@@ -2527,15 +2679,16 @@  static void kcompactd_do_work(pg_data_t *pgdat)
 	int zoneid;
 	struct zone *zone;
 	struct compact_control cc = {
-		.order = pgdat->kcompactd_max_order,
-		.search_order = pgdat->kcompactd_max_order,
+		.order = -1,
 		.total_migrate_scanned = 0,
 		.total_free_scanned = 0,
-		.classzone_idx = pgdat->kcompactd_classzone_idx,
-		.mode = MIGRATE_SYNC_LIGHT,
-		.ignore_skip_hint = false,
+		.mode = MIGRATE_SYNC,
+		.ignore_skip_hint = true,
+		.whole_zone = false,
 		.gfp_mask = GFP_KERNEL,
+		.classzone_idx = MAX_NR_ZONES - 1,
 	};
+
 	trace_mm_compaction_kcompactd_wake(pgdat->node_id, cc.order,
 							cc.classzone_idx);
 	count_compact_event(KCOMPACTD_WAKE);
@@ -2565,7 +2718,6 @@  static void kcompactd_do_work(pg_data_t *pgdat)
 		if (kthread_should_stop())
 			return;
 		status = compact_zone(&cc, NULL);
-
 		if (status == COMPACT_SUCCESS) {
 			compaction_defer_reset(zone, cc.order, false);
 		} else if (status == COMPACT_PARTIAL_SKIPPED || status == COMPACT_COMPLETE) {
@@ -2650,11 +2802,14 @@  static int kcompactd(void *p)
 	pgdat->kcompactd_classzone_idx = pgdat->nr_zones - 1;
 
 	while (!kthread_should_stop()) {
-		unsigned long pflags;
+		unsigned long ret, pflags;
 
 		trace_mm_compaction_kcompactd_sleep(pgdat->node_id);
-		wait_event_freezable(pgdat->kcompactd_wait,
-				kcompactd_work_requested(pgdat));
+		ret = wait_event_freezable_timeout(pgdat->kcompactd_wait,
+				kcompactd_work_requested(pgdat),
+				msecs_to_jiffies(5000));
+		if (!ret)
+			continue;
 
 		psi_memstall_enter(&pflags);
 		kcompactd_do_work(pgdat);
@@ -2735,6 +2890,9 @@  static int __init kcompactd_init(void)
 		return ret;
 	}
 
+	compaction_init_order_states();
+	compaction_sysfs_init();
+
 	for_each_node_state(nid, N_MEMORY)
 		kcompactd_run(nid);
 	return 0;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index fd7e16ca6996..e9090a5595d1 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1074,6 +1074,18 @@  static int __fragmentation_index(unsigned int order, struct contig_page_info *in
 	return 1000 - div_u64( (1000+(div_u64(info->free_pages * 1000ULL, requested))), info->free_blocks_total);
 }
 
+int extfrag_for_order(struct zone *zone, unsigned int order)
+{
+	struct contig_page_info info;
+
+	fill_contig_page_info(zone, order, &info);
+	if (info.free_pages == 0)
+		return 0;
+
+	return (info.free_pages - (info.free_blocks_suitable << order)) * 100
+							/ info.free_pages;
+}
+
 /* Same as __fragmentation index but allocs contig_page_info on stack */
 int fragmentation_index(struct zone *zone, unsigned int order)
 {