mm: Add callback for defining compaction completion

Message ID	20190910200756.7143-1-nigupta@nvidia.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=JR82=XF=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 0A6EF2168B TLS: TLSv1.2, DES-CBC3-SHA) id <B5d7802b10000>; Tue, 10 Sep 2019 13:08:17 -0700 From: Nitin Gupta <nigupta@nvidia.com> To: <akpm@linux-foundation.org>, <vbabka@suse.cz>, <mgorman@techsingularity.net>, <mhocko@suse.com>, <dan.j.williams@intel.com>, <khalid.aziz@oracle.com> CC: Nitin Gupta <nigupta@nvidia.com>, Matthew Wilcox <willy@infradead.org>, Yu Zhao <yuzhao@google.com>, Qian Cai <cai@lca.pw>, Andrey Ryabinin <aryabinin@virtuozzo.com>, Allison Randal <allison@lohutok.net>, Mike Rapoport <rppt@linux.vnet.ibm.com>, Thomas Gleixner <tglx@linutronix.de>, Arun KS <arunks@codeaurora.org>, Wei Yang <richard.weiyang@gmail.com>, <linux-kernel@vger.kernel.org>, <linux-mm@kvack.org> Subject: [PATCH] mm: Add callback for defining compaction completion Date: Tue, 10 Sep 2019 13:07:32 -0700 Message-ID: <20190910200756.7143-1-nigupta@nvidia.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm: Add callback for defining compaction completion \| expand mm: Add callback for defining compaction completion

Nitin Gupta Sept. 10, 2019, 8:07 p.m. UTC

For some applications we need to allocate almost all memory as hugepages.
However, on a running system, higher order allocations can fail if the
memory is fragmented. Linux kernel currently does on-demand compaction as
we request more hugepages but this style of compaction incurs very high
latency. Experiments with one-time full memory compaction (followed by
hugepage allocations) shows that kernel is able to restore a highly
fragmented memory state to a fairly compacted memory state within <1 sec
for a 32G system. Such data suggests that a more proactive compaction can
help us allocate a large fraction of memory as hugepages keeping allocation
latencies low.

In general, compaction can introduce unexpected latencies for applications
that don't even have strong requirements for contiguous allocations. It is
also hard to efficiently determine if the current system state can be
easily compacted due to mixing of unmovable memory. Due to these reasons,
automatic background compaction by the kernel itself is hard to get right
in a way which does not hurt unsuspecting applications or waste CPU cycles.

Even with these caveats, pro-active compaction can still be very useful in
certain scenarios to reduce hugepage allocation latencies. This callback
interface allows drivers to drive compaction based on their own policies
like the current level of external fragmentation for a particular order,
system load etc.

Signed-off-by: Nitin Gupta <nigupta@nvidia.com>
---
 include/linux/compaction.h | 10 ++++++++++
 mm/compaction.c            | 20 ++++++++++++++------
 mm/internal.h              |  2 ++
 3 files changed, 26 insertions(+), 6 deletions(-)

Michal Hocko Sept. 10, 2019, 8:19 p.m. UTC | #1

On Tue 10-09-19 13:07:32, Nitin Gupta wrote:
> For some applications we need to allocate almost all memory as hugepages.
> However, on a running system, higher order allocations can fail if the
> memory is fragmented. Linux kernel currently does on-demand compaction as
> we request more hugepages but this style of compaction incurs very high
> latency. Experiments with one-time full memory compaction (followed by
> hugepage allocations) shows that kernel is able to restore a highly
> fragmented memory state to a fairly compacted memory state within <1 sec
> for a 32G system. Such data suggests that a more proactive compaction can
> help us allocate a large fraction of memory as hugepages keeping allocation
> latencies low.
> 
> In general, compaction can introduce unexpected latencies for applications
> that don't even have strong requirements for contiguous allocations. It is
> also hard to efficiently determine if the current system state can be
> easily compacted due to mixing of unmovable memory. Due to these reasons,
> automatic background compaction by the kernel itself is hard to get right
> in a way which does not hurt unsuspecting applications or waste CPU cycles.

We do trigger background compaction on a high order pressure from the
page allocator by waking up kcompactd. Why is that not sufficient?

> Even with these caveats, pro-active compaction can still be very useful in
> certain scenarios to reduce hugepage allocation latencies. This callback
> interface allows drivers to drive compaction based on their own policies
> like the current level of external fragmentation for a particular order,
> system load etc.

So we do not trust the core MM to make a reasonable decision while we
give a free ticket to modules. How does this make any sense at all? How
is a random module going to make a more informed decision when it has
less visibility on the overal MM situation.

If you need to control compaction from the userspace you have an
interface for that.  It is also completely unexplained why you need a
completion callback.

That being said, this looks like a terrible idea to me.

> Signed-off-by: Nitin Gupta <nigupta@nvidia.com>
> ---
>  include/linux/compaction.h | 10 ++++++++++
>  mm/compaction.c            | 20 ++++++++++++++------
>  mm/internal.h              |  2 ++
>  3 files changed, 26 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index 9569e7c786d3..1ea828450fa2 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -58,6 +58,16 @@ enum compact_result {
>  	COMPACT_SUCCESS,
>  };
>  
> +/* Callback function to determine if compaction is finished. */
> +typedef enum compact_result (*compact_finished_cb)(
> +	struct zone *zone, int order);
> +
> +enum compact_result compact_zone_order(struct zone *zone, int order,
> +		gfp_t gfp_mask, enum compact_priority prio,
> +		unsigned int alloc_flags, int classzone_idx,
> +		struct page **capture,
> +		compact_finished_cb compact_finished_cb);
> +
>  struct alloc_context; /* in mm/internal.h */
>  
>  /*
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 952dc2fb24e5..73e2e9246bc4 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1872,6 +1872,9 @@ static enum compact_result __compact_finished(struct compact_control *cc)
>  			return COMPACT_PARTIAL_SKIPPED;
>  	}
>  
> +	if (cc->compact_finished_cb)
> +		return cc->compact_finished_cb(cc->zone, cc->order);
> +
>  	if (is_via_compact_memory(cc->order))
>  		return COMPACT_CONTINUE;
>  
> @@ -2274,10 +2277,11 @@ compact_zone(struct compact_control *cc, struct capture_control *capc)
>  	return ret;
>  }
>  
> -static enum compact_result compact_zone_order(struct zone *zone, int order,
> +enum compact_result compact_zone_order(struct zone *zone, int order,
>  		gfp_t gfp_mask, enum compact_priority prio,
>  		unsigned int alloc_flags, int classzone_idx,
> -		struct page **capture)
> +		struct page **capture,
> +		compact_finished_cb compact_finished_cb)
>  {
>  	enum compact_result ret;
>  	struct compact_control cc = {
> @@ -2293,10 +2297,11 @@ static enum compact_result compact_zone_order(struct zone *zone, int order,
>  					MIGRATE_ASYNC :	MIGRATE_SYNC_LIGHT,
>  		.alloc_flags = alloc_flags,
>  		.classzone_idx = classzone_idx,
> -		.direct_compaction = true,
> +		.direct_compaction = !compact_finished_cb,
>  		.whole_zone = (prio == MIN_COMPACT_PRIORITY),
>  		.ignore_skip_hint = (prio == MIN_COMPACT_PRIORITY),
> -		.ignore_block_suitable = (prio == MIN_COMPACT_PRIORITY)
> +		.ignore_block_suitable = (prio == MIN_COMPACT_PRIORITY),
> +		.compact_finished_cb = compact_finished_cb
>  	};
>  	struct capture_control capc = {
>  		.cc = &cc,
> @@ -2313,11 +2318,13 @@ static enum compact_result compact_zone_order(struct zone *zone, int order,
>  	VM_BUG_ON(!list_empty(&cc.freepages));
>  	VM_BUG_ON(!list_empty(&cc.migratepages));
>  
> -	*capture = capc.page;
> +	if (capture)
> +		*capture = capc.page;
>  	current->capture_control = NULL;
>  
>  	return ret;
>  }
> +EXPORT_SYMBOL(compact_zone_order);
>  
>  int sysctl_extfrag_threshold = 500;
>  
> @@ -2361,7 +2368,8 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
>  		}
>  
>  		status = compact_zone_order(zone, order, gfp_mask, prio,
> -				alloc_flags, ac_classzone_idx(ac), capture);
> +				alloc_flags, ac_classzone_idx(ac), capture,
> +				NULL);
>  		rc = max(status, rc);
>  
>  		/* The allocation should succeed, stop compacting */
> diff --git a/mm/internal.h b/mm/internal.h
> index e32390802fd3..f873f7c2b9dc 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -11,6 +11,7 @@
>  #include <linux/mm.h>
>  #include <linux/pagemap.h>
>  #include <linux/tracepoint-defs.h>
> +#include <linux/compaction.h>
>  
>  /*
>   * The set of flags that only affect watermark checking and reclaim
> @@ -203,6 +204,7 @@ struct compact_control {
>  	bool whole_zone;		/* Whole zone should/has been scanned */
>  	bool contended;			/* Signal lock or sched contention */
>  	bool rescan;			/* Rescanning the same pageblock */
> +	compact_finished_cb compact_finished_cb;
>  };
>  
>  /*
> -- 
> 2.21.0

Nitin Gupta Sept. 10, 2019, 10:27 p.m. UTC | #2

> -----Original Message-----
> From: owner-linux-mm@kvack.org <owner-linux-mm@kvack.org> On Behalf
> Of Michal Hocko
> Sent: Tuesday, September 10, 2019 1:19 PM
> To: Nitin Gupta <nigupta@nvidia.com>
> Cc: akpm@linux-foundation.org; vbabka@suse.cz;
> mgorman@techsingularity.net; dan.j.williams@intel.com;
> khalid.aziz@oracle.com; Matthew Wilcox <willy@infradead.org>; Yu Zhao
> <yuzhao@google.com>; Qian Cai <cai@lca.pw>; Andrey Ryabinin
> <aryabinin@virtuozzo.com>; Allison Randal <allison@lohutok.net>; Mike
> Rapoport <rppt@linux.vnet.ibm.com>; Thomas Gleixner
> <tglx@linutronix.de>; Arun KS <arunks@codeaurora.org>; Wei Yang
> <richard.weiyang@gmail.com>; linux-kernel@vger.kernel.org; linux-
> mm@kvack.org
> Subject: Re: [PATCH] mm: Add callback for defining compaction completion
> 
> On Tue 10-09-19 13:07:32, Nitin Gupta wrote:
> > For some applications we need to allocate almost all memory as
> hugepages.
> > However, on a running system, higher order allocations can fail if the
> > memory is fragmented. Linux kernel currently does on-demand
> compaction
> > as we request more hugepages but this style of compaction incurs very
> > high latency. Experiments with one-time full memory compaction
> > (followed by hugepage allocations) shows that kernel is able to
> > restore a highly fragmented memory state to a fairly compacted memory
> > state within <1 sec for a 32G system. Such data suggests that a more
> > proactive compaction can help us allocate a large fraction of memory
> > as hugepages keeping allocation latencies low.
> >
> > In general, compaction can introduce unexpected latencies for
> > applications that don't even have strong requirements for contiguous
> > allocations. It is also hard to efficiently determine if the current
> > system state can be easily compacted due to mixing of unmovable
> > memory. Due to these reasons, automatic background compaction by the
> > kernel itself is hard to get right in a way which does not hurt unsuspecting
> applications or waste CPU cycles.
> 
> We do trigger background compaction on a high order pressure from the
> page allocator by waking up kcompactd. Why is that not sufficient?
> 

Whenever kcompactd is woken up, it does just enough work to create
one free page of the given order (compaction_control.order) or higher.

Such a design causes very high latency for workloads where we want
to allocate lots of hugepages in short period of time. With pro-active
compaction we can hide much of this latency. For some more background
discussion and data, please see this thread:

https://patchwork.kernel.org/patch/11098289/

> > Even with these caveats, pro-active compaction can still be very
> > useful in certain scenarios to reduce hugepage allocation latencies.
> > This callback interface allows drivers to drive compaction based on
> > their own policies like the current level of external fragmentation
> > for a particular order, system load etc.
> 
> So we do not trust the core MM to make a reasonable decision while we give
> a free ticket to modules. How does this make any sense at all? How is a
> random module going to make a more informed decision when it has less
> visibility on the overal MM situation.
>

Embedding any specific policy (like: keep external fragmentation for order-9
between 30-40%) within MM core looks like a bad idea. As a driver, we
can easily measure parameters like system load, current fragmentation level
for any order in any zone etc. to make an informed decision.
See the thread I refereed above for more background discussion.

> If you need to control compaction from the userspace you have an interface
> for that.  It is also completely unexplained why you need a completion
> callback.
> 

/proc/sys/vm/compact_memory does whole system compaction which is
often too much as a pro-active compaction strategy. To get more control
over how to compaction work to do, I have added a compaction callback
which controls how much work is done in one compaction cycle.
 
For example, as a test for this patch, I have a small test driver which defines
[low, high] external fragmentation thresholds for the HPAGE_ORDER. Whenever
extfrag is within this range, I run compact_zone_order with a callback which
returns COMPACT_CONTINUE till extfrag > low threshold and returns
COMPACT_PARTIAL_SKIPPED when extfrag <= low.

Here's the code for this sample driver:
https://gitlab.com/nigupta/memstress/snippets/1893847

Maybe this code can be added to Documentation/...

Thanks,
Nitin

> 
> > Signed-off-by: Nitin Gupta <nigupta@nvidia.com>
> > ---
> >  include/linux/compaction.h | 10 ++++++++++
> >  mm/compaction.c            | 20 ++++++++++++++------
> >  mm/internal.h              |  2 ++
> >  3 files changed, 26 insertions(+), 6 deletions(-)
> >
> > diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> > index 9569e7c786d3..1ea828450fa2 100644
> > --- a/include/linux/compaction.h
> > +++ b/include/linux/compaction.h
> > @@ -58,6 +58,16 @@ enum compact_result {
> >  	COMPACT_SUCCESS,
> >  };
> >
> > +/* Callback function to determine if compaction is finished. */
> > +typedef enum compact_result (*compact_finished_cb)(
> > +	struct zone *zone, int order);
> > +
> > +enum compact_result compact_zone_order(struct zone *zone, int order,
> > +		gfp_t gfp_mask, enum compact_priority prio,
> > +		unsigned int alloc_flags, int classzone_idx,
> > +		struct page **capture,
> > +		compact_finished_cb compact_finished_cb);
> > +
> >  struct alloc_context; /* in mm/internal.h */
> >
> >  /*
> > diff --git a/mm/compaction.c b/mm/compaction.c index
> > 952dc2fb24e5..73e2e9246bc4 100644
> > --- a/mm/compaction.c
> > +++ b/mm/compaction.c
> > @@ -1872,6 +1872,9 @@ static enum compact_result
> __compact_finished(struct compact_control *cc)
> >  			return COMPACT_PARTIAL_SKIPPED;
> >  	}
> >
> > +	if (cc->compact_finished_cb)
> > +		return cc->compact_finished_cb(cc->zone, cc->order);
> > +
> >  	if (is_via_compact_memory(cc->order))
> >  		return COMPACT_CONTINUE;
> >
> > @@ -2274,10 +2277,11 @@ compact_zone(struct compact_control *cc,
> struct capture_control *capc)
> >  	return ret;
> >  }
> >
> > -static enum compact_result compact_zone_order(struct zone *zone, int
> > order,
> > +enum compact_result compact_zone_order(struct zone *zone, int order,
> >  		gfp_t gfp_mask, enum compact_priority prio,
> >  		unsigned int alloc_flags, int classzone_idx,
> > -		struct page **capture)
> > +		struct page **capture,
> > +		compact_finished_cb compact_finished_cb)
> >  {
> >  	enum compact_result ret;
> >  	struct compact_control cc = {
> > @@ -2293,10 +2297,11 @@ static enum compact_result
> compact_zone_order(struct zone *zone, int order,
> >  					MIGRATE_ASYNC :
> 	MIGRATE_SYNC_LIGHT,
> >  		.alloc_flags = alloc_flags,
> >  		.classzone_idx = classzone_idx,
> > -		.direct_compaction = true,
> > +		.direct_compaction = !compact_finished_cb,
> >  		.whole_zone = (prio == MIN_COMPACT_PRIORITY),
> >  		.ignore_skip_hint = (prio == MIN_COMPACT_PRIORITY),
> > -		.ignore_block_suitable = (prio == MIN_COMPACT_PRIORITY)
> > +		.ignore_block_suitable = (prio ==
> MIN_COMPACT_PRIORITY),
> > +		.compact_finished_cb = compact_finished_cb
> >  	};
> >  	struct capture_control capc = {
> >  		.cc = &cc,
> > @@ -2313,11 +2318,13 @@ static enum compact_result
> compact_zone_order(struct zone *zone, int order,
> >  	VM_BUG_ON(!list_empty(&cc.freepages));
> >  	VM_BUG_ON(!list_empty(&cc.migratepages));
> >
> > -	*capture = capc.page;
> > +	if (capture)
> > +		*capture = capc.page;
> >  	current->capture_control = NULL;
> >
> >  	return ret;
> >  }
> > +EXPORT_SYMBOL(compact_zone_order);
> >
> >  int sysctl_extfrag_threshold = 500;
> >
> > @@ -2361,7 +2368,8 @@ enum compact_result
> try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
> >  		}
> >
> >  		status = compact_zone_order(zone, order, gfp_mask, prio,
> > -				alloc_flags, ac_classzone_idx(ac), capture);
> > +				alloc_flags, ac_classzone_idx(ac), capture,
> > +				NULL);
> >  		rc = max(status, rc);
> >
> >  		/* The allocation should succeed, stop compacting */ diff --git
> > a/mm/internal.h b/mm/internal.h index e32390802fd3..f873f7c2b9dc
> > 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -11,6 +11,7 @@
> >  #include <linux/mm.h>
> >  #include <linux/pagemap.h>
> >  #include <linux/tracepoint-defs.h>
> > +#include <linux/compaction.h>
> >
> >  /*
> >   * The set of flags that only affect watermark checking and reclaim
> > @@ -203,6 +204,7 @@ struct compact_control {
> >  	bool whole_zone;		/* Whole zone should/has been
> scanned */
> >  	bool contended;			/* Signal lock or sched
> contention */
> >  	bool rescan;			/* Rescanning the same pageblock */
> > +	compact_finished_cb compact_finished_cb;
> >  };
> >
> >  /*
> > --
> > 2.21.0
> 
> --
> Michal Hocko
> SUSE Labs

Michal Hocko Sept. 11, 2019, 6:45 a.m. UTC | #3

On Tue 10-09-19 22:27:53, Nitin Gupta wrote:
[...]
> > On Tue 10-09-19 13:07:32, Nitin Gupta wrote:
> > > For some applications we need to allocate almost all memory as
> > > hugepages.
> > > However, on a running system, higher order allocations can fail if the
> > > memory is fragmented. Linux kernel currently does on-demand
> > > compaction
> > > as we request more hugepages but this style of compaction incurs very
> > > high latency. Experiments with one-time full memory compaction
> > > (followed by hugepage allocations) shows that kernel is able to
> > > restore a highly fragmented memory state to a fairly compacted memory
> > > state within <1 sec for a 32G system. Such data suggests that a more
> > > proactive compaction can help us allocate a large fraction of memory
> > > as hugepages keeping allocation latencies low.
> > >
> > > In general, compaction can introduce unexpected latencies for
> > > applications that don't even have strong requirements for contiguous
> > > allocations.

Could you expand on this a bit please? Gfp flags allow to express how
much the allocator try and compact for a high order allocations. Hugetlb
allocations tend to require retrying and heavy compaction to succeed and
the success rate tends to be pretty high from my experience.  Why that
is not case in your case?

> > > It is also hard to efficiently determine if the current
> > > system state can be easily compacted due to mixing of unmovable
> > > memory. Due to these reasons, automatic background compaction by the
> > > kernel itself is hard to get right in a way which does not hurt unsuspecting
> > applications or waste CPU cycles.
> > 
> > We do trigger background compaction on a high order pressure from the
> > page allocator by waking up kcompactd. Why is that not sufficient?
> > 
> 
> Whenever kcompactd is woken up, it does just enough work to create
> one free page of the given order (compaction_control.order) or higher.

This is an implementation detail IMHO. I am pretty sure we can do a
better auto tuning when there is an indication of a constant flow of
high order requests. This is no different from the memory reclaim in
principle. Just because the kswapd autotuning not fitting with your
particular workload you wouldn't want to export direct reclaim
functionality and call it from a random module. That is just doomed to
fail because different subsystems in control just leads to decisions
going against each other.

> Such a design causes very high latency for workloads where we want
> to allocate lots of hugepages in short period of time. With pro-active
> compaction we can hide much of this latency. For some more background
> discussion and data, please see this thread:
> 
> https://patchwork.kernel.org/patch/11098289/

I am aware of that thread. And there are two things. You claim the
allocation success rate is unnecessarily lower and that the direct
latency is high. You simply cannot assume both low latency and high
success rate. Compaction is not free. Somebody has to do the work.
Hiding it into the background means that you are eating a lot of cycles
from everybody else (think of a workload running in a restricted cpu
controller just doing a lot of work in an unaccounted context).

That being said you really have to be prepared to pay a price for
precious resource like high order pages.

On the other hand I do understand that high latency is not really
desired for a more optimistic allocation requests with a reasonable
fallback strategy. Those would benefit from kcompactd not giving up too
early.

> > > Even with these caveats, pro-active compaction can still be very
> > > useful in certain scenarios to reduce hugepage allocation latencies.
> > > This callback interface allows drivers to drive compaction based on
> > > their own policies like the current level of external fragmentation
> > > for a particular order, system load etc.
> > 
> > So we do not trust the core MM to make a reasonable decision while we give
> > a free ticket to modules. How does this make any sense at all? How is a
> > random module going to make a more informed decision when it has less
> > visibility on the overal MM situation.
> >
> 
> Embedding any specific policy (like: keep external fragmentation for order-9
> between 30-40%) within MM core looks like a bad idea.

Agreed

> As a driver, we
> can easily measure parameters like system load, current fragmentation level
> for any order in any zone etc. to make an informed decision.
> See the thread I refereed above for more background discussion.

Do that from the userspace then. If there is an insufficient interface
to do that then let's talk about what is missing.

> > If you need to control compaction from the userspace you have an interface
> > for that.  It is also completely unexplained why you need a completion
> > callback.
> > 
> 
> /proc/sys/vm/compact_memory does whole system compaction which is
> often too much as a pro-active compaction strategy. To get more control
> over how to compaction work to do, I have added a compaction callback
> which controls how much work is done in one compaction cycle.

Why is a more fine grained control really needed? Sure compacting
everything is heavy weight but how often do you have to do that. Your
changelog starts with a usecase when there is a high demand for large
pages at the startup. What prevents you do compaction at that time. If
the workload is longterm then the initial price should just pay back,
no?

Nitin Gupta Sept. 11, 2019, 10:33 p.m. UTC | #4

On Wed, 2019-09-11 at 08:45 +0200, Michal Hocko wrote:
> On Tue 10-09-19 22:27:53, Nitin Gupta wrote:
> [...]
> > > On Tue 10-09-19 13:07:32, Nitin Gupta wrote:
> > > > For some applications we need to allocate almost all memory as
> > > > hugepages.
> > > > However, on a running system, higher order allocations can fail if the
> > > > memory is fragmented. Linux kernel currently does on-demand
> > > > compaction
> > > > as we request more hugepages but this style of compaction incurs very
> > > > high latency. Experiments with one-time full memory compaction
> > > > (followed by hugepage allocations) shows that kernel is able to
> > > > restore a highly fragmented memory state to a fairly compacted memory
> > > > state within <1 sec for a 32G system. Such data suggests that a more
> > > > proactive compaction can help us allocate a large fraction of memory
> > > > as hugepages keeping allocation latencies low.
> > > > 
> > > > In general, compaction can introduce unexpected latencies for
> > > > applications that don't even have strong requirements for contiguous
> > > > allocations.
> 
> Could you expand on this a bit please? Gfp flags allow to express how
> much the allocator try and compact for a high order allocations. Hugetlb
> allocations tend to require retrying and heavy compaction to succeed and
> the success rate tends to be pretty high from my experience.  Why that
> is not case in your case?
> 

Yes, I have the same observation: with `GFP_TRANSHUGE |
__GFP_RETRY_MAYFAIL` I get very good success rate (~90% of free RAM
allocated as hugepages). However, what I'm trying to point out is that this
high success rate comes with high allocation latencies (90th percentile
latency of 2206us). On the same system, the same high-order allocations
which hit the fast path have latency <5us.

> > > > It is also hard to efficiently determine if the current
> > > > system state can be easily compacted due to mixing of unmovable
> > > > memory. Due to these reasons, automatic background compaction by the
> > > > kernel itself is hard to get right in a way which does not hurt
> > > > unsuspecting
> > > applications or waste CPU cycles.
> > > 
> > > We do trigger background compaction on a high order pressure from the
> > > page allocator by waking up kcompactd. Why is that not sufficient?
> > > 
> > 
> > Whenever kcompactd is woken up, it does just enough work to create
> > one free page of the given order (compaction_control.order) or higher.
> 
> This is an implementation detail IMHO. I am pretty sure we can do a
> better auto tuning when there is an indication of a constant flow of
> high order requests. This is no different from the memory reclaim in
> principle. Just because the kswapd autotuning not fitting with your
> particular workload you wouldn't want to export direct reclaim
> functionality and call it from a random module. That is just doomed to
> fail because different subsystems in control just leads to decisions
> going against each other.
> 

I don't want to go the route of adding any auto-tuning/perdiction code to
control compaction in the kernel. I'm more inclined towards extending
existing interfaces to allow compaction behavior to be controlled either
from userspace or a kernel driver. Letting a random module control
compaction or a root process pumping new tunables from sysfs is the same in
principle.

This patch is in the spirit of simple extension to existing
compaction_zone_order() which allows either a kernel driver or userspace
(through sysfs) to control compaction.

Also, we should avoid driving hard parallels between reclaim and
compaction: the former is often necessary for forward progress while the
latter is often an optimization. Since contiguous allocations are mostly
optimizations it's good to expose hooks from the kernel that let user
(through a driver or userspace) control it using its own heuristics.

I thought hard about whats lacking in current userspace interface (sysfs):
 - /proc/sys/vm/compact_memory: full system compaction is not an option as
   a viable pro-active compaction strategy.
 - possibly expose [low, high] threshold values for each node and let
   kcompactd act on them. This was my approach for my original patch I
   linked earlier. Problem here is that it introduces too many tunables.

Considering the above, I came up with this callback approach which make it
trivial to introduce user specific policies for compaction. It puts the
onus of system stability, responsive in the hands of user without burdening
admins with more tunables or adding crystal balls to kernel.

> > Such a design causes very high latency for workloads where we want
> > to allocate lots of hugepages in short period of time. With pro-active
> > compaction we can hide much of this latency. For some more background
> > discussion and data, please see this thread:
> > 
> > https://patchwork.kernel.org/patch/11098289/
> 
> I am aware of that thread. And there are two things. You claim the
> allocation success rate is unnecessarily lower and that the direct
> latency is high. You simply cannot assume both low latency and high
> success rate. Compaction is not free. Somebody has to do the work.
> Hiding it into the background means that you are eating a lot of cycles
> from everybody else (think of a workload running in a restricted cpu
> controller just doing a lot of work in an unaccounted context).
> 
> That being said you really have to be prepared to pay a price for
> precious resource like high order pages.
> 
> On the other hand I do understand that high latency is not really
> desired for a more optimistic allocation requests with a reasonable
> fallback strategy. Those would benefit from kcompactd not giving up too
> early.

Doing pro-active compaction in background has merits in reducing reducing
high-order alloc latency. Its true that it would end up burning cycles with
little benefit in some cases. Its upto the user of this new interface to
back off if it detects such a case.

>  
> > > > Even with these caveats, pro-active compaction can still be very
> > > > useful in certain scenarios to reduce hugepage allocation latencies.
> > > > This callback interface allows drivers to drive compaction based on
> > > > their own policies like the current level of external fragmentation
> > > > for a particular order, system load etc.
> > > 
> > > So we do not trust the core MM to make a reasonable decision while we
> > > give
> > > a free ticket to modules. How does this make any sense at all? How is a
> > > random module going to make a more informed decision when it has less
> > > visibility on the overal MM situation.
> > > 
> > 
> > Embedding any specific policy (like: keep external fragmentation for
> > order-9
> > between 30-40%) within MM core looks like a bad idea.
> 
> Agreed
> 
> > As a driver, we
> > can easily measure parameters like system load, current fragmentation
> > level
> > for any order in any zone etc. to make an informed decision.
> > See the thread I refereed above for more background discussion.
> 
> Do that from the userspace then. If there is an insufficient interface
> to do that then let's talk about what is missing.
> 

Currently we only have a proc interface to do full system compaction.
Here's what missing from this interface: ability to set per-node, per-zone,
per-order, [low, high] extfrag thresholds. This is what I exposed in my
earlier patch titled 'proactive compaction'. Discussion there made me realize
these are too many tunables and any admin would always get them wrong. Even
if intended user of these sysfs node is some monitoring daemon, its
tempting to mess with them.

With a callback extension to compact_zone_order() implementing any of the
per-node, per-zone, per-order limits is straightforward and if needed the
driver can expose debugfs/sysfs nodes if needed at all. (nvcompact.c
driver[1] exposes these tunables as debugfs nodes, for example).

[1] https://gitlab.com/nigupta/linux/snippets/1894161

> > > If you need to control compaction from the userspace you have an
> > > interface
> > > for that.  It is also completely unexplained why you need a completion
> > > callback.
> > > 
> > 
> > /proc/sys/vm/compact_memory does whole system compaction which is
> > often too much as a pro-active compaction strategy. To get more control
> > over how to compaction work to do, I have added a compaction callback
> > which controls how much work is done in one compaction cycle.
> 
> Why is a more fine grained control really needed? Sure compacting
> everything is heavy weight but how often do you have to do that. Your
> changelog starts with a usecase when there is a high demand for large
> pages at the startup. What prevents you do compaction at that time. If
> the workload is longterm then the initial price should just pay back,
> no?
> 

Compacting all NUMA nodes is not practical on large systems in response to,
say, launching a DB process on a certain node. Also, the frequency of
hugepage allocation burts may be completely unpredictable. That's why
background compaction can keep extfrag in check, say while system is
lightly loaded (adhoc policy), keeping high-order allocation latencies low
whenever the burst shows up.

- Nitin

Michal Hocko Sept. 12, 2019, 11:27 a.m. UTC | #5

On Wed 11-09-19 22:33:39, Nitin Gupta wrote:
> On Wed, 2019-09-11 at 08:45 +0200, Michal Hocko wrote:
> > On Tue 10-09-19 22:27:53, Nitin Gupta wrote:
> > [...]
> > > > On Tue 10-09-19 13:07:32, Nitin Gupta wrote:
> > > > > For some applications we need to allocate almost all memory as
> > > > > hugepages.
> > > > > However, on a running system, higher order allocations can fail if the
> > > > > memory is fragmented. Linux kernel currently does on-demand
> > > > > compaction
> > > > > as we request more hugepages but this style of compaction incurs very
> > > > > high latency. Experiments with one-time full memory compaction
> > > > > (followed by hugepage allocations) shows that kernel is able to
> > > > > restore a highly fragmented memory state to a fairly compacted memory
> > > > > state within <1 sec for a 32G system. Such data suggests that a more
> > > > > proactive compaction can help us allocate a large fraction of memory
> > > > > as hugepages keeping allocation latencies low.
> > > > > 
> > > > > In general, compaction can introduce unexpected latencies for
> > > > > applications that don't even have strong requirements for contiguous
> > > > > allocations.
> > 
> > Could you expand on this a bit please? Gfp flags allow to express how
> > much the allocator try and compact for a high order allocations. Hugetlb
> > allocations tend to require retrying and heavy compaction to succeed and
> > the success rate tends to be pretty high from my experience.  Why that
> > is not case in your case?
> > 
> 
> Yes, I have the same observation: with `GFP_TRANSHUGE |
> __GFP_RETRY_MAYFAIL` I get very good success rate (~90% of free RAM
> allocated as hugepages). However, what I'm trying to point out is that this
> high success rate comes with high allocation latencies (90th percentile
> latency of 2206us). On the same system, the same high-order allocations
> which hit the fast path have latency <5us.

Sure, that is no free cake. Unless the direct compaction can do
something fundamentally different than the background one this is the
amount of the work that has to be done for those situation no matter
what. Lower latency is certainly attractive but the other part of the
equation is _who_ is going to pay for that.

> > > > > It is also hard to efficiently determine if the current
> > > > > system state can be easily compacted due to mixing of unmovable
> > > > > memory. Due to these reasons, automatic background compaction by the
> > > > > kernel itself is hard to get right in a way which does not hurt
> > > > > unsuspecting
> > > > applications or waste CPU cycles.
> > > > 
> > > > We do trigger background compaction on a high order pressure from the
> > > > page allocator by waking up kcompactd. Why is that not sufficient?
> > > > 
> > > 
> > > Whenever kcompactd is woken up, it does just enough work to create
> > > one free page of the given order (compaction_control.order) or higher.
> > 
> > This is an implementation detail IMHO. I am pretty sure we can do a
> > better auto tuning when there is an indication of a constant flow of
> > high order requests. This is no different from the memory reclaim in
> > principle. Just because the kswapd autotuning not fitting with your
> > particular workload you wouldn't want to export direct reclaim
> > functionality and call it from a random module. That is just doomed to
> > fail because different subsystems in control just leads to decisions
> > going against each other.
> > 
> 
> I don't want to go the route of adding any auto-tuning/perdiction code to
> control compaction in the kernel. I'm more inclined towards extending
> existing interfaces to allow compaction behavior to be controlled either
> from userspace or a kernel driver. Letting a random module control
> compaction or a root process pumping new tunables from sysfs is the same in
> principle.

Then I would start by listing shortcomings of the existing interfaces
and examples how it could be extended for specific usecases.

> This patch is in the spirit of simple extension to existing
> compaction_zone_order() which allows either a kernel driver or userspace
> (through sysfs) to control compaction.
> 
> Also, we should avoid driving hard parallels between reclaim and
> compaction: the former is often necessary for forward progress while the
> latter is often an optimization. Since contiguous allocations are mostly
> optimizations it's good to expose hooks from the kernel that let user
> (through a driver or userspace) control it using its own heuristics.

This really depends on the allocation failure fallback strategy. If your
specific case can gracefully fallback to smaller allocations then all
fine, this is just an optimization. But if you are an order-3 GFP_KERNEL
request then not making a forward progress is a matter of an OOM killer.
So no, we are not only talking about optimization.
 
> I thought hard about whats lacking in current userspace interface (sysfs):
>  - /proc/sys/vm/compact_memory: full system compaction is not an option as
>    a viable pro-active compaction strategy.

Because...

>  - possibly expose [low, high] threshold values for each node and let
>    kcompactd act on them. This was my approach for my original patch I
>    linked earlier. Problem here is that it introduces too many tunables.

I was playing with a similar idea as well in the past as well. But this
is quite tricky. Watermark api makes sense if you can somehow enforce
them. What if the low watermark cannot be achieved due to excessive
fragmentation that cannot be handled? Should the background daemon try
endlessly consuming an unbound amount of cpu cycles? The reclaim can act
by triggering OOM killer and free up some memory. There is nothing
actionable like that for the compaction.

> Considering the above, I came up with this callback approach which make it
> trivial to introduce user specific policies for compaction. It puts the
> onus of system stability, responsive in the hands of user without burdening
> admins with more tunables or adding crystal balls to kernel.

It might seem trivial to use but I am not really sure that consequences
of using is are trivial to argue about.

[...]
> > Do that from the userspace then. If there is an insufficient interface
> > to do that then let's talk about what is missing.
> > 
> 
> Currently we only have a proc interface to do full system compaction.
> Here's what missing from this interface: ability to set per-node, per-zone,
> per-order, [low, high] extfrag thresholds.

I would agree about per-node interface because we already do allow
per-node policies and that's why high order demand might differ. But I
would be really against any per-zone interfaces because zones are an
internal implementation detail of the page allocator. We've made
mistakes to expose that into the userspace in the past and we shouldn't
repeat them. Per-order is quite questionable without seeing explicit
usecases and data. E.g. are there usecases to save (how much) cycles to
only compact up to order-3 comparing to full flagged compaction?

Bharath Vedartham Sept. 12, 2019, 11:41 a.m. UTC | #6

Hi Nitin,
On Wed, Sep 11, 2019 at 10:33:39PM +0000, Nitin Gupta wrote:
> On Wed, 2019-09-11 at 08:45 +0200, Michal Hocko wrote:
> > On Tue 10-09-19 22:27:53, Nitin Gupta wrote:
> > [...]
> > > > On Tue 10-09-19 13:07:32, Nitin Gupta wrote:
> > > > > For some applications we need to allocate almost all memory as
> > > > > hugepages.
> > > > > However, on a running system, higher order allocations can fail if the
> > > > > memory is fragmented. Linux kernel currently does on-demand
> > > > > compaction
> > > > > as we request more hugepages but this style of compaction incurs very
> > > > > high latency. Experiments with one-time full memory compaction
> > > > > (followed by hugepage allocations) shows that kernel is able to
> > > > > restore a highly fragmented memory state to a fairly compacted memory
> > > > > state within <1 sec for a 32G system. Such data suggests that a more
> > > > > proactive compaction can help us allocate a large fraction of memory
> > > > > as hugepages keeping allocation latencies low.
> > > > > 
> > > > > In general, compaction can introduce unexpected latencies for
> > > > > applications that don't even have strong requirements for contiguous
> > > > > allocations.
> > 
> > Could you expand on this a bit please? Gfp flags allow to express how
> > much the allocator try and compact for a high order allocations. Hugetlb
> > allocations tend to require retrying and heavy compaction to succeed and
> > the success rate tends to be pretty high from my experience.  Why that
> > is not case in your case?
> > 
The link to the driver you send on gitlab is not working :(
> Yes, I have the same observation: with `GFP_TRANSHUGE |
> __GFP_RETRY_MAYFAIL` I get very good success rate (~90% of free RAM
> allocated as hugepages). However, what I'm trying to point out is that this
> high success rate comes with high allocation latencies (90th percentile
> latency of 2206us). On the same system, the same high-order allocations
> which hit the fast path have latency <5us.
> 
> > > > > It is also hard to efficiently determine if the current
> > > > > system state can be easily compacted due to mixing of unmovable
> > > > > memory. Due to these reasons, automatic background compaction by the
> > > > > kernel itself is hard to get right in a way which does not hurt
> > > > > unsuspecting
> > > > applications or waste CPU cycles.
> > > > 
> > > > We do trigger background compaction on a high order pressure from the
> > > > page allocator by waking up kcompactd. Why is that not sufficient?
> > > > 
> > > 
> > > Whenever kcompactd is woken up, it does just enough work to create
> > > one free page of the given order (compaction_control.order) or higher.
> > 
> > This is an implementation detail IMHO. I am pretty sure we can do a
> > better auto tuning when there is an indication of a constant flow of
> > high order requests. This is no different from the memory reclaim in
> > principle. Just because the kswapd autotuning not fitting with your
> > particular workload you wouldn't want to export direct reclaim
> > functionality and call it from a random module. That is just doomed to
> > fail because different subsystems in control just leads to decisions
> > going against each other.
> > 
> 
> I don't want to go the route of adding any auto-tuning/perdiction code to
> control compaction in the kernel. I'm more inclined towards extending
> existing interfaces to allow compaction behavior to be controlled either
> from userspace or a kernel driver. Letting a random module control
> compaction or a root process pumping new tunables from sysfs is the same in
> principle.
Do you think a kernel module and root user process have the same
privileges? You can only export so much info to sysfs to use? Also
wouldn't this introduce more tunables, per driver tunables to be more
specific?
> This patch is in the spirit of simple extension to existing
> compaction_zone_order() which allows either a kernel driver or userspace
> (through sysfs) to control compaction.
> 
> Also, we should avoid driving hard parallels between reclaim and
> compaction: the former is often necessary for forward progress while the
> latter is often an optimization. Since contiguous allocations are mostly
> optimizations it's good to expose hooks from the kernel that let user
> (through a driver or userspace) control it using its own heuristics.
How is compaction an optimization? If I have a memory zone which has
memory pages more than zone_highwmark and if higher order allocations a
re failing because the memory is awfully fragmented, We need compaction
to furthur progress here. I have seen workloads where kswapd won't help
in progressing furthur because memory is so awfully fragmented.
The workload I am quoting is the thpscale_workload from Mel Gorman's mmtests
workloads.
> 
> I thought hard about whats lacking in current userspace interface (sysfs):
>  - /proc/sys/vm/compact_memory: full system compaction is not an option as
>    a viable pro-active compaction strategy.
Don't we have a sysfs interface to compact memory per node through 
/sys/devices/system/node/node<node_number>/compact? CONFIG_SYSFS AND
CONFIG_NUMA are enabled on a lot of systems? Why are we not talking
about this?
I don't think kcompactd can go finer grain than per node. per-zone is 
an option but then that would be overkill I feel.
>  - possibly expose [low, high] threshold values for each node and let
>    kcompactd act on them. This was my approach for my original patch I
>    linked earlier. Problem here is that it introduces too many tunables.
> 
> Considering the above, I came up with this callback approach which make it
> trivial to introduce user specific policies for compaction. It puts the
> onus of system stability, responsive in the hands of user without burdening
> admins with more tunables or adding crystal balls to kernel.
I have the same question as Michal, that is won't this cause conflicts
among different subsystems? If you did answer it in your previous
mails, could you point to as I may have missed it :)
> > > Such a design causes very high latency for workloads where we want
> > > to allocate lots of hugepages in short period of time. With pro-active
> > > compaction we can hide much of this latency. For some more background
> > > discussion and data, please see this thread:
> > > 
> > > https://patchwork.kernel.org/patch/11098289/
> > 
> > I am aware of that thread. And there are two things. You claim the
> > allocation success rate is unnecessarily lower and that the direct
> > latency is high. You simply cannot assume both low latency and high
> > success rate. Compaction is not free. Somebody has to do the work.
> > Hiding it into the background means that you are eating a lot of cycles
> > from everybody else (think of a workload running in a restricted cpu
> > controller just doing a lot of work in an unaccounted context).
> > 
> > That being said you really have to be prepared to pay a price for
> > precious resource like high order pages.
> > 
> > On the other hand I do understand that high latency is not really
> > desired for a more optimistic allocation requests with a reasonable
> > fallback strategy. Those would benefit from kcompactd not giving up too
> > early.
> 
> Doing pro-active compaction in background has merits in reducing reducing
> high-order alloc latency. Its true that it would end up burning cycles with
> little benefit in some cases. Its upto the user of this new interface to
> back off if it detects such a case.
Are these cycles worth considering in the big picture of reducing high
order allocation latency? 
> >  
> > > > > Even with these caveats, pro-active compaction can still be very
> > > > > useful in certain scenarios to reduce hugepage allocation latencies.
> > > > > This callback interface allows drivers to drive compaction based on
> > > > > their own policies like the current level of external fragmentation
> > > > > for a particular order, system load etc.
> > > > 
> > > > So we do not trust the core MM to make a reasonable decision while we
> > > > give
> > > > a free ticket to modules. How does this make any sense at all? How is a
> > > > random module going to make a more informed decision when it has less
> > > > visibility on the overal MM situation.
> > > > 
> > > 
> > > Embedding any specific policy (like: keep external fragmentation for
> > > order-9
> > > between 30-40%) within MM core looks like a bad idea.
> > 
> > Agreed
> > 
> > > As a driver, we
> > > can easily measure parameters like system load, current fragmentation
> > > level
> > > for any order in any zone etc. to make an informed decision.
> > > See the thread I refereed above for more background discussion.
> > 
> > Do that from the userspace then. If there is an insufficient interface
> > to do that then let's talk about what is missing.
> > 
> 
> Currently we only have a proc interface to do full system compaction.
> Here's what missing from this interface: ability to set per-node, per-zone,
> per-order, [low, high] extfrag thresholds. This is what I exposed in my
> earlier patch titled 'proactive compaction'. Discussion there made me realize
> these are too many tunables and any admin would always get them wrong. Even
> if intended user of these sysfs node is some monitoring daemon, its
> tempting to mess with them.
> 
> With a callback extension to compact_zone_order() implementing any of the
> per-node, per-zone, per-order limits is straightforward and if needed the
> driver can expose debugfs/sysfs nodes if needed at all. (nvcompact.c
> driver[1] exposes these tunables as debugfs nodes, for example).
> 
> [1] https://gitlab.com/nigupta/linux/snippets/1894161
Now, your proposing a system where we have interfaces from each driver.
That could be more confusing for a sys admin to configure I feel?

But what your proposing really made me think about  what kind
of tunables do we want? Rather than just talking about tunables from the
mm subsystem, can we introduce tunables that indicate the behaviour of
workloads? Using this information from the user, we can look to optimize 
reclaim and compaction for that workload.
If we have a tunable which can indicate that the kernel is running in an
environment where the where the workload will be performing a lot of
higher order allocations, we can improve memory reclaim and compaction
considering these parameters. One optimization I can think of extending
kcompactd to compact more memory when a higher order allocation fails.

One of the biggest issues with having a discussion on proactive
reclaim/compaction is that the workloads are really unpredictable. 
Rather than working on tunables from the mm subsystem which help us take
action on memory pressure, can we talk about interfaces to hint about
workloads so that we can make better informed decisions in the mm
subsystem rather than involving other drivers?
> 
> > > > If you need to control compaction from the userspace you have an
> > > > interface
> > > > for that.  It is also completely unexplained why you need a completion
> > > > callback.
> > > > 
> > > 
> > > /proc/sys/vm/compact_memory does whole system compaction which is
> > > often too much as a pro-active compaction strategy. To get more control
> > > over how to compaction work to do, I have added a compaction callback
> > > which controls how much work is done in one compaction cycle.
> > 
> > Why is a more fine grained control really needed? Sure compacting
> > everything is heavy weight but how often do you have to do that. Your
> > changelog starts with a usecase when there is a high demand for large
> > pages at the startup. What prevents you do compaction at that time. If
> > the workload is longterm then the initial price should just pay back,
> > no?
> > 
> 
> Compacting all NUMA nodes is not practical on large systems in response to,
> say, launching a DB process on a certain node. Also, the frequency of
> hugepage allocation burts may be completely unpredictable. That's why
> background compaction can keep extfrag in check, say while system is
> lightly loaded (adhoc policy), keeping high-order allocation latencies low
> whenever the burst shows up.
> 
> - Nitin
> 
Thank you
Bharath

Nitin Gupta Sept. 12, 2019, 5:17 p.m. UTC | #7

On Thu, 2019-09-12 at 17:11 +0530, Bharath Vedartham wrote:
> Hi Nitin,
> On Wed, Sep 11, 2019 at 10:33:39PM +0000, Nitin Gupta wrote:
> > On Wed, 2019-09-11 at 08:45 +0200, Michal Hocko wrote:
> > > On Tue 10-09-19 22:27:53, Nitin Gupta wrote:
> > > [...]
> > > > > On Tue 10-09-19 13:07:32, Nitin Gupta wrote:
> > > > > > For some applications we need to allocate almost all memory as
> > > > > > hugepages.
> > > > > > However, on a running system, higher order allocations can fail if
> > > > > > the
> > > > > > memory is fragmented. Linux kernel currently does on-demand
> > > > > > compaction
> > > > > > as we request more hugepages but this style of compaction incurs
> > > > > > very
> > > > > > high latency. Experiments with one-time full memory compaction
> > > > > > (followed by hugepage allocations) shows that kernel is able to
> > > > > > restore a highly fragmented memory state to a fairly compacted
> > > > > > memory
> > > > > > state within <1 sec for a 32G system. Such data suggests that a
> > > > > > more
> > > > > > proactive compaction can help us allocate a large fraction of
> > > > > > memory
> > > > > > as hugepages keeping allocation latencies low.
> > > > > > 
> > > > > > In general, compaction can introduce unexpected latencies for
> > > > > > applications that don't even have strong requirements for
> > > > > > contiguous
> > > > > > allocations.
> > > 
> > > Could you expand on this a bit please? Gfp flags allow to express how
> > > much the allocator try and compact for a high order allocations. Hugetlb
> > > allocations tend to require retrying and heavy compaction to succeed and
> > > the success rate tends to be pretty high from my experience.  Why that
> > > is not case in your case?
> > > 
> The link to the driver you send on gitlab is not working :(

Sorry about that, here's the correct link:
https://gitlab.com/nigupta/linux/snippets/1894161

> > Yes, I have the same observation: with `GFP_TRANSHUGE |
> > __GFP_RETRY_MAYFAIL` I get very good success rate (~90% of free RAM
> > allocated as hugepages). However, what I'm trying to point out is that
> > this
> > high success rate comes with high allocation latencies (90th percentile
> > latency of 2206us). On the same system, the same high-order allocations
> > which hit the fast path have latency <5us.
> > 
> > > > > > It is also hard to efficiently determine if the current
> > > > > > system state can be easily compacted due to mixing of unmovable
> > > > > > memory. Due to these reasons, automatic background compaction by
> > > > > > the
> > > > > > kernel itself is hard to get right in a way which does not hurt
> > > > > > unsuspecting
> > > > > applications or waste CPU cycles.
> > > > > 
> > > > > We do trigger background compaction on a high order pressure from
> > > > > the
> > > > > page allocator by waking up kcompactd. Why is that not sufficient?
> > > > > 
> > > > 
> > > > Whenever kcompactd is woken up, it does just enough work to create
> > > > one free page of the given order (compaction_control.order) or higher.
> > > 
> > > This is an implementation detail IMHO. I am pretty sure we can do a
> > > better auto tuning when there is an indication of a constant flow of
> > > high order requests. This is no different from the memory reclaim in
> > > principle. Just because the kswapd autotuning not fitting with your
> > > particular workload you wouldn't want to export direct reclaim
> > > functionality and call it from a random module. That is just doomed to
> > > fail because different subsystems in control just leads to decisions
> > > going against each other.
> > > 
> > 
> > I don't want to go the route of adding any auto-tuning/perdiction code to
> > control compaction in the kernel. I'm more inclined towards extending
> > existing interfaces to allow compaction behavior to be controlled either
> > from userspace or a kernel driver. Letting a random module control
> > compaction or a root process pumping new tunables from sysfs is the same
> > in
> > principle.
> Do you think a kernel module and root user process have the same
> privileges? You can only export so much info to sysfs to use? Also
> wouldn't this introduce more tunables, per driver tunables to be more
> specific?

- sysfs is a narrow interface to kernel functions. Not much different
from a narrow interface I'm exporting, to be used directly by drivers
which can themselves export sysfs/debugfs nodes.
- There are no per driver tunables here.


> > This patch is in the spirit of simple extension to existing
> > compaction_zone_order() which allows either a kernel driver or userspace
> > (through sysfs) to control compaction.
> > 
> > Also, we should avoid driving hard parallels between reclaim and
> > compaction: the former is often necessary for forward progress while the
> > latter is often an optimization. Since contiguous allocations are mostly
> > optimizations it's good to expose hooks from the kernel that let user
> > (through a driver or userspace) control it using its own heuristics.
> How is compaction an optimization? If I have a memory zone which has
> memory pages more than zone_highwmark and if higher order allocations a
> re failing because the memory is awfully fragmented, We need compaction
> to furthur progress here. I have seen workloads where kswapd won't help
> in progressing furthur because memory is so awfully fragmented.
> The workload I am quoting is the thpscale_workload from Mel Gorman's mmtests
> workloads.

- You can usually (but not always) fallback to base pages in case higher-order
alloc fails. Higher order allocs are for reducing TLB pressure and for devices
that cannot handle non-contiguous physical regions.
- kswapd is for memory reclaim only and cannot help with fragmentation.
- THP itself is an optimization and can be turned off.

> > I thought hard about whats lacking in current userspace interface (sysfs):
> >  - /proc/sys/vm/compact_memory: full system compaction is not an option as
> >    a viable pro-active compaction strategy.
> Don't we have a sysfs interface to compact memory per node through 
> /sys/devices/system/node/node<node_number>/compact? CONFIG_SYSFS AND
> CONFIG_NUMA are enabled on a lot of systems? Why are we not talking
> about this?
> I don't think kcompactd can go finer grain than per node. per-zone is 
> an option but then that would be overkill I feel.

I want pro-active compaction to somewhat hide higher-order allocation
latencies. Even full node compaction is too coase for this purpose.
The goal is to keep fragmentation in check i.e, within certain thresholds.


> >  - possibly expose [low, high] threshold values for each node and let
> >    kcompactd act on them. This was my approach for my original patch I
> >    linked earlier. Problem here is that it introduces too many tunables.
> > 
> > Considering the above, I came up with this callback approach which make it
> > trivial to introduce user specific policies for compaction. It puts the
> > onus of system stability, responsive in the hands of user without
> > burdening
> > admins with more tunables or adding crystal balls to kernel.
> I have the same question as Michal, that is won't this cause conflicts
> among different subsystems? If you did answer it in your previous
> mails, could you point to as I may have missed it :)

There is no big harm if multiple drivers call compact_zone_order().
A reasonable driver would want to call this interface to compact memory
to a certain extent and under specific conditions. If another driver
call it in parallel then other driver would simply see a well compacted
state and back-off. It's also not hard for a driver to see if compaction
is not helping much, where it can again back-off.


> > > > Such a design causes very high latency for workloads where we want
> > > > to allocate lots of hugepages in short period of time. With pro-active
> > > > compaction we can hide much of this latency. For some more background
> > > > discussion and data, please see this thread:
> > > > 
> > > > https://patchwork.kernel.org/patch/11098289/
> > > 
> > > I am aware of that thread. And there are two things. You claim the
> > > allocation success rate is unnecessarily lower and that the direct
> > > latency is high. You simply cannot assume both low latency and high
> > > success rate. Compaction is not free. Somebody has to do the work.
> > > Hiding it into the background means that you are eating a lot of cycles
> > > from everybody else (think of a workload running in a restricted cpu
> > > controller just doing a lot of work in an unaccounted context).
> > > 
> > > That being said you really have to be prepared to pay a price for
> > > precious resource like high order pages.
> > > 
> > > On the other hand I do understand that high latency is not really
> > > desired for a more optimistic allocation requests with a reasonable
> > > fallback strategy. Those would benefit from kcompactd not giving up too
> > > early.
> > 
> > Doing pro-active compaction in background has merits in reducing reducing
> > high-order alloc latency. Its true that it would end up burning cycles
> > with
> > little benefit in some cases. Its upto the user of this new interface to
> > back off if it detects such a case.
> Are these cycles worth considering in the big picture of reducing high
> order allocation latency? 

Yes, I think it's worth it.


> > >  
> > > > > > Even with these caveats, pro-active compaction can still be very
> > > > > > useful in certain scenarios to reduce hugepage allocation
> > > > > > latencies.
> > > > > > This callback interface allows drivers to drive compaction based
> > > > > > on
> > > > > > their own policies like the current level of external
> > > > > > fragmentation
> > > > > > for a particular order, system load etc.
> > > > > 
> > > > > So we do not trust the core MM to make a reasonable decision while
> > > > > we
> > > > > give
> > > > > a free ticket to modules. How does this make any sense at all? How
> > > > > is a
> > > > > random module going to make a more informed decision when it has
> > > > > less
> > > > > visibility on the overal MM situation.
> > > > > 
> > > > 
> > > > Embedding any specific policy (like: keep external fragmentation for
> > > > order-9
> > > > between 30-40%) within MM core looks like a bad idea.
> > > 
> > > Agreed
> > > 
> > > > As a driver, we
> > > > can easily measure parameters like system load, current fragmentation
> > > > level
> > > > for any order in any zone etc. to make an informed decision.
> > > > See the thread I refereed above for more background discussion.
> > > 
> > > Do that from the userspace then. If there is an insufficient interface
> > > to do that then let's talk about what is missing.
> > > 
> > 
> > Currently we only have a proc interface to do full system compaction.
> > Here's what missing from this interface: ability to set per-node, per-
> > zone,
> > per-order, [low, high] extfrag thresholds. This is what I exposed in my
> > earlier patch titled 'proactive compaction'. Discussion there made me
> > realize
> > these are too many tunables and any admin would always get them wrong.
> > Even
> > if intended user of these sysfs node is some monitoring daemon, its
> > tempting to mess with them.
> > 
> > With a callback extension to compact_zone_order() implementing any of the
> > per-node, per-zone, per-order limits is straightforward and if needed the
> > driver can expose debugfs/sysfs nodes if needed at all. (nvcompact.c
> > driver[1] exposes these tunables as debugfs nodes, for example).
> > 
> > [1] https://gitlab.com/nigupta/linux/snippets/1894161
> Now, your proposing a system where we have interfaces from each driver.
> That could be more confusing for a sys admin to configure I feel?
> 
> But what your proposing really made me think about  what kind
> of tunables do we want? Rather than just talking about tunables from the
> mm subsystem, can we introduce tunables that indicate the behaviour of
> workloads? Using this information from the user, we can look to optimize 
> reclaim and compaction for that workload.
> If we have a tunable which can indicate that the kernel is running in an
> environment where the where the workload will be performing a lot of
> higher order allocations, we can improve memory reclaim and compaction
> considering these parameters. One optimization I can think of extending
> kcompactd to compact more memory when a higher order allocation fails.
> 
> One of the biggest issues with having a discussion on proactive
> reclaim/compaction is that the workloads are really unpredictable. 
> Rather than working on tunables from the mm subsystem which help us take
> action on memory pressure, can we talk about interfaces to hint about
> workloads so that we can make better informed decisions in the mm
> subsystem rather than involving other drivers?


I'm not adding any tunables, just exposing an interface.

-Nitin

mm: Add callback for defining compaction completion

Commit Message

Comments

Patch