diff mbox series

[RFC,v4,2/5] mm/__free_one_page: skip merge for order-0 page unless compaction failed

Message ID 20181017063330.15384-3-aaron.lu@intel.com (mailing list archive)
State New, archived
Headers show
Series Eliminate zone->lock contention for will-it-scale/page_fault1 and parallel free | expand

Commit Message

Aaron Lu Oct. 17, 2018, 6:33 a.m. UTC
Running will-it-scale/page_fault1 process mode workload on a 2 sockets
Intel Skylake server showed severe lock contention of zone->lock, as
high as about 80%(42% on allocation path and 35% on free path) CPU
cycles are burnt spinning. With perf, the most time consuming part inside
that lock on free path is cache missing on page structures, mostly on
the to-be-freed page's buddy due to merging.

One way to avoid this overhead is not do any merging at all for order-0
pages. With this approach, the lock contention for zone->lock on free
path dropped to 1.1% but allocation side still has as high as 42% lock
contention. In the meantime, the dropped lock contention on free side
doesn't translate to performance increase, instead, it's consumed by
increased lock contention of the per node lru_lock(rose from 5% to 37%)
and the final performance slightly dropped about 1%.

Though performance dropped a little, it almost eliminated zone lock
contention on free path and it is the foundation for the next patch
that eliminates zone lock contention for allocation path.

Suggested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
 include/linux/mm_types.h |  9 +++-
 mm/compaction.c          | 13 +++++-
 mm/internal.h            | 27 ++++++++++++
 mm/page_alloc.c          | 88 ++++++++++++++++++++++++++++++++++------
 4 files changed, 121 insertions(+), 16 deletions(-)

Comments

Mel Gorman Oct. 17, 2018, 10:44 a.m. UTC | #1
On Wed, Oct 17, 2018 at 02:33:27PM +0800, Aaron Lu wrote:
> Running will-it-scale/page_fault1 process mode workload on a 2 sockets
> Intel Skylake server showed severe lock contention of zone->lock, as
> high as about 80%(42% on allocation path and 35% on free path) CPU
> cycles are burnt spinning. With perf, the most time consuming part inside
> that lock on free path is cache missing on page structures, mostly on
> the to-be-freed page's buddy due to merging.
> 

This confuses me slightly. The commit log for d8a759b57035 ("mm,
page_alloc: double zone's batchsize") indicates that the contention for
will-it-scale moved from the zone lock to the LRU lock. This appears to
contradict that although the exact test case is different (page_fault_1
vs page_fault2). Can you clarify why commit d8a759b57035 is
insufficient?

I'm wondering is this really about reducing the number of dirtied cache
lines due to struct page updates and less about the actual zone lock.

> One way to avoid this overhead is not do any merging at all for order-0
> pages. With this approach, the lock contention for zone->lock on free
> path dropped to 1.1% but allocation side still has as high as 42% lock
> contention. In the meantime, the dropped lock contention on free side
> doesn't translate to performance increase, instead, it's consumed by
> increased lock contention of the per node lru_lock(rose from 5% to 37%)
> and the final performance slightly dropped about 1%.
> 

Although this implies it's really about contention.

> Though performance dropped a little, it almost eliminated zone lock
> contention on free path and it is the foundation for the next patch
> that eliminates zone lock contention for allocation path.
> 

Can you clarify whether THP was enabled or not? As this is order-0 focused,
it would imply the series should have minimal impact due to limited merging.

> Suggested-by: Dave Hansen <dave.hansen@intel.com>
> Signed-off-by: Aaron Lu <aaron.lu@intel.com>
> ---
>  include/linux/mm_types.h |  9 +++-
>  mm/compaction.c          | 13 +++++-
>  mm/internal.h            | 27 ++++++++++++
>  mm/page_alloc.c          | 88 ++++++++++++++++++++++++++++++++++------
>  4 files changed, 121 insertions(+), 16 deletions(-)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 5ed8f6292a53..aed93053ef6e 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -179,8 +179,13 @@ struct page {
>  		int units;			/* SLOB */
>  	};
>  
> -	/* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */
> -	atomic_t _refcount;
> +	union {
> +		/* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */
> +		atomic_t _refcount;
> +
> +		/* For pages in Buddy: if skipped merging when added to Buddy */
> +		bool buddy_merge_skipped;
> +	};
>  

In some instances, bools within structrs are frowned upon because of
differences in sizes across architectures. Because this is part of a
union, I don't think it's problematic but bear in mind in case someone
else spots it.

>  #ifdef CONFIG_MEMCG
>  	struct mem_cgroup *mem_cgroup;
> diff --git a/mm/compaction.c b/mm/compaction.c
> index faca45ebe62d..0c9c7a30dde3 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -777,8 +777,19 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>  		 * potential isolation targets.
>  		 */
>  		if (PageBuddy(page)) {
> -			unsigned long freepage_order = page_order_unsafe(page);
> +			unsigned long freepage_order;
>  
> +			/*
> +			 * If this is a merge_skipped page, do merge now
> +			 * since high-order pages are needed. zone lock
> +			 * isn't taken for the merge_skipped check so the
> +			 * check could be wrong but the worst case is we
> +			 * lose a merge opportunity.
> +			 */
> +			if (page_merge_was_skipped(page))
> +				try_to_merge_page(page);
> +
> +			freepage_order = page_order_unsafe(page);
>  			/*
>  			 * Without lock, we cannot be sure that what we got is
>  			 * a valid page order. Consider only values in the
> diff --git a/mm/internal.h b/mm/internal.h
> index 87256ae1bef8..c166735a559e 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -527,4 +527,31 @@ static inline bool is_migrate_highatomic_page(struct page *page)
>  
>  void setup_zone_pageset(struct zone *zone);
>  extern struct page *alloc_new_node_page(struct page *page, unsigned long node);
> +
> +static inline bool page_merge_was_skipped(struct page *page)
> +{
> +	return page->buddy_merge_skipped;
> +}
> +
> +void try_to_merge_page(struct page *page);
> +
> +#ifdef CONFIG_COMPACTION
> +static inline bool can_skip_merge(struct zone *zone, int order)
> +{
> +	/* Compaction has failed in this zone, we shouldn't skip merging */
> +	if (zone->compact_considered)
> +		return false;
> +
> +	/* Only consider no_merge for order 0 pages */
> +	if (order)
> +		return false;
> +
> +	return true;
> +}
> +#else /* CONFIG_COMPACTION */
> +static inline bool can_skip_merge(struct zone *zone, int order)
> +{
> +	return false;
> +}
> +#endif  /* CONFIG_COMPACTION */
>  #endif	/* __MM_INTERNAL_H */

Strictly speaking, lazy buddy merging does not need to be linked to
compaction. Lazy merging doesn't say anything about the mobility of
buddy pages that are still allocated.

When lazy buddy merging was last examined years ago, a consequence was
that high-order allocation success rates were reduced. I see you do the
merging when compaction has been recently considered but I don't see how
that is sufficient. If a high-order allocation fails, there is no
guarantee that compaction will find those unmerged buddies. There is
also no guarantee that a page free will find them. So, in the event of a
high-order allocation failure, what finds all those unmerged buddies and
puts them together to see if the allocation would succeed without
reclaim/compaction/etc.
Aaron Lu Oct. 17, 2018, 1:10 p.m. UTC | #2
On Wed, Oct 17, 2018 at 11:44:27AM +0100, Mel Gorman wrote:
> On Wed, Oct 17, 2018 at 02:33:27PM +0800, Aaron Lu wrote:
> > Running will-it-scale/page_fault1 process mode workload on a 2 sockets
> > Intel Skylake server showed severe lock contention of zone->lock, as
> > high as about 80%(42% on allocation path and 35% on free path) CPU
> > cycles are burnt spinning. With perf, the most time consuming part inside
> > that lock on free path is cache missing on page structures, mostly on
> > the to-be-freed page's buddy due to merging.
> > 
> 
> This confuses me slightly. The commit log for d8a759b57035 ("mm,
> page_alloc: double zone's batchsize") indicates that the contention for
> will-it-scale moved from the zone lock to the LRU lock. This appears to
> contradict that although the exact test case is different (page_fault_1
> vs page_fault2). Can you clarify why commit d8a759b57035 is
> insufficient?

commit d8a759b57035 helps zone lock scalability and while it reduced
zone lock scalability to some extent(but not entirely eliminated it),
the lock contention shifted to LRU lock in the meantime.

e.g. from commit d8a759b57035's changelog, with the same test case
will-it-scale/page_fault1:

4 sockets Skylake:
    batch   score     change   zone_contention   lru_contention   total_contention
     31   15345900    +0.00%       64%                 8%           72%
     63   17992886   +17.25%       24%                45%           69%

4 sockets Broadwell:
    batch   score     change   zone_contention   lru_contention   total_contention
     31   16703983    +0.00%       67%                 7%           74%
     63   18288885    +9.49%       38%                33%           71%

2 sockets Skylake:
    batch   score     change   zone_contention   lru_contention   total_contention
     31   9554867     +0.00%       66%                 3%           69%
     63   9980145     +4.45%       62%                 4%           66%

Please note that though zone lock contention for the 4 sockets server
reduced a lot with commit d8a759b57035, 2 sockets Skylake still suffered
a lot from zone lock contention even after we doubled batch size.

Also, the reduced zone lock contention will again get worse if LRU lock
is optimized away by Daniel's work, or in cases there are no LRU in the
picture, e.g. an in-kernel user of page allocator like Tariq Toukan
demonstrated with netperf.

> I'm wondering is this really about reducing the number of dirtied cache
> lines due to struct page updates and less about the actual zone lock.

Hmm...if we reduce the time it takes under the zone lock, aren't we
helping the zone lock? :-)

> 
> > One way to avoid this overhead is not do any merging at all for order-0
> > pages. With this approach, the lock contention for zone->lock on free
> > path dropped to 1.1% but allocation side still has as high as 42% lock
> > contention. In the meantime, the dropped lock contention on free side
> > doesn't translate to performance increase, instead, it's consumed by
> > increased lock contention of the per node lru_lock(rose from 5% to 37%)
> > and the final performance slightly dropped about 1%.
> > 
> 
> Although this implies it's really about contention.
> 
> > Though performance dropped a little, it almost eliminated zone lock
> > contention on free path and it is the foundation for the next patch
> > that eliminates zone lock contention for allocation path.
> > 
> 
> Can you clarify whether THP was enabled or not? As this is order-0 focused,
> it would imply the series should have minimal impact due to limited merging.

Sorry about this, I should have mentioned THP is not used here.

> 
> > Suggested-by: Dave Hansen <dave.hansen@intel.com>
> > Signed-off-by: Aaron Lu <aaron.lu@intel.com>
> > ---
> >  include/linux/mm_types.h |  9 +++-
> >  mm/compaction.c          | 13 +++++-
> >  mm/internal.h            | 27 ++++++++++++
> >  mm/page_alloc.c          | 88 ++++++++++++++++++++++++++++++++++------
> >  4 files changed, 121 insertions(+), 16 deletions(-)
> > 
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 5ed8f6292a53..aed93053ef6e 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -179,8 +179,13 @@ struct page {
> >  		int units;			/* SLOB */
> >  	};
> >  
> > -	/* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */
> > -	atomic_t _refcount;
> > +	union {
> > +		/* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */
> > +		atomic_t _refcount;
> > +
> > +		/* For pages in Buddy: if skipped merging when added to Buddy */
> > +		bool buddy_merge_skipped;
> > +	};
> >  
> 
> In some instances, bools within structrs are frowned upon because of
> differences in sizes across architectures. Because this is part of a
> union, I don't think it's problematic but bear in mind in case someone
> else spots it.

OK, thanks for the remind.

> 
> >  #ifdef CONFIG_MEMCG
> >  	struct mem_cgroup *mem_cgroup;
> > diff --git a/mm/compaction.c b/mm/compaction.c
> > index faca45ebe62d..0c9c7a30dde3 100644
> > --- a/mm/compaction.c
> > +++ b/mm/compaction.c
> > @@ -777,8 +777,19 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
> >  		 * potential isolation targets.
> >  		 */
> >  		if (PageBuddy(page)) {
> > -			unsigned long freepage_order = page_order_unsafe(page);
> > +			unsigned long freepage_order;
> >  
> > +			/*
> > +			 * If this is a merge_skipped page, do merge now
> > +			 * since high-order pages are needed. zone lock
> > +			 * isn't taken for the merge_skipped check so the
> > +			 * check could be wrong but the worst case is we
> > +			 * lose a merge opportunity.
> > +			 */
> > +			if (page_merge_was_skipped(page))
> > +				try_to_merge_page(page);
> > +
> > +			freepage_order = page_order_unsafe(page);
> >  			/*
> >  			 * Without lock, we cannot be sure that what we got is
> >  			 * a valid page order. Consider only values in the
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 87256ae1bef8..c166735a559e 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -527,4 +527,31 @@ static inline bool is_migrate_highatomic_page(struct page *page)
> >  
> >  void setup_zone_pageset(struct zone *zone);
> >  extern struct page *alloc_new_node_page(struct page *page, unsigned long node);
> > +
> > +static inline bool page_merge_was_skipped(struct page *page)
> > +{
> > +	return page->buddy_merge_skipped;
> > +}
> > +
> > +void try_to_merge_page(struct page *page);
> > +
> > +#ifdef CONFIG_COMPACTION
> > +static inline bool can_skip_merge(struct zone *zone, int order)
> > +{
> > +	/* Compaction has failed in this zone, we shouldn't skip merging */
> > +	if (zone->compact_considered)
> > +		return false;
> > +
> > +	/* Only consider no_merge for order 0 pages */
> > +	if (order)
> > +		return false;
> > +
> > +	return true;
> > +}
> > +#else /* CONFIG_COMPACTION */
> > +static inline bool can_skip_merge(struct zone *zone, int order)
> > +{
> > +	return false;
> > +}
> > +#endif  /* CONFIG_COMPACTION */
> >  #endif	/* __MM_INTERNAL_H */
> 
> Strictly speaking, lazy buddy merging does not need to be linked to
> compaction. Lazy merging doesn't say anything about the mobility of
> buddy pages that are still allocated.

True.
I was thinking if compactions isn't enabled, we probably shouldn't
enable this lazy buddy merging feature as it would make high order
allocation success rate dropping a lot.

I probably should have mentioned clearly somewhere in the changelog that
the function of merging those unmerged order0 pages are embedded in
compaction code, in function isolate_migratepages_block() when isolate
candidates are scanned.

> 
> When lazy buddy merging was last examined years ago, a consequence was
> that high-order allocation success rates were reduced. I see you do the

I tried mmtests/stress-highalloc on one desktop and didn't see
high-order allocation success rate dropping as shown in patch0's
changelog. But it could be that I didn't test enough machines or using
other test cases? Any suggestions on how to uncover this problem?

> merging when compaction has been recently considered but I don't see how
> that is sufficient. If a high-order allocation fails, there is no
> guarantee that compaction will find those unmerged buddies. There is

Any unmerged buddies will have page->buddy_merge_skipped set and during
compaction, when isolate_migratepages_block() iterates pages to find
isolate candidates, it will find these unmerged pages and will do_merge()
for them. Suppose an order-9 pageblock, every page is merge_skipped
order-0 page; after isolate_migratepages_block() iterates them one by one
and calls do_merge() for them one by one, higher order page will be
formed during this process and after the last unmerged order0 page goes
through do_merge(), an order-9 buddy page will be formed.

> also no guarantee that a page free will find them. So, in the event of a
> high-order allocation failure, what finds all those unmerged buddies and
> puts them together to see if the allocation would succeed without
> reclaim/compaction/etc.

compaction is needed to form a high-order page after high-order
allocation failed, I think this is also true for vanilla kernel?
Mel Gorman Oct. 17, 2018, 1:58 p.m. UTC | #3
On Wed, Oct 17, 2018 at 09:10:59PM +0800, Aaron Lu wrote:
> On Wed, Oct 17, 2018 at 11:44:27AM +0100, Mel Gorman wrote:
> > On Wed, Oct 17, 2018 at 02:33:27PM +0800, Aaron Lu wrote:
> > > Running will-it-scale/page_fault1 process mode workload on a 2 sockets
> > > Intel Skylake server showed severe lock contention of zone->lock, as
> > > high as about 80%(42% on allocation path and 35% on free path) CPU
> > > cycles are burnt spinning. With perf, the most time consuming part inside
> > > that lock on free path is cache missing on page structures, mostly on
> > > the to-be-freed page's buddy due to merging.
> > > 
> > 
> > This confuses me slightly. The commit log for d8a759b57035 ("mm,
> > page_alloc: double zone's batchsize") indicates that the contention for
> > will-it-scale moved from the zone lock to the LRU lock. This appears to
> > contradict that although the exact test case is different (page_fault_1
> > vs page_fault2). Can you clarify why commit d8a759b57035 is
> > insufficient?
> 
> commit d8a759b57035 helps zone lock scalability and while it reduced
> zone lock scalability to some extent(but not entirely eliminated it),
> the lock contention shifted to LRU lock in the meantime.
> 

I assume you meant "zone lock contention" in the second case.

> e.g. from commit d8a759b57035's changelog, with the same test case
> will-it-scale/page_fault1:
> 
> 4 sockets Skylake:
>     batch   score     change   zone_contention   lru_contention   total_contention
>      31   15345900    +0.00%       64%                 8%           72%
>      63   17992886   +17.25%       24%                45%           69%
> 
> 4 sockets Broadwell:
>     batch   score     change   zone_contention   lru_contention   total_contention
>      31   16703983    +0.00%       67%                 7%           74%
>      63   18288885    +9.49%       38%                33%           71%
> 
> 2 sockets Skylake:
>     batch   score     change   zone_contention   lru_contention   total_contention
>      31   9554867     +0.00%       66%                 3%           69%
>      63   9980145     +4.45%       62%                 4%           66%
> 
> Please note that though zone lock contention for the 4 sockets server
> reduced a lot with commit d8a759b57035, 2 sockets Skylake still suffered
> a lot from zone lock contention even after we doubled batch size.
> 

Any particuular reason why? I assume it's related to the number of zone
locks with the increase number of zones and the number of threads used
for the test.

> Also, the reduced zone lock contention will again get worse if LRU lock
> is optimized away by Daniel's work, or in cases there are no LRU in the
> picture, e.g. an in-kernel user of page allocator like Tariq Toukan
> demonstrated with netperf.
> 

Vaguely understood, I never looked at the LRU lock patches.

> > I'm wondering is this really about reducing the number of dirtied cache
> > lines due to struct page updates and less about the actual zone lock.
> 
> Hmm...if we reduce the time it takes under the zone lock, aren't we
> helping the zone lock? :-)
> 

Indirectly yes but reducing cache line dirtying is useful in itself so
they should be at least considered separately as independent
optimisations.

> > 
> > > One way to avoid this overhead is not do any merging at all for order-0
> > > pages. With this approach, the lock contention for zone->lock on free
> > > path dropped to 1.1% but allocation side still has as high as 42% lock
> > > contention. In the meantime, the dropped lock contention on free side
> > > doesn't translate to performance increase, instead, it's consumed by
> > > increased lock contention of the per node lru_lock(rose from 5% to 37%)
> > > and the final performance slightly dropped about 1%.
> > > 
> > 
> > Although this implies it's really about contention.
> > 
> > > Though performance dropped a little, it almost eliminated zone lock
> > > contention on free path and it is the foundation for the next patch
> > > that eliminates zone lock contention for allocation path.
> > > 
> > 
> > Can you clarify whether THP was enabled or not? As this is order-0 focused,
> > it would imply the series should have minimal impact due to limited merging.
> 
> Sorry about this, I should have mentioned THP is not used here.
> 

That's important to know. It does reduce the utility of the patch
somewhat but not all arches support THP and THP is not always enabled on
x86.

> > compaction. Lazy merging doesn't say anything about the mobility of
> > buddy pages that are still allocated.
> 
> True.
> I was thinking if compactions isn't enabled, we probably shouldn't
> enable this lazy buddy merging feature as it would make high order
> allocation success rate dropping a lot.
> 

It probably is lower as reclaim is not that aggressive. Add a comment
with an explanation as to why it's compaction-specific.

> I probably should have mentioned clearly somewhere in the changelog that
> the function of merging those unmerged order0 pages are embedded in
> compaction code, in function isolate_migratepages_block() when isolate
> candidates are scanned.
> 

Yes, but note that the concept is still problematic.
isolate_migratepages_block is not guaranteed to find a pageblock with
unmerged buddies in it. If there are pageblocks towards the end of the
zone with unmerged pages, they may never be found. This will be very hard
to detect at runtime because it's heavily dependant on the exact state
of the system.


> > 
> > When lazy buddy merging was last examined years ago, a consequence was
> > that high-order allocation success rates were reduced. I see you do the
> 
> I tried mmtests/stress-highalloc on one desktop and didn't see
> high-order allocation success rate dropping as shown in patch0's
> changelog. But it could be that I didn't test enough machines or using
> other test cases? Any suggestions on how to uncover this problem?
> 

stress-highalloc is nowhere near as useful as it used to be
unfortunately. It was built at a time when 4G machines were unusual.
config-global-dhp__workload_thpscale can be sometimes useful but it's
variable. There is not a good modern example of detecting allocation success
rates of highly fragmented systems at the moment which is a real pity.

> > merging when compaction has been recently considered but I don't see how
> > that is sufficient. If a high-order allocation fails, there is no
> > guarantee that compaction will find those unmerged buddies. There is
> 
> Any unmerged buddies will have page->buddy_merge_skipped set and during
> compaction, when isolate_migratepages_block() iterates pages to find
> isolate candidates, it will find these unmerged pages and will do_merge()
> for them. Suppose an order-9 pageblock, every page is merge_skipped
> order-0 page; after isolate_migratepages_block() iterates them one by one
> and calls do_merge() for them one by one, higher order page will be
> formed during this process and after the last unmerged order0 page goes
> through do_merge(), an order-9 buddy page will be formed.
> 

Again, as compaction is not guaranteed to find the pageblocks, it would
be important to consider whether a) that matters or b) find an
alternative way of keeping unmerged buddies on separate lists so they
can be quickly discovered when a high-order allocation fails.

> > also no guarantee that a page free will find them. So, in the event of a
> > high-order allocation failure, what finds all those unmerged buddies and
> > puts them together to see if the allocation would succeed without
> > reclaim/compaction/etc.
> 
> compaction is needed to form a high-order page after high-order
> allocation failed, I think this is also true for vanilla kernel?

It's needed to form them efficiently but excessive reclaim or writing 3
to drop_caches can also do it. Be careful of tying lazy buddy too
closely to compaction.
Aaron Lu Oct. 17, 2018, 2:59 p.m. UTC | #4
On Wed, Oct 17, 2018 at 02:58:07PM +0100, Mel Gorman wrote:
> On Wed, Oct 17, 2018 at 09:10:59PM +0800, Aaron Lu wrote:
> > On Wed, Oct 17, 2018 at 11:44:27AM +0100, Mel Gorman wrote:
> > > On Wed, Oct 17, 2018 at 02:33:27PM +0800, Aaron Lu wrote:
> > > > Running will-it-scale/page_fault1 process mode workload on a 2 sockets
> > > > Intel Skylake server showed severe lock contention of zone->lock, as
> > > > high as about 80%(42% on allocation path and 35% on free path) CPU
> > > > cycles are burnt spinning. With perf, the most time consuming part inside
> > > > that lock on free path is cache missing on page structures, mostly on
> > > > the to-be-freed page's buddy due to merging.
> > > > 
> > > 
> > > This confuses me slightly. The commit log for d8a759b57035 ("mm,
> > > page_alloc: double zone's batchsize") indicates that the contention for
> > > will-it-scale moved from the zone lock to the LRU lock. This appears to
> > > contradict that although the exact test case is different (page_fault_1
> > > vs page_fault2). Can you clarify why commit d8a759b57035 is
> > > insufficient?
> > 
> > commit d8a759b57035 helps zone lock scalability and while it reduced
> > zone lock scalability to some extent(but not entirely eliminated it),
> > the lock contention shifted to LRU lock in the meantime.
> > 
> 
> I assume you meant "zone lock contention" in the second case.

Yes, that's right.

> 
> > e.g. from commit d8a759b57035's changelog, with the same test case
> > will-it-scale/page_fault1:
> > 
> > 4 sockets Skylake:
> >     batch   score     change   zone_contention   lru_contention   total_contention
> >      31   15345900    +0.00%       64%                 8%           72%
> >      63   17992886   +17.25%       24%                45%           69%
> > 
> > 4 sockets Broadwell:
> >     batch   score     change   zone_contention   lru_contention   total_contention
> >      31   16703983    +0.00%       67%                 7%           74%
> >      63   18288885    +9.49%       38%                33%           71%
> > 
> > 2 sockets Skylake:
> >     batch   score     change   zone_contention   lru_contention   total_contention
> >      31   9554867     +0.00%       66%                 3%           69%
> >      63   9980145     +4.45%       62%                 4%           66%
> > 
> > Please note that though zone lock contention for the 4 sockets server
> > reduced a lot with commit d8a759b57035, 2 sockets Skylake still suffered
> > a lot from zone lock contention even after we doubled batch size.
> > 
> 
> Any particuular reason why? I assume it's related to the number of zone
> locks with the increase number of zones and the number of threads used
> for the test.

I think so too.

The 4 sockets server has 192 CPUs in total while the 2 sockets server
has 112 CPUs in total. Assume only ZONE_NORMAL are used, for the 4
sockets server it would be 192/4=48(CPUs per zone) while for the 2
sockets server it is 112/2=56(CPUs per zone). The test is started with
nr_task=nr_cpu so for the 2 sockets servers, it ends up having more CPUs
consuming one zone.

> 
> > Also, the reduced zone lock contention will again get worse if LRU lock
> > is optimized away by Daniel's work, or in cases there are no LRU in the
> > picture, e.g. an in-kernel user of page allocator like Tariq Toukan
> > demonstrated with netperf.
> > 
> 
> Vaguely understood, I never looked at the LRU lock patches.
> 
> > > I'm wondering is this really about reducing the number of dirtied cache
> > > lines due to struct page updates and less about the actual zone lock.
> > 
> > Hmm...if we reduce the time it takes under the zone lock, aren't we
> > helping the zone lock? :-)
> > 
> 
> Indirectly yes but reducing cache line dirtying is useful in itself so
> they should be at least considered separately as independent
> optimisations.
> 
> > > 
> > > > One way to avoid this overhead is not do any merging at all for order-0
> > > > pages. With this approach, the lock contention for zone->lock on free
> > > > path dropped to 1.1% but allocation side still has as high as 42% lock
> > > > contention. In the meantime, the dropped lock contention on free side
> > > > doesn't translate to performance increase, instead, it's consumed by
> > > > increased lock contention of the per node lru_lock(rose from 5% to 37%)
> > > > and the final performance slightly dropped about 1%.
> > > > 
> > > 
> > > Although this implies it's really about contention.
> > > 
> > > > Though performance dropped a little, it almost eliminated zone lock
> > > > contention on free path and it is the foundation for the next patch
> > > > that eliminates zone lock contention for allocation path.
> > > > 
> > > 
> > > Can you clarify whether THP was enabled or not? As this is order-0 focused,
> > > it would imply the series should have minimal impact due to limited merging.
> > 
> > Sorry about this, I should have mentioned THP is not used here.
> > 
> 
> That's important to know. It does reduce the utility of the patch
> somewhat but not all arches support THP and THP is not always enabled on
> x86.

I always wondered how systems are making use of THP.
After all, when system has been runing a while(days or months), file
cache should consumed a lot of memory and high order pages will become
more and more scare. If order9 page can't be reliably allocated, will
workload rely on it?
Just a thought.

THP is of course pretty neat that it reduced TLB cost, needs fewer page
table etc. I just wondered if people really rely on it, or using it
after their system has been up for a long time.

> > > compaction. Lazy merging doesn't say anything about the mobility of
> > > buddy pages that are still allocated.
> > 
> > True.
> > I was thinking if compactions isn't enabled, we probably shouldn't
> > enable this lazy buddy merging feature as it would make high order
> > allocation success rate dropping a lot.
> > 
> 
> It probably is lower as reclaim is not that aggressive. Add a comment
> with an explanation as to why it's compaction-specific.
> 
> > I probably should have mentioned clearly somewhere in the changelog that
> > the function of merging those unmerged order0 pages are embedded in
> > compaction code, in function isolate_migratepages_block() when isolate
> > candidates are scanned.
> > 
> 
> Yes, but note that the concept is still problematic.
> isolate_migratepages_block is not guaranteed to find a pageblock with
> unmerged buddies in it. If there are pageblocks towards the end of the
> zone with unmerged pages, they may never be found. This will be very hard
> to detect at runtime because it's heavily dependant on the exact state
> of the system.

Quite true.

The intent here though, is not to have compaction merge back all
unmerged pages, but did the merge for these unmerged pages in a
piggyback way, i.e. since isolate_migratepages_block() is doing the
scan, why don't we let it handle these unmerged pages when it meets
them?

If for some reason isolate_migratepages_block() didn't meet a single
unmerged page before compaction succeed, we probably do not need worry
much yet since compaction succeeded anyway.

> > > 
> > > When lazy buddy merging was last examined years ago, a consequence was
> > > that high-order allocation success rates were reduced. I see you do the
> > 
> > I tried mmtests/stress-highalloc on one desktop and didn't see
> > high-order allocation success rate dropping as shown in patch0's
> > changelog. But it could be that I didn't test enough machines or using
> > other test cases? Any suggestions on how to uncover this problem?
> > 
> 
> stress-highalloc is nowhere near as useful as it used to be
> unfortunately. It was built at a time when 4G machines were unusual.
> config-global-dhp__workload_thpscale can be sometimes useful but it's

Will take a look at this, thanks for the pointer.

> variable. There is not a good modern example of detecting allocation success
> rates of highly fragmented systems at the moment which is a real pity.
> 
> > > merging when compaction has been recently considered but I don't see how
> > > that is sufficient. If a high-order allocation fails, there is no
> > > guarantee that compaction will find those unmerged buddies. There is
> > 
> > Any unmerged buddies will have page->buddy_merge_skipped set and during
> > compaction, when isolate_migratepages_block() iterates pages to find
> > isolate candidates, it will find these unmerged pages and will do_merge()
> > for them. Suppose an order-9 pageblock, every page is merge_skipped
> > order-0 page; after isolate_migratepages_block() iterates them one by one
> > and calls do_merge() for them one by one, higher order page will be
> > formed during this process and after the last unmerged order0 page goes
> > through do_merge(), an order-9 buddy page will be formed.
> > 
> 
> Again, as compaction is not guaranteed to find the pageblocks, it would
> be important to consider whether a) that matters or b) find an
> alternative way of keeping unmerged buddies on separate lists so they
> can be quickly discovered when a high-order allocation fails.

That's a good question.
I tend to think it doesn't matter whether we can find all unmerged pages.
Let compaction does its job as before and do_merge() for any unmerged
pages when it scanned them is probably enough. But as you said, we don't
have a good enough case to test this yet.

> > > also no guarantee that a page free will find them. So, in the event of a
> > > high-order allocation failure, what finds all those unmerged buddies and
> > > puts them together to see if the allocation would succeed without
> > > reclaim/compaction/etc.
> > 
> > compaction is needed to form a high-order page after high-order
> > allocation failed, I think this is also true for vanilla kernel?
> 
> It's needed to form them efficiently but excessive reclaim or writing 3
> to drop_caches can also do it. Be careful of tying lazy buddy too
> closely to compaction.

That's the current design of this patchset, do you see any immediate
problem of this? Is it that you are worried about high-order allocation
success rate using this design?
Vlastimil Babka Oct. 17, 2018, 5:03 p.m. UTC | #5
On 10/17/18 3:58 PM, Mel Gorman wrote:
> Again, as compaction is not guaranteed to find the pageblocks, it would
> be important to consider whether a) that matters or b) find an
> alternative way of keeping unmerged buddies on separate lists so they
> can be quickly discovered when a high-order allocation fails.

Agree, unmerged buddies could be on separate freelist from regular
order-0 freelist. That list could be also preferred to allocations
before the regular one. Then one could e.g. try "direct merging" via
this list when compaction fails, or prefer direct merging to compaction
for non-costly-order allocations, do direct merging when allocation
context doesn't even allow compaction (atomic etc).

Also I would definitely consider always merging pages freed to
non-MOVABLE pageblocks. We really don't want to increase the
fragmentation in those. However that means it probably won't help the
netperf case?

Vlastimil
Aaron Lu Oct. 18, 2018, 6:48 a.m. UTC | #6
On Wed, Oct 17, 2018 at 07:03:30PM +0200, Vlastimil Babka wrote:
> On 10/17/18 3:58 PM, Mel Gorman wrote:
> > Again, as compaction is not guaranteed to find the pageblocks, it would
> > be important to consider whether a) that matters or b) find an
> > alternative way of keeping unmerged buddies on separate lists so they
> > can be quickly discovered when a high-order allocation fails.
> 
> Agree, unmerged buddies could be on separate freelist from regular
> order-0 freelist. That list could be also preferred to allocations
> before the regular one. Then one could e.g. try "direct merging" via
> this list when compaction fails, or prefer direct merging to compaction
> for non-costly-order allocations, do direct merging when allocation
> context doesn't even allow compaction (atomic etc).

One concern regarding "direct merging" these unmerged pages via this
separate freelist(let's call it unmerged_free_list) is: adjacent
unmerged pages on the unmerged_free_list could be far away from each
other regarding their physical positions, so during the process of
merging them, the needed high order page may not be able to be formed
in a short time. Actually, the time could be unbound in a bad condition
when:
1 unmerged pages adjacent on the unmerged_free_list happen to be far
  away from each other regarding their physical positions; and
2 there are a lot of unmerged pages on unmerged_free_list.

That's the reason I hooked the merging of unmerged pages in compaction
when isolate_migratepages_block() is scanning every page of a pageblock
in PFN order.

OTOH, if there is a kernel thread trying to reduce fragmentation by
doing merges for these unmerged pages, I think it's perfect fine to let
it iterate all unmerged pages of that list and do_merge() for all of
them.

So what about this: if kcompactd is running, let it handle these
unmerged pages on the list and after that, do its usual job of
compaction. If direct compaction is running, do not handle unmerged
pages on that list but rely on isolate_migratepages_block() to do the
merging as is done in this patchset.

This of course has the effect of tying compaction with 'lazy merging'.
If it is not desirable, what about creating a new kernel thread to do
the merging of unmerged pages on the list while keeping the behaviour of
isolate_migratepages_block() in this patchset to improve compaction
success rate.

> Also I would definitely consider always merging pages freed to
> non-MOVABLE pageblocks. We really don't want to increase the
> fragmentation in those. However that means it probably won't help the
> netperf case?

Yes, that would be unfortunate for all in-kernel users of page
allocator...
Vlastimil Babka Oct. 18, 2018, 8:23 a.m. UTC | #7
On 10/18/18 8:48 AM, Aaron Lu wrote:
> On Wed, Oct 17, 2018 at 07:03:30PM +0200, Vlastimil Babka wrote:
>> On 10/17/18 3:58 PM, Mel Gorman wrote:
>>> Again, as compaction is not guaranteed to find the pageblocks, it would
>>> be important to consider whether a) that matters or b) find an
>>> alternative way of keeping unmerged buddies on separate lists so they
>>> can be quickly discovered when a high-order allocation fails.
>>
>> Agree, unmerged buddies could be on separate freelist from regular
>> order-0 freelist. That list could be also preferred to allocations
>> before the regular one. Then one could e.g. try "direct merging" via
>> this list when compaction fails, or prefer direct merging to compaction
>> for non-costly-order allocations, do direct merging when allocation
>> context doesn't even allow compaction (atomic etc).
> 
> One concern regarding "direct merging" these unmerged pages via this
> separate freelist(let's call it unmerged_free_list) is: adjacent
> unmerged pages on the unmerged_free_list could be far away from each
> other regarding their physical positions, so during the process of
> merging them, the needed high order page may not be able to be formed
> in a short time. Actually, the time could be unbound in a bad condition
> when:
> 1 unmerged pages adjacent on the unmerged_free_list happen to be far
>   away from each other regarding their physical positions; and

I'm not sure I understand. Why should it matter for merging if pages are
adjacent on the unmerged_free_list? The buddy for merging is found the
usual way, no?

> 2 there are a lot of unmerged pages on unmerged_free_list.

That will affect allocation latency, yeah. Still might be faster than
direct compaction. And possible to do in GFP_ATOMIC context, unlike
direct compaction.

> That's the reason I hooked the merging of unmerged pages in compaction
> when isolate_migratepages_block() is scanning every page of a pageblock
> in PFN order.
> 
> OTOH, if there is a kernel thread trying to reduce fragmentation by
> doing merges for these unmerged pages, I think it's perfect fine to let
> it iterate all unmerged pages of that list and do_merge() for all of
> them.
> 
> So what about this: if kcompactd is running, let it handle these
> unmerged pages on the list and after that, do its usual job of
> compaction. If direct compaction is running, do not handle unmerged
> pages on that list but rely on isolate_migratepages_block() to do the
> merging as is done in this patchset.
> 
> This of course has the effect of tying compaction with 'lazy merging'.
> If it is not desirable, what about creating a new kernel thread to do
> the merging of unmerged pages on the list while keeping the behaviour of
> isolate_migratepages_block() in this patchset to improve compaction
> success rate.

Note that anything based on daemons will seem like reducing latency for
allocations, but if we delay merging and then later do it from a daemon,
the overall zone lock times will be essentially the same, right? The
reduced zone lock benefits only happen when the unmerged pages get
reallocated.

>> Also I would definitely consider always merging pages freed to
>> non-MOVABLE pageblocks. We really don't want to increase the
>> fragmentation in those. However that means it probably won't help the
>> netperf case?
> 
> Yes, that would be unfortunate for all in-kernel users of page
> allocator...

In that case there should definitely be a direct merging possibility
IMHO, even if only as a last resort before stealing from another pageblock.
Aaron Lu Oct. 18, 2018, 11:07 a.m. UTC | #8
On Thu, Oct 18, 2018 at 10:23:22AM +0200, Vlastimil Babka wrote:
> On 10/18/18 8:48 AM, Aaron Lu wrote:
> > On Wed, Oct 17, 2018 at 07:03:30PM +0200, Vlastimil Babka wrote:
> >> On 10/17/18 3:58 PM, Mel Gorman wrote:
> >>> Again, as compaction is not guaranteed to find the pageblocks, it would
> >>> be important to consider whether a) that matters or b) find an
> >>> alternative way of keeping unmerged buddies on separate lists so they
> >>> can be quickly discovered when a high-order allocation fails.
> >>
> >> Agree, unmerged buddies could be on separate freelist from regular
> >> order-0 freelist. That list could be also preferred to allocations
> >> before the regular one. Then one could e.g. try "direct merging" via
> >> this list when compaction fails, or prefer direct merging to compaction
> >> for non-costly-order allocations, do direct merging when allocation
> >> context doesn't even allow compaction (atomic etc).
> > 
> > One concern regarding "direct merging" these unmerged pages via this
> > separate freelist(let's call it unmerged_free_list) is: adjacent
> > unmerged pages on the unmerged_free_list could be far away from each
> > other regarding their physical positions, so during the process of
> > merging them, the needed high order page may not be able to be formed
> > in a short time. Actually, the time could be unbound in a bad condition
> > when:
> > 1 unmerged pages adjacent on the unmerged_free_list happen to be far
> >   away from each other regarding their physical positions; and
> 
> I'm not sure I understand. Why should it matter for merging if pages are
> adjacent on the unmerged_free_list? The buddy for merging is found the
> usual way, no?

Yes it's found the usual way. I probably didn't state clear, let me try
again.

Consider a pageblock, initially as an free order9 page. Let's assume
this order9 page is expand()ed into 512 order0 pages during different
allocation requests and they go to different applications running on
different CPUs. After some time, all of them are freed back, but each
of them is freed back at different times, so they are not adjacent on
unmerged_free_list(they could be far away from each other).

In the above scenerio, merging pages on unmerged_free_list one by one
may not be an efficent way to form a high-order page, but scanning a
pageblock PFN wise could be.

Of course, the above scenerio is imagined by me as a worst case, normal
case could be much better.

> 
> > 2 there are a lot of unmerged pages on unmerged_free_list.
> 
> That will affect allocation latency, yeah. Still might be faster than
> direct compaction. And possible to do in GFP_ATOMIC context, unlike
> direct compaction.

I see, but I'm not sure if it is OK to do 'direct merging' in GFP_ATOMIC
context - it is better for cases where failure to have the high-order
page allocated is very bad, but it might not be a good idea if the caller
has a fallback mechanism, i.e. if high order page allocation failed, they
can work with order0.

> 
> > That's the reason I hooked the merging of unmerged pages in compaction
> > when isolate_migratepages_block() is scanning every page of a pageblock
> > in PFN order.
> > 
> > OTOH, if there is a kernel thread trying to reduce fragmentation by
> > doing merges for these unmerged pages, I think it's perfect fine to let
> > it iterate all unmerged pages of that list and do_merge() for all of
> > them.
> > 
> > So what about this: if kcompactd is running, let it handle these
> > unmerged pages on the list and after that, do its usual job of
> > compaction. If direct compaction is running, do not handle unmerged
> > pages on that list but rely on isolate_migratepages_block() to do the
> > merging as is done in this patchset.
> > 
> > This of course has the effect of tying compaction with 'lazy merging'.
> > If it is not desirable, what about creating a new kernel thread to do
> > the merging of unmerged pages on the list while keeping the behaviour of
> > isolate_migratepages_block() in this patchset to improve compaction
> > success rate.
> 
> Note that anything based on daemons will seem like reducing latency for
> allocations, but if we delay merging and then later do it from a daemon,
> the overall zone lock times will be essentially the same, right? The
> reduced zone lock benefits only happen when the unmerged pages get
> reallocated.

Agree.
 
> Also I would definitely consider always merging pages freed to
> >> non-MOVABLE pageblocks. We really don't want to increase the
> >> fragmentation in those. However that means it probably won't help the
> >> netperf case?
> > 
> > Yes, that would be unfortunate for all in-kernel users of page
> > allocator...
> 
> In that case there should definitely be a direct merging possibility
> IMHO, even if only as a last resort before stealing from another pageblock.
Mel Gorman Oct. 18, 2018, 11:16 a.m. UTC | #9
On Wed, Oct 17, 2018 at 10:59:04PM +0800, Aaron Lu wrote:
> > Any particuular reason why? I assume it's related to the number of zone
> > locks with the increase number of zones and the number of threads used
> > for the test.
> 
> I think so too.
> 
> The 4 sockets server has 192 CPUs in total while the 2 sockets server
> has 112 CPUs in total. Assume only ZONE_NORMAL are used, for the 4
> sockets server it would be 192/4=48(CPUs per zone) while for the 2
> sockets server it is 112/2=56(CPUs per zone). The test is started with
> nr_task=nr_cpu so for the 2 sockets servers, it ends up having more CPUs
> consuming one zone.
> 

Nice that the prediction is accurate. It brings us to another option --
breaking up the zone lock by either hash or address space ranges. The
address space ranges would probably be easier to implement. Where it
gets hairy is that PFN walkers would need different zone locks. However,
overall it might be a better option because it's not order-0 specific.

It would be a lot of legwork because all uses of the zone lock would
have to be audited to see which ones protect the free lists and which
ones protect "something else".

> > That's important to know. It does reduce the utility of the patch
> > somewhat but not all arches support THP and THP is not always enabled on
> > x86.
> 
> I always wondered how systems are making use of THP.
> After all, when system has been runing a while(days or months), file
> cache should consumed a lot of memory and high order pages will become
> more and more scare. If order9 page can't be reliably allocated, will
> workload rely on it?
> Just a thought.
> 

File cache can usually be trivially reclaimed and moved. It's a "how
long is a piece of string" to determine at what point a system can get
fragmented and whether than can be prevented. It's somewhat outside the
scope of this patch but anecdotally I'm looking at a machine with 20 days
uptime and it still has 2390GB worth of THPs free after a large amount
of reclaim activity over the system lifetime so fragmentation avoidance
does work in some cases.

> THP is of course pretty neat that it reduced TLB cost, needs fewer page
> table etc. I just wondered if people really rely on it, or using it
> after their system has been up for a long time.
> 

If people didn't rely on it then we might as well delete THP and the
declare the whole tmpfs-backed-THP as worthless.

> > Yes, but note that the concept is still problematic.
> > isolate_migratepages_block is not guaranteed to find a pageblock with
> > unmerged buddies in it. If there are pageblocks towards the end of the
> > zone with unmerged pages, they may never be found. This will be very hard
> > to detect at runtime because it's heavily dependant on the exact state
> > of the system.
> 
> Quite true.
> 
> The intent here though, is not to have compaction merge back all
> unmerged pages, but did the merge for these unmerged pages in a
> piggyback way, i.e. since isolate_migratepages_block() is doing the
> scan, why don't we let it handle these unmerged pages when it meets
> them?
> 
> If for some reason isolate_migratepages_block() didn't meet a single
> unmerged page before compaction succeed, we probably do not need worry
> much yet since compaction succeeded anyway.
> 

I don't think this is the right way of thinking about it because it's
possible to have the system split in such a way so that the migration
scanner only encounters unmovable pages before it meets the free scanner
where unmerged buddies were in the higher portion of the address space.

You either need to keep unmerged buddies on a separate list or search
the order-0 free list for merge candidates prior to compaction.

> > It's needed to form them efficiently but excessive reclaim or writing 3
> > to drop_caches can also do it. Be careful of tying lazy buddy too
> > closely to compaction.
> 
> That's the current design of this patchset, do you see any immediate
> problem of this? Is it that you are worried about high-order allocation
> success rate using this design?

I've pointed out what I see are the design flaws but yes, in general, I'm
worried about the high order allocation success rate using this design,
the reliance on compaction and the fact that the primary motivation is
when THP is disabled.
Aaron Lu Oct. 19, 2018, 5:57 a.m. UTC | #10
On Thu, Oct 18, 2018 at 12:16:32PM +0100, Mel Gorman wrote:
> On Wed, Oct 17, 2018 at 10:59:04PM +0800, Aaron Lu wrote:
> > > Any particuular reason why? I assume it's related to the number of zone
> > > locks with the increase number of zones and the number of threads used
> > > for the test.
> > 
> > I think so too.
> > 
> > The 4 sockets server has 192 CPUs in total while the 2 sockets server
> > has 112 CPUs in total. Assume only ZONE_NORMAL are used, for the 4
> > sockets server it would be 192/4=48(CPUs per zone) while for the 2
> > sockets server it is 112/2=56(CPUs per zone). The test is started with
> > nr_task=nr_cpu so for the 2 sockets servers, it ends up having more CPUs
> > consuming one zone.
> > 
> 
> Nice that the prediction is accurate. It brings us to another option --
> breaking up the zone lock by either hash or address space ranges. The
> address space ranges would probably be easier to implement. Where it
> gets hairy is that PFN walkers would need different zone locks. However,
> overall it might be a better option because it's not order-0 specific.

I think the 'address space range' lock is worth a try.

> It would be a lot of legwork because all uses of the zone lock would
> have to be audited to see which ones protect the free lists and which
> ones protect "something else".

Yes a lot of details.

> > > That's important to know. It does reduce the utility of the patch
> > > somewhat but not all arches support THP and THP is not always enabled on
> > > x86.
> > 
> > I always wondered how systems are making use of THP.
> > After all, when system has been runing a while(days or months), file
> > cache should consumed a lot of memory and high order pages will become
> > more and more scare. If order9 page can't be reliably allocated, will
> > workload rely on it?
> > Just a thought.
> > 
> 
> File cache can usually be trivially reclaimed and moved. It's a "how
> long is a piece of string" to determine at what point a system can get
> fragmented and whether than can be prevented. It's somewhat outside the
> scope of this patch but anecdotally I'm looking at a machine with 20 days
> uptime and it still has 2390GB worth of THPs free after a large amount
> of reclaim activity over the system lifetime so fragmentation avoidance
> does work in some cases.

Good to know, thanks.

> 
> > THP is of course pretty neat that it reduced TLB cost, needs fewer page
> > table etc. I just wondered if people really rely on it, or using it
> > after their system has been up for a long time.
> > 
> 
> If people didn't rely on it then we might as well delete THP and the
> declare the whole tmpfs-backed-THP as worthless.
> 
> > > Yes, but note that the concept is still problematic.
> > > isolate_migratepages_block is not guaranteed to find a pageblock with
> > > unmerged buddies in it. If there are pageblocks towards the end of the
> > > zone with unmerged pages, they may never be found. This will be very hard
> > > to detect at runtime because it's heavily dependant on the exact state
> > > of the system.
> > 
> > Quite true.
> > 
> > The intent here though, is not to have compaction merge back all
> > unmerged pages, but did the merge for these unmerged pages in a
> > piggyback way, i.e. since isolate_migratepages_block() is doing the
> > scan, why don't we let it handle these unmerged pages when it meets
> > them?
> > 
> > If for some reason isolate_migratepages_block() didn't meet a single
> > unmerged page before compaction succeed, we probably do not need worry
> > much yet since compaction succeeded anyway.
> > 
> 
> I don't think this is the right way of thinking about it because it's
> possible to have the system split in such a way so that the migration
> scanner only encounters unmovable pages before it meets the free scanner
> where unmerged buddies were in the higher portion of the address space.

Yes it is possible unmerged pages are in the higher portion.

My understanding is, when the two scanners meet, all unmerged pages will
be either used by the free scanner as migrate targets or sent to merge
by the migration scanner.

> 
> You either need to keep unmerged buddies on a separate list or search
> the order-0 free list for merge candidates prior to compaction.
> 
> > > It's needed to form them efficiently but excessive reclaim or writing 3
> > > to drop_caches can also do it. Be careful of tying lazy buddy too
> > > closely to compaction.
> > 
> > That's the current design of this patchset, do you see any immediate
> > problem of this? Is it that you are worried about high-order allocation
> > success rate using this design?
> 
> I've pointed out what I see are the design flaws but yes, in general, I'm
> worried about the high order allocation success rate using this design,
> the reliance on compaction and the fact that the primary motivation is
> when THP is disabled.

When THP is in use, zone lock contention is pretty much nowhere :-)

I'll see what I can get with 'address space range' lock first and will
come back to 'lazy buddy' if it doesn't work out. Thank you and
Vlastimil for all the suggestions.
Mel Gorman Oct. 19, 2018, 8:54 a.m. UTC | #11
On Fri, Oct 19, 2018 at 01:57:03PM +0800, Aaron Lu wrote:
> > 
> > I don't think this is the right way of thinking about it because it's
> > possible to have the system split in such a way so that the migration
> > scanner only encounters unmovable pages before it meets the free scanner
> > where unmerged buddies were in the higher portion of the address space.
> 
> Yes it is possible unmerged pages are in the higher portion.
> 
> My understanding is, when the two scanners meet, all unmerged pages will
> be either used by the free scanner as migrate targets or sent to merge
> by the migration scanner.
> 

It's not guaranteed if the lower portion of the address space consisted
entirely of pages that cannot migrate (because they are unmovable or because
migration failed due to pins). It's actually a fundamental limitation
of compaction that it can miss migration and compaction opportunities
due to how the scanners are implemented. It was designed that way to
avoid pageblocks being migrated unnecessarily back and forth but the
downside is missed opportunities.

> > You either need to keep unmerged buddies on a separate list or search
> > the order-0 free list for merge candidates prior to compaction.
> > 
> > > > It's needed to form them efficiently but excessive reclaim or writing 3
> > > > to drop_caches can also do it. Be careful of tying lazy buddy too
> > > > closely to compaction.
> > > 
> > > That's the current design of this patchset, do you see any immediate
> > > problem of this? Is it that you are worried about high-order allocation
> > > success rate using this design?
> > 
> > I've pointed out what I see are the design flaws but yes, in general, I'm
> > worried about the high order allocation success rate using this design,
> > the reliance on compaction and the fact that the primary motivation is
> > when THP is disabled.
> 
> When THP is in use, zone lock contention is pretty much nowhere :-)
> 
> I'll see what I can get with 'address space range' lock first and will
> come back to 'lazy buddy' if it doesn't work out. Thank you and
> Vlastimil for all the suggestions.

My pleasure.
Daniel Jordan Oct. 19, 2018, 3 p.m. UTC | #12
On Fri, Oct 19, 2018 at 09:54:35AM +0100, Mel Gorman wrote:
> On Fri, Oct 19, 2018 at 01:57:03PM +0800, Aaron Lu wrote:
> > > 
> > > I don't think this is the right way of thinking about it because it's
> > > possible to have the system split in such a way so that the migration
> > > scanner only encounters unmovable pages before it meets the free scanner
> > > where unmerged buddies were in the higher portion of the address space.
> > 
> > Yes it is possible unmerged pages are in the higher portion.
> > 
> > My understanding is, when the two scanners meet, all unmerged pages will
> > be either used by the free scanner as migrate targets or sent to merge
> > by the migration scanner.
> > 
> 
> It's not guaranteed if the lower portion of the address space consisted
> entirely of pages that cannot migrate (because they are unmovable or because
> migration failed due to pins). It's actually a fundamental limitation
> of compaction that it can miss migration and compaction opportunities
> due to how the scanners are implemented. It was designed that way to
> avoid pageblocks being migrated unnecessarily back and forth but the
> downside is missed opportunities.
> 
> > > You either need to keep unmerged buddies on a separate list or search
> > > the order-0 free list for merge candidates prior to compaction.
> > > 
> > > > > It's needed to form them efficiently but excessive reclaim or writing 3
> > > > > to drop_caches can also do it. Be careful of tying lazy buddy too
> > > > > closely to compaction.
> > > > 
> > > > That's the current design of this patchset, do you see any immediate
> > > > problem of this? Is it that you are worried about high-order allocation
> > > > success rate using this design?
> > > 
> > > I've pointed out what I see are the design flaws but yes, in general, I'm
> > > worried about the high order allocation success rate using this design,
> > > the reliance on compaction and the fact that the primary motivation is
> > > when THP is disabled.
> > 
> > When THP is in use, zone lock contention is pretty much nowhere :-)
> > 
> > I'll see what I can get with 'address space range' lock first and will
> > come back to 'lazy buddy' if it doesn't work out.

With the address space range idea, wouldn't the zone free_area require changes
too?  I can't see how locking by address range could synchronize it as it
exists now otherwise, with per order/mt list heads.

One idea is to further subdivide the free area according to how the locking
works and find some reasonable way to handle having to search for pages of a
given order/mt in multiple places.
Aaron Lu Oct. 20, 2018, 9 a.m. UTC | #13
On Fri, Oct 19, 2018 at 08:00:53AM -0700, Daniel Jordan wrote:
> On Fri, Oct 19, 2018 at 09:54:35AM +0100, Mel Gorman wrote:
> > On Fri, Oct 19, 2018 at 01:57:03PM +0800, Aaron Lu wrote:
> > > > 
> > > > I don't think this is the right way of thinking about it because it's
> > > > possible to have the system split in such a way so that the migration
> > > > scanner only encounters unmovable pages before it meets the free scanner
> > > > where unmerged buddies were in the higher portion of the address space.
> > > 
> > > Yes it is possible unmerged pages are in the higher portion.
> > > 
> > > My understanding is, when the two scanners meet, all unmerged pages will
> > > be either used by the free scanner as migrate targets or sent to merge
> > > by the migration scanner.
> > > 
> > 
> > It's not guaranteed if the lower portion of the address space consisted
> > entirely of pages that cannot migrate (because they are unmovable or because
> > migration failed due to pins). It's actually a fundamental limitation
> > of compaction that it can miss migration and compaction opportunities
> > due to how the scanners are implemented. It was designed that way to
> > avoid pageblocks being migrated unnecessarily back and forth but the
> > downside is missed opportunities.
> > 
> > > > You either need to keep unmerged buddies on a separate list or search
> > > > the order-0 free list for merge candidates prior to compaction.
> > > > 
> > > > > > It's needed to form them efficiently but excessive reclaim or writing 3
> > > > > > to drop_caches can also do it. Be careful of tying lazy buddy too
> > > > > > closely to compaction.
> > > > > 
> > > > > That's the current design of this patchset, do you see any immediate
> > > > > problem of this? Is it that you are worried about high-order allocation
> > > > > success rate using this design?
> > > > 
> > > > I've pointed out what I see are the design flaws but yes, in general, I'm
> > > > worried about the high order allocation success rate using this design,
> > > > the reliance on compaction and the fact that the primary motivation is
> > > > when THP is disabled.
> > > 
> > > When THP is in use, zone lock contention is pretty much nowhere :-)
> > > 
> > > I'll see what I can get with 'address space range' lock first and will
> > > come back to 'lazy buddy' if it doesn't work out.
> 
> With the address space range idea, wouldn't the zone free_area require changes
> too?  I can't see how locking by address range could synchronize it as it
> exists now otherwise, with per order/mt list heads.
> 
> One idea is to further subdivide the free area according to how the locking
> works and find some reasonable way to handle having to search for pages of a
> given order/mt in multiple places.

I plan to create one free_are per 'address space range'. The challenge
will be how to quickly locate a free_area that has the required free
page on allocation path. Other details like how big the address space
range should be etc. will need to be explored with testing.

I think this approach is worth a try because it wouldn't cause
fragmentation.
diff mbox series

Patch

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5ed8f6292a53..aed93053ef6e 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -179,8 +179,13 @@  struct page {
 		int units;			/* SLOB */
 	};
 
-	/* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */
-	atomic_t _refcount;
+	union {
+		/* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */
+		atomic_t _refcount;
+
+		/* For pages in Buddy: if skipped merging when added to Buddy */
+		bool buddy_merge_skipped;
+	};
 
 #ifdef CONFIG_MEMCG
 	struct mem_cgroup *mem_cgroup;
diff --git a/mm/compaction.c b/mm/compaction.c
index faca45ebe62d..0c9c7a30dde3 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -777,8 +777,19 @@  isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 		 * potential isolation targets.
 		 */
 		if (PageBuddy(page)) {
-			unsigned long freepage_order = page_order_unsafe(page);
+			unsigned long freepage_order;
 
+			/*
+			 * If this is a merge_skipped page, do merge now
+			 * since high-order pages are needed. zone lock
+			 * isn't taken for the merge_skipped check so the
+			 * check could be wrong but the worst case is we
+			 * lose a merge opportunity.
+			 */
+			if (page_merge_was_skipped(page))
+				try_to_merge_page(page);
+
+			freepage_order = page_order_unsafe(page);
 			/*
 			 * Without lock, we cannot be sure that what we got is
 			 * a valid page order. Consider only values in the
diff --git a/mm/internal.h b/mm/internal.h
index 87256ae1bef8..c166735a559e 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -527,4 +527,31 @@  static inline bool is_migrate_highatomic_page(struct page *page)
 
 void setup_zone_pageset(struct zone *zone);
 extern struct page *alloc_new_node_page(struct page *page, unsigned long node);
+
+static inline bool page_merge_was_skipped(struct page *page)
+{
+	return page->buddy_merge_skipped;
+}
+
+void try_to_merge_page(struct page *page);
+
+#ifdef CONFIG_COMPACTION
+static inline bool can_skip_merge(struct zone *zone, int order)
+{
+	/* Compaction has failed in this zone, we shouldn't skip merging */
+	if (zone->compact_considered)
+		return false;
+
+	/* Only consider no_merge for order 0 pages */
+	if (order)
+		return false;
+
+	return true;
+}
+#else /* CONFIG_COMPACTION */
+static inline bool can_skip_merge(struct zone *zone, int order)
+{
+	return false;
+}
+#endif  /* CONFIG_COMPACTION */
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 14c20bb3a3da..76d471e0ab24 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -691,6 +691,16 @@  static inline void clear_page_guard(struct zone *zone, struct page *page,
 				unsigned int order, int migratetype) {}
 #endif
 
+static inline void set_page_merge_skipped(struct page *page)
+{
+	page->buddy_merge_skipped = true;
+}
+
+static inline void clear_page_merge_skipped(struct page *page)
+{
+	page->buddy_merge_skipped = false;
+}
+
 static inline void set_page_order(struct page *page, unsigned int order)
 {
 	set_page_private(page, order);
@@ -700,6 +710,7 @@  static inline void set_page_order(struct page *page, unsigned int order)
 static inline void add_to_buddy_common(struct page *page, struct zone *zone,
 					unsigned int order)
 {
+	clear_page_merge_skipped(page);
 	set_page_order(page, order);
 	zone->free_area[order].nr_free++;
 }
@@ -730,6 +741,7 @@  static inline void remove_from_buddy(struct page *page, struct zone *zone,
 	list_del(&page->lru);
 	zone->free_area[order].nr_free--;
 	rmv_page_order(page);
+	clear_page_merge_skipped(page);
 }
 
 /*
@@ -797,7 +809,7 @@  static inline int page_is_buddy(struct page *page, struct page *buddy,
  * -- nyc
  */
 
-static inline void __free_one_page(struct page *page,
+static inline void do_merge(struct page *page,
 		unsigned long pfn,
 		struct zone *zone, unsigned int order,
 		int migratetype)
@@ -809,16 +821,6 @@  static inline void __free_one_page(struct page *page,
 
 	max_order = min_t(unsigned int, MAX_ORDER, pageblock_order + 1);
 
-	VM_BUG_ON(!zone_is_initialized(zone));
-	VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
-
-	VM_BUG_ON(migratetype == -1);
-	if (likely(!is_migrate_isolate(migratetype)))
-		__mod_zone_freepage_state(zone, 1 << order, migratetype);
-
-	VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
-	VM_BUG_ON_PAGE(bad_range(zone, page), page);
-
 continue_merging:
 	while (order < max_order - 1) {
 		buddy_pfn = __find_buddy_pfn(pfn, order);
@@ -891,6 +893,61 @@  static inline void __free_one_page(struct page *page,
 	add_to_buddy_head(page, zone, order, migratetype);
 }
 
+void try_to_merge_page(struct page *page)
+{
+	unsigned long pfn, buddy_pfn, flags;
+	struct page *buddy;
+	struct zone *zone;
+
+	/*
+	 * No need to do merging if buddy is not free.
+	 * zone lock isn't taken so this could be wrong but worst case
+	 * is we lose a merge opportunity.
+	 */
+	pfn = page_to_pfn(page);
+	buddy_pfn = __find_buddy_pfn(pfn, 0);
+	buddy = page + (buddy_pfn - pfn);
+	if (!PageBuddy(buddy))
+		return;
+
+	zone = page_zone(page);
+	spin_lock_irqsave(&zone->lock, flags);
+	/* Verify again after taking the lock */
+	if (likely(PageBuddy(page) && page_merge_was_skipped(page) &&
+		   PageBuddy(buddy))) {
+		int mt = get_pageblock_migratetype(page);
+
+		remove_from_buddy(page, zone, 0);
+		do_merge(page, pfn, zone, 0, mt);
+	}
+	spin_unlock_irqrestore(&zone->lock, flags);
+}
+
+static inline void __free_one_page(struct page *page,
+		unsigned long pfn,
+		struct zone *zone, unsigned int order,
+		int migratetype)
+{
+	VM_BUG_ON(!zone_is_initialized(zone));
+	VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
+
+	VM_BUG_ON(migratetype == -1);
+	if (likely(!is_migrate_isolate(migratetype)))
+		__mod_zone_freepage_state(zone, 1 << order, migratetype);
+
+	VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);
+	VM_BUG_ON_PAGE(bad_range(zone, page), page);
+
+	if (can_skip_merge(zone, order)) {
+		add_to_buddy_head(page, zone, 0, migratetype);
+		set_page_merge_skipped(page);
+		return;
+	}
+
+	do_merge(page, pfn, zone, order, migratetype);
+}
+
+
 /*
  * A bad page could be due to a number of fields. Instead of multiple branches,
  * try and check multiple fields with one check. The caller must do a detailed
@@ -1148,9 +1205,14 @@  static void free_pcppages_bulk(struct zone *zone, int count,
 			 * can be offset by reduced memory latency later. To
 			 * avoid excessive prefetching due to large count, only
 			 * prefetch buddy for the first pcp->batch nr of pages.
+			 *
+			 * If merge can be skipped, no need to prefetch buddy.
 			 */
-			if (prefetch_nr++ < pcp->batch)
-				prefetch_buddy(page);
+			if (can_skip_merge(zone, 0) || prefetch_nr > pcp->batch)
+				continue;
+
+			prefetch_buddy(page);
+			prefetch_nr++;
 		} while (--count && --batch_free && !list_empty(list));
 	}