Patchwork mm, compaction: direct freepage allocation for async direct compaction

login
register
mail settings
Submitter Johannes Weiner
Date Nov. 22, 2017, 2:33 p.m.
Message ID <20171122143321.29501-1-hannes@cmpxchg.org>
Download mbox | patch
Permalink /patch/10070513/
State New
Headers show

Comments

Johannes Weiner - Nov. 22, 2017, 2:33 p.m.
From: Vlastimil Babka <vbabka@suse.cz>

The goal of direct compaction is to quickly make a high-order page available
for the pending allocation. The free page scanner can add significant latency
when searching for migration targets, although to succeed the compaction, the
only important limit on the target free pages is that they must not come from
the same order-aligned block as the migrated pages.

This patch therefore makes direct async compaction allocate freepages directly
from freelists. Pages that do come from the same block (which we cannot simply
exclude from the freelist allocation) are put on separate list and released
only after migration to allow them to merge.

In addition to reduced stall, another advantage is that we split larger free
pages for migration targets only when smaller pages are depleted, while the
free scanner can split pages up to (order - 1) as it encouters them. However,
this approach likely sacrifices some of the long-term anti-fragmentation
features of a thorough compaction, so we limit the direct allocation approach
to direct async compaction.

For observational purposes, the patch introduces two new counters to
/proc/vmstat. compact_free_direct_alloc counts how many pages were allocated
directly without scanning, and compact_free_direct_miss counts the subset of
these allocations that were from the wrong range and had to be held on the
separate list.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---

Hi. I'm resending this because we've been struggling with the cost of
compaction in our fleet, and this patch helps substantially.

On 128G+ machines, we have seen isolate_freepages_block() eat up 40%
of the CPU cycles and scanning up to a billion PFNs per minute. Not in
a spike, but continuously, to service higher-order allocations from
the network stack, fork (non-vmap stacks), THP, etc. during regular
operation.

I've been running this patch on a handful of less-affected but still
pretty bad machines for a week, and the results look pretty great:

	http://cmpxchg.org/compactdirectalloc/compactdirectalloc.png

Note the two different scales - otherwise the compact_free_direct
lines wouldn't be visible. The free scanner peaks close to 10M pages
checked per minute, whereas the direct allocations peak at under 180
per minute, direct misses at 50.

The work doesn't increase over this period, which is a good sign that
long-term we're not trending toward worse fragmentation.

There was an outstanding concern from Joonsoo regarding this patch -
https://marc.info/?l=linux-mm&m=146035962702122&w=2 - although that
didn't seem to affect us much in practice.

 include/linux/vm_event_item.h |  1 +
 mm/compaction.c               | 53 ++++++++++++++++++++++++++++++++++++++++++-
 mm/internal.h                 |  4 ++++
 mm/page_alloc.c               | 27 ++++++++++++++++++++++
 mm/vmstat.c                   |  2 ++
 5 files changed, 86 insertions(+), 1 deletion(-)
Mel Gorman - Nov. 23, 2017, 2:08 p.m.
On Wed, Nov 22, 2017 at 09:33:21AM -0500, Johannes Weiner wrote:
> From: Vlastimil Babka <vbabka@suse.cz>
> 
> The goal of direct compaction is to quickly make a high-order page available
> for the pending allocation. The free page scanner can add significant latency
> when searching for migration targets, although to succeed the compaction, the
> only important limit on the target free pages is that they must not come from
> the same order-aligned block as the migrated pages.
> 
> This patch therefore makes direct async compaction allocate freepages directly
> from freelists. Pages that do come from the same block (which we cannot simply
> exclude from the freelist allocation) are put on separate list and released
> only after migration to allow them to merge.
> 
> In addition to reduced stall, another advantage is that we split larger free
> pages for migration targets only when smaller pages are depleted, while the
> free scanner can split pages up to (order - 1) as it encouters them. However,
> this approach likely sacrifices some of the long-term anti-fragmentation
> features of a thorough compaction, so we limit the direct allocation approach
> to direct async compaction.
> 
> For observational purposes, the patch introduces two new counters to
> /proc/vmstat. compact_free_direct_alloc counts how many pages were allocated
> directly without scanning, and compact_free_direct_miss counts the subset of
> these allocations that were from the wrong range and had to be held on the
> separate list.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
> 
> Hi. I'm resending this because we've been struggling with the cost of
> compaction in our fleet, and this patch helps substantially.
>  

That particular problem only has been gettiing worse as memory sizes get
larger. So broadly speaking I'm happy to see something happen with it but
there were reasons why a linear scanner was settled on originally. They were
not insurmountable problems but not severe enough at the time to justify
the complexity (particularly as THP and high-order were still treated as
"no on cares" problems). Unfortunately, I believe the same problems are
still relevant today;

Lets look closely at the core function that really matters in this
patch, IMO at least.

> diff --git a/mm/compaction.c b/mm/compaction.c
> index 10cd757f1006..ccc9b157f716 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1160,6 +1160,41 @@ static void isolate_freepages(struct compact_control *cc)
>  	cc->free_pfn = isolate_start_pfn;
>  }
>  
> +static void isolate_freepages_direct(struct compact_control *cc)
> +{
> +	unsigned long nr_pages;
> +	unsigned long flags;
> +
> +	nr_pages = cc->nr_migratepages - cc->nr_freepages;
> +
> +	if (!compact_trylock_irqsave(&cc->zone->lock, &flags, cc))
> +		return;
> +
> +	while (nr_pages) {
> +		struct page *page;
> +		unsigned long pfn;
> +
> +		page = alloc_pages_zone(cc->zone, 0, MIGRATE_MOVABLE);
> +		if (!page)
> +			break;
> +		pfn = page_to_pfn(page);
> +
> +		count_compact_event(COMPACTFREE_DIRECT_ALLOC);
> +
> +		/* Is the free page in the block we are migrating from? */
> +		if (pfn >> cc->order ==	(cc->migrate_pfn - 1) >> cc->order) {
> +			list_add(&page->lru, &cc->freepages_held);
> +			count_compact_event(COMPACTFREE_DIRECT_MISS);
> +		} else {
> +			list_add(&page->lru, &cc->freepages);
> +			cc->nr_freepages++;
> +			nr_pages--;
> +		}
> +	}
> +
> +	spin_unlock_irqrestore(&cc->zone->lock, flags);
> +}
> +

1. This indirectly uses __rmqueue to allocate a MIGRATE_MOVABLE page but
   that is allowed to fallback to other pageblocks and potentially even
   steal them. I think it's very bad that an attempt to defragment can
   itself indirectly cause more fragmentation events by altering pageblocks.
   Please consider using __rmqueue_fallback (within alloc_pages_zone of
   course)

2. One of the reasons a linear scanner was used was because I wanted the
   possibility that MIGRATE_UNMOVABLE and MIGRATE_RECLAIMABLE pageblocks
   would also be scanned and we would avoid future fragmentation events.
   This had a lot of overhead and was reduced since but it's still a
   relevant problem.  Granted, this patch is not the correct place to fix
   that issue and potential solutions have been discussed elsewhere. However,
   this patch potentially means that never happens. It doesn't necessarily
   kill the patch but the long-lived behaviour may be that no compaction
   occurs because all the MIGRATE_MOVABLE pageblocks are full and you'll
   either need to reclaim to fix it or we'll need kcompactd to migration
   MIGRATE_MOVABLE pages from UNMOVABLE and RECLAIMABLE pageblocks out
   of band.

   For THP, this point doesn't matter but if you need this patch for
   high-order allocations for network buffers then at some point, you
   really will have to clean out those pageblocks or it'll degrade.

3. Another reason a linear scanner was used was because we wanted to
   clear entire pageblocks we were migrating from and pack the target
   pageblocks as much as possible. This was to reduce the amount of
   migration required overall even though the scanning hurts. This patch
   takes MIGRATE_MOVABLE pages from anywhere that is "not this pageblock".
   Those potentially have to be moved again and again trying to randomly
   fill a MIGRATE_MOVABLE block. Have you considered using the freelists
   as a hint? i.e. take a page from the freelist, then isolate all free
   pages in the same pageblock as migration targets? That would preserve
   the "packing property" of the linear scanner.

   This would increase the amount of scanning but that *might* be offset by
   the number of migrations the workload does overall. Note that migrations
   potentially are minor faults so if we do too many migrations, your
   workload may suffer.

4. One problem the linear scanner avoids is that a migration target is
   subsequently used as a migration source and leads to a ping-pong effect.
   I don't know how bad this is in practice or even if it's a problem at
   all but it was a concern at the time

5. Consider two processes A and B compacting at the same time with A_s
   and A_t being the source pageblock and target pageblock that process
   A is using and B_s/B_t being B's pageblocks. Nothing prevents A_s ==
   B_t and B_s == A_t. Maybe it rarely happens in practice but it was one
   problem the linear scanner was meant to avoid.

I can't shake the feeling I had another concern when I started this
email but then forgot it before I got to the end so it can't be that
important :(.
Vlastimil Babka - Nov. 23, 2017, 9:15 p.m.
On 11/23/2017 03:08 PM, Mel Gorman wrote:
> 
> 1. This indirectly uses __rmqueue to allocate a MIGRATE_MOVABLE page but
>    that is allowed to fallback to other pageblocks and potentially even
>    steal them. I think it's very bad that an attempt to defragment can
>    itself indirectly cause more fragmentation events by altering pageblocks.
>    Please consider using __rmqueue_fallback (within alloc_pages_zone of
>    course)

Agree. That should be simpler to do in the new version of the patch and
its __rmqueue_compact(). It might happen though that we deplete all free
pages on movable lists. Then the only option is to fallback to others
(aborting compaction in that case makes little sense IMHO) but perhaps
without the usual fallback heuristics of trying to steal the largest
page, whole pageblock etc.

> 2. One of the reasons a linear scanner was used was because I wanted the
>    possibility that MIGRATE_UNMOVABLE and MIGRATE_RECLAIMABLE pageblocks
>    would also be scanned and we would avoid future fragmentation events.

Hmm are you talking about the free scanner here, or the migration
scanner? The free scanner generally avoids these pageblocks, by the way
of suitable_migration_target() (and I think it used to be like this all
the time). Only recently an override of cc->ignore_block_suitable was added.

>    This had a lot of overhead and was reduced since but it's still a
>    relevant problem.  Granted, this patch is not the correct place to fix
>    that issue and potential solutions have been discussed elsewhere. However,
>    this patch potentially means that never happens. It doesn't necessarily
>    kill the patch but the long-lived behaviour may be that no compaction
>    occurs because all the MIGRATE_MOVABLE pageblocks are full and you'll
>    either need to reclaim to fix it or we'll need kcompactd to migration
>    MIGRATE_MOVABLE pages from UNMOVABLE and RECLAIMABLE pageblocks out
>    of band.
> 
>    For THP, this point doesn't matter but if you need this patch for
>    high-order allocations for network buffers then at some point, you
>    really will have to clean out those pageblocks or it'll degrade.

Hmm this really reads like about the migration scanner. That one is
unchanged by this patch, there is still a linear scanner. In fact, it
gets better, because now it can see the whole zone, not just the first
1/3 - 1/2 until it meets the free scanner (my past observations). And
some time ago the async direct compaction was adjusted so that it only
scans the migratetype matching the allocation (see
suitable_migration_source()). So to some extent, the cleaning already
happens.

> 3. Another reason a linear scanner was used was because we wanted to
>    clear entire pageblocks we were migrating from and pack the target
>    pageblocks as much as possible. This was to reduce the amount of
>    migration required overall even though the scanning hurts. This patch
>    takes MIGRATE_MOVABLE pages from anywhere that is "not this pageblock".
>    Those potentially have to be moved again and again trying to randomly
>    fill a MIGRATE_MOVABLE block. Have you considered using the freelists
>    as a hint? i.e. take a page from the freelist, then isolate all free
>    pages in the same pageblock as migration targets? That would preserve
>    the "packing property" of the linear scanner.
> 
>    This would increase the amount of scanning but that *might* be offset by
>    the number of migrations the workload does overall. Note that migrations
>    potentially are minor faults so if we do too many migrations, your
>    workload may suffer.

I have considered the "freelist as a hint", but I'm kinda sceptical
about it, because with increasing uptime reclaim should be freeing
rather random pages, so finding some free page in a pageblock doesn't
mean there would be more free pages there than in the other pageblocks?

Instead my plan is to make the migration scanner smarter by expanding
the "skip_on_failure" feature in isolate_migratepages_block(). The
scanner should not even start isolating if the block ahead contains a
page that's not free or lru-isolatable/PageMovable. The current
"look-ahead" is effectively limited by COMPACT_CLUSTER_MAX (32) isolated
pages followed by a migration, after which the scanner might immediately
find a non-migratable page, so if it was called for a THP, that work has
been wasted.

This should be straightforward for direct compaction, but will be
trickier for kcompactd becoming more proactive than the current
wake-up-for-order-X.

> 4. One problem the linear scanner avoids is that a migration target is
>    subsequently used as a migration source and leads to a ping-pong effect.
>    I don't know how bad this is in practice or even if it's a problem at
>    all but it was a concern at the time

I think this is "just" the worst case of the problem 3. It would likely
happen with manually triggered compaction from /proc which would create
a range of free pages near the zone start and then move it along to the
zone end, because there's no other termination criteria. The
"look-ahead" heuristics described above might be the trickiest here...

> 5. Consider two processes A and B compacting at the same time with A_s
>    and A_t being the source pageblock and target pageblock that process
>    A is using and B_s/B_t being B's pageblocks. Nothing prevents A_s ==
>    B_t and B_s == A_t. Maybe it rarely happens in practice but it was one
>    problem the linear scanner was meant to avoid.

I hope that ultimately this problem is not worse than the existing
problem where B would not be compacting, but simply allocating the pages
that A just created... Maybe if the "look-ahead" idea turns out to have
high enough success rate of really creating the high-order page where it
decides to isolate and migrate (which probably depends mostly on the
migration failure rate?) we could resurrect the old idea of doing a
pageblock isolation (MIGRATE_ISOLATE) beforehand. That would block all
interference.

> I can't shake the feeling I had another concern when I started this
> email but then forgot it before I got to the end so it can't be that
> important :(.

Thanks a lot for the feedback. I totally see how the approach of two
linear scanners makes many things simpler, but seems we are now really
paying too high a price for the free page scanning. So hopefully there
is a way out, although not a simple one.
Mel Gorman - Nov. 24, 2017, 10:57 a.m.
On Thu, Nov 23, 2017 at 10:15:17PM +0100, Vlastimil Babka wrote:
> On 11/23/2017 03:08 PM, Mel Gorman wrote:
> > 
> > 1. This indirectly uses __rmqueue to allocate a MIGRATE_MOVABLE page but
> >    that is allowed to fallback to other pageblocks and potentially even
> >    steal them. I think it's very bad that an attempt to defragment can
> >    itself indirectly cause more fragmentation events by altering pageblocks.
> >    Please consider using __rmqueue_fallback (within alloc_pages_zone of
> >    course)
> 
> Agree. That should be simpler to do in the new version of the patch and
> its __rmqueue_compact(). It might happen though that we deplete all free
> pages on movable lists. Then the only option is to fallback to others
> (aborting compaction in that case makes little sense IMHO) but perhaps
> without the usual fallback heuristics of trying to steal the largest
> page, whole pageblock etc.
> 

I also should have said __rmqueue_smallest. It was __rmqueue_fallback
that needed to be avoided :(

> > 2. One of the reasons a linear scanner was used was because I wanted the
> >    possibility that MIGRATE_UNMOVABLE and MIGRATE_RECLAIMABLE pageblocks
> >    would also be scanned and we would avoid future fragmentation events.
> 
> Hmm are you talking about the free scanner here, or the migration
> scanner? The free scanner generally avoids these pageblocks, by the way
> of suitable_migration_target() (and I think it used to be like this all
> the time). Only recently an override of cc->ignore_block_suitable was added.
> 

Migration scanner.

> >    This had a lot of overhead and was reduced since but it's still a
> >    relevant problem.  Granted, this patch is not the correct place to fix
> >    that issue and potential solutions have been discussed elsewhere. However,
> >    this patch potentially means that never happens. It doesn't necessarily
> >    kill the patch but the long-lived behaviour may be that no compaction
> >    occurs because all the MIGRATE_MOVABLE pageblocks are full and you'll
> >    either need to reclaim to fix it or we'll need kcompactd to migration
> >    MIGRATE_MOVABLE pages from UNMOVABLE and RECLAIMABLE pageblocks out
> >    of band.
> > 
> >    For THP, this point doesn't matter but if you need this patch for
> >    high-order allocations for network buffers then at some point, you
> >    really will have to clean out those pageblocks or it'll degrade.
> 
> Hmm this really reads like about the migration scanner. That one is
> unchanged by this patch, there is still a linear scanner. In fact, it
> gets better, because now it can see the whole zone, not just the first
> 1/3 - 1/2 until it meets the free scanner (my past observations). And
> some time ago the async direct compaction was adjusted so that it only
> scans the migratetype matching the allocation (see
> suitable_migration_source()). So to some extent, the cleaning already
> happens.
> 

It is true that the migration scanner may see a subset of the zone but
it was important to avoid a previous migration source becoming a
migration target. The problem is completely different when using the
freelist as a hint.

> > 3. Another reason a linear scanner was used was because we wanted to
> >    clear entire pageblocks we were migrating from and pack the target
> >    pageblocks as much as possible. This was to reduce the amount of
> >    migration required overall even though the scanning hurts. This patch
> >    takes MIGRATE_MOVABLE pages from anywhere that is "not this pageblock".
> >    Those potentially have to be moved again and again trying to randomly
> >    fill a MIGRATE_MOVABLE block. Have you considered using the freelists
> >    as a hint? i.e. take a page from the freelist, then isolate all free
> >    pages in the same pageblock as migration targets? That would preserve
> >    the "packing property" of the linear scanner.
> > 
> >    This would increase the amount of scanning but that *might* be offset by
> >    the number of migrations the workload does overall. Note that migrations
> >    potentially are minor faults so if we do too many migrations, your
> >    workload may suffer.
> 
> I have considered the "freelist as a hint", but I'm kinda sceptical
> about it, because with increasing uptime reclaim should be freeing
> rather random pages, so finding some free page in a pageblock doesn't
> mean there would be more free pages there than in the other pageblocks?
> 

True, but randomly selecting pageblocks based on the contents of the
freelist is not better. If a pageblock has limited free pages then it'll
be filled quickly and not used as a hint in the future.

> Instead my plan is to make the migration scanner smarter by expanding
> the "skip_on_failure" feature in isolate_migratepages_block(). The
> scanner should not even start isolating if the block ahead contains a
> page that's not free or lru-isolatable/PageMovable. The current
> "look-ahead" is effectively limited by COMPACT_CLUSTER_MAX (32) isolated
> pages followed by a migration, after which the scanner might immediately
> find a non-migratable page, so if it was called for a THP, that work has
> been wasted.
> 

That's also not necessarily true because there is a benefit to moving
pages from unmovable blocks to avoid fragmentation later.

> > 5. Consider two processes A and B compacting at the same time with A_s
> >    and A_t being the source pageblock and target pageblock that process
> >    A is using and B_s/B_t being B's pageblocks. Nothing prevents A_s ==
> >    B_t and B_s == A_t. Maybe it rarely happens in practice but it was one
> >    problem the linear scanner was meant to avoid.
> 
> I hope that ultimately this problem is not worse than the existing
> problem where B would not be compacting, but simply allocating the pages
> that A just created... Maybe if the "look-ahead" idea turns out to have
> high enough success rate of really creating the high-order page where it
> decides to isolate and migrate (which probably depends mostly on the
> migration failure rate?) we could resurrect the old idea of doing a
> pageblock isolation (MIGRATE_ISOLATE) beforehand. That would block all
> interference.
> 

Pageblock bits similar to the skip bit could also be used to limit the
problem.

> > I can't shake the feeling I had another concern when I started this
> > email but then forgot it before I got to the end so it can't be that
> > important :(.
> 
> Thanks a lot for the feedback. I totally see how the approach of two
> linear scanners makes many things simpler, but seems we are now really
> paying too high a price for the free page scanning. So hopefully there
> is a way out, although not a simple one.


While the linear scanner solved some problems, I do agree that the overhead
is too high today. However, I think it can be fixed by using the freelist
as a hint, possibly combined with a pageblock bit to avoid hitting some
problems the linear scanner avoids. I do think there is a way out even
though I also think that the complexity would not have been justified
when compaction was first introduced -- partially because it was not clear
the time that the overhead was an issue but mostly because compaction was
initially a huge-page-only thing.
Vlastimil Babka - Nov. 24, 2017, 1:49 p.m.
On 11/24/2017 11:57 AM, Mel Gorman wrote:
> On Thu, Nov 23, 2017 at 10:15:17PM +0100, Vlastimil Babka wrote:
>> Hmm this really reads like about the migration scanner. That one is
>> unchanged by this patch, there is still a linear scanner. In fact, it
>> gets better, because now it can see the whole zone, not just the first
>> 1/3 - 1/2 until it meets the free scanner (my past observations). And
>> some time ago the async direct compaction was adjusted so that it only
>> scans the migratetype matching the allocation (see
>> suitable_migration_source()). So to some extent, the cleaning already
>> happens.
>>
> 
> It is true that the migration scanner may see a subset of the zone but
> it was important to avoid a previous migration source becoming a
> migration target. The problem is completely different when using the
> freelist as a hint.

I think fundamentally the problems are the same when using freelist
exclusively, or just as a hint, as there's no longer the natural
exclusivity where some pageblocks are used as migration source and
others as migration target, no?

>>> 3. Another reason a linear scanner was used was because we wanted to
>>>    clear entire pageblocks we were migrating from and pack the target
>>>    pageblocks as much as possible. This was to reduce the amount of
>>>    migration required overall even though the scanning hurts. This patch
>>>    takes MIGRATE_MOVABLE pages from anywhere that is "not this pageblock".
>>>    Those potentially have to be moved again and again trying to randomly
>>>    fill a MIGRATE_MOVABLE block. Have you considered using the freelists
>>>    as a hint? i.e. take a page from the freelist, then isolate all free
>>>    pages in the same pageblock as migration targets? That would preserve
>>>    the "packing property" of the linear scanner.
>>>
>>>    This would increase the amount of scanning but that *might* be offset by
>>>    the number of migrations the workload does overall. Note that migrations
>>>    potentially are minor faults so if we do too many migrations, your
>>>    workload may suffer.
>>
>> I have considered the "freelist as a hint", but I'm kinda sceptical
>> about it, because with increasing uptime reclaim should be freeing
>> rather random pages, so finding some free page in a pageblock doesn't
>> mean there would be more free pages there than in the other pageblocks?
>>
> 
> True, but randomly selecting pageblocks based on the contents of the
> freelist is not better.

One theoretical benefit (besides no scanning overhead) is that we prefer
the smallest blocks from the freelist, where in the hint approach we
might pick order-0 as a hint but then split larger free pages in the
same pageblock.

> If a pageblock has limited free pages then it'll
> be filled quickly and not used as a hint in the future.
> 
>> Instead my plan is to make the migration scanner smarter by expanding
>> the "skip_on_failure" feature in isolate_migratepages_block(). The
>> scanner should not even start isolating if the block ahead contains a
>> page that's not free or lru-isolatable/PageMovable. The current
>> "look-ahead" is effectively limited by COMPACT_CLUSTER_MAX (32) isolated
>> pages followed by a migration, after which the scanner might immediately
>> find a non-migratable page, so if it was called for a THP, that work has
>> been wasted.
>>
> 
> That's also not necessarily true because there is a benefit to moving
> pages from unmovable blocks to avoid fragmentation later.

Yeah, I didn't describe it fully, but for unmovable blocks, this would
not apply and we would clear them. Then, avoiding fallback to unmovable
blocks when allocating migration target would prevent the ping-pong.

>>> 5. Consider two processes A and B compacting at the same time with A_s
>>>    and A_t being the source pageblock and target pageblock that process
>>>    A is using and B_s/B_t being B's pageblocks. Nothing prevents A_s ==
>>>    B_t and B_s == A_t. Maybe it rarely happens in practice but it was one
>>>    problem the linear scanner was meant to avoid.
>>
>> I hope that ultimately this problem is not worse than the existing
>> problem where B would not be compacting, but simply allocating the pages
>> that A just created... Maybe if the "look-ahead" idea turns out to have
>> high enough success rate of really creating the high-order page where it
>> decides to isolate and migrate (which probably depends mostly on the
>> migration failure rate?) we could resurrect the old idea of doing a
>> pageblock isolation (MIGRATE_ISOLATE) beforehand. That would block all
>> interference.
>>
> 
> Pageblock bits similar to the skip bit could also be used to limit the
> problem.

Right, if we can afford changing the current 4 bits per pageblock to a
full byte.

>>> I can't shake the feeling I had another concern when I started this
>>> email but then forgot it before I got to the end so it can't be that
>>> important :(.
>>
>> Thanks a lot for the feedback. I totally see how the approach of two
>> linear scanners makes many things simpler, but seems we are now really
>> paying too high a price for the free page scanning. So hopefully there
>> is a way out, although not a simple one.
> 
> 
> While the linear scanner solved some problems, I do agree that the overhead
> is too high today. However, I think it can be fixed by using the freelist
> as a hint, possibly combined with a pageblock bit to avoid hitting some
> problems the linear scanner avoids. I do think there is a way out even
> though I also think that the complexity would not have been justified
> when compaction was first introduced -- partially because it was not clear
> the time that the overhead was an issue but mostly because compaction was
> initially a huge-page-only thing.
>
Mel Gorman - Nov. 24, 2017, 3:11 p.m.
On Fri, Nov 24, 2017 at 02:49:34PM +0100, Vlastimil Babka wrote:
> On 11/24/2017 11:57 AM, Mel Gorman wrote:
> > On Thu, Nov 23, 2017 at 10:15:17PM +0100, Vlastimil Babka wrote:
> >> Hmm this really reads like about the migration scanner. That one is
> >> unchanged by this patch, there is still a linear scanner. In fact, it
> >> gets better, because now it can see the whole zone, not just the first
> >> 1/3 - 1/2 until it meets the free scanner (my past observations). And
> >> some time ago the async direct compaction was adjusted so that it only
> >> scans the migratetype matching the allocation (see
> >> suitable_migration_source()). So to some extent, the cleaning already
> >> happens.
> >>
> > 
> > It is true that the migration scanner may see a subset of the zone but
> > it was important to avoid a previous migration source becoming a
> > migration target. The problem is completely different when using the
> > freelist as a hint.
> 
> I think fundamentally the problems are the same when using freelist
> exclusively, or just as a hint, as there's no longer the natural
> exclusivity where some pageblocks are used as migration source and
> others as migration target, no?
> 

They are similar only in that the linear scanner only suffers from the
same problem at that boundary where the migration and free scanner meet.
At the boundary where they meet, the problem occurs if that boundary moves
towards the end of the zone.

> >>> 3. Another reason a linear scanner was used was because we wanted to
> >>>    clear entire pageblocks we were migrating from and pack the target
> >>>    pageblocks as much as possible. This was to reduce the amount of
> >>>    migration required overall even though the scanning hurts. This patch
> >>>    takes MIGRATE_MOVABLE pages from anywhere that is "not this pageblock".
> >>>    Those potentially have to be moved again and again trying to randomly
> >>>    fill a MIGRATE_MOVABLE block. Have you considered using the freelists
> >>>    as a hint? i.e. take a page from the freelist, then isolate all free
> >>>    pages in the same pageblock as migration targets? That would preserve
> >>>    the "packing property" of the linear scanner.
> >>>
> >>>    This would increase the amount of scanning but that *might* be offset by
> >>>    the number of migrations the workload does overall. Note that migrations
> >>>    potentially are minor faults so if we do too many migrations, your
> >>>    workload may suffer.
> >>
> >> I have considered the "freelist as a hint", but I'm kinda sceptical
> >> about it, because with increasing uptime reclaim should be freeing
> >> rather random pages, so finding some free page in a pageblock doesn't
> >> mean there would be more free pages there than in the other pageblocks?
> >>
> > 
> > True, but randomly selecting pageblocks based on the contents of the
> > freelist is not better.
> 
> One theoretical benefit (besides no scanning overhead) is that we prefer
> the smallest blocks from the freelist, where in the hint approach we
> might pick order-0 as a hint but then split larger free pages in the
> same pageblock.
> 

While you're right, I think it's more than offset by the possibility
that we migrate the same page multiple times and the packing is worse.
The benefit also is not that great if you consider that we're migrating to
a MIGRATE_MOVABLE block because for movable allocations, we only care about
being able to free the entire pageblock for a hugepage allocation. Splitting
a large contiguous range within a movable block so that an unmovable
allocation can use the space is not a great outcome from a fragmentation
perspective.

> > If a pageblock has limited free pages then it'll
> > be filled quickly and not used as a hint in the future.
> > 
> >> Instead my plan is to make the migration scanner smarter by expanding
> >> the "skip_on_failure" feature in isolate_migratepages_block(). The
> >> scanner should not even start isolating if the block ahead contains a
> >> page that's not free or lru-isolatable/PageMovable. The current
> >> "look-ahead" is effectively limited by COMPACT_CLUSTER_MAX (32) isolated
> >> pages followed by a migration, after which the scanner might immediately
> >> find a non-migratable page, so if it was called for a THP, that work has
> >> been wasted.
> >>
> > 
> > That's also not necessarily true because there is a benefit to moving
> > pages from unmovable blocks to avoid fragmentation later.
> 
> Yeah, I didn't describe it fully, but for unmovable blocks, this would
> not apply and we would clear them. Then, avoiding fallback to unmovable
> blocks when allocating migration target would prevent the ping-pong.
> 

While it would be nice to clear them, I don't think it would be the
responsibility of this particular patch. I think it would be better to do
that clearing from kcompactd context.

> >>> 5. Consider two processes A and B compacting at the same time with A_s
> >>>    and A_t being the source pageblock and target pageblock that process
> >>>    A is using and B_s/B_t being B's pageblocks. Nothing prevents A_s ==
> >>>    B_t and B_s == A_t. Maybe it rarely happens in practice but it was one
> >>>    problem the linear scanner was meant to avoid.
> >>
> >> I hope that ultimately this problem is not worse than the existing
> >> problem where B would not be compacting, but simply allocating the pages
> >> that A just created... Maybe if the "look-ahead" idea turns out to have
> >> high enough success rate of really creating the high-order page where it
> >> decides to isolate and migrate (which probably depends mostly on the
> >> migration failure rate?) we could resurrect the old idea of doing a
> >> pageblock isolation (MIGRATE_ISOLATE) beforehand. That would block all
> >> interference.
> >>
> > 
> > Pageblock bits similar to the skip bit could also be used to limit the
> > problem.
> 
> Right, if we can afford changing the current 4 bits per pageblock to a
> full byte.
> 

Potentially yes although maybe initially just do nothing at all with this
problem and look into whether it occurs at all later. If nothing else,
the larger the machine, the less likely it is to occur. It was more of a
concern when compaction was first implemented as machines with less than
1G of memory were still common.
Joonsoo Kim - Nov. 29, 2017, 6:32 a.m.
On Thu, Nov 23, 2017 at 02:08:43PM +0000, Mel Gorman wrote:

> 3. Another reason a linear scanner was used was because we wanted to
>    clear entire pageblocks we were migrating from and pack the target
>    pageblocks as much as possible. This was to reduce the amount of
>    migration required overall even though the scanning hurts. This patch
>    takes MIGRATE_MOVABLE pages from anywhere that is "not this pageblock".
>    Those potentially have to be moved again and again trying to randomly
>    fill a MIGRATE_MOVABLE block. Have you considered using the freelists
>    as a hint? i.e. take a page from the freelist, then isolate all free
>    pages in the same pageblock as migration targets? That would preserve
>    the "packing property" of the linear scanner.
> 
>    This would increase the amount of scanning but that *might* be offset by
>    the number of migrations the workload does overall. Note that migrations
>    potentially are minor faults so if we do too many migrations, your
>    workload may suffer.
> 
> 4. One problem the linear scanner avoids is that a migration target is
>    subsequently used as a migration source and leads to a ping-pong effect.
>    I don't know how bad this is in practice or even if it's a problem at
>    all but it was a concern at the time

IIUC, this potential advantage for a linear scanner would not be the
actual advantage in the *running* system.

Consider about following worst case scenario for "direct freepage
allocation" that "moved again" happens.

__M1___F1_________________F2__F3___

M: migration source page (the number denotes the ID of the page)
F: freepage (the number denotes the sequence in the freelist of the buddy)
_: other pages
migration scanner: move from left to right.

If migration happens with "direct freepage allocation", memory state
will be changed to:

__F?___M1_________________F2__F3___

And then, please assume that there is an another movable allocation
before another migration. It's reasonable assumption since there are
really many movable allocations in the *running* system.

__F?___M1_________________M2__F3___

If migration happens again, memory state will be:

__F?___F?_________________M2__M1___

M1 is moved twice and overall number of migration is two.

Now, think about a linear scanner. With the same scenario,
memory state will be as following.

__M1___F1_________________F2__F3___
__F?___F1_________________F2__M1___
__F?___M2_________________F2__M1___
__F?___F?_________________M2__M1___

Although "move again" doesn't happen in a linear scanner, final memory
state is the same and the same number of migration happens.

So, I think that "direct freepage allocation" doesn't suffer from such
a ping-poing effect. Am I missing something?

Thanks.
Mel Gorman - Nov. 29, 2017, 2:13 p.m.
On Wed, Nov 29, 2017 at 03:32:08PM +0900, Joonsoo Kim wrote:
> On Thu, Nov 23, 2017 at 02:08:43PM +0000, Mel Gorman wrote:
> 
> > 3. Another reason a linear scanner was used was because we wanted to
> >    clear entire pageblocks we were migrating from and pack the target
> >    pageblocks as much as possible. This was to reduce the amount of
> >    migration required overall even though the scanning hurts. This patch
> >    takes MIGRATE_MOVABLE pages from anywhere that is "not this pageblock".
> >    Those potentially have to be moved again and again trying to randomly
> >    fill a MIGRATE_MOVABLE block. Have you considered using the freelists
> >    as a hint? i.e. take a page from the freelist, then isolate all free
> >    pages in the same pageblock as migration targets? That would preserve
> >    the "packing property" of the linear scanner.
> > 
> >    This would increase the amount of scanning but that *might* be offset by
> >    the number of migrations the workload does overall. Note that migrations
> >    potentially are minor faults so if we do too many migrations, your
> >    workload may suffer.
> > 
> > 4. One problem the linear scanner avoids is that a migration target is
> >    subsequently used as a migration source and leads to a ping-pong effect.
> >    I don't know how bad this is in practice or even if it's a problem at
> >    all but it was a concern at the time
> 
> IIUC, this potential advantage for a linear scanner would not be the
> actual advantage in the *running* system.
> 
> Consider about following worst case scenario for "direct freepage
> allocation" that "moved again" happens.
> 

The immediate case should be ok as long as the migration source and the
pageblock a freepage is taken from is not the same pageblock. That might
mean that more pages from the freelist would need to be examined until
another pageblock was found.

> 
> So, I think that "direct freepage allocation" doesn't suffer from such
> a ping-poing effect. Am I missing something?
> 

The ping-pong effect I'm concerned with is that a previous migration
target is used as a migration source in the future. It's hard for that
situation to occur with two linear scanners but care is needed when
using direct freepage allocation.

Patch

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 5c7f010676a7..81d07a97e8c9 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -55,6 +55,7 @@  enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 #endif
 #ifdef CONFIG_COMPACTION
 		COMPACTMIGRATE_SCANNED, COMPACTFREE_SCANNED,
+		COMPACTFREE_DIRECT_ALLOC, COMPACTFREE_DIRECT_MISS,
 		COMPACTISOLATED,
 		COMPACTSTALL, COMPACTFAIL, COMPACTSUCCESS,
 		KCOMPACTD_WAKE,
diff --git a/mm/compaction.c b/mm/compaction.c
index 10cd757f1006..ccc9b157f716 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1160,6 +1160,41 @@  static void isolate_freepages(struct compact_control *cc)
 	cc->free_pfn = isolate_start_pfn;
 }
 
+static void isolate_freepages_direct(struct compact_control *cc)
+{
+	unsigned long nr_pages;
+	unsigned long flags;
+
+	nr_pages = cc->nr_migratepages - cc->nr_freepages;
+
+	if (!compact_trylock_irqsave(&cc->zone->lock, &flags, cc))
+		return;
+
+	while (nr_pages) {
+		struct page *page;
+		unsigned long pfn;
+
+		page = alloc_pages_zone(cc->zone, 0, MIGRATE_MOVABLE);
+		if (!page)
+			break;
+		pfn = page_to_pfn(page);
+
+		count_compact_event(COMPACTFREE_DIRECT_ALLOC);
+
+		/* Is the free page in the block we are migrating from? */
+		if (pfn >> cc->order ==	(cc->migrate_pfn - 1) >> cc->order) {
+			list_add(&page->lru, &cc->freepages_held);
+			count_compact_event(COMPACTFREE_DIRECT_MISS);
+		} else {
+			list_add(&page->lru, &cc->freepages);
+			cc->nr_freepages++;
+			nr_pages--;
+		}
+	}
+
+	spin_unlock_irqrestore(&cc->zone->lock, flags);
+}
+
 /*
  * This is a migrate-callback that "allocates" freepages by taking pages
  * from the isolated freelists in the block we are migrating to.
@@ -1176,7 +1211,12 @@  static struct page *compaction_alloc(struct page *migratepage,
 	 * contention.
 	 */
 	if (list_empty(&cc->freepages)) {
-		if (!cc->contended)
+		if (cc->contended)
+			return NULL;
+
+		if (cc->direct_compaction && (cc->mode == MIGRATE_ASYNC))
+			isolate_freepages_direct(cc);
+		else
 			isolate_freepages(cc);
 
 		if (list_empty(&cc->freepages))
@@ -1637,6 +1677,10 @@  static enum compact_result compact_zone(struct zone *zone, struct compact_contro
 						(cc->mode == MIGRATE_ASYNC)) {
 				cc->migrate_pfn = block_end_pfn(
 						cc->migrate_pfn - 1, cc->order);
+
+				if (!list_empty(&cc->freepages_held))
+					release_freepages(&cc->freepages_held);
+
 				/* Draining pcplists is useless in this case */
 				cc->last_migrated_pfn = 0;
 
@@ -1657,6 +1701,8 @@  static enum compact_result compact_zone(struct zone *zone, struct compact_contro
 				block_start_pfn(cc->migrate_pfn, cc->order);
 
 			if (cc->last_migrated_pfn < current_block_start) {
+				if (!list_empty(&cc->freepages_held))
+					release_freepages(&cc->freepages_held);
 				cpu = get_cpu();
 				lru_add_drain_cpu(cpu);
 				drain_local_pages(zone);
@@ -1687,6 +1733,8 @@  static enum compact_result compact_zone(struct zone *zone, struct compact_contro
 		if (free_pfn > zone->compact_cached_free_pfn)
 			zone->compact_cached_free_pfn = free_pfn;
 	}
+	if (!list_empty(&cc->freepages_held))
+		release_freepages(&cc->freepages_held);
 
 	count_compact_events(COMPACTMIGRATE_SCANNED, cc->total_migrate_scanned);
 	count_compact_events(COMPACTFREE_SCANNED, cc->total_free_scanned);
@@ -1721,6 +1769,7 @@  static enum compact_result compact_zone_order(struct zone *zone, int order,
 	};
 	INIT_LIST_HEAD(&cc.freepages);
 	INIT_LIST_HEAD(&cc.migratepages);
+	INIT_LIST_HEAD(&cc.freepages_held);
 
 	ret = compact_zone(zone, &cc);
 
@@ -1839,6 +1888,7 @@  static void compact_node(int nid)
 		cc.zone = zone;
 		INIT_LIST_HEAD(&cc.freepages);
 		INIT_LIST_HEAD(&cc.migratepages);
+		INIT_LIST_HEAD(&cc.freepages_held);
 
 		compact_zone(zone, &cc);
 
@@ -1979,6 +2029,7 @@  static void kcompactd_do_work(pg_data_t *pgdat)
 		cc.zone = zone;
 		INIT_LIST_HEAD(&cc.freepages);
 		INIT_LIST_HEAD(&cc.migratepages);
+		INIT_LIST_HEAD(&cc.freepages_held);
 
 		if (kthread_should_stop())
 			return;
diff --git a/mm/internal.h b/mm/internal.h
index e6bd35182dae..191da54dea16 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -161,6 +161,8 @@  static inline struct page *pageblock_pfn_to_page(unsigned long start_pfn,
 }
 
 extern int __isolate_free_page(struct page *page, unsigned int order);
+extern struct page * alloc_pages_zone(struct zone *zone, unsigned int order,
+							int migratetype);
 extern void __free_pages_bootmem(struct page *page, unsigned long pfn,
 					unsigned int order);
 extern void prep_compound_page(struct page *page, unsigned int order);
@@ -183,6 +185,8 @@  extern int user_min_free_kbytes;
 struct compact_control {
 	struct list_head freepages;	/* List of free pages to migrate to */
 	struct list_head migratepages;	/* List of pages being migrated */
+	struct list_head freepages_held;/* List of free pages from the block
+					 * that's being migrated */
 	struct zone *zone;
 	unsigned long nr_freepages;	/* Number of isolated free pages */
 	unsigned long nr_migratepages;	/* Number of pages to migrate */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d4096f4a5c1f..f26acf62b4c7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2851,6 +2851,33 @@  static struct page *rmqueue_pcplist(struct zone *preferred_zone,
 	return page;
 }
 
+/*
+ * Like split_free_page, but given the zone, it will grab a free page from
+ * the freelists.
+ */
+struct page *
+alloc_pages_zone(struct zone *zone, unsigned int order, int migratetype)
+{
+	struct page *page;
+	unsigned long watermark;
+
+	watermark = low_wmark_pages(zone) + (1 << order);
+	if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
+		return NULL;
+
+	page = __rmqueue(zone, order, migratetype);
+	if (!page)
+		return NULL;
+
+	__mod_zone_freepage_state(zone, -(1 << order),
+					  get_pcppage_migratetype(page));
+
+	set_page_owner(page, order, __GFP_MOVABLE);
+	set_page_refcounted(page);
+
+	return page;
+}
+
 /*
  * Allocate a page from the given zone. Use pcplists for order-0 allocations.
  */
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 40b2db6db6b1..52187c5fbd1b 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1223,6 +1223,8 @@  const char * const vmstat_text[] = {
 #ifdef CONFIG_COMPACTION
 	"compact_migrate_scanned",
 	"compact_free_scanned",
+	"compact_free_direct_alloc",
+	"compact_free_direct_miss",
 	"compact_isolated",
 	"compact_stall",
 	"compact_fail",