diff mbox series

[3/4] mm: Reclaim small amounts of memory when an external fragmentation event occurs

Message ID 20181121101414.21301-4-mgorman@techsingularity.net (mailing list archive)
State New, archived
Headers show
Series Fragmentation avoidance improvements v4 | expand

Commit Message

Mel Gorman Nov. 21, 2018, 10:14 a.m. UTC
An external fragmentation event was previously described as

    When the page allocator fragments memory, it records the event using
    the mm_page_alloc_extfrag event. If the fallback_order is smaller
    than a pageblock order (order-9 on 64-bit x86) then it's considered
    an event that will cause external fragmentation issues in the future.

The kernel reduces the probability of such events by increasing the
watermark sizes by calling set_recommended_min_free_kbytes early in the
lifetime of the system. This works reasonably well in general but if there
are enough sparsely populated pageblocks then the problem can still occur
as enough memory is free overall and kswapd stays asleep.

This patch introduces a watermark_boost_factor sysctl that allows a
zone watermark to be temporarily boosted when an external fragmentation
causing events occurs. The boosting will stall allocations that would
decrease free memory below the boosted low watermark and kswapd is woken
unconditionally to reclaim an amount of memory relative to the size
of the high watermark and the watermark_boost_factor until the boost
is cleared. When kswapd finishes, it wakes kcompactd at the pageblock
order to clean some of the pageblocks that may have been affected by the
fragmentation event. kswapd avoids any writeback or swap from reclaim
context during this operation to avoid excessive system disruption in
the name of fragmentation avoidance. Care is taken so that kswapd will
do normal reclaim work if the system is really low on memory.

This was evaluated using the same workloads as "mm, page_alloc: Spread
allocations across zones before introducing fragmentation".

1-socket Skylake machine
config-global-dhp__workload_thpfioscale XFS (no special madvise)
4 fio threads, 1 THP allocating thread
--------------------------------------

4.20-rc1 extfrag events < order 9:  1023463
4.20-rc1+patch:                      358574 (65% reduction)
4.20-rc1+patch1-3:                    19274 (98% reduction)

                                   4.20.0-rc1             4.20.0-rc1
                                 lowzone-v2r4             boost-v2r4
Amean     fault-base-1      663.65 (   0.00%)      659.85 *   0.57%*
Amean     fault-huge-1        0.00 (   0.00%)      172.19 * -99.00%*

                              4.20.0-rc1             4.20.0-rc1
                            lowzone-v2r4             boost-v2r4
Percentage huge-1        0.00 (   0.00%)        1.68 ( 100.00%)

Note that external fragmentation causing events are massively reduced
by this path whether in comparison to the previous kernel or the vanilla
kernel. The fault latency for huge pages appears to be increased but that
is only because THP allocations were successful with the patch applied.

1-socket Skylake machine
global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
-----------------------------------------------------------------

4.20-rc1 extfrag events < order 9:  342549
4.20-rc1+patch:                     337890 ( 1% reduction)
4.20-rc1+patch1-3:                   12801 (96% reduction)

thpfioscale Fault Latencies
thpfioscale Fault Latencies
                                   4.20.0-rc1             4.20.0-rc1
                                 lowzone-v2r4             boost-v2r4
Amean     fault-base-1     1531.37 (   0.00%)     1578.91 (  -3.10%)
Amean     fault-huge-1     1160.95 (   0.00%)     1090.23 *   6.09%*

                              4.20.0-rc1             4.20.0-rc1
                            lowzone-v2r4             boost-v2r4
Percentage huge-1       78.97 (   0.00%)       82.59 (   4.58%)

As before, massive reduction in external fragmentation events, some jitter
on latencies and an increase in THP allocation success rates.

2-socket Haswell machine
config-global-dhp__workload_thpfioscale XFS (no special madvise)
4 fio threads, 5 THP allocating threads
----------------------------------------------------------------

4.20-rc1 extfrag events < order 9:  209820
4.20-rc1+patch:                     185923 (11% reduction)
4.20-rc1+patch1-3:                   11240 (95% reduction)

                                   4.20.0-rc1             4.20.0-rc1
                                 lowzone-v2r4             boost-v2r4
Amean     fault-base-5     1334.99 (   0.00%)     1395.28 (  -4.52%)
Amean     fault-huge-5     2428.43 (   0.00%)      539.69 (  77.78%)

                              4.20.0-rc1             4.20.0-rc1
                            lowzone-v2r4             boost-v2r4
Percentage huge-5        1.13 (   0.00%)        0.53 ( -52.94%)

This is an illustration of why latencies are not the primary metric.
There is a 95% reduction in fragmentation causing events but the
huge page latencies look fantastic until you account for the fact it
might be because the success rate was lower. Given how low it was
initially, this is partially down to luck.

2-socket Haswell machine
global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
-----------------------------------------------------------------

4.20-rc1 extfrag events < order 9: 167464
4.20-rc1+patch:                    130081 (22% reduction)
4.20-rc1+patch1-3:                  12057 (92% reduction)

thpfioscale Fault Latencies
                                   4.20.0-rc1             4.20.0-rc1
                                 lowzone-v2r4             boost-v2r4
Amean     fault-base-5     6652.67 (   0.00%)     8691.83 * -30.65%*
Amean     fault-huge-5     2486.89 (   0.00%)     2899.83 * -16.60%*

                              4.20.0-rc1             4.20.0-rc1
                            lowzone-v2r4             boost-v2r4
Percentage huge-5       94.49 (   0.00%)       95.55 (   1.13%)

There is a large reduction in fragmentation events with a very slightly
higher THP allocation success rate. The latencies look bad but a closer
look at the data seems to indicate the problem is at the tails. Given the
high THP allocation success rate, the system is under quite some pressure.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 Documentation/sysctl/vm.txt |  19 +++++++
 include/linux/mm.h          |   1 +
 include/linux/mmzone.h      |  11 ++--
 kernel/sysctl.c             |   8 +++
 mm/page_alloc.c             |  53 +++++++++++++++++--
 mm/vmscan.c                 | 123 ++++++++++++++++++++++++++++++++++++++++----
 6 files changed, 199 insertions(+), 16 deletions(-)

Comments

Vlastimil Babka Nov. 22, 2018, 1:53 p.m. UTC | #1
On 11/21/18 11:14 AM, Mel Gorman wrote:
> An external fragmentation event was previously described as
> 
>     When the page allocator fragments memory, it records the event using
>     the mm_page_alloc_extfrag event. If the fallback_order is smaller
>     than a pageblock order (order-9 on 64-bit x86) then it's considered
>     an event that will cause external fragmentation issues in the future.
> 
> The kernel reduces the probability of such events by increasing the
> watermark sizes by calling set_recommended_min_free_kbytes early in the
> lifetime of the system. This works reasonably well in general but if there
> are enough sparsely populated pageblocks then the problem can still occur
> as enough memory is free overall and kswapd stays asleep.
> 
> This patch introduces a watermark_boost_factor sysctl that allows a
> zone watermark to be temporarily boosted when an external fragmentation
> causing events occurs. The boosting will stall allocations that would
> decrease free memory below the boosted low watermark and kswapd is woken
> unconditionally to reclaim an amount of memory relative to the size
> of the high watermark and the watermark_boost_factor until the boost
> is cleared. When kswapd finishes, it wakes kcompactd at the pageblock
> order to clean some of the pageblocks that may have been affected by the
> fragmentation event. kswapd avoids any writeback or swap from reclaim
> context during this operation to avoid excessive system disruption in
> the name of fragmentation avoidance. Care is taken so that kswapd will
> do normal reclaim work if the system is really low on memory.
> 
> This was evaluated using the same workloads as "mm, page_alloc: Spread
> allocations across zones before introducing fragmentation".
> 
> 1-socket Skylake machine
> config-global-dhp__workload_thpfioscale XFS (no special madvise)
> 4 fio threads, 1 THP allocating thread
> --------------------------------------
> 
> 4.20-rc1 extfrag events < order 9:  1023463
> 4.20-rc1+patch:                      358574 (65% reduction)
> 4.20-rc1+patch1-3:                    19274 (98% reduction)

So the reason I was wondering about movable vs unmovable fallbacks here
is that movable fallbacks are ok as they can be migrated later, but the
unmovable/reclaimable not, which is bad if they fallback to movable
pageblock. Movable fallbacks can however fill the unmovable pageblocks
and increase change of the unmovable fallback, but that would depend on
the workload. So hypothetically if the test workload was such that
movable fallbacks did not cause unmovable fallbacks, and a patch would
thus only decrease the movable fallbacks (at the cost of e.g. higher
reclaim, as this patch) with unmovable fallbacks unchanged, then it
would be useful to know that for better evaluation of the pros vs cons,
imho.

> +static inline void boost_watermark(struct zone *zone)
> +{
> +	unsigned long max_boost;
> +
> +	if (!watermark_boost_factor)
> +		return;
> +
> +	max_boost = mult_frac(wmark_pages(zone, WMARK_HIGH),
> +			watermark_boost_factor, 10000);

Hmm I assume you didn't use high_wmark_pages() because the calculation
should start with high watermark not including existing boost. But then,
wmark_pages() also includes existing boost, so the limit won't work and
each invocation of boost_watermark() will simply add pageblock_nr_pages?
I.e. this should use zone->_watermark[] instead of wmark_pages()?

> +	max_boost = max(pageblock_nr_pages, max_boost);
> +
> +	zone->watermark_boost = min(zone->watermark_boost + pageblock_nr_pages,
> +		max_boost);
> +}
> +
>  /*
>   * This function implements actual steal behaviour. If order is large enough,
>   * we can steal whole pageblock. If not, we first move freepages in this
> @@ -2160,6 +2176,14 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page,
>  		goto single_page;
>  	}
>  
> +	/*
> +	 * Boost watermarks to increase reclaim pressure to reduce the
> +	 * likelihood of future fallbacks. Wake kswapd now as the node
> +	 * may be balanced overall and kswapd will not wake naturally.
> +	 */
> +	boost_watermark(zone);
> +	wakeup_kswapd(zone, 0, 0, zone_idx(zone));
> +
>  	/* We are not allowed to try stealing from the whole block */
>  	if (!whole_block)
>  		goto single_page;
> @@ -3277,11 +3301,19 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
>   * probably too small. It only makes sense to spread allocations to avoid
>   * fragmentation between the Normal and DMA32 zones.
>   */
> -static inline unsigned int alloc_flags_nofragment(struct zone *zone)
> +static inline unsigned int
> +alloc_flags_nofragment(struct zone *zone, gfp_t gfp_mask)
>  {
>  	if (zone_idx(zone) != ZONE_NORMAL)
>  		return 0;
>  
> +	/*
> +	 * A fragmenting fallback will try waking kswapd. ALLOC_NOFRAGMENT
> +	 * may break that so such callers can introduce fragmentation.
> +	 */

I think I don't understand this comment :( Do you want to avoid waking
up kswapd from steal_suitable_fallback() (introduced above) for
allocations without __GFP_KSWAPD_RECLAIM? But returning 0 here means
actually allowing the allocation go through steal_suitable_fallback()?
So should it return ALLOC_NOFRAGMENT below, or was the intent different?

> +	if (!(gfp_mask & __GFP_KSWAPD_RECLAIM))
> +		return 0;
> +
>  	/*
>  	 * If ZONE_DMA32 exists, assume it is the one after ZONE_NORMAL and
>  	 * the pointer is within zone->zone_pgdat->node_zones[]

.
Mel Gorman Nov. 22, 2018, 3:04 p.m. UTC | #2
On Thu, Nov 22, 2018 at 02:53:08PM +0100, Vlastimil Babka wrote:
> > 1-socket Skylake machine
> > config-global-dhp__workload_thpfioscale XFS (no special madvise)
> > 4 fio threads, 1 THP allocating thread
> > --------------------------------------
> > 
> > 4.20-rc1 extfrag events < order 9:  1023463
> > 4.20-rc1+patch:                      358574 (65% reduction)
> > 4.20-rc1+patch1-3:                    19274 (98% reduction)
> 
> So the reason I was wondering about movable vs unmovable fallbacks here
> is that movable fallbacks are ok as they can be migrated later, but the
> unmovable/reclaimable not, which is bad if they fallback to movable
> pageblock. Movable fallbacks can however fill the unmovable pageblocks
> and increase change of the unmovable fallback, but that would depend on
> the workload. So hypothetically if the test workload was such that
> movable fallbacks did not cause unmovable fallbacks, and a patch would
> thus only decrease the movable fallbacks (at the cost of e.g. higher
> reclaim, as this patch) with unmovable fallbacks unchanged, then it
> would be useful to know that for better evaluation of the pros vs cons,
> imho.
> 

I can give the breakdown in the next changelog as it'll be similar for
each instance of the workload.

Movable fallbacks are ok in that they can fallback but not ok in that
they can fill an unmovable/reclaimable pageblock causing them to
fallback later. I think you understand this already. If there is a
movable pageblock, it is pretty much guaranteed to affect an
unmovable/reclaimable pageblock and while it might not be enough to
actually cause a unmovable/reclaimable fallback in the future, we cannot
know that in advance so the patch takes the only option available to it.

In terms of reclaim, what I've observed for a few workloads is that
reclaim is different but not necessarily worse. With the patch, reclaim
is roughly similar overall but at a smoother rate. The vanilla kernel
tends to spike with large amounts of reclaim at semi-regular intervals
where as this path has a small amount of reclaim done steadily over
time.

Now I did encounter a bug whereby fio reduced throughput because the
boosted reclaim also reclaimed slab which caused secondary issues that
were unrelated to the fragmentation pattern. This will be fixed in the
next version.

While it does leave open the possibilty that a slab-intensive workload
occupying lots of memory will still cause fragmentation, that is a
different class of problem and would be approached differently. That
particular problem is not covered by this approach but it does not exclude
it either as the full solution would have to encompass both.

> > +static inline void boost_watermark(struct zone *zone)
> > +{
> > +	unsigned long max_boost;
> > +
> > +	if (!watermark_boost_factor)
> > +		return;
> > +
> > +	max_boost = mult_frac(wmark_pages(zone, WMARK_HIGH),
> > +			watermark_boost_factor, 10000);
> 
> Hmm I assume you didn't use high_wmark_pages() because the calculation
> should start with high watermark not including existing boost. But then,
> wmark_pages() also includes existing boost, so the limit won't work and
> each invocation of boost_watermark() will simply add pageblock_nr_pages?
> I.e. this should use zone->_watermark[] instead of wmark_pages()?
> 

You're right. The consequences are that it boosts by higher than
expected. I'll fix it.

> > +	max_boost = max(pageblock_nr_pages, max_boost);
> > +
> > +	zone->watermark_boost = min(zone->watermark_boost + pageblock_nr_pages,
> > +		max_boost);
> > +}
> > +
> >  /*
> >   * This function implements actual steal behaviour. If order is large enough,
> >   * we can steal whole pageblock. If not, we first move freepages in this
> > @@ -2160,6 +2176,14 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page,
> >  		goto single_page;
> >  	}
> >  
> > +	/*
> > +	 * Boost watermarks to increase reclaim pressure to reduce the
> > +	 * likelihood of future fallbacks. Wake kswapd now as the node
> > +	 * may be balanced overall and kswapd will not wake naturally.
> > +	 */
> > +	boost_watermark(zone);
> > +	wakeup_kswapd(zone, 0, 0, zone_idx(zone));
> > +
> >  	/* We are not allowed to try stealing from the whole block */
> >  	if (!whole_block)
> >  		goto single_page;
> > @@ -3277,11 +3301,19 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
> >   * probably too small. It only makes sense to spread allocations to avoid
> >   * fragmentation between the Normal and DMA32 zones.
> >   */
> > -static inline unsigned int alloc_flags_nofragment(struct zone *zone)
> > +static inline unsigned int
> > +alloc_flags_nofragment(struct zone *zone, gfp_t gfp_mask)
> >  {
> >  	if (zone_idx(zone) != ZONE_NORMAL)
> >  		return 0;
> >  
> > +	/*
> > +	 * A fragmenting fallback will try waking kswapd. ALLOC_NOFRAGMENT
> > +	 * may break that so such callers can introduce fragmentation.
> > +	 */
> 
> I think I don't understand this comment :( Do you want to avoid waking
> up kswapd from steal_suitable_fallback() (introduced above) for
> allocations without __GFP_KSWAPD_RECLAIM? But returning 0 here means
> actually allowing the allocation go through steal_suitable_fallback()?
> So should it return ALLOC_NOFRAGMENT below, or was the intent different?
> 

I want to avoid waking kswapd in steal_suitable_fallback if waking
kswapd is not allowed. If the calling context does not allow it, it does
mean that fragmentation will be allowed to occur. I'm banking on it
being a relatively rare case but potentially it'll be problematic. The
main source of allocation requests that I expect to hit this are THP and
as they are already at pageblock_order, it has limited impact from a
fragmentation perspective -- particularly as pageblock_order stealing is
allowed even with ALLOC_NOFRAGMENT.
Vlastimil Babka Nov. 22, 2018, 3:35 p.m. UTC | #3
On 11/22/18 4:04 PM, Mel Gorman wrote:
> On Thu, Nov 22, 2018 at 02:53:08PM +0100, Vlastimil Babka wrote:
>>
>> So the reason I was wondering about movable vs unmovable fallbacks here
>> is that movable fallbacks are ok as they can be migrated later, but the
>> unmovable/reclaimable not, which is bad if they fallback to movable
>> pageblock. Movable fallbacks can however fill the unmovable pageblocks
>> and increase change of the unmovable fallback, but that would depend on
>> the workload. So hypothetically if the test workload was such that
>> movable fallbacks did not cause unmovable fallbacks, and a patch would
>> thus only decrease the movable fallbacks (at the cost of e.g. higher
>> reclaim, as this patch) with unmovable fallbacks unchanged, then it
>> would be useful to know that for better evaluation of the pros vs cons,
>> imho.
>>
> 
> I can give the breakdown in the next changelog as it'll be similar for
> each instance of the workload.
> 
> Movable fallbacks are ok in that they can fallback but not ok in that
> they can fill an unmovable/reclaimable pageblock causing them to
> fallback later. I think you understand this already.

Yes.

> If there is a
> movable pageblock, it is pretty much guaranteed to affect an
> unmovable/reclaimable pageblock and while it might not be enough to
> actually cause a unmovable/reclaimable fallback in the future, we cannot
> know that in advance so the patch takes the only option available to it.
> 
> In terms of reclaim, what I've observed for a few workloads is that
> reclaim is different but not necessarily worse. With the patch, reclaim
> is roughly similar overall but at a smoother rate. The vanilla kernel
> tends to spike with large amounts of reclaim at semi-regular intervals
> where as this path has a small amount of reclaim done steadily over
> time.
> 
> Now I did encounter a bug whereby fio reduced throughput because the
> boosted reclaim also reclaimed slab which caused secondary issues that
> were unrelated to the fragmentation pattern. This will be fixed in the
> next version.
> 
> While it does leave open the possibilty that a slab-intensive workload
> occupying lots of memory will still cause fragmentation, that is a
> different class of problem and would be approached differently. That
> particular problem is not covered by this approach but it does not exclude
> it either as the full solution would have to encompass both.

OK, thanks for explaining.

>>> +	max_boost = max(pageblock_nr_pages, max_boost);
>>> +
>>> +	zone->watermark_boost = min(zone->watermark_boost + pageblock_nr_pages,
>>> +		max_boost);
>>> +}
>>> +
>>>  /*
>>>   * This function implements actual steal behaviour. If order is large enough,
>>>   * we can steal whole pageblock. If not, we first move freepages in this
>>> @@ -2160,6 +2176,14 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page,
>>>  		goto single_page;
>>>  	}
>>>  
>>> +	/*
>>> +	 * Boost watermarks to increase reclaim pressure to reduce the
>>> +	 * likelihood of future fallbacks. Wake kswapd now as the node
>>> +	 * may be balanced overall and kswapd will not wake naturally.
>>> +	 */
>>> +	boost_watermark(zone);
>>> +	wakeup_kswapd(zone, 0, 0, zone_idx(zone));
>>> +
>>>  	/* We are not allowed to try stealing from the whole block */
>>>  	if (!whole_block)
>>>  		goto single_page;
>>> @@ -3277,11 +3301,19 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
>>>   * probably too small. It only makes sense to spread allocations to avoid
>>>   * fragmentation between the Normal and DMA32 zones.
>>>   */
>>> -static inline unsigned int alloc_flags_nofragment(struct zone *zone)
>>> +static inline unsigned int
>>> +alloc_flags_nofragment(struct zone *zone, gfp_t gfp_mask)
>>>  {
>>>  	if (zone_idx(zone) != ZONE_NORMAL)
>>>  		return 0;
>>>  
>>> +	/*
>>> +	 * A fragmenting fallback will try waking kswapd. ALLOC_NOFRAGMENT
>>> +	 * may break that so such callers can introduce fragmentation.
>>> +	 */
>>
>> I think I don't understand this comment :( Do you want to avoid waking
>> up kswapd from steal_suitable_fallback() (introduced above) for
>> allocations without __GFP_KSWAPD_RECLAIM? But returning 0 here means
>> actually allowing the allocation go through steal_suitable_fallback()?
>> So should it return ALLOC_NOFRAGMENT below, or was the intent different?
>>
> 
> I want to avoid waking kswapd in steal_suitable_fallback if waking
> kswapd is not allowed.

OK, but then this 'if' should return ALLOC_NOFRAGMENT, not 0?
But that will still not prevent waking kswapd for nodes where there's no
ZONE_DMA32, or any node when get_page_from_freelist() retries without
fallback.

> If the calling context does not allow it, it does
> mean that fragmentation will be allowed to occur. I'm banking on it
> being a relatively rare case but potentially it'll be problematic. The
> main source of allocation requests that I expect to hit this are THP and
> as they are already at pageblock_order, it has limited impact from a
> fragmentation perspective -- particularly as pageblock_order stealing is
> allowed even with ALLOC_NOFRAGMENT.

Yep, THP will skip the wakeup in steal_suitable_fallback() via 'goto
single_page' above it. For other users of ~__GFP_KSWAPD_RECLAIM (are
there any?) we could maybe just ignore and wakeup kswapd anyway, since
avoiding fragmentation is more important? Or if we wanted to avoid
wakeup reliably, then steal_suitable_fallback() would have to know and
check gfp_flags I'm afraid, and that doesn't seem worth the trouble.
Mel Gorman Nov. 22, 2018, 4:22 p.m. UTC | #4
On Thu, Nov 22, 2018 at 04:35:58PM +0100, Vlastimil Babka wrote:
> >> I think I don't understand this comment :( Do you want to avoid waking
> >> up kswapd from steal_suitable_fallback() (introduced above) for
> >> allocations without __GFP_KSWAPD_RECLAIM? But returning 0 here means
> >> actually allowing the allocation go through steal_suitable_fallback()?
> >> So should it return ALLOC_NOFRAGMENT below, or was the intent different?
> >>
> > 
> > I want to avoid waking kswapd in steal_suitable_fallback if waking
> > kswapd is not allowed.
> 
> OK, but then this 'if' should return ALLOC_NOFRAGMENT, not 0?
> But that will still not prevent waking kswapd for nodes where there's no
> ZONE_DMA32, or any node when get_page_from_freelist() retries without
> fallback.
> 
> > If the calling context does not allow it, it does
> > mean that fragmentation will be allowed to occur. I'm banking on it
> > being a relatively rare case but potentially it'll be problematic. The
> > main source of allocation requests that I expect to hit this are THP and
> > as they are already at pageblock_order, it has limited impact from a
> > fragmentation perspective -- particularly as pageblock_order stealing is
> > allowed even with ALLOC_NOFRAGMENT.
> 
> Yep, THP will skip the wakeup in steal_suitable_fallback() via 'goto
> single_page' above it. For other users of ~__GFP_KSWAPD_RECLAIM (are
> there any?) we could maybe just ignore and wakeup kswapd anyway, since
> avoiding fragmentation is more important? Or if we wanted to avoid
> wakeup reliably, then steal_suitable_fallback() would have to know and
> check gfp_flags I'm afraid, and that doesn't seem worth the trouble.

Indeed. While it works in some cases, it'll be full of holes and while
I could close them, it just turns into a subtle mess. I've prepared a
preparation path that encodes __GFP_KSWAPD_RECLAIM in alloc_flags and checks
based on that.  It's a lot cleaner overall, it's less of a mess than passing
gfp_flags all the way through for one test and there are fewer side-effects.

Thanks!
diff mbox series

Patch

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 7d73882e2c27..4dff1b75229b 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -63,6 +63,7 @@  files can be found in mm/swap.c.
 - swappiness
 - user_reserve_kbytes
 - vfs_cache_pressure
+- watermark_boost_factor
 - watermark_scale_factor
 - zone_reclaim_mode
 
@@ -856,6 +857,24 @@  ten times more freeable objects than there are.
 
 =============================================================
 
+watermark_boost_factor:
+
+This factor controls the level of reclaim when memory is being fragmented.
+It defines the percentage of the low watermark of a zone that will be
+reclaimed if pages of different mobility are being mixed within pageblocks.
+The intent is so that compaction has less work to do and increase the
+success rate of future high-order allocations such as SLUB allocations,
+THP and hugetlbfs pages.
+
+To make it sensible with respect to the watermark_scale_factor parameter,
+the unit is in fractions of 10,000. The default value of 15000 means
+that 150% of the high watermark will be reclaimed in the event of a
+pageblock being mixed due to fragmentation. If this value is smaller
+than a pageblock then a pageblocks worth of pages will be reclaimed (e.g.
+2MB on 64-bit x86). A boost factor of 0 will disable the feature.
+
+=============================================================
+
 watermark_scale_factor:
 
 This factor controls the aggressiveness of kswapd. It defines the
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5411de93a363..2c4c69508413 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2202,6 +2202,7 @@  extern void zone_pcp_reset(struct zone *zone);
 
 /* page_alloc.c */
 extern int min_free_kbytes;
+extern int watermark_boost_factor;
 extern int watermark_scale_factor;
 
 /* nommu.c */
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index e43e8e79db99..d352c1dab486 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -269,10 +269,10 @@  enum zone_watermarks {
 	NR_WMARK
 };
 
-#define min_wmark_pages(z) (z->_watermark[WMARK_MIN])
-#define low_wmark_pages(z) (z->_watermark[WMARK_LOW])
-#define high_wmark_pages(z) (z->_watermark[WMARK_HIGH])
-#define wmark_pages(z, i) (z->_watermark[i])
+#define min_wmark_pages(z) (z->_watermark[WMARK_MIN] + z->watermark_boost)
+#define low_wmark_pages(z) (z->_watermark[WMARK_LOW] + z->watermark_boost)
+#define high_wmark_pages(z) (z->_watermark[WMARK_HIGH] + z->watermark_boost)
+#define wmark_pages(z, i) (z->_watermark[i] + z->watermark_boost)
 
 struct per_cpu_pages {
 	int count;		/* number of pages in the list */
@@ -364,6 +364,7 @@  struct zone {
 
 	/* zone watermarks, access with *_wmark_pages(zone) macros */
 	unsigned long _watermark[NR_WMARK];
+	unsigned long watermark_boost;
 
 	unsigned long nr_reserved_highatomic;
 
@@ -885,6 +886,8 @@  static inline int is_highmem(struct zone *zone)
 struct ctl_table;
 int min_free_kbytes_sysctl_handler(struct ctl_table *, int,
 					void __user *, size_t *, loff_t *);
+int watermark_boost_factor_sysctl_handler(struct ctl_table *, int,
+					void __user *, size_t *, loff_t *);
 int watermark_scale_factor_sysctl_handler(struct ctl_table *, int,
 					void __user *, size_t *, loff_t *);
 extern int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES];
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 5fc724e4e454..1825f712e73b 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1462,6 +1462,14 @@  static struct ctl_table vm_table[] = {
 		.proc_handler	= min_free_kbytes_sysctl_handler,
 		.extra1		= &zero,
 	},
+	{
+		.procname	= "watermark_boost_factor",
+		.data		= &watermark_boost_factor,
+		.maxlen		= sizeof(watermark_boost_factor),
+		.mode		= 0644,
+		.proc_handler	= watermark_boost_factor_sysctl_handler,
+		.extra1		= &zero,
+	},
 	{
 		.procname	= "watermark_scale_factor",
 		.data		= &watermark_scale_factor,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9ea2d828d20c..04b29228e9f0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -263,6 +263,7 @@  compound_page_dtor * const compound_page_dtors[] = {
 
 int min_free_kbytes = 1024;
 int user_min_free_kbytes = -1;
+int watermark_boost_factor __read_mostly = 15000;
 int watermark_scale_factor = 10;
 
 static unsigned long nr_kernel_pages __meminitdata;
@@ -2129,6 +2130,21 @@  static bool can_steal_fallback(unsigned int order, int start_mt)
 	return false;
 }
 
+static inline void boost_watermark(struct zone *zone)
+{
+	unsigned long max_boost;
+
+	if (!watermark_boost_factor)
+		return;
+
+	max_boost = mult_frac(wmark_pages(zone, WMARK_HIGH),
+			watermark_boost_factor, 10000);
+	max_boost = max(pageblock_nr_pages, max_boost);
+
+	zone->watermark_boost = min(zone->watermark_boost + pageblock_nr_pages,
+		max_boost);
+}
+
 /*
  * This function implements actual steal behaviour. If order is large enough,
  * we can steal whole pageblock. If not, we first move freepages in this
@@ -2160,6 +2176,14 @@  static void steal_suitable_fallback(struct zone *zone, struct page *page,
 		goto single_page;
 	}
 
+	/*
+	 * Boost watermarks to increase reclaim pressure to reduce the
+	 * likelihood of future fallbacks. Wake kswapd now as the node
+	 * may be balanced overall and kswapd will not wake naturally.
+	 */
+	boost_watermark(zone);
+	wakeup_kswapd(zone, 0, 0, zone_idx(zone));
+
 	/* We are not allowed to try stealing from the whole block */
 	if (!whole_block)
 		goto single_page;
@@ -3277,11 +3301,19 @@  static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
  * probably too small. It only makes sense to spread allocations to avoid
  * fragmentation between the Normal and DMA32 zones.
  */
-static inline unsigned int alloc_flags_nofragment(struct zone *zone)
+static inline unsigned int
+alloc_flags_nofragment(struct zone *zone, gfp_t gfp_mask)
 {
 	if (zone_idx(zone) != ZONE_NORMAL)
 		return 0;
 
+	/*
+	 * A fragmenting fallback will try waking kswapd. ALLOC_NOFRAGMENT
+	 * may break that so such callers can introduce fragmentation.
+	 */
+	if (!(gfp_mask & __GFP_KSWAPD_RECLAIM))
+		return 0;
+
 	/*
 	 * If ZONE_DMA32 exists, assume it is the one after ZONE_NORMAL and
 	 * the pointer is within zone->zone_pgdat->node_zones[].
@@ -3292,7 +3324,8 @@  static inline unsigned int alloc_flags_nofragment(struct zone *zone)
 	return ALLOC_NOFRAGMENT;
 }
 #else
-static inline unsigned int alloc_flags_nofragment(struct zone *zone)
+static inline unsigned int
+alloc_flags_nofragment(struct zone *zone, gfp_t gfp_mask)
 {
 	return 0;
 }
@@ -4445,7 +4478,8 @@  __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
 	 * Forbid the first pass from falling back to types that fragment
 	 * memory until all local zones are considered.
 	 */
-	alloc_flags |= alloc_flags_nofragment(ac.preferred_zoneref->zone);
+	alloc_flags |= alloc_flags_nofragment(ac.preferred_zoneref->zone,
+								gfp_mask);
 
 	/* First allocation attempt */
 	page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac);
@@ -7436,6 +7470,7 @@  static void __setup_per_zone_wmarks(void)
 
 		zone->_watermark[WMARK_LOW]  = min_wmark_pages(zone) + tmp;
 		zone->_watermark[WMARK_HIGH] = min_wmark_pages(zone) + tmp * 2;
+		zone->watermark_boost = 0;
 
 		spin_unlock_irqrestore(&zone->lock, flags);
 	}
@@ -7536,6 +7571,18 @@  int min_free_kbytes_sysctl_handler(struct ctl_table *table, int write,
 	return 0;
 }
 
+int watermark_boost_factor_sysctl_handler(struct ctl_table *table, int write,
+	void __user *buffer, size_t *length, loff_t *ppos)
+{
+	int rc;
+
+	rc = proc_dointvec_minmax(table, write, buffer, length, ppos);
+	if (rc)
+		return rc;
+
+	return 0;
+}
+
 int watermark_scale_factor_sysctl_handler(struct ctl_table *table, int write,
 	void __user *buffer, size_t *length, loff_t *ppos)
 {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 62ac0c488624..5ba76ec4f01e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3378,6 +3378,30 @@  static void age_active_anon(struct pglist_data *pgdat,
 	} while (memcg);
 }
 
+static bool pgdat_watermark_boosted(pg_data_t *pgdat, int classzone_idx)
+{
+	int i;
+	struct zone *zone;
+
+	/*
+	 * Check for watermark boosts top-down as the higher zones
+	 * are more likely to be boosted. Both watermarks and boosts
+	 * should not be checked at the time time as reclaim would
+	 * start prematurely when there is no boosting and a lower
+	 * zone is balanced.
+	 */
+	for (i = classzone_idx; i >= 0; i--) {
+		zone = pgdat->node_zones + i;
+		if (!managed_zone(zone))
+			continue;
+
+		if (zone->watermark_boost)
+			return true;
+	}
+
+	return false;
+}
+
 /*
  * Returns true if there is an eligible zone balanced for the request order
  * and classzone_idx
@@ -3388,9 +3412,12 @@  static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
 	unsigned long mark = -1;
 	struct zone *zone;
 
+	/*
+	 * Check watermarks bottom-up as lower zones are more likely to
+	 * meet watermarks.
+	 */
 	for (i = 0; i <= classzone_idx; i++) {
 		zone = pgdat->node_zones + i;
-
 		if (!managed_zone(zone))
 			continue;
 
@@ -3516,14 +3543,14 @@  static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 	unsigned long nr_soft_reclaimed;
 	unsigned long nr_soft_scanned;
 	unsigned long pflags;
+	unsigned long nr_boost_reclaim;
+	unsigned long zone_boosts[MAX_NR_ZONES] = { 0, };
+	bool boosted;
 	struct zone *zone;
 	struct scan_control sc = {
 		.gfp_mask = GFP_KERNEL,
 		.order = order,
-		.priority = DEF_PRIORITY,
-		.may_writepage = !laptop_mode,
 		.may_unmap = 1,
-		.may_swap = 1,
 	};
 
 	psi_memstall_enter(&pflags);
@@ -3531,9 +3558,28 @@  static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 
 	count_vm_event(PAGEOUTRUN);
 
+	/*
+	 * Account for the reclaim boost. Note that the zone boost is left in
+	 * place so that parallel allocations that are near the watermark will
+	 * stall or direct reclaim until kswapd is finished.
+	 */
+	nr_boost_reclaim = 0;
+	for (i = 0; i <= classzone_idx; i++) {
+		zone = pgdat->node_zones + i;
+		if (!managed_zone(zone))
+			continue;
+
+		nr_boost_reclaim += zone->watermark_boost;
+		zone_boosts[i] = zone->watermark_boost;
+	}
+	boosted = nr_boost_reclaim;
+
+restart:
+	sc.priority = DEF_PRIORITY;
 	do {
 		unsigned long nr_reclaimed = sc.nr_reclaimed;
 		bool raise_priority = true;
+		bool balanced;
 		bool ret;
 
 		sc.reclaim_idx = classzone_idx;
@@ -3560,13 +3606,39 @@  static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 		}
 
 		/*
-		 * Only reclaim if there are no eligible zones. Note that
-		 * sc.reclaim_idx is not used as buffer_heads_over_limit may
-		 * have adjusted it.
+		 * If the pgdat is imbalanced then ignore boosting and preserve
+		 * the watermarks for a later time and restart. Note that the
+		 * zone watermarks will be still reset at the end of balancing
+		 * on the grounds that the normal reclaim should be enough to
+		 * re-evaluate if boosting is required when kswapd next wakes.
+		 */
+		balanced = pgdat_balanced(pgdat, sc.order, classzone_idx);
+		if (!balanced && nr_boost_reclaim) {
+			nr_boost_reclaim = 0;
+			goto restart;
+		}
+
+		/*
+		 * If boosting is not active then only reclaim if there are no
+		 * eligible zones. Note that sc.reclaim_idx is not used as
+		 * buffer_heads_over_limit may have adjusted it.
 		 */
-		if (pgdat_balanced(pgdat, sc.order, classzone_idx))
+		if (!nr_boost_reclaim && balanced)
 			goto out;
 
+		/* Limit the priority of boosting to avoid reclaim writeback */
+		if (nr_boost_reclaim && sc.priority == DEF_PRIORITY - 2)
+			raise_priority = false;
+
+		/*
+		 * Do not writeback or swap pages for boosted reclaim. The
+		 * intent is to relieve pressure not issue sub-optimal IO
+		 * from reclaim context. If no pages are reclaimed, the
+		 * reclaim will be aborted.
+		 */
+		sc.may_writepage = !laptop_mode && !nr_boost_reclaim;
+		sc.may_swap = !nr_boost_reclaim;
+
 		/*
 		 * Do some background aging of the anon list, to give
 		 * pages a chance to be referenced before reclaiming. All
@@ -3618,6 +3690,16 @@  static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 		 * progress in reclaiming pages
 		 */
 		nr_reclaimed = sc.nr_reclaimed - nr_reclaimed;
+		nr_boost_reclaim -= min(nr_boost_reclaim, nr_reclaimed);
+
+		/*
+		 * If reclaim made no progress for a boost, stop reclaim as
+		 * IO cannot be queued and it could be an infinite loop in
+		 * extreme circumstances.
+		 */
+		if (nr_boost_reclaim && !nr_reclaimed)
+			break;
+
 		if (raise_priority || !nr_reclaimed)
 			sc.priority--;
 	} while (sc.priority >= 1);
@@ -3626,6 +3708,28 @@  static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 		pgdat->kswapd_failures++;
 
 out:
+	/* If reclaim was boosted, account for the reclaim done in this pass */
+	if (boosted) {
+		unsigned long flags;
+
+		for (i = 0; i <= classzone_idx; i++) {
+			if (!zone_boosts[i])
+				continue;
+
+			/* Increments are under the zone lock */
+			zone = pgdat->node_zones + i;
+			spin_lock_irqsave(&zone->lock, flags);
+			zone->watermark_boost -= min(zone->watermark_boost, zone_boosts[i]);
+			spin_unlock_irqrestore(&zone->lock, flags);
+		}
+
+		/*
+		 * As there is now likely space, wakeup kcompact to defragment
+		 * pageblocks.
+		 */
+		wakeup_kcompactd(pgdat, pageblock_order, classzone_idx);
+	}
+
 	snapshot_refaults(NULL, pgdat);
 	__fs_reclaim_release();
 	psi_memstall_leave(&pflags);
@@ -3854,7 +3958,8 @@  void wakeup_kswapd(struct zone *zone, gfp_t gfp_flags, int order,
 
 	/* Hopeless node, leave it to direct reclaim if possible */
 	if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ||
-	    pgdat_balanced(pgdat, order, classzone_idx)) {
+	    (pgdat_balanced(pgdat, order, classzone_idx) &&
+	     !pgdat_watermark_boosted(pgdat, classzone_idx))) {
 		/*
 		 * There may be plenty of free memory available, but it's too
 		 * fragmented for high-order allocations.  Wake up kcompactd