diff mbox series

[RFC,1/3] mm, thp: restore __GFP_NORETRY for madvised thp fault allocations

Message ID 20181211142941.20500-2-vbabka@suse.cz (mailing list archive)
State New, archived
Headers show
Series reduce THP fault thrashing | expand

Commit Message

Vlastimil Babka Dec. 11, 2018, 2:29 p.m. UTC
Commit 2516035499b9 ("mm, thp: remove __GFP_NORETRY from khugepaged and
madvised allocations") intended to make THP faults in MADV_HUGEPAGE areas more
successful for processes that indicate that they are willing to pay a higher
initial setup cost for long-term THP benefits. In the current page allocator
implementation this means that the allocations will try to use reclaim and the
more costly sync compaction mode, in case the initial direct async compaction
fails.

However, THP faults also include __GFP_THISNODE, which, combined with direct
reclaim, can result in a node-reclaim-like local node thrashing behavior, as
reported by Andrea [1].

While this patch is not a full fix, the first step is to restore __GFP_NORETRY
for madvised THP faults. The expected downside are potentially worse THP
fault success rates for the madvised areas, which will have to then rely more
on khugepaged. For khugepaged, __GFP_NORETRY is not restored, as its activity
should be limited enough by sleeping to cause noticeable thrashing.

Note that alloc_new_node_page() and new_page() is probably another candidate as
they handle the migrate_pages(2), resp. mbind(2) syscall, which can thus allow
unprivileged node-reclaim-like behavior.

The patch also updates the comments in alloc_hugepage_direct_gfpmask() because
elsewhere compaction during page fault is called direct compaction, and
'synchronous' refers to the migration mode, which is not used for THP faults.

[1] https://lkml.kernel.org/m/20180820032204.9591-1-aarcange@redhat.com

Reported-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
---
 mm/huge_memory.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

Comments

Mel Gorman Jan. 8, 2019, 11:16 a.m. UTC | #1
On Tue, Dec 11, 2018 at 03:29:39PM +0100, Vlastimil Babka wrote:
> Commit 2516035499b9 ("mm, thp: remove __GFP_NORETRY from khugepaged and
> madvised allocations") intended to make THP faults in MADV_HUGEPAGE areas more
> successful for processes that indicate that they are willing to pay a higher
> initial setup cost for long-term THP benefits. In the current page allocator
> implementation this means that the allocations will try to use reclaim and the
> more costly sync compaction mode, in case the initial direct async compaction
> fails.
> 

First off, I'm going to have trouble reasoning about this because there
is also my own series that reduces compaction failure rates. If that
series get approved, then it's far more likely that when compaction
fails that we really mean it and retries from the page allocator context
may be pointless unless the caller indicates it should really retry for
a long time.

The corner case I have in mind is a compaction failure on a very small
percentage of remaining pageblocks that are currently under writeback.

It also means that the merit of this series needs to account for whether
it's before or after the compaction series as the impact will be
different. FWIW, I had the same problem with evaluating the compaction
series on the context of __GFP_THISNODE vs !__GFP_THISNODE

> However, THP faults also include __GFP_THISNODE, which, combined with direct
> reclaim, can result in a node-reclaim-like local node thrashing behavior, as
> reported by Andrea [1].
> 
> While this patch is not a full fix, the first step is to restore __GFP_NORETRY
> for madvised THP faults. The expected downside are potentially worse THP
> fault success rates for the madvised areas, which will have to then rely more
> on khugepaged. For khugepaged, __GFP_NORETRY is not restored, as its activity
> should be limited enough by sleeping to cause noticeable thrashing.
> 
> Note that alloc_new_node_page() and new_page() is probably another candidate as
> they handle the migrate_pages(2), resp. mbind(2) syscall, which can thus allow
> unprivileged node-reclaim-like behavior.
> 
> The patch also updates the comments in alloc_hugepage_direct_gfpmask() because
> elsewhere compaction during page fault is called direct compaction, and
> 'synchronous' refers to the migration mode, which is not used for THP faults.
> 
> [1] https://lkml.kernel.org/m/20180820032204.9591-1-aarcange@redhat.com
> 
> Reported-by: Andrea Arcangeli <aarcange@redhat.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Michal Hocko <mhocko@kernel.org>
> ---
>  mm/huge_memory.c | 13 ++++++-------
>  1 file changed, 6 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 5da55b38b1b7..c442b12b060c 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -633,24 +633,23 @@ static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
>  {
>  	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
>  
> -	/* Always do synchronous compaction */
> +	/* Always try direct compaction */
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
> -		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
> +		return GFP_TRANSHUGE | __GFP_NORETRY;
>  

While I get that you want to reduce thrashing, the configuration item
really indicates the system (not just the caller, but everyone) is willing
to take a hit to get a THP.

>  	/* Kick kcompactd and fail quickly */
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
>  		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
>  
> -	/* Synchronous compaction if madvised, otherwise kick kcompactd */
> +	/* Direct compaction if madvised, otherwise kick kcompactd */
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags))
>  		return GFP_TRANSHUGE_LIGHT |
> -			(vma_madvised ? __GFP_DIRECT_RECLAIM :
> +			(vma_madvised ? (__GFP_DIRECT_RECLAIM | __GFP_NORETRY):
>  					__GFP_KSWAPD_RECLAIM);
>  

Similar note.

> -	/* Only do synchronous compaction if madvised */
> +	/* Only do direct compaction if madvised */
>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
> -		return GFP_TRANSHUGE_LIGHT |
> -		       (vma_madvised ? __GFP_DIRECT_RECLAIM : 0);
> +		return vma_madvised ? (GFP_TRANSHUGE | __GFP_NORETRY) : GFP_TRANSHUGE_LIGHT;
>  

Similar note.

That said, if this series went in before the compaction series then I
would think it's ok and I would Ack it. However, *after* the compaction
series, I think it would make more sense to rip out or severely reduce the
complex should_compact_retry logic and move towards "if compaction returns,
the page allocator should take its word for it except in extreme cases".

Compaction previously was a bit too casual about partially scanning
backblocks and the setting/clearing of pageblock bits. The retry logic
sortof dealt with it by making reset/clear cycles very frequent but
after the series, they are relatively rare[1].

[1] Says he bravely before proven otherwise
Vlastimil Babka Jan. 10, 2019, 1:52 p.m. UTC | #2
On 1/8/19 12:16 PM, Mel Gorman wrote:
> On Tue, Dec 11, 2018 at 03:29:39PM +0100, Vlastimil Babka wrote:
>> Commit 2516035499b9 ("mm, thp: remove __GFP_NORETRY from khugepaged and
>> madvised allocations") intended to make THP faults in MADV_HUGEPAGE areas more
>> successful for processes that indicate that they are willing to pay a higher
>> initial setup cost for long-term THP benefits. In the current page allocator
>> implementation this means that the allocations will try to use reclaim and the
>> more costly sync compaction mode, in case the initial direct async compaction
>> fails.
>>
> 
> First off, I'm going to have trouble reasoning about this because there
> is also my own series that reduces compaction failure rates. If that
> series get approved, then it's far more likely that when compaction
> fails that we really mean it and retries from the page allocator context
> may be pointless unless the caller indicates it should really retry for
> a long time.

Understood.

> The corner case I have in mind is a compaction failure on a very small
> percentage of remaining pageblocks that are currently under writeback.
> 
> It also means that the merit of this series needs to account for whether
> it's before or after the compaction series as the impact will be
> different. FWIW, I had the same problem with evaluating the compaction
> series on the context of __GFP_THISNODE vs !__GFP_THISNODE

Right. In that case I think for mainline, making compaction better has
priority over trying to compensate for it. The question is if somebody
wants to fix stable/older distro kernels. Now that it wasn't possible to
remove the __GFP_THISNODE for THP's, I thought this might be an
alternative acceptable to anyone, provided that it works. Backporting
your compaction series would be much more difficult I guess. Of course
distro kernels can also divert from mainline and go with the
__GFP_THISNODE removal privately.

>> However, THP faults also include __GFP_THISNODE, which, combined with direct
>> reclaim, can result in a node-reclaim-like local node thrashing behavior, as
>> reported by Andrea [1].
>>
>> While this patch is not a full fix, the first step is to restore __GFP_NORETRY
>> for madvised THP faults. The expected downside are potentially worse THP
>> fault success rates for the madvised areas, which will have to then rely more
>> on khugepaged. For khugepaged, __GFP_NORETRY is not restored, as its activity
>> should be limited enough by sleeping to cause noticeable thrashing.
>>
>> Note that alloc_new_node_page() and new_page() is probably another candidate as
>> they handle the migrate_pages(2), resp. mbind(2) syscall, which can thus allow
>> unprivileged node-reclaim-like behavior.
>>
>> The patch also updates the comments in alloc_hugepage_direct_gfpmask() because
>> elsewhere compaction during page fault is called direct compaction, and
>> 'synchronous' refers to the migration mode, which is not used for THP faults.
>>
>> [1] https://lkml.kernel.org/m/20180820032204.9591-1-aarcange@redhat.com
>>
>> Reported-by: Andrea Arcangeli <aarcange@redhat.com>
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>> Cc: David Rientjes <rientjes@google.com>
>> Cc: Mel Gorman <mgorman@techsingularity.net>
>> Cc: Michal Hocko <mhocko@kernel.org>
>> ---
>>  mm/huge_memory.c | 13 ++++++-------
>>  1 file changed, 6 insertions(+), 7 deletions(-)
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 5da55b38b1b7..c442b12b060c 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -633,24 +633,23 @@ static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
>>  {
>>  	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
>>  
>> -	/* Always do synchronous compaction */
>> +	/* Always try direct compaction */
>>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
>> -		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
>> +		return GFP_TRANSHUGE | __GFP_NORETRY;
>>  
> 
> While I get that you want to reduce thrashing, the configuration item
> really indicates the system (not just the caller, but everyone) is willing
> to take a hit to get a THP.

Yeah some hit in exchange for THP's, but probably not an overreclaim due
to __GFP_THISNODE implications.

>>  	/* Kick kcompactd and fail quickly */
>>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
>>  		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
>>  
>> -	/* Synchronous compaction if madvised, otherwise kick kcompactd */
>> +	/* Direct compaction if madvised, otherwise kick kcompactd */
>>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags))
>>  		return GFP_TRANSHUGE_LIGHT |
>> -			(vma_madvised ? __GFP_DIRECT_RECLAIM :
>> +			(vma_madvised ? (__GFP_DIRECT_RECLAIM | __GFP_NORETRY):
>>  					__GFP_KSWAPD_RECLAIM);
>>  
> 
> Similar note.
> 
>> -	/* Only do synchronous compaction if madvised */
>> +	/* Only do direct compaction if madvised */
>>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
>> -		return GFP_TRANSHUGE_LIGHT |
>> -		       (vma_madvised ? __GFP_DIRECT_RECLAIM : 0);
>> +		return vma_madvised ? (GFP_TRANSHUGE | __GFP_NORETRY) : GFP_TRANSHUGE_LIGHT;
>>  
> 
> Similar note.
> 
> That said, if this series went in before the compaction series then I
> would think it's ok and I would Ack it. However, *after* the compaction
> series, I think it would make more sense to rip out or severely reduce the
> complex should_compact_retry logic and move towards "if compaction returns,
> the page allocator should take its word for it except in extreme cases".

OK.

> Compaction previously was a bit too casual about partially scanning
> backblocks and the setting/clearing of pageblock bits. The retry logic
> sortof dealt with it by making reset/clear cycles very frequent but
> after the series, they are relatively rare[1].
> 
> [1] Says he bravely before proven otherwise
>
Mel Gorman Jan. 10, 2019, 2:55 p.m. UTC | #3
On Thu, Jan 10, 2019 at 02:52:32PM +0100, Vlastimil Babka wrote:
> > It also means that the merit of this series needs to account for whether
> > it's before or after the compaction series as the impact will be
> > different. FWIW, I had the same problem with evaluating the compaction
> > series on the context of __GFP_THISNODE vs !__GFP_THISNODE
> 
> Right. In that case I think for mainline, making compaction better has
> priority over trying to compensate for it.

Thanks, I agree.

> The question is if somebody
> wants to fix stable/older distro kernels. Now that it wasn't possible to
> remove the __GFP_THISNODE for THP's, I thought this might be an
> alternative acceptable to anyone, provided that it works. Backporting
> your compaction series would be much more difficult I guess. Of course
> distro kernels can also divert from mainline and go with the
> __GFP_THISNODE removal privately.
> 

That is a good point and hopefully Andrea can come back with some data from
his side. I can queue up something our side and see how it affects the
usemem case. As it's a backporting issue that I think would be rejected
by the stable rules, we can discuss the specifics offline and keep "did
it work or not" for here.

I agree that backporting the compaction series too far back would get
"interesting" as some of the pre-requisites are unexpected -- e.g. all
the data we have assumes the fragmentation avoidance stuff is in place and
that in turn has other dependencies such as when kcompactd gets woken up,
your patches on how fallbacks are managed etc.

> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >> index 5da55b38b1b7..c442b12b060c 100644
> >> --- a/mm/huge_memory.c
> >> +++ b/mm/huge_memory.c
> >> @@ -633,24 +633,23 @@ static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
> >>  {
> >>  	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
> >>  
> >> -	/* Always do synchronous compaction */
> >> +	/* Always try direct compaction */
> >>  	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
> >> -		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
> >> +		return GFP_TRANSHUGE | __GFP_NORETRY;
> >>  
> > 
> > While I get that you want to reduce thrashing, the configuration item
> > really indicates the system (not just the caller, but everyone) is willing
> > to take a hit to get a THP.
> 
> Yeah some hit in exchange for THP's, but probably not an overreclaim due
> to __GFP_THISNODE implications.
> 

Fair point, we can get that data.
diff mbox series

Patch

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5da55b38b1b7..c442b12b060c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -633,24 +633,23 @@  static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
 {
 	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
 
-	/* Always do synchronous compaction */
+	/* Always try direct compaction */
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
-		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
+		return GFP_TRANSHUGE | __GFP_NORETRY;
 
 	/* Kick kcompactd and fail quickly */
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
 		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
 
-	/* Synchronous compaction if madvised, otherwise kick kcompactd */
+	/* Direct compaction if madvised, otherwise kick kcompactd */
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags))
 		return GFP_TRANSHUGE_LIGHT |
-			(vma_madvised ? __GFP_DIRECT_RECLAIM :
+			(vma_madvised ? (__GFP_DIRECT_RECLAIM | __GFP_NORETRY):
 					__GFP_KSWAPD_RECLAIM);
 
-	/* Only do synchronous compaction if madvised */
+	/* Only do direct compaction if madvised */
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
-		return GFP_TRANSHUGE_LIGHT |
-		       (vma_madvised ? __GFP_DIRECT_RECLAIM : 0);
+		return vma_madvised ? (GFP_TRANSHUGE | __GFP_NORETRY) : GFP_TRANSHUGE_LIGHT;
 
 	return GFP_TRANSHUGE_LIGHT;
 }