diff mbox series

mm,thp,shmem: limit shmem THP alloc gfp_mask

Message ID 20201021234846.5cc97e62@imladris.surriel.com (mailing list archive)
State New, archived
Headers show
Series mm,thp,shmem: limit shmem THP alloc gfp_mask | expand

Commit Message

Rik van Riel Oct. 22, 2020, 3:48 a.m. UTC
The allocation flags of anonymous transparent huge pages can be controlled
through the files in /sys/kernel/mm/transparent_hugepage/defrag, which can
help the system from getting bogged down in the page reclaim and compaction
code when many THPs are getting allocated simultaneously.

However, the gfp_mask for shmem THP allocations were not limited by those
configuration settings, and some workloads ended up with all CPUs stuck
on the LRU lock in the page reclaim code, trying to allocate dozens of
THPs simultaneously.

This patch applies the same configurated limitation of THPs to shmem
hugepage allocations, to prevent that from happening.

This way a THP defrag setting of "never" or "defer+madvise" will result
in quick allocation failures without direct reclaim when no 2MB free
pages are available.

Signed-off-by: Rik van Riel <riel@surriel.com>
---

Comments

Michal Hocko Oct. 22, 2020, 8:15 a.m. UTC | #1
On Wed 21-10-20 23:48:46, Rik van Riel wrote:
> The allocation flags of anonymous transparent huge pages can be controlled
> through the files in /sys/kernel/mm/transparent_hugepage/defrag, which can
> help the system from getting bogged down in the page reclaim and compaction
> code when many THPs are getting allocated simultaneously.
> 
> However, the gfp_mask for shmem THP allocations were not limited by those
> configuration settings, and some workloads ended up with all CPUs stuck
> on the LRU lock in the page reclaim code, trying to allocate dozens of
> THPs simultaneously.
> 
> This patch applies the same configurated limitation of THPs to shmem
> hugepage allocations, to prevent that from happening.
> 
> This way a THP defrag setting of "never" or "defer+madvise" will result
> in quick allocation failures without direct reclaim when no 2MB free
> pages are available.

I remmeber I wanted to unify this in the past as well. The patch got
reverted in the end. It was a long story and I do not have time to read
through lengthy discussions again. I do remember though that there were
some objections pointing out that shmem has its own behavior which is
controlled by the mount option - at least for the explicitly mounted
shmems. I might misremember.

[...]

> diff --git a/mm/shmem.c b/mm/shmem.c
> index 537c137698f8..d1290eb508e5 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1545,8 +1545,11 @@ static struct page *shmem_alloc_hugepage(gfp_t gfp,
>  		return NULL;
>  
>  	shmem_pseudo_vma_init(&pvma, info, hindex);
> -	page = alloc_pages_vma(gfp | __GFP_COMP | __GFP_NORETRY | __GFP_NOWARN,
> -			HPAGE_PMD_ORDER, &pvma, 0, numa_node_id(), true);
> +	/* Limit the gfp mask according to THP configuration. */
> +	gfp |= __GFP_COMP | __GFP_NORETRY | __GFP_NOWARN;

What is the reason for these when alloc_hugepage_direct_gfpmask provides
the full mask?

> +	gfp &= alloc_hugepage_direct_gfpmask(&pvma);
> +	page = alloc_pages_vma(gfp, HPAGE_PMD_ORDER, &pvma, 0, numa_node_id(),
> +			       true);
>  	shmem_pseudo_vma_destroy(&pvma);
>  	if (page)
>  		prep_transhuge_page(page);
> 
> -- 
> All rights reversed.
Rik van Riel Oct. 22, 2020, 1:25 p.m. UTC | #2
On Thu, 2020-10-22 at 10:15 +0200, Michal Hocko wrote:
> On Wed 21-10-20 23:48:46, Rik van Riel wrote:
> > The allocation flags of anonymous transparent huge pages can be
> > controlled
> > through the files in /sys/kernel/mm/transparent_hugepage/defrag,
> > which can
> > help the system from getting bogged down in the page reclaim and
> > compaction
> > code when many THPs are getting allocated simultaneously.
> > 
> > However, the gfp_mask for shmem THP allocations were not limited by
> > those
> > configuration settings, and some workloads ended up with all CPUs
> > stuck
> > on the LRU lock in the page reclaim code, trying to allocate dozens
> > of
> > THPs simultaneously.
> > 
> > This patch applies the same configurated limitation of THPs to
> > shmem
> > hugepage allocations, to prevent that from happening.
> > 
> > This way a THP defrag setting of "never" or "defer+madvise" will
> > result
> > in quick allocation failures without direct reclaim when no 2MB
> > free
> > pages are available.
> 
> I remmeber I wanted to unify this in the past as well. The patch got
> reverted in the end. It was a long story and I do not have time to
> read
> through lengthy discussions again. I do remember though that there
> were
> some objections pointing out that shmem has its own behavior which is
> controlled by the mount option - at least for the explicitly mounted
> shmems. I might misremember.

That is not entirely true, though.

THP has two main sysfs knobs: "enabled" and "defrag"

The mount options
control the shmem equivalent options
for "enabled", but they do not do anything for the "defrag"
equivalent options.

This patch applies the "defrag" THP options to
shmem.

> [...]
> 
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index 537c137698f8..d1290eb508e5 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -1545,8 +1545,11 @@ static struct page
> > *shmem_alloc_hugepage(gfp_t gfp,
> >  		return NULL;
> >  
> >  	shmem_pseudo_vma_init(&pvma, info, hindex);
> > -	page = alloc_pages_vma(gfp | __GFP_COMP | __GFP_NORETRY |
> > __GFP_NOWARN,
> > -			HPAGE_PMD_ORDER, &pvma, 0, numa_node_id(),
> > true);
> > +	/* Limit the gfp mask according to THP configuration. */
> > +	gfp |= __GFP_COMP | __GFP_NORETRY | __GFP_NOWARN;
> 
> What is the reason for these when alloc_hugepage_direct_gfpmask
> provides
> the full mask?

The mapping_gfp_mask for the shmem file might have additional
restrictions above and beyond the gfp mask returned by
alloc_hugepage_direct_gfpmask, and I am not sure we should just
ignore the mapping_gfp_mask.

That is why this patch takes the union of both gfp masks.

However, digging into things more, it looks like shmem inodes
always have the mapping gfp mask set to GFP_HIGHUSER_MOVABLE,
and it is never changed, so simply using the output from
alloc_hugepage_direct_gfpmask should be fine.

I can do the patch either way. Just let me know what you prefer.

> > +	gfp &= alloc_hugepage_direct_gfpmask(&pvma);
> > +	page = alloc_pages_vma(gfp, HPAGE_PMD_ORDER, &pvma, 0,
> > numa_node_id(),
> > +			       true);
> >  	shmem_pseudo_vma_destroy(&pvma);
> >  	if (page)
> >  		prep_transhuge_page(page);
> > 
> > -- 
> > All rights reversed.
Vlastimil Babka Oct. 22, 2020, 2:51 p.m. UTC | #3
On 10/22/20 5:48 AM, Rik van Riel wrote:
> The allocation flags of anonymous transparent huge pages can be controlled
> through the files in /sys/kernel/mm/transparent_hugepage/defrag, which can
> help the system from getting bogged down in the page reclaim and compaction
> code when many THPs are getting allocated simultaneously.
> 
> However, the gfp_mask for shmem THP allocations were not limited by those
> configuration settings, and some workloads ended up with all CPUs stuck
> on the LRU lock in the page reclaim code, trying to allocate dozens of
> THPs simultaneously.
> 
> This patch applies the same configurated limitation of THPs to shmem
> hugepage allocations, to prevent that from happening.
> 
> This way a THP defrag setting of "never" or "defer+madvise" will result
> in quick allocation failures without direct reclaim when no 2MB free
> pages are available.
> 
> Signed-off-by: Rik van Riel <riel@surriel.com>

FTR, a patch to the same effect was sent by Xu Yu:

https://lore.kernel.org/r/11e1ead211eb7d141efa0eb75a46ee2096ee63f8.1603267572.git.xuyu@linux.alibaba.com

> ---
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index c603237e006c..0a5b164a26d9 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -614,6 +614,8 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask);
>   extern void pm_restrict_gfp_mask(void);
>   extern void pm_restore_gfp_mask(void);
>   
> +extern gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma);
> +
>   #ifdef CONFIG_PM_SLEEP
>   extern bool pm_suspended_storage(void);
>   #else
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 9474dbc150ed..9b08ce5cc387 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -649,7 +649,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
>    *	    available
>    * never: never stall for any thp allocation
>    */
> -static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
> +gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
>   {
>   	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
>   
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 537c137698f8..d1290eb508e5 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1545,8 +1545,11 @@ static struct page *shmem_alloc_hugepage(gfp_t gfp,
>   		return NULL;
>   
>   	shmem_pseudo_vma_init(&pvma, info, hindex);
> -	page = alloc_pages_vma(gfp | __GFP_COMP | __GFP_NORETRY | __GFP_NOWARN,
> -			HPAGE_PMD_ORDER, &pvma, 0, numa_node_id(), true);
> +	/* Limit the gfp mask according to THP configuration. */
> +	gfp |= __GFP_COMP | __GFP_NORETRY | __GFP_NOWARN;
> +	gfp &= alloc_hugepage_direct_gfpmask(&pvma);
> +	page = alloc_pages_vma(gfp, HPAGE_PMD_ORDER, &pvma, 0, numa_node_id(),
> +			       true);
>   	shmem_pseudo_vma_destroy(&pvma);
>   	if (page)
>   		prep_transhuge_page(page);
>
Vlastimil Babka Oct. 22, 2020, 2:52 p.m. UTC | #4
On 10/22/20 4:51 PM, Vlastimil Babka wrote:
> On 10/22/20 5:48 AM, Rik van Riel wrote:
>> The allocation flags of anonymous transparent huge pages can be controlled
>> through the files in /sys/kernel/mm/transparent_hugepage/defrag, which can
>> help the system from getting bogged down in the page reclaim and compaction
>> code when many THPs are getting allocated simultaneously.
>> 
>> However, the gfp_mask for shmem THP allocations were not limited by those
>> configuration settings, and some workloads ended up with all CPUs stuck
>> on the LRU lock in the page reclaim code, trying to allocate dozens of
>> THPs simultaneously.
>> 
>> This patch applies the same configurated limitation of THPs to shmem
>> hugepage allocations, to prevent that from happening.
>> 
>> This way a THP defrag setting of "never" or "defer+madvise" will result
>> in quick allocation failures without direct reclaim when no 2MB free
>> pages are available.
>> 
>> Signed-off-by: Rik van Riel <riel@surriel.com>
> 
> FTR, a patch to the same effect was sent by Xu Yu:

Hm thought I did CC, but TB ate it. sorry for the noise

> https://lore.kernel.org/r/11e1ead211eb7d141efa0eb75a46ee2096ee63f8.1603267572.git.xuyu@linux.alibaba.com
> 
>> ---
>> 
>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>> index c603237e006c..0a5b164a26d9 100644
>> --- a/include/linux/gfp.h
>> +++ b/include/linux/gfp.h
>> @@ -614,6 +614,8 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask);
>>   extern void pm_restrict_gfp_mask(void);
>>   extern void pm_restore_gfp_mask(void);
>>   
>> +extern gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma);
>> +
>>   #ifdef CONFIG_PM_SLEEP
>>   extern bool pm_suspended_storage(void);
>>   #else
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 9474dbc150ed..9b08ce5cc387 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -649,7 +649,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
>>    *	    available
>>    * never: never stall for any thp allocation
>>    */
>> -static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
>> +gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
>>   {
>>   	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
>>   
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index 537c137698f8..d1290eb508e5 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -1545,8 +1545,11 @@ static struct page *shmem_alloc_hugepage(gfp_t gfp,
>>   		return NULL;
>>   
>>   	shmem_pseudo_vma_init(&pvma, info, hindex);
>> -	page = alloc_pages_vma(gfp | __GFP_COMP | __GFP_NORETRY | __GFP_NOWARN,
>> -			HPAGE_PMD_ORDER, &pvma, 0, numa_node_id(), true);
>> +	/* Limit the gfp mask according to THP configuration. */
>> +	gfp |= __GFP_COMP | __GFP_NORETRY | __GFP_NOWARN;
>> +	gfp &= alloc_hugepage_direct_gfpmask(&pvma);
>> +	page = alloc_pages_vma(gfp, HPAGE_PMD_ORDER, &pvma, 0, numa_node_id(),
>> +			       true);
>>   	shmem_pseudo_vma_destroy(&pvma);
>>   	if (page)
>>   		prep_transhuge_page(page);
>> 
>
Michal Hocko Oct. 22, 2020, 3:50 p.m. UTC | #5
On Thu 22-10-20 09:25:21, Rik van Riel wrote:
> On Thu, 2020-10-22 at 10:15 +0200, Michal Hocko wrote:
> > On Wed 21-10-20 23:48:46, Rik van Riel wrote:
> > > The allocation flags of anonymous transparent huge pages can be
> > > controlled
> > > through the files in /sys/kernel/mm/transparent_hugepage/defrag,
> > > which can
> > > help the system from getting bogged down in the page reclaim and
> > > compaction
> > > code when many THPs are getting allocated simultaneously.
> > > 
> > > However, the gfp_mask for shmem THP allocations were not limited by
> > > those
> > > configuration settings, and some workloads ended up with all CPUs
> > > stuck
> > > on the LRU lock in the page reclaim code, trying to allocate dozens
> > > of
> > > THPs simultaneously.
> > > 
> > > This patch applies the same configurated limitation of THPs to
> > > shmem
> > > hugepage allocations, to prevent that from happening.
> > > 
> > > This way a THP defrag setting of "never" or "defer+madvise" will
> > > result
> > > in quick allocation failures without direct reclaim when no 2MB
> > > free
> > > pages are available.
> > 
> > I remmeber I wanted to unify this in the past as well. The patch got
> > reverted in the end. It was a long story and I do not have time to
> > read
> > through lengthy discussions again. I do remember though that there
> > were
> > some objections pointing out that shmem has its own behavior which is
> > controlled by the mount option - at least for the explicitly mounted
> > shmems. I might misremember.
> 
> That is not entirely true, though.
> 
> THP has two main sysfs knobs: "enabled" and "defrag"
> 
> The mount options
> control the shmem equivalent options
> for "enabled", but they do not do anything for the "defrag"
> equivalent options.

Yeah, the situation is quite messy :/

> This patch applies the "defrag" THP options to
> shmem.

I am not really objecting I just do remember some pushback. My previous
attempt was to unify everything inside alloc_pages_vma IIRC.

> > [...]
> > 
> > > diff --git a/mm/shmem.c b/mm/shmem.c
> > > index 537c137698f8..d1290eb508e5 100644
> > > --- a/mm/shmem.c
> > > +++ b/mm/shmem.c
> > > @@ -1545,8 +1545,11 @@ static struct page
> > > *shmem_alloc_hugepage(gfp_t gfp,
> > >  		return NULL;
> > >  
> > >  	shmem_pseudo_vma_init(&pvma, info, hindex);
> > > -	page = alloc_pages_vma(gfp | __GFP_COMP | __GFP_NORETRY |
> > > __GFP_NOWARN,
> > > -			HPAGE_PMD_ORDER, &pvma, 0, numa_node_id(),
> > > true);
> > > +	/* Limit the gfp mask according to THP configuration. */
> > > +	gfp |= __GFP_COMP | __GFP_NORETRY | __GFP_NOWARN;
> > 
> > What is the reason for these when alloc_hugepage_direct_gfpmask
> > provides
> > the full mask?
> 
> The mapping_gfp_mask for the shmem file might have additional
> restrictions above and beyond the gfp mask returned by
> alloc_hugepage_direct_gfpmask, and I am not sure we should just
> ignore the mapping_gfp_mask.

No, we shouldn't. But I do not see why you should be adding the above
set of flags on top.

> That is why this patch takes the union of both gfp masks.
> 
> However, digging into things more, it looks like shmem inodes
> always have the mapping gfp mask set to GFP_HIGHUSER_MOVABLE,
> and it is never changed, so simply using the output from
> alloc_hugepage_direct_gfpmask should be fine.
> 
> I can do the patch either way. Just let me know what you prefer.

I would just and the given gfp with alloc_hugepage_direct_gfpmask
Xu Yu Oct. 22, 2020, 4 p.m. UTC | #6
On 10/22/20 11:48 AM, Rik van Riel wrote:
> The allocation flags of anonymous transparent huge pages can be controlled
> through the files in /sys/kernel/mm/transparent_hugepage/defrag, which can
> help the system from getting bogged down in the page reclaim and compaction
> code when many THPs are getting allocated simultaneously.
> 
> However, the gfp_mask for shmem THP allocations were not limited by those
> configuration settings, and some workloads ended up with all CPUs stuck
> on the LRU lock in the page reclaim code, trying to allocate dozens of
> THPs simultaneously.
> 
> This patch applies the same configurated limitation of THPs to shmem
> hugepage allocations, to prevent that from happening.
> 
> This way a THP defrag setting of "never" or "defer+madvise" will result
> in quick allocation failures without direct reclaim when no 2MB free
> pages are available.
> 
> Signed-off-by: Rik van Riel <riel@surriel.com>
> ---
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index c603237e006c..0a5b164a26d9 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -614,6 +614,8 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask);
>   extern void pm_restrict_gfp_mask(void);
>   extern void pm_restore_gfp_mask(void);
>   
> +extern gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma);
> +
>   #ifdef CONFIG_PM_SLEEP
>   extern bool pm_suspended_storage(void);
>   #else
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 9474dbc150ed..9b08ce5cc387 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -649,7 +649,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
>    *	    available
>    * never: never stall for any thp allocation
>    */
> -static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
> +gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
>   {
>   	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
>   
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 537c137698f8..d1290eb508e5 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1545,8 +1545,11 @@ static struct page *shmem_alloc_hugepage(gfp_t gfp,
>   		return NULL;
>   
>   	shmem_pseudo_vma_init(&pvma, info, hindex);
> -	page = alloc_pages_vma(gfp | __GFP_COMP | __GFP_NORETRY | __GFP_NOWARN,
> -			HPAGE_PMD_ORDER, &pvma, 0, numa_node_id(), true);
> +	/* Limit the gfp mask according to THP configuration. */
> +	gfp |= __GFP_COMP | __GFP_NORETRY | __GFP_NOWARN;
> +	gfp &= alloc_hugepage_direct_gfpmask(&pvma);

It is fine to reuse `alloc_hugepage_direct_gfpmask`, but
`pvma.vm_flags & VM_HUGEPAGE` is always false here, thus,
`vma_madvised` in `alloc_hugepage_direct_gfpmask` will always
be false.

That is why I chose to do the gfp_mask fixup in `shmem_getpage_gfp`,
using `sgp_huge` to indicate `vma_madvised`, although with some silly
mistakes pointed out by you, in another mail thread.

It will be better if vma_madvised is well handled in your solution.

> +	page = alloc_pages_vma(gfp, HPAGE_PMD_ORDER, &pvma, 0, numa_node_id(),
> +			       true);
>   	shmem_pseudo_vma_destroy(&pvma);
>   	if (page)
>   		prep_transhuge_page(page);
>
Rik van Riel Oct. 22, 2020, 4:06 p.m. UTC | #7
On Thu, 2020-10-22 at 17:50 +0200, Michal Hocko wrote:
> On Thu 22-10-20 09:25:21, Rik van Riel wrote:
> > On Thu, 2020-10-22 at 10:15 +0200, Michal Hocko wrote:
> > > On Wed 21-10-20 23:48:46, Rik van Riel wrote:
> > > > 
> > > > diff --git a/mm/shmem.c b/mm/shmem.c
> > > > index 537c137698f8..d1290eb508e5 100644
> > > > --- a/mm/shmem.c
> > > > +++ b/mm/shmem.c
> > > > @@ -1545,8 +1545,11 @@ static struct page
> > > > *shmem_alloc_hugepage(gfp_t gfp,
> > > >  		return NULL;
> > > >  
> > > >  	shmem_pseudo_vma_init(&pvma, info, hindex);
> > > > -	page = alloc_pages_vma(gfp | __GFP_COMP | __GFP_NORETRY
> > > > |
> > > > __GFP_NOWARN,
> > > > -			HPAGE_PMD_ORDER, &pvma, 0,
> > > > numa_node_id(),
> > > > true);
> > > > +	/* Limit the gfp mask according to THP configuration.
> > > > */
> > > > +	gfp |= __GFP_COMP | __GFP_NORETRY | __GFP_NOWARN;
> > > 
> > > What is the reason for these when alloc_hugepage_direct_gfpmask
> > > provides
> > > the full mask?
> > 
> > The mapping_gfp_mask for the shmem file might have additional
> > restrictions above and beyond the gfp mask returned by
> > alloc_hugepage_direct_gfpmask, and I am not sure we should just
> > ignore the mapping_gfp_mask.
> 
> No, we shouldn't. But I do not see why you should be adding the above
> set of flags on top.

Because THP allocations are higher order and optimistic,
and we want them to:
1) be annotated as compound allocations
2) fail (and fall back to 4kB allocations) when they cannot
   be easily satisfied, and
3) not create a spew of allocation failure backtraces on
   the (serial) console when these THP allocations fail

> > That is why this patch takes the union of both gfp masks.
> > 
> > However, digging into things more, it looks like shmem inodes
> > always have the mapping gfp mask set to GFP_HIGHUSER_MOVABLE,
> > and it is never changed, so simply using the output from
> > alloc_hugepage_direct_gfpmask should be fine.
> > 
> > I can do the patch either way. Just let me know what you prefer.
> 
> I would just and the given gfp with alloc_hugepage_direct_gfpmask

That would miss the three things we definitely want
from above.
Rik van Riel Oct. 22, 2020, 4:40 p.m. UTC | #8
On Fri, 2020-10-23 at 00:00 +0800, Yu Xu wrote:
> On 10/22/20 11:48 AM, Rik van Riel wrote:
> > The allocation flags of anonymous transparent huge pages can be
> > controlled
> > through the files in /sys/kernel/mm/transparent_hugepage/defrag,
> > which can
> > help the system from getting bogged down in the page reclaim and
> > compaction
> > code when many THPs are getting allocated simultaneously.
> > 
> > However, the gfp_mask for shmem THP allocations were not limited by
> > those
> > configuration settings, and some workloads ended up with all CPUs
> > stuck
> > on the LRU lock in the page reclaim code, trying to allocate dozens
> > of
> > THPs simultaneously.
> > 
> > This patch applies the same configurated limitation of THPs to
> > shmem
> > hugepage allocations, to prevent that from happening.
> > 
> > This way a THP defrag setting of "never" or "defer+madvise" will
> > result
> > in quick allocation failures without direct reclaim when no 2MB
> > free
> > pages are available.
> > 
> > Signed-off-by: Rik van Riel <riel@surriel.com>
> > ---
> > 
> > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > index c603237e006c..0a5b164a26d9 100644
> > --- a/include/linux/gfp.h
> > +++ b/include/linux/gfp.h
> > @@ -614,6 +614,8 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask);
> >   extern void pm_restrict_gfp_mask(void);
> >   extern void pm_restore_gfp_mask(void);
> >   
> > +extern gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct
> > *vma);
> > +
> >   #ifdef CONFIG_PM_SLEEP
> >   extern bool pm_suspended_storage(void);
> >   #else
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 9474dbc150ed..9b08ce5cc387 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -649,7 +649,7 @@ static vm_fault_t
> > __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
> >    *	    available
> >    * never: never stall for any thp allocation
> >    */
> > -static inline gfp_t alloc_hugepage_direct_gfpmask(struct
> > vm_area_struct *vma)
> > +gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
> >   {
> >   	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
> >   
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index 537c137698f8..d1290eb508e5 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -1545,8 +1545,11 @@ static struct page
> > *shmem_alloc_hugepage(gfp_t gfp,
> >   		return NULL;
> >   
> >   	shmem_pseudo_vma_init(&pvma, info, hindex);
> > -	page = alloc_pages_vma(gfp | __GFP_COMP | __GFP_NORETRY |
> > __GFP_NOWARN,
> > -			HPAGE_PMD_ORDER, &pvma, 0, numa_node_id(),
> > true);
> > +	/* Limit the gfp mask according to THP configuration. */
> > +	gfp |= __GFP_COMP | __GFP_NORETRY | __GFP_NOWARN;
> > +	gfp &= alloc_hugepage_direct_gfpmask(&pvma);
> 
> It is fine to reuse `alloc_hugepage_direct_gfpmask`, but
> `pvma.vm_flags & VM_HUGEPAGE` is always false here, thus,
> `vma_madvised` in `alloc_hugepage_direct_gfpmask` will always
> be false.
> 
> That is why I chose to do the gfp_mask fixup in `shmem_getpage_gfp`,
> using `sgp_huge` to indicate `vma_madvised`, although with some silly
> mistakes pointed out by you, in another mail thread.
> 
> It will be better if vma_madvised is well handled in your solution.

OK, let me send a v2 that does that!

By just passing a correct gfp_mask to shmem_alloc_and_acct_page
we can also avoid the gfp gymnastics in shmem_alloc_hugepage
that Michal rightfully objected to.
Michal Hocko Oct. 23, 2020, 6:47 a.m. UTC | #9
On Thu 22-10-20 12:06:01, Rik van Riel wrote:
> On Thu, 2020-10-22 at 17:50 +0200, Michal Hocko wrote:
> > On Thu 22-10-20 09:25:21, Rik van Riel wrote:
> > > On Thu, 2020-10-22 at 10:15 +0200, Michal Hocko wrote:
> > > > On Wed 21-10-20 23:48:46, Rik van Riel wrote:
> > > > > 
> > > > > diff --git a/mm/shmem.c b/mm/shmem.c
> > > > > index 537c137698f8..d1290eb508e5 100644
> > > > > --- a/mm/shmem.c
> > > > > +++ b/mm/shmem.c
> > > > > @@ -1545,8 +1545,11 @@ static struct page
> > > > > *shmem_alloc_hugepage(gfp_t gfp,
> > > > >  		return NULL;
> > > > >  
> > > > >  	shmem_pseudo_vma_init(&pvma, info, hindex);
> > > > > -	page = alloc_pages_vma(gfp | __GFP_COMP | __GFP_NORETRY
> > > > > |
> > > > > __GFP_NOWARN,
> > > > > -			HPAGE_PMD_ORDER, &pvma, 0,
> > > > > numa_node_id(),
> > > > > true);
> > > > > +	/* Limit the gfp mask according to THP configuration.
> > > > > */
> > > > > +	gfp |= __GFP_COMP | __GFP_NORETRY | __GFP_NOWARN;
> > > > 
> > > > What is the reason for these when alloc_hugepage_direct_gfpmask
> > > > provides
> > > > the full mask?
> > > 
> > > The mapping_gfp_mask for the shmem file might have additional
> > > restrictions above and beyond the gfp mask returned by
> > > alloc_hugepage_direct_gfpmask, and I am not sure we should just
> > > ignore the mapping_gfp_mask.
> > 
> > No, we shouldn't. But I do not see why you should be adding the above
> > set of flags on top.
> 
> Because THP allocations are higher order and optimistic,
> and we want them to:
> 1) be annotated as compound allocations
> 2) fail (and fall back to 4kB allocations) when they cannot
>    be easily satisfied, and
> 3) not create a spew of allocation failure backtraces on
>    the (serial) console when these THP allocations fail

This all is already returned from alloc_hugepage_direct_gfpmask.
diff mbox series

Patch

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index c603237e006c..0a5b164a26d9 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -614,6 +614,8 @@  bool gfp_pfmemalloc_allowed(gfp_t gfp_mask);
 extern void pm_restrict_gfp_mask(void);
 extern void pm_restore_gfp_mask(void);
 
+extern gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma);
+
 #ifdef CONFIG_PM_SLEEP
 extern bool pm_suspended_storage(void);
 #else
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9474dbc150ed..9b08ce5cc387 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -649,7 +649,7 @@  static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
  *	    available
  * never: never stall for any thp allocation
  */
-static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
+gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
 {
 	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
 
diff --git a/mm/shmem.c b/mm/shmem.c
index 537c137698f8..d1290eb508e5 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1545,8 +1545,11 @@  static struct page *shmem_alloc_hugepage(gfp_t gfp,
 		return NULL;
 
 	shmem_pseudo_vma_init(&pvma, info, hindex);
-	page = alloc_pages_vma(gfp | __GFP_COMP | __GFP_NORETRY | __GFP_NOWARN,
-			HPAGE_PMD_ORDER, &pvma, 0, numa_node_id(), true);
+	/* Limit the gfp mask according to THP configuration. */
+	gfp |= __GFP_COMP | __GFP_NORETRY | __GFP_NOWARN;
+	gfp &= alloc_hugepage_direct_gfpmask(&pvma);
+	page = alloc_pages_vma(gfp, HPAGE_PMD_ORDER, &pvma, 0, numa_node_id(),
+			       true);
 	shmem_pseudo_vma_destroy(&pvma);
 	if (page)
 		prep_transhuge_page(page);