diff mbox series

[v2] mm,thp,shmem: limit shmem THP alloc gfp_mask

Message ID 20201022124511.72448a5f@imladris.surriel.com (mailing list archive)
State New, archived
Headers show
Series [v2] mm,thp,shmem: limit shmem THP alloc gfp_mask | expand

Commit Message

Rik van Riel Oct. 22, 2020, 4:45 p.m. UTC
The allocation flags of anonymous transparent huge pages can be controlled
through the files in /sys/kernel/mm/transparent_hugepage/defrag, which can
help the system from getting bogged down in the page reclaim and compaction
code when many THPs are getting allocated simultaneously.

However, the gfp_mask for shmem THP allocations were not limited by those
configuration settings, and some workloads ended up with all CPUs stuck
on the LRU lock in the page reclaim code, trying to allocate dozens of
THPs simultaneously.

This patch applies the same configurated limitation of THPs to shmem
hugepage allocations, to prevent that from happening.

This way a THP defrag setting of "never" or "defer+madvise" will result
in quick allocation failures without direct reclaim when no 2MB free
pages are available.

Signed-off-by: Rik van Riel <riel@surriel.com>
--- 
v2: move gfp calculation to shmem_getpage_gfp as suggested by Yu Xu

Comments

Vlastimil Babka Oct. 22, 2020, 4:52 p.m. UTC | #1
On 10/22/20 6:45 PM, Rik van Riel wrote:
> The allocation flags of anonymous transparent huge pages can be controlled
> through the files in /sys/kernel/mm/transparent_hugepage/defrag, which can
> help the system from getting bogged down in the page reclaim and compaction
> code when many THPs are getting allocated simultaneously.
> 
> However, the gfp_mask for shmem THP allocations were not limited by those
> configuration settings, and some workloads ended up with all CPUs stuck
> on the LRU lock in the page reclaim code, trying to allocate dozens of
> THPs simultaneously.
> 
> This patch applies the same configurated limitation of THPs to shmem
> hugepage allocations, to prevent that from happening.
> 
> This way a THP defrag setting of "never" or "defer+madvise" will result
> in quick allocation failures without direct reclaim when no 2MB free
> pages are available.
> 
> Signed-off-by: Rik van Riel <riel@surriel.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
> v2: move gfp calculation to shmem_getpage_gfp as suggested by Yu Xu
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index c603237e006c..0a5b164a26d9 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -614,6 +614,8 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask);
>   extern void pm_restrict_gfp_mask(void);
>   extern void pm_restore_gfp_mask(void);
>   
> +extern gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma);
> +
>   #ifdef CONFIG_PM_SLEEP
>   extern bool pm_suspended_storage(void);
>   #else
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 9474dbc150ed..9b08ce5cc387 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -649,7 +649,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
>    *	    available
>    * never: never stall for any thp allocation
>    */
> -static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
> +gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
>   {
>   	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
>   
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 537c137698f8..9710b9df91e9 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1545,8 +1545,8 @@ static struct page *shmem_alloc_hugepage(gfp_t gfp,
>   		return NULL;
>   
>   	shmem_pseudo_vma_init(&pvma, info, hindex);
> -	page = alloc_pages_vma(gfp | __GFP_COMP | __GFP_NORETRY | __GFP_NOWARN,
> -			HPAGE_PMD_ORDER, &pvma, 0, numa_node_id(), true);
> +	page = alloc_pages_vma(gfp, HPAGE_PMD_ORDER, &pvma, 0, numa_node_id(),
> +			       true);
>   	shmem_pseudo_vma_destroy(&pvma);
>   	if (page)
>   		prep_transhuge_page(page);
> @@ -1802,6 +1802,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
>   	struct page *page;
>   	enum sgp_type sgp_huge = sgp;
>   	pgoff_t hindex = index;
> +	gfp_t huge_gfp;
>   	int error;
>   	int once = 0;
>   	int alloced = 0;
> @@ -1887,7 +1888,8 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
>   	}
>   
>   alloc_huge:
> -	page = shmem_alloc_and_acct_page(gfp, inode, index, true);
> +	huge_gfp = alloc_hugepage_direct_gfpmask(vma);
> +	page = shmem_alloc_and_acct_page(huge_gfp, inode, index, true);
>   	if (IS_ERR(page)) {
>   alloc_nohuge:
>   		page = shmem_alloc_and_acct_page(gfp, inode,
>
Hugh Dickins Oct. 23, 2020, 2:54 a.m. UTC | #2
On Thu, 22 Oct 2020, Rik van Riel wrote:

> The allocation flags of anonymous transparent huge pages can be controlled
> through the files in /sys/kernel/mm/transparent_hugepage/defrag, which can
> help the system from getting bogged down in the page reclaim and compaction
> code when many THPs are getting allocated simultaneously.
> 
> However, the gfp_mask for shmem THP allocations were not limited by those
> configuration settings, and some workloads ended up with all CPUs stuck
> on the LRU lock in the page reclaim code, trying to allocate dozens of
> THPs simultaneously.
> 
> This patch applies the same configurated limitation of THPs to shmem
> hugepage allocations, to prevent that from happening.
> 
> This way a THP defrag setting of "never" or "defer+madvise" will result
> in quick allocation failures without direct reclaim when no 2MB free
> pages are available.
> 
> Signed-off-by: Rik van Riel <riel@surriel.com>

NAK in its present untested form: see below.

I'm open to change here, particularly to Yu Xu's point (in other mail)
about direct reclaim - we avoid that here in Google too: though it's
not so much to avoid the direct reclaim, as to avoid the latencies of
direct compaction, which __GFP_DIRECT_RECLAIM allows as a side-effect.

> --- 
> v2: move gfp calculation to shmem_getpage_gfp as suggested by Yu Xu
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index c603237e006c..0a5b164a26d9 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -614,6 +614,8 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask);
>  extern void pm_restrict_gfp_mask(void);
>  extern void pm_restore_gfp_mask(void);
>  
> +extern gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma);
> +
>  #ifdef CONFIG_PM_SLEEP
>  extern bool pm_suspended_storage(void);
>  #else
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 9474dbc150ed..9b08ce5cc387 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -649,7 +649,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
>   *	    available
>   * never: never stall for any thp allocation
>   */
> -static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
> +gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
>  {
>  	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
>  
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 537c137698f8..9710b9df91e9 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1545,8 +1545,8 @@ static struct page *shmem_alloc_hugepage(gfp_t gfp,
>  		return NULL;
>  
>  	shmem_pseudo_vma_init(&pvma, info, hindex);
> -	page = alloc_pages_vma(gfp | __GFP_COMP | __GFP_NORETRY | __GFP_NOWARN,
> -			HPAGE_PMD_ORDER, &pvma, 0, numa_node_id(), true);
> +	page = alloc_pages_vma(gfp, HPAGE_PMD_ORDER, &pvma, 0, numa_node_id(),
> +			       true);

Commendably neat so far.

>  	shmem_pseudo_vma_destroy(&pvma);
>  	if (page)
>  		prep_transhuge_page(page);
> @@ -1802,6 +1802,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
>  	struct page *page;
>  	enum sgp_type sgp_huge = sgp;
>  	pgoff_t hindex = index;
> +	gfp_t huge_gfp;
>  	int error;
>  	int once = 0;
>  	int alloced = 0;
> @@ -1887,7 +1888,8 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
>  	}
>  
>  alloc_huge:
> -	page = shmem_alloc_and_acct_page(gfp, inode, index, true);
> +	huge_gfp = alloc_hugepage_direct_gfpmask(vma);

Still looks nice: but what about the crash when vma is NULL?

It may work for shmem_fault() (though I'll probably disagree on the
details): but tmpfs is a filesystem, so most if not all of the system
calls which arrive here have no vma to offer.

Michal is right to remember pushback before, because tmpfs is a
filesystem, and "huge=" is a mount option: in using a huge=always
filesystem, the user has already declared a preference for huge pages.
Whereas the original anon THP had to deduce that preference from sys
tunables and vma madvice.

I certainly found it a lot easier to ignore all the shifting sandmaze
of the anon THP tunables, and I think Kirill followed me on that.

But it's likely that they have accumulated some defrag wisdom, which
tmpfs can take on board - but please accept that in using a huge mount,
the preference for huge has already been expressed, so I don't expect
anon THP alloc_hugepage_direct_gfpmask() choices will map one to one.

> +	page = shmem_alloc_and_acct_page(huge_gfp, inode, index, true);
>  	if (IS_ERR(page)) {
>  alloc_nohuge:
>  		page = shmem_alloc_and_acct_page(gfp, inode,
> 

Hugh
Rik van Riel Oct. 23, 2020, 3:40 a.m. UTC | #3
On Thu, 2020-10-22 at 19:54 -0700, Hugh Dickins wrote:
> On Thu, 22 Oct 2020, Rik van Riel wrote:
> 
> > The allocation flags of anonymous transparent huge pages can be
> controlled
> > through the files in /sys/kernel/mm/transparent_hugepage/defrag,
> which can
> > help the system from getting bogged down in the page reclaim and
> compaction
> > code when many THPs are getting allocated simultaneously.
> > 
> > However, the gfp_mask for shmem THP allocations were not limited by
> those
> > configuration settings, and some workloads ended up with all CPUs
> stuck
> > on the LRU lock in the page reclaim code, trying to allocate dozens
> of
> > THPs simultaneously.
> > 
> > This patch applies the same configurated limitation of THPs to
> shmem
> > hugepage allocations, to prevent that from happening.
> > 
> > This way a THP defrag setting of "never" or "defer+madvise" will
> result
> > in quick allocation failures without direct reclaim when no 2MB
> free
> > pages are available.
> > 
> > Signed-off-by: Rik van Riel <riel@surriel.com>
> 
> NAK in its present untested form: see below.

Oops. That issue is easy to fix, but indeed lets figure
out what the desired behavior is.

> I'm open to change here, particularly to Yu Xu's point (in other
> mail)
> about direct reclaim - we avoid that here in Google too: though it's
> not so much to avoid the direct reclaim, as to avoid the latencies of
> direct compaction, which __GFP_DIRECT_RECLAIM allows as a side-
> effect.
> 
> > @@ -1887,7 +1888,8 @@ static int shmem_getpage_gfp(struct inode
> *inode, pgoff_t index,
> >       }
> >  
> >  alloc_huge:
> > -     page = shmem_alloc_and_acct_page(gfp, inode, index, true);
> > +     huge_gfp = alloc_hugepage_direct_gfpmask(vma);
> 
> Still looks nice: but what about the crash when vma is NULL?

That's a one line fix, but I suppose we should get the
discussion on what the code behavior should be out of
the way first :)

> Michal is right to remember pushback before, because tmpfs is a
> filesystem, and "huge=" is a mount option: in using a huge=always
> filesystem, the user has already declared a preference for huge
> pages.
> Whereas the original anon THP had to deduce that preference from sys
> tunables and vma madvice.

...

> But it's likely that they have accumulated some defrag wisdom, which
> tmpfs can take on board - but please accept that in using a huge
> mount,
> the preference for huge has already been expressed, so I don't expect
> anon THP alloc_hugepage_direct_gfpmask() choices will map one to one.

In my mind, the huge= mount options for tmpfs corresponded
to the "enabled" anon THP options, denoting a desired end
state, not necessarily how much we will stall allocations
to get there immediately.

The underlying allocation behavior has been changed repeatedly,
with changes to the direct reclaim code and the compaction
deferral code.

The shmem THP gfp_mask never tried really hard anyway,
with __GFP_NORETRY being the default, which matches what
is used for non-VM_HUGEPAGE anon VMAs.

Likewise, the direct reclaim done from the opportunistic
THP allocations done by the shmem code limited itself to
reclaiming 32 4kB pages per THP allocation.

In other words, mounting
with huge=always has never behaved
the same as the more aggressive allocations done for
MADV_HUGEPAGE VMAs.

This patch would leave shmem THP allocations for non-MADV_HUGEPAGE
mapped files opportunistic like today, and make shmem THP
allocations for files mapped with MADV_HUGEPAGE more aggressive
than today.

However, I would like to know what people think the shmem
huge= mount options should do, and how things should behave
when memory gets low, before pushing in a patch just because
it makes the system run smoother "without changing current
behavior too much".

What do people want tmpfs THP allocations to do?
Michal Hocko Oct. 23, 2020, 8:49 a.m. UTC | #4
On Thu 22-10-20 23:40:53, Rik van Riel wrote:
> On Thu, 2020-10-22 at 19:54 -0700, Hugh Dickins wrote:
[...]
> > But it's likely that they have accumulated some defrag wisdom, which
> > tmpfs can take on board - but please accept that in using a huge
> > mount,
> > the preference for huge has already been expressed, so I don't expect
> > anon THP alloc_hugepage_direct_gfpmask() choices will map one to one.
> 
> In my mind, the huge= mount options for tmpfs corresponded
> to the "enabled" anon THP options, denoting a desired end
> state, not necessarily how much we will stall allocations
> to get there immediately.

It is really unfortunate that our configuration space is so huge and
messy but we have to live with that now.

Anyway, I would tend to agree that with an absense of per-mount defrag
configuration it makes sense to use the global one. Is anybody aware of
usecases where a mount specific configuration would make sense?
Matthew Wilcox (Oracle) Oct. 23, 2020, 12:55 p.m. UTC | #5
On Thu, Oct 22, 2020 at 11:40:53PM -0400, Rik van Riel wrote:
> On Thu, 2020-10-22 at 19:54 -0700, Hugh Dickins wrote:
> > Michal is right to remember pushback before, because tmpfs is a
> > filesystem, and "huge=" is a mount option: in using a huge=always
> > filesystem, the user has already declared a preference for huge
> > pages.
> > Whereas the original anon THP had to deduce that preference from sys
> > tunables and vma madvice.
> 
> ...
> 
> > But it's likely that they have accumulated some defrag wisdom, which
> > tmpfs can take on board - but please accept that in using a huge
> > mount,
> > the preference for huge has already been expressed, so I don't expect
> > anon THP alloc_hugepage_direct_gfpmask() choices will map one to one.
> 
> In my mind, the huge= mount options for tmpfs corresponded
> to the "enabled" anon THP options, denoting a desired end
> state, not necessarily how much we will stall allocations
> to get there immediately.
> 
> The underlying allocation behavior has been changed repeatedly,
> with changes to the direct reclaim code and the compaction
> deferral code.
> 
> The shmem THP gfp_mask never tried really hard anyway,
> with __GFP_NORETRY being the default, which matches what
> is used for non-VM_HUGEPAGE anon VMAs.
> 
> Likewise, the direct reclaim done from the opportunistic
> THP allocations done by the shmem code limited itself to
> reclaiming 32 4kB pages per THP allocation.
> 
> In other words, mounting
> with huge=always has never behaved
> the same as the more aggressive allocations done for
> MADV_HUGEPAGE VMAs.
> 
> This patch would leave shmem THP allocations for non-MADV_HUGEPAGE
> mapped files opportunistic like today, and make shmem THP
> allocations for files mapped with MADV_HUGEPAGE more aggressive
> than today.
> 
> However, I would like to know what people think the shmem
> huge= mount options should do, and how things should behave
> when memory gets low, before pushing in a patch just because
> it makes the system run smoother "without changing current
> behavior too much".
> 
> What do people want tmpfs THP allocations to do?

I'm also interested for non-tmpfs THP allocations.  In my patchset, THPs
are no longer limited to being PMD sized, and allocating smaller pages
isn't such a tax on the VM.  So currently I'm doing:

        gfp_t gfp = readahead_gfp_mask(mapping);
...
        struct page *page = __page_cache_alloc_order(gfp, order);

which translates to:

        mapping_gfp_mask(mapping) | __GFP_NORETRY | __GFP_NOWARN;
        gfp |= GFP_TRANSHUGE_LIGHT;
        gfp &= ~__GFP_DIRECT_RECLAIM;

Everything's very willing to fall back to order-0 pages, but I can see
that, eg, for a VM_HUGEPAGE vma, we should perhaps be less willing to
fall back to small pages.  I would prefer not to add a mount option to
every filesystem.  People will only get it wrong.
diff mbox series

Patch

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index c603237e006c..0a5b164a26d9 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -614,6 +614,8 @@  bool gfp_pfmemalloc_allowed(gfp_t gfp_mask);
 extern void pm_restrict_gfp_mask(void);
 extern void pm_restore_gfp_mask(void);
 
+extern gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma);
+
 #ifdef CONFIG_PM_SLEEP
 extern bool pm_suspended_storage(void);
 #else
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9474dbc150ed..9b08ce5cc387 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -649,7 +649,7 @@  static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
  *	    available
  * never: never stall for any thp allocation
  */
-static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
+gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
 {
 	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
 
diff --git a/mm/shmem.c b/mm/shmem.c
index 537c137698f8..9710b9df91e9 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1545,8 +1545,8 @@  static struct page *shmem_alloc_hugepage(gfp_t gfp,
 		return NULL;
 
 	shmem_pseudo_vma_init(&pvma, info, hindex);
-	page = alloc_pages_vma(gfp | __GFP_COMP | __GFP_NORETRY | __GFP_NOWARN,
-			HPAGE_PMD_ORDER, &pvma, 0, numa_node_id(), true);
+	page = alloc_pages_vma(gfp, HPAGE_PMD_ORDER, &pvma, 0, numa_node_id(),
+			       true);
 	shmem_pseudo_vma_destroy(&pvma);
 	if (page)
 		prep_transhuge_page(page);
@@ -1802,6 +1802,7 @@  static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 	struct page *page;
 	enum sgp_type sgp_huge = sgp;
 	pgoff_t hindex = index;
+	gfp_t huge_gfp;
 	int error;
 	int once = 0;
 	int alloced = 0;
@@ -1887,7 +1888,8 @@  static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 	}
 
 alloc_huge:
-	page = shmem_alloc_and_acct_page(gfp, inode, index, true);
+	huge_gfp = alloc_hugepage_direct_gfpmask(vma);
+	page = shmem_alloc_and_acct_page(huge_gfp, inode, index, true);
 	if (IS_ERR(page)) {
 alloc_nohuge:
 		page = shmem_alloc_and_acct_page(gfp, inode,