diff mbox series

[2/2] mm: mmap: map MAP_STACK to VM_NOHUGEPAGE

Message ID 20231221065943.2803551-2-shy828301@gmail.com (mailing list archive)
State New
Headers show
Series [1/2] mm: mmap: no need to call khugepaged_enter_vma() for stack | expand

Commit Message

Yang Shi Dec. 21, 2023, 6:59 a.m. UTC
From: Yang Shi <yang@os.amperecomputing.com>

The commit efa7df3e3bb5 ("mm: align larger anonymous mappings on THP
boundaries") incured regression for stress-ng pthread benchmark [1].
It is because THP get allocated to pthread's stack area much more possible
than before.  Pthread's stack area is allocated by mmap without VM_GROWSDOWN
or VM_GROWSUP flag, so kernel can't tell whether it is a stack area or not.

The MAP_STACK flag is used to mark the stack area, but it is a no-op on
Linux.  Mapping MAP_STACK to VM_NOHUGEPAGE to prevent from allocating
THP for such stack area.

With this change the stack area looks like:

fffd18e10000-fffd19610000 rw-p 00000000 00:00 0
Size:               8192 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:                  12 kB
Pss:                  12 kB
Pss_Dirty:            12 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:        12 kB
Referenced:           12 kB
Anonymous:            12 kB
KSM:                   0 kB
LazyFree:              0 kB
AnonHugePages:         0 kB
ShmemPmdMapped:        0 kB
FilePmdMapped:         0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
Locked:                0 kB
THPeligible:           0
VmFlags: rd wr mr mw me ac nh

The "nh" flag is set.

[1] https://lore.kernel.org/linux-mm/202312192310.56367035-oliver.sang@intel.com/

Reported-by: kernel test robot <oliver.sang@intel.com>
Tested-by: Oliver Sang <oliver.sang@intel.com>
Cc: Yin Fengwei <fengwei.yin@intel.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christopher Lameter <cl@linux.com>
Cc: Huang, Ying <ying.huang@intel.com>
Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
---
 include/linux/mman.h | 1 +
 1 file changed, 1 insertion(+)

Comments

Yin, Fengwei Jan. 10, 2024, 1:36 a.m. UTC | #1
On 2023/12/21 14:59, Yang Shi wrote:
> From: Yang Shi <yang@os.amperecomputing.com>
> 
> The commit efa7df3e3bb5 ("mm: align larger anonymous mappings on THP
> boundaries") incured regression for stress-ng pthread benchmark [1].
> It is because THP get allocated to pthread's stack area much more possible
> than before.  Pthread's stack area is allocated by mmap without VM_GROWSDOWN
> or VM_GROWSUP flag, so kernel can't tell whether it is a stack area or not.
> 
> The MAP_STACK flag is used to mark the stack area, but it is a no-op on
> Linux.  Mapping MAP_STACK to VM_NOHUGEPAGE to prevent from allocating
> THP for such stack area.
> 
> With this change the stack area looks like:
> 
> fffd18e10000-fffd19610000 rw-p 00000000 00:00 0
> Size:               8192 kB
> KernelPageSize:        4 kB
> MMUPageSize:           4 kB
> Rss:                  12 kB
> Pss:                  12 kB
> Pss_Dirty:            12 kB
> Shared_Clean:          0 kB
> Shared_Dirty:          0 kB
> Private_Clean:         0 kB
> Private_Dirty:        12 kB
> Referenced:           12 kB
> Anonymous:            12 kB
> KSM:                   0 kB
> LazyFree:              0 kB
> AnonHugePages:         0 kB
> ShmemPmdMapped:        0 kB
> FilePmdMapped:         0 kB
> Shared_Hugetlb:        0 kB
> Private_Hugetlb:       0 kB
> Swap:                  0 kB
> SwapPss:               0 kB
> Locked:                0 kB
> THPeligible:           0
> VmFlags: rd wr mr mw me ac nh
> 
> The "nh" flag is set.
> 
> [1] https://lore.kernel.org/linux-mm/202312192310.56367035-oliver.sang@intel.com/
> 
> Reported-by: kernel test robot <oliver.sang@intel.com>
> Tested-by: Oliver Sang <oliver.sang@intel.com>
> Cc: Yin Fengwei <fengwei.yin@intel.com>
> Cc: Rik van Riel <riel@surriel.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Christopher Lameter <cl@linux.com>
> Cc: Huang, Ying <ying.huang@intel.com>
> Signed-off-by: Yang Shi <yang@os.amperecomputing.com>

Reviewed-by: Yin Fengwei <fengwei.yin@intel.com>

> ---
>   include/linux/mman.h | 1 +
>   1 file changed, 1 insertion(+)
> 
> diff --git a/include/linux/mman.h b/include/linux/mman.h
> index 40d94411d492..dc7048824be8 100644
> --- a/include/linux/mman.h
> +++ b/include/linux/mman.h
> @@ -156,6 +156,7 @@ calc_vm_flag_bits(unsigned long flags)
>   	return _calc_vm_trans(flags, MAP_GROWSDOWN,  VM_GROWSDOWN ) |
>   	       _calc_vm_trans(flags, MAP_LOCKED,     VM_LOCKED    ) |
>   	       _calc_vm_trans(flags, MAP_SYNC,	     VM_SYNC      ) |
> +	       _calc_vm_trans(flags, MAP_STACK,	     VM_NOHUGEPAGE) |
>   	       arch_calc_vm_flag_bits(flags);
>   }
>
Zach O'Keefe Jan. 16, 2024, 7:22 p.m. UTC | #2
Thanks Yang,

Should this be marked for stable? Given how easily it is for pthreads
to allocate hugepages w/o this change, it can easily cause memory
bloat on larger systems and/or users with high thread counts. I don't
think that will be welcomed, and seems odd that just 6.7 should suffer
this.

Thanks,
Zach

On Tue, Jan 9, 2024 at 5:36 PM Yin Fengwei <fengwei.yin@intel.com> wrote:
>
>
>
> On 2023/12/21 14:59, Yang Shi wrote:
> > From: Yang Shi <yang@os.amperecomputing.com>
> >
> > The commit efa7df3e3bb5 ("mm: align larger anonymous mappings on THP
> > boundaries") incured regression for stress-ng pthread benchmark [1].
> > It is because THP get allocated to pthread's stack area much more possible
> > than before.  Pthread's stack area is allocated by mmap without VM_GROWSDOWN
> > or VM_GROWSUP flag, so kernel can't tell whether it is a stack area or not.
> >
> > The MAP_STACK flag is used to mark the stack area, but it is a no-op on
> > Linux.  Mapping MAP_STACK to VM_NOHUGEPAGE to prevent from allocating
> > THP for such stack area.
> >
> > With this change the stack area looks like:
> >
> > fffd18e10000-fffd19610000 rw-p 00000000 00:00 0
> > Size:               8192 kB
> > KernelPageSize:        4 kB
> > MMUPageSize:           4 kB
> > Rss:                  12 kB
> > Pss:                  12 kB
> > Pss_Dirty:            12 kB
> > Shared_Clean:          0 kB
> > Shared_Dirty:          0 kB
> > Private_Clean:         0 kB
> > Private_Dirty:        12 kB
> > Referenced:           12 kB
> > Anonymous:            12 kB
> > KSM:                   0 kB
> > LazyFree:              0 kB
> > AnonHugePages:         0 kB
> > ShmemPmdMapped:        0 kB
> > FilePmdMapped:         0 kB
> > Shared_Hugetlb:        0 kB
> > Private_Hugetlb:       0 kB
> > Swap:                  0 kB
> > SwapPss:               0 kB
> > Locked:                0 kB
> > THPeligible:           0
> > VmFlags: rd wr mr mw me ac nh
> >
> > The "nh" flag is set.
> >
> > [1] https://lore.kernel.org/linux-mm/202312192310.56367035-oliver.sang@intel.com/
> >
> > Reported-by: kernel test robot <oliver.sang@intel.com>
> > Tested-by: Oliver Sang <oliver.sang@intel.com>
> > Cc: Yin Fengwei <fengwei.yin@intel.com>
> > Cc: Rik van Riel <riel@surriel.com>
> > Cc: Matthew Wilcox <willy@infradead.org>
> > Cc: Christopher Lameter <cl@linux.com>
> > Cc: Huang, Ying <ying.huang@intel.com>
> > Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
>
> Reviewed-by: Yin Fengwei <fengwei.yin@intel.com>
>
> > ---
> >   include/linux/mman.h | 1 +
> >   1 file changed, 1 insertion(+)
> >
> > diff --git a/include/linux/mman.h b/include/linux/mman.h
> > index 40d94411d492..dc7048824be8 100644
> > --- a/include/linux/mman.h
> > +++ b/include/linux/mman.h
> > @@ -156,6 +156,7 @@ calc_vm_flag_bits(unsigned long flags)
> >       return _calc_vm_trans(flags, MAP_GROWSDOWN,  VM_GROWSDOWN ) |
> >              _calc_vm_trans(flags, MAP_LOCKED,     VM_LOCKED    ) |
> >              _calc_vm_trans(flags, MAP_SYNC,       VM_SYNC      ) |
> > +            _calc_vm_trans(flags, MAP_STACK,      VM_NOHUGEPAGE) |
> >              arch_calc_vm_flag_bits(flags);
> >   }
> >
>
Yang Shi Jan. 16, 2024, 8:57 p.m. UTC | #3
On Tue, Jan 16, 2024 at 11:22 AM Zach O'Keefe <zokeefe@google.com> wrote:
>
> Thanks Yang,
>
> Should this be marked for stable? Given how easily it is for pthreads
> to allocate hugepages w/o this change, it can easily cause memory
> bloat on larger systems and/or users with high thread counts. I don't
> think that will be welcomed, and seems odd that just 6.7 should suffer
> this.

Thanks for the suggestion, fine to me.

>
> Thanks,
> Zach
>
> On Tue, Jan 9, 2024 at 5:36 PM Yin Fengwei <fengwei.yin@intel.com> wrote:
> >
> >
> >
> > On 2023/12/21 14:59, Yang Shi wrote:
> > > From: Yang Shi <yang@os.amperecomputing.com>
> > >
> > > The commit efa7df3e3bb5 ("mm: align larger anonymous mappings on THP
> > > boundaries") incured regression for stress-ng pthread benchmark [1].
> > > It is because THP get allocated to pthread's stack area much more possible
> > > than before.  Pthread's stack area is allocated by mmap without VM_GROWSDOWN
> > > or VM_GROWSUP flag, so kernel can't tell whether it is a stack area or not.
> > >
> > > The MAP_STACK flag is used to mark the stack area, but it is a no-op on
> > > Linux.  Mapping MAP_STACK to VM_NOHUGEPAGE to prevent from allocating
> > > THP for such stack area.
> > >
> > > With this change the stack area looks like:
> > >
> > > fffd18e10000-fffd19610000 rw-p 00000000 00:00 0
> > > Size:               8192 kB
> > > KernelPageSize:        4 kB
> > > MMUPageSize:           4 kB
> > > Rss:                  12 kB
> > > Pss:                  12 kB
> > > Pss_Dirty:            12 kB
> > > Shared_Clean:          0 kB
> > > Shared_Dirty:          0 kB
> > > Private_Clean:         0 kB
> > > Private_Dirty:        12 kB
> > > Referenced:           12 kB
> > > Anonymous:            12 kB
> > > KSM:                   0 kB
> > > LazyFree:              0 kB
> > > AnonHugePages:         0 kB
> > > ShmemPmdMapped:        0 kB
> > > FilePmdMapped:         0 kB
> > > Shared_Hugetlb:        0 kB
> > > Private_Hugetlb:       0 kB
> > > Swap:                  0 kB
> > > SwapPss:               0 kB
> > > Locked:                0 kB
> > > THPeligible:           0
> > > VmFlags: rd wr mr mw me ac nh
> > >
> > > The "nh" flag is set.
> > >
> > > [1] https://lore.kernel.org/linux-mm/202312192310.56367035-oliver.sang@intel.com/
> > >
> > > Reported-by: kernel test robot <oliver.sang@intel.com>
> > > Tested-by: Oliver Sang <oliver.sang@intel.com>
> > > Cc: Yin Fengwei <fengwei.yin@intel.com>
> > > Cc: Rik van Riel <riel@surriel.com>
> > > Cc: Matthew Wilcox <willy@infradead.org>
> > > Cc: Christopher Lameter <cl@linux.com>
> > > Cc: Huang, Ying <ying.huang@intel.com>
> > > Signed-off-by: Yang Shi <yang@os.amperecomputing.com>
> >
> > Reviewed-by: Yin Fengwei <fengwei.yin@intel.com>
> >
> > > ---
> > >   include/linux/mman.h | 1 +
> > >   1 file changed, 1 insertion(+)
> > >
> > > diff --git a/include/linux/mman.h b/include/linux/mman.h
> > > index 40d94411d492..dc7048824be8 100644
> > > --- a/include/linux/mman.h
> > > +++ b/include/linux/mman.h
> > > @@ -156,6 +156,7 @@ calc_vm_flag_bits(unsigned long flags)
> > >       return _calc_vm_trans(flags, MAP_GROWSDOWN,  VM_GROWSDOWN ) |
> > >              _calc_vm_trans(flags, MAP_LOCKED,     VM_LOCKED    ) |
> > >              _calc_vm_trans(flags, MAP_SYNC,       VM_SYNC      ) |
> > > +            _calc_vm_trans(flags, MAP_STACK,      VM_NOHUGEPAGE) |
> > >              arch_calc_vm_flag_bits(flags);
> > >   }
> > >
> >
Andrew Morton Jan. 16, 2024, 9:31 p.m. UTC | #4
On Tue, 16 Jan 2024 12:57:41 -0800 Yang Shi <shy828301@gmail.com> wrote:

> > Should this be marked for stable? Given how easily it is for pthreads
> > to allocate hugepages w/o this change, it can easily cause memory
> > bloat on larger systems and/or users with high thread counts. I don't
> > think that will be welcomed, and seems odd that just 6.7 should suffer
> > this.
> 
> Thanks for the suggestion, fine to me.
>

Thanks, added, along with

Fixes: efa7df3e3bb5 ("mm: align larger anonymous mappings on THP boundaries")
Florian Weimer Jan. 31, 2024, 7:53 a.m. UTC | #5
* Yang Shi:

> From: Yang Shi <yang@os.amperecomputing.com>
>
> The commit efa7df3e3bb5 ("mm: align larger anonymous mappings on THP
> boundaries") incured regression for stress-ng pthread benchmark [1].
> It is because THP get allocated to pthread's stack area much more possible
> than before.  Pthread's stack area is allocated by mmap without VM_GROWSDOWN
> or VM_GROWSUP flag, so kernel can't tell whether it is a stack area or not.
>
> The MAP_STACK flag is used to mark the stack area, but it is a no-op on
> Linux.  Mapping MAP_STACK to VM_NOHUGEPAGE to prevent from allocating
> THP for such stack area.

Doesn't this introduce a regression in the other direction, where
workloads expect to use a hugepage TLB entry for the stack?

It's seems an odd approach to fixing the stress-ng regression.  Isn't it
very much coding to the benchmark?

Thanks,
Florian
Yang Shi Jan. 31, 2024, 6:46 p.m. UTC | #6
On Tue, Jan 30, 2024 at 11:53 PM Florian Weimer <fweimer@redhat.com> wrote:
>
> * Yang Shi:
>
> > From: Yang Shi <yang@os.amperecomputing.com>
> >
> > The commit efa7df3e3bb5 ("mm: align larger anonymous mappings on THP
> > boundaries") incured regression for stress-ng pthread benchmark [1].
> > It is because THP get allocated to pthread's stack area much more possible
> > than before.  Pthread's stack area is allocated by mmap without VM_GROWSDOWN
> > or VM_GROWSUP flag, so kernel can't tell whether it is a stack area or not.
> >
> > The MAP_STACK flag is used to mark the stack area, but it is a no-op on
> > Linux.  Mapping MAP_STACK to VM_NOHUGEPAGE to prevent from allocating
> > THP for such stack area.
>
> Doesn't this introduce a regression in the other direction, where
> workloads expect to use a hugepage TLB entry for the stack?

Maybe, it is theoretically possible. But AFAICT, the real life
workloads performance usually gets hurt if THP is used for stack.
Willy has an example:

https://lore.kernel.org/linux-mm/ZYPDwCcAjX+r+g6s@casper.infradead.org/#t

And avoiding THP on stack is not new, VM_GROWSDOWN | VM_GROWSUP areas
have been applied before, this patch just extends this to MAP_STACK.

>
> It's seems an odd approach to fixing the stress-ng regression.  Isn't it
> very much coding to the benchmark?
>
> Thanks,
> Florian
>
Florian Weimer Feb. 1, 2024, 3:34 p.m. UTC | #7
* Yang Shi:

> On Tue, Jan 30, 2024 at 11:53 PM Florian Weimer <fweimer@redhat.com> wrote:
>>
>> * Yang Shi:
>>
>> > From: Yang Shi <yang@os.amperecomputing.com>
>> >
>> > The commit efa7df3e3bb5 ("mm: align larger anonymous mappings on THP
>> > boundaries") incured regression for stress-ng pthread benchmark [1].
>> > It is because THP get allocated to pthread's stack area much more possible
>> > than before.  Pthread's stack area is allocated by mmap without VM_GROWSDOWN
>> > or VM_GROWSUP flag, so kernel can't tell whether it is a stack area or not.
>> >
>> > The MAP_STACK flag is used to mark the stack area, but it is a no-op on
>> > Linux.  Mapping MAP_STACK to VM_NOHUGEPAGE to prevent from allocating
>> > THP for such stack area.
>>
>> Doesn't this introduce a regression in the other direction, where
>> workloads expect to use a hugepage TLB entry for the stack?
>
> Maybe, it is theoretically possible. But AFAICT, the real life
> workloads performance usually gets hurt if THP is used for stack.
> Willy has an example:
>
> https://lore.kernel.org/linux-mm/ZYPDwCcAjX+r+g6s@casper.infradead.org/#t
>
> And avoiding THP on stack is not new, VM_GROWSDOWN | VM_GROWSUP areas
> have been applied before, this patch just extends this to MAP_STACK.

If it's *always* beneficial then we should help it along in glibc as
well.  We've started to offer a tunable in response to this observation
(also paper over in OpenJDK):

  Make thread stacks not use huge pages
  <https://bugs.openjdk.org/browse/JDK-8303215>

But this is specifically about RSS usage, and not directly about
reducing TLB misses etc.

Thanks,
Florian
Yang Shi Feb. 1, 2024, 7 p.m. UTC | #8
On Thu, Feb 1, 2024 at 7:34 AM Florian Weimer <fweimer@redhat.com> wrote:
>
> * Yang Shi:
>
> > On Tue, Jan 30, 2024 at 11:53 PM Florian Weimer <fweimer@redhat.com> wrote:
> >>
> >> * Yang Shi:
> >>
> >> > From: Yang Shi <yang@os.amperecomputing.com>
> >> >
> >> > The commit efa7df3e3bb5 ("mm: align larger anonymous mappings on THP
> >> > boundaries") incured regression for stress-ng pthread benchmark [1].
> >> > It is because THP get allocated to pthread's stack area much more possible
> >> > than before.  Pthread's stack area is allocated by mmap without VM_GROWSDOWN
> >> > or VM_GROWSUP flag, so kernel can't tell whether it is a stack area or not.
> >> >
> >> > The MAP_STACK flag is used to mark the stack area, but it is a no-op on
> >> > Linux.  Mapping MAP_STACK to VM_NOHUGEPAGE to prevent from allocating
> >> > THP for such stack area.
> >>
> >> Doesn't this introduce a regression in the other direction, where
> >> workloads expect to use a hugepage TLB entry for the stack?
> >
> > Maybe, it is theoretically possible. But AFAICT, the real life
> > workloads performance usually gets hurt if THP is used for stack.
> > Willy has an example:
> >
> > https://lore.kernel.org/linux-mm/ZYPDwCcAjX+r+g6s@casper.infradead.org/#t
> >
> > And avoiding THP on stack is not new, VM_GROWSDOWN | VM_GROWSUP areas
> > have been applied before, this patch just extends this to MAP_STACK.
>
> If it's *always* beneficial then we should help it along in glibc as
> well.  We've started to offer a tunable in response to this observation
> (also paper over in OpenJDK):
>
>   Make thread stacks not use huge pages
>   <https://bugs.openjdk.org/browse/JDK-8303215>
>
> But this is specifically about RSS usage, and not directly about
> reducing TLB misses etc.

Thanks for the data point. Out of curiosity, what mmap flags are used
by JVM to indicate a stack? MAP_STACK? If so it should get
VM_NOHUGEPAGE due to this patch (of course, on older kernel
MADV_NOHUGEPAGE must be called by JVM).

Letting others, for example, glibc, call MADV_NOHUGEPAGE explicitly on
stack area is fine too, but it may take some time to get there...

>
> Thanks,
> Florian
>
diff mbox series

Patch

diff --git a/include/linux/mman.h b/include/linux/mman.h
index 40d94411d492..dc7048824be8 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -156,6 +156,7 @@  calc_vm_flag_bits(unsigned long flags)
 	return _calc_vm_trans(flags, MAP_GROWSDOWN,  VM_GROWSDOWN ) |
 	       _calc_vm_trans(flags, MAP_LOCKED,     VM_LOCKED    ) |
 	       _calc_vm_trans(flags, MAP_SYNC,	     VM_SYNC      ) |
+	       _calc_vm_trans(flags, MAP_STACK,	     VM_NOHUGEPAGE) |
 	       arch_calc_vm_flag_bits(flags);
 }