mbox series

[RFC,00/16] 1GB THP support on x86_64

Message ID 20200902180628.4052244-1-zi.yan@sent.com (mailing list archive)
Headers show
Series 1GB THP support on x86_64 | expand

Message

Zi Yan Sept. 2, 2020, 6:06 p.m. UTC
From: Zi Yan <ziy@nvidia.com>

Hi all,

This patchset adds support for 1GB THP on x86_64. It is on top of
v5.9-rc2-mmots-2020-08-25-21-13.

1GB THP is more flexible for reducing translation overhead and increasing the
performance of applications with large memory footprint without application
changes compared to hugetlb.

Design
=======

1GB THP implementation looks similar to exiting THP code except some new designs
for the additional page table level.

1. Page table deposit and withdraw using a new pagechain data structure:
   instead of one PTE page table page, 1GB THP requires 513 page table pages
   (one PMD page table page and 512 PTE page table pages) to be deposited
   at the page allocaiton time, so that we can split the page later. Currently,
   the page table deposit is using ->lru, thus only one page can be deposited.
   A new pagechain data structure is added to enable multi-page deposit.

2. Triple mapped 1GB THP : 1GB THP can be mapped by a combination of PUD, PMD,
   and PTE entries. Mixing PUD an PTE mapping can be achieved with existing
   PageDoubleMap mechanism. To add PMD mapping, PMDPageInPUD and
   sub_compound_mapcount are introduced. PMDPageInPUD is the 512-aligned base
   page in a 1GB THP and sub_compound_mapcount counts the PMD mapping by using
   page[N*512 + 3].compound_mapcount.

3. Using CMA allocaiton for 1GB THP: instead of bump MAX_ORDER, it is more sane
   to use something less intrusive. So all 1GB THPs are allocated from reserved
   CMA areas shared with hugetlb. At page splitting time, the bitmap for the 1GB
   THP is cleared as the resulting pages can be freed via normal page free path.
   We can fall back to alloc_contig_pages for 1GB THP if necessary.


Patch Organization
=======

Patch 01 adds the new pagechain data structure.

Patch 02 to 13 adds 1GB THP support in variable places.

Patch 14 tries to use alloc_contig_pages for 1GB THP allocaiton.

Patch 15 moves hugetlb_cma reservation to cma.c and rename it to hugepage_cma.

Patch 16 use hugepage_cma reservation for 1GB THP allocation.


Any suggestions and comments are welcome.


Zi Yan (16):
  mm: add pagechain container for storing multiple pages.
  mm: thp: 1GB anonymous page implementation.
  mm: proc: add 1GB THP kpageflag.
  mm: thp: 1GB THP copy on write implementation.
  mm: thp: handling 1GB THP reference bit.
  mm: thp: add 1GB THP split_huge_pud_page() function.
  mm: stats: make smap stats understand PUD THPs.
  mm: page_vma_walk: teach it about PMD-mapped PUD THP.
  mm: thp: 1GB THP support in try_to_unmap().
  mm: thp: split 1GB THPs at page reclaim.
  mm: thp: 1GB THP follow_p*d_page() support.
  mm: support 1GB THP pagemap support.
  mm: thp: add a knob to enable/disable 1GB THPs.
  mm: page_alloc: >=MAX_ORDER pages allocation an deallocation.
  hugetlb: cma: move cma reserve function to cma.c.
  mm: thp: use cma reservation for pud thp allocation.

 .../admin-guide/kernel-parameters.txt         |   2 +-
 arch/arm64/mm/hugetlbpage.c                   |   2 +-
 arch/powerpc/mm/hugetlbpage.c                 |   2 +-
 arch/x86/include/asm/pgalloc.h                |  68 ++
 arch/x86/include/asm/pgtable.h                |  26 +
 arch/x86/kernel/setup.c                       |   8 +-
 arch/x86/mm/pgtable.c                         |  38 +
 drivers/base/node.c                           |   3 +
 fs/proc/meminfo.c                             |   2 +
 fs/proc/page.c                                |   2 +
 fs/proc/task_mmu.c                            | 122 ++-
 include/linux/cma.h                           |  18 +
 include/linux/huge_mm.h                       |  84 +-
 include/linux/hugetlb.h                       |  12 -
 include/linux/memcontrol.h                    |   5 +
 include/linux/mm.h                            |  29 +-
 include/linux/mm_types.h                      |   1 +
 include/linux/mmu_notifier.h                  |  13 +
 include/linux/mmzone.h                        |   1 +
 include/linux/page-flags.h                    |  47 +
 include/linux/pagechain.h                     |  73 ++
 include/linux/pgtable.h                       |  34 +
 include/linux/rmap.h                          |  10 +-
 include/linux/swap.h                          |   2 +
 include/linux/vm_event_item.h                 |   7 +
 include/uapi/linux/kernel-page-flags.h        |   2 +
 kernel/events/uprobes.c                       |   4 +-
 kernel/fork.c                                 |   5 +
 mm/cma.c                                      | 119 +++
 mm/gup.c                                      |  60 +-
 mm/huge_memory.c                              | 939 +++++++++++++++++-
 mm/hugetlb.c                                  | 114 +--
 mm/internal.h                                 |   2 +
 mm/khugepaged.c                               |   6 +-
 mm/ksm.c                                      |   4 +-
 mm/memcontrol.c                               |  13 +
 mm/memory.c                                   |  51 +-
 mm/mempolicy.c                                |  21 +-
 mm/migrate.c                                  |  12 +-
 mm/page_alloc.c                               |  57 +-
 mm/page_vma_mapped.c                          | 129 ++-
 mm/pgtable-generic.c                          |  56 ++
 mm/rmap.c                                     | 289 ++++--
 mm/swap.c                                     |  31 +
 mm/swap_slots.c                               |   2 +
 mm/swapfile.c                                 |   8 +-
 mm/userfaultfd.c                              |   2 +-
 mm/util.c                                     |  16 +-
 mm/vmscan.c                                   |  58 +-
 mm/vmstat.c                                   |   8 +
 50 files changed, 2270 insertions(+), 349 deletions(-)
 create mode 100644 include/linux/pagechain.h

--
2.28.0

Comments

Jason Gunthorpe Sept. 2, 2020, 6:40 p.m. UTC | #1
On Wed, Sep 02, 2020 at 02:06:12PM -0400, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
> 
> Hi all,
> 
> This patchset adds support for 1GB THP on x86_64. It is on top of
> v5.9-rc2-mmots-2020-08-25-21-13.
> 
> 1GB THP is more flexible for reducing translation overhead and increasing the
> performance of applications with large memory footprint without application
> changes compared to hugetlb.
> 
> Design
> =======
> 
> 1GB THP implementation looks similar to exiting THP code except some new designs
> for the additional page table level.
> 
> 1. Page table deposit and withdraw using a new pagechain data structure:
>    instead of one PTE page table page, 1GB THP requires 513 page table pages
>    (one PMD page table page and 512 PTE page table pages) to be deposited
>    at the page allocaiton time, so that we can split the page later. Currently,
>    the page table deposit is using ->lru, thus only one page can be deposited.
>    A new pagechain data structure is added to enable multi-page deposit.
> 
> 2. Triple mapped 1GB THP : 1GB THP can be mapped by a combination of PUD, PMD,
>    and PTE entries. Mixing PUD an PTE mapping can be achieved with existing
>    PageDoubleMap mechanism. To add PMD mapping, PMDPageInPUD and
>    sub_compound_mapcount are introduced. PMDPageInPUD is the 512-aligned base
>    page in a 1GB THP and sub_compound_mapcount counts the PMD mapping by using
>    page[N*512 + 3].compound_mapcount.
> 
> 3. Using CMA allocaiton for 1GB THP: instead of bump MAX_ORDER, it is more sane
>    to use something less intrusive. So all 1GB THPs are allocated from reserved
>    CMA areas shared with hugetlb. At page splitting time, the bitmap for the 1GB
>    THP is cleared as the resulting pages can be freed via normal page free path.
>    We can fall back to alloc_contig_pages for 1GB THP if necessary.
> 
> 
> Patch Organization
> =======
> 
> Patch 01 adds the new pagechain data structure.
> 
> Patch 02 to 13 adds 1GB THP support in variable places.
> 
> Patch 14 tries to use alloc_contig_pages for 1GB THP allocaiton.
> 
> Patch 15 moves hugetlb_cma reservation to cma.c and rename it to hugepage_cma.
> 
> Patch 16 use hugepage_cma reservation for 1GB THP allocation.
> 
> 
> Any suggestions and comments are welcome.
> 
> 
> Zi Yan (16):
>   mm: add pagechain container for storing multiple pages.
>   mm: thp: 1GB anonymous page implementation.
>   mm: proc: add 1GB THP kpageflag.
>   mm: thp: 1GB THP copy on write implementation.
>   mm: thp: handling 1GB THP reference bit.
>   mm: thp: add 1GB THP split_huge_pud_page() function.
>   mm: stats: make smap stats understand PUD THPs.
>   mm: page_vma_walk: teach it about PMD-mapped PUD THP.
>   mm: thp: 1GB THP support in try_to_unmap().
>   mm: thp: split 1GB THPs at page reclaim.
>   mm: thp: 1GB THP follow_p*d_page() support.
>   mm: support 1GB THP pagemap support.
>   mm: thp: add a knob to enable/disable 1GB THPs.
>   mm: page_alloc: >=MAX_ORDER pages allocation an deallocation.
>   hugetlb: cma: move cma reserve function to cma.c.
>   mm: thp: use cma reservation for pud thp allocation.

Surprised this doesn't touch mm/pagewalk.c ?

Jason
Zi Yan Sept. 2, 2020, 6:45 p.m. UTC | #2
On 2 Sep 2020, at 14:40, Jason Gunthorpe wrote:

> On Wed, Sep 02, 2020 at 02:06:12PM -0400, Zi Yan wrote:
>> From: Zi Yan <ziy@nvidia.com>
>>
>> Hi all,
>>
>> This patchset adds support for 1GB THP on x86_64. It is on top of
>> v5.9-rc2-mmots-2020-08-25-21-13.
>>
>> 1GB THP is more flexible for reducing translation overhead and increasing the
>> performance of applications with large memory footprint without application
>> changes compared to hugetlb.
>>
>> Design
>> =======
>>
>> 1GB THP implementation looks similar to exiting THP code except some new designs
>> for the additional page table level.
>>
>> 1. Page table deposit and withdraw using a new pagechain data structure:
>>    instead of one PTE page table page, 1GB THP requires 513 page table pages
>>    (one PMD page table page and 512 PTE page table pages) to be deposited
>>    at the page allocaiton time, so that we can split the page later. Currently,
>>    the page table deposit is using ->lru, thus only one page can be deposited.
>>    A new pagechain data structure is added to enable multi-page deposit.
>>
>> 2. Triple mapped 1GB THP : 1GB THP can be mapped by a combination of PUD, PMD,
>>    and PTE entries. Mixing PUD an PTE mapping can be achieved with existing
>>    PageDoubleMap mechanism. To add PMD mapping, PMDPageInPUD and
>>    sub_compound_mapcount are introduced. PMDPageInPUD is the 512-aligned base
>>    page in a 1GB THP and sub_compound_mapcount counts the PMD mapping by using
>>    page[N*512 + 3].compound_mapcount.
>>
>> 3. Using CMA allocaiton for 1GB THP: instead of bump MAX_ORDER, it is more sane
>>    to use something less intrusive. So all 1GB THPs are allocated from reserved
>>    CMA areas shared with hugetlb. At page splitting time, the bitmap for the 1GB
>>    THP is cleared as the resulting pages can be freed via normal page free path.
>>    We can fall back to alloc_contig_pages for 1GB THP if necessary.
>>
>>
>> Patch Organization
>> =======
>>
>> Patch 01 adds the new pagechain data structure.
>>
>> Patch 02 to 13 adds 1GB THP support in variable places.
>>
>> Patch 14 tries to use alloc_contig_pages for 1GB THP allocaiton.
>>
>> Patch 15 moves hugetlb_cma reservation to cma.c and rename it to hugepage_cma.
>>
>> Patch 16 use hugepage_cma reservation for 1GB THP allocation.
>>
>>
>> Any suggestions and comments are welcome.
>>
>>
>> Zi Yan (16):
>>   mm: add pagechain container for storing multiple pages.
>>   mm: thp: 1GB anonymous page implementation.
>>   mm: proc: add 1GB THP kpageflag.
>>   mm: thp: 1GB THP copy on write implementation.
>>   mm: thp: handling 1GB THP reference bit.
>>   mm: thp: add 1GB THP split_huge_pud_page() function.
>>   mm: stats: make smap stats understand PUD THPs.
>>   mm: page_vma_walk: teach it about PMD-mapped PUD THP.
>>   mm: thp: 1GB THP support in try_to_unmap().
>>   mm: thp: split 1GB THPs at page reclaim.
>>   mm: thp: 1GB THP follow_p*d_page() support.
>>   mm: support 1GB THP pagemap support.
>>   mm: thp: add a knob to enable/disable 1GB THPs.
>>   mm: page_alloc: >=MAX_ORDER pages allocation an deallocation.
>>   hugetlb: cma: move cma reserve function to cma.c.
>>   mm: thp: use cma reservation for pud thp allocation.
>
> Surprised this doesn't touch mm/pagewalk.c ?

1GB PUD page support is present for DAX purpose, so the code is there
in mm/pagewalk.c already. I only needed to supply ops->pud_entry when using
the functions in mm/pagewalk.c. :)

—
Best Regards,
Yan Zi
Jason Gunthorpe Sept. 2, 2020, 6:48 p.m. UTC | #3
On Wed, Sep 02, 2020 at 02:45:37PM -0400, Zi Yan wrote:

> > Surprised this doesn't touch mm/pagewalk.c ?
> 
> 1GB PUD page support is present for DAX purpose, so the code is there
> in mm/pagewalk.c already. I only needed to supply ops->pud_entry when using
> the functions in mm/pagewalk.c. :)

Yes, but doesn't this change what is possible under the mmap_sem
without the page table locks?

ie I would expect some thing like pmd_trans_unstable() to be required
as well for lockless walkers. (and I don't think the pmd code is 100%
right either)

Jason
Zi Yan Sept. 2, 2020, 7:05 p.m. UTC | #4
On 2 Sep 2020, at 14:48, Jason Gunthorpe wrote:

> On Wed, Sep 02, 2020 at 02:45:37PM -0400, Zi Yan wrote:
>
>>> Surprised this doesn't touch mm/pagewalk.c ?
>>
>> 1GB PUD page support is present for DAX purpose, so the code is there
>> in mm/pagewalk.c already. I only needed to supply ops->pud_entry when using
>> the functions in mm/pagewalk.c. :)
>
> Yes, but doesn't this change what is possible under the mmap_sem
> without the page table locks?
>
> ie I would expect some thing like pmd_trans_unstable() to be required
> as well for lockless walkers. (and I don't think the pmd code is 100%
> right either)
>

Right. I missed that. Thanks for pointing it out.
The code like this, right?

diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index e81640d9f177..4fe6ce4a92eb 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -152,10 +152,11 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
                    !(ops->pmd_entry || ops->pte_entry))
                        continue;

-               if (walk->vma)
+               if (walk->vma) {
                        split_huge_pud(walk->vma, pud, addr);
-               if (pud_none(*pud))
-                       goto again;
+                       if (pud_trans_unstable(pud))
+                               goto again;
+               }

                err = walk_pmd_range(pud, addr, next, walk);
                if (err)


—
Best Regards,
Yan Zi
Jason Gunthorpe Sept. 2, 2020, 7:57 p.m. UTC | #5
On Wed, Sep 02, 2020 at 03:05:39PM -0400, Zi Yan wrote:
> On 2 Sep 2020, at 14:48, Jason Gunthorpe wrote:
> 
> > On Wed, Sep 02, 2020 at 02:45:37PM -0400, Zi Yan wrote:
> >
> >>> Surprised this doesn't touch mm/pagewalk.c ?
> >>
> >> 1GB PUD page support is present for DAX purpose, so the code is there
> >> in mm/pagewalk.c already. I only needed to supply ops->pud_entry when using
> >> the functions in mm/pagewalk.c. :)
> >
> > Yes, but doesn't this change what is possible under the mmap_sem
> > without the page table locks?
> >
> > ie I would expect some thing like pmd_trans_unstable() to be required
> > as well for lockless walkers. (and I don't think the pmd code is 100%
> > right either)
> >
> 
> Right. I missed that. Thanks for pointing it out.
> The code like this, right?

Technically all those *pud's are racy too, the design here with the
_unstable function call always seemed weird. I strongly suspect it
should mirror how get_user_pages_fast works for lockless walking

Jason
Zi Yan Sept. 2, 2020, 8:29 p.m. UTC | #6
On 2 Sep 2020, at 15:57, Jason Gunthorpe wrote:

> On Wed, Sep 02, 2020 at 03:05:39PM -0400, Zi Yan wrote:
>> On 2 Sep 2020, at 14:48, Jason Gunthorpe wrote:
>>
>>> On Wed, Sep 02, 2020 at 02:45:37PM -0400, Zi Yan wrote:
>>>
>>>>> Surprised this doesn't touch mm/pagewalk.c ?
>>>>
>>>> 1GB PUD page support is present for DAX purpose, so the code is there
>>>> in mm/pagewalk.c already. I only needed to supply ops->pud_entry when using
>>>> the functions in mm/pagewalk.c. :)
>>>
>>> Yes, but doesn't this change what is possible under the mmap_sem
>>> without the page table locks?
>>>
>>> ie I would expect some thing like pmd_trans_unstable() to be required
>>> as well for lockless walkers. (and I don't think the pmd code is 100%
>>> right either)
>>>
>>
>> Right. I missed that. Thanks for pointing it out.
>> The code like this, right?
>
> Technically all those *pud's are racy too, the design here with the
> _unstable function call always seemed weird. I strongly suspect it
> should mirror how get_user_pages_fast works for lockless walking

You mean READ_ONCE on page table entry pointer first, then use the value
for the rest of the loop? I am not quite familiar with this racy check
part of the code and happy to hear more about it.


—
Best Regards,
Yan Zi
Michal Hocko Sept. 3, 2020, 7:32 a.m. UTC | #7
On Wed 02-09-20 14:06:12, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
> 
> Hi all,
> 
> This patchset adds support for 1GB THP on x86_64. It is on top of
> v5.9-rc2-mmots-2020-08-25-21-13.
> 
> 1GB THP is more flexible for reducing translation overhead and increasing the
> performance of applications with large memory footprint without application
> changes compared to hugetlb.

Please be more specific about usecases. This better have some strong
ones because THP code is complex enough already to add on top solely
based on a generic TLB pressure easing.

> Design
> =======
> 
> 1GB THP implementation looks similar to exiting THP code except some new designs
> for the additional page table level.
> 
> 1. Page table deposit and withdraw using a new pagechain data structure:
>    instead of one PTE page table page, 1GB THP requires 513 page table pages
>    (one PMD page table page and 512 PTE page table pages) to be deposited
>    at the page allocaiton time, so that we can split the page later. Currently,
>    the page table deposit is using ->lru, thus only one page can be deposited.
>    A new pagechain data structure is added to enable multi-page deposit.
> 
> 2. Triple mapped 1GB THP : 1GB THP can be mapped by a combination of PUD, PMD,
>    and PTE entries. Mixing PUD an PTE mapping can be achieved with existing
>    PageDoubleMap mechanism. To add PMD mapping, PMDPageInPUD and
>    sub_compound_mapcount are introduced. PMDPageInPUD is the 512-aligned base
>    page in a 1GB THP and sub_compound_mapcount counts the PMD mapping by using
>    page[N*512 + 3].compound_mapcount.
> 
> 3. Using CMA allocaiton for 1GB THP: instead of bump MAX_ORDER, it is more sane
>    to use something less intrusive. So all 1GB THPs are allocated from reserved
>    CMA areas shared with hugetlb. At page splitting time, the bitmap for the 1GB
>    THP is cleared as the resulting pages can be freed via normal page free path.
>    We can fall back to alloc_contig_pages for 1GB THP if necessary.

Do those pages get instantiated during the page fault or only via
khugepaged? This is an important design detail because then we have to
think carefully about how much automatic we want this to be. Memory
overhead can be quite large with 2MB THPs already. Also what about the
allocation overhead? Do you have any numbers?

Maybe all these details are described in the patcheset but the cover
letter should contain all that information. It doesn't make much sense
to dig into details in a patchset this large without having an idea how
feasible this is.

Thanks.
 
> Patch Organization
> =======
> 
> Patch 01 adds the new pagechain data structure.
> 
> Patch 02 to 13 adds 1GB THP support in variable places.
> 
> Patch 14 tries to use alloc_contig_pages for 1GB THP allocaiton.
> 
> Patch 15 moves hugetlb_cma reservation to cma.c and rename it to hugepage_cma.
> 
> Patch 16 use hugepage_cma reservation for 1GB THP allocation.
> 
> 
> Any suggestions and comments are welcome.
> 
> 
> Zi Yan (16):
>   mm: add pagechain container for storing multiple pages.
>   mm: thp: 1GB anonymous page implementation.
>   mm: proc: add 1GB THP kpageflag.
>   mm: thp: 1GB THP copy on write implementation.
>   mm: thp: handling 1GB THP reference bit.
>   mm: thp: add 1GB THP split_huge_pud_page() function.
>   mm: stats: make smap stats understand PUD THPs.
>   mm: page_vma_walk: teach it about PMD-mapped PUD THP.
>   mm: thp: 1GB THP support in try_to_unmap().
>   mm: thp: split 1GB THPs at page reclaim.
>   mm: thp: 1GB THP follow_p*d_page() support.
>   mm: support 1GB THP pagemap support.
>   mm: thp: add a knob to enable/disable 1GB THPs.
>   mm: page_alloc: >=MAX_ORDER pages allocation an deallocation.
>   hugetlb: cma: move cma reserve function to cma.c.
>   mm: thp: use cma reservation for pud thp allocation.
> 
>  .../admin-guide/kernel-parameters.txt         |   2 +-
>  arch/arm64/mm/hugetlbpage.c                   |   2 +-
>  arch/powerpc/mm/hugetlbpage.c                 |   2 +-
>  arch/x86/include/asm/pgalloc.h                |  68 ++
>  arch/x86/include/asm/pgtable.h                |  26 +
>  arch/x86/kernel/setup.c                       |   8 +-
>  arch/x86/mm/pgtable.c                         |  38 +
>  drivers/base/node.c                           |   3 +
>  fs/proc/meminfo.c                             |   2 +
>  fs/proc/page.c                                |   2 +
>  fs/proc/task_mmu.c                            | 122 ++-
>  include/linux/cma.h                           |  18 +
>  include/linux/huge_mm.h                       |  84 +-
>  include/linux/hugetlb.h                       |  12 -
>  include/linux/memcontrol.h                    |   5 +
>  include/linux/mm.h                            |  29 +-
>  include/linux/mm_types.h                      |   1 +
>  include/linux/mmu_notifier.h                  |  13 +
>  include/linux/mmzone.h                        |   1 +
>  include/linux/page-flags.h                    |  47 +
>  include/linux/pagechain.h                     |  73 ++
>  include/linux/pgtable.h                       |  34 +
>  include/linux/rmap.h                          |  10 +-
>  include/linux/swap.h                          |   2 +
>  include/linux/vm_event_item.h                 |   7 +
>  include/uapi/linux/kernel-page-flags.h        |   2 +
>  kernel/events/uprobes.c                       |   4 +-
>  kernel/fork.c                                 |   5 +
>  mm/cma.c                                      | 119 +++
>  mm/gup.c                                      |  60 +-
>  mm/huge_memory.c                              | 939 +++++++++++++++++-
>  mm/hugetlb.c                                  | 114 +--
>  mm/internal.h                                 |   2 +
>  mm/khugepaged.c                               |   6 +-
>  mm/ksm.c                                      |   4 +-
>  mm/memcontrol.c                               |  13 +
>  mm/memory.c                                   |  51 +-
>  mm/mempolicy.c                                |  21 +-
>  mm/migrate.c                                  |  12 +-
>  mm/page_alloc.c                               |  57 +-
>  mm/page_vma_mapped.c                          | 129 ++-
>  mm/pgtable-generic.c                          |  56 ++
>  mm/rmap.c                                     | 289 ++++--
>  mm/swap.c                                     |  31 +
>  mm/swap_slots.c                               |   2 +
>  mm/swapfile.c                                 |   8 +-
>  mm/userfaultfd.c                              |   2 +-
>  mm/util.c                                     |  16 +-
>  mm/vmscan.c                                   |  58 +-
>  mm/vmstat.c                                   |   8 +
>  50 files changed, 2270 insertions(+), 349 deletions(-)
>  create mode 100644 include/linux/pagechain.h
> 
> --
> 2.28.0
>
Kirill A. Shutemov Sept. 3, 2020, 2:23 p.m. UTC | #8
On Wed, Sep 02, 2020 at 02:06:12PM -0400, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
> 
> Hi all,
> 
> This patchset adds support for 1GB THP on x86_64. It is on top of
> v5.9-rc2-mmots-2020-08-25-21-13.
> 
> 1GB THP is more flexible for reducing translation overhead and increasing the
> performance of applications with large memory footprint without application
> changes compared to hugetlb.

This statement needs a lot of justification. I don't see 1GB THP as viable
for any workload. Opportunistic 1GB allocation is very questionable
strategy.

> Design
> =======
> 
> 1GB THP implementation looks similar to exiting THP code except some new designs
> for the additional page table level.
> 
> 1. Page table deposit and withdraw using a new pagechain data structure:
>    instead of one PTE page table page, 1GB THP requires 513 page table pages
>    (one PMD page table page and 512 PTE page table pages) to be deposited
>    at the page allocaiton time, so that we can split the page later. Currently,
>    the page table deposit is using ->lru, thus only one page can be deposited.

False. Current code can deposit arbitrary number of page tables.

What can be problem to you is that these page tables tied to struct page
of PMD page table.

>    A new pagechain data structure is added to enable multi-page deposit.
> 
> 2. Triple mapped 1GB THP : 1GB THP can be mapped by a combination of PUD, PMD,
>    and PTE entries. Mixing PUD an PTE mapping can be achieved with existing
>    PageDoubleMap mechanism. To add PMD mapping, PMDPageInPUD and
>    sub_compound_mapcount are introduced. PMDPageInPUD is the 512-aligned base
>    page in a 1GB THP and sub_compound_mapcount counts the PMD mapping by using
>    page[N*512 + 3].compound_mapcount.

I had hard time reasoning about DoubleMap vs. rmap. Good for you if you
get it right.

> 3. Using CMA allocaiton for 1GB THP: instead of bump MAX_ORDER, it is more sane
>    to use something less intrusive. So all 1GB THPs are allocated from reserved
>    CMA areas shared with hugetlb. At page splitting time, the bitmap for the 1GB
>    THP is cleared as the resulting pages can be freed via normal page free path.
>    We can fall back to alloc_contig_pages for 1GB THP if necessary.
>
Roman Gushchin Sept. 3, 2020, 4:25 p.m. UTC | #9
On Thu, Sep 03, 2020 at 09:32:54AM +0200, Michal Hocko wrote:
> On Wed 02-09-20 14:06:12, Zi Yan wrote:
> > From: Zi Yan <ziy@nvidia.com>
> > 
> > Hi all,
> > 
> > This patchset adds support for 1GB THP on x86_64. It is on top of
> > v5.9-rc2-mmots-2020-08-25-21-13.
> > 
> > 1GB THP is more flexible for reducing translation overhead and increasing the
> > performance of applications with large memory footprint without application
> > changes compared to hugetlb.
> 
> Please be more specific about usecases. This better have some strong
> ones because THP code is complex enough already to add on top solely
> based on a generic TLB pressure easing.

Hello, Michal!

We at Facebook are using 1 GB hugetlbfs pages and are getting noticeable
performance wins on some workloads.

Historically we allocated gigantic pages at the boot time, but recently moved
to cma-based dynamic approach. Still, hugetlbfs interface requires more management
than we would like to do. 1 GB THP seems to be a better alternative. So I definitely
see it as a very useful feature.

Given the cost of an allocation, I'm slightly skeptical about an automatic
heuristics-based approach, but if an application can explicitly mark target areas
with madvise(), I don't see why it wouldn't work.

In our case we'd like to have a reliable way to get 1 GB THPs at some point
(usually at the start of an application), and transparently destroy them on
the application exit.

Once we'll have the patchset in a relatively good shape, I'll be happy to give
it a test in our environment and share results.

Thanks!

> 
> > Design
> > =======
> > 
> > 1GB THP implementation looks similar to exiting THP code except some new designs
> > for the additional page table level.
> > 
> > 1. Page table deposit and withdraw using a new pagechain data structure:
> >    instead of one PTE page table page, 1GB THP requires 513 page table pages
> >    (one PMD page table page and 512 PTE page table pages) to be deposited
> >    at the page allocaiton time, so that we can split the page later. Currently,
> >    the page table deposit is using ->lru, thus only one page can be deposited.
> >    A new pagechain data structure is added to enable multi-page deposit.
> > 
> > 2. Triple mapped 1GB THP : 1GB THP can be mapped by a combination of PUD, PMD,
> >    and PTE entries. Mixing PUD an PTE mapping can be achieved with existing
> >    PageDoubleMap mechanism. To add PMD mapping, PMDPageInPUD and
> >    sub_compound_mapcount are introduced. PMDPageInPUD is the 512-aligned base
> >    page in a 1GB THP and sub_compound_mapcount counts the PMD mapping by using
> >    page[N*512 + 3].compound_mapcount.
> > 
> > 3. Using CMA allocaiton for 1GB THP: instead of bump MAX_ORDER, it is more sane
> >    to use something less intrusive. So all 1GB THPs are allocated from reserved
> >    CMA areas shared with hugetlb. At page splitting time, the bitmap for the 1GB
> >    THP is cleared as the resulting pages can be freed via normal page free path.
> >    We can fall back to alloc_contig_pages for 1GB THP if necessary.
> 
> Do those pages get instantiated during the page fault or only via
> khugepaged? This is an important design detail because then we have to
> think carefully about how much automatic we want this to be. Memory
> overhead can be quite large with 2MB THPs already. Also what about the
> allocation overhead? Do you have any numbers?
> 
> Maybe all these details are described in the patcheset but the cover
> letter should contain all that information. It doesn't make much sense
> to dig into details in a patchset this large without having an idea how
> feasible this is.
> 
> Thanks.
>  
> > Patch Organization
> > =======
> > 
> > Patch 01 adds the new pagechain data structure.
> > 
> > Patch 02 to 13 adds 1GB THP support in variable places.
> > 
> > Patch 14 tries to use alloc_contig_pages for 1GB THP allocaiton.
> > 
> > Patch 15 moves hugetlb_cma reservation to cma.c and rename it to hugepage_cma.
> > 
> > Patch 16 use hugepage_cma reservation for 1GB THP allocation.
> > 
> > 
> > Any suggestions and comments are welcome.
> > 
> > 
> > Zi Yan (16):
> >   mm: add pagechain container for storing multiple pages.
> >   mm: thp: 1GB anonymous page implementation.
> >   mm: proc: add 1GB THP kpageflag.
> >   mm: thp: 1GB THP copy on write implementation.
> >   mm: thp: handling 1GB THP reference bit.
> >   mm: thp: add 1GB THP split_huge_pud_page() function.
> >   mm: stats: make smap stats understand PUD THPs.
> >   mm: page_vma_walk: teach it about PMD-mapped PUD THP.
> >   mm: thp: 1GB THP support in try_to_unmap().
> >   mm: thp: split 1GB THPs at page reclaim.
> >   mm: thp: 1GB THP follow_p*d_page() support.
> >   mm: support 1GB THP pagemap support.
> >   mm: thp: add a knob to enable/disable 1GB THPs.
> >   mm: page_alloc: >=MAX_ORDER pages allocation an deallocation.
> >   hugetlb: cma: move cma reserve function to cma.c.
> >   mm: thp: use cma reservation for pud thp allocation.
> > 
> >  .../admin-guide/kernel-parameters.txt         |   2 +-
> >  arch/arm64/mm/hugetlbpage.c                   |   2 +-
> >  arch/powerpc/mm/hugetlbpage.c                 |   2 +-
> >  arch/x86/include/asm/pgalloc.h                |  68 ++
> >  arch/x86/include/asm/pgtable.h                |  26 +
> >  arch/x86/kernel/setup.c                       |   8 +-
> >  arch/x86/mm/pgtable.c                         |  38 +
> >  drivers/base/node.c                           |   3 +
> >  fs/proc/meminfo.c                             |   2 +
> >  fs/proc/page.c                                |   2 +
> >  fs/proc/task_mmu.c                            | 122 ++-
> >  include/linux/cma.h                           |  18 +
> >  include/linux/huge_mm.h                       |  84 +-
> >  include/linux/hugetlb.h                       |  12 -
> >  include/linux/memcontrol.h                    |   5 +
> >  include/linux/mm.h                            |  29 +-
> >  include/linux/mm_types.h                      |   1 +
> >  include/linux/mmu_notifier.h                  |  13 +
> >  include/linux/mmzone.h                        |   1 +
> >  include/linux/page-flags.h                    |  47 +
> >  include/linux/pagechain.h                     |  73 ++
> >  include/linux/pgtable.h                       |  34 +
> >  include/linux/rmap.h                          |  10 +-
> >  include/linux/swap.h                          |   2 +
> >  include/linux/vm_event_item.h                 |   7 +
> >  include/uapi/linux/kernel-page-flags.h        |   2 +
> >  kernel/events/uprobes.c                       |   4 +-
> >  kernel/fork.c                                 |   5 +
> >  mm/cma.c                                      | 119 +++
> >  mm/gup.c                                      |  60 +-
> >  mm/huge_memory.c                              | 939 +++++++++++++++++-
> >  mm/hugetlb.c                                  | 114 +--
> >  mm/internal.h                                 |   2 +
> >  mm/khugepaged.c                               |   6 +-
> >  mm/ksm.c                                      |   4 +-
> >  mm/memcontrol.c                               |  13 +
> >  mm/memory.c                                   |  51 +-
> >  mm/mempolicy.c                                |  21 +-
> >  mm/migrate.c                                  |  12 +-
> >  mm/page_alloc.c                               |  57 +-
> >  mm/page_vma_mapped.c                          | 129 ++-
> >  mm/pgtable-generic.c                          |  56 ++
> >  mm/rmap.c                                     | 289 ++++--
> >  mm/swap.c                                     |  31 +
> >  mm/swap_slots.c                               |   2 +
> >  mm/swapfile.c                                 |   8 +-
> >  mm/userfaultfd.c                              |   2 +-
> >  mm/util.c                                     |  16 +-
> >  mm/vmscan.c                                   |  58 +-
> >  mm/vmstat.c                                   |   8 +
> >  50 files changed, 2270 insertions(+), 349 deletions(-)
> >  create mode 100644 include/linux/pagechain.h
> > 
> > --
> > 2.28.0
> > 
> 
> -- 
> Michal Hocko
> SUSE Labs
Roman Gushchin Sept. 3, 2020, 4:30 p.m. UTC | #10
On Thu, Sep 03, 2020 at 05:23:00PM +0300, Kirill A. Shutemov wrote:
> On Wed, Sep 02, 2020 at 02:06:12PM -0400, Zi Yan wrote:
> > From: Zi Yan <ziy@nvidia.com>
> > 
> > Hi all,
> > 
> > This patchset adds support for 1GB THP on x86_64. It is on top of
> > v5.9-rc2-mmots-2020-08-25-21-13.
> > 
> > 1GB THP is more flexible for reducing translation overhead and increasing the
> > performance of applications with large memory footprint without application
> > changes compared to hugetlb.
> 
> This statement needs a lot of justification. I don't see 1GB THP as viable
> for any workload. Opportunistic 1GB allocation is very questionable
> strategy.

Hello, Kirill!

I share your skepticism about opportunistic 1 GB allocations, however it might be useful
if backed by an madvise() annotations from userspace application. In this case,
1 GB THPs might be an alternative to 1 GB hugetlbfs pages, but with a more convenient
interface.

Thanks!

> 
> > Design
> > =======
> > 
> > 1GB THP implementation looks similar to exiting THP code except some new designs
> > for the additional page table level.
> > 
> > 1. Page table deposit and withdraw using a new pagechain data structure:
> >    instead of one PTE page table page, 1GB THP requires 513 page table pages
> >    (one PMD page table page and 512 PTE page table pages) to be deposited
> >    at the page allocaiton time, so that we can split the page later. Currently,
> >    the page table deposit is using ->lru, thus only one page can be deposited.
> 
> False. Current code can deposit arbitrary number of page tables.
> 
> What can be problem to you is that these page tables tied to struct page
> of PMD page table.
> 
> >    A new pagechain data structure is added to enable multi-page deposit.
> > 
> > 2. Triple mapped 1GB THP : 1GB THP can be mapped by a combination of PUD, PMD,
> >    and PTE entries. Mixing PUD an PTE mapping can be achieved with existing
> >    PageDoubleMap mechanism. To add PMD mapping, PMDPageInPUD and
> >    sub_compound_mapcount are introduced. PMDPageInPUD is the 512-aligned base
> >    page in a 1GB THP and sub_compound_mapcount counts the PMD mapping by using
> >    page[N*512 + 3].compound_mapcount.
> 
> I had hard time reasoning about DoubleMap vs. rmap. Good for you if you
> get it right.
> 
> > 3. Using CMA allocaiton for 1GB THP: instead of bump MAX_ORDER, it is more sane
> >    to use something less intrusive. So all 1GB THPs are allocated from reserved
> >    CMA areas shared with hugetlb. At page splitting time, the bitmap for the 1GB
> >    THP is cleared as the resulting pages can be freed via normal page free path.
> >    We can fall back to alloc_contig_pages for 1GB THP if necessary.
> > 
> 
> -- 
>  Kirill A. Shutemov
Jason Gunthorpe Sept. 3, 2020, 4:40 p.m. UTC | #11
On Wed, Sep 02, 2020 at 04:29:46PM -0400, Zi Yan wrote:
> On 2 Sep 2020, at 15:57, Jason Gunthorpe wrote:
> 
> > On Wed, Sep 02, 2020 at 03:05:39PM -0400, Zi Yan wrote:
> >> On 2 Sep 2020, at 14:48, Jason Gunthorpe wrote:
> >>
> >>> On Wed, Sep 02, 2020 at 02:45:37PM -0400, Zi Yan wrote:
> >>>
> >>>>> Surprised this doesn't touch mm/pagewalk.c ?
> >>>>
> >>>> 1GB PUD page support is present for DAX purpose, so the code is there
> >>>> in mm/pagewalk.c already. I only needed to supply ops->pud_entry when using
> >>>> the functions in mm/pagewalk.c. :)
> >>>
> >>> Yes, but doesn't this change what is possible under the mmap_sem
> >>> without the page table locks?
> >>>
> >>> ie I would expect some thing like pmd_trans_unstable() to be required
> >>> as well for lockless walkers. (and I don't think the pmd code is 100%
> >>> right either)
> >>>
> >>
> >> Right. I missed that. Thanks for pointing it out.
> >> The code like this, right?
> >
> > Technically all those *pud's are racy too, the design here with the
> > _unstable function call always seemed weird. I strongly suspect it
> > should mirror how get_user_pages_fast works for lockless walking
> 
> You mean READ_ONCE on page table entry pointer first, then use the value
> for the rest of the loop? I am not quite familiar with this racy check
> part of the code and happy to hear more about it.

There are two main issues with the THPs and lockless walks
 
- The *pXX value can change at any time, as THPs can be split at any
  moment. However, once observed to be a sub page table pointer the
  value is fixed under the read side of the mmap (I think, I never
  did find the code path supporting this, but everything is busted if
  it isn't true...)
 
- Reading the *pXX without load tearing is difficult on 32 bit arches

So if you do READ_ONCE() it defeats the first problem.

However if the sizeof(*pXX) is 8 on a 32 bit platform then load
tearing is a problem. At lest the various pXX_*() test functions
operate on a single 32 bit word so don't tear, but to to convert the
*pXX to a lower level page table pointer a coherent, untorn, read is
required.

So, looking again, I remember now, I could never quite figure out why
gup_pmd_range() was safe to do:

                pmd_t pmd = READ_ONCE(*pmdp);
[..]
                } else if (!gup_pte_range(pmd, addr, next, flags, pages, nr))
[..]
        ptem = ptep = pte_offset_map(&pmd, addr);

As I don't see what prevents load tearing a 64 bit pmd.. Eg no
pmd_trans_unstable() or equivalent here.

But we see gup_get_pte() using an anti-load tearing technique.. 

Jason
Jason Gunthorpe Sept. 3, 2020, 4:50 p.m. UTC | #12
On Thu, Sep 03, 2020 at 09:25:27AM -0700, Roman Gushchin wrote:
> On Thu, Sep 03, 2020 at 09:32:54AM +0200, Michal Hocko wrote:
> > On Wed 02-09-20 14:06:12, Zi Yan wrote:
> > > From: Zi Yan <ziy@nvidia.com>
> > > 
> > > Hi all,
> > > 
> > > This patchset adds support for 1GB THP on x86_64. It is on top of
> > > v5.9-rc2-mmots-2020-08-25-21-13.
> > > 
> > > 1GB THP is more flexible for reducing translation overhead and increasing the
> > > performance of applications with large memory footprint without application
> > > changes compared to hugetlb.
> > 
> > Please be more specific about usecases. This better have some strong
> > ones because THP code is complex enough already to add on top solely
> > based on a generic TLB pressure easing.
> 
> Hello, Michal!
> 
> We at Facebook are using 1 GB hugetlbfs pages and are getting noticeable
> performance wins on some workloads.

At least from a RDMA NIC perspective I've heard from a lot of users
that higher order pages at the DMA level is giving big speed ups too.

It is basically the same dynamic as CPU TLB, except missing a 'TLB'
cache in a PCI-E device is dramatically more expensive to refill. With
200G and soon 400G networking these misses are a growing problem.

With HPC nodes now pushing 1TB of actual physical RAM and single
applications basically using all of it, there is definately some
meaningful return - if pages can be reliably available.

At least for HPC where the node returns to an idle state after each
job and most of the 1TB memory becomes freed up again, it seems more
believable to me that a large cache of 1G pages could be available?

Even triggering some kind of cleaner between jobs to defragment could
be a reasonable approach..

Jason
Matthew Wilcox Sept. 3, 2020, 4:55 p.m. UTC | #13
On Thu, Sep 03, 2020 at 01:40:32PM -0300, Jason Gunthorpe wrote:
> However if the sizeof(*pXX) is 8 on a 32 bit platform then load
> tearing is a problem. At lest the various pXX_*() test functions
> operate on a single 32 bit word so don't tear, but to to convert the
> *pXX to a lower level page table pointer a coherent, untorn, read is
> required.
> 
> So, looking again, I remember now, I could never quite figure out why
> gup_pmd_range() was safe to do:
> 
>                 pmd_t pmd = READ_ONCE(*pmdp);
> [..]
>                 } else if (!gup_pte_range(pmd, addr, next, flags, pages, nr))
> [..]
>         ptem = ptep = pte_offset_map(&pmd, addr);
> 
> As I don't see what prevents load tearing a 64 bit pmd.. Eg no
> pmd_trans_unstable() or equivalent here.

I don't think there are any 32-bit page tables which support a PUD-sized
page.  Pretty sure x86 doesn't until you get to 4- or 5- level page tables
(which need you to be running in 64-bit mode).  There's not much utility
in having 1GB of your 3GB process address space taken up by a single page.

I'm OK if there are some oddball architectures which support it, but
Linux doesn't.
Matthew Wilcox Sept. 3, 2020, 5:01 p.m. UTC | #14
On Thu, Sep 03, 2020 at 01:50:51PM -0300, Jason Gunthorpe wrote:
> At least from a RDMA NIC perspective I've heard from a lot of users
> that higher order pages at the DMA level is giving big speed ups too.
> 
> It is basically the same dynamic as CPU TLB, except missing a 'TLB'
> cache in a PCI-E device is dramatically more expensive to refill. With
> 200G and soon 400G networking these misses are a growing problem.
> 
> With HPC nodes now pushing 1TB of actual physical RAM and single
> applications basically using all of it, there is definately some
> meaningful return - if pages can be reliably available.
> 
> At least for HPC where the node returns to an idle state after each
> job and most of the 1TB memory becomes freed up again, it seems more
> believable to me that a large cache of 1G pages could be available?

You may be interested in trying out my current THP patchset:

http://git.infradead.org/users/willy/pagecache.git

It doesn't allocate pages larger than PMD size, but it does allocate pages
*up to* PMD size for the page cache which means that larger pages are
easier to create as larger pages aren't fragmented all over the system.

If someone wants to opportunistically allocate pages larger than PMD
size, I've put some preliminary support in for that, but I've never
tested any of it.  That's not my goal at the moment.

I'm not clear whether these HPC users primarily use page cache or
anonymous memory (with O_DIRECT).  Probably a mixture.
Jason Gunthorpe Sept. 3, 2020, 5:08 p.m. UTC | #15
On Thu, Sep 03, 2020 at 05:55:59PM +0100, Matthew Wilcox wrote:
> On Thu, Sep 03, 2020 at 01:40:32PM -0300, Jason Gunthorpe wrote:
> > However if the sizeof(*pXX) is 8 on a 32 bit platform then load
> > tearing is a problem. At lest the various pXX_*() test functions
> > operate on a single 32 bit word so don't tear, but to to convert the
> > *pXX to a lower level page table pointer a coherent, untorn, read is
> > required.
> > 
> > So, looking again, I remember now, I could never quite figure out why
> > gup_pmd_range() was safe to do:
> > 
> >                 pmd_t pmd = READ_ONCE(*pmdp);
> > [..]
> >                 } else if (!gup_pte_range(pmd, addr, next, flags, pages, nr))
> > [..]
> >         ptem = ptep = pte_offset_map(&pmd, addr);
> > 
> > As I don't see what prevents load tearing a 64 bit pmd.. Eg no
> > pmd_trans_unstable() or equivalent here.
> 
> I don't think there are any 32-bit page tables which support a PUD-sized
> page.  Pretty sure x86 doesn't until you get to 4- or 5- level page tables
> (which need you to be running in 64-bit mode).  There's not much utility
> in having 1GB of your 3GB process address space taken up by a single page.

Make sense for PUD, but why is the above GUP code OK for PMD?
pmd_trans_unstable() exists specifically to close read tearing races,
so it looks like a real problem?

> I'm OK if there are some oddball architectures which support it, but
> Linux doesn't.

So, based on that observation, I think something approximately like
this is needed for the page walker for PUD: (this has been on my
backlog to return to these patches..)

From 00a361ecb2d9e1226600d9e78e6e1803a886f2d6 Mon Sep 17 00:00:00 2001
From: Jason Gunthorpe <jgg@mellanox.com>
Date: Fri, 13 Mar 2020 13:15:36 -0300
Subject: [RFC] mm/pagewalk: use READ_ONCE when reading the PUD entry
 unlocked

The pagewalker runs while only holding the mmap_sem for read. The pud can
be set asynchronously, while also holding the mmap_sem for read

eg from:

 handle_mm_fault()
  __handle_mm_fault()
   create_huge_pmd()
    dev_dax_huge_fault()
     __dev_dax_pud_fault()
      vmf_insert_pfn_pud()
       insert_pfn_pud()
        pud_lock()
        set_pud_at()

At least x86 sets the PUD using WRITE_ONCE(), so an unlocked read of
unstable data should be paired to use READ_ONCE().

For the pagewalker to work locklessly the PUD must work similarly to the
PMD: once the PUD entry becomes a pointer to a PMD, it must be stable, and
safe to pass to pmd_offset()

Passing the value from READ_ONCE into the callbacks prevents the callers
from seeing inconsistencies after they re-read, such as seeing pud_none().

If a callback does obtain the pud_lock then it should trigger ACTION_AGAIN
if a data race caused the original value to change.

Use the same pattern as gup_pmd_range() and pass in the address of the
local READ_ONCE stack variable to pmd_offset() to avoid reading it again.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
---
 include/linux/pagewalk.h   |  2 +-
 mm/hmm.c                   | 16 +++++++---------
 mm/mapping_dirty_helpers.c |  6 ++----
 mm/pagewalk.c              | 28 ++++++++++++++++------------
 mm/ptdump.c                |  3 +--
 5 files changed, 27 insertions(+), 28 deletions(-)

diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
index b1cb6b753abb53..6caf28aadafbff 100644
--- a/include/linux/pagewalk.h
+++ b/include/linux/pagewalk.h
@@ -39,7 +39,7 @@ struct mm_walk_ops {
 			 unsigned long next, struct mm_walk *walk);
 	int (*p4d_entry)(p4d_t *p4d, unsigned long addr,
 			 unsigned long next, struct mm_walk *walk);
-	int (*pud_entry)(pud_t *pud, unsigned long addr,
+	int (*pud_entry)(pud_t pud, pud_t *pudp, unsigned long addr,
 			 unsigned long next, struct mm_walk *walk);
 	int (*pmd_entry)(pmd_t *pmd, unsigned long addr,
 			 unsigned long next, struct mm_walk *walk);
diff --git a/mm/hmm.c b/mm/hmm.c
index 6d9da4b0f0a9f8..98ced96421b913 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -459,28 +459,26 @@ static inline uint64_t pud_to_hmm_pfn_flags(struct hmm_range *range, pud_t pud)
 				range->flags[HMM_PFN_VALID];
 }
 
-static int hmm_vma_walk_pud(pud_t *pudp, unsigned long start, unsigned long end,
-		struct mm_walk *walk)
+static int hmm_vma_walk_pud(pud_t pud, pud_t *pudp, unsigned long start,
+			    unsigned long end, struct mm_walk *walk)
 {
 	struct hmm_vma_walk *hmm_vma_walk = walk->private;
 	struct hmm_range *range = hmm_vma_walk->range;
 	unsigned long addr = start;
-	pud_t pud;
 	int ret = 0;
 	spinlock_t *ptl = pud_trans_huge_lock(pudp, walk->vma);
 
 	if (!ptl)
 		return 0;
+	if (memcmp(pudp, &pud, sizeof(pud)) != 0) {
+		walk->action = ACTION_AGAIN;
+		spin_unlock(ptl);
+		return 0;
+	}
 
 	/* Normally we don't want to split the huge page */
 	walk->action = ACTION_CONTINUE;
 
-	pud = READ_ONCE(*pudp);
-	if (pud_none(pud)) {
-		spin_unlock(ptl);
-		return hmm_vma_walk_hole(start, end, -1, walk);
-	}
-
 	if (pud_huge(pud) && pud_devmap(pud)) {
 		unsigned long i, npages, pfn;
 		uint64_t *pfns, cpu_flags;
diff --git a/mm/mapping_dirty_helpers.c b/mm/mapping_dirty_helpers.c
index 71070dda9643d4..8943c2509ec0f7 100644
--- a/mm/mapping_dirty_helpers.c
+++ b/mm/mapping_dirty_helpers.c
@@ -125,12 +125,10 @@ static int wp_clean_pmd_entry(pmd_t *pmd, unsigned long addr, unsigned long end,
 }
 
 /* wp_clean_pud_entry - The pagewalk pud callback. */
-static int wp_clean_pud_entry(pud_t *pud, unsigned long addr, unsigned long end,
-			      struct mm_walk *walk)
+static int wp_clean_pud_entry(pud_t pudval, pud_t *pudp, unsigned long addr,
+			      unsigned long end, struct mm_walk *walk)
 {
 	/* Dirty-tracking should be handled on the pte level */
-	pud_t pudval = READ_ONCE(*pud);
-
 	if (pud_trans_huge(pudval) || pud_devmap(pudval))
 		WARN_ON(pud_write(pudval) || pud_dirty(pudval));
 
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 928df1638c30d1..cf99536cec23be 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -58,7 +58,7 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	return err;
 }
 
-static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
+static int walk_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 			  struct mm_walk *walk)
 {
 	pmd_t *pmd;
@@ -67,7 +67,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 	int err = 0;
 	int depth = real_depth(3);
 
-	pmd = pmd_offset(pud, addr);
+	pmd = pmd_offset(&pud, addr);
 	do {
 again:
 		next = pmd_addr_end(addr, end);
@@ -119,17 +119,19 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
 			  struct mm_walk *walk)
 {
-	pud_t *pud;
+	pud_t *pudp;
+	pud_t pud;
 	unsigned long next;
 	const struct mm_walk_ops *ops = walk->ops;
 	int err = 0;
 	int depth = real_depth(2);
 
-	pud = pud_offset(p4d, addr);
+	pudp = pud_offset(p4d, addr);
 	do {
  again:
+		pud = READ_ONCE(*pudp);
 		next = pud_addr_end(addr, end);
-		if (pud_none(*pud) || (!walk->vma && !walk->no_vma)) {
+		if (pud_none(pud) || (!walk->vma && !walk->no_vma)) {
 			if (ops->pte_hole)
 				err = ops->pte_hole(addr, next, depth, walk);
 			if (err)
@@ -140,27 +142,29 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
 		walk->action = ACTION_SUBTREE;
 
 		if (ops->pud_entry)
-			err = ops->pud_entry(pud, addr, next, walk);
+			err = ops->pud_entry(pud, pudp, addr, next, walk);
 		if (err)
 			break;
 
 		if (walk->action == ACTION_AGAIN)
 			goto again;
 
-		if ((!walk->vma && (pud_leaf(*pud) || !pud_present(*pud))) ||
+		if ((!walk->vma && (pud_leaf(pud) || !pud_present(pud))) ||
 		    walk->action == ACTION_CONTINUE ||
 		    !(ops->pmd_entry || ops->pte_entry))
 			continue;
 
-		if (walk->vma)
-			split_huge_pud(walk->vma, pud, addr);
-		if (pud_none(*pud))
-			goto again;
+		if (walk->vma) {
+			split_huge_pud(walk->vma, pudp, addr);
+			pud = READ_ONCE(*pudp);
+			if (pud_none(pud))
+				goto again;
+		}
 
 		err = walk_pmd_range(pud, addr, next, walk);
 		if (err)
 			break;
-	} while (pud++, addr = next, addr != end);
+	} while (pudp++, addr = next, addr != end);
 
 	return err;
 }
diff --git a/mm/ptdump.c b/mm/ptdump.c
index 26208d0d03b7a9..c5e1717671e36a 100644
--- a/mm/ptdump.c
+++ b/mm/ptdump.c
@@ -59,11 +59,10 @@ static int ptdump_p4d_entry(p4d_t *p4d, unsigned long addr,
 	return 0;
 }
 
-static int ptdump_pud_entry(pud_t *pud, unsigned long addr,
+static int ptdump_pud_entry(pud_t val, pud_t *pudp, unsigned long addr,
 			    unsigned long next, struct mm_walk *walk)
 {
 	struct ptdump_state *st = walk->private;
-	pud_t val = READ_ONCE(*pud);
 
 #if CONFIG_PGTABLE_LEVELS > 2 && defined(CONFIG_KASAN)
 	if (pud_page(val) == virt_to_page(lm_alias(kasan_early_shadow_pmd)))
Jason Gunthorpe Sept. 3, 2020, 5:18 p.m. UTC | #16
On Thu, Sep 03, 2020 at 06:01:57PM +0100, Matthew Wilcox wrote:
> On Thu, Sep 03, 2020 at 01:50:51PM -0300, Jason Gunthorpe wrote:
> > At least from a RDMA NIC perspective I've heard from a lot of users
> > that higher order pages at the DMA level is giving big speed ups too.
> > 
> > It is basically the same dynamic as CPU TLB, except missing a 'TLB'
> > cache in a PCI-E device is dramatically more expensive to refill. With
> > 200G and soon 400G networking these misses are a growing problem.
> > 
> > With HPC nodes now pushing 1TB of actual physical RAM and single
> > applications basically using all of it, there is definately some
> > meaningful return - if pages can be reliably available.
> > 
> > At least for HPC where the node returns to an idle state after each
> > job and most of the 1TB memory becomes freed up again, it seems more
> > believable to me that a large cache of 1G pages could be available?
> 
> You may be interested in trying out my current THP patchset:
> 
> http://git.infradead.org/users/willy/pagecache.git
> 
> It doesn't allocate pages larger than PMD size, but it does allocate pages
> *up to* PMD size for the page cache which means that larger pages are
> easier to create as larger pages aren't fragmented all over the system.

Yeah, I saw that, it looks like a great direction.

> If someone wants to opportunistically allocate pages larger than PMD
> size, I've put some preliminary support in for that, but I've never
> tested any of it.  That's not my goal at the moment.
> 
> I'm not clear whether these HPC users primarily use page cache or
> anonymous memory (with O_DIRECT).  Probably a mixture.

There are defiantly HPC systems now that are filesystem-less - they
import data for computation from the network using things like blob
storage or some other kind of non-POSIX userspace based data storage
scheme.

Jason
Mike Kravetz Sept. 3, 2020, 8:57 p.m. UTC | #17
On 9/3/20 9:25 AM, Roman Gushchin wrote:
> On Thu, Sep 03, 2020 at 09:32:54AM +0200, Michal Hocko wrote:
>> On Wed 02-09-20 14:06:12, Zi Yan wrote:
>>> From: Zi Yan <ziy@nvidia.com>
>>>
>>> Hi all,
>>>
>>> This patchset adds support for 1GB THP on x86_64. It is on top of
>>> v5.9-rc2-mmots-2020-08-25-21-13.
>>>
>>> 1GB THP is more flexible for reducing translation overhead and increasing the
>>> performance of applications with large memory footprint without application
>>> changes compared to hugetlb.
>>
>> Please be more specific about usecases. This better have some strong
>> ones because THP code is complex enough already to add on top solely
>> based on a generic TLB pressure easing.
> 
> Hello, Michal!
> 
> We at Facebook are using 1 GB hugetlbfs pages and are getting noticeable
> performance wins on some workloads.
> 
> Historically we allocated gigantic pages at the boot time, but recently moved
> to cma-based dynamic approach. Still, hugetlbfs interface requires more management
> than we would like to do. 1 GB THP seems to be a better alternative. So I definitely
> see it as a very useful feature.
> 
> Given the cost of an allocation, I'm slightly skeptical about an automatic
> heuristics-based approach, but if an application can explicitly mark target areas
> with madvise(), I don't see why it wouldn't work.
> 
> In our case we'd like to have a reliable way to get 1 GB THPs at some point
> (usually at the start of an application), and transparently destroy them on
> the application exit.

Hi Roman,

In your current use case at Facebook, are you adding 1G hugetlb pages to
the hugetlb pool and then using them within applications?  Or, are you
dynamically allocating them at fault time (hugetlb overcommit/surplus)?

Latency time for use of such pages includes:
- Putting together 1G contiguous
- Clearing 1G memory

In the 'allocation at fault time' mode you incur both costs at fault time.
If using pages from the pool, your only cost at fault time is clearing the
page.
Roman Gushchin Sept. 3, 2020, 9:06 p.m. UTC | #18
On Thu, Sep 03, 2020 at 01:57:54PM -0700, Mike Kravetz wrote:
> On 9/3/20 9:25 AM, Roman Gushchin wrote:
> > On Thu, Sep 03, 2020 at 09:32:54AM +0200, Michal Hocko wrote:
> >> On Wed 02-09-20 14:06:12, Zi Yan wrote:
> >>> From: Zi Yan <ziy@nvidia.com>
> >>>
> >>> Hi all,
> >>>
> >>> This patchset adds support for 1GB THP on x86_64. It is on top of
> >>> v5.9-rc2-mmots-2020-08-25-21-13.
> >>>
> >>> 1GB THP is more flexible for reducing translation overhead and increasing the
> >>> performance of applications with large memory footprint without application
> >>> changes compared to hugetlb.
> >>
> >> Please be more specific about usecases. This better have some strong
> >> ones because THP code is complex enough already to add on top solely
> >> based on a generic TLB pressure easing.
> > 
> > Hello, Michal!
> > 
> > We at Facebook are using 1 GB hugetlbfs pages and are getting noticeable
> > performance wins on some workloads.
> > 
> > Historically we allocated gigantic pages at the boot time, but recently moved
> > to cma-based dynamic approach. Still, hugetlbfs interface requires more management
> > than we would like to do. 1 GB THP seems to be a better alternative. So I definitely
> > see it as a very useful feature.
> > 
> > Given the cost of an allocation, I'm slightly skeptical about an automatic
> > heuristics-based approach, but if an application can explicitly mark target areas
> > with madvise(), I don't see why it wouldn't work.
> > 
> > In our case we'd like to have a reliable way to get 1 GB THPs at some point
> > (usually at the start of an application), and transparently destroy them on
> > the application exit.
> 
> Hi Roman,
> 
> In your current use case at Facebook, are you adding 1G hugetlb pages to
> the hugetlb pool and then using them within applications?  Or, are you
> dynamically allocating them at fault time (hugetlb overcommit/surplus)?
> 
> Latency time for use of such pages includes:
> - Putting together 1G contiguous
> - Clearing 1G memory
> 
> In the 'allocation at fault time' mode you incur both costs at fault time.
> If using pages from the pool, your only cost at fault time is clearing the
> page.

Hi Mike,

We're using a pool. Under dynamic I mean that gigantic pages are not
allocated at a boot time.

Thanks!
Michal Hocko Sept. 4, 2020, 7:42 a.m. UTC | #19
On Thu 03-09-20 09:25:27, Roman Gushchin wrote:
> On Thu, Sep 03, 2020 at 09:32:54AM +0200, Michal Hocko wrote:
> > On Wed 02-09-20 14:06:12, Zi Yan wrote:
> > > From: Zi Yan <ziy@nvidia.com>
> > > 
> > > Hi all,
> > > 
> > > This patchset adds support for 1GB THP on x86_64. It is on top of
> > > v5.9-rc2-mmots-2020-08-25-21-13.
> > > 
> > > 1GB THP is more flexible for reducing translation overhead and increasing the
> > > performance of applications with large memory footprint without application
> > > changes compared to hugetlb.
> > 
> > Please be more specific about usecases. This better have some strong
> > ones because THP code is complex enough already to add on top solely
> > based on a generic TLB pressure easing.
> 
> Hello, Michal!
> 
> We at Facebook are using 1 GB hugetlbfs pages and are getting noticeable
> performance wins on some workloads.

Let me clarify. I am not questioning 1GB (or large) pages in general. I
believe it is quite clear that there are usecases which hugely benefit
from them.  I am mostly asking for the transparent part of it which
traditionally means that userspace mostly doesn't have to care and get
them. 2MB THPs have established certain expectations mostly a really    
aggressive pro-active instanciation. This has bitten us many times and
create a "you need to disable THP to fix your problem whatever that is"
cargo cult. I hope we do not want to repeat that mistake here again.

> Historically we allocated gigantic pages at the boot time, but recently moved
> to cma-based dynamic approach. Still, hugetlbfs interface requires more management
> than we would like to do. 1 GB THP seems to be a better alternative. So I definitely
> see it as a very useful feature.
> 
> Given the cost of an allocation, I'm slightly skeptical about an automatic
> heuristics-based approach, but if an application can explicitly mark target areas
> with madvise(), I don't see why it wouldn't work.

An explicit opt-in sounds much more appropriate to me as well. If we go
with a specific API then I would not make it 1GB pages specific. Why
cannot we have an explicit interface to "defragment" address space
range into large pages and the kernel would use large pages where
appropriate? Or is the additional copying prohibitively expensive?
Roman Gushchin Sept. 4, 2020, 9:10 p.m. UTC | #20
On Fri, Sep 04, 2020 at 09:42:07AM +0200, Michal Hocko wrote:
> On Thu 03-09-20 09:25:27, Roman Gushchin wrote:
> > On Thu, Sep 03, 2020 at 09:32:54AM +0200, Michal Hocko wrote:
> > > On Wed 02-09-20 14:06:12, Zi Yan wrote:
> > > > From: Zi Yan <ziy@nvidia.com>
> > > > 
> > > > Hi all,
> > > > 
> > > > This patchset adds support for 1GB THP on x86_64. It is on top of
> > > > v5.9-rc2-mmots-2020-08-25-21-13.
> > > > 
> > > > 1GB THP is more flexible for reducing translation overhead and increasing the
> > > > performance of applications with large memory footprint without application
> > > > changes compared to hugetlb.
> > > 
> > > Please be more specific about usecases. This better have some strong
> > > ones because THP code is complex enough already to add on top solely
> > > based on a generic TLB pressure easing.
> > 
> > Hello, Michal!
> > 
> > We at Facebook are using 1 GB hugetlbfs pages and are getting noticeable
> > performance wins on some workloads.
> 
> Let me clarify. I am not questioning 1GB (or large) pages in general. I
> believe it is quite clear that there are usecases which hugely benefit
> from them.  I am mostly asking for the transparent part of it which
> traditionally means that userspace mostly doesn't have to care and get
> them. 2MB THPs have established certain expectations mostly a really    
> aggressive pro-active instanciation. This has bitten us many times and
> create a "you need to disable THP to fix your problem whatever that is"
> cargo cult. I hope we do not want to repeat that mistake here again.

Absolutely, I agree with all above. 1 GB THPs have even fewer chances
to be allocated automatically without hurting overall performance.

I believe that historically the THP allocation success rate and cost were not good
enough to have a strict interface, that's why the "best effort" approach was used.
Maybe I'm wrong here. Also in some cases (e.g. desktop) an opportunistic approach
looks like "it's some perf boost for free". However in case of large distributed
systems it's important to get a predictable and uniform performance across nodes,
so "maybe some hosts will perform better" is not giving much.

> 
> > Historically we allocated gigantic pages at the boot time, but recently moved
> > to cma-based dynamic approach. Still, hugetlbfs interface requires more management
> > than we would like to do. 1 GB THP seems to be a better alternative. So I definitely
> > see it as a very useful feature.
> > 
> > Given the cost of an allocation, I'm slightly skeptical about an automatic
> > heuristics-based approach, but if an application can explicitly mark target areas
> > with madvise(), I don't see why it wouldn't work.
> 
> An explicit opt-in sounds much more appropriate to me as well. If we go
> with a specific API then I would not make it 1GB pages specific. Why
> cannot we have an explicit interface to "defragment" address space
> range into large pages and the kernel would use large pages where
> appropriate? Or is the additional copying prohibitively expensive?

Can you, please, elaborate a bit more here? It seems like madvise(MADV_HUGEPAGE)
provides something similar to what you're describing, but there are lot
of details here, so I'm probably missing something.

Thank you!
Michal Hocko Sept. 7, 2020, 7:20 a.m. UTC | #21
On Fri 04-09-20 14:10:45, Roman Gushchin wrote:
> On Fri, Sep 04, 2020 at 09:42:07AM +0200, Michal Hocko wrote:
[...]
> > An explicit opt-in sounds much more appropriate to me as well. If we go
> > with a specific API then I would not make it 1GB pages specific. Why
> > cannot we have an explicit interface to "defragment" address space
> > range into large pages and the kernel would use large pages where
> > appropriate? Or is the additional copying prohibitively expensive?
> 
> Can you, please, elaborate a bit more here? It seems like madvise(MADV_HUGEPAGE)
> provides something similar to what you're describing, but there are lot
> of details here, so I'm probably missing something.

MADV_HUGEPAGE is controlling a preference for THP to be used for a
particular address range. So it looks similar but the historical
behavior is to control page faults as well and the behavior depends on
the global setup.

I've had in mind something much simpler. Effectively an API to invoke
khugepaged (like) functionality synchronously from the calling context
on the specific address range. It could be more aggressive than the
regular khugepaged and create even 1G pages (or as large THPs as page
tables can handle on the particular arch for that matter).

As this would be an explicit call we do not have to be worried about
the resulting latency because it would be an explicit call by the
userspace.  The default khugepaged has a harder position there because
has no understanding of the target address space and cannot make any
cost/benefit evaluation so it has to be more conservative.
David Hildenbrand Sept. 8, 2020, 11:57 a.m. UTC | #22
On 03.09.20 18:30, Roman Gushchin wrote:
> On Thu, Sep 03, 2020 at 05:23:00PM +0300, Kirill A. Shutemov wrote:
>> On Wed, Sep 02, 2020 at 02:06:12PM -0400, Zi Yan wrote:
>>> From: Zi Yan <ziy@nvidia.com>
>>>
>>> Hi all,
>>>
>>> This patchset adds support for 1GB THP on x86_64. It is on top of
>>> v5.9-rc2-mmots-2020-08-25-21-13.
>>>
>>> 1GB THP is more flexible for reducing translation overhead and increasing the
>>> performance of applications with large memory footprint without application
>>> changes compared to hugetlb.
>>
>> This statement needs a lot of justification. I don't see 1GB THP as viable
>> for any workload. Opportunistic 1GB allocation is very questionable
>> strategy.
> 
> Hello, Kirill!
> 
> I share your skepticism about opportunistic 1 GB allocations, however it might be useful
> if backed by an madvise() annotations from userspace application. In this case,
> 1 GB THPs might be an alternative to 1 GB hugetlbfs pages, but with a more convenient
> interface.

I have concerns if we would silently use 1~GB THPs in most scenarios
where be would have used 2~MB THP. I'd appreciate a trigger to
explicitly enable that - MADV_HUGEPAGE is not sufficient because some
applications relying on that assume that the THP size will be 2~MB
(especially, if you want sparse, large VMAs).

E.g., read via man page

"This  feature  is  primarily  aimed at applications that use large
mappings of data and access large regions of that memory at a time
(e.g., virtualization systems such as QEMU).  It  can  very  easily
waste  memory (e.g., a 2 MB mapping that only ever accesses 1 byte will
result in 2 MB of wired memory instead of one 4 KB page)."



Having that said, I consider having 1~GB THP - similar to 512~MP THP on
arm64 - useless in most setup and I am not sure if it is worth the
trouble. Just use hugetlbfs for the handful of applications where it
makes sense.
Zi Yan Sept. 8, 2020, 2:05 p.m. UTC | #23
On 8 Sep 2020, at 7:57, David Hildenbrand wrote:

> On 03.09.20 18:30, Roman Gushchin wrote:
>> On Thu, Sep 03, 2020 at 05:23:00PM +0300, Kirill A. Shutemov wrote:
>>> On Wed, Sep 02, 2020 at 02:06:12PM -0400, Zi Yan wrote:
>>>> From: Zi Yan <ziy@nvidia.com>
>>>>
>>>> Hi all,
>>>>
>>>> This patchset adds support for 1GB THP on x86_64. It is on top of
>>>> v5.9-rc2-mmots-2020-08-25-21-13.
>>>>
>>>> 1GB THP is more flexible for reducing translation overhead and increasing the
>>>> performance of applications with large memory footprint without application
>>>> changes compared to hugetlb.
>>>
>>> This statement needs a lot of justification. I don't see 1GB THP as viable
>>> for any workload. Opportunistic 1GB allocation is very questionable
>>> strategy.
>>
>> Hello, Kirill!
>>
>> I share your skepticism about opportunistic 1 GB allocations, however it might be useful
>> if backed by an madvise() annotations from userspace application. In this case,
>> 1 GB THPs might be an alternative to 1 GB hugetlbfs pages, but with a more convenient
>> interface.
>
> I have concerns if we would silently use 1~GB THPs in most scenarios
> where be would have used 2~MB THP. I'd appreciate a trigger to
> explicitly enable that - MADV_HUGEPAGE is not sufficient because some
> applications relying on that assume that the THP size will be 2~MB
> (especially, if you want sparse, large VMAs).

This patchset is not intended to silently use 1GB THP in place of 2MB THP.
First of all, there is a knob /sys/kernel/mm/transparent_hugepage/enable_1GB
to enable 1GB THP explicitly. Also, 1GB THP is allocated from a reserved CMA
region (although I had alloc_contig_pages as a fallback, which can be removed
in next version), so users need to add hugepage_cma=nG kernel parameter to
enable 1GB THP allocation. If a finer control is necessary, we can add
a new MADV_HUGEPAGE_1GB for 1GB THP.


—
Best Regards,
Yan Zi
David Hildenbrand Sept. 8, 2020, 2:22 p.m. UTC | #24
On 08.09.20 16:05, Zi Yan wrote:
> On 8 Sep 2020, at 7:57, David Hildenbrand wrote:
> 
>> On 03.09.20 18:30, Roman Gushchin wrote:
>>> On Thu, Sep 03, 2020 at 05:23:00PM +0300, Kirill A. Shutemov wrote:
>>>> On Wed, Sep 02, 2020 at 02:06:12PM -0400, Zi Yan wrote:
>>>>> From: Zi Yan <ziy@nvidia.com>
>>>>>
>>>>> Hi all,
>>>>>
>>>>> This patchset adds support for 1GB THP on x86_64. It is on top of
>>>>> v5.9-rc2-mmots-2020-08-25-21-13.
>>>>>
>>>>> 1GB THP is more flexible for reducing translation overhead and increasing the
>>>>> performance of applications with large memory footprint without application
>>>>> changes compared to hugetlb.
>>>>
>>>> This statement needs a lot of justification. I don't see 1GB THP as viable
>>>> for any workload. Opportunistic 1GB allocation is very questionable
>>>> strategy.
>>>
>>> Hello, Kirill!
>>>
>>> I share your skepticism about opportunistic 1 GB allocations, however it might be useful
>>> if backed by an madvise() annotations from userspace application. In this case,
>>> 1 GB THPs might be an alternative to 1 GB hugetlbfs pages, but with a more convenient
>>> interface.
>>
>> I have concerns if we would silently use 1~GB THPs in most scenarios
>> where be would have used 2~MB THP. I'd appreciate a trigger to
>> explicitly enable that - MADV_HUGEPAGE is not sufficient because some
>> applications relying on that assume that the THP size will be 2~MB
>> (especially, if you want sparse, large VMAs).
> 
> This patchset is not intended to silently use 1GB THP in place of 2MB THP.
> First of all, there is a knob /sys/kernel/mm/transparent_hugepage/enable_1GB
> to enable 1GB THP explicitly. Also, 1GB THP is allocated from a reserved CMA
> region (although I had alloc_contig_pages as a fallback, which can be removed
> in next version), so users need to add hugepage_cma=nG kernel parameter to
> enable 1GB THP allocation. If a finer control is necessary, we can add
> a new MADV_HUGEPAGE_1GB for 1GB THP.

Thanks for the information - I would have loved to see important
information like that (esp. how to use) in the cover letter.

So what you propose is (excluding alloc_contig_pages()) really just
automatically using (previously reserved) 1GB huge pages as 1GB THP
instead of explicitly using them in an application using hugetlbfs.
Still, not convinced how helpful that actually is - most certainly you
really want a mechanism to control this per application (+ maybe make
the application indicate actual ranges where it makes sense - but then
you can directly modify the application to use hugetlbfs).

I guess the interesting thing of this approach is that we can
mix-and-match THP of differing granularity within a single mapping -
whereby a hugetlbfs allocation would fail in case there isn't sufficient
1GB pages available. However, there are no guarantees for applications
anymore (thinking about RT KVM and similar, we really want gigantic
pages and cannot tolerate falling back to smaller granularity).

What are intended use cases/applications that could benefit? I doubt
databases and virtualization are really a good fit - they know how to
handle hugetlbfs just fine.
Matthew Wilcox Sept. 8, 2020, 2:27 p.m. UTC | #25
On Tue, Sep 08, 2020 at 10:05:11AM -0400, Zi Yan wrote:
> On 8 Sep 2020, at 7:57, David Hildenbrand wrote:
> > I have concerns if we would silently use 1~GB THPs in most scenarios
> > where be would have used 2~MB THP. I'd appreciate a trigger to
> > explicitly enable that - MADV_HUGEPAGE is not sufficient because some
> > applications relying on that assume that the THP size will be 2~MB
> > (especially, if you want sparse, large VMAs).
> 
> This patchset is not intended to silently use 1GB THP in place of 2MB THP.
> First of all, there is a knob /sys/kernel/mm/transparent_hugepage/enable_1GB
> to enable 1GB THP explicitly. Also, 1GB THP is allocated from a reserved CMA
> region (although I had alloc_contig_pages as a fallback, which can be removed
> in next version), so users need to add hugepage_cma=nG kernel parameter to
> enable 1GB THP allocation. If a finer control is necessary, we can add
> a new MADV_HUGEPAGE_1GB for 1GB THP.

I think we do need that flag.  Machines don't run a single workload
(arguably with VMs, we're getting closer to going back to the single
workload per machine, but that's a different matter).  So if there's
one app that wants 2MB pages and one that wants 1GB pages, we need to
be able to distinguish them.

I could also see there being an app which benefits from 1GB for
one mapping and prefers 2GB for a different mapping, so I think the
per-mapping madvise flag is best.

I'm a little wary of encoding the size of an x86 PUD in the Linux API
though.  Probably best to follow the example set in
include/uapi/asm-generic/hugetlb_encode.h, but I don't love it.  I
don't have a better suggestion though.
Michal Hocko Sept. 8, 2020, 2:35 p.m. UTC | #26
On Tue 08-09-20 10:05:11, Zi Yan wrote:
> On 8 Sep 2020, at 7:57, David Hildenbrand wrote:
> 
> > On 03.09.20 18:30, Roman Gushchin wrote:
> >> On Thu, Sep 03, 2020 at 05:23:00PM +0300, Kirill A. Shutemov wrote:
> >>> On Wed, Sep 02, 2020 at 02:06:12PM -0400, Zi Yan wrote:
> >>>> From: Zi Yan <ziy@nvidia.com>
> >>>>
> >>>> Hi all,
> >>>>
> >>>> This patchset adds support for 1GB THP on x86_64. It is on top of
> >>>> v5.9-rc2-mmots-2020-08-25-21-13.
> >>>>
> >>>> 1GB THP is more flexible for reducing translation overhead and increasing the
> >>>> performance of applications with large memory footprint without application
> >>>> changes compared to hugetlb.
> >>>
> >>> This statement needs a lot of justification. I don't see 1GB THP as viable
> >>> for any workload. Opportunistic 1GB allocation is very questionable
> >>> strategy.
> >>
> >> Hello, Kirill!
> >>
> >> I share your skepticism about opportunistic 1 GB allocations, however it might be useful
> >> if backed by an madvise() annotations from userspace application. In this case,
> >> 1 GB THPs might be an alternative to 1 GB hugetlbfs pages, but with a more convenient
> >> interface.
> >
> > I have concerns if we would silently use 1~GB THPs in most scenarios
> > where be would have used 2~MB THP. I'd appreciate a trigger to
> > explicitly enable that - MADV_HUGEPAGE is not sufficient because some
> > applications relying on that assume that the THP size will be 2~MB
> > (especially, if you want sparse, large VMAs).
> 
> This patchset is not intended to silently use 1GB THP in place of 2MB THP.
> First of all, there is a knob /sys/kernel/mm/transparent_hugepage/enable_1GB
> to enable 1GB THP explicitly. Also, 1GB THP is allocated from a reserved CMA
> region (although I had alloc_contig_pages as a fallback, which can be removed
> in next version), so users need to add hugepage_cma=nG kernel parameter to
> enable 1GB THP allocation. If a finer control is necessary, we can add
> a new MADV_HUGEPAGE_1GB for 1GB THP.

A global knob is insufficient. 1G pages will become a very precious
resource as it requires a pre-allocation (reservation). So it really has
to be an opt-in and the question is whether there is also some sort of
access control needed.
Rik van Riel Sept. 8, 2020, 2:41 p.m. UTC | #27
On Tue, 2020-09-08 at 16:35 +0200, Michal Hocko wrote:

> A global knob is insufficient. 1G pages will become a very precious
> resource as it requires a pre-allocation (reservation). So it really
> has
> to be an opt-in and the question is whether there is also some sort
> of
> access control needed.

The 1GB pages do not require that much in the way of
pre-allocation. The memory can be obtained through CMA,
which means it can be used for movable 4kB and 2MB
allocations when not
being used for 1GB pages.

That makes it relatively easy to set aside
some fraction
of system memory in every system for 1GB and movable
allocations, and use it for whatever way it is needed
depending on what workload(s) end up running on a system.
David Hildenbrand Sept. 8, 2020, 3:02 p.m. UTC | #28
On 08.09.20 16:41, Rik van Riel wrote:
> On Tue, 2020-09-08 at 16:35 +0200, Michal Hocko wrote:
> 
>> A global knob is insufficient. 1G pages will become a very precious
>> resource as it requires a pre-allocation (reservation). So it really
>> has
>> to be an opt-in and the question is whether there is also some sort
>> of
>> access control needed.
> 
> The 1GB pages do not require that much in the way of
> pre-allocation. The memory can be obtained through CMA,
> which means it can be used for movable 4kB and 2MB
> allocations when not
> being used for 1GB pages.
> 
> That makes it relatively easy to set aside
> some fraction
> of system memory in every system for 1GB and movable
> allocations, and use it for whatever way it is needed
> depending on what workload(s) end up running on a system.
> 

Linking secretmem discussion

https://lkml.kernel.org/r/fdda6ba7-9418-2b52-eee8-ce5e9bfdb6ad@redhat.com
Zi Yan Sept. 8, 2020, 3:09 p.m. UTC | #29
On 7 Sep 2020, at 3:20, Michal Hocko wrote:

> On Fri 04-09-20 14:10:45, Roman Gushchin wrote:
>> On Fri, Sep 04, 2020 at 09:42:07AM +0200, Michal Hocko wrote:
> [...]
>>> An explicit opt-in sounds much more appropriate to me as well. If we go
>>> with a specific API then I would not make it 1GB pages specific. Why
>>> cannot we have an explicit interface to "defragment" address space
>>> range into large pages and the kernel would use large pages where
>>> appropriate? Or is the additional copying prohibitively expensive?
>>
>> Can you, please, elaborate a bit more here? It seems like madvise(MADV_HUGEPAGE)
>> provides something similar to what you're describing, but there are lot
>> of details here, so I'm probably missing something.
>
> MADV_HUGEPAGE is controlling a preference for THP to be used for a
> particular address range. So it looks similar but the historical
> behavior is to control page faults as well and the behavior depends on
> the global setup.
>
> I've had in mind something much simpler. Effectively an API to invoke
> khugepaged (like) functionality synchronously from the calling context
> on the specific address range. It could be more aggressive than the
> regular khugepaged and create even 1G pages (or as large THPs as page
> tables can handle on the particular arch for that matter).
>
> As this would be an explicit call we do not have to be worried about
> the resulting latency because it would be an explicit call by the
> userspace.  The default khugepaged has a harder position there because
> has no understanding of the target address space and cannot make any
> cost/benefit evaluation so it has to be more conservative.

Something like MADV_HUGEPAGE_SYNC? It would be useful, since users have
better and clearer control of getting huge pages from the kernel and
know when they will pay the cost of getting the huge pages.

I would think the suggestion is more about the huge page control options
currently provided by the kernel do not have predictable performance
outcome, since MADV_HUGEPAGE is a best-effort option and does not tell
users whether the marked virtual address range is backed by huge pages
or not when the madvise returns. MADV_HUGEPAGE_SYNC would provide a
deterministic result to users on whether the huge page(s) are formed
or not.

—
Best Regards,
Yan Zi
Zi Yan Sept. 8, 2020, 3:36 p.m. UTC | #30
On 8 Sep 2020, at 10:22, David Hildenbrand wrote:

> On 08.09.20 16:05, Zi Yan wrote:
>> On 8 Sep 2020, at 7:57, David Hildenbrand wrote:
>>
>>> On 03.09.20 18:30, Roman Gushchin wrote:
>>>> On Thu, Sep 03, 2020 at 05:23:00PM +0300, Kirill A. Shutemov wrote:
>>>>> On Wed, Sep 02, 2020 at 02:06:12PM -0400, Zi Yan wrote:
>>>>>> From: Zi Yan <ziy@nvidia.com>
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> This patchset adds support for 1GB THP on x86_64. It is on top of
>>>>>> v5.9-rc2-mmots-2020-08-25-21-13.
>>>>>>
>>>>>> 1GB THP is more flexible for reducing translation overhead and increasing the
>>>>>> performance of applications with large memory footprint without application
>>>>>> changes compared to hugetlb.
>>>>>
>>>>> This statement needs a lot of justification. I don't see 1GB THP as viable
>>>>> for any workload. Opportunistic 1GB allocation is very questionable
>>>>> strategy.
>>>>
>>>> Hello, Kirill!
>>>>
>>>> I share your skepticism about opportunistic 1 GB allocations, however it might be useful
>>>> if backed by an madvise() annotations from userspace application. In this case,
>>>> 1 GB THPs might be an alternative to 1 GB hugetlbfs pages, but with a more convenient
>>>> interface.
>>>
>>> I have concerns if we would silently use 1~GB THPs in most scenarios
>>> where be would have used 2~MB THP. I'd appreciate a trigger to
>>> explicitly enable that - MADV_HUGEPAGE is not sufficient because some
>>> applications relying on that assume that the THP size will be 2~MB
>>> (especially, if you want sparse, large VMAs).
>>
>> This patchset is not intended to silently use 1GB THP in place of 2MB THP.
>> First of all, there is a knob /sys/kernel/mm/transparent_hugepage/enable_1GB
>> to enable 1GB THP explicitly. Also, 1GB THP is allocated from a reserved CMA
>> region (although I had alloc_contig_pages as a fallback, which can be removed
>> in next version), so users need to add hugepage_cma=nG kernel parameter to
>> enable 1GB THP allocation. If a finer control is necessary, we can add
>> a new MADV_HUGEPAGE_1GB for 1GB THP.
>
> Thanks for the information - I would have loved to see important
> information like that (esp. how to use) in the cover letter.
>
> So what you propose is (excluding alloc_contig_pages()) really just
> automatically using (previously reserved) 1GB huge pages as 1GB THP
> instead of explicitly using them in an application using hugetlbfs.
> Still, not convinced how helpful that actually is - most certainly you
> really want a mechanism to control this per application (+ maybe make
> the application indicate actual ranges where it makes sense - but then
> you can directly modify the application to use hugetlbfs).
>
> I guess the interesting thing of this approach is that we can
> mix-and-match THP of differing granularity within a single mapping -
> whereby a hugetlbfs allocation would fail in case there isn't sufficient
> 1GB pages available. However, there are no guarantees for applications
> anymore (thinking about RT KVM and similar, we really want gigantic
> pages and cannot tolerate falling back to smaller granularity).

I agree that currently THP allocation does not provide a strong guarantee
like hugetlbfs, which can pre-allocate pages at boot time. For users like
RT KVM and such, pre-allocated hugetlb might be the only choice, since
allocating huge pages from CMA (either hugetlb or 1GB THP) would fail
if some pages are pinned and scattered in the CMA that could prevent
huge page allocation.

In other cases, if the user can tolerate fall backs but do not like the
unpredictable huge page formation outcome, we could add an madvise()
option like Michal suggested [1], so the user will know whether he gets
huge pages or not and can act accordingly.


> What are intended use cases/applications that could benefit? I doubt
> databases and virtualization are really a good fit - they know how to
> handle hugetlbfs just fine.

Romand and Jason have provided some use cases [2,3]

[1]https://lore.kernel.org/linux-mm/20200907072014.GD30144@dhcp22.suse.cz/
[2]https://lore.kernel.org/linux-mm/20200903162527.GF60440@carbon.dhcp.thefacebook.com/
[3]https://lore.kernel.org/linux-mm/20200903165051.GN24045@ziepe.ca/

—
Best Regards,
Yan Zi
Zi Yan Sept. 8, 2020, 3:50 p.m. UTC | #31
On 8 Sep 2020, at 10:27, Matthew Wilcox wrote:

> On Tue, Sep 08, 2020 at 10:05:11AM -0400, Zi Yan wrote:
>> On 8 Sep 2020, at 7:57, David Hildenbrand wrote:
>>> I have concerns if we would silently use 1~GB THPs in most scenarios
>>> where be would have used 2~MB THP. I'd appreciate a trigger to
>>> explicitly enable that - MADV_HUGEPAGE is not sufficient because some
>>> applications relying on that assume that the THP size will be 2~MB
>>> (especially, if you want sparse, large VMAs).
>>
>> This patchset is not intended to silently use 1GB THP in place of 2MB THP.
>> First of all, there is a knob /sys/kernel/mm/transparent_hugepage/enable_1GB
>> to enable 1GB THP explicitly. Also, 1GB THP is allocated from a reserved CMA
>> region (although I had alloc_contig_pages as a fallback, which can be removed
>> in next version), so users need to add hugepage_cma=nG kernel parameter to
>> enable 1GB THP allocation. If a finer control is necessary, we can add
>> a new MADV_HUGEPAGE_1GB for 1GB THP.
>
> I think we do need that flag.  Machines don't run a single workload
> (arguably with VMs, we're getting closer to going back to the single
> workload per machine, but that's a different matter).  So if there's
> one app that wants 2MB pages and one that wants 1GB pages, we need to
> be able to distinguish them.
>
> I could also see there being an app which benefits from 1GB for
> one mapping and prefers 2GB for a different mapping, so I think the
> per-mapping madvise flag is best.
>
> I'm a little wary of encoding the size of an x86 PUD in the Linux API
> though.  Probably best to follow the example set in
> include/uapi/asm-generic/hugetlb_encode.h, but I don't love it.  I
> don't have a better suggestion though.

Using hugeltb_encode.h makes sense to me. I will add it in the next version.

Thanks for the suggestion.


—
Best Regards,
Yan Zi
Roman Gushchin Sept. 8, 2020, 7:58 p.m. UTC | #32
On Tue, Sep 08, 2020 at 11:09:25AM -0400, Zi Yan wrote:
> On 7 Sep 2020, at 3:20, Michal Hocko wrote:
> 
> > On Fri 04-09-20 14:10:45, Roman Gushchin wrote:
> >> On Fri, Sep 04, 2020 at 09:42:07AM +0200, Michal Hocko wrote:
> > [...]
> >>> An explicit opt-in sounds much more appropriate to me as well. If we go
> >>> with a specific API then I would not make it 1GB pages specific. Why
> >>> cannot we have an explicit interface to "defragment" address space
> >>> range into large pages and the kernel would use large pages where
> >>> appropriate? Or is the additional copying prohibitively expensive?
> >>
> >> Can you, please, elaborate a bit more here? It seems like madvise(MADV_HUGEPAGE)
> >> provides something similar to what you're describing, but there are lot
> >> of details here, so I'm probably missing something.
> >
> > MADV_HUGEPAGE is controlling a preference for THP to be used for a
> > particular address range. So it looks similar but the historical
> > behavior is to control page faults as well and the behavior depends on
> > the global setup.
> >
> > I've had in mind something much simpler. Effectively an API to invoke
> > khugepaged (like) functionality synchronously from the calling context
> > on the specific address range. It could be more aggressive than the
> > regular khugepaged and create even 1G pages (or as large THPs as page
> > tables can handle on the particular arch for that matter).
> >
> > As this would be an explicit call we do not have to be worried about
> > the resulting latency because it would be an explicit call by the
> > userspace.  The default khugepaged has a harder position there because
> > has no understanding of the target address space and cannot make any
> > cost/benefit evaluation so it has to be more conservative.
> 
> Something like MADV_HUGEPAGE_SYNC? It would be useful, since users have
> better and clearer control of getting huge pages from the kernel and
> know when they will pay the cost of getting the huge pages.
> 
> I would think the suggestion is more about the huge page control options
> currently provided by the kernel do not have predictable performance
> outcome, since MADV_HUGEPAGE is a best-effort option and does not tell
> users whether the marked virtual address range is backed by huge pages
> or not when the madvise returns. MADV_HUGEPAGE_SYNC would provide a
> deterministic result to users on whether the huge page(s) are formed
> or not.

Yeah, I agree with Michal here, we need a more straightforward interface.

The hard question here is how hard the kernel should try to allocate
a gigantic page and how fast it should give up and return an error?
I'd say to try really hard if there are some chances to succeed,
so that if an error is returned, there are no more reasons to retry.
Any objections/better ideas here?

Given that we need to pass a page size, we probably need either to introduce
a new syscall (madvise2?) with an additional argument, or add a bunch
of new madvise flags, like MADV_HUGEPAGE_SYNC + encoded 2MB, 1GB etc.

Idk what is better long-term, but new madvise flags are probably slightly
easier to deal with in the development process.

Thanks!
John Hubbard Sept. 9, 2020, 4:01 a.m. UTC | #33
On 9/8/20 12:58 PM, Roman Gushchin wrote:
> On Tue, Sep 08, 2020 at 11:09:25AM -0400, Zi Yan wrote:
>> On 7 Sep 2020, at 3:20, Michal Hocko wrote:
>>> On Fri 04-09-20 14:10:45, Roman Gushchin wrote:
>>>> On Fri, Sep 04, 2020 at 09:42:07AM +0200, Michal Hocko wrote:
>>> [...]
>> Something like MADV_HUGEPAGE_SYNC? It would be useful, since users have
>> better and clearer control of getting huge pages from the kernel and
>> know when they will pay the cost of getting the huge pages.
>>
>> I would think the suggestion is more about the huge page control options
>> currently provided by the kernel do not have predictable performance
>> outcome, since MADV_HUGEPAGE is a best-effort option and does not tell
>> users whether the marked virtual address range is backed by huge pages
>> or not when the madvise returns. MADV_HUGEPAGE_SYNC would provide a
>> deterministic result to users on whether the huge page(s) are formed
>> or not.
> 
> Yeah, I agree with Michal here, we need a more straightforward interface.
> 
> The hard question here is how hard the kernel should try to allocate
> a gigantic page and how fast it should give up and return an error?
> I'd say to try really hard if there are some chances to succeed,
> so that if an error is returned, there are no more reasons to retry.
> Any objections/better ideas here?

I agree, especially because this is starting to look a lot more like an
allocation call. And I think it would be appropriate for the kernel to
try approximately as hard to provide these 1GB pages, as it would to
allocate normal memory to a process.

In fact, for a moment I thought, why not go all the way and make this
actually be a true allocation? However, given that we still have
operations that require page splitting, with no good way to call back
user space to notify it that its "allocated" huge pages are being split,
that fails. But it's still pretty close.


> 
> Given that we need to pass a page size, we probably need either to introduce
> a new syscall (madvise2?) with an additional argument, or add a bunch
> of new madvise flags, like MADV_HUGEPAGE_SYNC + encoded 2MB, 1GB etc.
> 
> Idk what is better long-term, but new madvise flags are probably slightly
> easier to deal with in the development process.
> 

Probably either an MADV_* flag or a new syscall would work fine. But
given that this seems like a pretty distinct new capability, one with
options and man page documentation and possibly future flags itself, I'd
lean toward making it its own new syscall, maybe:

     compact_huge_pages(nbytes or npages, flags /* page size, etc */);

...thus leaving madvise() and it's remaining flags still available, to
further refine things.


thanks,
Michal Hocko Sept. 9, 2020, 7:04 a.m. UTC | #34
On Tue 08-09-20 10:41:10, Rik van Riel wrote:
> On Tue, 2020-09-08 at 16:35 +0200, Michal Hocko wrote:
> 
> > A global knob is insufficient. 1G pages will become a very precious
> > resource as it requires a pre-allocation (reservation). So it really
> > has
> > to be an opt-in and the question is whether there is also some sort
> > of
> > access control needed.
> 
> The 1GB pages do not require that much in the way of
> pre-allocation. The memory can be obtained through CMA,
> which means it can be used for movable 4kB and 2MB
> allocations when not
> being used for 1GB pages.

That CMA has to be pre-reserved, right? That requires a configuration.

> That makes it relatively easy to set aside
> some fraction
> of system memory in every system for 1GB and movable
> allocations, and use it for whatever way it is needed
> depending on what workload(s) end up running on a system.

I was not talking about how easy or hard it is. My main concern is that
this is effectively a pre-reserved pool and a global knob is a very
suboptimal way to control access to it. I (rather) strongly believe this
should be an explicit opt-in and ideally not 1GB specific but rather
something to allow large pages to be created as there is a fit. See
other subthread for more details.
Michal Hocko Sept. 9, 2020, 7:15 a.m. UTC | #35
On Tue 08-09-20 12:58:59, Roman Gushchin wrote:
> On Tue, Sep 08, 2020 at 11:09:25AM -0400, Zi Yan wrote:
> > On 7 Sep 2020, at 3:20, Michal Hocko wrote:
> > 
> > > On Fri 04-09-20 14:10:45, Roman Gushchin wrote:
> > >> On Fri, Sep 04, 2020 at 09:42:07AM +0200, Michal Hocko wrote:
> > > [...]
> > >>> An explicit opt-in sounds much more appropriate to me as well. If we go
> > >>> with a specific API then I would not make it 1GB pages specific. Why
> > >>> cannot we have an explicit interface to "defragment" address space
> > >>> range into large pages and the kernel would use large pages where
> > >>> appropriate? Or is the additional copying prohibitively expensive?
> > >>
> > >> Can you, please, elaborate a bit more here? It seems like madvise(MADV_HUGEPAGE)
> > >> provides something similar to what you're describing, but there are lot
> > >> of details here, so I'm probably missing something.
> > >
> > > MADV_HUGEPAGE is controlling a preference for THP to be used for a
> > > particular address range. So it looks similar but the historical
> > > behavior is to control page faults as well and the behavior depends on
> > > the global setup.
> > >
> > > I've had in mind something much simpler. Effectively an API to invoke
> > > khugepaged (like) functionality synchronously from the calling context
> > > on the specific address range. It could be more aggressive than the
> > > regular khugepaged and create even 1G pages (or as large THPs as page
> > > tables can handle on the particular arch for that matter).
> > >
> > > As this would be an explicit call we do not have to be worried about
> > > the resulting latency because it would be an explicit call by the
> > > userspace.  The default khugepaged has a harder position there because
> > > has no understanding of the target address space and cannot make any
> > > cost/benefit evaluation so it has to be more conservative.
> > 
> > Something like MADV_HUGEPAGE_SYNC? It would be useful, since users have
> > better and clearer control of getting huge pages from the kernel and
> > know when they will pay the cost of getting the huge pages.

The name is not really that important. The crucial design decisions are
- THP allocation time - #PF and/or madvise context
- lazy/sync instantiation
- huge page sizes controllable by the userspace?
- aggressiveness - how hard to try
- internal fragmentation - allow to create THPs on sparsely or unpopulated
  ranges
- do we need some sort of access control or privilege check as some THPs
  would be a really scarce (like those that require pre-reservation).

> > I would think the suggestion is more about the huge page control options
> > currently provided by the kernel do not have predictable performance
> > outcome, since MADV_HUGEPAGE is a best-effort option and does not tell
> > users whether the marked virtual address range is backed by huge pages
> > or not when the madvise returns. MADV_HUGEPAGE_SYNC would provide a
> > deterministic result to users on whether the huge page(s) are formed
> > or not.
> 
> Yeah, I agree with Michal here, we need a more straightforward interface.
> 
> The hard question here is how hard the kernel should try to allocate
> a gigantic page and how fast it should give up and return an error?
> I'd say to try really hard if there are some chances to succeed,
> so that if an error is returned, there are no more reasons to retry.
> Any objections/better ideas here?

If this is going to be an explicit interface like madvise then I would
follow the same semantic as hugetlb pages allocation - aka try as hard
as feasible (whatever that means).

> Given that we need to pass a page size, we probably need either to introduce
> a new syscall (madvise2?) with an additional argument, or add a bunch
> of new madvise flags, like MADV_HUGEPAGE_SYNC + encoded 2MB, 1GB etc.

Do we really need to bother userspace with making decision about the
page size? I would expect that the userspace only cares to get huge
pages backed memory range. The larger the pages the better. It is up to
the kernel to make the resource control here. Afterall THPs can be
split/reclaimed under a memory pressure so we do not want to make any
promises about pages backing any mapping.
Jason Gunthorpe Sept. 9, 2020, 12:11 p.m. UTC | #36
On Tue, Sep 08, 2020 at 03:27:58PM +0100, Matthew Wilcox wrote:
> On Tue, Sep 08, 2020 at 10:05:11AM -0400, Zi Yan wrote:
> > On 8 Sep 2020, at 7:57, David Hildenbrand wrote:
> > > I have concerns if we would silently use 1~GB THPs in most scenarios
> > > where be would have used 2~MB THP. I'd appreciate a trigger to
> > > explicitly enable that - MADV_HUGEPAGE is not sufficient because some
> > > applications relying on that assume that the THP size will be 2~MB
> > > (especially, if you want sparse, large VMAs).
> > 
> > This patchset is not intended to silently use 1GB THP in place of 2MB THP.
> > First of all, there is a knob /sys/kernel/mm/transparent_hugepage/enable_1GB
> > to enable 1GB THP explicitly. Also, 1GB THP is allocated from a reserved CMA
> > region (although I had alloc_contig_pages as a fallback, which can be removed
> > in next version), so users need to add hugepage_cma=nG kernel parameter to
> > enable 1GB THP allocation. If a finer control is necessary, we can add
> > a new MADV_HUGEPAGE_1GB for 1GB THP.
> 
> I think we do need that flag.  Machines don't run a single workload
> (arguably with VMs, we're getting closer to going back to the single
> workload per machine, but that's a different matter).  So if there's
> one app that wants 2MB pages and one that wants 1GB pages, we need to
> be able to distinguish them.
> 
> I could also see there being an app which benefits from 1GB for
> one mapping and prefers 2GB for a different mapping, so I think the
> per-mapping madvise flag is best.

I wonder if apps really care about the specific page size?
Particularly from a portability view?

The general app desire seems to be the need for 'efficient' memory (eg
because it is highly accessed) and I suspect comes with a desire to
populate the pages too.

Maybe doing something with MAP_POPULATE is an idea?

eg if I ask for 1GB of MAP_POPULATE it seems fairly natural the thing
that comes back should be a 1GB THP? If I ask for only .5GB then it
could be 2M pages, or whatever depending on arch support.

Jason
Matthew Wilcox Sept. 9, 2020, 12:32 p.m. UTC | #37
On Wed, Sep 09, 2020 at 09:11:17AM -0300, Jason Gunthorpe wrote:
> On Tue, Sep 08, 2020 at 03:27:58PM +0100, Matthew Wilcox wrote:
> > I could also see there being an app which benefits from 1GB for
> > one mapping and prefers 2GB for a different mapping, so I think the
> > per-mapping madvise flag is best.
> 
> I wonder if apps really care about the specific page size?
> Particularly from a portability view?

No, they don't.  They just want to run as fast as possible ;-)

> The general app desire seems to be the need for 'efficient' memory (eg
> because it is highly accessed) and I suspect comes with a desire to
> populate the pages too.

The problem with a MAP_GOES_FASTER flag is that everybody sets it.
Any flag name needs to convey its drawbacks as well as its advantages.
Maybe MAP_EXTREMELY_COARSE_WORKINGSET would do that -- the VM will work
in terms of 1GB pages for this mapping, so any swap-out is going to take
out an entire 1GB at once.

But here's the thing ... we already allow
	mmap(MAP_POPULATE | MAP_HUGETLB | MAP_HUGE_1GB)

So if we're not doing THP, what's the point of this thread?
My understanding of THP is "Application doesn't need to change, kernel
makes a decision about what page size is best based on entire system
state and process's behaviour".

An madvise flag is a different beast; that's just letting the kernel
know what the app thinks its behaviour will be.  The kernel can pay
as much (or as little) attention to that hint as it sees fit.  And of
course, it can change over time (either by kernel release as we change
the algorithms, or simple from one minute to the next as more or less
memory comes available).
Jason Gunthorpe Sept. 9, 2020, 1:14 p.m. UTC | #38
On Wed, Sep 09, 2020 at 01:32:44PM +0100, Matthew Wilcox wrote:

> But here's the thing ... we already allow
> 	mmap(MAP_POPULATE | MAP_HUGETLB | MAP_HUGE_1GB)
> 
> So if we're not doing THP, what's the point of this thread?

I wondered that too..

> An madvise flag is a different beast; that's just letting the kernel
> know what the app thinks its behaviour will be.  The kernel can pay

But madvise is too late, the VMA already has an address, if it is not
1G aligned it cannot be 1G THP already.

Jason
Rik van Riel Sept. 9, 2020, 1:19 p.m. UTC | #39
On Wed, 2020-09-09 at 09:04 +0200, Michal Hocko wrote:
> On Tue 08-09-20 10:41:10, Rik van Riel wrote:
> > On Tue, 2020-09-08 at 16:35 +0200, Michal Hocko wrote:
> > 
> > > A global knob is insufficient. 1G pages will become a very
> > > precious
> > > resource as it requires a pre-allocation (reservation). So it
> > > really
> > > has
> > > to be an opt-in and the question is whether there is also some
> > > sort
> > > of
> > > access control needed.
> > 
> > The 1GB pages do not require that much in the way of
> > pre-allocation. The memory can be obtained through CMA,
> > which means it can be used for movable 4kB and 2MB
> > allocations when not
> > being used for 1GB pages.
> 
> That CMA has to be pre-reserved, right? That requires a
> configuration.

To some extent, yes.

However, because that pool can be used for movable
4kB and 2MB
pages as well as for 1GB pages, it would be easy to just set
the size of that pool to eg. 1/3 or even 1/2 of memory for every
system.

It isn't like the pool needs to be the exact right size. We
just need to avoid the "highmem problem" of having too little
memory for kernel allocations.
David Hildenbrand Sept. 9, 2020, 1:27 p.m. UTC | #40
On 09.09.20 15:14, Jason Gunthorpe wrote:
> On Wed, Sep 09, 2020 at 01:32:44PM +0100, Matthew Wilcox wrote:
> 
>> But here's the thing ... we already allow
>> 	mmap(MAP_POPULATE | MAP_HUGETLB | MAP_HUGE_1GB)
>>
>> So if we're not doing THP, what's the point of this thread?
> 
> I wondered that too..
> 
>> An madvise flag is a different beast; that's just letting the kernel
>> know what the app thinks its behaviour will be.  The kernel can pay
> 
> But madvise is too late, the VMA already has an address, if it is not
> 1G aligned it cannot be 1G THP already.

That's why user space (like QEMU) is THP-aware and selects an address
that is aligned to the expected THP granularity (e.g., 2MB on x86_64).
David Hildenbrand Sept. 9, 2020, 1:43 p.m. UTC | #41
On 09.09.20 15:19, Rik van Riel wrote:
> On Wed, 2020-09-09 at 09:04 +0200, Michal Hocko wrote:
>> On Tue 08-09-20 10:41:10, Rik van Riel wrote:
>>> On Tue, 2020-09-08 at 16:35 +0200, Michal Hocko wrote:
>>>
>>>> A global knob is insufficient. 1G pages will become a very
>>>> precious
>>>> resource as it requires a pre-allocation (reservation). So it
>>>> really
>>>> has
>>>> to be an opt-in and the question is whether there is also some
>>>> sort
>>>> of
>>>> access control needed.
>>>
>>> The 1GB pages do not require that much in the way of
>>> pre-allocation. The memory can be obtained through CMA,
>>> which means it can be used for movable 4kB and 2MB
>>> allocations when not
>>> being used for 1GB pages.
>>
>> That CMA has to be pre-reserved, right? That requires a
>> configuration.
> 
> To some extent, yes.
> 
> However, because that pool can be used for movable
> 4kB and 2MB
> pages as well as for 1GB pages, it would be easy to just set
> the size of that pool to eg. 1/3 or even 1/2 of memory for every
> system.
> 
> It isn't like the pool needs to be the exact right size. We
> just need to avoid the "highmem problem" of having too little
> memory for kernel allocations.
> 

I am not sure I like the trend towards CMA that we are seeing, reserving
huge buffers for specific users (and eventually even doing it
automatically).

What we actually want is ZONE_MOVABLE with relaxed guarantees, such that
anybody who requires large, unmovable allocations can use it.

I once played with the idea of having ZONE_PREFER_MOVABLE, which
a) Is the primary choice for movable allocations
b) Is allowed to contain unmovable allocations (esp., gigantic pages)
c) Is the fallback for ZONE_NORMAL for unmovable allocations, instead of
running out of memory

If someone messes up the zone ratio, issues known from zone imbalances
are avoided - large allocations simply become less likely to succeed. In
contrast to ZONE_MOVABLE, memory offlining is not guaranteed to work.
Rik van Riel Sept. 9, 2020, 1:49 p.m. UTC | #42
On Wed, 2020-09-09 at 15:43 +0200, David Hildenbrand wrote:
> On 09.09.20 15:19, Rik van Riel wrote:
> > On Wed, 2020-09-09 at 09:04 +0200, Michal Hocko wrote:
> > 
> > > That CMA has to be pre-reserved, right? That requires a
> > > configuration.
> > 
> > To some extent, yes.
> > 
> > However, because that pool can be used for movable
> > 4kB and 2MB
> > pages as well as for 1GB pages, it would be easy to just set
> > the size of that pool to eg. 1/3 or even 1/2 of memory for every
> > system.
> > 
> > It isn't like the pool needs to be the exact right size. We
> > just need to avoid the "highmem problem" of having too little
> > memory for kernel allocations.
> > 
> 
> I am not sure I like the trend towards CMA that we are seeing,
> reserving
> huge buffers for specific users (and eventually even doing it
> automatically).
> 
> What we actually want is ZONE_MOVABLE with relaxed guarantees, such
> that
> anybody who requires large, unmovable allocations can use it.
> 
> I once played with the idea of having ZONE_PREFER_MOVABLE, which
> a) Is the primary choice for movable allocations
> b) Is allowed to contain unmovable allocations (esp., gigantic pages)
> c) Is the fallback for ZONE_NORMAL for unmovable allocations, instead
> of
> running out of memory
> 
> If someone messes up the zone ratio, issues known from zone
> imbalances
> are avoided - large allocations simply become less likely to succeed.
> In
> contrast to ZONE_MOVABLE, memory offlining is not guaranteed to work.

I really like that idea. This will be easier to deal with than
a "just the right size" CMA area, and seems like it would be
pretty forgiving in both directions.

Keeping unmovable allocations
contained to one part of memory
should also make compaction within the ZONE_PREFER_MOVABLE area
a lot easier than compaction for higher order allocations is
today.

I suspect your proposal solves a lot of issues at once.

For (c) from your proposal, we could even claim a whole
2MB or even 1GB area at once for unmovable allocations,
keeping those contained in a limited amount of physical
memory again, to make life easier on compaction.
David Hildenbrand Sept. 9, 2020, 1:54 p.m. UTC | #43
On 09.09.20 15:49, Rik van Riel wrote:
> On Wed, 2020-09-09 at 15:43 +0200, David Hildenbrand wrote:
>> On 09.09.20 15:19, Rik van Riel wrote:
>>> On Wed, 2020-09-09 at 09:04 +0200, Michal Hocko wrote:
>>>
>>>> That CMA has to be pre-reserved, right? That requires a
>>>> configuration.
>>>
>>> To some extent, yes.
>>>
>>> However, because that pool can be used for movable
>>> 4kB and 2MB
>>> pages as well as for 1GB pages, it would be easy to just set
>>> the size of that pool to eg. 1/3 or even 1/2 of memory for every
>>> system.
>>>
>>> It isn't like the pool needs to be the exact right size. We
>>> just need to avoid the "highmem problem" of having too little
>>> memory for kernel allocations.
>>>
>>
>> I am not sure I like the trend towards CMA that we are seeing,
>> reserving
>> huge buffers for specific users (and eventually even doing it
>> automatically).
>>
>> What we actually want is ZONE_MOVABLE with relaxed guarantees, such
>> that
>> anybody who requires large, unmovable allocations can use it.
>>
>> I once played with the idea of having ZONE_PREFER_MOVABLE, which
>> a) Is the primary choice for movable allocations
>> b) Is allowed to contain unmovable allocations (esp., gigantic pages)
>> c) Is the fallback for ZONE_NORMAL for unmovable allocations, instead
>> of
>> running out of memory
>>
>> If someone messes up the zone ratio, issues known from zone
>> imbalances
>> are avoided - large allocations simply become less likely to succeed.
>> In
>> contrast to ZONE_MOVABLE, memory offlining is not guaranteed to work.
> 
> I really like that idea. This will be easier to deal with than
> a "just the right size" CMA area, and seems like it would be
> pretty forgiving in both directions.
> 

Yes, and can be extended using memory hotplug.

> Keeping unmovable allocations
> contained to one part of memory
> should also make compaction within the ZONE_PREFER_MOVABLE area
> a lot easier than compaction for higher order allocations is
> today.
> 
> I suspect your proposal solves a lot of issues at once.
> 
> For (c) from your proposal, we could even claim a whole
> 2MB or even 1GB area at once for unmovable allocations,
> keeping those contained in a limited amount of physical
> memory again, to make life easier on compaction.
> 

Exactly, locally limiting unmovable allocations to a sane minimum.

(with some smart extra work, we could even convert ZONE_PREFER_MOVABLE
to ZONE_NORMAL, one memory section/block at a time where needed, that
direction always works. But that's very tricky.)
Michal Hocko Sept. 9, 2020, 1:59 p.m. UTC | #44
On Wed 09-09-20 09:19:16, Rik van Riel wrote:
> On Wed, 2020-09-09 at 09:04 +0200, Michal Hocko wrote:
> > On Tue 08-09-20 10:41:10, Rik van Riel wrote:
> > > On Tue, 2020-09-08 at 16:35 +0200, Michal Hocko wrote:
> > > 
> > > > A global knob is insufficient. 1G pages will become a very
> > > > precious
> > > > resource as it requires a pre-allocation (reservation). So it
> > > > really
> > > > has
> > > > to be an opt-in and the question is whether there is also some
> > > > sort
> > > > of
> > > > access control needed.
> > > 
> > > The 1GB pages do not require that much in the way of
> > > pre-allocation. The memory can be obtained through CMA,
> > > which means it can be used for movable 4kB and 2MB
> > > allocations when not
> > > being used for 1GB pages.
> > 
> > That CMA has to be pre-reserved, right? That requires a
> > configuration.
> 
> To some extent, yes.
> 
> However, because that pool can be used for movable
> 4kB and 2MB
> pages as well as for 1GB pages, it would be easy to just set
> the size of that pool to eg. 1/3 or even 1/2 of memory for every
> system.
> 
> It isn't like the pool needs to be the exact right size. We
> just need to avoid the "highmem problem" of having too little
> memory for kernel allocations.

Which is the problem why this is not really suitable for an uneducated
guesses. It is really hard to guess the right amount of lowmem. Think of
heavy fs metadata workloads and their memory demand. Memory reclaim
usually struggles when zones are imbalanced from my experience.
Michal Hocko Sept. 10, 2020, 7:32 a.m. UTC | #45
[Cc Vlastimil and Mel - the whole email thread starts
 http://lkml.kernel.org/r/20200902180628.4052244-1-zi.yan@sent.com
 but this particular subthread has diverged a bit and you might find it
 interesting]

On Wed 09-09-20 15:43:55, David Hildenbrand wrote:
> On 09.09.20 15:19, Rik van Riel wrote:
> > On Wed, 2020-09-09 at 09:04 +0200, Michal Hocko wrote:
> >> On Tue 08-09-20 10:41:10, Rik van Riel wrote:
> >>> On Tue, 2020-09-08 at 16:35 +0200, Michal Hocko wrote:
> >>>
> >>>> A global knob is insufficient. 1G pages will become a very
> >>>> precious
> >>>> resource as it requires a pre-allocation (reservation). So it
> >>>> really
> >>>> has
> >>>> to be an opt-in and the question is whether there is also some
> >>>> sort
> >>>> of
> >>>> access control needed.
> >>>
> >>> The 1GB pages do not require that much in the way of
> >>> pre-allocation. The memory can be obtained through CMA,
> >>> which means it can be used for movable 4kB and 2MB
> >>> allocations when not
> >>> being used for 1GB pages.
> >>
> >> That CMA has to be pre-reserved, right? That requires a
> >> configuration.
> > 
> > To some extent, yes.
> > 
> > However, because that pool can be used for movable
> > 4kB and 2MB
> > pages as well as for 1GB pages, it would be easy to just set
> > the size of that pool to eg. 1/3 or even 1/2 of memory for every
> > system.
> > 
> > It isn't like the pool needs to be the exact right size. We
> > just need to avoid the "highmem problem" of having too little
> > memory for kernel allocations.
> > 
> 
> I am not sure I like the trend towards CMA that we are seeing, reserving
> huge buffers for specific users (and eventually even doing it
> automatically).
> 
> What we actually want is ZONE_MOVABLE with relaxed guarantees, such that
> anybody who requires large, unmovable allocations can use it.
> 
> I once played with the idea of having ZONE_PREFER_MOVABLE, which
> a) Is the primary choice for movable allocations
> b) Is allowed to contain unmovable allocations (esp., gigantic pages)
> c) Is the fallback for ZONE_NORMAL for unmovable allocations, instead of
> running out of memory

I might be missing something but how can this work longterm? Or put in
another words why would this work any better than existing fragmentation
avoidance techniques that page allocator implements already - movability
grouping etc. Please note that I am not deeply familiar with those but
my high level understanding is that we already try hard to not mix
movable and unmovable objects in same page blocks as much as we can.

My suspicion is that a separate zone would work in a similar fashion. As
long as there is a lot of free memory then zone will be effectively
MOVABLE. Similar applies to normal zone when unmovable allocations are
in minority. As long as the Normal zone gets full of unmovable objects
they start overflowing to ZONE_PREFER_MOVABLE and it will resemble page
block stealing when unmovable objects start spreading over movable page
blocks.

Again, my level of expertise to page allocator is quite low so all the
above might be simply wrong...

> If someone messes up the zone ratio, issues known from zone imbalances
> are avoided - large allocations simply become less likely to succeed. In
> contrast to ZONE_MOVABLE, memory offlining is not guaranteed to work.
David Hildenbrand Sept. 10, 2020, 8:27 a.m. UTC | #46
On 10.09.20 09:32, Michal Hocko wrote:
> [Cc Vlastimil and Mel - the whole email thread starts
>  http://lkml.kernel.org/r/20200902180628.4052244-1-zi.yan@sent.com
>  but this particular subthread has diverged a bit and you might find it
>  interesting]
> 
> On Wed 09-09-20 15:43:55, David Hildenbrand wrote:
>> On 09.09.20 15:19, Rik van Riel wrote:
>>> On Wed, 2020-09-09 at 09:04 +0200, Michal Hocko wrote:
>>>> On Tue 08-09-20 10:41:10, Rik van Riel wrote:
>>>>> On Tue, 2020-09-08 at 16:35 +0200, Michal Hocko wrote:
>>>>>
>>>>>> A global knob is insufficient. 1G pages will become a very
>>>>>> precious
>>>>>> resource as it requires a pre-allocation (reservation). So it
>>>>>> really
>>>>>> has
>>>>>> to be an opt-in and the question is whether there is also some
>>>>>> sort
>>>>>> of
>>>>>> access control needed.
>>>>>
>>>>> The 1GB pages do not require that much in the way of
>>>>> pre-allocation. The memory can be obtained through CMA,
>>>>> which means it can be used for movable 4kB and 2MB
>>>>> allocations when not
>>>>> being used for 1GB pages.
>>>>
>>>> That CMA has to be pre-reserved, right? That requires a
>>>> configuration.
>>>
>>> To some extent, yes.
>>>
>>> However, because that pool can be used for movable
>>> 4kB and 2MB
>>> pages as well as for 1GB pages, it would be easy to just set
>>> the size of that pool to eg. 1/3 or even 1/2 of memory for every
>>> system.
>>>
>>> It isn't like the pool needs to be the exact right size. We
>>> just need to avoid the "highmem problem" of having too little
>>> memory for kernel allocations.
>>>
>>
>> I am not sure I like the trend towards CMA that we are seeing, reserving
>> huge buffers for specific users (and eventually even doing it
>> automatically).
>>
>> What we actually want is ZONE_MOVABLE with relaxed guarantees, such that
>> anybody who requires large, unmovable allocations can use it.
>>
>> I once played with the idea of having ZONE_PREFER_MOVABLE, which
>> a) Is the primary choice for movable allocations
>> b) Is allowed to contain unmovable allocations (esp., gigantic pages)
>> c) Is the fallback for ZONE_NORMAL for unmovable allocations, instead of
>> running out of memory
> 
> I might be missing something but how can this work longterm? Or put in
> another words why would this work any better than existing fragmentation
> avoidance techniques that page allocator implements already - movability
> grouping etc. Please note that I am not deeply familiar with those but
> my high level understanding is that we already try hard to not mix
> movable and unmovable objects in same page blocks as much as we can.

Note that we group in pageblock granularity, which avoids fragmentation
on a pageblock level, not on anything bigger than that. Especially
MAX_ORDER - 1 pages (e.g., on x86-64) and gigantic pages.

So once you run for some time on a system (especially thinking about
page shuffling *within* a zone), trying to allocate a gigantic page will
simply always fail - even if you always had plenty of free memory in
your single zone.

> 
> My suspicion is that a separate zone would work in a similar fashion. As
> long as there is a lot of free memory then zone will be effectively
> MOVABLE. Similar applies to normal zone when unmovable allocations are

Note the difference to MOVABLE: if you really want, you *can* put
movable allocations into that zone. So you can happily allocate gigantic
pages from it. Or anything else you like. As the name suggests "prefer
movable allocations".

> in minority. As long as the Normal zone gets full of unmovable objects
> they start overflowing to ZONE_PREFER_MOVABLE and it will resemble page
> block stealing when unmovable objects start spreading over movable page
> blocks.

Right, the long-term goal would be
1. To limit the chance of that happening. (e.g., size it in a way that's
safe for 99.9% of all setups, resize dynamically on demand)
2. To limit the physical area where that is happening (e.g., find lowest
possible pageblock etc.). That's more tricky but I consider this a pure
optimization on top.

As long as we stay in safe zone boundaries you get a benefit in most
scenarios. As soon as we would have a (temporary) workload that would
require more unmovable allocations we would fallback to polluting some
pageblocks only.

> 
> Again, my level of expertise to page allocator is quite low so all the
> above might be simply wrong...

Same over here. I had this idea in my mind for quite a while but
obviously didn't get to figure out the details/implement yet - that's
why I decided to share the basic idea just now.
William Kucharski Sept. 10, 2020, 10:02 a.m. UTC | #47
> On Sep 9, 2020, at 7:27 AM, David Hildenbrand <david@redhat.com> wrote:
> 
> On 09.09.20 15:14, Jason Gunthorpe wrote:
>> On Wed, Sep 09, 2020 at 01:32:44PM +0100, Matthew Wilcox wrote:
>> 
>>> But here's the thing ... we already allow
>>> 	mmap(MAP_POPULATE | MAP_HUGETLB | MAP_HUGE_1GB)
>>> 
>>> So if we're not doing THP, what's the point of this thread?
>> 
>> I wondered that too..
>> 
>>> An madvise flag is a different beast; that's just letting the kernel
>>> know what the app thinks its behaviour will be.  The kernel can pay
>> 
>> But madvise is too late, the VMA already has an address, if it is not
>> 1G aligned it cannot be 1G THP already.
> 
> That's why user space (like QEMU) is THP-aware and selects an address
> that is aligned to the expected THP granularity (e.g., 2MB on x86_64).


To me it's always seemed like there are two major divisions among THP use
cases:

1) Applications that KNOW they would benefit from use of THPs, so they
call madvise() with an appropriate parameter and explicitly inform the
kernel of such

2) Applications that know nothing about THP but there may be an
advantage that comes from "automatic" THP mapping when possible.

This is an approach that I am more familiar with that comes down to:

    1) Is a VMA properly aligned for a (whatever size) THP?

    2) Is the mapping request for a length >= (whatever size) THP?

    3) Let's try allocating memory to map the space using (whatever size)
       THP, and:

        -- If we succeed, great, awesome, let's do it.
        -- If not, no big deal, map using as large a page as we CAN get.

There of course are myriad performance implications to this. Processes
that start early after boot have a better chance of getting a THP,
but that also means frequently mapped large memory spaces have a better
chance of being mapped in a shared manner via a THP, e.g. libc, X servers
or Firefox/Chrome. It also means that processes that would be mapped
using THPs early in boot may not be if they should crash and need to be
restarted.

There are all sorts of tunables that would likely need to be in place to make
the second approach more viable, but I think it's certainly worth investigating.

The address selection you suggest is the basis of one of the patches I wrote
for a previous iteration of THP support (and that is in Matthew's THP tree)
that will try to round VM addresses to the proper alignment if possible so a
THP can then be used to map the area.
Rik van Riel Sept. 10, 2020, 1:32 p.m. UTC | #48
On Thu, 2020-09-10 at 09:32 +0200, Michal Hocko wrote:
> [Cc Vlastimil and Mel - the whole email thread starts
>  http://lkml.kernel.org/r/20200902180628.4052244-1-zi.yan@sent.com
>  but this particular subthread has diverged a bit and you might find
> it
>  interesting]
> 
> On Wed 09-09-20 15:43:55, David Hildenbrand wrote:
> > 
> > I am not sure I like the trend towards CMA that we are seeing,
> > reserving
> > huge buffers for specific users (and eventually even doing it
> > automatically).
> > 
> > What we actually want is ZONE_MOVABLE with relaxed guarantees, such
> > that
> > anybody who requires large, unmovable allocations can use it.
> > 
> > I once played with the idea of having ZONE_PREFER_MOVABLE, which
> > a) Is the primary choice for movable allocations
> > b) Is allowed to contain unmovable allocations (esp., gigantic
> > pages)
> > c) Is the fallback for ZONE_NORMAL for unmovable allocations,
> > instead of
> > running out of memory
> 
> I might be missing something but how can this work longterm? Or put
> in
> another words why would this work any better than existing
> fragmentation
> avoidance techniques that page allocator implements already - 

One big difference is reclaim. If ZONE_NORMAL runs low on
free memory, page reclaim would kick in and evict some
movable/reclaimable things, to free up more space for
unmovable allocations.

The current fragmentation avoidance techniques don't do
things like reclaim, or proactively migrating movable
pages out of unmovable page blocks to prevent unmovable
allocations in currently movable page blocks.

> My suspicion is that a separate zone would work in a similar fashion.
> As
> long as there is a lot of free memory then zone will be effectively
> MOVABLE. Similar applies to normal zone when unmovable allocations
> are
> in minority. As long as the Normal zone gets full of unmovable
> objects
> they start overflowing to ZONE_PREFER_MOVABLE and it will resemble
> page
> block stealing when unmovable objects start spreading over movable
> page
> blocks.

You are right, with the difference being reclaim and/or
migration, which could make a real difference in limiting
the number of pageblocks that have unmovable allocations.
Zi Yan Sept. 10, 2020, 2:21 p.m. UTC | #49
On 10 Sep 2020, at 4:27, David Hildenbrand wrote:

> On 10.09.20 09:32, Michal Hocko wrote:
>> [Cc Vlastimil and Mel - the whole email thread starts
>> http://lkml.kernel.org/r/20200902180628.4052244-1-zi.yan@sent.com
>>  but this particular subthread has diverged a bit and you might find it
>>  interesting]
>>
>> On Wed 09-09-20 15:43:55, David Hildenbrand wrote:
>>> On 09.09.20 15:19, Rik van Riel wrote:
>>>> On Wed, 2020-09-09 at 09:04 +0200, Michal Hocko wrote:
>>>>> On Tue 08-09-20 10:41:10, Rik van Riel wrote:
>>>>>> On Tue, 2020-09-08 at 16:35 +0200, Michal Hocko wrote:
>>>>>>
>>>>>>> A global knob is insufficient. 1G pages will become a very
>>>>>>> precious
>>>>>>> resource as it requires a pre-allocation (reservation). So it
>>>>>>> really
>>>>>>> has
>>>>>>> to be an opt-in and the question is whether there is also some
>>>>>>> sort
>>>>>>> of
>>>>>>> access control needed.
>>>>>>
>>>>>> The 1GB pages do not require that much in the way of
>>>>>> pre-allocation. The memory can be obtained through CMA,
>>>>>> which means it can be used for movable 4kB and 2MB
>>>>>> allocations when not
>>>>>> being used for 1GB pages.
>>>>>
>>>>> That CMA has to be pre-reserved, right? That requires a
>>>>> configuration.
>>>>
>>>> To some extent, yes.
>>>>
>>>> However, because that pool can be used for movable
>>>> 4kB and 2MB
>>>> pages as well as for 1GB pages, it would be easy to just set
>>>> the size of that pool to eg. 1/3 or even 1/2 of memory for every
>>>> system.
>>>>
>>>> It isn't like the pool needs to be the exact right size. We
>>>> just need to avoid the "highmem problem" of having too little
>>>> memory for kernel allocations.
>>>>
>>>
>>> I am not sure I like the trend towards CMA that we are seeing, reserving
>>> huge buffers for specific users (and eventually even doing it
>>> automatically).
>>>
>>> What we actually want is ZONE_MOVABLE with relaxed guarantees, such that
>>> anybody who requires large, unmovable allocations can use it.
>>>
>>> I once played with the idea of having ZONE_PREFER_MOVABLE, which
>>> a) Is the primary choice for movable allocations
>>> b) Is allowed to contain unmovable allocations (esp., gigantic pages)
>>> c) Is the fallback for ZONE_NORMAL for unmovable allocations, instead of
>>> running out of memory
>>
>> I might be missing something but how can this work longterm? Or put in
>> another words why would this work any better than existing fragmentation
>> avoidance techniques that page allocator implements already - movability
>> grouping etc. Please note that I am not deeply familiar with those but
>> my high level understanding is that we already try hard to not mix
>> movable and unmovable objects in same page blocks as much as we can.
>
> Note that we group in pageblock granularity, which avoids fragmentation
> on a pageblock level, not on anything bigger than that. Especially
> MAX_ORDER - 1 pages (e.g., on x86-64) and gigantic pages.
>
> So once you run for some time on a system (especially thinking about
> page shuffling *within* a zone), trying to allocate a gigantic page will
> simply always fail - even if you always had plenty of free memory in
> your single zone.
>
>>
>> My suspicion is that a separate zone would work in a similar fashion. As
>> long as there is a lot of free memory then zone will be effectively
>> MOVABLE. Similar applies to normal zone when unmovable allocations are
>
> Note the difference to MOVABLE: if you really want, you *can* put
> movable allocations into that zone. So you can happily allocate gigantic
> pages from it. Or anything else you like. As the name suggests "prefer
> movable allocations".
>
>> in minority. As long as the Normal zone gets full of unmovable objects
>> they start overflowing to ZONE_PREFER_MOVABLE and it will resemble page
>> block stealing when unmovable objects start spreading over movable page
>> blocks.
>
> Right, the long-term goal would be
> 1. To limit the chance of that happening. (e.g., size it in a way that's
> safe for 99.9% of all setups, resize dynamically on demand)
> 2. To limit the physical area where that is happening (e.g., find lowest
> possible pageblock etc.). That's more tricky but I consider this a pure
> optimization on top.
>
> As long as we stay in safe zone boundaries you get a benefit in most
> scenarios. As soon as we would have a (temporary) workload that would
> require more unmovable allocations we would fallback to polluting some
> pageblocks only.

The idea would work well until unmoveable pages begin to overflow into
ZONE_PREFER_MOVABLE or we move the boundary of ZONE_PREFER_MOVABLE to
avoid unmoveable page overflow. The issue comes from the lifetime of
the unmoveable pages. Since some long-live ones can be around the boundary,
there is no guarantee that ZONE_PREFER_MOVABLE cannot grow back
even if other unmoveable pages are deallocated. Ultimately,
ZONE_PREFER_MOVABLE would be shrink to a small size and the situation is
back to what we have now.

OK. I have a stupid question here. Why not just grow pageblock to a larger
size, like 1GB? So the fragmentation of unmoveable pages will be at larger
granularity. But it is less likely unmoveable pages will be allocated at
a movable pageblock, since the kernel has 1GB pageblock for them after
a pageblock stealing. If other kinds of pageblocks run out, moveable and
reclaimable pages can fall back to unmoveable pageblocks.
What am I missing here?

Thanks.


—
Best Regards,
Yan Zi
Zi Yan Sept. 10, 2020, 2:30 p.m. UTC | #50
On 10 Sep 2020, at 9:32, Rik van Riel wrote:

> On Thu, 2020-09-10 at 09:32 +0200, Michal Hocko wrote:
>> [Cc Vlastimil and Mel - the whole email thread starts
>> http://lkml.kernel.org/r/20200902180628.4052244-1-zi.yan@sent.com
>>  but this particular subthread has diverged a bit and you might find
>> it
>>  interesting]
>>
>> On Wed 09-09-20 15:43:55, David Hildenbrand wrote:
>>>
>>> I am not sure I like the trend towards CMA that we are seeing,
>>> reserving
>>> huge buffers for specific users (and eventually even doing it
>>> automatically).
>>>
>>> What we actually want is ZONE_MOVABLE with relaxed guarantees, such
>>> that
>>> anybody who requires large, unmovable allocations can use it.
>>>
>>> I once played with the idea of having ZONE_PREFER_MOVABLE, which
>>> a) Is the primary choice for movable allocations
>>> b) Is allowed to contain unmovable allocations (esp., gigantic
>>> pages)
>>> c) Is the fallback for ZONE_NORMAL for unmovable allocations,
>>> instead of
>>> running out of memory
>>
>> I might be missing something but how can this work longterm? Or put
>> in
>> another words why would this work any better than existing
>> fragmentation
>> avoidance techniques that page allocator implements already -
>
> One big difference is reclaim. If ZONE_NORMAL runs low on
> free memory, page reclaim would kick in and evict some
> movable/reclaimable things, to free up more space for
> unmovable allocations.
>
> The current fragmentation avoidance techniques don't do
> things like reclaim, or proactively migrating movable
> pages out of unmovable page blocks to prevent unmovable
> allocations in currently movable page blocks.

Isn’t Mel Gorman’s watermark boost patch[1] (merged about a year ago)
doing what you are describing?


[1]https://lore.kernel.org/linux-mm/20181123114528.28802-1-mgorman@techsingularity.net/


—
Best Regards,
Yan Zi
David Hildenbrand Sept. 10, 2020, 2:34 p.m. UTC | #51
>> As long as we stay in safe zone boundaries you get a benefit in most
>> scenarios. As soon as we would have a (temporary) workload that would
>> require more unmovable allocations we would fallback to polluting some
>> pageblocks only.
> 
> The idea would work well until unmoveable pages begin to overflow into
> ZONE_PREFER_MOVABLE or we move the boundary of ZONE_PREFER_MOVABLE to
> avoid unmoveable page overflow. The issue comes from the lifetime of
> the unmoveable pages. Since some long-live ones can be around the boundary,
> there is no guarantee that ZONE_PREFER_MOVABLE cannot grow back
> even if other unmoveable pages are deallocated. Ultimately,
> ZONE_PREFER_MOVABLE would be shrink to a small size and the situation is
> back to what we have now.

As discussed this would not happen in the usual case in case we size it
reasonable. Of course, if you push it to the extreme (which was never
suggested!), you would create mess. There is always a way to create a
mess if you abuse such mechanism. Also see Rik's reply regarding reclaim.

> 
> OK. I have a stupid question here. Why not just grow pageblock to a larger
> size, like 1GB? So the fragmentation of unmoveable pages will be at larger
> granularity. But it is less likely unmoveable pages will be allocated at
> a movable pageblock, since the kernel has 1GB pageblock for them after
> a pageblock stealing. If other kinds of pageblocks run out, moveable and
> reclaimable pages can fall back to unmoveable pageblocks.
> What am I missing here?

Oh no. For example pageblocks have to completely fit into a single
section (that's where metadata is maintained). Please refrain from
suggesting to increase the section size ;)

There is plenty of code relying on pageblocks/MAX_ORDER - 1 to be
reasonable in size. Examples in VMs are free page reporting or virtio-mem.
Zi Yan Sept. 10, 2020, 2:41 p.m. UTC | #52
On 10 Sep 2020, at 10:34, David Hildenbrand wrote:

>>> As long as we stay in safe zone boundaries you get a benefit in most
>>> scenarios. As soon as we would have a (temporary) workload that would
>>> require more unmovable allocations we would fallback to polluting some
>>> pageblocks only.
>>
>> The idea would work well until unmoveable pages begin to overflow into
>> ZONE_PREFER_MOVABLE or we move the boundary of ZONE_PREFER_MOVABLE to
>> avoid unmoveable page overflow. The issue comes from the lifetime of
>> the unmoveable pages. Since some long-live ones can be around the boundary,
>> there is no guarantee that ZONE_PREFER_MOVABLE cannot grow back
>> even if other unmoveable pages are deallocated. Ultimately,
>> ZONE_PREFER_MOVABLE would be shrink to a small size and the situation is
>> back to what we have now.
>
> As discussed this would not happen in the usual case in case we size it
> reasonable. Of course, if you push it to the extreme (which was never
> suggested!), you would create mess. There is always a way to create a
> mess if you abuse such mechanism. Also see Rik's reply regarding reclaim.
>
>>
>> OK. I have a stupid question here. Why not just grow pageblock to a larger
>> size, like 1GB? So the fragmentation of unmoveable pages will be at larger
>> granularity. But it is less likely unmoveable pages will be allocated at
>> a movable pageblock, since the kernel has 1GB pageblock for them after
>> a pageblock stealing. If other kinds of pageblocks run out, moveable and
>> reclaimable pages can fall back to unmoveable pageblocks.
>> What am I missing here?
>
> Oh no. For example pageblocks have to completely fit into a single
> section (that's where metadata is maintained). Please refrain from
> suggesting to increase the section size ;)

Thank you for the explanation. I have no idea about the restrictions on
pageblock and section. Out of curiosity, what prevents the growth of
the section size?

—
Best Regards,
Yan Zi
David Hildenbrand Sept. 10, 2020, 3:15 p.m. UTC | #53
On 10.09.20 16:41, Zi Yan wrote:
> On 10 Sep 2020, at 10:34, David Hildenbrand wrote:
> 
>>>> As long as we stay in safe zone boundaries you get a benefit in most
>>>> scenarios. As soon as we would have a (temporary) workload that would
>>>> require more unmovable allocations we would fallback to polluting some
>>>> pageblocks only.
>>>
>>> The idea would work well until unmoveable pages begin to overflow into
>>> ZONE_PREFER_MOVABLE or we move the boundary of ZONE_PREFER_MOVABLE to
>>> avoid unmoveable page overflow. The issue comes from the lifetime of
>>> the unmoveable pages. Since some long-live ones can be around the boundary,
>>> there is no guarantee that ZONE_PREFER_MOVABLE cannot grow back
>>> even if other unmoveable pages are deallocated. Ultimately,
>>> ZONE_PREFER_MOVABLE would be shrink to a small size and the situation is
>>> back to what we have now.
>>
>> As discussed this would not happen in the usual case in case we size it
>> reasonable. Of course, if you push it to the extreme (which was never
>> suggested!), you would create mess. There is always a way to create a
>> mess if you abuse such mechanism. Also see Rik's reply regarding reclaim.
>>
>>>
>>> OK. I have a stupid question here. Why not just grow pageblock to a larger
>>> size, like 1GB? So the fragmentation of unmoveable pages will be at larger
>>> granularity. But it is less likely unmoveable pages will be allocated at
>>> a movable pageblock, since the kernel has 1GB pageblock for them after
>>> a pageblock stealing. If other kinds of pageblocks run out, moveable and
>>> reclaimable pages can fall back to unmoveable pageblocks.
>>> What am I missing here?
>>
>> Oh no. For example pageblocks have to completely fit into a single
>> section (that's where metadata is maintained). Please refrain from
>> suggesting to increase the section size ;)
> 
> Thank you for the explanation. I have no idea about the restrictions on
> pageblock and section. Out of curiosity, what prevents the growth of
> the section size?

The section size (and based on that the Linux memory block size) defines
- the minimum size in which we can add_memory()
- the alignment requirement in which we can add_memory()

This is applicable
- in physical environments, where the bios will decide where to place
  DIMMs/NVDIMMs. The coarser the granularity, the less memory we might
  be able to make use of in corner cases.
- in virtualized environments, where we want to add memory in fairly
  small granularity. The coarser the granularity, the less flexibility
  we have.

arm64 has a section size of 1GB (and a THP/MAX_ORDER - 1 size of 512MB
with 64k base pages :/ ). That already turned out to be a problem - see
[1] regarding thoughts on how to shrink the section size. I once read
about thoughts of switching to 2MB THP on arm64 with any base page size,
not sure if that will become real at one point (and we might be able to
reduce the pageblock size there as well ... )

[1]
https://lkml.kernel.org/r/AM6PR08MB40690714A2E77A7128B2B2ADF7700@AM6PR08MB4069.eurprd08.prod.outlook.com
See [1] as

> 
> —
> Best Regards,
> Yan Zi
>