Message ID | 20200902180628.4052244-1-zi.yan@sent.com (mailing list archive) |
---|---|
Headers | show |
Series | 1GB THP support on x86_64 | expand |
On Wed, Sep 02, 2020 at 02:06:12PM -0400, Zi Yan wrote: > From: Zi Yan <ziy@nvidia.com> > > Hi all, > > This patchset adds support for 1GB THP on x86_64. It is on top of > v5.9-rc2-mmots-2020-08-25-21-13. > > 1GB THP is more flexible for reducing translation overhead and increasing the > performance of applications with large memory footprint without application > changes compared to hugetlb. > > Design > ======= > > 1GB THP implementation looks similar to exiting THP code except some new designs > for the additional page table level. > > 1. Page table deposit and withdraw using a new pagechain data structure: > instead of one PTE page table page, 1GB THP requires 513 page table pages > (one PMD page table page and 512 PTE page table pages) to be deposited > at the page allocaiton time, so that we can split the page later. Currently, > the page table deposit is using ->lru, thus only one page can be deposited. > A new pagechain data structure is added to enable multi-page deposit. > > 2. Triple mapped 1GB THP : 1GB THP can be mapped by a combination of PUD, PMD, > and PTE entries. Mixing PUD an PTE mapping can be achieved with existing > PageDoubleMap mechanism. To add PMD mapping, PMDPageInPUD and > sub_compound_mapcount are introduced. PMDPageInPUD is the 512-aligned base > page in a 1GB THP and sub_compound_mapcount counts the PMD mapping by using > page[N*512 + 3].compound_mapcount. > > 3. Using CMA allocaiton for 1GB THP: instead of bump MAX_ORDER, it is more sane > to use something less intrusive. So all 1GB THPs are allocated from reserved > CMA areas shared with hugetlb. At page splitting time, the bitmap for the 1GB > THP is cleared as the resulting pages can be freed via normal page free path. > We can fall back to alloc_contig_pages for 1GB THP if necessary. > > > Patch Organization > ======= > > Patch 01 adds the new pagechain data structure. > > Patch 02 to 13 adds 1GB THP support in variable places. > > Patch 14 tries to use alloc_contig_pages for 1GB THP allocaiton. > > Patch 15 moves hugetlb_cma reservation to cma.c and rename it to hugepage_cma. > > Patch 16 use hugepage_cma reservation for 1GB THP allocation. > > > Any suggestions and comments are welcome. > > > Zi Yan (16): > mm: add pagechain container for storing multiple pages. > mm: thp: 1GB anonymous page implementation. > mm: proc: add 1GB THP kpageflag. > mm: thp: 1GB THP copy on write implementation. > mm: thp: handling 1GB THP reference bit. > mm: thp: add 1GB THP split_huge_pud_page() function. > mm: stats: make smap stats understand PUD THPs. > mm: page_vma_walk: teach it about PMD-mapped PUD THP. > mm: thp: 1GB THP support in try_to_unmap(). > mm: thp: split 1GB THPs at page reclaim. > mm: thp: 1GB THP follow_p*d_page() support. > mm: support 1GB THP pagemap support. > mm: thp: add a knob to enable/disable 1GB THPs. > mm: page_alloc: >=MAX_ORDER pages allocation an deallocation. > hugetlb: cma: move cma reserve function to cma.c. > mm: thp: use cma reservation for pud thp allocation. Surprised this doesn't touch mm/pagewalk.c ? Jason
On 2 Sep 2020, at 14:40, Jason Gunthorpe wrote: > On Wed, Sep 02, 2020 at 02:06:12PM -0400, Zi Yan wrote: >> From: Zi Yan <ziy@nvidia.com> >> >> Hi all, >> >> This patchset adds support for 1GB THP on x86_64. It is on top of >> v5.9-rc2-mmots-2020-08-25-21-13. >> >> 1GB THP is more flexible for reducing translation overhead and increasing the >> performance of applications with large memory footprint without application >> changes compared to hugetlb. >> >> Design >> ======= >> >> 1GB THP implementation looks similar to exiting THP code except some new designs >> for the additional page table level. >> >> 1. Page table deposit and withdraw using a new pagechain data structure: >> instead of one PTE page table page, 1GB THP requires 513 page table pages >> (one PMD page table page and 512 PTE page table pages) to be deposited >> at the page allocaiton time, so that we can split the page later. Currently, >> the page table deposit is using ->lru, thus only one page can be deposited. >> A new pagechain data structure is added to enable multi-page deposit. >> >> 2. Triple mapped 1GB THP : 1GB THP can be mapped by a combination of PUD, PMD, >> and PTE entries. Mixing PUD an PTE mapping can be achieved with existing >> PageDoubleMap mechanism. To add PMD mapping, PMDPageInPUD and >> sub_compound_mapcount are introduced. PMDPageInPUD is the 512-aligned base >> page in a 1GB THP and sub_compound_mapcount counts the PMD mapping by using >> page[N*512 + 3].compound_mapcount. >> >> 3. Using CMA allocaiton for 1GB THP: instead of bump MAX_ORDER, it is more sane >> to use something less intrusive. So all 1GB THPs are allocated from reserved >> CMA areas shared with hugetlb. At page splitting time, the bitmap for the 1GB >> THP is cleared as the resulting pages can be freed via normal page free path. >> We can fall back to alloc_contig_pages for 1GB THP if necessary. >> >> >> Patch Organization >> ======= >> >> Patch 01 adds the new pagechain data structure. >> >> Patch 02 to 13 adds 1GB THP support in variable places. >> >> Patch 14 tries to use alloc_contig_pages for 1GB THP allocaiton. >> >> Patch 15 moves hugetlb_cma reservation to cma.c and rename it to hugepage_cma. >> >> Patch 16 use hugepage_cma reservation for 1GB THP allocation. >> >> >> Any suggestions and comments are welcome. >> >> >> Zi Yan (16): >> mm: add pagechain container for storing multiple pages. >> mm: thp: 1GB anonymous page implementation. >> mm: proc: add 1GB THP kpageflag. >> mm: thp: 1GB THP copy on write implementation. >> mm: thp: handling 1GB THP reference bit. >> mm: thp: add 1GB THP split_huge_pud_page() function. >> mm: stats: make smap stats understand PUD THPs. >> mm: page_vma_walk: teach it about PMD-mapped PUD THP. >> mm: thp: 1GB THP support in try_to_unmap(). >> mm: thp: split 1GB THPs at page reclaim. >> mm: thp: 1GB THP follow_p*d_page() support. >> mm: support 1GB THP pagemap support. >> mm: thp: add a knob to enable/disable 1GB THPs. >> mm: page_alloc: >=MAX_ORDER pages allocation an deallocation. >> hugetlb: cma: move cma reserve function to cma.c. >> mm: thp: use cma reservation for pud thp allocation. > > Surprised this doesn't touch mm/pagewalk.c ? 1GB PUD page support is present for DAX purpose, so the code is there in mm/pagewalk.c already. I only needed to supply ops->pud_entry when using the functions in mm/pagewalk.c. :) — Best Regards, Yan Zi
On Wed, Sep 02, 2020 at 02:45:37PM -0400, Zi Yan wrote: > > Surprised this doesn't touch mm/pagewalk.c ? > > 1GB PUD page support is present for DAX purpose, so the code is there > in mm/pagewalk.c already. I only needed to supply ops->pud_entry when using > the functions in mm/pagewalk.c. :) Yes, but doesn't this change what is possible under the mmap_sem without the page table locks? ie I would expect some thing like pmd_trans_unstable() to be required as well for lockless walkers. (and I don't think the pmd code is 100% right either) Jason
On 2 Sep 2020, at 14:48, Jason Gunthorpe wrote: > On Wed, Sep 02, 2020 at 02:45:37PM -0400, Zi Yan wrote: > >>> Surprised this doesn't touch mm/pagewalk.c ? >> >> 1GB PUD page support is present for DAX purpose, so the code is there >> in mm/pagewalk.c already. I only needed to supply ops->pud_entry when using >> the functions in mm/pagewalk.c. :) > > Yes, but doesn't this change what is possible under the mmap_sem > without the page table locks? > > ie I would expect some thing like pmd_trans_unstable() to be required > as well for lockless walkers. (and I don't think the pmd code is 100% > right either) > Right. I missed that. Thanks for pointing it out. The code like this, right? diff --git a/mm/pagewalk.c b/mm/pagewalk.c index e81640d9f177..4fe6ce4a92eb 100644 --- a/mm/pagewalk.c +++ b/mm/pagewalk.c @@ -152,10 +152,11 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end, !(ops->pmd_entry || ops->pte_entry)) continue; - if (walk->vma) + if (walk->vma) { split_huge_pud(walk->vma, pud, addr); - if (pud_none(*pud)) - goto again; + if (pud_trans_unstable(pud)) + goto again; + } err = walk_pmd_range(pud, addr, next, walk); if (err) — Best Regards, Yan Zi
On Wed, Sep 02, 2020 at 03:05:39PM -0400, Zi Yan wrote: > On 2 Sep 2020, at 14:48, Jason Gunthorpe wrote: > > > On Wed, Sep 02, 2020 at 02:45:37PM -0400, Zi Yan wrote: > > > >>> Surprised this doesn't touch mm/pagewalk.c ? > >> > >> 1GB PUD page support is present for DAX purpose, so the code is there > >> in mm/pagewalk.c already. I only needed to supply ops->pud_entry when using > >> the functions in mm/pagewalk.c. :) > > > > Yes, but doesn't this change what is possible under the mmap_sem > > without the page table locks? > > > > ie I would expect some thing like pmd_trans_unstable() to be required > > as well for lockless walkers. (and I don't think the pmd code is 100% > > right either) > > > > Right. I missed that. Thanks for pointing it out. > The code like this, right? Technically all those *pud's are racy too, the design here with the _unstable function call always seemed weird. I strongly suspect it should mirror how get_user_pages_fast works for lockless walking Jason
On 2 Sep 2020, at 15:57, Jason Gunthorpe wrote: > On Wed, Sep 02, 2020 at 03:05:39PM -0400, Zi Yan wrote: >> On 2 Sep 2020, at 14:48, Jason Gunthorpe wrote: >> >>> On Wed, Sep 02, 2020 at 02:45:37PM -0400, Zi Yan wrote: >>> >>>>> Surprised this doesn't touch mm/pagewalk.c ? >>>> >>>> 1GB PUD page support is present for DAX purpose, so the code is there >>>> in mm/pagewalk.c already. I only needed to supply ops->pud_entry when using >>>> the functions in mm/pagewalk.c. :) >>> >>> Yes, but doesn't this change what is possible under the mmap_sem >>> without the page table locks? >>> >>> ie I would expect some thing like pmd_trans_unstable() to be required >>> as well for lockless walkers. (and I don't think the pmd code is 100% >>> right either) >>> >> >> Right. I missed that. Thanks for pointing it out. >> The code like this, right? > > Technically all those *pud's are racy too, the design here with the > _unstable function call always seemed weird. I strongly suspect it > should mirror how get_user_pages_fast works for lockless walking You mean READ_ONCE on page table entry pointer first, then use the value for the rest of the loop? I am not quite familiar with this racy check part of the code and happy to hear more about it. — Best Regards, Yan Zi
On Wed 02-09-20 14:06:12, Zi Yan wrote: > From: Zi Yan <ziy@nvidia.com> > > Hi all, > > This patchset adds support for 1GB THP on x86_64. It is on top of > v5.9-rc2-mmots-2020-08-25-21-13. > > 1GB THP is more flexible for reducing translation overhead and increasing the > performance of applications with large memory footprint without application > changes compared to hugetlb. Please be more specific about usecases. This better have some strong ones because THP code is complex enough already to add on top solely based on a generic TLB pressure easing. > Design > ======= > > 1GB THP implementation looks similar to exiting THP code except some new designs > for the additional page table level. > > 1. Page table deposit and withdraw using a new pagechain data structure: > instead of one PTE page table page, 1GB THP requires 513 page table pages > (one PMD page table page and 512 PTE page table pages) to be deposited > at the page allocaiton time, so that we can split the page later. Currently, > the page table deposit is using ->lru, thus only one page can be deposited. > A new pagechain data structure is added to enable multi-page deposit. > > 2. Triple mapped 1GB THP : 1GB THP can be mapped by a combination of PUD, PMD, > and PTE entries. Mixing PUD an PTE mapping can be achieved with existing > PageDoubleMap mechanism. To add PMD mapping, PMDPageInPUD and > sub_compound_mapcount are introduced. PMDPageInPUD is the 512-aligned base > page in a 1GB THP and sub_compound_mapcount counts the PMD mapping by using > page[N*512 + 3].compound_mapcount. > > 3. Using CMA allocaiton for 1GB THP: instead of bump MAX_ORDER, it is more sane > to use something less intrusive. So all 1GB THPs are allocated from reserved > CMA areas shared with hugetlb. At page splitting time, the bitmap for the 1GB > THP is cleared as the resulting pages can be freed via normal page free path. > We can fall back to alloc_contig_pages for 1GB THP if necessary. Do those pages get instantiated during the page fault or only via khugepaged? This is an important design detail because then we have to think carefully about how much automatic we want this to be. Memory overhead can be quite large with 2MB THPs already. Also what about the allocation overhead? Do you have any numbers? Maybe all these details are described in the patcheset but the cover letter should contain all that information. It doesn't make much sense to dig into details in a patchset this large without having an idea how feasible this is. Thanks. > Patch Organization > ======= > > Patch 01 adds the new pagechain data structure. > > Patch 02 to 13 adds 1GB THP support in variable places. > > Patch 14 tries to use alloc_contig_pages for 1GB THP allocaiton. > > Patch 15 moves hugetlb_cma reservation to cma.c and rename it to hugepage_cma. > > Patch 16 use hugepage_cma reservation for 1GB THP allocation. > > > Any suggestions and comments are welcome. > > > Zi Yan (16): > mm: add pagechain container for storing multiple pages. > mm: thp: 1GB anonymous page implementation. > mm: proc: add 1GB THP kpageflag. > mm: thp: 1GB THP copy on write implementation. > mm: thp: handling 1GB THP reference bit. > mm: thp: add 1GB THP split_huge_pud_page() function. > mm: stats: make smap stats understand PUD THPs. > mm: page_vma_walk: teach it about PMD-mapped PUD THP. > mm: thp: 1GB THP support in try_to_unmap(). > mm: thp: split 1GB THPs at page reclaim. > mm: thp: 1GB THP follow_p*d_page() support. > mm: support 1GB THP pagemap support. > mm: thp: add a knob to enable/disable 1GB THPs. > mm: page_alloc: >=MAX_ORDER pages allocation an deallocation. > hugetlb: cma: move cma reserve function to cma.c. > mm: thp: use cma reservation for pud thp allocation. > > .../admin-guide/kernel-parameters.txt | 2 +- > arch/arm64/mm/hugetlbpage.c | 2 +- > arch/powerpc/mm/hugetlbpage.c | 2 +- > arch/x86/include/asm/pgalloc.h | 68 ++ > arch/x86/include/asm/pgtable.h | 26 + > arch/x86/kernel/setup.c | 8 +- > arch/x86/mm/pgtable.c | 38 + > drivers/base/node.c | 3 + > fs/proc/meminfo.c | 2 + > fs/proc/page.c | 2 + > fs/proc/task_mmu.c | 122 ++- > include/linux/cma.h | 18 + > include/linux/huge_mm.h | 84 +- > include/linux/hugetlb.h | 12 - > include/linux/memcontrol.h | 5 + > include/linux/mm.h | 29 +- > include/linux/mm_types.h | 1 + > include/linux/mmu_notifier.h | 13 + > include/linux/mmzone.h | 1 + > include/linux/page-flags.h | 47 + > include/linux/pagechain.h | 73 ++ > include/linux/pgtable.h | 34 + > include/linux/rmap.h | 10 +- > include/linux/swap.h | 2 + > include/linux/vm_event_item.h | 7 + > include/uapi/linux/kernel-page-flags.h | 2 + > kernel/events/uprobes.c | 4 +- > kernel/fork.c | 5 + > mm/cma.c | 119 +++ > mm/gup.c | 60 +- > mm/huge_memory.c | 939 +++++++++++++++++- > mm/hugetlb.c | 114 +-- > mm/internal.h | 2 + > mm/khugepaged.c | 6 +- > mm/ksm.c | 4 +- > mm/memcontrol.c | 13 + > mm/memory.c | 51 +- > mm/mempolicy.c | 21 +- > mm/migrate.c | 12 +- > mm/page_alloc.c | 57 +- > mm/page_vma_mapped.c | 129 ++- > mm/pgtable-generic.c | 56 ++ > mm/rmap.c | 289 ++++-- > mm/swap.c | 31 + > mm/swap_slots.c | 2 + > mm/swapfile.c | 8 +- > mm/userfaultfd.c | 2 +- > mm/util.c | 16 +- > mm/vmscan.c | 58 +- > mm/vmstat.c | 8 + > 50 files changed, 2270 insertions(+), 349 deletions(-) > create mode 100644 include/linux/pagechain.h > > -- > 2.28.0 >
On Wed, Sep 02, 2020 at 02:06:12PM -0400, Zi Yan wrote: > From: Zi Yan <ziy@nvidia.com> > > Hi all, > > This patchset adds support for 1GB THP on x86_64. It is on top of > v5.9-rc2-mmots-2020-08-25-21-13. > > 1GB THP is more flexible for reducing translation overhead and increasing the > performance of applications with large memory footprint without application > changes compared to hugetlb. This statement needs a lot of justification. I don't see 1GB THP as viable for any workload. Opportunistic 1GB allocation is very questionable strategy. > Design > ======= > > 1GB THP implementation looks similar to exiting THP code except some new designs > for the additional page table level. > > 1. Page table deposit and withdraw using a new pagechain data structure: > instead of one PTE page table page, 1GB THP requires 513 page table pages > (one PMD page table page and 512 PTE page table pages) to be deposited > at the page allocaiton time, so that we can split the page later. Currently, > the page table deposit is using ->lru, thus only one page can be deposited. False. Current code can deposit arbitrary number of page tables. What can be problem to you is that these page tables tied to struct page of PMD page table. > A new pagechain data structure is added to enable multi-page deposit. > > 2. Triple mapped 1GB THP : 1GB THP can be mapped by a combination of PUD, PMD, > and PTE entries. Mixing PUD an PTE mapping can be achieved with existing > PageDoubleMap mechanism. To add PMD mapping, PMDPageInPUD and > sub_compound_mapcount are introduced. PMDPageInPUD is the 512-aligned base > page in a 1GB THP and sub_compound_mapcount counts the PMD mapping by using > page[N*512 + 3].compound_mapcount. I had hard time reasoning about DoubleMap vs. rmap. Good for you if you get it right. > 3. Using CMA allocaiton for 1GB THP: instead of bump MAX_ORDER, it is more sane > to use something less intrusive. So all 1GB THPs are allocated from reserved > CMA areas shared with hugetlb. At page splitting time, the bitmap for the 1GB > THP is cleared as the resulting pages can be freed via normal page free path. > We can fall back to alloc_contig_pages for 1GB THP if necessary. >
On Thu, Sep 03, 2020 at 09:32:54AM +0200, Michal Hocko wrote: > On Wed 02-09-20 14:06:12, Zi Yan wrote: > > From: Zi Yan <ziy@nvidia.com> > > > > Hi all, > > > > This patchset adds support for 1GB THP on x86_64. It is on top of > > v5.9-rc2-mmots-2020-08-25-21-13. > > > > 1GB THP is more flexible for reducing translation overhead and increasing the > > performance of applications with large memory footprint without application > > changes compared to hugetlb. > > Please be more specific about usecases. This better have some strong > ones because THP code is complex enough already to add on top solely > based on a generic TLB pressure easing. Hello, Michal! We at Facebook are using 1 GB hugetlbfs pages and are getting noticeable performance wins on some workloads. Historically we allocated gigantic pages at the boot time, but recently moved to cma-based dynamic approach. Still, hugetlbfs interface requires more management than we would like to do. 1 GB THP seems to be a better alternative. So I definitely see it as a very useful feature. Given the cost of an allocation, I'm slightly skeptical about an automatic heuristics-based approach, but if an application can explicitly mark target areas with madvise(), I don't see why it wouldn't work. In our case we'd like to have a reliable way to get 1 GB THPs at some point (usually at the start of an application), and transparently destroy them on the application exit. Once we'll have the patchset in a relatively good shape, I'll be happy to give it a test in our environment and share results. Thanks! > > > Design > > ======= > > > > 1GB THP implementation looks similar to exiting THP code except some new designs > > for the additional page table level. > > > > 1. Page table deposit and withdraw using a new pagechain data structure: > > instead of one PTE page table page, 1GB THP requires 513 page table pages > > (one PMD page table page and 512 PTE page table pages) to be deposited > > at the page allocaiton time, so that we can split the page later. Currently, > > the page table deposit is using ->lru, thus only one page can be deposited. > > A new pagechain data structure is added to enable multi-page deposit. > > > > 2. Triple mapped 1GB THP : 1GB THP can be mapped by a combination of PUD, PMD, > > and PTE entries. Mixing PUD an PTE mapping can be achieved with existing > > PageDoubleMap mechanism. To add PMD mapping, PMDPageInPUD and > > sub_compound_mapcount are introduced. PMDPageInPUD is the 512-aligned base > > page in a 1GB THP and sub_compound_mapcount counts the PMD mapping by using > > page[N*512 + 3].compound_mapcount. > > > > 3. Using CMA allocaiton for 1GB THP: instead of bump MAX_ORDER, it is more sane > > to use something less intrusive. So all 1GB THPs are allocated from reserved > > CMA areas shared with hugetlb. At page splitting time, the bitmap for the 1GB > > THP is cleared as the resulting pages can be freed via normal page free path. > > We can fall back to alloc_contig_pages for 1GB THP if necessary. > > Do those pages get instantiated during the page fault or only via > khugepaged? This is an important design detail because then we have to > think carefully about how much automatic we want this to be. Memory > overhead can be quite large with 2MB THPs already. Also what about the > allocation overhead? Do you have any numbers? > > Maybe all these details are described in the patcheset but the cover > letter should contain all that information. It doesn't make much sense > to dig into details in a patchset this large without having an idea how > feasible this is. > > Thanks. > > > Patch Organization > > ======= > > > > Patch 01 adds the new pagechain data structure. > > > > Patch 02 to 13 adds 1GB THP support in variable places. > > > > Patch 14 tries to use alloc_contig_pages for 1GB THP allocaiton. > > > > Patch 15 moves hugetlb_cma reservation to cma.c and rename it to hugepage_cma. > > > > Patch 16 use hugepage_cma reservation for 1GB THP allocation. > > > > > > Any suggestions and comments are welcome. > > > > > > Zi Yan (16): > > mm: add pagechain container for storing multiple pages. > > mm: thp: 1GB anonymous page implementation. > > mm: proc: add 1GB THP kpageflag. > > mm: thp: 1GB THP copy on write implementation. > > mm: thp: handling 1GB THP reference bit. > > mm: thp: add 1GB THP split_huge_pud_page() function. > > mm: stats: make smap stats understand PUD THPs. > > mm: page_vma_walk: teach it about PMD-mapped PUD THP. > > mm: thp: 1GB THP support in try_to_unmap(). > > mm: thp: split 1GB THPs at page reclaim. > > mm: thp: 1GB THP follow_p*d_page() support. > > mm: support 1GB THP pagemap support. > > mm: thp: add a knob to enable/disable 1GB THPs. > > mm: page_alloc: >=MAX_ORDER pages allocation an deallocation. > > hugetlb: cma: move cma reserve function to cma.c. > > mm: thp: use cma reservation for pud thp allocation. > > > > .../admin-guide/kernel-parameters.txt | 2 +- > > arch/arm64/mm/hugetlbpage.c | 2 +- > > arch/powerpc/mm/hugetlbpage.c | 2 +- > > arch/x86/include/asm/pgalloc.h | 68 ++ > > arch/x86/include/asm/pgtable.h | 26 + > > arch/x86/kernel/setup.c | 8 +- > > arch/x86/mm/pgtable.c | 38 + > > drivers/base/node.c | 3 + > > fs/proc/meminfo.c | 2 + > > fs/proc/page.c | 2 + > > fs/proc/task_mmu.c | 122 ++- > > include/linux/cma.h | 18 + > > include/linux/huge_mm.h | 84 +- > > include/linux/hugetlb.h | 12 - > > include/linux/memcontrol.h | 5 + > > include/linux/mm.h | 29 +- > > include/linux/mm_types.h | 1 + > > include/linux/mmu_notifier.h | 13 + > > include/linux/mmzone.h | 1 + > > include/linux/page-flags.h | 47 + > > include/linux/pagechain.h | 73 ++ > > include/linux/pgtable.h | 34 + > > include/linux/rmap.h | 10 +- > > include/linux/swap.h | 2 + > > include/linux/vm_event_item.h | 7 + > > include/uapi/linux/kernel-page-flags.h | 2 + > > kernel/events/uprobes.c | 4 +- > > kernel/fork.c | 5 + > > mm/cma.c | 119 +++ > > mm/gup.c | 60 +- > > mm/huge_memory.c | 939 +++++++++++++++++- > > mm/hugetlb.c | 114 +-- > > mm/internal.h | 2 + > > mm/khugepaged.c | 6 +- > > mm/ksm.c | 4 +- > > mm/memcontrol.c | 13 + > > mm/memory.c | 51 +- > > mm/mempolicy.c | 21 +- > > mm/migrate.c | 12 +- > > mm/page_alloc.c | 57 +- > > mm/page_vma_mapped.c | 129 ++- > > mm/pgtable-generic.c | 56 ++ > > mm/rmap.c | 289 ++++-- > > mm/swap.c | 31 + > > mm/swap_slots.c | 2 + > > mm/swapfile.c | 8 +- > > mm/userfaultfd.c | 2 +- > > mm/util.c | 16 +- > > mm/vmscan.c | 58 +- > > mm/vmstat.c | 8 + > > 50 files changed, 2270 insertions(+), 349 deletions(-) > > create mode 100644 include/linux/pagechain.h > > > > -- > > 2.28.0 > > > > -- > Michal Hocko > SUSE Labs
On Thu, Sep 03, 2020 at 05:23:00PM +0300, Kirill A. Shutemov wrote: > On Wed, Sep 02, 2020 at 02:06:12PM -0400, Zi Yan wrote: > > From: Zi Yan <ziy@nvidia.com> > > > > Hi all, > > > > This patchset adds support for 1GB THP on x86_64. It is on top of > > v5.9-rc2-mmots-2020-08-25-21-13. > > > > 1GB THP is more flexible for reducing translation overhead and increasing the > > performance of applications with large memory footprint without application > > changes compared to hugetlb. > > This statement needs a lot of justification. I don't see 1GB THP as viable > for any workload. Opportunistic 1GB allocation is very questionable > strategy. Hello, Kirill! I share your skepticism about opportunistic 1 GB allocations, however it might be useful if backed by an madvise() annotations from userspace application. In this case, 1 GB THPs might be an alternative to 1 GB hugetlbfs pages, but with a more convenient interface. Thanks! > > > Design > > ======= > > > > 1GB THP implementation looks similar to exiting THP code except some new designs > > for the additional page table level. > > > > 1. Page table deposit and withdraw using a new pagechain data structure: > > instead of one PTE page table page, 1GB THP requires 513 page table pages > > (one PMD page table page and 512 PTE page table pages) to be deposited > > at the page allocaiton time, so that we can split the page later. Currently, > > the page table deposit is using ->lru, thus only one page can be deposited. > > False. Current code can deposit arbitrary number of page tables. > > What can be problem to you is that these page tables tied to struct page > of PMD page table. > > > A new pagechain data structure is added to enable multi-page deposit. > > > > 2. Triple mapped 1GB THP : 1GB THP can be mapped by a combination of PUD, PMD, > > and PTE entries. Mixing PUD an PTE mapping can be achieved with existing > > PageDoubleMap mechanism. To add PMD mapping, PMDPageInPUD and > > sub_compound_mapcount are introduced. PMDPageInPUD is the 512-aligned base > > page in a 1GB THP and sub_compound_mapcount counts the PMD mapping by using > > page[N*512 + 3].compound_mapcount. > > I had hard time reasoning about DoubleMap vs. rmap. Good for you if you > get it right. > > > 3. Using CMA allocaiton for 1GB THP: instead of bump MAX_ORDER, it is more sane > > to use something less intrusive. So all 1GB THPs are allocated from reserved > > CMA areas shared with hugetlb. At page splitting time, the bitmap for the 1GB > > THP is cleared as the resulting pages can be freed via normal page free path. > > We can fall back to alloc_contig_pages for 1GB THP if necessary. > > > > -- > Kirill A. Shutemov
On Wed, Sep 02, 2020 at 04:29:46PM -0400, Zi Yan wrote: > On 2 Sep 2020, at 15:57, Jason Gunthorpe wrote: > > > On Wed, Sep 02, 2020 at 03:05:39PM -0400, Zi Yan wrote: > >> On 2 Sep 2020, at 14:48, Jason Gunthorpe wrote: > >> > >>> On Wed, Sep 02, 2020 at 02:45:37PM -0400, Zi Yan wrote: > >>> > >>>>> Surprised this doesn't touch mm/pagewalk.c ? > >>>> > >>>> 1GB PUD page support is present for DAX purpose, so the code is there > >>>> in mm/pagewalk.c already. I only needed to supply ops->pud_entry when using > >>>> the functions in mm/pagewalk.c. :) > >>> > >>> Yes, but doesn't this change what is possible under the mmap_sem > >>> without the page table locks? > >>> > >>> ie I would expect some thing like pmd_trans_unstable() to be required > >>> as well for lockless walkers. (and I don't think the pmd code is 100% > >>> right either) > >>> > >> > >> Right. I missed that. Thanks for pointing it out. > >> The code like this, right? > > > > Technically all those *pud's are racy too, the design here with the > > _unstable function call always seemed weird. I strongly suspect it > > should mirror how get_user_pages_fast works for lockless walking > > You mean READ_ONCE on page table entry pointer first, then use the value > for the rest of the loop? I am not quite familiar with this racy check > part of the code and happy to hear more about it. There are two main issues with the THPs and lockless walks - The *pXX value can change at any time, as THPs can be split at any moment. However, once observed to be a sub page table pointer the value is fixed under the read side of the mmap (I think, I never did find the code path supporting this, but everything is busted if it isn't true...) - Reading the *pXX without load tearing is difficult on 32 bit arches So if you do READ_ONCE() it defeats the first problem. However if the sizeof(*pXX) is 8 on a 32 bit platform then load tearing is a problem. At lest the various pXX_*() test functions operate on a single 32 bit word so don't tear, but to to convert the *pXX to a lower level page table pointer a coherent, untorn, read is required. So, looking again, I remember now, I could never quite figure out why gup_pmd_range() was safe to do: pmd_t pmd = READ_ONCE(*pmdp); [..] } else if (!gup_pte_range(pmd, addr, next, flags, pages, nr)) [..] ptem = ptep = pte_offset_map(&pmd, addr); As I don't see what prevents load tearing a 64 bit pmd.. Eg no pmd_trans_unstable() or equivalent here. But we see gup_get_pte() using an anti-load tearing technique.. Jason
On Thu, Sep 03, 2020 at 09:25:27AM -0700, Roman Gushchin wrote: > On Thu, Sep 03, 2020 at 09:32:54AM +0200, Michal Hocko wrote: > > On Wed 02-09-20 14:06:12, Zi Yan wrote: > > > From: Zi Yan <ziy@nvidia.com> > > > > > > Hi all, > > > > > > This patchset adds support for 1GB THP on x86_64. It is on top of > > > v5.9-rc2-mmots-2020-08-25-21-13. > > > > > > 1GB THP is more flexible for reducing translation overhead and increasing the > > > performance of applications with large memory footprint without application > > > changes compared to hugetlb. > > > > Please be more specific about usecases. This better have some strong > > ones because THP code is complex enough already to add on top solely > > based on a generic TLB pressure easing. > > Hello, Michal! > > We at Facebook are using 1 GB hugetlbfs pages and are getting noticeable > performance wins on some workloads. At least from a RDMA NIC perspective I've heard from a lot of users that higher order pages at the DMA level is giving big speed ups too. It is basically the same dynamic as CPU TLB, except missing a 'TLB' cache in a PCI-E device is dramatically more expensive to refill. With 200G and soon 400G networking these misses are a growing problem. With HPC nodes now pushing 1TB of actual physical RAM and single applications basically using all of it, there is definately some meaningful return - if pages can be reliably available. At least for HPC where the node returns to an idle state after each job and most of the 1TB memory becomes freed up again, it seems more believable to me that a large cache of 1G pages could be available? Even triggering some kind of cleaner between jobs to defragment could be a reasonable approach.. Jason
On Thu, Sep 03, 2020 at 01:40:32PM -0300, Jason Gunthorpe wrote: > However if the sizeof(*pXX) is 8 on a 32 bit platform then load > tearing is a problem. At lest the various pXX_*() test functions > operate on a single 32 bit word so don't tear, but to to convert the > *pXX to a lower level page table pointer a coherent, untorn, read is > required. > > So, looking again, I remember now, I could never quite figure out why > gup_pmd_range() was safe to do: > > pmd_t pmd = READ_ONCE(*pmdp); > [..] > } else if (!gup_pte_range(pmd, addr, next, flags, pages, nr)) > [..] > ptem = ptep = pte_offset_map(&pmd, addr); > > As I don't see what prevents load tearing a 64 bit pmd.. Eg no > pmd_trans_unstable() or equivalent here. I don't think there are any 32-bit page tables which support a PUD-sized page. Pretty sure x86 doesn't until you get to 4- or 5- level page tables (which need you to be running in 64-bit mode). There's not much utility in having 1GB of your 3GB process address space taken up by a single page. I'm OK if there are some oddball architectures which support it, but Linux doesn't.
On Thu, Sep 03, 2020 at 01:50:51PM -0300, Jason Gunthorpe wrote: > At least from a RDMA NIC perspective I've heard from a lot of users > that higher order pages at the DMA level is giving big speed ups too. > > It is basically the same dynamic as CPU TLB, except missing a 'TLB' > cache in a PCI-E device is dramatically more expensive to refill. With > 200G and soon 400G networking these misses are a growing problem. > > With HPC nodes now pushing 1TB of actual physical RAM and single > applications basically using all of it, there is definately some > meaningful return - if pages can be reliably available. > > At least for HPC where the node returns to an idle state after each > job and most of the 1TB memory becomes freed up again, it seems more > believable to me that a large cache of 1G pages could be available? You may be interested in trying out my current THP patchset: http://git.infradead.org/users/willy/pagecache.git It doesn't allocate pages larger than PMD size, but it does allocate pages *up to* PMD size for the page cache which means that larger pages are easier to create as larger pages aren't fragmented all over the system. If someone wants to opportunistically allocate pages larger than PMD size, I've put some preliminary support in for that, but I've never tested any of it. That's not my goal at the moment. I'm not clear whether these HPC users primarily use page cache or anonymous memory (with O_DIRECT). Probably a mixture.
On Thu, Sep 03, 2020 at 05:55:59PM +0100, Matthew Wilcox wrote: > On Thu, Sep 03, 2020 at 01:40:32PM -0300, Jason Gunthorpe wrote: > > However if the sizeof(*pXX) is 8 on a 32 bit platform then load > > tearing is a problem. At lest the various pXX_*() test functions > > operate on a single 32 bit word so don't tear, but to to convert the > > *pXX to a lower level page table pointer a coherent, untorn, read is > > required. > > > > So, looking again, I remember now, I could never quite figure out why > > gup_pmd_range() was safe to do: > > > > pmd_t pmd = READ_ONCE(*pmdp); > > [..] > > } else if (!gup_pte_range(pmd, addr, next, flags, pages, nr)) > > [..] > > ptem = ptep = pte_offset_map(&pmd, addr); > > > > As I don't see what prevents load tearing a 64 bit pmd.. Eg no > > pmd_trans_unstable() or equivalent here. > > I don't think there are any 32-bit page tables which support a PUD-sized > page. Pretty sure x86 doesn't until you get to 4- or 5- level page tables > (which need you to be running in 64-bit mode). There's not much utility > in having 1GB of your 3GB process address space taken up by a single page. Make sense for PUD, but why is the above GUP code OK for PMD? pmd_trans_unstable() exists specifically to close read tearing races, so it looks like a real problem? > I'm OK if there are some oddball architectures which support it, but > Linux doesn't. So, based on that observation, I think something approximately like this is needed for the page walker for PUD: (this has been on my backlog to return to these patches..) From 00a361ecb2d9e1226600d9e78e6e1803a886f2d6 Mon Sep 17 00:00:00 2001 From: Jason Gunthorpe <jgg@mellanox.com> Date: Fri, 13 Mar 2020 13:15:36 -0300 Subject: [RFC] mm/pagewalk: use READ_ONCE when reading the PUD entry unlocked The pagewalker runs while only holding the mmap_sem for read. The pud can be set asynchronously, while also holding the mmap_sem for read eg from: handle_mm_fault() __handle_mm_fault() create_huge_pmd() dev_dax_huge_fault() __dev_dax_pud_fault() vmf_insert_pfn_pud() insert_pfn_pud() pud_lock() set_pud_at() At least x86 sets the PUD using WRITE_ONCE(), so an unlocked read of unstable data should be paired to use READ_ONCE(). For the pagewalker to work locklessly the PUD must work similarly to the PMD: once the PUD entry becomes a pointer to a PMD, it must be stable, and safe to pass to pmd_offset() Passing the value from READ_ONCE into the callbacks prevents the callers from seeing inconsistencies after they re-read, such as seeing pud_none(). If a callback does obtain the pud_lock then it should trigger ACTION_AGAIN if a data race caused the original value to change. Use the same pattern as gup_pmd_range() and pass in the address of the local READ_ONCE stack variable to pmd_offset() to avoid reading it again. Signed-off-by: Jason Gunthorpe <jgg@mellanox.com> --- include/linux/pagewalk.h | 2 +- mm/hmm.c | 16 +++++++--------- mm/mapping_dirty_helpers.c | 6 ++---- mm/pagewalk.c | 28 ++++++++++++++++------------ mm/ptdump.c | 3 +-- 5 files changed, 27 insertions(+), 28 deletions(-) diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h index b1cb6b753abb53..6caf28aadafbff 100644 --- a/include/linux/pagewalk.h +++ b/include/linux/pagewalk.h @@ -39,7 +39,7 @@ struct mm_walk_ops { unsigned long next, struct mm_walk *walk); int (*p4d_entry)(p4d_t *p4d, unsigned long addr, unsigned long next, struct mm_walk *walk); - int (*pud_entry)(pud_t *pud, unsigned long addr, + int (*pud_entry)(pud_t pud, pud_t *pudp, unsigned long addr, unsigned long next, struct mm_walk *walk); int (*pmd_entry)(pmd_t *pmd, unsigned long addr, unsigned long next, struct mm_walk *walk); diff --git a/mm/hmm.c b/mm/hmm.c index 6d9da4b0f0a9f8..98ced96421b913 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -459,28 +459,26 @@ static inline uint64_t pud_to_hmm_pfn_flags(struct hmm_range *range, pud_t pud) range->flags[HMM_PFN_VALID]; } -static int hmm_vma_walk_pud(pud_t *pudp, unsigned long start, unsigned long end, - struct mm_walk *walk) +static int hmm_vma_walk_pud(pud_t pud, pud_t *pudp, unsigned long start, + unsigned long end, struct mm_walk *walk) { struct hmm_vma_walk *hmm_vma_walk = walk->private; struct hmm_range *range = hmm_vma_walk->range; unsigned long addr = start; - pud_t pud; int ret = 0; spinlock_t *ptl = pud_trans_huge_lock(pudp, walk->vma); if (!ptl) return 0; + if (memcmp(pudp, &pud, sizeof(pud)) != 0) { + walk->action = ACTION_AGAIN; + spin_unlock(ptl); + return 0; + } /* Normally we don't want to split the huge page */ walk->action = ACTION_CONTINUE; - pud = READ_ONCE(*pudp); - if (pud_none(pud)) { - spin_unlock(ptl); - return hmm_vma_walk_hole(start, end, -1, walk); - } - if (pud_huge(pud) && pud_devmap(pud)) { unsigned long i, npages, pfn; uint64_t *pfns, cpu_flags; diff --git a/mm/mapping_dirty_helpers.c b/mm/mapping_dirty_helpers.c index 71070dda9643d4..8943c2509ec0f7 100644 --- a/mm/mapping_dirty_helpers.c +++ b/mm/mapping_dirty_helpers.c @@ -125,12 +125,10 @@ static int wp_clean_pmd_entry(pmd_t *pmd, unsigned long addr, unsigned long end, } /* wp_clean_pud_entry - The pagewalk pud callback. */ -static int wp_clean_pud_entry(pud_t *pud, unsigned long addr, unsigned long end, - struct mm_walk *walk) +static int wp_clean_pud_entry(pud_t pudval, pud_t *pudp, unsigned long addr, + unsigned long end, struct mm_walk *walk) { /* Dirty-tracking should be handled on the pte level */ - pud_t pudval = READ_ONCE(*pud); - if (pud_trans_huge(pudval) || pud_devmap(pudval)) WARN_ON(pud_write(pudval) || pud_dirty(pudval)); diff --git a/mm/pagewalk.c b/mm/pagewalk.c index 928df1638c30d1..cf99536cec23be 100644 --- a/mm/pagewalk.c +++ b/mm/pagewalk.c @@ -58,7 +58,7 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, return err; } -static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end, +static int walk_pmd_range(pud_t pud, unsigned long addr, unsigned long end, struct mm_walk *walk) { pmd_t *pmd; @@ -67,7 +67,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end, int err = 0; int depth = real_depth(3); - pmd = pmd_offset(pud, addr); + pmd = pmd_offset(&pud, addr); do { again: next = pmd_addr_end(addr, end); @@ -119,17 +119,19 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end, static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end, struct mm_walk *walk) { - pud_t *pud; + pud_t *pudp; + pud_t pud; unsigned long next; const struct mm_walk_ops *ops = walk->ops; int err = 0; int depth = real_depth(2); - pud = pud_offset(p4d, addr); + pudp = pud_offset(p4d, addr); do { again: + pud = READ_ONCE(*pudp); next = pud_addr_end(addr, end); - if (pud_none(*pud) || (!walk->vma && !walk->no_vma)) { + if (pud_none(pud) || (!walk->vma && !walk->no_vma)) { if (ops->pte_hole) err = ops->pte_hole(addr, next, depth, walk); if (err) @@ -140,27 +142,29 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end, walk->action = ACTION_SUBTREE; if (ops->pud_entry) - err = ops->pud_entry(pud, addr, next, walk); + err = ops->pud_entry(pud, pudp, addr, next, walk); if (err) break; if (walk->action == ACTION_AGAIN) goto again; - if ((!walk->vma && (pud_leaf(*pud) || !pud_present(*pud))) || + if ((!walk->vma && (pud_leaf(pud) || !pud_present(pud))) || walk->action == ACTION_CONTINUE || !(ops->pmd_entry || ops->pte_entry)) continue; - if (walk->vma) - split_huge_pud(walk->vma, pud, addr); - if (pud_none(*pud)) - goto again; + if (walk->vma) { + split_huge_pud(walk->vma, pudp, addr); + pud = READ_ONCE(*pudp); + if (pud_none(pud)) + goto again; + } err = walk_pmd_range(pud, addr, next, walk); if (err) break; - } while (pud++, addr = next, addr != end); + } while (pudp++, addr = next, addr != end); return err; } diff --git a/mm/ptdump.c b/mm/ptdump.c index 26208d0d03b7a9..c5e1717671e36a 100644 --- a/mm/ptdump.c +++ b/mm/ptdump.c @@ -59,11 +59,10 @@ static int ptdump_p4d_entry(p4d_t *p4d, unsigned long addr, return 0; } -static int ptdump_pud_entry(pud_t *pud, unsigned long addr, +static int ptdump_pud_entry(pud_t val, pud_t *pudp, unsigned long addr, unsigned long next, struct mm_walk *walk) { struct ptdump_state *st = walk->private; - pud_t val = READ_ONCE(*pud); #if CONFIG_PGTABLE_LEVELS > 2 && defined(CONFIG_KASAN) if (pud_page(val) == virt_to_page(lm_alias(kasan_early_shadow_pmd)))
On Thu, Sep 03, 2020 at 06:01:57PM +0100, Matthew Wilcox wrote: > On Thu, Sep 03, 2020 at 01:50:51PM -0300, Jason Gunthorpe wrote: > > At least from a RDMA NIC perspective I've heard from a lot of users > > that higher order pages at the DMA level is giving big speed ups too. > > > > It is basically the same dynamic as CPU TLB, except missing a 'TLB' > > cache in a PCI-E device is dramatically more expensive to refill. With > > 200G and soon 400G networking these misses are a growing problem. > > > > With HPC nodes now pushing 1TB of actual physical RAM and single > > applications basically using all of it, there is definately some > > meaningful return - if pages can be reliably available. > > > > At least for HPC where the node returns to an idle state after each > > job and most of the 1TB memory becomes freed up again, it seems more > > believable to me that a large cache of 1G pages could be available? > > You may be interested in trying out my current THP patchset: > > http://git.infradead.org/users/willy/pagecache.git > > It doesn't allocate pages larger than PMD size, but it does allocate pages > *up to* PMD size for the page cache which means that larger pages are > easier to create as larger pages aren't fragmented all over the system. Yeah, I saw that, it looks like a great direction. > If someone wants to opportunistically allocate pages larger than PMD > size, I've put some preliminary support in for that, but I've never > tested any of it. That's not my goal at the moment. > > I'm not clear whether these HPC users primarily use page cache or > anonymous memory (with O_DIRECT). Probably a mixture. There are defiantly HPC systems now that are filesystem-less - they import data for computation from the network using things like blob storage or some other kind of non-POSIX userspace based data storage scheme. Jason
On 9/3/20 9:25 AM, Roman Gushchin wrote: > On Thu, Sep 03, 2020 at 09:32:54AM +0200, Michal Hocko wrote: >> On Wed 02-09-20 14:06:12, Zi Yan wrote: >>> From: Zi Yan <ziy@nvidia.com> >>> >>> Hi all, >>> >>> This patchset adds support for 1GB THP on x86_64. It is on top of >>> v5.9-rc2-mmots-2020-08-25-21-13. >>> >>> 1GB THP is more flexible for reducing translation overhead and increasing the >>> performance of applications with large memory footprint without application >>> changes compared to hugetlb. >> >> Please be more specific about usecases. This better have some strong >> ones because THP code is complex enough already to add on top solely >> based on a generic TLB pressure easing. > > Hello, Michal! > > We at Facebook are using 1 GB hugetlbfs pages and are getting noticeable > performance wins on some workloads. > > Historically we allocated gigantic pages at the boot time, but recently moved > to cma-based dynamic approach. Still, hugetlbfs interface requires more management > than we would like to do. 1 GB THP seems to be a better alternative. So I definitely > see it as a very useful feature. > > Given the cost of an allocation, I'm slightly skeptical about an automatic > heuristics-based approach, but if an application can explicitly mark target areas > with madvise(), I don't see why it wouldn't work. > > In our case we'd like to have a reliable way to get 1 GB THPs at some point > (usually at the start of an application), and transparently destroy them on > the application exit. Hi Roman, In your current use case at Facebook, are you adding 1G hugetlb pages to the hugetlb pool and then using them within applications? Or, are you dynamically allocating them at fault time (hugetlb overcommit/surplus)? Latency time for use of such pages includes: - Putting together 1G contiguous - Clearing 1G memory In the 'allocation at fault time' mode you incur both costs at fault time. If using pages from the pool, your only cost at fault time is clearing the page.
On Thu, Sep 03, 2020 at 01:57:54PM -0700, Mike Kravetz wrote: > On 9/3/20 9:25 AM, Roman Gushchin wrote: > > On Thu, Sep 03, 2020 at 09:32:54AM +0200, Michal Hocko wrote: > >> On Wed 02-09-20 14:06:12, Zi Yan wrote: > >>> From: Zi Yan <ziy@nvidia.com> > >>> > >>> Hi all, > >>> > >>> This patchset adds support for 1GB THP on x86_64. It is on top of > >>> v5.9-rc2-mmots-2020-08-25-21-13. > >>> > >>> 1GB THP is more flexible for reducing translation overhead and increasing the > >>> performance of applications with large memory footprint without application > >>> changes compared to hugetlb. > >> > >> Please be more specific about usecases. This better have some strong > >> ones because THP code is complex enough already to add on top solely > >> based on a generic TLB pressure easing. > > > > Hello, Michal! > > > > We at Facebook are using 1 GB hugetlbfs pages and are getting noticeable > > performance wins on some workloads. > > > > Historically we allocated gigantic pages at the boot time, but recently moved > > to cma-based dynamic approach. Still, hugetlbfs interface requires more management > > than we would like to do. 1 GB THP seems to be a better alternative. So I definitely > > see it as a very useful feature. > > > > Given the cost of an allocation, I'm slightly skeptical about an automatic > > heuristics-based approach, but if an application can explicitly mark target areas > > with madvise(), I don't see why it wouldn't work. > > > > In our case we'd like to have a reliable way to get 1 GB THPs at some point > > (usually at the start of an application), and transparently destroy them on > > the application exit. > > Hi Roman, > > In your current use case at Facebook, are you adding 1G hugetlb pages to > the hugetlb pool and then using them within applications? Or, are you > dynamically allocating them at fault time (hugetlb overcommit/surplus)? > > Latency time for use of such pages includes: > - Putting together 1G contiguous > - Clearing 1G memory > > In the 'allocation at fault time' mode you incur both costs at fault time. > If using pages from the pool, your only cost at fault time is clearing the > page. Hi Mike, We're using a pool. Under dynamic I mean that gigantic pages are not allocated at a boot time. Thanks!
On Thu 03-09-20 09:25:27, Roman Gushchin wrote: > On Thu, Sep 03, 2020 at 09:32:54AM +0200, Michal Hocko wrote: > > On Wed 02-09-20 14:06:12, Zi Yan wrote: > > > From: Zi Yan <ziy@nvidia.com> > > > > > > Hi all, > > > > > > This patchset adds support for 1GB THP on x86_64. It is on top of > > > v5.9-rc2-mmots-2020-08-25-21-13. > > > > > > 1GB THP is more flexible for reducing translation overhead and increasing the > > > performance of applications with large memory footprint without application > > > changes compared to hugetlb. > > > > Please be more specific about usecases. This better have some strong > > ones because THP code is complex enough already to add on top solely > > based on a generic TLB pressure easing. > > Hello, Michal! > > We at Facebook are using 1 GB hugetlbfs pages and are getting noticeable > performance wins on some workloads. Let me clarify. I am not questioning 1GB (or large) pages in general. I believe it is quite clear that there are usecases which hugely benefit from them. I am mostly asking for the transparent part of it which traditionally means that userspace mostly doesn't have to care and get them. 2MB THPs have established certain expectations mostly a really aggressive pro-active instanciation. This has bitten us many times and create a "you need to disable THP to fix your problem whatever that is" cargo cult. I hope we do not want to repeat that mistake here again. > Historically we allocated gigantic pages at the boot time, but recently moved > to cma-based dynamic approach. Still, hugetlbfs interface requires more management > than we would like to do. 1 GB THP seems to be a better alternative. So I definitely > see it as a very useful feature. > > Given the cost of an allocation, I'm slightly skeptical about an automatic > heuristics-based approach, but if an application can explicitly mark target areas > with madvise(), I don't see why it wouldn't work. An explicit opt-in sounds much more appropriate to me as well. If we go with a specific API then I would not make it 1GB pages specific. Why cannot we have an explicit interface to "defragment" address space range into large pages and the kernel would use large pages where appropriate? Or is the additional copying prohibitively expensive?
On Fri, Sep 04, 2020 at 09:42:07AM +0200, Michal Hocko wrote: > On Thu 03-09-20 09:25:27, Roman Gushchin wrote: > > On Thu, Sep 03, 2020 at 09:32:54AM +0200, Michal Hocko wrote: > > > On Wed 02-09-20 14:06:12, Zi Yan wrote: > > > > From: Zi Yan <ziy@nvidia.com> > > > > > > > > Hi all, > > > > > > > > This patchset adds support for 1GB THP on x86_64. It is on top of > > > > v5.9-rc2-mmots-2020-08-25-21-13. > > > > > > > > 1GB THP is more flexible for reducing translation overhead and increasing the > > > > performance of applications with large memory footprint without application > > > > changes compared to hugetlb. > > > > > > Please be more specific about usecases. This better have some strong > > > ones because THP code is complex enough already to add on top solely > > > based on a generic TLB pressure easing. > > > > Hello, Michal! > > > > We at Facebook are using 1 GB hugetlbfs pages and are getting noticeable > > performance wins on some workloads. > > Let me clarify. I am not questioning 1GB (or large) pages in general. I > believe it is quite clear that there are usecases which hugely benefit > from them. I am mostly asking for the transparent part of it which > traditionally means that userspace mostly doesn't have to care and get > them. 2MB THPs have established certain expectations mostly a really > aggressive pro-active instanciation. This has bitten us many times and > create a "you need to disable THP to fix your problem whatever that is" > cargo cult. I hope we do not want to repeat that mistake here again. Absolutely, I agree with all above. 1 GB THPs have even fewer chances to be allocated automatically without hurting overall performance. I believe that historically the THP allocation success rate and cost were not good enough to have a strict interface, that's why the "best effort" approach was used. Maybe I'm wrong here. Also in some cases (e.g. desktop) an opportunistic approach looks like "it's some perf boost for free". However in case of large distributed systems it's important to get a predictable and uniform performance across nodes, so "maybe some hosts will perform better" is not giving much. > > > Historically we allocated gigantic pages at the boot time, but recently moved > > to cma-based dynamic approach. Still, hugetlbfs interface requires more management > > than we would like to do. 1 GB THP seems to be a better alternative. So I definitely > > see it as a very useful feature. > > > > Given the cost of an allocation, I'm slightly skeptical about an automatic > > heuristics-based approach, but if an application can explicitly mark target areas > > with madvise(), I don't see why it wouldn't work. > > An explicit opt-in sounds much more appropriate to me as well. If we go > with a specific API then I would not make it 1GB pages specific. Why > cannot we have an explicit interface to "defragment" address space > range into large pages and the kernel would use large pages where > appropriate? Or is the additional copying prohibitively expensive? Can you, please, elaborate a bit more here? It seems like madvise(MADV_HUGEPAGE) provides something similar to what you're describing, but there are lot of details here, so I'm probably missing something. Thank you!
On Fri 04-09-20 14:10:45, Roman Gushchin wrote: > On Fri, Sep 04, 2020 at 09:42:07AM +0200, Michal Hocko wrote: [...] > > An explicit opt-in sounds much more appropriate to me as well. If we go > > with a specific API then I would not make it 1GB pages specific. Why > > cannot we have an explicit interface to "defragment" address space > > range into large pages and the kernel would use large pages where > > appropriate? Or is the additional copying prohibitively expensive? > > Can you, please, elaborate a bit more here? It seems like madvise(MADV_HUGEPAGE) > provides something similar to what you're describing, but there are lot > of details here, so I'm probably missing something. MADV_HUGEPAGE is controlling a preference for THP to be used for a particular address range. So it looks similar but the historical behavior is to control page faults as well and the behavior depends on the global setup. I've had in mind something much simpler. Effectively an API to invoke khugepaged (like) functionality synchronously from the calling context on the specific address range. It could be more aggressive than the regular khugepaged and create even 1G pages (or as large THPs as page tables can handle on the particular arch for that matter). As this would be an explicit call we do not have to be worried about the resulting latency because it would be an explicit call by the userspace. The default khugepaged has a harder position there because has no understanding of the target address space and cannot make any cost/benefit evaluation so it has to be more conservative.
On 03.09.20 18:30, Roman Gushchin wrote: > On Thu, Sep 03, 2020 at 05:23:00PM +0300, Kirill A. Shutemov wrote: >> On Wed, Sep 02, 2020 at 02:06:12PM -0400, Zi Yan wrote: >>> From: Zi Yan <ziy@nvidia.com> >>> >>> Hi all, >>> >>> This patchset adds support for 1GB THP on x86_64. It is on top of >>> v5.9-rc2-mmots-2020-08-25-21-13. >>> >>> 1GB THP is more flexible for reducing translation overhead and increasing the >>> performance of applications with large memory footprint without application >>> changes compared to hugetlb. >> >> This statement needs a lot of justification. I don't see 1GB THP as viable >> for any workload. Opportunistic 1GB allocation is very questionable >> strategy. > > Hello, Kirill! > > I share your skepticism about opportunistic 1 GB allocations, however it might be useful > if backed by an madvise() annotations from userspace application. In this case, > 1 GB THPs might be an alternative to 1 GB hugetlbfs pages, but with a more convenient > interface. I have concerns if we would silently use 1~GB THPs in most scenarios where be would have used 2~MB THP. I'd appreciate a trigger to explicitly enable that - MADV_HUGEPAGE is not sufficient because some applications relying on that assume that the THP size will be 2~MB (especially, if you want sparse, large VMAs). E.g., read via man page "This feature is primarily aimed at applications that use large mappings of data and access large regions of that memory at a time (e.g., virtualization systems such as QEMU). It can very easily waste memory (e.g., a 2 MB mapping that only ever accesses 1 byte will result in 2 MB of wired memory instead of one 4 KB page)." Having that said, I consider having 1~GB THP - similar to 512~MP THP on arm64 - useless in most setup and I am not sure if it is worth the trouble. Just use hugetlbfs for the handful of applications where it makes sense.
On 8 Sep 2020, at 7:57, David Hildenbrand wrote: > On 03.09.20 18:30, Roman Gushchin wrote: >> On Thu, Sep 03, 2020 at 05:23:00PM +0300, Kirill A. Shutemov wrote: >>> On Wed, Sep 02, 2020 at 02:06:12PM -0400, Zi Yan wrote: >>>> From: Zi Yan <ziy@nvidia.com> >>>> >>>> Hi all, >>>> >>>> This patchset adds support for 1GB THP on x86_64. It is on top of >>>> v5.9-rc2-mmots-2020-08-25-21-13. >>>> >>>> 1GB THP is more flexible for reducing translation overhead and increasing the >>>> performance of applications with large memory footprint without application >>>> changes compared to hugetlb. >>> >>> This statement needs a lot of justification. I don't see 1GB THP as viable >>> for any workload. Opportunistic 1GB allocation is very questionable >>> strategy. >> >> Hello, Kirill! >> >> I share your skepticism about opportunistic 1 GB allocations, however it might be useful >> if backed by an madvise() annotations from userspace application. In this case, >> 1 GB THPs might be an alternative to 1 GB hugetlbfs pages, but with a more convenient >> interface. > > I have concerns if we would silently use 1~GB THPs in most scenarios > where be would have used 2~MB THP. I'd appreciate a trigger to > explicitly enable that - MADV_HUGEPAGE is not sufficient because some > applications relying on that assume that the THP size will be 2~MB > (especially, if you want sparse, large VMAs). This patchset is not intended to silently use 1GB THP in place of 2MB THP. First of all, there is a knob /sys/kernel/mm/transparent_hugepage/enable_1GB to enable 1GB THP explicitly. Also, 1GB THP is allocated from a reserved CMA region (although I had alloc_contig_pages as a fallback, which can be removed in next version), so users need to add hugepage_cma=nG kernel parameter to enable 1GB THP allocation. If a finer control is necessary, we can add a new MADV_HUGEPAGE_1GB for 1GB THP. — Best Regards, Yan Zi
On 08.09.20 16:05, Zi Yan wrote: > On 8 Sep 2020, at 7:57, David Hildenbrand wrote: > >> On 03.09.20 18:30, Roman Gushchin wrote: >>> On Thu, Sep 03, 2020 at 05:23:00PM +0300, Kirill A. Shutemov wrote: >>>> On Wed, Sep 02, 2020 at 02:06:12PM -0400, Zi Yan wrote: >>>>> From: Zi Yan <ziy@nvidia.com> >>>>> >>>>> Hi all, >>>>> >>>>> This patchset adds support for 1GB THP on x86_64. It is on top of >>>>> v5.9-rc2-mmots-2020-08-25-21-13. >>>>> >>>>> 1GB THP is more flexible for reducing translation overhead and increasing the >>>>> performance of applications with large memory footprint without application >>>>> changes compared to hugetlb. >>>> >>>> This statement needs a lot of justification. I don't see 1GB THP as viable >>>> for any workload. Opportunistic 1GB allocation is very questionable >>>> strategy. >>> >>> Hello, Kirill! >>> >>> I share your skepticism about opportunistic 1 GB allocations, however it might be useful >>> if backed by an madvise() annotations from userspace application. In this case, >>> 1 GB THPs might be an alternative to 1 GB hugetlbfs pages, but with a more convenient >>> interface. >> >> I have concerns if we would silently use 1~GB THPs in most scenarios >> where be would have used 2~MB THP. I'd appreciate a trigger to >> explicitly enable that - MADV_HUGEPAGE is not sufficient because some >> applications relying on that assume that the THP size will be 2~MB >> (especially, if you want sparse, large VMAs). > > This patchset is not intended to silently use 1GB THP in place of 2MB THP. > First of all, there is a knob /sys/kernel/mm/transparent_hugepage/enable_1GB > to enable 1GB THP explicitly. Also, 1GB THP is allocated from a reserved CMA > region (although I had alloc_contig_pages as a fallback, which can be removed > in next version), so users need to add hugepage_cma=nG kernel parameter to > enable 1GB THP allocation. If a finer control is necessary, we can add > a new MADV_HUGEPAGE_1GB for 1GB THP. Thanks for the information - I would have loved to see important information like that (esp. how to use) in the cover letter. So what you propose is (excluding alloc_contig_pages()) really just automatically using (previously reserved) 1GB huge pages as 1GB THP instead of explicitly using them in an application using hugetlbfs. Still, not convinced how helpful that actually is - most certainly you really want a mechanism to control this per application (+ maybe make the application indicate actual ranges where it makes sense - but then you can directly modify the application to use hugetlbfs). I guess the interesting thing of this approach is that we can mix-and-match THP of differing granularity within a single mapping - whereby a hugetlbfs allocation would fail in case there isn't sufficient 1GB pages available. However, there are no guarantees for applications anymore (thinking about RT KVM and similar, we really want gigantic pages and cannot tolerate falling back to smaller granularity). What are intended use cases/applications that could benefit? I doubt databases and virtualization are really a good fit - they know how to handle hugetlbfs just fine.
On Tue, Sep 08, 2020 at 10:05:11AM -0400, Zi Yan wrote: > On 8 Sep 2020, at 7:57, David Hildenbrand wrote: > > I have concerns if we would silently use 1~GB THPs in most scenarios > > where be would have used 2~MB THP. I'd appreciate a trigger to > > explicitly enable that - MADV_HUGEPAGE is not sufficient because some > > applications relying on that assume that the THP size will be 2~MB > > (especially, if you want sparse, large VMAs). > > This patchset is not intended to silently use 1GB THP in place of 2MB THP. > First of all, there is a knob /sys/kernel/mm/transparent_hugepage/enable_1GB > to enable 1GB THP explicitly. Also, 1GB THP is allocated from a reserved CMA > region (although I had alloc_contig_pages as a fallback, which can be removed > in next version), so users need to add hugepage_cma=nG kernel parameter to > enable 1GB THP allocation. If a finer control is necessary, we can add > a new MADV_HUGEPAGE_1GB for 1GB THP. I think we do need that flag. Machines don't run a single workload (arguably with VMs, we're getting closer to going back to the single workload per machine, but that's a different matter). So if there's one app that wants 2MB pages and one that wants 1GB pages, we need to be able to distinguish them. I could also see there being an app which benefits from 1GB for one mapping and prefers 2GB for a different mapping, so I think the per-mapping madvise flag is best. I'm a little wary of encoding the size of an x86 PUD in the Linux API though. Probably best to follow the example set in include/uapi/asm-generic/hugetlb_encode.h, but I don't love it. I don't have a better suggestion though.
On Tue 08-09-20 10:05:11, Zi Yan wrote: > On 8 Sep 2020, at 7:57, David Hildenbrand wrote: > > > On 03.09.20 18:30, Roman Gushchin wrote: > >> On Thu, Sep 03, 2020 at 05:23:00PM +0300, Kirill A. Shutemov wrote: > >>> On Wed, Sep 02, 2020 at 02:06:12PM -0400, Zi Yan wrote: > >>>> From: Zi Yan <ziy@nvidia.com> > >>>> > >>>> Hi all, > >>>> > >>>> This patchset adds support for 1GB THP on x86_64. It is on top of > >>>> v5.9-rc2-mmots-2020-08-25-21-13. > >>>> > >>>> 1GB THP is more flexible for reducing translation overhead and increasing the > >>>> performance of applications with large memory footprint without application > >>>> changes compared to hugetlb. > >>> > >>> This statement needs a lot of justification. I don't see 1GB THP as viable > >>> for any workload. Opportunistic 1GB allocation is very questionable > >>> strategy. > >> > >> Hello, Kirill! > >> > >> I share your skepticism about opportunistic 1 GB allocations, however it might be useful > >> if backed by an madvise() annotations from userspace application. In this case, > >> 1 GB THPs might be an alternative to 1 GB hugetlbfs pages, but with a more convenient > >> interface. > > > > I have concerns if we would silently use 1~GB THPs in most scenarios > > where be would have used 2~MB THP. I'd appreciate a trigger to > > explicitly enable that - MADV_HUGEPAGE is not sufficient because some > > applications relying on that assume that the THP size will be 2~MB > > (especially, if you want sparse, large VMAs). > > This patchset is not intended to silently use 1GB THP in place of 2MB THP. > First of all, there is a knob /sys/kernel/mm/transparent_hugepage/enable_1GB > to enable 1GB THP explicitly. Also, 1GB THP is allocated from a reserved CMA > region (although I had alloc_contig_pages as a fallback, which can be removed > in next version), so users need to add hugepage_cma=nG kernel parameter to > enable 1GB THP allocation. If a finer control is necessary, we can add > a new MADV_HUGEPAGE_1GB for 1GB THP. A global knob is insufficient. 1G pages will become a very precious resource as it requires a pre-allocation (reservation). So it really has to be an opt-in and the question is whether there is also some sort of access control needed.
On Tue, 2020-09-08 at 16:35 +0200, Michal Hocko wrote: > A global knob is insufficient. 1G pages will become a very precious > resource as it requires a pre-allocation (reservation). So it really > has > to be an opt-in and the question is whether there is also some sort > of > access control needed. The 1GB pages do not require that much in the way of pre-allocation. The memory can be obtained through CMA, which means it can be used for movable 4kB and 2MB allocations when not being used for 1GB pages. That makes it relatively easy to set aside some fraction of system memory in every system for 1GB and movable allocations, and use it for whatever way it is needed depending on what workload(s) end up running on a system.
On 08.09.20 16:41, Rik van Riel wrote: > On Tue, 2020-09-08 at 16:35 +0200, Michal Hocko wrote: > >> A global knob is insufficient. 1G pages will become a very precious >> resource as it requires a pre-allocation (reservation). So it really >> has >> to be an opt-in and the question is whether there is also some sort >> of >> access control needed. > > The 1GB pages do not require that much in the way of > pre-allocation. The memory can be obtained through CMA, > which means it can be used for movable 4kB and 2MB > allocations when not > being used for 1GB pages. > > That makes it relatively easy to set aside > some fraction > of system memory in every system for 1GB and movable > allocations, and use it for whatever way it is needed > depending on what workload(s) end up running on a system. > Linking secretmem discussion https://lkml.kernel.org/r/fdda6ba7-9418-2b52-eee8-ce5e9bfdb6ad@redhat.com
On 7 Sep 2020, at 3:20, Michal Hocko wrote: > On Fri 04-09-20 14:10:45, Roman Gushchin wrote: >> On Fri, Sep 04, 2020 at 09:42:07AM +0200, Michal Hocko wrote: > [...] >>> An explicit opt-in sounds much more appropriate to me as well. If we go >>> with a specific API then I would not make it 1GB pages specific. Why >>> cannot we have an explicit interface to "defragment" address space >>> range into large pages and the kernel would use large pages where >>> appropriate? Or is the additional copying prohibitively expensive? >> >> Can you, please, elaborate a bit more here? It seems like madvise(MADV_HUGEPAGE) >> provides something similar to what you're describing, but there are lot >> of details here, so I'm probably missing something. > > MADV_HUGEPAGE is controlling a preference for THP to be used for a > particular address range. So it looks similar but the historical > behavior is to control page faults as well and the behavior depends on > the global setup. > > I've had in mind something much simpler. Effectively an API to invoke > khugepaged (like) functionality synchronously from the calling context > on the specific address range. It could be more aggressive than the > regular khugepaged and create even 1G pages (or as large THPs as page > tables can handle on the particular arch for that matter). > > As this would be an explicit call we do not have to be worried about > the resulting latency because it would be an explicit call by the > userspace. The default khugepaged has a harder position there because > has no understanding of the target address space and cannot make any > cost/benefit evaluation so it has to be more conservative. Something like MADV_HUGEPAGE_SYNC? It would be useful, since users have better and clearer control of getting huge pages from the kernel and know when they will pay the cost of getting the huge pages. I would think the suggestion is more about the huge page control options currently provided by the kernel do not have predictable performance outcome, since MADV_HUGEPAGE is a best-effort option and does not tell users whether the marked virtual address range is backed by huge pages or not when the madvise returns. MADV_HUGEPAGE_SYNC would provide a deterministic result to users on whether the huge page(s) are formed or not. — Best Regards, Yan Zi
On 8 Sep 2020, at 10:22, David Hildenbrand wrote: > On 08.09.20 16:05, Zi Yan wrote: >> On 8 Sep 2020, at 7:57, David Hildenbrand wrote: >> >>> On 03.09.20 18:30, Roman Gushchin wrote: >>>> On Thu, Sep 03, 2020 at 05:23:00PM +0300, Kirill A. Shutemov wrote: >>>>> On Wed, Sep 02, 2020 at 02:06:12PM -0400, Zi Yan wrote: >>>>>> From: Zi Yan <ziy@nvidia.com> >>>>>> >>>>>> Hi all, >>>>>> >>>>>> This patchset adds support for 1GB THP on x86_64. It is on top of >>>>>> v5.9-rc2-mmots-2020-08-25-21-13. >>>>>> >>>>>> 1GB THP is more flexible for reducing translation overhead and increasing the >>>>>> performance of applications with large memory footprint without application >>>>>> changes compared to hugetlb. >>>>> >>>>> This statement needs a lot of justification. I don't see 1GB THP as viable >>>>> for any workload. Opportunistic 1GB allocation is very questionable >>>>> strategy. >>>> >>>> Hello, Kirill! >>>> >>>> I share your skepticism about opportunistic 1 GB allocations, however it might be useful >>>> if backed by an madvise() annotations from userspace application. In this case, >>>> 1 GB THPs might be an alternative to 1 GB hugetlbfs pages, but with a more convenient >>>> interface. >>> >>> I have concerns if we would silently use 1~GB THPs in most scenarios >>> where be would have used 2~MB THP. I'd appreciate a trigger to >>> explicitly enable that - MADV_HUGEPAGE is not sufficient because some >>> applications relying on that assume that the THP size will be 2~MB >>> (especially, if you want sparse, large VMAs). >> >> This patchset is not intended to silently use 1GB THP in place of 2MB THP. >> First of all, there is a knob /sys/kernel/mm/transparent_hugepage/enable_1GB >> to enable 1GB THP explicitly. Also, 1GB THP is allocated from a reserved CMA >> region (although I had alloc_contig_pages as a fallback, which can be removed >> in next version), so users need to add hugepage_cma=nG kernel parameter to >> enable 1GB THP allocation. If a finer control is necessary, we can add >> a new MADV_HUGEPAGE_1GB for 1GB THP. > > Thanks for the information - I would have loved to see important > information like that (esp. how to use) in the cover letter. > > So what you propose is (excluding alloc_contig_pages()) really just > automatically using (previously reserved) 1GB huge pages as 1GB THP > instead of explicitly using them in an application using hugetlbfs. > Still, not convinced how helpful that actually is - most certainly you > really want a mechanism to control this per application (+ maybe make > the application indicate actual ranges where it makes sense - but then > you can directly modify the application to use hugetlbfs). > > I guess the interesting thing of this approach is that we can > mix-and-match THP of differing granularity within a single mapping - > whereby a hugetlbfs allocation would fail in case there isn't sufficient > 1GB pages available. However, there are no guarantees for applications > anymore (thinking about RT KVM and similar, we really want gigantic > pages and cannot tolerate falling back to smaller granularity). I agree that currently THP allocation does not provide a strong guarantee like hugetlbfs, which can pre-allocate pages at boot time. For users like RT KVM and such, pre-allocated hugetlb might be the only choice, since allocating huge pages from CMA (either hugetlb or 1GB THP) would fail if some pages are pinned and scattered in the CMA that could prevent huge page allocation. In other cases, if the user can tolerate fall backs but do not like the unpredictable huge page formation outcome, we could add an madvise() option like Michal suggested [1], so the user will know whether he gets huge pages or not and can act accordingly. > What are intended use cases/applications that could benefit? I doubt > databases and virtualization are really a good fit - they know how to > handle hugetlbfs just fine. Romand and Jason have provided some use cases [2,3] [1]https://lore.kernel.org/linux-mm/20200907072014.GD30144@dhcp22.suse.cz/ [2]https://lore.kernel.org/linux-mm/20200903162527.GF60440@carbon.dhcp.thefacebook.com/ [3]https://lore.kernel.org/linux-mm/20200903165051.GN24045@ziepe.ca/ — Best Regards, Yan Zi
On 8 Sep 2020, at 10:27, Matthew Wilcox wrote: > On Tue, Sep 08, 2020 at 10:05:11AM -0400, Zi Yan wrote: >> On 8 Sep 2020, at 7:57, David Hildenbrand wrote: >>> I have concerns if we would silently use 1~GB THPs in most scenarios >>> where be would have used 2~MB THP. I'd appreciate a trigger to >>> explicitly enable that - MADV_HUGEPAGE is not sufficient because some >>> applications relying on that assume that the THP size will be 2~MB >>> (especially, if you want sparse, large VMAs). >> >> This patchset is not intended to silently use 1GB THP in place of 2MB THP. >> First of all, there is a knob /sys/kernel/mm/transparent_hugepage/enable_1GB >> to enable 1GB THP explicitly. Also, 1GB THP is allocated from a reserved CMA >> region (although I had alloc_contig_pages as a fallback, which can be removed >> in next version), so users need to add hugepage_cma=nG kernel parameter to >> enable 1GB THP allocation. If a finer control is necessary, we can add >> a new MADV_HUGEPAGE_1GB for 1GB THP. > > I think we do need that flag. Machines don't run a single workload > (arguably with VMs, we're getting closer to going back to the single > workload per machine, but that's a different matter). So if there's > one app that wants 2MB pages and one that wants 1GB pages, we need to > be able to distinguish them. > > I could also see there being an app which benefits from 1GB for > one mapping and prefers 2GB for a different mapping, so I think the > per-mapping madvise flag is best. > > I'm a little wary of encoding the size of an x86 PUD in the Linux API > though. Probably best to follow the example set in > include/uapi/asm-generic/hugetlb_encode.h, but I don't love it. I > don't have a better suggestion though. Using hugeltb_encode.h makes sense to me. I will add it in the next version. Thanks for the suggestion. — Best Regards, Yan Zi
On Tue, Sep 08, 2020 at 11:09:25AM -0400, Zi Yan wrote: > On 7 Sep 2020, at 3:20, Michal Hocko wrote: > > > On Fri 04-09-20 14:10:45, Roman Gushchin wrote: > >> On Fri, Sep 04, 2020 at 09:42:07AM +0200, Michal Hocko wrote: > > [...] > >>> An explicit opt-in sounds much more appropriate to me as well. If we go > >>> with a specific API then I would not make it 1GB pages specific. Why > >>> cannot we have an explicit interface to "defragment" address space > >>> range into large pages and the kernel would use large pages where > >>> appropriate? Or is the additional copying prohibitively expensive? > >> > >> Can you, please, elaborate a bit more here? It seems like madvise(MADV_HUGEPAGE) > >> provides something similar to what you're describing, but there are lot > >> of details here, so I'm probably missing something. > > > > MADV_HUGEPAGE is controlling a preference for THP to be used for a > > particular address range. So it looks similar but the historical > > behavior is to control page faults as well and the behavior depends on > > the global setup. > > > > I've had in mind something much simpler. Effectively an API to invoke > > khugepaged (like) functionality synchronously from the calling context > > on the specific address range. It could be more aggressive than the > > regular khugepaged and create even 1G pages (or as large THPs as page > > tables can handle on the particular arch for that matter). > > > > As this would be an explicit call we do not have to be worried about > > the resulting latency because it would be an explicit call by the > > userspace. The default khugepaged has a harder position there because > > has no understanding of the target address space and cannot make any > > cost/benefit evaluation so it has to be more conservative. > > Something like MADV_HUGEPAGE_SYNC? It would be useful, since users have > better and clearer control of getting huge pages from the kernel and > know when they will pay the cost of getting the huge pages. > > I would think the suggestion is more about the huge page control options > currently provided by the kernel do not have predictable performance > outcome, since MADV_HUGEPAGE is a best-effort option and does not tell > users whether the marked virtual address range is backed by huge pages > or not when the madvise returns. MADV_HUGEPAGE_SYNC would provide a > deterministic result to users on whether the huge page(s) are formed > or not. Yeah, I agree with Michal here, we need a more straightforward interface. The hard question here is how hard the kernel should try to allocate a gigantic page and how fast it should give up and return an error? I'd say to try really hard if there are some chances to succeed, so that if an error is returned, there are no more reasons to retry. Any objections/better ideas here? Given that we need to pass a page size, we probably need either to introduce a new syscall (madvise2?) with an additional argument, or add a bunch of new madvise flags, like MADV_HUGEPAGE_SYNC + encoded 2MB, 1GB etc. Idk what is better long-term, but new madvise flags are probably slightly easier to deal with in the development process. Thanks!
On 9/8/20 12:58 PM, Roman Gushchin wrote: > On Tue, Sep 08, 2020 at 11:09:25AM -0400, Zi Yan wrote: >> On 7 Sep 2020, at 3:20, Michal Hocko wrote: >>> On Fri 04-09-20 14:10:45, Roman Gushchin wrote: >>>> On Fri, Sep 04, 2020 at 09:42:07AM +0200, Michal Hocko wrote: >>> [...] >> Something like MADV_HUGEPAGE_SYNC? It would be useful, since users have >> better and clearer control of getting huge pages from the kernel and >> know when they will pay the cost of getting the huge pages. >> >> I would think the suggestion is more about the huge page control options >> currently provided by the kernel do not have predictable performance >> outcome, since MADV_HUGEPAGE is a best-effort option and does not tell >> users whether the marked virtual address range is backed by huge pages >> or not when the madvise returns. MADV_HUGEPAGE_SYNC would provide a >> deterministic result to users on whether the huge page(s) are formed >> or not. > > Yeah, I agree with Michal here, we need a more straightforward interface. > > The hard question here is how hard the kernel should try to allocate > a gigantic page and how fast it should give up and return an error? > I'd say to try really hard if there are some chances to succeed, > so that if an error is returned, there are no more reasons to retry. > Any objections/better ideas here? I agree, especially because this is starting to look a lot more like an allocation call. And I think it would be appropriate for the kernel to try approximately as hard to provide these 1GB pages, as it would to allocate normal memory to a process. In fact, for a moment I thought, why not go all the way and make this actually be a true allocation? However, given that we still have operations that require page splitting, with no good way to call back user space to notify it that its "allocated" huge pages are being split, that fails. But it's still pretty close. > > Given that we need to pass a page size, we probably need either to introduce > a new syscall (madvise2?) with an additional argument, or add a bunch > of new madvise flags, like MADV_HUGEPAGE_SYNC + encoded 2MB, 1GB etc. > > Idk what is better long-term, but new madvise flags are probably slightly > easier to deal with in the development process. > Probably either an MADV_* flag or a new syscall would work fine. But given that this seems like a pretty distinct new capability, one with options and man page documentation and possibly future flags itself, I'd lean toward making it its own new syscall, maybe: compact_huge_pages(nbytes or npages, flags /* page size, etc */); ...thus leaving madvise() and it's remaining flags still available, to further refine things. thanks,
On Tue 08-09-20 10:41:10, Rik van Riel wrote: > On Tue, 2020-09-08 at 16:35 +0200, Michal Hocko wrote: > > > A global knob is insufficient. 1G pages will become a very precious > > resource as it requires a pre-allocation (reservation). So it really > > has > > to be an opt-in and the question is whether there is also some sort > > of > > access control needed. > > The 1GB pages do not require that much in the way of > pre-allocation. The memory can be obtained through CMA, > which means it can be used for movable 4kB and 2MB > allocations when not > being used for 1GB pages. That CMA has to be pre-reserved, right? That requires a configuration. > That makes it relatively easy to set aside > some fraction > of system memory in every system for 1GB and movable > allocations, and use it for whatever way it is needed > depending on what workload(s) end up running on a system. I was not talking about how easy or hard it is. My main concern is that this is effectively a pre-reserved pool and a global knob is a very suboptimal way to control access to it. I (rather) strongly believe this should be an explicit opt-in and ideally not 1GB specific but rather something to allow large pages to be created as there is a fit. See other subthread for more details.
On Tue 08-09-20 12:58:59, Roman Gushchin wrote: > On Tue, Sep 08, 2020 at 11:09:25AM -0400, Zi Yan wrote: > > On 7 Sep 2020, at 3:20, Michal Hocko wrote: > > > > > On Fri 04-09-20 14:10:45, Roman Gushchin wrote: > > >> On Fri, Sep 04, 2020 at 09:42:07AM +0200, Michal Hocko wrote: > > > [...] > > >>> An explicit opt-in sounds much more appropriate to me as well. If we go > > >>> with a specific API then I would not make it 1GB pages specific. Why > > >>> cannot we have an explicit interface to "defragment" address space > > >>> range into large pages and the kernel would use large pages where > > >>> appropriate? Or is the additional copying prohibitively expensive? > > >> > > >> Can you, please, elaborate a bit more here? It seems like madvise(MADV_HUGEPAGE) > > >> provides something similar to what you're describing, but there are lot > > >> of details here, so I'm probably missing something. > > > > > > MADV_HUGEPAGE is controlling a preference for THP to be used for a > > > particular address range. So it looks similar but the historical > > > behavior is to control page faults as well and the behavior depends on > > > the global setup. > > > > > > I've had in mind something much simpler. Effectively an API to invoke > > > khugepaged (like) functionality synchronously from the calling context > > > on the specific address range. It could be more aggressive than the > > > regular khugepaged and create even 1G pages (or as large THPs as page > > > tables can handle on the particular arch for that matter). > > > > > > As this would be an explicit call we do not have to be worried about > > > the resulting latency because it would be an explicit call by the > > > userspace. The default khugepaged has a harder position there because > > > has no understanding of the target address space and cannot make any > > > cost/benefit evaluation so it has to be more conservative. > > > > Something like MADV_HUGEPAGE_SYNC? It would be useful, since users have > > better and clearer control of getting huge pages from the kernel and > > know when they will pay the cost of getting the huge pages. The name is not really that important. The crucial design decisions are - THP allocation time - #PF and/or madvise context - lazy/sync instantiation - huge page sizes controllable by the userspace? - aggressiveness - how hard to try - internal fragmentation - allow to create THPs on sparsely or unpopulated ranges - do we need some sort of access control or privilege check as some THPs would be a really scarce (like those that require pre-reservation). > > I would think the suggestion is more about the huge page control options > > currently provided by the kernel do not have predictable performance > > outcome, since MADV_HUGEPAGE is a best-effort option and does not tell > > users whether the marked virtual address range is backed by huge pages > > or not when the madvise returns. MADV_HUGEPAGE_SYNC would provide a > > deterministic result to users on whether the huge page(s) are formed > > or not. > > Yeah, I agree with Michal here, we need a more straightforward interface. > > The hard question here is how hard the kernel should try to allocate > a gigantic page and how fast it should give up and return an error? > I'd say to try really hard if there are some chances to succeed, > so that if an error is returned, there are no more reasons to retry. > Any objections/better ideas here? If this is going to be an explicit interface like madvise then I would follow the same semantic as hugetlb pages allocation - aka try as hard as feasible (whatever that means). > Given that we need to pass a page size, we probably need either to introduce > a new syscall (madvise2?) with an additional argument, or add a bunch > of new madvise flags, like MADV_HUGEPAGE_SYNC + encoded 2MB, 1GB etc. Do we really need to bother userspace with making decision about the page size? I would expect that the userspace only cares to get huge pages backed memory range. The larger the pages the better. It is up to the kernel to make the resource control here. Afterall THPs can be split/reclaimed under a memory pressure so we do not want to make any promises about pages backing any mapping.
On Tue, Sep 08, 2020 at 03:27:58PM +0100, Matthew Wilcox wrote: > On Tue, Sep 08, 2020 at 10:05:11AM -0400, Zi Yan wrote: > > On 8 Sep 2020, at 7:57, David Hildenbrand wrote: > > > I have concerns if we would silently use 1~GB THPs in most scenarios > > > where be would have used 2~MB THP. I'd appreciate a trigger to > > > explicitly enable that - MADV_HUGEPAGE is not sufficient because some > > > applications relying on that assume that the THP size will be 2~MB > > > (especially, if you want sparse, large VMAs). > > > > This patchset is not intended to silently use 1GB THP in place of 2MB THP. > > First of all, there is a knob /sys/kernel/mm/transparent_hugepage/enable_1GB > > to enable 1GB THP explicitly. Also, 1GB THP is allocated from a reserved CMA > > region (although I had alloc_contig_pages as a fallback, which can be removed > > in next version), so users need to add hugepage_cma=nG kernel parameter to > > enable 1GB THP allocation. If a finer control is necessary, we can add > > a new MADV_HUGEPAGE_1GB for 1GB THP. > > I think we do need that flag. Machines don't run a single workload > (arguably with VMs, we're getting closer to going back to the single > workload per machine, but that's a different matter). So if there's > one app that wants 2MB pages and one that wants 1GB pages, we need to > be able to distinguish them. > > I could also see there being an app which benefits from 1GB for > one mapping and prefers 2GB for a different mapping, so I think the > per-mapping madvise flag is best. I wonder if apps really care about the specific page size? Particularly from a portability view? The general app desire seems to be the need for 'efficient' memory (eg because it is highly accessed) and I suspect comes with a desire to populate the pages too. Maybe doing something with MAP_POPULATE is an idea? eg if I ask for 1GB of MAP_POPULATE it seems fairly natural the thing that comes back should be a 1GB THP? If I ask for only .5GB then it could be 2M pages, or whatever depending on arch support. Jason
On Wed, Sep 09, 2020 at 09:11:17AM -0300, Jason Gunthorpe wrote: > On Tue, Sep 08, 2020 at 03:27:58PM +0100, Matthew Wilcox wrote: > > I could also see there being an app which benefits from 1GB for > > one mapping and prefers 2GB for a different mapping, so I think the > > per-mapping madvise flag is best. > > I wonder if apps really care about the specific page size? > Particularly from a portability view? No, they don't. They just want to run as fast as possible ;-) > The general app desire seems to be the need for 'efficient' memory (eg > because it is highly accessed) and I suspect comes with a desire to > populate the pages too. The problem with a MAP_GOES_FASTER flag is that everybody sets it. Any flag name needs to convey its drawbacks as well as its advantages. Maybe MAP_EXTREMELY_COARSE_WORKINGSET would do that -- the VM will work in terms of 1GB pages for this mapping, so any swap-out is going to take out an entire 1GB at once. But here's the thing ... we already allow mmap(MAP_POPULATE | MAP_HUGETLB | MAP_HUGE_1GB) So if we're not doing THP, what's the point of this thread? My understanding of THP is "Application doesn't need to change, kernel makes a decision about what page size is best based on entire system state and process's behaviour". An madvise flag is a different beast; that's just letting the kernel know what the app thinks its behaviour will be. The kernel can pay as much (or as little) attention to that hint as it sees fit. And of course, it can change over time (either by kernel release as we change the algorithms, or simple from one minute to the next as more or less memory comes available).
On Wed, Sep 09, 2020 at 01:32:44PM +0100, Matthew Wilcox wrote: > But here's the thing ... we already allow > mmap(MAP_POPULATE | MAP_HUGETLB | MAP_HUGE_1GB) > > So if we're not doing THP, what's the point of this thread? I wondered that too.. > An madvise flag is a different beast; that's just letting the kernel > know what the app thinks its behaviour will be. The kernel can pay But madvise is too late, the VMA already has an address, if it is not 1G aligned it cannot be 1G THP already. Jason
On Wed, 2020-09-09 at 09:04 +0200, Michal Hocko wrote: > On Tue 08-09-20 10:41:10, Rik van Riel wrote: > > On Tue, 2020-09-08 at 16:35 +0200, Michal Hocko wrote: > > > > > A global knob is insufficient. 1G pages will become a very > > > precious > > > resource as it requires a pre-allocation (reservation). So it > > > really > > > has > > > to be an opt-in and the question is whether there is also some > > > sort > > > of > > > access control needed. > > > > The 1GB pages do not require that much in the way of > > pre-allocation. The memory can be obtained through CMA, > > which means it can be used for movable 4kB and 2MB > > allocations when not > > being used for 1GB pages. > > That CMA has to be pre-reserved, right? That requires a > configuration. To some extent, yes. However, because that pool can be used for movable 4kB and 2MB pages as well as for 1GB pages, it would be easy to just set the size of that pool to eg. 1/3 or even 1/2 of memory for every system. It isn't like the pool needs to be the exact right size. We just need to avoid the "highmem problem" of having too little memory for kernel allocations.
On 09.09.20 15:14, Jason Gunthorpe wrote: > On Wed, Sep 09, 2020 at 01:32:44PM +0100, Matthew Wilcox wrote: > >> But here's the thing ... we already allow >> mmap(MAP_POPULATE | MAP_HUGETLB | MAP_HUGE_1GB) >> >> So if we're not doing THP, what's the point of this thread? > > I wondered that too.. > >> An madvise flag is a different beast; that's just letting the kernel >> know what the app thinks its behaviour will be. The kernel can pay > > But madvise is too late, the VMA already has an address, if it is not > 1G aligned it cannot be 1G THP already. That's why user space (like QEMU) is THP-aware and selects an address that is aligned to the expected THP granularity (e.g., 2MB on x86_64).
On 09.09.20 15:19, Rik van Riel wrote: > On Wed, 2020-09-09 at 09:04 +0200, Michal Hocko wrote: >> On Tue 08-09-20 10:41:10, Rik van Riel wrote: >>> On Tue, 2020-09-08 at 16:35 +0200, Michal Hocko wrote: >>> >>>> A global knob is insufficient. 1G pages will become a very >>>> precious >>>> resource as it requires a pre-allocation (reservation). So it >>>> really >>>> has >>>> to be an opt-in and the question is whether there is also some >>>> sort >>>> of >>>> access control needed. >>> >>> The 1GB pages do not require that much in the way of >>> pre-allocation. The memory can be obtained through CMA, >>> which means it can be used for movable 4kB and 2MB >>> allocations when not >>> being used for 1GB pages. >> >> That CMA has to be pre-reserved, right? That requires a >> configuration. > > To some extent, yes. > > However, because that pool can be used for movable > 4kB and 2MB > pages as well as for 1GB pages, it would be easy to just set > the size of that pool to eg. 1/3 or even 1/2 of memory for every > system. > > It isn't like the pool needs to be the exact right size. We > just need to avoid the "highmem problem" of having too little > memory for kernel allocations. > I am not sure I like the trend towards CMA that we are seeing, reserving huge buffers for specific users (and eventually even doing it automatically). What we actually want is ZONE_MOVABLE with relaxed guarantees, such that anybody who requires large, unmovable allocations can use it. I once played with the idea of having ZONE_PREFER_MOVABLE, which a) Is the primary choice for movable allocations b) Is allowed to contain unmovable allocations (esp., gigantic pages) c) Is the fallback for ZONE_NORMAL for unmovable allocations, instead of running out of memory If someone messes up the zone ratio, issues known from zone imbalances are avoided - large allocations simply become less likely to succeed. In contrast to ZONE_MOVABLE, memory offlining is not guaranteed to work.
On Wed, 2020-09-09 at 15:43 +0200, David Hildenbrand wrote: > On 09.09.20 15:19, Rik van Riel wrote: > > On Wed, 2020-09-09 at 09:04 +0200, Michal Hocko wrote: > > > > > That CMA has to be pre-reserved, right? That requires a > > > configuration. > > > > To some extent, yes. > > > > However, because that pool can be used for movable > > 4kB and 2MB > > pages as well as for 1GB pages, it would be easy to just set > > the size of that pool to eg. 1/3 or even 1/2 of memory for every > > system. > > > > It isn't like the pool needs to be the exact right size. We > > just need to avoid the "highmem problem" of having too little > > memory for kernel allocations. > > > > I am not sure I like the trend towards CMA that we are seeing, > reserving > huge buffers for specific users (and eventually even doing it > automatically). > > What we actually want is ZONE_MOVABLE with relaxed guarantees, such > that > anybody who requires large, unmovable allocations can use it. > > I once played with the idea of having ZONE_PREFER_MOVABLE, which > a) Is the primary choice for movable allocations > b) Is allowed to contain unmovable allocations (esp., gigantic pages) > c) Is the fallback for ZONE_NORMAL for unmovable allocations, instead > of > running out of memory > > If someone messes up the zone ratio, issues known from zone > imbalances > are avoided - large allocations simply become less likely to succeed. > In > contrast to ZONE_MOVABLE, memory offlining is not guaranteed to work. I really like that idea. This will be easier to deal with than a "just the right size" CMA area, and seems like it would be pretty forgiving in both directions. Keeping unmovable allocations contained to one part of memory should also make compaction within the ZONE_PREFER_MOVABLE area a lot easier than compaction for higher order allocations is today. I suspect your proposal solves a lot of issues at once. For (c) from your proposal, we could even claim a whole 2MB or even 1GB area at once for unmovable allocations, keeping those contained in a limited amount of physical memory again, to make life easier on compaction.
On 09.09.20 15:49, Rik van Riel wrote: > On Wed, 2020-09-09 at 15:43 +0200, David Hildenbrand wrote: >> On 09.09.20 15:19, Rik van Riel wrote: >>> On Wed, 2020-09-09 at 09:04 +0200, Michal Hocko wrote: >>> >>>> That CMA has to be pre-reserved, right? That requires a >>>> configuration. >>> >>> To some extent, yes. >>> >>> However, because that pool can be used for movable >>> 4kB and 2MB >>> pages as well as for 1GB pages, it would be easy to just set >>> the size of that pool to eg. 1/3 or even 1/2 of memory for every >>> system. >>> >>> It isn't like the pool needs to be the exact right size. We >>> just need to avoid the "highmem problem" of having too little >>> memory for kernel allocations. >>> >> >> I am not sure I like the trend towards CMA that we are seeing, >> reserving >> huge buffers for specific users (and eventually even doing it >> automatically). >> >> What we actually want is ZONE_MOVABLE with relaxed guarantees, such >> that >> anybody who requires large, unmovable allocations can use it. >> >> I once played with the idea of having ZONE_PREFER_MOVABLE, which >> a) Is the primary choice for movable allocations >> b) Is allowed to contain unmovable allocations (esp., gigantic pages) >> c) Is the fallback for ZONE_NORMAL for unmovable allocations, instead >> of >> running out of memory >> >> If someone messes up the zone ratio, issues known from zone >> imbalances >> are avoided - large allocations simply become less likely to succeed. >> In >> contrast to ZONE_MOVABLE, memory offlining is not guaranteed to work. > > I really like that idea. This will be easier to deal with than > a "just the right size" CMA area, and seems like it would be > pretty forgiving in both directions. > Yes, and can be extended using memory hotplug. > Keeping unmovable allocations > contained to one part of memory > should also make compaction within the ZONE_PREFER_MOVABLE area > a lot easier than compaction for higher order allocations is > today. > > I suspect your proposal solves a lot of issues at once. > > For (c) from your proposal, we could even claim a whole > 2MB or even 1GB area at once for unmovable allocations, > keeping those contained in a limited amount of physical > memory again, to make life easier on compaction. > Exactly, locally limiting unmovable allocations to a sane minimum. (with some smart extra work, we could even convert ZONE_PREFER_MOVABLE to ZONE_NORMAL, one memory section/block at a time where needed, that direction always works. But that's very tricky.)
On Wed 09-09-20 09:19:16, Rik van Riel wrote: > On Wed, 2020-09-09 at 09:04 +0200, Michal Hocko wrote: > > On Tue 08-09-20 10:41:10, Rik van Riel wrote: > > > On Tue, 2020-09-08 at 16:35 +0200, Michal Hocko wrote: > > > > > > > A global knob is insufficient. 1G pages will become a very > > > > precious > > > > resource as it requires a pre-allocation (reservation). So it > > > > really > > > > has > > > > to be an opt-in and the question is whether there is also some > > > > sort > > > > of > > > > access control needed. > > > > > > The 1GB pages do not require that much in the way of > > > pre-allocation. The memory can be obtained through CMA, > > > which means it can be used for movable 4kB and 2MB > > > allocations when not > > > being used for 1GB pages. > > > > That CMA has to be pre-reserved, right? That requires a > > configuration. > > To some extent, yes. > > However, because that pool can be used for movable > 4kB and 2MB > pages as well as for 1GB pages, it would be easy to just set > the size of that pool to eg. 1/3 or even 1/2 of memory for every > system. > > It isn't like the pool needs to be the exact right size. We > just need to avoid the "highmem problem" of having too little > memory for kernel allocations. Which is the problem why this is not really suitable for an uneducated guesses. It is really hard to guess the right amount of lowmem. Think of heavy fs metadata workloads and their memory demand. Memory reclaim usually struggles when zones are imbalanced from my experience.
[Cc Vlastimil and Mel - the whole email thread starts http://lkml.kernel.org/r/20200902180628.4052244-1-zi.yan@sent.com but this particular subthread has diverged a bit and you might find it interesting] On Wed 09-09-20 15:43:55, David Hildenbrand wrote: > On 09.09.20 15:19, Rik van Riel wrote: > > On Wed, 2020-09-09 at 09:04 +0200, Michal Hocko wrote: > >> On Tue 08-09-20 10:41:10, Rik van Riel wrote: > >>> On Tue, 2020-09-08 at 16:35 +0200, Michal Hocko wrote: > >>> > >>>> A global knob is insufficient. 1G pages will become a very > >>>> precious > >>>> resource as it requires a pre-allocation (reservation). So it > >>>> really > >>>> has > >>>> to be an opt-in and the question is whether there is also some > >>>> sort > >>>> of > >>>> access control needed. > >>> > >>> The 1GB pages do not require that much in the way of > >>> pre-allocation. The memory can be obtained through CMA, > >>> which means it can be used for movable 4kB and 2MB > >>> allocations when not > >>> being used for 1GB pages. > >> > >> That CMA has to be pre-reserved, right? That requires a > >> configuration. > > > > To some extent, yes. > > > > However, because that pool can be used for movable > > 4kB and 2MB > > pages as well as for 1GB pages, it would be easy to just set > > the size of that pool to eg. 1/3 or even 1/2 of memory for every > > system. > > > > It isn't like the pool needs to be the exact right size. We > > just need to avoid the "highmem problem" of having too little > > memory for kernel allocations. > > > > I am not sure I like the trend towards CMA that we are seeing, reserving > huge buffers for specific users (and eventually even doing it > automatically). > > What we actually want is ZONE_MOVABLE with relaxed guarantees, such that > anybody who requires large, unmovable allocations can use it. > > I once played with the idea of having ZONE_PREFER_MOVABLE, which > a) Is the primary choice for movable allocations > b) Is allowed to contain unmovable allocations (esp., gigantic pages) > c) Is the fallback for ZONE_NORMAL for unmovable allocations, instead of > running out of memory I might be missing something but how can this work longterm? Or put in another words why would this work any better than existing fragmentation avoidance techniques that page allocator implements already - movability grouping etc. Please note that I am not deeply familiar with those but my high level understanding is that we already try hard to not mix movable and unmovable objects in same page blocks as much as we can. My suspicion is that a separate zone would work in a similar fashion. As long as there is a lot of free memory then zone will be effectively MOVABLE. Similar applies to normal zone when unmovable allocations are in minority. As long as the Normal zone gets full of unmovable objects they start overflowing to ZONE_PREFER_MOVABLE and it will resemble page block stealing when unmovable objects start spreading over movable page blocks. Again, my level of expertise to page allocator is quite low so all the above might be simply wrong... > If someone messes up the zone ratio, issues known from zone imbalances > are avoided - large allocations simply become less likely to succeed. In > contrast to ZONE_MOVABLE, memory offlining is not guaranteed to work.
On 10.09.20 09:32, Michal Hocko wrote: > [Cc Vlastimil and Mel - the whole email thread starts > http://lkml.kernel.org/r/20200902180628.4052244-1-zi.yan@sent.com > but this particular subthread has diverged a bit and you might find it > interesting] > > On Wed 09-09-20 15:43:55, David Hildenbrand wrote: >> On 09.09.20 15:19, Rik van Riel wrote: >>> On Wed, 2020-09-09 at 09:04 +0200, Michal Hocko wrote: >>>> On Tue 08-09-20 10:41:10, Rik van Riel wrote: >>>>> On Tue, 2020-09-08 at 16:35 +0200, Michal Hocko wrote: >>>>> >>>>>> A global knob is insufficient. 1G pages will become a very >>>>>> precious >>>>>> resource as it requires a pre-allocation (reservation). So it >>>>>> really >>>>>> has >>>>>> to be an opt-in and the question is whether there is also some >>>>>> sort >>>>>> of >>>>>> access control needed. >>>>> >>>>> The 1GB pages do not require that much in the way of >>>>> pre-allocation. The memory can be obtained through CMA, >>>>> which means it can be used for movable 4kB and 2MB >>>>> allocations when not >>>>> being used for 1GB pages. >>>> >>>> That CMA has to be pre-reserved, right? That requires a >>>> configuration. >>> >>> To some extent, yes. >>> >>> However, because that pool can be used for movable >>> 4kB and 2MB >>> pages as well as for 1GB pages, it would be easy to just set >>> the size of that pool to eg. 1/3 or even 1/2 of memory for every >>> system. >>> >>> It isn't like the pool needs to be the exact right size. We >>> just need to avoid the "highmem problem" of having too little >>> memory for kernel allocations. >>> >> >> I am not sure I like the trend towards CMA that we are seeing, reserving >> huge buffers for specific users (and eventually even doing it >> automatically). >> >> What we actually want is ZONE_MOVABLE with relaxed guarantees, such that >> anybody who requires large, unmovable allocations can use it. >> >> I once played with the idea of having ZONE_PREFER_MOVABLE, which >> a) Is the primary choice for movable allocations >> b) Is allowed to contain unmovable allocations (esp., gigantic pages) >> c) Is the fallback for ZONE_NORMAL for unmovable allocations, instead of >> running out of memory > > I might be missing something but how can this work longterm? Or put in > another words why would this work any better than existing fragmentation > avoidance techniques that page allocator implements already - movability > grouping etc. Please note that I am not deeply familiar with those but > my high level understanding is that we already try hard to not mix > movable and unmovable objects in same page blocks as much as we can. Note that we group in pageblock granularity, which avoids fragmentation on a pageblock level, not on anything bigger than that. Especially MAX_ORDER - 1 pages (e.g., on x86-64) and gigantic pages. So once you run for some time on a system (especially thinking about page shuffling *within* a zone), trying to allocate a gigantic page will simply always fail - even if you always had plenty of free memory in your single zone. > > My suspicion is that a separate zone would work in a similar fashion. As > long as there is a lot of free memory then zone will be effectively > MOVABLE. Similar applies to normal zone when unmovable allocations are Note the difference to MOVABLE: if you really want, you *can* put movable allocations into that zone. So you can happily allocate gigantic pages from it. Or anything else you like. As the name suggests "prefer movable allocations". > in minority. As long as the Normal zone gets full of unmovable objects > they start overflowing to ZONE_PREFER_MOVABLE and it will resemble page > block stealing when unmovable objects start spreading over movable page > blocks. Right, the long-term goal would be 1. To limit the chance of that happening. (e.g., size it in a way that's safe for 99.9% of all setups, resize dynamically on demand) 2. To limit the physical area where that is happening (e.g., find lowest possible pageblock etc.). That's more tricky but I consider this a pure optimization on top. As long as we stay in safe zone boundaries you get a benefit in most scenarios. As soon as we would have a (temporary) workload that would require more unmovable allocations we would fallback to polluting some pageblocks only. > > Again, my level of expertise to page allocator is quite low so all the > above might be simply wrong... Same over here. I had this idea in my mind for quite a while but obviously didn't get to figure out the details/implement yet - that's why I decided to share the basic idea just now.
> On Sep 9, 2020, at 7:27 AM, David Hildenbrand <david@redhat.com> wrote: > > On 09.09.20 15:14, Jason Gunthorpe wrote: >> On Wed, Sep 09, 2020 at 01:32:44PM +0100, Matthew Wilcox wrote: >> >>> But here's the thing ... we already allow >>> mmap(MAP_POPULATE | MAP_HUGETLB | MAP_HUGE_1GB) >>> >>> So if we're not doing THP, what's the point of this thread? >> >> I wondered that too.. >> >>> An madvise flag is a different beast; that's just letting the kernel >>> know what the app thinks its behaviour will be. The kernel can pay >> >> But madvise is too late, the VMA already has an address, if it is not >> 1G aligned it cannot be 1G THP already. > > That's why user space (like QEMU) is THP-aware and selects an address > that is aligned to the expected THP granularity (e.g., 2MB on x86_64). To me it's always seemed like there are two major divisions among THP use cases: 1) Applications that KNOW they would benefit from use of THPs, so they call madvise() with an appropriate parameter and explicitly inform the kernel of such 2) Applications that know nothing about THP but there may be an advantage that comes from "automatic" THP mapping when possible. This is an approach that I am more familiar with that comes down to: 1) Is a VMA properly aligned for a (whatever size) THP? 2) Is the mapping request for a length >= (whatever size) THP? 3) Let's try allocating memory to map the space using (whatever size) THP, and: -- If we succeed, great, awesome, let's do it. -- If not, no big deal, map using as large a page as we CAN get. There of course are myriad performance implications to this. Processes that start early after boot have a better chance of getting a THP, but that also means frequently mapped large memory spaces have a better chance of being mapped in a shared manner via a THP, e.g. libc, X servers or Firefox/Chrome. It also means that processes that would be mapped using THPs early in boot may not be if they should crash and need to be restarted. There are all sorts of tunables that would likely need to be in place to make the second approach more viable, but I think it's certainly worth investigating. The address selection you suggest is the basis of one of the patches I wrote for a previous iteration of THP support (and that is in Matthew's THP tree) that will try to round VM addresses to the proper alignment if possible so a THP can then be used to map the area.
On Thu, 2020-09-10 at 09:32 +0200, Michal Hocko wrote: > [Cc Vlastimil and Mel - the whole email thread starts > http://lkml.kernel.org/r/20200902180628.4052244-1-zi.yan@sent.com > but this particular subthread has diverged a bit and you might find > it > interesting] > > On Wed 09-09-20 15:43:55, David Hildenbrand wrote: > > > > I am not sure I like the trend towards CMA that we are seeing, > > reserving > > huge buffers for specific users (and eventually even doing it > > automatically). > > > > What we actually want is ZONE_MOVABLE with relaxed guarantees, such > > that > > anybody who requires large, unmovable allocations can use it. > > > > I once played with the idea of having ZONE_PREFER_MOVABLE, which > > a) Is the primary choice for movable allocations > > b) Is allowed to contain unmovable allocations (esp., gigantic > > pages) > > c) Is the fallback for ZONE_NORMAL for unmovable allocations, > > instead of > > running out of memory > > I might be missing something but how can this work longterm? Or put > in > another words why would this work any better than existing > fragmentation > avoidance techniques that page allocator implements already - One big difference is reclaim. If ZONE_NORMAL runs low on free memory, page reclaim would kick in and evict some movable/reclaimable things, to free up more space for unmovable allocations. The current fragmentation avoidance techniques don't do things like reclaim, or proactively migrating movable pages out of unmovable page blocks to prevent unmovable allocations in currently movable page blocks. > My suspicion is that a separate zone would work in a similar fashion. > As > long as there is a lot of free memory then zone will be effectively > MOVABLE. Similar applies to normal zone when unmovable allocations > are > in minority. As long as the Normal zone gets full of unmovable > objects > they start overflowing to ZONE_PREFER_MOVABLE and it will resemble > page > block stealing when unmovable objects start spreading over movable > page > blocks. You are right, with the difference being reclaim and/or migration, which could make a real difference in limiting the number of pageblocks that have unmovable allocations.
On 10 Sep 2020, at 4:27, David Hildenbrand wrote: > On 10.09.20 09:32, Michal Hocko wrote: >> [Cc Vlastimil and Mel - the whole email thread starts >> http://lkml.kernel.org/r/20200902180628.4052244-1-zi.yan@sent.com >> but this particular subthread has diverged a bit and you might find it >> interesting] >> >> On Wed 09-09-20 15:43:55, David Hildenbrand wrote: >>> On 09.09.20 15:19, Rik van Riel wrote: >>>> On Wed, 2020-09-09 at 09:04 +0200, Michal Hocko wrote: >>>>> On Tue 08-09-20 10:41:10, Rik van Riel wrote: >>>>>> On Tue, 2020-09-08 at 16:35 +0200, Michal Hocko wrote: >>>>>> >>>>>>> A global knob is insufficient. 1G pages will become a very >>>>>>> precious >>>>>>> resource as it requires a pre-allocation (reservation). So it >>>>>>> really >>>>>>> has >>>>>>> to be an opt-in and the question is whether there is also some >>>>>>> sort >>>>>>> of >>>>>>> access control needed. >>>>>> >>>>>> The 1GB pages do not require that much in the way of >>>>>> pre-allocation. The memory can be obtained through CMA, >>>>>> which means it can be used for movable 4kB and 2MB >>>>>> allocations when not >>>>>> being used for 1GB pages. >>>>> >>>>> That CMA has to be pre-reserved, right? That requires a >>>>> configuration. >>>> >>>> To some extent, yes. >>>> >>>> However, because that pool can be used for movable >>>> 4kB and 2MB >>>> pages as well as for 1GB pages, it would be easy to just set >>>> the size of that pool to eg. 1/3 or even 1/2 of memory for every >>>> system. >>>> >>>> It isn't like the pool needs to be the exact right size. We >>>> just need to avoid the "highmem problem" of having too little >>>> memory for kernel allocations. >>>> >>> >>> I am not sure I like the trend towards CMA that we are seeing, reserving >>> huge buffers for specific users (and eventually even doing it >>> automatically). >>> >>> What we actually want is ZONE_MOVABLE with relaxed guarantees, such that >>> anybody who requires large, unmovable allocations can use it. >>> >>> I once played with the idea of having ZONE_PREFER_MOVABLE, which >>> a) Is the primary choice for movable allocations >>> b) Is allowed to contain unmovable allocations (esp., gigantic pages) >>> c) Is the fallback for ZONE_NORMAL for unmovable allocations, instead of >>> running out of memory >> >> I might be missing something but how can this work longterm? Or put in >> another words why would this work any better than existing fragmentation >> avoidance techniques that page allocator implements already - movability >> grouping etc. Please note that I am not deeply familiar with those but >> my high level understanding is that we already try hard to not mix >> movable and unmovable objects in same page blocks as much as we can. > > Note that we group in pageblock granularity, which avoids fragmentation > on a pageblock level, not on anything bigger than that. Especially > MAX_ORDER - 1 pages (e.g., on x86-64) and gigantic pages. > > So once you run for some time on a system (especially thinking about > page shuffling *within* a zone), trying to allocate a gigantic page will > simply always fail - even if you always had plenty of free memory in > your single zone. > >> >> My suspicion is that a separate zone would work in a similar fashion. As >> long as there is a lot of free memory then zone will be effectively >> MOVABLE. Similar applies to normal zone when unmovable allocations are > > Note the difference to MOVABLE: if you really want, you *can* put > movable allocations into that zone. So you can happily allocate gigantic > pages from it. Or anything else you like. As the name suggests "prefer > movable allocations". > >> in minority. As long as the Normal zone gets full of unmovable objects >> they start overflowing to ZONE_PREFER_MOVABLE and it will resemble page >> block stealing when unmovable objects start spreading over movable page >> blocks. > > Right, the long-term goal would be > 1. To limit the chance of that happening. (e.g., size it in a way that's > safe for 99.9% of all setups, resize dynamically on demand) > 2. To limit the physical area where that is happening (e.g., find lowest > possible pageblock etc.). That's more tricky but I consider this a pure > optimization on top. > > As long as we stay in safe zone boundaries you get a benefit in most > scenarios. As soon as we would have a (temporary) workload that would > require more unmovable allocations we would fallback to polluting some > pageblocks only. The idea would work well until unmoveable pages begin to overflow into ZONE_PREFER_MOVABLE or we move the boundary of ZONE_PREFER_MOVABLE to avoid unmoveable page overflow. The issue comes from the lifetime of the unmoveable pages. Since some long-live ones can be around the boundary, there is no guarantee that ZONE_PREFER_MOVABLE cannot grow back even if other unmoveable pages are deallocated. Ultimately, ZONE_PREFER_MOVABLE would be shrink to a small size and the situation is back to what we have now. OK. I have a stupid question here. Why not just grow pageblock to a larger size, like 1GB? So the fragmentation of unmoveable pages will be at larger granularity. But it is less likely unmoveable pages will be allocated at a movable pageblock, since the kernel has 1GB pageblock for them after a pageblock stealing. If other kinds of pageblocks run out, moveable and reclaimable pages can fall back to unmoveable pageblocks. What am I missing here? Thanks. — Best Regards, Yan Zi
On 10 Sep 2020, at 9:32, Rik van Riel wrote: > On Thu, 2020-09-10 at 09:32 +0200, Michal Hocko wrote: >> [Cc Vlastimil and Mel - the whole email thread starts >> http://lkml.kernel.org/r/20200902180628.4052244-1-zi.yan@sent.com >> but this particular subthread has diverged a bit and you might find >> it >> interesting] >> >> On Wed 09-09-20 15:43:55, David Hildenbrand wrote: >>> >>> I am not sure I like the trend towards CMA that we are seeing, >>> reserving >>> huge buffers for specific users (and eventually even doing it >>> automatically). >>> >>> What we actually want is ZONE_MOVABLE with relaxed guarantees, such >>> that >>> anybody who requires large, unmovable allocations can use it. >>> >>> I once played with the idea of having ZONE_PREFER_MOVABLE, which >>> a) Is the primary choice for movable allocations >>> b) Is allowed to contain unmovable allocations (esp., gigantic >>> pages) >>> c) Is the fallback for ZONE_NORMAL for unmovable allocations, >>> instead of >>> running out of memory >> >> I might be missing something but how can this work longterm? Or put >> in >> another words why would this work any better than existing >> fragmentation >> avoidance techniques that page allocator implements already - > > One big difference is reclaim. If ZONE_NORMAL runs low on > free memory, page reclaim would kick in and evict some > movable/reclaimable things, to free up more space for > unmovable allocations. > > The current fragmentation avoidance techniques don't do > things like reclaim, or proactively migrating movable > pages out of unmovable page blocks to prevent unmovable > allocations in currently movable page blocks. Isn’t Mel Gorman’s watermark boost patch[1] (merged about a year ago) doing what you are describing? [1]https://lore.kernel.org/linux-mm/20181123114528.28802-1-mgorman@techsingularity.net/ — Best Regards, Yan Zi
>> As long as we stay in safe zone boundaries you get a benefit in most >> scenarios. As soon as we would have a (temporary) workload that would >> require more unmovable allocations we would fallback to polluting some >> pageblocks only. > > The idea would work well until unmoveable pages begin to overflow into > ZONE_PREFER_MOVABLE or we move the boundary of ZONE_PREFER_MOVABLE to > avoid unmoveable page overflow. The issue comes from the lifetime of > the unmoveable pages. Since some long-live ones can be around the boundary, > there is no guarantee that ZONE_PREFER_MOVABLE cannot grow back > even if other unmoveable pages are deallocated. Ultimately, > ZONE_PREFER_MOVABLE would be shrink to a small size and the situation is > back to what we have now. As discussed this would not happen in the usual case in case we size it reasonable. Of course, if you push it to the extreme (which was never suggested!), you would create mess. There is always a way to create a mess if you abuse such mechanism. Also see Rik's reply regarding reclaim. > > OK. I have a stupid question here. Why not just grow pageblock to a larger > size, like 1GB? So the fragmentation of unmoveable pages will be at larger > granularity. But it is less likely unmoveable pages will be allocated at > a movable pageblock, since the kernel has 1GB pageblock for them after > a pageblock stealing. If other kinds of pageblocks run out, moveable and > reclaimable pages can fall back to unmoveable pageblocks. > What am I missing here? Oh no. For example pageblocks have to completely fit into a single section (that's where metadata is maintained). Please refrain from suggesting to increase the section size ;) There is plenty of code relying on pageblocks/MAX_ORDER - 1 to be reasonable in size. Examples in VMs are free page reporting or virtio-mem.
On 10 Sep 2020, at 10:34, David Hildenbrand wrote: >>> As long as we stay in safe zone boundaries you get a benefit in most >>> scenarios. As soon as we would have a (temporary) workload that would >>> require more unmovable allocations we would fallback to polluting some >>> pageblocks only. >> >> The idea would work well until unmoveable pages begin to overflow into >> ZONE_PREFER_MOVABLE or we move the boundary of ZONE_PREFER_MOVABLE to >> avoid unmoveable page overflow. The issue comes from the lifetime of >> the unmoveable pages. Since some long-live ones can be around the boundary, >> there is no guarantee that ZONE_PREFER_MOVABLE cannot grow back >> even if other unmoveable pages are deallocated. Ultimately, >> ZONE_PREFER_MOVABLE would be shrink to a small size and the situation is >> back to what we have now. > > As discussed this would not happen in the usual case in case we size it > reasonable. Of course, if you push it to the extreme (which was never > suggested!), you would create mess. There is always a way to create a > mess if you abuse such mechanism. Also see Rik's reply regarding reclaim. > >> >> OK. I have a stupid question here. Why not just grow pageblock to a larger >> size, like 1GB? So the fragmentation of unmoveable pages will be at larger >> granularity. But it is less likely unmoveable pages will be allocated at >> a movable pageblock, since the kernel has 1GB pageblock for them after >> a pageblock stealing. If other kinds of pageblocks run out, moveable and >> reclaimable pages can fall back to unmoveable pageblocks. >> What am I missing here? > > Oh no. For example pageblocks have to completely fit into a single > section (that's where metadata is maintained). Please refrain from > suggesting to increase the section size ;) Thank you for the explanation. I have no idea about the restrictions on pageblock and section. Out of curiosity, what prevents the growth of the section size? — Best Regards, Yan Zi
On 10.09.20 16:41, Zi Yan wrote: > On 10 Sep 2020, at 10:34, David Hildenbrand wrote: > >>>> As long as we stay in safe zone boundaries you get a benefit in most >>>> scenarios. As soon as we would have a (temporary) workload that would >>>> require more unmovable allocations we would fallback to polluting some >>>> pageblocks only. >>> >>> The idea would work well until unmoveable pages begin to overflow into >>> ZONE_PREFER_MOVABLE or we move the boundary of ZONE_PREFER_MOVABLE to >>> avoid unmoveable page overflow. The issue comes from the lifetime of >>> the unmoveable pages. Since some long-live ones can be around the boundary, >>> there is no guarantee that ZONE_PREFER_MOVABLE cannot grow back >>> even if other unmoveable pages are deallocated. Ultimately, >>> ZONE_PREFER_MOVABLE would be shrink to a small size and the situation is >>> back to what we have now. >> >> As discussed this would not happen in the usual case in case we size it >> reasonable. Of course, if you push it to the extreme (which was never >> suggested!), you would create mess. There is always a way to create a >> mess if you abuse such mechanism. Also see Rik's reply regarding reclaim. >> >>> >>> OK. I have a stupid question here. Why not just grow pageblock to a larger >>> size, like 1GB? So the fragmentation of unmoveable pages will be at larger >>> granularity. But it is less likely unmoveable pages will be allocated at >>> a movable pageblock, since the kernel has 1GB pageblock for them after >>> a pageblock stealing. If other kinds of pageblocks run out, moveable and >>> reclaimable pages can fall back to unmoveable pageblocks. >>> What am I missing here? >> >> Oh no. For example pageblocks have to completely fit into a single >> section (that's where metadata is maintained). Please refrain from >> suggesting to increase the section size ;) > > Thank you for the explanation. I have no idea about the restrictions on > pageblock and section. Out of curiosity, what prevents the growth of > the section size? The section size (and based on that the Linux memory block size) defines - the minimum size in which we can add_memory() - the alignment requirement in which we can add_memory() This is applicable - in physical environments, where the bios will decide where to place DIMMs/NVDIMMs. The coarser the granularity, the less memory we might be able to make use of in corner cases. - in virtualized environments, where we want to add memory in fairly small granularity. The coarser the granularity, the less flexibility we have. arm64 has a section size of 1GB (and a THP/MAX_ORDER - 1 size of 512MB with 64k base pages :/ ). That already turned out to be a problem - see [1] regarding thoughts on how to shrink the section size. I once read about thoughts of switching to 2MB THP on arm64 with any base page size, not sure if that will become real at one point (and we might be able to reduce the pageblock size there as well ... ) [1] https://lkml.kernel.org/r/AM6PR08MB40690714A2E77A7128B2B2ADF7700@AM6PR08MB4069.eurprd08.prod.outlook.com See [1] as > > — > Best Regards, > Yan Zi >
From: Zi Yan <ziy@nvidia.com> Hi all, This patchset adds support for 1GB THP on x86_64. It is on top of v5.9-rc2-mmots-2020-08-25-21-13. 1GB THP is more flexible for reducing translation overhead and increasing the performance of applications with large memory footprint without application changes compared to hugetlb. Design ======= 1GB THP implementation looks similar to exiting THP code except some new designs for the additional page table level. 1. Page table deposit and withdraw using a new pagechain data structure: instead of one PTE page table page, 1GB THP requires 513 page table pages (one PMD page table page and 512 PTE page table pages) to be deposited at the page allocaiton time, so that we can split the page later. Currently, the page table deposit is using ->lru, thus only one page can be deposited. A new pagechain data structure is added to enable multi-page deposit. 2. Triple mapped 1GB THP : 1GB THP can be mapped by a combination of PUD, PMD, and PTE entries. Mixing PUD an PTE mapping can be achieved with existing PageDoubleMap mechanism. To add PMD mapping, PMDPageInPUD and sub_compound_mapcount are introduced. PMDPageInPUD is the 512-aligned base page in a 1GB THP and sub_compound_mapcount counts the PMD mapping by using page[N*512 + 3].compound_mapcount. 3. Using CMA allocaiton for 1GB THP: instead of bump MAX_ORDER, it is more sane to use something less intrusive. So all 1GB THPs are allocated from reserved CMA areas shared with hugetlb. At page splitting time, the bitmap for the 1GB THP is cleared as the resulting pages can be freed via normal page free path. We can fall back to alloc_contig_pages for 1GB THP if necessary. Patch Organization ======= Patch 01 adds the new pagechain data structure. Patch 02 to 13 adds 1GB THP support in variable places. Patch 14 tries to use alloc_contig_pages for 1GB THP allocaiton. Patch 15 moves hugetlb_cma reservation to cma.c and rename it to hugepage_cma. Patch 16 use hugepage_cma reservation for 1GB THP allocation. Any suggestions and comments are welcome. Zi Yan (16): mm: add pagechain container for storing multiple pages. mm: thp: 1GB anonymous page implementation. mm: proc: add 1GB THP kpageflag. mm: thp: 1GB THP copy on write implementation. mm: thp: handling 1GB THP reference bit. mm: thp: add 1GB THP split_huge_pud_page() function. mm: stats: make smap stats understand PUD THPs. mm: page_vma_walk: teach it about PMD-mapped PUD THP. mm: thp: 1GB THP support in try_to_unmap(). mm: thp: split 1GB THPs at page reclaim. mm: thp: 1GB THP follow_p*d_page() support. mm: support 1GB THP pagemap support. mm: thp: add a knob to enable/disable 1GB THPs. mm: page_alloc: >=MAX_ORDER pages allocation an deallocation. hugetlb: cma: move cma reserve function to cma.c. mm: thp: use cma reservation for pud thp allocation. .../admin-guide/kernel-parameters.txt | 2 +- arch/arm64/mm/hugetlbpage.c | 2 +- arch/powerpc/mm/hugetlbpage.c | 2 +- arch/x86/include/asm/pgalloc.h | 68 ++ arch/x86/include/asm/pgtable.h | 26 + arch/x86/kernel/setup.c | 8 +- arch/x86/mm/pgtable.c | 38 + drivers/base/node.c | 3 + fs/proc/meminfo.c | 2 + fs/proc/page.c | 2 + fs/proc/task_mmu.c | 122 ++- include/linux/cma.h | 18 + include/linux/huge_mm.h | 84 +- include/linux/hugetlb.h | 12 - include/linux/memcontrol.h | 5 + include/linux/mm.h | 29 +- include/linux/mm_types.h | 1 + include/linux/mmu_notifier.h | 13 + include/linux/mmzone.h | 1 + include/linux/page-flags.h | 47 + include/linux/pagechain.h | 73 ++ include/linux/pgtable.h | 34 + include/linux/rmap.h | 10 +- include/linux/swap.h | 2 + include/linux/vm_event_item.h | 7 + include/uapi/linux/kernel-page-flags.h | 2 + kernel/events/uprobes.c | 4 +- kernel/fork.c | 5 + mm/cma.c | 119 +++ mm/gup.c | 60 +- mm/huge_memory.c | 939 +++++++++++++++++- mm/hugetlb.c | 114 +-- mm/internal.h | 2 + mm/khugepaged.c | 6 +- mm/ksm.c | 4 +- mm/memcontrol.c | 13 + mm/memory.c | 51 +- mm/mempolicy.c | 21 +- mm/migrate.c | 12 +- mm/page_alloc.c | 57 +- mm/page_vma_mapped.c | 129 ++- mm/pgtable-generic.c | 56 ++ mm/rmap.c | 289 ++++-- mm/swap.c | 31 + mm/swap_slots.c | 2 + mm/swapfile.c | 8 +- mm/userfaultfd.c | 2 +- mm/util.c | 16 +- mm/vmscan.c | 58 +- mm/vmstat.c | 8 + 50 files changed, 2270 insertions(+), 349 deletions(-) create mode 100644 include/linux/pagechain.h -- 2.28.0