Message ID | 20210714193542.21857-15-joao.m.martins@oracle.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm, sparse-vmemmap: Introduce compound pagemaps | expand |
On Wed, Jul 14, 2021 at 12:36 PM Joao Martins <joao.m.martins@oracle.com> wrote: > > Currently, for compound PUD mappings, the implementation consumes 40MB > per TB but it can be optimized to 16MB per TB with the approach > detailed below. > > Right now basepages are used to populate the PUD tail pages, and it > picks the address of the previous page of the subsection that precedes > the memmap being initialized. This is done when a given memmap > address isn't aligned to the pgmap @geometry (which is safe to do because > @ranges are guaranteed to be aligned to @geometry). > > For pagemaps with an align which spans various sections, this means > that PMD pages are unnecessarily allocated for reusing the same tail > pages. Effectively, on x86 a PUD can span 8 sections (depending on > config), and a page is being allocated a page for the PMD to reuse > the tail vmemmap across the rest of the PTEs. In short effecitvely the > PMD cover the tail vmemmap areas all contain the same PFN. So instead > of doing this way, populate a new PMD on the second section of the > compound page (tail vmemmap PMD), and then the following sections > utilize the preceding PMD previously populated which only contain > tail pages). > > After this scheme for an 1GB pagemap aligned area, the first PMD > (section) would contain head page and 32767 tail pages, where the > second PMD contains the full 32768 tail pages. The latter page gets > its PMD reused across future section mapping of the same pagemap. > > Besides fewer pagetable entries allocated, keeping parity with > hugepages in the directmap (as done by vmemmap_populate_hugepages()), > this further increases savings per compound page. Rather than > requiring 8 PMD page allocations only need 2 (plus two base pages > allocated for head and tail areas for the first PMD). 2M pages still > require using base pages, though. This looks good to me now, modulo the tail_page helper discussed previously. Thanks for the diagram, makes it clearer what's happening. I don't see any red flags that would prevent a reviewed-by when you send the next spin. > > Signed-off-by: Joao Martins <joao.m.martins@oracle.com> > --- > Documentation/vm/vmemmap_dedup.rst | 109 +++++++++++++++++++++++++++++ > include/linux/mm.h | 3 +- > mm/sparse-vmemmap.c | 74 +++++++++++++++++--- > 3 files changed, 174 insertions(+), 12 deletions(-) > > diff --git a/Documentation/vm/vmemmap_dedup.rst b/Documentation/vm/vmemmap_dedup.rst > index 42830a667c2a..96d9f5f0a497 100644 > --- a/Documentation/vm/vmemmap_dedup.rst > +++ b/Documentation/vm/vmemmap_dedup.rst > @@ -189,3 +189,112 @@ at a later stage when we populate the sections. > It only use 3 page structs for storing all information as opposed > to 4 on HugeTLB pages. This does not affect memory savings between both. > > +Additionally, it further extends the tail page deduplication with 1GB > +device-dax compound pages. > + > +E.g.: A 1G device-dax page on x86_64 consists in 4096 page frames, split > +across 8 PMD page frames, with the first PMD having 2 PTE page frames. > +In total this represents a total of 40960 bytes per 1GB page. > + > +Here is how things look after the previously described tail page deduplication > +technique. > + > + device-dax page frames struct pages(4096 pages) page frame(2 pages) > + +-----------+ -> +----------+ --> +-----------+ mapping to +-------------+ > + | | | 0 | | 0 | -------------> | 0 | > + | | +----------+ +-----------+ +-------------+ > + | | | 1 | -------------> | 1 | > + | | +-----------+ +-------------+ > + | | | 2 | ----------------^ ^ ^ ^ ^ ^ ^ > + | | +-----------+ | | | | | | > + | | | 3 | ------------------+ | | | | | > + | | +-----------+ | | | | | > + | | | 4 | --------------------+ | | | | > + | PMD 0 | +-----------+ | | | | > + | | | 5 | ----------------------+ | | | > + | | +-----------+ | | | > + | | | .. | ------------------------+ | | > + | | +-----------+ | | > + | | | 511 | --------------------------+ | > + | | +-----------+ | > + | | | > + | | | > + | | | > + +-----------+ page frames | > + +-----------+ -> +----------+ --> +-----------+ mapping to | > + | | | 1 .. 7 | | 512 | ----------------------------+ > + | | +----------+ +-----------+ | > + | | | .. | ----------------------------+ > + | | +-----------+ | > + | | | .. | ----------------------------+ > + | | +-----------+ | > + | | | .. | ----------------------------+ > + | | +-----------+ | > + | | | .. | ----------------------------+ > + | PMD | +-----------+ | > + | 1 .. 7 | | .. | ----------------------------+ > + | | +-----------+ | > + | | | .. | ----------------------------+ > + | | +-----------+ | > + | | | 4095 | ----------------------------+ > + +-----------+ +-----------+ > + > +Page frames of PMD 1 through 7 are allocated and mapped to the same PTE page frame > +that contains stores tail pages. As we can see in the diagram, PMDs 1 through 7 > +all look like the same. Therefore we can map PMD 2 through 7 to PMD 1 page frame. > +This allows to free 6 vmemmap pages per 1GB page, decreasing the overhead per > +1GB page from 40960 bytes to 16384 bytes. > + > +Here is how things look after PMD tail page deduplication. > + > + device-dax page frames struct pages(4096 pages) page frame(2 pages) > + +-----------+ -> +----------+ --> +-----------+ mapping to +-------------+ > + | | | 0 | | 0 | -------------> | 0 | > + | | +----------+ +-----------+ +-------------+ > + | | | 1 | -------------> | 1 | > + | | +-----------+ +-------------+ > + | | | 2 | ----------------^ ^ ^ ^ ^ ^ ^ > + | | +-----------+ | | | | | | > + | | | 3 | ------------------+ | | | | | > + | | +-----------+ | | | | | > + | | | 4 | --------------------+ | | | | > + | PMD 0 | +-----------+ | | | | > + | | | 5 | ----------------------+ | | | > + | | +-----------+ | | | > + | | | .. | ------------------------+ | | > + | | +-----------+ | | > + | | | 511 | --------------------------+ | > + | | +-----------+ | > + | | | > + | | | > + | | | > + +-----------+ page frames | > + +-----------+ -> +----------+ --> +-----------+ mapping to | > + | | | 1 | | 512 | ----------------------------+ > + | | +----------+ +-----------+ | > + | | ^ ^ ^ ^ ^ ^ | .. | ----------------------------+ > + | | | | | | | | +-----------+ | > + | | | | | | | | | .. | ----------------------------+ > + | | | | | | | | +-----------+ | > + | | | | | | | | | .. | ----------------------------+ > + | | | | | | | | +-----------+ | > + | | | | | | | | | .. | ----------------------------+ > + | PMD 1 | | | | | | | +-----------+ | > + | | | | | | | | | .. | ----------------------------+ > + | | | | | | | | +-----------+ | > + | | | | | | | | | .. | ----------------------------+ > + | | | | | | | | +-----------+ | > + | | | | | | | | | 4095 | ----------------------------+ > + +-----------+ | | | | | | +-----------+ > + | PMD 2 | ----+ | | | | | > + +-----------+ | | | | | > + | PMD 3 | ------+ | | | | > + +-----------+ | | | | > + | PMD 4 | --------+ | | | > + +-----------+ | | | > + | PMD 5 | ----------+ | | > + +-----------+ | | > + | PMD 6 | ------------+ | > + +-----------+ | > + | PMD 7 | --------------+ > + +-----------+ > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 5e3e153ddd3d..e9dc3e2de7be 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -3088,7 +3088,8 @@ struct page * __populate_section_memmap(unsigned long pfn, > pgd_t *vmemmap_pgd_populate(unsigned long addr, int node); > p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node); > pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node); > -pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node); > +pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node, > + struct page *block); > pte_t *vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node, > struct vmem_altmap *altmap, struct page *block); > void *vmemmap_alloc_block(unsigned long size, int node); > diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c > index a8de6c472999..68041ca9a797 100644 > --- a/mm/sparse-vmemmap.c > +++ b/mm/sparse-vmemmap.c > @@ -537,13 +537,22 @@ static void * __meminit vmemmap_alloc_block_zero(unsigned long size, int node) > return p; > } > > -pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node) > +pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node, > + struct page *block) > { > pmd_t *pmd = pmd_offset(pud, addr); > if (pmd_none(*pmd)) { > - void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node); > - if (!p) > - return NULL; > + void *p; > + > + if (!block) { > + p = vmemmap_alloc_block_zero(PAGE_SIZE, node); > + if (!p) > + return NULL; > + } else { > + /* See comment in vmemmap_pte_populate(). */ > + get_page(block); > + p = page_to_virt(block); > + } > pmd_populate_kernel(&init_mm, pmd, p); > } > return pmd; > @@ -585,15 +594,14 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node) > return pgd; > } > > -static int __meminit vmemmap_populate_address(unsigned long addr, int node, > - struct vmem_altmap *altmap, > - struct page *reuse, struct page **page) > +static int __meminit vmemmap_populate_pmd_address(unsigned long addr, int node, > + struct vmem_altmap *altmap, > + struct page *reuse, pmd_t **ptr) > { > pgd_t *pgd; > p4d_t *p4d; > pud_t *pud; > pmd_t *pmd; > - pte_t *pte; > > pgd = vmemmap_pgd_populate(addr, node); > if (!pgd) > @@ -604,9 +612,24 @@ static int __meminit vmemmap_populate_address(unsigned long addr, int node, > pud = vmemmap_pud_populate(p4d, addr, node); > if (!pud) > return -ENOMEM; > - pmd = vmemmap_pmd_populate(pud, addr, node); > + pmd = vmemmap_pmd_populate(pud, addr, node, reuse); > if (!pmd) > return -ENOMEM; > + if (ptr) > + *ptr = pmd; > + return 0; > +} > + > +static int __meminit vmemmap_populate_address(unsigned long addr, int node, > + struct vmem_altmap *altmap, > + struct page *reuse, struct page **page) > +{ > + pmd_t *pmd; > + pte_t *pte; > + > + if (vmemmap_populate_pmd_address(addr, node, altmap, NULL, &pmd)) > + return -ENOMEM; > + > pte = vmemmap_pte_populate(pmd, addr, node, altmap, reuse); > if (!pte) > return -ENOMEM; > @@ -650,6 +673,20 @@ static inline int __meminit vmemmap_populate_page(unsigned long addr, int node, > return vmemmap_populate_address(addr, node, NULL, NULL, page); > } > > +static int __meminit vmemmap_populate_pmd_range(unsigned long start, > + unsigned long end, > + int node, struct page *page) > +{ > + unsigned long addr = start; > + > + for (; addr < end; addr += PMD_SIZE) { > + if (vmemmap_populate_pmd_address(addr, node, NULL, page, NULL)) > + return -ENOMEM; > + } > + > + return 0; > +} > + > static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn, > unsigned long start, > unsigned long end, int node, > @@ -670,6 +707,7 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn, > offset = PFN_PHYS(start_pfn) - pgmap->ranges[pgmap->nr_range].start; > if (!IS_ALIGNED(offset, pgmap_geometry(pgmap)) && > pgmap_geometry(pgmap) > SUBSECTION_SIZE) { > + pmd_t *pmdp; > pte_t *ptep; > > addr = start - PAGE_SIZE; > @@ -681,11 +719,25 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn, > * the previous struct pages are mapped when trying to lookup > * the last tail page. > */ > - ptep = pte_offset_kernel(pmd_off_k(addr), addr); > - if (!ptep) > + pmdp = pmd_off_k(addr); > + if (!pmdp) > + return -ENOMEM; > + > + /* > + * Reuse the tail pages vmemmap pmd page > + * See layout diagram in Documentation/vm/vmemmap_dedup.rst > + */ > + if (offset % pgmap_geometry(pgmap) > PFN_PHYS(PAGES_PER_SECTION)) > + return vmemmap_populate_pmd_range(start, end, node, > + pmd_page(*pmdp)); > + > + /* See comment above when pmd_off_k() is called. */ > + ptep = pte_offset_kernel(pmdp, addr); > + if (pte_none(*ptep)) > return -ENOMEM; > > /* > + * Populate the tail pages vmemmap pmd page. > * Reuse the page that was populated in the prior iteration > * with just tail struct pages. > */ > -- > 2.17.1 >
On 7/28/21 9:03 PM, Dan Williams wrote: > On Wed, Jul 14, 2021 at 12:36 PM Joao Martins <joao.m.martins@oracle.com> wrote: >> >> Currently, for compound PUD mappings, the implementation consumes 40MB >> per TB but it can be optimized to 16MB per TB with the approach >> detailed below. >> >> Right now basepages are used to populate the PUD tail pages, and it >> picks the address of the previous page of the subsection that precedes >> the memmap being initialized. This is done when a given memmap >> address isn't aligned to the pgmap @geometry (which is safe to do because >> @ranges are guaranteed to be aligned to @geometry). >> >> For pagemaps with an align which spans various sections, this means >> that PMD pages are unnecessarily allocated for reusing the same tail >> pages. Effectively, on x86 a PUD can span 8 sections (depending on >> config), and a page is being allocated a page for the PMD to reuse >> the tail vmemmap across the rest of the PTEs. In short effecitvely the >> PMD cover the tail vmemmap areas all contain the same PFN. So instead >> of doing this way, populate a new PMD on the second section of the >> compound page (tail vmemmap PMD), and then the following sections >> utilize the preceding PMD previously populated which only contain >> tail pages). >> >> After this scheme for an 1GB pagemap aligned area, the first PMD >> (section) would contain head page and 32767 tail pages, where the >> second PMD contains the full 32768 tail pages. The latter page gets >> its PMD reused across future section mapping of the same pagemap. >> >> Besides fewer pagetable entries allocated, keeping parity with >> hugepages in the directmap (as done by vmemmap_populate_hugepages()), >> this further increases savings per compound page. Rather than >> requiring 8 PMD page allocations only need 2 (plus two base pages >> allocated for head and tail areas for the first PMD). 2M pages still >> require using base pages, though. > > This looks good to me now, modulo the tail_page helper discussed > previously. Thanks for the diagram, makes it clearer what's happening. > > I don't see any red flags that would prevent a reviewed-by when you > send the next spin. > Cool, thanks! >> >> Signed-off-by: Joao Martins <joao.m.martins@oracle.com> >> --- >> Documentation/vm/vmemmap_dedup.rst | 109 +++++++++++++++++++++++++++++ >> include/linux/mm.h | 3 +- >> mm/sparse-vmemmap.c | 74 +++++++++++++++++--- >> 3 files changed, 174 insertions(+), 12 deletions(-) >> >> diff --git a/Documentation/vm/vmemmap_dedup.rst b/Documentation/vm/vmemmap_dedup.rst >> index 42830a667c2a..96d9f5f0a497 100644 >> --- a/Documentation/vm/vmemmap_dedup.rst >> +++ b/Documentation/vm/vmemmap_dedup.rst >> @@ -189,3 +189,112 @@ at a later stage when we populate the sections. >> It only use 3 page structs for storing all information as opposed >> to 4 on HugeTLB pages. This does not affect memory savings between both. >> >> +Additionally, it further extends the tail page deduplication with 1GB >> +device-dax compound pages. >> + >> +E.g.: A 1G device-dax page on x86_64 consists in 4096 page frames, split >> +across 8 PMD page frames, with the first PMD having 2 PTE page frames. >> +In total this represents a total of 40960 bytes per 1GB page. >> + >> +Here is how things look after the previously described tail page deduplication >> +technique. >> + >> + device-dax page frames struct pages(4096 pages) page frame(2 pages) >> + +-----------+ -> +----------+ --> +-----------+ mapping to +-------------+ >> + | | | 0 | | 0 | -------------> | 0 | >> + | | +----------+ +-----------+ +-------------+ >> + | | | 1 | -------------> | 1 | >> + | | +-----------+ +-------------+ >> + | | | 2 | ----------------^ ^ ^ ^ ^ ^ ^ >> + | | +-----------+ | | | | | | >> + | | | 3 | ------------------+ | | | | | >> + | | +-----------+ | | | | | >> + | | | 4 | --------------------+ | | | | >> + | PMD 0 | +-----------+ | | | | >> + | | | 5 | ----------------------+ | | | >> + | | +-----------+ | | | >> + | | | .. | ------------------------+ | | >> + | | +-----------+ | | >> + | | | 511 | --------------------------+ | >> + | | +-----------+ | >> + | | | >> + | | | >> + | | | >> + +-----------+ page frames | >> + +-----------+ -> +----------+ --> +-----------+ mapping to | >> + | | | 1 .. 7 | | 512 | ----------------------------+ >> + | | +----------+ +-----------+ | >> + | | | .. | ----------------------------+ >> + | | +-----------+ | >> + | | | .. | ----------------------------+ >> + | | +-----------+ | >> + | | | .. | ----------------------------+ >> + | | +-----------+ | >> + | | | .. | ----------------------------+ >> + | PMD | +-----------+ | >> + | 1 .. 7 | | .. | ----------------------------+ >> + | | +-----------+ | >> + | | | .. | ----------------------------+ >> + | | +-----------+ | >> + | | | 4095 | ----------------------------+ >> + +-----------+ +-----------+ >> + >> +Page frames of PMD 1 through 7 are allocated and mapped to the same PTE page frame >> +that contains stores tail pages. As we can see in the diagram, PMDs 1 through 7 >> +all look like the same. Therefore we can map PMD 2 through 7 to PMD 1 page frame. >> +This allows to free 6 vmemmap pages per 1GB page, decreasing the overhead per >> +1GB page from 40960 bytes to 16384 bytes. >> + >> +Here is how things look after PMD tail page deduplication. >> + >> + device-dax page frames struct pages(4096 pages) page frame(2 pages) >> + +-----------+ -> +----------+ --> +-----------+ mapping to +-------------+ >> + | | | 0 | | 0 | -------------> | 0 | >> + | | +----------+ +-----------+ +-------------+ >> + | | | 1 | -------------> | 1 | >> + | | +-----------+ +-------------+ >> + | | | 2 | ----------------^ ^ ^ ^ ^ ^ ^ >> + | | +-----------+ | | | | | | >> + | | | 3 | ------------------+ | | | | | >> + | | +-----------+ | | | | | >> + | | | 4 | --------------------+ | | | | >> + | PMD 0 | +-----------+ | | | | >> + | | | 5 | ----------------------+ | | | >> + | | +-----------+ | | | >> + | | | .. | ------------------------+ | | >> + | | +-----------+ | | >> + | | | 511 | --------------------------+ | >> + | | +-----------+ | >> + | | | >> + | | | >> + | | | >> + +-----------+ page frames | >> + +-----------+ -> +----------+ --> +-----------+ mapping to | >> + | | | 1 | | 512 | ----------------------------+ >> + | | +----------+ +-----------+ | >> + | | ^ ^ ^ ^ ^ ^ | .. | ----------------------------+ >> + | | | | | | | | +-----------+ | >> + | | | | | | | | | .. | ----------------------------+ >> + | | | | | | | | +-----------+ | >> + | | | | | | | | | .. | ----------------------------+ >> + | | | | | | | | +-----------+ | >> + | | | | | | | | | .. | ----------------------------+ >> + | PMD 1 | | | | | | | +-----------+ | >> + | | | | | | | | | .. | ----------------------------+ >> + | | | | | | | | +-----------+ | >> + | | | | | | | | | .. | ----------------------------+ >> + | | | | | | | | +-----------+ | >> + | | | | | | | | | 4095 | ----------------------------+ >> + +-----------+ | | | | | | +-----------+ >> + | PMD 2 | ----+ | | | | | >> + +-----------+ | | | | | >> + | PMD 3 | ------+ | | | | >> + +-----------+ | | | | >> + | PMD 4 | --------+ | | | >> + +-----------+ | | | >> + | PMD 5 | ----------+ | | >> + +-----------+ | | >> + | PMD 6 | ------------+ | >> + +-----------+ | >> + | PMD 7 | --------------+ >> + +-----------+ >> diff --git a/include/linux/mm.h b/include/linux/mm.h >> index 5e3e153ddd3d..e9dc3e2de7be 100644 >> --- a/include/linux/mm.h >> +++ b/include/linux/mm.h >> @@ -3088,7 +3088,8 @@ struct page * __populate_section_memmap(unsigned long pfn, >> pgd_t *vmemmap_pgd_populate(unsigned long addr, int node); >> p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node); >> pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node); >> -pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node); >> +pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node, >> + struct page *block); >> pte_t *vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node, >> struct vmem_altmap *altmap, struct page *block); >> void *vmemmap_alloc_block(unsigned long size, int node); >> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c >> index a8de6c472999..68041ca9a797 100644 >> --- a/mm/sparse-vmemmap.c >> +++ b/mm/sparse-vmemmap.c >> @@ -537,13 +537,22 @@ static void * __meminit vmemmap_alloc_block_zero(unsigned long size, int node) >> return p; >> } >> >> -pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node) >> +pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node, >> + struct page *block) >> { >> pmd_t *pmd = pmd_offset(pud, addr); >> if (pmd_none(*pmd)) { >> - void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node); >> - if (!p) >> - return NULL; >> + void *p; >> + >> + if (!block) { >> + p = vmemmap_alloc_block_zero(PAGE_SIZE, node); >> + if (!p) >> + return NULL; >> + } else { >> + /* See comment in vmemmap_pte_populate(). */ >> + get_page(block); >> + p = page_to_virt(block); >> + } >> pmd_populate_kernel(&init_mm, pmd, p); >> } >> return pmd; >> @@ -585,15 +594,14 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node) >> return pgd; >> } >> >> -static int __meminit vmemmap_populate_address(unsigned long addr, int node, >> - struct vmem_altmap *altmap, >> - struct page *reuse, struct page **page) >> +static int __meminit vmemmap_populate_pmd_address(unsigned long addr, int node, >> + struct vmem_altmap *altmap, >> + struct page *reuse, pmd_t **ptr) >> { >> pgd_t *pgd; >> p4d_t *p4d; >> pud_t *pud; >> pmd_t *pmd; >> - pte_t *pte; >> >> pgd = vmemmap_pgd_populate(addr, node); >> if (!pgd) >> @@ -604,9 +612,24 @@ static int __meminit vmemmap_populate_address(unsigned long addr, int node, >> pud = vmemmap_pud_populate(p4d, addr, node); >> if (!pud) >> return -ENOMEM; >> - pmd = vmemmap_pmd_populate(pud, addr, node); >> + pmd = vmemmap_pmd_populate(pud, addr, node, reuse); >> if (!pmd) >> return -ENOMEM; >> + if (ptr) >> + *ptr = pmd; >> + return 0; >> +} >> + >> +static int __meminit vmemmap_populate_address(unsigned long addr, int node, >> + struct vmem_altmap *altmap, >> + struct page *reuse, struct page **page) >> +{ >> + pmd_t *pmd; >> + pte_t *pte; >> + >> + if (vmemmap_populate_pmd_address(addr, node, altmap, NULL, &pmd)) >> + return -ENOMEM; >> + >> pte = vmemmap_pte_populate(pmd, addr, node, altmap, reuse); >> if (!pte) >> return -ENOMEM; >> @@ -650,6 +673,20 @@ static inline int __meminit vmemmap_populate_page(unsigned long addr, int node, >> return vmemmap_populate_address(addr, node, NULL, NULL, page); >> } >> >> +static int __meminit vmemmap_populate_pmd_range(unsigned long start, >> + unsigned long end, >> + int node, struct page *page) >> +{ >> + unsigned long addr = start; >> + >> + for (; addr < end; addr += PMD_SIZE) { >> + if (vmemmap_populate_pmd_address(addr, node, NULL, page, NULL)) >> + return -ENOMEM; >> + } >> + >> + return 0; >> +} >> + >> static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn, >> unsigned long start, >> unsigned long end, int node, >> @@ -670,6 +707,7 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn, >> offset = PFN_PHYS(start_pfn) - pgmap->ranges[pgmap->nr_range].start; >> if (!IS_ALIGNED(offset, pgmap_geometry(pgmap)) && >> pgmap_geometry(pgmap) > SUBSECTION_SIZE) { >> + pmd_t *pmdp; >> pte_t *ptep; >> >> addr = start - PAGE_SIZE; >> @@ -681,11 +719,25 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn, >> * the previous struct pages are mapped when trying to lookup >> * the last tail page. >> */ >> - ptep = pte_offset_kernel(pmd_off_k(addr), addr); >> - if (!ptep) >> + pmdp = pmd_off_k(addr); >> + if (!pmdp) >> + return -ENOMEM; >> + >> + /* >> + * Reuse the tail pages vmemmap pmd page >> + * See layout diagram in Documentation/vm/vmemmap_dedup.rst >> + */ >> + if (offset % pgmap_geometry(pgmap) > PFN_PHYS(PAGES_PER_SECTION)) >> + return vmemmap_populate_pmd_range(start, end, node, >> + pmd_page(*pmdp)); >> + >> + /* See comment above when pmd_off_k() is called. */ >> + ptep = pte_offset_kernel(pmdp, addr); >> + if (pte_none(*ptep)) >> return -ENOMEM; >> >> /* >> + * Populate the tail pages vmemmap pmd page. >> * Reuse the page that was populated in the prior iteration >> * with just tail struct pages. >> */ >> -- >> 2.17.1 >>
diff --git a/Documentation/vm/vmemmap_dedup.rst b/Documentation/vm/vmemmap_dedup.rst index 42830a667c2a..96d9f5f0a497 100644 --- a/Documentation/vm/vmemmap_dedup.rst +++ b/Documentation/vm/vmemmap_dedup.rst @@ -189,3 +189,112 @@ at a later stage when we populate the sections. It only use 3 page structs for storing all information as opposed to 4 on HugeTLB pages. This does not affect memory savings between both. +Additionally, it further extends the tail page deduplication with 1GB +device-dax compound pages. + +E.g.: A 1G device-dax page on x86_64 consists in 4096 page frames, split +across 8 PMD page frames, with the first PMD having 2 PTE page frames. +In total this represents a total of 40960 bytes per 1GB page. + +Here is how things look after the previously described tail page deduplication +technique. + + device-dax page frames struct pages(4096 pages) page frame(2 pages) + +-----------+ -> +----------+ --> +-----------+ mapping to +-------------+ + | | | 0 | | 0 | -------------> | 0 | + | | +----------+ +-----------+ +-------------+ + | | | 1 | -------------> | 1 | + | | +-----------+ +-------------+ + | | | 2 | ----------------^ ^ ^ ^ ^ ^ ^ + | | +-----------+ | | | | | | + | | | 3 | ------------------+ | | | | | + | | +-----------+ | | | | | + | | | 4 | --------------------+ | | | | + | PMD 0 | +-----------+ | | | | + | | | 5 | ----------------------+ | | | + | | +-----------+ | | | + | | | .. | ------------------------+ | | + | | +-----------+ | | + | | | 511 | --------------------------+ | + | | +-----------+ | + | | | + | | | + | | | + +-----------+ page frames | + +-----------+ -> +----------+ --> +-----------+ mapping to | + | | | 1 .. 7 | | 512 | ----------------------------+ + | | +----------+ +-----------+ | + | | | .. | ----------------------------+ + | | +-----------+ | + | | | .. | ----------------------------+ + | | +-----------+ | + | | | .. | ----------------------------+ + | | +-----------+ | + | | | .. | ----------------------------+ + | PMD | +-----------+ | + | 1 .. 7 | | .. | ----------------------------+ + | | +-----------+ | + | | | .. | ----------------------------+ + | | +-----------+ | + | | | 4095 | ----------------------------+ + +-----------+ +-----------+ + +Page frames of PMD 1 through 7 are allocated and mapped to the same PTE page frame +that contains stores tail pages. As we can see in the diagram, PMDs 1 through 7 +all look like the same. Therefore we can map PMD 2 through 7 to PMD 1 page frame. +This allows to free 6 vmemmap pages per 1GB page, decreasing the overhead per +1GB page from 40960 bytes to 16384 bytes. + +Here is how things look after PMD tail page deduplication. + + device-dax page frames struct pages(4096 pages) page frame(2 pages) + +-----------+ -> +----------+ --> +-----------+ mapping to +-------------+ + | | | 0 | | 0 | -------------> | 0 | + | | +----------+ +-----------+ +-------------+ + | | | 1 | -------------> | 1 | + | | +-----------+ +-------------+ + | | | 2 | ----------------^ ^ ^ ^ ^ ^ ^ + | | +-----------+ | | | | | | + | | | 3 | ------------------+ | | | | | + | | +-----------+ | | | | | + | | | 4 | --------------------+ | | | | + | PMD 0 | +-----------+ | | | | + | | | 5 | ----------------------+ | | | + | | +-----------+ | | | + | | | .. | ------------------------+ | | + | | +-----------+ | | + | | | 511 | --------------------------+ | + | | +-----------+ | + | | | + | | | + | | | + +-----------+ page frames | + +-----------+ -> +----------+ --> +-----------+ mapping to | + | | | 1 | | 512 | ----------------------------+ + | | +----------+ +-----------+ | + | | ^ ^ ^ ^ ^ ^ | .. | ----------------------------+ + | | | | | | | | +-----------+ | + | | | | | | | | | .. | ----------------------------+ + | | | | | | | | +-----------+ | + | | | | | | | | | .. | ----------------------------+ + | | | | | | | | +-----------+ | + | | | | | | | | | .. | ----------------------------+ + | PMD 1 | | | | | | | +-----------+ | + | | | | | | | | | .. | ----------------------------+ + | | | | | | | | +-----------+ | + | | | | | | | | | .. | ----------------------------+ + | | | | | | | | +-----------+ | + | | | | | | | | | 4095 | ----------------------------+ + +-----------+ | | | | | | +-----------+ + | PMD 2 | ----+ | | | | | + +-----------+ | | | | | + | PMD 3 | ------+ | | | | + +-----------+ | | | | + | PMD 4 | --------+ | | | + +-----------+ | | | + | PMD 5 | ----------+ | | + +-----------+ | | + | PMD 6 | ------------+ | + +-----------+ | + | PMD 7 | --------------+ + +-----------+ diff --git a/include/linux/mm.h b/include/linux/mm.h index 5e3e153ddd3d..e9dc3e2de7be 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3088,7 +3088,8 @@ struct page * __populate_section_memmap(unsigned long pfn, pgd_t *vmemmap_pgd_populate(unsigned long addr, int node); p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node); pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node); -pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node); +pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node, + struct page *block); pte_t *vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node, struct vmem_altmap *altmap, struct page *block); void *vmemmap_alloc_block(unsigned long size, int node); diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c index a8de6c472999..68041ca9a797 100644 --- a/mm/sparse-vmemmap.c +++ b/mm/sparse-vmemmap.c @@ -537,13 +537,22 @@ static void * __meminit vmemmap_alloc_block_zero(unsigned long size, int node) return p; } -pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node) +pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node, + struct page *block) { pmd_t *pmd = pmd_offset(pud, addr); if (pmd_none(*pmd)) { - void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node); - if (!p) - return NULL; + void *p; + + if (!block) { + p = vmemmap_alloc_block_zero(PAGE_SIZE, node); + if (!p) + return NULL; + } else { + /* See comment in vmemmap_pte_populate(). */ + get_page(block); + p = page_to_virt(block); + } pmd_populate_kernel(&init_mm, pmd, p); } return pmd; @@ -585,15 +594,14 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node) return pgd; } -static int __meminit vmemmap_populate_address(unsigned long addr, int node, - struct vmem_altmap *altmap, - struct page *reuse, struct page **page) +static int __meminit vmemmap_populate_pmd_address(unsigned long addr, int node, + struct vmem_altmap *altmap, + struct page *reuse, pmd_t **ptr) { pgd_t *pgd; p4d_t *p4d; pud_t *pud; pmd_t *pmd; - pte_t *pte; pgd = vmemmap_pgd_populate(addr, node); if (!pgd) @@ -604,9 +612,24 @@ static int __meminit vmemmap_populate_address(unsigned long addr, int node, pud = vmemmap_pud_populate(p4d, addr, node); if (!pud) return -ENOMEM; - pmd = vmemmap_pmd_populate(pud, addr, node); + pmd = vmemmap_pmd_populate(pud, addr, node, reuse); if (!pmd) return -ENOMEM; + if (ptr) + *ptr = pmd; + return 0; +} + +static int __meminit vmemmap_populate_address(unsigned long addr, int node, + struct vmem_altmap *altmap, + struct page *reuse, struct page **page) +{ + pmd_t *pmd; + pte_t *pte; + + if (vmemmap_populate_pmd_address(addr, node, altmap, NULL, &pmd)) + return -ENOMEM; + pte = vmemmap_pte_populate(pmd, addr, node, altmap, reuse); if (!pte) return -ENOMEM; @@ -650,6 +673,20 @@ static inline int __meminit vmemmap_populate_page(unsigned long addr, int node, return vmemmap_populate_address(addr, node, NULL, NULL, page); } +static int __meminit vmemmap_populate_pmd_range(unsigned long start, + unsigned long end, + int node, struct page *page) +{ + unsigned long addr = start; + + for (; addr < end; addr += PMD_SIZE) { + if (vmemmap_populate_pmd_address(addr, node, NULL, page, NULL)) + return -ENOMEM; + } + + return 0; +} + static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn, unsigned long start, unsigned long end, int node, @@ -670,6 +707,7 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn, offset = PFN_PHYS(start_pfn) - pgmap->ranges[pgmap->nr_range].start; if (!IS_ALIGNED(offset, pgmap_geometry(pgmap)) && pgmap_geometry(pgmap) > SUBSECTION_SIZE) { + pmd_t *pmdp; pte_t *ptep; addr = start - PAGE_SIZE; @@ -681,11 +719,25 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn, * the previous struct pages are mapped when trying to lookup * the last tail page. */ - ptep = pte_offset_kernel(pmd_off_k(addr), addr); - if (!ptep) + pmdp = pmd_off_k(addr); + if (!pmdp) + return -ENOMEM; + + /* + * Reuse the tail pages vmemmap pmd page + * See layout diagram in Documentation/vm/vmemmap_dedup.rst + */ + if (offset % pgmap_geometry(pgmap) > PFN_PHYS(PAGES_PER_SECTION)) + return vmemmap_populate_pmd_range(start, end, node, + pmd_page(*pmdp)); + + /* See comment above when pmd_off_k() is called. */ + ptep = pte_offset_kernel(pmdp, addr); + if (pte_none(*ptep)) return -ENOMEM; /* + * Populate the tail pages vmemmap pmd page. * Reuse the page that was populated in the prior iteration * with just tail struct pages. */
Currently, for compound PUD mappings, the implementation consumes 40MB per TB but it can be optimized to 16MB per TB with the approach detailed below. Right now basepages are used to populate the PUD tail pages, and it picks the address of the previous page of the subsection that precedes the memmap being initialized. This is done when a given memmap address isn't aligned to the pgmap @geometry (which is safe to do because @ranges are guaranteed to be aligned to @geometry). For pagemaps with an align which spans various sections, this means that PMD pages are unnecessarily allocated for reusing the same tail pages. Effectively, on x86 a PUD can span 8 sections (depending on config), and a page is being allocated a page for the PMD to reuse the tail vmemmap across the rest of the PTEs. In short effecitvely the PMD cover the tail vmemmap areas all contain the same PFN. So instead of doing this way, populate a new PMD on the second section of the compound page (tail vmemmap PMD), and then the following sections utilize the preceding PMD previously populated which only contain tail pages). After this scheme for an 1GB pagemap aligned area, the first PMD (section) would contain head page and 32767 tail pages, where the second PMD contains the full 32768 tail pages. The latter page gets its PMD reused across future section mapping of the same pagemap. Besides fewer pagetable entries allocated, keeping parity with hugepages in the directmap (as done by vmemmap_populate_hugepages()), this further increases savings per compound page. Rather than requiring 8 PMD page allocations only need 2 (plus two base pages allocated for head and tail areas for the first PMD). 2M pages still require using base pages, though. Signed-off-by: Joao Martins <joao.m.martins@oracle.com> --- Documentation/vm/vmemmap_dedup.rst | 109 +++++++++++++++++++++++++++++ include/linux/mm.h | 3 +- mm/sparse-vmemmap.c | 74 +++++++++++++++++--- 3 files changed, 174 insertions(+), 12 deletions(-)