diff mbox series

[v1,08/11] mm/sparse-vmemmap: use hugepages for PUD compound pagemaps

Message ID 20210325230938.30752-9-joao.m.martins@oracle.com (mailing list archive)
State New, archived
Headers show
Series mm, sparse-vmemmap: Introduce compound pagemaps | expand

Commit Message

Joao Martins March 25, 2021, 11:09 p.m. UTC
Right now basepages are used to populate the PUD tail pages, and it
picks the address of the previous page of the subsection that preceeds
the memmap we are initializing.  This is done when a given memmap
address isn't aligned to the pgmap @align (which is safe to do because
@ranges are guaranteed to be aligned to @align).

For pagemaps with an align which spans various sections, this means
that PMD pages are unnecessarily allocated for reusing the same tail
pages.  Effectively, on x86 a PUD can span 8 sections (depending on
config), and a page is being  allocated a page for the PMD to reuse
the tail vmemmap across the rest of the PTEs. In short effecitvely the
PMD cover the tail vmemmap areas all contain the same PFN. So instead
of doing this way, populate a new PMD on the second section of the
compound page (tail vmemmap PMD), and then the following sections
utilize the preceding PMD we previously populated which only contain
tail pages).

After this scheme for an 1GB pagemap aligned area, the first PMD
(section) would contain head page and 32767 tail pages, where the
second PMD contains the full 32768 tail pages.  The latter page gets
its PMD reused across future section mapping of the same pagemap.

Besides fewer pagetable entries allocated, keeping parity with
hugepages in the directmap (as done by vmemmap_populate_hugepages()),
this further increases savings per compound page. For each PUD-aligned
pagemap we go from 40960 bytes down to 16384 bytes. Rather than
requiring 8 PMD page allocations we only need 2 (plus two base pages
allocated for head and tail areas for the first PMD). 2M pages still
require using base pages, though.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 include/linux/mm.h  |  3 +-
 mm/sparse-vmemmap.c | 79 ++++++++++++++++++++++++++++++++++-----------
 2 files changed, 63 insertions(+), 19 deletions(-)

Comments

Dan Williams June 1, 2021, 7:30 p.m. UTC | #1
Sorry for the delay, and the sync up on IRC. I think this subject
needs to change to "optimize memory savings for compound pud memmap
geometry", and move this to the back of the series to make it clear
which patches are base functionality and which extend the idea
further. As far as I can see this patch can move to the end of the
series. Some additional changelog feedback below:


On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> Right now basepages are used to populate the PUD tail pages, and it
> picks the address of the previous page of the subsection that preceeds

s/preceeds/precedes/

> the memmap we are initializing.  This is done when a given memmap
> address isn't aligned to the pgmap @align (which is safe to do because
> @ranges are guaranteed to be aligned to @align).

You know what would help is if you could draw an ascii art picture of
the before and after of the head page vs tail page arrangement. I can
see how the words line up to the code, but it takes a while to get the
picture in my head and I think future work in this area will benefit
from having a place in Documentation that draws a picture of the
layout of the various geometries.

I've used asciiflow.com for these types of diagrams in the past.

>
> For pagemaps with an align which spans various sections, this means
> that PMD pages are unnecessarily allocated for reusing the same tail
> pages.  Effectively, on x86 a PUD can span 8 sections (depending on
> config), and a page is being  allocated a page for the PMD to reuse
> the tail vmemmap across the rest of the PTEs. In short effecitvely the
> PMD cover the tail vmemmap areas all contain the same PFN. So instead
> of doing this way, populate a new PMD on the second section of the
> compound page (tail vmemmap PMD), and then the following sections
> utilize the preceding PMD we previously populated which only contain
> tail pages).
>
> After this scheme for an 1GB pagemap aligned area, the first PMD
> (section) would contain head page and 32767 tail pages, where the
> second PMD contains the full 32768 tail pages.  The latter page gets
> its PMD reused across future section mapping of the same pagemap.
>
> Besides fewer pagetable entries allocated, keeping parity with
> hugepages in the directmap (as done by vmemmap_populate_hugepages()),
> this further increases savings per compound page. For each PUD-aligned
> pagemap we go from 40960 bytes down to 16384 bytes. Rather than
> requiring 8 PMD page allocations we only need 2 (plus two base pages
> allocated for head and tail areas for the first PMD). 2M pages still
> require using base pages, though.

I would front load the above savings to the top of this discussion and
say something like:

"Currently, for compound PUD mappings, the implementation consumes X
GB per TB this can be optimized to Y GB per TB with the approach
detailed below."

>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  include/linux/mm.h  |  3 +-
>  mm/sparse-vmemmap.c | 79 ++++++++++++++++++++++++++++++++++-----------
>  2 files changed, 63 insertions(+), 19 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 49d717ae40ae..9c1a676d6b95 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3038,7 +3038,8 @@ struct page * __populate_section_memmap(unsigned long pfn,
>  pgd_t *vmemmap_pgd_populate(unsigned long addr, int node);
>  p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node);
>  pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node);
> -pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node);
> +pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node,
> +                           void *block);
>  pte_t *vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
>                             struct vmem_altmap *altmap, void *block);
>  void *vmemmap_alloc_block(unsigned long size, int node);
> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
> index f57c5eada099..291a8a32480a 100644
> --- a/mm/sparse-vmemmap.c
> +++ b/mm/sparse-vmemmap.c
> @@ -172,13 +172,20 @@ static void * __meminit vmemmap_alloc_block_zero(unsigned long size, int node)
>         return p;
>  }
>
> -pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node)
> +pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node,
> +                                      void *block)
>  {
>         pmd_t *pmd = pmd_offset(pud, addr);
>         if (pmd_none(*pmd)) {
> -               void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
> -               if (!p)
> -                       return NULL;
> +               void *p = block;
> +
> +               if (!block) {
> +                       p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
> +                       if (!p)
> +                               return NULL;
> +               } else {
> +                       get_page(virt_to_page(block));
> +               }
>                 pmd_populate_kernel(&init_mm, pmd, p);
>         }
>         return pmd;
> @@ -220,15 +227,14 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
>         return pgd;
>  }
>
> -static int __meminit vmemmap_populate_address(unsigned long addr, int node,
> -                                             struct vmem_altmap *altmap,
> -                                             void *page, void **ptr)
> +static int __meminit vmemmap_populate_pmd_address(unsigned long addr, int node,
> +                                                 struct vmem_altmap *altmap,
> +                                                 void *page, pmd_t **ptr)
>  {
>         pgd_t *pgd;
>         p4d_t *p4d;
>         pud_t *pud;
>         pmd_t *pmd;
> -       pte_t *pte;
>
>         pgd = vmemmap_pgd_populate(addr, node);
>         if (!pgd)
> @@ -239,9 +245,24 @@ static int __meminit vmemmap_populate_address(unsigned long addr, int node,
>         pud = vmemmap_pud_populate(p4d, addr, node);
>         if (!pud)
>                 return -ENOMEM;
> -       pmd = vmemmap_pmd_populate(pud, addr, node);
> +       pmd = vmemmap_pmd_populate(pud, addr, node, page);
>         if (!pmd)
>                 return -ENOMEM;
> +       if (ptr)
> +               *ptr = pmd;
> +       return 0;
> +}
> +
> +static int __meminit vmemmap_populate_address(unsigned long addr, int node,
> +                                             struct vmem_altmap *altmap,
> +                                             void *page, void **ptr)
> +{
> +       pmd_t *pmd;
> +       pte_t *pte;
> +
> +       if (vmemmap_populate_pmd_address(addr, node, altmap, NULL, &pmd))
> +               return -ENOMEM;
> +
>         pte = vmemmap_pte_populate(pmd, addr, node, altmap, page);
>         if (!pte)
>                 return -ENOMEM;
> @@ -285,13 +306,26 @@ static inline int __meminit vmemmap_populate_page(unsigned long addr, int node,
>         return vmemmap_populate_address(addr, node, NULL, NULL, ptr);
>  }
>
> -static pte_t * __meminit vmemmap_lookup_address(unsigned long addr)
> +static int __meminit vmemmap_populate_pmd_range(unsigned long start,
> +                                               unsigned long end,
> +                                               int node, void *page)
> +{
> +       unsigned long addr = start;
> +
> +       for (; addr < end; addr += PMD_SIZE) {
> +               if (vmemmap_populate_pmd_address(addr, node, NULL, page, NULL))
> +                       return -ENOMEM;
> +       }
> +
> +       return 0;
> +}
> +
> +static pmd_t * __meminit vmemmap_lookup_address(unsigned long addr)
>  {
>         pgd_t *pgd;
>         p4d_t *p4d;
>         pud_t *pud;
>         pmd_t *pmd;
> -       pte_t *pte;
>
>         pgd = pgd_offset_k(addr);
>         if (pgd_none(*pgd))
> @@ -309,11 +343,7 @@ static pte_t * __meminit vmemmap_lookup_address(unsigned long addr)
>         if (pmd_none(*pmd))
>                 return NULL;
>
> -       pte = pte_offset_kernel(pmd, addr);
> -       if (pte_none(*pte))
> -               return NULL;
> -
> -       return pte;
> +       return pmd;
>  }
>
>  static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
> @@ -335,9 +365,22 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
>         offset = PFN_PHYS(start_pfn) - pgmap->ranges[pgmap->nr_range].start;
>         if (!IS_ALIGNED(offset, pgmap_align(pgmap)) &&
>             pgmap_align(pgmap) > SUBSECTION_SIZE) {
> -               pte_t *ptep = vmemmap_lookup_address(start - PAGE_SIZE);
> +               pmd_t *pmdp;
> +               pte_t *ptep;
> +
> +               addr = start - PAGE_SIZE;
> +               pmdp = vmemmap_lookup_address(addr);
> +               if (!pmdp)
> +                       return -ENOMEM;
> +
> +               /* Reuse the tail pages vmemmap pmd page */

This comment really wants to be "See layout diagram in
Documentation/vm/compound_pagemaps.rst", because the reuse algorithm
is not immediately obvious.

Other than that the implementation looks ok to me, modulo previous
comments about @block type and the use of the "geometry" term.
Joao Martins June 7, 2021, 12:02 p.m. UTC | #2
On 6/1/21 8:30 PM, Dan Williams wrote:
> Sorry for the delay, and the sync up on IRC. I think this subject
> needs to change to "optimize memory savings for compound pud memmap
> geometry", and move this to the back of the series to make it clear
> which patches are base functionality and which extend the idea
> further. 

OK

> As far as I can see this patch can move to the end of the
> series. 

Maybe its prefered that this patch could be deferred to out of the series as
a followup improvement, and leave this series with the base feature set only?

> Some additional changelog feedback below:
> 
> 
> On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> Right now basepages are used to populate the PUD tail pages, and it
>> picks the address of the previous page of the subsection that preceeds
> 
> s/preceeds/precedes/
> 
Yeap.

>> the memmap we are initializing.  This is done when a given memmap
>> address isn't aligned to the pgmap @align (which is safe to do because
>> @ranges are guaranteed to be aligned to @align).
> 
> You know what would help is if you could draw an ascii art picture of
> the before and after of the head page vs tail page arrangement. I can
> see how the words line up to the code, but it takes a while to get the
> picture in my head and I think future work in this area will benefit
> from having a place in Documentation that draws a picture of the
> layout of the various geometries.
> 
Makes sense, I will add docs. Mike K. and others had similar trouble following
the page structs arrangement which ultimately lead to this section of commentary
at the beginning of the new source file added here ...

https://www.ozlabs.org/~akpm/mmotm/broken-out/mm-hugetlb-free-the-vmemmap-pages-associated-with-each-hugetlb-page.patch

> I've used asciiflow.com for these types of diagrams in the past.
>

... so perhaps I can borrow some of that and place it to
a common place like in Documentation/vm/compound_pagemaps.rst

This patch specifically would need a new diagram added on top covering
the PMD page case.

>>
>> For pagemaps with an align which spans various sections, this means
>> that PMD pages are unnecessarily allocated for reusing the same tail
>> pages.  Effectively, on x86 a PUD can span 8 sections (depending on
>> config), and a page is being  allocated a page for the PMD to reuse
>> the tail vmemmap across the rest of the PTEs. In short effecitvely the
>> PMD cover the tail vmemmap areas all contain the same PFN. So instead
>> of doing this way, populate a new PMD on the second section of the
>> compound page (tail vmemmap PMD), and then the following sections
>> utilize the preceding PMD we previously populated which only contain
>> tail pages).
>>
>> After this scheme for an 1GB pagemap aligned area, the first PMD
>> (section) would contain head page and 32767 tail pages, where the
>> second PMD contains the full 32768 tail pages.  The latter page gets
>> its PMD reused across future section mapping of the same pagemap.
>>
>> Besides fewer pagetable entries allocated, keeping parity with
>> hugepages in the directmap (as done by vmemmap_populate_hugepages()),
>> this further increases savings per compound page. For each PUD-aligned
>> pagemap we go from 40960 bytes down to 16384 bytes. Rather than
>> requiring 8 PMD page allocations we only need 2 (plus two base pages
>> allocated for head and tail areas for the first PMD). 2M pages still
>> require using base pages, though.
> 
> I would front load the above savings to the top of this discussion and
> say something like:
> 
> "Currently, for compound PUD mappings, the implementation consumes X
> GB per TB this can be optimized to Y GB per TB with the approach
> detailed below."
> 
Got it.

>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>  include/linux/mm.h  |  3 +-
>>  mm/sparse-vmemmap.c | 79 ++++++++++++++++++++++++++++++++++-----------
>>  2 files changed, 63 insertions(+), 19 deletions(-)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 49d717ae40ae..9c1a676d6b95 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -3038,7 +3038,8 @@ struct page * __populate_section_memmap(unsigned long pfn,
>>  pgd_t *vmemmap_pgd_populate(unsigned long addr, int node);
>>  p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node);
>>  pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node);
>> -pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node);
>> +pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node,
>> +                           void *block);
>>  pte_t *vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
>>                             struct vmem_altmap *altmap, void *block);
>>  void *vmemmap_alloc_block(unsigned long size, int node);
>> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
>> index f57c5eada099..291a8a32480a 100644
>> --- a/mm/sparse-vmemmap.c
>> +++ b/mm/sparse-vmemmap.c
>> @@ -172,13 +172,20 @@ static void * __meminit vmemmap_alloc_block_zero(unsigned long size, int node)
>>         return p;
>>  }
>>
>> -pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node)
>> +pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node,
>> +                                      void *block)
>>  {
>>         pmd_t *pmd = pmd_offset(pud, addr);
>>         if (pmd_none(*pmd)) {
>> -               void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
>> -               if (!p)
>> -                       return NULL;
>> +               void *p = block;
>> +
>> +               if (!block) {
>> +                       p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
>> +                       if (!p)
>> +                               return NULL;
>> +               } else {
>> +                       get_page(virt_to_page(block));
>> +               }
>>                 pmd_populate_kernel(&init_mm, pmd, p);
>>         }
>>         return pmd;
>> @@ -220,15 +227,14 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
>>         return pgd;
>>  }
>>
>> -static int __meminit vmemmap_populate_address(unsigned long addr, int node,
>> -                                             struct vmem_altmap *altmap,
>> -                                             void *page, void **ptr)
>> +static int __meminit vmemmap_populate_pmd_address(unsigned long addr, int node,
>> +                                                 struct vmem_altmap *altmap,
>> +                                                 void *page, pmd_t **ptr)
>>  {
>>         pgd_t *pgd;
>>         p4d_t *p4d;
>>         pud_t *pud;
>>         pmd_t *pmd;
>> -       pte_t *pte;
>>
>>         pgd = vmemmap_pgd_populate(addr, node);
>>         if (!pgd)
>> @@ -239,9 +245,24 @@ static int __meminit vmemmap_populate_address(unsigned long addr, int node,
>>         pud = vmemmap_pud_populate(p4d, addr, node);
>>         if (!pud)
>>                 return -ENOMEM;
>> -       pmd = vmemmap_pmd_populate(pud, addr, node);
>> +       pmd = vmemmap_pmd_populate(pud, addr, node, page);
>>         if (!pmd)
>>                 return -ENOMEM;
>> +       if (ptr)
>> +               *ptr = pmd;
>> +       return 0;
>> +}
>> +
>> +static int __meminit vmemmap_populate_address(unsigned long addr, int node,
>> +                                             struct vmem_altmap *altmap,
>> +                                             void *page, void **ptr)
>> +{
>> +       pmd_t *pmd;
>> +       pte_t *pte;
>> +
>> +       if (vmemmap_populate_pmd_address(addr, node, altmap, NULL, &pmd))
>> +               return -ENOMEM;
>> +
>>         pte = vmemmap_pte_populate(pmd, addr, node, altmap, page);
>>         if (!pte)
>>                 return -ENOMEM;
>> @@ -285,13 +306,26 @@ static inline int __meminit vmemmap_populate_page(unsigned long addr, int node,
>>         return vmemmap_populate_address(addr, node, NULL, NULL, ptr);
>>  }
>>
>> -static pte_t * __meminit vmemmap_lookup_address(unsigned long addr)
>> +static int __meminit vmemmap_populate_pmd_range(unsigned long start,
>> +                                               unsigned long end,
>> +                                               int node, void *page)
>> +{
>> +       unsigned long addr = start;
>> +
>> +       for (; addr < end; addr += PMD_SIZE) {
>> +               if (vmemmap_populate_pmd_address(addr, node, NULL, page, NULL))
>> +                       return -ENOMEM;
>> +       }
>> +
>> +       return 0;
>> +}
>> +
>> +static pmd_t * __meminit vmemmap_lookup_address(unsigned long addr)
>>  {
>>         pgd_t *pgd;
>>         p4d_t *p4d;
>>         pud_t *pud;
>>         pmd_t *pmd;
>> -       pte_t *pte;
>>
>>         pgd = pgd_offset_k(addr);
>>         if (pgd_none(*pgd))
>> @@ -309,11 +343,7 @@ static pte_t * __meminit vmemmap_lookup_address(unsigned long addr)
>>         if (pmd_none(*pmd))
>>                 return NULL;
>>
>> -       pte = pte_offset_kernel(pmd, addr);
>> -       if (pte_none(*pte))
>> -               return NULL;
>> -
>> -       return pte;
>> +       return pmd;
>>  }
>>
>>  static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
>> @@ -335,9 +365,22 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
>>         offset = PFN_PHYS(start_pfn) - pgmap->ranges[pgmap->nr_range].start;
>>         if (!IS_ALIGNED(offset, pgmap_align(pgmap)) &&
>>             pgmap_align(pgmap) > SUBSECTION_SIZE) {
>> -               pte_t *ptep = vmemmap_lookup_address(start - PAGE_SIZE);
>> +               pmd_t *pmdp;
>> +               pte_t *ptep;
>> +
>> +               addr = start - PAGE_SIZE;
>> +               pmdp = vmemmap_lookup_address(addr);
>> +               if (!pmdp)
>> +                       return -ENOMEM;
>> +
>> +               /* Reuse the tail pages vmemmap pmd page */
> 
> This comment really wants to be "See layout diagram in
> Documentation/vm/compound_pagemaps.rst", because the reuse algorithm
> is not immediately obvious.
>
ACK

> Other than that the implementation looks ok to me, modulo previous
> comments about @block type and the use of the "geometry" term.
> 
OK.

Btw, speaking of geometry, could you have a look at this thread:

https://lore.kernel.org/linux-mm/8c922a58-c901-1ad9-5d19-1182bd6dea1e@oracle.com/

.. and let me know what you think?
Dan Williams June 7, 2021, 7:47 p.m. UTC | #3
On Mon, Jun 7, 2021 at 5:03 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>
>
>
> On 6/1/21 8:30 PM, Dan Williams wrote:
> > Sorry for the delay, and the sync up on IRC. I think this subject
> > needs to change to "optimize memory savings for compound pud memmap
> > geometry", and move this to the back of the series to make it clear
> > which patches are base functionality and which extend the idea
> > further.
>
> OK
>
> > As far as I can see this patch can move to the end of the
> > series.
>
> Maybe its prefered that this patch could be deferred to out of the series as
> a followup improvement, and leave this series with the base feature set only?

My preference is to keep it in just clarify that it's an optimization
and make sure it does not come before any fundamental patches that
implement the base support.

> > Some additional changelog feedback below:
> >
> >
> > On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
> >>
> >> Right now basepages are used to populate the PUD tail pages, and it
> >> picks the address of the previous page of the subsection that preceeds
> >
> > s/preceeds/precedes/
> >
> Yeap.
>
> >> the memmap we are initializing.  This is done when a given memmap
> >> address isn't aligned to the pgmap @align (which is safe to do because
> >> @ranges are guaranteed to be aligned to @align).
> >
> > You know what would help is if you could draw an ascii art picture of
> > the before and after of the head page vs tail page arrangement. I can
> > see how the words line up to the code, but it takes a while to get the
> > picture in my head and I think future work in this area will benefit
> > from having a place in Documentation that draws a picture of the
> > layout of the various geometries.
> >
> Makes sense, I will add docs. Mike K. and others had similar trouble following
> the page structs arrangement which ultimately lead to this section of commentary
> at the beginning of the new source file added here ...
>
> https://www.ozlabs.org/~akpm/mmotm/broken-out/mm-hugetlb-free-the-vmemmap-pages-associated-with-each-hugetlb-page.patch

Ah, looks good, but that really belongs in Documentation/ not a comment block.

>
> > I've used asciiflow.com for these types of diagrams in the past.
> >
>
> ... so perhaps I can borrow some of that and place it to
> a common place like in Documentation/vm/compound_pagemaps.rst
>
> This patch specifically would need a new diagram added on top covering
> the PMD page case.

Sounds good.

[..]
> > Other than that the implementation looks ok to me, modulo previous
> > comments about @block type and the use of the "geometry" term.
> >
> OK.
>
> Btw, speaking of geometry, could you have a look at this thread:
>
> https://lore.kernel.org/linux-mm/8c922a58-c901-1ad9-5d19-1182bd6dea1e@oracle.com/
>
> .. and let me know what you think?

Ok, will take a look
diff mbox series

Patch

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 49d717ae40ae..9c1a676d6b95 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3038,7 +3038,8 @@  struct page * __populate_section_memmap(unsigned long pfn,
 pgd_t *vmemmap_pgd_populate(unsigned long addr, int node);
 p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node);
 pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node);
-pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node);
+pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node,
+			    void *block);
 pte_t *vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
 			    struct vmem_altmap *altmap, void *block);
 void *vmemmap_alloc_block(unsigned long size, int node);
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index f57c5eada099..291a8a32480a 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -172,13 +172,20 @@  static void * __meminit vmemmap_alloc_block_zero(unsigned long size, int node)
 	return p;
 }
 
-pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node)
+pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node,
+				       void *block)
 {
 	pmd_t *pmd = pmd_offset(pud, addr);
 	if (pmd_none(*pmd)) {
-		void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
-		if (!p)
-			return NULL;
+		void *p = block;
+
+		if (!block) {
+			p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
+			if (!p)
+				return NULL;
+		} else {
+			get_page(virt_to_page(block));
+		}
 		pmd_populate_kernel(&init_mm, pmd, p);
 	}
 	return pmd;
@@ -220,15 +227,14 @@  pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
 	return pgd;
 }
 
-static int __meminit vmemmap_populate_address(unsigned long addr, int node,
-					      struct vmem_altmap *altmap,
-					      void *page, void **ptr)
+static int __meminit vmemmap_populate_pmd_address(unsigned long addr, int node,
+						  struct vmem_altmap *altmap,
+						  void *page, pmd_t **ptr)
 {
 	pgd_t *pgd;
 	p4d_t *p4d;
 	pud_t *pud;
 	pmd_t *pmd;
-	pte_t *pte;
 
 	pgd = vmemmap_pgd_populate(addr, node);
 	if (!pgd)
@@ -239,9 +245,24 @@  static int __meminit vmemmap_populate_address(unsigned long addr, int node,
 	pud = vmemmap_pud_populate(p4d, addr, node);
 	if (!pud)
 		return -ENOMEM;
-	pmd = vmemmap_pmd_populate(pud, addr, node);
+	pmd = vmemmap_pmd_populate(pud, addr, node, page);
 	if (!pmd)
 		return -ENOMEM;
+	if (ptr)
+		*ptr = pmd;
+	return 0;
+}
+
+static int __meminit vmemmap_populate_address(unsigned long addr, int node,
+					      struct vmem_altmap *altmap,
+					      void *page, void **ptr)
+{
+	pmd_t *pmd;
+	pte_t *pte;
+
+	if (vmemmap_populate_pmd_address(addr, node, altmap, NULL, &pmd))
+		return -ENOMEM;
+
 	pte = vmemmap_pte_populate(pmd, addr, node, altmap, page);
 	if (!pte)
 		return -ENOMEM;
@@ -285,13 +306,26 @@  static inline int __meminit vmemmap_populate_page(unsigned long addr, int node,
 	return vmemmap_populate_address(addr, node, NULL, NULL, ptr);
 }
 
-static pte_t * __meminit vmemmap_lookup_address(unsigned long addr)
+static int __meminit vmemmap_populate_pmd_range(unsigned long start,
+						unsigned long end,
+						int node, void *page)
+{
+	unsigned long addr = start;
+
+	for (; addr < end; addr += PMD_SIZE) {
+		if (vmemmap_populate_pmd_address(addr, node, NULL, page, NULL))
+			return -ENOMEM;
+	}
+
+	return 0;
+}
+
+static pmd_t * __meminit vmemmap_lookup_address(unsigned long addr)
 {
 	pgd_t *pgd;
 	p4d_t *p4d;
 	pud_t *pud;
 	pmd_t *pmd;
-	pte_t *pte;
 
 	pgd = pgd_offset_k(addr);
 	if (pgd_none(*pgd))
@@ -309,11 +343,7 @@  static pte_t * __meminit vmemmap_lookup_address(unsigned long addr)
 	if (pmd_none(*pmd))
 		return NULL;
 
-	pte = pte_offset_kernel(pmd, addr);
-	if (pte_none(*pte))
-		return NULL;
-
-	return pte;
+	return pmd;
 }
 
 static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
@@ -335,9 +365,22 @@  static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
 	offset = PFN_PHYS(start_pfn) - pgmap->ranges[pgmap->nr_range].start;
 	if (!IS_ALIGNED(offset, pgmap_align(pgmap)) &&
 	    pgmap_align(pgmap) > SUBSECTION_SIZE) {
-		pte_t *ptep = vmemmap_lookup_address(start - PAGE_SIZE);
+		pmd_t *pmdp;
+		pte_t *ptep;
+
+		addr = start - PAGE_SIZE;
+		pmdp = vmemmap_lookup_address(addr);
+		if (!pmdp)
+			return -ENOMEM;
+
+		/* Reuse the tail pages vmemmap pmd page */
+		if (offset % pgmap->align > PFN_PHYS(PAGES_PER_SECTION))
+			return vmemmap_populate_pmd_range(start, end, node,
+						page_to_virt(pmd_page(*pmdp)));
 
-		if (!ptep)
+		/* Populate the tail pages vmemmap pmd page */
+		ptep = pte_offset_kernel(pmdp, addr);
+		if (pte_none(*ptep))
 			return -ENOMEM;
 
 		return vmemmap_populate_range(start, end, node,