diff mbox series

[v1,07/11] mm/sparse-vmemmap: populate compound pagemaps

Message ID 20210325230938.30752-8-joao.m.martins@oracle.com (mailing list archive)
State New, archived
Headers show
Series mm, sparse-vmemmap: Introduce compound pagemaps | expand

Commit Message

Joao Martins March 25, 2021, 11:09 p.m. UTC
A compound pagemap is a dev_pagemap with @align > PAGE_SIZE and it
means that pages are mapped at a given huge page alignment and utilize
uses compound pages as opposed to order-0 pages.

To minimize struct page overhead we take advantage of the fact that
most tail pages look the same (except the first two). We allocate a
separate page for the vmemmap area which contains the head page and
separate for the next 64 pages. The rest of the subsections then reuse
this tail vmemmap page to initialize the rest of the tail pages.

Sections are arch-dependent (e.g. on x86 it's 64M, 128M or 512M) and
when initializing compound pagemap with big enough @align (e.g. 1G
PUD) it  will cross various sections. To be able to reuse tail pages
across sections belonging to the same gigantic page we fetch the
@range being mapped (nr_ranges + 1).  If the section being mapped is
not offset 0 of the @align, then lookup the PFN of the struct page
address that preceeds it and use that to populate the entire
section.

On compound pagemaps with 2M align, this lets mechanism saves 6 pages
out of the 8 necessary PFNs necessary to set the subsection's 512
struct pages being mapped. On a 1G compound pagemap it saves
4094 pages.

Altmap isn't supported yet, given various restrictions in altmap pfn
allocator, thus fallback to the already in use vmemmap_populate().

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 include/linux/mm.h  |   2 +-
 mm/memremap.c       |   1 +
 mm/sparse-vmemmap.c | 139 ++++++++++++++++++++++++++++++++++++++++----
 3 files changed, 131 insertions(+), 11 deletions(-)

Comments

Dan Williams May 6, 2021, 1:18 a.m. UTC | #1
On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> A compound pagemap is a dev_pagemap with @align > PAGE_SIZE and it
> means that pages are mapped at a given huge page alignment and utilize
> uses compound pages as opposed to order-0 pages.
>
> To minimize struct page overhead we take advantage of the fact that

I'm not sure if Andrew is a member of the "we" police, but I am on
team "imperative tense".

"Take advantage of the fact that most tail pages look the same (except
the first two) to minimize struct page overhead."

...I think all the other "we"s below can be dropped without losing any meaning.

> most tail pages look the same (except the first two). We allocate a
> separate page for the vmemmap area which contains the head page and
> separate for the next 64 pages. The rest of the subsections then reuse
> this tail vmemmap page to initialize the rest of the tail pages.
>
> Sections are arch-dependent (e.g. on x86 it's 64M, 128M or 512M) and
> when initializing compound pagemap with big enough @align (e.g. 1G
> PUD) it  will cross various sections. To be able to reuse tail pages
> across sections belonging to the same gigantic page we fetch the
> @range being mapped (nr_ranges + 1).  If the section being mapped is
> not offset 0 of the @align, then lookup the PFN of the struct page
> address that preceeds it and use that to populate the entire

s/preceeds/precedes/

> section.
>
> On compound pagemaps with 2M align, this lets mechanism saves 6 pages

s/lets mechanism saves 6 pages/this mechanism lets 6 pages be saved/

> out of the 8 necessary PFNs necessary to set the subsection's 512
> struct pages being mapped. On a 1G compound pagemap it saves
> 4094 pages.
>
> Altmap isn't supported yet, given various restrictions in altmap pfn
> allocator, thus fallback to the already in use vmemmap_populate().

Perhaps also note that altmap for devmap mappings was there to relieve
the pressure of inordinate amounts of memmap space to map terabytes of
PMEM. With compound pages the motivation for altmaps for PMEM is
reduced.

>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  include/linux/mm.h  |   2 +-
>  mm/memremap.c       |   1 +
>  mm/sparse-vmemmap.c | 139 ++++++++++++++++++++++++++++++++++++++++----
>  3 files changed, 131 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 61474602c2b1..49d717ae40ae 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3040,7 +3040,7 @@ p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node);
>  pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node);
>  pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node);
>  pte_t *vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
> -                           struct vmem_altmap *altmap);
> +                           struct vmem_altmap *altmap, void *block);
>  void *vmemmap_alloc_block(unsigned long size, int node);
>  struct vmem_altmap;
>  void *vmemmap_alloc_block_buf(unsigned long size, int node,
> diff --git a/mm/memremap.c b/mm/memremap.c
> index d160853670c4..2e6bc0b1ff00 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -345,6 +345,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
>  {
>         struct mhp_params params = {
>                 .altmap = pgmap_altmap(pgmap),
> +               .pgmap = pgmap,
>                 .pgprot = PAGE_KERNEL,
>         };
>         const int nr_range = pgmap->nr_range;
> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
> index 8814876edcfa..f57c5eada099 100644
> --- a/mm/sparse-vmemmap.c
> +++ b/mm/sparse-vmemmap.c
> @@ -141,16 +141,20 @@ void __meminit vmemmap_verify(pte_t *pte, int node,
>  }
>
>  pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
> -                                      struct vmem_altmap *altmap)
> +                                      struct vmem_altmap *altmap, void *block)

For type-safety I would make this argument a 'struct page *' and drop
the virt_to_page().

>  {
>         pte_t *pte = pte_offset_kernel(pmd, addr);
>         if (pte_none(*pte)) {
>                 pte_t entry;
> -               void *p;
> -
> -               p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
> -               if (!p)
> -                       return NULL;
> +               void *p = block;
> +
> +               if (!block) {
> +                       p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
> +                       if (!p)
> +                               return NULL;
> +               } else if (!altmap) {
> +                       get_page(virt_to_page(block));

This is either super subtle, or there is a missing put_page(). I'm
assuming that in the shutdown path the same page will be freed
multiple times because the same page is mapped multiple times.

Have you validated that the accounting is correct and all allocated
pages get freed? I feel this needs more than a comment, it needs a
test to validate that the accounting continues to happen correctly as
future vmemmap changes land.

> +               }
>                 entry = pfn_pte(__pa(p) >> PAGE_SHIFT, PAGE_KERNEL);
>                 set_pte_at(&init_mm, addr, pte, entry);
>         }
> @@ -217,7 +221,8 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
>  }
>
>  static int __meminit vmemmap_populate_address(unsigned long addr, int node,
> -                                             struct vmem_altmap *altmap)
> +                                             struct vmem_altmap *altmap,
> +                                             void *page, void **ptr)
>  {
>         pgd_t *pgd;
>         p4d_t *p4d;
> @@ -237,10 +242,14 @@ static int __meminit vmemmap_populate_address(unsigned long addr, int node,
>         pmd = vmemmap_pmd_populate(pud, addr, node);
>         if (!pmd)
>                 return -ENOMEM;
> -       pte = vmemmap_pte_populate(pmd, addr, node, altmap);
> +       pte = vmemmap_pte_populate(pmd, addr, node, altmap, page);
>         if (!pte)
>                 return -ENOMEM;
>         vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
> +
> +       if (ptr)
> +               *ptr = __va(__pfn_to_phys(pte_pfn(*pte)));
> +       return 0;
>  }
>
>  int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
> @@ -249,7 +258,110 @@ int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
>         unsigned long addr = start;
>
>         for (; addr < end; addr += PAGE_SIZE) {
> -               if (vmemmap_populate_address(addr, node, altmap))
> +               if (vmemmap_populate_address(addr, node, altmap, NULL, NULL))
> +                       return -ENOMEM;
> +       }
> +
> +       return 0;
> +}
> +
> +static int __meminit vmemmap_populate_range(unsigned long start,
> +                                           unsigned long end,
> +                                           int node, void *page)
> +{
> +       unsigned long addr = start;
> +
> +       for (; addr < end; addr += PAGE_SIZE) {
> +               if (vmemmap_populate_address(addr, node, NULL, page, NULL))
> +                       return -ENOMEM;
> +       }
> +
> +       return 0;
> +}
> +
> +static inline int __meminit vmemmap_populate_page(unsigned long addr, int node,
> +                                                 void **ptr)
> +{
> +       return vmemmap_populate_address(addr, node, NULL, NULL, ptr);
> +}
> +
> +static pte_t * __meminit vmemmap_lookup_address(unsigned long addr)

I think this can be replaced with a call to follow_pte(&init_mm...).

> +{
> +       pgd_t *pgd;
> +       p4d_t *p4d;
> +       pud_t *pud;
> +       pmd_t *pmd;
> +       pte_t *pte;
> +
> +       pgd = pgd_offset_k(addr);
> +       if (pgd_none(*pgd))
> +               return NULL;
> +
> +       p4d = p4d_offset(pgd, addr);
> +       if (p4d_none(*p4d))
> +               return NULL;
> +
> +       pud = pud_offset(p4d, addr);
> +       if (pud_none(*pud))
> +               return NULL;
> +
> +       pmd = pmd_offset(pud, addr);
> +       if (pmd_none(*pmd))
> +               return NULL;
> +
> +       pte = pte_offset_kernel(pmd, addr);
> +       if (pte_none(*pte))
> +               return NULL;
> +
> +       return pte;
> +}
> +
> +static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
> +                                                    unsigned long start,
> +                                                    unsigned long end, int node,
> +                                                    struct dev_pagemap *pgmap)
> +{
> +       unsigned long offset, size, addr;
> +
> +       /*
> +        * For compound pages bigger than section size (e.g. 1G) fill the rest

Oh, I had to read this a couple times because I thought the "e.g." was
referring to section size. How about:

s/(e.g. 1G)/(e.g. x86 1G compound pages with 2M subsection size)/

> +        * of sections as tail pages.
> +        *
> +        * Note that memremap_pages() resets @nr_range value and will increment
> +        * it after each range successful onlining. Thus the value or @nr_range
> +        * at section memmap populate corresponds to the in-progress range
> +        * being onlined that we care about.
> +        */
> +       offset = PFN_PHYS(start_pfn) - pgmap->ranges[pgmap->nr_range].start;
> +       if (!IS_ALIGNED(offset, pgmap_align(pgmap)) &&
> +           pgmap_align(pgmap) > SUBSECTION_SIZE) {
> +               pte_t *ptep = vmemmap_lookup_address(start - PAGE_SIZE);
> +
> +               if (!ptep)
> +                       return -ENOMEM;

Side note: I had been resistant to adding 'struct page' for altmap
pages, but that would be a requirement to enable compound pages +
altmap.

> +

Perhaps a comment here to the effect "recall the page that was
allocated in a prior iteration and fill it in with tail pages".

> +               return vmemmap_populate_range(start, end, node,
> +                                             page_to_virt(pte_page(*ptep)));

If the @block parameter changes to a 'struct page *' it also saves
some typing here.

> +       }
> +
> +       size = min(end - start, pgmap_pfn_align(pgmap) * sizeof(struct page));
> +       for (addr = start; addr < end; addr += size) {
> +               unsigned long next = addr, last = addr + size;
> +               void *block;
> +
> +               /* Populate the head page vmemmap page */
> +               if (vmemmap_populate_page(addr, node, NULL))
> +                       return -ENOMEM;
> +
> +               /* Populate the tail pages vmemmap page */
> +               block = NULL;
> +               next = addr + PAGE_SIZE;
> +               if (vmemmap_populate_page(next, node, &block))
> +                       return -ENOMEM;
> +
> +               /* Reuse the previous page for the rest of tail pages */
> +               next += PAGE_SIZE;
> +               if (vmemmap_populate_range(next, last, node, block))
>                         return -ENOMEM;
>         }
>
> @@ -262,12 +374,19 @@ struct page * __meminit __populate_section_memmap(unsigned long pfn,
>  {
>         unsigned long start = (unsigned long) pfn_to_page(pfn);
>         unsigned long end = start + nr_pages * sizeof(struct page);
> +       unsigned int align = pgmap_align(pgmap);
> +       int r;
>
>         if (WARN_ON_ONCE(!IS_ALIGNED(pfn, PAGES_PER_SUBSECTION) ||
>                 !IS_ALIGNED(nr_pages, PAGES_PER_SUBSECTION)))
>                 return NULL;
>
> -       if (vmemmap_populate(start, end, nid, altmap))
> +       if (align > PAGE_SIZE && !altmap)

I also think this code will read better the devmap_geometry enum.

> +               r = vmemmap_populate_compound_pages(pfn, start, end, nid, pgmap);
> +       else
> +               r = vmemmap_populate(start, end, nid, altmap);
> +
> +       if (r < 0)
>                 return NULL;
>
>         return pfn_to_page(pfn);
> --
> 2.17.1
>
Joao Martins May 6, 2021, 11:01 a.m. UTC | #2
On 5/6/21 2:18 AM, Dan Williams wrote:
> On Thu, Mar 25, 2021 at 4:10 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> A compound pagemap is a dev_pagemap with @align > PAGE_SIZE and it
>> means that pages are mapped at a given huge page alignment and utilize
>> uses compound pages as opposed to order-0 pages.
>>
>> To minimize struct page overhead we take advantage of the fact that
> 
> I'm not sure if Andrew is a member of the "we" police, but I am on
> team "imperative tense".
> 
That was my mistake once again. You did mentioned it over the RFC.
A few days after submitting the series I noticed[0] this patch and the
next one had still a few instances of misplaced usage of 'we', in addition
to a little unit mistake I made over the cover-letter.

[0] https://lore.kernel.org/linux-mm/a2b7dc3a-3f1d-b8e1-4770-e261055f2b17@oracle.com/

> "Take advantage of the fact that most tail pages look the same (except
> the first two) to minimize struct page overhead."
> 
> ...I think all the other "we"s below can be dropped without losing any meaning.
> 
Indeed.

>> most tail pages look the same (except the first two). We allocate a
>> separate page for the vmemmap area which contains the head page and
>> separate for the next 64 pages. The rest of the subsections then reuse
>> this tail vmemmap page to initialize the rest of the tail pages.
>>
>> Sections are arch-dependent (e.g. on x86 it's 64M, 128M or 512M) and
>> when initializing compound pagemap with big enough @align (e.g. 1G
>> PUD) it  will cross various sections. To be able to reuse tail pages
>> across sections belonging to the same gigantic page we fetch the
>> @range being mapped (nr_ranges + 1).  If the section being mapped is
>> not offset 0 of the @align, then lookup the PFN of the struct page
>> address that preceeds it and use that to populate the entire
> 
> s/preceeds/precedes/
> 
/me nods

>> section.
>>
>> On compound pagemaps with 2M align, this lets mechanism saves 6 pages
> 
> s/lets mechanism saves 6 pages/this mechanism lets 6 pages be saved/
> 
Will fix.

>> out of the 8 necessary PFNs necessary to set the subsection's 512
>> struct pages being mapped. On a 1G compound pagemap it saves
>> 4094 pages.
>>
>> Altmap isn't supported yet, given various restrictions in altmap pfn
>> allocator, thus fallback to the already in use vmemmap_populate().
> 
> Perhaps also note that altmap for devmap mappings was there to relieve
> the pressure of inordinate amounts of memmap space to map terabytes of
> PMEM. With compound pages the motivation for altmaps for PMEM is
> reduced.
> 
OK, I suppose it's worth highlighting it.

But perhaps how 'reduced' this motivation is still not 100% clear to me.

Like we discussed over the RFC, the DEVMAP_PMD geometry we still take a
non-trivial hit of ~4G per Tb. And altmap is a rather quick way of having
heavily disproportioned RAM to PMEM ratios working without pay the RAM
penalty. Specially with the pinning numbers so heavily reduced.

>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>  include/linux/mm.h  |   2 +-
>>  mm/memremap.c       |   1 +
>>  mm/sparse-vmemmap.c | 139 ++++++++++++++++++++++++++++++++++++++++----
>>  3 files changed, 131 insertions(+), 11 deletions(-)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 61474602c2b1..49d717ae40ae 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -3040,7 +3040,7 @@ p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node);
>>  pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node);
>>  pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node);
>>  pte_t *vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
>> -                           struct vmem_altmap *altmap);
>> +                           struct vmem_altmap *altmap, void *block);
>>  void *vmemmap_alloc_block(unsigned long size, int node);
>>  struct vmem_altmap;
>>  void *vmemmap_alloc_block_buf(unsigned long size, int node,
>> diff --git a/mm/memremap.c b/mm/memremap.c
>> index d160853670c4..2e6bc0b1ff00 100644
>> --- a/mm/memremap.c
>> +++ b/mm/memremap.c
>> @@ -345,6 +345,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
>>  {
>>         struct mhp_params params = {
>>                 .altmap = pgmap_altmap(pgmap),
>> +               .pgmap = pgmap,
>>                 .pgprot = PAGE_KERNEL,
>>         };
>>         const int nr_range = pgmap->nr_range;
>> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
>> index 8814876edcfa..f57c5eada099 100644
>> --- a/mm/sparse-vmemmap.c
>> +++ b/mm/sparse-vmemmap.c
>> @@ -141,16 +141,20 @@ void __meminit vmemmap_verify(pte_t *pte, int node,
>>  }
>>
>>  pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
>> -                                      struct vmem_altmap *altmap)
>> +                                      struct vmem_altmap *altmap, void *block)
> 
> For type-safety I would make this argument a 'struct page *' and drop
> the virt_to_page().
> 
Will do.

>>  {
>>         pte_t *pte = pte_offset_kernel(pmd, addr);
>>         if (pte_none(*pte)) {
>>                 pte_t entry;
>> -               void *p;
>> -
>> -               p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
>> -               if (!p)
>> -                       return NULL;
>> +               void *p = block;
>> +
>> +               if (!block) {
>> +                       p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
>> +                       if (!p)
>> +                               return NULL;
>> +               } else if (!altmap) {
>> +                       get_page(virt_to_page(block));
> 
> This is either super subtle, or there is a missing put_page(). I'm
> assuming that in the shutdown path the same page will be freed
> multiple times because the same page is mapped multiple times.
> 

It's the former (subtlety):

1) When a PTE/PMD entry is freed from the init_mm there's a a free_pages() call to
this page allocated above. free_pages() at the top does a put_page_testzero() prior to
freeing the actual page.

2) Given this codepath is solely used by pgmap infra, we already have the page allocator
is available (i.e. slab_is_available()), hence we will always go towards the
alloc_pages_node() codepath.

> Have you validated that the accounting is correct and all allocated
> pages get freed? I feel this needs more than a comment, it needs a
> test to validate that the accounting continues to happen correctly as
> future vmemmap changes land.
> 

I can add a comment *at least*.

I checked the accounting. But let me be extra pedantic on this part and recheck
this once again as I settled with this part some time ago. I will followup on this
chunk, should this part be broken in some way.

>> +               }
>>                 entry = pfn_pte(__pa(p) >> PAGE_SHIFT, PAGE_KERNEL);
>>                 set_pte_at(&init_mm, addr, pte, entry);
>>         }
>> @@ -217,7 +221,8 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
>>  }
>>
>>  static int __meminit vmemmap_populate_address(unsigned long addr, int node,
>> -                                             struct vmem_altmap *altmap)
>> +                                             struct vmem_altmap *altmap,
>> +                                             void *page, void **ptr)
>>  {
>>         pgd_t *pgd;
>>         p4d_t *p4d;
>> @@ -237,10 +242,14 @@ static int __meminit vmemmap_populate_address(unsigned long addr, int node,
>>         pmd = vmemmap_pmd_populate(pud, addr, node);
>>         if (!pmd)
>>                 return -ENOMEM;
>> -       pte = vmemmap_pte_populate(pmd, addr, node, altmap);
>> +       pte = vmemmap_pte_populate(pmd, addr, node, altmap, page);
>>         if (!pte)
>>                 return -ENOMEM;
>>         vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
>> +
>> +       if (ptr)
>> +               *ptr = __va(__pfn_to_phys(pte_pfn(*pte)));
>> +       return 0;
>>  }
>>
>>  int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
>> @@ -249,7 +258,110 @@ int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
>>         unsigned long addr = start;
>>
>>         for (; addr < end; addr += PAGE_SIZE) {
>> -               if (vmemmap_populate_address(addr, node, altmap))
>> +               if (vmemmap_populate_address(addr, node, altmap, NULL, NULL))
>> +                       return -ENOMEM;
>> +       }
>> +
>> +       return 0;
>> +}
>> +
>> +static int __meminit vmemmap_populate_range(unsigned long start,
>> +                                           unsigned long end,
>> +                                           int node, void *page)
>> +{
>> +       unsigned long addr = start;
>> +
>> +       for (; addr < end; addr += PAGE_SIZE) {
>> +               if (vmemmap_populate_address(addr, node, NULL, page, NULL))
>> +                       return -ENOMEM;
>> +       }
>> +
>> +       return 0;
>> +}
>> +
>> +static inline int __meminit vmemmap_populate_page(unsigned long addr, int node,
>> +                                                 void **ptr)
>> +{
>> +       return vmemmap_populate_address(addr, node, NULL, NULL, ptr);
>> +}
>> +
>> +static pte_t * __meminit vmemmap_lookup_address(unsigned long addr)
> 
> I think this can be replaced with a call to follow_pte(&init_mm...).
> 

Ah, of course! That would shorten things up too.

>> +{
>> +       pgd_t *pgd;
>> +       p4d_t *p4d;
>> +       pud_t *pud;
>> +       pmd_t *pmd;
>> +       pte_t *pte;
>> +
>> +       pgd = pgd_offset_k(addr);
>> +       if (pgd_none(*pgd))
>> +               return NULL;
>> +
>> +       p4d = p4d_offset(pgd, addr);
>> +       if (p4d_none(*p4d))
>> +               return NULL;
>> +
>> +       pud = pud_offset(p4d, addr);
>> +       if (pud_none(*pud))
>> +               return NULL;
>> +
>> +       pmd = pmd_offset(pud, addr);
>> +       if (pmd_none(*pmd))
>> +               return NULL;
>> +
>> +       pte = pte_offset_kernel(pmd, addr);
>> +       if (pte_none(*pte))
>> +               return NULL;
>> +
>> +       return pte;
>> +}
>> +
>> +static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
>> +                                                    unsigned long start,
>> +                                                    unsigned long end, int node,
>> +                                                    struct dev_pagemap *pgmap)
>> +{
>> +       unsigned long offset, size, addr;
>> +
>> +       /*
>> +        * For compound pages bigger than section size (e.g. 1G) fill the rest
> 
> Oh, I had to read this a couple times because I thought the "e.g." was
> referring to section size. How about:
> 
> s/(e.g. 1G)/(e.g. x86 1G compound pages with 2M subsection size)/
> 
Will fix. But instead I'll fix to:

e.g. x86 1G compound pages with 128M section size

Because vmemmap_populate_compound_pages() is called per section, while being
subsection-aligned.

>> +        * of sections as tail pages.
>> +        *
>> +        * Note that memremap_pages() resets @nr_range value and will increment
>> +        * it after each range successful onlining. Thus the value or @nr_range
>> +        * at section memmap populate corresponds to the in-progress range
>> +        * being onlined that we care about.
>> +        */
>> +       offset = PFN_PHYS(start_pfn) - pgmap->ranges[pgmap->nr_range].start;
>> +       if (!IS_ALIGNED(offset, pgmap_align(pgmap)) &&
>> +           pgmap_align(pgmap) > SUBSECTION_SIZE) {
>> +               pte_t *ptep = vmemmap_lookup_address(start - PAGE_SIZE);
>> +
>> +               if (!ptep)
>> +                       return -ENOMEM;
> 
> Side note: I had been resistant to adding 'struct page' for altmap
> pages, but that would be a requirement to enable compound pages +
> altmap.
> 
Fair enough.

I was thinking about something a little simpler like adding a refcount to
altmap pfn allocator, to avoid arch changes and be a little more tidy
on space (16bytes rather than 64bytes). But I suspect I can avoid the said
arch changes even with a backend struct page.

>> +
> 
> Perhaps a comment here to the effect "recall the page that was
> allocated in a prior iteration and fill it in with tail pages".
> 
Will add.

>> +               return vmemmap_populate_range(start, end, node,
>> +                                             page_to_virt(pte_page(*ptep)));
> 
> If the @block parameter changes to a 'struct page *' it also saves
> some typing here.
> 
Indeed.

>> +       }
>> +
>> +       size = min(end - start, pgmap_pfn_align(pgmap) * sizeof(struct page));
>> +       for (addr = start; addr < end; addr += size) {
>> +               unsigned long next = addr, last = addr + size;
>> +               void *block;
>> +
>> +               /* Populate the head page vmemmap page */
>> +               if (vmemmap_populate_page(addr, node, NULL))
>> +                       return -ENOMEM;
>> +
>> +               /* Populate the tail pages vmemmap page */
>> +               block = NULL;
>> +               next = addr + PAGE_SIZE;
>> +               if (vmemmap_populate_page(next, node, &block))
>> +                       return -ENOMEM;
>> +
>> +               /* Reuse the previous page for the rest of tail pages */
>> +               next += PAGE_SIZE;
>> +               if (vmemmap_populate_range(next, last, node, block))
>>                         return -ENOMEM;
>>         }
>>
>> @@ -262,12 +374,19 @@ struct page * __meminit __populate_section_memmap(unsigned long pfn,
>>  {
>>         unsigned long start = (unsigned long) pfn_to_page(pfn);
>>         unsigned long end = start + nr_pages * sizeof(struct page);
>> +       unsigned int align = pgmap_align(pgmap);
>> +       int r;
>>
>>         if (WARN_ON_ONCE(!IS_ALIGNED(pfn, PAGES_PER_SUBSECTION) ||
>>                 !IS_ALIGNED(nr_pages, PAGES_PER_SUBSECTION)))
>>                 return NULL;
>>
>> -       if (vmemmap_populate(start, end, nid, altmap))
>> +       if (align > PAGE_SIZE && !altmap)
> 
> I also think this code will read better the devmap_geometry enum.
> 
True.

I am starting to like the ring of it with @geometry rather than @align to represent
how metadata is used. Should avoid confusion between device-dax @align and pagemap
@align.

>> +               r = vmemmap_populate_compound_pages(pfn, start, end, nid, pgmap);
>> +       else
>> +               r = vmemmap_populate(start, end, nid, altmap);
>> +
>> +       if (r < 0)
>>                 return NULL;
>>
>>         return pfn_to_page(pfn);
>> --
>> 2.17.1
>>
Dan Williams May 10, 2021, 7:19 p.m. UTC | #3
On Thu, May 6, 2021 at 4:02 AM Joao Martins <joao.m.martins@oracle.com> wrote:
[..]
> >> +static pte_t * __meminit vmemmap_lookup_address(unsigned long addr)
> >
> > I think this can be replaced with a call to follow_pte(&init_mm...).
> >
>
> Ah, of course! That would shorten things up too.

Now that I look closely, I notice that function disclaims being
suitable as a general purpose pte lookup utility. If it works for you,
great, but if not I think it's past time to create such a helper. I
know Ira is looking for one, and I contributed to the proliferation
when I added dev_pagemap_mapping_shift().
Joao Martins May 13, 2021, 6:45 p.m. UTC | #4
On 5/10/21 8:19 PM, Dan Williams wrote:
> On Thu, May 6, 2021 at 4:02 AM Joao Martins <joao.m.martins@oracle.com> wrote:
> [..]
>>>> +static pte_t * __meminit vmemmap_lookup_address(unsigned long addr)
>>>
>>> I think this can be replaced with a call to follow_pte(&init_mm...).
>>>
>>
>> Ah, of course! That would shorten things up too.
> 
> Now that I look closely, I notice that function disclaims being
> suitable as a general purpose pte lookup utility. 
> If it works for you,
> great, but if not I think it's past time to create such a helper. I
> know Ira is looking for one, and I contributed to the proliferation
> when I added dev_pagemap_mapping_shift().
> 
There's also x86 equivalents, like lookup_address() and lookup_address_in_pgd().
These two don't differ that much from vmemmap_lookup_address().

I can move this to an internal place e.g. mm/internal.h

The patch after this one, changes vmemmap_lookup_address() to stop at the PMD (to reuse
that across the next sections).
Joao Martins June 16, 2021, 3:05 p.m. UTC | #5
On 5/13/21 7:45 PM, Joao Martins wrote:
> On 5/10/21 8:19 PM, Dan Williams wrote:
>> On Thu, May 6, 2021 at 4:02 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>> [..]
>>>>> +static pte_t * __meminit vmemmap_lookup_address(unsigned long addr)
>>>>
>>>> I think this can be replaced with a call to follow_pte(&init_mm...).
>>>
>>> Ah, of course! That would shorten things up too.
>>
>> Now that I look closely, I notice that function disclaims being
>> suitable as a general purpose pte lookup utility. 
>> If it works for you,
>> great, but if not I think it's past time to create such a helper. I
>> know Ira is looking for one, and I contributed to the proliferation
>> when I added dev_pagemap_mapping_shift().
>>
> There's also x86 equivalents, like lookup_address() and lookup_address_in_pgd().
> These two don't differ that much from vmemmap_lookup_address().
> 
> I can move this to an internal place e.g. mm/internal.h
> 
> The patch after this one, changes vmemmap_lookup_address() to stop at the PMD (to reuse
> that across the next sections).
> 
Thinking again on this particular comment, but more on the actual need for the lookup
function. It is very specific to 1G geometry case being spread over multiple sections.

Given section population *needs* to succeed for the followup sections to continue to be
populated, perhaps we simplify this a whole lot e.g. below comparing this patch before and
after.

diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 3d671e3e804d..c06796fcc77d 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -554,37 +554,6 @@ static inline int __meminit vmemmap_populate_page(unsigned long addr,
int node,
        return vmemmap_populate_address(addr, node, NULL, NULL, page);
 }

-static pte_t * __meminit vmemmap_lookup_address(unsigned long addr)
-{
-       pgd_t *pgd;
-       p4d_t *p4d;
-       pud_t *pud;
-       pmd_t *pmd;
-       pte_t *pte;
-
-       pgd = pgd_offset_k(addr);
-       if (pgd_none(*pgd))
-               return NULL;
-
-       p4d = p4d_offset(pgd, addr);
-       if (p4d_none(*p4d))
-               return NULL;
-
-       pud = pud_offset(p4d, addr);
-       if (pud_none(*pud))
-               return NULL;
-
-       pmd = pmd_offset(pud, addr);
-       if (pmd_none(*pmd))
-               return NULL;
-
-       pte = pte_offset_kernel(pmd, addr);
-       if (pte_none(*pte))
-               return NULL;
-
-       return pte;
-}
-
 static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
                                                     unsigned long start,
                                                     unsigned long end, int node,
@@ -605,8 +574,10 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long
start_pfn,
        offset = PFN_PHYS(start_pfn) - pgmap->ranges[pgmap->nr_range].start;
        if (!IS_ALIGNED(offset, pgmap_geometry(pgmap)) &&
            pgmap_geometry(pgmap) > SUBSECTION_SIZE) {
-               pte_t *ptep = vmemmap_lookup_address(start - PAGE_SIZE);
+               pte_t *ptep;

+               addr = start - PAGE_SIZE;
+               ptep = pte_offset_kernel(pmd_off_k(addr), addr);
                if (!ptep)
                        return -ENOMEM;

The 'if (!ptep)' cannot happen in pratice AFAIU so could remove that as well.

Thoughts?
Dan Williams June 16, 2021, 11:35 p.m. UTC | #6
On Wed, Jun 16, 2021 at 8:06 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> On 5/13/21 7:45 PM, Joao Martins wrote:
> > On 5/10/21 8:19 PM, Dan Williams wrote:
> >> On Thu, May 6, 2021 at 4:02 AM Joao Martins <joao.m.martins@oracle.com> wrote:
> >> [..]
> >>>>> +static pte_t * __meminit vmemmap_lookup_address(unsigned long addr)
> >>>>
> >>>> I think this can be replaced with a call to follow_pte(&init_mm...).
> >>>
> >>> Ah, of course! That would shorten things up too.
> >>
> >> Now that I look closely, I notice that function disclaims being
> >> suitable as a general purpose pte lookup utility.
> >> If it works for you,
> >> great, but if not I think it's past time to create such a helper. I
> >> know Ira is looking for one, and I contributed to the proliferation
> >> when I added dev_pagemap_mapping_shift().
> >>
> > There's also x86 equivalents, like lookup_address() and lookup_address_in_pgd().
> > These two don't differ that much from vmemmap_lookup_address().
> >
> > I can move this to an internal place e.g. mm/internal.h
> >
> > The patch after this one, changes vmemmap_lookup_address() to stop at the PMD (to reuse
> > that across the next sections).
> >
> Thinking again on this particular comment, but more on the actual need for the lookup
> function. It is very specific to 1G geometry case being spread over multiple sections.
>
> Given section population *needs* to succeed for the followup sections to continue to be
> populated, perhaps we simplify this a whole lot e.g. below comparing this patch before and
> after.
>
> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
> index 3d671e3e804d..c06796fcc77d 100644
> --- a/mm/sparse-vmemmap.c
> +++ b/mm/sparse-vmemmap.c
> @@ -554,37 +554,6 @@ static inline int __meminit vmemmap_populate_page(unsigned long addr,
> int node,
>         return vmemmap_populate_address(addr, node, NULL, NULL, page);
>  }
>
> -static pte_t * __meminit vmemmap_lookup_address(unsigned long addr)
> -{
> -       pgd_t *pgd;
> -       p4d_t *p4d;
> -       pud_t *pud;
> -       pmd_t *pmd;
> -       pte_t *pte;
> -
> -       pgd = pgd_offset_k(addr);
> -       if (pgd_none(*pgd))
> -               return NULL;
> -
> -       p4d = p4d_offset(pgd, addr);
> -       if (p4d_none(*p4d))
> -               return NULL;
> -
> -       pud = pud_offset(p4d, addr);
> -       if (pud_none(*pud))
> -               return NULL;
> -
> -       pmd = pmd_offset(pud, addr);
> -       if (pmd_none(*pmd))
> -               return NULL;
> -
> -       pte = pte_offset_kernel(pmd, addr);
> -       if (pte_none(*pte))
> -               return NULL;
> -
> -       return pte;
> -}
> -
>  static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
>                                                      unsigned long start,
>                                                      unsigned long end, int node,
> @@ -605,8 +574,10 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long
> start_pfn,
>         offset = PFN_PHYS(start_pfn) - pgmap->ranges[pgmap->nr_range].start;
>         if (!IS_ALIGNED(offset, pgmap_geometry(pgmap)) &&
>             pgmap_geometry(pgmap) > SUBSECTION_SIZE) {
> -               pte_t *ptep = vmemmap_lookup_address(start - PAGE_SIZE);
> +               pte_t *ptep;
>
> +               addr = start - PAGE_SIZE;
> +               ptep = pte_offset_kernel(pmd_off_k(addr), addr);

It deserves a comment about the guarantee, but looks good.

>                 if (!ptep)
>                         return -ENOMEM;
>
> The 'if (!ptep)' cannot happen in pratice AFAIU so could remove that as well.
>
> Thoughts?
diff mbox series

Patch

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 61474602c2b1..49d717ae40ae 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3040,7 +3040,7 @@  p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node);
 pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node);
 pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node);
 pte_t *vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
-			    struct vmem_altmap *altmap);
+			    struct vmem_altmap *altmap, void *block);
 void *vmemmap_alloc_block(unsigned long size, int node);
 struct vmem_altmap;
 void *vmemmap_alloc_block_buf(unsigned long size, int node,
diff --git a/mm/memremap.c b/mm/memremap.c
index d160853670c4..2e6bc0b1ff00 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -345,6 +345,7 @@  void *memremap_pages(struct dev_pagemap *pgmap, int nid)
 {
 	struct mhp_params params = {
 		.altmap = pgmap_altmap(pgmap),
+		.pgmap = pgmap,
 		.pgprot = PAGE_KERNEL,
 	};
 	const int nr_range = pgmap->nr_range;
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 8814876edcfa..f57c5eada099 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -141,16 +141,20 @@  void __meminit vmemmap_verify(pte_t *pte, int node,
 }
 
 pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node,
-				       struct vmem_altmap *altmap)
+				       struct vmem_altmap *altmap, void *block)
 {
 	pte_t *pte = pte_offset_kernel(pmd, addr);
 	if (pte_none(*pte)) {
 		pte_t entry;
-		void *p;
-
-		p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
-		if (!p)
-			return NULL;
+		void *p = block;
+
+		if (!block) {
+			p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap);
+			if (!p)
+				return NULL;
+		} else if (!altmap) {
+			get_page(virt_to_page(block));
+		}
 		entry = pfn_pte(__pa(p) >> PAGE_SHIFT, PAGE_KERNEL);
 		set_pte_at(&init_mm, addr, pte, entry);
 	}
@@ -217,7 +221,8 @@  pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node)
 }
 
 static int __meminit vmemmap_populate_address(unsigned long addr, int node,
-					      struct vmem_altmap *altmap)
+					      struct vmem_altmap *altmap,
+					      void *page, void **ptr)
 {
 	pgd_t *pgd;
 	p4d_t *p4d;
@@ -237,10 +242,14 @@  static int __meminit vmemmap_populate_address(unsigned long addr, int node,
 	pmd = vmemmap_pmd_populate(pud, addr, node);
 	if (!pmd)
 		return -ENOMEM;
-	pte = vmemmap_pte_populate(pmd, addr, node, altmap);
+	pte = vmemmap_pte_populate(pmd, addr, node, altmap, page);
 	if (!pte)
 		return -ENOMEM;
 	vmemmap_verify(pte, node, addr, addr + PAGE_SIZE);
+
+	if (ptr)
+		*ptr = __va(__pfn_to_phys(pte_pfn(*pte)));
+	return 0;
 }
 
 int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
@@ -249,7 +258,110 @@  int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end,
 	unsigned long addr = start;
 
 	for (; addr < end; addr += PAGE_SIZE) {
-		if (vmemmap_populate_address(addr, node, altmap))
+		if (vmemmap_populate_address(addr, node, altmap, NULL, NULL))
+			return -ENOMEM;
+	}
+
+	return 0;
+}
+
+static int __meminit vmemmap_populate_range(unsigned long start,
+					    unsigned long end,
+					    int node, void *page)
+{
+	unsigned long addr = start;
+
+	for (; addr < end; addr += PAGE_SIZE) {
+		if (vmemmap_populate_address(addr, node, NULL, page, NULL))
+			return -ENOMEM;
+	}
+
+	return 0;
+}
+
+static inline int __meminit vmemmap_populate_page(unsigned long addr, int node,
+						  void **ptr)
+{
+	return vmemmap_populate_address(addr, node, NULL, NULL, ptr);
+}
+
+static pte_t * __meminit vmemmap_lookup_address(unsigned long addr)
+{
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
+
+	pgd = pgd_offset_k(addr);
+	if (pgd_none(*pgd))
+		return NULL;
+
+	p4d = p4d_offset(pgd, addr);
+	if (p4d_none(*p4d))
+		return NULL;
+
+	pud = pud_offset(p4d, addr);
+	if (pud_none(*pud))
+		return NULL;
+
+	pmd = pmd_offset(pud, addr);
+	if (pmd_none(*pmd))
+		return NULL;
+
+	pte = pte_offset_kernel(pmd, addr);
+	if (pte_none(*pte))
+		return NULL;
+
+	return pte;
+}
+
+static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn,
+						     unsigned long start,
+						     unsigned long end, int node,
+						     struct dev_pagemap *pgmap)
+{
+	unsigned long offset, size, addr;
+
+	/*
+	 * For compound pages bigger than section size (e.g. 1G) fill the rest
+	 * of sections as tail pages.
+	 *
+	 * Note that memremap_pages() resets @nr_range value and will increment
+	 * it after each range successful onlining. Thus the value or @nr_range
+	 * at section memmap populate corresponds to the in-progress range
+	 * being onlined that we care about.
+	 */
+	offset = PFN_PHYS(start_pfn) - pgmap->ranges[pgmap->nr_range].start;
+	if (!IS_ALIGNED(offset, pgmap_align(pgmap)) &&
+	    pgmap_align(pgmap) > SUBSECTION_SIZE) {
+		pte_t *ptep = vmemmap_lookup_address(start - PAGE_SIZE);
+
+		if (!ptep)
+			return -ENOMEM;
+
+		return vmemmap_populate_range(start, end, node,
+					      page_to_virt(pte_page(*ptep)));
+	}
+
+	size = min(end - start, pgmap_pfn_align(pgmap) * sizeof(struct page));
+	for (addr = start; addr < end; addr += size) {
+		unsigned long next = addr, last = addr + size;
+		void *block;
+
+		/* Populate the head page vmemmap page */
+		if (vmemmap_populate_page(addr, node, NULL))
+			return -ENOMEM;
+
+		/* Populate the tail pages vmemmap page */
+		block = NULL;
+		next = addr + PAGE_SIZE;
+		if (vmemmap_populate_page(next, node, &block))
+			return -ENOMEM;
+
+		/* Reuse the previous page for the rest of tail pages */
+		next += PAGE_SIZE;
+		if (vmemmap_populate_range(next, last, node, block))
 			return -ENOMEM;
 	}
 
@@ -262,12 +374,19 @@  struct page * __meminit __populate_section_memmap(unsigned long pfn,
 {
 	unsigned long start = (unsigned long) pfn_to_page(pfn);
 	unsigned long end = start + nr_pages * sizeof(struct page);
+	unsigned int align = pgmap_align(pgmap);
+	int r;
 
 	if (WARN_ON_ONCE(!IS_ALIGNED(pfn, PAGES_PER_SUBSECTION) ||
 		!IS_ALIGNED(nr_pages, PAGES_PER_SUBSECTION)))
 		return NULL;
 
-	if (vmemmap_populate(start, end, nid, altmap))
+	if (align > PAGE_SIZE && !altmap)
+		r = vmemmap_populate_compound_pages(pfn, start, end, nid, pgmap);
+	else
+		r = vmemmap_populate(start, end, nid, altmap);
+
+	if (r < 0)
 		return NULL;
 
 	return pfn_to_page(pfn);