diff mbox series

[v2] mm/hugetlb_vmemmap: remap head page to newly allocated page

Message ID 20221107153922.77094-1-joao.m.martins@oracle.com (mailing list archive)
State New
Headers show
Series [v2] mm/hugetlb_vmemmap: remap head page to newly allocated page | expand

Commit Message

Joao Martins Nov. 7, 2022, 3:39 p.m. UTC
Today with `hugetlb_free_vmemmap=on` the struct page memory that is freed
back to page allocator is as following: for a 2M hugetlb page it will reuse
the first 4K vmemmap page to remap the remaining 7 vmemmap pages, and for a
1G hugetlb it will remap the remaining 4095 vmemmap pages. Essentially,
that means that it breaks the first 4K of a potentially contiguous chunk of
memory of 32K (for 2M hugetlb pages) or 16M (for 1G hugetlb pages). For
this reason the memory that it's free back to page allocator cannot be used
for hugetlb to allocate huge pages of the same size, but rather only of a
smaller huge page size:

Trying to assign a 64G node to hugetlb (on a 128G 2node guest, each node
having 64G):

* Before allocation:
Free pages count per migrate type at order       0      1      2      3
4      5      6      7      8      9     10
...
Node    0, zone   Normal, type      Movable    340    100     32     15
1      2      0      0      0      1  15558

$ echo 32768 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
$ cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
 31987

* After:

Node    0, zone   Normal, type      Movable  30893  32006  31515      7
0      0      0      0      0      0      0

Notice how the memory freed back are put back into 4K / 8K / 16K page
pools.  And it allocates a total of 31974 pages (63948M).

To fix this behaviour rather than remapping one page (thus breaking the
contiguous block of memory backing the struct pages) repopulate with a new
page for the head vmemmap page. It will copying the data from the currently
mapped vmemmap page, and then remap it to this new page. Additionally,
change the remap_pte callback to look at the newly added walk::head_page
which needs to be mapped as r/w compared to the tail page vmemmap reuse
that uses r/o.

The new head page is allocated by the caller of vmemmap_remap_free() given
that on restore it should still be using the same code path as before. Note
that, because right now one hugepage is remapped at a time, thus only one
free 4K page at a time is needed to remap the head page.  Should it fail to
allocate said new page, it reuses the one that's already mapped just like
before. As a result, for every 64G of contiguous hugepages it can give back
1G more of contiguous memory per 64G, while needing in total 128M new 4K
pages (for 2M hugetlb) or 256k (for 1G hugetlb).

After the changes, try to assign a 64G node to hugetlb (on a 128G 2node
guest, each node with 64G):

* Before allocation
Free pages count per migrate type at order       0      1      2      3
4      5      6      7      8      9     10
...
Node    0, zone   Normal, type      Movable      1      1      1      0
0      1      0      0      1      1  15564

$ echo 32768  > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
$ cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
32394

* After:

Node    0, zone   Normal, type      Movable      0     50     97    108
96     81     70     46     18      0      0


In the example above, 407 more hugeltb 2M pages are allocated i.e. 814M out
of the 32394 (64796M) allocated. So the memory freed back is indeed being
used back in hugetlb and there's no massive order-0..order-2 pages
accumulated unused.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
Changes since v1[0]:
* Drop rw argument and check walk::head_page directly when there's
  no reuse_page set (similar suggestion by Muchun Song to adjust
  inside the remap_pte callback)
* Adjust TLB flush to cover the head page vaddr too (Muchun Song)
* Simplify the remap of head page in vmemmap_pte_range()
* Check start is aligned to PAGE_SIZE in vmemmap_remap_free()

I've kept the same structure as in v1 compared to a chunk Muchun
pasted in the v1 thread[1] and thus I am not altering the calling
convention of vmemmap_remap_free()/vmemmap_restore_pte().
The remapping of head page is not exactly a page that is reused,
compared to the r/o tail vmemmap pages remapping. So tiny semantic change,
albeit same outcome in pratice of changing the PTE and freeing the page,
with different permissions. It also made it simpler to gracefully fail
in case of page allocation failure, and logic simpler to follow IMHO.

Let me know otherwise if I followed the wrong thinking.

[0] https://lore.kernel.org/linux-mm/20220802180309.19340-1-joao.m.martins@oracle.com/
[1] https://lore.kernel.org/linux-mm/Yun1bJsnK%2F6MFc0b@FVFYT0MHHV2J/

---
 mm/hugetlb_vmemmap.c | 59 ++++++++++++++++++++++++++++++++++++++------
 1 file changed, 52 insertions(+), 7 deletions(-)

Comments

Muchun Song Nov. 8, 2022, 9:13 a.m. UTC | #1
> On Nov 7, 2022, at 23:39, Joao Martins <joao.m.martins@oracle.com> wrote:
> 
> Today with `hugetlb_free_vmemmap=on` the struct page memory that is freed
> back to page allocator is as following: for a 2M hugetlb page it will reuse
> the first 4K vmemmap page to remap the remaining 7 vmemmap pages, and for a
> 1G hugetlb it will remap the remaining 4095 vmemmap pages. Essentially,
> that means that it breaks the first 4K of a potentially contiguous chunk of
> memory of 32K (for 2M hugetlb pages) or 16M (for 1G hugetlb pages). For
> this reason the memory that it's free back to page allocator cannot be used
> for hugetlb to allocate huge pages of the same size, but rather only of a
> smaller huge page size:
> 
> Trying to assign a 64G node to hugetlb (on a 128G 2node guest, each node
> having 64G):
> 
> * Before allocation:
> Free pages count per migrate type at order       0      1      2      3
> 4      5      6      7      8      9     10
> ...
> Node    0, zone   Normal, type      Movable    340    100     32     15
> 1      2      0      0      0      1  15558
> 
> $ echo 32768 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
> $ cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
> 31987
> 
> * After:
> 
> Node    0, zone   Normal, type      Movable  30893  32006  31515      7
> 0      0      0      0      0      0      0
> 
> Notice how the memory freed back are put back into 4K / 8K / 16K page
> pools.  And it allocates a total of 31974 pages (63948M).
> 
> To fix this behaviour rather than remapping one page (thus breaking the
> contiguous block of memory backing the struct pages) repopulate with a new
> page for the head vmemmap page. It will copying the data from the currently
> mapped vmemmap page, and then remap it to this new page. Additionally,
> change the remap_pte callback to look at the newly added walk::head_page
> which needs to be mapped as r/w compared to the tail page vmemmap reuse
> that uses r/o.
> 
> The new head page is allocated by the caller of vmemmap_remap_free() given
> that on restore it should still be using the same code path as before. Note
> that, because right now one hugepage is remapped at a time, thus only one
> free 4K page at a time is needed to remap the head page.  Should it fail to
> allocate said new page, it reuses the one that's already mapped just like
> before. As a result, for every 64G of contiguous hugepages it can give back
> 1G more of contiguous memory per 64G, while needing in total 128M new 4K
> pages (for 2M hugetlb) or 256k (for 1G hugetlb).
> 
> After the changes, try to assign a 64G node to hugetlb (on a 128G 2node
> guest, each node with 64G):
> 
> * Before allocation
> Free pages count per migrate type at order       0      1      2      3
> 4      5      6      7      8      9     10
> ...
> Node    0, zone   Normal, type      Movable      1      1      1      0
> 0      1      0      0      1      1  15564
> 
> $ echo 32768  > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
> $ cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
> 32394
> 
> * After:
> 
> Node    0, zone   Normal, type      Movable      0     50     97    108
> 96     81     70     46     18      0      0

Thanks for your work.

> 
> 
> In the example above, 407 more hugeltb 2M pages are allocated i.e. 814M out
> of the 32394 (64796M) allocated. So the memory freed back is indeed being
> used back in hugetlb and there's no massive order-0..order-2 pages
> accumulated unused.
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
> Changes since v1[0]:
> * Drop rw argument and check walk::head_page directly when there's
>  no reuse_page set (similar suggestion by Muchun Song to adjust
>  inside the remap_pte callback)
> * Adjust TLB flush to cover the head page vaddr too (Muchun Song)
> * Simplify the remap of head page in vmemmap_pte_range()
> * Check start is aligned to PAGE_SIZE in vmemmap_remap_free()
> 
> I've kept the same structure as in v1 compared to a chunk Muchun
> pasted in the v1 thread[1] and thus I am not altering the calling
> convention of vmemmap_remap_free()/vmemmap_restore_pte().
> The remapping of head page is not exactly a page that is reused,
> compared to the r/o tail vmemmap pages remapping. So tiny semantic change,
> albeit same outcome in pratice of changing the PTE and freeing the page,
> with different permissions. It also made it simpler to gracefully fail
> in case of page allocation failure, and logic simpler to follow IMHO.
> 
> Let me know otherwise if I followed the wrong thinking.
> 
> [0] https://lore.kernel.org/linux-mm/20220802180309.19340-1-joao.m.martins@oracle.com/
> [1] https://lore.kernel.org/linux-mm/Yun1bJsnK%2F6MFc0b@FVFYT0MHHV2J/
> 
> ---
> mm/hugetlb_vmemmap.c | 59 ++++++++++++++++++++++++++++++++++++++------
> 1 file changed, 52 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> index 7898c2c75e35..4298c44578e3 100644
> --- a/mm/hugetlb_vmemmap.c
> +++ b/mm/hugetlb_vmemmap.c
> @@ -22,6 +22,7 @@
>  *
>  * @remap_pte: called for each lowest-level entry (PTE).
>  * @nr_walked: the number of walked pte.
> + * @head_page: the page which replaces the head vmemmap page.
>  * @reuse_page: the page which is reused for the tail vmemmap pages.
>  * @reuse_addr: the virtual address of the @reuse_page page.
>  * @vmemmap_pages: the list head of the vmemmap pages that can be freed
> @@ -31,6 +32,7 @@ struct vmemmap_remap_walk {
> void (*remap_pte)(pte_t *pte, unsigned long addr,
>     struct vmemmap_remap_walk *walk);
> unsigned long nr_walked;
> + struct page *head_page;

This field is unnecessary. We can reuse ->reuse_page to implement the same
functionality. I'll explain the reason later.

> struct page *reuse_page;
> unsigned long reuse_addr;
> struct list_head *vmemmap_pages;
> @@ -105,10 +107,26 @@ static void vmemmap_pte_range(pmd_t *pmd, unsigned long addr,
> * remapping (which is calling @walk->remap_pte).
> */
> if (!walk->reuse_page) {
> - walk->reuse_page = pte_page(*pte);
> + struct page *page = pte_page(*pte);
> +
> + /*
> + * Copy the data from the original head, and remap to
> + * the newly allocated page.
> + */
> + if (walk->head_page) {
> + memcpy(page_address(walk->head_page),
> +       page_address(page), PAGE_SIZE);
> + walk->remap_pte(pte, addr, walk);
> + page = walk->head_page;
> + }
> +
> + walk->reuse_page = page;
> +
> /*
> - * Because the reuse address is part of the range that we are
> - * walking, skip the reuse address range.
> + * Because the reuse address is part of the range that
> + * we are walking or the head page was remapped to a
> + * new page, skip the reuse address range.
> + * .
> */
> addr += PAGE_SIZE;
> pte++;
> @@ -204,11 +222,11 @@ static int vmemmap_remap_range(unsigned long start, unsigned long end,
> } while (pgd++, addr = next, addr != end);
> 
> /*
> - * We only change the mapping of the vmemmap virtual address range
> - * [@start + PAGE_SIZE, end), so we only need to flush the TLB which
> + * We change the mapping of the vmemmap virtual address range
> + * [@start, end], so we only need to flush the TLB which
> * belongs to the range.
> */

This comment could go away, the reason I added it here is because it is a bit special
here. I want to tell others why we don't flush the full range from @start to @end. Now, I
think it can go away.

> - flush_tlb_kernel_range(start + PAGE_SIZE, end);
> + flush_tlb_kernel_range(start, end);
> 
> return 0;
> }
> @@ -244,9 +262,21 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr,
> * to the tail pages.
> */
> pgprot_t pgprot = PAGE_KERNEL_RO;
> - pte_t entry = mk_pte(walk->reuse_page, pgprot);
> + struct page *reuse = walk->reuse_page;
> struct page *page = pte_page(*pte);
> + pte_t entry;
> 
> + /*
> + * When there's no walk::reuse_page, it means we allocated a new head
> + * page (stored in walk::head_page) and copied from the old head page.
> + * In that case use the walk::head_page as the page to remap.
> + */
> + if (!reuse) {
> + pgprot = PAGE_KERNEL;
> + reuse = walk->head_page;
> + }
> +
> + entry = mk_pte(reuse, pgprot);
> list_add_tail(&page->lru, walk->vmemmap_pages);
> set_pte_at(&init_mm, addr, pte, entry);
> }
> @@ -315,6 +345,21 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end,
> .reuse_addr = reuse,
> .vmemmap_pages = &vmemmap_pages,
> };
> + gfp_t gfp_mask = GFP_KERNEL|__GFP_RETRY_MAYFAIL|__GFP_NOWARN;

It is better to add __GFP_THISNODE here for performance. And replace
__GFP_RETRY_MAYFAIL to __GFP_NORETRY to keep consistent	with
hugetlb_vmemmap_restore().

> + int nid = page_to_nid((struct page *)start);
> + struct page *page = NULL;
> +
> + /*
> + * Allocate a new head vmemmap page to avoid breaking a contiguous
> + * block of struct page memory when freeing it back to page allocator
> + * in free_vmemmap_page_list(). This will allow the likely contiguous
> + * struct page backing memory to be kept contiguous and allowing for
> + * more allocations of hugepages. Fallback to the currently
> + * mapped head page in case should it fail to allocate.
> + */
> + if (IS_ALIGNED((unsigned long)start, PAGE_SIZE))

I'm curious why we need this check. IIUC, this is unnecessary.

> + page = alloc_pages_node(nid, gfp_mask, 0);
> + walk.head_page = page;
> 
> /*
> * In order to make remapping routine most efficient for the huge pages,
> -- 
> 2.17.2
> 

I have implemented a version based on yours, which does not introduce
->head_page field (Not test if it works). Seems to be simple.

Thanks.

diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index c98805d5b815..8ee94f6a6697 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -202,12 +202,7 @@ static int vmemmap_remap_range(unsigned long start, unsigned long end,
                        return ret;
        } while (pgd++, addr = next, addr != end);

-       /*
-        * We only change the mapping of the vmemmap virtual address range
-        * [@start + PAGE_SIZE, end), so we only need to flush the TLB which
-        * belongs to the range.
-        */
-       flush_tlb_kernel_range(start + PAGE_SIZE, end);
+       flush_tlb_kernel_range(start, end);

        return 0;
 }
@@ -246,6 +241,12 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr,
        pte_t entry = mk_pte(walk->reuse_page, pgprot);
        struct page *page = pte_page(*pte);

+       /* The case of remapping the head vmemmap page. */
+       if (unlikely(addr == walk->reuse_addr)) {
+               list_del(&walk->reuse_page->lru);
+               entry = mk_pte(walk->reuse_page, PAGE_KERNEL);
+       }
+
        list_add_tail(&page->lru, walk->vmemmap_pages);
        set_pte_at(&init_mm, addr, pte, entry);
 }
@@ -310,6 +311,8 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end,
                .reuse_addr     = reuse,
                .vmemmap_pages  = &vmemmap_pages,
        };
+       int nid = page_to_nid((struct page *)start);
+       gfp_t gfp_mask = GFP_KERNEL | __GFP_THISNODE | __GFP_NORETRY | __GFP_NOWARN;

        /*
         * In order to make remapping routine most efficient for the huge pages,
@@ -326,6 +329,20 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end,
         */
        BUG_ON(start - reuse != PAGE_SIZE);

+       /*
+        * Allocate a new head vmemmap page to avoid breaking a contiguous
+        * block of struct page memory when freeing it back to page allocator
+        * in free_vmemmap_page_list(). This will allow the likely contiguous
+        * struct page backing memory to be kept contiguous and allowing for
+        * more allocations of hugepages. Fallback to the currently
+        * mapped head page in case should it fail to allocate.
+        */
+       walk.reuse_page = alloc_pages_node(nid, gfp_mask, 0);
+       if (walk.reuse_page) {
+               copy_page(page_to_virt(walk.reuse_page), walk.reuse_addr);
+               list_add(&walk.reuse_page->lru, &vmemmap_pages);
+       }
+
        mmap_read_lock(&init_mm);
        ret = vmemmap_remap_range(reuse, end, &walk);
        if (ret && walk.nr_walked) {
Joao Martins Nov. 8, 2022, 10:38 a.m. UTC | #2
On 08/11/2022 09:13, Muchun Song wrote:
>> On Nov 7, 2022, at 23:39, Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> Today with `hugetlb_free_vmemmap=on` the struct page memory that is freed
>> back to page allocator is as following: for a 2M hugetlb page it will reuse
>> the first 4K vmemmap page to remap the remaining 7 vmemmap pages, and for a
>> 1G hugetlb it will remap the remaining 4095 vmemmap pages. Essentially,
>> that means that it breaks the first 4K of a potentially contiguous chunk of
>> memory of 32K (for 2M hugetlb pages) or 16M (for 1G hugetlb pages). For
>> this reason the memory that it's free back to page allocator cannot be used
>> for hugetlb to allocate huge pages of the same size, but rather only of a
>> smaller huge page size:
>>
>> Trying to assign a 64G node to hugetlb (on a 128G 2node guest, each node
>> having 64G):
>>
>> * Before allocation:
>> Free pages count per migrate type at order       0      1      2      3
>> 4      5      6      7      8      9     10
>> ...
>> Node    0, zone   Normal, type      Movable    340    100     32     15
>> 1      2      0      0      0      1  15558
>>
>> $ echo 32768 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
>> $ cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
>> 31987
>>
>> * After:
>>
>> Node    0, zone   Normal, type      Movable  30893  32006  31515      7
>> 0      0      0      0      0      0      0
>>
>> Notice how the memory freed back are put back into 4K / 8K / 16K page
>> pools.  And it allocates a total of 31974 pages (63948M).
>>
>> To fix this behaviour rather than remapping one page (thus breaking the
>> contiguous block of memory backing the struct pages) repopulate with a new
>> page for the head vmemmap page. It will copying the data from the currently
>> mapped vmemmap page, and then remap it to this new page. Additionally,
>> change the remap_pte callback to look at the newly added walk::head_page
>> which needs to be mapped as r/w compared to the tail page vmemmap reuse
>> that uses r/o.
>>
>> The new head page is allocated by the caller of vmemmap_remap_free() given
>> that on restore it should still be using the same code path as before. Note
>> that, because right now one hugepage is remapped at a time, thus only one
>> free 4K page at a time is needed to remap the head page.  Should it fail to
>> allocate said new page, it reuses the one that's already mapped just like
>> before. As a result, for every 64G of contiguous hugepages it can give back
>> 1G more of contiguous memory per 64G, while needing in total 128M new 4K
>> pages (for 2M hugetlb) or 256k (for 1G hugetlb).
>>
>> After the changes, try to assign a 64G node to hugetlb (on a 128G 2node
>> guest, each node with 64G):
>>
>> * Before allocation
>> Free pages count per migrate type at order       0      1      2      3
>> 4      5      6      7      8      9     10
>> ...
>> Node    0, zone   Normal, type      Movable      1      1      1      0
>> 0      1      0      0      1      1  15564
>>
>> $ echo 32768  > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
>> $ cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
>> 32394
>>
>> * After:
>>
>> Node    0, zone   Normal, type      Movable      0     50     97    108
>> 96     81     70     46     18      0      0
> 
> Thanks for your work.
> 
Thanks for the comments!

>>
>>
>> In the example above, 407 more hugeltb 2M pages are allocated i.e. 814M out
>> of the 32394 (64796M) allocated. So the memory freed back is indeed being
>> used back in hugetlb and there's no massive order-0..order-2 pages
>> accumulated unused.
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>> Changes since v1[0]:
>> * Drop rw argument and check walk::head_page directly when there's
>>  no reuse_page set (similar suggestion by Muchun Song to adjust
>>  inside the remap_pte callback)
>> * Adjust TLB flush to cover the head page vaddr too (Muchun Song)
>> * Simplify the remap of head page in vmemmap_pte_range()
>> * Check start is aligned to PAGE_SIZE in vmemmap_remap_free()
>>
>> I've kept the same structure as in v1 compared to a chunk Muchun
>> pasted in the v1 thread[1] and thus I am not altering the calling
>> convention of vmemmap_remap_free()/vmemmap_restore_pte().
>> The remapping of head page is not exactly a page that is reused,
>> compared to the r/o tail vmemmap pages remapping. So tiny semantic change,
>> albeit same outcome in pratice of changing the PTE and freeing the page,
>> with different permissions. It also made it simpler to gracefully fail
>> in case of page allocation failure, and logic simpler to follow IMHO.
>>
>> Let me know otherwise if I followed the wrong thinking.
>>
>> [0] https://lore.kernel.org/linux-mm/20220802180309.19340-1-joao.m.martins@oracle.com/
>> [1] https://lore.kernel.org/linux-mm/Yun1bJsnK%2F6MFc0b@FVFYT0MHHV2J/
>>
>> ---
>> mm/hugetlb_vmemmap.c | 59 ++++++++++++++++++++++++++++++++++++++------
>> 1 file changed, 52 insertions(+), 7 deletions(-)
>>
>> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
>> index 7898c2c75e35..4298c44578e3 100644
>> --- a/mm/hugetlb_vmemmap.c
>> +++ b/mm/hugetlb_vmemmap.c
>> @@ -22,6 +22,7 @@
>>  *
>>  * @remap_pte: called for each lowest-level entry (PTE).
>>  * @nr_walked: the number of walked pte.
>> + * @head_page: the page which replaces the head vmemmap page.
>>  * @reuse_page: the page which is reused for the tail vmemmap pages.
>>  * @reuse_addr: the virtual address of the @reuse_page page.
>>  * @vmemmap_pages: the list head of the vmemmap pages that can be freed
>> @@ -31,6 +32,7 @@ struct vmemmap_remap_walk {
>> void (*remap_pte)(pte_t *pte, unsigned long addr,
>>     struct vmemmap_remap_walk *walk);
>> unsigned long nr_walked;
>> + struct page *head_page;
> 
> This field is unnecessary. We can reuse ->reuse_page to implement the same
> functionality. I'll explain the reason later.
> 

OK, I'll comment below

>> struct page *reuse_page;
>> unsigned long reuse_addr;
>> struct list_head *vmemmap_pages;
>> @@ -105,10 +107,26 @@ static void vmemmap_pte_range(pmd_t *pmd, unsigned long addr,
>> * remapping (which is calling @walk->remap_pte).
>> */
>> if (!walk->reuse_page) {
>> - walk->reuse_page = pte_page(*pte);
>> + struct page *page = pte_page(*pte);
>> +
>> + /*
>> + * Copy the data from the original head, and remap to
>> + * the newly allocated page.
>> + */
>> + if (walk->head_page) {
>> + memcpy(page_address(walk->head_page),
>> +       page_address(page), PAGE_SIZE);
>> + walk->remap_pte(pte, addr, walk);
>> + page = walk->head_page;
>> + }
>> +
>> + walk->reuse_page = page;
>> +
>> /*
>> - * Because the reuse address is part of the range that we are
>> - * walking, skip the reuse address range.
>> + * Because the reuse address is part of the range that
>> + * we are walking or the head page was remapped to a
>> + * new page, skip the reuse address range.
>> + * .
>> */
>> addr += PAGE_SIZE;
>> pte++;
>> @@ -204,11 +222,11 @@ static int vmemmap_remap_range(unsigned long start, unsigned long end,
>> } while (pgd++, addr = next, addr != end);
>>
>> /*
>> - * We only change the mapping of the vmemmap virtual address range
>> - * [@start + PAGE_SIZE, end), so we only need to flush the TLB which
>> + * We change the mapping of the vmemmap virtual address range
>> + * [@start, end], so we only need to flush the TLB which
>> * belongs to the range.
>> */
> 
> This comment could go away, the reason I added it here is because it is a bit special
> here. I want to tell others why we don't flush the full range from @start to @end. Now, I
> think it can go away.
> 
OK

>> - flush_tlb_kernel_range(start + PAGE_SIZE, end);
>> + flush_tlb_kernel_range(start, end);
>>
>> return 0;
>> }
>> @@ -244,9 +262,21 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr,
>> * to the tail pages.
>> */
>> pgprot_t pgprot = PAGE_KERNEL_RO;
>> - pte_t entry = mk_pte(walk->reuse_page, pgprot);
>> + struct page *reuse = walk->reuse_page;
>> struct page *page = pte_page(*pte);
>> + pte_t entry;
>>
>> + /*
>> + * When there's no walk::reuse_page, it means we allocated a new head
>> + * page (stored in walk::head_page) and copied from the old head page.
>> + * In that case use the walk::head_page as the page to remap.
>> + */
>> + if (!reuse) {
>> + pgprot = PAGE_KERNEL;
>> + reuse = walk->head_page;
>> + }
>> +
>> + entry = mk_pte(reuse, pgprot);
>> list_add_tail(&page->lru, walk->vmemmap_pages);
>> set_pte_at(&init_mm, addr, pte, entry);
>> }
>> @@ -315,6 +345,21 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end,
>> .reuse_addr = reuse,
>> .vmemmap_pages = &vmemmap_pages,
>> };
>> + gfp_t gfp_mask = GFP_KERNEL|__GFP_RETRY_MAYFAIL|__GFP_NOWARN;
> 
> It is better to add __GFP_THISNODE here for performance. And replace
> __GFP_RETRY_MAYFAIL to __GFP_NORETRY to keep consistent	with
> hugetlb_vmemmap_restore().
> 
OK

>> + int nid = page_to_nid((struct page *)start);
>> + struct page *page = NULL;
>> +
>> + /*
>> + * Allocate a new head vmemmap page to avoid breaking a contiguous
>> + * block of struct page memory when freeing it back to page allocator
>> + * in free_vmemmap_page_list(). This will allow the likely contiguous
>> + * struct page backing memory to be kept contiguous and allowing for
>> + * more allocations of hugepages. Fallback to the currently
>> + * mapped head page in case should it fail to allocate.
>> + */
>> + if (IS_ALIGNED((unsigned long)start, PAGE_SIZE))
> 
> I'm curious why we need this check. IIUC, this is unnecessary.
> 

So if the start of the vmemmap range (the head page) we will remap isn't the
first struct page, then we would corrupt the other struct pages in
that vmemmap page unrelated to hugetlb? That was my thinking

>> + page = alloc_pages_node(nid, gfp_mask, 0);
>> + walk.head_page = page;
>>
>> /*
>> * In order to make remapping routine most efficient for the huge pages,
>> -- 
>> 2.17.2
>>
> 
> I have implemented a version based on yours, which does not introduce
> ->head_page field (Not test if it works). Seems to be simple.
> 

Let me try out with the adjustment below

> Thanks.
> 
> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> index c98805d5b815..8ee94f6a6697 100644
> --- a/mm/hugetlb_vmemmap.c
> +++ b/mm/hugetlb_vmemmap.c
> @@ -202,12 +202,7 @@ static int vmemmap_remap_range(unsigned long start, unsigned long end,
>                         return ret;
>         } while (pgd++, addr = next, addr != end);
> 
> -       /*
> -        * We only change the mapping of the vmemmap virtual address range
> -        * [@start + PAGE_SIZE, end), so we only need to flush the TLB which
> -        * belongs to the range.
> -        */
> -       flush_tlb_kernel_range(start + PAGE_SIZE, end);
> +       flush_tlb_kernel_range(start, end);
> 
>         return 0;
>  }
> @@ -246,6 +241,12 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr,
>         pte_t entry = mk_pte(walk->reuse_page, pgprot);
>         struct page *page = pte_page(*pte);
> 
> +       /* The case of remapping the head vmemmap page. */
> +       if (unlikely(addr == walk->reuse_addr)) {

You replace the head_page with checking the reuse_addr , but that is
set on the tail page. So if we want to rely on reuse_addr perhaps
best if do:

if (unlikely(addr == (walk->reuse_addr - PAGE_SIZE))) {
	...
}

> +               list_del(&walk->reuse_page->lru);
> +               entry = mk_pte(walk->reuse_page, PAGE_KERNEL);
> +       }
> +
>         list_add_tail(&page->lru, walk->vmemmap_pages);
>         set_pte_at(&init_mm, addr, pte, entry);
>  }
> @@ -310,6 +311,8 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end,
>                 .reuse_addr     = reuse,
>                 .vmemmap_pages  = &vmemmap_pages,
>         };
> +       int nid = page_to_nid((struct page *)start);
> +       gfp_t gfp_mask = GFP_KERNEL | __GFP_THISNODE | __GFP_NORETRY | __GFP_NOWARN;
> 
>         /*
>          * In order to make remapping routine most efficient for the huge pages,
> @@ -326,6 +329,20 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end,
>          */
>         BUG_ON(start - reuse != PAGE_SIZE);
> 
> +       /*
> +        * Allocate a new head vmemmap page to avoid breaking a contiguous
> +        * block of struct page memory when freeing it back to page allocator
> +        * in free_vmemmap_page_list(). This will allow the likely contiguous
> +        * struct page backing memory to be kept contiguous and allowing for
> +        * more allocations of hugepages. Fallback to the currently
> +        * mapped head page in case should it fail to allocate.
> +        */
> +       walk.reuse_page = alloc_pages_node(nid, gfp_mask, 0);
> +       if (walk.reuse_page) {
> +               copy_page(page_to_virt(walk.reuse_page), walk.reuse_addr);
> +               list_add(&walk.reuse_page->lru, &vmemmap_pages);
> +       }
> +
>         mmap_read_lock(&init_mm);
>         ret = vmemmap_remap_range(reuse, end, &walk);
>         if (ret && walk.nr_walked) {
>
Muchun Song Nov. 9, 2022, 2:42 a.m. UTC | #3
> On Nov 8, 2022, at 18:38, Joao Martins <joao.m.martins@oracle.com> wrote:
> 
> On 08/11/2022 09:13, Muchun Song wrote:
>>> On Nov 7, 2022, at 23:39, Joao Martins <joao.m.martins@oracle.com> wrote:
>>> 
>>> Today with `hugetlb_free_vmemmap=on` the struct page memory that is freed
>>> back to page allocator is as following: for a 2M hugetlb page it will reuse
>>> the first 4K vmemmap page to remap the remaining 7 vmemmap pages, and for a
>>> 1G hugetlb it will remap the remaining 4095 vmemmap pages. Essentially,
>>> that means that it breaks the first 4K of a potentially contiguous chunk of
>>> memory of 32K (for 2M hugetlb pages) or 16M (for 1G hugetlb pages). For
>>> this reason the memory that it's free back to page allocator cannot be used
>>> for hugetlb to allocate huge pages of the same size, but rather only of a
>>> smaller huge page size:
>>> 
>>> Trying to assign a 64G node to hugetlb (on a 128G 2node guest, each node
>>> having 64G):
>>> 
>>> * Before allocation:
>>> Free pages count per migrate type at order       0      1      2      3
>>> 4      5      6      7      8      9     10
>>> ...
>>> Node    0, zone   Normal, type      Movable    340    100     32     15
>>> 1      2      0      0      0      1  15558
>>> 
>>> $ echo 32768 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
>>> $ cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
>>> 31987
>>> 
>>> * After:
>>> 
>>> Node    0, zone   Normal, type      Movable  30893  32006  31515      7
>>> 0      0      0      0      0      0      0
>>> 
>>> Notice how the memory freed back are put back into 4K / 8K / 16K page
>>> pools.  And it allocates a total of 31974 pages (63948M).
>>> 
>>> To fix this behaviour rather than remapping one page (thus breaking the
>>> contiguous block of memory backing the struct pages) repopulate with a new
>>> page for the head vmemmap page. It will copying the data from the currently
>>> mapped vmemmap page, and then remap it to this new page. Additionally,
>>> change the remap_pte callback to look at the newly added walk::head_page
>>> which needs to be mapped as r/w compared to the tail page vmemmap reuse
>>> that uses r/o.
>>> 
>>> The new head page is allocated by the caller of vmemmap_remap_free() given
>>> that on restore it should still be using the same code path as before. Note
>>> that, because right now one hugepage is remapped at a time, thus only one
>>> free 4K page at a time is needed to remap the head page.  Should it fail to
>>> allocate said new page, it reuses the one that's already mapped just like
>>> before. As a result, for every 64G of contiguous hugepages it can give back
>>> 1G more of contiguous memory per 64G, while needing in total 128M new 4K
>>> pages (for 2M hugetlb) or 256k (for 1G hugetlb).
>>> 
>>> After the changes, try to assign a 64G node to hugetlb (on a 128G 2node
>>> guest, each node with 64G):
>>> 
>>> * Before allocation
>>> Free pages count per migrate type at order       0      1      2      3
>>> 4      5      6      7      8      9     10
>>> ...
>>> Node    0, zone   Normal, type      Movable      1      1      1      0
>>> 0      1      0      0      1      1  15564
>>> 
>>> $ echo 32768  > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
>>> $ cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
>>> 32394
>>> 
>>> * After:
>>> 
>>> Node    0, zone   Normal, type      Movable      0     50     97    108
>>> 96     81     70     46     18      0      0
>> 
>> Thanks for your work.
>> 
> Thanks for the comments!
> 
>>> 
>>> 
>>> In the example above, 407 more hugeltb 2M pages are allocated i.e. 814M out
>>> of the 32394 (64796M) allocated. So the memory freed back is indeed being
>>> used back in hugetlb and there's no massive order-0..order-2 pages
>>> accumulated unused.
>>> 
>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>>> ---
>>> Changes since v1[0]:
>>> * Drop rw argument and check walk::head_page directly when there's
>>> no reuse_page set (similar suggestion by Muchun Song to adjust
>>> inside the remap_pte callback)
>>> * Adjust TLB flush to cover the head page vaddr too (Muchun Song)
>>> * Simplify the remap of head page in vmemmap_pte_range()
>>> * Check start is aligned to PAGE_SIZE in vmemmap_remap_free()
>>> 
>>> I've kept the same structure as in v1 compared to a chunk Muchun
>>> pasted in the v1 thread[1] and thus I am not altering the calling
>>> convention of vmemmap_remap_free()/vmemmap_restore_pte().
>>> The remapping of head page is not exactly a page that is reused,
>>> compared to the r/o tail vmemmap pages remapping. So tiny semantic change,
>>> albeit same outcome in pratice of changing the PTE and freeing the page,
>>> with different permissions. It also made it simpler to gracefully fail
>>> in case of page allocation failure, and logic simpler to follow IMHO.
>>> 
>>> Let me know otherwise if I followed the wrong thinking.
>>> 
>>> [0] https://lore.kernel.org/linux-mm/20220802180309.19340-1-joao.m.martins@oracle.com/
>>> [1] https://lore.kernel.org/linux-mm/Yun1bJsnK%2F6MFc0b@FVFYT0MHHV2J/
>>> 
>>> ---
>>> mm/hugetlb_vmemmap.c | 59 ++++++++++++++++++++++++++++++++++++++------
>>> 1 file changed, 52 insertions(+), 7 deletions(-)
>>> 
>>> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
>>> index 7898c2c75e35..4298c44578e3 100644
>>> --- a/mm/hugetlb_vmemmap.c
>>> +++ b/mm/hugetlb_vmemmap.c
>>> @@ -22,6 +22,7 @@
>>> *
>>> * @remap_pte: called for each lowest-level entry (PTE).
>>> * @nr_walked: the number of walked pte.
>>> + * @head_page: the page which replaces the head vmemmap page.
>>> * @reuse_page: the page which is reused for the tail vmemmap pages.
>>> * @reuse_addr: the virtual address of the @reuse_page page.
>>> * @vmemmap_pages: the list head of the vmemmap pages that can be freed
>>> @@ -31,6 +32,7 @@ struct vmemmap_remap_walk {
>>> void (*remap_pte)(pte_t *pte, unsigned long addr,
>>>    struct vmemmap_remap_walk *walk);
>>> unsigned long nr_walked;
>>> + struct page *head_page;
>> 
>> This field is unnecessary. We can reuse ->reuse_page to implement the same
>> functionality. I'll explain the reason later.
>> 
> 
> OK, I'll comment below
> 
>>> struct page *reuse_page;
>>> unsigned long reuse_addr;
>>> struct list_head *vmemmap_pages;
>>> @@ -105,10 +107,26 @@ static void vmemmap_pte_range(pmd_t *pmd, unsigned long addr,
>>> * remapping (which is calling @walk->remap_pte).
>>> */
>>> if (!walk->reuse_page) {
>>> - walk->reuse_page = pte_page(*pte);
>>> + struct page *page = pte_page(*pte);
>>> +
>>> + /*
>>> + * Copy the data from the original head, and remap to
>>> + * the newly allocated page.
>>> + */
>>> + if (walk->head_page) {
>>> + memcpy(page_address(walk->head_page),
>>> +       page_address(page), PAGE_SIZE);
>>> + walk->remap_pte(pte, addr, walk);
>>> + page = walk->head_page;
>>> + }
>>> +
>>> + walk->reuse_page = page;
>>> +
>>> /*
>>> - * Because the reuse address is part of the range that we are
>>> - * walking, skip the reuse address range.
>>> + * Because the reuse address is part of the range that
>>> + * we are walking or the head page was remapped to a
>>> + * new page, skip the reuse address range.
>>> + * .
>>> */
>>> addr += PAGE_SIZE;
>>> pte++;
>>> @@ -204,11 +222,11 @@ static int vmemmap_remap_range(unsigned long start, unsigned long end,
>>> } while (pgd++, addr = next, addr != end);
>>> 
>>> /*
>>> - * We only change the mapping of the vmemmap virtual address range
>>> - * [@start + PAGE_SIZE, end), so we only need to flush the TLB which
>>> + * We change the mapping of the vmemmap virtual address range
>>> + * [@start, end], so we only need to flush the TLB which
>>> * belongs to the range.
>>> */
>> 
>> This comment could go away, the reason I added it here is because it is a bit special
>> here. I want to tell others why we don't flush the full range from @start to @end. Now, I
>> think it can go away.
>> 
> OK
> 
>>> - flush_tlb_kernel_range(start + PAGE_SIZE, end);
>>> + flush_tlb_kernel_range(start, end);
>>> 
>>> return 0;
>>> }
>>> @@ -244,9 +262,21 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr,
>>> * to the tail pages.
>>> */
>>> pgprot_t pgprot = PAGE_KERNEL_RO;
>>> - pte_t entry = mk_pte(walk->reuse_page, pgprot);
>>> + struct page *reuse = walk->reuse_page;
>>> struct page *page = pte_page(*pte);
>>> + pte_t entry;
>>> 
>>> + /*
>>> + * When there's no walk::reuse_page, it means we allocated a new head
>>> + * page (stored in walk::head_page) and copied from the old head page.
>>> + * In that case use the walk::head_page as the page to remap.
>>> + */
>>> + if (!reuse) {
>>> + pgprot = PAGE_KERNEL;
>>> + reuse = walk->head_page;
>>> + }
>>> +
>>> + entry = mk_pte(reuse, pgprot);
>>> list_add_tail(&page->lru, walk->vmemmap_pages);
>>> set_pte_at(&init_mm, addr, pte, entry);
>>> }
>>> @@ -315,6 +345,21 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end,
>>> .reuse_addr = reuse,
>>> .vmemmap_pages = &vmemmap_pages,
>>> };
>>> + gfp_t gfp_mask = GFP_KERNEL|__GFP_RETRY_MAYFAIL|__GFP_NOWARN;
>> 
>> It is better to add __GFP_THISNODE here for performance. And replace
>> __GFP_RETRY_MAYFAIL to __GFP_NORETRY to keep consistent with
>> hugetlb_vmemmap_restore().
>> 
> OK
> 
>>> + int nid = page_to_nid((struct page *)start);
>>> + struct page *page = NULL;
>>> +
>>> + /*
>>> + * Allocate a new head vmemmap page to avoid breaking a contiguous
>>> + * block of struct page memory when freeing it back to page allocator
>>> + * in free_vmemmap_page_list(). This will allow the likely contiguous
>>> + * struct page backing memory to be kept contiguous and allowing for
>>> + * more allocations of hugepages. Fallback to the currently
>>> + * mapped head page in case should it fail to allocate.
>>> + */
>>> + if (IS_ALIGNED((unsigned long)start, PAGE_SIZE))
>> 
>> I'm curious why we need this check. IIUC, this is unnecessary.
>> 
> 
> So if the start of the vmemmap range (the head page) we will remap isn't the
> first struct page, then we would corrupt the other struct pages in
> that vmemmap page unrelated to hugetlb? That was my thinking

Actually, @start address should be always aligned with PAGE_SIZE. If not,
vmemmap_remap_range() will complain. So the check can be removed.

> 
>>> + page = alloc_pages_node(nid, gfp_mask, 0);
>>> + walk.head_page = page;
>>> 
>>> /*
>>> * In order to make remapping routine most efficient for the huge pages,
>>> -- 
>>> 2.17.2
>>> 
>> 
>> I have implemented a version based on yours, which does not introduce
>> ->head_page field (Not test if it works). Seems to be simple.
>> 
> 
> Let me try out with the adjustment below
> 
>> Thanks.
>> 
>> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
>> index c98805d5b815..8ee94f6a6697 100644
>> --- a/mm/hugetlb_vmemmap.c
>> +++ b/mm/hugetlb_vmemmap.c
>> @@ -202,12 +202,7 @@ static int vmemmap_remap_range(unsigned long start, unsigned long end,
>>                        return ret;
>>        } while (pgd++, addr = next, addr != end);
>> 
>> -       /*
>> -        * We only change the mapping of the vmemmap virtual address range
>> -        * [@start + PAGE_SIZE, end), so we only need to flush the TLB which
>> -        * belongs to the range.
>> -        */
>> -       flush_tlb_kernel_range(start + PAGE_SIZE, end);
>> +       flush_tlb_kernel_range(start, end);
>> 
>>        return 0;
>> }
>> @@ -246,6 +241,12 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr,
>>        pte_t entry = mk_pte(walk->reuse_page, pgprot);
>>        struct page *page = pte_page(*pte);
>> 
>> +       /* The case of remapping the head vmemmap page. */
>> +       if (unlikely(addr == walk->reuse_addr)) {
> 
> You replace the head_page with checking the reuse_addr , but that is
> set on the tail page. So if we want to rely on reuse_addr perhaps
> best if do:
> 
> if (unlikely(addr == (walk->reuse_addr - PAGE_SIZE))) {
> ...
> }

I don't think so. The @addr here should be equal to @walk->reuse_addr
when vmemmap_remap_pte() is fist called since @addr does not be updated
from vmemmap_pte_range(). Right?

Thanks.

> 
>> +               list_del(&walk->reuse_page->lru);
>> +               entry = mk_pte(walk->reuse_page, PAGE_KERNEL);
>> +       }
>> +
>>        list_add_tail(&page->lru, walk->vmemmap_pages);
>>        set_pte_at(&init_mm, addr, pte, entry);
>> }
>> @@ -310,6 +311,8 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end,
>>                .reuse_addr     = reuse,
>>                .vmemmap_pages  = &vmemmap_pages,
>>        };
>> +       int nid = page_to_nid((struct page *)start);
>> +       gfp_t gfp_mask = GFP_KERNEL | __GFP_THISNODE | __GFP_NORETRY | __GFP_NOWARN;
>> 
>>        /*
>>         * In order to make remapping routine most efficient for the huge pages,
>> @@ -326,6 +329,20 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end,
>>         */
>>        BUG_ON(start - reuse != PAGE_SIZE);
>> 
>> +       /*
>> +        * Allocate a new head vmemmap page to avoid breaking a contiguous
>> +        * block of struct page memory when freeing it back to page allocator
>> +        * in free_vmemmap_page_list(). This will allow the likely contiguous
>> +        * struct page backing memory to be kept contiguous and allowing for
>> +        * more allocations of hugepages. Fallback to the currently
>> +        * mapped head page in case should it fail to allocate.
>> +        */
>> +       walk.reuse_page = alloc_pages_node(nid, gfp_mask, 0);
>> +       if (walk.reuse_page) {
>> +               copy_page(page_to_virt(walk.reuse_page), walk.reuse_addr);
>> +               list_add(&walk.reuse_page->lru, &vmemmap_pages);
>> +       }
>> +
>>        mmap_read_lock(&init_mm);
>>        ret = vmemmap_remap_range(reuse, end, &walk);
>>        if (ret && walk.nr_walked) {
Joao Martins Nov. 9, 2022, 1:15 p.m. UTC | #4
On 09/11/2022 02:42, Muchun Song wrote:
>> On Nov 8, 2022, at 18:38, Joao Martins <joao.m.martins@oracle.com> wrote:
>> On 08/11/2022 09:13, Muchun Song wrote:
>>>> On Nov 7, 2022, at 23:39, Joao Martins <joao.m.martins@oracle.com> wrote:
>>>> + int nid = page_to_nid((struct page *)start);
>>>> + struct page *page = NULL;
>>>> +
>>>> + /*
>>>> + * Allocate a new head vmemmap page to avoid breaking a contiguous
>>>> + * block of struct page memory when freeing it back to page allocator
>>>> + * in free_vmemmap_page_list(). This will allow the likely contiguous
>>>> + * struct page backing memory to be kept contiguous and allowing for
>>>> + * more allocations of hugepages. Fallback to the currently
>>>> + * mapped head page in case should it fail to allocate.
>>>> + */
>>>> + if (IS_ALIGNED((unsigned long)start, PAGE_SIZE))
>>>
>>> I'm curious why we need this check. IIUC, this is unnecessary.
>>>
>>
>> So if the start of the vmemmap range (the head page) we will remap isn't the
>> first struct page, then we would corrupt the other struct pages in
>> that vmemmap page unrelated to hugetlb? That was my thinking
> 
> Actually, @start address should be always aligned with PAGE_SIZE. If not,
> vmemmap_remap_range() will complain. So the check can be removed.
>

True, but it will be a VM_BUG_ON().

I can remove this check here, but I would suggest that
rather than crashing that we detect the error in the caller
(vmemmap_remap_free()) prior to proceeding.

>>
>>>> + page = alloc_pages_node(nid, gfp_mask, 0);
>>>> + walk.head_page = page;
>>>>
>>>> /*
>>>> * In order to make remapping routine most efficient for the huge pages,
>>>> -- 
>>>> 2.17.2
>>>>
>>>
>>> I have implemented a version based on yours, which does not introduce
>>> ->head_page field (Not test if it works). Seems to be simple.
>>>
>>
>> Let me try out with the adjustment below
>>
>>> Thanks.
>>>
>>> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
>>> index c98805d5b815..8ee94f6a6697 100644
>>> --- a/mm/hugetlb_vmemmap.c
>>> +++ b/mm/hugetlb_vmemmap.c
>>> @@ -202,12 +202,7 @@ static int vmemmap_remap_range(unsigned long start, unsigned long end,
>>>                        return ret;
>>>        } while (pgd++, addr = next, addr != end);
>>>
>>> -       /*
>>> -        * We only change the mapping of the vmemmap virtual address range
>>> -        * [@start + PAGE_SIZE, end), so we only need to flush the TLB which
>>> -        * belongs to the range.
>>> -        */
>>> -       flush_tlb_kernel_range(start + PAGE_SIZE, end);
>>> +       flush_tlb_kernel_range(start, end);
>>>
>>>        return 0;
>>> }
>>> @@ -246,6 +241,12 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr,
>>>        pte_t entry = mk_pte(walk->reuse_page, pgprot);
>>>        struct page *page = pte_page(*pte);
>>>
>>> +       /* The case of remapping the head vmemmap page. */

I changed this comment to:

/* Remapping the head page requires r/w */

>>> +       if (unlikely(addr == walk->reuse_addr)) {
>>
>> You replace the head_page with checking the reuse_addr , but that is
>> set on the tail page. So if we want to rely on reuse_addr perhaps
>> best if do:
>>
>> if (unlikely(addr == (walk->reuse_addr - PAGE_SIZE))) {
>> ...
>> }
> 
> I don't think so. The @addr here should be equal to @walk->reuse_addr
> when vmemmap_remap_pte() is fist called since @addr does not be updated
> from vmemmap_pte_range(). Right?
> 

Ah, yes -- I misread it. I'm confusing with the 2 vmemmap pages reuse
rather the 1 (which is the latest, yes). And thus the reuse_addr is
pointed at the head page.

Ok, this should be much cleaner with these adjustments you proposed, and
works just as good as far as I tested

I'll respin v3.
diff mbox series

Patch

diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 7898c2c75e35..4298c44578e3 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -22,6 +22,7 @@ 
  *
  * @remap_pte:		called for each lowest-level entry (PTE).
  * @nr_walked:		the number of walked pte.
+ * @head_page:		the page which replaces the head vmemmap page.
  * @reuse_page:		the page which is reused for the tail vmemmap pages.
  * @reuse_addr:		the virtual address of the @reuse_page page.
  * @vmemmap_pages:	the list head of the vmemmap pages that can be freed
@@ -31,6 +32,7 @@  struct vmemmap_remap_walk {
 	void			(*remap_pte)(pte_t *pte, unsigned long addr,
 					     struct vmemmap_remap_walk *walk);
 	unsigned long		nr_walked;
+	struct page		*head_page;
 	struct page		*reuse_page;
 	unsigned long		reuse_addr;
 	struct list_head	*vmemmap_pages;
@@ -105,10 +107,26 @@  static void vmemmap_pte_range(pmd_t *pmd, unsigned long addr,
 	 * remapping (which is calling @walk->remap_pte).
 	 */
 	if (!walk->reuse_page) {
-		walk->reuse_page = pte_page(*pte);
+		struct page *page = pte_page(*pte);
+
+		/*
+		 * Copy the data from the original head, and remap to
+		 * the newly allocated page.
+		 */
+		if (walk->head_page) {
+			memcpy(page_address(walk->head_page),
+			       page_address(page), PAGE_SIZE);
+			walk->remap_pte(pte, addr, walk);
+			page = walk->head_page;
+		}
+
+		walk->reuse_page = page;
+
 		/*
-		 * Because the reuse address is part of the range that we are
-		 * walking, skip the reuse address range.
+		 * Because the reuse address is part of the range that
+		 * we are walking or the head page was remapped to a
+		 * new page, skip the reuse address range.
+		 * .
 		 */
 		addr += PAGE_SIZE;
 		pte++;
@@ -204,11 +222,11 @@  static int vmemmap_remap_range(unsigned long start, unsigned long end,
 	} while (pgd++, addr = next, addr != end);
 
 	/*
-	 * We only change the mapping of the vmemmap virtual address range
-	 * [@start + PAGE_SIZE, end), so we only need to flush the TLB which
+	 * We change the mapping of the vmemmap virtual address range
+	 * [@start, end], so we only need to flush the TLB which
 	 * belongs to the range.
 	 */
-	flush_tlb_kernel_range(start + PAGE_SIZE, end);
+	flush_tlb_kernel_range(start, end);
 
 	return 0;
 }
@@ -244,9 +262,21 @@  static void vmemmap_remap_pte(pte_t *pte, unsigned long addr,
 	 * to the tail pages.
 	 */
 	pgprot_t pgprot = PAGE_KERNEL_RO;
-	pte_t entry = mk_pte(walk->reuse_page, pgprot);
+	struct page *reuse = walk->reuse_page;
 	struct page *page = pte_page(*pte);
+	pte_t entry;
 
+	/*
+	 * When there's no walk::reuse_page, it means we allocated a new head
+	 * page (stored in walk::head_page) and copied from the old head page.
+	 * In that case use the walk::head_page as the page to remap.
+	 */
+	if (!reuse) {
+		pgprot = PAGE_KERNEL;
+		reuse = walk->head_page;
+	}
+
+	entry = mk_pte(reuse, pgprot);
 	list_add_tail(&page->lru, walk->vmemmap_pages);
 	set_pte_at(&init_mm, addr, pte, entry);
 }
@@ -315,6 +345,21 @@  static int vmemmap_remap_free(unsigned long start, unsigned long end,
 		.reuse_addr	= reuse,
 		.vmemmap_pages	= &vmemmap_pages,
 	};
+	gfp_t gfp_mask = GFP_KERNEL|__GFP_RETRY_MAYFAIL|__GFP_NOWARN;
+	int nid = page_to_nid((struct page *)start);
+	struct page *page = NULL;
+
+	/*
+	 * Allocate a new head vmemmap page to avoid breaking a contiguous
+	 * block of struct page memory when freeing it back to page allocator
+	 * in free_vmemmap_page_list(). This will allow the likely contiguous
+	 * struct page backing memory to be kept contiguous and allowing for
+	 * more allocations of hugepages. Fallback to the currently
+	 * mapped head page in case should it fail to allocate.
+	 */
+	if (IS_ALIGNED((unsigned long)start, PAGE_SIZE))
+		page = alloc_pages_node(nid, gfp_mask, 0);
+	walk.head_page = page;
 
 	/*
 	 * In order to make remapping routine most efficient for the huge pages,