Message ID | 20220802180309.19340-1-joao.m.martins@oracle.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v1] mm/hugetlb_vmemmap: remap head page to newly allocated page | expand |
On Tue, Aug 02, 2022 at 07:03:09PM +0100, Joao Martins wrote: > Today with `hugetlb_free_vmemmap=on` the struct page memory that is > freed back to page allocator is as following: for a 2M hugetlb page it > will reuse the first 4K vmemmap page to remap the remaining 7 vmemmap > pages, and for a 1G hugetlb it will remap the remaining 4095 vmemmap > pages. Essentially, that means that it breaks the first 4K of a > potentially contiguous chunk of memory of 32K (for 2M hugetlb pages) or > 16M (for 1G hugetlb pages). For this reason the memory that it's free > back to page allocator cannot be used for hugetlb to allocate huge pages > of the same size, but rather only of a smaller huge page size: > Hi Joao, Thanks for your work on this. I admit you are right. The current mechanism prevented the freed vmemmap pages from being mergerd into a potential contiguous page. Allocating a new head page is straightforward approach, however, it is very dangerous at runtime after system booting up. Why dangerous? Because you should first 1) copy the content from the head vmemmap page to the targeted (newly allocated) page, and then 2) change the PTE entry to the new page. However, the content (especially the refcount) of the old head vmemmmap page could be changed elsewhere (e.g. other modules) between the step 1) and 2). Eventually, the new allocated vmemmap page is corrupted. Unluckily, we don't have an easy approach to prevent it. I also thought of solving this issue, but I didn't find any easy way to solve it after system booting up. However, it is possible at system boot time. Because if the system is in a very early initialization stage, anyone should not access struct page. I have implemented it a mounth ago. I didn't send it out since some additional preparation work is required. The main preparation is to move the allocation of HugeTLB to a very early initialization stage (I want to put it into pure_initcall). Because the allocation of HugeTLB is parsed from cmdline whose logic is very complex, I have sent a patch to cleanup it [1]. After this cleanup, it will be easy to move the allocation to an early stage. Sorry for that I am busy lately, I didn't have time to send a new version. But I will send it ASAP. [1] https://lore.kernel.org/all/20220616071827.3480-1-songmuchun@bytedance.com/ The following diff is the main work to remap the head vmemmap page to a new allocated page. diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c index 20f414c0379f..71f2d7335e6f 100644 --- a/mm/hugetlb_vmemmap.c +++ b/mm/hugetlb_vmemmap.c @@ -15,6 +15,7 @@ #include <asm/pgalloc.h> #include <asm/tlbflush.h> #include "hugetlb_vmemmap.h" +#include "internal.h" /** * struct vmemmap_remap_walk - walk vmemmap page table @@ -227,14 +228,37 @@ static inline void free_vmemmap_page(struct page *page) } /* Free a list of the vmemmap pages */ -static void free_vmemmap_page_list(struct list_head *list) +static void free_vmemmap_pages(struct list_head *list) { struct page *page, *next; + list_for_each_entry_safe(page, next, list, lru) + free_vmemmap_page(page); +} + +/* + * Free a list of vmemmap pages but skip per-cpu free list of buddy + * allocator. + */ +static void free_vmemmap_pages_nonpcp(struct list_head *list) +{ + struct zone *zone; + struct page *page, *next; + + if (list_empty(list)) + return; + + zone = page_zone(list_first_entry(list, struct page, lru)); + zone_pcp_disable(zone); list_for_each_entry_safe(page, next, list, lru) { - list_del(&page->lru); + if (zone != page_zone(page)) { + zone_pcp_enable(zone); + zone = page_zone(page); + zone_pcp_disable(zone); + } free_vmemmap_page(page); } + zone_pcp_enable(zone); } static void vmemmap_remap_pte(pte_t *pte, unsigned long addr, @@ -244,12 +268,28 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr, * Remap the tail pages as read-only to catch illegal write operation * to the tail pages. */ - pgprot_t pgprot = PAGE_KERNEL_RO; + pgprot_t pgprot = addr == walk->reuse_addr ? PAGE_KERNEL : PAGE_KERNEL_RO; pte_t entry = mk_pte(walk->reuse_page, pgprot); struct page *page = pte_page(*pte); list_add_tail(&page->lru, walk->vmemmap_pages); set_pte_at(&init_mm, addr, pte, entry); + + if (unlikely(addr == walk->reuse_addr)) { + void *old = page_to_virt(page); + + /* Remove it from the vmemmap_pages list to avoid being freed. */ + list_del(&walk->reuse_page->lru); + flush_tlb_kernel_range(addr, addr + PAGE_SIZE); + /* + * If we reach here meaning the system is in a very early + * initialization stage, anyone should not access struct page. + * However, if there is something unexpected, the head struct + * page most likely to be written (usually ->_refcount). Using + * BUG_ON() to catch this unexpected case. + */ + BUG_ON(memcmp(old, (void *)addr, sizeof(struct page))); + } } /* @@ -298,7 +338,10 @@ static void vmemmap_restore_pte(pte_t *pte, unsigned long addr, * to remap. * @end: end address of the vmemmap virtual address range that we want to * remap. - * @reuse: reuse address. + * @reuse: reuse address. If @reuse is equal to @start, it means the page + * frame which @reuse address is mapped to will be replaced with a + * new page frame and the previous page frame will be freed, this + * is to reduce memory fragment. * * Return: %0 on success, negative error code otherwise. */ @@ -319,14 +362,26 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end, * (see more details from the vmemmap_pte_range()): * * - The range [@start, @end) and the range [@reuse, @reuse + PAGE_SIZE) - * should be continuous. + * should be continuous or @start is equal to @reuse. * - The @reuse address is part of the range [@reuse, @end) that we are * walking which is passed to vmemmap_remap_range(). * - The @reuse address is the first in the complete range. * * So we need to make sure that @start and @reuse meet the above rules. */ - BUG_ON(start - reuse != PAGE_SIZE); + BUG_ON(start - reuse != PAGE_SIZE && start != reuse); + + if (unlikely(reuse == start)) { + int nid = page_to_nid((struct page *)start); + gfp_t gfp_mask = GFP_KERNEL | __GFP_THISNODE | __GFP_NORETRY | + __GFP_NOWARN; + + walk.reuse_page = alloc_pages_node(nid, gfp_mask, 0); + if (walk.reuse_page) { + copy_page(page_to_virt(walk.reuse_page), (void *)reuse); + list_add(&walk.reuse_page->lru, &vmemmap_pages); + } + } mmap_read_lock(&init_mm); ret = vmemmap_remap_range(reuse, end, &walk); @@ -348,7 +403,10 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end, } mmap_read_unlock(&init_mm); - free_vmemmap_page_list(&vmemmap_pages); + if (unlikely(reuse == start)) + free_vmemmap_pages_nonpcp(&vmemmap_pages); + else + free_vmemmap_pages(&vmemmap_pages); return ret; } @@ -512,6 +570,22 @@ static bool vmemmap_should_optimize(const struct hstate *h, const struct page *h return true; } +/* + * Control if the page frame which the address of the head vmemmap associated + * with a HugeTLB page is mapped to should be replaced with a new page. The + * vmemmap pages are usually mapped with huge PMD mapping, the head vmemmap + * page frames is best freed to the buddy allocator once at an initial stage + * of system booting to reduce memory fragment. + */ +static bool vmemmap_remap_head __ro_after_init = true; + +static int __init vmemmap_remap_head_init(void) +{ + vmemmap_remap_head = false; + return 0; +} +core_initcall(vmemmap_remap_head_init); + /** * hugetlb_vmemmap_optimize - optimize @head page's vmemmap pages. * @h: struct hstate. @@ -537,6 +611,19 @@ void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head) vmemmap_start += HUGETLB_VMEMMAP_RESERVE_SIZE; /* + * The vmemmap pages are usually mapped with huge PMD mapping. If the + * head vmemmap page is not freed to the buddy allocator, then those + * freed tail vmemmap pages cannot be merged into a big order chunk. + * The head vmemmap page frame can be replaced with a new allocated + * page and be freed to the buddy allocator, then those freed vmemmmap + * pages have the opportunity to be merged into larger contiguous pages + * to reduce memory fragment. vmemmap_remap_free() will do this if + * @vmemmap_remap_free is equal to @vmemmap_reuse. + */ + if (unlikely(vmemmap_remap_head)) + vmemmap_start = vmemmap_reuse; + + /* * Remap the vmemmap virtual address range [@vmemmap_start, @vmemmap_end) * to the page which @vmemmap_reuse is mapped to, then free the pages * which the range [@vmemmap_start, @vmemmap_end] is mapped to. Thanks. > Try to assign a 64G node to hugetlb (on a 128G 2node guest, each node > having 64G): > > * Before allocation: > Free pages count per migrate type at order 0 1 2 3 > 4 5 6 7 8 9 10 > ... > Node 0, zone Normal, type Movable 0 0 1 0 > 1 0 0 0 0 0 15561 > > $ echo 32768 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages > $ cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages > 31974 > > * After: > > Node 0, zone Normal, type Movable 32174 31999 31642 104 > 58 24 16 4 2 0 0 > > Notice how the memory freed back are put back into 4K / 8K / 16K page pools. > And it allocates a total of 31974 pages (63948M). > > To fix this behaviour rather than leaving one page (thus breaking the > contiguous block of memory backing the struct pages) instead repopulate > with a new page for the head vmemmap page. Followed by copying the data > from the currently mapped vmemmap page there, to then remap it to this > new page. Additionally, change the remap_pte callback to include a rw > bool parameter given that the head page needs r/w perm compared to the > tail page vmemmap remap that is r/o. vmemmap_pte_range() will set > accordingly when calling remap_pte() when it first tries to set > @reuse_page. The new head page is allocated by the caller of > vmemmap_remap_free() given that on restore it should still be using the > same code path as before. Note that, because right now one hugepage is > remapped at a time, thus only one free 4K page at a time is needed to > remap the head page. Should it fail to allocate said new page, it reuses > the one that's already mapped just like before. As a result, for every 64G > of contiguous hugepages it can give back 1G more of contiguous memory per 64G, > while needing in total 128M new 4K pages (for 2M hugetlb) or 256k (for 1G > hugetlb). > > After the changes, try to assign a 64G node to hugetlb (on a 128G 2node guest, > each node with 64G): > > * Before allocation > Free pages count per migrate type at order 0 1 2 3 > 4 5 6 7 8 9 10 > ... > Node 0, zone Normal, type Movable 0 1 1 0 > 0 0 0 1 1 1 15564 > > $ echo 32768 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages > $ cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages > 32390 > > * After: > > Node 0, zone Normal, type Movable 0 1 0 70 > 106 91 78 48 17 0 0 > > In the example above, 416 more hugeltb 2M pages are allocated i.e. 832M > out of the 32390 (64780M) allocated. So the memory freed back is indeed > being used back in hugetlb and there's no order-0..order-2 pages > accumulated unused. > > Signed-off-by: Joao Martins <joao.m.martins@oracle.com> > --- > mm/hugetlb_vmemmap.c | 47 ++++++++++++++++++++++++++++++++++++-------- > 1 file changed, 39 insertions(+), 8 deletions(-) > > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c > index 20f414c0379f..2b97df8115fe 100644 > --- a/mm/hugetlb_vmemmap.c > +++ b/mm/hugetlb_vmemmap.c > @@ -28,9 +28,11 @@ > */ > struct vmemmap_remap_walk { > void (*remap_pte)(pte_t *pte, unsigned long addr, > - struct vmemmap_remap_walk *walk); > + struct vmemmap_remap_walk *walk, > + bool rw); > unsigned long nr_walked; > struct page *reuse_page; > + struct page *head_page; > unsigned long reuse_addr; > struct list_head *vmemmap_pages; > }; > @@ -104,10 +106,25 @@ static void vmemmap_pte_range(pmd_t *pmd, unsigned long addr, > * remapping (which is calling @walk->remap_pte). > */ > if (!walk->reuse_page) { > - walk->reuse_page = pte_page(*pte); > + struct page *ptpage = pte_page(*pte); > + struct page *page = walk->head_page ? walk->head_page : ptpage; > + > + walk->reuse_page = page; > + > + /* > + * Copy the data from the original head, and remap to > + * the newly allocated page. > + */ > + if (page != ptpage) { > + memcpy(page_address(page), page_address(ptpage), > + PAGE_SIZE); > + walk->remap_pte(pte, addr, walk, true); > + } > + > /* > - * Because the reuse address is part of the range that we are > - * walking, skip the reuse address range. > + * Because the reuse address is part of the range that > + * we are walking, skip the reuse address range. Or we > + * already remapped the head page to a newly page. > */ > addr += PAGE_SIZE; > pte++; > @@ -115,7 +132,7 @@ static void vmemmap_pte_range(pmd_t *pmd, unsigned long addr, > } > > for (; addr != end; addr += PAGE_SIZE, pte++) { > - walk->remap_pte(pte, addr, walk); > + walk->remap_pte(pte, addr, walk, false); > walk->nr_walked++; > } > } > @@ -238,13 +255,13 @@ static void free_vmemmap_page_list(struct list_head *list) > } > > static void vmemmap_remap_pte(pte_t *pte, unsigned long addr, > - struct vmemmap_remap_walk *walk) > + struct vmemmap_remap_walk *walk, bool rw) > { > /* > * Remap the tail pages as read-only to catch illegal write operation > * to the tail pages. > */ > - pgprot_t pgprot = PAGE_KERNEL_RO; > + pgprot_t pgprot = rw ? PAGE_KERNEL : PAGE_KERNEL_RO; > pte_t entry = mk_pte(walk->reuse_page, pgprot); > struct page *page = pte_page(*pte); > > @@ -273,7 +290,7 @@ static inline void reset_struct_pages(struct page *start) > } > > static void vmemmap_restore_pte(pte_t *pte, unsigned long addr, > - struct vmemmap_remap_walk *walk) > + struct vmemmap_remap_walk *walk, bool rw) > { > pgprot_t pgprot = PAGE_KERNEL; > struct page *page; > @@ -312,6 +329,20 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end, > .reuse_addr = reuse, > .vmemmap_pages = &vmemmap_pages, > }; > + gfp_t gfp_mask = GFP_KERNEL|__GFP_RETRY_MAYFAIL|__GFP_NOWARN; > + int nid = page_to_nid((struct page *)start); > + struct page *page; > + > + /* > + * Allocate a new head vmemmap page to avoid breaking a contiguous > + * block of struct page memory when freeing it back to page allocator > + * in free_vmemmap_page_list(). This will allow the likely contiguous > + * struct page backing memory to be kept contiguous and allowing for > + * more allocations of hugepages. Fallback to the currently > + * mapped head page in case should it fail to allocate. > + */ > + page = alloc_pages_node(nid, gfp_mask, 0); > + walk.head_page = page; > > /* > * In order to make remapping routine most efficient for the huge pages, > -- > 2.17.2 > >
On 8/3/22 05:11, Muchun Song wrote: > On Tue, Aug 02, 2022 at 07:03:09PM +0100, Joao Martins wrote: >> Today with `hugetlb_free_vmemmap=on` the struct page memory that is >> freed back to page allocator is as following: for a 2M hugetlb page it >> will reuse the first 4K vmemmap page to remap the remaining 7 vmemmap >> pages, and for a 1G hugetlb it will remap the remaining 4095 vmemmap >> pages. Essentially, that means that it breaks the first 4K of a >> potentially contiguous chunk of memory of 32K (for 2M hugetlb pages) or >> 16M (for 1G hugetlb pages). For this reason the memory that it's free >> back to page allocator cannot be used for hugetlb to allocate huge pages >> of the same size, but rather only of a smaller huge page size: >> > > Hi Joao, > > Thanks for your work on this. I admit you are right. The current mechanism > prevented the freed vmemmap pages from being mergerd into a potential > contiguous page. Allocating a new head page is straightforward approach, > however, it is very dangerous at runtime after system booting up. Why > dangerous? Because you should first 1) copy the content from the head vmemmap > page to the targeted (newly allocated) page, and then 2) change the PTE > entry to the new page. However, the content (especially the refcount) of the > old head vmemmmap page could be changed elsewhere (e.g. other modules) > between the step 1) and 2). Eventually, the new allocated vmemmap page is > corrupted. Unluckily, we don't have an easy approach to prevent it. > OK, I see what I missed. You mean the refcount (or any other data) on the preceeding struct pages to this head struct page unrelated to the hugetlb page being remapped. Meaning when the first struct page in the old vmemmap page is *not* the head page we are trying to remap right? See further below in your patch but I wonder if we could actually check this from the hugetlb head va being aligned to PAGE_SIZE. Meaning that we would be checking that this head page is the first of struct page in the vmemmap page that we are still safe to remap? If so, the patch could be simpler more like mine, without the special freeing path you added below. If I'm right, see at the end. > I also thought of solving this issue, but I didn't find any easy way to > solve it after system booting up. However, it is possible at system boot > time. Because if the system is in a very early initialization stage, > anyone should not access struct page. I have implemented it a mounth ago. > I didn't send it out since some additional preparation work is required. > The main preparation is to move the allocation of HugeTLB to a very > early initialization stage (I want to put it into pure_initcall). Because > the allocation of HugeTLB is parsed from cmdline whose logic is very > complex, I have sent a patch to cleanup it [1]. After this cleanup, it > will be easy to move the allocation to an early stage. Sorry for that > I am busy lately, I didn't have time to send a new version. But I will > send it ASAP. > > [1] https://lore.kernel.org/all/20220616071827.3480-1-songmuchun@bytedance.com/ > > The following diff is the main work to remap the head vmemmap page to > a new allocated page. > I like how you use walk::reuse_addr to tell it's an head page. I didn't find my version as clean with passing a boolean to the remap_pte(). > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c > index 20f414c0379f..71f2d7335e6f 100644 > --- a/mm/hugetlb_vmemmap.c > +++ b/mm/hugetlb_vmemmap.c > @@ -15,6 +15,7 @@ > #include <asm/pgalloc.h> > #include <asm/tlbflush.h> > #include "hugetlb_vmemmap.h" > +#include "internal.h" > > /** > * struct vmemmap_remap_walk - walk vmemmap page table > @@ -227,14 +228,37 @@ static inline void free_vmemmap_page(struct page *page) > } > > /* Free a list of the vmemmap pages */ > -static void free_vmemmap_page_list(struct list_head *list) > +static void free_vmemmap_pages(struct list_head *list) > { > struct page *page, *next; > > + list_for_each_entry_safe(page, next, list, lru) > + free_vmemmap_page(page); > +} > + > +/* > + * Free a list of vmemmap pages but skip per-cpu free list of buddy > + * allocator. > + */ > +static void free_vmemmap_pages_nonpcp(struct list_head *list) > +{ > + struct zone *zone; > + struct page *page, *next; > + > + if (list_empty(list)) > + return; > + > + zone = page_zone(list_first_entry(list, struct page, lru)); > + zone_pcp_disable(zone); > list_for_each_entry_safe(page, next, list, lru) { > - list_del(&page->lru); > + if (zone != page_zone(page)) { > + zone_pcp_enable(zone); > + zone = page_zone(page); > + zone_pcp_disable(zone); > + } > free_vmemmap_page(page); > } > + zone_pcp_enable(zone); > } > > static void vmemmap_remap_pte(pte_t *pte, unsigned long addr, > @@ -244,12 +268,28 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr, > * Remap the tail pages as read-only to catch illegal write operation > * to the tail pages. > */ > - pgprot_t pgprot = PAGE_KERNEL_RO; > + pgprot_t pgprot = addr == walk->reuse_addr ? PAGE_KERNEL : PAGE_KERNEL_RO; > pte_t entry = mk_pte(walk->reuse_page, pgprot); > struct page *page = pte_page(*pte); > > list_add_tail(&page->lru, walk->vmemmap_pages); > set_pte_at(&init_mm, addr, pte, entry); > + > + if (unlikely(addr == walk->reuse_addr)) { > + void *old = page_to_virt(page); > + > + /* Remove it from the vmemmap_pages list to avoid being freed. */ > + list_del(&walk->reuse_page->lru); > + flush_tlb_kernel_range(addr, addr + PAGE_SIZE); > + /* > + * If we reach here meaning the system is in a very early > + * initialization stage, anyone should not access struct page. > + * However, if there is something unexpected, the head struct > + * page most likely to be written (usually ->_refcount). Using > + * BUG_ON() to catch this unexpected case. > + */ > + BUG_ON(memcmp(old, (void *)addr, sizeof(struct page))); > + } > } > > /* > @@ -298,7 +338,10 @@ static void vmemmap_restore_pte(pte_t *pte, unsigned long addr, > * to remap. > * @end: end address of the vmemmap virtual address range that we want to > * remap. > - * @reuse: reuse address. > + * @reuse: reuse address. If @reuse is equal to @start, it means the page > + * frame which @reuse address is mapped to will be replaced with a > + * new page frame and the previous page frame will be freed, this > + * is to reduce memory fragment. > * > * Return: %0 on success, negative error code otherwise. > */ > @@ -319,14 +362,26 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end, > * (see more details from the vmemmap_pte_range()): > * > * - The range [@start, @end) and the range [@reuse, @reuse + PAGE_SIZE) > - * should be continuous. > + * should be continuous or @start is equal to @reuse. > * - The @reuse address is part of the range [@reuse, @end) that we are > * walking which is passed to vmemmap_remap_range(). > * - The @reuse address is the first in the complete range. > * > * So we need to make sure that @start and @reuse meet the above rules. > */ > - BUG_ON(start - reuse != PAGE_SIZE); > + BUG_ON(start - reuse != PAGE_SIZE && start != reuse); > + > + if (unlikely(reuse == start)) { > + int nid = page_to_nid((struct page *)start); > + gfp_t gfp_mask = GFP_KERNEL | __GFP_THISNODE | __GFP_NORETRY | > + __GFP_NOWARN; > + > + walk.reuse_page = alloc_pages_node(nid, gfp_mask, 0); > + if (walk.reuse_page) { > + copy_page(page_to_virt(walk.reuse_page), (void *)reuse); > + list_add(&walk.reuse_page->lru, &vmemmap_pages); > + } > + } > > mmap_read_lock(&init_mm); > ret = vmemmap_remap_range(reuse, end, &walk); > @@ -348,7 +403,10 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end, > } > mmap_read_unlock(&init_mm); > > - free_vmemmap_page_list(&vmemmap_pages); > + if (unlikely(reuse == start)) > + free_vmemmap_pages_nonpcp(&vmemmap_pages); > + else > + free_vmemmap_pages(&vmemmap_pages); > > return ret; > } > @@ -512,6 +570,22 @@ static bool vmemmap_should_optimize(const struct hstate *h, const struct page *h > return true; > } > > +/* > + * Control if the page frame which the address of the head vmemmap associated > + * with a HugeTLB page is mapped to should be replaced with a new page. The > + * vmemmap pages are usually mapped with huge PMD mapping, the head vmemmap > + * page frames is best freed to the buddy allocator once at an initial stage > + * of system booting to reduce memory fragment. > + */ > +static bool vmemmap_remap_head __ro_after_init = true; > + > +static int __init vmemmap_remap_head_init(void) > +{ > + vmemmap_remap_head = false; > + return 0; > +} > +core_initcall(vmemmap_remap_head_init); > + > /** > * hugetlb_vmemmap_optimize - optimize @head page's vmemmap pages. > * @h: struct hstate. > @@ -537,6 +611,19 @@ void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head) > vmemmap_start += HUGETLB_VMEMMAP_RESERVE_SIZE; > > /* > + * The vmemmap pages are usually mapped with huge PMD mapping. If the > + * head vmemmap page is not freed to the buddy allocator, then those > + * freed tail vmemmap pages cannot be merged into a big order chunk. > + * The head vmemmap page frame can be replaced with a new allocated > + * page and be freed to the buddy allocator, then those freed vmemmmap > + * pages have the opportunity to be merged into larger contiguous pages > + * to reduce memory fragment. vmemmap_remap_free() will do this if > + * @vmemmap_remap_free is equal to @vmemmap_reuse. > + */ > + if (unlikely(vmemmap_remap_head)) > + vmemmap_start = vmemmap_reuse; > + Maybe it doesn't need to strictly early init vs late init. I wonder if we can still make this trick late-stage but when the head struct page is aligned to PAGE_SIZE / sizeof(struct page), and when it is it means that we are safe to replace the head vmemmap page provided we would only be covering this hugetlb page data? Unless I am missing something and the check wouldn't be enough? A quick check with this snip added: diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c index 2b97df8115fe..06e028734b1e 100644 --- a/mm/hugetlb_vmemmap.c +++ b/mm/hugetlb_vmemmap.c @@ -331,7 +331,7 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end, }; gfp_t gfp_mask = GFP_KERNEL|__GFP_RETRY_MAYFAIL|__GFP_NOWARN; int nid = page_to_nid((struct page *)start); - struct page *page; + struct page *page = NULL; /* * Allocate a new head vmemmap page to avoid breaking a contiguous @@ -341,7 +341,8 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end, * more allocations of hugepages. Fallback to the currently * mapped head page in case should it fail to allocate. */ - page = alloc_pages_node(nid, gfp_mask, 0); + if (IS_ALIGNED(start, PAGE_SIZE)) + page = alloc_pages_node(nid, gfp_mask, 0); walk.head_page = page; /* ... and it looks that it can still be effective just not as much, more or less as expected? # with 2M hugepages Before: Node 0, zone Normal, type Movable 76 28 10 4 1 0 0 0 1 1 15568 After (allocated 32400): Node 0, zone Normal, type Movable 135 328 219 198 155 106 72 41 23 0 0 Compared to my original patch where there weren't any pages left in the order-0..order-2: Node 0, zone Normal, type Movable 0 1 0 70 106 91 78 48 17 0 0 But still much better that without any of this: Node 0, zone Normal, type Movable 32174 31999 31642 104 58 24 16 4 2 0 0
On Wed, Aug 03, 2022 at 10:52:13AM +0100, Joao Martins wrote: > On 8/3/22 05:11, Muchun Song wrote: > > On Tue, Aug 02, 2022 at 07:03:09PM +0100, Joao Martins wrote: > >> Today with `hugetlb_free_vmemmap=on` the struct page memory that is > >> freed back to page allocator is as following: for a 2M hugetlb page it > >> will reuse the first 4K vmemmap page to remap the remaining 7 vmemmap > >> pages, and for a 1G hugetlb it will remap the remaining 4095 vmemmap > >> pages. Essentially, that means that it breaks the first 4K of a > >> potentially contiguous chunk of memory of 32K (for 2M hugetlb pages) or > >> 16M (for 1G hugetlb pages). For this reason the memory that it's free > >> back to page allocator cannot be used for hugetlb to allocate huge pages > >> of the same size, but rather only of a smaller huge page size: > >> > > > > Hi Joao, > > > > Thanks for your work on this. I admit you are right. The current mechanism > > prevented the freed vmemmap pages from being mergerd into a potential > > contiguous page. Allocating a new head page is straightforward approach, > > however, it is very dangerous at runtime after system booting up. Why > > dangerous? Because you should first 1) copy the content from the head vmemmap > > page to the targeted (newly allocated) page, and then 2) change the PTE > > entry to the new page. However, the content (especially the refcount) of the > > old head vmemmmap page could be changed elsewhere (e.g. other modules) > > between the step 1) and 2). Eventually, the new allocated vmemmap page is > > corrupted. Unluckily, we don't have an easy approach to prevent it. > > > OK, I see what I missed. You mean the refcount (or any other data) on the > preceeding struct pages to this head struct page unrelated to the hugetlb > page being remapped. Meaning when the first struct page in the old vmemmap > page is *not* the head page we are trying to remap right? > > See further below in your patch but I wonder if we could actually check this > from the hugetlb head va being aligned to PAGE_SIZE. Meaning that we would be checking > that this head page is the first of struct page in the vmemmap page that > we are still safe to remap? If so, the patch could be simpler more > like mine, without the special freeing path you added below. > > If I'm right, see at the end. I am not sure we are on the same page (it seems that we are not after I saw your below patch). So let me make it become more clarified. CPU0: CPU1: vmemmap_remap_free(start, end, reuse) // alloc a new page used to be the head vmemmap page page = alloc_pages_node(); memcpy(page_address(page), reuse, PAGE_SIZE); // Now the @reuse address is mapped to the original // page frame. So the change will be reflected on the // original page frame. get_page(reuse); vmemmap_remap_pte(); // remap to the above new allocated page set_pte_at(); flush_tlb_kernel_range(); // Now the @reuse address is mapped to the new allocated // page frame. So the change will be reflected on the // new page frame and it is corrupted. put_page(reuse); So we should make 1) memcpy, 2) remap and 3) TLB flush atomic on CPU0, however it is not easy. Thanks. > > > I also thought of solving this issue, but I didn't find any easy way to > > solve it after system booting up. However, it is possible at system boot > > time. Because if the system is in a very early initialization stage, > > anyone should not access struct page. I have implemented it a mounth ago. > > I didn't send it out since some additional preparation work is required. > > The main preparation is to move the allocation of HugeTLB to a very > > early initialization stage (I want to put it into pure_initcall). Because > > the allocation of HugeTLB is parsed from cmdline whose logic is very > > complex, I have sent a patch to cleanup it [1]. After this cleanup, it > > will be easy to move the allocation to an early stage. Sorry for that > > I am busy lately, I didn't have time to send a new version. But I will > > send it ASAP. > > > > [1] https://lore.kernel.org/all/20220616071827.3480-1-songmuchun@bytedance.com/ > > > > The following diff is the main work to remap the head vmemmap page to > > a new allocated page. > > > I like how you use walk::reuse_addr to tell it's an head page. > I didn't find my version as clean with passing a boolean to the remap_pte(). > > > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c > > index 20f414c0379f..71f2d7335e6f 100644 > > --- a/mm/hugetlb_vmemmap.c > > +++ b/mm/hugetlb_vmemmap.c > > @@ -15,6 +15,7 @@ > > #include <asm/pgalloc.h> > > #include <asm/tlbflush.h> > > #include "hugetlb_vmemmap.h" > > +#include "internal.h" > > > > /** > > * struct vmemmap_remap_walk - walk vmemmap page table > > @@ -227,14 +228,37 @@ static inline void free_vmemmap_page(struct page *page) > > } > > > > /* Free a list of the vmemmap pages */ > > -static void free_vmemmap_page_list(struct list_head *list) > > +static void free_vmemmap_pages(struct list_head *list) > > { > > struct page *page, *next; > > > > + list_for_each_entry_safe(page, next, list, lru) > > + free_vmemmap_page(page); > > +} > > + > > +/* > > + * Free a list of vmemmap pages but skip per-cpu free list of buddy > > + * allocator. > > + */ > > +static void free_vmemmap_pages_nonpcp(struct list_head *list) > > +{ > > + struct zone *zone; > > + struct page *page, *next; > > + > > + if (list_empty(list)) > > + return; > > + > > + zone = page_zone(list_first_entry(list, struct page, lru)); > > + zone_pcp_disable(zone); > > list_for_each_entry_safe(page, next, list, lru) { > > - list_del(&page->lru); > > + if (zone != page_zone(page)) { > > + zone_pcp_enable(zone); > > + zone = page_zone(page); > > + zone_pcp_disable(zone); > > + } > > free_vmemmap_page(page); > > } > > + zone_pcp_enable(zone); > > } > > > > static void vmemmap_remap_pte(pte_t *pte, unsigned long addr, > > @@ -244,12 +268,28 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr, > > * Remap the tail pages as read-only to catch illegal write operation > > * to the tail pages. > > */ > > - pgprot_t pgprot = PAGE_KERNEL_RO; > > + pgprot_t pgprot = addr == walk->reuse_addr ? PAGE_KERNEL : PAGE_KERNEL_RO; > > pte_t entry = mk_pte(walk->reuse_page, pgprot); > > struct page *page = pte_page(*pte); > > > > list_add_tail(&page->lru, walk->vmemmap_pages); > > set_pte_at(&init_mm, addr, pte, entry); > > + > > + if (unlikely(addr == walk->reuse_addr)) { > > + void *old = page_to_virt(page); > > + > > + /* Remove it from the vmemmap_pages list to avoid being freed. */ > > + list_del(&walk->reuse_page->lru); > > + flush_tlb_kernel_range(addr, addr + PAGE_SIZE); > > + /* > > + * If we reach here meaning the system is in a very early > > + * initialization stage, anyone should not access struct page. > > + * However, if there is something unexpected, the head struct > > + * page most likely to be written (usually ->_refcount). Using > > + * BUG_ON() to catch this unexpected case. > > + */ > > + BUG_ON(memcmp(old, (void *)addr, sizeof(struct page))); > > + } > > } > > > > /* > > @@ -298,7 +338,10 @@ static void vmemmap_restore_pte(pte_t *pte, unsigned long addr, > > * to remap. > > * @end: end address of the vmemmap virtual address range that we want to > > * remap. > > - * @reuse: reuse address. > > + * @reuse: reuse address. If @reuse is equal to @start, it means the page > > + * frame which @reuse address is mapped to will be replaced with a > > + * new page frame and the previous page frame will be freed, this > > + * is to reduce memory fragment. > > * > > * Return: %0 on success, negative error code otherwise. > > */ > > @@ -319,14 +362,26 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end, > > * (see more details from the vmemmap_pte_range()): > > * > > * - The range [@start, @end) and the range [@reuse, @reuse + PAGE_SIZE) > > - * should be continuous. > > + * should be continuous or @start is equal to @reuse. > > * - The @reuse address is part of the range [@reuse, @end) that we are > > * walking which is passed to vmemmap_remap_range(). > > * - The @reuse address is the first in the complete range. > > * > > * So we need to make sure that @start and @reuse meet the above rules. > > */ > > - BUG_ON(start - reuse != PAGE_SIZE); > > + BUG_ON(start - reuse != PAGE_SIZE && start != reuse); > > + > > + if (unlikely(reuse == start)) { > > + int nid = page_to_nid((struct page *)start); > > + gfp_t gfp_mask = GFP_KERNEL | __GFP_THISNODE | __GFP_NORETRY | > > + __GFP_NOWARN; > > + > > + walk.reuse_page = alloc_pages_node(nid, gfp_mask, 0); > > + if (walk.reuse_page) { > > + copy_page(page_to_virt(walk.reuse_page), (void *)reuse); > > + list_add(&walk.reuse_page->lru, &vmemmap_pages); > > + } > > + } > > > > mmap_read_lock(&init_mm); > > ret = vmemmap_remap_range(reuse, end, &walk); > > @@ -348,7 +403,10 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end, > > } > > mmap_read_unlock(&init_mm); > > > > - free_vmemmap_page_list(&vmemmap_pages); > > + if (unlikely(reuse == start)) > > + free_vmemmap_pages_nonpcp(&vmemmap_pages); > > + else > > + free_vmemmap_pages(&vmemmap_pages); > > > > return ret; > > } > > @@ -512,6 +570,22 @@ static bool vmemmap_should_optimize(const struct hstate *h, const struct page *h > > return true; > > } > > > > +/* > > + * Control if the page frame which the address of the head vmemmap associated > > + * with a HugeTLB page is mapped to should be replaced with a new page. The > > + * vmemmap pages are usually mapped with huge PMD mapping, the head vmemmap > > + * page frames is best freed to the buddy allocator once at an initial stage > > + * of system booting to reduce memory fragment. > > + */ > > +static bool vmemmap_remap_head __ro_after_init = true; > > + > > +static int __init vmemmap_remap_head_init(void) > > +{ > > + vmemmap_remap_head = false; > > + return 0; > > +} > > +core_initcall(vmemmap_remap_head_init); > > + > > /** > > * hugetlb_vmemmap_optimize - optimize @head page's vmemmap pages. > > * @h: struct hstate. > > @@ -537,6 +611,19 @@ void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head) > > vmemmap_start += HUGETLB_VMEMMAP_RESERVE_SIZE; > > > > /* > > + * The vmemmap pages are usually mapped with huge PMD mapping. If the > > + * head vmemmap page is not freed to the buddy allocator, then those > > + * freed tail vmemmap pages cannot be merged into a big order chunk. > > + * The head vmemmap page frame can be replaced with a new allocated > > + * page and be freed to the buddy allocator, then those freed vmemmmap > > + * pages have the opportunity to be merged into larger contiguous pages > > + * to reduce memory fragment. vmemmap_remap_free() will do this if > > + * @vmemmap_remap_free is equal to @vmemmap_reuse. > > + */ > > + if (unlikely(vmemmap_remap_head)) > > + vmemmap_start = vmemmap_reuse; > > + > > Maybe it doesn't need to strictly early init vs late init. > > I wonder if we can still make this trick late-stage but when the head struct page is > aligned to PAGE_SIZE / sizeof(struct page), and when it is it means that we are safe > to replace the head vmemmap page provided we would only be covering this hugetlb page > data? Unless I am missing something and the check wouldn't be enough? > > A quick check with this snip added: > > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c > index 2b97df8115fe..06e028734b1e 100644 > --- a/mm/hugetlb_vmemmap.c > +++ b/mm/hugetlb_vmemmap.c > @@ -331,7 +331,7 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end, > }; > gfp_t gfp_mask = GFP_KERNEL|__GFP_RETRY_MAYFAIL|__GFP_NOWARN; > int nid = page_to_nid((struct page *)start); > - struct page *page; > + struct page *page = NULL; > > /* > * Allocate a new head vmemmap page to avoid breaking a contiguous > @@ -341,7 +341,8 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end, > * more allocations of hugepages. Fallback to the currently > * mapped head page in case should it fail to allocate. > */ > - page = alloc_pages_node(nid, gfp_mask, 0); > + if (IS_ALIGNED(start, PAGE_SIZE)) > + page = alloc_pages_node(nid, gfp_mask, 0); > walk.head_page = page; > > /* > > > ... and it looks that it can still be effective just not as much, more or less as expected? > > # with 2M hugepages > > Before: > > Node 0, zone Normal, type Movable 76 28 10 4 1 0 > 0 0 1 1 15568 > > After (allocated 32400): > > Node 0, zone Normal, type Movable 135 328 219 198 155 106 > 72 41 23 0 0 > > Compared to my original patch where there weren't any pages left in the order-0..order-2: > > Node 0, zone Normal, type Movable 0 1 0 70 > 106 91 78 48 17 0 0 > > But still much better that without any of this: > > Node 0, zone Normal, type Movable 32174 31999 31642 104 > 58 24 16 4 2 0 0 >
On 8/3/22 11:44, Muchun Song wrote: > On Wed, Aug 03, 2022 at 10:52:13AM +0100, Joao Martins wrote: >> On 8/3/22 05:11, Muchun Song wrote: >>> On Tue, Aug 02, 2022 at 07:03:09PM +0100, Joao Martins wrote: >>>> Today with `hugetlb_free_vmemmap=on` the struct page memory that is >>>> freed back to page allocator is as following: for a 2M hugetlb page it >>>> will reuse the first 4K vmemmap page to remap the remaining 7 vmemmap >>>> pages, and for a 1G hugetlb it will remap the remaining 4095 vmemmap >>>> pages. Essentially, that means that it breaks the first 4K of a >>>> potentially contiguous chunk of memory of 32K (for 2M hugetlb pages) or >>>> 16M (for 1G hugetlb pages). For this reason the memory that it's free >>>> back to page allocator cannot be used for hugetlb to allocate huge pages >>>> of the same size, but rather only of a smaller huge page size: >>>> >>> >>> Hi Joao, >>> >>> Thanks for your work on this. I admit you are right. The current mechanism >>> prevented the freed vmemmap pages from being mergerd into a potential >>> contiguous page. Allocating a new head page is straightforward approach, >>> however, it is very dangerous at runtime after system booting up. Why >>> dangerous? Because you should first 1) copy the content from the head vmemmap >>> page to the targeted (newly allocated) page, and then 2) change the PTE >>> entry to the new page. However, the content (especially the refcount) of the >>> old head vmemmmap page could be changed elsewhere (e.g. other modules) >>> between the step 1) and 2). Eventually, the new allocated vmemmap page is >>> corrupted. Unluckily, we don't have an easy approach to prevent it. >>> >> OK, I see what I missed. You mean the refcount (or any other data) on the >> preceeding struct pages to this head struct page unrelated to the hugetlb >> page being remapped. Meaning when the first struct page in the old vmemmap >> page is *not* the head page we are trying to remap right? >> >> See further below in your patch but I wonder if we could actually check this >> from the hugetlb head va being aligned to PAGE_SIZE. Meaning that we would be checking >> that this head page is the first of struct page in the vmemmap page that >> we are still safe to remap? If so, the patch could be simpler more >> like mine, without the special freeing path you added below. >> >> If I'm right, see at the end. > > I am not sure we are on the same page (it seems that we are not after I saw your > below patch). Even though I misunderstood you it might still look like a possible scenario. > So let me make it become more clarified. > Thanks > CPU0: CPU1: > > vmemmap_remap_free(start, end, reuse) > // alloc a new page used to be the head vmemmap page > page = alloc_pages_node(); > > memcpy(page_address(page), reuse, PAGE_SIZE); > // Now the @reuse address is mapped to the original > // page frame. So the change will be reflected on the > // original page frame. > get_page(reuse); > vmemmap_remap_pte(); > // remap to the above new allocated page > set_pte_at(); > > flush_tlb_kernel_range(); note-to-self: totally missed to change the flush_tlb_kernel_range() to include the full range. > // Now the @reuse address is mapped to the new allocated > // page frame. So the change will be reflected on the > // new page frame and it is corrupted. > put_page(reuse); > > So we should make 1) memcpy, 2) remap and 3) TLB flush atomic on CPU0, however > it is not easy. > OK, I understand what you mean now. However, I am trying to follow if this race is possible? Note that given your previous answer above, I am assuming in your race scenario that the vmemmap page only ever stores metadata (struct page) related to the hugetlb-page currently being remapped. If this assumption is wrong, then your race would be possible (but it wouldn't be from a get_page in the reuse_addr) So, how would we get into doing a get_page() on the head-page that we are remapping (and its put_page() for that matter) from somewhere else ... considering we are at prep_new_huge_page() when we call vmemmap_remap_free() and hence ... we already got it from page allocator ... but hugetlb hasn't yet finished initializing the head page metadata. Hence it isn't yet accounted for someone to grab either e.g. in demote/page-fault/migration/etc? On the hugetlb freeing path (vmemmap_remap_alloc) there we just keep the already mapped head page as is and we will not be allocating a new one. Perhaps I am missing something obvious. > Thanks. > >> >>> I also thought of solving this issue, but I didn't find any easy way to >>> solve it after system booting up. However, it is possible at system boot >>> time. Because if the system is in a very early initialization stage, >>> anyone should not access struct page. I have implemented it a mounth ago. >>> I didn't send it out since some additional preparation work is required. >>> The main preparation is to move the allocation of HugeTLB to a very >>> early initialization stage (I want to put it into pure_initcall). Because >>> the allocation of HugeTLB is parsed from cmdline whose logic is very >>> complex, I have sent a patch to cleanup it [1]. After this cleanup, it >>> will be easy to move the allocation to an early stage. Sorry for that >>> I am busy lately, I didn't have time to send a new version. But I will >>> send it ASAP. >>> >>> [1] https://lore.kernel.org/all/20220616071827.3480-1-songmuchun@bytedance.com/ >>> >>> The following diff is the main work to remap the head vmemmap page to >>> a new allocated page. >>> >> I like how you use walk::reuse_addr to tell it's an head page. >> I didn't find my version as clean with passing a boolean to the remap_pte(). >> >>> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c >>> index 20f414c0379f..71f2d7335e6f 100644 >>> --- a/mm/hugetlb_vmemmap.c >>> +++ b/mm/hugetlb_vmemmap.c >>> @@ -15,6 +15,7 @@ >>> #include <asm/pgalloc.h> >>> #include <asm/tlbflush.h> >>> #include "hugetlb_vmemmap.h" >>> +#include "internal.h" >>> >>> /** >>> * struct vmemmap_remap_walk - walk vmemmap page table >>> @@ -227,14 +228,37 @@ static inline void free_vmemmap_page(struct page *page) >>> } >>> >>> /* Free a list of the vmemmap pages */ >>> -static void free_vmemmap_page_list(struct list_head *list) >>> +static void free_vmemmap_pages(struct list_head *list) >>> { >>> struct page *page, *next; >>> >>> + list_for_each_entry_safe(page, next, list, lru) >>> + free_vmemmap_page(page); >>> +} >>> + >>> +/* >>> + * Free a list of vmemmap pages but skip per-cpu free list of buddy >>> + * allocator. >>> + */ >>> +static void free_vmemmap_pages_nonpcp(struct list_head *list) >>> +{ >>> + struct zone *zone; >>> + struct page *page, *next; >>> + >>> + if (list_empty(list)) >>> + return; >>> + >>> + zone = page_zone(list_first_entry(list, struct page, lru)); >>> + zone_pcp_disable(zone); >>> list_for_each_entry_safe(page, next, list, lru) { >>> - list_del(&page->lru); >>> + if (zone != page_zone(page)) { >>> + zone_pcp_enable(zone); >>> + zone = page_zone(page); >>> + zone_pcp_disable(zone); >>> + } >>> free_vmemmap_page(page); >>> } >>> + zone_pcp_enable(zone); >>> } >>> >>> static void vmemmap_remap_pte(pte_t *pte, unsigned long addr, >>> @@ -244,12 +268,28 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr, >>> * Remap the tail pages as read-only to catch illegal write operation >>> * to the tail pages. >>> */ >>> - pgprot_t pgprot = PAGE_KERNEL_RO; >>> + pgprot_t pgprot = addr == walk->reuse_addr ? PAGE_KERNEL : PAGE_KERNEL_RO; >>> pte_t entry = mk_pte(walk->reuse_page, pgprot); >>> struct page *page = pte_page(*pte); >>> >>> list_add_tail(&page->lru, walk->vmemmap_pages); >>> set_pte_at(&init_mm, addr, pte, entry); >>> + >>> + if (unlikely(addr == walk->reuse_addr)) { >>> + void *old = page_to_virt(page); >>> + >>> + /* Remove it from the vmemmap_pages list to avoid being freed. */ >>> + list_del(&walk->reuse_page->lru); >>> + flush_tlb_kernel_range(addr, addr + PAGE_SIZE); >>> + /* >>> + * If we reach here meaning the system is in a very early >>> + * initialization stage, anyone should not access struct page. >>> + * However, if there is something unexpected, the head struct >>> + * page most likely to be written (usually ->_refcount). Using >>> + * BUG_ON() to catch this unexpected case. >>> + */ >>> + BUG_ON(memcmp(old, (void *)addr, sizeof(struct page))); >>> + } >>> } >>> >>> /* >>> @@ -298,7 +338,10 @@ static void vmemmap_restore_pte(pte_t *pte, unsigned long addr, >>> * to remap. >>> * @end: end address of the vmemmap virtual address range that we want to >>> * remap. >>> - * @reuse: reuse address. >>> + * @reuse: reuse address. If @reuse is equal to @start, it means the page >>> + * frame which @reuse address is mapped to will be replaced with a >>> + * new page frame and the previous page frame will be freed, this >>> + * is to reduce memory fragment. >>> * >>> * Return: %0 on success, negative error code otherwise. >>> */ >>> @@ -319,14 +362,26 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end, >>> * (see more details from the vmemmap_pte_range()): >>> * >>> * - The range [@start, @end) and the range [@reuse, @reuse + PAGE_SIZE) >>> - * should be continuous. >>> + * should be continuous or @start is equal to @reuse. >>> * - The @reuse address is part of the range [@reuse, @end) that we are >>> * walking which is passed to vmemmap_remap_range(). >>> * - The @reuse address is the first in the complete range. >>> * >>> * So we need to make sure that @start and @reuse meet the above rules. >>> */ >>> - BUG_ON(start - reuse != PAGE_SIZE); >>> + BUG_ON(start - reuse != PAGE_SIZE && start != reuse); >>> + >>> + if (unlikely(reuse == start)) { >>> + int nid = page_to_nid((struct page *)start); >>> + gfp_t gfp_mask = GFP_KERNEL | __GFP_THISNODE | __GFP_NORETRY | >>> + __GFP_NOWARN; >>> + >>> + walk.reuse_page = alloc_pages_node(nid, gfp_mask, 0); >>> + if (walk.reuse_page) { >>> + copy_page(page_to_virt(walk.reuse_page), (void *)reuse); >>> + list_add(&walk.reuse_page->lru, &vmemmap_pages); >>> + } >>> + } >>> >>> mmap_read_lock(&init_mm); >>> ret = vmemmap_remap_range(reuse, end, &walk); >>> @@ -348,7 +403,10 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end, >>> } >>> mmap_read_unlock(&init_mm); >>> >>> - free_vmemmap_page_list(&vmemmap_pages); >>> + if (unlikely(reuse == start)) >>> + free_vmemmap_pages_nonpcp(&vmemmap_pages); >>> + else >>> + free_vmemmap_pages(&vmemmap_pages); >>> >>> return ret; >>> } >>> @@ -512,6 +570,22 @@ static bool vmemmap_should_optimize(const struct hstate *h, const struct page *h >>> return true; >>> } >>> >>> +/* >>> + * Control if the page frame which the address of the head vmemmap associated >>> + * with a HugeTLB page is mapped to should be replaced with a new page. The >>> + * vmemmap pages are usually mapped with huge PMD mapping, the head vmemmap >>> + * page frames is best freed to the buddy allocator once at an initial stage >>> + * of system booting to reduce memory fragment. >>> + */ >>> +static bool vmemmap_remap_head __ro_after_init = true; >>> + >>> +static int __init vmemmap_remap_head_init(void) >>> +{ >>> + vmemmap_remap_head = false; >>> + return 0; >>> +} >>> +core_initcall(vmemmap_remap_head_init); >>> + >>> /** >>> * hugetlb_vmemmap_optimize - optimize @head page's vmemmap pages. >>> * @h: struct hstate. >>> @@ -537,6 +611,19 @@ void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head) >>> vmemmap_start += HUGETLB_VMEMMAP_RESERVE_SIZE; >>> >>> /* >>> + * The vmemmap pages are usually mapped with huge PMD mapping. If the >>> + * head vmemmap page is not freed to the buddy allocator, then those >>> + * freed tail vmemmap pages cannot be merged into a big order chunk. >>> + * The head vmemmap page frame can be replaced with a new allocated >>> + * page and be freed to the buddy allocator, then those freed vmemmmap >>> + * pages have the opportunity to be merged into larger contiguous pages >>> + * to reduce memory fragment. vmemmap_remap_free() will do this if >>> + * @vmemmap_remap_free is equal to @vmemmap_reuse. >>> + */ >>> + if (unlikely(vmemmap_remap_head)) >>> + vmemmap_start = vmemmap_reuse; >>> + >> >> Maybe it doesn't need to strictly early init vs late init. >> >> I wonder if we can still make this trick late-stage but when the head struct page is >> aligned to PAGE_SIZE / sizeof(struct page), and when it is it means that we are safe >> to replace the head vmemmap page provided we would only be covering this hugetlb page >> data? Unless I am missing something and the check wouldn't be enough? >> >> A quick check with this snip added: >> >> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c >> index 2b97df8115fe..06e028734b1e 100644 >> --- a/mm/hugetlb_vmemmap.c >> +++ b/mm/hugetlb_vmemmap.c >> @@ -331,7 +331,7 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end, >> }; >> gfp_t gfp_mask = GFP_KERNEL|__GFP_RETRY_MAYFAIL|__GFP_NOWARN; >> int nid = page_to_nid((struct page *)start); >> - struct page *page; >> + struct page *page = NULL; >> >> /* >> * Allocate a new head vmemmap page to avoid breaking a contiguous >> @@ -341,7 +341,8 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end, >> * more allocations of hugepages. Fallback to the currently >> * mapped head page in case should it fail to allocate. >> */ >> - page = alloc_pages_node(nid, gfp_mask, 0); >> + if (IS_ALIGNED(start, PAGE_SIZE)) >> + page = alloc_pages_node(nid, gfp_mask, 0); >> walk.head_page = page; >> >> /* >> >> >> ... and it looks that it can still be effective just not as much, more or less as expected? >> >> # with 2M hugepages >> >> Before: >> >> Node 0, zone Normal, type Movable 76 28 10 4 1 0 >> 0 0 1 1 15568 >> >> After (allocated 32400): >> >> Node 0, zone Normal, type Movable 135 328 219 198 155 106 >> 72 41 23 0 0 >> >> Compared to my original patch where there weren't any pages left in the order-0..order-2: >> >> Node 0, zone Normal, type Movable 0 1 0 70 >> 106 91 78 48 17 0 0 >> >> But still much better that without any of this: >> >> Node 0, zone Normal, type Movable 32174 31999 31642 104 >> 58 24 16 4 2 0 0 >>
On 08/03/22 18:44, Muchun Song wrote: > On Wed, Aug 03, 2022 at 10:52:13AM +0100, Joao Martins wrote: > > On 8/3/22 05:11, Muchun Song wrote: > > > On Tue, Aug 02, 2022 at 07:03:09PM +0100, Joao Martins wrote: > > >> Today with `hugetlb_free_vmemmap=on` the struct page memory that is > > >> freed back to page allocator is as following: for a 2M hugetlb page it > > >> will reuse the first 4K vmemmap page to remap the remaining 7 vmemmap > > >> pages, and for a 1G hugetlb it will remap the remaining 4095 vmemmap > > >> pages. Essentially, that means that it breaks the first 4K of a > > >> potentially contiguous chunk of memory of 32K (for 2M hugetlb pages) or > > >> 16M (for 1G hugetlb pages). For this reason the memory that it's free > > >> back to page allocator cannot be used for hugetlb to allocate huge pages > > >> of the same size, but rather only of a smaller huge page size: > > >> > > > > > > Hi Joao, > > > > > > Thanks for your work on this. I admit you are right. The current mechanism > > > prevented the freed vmemmap pages from being mergerd into a potential > > > contiguous page. Allocating a new head page is straightforward approach, > > > however, it is very dangerous at runtime after system booting up. Why > > > dangerous? Because you should first 1) copy the content from the head vmemmap > > > page to the targeted (newly allocated) page, and then 2) change the PTE > > > entry to the new page. However, the content (especially the refcount) of the > > > old head vmemmmap page could be changed elsewhere (e.g. other modules) > > > between the step 1) and 2). Eventually, the new allocated vmemmap page is > > > corrupted. Unluckily, we don't have an easy approach to prevent it. > > > > > OK, I see what I missed. You mean the refcount (or any other data) on the > > preceeding struct pages to this head struct page unrelated to the hugetlb > > page being remapped. Meaning when the first struct page in the old vmemmap > > page is *not* the head page we are trying to remap right? > > > > See further below in your patch but I wonder if we could actually check this > > from the hugetlb head va being aligned to PAGE_SIZE. Meaning that we would be checking > > that this head page is the first of struct page in the vmemmap page that > > we are still safe to remap? If so, the patch could be simpler more > > like mine, without the special freeing path you added below. > > > > If I'm right, see at the end. > > > I am not sure we are on the same page (it seems that we are not after I saw your > below patch). So let me make it become more clarified. Thanks Muchun! I told Joao that you would be the expert in this area and was correct. > > CPU0: CPU1: > > vmemmap_remap_free(start, end, reuse) > // alloc a new page used to be the head vmemmap page > page = alloc_pages_node(); > > memcpy(page_address(page), reuse, PAGE_SIZE); > // Now the @reuse address is mapped to the original > // page frame. So the change will be reflected on the > // original page frame. > get_page(reuse); Here is a thought. This code gets called early after allocating a new hugetlb page. This new compound page has a ref count of 1 on the head page and 0 on all the tail pages. If the ref count was 0 on the head page, get_page() would not succeed. I can not think of a reason why we NEED to have a ref count of 1 on the head page. It does make it more convenient to simply call put_page() on the newly initialized hugetlb page and have the page be added to the huegtlb free lists, but this could easily be modified. Matthew Willcox recently pointed me at this: https://lore.kernel.org/linux-mm/20220531150611.1303156-1-willy@infradead.org/T/#m98fb9f9bd476155b4951339da51a0887b2377476 That would only work for hugetlb pages allocated from buddy. For gigantic pages, we manually 'freeze' (zero ref count) of tail pages and check for an unexpected increased ref count. We could do the same with the head page. Would having zero ref count on the head page eliminate this race?
On Wed, Aug 03, 2022 at 01:22:21PM +0100, Joao Martins wrote: > > > On 8/3/22 11:44, Muchun Song wrote: > > On Wed, Aug 03, 2022 at 10:52:13AM +0100, Joao Martins wrote: > >> On 8/3/22 05:11, Muchun Song wrote: > >>> On Tue, Aug 02, 2022 at 07:03:09PM +0100, Joao Martins wrote: > >>>> Today with `hugetlb_free_vmemmap=on` the struct page memory that is > >>>> freed back to page allocator is as following: for a 2M hugetlb page it > >>>> will reuse the first 4K vmemmap page to remap the remaining 7 vmemmap > >>>> pages, and for a 1G hugetlb it will remap the remaining 4095 vmemmap > >>>> pages. Essentially, that means that it breaks the first 4K of a > >>>> potentially contiguous chunk of memory of 32K (for 2M hugetlb pages) or > >>>> 16M (for 1G hugetlb pages). For this reason the memory that it's free > >>>> back to page allocator cannot be used for hugetlb to allocate huge pages > >>>> of the same size, but rather only of a smaller huge page size: > >>>> > >>> > >>> Hi Joao, > >>> > >>> Thanks for your work on this. I admit you are right. The current mechanism > >>> prevented the freed vmemmap pages from being mergerd into a potential > >>> contiguous page. Allocating a new head page is straightforward approach, > >>> however, it is very dangerous at runtime after system booting up. Why > >>> dangerous? Because you should first 1) copy the content from the head vmemmap > >>> page to the targeted (newly allocated) page, and then 2) change the PTE > >>> entry to the new page. However, the content (especially the refcount) of the > >>> old head vmemmmap page could be changed elsewhere (e.g. other modules) > >>> between the step 1) and 2). Eventually, the new allocated vmemmap page is > >>> corrupted. Unluckily, we don't have an easy approach to prevent it. > >>> > >> OK, I see what I missed. You mean the refcount (or any other data) on the > >> preceeding struct pages to this head struct page unrelated to the hugetlb > >> page being remapped. Meaning when the first struct page in the old vmemmap > >> page is *not* the head page we are trying to remap right? > >> > >> See further below in your patch but I wonder if we could actually check this > >> from the hugetlb head va being aligned to PAGE_SIZE. Meaning that we would be checking > >> that this head page is the first of struct page in the vmemmap page that > >> we are still safe to remap? If so, the patch could be simpler more > >> like mine, without the special freeing path you added below. > >> > >> If I'm right, see at the end. > > > > I am not sure we are on the same page (it seems that we are not after I saw your > > below patch). > > Even though I misunderstood you it might still look like a possible scenario. > > > So let me make it become more clarified. > > > Thanks > > > CPU0: CPU1: > > > > vmemmap_remap_free(start, end, reuse) > > // alloc a new page used to be the head vmemmap page > > page = alloc_pages_node(); > > > > memcpy(page_address(page), reuse, PAGE_SIZE); > > // Now the @reuse address is mapped to the original > > // page frame. So the change will be reflected on the > > // original page frame. > > get_page(reuse); > > vmemmap_remap_pte(); > > // remap to the above new allocated page > > set_pte_at(); > > > > flush_tlb_kernel_range(); > > note-to-self: totally missed to change the flush_tlb_kernel_range() to include the full range. > Right. I have noticed that as well. > > // Now the @reuse address is mapped to the new allocated > > // page frame. So the change will be reflected on the > > // new page frame and it is corrupted. > > put_page(reuse); > > > > So we should make 1) memcpy, 2) remap and 3) TLB flush atomic on CPU0, however > > it is not easy. > > > OK, I understand what you mean now. However, I am trying to follow if this race is > possible? Note that given your previous answer above, I am assuming in your race scenario > that the vmemmap page only ever stores metadata (struct page) related to the hugetlb-page > currently being remapped. If this assumption is wrong, then your race would be possible > (but it wouldn't be from a get_page in the reuse_addr) > > So, how would we get into doing a get_page() on the head-page that we are remapping (and > its put_page() for that matter) from somewhere else ... considering we are at > prep_new_huge_page() when we call vmemmap_remap_free() and hence ... we already got it > from page allocator ... but hugetlb hasn't yet finished initializing the head page > metadata. Hence it isn't yet accounted for someone to grab either e.g. in > demote/page-fault/migration/etc? > As I know, at least there are two places which could get the refcount. 1) GUP and 2) Memoey failure. For 1), you can refer to the commit 7118fc2906e2925d7edb5ed9c8a57f2a5f23b849. For 2), I think you can refer to the function of __get_unpoison_page(). Both places could grab an extra refcount to the processing HugeTLB's struct page. Thanks. > On the hugetlb freeing path (vmemmap_remap_alloc) there we just keep the already mapped > head page as is and we will not be allocating a new one. > > Perhaps I am missing something obvious. > > > Thanks. > > > >> > >>> I also thought of solving this issue, but I didn't find any easy way to > >>> solve it after system booting up. However, it is possible at system boot > >>> time. Because if the system is in a very early initialization stage, > >>> anyone should not access struct page. I have implemented it a mounth ago. > >>> I didn't send it out since some additional preparation work is required. > >>> The main preparation is to move the allocation of HugeTLB to a very > >>> early initialization stage (I want to put it into pure_initcall). Because > >>> the allocation of HugeTLB is parsed from cmdline whose logic is very > >>> complex, I have sent a patch to cleanup it [1]. After this cleanup, it > >>> will be easy to move the allocation to an early stage. Sorry for that > >>> I am busy lately, I didn't have time to send a new version. But I will > >>> send it ASAP. > >>> > >>> [1] https://lore.kernel.org/all/20220616071827.3480-1-songmuchun@bytedance.com/ > >>> > >>> The following diff is the main work to remap the head vmemmap page to > >>> a new allocated page. > >>> > >> I like how you use walk::reuse_addr to tell it's an head page. > >> I didn't find my version as clean with passing a boolean to the remap_pte(). > >> > >>> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c > >>> index 20f414c0379f..71f2d7335e6f 100644 > >>> --- a/mm/hugetlb_vmemmap.c > >>> +++ b/mm/hugetlb_vmemmap.c > >>> @@ -15,6 +15,7 @@ > >>> #include <asm/pgalloc.h> > >>> #include <asm/tlbflush.h> > >>> #include "hugetlb_vmemmap.h" > >>> +#include "internal.h" > >>> > >>> /** > >>> * struct vmemmap_remap_walk - walk vmemmap page table > >>> @@ -227,14 +228,37 @@ static inline void free_vmemmap_page(struct page *page) > >>> } > >>> > >>> /* Free a list of the vmemmap pages */ > >>> -static void free_vmemmap_page_list(struct list_head *list) > >>> +static void free_vmemmap_pages(struct list_head *list) > >>> { > >>> struct page *page, *next; > >>> > >>> + list_for_each_entry_safe(page, next, list, lru) > >>> + free_vmemmap_page(page); > >>> +} > >>> + > >>> +/* > >>> + * Free a list of vmemmap pages but skip per-cpu free list of buddy > >>> + * allocator. > >>> + */ > >>> +static void free_vmemmap_pages_nonpcp(struct list_head *list) > >>> +{ > >>> + struct zone *zone; > >>> + struct page *page, *next; > >>> + > >>> + if (list_empty(list)) > >>> + return; > >>> + > >>> + zone = page_zone(list_first_entry(list, struct page, lru)); > >>> + zone_pcp_disable(zone); > >>> list_for_each_entry_safe(page, next, list, lru) { > >>> - list_del(&page->lru); > >>> + if (zone != page_zone(page)) { > >>> + zone_pcp_enable(zone); > >>> + zone = page_zone(page); > >>> + zone_pcp_disable(zone); > >>> + } > >>> free_vmemmap_page(page); > >>> } > >>> + zone_pcp_enable(zone); > >>> } > >>> > >>> static void vmemmap_remap_pte(pte_t *pte, unsigned long addr, > >>> @@ -244,12 +268,28 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr, > >>> * Remap the tail pages as read-only to catch illegal write operation > >>> * to the tail pages. > >>> */ > >>> - pgprot_t pgprot = PAGE_KERNEL_RO; > >>> + pgprot_t pgprot = addr == walk->reuse_addr ? PAGE_KERNEL : PAGE_KERNEL_RO; > >>> pte_t entry = mk_pte(walk->reuse_page, pgprot); > >>> struct page *page = pte_page(*pte); > >>> > >>> list_add_tail(&page->lru, walk->vmemmap_pages); > >>> set_pte_at(&init_mm, addr, pte, entry); > >>> + > >>> + if (unlikely(addr == walk->reuse_addr)) { > >>> + void *old = page_to_virt(page); > >>> + > >>> + /* Remove it from the vmemmap_pages list to avoid being freed. */ > >>> + list_del(&walk->reuse_page->lru); > >>> + flush_tlb_kernel_range(addr, addr + PAGE_SIZE); > >>> + /* > >>> + * If we reach here meaning the system is in a very early > >>> + * initialization stage, anyone should not access struct page. > >>> + * However, if there is something unexpected, the head struct > >>> + * page most likely to be written (usually ->_refcount). Using > >>> + * BUG_ON() to catch this unexpected case. > >>> + */ > >>> + BUG_ON(memcmp(old, (void *)addr, sizeof(struct page))); > >>> + } > >>> } > >>> > >>> /* > >>> @@ -298,7 +338,10 @@ static void vmemmap_restore_pte(pte_t *pte, unsigned long addr, > >>> * to remap. > >>> * @end: end address of the vmemmap virtual address range that we want to > >>> * remap. > >>> - * @reuse: reuse address. > >>> + * @reuse: reuse address. If @reuse is equal to @start, it means the page > >>> + * frame which @reuse address is mapped to will be replaced with a > >>> + * new page frame and the previous page frame will be freed, this > >>> + * is to reduce memory fragment. > >>> * > >>> * Return: %0 on success, negative error code otherwise. > >>> */ > >>> @@ -319,14 +362,26 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end, > >>> * (see more details from the vmemmap_pte_range()): > >>> * > >>> * - The range [@start, @end) and the range [@reuse, @reuse + PAGE_SIZE) > >>> - * should be continuous. > >>> + * should be continuous or @start is equal to @reuse. > >>> * - The @reuse address is part of the range [@reuse, @end) that we are > >>> * walking which is passed to vmemmap_remap_range(). > >>> * - The @reuse address is the first in the complete range. > >>> * > >>> * So we need to make sure that @start and @reuse meet the above rules. > >>> */ > >>> - BUG_ON(start - reuse != PAGE_SIZE); > >>> + BUG_ON(start - reuse != PAGE_SIZE && start != reuse); > >>> + > >>> + if (unlikely(reuse == start)) { > >>> + int nid = page_to_nid((struct page *)start); > >>> + gfp_t gfp_mask = GFP_KERNEL | __GFP_THISNODE | __GFP_NORETRY | > >>> + __GFP_NOWARN; > >>> + > >>> + walk.reuse_page = alloc_pages_node(nid, gfp_mask, 0); > >>> + if (walk.reuse_page) { > >>> + copy_page(page_to_virt(walk.reuse_page), (void *)reuse); > >>> + list_add(&walk.reuse_page->lru, &vmemmap_pages); > >>> + } > >>> + } > >>> > >>> mmap_read_lock(&init_mm); > >>> ret = vmemmap_remap_range(reuse, end, &walk); > >>> @@ -348,7 +403,10 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end, > >>> } > >>> mmap_read_unlock(&init_mm); > >>> > >>> - free_vmemmap_page_list(&vmemmap_pages); > >>> + if (unlikely(reuse == start)) > >>> + free_vmemmap_pages_nonpcp(&vmemmap_pages); > >>> + else > >>> + free_vmemmap_pages(&vmemmap_pages); > >>> > >>> return ret; > >>> } > >>> @@ -512,6 +570,22 @@ static bool vmemmap_should_optimize(const struct hstate *h, const struct page *h > >>> return true; > >>> } > >>> > >>> +/* > >>> + * Control if the page frame which the address of the head vmemmap associated > >>> + * with a HugeTLB page is mapped to should be replaced with a new page. The > >>> + * vmemmap pages are usually mapped with huge PMD mapping, the head vmemmap > >>> + * page frames is best freed to the buddy allocator once at an initial stage > >>> + * of system booting to reduce memory fragment. > >>> + */ > >>> +static bool vmemmap_remap_head __ro_after_init = true; > >>> + > >>> +static int __init vmemmap_remap_head_init(void) > >>> +{ > >>> + vmemmap_remap_head = false; > >>> + return 0; > >>> +} > >>> +core_initcall(vmemmap_remap_head_init); > >>> + > >>> /** > >>> * hugetlb_vmemmap_optimize - optimize @head page's vmemmap pages. > >>> * @h: struct hstate. > >>> @@ -537,6 +611,19 @@ void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head) > >>> vmemmap_start += HUGETLB_VMEMMAP_RESERVE_SIZE; > >>> > >>> /* > >>> + * The vmemmap pages are usually mapped with huge PMD mapping. If the > >>> + * head vmemmap page is not freed to the buddy allocator, then those > >>> + * freed tail vmemmap pages cannot be merged into a big order chunk. > >>> + * The head vmemmap page frame can be replaced with a new allocated > >>> + * page and be freed to the buddy allocator, then those freed vmemmmap > >>> + * pages have the opportunity to be merged into larger contiguous pages > >>> + * to reduce memory fragment. vmemmap_remap_free() will do this if > >>> + * @vmemmap_remap_free is equal to @vmemmap_reuse. > >>> + */ > >>> + if (unlikely(vmemmap_remap_head)) > >>> + vmemmap_start = vmemmap_reuse; > >>> + > >> > >> Maybe it doesn't need to strictly early init vs late init. > >> > >> I wonder if we can still make this trick late-stage but when the head struct page is > >> aligned to PAGE_SIZE / sizeof(struct page), and when it is it means that we are safe > >> to replace the head vmemmap page provided we would only be covering this hugetlb page > >> data? Unless I am missing something and the check wouldn't be enough? > >> > >> A quick check with this snip added: > >> > >> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c > >> index 2b97df8115fe..06e028734b1e 100644 > >> --- a/mm/hugetlb_vmemmap.c > >> +++ b/mm/hugetlb_vmemmap.c > >> @@ -331,7 +331,7 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end, > >> }; > >> gfp_t gfp_mask = GFP_KERNEL|__GFP_RETRY_MAYFAIL|__GFP_NOWARN; > >> int nid = page_to_nid((struct page *)start); > >> - struct page *page; > >> + struct page *page = NULL; > >> > >> /* > >> * Allocate a new head vmemmap page to avoid breaking a contiguous > >> @@ -341,7 +341,8 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end, > >> * more allocations of hugepages. Fallback to the currently > >> * mapped head page in case should it fail to allocate. > >> */ > >> - page = alloc_pages_node(nid, gfp_mask, 0); > >> + if (IS_ALIGNED(start, PAGE_SIZE)) > >> + page = alloc_pages_node(nid, gfp_mask, 0); > >> walk.head_page = page; > >> > >> /* > >> > >> > >> ... and it looks that it can still be effective just not as much, more or less as expected? > >> > >> # with 2M hugepages > >> > >> Before: > >> > >> Node 0, zone Normal, type Movable 76 28 10 4 1 0 > >> 0 0 1 1 15568 > >> > >> After (allocated 32400): > >> > >> Node 0, zone Normal, type Movable 135 328 219 198 155 106 > >> 72 41 23 0 0 > >> > >> Compared to my original patch where there weren't any pages left in the order-0..order-2: > >> > >> Node 0, zone Normal, type Movable 0 1 0 70 > >> 106 91 78 48 17 0 0 > >> > >> But still much better that without any of this: > >> > >> Node 0, zone Normal, type Movable 32174 31999 31642 104 > >> 58 24 16 4 2 0 0 > >> >
On Wed, Aug 03, 2022 at 03:42:15PM -0700, Mike Kravetz wrote: > On 08/03/22 18:44, Muchun Song wrote: > > On Wed, Aug 03, 2022 at 10:52:13AM +0100, Joao Martins wrote: > > > On 8/3/22 05:11, Muchun Song wrote: > > > > On Tue, Aug 02, 2022 at 07:03:09PM +0100, Joao Martins wrote: > > > >> Today with `hugetlb_free_vmemmap=on` the struct page memory that is > > > >> freed back to page allocator is as following: for a 2M hugetlb page it > > > >> will reuse the first 4K vmemmap page to remap the remaining 7 vmemmap > > > >> pages, and for a 1G hugetlb it will remap the remaining 4095 vmemmap > > > >> pages. Essentially, that means that it breaks the first 4K of a > > > >> potentially contiguous chunk of memory of 32K (for 2M hugetlb pages) or > > > >> 16M (for 1G hugetlb pages). For this reason the memory that it's free > > > >> back to page allocator cannot be used for hugetlb to allocate huge pages > > > >> of the same size, but rather only of a smaller huge page size: > > > >> > > > > > > > > Hi Joao, > > > > > > > > Thanks for your work on this. I admit you are right. The current mechanism > > > > prevented the freed vmemmap pages from being mergerd into a potential > > > > contiguous page. Allocating a new head page is straightforward approach, > > > > however, it is very dangerous at runtime after system booting up. Why > > > > dangerous? Because you should first 1) copy the content from the head vmemmap > > > > page to the targeted (newly allocated) page, and then 2) change the PTE > > > > entry to the new page. However, the content (especially the refcount) of the > > > > old head vmemmmap page could be changed elsewhere (e.g. other modules) > > > > between the step 1) and 2). Eventually, the new allocated vmemmap page is > > > > corrupted. Unluckily, we don't have an easy approach to prevent it. > > > > > > > OK, I see what I missed. You mean the refcount (or any other data) on the > > > preceeding struct pages to this head struct page unrelated to the hugetlb > > > page being remapped. Meaning when the first struct page in the old vmemmap > > > page is *not* the head page we are trying to remap right? > > > > > > See further below in your patch but I wonder if we could actually check this > > > from the hugetlb head va being aligned to PAGE_SIZE. Meaning that we would be checking > > > that this head page is the first of struct page in the vmemmap page that > > > we are still safe to remap? If so, the patch could be simpler more > > > like mine, without the special freeing path you added below. > > > > > > If I'm right, see at the end. > > > > > > I am not sure we are on the same page (it seems that we are not after I saw your > > below patch). So let me make it become more clarified. > > Thanks Muchun! > I told Joao that you would be the expert in this area and was correct. > Thanks for your trust. > > > > CPU0: CPU1: > > > > vmemmap_remap_free(start, end, reuse) > > // alloc a new page used to be the head vmemmap page > > page = alloc_pages_node(); > > > > memcpy(page_address(page), reuse, PAGE_SIZE); > > // Now the @reuse address is mapped to the original > > // page frame. So the change will be reflected on the > > // original page frame. > > get_page(reuse); > > Here is a thought. > > This code gets called early after allocating a new hugetlb page. This new > compound page has a ref count of 1 on the head page and 0 on all the tail > pages. If the ref count was 0 on the head page, get_page() would not succeed. > > I can not think of a reason why we NEED to have a ref count of 1 on the > head page. It does make it more convenient to simply call put_page() on > the newly initialized hugetlb page and have the page be added to the huegtlb > free lists, but this could easily be modified. > > Matthew Willcox recently pointed me at this: > https://lore.kernel.org/linux-mm/20220531150611.1303156-1-willy@infradead.org/T/#m98fb9f9bd476155b4951339da51a0887b2377476 > > That would only work for hugetlb pages allocated from buddy. For gigantic > pages, we manually 'freeze' (zero ref count) of tail pages and check for > an unexpected increased ref count. We could do the same with the head page. > > Would having zero ref count on the head page eliminate this race? I think most races which try to grab an extra refcount could be avoided in this case. However, I can figure out a race related to memory failure, that is to poison a page in memory_failure(), which set PageHWPoison without checking if the refcount is zero. If we can fix this easily, I think this patch is a good direction. Thanks Mike. > -- > Mike Kravetz > > > vmemmap_remap_pte(); > > // remap to the above new allocated page > > set_pte_at(); > > > > flush_tlb_kernel_range(); > > // Now the @reuse address is mapped to the new allocated > > // page frame. So the change will be reflected on the > > // new page frame and it is corrupted. > > put_page(reuse); > > > > So we should make 1) memcpy, 2) remap and 3) TLB flush atomic on CPU0, however > > it is not easy. >
On 8/4/22 08:17, Muchun Song wrote: > On Wed, Aug 03, 2022 at 01:22:21PM +0100, Joao Martins wrote: >> >> >> On 8/3/22 11:44, Muchun Song wrote: >>> On Wed, Aug 03, 2022 at 10:52:13AM +0100, Joao Martins wrote: >>>> On 8/3/22 05:11, Muchun Song wrote: >>>>> On Tue, Aug 02, 2022 at 07:03:09PM +0100, Joao Martins wrote: >>>>>> Today with `hugetlb_free_vmemmap=on` the struct page memory that is >>>>>> freed back to page allocator is as following: for a 2M hugetlb page it >>>>>> will reuse the first 4K vmemmap page to remap the remaining 7 vmemmap >>>>>> pages, and for a 1G hugetlb it will remap the remaining 4095 vmemmap >>>>>> pages. Essentially, that means that it breaks the first 4K of a >>>>>> potentially contiguous chunk of memory of 32K (for 2M hugetlb pages) or >>>>>> 16M (for 1G hugetlb pages). For this reason the memory that it's free >>>>>> back to page allocator cannot be used for hugetlb to allocate huge pages >>>>>> of the same size, but rather only of a smaller huge page size: >>>>>> >>>>> >>>>> Hi Joao, >>>>> >>>>> Thanks for your work on this. I admit you are right. The current mechanism >>>>> prevented the freed vmemmap pages from being mergerd into a potential >>>>> contiguous page. Allocating a new head page is straightforward approach, >>>>> however, it is very dangerous at runtime after system booting up. Why >>>>> dangerous? Because you should first 1) copy the content from the head vmemmap >>>>> page to the targeted (newly allocated) page, and then 2) change the PTE >>>>> entry to the new page. However, the content (especially the refcount) of the >>>>> old head vmemmmap page could be changed elsewhere (e.g. other modules) >>>>> between the step 1) and 2). Eventually, the new allocated vmemmap page is >>>>> corrupted. Unluckily, we don't have an easy approach to prevent it. >>>>> >>>> OK, I see what I missed. You mean the refcount (or any other data) on the >>>> preceeding struct pages to this head struct page unrelated to the hugetlb >>>> page being remapped. Meaning when the first struct page in the old vmemmap >>>> page is *not* the head page we are trying to remap right? >>>> >>>> See further below in your patch but I wonder if we could actually check this >>>> from the hugetlb head va being aligned to PAGE_SIZE. Meaning that we would be checking >>>> that this head page is the first of struct page in the vmemmap page that >>>> we are still safe to remap? If so, the patch could be simpler more >>>> like mine, without the special freeing path you added below. >>>> >>>> If I'm right, see at the end. >>> >>> I am not sure we are on the same page (it seems that we are not after I saw your >>> below patch). >> >> Even though I misunderstood you it might still look like a possible scenario. >> >>> So let me make it become more clarified. >>> >> Thanks >> >>> CPU0: CPU1: >>> >>> vmemmap_remap_free(start, end, reuse) >>> // alloc a new page used to be the head vmemmap page >>> page = alloc_pages_node(); >>> >>> memcpy(page_address(page), reuse, PAGE_SIZE); >>> // Now the @reuse address is mapped to the original >>> // page frame. So the change will be reflected on the >>> // original page frame. >>> get_page(reuse); >>> vmemmap_remap_pte(); >>> // remap to the above new allocated page >>> set_pte_at(); >>> >>> flush_tlb_kernel_range(); >> >> note-to-self: totally missed to change the flush_tlb_kernel_range() to include the full range. >> > > Right. I have noticed that as well. > >>> // Now the @reuse address is mapped to the new allocated >>> // page frame. So the change will be reflected on the >>> // new page frame and it is corrupted. >>> put_page(reuse); >>> >>> So we should make 1) memcpy, 2) remap and 3) TLB flush atomic on CPU0, however >>> it is not easy. >>> >> OK, I understand what you mean now. However, I am trying to follow if this race is >> possible? Note that given your previous answer above, I am assuming in your race scenario >> that the vmemmap page only ever stores metadata (struct page) related to the hugetlb-page >> currently being remapped. If this assumption is wrong, then your race would be possible >> (but it wouldn't be from a get_page in the reuse_addr) >> >> So, how would we get into doing a get_page() on the head-page that we are remapping (and >> its put_page() for that matter) from somewhere else ... considering we are at >> prep_new_huge_page() when we call vmemmap_remap_free() and hence ... we already got it >> from page allocator ... but hugetlb hasn't yet finished initializing the head page >> metadata. Hence it isn't yet accounted for someone to grab either e.g. in >> demote/page-fault/migration/etc? >> > > As I know, at least there are two places which could get the refcount. > 1) GUP and 2) Memoey failure. > So I am aware the means to grab the page refcount. It's how we get into such situation that I wasn't sure how we get to in the first place: > For 1), you can refer to the commit 7118fc2906e2925d7edb5ed9c8a57f2a5f23b849. Good pointer. This one sheds some light too: https://lore.kernel.org/linux-mm/CAG48ez23q0Jy9cuVnwAe7t_fdhMk2S7N5Hdi-GLcCeq5bsfLxw@mail.gmail.com/ I wonder we get into a situation of doing a GUP on a user VA referring a just-allocated *hugetlb* page from buddy (but not yet in hugetlb free lists) even if temporarily. Unless the page was still siting on a page tables that were about to tear down. Maybe it is specific to gigantic pages. Hmm > For 2), I think you can refer to the function of __get_unpoison_page(). > Ah, yes. Good point. > Both places could grab an extra refcount to the processing HugeTLB's struct page. > Thanks for the pointers ;)
On Thu, Aug 04, 2022 at 10:23:48AM +0100, Joao Martins wrote: > On 8/4/22 08:17, Muchun Song wrote: > > On Wed, Aug 03, 2022 at 01:22:21PM +0100, Joao Martins wrote: > >> > >> > >> On 8/3/22 11:44, Muchun Song wrote: > >>> On Wed, Aug 03, 2022 at 10:52:13AM +0100, Joao Martins wrote: > >>>> On 8/3/22 05:11, Muchun Song wrote: > >>>>> On Tue, Aug 02, 2022 at 07:03:09PM +0100, Joao Martins wrote: > >>>>>> Today with `hugetlb_free_vmemmap=on` the struct page memory that is > >>>>>> freed back to page allocator is as following: for a 2M hugetlb page it > >>>>>> will reuse the first 4K vmemmap page to remap the remaining 7 vmemmap > >>>>>> pages, and for a 1G hugetlb it will remap the remaining 4095 vmemmap > >>>>>> pages. Essentially, that means that it breaks the first 4K of a > >>>>>> potentially contiguous chunk of memory of 32K (for 2M hugetlb pages) or > >>>>>> 16M (for 1G hugetlb pages). For this reason the memory that it's free > >>>>>> back to page allocator cannot be used for hugetlb to allocate huge pages > >>>>>> of the same size, but rather only of a smaller huge page size: > >>>>>> > >>>>> > >>>>> Hi Joao, > >>>>> > >>>>> Thanks for your work on this. I admit you are right. The current mechanism > >>>>> prevented the freed vmemmap pages from being mergerd into a potential > >>>>> contiguous page. Allocating a new head page is straightforward approach, > >>>>> however, it is very dangerous at runtime after system booting up. Why > >>>>> dangerous? Because you should first 1) copy the content from the head vmemmap > >>>>> page to the targeted (newly allocated) page, and then 2) change the PTE > >>>>> entry to the new page. However, the content (especially the refcount) of the > >>>>> old head vmemmmap page could be changed elsewhere (e.g. other modules) > >>>>> between the step 1) and 2). Eventually, the new allocated vmemmap page is > >>>>> corrupted. Unluckily, we don't have an easy approach to prevent it. > >>>>> > >>>> OK, I see what I missed. You mean the refcount (or any other data) on the > >>>> preceeding struct pages to this head struct page unrelated to the hugetlb > >>>> page being remapped. Meaning when the first struct page in the old vmemmap > >>>> page is *not* the head page we are trying to remap right? > >>>> > >>>> See further below in your patch but I wonder if we could actually check this > >>>> from the hugetlb head va being aligned to PAGE_SIZE. Meaning that we would be checking > >>>> that this head page is the first of struct page in the vmemmap page that > >>>> we are still safe to remap? If so, the patch could be simpler more > >>>> like mine, without the special freeing path you added below. > >>>> > >>>> If I'm right, see at the end. > >>> > >>> I am not sure we are on the same page (it seems that we are not after I saw your > >>> below patch). > >> > >> Even though I misunderstood you it might still look like a possible scenario. > >> > >>> So let me make it become more clarified. > >>> > >> Thanks > >> > >>> CPU0: CPU1: > >>> > >>> vmemmap_remap_free(start, end, reuse) > >>> // alloc a new page used to be the head vmemmap page > >>> page = alloc_pages_node(); > >>> > >>> memcpy(page_address(page), reuse, PAGE_SIZE); > >>> // Now the @reuse address is mapped to the original > >>> // page frame. So the change will be reflected on the > >>> // original page frame. > >>> get_page(reuse); > >>> vmemmap_remap_pte(); > >>> // remap to the above new allocated page > >>> set_pte_at(); > >>> > >>> flush_tlb_kernel_range(); > >> > >> note-to-self: totally missed to change the flush_tlb_kernel_range() to include the full range. > >> > > > > Right. I have noticed that as well. > > > >>> // Now the @reuse address is mapped to the new allocated > >>> // page frame. So the change will be reflected on the > >>> // new page frame and it is corrupted. > >>> put_page(reuse); > >>> > >>> So we should make 1) memcpy, 2) remap and 3) TLB flush atomic on CPU0, however > >>> it is not easy. > >>> > >> OK, I understand what you mean now. However, I am trying to follow if this race is > >> possible? Note that given your previous answer above, I am assuming in your race scenario > >> that the vmemmap page only ever stores metadata (struct page) related to the hugetlb-page > >> currently being remapped. If this assumption is wrong, then your race would be possible > >> (but it wouldn't be from a get_page in the reuse_addr) > >> > >> So, how would we get into doing a get_page() on the head-page that we are remapping (and > >> its put_page() for that matter) from somewhere else ... considering we are at > >> prep_new_huge_page() when we call vmemmap_remap_free() and hence ... we already got it > >> from page allocator ... but hugetlb hasn't yet finished initializing the head page > >> metadata. Hence it isn't yet accounted for someone to grab either e.g. in > >> demote/page-fault/migration/etc? > >> > > > > As I know, at least there are two places which could get the refcount. > > 1) GUP and 2) Memoey failure. > > > So I am aware the means to grab the page refcount. It's how we get into such situation > that I wasn't sure how we get to in the first place: > > > For 1), you can refer to the commit 7118fc2906e2925d7edb5ed9c8a57f2a5f23b849. > > Good pointer. This one sheds some light too: > > https://lore.kernel.org/linux-mm/CAG48ez23q0Jy9cuVnwAe7t_fdhMk2S7N5Hdi-GLcCeq5bsfLxw@mail.gmail.com/ > > I wonder we get into a situation of doing a GUP on a user VA referring a > just-allocated *hugetlb* page from buddy (but not yet in hugetlb free lists) even if > temporarily. Unless the page was still siting on a page tables that were about to tear > down. Maybe it is specific to gigantic pages. Hmm > It is not specific to gigantic pages. I think your above link has a good description to show how this could happen to a non-gigantic page. Thanks. > > For 2), I think you can refer to the function of __get_unpoison_page(). > > > Ah, yes. Good point. > > > Both places could grab an extra refcount to the processing HugeTLB's struct page. > > > Thanks for the pointers ;) >
On 08/04/22 16:05, Muchun Song wrote: > On Wed, Aug 03, 2022 at 03:42:15PM -0700, Mike Kravetz wrote: > > On 08/03/22 18:44, Muchun Song wrote: > > > On Wed, Aug 03, 2022 at 10:52:13AM +0100, Joao Martins wrote: > > > > On 8/3/22 05:11, Muchun Song wrote: > > > > > On Tue, Aug 02, 2022 at 07:03:09PM +0100, Joao Martins wrote: > > > I am not sure we are on the same page (it seems that we are not after I saw your > > > below patch). So let me make it become more clarified. > > > > Thanks Muchun! > > I told Joao that you would be the expert in this area and was correct. > > > > Thanks for your trust. > > > > > > > CPU0: CPU1: > > > > > > vmemmap_remap_free(start, end, reuse) > > > // alloc a new page used to be the head vmemmap page > > > page = alloc_pages_node(); > > > > > > memcpy(page_address(page), reuse, PAGE_SIZE); > > > // Now the @reuse address is mapped to the original > > > // page frame. So the change will be reflected on the > > > // original page frame. > > > get_page(reuse); > > > > Here is a thought. > > > > This code gets called early after allocating a new hugetlb page. This new > > compound page has a ref count of 1 on the head page and 0 on all the tail > > pages. If the ref count was 0 on the head page, get_page() would not succeed. > > > > I can not think of a reason why we NEED to have a ref count of 1 on the > > head page. It does make it more convenient to simply call put_page() on > > the newly initialized hugetlb page and have the page be added to the huegtlb > > free lists, but this could easily be modified. > > > > Matthew Willcox recently pointed me at this: > > https://lore.kernel.org/linux-mm/20220531150611.1303156-1-willy@infradead.org/T/#m98fb9f9bd476155b4951339da51a0887b2377476 > > > > That would only work for hugetlb pages allocated from buddy. For gigantic > > pages, we manually 'freeze' (zero ref count) of tail pages and check for > > an unexpected increased ref count. We could do the same with the head page. > > > > Would having zero ref count on the head page eliminate this race? > > I think most races which try to grab an extra refcount could be avoided > in this case. > > However, I can figure out a race related to memory failure, that is to poison > a page in memory_failure(), which set PageHWPoison without checking if the > refcount is zero. If we can fix this easily, I think this patch is a good > direction. Adding Naoya. Yes, I recall that discussion in the thread, https://lore.kernel.org/linux-mm/3c542543-0965-ef60-4627-1a4116077a5b@huawei.com/ I don't have any ideas about how to avoid that issue. Naoya notes in that thread, the issue is poisioning a generic compound page that may turn into a hugetlb page. There may be a way to work around this. However, IMO we may be trying too hard to cover ALL memory error handling races with hugetlb. We know that memory errors can happen at ANY time, and a page could be marked poision at any time. I suspect there are many paths where bad things could happen if a memory error happens at 'just the right time'. For example, I believe a page could be poisioned at the very end of the page allocation path and returned to the caller requesting the page. Certainly, not every caller of page allocation checks for and handles a poisioned page. In general, we know that memory error handling will not be 100% perfect and can not be handled in ALL code paths. In some cases if a memory error happens at 'just the right time', bad things will happen. It would be almost impossible to handle a memory error at any time in all code paths. Someone please correct me if this is not accepted/known situation. It seems there has been much effort lately to try and catch all hugetlb races with memory error handling. That is OK, but I think we need to accept that there may be races with code paths that are not worth the effort to try and cover. Just my opinion of course. For this specific proposal by Joao, I think we can handle most of the races if all sub-pages of the hugetlb (including head) have a ref count of zero. I actually like moving in this direction as it means we could remove some existing hugetlb code checking for increased ref counts. I can start working on code to move in this direction. We will need to wait for the alloc_frozen_pages() interface and will need to make changes to other allocators for gigantic pages. Until then we can manually zero the ref counts before calling the vmemmap freeing routines. With this in place, I would say that we do something like Joao proposes to free more contiguous vmemmap. Thoughts? In any case, I am going to move forward with setting ref count to zero of all 'hugetlb pages' before they officially become hugetlb pages. As mentioned, this should allow for removal of some code checking for inflated ref counts.
On Thu, Aug 04, 2022 at 09:56:39AM -0700, Mike Kravetz wrote: > On 08/04/22 16:05, Muchun Song wrote: > > On Wed, Aug 03, 2022 at 03:42:15PM -0700, Mike Kravetz wrote: > > > On 08/03/22 18:44, Muchun Song wrote: > > > > On Wed, Aug 03, 2022 at 10:52:13AM +0100, Joao Martins wrote: > > > > > On 8/3/22 05:11, Muchun Song wrote: > > > > > > On Tue, Aug 02, 2022 at 07:03:09PM +0100, Joao Martins wrote: > > > > I am not sure we are on the same page (it seems that we are not after I saw your > > > > below patch). So let me make it become more clarified. > > > > > > Thanks Muchun! > > > I told Joao that you would be the expert in this area and was correct. > > > > > > > Thanks for your trust. > > > > > > > > > > CPU0: CPU1: > > > > > > > > vmemmap_remap_free(start, end, reuse) > > > > // alloc a new page used to be the head vmemmap page > > > > page = alloc_pages_node(); > > > > > > > > memcpy(page_address(page), reuse, PAGE_SIZE); > > > > // Now the @reuse address is mapped to the original > > > > // page frame. So the change will be reflected on the > > > > // original page frame. > > > > get_page(reuse); > > > > > > Here is a thought. > > > > > > This code gets called early after allocating a new hugetlb page. This new > > > compound page has a ref count of 1 on the head page and 0 on all the tail > > > pages. If the ref count was 0 on the head page, get_page() would not succeed. > > > > > > I can not think of a reason why we NEED to have a ref count of 1 on the > > > head page. It does make it more convenient to simply call put_page() on > > > the newly initialized hugetlb page and have the page be added to the huegtlb > > > free lists, but this could easily be modified. > > > > > > Matthew Willcox recently pointed me at this: > > > https://lore.kernel.org/linux-mm/20220531150611.1303156-1-willy@infradead.org/T/#m98fb9f9bd476155b4951339da51a0887b2377476 > > > > > > That would only work for hugetlb pages allocated from buddy. For gigantic > > > pages, we manually 'freeze' (zero ref count) of tail pages and check for > > > an unexpected increased ref count. We could do the same with the head page. > > > > > > Would having zero ref count on the head page eliminate this race? > > > > I think most races which try to grab an extra refcount could be avoided > > in this case. > > > > However, I can figure out a race related to memory failure, that is to poison > > a page in memory_failure(), which set PageHWPoison without checking if the > > refcount is zero. If we can fix this easily, I think this patch is a good > > direction. > > Adding Naoya. > > Yes, I recall that discussion in the thread, > https://lore.kernel.org/linux-mm/3c542543-0965-ef60-4627-1a4116077a5b@huawei.com/ > > I don't have any ideas about how to avoid that issue. > > Naoya notes in that thread, the issue is poisioning a generic compound page > that may turn into a hugetlb page. There may be a way to work around this. > > However, IMO we may be trying too hard to cover ALL memory error handling races > with hugetlb. We know that memory errors can happen at ANY time, and a page > could be marked poision at any time. I suspect there are many paths where bad > things could happen if a memory error happens at 'just the right time'. For > example, I believe a page could be poisioned at the very end of the page > allocation path and returned to the caller requesting the page. Certainly, > not every caller of page allocation checks for and handles a poisioned page. > In general, we know that memory error handling will not be 100% perfect > and can not be handled in ALL code paths. In some cases if a memory error > happens at 'just the right time', bad things will happen. It would be almost > impossible to handle a memory error at any time in all code paths. Someone > please correct me if this is not accepted/known situation. > > It seems there has been much effort lately to try and catch all hugetlb races > with memory error handling. That is OK, but I think we need to accept that > there may be races with code paths that are not worth the effort to try and > cover. Just my opinion of course. > > For this specific proposal by Joao, I think we can handle most of the races > if all sub-pages of the hugetlb (including head) have a ref count of zero. > I actually like moving in this direction as it means we could remove some > existing hugetlb code checking for increased ref counts. > Totally agree. > I can start working on code to move in this direction. We will need to > wait for the alloc_frozen_pages() interface and will need to make changes > to other allocators for gigantic pages. Until then we can manually zero the > ref counts before calling the vmemmap freeing routines. With this in place, > I would say that we do something like Joao proposes to free more contiguous > vmemmap. > > Thoughts? I think it is the right direction. I can provide code review once you finish this work. Thanks Mike. > > In any case, I am going to move forward with setting ref count to zero of > all 'hugetlb pages' before they officially become hugetlb pages. As mentioned, > this should allow for removal of some code checking for inflated ref counts. > -- > Mike Kravetz >
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c index 20f414c0379f..2b97df8115fe 100644 --- a/mm/hugetlb_vmemmap.c +++ b/mm/hugetlb_vmemmap.c @@ -28,9 +28,11 @@ */ struct vmemmap_remap_walk { void (*remap_pte)(pte_t *pte, unsigned long addr, - struct vmemmap_remap_walk *walk); + struct vmemmap_remap_walk *walk, + bool rw); unsigned long nr_walked; struct page *reuse_page; + struct page *head_page; unsigned long reuse_addr; struct list_head *vmemmap_pages; }; @@ -104,10 +106,25 @@ static void vmemmap_pte_range(pmd_t *pmd, unsigned long addr, * remapping (which is calling @walk->remap_pte). */ if (!walk->reuse_page) { - walk->reuse_page = pte_page(*pte); + struct page *ptpage = pte_page(*pte); + struct page *page = walk->head_page ? walk->head_page : ptpage; + + walk->reuse_page = page; + + /* + * Copy the data from the original head, and remap to + * the newly allocated page. + */ + if (page != ptpage) { + memcpy(page_address(page), page_address(ptpage), + PAGE_SIZE); + walk->remap_pte(pte, addr, walk, true); + } + /* - * Because the reuse address is part of the range that we are - * walking, skip the reuse address range. + * Because the reuse address is part of the range that + * we are walking, skip the reuse address range. Or we + * already remapped the head page to a newly page. */ addr += PAGE_SIZE; pte++; @@ -115,7 +132,7 @@ static void vmemmap_pte_range(pmd_t *pmd, unsigned long addr, } for (; addr != end; addr += PAGE_SIZE, pte++) { - walk->remap_pte(pte, addr, walk); + walk->remap_pte(pte, addr, walk, false); walk->nr_walked++; } } @@ -238,13 +255,13 @@ static void free_vmemmap_page_list(struct list_head *list) } static void vmemmap_remap_pte(pte_t *pte, unsigned long addr, - struct vmemmap_remap_walk *walk) + struct vmemmap_remap_walk *walk, bool rw) { /* * Remap the tail pages as read-only to catch illegal write operation * to the tail pages. */ - pgprot_t pgprot = PAGE_KERNEL_RO; + pgprot_t pgprot = rw ? PAGE_KERNEL : PAGE_KERNEL_RO; pte_t entry = mk_pte(walk->reuse_page, pgprot); struct page *page = pte_page(*pte); @@ -273,7 +290,7 @@ static inline void reset_struct_pages(struct page *start) } static void vmemmap_restore_pte(pte_t *pte, unsigned long addr, - struct vmemmap_remap_walk *walk) + struct vmemmap_remap_walk *walk, bool rw) { pgprot_t pgprot = PAGE_KERNEL; struct page *page; @@ -312,6 +329,20 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end, .reuse_addr = reuse, .vmemmap_pages = &vmemmap_pages, }; + gfp_t gfp_mask = GFP_KERNEL|__GFP_RETRY_MAYFAIL|__GFP_NOWARN; + int nid = page_to_nid((struct page *)start); + struct page *page; + + /* + * Allocate a new head vmemmap page to avoid breaking a contiguous + * block of struct page memory when freeing it back to page allocator + * in free_vmemmap_page_list(). This will allow the likely contiguous + * struct page backing memory to be kept contiguous and allowing for + * more allocations of hugepages. Fallback to the currently + * mapped head page in case should it fail to allocate. + */ + page = alloc_pages_node(nid, gfp_mask, 0); + walk.head_page = page; /* * In order to make remapping routine most efficient for the huge pages,
Today with `hugetlb_free_vmemmap=on` the struct page memory that is freed back to page allocator is as following: for a 2M hugetlb page it will reuse the first 4K vmemmap page to remap the remaining 7 vmemmap pages, and for a 1G hugetlb it will remap the remaining 4095 vmemmap pages. Essentially, that means that it breaks the first 4K of a potentially contiguous chunk of memory of 32K (for 2M hugetlb pages) or 16M (for 1G hugetlb pages). For this reason the memory that it's free back to page allocator cannot be used for hugetlb to allocate huge pages of the same size, but rather only of a smaller huge page size: Try to assign a 64G node to hugetlb (on a 128G 2node guest, each node having 64G): * Before allocation: Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 ... Node 0, zone Normal, type Movable 0 0 1 0 1 0 0 0 0 0 15561 $ echo 32768 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages $ cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages 31974 * After: Node 0, zone Normal, type Movable 32174 31999 31642 104 58 24 16 4 2 0 0 Notice how the memory freed back are put back into 4K / 8K / 16K page pools. And it allocates a total of 31974 pages (63948M). To fix this behaviour rather than leaving one page (thus breaking the contiguous block of memory backing the struct pages) instead repopulate with a new page for the head vmemmap page. Followed by copying the data from the currently mapped vmemmap page there, to then remap it to this new page. Additionally, change the remap_pte callback to include a rw bool parameter given that the head page needs r/w perm compared to the tail page vmemmap remap that is r/o. vmemmap_pte_range() will set accordingly when calling remap_pte() when it first tries to set @reuse_page. The new head page is allocated by the caller of vmemmap_remap_free() given that on restore it should still be using the same code path as before. Note that, because right now one hugepage is remapped at a time, thus only one free 4K page at a time is needed to remap the head page. Should it fail to allocate said new page, it reuses the one that's already mapped just like before. As a result, for every 64G of contiguous hugepages it can give back 1G more of contiguous memory per 64G, while needing in total 128M new 4K pages (for 2M hugetlb) or 256k (for 1G hugetlb). After the changes, try to assign a 64G node to hugetlb (on a 128G 2node guest, each node with 64G): * Before allocation Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 ... Node 0, zone Normal, type Movable 0 1 1 0 0 0 0 1 1 1 15564 $ echo 32768 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages $ cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages 32390 * After: Node 0, zone Normal, type Movable 0 1 0 70 106 91 78 48 17 0 0 In the example above, 416 more hugeltb 2M pages are allocated i.e. 832M out of the 32390 (64780M) allocated. So the memory freed back is indeed being used back in hugetlb and there's no order-0..order-2 pages accumulated unused. Signed-off-by: Joao Martins <joao.m.martins@oracle.com> --- mm/hugetlb_vmemmap.c | 47 ++++++++++++++++++++++++++++++++++++-------- 1 file changed, 39 insertions(+), 8 deletions(-)