Message ID | 20220916214638.155744-1-mike.kravetz@oracle.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v2] hugetlb: freeze allocated pages before creating hugetlb pages | expand |
On Fri, Sep 16, 2022 at 02:46:38PM -0700, Mike Kravetz wrote: > When creating hugetlb pages, the hugetlb code must first allocate > contiguous pages from a low level allocator such as buddy, cma or > memblock. The pages returned from these low level allocators are > ref counted. This creates potential issues with other code taking > speculative references on these pages before they can be transformed to > a hugetlb page. This issue has been addressed with methods and code > such as that provided in [1]. > > Recent discussions about vmemmap freeing [2] have indicated that it > would be beneficial to freeze all sub pages, including the head page > of pages returned from low level allocators before converting to a > hugetlb page. This helps avoid races if we want to replace the page > containing vmemmap for the head page. > > There have been proposals to change at least the buddy allocator to > return frozen pages as described at [3]. If such a change is made, it > can be employed by the hugetlb code. However, as mentioned above > hugetlb uses several low level allocators so each would need to be > modified to return frozen pages. For now, we can manually freeze the > returned pages. This is done in two places: > 1) alloc_buddy_huge_page, only the returned head page is ref counted. > We freeze the head page, retrying once in the VERY rare case where > there may be an inflated ref count. > 2) prep_compound_gigantic_page, for gigantic pages the current code > freezes all pages except the head page. New code will simply freeze > the head page as well. > > In a few other places, code checks for inflated ref counts on newly > allocated hugetlb pages. With the modifications to freeze after > allocating, this code can be removed. > > After hugetlb pages are freshly allocated, they are often added to the > hugetlb free lists. Since these pages were previously ref counted, this > was done via put_page() which would end up calling the hugetlb > destructor: free_huge_page. With changes to freeze pages, we simply > call free_huge_page directly to add the pages to the free list. > > In a few other places, freshly allocated hugetlb pages were immediately > put into use, and the expectation was they were already ref counted. In > these cases, we must manually ref count the page. > > [1] https://lore.kernel.org/linux-mm/20210622021423.154662-3-mike.kravetz@oracle.com/ > [2] https://lore.kernel.org/linux-mm/20220802180309.19340-1-joao.m.martins@oracle.com/ > [3] https://lore.kernel.org/linux-mm/20220809171854.3725722-1-willy@infradead.org/ > > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Hi Mike, this looks great and simplifies the code much more. I got a question though: > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -1787,9 +1787,8 @@ static bool __prep_compound_gigantic_page(struct page *page, unsigned int order, > > /* we rely on prep_new_huge_page to set the destructor */ > set_compound_order(page, order); > - __ClearPageReserved(page); > __SetPageHead(page); > - for (i = 1; i < nr_pages; i++) { > + for (i = 0; i < nr_pages; i++) { > p = nth_page(page, i); > > /* > @@ -1830,17 +1829,19 @@ static bool __prep_compound_gigantic_page(struct page *page, unsigned int order, > } else { > VM_BUG_ON_PAGE(page_count(p), p); > } > - set_compound_head(p, page); > + if (i != 0) > + set_compound_head(p, page); Sure I am missing something here, but why we only freeze refcount here in case it is for demote? We seem to be doing it inconditionally in alloc_buddy_huge_page.
On 09/20/22 06:02, Oscar Salvador wrote: > On Fri, Sep 16, 2022 at 02:46:38PM -0700, Mike Kravetz wrote: > > When creating hugetlb pages, the hugetlb code must first allocate > > contiguous pages from a low level allocator such as buddy, cma or > > memblock. The pages returned from these low level allocators are > > ref counted. This creates potential issues with other code taking > > speculative references on these pages before they can be transformed to > > a hugetlb page. This issue has been addressed with methods and code > > such as that provided in [1]. > > > > Recent discussions about vmemmap freeing [2] have indicated that it > > would be beneficial to freeze all sub pages, including the head page > > of pages returned from low level allocators before converting to a > > hugetlb page. This helps avoid races if we want to replace the page > > containing vmemmap for the head page. > > > > There have been proposals to change at least the buddy allocator to > > return frozen pages as described at [3]. If such a change is made, it > > can be employed by the hugetlb code. However, as mentioned above > > hugetlb uses several low level allocators so each would need to be > > modified to return frozen pages. For now, we can manually freeze the > > returned pages. This is done in two places: > > 1) alloc_buddy_huge_page, only the returned head page is ref counted. > > We freeze the head page, retrying once in the VERY rare case where > > there may be an inflated ref count. > > 2) prep_compound_gigantic_page, for gigantic pages the current code > > freezes all pages except the head page. New code will simply freeze > > the head page as well. > > > > In a few other places, code checks for inflated ref counts on newly > > allocated hugetlb pages. With the modifications to freeze after > > allocating, this code can be removed. > > > > After hugetlb pages are freshly allocated, they are often added to the > > hugetlb free lists. Since these pages were previously ref counted, this > > was done via put_page() which would end up calling the hugetlb > > destructor: free_huge_page. With changes to freeze pages, we simply > > call free_huge_page directly to add the pages to the free list. > > > > In a few other places, freshly allocated hugetlb pages were immediately > > put into use, and the expectation was they were already ref counted. In > > these cases, we must manually ref count the page. > > > > [1] https://lore.kernel.org/linux-mm/20210622021423.154662-3-mike.kravetz@oracle.com/ > > [2] https://lore.kernel.org/linux-mm/20220802180309.19340-1-joao.m.martins@oracle.com/ > > [3] https://lore.kernel.org/linux-mm/20220809171854.3725722-1-willy@infradead.org/ > > > > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> > > Hi Mike, > > this looks great and simplifies the code much more. > I got a question though: > > > --- a/mm/hugetlb.c > > +++ b/mm/hugetlb.c > > @@ -1787,9 +1787,8 @@ static bool __prep_compound_gigantic_page(struct page *page, unsigned int order, > > > > /* we rely on prep_new_huge_page to set the destructor */ > > set_compound_order(page, order); > > - __ClearPageReserved(page); > > __SetPageHead(page); > > - for (i = 1; i < nr_pages; i++) { > > + for (i = 0; i < nr_pages; i++) { > > p = nth_page(page, i); > > > > /* > > @@ -1830,17 +1829,19 @@ static bool __prep_compound_gigantic_page(struct page *page, unsigned int order, > > } else { > > VM_BUG_ON_PAGE(page_count(p), p); > > } > > - set_compound_head(p, page); > > + if (i != 0) > > + set_compound_head(p, page); > > Sure I am missing something here, but why we only freeze refcount here > in case it is for demote? Hi Oscar, thanks for taking a look. I think you may have missed something in the above comment. For gigantic pages, we only freeze pages if NOT demote. In the demote case, pages are already frozen. > We seem to be doing it inconditionally in alloc_buddy_huge_page. In alloc_buddy_huge_page, we are getting a fresh/new hugetlb page. So, we are certainly not in a demote path. That is why we freeze unconditionally there. We want to make sure it is frozen before it is put to use as a hugetlb page. In the demote path, we call two routines to 'prep' the set of free pages. - prep_compound_gigantic_page_for_demote which ends up in the above logic. We do not need to freeze, because the ref count of all the pages are already zero. - prep_compound_page is not hugetlb specific, but rather used in the normal allocation path of compound pages. It serves the purpose of creating a compound page of the appropriate size. prep_compound_page only deals with getting the compound page structure correct. It does not change any ref counts. This is desired as the ref count on all the pages is zero. Hope that answers your question.
On Fri, Sep 16, 2022 at 02:46:38PM -0700, Mike Kravetz wrote: > When creating hugetlb pages, the hugetlb code must first allocate > contiguous pages from a low level allocator such as buddy, cma or > memblock. The pages returned from these low level allocators are > ref counted. This creates potential issues with other code taking > speculative references on these pages before they can be transformed to > a hugetlb page. This issue has been addressed with methods and code > such as that provided in [1]. > > Recent discussions about vmemmap freeing [2] have indicated that it > would be beneficial to freeze all sub pages, including the head page > of pages returned from low level allocators before converting to a > hugetlb page. This helps avoid races if we want to replace the page > containing vmemmap for the head page. > > There have been proposals to change at least the buddy allocator to > return frozen pages as described at [3]. If such a change is made, it > can be employed by the hugetlb code. However, as mentioned above > hugetlb uses several low level allocators so each would need to be > modified to return frozen pages. For now, we can manually freeze the > returned pages. This is done in two places: > 1) alloc_buddy_huge_page, only the returned head page is ref counted. > We freeze the head page, retrying once in the VERY rare case where > there may be an inflated ref count. > 2) prep_compound_gigantic_page, for gigantic pages the current code > freezes all pages except the head page. New code will simply freeze > the head page as well. > > In a few other places, code checks for inflated ref counts on newly > allocated hugetlb pages. With the modifications to freeze after > allocating, this code can be removed. > > After hugetlb pages are freshly allocated, they are often added to the > hugetlb free lists. Since these pages were previously ref counted, this > was done via put_page() which would end up calling the hugetlb > destructor: free_huge_page. With changes to freeze pages, we simply > call free_huge_page directly to add the pages to the free list. > > In a few other places, freshly allocated hugetlb pages were immediately > put into use, and the expectation was they were already ref counted. In > these cases, we must manually ref count the page. > > [1] https://lore.kernel.org/linux-mm/20210622021423.154662-3-mike.kravetz@oracle.com/ > [2] https://lore.kernel.org/linux-mm/20220802180309.19340-1-joao.m.martins@oracle.com/ > [3] https://lore.kernel.org/linux-mm/20220809171854.3725722-1-willy@infradead.org/ > > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> > --- > v1 -> v2 > - Fixed up head page in error path of __prep_compound_gigantic_page as > discovered by Miaohe Lin. > - Updated link to Matthew's Allocate and free frozen pages series. > - Rebased on next-20220916 > > mm/hugetlb.c | 102 +++++++++++++++++++-------------------------------- > 1 file changed, 38 insertions(+), 64 deletions(-) Hello Mike, I accidentally found a NULL pointer dereference when testing the latest mm-unstable, which seems to be caused (or exposed?) by this patch (I confirmed that it disappeared by reverting this patch). It's reproduced by doing like `sysctl vm.nr_hugepages=1000000` to allocate hugepages as much as possible. Could you check that this patch is related to the issue? Thanks, Naoya Horiguchi --- [ 25.634476] BUG: kernel NULL pointer dereference, address: 0000000000000034 [ 25.635980] #PF: supervisor write access in kernel mode [ 25.637283] #PF: error_code(0x0002) - not-present page [ 25.638365] PGD 0 P4D 0 [ 25.638906] Oops: 0002 [#1] PREEMPT SMP PTI [ 25.639779] CPU: 4 PID: 819 Comm: sysctl Tainted: G E N 6.0.0-rc3-v6.0-rc1-220920-1758-1398-g2b3f5+ #12 [ 25.641928] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1.fc35 04/01/2014 [ 25.643727] RIP: 0010:alloc_buddy_huge_page.isra.0+0x8c/0x140 [ 25.645071] Code: fe ff 41 83 fc 01 0f 84 54 94 8b 00 41 bc 01 00 00 00 44 89 f7 4c 89 f9 44 89 ea 89 de e8 7c b9 fe ff 48 89 c7 b8 01 00 00 00 <f0> 0f b1 6f 34 66 90 83 f8 01 75 c5 48 85 ff 74 52 65 48 ff 05 03 [ 25.649006] RSP: 0018:ffffaa7181fffc18 EFLAGS: 00010286 [ 25.650215] RAX: 0000000000000001 RBX: 0000000000000009 RCX: 0000000000000009 [ 25.651672] RDX: ffffffffae3b6df0 RSI: ffffffffae8f7ce0 RDI: 0000000000000000 [ 25.653115] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000c01 [ 25.654579] R10: 0000000000000f90 R11: 0000000000000000 R12: 0000000000000002 [ 25.656176] R13: 0000000000000000 R14: 0000000000346cca R15: ffffffffae8f7ce0 [ 25.657637] FS: 00007f9252f2a740(0000) GS:ffff98cebbc00000(0000) knlGS:0000000000000000 [ 25.659292] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 25.660469] CR2: 0000000000000034 CR3: 000000014924c004 CR4: 0000000000170ee0 [ 25.661928] Call Trace: [ 25.662469] <TASK> [ 25.662927] alloc_fresh_huge_page+0x16f/0x1d0 [ 25.663859] alloc_pool_huge_page+0x6d/0xb0 [ 25.664734] __nr_hugepages_store_common+0x189/0x3e0 [ 25.665764] ? __do_proc_doulongvec_minmax+0x31f/0x340 [ 25.666832] hugetlb_sysctl_handler_common+0xbf/0xd0 [ 25.667861] ? hugetlb_register_node+0xe0/0xe0 [ 25.668786] proc_sys_call_handler+0x196/0x2b0 [ 25.669724] vfs_write+0x29b/0x3a0 [ 25.670454] ksys_write+0x4f/0xd0 [ 25.671153] do_syscall_64+0x3b/0x90 [ 25.671909] entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 25.672958] RIP: 0033:0x7f9252d3e727 [ 25.673712] Code: 0b 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24 [ 25.677470] RSP: 002b:00007ffcdf9904a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 25.679002] RAX: ffffffffffffffda RBX: 000055c6ae683210 RCX: 00007f9252d3e727 [ 25.680456] RDX: 0000000000000006 RSI: 000055c6ae683250 RDI: 0000000000000003 [ 25.681910] RBP: 000055c6ae685380 R08: 0000000000000003 R09: 0000000000000077 [ 25.683373] R10: 000000000000006b R11: 0000000000000246 R12: 0000000000000006 [ 25.684824] R13: 0000000000000006 R14: 0000000000000006 R15: 00007f9252df59e0 [ 25.686293] </TASK>
On 09/21/22 01:48, HORIGUCHI NAOYA(堀口 直也) wrote: > On Fri, Sep 16, 2022 at 02:46:38PM -0700, Mike Kravetz wrote: > > v1 -> v2 > > - Fixed up head page in error path of __prep_compound_gigantic_page as > > discovered by Miaohe Lin. > > - Updated link to Matthew's Allocate and free frozen pages series. > > - Rebased on next-20220916 > > > > mm/hugetlb.c | 102 +++++++++++++++++++-------------------------------- > > 1 file changed, 38 insertions(+), 64 deletions(-) > > Hello Mike, > > I accidentally found a NULL pointer dereference when testing the latest > mm-unstable, which seems to be caused (or exposed?) by this patch > (I confirmed that it disappeared by reverting this patch). > > It's reproduced by doing like `sysctl vm.nr_hugepages=1000000` to allocate > hugepages as much as possible. > > Could you check that this patch is related to the issue? > > Thanks, > Naoya Horiguchi > > --- > [ 25.634476] BUG: kernel NULL pointer dereference, address: 0000000000000034 > [ 25.635980] #PF: supervisor write access in kernel mode > [ 25.637283] #PF: error_code(0x0002) - not-present page > [ 25.638365] PGD 0 P4D 0 > [ 25.638906] Oops: 0002 [#1] PREEMPT SMP PTI > [ 25.639779] CPU: 4 PID: 819 Comm: sysctl Tainted: G E N 6.0.0-rc3-v6.0-rc1-220920-1758-1398-g2b3f5+ #12 > [ 25.641928] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1.fc35 04/01/2014 > [ 25.643727] RIP: 0010:alloc_buddy_huge_page.isra.0+0x8c/0x140 > [ 25.645071] Code: fe ff 41 83 fc 01 0f 84 54 94 8b 00 41 bc 01 00 00 00 44 89 f7 4c 89 f9 44 89 ea 89 de e8 7c b9 fe ff 48 89 c7 b8 01 00 00 00 <f0> 0f b1 6f 34 66 90 83 f8 01 75 c5 48 85 ff 74 52 65 48 ff 05 03 > [ 25.649006] RSP: 0018:ffffaa7181fffc18 EFLAGS: 00010286 > [ 25.650215] RAX: 0000000000000001 RBX: 0000000000000009 RCX: 0000000000000009 > [ 25.651672] RDX: ffffffffae3b6df0 RSI: ffffffffae8f7ce0 RDI: 0000000000000000 > [ 25.653115] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000c01 > [ 25.654579] R10: 0000000000000f90 R11: 0000000000000000 R12: 0000000000000002 > [ 25.656176] R13: 0000000000000000 R14: 0000000000346cca R15: ffffffffae8f7ce0 > [ 25.657637] FS: 00007f9252f2a740(0000) GS:ffff98cebbc00000(0000) knlGS:0000000000000000 > [ 25.659292] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 25.660469] CR2: 0000000000000034 CR3: 000000014924c004 CR4: 0000000000170ee0 > [ 25.661928] Call Trace: > [ 25.662469] <TASK> > [ 25.662927] alloc_fresh_huge_page+0x16f/0x1d0 > [ 25.663859] alloc_pool_huge_page+0x6d/0xb0 > [ 25.664734] __nr_hugepages_store_common+0x189/0x3e0 > [ 25.665764] ? __do_proc_doulongvec_minmax+0x31f/0x340 > [ 25.666832] hugetlb_sysctl_handler_common+0xbf/0xd0 > [ 25.667861] ? hugetlb_register_node+0xe0/0xe0 > [ 25.668786] proc_sys_call_handler+0x196/0x2b0 > [ 25.669724] vfs_write+0x29b/0x3a0 > [ 25.670454] ksys_write+0x4f/0xd0 > [ 25.671153] do_syscall_64+0x3b/0x90 > [ 25.671909] entry_SYSCALL_64_after_hwframe+0x63/0xcd > [ 25.672958] RIP: 0033:0x7f9252d3e727 > [ 25.673712] Code: 0b 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24 > [ 25.677470] RSP: 002b:00007ffcdf9904a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 > [ 25.679002] RAX: ffffffffffffffda RBX: 000055c6ae683210 RCX: 00007f9252d3e727 > [ 25.680456] RDX: 0000000000000006 RSI: 000055c6ae683250 RDI: 0000000000000003 > [ 25.681910] RBP: 000055c6ae685380 R08: 0000000000000003 R09: 0000000000000077 > [ 25.683373] R10: 000000000000006b R11: 0000000000000246 R12: 0000000000000006 > [ 25.684824] R13: 0000000000000006 R14: 0000000000000006 R15: 00007f9252df59e0 > [ 25.686293] </TASK> Hello Naoya, My bad for an obvious mistake! Note this change in the patch, > @@ -1951,7 +1953,21 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, > gfp_mask |= __GFP_RETRY_MAYFAIL; > if (nid == NUMA_NO_NODE) > nid = numa_mem_id(); > +retry: > page = __alloc_pages(gfp_mask, order, nid, nmask); > + > + /* Freeze head page */ > + if (!page_ref_freeze(page, 1)) { > + __free_pages(page, order); > + if (retry) { /* retry once */ > + retry = false; > + goto retry; > + } > + /* WOW! twice in a row. */ > + pr_warn("HugeTLB head page unexpected inflated ref count\n"); > + page = NULL; > + } > + > if (page) > __count_vm_event(HTLB_BUDDY_PGALLOC); > else It does not check for a successful return from __alloc_pages before trying to freeze 'page'. It should obviously check for page == NULL before trying to freeze with something like this. diff --git a/mm/hugetlb.c b/mm/hugetlb.c index db90658db171..a092dd639687 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1961,7 +1961,7 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, page = __alloc_pages(gfp_mask, order, nid, nmask); /* Freeze head page */ - if (!page_ref_freeze(page, 1)) { + if (page && !page_ref_freeze(page, 1)) { __free_pages(page, order); if (retry) { /* retry once */ retry = false; I will send out a v3 (my) tomorrow.
On Tue, Sep 20, 2022 at 10:10:29AM -0700, Mike Kravetz wrote: > Hi Oscar, thanks for taking a look. > > I think you may have missed something in the above comment. For gigantic > pages, we only freeze pages if NOT demote. In the demote case, pages are > already frozen. Yeah. It got me confused that the link you posted (https://lore.kernel.org/linux-mm/20210622021423.154662-3-mike.kravetz@oracle.com/) seemed like an earlier version of what it got into upstream, and in there there was "demotion" handling. Having checked the tree it all looks sound. I will wait for v3. Thanks Mike!
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 5ea0b1b0d1ab..9b8526d27c29 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1787,9 +1787,8 @@ static bool __prep_compound_gigantic_page(struct page *page, unsigned int order, /* we rely on prep_new_huge_page to set the destructor */ set_compound_order(page, order); - __ClearPageReserved(page); __SetPageHead(page); - for (i = 1; i < nr_pages; i++) { + for (i = 0; i < nr_pages; i++) { p = nth_page(page, i); /* @@ -1830,17 +1829,19 @@ static bool __prep_compound_gigantic_page(struct page *page, unsigned int order, } else { VM_BUG_ON_PAGE(page_count(p), p); } - set_compound_head(p, page); + if (i != 0) + set_compound_head(p, page); } atomic_set(compound_mapcount_ptr(page), -1); atomic_set(compound_pincount_ptr(page), 0); return true; out_error: - /* undo tail page modifications made above */ - for (j = 1; j < i; j++) { + /* undo page modifications made above */ + for (j = 0; j < i; j++) { p = nth_page(page, j); - clear_compound_head(p); + if (j != 0) + clear_compound_head(p); set_page_refcounted(p); } /* need to clear PG_reserved on remaining tail pages */ @@ -1936,6 +1937,7 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, int order = huge_page_order(h); struct page *page; bool alloc_try_hard = true; + bool retry = true; /* * By default we always try hard to allocate the page with @@ -1951,7 +1953,21 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, gfp_mask |= __GFP_RETRY_MAYFAIL; if (nid == NUMA_NO_NODE) nid = numa_mem_id(); +retry: page = __alloc_pages(gfp_mask, order, nid, nmask); + + /* Freeze head page */ + if (!page_ref_freeze(page, 1)) { + __free_pages(page, order); + if (retry) { /* retry once */ + retry = false; + goto retry; + } + /* WOW! twice in a row. */ + pr_warn("HugeTLB head page unexpected inflated ref count\n"); + page = NULL; + } + if (page) __count_vm_event(HTLB_BUDDY_PGALLOC); else @@ -1979,6 +1995,9 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, /* * Common helper to allocate a fresh hugetlb page. All specific allocators * should use this function to get new hugetlb pages + * + * Note that returned page is 'frozen': ref count of head page and all tail + * pages is zero. */ static struct page *alloc_fresh_huge_page(struct hstate *h, gfp_t gfp_mask, int nid, nodemask_t *nmask, @@ -2036,7 +2055,7 @@ static int alloc_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed, if (!page) return 0; - put_page(page); /* free it into the hugepage allocator */ + free_huge_page(page); /* free it into the hugepage allocator */ return 1; } @@ -2193,10 +2212,9 @@ int dissolve_free_huge_pages(unsigned long start_pfn, unsigned long end_pfn) * Allocates a fresh surplus page from the page allocator. */ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask, - int nid, nodemask_t *nmask, bool zero_ref) + int nid, nodemask_t *nmask) { struct page *page = NULL; - bool retry = false; if (hstate_is_gigantic(h)) return NULL; @@ -2206,7 +2224,6 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask, goto out_unlock; spin_unlock_irq(&hugetlb_lock); -retry: page = alloc_fresh_huge_page(h, gfp_mask, nid, nmask, NULL); if (!page) return NULL; @@ -2222,34 +2239,10 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask, if (h->surplus_huge_pages >= h->nr_overcommit_huge_pages) { SetHPageTemporary(page); spin_unlock_irq(&hugetlb_lock); - put_page(page); + free_huge_page(page); return NULL; } - if (zero_ref) { - /* - * Caller requires a page with zero ref count. - * We will drop ref count here. If someone else is holding - * a ref, the page will be freed when they drop it. Abuse - * temporary page flag to accomplish this. - */ - SetHPageTemporary(page); - if (!put_page_testzero(page)) { - /* - * Unexpected inflated ref count on freshly allocated - * huge. Retry once. - */ - pr_info("HugeTLB unexpected inflated ref count on freshly allocated page\n"); - spin_unlock_irq(&hugetlb_lock); - if (retry) - return NULL; - - retry = true; - goto retry; - } - ClearHPageTemporary(page); - } - h->surplus_huge_pages++; h->surplus_huge_pages_node[page_to_nid(page)]++; @@ -2271,6 +2264,9 @@ static struct page *alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask, if (!page) return NULL; + /* fresh huge pages are frozen */ + set_page_refcounted(page); + /* * We do not account these pages as surplus because they are only * temporary and will be released properly on the last reference @@ -2298,14 +2294,14 @@ struct page *alloc_buddy_huge_page_with_mpol(struct hstate *h, gfp_t gfp = gfp_mask | __GFP_NOWARN; gfp &= ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL); - page = alloc_surplus_huge_page(h, gfp, nid, nodemask, false); + page = alloc_surplus_huge_page(h, gfp, nid, nodemask); /* Fallback to all nodes if page==NULL */ nodemask = NULL; } if (!page) - page = alloc_surplus_huge_page(h, gfp_mask, nid, nodemask, false); + page = alloc_surplus_huge_page(h, gfp_mask, nid, nodemask); mpol_cond_put(mpol); return page; } @@ -2375,7 +2371,7 @@ static int gather_surplus_pages(struct hstate *h, long delta) spin_unlock_irq(&hugetlb_lock); for (i = 0; i < needed; i++) { page = alloc_surplus_huge_page(h, htlb_alloc_mask(h), - NUMA_NO_NODE, NULL, true); + NUMA_NO_NODE, NULL); if (!page) { alloc_ok = false; break; @@ -2737,7 +2733,6 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page, { gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE; int nid = page_to_nid(old_page); - bool alloc_retry = false; struct page *new_page; int ret = 0; @@ -2748,30 +2743,9 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page, * the pool. This simplifies and let us do most of the processing * under the lock. */ -alloc_retry: new_page = alloc_buddy_huge_page(h, gfp_mask, nid, NULL, NULL); if (!new_page) return -ENOMEM; - /* - * If all goes well, this page will be directly added to the free - * list in the pool. For this the ref count needs to be zero. - * Attempt to drop now, and retry once if needed. It is VERY - * unlikely there is another ref on the page. - * - * If someone else has a reference to the page, it will be freed - * when they drop their ref. Abuse temporary page flag to accomplish - * this. Retry once if there is an inflated ref count. - */ - SetHPageTemporary(new_page); - if (!put_page_testzero(new_page)) { - if (alloc_retry) - return -EBUSY; - - alloc_retry = true; - goto alloc_retry; - } - ClearHPageTemporary(new_page); - __prep_new_huge_page(h, new_page); retry: @@ -2951,6 +2925,7 @@ struct page *alloc_huge_page(struct vm_area_struct *vma, } spin_lock_irq(&hugetlb_lock); list_add(&page->lru, &h->hugepage_activelist); + set_page_refcounted(page); /* Fall through */ } hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h), h_cg, page); @@ -3055,7 +3030,7 @@ static void __init gather_bootmem_prealloc(void) if (prep_compound_gigantic_page(page, huge_page_order(h))) { WARN_ON(PageReserved(page)); prep_new_huge_page(h, page, page_to_nid(page)); - put_page(page); /* add to the hugepage allocator */ + free_huge_page(page); /* add to the hugepage allocator */ } else { /* VERY unlikely inflated ref count on a tail page */ free_gigantic_page(page, huge_page_order(h)); @@ -3087,7 +3062,7 @@ static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid) &node_states[N_MEMORY], NULL); if (!page) break; - put_page(page); /* free it into the hugepage allocator */ + free_huge_page(page); /* free it into the hugepage allocator */ } cond_resched(); } @@ -3478,9 +3453,8 @@ static int demote_free_huge_page(struct hstate *h, struct page *page) else prep_compound_page(subpage, target_hstate->order); set_page_private(subpage, 0); - set_page_refcounted(subpage); prep_new_huge_page(target_hstate, subpage, nid); - put_page(subpage); + free_huge_page(subpage); } mutex_unlock(&target_hstate->resize_lock);
When creating hugetlb pages, the hugetlb code must first allocate contiguous pages from a low level allocator such as buddy, cma or memblock. The pages returned from these low level allocators are ref counted. This creates potential issues with other code taking speculative references on these pages before they can be transformed to a hugetlb page. This issue has been addressed with methods and code such as that provided in [1]. Recent discussions about vmemmap freeing [2] have indicated that it would be beneficial to freeze all sub pages, including the head page of pages returned from low level allocators before converting to a hugetlb page. This helps avoid races if we want to replace the page containing vmemmap for the head page. There have been proposals to change at least the buddy allocator to return frozen pages as described at [3]. If such a change is made, it can be employed by the hugetlb code. However, as mentioned above hugetlb uses several low level allocators so each would need to be modified to return frozen pages. For now, we can manually freeze the returned pages. This is done in two places: 1) alloc_buddy_huge_page, only the returned head page is ref counted. We freeze the head page, retrying once in the VERY rare case where there may be an inflated ref count. 2) prep_compound_gigantic_page, for gigantic pages the current code freezes all pages except the head page. New code will simply freeze the head page as well. In a few other places, code checks for inflated ref counts on newly allocated hugetlb pages. With the modifications to freeze after allocating, this code can be removed. After hugetlb pages are freshly allocated, they are often added to the hugetlb free lists. Since these pages were previously ref counted, this was done via put_page() which would end up calling the hugetlb destructor: free_huge_page. With changes to freeze pages, we simply call free_huge_page directly to add the pages to the free list. In a few other places, freshly allocated hugetlb pages were immediately put into use, and the expectation was they were already ref counted. In these cases, we must manually ref count the page. [1] https://lore.kernel.org/linux-mm/20210622021423.154662-3-mike.kravetz@oracle.com/ [2] https://lore.kernel.org/linux-mm/20220802180309.19340-1-joao.m.martins@oracle.com/ [3] https://lore.kernel.org/linux-mm/20220809171854.3725722-1-willy@infradead.org/ Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> --- v1 -> v2 - Fixed up head page in error path of __prep_compound_gigantic_page as discovered by Miaohe Lin. - Updated link to Matthew's Allocate and free frozen pages series. - Rebased on next-20220916 mm/hugetlb.c | 102 +++++++++++++++++++-------------------------------- 1 file changed, 38 insertions(+), 64 deletions(-)