Message ID | 20220211065215.101767-1-aneesh.kumar@linux.ibm.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v2] powerpc/mm: Update default hugetlb size early | expand |
On 11.02.22 07:52, Aneesh Kumar K.V wrote: > commit: d9c234005227 ("Do not depend on MAX_ORDER when grouping pages by mobility") > introduced pageblock_order which will be used to group pages better. > The kernel now groups pages based on the value of HPAGE_SHIFT. Hence HPAGE_SHIFT > should be set before we call set_pageblock_order. > > set_pageblock_order happens early in the boot and default hugetlb page size > should be initialized before that to compute the right pageblock_order value. > > Currently, default hugetlbe page size is set via arch_initcalls which happens > late in the boot as shown via the below callstack: > > [c000000007383b10] [c000000001289328] hugetlbpage_init+0x2b8/0x2f8 > [c000000007383bc0] [c0000000012749e4] do_one_initcall+0x14c/0x320 > [c000000007383c90] [c00000000127505c] kernel_init_freeable+0x410/0x4e8 > [c000000007383da0] [c000000000012664] kernel_init+0x30/0x15c > [c000000007383e10] [c00000000000cf14] ret_from_kernel_thread+0x5c/0x64 > > and the pageblock_order initialization is done early during the boot. > > [c0000000018bfc80] [c0000000012ae120] set_pageblock_order+0x50/0x64 > [c0000000018bfca0] [c0000000012b3d94] sparse_init+0x188/0x268 > [c0000000018bfd60] [c000000001288bfc] initmem_init+0x28c/0x328 > [c0000000018bfe50] [c00000000127b370] setup_arch+0x410/0x480 > [c0000000018bfed0] [c00000000127401c] start_kernel+0xb8/0x934 > [c0000000018bff90] [c00000000000d984] start_here_common+0x1c/0x98 > > delaying default hugetlb page size initialization implies the kernel will > initialize pageblock_order to (MAX_ORDER - 1) which is not an optimal > value for mobility grouping. IIUC we always had this issue. But it was not > a problem for hash translation mode because (MAX_ORDER - 1) is the same as > HUGETLB_PAGE_ORDER (8) in the case of hash (16MB). With radix, > HUGETLB_PAGE_ORDER will be 5 (2M size) and hence pageblock_order should be > 5 instead of 8. A related question: Can we on ppc still have pageblock_order > MAX_ORDER - 1? We have some code for that and I am not so sure if we really need that.
On 2/11/22 14:00, David Hildenbrand wrote: > On 11.02.22 07:52, Aneesh Kumar K.V wrote: >> commit: d9c234005227 ("Do not depend on MAX_ORDER when grouping pages by mobility") >> introduced pageblock_order which will be used to group pages better. >> The kernel now groups pages based on the value of HPAGE_SHIFT. Hence HPAGE_SHIFT >> should be set before we call set_pageblock_order. >> >> set_pageblock_order happens early in the boot and default hugetlb page size >> should be initialized before that to compute the right pageblock_order value. >> >> Currently, default hugetlbe page size is set via arch_initcalls which happens >> late in the boot as shown via the below callstack: >> >> [c000000007383b10] [c000000001289328] hugetlbpage_init+0x2b8/0x2f8 >> [c000000007383bc0] [c0000000012749e4] do_one_initcall+0x14c/0x320 >> [c000000007383c90] [c00000000127505c] kernel_init_freeable+0x410/0x4e8 >> [c000000007383da0] [c000000000012664] kernel_init+0x30/0x15c >> [c000000007383e10] [c00000000000cf14] ret_from_kernel_thread+0x5c/0x64 >> >> and the pageblock_order initialization is done early during the boot. >> >> [c0000000018bfc80] [c0000000012ae120] set_pageblock_order+0x50/0x64 >> [c0000000018bfca0] [c0000000012b3d94] sparse_init+0x188/0x268 >> [c0000000018bfd60] [c000000001288bfc] initmem_init+0x28c/0x328 >> [c0000000018bfe50] [c00000000127b370] setup_arch+0x410/0x480 >> [c0000000018bfed0] [c00000000127401c] start_kernel+0xb8/0x934 >> [c0000000018bff90] [c00000000000d984] start_here_common+0x1c/0x98 >> >> delaying default hugetlb page size initialization implies the kernel will >> initialize pageblock_order to (MAX_ORDER - 1) which is not an optimal >> value for mobility grouping. IIUC we always had this issue. But it was not >> a problem for hash translation mode because (MAX_ORDER - 1) is the same as >> HUGETLB_PAGE_ORDER (8) in the case of hash (16MB). With radix, >> HUGETLB_PAGE_ORDER will be 5 (2M size) and hence pageblock_order should be >> 5 instead of 8. > > > A related question: Can we on ppc still have pageblock_order > MAX_ORDER > - 1? We have some code for that and I am not so sure if we really need that. > I also have been wondering about the same. On book3s64 I don't think we need that support for both 64K and 4K page size because with hash hugetlb size is MAX_ORDER -1. (16MB hugepage size) I am not sure about the 256K page support. Christophe may be able to answer that. For the gigantic hugepage support we depend on cma based allocation or firmware reservation. So I am not sure why we ever considered pageblock > MAX_ORDER -1 scenario. If you have pointers w.r.t why that was ever needed, I could double-check whether ppc64 is still dependent on that. -aneesh
On 11.02.22 10:16, Aneesh Kumar K V wrote: > On 2/11/22 14:00, David Hildenbrand wrote: >> On 11.02.22 07:52, Aneesh Kumar K.V wrote: >>> commit: d9c234005227 ("Do not depend on MAX_ORDER when grouping pages by mobility") >>> introduced pageblock_order which will be used to group pages better. >>> The kernel now groups pages based on the value of HPAGE_SHIFT. Hence HPAGE_SHIFT >>> should be set before we call set_pageblock_order. >>> >>> set_pageblock_order happens early in the boot and default hugetlb page size >>> should be initialized before that to compute the right pageblock_order value. >>> >>> Currently, default hugetlbe page size is set via arch_initcalls which happens >>> late in the boot as shown via the below callstack: >>> >>> [c000000007383b10] [c000000001289328] hugetlbpage_init+0x2b8/0x2f8 >>> [c000000007383bc0] [c0000000012749e4] do_one_initcall+0x14c/0x320 >>> [c000000007383c90] [c00000000127505c] kernel_init_freeable+0x410/0x4e8 >>> [c000000007383da0] [c000000000012664] kernel_init+0x30/0x15c >>> [c000000007383e10] [c00000000000cf14] ret_from_kernel_thread+0x5c/0x64 >>> >>> and the pageblock_order initialization is done early during the boot. >>> >>> [c0000000018bfc80] [c0000000012ae120] set_pageblock_order+0x50/0x64 >>> [c0000000018bfca0] [c0000000012b3d94] sparse_init+0x188/0x268 >>> [c0000000018bfd60] [c000000001288bfc] initmem_init+0x28c/0x328 >>> [c0000000018bfe50] [c00000000127b370] setup_arch+0x410/0x480 >>> [c0000000018bfed0] [c00000000127401c] start_kernel+0xb8/0x934 >>> [c0000000018bff90] [c00000000000d984] start_here_common+0x1c/0x98 >>> >>> delaying default hugetlb page size initialization implies the kernel will >>> initialize pageblock_order to (MAX_ORDER - 1) which is not an optimal >>> value for mobility grouping. IIUC we always had this issue. But it was not >>> a problem for hash translation mode because (MAX_ORDER - 1) is the same as >>> HUGETLB_PAGE_ORDER (8) in the case of hash (16MB). With radix, >>> HUGETLB_PAGE_ORDER will be 5 (2M size) and hence pageblock_order should be >>> 5 instead of 8. >> >> >> A related question: Can we on ppc still have pageblock_order > MAX_ORDER >> - 1? We have some code for that and I am not so sure if we really need that. >> > > I also have been wondering about the same. On book3s64 I don't think we > need that support for both 64K and 4K page size because with hash > hugetlb size is MAX_ORDER -1. (16MB hugepage size) > > I am not sure about the 256K page support. Christophe may be able to > answer that. > > For the gigantic hugepage support we depend on cma based allocation or > firmware reservation. So I am not sure why we ever considered pageblock > > MAX_ORDER -1 scenario. If you have pointers w.r.t why that was ever > needed, I could double-check whether ppc64 is still dependent on that. commit dc78327c0ea7da5186d8cbc1647bd6088c5c9fa5 Author: Michal Nazarewicz <mina86@mina86.com> Date: Wed Jul 2 15:22:35 2014 -0700 mm: page_alloc: fix CMA area initialisation when pageblock > MAX_ORDER indicates that at least arm64 used to have cases for that as well. However, nowadays with ARM64_64K_PAGES we have FORCE_MAX_ZONEORDER=14 as default, corresponding to 512MiB. So I'm not sure if this is something worth supporting. If you want somewhat reliable gigantic pages, use CMA or preallocate them during boot.
David Hildenbrand <david@redhat.com> writes: > On 11.02.22 10:16, Aneesh Kumar K V wrote: >> On 2/11/22 14:00, David Hildenbrand wrote: >>> On 11.02.22 07:52, Aneesh Kumar K.V wrote: >>>> commit: d9c234005227 ("Do not depend on MAX_ORDER when grouping pages by mobility") >>>> introduced pageblock_order which will be used to group pages better. >>>> The kernel now groups pages based on the value of HPAGE_SHIFT. Hence HPAGE_SHIFT >>>> should be set before we call set_pageblock_order. >>>> >>>> set_pageblock_order happens early in the boot and default hugetlb page size >>>> should be initialized before that to compute the right pageblock_order value. >>>> >>>> Currently, default hugetlbe page size is set via arch_initcalls which happens >>>> late in the boot as shown via the below callstack: >>>> >>>> [c000000007383b10] [c000000001289328] hugetlbpage_init+0x2b8/0x2f8 >>>> [c000000007383bc0] [c0000000012749e4] do_one_initcall+0x14c/0x320 >>>> [c000000007383c90] [c00000000127505c] kernel_init_freeable+0x410/0x4e8 >>>> [c000000007383da0] [c000000000012664] kernel_init+0x30/0x15c >>>> [c000000007383e10] [c00000000000cf14] ret_from_kernel_thread+0x5c/0x64 >>>> >>>> and the pageblock_order initialization is done early during the boot. >>>> >>>> [c0000000018bfc80] [c0000000012ae120] set_pageblock_order+0x50/0x64 >>>> [c0000000018bfca0] [c0000000012b3d94] sparse_init+0x188/0x268 >>>> [c0000000018bfd60] [c000000001288bfc] initmem_init+0x28c/0x328 >>>> [c0000000018bfe50] [c00000000127b370] setup_arch+0x410/0x480 >>>> [c0000000018bfed0] [c00000000127401c] start_kernel+0xb8/0x934 >>>> [c0000000018bff90] [c00000000000d984] start_here_common+0x1c/0x98 >>>> >>>> delaying default hugetlb page size initialization implies the kernel will >>>> initialize pageblock_order to (MAX_ORDER - 1) which is not an optimal >>>> value for mobility grouping. IIUC we always had this issue. But it was not >>>> a problem for hash translation mode because (MAX_ORDER - 1) is the same as >>>> HUGETLB_PAGE_ORDER (8) in the case of hash (16MB). With radix, >>>> HUGETLB_PAGE_ORDER will be 5 (2M size) and hence pageblock_order should be >>>> 5 instead of 8. >>> >>> >>> A related question: Can we on ppc still have pageblock_order > MAX_ORDER >>> - 1? We have some code for that and I am not so sure if we really need that. >>> >> >> I also have been wondering about the same. On book3s64 I don't think we >> need that support for both 64K and 4K page size because with hash >> hugetlb size is MAX_ORDER -1. (16MB hugepage size) >> >> I am not sure about the 256K page support. Christophe may be able to >> answer that. >> >> For the gigantic hugepage support we depend on cma based allocation or >> firmware reservation. So I am not sure why we ever considered pageblock >> > MAX_ORDER -1 scenario. If you have pointers w.r.t why that was ever >> needed, I could double-check whether ppc64 is still dependent on that. > > commit dc78327c0ea7da5186d8cbc1647bd6088c5c9fa5 > Author: Michal Nazarewicz <mina86@mina86.com> > Date: Wed Jul 2 15:22:35 2014 -0700 > > mm: page_alloc: fix CMA area initialisation when pageblock > MAX_ORDER > > indicates that at least arm64 used to have cases for that as well. > > However, nowadays with ARM64_64K_PAGES we have FORCE_MAX_ZONEORDER=14 as > default, corresponding to 512MiB. > > So I'm not sure if this is something worth supporting. If you want > somewhat reliable gigantic pages, use CMA or preallocate them during boot. > > -- > Thanks, > > David / dhildenb I could build a kernel with FORCE_MAX_ZONEORDER=8 and pageblock_order = 8. We need to disable THP for such a kernel to boot, because THP do check for PMD_ORDER < MAX_ORDER. I was able to boot that kernel on a virtualized platform, but then gigantic_page_runtime_supported is not supported on such config with hash translation. On non virtualized platform I am hitting crashes like below during boot. [ 47.637865][ C42] ============================================================================= [ 47.637907][ C42] BUG pgtable-2^11 (Not tainted): Object already free [ 47.637925][ C42] ----------------------------------------------------------------------------- [ 47.637925][ C42] [ 47.637945][ C42] Allocated in __pud_alloc+0x84/0x2a0 age=278 cpu=40 pid=1409 [ 47.637974][ C42] __slab_alloc.isra.0+0x40/0x60 [ 47.637995][ C42] kmem_cache_alloc+0x1a8/0x510 [ 47.638010][ C42] __pud_alloc+0x84/0x2a0 [ 47.638024][ C42] copy_page_range+0x38c/0x1b90 [ 47.638040][ C42] dup_mm+0x548/0x880 [ 47.638058][ C42] copy_process+0xdc0/0x1e90 [ 47.638076][ C42] kernel_clone+0xd4/0x9d0 [ 47.638094][ C42] __do_sys_clone+0x88/0xe0 [ 47.638112][ C42] system_call_exception+0x368/0x3a0 [ 47.638128][ C42] system_call_common+0xec/0x250 [ 47.638147][ C42] Freed in __tlb_remove_table+0x1d4/0x200 age=263 cpu=57 pid=326 [ 47.638172][ C42] kmem_cache_free+0x44c/0x680 [ 47.638187][ C42] __tlb_remove_table+0x1d4/0x200 [ 47.638204][ C42] tlb_remove_table_rcu+0x54/0xa0 [ 47.638222][ C42] rcu_core+0xdd4/0x15d0 [ 47.638239][ C42] __do_softirq+0x360/0x69c [ 47.638257][ C42] run_ksoftirqd+0x54/0xc0 [ 47.638273][ C42] smpboot_thread_fn+0x28c/0x2f0 [ 47.638290][ C42] kthread+0x1a4/0x1b0 [ 47.638305][ C42] ret_from_kernel_thread+0x5c/0x64 [ 47.638320][ C42] Slab 0xc00c00000000d600 objects=10 used=9 fp=0xc0000000035a8000 flags=0x7ffff000010201(locked|slab|head|node=0|zone=0|lastcpupid=0x7ffff) [ 47.638352][ C42] Object 0xc0000000035a8000 @offset=163840 fp=0x0000000000000000 [ 47.638352][ C42] [ 47.638373][ C42] Redzone c0000000035a4000: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................ [ 47.638394][ C42] Redzone c0000000035a4010: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................ [ 47.638414][ C42] Redzone c0000000035a4020: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................ [ 47.638435][ C42] Redzone c0000000035a4030: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................ [ 47.638455][ C42] Redzone c0000000035a4040: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................ [ 47.638474][ C42] Redzone c0000000035a4050: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................ [ 47.638494][ C42] Redzone c0000000035a4060: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................ [ 47.638514][ C42] Redzone c0000000035a4070: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................ [ 47.638534][ C42] Redzone c0000000035a4080: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
On 11.02.22 13:23, Aneesh Kumar K.V wrote: > David Hildenbrand <david@redhat.com> writes: > >> On 11.02.22 10:16, Aneesh Kumar K V wrote: >>> On 2/11/22 14:00, David Hildenbrand wrote: >>>> On 11.02.22 07:52, Aneesh Kumar K.V wrote: >>>>> commit: d9c234005227 ("Do not depend on MAX_ORDER when grouping pages by mobility") >>>>> introduced pageblock_order which will be used to group pages better. >>>>> The kernel now groups pages based on the value of HPAGE_SHIFT. Hence HPAGE_SHIFT >>>>> should be set before we call set_pageblock_order. >>>>> >>>>> set_pageblock_order happens early in the boot and default hugetlb page size >>>>> should be initialized before that to compute the right pageblock_order value. >>>>> >>>>> Currently, default hugetlbe page size is set via arch_initcalls which happens >>>>> late in the boot as shown via the below callstack: >>>>> >>>>> [c000000007383b10] [c000000001289328] hugetlbpage_init+0x2b8/0x2f8 >>>>> [c000000007383bc0] [c0000000012749e4] do_one_initcall+0x14c/0x320 >>>>> [c000000007383c90] [c00000000127505c] kernel_init_freeable+0x410/0x4e8 >>>>> [c000000007383da0] [c000000000012664] kernel_init+0x30/0x15c >>>>> [c000000007383e10] [c00000000000cf14] ret_from_kernel_thread+0x5c/0x64 >>>>> >>>>> and the pageblock_order initialization is done early during the boot. >>>>> >>>>> [c0000000018bfc80] [c0000000012ae120] set_pageblock_order+0x50/0x64 >>>>> [c0000000018bfca0] [c0000000012b3d94] sparse_init+0x188/0x268 >>>>> [c0000000018bfd60] [c000000001288bfc] initmem_init+0x28c/0x328 >>>>> [c0000000018bfe50] [c00000000127b370] setup_arch+0x410/0x480 >>>>> [c0000000018bfed0] [c00000000127401c] start_kernel+0xb8/0x934 >>>>> [c0000000018bff90] [c00000000000d984] start_here_common+0x1c/0x98 >>>>> >>>>> delaying default hugetlb page size initialization implies the kernel will >>>>> initialize pageblock_order to (MAX_ORDER - 1) which is not an optimal >>>>> value for mobility grouping. IIUC we always had this issue. But it was not >>>>> a problem for hash translation mode because (MAX_ORDER - 1) is the same as >>>>> HUGETLB_PAGE_ORDER (8) in the case of hash (16MB). With radix, >>>>> HUGETLB_PAGE_ORDER will be 5 (2M size) and hence pageblock_order should be >>>>> 5 instead of 8. >>>> >>>> >>>> A related question: Can we on ppc still have pageblock_order > MAX_ORDER >>>> - 1? We have some code for that and I am not so sure if we really need that. >>>> >>> >>> I also have been wondering about the same. On book3s64 I don't think we >>> need that support for both 64K and 4K page size because with hash >>> hugetlb size is MAX_ORDER -1. (16MB hugepage size) >>> >>> I am not sure about the 256K page support. Christophe may be able to >>> answer that. >>> >>> For the gigantic hugepage support we depend on cma based allocation or >>> firmware reservation. So I am not sure why we ever considered pageblock >>> > MAX_ORDER -1 scenario. If you have pointers w.r.t why that was ever >>> needed, I could double-check whether ppc64 is still dependent on that. >> >> commit dc78327c0ea7da5186d8cbc1647bd6088c5c9fa5 >> Author: Michal Nazarewicz <mina86@mina86.com> >> Date: Wed Jul 2 15:22:35 2014 -0700 >> >> mm: page_alloc: fix CMA area initialisation when pageblock > MAX_ORDER >> >> indicates that at least arm64 used to have cases for that as well. >> >> However, nowadays with ARM64_64K_PAGES we have FORCE_MAX_ZONEORDER=14 as >> default, corresponding to 512MiB. >> >> So I'm not sure if this is something worth supporting. If you want >> somewhat reliable gigantic pages, use CMA or preallocate them during boot. >> >> -- >> Thanks, >> >> David / dhildenb > > I could build a kernel with FORCE_MAX_ZONEORDER=8 and pageblock_order = > 8. We need to disable THP for such a kernel to boot, because THP do > check for PMD_ORDER < MAX_ORDER. I was able to boot that kernel on a > virtualized platform, but then gigantic_page_runtime_supported is not > supported on such config with hash translation. > I'm currently playing with the idea of the following (uncompiled,untested): From 68e0a158a5060bc1a175d12b20e21794763a33df Mon Sep 17 00:00:00 2001 From: David Hildenbrand <david@redhat.com> Date: Fri, 11 Feb 2022 11:40:27 +0100 Subject: [PATCH] mm: enforce pageblock_order < MAX_ORDER Some places in the kernel don't really expect pageblock_order >= MAX_ORDER, and it looks like this is only possible in corner cases: 1) CONFIG_DEFERRED_STRUCT_PAGE_INIT we'll end up freeing pageblock_order pages via __free_pages_core(), which cannot possibly work. 2) mm/page_reporting.c won't be reporting any pages with default page_reporting_order == pageblock_order, as we'll be skipping the reporting loop inside page_reporting_process_zone(). 3) __rmqueue_fallback() will never be able to steal with ALLOC_NOFRAGMENT. 4) find_zone_movable_pfns_for_nodes() will roundup the ZONE_MOVABLE start PFN to MAX_ORDER_NR_PAGES. Consequently with a bigger pageblock_order, we could have pageblocks partially managed by two zones. pageblock_order >= MAX_ORDER is weird either way: it's a pure optimization for making alloc_contig_range(), as used for allcoation of gigantic pages, a little more reliable to succeed. However, if there is demand for somewhat reliable allocation of gigantic pages, affected setups should be using CMA or boottime allocations instead. So let's make sure that pageblock_order < MAX_ORDER and simplify. Signed-off-by: David Hildenbrand <david@redhat.com> --- arch/powerpc/include/asm/fadump-internal.h | 4 +-- drivers/of/of_reserved_mem.c | 8 ++--- include/linux/pageblock-flags.h | 7 +++-- kernel/dma/contiguous.c | 2 +- mm/Kconfig | 3 ++ mm/cma.c | 6 ++-- mm/page_alloc.c | 34 ++++++---------------- mm/page_isolation.c | 6 ++-- 8 files changed, 26 insertions(+), 44 deletions(-) diff --git a/arch/powerpc/include/asm/fadump-internal.h b/arch/powerpc/include/asm/fadump-internal.h index 52189928ec08..959c7df15baa 100644 --- a/arch/powerpc/include/asm/fadump-internal.h +++ b/arch/powerpc/include/asm/fadump-internal.h @@ -20,9 +20,7 @@ #define memblock_num_regions(memblock_type) (memblock.memblock_type.cnt) /* Alignment per CMA requirement. */ -#define FADUMP_CMA_ALIGNMENT (PAGE_SIZE << \ - max_t(unsigned long, MAX_ORDER - 1, \ - pageblock_order)) +#define FADUMP_CMA_ALIGNMENT (PAGE_SIZE * MAX_ORDER_NR_PAGES) /* FAD commands */ #define FADUMP_REGISTER 1 diff --git a/drivers/of/of_reserved_mem.c b/drivers/of/of_reserved_mem.c index 9c0fb962c22b..dcbbffca0c57 100644 --- a/drivers/of/of_reserved_mem.c +++ b/drivers/of/of_reserved_mem.c @@ -116,12 +116,8 @@ static int __init __reserved_mem_alloc_size(unsigned long node, if (IS_ENABLED(CONFIG_CMA) && of_flat_dt_is_compatible(node, "shared-dma-pool") && of_get_flat_dt_prop(node, "reusable", NULL) - && !nomap) { - unsigned long order = - max_t(unsigned long, MAX_ORDER - 1, pageblock_order); - - align = max(align, (phys_addr_t)PAGE_SIZE << order); - } + && !nomap) + align = max_t(phys_addr_t, align, PAGE_SIZE * MAX_ORDER_NR_PAGES); prop = of_get_flat_dt_prop(node, "alloc-ranges", &len); if (prop) { diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h index 973fd731a520..83c7248053a1 100644 --- a/include/linux/pageblock-flags.h +++ b/include/linux/pageblock-flags.h @@ -37,8 +37,11 @@ extern unsigned int pageblock_order; #else /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */ -/* Huge pages are a constant size */ -#define pageblock_order HUGETLB_PAGE_ORDER +/* + * Huge pages are a constant size, but don't exceed the maximum allocation + * granularity. + */ +#define pageblock_order min_t(unsigned int, HUGETLB_PAGE_ORDER, MAX_ORDER - 1) #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */ diff --git a/kernel/dma/contiguous.c b/kernel/dma/contiguous.c index 3d63d91cba5c..4333c05c14fc 100644 --- a/kernel/dma/contiguous.c +++ b/kernel/dma/contiguous.c @@ -399,7 +399,7 @@ static const struct reserved_mem_ops rmem_cma_ops = { static int __init rmem_cma_setup(struct reserved_mem *rmem) { - phys_addr_t align = PAGE_SIZE << max(MAX_ORDER - 1, pageblock_order); + phys_addr_t align = PAGE_SIZE * MAX_ORDER_NR_PAGES; phys_addr_t mask = align - 1; unsigned long node = rmem->fdt_node; bool default_cma = of_get_flat_dt_prop(node, "linux,cma-default", NULL); diff --git a/mm/Kconfig b/mm/Kconfig index 3326ee3903f3..4c91b92e7537 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -262,6 +262,9 @@ config HUGETLB_PAGE_SIZE_VARIABLE HUGETLB_PAGE_ORDER when there are multiple HugeTLB page sizes available on a platform. + Note that the pageblock_order cannot exceed MAX_ORDER - 1 and will be + clamped down to MAX_ORDER - 1. + config CONTIG_ALLOC def_bool (MEMORY_ISOLATION && COMPACTION) || CMA diff --git a/mm/cma.c b/mm/cma.c index bc9ca8f3c487..418e214685da 100644 --- a/mm/cma.c +++ b/mm/cma.c @@ -180,8 +180,7 @@ int __init cma_init_reserved_mem(phys_addr_t base, phys_addr_t size, return -EINVAL; /* ensure minimal alignment required by mm core */ - alignment = PAGE_SIZE << - max_t(unsigned long, MAX_ORDER - 1, pageblock_order); + alignment = PAGE_SIZE * MAX_ORDER_NR_PAGES; /* alignment should be aligned with order_per_bit */ if (!IS_ALIGNED(alignment >> PAGE_SHIFT, 1 << order_per_bit)) @@ -268,8 +267,7 @@ int __init cma_declare_contiguous_nid(phys_addr_t base, * migratetype page by page allocator's buddy algorithm. In the case, * you couldn't get a contiguous memory, which is not what we want. */ - alignment = max(alignment, (phys_addr_t)PAGE_SIZE << - max_t(unsigned long, MAX_ORDER - 1, pageblock_order)); + alignment = max_t(phys_addr_t, alignment, PAGE_SIZE * MAX_ORDER_NR_PAGES); if (fixed && base & (alignment - 1)) { ret = -EINVAL; pr_err("Region at %pa must be aligned to %pa bytes\n", diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 3589febc6d31..36d9fc308a26 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1072,14 +1072,12 @@ static inline void __free_one_page(struct page *page, int migratetype, fpi_t fpi_flags) { struct capture_control *capc = task_capc(zone); + unsigned int max_order = pageblock_order; unsigned long buddy_pfn; unsigned long combined_pfn; - unsigned int max_order; struct page *buddy; bool to_tail; - max_order = min_t(unsigned int, MAX_ORDER - 1, pageblock_order); - VM_BUG_ON(!zone_is_initialized(zone)); VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page); @@ -2260,19 +2258,8 @@ void __init init_cma_reserved_pageblock(struct page *page) } while (++p, --i); set_pageblock_migratetype(page, MIGRATE_CMA); - - if (pageblock_order >= MAX_ORDER) { - i = pageblock_nr_pages; - p = page; - do { - set_page_refcounted(p); - __free_pages(p, MAX_ORDER - 1); - p += MAX_ORDER_NR_PAGES; - } while (i -= MAX_ORDER_NR_PAGES); - } else { - set_page_refcounted(page); - __free_pages(page, pageblock_order); - } + set_page_refcounted(page); + __free_pages(page, pageblock_order); adjust_managed_page_count(page, pageblock_nr_pages); page_zone(page)->cma_pages += pageblock_nr_pages; @@ -7389,16 +7376,15 @@ static inline void setup_usemap(struct zone *zone) {} /* Initialise the number of pages represented by NR_PAGEBLOCK_BITS */ void __init set_pageblock_order(void) { - unsigned int order; + unsigned int order = MAX_ORDER - 1; /* Check that pageblock_nr_pages has not already been setup */ if (pageblock_order) return; - if (HPAGE_SHIFT > PAGE_SHIFT) + /* Don't let pageblocks exceed the maximum allocation granularity. */ + if (HPAGE_SHIFT > PAGE_SHIFT && HUGETLB_PAGE_ORDER < order) order = HUGETLB_PAGE_ORDER; - else - order = MAX_ORDER - 1; /* * Assume the largest contiguous order of interest is a huge page. @@ -7593,7 +7579,7 @@ static void __init alloc_node_mem_map(struct pglist_data *pgdat) if (!pgdat->node_spanned_pages) return; - start = pgdat->node_start_pfn & ~(MAX_ORDER_NR_PAGES - 1); + start = ALIGN_DOWN(pgdat->node_start_pfn, MAX_ORDER_NR_PAGES); offset = pgdat->node_start_pfn - start; /* ia64 gets its own node_mem_map, before this, without bootmem */ if (!pgdat->node_mem_map) { @@ -8986,14 +8972,12 @@ struct page *has_unmovable_pages(struct zone *zone, struct page *page, #ifdef CONFIG_CONTIG_ALLOC static unsigned long pfn_max_align_down(unsigned long pfn) { - return pfn & ~(max_t(unsigned long, MAX_ORDER_NR_PAGES, - pageblock_nr_pages) - 1); + return ALIGN_DOWN(pfn, MAX_ORDER_NR_PAGES); } static unsigned long pfn_max_align_up(unsigned long pfn) { - return ALIGN(pfn, max_t(unsigned long, MAX_ORDER_NR_PAGES, - pageblock_nr_pages)); + return ALIGN(pfn, MAX_ORDER_NR_PAGES); } #if defined(CONFIG_DYNAMIC_DEBUG) || \ diff --git a/mm/page_isolation.c b/mm/page_isolation.c index f67c4c70f17f..e679af6121e3 100644 --- a/mm/page_isolation.c +++ b/mm/page_isolation.c @@ -276,9 +276,9 @@ int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn, int ret; /* - * Note: pageblock_nr_pages != MAX_ORDER. Then, chunks of free pages - * are not aligned to pageblock_nr_pages. - * Then we just check migratetype first. + * Note: if pageblock_nr_pages < MAX_ORDER_NR_PAGES, then chunks of free + * pages are not necessarily aligned to pageblock_nr_pages. Check the + * migratetype first. */ for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) { page = __first_valid_page(pfn, pageblock_nr_pages);
Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> writes: > David Hildenbrand <david@redhat.com> writes: > >> On 11.02.22 10:16, Aneesh Kumar K V wrote: >>> On 2/11/22 14:00, David Hildenbrand wrote: >>>> On 11.02.22 07:52, Aneesh Kumar K.V wrote: >>>>> commit: d9c234005227 ("Do not depend on MAX_ORDER when grouping pages by mobility") .... .... > I could build a kernel with FORCE_MAX_ZONEORDER=8 and pageblock_order = > 8. We need to disable THP for such a kernel to boot, because THP do > check for PMD_ORDER < MAX_ORDER. I was able to boot that kernel on a > virtualized platform, but then gigantic_page_runtime_supported is not > supported on such config with hash translation. > > On non virtualized platform I am hitting crashes like below during boot. > > [ 47.637865][ C42] ============================================================================= > [ 47.637907][ C42] BUG pgtable-2^11 (Not tainted): Object already free > [ 47.637925][ C42] ----------------------------------------------------------------------------- > [ 47.637925][ C42] > [ 47.637945][ C42] Allocated in __pud_alloc+0x84/0x2a0 age=278 cpu=40 pid=1409 > [ 47.637974][ C42] __slab_alloc.isra.0+0x40/0x60 > [ 47.637995][ C42] kmem_cache_alloc+0x1a8/0x510 > [ 47.638010][ C42] __pud_alloc+0x84/0x2a0 > [ 47.638024][ C42] copy_page_range+0x38c/0x1b90 > [ 47.638040][ C42] dup_mm+0x548/0x880 > [ 47.638058][ C42] copy_process+0xdc0/0x1e90 > [ 47.638076][ C42] kernel_clone+0xd4/0x9d0 > [ 47.638094][ C42] __do_sys_clone+0x88/0xe0 > [ 47.638112][ C42] system_call_exception+0x368/0x3a0 > [ 47.638128][ C42] system_call_common+0xec/0x250 > [ 47.638147][ C42] Freed in __tlb_remove_table+0x1d4/0x200 age=263 cpu=57 pid=326 > [ 47.638172][ C42] kmem_cache_free+0x44c/0x680 > [ 47.638187][ C42] __tlb_remove_table+0x1d4/0x200 > [ 47.638204][ C42] tlb_remove_table_rcu+0x54/0xa0 > [ 47.638222][ C42] rcu_core+0xdd4/0x15d0 > [ 47.638239][ C42] __do_softirq+0x360/0x69c > [ 47.638257][ C42] run_ksoftirqd+0x54/0xc0 > [ 47.638273][ C42] smpboot_thread_fn+0x28c/0x2f0 > [ 47.638290][ C42] kthread+0x1a4/0x1b0 > [ 47.638305][ C42] ret_from_kernel_thread+0x5c/0x64 > [ 47.638320][ C42] Slab 0xc00c00000000d600 objects=10 used=9 fp=0xc0000000035a8000 flags=0x7ffff000010201(locked|slab|head|node=0|zone=0|lastcpupid=0x7ffff) > [ 47.638352][ C42] Object 0xc0000000035a8000 @offset=163840 fp=0x0000000000000000 > [ 47.638352][ C42] > [ 47.638373][ C42] Redzone c0000000035a4000: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................ > [ 47.638394][ C42] Redzone c0000000035a4010: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................ > [ 47.638414][ C42] Redzone c0000000035a4020: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................ > [ 47.638435][ C42] Redzone c0000000035a4030: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................ > [ 47.638455][ C42] Redzone c0000000035a4040: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................ > [ 47.638474][ C42] Redzone c0000000035a4050: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................ > [ 47.638494][ C42] Redzone c0000000035a4060: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................ > [ 47.638514][ C42] Redzone c0000000035a4070: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................ > [ 47.638534][ C42] Redzone c0000000035a4080: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................ Ok that turned out to be unrelated. I was using a wrong kernel. I can boot kernel with pageblock_order > MAX_ORDER and run hugetlb related test fine. I do get the below warning which you had already called out in your patch. [ 3.952124] WARNING: CPU: 16 PID: 719 at mm/vmstat.c:1103 __fragmentation_index+0x14/0x70 [ 3.952136] Modules linked in: [ 3.952141] CPU: 16 PID: 719 Comm: kswapd0 Tainted: G B 5.17.0-rc3-00044-g69052ffa0e08 #68 [ 3.952149] NIP: c000000000465264 LR: c000000000468544 CTR: 0000000000000000 [ 3.952154] REGS: c000000014a4f7e0 TRAP: 0700 Tainted: G B (5.17.0-rc3-00044-g69052ffa0e08) [ 3.952161] MSR: 9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 44042422 XER: 20000000 [ 3.952174] CFAR: c000000000468540 IRQMASK: 0 GPR00: c000000000468544 c000000014a4fa80 c000000001ea9500 0000000000000008 GPR04: c000000014a4faa0 00000000001fd700 0000000000004003 00000000001fd92d GPR08: c000001fffd1c7a0 0000000000000008 0000000000000008 0000000000000000 GPR12: 0000000000002200 c000001fffff2880 0000000000000000 c000000013cfd240 GPR16: c000000011940600 c000001fffd21058 0000000000000d00 c000000001407d30 GPR20: ffffffffffffffaf c000001fffd21098 0000000000000000 c000000002ab7328 GPR24: c000000011940600 c000001fffd21300 0000000000000000 0000000000000008 GPR28: c000001fffd1c280 0000000000000008 0000000000000000 0000000000000004 [ 3.952231] NIP [c000000000465264] __fragmentation_index+0x14/0x70 [ 3.952237] LR [c000000000468544] fragmentation_index+0xb4/0xe0 [ 3.952244] Call Trace: [ 3.952247] [c000000014a4fa80] [c00000000023e248] lock_release+0x138/0x470 (unreliable) [ 3.952256] [c000000014a4fac0] [c00000000047cd84] compaction_suitable+0x94/0x270 [ 3.952263] [c000000014a4fb10] [c0000000004802b8] wakeup_kcompactd+0xc8/0x2a0 [ 3.952270] [c000000014a4fb60] [c000000000457568] balance_pgdat+0x798/0x8d0 [ 3.952277] [c000000014a4fca0] [c000000000457d14] kswapd+0x674/0x7b0 [ 3.952283] [c000000014a4fdc0] [c0000000001d7e84] kthread+0x144/0x150 [ 3.952290] [c000000014a4fe10] [c00000000000cd74] ret_from_kernel_thread+0x5c/0x64 [ 3.952297] Instruction dump: [ 3.952301] 7d2021ad 40c2fff4 e8ed0030 38a00000 7caa39ae 4e800020 60000000 7c0802a6 [ 3.952311] 60000000 28030007 7c6a1b78 40810010 <0fe00000> 60000000 60000000 e9040008 [ 3.952322] irq event stamp: 0 [ 3.952325] hardirqs last enabled at (0): [<0000000000000000>] 0x0 [ 3.952331] hardirqs last disabled at (0): [<c000000000196030>] copy_process+0x970/0x1de0 [ 3.952339] softirqs last enabled at (0): [<c000000000196030>] copy_process+0x970/0x1de0 [ 3.952345] softirqs last disabled at (0): [<0000000000000000>] 0x0 I am not sure whether there is any value in selecting MAX_ORDER = 8 on ppc64. If not we could do a patch as below for ppc64. commit 09ed79c4fda92418914546f36c2750670503d7a0 Author: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Date: Fri Feb 11 17:15:10 2022 +0530 powerpc/mm: Disable MAX_ORDER value 8 on book3s64 with 64K pagesize With transparent hugepage support we expect HPAGE_PMD_ORDER < MAX_ORDER. Without this we BUG() during boot as below cpu 0x6: Vector: 700 (Program Check) at [c000000012143880] pc: c000000001b4ddbc: hugepage_init+0x108/0x2c4 lr: c000000001b4dd98: hugepage_init+0xe4/0x2c4 sp: c000000012143b20 msr: 8000000002029033 current = 0xc0000000120d0f80 paca = 0xc00000001ec7e900 irqmask: 0x03 irq_happened: 0x01 pid = 1, comm = swapper/0 kernel BUG at mm/huge_memory.c:413! [c000000012143b20] c0000000022c0468 blacklisted_initcalls+0x120/0x1c8 (unreliable) [c000000012143bb0] c000000000012104 do_one_initcall+0x94/0x520 [c000000012143c90] c000000001b04da0 kernel_init_freeable+0x444/0x508 [c000000012143da0] c000000000012d8c kernel_init+0x44/0x188 [c000000012143e10] c00000000000cbf4 ret_from_kernel_thread+0x5c/0x64 Hence a FORCE_MAX_ZONEORDER of value < 9 doesn't make sense with THP enabled. We also cannot have value > 9 because we are limitted by SECTION_SIZE_BITS #if (MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS #error Allocator MAX_ORDER exceeds SECTION_SIZE #endif We can select MAX_ORDER value 8 by disabling THP support but then that results in pageblock_order > MAX_ORDER - 1 which is not fully tested/supported. Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index b779603978e1..a050f5f46df3 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -807,7 +807,7 @@ config DATA_SHIFT config FORCE_MAX_ZONEORDER int "Maximum zone order" - range 8 9 if PPC64 && PPC_64K_PAGES + range 9 9 if PPC64 && PPC_64K_PAGES default "9" if PPC64 && PPC_64K_PAGES range 13 13 if PPC64 && !PPC_64K_PAGES default "13" if PPC64 && !PPC_64K_PAGES
On Fri, 11 Feb 2022 12:22:15 +0530, Aneesh Kumar K.V wrote: > commit: d9c234005227 ("Do not depend on MAX_ORDER when grouping pages by mobility") > introduced pageblock_order which will be used to group pages better. > The kernel now groups pages based on the value of HPAGE_SHIFT. Hence HPAGE_SHIFT > should be set before we call set_pageblock_order. > > set_pageblock_order happens early in the boot and default hugetlb page size > should be initialized before that to compute the right pageblock_order value. > > [...] Applied to powerpc/next. [1/1] powerpc/mm: Update default hugetlb size early https://git.kernel.org/powerpc/c/2354ad252b66695be02f4acd18e37bf6264f0464 cheers
diff --git a/arch/powerpc/include/asm/hugetlb.h b/arch/powerpc/include/asm/hugetlb.h index 962708fa1017..6a1a1ac5743b 100644 --- a/arch/powerpc/include/asm/hugetlb.h +++ b/arch/powerpc/include/asm/hugetlb.h @@ -15,7 +15,7 @@ extern bool hugetlb_disabled; -void __init hugetlbpage_init_default(void); +void __init hugetlbpage_init_defaultsize(void); int slice_is_hugepage_only_range(struct mm_struct *mm, unsigned long addr, unsigned long len); @@ -76,6 +76,9 @@ static inline void __init gigantic_hugetlb_cma_reserve(void) { } +static inline void __init hugetlbpage_init_defaultsize(void) +{ +} #endif /* CONFIG_HUGETLB_PAGE */ #endif /* _ASM_POWERPC_HUGETLB_H */ diff --git a/arch/powerpc/mm/book3s64/hugetlbpage.c b/arch/powerpc/mm/book3s64/hugetlbpage.c index ea8f83afb0ae..3bc0eb21b2a0 100644 --- a/arch/powerpc/mm/book3s64/hugetlbpage.c +++ b/arch/powerpc/mm/book3s64/hugetlbpage.c @@ -150,7 +150,7 @@ void huge_ptep_modify_prot_commit(struct vm_area_struct *vma, unsigned long addr set_huge_pte_at(vma->vm_mm, addr, ptep, pte); } -void __init hugetlbpage_init_default(void) +void __init hugetlbpage_init_defaultsize(void) { /* Set default large page size. Currently, we pick 16M or 1M * depending on what is available diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c index ddead41e2194..b642a5a8668f 100644 --- a/arch/powerpc/mm/hugetlbpage.c +++ b/arch/powerpc/mm/hugetlbpage.c @@ -664,10 +664,7 @@ static int __init hugetlbpage_init(void) configured = true; } - if (configured) { - if (IS_ENABLED(CONFIG_HUGETLB_PAGE_SIZE_VARIABLE)) - hugetlbpage_init_default(); - } else + if (!configured) pr_info("Failed to initialize. Disabling HugeTLB"); return 0; diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c index 35f46bf54281..83c0ee9fbf05 100644 --- a/arch/powerpc/mm/init_64.c +++ b/arch/powerpc/mm/init_64.c @@ -59,6 +59,7 @@ #include <asm/sections.h> #include <asm/iommu.h> #include <asm/vdso.h> +#include <asm/hugetlb.h> #include <mm/mmu_decl.h> @@ -513,6 +514,9 @@ void __init mmu_early_init_devtree(void) } else hash__early_init_devtree(); + if (IS_ENABLED(CONFIG_HUGETLB_PAGE_SIZE_VARIABLE)) + hugetlbpage_init_defaultsize(); + if (!(cur_cpu_spec->mmu_features & MMU_FTR_HPTE_TABLE) && !(cur_cpu_spec->mmu_features & MMU_FTR_TYPE_RADIX)) panic("kernel does not support any MMU type offered by platform");
commit: d9c234005227 ("Do not depend on MAX_ORDER when grouping pages by mobility") introduced pageblock_order which will be used to group pages better. The kernel now groups pages based on the value of HPAGE_SHIFT. Hence HPAGE_SHIFT should be set before we call set_pageblock_order. set_pageblock_order happens early in the boot and default hugetlb page size should be initialized before that to compute the right pageblock_order value. Currently, default hugetlbe page size is set via arch_initcalls which happens late in the boot as shown via the below callstack: [c000000007383b10] [c000000001289328] hugetlbpage_init+0x2b8/0x2f8 [c000000007383bc0] [c0000000012749e4] do_one_initcall+0x14c/0x320 [c000000007383c90] [c00000000127505c] kernel_init_freeable+0x410/0x4e8 [c000000007383da0] [c000000000012664] kernel_init+0x30/0x15c [c000000007383e10] [c00000000000cf14] ret_from_kernel_thread+0x5c/0x64 and the pageblock_order initialization is done early during the boot. [c0000000018bfc80] [c0000000012ae120] set_pageblock_order+0x50/0x64 [c0000000018bfca0] [c0000000012b3d94] sparse_init+0x188/0x268 [c0000000018bfd60] [c000000001288bfc] initmem_init+0x28c/0x328 [c0000000018bfe50] [c00000000127b370] setup_arch+0x410/0x480 [c0000000018bfed0] [c00000000127401c] start_kernel+0xb8/0x934 [c0000000018bff90] [c00000000000d984] start_here_common+0x1c/0x98 delaying default hugetlb page size initialization implies the kernel will initialize pageblock_order to (MAX_ORDER - 1) which is not an optimal value for mobility grouping. IIUC we always had this issue. But it was not a problem for hash translation mode because (MAX_ORDER - 1) is the same as HUGETLB_PAGE_ORDER (8) in the case of hash (16MB). With radix, HUGETLB_PAGE_ORDER will be 5 (2M size) and hence pageblock_order should be 5 instead of 8. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> --- arch/powerpc/include/asm/hugetlb.h | 5 ++++- arch/powerpc/mm/book3s64/hugetlbpage.c | 2 +- arch/powerpc/mm/hugetlbpage.c | 5 +---- arch/powerpc/mm/init_64.c | 4 ++++ 4 files changed, 10 insertions(+), 6 deletions(-) Changes from v1: * update commit message