diff mbox series

[v2] powerpc/mm: Update default hugetlb size early

Message ID 20220211065215.101767-1-aneesh.kumar@linux.ibm.com (mailing list archive)
State New
Headers show
Series [v2] powerpc/mm: Update default hugetlb size early | expand

Commit Message

Aneesh Kumar K.V Feb. 11, 2022, 6:52 a.m. UTC
commit: d9c234005227 ("Do not depend on MAX_ORDER when grouping pages by mobility")
introduced pageblock_order which will be used to group pages better.
The kernel now groups pages based on the value of HPAGE_SHIFT. Hence HPAGE_SHIFT
should be set before we call set_pageblock_order.

set_pageblock_order happens early in the boot and default hugetlb page size
should be initialized before that to compute the right pageblock_order value.

Currently, default hugetlbe page size is set via arch_initcalls which happens
late in the boot as shown via the below callstack:

[c000000007383b10] [c000000001289328] hugetlbpage_init+0x2b8/0x2f8
[c000000007383bc0] [c0000000012749e4] do_one_initcall+0x14c/0x320
[c000000007383c90] [c00000000127505c] kernel_init_freeable+0x410/0x4e8
[c000000007383da0] [c000000000012664] kernel_init+0x30/0x15c
[c000000007383e10] [c00000000000cf14] ret_from_kernel_thread+0x5c/0x64

and the pageblock_order initialization is done early during the boot.

[c0000000018bfc80] [c0000000012ae120] set_pageblock_order+0x50/0x64
[c0000000018bfca0] [c0000000012b3d94] sparse_init+0x188/0x268
[c0000000018bfd60] [c000000001288bfc] initmem_init+0x28c/0x328
[c0000000018bfe50] [c00000000127b370] setup_arch+0x410/0x480
[c0000000018bfed0] [c00000000127401c] start_kernel+0xb8/0x934
[c0000000018bff90] [c00000000000d984] start_here_common+0x1c/0x98

delaying default hugetlb page size initialization implies the kernel will
initialize pageblock_order to (MAX_ORDER - 1) which is not an optimal
value for mobility grouping. IIUC we always had this issue. But it was not
a problem for hash translation mode because (MAX_ORDER - 1) is the same as
HUGETLB_PAGE_ORDER (8) in the case of hash (16MB). With radix,
HUGETLB_PAGE_ORDER will be 5 (2M size) and hence pageblock_order should be
5 instead of 8.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 arch/powerpc/include/asm/hugetlb.h     | 5 ++++-
 arch/powerpc/mm/book3s64/hugetlbpage.c | 2 +-
 arch/powerpc/mm/hugetlbpage.c          | 5 +----
 arch/powerpc/mm/init_64.c              | 4 ++++
 4 files changed, 10 insertions(+), 6 deletions(-)

Changes from v1:
* update commit message

Comments

David Hildenbrand Feb. 11, 2022, 8:30 a.m. UTC | #1
On 11.02.22 07:52, Aneesh Kumar K.V wrote:
> commit: d9c234005227 ("Do not depend on MAX_ORDER when grouping pages by mobility")
> introduced pageblock_order which will be used to group pages better.
> The kernel now groups pages based on the value of HPAGE_SHIFT. Hence HPAGE_SHIFT
> should be set before we call set_pageblock_order.
> 
> set_pageblock_order happens early in the boot and default hugetlb page size
> should be initialized before that to compute the right pageblock_order value.
> 
> Currently, default hugetlbe page size is set via arch_initcalls which happens
> late in the boot as shown via the below callstack:
> 
> [c000000007383b10] [c000000001289328] hugetlbpage_init+0x2b8/0x2f8
> [c000000007383bc0] [c0000000012749e4] do_one_initcall+0x14c/0x320
> [c000000007383c90] [c00000000127505c] kernel_init_freeable+0x410/0x4e8
> [c000000007383da0] [c000000000012664] kernel_init+0x30/0x15c
> [c000000007383e10] [c00000000000cf14] ret_from_kernel_thread+0x5c/0x64
> 
> and the pageblock_order initialization is done early during the boot.
> 
> [c0000000018bfc80] [c0000000012ae120] set_pageblock_order+0x50/0x64
> [c0000000018bfca0] [c0000000012b3d94] sparse_init+0x188/0x268
> [c0000000018bfd60] [c000000001288bfc] initmem_init+0x28c/0x328
> [c0000000018bfe50] [c00000000127b370] setup_arch+0x410/0x480
> [c0000000018bfed0] [c00000000127401c] start_kernel+0xb8/0x934
> [c0000000018bff90] [c00000000000d984] start_here_common+0x1c/0x98
> 
> delaying default hugetlb page size initialization implies the kernel will
> initialize pageblock_order to (MAX_ORDER - 1) which is not an optimal
> value for mobility grouping. IIUC we always had this issue. But it was not
> a problem for hash translation mode because (MAX_ORDER - 1) is the same as
> HUGETLB_PAGE_ORDER (8) in the case of hash (16MB). With radix,
> HUGETLB_PAGE_ORDER will be 5 (2M size) and hence pageblock_order should be
> 5 instead of 8.


A related question: Can we on ppc still have pageblock_order > MAX_ORDER
- 1? We have some code for that and I am not so sure if we really need that.
Aneesh Kumar K.V Feb. 11, 2022, 9:16 a.m. UTC | #2
On 2/11/22 14:00, David Hildenbrand wrote:
> On 11.02.22 07:52, Aneesh Kumar K.V wrote:
>> commit: d9c234005227 ("Do not depend on MAX_ORDER when grouping pages by mobility")
>> introduced pageblock_order which will be used to group pages better.
>> The kernel now groups pages based on the value of HPAGE_SHIFT. Hence HPAGE_SHIFT
>> should be set before we call set_pageblock_order.
>>
>> set_pageblock_order happens early in the boot and default hugetlb page size
>> should be initialized before that to compute the right pageblock_order value.
>>
>> Currently, default hugetlbe page size is set via arch_initcalls which happens
>> late in the boot as shown via the below callstack:
>>
>> [c000000007383b10] [c000000001289328] hugetlbpage_init+0x2b8/0x2f8
>> [c000000007383bc0] [c0000000012749e4] do_one_initcall+0x14c/0x320
>> [c000000007383c90] [c00000000127505c] kernel_init_freeable+0x410/0x4e8
>> [c000000007383da0] [c000000000012664] kernel_init+0x30/0x15c
>> [c000000007383e10] [c00000000000cf14] ret_from_kernel_thread+0x5c/0x64
>>
>> and the pageblock_order initialization is done early during the boot.
>>
>> [c0000000018bfc80] [c0000000012ae120] set_pageblock_order+0x50/0x64
>> [c0000000018bfca0] [c0000000012b3d94] sparse_init+0x188/0x268
>> [c0000000018bfd60] [c000000001288bfc] initmem_init+0x28c/0x328
>> [c0000000018bfe50] [c00000000127b370] setup_arch+0x410/0x480
>> [c0000000018bfed0] [c00000000127401c] start_kernel+0xb8/0x934
>> [c0000000018bff90] [c00000000000d984] start_here_common+0x1c/0x98
>>
>> delaying default hugetlb page size initialization implies the kernel will
>> initialize pageblock_order to (MAX_ORDER - 1) which is not an optimal
>> value for mobility grouping. IIUC we always had this issue. But it was not
>> a problem for hash translation mode because (MAX_ORDER - 1) is the same as
>> HUGETLB_PAGE_ORDER (8) in the case of hash (16MB). With radix,
>> HUGETLB_PAGE_ORDER will be 5 (2M size) and hence pageblock_order should be
>> 5 instead of 8.
> 
> 
> A related question: Can we on ppc still have pageblock_order > MAX_ORDER
> - 1? We have some code for that and I am not so sure if we really need that.
> 

I also have been wondering about the same. On book3s64 I don't think we 
need that support for both 64K and 4K page size because with hash 
hugetlb size is MAX_ORDER -1. (16MB hugepage size)

I am not sure about the 256K page support. Christophe may be able to 
answer that.

For the gigantic hugepage support we depend on cma based allocation or
firmware reservation. So I am not sure why we ever considered pageblock 
 > MAX_ORDER -1 scenario. If you have pointers w.r.t why that was ever 
needed, I could double-check whether ppc64 is still dependent on that.

-aneesh
David Hildenbrand Feb. 11, 2022, 10:05 a.m. UTC | #3
On 11.02.22 10:16, Aneesh Kumar K V wrote:
> On 2/11/22 14:00, David Hildenbrand wrote:
>> On 11.02.22 07:52, Aneesh Kumar K.V wrote:
>>> commit: d9c234005227 ("Do not depend on MAX_ORDER when grouping pages by mobility")
>>> introduced pageblock_order which will be used to group pages better.
>>> The kernel now groups pages based on the value of HPAGE_SHIFT. Hence HPAGE_SHIFT
>>> should be set before we call set_pageblock_order.
>>>
>>> set_pageblock_order happens early in the boot and default hugetlb page size
>>> should be initialized before that to compute the right pageblock_order value.
>>>
>>> Currently, default hugetlbe page size is set via arch_initcalls which happens
>>> late in the boot as shown via the below callstack:
>>>
>>> [c000000007383b10] [c000000001289328] hugetlbpage_init+0x2b8/0x2f8
>>> [c000000007383bc0] [c0000000012749e4] do_one_initcall+0x14c/0x320
>>> [c000000007383c90] [c00000000127505c] kernel_init_freeable+0x410/0x4e8
>>> [c000000007383da0] [c000000000012664] kernel_init+0x30/0x15c
>>> [c000000007383e10] [c00000000000cf14] ret_from_kernel_thread+0x5c/0x64
>>>
>>> and the pageblock_order initialization is done early during the boot.
>>>
>>> [c0000000018bfc80] [c0000000012ae120] set_pageblock_order+0x50/0x64
>>> [c0000000018bfca0] [c0000000012b3d94] sparse_init+0x188/0x268
>>> [c0000000018bfd60] [c000000001288bfc] initmem_init+0x28c/0x328
>>> [c0000000018bfe50] [c00000000127b370] setup_arch+0x410/0x480
>>> [c0000000018bfed0] [c00000000127401c] start_kernel+0xb8/0x934
>>> [c0000000018bff90] [c00000000000d984] start_here_common+0x1c/0x98
>>>
>>> delaying default hugetlb page size initialization implies the kernel will
>>> initialize pageblock_order to (MAX_ORDER - 1) which is not an optimal
>>> value for mobility grouping. IIUC we always had this issue. But it was not
>>> a problem for hash translation mode because (MAX_ORDER - 1) is the same as
>>> HUGETLB_PAGE_ORDER (8) in the case of hash (16MB). With radix,
>>> HUGETLB_PAGE_ORDER will be 5 (2M size) and hence pageblock_order should be
>>> 5 instead of 8.
>>
>>
>> A related question: Can we on ppc still have pageblock_order > MAX_ORDER
>> - 1? We have some code for that and I am not so sure if we really need that.
>>
> 
> I also have been wondering about the same. On book3s64 I don't think we 
> need that support for both 64K and 4K page size because with hash 
> hugetlb size is MAX_ORDER -1. (16MB hugepage size)
> 
> I am not sure about the 256K page support. Christophe may be able to 
> answer that.
> 
> For the gigantic hugepage support we depend on cma based allocation or
> firmware reservation. So I am not sure why we ever considered pageblock 
>  > MAX_ORDER -1 scenario. If you have pointers w.r.t why that was ever 
> needed, I could double-check whether ppc64 is still dependent on that.

commit dc78327c0ea7da5186d8cbc1647bd6088c5c9fa5
Author: Michal Nazarewicz <mina86@mina86.com>
Date:   Wed Jul 2 15:22:35 2014 -0700

    mm: page_alloc: fix CMA area initialisation when pageblock > MAX_ORDER

indicates that at least arm64 used to have cases for that as well.

However, nowadays with ARM64_64K_PAGES we have FORCE_MAX_ZONEORDER=14 as
default, corresponding to 512MiB.

So I'm not sure if this is something worth supporting. If you want
somewhat reliable gigantic pages, use CMA or preallocate them during boot.
Aneesh Kumar K.V Feb. 11, 2022, 12:23 p.m. UTC | #4
David Hildenbrand <david@redhat.com> writes:

> On 11.02.22 10:16, Aneesh Kumar K V wrote:
>> On 2/11/22 14:00, David Hildenbrand wrote:
>>> On 11.02.22 07:52, Aneesh Kumar K.V wrote:
>>>> commit: d9c234005227 ("Do not depend on MAX_ORDER when grouping pages by mobility")
>>>> introduced pageblock_order which will be used to group pages better.
>>>> The kernel now groups pages based on the value of HPAGE_SHIFT. Hence HPAGE_SHIFT
>>>> should be set before we call set_pageblock_order.
>>>>
>>>> set_pageblock_order happens early in the boot and default hugetlb page size
>>>> should be initialized before that to compute the right pageblock_order value.
>>>>
>>>> Currently, default hugetlbe page size is set via arch_initcalls which happens
>>>> late in the boot as shown via the below callstack:
>>>>
>>>> [c000000007383b10] [c000000001289328] hugetlbpage_init+0x2b8/0x2f8
>>>> [c000000007383bc0] [c0000000012749e4] do_one_initcall+0x14c/0x320
>>>> [c000000007383c90] [c00000000127505c] kernel_init_freeable+0x410/0x4e8
>>>> [c000000007383da0] [c000000000012664] kernel_init+0x30/0x15c
>>>> [c000000007383e10] [c00000000000cf14] ret_from_kernel_thread+0x5c/0x64
>>>>
>>>> and the pageblock_order initialization is done early during the boot.
>>>>
>>>> [c0000000018bfc80] [c0000000012ae120] set_pageblock_order+0x50/0x64
>>>> [c0000000018bfca0] [c0000000012b3d94] sparse_init+0x188/0x268
>>>> [c0000000018bfd60] [c000000001288bfc] initmem_init+0x28c/0x328
>>>> [c0000000018bfe50] [c00000000127b370] setup_arch+0x410/0x480
>>>> [c0000000018bfed0] [c00000000127401c] start_kernel+0xb8/0x934
>>>> [c0000000018bff90] [c00000000000d984] start_here_common+0x1c/0x98
>>>>
>>>> delaying default hugetlb page size initialization implies the kernel will
>>>> initialize pageblock_order to (MAX_ORDER - 1) which is not an optimal
>>>> value for mobility grouping. IIUC we always had this issue. But it was not
>>>> a problem for hash translation mode because (MAX_ORDER - 1) is the same as
>>>> HUGETLB_PAGE_ORDER (8) in the case of hash (16MB). With radix,
>>>> HUGETLB_PAGE_ORDER will be 5 (2M size) and hence pageblock_order should be
>>>> 5 instead of 8.
>>>
>>>
>>> A related question: Can we on ppc still have pageblock_order > MAX_ORDER
>>> - 1? We have some code for that and I am not so sure if we really need that.
>>>
>> 
>> I also have been wondering about the same. On book3s64 I don't think we 
>> need that support for both 64K and 4K page size because with hash 
>> hugetlb size is MAX_ORDER -1. (16MB hugepage size)
>> 
>> I am not sure about the 256K page support. Christophe may be able to 
>> answer that.
>> 
>> For the gigantic hugepage support we depend on cma based allocation or
>> firmware reservation. So I am not sure why we ever considered pageblock 
>>  > MAX_ORDER -1 scenario. If you have pointers w.r.t why that was ever 
>> needed, I could double-check whether ppc64 is still dependent on that.
>
> commit dc78327c0ea7da5186d8cbc1647bd6088c5c9fa5
> Author: Michal Nazarewicz <mina86@mina86.com>
> Date:   Wed Jul 2 15:22:35 2014 -0700
>
>     mm: page_alloc: fix CMA area initialisation when pageblock > MAX_ORDER
>
> indicates that at least arm64 used to have cases for that as well.
>
> However, nowadays with ARM64_64K_PAGES we have FORCE_MAX_ZONEORDER=14 as
> default, corresponding to 512MiB.
>
> So I'm not sure if this is something worth supporting. If you want
> somewhat reliable gigantic pages, use CMA or preallocate them during boot.
>
> -- 
> Thanks,
>
> David / dhildenb

I could build a kernel with FORCE_MAX_ZONEORDER=8 and pageblock_order =
8. We need to disable THP for such a kernel to boot, because THP do
check for PMD_ORDER < MAX_ORDER. I was able to boot that kernel on a
virtualized platform, but then gigantic_page_runtime_supported is not
supported on such config with hash translation.

On non virtualized platform I am hitting crashes like below during boot.

[   47.637865][   C42] =============================================================================                                                                                                                                                                                                              
[   47.637907][   C42] BUG pgtable-2^11 (Not tainted): Object already free                                                                                     
[   47.637925][   C42] -----------------------------------------------------------------------------                                                           
[   47.637925][   C42]                                                                                                                                         
[   47.637945][   C42] Allocated in __pud_alloc+0x84/0x2a0 age=278 cpu=40 pid=1409                                                                             
[   47.637974][   C42]  __slab_alloc.isra.0+0x40/0x60                                                                                                          
[   47.637995][   C42]  kmem_cache_alloc+0x1a8/0x510                                                                                                           
[   47.638010][   C42]  __pud_alloc+0x84/0x2a0                                                                                                                 
[   47.638024][   C42]  copy_page_range+0x38c/0x1b90                                                                                                           
[   47.638040][   C42]  dup_mm+0x548/0x880                                                                                                                     
[   47.638058][   C42]  copy_process+0xdc0/0x1e90                                                                                                              
[   47.638076][   C42]  kernel_clone+0xd4/0x9d0                                                                                                                
[   47.638094][   C42]  __do_sys_clone+0x88/0xe0                                                                                                               
[   47.638112][   C42]  system_call_exception+0x368/0x3a0                                                                                                      
[   47.638128][   C42]  system_call_common+0xec/0x250                                                                                                          
[   47.638147][   C42] Freed in __tlb_remove_table+0x1d4/0x200 age=263 cpu=57 pid=326                                                                          
[   47.638172][   C42]  kmem_cache_free+0x44c/0x680                                                                                                            
[   47.638187][   C42]  __tlb_remove_table+0x1d4/0x200                                                                                                         
[   47.638204][   C42]  tlb_remove_table_rcu+0x54/0xa0                                                                                                         
[   47.638222][   C42]  rcu_core+0xdd4/0x15d0                                                                                                                  
[   47.638239][   C42]  __do_softirq+0x360/0x69c                                                                                                               
[   47.638257][   C42]  run_ksoftirqd+0x54/0xc0                                                                                                                
[   47.638273][   C42]  smpboot_thread_fn+0x28c/0x2f0                                                                                                          
[   47.638290][   C42]  kthread+0x1a4/0x1b0                                                                                                                    
[   47.638305][   C42]  ret_from_kernel_thread+0x5c/0x64                                                                                                       
[   47.638320][   C42] Slab 0xc00c00000000d600 objects=10 used=9 fp=0xc0000000035a8000 flags=0x7ffff000010201(locked|slab|head|node=0|zone=0|lastcpupid=0x7ffff)                                                                                                                                                              
[   47.638352][   C42] Object 0xc0000000035a8000 @offset=163840 fp=0x0000000000000000                                                                          
[   47.638352][   C42]                                                                                                                                         
[   47.638373][   C42] Redzone  c0000000035a4000: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................                                            
[   47.638394][   C42] Redzone  c0000000035a4010: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................                                            
[   47.638414][   C42] Redzone  c0000000035a4020: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................                                            
[   47.638435][   C42] Redzone  c0000000035a4030: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................                                            
[   47.638455][   C42] Redzone  c0000000035a4040: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................                                            
[   47.638474][   C42] Redzone  c0000000035a4050: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................                                            
[   47.638494][   C42] Redzone  c0000000035a4060: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................                                            
[   47.638514][   C42] Redzone  c0000000035a4070: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................                                            
[   47.638534][   C42] Redzone  c0000000035a4080: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
David Hildenbrand Feb. 11, 2022, 12:29 p.m. UTC | #5
On 11.02.22 13:23, Aneesh Kumar K.V wrote:
> David Hildenbrand <david@redhat.com> writes:
> 
>> On 11.02.22 10:16, Aneesh Kumar K V wrote:
>>> On 2/11/22 14:00, David Hildenbrand wrote:
>>>> On 11.02.22 07:52, Aneesh Kumar K.V wrote:
>>>>> commit: d9c234005227 ("Do not depend on MAX_ORDER when grouping pages by mobility")
>>>>> introduced pageblock_order which will be used to group pages better.
>>>>> The kernel now groups pages based on the value of HPAGE_SHIFT. Hence HPAGE_SHIFT
>>>>> should be set before we call set_pageblock_order.
>>>>>
>>>>> set_pageblock_order happens early in the boot and default hugetlb page size
>>>>> should be initialized before that to compute the right pageblock_order value.
>>>>>
>>>>> Currently, default hugetlbe page size is set via arch_initcalls which happens
>>>>> late in the boot as shown via the below callstack:
>>>>>
>>>>> [c000000007383b10] [c000000001289328] hugetlbpage_init+0x2b8/0x2f8
>>>>> [c000000007383bc0] [c0000000012749e4] do_one_initcall+0x14c/0x320
>>>>> [c000000007383c90] [c00000000127505c] kernel_init_freeable+0x410/0x4e8
>>>>> [c000000007383da0] [c000000000012664] kernel_init+0x30/0x15c
>>>>> [c000000007383e10] [c00000000000cf14] ret_from_kernel_thread+0x5c/0x64
>>>>>
>>>>> and the pageblock_order initialization is done early during the boot.
>>>>>
>>>>> [c0000000018bfc80] [c0000000012ae120] set_pageblock_order+0x50/0x64
>>>>> [c0000000018bfca0] [c0000000012b3d94] sparse_init+0x188/0x268
>>>>> [c0000000018bfd60] [c000000001288bfc] initmem_init+0x28c/0x328
>>>>> [c0000000018bfe50] [c00000000127b370] setup_arch+0x410/0x480
>>>>> [c0000000018bfed0] [c00000000127401c] start_kernel+0xb8/0x934
>>>>> [c0000000018bff90] [c00000000000d984] start_here_common+0x1c/0x98
>>>>>
>>>>> delaying default hugetlb page size initialization implies the kernel will
>>>>> initialize pageblock_order to (MAX_ORDER - 1) which is not an optimal
>>>>> value for mobility grouping. IIUC we always had this issue. But it was not
>>>>> a problem for hash translation mode because (MAX_ORDER - 1) is the same as
>>>>> HUGETLB_PAGE_ORDER (8) in the case of hash (16MB). With radix,
>>>>> HUGETLB_PAGE_ORDER will be 5 (2M size) and hence pageblock_order should be
>>>>> 5 instead of 8.
>>>>
>>>>
>>>> A related question: Can we on ppc still have pageblock_order > MAX_ORDER
>>>> - 1? We have some code for that and I am not so sure if we really need that.
>>>>
>>>
>>> I also have been wondering about the same. On book3s64 I don't think we 
>>> need that support for both 64K and 4K page size because with hash 
>>> hugetlb size is MAX_ORDER -1. (16MB hugepage size)
>>>
>>> I am not sure about the 256K page support. Christophe may be able to 
>>> answer that.
>>>
>>> For the gigantic hugepage support we depend on cma based allocation or
>>> firmware reservation. So I am not sure why we ever considered pageblock 
>>>  > MAX_ORDER -1 scenario. If you have pointers w.r.t why that was ever 
>>> needed, I could double-check whether ppc64 is still dependent on that.
>>
>> commit dc78327c0ea7da5186d8cbc1647bd6088c5c9fa5
>> Author: Michal Nazarewicz <mina86@mina86.com>
>> Date:   Wed Jul 2 15:22:35 2014 -0700
>>
>>     mm: page_alloc: fix CMA area initialisation when pageblock > MAX_ORDER
>>
>> indicates that at least arm64 used to have cases for that as well.
>>
>> However, nowadays with ARM64_64K_PAGES we have FORCE_MAX_ZONEORDER=14 as
>> default, corresponding to 512MiB.
>>
>> So I'm not sure if this is something worth supporting. If you want
>> somewhat reliable gigantic pages, use CMA or preallocate them during boot.
>>
>> -- 
>> Thanks,
>>
>> David / dhildenb
> 
> I could build a kernel with FORCE_MAX_ZONEORDER=8 and pageblock_order =
> 8. We need to disable THP for such a kernel to boot, because THP do
> check for PMD_ORDER < MAX_ORDER. I was able to boot that kernel on a
> virtualized platform, but then gigantic_page_runtime_supported is not
> supported on such config with hash translation.
> 

I'm currently playing with the idea of the following (uncompiled,untested):

From 68e0a158a5060bc1a175d12b20e21794763a33df Mon Sep 17 00:00:00 2001
From: David Hildenbrand <david@redhat.com>
Date: Fri, 11 Feb 2022 11:40:27 +0100
Subject: [PATCH] mm: enforce pageblock_order < MAX_ORDER

Some places in the kernel don't really expect pageblock_order >=
MAX_ORDER, and it looks like this is only possible in corner cases:

1) CONFIG_DEFERRED_STRUCT_PAGE_INIT we'll end up freeing pageblock_order
   pages via __free_pages_core(), which cannot possibly work.

2) mm/page_reporting.c won't be reporting any pages with default
   page_reporting_order == pageblock_order, as we'll be skipping the
   reporting loop inside page_reporting_process_zone().

3) __rmqueue_fallback() will never be able to steal with
   ALLOC_NOFRAGMENT.

4) find_zone_movable_pfns_for_nodes() will roundup the ZONE_MOVABLE
   start PFN to MAX_ORDER_NR_PAGES. Consequently with a bigger
   pageblock_order, we could have pageblocks partially managed by two
   zones.

pageblock_order >= MAX_ORDER is weird either way: it's a pure
optimization for making alloc_contig_range(), as used for allcoation of
gigantic pages, a little more reliable to succeed. However, if there is
demand for somewhat reliable allocation of gigantic pages, affected setups
should be using CMA or boottime allocations instead.

So let's make sure that pageblock_order < MAX_ORDER and simplify.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/powerpc/include/asm/fadump-internal.h |  4 +--
 drivers/of/of_reserved_mem.c               |  8 ++---
 include/linux/pageblock-flags.h            |  7 +++--
 kernel/dma/contiguous.c                    |  2 +-
 mm/Kconfig                                 |  3 ++
 mm/cma.c                                   |  6 ++--
 mm/page_alloc.c                            | 34 ++++++----------------
 mm/page_isolation.c                        |  6 ++--
 8 files changed, 26 insertions(+), 44 deletions(-)

diff --git a/arch/powerpc/include/asm/fadump-internal.h b/arch/powerpc/include/asm/fadump-internal.h
index 52189928ec08..959c7df15baa 100644
--- a/arch/powerpc/include/asm/fadump-internal.h
+++ b/arch/powerpc/include/asm/fadump-internal.h
@@ -20,9 +20,7 @@
 #define memblock_num_regions(memblock_type)	(memblock.memblock_type.cnt)
 
 /* Alignment per CMA requirement. */
-#define FADUMP_CMA_ALIGNMENT	(PAGE_SIZE <<				\
-				 max_t(unsigned long, MAX_ORDER - 1,	\
-				 pageblock_order))
+#define FADUMP_CMA_ALIGNMENT	(PAGE_SIZE * MAX_ORDER_NR_PAGES)
 
 /* FAD commands */
 #define FADUMP_REGISTER			1
diff --git a/drivers/of/of_reserved_mem.c b/drivers/of/of_reserved_mem.c
index 9c0fb962c22b..dcbbffca0c57 100644
--- a/drivers/of/of_reserved_mem.c
+++ b/drivers/of/of_reserved_mem.c
@@ -116,12 +116,8 @@ static int __init __reserved_mem_alloc_size(unsigned long node,
 	if (IS_ENABLED(CONFIG_CMA)
 	    && of_flat_dt_is_compatible(node, "shared-dma-pool")
 	    && of_get_flat_dt_prop(node, "reusable", NULL)
-	    && !nomap) {
-		unsigned long order =
-			max_t(unsigned long, MAX_ORDER - 1, pageblock_order);
-
-		align = max(align, (phys_addr_t)PAGE_SIZE << order);
-	}
+	    && !nomap)
+		align = max_t(phys_addr_t, align, PAGE_SIZE * MAX_ORDER_NR_PAGES);
 
 	prop = of_get_flat_dt_prop(node, "alloc-ranges", &len);
 	if (prop) {
diff --git a/include/linux/pageblock-flags.h b/include/linux/pageblock-flags.h
index 973fd731a520..83c7248053a1 100644
--- a/include/linux/pageblock-flags.h
+++ b/include/linux/pageblock-flags.h
@@ -37,8 +37,11 @@ extern unsigned int pageblock_order;
 
 #else /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
 
-/* Huge pages are a constant size */
-#define pageblock_order		HUGETLB_PAGE_ORDER
+/*
+ * Huge pages are a constant size, but don't exceed the maximum allocation
+ * granularity.
+ */
+#define pageblock_order		min_t(unsigned int, HUGETLB_PAGE_ORDER, MAX_ORDER - 1)
 
 #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
 
diff --git a/kernel/dma/contiguous.c b/kernel/dma/contiguous.c
index 3d63d91cba5c..4333c05c14fc 100644
--- a/kernel/dma/contiguous.c
+++ b/kernel/dma/contiguous.c
@@ -399,7 +399,7 @@ static const struct reserved_mem_ops rmem_cma_ops = {
 
 static int __init rmem_cma_setup(struct reserved_mem *rmem)
 {
-	phys_addr_t align = PAGE_SIZE << max(MAX_ORDER - 1, pageblock_order);
+	phys_addr_t align = PAGE_SIZE * MAX_ORDER_NR_PAGES;
 	phys_addr_t mask = align - 1;
 	unsigned long node = rmem->fdt_node;
 	bool default_cma = of_get_flat_dt_prop(node, "linux,cma-default", NULL);
diff --git a/mm/Kconfig b/mm/Kconfig
index 3326ee3903f3..4c91b92e7537 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -262,6 +262,9 @@ config HUGETLB_PAGE_SIZE_VARIABLE
 	  HUGETLB_PAGE_ORDER when there are multiple HugeTLB page sizes available
 	  on a platform.
 
+	  Note that the pageblock_order cannot exceed MAX_ORDER - 1 and will be
+	  clamped down to MAX_ORDER - 1.
+
 config CONTIG_ALLOC
 	def_bool (MEMORY_ISOLATION && COMPACTION) || CMA
 
diff --git a/mm/cma.c b/mm/cma.c
index bc9ca8f3c487..418e214685da 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -180,8 +180,7 @@ int __init cma_init_reserved_mem(phys_addr_t base, phys_addr_t size,
 		return -EINVAL;
 
 	/* ensure minimal alignment required by mm core */
-	alignment = PAGE_SIZE <<
-			max_t(unsigned long, MAX_ORDER - 1, pageblock_order);
+	alignment = PAGE_SIZE * MAX_ORDER_NR_PAGES;
 
 	/* alignment should be aligned with order_per_bit */
 	if (!IS_ALIGNED(alignment >> PAGE_SHIFT, 1 << order_per_bit))
@@ -268,8 +267,7 @@ int __init cma_declare_contiguous_nid(phys_addr_t base,
 	 * migratetype page by page allocator's buddy algorithm. In the case,
 	 * you couldn't get a contiguous memory, which is not what we want.
 	 */
-	alignment = max(alignment,  (phys_addr_t)PAGE_SIZE <<
-			  max_t(unsigned long, MAX_ORDER - 1, pageblock_order));
+	alignment = max_t(phys_addr_t, alignment, PAGE_SIZE * MAX_ORDER_NR_PAGES);
 	if (fixed && base & (alignment - 1)) {
 		ret = -EINVAL;
 		pr_err("Region at %pa must be aligned to %pa bytes\n",
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3589febc6d31..36d9fc308a26 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1072,14 +1072,12 @@ static inline void __free_one_page(struct page *page,
 		int migratetype, fpi_t fpi_flags)
 {
 	struct capture_control *capc = task_capc(zone);
+	unsigned int max_order = pageblock_order;
 	unsigned long buddy_pfn;
 	unsigned long combined_pfn;
-	unsigned int max_order;
 	struct page *buddy;
 	bool to_tail;
 
-	max_order = min_t(unsigned int, MAX_ORDER - 1, pageblock_order);
-
 	VM_BUG_ON(!zone_is_initialized(zone));
 	VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
 
@@ -2260,19 +2258,8 @@ void __init init_cma_reserved_pageblock(struct page *page)
 	} while (++p, --i);
 
 	set_pageblock_migratetype(page, MIGRATE_CMA);
-
-	if (pageblock_order >= MAX_ORDER) {
-		i = pageblock_nr_pages;
-		p = page;
-		do {
-			set_page_refcounted(p);
-			__free_pages(p, MAX_ORDER - 1);
-			p += MAX_ORDER_NR_PAGES;
-		} while (i -= MAX_ORDER_NR_PAGES);
-	} else {
-		set_page_refcounted(page);
-		__free_pages(page, pageblock_order);
-	}
+	set_page_refcounted(page);
+	__free_pages(page, pageblock_order);
 
 	adjust_managed_page_count(page, pageblock_nr_pages);
 	page_zone(page)->cma_pages += pageblock_nr_pages;
@@ -7389,16 +7376,15 @@ static inline void setup_usemap(struct zone *zone) {}
 /* Initialise the number of pages represented by NR_PAGEBLOCK_BITS */
 void __init set_pageblock_order(void)
 {
-	unsigned int order;
+	unsigned int order = MAX_ORDER - 1;
 
 	/* Check that pageblock_nr_pages has not already been setup */
 	if (pageblock_order)
 		return;
 
-	if (HPAGE_SHIFT > PAGE_SHIFT)
+	/* Don't let pageblocks exceed the maximum allocation granularity. */
+	if (HPAGE_SHIFT > PAGE_SHIFT && HUGETLB_PAGE_ORDER < order)
 		order = HUGETLB_PAGE_ORDER;
-	else
-		order = MAX_ORDER - 1;
 
 	/*
 	 * Assume the largest contiguous order of interest is a huge page.
@@ -7593,7 +7579,7 @@ static void __init alloc_node_mem_map(struct pglist_data *pgdat)
 	if (!pgdat->node_spanned_pages)
 		return;
 
-	start = pgdat->node_start_pfn & ~(MAX_ORDER_NR_PAGES - 1);
+	start = ALIGN_DOWN(pgdat->node_start_pfn, MAX_ORDER_NR_PAGES);
 	offset = pgdat->node_start_pfn - start;
 	/* ia64 gets its own node_mem_map, before this, without bootmem */
 	if (!pgdat->node_mem_map) {
@@ -8986,14 +8972,12 @@ struct page *has_unmovable_pages(struct zone *zone, struct page *page,
 #ifdef CONFIG_CONTIG_ALLOC
 static unsigned long pfn_max_align_down(unsigned long pfn)
 {
-	return pfn & ~(max_t(unsigned long, MAX_ORDER_NR_PAGES,
-			     pageblock_nr_pages) - 1);
+	return ALIGN_DOWN(pfn, MAX_ORDER_NR_PAGES);
 }
 
 static unsigned long pfn_max_align_up(unsigned long pfn)
 {
-	return ALIGN(pfn, max_t(unsigned long, MAX_ORDER_NR_PAGES,
-				pageblock_nr_pages));
+	return ALIGN(pfn, MAX_ORDER_NR_PAGES);
 }
 
 #if defined(CONFIG_DYNAMIC_DEBUG) || \
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index f67c4c70f17f..e679af6121e3 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -276,9 +276,9 @@ int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn,
 	int ret;
 
 	/*
-	 * Note: pageblock_nr_pages != MAX_ORDER. Then, chunks of free pages
-	 * are not aligned to pageblock_nr_pages.
-	 * Then we just check migratetype first.
+	 * Note: if pageblock_nr_pages < MAX_ORDER_NR_PAGES, then chunks of free
+	 * pages are not necessarily aligned to pageblock_nr_pages. Check the
+	 * migratetype first.
 	 */
 	for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
 		page = __first_valid_page(pfn, pageblock_nr_pages);
Aneesh Kumar K.V Feb. 11, 2022, 2:40 p.m. UTC | #6
Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> writes:

> David Hildenbrand <david@redhat.com> writes:
>
>> On 11.02.22 10:16, Aneesh Kumar K V wrote:
>>> On 2/11/22 14:00, David Hildenbrand wrote:
>>>> On 11.02.22 07:52, Aneesh Kumar K.V wrote:
>>>>> commit: d9c234005227 ("Do not depend on MAX_ORDER when grouping pages by mobility")
....
....

> I could build a kernel with FORCE_MAX_ZONEORDER=8 and pageblock_order =
> 8. We need to disable THP for such a kernel to boot, because THP do
> check for PMD_ORDER < MAX_ORDER. I was able to boot that kernel on a
> virtualized platform, but then gigantic_page_runtime_supported is not
> supported on such config with hash translation.
>
> On non virtualized platform I am hitting crashes like below during boot.
>
> [   47.637865][   C42] =============================================================================                                                                                                                                                                                                              
> [   47.637907][   C42] BUG pgtable-2^11 (Not tainted): Object already free                                                                                     
> [   47.637925][   C42] -----------------------------------------------------------------------------                                                           
> [   47.637925][   C42]                                                                                                                                         
> [   47.637945][   C42] Allocated in __pud_alloc+0x84/0x2a0 age=278 cpu=40 pid=1409                                                                             
> [   47.637974][   C42]  __slab_alloc.isra.0+0x40/0x60                                                                                                          
> [   47.637995][   C42]  kmem_cache_alloc+0x1a8/0x510                                                                                                           
> [   47.638010][   C42]  __pud_alloc+0x84/0x2a0                                                                                                                 
> [   47.638024][   C42]  copy_page_range+0x38c/0x1b90                                                                                                           
> [   47.638040][   C42]  dup_mm+0x548/0x880                                                                                                                     
> [   47.638058][   C42]  copy_process+0xdc0/0x1e90                                                                                                              
> [   47.638076][   C42]  kernel_clone+0xd4/0x9d0                                                                                                                
> [   47.638094][   C42]  __do_sys_clone+0x88/0xe0                                                                                                               
> [   47.638112][   C42]  system_call_exception+0x368/0x3a0                                                                                                      
> [   47.638128][   C42]  system_call_common+0xec/0x250                                                                                                          
> [   47.638147][   C42] Freed in __tlb_remove_table+0x1d4/0x200 age=263 cpu=57 pid=326                                                                          
> [   47.638172][   C42]  kmem_cache_free+0x44c/0x680                                                                                                            
> [   47.638187][   C42]  __tlb_remove_table+0x1d4/0x200                                                                                                         
> [   47.638204][   C42]  tlb_remove_table_rcu+0x54/0xa0                                                                                                         
> [   47.638222][   C42]  rcu_core+0xdd4/0x15d0                                                                                                                  
> [   47.638239][   C42]  __do_softirq+0x360/0x69c                                                                                                               
> [   47.638257][   C42]  run_ksoftirqd+0x54/0xc0                                                                                                                
> [   47.638273][   C42]  smpboot_thread_fn+0x28c/0x2f0                                                                                                          
> [   47.638290][   C42]  kthread+0x1a4/0x1b0                                                                                                                    
> [   47.638305][   C42]  ret_from_kernel_thread+0x5c/0x64                                                                                                       
> [   47.638320][   C42] Slab 0xc00c00000000d600 objects=10 used=9 fp=0xc0000000035a8000 flags=0x7ffff000010201(locked|slab|head|node=0|zone=0|lastcpupid=0x7ffff)                                                                                                                                                              
> [   47.638352][   C42] Object 0xc0000000035a8000 @offset=163840 fp=0x0000000000000000                                                                          
> [   47.638352][   C42]                                                                                                                                         
> [   47.638373][   C42] Redzone  c0000000035a4000: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................                                            
> [   47.638394][   C42] Redzone  c0000000035a4010: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................                                            
> [   47.638414][   C42] Redzone  c0000000035a4020: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................                                            
> [   47.638435][   C42] Redzone  c0000000035a4030: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................                                            
> [   47.638455][   C42] Redzone  c0000000035a4040: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................                                            
> [   47.638474][   C42] Redzone  c0000000035a4050: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................                                            
> [   47.638494][   C42] Redzone  c0000000035a4060: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................                                            
> [   47.638514][   C42] Redzone  c0000000035a4070: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................                                            
> [   47.638534][   C42] Redzone  c0000000035a4080: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................                                            

Ok that turned out to be unrelated. I was using a wrong kernel. I can
boot kernel with pageblock_order > MAX_ORDER and run hugetlb related
test fine. I do get the below warning which you had already called out
in your patch.

[    3.952124] WARNING: CPU: 16 PID: 719 at mm/vmstat.c:1103 __fragmentation_index+0x14/0x70                                                                   
[    3.952136] Modules linked in:                                                                                                                              
[    3.952141] CPU: 16 PID: 719 Comm: kswapd0 Tainted: G    B             5.17.0-rc3-00044-g69052ffa0e08 #68                                                   
[    3.952149] NIP:  c000000000465264 LR: c000000000468544 CTR: 0000000000000000                                                                               
[    3.952154] REGS: c000000014a4f7e0 TRAP: 0700   Tainted: G    B              (5.17.0-rc3-00044-g69052ffa0e08)
[    3.952161] MSR:  9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 44042422  XER: 20000000
[    3.952174] CFAR: c000000000468540 IRQMASK: 0                  
               GPR00: c000000000468544 c000000014a4fa80 c000000001ea9500 0000000000000008 
               GPR04: c000000014a4faa0 00000000001fd700 0000000000004003 00000000001fd92d 
               GPR08: c000001fffd1c7a0 0000000000000008 0000000000000008 0000000000000000 
               GPR12: 0000000000002200 c000001fffff2880 0000000000000000 c000000013cfd240                                                                      
               GPR16: c000000011940600 c000001fffd21058 0000000000000d00 c000000001407d30                                                                      
               GPR20: ffffffffffffffaf c000001fffd21098 0000000000000000 c000000002ab7328                                                                      
               GPR24: c000000011940600 c000001fffd21300 0000000000000000 0000000000000008 
               GPR28: c000001fffd1c280 0000000000000008 0000000000000000 0000000000000004                                                                      
[    3.952231] NIP [c000000000465264] __fragmentation_index+0x14/0x70                                                                                          
[    3.952237] LR [c000000000468544] fragmentation_index+0xb4/0xe0                                                                                             
[    3.952244] Call Trace:                                        
[    3.952247] [c000000014a4fa80] [c00000000023e248] lock_release+0x138/0x470 (unreliable)
[    3.952256] [c000000014a4fac0] [c00000000047cd84] compaction_suitable+0x94/0x270
[    3.952263] [c000000014a4fb10] [c0000000004802b8] wakeup_kcompactd+0xc8/0x2a0
[    3.952270] [c000000014a4fb60] [c000000000457568] balance_pgdat+0x798/0x8d0
[    3.952277] [c000000014a4fca0] [c000000000457d14] kswapd+0x674/0x7b0                                                                                        
[    3.952283] [c000000014a4fdc0] [c0000000001d7e84] kthread+0x144/0x150                                                                                       
[    3.952290] [c000000014a4fe10] [c00000000000cd74] ret_from_kernel_thread+0x5c/0x64
[    3.952297] Instruction dump:                                      
[    3.952301] 7d2021ad 40c2fff4 e8ed0030 38a00000 7caa39ae 4e800020 60000000 7c0802a6 
[    3.952311] 60000000 28030007 7c6a1b78 40810010 <0fe00000> 60000000 60000000 e9040008 
[    3.952322] irq event stamp: 0                                        
[    3.952325] hardirqs last  enabled at (0): [<0000000000000000>] 0x0                                                                                         
[    3.952331] hardirqs last disabled at (0): [<c000000000196030>] copy_process+0x970/0x1de0                                                                   
[    3.952339] softirqs last  enabled at (0): [<c000000000196030>] copy_process+0x970/0x1de0                                                                   
[    3.952345] softirqs last disabled at (0): [<0000000000000000>] 0x0                                                                                         

I am not sure whether there is any value in selecting MAX_ORDER = 8 on
ppc64. If not we could do a patch as below for ppc64.

commit 09ed79c4fda92418914546f36c2750670503d7a0
Author: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Date:   Fri Feb 11 17:15:10 2022 +0530

    powerpc/mm: Disable MAX_ORDER value 8 on book3s64 with 64K pagesize
    
    With transparent hugepage support we expect HPAGE_PMD_ORDER < MAX_ORDER.
    Without this we BUG() during boot as below
    
    cpu 0x6: Vector: 700 (Program Check) at [c000000012143880]
        pc: c000000001b4ddbc: hugepage_init+0x108/0x2c4
        lr: c000000001b4dd98: hugepage_init+0xe4/0x2c4
        sp: c000000012143b20
       msr: 8000000002029033
      current = 0xc0000000120d0f80
      paca    = 0xc00000001ec7e900   irqmask: 0x03   irq_happened: 0x01
        pid   = 1, comm = swapper/0
    kernel BUG at mm/huge_memory.c:413!
    [c000000012143b20] c0000000022c0468 blacklisted_initcalls+0x120/0x1c8 (unreliable)
    [c000000012143bb0] c000000000012104 do_one_initcall+0x94/0x520
    [c000000012143c90] c000000001b04da0 kernel_init_freeable+0x444/0x508
    [c000000012143da0] c000000000012d8c kernel_init+0x44/0x188
    [c000000012143e10] c00000000000cbf4 ret_from_kernel_thread+0x5c/0x64
    
    Hence a FORCE_MAX_ZONEORDER of value < 9 doesn't make sense with THP
    enabled. We also cannot have value > 9 because we are limitted by
    SECTION_SIZE_BITS
    
     #if (MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS
     #error Allocator MAX_ORDER exceeds SECTION_SIZE
     #endif
    
    We can select MAX_ORDER value 8 by disabling THP support but then that
    results in pageblock_order > MAX_ORDER - 1 which is not fully tested/supported.
    
    Cc: David Hildenbrand <david@redhat.com>
    Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index b779603978e1..a050f5f46df3 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -807,7 +807,7 @@ config DATA_SHIFT
 
 config FORCE_MAX_ZONEORDER
 	int "Maximum zone order"
-	range 8 9 if PPC64 && PPC_64K_PAGES
+	range 9 9 if PPC64 && PPC_64K_PAGES
 	default "9" if PPC64 && PPC_64K_PAGES
 	range 13 13 if PPC64 && !PPC_64K_PAGES
 	default "13" if PPC64 && !PPC_64K_PAGES
Michael Ellerman Feb. 16, 2022, 12:25 p.m. UTC | #7
On Fri, 11 Feb 2022 12:22:15 +0530, Aneesh Kumar K.V wrote:
> commit: d9c234005227 ("Do not depend on MAX_ORDER when grouping pages by mobility")
> introduced pageblock_order which will be used to group pages better.
> The kernel now groups pages based on the value of HPAGE_SHIFT. Hence HPAGE_SHIFT
> should be set before we call set_pageblock_order.
> 
> set_pageblock_order happens early in the boot and default hugetlb page size
> should be initialized before that to compute the right pageblock_order value.
> 
> [...]

Applied to powerpc/next.

[1/1] powerpc/mm: Update default hugetlb size early
      https://git.kernel.org/powerpc/c/2354ad252b66695be02f4acd18e37bf6264f0464

cheers
diff mbox series

Patch

diff --git a/arch/powerpc/include/asm/hugetlb.h b/arch/powerpc/include/asm/hugetlb.h
index 962708fa1017..6a1a1ac5743b 100644
--- a/arch/powerpc/include/asm/hugetlb.h
+++ b/arch/powerpc/include/asm/hugetlb.h
@@ -15,7 +15,7 @@ 
 
 extern bool hugetlb_disabled;
 
-void __init hugetlbpage_init_default(void);
+void __init hugetlbpage_init_defaultsize(void);
 
 int slice_is_hugepage_only_range(struct mm_struct *mm, unsigned long addr,
 			   unsigned long len);
@@ -76,6 +76,9 @@  static inline void __init gigantic_hugetlb_cma_reserve(void)
 {
 }
 
+static inline void __init hugetlbpage_init_defaultsize(void)
+{
+}
 #endif /* CONFIG_HUGETLB_PAGE */
 
 #endif /* _ASM_POWERPC_HUGETLB_H */
diff --git a/arch/powerpc/mm/book3s64/hugetlbpage.c b/arch/powerpc/mm/book3s64/hugetlbpage.c
index ea8f83afb0ae..3bc0eb21b2a0 100644
--- a/arch/powerpc/mm/book3s64/hugetlbpage.c
+++ b/arch/powerpc/mm/book3s64/hugetlbpage.c
@@ -150,7 +150,7 @@  void huge_ptep_modify_prot_commit(struct vm_area_struct *vma, unsigned long addr
 	set_huge_pte_at(vma->vm_mm, addr, ptep, pte);
 }
 
-void __init hugetlbpage_init_default(void)
+void __init hugetlbpage_init_defaultsize(void)
 {
 	/* Set default large page size. Currently, we pick 16M or 1M
 	 * depending on what is available
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index ddead41e2194..b642a5a8668f 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -664,10 +664,7 @@  static int __init hugetlbpage_init(void)
 		configured = true;
 	}
 
-	if (configured) {
-		if (IS_ENABLED(CONFIG_HUGETLB_PAGE_SIZE_VARIABLE))
-			hugetlbpage_init_default();
-	} else
+	if (!configured)
 		pr_info("Failed to initialize. Disabling HugeTLB");
 
 	return 0;
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index 35f46bf54281..83c0ee9fbf05 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -59,6 +59,7 @@ 
 #include <asm/sections.h>
 #include <asm/iommu.h>
 #include <asm/vdso.h>
+#include <asm/hugetlb.h>
 
 #include <mm/mmu_decl.h>
 
@@ -513,6 +514,9 @@  void __init mmu_early_init_devtree(void)
 	} else
 		hash__early_init_devtree();
 
+	if (IS_ENABLED(CONFIG_HUGETLB_PAGE_SIZE_VARIABLE))
+		hugetlbpage_init_defaultsize();
+
 	if (!(cur_cpu_spec->mmu_features & MMU_FTR_HPTE_TABLE) &&
 	    !(cur_cpu_spec->mmu_features & MMU_FTR_TYPE_RADIX))
 		panic("kernel does not support any MMU type offered by platform");