diff mbox series

[v6,2/8] hugetlb: restructure pool allocations

Message ID 20230925234837.86786-3-mike.kravetz@oracle.com (mailing list archive)
State New
Headers show
Series Batch hugetlb vmemmap modification operations | expand

Commit Message

Mike Kravetz Sept. 25, 2023, 11:48 p.m. UTC
Allocation of a hugetlb page for the hugetlb pool is done by the routine
alloc_pool_huge_page.  This routine will allocate contiguous pages from
a low level allocator, prep the pages for usage as a hugetlb page and
then add the resulting hugetlb page to the pool.

In the 'prep' stage, optional vmemmap optimization is done.  For
performance reasons we want to perform vmemmap optimization on multiple
hugetlb pages at once.  To do this, restructure the hugetlb pool
allocation code such that vmemmap optimization can be isolated and later
batched.

The code to allocate hugetlb pages from bootmem was also modified to
allow batching.

No functional changes, only code restructure.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/hugetlb.c | 179 ++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 140 insertions(+), 39 deletions(-)

Comments

Konrad Dybcio Sept. 27, 2023, 11:26 a.m. UTC | #1
On 26.09.2023 01:48, Mike Kravetz wrote:
> Allocation of a hugetlb page for the hugetlb pool is done by the routine
> alloc_pool_huge_page.  This routine will allocate contiguous pages from
> a low level allocator, prep the pages for usage as a hugetlb page and
> then add the resulting hugetlb page to the pool.
> 
> In the 'prep' stage, optional vmemmap optimization is done.  For
> performance reasons we want to perform vmemmap optimization on multiple
> hugetlb pages at once.  To do this, restructure the hugetlb pool
> allocation code such that vmemmap optimization can be isolated and later
> batched.
> 
> The code to allocate hugetlb pages from bootmem was also modified to
> allow batching.
> 
> No functional changes, only code restructure.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> Reviewed-by: Muchun Song <songmuchun@bytedance.com>
> ---
Hi, looks like this patch prevents today's next from booting
on at least one Qualcomm ARM64 platform. Reverting it makes
the device boot again.

Konrad
Mike Kravetz Sept. 29, 2023, 8:57 p.m. UTC | #2
On 09/27/23 13:26, Konrad Dybcio wrote:
> 
> 
> On 26.09.2023 01:48, Mike Kravetz wrote:
> > Allocation of a hugetlb page for the hugetlb pool is done by the routine
> > alloc_pool_huge_page.  This routine will allocate contiguous pages from
> > a low level allocator, prep the pages for usage as a hugetlb page and
> > then add the resulting hugetlb page to the pool.
> > 
> > In the 'prep' stage, optional vmemmap optimization is done.  For
> > performance reasons we want to perform vmemmap optimization on multiple
> > hugetlb pages at once.  To do this, restructure the hugetlb pool
> > allocation code such that vmemmap optimization can be isolated and later
> > batched.
> > 
> > The code to allocate hugetlb pages from bootmem was also modified to
> > allow batching.
> > 
> > No functional changes, only code restructure.
> > 
> > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> > Reviewed-by: Muchun Song <songmuchun@bytedance.com>
> > ---
> Hi, looks like this patch prevents today's next from booting
> on at least one Qualcomm ARM64 platform. Reverting it makes
> the device boot again.

Can you share the config used and any other specific information such as
kernel command line.

I can not reproduce on the arm64 platforms I have.  Been trying various
config combinations without success.  Although, there are lots of
possibilities.  Also, taking a closer look at the changes.  So far,
nothing is obvious.
Konrad Dybcio Oct. 2, 2023, 9:57 a.m. UTC | #3
On 9/29/23 22:57, Mike Kravetz wrote:
> On 09/27/23 13:26, Konrad Dybcio wrote:
>>
>>
>> On 26.09.2023 01:48, Mike Kravetz wrote:
>>> Allocation of a hugetlb page for the hugetlb pool is done by the routine
>>> alloc_pool_huge_page.  This routine will allocate contiguous pages from
>>> a low level allocator, prep the pages for usage as a hugetlb page and
>>> then add the resulting hugetlb page to the pool.
>>>
>>> In the 'prep' stage, optional vmemmap optimization is done.  For
>>> performance reasons we want to perform vmemmap optimization on multiple
>>> hugetlb pages at once.  To do this, restructure the hugetlb pool
>>> allocation code such that vmemmap optimization can be isolated and later
>>> batched.
>>>
>>> The code to allocate hugetlb pages from bootmem was also modified to
>>> allow batching.
>>>
>>> No functional changes, only code restructure.
>>>
>>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>>> Reviewed-by: Muchun Song <songmuchun@bytedance.com>
>>> ---
>> Hi, looks like this patch prevents today's next from booting
>> on at least one Qualcomm ARM64 platform. Reverting it makes
>> the device boot again.
> 
> Can you share the config used and any other specific information such as
> kernel command line.
Later this week.

Konrad
Mike Kravetz Oct. 6, 2023, 3:08 a.m. UTC | #4
On 10/02/23 11:57, Konrad Dybcio wrote:
> 
> 
> On 9/29/23 22:57, Mike Kravetz wrote:
> > On 09/27/23 13:26, Konrad Dybcio wrote:
> > > 
> > > 
> > > On 26.09.2023 01:48, Mike Kravetz wrote:
> > > > Allocation of a hugetlb page for the hugetlb pool is done by the routine
> > > > alloc_pool_huge_page.  This routine will allocate contiguous pages from
> > > > a low level allocator, prep the pages for usage as a hugetlb page and
> > > > then add the resulting hugetlb page to the pool.
> > > > 
> > > > In the 'prep' stage, optional vmemmap optimization is done.  For
> > > > performance reasons we want to perform vmemmap optimization on multiple
> > > > hugetlb pages at once.  To do this, restructure the hugetlb pool
> > > > allocation code such that vmemmap optimization can be isolated and later
> > > > batched.
> > > > 
> > > > The code to allocate hugetlb pages from bootmem was also modified to
> > > > allow batching.
> > > > 
> > > > No functional changes, only code restructure.
> > > > 
> > > > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> > > > Reviewed-by: Muchun Song <songmuchun@bytedance.com>
> > > > ---
> > > Hi, looks like this patch prevents today's next from booting
> > > on at least one Qualcomm ARM64 platform. Reverting it makes
> > > the device boot again.
> > 
> > Can you share the config used and any other specific information such as
> > kernel command line.
> Later this week.

As mentioned, I have been unable to reproduce on arm64 platforms I can
access.  I have tried various config and boot options.  While doing so,
I came across one issue impacting kernels compiled without
CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP defined.  This is not something
that would prevent booting.

I will send out an updated version series in the hope that any other
issues may be discovered.
Konrad Dybcio Oct. 6, 2023, 9:39 p.m. UTC | #5
On 6.10.2023 05:08, Mike Kravetz wrote:
> On 10/02/23 11:57, Konrad Dybcio wrote:
>>
>>
>> On 9/29/23 22:57, Mike Kravetz wrote:
>>> On 09/27/23 13:26, Konrad Dybcio wrote:
>>>>
>>>>
>>>> On 26.09.2023 01:48, Mike Kravetz wrote:
>>>>> Allocation of a hugetlb page for the hugetlb pool is done by the routine
>>>>> alloc_pool_huge_page.  This routine will allocate contiguous pages from
>>>>> a low level allocator, prep the pages for usage as a hugetlb page and
>>>>> then add the resulting hugetlb page to the pool.
>>>>>
>>>>> In the 'prep' stage, optional vmemmap optimization is done.  For
>>>>> performance reasons we want to perform vmemmap optimization on multiple
>>>>> hugetlb pages at once.  To do this, restructure the hugetlb pool
>>>>> allocation code such that vmemmap optimization can be isolated and later
>>>>> batched.
>>>>>
>>>>> The code to allocate hugetlb pages from bootmem was also modified to
>>>>> allow batching.
>>>>>
>>>>> No functional changes, only code restructure.
>>>>>
>>>>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>>>>> Reviewed-by: Muchun Song <songmuchun@bytedance.com>
>>>>> ---
>>>> Hi, looks like this patch prevents today's next from booting
>>>> on at least one Qualcomm ARM64 platform. Reverting it makes
>>>> the device boot again.
>>>
>>> Can you share the config used and any other specific information such as
>>> kernel command line.
>> Later this week.
> 
> As mentioned, I have been unable to reproduce on arm64 platforms I can
> access.  I have tried various config and boot options.  While doing so,
> I came across one issue impacting kernels compiled without
> CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP defined.  This is not something
> that would prevent booting.
> 
> I will send out an updated version series in the hope that any other
> issues may be discovered.
I'm pushing the "later this week" by answering near end of calendar
day, Friday, but it seems like this patch in v7 still prevents the
device from booting..

You can find my defconfig at the link below.

https://gist.github.com/konradybcio/d865f8dc9b12a98ba3875ec5a9aac42e

Konrad
Mike Kravetz Oct. 6, 2023, 10:35 p.m. UTC | #6
On 10/06/23 23:39, Konrad Dybcio wrote:
> On 6.10.2023 05:08, Mike Kravetz wrote:
> > On 10/02/23 11:57, Konrad Dybcio wrote:
> >>
> >>
> >> On 9/29/23 22:57, Mike Kravetz wrote:
> >>> On 09/27/23 13:26, Konrad Dybcio wrote:
> >>>>
> >>>>
> >>>> On 26.09.2023 01:48, Mike Kravetz wrote:
> >>>>> Allocation of a hugetlb page for the hugetlb pool is done by the routine
> >>>>> alloc_pool_huge_page.  This routine will allocate contiguous pages from
> >>>>> a low level allocator, prep the pages for usage as a hugetlb page and
> >>>>> then add the resulting hugetlb page to the pool.
> >>>>>
> >>>>> In the 'prep' stage, optional vmemmap optimization is done.  For
> >>>>> performance reasons we want to perform vmemmap optimization on multiple
> >>>>> hugetlb pages at once.  To do this, restructure the hugetlb pool
> >>>>> allocation code such that vmemmap optimization can be isolated and later
> >>>>> batched.
> >>>>>
> >>>>> The code to allocate hugetlb pages from bootmem was also modified to
> >>>>> allow batching.
> >>>>>
> >>>>> No functional changes, only code restructure.
> >>>>>
> >>>>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> >>>>> Reviewed-by: Muchun Song <songmuchun@bytedance.com>
> >>>>> ---
> >>>> Hi, looks like this patch prevents today's next from booting
> >>>> on at least one Qualcomm ARM64 platform. Reverting it makes
> >>>> the device boot again.
> >>>
> >>> Can you share the config used and any other specific information such as
> >>> kernel command line.
> >> Later this week.
> > 
> > As mentioned, I have been unable to reproduce on arm64 platforms I can
> > access.  I have tried various config and boot options.  While doing so,
> > I came across one issue impacting kernels compiled without
> > CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP defined.  This is not something
> > that would prevent booting.
> > 
> > I will send out an updated version series in the hope that any other
> > issues may be discovered.
> I'm pushing the "later this week" by answering near end of calendar
> day, Friday, but it seems like this patch in v7 still prevents the
> device from booting..
> 
> You can find my defconfig at the link below.
> 
> https://gist.github.com/konradybcio/d865f8dc9b12a98ba3875ec5a9aac42e
> 

Thanks!

I assume there is no further information such as any console output?
Did any of you other arm64 platforms have this issue?

Just trying to get as much information as possible to get to root cause.
Jane Chu Oct. 7, 2023, 1:51 a.m. UTC | #7
Hi, Konrad,

Just wondering, is your arm64 system a VM instance, or a bare metal?

thanks!
-jane


On 10/6/2023 2:39 PM, Konrad Dybcio wrote:
> On 6.10.2023 05:08, Mike Kravetz wrote:
>> On 10/02/23 11:57, Konrad Dybcio wrote:
>>>
>>>
>>> On 9/29/23 22:57, Mike Kravetz wrote:
>>>> On 09/27/23 13:26, Konrad Dybcio wrote:
>>>>
[..]
>> I will send out an updated version series in the hope that any other
>> issues may be discovered.
> I'm pushing the "later this week" by answering near end of calendar
> day, Friday, but it seems like this patch in v7 still prevents the
> device from booting..
> 
> You can find my defconfig at the link below.
> 
> https://gist.github.com/konradybcio/d865f8dc9b12a98ba3875ec5a9aac42e
> 
> Konrad
>
Mike Kravetz Oct. 9, 2023, 3:29 a.m. UTC | #8
On 10/06/23 15:35, Mike Kravetz wrote:
> On 10/06/23 23:39, Konrad Dybcio wrote:
> > On 6.10.2023 05:08, Mike Kravetz wrote:
> > > On 10/02/23 11:57, Konrad Dybcio wrote:
> > >>
> > >>
> > >> On 9/29/23 22:57, Mike Kravetz wrote:
> > >>> On 09/27/23 13:26, Konrad Dybcio wrote:
> > >>>>
> > >>>>
> > >>>> On 26.09.2023 01:48, Mike Kravetz wrote:
> > >>>>> Allocation of a hugetlb page for the hugetlb pool is done by the routine
> > >>>>> alloc_pool_huge_page.  This routine will allocate contiguous pages from
> > >>>>> a low level allocator, prep the pages for usage as a hugetlb page and
> > >>>>> then add the resulting hugetlb page to the pool.
> > >>>>>
> > >>>>> In the 'prep' stage, optional vmemmap optimization is done.  For
> > >>>>> performance reasons we want to perform vmemmap optimization on multiple
> > >>>>> hugetlb pages at once.  To do this, restructure the hugetlb pool
> > >>>>> allocation code such that vmemmap optimization can be isolated and later
> > >>>>> batched.
> > >>>>>
> > >>>>> The code to allocate hugetlb pages from bootmem was also modified to
> > >>>>> allow batching.
> > >>>>>
> > >>>>> No functional changes, only code restructure.
> > >>>>>
> > >>>>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> > >>>>> Reviewed-by: Muchun Song <songmuchun@bytedance.com>
> > >>>>> ---
> > >>>> Hi, looks like this patch prevents today's next from booting
> > >>>> on at least one Qualcomm ARM64 platform. Reverting it makes
> > >>>> the device boot again.
> > >>>
> > >>> Can you share the config used and any other specific information such as
> > >>> kernel command line.
> > >> Later this week.
> > > 
> > > As mentioned, I have been unable to reproduce on arm64 platforms I can
> > > access.  I have tried various config and boot options.  While doing so,
> > > I came across one issue impacting kernels compiled without
> > > CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP defined.  This is not something
> > > that would prevent booting.
> > > 
> > > I will send out an updated version series in the hope that any other
> > > issues may be discovered.
> > I'm pushing the "later this week" by answering near end of calendar
> > day, Friday, but it seems like this patch in v7 still prevents the
> > device from booting..
> > 
> > You can find my defconfig at the link below.
> > 
> > https://gist.github.com/konradybcio/d865f8dc9b12a98ba3875ec5a9aac42e
> > 
> 
> Thanks!
> 
> I assume there is no further information such as any console output?
> Did any of you other arm64 platforms have this issue?
> 
> Just trying to get as much information as possible to get to root cause.

I have not had success isolating the issue with your config file.

Since the only code changes in this patch deal with allocating hugetlb
pages, I assume this is what you are doing?  Can you let me know how you
are performing the allocations?  I assume it is on the kernel command
line as these would be processed earliest in boot.

If you are not allocating hugetlb pages, then I need to think of what
else may be happening.

Anshuman, any chance you (or someone else with access to arm64 platforms)
could throw this on any platforms you have access to for a quick test?
Konrad Dybcio Oct. 9, 2023, 10:11 a.m. UTC | #9
On 9.10.2023 05:29, Mike Kravetz wrote:
> On 10/06/23 15:35, Mike Kravetz wrote:
>> On 10/06/23 23:39, Konrad Dybcio wrote:
>>> On 6.10.2023 05:08, Mike Kravetz wrote:
>>>> On 10/02/23 11:57, Konrad Dybcio wrote:
>>>>>
>>>>>
>>>>> On 9/29/23 22:57, Mike Kravetz wrote:
>>>>>> On 09/27/23 13:26, Konrad Dybcio wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 26.09.2023 01:48, Mike Kravetz wrote:
>>>>>>>> Allocation of a hugetlb page for the hugetlb pool is done by the routine
>>>>>>>> alloc_pool_huge_page.  This routine will allocate contiguous pages from
>>>>>>>> a low level allocator, prep the pages for usage as a hugetlb page and
>>>>>>>> then add the resulting hugetlb page to the pool.
>>>>>>>>
>>>>>>>> In the 'prep' stage, optional vmemmap optimization is done.  For
>>>>>>>> performance reasons we want to perform vmemmap optimization on multiple
>>>>>>>> hugetlb pages at once.  To do this, restructure the hugetlb pool
>>>>>>>> allocation code such that vmemmap optimization can be isolated and later
>>>>>>>> batched.
>>>>>>>>
>>>>>>>> The code to allocate hugetlb pages from bootmem was also modified to
>>>>>>>> allow batching.
>>>>>>>>
>>>>>>>> No functional changes, only code restructure.
>>>>>>>>
>>>>>>>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>>>>>>>> Reviewed-by: Muchun Song <songmuchun@bytedance.com>
>>>>>>>> ---
>>>>>>> Hi, looks like this patch prevents today's next from booting
>>>>>>> on at least one Qualcomm ARM64 platform. Reverting it makes
>>>>>>> the device boot again.
>>>>>>
>>>>>> Can you share the config used and any other specific information such as
>>>>>> kernel command line.
>>>>> Later this week.
>>>>
>>>> As mentioned, I have been unable to reproduce on arm64 platforms I can
>>>> access.  I have tried various config and boot options.  While doing so,
>>>> I came across one issue impacting kernels compiled without
>>>> CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP defined.  This is not something
>>>> that would prevent booting.
>>>>
>>>> I will send out an updated version series in the hope that any other
>>>> issues may be discovered.
>>> I'm pushing the "later this week" by answering near end of calendar
>>> day, Friday, but it seems like this patch in v7 still prevents the
>>> device from booting..
>>>
>>> You can find my defconfig at the link below.
>>>
>>> https://gist.github.com/konradybcio/d865f8dc9b12a98ba3875ec5a9aac42e
>>>
>>
>> Thanks!
>>
>> I assume there is no further information such as any console output?
>> Did any of you other arm64 platforms have this issue?
>>
>> Just trying to get as much information as possible to get to root cause.
> 
> I have not had success isolating the issue with your config file.
> 
> Since the only code changes in this patch deal with allocating hugetlb
> pages, I assume this is what you are doing?  Can you let me know how you
> are performing the allocations?  I assume it is on the kernel command
> line as these would be processed earliest in boot.
> 
> If you are not allocating hugetlb pages, then I need to think of what
> else may be happening.
> 
> Anshuman, any chance you (or someone else with access to arm64 platforms)
> could throw this on any platforms you have access to for a quick test?
I managed to get a boot log:

https://pastebin.com/GwurpCw9

This is using arch/arm64/boot/dts/qcom/sm8550-mtp.dts for reference

Konrad
Konrad Dybcio Oct. 9, 2023, 10:13 a.m. UTC | #10
On 7.10.2023 03:51, Jane Chu wrote:
> Hi, Konrad,
> 
> Just wondering, is your arm64 system a VM instance, or a bare metal?
That's a tricky question :)

Qualcomm platforms expose much of the hardware in a manner that's
similar to a VM, there's an extensive irreplaceable hypervisor in
place and the user can only boot Linux at EL1..

So, I guess the answer here is "somewhat bare metal" :/

Konrad
> 
> thanks!
> -jane
> 
> 
> On 10/6/2023 2:39 PM, Konrad Dybcio wrote:
>> On 6.10.2023 05:08, Mike Kravetz wrote:
>>> On 10/02/23 11:57, Konrad Dybcio wrote:
>>>>
>>>>
>>>> On 9/29/23 22:57, Mike Kravetz wrote:
>>>>> On 09/27/23 13:26, Konrad Dybcio wrote:
>>>>>
> [..]
>>> I will send out an updated version series in the hope that any other
>>> issues may be discovered.
>> I'm pushing the "later this week" by answering near end of calendar
>> day, Friday, but it seems like this patch in v7 still prevents the
>> device from booting..
>>
>> You can find my defconfig at the link below.
>>
>> https://gist.github.com/konradybcio/d865f8dc9b12a98ba3875ec5a9aac42e
>>
>> Konrad
>>
Mike Kravetz Oct. 9, 2023, 3:04 p.m. UTC | #11
On 10/09/23 12:11, Konrad Dybcio wrote:
> On 9.10.2023 05:29, Mike Kravetz wrote:
> > On 10/06/23 15:35, Mike Kravetz wrote:
> >> On 10/06/23 23:39, Konrad Dybcio wrote:
> >>> On 6.10.2023 05:08, Mike Kravetz wrote:
> >>>> On 10/02/23 11:57, Konrad Dybcio wrote:
> >>>>> On 9/29/23 22:57, Mike Kravetz wrote:
> >>>>>> On 09/27/23 13:26, Konrad Dybcio wrote:
> >>>>>>> On 26.09.2023 01:48, Mike Kravetz wrote:
>
> I managed to get a boot log:
> 
> https://pastebin.com/GwurpCw9
> 
> This is using arch/arm64/boot/dts/qcom/sm8550-mtp.dts for reference
> 

Early on in boot log before the panic, I see this in the log:

[    0.000000] efi: UEFI not found.
[    0.000000] [Firmware Bug]: Kernel image misaligned at boot, please fix your bootloader!

Isn't that misalignment pretty serious?  Or, is is possible to run with that?

There are no hugetlb pages allocated at boot time:

[    0.000000] Kernel command line: PMOS_NO_OUTPUT_REDIRECT console=ttyMSM0 earlycon clk_ignore_unused pd_ignore_unused androidboot.bootdevice=1d84000.ufshc androidboot.fstab_suffix=default androidboot.boot_devices=soc/1d84000.ufshc androidboot.serialno=ab855d8d androidboot.baseband=msm 

So, the routine where we are panic'ing (gather_bootmem_prealloc) should be a
noop.  The first thing it does is:
list_for_each_entry(m, &huge_boot_pages, list) {
...
}

However, huge_boot_pages should be empty as initialized here:
__initdata LIST_HEAD(huge_boot_pages);

At the end of the routine, we call prep_and_add_bootmem_folios to
process the local list created withing that above loop:

LIST_HEAD(folio_list);

This should also be empty and a noop.

Is it possible that the misaligned kernel image could make these lists
appear as non-empty?
Mike Kravetz Oct. 9, 2023, 3:15 p.m. UTC | #12
On 10/09/23 08:04, Mike Kravetz wrote:
> On 10/09/23 12:11, Konrad Dybcio wrote:
> > On 9.10.2023 05:29, Mike Kravetz wrote:
> > > On 10/06/23 15:35, Mike Kravetz wrote:
> > >> On 10/06/23 23:39, Konrad Dybcio wrote:
> > >>> On 6.10.2023 05:08, Mike Kravetz wrote:
> > >>>> On 10/02/23 11:57, Konrad Dybcio wrote:
> > >>>>> On 9/29/23 22:57, Mike Kravetz wrote:
> > >>>>>> On 09/27/23 13:26, Konrad Dybcio wrote:
> > >>>>>>> On 26.09.2023 01:48, Mike Kravetz wrote:
> >
> > I managed to get a boot log:
> > 
> > https://pastebin.com/GwurpCw9
> > 
> > This is using arch/arm64/boot/dts/qcom/sm8550-mtp.dts for reference
> > 
> 
> Early on in boot log before the panic, I see this in the log:
> 
> [    0.000000] efi: UEFI not found.
> [    0.000000] [Firmware Bug]: Kernel image misaligned at boot, please fix your bootloader!
> 
> Isn't that misalignment pretty serious?  Or, is is possible to run with that?
> 
> There are no hugetlb pages allocated at boot time:
> 
> [    0.000000] Kernel command line: PMOS_NO_OUTPUT_REDIRECT console=ttyMSM0 earlycon clk_ignore_unused pd_ignore_unused androidboot.bootdevice=1d84000.ufshc androidboot.fstab_suffix=default androidboot.boot_devices=soc/1d84000.ufshc androidboot.serialno=ab855d8d androidboot.baseband=msm 
> 
> So, the routine where we are panic'ing (gather_bootmem_prealloc) should be a
> noop.  The first thing it does is:
> list_for_each_entry(m, &huge_boot_pages, list) {
> ...
> }
> 
> However, huge_boot_pages should be empty as initialized here:
> __initdata LIST_HEAD(huge_boot_pages);
> 
> At the end of the routine, we call prep_and_add_bootmem_folios to
> process the local list created withing that above loop:
> 
> LIST_HEAD(folio_list);
> 
> This should also be empty and a noop.
> 
> Is it possible that the misaligned kernel image could make these lists
> appear as non-empty?

Actually, just saw this:

https://lore.kernel.org/linux-mm/20231009145605.2150897-1-usama.arif@bytedance.com/

Will take a look, although as mentioned above prep_and_add_bootmem_folios on
an empty list should be a noop.
Konrad Dybcio Oct. 9, 2023, 9:09 p.m. UTC | #13
On 10/9/23 17:04, Mike Kravetz wrote:
> On 10/09/23 12:11, Konrad Dybcio wrote:
>> On 9.10.2023 05:29, Mike Kravetz wrote:
>>> On 10/06/23 15:35, Mike Kravetz wrote:
>>>> On 10/06/23 23:39, Konrad Dybcio wrote:
>>>>> On 6.10.2023 05:08, Mike Kravetz wrote:
>>>>>> On 10/02/23 11:57, Konrad Dybcio wrote:
>>>>>>> On 9/29/23 22:57, Mike Kravetz wrote:
>>>>>>>> On 09/27/23 13:26, Konrad Dybcio wrote:
>>>>>>>>> On 26.09.2023 01:48, Mike Kravetz wrote:
>>
>> I managed to get a boot log:
>>
>> https://pastebin.com/GwurpCw9
>>
>> This is using arch/arm64/boot/dts/qcom/sm8550-mtp.dts for reference
>>
> 
> Early on in boot log before the panic, I see this in the log:
> 
> [    0.000000] efi: UEFI not found.
> [    0.000000] [Firmware Bug]: Kernel image misaligned at boot, please fix your bootloader!
> 
> Isn't that misalignment pretty serious?  Or, is is possible to run with that?
That has never caused any issues and sadly I can't do anything about it.

> 
> There are no hugetlb pages allocated at boot time:
> 
> [    0.000000] Kernel command line: PMOS_NO_OUTPUT_REDIRECT console=ttyMSM0 earlycon clk_ignore_unused pd_ignore_unused androidboot.bootdevice=1d84000.ufshc androidboot.fstab_suffix=default androidboot.boot_devices=soc/1d84000.ufshc androidboot.serialno=ab855d8d androidboot.baseband=msm
> 
> So, the routine where we are panic'ing (gather_bootmem_prealloc) should be a
> noop.  The first thing it does is:
> list_for_each_entry(m, &huge_boot_pages, list) {
> ...
> }
> 
> However, huge_boot_pages should be empty as initialized here:
> __initdata LIST_HEAD(huge_boot_pages);
> 
> At the end of the routine, we call prep_and_add_bootmem_folios to
> process the local list created withing that above loop:
> 
> LIST_HEAD(folio_list);
> 
> This should also be empty and a noop.
> 
> Is it possible that the misaligned kernel image could make these lists
> appear as non-empty?
I don't think I have an answer for this

Konrad
Konrad Dybcio Oct. 9, 2023, 9:09 p.m. UTC | #14
On 10/9/23 17:15, Mike Kravetz wrote:
> On 10/09/23 08:04, Mike Kravetz wrote:
>> On 10/09/23 12:11, Konrad Dybcio wrote:
>>> On 9.10.2023 05:29, Mike Kravetz wrote:
>>>> On 10/06/23 15:35, Mike Kravetz wrote:
>>>>> On 10/06/23 23:39, Konrad Dybcio wrote:
>>>>>> On 6.10.2023 05:08, Mike Kravetz wrote:
>>>>>>> On 10/02/23 11:57, Konrad Dybcio wrote:
>>>>>>>> On 9/29/23 22:57, Mike Kravetz wrote:
>>>>>>>>> On 09/27/23 13:26, Konrad Dybcio wrote:
>>>>>>>>>> On 26.09.2023 01:48, Mike Kravetz wrote:
>>>
>>> I managed to get a boot log:
>>>
>>> https://pastebin.com/GwurpCw9
>>>
>>> This is using arch/arm64/boot/dts/qcom/sm8550-mtp.dts for reference
>>>
>>
>> Early on in boot log before the panic, I see this in the log:
>>
>> [    0.000000] efi: UEFI not found.
>> [    0.000000] [Firmware Bug]: Kernel image misaligned at boot, please fix your bootloader!
>>
>> Isn't that misalignment pretty serious?  Or, is is possible to run with that?
>>
>> There are no hugetlb pages allocated at boot time:
>>
>> [    0.000000] Kernel command line: PMOS_NO_OUTPUT_REDIRECT console=ttyMSM0 earlycon clk_ignore_unused pd_ignore_unused androidboot.bootdevice=1d84000.ufshc androidboot.fstab_suffix=default androidboot.boot_devices=soc/1d84000.ufshc androidboot.serialno=ab855d8d androidboot.baseband=msm
>>
>> So, the routine where we are panic'ing (gather_bootmem_prealloc) should be a
>> noop.  The first thing it does is:
>> list_for_each_entry(m, &huge_boot_pages, list) {
>> ...
>> }
>>
>> However, huge_boot_pages should be empty as initialized here:
>> __initdata LIST_HEAD(huge_boot_pages);
>>
>> At the end of the routine, we call prep_and_add_bootmem_folios to
>> process the local list created withing that above loop:
>>
>> LIST_HEAD(folio_list);
>>
>> This should also be empty and a noop.
>>
>> Is it possible that the misaligned kernel image could make these lists
>> appear as non-empty?
> 
> Actually, just saw this:
> 
> https://lore.kernel.org/linux-mm/20231009145605.2150897-1-usama.arif@bytedance.com/
> 
> Will take a look, although as mentioned above prep_and_add_bootmem_folios on
> an empty list should be a noop.
I'll try it out atop the series tomorrow or so.

Konrad
Andrew Morton Oct. 10, 2023, 12:07 a.m. UTC | #15
On Mon, 9 Oct 2023 08:15:13 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote:

> > This should also be empty and a noop.
> > 
> > Is it possible that the misaligned kernel image could make these lists
> > appear as non-empty?
> 
> Actually, just saw this:
> 
> https://lore.kernel.org/linux-mm/20231009145605.2150897-1-usama.arif@bytedance.com/
> 
> Will take a look, although as mentioned above prep_and_add_bootmem_folios on
> an empty list should be a noop.

Konrad, are you able to test Usama's patch?  Thanks.

From: Usama Arif <usama.arif@bytedance.com>
Subject: mm: hugetlb: only prep and add allocated folios for non-gigantic pages
Date: Mon, 9 Oct 2023 15:56:05 +0100

Calling prep_and_add_allocated_folios when allocating gigantic pages at
boot time causes the kernel to crash as folio_list is empty and iterating
it causes a NULL pointer dereference.  Call this only for non-gigantic
pages when folio_list has entries.

Link: https://lkml.kernel.org/r/20231009145605.2150897-1-usama.arif@bytedance.com
Fixes: bfb41d6b2fe148 ("hugetlb: restructure pool allocations")
Signed-off-by: Usama Arif <usama.arif@bytedance.com>
Cc: Fam Zheng <fam.zheng@bytedance.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Punit Agrawal <punit.agrawal@bytedance.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Barry Song <21cnbao@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Konrad Dybcio <konradybcio@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/hugetlb.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- a/mm/hugetlb.c~hugetlb-restructure-pool-allocations-fix
+++ a/mm/hugetlb.c
@@ -3307,7 +3307,8 @@ static void __init hugetlb_hstate_alloc_
 	}
 
 	/* list will be empty if hstate_is_gigantic */
-	prep_and_add_allocated_folios(h, &folio_list);
+	if (!hstate_is_gigantic(h))
+		prep_and_add_allocated_folios(h, &folio_list);
 
 	if (i < h->max_huge_pages) {
 		char buf[32];
Mike Kravetz Oct. 10, 2023, 1:26 a.m. UTC | #16
On 10/09/23 23:09, Konrad Dybcio wrote:
> 
> 
> On 10/9/23 17:15, Mike Kravetz wrote:
> > On 10/09/23 08:04, Mike Kravetz wrote:
> > > On 10/09/23 12:11, Konrad Dybcio wrote:
> > > > On 9.10.2023 05:29, Mike Kravetz wrote:
> > > > > On 10/06/23 15:35, Mike Kravetz wrote:
> > > > > > On 10/06/23 23:39, Konrad Dybcio wrote:
> > > > > > > On 6.10.2023 05:08, Mike Kravetz wrote:
> > > > > > > > On 10/02/23 11:57, Konrad Dybcio wrote:
> > > > > > > > > On 9/29/23 22:57, Mike Kravetz wrote:
> > > > > > > > > > On 09/27/23 13:26, Konrad Dybcio wrote:
> > > > > > > > > > > On 26.09.2023 01:48, Mike Kravetz wrote:
> > > > 
> > > > I managed to get a boot log:
> > > > 
> > > > https://pastebin.com/GwurpCw9
> > > > 
> > > > This is using arch/arm64/boot/dts/qcom/sm8550-mtp.dts for reference
> > > > 
> > > 
> > > Early on in boot log before the panic, I see this in the log:
> > > 
> > > [    0.000000] efi: UEFI not found.
> > > [    0.000000] [Firmware Bug]: Kernel image misaligned at boot, please fix your bootloader!
> > > 
> > > Isn't that misalignment pretty serious?  Or, is is possible to run with that?
> > > 
> > > There are no hugetlb pages allocated at boot time:
> > > 
> > > [    0.000000] Kernel command line: PMOS_NO_OUTPUT_REDIRECT console=ttyMSM0 earlycon clk_ignore_unused pd_ignore_unused androidboot.bootdevice=1d84000.ufshc androidboot.fstab_suffix=default androidboot.boot_devices=soc/1d84000.ufshc androidboot.serialno=ab855d8d androidboot.baseband=msm
> > > 
> > > So, the routine where we are panic'ing (gather_bootmem_prealloc) should be a
> > > noop.  The first thing it does is:
> > > list_for_each_entry(m, &huge_boot_pages, list) {
> > > ...
> > > }
> > > 
> > > However, huge_boot_pages should be empty as initialized here:
> > > __initdata LIST_HEAD(huge_boot_pages);
> > > 
> > > At the end of the routine, we call prep_and_add_bootmem_folios to
> > > process the local list created withing that above loop:
> > > 
> > > LIST_HEAD(folio_list);
> > > 
> > > This should also be empty and a noop.
> > > 
> > > Is it possible that the misaligned kernel image could make these lists
> > > appear as non-empty?
> > 
> > Actually, just saw this:
> > 
> > https://lore.kernel.org/linux-mm/20231009145605.2150897-1-usama.arif@bytedance.com/
> > 
> > Will take a look, although as mentioned above prep_and_add_bootmem_folios on
> > an empty list should be a noop.
> I'll try it out atop the series tomorrow or so.

I just replied to Usama's patch.  This may have more to do with IRQ enablement.
Konrad Dybcio Oct. 10, 2023, 9:30 p.m. UTC | #17
On 10/10/23 02:07, Andrew Morton wrote:
> On Mon, 9 Oct 2023 08:15:13 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote:
> 
>>> This should also be empty and a noop.
>>>
>>> Is it possible that the misaligned kernel image could make these lists
>>> appear as non-empty?
>>
>> Actually, just saw this:
>>
>> https://lore.kernel.org/linux-mm/20231009145605.2150897-1-usama.arif@bytedance.com/
>>
>> Will take a look, although as mentioned above prep_and_add_bootmem_folios on
>> an empty list should be a noop.
> 
> Konrad, are you able to test Usama's patch?  Thanks.
I legitimately spent a sad amount of time trying to regain access to the 
remote board farm. Previously I could hit the bug on SM8550, but now I 
can't do it on SM8450, SM8350 and SM8250 (previous gens), with the same 
config.. I have no idea when I'll be able to get access to SM8550 again.

I did test it on the QCM6490 Fairphone 5 that I initially reported this 
on, and neither booting next-20231010 (with your patchset applied) nor 
adding the below patch on top of it seems to work. I can't get serial 
output from this device though to find out what it's unhappy about :/

Konrad
Mike Kravetz Oct. 10, 2023, 9:45 p.m. UTC | #18
On 10/10/23 23:30, Konrad Dybcio wrote:
> 
> 
> On 10/10/23 02:07, Andrew Morton wrote:
> > On Mon, 9 Oct 2023 08:15:13 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote:
> > 
> > > > This should also be empty and a noop.
> > > > 
> > > > Is it possible that the misaligned kernel image could make these lists
> > > > appear as non-empty?
> > > 
> > > Actually, just saw this:
> > > 
> > > https://lore.kernel.org/linux-mm/20231009145605.2150897-1-usama.arif@bytedance.com/
> > > 
> > > Will take a look, although as mentioned above prep_and_add_bootmem_folios on
> > > an empty list should be a noop.
> > 
> > Konrad, are you able to test Usama's patch?  Thanks.
> I legitimately spent a sad amount of time trying to regain access to the
> remote board farm. Previously I could hit the bug on SM8550, but now I can't
> do it on SM8450, SM8350 and SM8250 (previous gens), with the same config.. I
> have no idea when I'll be able to get access to SM8550 again.
> 
> I did test it on the QCM6490 Fairphone 5 that I initially reported this on,
> and neither booting next-20231010 (with your patchset applied) nor adding
> the below patch on top of it seems to work. I can't get serial output from
> this device though to find out what it's unhappy about :/

Sorry for causing you to spend so much time on this.

As mentioned in the reply to Usama's patch, the root cause seems to be
the locking.  So, the real change to test is the locking changes in
that thread; not Usama's patch.
Konrad Dybcio Oct. 11, 2023, 9:36 a.m. UTC | #19
On 10/10/23 23:45, Mike Kravetz wrote:
> On 10/10/23 23:30, Konrad Dybcio wrote:
>>
>>
>> On 10/10/23 02:07, Andrew Morton wrote:
>>> On Mon, 9 Oct 2023 08:15:13 -0700 Mike Kravetz <mike.kravetz@oracle.com> wrote:
>>>
>>>>> This should also be empty and a noop.
>>>>>
>>>>> Is it possible that the misaligned kernel image could make these lists
>>>>> appear as non-empty?
>>>>
>>>> Actually, just saw this:
>>>>
>>>> https://lore.kernel.org/linux-mm/20231009145605.2150897-1-usama.arif@bytedance.com/
>>>>
>>>> Will take a look, although as mentioned above prep_and_add_bootmem_folios on
>>>> an empty list should be a noop.
>>>
>>> Konrad, are you able to test Usama's patch?  Thanks.
>> I legitimately spent a sad amount of time trying to regain access to the
>> remote board farm. Previously I could hit the bug on SM8550, but now I can't
>> do it on SM8450, SM8350 and SM8250 (previous gens), with the same config.. I
>> have no idea when I'll be able to get access to SM8550 again.
>>
>> I did test it on the QCM6490 Fairphone 5 that I initially reported this on,
>> and neither booting next-20231010 (with your patchset applied) nor adding
>> the below patch on top of it seems to work. I can't get serial output from
>> this device though to find out what it's unhappy about :/
> 
> Sorry for causing you to spend so much time on this.
No worries, that was my explanation for why it took me so long to 
respond again..

> 
> As mentioned in the reply to Usama's patch, the root cause seems to be
> the locking.  So, the real change to test is the locking changes in
> that thread; not Usama's patch.
Ack

Konrad
diff mbox series

Patch

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 47159b9de633..64f50f3844fc 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1970,16 +1970,21 @@  static void __prep_account_new_huge_page(struct hstate *h, int nid)
 	h->nr_huge_pages_node[nid]++;
 }
 
-static void __prep_new_hugetlb_folio(struct hstate *h, struct folio *folio)
+static void init_new_hugetlb_folio(struct hstate *h, struct folio *folio)
 {
 	folio_set_hugetlb(folio);
-	hugetlb_vmemmap_optimize(h, &folio->page);
 	INIT_LIST_HEAD(&folio->lru);
 	hugetlb_set_folio_subpool(folio, NULL);
 	set_hugetlb_cgroup(folio, NULL);
 	set_hugetlb_cgroup_rsvd(folio, NULL);
 }
 
+static void __prep_new_hugetlb_folio(struct hstate *h, struct folio *folio)
+{
+	init_new_hugetlb_folio(h, folio);
+	hugetlb_vmemmap_optimize(h, &folio->page);
+}
+
 static void prep_new_hugetlb_folio(struct hstate *h, struct folio *folio, int nid)
 {
 	__prep_new_hugetlb_folio(h, folio);
@@ -2190,16 +2195,9 @@  static struct folio *alloc_buddy_hugetlb_folio(struct hstate *h,
 	return page_folio(page);
 }
 
-/*
- * Common helper to allocate a fresh hugetlb page. All specific allocators
- * should use this function to get new hugetlb pages
- *
- * Note that returned page is 'frozen':  ref count of head page and all tail
- * pages is zero.
- */
-static struct folio *alloc_fresh_hugetlb_folio(struct hstate *h,
-		gfp_t gfp_mask, int nid, nodemask_t *nmask,
-		nodemask_t *node_alloc_noretry)
+static struct folio *__alloc_fresh_hugetlb_folio(struct hstate *h,
+				gfp_t gfp_mask, int nid, nodemask_t *nmask,
+				nodemask_t *node_alloc_noretry)
 {
 	struct folio *folio;
 	bool retry = false;
@@ -2212,6 +2210,7 @@  static struct folio *alloc_fresh_hugetlb_folio(struct hstate *h,
 				nid, nmask, node_alloc_noretry);
 	if (!folio)
 		return NULL;
+
 	if (hstate_is_gigantic(h)) {
 		if (!prep_compound_gigantic_folio(folio, huge_page_order(h))) {
 			/*
@@ -2226,32 +2225,80 @@  static struct folio *alloc_fresh_hugetlb_folio(struct hstate *h,
 			return NULL;
 		}
 	}
-	prep_new_hugetlb_folio(h, folio, folio_nid(folio));
 
 	return folio;
 }
 
+static struct folio *only_alloc_fresh_hugetlb_folio(struct hstate *h,
+		gfp_t gfp_mask, int nid, nodemask_t *nmask,
+		nodemask_t *node_alloc_noretry)
+{
+	struct folio *folio;
+
+	folio = __alloc_fresh_hugetlb_folio(h, gfp_mask, nid, nmask,
+						node_alloc_noretry);
+	if (folio)
+		init_new_hugetlb_folio(h, folio);
+	return folio;
+}
+
 /*
- * Allocates a fresh page to the hugetlb allocator pool in the node interleaved
- * manner.
+ * Common helper to allocate a fresh hugetlb page. All specific allocators
+ * should use this function to get new hugetlb pages
+ *
+ * Note that returned page is 'frozen':  ref count of head page and all tail
+ * pages is zero.
  */
-static int alloc_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
-				nodemask_t *node_alloc_noretry)
+static struct folio *alloc_fresh_hugetlb_folio(struct hstate *h,
+		gfp_t gfp_mask, int nid, nodemask_t *nmask,
+		nodemask_t *node_alloc_noretry)
 {
 	struct folio *folio;
-	int nr_nodes, node;
+
+	folio = __alloc_fresh_hugetlb_folio(h, gfp_mask, nid, nmask,
+						node_alloc_noretry);
+	if (!folio)
+		return NULL;
+
+	prep_new_hugetlb_folio(h, folio, folio_nid(folio));
+	return folio;
+}
+
+static void prep_and_add_allocated_folios(struct hstate *h,
+					struct list_head *folio_list)
+{
+	struct folio *folio, *tmp_f;
+
+	/* Add all new pool pages to free lists in one lock cycle */
+	spin_lock_irq(&hugetlb_lock);
+	list_for_each_entry_safe(folio, tmp_f, folio_list, lru) {
+		__prep_account_new_huge_page(h, folio_nid(folio));
+		enqueue_hugetlb_folio(h, folio);
+	}
+	spin_unlock_irq(&hugetlb_lock);
+}
+
+/*
+ * Allocates a fresh hugetlb page in a node interleaved manner.  The page
+ * will later be added to the appropriate hugetlb pool.
+ */
+static struct folio *alloc_pool_huge_folio(struct hstate *h,
+					nodemask_t *nodes_allowed,
+					nodemask_t *node_alloc_noretry)
+{
 	gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
+	int nr_nodes, node;
 
 	for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) {
-		folio = alloc_fresh_hugetlb_folio(h, gfp_mask, node,
+		struct folio *folio;
+
+		folio = only_alloc_fresh_hugetlb_folio(h, gfp_mask, node,
 					nodes_allowed, node_alloc_noretry);
-		if (folio) {
-			free_huge_folio(folio); /* free it into the hugepage allocator */
-			return 1;
-		}
+		if (folio)
+			return folio;
 	}
 
-	return 0;
+	return NULL;
 }
 
 /*
@@ -3264,25 +3311,35 @@  static void __init hugetlb_folio_init_vmemmap(struct folio *folio,
  */
 static void __init gather_bootmem_prealloc(void)
 {
+	LIST_HEAD(folio_list);
 	struct huge_bootmem_page *m;
+	struct hstate *h, *prev_h = NULL;
 
 	list_for_each_entry(m, &huge_boot_pages, list) {
 		struct page *page = virt_to_page(m);
 		struct folio *folio = (void *)page;
-		struct hstate *h = m->hstate;
+
+		h = m->hstate;
+		/*
+		 * It is possible to have multiple huge page sizes (hstates)
+		 * in this list.  If so, process each size separately.
+		 */
+		if (h != prev_h && prev_h != NULL)
+			prep_and_add_allocated_folios(prev_h, &folio_list);
+		prev_h = h;
 
 		VM_BUG_ON(!hstate_is_gigantic(h));
 		WARN_ON(folio_ref_count(folio) != 1);
 
 		hugetlb_folio_init_vmemmap(folio, h,
 					   HUGETLB_VMEMMAP_RESERVE_PAGES);
-		prep_new_hugetlb_folio(h, folio, folio_nid(folio));
+		__prep_new_hugetlb_folio(h, folio);
 		/* If HVO fails, initialize all tail struct pages */
 		if (!HPageVmemmapOptimized(&folio->page))
 			hugetlb_folio_init_tail_vmemmap(folio,
 						HUGETLB_VMEMMAP_RESERVE_PAGES,
 						pages_per_huge_page(h));
-		free_huge_folio(folio); /* add to the hugepage allocator */
+		list_add(&folio->lru, &folio_list);
 
 		/*
 		 * We need to restore the 'stolen' pages to totalram_pages
@@ -3292,6 +3349,8 @@  static void __init gather_bootmem_prealloc(void)
 		adjust_managed_page_count(page, pages_per_huge_page(h));
 		cond_resched();
 	}
+
+	prep_and_add_allocated_folios(h, &folio_list);
 }
 
 static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
@@ -3325,9 +3384,22 @@  static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
 	h->max_huge_pages_node[nid] = i;
 }
 
+/*
+ * NOTE: this routine is called in different contexts for gigantic and
+ * non-gigantic pages.
+ * - For gigantic pages, this is called early in the boot process and
+ *   pages are allocated from memblock allocated or something similar.
+ *   Gigantic pages are actually added to pools later with the routine
+ *   gather_bootmem_prealloc.
+ * - For non-gigantic pages, this is called later in the boot process after
+ *   all of mm is up and functional.  Pages are allocated from buddy and
+ *   then added to hugetlb pools.
+ */
 static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
 {
 	unsigned long i;
+	struct folio *folio;
+	LIST_HEAD(folio_list);
 	nodemask_t *node_alloc_noretry;
 	bool node_specific_alloc = false;
 
@@ -3369,14 +3441,25 @@  static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
 
 	for (i = 0; i < h->max_huge_pages; ++i) {
 		if (hstate_is_gigantic(h)) {
+			/*
+			 * gigantic pages not added to list as they are not
+			 * added to pools now.
+			 */
 			if (!alloc_bootmem_huge_page(h, NUMA_NO_NODE))
 				break;
-		} else if (!alloc_pool_huge_page(h,
-					 &node_states[N_MEMORY],
-					 node_alloc_noretry))
-			break;
+		} else {
+			folio = alloc_pool_huge_folio(h, &node_states[N_MEMORY],
+							node_alloc_noretry);
+			if (!folio)
+				break;
+			list_add(&folio->lru, &folio_list);
+		}
 		cond_resched();
 	}
+
+	/* list will be empty if hstate_is_gigantic */
+	prep_and_add_allocated_folios(h, &folio_list);
+
 	if (i < h->max_huge_pages) {
 		char buf[32];
 
@@ -3510,7 +3593,9 @@  static int adjust_pool_surplus(struct hstate *h, nodemask_t *nodes_allowed,
 static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 			      nodemask_t *nodes_allowed)
 {
-	unsigned long min_count, ret;
+	unsigned long min_count;
+	unsigned long allocated;
+	struct folio *folio;
 	LIST_HEAD(page_list);
 	NODEMASK_ALLOC(nodemask_t, node_alloc_noretry, GFP_KERNEL);
 
@@ -3587,7 +3672,8 @@  static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 			break;
 	}
 
-	while (count > persistent_huge_pages(h)) {
+	allocated = 0;
+	while (count > (persistent_huge_pages(h) + allocated)) {
 		/*
 		 * If this allocation races such that we no longer need the
 		 * page, free_huge_folio will handle it by freeing the page
@@ -3598,15 +3684,32 @@  static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 		/* yield cpu to avoid soft lockup */
 		cond_resched();
 
-		ret = alloc_pool_huge_page(h, nodes_allowed,
+		folio = alloc_pool_huge_folio(h, nodes_allowed,
 						node_alloc_noretry);
-		spin_lock_irq(&hugetlb_lock);
-		if (!ret)
+		if (!folio) {
+			prep_and_add_allocated_folios(h, &page_list);
+			spin_lock_irq(&hugetlb_lock);
 			goto out;
+		}
+
+		list_add(&folio->lru, &page_list);
+		allocated++;
 
 		/* Bail for signals. Probably ctrl-c from user */
-		if (signal_pending(current))
+		if (signal_pending(current)) {
+			prep_and_add_allocated_folios(h, &page_list);
+			spin_lock_irq(&hugetlb_lock);
 			goto out;
+		}
+
+		spin_lock_irq(&hugetlb_lock);
+	}
+
+	/* Add allocated pages to the pool */
+	if (!list_empty(&page_list)) {
+		spin_unlock_irq(&hugetlb_lock);
+		prep_and_add_allocated_folios(h, &page_list);
+		spin_lock_irq(&hugetlb_lock);
 	}
 
 	/*
@@ -3632,8 +3735,6 @@  static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 	 * Collect pages to be removed on list without dropping lock
 	 */
 	while (min_count < persistent_huge_pages(h)) {
-		struct folio *folio;
-
 		folio = remove_pool_hugetlb_folio(h, nodes_allowed, 0);
 		if (!folio)
 			break;