mbox series

[RFC,00/15] Make MAX_ORDER adjustable as a kernel boot time parameter.

Message ID 20210805190253.2795604-1-zi.yan@sent.com (mailing list archive)
Headers show
Series Make MAX_ORDER adjustable as a kernel boot time parameter. | expand

Message

Zi Yan Aug. 5, 2021, 7:02 p.m. UTC
From: Zi Yan <ziy@nvidia.com>

Hi all,

This patchset add support for kernel boot time adjustable MAX_ORDER, so that
user can change the largest size of pages obtained from buddy allocator. It also
removes the restriction on MAX_ORDER based on SECTION_SIZE_BITS, so that
buddy allocator can merge PFNs across memory sections when SPARSEMEM_VMEMMAP is
set. It is on top of v5.14-rc4-mmotm-2021-08-02-18-51.

Motivation
===

This enables kernel to allocate 1GB pages and is necessary for my ongoing work
on adding support for 1GB PUD THP[1]. This is also the conclusion I came up with
after some discussion with David Hildenbrand on what methods should be used for
allocating gigantic pages[2], since other approaches like using CMA allocator or
alloc_contig_pages() are regarded as suboptimal.

This also prevents increasing SECTION_SIZE_BITS when increasing MAX_ORDER, since
increasing SECTION_SIZE_BITS is not desirable as memory hotadd/hotremove chunk
size will be increased as well, causing memory management difficulty for VMs.

In addition, make MAX_ORDER a kernel boot time parameter can enable user to
adjust buddy allocator without recompiling the kernel for their own needs, so
that one can still have a small MAX_ORDER if he/she does not need to allocate
gigantic pages like 1GB PUD THPs.

Background
===

At the moment, kernel imposes MAX_ORDER - 1 + PAGE_SHFIT < SECTION_SIZE_BITS
restriction. This prevents buddy allocator merging pages across memory sections,
as PFNs might not be contiguous and code like page++ would fail. But this would
not be an issue when SPARSEMEM_VMEMMAP is set, since all struct page are
virtually contiguous. In addition, as long as buddy allocator checks the PFN
validity during buddy page merging (done in Patch 3), pages allocated from
buddy allocator can be manipulated by code like page++.


Description
===

I tested the patchset on both x86_64 and ARM64 at 4KB, 16KB, and 64KB base
pages. The systems boot and ltp mm test suite finished without issue. Also
memory hotplug worked on x86_64 when I tested. It definitely needs more tests
and reviews for other architectures.

In terms of the concerns on performance degradation if MAX_ORDER is increased,
I did some initial performance tests comparing MAX_ORDER=11 and MAX_ORDER=20 on
x86_64 machines and saw no performance difference[3].

Patch 1 excludes MAX_ORDER check from 32bit vdso compilation. The check uses
irrelevant 32bit SECTION_SIZE_BITS during 64bit kernel compilation. The
exclusion does not break the check in 32bit kernel, since the check will still
be performed during other kernel component compilation.

Patch 2 gives FORCE_MAX_ZONEORDER a better name.

Patch 3 restores the pfn_valid_within() check when buddy allocator can merge
pages across memory sections. The check was removed when ARM64 gets rid of holes
in zones, but holes can appear in zones again after this patchset.

Patch 4-11 convert the use of MAX_ORDER to SECTION_SIZE_BITS or its derivative
constants, since these code places use MAX_ORDER as boundary check for
physically contiguous pages, where SECTION_SIZE_BITS should be used. After this
patchset, MAX_ORDER can go beyond SECTION_SIZE_BITS, the code can break.
I separate changes to different patches for easy review and can merge them into
a single one if that works better.

Patch 12 adds a new Kconfig option SET_MAX_ORDER to allow specifying MAX_ORDER
when ARCH_FORCE_MAX_ORDER is not used by the arch, like x86_64.

Patch 13 converts statically allocated arrays with MAX_ORDER length to dynamic
ones if possible and prepares for making MAX_ORDER a boot time parameter.

Patch 14 adds a new MIN_MAX_ORDER constant to replace soon-to-be-dynamic
MAX_ORDER for places where converting static array to dynamic one is causing
hassle and not necessary, i.e., ARM64 hypervisor page allocation and SLAB.

Patch 15 finally changes MAX_ORDER to be a kernel boot time parameter.


Any suggestion and/or comment is welcome. Thanks.


TODO
===

1. Redo the performance comparison tests using this patchset to understand the
   performance implication of changing MAX_ORDER.


Zi Yan (15):
  arch: x86: remove MAX_ORDER exceeding SECTION_SIZE check for 32bit
    vdso.
  arch: mm: rename FORCE_MAX_ZONEORDER to ARCH_FORCE_MAX_ORDER
  mm: check pfn validity when buddy allocator can merge pages across mem
    sections.
  mm: prevent pageblock size being larger than section size.
  mm/memory_hotplug: online pages at section size.
  mm: use PAGES_PER_SECTION instead for mem_map_offset/next().
  mm: hugetlb: use PAGES_PER_SECTION to check mem_map discontiguity
  fs: proc: use PAGES_PER_SECTION for page offline checking period.
  virtio: virtio_mem: use PAGES_PER_SECTION instead of
    MAX_ORDER_NR_PAGES
  virtio: virtio_balloon: use PAGES_PER_SECTION instead of
    MAX_ORDER_NR_PAGES.
  mm/page_reporting: report pages at section size instead of MAX_ORDER.
  mm: Make MAX_ORDER of buddy allocator configurable via Kconfig
    SET_MAX_ORDER.
  mm: convert MAX_ORDER sized static arrays to dynamic ones.
  mm: introduce MIN_MAX_ORDER to replace MAX_ORDER as compile time
    constant.
  mm: make MAX_ORDER a kernel boot time parameter.

 .../admin-guide/kdump/vmcoreinfo.rst          |   2 +-
 .../admin-guide/kernel-parameters.txt         |   5 +
 arch/Kconfig                                  |   4 +
 arch/arc/Kconfig                              |   2 +-
 arch/arm/Kconfig                              |   2 +-
 arch/arm/configs/imx_v6_v7_defconfig          |   2 +-
 arch/arm/configs/milbeaut_m10v_defconfig      |   2 +-
 arch/arm/configs/oxnas_v6_defconfig           |   2 +-
 arch/arm/configs/sama7_defconfig              |   2 +-
 arch/arm64/Kconfig                            |   2 +-
 arch/arm64/kvm/hyp/include/nvhe/gfp.h         |   2 +-
 arch/arm64/kvm/hyp/nvhe/page_alloc.c          |   3 +-
 arch/csky/Kconfig                             |   2 +-
 arch/ia64/Kconfig                             |   2 +-
 arch/ia64/include/asm/sparsemem.h             |   6 +-
 arch/m68k/Kconfig.cpu                         |   2 +-
 arch/mips/Kconfig                             |   2 +-
 arch/nios2/Kconfig                            |   2 +-
 arch/powerpc/Kconfig                          |   2 +-
 arch/powerpc/configs/85xx/ge_imp3a_defconfig  |   2 +-
 arch/powerpc/configs/fsl-emb-nonhw.config     |   2 +-
 arch/sh/configs/ecovec24_defconfig            |   2 +-
 arch/sh/mm/Kconfig                            |   2 +-
 arch/sparc/Kconfig                            |   2 +-
 arch/x86/entry/vdso/Makefile                  |   1 +
 arch/xtensa/Kconfig                           |   2 +-
 drivers/gpu/drm/ttm/ttm_device.c              |   7 +-
 drivers/gpu/drm/ttm/ttm_pool.c                |  58 +++++++++-
 drivers/virtio/virtio_balloon.c               |   2 +-
 drivers/virtio/virtio_mem.c                   |  12 +-
 fs/proc/kcore.c                               |   2 +-
 include/drm/ttm/ttm_pool.h                    |   4 +-
 include/linux/memory_hotplug.h                |   1 +
 include/linux/mmzone.h                        |  56 ++++++++-
 include/linux/pageblock-flags.h               |   7 +-
 include/linux/slab.h                          |   8 +-
 mm/Kconfig                                    |  16 +++
 mm/compaction.c                               |  20 ++--
 mm/hugetlb.c                                  |   2 +-
 mm/internal.h                                 |   4 +-
 mm/memory_hotplug.c                           |  18 ++-
 mm/page_alloc.c                               | 108 ++++++++++++++++--
 mm/page_isolation.c                           |   7 +-
 mm/page_owner.c                               |  14 ++-
 mm/page_reporting.c                           |   3 +-
 mm/slab.c                                     |   2 +-
 mm/slub.c                                     |   7 +-
 mm/vmscan.c                                   |   1 -
 48 files changed, 334 insertions(+), 86 deletions(-)

Comments

Vlastimil Babka Aug. 6, 2021, 3:36 p.m. UTC | #1
On 8/5/21 9:02 PM, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>

> Patch 3 restores the pfn_valid_within() check when buddy allocator can merge
> pages across memory sections. The check was removed when ARM64 gets rid of holes
> in zones, but holes can appear in zones again after this patchset.

To me that's most unwelcome resurrection. I kinda missed it was going away and
now I can't even rejoice? I assume the systems that will be bumping max_order
have a lot of memory. Are they going to have many holes? What if we just
sacrificed the memory that would have a hole and don't add it to buddy at all?
David Hildenbrand Aug. 6, 2021, 4:16 p.m. UTC | #2
On 06.08.21 17:36, Vlastimil Babka wrote:
> On 8/5/21 9:02 PM, Zi Yan wrote:
>> From: Zi Yan <ziy@nvidia.com>
> 
>> Patch 3 restores the pfn_valid_within() check when buddy allocator can merge
>> pages across memory sections. The check was removed when ARM64 gets rid of holes
>> in zones, but holes can appear in zones again after this patchset.
> 
> To me that's most unwelcome resurrection. I kinda missed it was going away and
> now I can't even rejoice? I assume the systems that will be bumping max_order
> have a lot of memory. Are they going to have many holes? What if we just
> sacrificed the memory that would have a hole and don't add it to buddy at all?

I think the old implementation was just horrible and the description we 
have here still suffers from that old crap: "but holes can appear in 
zones again". No, it's not related to holes in zones at all. We can have 
MAX_ORDER -1 pages that are partially a hole.

And to be precise, "hole" here means "there is no memmap" and not "there 
is a hole but it has a valid memmap".

But IIRC, we now have under SPARSEMEM always a complete memmap for a 
complete memory sections (when talking about system RAM, ZONE_DEVICE is 
different but we don't really care for now I think).

So instead of introducing what we had before, I think we should look 
into something that doesn't confuse each person that stumbles over it 
out there. What does pfn_valid_within() even mean in the new context? 
pfn_valid() is most probably no longer what we really want, as we're 
dealing with multiple sections that might be online or offline; in the 
old world, this was different, as a MAX_ORDER -1 page was completely 
contained in a memory section that was either online or offline.

I'd imagine something that expresses something different in the context 
of sparsemem:

"Some page orders, such as MAX_ORDER -1, might span multiple memory 
sections. Each memory section has a completely valid memmap if online. 
Memory sections might either be completely online or completely offline. 
pfn_to_online_page() might succeed on one part of a MAX_ORDER - 1 page, 
but not on another part. But it will certainly be consistent within one 
memory section."

Further, as we know that MAX_ORDER -1 and memory sections are a power of 
two, we can actually do a binary search to identify boundaries, instead 
of having to check each and every page in the range.

Is what I describe the actual reason why we introduce pfn_valid_within() 
? (and might better introduce something new, with a better fitting name?)
Vlastimil Babka Aug. 6, 2021, 4:32 p.m. UTC | #3
On 8/5/21 9:02 PM, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
> 
> Hi all,
> 
> This patchset add support for kernel boot time adjustable MAX_ORDER, so that
> user can change the largest size of pages obtained from buddy allocator. It also
> removes the restriction on MAX_ORDER based on SECTION_SIZE_BITS, so that
> buddy allocator can merge PFNs across memory sections when SPARSEMEM_VMEMMAP is
> set. It is on top of v5.14-rc4-mmotm-2021-08-02-18-51.
> 
> Motivation
> ===
> 
> This enables kernel to allocate 1GB pages and is necessary for my ongoing work
> on adding support for 1GB PUD THP[1]. This is also the conclusion I came up with
> after some discussion with David Hildenbrand on what methods should be used for
> allocating gigantic pages[2], since other approaches like using CMA allocator or
> alloc_contig_pages() are regarded as suboptimal.

I see references [1] [2] etc but no list of links at the end, did you
forgot to include?
Vlastimil Babka Aug. 6, 2021, 4:54 p.m. UTC | #4
On 8/6/21 6:16 PM, David Hildenbrand wrote:
> On 06.08.21 17:36, Vlastimil Babka wrote:
>> On 8/5/21 9:02 PM, Zi Yan wrote:
>>> From: Zi Yan <ziy@nvidia.com>
>>
>>> Patch 3 restores the pfn_valid_within() check when buddy allocator can merge
>>> pages across memory sections. The check was removed when ARM64 gets rid of holes
>>> in zones, but holes can appear in zones again after this patchset.
>>
>> To me that's most unwelcome resurrection. I kinda missed it was going away and
>> now I can't even rejoice? I assume the systems that will be bumping max_order
>> have a lot of memory. Are they going to have many holes? What if we just
>> sacrificed the memory that would have a hole and don't add it to buddy at all?
> 
> I think the old implementation was just horrible and the description we have
> here still suffers from that old crap: "but holes can appear in zones again".
> No, it's not related to holes in zones at all. We can have MAX_ORDER -1 pages
> that are partially a hole.
> 
> And to be precise, "hole" here means "there is no memmap" and not "there is a
> hole but it has a valid memmap".

Yes.

> But IIRC, we now have under SPARSEMEM always a complete memmap for a complete
> memory sections (when talking about system RAM, ZONE_DEVICE is different but we
> don't really care for now I think).
> 
> So instead of introducing what we had before, I think we should look into
> something that doesn't confuse each person that stumbles over it out there. What
> does pfn_valid_within() even mean in the new context? pfn_valid() is most
> probably no longer what we really want, as we're dealing with multiple sections
> that might be online or offline; in the old world, this was different, as a
> MAX_ORDER -1 page was completely contained in a memory section that was either
> online or offline.
> 
> I'd imagine something that expresses something different in the context of
> sparsemem:
> 
> "Some page orders, such as MAX_ORDER -1, might span multiple memory sections.
> Each memory section has a completely valid memmap if online. Memory sections
> might either be completely online or completely offline. pfn_to_online_page()
> might succeed on one part of a MAX_ORDER - 1 page, but not on another part. But
> it will certainly be consistent within one memory section."
> 
> Further, as we know that MAX_ORDER -1 and memory sections are a power of two, we
> can actually do a binary search to identify boundaries, instead of having to
> check each and every page in the range.
> 
> Is what I describe the actual reason why we introduce pfn_valid_within() ? (and
> might better introduce something new, with a better fitting name?)

What I don't like is mainly the re-addition of pfn_valid_within() (or whatever
we'd call it) into __free_one_page() for performance reasons, and also to
various pfn scanners (compaction) for performance and "I must not forget to
check this, or do I?" confusion reasons. It would be really great if we could
keep a guarantee that memmap exists for MAX_ORDER blocks. I see two ways to
achieve that:

1. we create memmap for MAX_ORDER blocks, pages in sections not online are
marked as reserved or some other state that allows us to do checks such as "is
there a buddy? no" without accessing a missing memmap
2. smaller blocks than MAX_ORDER are not released to buddy allocator

I think 1 would be more work, but less wasteful in the end?
David Hildenbrand Aug. 6, 2021, 5:08 p.m. UTC | #5
On 06.08.21 18:54, Vlastimil Babka wrote:
> On 8/6/21 6:16 PM, David Hildenbrand wrote:
>> On 06.08.21 17:36, Vlastimil Babka wrote:
>>> On 8/5/21 9:02 PM, Zi Yan wrote:
>>>> From: Zi Yan <ziy@nvidia.com>
>>>
>>>> Patch 3 restores the pfn_valid_within() check when buddy allocator can merge
>>>> pages across memory sections. The check was removed when ARM64 gets rid of holes
>>>> in zones, but holes can appear in zones again after this patchset.
>>>
>>> To me that's most unwelcome resurrection. I kinda missed it was going away and
>>> now I can't even rejoice? I assume the systems that will be bumping max_order
>>> have a lot of memory. Are they going to have many holes? What if we just
>>> sacrificed the memory that would have a hole and don't add it to buddy at all?
>>
>> I think the old implementation was just horrible and the description we have
>> here still suffers from that old crap: "but holes can appear in zones again".
>> No, it's not related to holes in zones at all. We can have MAX_ORDER -1 pages
>> that are partially a hole.
>>
>> And to be precise, "hole" here means "there is no memmap" and not "there is a
>> hole but it has a valid memmap".
> 
> Yes.
> 
>> But IIRC, we now have under SPARSEMEM always a complete memmap for a complete
>> memory sections (when talking about system RAM, ZONE_DEVICE is different but we
>> don't really care for now I think).
>>
>> So instead of introducing what we had before, I think we should look into
>> something that doesn't confuse each person that stumbles over it out there. What
>> does pfn_valid_within() even mean in the new context? pfn_valid() is most
>> probably no longer what we really want, as we're dealing with multiple sections
>> that might be online or offline; in the old world, this was different, as a
>> MAX_ORDER -1 page was completely contained in a memory section that was either
>> online or offline.
>>
>> I'd imagine something that expresses something different in the context of
>> sparsemem:
>>
>> "Some page orders, such as MAX_ORDER -1, might span multiple memory sections.
>> Each memory section has a completely valid memmap if online. Memory sections
>> might either be completely online or completely offline. pfn_to_online_page()
>> might succeed on one part of a MAX_ORDER - 1 page, but not on another part. But
>> it will certainly be consistent within one memory section."
>>
>> Further, as we know that MAX_ORDER -1 and memory sections are a power of two, we
>> can actually do a binary search to identify boundaries, instead of having to
>> check each and every page in the range.
>>
>> Is what I describe the actual reason why we introduce pfn_valid_within() ? (and
>> might better introduce something new, with a better fitting name?)
> 
> What I don't like is mainly the re-addition of pfn_valid_within() (or whatever
> we'd call it) into __free_one_page() for performance reasons, and also to
> various pfn scanners (compaction) for performance and "I must not forget to
> check this, or do I?" confusion reasons. It would be really great if we could
> keep a guarantee that memmap exists for MAX_ORDER blocks. I see two ways to
> achieve that:
> 
> 1. we create memmap for MAX_ORDER blocks, pages in sections not online are
> marked as reserved or some other state that allows us to do checks such as "is
> there a buddy? no" without accessing a missing memmap
> 2. smaller blocks than MAX_ORDER are not released to buddy allocator
> 
> I think 1 would be more work, but less wasteful in the end?

It will end up seriously messing with memory hot(un)plug. It's not 
sufficient if there is a memmap (pfn_valid()), it has to be online 
(pfn_to_online_page()) to actually have a meaning.

So you'd have to  allocate a memmap for all such memory sections, 
initialize it to all PG_Reserved ("memory hole") and mark these memory 
sections online. Further, you need memory block devices that are 
initialized and online.

So far so good, although wasteful. What happens if someone hotplugs a 
memory block that doesn't span a complete MAX_ORDER -1 page? Broken.


The only "workaround" would be requiring that MAX_ORDER - 1 cannot be 
bigger than memory blocks (memory_block_size_bytes()). The memory block 
size determines our hot(un)plug granularity and can (on some archs 
already) be determined at runtime. As both (MAX_ORDER and 
memory_block_size_bytes) would be determined at runtime, for example, by 
an admin explicitly requesting it, this might be feasible.


Memory hot(un)plug / onlining / offlining would most probably work 
naturally (although the hot(un)plug granularity is then limited to e.g., 
1GiB memory blocks). But if that's what an admin requests on the command 
line, so be it.

What might need some thought, though, is having overlapping 
sections/such memory blocks with devmem. Sub-section hotadd has to 
continue working unless we want to break some PMEM devices seriously.
Zi Yan Aug. 6, 2021, 5:19 p.m. UTC | #6
On 6 Aug 2021, at 12:32, Vlastimil Babka wrote:

> On 8/5/21 9:02 PM, Zi Yan wrote:
>> From: Zi Yan <ziy@nvidia.com>
>>
>> Hi all,
>>
>> This patchset add support for kernel boot time adjustable MAX_ORDER, so that
>> user can change the largest size of pages obtained from buddy allocator. It also
>> removes the restriction on MAX_ORDER based on SECTION_SIZE_BITS, so that
>> buddy allocator can merge PFNs across memory sections when SPARSEMEM_VMEMMAP is
>> set. It is on top of v5.14-rc4-mmotm-2021-08-02-18-51.
>>
>> Motivation
>> ===
>>
>> This enables kernel to allocate 1GB pages and is necessary for my ongoing work
>> on adding support for 1GB PUD THP[1]. This is also the conclusion I came up with
>> after some discussion with David Hildenbrand on what methods should be used for
>> allocating gigantic pages[2], since other approaches like using CMA allocator or
>> alloc_contig_pages() are regarded as suboptimal.
>
> I see references [1] [2] etc but no list of links at the end, did you
> forgot to include?
Ah, I missed that. Sorry.

Here are the links:

[1] https://lore.kernel.org/linux-mm/20200928175428.4110504-1-zi.yan@sent.com/
[2] https://lore.kernel.org/linux-mm/e132fdd9-65af-1cad-8a6e-71844ebfe6a2@redhat.com/
[3] https://lore.kernel.org/linux-mm/289DA3C0-9AE5-4992-A35A-C13FCE4D8544@nvidia.com/

BTW, [3] is the performance test I did to compare MAX_ORDER=11 and MAX_ORDER=20,
which showed no performance impact of increasing MAX_ORDER.

In addition, I would like to share more detail on my plan on supporting 1GB PUD THP.
This patchset is the first step, enabling kernel to allocate 1GB pages, so that
user can get 1GB THPs from ZONE_NORMAL and ZONE_MOVABLE without using
alloc_contig_pages() or CMA allocator. The next step is to improve kernel memory
fragmentation handling for pages up to MAX_ORDER, since currently pageblock size
is still limited by memory section size. As a result, I will explore solutions
like having additional larger pageblocks (up to MAX_ORDER) to counter memory
fragmentation. I will discover what else needs to be solved as I gradually improve
1GB PUD THP support.

—
Best Regards,
Yan, Zi
Zi Yan Aug. 6, 2021, 6:24 p.m. UTC | #7
On 6 Aug 2021, at 13:08, David Hildenbrand wrote:

> On 06.08.21 18:54, Vlastimil Babka wrote:
>> On 8/6/21 6:16 PM, David Hildenbrand wrote:
>>> On 06.08.21 17:36, Vlastimil Babka wrote:
>>>> On 8/5/21 9:02 PM, Zi Yan wrote:
>>>>> From: Zi Yan <ziy@nvidia.com>
>>>>
>>>>> Patch 3 restores the pfn_valid_within() check when buddy allocator can merge
>>>>> pages across memory sections. The check was removed when ARM64 gets rid of holes
>>>>> in zones, but holes can appear in zones again after this patchset.
>>>>
>>>> To me that's most unwelcome resurrection. I kinda missed it was going away and
>>>> now I can't even rejoice? I assume the systems that will be bumping max_order
>>>> have a lot of memory. Are they going to have many holes? What if we just
>>>> sacrificed the memory that would have a hole and don't add it to buddy at all?
>>>
>>> I think the old implementation was just horrible and the description we have
>>> here still suffers from that old crap: "but holes can appear in zones again".
>>> No, it's not related to holes in zones at all. We can have MAX_ORDER -1 pages
>>> that are partially a hole.
>>>
>>> And to be precise, "hole" here means "there is no memmap" and not "there is a
>>> hole but it has a valid memmap".
>>
>> Yes.
>>
>>> But IIRC, we now have under SPARSEMEM always a complete memmap for a complete
>>> memory sections (when talking about system RAM, ZONE_DEVICE is different but we
>>> don't really care for now I think).
>>>
>>> So instead of introducing what we had before, I think we should look into
>>> something that doesn't confuse each person that stumbles over it out there. What
>>> does pfn_valid_within() even mean in the new context? pfn_valid() is most
>>> probably no longer what we really want, as we're dealing with multiple sections
>>> that might be online or offline; in the old world, this was different, as a
>>> MAX_ORDER -1 page was completely contained in a memory section that was either
>>> online or offline.
>>>
>>> I'd imagine something that expresses something different in the context of
>>> sparsemem:
>>>
>>> "Some page orders, such as MAX_ORDER -1, might span multiple memory sections.
>>> Each memory section has a completely valid memmap if online. Memory sections
>>> might either be completely online or completely offline. pfn_to_online_page()
>>> might succeed on one part of a MAX_ORDER - 1 page, but not on another part. But
>>> it will certainly be consistent within one memory section."
>>>
>>> Further, as we know that MAX_ORDER -1 and memory sections are a power of two, we
>>> can actually do a binary search to identify boundaries, instead of having to
>>> check each and every page in the range.
>>>
>>> Is what I describe the actual reason why we introduce pfn_valid_within() ? (and
>>> might better introduce something new, with a better fitting name?)
>>
>> What I don't like is mainly the re-addition of pfn_valid_within() (or whatever
>> we'd call it) into __free_one_page() for performance reasons, and also to
>> various pfn scanners (compaction) for performance and "I must not forget to
>> check this, or do I?" confusion reasons. It would be really great if we could
>> keep a guarantee that memmap exists for MAX_ORDER blocks. I see two ways to
>> achieve that:
>>
>> 1. we create memmap for MAX_ORDER blocks, pages in sections not online are
>> marked as reserved or some other state that allows us to do checks such as "is
>> there a buddy? no" without accessing a missing memmap
>> 2. smaller blocks than MAX_ORDER are not released to buddy allocator
>>
>> I think 1 would be more work, but less wasteful in the end?
>
> It will end up seriously messing with memory hot(un)plug. It's not sufficient if there is a memmap (pfn_valid()), it has to be online (pfn_to_online_page()) to actually have a meaning.
>
> So you'd have to  allocate a memmap for all such memory sections, initialize it to all PG_Reserved ("memory hole") and mark these memory sections online. Further, you need memory block devices that are initialized and online.
>
> So far so good, although wasteful. What happens if someone hotplugs a memory block that doesn't span a complete MAX_ORDER -1 page? Broken.
>
>
> The only "workaround" would be requiring that MAX_ORDER - 1 cannot be bigger than memory blocks (memory_block_size_bytes()). The memory block size determines our hot(un)plug granularity and can (on some archs already) be determined at runtime. As both (MAX_ORDER and memory_block_size_bytes) would be determined at runtime, for example, by an admin explicitly requesting it, this might be feasible.
>
>
> Memory hot(un)plug / onlining / offlining would most probably work naturally (although the hot(un)plug granularity is then limited to e.g., 1GiB memory blocks). But if that's what an admin requests on the command line, so be it.
>
> What might need some thought, though, is having overlapping sections/such memory blocks with devmem. Sub-section hotadd has to continue working unless we want to break some PMEM devices seriously.

Thanks a lot for your valuable inputs!

Yes, this might work. But it seems to also defeat the purpose of sparse memory, which allow only memmapping present PFNs, right? Also it requires a lot more intrusive changes, which might not be accepted easily.

I will look into the cost of the added pfn checks and try to optimize it. One thing I can think of is that these non-present PFNs should only appear at the beginning and at the end of a zone, since HOLES_IN_ZONE is gone, maybe I just need to store and check PFN range of a zone instead of checking memory section validity and modify the zone PFN range during memory hot(un)plug. For offline pages in the middle of a zone, struct page still exists and PageBuddy() returns false, since PG_offline is set, right?


—
Best Regards,
Yan, Zi
Hugh Dickins Aug. 6, 2021, 8:27 p.m. UTC | #8
On Fri, 6 Aug 2021, Zi Yan wrote:
> 
> In addition, I would like to share more detail on my plan on supporting 1GB PUD THP.
> This patchset is the first step, enabling kernel to allocate 1GB pages, so that
> user can get 1GB THPs from ZONE_NORMAL and ZONE_MOVABLE without using
> alloc_contig_pages() or CMA allocator. The next step is to improve kernel memory
> fragmentation handling for pages up to MAX_ORDER, since currently pageblock size
> is still limited by memory section size. As a result, I will explore solutions
> like having additional larger pageblocks (up to MAX_ORDER) to counter memory
> fragmentation. I will discover what else needs to be solved as I gradually improve
> 1GB PUD THP support.

Sorry to be blunt, but let me state my opinion: 2MB THPs have given and
continue to give us more than enough trouble.  Complicating the kernel's
mm further, just to allow 1GB THPs, seems a very bad tradeoff to me.  I
understand that it's an appealing personal project; but for the sake of
of all the rest of us, please leave 1GB huge pages to hugetlbfs (until
the day when we are all using 2MB base pages).

Hugh
Zi Yan Aug. 6, 2021, 9:26 p.m. UTC | #9
On 6 Aug 2021, at 16:27, Hugh Dickins wrote:

> On Fri, 6 Aug 2021, Zi Yan wrote:
>>
>> In addition, I would like to share more detail on my plan on supporting 1GB PUD THP.
>> This patchset is the first step, enabling kernel to allocate 1GB pages, so that
>> user can get 1GB THPs from ZONE_NORMAL and ZONE_MOVABLE without using
>> alloc_contig_pages() or CMA allocator. The next step is to improve kernel memory
>> fragmentation handling for pages up to MAX_ORDER, since currently pageblock size
>> is still limited by memory section size. As a result, I will explore solutions
>> like having additional larger pageblocks (up to MAX_ORDER) to counter memory
>> fragmentation. I will discover what else needs to be solved as I gradually improve
>> 1GB PUD THP support.
>
> Sorry to be blunt, but let me state my opinion: 2MB THPs have given and
> continue to give us more than enough trouble.  Complicating the kernel's
> mm further, just to allow 1GB THPs, seems a very bad tradeoff to me.  I
> understand that it's an appealing personal project; but for the sake of
> of all the rest of us, please leave 1GB huge pages to hugetlbfs (until
> the day when we are all using 2MB base pages).

I do not agree with you. 2MB THP provides good performance, while letting us
keep using 4KB base pages. The 2MB THP implementation is the price we pay
to get the performance. This patchset removes the tie between MAX_ORDER
and section size to allow >2MB page allocation, which is useful in many
places. 1GB THP is one of the users. Gigantic pages also improve
device performance, like GPUs (e.g., AMD GPUs can use any power of two up to
1GB pages[1], which I just learnt). Also could you point out which part
of my patchset complicates kernel’s mm? I could try to simplify it if
possible.

In addition, I am not sure hugetlbfs is the way to go. THP is managed by
core mm, whereas hugetlbfs has its own code for memory management.
As hugetlbfs gets popular, more core mm functionalities have been
replicated and added to hugetlbfs codebase. It is not a good tradeoff
either. One of the reasons I work on 1GB THP is that Roman from Facebook
explicitly mentioned they want to use THP in place of hugetlbfs[2].

I think it might be more constructive to point out the existing issues
in THP so that we can improve the code together. BTW, I am also working
on simplifying THP code like generalizing THP split[3] and planning to
simplify page table manipulation code by reviving Kirill’s idea[4].

[1] https://lore.kernel.org/linux-mm/bdec12bd-9188-9f3e-c442-aa33e25303a6@amd.com/
[2] https://lore.kernel.org/linux-mm/20200903162527.GF60440@carbon.dhcp.thefacebook.com/
[3] https://lwn.net/Articles/837928/
[4] https://lore.kernel.org/linux-mm/20180424154355.mfjgkf47kdp2by4e@black.fi.intel.com/

—
Best Regards,
Yan, Zi
Matthew Wilcox Aug. 7, 2021, 1:10 a.m. UTC | #10
On Fri, Aug 06, 2021 at 01:27:27PM -0700, Hugh Dickins wrote:
> On Fri, 6 Aug 2021, Zi Yan wrote:
> > 
> > In addition, I would like to share more detail on my plan on supporting 1GB PUD THP.
> > This patchset is the first step, enabling kernel to allocate 1GB pages, so that
> > user can get 1GB THPs from ZONE_NORMAL and ZONE_MOVABLE without using
> > alloc_contig_pages() or CMA allocator. The next step is to improve kernel memory
> > fragmentation handling for pages up to MAX_ORDER, since currently pageblock size
> > is still limited by memory section size. As a result, I will explore solutions
> > like having additional larger pageblocks (up to MAX_ORDER) to counter memory
> > fragmentation. I will discover what else needs to be solved as I gradually improve
> > 1GB PUD THP support.
> 
> Sorry to be blunt, but let me state my opinion: 2MB THPs have given and
> continue to give us more than enough trouble.  Complicating the kernel's
> mm further, just to allow 1GB THPs, seems a very bad tradeoff to me.  I
> understand that it's an appealing personal project; but for the sake of
> of all the rest of us, please leave 1GB huge pages to hugetlbfs (until
> the day when we are all using 2MB base pages).

I respect your opinion, Hugh.  You, more than most of us, have spent an
inordinate amount of time debugging huge page related issues.  I also
share your misgivings about the potential performance improvements for
1GB pages.  They're too big for all but the most unusual of special cases.
This hasn't been helped by the scarce number of 1GB TLB entries in Intel
CPUs until very recently (and even those are hard to come by today).
I do not think they are of interest for the page cache (as I'm fond of
observing, if you have 7GB/s storage (eg the Samsung 980 Pro), you can
take seven page faults per second).

I am, however, of the opinion that 2MB pages give us so much trouble
because they're so very special.  Few people exercise those code paths and
it's easy to break them without noticing.  This is partly why I want to
do arbitrary-order pages.  If everybody is running with compound pages
all the time, we'll see the corner cases often, and people other than
Hugh, Kirill and Mike will be able to work on them.

Now, I'm not planning on working on arbitrary-order anonymous
pages myself.  I think I have enough to deal with in the page cache &
filesystems.  But I'm happy to help out when I can be useful.  I think
256kB pages are probably optimal at the moment for file-backed memory,
so I'm not planning on exploring the space above PMD_ORDER myself.
But there have already been some important areas of collaboration between
the 1GB effort and the folio effort.
Matthew Wilcox Aug. 7, 2021, 9:23 p.m. UTC | #11
On Sat, Aug 07, 2021 at 02:10:55AM +0100, Matthew Wilcox wrote:
> This hasn't been helped by the scarce number of 1GB TLB entries in Intel
> CPUs until very recently (and even those are hard to come by today).

Minor correction to this.  I just fixed x86info to work on my Core
i7-1165G7 (Tiger Lake, launched about a year ago) and discovered it has:

 L1 Store Only TLB: 1GB/4MB/2MB/4KB pages, fully associative, 16 entries
 L1 Load Only TLB: 1GB pages, fully associative, 8 entries
 L2 Unified TLB: 1GB/4KB pages, 8-way associative, 1024 entries

My prior laptop (i7-7500U, Kaby Lake, 2016) has only 4x 1GB TLB entries at
the L1 level, and no support for L2 1GB pages.  So this speaks to Intel
finally taking performance of 1GB TLB entries seriously.  Perhaps more
seriously than they need to for a laptop with 16GB of memory!  There are
Xeon-W versions of this CPU, so I imagine that there are versions which
can support 1TB or more memory.

I still think that 1GB pages are too big for anything but specialist
use cases, but those do exist.
Mike Rapoport Aug. 8, 2021, 7:41 a.m. UTC | #12
On Fri, Aug 06, 2021 at 06:54:14PM +0200, Vlastimil Babka wrote:
> On 8/6/21 6:16 PM, David Hildenbrand wrote:
> > On 06.08.21 17:36, Vlastimil Babka wrote:
> >> On 8/5/21 9:02 PM, Zi Yan wrote:
> >>> From: Zi Yan <ziy@nvidia.com>
> >>
> >>> Patch 3 restores the pfn_valid_within() check when buddy allocator can merge
> >>> pages across memory sections. The check was removed when ARM64 gets rid of holes
> >>> in zones, but holes can appear in zones again after this patchset.
> >>
> >> To me that's most unwelcome resurrection. I kinda missed it was going away and
> >> now I can't even rejoice? I assume the systems that will be bumping max_order
> >> have a lot of memory. Are they going to have many holes? What if we just
> >> sacrificed the memory that would have a hole and don't add it to buddy at all?
> > 
> > I think the old implementation was just horrible and the description we have
> > here still suffers from that old crap: "but holes can appear in zones again".
> > No, it's not related to holes in zones at all. We can have MAX_ORDER -1 pages
> > that are partially a hole.
> > 
> > And to be precise, "hole" here means "there is no memmap" and not "there is a
> > hole but it has a valid memmap".
> 
> Yes.
> 
> > But IIRC, we now have under SPARSEMEM always a complete memmap for a complete
> > memory sections (when talking about system RAM, ZONE_DEVICE is different but we
> > don't really care for now I think).
> > 
> > So instead of introducing what we had before, I think we should look into
> > something that doesn't confuse each person that stumbles over it out there. What
> > does pfn_valid_within() even mean in the new context? pfn_valid() is most
> > probably no longer what we really want, as we're dealing with multiple sections
> > that might be online or offline; in the old world, this was different, as a
> > MAX_ORDER -1 page was completely contained in a memory section that was either
> > online or offline.
> > 
> > I'd imagine something that expresses something different in the context of
> > sparsemem:
> > 
> > "Some page orders, such as MAX_ORDER -1, might span multiple memory sections.
> > Each memory section has a completely valid memmap if online. Memory sections
> > might either be completely online or completely offline. pfn_to_online_page()
> > might succeed on one part of a MAX_ORDER - 1 page, but not on another part. But
> > it will certainly be consistent within one memory section."
> > 
> > Further, as we know that MAX_ORDER -1 and memory sections are a power of two, we
> > can actually do a binary search to identify boundaries, instead of having to
> > check each and every page in the range.
> > 
> > Is what I describe the actual reason why we introduce pfn_valid_within() ? (and
> > might better introduce something new, with a better fitting name?)
> 
> What I don't like is mainly the re-addition of pfn_valid_within() (or whatever
> we'd call it) into __free_one_page() for performance reasons, and also to
> various pfn scanners (compaction) for performance and "I must not forget to
> check this, or do I?" confusion reasons. It would be really great if we could
> keep a guarantee that memmap exists for MAX_ORDER blocks.

Maybe I'm missing something, but what if we use different constants to
define maximal allocation order buddy can handle and maximal size of
reasonable physically contiguous chunk?
I've skimmed through the uses of MAX_ORDER and it seems that many of those
could use, say, pageblock_order of several megabytes rather than MAX_ORDER.

If we only use MAX_ORDER to denote the maximal allocation order and keep
PFN walkers to use smaller pageblock order that we'll be able to have valid
memory map in all the cases. 

Then we can work out how to make migration/compaction etc to process 1G
pages.

> I see two ways to achieve that:
> 
> 1. we create memmap for MAX_ORDER blocks, pages in sections not online are
> marked as reserved or some other state that allows us to do checks such as "is
> there a buddy? no" without accessing a missing memmap
> 2. smaller blocks than MAX_ORDER are not released to buddy allocator
> 
> I think 1 would be more work, but less wasteful in the end?
Hugh Dickins Aug. 9, 2021, 4:04 a.m. UTC | #13
On Fri, 6 Aug 2021, Zi Yan wrote:
> On 6 Aug 2021, at 16:27, Hugh Dickins wrote:
> > On Fri, 6 Aug 2021, Zi Yan wrote:
> >>
> >> In addition, I would like to share more detail on my plan on supporting 1GB PUD THP.
> >> This patchset is the first step, enabling kernel to allocate 1GB pages, so that
> >> user can get 1GB THPs from ZONE_NORMAL and ZONE_MOVABLE without using
> >> alloc_contig_pages() or CMA allocator. The next step is to improve kernel memory
> >> fragmentation handling for pages up to MAX_ORDER, since currently pageblock size
> >> is still limited by memory section size. As a result, I will explore solutions
> >> like having additional larger pageblocks (up to MAX_ORDER) to counter memory
> >> fragmentation. I will discover what else needs to be solved as I gradually improve
> >> 1GB PUD THP support.
> >
> > Sorry to be blunt, but let me state my opinion: 2MB THPs have given and
> > continue to give us more than enough trouble.  Complicating the kernel's
> > mm further, just to allow 1GB THPs, seems a very bad tradeoff to me.  I
> > understand that it's an appealing personal project; but for the sake of
> > of all the rest of us, please leave 1GB huge pages to hugetlbfs (until
> > the day when we are all using 2MB base pages).
> 
> I do not agree with you. 2MB THP provides good performance, while letting us
> keep using 4KB base pages. The 2MB THP implementation is the price we pay
> to get the performance. This patchset removes the tie between MAX_ORDER
> and section size to allow >2MB page allocation, which is useful in many
> places. 1GB THP is one of the users. Gigantic pages also improve
> device performance, like GPUs (e.g., AMD GPUs can use any power of two up to
> 1GB pages[1], which I just learnt). Also could you point out which part
> of my patchset complicates kernel’s mm? I could try to simplify it if
> possible.
> 
> In addition, I am not sure hugetlbfs is the way to go. THP is managed by
> core mm, whereas hugetlbfs has its own code for memory management.
> As hugetlbfs gets popular, more core mm functionalities have been
> replicated and added to hugetlbfs codebase. It is not a good tradeoff
> either. One of the reasons I work on 1GB THP is that Roman from Facebook
> explicitly mentioned they want to use THP in place of hugetlbfs[2].
> 
> I think it might be more constructive to point out the existing issues
> in THP so that we can improve the code together. BTW, I am also working
> on simplifying THP code like generalizing THP split[3] and planning to
> simplify page table manipulation code by reviving Kirill’s idea[4].

You may have good reasons for working on huge PUD entry support;
and perhaps we have different understandings of "THP".

Fragmentation: that's what horrifies me about 1GB THP.

The dark side of THP is compaction.  People have put in a lot of effort
to get compaction working as well as it currently does, but getting 512
adjacent 4k pages is not easy.  Getting 512*512 adjacent 4k pages is
very much harder.  Please put in the work on compaction before you
attempt to support 1GB THP.

Related fears: unexpected latencies; unacceptable variance between runs;
frequent rebooting of machines to get back to an unfragmented state;
page table code that most of us will never be in a position to test.

Sorry, no, I'm not reading your patches: that's not personal, it's
just that I've more than enough to do already, and must make choices.

Hugh

> 
> [1] https://lore.kernel.org/linux-mm/bdec12bd-9188-9f3e-c442-aa33e25303a6@amd.com/
> [2] https://lore.kernel.org/linux-mm/20200903162527.GF60440@carbon.dhcp.thefacebook.com/
> [3] https://lwn.net/Articles/837928/
> [4] https://lore.kernel.org/linux-mm/20180424154355.mfjgkf47kdp2by4e@black.fi.intel.com/
> 
> —
> Best Regards,
> Yan, Zi
Hugh Dickins Aug. 9, 2021, 4:29 a.m. UTC | #14
On Sat, 7 Aug 2021, Matthew Wilcox wrote:
> 
> I am, however, of the opinion that 2MB pages give us so much trouble
> because they're so very special.  Few people exercise those code paths and
> it's easy to break them without noticing.  This is partly why I want to
> do arbitrary-order pages.  If everybody is running with compound pages
> all the time, we'll see the corner cases often, and people other than
> Hugh, Kirill and Mike will be able to work on them.

I don't entirely agree.  I'm all for your use of compound pages in page
cache, but don't think its problems are representative of the problems
in aiming for a PMD (or PUD) bar, with the weird page table transitions
we expect of "THP" there.

I haven't checked: is your use of compound pages in page cache still
limited to the HAVE_ARCH_TRANSPARENT_HUGEPAGE architectures?  When
the others could just as well use compound pages in page cache too.

Hugh
David Hildenbrand Aug. 9, 2021, 7:20 a.m. UTC | #15
On 06.08.21 20:24, Zi Yan wrote:
> On 6 Aug 2021, at 13:08, David Hildenbrand wrote:
> 
>> On 06.08.21 18:54, Vlastimil Babka wrote:
>>> On 8/6/21 6:16 PM, David Hildenbrand wrote:
>>>> On 06.08.21 17:36, Vlastimil Babka wrote:
>>>>> On 8/5/21 9:02 PM, Zi Yan wrote:
>>>>>> From: Zi Yan <ziy@nvidia.com>
>>>>>
>>>>>> Patch 3 restores the pfn_valid_within() check when buddy allocator can merge
>>>>>> pages across memory sections. The check was removed when ARM64 gets rid of holes
>>>>>> in zones, but holes can appear in zones again after this patchset.
>>>>>
>>>>> To me that's most unwelcome resurrection. I kinda missed it was going away and
>>>>> now I can't even rejoice? I assume the systems that will be bumping max_order
>>>>> have a lot of memory. Are they going to have many holes? What if we just
>>>>> sacrificed the memory that would have a hole and don't add it to buddy at all?
>>>>
>>>> I think the old implementation was just horrible and the description we have
>>>> here still suffers from that old crap: "but holes can appear in zones again".
>>>> No, it's not related to holes in zones at all. We can have MAX_ORDER -1 pages
>>>> that are partially a hole.
>>>>
>>>> And to be precise, "hole" here means "there is no memmap" and not "there is a
>>>> hole but it has a valid memmap".
>>>
>>> Yes.
>>>
>>>> But IIRC, we now have under SPARSEMEM always a complete memmap for a complete
>>>> memory sections (when talking about system RAM, ZONE_DEVICE is different but we
>>>> don't really care for now I think).
>>>>
>>>> So instead of introducing what we had before, I think we should look into
>>>> something that doesn't confuse each person that stumbles over it out there. What
>>>> does pfn_valid_within() even mean in the new context? pfn_valid() is most
>>>> probably no longer what we really want, as we're dealing with multiple sections
>>>> that might be online or offline; in the old world, this was different, as a
>>>> MAX_ORDER -1 page was completely contained in a memory section that was either
>>>> online or offline.
>>>>
>>>> I'd imagine something that expresses something different in the context of
>>>> sparsemem:
>>>>
>>>> "Some page orders, such as MAX_ORDER -1, might span multiple memory sections.
>>>> Each memory section has a completely valid memmap if online. Memory sections
>>>> might either be completely online or completely offline. pfn_to_online_page()
>>>> might succeed on one part of a MAX_ORDER - 1 page, but not on another part. But
>>>> it will certainly be consistent within one memory section."
>>>>
>>>> Further, as we know that MAX_ORDER -1 and memory sections are a power of two, we
>>>> can actually do a binary search to identify boundaries, instead of having to
>>>> check each and every page in the range.
>>>>
>>>> Is what I describe the actual reason why we introduce pfn_valid_within() ? (and
>>>> might better introduce something new, with a better fitting name?)
>>>
>>> What I don't like is mainly the re-addition of pfn_valid_within() (or whatever
>>> we'd call it) into __free_one_page() for performance reasons, and also to
>>> various pfn scanners (compaction) for performance and "I must not forget to
>>> check this, or do I?" confusion reasons. It would be really great if we could
>>> keep a guarantee that memmap exists for MAX_ORDER blocks. I see two ways to
>>> achieve that:
>>>
>>> 1. we create memmap for MAX_ORDER blocks, pages in sections not online are
>>> marked as reserved or some other state that allows us to do checks such as "is
>>> there a buddy? no" without accessing a missing memmap
>>> 2. smaller blocks than MAX_ORDER are not released to buddy allocator
>>>
>>> I think 1 would be more work, but less wasteful in the end?
>>
>> It will end up seriously messing with memory hot(un)plug. It's not sufficient if there is a memmap (pfn_valid()), it has to be online (pfn_to_online_page()) to actually have a meaning.
>>
>> So you'd have to  allocate a memmap for all such memory sections, initialize it to all PG_Reserved ("memory hole") and mark these memory sections online. Further, you need memory block devices that are initialized and online.
>>
>> So far so good, although wasteful. What happens if someone hotplugs a memory block that doesn't span a complete MAX_ORDER -1 page? Broken.
>>
>>
>> The only "workaround" would be requiring that MAX_ORDER - 1 cannot be bigger than memory blocks (memory_block_size_bytes()). The memory block size determines our hot(un)plug granularity and can (on some archs already) be determined at runtime. As both (MAX_ORDER and memory_block_size_bytes) would be determined at runtime, for example, by an admin explicitly requesting it, this might be feasible.
>>
>>
>> Memory hot(un)plug / onlining / offlining would most probably work naturally (although the hot(un)plug granularity is then limited to e.g., 1GiB memory blocks). But if that's what an admin requests on the command line, so be it.
>>
>> What might need some thought, though, is having overlapping sections/such memory blocks with devmem. Sub-section hotadd has to continue working unless we want to break some PMEM devices seriously.
> 
> Thanks a lot for your valuable inputs!
> 
> Yes, this might work. But it seems to also defeat the purpose of sparse memory, which allow only memmapping present PFNs, right?

Not really. I will only be suboptimal for corner cases.

Except corner cases for devemem, we already always populate the memmap 
for complete memory sections. Now, we would populate the memmap for all 
memory sections spanning a MAX_ORDER - 1 page, if bigger than a section.

Will it matter in practice? I doubt it.

I consider 1 GiB allocations only relevant for really big machines. 
There, we don't really expect to have a lot of random memory holes. On a 
1 TiB machine, with 1 GiB memory blocks and 1 GiB max_order - 1, you 
don't expect to have a completely fragmented memory layout such that 
allocating additional memmap for some memory sections really makes a 
difference.

> Also it requires a lot more intrusive changes, which might not be accepted easily.

I guess it should require quite minimal changes in contrast to what you 
propose. What we should have to do is

a) Check that the configured MAX_ORDER - 1 is effectively not bigger 
than the memory block size

b) Initialize all sections spanning a MAX_ORDER - 1 during boot, we 
won't even have to mess with memory blocks et al.

All that's required is parsing/processing early parameters in the right 
order.

That sound very little intrusive compared to what you propose. Actually, 
I think what you propose would be an optimization of that approach.


> I will look into the cost of the added pfn checks and try to optimize it. One thing I can think of is that these non-present PFNs should only appear at the beginning and at the end of a zone, since HOLES_IN_ZONE is gone, maybe I just need to store and check PFN range of a zone instead of checking memory section validity and modify the zone PFN range during memory hot(un)plug. For offline pages in the middle of a zone, struct page still exists and PageBuddy() returns false, since PG_offline is set, right?

I think we can have quite some crazy "sparse" layouts where you can have 
random holes within a zone, not only at the beginning/end.

Offline pages can be identified using pfn_to_online_page(). You must not 
touch their memmap, not even to check for PageBuddy(). PG_offline is a 
special case where pfn_to_online_page() returns true and the memmap is 
valid, however, the pages are logically offline and  might get logically 
onlined later -- primarily used in virtualized environments, for 
example, with memory ballooning.

You can treat PG_offline pages like their are online, they just are 
accounted differently (!managed) and shouldn't be touched; but 
otherwise, they are just like any other allocated page.
David Hildenbrand Aug. 9, 2021, 7:41 a.m. UTC | #16
On 05.08.21 21:02, Zi Yan wrote:
> From: Zi Yan <ziy@nvidia.com>
> 
> Hi all,
> 
> This patchset add support for kernel boot time adjustable MAX_ORDER, so that
> user can change the largest size of pages obtained from buddy allocator. It also
> removes the restriction on MAX_ORDER based on SECTION_SIZE_BITS, so that
> buddy allocator can merge PFNs across memory sections when SPARSEMEM_VMEMMAP is
> set. It is on top of v5.14-rc4-mmotm-2021-08-02-18-51.
> 
> Motivation
> ===
> 
> This enables kernel to allocate 1GB pages and is necessary for my ongoing work
> on adding support for 1GB PUD THP[1]. This is also the conclusion I came up with
> after some discussion with David Hildenbrand on what methods should be used for
> allocating gigantic pages[2], since other approaches like using CMA allocator or
> alloc_contig_pages() are regarded as suboptimal.
> 
> This also prevents increasing SECTION_SIZE_BITS when increasing MAX_ORDER, since
> increasing SECTION_SIZE_BITS is not desirable as memory hotadd/hotremove chunk
> size will be increased as well, causing memory management difficulty for VMs.
> 
> In addition, make MAX_ORDER a kernel boot time parameter can enable user to
> adjust buddy allocator without recompiling the kernel for their own needs, so
> that one can still have a small MAX_ORDER if he/she does not need to allocate
> gigantic pages like 1GB PUD THPs.
> 
> Background
> ===
> 
> At the moment, kernel imposes MAX_ORDER - 1 + PAGE_SHFIT < SECTION_SIZE_BITS
> restriction. This prevents buddy allocator merging pages across memory sections,
> as PFNs might not be contiguous and code like page++ would fail. But this would
> not be an issue when SPARSEMEM_VMEMMAP is set, since all struct page are
> virtually contiguous. In addition, as long as buddy allocator checks the PFN
> validity during buddy page merging (done in Patch 3), pages allocated from
> buddy allocator can be manipulated by code like page++.
> 
> 
> Description
> ===
> 
> I tested the patchset on both x86_64 and ARM64 at 4KB, 16KB, and 64KB base
> pages. The systems boot and ltp mm test suite finished without issue. Also
> memory hotplug worked on x86_64 when I tested. It definitely needs more tests
> and reviews for other architectures.
> 
> In terms of the concerns on performance degradation if MAX_ORDER is increased,
> I did some initial performance tests comparing MAX_ORDER=11 and MAX_ORDER=20 on
> x86_64 machines and saw no performance difference[3].
> 
> Patch 1 excludes MAX_ORDER check from 32bit vdso compilation. The check uses
> irrelevant 32bit SECTION_SIZE_BITS during 64bit kernel compilation. The
> exclusion does not break the check in 32bit kernel, since the check will still
> be performed during other kernel component compilation.
> 
> Patch 2 gives FORCE_MAX_ZONEORDER a better name.
> 
> Patch 3 restores the pfn_valid_within() check when buddy allocator can merge
> pages across memory sections. The check was removed when ARM64 gets rid of holes
> in zones, but holes can appear in zones again after this patchset.
> 
> Patch 4-11 convert the use of MAX_ORDER to SECTION_SIZE_BITS or its derivative
> constants, since these code places use MAX_ORDER as boundary check for
> physically contiguous pages, where SECTION_SIZE_BITS should be used. After this
> patchset, MAX_ORDER can go beyond SECTION_SIZE_BITS, the code can break.
> I separate changes to different patches for easy review and can merge them into
> a single one if that works better.
> 
> Patch 12 adds a new Kconfig option SET_MAX_ORDER to allow specifying MAX_ORDER
> when ARCH_FORCE_MAX_ORDER is not used by the arch, like x86_64.
> 
> Patch 13 converts statically allocated arrays with MAX_ORDER length to dynamic
> ones if possible and prepares for making MAX_ORDER a boot time parameter.
> 
> Patch 14 adds a new MIN_MAX_ORDER constant to replace soon-to-be-dynamic
> MAX_ORDER for places where converting static array to dynamic one is causing
> hassle and not necessary, i.e., ARM64 hypervisor page allocation and SLAB.
> 
> Patch 15 finally changes MAX_ORDER to be a kernel boot time parameter.
> 
> 
> Any suggestion and/or comment is welcome. Thanks.
> 
> 
> TODO
> ===
> 
> 1. Redo the performance comparison tests using this patchset to understand the
>     performance implication of changing MAX_ORDER.


2. Make alloc_contig_range() cleanly deal with pageblock_order instead 
of MAX_ORDER - 1 to not force the minimal CMA area size/alignment to be 
e.g., 1 GiB instead of 4 MiB and to keep virtio-mem working as expected.

virtio-mem short term would mean disallowing initialization when an 
incompatible setup (MAX_ORDER_NR_PAGE > SECTION_NR_PAGES) is detected 
and bailing out loud that the admin has to fix that on the command line. 
I have optimizing alloc_contig_range() on my todo list, to get rid of 
the MAX_ORDER -1 dependency in virtio-mem; but I have no idea when I'll 
have time to work on that.
Matthew Wilcox Aug. 9, 2021, 11:22 a.m. UTC | #17
On Sun, Aug 08, 2021 at 09:29:29PM -0700, Hugh Dickins wrote:
> On Sat, 7 Aug 2021, Matthew Wilcox wrote:
> > 
> > I am, however, of the opinion that 2MB pages give us so much trouble
> > because they're so very special.  Few people exercise those code paths and
> > it's easy to break them without noticing.  This is partly why I want to
> > do arbitrary-order pages.  If everybody is running with compound pages
> > all the time, we'll see the corner cases often, and people other than
> > Hugh, Kirill and Mike will be able to work on them.
> 
> I don't entirely agree.  I'm all for your use of compound pages in page
> cache, but don't think its problems are representative of the problems
> in aiming for a PMD (or PUD) bar, with the weird page table transitions
> we expect of "THP" there.
> 
> I haven't checked: is your use of compound pages in page cache still
> limited to the HAVE_ARCH_TRANSPARENT_HUGEPAGE architectures?  When
> the others could just as well use compound pages in page cache too.

It is no longer gated by whether TRANSPARENT_HUGEPAGE is enabled or not.
That's a relatively recent change, and I can't say that I've tested it.
I'll give it a try today on a PA-RISC system I've booted recently.

One of the followup pieces of work that I hope somebody other than
myself will undertake is using 64KB PTEs on a 4KB PAGE_SIZE ARM/POWER
machine if the stars align.