mbox series

[v3,0/5] Allocate memmap from hotadded memory

Message ID 20190725160207.19579-1-osalvador@suse.de (mailing list archive)
Headers show
Series Allocate memmap from hotadded memory | expand

Message

Oscar Salvador July 25, 2019, 4:02 p.m. UTC
Here we go with v3.

v3 -> v2:
        * Rewrite about vmemmap pages handling.
          Prior to this version, I was (ab)using hugepages fields
          from struct page, while here I am officially adding a new
          sub-page type with the fields I need.

        * Drop MHP_MEMMAP_{MEMBLOCK,DEVICE} in favor of MHP_MEMMAP_ON_MEMORY.
          While I am still not 100% if this the right decision, and while I
          still see some gaining in having MHP_MEMMAP_{MEMBLOCK,DEVICE},
          having only one flag ease the code.
          If the user wants to allocate memmaps per memblock, it'll
          have to call add_memory() variants with memory-block granularity.

          If we happen to have a more clear usecase MHP_MEMMAP_MEMBLOCK
          flag in the future, so user does not have to bother about the way
          it calls add_memory() variants, but only pass a flag, we can add it.
          Actually, I already had the code, so add it in the future is going to be
          easy.

        * Granularity check when hot-removing memory.
          Just checking that the granularity is the same.

[Testing]

 - x86_64: small and large memblocks (128MB, 1G and 2G)

So far, only acpi memory hotplug uses the new flag.
The other callers can be changed depending on their needs.

[Coverletter]

This is another step to make memory hotplug more usable. The primary
goal of this patchset is to reduce memory overhead of the hot-added
memory (at least for SPARSEMEM_VMEMMAP memory model). The current way we use
to populate memmap (struct page array) has two main drawbacks:

a) it consumes an additional memory until the hotadded memory itself is
   onlined and
b) memmap might end up on a different numa node which is especially true
   for movable_node configuration.

a) it is a problem especially for memory hotplug based memory "ballooning"
   solutions when the delay between physical memory hotplug and the
   onlining can lead to OOM and that led to introduction of hacks like auto
   onlining (see 31bc3858ea3e ("memory-hotplug: add automatic onlining
   policy for the newly added memory")).

b) can have performance drawbacks.

One way to mitigate all these issues is to simply allocate memmap array
(which is the largest memory footprint of the physical memory hotplug)
from the hot-added memory itself. SPARSEMEM_VMEMMAP memory model allows
us to map any pfn range so the memory doesn't need to be online to be
usable for the array. See patch 3 for more details.
This feature is only usable when CONFIG_SPARSEMEM_VMEMMAP is set.

[Overall design]:

Implementation wise we reuse vmem_altmap infrastructure to override
the default allocator used by vmemap_populate. Once the memmap is
allocated we need a way to mark altmap pfns used for the allocation.
If MHP_MEMMAP_ON_MEMORY flag was passed, we set up the layout of the
altmap structure at the beginning of __add_pages(), and then we call
mark_vmemmap_pages().

MHP_MEMMAP_ON_MEMORY flag parameter will specify to allocate memmaps
from the hot-added range.
If callers wants memmaps to be allocated per memory block, it will
have to call add_memory() variants in memory-block granularity
spanning the whole range, while if it wants to allocate memmaps
per whole memory range, just one call will do.

Want to add 384MB (3 sections, 3 memory-blocks)
e.g:

add_memory(0x1000, size_memory_block);
add_memory(0x2000, size_memory_block);
add_memory(0x3000, size_memory_block);

or

add_memory(0x1000, size_memory_block * 3);

One thing worth mention is that vmemmap pages residing in movable memory is not a
show-stopper for that memory to be offlined/migrated away.
Vmemmap pages are just ignored in that case and they stick around until sections
referred by those vmemmap pages are hot-removed.

Oscar Salvador (5):
  mm,memory_hotplug: Introduce MHP_MEMMAP_ON_MEMORY
  mm: Introduce a new Vmemmap page-type
  mm,sparse: Add SECTION_USE_VMEMMAP flag
  mm,memory_hotplug: Allocate memmap from the added memory range for
    sparse-vmemmap
  mm,memory_hotplug: Allow userspace to enable/disable vmemmap

 arch/powerpc/mm/init_64.c      |   7 ++
 arch/s390/mm/init.c            |   6 ++
 arch/x86/mm/init_64.c          |  10 +++
 drivers/acpi/acpi_memhotplug.c |   3 +-
 drivers/base/memory.c          |  35 +++++++++-
 drivers/dax/kmem.c             |   2 +-
 drivers/hv/hv_balloon.c        |   2 +-
 drivers/s390/char/sclp_cmd.c   |   2 +-
 drivers/xen/balloon.c          |   2 +-
 include/linux/memory_hotplug.h |  37 ++++++++--
 include/linux/memremap.h       |   2 +-
 include/linux/mm.h             |  17 +++++
 include/linux/mm_types.h       |   5 ++
 include/linux/mmzone.h         |   8 ++-
 include/linux/page-flags.h     |  19 +++++
 mm/compaction.c                |   7 ++
 mm/memory_hotplug.c            | 153 +++++++++++++++++++++++++++++++++++++----
 mm/page_alloc.c                |  26 ++++++-
 mm/page_isolation.c            |  14 +++-
 mm/sparse.c                    | 116 ++++++++++++++++++++++++++++++-
 20 files changed, 441 insertions(+), 32 deletions(-)

Comments

David Hildenbrand July 25, 2019, 4:56 p.m. UTC | #1
On 25.07.19 18:02, Oscar Salvador wrote:
> Here we go with v3.
> 
> v3 -> v2:
>         * Rewrite about vmemmap pages handling.
>           Prior to this version, I was (ab)using hugepages fields
>           from struct page, while here I am officially adding a new
>           sub-page type with the fields I need.
> 
>         * Drop MHP_MEMMAP_{MEMBLOCK,DEVICE} in favor of MHP_MEMMAP_ON_MEMORY.
>           While I am still not 100% if this the right decision, and while I
>           still see some gaining in having MHP_MEMMAP_{MEMBLOCK,DEVICE},
>           having only one flag ease the code.
>           If the user wants to allocate memmaps per memblock, it'll
>           have to call add_memory() variants with memory-block granularity.
> 
>           If we happen to have a more clear usecase MHP_MEMMAP_MEMBLOCK
>           flag in the future, so user does not have to bother about the way
>           it calls add_memory() variants, but only pass a flag, we can add it.
>           Actually, I already had the code, so add it in the future is going to be
>           easy.

FWIW, for now I think this is the right thing to do. Whoever roots for
this now has to propose an interface on how this is going to be used
now. Otherwise, this is untested, dead code. Nobody wants that :)

> 
>         * Granularity check when hot-removing memory.
>           Just checking that the granularity is the same.

This is for the powernv/memtrace.c case, right?

> 
> [Testing]
> 
>  - x86_64: small and large memblocks (128MB, 1G and 2G)
> 
> So far, only acpi memory hotplug uses the new flag.
> The other callers can be changed depending on their needs.
> 
> [Coverletter]
> 
> This is another step to make memory hotplug more usable. The primary
> goal of this patchset is to reduce memory overhead of the hot-added
> memory (at least for SPARSEMEM_VMEMMAP memory model). The current way we use
> to populate memmap (struct page array) has two main drawbacks:
> 
> a) it consumes an additional memory until the hotadded memory itself is
>    onlined and
> b) memmap might end up on a different numa node which is especially true
>    for movable_node configuration.
> 
> a) it is a problem especially for memory hotplug based memory "ballooning"
>    solutions when the delay between physical memory hotplug and the
>    onlining can lead to OOM and that led to introduction of hacks like auto
>    onlining (see 31bc3858ea3e ("memory-hotplug: add automatic onlining
>    policy for the newly added memory")).
> 
> b) can have performance drawbacks.

We now also consume less NORMAL memory when onlining DIMMs to the
MOVABLE_ZONE, as the vmemmap no longer ends up in the NORMAL zone -
which is nice. (not perfect, but nice :) )

I'm curious on how/when you are initializing the vmemmap and setting all
vmemmap pages to the new page type. Right now, we initialize it when
onlining memory - will have a look how you sorted that out :)

> 
> One way to mitigate all these issues is to simply allocate memmap array
> (which is the largest memory footprint of the physical memory hotplug)
> from the hot-added memory itself. SPARSEMEM_VMEMMAP memory model allows
> us to map any pfn range so the memory doesn't need to be online to be
> usable for the array. See patch 3 for more details.
> This feature is only usable when CONFIG_SPARSEMEM_VMEMMAP is set.
> 
> [Overall design]:
> 
> Implementation wise we reuse vmem_altmap infrastructure to override
> the default allocator used by vmemap_populate. Once the memmap is
> allocated we need a way to mark altmap pfns used for the allocation.
> If MHP_MEMMAP_ON_MEMORY flag was passed, we set up the layout of the
> altmap structure at the beginning of __add_pages(), and then we call
> mark_vmemmap_pages().
> 
> MHP_MEMMAP_ON_MEMORY flag parameter will specify to allocate memmaps
> from the hot-added range.
> If callers wants memmaps to be allocated per memory block, it will
> have to call add_memory() variants in memory-block granularity
> spanning the whole range, while if it wants to allocate memmaps
> per whole memory range, just one call will do.

I assume you you played with all kinds of offlining/onlining of affected
memory blocks and especially that the vmemmap pages remain set to the
new page type?
Oscar Salvador Aug. 1, 2019, 7:39 a.m. UTC | #2
On Thu, Jul 25, 2019 at 06:02:02PM +0200, Oscar Salvador wrote:
> Here we go with v3.
> 
> v3 -> v2:
>         * Rewrite about vmemmap pages handling.
>           Prior to this version, I was (ab)using hugepages fields
>           from struct page, while here I am officially adding a new
>           sub-page type with the fields I need.
> 
>         * Drop MHP_MEMMAP_{MEMBLOCK,DEVICE} in favor of MHP_MEMMAP_ON_MEMORY.
>           While I am still not 100% if this the right decision, and while I
>           still see some gaining in having MHP_MEMMAP_{MEMBLOCK,DEVICE},
>           having only one flag ease the code.
>           If the user wants to allocate memmaps per memblock, it'll
>           have to call add_memory() variants with memory-block granularity.
> 
>           If we happen to have a more clear usecase MHP_MEMMAP_MEMBLOCK
>           flag in the future, so user does not have to bother about the way
>           it calls add_memory() variants, but only pass a flag, we can add it.
>           Actually, I already had the code, so add it in the future is going to be
>           easy.
> 
>         * Granularity check when hot-removing memory.
>           Just checking that the granularity is the same.
> 
> [Testing]
> 
>  - x86_64: small and large memblocks (128MB, 1G and 2G)
> 
> So far, only acpi memory hotplug uses the new flag.
> The other callers can be changed depending on their needs.
> 
> [Coverletter]
> 
> This is another step to make memory hotplug more usable. The primary
> goal of this patchset is to reduce memory overhead of the hot-added
> memory (at least for SPARSEMEM_VMEMMAP memory model). The current way we use
> to populate memmap (struct page array) has two main drawbacks:
> 
> a) it consumes an additional memory until the hotadded memory itself is
>    onlined and
> b) memmap might end up on a different numa node which is especially true
>    for movable_node configuration.
> 
> a) it is a problem especially for memory hotplug based memory "ballooning"
>    solutions when the delay between physical memory hotplug and the
>    onlining can lead to OOM and that led to introduction of hacks like auto
>    onlining (see 31bc3858ea3e ("memory-hotplug: add automatic onlining
>    policy for the newly added memory")).
> 
> b) can have performance drawbacks.
> 
> One way to mitigate all these issues is to simply allocate memmap array
> (which is the largest memory footprint of the physical memory hotplug)
> from the hot-added memory itself. SPARSEMEM_VMEMMAP memory model allows
> us to map any pfn range so the memory doesn't need to be online to be
> usable for the array. See patch 3 for more details.
> This feature is only usable when CONFIG_SPARSEMEM_VMEMMAP is set.
> 
> [Overall design]:
> 
> Implementation wise we reuse vmem_altmap infrastructure to override
> the default allocator used by vmemap_populate. Once the memmap is
> allocated we need a way to mark altmap pfns used for the allocation.
> If MHP_MEMMAP_ON_MEMORY flag was passed, we set up the layout of the
> altmap structure at the beginning of __add_pages(), and then we call
> mark_vmemmap_pages().
> 
> MHP_MEMMAP_ON_MEMORY flag parameter will specify to allocate memmaps
> from the hot-added range.
> If callers wants memmaps to be allocated per memory block, it will
> have to call add_memory() variants in memory-block granularity
> spanning the whole range, while if it wants to allocate memmaps
> per whole memory range, just one call will do.
> 
> Want to add 384MB (3 sections, 3 memory-blocks)
> e.g:
> 
> add_memory(0x1000, size_memory_block);
> add_memory(0x2000, size_memory_block);
> add_memory(0x3000, size_memory_block);
> 
> or
> 
> add_memory(0x1000, size_memory_block * 3);
> 
> One thing worth mention is that vmemmap pages residing in movable memory is not a
> show-stopper for that memory to be offlined/migrated away.
> Vmemmap pages are just ignored in that case and they stick around until sections
> referred by those vmemmap pages are hot-removed.

Gentle ping :-)
David Hildenbrand Aug. 1, 2019, 8:17 a.m. UTC | #3
On 01.08.19 09:39, Oscar Salvador wrote:
> On Thu, Jul 25, 2019 at 06:02:02PM +0200, Oscar Salvador wrote:
>> Here we go with v3.
>>
>> v3 -> v2:
>>         * Rewrite about vmemmap pages handling.
>>           Prior to this version, I was (ab)using hugepages fields
>>           from struct page, while here I am officially adding a new
>>           sub-page type with the fields I need.
>>
>>         * Drop MHP_MEMMAP_{MEMBLOCK,DEVICE} in favor of MHP_MEMMAP_ON_MEMORY.
>>           While I am still not 100% if this the right decision, and while I
>>           still see some gaining in having MHP_MEMMAP_{MEMBLOCK,DEVICE},
>>           having only one flag ease the code.
>>           If the user wants to allocate memmaps per memblock, it'll
>>           have to call add_memory() variants with memory-block granularity.
>>
>>           If we happen to have a more clear usecase MHP_MEMMAP_MEMBLOCK
>>           flag in the future, so user does not have to bother about the way
>>           it calls add_memory() variants, but only pass a flag, we can add it.
>>           Actually, I already had the code, so add it in the future is going to be
>>           easy.
>>
>>         * Granularity check when hot-removing memory.
>>           Just checking that the granularity is the same.
>>
>> [Testing]
>>
>>  - x86_64: small and large memblocks (128MB, 1G and 2G)
>>
>> So far, only acpi memory hotplug uses the new flag.
>> The other callers can be changed depending on their needs.
>>
>> [Coverletter]
>>
>> This is another step to make memory hotplug more usable. The primary
>> goal of this patchset is to reduce memory overhead of the hot-added
>> memory (at least for SPARSEMEM_VMEMMAP memory model). The current way we use
>> to populate memmap (struct page array) has two main drawbacks:
>>
>> a) it consumes an additional memory until the hotadded memory itself is
>>    onlined and
>> b) memmap might end up on a different numa node which is especially true
>>    for movable_node configuration.
>>
>> a) it is a problem especially for memory hotplug based memory "ballooning"
>>    solutions when the delay between physical memory hotplug and the
>>    onlining can lead to OOM and that led to introduction of hacks like auto
>>    onlining (see 31bc3858ea3e ("memory-hotplug: add automatic onlining
>>    policy for the newly added memory")).
>>
>> b) can have performance drawbacks.
>>
>> One way to mitigate all these issues is to simply allocate memmap array
>> (which is the largest memory footprint of the physical memory hotplug)
>> from the hot-added memory itself. SPARSEMEM_VMEMMAP memory model allows
>> us to map any pfn range so the memory doesn't need to be online to be
>> usable for the array. See patch 3 for more details.
>> This feature is only usable when CONFIG_SPARSEMEM_VMEMMAP is set.
>>
>> [Overall design]:
>>
>> Implementation wise we reuse vmem_altmap infrastructure to override
>> the default allocator used by vmemap_populate. Once the memmap is
>> allocated we need a way to mark altmap pfns used for the allocation.
>> If MHP_MEMMAP_ON_MEMORY flag was passed, we set up the layout of the
>> altmap structure at the beginning of __add_pages(), and then we call
>> mark_vmemmap_pages().
>>
>> MHP_MEMMAP_ON_MEMORY flag parameter will specify to allocate memmaps
>> from the hot-added range.
>> If callers wants memmaps to be allocated per memory block, it will
>> have to call add_memory() variants in memory-block granularity
>> spanning the whole range, while if it wants to allocate memmaps
>> per whole memory range, just one call will do.
>>
>> Want to add 384MB (3 sections, 3 memory-blocks)
>> e.g:
>>
>> add_memory(0x1000, size_memory_block);
>> add_memory(0x2000, size_memory_block);
>> add_memory(0x3000, size_memory_block);
>>
>> or
>>
>> add_memory(0x1000, size_memory_block * 3);
>>
>> One thing worth mention is that vmemmap pages residing in movable memory is not a
>> show-stopper for that memory to be offlined/migrated away.
>> Vmemmap pages are just ignored in that case and they stick around until sections
>> referred by those vmemmap pages are hot-removed.
> 
> Gentle ping :-)
> 

I am not yet sure about two things:


1. Checking uninitialized pages for PageVmemmap() when onlining. I
consider this very bad.

I wonder if it would be better to remember for each memory block the pfn
offset, which will be used when onlining/offlining.

I have some patches that convert online_pages() to
__online_memory_block(struct memory block *mem) - which fits perfect to
the current user. So taking the offset and processing only these pages
when onlining would be easy. To do the same for offline_pages(), we
first have to rework memtrace code. But when offlining, all memmaps have
already been initialized.


2. Setting the Vmemmap pages to the zone of the online type. This would
mean we would have unmovable data on pages marked to belong to the
movable zone. I would suggest to always set them to the NORMAL zone when
onlining - and inititalize the vmemmap of the vmemmap pages directly
during add_memory() instead.
Oscar Salvador Aug. 1, 2019, 8:39 a.m. UTC | #4
On Thu, Aug 01, 2019 at 10:17:23AM +0200, David Hildenbrand wrote:
> I am not yet sure about two things:
> 
> 
> 1. Checking uninitialized pages for PageVmemmap() when onlining. I
> consider this very bad.
> 
> I wonder if it would be better to remember for each memory block the pfn
> offset, which will be used when onlining/offlining.
> 
> I have some patches that convert online_pages() to
> __online_memory_block(struct memory block *mem) - which fits perfect to
> the current user. So taking the offset and processing only these pages
> when onlining would be easy. To do the same for offline_pages(), we
> first have to rework memtrace code. But when offlining, all memmaps have
> already been initialized.

This is true, I did not really like that either, but was one of the things
I came up.
I already have some ideas how to avoid checking the page, I will work on it.

> 2. Setting the Vmemmap pages to the zone of the online type. This would
> mean we would have unmovable data on pages marked to belong to the
> movable zone. I would suggest to always set them to the NORMAL zone when
> onlining - and inititalize the vmemmap of the vmemmap pages directly
> during add_memory() instead.

IMHO, having vmemmap pages in ZONE_MOVABLE do not matter that match.
They are not counted as managed_pages, and they are not show-stopper for
moving all the other data around (migrate), they are just skipped.
Conceptually, they are not pages we can deal with.

I thought they should lay wherever the range lays.
Having said that, I do not oppose to place them in ZONE_NORMAL, as they might
fit there better under the theory that ZONE_NORMAL have memory that might not be
movable/migratable.

As for initializing them in add_memory(), we cannot do that.
First problem is that we first need sparse_mem_map_populate to create
the mapping, and to take the pages from our altmap.

Then, we can access and initialize those pages.
So we cannot do that in add_memory() because that happens before.

And I really think that it fits much better in __add_pages than in add_memory.

Given said that, I would appreciate some comments in patches#3 and patches#4,
specially patch#4.
So I would like to collect some feedback in those before sending a new version.

Thanks David
David Hildenbrand Aug. 1, 2019, 8:44 a.m. UTC | #5
On 01.08.19 10:39, Oscar Salvador wrote:
> On Thu, Aug 01, 2019 at 10:17:23AM +0200, David Hildenbrand wrote:
>> I am not yet sure about two things:
>>
>>
>> 1. Checking uninitialized pages for PageVmemmap() when onlining. I
>> consider this very bad.
>>
>> I wonder if it would be better to remember for each memory block the pfn
>> offset, which will be used when onlining/offlining.
>>
>> I have some patches that convert online_pages() to
>> __online_memory_block(struct memory block *mem) - which fits perfect to
>> the current user. So taking the offset and processing only these pages
>> when onlining would be easy. To do the same for offline_pages(), we
>> first have to rework memtrace code. But when offlining, all memmaps have
>> already been initialized.
> 
> This is true, I did not really like that either, but was one of the things
> I came up.
> I already have some ideas how to avoid checking the page, I will work on it.

I think it would be best if we find some way that during
onlining/offlining we skip the vmemmap part completely. (e.g., as
discussed via an offset in the memblock or similar)

> 
>> 2. Setting the Vmemmap pages to the zone of the online type. This would
>> mean we would have unmovable data on pages marked to belong to the
>> movable zone. I would suggest to always set them to the NORMAL zone when
>> onlining - and inititalize the vmemmap of the vmemmap pages directly
>> during add_memory() instead.
> 
> IMHO, having vmemmap pages in ZONE_MOVABLE do not matter that match.
> They are not counted as managed_pages, and they are not show-stopper for
> moving all the other data around (migrate), they are just skipped.
> Conceptually, they are not pages we can deal with.

I am not sure yet about the implications of having these belong to a
zone they don't hmmmm. Will the pages be PG_reserved?

> 
> I thought they should lay wherever the range lays.
> Having said that, I do not oppose to place them in ZONE_NORMAL, as they might
> fit there better under the theory that ZONE_NORMAL have memory that might not be
> movable/migratable.
> 
> As for initializing them in add_memory(), we cannot do that.
> First problem is that we first need sparse_mem_map_populate to create
> the mapping, and to take the pages from our altmap.
> 
> Then, we can access and initialize those pages.
> So we cannot do that in add_memory() because that happens before.
> 
> And I really think that it fits much better in __add_pages than in add_memory.

Sorry, I rather meant when adding memory, not when onlining. But you
seem to do that already. :)

> 
> Given said that, I would appreciate some comments in patches#3 and patches#4,
> specially patch#4.

Will have a look!

> So I would like to collect some feedback in those before sending a new version.
> 
> Thanks David
>
David Hildenbrand Aug. 1, 2019, 6:46 p.m. UTC | #6
On 25.07.19 18:02, Oscar Salvador wrote:
> Here we go with v3.
> 
> v3 -> v2:
>         * Rewrite about vmemmap pages handling.
>           Prior to this version, I was (ab)using hugepages fields
>           from struct page, while here I am officially adding a new
>           sub-page type with the fields I need.
> 
>         * Drop MHP_MEMMAP_{MEMBLOCK,DEVICE} in favor of MHP_MEMMAP_ON_MEMORY.
>           While I am still not 100% if this the right decision, and while I
>           still see some gaining in having MHP_MEMMAP_{MEMBLOCK,DEVICE},
>           having only one flag ease the code.
>           If the user wants to allocate memmaps per memblock, it'll
>           have to call add_memory() variants with memory-block granularity.
> 
>           If we happen to have a more clear usecase MHP_MEMMAP_MEMBLOCK
>           flag in the future, so user does not have to bother about the way
>           it calls add_memory() variants, but only pass a flag, we can add it.
>           Actually, I already had the code, so add it in the future is going to be
>           easy.
> 
>         * Granularity check when hot-removing memory.
>           Just checking that the granularity is the same.
> 
> [Testing]
> 
>  - x86_64: small and large memblocks (128MB, 1G and 2G)
> 
> So far, only acpi memory hotplug uses the new flag.
> The other callers can be changed depending on their needs.
> 
> [Coverletter]
> 
> This is another step to make memory hotplug more usable. The primary
> goal of this patchset is to reduce memory overhead of the hot-added
> memory (at least for SPARSEMEM_VMEMMAP memory model). The current way we use
> to populate memmap (struct page array) has two main drawbacks:
> 
> a) it consumes an additional memory until the hotadded memory itself is
>    onlined and
> b) memmap might end up on a different numa node which is especially true
>    for movable_node configuration.
> 
> a) it is a problem especially for memory hotplug based memory "ballooning"
>    solutions when the delay between physical memory hotplug and the
>    onlining can lead to OOM and that led to introduction of hacks like auto
>    onlining (see 31bc3858ea3e ("memory-hotplug: add automatic onlining
>    policy for the newly added memory")).
> 
> b) can have performance drawbacks.
> 
> One way to mitigate all these issues is to simply allocate memmap array
> (which is the largest memory footprint of the physical memory hotplug)
> from the hot-added memory itself. SPARSEMEM_VMEMMAP memory model allows
> us to map any pfn range so the memory doesn't need to be online to be
> usable for the array. See patch 3 for more details.
> This feature is only usable when CONFIG_SPARSEMEM_VMEMMAP is set.
> 
> [Overall design]:
> 
> Implementation wise we reuse vmem_altmap infrastructure to override
> the default allocator used by vmemap_populate. Once the memmap is
> allocated we need a way to mark altmap pfns used for the allocation.
> If MHP_MEMMAP_ON_MEMORY flag was passed, we set up the layout of the
> altmap structure at the beginning of __add_pages(), and then we call
> mark_vmemmap_pages().
> 
> MHP_MEMMAP_ON_MEMORY flag parameter will specify to allocate memmaps
> from the hot-added range.
> If callers wants memmaps to be allocated per memory block, it will
> have to call add_memory() variants in memory-block granularity
> spanning the whole range, while if it wants to allocate memmaps
> per whole memory range, just one call will do.
> 
> Want to add 384MB (3 sections, 3 memory-blocks)
> e.g:
> 
> add_memory(0x1000, size_memory_block);
> add_memory(0x2000, size_memory_block);
> add_memory(0x3000, size_memory_block);
> 

Some more thoughts:

1. It can happen that pfn_online() for a vmemmap page returns either
true or false, depending on the state of the section. It could be that
the memory block holding the vmemmap is offline while another memory
block making use of it is online.

I guess this isn't bad (I assume it is similar for the altmap), however
it could be that makedumpfile will exclude the vmemmap from dumps (as it
will usually only dump pages in sections marked online if I am not wrong
- maybe it special cases vmemmaps already). Also, could be that it is
not saved/restored during hibernation. We'll have to verify.


2. memmap access when adding/removing memory

The memmap is initialized when onlining memory. We still have to clean
up accessing the memmap in remove_memory(). You seem to introduce new
users - which is bad. Especially when removing memory we never onlined.

When removing memory, you shouldn't have to worry about any orders -
nobody should touch the memmap. I am aware that we still query the zone
- are there other users that touch the memmap when removing memory?


3. isolation/compaction

I am not sure if simply unconditionally skipping over Vmemmap pages is a
good idea. I would have guessed it is better to hinder callers from even
triggering this.

E.g., Only online the pieces that don't contain the vmemmap. When
offlining a memory block, only actually try to offline the pieces that
were onlined - excluding the vmemmap.

Might require some smaller reworks but shouldn't be too hard as far as I
can tell.


4. mhp_flags and altmap with __add_pages()

I have hoped that we can  handle the specific of MMAP_ON_MEMORY
completely in add_memory() - nobody else needs MMAP_ON_MEMORY (we have
the generic altmap concept already).

So, setup the struct vmem_altmap; in add_memory() and pass it directly.
During arch_add_memory(), nobody should be touching the vmemmap either
way, as it is completely uninitialized.

When we return from arch_add_memory() in add_memory(), we could then
initialize the memmap for the vmemmap pages (e.g., set them to
PageVmemmap) - via mhp_mark_vmemmap_pages() or such.

What exactly speaks against this approach? (moving the MMAP_ON_MEMORY
handling completely out of __add_pages())? Am I missing some access the
could be evil while the pages are not mapped?

(I'd love to see __add_pages() only eat an altmap again, and keep the
MMAP_ON_MEMORY thingy specific to add_memory())