mbox series

[RFC,0/4] mm, memory_hotplug: allocate memmap from hotadded memory

Message ID 20181116101222.16581-1-osalvador@suse.com (mailing list archive)
Headers show
Series mm, memory_hotplug: allocate memmap from hotadded memory | expand

Message

Oscar Salvador Nov. 16, 2018, 10:12 a.m. UTC
Hi,

this patchset is based on Michal's patchset [1].
Patch#1, patch#2 and patch#4 are quite the same.
They just needed little changes to adapt it to current codestream,
so it seemed fair to leave them.

---------
Original cover:

This is another step to make the memory hotplug more usable. The primary
goal of this patchset is to reduce memory overhead of the hot added
memory (at least for SPARSE_VMEMMAP memory model). Currently we use
kmalloc to poppulate memmap (struct page array) which has two main
drawbacks a) it consumes an additional memory until the hotadded memory
itslef is onlined and b) memmap might end up on a different numa node
which is especially true for movable_node configuration.

a) is problem especially for memory hotplug based memory "ballooning"
solutions when the delay between physical memory hotplug and the
onlining can lead to OOM and that led to introduction of hacks like auto
onlining (see 31bc3858ea3e ("memory-hotplug: add automatic onlining
policy for the newly added memory")).
b) can have performance drawbacks.

One way to mitigate both issues is to simply allocate memmap array
(which is the largest memory footprint of the physical memory hotplug)
from the hotadded memory itself. VMEMMAP memory model allows us to map
any pfn range so the memory doesn't need to be online to be usable
for the array. See patch 3 for more details. In short I am reusing an
existing vmem_altmap which wants to achieve the same thing for nvdim
device memory.

There is also one potential drawback, though. If somebody uses memory
hotplug for 1G (gigantic) hugetlb pages then this scheme will not work
for them obviously because each memory block will contain reserved
area. Large x86 machines will use 2G memblocks so at least one 1G page
will be available but this is still not 2G...

I am not really sure somebody does that and how reliable that can work
actually. Nevertheless, I _believe_ that onlining more memory into
virtual machines is much more common usecase. Anyway if there ever is a
strong demand for such a usecase we have basically 3 options a) enlarge
memory blocks even more b) enhance altmap allocation strategy and reuse
low memory sections to host memmaps of other sections on the same NUMA
node c) have the memmap allocation strategy configurable to fallback to
the current allocation.

---------

Old version of this patchset would blow up because we were clearing the
pmds while we still had to reference pages backed by that memory.
I picked another approach which does not force us to touch arch specific code
in that regard.

Overall design:

With the preface of:

    1) Whenever you hot-add a range, this is the same range that will be hot-removed.
       This is just because you can't remove half of a DIMM, in the same way you can't
       remove half of a device in qemu.
       A device/DIMM are added/removed as a whole.

    2) Every add_memory()->add_memory_resource()->arch_add_memory()->__add_pages()
       will use a new altmap because it is a different hot-added range.

    3) When you hot-remove a range, the sections will be removed sequantially
       starting from the first section of the range and ending with the last one.

    4) hot-remove operations are protected by hotplug lock, so no parallel operations
       can take place.

    The current design is as follows:

    hot-remove operation)

    - __kfree_section_memmap will be called for every section to be removed.
    - We catch the first vmemmap_page and we pin it to a global variable.
    - Further calls to __kfree_section_memmap will decrease refcount of
      the vmemmap page without calling vmemmap_free().
      We defer the call to vmemmap_free() untill all sections are removed
    - If the refcount drops to 0, we know that we hit the last section.
    - We clear the global variable.
    - We call vmemmap_free for [last_section, current_vmemmap_page)

    In case we are hot-removing a range that used altmap, the call to
    vmemmap_free must be done backwards, because the beginning of memory
    is used for the pagetables.
    Doing it this way, we ensure that by the time we remove the pagetables,
    those pages will not have to be referenced anymore.

    An example:

    (qemu) object_add memory-backend-ram,id=ram0,size=10G
    (qemu) device_add pc-dimm,id=dimm0,memdev=ram0,node=1

    - This has added: ffffea0004000000 - ffffea000427ffc0 (refcount: 80)

    When refcount of ffffea0004000000 drops to 0, vmemmap_free()
    will be called in this way:

    vmemmap_free: start/end: ffffea000de00000 - ffffea000e000000
    vmemmap_free: start/end: ffffea000dc00000 - ffffea000de00000
    vmemmap_free: start/end: ffffea000da00000 - ffffea000dc00000
    vmemmap_free: start/end: ffffea000d800000 - ffffea000da00000
    vmemmap_free: start/end: ffffea000d600000 - ffffea000d800000
    vmemmap_free: start/end: ffffea000d400000 - ffffea000d600000
    vmemmap_free: start/end: ffffea000d200000 - ffffea000d400000
    vmemmap_free: start/end: ffffea000d000000 - ffffea000d200000
    vmemmap_free: start/end: ffffea000ce00000 - ffffea000d000000
    vmemmap_free: start/end: ffffea000cc00000 - ffffea000ce00000
    vmemmap_free: start/end: ffffea000ca00000 - ffffea000cc00000
    vmemmap_free: start/end: ffffea000c800000 - ffffea000ca00000
    vmemmap_free: start/end: ffffea000c600000 - ffffea000c800000
    vmemmap_free: start/end: ffffea000c400000 - ffffea000c600000
    vmemmap_free: start/end: ffffea000c200000 - ffffea000c400000
    vmemmap_free: start/end: ffffea000c000000 - ffffea000c200000
    vmemmap_free: start/end: ffffea000be00000 - ffffea000c000000
    ...
    ...
    vmemmap_free: start/end: ffffea0004000000 - ffffea0004200000


    [Testing]

    - Tested ony on x86_64
    - Several tests were carried out with memblocks of different sizes.
    - Tests were performed adding different memory-range sizes
      from 512M to 60GB.

    [Todo]
    - Look into hotplug gigantic pages case

Before investing more effort, I would like to hear some opinions/thoughts/ideas.

[1] https://lore.kernel.org/lkml/20170801124111.28881-1-mhocko@kernel.org/

Michal Hocko (3):
  mm, memory_hotplug: cleanup memory offline path
  mm, memory_hotplug: provide a more generic restrictions for memory
    hotplug
  mm, sparse: rename kmalloc_section_memmap, __kfree_section_memmap

Oscar Salvador (1):
  mm, memory_hotplug: allocate memmap from the added memory range for
    sparse-vmemmap

 arch/arm64/mm/mmu.c            |   5 +-
 arch/ia64/mm/init.c            |   5 +-
 arch/powerpc/mm/init_64.c      |   2 +
 arch/powerpc/mm/mem.c          |   6 +-
 arch/s390/mm/init.c            |  12 +++-
 arch/sh/mm/init.c              |   6 +-
 arch/x86/mm/init_32.c          |   6 +-
 arch/x86/mm/init_64.c          |  17 ++++--
 include/linux/memory_hotplug.h |  35 ++++++++---
 include/linux/memremap.h       |  65 +++++++++++++++++++-
 include/linux/page-flags.h     |  18 ++++++
 kernel/memremap.c              |  12 ++--
 mm/compaction.c                |   3 +
 mm/hmm.c                       |   6 +-
 mm/memory_hotplug.c            | 133 ++++++++++++++++++++++++++++-------------
 mm/page_alloc.c                |  33 ++++++++--
 mm/page_isolation.c            |  13 +++-
 mm/sparse.c                    |  62 ++++++++++++++++---
 18 files changed, 345 insertions(+), 94 deletions(-)

Comments

David Hildenbrand Nov. 22, 2018, 9:21 a.m. UTC | #1
On 16.11.18 11:12, Oscar Salvador wrote:
> Hi,
> 
> this patchset is based on Michal's patchset [1].
> Patch#1, patch#2 and patch#4 are quite the same.
> They just needed little changes to adapt it to current codestream,
> so it seemed fair to leave them.
> 
> ---------
> Original cover:
> 
> This is another step to make the memory hotplug more usable. The primary
> goal of this patchset is to reduce memory overhead of the hot added
> memory (at least for SPARSE_VMEMMAP memory model). Currently we use
> kmalloc to poppulate memmap (struct page array) which has two main
> drawbacks a) it consumes an additional memory until the hotadded memory
> itslef is onlined and b) memmap might end up on a different numa node
> which is especially true for movable_node configuration.

I haven't looked at the patches but have some questions.

1. How are we going to present such memory to the system statistics?

In my opinion, this vmemmap memory should
a) still account to total memory
b) show up as allocated

So just like before.


2. Is this optional, in other words, can a device driver decide to not
to it like that?

You mention ballooning. Now, both XEN and Hyper-V (the only balloon
drivers that add new memory as of now), usually add e.g. a 128MB segment
to only actually some part of it (e.g. 64MB, but could vary). Now, going
ahead and assuming that all memory of a section can be read/written is
wrong. A device driver will indicate which pages may actually be used
via set_online_page_callback() when new memory is added. But at that
point you already happily accessed some memory for vmmap - which might
lead to crashes.

For now the rule was: Memory that was not onlined will not be
read/written, that's why it works for XEN and Hyper-V.

It *could* work for them if they could know and communicate to
add_memory() which part of a newly added memory block is definitely usable.

So, especially for the case of balloning that you describe, things are
more tricky than a simple "let's just use some memory of the memory
block we're adding" unfortunately. For DIMMs it can work.

> 
> a) is problem especially for memory hotplug based memory "ballooning"
> solutions when the delay between physical memory hotplug and the
> onlining can lead to OOM and that led to introduction of hacks like auto
> onlining (see 31bc3858ea3e ("memory-hotplug: add automatic onlining
> policy for the newly added memory")).
> b) can have performance drawbacks.
> 
> One way to mitigate both issues is to simply allocate memmap array
> (which is the largest memory footprint of the physical memory hotplug)
> from the hotadded memory itself. VMEMMAP memory model allows us to map
> any pfn range so the memory doesn't need to be online to be usable
> for the array. See patch 3 for more details. In short I am reusing an
> existing vmem_altmap which wants to achieve the same thing for nvdim
> device memory.
> 
> There is also one potential drawback, though. If somebody uses memory
> hotplug for 1G (gigantic) hugetlb pages then this scheme will not work
> for them obviously because each memory block will contain reserved
> area. Large x86 machines will use 2G memblocks so at least one 1G page
> will be available but this is still not 2G...

Yes, I think this is a possible use case. So it would have to be
configurable somewehere - opt-in most probably. But related to
ballooning, they will usually add the minimum possible granularity (e.g.
128MB) and that seems to work for these setups. DIMMs are probably
different.

> 
> I am not really sure somebody does that and how reliable that can work
> actually. Nevertheless, I _believe_ that onlining more memory into
> virtual machines is much more common usecase. Anyway if there ever is a
> strong demand for such a usecase we have basically 3 options a) enlarge
> memory blocks even more b) enhance altmap allocation strategy and reuse
> low memory sections to host memmaps of other sections on the same NUMA
> node c) have the memmap allocation strategy configurable to fallback to
> the current allocation.
> 
> ---------
> 
> Old version of this patchset would blow up because we were clearing the
> pmds while we still had to reference pages backed by that memory.
> I picked another approach which does not force us to touch arch specific code
> in that regard.
> 
> Overall design:
> 
> With the preface of:
> 
>     1) Whenever you hot-add a range, this is the same range that will be hot-removed.
>        This is just because you can't remove half of a DIMM, in the same way you can't
>        remove half of a device in qemu.
>        A device/DIMM are added/removed as a whole.
> 
>     2) Every add_memory()->add_memory_resource()->arch_add_memory()->__add_pages()
>        will use a new altmap because it is a different hot-added range.
> 
>     3) When you hot-remove a range, the sections will be removed sequantially
>        starting from the first section of the range and ending with the last one.
> 
>     4) hot-remove operations are protected by hotplug lock, so no parallel operations
>        can take place.
> 
>     The current design is as follows:
> 
>     hot-remove operation)
> 
>     - __kfree_section_memmap will be called for every section to be removed.
>     - We catch the first vmemmap_page and we pin it to a global variable.
>     - Further calls to __kfree_section_memmap will decrease refcount of
>       the vmemmap page without calling vmemmap_free().
>       We defer the call to vmemmap_free() untill all sections are removed
>     - If the refcount drops to 0, we know that we hit the last section.
>     - We clear the global variable.
>     - We call vmemmap_free for [last_section, current_vmemmap_page)
> 
>     In case we are hot-removing a range that used altmap, the call to
>     vmemmap_free must be done backwards, because the beginning of memory
>     is used for the pagetables.
>     Doing it this way, we ensure that by the time we remove the pagetables,
>     those pages will not have to be referenced anymore.
> 
>     An example:
> 
>     (qemu) object_add memory-backend-ram,id=ram0,size=10G
>     (qemu) device_add pc-dimm,id=dimm0,memdev=ram0,node=1
> 
>     - This has added: ffffea0004000000 - ffffea000427ffc0 (refcount: 80)
> 
>     When refcount of ffffea0004000000 drops to 0, vmemmap_free()
>     will be called in this way:
> 
>     vmemmap_free: start/end: ffffea000de00000 - ffffea000e000000
>     vmemmap_free: start/end: ffffea000dc00000 - ffffea000de00000
>     vmemmap_free: start/end: ffffea000da00000 - ffffea000dc00000
>     vmemmap_free: start/end: ffffea000d800000 - ffffea000da00000
>     vmemmap_free: start/end: ffffea000d600000 - ffffea000d800000
>     vmemmap_free: start/end: ffffea000d400000 - ffffea000d600000
>     vmemmap_free: start/end: ffffea000d200000 - ffffea000d400000
>     vmemmap_free: start/end: ffffea000d000000 - ffffea000d200000
>     vmemmap_free: start/end: ffffea000ce00000 - ffffea000d000000
>     vmemmap_free: start/end: ffffea000cc00000 - ffffea000ce00000
>     vmemmap_free: start/end: ffffea000ca00000 - ffffea000cc00000
>     vmemmap_free: start/end: ffffea000c800000 - ffffea000ca00000
>     vmemmap_free: start/end: ffffea000c600000 - ffffea000c800000
>     vmemmap_free: start/end: ffffea000c400000 - ffffea000c600000
>     vmemmap_free: start/end: ffffea000c200000 - ffffea000c400000
>     vmemmap_free: start/end: ffffea000c000000 - ffffea000c200000
>     vmemmap_free: start/end: ffffea000be00000 - ffffea000c000000
>     ...
>     ...
>     vmemmap_free: start/end: ffffea0004000000 - ffffea0004200000
> 
> 
>     [Testing]
> 
>     - Tested ony on x86_64
>     - Several tests were carried out with memblocks of different sizes.
>     - Tests were performed adding different memory-range sizes
>       from 512M to 60GB.
> 
>     [Todo]
>     - Look into hotplug gigantic pages case
> 
> Before investing more effort, I would like to hear some opinions/thoughts/ideas.
> 
> [1] https://lore.kernel.org/lkml/20170801124111.28881-1-mhocko@kernel.org/
> 
> Michal Hocko (3):
>   mm, memory_hotplug: cleanup memory offline path
>   mm, memory_hotplug: provide a more generic restrictions for memory
>     hotplug
>   mm, sparse: rename kmalloc_section_memmap, __kfree_section_memmap
> 
> Oscar Salvador (1):
>   mm, memory_hotplug: allocate memmap from the added memory range for
>     sparse-vmemmap
> 
>  arch/arm64/mm/mmu.c            |   5 +-
>  arch/ia64/mm/init.c            |   5 +-
>  arch/powerpc/mm/init_64.c      |   2 +
>  arch/powerpc/mm/mem.c          |   6 +-
>  arch/s390/mm/init.c            |  12 +++-
>  arch/sh/mm/init.c              |   6 +-
>  arch/x86/mm/init_32.c          |   6 +-
>  arch/x86/mm/init_64.c          |  17 ++++--
>  include/linux/memory_hotplug.h |  35 ++++++++---
>  include/linux/memremap.h       |  65 +++++++++++++++++++-
>  include/linux/page-flags.h     |  18 ++++++
>  kernel/memremap.c              |  12 ++--
>  mm/compaction.c                |   3 +
>  mm/hmm.c                       |   6 +-
>  mm/memory_hotplug.c            | 133 ++++++++++++++++++++++++++++-------------
>  mm/page_alloc.c                |  33 ++++++++--
>  mm/page_isolation.c            |  13 +++-
>  mm/sparse.c                    |  62 ++++++++++++++++---
>  18 files changed, 345 insertions(+), 94 deletions(-)
>
Oscar Salvador Nov. 23, 2018, 11:55 a.m. UTC | #2
On Thu, Nov 22, 2018 at 10:21:24AM +0100, David Hildenbrand wrote:
> 1. How are we going to present such memory to the system statistics?
> 
> In my opinion, this vmemmap memory should
> a) still account to total memory
> b) show up as allocated
> 
> So just like before.

No, it does not show up under total memory and neither as allocated memory.
This memory is not for use for anything but for creating the pagetables
for the memmap array for the section/s.

It is not memory that the system can use.

I also guess that if there is a strong opinion on this, we could create
a counter, something like NR_VMEMMAP_PAGES, and show it under /proc/meminfo.

> 2. Is this optional, in other words, can a device driver decide to not
> to it like that?

Right now, is a per arch setup.
For example, x86_64/powerpc/arm64 will do it inconditionally.

If we want to restrict this a per device-driver thing, I guess that we could
allow to pass a flag to add_memory()->add_memory_resource(), and there
unset MHP_MEMMAP_FROM_RANGE in case that flag is enabled.

> You mention ballooning. Now, both XEN and Hyper-V (the only balloon
> drivers that add new memory as of now), usually add e.g. a 128MB segment
> to only actually some part of it (e.g. 64MB, but could vary). Now, going
> ahead and assuming that all memory of a section can be read/written is
> wrong. A device driver will indicate which pages may actually be used
> via set_online_page_callback() when new memory is added. But at that
> point you already happily accessed some memory for vmmap - which might
> lead to crashes.
> 
> For now the rule was: Memory that was not onlined will not be
> read/written, that's why it works for XEN and Hyper-V.

We do not write all memory of the hot-added section, we just write the
first 2MB (first 512 pages), the other 126MB are left untouched.

Assuming that you add a memory-chunk section aligned (128MB), but you only present
the first 64MB or 32MB to the guest as onlined, we still need to allocate the memmap
for the whole section.

I do not really know the tricks behind Hyper-V/Xen, could you expand on that?

So far I only tested this with qemu simulating large machines, but I plan
to try the balloning thing on Xen.

At this moment I am working on a second version of this patchset
to address Dave's feedback.

----
Oscar Salvador
SUSE L3
David Hildenbrand Nov. 23, 2018, 12:11 p.m. UTC | #3
On 23.11.18 12:55, Oscar Salvador wrote:
> On Thu, Nov 22, 2018 at 10:21:24AM +0100, David Hildenbrand wrote:
>> 1. How are we going to present such memory to the system statistics?
>>
>> In my opinion, this vmemmap memory should
>> a) still account to total memory
>> b) show up as allocated
>>
>> So just like before.
> 
> No, it does not show up under total memory and neither as allocated memory.
> This memory is not for use for anything but for creating the pagetables
> for the memmap array for the section/s.
> 
> It is not memory that the system can use.
> 
> I also guess that if there is a strong opinion on this, we could create
> a counter, something like NR_VMEMMAP_PAGES, and show it under /proc/meminfo.

It's a change if we "hide" such memory. E.g. in a cloud environment you
request to add XGB to your system. You will not see XGB, that can be
"problematic" with some costumers :) - "But I am paying for additional
XGB". (Showing XGB but YMB as allocated is easier to argue with - "your
OS is using it").

> 
>> 2. Is this optional, in other words, can a device driver decide to not
>> to it like that?
> 
> Right now, is a per arch setup.
> For example, x86_64/powerpc/arm64 will do it inconditionally.

That could indeed break Hyper-V/XEN (if the granularity in which you can
add memory can be smaller than 2MB). Or you have bigger memory blocks.

> 
> If we want to restrict this a per device-driver thing, I guess that we could
> allow to pass a flag to add_memory()->add_memory_resource(), and there
> unset MHP_MEMMAP_FROM_RANGE in case that flag is enabled.
> 
>> You mention ballooning. Now, both XEN and Hyper-V (the only balloon
>> drivers that add new memory as of now), usually add e.g. a 128MB segment
>> to only actually some part of it (e.g. 64MB, but could vary). Now, going
>> ahead and assuming that all memory of a section can be read/written is
>> wrong. A device driver will indicate which pages may actually be used
>> via set_online_page_callback() when new memory is added. But at that
>> point you already happily accessed some memory for vmmap - which might
>> lead to crashes.
>>
>> For now the rule was: Memory that was not onlined will not be
>> read/written, that's why it works for XEN and Hyper-V.
> 
> We do not write all memory of the hot-added section, we just write the
> first 2MB (first 512 pages), the other 126MB are left untouched.

Then that has to be made a rule and we have to make sure that all users
(Hyper-V/XEN) can cope with that.

But it is more problematic because we could have 2GB memory blocks. Then
the 2MB rule does no longer strike. Other archs have other sizes (e.g.
s390x 256MB).

> 
> Assuming that you add a memory-chunk section aligned (128MB), but you only present
> the first 64MB or 32MB to the guest as onlined, we still need to allocate the memmap
> for the whole section.

Yes, that's the right thing to do. (the section will be online but some
parts "fake offline")

> 
> I do not really know the tricks behind Hyper-V/Xen, could you expand on that?

Let's say you want to add 64MB on Hyper-V. What Linux will do is add a
new section (128MB) but only actually online, say the first 64MB (I have
no idea if it has to be the first 64MB actually!).

It will keep the other pages "fake-offline" and online them later on
when e.g. adding another 64MB.

See drivers/hv/hv_balloon.c:
- set_online_page_callback(&hv_online_page);
- hv_bring_pgs_online() -> hv_page_online_one() -> has_pfn_is_backed()

The other 64MB must not be written (otherwise GP!) but eventually be
read for e.g. dumping (although that is also shaky and I am fixing that
right now to make it more reliable).

Long story short: It is better to allow device drivers to make use of
the old behavior until they eventually can make sure that the "altmap?"
can be read/written when adding memory.

It presents a major change in the add_memory() interface.

> 
> So far I only tested this with qemu simulating large machines, but I plan
> to try the balloning thing on Xen.
> 
> At this moment I am working on a second version of this patchset
> to address Dave's feedback.

Cool, keep me tuned :)

> 
> ----
> Oscar Salvador
> SUSE L3 
>
Michal Hocko Nov. 23, 2018, 12:42 p.m. UTC | #4
On Fri 23-11-18 12:55:41, Oscar Salvador wrote:
> On Thu, Nov 22, 2018 at 10:21:24AM +0100, David Hildenbrand wrote:
> > 1. How are we going to present such memory to the system statistics?
> > 
> > In my opinion, this vmemmap memory should
> > a) still account to total memory
> > b) show up as allocated
> > 
> > So just like before.
> 
> No, it does not show up under total memory and neither as allocated memory.
> This memory is not for use for anything but for creating the pagetables
> for the memmap array for the section/s.

I haven't read through your patches yet but wanted to clarfify few
points here.

This should essentially follow the bootmem allocated memory pattern. So
it is present and accounted to spanned pages but it is not managed.

> It is not memory that the system can use.

same as bootmem ;)
 
> I also guess that if there is a strong opinion on this, we could create
> a counter, something like NR_VMEMMAP_PAGES, and show it under /proc/meminfo.

Do we really have to? Isn't the number quite obvious from the size of
the hotpluged memory?

> 
> > 2. Is this optional, in other words, can a device driver decide to not
> > to it like that?
> 
> Right now, is a per arch setup.
> For example, x86_64/powerpc/arm64 will do it inconditionally.
> 
> If we want to restrict this a per device-driver thing, I guess that we could
> allow to pass a flag to add_memory()->add_memory_resource(), and there
> unset MHP_MEMMAP_FROM_RANGE in case that flag is enabled.

I believe we will need to make this opt-in. There are some usecases
which hotplug an expensive (per size) memory via hotplug and it would be
too wasteful to use it for struct pages. I haven't bothered to address
that with my previous patches because I just wanted to make the damn
thing work first.
David Hildenbrand Nov. 23, 2018, 12:51 p.m. UTC | #5
On 23.11.18 13:42, Michal Hocko wrote:
> On Fri 23-11-18 12:55:41, Oscar Salvador wrote:
>> On Thu, Nov 22, 2018 at 10:21:24AM +0100, David Hildenbrand wrote:
>>> 1. How are we going to present such memory to the system statistics?
>>>
>>> In my opinion, this vmemmap memory should
>>> a) still account to total memory
>>> b) show up as allocated
>>>
>>> So just like before.
>>
>> No, it does not show up under total memory and neither as allocated memory.
>> This memory is not for use for anything but for creating the pagetables
>> for the memmap array for the section/s.
> 
> I haven't read through your patches yet but wanted to clarfify few
> points here.
> 
> This should essentially follow the bootmem allocated memory pattern. So
> it is present and accounted to spanned pages but it is not managed.
> 
>> It is not memory that the system can use.
> 
> same as bootmem ;)

Fair enough, just saying that it represents a change :)

(but people also already complained if their VM has XGB but they don't
see actual XGB as total memory e.g. due to the crash kernel size)

>  
>> I also guess that if there is a strong opinion on this, we could create
>> a counter, something like NR_VMEMMAP_PAGES, and show it under /proc/meminfo.
> 
> Do we really have to? Isn't the number quite obvious from the size of
> the hotpluged memory?

At least the size of vmmaps cannot reliably calculated from "MemTotal" .
But maybe based on something else. (there, it is indeed obvious)

> 
>>
>>> 2. Is this optional, in other words, can a device driver decide to not
>>> to it like that?
>>
>> Right now, is a per arch setup.
>> For example, x86_64/powerpc/arm64 will do it inconditionally.
>>
>> If we want to restrict this a per device-driver thing, I guess that we could
>> allow to pass a flag to add_memory()->add_memory_resource(), and there
>> unset MHP_MEMMAP_FROM_RANGE in case that flag is enabled.
> 
> I believe we will need to make this opt-in. There are some usecases
> which hotplug an expensive (per size) memory via hotplug and it would be
> too wasteful to use it for struct pages. I haven't bothered to address
> that with my previous patches because I just wanted to make the damn
> thing work first.
> 

Good point.
Michal Hocko Nov. 23, 2018, 1:12 p.m. UTC | #6
On Fri 23-11-18 13:51:57, David Hildenbrand wrote:
> On 23.11.18 13:42, Michal Hocko wrote:
> > On Fri 23-11-18 12:55:41, Oscar Salvador wrote:
[...]
> >> It is not memory that the system can use.
> > 
> > same as bootmem ;)
> 
> Fair enough, just saying that it represents a change :)
> 
> (but people also already complained if their VM has XGB but they don't
> see actual XGB as total memory e.g. due to the crash kernel size)

I can imagine. I have seen many "where's my memory dude" questions... We
have so many unaccounted usages that it is simply impossible to see the
full picture of where the memory is consumed. The current implementation
would account memmaps in unreclaimable slabs but you still do not know
how much was spent for it...
 
> >> I also guess that if there is a strong opinion on this, we could create
> >> a counter, something like NR_VMEMMAP_PAGES, and show it under /proc/meminfo.
> > 
> > Do we really have to? Isn't the number quite obvious from the size of
> > the hotpluged memory?
> 
> At least the size of vmmaps cannot reliably calculated from "MemTotal" .
> But maybe based on something else. (there, it is indeed obvious)

Everybody knows the struct page size obviously :p and the rest is a
simple exercise. But more seriously, I see what you are saying. We do
not have a good counter now and the patch doesn't improve that. But I
guess this is a separate discussion.
David Hildenbrand Nov. 26, 2018, 1:05 p.m. UTC | #7
On 23.11.18 12:55, Oscar Salvador wrote:
> On Thu, Nov 22, 2018 at 10:21:24AM +0100, David Hildenbrand wrote:
>> 1. How are we going to present such memory to the system statistics?
>>
>> In my opinion, this vmemmap memory should
>> a) still account to total memory
>> b) show up as allocated
>>
>> So just like before.
> 
> No, it does not show up under total memory and neither as allocated memory.
> This memory is not for use for anything but for creating the pagetables
> for the memmap array for the section/s.
> 
> It is not memory that the system can use.
> 
> I also guess that if there is a strong opinion on this, we could create
> a counter, something like NR_VMEMMAP_PAGES, and show it under /proc/meminfo.
> 
>> 2. Is this optional, in other words, can a device driver decide to not
>> to it like that?
> 
> Right now, is a per arch setup.
> For example, x86_64/powerpc/arm64 will do it inconditionally.
> 

Just FYI another special case is s390x right now when it comes to adding
standby memory: (linux/drivers/s390/char/sclp_cmd.c)

There are two issues:

a) Storage keys

On s390x, storage keys have to be initialized before memory might be
used (think of it as 7bit page status/protection for each 4k page
managed and stored by the HW separately)

Storage keys are initialized in sclp_assign_storage(), when the memory
is going online (MEM_GOING_ONLINE).

b) Hypervisor making memory accessible

Only when onlining memory, the memory is actually made accessible in the
hypervisor (sclp_assign_storage()). Touching it before that is bad and
will fail.

You can think of standby memory on s390x like memory that is only
onlined on request by an administrator. Once onlined, the hypervisor
will allocate memory for it.


However, once we have other ways of adding memory to a s390x guest (e.g.
virtio-mem) at least b) is not an issue anymore. a) would require manual
tweaking (e.g. initialize storage keys of memory for vmmaps early).


So in summary as of now your approach will not work on s390x, but with
e.g. virtio-mem it could. We would need some interface to specify how to
add memory. (To somehow allow a driver to specify it - e.g. SCLP vs.
virtio-mem)

Cheers!