mbox series

[0/8] implement "memmap on memory" feature on s390

Message ID 20231114180238.1522782-1-sumanthk@linux.ibm.com (mailing list archive)
Headers show
Series implement "memmap on memory" feature on s390 | expand

Message

Sumanth Korikkar Nov. 14, 2023, 6:02 p.m. UTC
Hi All,

The patch series implements "memmap on memory" feature on s390 and
provides the necessary fixes for it.

Patch 1  addresses the locking order in memory hotplug operations,
ensuring that the mem_hotplug_lock is held during critical operations
like mhp_init_memmap_on_memory() and mhp_deinit_memmap_on_memory()

Patch 2 deals with error handling in add_memory_resource() and considers
the possibility of altmap support. This ensures proper deallocation of
struct pages, aligning with the allocation strategy.

Patch 3 relocates the vmem_altmap code to sparse-vmemmap.c, enabling the
utilization of vmem_altmap_free() and vmem_altmap_offset() without the
dependency on CONFIG_ZONE_DEVICE. Note: These functions are also used in
arm64 architecture. However, ZONE_DEVICE or ARCH_HAS_ZONE_DEVICE doesnt
seems to be enabled in arm64.

Patch 4 introduces MEM_PHYS_ONLINE/OFFLINE memory notifiers. It
facilitates the emulation of dynamic ACPI event-triggered logic for
memory hotplug on platforms lacking such events. This sets the stage for
implementing the "memmap on memory" feature for s390 in subsequent
patches. All architecture/codepaths have the default cases handled in
memory notifiers. Hence, introducing new memory notifiers will have no
functional impact.

Patches 5 allocates vmemmap pages from self-contained memory range for
s390. It allocates memory map (struct pages array) from the hotplugged
memory range, rather than using system memory by passing altmap to
vmemmap functions.

Patch 6 implements MEM_PHYS_ONLINE and MEM_PHYS_OFFLINE memory notifiers
on s390. It involves making the memory block physically accessible and
then calling __add_pages()/__remove_pages() with altmap parameter.

Patch 7 removes unhandled memory notifier types. This is currently
handled in default case

Patch 8 finally enables MHP_MEMMAP_ON_MEMORY on s390

Thank you

Sumanth Korikkar (8):
  mm/memory_hotplug: fix memory hotplug locking order
  mm/memory_hotplug: fix error handling in add_memory_resource()
  mm: use vmem_altmap code without CONFIG_ZONE_DEVICE
  mm/memory_hotplug: introduce MEM_PHYS_ONLINE/OFFLINE memory notifiers
  s390/mm: allocate vmemmap pages from self-contained memory range
  s390/mm: implement MEM_PHYS_ONLINE MEM_PHYS_OFFLINE memory notifiers
  s390/sclp: remove unhandled memory notifier type
  s390: enable MHP_MEMMAP_ON_MEMORY

 arch/s390/Kconfig            |  1 +
 arch/s390/mm/init.c          | 19 ++++++++---
 arch/s390/mm/vmem.c          | 62 ++++++++++++++++++++----------------
 drivers/base/memory.c        | 28 ++++++++++++++--
 drivers/s390/char/sclp_cmd.c | 37 +++++++++++++++------
 include/linux/memory.h       |  2 ++
 include/linux/memremap.h     | 12 -------
 include/linux/mm.h           |  2 ++
 mm/memory_hotplug.c          | 15 ++++-----
 mm/memremap.c                | 14 +-------
 mm/sparse-vmemmap.c          | 13 ++++++++
 11 files changed, 129 insertions(+), 76 deletions(-)

Comments

David Hildenbrand Nov. 14, 2023, 6:22 p.m. UTC | #1
On 14.11.23 19:02, Sumanth Korikkar wrote:

The patch subject talks about "fixing locking order", but it's actually 
missing locking, no?

>  From Documentation/core-api/memory-hotplug.rst:
> When adding/removing/onlining/offlining memory or adding/removing
> heterogeneous/device memory, we should always hold the mem_hotplug_lock
> in write mode to serialise memory hotplug (e.g. access to global/zone
> variables).
> 
> mhp_(de)init_memmap_on_memory() functions can change zone stats and
> struct page content, but they are currently called w/o the
> mem_hotplug_lock.
> 
> When memory block is being offlined and when kmemleak goes through each
> populated zone, the following theoretical race conditions could occur:
> CPU 0:					     | CPU 1:
> memory_offline()			     |
> -> offline_pages()			     |
> 	-> mem_hotplug_begin()		     |
> 	   ...				     |
> 	-> mem_hotplug_done()		     |
> 					     | kmemleak_scan()
> 					     | -> get_online_mems()
> 					     |    ...
> -> mhp_deinit_memmap_on_memory()	     |
>    [not protected by mem_hotplug_begin/done()]|
>    Marks memory section as offline,	     |   Retrieves zone_start_pfn
>    poisons vmemmap struct pages and updates   |   and struct page members.
>    the zone related data			     |
>     					     |    ...
>     					     | -> put_online_mems()
> 
> Fix this by ensuring mem_hotplug_lock is taken before performing
> mhp_init_memmap_on_memory(). Also ensure that
> mhp_deinit_memmap_on_memory() holds the lock.

What speaks against grabbing that lock in these functions?
David Hildenbrand Nov. 14, 2023, 6:27 p.m. UTC | #2
On 14.11.23 19:02, Sumanth Korikkar wrote:
> Add new memory notifiers to mimic the dynamic ACPI event triggered logic
> for memory hotplug on platforms that do not generate such events. This
> will be used to implement "memmap on memory" feature for s390 in a later
> patch.
> 
> Platforms such as x86 can support physical memory hotplug via ACPI. When
> there is physical memory hotplug, ACPI event leads to the memory
> addition with the following callchain:
> acpi_memory_device_add()
>    -> acpi_memory_enable_device()
>       -> __add_memory()
> 
> After this, the hotplugged memory is physically accessible, and altmap
> support prepared, before the "memmap on memory" initialization in
> memory_block_online() is called.
> 
> On s390, memory hotplug works in a different way. The available hotplug
> memory has to be defined upfront in the hypervisor, but it is made
> physically accessible only when the user sets it online via sysfs,
> currently in the MEM_GOING_ONLINE notifier. This requires calling
> add_memory() during early memory detection, in order to get the sysfs
> representation, but we cannot use "memmap on memory" altmap support at
> this stage, w/o having it physically accessible.
> 
> Since no ACPI or similar events are generated, there is no way to set up
> altmap support, or even make the memory physically accessible at all,
> before the "memmap on memory" initialization in memory_block_online().
> 
> The new MEM_PHYS_ONLINE notifier allows to work around this, by
> providing a hook to make the memory physically accessible, and also call
> __add_pages() with altmap support, early in memory_block_online().
> Similarly, the MEM_PHYS_OFFLINE notifier allows to make the memory
> inaccessible and call __remove_pages(), at the end of
> memory_block_offline().
> 
> Calling __add/remove_pages() requires mem_hotplug_lock, so move
> mem_hotplug_begin/done() to include the new notifiers.
> 
> All architectures ignore unknown memory notifiers, so this patch should
> not introduce any functional changes.

Sorry to say, no. No hacks please, and this is a hack for memory that 
has already been added to the system.

If you want memory without an altmap to suddenly not have an altmap 
anymore, then look into removing and readding that memory, or some way 
to convert offline memory.

But certainly not on the online/offline path triggered by sysfs.
David Hildenbrand Nov. 14, 2023, 6:39 p.m. UTC | #3
On 14.11.23 19:02, Sumanth Korikkar wrote:
> Implement MEM_PHYS_ONLINE and MEM_PHYS_OFFLINE memory notifiers on s390
> 
> Implementation of MEM_PHYS_ONLINE Memory Notifier:
> * Transition the memory block to an accessible/online state using the
>    sclp assign command.
> * Execute __add_pages() for the memory block, enabling a self-contained
>    memory map range. For boot-time memory, vmemmap mapping is carried out
>    through sparse_init().
> 
> Implementation of MEM_PHYS_OFFLINE Memory Notifier:
> * Execute __remove_pages() exclusively for the memory block (applicable
>    where a self-contained memory map was possible before).
> * Shift the memory block to an inaccessible/offline state using the sclp
>    unassign command.
> 
> Additional Implementation Considerations:
> * When MHP_MEMMAP_ON_MEMORY is disabled, the system retains the old
>    behavior. This means the memory map is allocated from default memory,
>    and struct vmemmap pages are populated during the standby memory
>    detection phase.
> * With MHP_MEMMAP_ON_MEMORY enabled (allowing self-contained memory
>    map), the memory map is allocated using the self-contained memory map
>    range. Struct vmemmap pages are populated during the memory hotplug
>    phase.
> * If MACHINE_HAS_EDAT1 is unavailable, MHP_MEMMAP_ON_MEMORY is
>    automatically disabled. This ensures that vmemmap pagetables do not
>    consume additional memory from the default memory allocator.
> * The MEM_GOING_ONLINE notifier has been modified to perform no
>    operation, as MEM_PHYS_ONLINE already executes the sclp assign
>    command.
> * The MEM_CANCEL_ONLINE notifier now performs no operation, as
>    MEM_PHYS_OFFLINE already executes the sclp unassign command.
> * The call to __add_pages() in arch_add_memory() with altmap support is
>    skipped. This operation is deferred and will be performed later in the
>    MEM_PHYS_ONLINE notifier.
> 
> Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com>
> ---
>   arch/s390/mm/init.c          | 16 +++++++++++++++-
>   drivers/s390/char/sclp_cmd.c | 33 ++++++++++++++++++++++++++++++---
>   2 files changed, 45 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
> index 8d9a60ccb777..db505ed590b2 100644
> --- a/arch/s390/mm/init.c
> +++ b/arch/s390/mm/init.c
> @@ -288,6 +288,12 @@ int arch_add_memory(int nid, u64 start, u64 size,
>   	rc = vmem_add_mapping(start, size);
>   	if (rc)
>   		return rc;
> +	/*
> +	 * If MHP_MEMMAP_ON_MEMORY is enabled, perform __add_pages() during memory
> +	 * onlining phase
> +	 */
> +	if (params->altmap)
> +		return 0;


So we'd have added memory blocks without a memmap? Sorry, but this seems 
to further hack into the s390x direction.

Maybe s390x should just provide a dedicate interface to add these memory 
blocks instead of adding them during boot and then relying on the old 
way of using online/offline set them online/offline.
Sumanth Korikkar Nov. 15, 2023, 2:20 p.m. UTC | #4
On Tue, Nov 14, 2023 at 07:39:40PM +0100, David Hildenbrand wrote:
> On 14.11.23 19:02, Sumanth Korikkar wrote:
> > Implement MEM_PHYS_ONLINE and MEM_PHYS_OFFLINE memory notifiers on s390
> > 
...
> >   arch/s390/mm/init.c          | 16 +++++++++++++++-
> >   drivers/s390/char/sclp_cmd.c | 33 ++++++++++++++++++++++++++++++---
> >   2 files changed, 45 insertions(+), 4 deletions(-)
> > 
> > diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
> > index 8d9a60ccb777..db505ed590b2 100644
> > --- a/arch/s390/mm/init.c
> > +++ b/arch/s390/mm/init.c
> > @@ -288,6 +288,12 @@ int arch_add_memory(int nid, u64 start, u64 size,
> >   	rc = vmem_add_mapping(start, size);
> >   	if (rc)
> >   		return rc;
> > +	/*
> > +	 * If MHP_MEMMAP_ON_MEMORY is enabled, perform __add_pages() during memory
> > +	 * onlining phase
> > +	 */
> > +	if (params->altmap)
> > +		return 0;
> 
> 
> So we'd have added memory blocks without a memmap? Sorry, but this seems to
> further hack into the s390x direction.

This new approach has the advantage that we do not need to allocate any
additional memory during online phase, neither for direct mapping page
tables nor struct pages, so that memory hotplug can never fail.

The old approach (without altmap) is already a hack, because we add
the memmap / struct pages, but for memory that is not really accessible.
And with all the disadvantage of pre-allocating struct pages from system
memory.

The new approach allows to better integrate s390 to the existing
interface, and also make use of altmap support, which would eliminate
the major disadvantage of the old behaviour. So from s390 perspective,
this new mechanism would be preferred, provided that there is no
functional issue with the "added memory blocks without a memmap"
approach.

Do you see any functional issues, e.g. conflict with common
code?
> 
> Maybe s390x should just provide a dedicate interface to add these memory
> blocks instead of adding them during boot and then relying on the old way of
> using online/offline set them online/offline.

Existing behavior:
The current 'lsmem -a' command displays both online and standby memory.

interface changes:
If a new interface is introduced and standby memory is no longer listed,
the following consequences might occur:

1. Running 'lsmem -a' would only show online memory, potentially leading
   to user complaints.
2. standby memory addition would need:
   * echo "standby memory addr" > /sys/devices/system/memory/probe
   As far as I understand, this interface is already deprecated.

3. To remove standby memory, a new interface probe_remove is needed
   * echo "standby memory addr" > /sys/devices/system/memory/probe_remove

4. Users may express a need to identify standby memory addresses,
resulting in the creation of another interface to list these standby
memory ranges.

Hence, introducing new physical memory notifiers to platforms lacking
dynamic ACPI events would be highly advantageous while maintaining
existing user-friendly interface.

Thanks
Sumanth Korikkar Nov. 15, 2023, 2:23 p.m. UTC | #5
On Tue, Nov 14, 2023 at 07:27:35PM +0100, David Hildenbrand wrote:
> On 14.11.23 19:02, Sumanth Korikkar wrote:
> > Add new memory notifiers to mimic the dynamic ACPI event triggered logic
> > for memory hotplug on platforms that do not generate such events. This
> > will be used to implement "memmap on memory" feature for s390 in a later
> > patch.
> > 
> > Platforms such as x86 can support physical memory hotplug via ACPI. When
> > there is physical memory hotplug, ACPI event leads to the memory
> > addition with the following callchain:
> > acpi_memory_device_add()
> >    -> acpi_memory_enable_device()
> >       -> __add_memory()
> > 
> > After this, the hotplugged memory is physically accessible, and altmap
> > support prepared, before the "memmap on memory" initialization in
> > memory_block_online() is called.
> > 
> > On s390, memory hotplug works in a different way. The available hotplug
> > memory has to be defined upfront in the hypervisor, but it is made
> > physically accessible only when the user sets it online via sysfs,
> > currently in the MEM_GOING_ONLINE notifier. This requires calling
> > add_memory() during early memory detection, in order to get the sysfs
> > representation, but we cannot use "memmap on memory" altmap support at
> > this stage, w/o having it physically accessible.
> > 
> > Since no ACPI or similar events are generated, there is no way to set up
> > altmap support, or even make the memory physically accessible at all,
> > before the "memmap on memory" initialization in memory_block_online().
> > 
> > The new MEM_PHYS_ONLINE notifier allows to work around this, by
> > providing a hook to make the memory physically accessible, and also call
> > __add_pages() with altmap support, early in memory_block_online().
> > Similarly, the MEM_PHYS_OFFLINE notifier allows to make the memory
> > inaccessible and call __remove_pages(), at the end of
> > memory_block_offline().
> > 
> > Calling __add/remove_pages() requires mem_hotplug_lock, so move
> > mem_hotplug_begin/done() to include the new notifiers.
> > 
> > All architectures ignore unknown memory notifiers, so this patch should
> > not introduce any functional changes.
> 
> Sorry to say, no. No hacks please, and this is a hack for memory that has
> already been added to the system.
> 
> If you want memory without an altmap to suddenly not have an altmap anymore,
> then look into removing and readding that memory, or some way to convert
> offline memory.

Sorry, I couldnt get the context. Could you please give me more details?

Thanks
Gerald Schaefer Nov. 15, 2023, 3:03 p.m. UTC | #6
On Tue, 14 Nov 2023 19:27:35 +0100
David Hildenbrand <david@redhat.com> wrote:

> On 14.11.23 19:02, Sumanth Korikkar wrote:
> > Add new memory notifiers to mimic the dynamic ACPI event triggered logic
> > for memory hotplug on platforms that do not generate such events. This
> > will be used to implement "memmap on memory" feature for s390 in a later
> > patch.
> > 
> > Platforms such as x86 can support physical memory hotplug via ACPI. When
> > there is physical memory hotplug, ACPI event leads to the memory
> > addition with the following callchain:
> > acpi_memory_device_add()  
> >    -> acpi_memory_enable_device()
> >       -> __add_memory()  
> > 
> > After this, the hotplugged memory is physically accessible, and altmap
> > support prepared, before the "memmap on memory" initialization in
> > memory_block_online() is called.
> > 
> > On s390, memory hotplug works in a different way. The available hotplug
> > memory has to be defined upfront in the hypervisor, but it is made
> > physically accessible only when the user sets it online via sysfs,
> > currently in the MEM_GOING_ONLINE notifier. This requires calling
> > add_memory() during early memory detection, in order to get the sysfs
> > representation, but we cannot use "memmap on memory" altmap support at
> > this stage, w/o having it physically accessible.
> > 
> > Since no ACPI or similar events are generated, there is no way to set up
> > altmap support, or even make the memory physically accessible at all,
> > before the "memmap on memory" initialization in memory_block_online().
> > 
> > The new MEM_PHYS_ONLINE notifier allows to work around this, by
> > providing a hook to make the memory physically accessible, and also call
> > __add_pages() with altmap support, early in memory_block_online().
> > Similarly, the MEM_PHYS_OFFLINE notifier allows to make the memory
> > inaccessible and call __remove_pages(), at the end of
> > memory_block_offline().
> > 
> > Calling __add/remove_pages() requires mem_hotplug_lock, so move
> > mem_hotplug_begin/done() to include the new notifiers.
> > 
> > All architectures ignore unknown memory notifiers, so this patch should
> > not introduce any functional changes.  
> 
> Sorry to say, no. No hacks please, and this is a hack for memory that 
> has already been added to the system.

IIUC, when we enter memory_block_online(), memory has always already
been added to the system, on all architectures. E.g. via ACPI events
on x86, or with the existing s390 hack, where we add it at boot time,
including memmap allocated from system memory. Without a preceding
add_memory() you cannot reach memory_block_online() via sysfs online.

The difference is that for s390, the memory is not yet physically
accessible, and therefore we cannot use the existing altmap support
in memory_block_online(), which requires that the memory is accessible
before it calls mhp_init_memmap_on_memory().

Currently, on s390 we make the memory accessible in the GOING_ONLINE
notifier, by sclp call to the hypervisor. That is too late for altmap
setup code in memory_block_online(), therefore we'd like to introduce
the new notifier, to have a hook where we can make it accessible
earlier, and after that there is no difference to how it works for
other architectures, and we can make use of the existing altmap support.

> 
> If you want memory without an altmap to suddenly not have an altmap 
> anymore, then look into removing and readding that memory, or some way 
> to convert offline memory.

We do not want to have memory suddenly not have an altmap support
any more, but simply get a hook so that we can prepare the memory
to have altmap support. This means making it physically accessible,
and calling __add_pages() for altmap support, which for other
architecture has already happened before.

Of course, it is a hack for s390, that we must skip __add_pages()
in the initial (arch_)add_memory() during boot time, when we want
altmap support, because the memory simply is not accessible at that
time. But s390 memory hotplug support has always been a hack, and
had to be, because of how it is implemented by the architecture.

So we replace one hack with another one, that has the huge advantage
that we do not need to allocate struct pages upfront from system
memory any more, for the whole possible online memory range.

And the current approach comes without any change to existing
interfaces, and minimal change to common code, i.e. these new
notifiers, that should not have any impact on other architectures.

What exactly is your concern regarding the new notifiers? Is it
useless no-op notifier calls on other archs (not sure if they
would get optimized out by compiler)?
David Hildenbrand Nov. 16, 2023, 6:40 p.m. UTC | #7
On 15.11.23 14:41, Sumanth Korikkar wrote:
> On Tue, Nov 14, 2023 at 07:22:33PM +0100, David Hildenbrand wrote:
>> On 14.11.23 19:02, Sumanth Korikkar wrote:
>>
>> The patch subject talks about "fixing locking order", but it's actually
>> missing locking, no?
>>
>>>   From Documentation/core-api/memory-hotplug.rst:
>>> When adding/removing/onlining/offlining memory or adding/removing
>>> heterogeneous/device memory, we should always hold the mem_hotplug_lock
>>> in write mode to serialise memory hotplug (e.g. access to global/zone
>>> variables).
>>>
>>> mhp_(de)init_memmap_on_memory() functions can change zone stats and
>>> struct page content, but they are currently called w/o the
>>> mem_hotplug_lock.
>>>
>>> When memory block is being offlined and when kmemleak goes through each
>>> populated zone, the following theoretical race conditions could occur:
>>> CPU 0:					     | CPU 1:
>>> memory_offline()			     |
>>> -> offline_pages()			     |
>>> 	-> mem_hotplug_begin()		     |
>>> 	   ...				     |
>>> 	-> mem_hotplug_done()		     |
>>> 					     | kmemleak_scan()
>>> 					     | -> get_online_mems()
>>> 					     |    ...
>>> -> mhp_deinit_memmap_on_memory()	     |
>>>     [not protected by mem_hotplug_begin/done()]|
>>>     Marks memory section as offline,	     |   Retrieves zone_start_pfn
>>>     poisons vmemmap struct pages and updates   |   and struct page members.
>>>     the zone related data			     |
>>>      					     |    ...
>>>      					     | -> put_online_mems()
>>>
>>> Fix this by ensuring mem_hotplug_lock is taken before performing
>>> mhp_init_memmap_on_memory(). Also ensure that
>>> mhp_deinit_memmap_on_memory() holds the lock.
>>
>> What speaks against grabbing that lock in these functions?
>>
> At present, the functions online_pages() and offline_pages() acquire the
> mem_hotplug_lock right at the start. However, given the necessity of
> locking in mhp_(de)init_memmap_on_memory(), it would be more efficient
> to consolidate the locking process by holding the mem_hotplug_lock once
> in memory_block_online() and memory_block_offline().

Good point; can you similarly add comments to these two functions that 
they need that lock in write mode?
David Hildenbrand Nov. 16, 2023, 7:02 p.m. UTC | #8
On 15.11.23 16:03, Gerald Schaefer wrote:
> On Tue, 14 Nov 2023 19:27:35 +0100
> David Hildenbrand <david@redhat.com> wrote:
> 
>> On 14.11.23 19:02, Sumanth Korikkar wrote:
>>> Add new memory notifiers to mimic the dynamic ACPI event triggered logic
>>> for memory hotplug on platforms that do not generate such events. This
>>> will be used to implement "memmap on memory" feature for s390 in a later
>>> patch.
>>>
>>> Platforms such as x86 can support physical memory hotplug via ACPI. When
>>> there is physical memory hotplug, ACPI event leads to the memory
>>> addition with the following callchain:
>>> acpi_memory_device_add()
>>>     -> acpi_memory_enable_device()
>>>        -> __add_memory()
>>>
>>> After this, the hotplugged memory is physically accessible, and altmap
>>> support prepared, before the "memmap on memory" initialization in
>>> memory_block_online() is called.
>>>
>>> On s390, memory hotplug works in a different way. The available hotplug
>>> memory has to be defined upfront in the hypervisor, but it is made
>>> physically accessible only when the user sets it online via sysfs,
>>> currently in the MEM_GOING_ONLINE notifier. This requires calling
>>> add_memory() during early memory detection, in order to get the sysfs
>>> representation, but we cannot use "memmap on memory" altmap support at
>>> this stage, w/o having it physically accessible.
>>>
>>> Since no ACPI or similar events are generated, there is no way to set up
>>> altmap support, or even make the memory physically accessible at all,
>>> before the "memmap on memory" initialization in memory_block_online().
>>>
>>> The new MEM_PHYS_ONLINE notifier allows to work around this, by
>>> providing a hook to make the memory physically accessible, and also call
>>> __add_pages() with altmap support, early in memory_block_online().
>>> Similarly, the MEM_PHYS_OFFLINE notifier allows to make the memory
>>> inaccessible and call __remove_pages(), at the end of
>>> memory_block_offline().
>>>
>>> Calling __add/remove_pages() requires mem_hotplug_lock, so move
>>> mem_hotplug_begin/done() to include the new notifiers.
>>>
>>> All architectures ignore unknown memory notifiers, so this patch should
>>> not introduce any functional changes.
>>
>> Sorry to say, no. No hacks please, and this is a hack for memory that
>> has already been added to the system.
> 
> IIUC, when we enter memory_block_online(), memory has always already
> been added to the system, on all architectures. E.g. via ACPI events
> on x86, or with the existing s390 hack, where we add it at boot time,
> including memmap allocated from system memory. Without a preceding
> add_memory() you cannot reach memory_block_online() via sysfs online.

Adding that memory block at boot time is the legacy leftover s390x is 
carrying along; and now we want to "workaround" that by adding s390x 
special handling for online/offlining code and having memory blocks 
without any memmap, or configuring an altmap in the very last minute 
using a s390x specific memory notifier.

Instead, if you want to support the altmap, the kernel should not add 
standby memory to the system (if configured for this new feature), but 
instead only remember the standby memory ranges so it knows what can 
later be added and what can't.

 From there, users should have an interface where they can actually add 
memory to the system, and either online it manually or just let the 
kernel online it automatically.

s390x code will call add_memory() and properly prepare an altmap if 
requested and make that standby memory available. You can then even have 
an interface to remove that memory again once offline. That will work 
with an altmap or without an altmap.

This approach is aligned with any other code that hot(un)plugs memory 
and is compatible with things like variable-sized memory blocks people 
have been talking about quite a while already, and altmaps that span 
multiple memory blocks to make gigantic pages in such ranges usable.

Sure, you'll have a new interface and have to enable the new handling 
for the new kernel, but you're asking for supporting a new feature that 
cannot be supported cleanly just like any other architecture does. But 
it's a clean approach and probably should have been done that way right 
from the start (decades ago).

Note: We do have the same for other architectures without ACPI that add 
memory via the probe interface. But IIRC we cannot really do any checks 
there, because these architectures have no way of identifying what

> 
> The difference is that for s390, the memory is not yet physically
> accessible, and therefore we cannot use the existing altmap support
> in memory_block_online(), which requires that the memory is accessible
> before it calls mhp_init_memmap_on_memory().
> 
> Currently, on s390 we make the memory accessible in the GOING_ONLINE
> notifier, by sclp call to the hypervisor. That is too late for altmap
> setup code in memory_block_online(), therefore we'd like to introduce
> the new notifier, to have a hook where we can make it accessible
> earlier, and after that there is no difference to how it works for
> other architectures, and we can make use of the existing altmap support.
> 
>>
>> If you want memory without an altmap to suddenly not have an altmap
>> anymore, then look into removing and readding that memory, or some way
>> to convert offline memory.
> 
> We do not want to have memory suddenly not have an altmap support
> any more, but simply get a hook so that we can prepare the memory
> to have altmap support. This means making it physically accessible,
> and calling __add_pages() for altmap support, which for other
> architecture has already happened before.
> 
> Of course, it is a hack for s390, that we must skip __add_pages()
> in the initial (arch_)add_memory() during boot time, when we want
> altmap support, because the memory simply is not accessible at that
> time. But s390 memory hotplug support has always been a hack, and
> had to be, because of how it is implemented by the architecture.

I write above paragraph before reading this; and it's fully aligned with 
what I said above.

> 
> So we replace one hack with another one, that has the huge advantage
> that we do not need to allocate struct pages upfront from system
> memory any more, for the whole possible online memory range.
> 
> And the current approach comes without any change to existing
> interfaces, and minimal change to common code, i.e. these new
> notifiers, that should not have any impact on other architectures.
> 
> What exactly is your concern regarding the new notifiers? Is it
> useless no-op notifier calls on other archs (not sure if they
> would get optimized out by compiler)?

That it makes hotplug code more special because of s390x, instead of 
cleaning up that legacy code.
David Hildenbrand Nov. 16, 2023, 7:03 p.m. UTC | #9
On 15.11.23 15:23, Sumanth Korikkar wrote:
> On Tue, Nov 14, 2023 at 07:27:35PM +0100, David Hildenbrand wrote:
>> On 14.11.23 19:02, Sumanth Korikkar wrote:
>>> Add new memory notifiers to mimic the dynamic ACPI event triggered logic
>>> for memory hotplug on platforms that do not generate such events. This
>>> will be used to implement "memmap on memory" feature for s390 in a later
>>> patch.
>>>
>>> Platforms such as x86 can support physical memory hotplug via ACPI. When
>>> there is physical memory hotplug, ACPI event leads to the memory
>>> addition with the following callchain:
>>> acpi_memory_device_add()
>>>     -> acpi_memory_enable_device()
>>>        -> __add_memory()
>>>
>>> After this, the hotplugged memory is physically accessible, and altmap
>>> support prepared, before the "memmap on memory" initialization in
>>> memory_block_online() is called.
>>>
>>> On s390, memory hotplug works in a different way. The available hotplug
>>> memory has to be defined upfront in the hypervisor, but it is made
>>> physically accessible only when the user sets it online via sysfs,
>>> currently in the MEM_GOING_ONLINE notifier. This requires calling
>>> add_memory() during early memory detection, in order to get the sysfs
>>> representation, but we cannot use "memmap on memory" altmap support at
>>> this stage, w/o having it physically accessible.
>>>
>>> Since no ACPI or similar events are generated, there is no way to set up
>>> altmap support, or even make the memory physically accessible at all,
>>> before the "memmap on memory" initialization in memory_block_online().
>>>
>>> The new MEM_PHYS_ONLINE notifier allows to work around this, by
>>> providing a hook to make the memory physically accessible, and also call
>>> __add_pages() with altmap support, early in memory_block_online().
>>> Similarly, the MEM_PHYS_OFFLINE notifier allows to make the memory
>>> inaccessible and call __remove_pages(), at the end of
>>> memory_block_offline().
>>>
>>> Calling __add/remove_pages() requires mem_hotplug_lock, so move
>>> mem_hotplug_begin/done() to include the new notifiers.
>>>
>>> All architectures ignore unknown memory notifiers, so this patch should
>>> not introduce any functional changes.
>>
>> Sorry to say, no. No hacks please, and this is a hack for memory that has
>> already been added to the system.
>>
>> If you want memory without an altmap to suddenly not have an altmap anymore,
>> then look into removing and readding that memory, or some way to convert
>> offline memory.
> 
> Sorry, I couldnt get the context. Could you please give me more details?

See my reply to Gerald.

In an ideal world, there would not be any new callbacks, we would get 
rid of them, and just let the architecture properly hotplug memory to 
the system when requested by the user.
David Hildenbrand Nov. 16, 2023, 7:16 p.m. UTC | #10
On 15.11.23 15:20, Sumanth Korikkar wrote:
> On Tue, Nov 14, 2023 at 07:39:40PM +0100, David Hildenbrand wrote:
>> On 14.11.23 19:02, Sumanth Korikkar wrote:
>>> Implement MEM_PHYS_ONLINE and MEM_PHYS_OFFLINE memory notifiers on s390
>>>
> ...
>>>    arch/s390/mm/init.c          | 16 +++++++++++++++-
>>>    drivers/s390/char/sclp_cmd.c | 33 ++++++++++++++++++++++++++++++---
>>>    2 files changed, 45 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
>>> index 8d9a60ccb777..db505ed590b2 100644
>>> --- a/arch/s390/mm/init.c
>>> +++ b/arch/s390/mm/init.c
>>> @@ -288,6 +288,12 @@ int arch_add_memory(int nid, u64 start, u64 size,
>>>    	rc = vmem_add_mapping(start, size);
>>>    	if (rc)
>>>    		return rc;
>>> +	/*
>>> +	 * If MHP_MEMMAP_ON_MEMORY is enabled, perform __add_pages() during memory
>>> +	 * onlining phase
>>> +	 */
>>> +	if (params->altmap)
>>> +		return 0;
>>
>>
>> So we'd have added memory blocks without a memmap? Sorry, but this seems to
>> further hack into the s390x direction.
> 
> This new approach has the advantage that we do not need to allocate any
> additional memory during online phase, neither for direct mapping page
> tables nor struct pages, so that memory hotplug can never fail.

Right, just like any other architecture that (triggered by whatever 
mechanism) ends up calling add_memory() and friends.

> 
> The old approach (without altmap) is already a hack, because we add
> the memmap / struct pages, but for memory that is not really accessible.

Yes, it's disgusting. And you still allocate other things like memory 
block devices or the identify map.

> And with all the disadvantage of pre-allocating struct pages from system
> memory.

Jep. It never should have been done like that.

> 
> The new approach allows to better integrate s390 to the existing
> interface, and also make use of altmap support, which would eliminate
> the major disadvantage of the old behaviour. So from s390 perspective,
> this new mechanism would be preferred, provided that there is no
> functional issue with the "added memory blocks without a memmap"
> approach.

It achieves that by s390x specific hacks in common code :) Instead of 
everybody else that simply uses add_memory() and friends.

> 
> Do you see any functional issues, e.g. conflict with common
> code?

I don't see functional issues right now, just the way it is done to 
implement support for a new feature is a hack IMHO. Replacing hack #1 by 
hack #2 is not really something reasonable. Let's try to remove hacks.

>>
>> Maybe s390x should just provide a dedicate interface to add these memory
>> blocks instead of adding them during boot and then relying on the old way of
>> using online/offline set them online/offline.
> 
> Existing behavior:
> The current 'lsmem -a' command displays both online and standby memory.
> 
> interface changes:
> If a new interface is introduced and standby memory is no longer listed,
> the following consequences might occur:
> 
> 1. Running 'lsmem -a' would only show online memory, potentially leading
>     to user complaints.

That's why the new, clean way of doing it will require a world switch. 
If the admin wants the benefits of altmap/memmap allocation, it can be 
enabled.

> 2. standby memory addition would need:
>     * echo "standby memory addr" > /sys/devices/system/memory/probe
>     As far as I understand, this interface is already deprecated.

It should actually be an s390x specific interface where people are able 
to query the standby ranges, and request to add/remove them. There, 
s390x can perform checks and setup everything accordingly before calling 
add_memory() and have the memory onlined.

We do have something comparable with the dax/kmem infrastructure: users 
configure the available memory to hotplug, and then hotplug it. Tooling 
onlines that memory automatically.

Ideally they will add ranges, not memory blocks.

> 
> 3. To remove standby memory, a new interface probe_remove is needed
>     * echo "standby memory addr" > /sys/devices/system/memory/probe_remove
> 

Similarly, an s390x specific interface that performs checks and properly 
tears everything s390x-specifc down -- for example, turning system RAM 
into standby RAM again.


> 4. Users may express a need to identify standby memory addresses,
> resulting in the creation of another interface to list these standby
> memory ranges.

Exactly. Memory that is not added to the system that does not consume 
any resources, but can be added on demand using an interface that is not 
the second stage (onlining/offlining) of memory hot(un)plug.
David Hildenbrand Nov. 16, 2023, 7:23 p.m. UTC | #11
>>>
>>> Maybe s390x should just provide a dedicate interface to add these memory
>>> blocks instead of adding them during boot and then relying on the old way of
>>> using online/offline set them online/offline.
>>
>> Existing behavior:
>> The current 'lsmem -a' command displays both online and standby memory.
>>
>> interface changes:
>> If a new interface is introduced and standby memory is no longer listed,
>> the following consequences might occur:
>>
>> 1. Running 'lsmem -a' would only show online memory, potentially leading
>>      to user complaints.
> 
> That's why the new, clean way of doing it will require a world switch.
> If the admin wants the benefits of altmap/memmap allocation, it can be
> enabled.

BTW, thinking about it, I guess one could teach lsmem (and maybe chmem) 
to consult additional interfaces on s390x to show standby memory that's 
not added to the system yet.
David Hildenbrand Nov. 16, 2023, 11:08 p.m. UTC | #12
On 14.11.23 19:02, Sumanth Korikkar wrote:
> Hi All,
> 
> The patch series implements "memmap on memory" feature on s390 and
> provides the necessary fixes for it.

Thinking about this, one thing that makes s390x different from all the 
other architectures in this series is the altmap handling.

I'm curious, why is that even required?

A memmep that is not marked as online in the section should not be 
touched by anybody (except memory onlining code :) ). And if we do, it's 
usually a BUG because that memmap might contain garbage/be poisoned or 
completely stale, so we might want to track that down and fix it in any 
case.

So what speaks against just leaving add_memory() populate the memmap 
from the altmap? Then, also the page tables for the memmap are already 
in place when onlining memory.


Then, adding two new notifier calls on start of memory_block_online() 
called something like MEM_PREPARE_ONLINE and end the end of 
memory_block_offline() called something like MEM_FINISH_OFFLINE is still 
suboptimal, but that's where standby memory could be 
activated/deactivated, without messing with the altmap.

That way, the only s390x specific thing is that the memmap that should 
not be touched by anybody is actually inaccessible, and you'd 
activate/deactivate simply from the new notifier calls just the way we 
used to do.

It's still all worse than just adding/removing memory properly, using a 
proper interface -- where you could alloc/free an actual memmap when the 
altmap is not desired. But I know that people don't want to spend time 
just doing it cleanly from scratch.
Gerald Schaefer Nov. 17, 2023, 1 p.m. UTC | #13
On Fri, 17 Nov 2023 00:08:31 +0100
David Hildenbrand <david@redhat.com> wrote:

> On 14.11.23 19:02, Sumanth Korikkar wrote:
> > Hi All,
> > 
> > The patch series implements "memmap on memory" feature on s390 and
> > provides the necessary fixes for it.  
> 
> Thinking about this, one thing that makes s390x different from all the 
> other architectures in this series is the altmap handling.
> 
> I'm curious, why is that even required?
> 
> A memmep that is not marked as online in the section should not be 
> touched by anybody (except memory onlining code :) ). And if we do, it's 
> usually a BUG because that memmap might contain garbage/be poisoned or 
> completely stale, so we might want to track that down and fix it in any 
> case.
> 
> So what speaks against just leaving add_memory() populate the memmap 
> from the altmap? Then, also the page tables for the memmap are already 
> in place when onlining memory.

Good question, I am not 100% sure if we ran into bugs, or simply assumed
that it is not OK to call __add_pages() when the memory for the altmap
is not accessible.

Maybe there is also already a common code bug with that, s390 might be
special but that is often also good for finding bugs in common code ...

> Then, adding two new notifier calls on start of memory_block_online() 
> called something like MEM_PREPARE_ONLINE and end the end of 
> memory_block_offline() called something like MEM_FINISH_OFFLINE is still 
> suboptimal, but that's where standby memory could be 
> activated/deactivated, without messing with the altmap.
> 
> That way, the only s390x specific thing is that the memmap that should 
> not be touched by anybody is actually inaccessible, and you'd 
> activate/deactivate simply from the new notifier calls just the way we 
> used to do.
> 
> It's still all worse than just adding/removing memory properly, using a 
> proper interface -- where you could alloc/free an actual memmap when the 
> altmap is not desired. But I know that people don't want to spend time 
> just doing it cleanly from scratch.

Yes, sometimes they need to be forced to do that :-)

So, we'll look into defining a "proper interface", and treat patches 1-3
separately as bug fixes? Especially patch 3 might be interesting for arm,
if they do not have ZONE_DEVICE, but still use the functions, they might
end up with the no-op version, not really freeing any memory.
Gerald Schaefer Nov. 17, 2023, 1 p.m. UTC | #14
On Thu, 16 Nov 2023 20:16:02 +0100
David Hildenbrand <david@redhat.com> wrote:

> On 15.11.23 15:20, Sumanth Korikkar wrote:
> > On Tue, Nov 14, 2023 at 07:39:40PM +0100, David Hildenbrand wrote:  
> >> On 14.11.23 19:02, Sumanth Korikkar wrote:  
> >>> Implement MEM_PHYS_ONLINE and MEM_PHYS_OFFLINE memory notifiers on s390
> >>>  
> > ...  
> >>>    arch/s390/mm/init.c          | 16 +++++++++++++++-
> >>>    drivers/s390/char/sclp_cmd.c | 33 ++++++++++++++++++++++++++++++---
> >>>    2 files changed, 45 insertions(+), 4 deletions(-)
> >>>
> >>> diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
> >>> index 8d9a60ccb777..db505ed590b2 100644
> >>> --- a/arch/s390/mm/init.c
> >>> +++ b/arch/s390/mm/init.c
> >>> @@ -288,6 +288,12 @@ int arch_add_memory(int nid, u64 start, u64 size,
> >>>    	rc = vmem_add_mapping(start, size);
> >>>    	if (rc)
> >>>    		return rc;
> >>> +	/*
> >>> +	 * If MHP_MEMMAP_ON_MEMORY is enabled, perform __add_pages() during memory
> >>> +	 * onlining phase
> >>> +	 */
> >>> +	if (params->altmap)
> >>> +		return 0;  
> >>
> >>
> >> So we'd have added memory blocks without a memmap? Sorry, but this seems to
> >> further hack into the s390x direction.  
> > 
> > This new approach has the advantage that we do not need to allocate any
> > additional memory during online phase, neither for direct mapping page
> > tables nor struct pages, so that memory hotplug can never fail.  
> 
> Right, just like any other architecture that (triggered by whatever 
> mechanism) ends up calling add_memory() and friends.

Just for better understanding, are page tables for identity and also
vmemmap mapping not allocated from system memory on other archs? I.e.
no altmap support for that, only for struct pages (so far)?

> 
> > 
> > The old approach (without altmap) is already a hack, because we add
> > the memmap / struct pages, but for memory that is not really accessible.  
> 
> Yes, it's disgusting. And you still allocate other things like memory 
> block devices or the identify map.

I would say it is special :-). And again, for understanding, all other
things apart from struct pages, still would need to be allocated from
system memory on other archs?

Of course, struct pages would be by far the biggest part, so having
altmap support for that helps a lot. Doing the other allocations also
via altmap would feel natural, but it is not possible yet, or did we
miss something?

> 
> > And with all the disadvantage of pre-allocating struct pages from system
> > memory.  
> 
> Jep. It never should have been done like that.

At that time, it gave the benefit over all others, that we do not need
to allocate struct pages from system memory, at the time of memory online,
when memory pressure might be high and such allocations might fail.

I guess you can say that it should have been done "right" at that time,
e.g. by already adding something like altmap support, instead of our own
hack.

> 
> > 
> > The new approach allows to better integrate s390 to the existing
> > interface, and also make use of altmap support, which would eliminate
> > the major disadvantage of the old behaviour. So from s390 perspective,
> > this new mechanism would be preferred, provided that there is no
> > functional issue with the "added memory blocks without a memmap"
> > approach.  
> 
> It achieves that by s390x specific hacks in common code :) Instead of 
> everybody else that simply uses add_memory() and friends.
> 
> > 
> > Do you see any functional issues, e.g. conflict with common
> > code?  
> 
> I don't see functional issues right now, just the way it is done to 
> implement support for a new feature is a hack IMHO. Replacing hack #1 by 
> hack #2 is not really something reasonable. Let's try to remove hacks.

Ok, sounds reasonable, let's try that. Introducing some new s390-specific
interface also feels a bit hacky, or ugly, but we'll see if we find a
way so that it is only "special" :-)
Reminds me a bit of that "probe" attribute, that also was an arch-specific
hack initially, IIRC, and is now to be deprecated...
Sumanth Korikkar Nov. 17, 2023, 1:42 p.m. UTC | #15
On Thu, Nov 16, 2023 at 07:40:34PM +0100, David Hildenbrand wrote:
> Good point; can you similarly add comments to these two functions that they
> need that lock in write mode?

Ok, will add it.

Thanks
Sumanth Korikkar Nov. 17, 2023, 1:56 p.m. UTC | #16
On Fri, Nov 17, 2023 at 12:08:31AM +0100, David Hildenbrand wrote:
> On 14.11.23 19:02, Sumanth Korikkar wrote:
> > Hi All,
> > 
> > The patch series implements "memmap on memory" feature on s390 and
> > provides the necessary fixes for it.
> 
> Thinking about this, one thing that makes s390x different from all the other
> architectures in this series is the altmap handling.
> 
> I'm curious, why is that even required?
> 
> A memmep that is not marked as online in the section should not be touched
> by anybody (except memory onlining code :) ). And if we do, it's usually a
> BUG because that memmap might contain garbage/be poisoned or completely
> stale, so we might want to track that down and fix it in any case.
> 
> So what speaks against just leaving add_memory() populate the memmap from
> the altmap? Then, also the page tables for the memmap are already in place
> when onlining memory.
>

we do have page_init_poison() in sparse_add_section() which should be
handled later then. not in add_pages()
> 
> Then, adding two new notifier calls on start of memory_block_online() called
> something like MEM_PREPARE_ONLINE and end the end of memory_block_offline()
> called something like MEM_FINISH_OFFLINE is still suboptimal, but that's
> where standby memory could be activated/deactivated, without messing with
> the altmap.
> 
> That way, the only s390x specific thing is that the memmap that should not
> be touched by anybody is actually inaccessible, and you'd
> activate/deactivate simply from the new notifier calls just the way we used
> to do.
ok. 

Thanks
David Hildenbrand Nov. 17, 2023, 3:37 p.m. UTC | #17
On 17.11.23 14:00, Gerald Schaefer wrote:
> On Fri, 17 Nov 2023 00:08:31 +0100
> David Hildenbrand <david@redhat.com> wrote:
> 
>> On 14.11.23 19:02, Sumanth Korikkar wrote:
>>> Hi All,
>>>
>>> The patch series implements "memmap on memory" feature on s390 and
>>> provides the necessary fixes for it.
>>
>> Thinking about this, one thing that makes s390x different from all the
>> other architectures in this series is the altmap handling.
>>
>> I'm curious, why is that even required?
>>
>> A memmep that is not marked as online in the section should not be
>> touched by anybody (except memory onlining code :) ). And if we do, it's
>> usually a BUG because that memmap might contain garbage/be poisoned or
>> completely stale, so we might want to track that down and fix it in any
>> case.
>>
>> So what speaks against just leaving add_memory() populate the memmap
>> from the altmap? Then, also the page tables for the memmap are already
>> in place when onlining memory.
> 
> Good question, I am not 100% sure if we ran into bugs, or simply assumed
> that it is not OK to call __add_pages() when the memory for the altmap
> is not accessible.

I mean, we create the direct map even though nobody should access that 
memory, so maybe we can simply map the altmap even though nobody should 
should access that memory.

As I said, then, even the page tables for the altmap are allocated 
already and memory onlining likely doesn't need any allocation anymore 
(except, there is kasan or some other memory notifiers have special 
demands).

Certainly simpler :)

> 
> Maybe there is also already a common code bug with that, s390 might be
> special but that is often also good for finding bugs in common code ...

If it's only the page_init_poison() as noted by Sumanth, we could 
disable that on s390x with an altmap some way or the other; should be 
possible.

I mean, you effectively have your own poisoning if the altmap is 
effectively inaccessible and makes your CPU angry on access :)

Last but not least, support for an inaccessible altmap might come in 
handy for virtio-mem eventually, and make altmap support eventually 
simpler. So added bonus points.

> 
>> Then, adding two new notifier calls on start of memory_block_online()
>> called something like MEM_PREPARE_ONLINE and end the end of
>> memory_block_offline() called something like MEM_FINISH_OFFLINE is still
>> suboptimal, but that's where standby memory could be
>> activated/deactivated, without messing with the altmap.
>>
>> That way, the only s390x specific thing is that the memmap that should
>> not be touched by anybody is actually inaccessible, and you'd
>> activate/deactivate simply from the new notifier calls just the way we
>> used to do.
>>
>> It's still all worse than just adding/removing memory properly, using a
>> proper interface -- where you could alloc/free an actual memmap when the
>> altmap is not desired. But I know that people don't want to spend time
>> just doing it cleanly from scratch.
> 
> Yes, sometimes they need to be forced to do that :-)

I certainly won't force you if we can just keep the __add_pages() calls 
as is; having an altmap that is inaccessible but fully prepared sounds 
reasonable to me.

I can see how this gives an immediate benefit to existing s390x 
installations without being too hacky and without taking a long time to 
settle.

But I'll strongly suggest to evaluate a new interface long-term.

> 
> So, we'll look into defining a "proper interface", and treat patches 1-3
> separately as bug fixes? Especially patch 3 might be interesting for arm,
> if they do not have ZONE_DEVICE, but still use the functions, they might
> end up with the no-op version, not really freeing any memory.

It might make sense to

1) Send the first 3 out separately
2) Look into a simple variant that leaves __add_pages() calls alone and
    only adds the new MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE notifiers --
    well, and deals with an inaccessible altmap, like the
    page_init_poison() when the altmap might be inaccessible.
3) Look into a proper interface to add/remove memory instead of relying
    on online/offline.

2) is certainly an improvement and might be desired in some cases. 3) is 
more powerful (e.g., where you don't want an altmap because of 
fragmentation) and future proof.

I suspect there will be installations where an altmap is undesired: it 
fragments your address space with unmovable (memmap) allocations. 
Currently, runtime allocations of gigantic pages are affected. Long-term 
other large allocations (if we ever see very large THP) will be affected.

For that reason, we want to either support variable-sized memory blocks 
long-term, or simulate that by "grouping" memory blocks that share a 
same altmap located on the first memory blocks in that group: but 
onlining one block forces onlining of the whole group.

On s390x that adds all memory ahead of time, it's hard to make a 
decision what the right granularity will be, and seeing sudden 
online/offline changed behavior might be quite "surprising" for users. 
The user can give better hints when adding/removing memory explicitly.
David Hildenbrand Nov. 17, 2023, 3:37 p.m. UTC | #18
On 17.11.23 14:56, Sumanth Korikkar wrote:
> On Fri, Nov 17, 2023 at 12:08:31AM +0100, David Hildenbrand wrote:
>> On 14.11.23 19:02, Sumanth Korikkar wrote:
>>> Hi All,
>>>
>>> The patch series implements "memmap on memory" feature on s390 and
>>> provides the necessary fixes for it.
>>
>> Thinking about this, one thing that makes s390x different from all the other
>> architectures in this series is the altmap handling.
>>
>> I'm curious, why is that even required?
>>
>> A memmep that is not marked as online in the section should not be touched
>> by anybody (except memory onlining code :) ). And if we do, it's usually a
>> BUG because that memmap might contain garbage/be poisoned or completely
>> stale, so we might want to track that down and fix it in any case.
>>
>> So what speaks against just leaving add_memory() populate the memmap from
>> the altmap? Then, also the page tables for the memmap are already in place
>> when onlining memory.
>>
> 
> we do have page_init_poison() in sparse_add_section() which should be
> handled later then. not in add_pages()

Was that all, or did you stumble over other things?
Sumanth Korikkar Nov. 17, 2023, 7:46 p.m. UTC | #19
On Fri, Nov 17, 2023 at 04:37:29PM +0100, David Hildenbrand wrote:
> 
> It might make sense to
> 
> 1) Send the first 3 out separately

Ok sure, I will first send 3 patches as bug fixes with your feedback
applied.

> 2) Look into a simple variant that leaves __add_pages() calls alone and
>    only adds the new MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE notifiers --
>    well, and deals with an inaccessible altmap, like the
>    page_init_poison() when the altmap might be inaccessible.

Thanks for the valuable feedback.

I just tried out quickly with disabling page_init_poison() and removing
the hack in arch_add_memory() and arch_remove_memory(). Also used new
MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE notifiers. The current testing
result looks promising and seems to work and no issues found so far.

I will also double check if there are any other memmap accesses in
add_pages() phase.

we will try to go for this approach currently, i.e. with the notifiers you
suggested, and __add_pages() change.

Do you have any suggestions with how we could check for inaccessible altmap?

> 3) Look into a proper interface to add/remove memory instead of relying
>    on online/offline.

agree for long term.
> 
> 2) is certainly an improvement and might be desired in some cases. 3) is
> more powerful (e.g., where you don't want an altmap because of
> fragmentation) and future proof.
> 
> I suspect there will be installations where an altmap is undesired: it
> fragments your address space with unmovable (memmap) allocations. Currently,
> runtime allocations of gigantic pages are affected. Long-term other large
> allocations (if we ever see very large THP) will be affected.
> 
> For that reason, we want to either support variable-sized memory blocks
> long-term, or simulate that by "grouping" memory blocks that share a same
> altmap located on the first memory blocks in that group: but onlining one
> block forces onlining of the whole group.
> 
> On s390x that adds all memory ahead of time, it's hard to make a decision
> what the right granularity will be, and seeing sudden online/offline changed
> behavior might be quite "surprising" for users. The user can give better
> hints when adding/removing memory explicitly.

Thanks for providing insights and details.
David Hildenbrand Nov. 20, 2023, 2:49 p.m. UTC | #20
[catching up on mails]

>>> This new approach has the advantage that we do not need to allocate any
>>> additional memory during online phase, neither for direct mapping page
>>> tables nor struct pages, so that memory hotplug can never fail.
>>
>> Right, just like any other architecture that (triggered by whatever
>> mechanism) ends up calling add_memory() and friends.
> 
> Just for better understanding, are page tables for identity and also
> vmemmap mapping not allocated from system memory on other archs? I.e.
> no altmap support for that, only for struct pages (so far)?

Yes, only the actual "memmap ("struct page")" comes from altmap space, 
everything else comes from the buddy during memory hotplug.

> 
>>
>>>
>>> The old approach (without altmap) is already a hack, because we add
>>> the memmap / struct pages, but for memory that is not really accessible.
>>
>> Yes, it's disgusting. And you still allocate other things like memory
>> block devices or the identify map.
> 
> I would say it is special :-). And again, for understanding, all other

:)

> things apart from struct pages, still would need to be allocated from
> system memory on other archs?

Yes!

> 
> Of course, struct pages would be by far the biggest part, so having
> altmap support for that helps a lot. Doing the other allocations also
> via altmap would feel natural, but it is not possible yet, or did we
> miss something?

The tricky part is making sure ahead of time that that we set aside the 
required number of pageblocks, to properly control during memory 
onlining what to set aside and what to expose to the buddy.

See mhp_supports_memmap_on_memory() / 
memory_block_memmap_on_memory_pages() for the dirty details :)

> 
>>
>>> And with all the disadvantage of pre-allocating struct pages from system
>>> memory.
>>
>> Jep. It never should have been done like that.
> 
> At that time, it gave the benefit over all others, that we do not need
> to allocate struct pages from system memory, at the time of memory online,
> when memory pressure might be high and such allocations might fail.

Agreed. Having the memmap already around can be helpful. But not for all 
standby memory, that's just pure waste.

... but as memory onlining is triggered by user space, it's likely that 
that user space cannot even make progress (e.g., start a process to set 
memory online) to actually trigger memory onlining in serious low-memory 
situations.

> 
> I guess you can say that it should have been done "right" at that time,
> e.g. by already adding something like altmap support, instead of our own
> hack.

Probably yes. IMHO, relying on the existing memory block interface was 
the low hanging fruit. Now, s390x is just special.

> 
>>
>>>
>>> The new approach allows to better integrate s390 to the existing
>>> interface, and also make use of altmap support, which would eliminate
>>> the major disadvantage of the old behaviour. So from s390 perspective,
>>> this new mechanism would be preferred, provided that there is no
>>> functional issue with the "added memory blocks without a memmap"
>>> approach.
>>
>> It achieves that by s390x specific hacks in common code :) Instead of
>> everybody else that simply uses add_memory() and friends.
>>
>>>
>>> Do you see any functional issues, e.g. conflict with common
>>> code?
>>
>> I don't see functional issues right now, just the way it is done to
>> implement support for a new feature is a hack IMHO. Replacing hack #1 by
>> hack #2 is not really something reasonable. Let's try to remove hacks.
> 
> Ok, sounds reasonable, let's try that. Introducing some new s390-specific
> interface also feels a bit hacky, or ugly, but we'll see if we find a
> way so that it is only "special" :-)

As proposed in my other mail, I think there are ways to make s390x happy 
first and look into a cleaner approach long-term.

> Reminds me a bit of that "probe" attribute, that also was an arch-specific
> hack initially, IIRC, and is now to be deprecated...

Yeah, that was for interfaces where the kernel has absolutely no clue 
where/what/how memory gets hotplugged. ARM64 without ACPI.

s390x is completely different though: you know exactly which standby 
memory exists, where it resides, in which granularity in can be 
enabled/disabled, ... you don't have to play dangerous "I'm pretty sure 
there is memory out there although nobody can check so I crash the 
kernel" games.
Sumanth Korikkar Nov. 21, 2023, 1:13 p.m. UTC | #21
On Fri, Nov 17, 2023 at 04:37:29PM +0100, David Hildenbrand wrote:
> > 
> > Maybe there is also already a common code bug with that, s390 might be
> > special but that is often also good for finding bugs in common code ...
> 
> If it's only the page_init_poison() as noted by Sumanth, we could disable
> that on s390x with an altmap some way or the other; should be possible.
> 
> I mean, you effectively have your own poisoning if the altmap is effectively
> inaccessible and makes your CPU angry on access :)
> 
> Last but not least, support for an inaccessible altmap might come in handy
> for virtio-mem eventually, and make altmap support eventually simpler. So
> added bonus points.

We tried out two possibilities dealing with vmemmap altmap inaccessibilty.
Approach 1: Add MHP_ALTMAP_INACCESSIBLE flag and pass it in add_memory()

diff --git a/drivers/s390/char/sclp_cmd.c b/drivers/s390/char/sclp_cmd.c
index 075094ca59b4..ab2dfcc7e9e4 100644
--- a/drivers/s390/char/sclp_cmd.c
+++ b/drivers/s390/char/sclp_cmd.c
@@ -358,6 +358,13 @@ static int sclp_mem_notifier(struct notifier_block *nb,
 		 * buddy allocator later.
 		 */
 		__arch_set_page_nodat((void *)__va(start), memory_block->altmap->free);
+		/*
+		 * Poison the struct pages after memory block is accessible.
+		 * This is needed for only altmap. Without altmap, the struct
+		 * pages are poisoined in sparse_add_section().
+		 */
+		if (memory_block->altmap->inaccessible)
+			page_init_poison(pfn_to_page(arg->start_pfn), memory_block->altmap->free);
 		break;
 	case MEM_FINISH_OFFLINE:
 		sclp_mem_change_state(start, size, 0);
@@ -412,7 +419,7 @@ static void __init add_memory_merged(u16 rn)
 		goto skip_add;
 	for (addr = start; addr < start + size; addr += block_size)
 		add_memory(0, addr, block_size,
-			   MACHINE_HAS_EDAT1 ? MHP_MEMMAP_ON_MEMORY : MHP_NONE);
+			   MACHINE_HAS_EDAT1 ? MHP_MEMMAP_ON_MEMORY|MHP_ALTMAP_INACCESSIBLE : MHP_NONE);
 skip_add:
 	first_rn = rn;
 	num = 1;
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 7d2076583494..5c70707e706f 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -106,6 +106,11 @@ typedef int __bitwise mhp_t;
  * implies the node id (nid).
  */
 #define MHP_NID_IS_MGID		((__force mhp_t)BIT(2))
+/*
+ * Mark memmap on memory (struct pages array) as inaccessible during memory
+ * hotplug addition phase.
+ */
+#define MHP_ALTMAP_INACCESSIBLE	((__force mhp_t)BIT(3))
 
 /*
  * Extended parameters for memory hotplug:
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 744c830f4b13..9837f3e6fb95 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -25,6 +25,7 @@ struct vmem_altmap {
 	unsigned long free;
 	unsigned long align;
 	unsigned long alloc;
+	bool inaccessible;
 };
 
 /*
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 7a5fc89a8652..d8299853cdcc 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1439,6 +1439,8 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
 	if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
 		if (mhp_supports_memmap_on_memory(size)) {
 			mhp_altmap.free = memory_block_memmap_on_memory_pages();
+			if (mhp_flags & MHP_ALTMAP_INACCESSIBLE)
+				mhp_altmap.inaccessible = true;
 			params.altmap = kmalloc(sizeof(struct vmem_altmap), GFP_KERNEL);
 			if (!params.altmap) {
 				ret = -ENOMEM;
diff --git a/mm/sparse.c b/mm/sparse.c
index 77d91e565045..3991c717b769 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -907,7 +907,8 @@ int __meminit sparse_add_section(int nid, unsigned long start_pfn,
 	 * Poison uninitialized struct pages in order to catch invalid flags
 	 * combinations.
 	 */
-	page_init_poison(memmap, sizeof(struct page) * nr_pages);
+	if (!altmap || !altmap->inaccessible)
+		page_init_poison(memmap, sizeof(struct page) * nr_pages);
 
 	ms = __nr_to_section(section_nr);
 	set_section_nid(section_nr, nid);


Approach 2:
===========
Shouldnt kasan zero shadow mapping performed first before
accessing/initializing memmap via page_init_poisining()?  If that is
true, then it is a problem for all architectures and should could be
fixed like:


diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 7a5fc89a8652..eb3975740537 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1093,6 +1093,7 @@ int mhp_init_memmap_on_memory(unsigned long pfn, unsigned long nr_pages,
 	if (ret)
 		return ret;

+	page_init_poison(pfn_to_page(pfn), sizeof(struct page) * nr_pages);
 	move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_UNMOVABLE);

 	for (i = 0; i < nr_pages; i++)
diff --git a/mm/sparse.c b/mm/sparse.c
index 77d91e565045..4ddf53f52075 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -906,8 +906,11 @@ int __meminit sparse_add_section(int nid, unsigned long start_pfn,
 	/*
 	 * Poison uninitialized struct pages in order to catch invalid flags
 	 * combinations.
+	 * For altmap, do this later when onlining the memory, as it might
+	 * not be accessible at this point.
 	 */
-	page_init_poison(memmap, sizeof(struct page) * nr_pages);
+	if (!altmap)
+		page_init_poison(memmap, sizeof(struct page) * nr_pages);

 	ms = __nr_to_section(section_nr);
 	set_section_nid(section_nr, nid);



Also, if this approach is taken, should page_init_poison() be performed
with cond_resched() as mentioned in commit d33695b16a9f
("mm/memory_hotplug: poison memmap in remove_pfn_range_from_zone()") ?

Opinions?

Thank you
Sumanth Korikkar Nov. 21, 2023, 1:21 p.m. UTC | #22
On Tue, Nov 21, 2023 at 02:13:22PM +0100, Sumanth Korikkar wrote:
> Approach 2:
> ===========
> Shouldnt kasan zero shadow mapping performed first before
> accessing/initializing memmap via page_init_poisining()?  If that is
> true, then it is a problem for all architectures and should could be
> fixed like:
> 
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 7a5fc89a8652..eb3975740537 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1093,6 +1093,7 @@ int mhp_init_memmap_on_memory(unsigned long pfn, unsigned long nr_pages,
>  	if (ret)
>  		return ret;
> 
> +	page_init_poison(pfn_to_page(pfn), sizeof(struct page) * nr_pages);
>  	move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_UNMOVABLE);
> 
>  	for (i = 0; i < nr_pages; i++)
> diff --git a/mm/sparse.c b/mm/sparse.c
> index 77d91e565045..4ddf53f52075 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -906,8 +906,11 @@ int __meminit sparse_add_section(int nid, unsigned long start_pfn,
>  	/*
>  	 * Poison uninitialized struct pages in order to catch invalid flags
>  	 * combinations.
> +	 * For altmap, do this later when onlining the memory, as it might
> +	 * not be accessible at this point.
>  	 */
> -	page_init_poison(memmap, sizeof(struct page) * nr_pages);
> +	if (!altmap)
> +		page_init_poison(memmap, sizeof(struct page) * nr_pages);
> 
>  	ms = __nr_to_section(section_nr);
>  	set_section_nid(section_nr, nid);
> 
> 
> 
> Also, if this approach is taken, should page_init_poison() be performed
> with cond_resched() as mentioned in commit d33695b16a9f
> ("mm/memory_hotplug: poison memmap in remove_pfn_range_from_zone()") ?

Sorry, wrong commit id.

should page_init_poison() be performed with cond_resched() as mentioned in
Commit b7e3debdd040 ("mm/memory_hotplug.c: fix false softlockup
during pfn range removal") ?

Thanks
> 
> Opinions?
> 
> Thank you
David Hildenbrand Nov. 21, 2023, 2:42 p.m. UTC | #23
On 21.11.23 14:21, Sumanth Korikkar wrote:
> On Tue, Nov 21, 2023 at 02:13:22PM +0100, Sumanth Korikkar wrote:
>> Approach 2:
>> ===========
>> Shouldnt kasan zero shadow mapping performed first before
>> accessing/initializing memmap via page_init_poisining()?  If that is
>> true, then it is a problem for all architectures and should could be
>> fixed like:
>>
>>
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index 7a5fc89a8652..eb3975740537 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -1093,6 +1093,7 @@ int mhp_init_memmap_on_memory(unsigned long pfn, unsigned long nr_pages,
>>   	if (ret)
>>   		return ret;
>>
>> +	page_init_poison(pfn_to_page(pfn), sizeof(struct page) * nr_pages);
>>   	move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_UNMOVABLE);
>>
>>   	for (i = 0; i < nr_pages; i++)
>> diff --git a/mm/sparse.c b/mm/sparse.c
>> index 77d91e565045..4ddf53f52075 100644
>> --- a/mm/sparse.c
>> +++ b/mm/sparse.c
>> @@ -906,8 +906,11 @@ int __meminit sparse_add_section(int nid, unsigned long start_pfn,
>>   	/*
>>   	 * Poison uninitialized struct pages in order to catch invalid flags
>>   	 * combinations.
>> +	 * For altmap, do this later when onlining the memory, as it might
>> +	 * not be accessible at this point.
>>   	 */
>> -	page_init_poison(memmap, sizeof(struct page) * nr_pages);
>> +	if (!altmap)
>> +		page_init_poison(memmap, sizeof(struct page) * nr_pages);
>>
>>   	ms = __nr_to_section(section_nr);
>>   	set_section_nid(section_nr, nid);
>>
>>
>>
>> Also, if this approach is taken, should page_init_poison() be performed
>> with cond_resched() as mentioned in commit d33695b16a9f
>> ("mm/memory_hotplug: poison memmap in remove_pfn_range_from_zone()") ?
> 
> Sorry, wrong commit id.
> 
> should page_init_poison() be performed with cond_resched() as mentioned in
> Commit b7e3debdd040 ("mm/memory_hotplug.c: fix false softlockup
> during pfn range removal") ?

I think people are currently looking into removing all that cond_resched():

https://lore.kernel.org/all/20231107230822.371443-29-ankur.a.arora@oracle.com/T/#mda52da685a142bec9607625386b0b660e5470abe
David Hildenbrand Nov. 21, 2023, 7:24 p.m. UTC | #24
On 21.11.23 14:13, Sumanth Korikkar wrote:
> On Fri, Nov 17, 2023 at 04:37:29PM +0100, David Hildenbrand wrote:
>>>
>>> Maybe there is also already a common code bug with that, s390 might be
>>> special but that is often also good for finding bugs in common code ...
>>
>> If it's only the page_init_poison() as noted by Sumanth, we could disable
>> that on s390x with an altmap some way or the other; should be possible.
>>
>> I mean, you effectively have your own poisoning if the altmap is effectively
>> inaccessible and makes your CPU angry on access :)
>>
>> Last but not least, support for an inaccessible altmap might come in handy
>> for virtio-mem eventually, and make altmap support eventually simpler. So
>> added bonus points.
> 
> We tried out two possibilities dealing with vmemmap altmap inaccessibilty.
> Approach 1: Add MHP_ALTMAP_INACCESSIBLE flag and pass it in add_memory()
> 
> diff --git a/drivers/s390/char/sclp_cmd.c b/drivers/s390/char/sclp_cmd.c
> index 075094ca59b4..ab2dfcc7e9e4 100644
> --- a/drivers/s390/char/sclp_cmd.c
> +++ b/drivers/s390/char/sclp_cmd.c
> @@ -358,6 +358,13 @@ static int sclp_mem_notifier(struct notifier_block *nb,
>   		 * buddy allocator later.
>   		 */
>   		__arch_set_page_nodat((void *)__va(start), memory_block->altmap->free);
> +		/*
> +		 * Poison the struct pages after memory block is accessible.
> +		 * This is needed for only altmap. Without altmap, the struct
> +		 * pages are poisoined in sparse_add_section().
> +		 */
> +		if (memory_block->altmap->inaccessible)
> +			page_init_poison(pfn_to_page(arg->start_pfn), memory_block->altmap->free);

See below, likely it's best if the core will do that.

>   		break;
>   	case MEM_FINISH_OFFLINE:
>   		sclp_mem_change_state(start, size, 0);
> @@ -412,7 +419,7 @@ static void __init add_memory_merged(u16 rn)
>   		goto skip_add;
>   	for (addr = start; addr < start + size; addr += block_size)
>   		add_memory(0, addr, block_size,
> -			   MACHINE_HAS_EDAT1 ? MHP_MEMMAP_ON_MEMORY : MHP_NONE);
> +			   MACHINE_HAS_EDAT1 ? MHP_MEMMAP_ON_MEMORY|MHP_ALTMAP_INACCESSIBLE : MHP_NONE);
>   skip_add:
>   	first_rn = rn;
>   	num = 1;
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 7d2076583494..5c70707e706f 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -106,6 +106,11 @@ typedef int __bitwise mhp_t;
>    * implies the node id (nid).
>    */
>   #define MHP_NID_IS_MGID		((__force mhp_t)BIT(2))
> +/*
> + * Mark memmap on memory (struct pages array) as inaccessible during memory
> + * hotplug addition phase.
> + */
> +#define MHP_ALTMAP_INACCESSIBLE	((__force mhp_t)BIT(3))

If we go that path, maybe rather MHP_OFFLINE_INACCESSIBLE and document 
how this relates to/interacts with the memory notifier callbacks and the 
altmap.

Then, we can logically abstract this from altmap handling. Simply, the 
memory should not be read/written before the memory notifier was called.

In the core, you can do the poisioning for the altmap in that case after 
calling the notifier, probably in mhp_init_memmap_on_memory() as you do 
below.

>   
>   /*
>    * Extended parameters for memory hotplug:
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 744c830f4b13..9837f3e6fb95 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -25,6 +25,7 @@ struct vmem_altmap {
>   	unsigned long free;
>   	unsigned long align;
>   	unsigned long alloc;
> +	bool inaccessible;

We should then likely remember that information for the memory block, 
not the altmap.

[...]

> 
> 
> Approach 2:
> ===========
> Shouldnt kasan zero shadow mapping performed first before
> accessing/initializing memmap via page_init_poisining()?  If that is

Likely the existing way. The first access to the poisoned memmap should 
be a write, not a read. But I didn't look into the details.

Can you try reproducing?

> true, then it is a problem for all architectures and should could be
> fixed like:
> 
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 7a5fc89a8652..eb3975740537 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1093,6 +1093,7 @@ int mhp_init_memmap_on_memory(unsigned long pfn, unsigned long nr_pages,
>   	if (ret)
>   		return ret;
> 
> +	page_init_poison(pfn_to_page(pfn), sizeof(struct page) * nr_pages);
>   	move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_UNMOVABLE);
> 
>   	for (i = 0; i < nr_pages; i++)
> diff --git a/mm/sparse.c b/mm/sparse.c
> index 77d91e565045..4ddf53f52075 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -906,8 +906,11 @@ int __meminit sparse_add_section(int nid, unsigned long start_pfn,
>   	/*
>   	 * Poison uninitialized struct pages in order to catch invalid flags
>   	 * combinations.
> +	 * For altmap, do this later when onlining the memory, as it might
> +	 * not be accessible at this point.
>   	 */
> -	page_init_poison(memmap, sizeof(struct page) * nr_pages);
> +	if (!altmap)
> +		page_init_poison(memmap, sizeof(struct page) * nr_pages);
> 
>   	ms = __nr_to_section(section_nr);
>   	set_section_nid(section_nr, nid);
> 

That's too generic when it comes to other altmap users, especially DAX 
or when the altmap is accessible while memory is offlining, and we want 
to catch that.

> 
> 
> Also, if this approach is taken, should page_init_poison() be performed
> with cond_resched() as mentioned in commit d33695b16a9f
> ("mm/memory_hotplug: poison memmap in remove_pfn_range_from_zone()") ?

Likely as soon as possible after it is accessible :)
Sumanth Korikkar Nov. 22, 2023, 11:44 a.m. UTC | #25
On Tue, Nov 21, 2023 at 08:24:48PM +0100, David Hildenbrand wrote:
> > diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> > index 7d2076583494..5c70707e706f 100644
> > --- a/include/linux/memory_hotplug.h
> > +++ b/include/linux/memory_hotplug.h
> > @@ -106,6 +106,11 @@ typedef int __bitwise mhp_t;
> >    * implies the node id (nid).
> >    */
> >   #define MHP_NID_IS_MGID		((__force mhp_t)BIT(2))
> > +/*
> > + * Mark memmap on memory (struct pages array) as inaccessible during memory
> > + * hotplug addition phase.
> > + */
> > +#define MHP_ALTMAP_INACCESSIBLE	((__force mhp_t)BIT(3))
> 
> If we go that path, maybe rather MHP_OFFLINE_INACCESSIBLE and document how
> this relates to/interacts with the memory notifier callbacks and the altmap.
> 
> Then, we can logically abstract this from altmap handling. Simply, the
> memory should not be read/written before the memory notifier was called.
> 
> In the core, you can do the poisioning for the altmap in that case after
> calling the notifier, probably in mhp_init_memmap_on_memory() as you do
> below.
ok, sure. Thanks.
> 
> >   /*
> >    * Extended parameters for memory hotplug:
> > diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> > index 744c830f4b13..9837f3e6fb95 100644
> > --- a/include/linux/memremap.h
> > +++ b/include/linux/memremap.h
> > @@ -25,6 +25,7 @@ struct vmem_altmap {
> >   	unsigned long free;
> >   	unsigned long align;
> >   	unsigned long alloc;
> > +	bool inaccessible;
> 
> We should then likely remember that information for the memory block, not
> the altmap.

Tried using inaccessible field in memory_block and observed that the
memory block is created following the success of arch_add_memory().
Hence, when conducting checks for inaccessible memory in
sparse_add_section() (regarding page poisoning), there is still a
reliance on mhp_flags conveyed through add_memory(). Subsequently, then
memory_block inaccessible state is set in create_memory_block_devices(). 

I hope this approach is fine.

On the other hand, relying on inaccessible flag in vmem_altmap could
make checks in arch_add_memory() and other functions easier I suppose.

> 
> [...]
> 
> > 
> > 
> > Approach 2:
> > ===========
> > Shouldnt kasan zero shadow mapping performed first before
> > accessing/initializing memmap via page_init_poisining()?  If that is
> 
> Likely the existing way. The first access to the poisoned memmap should be a
> write, not a read. But I didn't look into the details.
> 
> Can you try reproducing?
>

Executing page_init_poison() right before kasan_add_zero_shadow() in the
context of altmap usage did not result in a crash when I attempted to
reproduce it.

However, altmap + page_init_poisoning() within sparse_add_section(), a
crash occurs on our arch, as previously discussed in this thread.

Thank you