Message ID | 20231114180238.1522782-1-sumanthk@linux.ibm.com (mailing list archive) |
---|---|
Headers | show |
Series | implement "memmap on memory" feature on s390 | expand |
On 14.11.23 19:02, Sumanth Korikkar wrote: The patch subject talks about "fixing locking order", but it's actually missing locking, no? > From Documentation/core-api/memory-hotplug.rst: > When adding/removing/onlining/offlining memory or adding/removing > heterogeneous/device memory, we should always hold the mem_hotplug_lock > in write mode to serialise memory hotplug (e.g. access to global/zone > variables). > > mhp_(de)init_memmap_on_memory() functions can change zone stats and > struct page content, but they are currently called w/o the > mem_hotplug_lock. > > When memory block is being offlined and when kmemleak goes through each > populated zone, the following theoretical race conditions could occur: > CPU 0: | CPU 1: > memory_offline() | > -> offline_pages() | > -> mem_hotplug_begin() | > ... | > -> mem_hotplug_done() | > | kmemleak_scan() > | -> get_online_mems() > | ... > -> mhp_deinit_memmap_on_memory() | > [not protected by mem_hotplug_begin/done()]| > Marks memory section as offline, | Retrieves zone_start_pfn > poisons vmemmap struct pages and updates | and struct page members. > the zone related data | > | ... > | -> put_online_mems() > > Fix this by ensuring mem_hotplug_lock is taken before performing > mhp_init_memmap_on_memory(). Also ensure that > mhp_deinit_memmap_on_memory() holds the lock. What speaks against grabbing that lock in these functions?
On 14.11.23 19:02, Sumanth Korikkar wrote: > Add new memory notifiers to mimic the dynamic ACPI event triggered logic > for memory hotplug on platforms that do not generate such events. This > will be used to implement "memmap on memory" feature for s390 in a later > patch. > > Platforms such as x86 can support physical memory hotplug via ACPI. When > there is physical memory hotplug, ACPI event leads to the memory > addition with the following callchain: > acpi_memory_device_add() > -> acpi_memory_enable_device() > -> __add_memory() > > After this, the hotplugged memory is physically accessible, and altmap > support prepared, before the "memmap on memory" initialization in > memory_block_online() is called. > > On s390, memory hotplug works in a different way. The available hotplug > memory has to be defined upfront in the hypervisor, but it is made > physically accessible only when the user sets it online via sysfs, > currently in the MEM_GOING_ONLINE notifier. This requires calling > add_memory() during early memory detection, in order to get the sysfs > representation, but we cannot use "memmap on memory" altmap support at > this stage, w/o having it physically accessible. > > Since no ACPI or similar events are generated, there is no way to set up > altmap support, or even make the memory physically accessible at all, > before the "memmap on memory" initialization in memory_block_online(). > > The new MEM_PHYS_ONLINE notifier allows to work around this, by > providing a hook to make the memory physically accessible, and also call > __add_pages() with altmap support, early in memory_block_online(). > Similarly, the MEM_PHYS_OFFLINE notifier allows to make the memory > inaccessible and call __remove_pages(), at the end of > memory_block_offline(). > > Calling __add/remove_pages() requires mem_hotplug_lock, so move > mem_hotplug_begin/done() to include the new notifiers. > > All architectures ignore unknown memory notifiers, so this patch should > not introduce any functional changes. Sorry to say, no. No hacks please, and this is a hack for memory that has already been added to the system. If you want memory without an altmap to suddenly not have an altmap anymore, then look into removing and readding that memory, or some way to convert offline memory. But certainly not on the online/offline path triggered by sysfs.
On 14.11.23 19:02, Sumanth Korikkar wrote: > Implement MEM_PHYS_ONLINE and MEM_PHYS_OFFLINE memory notifiers on s390 > > Implementation of MEM_PHYS_ONLINE Memory Notifier: > * Transition the memory block to an accessible/online state using the > sclp assign command. > * Execute __add_pages() for the memory block, enabling a self-contained > memory map range. For boot-time memory, vmemmap mapping is carried out > through sparse_init(). > > Implementation of MEM_PHYS_OFFLINE Memory Notifier: > * Execute __remove_pages() exclusively for the memory block (applicable > where a self-contained memory map was possible before). > * Shift the memory block to an inaccessible/offline state using the sclp > unassign command. > > Additional Implementation Considerations: > * When MHP_MEMMAP_ON_MEMORY is disabled, the system retains the old > behavior. This means the memory map is allocated from default memory, > and struct vmemmap pages are populated during the standby memory > detection phase. > * With MHP_MEMMAP_ON_MEMORY enabled (allowing self-contained memory > map), the memory map is allocated using the self-contained memory map > range. Struct vmemmap pages are populated during the memory hotplug > phase. > * If MACHINE_HAS_EDAT1 is unavailable, MHP_MEMMAP_ON_MEMORY is > automatically disabled. This ensures that vmemmap pagetables do not > consume additional memory from the default memory allocator. > * The MEM_GOING_ONLINE notifier has been modified to perform no > operation, as MEM_PHYS_ONLINE already executes the sclp assign > command. > * The MEM_CANCEL_ONLINE notifier now performs no operation, as > MEM_PHYS_OFFLINE already executes the sclp unassign command. > * The call to __add_pages() in arch_add_memory() with altmap support is > skipped. This operation is deferred and will be performed later in the > MEM_PHYS_ONLINE notifier. > > Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> > Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com> > --- > arch/s390/mm/init.c | 16 +++++++++++++++- > drivers/s390/char/sclp_cmd.c | 33 ++++++++++++++++++++++++++++++--- > 2 files changed, 45 insertions(+), 4 deletions(-) > > diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c > index 8d9a60ccb777..db505ed590b2 100644 > --- a/arch/s390/mm/init.c > +++ b/arch/s390/mm/init.c > @@ -288,6 +288,12 @@ int arch_add_memory(int nid, u64 start, u64 size, > rc = vmem_add_mapping(start, size); > if (rc) > return rc; > + /* > + * If MHP_MEMMAP_ON_MEMORY is enabled, perform __add_pages() during memory > + * onlining phase > + */ > + if (params->altmap) > + return 0; So we'd have added memory blocks without a memmap? Sorry, but this seems to further hack into the s390x direction. Maybe s390x should just provide a dedicate interface to add these memory blocks instead of adding them during boot and then relying on the old way of using online/offline set them online/offline.
On Tue, Nov 14, 2023 at 07:39:40PM +0100, David Hildenbrand wrote: > On 14.11.23 19:02, Sumanth Korikkar wrote: > > Implement MEM_PHYS_ONLINE and MEM_PHYS_OFFLINE memory notifiers on s390 > > ... > > arch/s390/mm/init.c | 16 +++++++++++++++- > > drivers/s390/char/sclp_cmd.c | 33 ++++++++++++++++++++++++++++++--- > > 2 files changed, 45 insertions(+), 4 deletions(-) > > > > diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c > > index 8d9a60ccb777..db505ed590b2 100644 > > --- a/arch/s390/mm/init.c > > +++ b/arch/s390/mm/init.c > > @@ -288,6 +288,12 @@ int arch_add_memory(int nid, u64 start, u64 size, > > rc = vmem_add_mapping(start, size); > > if (rc) > > return rc; > > + /* > > + * If MHP_MEMMAP_ON_MEMORY is enabled, perform __add_pages() during memory > > + * onlining phase > > + */ > > + if (params->altmap) > > + return 0; > > > So we'd have added memory blocks without a memmap? Sorry, but this seems to > further hack into the s390x direction. This new approach has the advantage that we do not need to allocate any additional memory during online phase, neither for direct mapping page tables nor struct pages, so that memory hotplug can never fail. The old approach (without altmap) is already a hack, because we add the memmap / struct pages, but for memory that is not really accessible. And with all the disadvantage of pre-allocating struct pages from system memory. The new approach allows to better integrate s390 to the existing interface, and also make use of altmap support, which would eliminate the major disadvantage of the old behaviour. So from s390 perspective, this new mechanism would be preferred, provided that there is no functional issue with the "added memory blocks without a memmap" approach. Do you see any functional issues, e.g. conflict with common code? > > Maybe s390x should just provide a dedicate interface to add these memory > blocks instead of adding them during boot and then relying on the old way of > using online/offline set them online/offline. Existing behavior: The current 'lsmem -a' command displays both online and standby memory. interface changes: If a new interface is introduced and standby memory is no longer listed, the following consequences might occur: 1. Running 'lsmem -a' would only show online memory, potentially leading to user complaints. 2. standby memory addition would need: * echo "standby memory addr" > /sys/devices/system/memory/probe As far as I understand, this interface is already deprecated. 3. To remove standby memory, a new interface probe_remove is needed * echo "standby memory addr" > /sys/devices/system/memory/probe_remove 4. Users may express a need to identify standby memory addresses, resulting in the creation of another interface to list these standby memory ranges. Hence, introducing new physical memory notifiers to platforms lacking dynamic ACPI events would be highly advantageous while maintaining existing user-friendly interface. Thanks
On Tue, Nov 14, 2023 at 07:27:35PM +0100, David Hildenbrand wrote: > On 14.11.23 19:02, Sumanth Korikkar wrote: > > Add new memory notifiers to mimic the dynamic ACPI event triggered logic > > for memory hotplug on platforms that do not generate such events. This > > will be used to implement "memmap on memory" feature for s390 in a later > > patch. > > > > Platforms such as x86 can support physical memory hotplug via ACPI. When > > there is physical memory hotplug, ACPI event leads to the memory > > addition with the following callchain: > > acpi_memory_device_add() > > -> acpi_memory_enable_device() > > -> __add_memory() > > > > After this, the hotplugged memory is physically accessible, and altmap > > support prepared, before the "memmap on memory" initialization in > > memory_block_online() is called. > > > > On s390, memory hotplug works in a different way. The available hotplug > > memory has to be defined upfront in the hypervisor, but it is made > > physically accessible only when the user sets it online via sysfs, > > currently in the MEM_GOING_ONLINE notifier. This requires calling > > add_memory() during early memory detection, in order to get the sysfs > > representation, but we cannot use "memmap on memory" altmap support at > > this stage, w/o having it physically accessible. > > > > Since no ACPI or similar events are generated, there is no way to set up > > altmap support, or even make the memory physically accessible at all, > > before the "memmap on memory" initialization in memory_block_online(). > > > > The new MEM_PHYS_ONLINE notifier allows to work around this, by > > providing a hook to make the memory physically accessible, and also call > > __add_pages() with altmap support, early in memory_block_online(). > > Similarly, the MEM_PHYS_OFFLINE notifier allows to make the memory > > inaccessible and call __remove_pages(), at the end of > > memory_block_offline(). > > > > Calling __add/remove_pages() requires mem_hotplug_lock, so move > > mem_hotplug_begin/done() to include the new notifiers. > > > > All architectures ignore unknown memory notifiers, so this patch should > > not introduce any functional changes. > > Sorry to say, no. No hacks please, and this is a hack for memory that has > already been added to the system. > > If you want memory without an altmap to suddenly not have an altmap anymore, > then look into removing and readding that memory, or some way to convert > offline memory. Sorry, I couldnt get the context. Could you please give me more details? Thanks
On Tue, 14 Nov 2023 19:27:35 +0100 David Hildenbrand <david@redhat.com> wrote: > On 14.11.23 19:02, Sumanth Korikkar wrote: > > Add new memory notifiers to mimic the dynamic ACPI event triggered logic > > for memory hotplug on platforms that do not generate such events. This > > will be used to implement "memmap on memory" feature for s390 in a later > > patch. > > > > Platforms such as x86 can support physical memory hotplug via ACPI. When > > there is physical memory hotplug, ACPI event leads to the memory > > addition with the following callchain: > > acpi_memory_device_add() > > -> acpi_memory_enable_device() > > -> __add_memory() > > > > After this, the hotplugged memory is physically accessible, and altmap > > support prepared, before the "memmap on memory" initialization in > > memory_block_online() is called. > > > > On s390, memory hotplug works in a different way. The available hotplug > > memory has to be defined upfront in the hypervisor, but it is made > > physically accessible only when the user sets it online via sysfs, > > currently in the MEM_GOING_ONLINE notifier. This requires calling > > add_memory() during early memory detection, in order to get the sysfs > > representation, but we cannot use "memmap on memory" altmap support at > > this stage, w/o having it physically accessible. > > > > Since no ACPI or similar events are generated, there is no way to set up > > altmap support, or even make the memory physically accessible at all, > > before the "memmap on memory" initialization in memory_block_online(). > > > > The new MEM_PHYS_ONLINE notifier allows to work around this, by > > providing a hook to make the memory physically accessible, and also call > > __add_pages() with altmap support, early in memory_block_online(). > > Similarly, the MEM_PHYS_OFFLINE notifier allows to make the memory > > inaccessible and call __remove_pages(), at the end of > > memory_block_offline(). > > > > Calling __add/remove_pages() requires mem_hotplug_lock, so move > > mem_hotplug_begin/done() to include the new notifiers. > > > > All architectures ignore unknown memory notifiers, so this patch should > > not introduce any functional changes. > > Sorry to say, no. No hacks please, and this is a hack for memory that > has already been added to the system. IIUC, when we enter memory_block_online(), memory has always already been added to the system, on all architectures. E.g. via ACPI events on x86, or with the existing s390 hack, where we add it at boot time, including memmap allocated from system memory. Without a preceding add_memory() you cannot reach memory_block_online() via sysfs online. The difference is that for s390, the memory is not yet physically accessible, and therefore we cannot use the existing altmap support in memory_block_online(), which requires that the memory is accessible before it calls mhp_init_memmap_on_memory(). Currently, on s390 we make the memory accessible in the GOING_ONLINE notifier, by sclp call to the hypervisor. That is too late for altmap setup code in memory_block_online(), therefore we'd like to introduce the new notifier, to have a hook where we can make it accessible earlier, and after that there is no difference to how it works for other architectures, and we can make use of the existing altmap support. > > If you want memory without an altmap to suddenly not have an altmap > anymore, then look into removing and readding that memory, or some way > to convert offline memory. We do not want to have memory suddenly not have an altmap support any more, but simply get a hook so that we can prepare the memory to have altmap support. This means making it physically accessible, and calling __add_pages() for altmap support, which for other architecture has already happened before. Of course, it is a hack for s390, that we must skip __add_pages() in the initial (arch_)add_memory() during boot time, when we want altmap support, because the memory simply is not accessible at that time. But s390 memory hotplug support has always been a hack, and had to be, because of how it is implemented by the architecture. So we replace one hack with another one, that has the huge advantage that we do not need to allocate struct pages upfront from system memory any more, for the whole possible online memory range. And the current approach comes without any change to existing interfaces, and minimal change to common code, i.e. these new notifiers, that should not have any impact on other architectures. What exactly is your concern regarding the new notifiers? Is it useless no-op notifier calls on other archs (not sure if they would get optimized out by compiler)?
On 15.11.23 14:41, Sumanth Korikkar wrote: > On Tue, Nov 14, 2023 at 07:22:33PM +0100, David Hildenbrand wrote: >> On 14.11.23 19:02, Sumanth Korikkar wrote: >> >> The patch subject talks about "fixing locking order", but it's actually >> missing locking, no? >> >>> From Documentation/core-api/memory-hotplug.rst: >>> When adding/removing/onlining/offlining memory or adding/removing >>> heterogeneous/device memory, we should always hold the mem_hotplug_lock >>> in write mode to serialise memory hotplug (e.g. access to global/zone >>> variables). >>> >>> mhp_(de)init_memmap_on_memory() functions can change zone stats and >>> struct page content, but they are currently called w/o the >>> mem_hotplug_lock. >>> >>> When memory block is being offlined and when kmemleak goes through each >>> populated zone, the following theoretical race conditions could occur: >>> CPU 0: | CPU 1: >>> memory_offline() | >>> -> offline_pages() | >>> -> mem_hotplug_begin() | >>> ... | >>> -> mem_hotplug_done() | >>> | kmemleak_scan() >>> | -> get_online_mems() >>> | ... >>> -> mhp_deinit_memmap_on_memory() | >>> [not protected by mem_hotplug_begin/done()]| >>> Marks memory section as offline, | Retrieves zone_start_pfn >>> poisons vmemmap struct pages and updates | and struct page members. >>> the zone related data | >>> | ... >>> | -> put_online_mems() >>> >>> Fix this by ensuring mem_hotplug_lock is taken before performing >>> mhp_init_memmap_on_memory(). Also ensure that >>> mhp_deinit_memmap_on_memory() holds the lock. >> >> What speaks against grabbing that lock in these functions? >> > At present, the functions online_pages() and offline_pages() acquire the > mem_hotplug_lock right at the start. However, given the necessity of > locking in mhp_(de)init_memmap_on_memory(), it would be more efficient > to consolidate the locking process by holding the mem_hotplug_lock once > in memory_block_online() and memory_block_offline(). Good point; can you similarly add comments to these two functions that they need that lock in write mode?
On 15.11.23 16:03, Gerald Schaefer wrote: > On Tue, 14 Nov 2023 19:27:35 +0100 > David Hildenbrand <david@redhat.com> wrote: > >> On 14.11.23 19:02, Sumanth Korikkar wrote: >>> Add new memory notifiers to mimic the dynamic ACPI event triggered logic >>> for memory hotplug on platforms that do not generate such events. This >>> will be used to implement "memmap on memory" feature for s390 in a later >>> patch. >>> >>> Platforms such as x86 can support physical memory hotplug via ACPI. When >>> there is physical memory hotplug, ACPI event leads to the memory >>> addition with the following callchain: >>> acpi_memory_device_add() >>> -> acpi_memory_enable_device() >>> -> __add_memory() >>> >>> After this, the hotplugged memory is physically accessible, and altmap >>> support prepared, before the "memmap on memory" initialization in >>> memory_block_online() is called. >>> >>> On s390, memory hotplug works in a different way. The available hotplug >>> memory has to be defined upfront in the hypervisor, but it is made >>> physically accessible only when the user sets it online via sysfs, >>> currently in the MEM_GOING_ONLINE notifier. This requires calling >>> add_memory() during early memory detection, in order to get the sysfs >>> representation, but we cannot use "memmap on memory" altmap support at >>> this stage, w/o having it physically accessible. >>> >>> Since no ACPI or similar events are generated, there is no way to set up >>> altmap support, or even make the memory physically accessible at all, >>> before the "memmap on memory" initialization in memory_block_online(). >>> >>> The new MEM_PHYS_ONLINE notifier allows to work around this, by >>> providing a hook to make the memory physically accessible, and also call >>> __add_pages() with altmap support, early in memory_block_online(). >>> Similarly, the MEM_PHYS_OFFLINE notifier allows to make the memory >>> inaccessible and call __remove_pages(), at the end of >>> memory_block_offline(). >>> >>> Calling __add/remove_pages() requires mem_hotplug_lock, so move >>> mem_hotplug_begin/done() to include the new notifiers. >>> >>> All architectures ignore unknown memory notifiers, so this patch should >>> not introduce any functional changes. >> >> Sorry to say, no. No hacks please, and this is a hack for memory that >> has already been added to the system. > > IIUC, when we enter memory_block_online(), memory has always already > been added to the system, on all architectures. E.g. via ACPI events > on x86, or with the existing s390 hack, where we add it at boot time, > including memmap allocated from system memory. Without a preceding > add_memory() you cannot reach memory_block_online() via sysfs online. Adding that memory block at boot time is the legacy leftover s390x is carrying along; and now we want to "workaround" that by adding s390x special handling for online/offlining code and having memory blocks without any memmap, or configuring an altmap in the very last minute using a s390x specific memory notifier. Instead, if you want to support the altmap, the kernel should not add standby memory to the system (if configured for this new feature), but instead only remember the standby memory ranges so it knows what can later be added and what can't. From there, users should have an interface where they can actually add memory to the system, and either online it manually or just let the kernel online it automatically. s390x code will call add_memory() and properly prepare an altmap if requested and make that standby memory available. You can then even have an interface to remove that memory again once offline. That will work with an altmap or without an altmap. This approach is aligned with any other code that hot(un)plugs memory and is compatible with things like variable-sized memory blocks people have been talking about quite a while already, and altmaps that span multiple memory blocks to make gigantic pages in such ranges usable. Sure, you'll have a new interface and have to enable the new handling for the new kernel, but you're asking for supporting a new feature that cannot be supported cleanly just like any other architecture does. But it's a clean approach and probably should have been done that way right from the start (decades ago). Note: We do have the same for other architectures without ACPI that add memory via the probe interface. But IIRC we cannot really do any checks there, because these architectures have no way of identifying what > > The difference is that for s390, the memory is not yet physically > accessible, and therefore we cannot use the existing altmap support > in memory_block_online(), which requires that the memory is accessible > before it calls mhp_init_memmap_on_memory(). > > Currently, on s390 we make the memory accessible in the GOING_ONLINE > notifier, by sclp call to the hypervisor. That is too late for altmap > setup code in memory_block_online(), therefore we'd like to introduce > the new notifier, to have a hook where we can make it accessible > earlier, and after that there is no difference to how it works for > other architectures, and we can make use of the existing altmap support. > >> >> If you want memory without an altmap to suddenly not have an altmap >> anymore, then look into removing and readding that memory, or some way >> to convert offline memory. > > We do not want to have memory suddenly not have an altmap support > any more, but simply get a hook so that we can prepare the memory > to have altmap support. This means making it physically accessible, > and calling __add_pages() for altmap support, which for other > architecture has already happened before. > > Of course, it is a hack for s390, that we must skip __add_pages() > in the initial (arch_)add_memory() during boot time, when we want > altmap support, because the memory simply is not accessible at that > time. But s390 memory hotplug support has always been a hack, and > had to be, because of how it is implemented by the architecture. I write above paragraph before reading this; and it's fully aligned with what I said above. > > So we replace one hack with another one, that has the huge advantage > that we do not need to allocate struct pages upfront from system > memory any more, for the whole possible online memory range. > > And the current approach comes without any change to existing > interfaces, and minimal change to common code, i.e. these new > notifiers, that should not have any impact on other architectures. > > What exactly is your concern regarding the new notifiers? Is it > useless no-op notifier calls on other archs (not sure if they > would get optimized out by compiler)? That it makes hotplug code more special because of s390x, instead of cleaning up that legacy code.
On 15.11.23 15:23, Sumanth Korikkar wrote: > On Tue, Nov 14, 2023 at 07:27:35PM +0100, David Hildenbrand wrote: >> On 14.11.23 19:02, Sumanth Korikkar wrote: >>> Add new memory notifiers to mimic the dynamic ACPI event triggered logic >>> for memory hotplug on platforms that do not generate such events. This >>> will be used to implement "memmap on memory" feature for s390 in a later >>> patch. >>> >>> Platforms such as x86 can support physical memory hotplug via ACPI. When >>> there is physical memory hotplug, ACPI event leads to the memory >>> addition with the following callchain: >>> acpi_memory_device_add() >>> -> acpi_memory_enable_device() >>> -> __add_memory() >>> >>> After this, the hotplugged memory is physically accessible, and altmap >>> support prepared, before the "memmap on memory" initialization in >>> memory_block_online() is called. >>> >>> On s390, memory hotplug works in a different way. The available hotplug >>> memory has to be defined upfront in the hypervisor, but it is made >>> physically accessible only when the user sets it online via sysfs, >>> currently in the MEM_GOING_ONLINE notifier. This requires calling >>> add_memory() during early memory detection, in order to get the sysfs >>> representation, but we cannot use "memmap on memory" altmap support at >>> this stage, w/o having it physically accessible. >>> >>> Since no ACPI or similar events are generated, there is no way to set up >>> altmap support, or even make the memory physically accessible at all, >>> before the "memmap on memory" initialization in memory_block_online(). >>> >>> The new MEM_PHYS_ONLINE notifier allows to work around this, by >>> providing a hook to make the memory physically accessible, and also call >>> __add_pages() with altmap support, early in memory_block_online(). >>> Similarly, the MEM_PHYS_OFFLINE notifier allows to make the memory >>> inaccessible and call __remove_pages(), at the end of >>> memory_block_offline(). >>> >>> Calling __add/remove_pages() requires mem_hotplug_lock, so move >>> mem_hotplug_begin/done() to include the new notifiers. >>> >>> All architectures ignore unknown memory notifiers, so this patch should >>> not introduce any functional changes. >> >> Sorry to say, no. No hacks please, and this is a hack for memory that has >> already been added to the system. >> >> If you want memory without an altmap to suddenly not have an altmap anymore, >> then look into removing and readding that memory, or some way to convert >> offline memory. > > Sorry, I couldnt get the context. Could you please give me more details? See my reply to Gerald. In an ideal world, there would not be any new callbacks, we would get rid of them, and just let the architecture properly hotplug memory to the system when requested by the user.
On 15.11.23 15:20, Sumanth Korikkar wrote: > On Tue, Nov 14, 2023 at 07:39:40PM +0100, David Hildenbrand wrote: >> On 14.11.23 19:02, Sumanth Korikkar wrote: >>> Implement MEM_PHYS_ONLINE and MEM_PHYS_OFFLINE memory notifiers on s390 >>> > ... >>> arch/s390/mm/init.c | 16 +++++++++++++++- >>> drivers/s390/char/sclp_cmd.c | 33 ++++++++++++++++++++++++++++++--- >>> 2 files changed, 45 insertions(+), 4 deletions(-) >>> >>> diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c >>> index 8d9a60ccb777..db505ed590b2 100644 >>> --- a/arch/s390/mm/init.c >>> +++ b/arch/s390/mm/init.c >>> @@ -288,6 +288,12 @@ int arch_add_memory(int nid, u64 start, u64 size, >>> rc = vmem_add_mapping(start, size); >>> if (rc) >>> return rc; >>> + /* >>> + * If MHP_MEMMAP_ON_MEMORY is enabled, perform __add_pages() during memory >>> + * onlining phase >>> + */ >>> + if (params->altmap) >>> + return 0; >> >> >> So we'd have added memory blocks without a memmap? Sorry, but this seems to >> further hack into the s390x direction. > > This new approach has the advantage that we do not need to allocate any > additional memory during online phase, neither for direct mapping page > tables nor struct pages, so that memory hotplug can never fail. Right, just like any other architecture that (triggered by whatever mechanism) ends up calling add_memory() and friends. > > The old approach (without altmap) is already a hack, because we add > the memmap / struct pages, but for memory that is not really accessible. Yes, it's disgusting. And you still allocate other things like memory block devices or the identify map. > And with all the disadvantage of pre-allocating struct pages from system > memory. Jep. It never should have been done like that. > > The new approach allows to better integrate s390 to the existing > interface, and also make use of altmap support, which would eliminate > the major disadvantage of the old behaviour. So from s390 perspective, > this new mechanism would be preferred, provided that there is no > functional issue with the "added memory blocks without a memmap" > approach. It achieves that by s390x specific hacks in common code :) Instead of everybody else that simply uses add_memory() and friends. > > Do you see any functional issues, e.g. conflict with common > code? I don't see functional issues right now, just the way it is done to implement support for a new feature is a hack IMHO. Replacing hack #1 by hack #2 is not really something reasonable. Let's try to remove hacks. >> >> Maybe s390x should just provide a dedicate interface to add these memory >> blocks instead of adding them during boot and then relying on the old way of >> using online/offline set them online/offline. > > Existing behavior: > The current 'lsmem -a' command displays both online and standby memory. > > interface changes: > If a new interface is introduced and standby memory is no longer listed, > the following consequences might occur: > > 1. Running 'lsmem -a' would only show online memory, potentially leading > to user complaints. That's why the new, clean way of doing it will require a world switch. If the admin wants the benefits of altmap/memmap allocation, it can be enabled. > 2. standby memory addition would need: > * echo "standby memory addr" > /sys/devices/system/memory/probe > As far as I understand, this interface is already deprecated. It should actually be an s390x specific interface where people are able to query the standby ranges, and request to add/remove them. There, s390x can perform checks and setup everything accordingly before calling add_memory() and have the memory onlined. We do have something comparable with the dax/kmem infrastructure: users configure the available memory to hotplug, and then hotplug it. Tooling onlines that memory automatically. Ideally they will add ranges, not memory blocks. > > 3. To remove standby memory, a new interface probe_remove is needed > * echo "standby memory addr" > /sys/devices/system/memory/probe_remove > Similarly, an s390x specific interface that performs checks and properly tears everything s390x-specifc down -- for example, turning system RAM into standby RAM again. > 4. Users may express a need to identify standby memory addresses, > resulting in the creation of another interface to list these standby > memory ranges. Exactly. Memory that is not added to the system that does not consume any resources, but can be added on demand using an interface that is not the second stage (onlining/offlining) of memory hot(un)plug.
>>> >>> Maybe s390x should just provide a dedicate interface to add these memory >>> blocks instead of adding them during boot and then relying on the old way of >>> using online/offline set them online/offline. >> >> Existing behavior: >> The current 'lsmem -a' command displays both online and standby memory. >> >> interface changes: >> If a new interface is introduced and standby memory is no longer listed, >> the following consequences might occur: >> >> 1. Running 'lsmem -a' would only show online memory, potentially leading >> to user complaints. > > That's why the new, clean way of doing it will require a world switch. > If the admin wants the benefits of altmap/memmap allocation, it can be > enabled. BTW, thinking about it, I guess one could teach lsmem (and maybe chmem) to consult additional interfaces on s390x to show standby memory that's not added to the system yet.
On 14.11.23 19:02, Sumanth Korikkar wrote: > Hi All, > > The patch series implements "memmap on memory" feature on s390 and > provides the necessary fixes for it. Thinking about this, one thing that makes s390x different from all the other architectures in this series is the altmap handling. I'm curious, why is that even required? A memmep that is not marked as online in the section should not be touched by anybody (except memory onlining code :) ). And if we do, it's usually a BUG because that memmap might contain garbage/be poisoned or completely stale, so we might want to track that down and fix it in any case. So what speaks against just leaving add_memory() populate the memmap from the altmap? Then, also the page tables for the memmap are already in place when onlining memory. Then, adding two new notifier calls on start of memory_block_online() called something like MEM_PREPARE_ONLINE and end the end of memory_block_offline() called something like MEM_FINISH_OFFLINE is still suboptimal, but that's where standby memory could be activated/deactivated, without messing with the altmap. That way, the only s390x specific thing is that the memmap that should not be touched by anybody is actually inaccessible, and you'd activate/deactivate simply from the new notifier calls just the way we used to do. It's still all worse than just adding/removing memory properly, using a proper interface -- where you could alloc/free an actual memmap when the altmap is not desired. But I know that people don't want to spend time just doing it cleanly from scratch.
On Fri, 17 Nov 2023 00:08:31 +0100 David Hildenbrand <david@redhat.com> wrote: > On 14.11.23 19:02, Sumanth Korikkar wrote: > > Hi All, > > > > The patch series implements "memmap on memory" feature on s390 and > > provides the necessary fixes for it. > > Thinking about this, one thing that makes s390x different from all the > other architectures in this series is the altmap handling. > > I'm curious, why is that even required? > > A memmep that is not marked as online in the section should not be > touched by anybody (except memory onlining code :) ). And if we do, it's > usually a BUG because that memmap might contain garbage/be poisoned or > completely stale, so we might want to track that down and fix it in any > case. > > So what speaks against just leaving add_memory() populate the memmap > from the altmap? Then, also the page tables for the memmap are already > in place when onlining memory. Good question, I am not 100% sure if we ran into bugs, or simply assumed that it is not OK to call __add_pages() when the memory for the altmap is not accessible. Maybe there is also already a common code bug with that, s390 might be special but that is often also good for finding bugs in common code ... > Then, adding two new notifier calls on start of memory_block_online() > called something like MEM_PREPARE_ONLINE and end the end of > memory_block_offline() called something like MEM_FINISH_OFFLINE is still > suboptimal, but that's where standby memory could be > activated/deactivated, without messing with the altmap. > > That way, the only s390x specific thing is that the memmap that should > not be touched by anybody is actually inaccessible, and you'd > activate/deactivate simply from the new notifier calls just the way we > used to do. > > It's still all worse than just adding/removing memory properly, using a > proper interface -- where you could alloc/free an actual memmap when the > altmap is not desired. But I know that people don't want to spend time > just doing it cleanly from scratch. Yes, sometimes they need to be forced to do that :-) So, we'll look into defining a "proper interface", and treat patches 1-3 separately as bug fixes? Especially patch 3 might be interesting for arm, if they do not have ZONE_DEVICE, but still use the functions, they might end up with the no-op version, not really freeing any memory.
On Thu, 16 Nov 2023 20:16:02 +0100 David Hildenbrand <david@redhat.com> wrote: > On 15.11.23 15:20, Sumanth Korikkar wrote: > > On Tue, Nov 14, 2023 at 07:39:40PM +0100, David Hildenbrand wrote: > >> On 14.11.23 19:02, Sumanth Korikkar wrote: > >>> Implement MEM_PHYS_ONLINE and MEM_PHYS_OFFLINE memory notifiers on s390 > >>> > > ... > >>> arch/s390/mm/init.c | 16 +++++++++++++++- > >>> drivers/s390/char/sclp_cmd.c | 33 ++++++++++++++++++++++++++++++--- > >>> 2 files changed, 45 insertions(+), 4 deletions(-) > >>> > >>> diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c > >>> index 8d9a60ccb777..db505ed590b2 100644 > >>> --- a/arch/s390/mm/init.c > >>> +++ b/arch/s390/mm/init.c > >>> @@ -288,6 +288,12 @@ int arch_add_memory(int nid, u64 start, u64 size, > >>> rc = vmem_add_mapping(start, size); > >>> if (rc) > >>> return rc; > >>> + /* > >>> + * If MHP_MEMMAP_ON_MEMORY is enabled, perform __add_pages() during memory > >>> + * onlining phase > >>> + */ > >>> + if (params->altmap) > >>> + return 0; > >> > >> > >> So we'd have added memory blocks without a memmap? Sorry, but this seems to > >> further hack into the s390x direction. > > > > This new approach has the advantage that we do not need to allocate any > > additional memory during online phase, neither for direct mapping page > > tables nor struct pages, so that memory hotplug can never fail. > > Right, just like any other architecture that (triggered by whatever > mechanism) ends up calling add_memory() and friends. Just for better understanding, are page tables for identity and also vmemmap mapping not allocated from system memory on other archs? I.e. no altmap support for that, only for struct pages (so far)? > > > > > The old approach (without altmap) is already a hack, because we add > > the memmap / struct pages, but for memory that is not really accessible. > > Yes, it's disgusting. And you still allocate other things like memory > block devices or the identify map. I would say it is special :-). And again, for understanding, all other things apart from struct pages, still would need to be allocated from system memory on other archs? Of course, struct pages would be by far the biggest part, so having altmap support for that helps a lot. Doing the other allocations also via altmap would feel natural, but it is not possible yet, or did we miss something? > > > And with all the disadvantage of pre-allocating struct pages from system > > memory. > > Jep. It never should have been done like that. At that time, it gave the benefit over all others, that we do not need to allocate struct pages from system memory, at the time of memory online, when memory pressure might be high and such allocations might fail. I guess you can say that it should have been done "right" at that time, e.g. by already adding something like altmap support, instead of our own hack. > > > > > The new approach allows to better integrate s390 to the existing > > interface, and also make use of altmap support, which would eliminate > > the major disadvantage of the old behaviour. So from s390 perspective, > > this new mechanism would be preferred, provided that there is no > > functional issue with the "added memory blocks without a memmap" > > approach. > > It achieves that by s390x specific hacks in common code :) Instead of > everybody else that simply uses add_memory() and friends. > > > > > Do you see any functional issues, e.g. conflict with common > > code? > > I don't see functional issues right now, just the way it is done to > implement support for a new feature is a hack IMHO. Replacing hack #1 by > hack #2 is not really something reasonable. Let's try to remove hacks. Ok, sounds reasonable, let's try that. Introducing some new s390-specific interface also feels a bit hacky, or ugly, but we'll see if we find a way so that it is only "special" :-) Reminds me a bit of that "probe" attribute, that also was an arch-specific hack initially, IIRC, and is now to be deprecated...
On Thu, Nov 16, 2023 at 07:40:34PM +0100, David Hildenbrand wrote: > Good point; can you similarly add comments to these two functions that they > need that lock in write mode? Ok, will add it. Thanks
On Fri, Nov 17, 2023 at 12:08:31AM +0100, David Hildenbrand wrote: > On 14.11.23 19:02, Sumanth Korikkar wrote: > > Hi All, > > > > The patch series implements "memmap on memory" feature on s390 and > > provides the necessary fixes for it. > > Thinking about this, one thing that makes s390x different from all the other > architectures in this series is the altmap handling. > > I'm curious, why is that even required? > > A memmep that is not marked as online in the section should not be touched > by anybody (except memory onlining code :) ). And if we do, it's usually a > BUG because that memmap might contain garbage/be poisoned or completely > stale, so we might want to track that down and fix it in any case. > > So what speaks against just leaving add_memory() populate the memmap from > the altmap? Then, also the page tables for the memmap are already in place > when onlining memory. > we do have page_init_poison() in sparse_add_section() which should be handled later then. not in add_pages() > > Then, adding two new notifier calls on start of memory_block_online() called > something like MEM_PREPARE_ONLINE and end the end of memory_block_offline() > called something like MEM_FINISH_OFFLINE is still suboptimal, but that's > where standby memory could be activated/deactivated, without messing with > the altmap. > > That way, the only s390x specific thing is that the memmap that should not > be touched by anybody is actually inaccessible, and you'd > activate/deactivate simply from the new notifier calls just the way we used > to do. ok. Thanks
On 17.11.23 14:00, Gerald Schaefer wrote: > On Fri, 17 Nov 2023 00:08:31 +0100 > David Hildenbrand <david@redhat.com> wrote: > >> On 14.11.23 19:02, Sumanth Korikkar wrote: >>> Hi All, >>> >>> The patch series implements "memmap on memory" feature on s390 and >>> provides the necessary fixes for it. >> >> Thinking about this, one thing that makes s390x different from all the >> other architectures in this series is the altmap handling. >> >> I'm curious, why is that even required? >> >> A memmep that is not marked as online in the section should not be >> touched by anybody (except memory onlining code :) ). And if we do, it's >> usually a BUG because that memmap might contain garbage/be poisoned or >> completely stale, so we might want to track that down and fix it in any >> case. >> >> So what speaks against just leaving add_memory() populate the memmap >> from the altmap? Then, also the page tables for the memmap are already >> in place when onlining memory. > > Good question, I am not 100% sure if we ran into bugs, or simply assumed > that it is not OK to call __add_pages() when the memory for the altmap > is not accessible. I mean, we create the direct map even though nobody should access that memory, so maybe we can simply map the altmap even though nobody should should access that memory. As I said, then, even the page tables for the altmap are allocated already and memory onlining likely doesn't need any allocation anymore (except, there is kasan or some other memory notifiers have special demands). Certainly simpler :) > > Maybe there is also already a common code bug with that, s390 might be > special but that is often also good for finding bugs in common code ... If it's only the page_init_poison() as noted by Sumanth, we could disable that on s390x with an altmap some way or the other; should be possible. I mean, you effectively have your own poisoning if the altmap is effectively inaccessible and makes your CPU angry on access :) Last but not least, support for an inaccessible altmap might come in handy for virtio-mem eventually, and make altmap support eventually simpler. So added bonus points. > >> Then, adding two new notifier calls on start of memory_block_online() >> called something like MEM_PREPARE_ONLINE and end the end of >> memory_block_offline() called something like MEM_FINISH_OFFLINE is still >> suboptimal, but that's where standby memory could be >> activated/deactivated, without messing with the altmap. >> >> That way, the only s390x specific thing is that the memmap that should >> not be touched by anybody is actually inaccessible, and you'd >> activate/deactivate simply from the new notifier calls just the way we >> used to do. >> >> It's still all worse than just adding/removing memory properly, using a >> proper interface -- where you could alloc/free an actual memmap when the >> altmap is not desired. But I know that people don't want to spend time >> just doing it cleanly from scratch. > > Yes, sometimes they need to be forced to do that :-) I certainly won't force you if we can just keep the __add_pages() calls as is; having an altmap that is inaccessible but fully prepared sounds reasonable to me. I can see how this gives an immediate benefit to existing s390x installations without being too hacky and without taking a long time to settle. But I'll strongly suggest to evaluate a new interface long-term. > > So, we'll look into defining a "proper interface", and treat patches 1-3 > separately as bug fixes? Especially patch 3 might be interesting for arm, > if they do not have ZONE_DEVICE, but still use the functions, they might > end up with the no-op version, not really freeing any memory. It might make sense to 1) Send the first 3 out separately 2) Look into a simple variant that leaves __add_pages() calls alone and only adds the new MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE notifiers -- well, and deals with an inaccessible altmap, like the page_init_poison() when the altmap might be inaccessible. 3) Look into a proper interface to add/remove memory instead of relying on online/offline. 2) is certainly an improvement and might be desired in some cases. 3) is more powerful (e.g., where you don't want an altmap because of fragmentation) and future proof. I suspect there will be installations where an altmap is undesired: it fragments your address space with unmovable (memmap) allocations. Currently, runtime allocations of gigantic pages are affected. Long-term other large allocations (if we ever see very large THP) will be affected. For that reason, we want to either support variable-sized memory blocks long-term, or simulate that by "grouping" memory blocks that share a same altmap located on the first memory blocks in that group: but onlining one block forces onlining of the whole group. On s390x that adds all memory ahead of time, it's hard to make a decision what the right granularity will be, and seeing sudden online/offline changed behavior might be quite "surprising" for users. The user can give better hints when adding/removing memory explicitly.
On 17.11.23 14:56, Sumanth Korikkar wrote: > On Fri, Nov 17, 2023 at 12:08:31AM +0100, David Hildenbrand wrote: >> On 14.11.23 19:02, Sumanth Korikkar wrote: >>> Hi All, >>> >>> The patch series implements "memmap on memory" feature on s390 and >>> provides the necessary fixes for it. >> >> Thinking about this, one thing that makes s390x different from all the other >> architectures in this series is the altmap handling. >> >> I'm curious, why is that even required? >> >> A memmep that is not marked as online in the section should not be touched >> by anybody (except memory onlining code :) ). And if we do, it's usually a >> BUG because that memmap might contain garbage/be poisoned or completely >> stale, so we might want to track that down and fix it in any case. >> >> So what speaks against just leaving add_memory() populate the memmap from >> the altmap? Then, also the page tables for the memmap are already in place >> when onlining memory. >> > > we do have page_init_poison() in sparse_add_section() which should be > handled later then. not in add_pages() Was that all, or did you stumble over other things?
On Fri, Nov 17, 2023 at 04:37:29PM +0100, David Hildenbrand wrote: > > It might make sense to > > 1) Send the first 3 out separately Ok sure, I will first send 3 patches as bug fixes with your feedback applied. > 2) Look into a simple variant that leaves __add_pages() calls alone and > only adds the new MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE notifiers -- > well, and deals with an inaccessible altmap, like the > page_init_poison() when the altmap might be inaccessible. Thanks for the valuable feedback. I just tried out quickly with disabling page_init_poison() and removing the hack in arch_add_memory() and arch_remove_memory(). Also used new MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE notifiers. The current testing result looks promising and seems to work and no issues found so far. I will also double check if there are any other memmap accesses in add_pages() phase. we will try to go for this approach currently, i.e. with the notifiers you suggested, and __add_pages() change. Do you have any suggestions with how we could check for inaccessible altmap? > 3) Look into a proper interface to add/remove memory instead of relying > on online/offline. agree for long term. > > 2) is certainly an improvement and might be desired in some cases. 3) is > more powerful (e.g., where you don't want an altmap because of > fragmentation) and future proof. > > I suspect there will be installations where an altmap is undesired: it > fragments your address space with unmovable (memmap) allocations. Currently, > runtime allocations of gigantic pages are affected. Long-term other large > allocations (if we ever see very large THP) will be affected. > > For that reason, we want to either support variable-sized memory blocks > long-term, or simulate that by "grouping" memory blocks that share a same > altmap located on the first memory blocks in that group: but onlining one > block forces onlining of the whole group. > > On s390x that adds all memory ahead of time, it's hard to make a decision > what the right granularity will be, and seeing sudden online/offline changed > behavior might be quite "surprising" for users. The user can give better > hints when adding/removing memory explicitly. Thanks for providing insights and details.
[catching up on mails] >>> This new approach has the advantage that we do not need to allocate any >>> additional memory during online phase, neither for direct mapping page >>> tables nor struct pages, so that memory hotplug can never fail. >> >> Right, just like any other architecture that (triggered by whatever >> mechanism) ends up calling add_memory() and friends. > > Just for better understanding, are page tables for identity and also > vmemmap mapping not allocated from system memory on other archs? I.e. > no altmap support for that, only for struct pages (so far)? Yes, only the actual "memmap ("struct page")" comes from altmap space, everything else comes from the buddy during memory hotplug. > >> >>> >>> The old approach (without altmap) is already a hack, because we add >>> the memmap / struct pages, but for memory that is not really accessible. >> >> Yes, it's disgusting. And you still allocate other things like memory >> block devices or the identify map. > > I would say it is special :-). And again, for understanding, all other :) > things apart from struct pages, still would need to be allocated from > system memory on other archs? Yes! > > Of course, struct pages would be by far the biggest part, so having > altmap support for that helps a lot. Doing the other allocations also > via altmap would feel natural, but it is not possible yet, or did we > miss something? The tricky part is making sure ahead of time that that we set aside the required number of pageblocks, to properly control during memory onlining what to set aside and what to expose to the buddy. See mhp_supports_memmap_on_memory() / memory_block_memmap_on_memory_pages() for the dirty details :) > >> >>> And with all the disadvantage of pre-allocating struct pages from system >>> memory. >> >> Jep. It never should have been done like that. > > At that time, it gave the benefit over all others, that we do not need > to allocate struct pages from system memory, at the time of memory online, > when memory pressure might be high and such allocations might fail. Agreed. Having the memmap already around can be helpful. But not for all standby memory, that's just pure waste. ... but as memory onlining is triggered by user space, it's likely that that user space cannot even make progress (e.g., start a process to set memory online) to actually trigger memory onlining in serious low-memory situations. > > I guess you can say that it should have been done "right" at that time, > e.g. by already adding something like altmap support, instead of our own > hack. Probably yes. IMHO, relying on the existing memory block interface was the low hanging fruit. Now, s390x is just special. > >> >>> >>> The new approach allows to better integrate s390 to the existing >>> interface, and also make use of altmap support, which would eliminate >>> the major disadvantage of the old behaviour. So from s390 perspective, >>> this new mechanism would be preferred, provided that there is no >>> functional issue with the "added memory blocks without a memmap" >>> approach. >> >> It achieves that by s390x specific hacks in common code :) Instead of >> everybody else that simply uses add_memory() and friends. >> >>> >>> Do you see any functional issues, e.g. conflict with common >>> code? >> >> I don't see functional issues right now, just the way it is done to >> implement support for a new feature is a hack IMHO. Replacing hack #1 by >> hack #2 is not really something reasonable. Let's try to remove hacks. > > Ok, sounds reasonable, let's try that. Introducing some new s390-specific > interface also feels a bit hacky, or ugly, but we'll see if we find a > way so that it is only "special" :-) As proposed in my other mail, I think there are ways to make s390x happy first and look into a cleaner approach long-term. > Reminds me a bit of that "probe" attribute, that also was an arch-specific > hack initially, IIRC, and is now to be deprecated... Yeah, that was for interfaces where the kernel has absolutely no clue where/what/how memory gets hotplugged. ARM64 without ACPI. s390x is completely different though: you know exactly which standby memory exists, where it resides, in which granularity in can be enabled/disabled, ... you don't have to play dangerous "I'm pretty sure there is memory out there although nobody can check so I crash the kernel" games.
On Fri, Nov 17, 2023 at 04:37:29PM +0100, David Hildenbrand wrote: > > > > Maybe there is also already a common code bug with that, s390 might be > > special but that is often also good for finding bugs in common code ... > > If it's only the page_init_poison() as noted by Sumanth, we could disable > that on s390x with an altmap some way or the other; should be possible. > > I mean, you effectively have your own poisoning if the altmap is effectively > inaccessible and makes your CPU angry on access :) > > Last but not least, support for an inaccessible altmap might come in handy > for virtio-mem eventually, and make altmap support eventually simpler. So > added bonus points. We tried out two possibilities dealing with vmemmap altmap inaccessibilty. Approach 1: Add MHP_ALTMAP_INACCESSIBLE flag and pass it in add_memory() diff --git a/drivers/s390/char/sclp_cmd.c b/drivers/s390/char/sclp_cmd.c index 075094ca59b4..ab2dfcc7e9e4 100644 --- a/drivers/s390/char/sclp_cmd.c +++ b/drivers/s390/char/sclp_cmd.c @@ -358,6 +358,13 @@ static int sclp_mem_notifier(struct notifier_block *nb, * buddy allocator later. */ __arch_set_page_nodat((void *)__va(start), memory_block->altmap->free); + /* + * Poison the struct pages after memory block is accessible. + * This is needed for only altmap. Without altmap, the struct + * pages are poisoined in sparse_add_section(). + */ + if (memory_block->altmap->inaccessible) + page_init_poison(pfn_to_page(arg->start_pfn), memory_block->altmap->free); break; case MEM_FINISH_OFFLINE: sclp_mem_change_state(start, size, 0); @@ -412,7 +419,7 @@ static void __init add_memory_merged(u16 rn) goto skip_add; for (addr = start; addr < start + size; addr += block_size) add_memory(0, addr, block_size, - MACHINE_HAS_EDAT1 ? MHP_MEMMAP_ON_MEMORY : MHP_NONE); + MACHINE_HAS_EDAT1 ? MHP_MEMMAP_ON_MEMORY|MHP_ALTMAP_INACCESSIBLE : MHP_NONE); skip_add: first_rn = rn; num = 1; diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h index 7d2076583494..5c70707e706f 100644 --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -106,6 +106,11 @@ typedef int __bitwise mhp_t; * implies the node id (nid). */ #define MHP_NID_IS_MGID ((__force mhp_t)BIT(2)) +/* + * Mark memmap on memory (struct pages array) as inaccessible during memory + * hotplug addition phase. + */ +#define MHP_ALTMAP_INACCESSIBLE ((__force mhp_t)BIT(3)) /* * Extended parameters for memory hotplug: diff --git a/include/linux/memremap.h b/include/linux/memremap.h index 744c830f4b13..9837f3e6fb95 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -25,6 +25,7 @@ struct vmem_altmap { unsigned long free; unsigned long align; unsigned long alloc; + bool inaccessible; }; /* diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 7a5fc89a8652..d8299853cdcc 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1439,6 +1439,8 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags) if (mhp_flags & MHP_MEMMAP_ON_MEMORY) { if (mhp_supports_memmap_on_memory(size)) { mhp_altmap.free = memory_block_memmap_on_memory_pages(); + if (mhp_flags & MHP_ALTMAP_INACCESSIBLE) + mhp_altmap.inaccessible = true; params.altmap = kmalloc(sizeof(struct vmem_altmap), GFP_KERNEL); if (!params.altmap) { ret = -ENOMEM; diff --git a/mm/sparse.c b/mm/sparse.c index 77d91e565045..3991c717b769 100644 --- a/mm/sparse.c +++ b/mm/sparse.c @@ -907,7 +907,8 @@ int __meminit sparse_add_section(int nid, unsigned long start_pfn, * Poison uninitialized struct pages in order to catch invalid flags * combinations. */ - page_init_poison(memmap, sizeof(struct page) * nr_pages); + if (!altmap || !altmap->inaccessible) + page_init_poison(memmap, sizeof(struct page) * nr_pages); ms = __nr_to_section(section_nr); set_section_nid(section_nr, nid); Approach 2: =========== Shouldnt kasan zero shadow mapping performed first before accessing/initializing memmap via page_init_poisining()? If that is true, then it is a problem for all architectures and should could be fixed like: diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 7a5fc89a8652..eb3975740537 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1093,6 +1093,7 @@ int mhp_init_memmap_on_memory(unsigned long pfn, unsigned long nr_pages, if (ret) return ret; + page_init_poison(pfn_to_page(pfn), sizeof(struct page) * nr_pages); move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_UNMOVABLE); for (i = 0; i < nr_pages; i++) diff --git a/mm/sparse.c b/mm/sparse.c index 77d91e565045..4ddf53f52075 100644 --- a/mm/sparse.c +++ b/mm/sparse.c @@ -906,8 +906,11 @@ int __meminit sparse_add_section(int nid, unsigned long start_pfn, /* * Poison uninitialized struct pages in order to catch invalid flags * combinations. + * For altmap, do this later when onlining the memory, as it might + * not be accessible at this point. */ - page_init_poison(memmap, sizeof(struct page) * nr_pages); + if (!altmap) + page_init_poison(memmap, sizeof(struct page) * nr_pages); ms = __nr_to_section(section_nr); set_section_nid(section_nr, nid); Also, if this approach is taken, should page_init_poison() be performed with cond_resched() as mentioned in commit d33695b16a9f ("mm/memory_hotplug: poison memmap in remove_pfn_range_from_zone()") ? Opinions? Thank you
On Tue, Nov 21, 2023 at 02:13:22PM +0100, Sumanth Korikkar wrote: > Approach 2: > =========== > Shouldnt kasan zero shadow mapping performed first before > accessing/initializing memmap via page_init_poisining()? If that is > true, then it is a problem for all architectures and should could be > fixed like: > > > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c > index 7a5fc89a8652..eb3975740537 100644 > --- a/mm/memory_hotplug.c > +++ b/mm/memory_hotplug.c > @@ -1093,6 +1093,7 @@ int mhp_init_memmap_on_memory(unsigned long pfn, unsigned long nr_pages, > if (ret) > return ret; > > + page_init_poison(pfn_to_page(pfn), sizeof(struct page) * nr_pages); > move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_UNMOVABLE); > > for (i = 0; i < nr_pages; i++) > diff --git a/mm/sparse.c b/mm/sparse.c > index 77d91e565045..4ddf53f52075 100644 > --- a/mm/sparse.c > +++ b/mm/sparse.c > @@ -906,8 +906,11 @@ int __meminit sparse_add_section(int nid, unsigned long start_pfn, > /* > * Poison uninitialized struct pages in order to catch invalid flags > * combinations. > + * For altmap, do this later when onlining the memory, as it might > + * not be accessible at this point. > */ > - page_init_poison(memmap, sizeof(struct page) * nr_pages); > + if (!altmap) > + page_init_poison(memmap, sizeof(struct page) * nr_pages); > > ms = __nr_to_section(section_nr); > set_section_nid(section_nr, nid); > > > > Also, if this approach is taken, should page_init_poison() be performed > with cond_resched() as mentioned in commit d33695b16a9f > ("mm/memory_hotplug: poison memmap in remove_pfn_range_from_zone()") ? Sorry, wrong commit id. should page_init_poison() be performed with cond_resched() as mentioned in Commit b7e3debdd040 ("mm/memory_hotplug.c: fix false softlockup during pfn range removal") ? Thanks > > Opinions? > > Thank you
On 21.11.23 14:21, Sumanth Korikkar wrote: > On Tue, Nov 21, 2023 at 02:13:22PM +0100, Sumanth Korikkar wrote: >> Approach 2: >> =========== >> Shouldnt kasan zero shadow mapping performed first before >> accessing/initializing memmap via page_init_poisining()? If that is >> true, then it is a problem for all architectures and should could be >> fixed like: >> >> >> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c >> index 7a5fc89a8652..eb3975740537 100644 >> --- a/mm/memory_hotplug.c >> +++ b/mm/memory_hotplug.c >> @@ -1093,6 +1093,7 @@ int mhp_init_memmap_on_memory(unsigned long pfn, unsigned long nr_pages, >> if (ret) >> return ret; >> >> + page_init_poison(pfn_to_page(pfn), sizeof(struct page) * nr_pages); >> move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_UNMOVABLE); >> >> for (i = 0; i < nr_pages; i++) >> diff --git a/mm/sparse.c b/mm/sparse.c >> index 77d91e565045..4ddf53f52075 100644 >> --- a/mm/sparse.c >> +++ b/mm/sparse.c >> @@ -906,8 +906,11 @@ int __meminit sparse_add_section(int nid, unsigned long start_pfn, >> /* >> * Poison uninitialized struct pages in order to catch invalid flags >> * combinations. >> + * For altmap, do this later when onlining the memory, as it might >> + * not be accessible at this point. >> */ >> - page_init_poison(memmap, sizeof(struct page) * nr_pages); >> + if (!altmap) >> + page_init_poison(memmap, sizeof(struct page) * nr_pages); >> >> ms = __nr_to_section(section_nr); >> set_section_nid(section_nr, nid); >> >> >> >> Also, if this approach is taken, should page_init_poison() be performed >> with cond_resched() as mentioned in commit d33695b16a9f >> ("mm/memory_hotplug: poison memmap in remove_pfn_range_from_zone()") ? > > Sorry, wrong commit id. > > should page_init_poison() be performed with cond_resched() as mentioned in > Commit b7e3debdd040 ("mm/memory_hotplug.c: fix false softlockup > during pfn range removal") ? I think people are currently looking into removing all that cond_resched(): https://lore.kernel.org/all/20231107230822.371443-29-ankur.a.arora@oracle.com/T/#mda52da685a142bec9607625386b0b660e5470abe
On 21.11.23 14:13, Sumanth Korikkar wrote: > On Fri, Nov 17, 2023 at 04:37:29PM +0100, David Hildenbrand wrote: >>> >>> Maybe there is also already a common code bug with that, s390 might be >>> special but that is often also good for finding bugs in common code ... >> >> If it's only the page_init_poison() as noted by Sumanth, we could disable >> that on s390x with an altmap some way or the other; should be possible. >> >> I mean, you effectively have your own poisoning if the altmap is effectively >> inaccessible and makes your CPU angry on access :) >> >> Last but not least, support for an inaccessible altmap might come in handy >> for virtio-mem eventually, and make altmap support eventually simpler. So >> added bonus points. > > We tried out two possibilities dealing with vmemmap altmap inaccessibilty. > Approach 1: Add MHP_ALTMAP_INACCESSIBLE flag and pass it in add_memory() > > diff --git a/drivers/s390/char/sclp_cmd.c b/drivers/s390/char/sclp_cmd.c > index 075094ca59b4..ab2dfcc7e9e4 100644 > --- a/drivers/s390/char/sclp_cmd.c > +++ b/drivers/s390/char/sclp_cmd.c > @@ -358,6 +358,13 @@ static int sclp_mem_notifier(struct notifier_block *nb, > * buddy allocator later. > */ > __arch_set_page_nodat((void *)__va(start), memory_block->altmap->free); > + /* > + * Poison the struct pages after memory block is accessible. > + * This is needed for only altmap. Without altmap, the struct > + * pages are poisoined in sparse_add_section(). > + */ > + if (memory_block->altmap->inaccessible) > + page_init_poison(pfn_to_page(arg->start_pfn), memory_block->altmap->free); See below, likely it's best if the core will do that. > break; > case MEM_FINISH_OFFLINE: > sclp_mem_change_state(start, size, 0); > @@ -412,7 +419,7 @@ static void __init add_memory_merged(u16 rn) > goto skip_add; > for (addr = start; addr < start + size; addr += block_size) > add_memory(0, addr, block_size, > - MACHINE_HAS_EDAT1 ? MHP_MEMMAP_ON_MEMORY : MHP_NONE); > + MACHINE_HAS_EDAT1 ? MHP_MEMMAP_ON_MEMORY|MHP_ALTMAP_INACCESSIBLE : MHP_NONE); > skip_add: > first_rn = rn; > num = 1; > diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h > index 7d2076583494..5c70707e706f 100644 > --- a/include/linux/memory_hotplug.h > +++ b/include/linux/memory_hotplug.h > @@ -106,6 +106,11 @@ typedef int __bitwise mhp_t; > * implies the node id (nid). > */ > #define MHP_NID_IS_MGID ((__force mhp_t)BIT(2)) > +/* > + * Mark memmap on memory (struct pages array) as inaccessible during memory > + * hotplug addition phase. > + */ > +#define MHP_ALTMAP_INACCESSIBLE ((__force mhp_t)BIT(3)) If we go that path, maybe rather MHP_OFFLINE_INACCESSIBLE and document how this relates to/interacts with the memory notifier callbacks and the altmap. Then, we can logically abstract this from altmap handling. Simply, the memory should not be read/written before the memory notifier was called. In the core, you can do the poisioning for the altmap in that case after calling the notifier, probably in mhp_init_memmap_on_memory() as you do below. > > /* > * Extended parameters for memory hotplug: > diff --git a/include/linux/memremap.h b/include/linux/memremap.h > index 744c830f4b13..9837f3e6fb95 100644 > --- a/include/linux/memremap.h > +++ b/include/linux/memremap.h > @@ -25,6 +25,7 @@ struct vmem_altmap { > unsigned long free; > unsigned long align; > unsigned long alloc; > + bool inaccessible; We should then likely remember that information for the memory block, not the altmap. [...] > > > Approach 2: > =========== > Shouldnt kasan zero shadow mapping performed first before > accessing/initializing memmap via page_init_poisining()? If that is Likely the existing way. The first access to the poisoned memmap should be a write, not a read. But I didn't look into the details. Can you try reproducing? > true, then it is a problem for all architectures and should could be > fixed like: > > > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c > index 7a5fc89a8652..eb3975740537 100644 > --- a/mm/memory_hotplug.c > +++ b/mm/memory_hotplug.c > @@ -1093,6 +1093,7 @@ int mhp_init_memmap_on_memory(unsigned long pfn, unsigned long nr_pages, > if (ret) > return ret; > > + page_init_poison(pfn_to_page(pfn), sizeof(struct page) * nr_pages); > move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_UNMOVABLE); > > for (i = 0; i < nr_pages; i++) > diff --git a/mm/sparse.c b/mm/sparse.c > index 77d91e565045..4ddf53f52075 100644 > --- a/mm/sparse.c > +++ b/mm/sparse.c > @@ -906,8 +906,11 @@ int __meminit sparse_add_section(int nid, unsigned long start_pfn, > /* > * Poison uninitialized struct pages in order to catch invalid flags > * combinations. > + * For altmap, do this later when onlining the memory, as it might > + * not be accessible at this point. > */ > - page_init_poison(memmap, sizeof(struct page) * nr_pages); > + if (!altmap) > + page_init_poison(memmap, sizeof(struct page) * nr_pages); > > ms = __nr_to_section(section_nr); > set_section_nid(section_nr, nid); > That's too generic when it comes to other altmap users, especially DAX or when the altmap is accessible while memory is offlining, and we want to catch that. > > > Also, if this approach is taken, should page_init_poison() be performed > with cond_resched() as mentioned in commit d33695b16a9f > ("mm/memory_hotplug: poison memmap in remove_pfn_range_from_zone()") ? Likely as soon as possible after it is accessible :)
On Tue, Nov 21, 2023 at 08:24:48PM +0100, David Hildenbrand wrote: > > diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h > > index 7d2076583494..5c70707e706f 100644 > > --- a/include/linux/memory_hotplug.h > > +++ b/include/linux/memory_hotplug.h > > @@ -106,6 +106,11 @@ typedef int __bitwise mhp_t; > > * implies the node id (nid). > > */ > > #define MHP_NID_IS_MGID ((__force mhp_t)BIT(2)) > > +/* > > + * Mark memmap on memory (struct pages array) as inaccessible during memory > > + * hotplug addition phase. > > + */ > > +#define MHP_ALTMAP_INACCESSIBLE ((__force mhp_t)BIT(3)) > > If we go that path, maybe rather MHP_OFFLINE_INACCESSIBLE and document how > this relates to/interacts with the memory notifier callbacks and the altmap. > > Then, we can logically abstract this from altmap handling. Simply, the > memory should not be read/written before the memory notifier was called. > > In the core, you can do the poisioning for the altmap in that case after > calling the notifier, probably in mhp_init_memmap_on_memory() as you do > below. ok, sure. Thanks. > > > /* > > * Extended parameters for memory hotplug: > > diff --git a/include/linux/memremap.h b/include/linux/memremap.h > > index 744c830f4b13..9837f3e6fb95 100644 > > --- a/include/linux/memremap.h > > +++ b/include/linux/memremap.h > > @@ -25,6 +25,7 @@ struct vmem_altmap { > > unsigned long free; > > unsigned long align; > > unsigned long alloc; > > + bool inaccessible; > > We should then likely remember that information for the memory block, not > the altmap. Tried using inaccessible field in memory_block and observed that the memory block is created following the success of arch_add_memory(). Hence, when conducting checks for inaccessible memory in sparse_add_section() (regarding page poisoning), there is still a reliance on mhp_flags conveyed through add_memory(). Subsequently, then memory_block inaccessible state is set in create_memory_block_devices(). I hope this approach is fine. On the other hand, relying on inaccessible flag in vmem_altmap could make checks in arch_add_memory() and other functions easier I suppose. > > [...] > > > > > > > Approach 2: > > =========== > > Shouldnt kasan zero shadow mapping performed first before > > accessing/initializing memmap via page_init_poisining()? If that is > > Likely the existing way. The first access to the poisoned memmap should be a > write, not a read. But I didn't look into the details. > > Can you try reproducing? > Executing page_init_poison() right before kasan_add_zero_shadow() in the context of altmap usage did not result in a crash when I attempted to reproduce it. However, altmap + page_init_poisoning() within sparse_add_section(), a crash occurs on our arch, as previously discussed in this thread. Thank you