mbox series

[RFC,0/4] Support dynamic (de)configuration of memory

Message ID 20241202082732.3959803-1-sumanthk@linux.ibm.com (mailing list archive)
Headers show
Series Support dynamic (de)configuration of memory | expand

Message

Sumanth Korikkar Dec. 2, 2024, 8:27 a.m. UTC
This patchset provides a new interface for dynamic configuration and
deconfiguration of hotplug memory, allowing for mixed altmap and
non-altmap support. It is a follow-up on the discussion with David,
when introducing memmap_on_memory support for s390:
https://lore.kernel.org/all/ee492da8-74b4-4a97-8b24-73e07257f01d@redhat.com/

The following suggestions from that discussion are addressed:

* "Look into a proper interface to add/remove memory instead of relying
  on online/offline ... (e.g., where user don't want an altmap because
  of fragmentation)"

With the new interface, users can dynamically specify which memory
ranges should have altmap support, rather than having it statically
enabled or disabled for all hot-plugged memory.

It would also be possible to revert the
MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE notifiers, that were previously
added for s390, and move that part to the new interface handler. This is
not yet done in this series, and might also need further evaluation,
e.g. we would lose the memmap_on_memory feature for s390, when the new
interface config option is not selected.

* "Support variable-sized memory blocks long-term, or simulate that by
  "grouping" memory blocks that share a same altmap located on the first
  memory blocks in that group.  On s390x that adds all memory ahead of
  time, it's hard to make a decision what the right granularity will be.
  The user can give better hints when adding/removing memory
  explicitly."

With the new interface, the user could specify a memory range, including
multiple blocks, and whether he wants altmap support for that range.
This could allow for the mentioned altmap block grouping, or even
variable-sized blocks, in the future.

When the new interface is enabled, s390 will not add all possible
hotplug memory in advance, like before, to make it visible in sysfs for
online/offline actions. Instead, a new "max_configurable" sysfs
attribute will give a hint on the presence of hotplug memory. Before it
can be set online, it has to be configured via a new interface in
/sys/bus/memory/devices/configure_memory, basically imitating what is
done in ACPI handlers for other archs.

Usage format for the new interface: echo
config_mode,memoryrange,altmap_mode >
/sys/bus/memory/devices/configure_memory

E.g. to configure a range with altmap:
echo 1,0x200000000-0x20fffffff,1 > /sys/bus/memory/devices/configure_memory

lsmem/chmem tools can be adjusted to do all that transparently, so that
there would be no visible impact to the user, at least not when using
those tools. In addition, support for dynamic altmap configuration can
be added to those tools.

This could not only help to make s390 more flexible and similar to
others (wrt adding hotplug memory in advance). It might also be possible
to provide the dynamically configured altmap support for others. E.g.
instead of directly doing an add_memory() in the ACPI handler, with the
static altmap setting, you could instead defer that to the new interface
which allows dynamic altmap configuration.

Patch 1 provides necessary validation against user inputs for new
/sys/bus/memory/devices/configure_memory sysfs interface.

Patch 2 displays altmap support per memory block.
CONFIG_RUNTIME_MEMORY_CONFIGURATION enables dynamic addition of memory
with altmap/non-altmap support. Hence, providing concrete altmap
information would be beneficial.

Patch 3 adds /sys/devices/system/memory/max_configurable sysfs show
interface to list maximum number of possible memory block supported by
the architecture. This information would be beneficial for tools like
lsmem to distinguish betweeen configured memory blocks and deconfigured
memory blocks.

Patch 4 provides support for both legacy boottime standby memory
configuration or runtime configuration of standby memory. The patch also
validates user inputs against /sys/bus/memory/devices/configure_memory
interface and overrides /sys/devices/system/memory/max_configurable
value.

Thank you.


Sumanth Korikkar (4):
  mm/memory_hotplug: Add interface for runtime (de)configuration of
    memory
  mm/memory_hotplug: Add memory block altmap sysfs attribute
  mm/memory_hotplug: Add max_configurable sysfs read attribute
  s390/sclp: Add support for dynamic (de)configuration of memory

 drivers/base/memory.c        | 153 +++++++++++++++++++++++++++++++++++
 drivers/s390/char/sclp_cmd.c |  80 +++++++++++++++---
 include/linux/memory.h       |   6 ++
 mm/Kconfig                   |  16 ++++
 4 files changed, 246 insertions(+), 9 deletions(-)

Comments

David Hildenbrand Dec. 2, 2024, 4:55 p.m. UTC | #1
On 02.12.24 09:27, Sumanth Korikkar wrote:
> Provide a new interface for dynamic configuration and deconfiguration of
> hotplug memory, allowing for mixed altmap and non-altmap support.  It is
> a follow-up on the discussion with David:
> 
> https://lore.kernel.org/all/ee492da8-74b4-4a97-8b24-73e07257f01d@redhat.com/
> 
> As mentioned in the discussion, advantages of the new interface are:
> 
> * Users can dynamically specify which memory ranges should have altmap
>    support, rather than having it statically enabled or disabled for all
>    hot-plugged memory.
> 
> * In the long term,  user could specify a memory range, including
>    multiple blocks, and whether user wants altmap support for that range.
>    This could allow for the altmap block grouping, or even variable-sized
>    blocks, in the future. i.e. "grouping" memory blocks that share a same
>    altmap located on the first memory blocks in the group and reduce
>    fragementation due to altmap.
> 
> To leverage these advantages:
> Create a sysfs interface /sys/bus/memory/devices/configure_memory, which
> performs runtime (de)configuration of memory with altmap or non-altmap
> support. The interface validates the memory ranges against architecture
> specific memory configuration and performs add_memory()/remove_memory().
> Dynamic (de)configuration of memory is made configurable via config
> CONFIG_RUNTIME_MEMORY_CONFIGURATION.

Hi!

Not completely what I had in mind, especially not that we need something 
that generic without any indication of ranges :)

In general, the flow is as follows:

1) Driver detects memory and adds it
2) Something auto-onlines that memory (e.g., udev rule)

For dax/kmem, 1) can be controlled using devdax, and usually it also 
tries to take care of 2).

s390x standby storage really is the weird thing here, because it does 1) 
and doesn't want 2). It shouldn't do 1) until a user wants to make use 
of standby memory.


My thinking was that s390x would expose the standby memory ranges 
somewhere arch specific in sysfs. From there, one could simply trigger 
the adding (maybe specifying e.g, memmap_on_memory) of selected ranges.


To disable standby memory, one would first offline the memory to then 
trigger removal using the arch specific interface. It is very similar to 
dax/kmem's way of handling offline+removal.

Now I wonder if dax/kmem could be (ab)used on s390x for standby storage. 
Likely a simple sysfs interface could be easier to implement.
Sumanth Korikkar Dec. 3, 2024, 2:33 p.m. UTC | #2
On Mon, Dec 02, 2024 at 05:55:19PM +0100, David Hildenbrand wrote:
> Hi!
> 
> Not completely what I had in mind, especially not that we need something
> that generic without any indication of ranges :)
> 
> In general, the flow is as follows:
> 
> 1) Driver detects memory and adds it
> 2) Something auto-onlines that memory (e.g., udev rule)
> 
> For dax/kmem, 1) can be controlled using devdax, and usually it also tries
> to take care of 2).
> 
> s390x standby storage really is the weird thing here, because it does 1) and
> doesn't want 2). It shouldn't do 1) until a user wants to make use of
> standby memory.

Hi David,

The current rfc design doesnt do 1) until user initiates it.

The current rfc design considers the fact that there cannot be memory
holes, when there is a availability of standby memory. (which holds true
for both lpars and zvms)

With number of online and standby memory ranges count
(max_configurable), prototype lsmem/chmem could determine memory ranges
which are not yet configured 
i.e. (configurable_memory = max_configurable - online ranges from sysfs
/sys/devices/system/memory/memory*).

Example prototype implementation of lsmem/chmem looks like:
./lsmem -o RANGE,SIZE,STATE,BLOCK,ALTMAP
RANGE                                 SIZE        STATE  BLOCK ALTMAP
0x0000000000000000-0x00000002ffffffff  12G       online   0-95      0
0x0000000300000000-0x00000003ffffffff   4G deconfigured 96-127      -

# Configure range with altmap
./chmem -c 0x0000000300000000-0x00000003ffffffff -a
./lsmem -o RANGE,SIZE,STATE,BLOCK,ALTMAP
RANGE                                 SIZE   STATE  BLOCK ALTMAP
0x0000000000000000-0x00000002ffffffff  12G  online   0-95      0
0x0000000300000000-0x00000003ffffffff   4G offline 96-127      1


# Online range
./chmem -e 0x0000000300000000-0x00000003ffffffff &&
./lsmem -o RANGE,SIZE,STATE,BLOCK,ALTMAP
RANGE                                 SIZE  STATE  BLOCK ALTMAP
0x0000000000000000-0x00000002ffffffff  12G online   0-95      0
0x0000000300000000-0x00000003ffffffff   4G online 96-127      1

Memory block size:       128M
Total online memory:      16G
Total offline memory:      0B
Total deconfigured:        0B

# offline range
./chmem -d 0x0000000300000000-0x00000003ffffffff &&
./lsmem -o RANGE,SIZE,STATE,BLOCK,ALTMAP
RANGE                                 SIZE   STATE  BLOCK ALTMAP
0x0000000000000000-0x00000002ffffffff  12G  online   0-95      0
0x0000000300000000-0x00000003ffffffff   4G offline 96-127      1

Memory block size:       128M
Total online memory:      12G
Total offline memory:      4G
Total deconfigured:        0B

# Defconfigure range.
./chmem -g 0x0000000300000000-0x00000003ffffffff &&
./lsmem -o RANGE,SIZE,STATE,BLOCK,ALTMAP
RANGE                                 SIZE        STATE  BLOCK ALTMAP
0x0000000000000000-0x00000002ffffffff  12G       online   0-95      0
0x0000000300000000-0x00000003ffffffff   4G deconfigured 96-127      -

Memory block size:       128M
Total online memory:      12G
Total offline memory:      0B
Total deconfigured:        4G

The user can still determine the available memory ranges and make them
configurable using tools like lsmem or chmem with this approach atleast
on s390 with this approach.

> My thinking was that s390x would expose the standby memory ranges somewhere
> arch specific in sysfs. From there, one could simply trigger the adding
> (maybe specifying e.g, memmap_on_memory) of selected ranges.

As far as I understand, sysfs interface limits the size of the buffer
used in show() to 4kb.  When there are huge number of standby memory
ranges, wouldnt it be an issue to display everything in one attribute?

Or use sysfs binary attributes to overcome the limitation?

Please correct me, If I am wrong.

Questions:
1. If we go ahead with this sysfs interface approach to list all standby
memory ranges, could the list be made available via
/sys/devices/system/memory/configurable_memlist?  This could be helpful,
as /sys/devices/system/memory/configure_memory performs architecture
independent checks and could also be useful for other architectures in
the future.

2. Whether the new interface should also be compatible with lsmem/chmem?

3. OR can we have a s390 specific path (eg:
/sys/firmware/memory/standy_range) to list all standby memory range
which are in deconfigured state and also use the current design
(max_configurable) to make it easier for lsmem/chmem tool to detect
these standby memory ranges?

> To disable standby memory, one would first offline the memory to then
> trigger removal using the arch specific interface. It is very similar to
> dax/kmem's way of handling offline+removal.

ok

> Now I wonder if dax/kmem could be (ab)used on s390x for standby storage.
> Likely a simple sysfs interface could be easier to implement.

I havent checked dax/kmem in detail yet. I will look into it.

Thank you
David Hildenbrand Dec. 20, 2024, 3:53 p.m. UTC | #3
On 03.12.24 15:33, Sumanth Korikkar wrote:
> On Mon, Dec 02, 2024 at 05:55:19PM +0100, David Hildenbrand wrote:
>> Hi!
>>
>> Not completely what I had in mind, especially not that we need something
>> that generic without any indication of ranges :)
>>
>> In general, the flow is as follows:
>>
>> 1) Driver detects memory and adds it
>> 2) Something auto-onlines that memory (e.g., udev rule)
>>
>> For dax/kmem, 1) can be controlled using devdax, and usually it also tries
>> to take care of 2).
>>
>> s390x standby storage really is the weird thing here, because it does 1) and
>> doesn't want 2). It shouldn't do 1) until a user wants to make use of
>> standby memory.
> 
> Hi David,

Hi,

sorry for the late reply. Cleaning up (some of) my inbox before 
Christmas, and I realized I skipped this mail.

> 
> The current rfc design doesnt do 1) until user initiates it.
> 
> The current rfc design considers the fact that there cannot be memory
> holes, when there is a availability of standby memory. (which holds true
> for both lpars and zvms)
> 
> With number of online and standby memory ranges count
> (max_configurable), prototype lsmem/chmem could determine memory ranges
> which are not yet configured
> i.e. (configurable_memory = max_configurable - online ranges from sysfs
> /sys/devices/system/memory/memory*).
> 
> Example prototype implementation of lsmem/chmem looks like:
> ./lsmem -o RANGE,SIZE,STATE,BLOCK,ALTMAP
> RANGE                                 SIZE        STATE  BLOCK ALTMAP
> 0x0000000000000000-0x00000002ffffffff  12G       online   0-95      0
> 0x0000000300000000-0x00000003ffffffff   4G deconfigured 96-127      -
> 
> # Configure range with altmap
> ./chmem -c 0x0000000300000000-0x00000003ffffffff -a
> ./lsmem -o RANGE,SIZE,STATE,BLOCK,ALTMAP
> RANGE                                 SIZE   STATE  BLOCK ALTMAP
> 0x0000000000000000-0x00000002ffffffff  12G  online   0-95      0
> 0x0000000300000000-0x00000003ffffffff   4G offline 96-127      1
> 
> 
> # Online range
> ./chmem -e 0x0000000300000000-0x00000003ffffffff &&
> ./lsmem -o RANGE,SIZE,STATE,BLOCK,ALTMAP
> RANGE                                 SIZE  STATE  BLOCK ALTMAP
> 0x0000000000000000-0x00000002ffffffff  12G online   0-95      0
> 0x0000000300000000-0x00000003ffffffff   4G online 96-127      1
> 
> Memory block size:       128M
> Total online memory:      16G
> Total offline memory:      0B
> Total deconfigured:        0B
> 
> # offline range
> ./chmem -d 0x0000000300000000-0x00000003ffffffff &&
> ./lsmem -o RANGE,SIZE,STATE,BLOCK,ALTMAP
> RANGE                                 SIZE   STATE  BLOCK ALTMAP
> 0x0000000000000000-0x00000002ffffffff  12G  online   0-95      0
> 0x0000000300000000-0x00000003ffffffff   4G offline 96-127      1
> 
> Memory block size:       128M
> Total online memory:      12G
> Total offline memory:      4G
> Total deconfigured:        0B
> 
> # Defconfigure range.
> ./chmem -g 0x0000000300000000-0x00000003ffffffff &&
> ./lsmem -o RANGE,SIZE,STATE,BLOCK,ALTMAP
> RANGE                                 SIZE        STATE  BLOCK ALTMAP
> 0x0000000000000000-0x00000002ffffffff  12G       online   0-95      0
> 0x0000000300000000-0x00000003ffffffff   4G deconfigured 96-127      -
> 
> Memory block size:       128M
> Total online memory:      12G
> Total offline memory:      0B
> Total deconfigured:        4G

Maybe "standby memory" might make it clearer. The concept is s390x 
specific, and it will likely stay s390x specific.

I like the idea (frontend/tool interface), all we need is a way for 
these commands to detect ranges and turn them from standby into usable 
memory.

> 
> The user can still determine the available memory ranges and make them
> configurable using tools like lsmem or chmem with this approach atleast
> on s390 with this approach.
> 
>> My thinking was that s390x would expose the standby memory ranges somewhere
>> arch specific in sysfs. From there, one could simply trigger the adding
>> (maybe specifying e.g, memmap_on_memory) of selected ranges.
> 
> As far as I understand, sysfs interface limits the size of the buffer
> used in show() to 4kb. 

sysfs want usually "one value per file".

> When there are huge number of standby memory
> ranges, wouldnt it be an issue to display everything in one attribute?

I was rather wondering about a syfs directory structure that exposes 
this information.

For example, in the granularity of storage increments we can enable/disable.

In general, it could be a similar structure as 
/sys/devices/system/memory/ (one director = one standby storage 
increment we can enable/disable?), but residing on the s390x specific 
sysfs area. Or any other way to express ranges that can be 
enabled/disabled as one unit.

I'm not sure if extending /sys/devices/system/memory/ itself would be a 
good idea, though. It all is very s390x specific.

> 
> Or use sysfs binary attributes to overcome the limitation?
> 
> Please correct me, If I am wrong.
> 
> Questions:
> 1. If we go ahead with this sysfs interface approach to list all standby
> memory ranges, could the list be made available via
> /sys/devices/system/memory/configurable_memlist?  This could be helpful,
> as /sys/devices/system/memory/configure_memory performs architecture
> independent checks and could also be useful for other architectures in
> the future.

See above, I think we want this s390x specific.

> 
> 2. Whether the new interface should also be compatible with lsmem/chmem?

Yes, likely we should allow them to query-configure this s390x specific 
thing.

> 
> 3. OR can we have a s390 specific path (eg:
> /sys/firmware/memory/standy_range) to list all standby memory range
> which are in deconfigured state and also use the current design
> (max_configurable) to make it easier for lsmem/chmem tool to detect
> these standby memory ranges?

Ah, there it is, yes!

> 
>> To disable standby memory, one would first offline the memory to then
>> trigger removal using the arch specific interface. It is very similar to
>> dax/kmem's way of handling offline+removal.
> 
> ok
> 
>> Now I wonder if dax/kmem could be (ab)used on s390x for standby storage.
>> Likely a simple sysfs interface could be easier to implement.
> 
> I havent checked dax/kmem in detail yet. I will look into it.

Probably it's not 100% what you want to achieve, just to give you an 
example how similar (but different) technologies have solved this problem.