[035/147] memory-hotplug.rst: complete admin-guide overhaul

Message ID	20210908025449.7rxiltYbJ%akpm@linux-foundation.org (mailing list archive)
State	New
Headers	show Return-Path: <SRS0=DBvG=N6=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 9F75561100 Date: Tue, 07 Sep 2021 19:54:49 -0700 From: Andrew Morton <akpm@linux-foundation.org> To: akpm@linux-foundation.org, anshuman.khandual@arm.com, corbet@lwn.net, dave.hansen@linux.intel.com, david@redhat.com, linux-mm@kvack.org, mhocko@suse.com, mike.kravetz@oracle.com, mm-commits@vger.kernel.org, osalvador@suse.de, pasha.tatashin@soleen.com, rppt@linux.ibm.com, sfr@canb.auug.org.au, songmuchun@bytedance.com, torvalds@linux-foundation.org, willy@infradead.org Subject: [patch 035/147] memory-hotplug.rst: complete admin-guide overhaul Message-ID: <20210908025449.7rxiltYbJ%akpm@linux-foundation.org> In-Reply-To: <20210907195226.14b1d22a07c085b22968b933@linux-foundation.org> User-Agent: s-nail v14.8.16 Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[001/147] mm, slub: don't call flush_all() from slab_debug_trace_open() \| expand [001/147] mm, slub: don't call flush_all() from slab_debug_trace_open() [002/147] mm, slub: allocate private object map for debugfs listings [003/147] mm, slub: allocate private object map for validate_slab_cache() [004/147] mm, slub: don't disable irq for debug_check_no_locks_freed() [005/147] mm, slub: remove redundant unfreeze_partials() from put_cpu_partial() [006/147] mm, slub: extract get_partial() from new_slab_objects() [007/147] mm, slub: dissolve new_slab_objects() into ___slab_alloc() [008/147] mm, slub: return slab page from get_partial() and set c->page afterwards [009/147] mm, slub: restructure new page checks in ___slab_alloc() [010/147] mm, slub: simplify kmem_cache_cpu and tid setup [011/147] mm, slub: move disabling/enabling irqs to ___slab_alloc() [012/147] mm, slub: do initial checks in ___slab_alloc() with irqs enabled [013/147] mm, slub: move disabling irqs closer to get_partial() in ___slab_alloc() [014/147] mm, slub: restore irqs around calling new_slab() [015/147] mm, slub: validate slab from partial list or page allocator before making it cpu slab [016/147] mm, slub: check new pages with restored irqs [017/147] mm, slub: stop disabling irqs around get_partial() [018/147] mm, slub: move reset of c->page and freelist out of deactivate_slab() [019/147] mm, slub: make locking in deactivate_slab() irq-safe [020/147] mm, slub: call deactivate_slab() without disabling irqs [021/147] mm, slub: move irq control into unfreeze_partials() [022/147] mm, slub: discard slabs in unfreeze_partials() without irqs disabled [023/147] mm, slub: detach whole partial list at once in unfreeze_partials() [024/147] mm, slub: separate detaching of partial list in unfreeze_partials() from unfreezing [025/147] mm, slub: only disable irq with spin_lock in __unfreeze_partials() [026/147] mm, slub: don't disable irqs in slub_cpu_dead() [027/147] mm, slab: split out the cpu offline variant of flush_slab() [028/147] mm: slub: move flush_cpu_slab() invocations __free_slab() invocations out of IRQ context [029/147] mm: slub: make object_map_lock a raw_spinlock_t [030/147] mm, slub: make slab_lock() disable irqs with PREEMPT_RT [031/147] mm, slub: protect put_cpu_partial() with disabled irqs instead of cmpxchg [032/147] mm, slub: use migrate_disable() on PREEMPT_RT [033/147] mm, slub: convert kmem_cpu_slab protection to local_lock [034/147] memory-hotplug.rst: remove locking details from admin-guide [035/147] memory-hotplug.rst: complete admin-guide overhaul [036/147] mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE [037/147] mm: memory_hotplug: cleanup after removal of pfn_valid_within() [038/147] mm/memory_hotplug: use "unsigned long" for PFN in zone_for_pfn_range() [039/147] mm/memory_hotplug: remove nid parameter from arch_remove_memory() [040/147] mm/memory_hotplug: remove nid parameter from remove_memory() and friends [041/147] ACPI: memhotplug: memory resources cannot be enabled yet [042/147] mm: track present early pages per zone [043/147] mm/memory_hotplug: introduce "auto-movable" online policy [044/147] drivers/base/memory: introduce "memory groups" to logically group memory blocks [045/147] mm/memory_hotplug: track present pages in memory groups [046/147] ACPI: memhotplug: use a single static memory group for a single memory device [047/147] dax/kmem: use a single static memory group for a single probed unit [048/147] virtio-mem: use a single dynamic memory group for a single virtio-mem device [049/147] mm/memory_hotplug: memory group aware "auto-movable" online policy [050/147] mm/memory_hotplug: improved dynamic memory group aware "auto-movable" online policy [051/147] mm/memory_hotplug: use helper zone_is_zone_device() to simplify the code [052/147] mm: remove redundant compound_head() calling [053/147] riscv: only select GENERIC_IOREMAP if MMU support is enabled [054/147] mm: move ioremap_page_range to vmalloc.c [055/147] mm: don't allow executable ioremap mappings [056/147] mm/early_ioremap.c: remove redundant early_ioremap_shutdown() [057/147] highmem: don't disable preemption on RT in kmap_atomic() [058/147] mm: in_irq() cleanup [059/147] mm: introduce PAGEFLAGS_MASK to replace ((1UL << NR_PAGEFLAGS) - 1) [060/147] mm/secretmem: use refcount_t instead of atomic_t [061/147] kfence: show cpu and timestamp in alloc/free info [062/147] kfence: test: fail fast if disabled at boot [063/147] mm: introduce Data Access MONitor (DAMON) [064/147] mm/damon/core: implement region-based sampling [065/147] mm/damon: adaptively adjust regions [066/147] mm/idle_page_tracking: make PG_idle reusable [067/147] mm/damon: implement primitives for the virtual memory address spaces [068/147] mm/damon: add a tracepoint [069/147] mm/damon: implement a debugfs-based user space interface [070/147] mm/damon/dbgfs: export kdamond pid to the user space [071/147] mm/damon/dbgfs: support multiple contexts [072/147] Documentation: add documents for DAMON [073/147] mm/damon: add kunit tests [074/147] mm/damon: add user space selftests [075/147] MAINTAINERS: update for DAMON [076/147] alpha: agp: make empty macros use do-while-0 style [077/147] alpha: pci-sysfs: fix all kernel-doc warnings [078/147] percpu: remove export of pcpu_base_addr [079/147] fs/proc/kcore.c: add mmap interface [080/147] proc: stop using seq_get_buf in proc_task_name [081/147] connector: send event on write to /proc/[pid]/comm [082/147] arch: Kconfig: fix spelling mistake "seperate" -> "separate" [083/147] include/linux/once.h: fix trivia typo Not -> Note [084/147] units: change from 'L' to 'UL' [085/147] units: add the HZ macros [086/147] thermal/drivers/devfreq_cooling: use HZ macros [087/147] devfreq: use HZ macros [088/147] iio/drivers/as73211: use HZ macros [089/147] hwmon/drivers/mr75203: use HZ macros [090/147] iio/drivers/hid-sensor: use HZ macros [091/147] i2c/drivers/ov02q10: use HZ macros [092/147] mtd/drivers/nand: use HZ macros [093/147] phy/drivers/stm32: use HZ macros [094/147] kernel/acct.c: use dedicated helper to access rlimit values [095/147] profiling: fix shift-out-of-bounds bugs [096/147] MAINTAINERS: update ClangBuiltLinux mailing list [097/147] Documentation/llvm: update mailing list [098/147] Documentation/llvm: update IRC location [099/147] math: make RATIONAL tristate [100/147] math: RATIONAL_KUNIT_TEST should depend on RATIONAL instead of selecting it [101/147] lib/string: optimized memcpy [102/147] lib/string: optimized memmove [103/147] lib/string: optimized memset [104/147] lib/test: convert test_sort.c to use KUnit [105/147] lib/dump_stack: correct kernel-doc notation [106/147] lib/iov_iter.c: fix kernel-doc warnings [107/147] bitops: protect find_first_{,zero}_bit properly [108/147] bitops: move find_bit__le functions from le.h to find.h [109/147] include: move find.h from asm_generic to linux [110/147] arch: remove GENERIC_FIND_FIRST_BIT entirely [111/147] lib: add find_first_and_bit() [112/147] cpumask: use find_first_and_bit() [113/147] all: replace find_next{,_zero}_bit with find_first{,_zero}_bit where appropriate [114/147] tools: sync tools/bitmap with mother linux [115/147] cpumask: replace cpumask_next_ with cpumask_first_* where appropriate [116/147] include/linux: move for_each_bit() macros from bitops.h to find.h [117/147] find: micro-optimize for_each_{set,clear}_bit() [118/147] bitops: replace for_each__bit_from() with for_each__bit() where appropriate [119/147] tools: rename bitmap_alloc() to bitmap_zalloc() [120/147] mm/percpu: micro-optimize pcpu_is_populated() [121/147] bitmap: unify find_bit operations [122/147] lib: bitmap: add performance test for bitmap_print_to_pagebuf [123/147] vsprintf: rework bitmap_list_string [124/147] checkpatch: support wide strings [125/147] checkpatch: make email address check case insensitive [126/147] checkpatch: improve GIT_COMMIT_ID test [127/147] fs/epoll: use a per-cpu counter for user's watches count [128/147] init: move usermodehelper_enable() to populate_rootfs() [130/147] nilfs2: fix memory leak in nilfs_sysfs_create_device_group [131/147] nilfs2: fix NULL pointer in nilfs_##name##_attr_release [132/147] nilfs2: fix memory leak in nilfs_sysfs_create_##name##_group [133/147] nilfs2: fix memory leak in nilfs_sysfs_delete_##name##_group [134/147] nilfs2: fix memory leak in nilfs_sysfs_create_snapshot_group [135/147] nilfs2: fix memory leak in nilfs_sysfs_delete_snapshot_group [136/147] nilfs2: use refcount_dec_and_lock() to fix potential UAF [137/147] fs/coredump.c: log if a core dump is aborted due to changed file permissions [138/147] coredump: fix memleak in dump_vma_snapshot() [139/147] kernel/fork.c: unexport get_{mm,task}_exe_file [140/147] pid: cleanup the stale comment mentioning pidmap_init(). [141/147] prctl: allow to setup brk for et_dyn executables [142/147] configs: remove the obsolete CONFIG_INPUT_POLLDEV [143/147] Kconfig.debug: drop selecting non-existing HARDLOCKUP_DETECTOR_ARCH [144/147] selftests/memfd: remove unused variable [145/147] ipc: replace costly bailout check in sysvipc_find_ipc() [146/147] mm/workingset: correct kernel-doc notations [147/147] scripts: check_extable: fix typo in user error message [129/147] trap: cleanup trap_init()

--- a/Documentation/admin-guide/mm/memory-hotplug.rst~memory-hotplugrst-complete-admin-guide-overhaul +++ a/Documentation/admin-guide/mm/memory-hotplug.rst @@ -1,427 +1,576 @@ .. _admin_guide_memory_hotplug: -============== -Memory Hotplug -============== - -:Created: Jul 28 2007 -:Updated: Add some details about locking internals: Aug 20 2018 - -This document is about memory hotplug including how-to-use and current status. -Because Memory Hotplug is still under development, contents of this text will -be changed often. +================== +Memory Hot(Un)Plug +================== + +This document describes generic Linux support for memory hot(un)plug with +a focus on System RAM, including ZONE_MOVABLE support. .. contents:: :local: -.. note:: +Introduction +============ - (1) x86_64's has special implementation for memory hotplug. - This text does not describe it. - (2) This text assumes that sysfs is mounted at ``/sys``. +Memory hot(un)plug allows for increasing and decreasing the size of physical +memory available to a machine at runtime. In the simplest case, it consists of +physically plugging or unplugging a DIMM at runtime, coordinated with the +operating system. + +Memory hot(un)plug is used for various purposes: + +- The physical memory available to a machine can be adjusted at runtime, up- or + downgrading the memory capacity. This dynamic memory resizing, sometimes + referred to as "capacity on demand", is frequently used with virtual machines + and logical partitions. + +- Replacing hardware, such as DIMMs or whole NUMA nodes, without downtime. One + example is replacing failing memory modules. + +- Reducing energy consumption either by physically unplugging memory modules or + by logically unplugging (parts of) memory modules from Linux. + +Further, the basic memory hot(un)plug infrastructure in Linux is nowadays also +used to expose persistent memory, other performance-differentiated memory and +reserved memory regions as ordinary system RAM to Linux. + +Linux only supports memory hot(un)plug on selected 64 bit architectures, such as +x86_64, arm64, ppc64, s390x and ia64. + +Memory Hot(Un)Plug Granularity +------------------------------ + +Memory hot(un)plug in Linux uses the SPARSEMEM memory model, which divides the +physical memory address space into chunks of the same size: memory sections. The +size of a memory section is architecture dependent. For example, x86_64 uses +128 MiB and ppc64 uses 16 MiB. +Memory sections are combined into chunks referred to as "memory blocks". The +size of a memory block is architecture dependent and corresponds to the smallest +granularity that can be hot(un)plugged. The default size of a memory block is +the same as memory section size, unless an architecture specifies otherwise. -Introduction -============ +All memory blocks have the same size. -Purpose of memory hotplug -------------------------- +Phases of Memory Hotplug +------------------------ -Memory Hotplug allows users to increase/decrease the amount of memory. -Generally, there are two purposes. +Memory hotplug consists of two phases: -(A) For changing the amount of memory. - This is to allow a feature like capacity on demand. -(B) For installing/removing DIMMs or NUMA-nodes physically. - This is to exchange DIMMs/NUMA-nodes, reduce power consumption, etc. +(1) Adding the memory to Linux +(2) Onlining memory blocks -(A) is required by highly virtualized environments and (B) is required by -hardware which supports memory power management. +In the first phase, metadata, such as the memory map ("memmap") and page tables +for the direct mapping, is allocated and initialized, and memory blocks are +created; the latter also creates sysfs files for managing newly created memory +blocks. -Linux memory hotplug is designed for both purpose. +In the second phase, added memory is exposed to the page allocator. After this +phase, the memory is visible in memory statistics, such as free and total +memory, of the system. -Phases of memory hotplug ------------------------- +Phases of Memory Hotunplug +-------------------------- -There are 2 phases in Memory Hotplug: +Memory hotunplug consists of two phases: - 1) Physical Memory Hotplug phase - 2) Logical Memory Hotplug phase. +(1) Offlining memory blocks +(2) Removing the memory from Linux -The First phase is to communicate hardware/firmware and make/erase -environment for hotplugged memory. Basically, this phase is necessary -for the purpose (B), but this is good phase for communication between -highly virtualized environments too. - -When memory is hotplugged, the kernel recognizes new memory, makes new memory -management tables, and makes sysfs files for new memory's operation. - -If firmware supports notification of connection of new memory to OS, -this phase is triggered automatically. ACPI can notify this event. If not, -"probe" operation by system administration is used instead. -(see :ref:`memory_hotplug_physical_mem`). - -Logical Memory Hotplug phase is to change memory state into -available/unavailable for users. Amount of memory from user's view is -changed by this phase. The kernel makes all memory in it as free pages -when a memory range is available. - -In this document, this phase is described as online/offline. - -Logical Memory Hotplug phase is triggered by write of sysfs file by system -administrator. For the hot-add case, it must be executed after Physical Hotplug -phase by hand. -(However, if you writes udev's hotplug scripts for memory hotplug, these -phases can be execute in seamless way.) - -Unit of Memory online/offline operation ---------------------------------------- - -Memory hotplug uses SPARSEMEM memory model which allows memory to be divided -into chunks of the same size. These chunks are called "sections". The size of -a memory section is architecture dependent. For example, power uses 16MiB, ia64 -uses 1GiB. +In the fist phase, memory is "hidden" from the page allocator again, for +example, by migrating busy memory to other memory locations and removing all +relevant free pages from the page allocator After this phase, the memory is no +longer visible in memory statistics of the system. -Memory sections are combined into chunks referred to as "memory blocks". The -size of a memory block is architecture dependent and represents the logical -unit upon which memory online/offline operations are to be performed. The -default size of a memory block is the same as memory section size unless an -architecture specifies otherwise. (see :ref:`memory_hotplug_sysfs_files`.) +In the second phase, the memory blocks are removed and metadata is freed. -To determine the size (in bytes) of a memory block please read this file:: +Memory Hotplug Notifications +============================ - /sys/devices/system/memory/block_size_bytes +There are various ways how Linux is notified about memory hotplug events such +that it can start adding hotplugged memory. This description is limited to +systems that support ACPI; mechanisms specific to other firmware interfaces or +virtual machines are not described. -Kernel Configuration -==================== +ACPI Notifications +------------------ -To use memory hotplug feature, kernel must be compiled with following -config options. +Platforms that support ACPI, such as x86_64, can support memory hotplug +notifications via ACPI. -- For all memory hotplug: - - Memory model -> Sparse Memory (``CONFIG_SPARSEMEM``) - - Allow for memory hot-add (``CONFIG_MEMORY_HOTPLUG``) +In general, a firmware supporting memory hotplug defines a memory class object +HID "PNP0C80". When notified about hotplug of a new memory device, the ACPI +driver will hotplug the memory to Linux. -- To enable memory removal, the following are also necessary: - - Allow for memory hot remove (``CONFIG_MEMORY_HOTREMOVE``) - - Page Migration (``CONFIG_MIGRATION``) +If the firmware supports hotplug of NUMA nodes, it defines an object _HID +"ACPI0004", "PNP0A05", or "PNP0A06". When notified about an hotplug event, all +assigned memory devices are added to Linux by the ACPI driver. -- For ACPI memory hotplug, the following are also necessary: - - Memory hotplug (under ACPI Support menu) (``CONFIG_ACPI_HOTPLUG_MEMORY``) - - This option can be kernel module. +Similarly, Linux can be notified about requests to hotunplug a memory device or +a NUMA node via ACPI. The ACPI driver will try offlining all relevant memory +blocks, and, if successful, hotunplug the memory from Linux. -- As a related configuration, if your box has a feature of NUMA-node hotplug - via ACPI, then this option is necessary too. +Manual Probing +-------------- - - ACPI0004,PNP0A05 and PNP0A06 Container Driver (under ACPI Support menu) - (``CONFIG_ACPI_CONTAINER``). +On some architectures, the firmware may not be able to notify the operating +system about a memory hotplug event. Instead, the memory has to be manually +probed from user space. - This option can be kernel module too. +The probe interface is located at:: + /sys/devices/system/memory/probe -.. _memory_hotplug_sysfs_files: +Only complete memory blocks can be probed. Individual memory blocks are probed +by providing the physical start address of the memory block:: -sysfs files for memory hotplug -============================== + % echo addr > /sys/devices/system/memory/probe -All memory blocks have their device information in sysfs. Each memory block -is described under ``/sys/devices/system/memory`` as:: +Which results in a memory block for the range [addr, addr + memory_block_size) +being created. - /sys/devices/system/memory/memoryXXX +.. note:: -where XXX is the memory block id. + Using the probe interface is discouraged as it is easy to crash the kernel, + because Linux cannot validate user input; this interface might be removed in + the future. + +Onlining and Offlining Memory Blocks +==================================== + +After a memory block has been created, Linux has to be instructed to actually +make use of that memory: the memory block has to be "online". + +Before a memory block can be removed, Linux has to stop using any memory part of +the memory block: the memory block has to be "offlined". + +The Linux kernel can be configured to automatically online added memory blocks +and drivers automatically trigger offlining of memory blocks when trying +hotunplug of memory. Memory blocks can only be removed once offlining succeeded +and drivers may trigger offlining of memory blocks when attempting hotunplug of +memory. -For the memory block covered by the sysfs directory. It is expected that all -memory sections in this range are present and no memory holes exist in the -range. Currently there is no way to determine if there is a memory hole, but -the existence of one should not affect the hotplug capabilities of the memory -block. +Onlining Memory Blocks Manually +------------------------------- -For example, assume 1GiB memory block size. A device for a memory starting at -0x100000000 is ``/sys/device/system/memory/memory4``:: +If auto-onlining of memory blocks isn't enabled, user-space has to manually +trigger onlining of memory blocks. Often, udev rules are used to automate this +task in user space. - (0x100000000 / 1Gib = 4) +Onlining of a memory block can be triggered via:: -This device covers address range [0x100000000 ... 0x140000000) + % echo online > /sys/devices/system/memory/memoryXXX/state -Under each memory block, you can see 5 files: +Or alternatively:: -- ``/sys/devices/system/memory/memoryXXX/phys_index`` -- ``/sys/devices/system/memory/memoryXXX/phys_device`` -- ``/sys/devices/system/memory/memoryXXX/state`` -- ``/sys/devices/system/memory/memoryXXX/removable`` -- ``/sys/devices/system/memory/memoryXXX/valid_zones`` + % echo 1 > /sys/devices/system/memory/memoryXXX/online -=================== ============================================================ -``phys_index`` read-only and contains memory block id, same as XXX. -``state`` read-write +The kernel will select the target zone automatically, usually defaulting to +``ZONE_NORMAL`` unless ``movablecore=1`` has been specified on the kernel +command line or if the memory block would intersect the ZONE_MOVABLE already. - - at read: contains online/offline state of memory. - - at write: user can specify "online_kernel", +One can explicitly request to associate an offline memory block with +ZONE_MOVABLE by:: - "online_movable", "online", "offline" command - which will be performed on all sections in the block. -``phys_device`` read-only: legacy interface only ever used on s390x to - expose the covered storage increment. -``removable`` read-only: legacy interface that indicated whether a memory - block was likely to be offlineable or not. Newer kernel - versions return "1" if and only if the kernel supports - memory offlining. -``valid_zones`` read-only: designed to show by which zone memory provided by - a memory block is managed, and to show by which zone memory - provided by an offline memory block could be managed when - onlining. - - The first column shows it`s default zone. - - "memory6/valid_zones: Normal Movable" shows this memoryblock - can be onlined to ZONE_NORMAL by default and to ZONE_MOVABLE - by online_movable. - - "memory7/valid_zones: Movable Normal" shows this memoryblock - can be onlined to ZONE_MOVABLE by default and to ZONE_NORMAL - by online_kernel. -=================== ============================================================ + % echo online_movable > /sys/devices/system/memory/memoryXXX/state -.. note:: +Or one can explicitly request a kernel zone (usually ZONE_NORMAL) by:: - These directories/files appear after physical memory hotplug phase. + % echo online_kernel > /sys/devices/system/memory/memoryXXX/state -If CONFIG_NUMA is enabled the memoryXXX/ directories can also be accessed -via symbolic links located in the ``/sys/devices/system/node/node*`` directories. +In any case, if onlining succeeds, the state of the memory block is changed to +be "online". If it fails, the state of the memory block will remain unchanged +and the above commands will fail. + +Onlining Memory Blocks Automatically +------------------------------------ + +The kernel can be configured to try auto-onlining of newly added memory blocks. +If this feature is disabled, the memory blocks will stay offline until +explicitly onlined from user space. -For example:: +The configured auto-online behavior can be observed via:: - /sys/devices/system/node/node0/memory9 -> ../../memory/memory9 + % cat /sys/devices/system/memory/auto_online_blocks -A backlink will also be created:: +Auto-onlining can be enabled by writing ``online``, ``online_kernel`` or +``online_movable`` to that file, like:: - /sys/devices/system/memory/memory9/node0 -> ../../node/node0 + % echo online > /sys/devices/system/memory/auto_online_blocks -.. _memory_hotplug_physical_mem: +Modifying the auto-online behavior will only affect all subsequently added +memory blocks only. -Physical memory hot-add phase -============================= +.. note:: -Hardware(Firmware) Support --------------------------- + In corner cases, auto-onlining can fail. The kernel won't retry. Note that + auto-onlining is not expected to fail in default configurations. -On x86_64/ia64 platform, memory hotplug by ACPI is supported. +.. note:: -In general, the firmware (ACPI) which supports memory hotplug defines -memory class object of _HID "PNP0C80". When a notify is asserted to PNP0C80, -Linux's ACPI handler does hot-add memory to the system and calls a hotplug udev -script. This will be done automatically. - -But scripts for memory hotplug are not contained in generic udev package(now). -You may have to write it by yourself or online/offline memory by hand. -Please see :ref:`memory_hotplug_how_to_online_memory` and -:ref:`memory_hotplug_how_to_offline_memory`. - -If firmware supports NUMA-node hotplug, and defines an object _HID "ACPI0004", -"PNP0A05", or "PNP0A06", notification is asserted to it, and ACPI handler -calls hotplug code for all of objects which are defined in it. -If memory device is found, memory hotplug code will be called. - -Notify memory hot-add event by hand ------------------------------------ - -On some architectures, the firmware may not notify the kernel of a memory -hotplug event. Therefore, the memory "probe" interface is supported to -explicitly notify the kernel. This interface depends on -CONFIG_ARCH_MEMORY_PROBE and can be configured on powerpc, sh, and x86 -if hotplug is supported, although for x86 this should be handled by ACPI -notification. + DLPAR on ppc64 ignores the ``offline`` setting and will still online added + memory blocks; if onlining fails, memory blocks are removed again. -Probe interface is located at:: +Offlining Memory Blocks +----------------------- - /sys/devices/system/memory/probe +In the current implementation, Linux's memory offlining will try migrating all +movable pages off the affected memory block. As most kernel allocations, such as +page tables, are unmovable, page migration can fail and, therefore, inhibit +memory offlining from succeeding. -You can tell the physical address of new memory to the kernel by:: +Having the memory provided by memory block managed by ZONE_MOVABLE significantly +increases memory offlining reliability; still, memory offlining can fail in +some corner cases. - % echo start_address_of_new_memory > /sys/devices/system/memory/probe +Further, memory offlining might retry for a long time (or even forever), until +aborted by the user. -Then, [start_address_of_new_memory, start_address_of_new_memory + -memory_block_size] memory range is hot-added. In this case, hotplug script is -not called (in current implementation). You'll have to online memory by -yourself. Please see :ref:`memory_hotplug_how_to_online_memory`. +Offlining of a memory block can be triggered via:: -Logical Memory hot-add phase -============================ + % echo offline > /sys/devices/system/memory/memoryXXX/state -State of memory ---------------- +Or alternatively:: -To see (online/offline) state of a memory block, read 'state' file:: + % echo 0 > /sys/devices/system/memory/memoryXXX/online + +If offlining succeeds, the state of the memory block is changed to be "offline". +If it fails, the state of the memory block will remain unchanged and the above +commands will fail, for example, via:: + + bash: echo: write error: Device or resource busy + +or via:: + + bash: echo: write error: Invalid argument + +Observing the State of Memory Blocks +------------------------------------ + +The state (online/offline/going-offline) of a memory block can be observed +either via:: % cat /sys/device/system/memory/memoryXXX/state +Or alternatively (1/0) via:: -- If the memory block is online, you'll read "online". -- If the memory block is offline, you'll read "offline". + % cat /sys/device/system/memory/memoryXXX/online +For an online memory block, the managing zone can be observed via:: -.. _memory_hotplug_how_to_online_memory: + % cat /sys/device/system/memory/memoryXXX/valid_zones -How to online memory --------------------- +Configuring Memory Hot(Un)Plug +============================== -When the memory is hot-added, the kernel decides whether or not to "online" -it according to the policy which can be read from "auto_online_blocks" file:: +There are various ways how system administrators can configure memory +hot(un)plug and interact with memory blocks, especially, to online them. - % cat /sys/devices/system/memory/auto_online_blocks +Memory Hot(Un)Plug Configuration via Sysfs +------------------------------------------ -The default depends on the CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE kernel config -option. If it is disabled the default is "offline" which means the newly added -memory is not in a ready-to-use state and you have to "online" the newly added -memory blocks manually. Automatic onlining can be requested by writing "online" -to "auto_online_blocks" file:: +Some memory hot(un)plug properties can be configured or inspected via sysfs in:: - % echo online > /sys/devices/system/memory/auto_online_blocks + /sys/devices/system/memory/ -This sets a global policy and impacts all memory blocks that will subsequently -be hotplugged. Currently offline blocks keep their state. It is possible, under -certain circumstances, that some memory blocks will be added but will fail to -online. User space tools can check their "state" files -(``/sys/devices/system/memory/memoryXXX/state``) and try to online them manually. - -If the automatic onlining wasn't requested, failed, or some memory block was -offlined it is possible to change the individual block's state by writing to the -"state" file:: +The following files are currently defined: - % echo online > /sys/devices/system/memory/memoryXXX/state +====================== ========================================================= +``auto_online_blocks`` read-write: set or get the default state of new memory + blocks; configure auto-onlining. -This onlining will not change the ZONE type of the target memory block, -If the memory block doesn't belong to any zone an appropriate kernel zone -(usually ZONE_NORMAL) will be used unless movable_node kernel command line -option is specified when ZONE_MOVABLE will be used. + The default value depends on the + CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE kernel configuration + option. -You can explicitly request to associate it with ZONE_MOVABLE by:: + See the ``state`` property of memory blocks for details. +``block_size_bytes`` read-only: the size in bytes of a memory block. +``probe`` write-only: add (probe) selected memory blocks manually + from user space by supplying the physical start address. - % echo online_movable > /sys/devices/system/memory/memoryXXX/state + Availability depends on the CONFIG_ARCH_MEMORY_PROBE + kernel configuration option. +``uevent`` read-write: generic udev file for device subsystems. +====================== ========================================================= -.. note:: current limit: this memory block must be adjacent to ZONE_MOVABLE +.. note:: -Or you can explicitly request a kernel zone (usually ZONE_NORMAL) by:: + When the CONFIG_MEMORY_FAILURE kernel configuration option is enabled, two + additional files ``hard_offline_page`` and ``soft_offline_page`` are available + to trigger hwpoisoning of pages, for example, for testing purposes. Note that + this functionality is not really related to memory hot(un)plug or actual + offlining of memory blocks. + +Memory Block Configuration via Sysfs +------------------------------------ + +Each memory block is represented as a memory block device that can be +onlined or offlined. All memory blocks have their device information located in +sysfs. Each present memory block is listed under +``/sys/devices/system/memory`` as:: - % echo online_kernel > /sys/devices/system/memory/memoryXXX/state + /sys/devices/system/memory/memoryXXX -.. note:: current limit: this memory block must be adjacent to ZONE_NORMAL +where XXX is the memory block id; the number of digits is variable. -An explicit zone onlining can fail (e.g. when the range is already within -and existing and incompatible zone already). +A present memory block indicates that some memory in the range is present; +however, a memory block might span memory holes. A memory block spanning memory +holes cannot be offlined. -After this, memory block XXX's state will be 'online' and the amount of -available memory will be increased. +For example, assume 1 GiB memory block size. A device for a memory starting at +0x100000000 is ``/sys/device/system/memory/memory4``:: -This may be changed in future. + (0x100000000 / 1Gib = 4) -Logical memory remove -===================== +This device covers address range [0x100000000 ... 0x140000000) -Memory offline and ZONE_MOVABLE -------------------------------- +The following files are currently defined: -Memory offlining is more complicated than memory online. Because memory offline -has to make the whole memory block be unused, memory offline can fail if -the memory block includes memory which cannot be freed. - -In general, memory offline can use 2 techniques. - -(1) reclaim and free all memory in the memory block. -(2) migrate all pages in the memory block. - -In the current implementation, Linux's memory offline uses method (2), freeing -all pages in the memory block by page migration. But not all pages are -migratable. Under current Linux, migratable pages are anonymous pages and -page caches. For offlining a memory block by migration, the kernel has to -guarantee that the memory block contains only migratable pages. - -Now, a boot option for making a memory block which consists of migratable pages -is supported. By specifying "kernelcore=" or "movablecore=" boot option, you can -create ZONE_MOVABLE...a zone which is just used for movable pages. -(See also Documentation/admin-guide/kernel-parameters.rst) - -Assume the system has "TOTAL" amount of memory at boot time, this boot option -creates ZONE_MOVABLE as following. - -1) When kernelcore=YYYY boot option is used, - Size of memory not for movable pages (not for offline) is YYYY. - Size of memory for movable pages (for offline) is TOTAL-YYYY. - -2) When movablecore=ZZZZ boot option is used, - Size of memory not for movable pages (not for offline) is TOTAL - ZZZZ. - Size of memory for movable pages (for offline) is ZZZZ. +=================== ============================================================ +``online`` read-write: simplified interface to trigger onlining / + offlining and to observe the state of a memory block. + When onlining, the zone is selected automatically. +``phys_device`` read-only: legacy interface only ever used on s390x to + expose the covered storage increment. +``phys_index`` read-only: the memory block id (XXX). +``removable`` read-only: legacy interface that indicated whether a memory + block was likely to be offlineable or not. Nowadays, the + kernel return ``1`` if and only if it supports memory + offlining. +``state`` read-write: advanced interface to trigger onlining / + offlining and to observe the state of a memory block. + + When writing, ``online``, ``offline``, ``online_kernel`` and + ``online_movable`` are supported. + + ``online_movable`` specifies onlining to ZONE_MOVABLE. + ``online_kernel`` specifies onlining to the default kernel + zone for the memory block, such as ZONE_NORMAL. + ``online`` let's the kernel select the zone automatically. + + When reading, ``online``, ``offline`` and ``going-offline`` + may be returned. +``uevent`` read-write: generic uevent file for devices. +``valid_zones`` read-only: when a block is online, shows the zone it + belongs to; when a block is offline, shows what zone will + manage it when the block will be onlined. + + For online memory blocks, ``DMA``, ``DMA32``, ``Normal``, + ``Movable`` and ``none`` may be returned. ``none`` indicates + that memory provided by a memory block is managed by + multiple zones or spans multiple nodes; such memory blocks + cannot be offlined. ``Movable`` indicates ZONE_MOVABLE. + Other values indicate a kernel zone. + + For offline memory blocks, the first column shows the + zone the kernel would select when onlining the memory block + right now without further specifying a zone. + + Availability depends on the CONFIG_MEMORY_HOTREMOVE + kernel configuration option. +=================== ============================================================ .. note:: - Unfortunately, there is no information to show which memory block belongs - to ZONE_MOVABLE. This is TBD. + If the CONFIG_NUMA kernel configuration option is enabled, the memoryXXX/ + directories can also be accessed via symbolic links located in the + ``/sys/devices/system/node/node*`` directories. + + For example:: + + /sys/devices/system/node/node0/memory9 -> ../../memory/memory9 + + A backlink will also be created:: + + /sys/devices/system/memory/memory9/node0 -> ../../node/node0 + +Command Line Parameters +----------------------- + +Some command line parameters affect memory hot(un)plug handling. The following +command line parameters are relevant: + +======================== ======================================================= +``memhp_default_state`` configure auto-onlining by essentially setting + ``/sys/devices/system/memory/auto_online_blocks``. +``movablecore`` configure automatic zone selection of the kernel. When + set, the kernel will default to ZONE_MOVABLE, unless + other zones can be kept contiguous. +======================== ======================================================= + +Module Parameters +------------------ + +Instead of additional command line parameters or sysfs files, the +``memory_hotplug`` subsystem now provides a dedicated namespace for module +parameters. Module parameters can be set via the command line by predicating +them with ``memory_hotplug.`` such as:: + + memory_hotplug.memmap_on_memory=1 + +and they can be observed (and some even modified at runtime) via:: + + /sys/modules/memory_hotplug/parameters/ + +The following module parameters are currently defined: + +======================== ======================================================= +``memmap_on_memory`` read-write: Allocate memory for the memmap from the + added memory block itself. Even if enabled, actual + support depends on various other system properties and + should only be regarded as a hint whether the behavior + would be desired. + + While allocating the memmap from the memory block + itself makes memory hotplug less likely to fail and + keeps the memmap on the same NUMA node in any case, it + can fragment physical memory in a way that huge pages + in bigger granularity cannot be formed on hotplugged + memory. +======================== ======================================================= + +ZONE_MOVABLE +============ + +ZONE_MOVABLE is an important mechanism for more reliable memory offlining. +Further, having system RAM managed by ZONE_MOVABLE instead of one of the +kernel zones can increase the number of possible transparent huge pages and +dynamically allocated huge pages. + +Most kernel allocations are unmovable. Important examples include the memory +map (usually 1/64ths of memory), page tables, and kmalloc(). Such allocations +can only be served from the kernel zones. + +Most user space pages, such as anonymous memory, and page cache pages are +movable. Such allocations can be served from ZONE_MOVABLE and the kernel zones. + +Only movable allocations are served from ZONE_MOVABLE, resulting in unmovable +allocations being limited to the kernel zones. Without ZONE_MOVABLE, there is +absolutely no guarantee whether a memory block can be offlined successfully. + +Zone Imbalances +--------------- + +Having too much system RAM managed by ZONE_MOVABLE is called a zone imbalance, +which can harm the system or degrade performance. As one example, the kernel +might crash because it runs out of free memory for unmovable allocations, +although there is still plenty of free memory left in ZONE_MOVABLE. - Memory offlining can fail when dissolving a free huge page on ZONE_MOVABLE - and the feature of freeing unused vmemmap pages associated with each hugetlb - page is enabled. - - This can happen when we have plenty of ZONE_MOVABLE memory, but not enough - kernel memory to allocate vmemmmap pages. We may even be able to migrate - huge page contents, but will not be able to dissolve the source huge page. - This will prevent an offline operation and is unfortunate as memory offlining - is expected to succeed on movable zones. Users that depend on memory hotplug - to succeed for movable zones should carefully consider whether the memory - savings gained from this feature are worth the risk of possibly not being - able to offline memory in certain situations. +Usually, MOVABLE:KERNEL ratios of up to 3:1 or even 4:1 are fine. Ratios of 63:1 +are definitely impossible due to the overhead for the memory map. + +Actual safe zone ratios depend on the workload. Extreme cases, like excessive +long-term pinning of pages, might not be able to deal with ZONE_MOVABLE at all. .. note:: - Techniques that rely on long-term pinnings of memory (especially, RDMA and - vfio) are fundamentally problematic with ZONE_MOVABLE and, therefore, memory - hot remove. Pinned pages cannot reside on ZONE_MOVABLE, to guarantee that - memory can still get hot removed - be aware that pinning can fail even if - there is plenty of free memory in ZONE_MOVABLE. In addition, using - ZONE_MOVABLE might make page pinning more expensive, because pages have to be - migrated off that zone first. -.. _memory_hotplug_how_to_offline_memory: + CMA memory part of a kernel zone essentially behaves like memory in + ZONE_MOVABLE and similar considerations apply, especially when combining + CMA with ZONE_MOVABLE. -How to offline memory ---------------------- +ZONE_MOVABLE Sizing Considerations +---------------------------------- -You can offline a memory block by using the same sysfs interface that was used -in memory onlining:: +We usually expect that a large portion of available system RAM will actually +be consumed by user space, either directly or indirectly via the page cache. In +the normal case, ZONE_MOVABLE can be used when allocating such pages just fine. - % echo offline > /sys/devices/system/memory/memoryXXX/state +With that in mind, it makes sense that we can have a big portion of system RAM +managed by ZONE_MOVABLE. However, there are some things to consider when using +ZONE_MOVABLE, especially when fine-tuning zone ratios: + +- Having a lot of offline memory blocks. Even offline memory blocks consume + memory for metadata and page tables in the direct map; having a lot of offline + memory blocks is not a typical case, though. + +- Memory ballooning without balloon compaction is incompatible with + ZONE_MOVABLE. Only some implementations, such as virtio-balloon and + pseries CMM, fully support balloon compaction. + + Further, the CONFIG_BALLOON_COMPACTION kernel configuration option might be + disabled. In that case, balloon inflation will only perform unmovable + allocations and silently create a zone imbalance, usually triggered by + inflation requests from the hypervisor. + +- Gigantic pages are unmovable, resulting in user space consuming a + lot of unmovable memory. + +- Huge pages are unmovable when an architectures does not support huge + page migration, resulting in a similar issue as with gigantic pages. + +- Page tables are unmovable. Excessive swapping, mapping extremely large + files or ZONE_DEVICE memory can be problematic, although only really relevant + in corner cases. When we manage a lot of user space memory that has been + swapped out or is served from a file/persistent memory/... we still need a lot + of page tables to manage that memory once user space accessed that memory. + +- In certain DAX configurations the memory map for the device memory will be + allocated from the kernel zones. + +- KASAN can have a significant memory overhead, for example, consuming 1/8th of + the total system memory size as (unmovable) tracking metadata. + +- Long-term pinning of pages. Techniques that rely on long-term pinnings + (especially, RDMA and vfio/mdev) are fundamentally problematic with + ZONE_MOVABLE, and therefore, memory offlining. Pinned pages cannot reside + on ZONE_MOVABLE as that would turn these pages unmovable. Therefore, they + have to be migrated off that zone while pinning. Pinning a page can fail + even if there is plenty of free memory in ZONE_MOVABLE. + + In addition, using ZONE_MOVABLE might make page pinning more expensive, + because of the page migration overhead. + +By default, all the memory configured at boot time is managed by the kernel +zones and ZONE_MOVABLE is not used. + +To enable ZONE_MOVABLE to include the memory present at boot and to control the +ratio between movable and kernel zones there are two command line options: +``kernelcore=`` and ``movablecore=``. See +Documentation/admin-guide/kernel-parameters.rst for their description. + +Memory Offlining and ZONE_MOVABLE +--------------------------------- + +Even with ZONE_MOVABLE, there are some corner cases where offlining a memory +block might fail: + +- Memory blocks with memory holes; this applies to memory blocks present during + boot and can apply to memory blocks hotplugged via the XEN balloon and the + Hyper-V balloon. + +- Mixed NUMA nodes and mixed zones within a single memory block prevent memory + offlining; this applies to memory blocks present during boot only. + +- Special memory blocks prevented by the system from getting offlined. Examples + include any memory available during boot on arm64 or memory blocks spanning + the crashkernel area on s390x; this usually applies to memory blocks present + during boot only. + +- Memory blocks overlapping with CMA areas cannot be offlined, this applies to + memory blocks present during boot only. + +- Concurrent activity that operates on the same physical memory area, such as + allocating gigantic pages, can result in temporary offlining failures. + +- Out of memory when dissolving huge pages, especially when freeing unused + vmemmap pages associated with each hugetlb page is enabled. + + Offlining code may be able to migrate huge page contents, but may not be able + to dissolve the source huge page because it fails allocating (unmovable) pages + for the vmemmap, because the system might not have free memory in the kernel + zones left. + + Users that depend on memory offlining to succeed for movable zones should + carefully consider whether the memory savings gained from this feature are + worth the risk of possibly not being able to offline memory in certain + situations. + +Further, when running into out of memory situations while migrating pages, or +when still encountering permanently unmovable pages within ZONE_MOVABLE +(-> BUG), memory offlining will keep retrying until it eventually succeeds. + +When offlining is triggered from user space, the offlining context can be +terminated by sending a fatal signal. A timeout based offlining can easily be +implemented via:: -If offline succeeds, the state of the memory block is changed to be "offline". -If it fails, some error core (like -EBUSY) will be returned by the kernel. -Even if a memory block does not belong to ZONE_MOVABLE, you can try to offline -it. If it doesn't contain 'unmovable' memory, you'll get success. - -A memory block under ZONE_MOVABLE is considered to be able to be offlined -easily. But under some busy state, it may return -EBUSY. Even if a memory -block cannot be offlined due to -EBUSY, you can retry offlining it and may be -able to offline it (or not). (For example, a page is referred to by some kernel -internal call and released soon.) - -Consideration: - Memory hotplug's design direction is to make the possibility of memory - offlining higher and to guarantee unplugging memory under any situation. But - it needs more work. Returning -EBUSY under some situation may be good because - the user can decide to retry more or not by himself. Currently, memory - offlining code does some amount of retry with 120 seconds timeout. - -Physical memory remove -====================== - -Need more implementation yet.... - - Notification completion of remove works by OS to firmware. - - Guard from remove if not yet. - - -Future Work -=========== - - - allowing memory hot-add to ZONE_MOVABLE. maybe we need some switch like - sysctl or new control file. - - showing memory block and physical device relationship. - - test and make it better memory offlining. - - support HugeTLB page migration and offlining. - - memmap removing at memory offline. - - physical remove memory. + % timeout $TIMEOUT offline_block | failure_handling

[035/147] memory-hotplug.rst: complete admin-guide overhaul

Commit Message

Patch