@@ -32,6 +32,7 @@ the Linux memory management.
idle_page_tracking
ksm
memory-hotplug
+ memory-tiering
nommu-mmap
numa_memory_policy
numaperf
new file mode 100644
@@ -0,0 +1,181 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+.. _admin_guide_memory_tiering:
+
+============
+Memory tiers
+============
+
+This document describes explicit memory tiering support along with
+demotion based on memory tiers.
+
+Introduction
+============
+
+Many systems have multiple type of memory devices e.g. GPU, DRAM and
+PMEM. The memory subsystem of these systems can be called memory
+tiering system because the performance of the different types of
+memory is different. Memory tiers are defined based on hardware
+capabilities of memory nodes. Each memory tier is assigned a rank
+value that determines the memory tier position in demotion order.
+
+The memory tier assignment of each node is independent from each
+other. Moving a node from one tier to another tier doesn't affect
+the tier assignment of any other node.
+
+Memory tiers are used to build the demotion targets for nodes, a node
+can demote its pages to any node of any lower tiers.
+
+Memory tier rank
+=================
+
+Memory nodes are divided into below 3 types of memory tiers with rank value
+as shown base on their hardware characteristics.
+
+MEMORY_RANK_HBM_GPU
+MEMORY_RANK_DRAM
+MEMORY_RANK_PMEM
+
+Memory tiers initialization and (re)assignments
+===============================================
+
+By default, all nodes are assigned to memory tier with default rank
+DEFAULT_MEMORY_RANK which is 1 (MEMORY_RANK_DRAM). Memory tier of
+memory node can be either modified through sysfs or from driver. On
+hotplug, memory tier with default rank is assigned to memory node.
+
+Sysfs interfaces
+================
+
+Nodes belonging to specific tier can be read from,
+/sys/devices/system/memtier/memtierN/nodelist (Read-Only)
+
+Where N is 0 - 2.
+
+Example 1:
+For a system where Node 0 is CPU + DRAM nodes, Node 1 is HBM node,
+node 2 is PMEM node an ideal tier layout will be
+
+$ cat /sys/devices/system/memtier/memtier0/nodelist
+1
+$ cat /sys/devices/system/memtier/memtier1/nodelist
+0
+$ cat /sys/devices/system/memtier/memtier2/nodelist
+2
+
+Example 2:
+For a system where Node 0 & 1 are CPU + DRAM nodes, node 2 & 3 are PMEM
+nodes.
+
+$ cat /sys/devices/system/memtier/memtier0/nodelist
+cat: /sys/devices/system/memtier/memtier0/nodelist: No such file or
+directory
+$ cat /sys/devices/system/memtier/memtier1/nodelist
+0-1
+$ cat /sys/devices/system/memtier/memtier2/nodelist
+2-3
+
+Default memory tier can be read from,
+/sys/devices/system/memtier/default_tier (Read-Only)
+
+e.g.
+$ cat /sys/devices/system/memtier/default_tier
+memtier1
+
+Max memory tier can be read from,
+/sys/devices/system/memtier/max_tier (Read-Only)
+
+e.g.
+$ cat /sys/devices/system/memtier/max_tier
+3
+
+Individual node's memory tier can be read of set using,
+/sys/devices/system/node/nodeN/memtier (Read-Write)
+
+where N = node id
+
+When this interface is written, Node is moved from old memory tier
+to new memory tier and demotion targets for all N_MEMORY nodes are
+built again.
+
+For example 1 mentioned above,
+$ cat /sys/devices/system/node/node0/memtier
+1
+$ cat /sys/devices/system/node/node1/memtier
+0
+$ cat /sys/devices/system/node/node2/memtier
+2
+
+Creation of memory tiers from userspace
+/sys/devices/system/memtier/create_tier_from_rank (Read-write)
+
+Additional memory tiers can be created by writing a rank value to this file.
+This results in a new memory tier creation with specified rank value and empty nodelist.
+
+Demotion
+========
+
+In a system with DRAM and persistent memory, once DRAM
+fills up, reclaim will start and some of the DRAM contents will be
+thrown out even if there is a space in persistent memory.
+Consequently allocations will, at some point, start falling over to the slower
+persistent memory.
+
+That has two nasty properties. First, the newer allocations can end up in
+the slower persistent memory. Second, reclaimed data in DRAM are just
+discarded even if there are gobs of space in persistent memory that could
+be used.
+
+Instead of page being discarded during reclaim, it can be moved to
+persistent memory. Allowing page migration during reclaim enables
+these systems to migrate pages from fast(higher) tiers to slow(lower)
+tiers when the fast(higher) tier is under pressure.
+
+
+Enable/Disable demotion
+-----------------------
+
+By default demotion is disabled, it can be enabled/disabled using
+below sysfs interface,
+
+$ echo 0/1 or false/true > /sys/kernel/mm/numa/demotion_enabled
+
+preferred and allowed demotion nodes
+------------------------------------
+
+Preffered nodes for a specific N_MEMORY nodes are best nodes
+from next possible lower memory tier. Allowed nodes for any
+node are all the node available in all possible lower memory
+tiers.
+
+Example:
+
+For a system where Node 0 & 1 are CPU + DRAM nodes, node 2 & 3 are PMEM
+nodes,
+
+node distances:
+node 0 1 2 3
+ 0 10 20 30 40
+ 1 20 10 40 30
+ 2 30 40 10 40
+ 3 40 30 40 10
+
+memory_tiers[0] = <empty>
+memory_tiers[1] = 0-1
+memory_tiers[2] = 2-3
+
+node_demotion[0].preferred = 2
+node_demotion[0].allowed = 2, 3
+node_demotion[1].preferred = 3
+node_demotion[1].allowed = 3, 2
+node_demotion[2].preferred = <empty>
+node_demotion[2].allowed = <empty>
+node_demotion[3].preferred = <empty>
+node_demotion[3].allowed = <empty>
+
+Memory allocation for demotion
+------------------------------
+
+If page needs to be demoted from any node, the kernel 1st tries
+to allocate new page from node's preferred node and fallbacks to
+node's allowed targets in allocation fallback order.