[v2,0/1] mm: introduce MADV_DEMOTE/MADV_PROMOTE

Message ID	20240801075610.12351-1-zhang.renze@h3c.com (mailing list archive)
Headers	show Received: from h3cspam02-ex.h3c.com (smtp.h3c.com [60.191.123.50]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 624944962E; Thu, 1 Aug 2024 07:58:02 +0000 (UTC) From: BiscuitOS Broiler <zhang.renze@h3c.com> To: <linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>, <akpm@linux-foundation.org> CC: <arnd@arndb.de>, <linux-arch@vger.kernel.org>, <chris@zankel.net>, <jcmvbkbc@gmail.com>, <James.Bottomley@HansenPartnership.com>, <deller@gmx.de>, <linux-parisc@vger.kernel.org>, <tsbogend@alpha.franken.de>, <rdunlap@infradead.org>, <bhelgaas@google.com>, <linux-mips@vger.kernel.org>, <richard.henderson@linaro.org>, <ink@jurassic.park.msu.ru>, <mattst88@gmail.com>, <linux-alpha@vger.kernel.org>, <jiaoxupo@h3c.com>, <zhou.haofan@h3c.com>, <zhang.renze@h3c.com> Subject: [PATCH v2 0/1] mm: introduce MADV_DEMOTE/MADV_PROMOTE Date: Thu, 1 Aug 2024 15:56:09 +0800 Message-ID: <20240801075610.12351-1-zhang.renze@h3c.com> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain
Series	mm: introduce MADV_DEMOTE/MADV_PROMOTE \| expand [v2,0/1] mm: introduce MADV_DEMOTE/MADV_PROMOTE [v2,1/1] mm: introduce MADV_DEMOTE/MADV_PROMOTE

Message ID

20240801075610.12351-1-zhang.renze@h3c.com (mailing list archive)

Headers

From: BiscuitOS Broiler <zhang.renze@h3c.com>
To: <linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>,
        <akpm@linux-foundation.org>
CC: <arnd@arndb.de>, <linux-arch@vger.kernel.org>, <chris@zankel.net>,
        <jcmvbkbc@gmail.com>, <James.Bottomley@HansenPartnership.com>,
        <deller@gmx.de>, <linux-parisc@vger.kernel.org>,
        <tsbogend@alpha.franken.de>, <rdunlap@infradead.org>,
        <bhelgaas@google.com>, <linux-mips@vger.kernel.org>,
        <richard.henderson@linaro.org>, <ink@jurassic.park.msu.ru>,
        <mattst88@gmail.com>, <linux-alpha@vger.kernel.org>,
        <jiaoxupo@h3c.com>, <zhou.haofan@h3c.com>, <zhang.renze@h3c.com>
Subject: [PATCH v2 0/1] mm: introduce MADV_DEMOTE/MADV_PROMOTE
Date: Thu, 1 Aug 2024 15:56:09 +0800
Message-ID: <20240801075610.12351-1-zhang.renze@h3c.com>
Precedence: bulk
MIME-Version: 1.0
Content-Type: text/plain

Series

mm: introduce MADV_DEMOTE/MADV_PROMOTE | expand

Message

BiscuitOS Broiler Aug. 1, 2024, 7:56 a.m. UTC

Sure, here's the Scalable Tiered Memory Control (STMC)

**Background**

In the era when artificial intelligence, big data analytics, and
machine learning have become mainstream research topics and
application scenarios, the demand for high-capacity and high-
bandwidth memory in computers has become increasingly important.
The emergence of CXL (Compute Express Link) provides the
possibility of high-capacity memory. Although CXL TYPE3 devices
can provide large memory capacities, their access speed is lower
than traditional DRAM due to hardware architecture limitations.

To enjoy the large capacity brought by CXL memory while minimizing
the impact of high latency, Linux has introduced the Tiered Memory
architecture. In the Tiered Memory architecture, CXL memory is
treated as an independent, slower NUMA NODE, while DRAM is
considered as a relatively faster NUMA NODE. Applications allocate
memory from the local node, and Tiered Memory, leveraging memory
reclamation and NUMA Balancing mechanisms, can transparently demote
physical pages not recently accessed by user processes to the slower
CXL NUMA NODE. However, when user processes re-access the demoted
memory, the Tiered Memory mechanism will, based on certain logic,
decide whether to promote the demoted physical pages back to the
fast NUMA NODE. If the promotion is successful, the memory accessed
by the user process will reside in DRAM; otherwise, it will reside in
the CXL NODE. Through the Tiered Memory mechanism, Linux balances
betweenlarge memory capacity and latency, striving to maintain an
equilibrium for applications.

**Problem**
Although Tiered Memory strives to balance between large capacity and
latency, specific scenarios can lead to the following issues:

  1. In scenarios requiring massive computations, if data is heavily
     stored in CXL slow memory and Tiered Memory cannot promptly
     promote this memory to fast DRAM, it will significantly impact
     program performance.
  2. Similar to the scenario described in point 1, if Tiered Memory
     decides to promote these physical pages to fast DRAM NODE, but
     due to limitations in the DRAM NODE promote ratio, these physical
     pages cannot be promoted. Consequently, the program will keep
     running in slow memory.
  3. After an application finishes computing on a large block of fast
     memory, it may not immediately re-access it. Hence, this memory
     can only wait for the memory reclamation mechanism to demote it.
  4. Similar to the scenario described in point 3, if the demotion
     speed is slow, these cold pages will occupy the promotion
     resources, preventing some eligible slow pages from being
     immediately promoted, severely affecting application efficiency.

**Solution**
We propose the **Scalable Tiered Memory Control (STMC)** mechanism,
which delegates the authority of promoting and demoting memory to the
application. The principle is simple, as follows:

  1. When an application is preparing for computation, it can promote
     the memory it needs to use or ensure the memory resides on a fast
     NODE.
  2. When an application will not use the memory shortly, it can
     immediately demote the memory to slow memory, freeing up valuable
     promotion resources.

STMC mechanism is implemented through the madvise system call, providing
two new advice options: MADV_DEMOTE and MADV_PROMOTE. MADV_DEMOTE
advises demote the physical memory to the node where slow memory
resides; this advice only fails if there is no free physical memory on
the slow memory node. MADV_PROMOTE advises retaining the physical memory
in the fast memory; this advice only fails if there are no promotion
slots available on the fast memory node. Benefits brought by STMC
include:

  1. The STMC mechanism is a variant of on-demand memory management
     designed to let applications enjoy fast memory as much as possible,
     while actively demoting to slow memory when not in use, thus
     freeing up promotion slots for the NODE and allowing it to run in
     an optimized Tiered Memory environment.
  2. The STMC mechanism better balances large capacity and latency.

**Shortcomings of STMC**
The STMC mechanism requires the caller to manage memory demotion and
promotion. If the memory is not promptly demoting after an promotion,
it may cause issues similar to memory leaks, leading to short-term
promotion bottlenecks.

BiscuitOS Broiler (1):
  mm: introduce MADV_DEMOTE/MADV_PROMOTE

 arch/alpha/include/uapi/asm/mman.h           |   3 +
 arch/mips/include/uapi/asm/mman.h            |   3 +
 arch/parisc/include/uapi/asm/mman.h          |   3 +
 arch/xtensa/include/uapi/asm/mman.h          |   3 +
 include/uapi/asm-generic/mman-common.h       |   3 +
 mm/internal.h                                |   1 +
 mm/madvise.c                                 | 251 +++++++++++++++++++
 mm/vmscan.c                                  |  57 +++++
 tools/include/uapi/asm-generic/mman-common.h |   3 +
 9 files changed, 327 insertions(+)

Comments

David Hildenbrand Aug. 1, 2024, 8:06 a.m. UTC | #1

On 01.08.24 09:56, BiscuitOS Broiler wrote:
> Sure, here's the Scalable Tiered Memory Control (STMC)
> 
> **Background**
> 
> In the era when artificial intelligence, big data analytics, and
> machine learning have become mainstream research topics and
> application scenarios, the demand for high-capacity and high-
> bandwidth memory in computers has become increasingly important.
> The emergence of CXL (Compute Express Link) provides the
> possibility of high-capacity memory. Although CXL TYPE3 devices
> can provide large memory capacities, their access speed is lower
> than traditional DRAM due to hardware architecture limitations.
> 
> To enjoy the large capacity brought by CXL memory while minimizing
> the impact of high latency, Linux has introduced the Tiered Memory
> architecture. In the Tiered Memory architecture, CXL memory is
> treated as an independent, slower NUMA NODE, while DRAM is
> considered as a relatively faster NUMA NODE. Applications allocate
> memory from the local node, and Tiered Memory, leveraging memory
> reclamation and NUMA Balancing mechanisms, can transparently demote
> physical pages not recently accessed by user processes to the slower
> CXL NUMA NODE. However, when user processes re-access the demoted
> memory, the Tiered Memory mechanism will, based on certain logic,
> decide whether to promote the demoted physical pages back to the
> fast NUMA NODE. If the promotion is successful, the memory accessed
> by the user process will reside in DRAM; otherwise, it will reside in
> the CXL NODE. Through the Tiered Memory mechanism, Linux balances
> betweenlarge memory capacity and latency, striving to maintain an
> equilibrium for applications.
> 
> **Problem**
> Although Tiered Memory strives to balance between large capacity and
> latency, specific scenarios can lead to the following issues:
> 
>    1. In scenarios requiring massive computations, if data is heavily
>       stored in CXL slow memory and Tiered Memory cannot promptly
>       promote this memory to fast DRAM, it will significantly impact
>       program performance.
>    2. Similar to the scenario described in point 1, if Tiered Memory
>       decides to promote these physical pages to fast DRAM NODE, but
>       due to limitations in the DRAM NODE promote ratio, these physical
>       pages cannot be promoted. Consequently, the program will keep
>       running in slow memory.
>    3. After an application finishes computing on a large block of fast
>       memory, it may not immediately re-access it. Hence, this memory
>       can only wait for the memory reclamation mechanism to demote it.
>    4. Similar to the scenario described in point 3, if the demotion
>       speed is slow, these cold pages will occupy the promotion
>       resources, preventing some eligible slow pages from being
>       immediately promoted, severely affecting application efficiency.
> 
> **Solution**
> We propose the **Scalable Tiered Memory Control (STMC)** mechanism,
> which delegates the authority of promoting and demoting memory to the
> application. The principle is simple, as follows:
> 
>    1. When an application is preparing for computation, it can promote
>       the memory it needs to use or ensure the memory resides on a fast
>       NODE.
>    2. When an application will not use the memory shortly, it can
>       immediately demote the memory to slow memory, freeing up valuable
>       promotion resources.
> 
> STMC mechanism is implemented through the madvise system call, providing
> two new advice options: MADV_DEMOTE and MADV_PROMOTE. MADV_DEMOTE
> advises demote the physical memory to the node where slow memory
> resides; this advice only fails if there is no free physical memory on
> the slow memory node. MADV_PROMOTE advises retaining the physical memory
> in the fast memory; this advice only fails if there are no promotion
> slots available on the fast memory node. Benefits brought by STMC
> include:
> 
>    1. The STMC mechanism is a variant of on-demand memory management
>       designed to let applications enjoy fast memory as much as possible,
>       while actively demoting to slow memory when not in use, thus
>       freeing up promotion slots for the NODE and allowing it to run in
>       an optimized Tiered Memory environment.
>    2. The STMC mechanism better balances large capacity and latency.
> 
> **Shortcomings of STMC**
> The STMC mechanism requires the caller to manage memory demotion and
> promotion. If the memory is not promptly demoting after an promotion,
> it may cause issues similar to memory leaks
Ehm, that sounds scary. Can you elaborate what's happening here and why 
it is "similar to memory leaks"?


Can you also point out why migrate_pages() is not suitable? I would 
assume demote/promote is in essence simply migrating memory between nodes.