From patchwork Mon Oct 9 20:42:56 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Gregory Price X-Patchwork-Id: 13417939 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8FA62CDB47E for ; Wed, 11 Oct 2023 20:44:06 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id BF1008D00D1; Wed, 11 Oct 2023 16:44:05 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id BA0C18D0002; Wed, 11 Oct 2023 16:44:05 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A69048D00D1; Wed, 11 Oct 2023 16:44:05 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 947B28D0002 for ; Wed, 11 Oct 2023 16:44:05 -0400 (EDT) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 6F6BE1203ED for ; Wed, 11 Oct 2023 20:44:05 +0000 (UTC) X-FDA: 81334357650.09.6D0671F Received: from mail-yw1-f195.google.com (mail-yw1-f195.google.com [209.85.128.195]) by imf06.hostedemail.com (Postfix) with ESMTP id 95B15180008 for ; Wed, 11 Oct 2023 20:44:03 +0000 (UTC) Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=GTR2BY5p; spf=pass (imf06.hostedemail.com: domain of gourry.memverge@gmail.com designates 209.85.128.195 as permitted sender) smtp.mailfrom=gourry.memverge@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1697057043; a=rsa-sha256; cv=none; b=2/JHA/ZXHnOzYbnalxRkOy/Wf8v/HsqOnO9QkN7CjBpVFlkhmEz2LHRzDuDZMerNdTDQMd L07qNS7QAfpChDdIr61QAfLnCW3nNtNhYeIbD9ryKX3TrQ4U03TbjcEm7er/3juBsEIMKX YzpERX/wYXIPItnMDWRNCaUelATBjWE= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=GTR2BY5p; spf=pass (imf06.hostedemail.com: domain of gourry.memverge@gmail.com designates 209.85.128.195 as permitted sender) smtp.mailfrom=gourry.memverge@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1697057043; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=TEu2nxzdjrozp82JZwjJ2Z/IX8GNfCovktbioAc6Ivw=; b=kfFqmdsZvpGL8F2kUDpywh1CzBpQNkH1o7dehjABiyzWM/+z6jfeMd1QDeXo4+ndE1njgt MiwqVHwTavVNadMHDE1Zb9Ngbnk8lyVyWT1SG5qGBlvv5mduq1a+Dj9zQA5XIyzwUgpkCZ XpFJhi/ez3v0rOk19+YXGNJhJj3meXA= Received: by mail-yw1-f195.google.com with SMTP id 00721157ae682-5a7dafb659cso3731127b3.0 for ; Wed, 11 Oct 2023 13:44:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1697057042; x=1697661842; darn=kvack.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=TEu2nxzdjrozp82JZwjJ2Z/IX8GNfCovktbioAc6Ivw=; b=GTR2BY5pa4nHHLzlmCo3CY1BXIdRtsOOazxEALg8reP7s6SmZY5v9aSOQA13bJv/tm NxQ+o6Vj+SwagqIE7A6fyTFms03xvF4QdXpuqdD6J9rqOIJhjY7WUox2NLHJrBCI5C0k viEYEP7LbgIbhkQxRyMApHBkOh3HYr81rxKNkLLJxXF9/FPl1KaCYLMrZEwB8/YhI5xv 1LgdgHW974lZNdeuQuTgdmp4/BJaQkBonLkZIvA0plhF7rT0+q6RIZQdrz3nV4xnGYga BU8Sp3byux8DuSZ+ySlNGgvvrElORi5G+m80RfztC9VkSr7x6qVyGjrMq+Uy4ivXHDRY najQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1697057042; x=1697661842; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=TEu2nxzdjrozp82JZwjJ2Z/IX8GNfCovktbioAc6Ivw=; b=lWw14p9u/4j787PqMRStNCsSyDZbRm43drxltCQmOsU+JdUKn6cm8l8grQl1tjeVme uCkV926SZS+lv1J7ESgMui3H6x++Z0jrvYE14KTbQV7vLiEOeA+Apsz16k/KIIV0NHgc UUKNtDRUNUqUkNIaFwiIcTU/vCXJelPGjGAVmIXnixmSsvCJHURbQ2HqJdb0ixrkl7ZG qGpaUNo+UGvEIJ7wIsKp2sM4XEB70GpA/Qs4CDO02S91Zisf0bzOSu5Nv1yR4qGX0Fsz 19Ol0osapx6hzHijGZsa0Gwf3PNiu91XvbOBW0icl0u1B6tA+vW6HKvSYw7YiL4QAjnh kZeg== X-Gm-Message-State: AOJu0Yzv27i8AKSoG1qDUb30Gh0hs/oD9DKR74axCCompZeaNmCgkVAR bFTYB3zj/1XxZ3DJqByhFMZqPbzXiO1eVUM= X-Google-Smtp-Source: AGHT+IE+Ov/9UKHtGIN2HQ6O3ziPnP2i9l6R9GNRJ0PJ4WCq209SNH4jBa98cpw3Ij5HnAg6OvfDLg== X-Received: by 2002:a81:49d0:0:b0:586:9f6c:41a0 with SMTP id w199-20020a8149d0000000b005869f6c41a0mr24595805ywa.14.1697057042368; Wed, 11 Oct 2023 13:44:02 -0700 (PDT) Received: from fedora.mshome.net (pool-173-79-56-208.washdc.fios.verizon.net. [173.79.56.208]) by smtp.gmail.com with ESMTPSA id q2-20020a819902000000b0059bc0d766f8sm1844588ywg.34.2023.10.11.13.44.01 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 11 Oct 2023 13:44:01 -0700 (PDT) From: Gregory Price X-Google-Original-From: Gregory Price To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, linux-cxl@vger.kernel.org, akpm@linux-foundation.org, sthanneeru@micron.com, ying.huang@intel.com, gregory.price@memverge.com Subject: [RFC PATCH v2 0/3] mm: mempolicy: Multi-tier weighted interleaving Date: Mon, 9 Oct 2023 16:42:56 -0400 Message-Id: <20231009204259.875232-1-gregory.price@memverge.com> X-Mailer: git-send-email 2.39.1 MIME-Version: 1.0 X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: 95B15180008 X-Stat-Signature: h6zd5b7kyosjx8ie4ms3pmiu7jp68r8b X-Rspam-User: X-HE-Tag: 1697057043-502930 X-HE-Meta: U2FsdGVkX1/H3f7YagVRGoDF3SlhsITQ/CKTvW0tcoI6IT2Y5pMdor9/oXZ1CGGXdJAhHN/SOs8qKtAqCHycFA1gC9mhQLqRSY9vPNYZmojhDCYEAZKB/gkam1cB0Y8uCDF2SDDqNuyiY0I7dBLhlXypIHt5TW5orMaA3Q9y9C3iq7izA8vv5b9LOZF9TVi4ThgaRm8er5kBZ7r1P1nw0NOzMSRzqcICNj2MXeLKXqLMofpXtP3nscbbURQge6QEmBdHAkTaICJC+ExqzTNBl/ObthSu2itUXa1Of/n4Lg0rHrYW9LFwa1Z0w5MmIvSnQlrSIoETMTZcNgvoucY6qGFjiawK4p1QlvWQ0TDomH4EctB4xOrRxO6pIs8M8kf+u7e7jlQmrcBwy+EZsJjFcOAwhnBjifcdgghxd5XKlNIJhaN+M1HAN8pX0m/q8+kLUOqAgslJ/u/UvleXFZSfLMl6KL3YOHXajz5YskJpNZTFpbh+QpB+7tjWNK/rvrlC7T7A12T2TY8Qx3xfcnsig2cc5R1aynZWvT/MSDIlqIXUWavqOoW6UvXjWHnMPxG5mz1ij0uMoyiBAHEQ38PB+RwseXXj2AE1Q53oHuP3UA3jgDeli9lXevhQPrSbjNq1+uCQ/j0z76n2HYKXPPR11z0O/J4aFfPsehLpGnwo5o371R004fHdhLHLTJk13vAEND55ya2jAaeZKgWy30pOFK3MlqOZ+WQ9iPdt02FtTy80MpeUwbliVF1myEXKzFAixiVbJy3Q72miK1Cvt/W2zvQ99Jw3ArVoS/eRhhRM/P/9cYMS9pg6LvDqgm11mk4/Q8kKjM1bxsNDGgWoZuBYnunqPgEG+nVVoj0ZbnjVI6mFw829JpXz2qVgOi+YoS4C5aSONa+uc8Gst4lmRPZNzxAoTYeMHVuYWwedHqIDrYvgRaZVdhEKt1I7hT5ryrYEMz5rWTma6AbuYChE928 YZhefWQu tCC65F3MoleFLjg/QlgRGONF5VR3qQfqTZW8TdzG97GikSfN605azVehnvSh+oMuyBx64B/9EKvUhYj0CLnFT0B65K4ouhBUNfodvmXIkYgR8Na2f11CyTkRHqBURO/Q9SkRMla8Xzf7vbEjMKJiFamymPaVn5Xi+PUMUiTXBq2qRMkr5/G2eXxoaDzGy0x4gNkr35bROYORX/2mtRISfWD/i/groHoGHjheiSwQ081LeL3ZikoW17+/VJUBFFNLerkpzU7fbiVE2UzzX7XTS6Cd8zKjDg6c/a5u3bCQ3ejwhLfHSKnzh083ZzOuy3Sm9901YuMC1wvjg6OybqeQFni4uGdBUmyFcYc0eXTUpFY6SuSF6Ay0yOr0dTL/jIjx8v9Y1tW1fZ5Vbw2V+1RaD8PF8nMvisF+d36zD/D1XZ9UrNRPbmHXWp1C9nydNO0t6+X5swkSIUOCgRrzD5Kqjo+X55QrbmuvhqrZ/ie5l3LYyXL+poJf8qg4shn3gliiUk8hodqbhHxB2UMvfTY2Zpb/5VpxStsqdqIol1UHQPdInAuNTxIWTSblZ43EDHL6B9tWLBIa/ohPtR1IfOyyJj3ldGVaKhPdejr/aLDvykINXfK1M/qaZGMYkxT9UOnxbOjgxfn1OhFTLAnhOH2LUH93X8R70UvLjyf4PPSBU+7OysHRne+9gew2axFhKwFqzo9gsLxYn+IU6hM0= X-Bogosity: Ham, tests=bogofilter, spamicity=0.002889, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: v2: change memtier mutex to semaphore add source-node relative weighting add remaining mempolicy integration code = v2 Notes Developed in colaboration with original authors to deconflict similar efforts to extend mempolicy to take weights directly. == Mutex to Semaphore change: The memory tiering subsystem is extended in this patch set to have externally available information (weights), and therefore additional controls need to be added to ensure values are not changed (or tiers changed/added/removed) during various calculations. Since it is expected that many threads will be accessing this data during allocations, a mutex is not appropriate. Since write-updates (weight changes, hotplug events) are rare events, a simple rw semaphore is sufficient. == Source-node relative weighting: Tiers can now be weighted differently based on the node requesting the weight. For example CPU-Nodes 0 and 1 may have different weights for the same CXL memory tier, because topologically the number of NUMA hops is greater (or any other physical topological difference resulting in different effective latency or bandwidth values) 1. Set weights for DDR (tier4) and CXL(teir22) tiers. echo source_node:weight > /path/to/interleave_weight # Set tier4 weight from node 0 to 85 echo 0:85 > /sys/devices/virtual/memory_tiering/memory_tier4/interleave_weight # Set tier4 weight from node 1 to 65 echo 1:65 > /sys/devices/virtual/memory_tiering/memory_tier4/interleave_weight # Set tier22 weight from node 0 to 15 echo 0:15 > /sys/devices/virtual/memory_tiering/memory_tier22/interleave_weight # Set tier22 weight from node 1 to 10 echo 1:10 > /sys/devices/virtual/memory_tiering/memory_tier22/interleave_weight == Mempolicy integration Two new functions have been added to memory-tiers.c * memtier_get_node_weight - Get the effective weight for a given node * memtier_get_total_weight - Get the "total effective weight" for a given nodemask. These functions are used by the following functions in mempolicy: * interleave_nodes * offset_il_nodes * alloc_pages_bulk_array_interleave The weight values are used to determine how many pages should be allocated per-node as interleave rounds occur. To avoid holding the memtier semaphore for long periods of time (e.g. during the calls that actually allocate pages), there is a small race condition during bulk allocation between calculating the total weight of a node mask and fetching each individual node weight - but this is managed by simply detecting the over/under allocation conditions and handling them accordingly. ~Gregory === original RFC ==== From: Ravi Shankar Hello, The current interleave policy operates by interleaving page requests among nodes defined in the memory policy. To accommodate the introduction of memory tiers for various memory types (e.g., DDR, CXL, HBM, PMEM, etc.), a mechanism is needed for interleaving page requests across these memory types or tiers. This can be achieved by implementing an interleaving method that considers the tier weights. The tier weight will determine the proportion of nodes to select from those specified in the memory policy. A tier weight can be assigned to each memory type within the system. Hasan Al Maruf had put forth a proposal for interleaving between two tiers, namely the top tier and the low tier. However, this patch was not adopted due to constraints on the number of available tiers. https://lore.kernel.org/linux-mm/YqD0%2FtzFwXvJ1gK6@cmpxchg.org/T/ New proposed changes: 1. Introducea sysfs entry to allow setting the interleave weight for each memory tier. 2. Each tier with a default weight of 1, indicating a standard 1:1 proportion. 3. Distribute the weight of that tier in a uniform manner across all nodes. 4. Modifications to the existing interleaving algorithm to support the implementation of multi-tier interleaving based on tier-weights. This is inline with Huang, Ying's presentation in lpc22, 16th slide in https://lpc.events/event/16/contributions/1209/attachments/1042/1995/\ Live%20In%20a%20World%20With%20Multiple%20Memory%20Types.pdf Observed a significant increase (165%) in bandwidth utilization with the newly proposed multi-tier interleaving compared to the traditional 1:1 interleaving approach between DDR and CXL tier nodes, where 85% of the bandwidth is allocated to DDR tier and 15% to CXL tier with MLC -w2 option. Usage Example: 1. Set weights for DDR (tier4) and CXL(teir22) tiers. echo 85 > /sys/devices/virtual/memory_tiering/memory_tier4/interleave_weight echo 15 > /sys/devices/virtual/memory_tiering/memory_tier22/interleave_weight 2. Interleave between DRR(tier4, node-0) and CXL (tier22, node-1) using numactl numactl -i0,1 mlc --loaded_latency W2 Gregory Price (3): mm/memory-tiers: change mutex to rw semaphore mm/memory-tiers: Introduce sysfs for tier interleave weights mm/mempolicy: modify interleave mempolicy to use memtier weights include/linux/memory-tiers.h | 16 ++++ include/linux/mempolicy.h | 3 + mm/memory-tiers.c | 179 +++++++++++++++++++++++++++++++---- mm/mempolicy.c | 148 +++++++++++++++++++++++------ 4 files changed, 297 insertions(+), 49 deletions(-)