From patchwork Wed Aug 18 06:31:07 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yu Zhao X-Patchwork-Id: 12443399 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-26.3 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED,USER_AGENT_GIT,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 85412C4338F for ; Wed, 18 Aug 2021 06:31:39 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 37A5B60F11 for ; Wed, 18 Aug 2021 06:31:39 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 37A5B60F11 Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 25A286B0085; Wed, 18 Aug 2021 02:31:28 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 1E4B48D0001; Wed, 18 Aug 2021 02:31:28 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id F29206B0088; Wed, 18 Aug 2021 02:31:27 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0215.hostedemail.com [216.40.44.215]) by kanga.kvack.org (Postfix) with ESMTP id D31CC6B0085 for ; Wed, 18 Aug 2021 02:31:27 -0400 (EDT) Received: from smtpin11.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 65DCA22892 for ; Wed, 18 Aug 2021 06:31:27 +0000 (UTC) X-FDA: 78487229814.11.5C300F7 Received: from mail-yb1-f202.google.com (mail-yb1-f202.google.com [209.85.219.202]) by imf05.hostedemail.com (Postfix) with ESMTP id 22BC45048BB2 for ; Wed, 18 Aug 2021 06:31:27 +0000 (UTC) Received: by mail-yb1-f202.google.com with SMTP id n20-20020a2540140000b0290593b8e64cd5so1837294yba.3 for ; Tue, 17 Aug 2021 23:31:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=EuYCgt8+lkHBd5kOQmKG+gfowm77wjwHGDYaQfqgXkM=; b=jf1KXu7fmEUdMsKv8qqPS1uXlv+56ikFPzF13yc3+KTvUt6asr/dfm42U7m8z2oaxW A1lZnhfuZsNz6idvEFIx7MwSYBYmByXzQK4ED92Tl/aOYre4fO0pStwZP5hfQyZoLhpA k9RuTVA9AcmArHPO54uF/Ki4EABqeALUSuj9BtbZVk/q1wrblw+DakX+xFVyH0ZPllpN QRr9BLXmVbCFxjf9LCMyU2/3AuDTtwNHaELFoGy21fgiwitZOhlfWs7IBfT24zSAzv5z gcDc1UyinDFGNwilfZtrE/BXLAPyfDw73xKZU00G6uLKrvxEGHYLjgWMad1/KSypqjGn s8Lg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=EuYCgt8+lkHBd5kOQmKG+gfowm77wjwHGDYaQfqgXkM=; b=bpiAyfCtnbVkWWkC4NcCa/hvNnmnf6xRF1m0fFRGnwecJGS0w262LZqzVbxoeQBvSm MWqCNMWz/r+IAMkMEnQriPzrzmN5bnLpGek7cCxpZ5uS+GUOPfa8fvupyhsNoqY4vaUE CcWqxTi3vSEKb+HmcenIywGxM+NysnOi20GQPMwwzeOzpzL344XNS8cGV+twx2ABb929 KedIbHHb4lNSlhnFA1CeIfWDBCTFTKFMl+gOM5r6Nr7wnUZSyDlJdG7QjNAWvGL+IV6M 09U73b0C1/UxR3+TG0Aq2tol5eGmRa6VWIKC5/1loFYTZVMCsBmBO4sR/Yhp2G+Aa7wp Rvyg== X-Gm-Message-State: AOAM533+43noaAkuihpXxsnRGizyoA+MqzrgGJmzbTF/yoZ5rDVRSEmI 1XV/a6O4TzpPo4fCtQk7U2V6pdMwTmWsijUigwaqACx8vsoKbQSYJqYccH9apUFjPaz4mGEGngr 0ADDt2XZ8kuTST9dnWeM/OyHpR43CeWBegLqpSeX4M+VQWUTTmbgwdNMZ X-Google-Smtp-Source: ABdhPJyHuC5mbVB/nEzN5pzEvd35CiT+uZVTZdGa9/B0v3WwM0r7rwsMuOQ4SlTfuOuU4UXhfMr9DYHhRjw= X-Received: from yuzhao.bld.corp.google.com ([2620:15c:183:200:41f0:f89:87cd:8bd0]) (user=yuzhao job=sendgmr) by 2002:a25:f310:: with SMTP id c16mr8599656ybs.464.1629268286452; Tue, 17 Aug 2021 23:31:26 -0700 (PDT) Date: Wed, 18 Aug 2021 00:31:07 -0600 In-Reply-To: <20210818063107.2696454-1-yuzhao@google.com> Message-Id: <20210818063107.2696454-12-yuzhao@google.com> Mime-Version: 1.0 References: <20210818063107.2696454-1-yuzhao@google.com> X-Mailer: git-send-email 2.33.0.rc1.237.g0d66db33f3-goog Subject: [PATCH v4 11/11] mm: multigenerational lru: documentation From: Yu Zhao To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, Hillf Danton , page-reclaim@google.com, Yu Zhao , Konstantin Kharlamov Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=google.com header.s=20161025 header.b=jf1KXu7f; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf05.hostedemail.com: domain of 3PqkcYQYKCBMHDI0t7z77z4x.v75416DG-553Etv3.7Az@flex--yuzhao.bounces.google.com designates 209.85.219.202 as permitted sender) smtp.mailfrom=3PqkcYQYKCBMHDI0t7z77z4x.v75416DG-553Etv3.7Az@flex--yuzhao.bounces.google.com X-Stat-Signature: c4nq3g7ct4oiq961sdqddouod8zp6x84 X-Rspamd-Queue-Id: 22BC45048BB2 X-Rspamd-Server: rspam05 X-HE-Tag: 1629268287-160661 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Add Documentation/vm/multigen_lru.rst. Signed-off-by: Yu Zhao Tested-by: Konstantin Kharlamov --- Documentation/vm/index.rst | 1 + Documentation/vm/multigen_lru.rst | 134 ++++++++++++++++++++++++++++++ 2 files changed, 135 insertions(+) create mode 100644 Documentation/vm/multigen_lru.rst diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst index eff5fbd492d0..c353b3f55924 100644 --- a/Documentation/vm/index.rst +++ b/Documentation/vm/index.rst @@ -17,6 +17,7 @@ various features of the Linux memory management swap_numa zswap + multigen_lru Kernel developers MM documentation ================================== diff --git a/Documentation/vm/multigen_lru.rst b/Documentation/vm/multigen_lru.rst new file mode 100644 index 000000000000..adedff5319d9 --- /dev/null +++ b/Documentation/vm/multigen_lru.rst @@ -0,0 +1,134 @@ +.. SPDX-License-Identifier: GPL-2.0 + +===================== +Multigenerational LRU +===================== + +Quick Start +=========== +Build Configurations +-------------------- +:Required: Set ``CONFIG_LRU_GEN=y``. + +:Optional: Set ``CONFIG_LRU_GEN_ENABLED=y`` to turn the feature on by + default. + +Runtime Configurations +---------------------- +:Required: Write ``1`` to ``/sys/kernel/mm/lru_gen/enable`` if the + feature was not turned on by default. + +:Optional: Write ``N`` to ``/sys/kernel/mm/lru_gen/min_ttl_ms`` to + protect the working set of ``N`` milliseconds. The OOM killer is + invoked if this working set cannot be kept in memory. + +:Optional: Read ``/sys/kernel/debug/lru_gen`` to confirm the feature + is turned on. This file has the following output: + +:: + + memcg memcg_id memcg_path + node node_id + min_gen birth_time anon_size file_size + ... + max_gen birth_time anon_size file_size + +``min_gen`` is the oldest generation number and ``max_gen`` is the +youngest generation number. ``birth_time`` is in milliseconds. +``anon_size`` and ``file_size`` are in pages. + +Phones/Laptops/Workstations +--------------------------- +No additional configurations required. + +Servers/Data Centers +-------------------- +:To support more generations: Change ``CONFIG_NR_LRU_GENS`` to a + larger number. + +:To support more tiers: Change ``CONFIG_TIERS_PER_GEN`` to a larger + number. + +:To support full stats: Set ``CONFIG_LRU_GEN_STATS=y``. + +:Working set estimation: Write ``+ memcg_id node_id max_gen + [swappiness]`` to ``/sys/kernel/debug/lru_gen`` to invoke the aging, + which scans PTEs for accessed pages and then creates the next + generation ``max_gen+1``. A swap file and a non-zero ``swappiness``, + which overrides ``vm.swappiness``, are required to scan PTEs mapping + anon pages. + +:Proactive reclaim: Write ``- memcg_id node_id min_gen [swappiness] + [nr_to_reclaim]`` to ``/sys/kernel/debug/lru_gen`` to invoke the + eviction, which evicts generations less than or equal to ``min_gen``. + ``min_gen`` should be less than ``max_gen-1`` as ``max_gen`` and + ``max_gen-1`` are not fully aged and therefore cannot be evicted. + ``nr_to_reclaim`` can be used to limit the number of pages to evict. + Multiple command lines are supported, so does concatenation with + delimiters ``,`` and ``;``. + +Framework +========= +For each ``lruvec``, evictable pages are divided into multiple +generations. The youngest generation number is stored in +``lrugen->max_seq`` for both anon and file types as they are aged on +an equal footing. The oldest generation numbers are stored in +``lrugen->min_seq[2]`` separately for anon and file types as clean +file pages can be evicted regardless of swap and writeback +constraints. These three variables are monotonically increasing. +Generation numbers are truncated into +``order_base_2(CONFIG_NR_LRU_GENS+1)`` bits in order to fit into +``page->flags``. The sliding window technique is used to prevent +truncated generation numbers from overlapping. Each truncated +generation number is an index to an array of per-type and per-zone +lists ``lrugen->lists``. + +Each generation is then divided into multiple tiers. Tiers represent +levels of usage from file descriptors only. Pages accessed ``N`` times +via file descriptors belong to tier ``order_base_2(N)``. Each +generation contains at most ``CONFIG_TIERS_PER_GEN`` tiers, and they +require additional ``CONFIG_TIERS_PER_GEN-2`` bits in ``page->flags``. +In contrast to moving across generations which requires list +operations, moving across tiers only involves operations on +``page->flags`` and therefore has a negligible cost. A feedback loop +modeled after the PID controller monitors refault rates of all tiers +and decides when to protect pages from which tiers. + +The framework comprises two conceptually independent components: the +aging and the eviction, which can be invoked separately from user +space for the purpose of working set estimation and proactive reclaim. + +Aging +----- +The aging produces young generations. Given an ``lruvec``, the aging +traverses ``lruvec_memcg()->mm_list`` and calls ``walk_page_range()`` +to scan PTEs for accessed pages (a ``mm_struct`` list is maintained +for each ``memcg``). Upon finding one, the aging updates its +generation number to ``max_seq`` (modulo ``CONFIG_NR_LRU_GENS``). +After each round of traversal, the aging increments ``max_seq``. The +aging is due when both ``min_seq[2]`` have caught up with +``max_seq-1``. + +Eviction +-------- +The eviction consumes old generations. Given an ``lruvec``, the +eviction scans pages on the per-zone lists indexed by anon and file +``min_seq[2]`` (modulo ``CONFIG_NR_LRU_GENS``). It first tries to +select a type based on the values of ``min_seq[2]``. If they are +equal, it selects the type that has a lower refault rate. The eviction +sorts a page according to its updated generation number if the aging +has found this page accessed. It also moves a page to the next +generation if this page is from an upper tier that has a higher +refault rate than the base tier. The eviction increments +``min_seq[2]`` of a selected type when it finds all the per-zone lists +indexed by ``min_seq[2]`` of this selected type are empty. + +To-do List +========== +KVM Optimization +---------------- +Support shadow page table walk. + +NUMA Optimization +----------------- +Optimize page table walk for NUMA.