From patchwork Thu Nov 11 04:15:10 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yu Zhao X-Patchwork-Id: 12692256 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id DBFC9C433EF for ; Thu, 11 Nov 2021 04:28:45 +0000 (UTC) Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 9902F61052 for ; Thu, 11 Nov 2021 04:28:45 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 9902F61052 Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=lists.infradead.org DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender: Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:Cc:To:From:Subject:References: Mime-Version:Message-Id:In-Reply-To:Date:Reply-To:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Owner; bh=5lbZkDLPaQXjnVR8TgOWatVKgvwo2cgfFiK/JFuJtro=; b=PkjDR7ZpqoLCua0FMg1hGDd4vi y5+5OIxg4qRsh8RVhbsi0G/uY7cHOlrrr/ktiq49xwhTuoIRixOau0omxLLRb/mbTBMTl14+DUD8u dJbNxLBm7mweXroC2MkiyyTWu1868mh91xU6P6gRglTtJPhPivf60P5M4Lh/GeUrBL/kBWjQT+JFH eSHN23+7hbY6+rVW3JwhCploJCWuSM6RUjaEYvhd+8QWjQc1rIxQn723H8gBfgwc5Fs3QLulyLzaK JSpXJ9ynpP8CnIsQ3za1R2fC5J+qngllgZTAtgWNCWVriFdC1fYyhukorVc1dRdziq+IAKka/o94n RuEbMh7w==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux)) id 1ml1fh-0075Jv-9Q; Thu, 11 Nov 2021 04:26:45 +0000 Received: from desiato.infradead.org ([2001:8b0:10b:1:d65d:64ff:fe57:4e05]) by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux)) id 1ml1VX-0070RW-Lz for linux-arm-kernel@bombadil.infradead.org; Thu, 11 Nov 2021 04:16:15 +0000 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=desiato.20200630; h=Content-Type:Cc:To:From:Subject: References:Mime-Version:Message-Id:In-Reply-To:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=AHZ3kmKPBbgSlKkx70Jkx/JJwoTNnQAirv7fs8MUZEQ=; b=PH017T1QFHRxtNoBT7hMQZdUeZ 2xh8JOlIXwlmp1qn84OqWIHU3D197Wsov+zp6tOme1oM5TVzJodV+S37FDYpGJ3QlJe+ATump4oDM 7I1l8NLFYw0uKrPvaeo0qptUfrS//iztkl+8Xj+hqpQxUY003WfPXwCduaY9r/7SVyWAusoU1KlUm gsMk2Fxc7H11xpn7w9mOc6RuRHwj6dH+XUVHq3qJkSDpx/ANNNMLtfkx7YKF2sS05D+8zZTx/i0zh MonQv9Y8jBR3imnfwkepBUGGZrOnbt6+Gx60KTsgN+qbD0Nz/SiJzK7kJijaAmrRkzKzKN2vymxJ9 cEa5Zr5Q==; Received: from mail-yb1-xb4a.google.com ([2607:f8b0:4864:20::b4a]) by desiato.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux)) id 1ml1Uy-00FOlg-4s for linux-arm-kernel@lists.infradead.org; Thu, 11 Nov 2021 04:16:08 +0000 Received: by mail-yb1-xb4a.google.com with SMTP id l28-20020a25b31c000000b005c27dd4987bso7397358ybj.18 for ; Wed, 10 Nov 2021 20:15:30 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=AHZ3kmKPBbgSlKkx70Jkx/JJwoTNnQAirv7fs8MUZEQ=; b=nganR/HIY2g3DIBcYqJ+LlFlwq3OV6yutpYcQYkXRL95KSdDdNyxNMwvupYK5zzK66 y1eF7QSfsl/c6NOsSKlX9qKoPA/6m/8vMrvDO4ia/DCpgjcUCyJ+AchTBFzFS8Yb66Ow Ic+5xp1xc1mmfv6bOVYMHwDpAAAnzy52bk4zoUim7/pcvGUD44AsmBoQd8kg/L0DC5I8 lIFQuI16WV2idpisLAngW4KrBGX5R5bL8kw1CSOmjIOMP+jV7yh05Mmgo1v7WlrY6OKZ oWahiX1cBmEMIhEsC9UV8wbKrkZzx4IqqVRjoJVrUZdHMi1I5rsh7lUe7K6eE5W0hiW4 x8Aw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=AHZ3kmKPBbgSlKkx70Jkx/JJwoTNnQAirv7fs8MUZEQ=; b=FnfR9sspmOm0s4Gec/QXGjGV/TqQcaq6e5T+LgiNcvVT0m1QW8qCNS9wnUx0GwN4bR x/HaIu7M/EC3UKq1RkdSnS/d3VlMKELDHGl7pe6vjctH5t4igURyW9kAt0J/t1UXs0eG EOdzwLdfvIb+CrwV/YNxIJJXvH+GHVmRoe+HcUr7ejm9nPZ9X9bS3I8tnIr583ckPzdD O5oLtGmNig24/i6pMlcLTQlGwmyWe3aMTDa7CbXUUREP67ZoyOfFGOT4OnLmVyB1gysL qyNy4eynE6CaL53/AnHIOoeZ++xs4Elk/FMneHPT9Rs5p6expEwH8G9ven8AsIffRRKL /HRA== X-Gm-Message-State: AOAM530nf5Dl97HhZwVkkd7orrM/iA5dtAEjpZiTedVJpARYw+9NP6da fnL5aMRfzjPDUZVz0dbcyyfmYVGNHBw= X-Google-Smtp-Source: ABdhPJyXNJLT8J42rpeVmfDmzRMGA/ulyVZGawiW3JEdWBFrMupnAG+En3AbbsUg1S/yDZ4Vz2QO1wi5xeo= X-Received: from yuzhao.bld.corp.google.com ([2620:15c:183:200:346b:bb72:659e:f91c]) (user=yuzhao job=sendgmr) by 2002:a25:b790:: with SMTP id n16mr4887325ybh.395.1636604129429; Wed, 10 Nov 2021 20:15:29 -0800 (PST) Date: Wed, 10 Nov 2021 21:15:10 -0700 In-Reply-To: <20211111041510.402534-1-yuzhao@google.com> Message-Id: <20211111041510.402534-11-yuzhao@google.com> Mime-Version: 1.0 References: <20211111041510.402534-1-yuzhao@google.com> X-Mailer: git-send-email 2.34.0.rc0.344.g81b53c2807-goog Subject: [PATCH v5 10/10] mm: multigenerational lru: documentation From: Yu Zhao To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org, x86@kernel.org, page-reclaim@google.com, holger@applied-asynchrony.com, iam@valdikss.org.ru, Yu Zhao , Konstantin Kharlamov X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20211111_041555_206406_A6E39C55 X-CRM114-Status: GOOD ( 20.56 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org Add Documentation/vm/multigen_lru.rst. Signed-off-by: Yu Zhao Tested-by: Konstantin Kharlamov --- Documentation/vm/index.rst | 1 + Documentation/vm/multigen_lru.rst | 132 ++++++++++++++++++++++++++++++ 2 files changed, 133 insertions(+) create mode 100644 Documentation/vm/multigen_lru.rst diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst index b51f0d8992f8..779772a025a0 100644 --- a/Documentation/vm/index.rst +++ b/Documentation/vm/index.rst @@ -17,6 +17,7 @@ various features of the Linux memory management swap_numa zswap + multigen_lru Kernel developers MM documentation ================================== diff --git a/Documentation/vm/multigen_lru.rst b/Documentation/vm/multigen_lru.rst new file mode 100644 index 000000000000..7c064a378b85 --- /dev/null +++ b/Documentation/vm/multigen_lru.rst @@ -0,0 +1,132 @@ +.. SPDX-License-Identifier: GPL-2.0 + +===================== +Multigenerational LRU +===================== + +Quick Start +=========== +Build Configurations +-------------------- +:Required: Set ``CONFIG_LRU_GEN=y``. + +:Optional: Set ``CONFIG_LRU_GEN_ENABLED=y`` to turn the feature on by + default. + +Runtime Configurations +---------------------- +:Required: Write ``1`` to ``/sys/kernel/mm/lru_gen/enable`` if the + feature was not turned on by default. + +:Optional: Write ``N`` to ``/sys/kernel/mm/lru_gen/min_ttl_ms`` to + protect the working set of ``N`` milliseconds. The OOM killer is + invoked if this working set cannot be kept in memory. + +:Optional: Read ``/sys/kernel/debug/lru_gen`` to confirm the feature + is turned on. This file has the following output: + +:: + + memcg memcg_id memcg_path + node node_id + min_gen birth_time anon_size file_size + ... + max_gen birth_time anon_size file_size + +``min_gen`` is the oldest generation number and ``max_gen`` is the +youngest generation number. ``birth_time`` is in milliseconds. +``anon_size`` and ``file_size`` are in pages. + +Phones/Laptops/Workstations +--------------------------- +No additional configurations required. + +Servers/Data Centers +-------------------- +:To support more generations: Change ``CONFIG_NR_LRU_GENS`` to a + larger number. + +:To support more tiers: Change ``CONFIG_TIERS_PER_GEN`` to a larger + number. + +:To support full stats: Set ``CONFIG_LRU_GEN_STATS=y``. + +:Working set estimation: Write ``+ memcg_id node_id max_gen + [swappiness] [use_bloom_filter]`` to ``/sys/kernel/debug/lru_gen`` to + invoke the aging, which scans PTEs for accessed pages and then + creates the next generation ``max_gen+1``. A swap file and a non-zero + ``swappiness``, which overrides ``vm.swappiness``, are required to + scan PTEs mapping anon pages. Set ``use_bloom_filter`` to 0 to + override the default behavior which only scans PTE tables found + populated. + +:Proactive reclaim: Write ``- memcg_id node_id min_gen [swappiness] + [nr_to_reclaim]`` to ``/sys/kernel/debug/lru_gen`` to invoke the + eviction, which evicts generations less than or equal to ``min_gen``. + ``min_gen`` should be less than ``max_gen-1`` as ``max_gen`` and + ``max_gen-1`` are not fully aged and therefore cannot be evicted. + Use ``nr_to_reclaim`` to limit the number of pages to evict. Multiple + command lines are supported, so does concatenation with delimiters + ``,`` and ``;``. + +Framework +========= +For each ``lruvec``, evictable pages are divided into multiple +generations. The youngest generation number is stored in +``lrugen->max_seq`` for both anon and file types as they are aged on +an equal footing. The oldest generation numbers are stored in +``lrugen->min_seq[]`` separately for anon and file types as clean +file pages can be evicted regardless of swap and writeback +constraints. These three variables are monotonically increasing. +Generation numbers are truncated into +``order_base_2(CONFIG_NR_LRU_GENS+1)`` bits in order to fit into +``page->flags``. The sliding window technique is used to prevent +truncated generation numbers from overlapping. Each truncated +generation number is an index to an array of per-type and per-zone +lists ``lrugen->lists``. + +Each generation is divided into multiple tiers. Tiers represent +different ranges of numbers of accesses from file descriptors only. +Pages accessed ``N`` times via file descriptors belong to tier +``order_base_2(N)``. Each generation contains at most +``CONFIG_TIERS_PER_GEN`` tiers, and they require additional +``CONFIG_TIERS_PER_GEN-2`` bits in ``page->flags``. In contrast to +moving between generations which requires list operations, moving +between tiers only involves operations on ``page->flags`` and +therefore has a negligible cost. A feedback loop modeled after the PID +controller monitors refaulted % across all tiers and decides when to +protect pages from which tiers. + +The framework comprises two conceptually independent components: the +aging and the eviction, which can be invoked separately from user +space for the purpose of working set estimation and proactive reclaim. + +Aging +----- +The aging produces young generations. Given an ``lruvec``, the aging +traverses ``lruvec_memcg()->mm_list`` and calls ``walk_page_range()`` +to scan PTEs for accessed pages (a ``mm_struct`` list is maintained +for each ``memcg``). Upon finding one, the aging updates its +generation number to ``max_seq`` (modulo ``CONFIG_NR_LRU_GENS``). +After each round of traversal, the aging increments ``max_seq``. The +aging is due when ``min_seq[]`` reaches ``max_seq-1``. + +Eviction +-------- +The eviction consumes old generations. Given an ``lruvec``, the +eviction scans pages on the per-zone lists indexed by anon and file +``min_seq[]`` (modulo ``CONFIG_NR_LRU_GENS``). It first tries to +select a type based on the values of ``min_seq[]``. If they are +equal, it selects the type that has a lower refaulted %. The eviction +sorts a page according to its updated generation number if the aging +has found this page accessed. It also moves a page to the next +generation if this page is from an upper tier that has a higher +refaulted % than the base tier. The eviction increments ``min_seq[]`` +of a selected type when it finds all the per-zone lists indexed by +``min_seq[]`` of this selected type are empty. + +To-do List +========== +KVM Optimization +---------------- +Support shadow page table walk.