From patchwork Mon Aug 15 07:13:33 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Yu Zhao X-Patchwork-Id: 12943201 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 504D3C25B0D for ; Mon, 15 Aug 2022 07:14:41 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 90BEB6B007B; Mon, 15 Aug 2022 03:14:35 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8CC096B0089; Mon, 15 Aug 2022 03:14:35 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6BF5D8D0001; Mon, 15 Aug 2022 03:14:35 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 57A006B007B for ; Mon, 15 Aug 2022 03:14:35 -0400 (EDT) Received: from smtpin31.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 2E2B0140542 for ; Mon, 15 Aug 2022 07:14:35 +0000 (UTC) X-FDA: 79800964110.31.CA37FE8 Received: from mail-il1-f201.google.com (mail-il1-f201.google.com [209.85.166.201]) by imf12.hostedemail.com (Postfix) with ESMTP id C3A0440087 for ; Mon, 15 Aug 2022 07:14:34 +0000 (UTC) Received: by mail-il1-f201.google.com with SMTP id j5-20020a056e02218500b002de1cf2347bso4479519ila.2 for ; Mon, 15 Aug 2022 00:14:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=content-transfer-encoding:cc:to:from:subject:references :mime-version:message-id:in-reply-to:date:from:to:cc; bh=BAVTVhb7gZgl1iJYc7ztfS+MVBtiYqa44fDrm6wk918=; b=E/KUjz9qYl5rnMdSSLuhyWa6ezzUzs5F747sC2n3ZLk4NZ1K4I+2aqX8tyiv6Y2O34 niUFKfAw3PWXxuhjO6t4PtT8rhIc0c4LX6k+zTfHtLRplp/gVH7J9aZHIFkHQScdZYJD 9i1WEeqO5cHoWdL6vv0kHzonp3t9GjsDDoOBZoK53o9m6gvHprVfq9CjGSPyj95kq/Fa 8QSVAhDBMeMgn8if1R9Gjjcr4+4DCvgXuJaUoaoPR6OZfH6+wISYzRwhpryA0dy+7Yj8 mm5s6dORoIWXYBYsvf81TqsfRU0l388ncxBkC85oBu12oVxHx3dYMd08CHvJKOSh53uT 1EbA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:cc:to:from:subject:references :mime-version:message-id:in-reply-to:date:x-gm-message-state:from:to :cc; bh=BAVTVhb7gZgl1iJYc7ztfS+MVBtiYqa44fDrm6wk918=; b=w4II+2tYpZkZEgbPS3SJfUxe81rutcPeWV9kQBRRpO3HWt0V/lKGoBKfg9ST0S80sp gGnFai68BbVl6eejtgNPixxfKsKLyXVZNqROW9fMgCa2zUg+i3RbxDQf24BddkU8Irs2 LQ07rG9jBVJsjTD72Cd+WNIKW6VSaGjC179VKn9VSuCT7TUtAuasDKtwxSaXfVytO02U kIxAYt7izSVABaGG2hIb1agHjFdHvmPVLQ+VttEJz+7+IX0GBaF3GJ/wrDNu6alTJOw8 xXqxXmjTavh1uLlb+sHpitvCHdw+R5aRK+GdJxdlKFXnNt/aDizTWk0RVx0o/OBQYDLo AizA== X-Gm-Message-State: ACgBeo1aXZ/zCzyxyV6IIShtmZ9DoiXHBnO9phYzOW54AL7zPlkXPZTN MafH7VLxQSRyCsMh8LBJMLXw8uVOfN8= X-Google-Smtp-Source: AA6agR45YmGv57iGGSwbRHn5bhHV9M7sMkBfSANjxbeOHlLP3SVMZ8GvTTIVy18fahMG/UrA1b5pCZAIRtg= X-Received: from yuzhao.bld.corp.google.com ([2620:15c:183:200:d91:5887:ac93:ddf0]) (user=yuzhao job=sendgmr) by 2002:a92:c544:0:b0:2e5:cfa1:286a with SMTP id a4-20020a92c544000000b002e5cfa1286amr848343ilj.296.1660547674147; Mon, 15 Aug 2022 00:14:34 -0700 (PDT) Date: Mon, 15 Aug 2022 01:13:33 -0600 In-Reply-To: <20220815071332.627393-1-yuzhao@google.com> Message-Id: <20220815071332.627393-15-yuzhao@google.com> Mime-Version: 1.0 References: <20220815071332.627393-1-yuzhao@google.com> X-Mailer: git-send-email 2.37.1.595.g718a3a8f04-goog Subject: [PATCH v14 14/14] mm: multi-gen LRU: design doc From: Yu Zhao To: Andrew Morton Cc: Andi Kleen , Aneesh Kumar , Catalin Marinas , Dave Hansen , Hillf Danton , Jens Axboe , Johannes Weiner , Jonathan Corbet , Linus Torvalds , Matthew Wilcox , Mel Gorman , Michael Larabel , Michal Hocko , Mike Rapoport , Peter Zijlstra , Tejun Heo , Vlastimil Babka , Will Deacon , linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, x86@kernel.org, page-reclaim@google.com, Yu Zhao , Brian Geffon , Jan Alexander Steffens , Oleksandr Natalenko , Steven Barrett , Suleiman Souhlal , Daniel Byrne , Donald Carr , " =?utf-8?q?Holger_Hoffst=C3=A4tte?= " , Konstantin Kharlamov , Shuang Zhai , Sofia Trinh , Vaibhav Jain ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1660547674; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=BAVTVhb7gZgl1iJYc7ztfS+MVBtiYqa44fDrm6wk918=; b=VKjBamhWBmMvvvOncuoZ06oINiLs7JuObMLaB1+3Q3GkhPZKZC0h0YHlLxrsvKqmJA/23L 2mHxF+2sCA1T3n5iEcd9anrNw6Z7eIwe2IFQAis3mkv6DuaePM9bU3qqizC0TYvcIRLzPz q17hWJar3mFW06RmNPBxWgeu4Dpb+j4= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b="E/KUjz9q"; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf12.hostedemail.com: domain of 3WvL5YgYKCEU516ohvnvvnsl.jvtspu14-ttr2hjr.vyn@flex--yuzhao.bounces.google.com designates 209.85.166.201 as permitted sender) smtp.mailfrom=3WvL5YgYKCEU516ohvnvvnsl.jvtspu14-ttr2hjr.vyn@flex--yuzhao.bounces.google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1660547674; a=rsa-sha256; cv=none; b=YjLWXl9SS3klSEdXy9NCee++zko5oNmUWJ0HbcBozRn9iGwVyNOgyQMva39iZpFVzYT9Lx FwsJmepEeFNTPdiorRDQFjMNy2aYEbp5+Z6cVd03zJWXQFZCz5TnDfnNyhOkd8ZKOidGQd S6WSB8E3ULz482BcKTE7eZ9JMNqXci4= X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: C3A0440087 X-Rspam-User: X-Stat-Signature: gqotsqhci4qn31i1mw4juo1ge553ahtm Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b="E/KUjz9q"; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf12.hostedemail.com: domain of 3WvL5YgYKCEU516ohvnvvnsl.jvtspu14-ttr2hjr.vyn@flex--yuzhao.bounces.google.com designates 209.85.166.201 as permitted sender) smtp.mailfrom=3WvL5YgYKCEU516ohvnvvnsl.jvtspu14-ttr2hjr.vyn@flex--yuzhao.bounces.google.com X-HE-Tag: 1660547674-399705 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Add a design doc. Signed-off-by: Yu Zhao Acked-by: Brian Geffon Acked-by: Jan Alexander Steffens (heftig) Acked-by: Oleksandr Natalenko Acked-by: Steven Barrett Acked-by: Suleiman Souhlal Tested-by: Daniel Byrne Tested-by: Donald Carr Tested-by: Holger Hoffstätte Tested-by: Konstantin Kharlamov Tested-by: Shuang Zhai Tested-by: Sofia Trinh Tested-by: Vaibhav Jain Reviewed-by: Bagas Sanjaya --- Documentation/mm/index.rst | 1 + Documentation/mm/multigen_lru.rst | 159 ++++++++++++++++++++++++++++++ 2 files changed, 160 insertions(+) create mode 100644 Documentation/mm/multigen_lru.rst diff --git a/Documentation/mm/index.rst b/Documentation/mm/index.rst index 575ccd40e30c..4aa12b8be278 100644 --- a/Documentation/mm/index.rst +++ b/Documentation/mm/index.rst @@ -51,6 +51,7 @@ above structured documentation, or deleted if it has served its purpose. ksm memory-model mmu_notifier + multigen_lru numa overcommit-accounting page_migration diff --git a/Documentation/mm/multigen_lru.rst b/Documentation/mm/multigen_lru.rst new file mode 100644 index 000000000000..d7062c6a8946 --- /dev/null +++ b/Documentation/mm/multigen_lru.rst @@ -0,0 +1,159 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============= +Multi-Gen LRU +============= +The multi-gen LRU is an alternative LRU implementation that optimizes +page reclaim and improves performance under memory pressure. Page +reclaim decides the kernel's caching policy and ability to overcommit +memory. It directly impacts the kswapd CPU usage and RAM efficiency. + +Design overview +=============== +Objectives +---------- +The design objectives are: + +* Good representation of access recency +* Try to profit from spatial locality +* Fast paths to make obvious choices +* Simple self-correcting heuristics + +The representation of access recency is at the core of all LRU +implementations. In the multi-gen LRU, each generation represents a +group of pages with similar access recency. Generations establish a +(time-based) common frame of reference and therefore help make better +choices, e.g., between different memcgs on a computer or different +computers in a data center (for job scheduling). + +Exploiting spatial locality improves efficiency when gathering the +accessed bit. A rmap walk targets a single page and does not try to +profit from discovering a young PTE. A page table walk can sweep all +the young PTEs in an address space, but the address space can be too +sparse to make a profit. The key is to optimize both methods and use +them in combination. + +Fast paths reduce code complexity and runtime overhead. Unmapped pages +do not require TLB flushes; clean pages do not require writeback. +These facts are only helpful when other conditions, e.g., access +recency, are similar. With generations as a common frame of reference, +additional factors stand out. But obvious choices might not be good +choices; thus self-correction is necessary. + +The benefits of simple self-correcting heuristics are self-evident. +Again, with generations as a common frame of reference, this becomes +attainable. Specifically, pages in the same generation can be +categorized based on additional factors, and a feedback loop can +statistically compare the refault percentages across those categories +and infer which of them are better choices. + +Assumptions +----------- +The protection of hot pages and the selection of cold pages are based +on page access channels and patterns. There are two access channels: + +* Accesses through page tables +* Accesses through file descriptors + +The protection of the former channel is by design stronger because: + +1. The uncertainty in determining the access patterns of the former + channel is higher due to the approximation of the accessed bit. +2. The cost of evicting the former channel is higher due to the TLB + flushes required and the likelihood of encountering the dirty bit. +3. The penalty of underprotecting the former channel is higher because + applications usually do not prepare themselves for major page + faults like they do for blocked I/O. E.g., GUI applications + commonly use dedicated I/O threads to avoid blocking rendering + threads. + +There are also two access patterns: + +* Accesses exhibiting temporal locality +* Accesses not exhibiting temporal locality + +For the reasons listed above, the former channel is assumed to follow +the former pattern unless ``VM_SEQ_READ`` or ``VM_RAND_READ`` is +present, and the latter channel is assumed to follow the latter +pattern unless outlying refaults have been observed. + +Workflow overview +================= +Evictable pages are divided into multiple generations for each +``lruvec``. The youngest generation number is stored in +``lrugen->max_seq`` for both anon and file types as they are aged on +an equal footing. The oldest generation numbers are stored in +``lrugen->min_seq[]`` separately for anon and file types as clean file +pages can be evicted regardless of swap constraints. These three +variables are monotonically increasing. + +Generation numbers are truncated into ``order_base_2(MAX_NR_GENS+1)`` +bits in order to fit into the gen counter in ``folio->flags``. Each +truncated generation number is an index to ``lrugen->lists[]``. The +sliding window technique is used to track at least ``MIN_NR_GENS`` and +at most ``MAX_NR_GENS`` generations. The gen counter stores a value +within ``[1, MAX_NR_GENS]`` while a page is on one of +``lrugen->lists[]``; otherwise it stores zero. + +Each generation is divided into multiple tiers. A page accessed ``N`` +times through file descriptors is in tier ``order_base_2(N)``. Unlike +generations, tiers do not have dedicated ``lrugen->lists[]``. In +contrast to moving across generations, which requires the LRU lock, +moving across tiers only involves atomic operations on +``folio->flags`` and therefore has a negligible cost. A feedback loop +modeled after the PID controller monitors refaults over all the tiers +from anon and file types and decides which tiers from which types to +evict or protect. + +There are two conceptually independent procedures: the aging and the +eviction. They form a closed-loop system, i.e., the page reclaim. + +Aging +----- +The aging produces young generations. Given an ``lruvec``, it +increments ``max_seq`` when ``max_seq-min_seq+1`` approaches +``MIN_NR_GENS``. The aging promotes hot pages to the youngest +generation when it finds them accessed through page tables; the +demotion of cold pages happens consequently when it increments +``max_seq``. The aging uses page table walks and rmap walks to find +young PTEs. For the former, it iterates ``lruvec_memcg()->mm_list`` +and calls ``walk_page_range()`` with each ``mm_struct`` on this list +to scan PTEs, and after each iteration, it increments ``max_seq``. For +the latter, when the eviction walks the rmap and finds a young PTE, +the aging scans the adjacent PTEs. For both, on finding a young PTE, +the aging clears the accessed bit and updates the gen counter of the +page mapped by this PTE to ``(max_seq%MAX_NR_GENS)+1``. + +Eviction +-------- +The eviction consumes old generations. Given an ``lruvec``, it +increments ``min_seq`` when ``lrugen->lists[]`` indexed by +``min_seq%MAX_NR_GENS`` becomes empty. To select a type and a tier to +evict from, it first compares ``min_seq[]`` to select the older type. +If both types are equally old, it selects the one whose first tier has +a lower refault percentage. The first tier contains single-use +unmapped clean pages, which are the best bet. The eviction sorts a +page according to its gen counter if the aging has found this page +accessed through page tables and updated its gen counter. It also +moves a page to the next generation, i.e., ``min_seq+1``, if this page +was accessed multiple times through file descriptors and the feedback +loop has detected outlying refaults from the tier this page is in. To +this end, the feedback loop uses the first tier as the baseline, for +the reason stated earlier. + +Summary +------- +The multi-gen LRU can be disassembled into the following parts: + +* Generations +* Rmap walks +* Page table walks +* Bloom filters +* PID controller + +The aging and the eviction form a producer-consumer model; +specifically, the latter drives the former by the sliding window over +generations. Within the aging, rmap walks drive page table walks by +inserting hot densely populated page tables to the Bloom filters. +Within the eviction, the PID controller uses refaults as the feedback +to select types to evict and tiers to protect.