From patchwork Thu Apr 7 03:15:26 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Yu Zhao X-Patchwork-Id: 12804412 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 08F9AC433F5 for ; Thu, 7 Apr 2022 03:35:51 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender: Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:Cc:To:From:Subject:References: Mime-Version:Message-Id:In-Reply-To:Date:Reply-To:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:List-Owner; bh=80AH7pkfySYDMFOt6mq7XB1A8Vzb0jPcDmMNghawRHs=; b=l3050b+Zo2GYOITPpQxLgzk45F 39wwiNyb2CfMz7stRkKVllyGLi+gqINeAV7pvjkrbNo/snNoRGsYLIL6Rp/qq2LksIw1kIt3cKcgV iaqPjg2tcipooYH8dRPBte+abRXcXrnYUxxSFbl/Te2o4qZYOQZgZdxN7moq1PC7tZ8xFB1cMQu5r zdIzl0GPPFOopNNQecKogEyhSJKR+G/azyCyDHh/TQ5ocN73z7oQMb50wlXmawtlYFCgC4NqN20Oc 2yTdro1hkqV6XVwD70d52VUCh3QScwWPI1Xal6qnVX3KbaO0b0WtrwuaUGsrzta8PqKIVjGYaCZh9 s6DUHy9A==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux)) id 1ncItz-0097tt-QO; Thu, 07 Apr 2022 03:33:44 +0000 Received: from desiato.infradead.org ([90.155.92.199]) by bombadil.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux)) id 1ncIdF-0091ka-QI for linux-arm-kernel@bombadil.infradead.org; Thu, 07 Apr 2022 03:16:25 +0000 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=desiato.20200630; h=Content-Transfer-Encoding:Content-Type :Cc:To:From:Subject:References:Mime-Version:Message-Id:In-Reply-To:Date: Sender:Reply-To:Content-ID:Content-Description; bh=INqxOeCwPOdr678cFak6a27+vB9QK/e2l4Y4r9T0uLg=; b=BP1rs2duW1zCgQmONAsSgkUBbH BXwJWJkNoedJPon+wp7t9ByFTLjQxjxIpMWFZE1fd9jVt0/sROt2psqLjP8yWtYGMotDAVeXfJGCa aQOaec2SeaAYueOkGh3nPSOvHQ7c5RWBS1EPK/lLCYY0kU1EwXJf1c3FPVWCd8XjXyTtyYEfOHUwx tLb0LIsdvWClQOpNbuiMfEn2ZyRDV0M0qVua4lyHTehQQQOS//INue5lWQ4Uod9MOUnMqDlabvYZ3 7jYXT3PyDbQhfaoJ4rxBIKeDgAf/52miOn5H9bEgSFCkzIUnrpdzN9GaJ89oqoLxmYEoZuRS72aJg IVE2eZAg==; Received: from mail-yb1-xb49.google.com ([2607:f8b0:4864:20::b49]) by desiato.infradead.org with esmtps (Exim 4.94.2 #2 (Red Hat Linux)) id 1ncIdB-002Ml0-7O for linux-arm-kernel@lists.infradead.org; Thu, 07 Apr 2022 03:16:23 +0000 Received: by mail-yb1-xb49.google.com with SMTP id i5-20020a258b05000000b006347131d40bso3242380ybl.17 for ; Wed, 06 Apr 2022 20:16:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc:content-transfer-encoding; bh=INqxOeCwPOdr678cFak6a27+vB9QK/e2l4Y4r9T0uLg=; b=ODNHvhfbQNp1hGmnnOr6mN+IOozaZYCUSEL7jsCQwFfy9Y57LO+dF04JBreR5O/EAL 0EToC/6xhmDArCjDA2MVpEvc0q2SwZKkIj6H9qZnVGgfHQDsEI7Jkt+jOWDqY3J5opmp A8HKDyQg4nPmznGUuV0t/FjiZX9wE7KrVbicBgCxC2wXyt3TWmEcHsXzAYo5jLRh2IgE 5MNd0Ormke1FPOcYnaIqvJMqYnhoYd8EaewOhDMazEwaMUG2ECHvR5sDIPFWpdG0sQD4 K1nZ7iiWr2Q/rJ0VBcdKV8I1dEsKv+t0h0aVzotL5FeW9jtdw35b/AfhN2BgTuDNsaUp IuHQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc:content-transfer-encoding; bh=INqxOeCwPOdr678cFak6a27+vB9QK/e2l4Y4r9T0uLg=; b=VP1dCruRUXlT/D8ENPsxyvVoUNZMdH9p5+t0OMT/rdN0quHT0vdh5bVwLh6wgq0A7A jfrQxMmiXoe0SRBxLOZKNROTuTWGmrBHt8tsdcBFp6ewOuwt27tcVlRlqfxy49t8Mh0f 2pr8pUlwK1QzgwbAmLSiYW1t1EifiO40v97VNz8u5Mvfeb2LUFsW2qIcXnAPnZn4QWbF 9dkv3+f1VVlLy741tPSZQZfqKW/WtBAld3jrmhzdqmf6CftwluAhomhLJtZFVxajUmkn +XWMp1vYP0C+CtnU3TQ+NAWwTV0sdPUb/Zh55eU5QEbF9ENqOC1/UChRREqwv36Av0d9 CT8Q== X-Gm-Message-State: AOAM532o8Dq67+ZLjJgrA4LVbcoqYvJVN1dK7chuBZo2OqhKvflfprkG tQkXRrmX/bh9m8oqKC6uDNiOwTrk7rg= X-Google-Smtp-Source: ABdhPJxlCSRM8gFV9U5TDCtD//P7GysdYKmbU5E8rnZZbWqCsJLalYpW9Af/O0T9cdRapsV/6ft/7YwhlGQ= X-Received: from yuzhao.bld.corp.google.com ([2620:15c:183:200:9ea2:c755:ae22:6862]) (user=yuzhao job=sendgmr) by 2002:a81:84:0:b0:2eb:4932:f34e with SMTP id 126-20020a810084000000b002eb4932f34emr9823178ywa.278.1649301377495; Wed, 06 Apr 2022 20:16:17 -0700 (PDT) Date: Wed, 6 Apr 2022 21:15:26 -0600 In-Reply-To: <20220407031525.2368067-1-yuzhao@google.com> Message-Id: <20220407031525.2368067-15-yuzhao@google.com> Mime-Version: 1.0 References: <20220407031525.2368067-1-yuzhao@google.com> X-Mailer: git-send-email 2.35.1.1094.g7c7d902a7c-goog Subject: [PATCH v10 14/14] mm: multi-gen LRU: design doc From: Yu Zhao To: Stephen Rothwell , linux-mm@kvack.org Cc: Andi Kleen , Andrew Morton , Aneesh Kumar , Barry Song <21cnbao@gmail.com>, Catalin Marinas , Dave Hansen , Hillf Danton , Jens Axboe , Jesse Barnes , Johannes Weiner , Jonathan Corbet , Linus Torvalds , Matthew Wilcox , Mel Gorman , Michael Larabel , Michal Hocko , Mike Rapoport , Rik van Riel , Vlastimil Babka , Will Deacon , Ying Huang , linux-arm-kernel@lists.infradead.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, page-reclaim@google.com, x86@kernel.org, Yu Zhao , Brian Geffon , Jan Alexander Steffens , Oleksandr Natalenko , Steven Barrett , Suleiman Souhlal , Daniel Byrne , Donald Carr , " =?utf-8?q?Holger_Hoffst=C3=A4tte?= " , Konstantin Kharlamov , Shuang Zhai , Sofia Trinh , Vaibhav Jain X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20220407_041621_635978_7F2A2E36 X-CRM114-Status: GOOD ( 23.37 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org Add a design doc. Signed-off-by: Yu Zhao Acked-by: Brian Geffon Acked-by: Jan Alexander Steffens (heftig) Acked-by: Oleksandr Natalenko Acked-by: Steven Barrett Acked-by: Suleiman Souhlal Tested-by: Daniel Byrne Tested-by: Donald Carr Tested-by: Holger Hoffstätte Tested-by: Konstantin Kharlamov Tested-by: Shuang Zhai Tested-by: Sofia Trinh Tested-by: Vaibhav Jain --- Documentation/vm/index.rst | 1 + Documentation/vm/multigen_lru.rst | 160 ++++++++++++++++++++++++++++++ 2 files changed, 161 insertions(+) create mode 100644 Documentation/vm/multigen_lru.rst diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst index 44365c4574a3..b48434300226 100644 --- a/Documentation/vm/index.rst +++ b/Documentation/vm/index.rst @@ -25,6 +25,7 @@ algorithms. If you are looking for advice on simply allocating memory, see the ksm memory-model mmu_notifier + multigen_lru numa overcommit-accounting page_migration diff --git a/Documentation/vm/multigen_lru.rst b/Documentation/vm/multigen_lru.rst new file mode 100644 index 000000000000..9b29b87e1435 --- /dev/null +++ b/Documentation/vm/multigen_lru.rst @@ -0,0 +1,160 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============= +Multi-Gen LRU +============= +The multi-gen LRU is an alternative LRU implementation that optimizes +page reclaim and improves performance under memory pressure. Page +reclaim decides the kernel's caching policy and ability to overcommit +memory. It directly impacts the kswapd CPU usage and RAM efficiency. + +Design overview +=============== +Objectives +---------- +The design objectives are: + +* Good representation of access recency +* Try to profit from spatial locality +* Fast paths to make obvious choices +* Simple self-correcting heuristics + +The representation of access recency is at the core of all LRU +implementations. In the multi-gen LRU, each generation represents a +group of pages with similar access recency. Generations establish a +common frame of reference and therefore help make better choices, +e.g., between different memcgs on a computer or different computers in +a data center (for job scheduling). + +Exploiting spatial locality improves efficiency when gathering the +accessed bit. A rmap walk targets a single page and does not try to +profit from discovering a young PTE. A page table walk can sweep all +the young PTEs in an address space, but the address space can be too +large to make a profit. The key is to optimize both methods and use +them in combination. + +Fast paths reduce code complexity and runtime overhead. Unmapped pages +do not require TLB flushes; clean pages do not require writeback. +These facts are only helpful when other conditions, e.g., access +recency, are similar. With generations as a common frame of reference, +additional factors stand out. But obvious choices might not be good +choices; thus self-correction is required. + +The benefits of simple self-correcting heuristics are self-evident. +Again, with generations as a common frame of reference, this becomes +attainable. Specifically, pages in the same generation can be +categorized based on additional factors, and a feedback loop can +statistically compare the refault percentages across those categories +and infer which of them are better choices. + +Assumptions +----------- +The protection of hot pages and the selection of cold pages are based +on page access channels and patterns. There are two access channels: + +* Accesses through page tables +* Accesses through file descriptors + +The protection of the former channel is by design stronger because: + +1. The uncertainty in determining the access patterns of the former + channel is higher due to the approximation of the accessed bit. +2. The cost of evicting the former channel is higher due to the TLB + flushes required and the likelihood of encountering the dirty bit. +3. The penalty of underprotecting the former channel is higher because + applications usually do not prepare themselves for major page + faults like they do for blocked I/O. E.g., GUI applications + commonly use dedicated I/O threads to avoid blocking the rendering + threads. + +There are also two access patterns: + +* Accesses exhibiting temporal locality +* Accesses not exhibiting temporal locality + +For the reasons listed above, the former channel is assumed to follow +the former pattern unless ``VM_SEQ_READ`` or ``VM_RAND_READ`` is +present, and the latter channel is assumed to follow the latter +pattern unless outlying refaults have been observed. + +Workflow overview +================= +Evictable pages are divided into multiple generations for each +``lruvec``. The youngest generation number is stored in +``lrugen->max_seq`` for both anon and file types as they are aged on +an equal footing. The oldest generation numbers are stored in +``lrugen->min_seq[]`` separately for anon and file types as clean file +pages can be evicted regardless of swap constraints. These three +variables are monotonically increasing. + +Generation numbers are truncated into ``order_base_2(MAX_NR_GENS+1)`` +bits in order to fit into the gen counter in ``folio->flags``. Each +truncated generation number is an index to ``lrugen->lists[]``. The +sliding window technique is used to track at least ``MIN_NR_GENS`` and +at most ``MAX_NR_GENS`` generations. The gen counter stores a value +within ``[1, MAX_NR_GENS]`` while a page is on one of +``lrugen->lists[]``; otherwise it stores zero. + +Each generation is divided into multiple tiers. Tiers represent +different ranges of numbers of accesses through file descriptors. A +page accessed ``N`` times through file descriptors is in tier +``order_base_2(N)``. In contrast to moving across generations, which +requires the LRU lock, moving across tiers only requires operations on +``folio->flags`` and therefore has a negligible cost. A feedback loop +modeled after the PID controller monitors refaults over all the tiers +from anon and file types and decides which tiers from which types to +evict or protect. + +There are two conceptually independent procedures: the aging and the +eviction. They form a closed-loop system, i.e., the page reclaim. + +Aging +----- +The aging produces young generations. Given an ``lruvec``, it +increments ``max_seq`` when ``max_seq-min_seq+1`` approaches +``MIN_NR_GENS``. The aging promotes hot pages to the youngest +generation when it finds them accessed through page tables; the +demotion of cold pages happens consequently when it increments +``max_seq``. The aging uses page table walks and rmap walks to find +young PTEs. For the former, it iterates ``lruvec_memcg()->mm_list`` +and calls ``walk_page_range()`` with each ``mm_struct`` on this list +to scan PTEs. On finding a young PTE, it clears the accessed bit and +updates the gen counter of the page mapped by this PTE to +``(max_seq%MAX_NR_GENS)+1``. After each iteration of this list, it +increments ``max_seq``. For the latter, when the eviction walks the +rmap and finds a young PTE, the aging scans the adjacent PTEs and +follows the same steps just described. + +Eviction +-------- +The eviction consumes old generations. Given an ``lruvec``, it +increments ``min_seq`` when ``lrugen->lists[]`` indexed by +``min_seq%MAX_NR_GENS`` becomes empty. To select a type and a tier to +evict from, it first compares ``min_seq[]`` to select the older type. +If both types are equally old, it selects the one whose first tier has +a lower refault percentage. The first tier contains single-use +unmapped clean pages, which are the best bet. The eviction sorts a +page according to the gen counter if the aging has found this page +accessed through page tables and updated the gen counter. It also +moves a page to the next generation, i.e., ``min_seq+1``, if this page +was accessed multiple times through file descriptors and the feedback +loop has detected outlying refaults from the tier this page is in. To +do this, the feedback loop uses the first tier as the baseline, for +the reason stated earlier. + +Summary +------- +The multi-gen LRU can be disassembled into the following parts: + +* Generations +* Page table walks +* Rmap walks +* Bloom filters +* The PID controller + +The aging and the eviction is a producer-consumer model; specifically, +the latter drives the former by the sliding window over generations. +Within the aging, rmap walks drive page table walks by inserting hot +densely populated page tables to the Bloom filters. Within the +eviction, the PID controller uses refaults as the feedback to select +types to evict and tiers to protect.