From patchwork Tue Sep 12 18:45:07 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kairui Song X-Patchwork-Id: 13382045 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 77FD0EE3F0C for ; Tue, 12 Sep 2023 18:45:38 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B2F7F6B0146; Tue, 12 Sep 2023 14:45:37 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id ADFAB6B0147; Tue, 12 Sep 2023 14:45:37 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9595F6B0148; Tue, 12 Sep 2023 14:45:37 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 7D76B6B0146 for ; Tue, 12 Sep 2023 14:45:37 -0400 (EDT) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 481348031B for ; Tue, 12 Sep 2023 18:45:37 +0000 (UTC) X-FDA: 81228823914.10.35BFAD2 Received: from mail-pf1-f180.google.com (mail-pf1-f180.google.com [209.85.210.180]) by imf30.hostedemail.com (Postfix) with ESMTP id 44C9F80009 for ; Tue, 12 Sep 2023 18:45:35 +0000 (UTC) Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=gmail.com header.s=20221208 header.b=DrhLUi0F; spf=pass (imf30.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.210.180 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1694544335; h=from:from:sender:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=WpcCE6VGUaQK+vP2ZyE+C/o0M3YSvV14np9x7oLHb84=; b=Oj3vtcH9CMQyh8QE5WQFRNpXTygA83YvWorHOhAmIJgfvVJIw6EffxLn7SM4ye9rX/qbRd eDo/W8tp6jzHSUhk+JrVMv09nnEyEwdCSJdVKxizaSX4CQz5pUm37gLJRJK958oQrvQfAn P2utZowwkJLfu5pzD6H7DemvIIH8Dls= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=pass header.d=gmail.com header.s=20221208 header.b=DrhLUi0F; spf=pass (imf30.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.210.180 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1694544335; a=rsa-sha256; cv=none; b=QrTyliAAy1SI/Nu6ADDzm8CDzttep2vvHYBDI5YRDTo0ZH3R9LYiEscPh1OLudxb6WC+y0 DdoOpwSRh2qAX8YW6Y5eR6OtNgRbzy0CI9mmk8aLQSY0AJAgInh3jk6ZgIJNX3f91cJUZj 2KIEjOT+hBLcw7O7OtqPg2lWR1tl5RY= Received: by mail-pf1-f180.google.com with SMTP id d2e1a72fcca58-68fb2e9ebcdso2209186b3a.2 for ; Tue, 12 Sep 2023 11:45:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1694544333; x=1695149133; darn=kvack.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=WpcCE6VGUaQK+vP2ZyE+C/o0M3YSvV14np9x7oLHb84=; b=DrhLUi0FemiSfzRymijIxRSY1brUCCz4ZgujlP3kfrbZtqDbQvMXL6XjvSUuEtNsT+ u2RG/HEE2lozwWy+HgN4n40yDne3XItp/K0CQQtB1fTSytWBBUC/s/aVvHhNMCrSQq6U WlSGaaMW4f27JXtIAevLavn1KnFhdmNngUEFKcZeleycw2RTx6Q3nTTde68fTKHeCRW9 xZE3gMBxr5wZ1gMqRkC0quMUhn1xhyzZTYxHv/6qHGHIXVVNrENWdv1PURepmurbpOWb hcEa8Yhkqy8ESoc6rGj92J/h8CgKMljpJMtkKP3gzufa32Y525rtGWv30k8V6gUY+NZk 1zIg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1694544333; x=1695149133; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=WpcCE6VGUaQK+vP2ZyE+C/o0M3YSvV14np9x7oLHb84=; b=M2+vJ6gRM0gvxLBkR2m1Ejev2FScqG93Q2ZjSL0BgCMiV/0N1VQh4oUm5qTHrShb5e klj436gqTAKV0+t0SbfhSkzxr0wxWFQWkEqf8HvsCefmF6umwaTh2/NlaDjio9VBnuiB bwvytGQSMfFqQxiLibbuxt+tbPwhKrw3VehwaZOvZBrCTQaKlx923Hy0t/5wx4GBByP2 9R4DBTOWD7J6EmekXZCdiWH/QzCKdXxVwaIL5D5zxIuOPsWLuZsJ24DCJGdc8FFgYJXZ QxXCiuyjRMTn6qjMeSG1iMvaA/wfRUhMFP0wwVRy27f61kvvpO6W/9GqNiv8qvKqS6oY WdCg== X-Gm-Message-State: AOJu0Yx1j7/GdgJ0Wt6I7uB7h02/D6Z6we7K3Z9/y37+dUPNfQS+BoX7 jkGtQ/wBmhdUmVkOl97tEQF9BXJc4jyGrCJcj0rM8g== X-Google-Smtp-Source: AGHT+IEm4bvuWb4oGiATA9YUnFzVWnGUla2Pt3ufXJ+3xcdk67KYg1JqlOaUbZ7I9H58UDpxhG5Ubg== X-Received: by 2002:a05:6a20:160c:b0:13e:7a0a:36d8 with SMTP id l12-20020a056a20160c00b0013e7a0a36d8mr351769pzj.9.1694544332817; Tue, 12 Sep 2023 11:45:32 -0700 (PDT) Received: from KASONG-MB2.tencent.com ([124.127.145.18]) by smtp.gmail.com with ESMTPSA id q18-20020a63bc12000000b00553b9e0510esm7390605pge.60.2023.09.12.11.45.28 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Tue, 12 Sep 2023 11:45:31 -0700 (PDT) From: Kairui Song To: linux-mm@kvack.org Cc: Andrew Morton , Yu Zhao , Roman Gushchin , Johannes Weiner , Michal Hocko , Hugh Dickins , Nhat Pham , Yuanchu Xie , Suren Baghdasaryan , "T . J . Mercier" , linux-kernel@vger.kernel.orng, Kairui Song Subject: [RFC PATCH v2 1/5] workingset: simplify and use a more intuitive model Date: Wed, 13 Sep 2023 02:45:07 +0800 Message-ID: <20230912184511.49333-2-ryncsn@gmail.com> X-Mailer: git-send-email 2.41.0 In-Reply-To: <20230912184511.49333-1-ryncsn@gmail.com> References: <20230912184511.49333-1-ryncsn@gmail.com> Reply-To: Kairui Song MIME-Version: 1.0 X-Rspamd-Queue-Id: 44C9F80009 X-Rspam-User: X-Stat-Signature: 6cd8sfw1iqhxxdkgzefcoohhtjh4xb7z X-Rspamd-Server: rspam01 X-HE-Tag: 1694544335-45023 X-HE-Meta: U2FsdGVkX19HyksDVx1Wx1QKt8rcGMY/db2gegJRyUlugrKPr+vVYmeeJ30G/VHyS7RH8MPytRDCk0RkWqBH9C6GB2ayUUn/Mf/85tirGAVaXxeukTBa8GPsyls2oOLGzN5OjXBA9FSxHte8wix2r5Z4Nuv7hH93Rft6dyZ2UQ51I517W2mu5crLC2BrzPLt4mTCo3MkR1kC94seQFWXiukF3Lane9GyLqxPnb2g2FoEH0SdoqZ8eV76+HuRgryXCjpaLb4vaQT+2vrS1JZM+poECq5T32UpkW00YFu79Md5l6QO9xV16eqqX7kmDLiEgI4IHGHjPrUwiBudSw9cwrQhaC4wJmrDnB/jT5JYA2fUwGdQfdShF7sJgwxo7Px5LCpBRV+G0fQki5M8Xyv0zZcLICOBl+pOwlwPxRvdHPDGfjLlIvgAhUeGwNdeCJ13dqAFuiVbHwh4UBtWPZKBN2QgYWow2k6dto+dDH9w7sfLcUZ8LPIn+9Vj2oXSyRKCDc3K3sUG69i+0U2OTgwjzQm5yxPkCjUYTGsXffkJR/Rem7MlIcNfPCMG6osyilfocSc8pJqMdLIYoPmhq3dAjw8dkC95TLQxa1HjNw4bV9kZYDKHGmlvj0HfbnNqaqDqAclzr18hrmnucCCVPQ3gy4KIgqyko7rrmOe8nAOXR+CwqdCDrUm7a3ChaU0Ge5n7G2pBmKEPn12sHouZeqaARf8/RIBRTrC9o7BPYzqaMlCTszdDeB2OkFs4an4+h8bUPGkSuMHPh38cWKEfcgGJYVZGgw6Rx+2JWMedpyQ1qNwiKOsAmJAxLqrFHZmuahJU1ID4jwUSynZp/ilx73n8HBuM40c0xfc6ge9nIB4tcJV/I8QCK5bt6pRXDtobJ51QbbpKNkHtK/GgHbrgfwOaVcSfhN3T5VEIFj99QbZkCNLC/S4C1lY8+4F5wYRoTmaznW9cVFA8U7ro1Uy6zJ/ m4/kKepW SWybTyGQ9lT1Z1Uup6UtfjoLGZfpkYsS8frwtD3p9CtiIm/ioxzUv9naFgZrO7nksiDq0IaH7vqdGHqF6tkBZOJT+PaVv12SmNYUqcC5ZBDT/JpXlPZVGClTf3RcB2pbASDUQgr8dMnVtB1O29oXEB4bgT5HJxMxlMMiEZFqgcb75CsRWIZ7Gng7APw3HZaZwKG1z/7TAiok0t3QDGGIhAdqXE8bmLDogw43U12i8uNZMi2oBJh4fovyv77O9k39SJpVUGK4Bi3fnzpjY6dIrrPMei/Em0uYZ9eSmwvgYYWp+S5Ew4dTKmiD6FmimVJ7jG1pm4jIvrKhH8VipLu9Qm7TlrwK7rLkX0u6xBd3VFvFctHXuE4+il1ktVTeCV77s4ECwR8Yhr4XdnIeVz1DfKv0U34xDUtooXQ1vKAizHLjW9r9B1qbNcwqPxLBD4+/VzBcgB8f5s16egl2KtoG6tqf3vs1/0niIeDYNrRBCFDY94bwEEHFv9qvNCRYWN2Owwv2bbc2wsj2YNZH4HbTVNZC2NNiN4UzRTFsAgdLRH4n0dVqyWp2iTc09hFQM8ew+/ESiR7iSQzpgHS32gr/CMbUq32mRAnoJ47Cx3ijFBq9o0eW+uwaSQbXg8aFtOmx8nyfz X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Kairui Song This basically removed workingset_activation and reduced calls to workingset_age_nonresident. The idea behind this change is a new way to calculate the refault distance and prepare for adapting refault distance based re-activation for multi-gen LRU. Currently, refault distance re-activation is based on two assumptions: 1. Activation of an inactive page will left-shift LRU pages (considering LRU starts from right). 2. Eviction of an inactive page will left-shift LRU pages. Assumption 2 is correct, but assumption 1 is not always true, an activated page could be anywhere in the LRU list (through mark_page_accessed), it only left-shift the pages on its right. And besides, one page can get activate/deactivated for multiple times. And multi-gen LRU doesn't fit with this model well, pages are getting aged and activated constantly as the generation sliding window slides. So instead we introduce a simpler idea here: Just presume the evicted pages are still in memory, each has an eviction sequence like before. Let the `nonresistence_age` still be NA and get increased for each eviction, so we get a "Shadow LRU" here of one evicted page: Let SP = ((NA's reading @ current) - (NA's reading @ eviction)) +-memory available to cache-+ | | +-------------------------+===============+===========+ | * shadows O O O | INACTIVE | ACTIVE | +-+-----------------------+===============+===========+ | | +-----------------------+ | SP fault page O -> Hole left by previously faulted in pages * -> The page corresponding to SP It can be easily seen that SP stands for how far the current workflow could push a page out of available memory. Since all evicted page was once head of INACTIVE list, the page could have such an access distance: SP + NR_INACTIVE It *may* get re-activated before getting evicted again if: SP + NR_INACTIVE < NR_INACTIVE + NR_ACTIVE Which can be simplified to: SP < NR_ACTIVE Then the page is worth getting re-activated to start from ACTIVE part, since the access distance is shorter than the total memory to make it stay. The calculation is same as before, just dropped the assumption 1 above. And since this is only an estimation, based on several hypotheses, and it could break the ability of LRU to distinguish a workingset out of caches, so throttle this by two factors: 1. Notice previously re-faulted in pages may leave "holes" on the shadow part of LRU, that part is left unhandled on purpose to decrease re-activate rate for pages that have a large SP value (the larger SP value a page has, the more likely it will be affected by such holes). 2. When the ACTIVE part of LRU is long enough, chanllaging ACTIVE pages by re-activating a one-time faulted previously INACTIVE page may not be a good idea, so throttle the re-activation when ACTIVE > INACTIVE by comparing with INACTIVE instead. Combined all above, we have: Upon refault, if any of following conditions is met, mark page as active: - If ACTIVE LRU is low (NR_ACTIVE < NR_INACTIVE), check if: SP < NR_ACTIVE - If ACTIVE LRU is high (NR_ACTIVE >= NR_INACTIVE), check if: SP < NR_INACTIVE The code is almose same but simpler than before, since no longer need to do lruvec statistic update when activating a page. A few benchmarks showed a similar or better result. And when combined with multi-gen LRU (in later commits) it shows a measurable performance gain for some workloads. Using memtier and fio test from commit ac35a4902374 but scaled down to fit in my test environment, and some other test results: memtier test (with 16G ramdisk as swap and 2G memcg limit on an i7-9700): memcached -u nobody -m 16384 -s /tmp/memcached.socket \ -a 0700 -t 12 -B binary & memtier_benchmark -S /tmp/memcached.socket -P memcache_binary -n allkeys\ --key-minimum=1 --key-maximum=24000000 --key-pattern=P:P -c 1 \ -t 12 --ratio 1:0 --pipeline 8 -d 2000 -x 6 fio test 1 (with 16G ramdisk on 28G VM on an i7-9700): fio -name=refault --numjobs=12 --directory=/mnt --size=1024m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=random --norandommap \ --time_based --ramp_time=5m --runtime=5m --group_reporting fio test 2 (with 16G ramdisk on 28G VM on an i7-9700): fio -name=mglru --numjobs=10 --directory=/mnt --size=1536m \ --buffered=1 --ioengine=io_uring --iodepth=128 \ --iodepth_batch_submit=32 --iodepth_batch_complete=32 \ --rw=randread --random_distribution=zipf:1.2 --norandommap \ --time_based --ramp_time=10m --runtime=5m --group_reporting mysql (using oltp_read_only from sysbench, with 12G of buffer pool in a 10G memcg): sysbench /usr/share/sysbench/oltp_read_only.lua \ --mysql-db=sb --tables=36 --table-size=2000000 --threads=12 --time=1800 Before (Average of 6 test run): fio: IOPS=5213.7k fio2: IOPS=7315.3k memcached: 49493.75 ops/s mysql: 6237.45 tps After (Average of 6 test run): fio: IOPS=5230.5k fio2: IOPS=7349.3k memcached: 49912.79 ops/s mysql: 6240.62 tps Signed-off-by: Kairui Song --- include/linux/swap.h | 2 - mm/swap.c | 1 - mm/vmscan.c | 2 - mm/workingset.c | 215 +++++++++++++++++++++---------------------- 4 files changed, 106 insertions(+), 114 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 493487ed7c38..ca51d79842b7 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -344,10 +344,8 @@ static inline swp_entry_t page_swap_entry(struct page *page) /* linux/mm/workingset.c */ bool workingset_test_recent(void *shadow, bool file, bool *workingset); -void workingset_age_nonresident(struct lruvec *lruvec, unsigned long nr_pages); void *workingset_eviction(struct folio *folio, struct mem_cgroup *target_memcg); void workingset_refault(struct folio *folio, void *shadow); -void workingset_activation(struct folio *folio); /* Only track the nodes of mappings with shadow entries */ void workingset_update_node(struct xa_node *node); diff --git a/mm/swap.c b/mm/swap.c index cd8f0150ba3a..685b446fd4f9 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -482,7 +482,6 @@ void folio_mark_accessed(struct folio *folio) else __lru_cache_activate_folio(folio); folio_clear_referenced(folio); - workingset_activation(folio); } if (folio_test_idle(folio)) folio_clear_idle(folio); diff --git a/mm/vmscan.c b/mm/vmscan.c index 6f13394b112e..3f4de75e5186 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2539,8 +2539,6 @@ static unsigned int move_folios_to_lru(struct lruvec *lruvec, lruvec_add_folio(lruvec, folio); nr_pages = folio_nr_pages(folio); nr_moved += nr_pages; - if (folio_test_active(folio)) - workingset_age_nonresident(lruvec, nr_pages); } /* diff --git a/mm/workingset.c b/mm/workingset.c index da58a26d0d4d..babda11601ea 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -180,9 +180,10 @@ */ #define WORKINGSET_SHIFT 1 -#define EVICTION_SHIFT ((BITS_PER_LONG - BITS_PER_XA_VALUE) + \ +#define EVICTION_SHIFT ((BITS_PER_LONG - BITS_PER_XA_VALUE) + \ WORKINGSET_SHIFT + NODES_SHIFT + \ MEM_CGROUP_ID_SHIFT) +#define EVICTION_BITS (BITS_PER_LONG - (EVICTION_SHIFT)) #define EVICTION_MASK (~0UL >> EVICTION_SHIFT) /* @@ -226,8 +227,103 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat, *workingsetp = workingset; } -#ifdef CONFIG_LRU_GEN +/* + * Get the distance reading at eviction time. + */ +static inline unsigned long lru_eviction(struct lruvec *lruvec, + int bits, int bucket_order) +{ + unsigned long eviction = atomic_long_read(&lruvec->nonresident_age); + + eviction >>= bucket_order; + eviction &= ~0UL >> (BITS_PER_LONG - bits); + + return eviction; +} + +/* + * Calculate and test refault distance + */ +static inline bool lru_refault(struct mem_cgroup *memcg, + struct lruvec *lruvec, + unsigned long eviction, bool file, + int bits, int bucket_order) +{ + unsigned long refault, distance; + unsigned long workingset, active, inactive, inactive_file, inactive_anon = 0; + + eviction <<= bucket_order; + refault = atomic_long_read(&lruvec->nonresident_age); + + /* + * The unsigned subtraction here gives an accurate distance + * across nonresident_age overflows in most cases. There is a + * special case: usually, shadow entries have a short lifetime + * and are either refaulted or reclaimed along with the inode + * before they get too old. But it is not impossible for the + * nonresident_age to lap a shadow entry in the field, which + * can then result in a false small refault distance, leading + * to a false activation should this old entry actually + * refault again. However, earlier kernels used to deactivate + * unconditionally with *every* reclaim invocation for the + * longest time, so the occasional inappropriate activation + * leading to pressure on the active list is not a problem. + */ + distance = (refault - eviction) & (~0UL >> (BITS_PER_LONG - bits)); + active = lruvec_page_state(lruvec, NR_ACTIVE_FILE); + inactive_file = lruvec_page_state(lruvec, NR_INACTIVE_FILE); + if (mem_cgroup_get_nr_swap_pages(memcg) > 0) { + active += lruvec_page_state(lruvec, NR_ACTIVE_ANON); + inactive_anon = lruvec_page_state(lruvec, NR_INACTIVE_ANON); + } + + /* + * Compare the distance to the existing workingset size. We + * don't activate pages that couldn't stay resident even if + * all the memory was available to the workingset. Whether + * workingset competition needs to consider anon or not depends + * on having free swap space. + * + * When there are already enough active pages, be less aggressive + * on reactivating pages, challenge an already established set of + * active pages with one time refaulted page may not be a good idea. + */ + if (active >= (inactive_anon + inactive_file)) + return distance < inactive_anon + inactive_file; + else + return distance < active + (file ? inactive_anon : inactive_file); +} + +/** + * workingset_age_nonresident - age non-resident entries as LRU ages + * @lruvec: the lruvec that was aged + * @nr_pages: the number of pages to count + * + * As in-memory pages are aged, non-resident pages need to be aged as + * well, in order for the refault distances later on to be comparable + * to the in-memory dimensions. This function allows reclaim and LRU + * operations to drive the non-resident aging along in parallel. + */ +static void workingset_age_nonresident(struct lruvec *lruvec, unsigned long nr_pages) +{ + /* + * Reclaiming a cgroup means reclaiming all its children in a + * round-robin fashion. That means that each cgroup has an LRU + * order that is composed of the LRU orders of its child + * cgroups; and every page has an LRU position not just in the + * cgroup that owns it, but in all of that group's ancestors. + * + * So when the physical inactive list of a leaf cgroup ages, + * the virtual inactive lists of all its parents, including + * the root cgroup's, age as well. + */ + do { + atomic_long_add(nr_pages, &lruvec->nonresident_age); + } while ((lruvec = parent_lruvec(lruvec))); +} + +#ifdef CONFIG_LRU_GEN static void *lru_gen_eviction(struct folio *folio) { int hist; @@ -342,34 +438,6 @@ static void lru_gen_refault(struct folio *folio, void *shadow) #endif /* CONFIG_LRU_GEN */ -/** - * workingset_age_nonresident - age non-resident entries as LRU ages - * @lruvec: the lruvec that was aged - * @nr_pages: the number of pages to count - * - * As in-memory pages are aged, non-resident pages need to be aged as - * well, in order for the refault distances later on to be comparable - * to the in-memory dimensions. This function allows reclaim and LRU - * operations to drive the non-resident aging along in parallel. - */ -void workingset_age_nonresident(struct lruvec *lruvec, unsigned long nr_pages) -{ - /* - * Reclaiming a cgroup means reclaiming all its children in a - * round-robin fashion. That means that each cgroup has an LRU - * order that is composed of the LRU orders of its child - * cgroups; and every page has an LRU position not just in the - * cgroup that owns it, but in all of that group's ancestors. - * - * So when the physical inactive list of a leaf cgroup ages, - * the virtual inactive lists of all its parents, including - * the root cgroup's, age as well. - */ - do { - atomic_long_add(nr_pages, &lruvec->nonresident_age); - } while ((lruvec = parent_lruvec(lruvec))); -} - /** * workingset_eviction - note the eviction of a folio from memory * @target_memcg: the cgroup that is causing the reclaim @@ -396,11 +464,11 @@ void *workingset_eviction(struct folio *folio, struct mem_cgroup *target_memcg) lruvec = mem_cgroup_lruvec(target_memcg, pgdat); /* XXX: target_memcg can be NULL, go through lruvec */ memcgid = mem_cgroup_id(lruvec_memcg(lruvec)); - eviction = atomic_long_read(&lruvec->nonresident_age); - eviction >>= bucket_order; + + eviction = lru_eviction(lruvec, EVICTION_BITS, bucket_order); workingset_age_nonresident(lruvec, folio_nr_pages(folio)); return pack_shadow(memcgid, pgdat, eviction, - folio_test_workingset(folio)); + folio_test_workingset(folio)); } /** @@ -418,9 +486,6 @@ bool workingset_test_recent(void *shadow, bool file, bool *workingset) { struct mem_cgroup *eviction_memcg; struct lruvec *eviction_lruvec; - unsigned long refault_distance; - unsigned long workingset_size; - unsigned long refault; int memcgid; struct pglist_data *pgdat; unsigned long eviction; @@ -429,7 +494,6 @@ bool workingset_test_recent(void *shadow, bool file, bool *workingset) return lru_gen_test_recent(shadow, file, &eviction_lruvec, &eviction, workingset); unpack_shadow(shadow, &memcgid, &pgdat, &eviction, workingset); - eviction <<= bucket_order; /* * Look up the memcg associated with the stored ID. It might @@ -450,50 +514,10 @@ bool workingset_test_recent(void *shadow, bool file, bool *workingset) eviction_memcg = mem_cgroup_from_id(memcgid); if (!mem_cgroup_disabled() && !eviction_memcg) return false; - eviction_lruvec = mem_cgroup_lruvec(eviction_memcg, pgdat); - refault = atomic_long_read(&eviction_lruvec->nonresident_age); - - /* - * Calculate the refault distance - * - * The unsigned subtraction here gives an accurate distance - * across nonresident_age overflows in most cases. There is a - * special case: usually, shadow entries have a short lifetime - * and are either refaulted or reclaimed along with the inode - * before they get too old. But it is not impossible for the - * nonresident_age to lap a shadow entry in the field, which - * can then result in a false small refault distance, leading - * to a false activation should this old entry actually - * refault again. However, earlier kernels used to deactivate - * unconditionally with *every* reclaim invocation for the - * longest time, so the occasional inappropriate activation - * leading to pressure on the active list is not a problem. - */ - refault_distance = (refault - eviction) & EVICTION_MASK; - /* - * Compare the distance to the existing workingset size. We - * don't activate pages that couldn't stay resident even if - * all the memory was available to the workingset. Whether - * workingset competition needs to consider anon or not depends - * on having free swap space. - */ - workingset_size = lruvec_page_state(eviction_lruvec, NR_ACTIVE_FILE); - if (!file) { - workingset_size += lruvec_page_state(eviction_lruvec, - NR_INACTIVE_FILE); - } - if (mem_cgroup_get_nr_swap_pages(eviction_memcg) > 0) { - workingset_size += lruvec_page_state(eviction_lruvec, - NR_ACTIVE_ANON); - if (file) { - workingset_size += lruvec_page_state(eviction_lruvec, - NR_INACTIVE_ANON); - } - } - - return refault_distance <= workingset_size; + return lru_refault(eviction_memcg, eviction_lruvec, eviction, file, + EVICTION_BITS, bucket_order); } /** @@ -543,7 +567,6 @@ void workingset_refault(struct folio *folio, void *shadow) goto out; folio_set_active(folio); - workingset_age_nonresident(lruvec, nr); mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + file, nr); /* Folio was active prior to eviction */ @@ -560,30 +583,6 @@ void workingset_refault(struct folio *folio, void *shadow) rcu_read_unlock(); } -/** - * workingset_activation - note a page activation - * @folio: Folio that is being activated. - */ -void workingset_activation(struct folio *folio) -{ - struct mem_cgroup *memcg; - - rcu_read_lock(); - /* - * Filter non-memcg pages here, e.g. unmap can call - * mark_page_accessed() on VDSO pages. - * - * XXX: See workingset_refault() - this should return - * root_mem_cgroup even for !CONFIG_MEMCG. - */ - memcg = folio_memcg_rcu(folio); - if (!mem_cgroup_disabled() && !memcg) - goto out; - workingset_age_nonresident(folio_lruvec(folio), folio_nr_pages(folio)); -out: - rcu_read_unlock(); -} - /* * Shadow entries reflect the share of the working set that does not * fit into memory, so their number depends on the access pattern of @@ -778,7 +777,6 @@ static struct lock_class_key shadow_nodes_key; static int __init workingset_init(void) { - unsigned int timestamp_bits; unsigned int max_order; int ret; @@ -790,12 +788,11 @@ static int __init workingset_init(void) * some more pages at runtime, so keep working with up to * double the initial memory by using totalram_pages as-is. */ - timestamp_bits = BITS_PER_LONG - EVICTION_SHIFT; max_order = fls_long(totalram_pages() - 1); - if (max_order > timestamp_bits) - bucket_order = max_order - timestamp_bits; + if (max_order > EVICTION_BITS) + bucket_order = max_order - EVICTION_BITS; pr_info("workingset: timestamp_bits=%d max_order=%d bucket_order=%u\n", - timestamp_bits, max_order, bucket_order); + EVICTION_BITS, max_order, bucket_order); ret = prealloc_shrinker(&workingset_shadow_shrinker, "mm-shadow"); if (ret)