Message ID | 20250107000346.1338481-1-gourry@gourry.net (mailing list archive) |
---|---|
Headers | show
Return-Path: <owner-linux-mm@kvack.org> X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 72AF0E77188 for <linux-mm@archiver.kernel.org>; Tue, 7 Jan 2025 00:03:54 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D1DF66B00C4; Mon, 6 Jan 2025 19:03:53 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id CCDC86B00C5; Mon, 6 Jan 2025 19:03:53 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B47E66B00C6; Mon, 6 Jan 2025 19:03:53 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 8D1E56B00C4 for <linux-mm@kvack.org>; Mon, 6 Jan 2025 19:03:53 -0500 (EST) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 373291A0525 for <linux-mm@kvack.org>; Tue, 7 Jan 2025 00:03:53 +0000 (UTC) X-FDA: 82978707546.26.6B4B6A9 Received: from mail-qv1-f53.google.com (mail-qv1-f53.google.com [209.85.219.53]) by imf20.hostedemail.com (Postfix) with ESMTP id 6A2161C0018 for <linux-mm@kvack.org>; Tue, 7 Jan 2025 00:03:51 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=gourry.net header.s=google header.b="R/NKbgiU"; dmarc=none; spf=pass (imf20.hostedemail.com: domain of gourry@gourry.net designates 209.85.219.53 as permitted sender) smtp.mailfrom=gourry@gourry.net ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1736208231; a=rsa-sha256; cv=none; b=OyWL94S81puMTyvp6rmBKO5aTPQZzC/Kij2ZNgdmllSt3ueTLK+rrsdKl0ncKQGdif71fk Gi0mhN2Ntwxh4KT/MD4CyCUrDJIxT6T+cITiUNTpZQ4O7Cf8uQ7O5Q140ndj3bkwVoT4rH T0Sm19zeo/8YW4h5U7LvQgd+u0/V95Q= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=pass header.d=gourry.net header.s=google header.b="R/NKbgiU"; dmarc=none; spf=pass (imf20.hostedemail.com: domain of gourry@gourry.net designates 209.85.219.53 as permitted sender) smtp.mailfrom=gourry@gourry.net ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1736208231; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=7F9aIJvmZZRHwFnTPTykyqjJydlIPU4/22TvPBaoQ6g=; b=dqdJIwM4nyE5ZBa5T6DPFr0pO3uD+P3hFg51W9tPlJUDJ+UH942rQPZDxCqKr5y2o/F8r/ CyUf5l9rum1pG7dBWAQC19UDOF5PdGU4AMQSfmkWIumrkwrIc6LFhVXiy9ZcK11UGiLp42 2myvIr+BjB48Fjg2MYHHyQsOIyXj6u4= Received: by mail-qv1-f53.google.com with SMTP id 6a1803df08f44-6dd5c544813so106571686d6.1 for <linux-mm@kvack.org>; Mon, 06 Jan 2025 16:03:51 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gourry.net; s=google; t=1736208230; x=1736813030; darn=kvack.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=7F9aIJvmZZRHwFnTPTykyqjJydlIPU4/22TvPBaoQ6g=; b=R/NKbgiU3jLLKk5pgXDnDHcsBkAHR8S6A4FpvvIQJl2p3oqmK7Bl4O4owSGAGqP+QO dKaCbf4lLQIz4vJz/+TAm166KwN1Pd965yce82TRekw0Ord5VtmI9vLz7Sjkoy2ZaENh A8A4WgJPj+G9xQcz5Zwd+o4z6wodXNuvWJFnDhczSJNF8J3guzTdo593WVooMI5leDR+ zTvVQm0rJExQEcLK8JfO3LA0145YNiNrlTcuf39sMDIR14I2Whfub9nFXZ2JMSXUZJut QmPEbnsYdik3XHUbpfijiPy8tZsovsG9CDMwurHVmllknsNhmts7u7arMIVhlog4iP3v ECKA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1736208230; x=1736813030; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=7F9aIJvmZZRHwFnTPTykyqjJydlIPU4/22TvPBaoQ6g=; b=llqqX3nbXNRbO9JgOdG0PLHLcYnGrxCeDu5TvtUytjpShNaV+E3QWBBGka6XyVMzJU p1b77znt52iUj1gG2EkW9gUpKHu+83hPrkLd2xxJWUWIkykg85PDFuij86cLz6v5DisG Hxkcv7ViYgVTDrNaQ7ryvTAQoyxchW3emrnNgX1s7woSmwW5bMMrxIzH1DQwkcY2dDhY s9zEqCT9WiOAhywdCrbmw3PJ+iFNWnK/Ay01gytJf+5RCSO89M7AJIZX7ntJxEKgsfZ+ bJczteDTwNt0Bq2lsb6SaLtFSm7qL73okD7SLC186xjSixkaLW8lhXvJYO73baiSkVl1 HI9w== X-Gm-Message-State: AOJu0YzML+PxV0ogX1Gzsj8og0aKZM9lnw/gA+fbKygZC7Esu1V4b/MY pYLsXU23cKzs5V6m1EBxL3ZYarSVi4yHAA5f7eQA7bTj4ts5vgyddbpSACA4F0UjU6/qhvlWTCq L X-Gm-Gg: ASbGncubFNYUjGxxhWSTr34Ym5zmUeqr5dJl5TWkx+J7nx6RfpsxfyshqvIqhQIJKxP b/ZMm7oJgvB4LEQ1VG90kaozTAGjqtoBoIDctQ/cTs2ZMTimIRv8roo+LnEXVz1b/ssUqf95vH0 kwI9VWFGS0Idi5pYboWAPN2OPjKo1kwr0Qg+vJ49pWH7I9VCg/8TEy9b3xEQDHXPcFxis2e/2ZD B46/L/+gQVXJl2p96efiOJtnkKqPwMxmUR3aBuhXTNvnHyUY3XnMsH73Z++lToeZhM1kxUuaoDF 6cgCVpIW/RrDDU+gTB5T005GD8ce0YV5dgpHpYVt88t8 X-Google-Smtp-Source: AGHT+IF+mYRnk/NIcdkkRLvhUPlffWd6FMKOjD6S7ChDXrEeCMpbbAck/XrcDGkea/OJxhSCVD7kBw== X-Received: by 2002:ad4:5c61:0:b0:6d8:7eb9:9bd7 with SMTP id 6a1803df08f44-6dd233b7865mr867787096d6.43.1736208230126; Mon, 06 Jan 2025 16:03:50 -0800 (PST) Received: from gourry-fedora-PF4VCD3F.lan (pool-173-79-56-208.washdc.fios.verizon.net. [173.79.56.208]) by smtp.gmail.com with ESMTPSA id 6a1803df08f44-6dd18137218sm174104476d6.57.2025.01.06.16.03.48 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 06 Jan 2025 16:03:49 -0800 (PST) From: Gregory Price <gourry@gourry.net> To: linux-mm@kvack.org Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@meta.com, nehagholkar@meta.com, abhishekd@meta.com, david@redhat.com, nphamcs@gmail.com, gourry@gourry.net, akpm@linux-foundation.org, hannes@cmpxchg.org, kbusch@meta.com, ying.huang@linux.alibaba.com, feng.tang@intel.com, donettom@linux.ibm.com Subject: [RFC v3 PATCH 0/5] Promotion of Unmapped Page Cache Folios. Date: Mon, 6 Jan 2025 19:03:40 -0500 Message-ID: <20250107000346.1338481-1-gourry@gourry.net> X-Mailer: git-send-email 2.47.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Stat-Signature: qk1yy6fus8sbur89pkda7g9muco5babi X-Rspam-User: X-Rspamd-Queue-Id: 6A2161C0018 X-Rspamd-Server: rspam08 X-HE-Tag: 1736208231-958066 X-HE-Meta: U2FsdGVkX18+FXW29EK5n1fGh2WZndBvSJqJYP+OYFvH6Hh3kvXDtVBw17W/SX6vu2MrNuduP+La8eWYQY5KuziBUZB3kJkjI5MsA694cGvc5525IpbliNEXnarZ7Wq4tFTSnuHUumgqRE/qnMUUhC1LbClJH5TYk5/l58juIOVBLrvfXSmFEX7YZ8zhjsMD9K9ENJOaoT+FsuRdj9NNftdbZZsBSYUETH0cHAUU6a7pnXrUARRW0+DKiyDrL8R2LKuqc6Rkc6R+EfrRN+2jX8CuFcirrvhQfwkfTHADX8NLPdFSoQz55owprAin0MdyP0TVte7v6krrurtpGiSB1QDoCGNr8dQVjX8wRcfgq0TvTpwk/47wJykCG6uqc4T57kL0Ll0geuNfdaCmItlbAvrwFRCM+g6ffCyjXPdnB8Aqe4LgtWOYpzvNPikdCKWXLhWinWpcmWFOEkAj3XSw+YTBOxGJsInvUaphsJDmGvALLjFDIvO0vc7ZHQcHH7tan34CXgghKLYnvo337r5Cf3U9DgOYWc5mI0EGYraJiH/lOVNjKaVOTw3Gq6r7TxDEffxAYN3WFjP29g3aT5BZQqix4lYtaff3IYx8FoNpQhf5bTsZG00mHVCRfkw0Jk+zYGV4pvsrfvWMvMrC4rtwAjO1ZBfwQkT8fDJ1s5Bs49qKaXOzEuYUajaQhQMGqnGD+P2pe0uUTarGBvo+I3f6iGKiqrkZdHl927xv062YO3qAHd8iM8ejiRMRksyvOG8O0nTR3UVSUHx203PPg1utvpcG9RHYZFG68TddAhHGABlmi1k4z+FH9Fl5kYYgXhWAsTE8IyNgTyqplNiw2qhbxKF8KRA9C1D1YupcqEnRmXHgQe8BQpazVKCUBZ8q85eKFQOL5fPQTn9+d7/hj9yM+0/XQJSjVf806v8LXq6Hx6WX4XJlzFSwGvLiX/98Ev5M9UP5Bqbnbm6wFIg2Epb C/mRbYw8 1S+fqSaHXwfRhtVr7suTdJUJu1f3l3LzrKyyzxf84DSy6JaO3sPzAdJtAbiAzQ4YRDTqciat//drH53XdRqg8/P5YWg95xYlS4lwFRlwP/UyQWH6iPNETg7LvEcw0byDZQxc8gzVxNyPtckBjprPgb3f+rrcYZq2uEOr5mheS3i5lDStG32Jtvl3+q2OTpMrIKqsTidGfs3Rnyohs8ih4y1kWoxlFjZ9V5ZmPiLEK0oKB4LFvXZc9GV3UhFtMqi3j5koYuP5aLTjKzZXq8gxvLgdjryU9Lu9wQgCqs4tm3AsIRgVphms+yqrpVbJzferme3ppU9SEHyfJT6i7PLm91IMHlqF6DFNnrCbRQxXwE5LW6PxYYkSXnOCPhOQU88elpWoKtI4wlpz+Bz1GD1t8Xb3iherfze/Akv79iZaokJU8FlHj5IqgaWfS9iiAiCV78CbDmX4mIcZEQH96qhHNQVHbrplp7J5GYnwCYaocuJJhxR3s+2kgGRsnqQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: <linux-mm.kvack.org> List-Subscribe: <mailto:majordomo@kvack.org> List-Unsubscribe: <mailto:majordomo@kvack.org> |
Series |
Promotion of Unmapped Page Cache Folios.
|
expand
|
Unmapped page cache pages can be demoted to low-tier memory, but they can presently only be promoted in two conditions: 1) The page is fully swapped out and re-faulted 2) The page becomes mapped (and exposed to NUMA hint faults) This RFC proposes promoting unmapped page cache pages by using folio_mark_accessed as a hotness hint for unmapped pages. We show in a microbenchmark that this mechanism can increase performance up to 23.5% compared to leaving page cache on the low tier - when that page cache becomes excessively hot. When disabled (NUMA tiering off), overhead in folio_mark_accessed was limited to <1% in a worst case scenario (all work is file_read()). There is an open question as to how to integrate this into MGLRU, as the current design is only applies to traditional LRU. Patches 1-3 allow NULL as valid input to migration prep interfaces for vmf/vma - which is not present in unmapped folios. Patch 4 adds NUMA_HINT_PAGE_CACHE to vmstat Patch 5 Implement migrate_misplaced_folio_batch Patch 6 add the promotion mechanism, along with a sysfs extension which defaults the behavior to off. /sys/kernel/mm/numa/pagecache_promotion_enabled v3 Notes === - added batch migration interface (migrate_misplaced_folio_batch) - dropped timestamp check in promotion_candidate (tests showed it did not make a difference and the work is duplicated during the migraiton process). - Bug fix from Donet Tom regarding vmstat - pulled folio_isolated and sysfs switch checks out into folio_mark_accessed because microbenchmark tests showed the function call overhead of promotion_candidate warranted a bit of manual optimization for the scenario where the majority of work is file_read(). This brought the standing overhead from ~7% down to <1% when everything is disabled. - Limited promotion work list to a number of folios that match the existing promotion rate limit, as microbenchmark demonstrated excessive overhead on a single system-call when significant amounts of memory are read. Before: 128GB read went from 7 seconds to 40 seconds over ~2 rounds. Now: 128GB read went from 7 seconds to ~11 seconds over ~10 rounds. - switched from list_add to list_add_tail in promotion_candidate, as it was discovered promoting in non-linear order caused fairly significant overheads (as high as running out of CXL) - likely due to poor TLB and prefetch behavior. Simply switching to list_add_tail all but confirmed this as the additional ~20% overhead vanished. This is likely to only occur on systems with a large amount of contiguous physical memory available on the hot tier, since the allocators are more likely to provide better spacially locality. Test: ====== Environment: 1.5-3.7GHz CPU, ~4000 BogoMIPS, 1TB Machine with 768GB DRAM and 256GB CXL A 128GB file being linearly read by a single process Goal: Generate promotions and demonstrate upper-bound on performance overhead and gain/loss. System Settings: echo 1 > /sys/kernel/mm/numa/pagecache_promotion_enabled echo 2 > /proc/sys/kernel/numa_balancing Test process: In each test, we do a linear read of a 128GB file into a buffer in a loop. To allocate the pagecache into CXL, we use mbind prior to the CXL test runs and read the file. We omit the overhead of allocating the buffer and initializing the memory into CXL from the test runs. 1) file allocated in DRAM with mechanisms off 2) file allocated in DRAM with balancing on but promotion off 3) file allocated in DRAM with balancing and promotion on (promotion check is negative because all pages are top tier) 4) file allocated in CXL with mechanisms off 5) file allocated in CXL with mechanisms on Each test was run with 50 read cycles and averaged (where relevant) to account for system noise. This number of cycles gives the promotion mechanism time to promote the vast majority of memory (usually <1MB remaining in worst case). Tests 2 and 3 test the upper bound on overhead of the new checks when there are no pages to migrate but work is dominated by file_read(). | 1 | 2 | 3 | 4 | 5 | | DRAM Base | Promo On | TopTier Chk | CXL Base | Post-Promotion | | 7.5804 | 7.7586 | 7.9726 | 9.75 | 7.8941 | Baseline DRAM vs Baseline CXL shows a ~28% overhead just allowing the file to remain on CXL, while after promotion, we see the performance trend back towards the overhead of the TopTier check time - a total overhead reduction of ~84% (or ~5% overhead down from ~23.5%). During promotion, we do see overhead which eventually tapers off over time. Here is a sample of the first 10 cycles during which promotion is the most aggressive, which shows overhead drops off dramatically as the majority of memory is migrated to the top tier. 12.79, 12.52, 12.33, 12.03, 11.81, 11.58, 11.36, 11.1, 8, 7.96 This could be further limited by limiting the promotion rate via the existing knob, or by implementing a new knob detached from the existing promotion rate. There are merits to both approach. After promotion, turning the mechanism off via sysfs increased the overall performance back to the DRAM baseline. The slight (~1%) increase between post-migration performance and the baseline mechanism overhead check appears to be general variance as similar times were observed during the baseline checks on subsequent runs. The mechanism itself represents a ~2-5% overhead in a worst case scenario (all work is file_read() and pages are in DRAM). Development History and Notes ======================================= During development, we explored the following proposals: 1) directly promoting within folio_mark_accessed (FMA) Originally suggested by Johannes Weiner https://lore.kernel.org/all/20240803094715.23900-1-gourry@gourry.net/ This caused deadlocks due to the fact that the PTL was held in a variety of cases - but in particular during task exit. It also is incredibly inflexible and causes promotion-on-fault. It was discussed that a deferral mechanism was preferred. 2) promoting in filemap.c locations (callers of FMA) Originally proposed by Feng Tang and Ying Huang https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git/patch/?id=5f2e64ce75c0322602c2ec8c70b64bb69b1f1329 First, we saw this as less problematic than directly hooking FMA, but we realized this has the potential to miss data in a variety of locations: swap.c, memory.c, gup.c, ksm.c, paddr.c - etc. Second, we discovered that the lock state of pages is very subtle, and that these locations in filemap.c can be called in an atomic context. Prototypes lead to a variety of stalls and lockups. 3) a new LRU - originally proposed by Keith Busch https://git.kernel.org/pub/scm/linux/kernel/git/kbusch/linux.git/patch/?id=6616afe9a722f6ebedbb27ade3848cf07b9a3af7 There are two issues with this approach: PG_promotable and reclaim. First - PG_promotable has generally be discouraged. Second - Attach this mechanism to an LRU is both backwards and counter-intutive. A promotable list is better served by a MOST recently used list, and since LRUs are generally only shrank when exposed to pressure it would require implementing a new promotion list shrinker that runs separate from the existing reclaim logic. 4) Adding a separate kthread - suggested by many This is - to an extent - a more general version of the LRU proposal. We still have to track the folios - which likely requires the addition of a page flag. Additionally, this method would actually contend pretty heavily with LRU behavior - i.e. we'd want to throttle addition to the promotion candidate list in some scenarios. 5) Doing it in task work This seemed to be the most realistic after considering the above. We observe the following: - FMA is an ideal hook for this and isolation is safe here - the new promotion_candidate function is an ideal hook for new filter logic (throttling, fairness, etc). - isolated folios are either promoted or putback on task resume, there are no additional concurrency mechanics to worry about - The mechanic can be made optional via a sysfs hook to avoid overhead in degenerate scenarios (thrashing). Suggested-by: Huang Ying <ying.huang@linux.alibaba.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Suggested-by: Keith Busch <kbusch@meta.com> Suggested-by: Feng Tang <feng.tang@intel.com> Signed-off-by: Gregory Price <gourry@gourry.net> Gregory Price (6): migrate: Allow migrate_misplaced_folio_prepare() to accept a NULL VMA. memory: move conditionally defined enums use inside ifdef tags memory: allow non-fault migration in numa_migrate_check path vmstat: add page-cache numa hints migrate: implement migrate_misplaced_folio_batch migrate,sysfs: add pagecache promotion .../ABI/testing/sysfs-kernel-mm-numa | 20 +++++ include/linux/memory-tiers.h | 2 + include/linux/migrate.h | 10 +++ include/linux/sched.h | 4 + include/linux/sched/sysctl.h | 1 + include/linux/vm_event_item.h | 8 ++ init/init_task.c | 2 + kernel/sched/fair.c | 24 ++++- mm/memcontrol.c | 1 + mm/memory-tiers.c | 27 ++++++ mm/memory.c | 32 ++++--- mm/mempolicy.c | 25 ++++-- mm/migrate.c | 88 ++++++++++++++++++- mm/swap.c | 8 ++ mm/vmstat.c | 2 + 15 files changed, 230 insertions(+), 24 deletions(-)