From patchwork Wed Nov 27 08:21:57 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Gregory Price X-Patchwork-Id: 13887191 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D537BD6ACF2 for ; Wed, 27 Nov 2024 16:28:45 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 672CD6B0083; Wed, 27 Nov 2024 11:28:45 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 622A06B0085; Wed, 27 Nov 2024 11:28:45 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4C3386B0088; Wed, 27 Nov 2024 11:28:45 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 316896B0083 for ; Wed, 27 Nov 2024 11:28:45 -0500 (EST) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 8482E818F9 for ; Wed, 27 Nov 2024 16:28:44 +0000 (UTC) X-FDA: 82832408526.28.DFFAD38 Received: from mail-qt1-f174.google.com (mail-qt1-f174.google.com [209.85.160.174]) by imf08.hostedemail.com (Postfix) with ESMTP id B0C42160013 for ; Wed, 27 Nov 2024 16:28:38 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=gourry.net header.s=google header.b=B9poDQ4O; dmarc=none; spf=pass (imf08.hostedemail.com: domain of gourry@gourry.net designates 209.85.160.174 as permitted sender) smtp.mailfrom=gourry@gourry.net ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1732724920; a=rsa-sha256; cv=none; b=mVdSLaPpIRTsTLDfbuvaACnkDIcQkNx5wJNRlLvJEuBPWFoGc8WyvRGv33vTKCz3Qjymm5 +BzxZqSUmC25VfQgNUtUyB1yHJeQ5yeGNzCAi3AFLpJEItu+3UgjDtfbEDtKdFYhoc7p/2 zWKei342NIhBS6I3R+5fp6/V3bTiINo= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=pass header.d=gourry.net header.s=google header.b=B9poDQ4O; dmarc=none; spf=pass (imf08.hostedemail.com: domain of gourry@gourry.net designates 209.85.160.174 as permitted sender) smtp.mailfrom=gourry@gourry.net ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1732724920; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=PidIsjj04A/ARWCx+XzUxVHyBAs8knKmMO+m1Fk3cqA=; b=nNAi9DjDr2MjnGYmJn4gCdk1j9aauCydK6pz4VX1sdsyDrqjg3ispmOwe/jdGaHjMEe8XT r/SUgTkMxBQ+A/efg+GYeLanW48AHAnanz1N0hk9N1n7cwqmsS2GpoQQcebPhzlPUzKJpL GZct63qUOJTHpVs4S+rMhATYZGELm00= Received: by mail-qt1-f174.google.com with SMTP id d75a77b69052e-4668e48963eso27044861cf.1 for ; Wed, 27 Nov 2024 08:28:42 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gourry.net; s=google; t=1732724921; x=1733329721; darn=kvack.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=PidIsjj04A/ARWCx+XzUxVHyBAs8knKmMO+m1Fk3cqA=; b=B9poDQ4OvHvY/FVTnKxawUTDKtYSfRWrMYk5NiaRHxQ6lHu2H1P0v7dYyMXeXoa6NF FYSTQWLJJ6LyOP0xo/l0eNJ83KW7iawc22PJQtRKNsFPBior2bjxhJ5jvG1ESp31E6jB /4oNbZR2rO3d1vFvszEAHCNYlYLEVFdlNPn3MiZ2DKL6zvTFzGasV1icqWO/6/LHfoYz PV7LeHQDxMQIun4Miq69T+OjOdJhGrevls/N/E5GPtescmb2oO/+Im6lFX5H7GZYnrff KD/wO2ab81LBcLlICqfVc+lDJpvxLPQ4js8bU/RCpI6OqFcCZEgYQoJToAWzJFo4y3K0 nTug== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1732724921; x=1733329721; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=PidIsjj04A/ARWCx+XzUxVHyBAs8knKmMO+m1Fk3cqA=; b=h+UVnm6GosgQ/FFeRWS/bN56bVLMVuwgvzPMY4jx/DtP5nIHwrzFWVYehovP5ROLnq mL1WDQ9gTLVDfWynbxIIPWie2zgW08IvsRwmj7Bay1ikeGhnnzhmt3y5BhB/uI6caXs+ joLxKQgrBh/pMTsef8dfg/kjpeEQSrZCBtr3FL7fARU4H7u9dF0dXHgyXLExdNaXtIDB VQxOMQYBDua9LR1SIAwJjpGpRzSQqoaFOGDV1UrNbczpm2ZX0/Em87duCbi8Oq18Axhh cJxXmKJJ9EeKlNCX6Lwyrn8MdmscLPtuE8ESuzatPxaGHj29xaTH6jZv+H6Qbq8pQCM8 W8oQ== X-Gm-Message-State: AOJu0YxYjR/91oMvWroJvfhB0ofyceRGn/nyMcNgvgZ1J27onB9GKMY8 HypO2IC6c4cMhjlXjKGD1+FCDWElperr1uNmeJXC3tzaD5DNteyw1O+8IXWWoCIYy40V79gCRT+ E X-Gm-Gg: ASbGncsEC3h+0NA9ku4oxUMni//mOZFZyDHkT9+/qdo4hu7CvtwLw2dwEstoIZrpe7M 13MQXkz04TTk+FhaV9z2LD9qV+bvXH7Os/ZKQIp/GvmKrlKlOUZdPDyjjla5u6Mz53MzJ93EoIk TcnX1sxDuwm/JVKGOdtfmaRJeRIeF/iqzOuXY68bhFraqxQ+zpNpIKcavtA+LNTzCvAKzfvXgZ1 KY9aiS0APFG41/+9jtQUNAd5l42OHwfN+ZnIYjdMc8NWwg+VHlZDgVRq9UliVnJlzUGli4xkVQU wcbRczpWHbCiJ4d0o/jmb/XFcyPvLaVxiJ0= X-Google-Smtp-Source: AGHT+IGGApAGUvMNEr0AbpDJx2A33CElkNGtekBcaVAFSVLKR6Fp2ZQohXY235rN2n8AmCeZqHUtbw== X-Received: by 2002:ac8:5a86:0:b0:466:a060:a484 with SMTP id d75a77b69052e-466b35264a8mr56878511cf.27.1732724921198; Wed, 27 Nov 2024 08:28:41 -0800 (PST) Received: from PC2K9PVX.TheFacebook.com (pool-173-79-56-208.washdc.fios.verizon.net. [173.79.56.208]) by smtp.gmail.com with ESMTPSA id d75a77b69052e-466871721c2sm45002921cf.17.2024.11.27.08.28.40 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 27 Nov 2024 08:28:40 -0800 (PST) From: Gregory Price To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, nehagholkar@meta.com, abhishekd@meta.com, kernel-team@meta.com, david@redhat.com, ying.huang@intel.com, nphamcs@gmail.com, gourry@gourry.net, akpm@linux-foundation.org, hannes@cmpxchg.org, feng.tang@intel.com, kbusch@meta.com Subject: [RFC PATCH 0/4] Promotion of Unmapped Page Cache Folios. Date: Wed, 27 Nov 2024 03:21:57 -0500 Message-ID: <20241127082201.1276-1-gourry@gourry.net> X-Mailer: git-send-email 2.43.0 MIME-Version: 1.0 X-Rspam-User: X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: B0C42160013 X-Stat-Signature: akw8xjeaua6ixdg7bthdm9uiwpdfwk5q X-HE-Tag: 1732724918-492155 X-HE-Meta: U2FsdGVkX18jtSPSuIePJGwLB8Vxclz86k6Ir9vvS2Og060ZqJ8bzHGhQlT3pI5Bs8i80+VzRd4iphdRF5rlZin7js7zZW3RKBCGdG+xr8F3fXDzZpsEfIxztra9d4E+bZV4CgswlAPkDxq5/Ide9jYWhCAjYPpkivMuKiu5FWwAB0jw9kuMni2i7YL2SR6HV5STJSwjIyIOaso8I29M26vtGXrOguSMjy7pas8b/N2YpkXDkCr8XC5hnhx82LxRVKSahOldFaDhn5mLZa0A5RseCEnT44RIXyicZIlJPhNlwXDaUgQdwU3JXxlW9qrvwPqe6rX26WH++4G4kLcP7EO3k6hL6l3cIwNSC/2AAC1xJVFVKUXGkqtVxODYmM762/Xv4DjDK/i7l5iPeP10A6KpdJ+PP2jTvdq2DZd+8tIFSMIl8f8utSCafbjMMYphxq7VLTiW5eWKaZ7CK+9fjzau5l0JG1fet3IxbjqEt0ukmCGAaQ1L/3NO8F2jmyc2741TFDFtIhneocrm8n7a2UhqunAtk+tWb86jxqcFkMYqTU7i40kjEUZiTZPSee9RDlC4YSN04B/YPazXEx1FiYlJ2E0kWDk2n3A8i6dDVUe64rbN9mIKx6YzPz2oK43RW++Oj6rWNgnJgnjRdV81a7SSUA4a6AKwdW4I6lILhm2rh8xk69n8hHSVPbvfdnZy0hihSw+d01n3AUxyft4cZDkS/1Ervlff3Pd+sezIdLFzTI5dwvDVxkl5VhYHv3R5BpBwgzeGAdbAltQlmk9T2nmRcdQ4X5X7iVkkrK5ttlWXmHhP6AhvEAWV/l7HMIkoFMzq1Na5GrO0SrEXlVB4Fr04+h31+YYcUa8vdRhcINOd5FIs4OLRxtcewdM4bKG7rJOsjwjWaMKy7oXYfBR3fvJPunj9qJZDW0FZz6TiRNd33RzpjlRh+pIqJWHhWr5Gs5MM28eGYHnqUIrwIQy fhLHAEDN y0BfDbBr/E8LFYVpEToMwbLhsmbJHHMOWTem3IOBv197Lj+VF2/dz1wopCdTuNe0FQxJAHOBD4ud7lz/YuArR0RwD2RmSkxURpBQP3QzI1nLquGlz68PolVeHJPRR5pn8VjM+AkTfFX868eAOcl29S7P6Bl3XSVmGBlqsf6IpZZpQjdmPBX/fhImXNe/wIia1qMr/a/z9/OCGDwZ2kSqBrLQ4Xzj/LM3f3d2Ff2FPQnIW2Mk36ColMCI0XUlbDwf2ua9rAB0N8cbhIHpL0IPWAlvfy88o9hrEOr9WCtH0QHEdkOCDeijblRe0hmo1z35/0tHWUTH1jioUsAfW9n4Sh4Ucltq10CCMwELpIY2DtJituuGMBp5EHK98Jq1JxYOYC1O/6PZInB3mSt2cyICwFSQfRKzugK2Sk6G0rpvJVevKy8CGs8XdtrDxF11+x7+CYCQvCrYEbiC9EzonfjKF1Ke8logJ6iqGpmBE X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Unmapped page cache pages can be demoted to low-tier memory, but they can presently only be promoted in two conditions: 1) The page is fully swapped out and re-faulted 2) The page becomes mapped (and exposed to NUMA hint faults) This RFC proposes promoting unmapped page cache pages by using folio_mark_accessed as a hotness hint for unmapped pages. Patches 1 & 2 allow NULL as valid input to migration prep interfaces for vmf/vma - which is not present in unmapped folios. Patch 3 adds NUMA_HINT_PAGE_CACHE to vmstat Patch 4 adds the promotion mechanism, along with a sysfs extension which defaults the behavior to off. /sys/kernel/mm/numa/pagecache_promotion_enabled Functional test showed that we are able to reclaim some performance in canned scenarios (a file gets demoted and becomes hot with relatively little contention). See test/overhead section below. Open Questions: ====== 1) Should we also add a limit to how much can be forced onto a single task's promotion list at any one time? This might piggy-back on the existing TPP promotion limit (256MB?) and would simply add something like task->promo_count. Technically we are limited by the batch read-rate before a TASK_RESUME occurs. 2) Should we exempt certain forms of folios, or add additional knobs/levers in to deal with things like large folios? 3) We added NUMA_HINT_PAGE_CACHE to differentiate hint faults so we could validate the behavior works as intended. Should we just call this a NUMA_HINT_FAULT and not add a new hint? 4) Benchmark suggestions that can pressure 1TB memory. This is not my typical wheelhouse, so if folks know of a useful benchmark that can pressure my 1TB (768 DRAM / 256 CXL) setup, I'd like to add additional measurements here. Development Notes ================= During development, we explored the following proposals: 1) directly promoting within folio_mark_accessed (FMA) Originally suggested by Johannes Weiner https://lore.kernel.org/all/20240803094715.23900-1-gourry@gourry.net/ This caused deadlocks due to the fact that the PTL was held in a variety of cases - but in particular during task exit. It also is incredibly inflexible and causes promotion-on-fault. It was discussed that a deferral mechanism was preferred. 2) promoting in filemap.c locations (calls of FMA) Originally proposed by Feng Tang and Ying Huang https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git/patch/?id=5f2e64ce75c0322602c2ec8c70b64bb69b1f1329 First, we saw this as less problematic than directly hooking FMA, but we realized this has the potential to miss data in a variety of locations: swap.c, memory.c, gup.c, ksm.c, paddr.c - etc. Second, we discovered that the lock state of pages is very subtle, and that these locations in filemap.c can be called in an atomic context. Prototypes lead to a variety of stalls and lockups. 3) a new LRU - originally proposed by Keith Busch https://git.kernel.org/pub/scm/linux/kernel/git/kbusch/linux.git/patch/?id=6616afe9a722f6ebedbb27ade3848cf07b9a3af7 There are two issues with this approach: PG_promotable and reclaim. First - PG_promotable has generally be discouraged. Second - Attach this mechanism to an LRU is both backwards and counter-intutive. A promotable list is better served by a MOST recently used list, and since LRUs are generally only shrank when exposed to pressure it would require implementing a new promotion list shrinker that runs separate from the existing reclaim logic. 4) Adding a separate kthread - suggested by many This is - to an extent - a more general version of the LRU proposal. We still have to track the folios - which likely requires the addition of a page flag. Additionally, this method would actually contend pretty heavily with LRU behavior - i.e. we'd want to throttle addition to the promotion candidate list in some scenarios. 5) Doing it in task work This seemed to be the most realistic after considering the above. We observe the following: - FMA is an ideal hook for this and isolation is safe here - the new promotion_candidate function is an ideal hook for new filter logic (throttling, fairness, etc). - isolated folios are either promoted or putback on task resume, there are no additional concurrency mechanics to worry about - The mechanic can be made optional via a sysfs hook to avoid overhead in degenerate scenarios (thrashing). We also piggy-backed on the numa_hint_fault_latency timestamp to further throttle promotions to help avoid promotions on one or two time accesses to a particular page. Test: ====== Environment: 1.5-3.7GHz CPU, ~4000 BogoMIPS, 1TB Machine with 768GB DRAM and 256GB CXL A 64GB file being linearly read by 6-7 Python processes Goal: Generate promotions. Demonstrate stability and measure overhead. System Settings: echo 1 > /sys/kernel/mm/numa/demotion_enabled echo 1 > /sys/kernel/mm/numa/pagecache_promotion_enabled echo 2 > /proc/sys/kernel/numa_balancing Each process took up ~128GB each, with anonymous memory growing and shrinking as python filled and released buffers with the 64GB data. This causes DRAM pressure to generate demotions, and file pages to "become hot" - and therefore be selected for promotion. First we ran with promotion disabled to show consistent overhead as a result of forcing a file out to CXL memory. We first ran a single reader to see uncontended performance, launched many readers to force demotions, then droppedb back to a single reader to observe. Single-reader DRAM: ~16.0-16.4s Single-reader CXL (after demotion): ~16.8-17s Next we turned promotion on with only a single reader running. Before promotions: Node 0 MemFree: 636478112 kB Node 0 FilePages: 59009156 kB Node 1 MemFree: 250336004 kB Node 1 FilePages: 14979628 kB After promotions: Node 0 MemFree: 632267268 kB Node 0 FilePages: 72204968 kB Node 1 MemFree: 262567056 kB Node 1 FilePages: 2918768 kB Single-reader (after_promotion): ~16.5s Turning the promotion mechanism on when nothing had been demoted produced no appreciable overhead (memory allocation noise overpowers it) Read time did not change after turning promotion off after promotion occurred, which implies that the additional overhead is not coming from the promotion system itself - but likely other pages still trapped on the low tier. Either way, this at least demonstrates the mechanism is not particularly harmful when there are no pages to promote - and the mechanism is valuable when a file actually is quite hot. Notability, it takes some time for the average read loop to come back down, and there still remains unpromoted file pages trapped in pagecache. This isn't entirely unexpected, there are many files which may have been demoted, and they may not be very hot. Overhead ====== When promotion was tured on we saw a loop-runtime increate temporarily before: 16.8s during: 17.606216192245483 17.375206470489502 17.722095489501953 18.230552434921265 18.20712447166443 18.008254528045654 17.008427381515503 16.851454257965088 16.715774059295654 stable: ~16.5s We measured overhead with a separate patch that simply measured the rdtsc value before/after calls in promotion_candidate and task work. e.g.: + start = rdtsc(); list_for_each_entry_safe(folio, tmp, promo_list, lru) { list_del_init(&folio->lru); migrate_misplaced_folio(folio, NULL, nid); + count++; } + atomic_long_add(rdtsc()-start, &promo_time); + atomic_long_add(count, &promo_count); numa_migrate_prep: 93 - time(3969867917) count(42576860) migrate_misplaced_folio_prepare: 491 - time(3433174319) count(6985523) migrate_misplaced_folio: 1635 - time(11426529980) count(6985523) Thoughts on a good throttling heuristic would be appreciated here. Suggested-by: Huang Ying Suggested-by: Johannes Weiner Suggested-by: Keith Busch Suggested-by: Feng Tang Signed-off-by: Gregory Price Gregory Price (4): migrate: Allow migrate_misplaced_folio APIs without a VMA memory: allow non-fault migration in numa_migrate_check path vmstat: add page-cache numa hints migrate,sysfs: add pagecache promotion .../ABI/testing/sysfs-kernel-mm-numa | 20 +++++++ include/linux/memory-tiers.h | 2 + include/linux/migrate.h | 5 ++ include/linux/sched.h | 3 + include/linux/sched/numa_balancing.h | 5 ++ include/linux/vm_event_item.h | 2 + init/init_task.c | 1 + kernel/sched/fair.c | 27 ++++++++- mm/memory-tiers.c | 27 +++++++++ mm/memory.c | 41 ++++++++----- mm/mempolicy.c | 25 +++++--- mm/migrate.c | 59 ++++++++++++++++++- mm/swap.c | 3 + mm/vmstat.c | 2 + 14 files changed, 196 insertions(+), 26 deletions(-)