From patchwork Mon Jan 13 17:57:19 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kairui Song X-Patchwork-Id: 13937866 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 62413C02180 for ; Mon, 13 Jan 2025 17:59:55 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id EE0586B0082; Mon, 13 Jan 2025 12:59:54 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id E90406B008C; Mon, 13 Jan 2025 12:59:54 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D30806B0092; Mon, 13 Jan 2025 12:59:54 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id B51B36B0082 for ; Mon, 13 Jan 2025 12:59:54 -0500 (EST) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 71F5A1C66DC for ; Mon, 13 Jan 2025 17:59:54 +0000 (UTC) X-FDA: 83003191908.28.EA475B6 Received: from mail-pl1-f173.google.com (mail-pl1-f173.google.com [209.85.214.173]) by imf18.hostedemail.com (Postfix) with ESMTP id 964041C0018 for ; Mon, 13 Jan 2025 17:59:52 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=XC96tcHh; spf=pass (imf18.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.214.173 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1736791192; h=from:from:sender:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=tKxTdrb/nP719CePLU8ZaN7uWkXVy/82CXO6YTQ8CjI=; b=gPG1ACwXFfrD1P9bAjVFkSsAMeOctH7KrpYbMqDzX2BLusscVDNsHNIP06l3BClTipkv0u I81+xjhVNLoJZaYWZL7E0xAeegCRXb6eo/j/SOhscK0izZwrN6kU3XIJ+DnjGPjA8Prv/H 40lRjEV0NvMmvONEWX4LXgncIFtIuK8= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1736791192; a=rsa-sha256; cv=none; b=Y6zz9EMnom5ZbG3038eTSuqNnSLEbXBIcTvVgdDp07sNpd1uVkOJbBWmbrlnkU/erXLvUh HfPTi4TeTWuE2uMq0zkpl9sQcHyxS88FLdESRFj9rnW0hsNHZq2gF8eTpcj9/omjWhP+Dg vm9sIlIECc25mlwpbG/krJdFmgjVuj8= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=XC96tcHh; spf=pass (imf18.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.214.173 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-pl1-f173.google.com with SMTP id d9443c01a7336-2164b662090so77144335ad.1 for ; Mon, 13 Jan 2025 09:59:52 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1736791190; x=1737395990; darn=kvack.org; h=content-transfer-encoding:mime-version:reply-to:message-id:date :subject:cc:to:from:from:to:cc:subject:date:message-id:reply-to; bh=tKxTdrb/nP719CePLU8ZaN7uWkXVy/82CXO6YTQ8CjI=; b=XC96tcHhXpbStSLwUlWue27M9/XZBqT7/haEd1enP+uBBsCquIdk58cJVtJXNXeIFF 7w9j0rBw4tkwH/kll8F+bnMii3vSe812nynXKfTiPjGoMJg0GIJ9NlXfsZsLRbth3Oed 1HaQjq2gJq0WlakUt/ElAHVffgMRrRCLtHVsBR5JbKi8zs4+eA4o99hwmyT0WVu7HnAt WmVVWtbJWGN61LjbwAFk60CtsG7UDBJ4FK4gFMGbU2SWltZoLb26qXfsoTD+bdbsu0bu lIROADKUYDHsrXXiqliuTVyB0GRSwPxzRI2jBBw9O3oeoOPe1v+R3Iu424Z1KvWEo3n3 qfKw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1736791190; x=1737395990; h=content-transfer-encoding:mime-version:reply-to:message-id:date :subject:cc:to:from:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=tKxTdrb/nP719CePLU8ZaN7uWkXVy/82CXO6YTQ8CjI=; b=X6FmuyP4gy90O1ErmcKNrpsAdvuQOkfzRN5Y7nRfeO0g323x06ug4Tj31sIqSXIHso RRfXf537sPtCE6rps38wLZYNlMHO77Gftg2/fT0DyVqLCCOKJ7dhaJtKfaIh4uS9e2zr j7bOjjSrIySlOHyXMd/qDlrD+SQ6Ortmb4liDFuo2NaUvTTCcub46wodKb888XAMcFkc 3w2uJFhksFprTQQ6L1IRcYVNaPgz9JvL409SA11zfBh95zi/GC+y3g6W1YOe/IgYp5am ysjBP0jpT8ZYiWL7Qk3YvBcSJEunSHYSTNqy1HxrB2EqM+nFEvrgmVf8Rt5fKg5RJ4ac KSRA== X-Gm-Message-State: AOJu0YwEfsEeqDDX6zCDyNxwrmhfDW0IHfQqKcbEPbRcAhJL25K0TRWB mXbN+xHf0UVT0Pb9zEGTHyz5/wB8iLvdrBW5U08aujuR6RE1Szclc75z81F2sa0= X-Gm-Gg: ASbGnctjmZ6nNFQEDydB+mVYvMJwwPJUbwm7XBGeam1Kqp2uRyVPU8R19D0dPZUjTx0 xlKVpUN4IXUkPOAng7vx8MPrLNcnty3FRSGDgfLk7WGph+w7XZTugs6096rYMxe+NrR3YhaiNqr 6Z/hNpfi+w05b2m8xhCT+BoQLdc6Dq/m+A3AjfMcfNh6/BD1f7lwYqrIk7CFF6SV7yM499mxJaH B0bH3gOEgXxR866cnBwfvXjQQB1Ak/Zt2N0RyDAWyVE5MnscoEQbAxtmWa+AKy+WLic5aXrYMro 1g== X-Google-Smtp-Source: AGHT+IGxLyjsaMvxnNaWaA9ifCXadlWzzFsu6K82kUug1u2Nu/uCsNmBnbeivNtAJ7AlJ8V7cntkLA== X-Received: by 2002:a17:902:d552:b0:216:60a3:b3fd with SMTP id d9443c01a7336-21a83f4bbd3mr304283705ad.3.1736791190037; Mon, 13 Jan 2025 09:59:50 -0800 (PST) Received: from KASONG-MC4.tencent.com ([115.171.41.132]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-21a9f21aba7sm57023635ad.113.2025.01.13.09.59.46 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Mon, 13 Jan 2025 09:59:49 -0800 (PST) From: Kairui Song To: linux-mm@kvack.org Cc: Andrew Morton , Chris Li , Barry Song , Ryan Roberts , Hugh Dickins , Yosry Ahmed , "Huang, Ying" , Baoquan He , Nhat Pham , Johannes Weiner , Kalesh Singh , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH v4 00/13] mm, swap: rework of swap allocator locks Date: Tue, 14 Jan 2025 01:57:19 +0800 Message-ID: <20250113175732.48099-1-ryncsn@gmail.com> X-Mailer: git-send-email 2.47.1 Reply-To: Kairui Song MIME-Version: 1.0 X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 964041C0018 X-Stat-Signature: 1n6txdh5s59uq8hikypx9a75iwq8xys1 X-Rspam-User: X-HE-Tag: 1736791192-469793 X-HE-Meta: U2FsdGVkX1+hGmUDkgpVNFa62MwIKKhBMgii/gEEyGG2R5We40rA1h7xA4w35c9uvKGpNr9o86WgvQ9vsXUdd9j6C2J9F/t4VExiVt8QwmbfELX0gpCVNc2O+xm4wLZn6gda1xmju4HDcxDoj17CV6/Cd1v7CSvJHzBvgctwOK4FJCWLuEK6yrJObfKRbX3EWZ0nVstFClbLUdPY2eRFytRLeJ/+xkLfYj4/1h+CJkRuH6rFHMbMEmA5hudhF/RVxvyYj8EQr5SNFF5KCj2K4tE8c3w1Jsx12foMf+X8LzUv7W1roUg5bFJOI/rw8+dBkKH6Wu8Zfzw0urnSeKZ0NSARaCmhCMSKjd4uZhSmznKS94TaOELYTbXUZfnwxQHVOziShrscaAYWnuvuZ0mJ+AQAbGQcSMSYNI8B5fcFNs3PHwmgdIwRi7/lwreOd0tW2sJe78kbeM5mP5Z7P6wn1kGKbXNCsd7Vv2UWm7noeQwfIgXszB7lvItqToWwxqnkxeGCIHlO/TH/SxXA8Cg/06oHemkmwzJXG9jAr4vJI1xs1QHlTq3t7LOJCpXWyIGJ6I1I6q+0PfNHU7xiJ5p2+LWTfNNtq7dYkraz9MdO2VkAmlnD2/7jxbcx+M8nrCbU3ENXy8UzY1RLIyvDRiT45Jn4R1NiHaDRirHw4g9NcVpQr3F9df1zv/4JSm/0VQmPxJEfpWwPzeak/vXt4fWYrl7Az3PKMOAEWX224Bmg1iCmbtAa2jbJMhs6A+yvdcwufEZbW7jS6jWyRYBCcCIKW+gNpRPwzqPovWTChDyN1DzWfNZrB6pIrWqHS+cxODGZ8cLo954E64OGa3BCUosVmZHPTpajwfCAKpNjMhncCB5U1+r/x39qeFt/sPU32w4HrZQT5FWxWmTRJgcsSopBjXyUctu3UodARBdNA7J626w60gYkBihuJpgdCy4ULu9zGTx5/vVfi/OdDGXOONQ o270owFM PPz+kS4JoZ6ecxGID/na6GJtprTHmpZrI9BeJf4CGeMLygk3wyvz8DLNN1YVfxvCAAEp/QyPf2Kl5eiNGdOgrBG/V+YSA7UQ4kf8C2DZ2CC3Dh6FSG2HWnEM2Y5aBz65b8AhSGY+KYfJ0mmZiuGjktbxsnxXXB4yPXeJ90M1r1/gfHiqnqzGWa4ugIduM2rkY2rRZqH8c5jkqRpye668pjsy1yOL35XS6oKmZH+6noC70ziDMMCnK1Ru58zTCFgRjfndfIVkypvmh4ce+kA13j6Uxdh/s/pfDiyiC/HGfgl6E1FtJdzlsAHAfIj9e/DCLXvQ+mlxLc22yrJQH3YGKS43ePhlsf/6AcEqMwFKpad4sjnLS1kcEgH7slXCyrdWvkhzzbXxcs/22zPEap0yF6s6B4lqvRs/JCIw5uEfqgYrKdUiCZRMuw2FuPbNBd87CFP0XKsNWubgZhJWfLBjX/J6vSp0tSmSY4HKFjrFoS/1zxu9vEXjediD/kknjQdHce52dEqgCXtphvLrquUOPlb+mSSaN7ibEKznw4pqipwSSOoIk2PZ6bjaei4NzK0z/Q0z0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: From: Kairui Song This series greatly improved swap performance by reworking the locking design and simplify a lot of code path. Test showed a up to 400% vm-scalability improvement with pmem as SWAP, and up to 37% reduce of kernel compile real time with ZRAM as SWAP (up to 60% improvement in system time). This is part of the new swap allocator discussed during the "Swap Abstraction" discussion at LSF/MM 2024, and "mTHP and swap allocator" discussion at LPC 2024. This is a follow up of previous swap cluster allocator series: https://lore.kernel.org/linux-mm/20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org/ Also enables further optimizations which will come later. Previous series introduced a fully cluster based allocator, this series completely get rid of the old allocator and makes the new allocator avoid touching the si->lock unless needed. This bring huge performance gain and get rid of slot cache for freeing path. Currently, swap locking is mainly composed of two locks, cluster lock (ci->lock) and device lock (si->lock). The device lock is widely used to protect many things, causing it to be the main bottleneck for SWAP. Cluster lock is much more fine-grained, so it will be best to use ci->lock instead of si->lock as much as possible. `perf lock` indicates this issue clearly. Doing linux kernel build using tmpfs and ZRAM with limited memory (make -j64 with 1G memcg and 4k pages), result of "perf lock contention -ab sleep 3" shows: contended total wait max wait avg wait type caller 34948 53.63 s 7.11 ms 1.53 ms spinlock free_swap_and_cache_nr+0x350 16569 40.05 s 6.45 ms 2.42 ms spinlock get_swap_pages+0x231 11191 28.41 s 7.03 ms 2.54 ms spinlock swapcache_free_entries+0x59 4147 22.78 s 122.66 ms 5.49 ms spinlock page_vma_mapped_walk+0x6f3 4595 7.17 s 6.79 ms 1.56 ms spinlock swapcache_free_entries+0x59 406027 2.74 s 2.59 ms 6.74 us spinlock list_lru_add+0x39 ...snip... The top 5 caller are all users of si->lock, total wait time sums to several minutes in the 3 seconds time window. Following the new allocator design, many operation doesn't need to touch si->lock at all. We only need to take si->lock when doing operations across multiple clusters (changing the cluster list). So ideally allocator should always take ci->lock first, then take si->lock only if needed. But due to historical reasons, ci->lock is used inside si->lock critical section, causing lock inversion if we simply try to acquire si->lock after acquiring ci->lock. This series audited all si->lock usage, clean up legacy codes, eliminate usage of si->lock as much as possible by introducing new designs based on the new cluster allocator. Old HDD allocation codes are removed, cluster allocator is adapted with small changes for HDD usage, test is looking OK. And this also removed slot cache for freeing path. The performance is even better without it now, and this enables other clean up and optimizations as discussed before: https://lore.kernel.org/all/CAMgjq7ACohT_uerSz8E_994ZZCv709Zor+43hdmesW_59W1BWw@mail.gmail.com/ After this series, lock contention on si->lock is nearly unobservable with `perf lock` with the same test above: contended total wait max wait avg wait type caller ... snip ... 52 127.12 us 3.82 us 2.44 us spinlock move_cluster+0x2c 56 120.77 us 12.41 us 2.16 us spinlock move_cluster+0x2c ... snip ... 10 21.96 us 2.78 us 2.20 us spinlock isolate_lock_cluster+0x20 ... snip ... 9 19.27 us 2.70 us 2.14 us spinlock move_cluster+0x2c ... snip ... 5 11.07 us 2.70 us 2.21 us spinlock isolate_lock_cluster+0x20 `move_cluster` and `isolate_lock_cluster` (two new introduced helper) are basically the only users of si->lock now, performance gain is huge, and LOC is reduced. Tests Results: vm-scalability ============== Running `usemem --init-time -O -y -x -R -31 1G` from vm-scalability in a 12G memory cgroup using simulated pmem as SWAP backend (32G pmem, 32 CPUs). Using 4K folio by default, 64k mTHP and sequential access (!-R) results are also provided. 6 test runs for each case, Total Throughput: Test Before (KB/s) (stdev) After (KB/s) (stdev) Delta --------------------------------------------------------------------------- Random (4K): 69937.11 (16449.77) 369816.17 (24476.68) +428.78% Random (64k): 123442.83 (13207.51) 216379.00 (25024.83) +75.28% Sequential (4K): 6313909.83 (148856.12) 6419860.66 (183563.38) +1.7% Sequential access will cause lower stress for the allocator so the gain is limited, but with random access (which is much closer to real workloads) the performance gain is huge. Build kernel with defconfig on tmpfs with ZRAM ============================================== Below results shows a test matrix using different memory cgroup limit and job numbets, and scaled up progressive for a intuitive result. Done on a 48c96t system. 6 test run for each case, it can be seen clearly that as concurrent job number goes higher the performance gain is higher, but even -j6 is showing slight improvement. make -j | System Time (seconds) | Total Time (seconds) (NR / Mem / ZRAM) | (Before / After / Delta) | (Before / After / Delta) With 4k pages only: 6 / 192M / 3G | 1533 / 1522 / -0.7% | 1420 / 1414 / -0.3% 12 / 256M / 4G | 2275 / 2226 / -2.2% | 758 / 742 / -2.1% 24 / 384M / 5G | 3596 / 3154 / -12.3% | 476 / 422 / -11.3% 48 / 768M / 7G | 8159 / 3605 / -55.8% | 330 / 221 / -33.0% 96 / 1.5G / 10G | 18541 / 6462 / -65.1% | 283 / 180 / -36.4% With 64k mTHP: 24 / 512M / 5G | 3585 / 3469 / -3.2% | 293 / 290 / -0.1% 48 / 1G / 7G | 8173 / 3607 / -55.9% | 251 / 158 / -37.0% 96 / 2G / 10G | 16305 / 7791 / -52.2% | 226 / 144 / -36.3% The fragmentation are reduced too: With: make -j96 / 1152M memcg, 64K mTHP: (avg of 4 test run) Before: hugepages-64kB/stats/swpout: 1696184 hugepages-64kB/stats/swpout_fallback: 414318 After: (-63.2% mTHP swapout failure) hugepages-64kB/stats/swpout: 1866267 hugepages-64kB/stats/swpout_fallback: 158330 There is a up to 65.1% improvement in sys time for build kernel test, and lower fragmentation rate. Build kernel with tinyconfig on tmpfs with HDD as swap: ======================================================= This test is similar to above, but HDD test is very noisy and slow, the deviation is huge, so just use tinyconfig instead and take the median test result of 3 test run, which looks OK: Before this series: 114.44user 29.11system 39:42.90elapsed 6%CPU 2901232inputs+0outputs (238877major+4227640minor)pagefaults After this commit: 113.90user 23.81system 38:11.77elapsed 6%CPU 2548728inputs+0outputs (235471major+4238110minor)pagefaults Single thread SWAP: =================== Sequential SWAP should also be slightly faster as we removed a lot of unnecessary parts. Test using micro benchmark for swapout/in 4G zero memory using ZRAM, 10 test runs: Swapout Before (avg. 3359304): 3353796 3358551 3371305 3356043 3367524 3355303 3355924 3354513 3360776 Swapin Before (avg. 1928698): 1920283 1927183 1934105 1921373 1926562 1938261 1927726 1928636 1934155 Swapout After (avg. 3347511, -0.4%): 3337863 3347948 3355235 3339081 3333134 3353006 3354917 3346055 3360359 Swapin After (avg. 1922290, -0.3%): 1919101 1925743 1916810 1917007 1923930 1935152 1917403 1923549 1921913 The gain is limited at noise level but seems slightly better. V3: https://lore.kernel.org/linux-mm/20241230174621.61185-1-ryncsn@gmail.com/T/ Updates since V2: - Fix a few typo and improve some comment and commit messages [Baoquan He] - Function naming change, and remove an unnecessary NULL check [Baoquan He] - Fix a potential race of swapoff and hibernation founded during local review, the get_swap_device_info call was dropped accidentally in get_swap_page_of_type. - Collect Review-by. V2: https://lore.kernel.org/linux-mm/20241224143811.33462-1-ryncsn@gmail.com/ Updates since V2: - Use atomic_long_try_cmpxchg instead of atomic_long_cmpxchg [Uros Bizjak] - Fix build error after rebase. V1: https://lore.kernel.org/linux-mm/20241022192451.38138-1-ryncsn@gmail.com/ Updates since V1: - Retest some tests after rebase on top of latest mm-unstable, the new Cgroup lock removal increased the performance gain of this series too, some results are basically same as before so unchanged: https://lore.kernel.org/linux-mm/20241218114633.85196-1-ryncsn@gmail.com/ - Rework the off-list bit handling, make it easier to review and more robust, also reduce LOC [Chris Li]. - Code style improvements and minor code optimizations. [Chris Li]. - Fixing a potential swapoff race issue due to missing SWP_WRITEOK check [Huang Ying]. - Added vm-scalability test with pmem [Huang Ying]. Kairui Song (13): mm, swap: minor clean up for swap entry allocation mm, swap: fold swap_info_get_cont in the only caller mm, swap: remove old allocation path for HDD mm, swap: use cluster lock for HDD mm, swap: clean up device availability check mm, swap: clean up plist removal and adding mm, swap: hold a reference during scan and cleanup flag usage mm, swap: use an enum to define all cluster flags and wrap flags changes mm, swap: reduce contention on device lock mm, swap: simplify percpu cluster updating mm, swap: introduce a helper for retrieving cluster from offset mm, swap: use a global swap cluster for non-rotation devices mm, swap_slots: remove slot cache for freeing path fs/btrfs/inode.c | 1 - fs/f2fs/data.c | 1 - fs/iomap/swapfile.c | 1 - include/linux/swap.h | 34 +- include/linux/swap_slots.h | 3 - mm/page_io.c | 1 - mm/swap_slots.c | 78 +-- mm/swapfile.c | 1258 ++++++++++++++++-------------------- 8 files changed, 602 insertions(+), 775 deletions(-)