From patchwork Tue Oct 22 19:24:38 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kairui Song X-Patchwork-Id: 13846074 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B783ECDD0CB for ; Tue, 22 Oct 2024 19:29:50 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 396546B007B; Tue, 22 Oct 2024 15:29:50 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3479C6B0082; Tue, 22 Oct 2024 15:29:50 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 20E856B0083; Tue, 22 Oct 2024 15:29:50 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 01F276B007B for ; Tue, 22 Oct 2024 15:29:49 -0400 (EDT) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 6A56A80474 for ; Tue, 22 Oct 2024 19:29:33 +0000 (UTC) X-FDA: 82702227216.07.8BAE2E2 Received: from mail-pl1-f182.google.com (mail-pl1-f182.google.com [209.85.214.182]) by imf10.hostedemail.com (Postfix) with ESMTP id CC871C000C for ; Tue, 22 Oct 2024 19:29:38 +0000 (UTC) Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=IWo9RVOu; spf=pass (imf10.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.214.182 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1729625235; h=from:from:sender:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=JGbf0fixLETlFv0KEXjfC5spoDaHmyAsiiYQtt7GTUQ=; b=q02t7AFLumGl5Smx9HMGcTlSGyNIf+NE9X4G9oVoyaJpCX3lS7T4O0+U/GE2eb0Vo2M/bx sJj7jR+mP+pvSjUjGOUZoidBqjio5Li74XdqqfMcU3bHbyG9VeRTv+NQT2Ri+2V28Xh62v MymMAdEdVwKsTG1n9ft7g4SLo+2+2h0= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1729625235; a=rsa-sha256; cv=none; b=JCcgR9F0FvTyq8CmscIpqb/dp7gBRfsyh3V2kXCP4Z4/xJQzfaL8niNFJeryvoZ3Tzu6H1 VWHoQlu9wRguhAt3j4eSwxRbqqUrAjO2s/62IOHLP6vHzcgVPkcbHAJvIMefN/PW2EtkwC 52KmCoEQa0935OYWRgKo+ouPoZ33DV4= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=IWo9RVOu; spf=pass (imf10.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.214.182 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-pl1-f182.google.com with SMTP id d9443c01a7336-20cd76c513cso50632365ad.3 for ; Tue, 22 Oct 2024 12:29:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1729625384; x=1730230184; darn=kvack.org; h=content-transfer-encoding:mime-version:reply-to:message-id:date :subject:cc:to:from:from:to:cc:subject:date:message-id:reply-to; bh=JGbf0fixLETlFv0KEXjfC5spoDaHmyAsiiYQtt7GTUQ=; b=IWo9RVOus1JnkqDzPlvuW54j39lOiemviqUKrqyC11zRdGGEcglV3n5RJWvq1q6fXU 9eFQH041Pr257j4XTbFWwBxk6gvWIPsksqGYJXERkDy35JKgVkCMdvu+XhT4dDZ7RerK TDLXITRMpn4X7Uv8eKuRc7k+Om2fRyYhQ8HIVGkMCHVwQA3Cl9H4GYSNyp90oMNDc3p6 lfsY+SRFwHwV/3a8lWQyOQRSxL9kYHc0az8ozQIKOPBMYsOI4LlhSqINwznElnkPesAC BoczecfjmuwFGOx1kshGu30a1srJjAGP/nUen5QUumADMIUvh46AfMecZNAaBxLORjG7 i+fA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1729625384; x=1730230184; h=content-transfer-encoding:mime-version:reply-to:message-id:date :subject:cc:to:from:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=JGbf0fixLETlFv0KEXjfC5spoDaHmyAsiiYQtt7GTUQ=; b=wcn7bLTw3Ph4ifVQaxD3O0lL7SGnFA45tl5c1KExn3Igw65eqmgrsd/2RruCChS+Aj It1uJrEa9KzuEfkuDeacHbZZIFSFvs1+pmcmj05uTRN8bZB6NzK3NCZK4MNqAU2dMFmq l0WmL2ztHUF1s1+JWYt3MllB/KHVer1TOfri27Ych+utCHpRx7/Gy7pF7YNj+zlPw+9N y8UEI1PbP1F35/gQAAc+yclwfM25tbbR2AxOiihTlMTRSaWukiwpn7VMfVZP04j7m8xY MFkShKMwJtGhKACzQNdoXNAoo0VWk96TkXwORSkb5lathPnAQYlPBQi0817+OmPOodtP nvIg== X-Gm-Message-State: AOJu0YxKCrXG6faztEbaffX7jsIVQix4PkWHDBBxzWCWfNIqrc3mBSZe DYhV2vHOIYNO+NwqUBHkK2EAocV98kCdco3PQ58/k22onO9sAdP9+7YItcyQkJ8= X-Google-Smtp-Source: AGHT+IH6LQ0/576MYGNZ72IUYK7iTjTm7UT2gNSleYWllSkzsKZKJANplqlxqEsn0rv7+9y7pDGxrw== X-Received: by 2002:a17:902:ccc8:b0:20b:5b1a:209 with SMTP id d9443c01a7336-20fa9de0ca2mr3874055ad.9.1729625383805; Tue, 22 Oct 2024 12:29:43 -0700 (PDT) Received: from KASONG-MC4.tencent.com ([106.37.123.36]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-20e7f0d9f05sm45895305ad.186.2024.10.22.12.29.40 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Tue, 22 Oct 2024 12:29:43 -0700 (PDT) From: Kairui Song To: linux-mm@kvack.org Cc: Andrew Morton , Chris Li , Barry Song , Ryan Roberts , Hugh Dickins , Yosry Ahmed , "Huang, Ying" , Tim Chen , Nhat Pham , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH 00/13] mm, swap: rework of swap allocator locks Date: Wed, 23 Oct 2024 03:24:38 +0800 Message-ID: <20241022192451.38138-1-ryncsn@gmail.com> X-Mailer: git-send-email 2.47.0 Reply-To: Kairui Song MIME-Version: 1.0 X-Rspamd-Queue-Id: CC871C000C X-Stat-Signature: qzmdwkeyr354734ojomk38w3qo7nmzys X-Rspamd-Server: rspam09 X-Rspam-User: X-HE-Tag: 1729625378-385064 X-HE-Meta: U2FsdGVkX1/VY3hTodMgrF7zYM4G/ntBoIkpygkBsNQ+jqhcOU2FOU3MAIuf2oQuufOOnpfQGCN6jk1N6xVxMRzPplNL5JlWXvqAwy9h7OA0jeyzl9jUplvHMrALEPSGkXCt4Y5NhOConS9IwkRtRfWw/MlfDItQZ++cYZqd1/TspNOBGAU7+btcCSuVlFHOs38CVtDnkPTt1grh6aLD2Mv0TARpDARTKRs1bAyUxUhnu5DrybgkkbeQTJR+1ia+fI7VL4XC5f3WRyA/QoOJuzJYnMGR7aPkKMX1mKNe8oXs/Hu7/lq+vCf2VKRIBV6vBUDjpKDkx+Od6MpZEvsLaEHRWU0jEaGxIYWq1Z5yzCkNNqDd94PJCvr50KYhBbXJvy/mrWey5ohFo4g8O+lCoyaCUoi+9VPYmfLfkslOIL24v1Om2lOBjuC4kUIYNNzmNyPXZZ3Jkl7OY9W4hxTVuBsScOkKZniXV1dfUS3YCs7+t4zFKSc6lxANPwRKUeVqKry5uQJAsxuaIxMndPUP1JPiCuVgRBczYEFpINufqZIm07Wg2X3kzKhMIJKidnjlERP4e2xYh5EVnC1TtH88LGzl6ikZSPrqKhr+pzdvXXVT2kxni6UZRXUWSSdg+rV68FzM9rTQ7G7jc2M4yNIMbVcqIYiLQ0Jmo6BiIirFAnX/tbwRYMufg+8mK6v9b1ABr+OBNQjMIuOkmpf9NeQiB2lAh0axN2tSmxNHJ0PH37o6Pbi2Lmvixkwxi74o5ZYt4VEnfgdHhgyvE7p96XIMpHiGnFUPVhDACe0GHps24DpSrTl+ADf2YpDKpGOd8uilAPqAH3Y9giPKwwkzguW/OVGlAW+qYd9BWeCbixnlfsygAlxKbu5RatKG8n03xI0xUY/SNgsI5MwCGQq5eOaHicDzFA4cjXDS2kiesKWVIUKi7ISfZE/X/THCIzvTk/MJy1jiX3jxcEjJNH2yret qbJ25sci XQ7rCccEJoS6HAyiOt+erafbq9QL4bV9/YaHHTu8tYa69LFkuugqBSUTwQoSXJZiJfTCMhQtKhWVhaiFL9oQdxhN0n9AmVicGBEDhPNPsXDgB3AolOSxS2HPbiRhP6vNXWQu18ieaVgNDb9ZQeB0/fXswBvjrx2RtbnaXG0zfzZRdNo+6F+Tv54MqoZUlcjG8UdJzJtQ6j+8pe3OQiOWuP+f5wu4dQwO7XwlEAldV3zrNhVAWXb8SOx2DT5Vw7U0nwx3mp94rCRwobF77VpOTnstE5zh0ZE8UjLIJltCnZr4kujitrZUhEMiV6mujXFYjdrJBr+VXq+je1G+8dGe8Ql+2xE5pdkglYDCDC5QdHXhYrm86W0LHFrOdRb0scFgNuvpXRMCaCeOIdlm+VdD36bTjp6tiT3gyIgEkqCrpZ81WGTI0OANXKkFcRkS92gciLctczOTw+Zeckv57LW2VTm5zT86qzVhE6bgW+gReEKsKgP2N/EVu4Qvdeg8VVp2ubU+FuCsjuuhwH6Rh4j5zgGck8ehkDrXndHx89FbnnF3j8CqN/63VUa9qa1zthloQFJyk X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: From: Kairui Song This series improved the swap allocator performance greatly by reworking the locking design and simplify a lot of code path. This is follow up of previous swap cluster allocator series: https://lore.kernel.org/linux-mm/20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org/ And this series is based on an follow up fix of the swap cluster allocator: https://lore.kernel.org/linux-mm/20241022175512.10398-1-ryncsn@gmail.com/ This is part of the new swap allocator work item discussed in Chris's "Swap Abstraction" discussion at LSF/MM 2024, and "mTHP and swap allocator" discussion at LPC 2024. Previous series introduced a fully cluster based allocation algorithm, this series completely get rid of the old allocation path and makes the allocator avoid grabbing the si->lock unless needed. This bring huge performance gain and get rid of slot cache on freeing path. Currently, swap locking is mainly composed of two locks, cluster lock (ci->lock) and device lock (si->lock). The device lock is widely used to protect many things, causing it to be the main bottleneck for SWAP. Cluster lock is much more fine-grained, so it will be best to use ci->lock instead of si->lock as much as possible. `perf lock` indicates this issue clearly. Doing linux kernel build using tmpfs and ZRAM with limited memory (make -j64 with 1G memcg and 4k pages), result of "perf lock contention -ab sleep 3": contended total wait max wait avg wait type caller 34948 53.63 s 7.11 ms 1.53 ms spinlock free_swap_and_cache_nr+0x350 16569 40.05 s 6.45 ms 2.42 ms spinlock get_swap_pages+0x231 11191 28.41 s 7.03 ms 2.54 ms spinlock swapcache_free_entries+0x59 4147 22.78 s 122.66 ms 5.49 ms spinlock page_vma_mapped_walk+0x6f3 4595 7.17 s 6.79 ms 1.56 ms spinlock swapcache_free_entries+0x59 406027 2.74 s 2.59 ms 6.74 us spinlock list_lru_add+0x39 ...snip... The top 5 caller are all users of si->lock, total wait time up sums to several minutes in the 3 seconds time window. Following the new allocator design, many operation doesn't need to touch si->lock at all. We only need to take si->lock when doing operations across multiple clusters (eg. changing the cluster list), other operations only need to take ci->lock. So ideally allocator should always take ci->lock first, then, if needed, take si->lock. But due to historical reasons, ci->lock is used inside si->lock by design, causing lock inversion if we simply try to acquire si->lock after acquiring ci->lock. This series audited all si->lock usage, simplify legacy codes, eliminate usage of si->lock as much as possible by introducing new designs based on the new cluster allocator. Old HDD allocation codes are removed, cluster allocator is adapted with small changes for HDD usage, test is looking OK. And this also removed slot cache for freeing path. The performance is better without it, and this enables other clean up and optimizations as discussed before: https://lore.kernel.org/all/CAMgjq7ACohT_uerSz8E_994ZZCv709Zor+43hdmesW_59W1BWw@mail.gmail.com/ After this series, lock contention on si->lock is nearly unobservable with `perf lock` with the same test above : contended total wait max wait avg wait type caller ... snip ... 91 204.62 us 4.51 us 2.25 us spinlock cluster_move+0x2e ... snip ... 47 125.62 us 4.47 us 2.67 us spinlock cluster_move+0x2e ... snip ... 23 63.15 us 3.95 us 2.74 us spinlock cluster_move+0x2e ... snip ... 17 41.26 us 4.58 us 2.43 us spinlock cluster_isolate_lock+0x1d ... snip ... cluster_move and cluster_isolate_lock are basically the only users of si->lock now, performance gain is huge with reduced LOC. Tests === Build kernel with defconfig on tmpfs with ZRAM as swap: --- Running a test matrix which is scaled up progressive for a intuitive result. The test are ran on top of tmpfs, using memory cgroup for memory limitation, on a 48c96t system. 12 test run for each case, it can be seen clearly that as concurrent job number goes higher the performance gain is higher, the performance is higher even with low concurrency. make -j | System Time (seconds) | Total Time (seconds) (NR / Mem / ZRAM) | (Before / After / Delta) | (Before / After / Delta) With 4k pages only: 6 / 192M / 3G | 5258 / 5235 / -0.3% | 1420 / 1414 / -0.3% 12 / 256M / 4G | 5518 / 5337 / -3.3% | 758 / 742 / -2.1% 24 / 384M / 5G | 7091 / 5766 / -18.7% | 476 / 422 / -11.3% 48 / 768M / 7G | 11139 / 5831 / -47.7% | 330 / 221 / -33.0% 96 / 1.5G / 10G | 21303 / 11353 / -46.7% | 283 / 180 / -36.4% With 64k mTHP: 24 / 512M / 5G | 5104 / 4641 / -18.7% | 376 / 358 / -4.8% 48 / 1G / 7G | 8693 / 4662 / -18.7% | 257 / 176 / -31.5% 96 / 2G / 10G | 17056 / 10263 / -39.8% | 234 / 169 / -27.8% With more aggressive setup, it shows clearly both the performance and fragmentation are better: tiem make -j96 / 768M memcg, 4K pages, 10G ZRAM, on Intel 8255C * 2: (avg of 4 test run) Before: Sys time: 73578.30, Real time: 864.05 tiem make -j96 / 1G memcg, 4K pages, 10G ZRAM: After: (-54.7% sys time, -49.3% real time) Sys time: 33314.76, Real time: 437.67 time make -j96 / 1152M memcg, 64K mTHP, 10G ZRAM, on Intel 8255C * 2: (avg of 4 test run) Before: Sys time: 74044.85, Real time: 846.51 hugepages-64kB/stats/swpout: 1735216 hugepages-64kB/stats/swpout_fallback: 430333 After: (-51.4% sys time, -47.7% real time, -63.2% mTHP failure) Sys time: 35958.87, Real time: 442.69 hugepages-64kB/stats/swpout: 1866267 hugepages-64kB/stats/swpout_fallback: 158330 There is a up to 54.7% improvement for build kernel test, and lower fragmentation rate. Performance improvement should be even larger for micro benchmarks Build kernel with tinyconfig on tmpfs with HDD as swap: --- This test is similar to above, but HDD test is very noisy and slow, the deviation is huge, so just use tinyconfig instead and take the median test result of 3 test run, which looks OK: Before this series: 114.44user 29.11system 39:42.90elapsed 6%CPU 2901232inputs+0outputs (238877major+4227640minor)pagefaults After this commit: 113.90user 23.81system 38:11.77elapsed 6%CPU 2548728inputs+0outputs (235471major+4238110minor)pagefaults Single thread SWAP: --- Sequential SWAP should also be slightly faster as we removed a lot of unnecessary parts. Test using micro benchmark for swapout/in 4G zero memory using ZRAM, 10 test runs: Swapout Before (avg. 3359304): 3353796 3358551 3371305 3356043 3367524 3355303 3355924 3354513 3360776 Swapin Before (avg. 1928698): 1920283 1927183 1934105 1921373 1926562 1938261 1927726 1928636 1934155 Swapout After (avg. 3347511, -0.4%): 3337863 3347948 3355235 3339081 3333134 3353006 3354917 3346055 3360359 Swapin After (avg. 1922290, -0.3%): 1919101 1925743 1916810 1917007 1923930 1935152 1917403 1923549 1921913 Worth noticing the patch "mm, swap: use a global swap cluster for non-rotation device" introduced minor overhead for certain tests (see the test results in commit message), but the gain from later commit covered that, it can be further improved later. Suggested-by: Chris Li Signed-off-by: Kairui Song Kairui Song (13): mm, swap: minor clean up for swap entry allocation mm, swap: fold swap_info_get_cont in the only caller mm, swap: remove old allocation path for HDD mm, swap: use cluster lock for HDD mm, swap: clean up device availability check mm, swap: clean up plist removal and adding mm, swap: hold a reference of si during scan and clean up flags mm, swap: use an enum to define all cluster flags and wrap flags changes mm, swap: reduce contention on device lock mm, swap: simplify percpu cluster updating mm, swap: introduce a helper for retrieving cluster from offset mm, swap: use a global swap cluster for non-rotation device mm, swap_slots: remove slot cache for freeing path fs/btrfs/inode.c | 1 - fs/iomap/swapfile.c | 1 - include/linux/swap.h | 36 +- include/linux/swap_slots.h | 3 - mm/page_io.c | 1 - mm/swap_slots.c | 78 +-- mm/swapfile.c | 1198 ++++++++++++++++-------------------- 7 files changed, 558 insertions(+), 760 deletions(-)