From patchwork Tue Dec 24 14:37:58 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Kairui Song <ryncsn@gmail.com>
X-Patchwork-Id: 13920188
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 45237E77188
	for <linux-mm@archiver.kernel.org>; Tue, 24 Dec 2024 14:39:33 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id A9BAB6B0082; Tue, 24 Dec 2024 09:39:32 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id A4BE86B0083; Tue, 24 Dec 2024 09:39:32 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 8EC526B0085; Tue, 24 Dec 2024 09:39:32 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com
 [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id 729936B0082
	for <linux-mm@kvack.org>; Tue, 24 Dec 2024 09:39:32 -0500 (EST)
Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay05.hostedemail.com (Postfix) with ESMTP id 06BEC44FA2
	for <linux-mm@kvack.org>; Tue, 24 Dec 2024 14:39:32 +0000 (UTC)
X-FDA: 82930110522.28.59C4912
Received: from mail-pl1-f171.google.com (mail-pl1-f171.google.com
 [209.85.214.171])
	by imf29.hostedemail.com (Postfix) with ESMTP id BDE9C120008
	for <linux-mm@kvack.org>; Tue, 24 Dec 2024 14:38:34 +0000 (UTC)
Authentication-Results: imf29.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b="TGCy1Ln/";
	spf=pass (imf29.hostedemail.com: domain of ryncsn@gmail.com designates
 209.85.214.171 as permitted sender) smtp.mailfrom=ryncsn@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed;
 d=hostedemail.com;
	s=arc-20220608; t=1735051151;
	h=from:from:sender:reply-to:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:references:dkim-signature;
	bh=f3kRgr3NoQ4Fhci9KVn1SFA4lqzcXW/uVjK7b/dE9H4=;
	b=QMuU/DsiuGaiU95c81gEXeyqD1a6lnaou3puVuKLuki2pWDJh1oflZi928CrK9XrAdlxdT
	k6Y218BXFxLe7HDFVcDjKIITLkoVgU3wC/oqknEovcTZZ89/SLrwoHi3Os/p40rk/rshs3
	fcTR5n4eYzNr2QIPxbVWwI/B7KccAM8=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1735051151; a=rsa-sha256;
	cv=none;
	b=PrS599vbOHVm5GYkdtbiJiT7/NOPJKMKc2MjH0qE8YT/VKNTbRT00fSoY9GEDYIp3U9s3r
	xMPlDa6uP68yhO09lgAAgMrBHvMzXLtWXdtwOGOz9NvM5KjPFFMZJ5R7nQiy5yHCLRkoub
	fb8A11jfBsd6n3+qm6JVVhw2Ujgg5Rw=
ARC-Authentication-Results: i=1;
	imf29.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b="TGCy1Ln/";
	spf=pass (imf29.hostedemail.com: domain of ryncsn@gmail.com designates
 209.85.214.171 as permitted sender) smtp.mailfrom=ryncsn@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
Received: by mail-pl1-f171.google.com with SMTP id
 d9443c01a7336-2164b662090so48211275ad.1
        for <linux-mm@kvack.org>; Tue, 24 Dec 2024 06:39:29 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1735051168; x=1735655968; darn=kvack.org;
        h=content-transfer-encoding:mime-version:reply-to:message-id:date
         :subject:cc:to:from:from:to:cc:subject:date:message-id:reply-to;
        bh=f3kRgr3NoQ4Fhci9KVn1SFA4lqzcXW/uVjK7b/dE9H4=;
        b=TGCy1Ln/nRB/hslw3ejbChU0XDQg7ivFzeorW509vOJXdn+ujNGmtDFMHnLucCNGL6
         AaCIM/QUZQiXBuORfWRR8rosqdbsRYofbBE4wrIg2rTF0h0MNCf/Y/x3PahQchSiwPDx
         nFeJElGyYgCzkjCAX5TqkpM5NYBFKVgSgcDga55KlTryNecpxT8nUGZGkxyuDEOI2l1z
         kRcvcTPJy66CZFGkE5QS9yb4QPPy7N5GE7kWKulwgIf36Z5jhtZXlCHoyrAkxhIV4LsM
         Z1/qubJsm/XMHXZIVnaco0I8B8+CoA+ug42GeexoTr0eeNc+fpEuodf3eOzvHjJ5ObKs
         I9+Q==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1735051168; x=1735655968;
        h=content-transfer-encoding:mime-version:reply-to:message-id:date
         :subject:cc:to:from:x-gm-message-state:from:to:cc:subject:date
         :message-id:reply-to;
        bh=f3kRgr3NoQ4Fhci9KVn1SFA4lqzcXW/uVjK7b/dE9H4=;
        b=eEgZF26GCjwlVO/eOid7VVxORU8TrZEOYTbd0/vr9bhH1LauPG/o2ABLe8LifPXJyx
         EoyVbyXMPyT2CrxjJ9HcUM4mntzrgqHNa0vjKWYVE+BY+SOLhdVeYM9oDANtibjjbz86
         VC21lH42fAKvuBKtvITHKMIOplNPptQCxrAF/sAJIaIN6oT8njjsKsQJOn1bdZCLyHTF
         059oXGgf5trrukRMh+sPvXEx46pevE/G3N4U/3bSf5KgYMEcAb13PLIXq0GJvolUK8w9
         WkddF0PqSKJb2guZRBe6zGthpqlsX/H//mGUHWcAltrKlXQKRPQ6MRwNCPEoL8OVYkj6
         eXaA==
X-Gm-Message-State: AOJu0YxjMUR1Dy3pH+T5Ex/pA8rNiFUL0zIAcS6+Gy7uwzLdzXPTnRFI
	Ll0KER8ermLCmr5TeN79j7c80Uom8/7DyE/7By7ej1seaLUWm/UbzCYK/Qp2BRE=
X-Gm-Gg: ASbGnculxpXWca+xXbjqMYzHo1V/f7w6z4yyiE/gjwBWnob+o2ohOfhMyZhejnpUl9X
	Wp8XZSAXtiAoFTyFs61KasR0jPE7kZV9y5T6Iktjph4nanFmF1weop2bcIjnyr2MKbHAB6i/DGM
	CTLmQzW6cpdfHW5oKwFKs0LIuLviUXXC5wKUDMGYlv1fWfjgpWDa32P/JT2btDSdSIejI0xxrKJ
	HZy9UY6HuSBRROxgbYHE718NNVJjtfEenae5WAxR/LIonH2HDNUkoMS+0oehJaCZQUGzRcLZB+2
	jA==
X-Google-Smtp-Source: 
 AGHT+IHgA1FTTKrB1noMGLh5pj6PGmYWH02PU6GrbPoJOXPNAE+L70vkrq2VkVR+MNAsmvKd2SrJOQ==
X-Received: by 2002:a17:902:e746:b0:215:8847:435c with SMTP id
 d9443c01a7336-219e6e8c595mr244710135ad.12.1735051168265;
        Tue, 24 Dec 2024 06:39:28 -0800 (PST)
Received: from KASONG-MC4.tencent.com ([115.171.41.189])
        by smtp.gmail.com with ESMTPSA id
 d9443c01a7336-219dc9cdec1sm90598735ad.136.2024.12.24.06.39.24
        (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256);
        Tue, 24 Dec 2024 06:39:27 -0800 (PST)
From: Kairui Song <ryncsn@gmail.com>
To: linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Chris Li <chrisl@kernel.org>,
	Barry Song <v-songbaohua@oppo.com>,
	Ryan Roberts <ryan.roberts@arm.com>,
	Hugh Dickins <hughd@google.com>,
	Yosry Ahmed <yosryahmed@google.com>,
	"Huang, Ying" <ying.huang@linux.alibaba.com>,
	Nhat Pham <nphamcs@gmail.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Kalesh Singh <kaleshsingh@google.com>,
	linux-kernel@vger.kernel.org,
	Kairui Song <kasong@tencent.com>
Subject: [PATCH v2 00/13] mm, swap: rework of swap allocator locks
Date: Tue, 24 Dec 2024 22:37:58 +0800
Message-ID: <20241224143811.33462-1-ryncsn@gmail.com>
X-Mailer: git-send-email 2.47.1
Reply-To: Kairui Song <kasong@tencent.com>
MIME-Version: 1.0
X-Rspamd-Server: rspam10
X-Rspamd-Queue-Id: BDE9C120008
X-Stat-Signature: mmr41uieo6czbaafqxpncm137giaankd
X-Rspam-User: 
X-HE-Tag: 1735051114-892809
X-HE-Meta: 
 U2FsdGVkX1+t/uFwNUShwx3t7cxKxOzckbmrqfBIoqNqg1ElHUDqHMPAVDUaTzN3JGmDKgeryjvQDnSllfpj1AP7Gq+ZYV8+U/KsWKicbJ9qaQhFOaPPo9bDfu8vBHna6LGIjZ+JDMVZLLGjYPNoVLH/IRIm9IBqIerINE2PlHqy5Q3dqyqd62sUr6OA8aXeXrGuHS0ryiahjkL5dZ+qamiLuBKO/xdWx0puXdXwExu6P065VfYsrSm9yMAD/0ba9Wp100w3PB0wrXjLGZe338UeQbvhpUAOnN4gFFM/op6CfYKGOJ72WEjr+dLn15ewG2RnBVOs7uzrWqF+8OrgyCBtI9Xog/5Uj2PJO0gWRbngV9O9VN646H66i6pv6lJj9U/uz6BoCBECdw/e4l+evR8kdRkvl5QGwaGG2Ysc084dqXY6RTM1JXRhb43S67t120iw12UKY+saHC9Uemp2o6zWx0QlB1mW68Mf5aZCvEAT/SzJs1iZ8ilHgl3gQUYqlLYSbJPuZvG9tH2eIXy3kZZcAfu93lSy9tvLl0Xbe9V5HlR9P/93kEnwjiUfdDCLcFS5LyauIM3kgDmQjmpMtlGMSRQBGNyqFafOlkrWWP3EtkeG3Zt1SU1oKIjtLNvqQvepY5pkqV5D6UAiyDH0dtacHVSoLpFR4Zj0TJwNXEMW3m6RboZn1AvOhMxI0Gr6b5P0JiKo5T3wrFA5FcSeA2i4azCeUcVirZJ5hzpDHU8etcq9B5j/79bzIlKydYp7gzmKf1/xBQAVDyhusuhLuUD3qlGEgiFEzMRqXCTEro4d8Yb0fDILWvsGrhoZ9qfnsWxGmbzmZNbf/m3F1QKGQqvGQVky7eRKqKy18iD3mWCQU6hPwIkqgSRYhPqi1B089Lx+SphfsGoDCGFcqnLVPQAVR1YvrfVeq/pwf3dKtDfAaZwa4jr47WOI73/1MExpPRoiaqB7f/697RBM4Ft
 OgKvqRgn
 eHkRMvnjhrpvTbWRD4FdM6aHhEiqcGxfytTtaGjftnL0hZxLn/SgUsiewI2YC1dUJrUJxUzKtOAVX7cJKLUPTiiYr7fzBBVRIWB1er4E6hND0hUZy+ALCUWpcEruYCueypirdWe0el8tzTwe7pNhnnV4aJDKKmEPc1U28gi6BvQM+TAXrel/tFvcTrouAr03R5XJAjm/DwnrCPr8ztWrHfHnPaY+nyDe1clQhuYXiSyhOxHNMwvQkpNigIk07qh4WH4CyKRKfDOKY9h/RuY6K48C2yLEpNjtI89LomdOA6gSrv5d5R/+I2LWePoKBIJHMcg6NC4MIubzR/iH4ItC10Iu/9MAFdNjS/5kI4twar4PU8ml2m2JM/qxwKcgajLUsB3wz0KIvWJ8V8DBOtU0Saq/ZTDv0gtvS6TQT4vZeE/YyrdbmALCNT6qOcArPVajd4rSYOZZ77ovH+n8RvZYet5b1kkNBlzAshmjSq4HBFy1G7ZbwKfNYjjg8LkoIDrNNlWKwcmByg8E9YfreXyIfsizOKAXK1Q8PzypF9oOy5ZHlE8/G5Kstgx52ACMPj1gLME8y
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000133, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

From: Kairui Song <kasong@tencent.com>

This series greatly improved swap performance by reworking
the locking design and simplify a lot of code path. Test showed
a up to 400% vm-scalability improvement with pmem as SWAP, and
up to 37% reduce of kernel compile real time with ZRAM as SWAP
(up to 60% improvement in system time).

This is part of the new swap allocator discussed during
the "Swap Abstraction" discussion at LSF/MM 2024, and
"mTHP and swap allocator" discussion at LPC 2024.

This is a follow up of previous swap cluster allocator series:
https://lore.kernel.org/linux-mm/20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org/
Also enables further optimizations which will come later.

Previous series introduced a fully cluster based allocator, this
series completely get rid of the old allocator and makes the new
allocator avoid touching the si->lock unless needed. This bring huge
performance gain and get rid of slot cache for freeing path.

Currently, swap locking is mainly composed of two locks, cluster lock
(ci->lock) and device lock (si->lock). The device lock is widely used
to protect many things, causing it to be the main bottleneck for SWAP.

Cluster lock is much more fine-grained, so it will be best to use
ci->lock instead of si->lock as much as possible.

`perf lock` indicates this issue clearly. Doing linux kernel build
using tmpfs and ZRAM with limited memory (make -j64 with 1G memcg and 4k
pages), result of "perf lock contention -ab sleep 3" shows:

  contended   total wait     max wait     avg wait         type   caller
     34948     53.63 s       7.11 ms      1.53 ms     spinlock   free_swap_and_cache_nr+0x350
     16569     40.05 s       6.45 ms      2.42 ms     spinlock   get_swap_pages+0x231
     11191     28.41 s       7.03 ms      2.54 ms     spinlock   swapcache_free_entries+0x59
      4147     22.78 s     122.66 ms      5.49 ms     spinlock   page_vma_mapped_walk+0x6f3
      4595      7.17 s       6.79 ms      1.56 ms     spinlock   swapcache_free_entries+0x59
    406027      2.74 s       2.59 ms      6.74 us     spinlock   list_lru_add+0x39
  ...snip...

The top 5 caller are all users of si->lock, total wait time sums to several
minutes in the 3 seconds time window.

Following the new allocator design, many operation doesn't need to touch
si->lock at all. We only need to take si->lock when doing operations
across multiple clusters (changing the cluster list). So ideally
allocator should always take ci->lock first, then take si->lock only if
needed. But due to historical reasons, ci->lock is used inside si->lock
critical section, causing lock inversion if we simply try to acquire
si->lock after acquiring ci->lock.

This series audited all si->lock usage, clean up legacy codes, eliminate
usage of si->lock as much as possible by introducing new designs based
on the new cluster allocator.

Old HDD allocation codes are removed, cluster allocator is adapted
with small changes for HDD usage, test is looking OK.

And this also removed slot cache for freeing path. The performance is
even better without it now, and this enables other clean up and
optimizations as discussed before:

https://lore.kernel.org/all/CAMgjq7ACohT_uerSz8E_994ZZCv709Zor+43hdmesW_59W1BWw@mail.gmail.com/

After this series, lock contention on si->lock is nearly unobservable
with `perf lock` with the same test above:

  contended   total wait     max wait     avg wait         type   caller
  ... snip ...
         91    204.62 us      4.51 us      2.25 us     spinlock   cluster_move+0x2e
  ... snip ...
         47    125.62 us      4.47 us      2.67 us     spinlock   cluster_move+0x2e
  ... snip ...
         23     63.15 us      3.95 us      2.74 us     spinlock   cluster_move+0x2e
  ... snip ...
         17     41.26 us      4.58 us      2.43 us     spinlock   cluster_isolate_lock+0x1d
  ... snip ...

`cluster_move` and `cluster_isolate_lock` (two new introduced helper)
are basically the only users of si->lock now, performance gain is huge,
and LOC is reduced.

Tests Results:

vm-scalability
==============
Running `usemem --init-time -O -y -x -R -31 1G` from vm-scalability
in a 12G memory cgroup using simulated pmem as SWAP backend (32G pmem,
32 CPUs).

Using 4K folio by default, 64k mTHP and sequential access (!-R) results
are also provided. 6 test runs for each case, Total Throughput:

Test             Before (KB/s) (stdev)  After (KB/s) (stdev)   Delta
---------------------------------------------------------------------------
Random (4K):     69937.11 (16449.77)    369816.17  (24476.68)  +428.78%
Random (64k):    123442.83 (13207.51)   216379.00  (25024.83)  +75.28%
Sequential (4K): 6313909.83 (148856.12) 6419860.66 (183563.38) +1.7%

Sequential access will cause lower stress for the allocator so the gain is
limited, but with random access (which is much closer to real workloads)
the performance gain is huge.

Build kernel with defconfig on tmpfs with ZRAM
==============================================
Below results shows a test matrix using different memory cgroup limit
and job numbets, and scaled up progressive for a intuitive result.
Done on a 48c96t system.

6 test run for each case, it can be seen clearly that as concurrent job
number goes higher the performance gain is higher, but even -j6 is
showing slight improvement.

   make -j<NR>     |   System Time (seconds)  |   Total Time (seconds)
 (NR / Mem / ZRAM) | (Before / After / Delta) | (Before / After / Delta)
 With 4k pages only:
  6 / 192M / 3G    |    1533 /  1522 / -0.7%  |    1420 /  1414 / -0.3%
 12 / 256M / 4G    |    2275 /  2226 / -2.2%  |     758 /   742 / -2.1%
 24 / 384M / 5G    |    3596 /  3154 / -12.3% |     476 /   422 / -11.3%
 48 / 768M / 7G    |    8159 /  3605 / -55.8% |     330 /   221 / -33.0%
 96 / 1.5G / 10G   |   18541 /  6462 / -65.1% |     283 /   180 / -36.4%
 With 64k mTHP:
 24 / 512M / 5G    |    3585 /  3469 /  -3.2% |     293 /   290 / -0.1%
 48 /   1G / 7G    |    8173 /  3607 / -55.9% |     251 /   158 / -37.0%
 96 /   2G / 10G   |   16305 /  7791 / -52.2% |     226 /   144 / -36.3%

The fragmentation are reduced too:
With: make -j96 / 1152M memcg, 64K mTHP:
(avg of 4 test run)
Before:
hugepages-64kB/stats/swpout: 1696184
hugepages-64kB/stats/swpout_fallback: 414318
After: (-63.2% mTHP swapout failure)
hugepages-64kB/stats/swpout: 1866267
hugepages-64kB/stats/swpout_fallback: 158330

There is a up to 65.1% improvement in sys time for build kernel test,
and lower fragmentation rate.

Build kernel with tinyconfig on tmpfs with HDD as swap:
=======================================================

This test is similar to above, but HDD test is very noisy and slow, the
deviation is huge, so just use tinyconfig instead and take the median test
result of 3 test run, which looks OK:

Before this series:
114.44user 29.11system 39:42.90elapsed 6%CPU
2901232inputs+0outputs (238877major+4227640minor)pagefaults

After this commit:
113.90user 23.81system 38:11.77elapsed 6%CPU
2548728inputs+0outputs (235471major+4238110minor)pagefaults

Single thread SWAP:
===================

Sequential SWAP should also be slightly faster as we removed a lot of
unnecessary parts. Test using micro benchmark for swapout/in 4G
zero memory using ZRAM, 10 test runs:

Swapout Before (avg. 3359304):
3353796 3358551 3371305 3356043 3367524 3355303 3355924 3354513 3360776

Swapin Before (avg. 1928698):
1920283 1927183 1934105 1921373 1926562 1938261 1927726 1928636 1934155

Swapout After (avg. 3347511, -0.4%):
3337863 3347948 3355235 3339081 3333134 3353006 3354917 3346055 3360359

Swapin After (avg. 1922290, -0.3%):
1919101 1925743 1916810 1917007 1923930 1935152 1917403 1923549 1921913

The gain is limited at noise level but seems slightly better.

V1: https://lore.kernel.org/linux-mm/20241022192451.38138-1-ryncsn@gmail.com/
Updates:
- Retest some tests after rebase on top of latest mm-unstable, the new
  Cgroup lock removal increased the performance gain of this series too,
  some results are basically same as before so unchanged:
  https://lore.kernel.org/linux-mm/20241218114633.85196-1-ryncsn@gmail.com/
- Rework the off-list bit handling, make it easier to review and more
  robust, also reduce LOC [Chris Li].
- Code style improvements and minor code optimizations. [Chris Li].
- Fixing a potential swapoff race issue due to missing SWP_WRITEOK check
  [Huang Ying].
- Added vm-scalability test with pmem [Huang Ying].

Suggested-by: Chris Li <chrisl@kernel.org>
Signed-off-by: Kairui Song <kasong@tencent.com>

Kairui Song (13):
  mm, swap: minor clean up for swap entry allocation
  mm, swap: fold swap_info_get_cont in the only caller
  mm, swap: remove old allocation path for HDD
  mm, swap: use cluster lock for HDD
  mm, swap: clean up device availability check
  mm, swap: clean up plist removal and adding
  mm, swap: hold a reference during scan and cleanup flag usage
  mm, swap: use an enum to define all cluster flags and wrap flags
    changes
  mm, swap: reduce contention on device lock
  mm, swap: simplify percpu cluster updating
  mm, swap: introduce a helper for retrieving cluster from offset
  mm, swap: use a global swap cluster for non-rotation devices
  mm, swap_slots: remove slot cache for freeing path

 fs/btrfs/inode.c           |    1 -
 fs/iomap/swapfile.c        |    1 -
 include/linux/swap.h       |   34 +-
 include/linux/swap_slots.h |    3 -
 mm/page_io.c               |    1 -
 mm/swap_slots.c            |   78 +--
 mm/swapfile.c              | 1246 ++++++++++++++++--------------------
 7 files changed, 591 insertions(+), 773 deletions(-)