From patchwork Mon Feb 24 18:02:10 2025
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Kairui Song <ryncsn@gmail.com>
X-Patchwork-Id: 13988666
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 29FF9C021A4
	for <linux-mm@archiver.kernel.org>; Mon, 24 Feb 2025 18:03:14 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id B6D2D280015; Mon, 24 Feb 2025 13:03:13 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id B1C1328000C; Mon, 24 Feb 2025 13:03:13 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 94918280015; Mon, 24 Feb 2025 13:03:13 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com
 [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id 6E0AF28000C
	for <linux-mm@kvack.org>; Mon, 24 Feb 2025 13:03:13 -0500 (EST)
Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay10.hostedemail.com (Postfix) with ESMTP id 020B1C19BC
	for <linux-mm@kvack.org>; Mon, 24 Feb 2025 18:03:12 +0000 (UTC)
X-FDA: 83155609866.09.A7CD16A
Received: from mail-pl1-f177.google.com (mail-pl1-f177.google.com
 [209.85.214.177])
	by imf11.hostedemail.com (Postfix) with ESMTP id 1D7654001F
	for <linux-mm@kvack.org>; Mon, 24 Feb 2025 18:03:10 +0000 (UTC)
Authentication-Results: imf11.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=e+ruaV21;
	spf=pass (imf11.hostedemail.com: domain of ryncsn@gmail.com designates
 209.85.214.177 as permitted sender) smtp.mailfrom=ryncsn@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed;
 d=hostedemail.com;
	s=arc-20220608; t=1740420191;
	h=from:from:sender:reply-to:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=WYFJk5aGTp6MYsHBjRjRVr+n+yHW6h7e82rjV++fDV8=;
	b=yAkODYVDpRp6osl7BXNJuMK7ZUnf0HmrwxjJYXguQKUkfT4Hq0R8LEsvEJxnyDG2EJESrm
	CHvyPDSzHQCncJC8DjmKbPSqBB367nwvcTH9RHyyn2igMKhK3OGtVCUMU22YeSi0GwZldo
	NBuJQrDT6Bslt3tml7TmBCwVHQ9LIyE=
ARC-Authentication-Results: i=1;
	imf11.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=e+ruaV21;
	spf=pass (imf11.hostedemail.com: domain of ryncsn@gmail.com designates
 209.85.214.177 as permitted sender) smtp.mailfrom=ryncsn@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1740420191; a=rsa-sha256;
	cv=none;
	b=0u959LFh0jZ1M5XXLNy66EZkpyyTH2ldSY25tVFi8Iq2O7LslNXEny7I5wCgmmz7EA9sFm
	zzqf7EI677REY6a/WR04CNJzjy1X/HPtdW9vP4BbaEWu9pVd3+4FAQofkZ7z3fREmIw6j5
	j9G8iGDSsi9zkIn7mTkg2shAL7IVmyQ=
Received: by mail-pl1-f177.google.com with SMTP id
 d9443c01a7336-22128b7d587so91369545ad.3
        for <linux-mm@kvack.org>; Mon, 24 Feb 2025 10:03:10 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1740420189; x=1741024989; darn=kvack.org;
        h=content-transfer-encoding:mime-version:reply-to:references
         :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject
         :date:message-id:reply-to;
        bh=WYFJk5aGTp6MYsHBjRjRVr+n+yHW6h7e82rjV++fDV8=;
        b=e+ruaV216gjbDejYfDqUZ0LuGz8fEyXumYOFuNHFMsktjHJZ9/p87qi+hzluUoFrF3
         OO5klXSgrTcMbKCZeTS0jIXE/t37sdB3lmcxLCy8rCCWcjpVMC9XQhk8EKyTGt/fHP7N
         xGEEkuIyf24/x/d/PqzH1mpjgCMSSGFPvET/45Xkr4mWuS7S8upXwtxdUMIahkx5irPK
         Zp2Ai3zHjtCeZCmgROAzny1V0kA/qwUMVWW5F0ebi4B7ECd6oi/J10PyULBhv6LPeeXV
         JCNfgILt1wHNrBo4RTHP/B4La1+5/0X2kQ6kabLC+SikhmapR2lR6/X8Kgnhtjg7Mhkg
         GtiA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1740420189; x=1741024989;
        h=content-transfer-encoding:mime-version:reply-to:references
         :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state
         :from:to:cc:subject:date:message-id:reply-to;
        bh=WYFJk5aGTp6MYsHBjRjRVr+n+yHW6h7e82rjV++fDV8=;
        b=deZsL/mO/cqQ48e2YsmV9RwuvpodgoCZzrgT6GhQnJ33O6hKr9WObmWXifqLE6J8d1
         HiLgZVmfw8oGHuStsho+wZ7ANLjMAOHsfNo6RGqy/VJRTslnkjJ80cDUgUCOwEtDdrZd
         F53KcVH0RtcaXqEIA0bCgPoI/EViAi21P74oIZIzwKy0FvezNfvnLKyR+s5TESVMcWK3
         mjE7GfLbtvO3kqhYDcBmdNbvmmuneK33u/hjb9h0HUZsIaDz+7mAh2mPu0S30up29p0E
         4lsFkZAK4Iptr6J6CZIrkd57UVDbCRWv5TGBDn0jToQJHg+E/28UlQOAf3N6xl8+UJ03
         skVw==
X-Gm-Message-State: AOJu0Yw0v1sFGtPOLIAQNS2HruGLDl5zd9qnKuj35Ys1eIxc8OKQ+FgY
	wYF+J0U8tdyDQ4a5+1ZqC4U2KxVX5DrhRsrEghGBnLsf741mkYJiG73YJkm59sQ=
X-Gm-Gg: ASbGnct021Mw2e/GS3MyHyYUUJ5B0iMgWMlt4RBXXgkh+7mGqhleIThXbbA9SOiR/zI
	hlz5XZDe8fwUsvItXxs2XNSzgzB6T/sATy0K0a5krNpRFkC8LVX1wBuwQKlc59ukdotO+hu0wii
	e7CacgTj1/rlU0ULGG5aXDpKVOBAg/TEEgXzbhuBxWWAdO7AB+Z59klVpo4hJKBVeUbW7+w+vv+
	t8aaMesnay39qP/axuJ1d2gt895Drcc4kr4XTjlvq/3UTH1VxtGqSg2mCdFET/tsFtRsqQBcjow
	UfMpo0EMcXzLS7fSNyxYfpPFL6Gy+YpU3QfqutpIh+Ii
X-Google-Smtp-Source: 
 AGHT+IF+nUtt9yT1whYy8czY+uFBQEuZI14FZHiJnUYmIVHKhu2fnDX//MoXY7yrvfn2QU5ZdSXwoA==
X-Received: by 2002:a17:902:d502:b0:21f:988d:5758 with SMTP id
 d9443c01a7336-2219ffd288emr253625875ad.35.1740420189294;
        Mon, 24 Feb 2025 10:03:09 -0800 (PST)
Received: from KASONG-MC4.tencent.com ([1.203.117.88])
        by smtp.gmail.com with ESMTPSA id
 d9443c01a7336-220d556e15esm184834695ad.190.2025.02.24.10.03.05
        (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256);
        Mon, 24 Feb 2025 10:03:08 -0800 (PST)
From: Kairui Song <ryncsn@gmail.com>
To: linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Chris Li <chrisl@kernel.org>,
	Barry Song <v-songbaohua@oppo.com>,
	Hugh Dickins <hughd@google.com>,
	Yosry Ahmed <yosryahmed@google.com>,
	"Huang, Ying" <ying.huang@linux.alibaba.com>,
	Baoquan He <bhe@redhat.com>,
	Nhat Pham <nphamcs@gmail.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	Kalesh Singh <kaleshsingh@google.com>,
	Matthew Wilcox <willy@infradead.org>,
	linux-kernel@vger.kernel.org,
	Kairui Song <kasong@tencent.com>
Subject: [PATCH v2 5/7] mm, swap: use percpu cluster as allocation fast path
Date: Tue, 25 Feb 2025 02:02:10 +0800
Message-ID: <20250224180212.22802-6-ryncsn@gmail.com>
X-Mailer: git-send-email 2.48.1
In-Reply-To: <20250224180212.22802-1-ryncsn@gmail.com>
References: <20250224180212.22802-1-ryncsn@gmail.com>
Reply-To: Kairui Song <kasong@tencent.com>
MIME-Version: 1.0
X-Rspam-User: 
X-Rspamd-Server: rspam09
X-Rspamd-Queue-Id: 1D7654001F
X-Stat-Signature: 9cbnq3ia8wj96ppwxjrkpauxxupcb75b
X-HE-Tag: 1740420190-747593
X-HE-Meta: 
 U2FsdGVkX1+HrB5epWLhIQgIlFFwIY94NVMoQ8r0eu9vXC4wZ0NAaIf+6TN0m6uQ1Kapi0v74lDjHCrNK4ZGb5KqJwLXwmNIDib5sS3w8VNJTcZPqq16FaidRgsC6JTrKqUVUOdUbAFjuk2GDVCtn8WZ0wTXg1Lfj5wV+yR2RVIkBdK3ziu3RRnNZFOz/ZOEKVPNlaG2CNcyG0S71WBe+jdHXNjTzrZPz6YmnElX1f3zOSymsq9/Y7ScszeZST8gLU6pQXYz1nBPGoj6ySzEA/mCU/o1PIEzHk1uUeBmBhCQIbJb9fdleKxmZhGhtCW0qYxo7f1iPX3J1YecndgvfqKXYpDjgmla36qFpkpVsULwu2CAMzB0ipNSY7Iy/NXbJjKT+OEYwPWzctp55Kko6C8w/Jk5yskq4BEZf8IxpnNnJqr3yY0+7Ny8NC5qNZzwxtXowoAJypSq8qeKcEQhc9RyXyslKWjWPDjI+nISu5rlthFjKSt/Z/TVGwLLT/QQUlCj3NRH66e3BMwmF3Qnf+WcYTXjCCRyzptqofMKoKbuWH5R8qwAo1PTzcnQS3nabAry97DVmZx0572vBxCFl7/Cdz1mTxMRXCh7aeCE2JktfJIS4zVYP5Ktun4CvzeGw9oyqsUXaRwT2lksVZLh+R0BzpRc1HaZua+ss+R4Wq+Pe4fLL4IuLBCjlshvs2lckL1MhXJTEvoIWghhAJ6Pnu6zeY+o0jA+6+seW4xF30weMvP8cuw4h5xehdfjBh1gqHR1YMb7VpIAO0GaoXgBaQm3TnkgoIl4svS2BlJ8CCBPOO1nZdtTWDTHBrpDS5yyQka/kL/m2VvQ2llIliBHfCP2IyUyZ/ttfCb/i+ECR+tgY2QTu2gVsz3fp136HDV583hHXJ4ejvAmiCrWjeYFdQf1cXze/uehQl5nPJkj4XQTbpRSc1QYWcSdOcxoWy/yvOW/ltcGumekj8uGEJw
 VSZgIY6z
 BLI32PepUWPzRZjfy0izdrW24U1qGuAX4uNXvHbYLULvA5ZBw70/4aKY962gq06atYEtyqnuTcMl7Yn0ws4HPla0b9J/0+lBGKxxPxglQCAZm9SipQxzSPuCJswIzdjbS8XznUsnT7vtqvtdyW64aU+olDzRJTVvEB2qDgJj57Lo65IJLzliN8PHZ4nVVvwkG465tgckTpykDXXERbyB2UHqceOVErtl9afZBiCyC+dTHPqEHYaA/HCwmT1RJEtHk8oM472CXCC3H0slzp0z+mj9pPGRV8prucHhPMSoQlXqP7tg+/A1K/e/GQH4z6z8GZAmJmgoHGJuFb5bZx8i5AExi0SRSS/T1tgfeS/EWDuJMxfvGUoqYB9b5LHXsMw8RxRU359vbIMo7VG9Pfb+TF2DXx1OveqlJW3cSUYGhzLhP7JsxVLvFUbLZdxuFAKpc9AgM8raw7yopsldRZWp0ea9qVSFEGY9XNnt2gEo3LW/KIRLYwJkJ9Eks/FpFNxvVgmzFuMNoG26uS/u2NEcMH5/patL/2AMJc9ef6+T9rbTq0N/AUqzDMvQN4P8vJZPcZtyc
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

From: Kairui Song <kasong@tencent.com>

Current allocation workflow first traverses the plist with a global lock
held, after choosing a device, it uses the percpu cluster on that swap
device. This commit moves the percpu cluster variable out of being tied
to individual swap devices, making it a global percpu variable, and will
be used directly for allocation as a fast path.

The global percpu cluster variable will never point to a HDD device, and
allocations on a HDD device are still globally serialized.

This improves the allocator performance and prepares for removal of the
slot cache in later commits. There shouldn't be much observable behavior
change, except one thing: this changes how swap device allocation
rotation works.

Currently, each allocation will rotate the plist, and because of the
existence of slot cache (one order 0 allocation usually returns 64
entries), swap devices of the same priority are rotated for every 64
order 0 entries consumed. High order allocations are different, they
will bypass the slot cache, and so swap device is rotated for every
16K, 32K, or up to 2M allocation.

The rotation rule was never clearly defined or documented, it was changed
several times without mentioning.

After this commit, and once slot cache is gone in later commits, swap
device rotation will happen for every consumed cluster. Ideally non-HDD
devices will be rotated if 2M space has been consumed for each order.
Fragmented clusters will rotate the device faster, which seems OK.
HDD devices is rotated for every allocation regardless of the allocation
order, which should be OK too and trivial.

This commit also slightly changes allocation behaviour for slot cache.
The new added cluster allocation fast path may allocate entries from
different device to the slot cache, this is not observable from user
space, only impact performance very slightly, and slot cache will be
just gone in next commit, so this can be ignored.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 include/linux/swap.h |  11 ++--
 mm/swapfile.c        | 136 +++++++++++++++++++++++++++++--------------
 2 files changed, 95 insertions(+), 52 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 2fe91c293636..374bffc87427 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -284,12 +284,10 @@ enum swap_cluster_flags {
 #endif
 
 /*
- * We assign a cluster to each CPU, so each CPU can allocate swap entry from
- * its own cluster and swapout sequentially. The purpose is to optimize swapout
- * throughput.
+ * We keep using same cluster for rotational device so IO will be sequential.
+ * The purpose is to optimize SWAP throughput on these device.
  */
-struct percpu_cluster {
-	local_lock_t lock; /* Protect the percpu_cluster above */
+struct swap_sequential_cluster {
 	unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
 };
 
@@ -315,8 +313,7 @@ struct swap_info_struct {
 	atomic_long_t frag_cluster_nr[SWAP_NR_ORDERS];
 	unsigned int pages;		/* total of usable pages of swap */
 	atomic_long_t inuse_pages;	/* number of those currently in use */
-	struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap location */
-	struct percpu_cluster *global_cluster; /* Use one global cluster for rotating device */
+	struct swap_sequential_cluster *global_cluster; /* Use one global cluster for rotating device */
 	spinlock_t global_cluster_lock;	/* Serialize usage of global cluster */
 	struct rb_root swap_extent_root;/* root of the swap extent rbtree */
 	struct block_device *bdev;	/* swap device or bdev of swap file */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index db836670c334..7caaaea95408 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -116,6 +116,18 @@ static atomic_t proc_poll_event = ATOMIC_INIT(0);
 
 atomic_t nr_rotate_swap = ATOMIC_INIT(0);
 
+struct percpu_swap_cluster {
+	struct swap_info_struct *si[SWAP_NR_ORDERS];
+	unsigned long offset[SWAP_NR_ORDERS];
+	local_lock_t lock;
+};
+
+static DEFINE_PER_CPU(struct percpu_swap_cluster, percpu_swap_cluster) = {
+	.si = { NULL },
+	.offset = { SWAP_ENTRY_INVALID },
+	.lock = INIT_LOCAL_LOCK(),
+};
+
 static struct swap_info_struct *swap_type_to_swap_info(int type)
 {
 	if (type >= MAX_SWAPFILES)
@@ -539,7 +551,7 @@ static bool swap_do_scheduled_discard(struct swap_info_struct *si)
 		ci = list_first_entry(&si->discard_clusters, struct swap_cluster_info, list);
 		/*
 		 * Delete the cluster from list to prepare for discard, but keep
-		 * the CLUSTER_FLAG_DISCARD flag, there could be percpu_cluster
+		 * the CLUSTER_FLAG_DISCARD flag, percpu_swap_cluster could be
 		 * pointing to it, or ran into by relocate_cluster.
 		 */
 		list_del(&ci->list);
@@ -805,10 +817,12 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
 out:
 	relocate_cluster(si, ci);
 	unlock_cluster(ci);
-	if (si->flags & SWP_SOLIDSTATE)
-		__this_cpu_write(si->percpu_cluster->next[order], next);
-	else
+	if (si->flags & SWP_SOLIDSTATE) {
+		__this_cpu_write(percpu_swap_cluster.si[order], si);
+		__this_cpu_write(percpu_swap_cluster.offset[order], next);
+	} else {
 		si->global_cluster->next[order] = next;
+	}
 	return found;
 }
 
@@ -862,9 +876,8 @@ static void swap_reclaim_work(struct work_struct *work)
 }
 
 /*
- * Try to get swap entries with specified order from current cpu's swap entry
- * pool (a cluster). This might involve allocating a new cluster for current CPU
- * too.
+ * Try to allocate swap entries with specified order and try set a new
+ * cluster for current CPU too.
  */
 static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int order,
 					      unsigned char usage)
@@ -872,18 +885,12 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 	struct swap_cluster_info *ci;
 	unsigned int offset, found = 0;
 
-	if (si->flags & SWP_SOLIDSTATE) {
-		/* Fast path using per CPU cluster */
-		local_lock(&si->percpu_cluster->lock);
-		offset = __this_cpu_read(si->percpu_cluster->next[order]);
-	} else {
+	if (!(si->flags & SWP_SOLIDSTATE)) {
 		/* Serialize HDD SWAP allocation for each device. */
 		spin_lock(&si->global_cluster_lock);
 		offset = si->global_cluster->next[order];
-	}
-
-	if (offset) {
 		ci = lock_cluster(si, offset);
+
 		/* Cluster could have been used by another order */
 		if (cluster_is_usable(ci, order)) {
 			if (cluster_is_empty(ci))
@@ -973,9 +980,7 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 		}
 	}
 done:
-	if (si->flags & SWP_SOLIDSTATE)
-		local_unlock(&si->percpu_cluster->lock);
-	else
+	if (!(si->flags & SWP_SOLIDSTATE))
 		spin_unlock(&si->global_cluster_lock);
 	return found;
 }
@@ -1196,6 +1201,49 @@ static bool get_swap_device_info(struct swap_info_struct *si)
 	return true;
 }
 
+/*
+ * Fast path try to get swap entries with specified order from current
+ * CPU's swap entry pool (a cluster).
+ */
+static int swap_alloc_fast(swp_entry_t entries[],
+			   unsigned char usage,
+			   int order, int n_goal)
+{
+	struct swap_cluster_info *ci;
+	struct swap_info_struct *si;
+	unsigned int offset, found;
+	int n_ret = 0;
+
+	n_goal = min(n_goal, SWAP_BATCH);
+
+	/*
+	 * Once allocated, swap_info_struct will never be completely freed,
+	 * so checking it's liveness by get_swap_device_info is enough.
+	 */
+	si = __this_cpu_read(percpu_swap_cluster.si[order]);
+	offset = __this_cpu_read(percpu_swap_cluster.offset[order]);
+	if (!si || !offset || !get_swap_device_info(si))
+		return 0;
+
+	while (offset) {
+		ci = lock_cluster(si, offset);
+		if (!cluster_is_usable(ci, order))
+			break;
+		if (cluster_is_empty(ci))
+			offset = cluster_offset(si, ci);
+		found = alloc_swap_scan_cluster(si, ci, offset, order, usage);
+		if (!found)
+			break;
+		entries[n_ret++] = swp_entry(si->type, found);
+		if (n_ret == n_goal)
+			break;
+		offset = __this_cpu_read(percpu_swap_cluster.offset[order]);
+	}
+
+	put_swap_device(si);
+	return n_ret;
+}
+
 int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
 {
 	int order = swap_entry_order(entry_order);
@@ -1204,19 +1252,36 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
 	int n_ret = 0;
 	int node;
 
+	/* Fast path using percpu cluster */
+	local_lock(&percpu_swap_cluster.lock);
+	n_ret = swap_alloc_fast(swp_entries,
+				SWAP_HAS_CACHE,
+				order, n_goal);
+	if (n_ret == n_goal)
+		goto out;
+
+	n_goal = min_t(int, n_goal - n_ret, SWAP_BATCH);
+	/* Rotate the device and switch to a new cluster */
 	spin_lock(&swap_avail_lock);
 start_over:
 	node = numa_node_id();
 	plist_for_each_entry_safe(si, next, &swap_avail_heads[node], avail_lists[node]) {
-		/* requeue si to after same-priority siblings */
 		plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]);
 		spin_unlock(&swap_avail_lock);
 		if (get_swap_device_info(si)) {
-			n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
-					n_goal, swp_entries, order);
+			/*
+			 * For order 0 allocation, try best to fill the request
+			 * as it's used by slot cache.
+			 *
+			 * For mTHP allocation, it always have n_goal == 1,
+			 * and falling a mTHP swapin will just make the caller
+			 * fallback to order 0 allocation, so just bail out.
+			 */
+			n_ret += scan_swap_map_slots(si, SWAP_HAS_CACHE, n_goal,
+					swp_entries + n_ret, order);
 			put_swap_device(si);
 			if (n_ret || size > 1)
-				goto check_out;
+				goto out;
 		}
 
 		spin_lock(&swap_avail_lock);
@@ -1234,12 +1299,10 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
 		if (plist_node_empty(&next->avail_lists[node]))
 			goto start_over;
 	}
-
 	spin_unlock(&swap_avail_lock);
-
-check_out:
+out:
+	local_unlock(&percpu_swap_cluster.lock);
 	atomic_long_sub(n_ret * size, &nr_swap_pages);
-
 	return n_ret;
 }
 
@@ -2725,8 +2788,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	arch_swap_invalidate_area(p->type);
 	zswap_swapoff(p->type);
 	mutex_unlock(&swapon_mutex);
-	free_percpu(p->percpu_cluster);
-	p->percpu_cluster = NULL;
 	kfree(p->global_cluster);
 	p->global_cluster = NULL;
 	vfree(swap_map);
@@ -3125,7 +3186,7 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
 	unsigned long nr_clusters = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER);
 	struct swap_cluster_info *cluster_info;
 	unsigned long i, j, idx;
-	int cpu, err = -ENOMEM;
+	int err = -ENOMEM;
 
 	cluster_info = kvcalloc(nr_clusters, sizeof(*cluster_info), GFP_KERNEL);
 	if (!cluster_info)
@@ -3134,20 +3195,7 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
 	for (i = 0; i < nr_clusters; i++)
 		spin_lock_init(&cluster_info[i].lock);
 
-	if (si->flags & SWP_SOLIDSTATE) {
-		si->percpu_cluster = alloc_percpu(struct percpu_cluster);
-		if (!si->percpu_cluster)
-			goto err_free;
-
-		for_each_possible_cpu(cpu) {
-			struct percpu_cluster *cluster;
-
-			cluster = per_cpu_ptr(si->percpu_cluster, cpu);
-			for (i = 0; i < SWAP_NR_ORDERS; i++)
-				cluster->next[i] = SWAP_ENTRY_INVALID;
-			local_lock_init(&cluster->lock);
-		}
-	} else {
+	if (!(si->flags & SWP_SOLIDSTATE)) {
 		si->global_cluster = kmalloc(sizeof(*si->global_cluster),
 				     GFP_KERNEL);
 		if (!si->global_cluster)
@@ -3424,8 +3472,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 bad_swap_unlock_inode:
 	inode_unlock(inode);
 bad_swap:
-	free_percpu(si->percpu_cluster);
-	si->percpu_cluster = NULL;
 	kfree(si->global_cluster);
 	si->global_cluster = NULL;
 	inode = NULL;