From patchwork Wed Oct 25 14:45:43 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Ryan Roberts <ryan.roberts@arm.com>
X-Patchwork-Id: 13436306
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 09812C07545
	for <linux-mm@archiver.kernel.org>; Wed, 25 Oct 2023 14:46:02 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 96A4A6B0329; Wed, 25 Oct 2023 10:46:01 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 8EE136B032B; Wed, 25 Oct 2023 10:46:01 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 767EA6B032C; Wed, 25 Oct 2023 10:46:01 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com
 [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id 68C1A6B0329
	for <linux-mm@kvack.org>; Wed, 25 Oct 2023 10:46:01 -0400 (EDT)
Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay08.hostedemail.com (Postfix) with ESMTP id 21FEB140584
	for <linux-mm@kvack.org>; Wed, 25 Oct 2023 14:46:01 +0000 (UTC)
X-FDA: 81384258522.26.1AC7518
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by imf05.hostedemail.com (Postfix) with ESMTP id 6ED46100022
	for <linux-mm@kvack.org>; Wed, 25 Oct 2023 14:45:59 +0000 (UTC)
Authentication-Results: imf05.hostedemail.com;
	dkim=none;
	dmarc=pass (policy=none) header.from=arm.com;
	spf=pass (imf05.hostedemail.com: domain of ryan.roberts@arm.com designates
 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed;
 d=hostedemail.com;
	s=arc-20220608; t=1698245159;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=VVek/o5yzsqrP73thiBmSPOq8G6wi+ra9xSsWuMFlN8=;
	b=sLnp9lOFqxK1Vcj0tmpa9Fc9OspD+2hV6pee8VwUp+XQ02BmmrbO8CfV52j7VvFOJQQQMO
	LBQUvT/ksZVS8wIMrH5hAlxJSgi9mCF1StiORKEWTdxOsJzuCGpHAjtIy9VojUVgxe+vHH
	VRAq5HEhKrdJoLby6UDbBwXDvC2UBmM=
ARC-Authentication-Results: i=1;
	imf05.hostedemail.com;
	dkim=none;
	dmarc=pass (policy=none) header.from=arm.com;
	spf=pass (imf05.hostedemail.com: domain of ryan.roberts@arm.com designates
 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1698245159; a=rsa-sha256;
	cv=none;
	b=xSvd22wPA3ZGHv1JITJZ6zCiszGLzmbIyrJE7/YAqXepL3HycEit4WOB83OoBtoBk5OCzP
	8t/v6Prks8ZoweiZ9tWNUyHfRJ2rE2jKljueT0TnGkwL8L8Y+qzcw7tJheCVQ4HQQu5DKG
	fktO6UFBwMbB7vKhCMenHRuZBR9umy0=
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id BB8E1C15;
	Wed, 25 Oct 2023 07:46:39 -0700 (PDT)
Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com
 [10.1.196.26])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 4EB863F64C;
	Wed, 25 Oct 2023 07:45:56 -0700 (PDT)
From: Ryan Roberts <ryan.roberts@arm.com>
To: Andrew Morton <akpm@linux-foundation.org>,
	David Hildenbrand <david@redhat.com>,
	Matthew Wilcox <willy@infradead.org>,
	Huang Ying <ying.huang@intel.com>,
	Gao Xiang <xiang@kernel.org>,
	Yu Zhao <yuzhao@google.com>,
	Yang Shi <shy828301@gmail.com>,
	Michal Hocko <mhocko@suse.com>,
	Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>,
	linux-kernel@vger.kernel.org,
	linux-mm@kvack.org
Subject: [PATCH v3 1/4] mm: swap: Remove CLUSTER_FLAG_HUGE from
 swap_cluster_info:flags
Date: Wed, 25 Oct 2023 15:45:43 +0100
Message-Id: <20231025144546.577640-2-ryan.roberts@arm.com>
X-Mailer: git-send-email 2.25.1
In-Reply-To: <20231025144546.577640-1-ryan.roberts@arm.com>
References: <20231025144546.577640-1-ryan.roberts@arm.com>
MIME-Version: 1.0
X-Rspam-User: 
X-Rspamd-Server: rspam12
X-Rspamd-Queue-Id: 6ED46100022
X-Stat-Signature: 8mhqi1b7repjmusrin75ocwpb6yhdswe
X-HE-Tag: 1698245159-102288
X-HE-Meta: 
 U2FsdGVkX1/vm+wq7qMKp7Uc8K3au5WZZ5oTspC/zGHd77ZxMV+g4tB0oWldT+oGVDhgLESSSuf41DvYMBew/K1k5W5mtQ6w87AbpvVozOZYKEUzaz/EtVoju/627SGwslzGY1qsprLsITslBBbG7PR5zY7pN1f9qaN9d9h3vLX9WBs5o/SZ33XVVVLxgoSfWW5k4lFk8t2OhvSgkCDzIlMtEoPqJvJInOrSxtjqAhbWJZKljoebFyss+LDGf8CSeOB6xNQVWS3v6+7V9MAcr44FaZ/IWt5vr4pmwTRikBXAz/9DyiKV3R93nk0CmA1TXbCq8p6t47j97f8GFnPsEYzBwp7Im/CUlhPnaKxgZM63AUbSWaCLcCvq0G3VVH7iWlNBMWVQdnhe/FxEpJ1OiYn/fGcDiJnGYm32wXCsFKFt+FpWieIcyMOV3WQGR2exOL5dI9+DyHFjGW/tUEDp2tQEuwVXDgu9fpkBX4BHizrA3BHp501GHI4RZBOM75o6FfI1063FwzjASGE/iLUZtCUCDiaQesH1zpS4I689djsnvJYh+VWLpDdfa0TSqVZNqcNsPmLDFyF+7Hb1QpgwBIEL0e3v+UHeFPKwYlB3JH8Puv+TAkAl22qFMI6lzbk8femQJq9n7TMavwENGHXcJ5EVWBIvNFQnaQk8WLHYglnhRaulzTd2KopIn/L5ZZFkEdQ9V+GLF4YXnjqV3e/cVxwNrcuu1QyNLKmlWT5+up3GKoUIoawexPxkldmdYTsZCCS4fS9aMYFlJMurXAjjODrsKXzEBZbMkTIHx2mv2mWl4IeA/mn1WFeMtG7JQWi/9nbRakqMGYPpqM0yX6PJy9Z00otutgaCCF/fpW6GtfWzkhuUrswiF5vykV28O9zPJO4Pw+iVenlmtQdn29Otij7tHN7ZOnMRm6V1tO/+LiRIV0Qn5/xoytdRUWsiIHrScC6c/f/zXb/hWi03KVU
 zRbskEWK
 S/5t2omOqWPeHE16huS0J574INuDxJq5TddNWBWtsq+EhwiunfbusbdnQVlxyzk5WG738ZD2kY+4OKPRjA+GwhTu3rVHqIQq+HEF+BnZSPdYkthMTlq/SrSR4a7YCzQJmClN0sxV9/utDvIU/VrB2WGFSRztBVmJYOuKKuwK2OL8Zo7TDxVd5+KvepZmfLzHm720eFSda/ICE38Y=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

As preparation for supporting small-sized THP in the swap-out path,
without first needing to split to order-0, Remove the CLUSTER_FLAG_HUGE,
which, when present, always implies PMD-sized THP, which is the same as
the cluster size.

The only use of the flag was to determine whether a swap entry refers to
a single page or a PMD-sized THP in swap_page_trans_huge_swapped().
Instead of relying on the flag, we now pass in nr_pages, which
originates from the folio's number of pages. This allows the logic to
work for folios of any order.

The one snag is that one of the swap_page_trans_huge_swapped() call
sites does not have the folio. But it was only being called there to
avoid bothering to call __try_to_reclaim_swap() in some cases.
__try_to_reclaim_swap() gets the folio and (via some other functions)
calls swap_page_trans_huge_swapped(). So I've removed the problematic
call site and believe the new logic should be equivalent.

Removing CLUSTER_FLAG_HUGE also means we can remove split_swap_cluster()
which used to be called during folio splitting, since
split_swap_cluster()'s only job was to remove the flag.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/swap.h | 10 ----------
 mm/huge_memory.c     |  3 ---
 mm/swapfile.c        | 47 ++++++++------------------------------------
 3 files changed, 8 insertions(+), 52 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 19f30a29e1f1..a073366a227c 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -259,7 +259,6 @@ struct swap_cluster_info {
 };
 #define CLUSTER_FLAG_FREE 1 /* This cluster is free */
 #define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */
-#define CLUSTER_FLAG_HUGE 4 /* This cluster is backing a transparent huge page */
 
 /*
  * We assign a cluster to each CPU, so each CPU can allocate swap entry from
@@ -595,15 +594,6 @@ static inline int add_swap_extent(struct swap_info_struct *sis,
 }
 #endif /* CONFIG_SWAP */
 
-#ifdef CONFIG_THP_SWAP
-extern int split_swap_cluster(swp_entry_t entry);
-#else
-static inline int split_swap_cluster(swp_entry_t entry)
-{
-	return 0;
-}
-#endif
-
 #ifdef CONFIG_MEMCG
 static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f31f02472396..b411dd4f1612 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2598,9 +2598,6 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		shmem_uncharge(head->mapping->host, nr_dropped);
 	remap_page(folio, nr);
 
-	if (folio_test_swapcache(folio))
-		split_swap_cluster(folio->swap);
-
 	for (i = 0; i < nr; i++) {
 		struct page *subpage = head + i;
 		if (subpage == page)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index e52f486834eb..b83ad77e04c0 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -342,18 +342,6 @@ static inline void cluster_set_null(struct swap_cluster_info *info)
 	info->data = 0;
 }
 
-static inline bool cluster_is_huge(struct swap_cluster_info *info)
-{
-	if (IS_ENABLED(CONFIG_THP_SWAP))
-		return info->flags & CLUSTER_FLAG_HUGE;
-	return false;
-}
-
-static inline void cluster_clear_huge(struct swap_cluster_info *info)
-{
-	info->flags &= ~CLUSTER_FLAG_HUGE;
-}
-
 static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
 						     unsigned long offset)
 {
@@ -1021,7 +1009,7 @@ static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot)
 	offset = idx * SWAPFILE_CLUSTER;
 	ci = lock_cluster(si, offset);
 	alloc_cluster(si, idx);
-	cluster_set_count_flag(ci, SWAPFILE_CLUSTER, CLUSTER_FLAG_HUGE);
+	cluster_set_count(ci, SWAPFILE_CLUSTER);
 
 	memset(si->swap_map + offset, SWAP_HAS_CACHE, SWAPFILE_CLUSTER);
 	unlock_cluster(ci);
@@ -1354,7 +1342,6 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
 
 	ci = lock_cluster_or_swap_info(si, offset);
 	if (size == SWAPFILE_CLUSTER) {
-		VM_BUG_ON(!cluster_is_huge(ci));
 		map = si->swap_map + offset;
 		for (i = 0; i < SWAPFILE_CLUSTER; i++) {
 			val = map[i];
@@ -1362,7 +1349,6 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
 			if (val == SWAP_HAS_CACHE)
 				free_entries++;
 		}
-		cluster_clear_huge(ci);
 		if (free_entries == SWAPFILE_CLUSTER) {
 			unlock_cluster_or_swap_info(si, ci);
 			spin_lock(&si->lock);
@@ -1384,23 +1370,6 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
 	unlock_cluster_or_swap_info(si, ci);
 }
 
-#ifdef CONFIG_THP_SWAP
-int split_swap_cluster(swp_entry_t entry)
-{
-	struct swap_info_struct *si;
-	struct swap_cluster_info *ci;
-	unsigned long offset = swp_offset(entry);
-
-	si = _swap_info_get(entry);
-	if (!si)
-		return -EBUSY;
-	ci = lock_cluster(si, offset);
-	cluster_clear_huge(ci);
-	unlock_cluster(ci);
-	return 0;
-}
-#endif
-
 static int swp_entry_cmp(const void *ent1, const void *ent2)
 {
 	const swp_entry_t *e1 = ent1, *e2 = ent2;
@@ -1508,22 +1477,23 @@ int swp_swapcount(swp_entry_t entry)
 }
 
 static bool swap_page_trans_huge_swapped(struct swap_info_struct *si,
-					 swp_entry_t entry)
+					 swp_entry_t entry,
+					 unsigned int nr_pages)
 {
 	struct swap_cluster_info *ci;
 	unsigned char *map = si->swap_map;
 	unsigned long roffset = swp_offset(entry);
-	unsigned long offset = round_down(roffset, SWAPFILE_CLUSTER);
+	unsigned long offset = round_down(roffset, nr_pages);
 	int i;
 	bool ret = false;
 
 	ci = lock_cluster_or_swap_info(si, offset);
-	if (!ci || !cluster_is_huge(ci)) {
+	if (!ci || nr_pages == 1) {
 		if (swap_count(map[roffset]))
 			ret = true;
 		goto unlock_out;
 	}
-	for (i = 0; i < SWAPFILE_CLUSTER; i++) {
+	for (i = 0; i < nr_pages; i++) {
 		if (swap_count(map[offset + i])) {
 			ret = true;
 			break;
@@ -1545,7 +1515,7 @@ static bool folio_swapped(struct folio *folio)
 	if (!IS_ENABLED(CONFIG_THP_SWAP) || likely(!folio_test_large(folio)))
 		return swap_swapcount(si, entry) != 0;
 
-	return swap_page_trans_huge_swapped(si, entry);
+	return swap_page_trans_huge_swapped(si, entry, folio_nr_pages(folio));
 }
 
 /**
@@ -1606,8 +1576,7 @@ int free_swap_and_cache(swp_entry_t entry)
 	p = _swap_info_get(entry);
 	if (p) {
 		count = __swap_entry_free(p, entry);
-		if (count == SWAP_HAS_CACHE &&
-		    !swap_page_trans_huge_swapped(p, entry))
+		if (count == SWAP_HAS_CACHE)
 			__try_to_reclaim_swap(p, swp_offset(entry),
 					      TTRS_UNMAPPED | TTRS_FULL);
 	}

From patchwork Wed Oct 25 14:45:44 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Ryan Roberts <ryan.roberts@arm.com>
X-Patchwork-Id: 13436307
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id DFAEBC0032E
	for <linux-mm@archiver.kernel.org>; Wed, 25 Oct 2023 14:46:03 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id AC6976B032B; Wed, 25 Oct 2023 10:46:02 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id A75B66B032C; Wed, 25 Oct 2023 10:46:02 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 8CCA46B032D; Wed, 25 Oct 2023 10:46:02 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com
 [216.40.44.17])
	by kanga.kvack.org (Postfix) with ESMTP id 7A50E6B032B
	for <linux-mm@kvack.org>; Wed, 25 Oct 2023 10:46:02 -0400 (EDT)
Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay10.hostedemail.com (Postfix) with ESMTP id 43156C0396
	for <linux-mm@kvack.org>; Wed, 25 Oct 2023 14:46:02 +0000 (UTC)
X-FDA: 81384258564.11.5309CA7
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by imf24.hostedemail.com (Postfix) with ESMTP id 757ED18002C
	for <linux-mm@kvack.org>; Wed, 25 Oct 2023 14:46:00 +0000 (UTC)
Authentication-Results: imf24.hostedemail.com;
	dkim=none;
	dmarc=pass (policy=none) header.from=arm.com;
	spf=pass (imf24.hostedemail.com: domain of ryan.roberts@arm.com designates
 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed;
 d=hostedemail.com;
	s=arc-20220608; t=1698245160;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=0Tpf03q3z+j3kVeYNfMMrhq2bga3CLxbv4cIDq7P5sE=;
	b=cIS6bZctombYvtIhXT6ZXsv9S8wEm8lq9hAQC7zqlHSEUutHJs4Pf4ehGp73df42KAL8p+
	ZQyXsRbmFspRBXByJXMWYp32UB7GMo5RKQt1djgyNrx/J/fezf5KMne7E/ESTm6wUiIAyD
	Per6anBobwdGao23YPkga1DrgGfkGxw=
ARC-Authentication-Results: i=1;
	imf24.hostedemail.com;
	dkim=none;
	dmarc=pass (policy=none) header.from=arm.com;
	spf=pass (imf24.hostedemail.com: domain of ryan.roberts@arm.com designates
 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1698245160; a=rsa-sha256;
	cv=none;
	b=YBDzLVSwuqa18acrZ2KM7mdATxOUg3aK+FZfZ4Pvp8/WcQMbbYpnH5NYjIHX/1UCB7WNGB
	chQ9T77ENh/ucJNpELnL5uE9dJjoyQJ+Ji42sptTVGvrMngCshBj9A/6XVM5L6SYW6OW8t
	L/H9VuKktfxddCBGBLX4G5f4JXYynSQ=
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 30C9B1474;
	Wed, 25 Oct 2023 07:46:41 -0700 (PDT)
Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com
 [10.1.196.26])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 139993F7C5;
	Wed, 25 Oct 2023 07:45:57 -0700 (PDT)
From: Ryan Roberts <ryan.roberts@arm.com>
To: Andrew Morton <akpm@linux-foundation.org>,
	David Hildenbrand <david@redhat.com>,
	Matthew Wilcox <willy@infradead.org>,
	Huang Ying <ying.huang@intel.com>,
	Gao Xiang <xiang@kernel.org>,
	Yu Zhao <yuzhao@google.com>,
	Yang Shi <shy828301@gmail.com>,
	Michal Hocko <mhocko@suse.com>,
	Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>,
	linux-kernel@vger.kernel.org,
	linux-mm@kvack.org
Subject: [PATCH v3 2/4] mm: swap: Remove struct percpu_cluster
Date: Wed, 25 Oct 2023 15:45:44 +0100
Message-Id: <20231025144546.577640-3-ryan.roberts@arm.com>
X-Mailer: git-send-email 2.25.1
In-Reply-To: <20231025144546.577640-1-ryan.roberts@arm.com>
References: <20231025144546.577640-1-ryan.roberts@arm.com>
MIME-Version: 1.0
X-Rspamd-Queue-Id: 757ED18002C
X-Rspam-User: 
X-Rspamd-Server: rspam04
X-Stat-Signature: uzj3qwykxi81c7owmbt99re3irgf8mc8
X-HE-Tag: 1698245160-710989
X-HE-Meta: 
 U2FsdGVkX18rZ61qgisKpVaL+zFFxMP606qxCsAXErQYPwt0NTDOW2d1zv2q9GoZzpaKj55MTC2A1tw2MkZkjCn9fkr997Egat8S2I3CXK6+G7Gjan+K5SbzlnyLkPtvRHU53S+CMCOhjh8KKscQ877uV9Q07UnJ0Q1QuNL0MJKLhZVlmy7rCLt7Eip/b8MPXpxOti6IDjs/KFfxXgngyciHJTKj7C3rbL8tf/q7m96BQrChBFKP+1HEOEuJ4qIuXJ6VZ85sqDBwNHMfsxTYREaXE4kP/R25vHu+bRSmiGpxwYoC/XajMPqDzhvk59Mczpi9LB8Aah20Eau1vfWZpXF/xu/WsG27ajoAFNf7z06dG6YALgsWPiabaPDkeyu10XXZQQu9QRiOABmr2hyB6WN5Sxqhv3cCbfATPcoHi1n/QjNonByGhGXtUQGOFlYX3gax0hXIHoBDq9RSQLlJKcrJFdJYiCQly1w+++xgd/bT3ym0VGeQPspJT9iTyIKTo2Mlq6bZiefnEHT+C+OSy3m+kRsblAr6GhFBClyN62pN6ePmBaRWXC4v1Bd3Wu9upZSzgDaamWvxexT7j1rKbFsqz000ajxZEtiqLxQh2mXmne8B0ZWa7NyJnwA3nUi0iS9YMkke51UcvvKioyfKmoQ3vGIuJXWnBMQjb3hvxIwX4YF4Aq2klsWAsVocv34CCvH3eKUyO83yJiin9E6gdcQSYf4k+oU7ZY1armBI6ub/ZFebs76/gGqhPY5429f6sSwxCchAzjLc3NPQoDTyrXU7qbUjcjGECUgj6yXPv9CcEQ5kam42CKSizpFcrCJow45Var7xdJt7WK3P7ytvIYW6joffyqsimwPsqDx1tNDmw9Zy/j7N+I6vthPa9x9owmqF/5AuiDWcCn+bfhCBxYrFGdQmbmlnLd4TIlsMWdiOEtMubYNolhdsQPL5z9ym33a1Pt4AdKvsL7nZl0Q
 L1agjGTT
 +FQ/wCZREvjxPqN/YuOLut+wsEBwQWqE6NPAoWpPd9UgRb1G6wn9tg65JyHjRCWd5ODHk8wpi/CQpcQSoEUgNZfkv5hpd8plq8eDhNCMTh7goR4SnfUmtSSclesG70+lX7JJYVsw2AG5bCcm9P4jzC0RbjnlL51DKBzEYgGhcqq6rZXerSy0YqhDBQZPA2XeTsreLWLQ8VTZddU8=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

struct percpu_cluster stores the index of cpu's current cluster and the
offset of the next entry that will be allocated for the cpu. These two
pieces of information are redundant because the cluster index is just
(offset / SWAPFILE_CLUSTER). The only reason for explicitly keeping the
cluster index is because the structure used for it also has a flag to
indicate "no cluster". However this data structure also contains a spin
lock, which is never used in this context, as a side effect the code
copies the spinlock_t structure, which is questionable coding practice
in my view.

So let's clean this up and store only the next offset, and use a
sentinal value (SWAP_NEXT_NULL) to indicate "no cluster". SWAP_NEXT_NULL
is chosen to be 0, because 0 will never be seen legitimately; The first
page in the swap file is the swap header, which is always marked bad to
prevent it from being allocated as an entry. This also prevents the
cluster to which it belongs being marked free, so it will never appear
on the free list.

This change saves 16 bytes per cpu. And given we are shortly going to
extend this mechanism to be per-cpu-AND-per-order, we will end up saving
16 * 9 = 144 bytes per cpu, which adds up if you have 256 cpus in the
system.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 include/linux/swap.h | 21 +++++++++++++--------
 mm/swapfile.c        | 43 +++++++++++++++++++------------------------
 2 files changed, 32 insertions(+), 32 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index a073366a227c..0ca8aaa098ba 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -261,14 +261,12 @@ struct swap_cluster_info {
 #define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */
 
 /*
- * We assign a cluster to each CPU, so each CPU can allocate swap entry from
- * its own cluster and swapout sequentially. The purpose is to optimize swapout
- * throughput.
+ * The first page in the swap file is the swap header, which is always marked
+ * bad to prevent it from being allocated as an entry. This also prevents the
+ * cluster to which it belongs being marked free. Therefore 0 is safe to use as
+ * a sentinel to indicate cpu_next is not valid in swap_info_struct.
  */
-struct percpu_cluster {
-	struct swap_cluster_info index; /* Current cluster index */
-	unsigned int next; /* Likely next allocation offset */
-};
+#define SWAP_NEXT_NULL	0
 
 struct swap_cluster_list {
 	struct swap_cluster_info head;
@@ -295,7 +293,14 @@ struct swap_info_struct {
 	unsigned int cluster_next;	/* likely index for next allocation */
 	unsigned int cluster_nr;	/* countdown to next cluster search */
 	unsigned int __percpu *cluster_next_cpu; /*percpu index for next allocation */
-	struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap location */
+	unsigned int __percpu *cpu_next;/*
+					 * Likely next allocation offset. We
+					 * assign a cluster to each CPU, so each
+					 * CPU can allocate swap entry from its
+					 * own cluster and swapout sequentially.
+					 * The purpose is to optimize swapout
+					 * throughput.
+					 */
 	struct rb_root swap_extent_root;/* root of the swap extent rbtree */
 	struct block_device *bdev;	/* swap device or bdev of swap file */
 	struct file *swap_file;		/* seldom referenced */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index b83ad77e04c0..617e34b8cdbe 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -591,7 +591,6 @@ static bool
 scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
 	unsigned long offset)
 {
-	struct percpu_cluster *percpu_cluster;
 	bool conflict;
 
 	offset /= SWAPFILE_CLUSTER;
@@ -602,8 +601,7 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
 	if (!conflict)
 		return false;
 
-	percpu_cluster = this_cpu_ptr(si->percpu_cluster);
-	cluster_set_null(&percpu_cluster->index);
+	*this_cpu_ptr(si->cpu_next) = SWAP_NEXT_NULL;
 	return true;
 }
 
@@ -614,16 +612,16 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
 static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
 	unsigned long *offset, unsigned long *scan_base)
 {
-	struct percpu_cluster *cluster;
 	struct swap_cluster_info *ci;
-	unsigned long tmp, max;
+	unsigned int tmp, max;
+	unsigned int *cpu_next;
 
 new_cluster:
-	cluster = this_cpu_ptr(si->percpu_cluster);
-	if (cluster_is_null(&cluster->index)) {
+	cpu_next = this_cpu_ptr(si->cpu_next);
+	tmp = *cpu_next;
+	if (tmp == SWAP_NEXT_NULL) {
 		if (!cluster_list_empty(&si->free_clusters)) {
-			cluster->index = si->free_clusters.head;
-			cluster->next = cluster_next(&cluster->index) *
+			tmp = cluster_next(&si->free_clusters.head) *
 					SWAPFILE_CLUSTER;
 		} else if (!cluster_list_empty(&si->discard_clusters)) {
 			/*
@@ -643,9 +641,8 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
 	 * Other CPUs can use our cluster if they can't find a free cluster,
 	 * check if there is still free entry in the cluster
 	 */
-	tmp = cluster->next;
 	max = min_t(unsigned long, si->max,
-		    (cluster_next(&cluster->index) + 1) * SWAPFILE_CLUSTER);
+		    ALIGN_DOWN(tmp, SWAPFILE_CLUSTER) + SWAPFILE_CLUSTER);
 	if (tmp < max) {
 		ci = lock_cluster(si, tmp);
 		while (tmp < max) {
@@ -656,12 +653,13 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
 		unlock_cluster(ci);
 	}
 	if (tmp >= max) {
-		cluster_set_null(&cluster->index);
+		*cpu_next = SWAP_NEXT_NULL;
 		goto new_cluster;
 	}
-	cluster->next = tmp + 1;
 	*offset = tmp;
 	*scan_base = tmp;
+	tmp += 1;
+	*cpu_next = tmp < max ? tmp : SWAP_NEXT_NULL;
 	return true;
 }
 
@@ -2488,8 +2486,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	arch_swap_invalidate_area(p->type);
 	zswap_swapoff(p->type);
 	mutex_unlock(&swapon_mutex);
-	free_percpu(p->percpu_cluster);
-	p->percpu_cluster = NULL;
+	free_percpu(p->cpu_next);
+	p->cpu_next = NULL;
 	free_percpu(p->cluster_next_cpu);
 	p->cluster_next_cpu = NULL;
 	vfree(swap_map);
@@ -3073,16 +3071,13 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 		for (ci = 0; ci < nr_cluster; ci++)
 			spin_lock_init(&((cluster_info + ci)->lock));
 
-		p->percpu_cluster = alloc_percpu(struct percpu_cluster);
-		if (!p->percpu_cluster) {
+		p->cpu_next = alloc_percpu(unsigned int);
+		if (!p->cpu_next) {
 			error = -ENOMEM;
 			goto bad_swap_unlock_inode;
 		}
-		for_each_possible_cpu(cpu) {
-			struct percpu_cluster *cluster;
-			cluster = per_cpu_ptr(p->percpu_cluster, cpu);
-			cluster_set_null(&cluster->index);
-		}
+		for_each_possible_cpu(cpu)
+			per_cpu(*p->cpu_next, cpu) = SWAP_NEXT_NULL;
 	} else {
 		atomic_inc(&nr_rotate_swap);
 		inced_nr_rotate_swap = true;
@@ -3171,8 +3166,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 bad_swap_unlock_inode:
 	inode_unlock(inode);
 bad_swap:
-	free_percpu(p->percpu_cluster);
-	p->percpu_cluster = NULL;
+	free_percpu(p->cpu_next);
+	p->cpu_next = NULL;
 	free_percpu(p->cluster_next_cpu);
 	p->cluster_next_cpu = NULL;
 	if (inode && S_ISBLK(inode->i_mode) && p->bdev) {

From patchwork Wed Oct 25 14:45:45 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Ryan Roberts <ryan.roberts@arm.com>
X-Patchwork-Id: 13436308
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id B6866C25B47
	for <linux-mm@archiver.kernel.org>; Wed, 25 Oct 2023 14:46:06 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 918756B032E; Wed, 25 Oct 2023 10:46:04 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 8A1CA6B032D; Wed, 25 Oct 2023 10:46:04 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 71AA26B032E; Wed, 25 Oct 2023 10:46:04 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com
 [216.40.44.14])
	by kanga.kvack.org (Postfix) with ESMTP id 61DD36B032C
	for <linux-mm@kvack.org>; Wed, 25 Oct 2023 10:46:04 -0400 (EDT)
Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay02.hostedemail.com (Postfix) with ESMTP id 2CB871206DB
	for <linux-mm@kvack.org>; Wed, 25 Oct 2023 14:46:04 +0000 (UTC)
X-FDA: 81384258648.26.8623D6D
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by imf22.hostedemail.com (Postfix) with ESMTP id 3DC7EC002D
	for <linux-mm@kvack.org>; Wed, 25 Oct 2023 14:46:02 +0000 (UTC)
Authentication-Results: imf22.hostedemail.com;
	dkim=none;
	dmarc=pass (policy=none) header.from=arm.com;
	spf=pass (imf22.hostedemail.com: domain of ryan.roberts@arm.com designates
 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed;
 d=hostedemail.com;
	s=arc-20220608; t=1698245162;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=D4xovi6VfiUHc7H1WexA1wOCaWXMFwz43AMZCLiNoqA=;
	b=WvNUtN+kc56yEq01BPcsRR1XcqmvE7GGPoWPinFEWAYUYLyhUlU70T+PsygAcopfEXro9j
	Y06tIbA2yoiEEJeRYeLniYNvLI4c2ugnZ16jytfSXragrw2Mlfnpuy4q+MdRHrOcY24JEC
	jKPRk9iMDSzgq77SVjEOjcW3ZCo3tgo=
ARC-Authentication-Results: i=1;
	imf22.hostedemail.com;
	dkim=none;
	dmarc=pass (policy=none) header.from=arm.com;
	spf=pass (imf22.hostedemail.com: domain of ryan.roberts@arm.com designates
 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1698245162; a=rsa-sha256;
	cv=none;
	b=ATbQ9iHT3+2abOWebzXD3Jwtz4f/Ua7jsy3MA0IRzx2oDk9T7zONfnWDsG3aIErMLyR7NL
	eGGvU1JsWZOce+EC/9ZliZlBfQUyN+/6H4ENzCuxy6nUiY/hgUR08xgpsoNWi7okeJUaVL
	LdAdPHRsu+I7SYBlvqTj/pvKNXx9J5c=
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id EB8F71477;
	Wed, 25 Oct 2023 07:46:42 -0700 (PDT)
Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com
 [10.1.196.26])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id CC0443F64C;
	Wed, 25 Oct 2023 07:45:59 -0700 (PDT)
From: Ryan Roberts <ryan.roberts@arm.com>
To: Andrew Morton <akpm@linux-foundation.org>,
	David Hildenbrand <david@redhat.com>,
	Matthew Wilcox <willy@infradead.org>,
	Huang Ying <ying.huang@intel.com>,
	Gao Xiang <xiang@kernel.org>,
	Yu Zhao <yuzhao@google.com>,
	Yang Shi <shy828301@gmail.com>,
	Michal Hocko <mhocko@suse.com>,
	Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>,
	linux-kernel@vger.kernel.org,
	linux-mm@kvack.org
Subject: [PATCH v3 3/4] mm: swap: Simplify ssd behavior when scanner steals
 entry
Date: Wed, 25 Oct 2023 15:45:45 +0100
Message-Id: <20231025144546.577640-4-ryan.roberts@arm.com>
X-Mailer: git-send-email 2.25.1
In-Reply-To: <20231025144546.577640-1-ryan.roberts@arm.com>
References: <20231025144546.577640-1-ryan.roberts@arm.com>
MIME-Version: 1.0
X-Rspamd-Server: rspam09
X-Rspamd-Queue-Id: 3DC7EC002D
X-Stat-Signature: q5z5dp7simb5bdb9kq4wj11i64ksjwua
X-Rspam-User: 
X-HE-Tag: 1698245162-870841
X-HE-Meta: 
 U2FsdGVkX1+VovgfLPauT70t/5RqtK/nBjwQxSgybz7Olw1tvvA+VzVPj7y9eV+tKvEEBeWJqNWObMH7ZJibpyk8pJzvOm415EkKBI0mZyyLrktrPQ3Gcl5aRzylNKxjaruE01hrk/3dyL4ElsUWIZsnfCuPNFJ1BKYcDDhjD2d/xCsC62ZEki79RZWElmU5dS2iA6XzE4zvHKT6/fu0BNR5KGh/cfhzNMy9mmQ+IJ48mg6egTl6LzQsLoGeieyFd21Gofe8gmLcaLxvOuxLog95O5p+t4x75zkBdH5x9R2kq9GGXX0R4XlBaCFLfLSI2dBF/rKjWeJvJEbWVLxpI3voKkxJYEN4OMT74uDwfYXMPbRwbwAmFEfgIoan6w+IodnzAefdUp9TL/XICQzuTrct7H9WDOdXnLf47NBZ3Tdv/alko1pdYuoyDmIrRwGPDXOJfVXcUWK0G/jxLkU03a9NxQtkCNke1P8ZrsMd/qCMa5siKGcFrSdjfiB/h5useEXjflW0hWn+PfBGK7c+T2GOru+BozAOtWdosARE3psTgHxIVrTM0Gb4BHOdhYKA6V6LLTpYrvn+lo/Y3ix6xTrAHhKY6aloun2LjMxjW28jcvxXumVlJ0Vdj7ymlv+0ESxNnfiAW3B2oCMPvGIQkdTuxF7of+3hbG1xCN14+ig8FFfbp4Vqs7waMx4xIjWSyAKEbf8ThoDoZkn4Y+u0F21jmN7ecdp8Fku5HkXZMwAav3GJyNm6f7kMied3AyD/oy20H4iWOVyDfRWNwsMWUchVyw86QYoEjFkHx+SRQGvgiKlTZlSx/Jqvm5NX7FL4iJpDSjUe/4w+S6fpGB7ak5Sh8D54BMKFjEa6duokqcS02apNY8BY5CXP6LJQwPMi5dx9RroPvX+RYByproVe+4z/OO9KTBoVB2EvcqLh/631chB2w6q/qZBAYhORGbORaGprNVmk8dFw2BY6EGP
 HpHqScCb
 +z6Zd2Y0Z+2Kbvv6UkLTPY0TVJHw6NdSiThm+VMGd3qp7e4YTPV+sVpSguEN/i0W/qh8LfB4bobG4btMjUIXhVUH86l8keF0wNb8gaYGjIYLK+U90Bc6qUbEGi5McIe8IGqcFkJ00guPrtU28dpyg/zLYP11MGC31VJIQ
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

When a CPU fails to reserve a cluster (due to free list exhaustion), we
revert to the scanner to find a free entry somewhere in the swap file.
This might cause an entry to be stolen from another CPU's reserved
cluster. Upon noticing this, the CPU with the stolen entry would
previously scan forward to the end of the cluster trying to find a free
entry to use. If there were none, it would try to reserve a new pre-cpu
cluster and allocate from that.

This scanning behavior does not scale well to high-order allocations,
which will be introduced in a future patch since would need to scan for
a contiguous area that was naturally aligned. Given stealing is a rare
occurrence, let's remove the scanning behavior from the ssd allocator
and simply drop the cluster and try to allocate a new one. Given the
purpose of the per-cpu cluster is to ensure a given task's pages are
sequential on disk to aid readahead, allocating a new cluster at this
point makes most sense.

Furthermore, si->max will always be greater than or equal to the end of
the last cluster because any partial cluster will never be put on the
free cluster list. Therefore we can simplify this logic too.

These changes make it simpler to generalize
scan_swap_map_try_ssd_cluster() to handle any allocation order.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/swapfile.c | 21 +++++++++------------
 1 file changed, 9 insertions(+), 12 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 617e34b8cdbe..94f7cc225eb9 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -639,27 +639,24 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
 
 	/*
 	 * Other CPUs can use our cluster if they can't find a free cluster,
-	 * check if there is still free entry in the cluster
+	 * check if the expected entry is still free. If not, drop it and
+	 * reserve a new cluster.
 	 */
-	max = min_t(unsigned long, si->max,
-		    ALIGN_DOWN(tmp, SWAPFILE_CLUSTER) + SWAPFILE_CLUSTER);
-	if (tmp < max) {
-		ci = lock_cluster(si, tmp);
-		while (tmp < max) {
-			if (!si->swap_map[tmp])
-				break;
-			tmp++;
-		}
+	ci = lock_cluster(si, tmp);
+	if (si->swap_map[tmp]) {
 		unlock_cluster(ci);
-	}
-	if (tmp >= max) {
 		*cpu_next = SWAP_NEXT_NULL;
 		goto new_cluster;
 	}
+	unlock_cluster(ci);
+
 	*offset = tmp;
 	*scan_base = tmp;
+
+	max = ALIGN_DOWN(tmp, SWAPFILE_CLUSTER) + SWAPFILE_CLUSTER;
 	tmp += 1;
 	*cpu_next = tmp < max ? tmp : SWAP_NEXT_NULL;
+
 	return true;
 }
 

From patchwork Wed Oct 25 14:45:46 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Ryan Roberts <ryan.roberts@arm.com>
X-Patchwork-Id: 13436309
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 245B7C07545
	for <linux-mm@archiver.kernel.org>; Wed, 25 Oct 2023 14:46:09 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 6CF376B032C; Wed, 25 Oct 2023 10:46:06 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 62F426B032D; Wed, 25 Oct 2023 10:46:06 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 482F66B032F; Wed, 25 Oct 2023 10:46:06 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com
 [216.40.44.17])
	by kanga.kvack.org (Postfix) with ESMTP id 30B3F6B032C
	for <linux-mm@kvack.org>; Wed, 25 Oct 2023 10:46:06 -0400 (EDT)
Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay09.hostedemail.com (Postfix) with ESMTP id F3F9F80339
	for <linux-mm@kvack.org>; Wed, 25 Oct 2023 14:46:05 +0000 (UTC)
X-FDA: 81384258732.26.70925FB
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by imf07.hostedemail.com (Postfix) with ESMTP id 26FBF40009
	for <linux-mm@kvack.org>; Wed, 25 Oct 2023 14:46:03 +0000 (UTC)
Authentication-Results: imf07.hostedemail.com;
	dkim=none;
	spf=pass (imf07.hostedemail.com: domain of ryan.roberts@arm.com designates
 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com;
	dmarc=pass (policy=none) header.from=arm.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1698245164; a=rsa-sha256;
	cv=none;
	b=8Av8CZ8YADVRaRkJQvodst6gjEVCJhh/8fxTX9YDqPDu3008Yc6PnBQ0EOah+cWi6pPHED
	sxCd9i46oE8ofe06SmcGvYg5geUdKE/LChfJ83k+72xIUnOsxDMTX1jQQCGK9elIT55rnj
	A1PCpWqx8MZjFyEqUI+nqlNWDdknYvo=
ARC-Authentication-Results: i=1;
	imf07.hostedemail.com;
	dkim=none;
	spf=pass (imf07.hostedemail.com: domain of ryan.roberts@arm.com designates
 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com;
	dmarc=pass (policy=none) header.from=arm.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed;
 d=hostedemail.com;
	s=arc-20220608; t=1698245164;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=GX0qjfIMCrEhQnbOpdF6hDbPFvKVZiRSqG9m0ghIw9w=;
	b=W+eUGKQ6P5kUp8aHxbWbP0SPthKWaODH7wgo/flQrxEtHv3jPtn8kdEa2Jy2wOXeSGTjzG
	yfA+UUotRR0rEX/5iGwB8u+1Zv7WVXIrmhSjDMuwq4ruiX/nxLn+tjTeDFzYqS4DZJqvx3
	ZhUlXAz70ucR2yQr5S4wf/WpxwvtHj0=
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id C7E2F1476;
	Wed, 25 Oct 2023 07:46:44 -0700 (PDT)
Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com
 [10.1.196.26])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 90E5C3F64C;
	Wed, 25 Oct 2023 07:46:01 -0700 (PDT)
From: Ryan Roberts <ryan.roberts@arm.com>
To: Andrew Morton <akpm@linux-foundation.org>,
	David Hildenbrand <david@redhat.com>,
	Matthew Wilcox <willy@infradead.org>,
	Huang Ying <ying.huang@intel.com>,
	Gao Xiang <xiang@kernel.org>,
	Yu Zhao <yuzhao@google.com>,
	Yang Shi <shy828301@gmail.com>,
	Michal Hocko <mhocko@suse.com>,
	Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>,
	linux-kernel@vger.kernel.org,
	linux-mm@kvack.org
Subject: [PATCH v3 4/4] mm: swap: Swap-out small-sized THP without splitting
Date: Wed, 25 Oct 2023 15:45:46 +0100
Message-Id: <20231025144546.577640-5-ryan.roberts@arm.com>
X-Mailer: git-send-email 2.25.1
In-Reply-To: <20231025144546.577640-1-ryan.roberts@arm.com>
References: <20231025144546.577640-1-ryan.roberts@arm.com>
MIME-Version: 1.0
X-Rspam-User: 
X-Rspamd-Server: rspam06
X-Rspamd-Queue-Id: 26FBF40009
X-Stat-Signature: bk6dxh94kk55wcygziqt5brage1a3ar8
X-HE-Tag: 1698245163-681852
X-HE-Meta: 
 U2FsdGVkX18ZkRSrFFxvrq0SE4kbsMAGMRcZTYL6QdInkQne0IgWIwj+TPY26SWUqsp35RCQ67pT4lSy9hyI7efj6mY37St0D54TAmNJ+CLDG12TGcJlTiewmvio7Z9WUhEvjPgWBmZImvQ0PI1bUHt93Lg3wFFkWcwRmcvVG6JDIgMt1u4Ler/P/83qwhaLfySSwuQVsl+eWW5SDc1e9w0JY1hOWggq6sAntJtfMyBG2eSscBwGSV2zI0PBeMDy3GTW/GoJIsowbMzamwg5OFtojc50Z+YQSMzUw0kAjvjZIQabDXSKxFr8fB2O6l7glQmAGsdZthexMIJFOg9MRbjxpc/pGV7LsoyTC+GqTHAFlRibaarxjRu5a/q8HdS+BoHkmUUhr0suaj89E79SDbm45P8D82olHv8006EPflBEdok7JnTfKWYyp0dwJyC+lr0wn/dkfJHPfMG7EHCngQwamQN+hMu5IIJ9N7SJlNbvjEATRaVOQyUMJLz7bte6rAC24l83VdRL1VUE7dUBbqvHngzt6Cs9luy5NBsqb8kT3uYm5qjox+VbT+MIKpNB2QtJo+sJg/A5zCDxVnzhDjleaZ38iVmHscb0I7Vuaj7ij1eY+9KptnXhAbUkRb+Z+2kr5emhhwk8VmWWupLsBMoDrC2hb0vwfppT5qdx8Dq7A8SuOFWZND3XorFy6oG1Q94KJC6oQPJmI4SZ0Yd6ceMNWbTVS6v8QXim34JvewJlCkKUjxMltYoDBtdCqfMPFE7oy3pGmoC7Eqmltf0Jru95/BIwjaLdI3d0j2YDUw7LSzsDHHHTpB/wEuOvC6wGXIabNDOKzLJKVlJv/p1ngfT86lnfkk+9u4z1myxFkN8A03UfrqHCTivbiQ0TGJRRvVyMbsfSoYMeeoTP9ugSJHO7++DvJa3mnPfmNhg9ic3QemV3yw0uB++gL1wB6sdQFO9mw7LoRl3Q04QFY99
 3hbYUOFp
 efb5U3knBcTXh1rYzKi1D9nhafXD8yOEr93hm4AGwxG0Bv/dr/n2bm1DjlP9OMK7kH7UKDXSS40G8BZbnf+N9fX0ID1JF+3vjOOmTTTsSsR8gZ4nfor3mJroeEWVyJa6XS5Cr/vFttXp3pnSjeRAOL8KtS3JbX2X6nttS
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

The upcoming anonymous small-sized THP feature enables performance
improvements by allocating large folios for anonymous memory. However
I've observed that on an arm64 system running a parallel workload (e.g.
kernel compilation) across many cores, under high memory pressure, the
speed regresses. This is due to bottlenecking on the increased number of
TLBIs added due to all the extra folio splitting.

Therefore, solve this regression by adding support for swapping out
small-sized THP without needing to split the folio, just like is already
done for PMD-sized THP. This change only applies when CONFIG_THP_SWAP is
enabled, and when the swap backing store is a non-rotating block device.
These are the same constraints as for the existing PMD-sized THP
swap-out support.

Note that no attempt is made to swap-in THP here - this is still done
page-by-page, like for PMD-sized THP.

The main change here is to improve the swap entry allocator so that it
can allocate any power-of-2 number of contiguous entries between [1, (1
<< PMD_ORDER)]. This is done by allocating a cluster for each distinct
order and allocating sequentially from it until the cluster is full.
This ensures that we don't need to search the map and we get no
fragmentation due to alignment padding for different orders in the
cluster. If there is no current cluster for a given order, we attempt to
allocate a free cluster from the list. If there are no free clusters, we
fail the allocation and the caller falls back to splitting the folio and
allocates individual entries (as per existing PMD-sized THP fallback).

The per-order current clusters are maintained per-cpu using the existing
infrastructure. This is done to avoid interleving pages from different
tasks, which would prevent IO being batched. This is already done for
the order-0 allocations so we follow the same pattern.
__scan_swap_map_try_ssd_cluster() is introduced to deal with arbitrary
orders and scan_swap_map_try_ssd_cluster() is refactored as a wrapper
for order-0.

As is done for order-0 per-cpu clusters, the scanner now can steal
order-0 entries from any per-cpu-per-order reserved cluster. This
ensures that when the swap file is getting full, space doesn't get tied
up in the per-cpu reserves.

I've run the tests on Ampere Altra (arm64), set up with a 35G block ram
device as the swap device and from inside a memcg limited to 40G memory.
I've then run `usemem` from vm-scalability with 70 processes (each has
its own core), each allocating and writing 1G of memory. I've repeated
everything 5 times and taken the mean:

Mean Performance Improvement vs 4K/baseline

| alloc size |            baseline |       + this series |
|            |  v6.6-rc4+anonfolio |                     |
|:-----------|--------------------:|--------------------:|
| 4K Page    |                0.0% |                4.9% |
| 64K THP    |              -44.1% |               10.7% |
| 2M THP     |               56.0% |               65.9% |

So with this change, the regression for 64K swap performance goes away
and 4K and 2M swap improves slightly too.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 include/linux/swap.h |  10 +--
 mm/swapfile.c        | 149 +++++++++++++++++++++++++++++++------------
 mm/vmscan.c          |  10 +--
 3 files changed, 119 insertions(+), 50 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 0ca8aaa098ba..ccbca5db851b 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -295,11 +295,11 @@ struct swap_info_struct {
 	unsigned int __percpu *cluster_next_cpu; /*percpu index for next allocation */
 	unsigned int __percpu *cpu_next;/*
 					 * Likely next allocation offset. We
-					 * assign a cluster to each CPU, so each
-					 * CPU can allocate swap entry from its
-					 * own cluster and swapout sequentially.
-					 * The purpose is to optimize swapout
-					 * throughput.
+					 * assign a cluster per-order to each
+					 * CPU, so each CPU can allocate swap
+					 * entry from its own cluster and
+					 * swapout sequentially. The purpose is
+					 * to optimize swapout throughput.
 					 */
 	struct rb_root swap_extent_root;/* root of the swap extent rbtree */
 	struct block_device *bdev;	/* swap device or bdev of swap file */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 94f7cc225eb9..b50bce50bed9 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -545,10 +545,12 @@ static void free_cluster(struct swap_info_struct *si, unsigned long idx)
 
 /*
  * The cluster corresponding to page_nr will be used. The cluster will be
- * removed from free cluster list and its usage counter will be increased.
+ * removed from free cluster list and its usage counter will be increased by
+ * count.
  */
-static void inc_cluster_info_page(struct swap_info_struct *p,
-	struct swap_cluster_info *cluster_info, unsigned long page_nr)
+static void add_cluster_info_page(struct swap_info_struct *p,
+	struct swap_cluster_info *cluster_info, unsigned long page_nr,
+	unsigned long count)
 {
 	unsigned long idx = page_nr / SWAPFILE_CLUSTER;
 
@@ -557,9 +559,19 @@ static void inc_cluster_info_page(struct swap_info_struct *p,
 	if (cluster_is_free(&cluster_info[idx]))
 		alloc_cluster(p, idx);
 
-	VM_BUG_ON(cluster_count(&cluster_info[idx]) >= SWAPFILE_CLUSTER);
+	VM_BUG_ON(cluster_count(&cluster_info[idx]) + count > SWAPFILE_CLUSTER);
 	cluster_set_count(&cluster_info[idx],
-		cluster_count(&cluster_info[idx]) + 1);
+		cluster_count(&cluster_info[idx]) + count);
+}
+
+/*
+ * The cluster corresponding to page_nr will be used. The cluster will be
+ * removed from free cluster list and its usage counter will be increased.
+ */
+static void inc_cluster_info_page(struct swap_info_struct *p,
+	struct swap_cluster_info *cluster_info, unsigned long page_nr)
+{
+	add_cluster_info_page(p, cluster_info, page_nr, 1);
 }
 
 /*
@@ -588,8 +600,8 @@ static void dec_cluster_info_page(struct swap_info_struct *p,
  * cluster list. Avoiding such abuse to avoid list corruption.
  */
 static bool
-scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
-	unsigned long offset)
+__scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
+	unsigned long offset, int order)
 {
 	bool conflict;
 
@@ -601,23 +613,36 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
 	if (!conflict)
 		return false;
 
-	*this_cpu_ptr(si->cpu_next) = SWAP_NEXT_NULL;
+	this_cpu_ptr(si->cpu_next)[order] = SWAP_NEXT_NULL;
 	return true;
 }
 
 /*
- * Try to get a swap entry from current cpu's swap entry pool (a cluster). This
- * might involve allocating a new cluster for current CPU too.
+ * It's possible scan_swap_map_slots() uses a free cluster in the middle of free
+ * cluster list. Avoiding such abuse to avoid list corruption.
  */
-static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
-	unsigned long *offset, unsigned long *scan_base)
+static bool
+scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
+	unsigned long offset)
+{
+	return __scan_swap_map_ssd_cluster_conflict(si, offset, 0);
+}
+
+/*
+ * Try to get a swap entry (or size indicated by order) from current cpu's swap
+ * entry pool (a cluster). This might involve allocating a new cluster for
+ * current CPU too.
+ */
+static bool __scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
+	unsigned long *offset, unsigned long *scan_base, int order)
 {
 	struct swap_cluster_info *ci;
-	unsigned int tmp, max;
+	unsigned int tmp, max, i;
 	unsigned int *cpu_next;
+	unsigned int nr_pages = 1 << order;
 
 new_cluster:
-	cpu_next = this_cpu_ptr(si->cpu_next);
+	cpu_next = &this_cpu_ptr(si->cpu_next)[order];
 	tmp = *cpu_next;
 	if (tmp == SWAP_NEXT_NULL) {
 		if (!cluster_list_empty(&si->free_clusters)) {
@@ -643,10 +668,12 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
 	 * reserve a new cluster.
 	 */
 	ci = lock_cluster(si, tmp);
-	if (si->swap_map[tmp]) {
-		unlock_cluster(ci);
-		*cpu_next = SWAP_NEXT_NULL;
-		goto new_cluster;
+	for (i = 0; i < nr_pages; i++) {
+		if (si->swap_map[tmp + i]) {
+			unlock_cluster(ci);
+			*cpu_next = SWAP_NEXT_NULL;
+			goto new_cluster;
+		}
 	}
 	unlock_cluster(ci);
 
@@ -654,12 +681,22 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
 	*scan_base = tmp;
 
 	max = ALIGN_DOWN(tmp, SWAPFILE_CLUSTER) + SWAPFILE_CLUSTER;
-	tmp += 1;
+	tmp += nr_pages;
 	*cpu_next = tmp < max ? tmp : SWAP_NEXT_NULL;
 
 	return true;
 }
 
+/*
+ * Try to get a swap entry from current cpu's swap entry pool (a cluster). This
+ * might involve allocating a new cluster for current CPU too.
+ */
+static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
+	unsigned long *offset, unsigned long *scan_base)
+{
+	return __scan_swap_map_try_ssd_cluster(si, offset, scan_base, 0);
+}
+
 static void __del_from_avail_list(struct swap_info_struct *p)
 {
 	int nid;
@@ -982,35 +1019,58 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
 	return n_ret;
 }
 
-static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot)
+static int swap_alloc_large(struct swap_info_struct *si, swp_entry_t *slot,
+			    unsigned int nr_pages)
 {
-	unsigned long idx;
 	struct swap_cluster_info *ci;
-	unsigned long offset;
+	unsigned long offset, scan_base;
+	int order = ilog2(nr_pages);
+	bool ret;
 
 	/*
-	 * Should not even be attempting cluster allocations when huge
+	 * Should not even be attempting large allocations when huge
 	 * page swap is disabled.  Warn and fail the allocation.
 	 */
-	if (!IS_ENABLED(CONFIG_THP_SWAP)) {
+	if (!IS_ENABLED(CONFIG_THP_SWAP) ||
+	    nr_pages < 2 || nr_pages > SWAPFILE_CLUSTER ||
+	    !is_power_of_2(nr_pages)) {
 		VM_WARN_ON_ONCE(1);
 		return 0;
 	}
 
-	if (cluster_list_empty(&si->free_clusters))
+	/*
+	 * Swapfile is not block device or not using clusters so unable to
+	 * allocate large entries.
+	 */
+	if (!(si->flags & SWP_BLKDEV) || !si->cluster_info)
 		return 0;
 
-	idx = cluster_list_first(&si->free_clusters);
-	offset = idx * SWAPFILE_CLUSTER;
-	ci = lock_cluster(si, offset);
-	alloc_cluster(si, idx);
-	cluster_set_count(ci, SWAPFILE_CLUSTER);
+again:
+	/*
+	 * __scan_swap_map_try_ssd_cluster() may drop si->lock during discard,
+	 * so indicate that we are scanning to synchronise with swapoff.
+	 */
+	si->flags += SWP_SCANNING;
+	ret = __scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order);
+	si->flags -= SWP_SCANNING;
+
+	/*
+	 * If we failed to allocate or if swapoff is waiting for us (due to lock
+	 * being dropped for discard above), return immediately.
+	 */
+	if (!ret || !(si->flags & SWP_WRITEOK))
+		return 0;
 
-	memset(si->swap_map + offset, SWAP_HAS_CACHE, SWAPFILE_CLUSTER);
+	if (__scan_swap_map_ssd_cluster_conflict(si, offset, order))
+		goto again;
+
+	ci = lock_cluster(si, offset);
+	memset(si->swap_map + offset, SWAP_HAS_CACHE, nr_pages);
+	add_cluster_info_page(si, si->cluster_info, offset, nr_pages);
 	unlock_cluster(ci);
-	swap_range_alloc(si, offset, SWAPFILE_CLUSTER);
-	*slot = swp_entry(si->type, offset);
 
+	swap_range_alloc(si, offset, nr_pages);
+	*slot = swp_entry(si->type, offset);
 	return 1;
 }
 
@@ -1036,7 +1096,7 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
 	int node;
 
 	/* Only single cluster request supported */
-	WARN_ON_ONCE(n_goal > 1 && size == SWAPFILE_CLUSTER);
+	WARN_ON_ONCE(n_goal > 1 && size > 1);
 
 	spin_lock(&swap_avail_lock);
 
@@ -1073,14 +1133,13 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
 			spin_unlock(&si->lock);
 			goto nextsi;
 		}
-		if (size == SWAPFILE_CLUSTER) {
-			if (si->flags & SWP_BLKDEV)
-				n_ret = swap_alloc_cluster(si, swp_entries);
+		if (size > 1) {
+			n_ret = swap_alloc_large(si, swp_entries, size);
 		} else
 			n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
 						    n_goal, swp_entries);
 		spin_unlock(&si->lock);
-		if (n_ret || size == SWAPFILE_CLUSTER)
+		if (n_ret || size > 1)
 			goto check_out;
 		cond_resched();
 
@@ -3041,6 +3100,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	if (p->bdev && bdev_nonrot(p->bdev)) {
 		int cpu;
 		unsigned long ci, nr_cluster;
+		int nr_order;
+		int i;
 
 		p->flags |= SWP_SOLIDSTATE;
 		p->cluster_next_cpu = alloc_percpu(unsigned int);
@@ -3068,13 +3129,19 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 		for (ci = 0; ci < nr_cluster; ci++)
 			spin_lock_init(&((cluster_info + ci)->lock));
 
-		p->cpu_next = alloc_percpu(unsigned int);
+		nr_order = IS_ENABLED(CONFIG_THP_SWAP) ? PMD_ORDER + 1 : 1;
+		p->cpu_next = __alloc_percpu(sizeof(unsigned int) * nr_order,
+					     __alignof__(unsigned int));
 		if (!p->cpu_next) {
 			error = -ENOMEM;
 			goto bad_swap_unlock_inode;
 		}
-		for_each_possible_cpu(cpu)
-			per_cpu(*p->cpu_next, cpu) = SWAP_NEXT_NULL;
+		for_each_possible_cpu(cpu) {
+			unsigned int *cpu_next = per_cpu_ptr(p->cpu_next, cpu);
+
+			for (i = 0; i < nr_order; i++)
+				cpu_next[i] = SWAP_NEXT_NULL;
+		}
 	} else {
 		atomic_inc(&nr_rotate_swap);
 		inced_nr_rotate_swap = true;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2cc0cb41fb32..ea19710aa4cd 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1212,11 +1212,13 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 					if (!can_split_folio(folio, NULL))
 						goto activate_locked;
 					/*
-					 * Split folios without a PMD map right
-					 * away. Chances are some or all of the
-					 * tail pages can be freed without IO.
+					 * Split PMD-mappable folios without a
+					 * PMD map right away. Chances are some
+					 * or all of the tail pages can be freed
+					 * without IO.
 					 */
-					if (!folio_entire_mapcount(folio) &&
+					if (folio_test_pmd_mappable(folio) &&
+					    !folio_entire_mapcount(folio) &&
 					    split_folio_to_list(folio,
 								folio_list))
 						goto activate_locked;