From patchwork Wed Jul 31 06:49:20 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chris Li X-Patchwork-Id: 13748193 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id BA331C3DA7F for ; Wed, 31 Jul 2024 06:49:45 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4D9F36B008C; Wed, 31 Jul 2024 02:49:28 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3C6C16B0099; Wed, 31 Jul 2024 02:49:28 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EEA566B008C; Wed, 31 Jul 2024 02:49:27 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 995956B0093 for ; Wed, 31 Jul 2024 02:49:27 -0400 (EDT) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 62D7AA0140 for ; Wed, 31 Jul 2024 06:49:27 +0000 (UTC) X-FDA: 82399121574.17.1DF5EDC Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf17.hostedemail.com (Postfix) with ESMTP id A32A540010 for ; Wed, 31 Jul 2024 06:49:25 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=qf8E8DJt; spf=pass (imf17.hostedemail.com: domain of chrisl@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=chrisl@kernel.org; dmarc=pass (policy=none) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1722408492; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ngsb7fBYnvgzqEr/8vE4gN8jmVNnFCMfQ3J5p4HCvS0=; b=8bJtbPpp9JNowdTNQJNRasoejg1csIpXPLsGnL+KyxK/9Jgzg4IVt4SXyoISy0sLnVBDf5 zz6/96+DHvVftLch1GZ2bVXyZIBc+bX8cpdNjtgYJaIPP890Ifba9ODZv/x/k0+qjz481C KJvwB96aNE/rYPtAq/rAjm3i0pV/c68= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=qf8E8DJt; spf=pass (imf17.hostedemail.com: domain of chrisl@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=chrisl@kernel.org; dmarc=pass (policy=none) header.from=kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1722408492; a=rsa-sha256; cv=none; b=8o0a/8sXDQNt8TuzlCGvBQG0tVdBtbIz79fyblC7kh7IL3dFfeKNTYYc71LJJGkdJeSR// uhtNZMz3jA2hgHfDmFrUTOlDnOvbOBuwFnFQle9EtQyDOyMOm3Mw442aVU3SrVr7jc7bST SHhlMhhgM08Lnzr8iTqMt9LFo2tRxb0= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by dfw.source.kernel.org (Postfix) with ESMTP id D1D9E6220D; Wed, 31 Jul 2024 06:49:24 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 4E058C4AF16; Wed, 31 Jul 2024 06:49:24 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1722408564; bh=bGhHEHa6OStA/L6zX9n/5nX1MnDqB9/9wXUAWu7kmvs=; h=From:Date:Subject:References:In-Reply-To:To:Cc:From; b=qf8E8DJt+TyJxgGeW+MoN9f10/VCDxTPe/27P7EIijJOawJ5WT4ZoQXslSCKgdZAX r+sq7N/EUp+65m4HpvtbITV1ND1wvD0bEuMyqpU80Tt9aQwQYCOFTy2PGAjkbwUCE0 EsUgx5Ew92LWr6GRpWG1yzi029EzdChEOYnb7nM/AFTuBdWcj3CE2s0QCCSNzTtMSG 2EoQpvjSf7fGZdI0OTDXzIyWfq4f8qEVZi0tLy2ny/XWtm/9OsxIwe+sR8BtECOuKx aFizmjxAKCnvF0b2gmG50glRnB/Z77N9wV1vIaHsJK9H4l7ZuoVsKuZzlE5lO1AvrJ uKhKaj2Dz7Plw== From: chrisl@kernel.org Date: Tue, 30 Jul 2024 23:49:20 -0700 Subject: [PATCH v5 8/9] mm: swap: relaim the cached parts that got scanned MIME-Version: 1.0 Message-Id: <20240730-swap-allocator-v5-8-cb9c148b9297@kernel.org> References: <20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org> In-Reply-To: <20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org> To: Andrew Morton Cc: Kairui Song , Hugh Dickins , Ryan Roberts , "Huang, Ying" , Kalesh Singh , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Chris Li , Barry Song X-Mailer: b4 0.13.0 X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: A32A540010 X-Stat-Signature: nuzhqq9iruo4qurotaso7bgfe34o6xi5 X-Rspam-User: X-HE-Tag: 1722408565-272824 X-HE-Meta: U2FsdGVkX1+Gj6eZ1023zF1t4v/nigwInI3P/LcevlUE+5722Kx4byoQLKYSYRMTp6wt1U7CdobZ5hEVIAjbTgkdRRQxCWHfwahP3y9kujaUmGr3Iu/0SjC0sR5P9/Lk2c48mnfG6WgsZpV91xHyTEwqsU50FGaqYYB5VzsVaSFojbPj2kuvntV5QHlthOC0kBfe1cwNBAPAstFmqeJp64iLbo6umfNj3vcGMCs5abS3fxLDQZtkgNeYgo76DewITV1O6tOs2cJ8jJ8C0YW3Am5X3LqizH9H9LkiCDGvn3nHF7JHiiOwPU1s2Y8scYDKFtykyMwcf39RvaM3I2kBkKxu69jKOfQVVC+enDL1DklCtQ1ZH2V1ft4Pu4/BUyGWxEGgWSxoB4ybJvR7oofLiQkJ/CKjQgbhmLAOM2zlGznArTd3vlFSSOEITq5oqP5eUsEMyhGfenytTLWNNArx3zmcN9Nt/3K8HpJyqEsiDgjd7hINEphtAIHK27BOrLC70RTIlk0g1Y/3DeYlc3KBJY7Jjf+oxKvVN0o3saFjRkFseS/+bW9Wqp6HFYebt5W4is7Rq1c/KAU6fzFtwjNXdbyUI86fisnqrTvrocp3NmT19cx5QdCPiVzyNAmgknYNGxxz0UgVAaQBPyQzlH9lN5wEVxJhJG4KU9xjb9KF+yJyjgOteRlQV5fbzgyxohsUiYbiqY6K52wv/CysdaYHfIykwocXhtGAfFDarii/MlgshqP78HJKJoyQz3aNBWfoB4R7ItjYUkzI3R8xE8tAltbYQfEkzmqViF0xZL48EpsL+c5fATj6DPKOTJzC82RClcMr8Us512CSAKx7XYyrYhU1dqEvJ99KPr+7Z/XKG0aWo3YcwbWGVZ5hs7wlsoIO+TkQ62P9uk/AKbfaimBH1+GvZViFNtyclpwjDJ+dG6s3YDdTpVQxzO2mb39uC8E2QsOK0kFOThEqvWCf0I+ ZsVUH3hu k5nrlirTvOGj/UOVPH1B3pnHF0sFW+tDkJELrxsHwp44Y7+X37PwZV8h/t6kpxT9yB/mP5QiRKwE6xZQlEej93f5DVycPs3o0xht7plztG5mR7dtxLtaWiR28pkMrPl6D8IX+1jBcTxXOuQT0miyR0FN3MSE78cGqxrkuhSioV5dL/Uaw36mnVboTJybKCwESOed/SApZAYQV+Lmbp6sz3bQ76SCm7uKPDP6hVqyqR7JK3elNoy3Aa7cKlA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: From: Kairui Song This commit implements reclaim during scan for cluster allocator. Cluster scanning were unable to reuse SWAP_HAS_CACHE slots, which could result in low allocation success rate or early OOM. So to ensure maximum allocation success rate, integrate reclaiming with scanning. If found a range of suitable swap slots but fragmented due to HAS_CACHE, just try to reclaim the slots. Signed-off-by: Kairui Song --- include/linux/swap.h | 1 + mm/swapfile.c | 140 +++++++++++++++++++++++++++++++++++++++------------ 2 files changed, 110 insertions(+), 31 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 5a14b6c65949..9eb740563d63 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -302,6 +302,7 @@ struct swap_info_struct { /* list of cluster that contains at least one free slot */ struct list_head frag_clusters[SWAP_NR_ORDERS]; /* list of cluster that are fragmented or contented */ + unsigned int frag_cluster_nr[SWAP_NR_ORDERS]; unsigned int lowest_bit; /* index of first free in swap_map */ unsigned int highest_bit; /* index of last free in swap_map */ unsigned int pages; /* total of usable pages of swap */ diff --git a/mm/swapfile.c b/mm/swapfile.c index eb3e387e86b2..50e7f600a9a1 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -513,6 +513,10 @@ static void free_cluster(struct swap_info_struct *si, struct swap_cluster_info * VM_BUG_ON(ci->count != 0); lockdep_assert_held(&si->lock); lockdep_assert_held(&ci->lock); + + if (ci->flags & CLUSTER_FLAG_FRAG) + si->frag_cluster_nr[ci->order]--; + /* * If the swap is discardable, prepare discard the cluster * instead of free it immediately. The cluster will be freed @@ -572,31 +576,84 @@ static void dec_cluster_info_page(struct swap_info_struct *p, if (!(ci->flags & CLUSTER_FLAG_NONFULL)) { VM_BUG_ON(ci->flags & CLUSTER_FLAG_FREE); - if (ci->flags & CLUSTER_FLAG_FRAG) + if (ci->flags & CLUSTER_FLAG_FRAG) { + p->frag_cluster_nr[ci->order]--; list_move_tail(&ci->list, &p->nonfull_clusters[ci->order]); - else + } else { list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]); + } ci->flags = CLUSTER_FLAG_NONFULL; } } -static inline bool cluster_scan_range(struct swap_info_struct *si, unsigned int start, - unsigned int nr_pages) +static bool cluster_reclaim_range(struct swap_info_struct *si, + struct swap_cluster_info *ci, + unsigned long start, unsigned long end) { - unsigned char *p = si->swap_map + start; - unsigned char *end = p + nr_pages; + unsigned char *map = si->swap_map; + unsigned long offset; + + spin_unlock(&ci->lock); + spin_unlock(&si->lock); + + for (offset = start; offset < end; offset++) { + switch (READ_ONCE(map[offset])) { + case 0: + continue; + case SWAP_HAS_CACHE: + if (__try_to_reclaim_swap(si, offset, TTRS_ANYWAY | TTRS_DIRECT) > 0) + continue; + goto out; + default: + goto out; + } + } +out: + spin_lock(&si->lock); + spin_lock(&ci->lock); - while (p < end) - if (*p++) + /* + * Recheck the range no matter reclaim succeeded or not, the slot + * could have been be freed while we are not holding the lock. + */ + for (offset = start; offset < end; offset++) + if (READ_ONCE(map[offset])) return false; return true; } +static bool cluster_scan_range(struct swap_info_struct *si, + struct swap_cluster_info *ci, + unsigned long start, unsigned int nr_pages) +{ + unsigned long offset, end = start + nr_pages; + unsigned char *map = si->swap_map; + bool need_reclaim = false; -static inline void cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster_info *ci, - unsigned int start, unsigned char usage, - unsigned int order) + for (offset = start; offset < end; offset++) { + switch (READ_ONCE(map[offset])) { + case 0: + continue; + case SWAP_HAS_CACHE: + if (!vm_swap_full()) + return false; + need_reclaim = true; + continue; + default: + return false; + } + } + + if (need_reclaim) + return cluster_reclaim_range(si, ci, start, end); + + return true; +} + +static void cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster_info *ci, + unsigned int start, unsigned char usage, + unsigned int order) { unsigned int nr_pages = 1 << order; @@ -615,6 +672,8 @@ static inline void cluster_alloc_range(struct swap_info_struct *si, struct swap_ if (ci->count == SWAPFILE_CLUSTER) { VM_BUG_ON(!(ci->flags & (CLUSTER_FLAG_FREE | CLUSTER_FLAG_NONFULL | CLUSTER_FLAG_FRAG))); + if (ci->flags & CLUSTER_FLAG_FRAG) + si->frag_cluster_nr[ci->order]--; list_del(&ci->list); ci->flags = 0; } @@ -640,7 +699,7 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si, unsigne } while (offset <= end) { - if (cluster_scan_range(si, offset, nr_pages)) { + if (cluster_scan_range(si, ci, offset, nr_pages)) { cluster_alloc_range(si, ci, offset, usage, order); *foundp = offset; if (ci->count == SWAPFILE_CLUSTER) { @@ -668,9 +727,8 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o unsigned char usage) { struct percpu_cluster *cluster; - struct swap_cluster_info *ci, *n; + struct swap_cluster_info *ci; unsigned int offset, found = 0; - LIST_HEAD(fraged); new_cluster: lockdep_assert_held(&si->lock); @@ -690,25 +748,42 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o } if (order < PMD_ORDER) { - list_for_each_entry_safe(ci, n, &si->nonfull_clusters[order], list) { - list_move_tail(&ci->list, &fraged); + unsigned int frags = 0; + + while (!list_empty(&si->nonfull_clusters[order])) { + ci = list_first_entry(&si->nonfull_clusters[order], + struct swap_cluster_info, list); + list_move_tail(&ci->list, &si->frag_clusters[order]); ci->flags = CLUSTER_FLAG_FRAG; + si->frag_cluster_nr[order]++; offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci), &found, order, usage); + frags++; if (found) break; } if (!found) { - list_for_each_entry_safe(ci, n, &si->frag_clusters[order], list) { + /* + * Nonfull clusters are moved to frag tail if we reached + * here, count them too, don't over scan the frag list. + */ + while (frags < si->frag_cluster_nr[order]) { + ci = list_first_entry(&si->frag_clusters[order], + struct swap_cluster_info, list); + /* + * Rotate the frag list to iterate, they were all failing + * high order allocation or moved here due to per-CPU usage, + * this help keeping usable cluster ahead. + */ + list_move_tail(&ci->list, &si->frag_clusters[order]); offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci), &found, order, usage); + frags++; if (found) break; } } - - list_splice_tail(&fraged, &si->frag_clusters[order]); } if (found) @@ -729,25 +804,28 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o /* Order 0 stealing from higher order */ for (int o = 1; o < PMD_ORDER; o++) { - if (!list_empty(&si->frag_clusters[o])) { + /* + * Clusters here have at least one usable slots and can't fail order 0 + * allocation, but reclaim may drop si->lock and race with another user. + */ + while (!list_empty(&si->frag_clusters[o])) { ci = list_first_entry(&si->frag_clusters[o], struct swap_cluster_info, list); - offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci), &found, - 0, usage); - VM_BUG_ON(!found); - goto done; + offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci), + &found, 0, usage); + if (found) + goto done; } - if (!list_empty(&si->nonfull_clusters[o])) { - ci = list_first_entry(&si->nonfull_clusters[o], struct swap_cluster_info, - list); + while (!list_empty(&si->nonfull_clusters[o])) { + ci = list_first_entry(&si->nonfull_clusters[o], + struct swap_cluster_info, list); offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci), &found, 0, usage); - VM_BUG_ON(!found); - goto done; + if (found) + goto done; } } - done: cluster->next[order] = offset; return found; @@ -3053,6 +3131,7 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p, for (i = 0; i < SWAP_NR_ORDERS; i++) { INIT_LIST_HEAD(&p->nonfull_clusters[i]); INIT_LIST_HEAD(&p->frag_clusters[i]); + p->frag_cluster_nr[i] = 0; } for (i = 0; i < swap_header->info.nr_badpages; i++) { @@ -3096,7 +3175,6 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p, if (!cluster_info) return nr_extents; - /* * Reduce false cache line sharing between cluster_info and * sharing same address space.