From patchwork Wed Jul 31 06:49:13 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chris Li X-Patchwork-Id: 13748185 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7CB9AC3DA7F for ; Wed, 31 Jul 2024 06:49:25 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C2AA06B0082; Wed, 31 Jul 2024 02:49:24 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id BDA7F6B0083; Wed, 31 Jul 2024 02:49:24 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A7B276B0085; Wed, 31 Jul 2024 02:49:24 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 80A056B0082 for ; Wed, 31 Jul 2024 02:49:24 -0400 (EDT) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id E50101C06BD for ; Wed, 31 Jul 2024 06:49:23 +0000 (UTC) X-FDA: 82399121406.06.7407519 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf05.hostedemail.com (Postfix) with ESMTP id 3B0FE10001D for ; Wed, 31 Jul 2024 06:49:22 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=qZ9gI7IJ; spf=pass (imf05.hostedemail.com: domain of chrisl@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=chrisl@kernel.org; dmarc=pass (policy=none) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1722408506; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=0fXG9nUR/RQlixQkE0vGXPy3bgsf/8l+2tqhdX1wLrU=; b=ArrO6klu1ARKce46QuVfj/8nJZoDZ1NaE6bwbzj1xDqXxqenMxHp+VGSl8CCwejb5ICSsc sgSEKlZOVFKqn2hmO5I8daxuTh0LuRxo3sNeGsUCt+SKcG7OnBGO2wBYS7n6xLA7O+GloQ 9Np9TNOo1p4ETkxPBY3i19BjNkkqEPY= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1722408506; a=rsa-sha256; cv=none; b=Oyfao3SsJAZ24fqjHSXZ+9CpDNbV4pAWyYz8s+eMUKqsjuSrZDuyrZ64qoK/7q9zPArMVB GmXlOf7Xq0a2VIevbMgRJ1EB0BtTIi30ZO1w1OQvr4XSV/yaqJdmni+MOTIo8mIdhhzWLT Y9Cuv5XsXdCouDSeCbK95YXMiqwAvE0= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=qZ9gI7IJ; spf=pass (imf05.hostedemail.com: domain of chrisl@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=chrisl@kernel.org; dmarc=pass (policy=none) header.from=kernel.org Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by dfw.source.kernel.org (Postfix) with ESMTP id 581BC621EE; Wed, 31 Jul 2024 06:49:21 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 93EF2C4AF0F; Wed, 31 Jul 2024 06:49:20 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1722408561; bh=MaMU1DAVaDdl0tnjEadIUWz+uEXakvEIou6fGP2oxqQ=; h=From:Date:Subject:References:In-Reply-To:To:Cc:From; b=qZ9gI7IJnQEYHa7+CvFQen2t+rWZbH+yPCgrbZs+V2WRs4mIWsrZQ8NbiYOxImEgU gtSbQi/UOrONsdtmW3ZBG7RGzeX7pvUHj4u+J4Yz1O7aYp4vM8x9Bxet6Q0mxPV1AN zLLyVB6k0tLVCt7SJnh8dm/4CP5DXwWDhhZpTkosRnBupJUGZEdVnxK48gsGujiKia giWqD8qgoCVeI3RAqKV5hlaOU132SIEmPWKcxtMNzA4QWTItFmUBt4yfZb0OulGE3b JzrtWhlUoND5UYzYFkYURKDw6KEH6RzbsWUa2G7CdTWPy74OUeGssQ2il5h4VJ6zfr f8AngurqZALfg== From: Chris Li Date: Tue, 30 Jul 2024 23:49:13 -0700 Subject: [PATCH v5 1/9] mm: swap: swap cluster switch to double link list MIME-Version: 1.0 Message-Id: <20240730-swap-allocator-v5-1-cb9c148b9297@kernel.org> References: <20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org> In-Reply-To: <20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org> To: Andrew Morton Cc: Kairui Song , Hugh Dickins , Ryan Roberts , "Huang, Ying" , Kalesh Singh , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Chris Li , Barry Song X-Mailer: b4 0.13.0 X-Stat-Signature: d9ikus6jfo46pytsrnwpc41eayx5a6qw X-Rspamd-Queue-Id: 3B0FE10001D X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1722408562-365438 X-HE-Meta: U2FsdGVkX1/PR0Yk/EaA5zjUTBn880lPmZ2F8Bw7yTn92HA1ymYMFr6qWIheM8OVHnL5sXiYYxDY3ypDUgYfbr684nnwS9P9Ksu5BfeYmTQ/FaHeQw5VDugPWESgoYySSm4K+IldoDefEzpgnh94W9G0qYJ6+PjI6+lFq7gww87AUQR+27VViDZSEAs5cD2RNjQhCmjXnyUmnpg6eIUmilMtoHp0XJDAfz5iNzn+yatKUD6tvKQK8wAg8BFSYf34vSna08k+VfibkD9kiCcgPWPpZwvtZR7iM3qzinUd7vr5brkkm2Mxuhi1QK8rSO6Wo0pE/mFAv76Dd49GsRoaDK0rA1GcvyJfWncZjnQ4f1K1on32Y7aovK3K2Pk/KbHYVdFvqoBl3uZk51naCOSwifFmUdAvS2RgbFA35C9ff4CYv/njiecJdqkVYXk0ePzyvQ9p9A9DF52cPQdydwQaW6d/gnNW7gy2Ez1UXenaKxGwMb0SjO+8Q2TrwpUUGKDG+IOduHPsZKNThLOEbUAw+Q6RI8SQ5YSh0Df9CHZJ/ipMkNbPGiKegpL067Wskp0o4YUhzvluy4jN5mpybT1hywRtVcR0SeRG7JhvZDYmsjpE0V48Dzdvwn6V4IBKOdi2YQzFcFAQ3eqoi3FTZgHIKLAMGw4/D++DWu4efy65XKJoJXbLmEEbhLHDZX87B/YJZ7BFpTX+ACEMj4FMTnqNjPBuNfDf4AFMhAnkQx5KO7JDoBL1o++fxJryRKeslYKYxS2fc20Z62BZjoICZNgALg4CphV/5qLIWU53Sn52exm9UCbeONP2Gu38JB7YmVTNLXs/hLX+fvAXwNSm+wUC94V07yZrPZhlP2mVPGaGGHLInkCJQAPmz6QhdOv+ZlgNIbKzg9ehbKJV9L92QEcN7gnRimbQnjsyFg7VOG9PBuIjlMx/VLutzvip2ioUvOuTXKK78chKaiQE0HkXp5u C68htzNN ZMmdI9ZNjjvMyoZcMRlIhKGT1qxMoendaXsuRYg3aKoxrnOZRo4+awBlhorfIPMVpEP/wTt3ur0gj6QaU6Kh80xGQZmk7/YNtpGESb6tBxXwwTcZ9+briStFsXR6EcF43BS1KzB7hNk4+iuKY41agOfqcaPMiABdNiUJCuPJIkj/067R5h9WL6HI354T9Yh58PNcqCLHaU9LrFu4hNqHs43kAt1k3NSXvFQMd3bglc+QV0u/ee0Qx5bJ2Kg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Previously, the swap cluster used a cluster index as a pointer to construct a custom single link list type "swap_cluster_list". The next cluster pointer is shared with the cluster->count. It prevents puting the non free cluster into a list. Change the cluster to use the standard double link list instead. This allows tracing the nonfull cluster in the follow up patch. That way, it is faster to get to the nonfull cluster of that order. Remove the cluster getter/setter for accessing the cluster struct member. The list operation is protected by the swap_info_struct->lock. Change cluster code to use "struct swap_cluster_info *" to reference the cluster rather than by using index. That is more consistent with the list manipulation. It avoids the repeat adding index to the cluser_info. The code is easier to understand. Remove the cluster next pointer is NULL flag, the double link list can handle the empty list pretty well. The "swap_cluster_info" struct is two pointer bigger, because 512 swap entries share one swap_cluster_info struct, it has very little impact on the average memory usage per swap entry. For 1TB swapfile, the swap cluster data structure increases from 8MB to 24MB. Other than the list conversion, there is no real function change in this patch. Signed-off-by: Chris Li --- include/linux/swap.h | 25 ++---- mm/swapfile.c | 226 ++++++++++++++------------------------------------- 2 files changed, 71 insertions(+), 180 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index e473fe6cfb7a..edafd52d7ac4 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -243,22 +243,20 @@ enum { * free clusters are organized into a list. We fetch an entry from the list to * get a free cluster. * - * The data field stores next cluster if the cluster is free or cluster usage - * counter otherwise. The flags field determines if a cluster is free. This is - * protected by swap_info_struct.lock. + * The flags field determines if a cluster is free. This is + * protected by cluster lock. */ struct swap_cluster_info { spinlock_t lock; /* * Protect swap_cluster_info fields - * and swap_info_struct->swap_map - * elements correspond to the swap - * cluster + * other than list, and swap_info_struct->swap_map + * elements corresponding to the swap cluster. */ - unsigned int data:24; - unsigned int flags:8; + u16 count; + u8 flags; + struct list_head list; }; #define CLUSTER_FLAG_FREE 1 /* This cluster is free */ -#define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */ /* * The first page in the swap file is the swap header, which is always marked @@ -283,11 +281,6 @@ struct percpu_cluster { unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */ }; -struct swap_cluster_list { - struct swap_cluster_info head; - struct swap_cluster_info tail; -}; - /* * The in-memory structure used to track swap areas. */ @@ -301,7 +294,7 @@ struct swap_info_struct { unsigned char *swap_map; /* vmalloc'ed array of usage counts */ unsigned long *zeromap; /* vmalloc'ed bitmap to track zero pages */ struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */ - struct swap_cluster_list free_clusters; /* free clusters list */ + struct list_head free_clusters; /* free clusters list */ unsigned int lowest_bit; /* index of first free in swap_map */ unsigned int highest_bit; /* index of last free in swap_map */ unsigned int pages; /* total of usable pages of swap */ @@ -332,7 +325,7 @@ struct swap_info_struct { * list. */ struct work_struct discard_work; /* discard worker */ - struct swap_cluster_list discard_clusters; /* discard clusters list */ + struct list_head discard_clusters; /* discard clusters list */ struct plist_node avail_lists[]; /* * entries in swap_avail_heads, one * entry per node. diff --git a/mm/swapfile.c b/mm/swapfile.c index f7224bc1320c..bceead7f9e3c 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -290,62 +290,15 @@ static void discard_swap_cluster(struct swap_info_struct *si, #endif #define LATENCY_LIMIT 256 -static inline void cluster_set_flag(struct swap_cluster_info *info, - unsigned int flag) -{ - info->flags = flag; -} - -static inline unsigned int cluster_count(struct swap_cluster_info *info) -{ - return info->data; -} - -static inline void cluster_set_count(struct swap_cluster_info *info, - unsigned int c) -{ - info->data = c; -} - -static inline void cluster_set_count_flag(struct swap_cluster_info *info, - unsigned int c, unsigned int f) -{ - info->flags = f; - info->data = c; -} - -static inline unsigned int cluster_next(struct swap_cluster_info *info) -{ - return info->data; -} - -static inline void cluster_set_next(struct swap_cluster_info *info, - unsigned int n) -{ - info->data = n; -} - -static inline void cluster_set_next_flag(struct swap_cluster_info *info, - unsigned int n, unsigned int f) -{ - info->flags = f; - info->data = n; -} - static inline bool cluster_is_free(struct swap_cluster_info *info) { return info->flags & CLUSTER_FLAG_FREE; } -static inline bool cluster_is_null(struct swap_cluster_info *info) -{ - return info->flags & CLUSTER_FLAG_NEXT_NULL; -} - -static inline void cluster_set_null(struct swap_cluster_info *info) +static inline unsigned int cluster_index(struct swap_info_struct *si, + struct swap_cluster_info *ci) { - info->flags = CLUSTER_FLAG_NEXT_NULL; - info->data = 0; + return ci - si->cluster_info; } static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si, @@ -394,65 +347,11 @@ static inline void unlock_cluster_or_swap_info(struct swap_info_struct *si, spin_unlock(&si->lock); } -static inline bool cluster_list_empty(struct swap_cluster_list *list) -{ - return cluster_is_null(&list->head); -} - -static inline unsigned int cluster_list_first(struct swap_cluster_list *list) -{ - return cluster_next(&list->head); -} - -static void cluster_list_init(struct swap_cluster_list *list) -{ - cluster_set_null(&list->head); - cluster_set_null(&list->tail); -} - -static void cluster_list_add_tail(struct swap_cluster_list *list, - struct swap_cluster_info *ci, - unsigned int idx) -{ - if (cluster_list_empty(list)) { - cluster_set_next_flag(&list->head, idx, 0); - cluster_set_next_flag(&list->tail, idx, 0); - } else { - struct swap_cluster_info *ci_tail; - unsigned int tail = cluster_next(&list->tail); - - /* - * Nested cluster lock, but both cluster locks are - * only acquired when we held swap_info_struct->lock - */ - ci_tail = ci + tail; - spin_lock_nested(&ci_tail->lock, SINGLE_DEPTH_NESTING); - cluster_set_next(ci_tail, idx); - spin_unlock(&ci_tail->lock); - cluster_set_next_flag(&list->tail, idx, 0); - } -} - -static unsigned int cluster_list_del_first(struct swap_cluster_list *list, - struct swap_cluster_info *ci) -{ - unsigned int idx; - - idx = cluster_next(&list->head); - if (cluster_next(&list->tail) == idx) { - cluster_set_null(&list->head); - cluster_set_null(&list->tail); - } else - cluster_set_next_flag(&list->head, - cluster_next(&ci[idx]), 0); - - return idx; -} - /* Add a cluster to discard list and schedule it to do discard */ static void swap_cluster_schedule_discard(struct swap_info_struct *si, - unsigned int idx) + struct swap_cluster_info *ci) { + unsigned int idx = cluster_index(si, ci); /* * If scan_swap_map_slots() can't find a free cluster, it will check * si->swap_map directly. To make sure the discarding cluster isn't @@ -462,17 +361,14 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si, memset(si->swap_map + idx * SWAPFILE_CLUSTER, SWAP_MAP_BAD, SWAPFILE_CLUSTER); - cluster_list_add_tail(&si->discard_clusters, si->cluster_info, idx); - + list_add_tail(&ci->list, &si->discard_clusters); schedule_work(&si->discard_work); } -static void __free_cluster(struct swap_info_struct *si, unsigned long idx) +static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci) { - struct swap_cluster_info *ci = si->cluster_info; - - cluster_set_flag(ci + idx, CLUSTER_FLAG_FREE); - cluster_list_add_tail(&si->free_clusters, ci, idx); + ci->flags = CLUSTER_FLAG_FREE; + list_add_tail(&ci->list, &si->free_clusters); } /* @@ -481,24 +377,25 @@ static void __free_cluster(struct swap_info_struct *si, unsigned long idx) */ static void swap_do_scheduled_discard(struct swap_info_struct *si) { - struct swap_cluster_info *info, *ci; + struct swap_cluster_info *ci; unsigned int idx; - info = si->cluster_info; - - while (!cluster_list_empty(&si->discard_clusters)) { - idx = cluster_list_del_first(&si->discard_clusters, info); + while (!list_empty(&si->discard_clusters)) { + ci = list_first_entry(&si->discard_clusters, struct swap_cluster_info, list); + list_del(&ci->list); + idx = cluster_index(si, ci); spin_unlock(&si->lock); discard_swap_cluster(si, idx * SWAPFILE_CLUSTER, SWAPFILE_CLUSTER); spin_lock(&si->lock); - ci = lock_cluster(si, idx * SWAPFILE_CLUSTER); - __free_cluster(si, idx); + + spin_lock(&ci->lock); + __free_cluster(si, ci); memset(si->swap_map + idx * SWAPFILE_CLUSTER, 0, SWAPFILE_CLUSTER); - unlock_cluster(ci); + spin_unlock(&ci->lock); } } @@ -521,20 +418,21 @@ static void swap_users_ref_free(struct percpu_ref *ref) complete(&si->comp); } -static void alloc_cluster(struct swap_info_struct *si, unsigned long idx) +static struct swap_cluster_info *alloc_cluster(struct swap_info_struct *si, unsigned long idx) { - struct swap_cluster_info *ci = si->cluster_info; + struct swap_cluster_info *ci = list_first_entry(&si->free_clusters, + struct swap_cluster_info, list); - VM_BUG_ON(cluster_list_first(&si->free_clusters) != idx); - cluster_list_del_first(&si->free_clusters, ci); - cluster_set_count_flag(ci + idx, 0, 0); + VM_BUG_ON(cluster_index(si, ci) != idx); + list_del(&ci->list); + ci->count = 0; + ci->flags = 0; + return ci; } -static void free_cluster(struct swap_info_struct *si, unsigned long idx) +static void free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci) { - struct swap_cluster_info *ci = si->cluster_info + idx; - - VM_BUG_ON(cluster_count(ci) != 0); + VM_BUG_ON(ci->count != 0); /* * If the swap is discardable, prepare discard the cluster * instead of free it immediately. The cluster will be freed @@ -542,11 +440,11 @@ static void free_cluster(struct swap_info_struct *si, unsigned long idx) */ if ((si->flags & (SWP_WRITEOK | SWP_PAGE_DISCARD)) == (SWP_WRITEOK | SWP_PAGE_DISCARD)) { - swap_cluster_schedule_discard(si, idx); + swap_cluster_schedule_discard(si, ci); return; } - __free_cluster(si, idx); + __free_cluster(si, ci); } /* @@ -559,15 +457,15 @@ static void add_cluster_info_page(struct swap_info_struct *p, unsigned long count) { unsigned long idx = page_nr / SWAPFILE_CLUSTER; + struct swap_cluster_info *ci = cluster_info + idx; if (!cluster_info) return; - if (cluster_is_free(&cluster_info[idx])) + if (cluster_is_free(ci)) alloc_cluster(p, idx); - VM_BUG_ON(cluster_count(&cluster_info[idx]) + count > SWAPFILE_CLUSTER); - cluster_set_count(&cluster_info[idx], - cluster_count(&cluster_info[idx]) + count); + VM_BUG_ON(ci->count + count > SWAPFILE_CLUSTER); + ci->count += count; } /* @@ -581,24 +479,20 @@ static void inc_cluster_info_page(struct swap_info_struct *p, } /* - * The cluster corresponding to page_nr decreases one usage. If the usage - * counter becomes 0, which means no page in the cluster is in using, we can - * optionally discard the cluster and add it to free cluster list. + * The cluster ci decreases one usage. If the usage counter becomes 0, + * which means no page in the cluster is in use, we can optionally discard + * the cluster and add it to free cluster list. */ -static void dec_cluster_info_page(struct swap_info_struct *p, - struct swap_cluster_info *cluster_info, unsigned long page_nr) +static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluster_info *ci) { - unsigned long idx = page_nr / SWAPFILE_CLUSTER; - - if (!cluster_info) + if (!p->cluster_info) return; - VM_BUG_ON(cluster_count(&cluster_info[idx]) == 0); - cluster_set_count(&cluster_info[idx], - cluster_count(&cluster_info[idx]) - 1); + VM_BUG_ON(ci->count == 0); + ci->count--; - if (cluster_count(&cluster_info[idx]) == 0) - free_cluster(p, idx); + if (!ci->count) + free_cluster(p, ci); } /* @@ -611,10 +505,12 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si, { struct percpu_cluster *percpu_cluster; bool conflict; + struct swap_cluster_info *first = list_first_entry(&si->free_clusters, + struct swap_cluster_info, list); offset /= SWAPFILE_CLUSTER; - conflict = !cluster_list_empty(&si->free_clusters) && - offset != cluster_list_first(&si->free_clusters) && + conflict = !list_empty(&si->free_clusters) && + offset != cluster_index(si, first) && cluster_is_free(&si->cluster_info[offset]); if (!conflict) @@ -655,10 +551,10 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, cluster = this_cpu_ptr(si->percpu_cluster); tmp = cluster->next[order]; if (tmp == SWAP_NEXT_INVALID) { - if (!cluster_list_empty(&si->free_clusters)) { - tmp = cluster_next(&si->free_clusters.head) * - SWAPFILE_CLUSTER; - } else if (!cluster_list_empty(&si->discard_clusters)) { + if (!list_empty(&si->free_clusters)) { + ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list); + tmp = cluster_index(si, ci) * SWAPFILE_CLUSTER; + } else if (!list_empty(&si->discard_clusters)) { /* * we don't have free cluster but have some clusters in * discarding, do discard now and reclaim them, then @@ -1070,8 +966,9 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx) ci = lock_cluster(si, offset); memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER); - cluster_set_count_flag(ci, 0, 0); - free_cluster(si, idx); + ci->count = 0; + ci->flags = 0; + free_cluster(si, ci); unlock_cluster(ci); swap_range_free(si, offset, SWAPFILE_CLUSTER); } @@ -1344,7 +1241,7 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry) count = p->swap_map[offset]; VM_BUG_ON(count != SWAP_HAS_CACHE); p->swap_map[offset] = 0; - dec_cluster_info_page(p, p->cluster_info, offset); + dec_cluster_info_page(p, ci); unlock_cluster(ci); mem_cgroup_uncharge_swap(entry, 1); @@ -3022,8 +2919,8 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p, nr_good_pages = maxpages - 1; /* omit header page */ - cluster_list_init(&p->free_clusters); - cluster_list_init(&p->discard_clusters); + INIT_LIST_HEAD(&p->free_clusters); + INIT_LIST_HEAD(&p->discard_clusters); for (i = 0; i < swap_header->info.nr_badpages; i++) { unsigned int page_nr = swap_header->info.badpages[i]; @@ -3074,14 +2971,15 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p, for (k = 0; k < SWAP_CLUSTER_COLS; k++) { j = (k + col) % SWAP_CLUSTER_COLS; for (i = 0; i < DIV_ROUND_UP(nr_clusters, SWAP_CLUSTER_COLS); i++) { + struct swap_cluster_info *ci; idx = i * SWAP_CLUSTER_COLS + j; + ci = cluster_info + idx; if (idx >= nr_clusters) continue; - if (cluster_count(&cluster_info[idx])) + if (ci->count) continue; - cluster_set_flag(&cluster_info[idx], CLUSTER_FLAG_FREE); - cluster_list_add_tail(&p->free_clusters, cluster_info, - idx); + ci->flags = CLUSTER_FLAG_FREE; + list_add_tail(&ci->list, &p->free_clusters); } } return nr_extents; From patchwork Wed Jul 31 06:49:14 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Chris Li X-Patchwork-Id: 13748187 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3A02CC3DA7F for ; Wed, 31 Jul 2024 06:49:29 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id AD35C6B0085; Wed, 31 Jul 2024 02:49:26 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9C8EC6B008A; Wed, 31 Jul 2024 02:49:26 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7BCB06B0089; Wed, 31 Jul 2024 02:49:26 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 5E5226B0085 for ; Wed, 31 Jul 2024 02:49:26 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id F1B8E120167 for ; Wed, 31 Jul 2024 06:49:25 +0000 (UTC) X-FDA: 82399121490.16.550AF28 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf20.hostedemail.com (Postfix) with ESMTP id 3EAFA1C0003 for ; Wed, 31 Jul 2024 06:49:24 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=Fm0jMWWG; dmarc=pass (policy=none) header.from=kernel.org; spf=pass (imf20.hostedemail.com: domain of chrisl@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=chrisl@kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1722408519; a=rsa-sha256; cv=none; b=hwT/WAlO0lhsYNSfZPBMFILNjFBiC8lVOndN78gswDkcb8XeojJbixjyCytLVuTJFGbLVO VcLHjf+vYER9pN0AwiBMA95niG1CEo3xbTiZLC5ocEpUOLzbYAGg9e+yJL8MsLa2hO+aYr E69UmHVVZR5XEkzRrIB9nMLZvcc3McY= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=Fm0jMWWG; dmarc=pass (policy=none) header.from=kernel.org; spf=pass (imf20.hostedemail.com: domain of chrisl@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=chrisl@kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1722408519; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=IoihjcWUIWWf/lqyGzmkdUfeV1jfthpP9BOCdBl91Fw=; b=XOQ81llwLK00odZbD2XAgXMDD+dfmVUMaQ/9/Z0Bs+Byf/VTEgNRbs9P9sUWY7h1n2lu2h oe/KxAmPYPYHjXN2es3Mw/KNmSLPVqqOaPEUmE4AV2etGIKNpY04VL4y2axKp88Z9FvES5 U6ptSbHq+34LJJVXMcYh/iETVu5WMdg= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by dfw.source.kernel.org (Postfix) with ESMTP id BB2F1621F2; Wed, 31 Jul 2024 06:49:21 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 343FAC116B1; Wed, 31 Jul 2024 06:49:21 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1722408561; bh=hsQxDCZrtRWDWaDs5IVwKPpQSb8Rg3k3cBHZomxM8ak=; h=From:Date:Subject:References:In-Reply-To:To:Cc:From; b=Fm0jMWWG/eAQp2Nr2ryJaZVToF680ojI3W1usxVCU4dItPLJFCXocEnliEzCRHs4m p6QbaxODPsJ7FvmYN2BjjA0YjJ/Sd6rNIBOoxhbKo85vLiD/mzXv0rQnrI2Nu0lAOw BmtbMIw038i/mghBSve4dnQskSg11vp2sOytVDfgYVa5HQuRztLAQ1rOF4f4rAGf32 WAeV6y69sRb+lNtNXKhHemHu7QHS5QKqi8fy9YqeuA19/gAcTKp1FIG8hnwdqc3C1X ZwE9hiHTcF8mC7aNdNM/FtCH+Uk7d0yUYUdI3INqBO0ogYWYmvowlwEcDEakuo9zW1 womIy2jawCs9g== From: Chris Li Date: Tue, 30 Jul 2024 23:49:14 -0700 Subject: [PATCH v5 2/9] mm: swap: mTHP allocate swap entries from nonfull list MIME-Version: 1.0 Message-Id: <20240730-swap-allocator-v5-2-cb9c148b9297@kernel.org> References: <20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org> In-Reply-To: <20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org> To: Andrew Morton Cc: Kairui Song , Hugh Dickins , Ryan Roberts , "Huang, Ying" , Kalesh Singh , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Chris Li , Barry Song X-Mailer: b4 0.13.0 X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 3EAFA1C0003 X-Stat-Signature: u4pggo4j4h5kjtcmx45u34ba3knhb756 X-Rspam-User: X-HE-Tag: 1722408564-303038 X-HE-Meta: U2FsdGVkX18QWmC8XLMnvQtmzJvVsbU0+wxVAFuYY9u3lUoIgItaXrIZrKgdDw+KNw1roEdkEEwch7S/rpKt5IxdRWfUmCA+qtuWvQu7p0cM2MLljq0qpP/UxXhK/CgvwRx4bv7YV9Bflctod9UV40B15xOZRYvA7M2VPWyIbZF0eag2x+YwQUcH5kXvtHP5ssJd2aaENmtiz6EJDIE5v/XoKv5gh7pN89JFn8P643Ny/1qUoxEPRrypGSpd018CD08Q4LZ6I/+0I/8pSyKCkKzkbRkxTKPYQqCo9b/jkXYInAYpwv2nQOJI+vo3bHAR0jz9JIpYVpzcbPX8709bx6c9aw5HCVxPMedIG+2nRhboROjiRwZ/IE6/YBKv19UMEvpcITAeFuyt1nrMaLYDpmbCqs68d+h35R91C11OzNE6HIjwDAxVhXlQByGJpipOXB5LX9SDRZDtL3GWpP9vciUOaPrg1sfnApRV6xHRy83JNEZTryPdRJ1jCTdSMAIl/Rb4PARwHTe1U9SFlCfXnhiBtGJQXVkQVRDj5NnrYgE3iEeo3O4RFltNvW5MAs0CplchmqwxroQvRGm8PGEIDxn/QtEv9QTV0fDILUWPr1JNBfWoijLSJbU/0VrJGweI7DRZVvRUK8pqQqN7T7o09cQ2++T8YbxvGCxf/xGbX/zeLIVoyJA+u5p3d/4RTCSDq8nST31OYhKPsNjI/MiiGXBtjmNUncAImno9AHZlBYl3/cyvUeXi6qe5KapLCBvTvsEii9yWkvUqeys7fu6OrHlqG2aZAc/xUEpyTLSQQxK/dx3S9L4uNtu83+D+CKGpg4KYWhmw63udZ4u8kMddc5wOnQx4NHkHlYtDphCoPktGf/4kU6k0VxIoaAHPvdEB1pRK39RJt+cBVMNvLBydz0RNWn01o3dn1WNZhN08+wYHvbIQrK0BNuL3FXza/ks5ro40F+ot/q9ZWh0cN9W jfGi9DmI T8GZD+NRkXMXuTePriXoPDvO5lQpDwXB7XcJT87QJ/dwmtHyzhRq6XDQuEGzO3DT6ngfwIHZVWWjy6vytk8juREHy1QSjpg0s95Dhxruo+Ny08pJYw+HOniLv+Mdymu475eNCQzjChZZ447/MwTBHJnYJyq1iexdqG14uT9X8PBQoA4CYB0mQ+2yX56461xCiRIIkDpYNIyFigr4iYJKa/wt7jZ7h9ILmxL8N+pvzEzaa/v1poux9aZDIDTlcv0rU/EiIIhMH9bcnKSiCTXZhP/CNOw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Track the nonfull cluster as well as the empty cluster on lists. Each order has one nonfull cluster list. The cluster will remember which order it was used during new cluster allocation. When the cluster has free entry, add to the nonfull[order] list.  When the free cluster list is empty, also allocate from the nonempty list of that order. This improves the mTHP swap allocation success rate. There are limitations if the distribution of numbers of different orders of mTHP changes a lot. e.g. there are a lot of nonfull cluster assign to order A while later time there are a lot of order B allocation while very little allocation in order A. Currently the cluster used by order A will not reused by order B unless the cluster is 100% empty. Signed-off-by: Chris Li --- include/linux/swap.h | 4 ++++ mm/swapfile.c | 38 +++++++++++++++++++++++++++++++++++--- 2 files changed, 39 insertions(+), 3 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index edafd52d7ac4..6716ef236766 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -254,9 +254,11 @@ struct swap_cluster_info { */ u16 count; u8 flags; + u8 order; struct list_head list; }; #define CLUSTER_FLAG_FREE 1 /* This cluster is free */ +#define CLUSTER_FLAG_NONFULL 2 /* This cluster is on nonfull list */ /* * The first page in the swap file is the swap header, which is always marked @@ -295,6 +297,8 @@ struct swap_info_struct { unsigned long *zeromap; /* vmalloc'ed bitmap to track zero pages */ struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */ struct list_head free_clusters; /* free clusters list */ + struct list_head nonfull_clusters[SWAP_NR_ORDERS]; + /* list of cluster that contains at least one free slot */ unsigned int lowest_bit; /* index of first free in swap_map */ unsigned int highest_bit; /* index of last free in swap_map */ unsigned int pages; /* total of usable pages of swap */ diff --git a/mm/swapfile.c b/mm/swapfile.c index bceead7f9e3c..dcf09eb549db 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -361,14 +361,22 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si, memset(si->swap_map + idx * SWAPFILE_CLUSTER, SWAP_MAP_BAD, SWAPFILE_CLUSTER); - list_add_tail(&ci->list, &si->discard_clusters); + VM_BUG_ON(ci->flags & CLUSTER_FLAG_FREE); + if (ci->flags & CLUSTER_FLAG_NONFULL) + list_move_tail(&ci->list, &si->discard_clusters); + else + list_add_tail(&ci->list, &si->discard_clusters); + ci->flags = 0; schedule_work(&si->discard_work); } static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci) { + if (ci->flags & CLUSTER_FLAG_NONFULL) + list_move_tail(&ci->list, &si->free_clusters); + else + list_add_tail(&ci->list, &si->free_clusters); ci->flags = CLUSTER_FLAG_FREE; - list_add_tail(&ci->list, &si->free_clusters); } /* @@ -491,8 +499,15 @@ static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluste VM_BUG_ON(ci->count == 0); ci->count--; - if (!ci->count) + if (!ci->count) { free_cluster(p, ci); + return; + } + + if (!(ci->flags & CLUSTER_FLAG_NONFULL)) { + list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]); + ci->flags |= CLUSTER_FLAG_NONFULL; + } } /* @@ -553,6 +568,19 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, if (tmp == SWAP_NEXT_INVALID) { if (!list_empty(&si->free_clusters)) { ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list); + list_del(&ci->list); + spin_lock(&ci->lock); + ci->order = order; + ci->flags = 0; + spin_unlock(&ci->lock); + tmp = cluster_index(si, ci) * SWAPFILE_CLUSTER; + } else if (!list_empty(&si->nonfull_clusters[order])) { + ci = list_first_entry(&si->nonfull_clusters[order], + struct swap_cluster_info, list); + list_del(&ci->list); + spin_lock(&ci->lock); + ci->flags = 0; + spin_unlock(&ci->lock); tmp = cluster_index(si, ci) * SWAPFILE_CLUSTER; } else if (!list_empty(&si->discard_clusters)) { /* @@ -967,6 +995,7 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx) ci = lock_cluster(si, offset); memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER); ci->count = 0; + ci->order = 0; ci->flags = 0; free_cluster(si, ci); unlock_cluster(ci); @@ -2922,6 +2951,9 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p, INIT_LIST_HEAD(&p->free_clusters); INIT_LIST_HEAD(&p->discard_clusters); + for (i = 0; i < SWAP_NR_ORDERS; i++) + INIT_LIST_HEAD(&p->nonfull_clusters[i]); + for (i = 0; i < swap_header->info.nr_badpages; i++) { unsigned int page_nr = swap_header->info.badpages[i]; if (page_nr == 0 || page_nr > swap_header->info.last_page) From patchwork Wed Jul 31 06:49:15 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chris Li X-Patchwork-Id: 13748189 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C13CEC3DA7F for ; Wed, 31 Jul 2024 06:49:34 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 60B816B008A; Wed, 31 Jul 2024 02:49:27 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 4B69B6B0096; Wed, 31 Jul 2024 02:49:27 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 21B516B0089; Wed, 31 Jul 2024 02:49:27 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id B8D476B008C for ; Wed, 31 Jul 2024 02:49:26 -0400 (EDT) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 5E5CF140162 for ; Wed, 31 Jul 2024 06:49:26 +0000 (UTC) X-FDA: 82399121532.23.5AD5704 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf10.hostedemail.com (Postfix) with ESMTP id A0ABDC0006 for ; Wed, 31 Jul 2024 06:49:24 +0000 (UTC) Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=p2kIPQFG; spf=pass (imf10.hostedemail.com: domain of chrisl@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=chrisl@kernel.org; dmarc=pass (policy=none) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1722408509; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=fsnN1kXFrWuHb9pUR5aCfoTjqhiO6YvVFwIqeyTik5s=; b=cc7kZpFErN8T2QvazvOiLQFZ4X/7RWfYkxEzwmCcYlNsQ5mUkBaXig2wuu6vQWAHpUkxLw uUzgzhxFVzVoSqUpWaFqRmbTGddt1UPa3NMx8mz8qvpIcW6PfyH5kCKJCynpb+pTvzdSTd lKEC8unfy/eSXuKwwKldP/0Anf5A6fY= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1722408509; a=rsa-sha256; cv=none; b=FmznUqBujfbaUxfre5pBUpvgisFu2qZf0DviadP4iNBZKInk5cjldBYsvxoeFnOS/Iw+QG 4PlpAp1pEgAeWe2NkcH/u6UxYFgELnJEFbcRbUaBxYFBHKorJpA4HwdvQLS0GvlJFBoeW2 +rD6A0K1Dkt1LA/ftRa1/hjQ06REd30= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=p2kIPQFG; spf=pass (imf10.hostedemail.com: domain of chrisl@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=chrisl@kernel.org; dmarc=pass (policy=none) header.from=kernel.org Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by dfw.source.kernel.org (Postfix) with ESMTP id 4FA4E621F0; Wed, 31 Jul 2024 06:49:22 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id B4DE0C4AF0F; Wed, 31 Jul 2024 06:49:21 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1722408562; bh=PN8qTRT5lgmnupmwn6Hij9mRhxIEjpYtzVYktfGmBSI=; h=From:Date:Subject:References:In-Reply-To:To:Cc:From; b=p2kIPQFGa9jPCbaL5f8NFoexmF1wQlQA5jKRWLl7xAu51zaysh3BJJyPdccYX2vRJ 3T6wJVqKlb4FaNnHphhP7c+lE7N8p+6L9UmOZr9kL9XYHdJ/tPYL/ObXssunbcbrG1 8wcvOmibJfQWq7nalaonZV5n3jMCcmhDgdvePjJ+NOlw/yo6WTPVvr52F6PDrRTstH R9ZhlfnRHTuz4kQUyrSTCE+YrvQ4biHdeubYqYjWpfaAPwzItFKkPoV3ID4AdTd8Rt N+c4hkq2PM9TFGIfW9lCE3ROp/XHZn0eOxknRBHPZzGoNT/AjKPWOcEkz8E9Qg7dng +R+gGF/BrF0dw== From: Chris Li Date: Tue, 30 Jul 2024 23:49:15 -0700 Subject: [PATCH v5 3/9] mm: swap: separate SSD allocation from scan_swap_map_slots() MIME-Version: 1.0 Message-Id: <20240730-swap-allocator-v5-3-cb9c148b9297@kernel.org> References: <20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org> In-Reply-To: <20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org> To: Andrew Morton Cc: Kairui Song , Hugh Dickins , Ryan Roberts , "Huang, Ying" , Kalesh Singh , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Chris Li , Barry Song X-Mailer: b4 0.13.0 X-Stat-Signature: k8isecgmz6y54tzx4rqaqwcmncjxt1zb X-Rspamd-Queue-Id: A0ABDC0006 X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1722408564-985856 X-HE-Meta: U2FsdGVkX1+2SkI3Ghm4m5TBq1W4XF0TXYbnvjyC9G2/kOA+Q/Wu4yXrlcKmKzGXui8PcRAjAtIJ5N/S4pvokWgren5PkPPC4v92+xp3RXqzIrRjJKi5Xam3NQggh8GgNHiWJHmxnlq+K/YFHmyLRJ91wzUXx/AkWslNZ4CIF9kkJHUPB/Ttg2ksetDX4UAHrwkv08BNMm6lQjoiitBcdSTRt5GkYgN+Ywk/qOyfgMUnkDY+liqOjn86WBsn1ZA+rRCr2sXsOK/wV5lIdv7cwsUV5H1C/QN4nrgCpzkQzwwQ9XRAb98/tACqcJnWcwTQa1LiplQA29UwGZEpopv9pZEpRhIrkPcmi+Xj2e3GylhG+oJJd/TmzTQ9CmVMo0xCXdkNdXy7M1IsekLwMKaojbhj+4+1efZanlB45+J3cVu+rakFtY9TjQ0+g1waotFrBnkO8W0+d7WmTwuL6g0mF/wUq9sSb8OsVS4N+em2CojS243yziLobJ51UXZczcGIhKvCgOJrZkHKw5RYx4S+bMu9+d0CurDouuF/Q8KkzSSiUiytvgYap7s25GBXmuaZD8lXbs3UeT9nmTR3n3XQGGtF9Ay4s3D56emy1+UJ3TdAhkP6DT+obRCsXsm8MZqIhbvVhPdWFv2DH0hTch2mmCELvtqZuYeI0mfqyVJR5wsqE+O9LJwOL9W5zQchyrnFU+vTBqriL7WQNGJCEohoV0FPRuwxP16uESmx6T0ww//E4DD/26KfY5tKfWqVixFnQGjZqkMRdOdPl3Himff08qFXh1XVtgEp4WMhMSWBIJ9nOvRb7o/SxANH0b7lOcGeWTZJ0I5ypwMe2+zPCO5nl0d2UBJl8n3UyjYDr5RmtcG/rphACMUb1k9/D+Qyh87hu+Kvjv27fqrL6faHiNPLe0/4Jx5Q/EKIghQTMVQ/PzmouPMNYC+yODlwHDr+ZmG481qLlgTYbDG12I94KOq dzlxVF7s O3qeZfpKLGJpiI3RkqpRieh+ZfDRYw9R8PH0Y6wkukiFqKdybWBhB8kI0BNR3uiuXDmqZ9K7K5pkgZI2Q58MUs4woIaFg0VcuKEnNP1PYMMFn6L62eKQKq3uPWv44pxN7V9WkMaVR64XWiwzwDUKWFl4Qacv/3kFQKHV0WpgCk34amxJTQjFDf8pps0ABdVdY0el3DflBuuKAdC+65x9lzqatmxer9XXEeOiGkEhZxsG9BwwGt8WKpxO3FKfvJQL87Y4CMGEvDfUOMTGJ+Zfd9rJ5uZvtlszrgsV0tTyyaqYQgHI9dxTzaSewvWwNvLZ2xQ094t1R0+xfzWmAeDk6kHHkGlWihX/AI7gOqzY76sXL2/Y= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Previously the SSD and HDD share the same swap_map scan loop in scan_swap_map_slots(). This function is complex and hard to flow the execution flow. scan_swap_map_try_ssd_cluster() can already do most of the heavy lifting to locate the candidate swap range in the cluster. However it needs to go back to scan_swap_map_slots() to check conflict and then perform the allocation. When scan_swap_map_try_ssd_cluster() failed, it still depended on the scan_swap_map_slots() to do brute force scanning of the swap_map. When the swapfile is large and almost full, it will take some CPU time to go through the swap_map array. Get rid of the cluster allocation dependency on the swap_map scan loop in scan_swap_map_slots(). Streamline the cluster allocation code path. No more conflict checks. For order 0 swap entry, when run out of free and nonfull list. It will allocate from the higher order nonfull cluster list. Users should see less CPU time spent on searching the free swap slot when swapfile is almost full. Signed-off-by: Chris Li Signed-off-by: Kairui Song --- mm/swapfile.c | 300 ++++++++++++++++++++++++++++++++-------------------------- 1 file changed, 168 insertions(+), 132 deletions(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index dcf09eb549db..8a72c8a9aafd 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -53,6 +53,8 @@ static bool swap_count_continued(struct swap_info_struct *, pgoff_t, unsigned char); static void free_swap_count_continuations(struct swap_info_struct *); +static void swap_range_alloc(struct swap_info_struct *si, unsigned long offset, + unsigned int nr_entries); static DEFINE_SPINLOCK(swap_lock); static unsigned int nr_swapfiles; @@ -301,6 +303,12 @@ static inline unsigned int cluster_index(struct swap_info_struct *si, return ci - si->cluster_info; } +static inline unsigned int cluster_offset(struct swap_info_struct *si, + struct swap_cluster_info *ci) +{ + return cluster_index(si, ci) * SWAPFILE_CLUSTER; +} + static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si, unsigned long offset) { @@ -372,11 +380,15 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si, static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci) { + lockdep_assert_held(&si->lock); + lockdep_assert_held(&ci->lock); + if (ci->flags & CLUSTER_FLAG_NONFULL) list_move_tail(&ci->list, &si->free_clusters); else list_add_tail(&ci->list, &si->free_clusters); ci->flags = CLUSTER_FLAG_FREE; + ci->order = 0; } /* @@ -431,9 +443,11 @@ static struct swap_cluster_info *alloc_cluster(struct swap_info_struct *si, unsi struct swap_cluster_info *ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list); + lockdep_assert_held(&si->lock); + lockdep_assert_held(&ci->lock); VM_BUG_ON(cluster_index(si, ci) != idx); + VM_BUG_ON(ci->count); list_del(&ci->list); - ci->count = 0; ci->flags = 0; return ci; } @@ -441,6 +455,8 @@ static struct swap_cluster_info *alloc_cluster(struct swap_info_struct *si, unsi static void free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci) { VM_BUG_ON(ci->count != 0); + lockdep_assert_held(&si->lock); + lockdep_assert_held(&ci->lock); /* * If the swap is discardable, prepare discard the cluster * instead of free it immediately. The cluster will be freed @@ -497,6 +513,9 @@ static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluste return; VM_BUG_ON(ci->count == 0); + VM_BUG_ON(cluster_is_free(ci)); + lockdep_assert_held(&p->lock); + lockdep_assert_held(&ci->lock); ci->count--; if (!ci->count) { @@ -505,48 +524,88 @@ static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluste } if (!(ci->flags & CLUSTER_FLAG_NONFULL)) { + VM_BUG_ON(ci->flags & CLUSTER_FLAG_FREE); list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]); - ci->flags |= CLUSTER_FLAG_NONFULL; + ci->flags = CLUSTER_FLAG_NONFULL; } } -/* - * It's possible scan_swap_map_slots() uses a free cluster in the middle of free - * cluster list. Avoiding such abuse to avoid list corruption. - */ -static bool -scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si, - unsigned long offset, int order) +static inline bool cluster_scan_range(struct swap_info_struct *si, unsigned int start, + unsigned int nr_pages) { - struct percpu_cluster *percpu_cluster; - bool conflict; - struct swap_cluster_info *first = list_first_entry(&si->free_clusters, - struct swap_cluster_info, list); - - offset /= SWAPFILE_CLUSTER; - conflict = !list_empty(&si->free_clusters) && - offset != cluster_index(si, first) && - cluster_is_free(&si->cluster_info[offset]); + unsigned char *p = si->swap_map + start; + unsigned char *end = p + nr_pages; - if (!conflict) - return false; + while (p < end) + if (*p++) + return false; - percpu_cluster = this_cpu_ptr(si->percpu_cluster); - percpu_cluster->next[order] = SWAP_NEXT_INVALID; return true; } -static inline bool swap_range_empty(char *swap_map, unsigned int start, - unsigned int nr_pages) + +static inline void cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster_info *ci, + unsigned int start, unsigned char usage, + unsigned int order) { - unsigned int i; + unsigned int nr_pages = 1 << order; - for (i = 0; i < nr_pages; i++) { - if (swap_map[start + i]) - return false; + if (cluster_is_free(ci)) { + if (nr_pages < SWAPFILE_CLUSTER) { + list_move_tail(&ci->list, &si->nonfull_clusters[order]); + ci->flags = CLUSTER_FLAG_NONFULL; + } + ci->order = order; } - return true; + memset(si->swap_map + start, usage, nr_pages); + swap_range_alloc(si, start, nr_pages); + ci->count += nr_pages; + + if (ci->count == SWAPFILE_CLUSTER) { + VM_BUG_ON(!(ci->flags & (CLUSTER_FLAG_FREE | CLUSTER_FLAG_NONFULL))); + list_del(&ci->list); + ci->flags = 0; + } +} + +static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si, unsigned long offset, + unsigned int *foundp, unsigned int order, + unsigned char usage) +{ + unsigned long start = offset & ~(SWAPFILE_CLUSTER - 1); + unsigned long end = min(start + SWAPFILE_CLUSTER, si->max); + unsigned int nr_pages = 1 << order; + struct swap_cluster_info *ci; + + if (end < nr_pages) + return SWAP_NEXT_INVALID; + end -= nr_pages; + + ci = lock_cluster(si, offset); + if (ci->count + nr_pages > SWAPFILE_CLUSTER) { + offset = SWAP_NEXT_INVALID; + goto done; + } + + while (offset <= end) { + if (cluster_scan_range(si, offset, nr_pages)) { + cluster_alloc_range(si, ci, offset, usage, order); + *foundp = offset; + if (ci->count == SWAPFILE_CLUSTER) { + offset = SWAP_NEXT_INVALID; + goto done; + } + offset += nr_pages; + break; + } + offset += nr_pages; + } + if (offset > end) + offset = SWAP_NEXT_INVALID; +done: + unlock_cluster(ci); + return offset; } /* @@ -554,72 +613,66 @@ static inline bool swap_range_empty(char *swap_map, unsigned int start, * pool (a cluster). This might involve allocating a new cluster for current CPU * too. */ -static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, - unsigned long *offset, unsigned long *scan_base, int order) +static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int order, + unsigned char usage) { - unsigned int nr_pages = 1 << order; struct percpu_cluster *cluster; - struct swap_cluster_info *ci; - unsigned int tmp, max; + struct swap_cluster_info *ci, *n; + unsigned int offset, found = 0; new_cluster: + lockdep_assert_held(&si->lock); cluster = this_cpu_ptr(si->percpu_cluster); - tmp = cluster->next[order]; - if (tmp == SWAP_NEXT_INVALID) { - if (!list_empty(&si->free_clusters)) { - ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list); - list_del(&ci->list); - spin_lock(&ci->lock); - ci->order = order; - ci->flags = 0; - spin_unlock(&ci->lock); - tmp = cluster_index(si, ci) * SWAPFILE_CLUSTER; - } else if (!list_empty(&si->nonfull_clusters[order])) { - ci = list_first_entry(&si->nonfull_clusters[order], - struct swap_cluster_info, list); - list_del(&ci->list); - spin_lock(&ci->lock); - ci->flags = 0; - spin_unlock(&ci->lock); - tmp = cluster_index(si, ci) * SWAPFILE_CLUSTER; - } else if (!list_empty(&si->discard_clusters)) { - /* - * we don't have free cluster but have some clusters in - * discarding, do discard now and reclaim them, then - * reread cluster_next_cpu since we dropped si->lock - */ - swap_do_scheduled_discard(si); - *scan_base = this_cpu_read(*si->cluster_next_cpu); - *offset = *scan_base; - goto new_cluster; - } else - return false; + offset = cluster->next[order]; + if (offset) { + offset = alloc_swap_scan_cluster(si, offset, &found, order, usage); + if (found) + goto done; } - /* - * Other CPUs can use our cluster if they can't find a free cluster, - * check if there is still free entry in the cluster, maintaining - * natural alignment. - */ - max = min_t(unsigned long, si->max, ALIGN(tmp + 1, SWAPFILE_CLUSTER)); - if (tmp < max) { - ci = lock_cluster(si, tmp); - while (tmp < max) { - if (swap_range_empty(si->swap_map, tmp, nr_pages)) - break; - tmp += nr_pages; + if (!list_empty(&si->free_clusters)) { + ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list); + offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci), &found, order, usage); + VM_BUG_ON(!found); + goto done; + } + + if (order < PMD_ORDER) { + list_for_each_entry_safe(ci, n, &si->nonfull_clusters[order], list) { + offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci), + &found, order, usage); + if (found) + goto done; } - unlock_cluster(ci); } - if (tmp >= max) { - cluster->next[order] = SWAP_NEXT_INVALID; + + if (!list_empty(&si->discard_clusters)) { + /* + * we don't have free cluster but have some clusters in + * discarding, do discard now and reclaim them, then + * reread cluster_next_cpu since we dropped si->lock + */ + swap_do_scheduled_discard(si); goto new_cluster; } - *offset = tmp; - *scan_base = tmp; - tmp += nr_pages; - cluster->next[order] = tmp < max ? tmp : SWAP_NEXT_INVALID; - return true; + + if (order) + goto done; + + for (int o = 1; o < PMD_ORDER; o++) { + if (!list_empty(&si->nonfull_clusters[o])) { + ci = list_first_entry(&si->nonfull_clusters[o], struct swap_cluster_info, + list); + offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci), + &found, 0, usage); + VM_BUG_ON(!found); + goto done; + } + } + +done: + cluster->next[order] = offset; + return found; } static void __del_from_avail_list(struct swap_info_struct *p) @@ -754,11 +807,29 @@ static bool swap_offset_available_and_locked(struct swap_info_struct *si, return false; } +static int cluster_alloc_swap(struct swap_info_struct *si, + unsigned char usage, int nr, + swp_entry_t slots[], int order) +{ + int n_ret = 0; + + VM_BUG_ON(!si->cluster_info); + + while (n_ret < nr) { + unsigned long offset = cluster_alloc_swap_entry(si, order, usage); + + if (!offset) + break; + slots[n_ret++] = swp_entry(si->type, offset); + } + + return n_ret; +} + static int scan_swap_map_slots(struct swap_info_struct *si, unsigned char usage, int nr, swp_entry_t slots[], int order) { - struct swap_cluster_info *ci; unsigned long offset; unsigned long scan_base; unsigned long last_in_cluster = 0; @@ -797,26 +868,16 @@ static int scan_swap_map_slots(struct swap_info_struct *si, return 0; } + if (si->cluster_info) + return cluster_alloc_swap(si, usage, nr, slots, order); + si->flags += SWP_SCANNING; - /* - * Use percpu scan base for SSD to reduce lock contention on - * cluster and swap cache. For HDD, sequential access is more - * important. - */ - if (si->flags & SWP_SOLIDSTATE) - scan_base = this_cpu_read(*si->cluster_next_cpu); - else - scan_base = si->cluster_next; + + /* For HDD, sequential access is more important. */ + scan_base = si->cluster_next; offset = scan_base; - /* SSD algorithm */ - if (si->cluster_info) { - if (!scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order)) { - if (order > 0) - goto no_page; - goto scan; - } - } else if (unlikely(!si->cluster_nr--)) { + if (unlikely(!si->cluster_nr--)) { if (si->pages - si->inuse_pages < SWAPFILE_CLUSTER) { si->cluster_nr = SWAPFILE_CLUSTER - 1; goto checks; @@ -827,8 +888,6 @@ static int scan_swap_map_slots(struct swap_info_struct *si, /* * If seek is expensive, start searching for new cluster from * start of partition, to minimize the span of allocated swap. - * If seek is cheap, that is the SWP_SOLIDSTATE si->cluster_info - * case, just handled by scan_swap_map_try_ssd_cluster() above. */ scan_base = offset = si->lowest_bit; last_in_cluster = offset + SWAPFILE_CLUSTER - 1; @@ -856,19 +915,6 @@ static int scan_swap_map_slots(struct swap_info_struct *si, } checks: - if (si->cluster_info) { - while (scan_swap_map_ssd_cluster_conflict(si, offset, order)) { - /* take a break if we already got some slots */ - if (n_ret) - goto done; - if (!scan_swap_map_try_ssd_cluster(si, &offset, - &scan_base, order)) { - if (order > 0) - goto no_page; - goto scan; - } - } - } if (!(si->flags & SWP_WRITEOK)) goto no_page; if (!si->highest_bit) @@ -876,11 +922,9 @@ static int scan_swap_map_slots(struct swap_info_struct *si, if (offset > si->highest_bit) scan_base = offset = si->lowest_bit; - ci = lock_cluster(si, offset); /* reuse swap entry of cache-only swap if not busy. */ if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) { int swap_was_freed; - unlock_cluster(ci); spin_unlock(&si->lock); swap_was_freed = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY); spin_lock(&si->lock); @@ -891,15 +935,12 @@ static int scan_swap_map_slots(struct swap_info_struct *si, } if (si->swap_map[offset]) { - unlock_cluster(ci); if (!n_ret) goto scan; else goto done; } memset(si->swap_map + offset, usage, nr_pages); - add_cluster_info_page(si, si->cluster_info, offset, nr_pages); - unlock_cluster(ci); swap_range_alloc(si, offset, nr_pages); slots[n_ret++] = swp_entry(si->type, offset); @@ -920,13 +961,7 @@ static int scan_swap_map_slots(struct swap_info_struct *si, latency_ration = LATENCY_LIMIT; } - /* try to get more slots in cluster */ - if (si->cluster_info) { - if (scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order)) - goto checks; - if (order > 0) - goto done; - } else if (si->cluster_nr && !si->swap_map[++offset]) { + if (si->cluster_nr && !si->swap_map[++offset]) { /* non-ssd case, still more slots in cluster? */ --si->cluster_nr; goto checks; @@ -995,8 +1030,6 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx) ci = lock_cluster(si, offset); memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER); ci->count = 0; - ci->order = 0; - ci->flags = 0; free_cluster(si, ci); unlock_cluster(ci); swap_range_free(si, offset, SWAPFILE_CLUSTER); @@ -3008,8 +3041,11 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p, ci = cluster_info + idx; if (idx >= nr_clusters) continue; - if (ci->count) + if (ci->count) { + ci->flags = CLUSTER_FLAG_NONFULL; + list_add_tail(&ci->list, &p->nonfull_clusters[0]); continue; + } ci->flags = CLUSTER_FLAG_FREE; list_add_tail(&ci->list, &p->free_clusters); } From patchwork Wed Jul 31 06:49:16 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chris Li X-Patchwork-Id: 13748188 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id EC702C3DA7F for ; Wed, 31 Jul 2024 06:49:31 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 21DE26B0095; Wed, 31 Jul 2024 02:49:27 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 1566B6B008A; Wed, 31 Jul 2024 02:49:27 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E00866B0095; Wed, 31 Jul 2024 02:49:26 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 95FEE6B0088 for ; Wed, 31 Jul 2024 02:49:26 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 48CE91A017F for ; Wed, 31 Jul 2024 06:49:26 +0000 (UTC) X-FDA: 82399121532.05.076C200 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf06.hostedemail.com (Postfix) with ESMTP id 8DF21180003 for ; Wed, 31 Jul 2024 06:49:23 +0000 (UTC) Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=QI85PgWk; spf=pass (imf06.hostedemail.com: domain of chrisl@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=chrisl@kernel.org; dmarc=pass (policy=none) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1722408508; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=XNHUxihQ6L0RDgCA95Q8Li1qX6ValIVmUiuixGOIno4=; b=FQZtw5qKjVgKUuP4ap0YnmQl0TAHF1taf28qwoIZxuHsWT/D3nuHpPL0q+MOcDcVc6i2SN HoeIeyD5V+j9DHJLhnf6RP6ys2wqcLolPAeltcuAHrgs3LaK48hHh1a1aKvyx93vSbUm2n lcPYKO2SKOWbuVkeE02AYhZVCmkOB6A= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1722408508; a=rsa-sha256; cv=none; b=f1yX5l5Ga1lN41FH/5txghIW+q/N6iBv9nvoGA1MdzYS8mxVVQrPPzFlz8+CR1OySJOoyw ipq+8dpe7yJLKAp9PUMVNJ3EU5TqYEB8EeVBXtqHPYaGx9jr12J7Yph1YP+euoIfITtb44 Cg5chUmkYc3ZtwenNunHcckL/GzaZXE= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=QI85PgWk; spf=pass (imf06.hostedemail.com: domain of chrisl@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=chrisl@kernel.org; dmarc=pass (policy=none) header.from=kernel.org Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by dfw.source.kernel.org (Postfix) with ESMTP id C92A1621F8; Wed, 31 Jul 2024 06:49:22 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 4669AC116B1; Wed, 31 Jul 2024 06:49:22 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1722408562; bh=mFTpCV7XNA0Q39wY6mTWVaq5W4nsa4d0OkBcY1l+IsU=; h=From:Date:Subject:References:In-Reply-To:To:Cc:From; b=QI85PgWkPs/n+4oHRNul+S2Hb87b9WfYCiP9KF36fh0yy05RRPF8RDd0RTg6h8e0T hmx6DVEUupfOip8hPEYA30Lq0l5BlASGFYOBVjFfAc+HU2dM0cLqdExkqi/LAfwuPJ 5bri/GYwmerEpO9awQlnLQf6o/74Tnt5u3b4XpL37eEs+2npF3NDP/BJ58Jy77hnuq V/d8r9A2ezC3fjXGp+1XXZXLwT1PrLLBKjrnyxIeD+rTAIhHFVSsgOJZB9+NdjuDMv XcRqv/SkuTbN30JM+64812/Ic5yUn7Hb6zLh4PCNxjjWcRvbXGlO4vddrsIh7Fx4Ff iKCuUz8yS1o1Q== From: chrisl@kernel.org Date: Tue, 30 Jul 2024 23:49:16 -0700 Subject: [PATCH v5 4/9] mm: swap: clean up initialization helper MIME-Version: 1.0 Message-Id: <20240730-swap-allocator-v5-4-cb9c148b9297@kernel.org> References: <20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org> In-Reply-To: <20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org> To: Andrew Morton Cc: Kairui Song , Hugh Dickins , Ryan Roberts , "Huang, Ying" , Kalesh Singh , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Chris Li , Barry Song X-Mailer: b4 0.13.0 X-Rspam-User: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 8DF21180003 X-Stat-Signature: pu5dgiefmdebjttxea71ay9bxfk4gqab X-HE-Tag: 1722408563-628302 X-HE-Meta: U2FsdGVkX184W1oNSAOa26v3SNM+fD7C9pcR7GAtMMpJ4FJcuK2mn+p7WAD4GvkF+tmELph2FGbGgfxWaYy649lPIucE87kZ1+mBQ+efhCA3oWy9oenQWqhfFa17Mkzaaw5GmYEjPQhlU2Qvef2FRQn7tuADWxjOOOUocAVs8Xt8R9rN2X+QZkRVoBABt27NnF8ntkslxNwvkhztqCloGXGeXQJmgS7WJ2mpL3reu74EdAQEhlBgRfUgzJwoZ8w+AlSwj7WyM+ZoLhrIYxxOacRt+7Cs6nrUejTXKOU8e9SZUnl+So4se5zeLg9EGxKX1YaO/QFwMC8zKQn+Kviqp7qKXPz/hiycq18QQnOltyY23qDIiotGDk34E7Zo8I2vs1FKVczUSEpBOgdQGXpAlJ5SXlc8TXQLy12GAth0lDK7AAcL6c7B1SuEQ0FIBbI0nUIBMTJLJxUfJKd6Kny5kwfhT4M4U4JByzpSsAEbRMniZh6tlwiVMCPCjSn08WAAzfTBA/fhr0C66/RS9dVX+ixDbWqZV8Reh1b8SSX3opdEYk8YGiYXNiuSERR3wK4Lr9aa/uVDckCBqXEujoln5j46U8Am5BhcDt1FtYkhZdRHuZFuZycxMyv+yz8qsRbcSPnFMPbaEp6CzW6Y8Br+mLQTZhy2wLeHCOsuxH4KGMPKL9UwOBt3rOeNnmOUC0ye7VnH1AcCEQ4MNyz0qzpOaPhs3+vlrSJ+zVbuJHd3LfoPNZVbWTEaRuJp33QQo4k+UMpNnlMQ6FvJMWmPf0iz3TE/3oFZtFyf0i6+MKvRj8cwI9hw1a2cVUPgNEx1WbuFPLAt9Gl42IDYsUFGeRkMbcMJXrLRCmVoPYygIt/IFrGlsOUiyZkiP/xuvYue74zKh2UoB45lo6lt8jSdKsGUbvYTjyEJximv4J15MncNkHJ9m8cOMm/of1GzGjkufx7SJEbu5u4KgK2EAtehZhF J4WN/oul p/M8KzD9EmTLVTkhVlritpyClJIOoMczKUYbFWishNwHlrvUWbBfhreekLKcv4qUerq2jl4lBM0IdtRpJxkBtE0F/JzXnn6vDUJ6LqDzBu2QhIBHiIeaqUOoEPeHv0+0A0L4cx3ak4ryENt1Kex2qoxInK0SgF+UA3YIWPXiMY7snF/cI0cC446arxRnaGIv2UPHybbyPEtl0qoOsc6zHngbeCr8oWzCGblA59tqH20h6v/y6QhMkZebogQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: From: Kairui Song At this point, alloc_cluster is never called already, and inc_cluster_info_page is called by initialization only, a lot of dead code can be dropped. Signed-off-by: Kairui Song --- mm/swapfile.c | 44 ++++++++++---------------------------------- 1 file changed, 10 insertions(+), 34 deletions(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index 8a72c8a9aafd..34e6ea13e8e4 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -438,20 +438,6 @@ static void swap_users_ref_free(struct percpu_ref *ref) complete(&si->comp); } -static struct swap_cluster_info *alloc_cluster(struct swap_info_struct *si, unsigned long idx) -{ - struct swap_cluster_info *ci = list_first_entry(&si->free_clusters, - struct swap_cluster_info, list); - - lockdep_assert_held(&si->lock); - lockdep_assert_held(&ci->lock); - VM_BUG_ON(cluster_index(si, ci) != idx); - VM_BUG_ON(ci->count); - list_del(&ci->list); - ci->flags = 0; - return ci; -} - static void free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci) { VM_BUG_ON(ci->count != 0); @@ -472,34 +458,24 @@ static void free_cluster(struct swap_info_struct *si, struct swap_cluster_info * } /* - * The cluster corresponding to page_nr will be used. The cluster will be - * removed from free cluster list and its usage counter will be increased by - * count. + * The cluster corresponding to page_nr will be used. The cluster will not be + * added to free cluster list and its usage counter will be increased by 1. + * Only used for initialization. */ -static void add_cluster_info_page(struct swap_info_struct *p, - struct swap_cluster_info *cluster_info, unsigned long page_nr, - unsigned long count) +static void inc_cluster_info_page(struct swap_info_struct *p, + struct swap_cluster_info *cluster_info, unsigned long page_nr) { unsigned long idx = page_nr / SWAPFILE_CLUSTER; - struct swap_cluster_info *ci = cluster_info + idx; + struct swap_cluster_info *ci; if (!cluster_info) return; - if (cluster_is_free(ci)) - alloc_cluster(p, idx); - VM_BUG_ON(ci->count + count > SWAPFILE_CLUSTER); - ci->count += count; -} + ci = cluster_info + idx; + ci->count++; -/* - * The cluster corresponding to page_nr will be used. The cluster will be - * removed from free cluster list and its usage counter will be increased by 1. - */ -static void inc_cluster_info_page(struct swap_info_struct *p, - struct swap_cluster_info *cluster_info, unsigned long page_nr) -{ - add_cluster_info_page(p, cluster_info, page_nr, 1); + VM_BUG_ON(ci->count > SWAPFILE_CLUSTER); + VM_BUG_ON(ci->flags); } /* From patchwork Wed Jul 31 06:49:17 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chris Li X-Patchwork-Id: 13748190 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B3E95C3DA7F for ; Wed, 31 Jul 2024 06:49:38 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9381B6B0088; Wed, 31 Jul 2024 02:49:27 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 71DE86B0092; Wed, 31 Jul 2024 02:49:27 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 28E5F6B0088; Wed, 31 Jul 2024 02:49:27 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id C88DF6B0093 for ; Wed, 31 Jul 2024 02:49:26 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 6B401A215F for ; Wed, 31 Jul 2024 06:49:26 +0000 (UTC) X-FDA: 82399121532.21.0AB7589 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf02.hostedemail.com (Postfix) with ESMTP id B61D080014 for ; Wed, 31 Jul 2024 06:49:24 +0000 (UTC) Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=aGUYTfmR; spf=pass (imf02.hostedemail.com: domain of chrisl@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=chrisl@kernel.org; dmarc=pass (policy=none) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1722408537; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=mZMH2Ro7MXV9ZIOqWjTQ/mCW01pKL21+YmWpGPfLWFo=; b=otu4PiYM1twGLvlWznoqBDjjTzu8nOHmeHyqTzo6O4AQhFSrGcJ0y/x5wHvtSZf0RroAaM PhXJq8B0XvJx5WEE7GYCOkdlL50KU0r2ooQJ5Opb65LIl5JexjJ3i81kvSJ+Qxv+BvBPYO vDcgBQSUB7BhNyfZbd5JIM6N2vKpRxw= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=aGUYTfmR; spf=pass (imf02.hostedemail.com: domain of chrisl@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=chrisl@kernel.org; dmarc=pass (policy=none) header.from=kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1722408537; a=rsa-sha256; cv=none; b=lvAluzBeQJgsBUVkZJmBtEMp4ooMV2m6UomIDkNMsIoiTNLA/RTmvXFam/u6FDSJPG8plK g4KktyPFQwFBO+NCZNuzQSg4yCa+dZqWTFHNHU2EmQPOxZqMuON2+L+k+Q86NZRO1mYPzn rK7wxWlYcfECCt3Bvk1tkBuSODo5QOA= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by dfw.source.kernel.org (Postfix) with ESMTP id 51895621F7; Wed, 31 Jul 2024 06:49:23 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id C268BC4AF0C; Wed, 31 Jul 2024 06:49:22 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1722408563; bh=LjLGrYfGSpypFUL96oAD4RmEiSL9/y3TWaJKb5zfZhE=; h=From:Date:Subject:References:In-Reply-To:To:Cc:From; b=aGUYTfmR5q32Z0zxb2/wyFW/QbQQ23Efur5PxJUzlVbMYXuYs7ALfTF8G7eFkWuVh mP+TkWiOMDDNrTjs0WIymVBPV06bIR//zu6ftunKi2VIBQCsRK7KW/ZGIO3+ElQwno hy8nsdhEIjtvbtAE395ivBfkcpZT7U2ZkyhmGNiJGk646FEHJQ8MTyCkEGrRgqO312 u3FHrbZ07kBi64GsA+XVa5tnntHpByXK2Vi0uQi50pBcBxBrYw2+FUPXJp6WlknpuW jlbh/caZLD20sJEFqJtxop265nlELUYfZWBrLb3vQbOR3ZVsK/VaL/jAdlS9mm8d0R Li4TyrhURQ42A== From: chrisl@kernel.org Date: Tue, 30 Jul 2024 23:49:17 -0700 Subject: [PATCH v5 5/9] mm: swap: skip slot cache on freeing for mTHP MIME-Version: 1.0 Message-Id: <20240730-swap-allocator-v5-5-cb9c148b9297@kernel.org> References: <20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org> In-Reply-To: <20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org> To: Andrew Morton Cc: Kairui Song , Hugh Dickins , Ryan Roberts , "Huang, Ying" , Kalesh Singh , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Chris Li , Barry Song X-Mailer: b4 0.13.0 X-Rspamd-Server: rspam03 X-Rspam-User: X-Rspamd-Queue-Id: B61D080014 X-Stat-Signature: ty9x5jt747rzh31y8tjjpj8i7y8zjkqh X-HE-Tag: 1722408564-639290 X-HE-Meta: U2FsdGVkX18CnRsi4X58z00tr/yGE1fulANBFoKqeJehP4Lcyj2IMKFldFXOQ+24wbVl6FQdOgN9CS7Oo7QnloyzxmAdyLeOcwKBq1P5j0+mKChH1Lw041h6+6BeiLjRtKymZuXcv1+GuFJ/389wSLcNghOWGxcHw1NI5ZtMR5fbV7f4kgvmTOh52YYX9Q4PQk0PNi/KfIYSm5AbV7Lnq+n6mRArSuAD7ozKGj0FK9PXeL0qYdH8Hfp+2ReFhGfdOBLdFU213XRvDQlzuAevl+pYOz10vKc1FxWOynaFftQJhaqNNA+MXt1C0EhrK/Ry4hFz0YnRVqK+fJm7l1GlhpUM/4ctYoYWDZoT6efr3RQy8NSbZmnZKuKpZdBdY02B5wlYm6WXKZJbK5uFlnFW1znLSPFQIhRuC8VpsbU9LVrXgCHU7wMygYkSp+lHoi0mtutTpV7MW91Ho8u4G17KQPWKqNGxoqLwabzdTEJmrKRgr+tpDsAEmClr9Bj/dA1Trg62LaUsYpleb+IFF+2Wt/GNwPRX1G33LmWCF1S508Y+2/pPHy7rvGc20Mf+hOfG9+9YUOK0Grz7UGCejHEaVznnQGxTn81ZH8Wzi3Ybud6lxt8t2q+yrOEISz8j9fxIqNvvf9fXGadjVc/hQA/3tprCbay27qWAknFHM2m+SpIDcFYDSemJ3YKiJrEIcLWOXL0xof6Vl0L5SKqiTYsGSaX0Dy2llAbZPnzd+zwmjmO0fMZQt45QH4HxB024GzZS/JxGlA3zdTZwUHM8E25gRj8OXrzVWWpRECWnVWPE0iO6CAI4IkKk8y3rs2be0BuweZtQZE0v5IVXn5HXf4Zn4uTMfoqdErC6LPfCHMmD/tPQQLn+2gaya2ao0mHpki36eYaDlKPhQiBMSf7yz7iC/c5t0IGtaXlYC9U4rwGmzj6slqyrbJ+hQ2abIrs8qfMCyphl1vIMdW2QIzzJX4a K1uSpP3d yPGj7E/Z34EDN9c+oTJ50F3vFJ274hYTpGuEGclu+dVxr46h9f38MSNSKk8EeliKMwNsIUDj4Ypnz8RujTL+9dkppIcK5RV/eY0Kc0kMWB861mACjxZo+dkWln0AUNxp1NI1M4KHd69EUrunIFkQ6WzhKtEKERPSjFDnPD6X+wCX0uaEgWE3SZ40KW9u3yZClmWcM+HahxLs0kabo0gvN1Fjct411Ktz2rKtMpjzfUWrcj6CAzZhC5FQYBQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: From: Kairui Song Currently when we are freeing mTHP folios from swap cache, we free then one by one and put each entry into swap slot cache. Slot cache is designed to reduce the overhead by batching the freeing, but mTHP swap entries are already continuous so they can be batch freed without it already, it saves litle overhead, or even increase overhead for larger mTHP. What's more, mTHP entries could stay in swap cache for a while. Contiguous swap entry is an rather rare resource so releasing them directly can help improve mTHP allocation success rate when under pressure. Signed-off-by: Kairui Song Acked-by: Barry Song Signed-off-by: Barry Song --- mm/swapfile.c | 59 ++++++++++++++++++++++++++--------------------------------- 1 file changed, 26 insertions(+), 33 deletions(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index 34e6ea13e8e4..9b63b2262cc2 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -479,20 +479,21 @@ static void inc_cluster_info_page(struct swap_info_struct *p, } /* - * The cluster ci decreases one usage. If the usage counter becomes 0, + * The cluster ci decreases @nr_pages usage. If the usage counter becomes 0, * which means no page in the cluster is in use, we can optionally discard * the cluster and add it to free cluster list. */ -static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluster_info *ci) +static void dec_cluster_info_page(struct swap_info_struct *p, + struct swap_cluster_info *ci, int nr_pages) { if (!p->cluster_info) return; - VM_BUG_ON(ci->count == 0); + VM_BUG_ON(ci->count < nr_pages); VM_BUG_ON(cluster_is_free(ci)); lockdep_assert_held(&p->lock); lockdep_assert_held(&ci->lock); - ci->count--; + ci->count -= nr_pages; if (!ci->count) { free_cluster(p, ci); @@ -998,19 +999,6 @@ static int scan_swap_map_slots(struct swap_info_struct *si, return n_ret; } -static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx) -{ - unsigned long offset = idx * SWAPFILE_CLUSTER; - struct swap_cluster_info *ci; - - ci = lock_cluster(si, offset); - memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER); - ci->count = 0; - free_cluster(si, ci); - unlock_cluster(ci); - swap_range_free(si, offset, SWAPFILE_CLUSTER); -} - int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order) { int order = swap_entry_order(entry_order); @@ -1269,21 +1257,28 @@ static unsigned char __swap_entry_free(struct swap_info_struct *p, return usage; } -static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry) +/* + * Drop the last HAS_CACHE flag of swap entries, caller have to + * ensure all entries belong to the same cgroup. + */ +static void swap_entry_range_free(struct swap_info_struct *p, swp_entry_t entry, + unsigned int nr_pages) { - struct swap_cluster_info *ci; unsigned long offset = swp_offset(entry); - unsigned char count; + unsigned char *map = p->swap_map + offset; + unsigned char *map_end = map + nr_pages; + struct swap_cluster_info *ci; ci = lock_cluster(p, offset); - count = p->swap_map[offset]; - VM_BUG_ON(count != SWAP_HAS_CACHE); - p->swap_map[offset] = 0; - dec_cluster_info_page(p, ci); + do { + VM_BUG_ON(*map != SWAP_HAS_CACHE); + *map = 0; + } while (++map < map_end); + dec_cluster_info_page(p, ci, nr_pages); unlock_cluster(ci); - mem_cgroup_uncharge_swap(entry, 1); - swap_range_free(p, offset, 1); + mem_cgroup_uncharge_swap(entry, nr_pages); + swap_range_free(p, offset, nr_pages); } static void cluster_swap_free_nr(struct swap_info_struct *sis, @@ -1343,7 +1338,6 @@ void swap_free_nr(swp_entry_t entry, int nr_pages) void put_swap_folio(struct folio *folio, swp_entry_t entry) { unsigned long offset = swp_offset(entry); - unsigned long idx = offset / SWAPFILE_CLUSTER; struct swap_cluster_info *ci; struct swap_info_struct *si; unsigned char *map; @@ -1356,19 +1350,18 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry) return; ci = lock_cluster_or_swap_info(si, offset); - if (size == SWAPFILE_CLUSTER) { + if (size > 1) { map = si->swap_map + offset; - for (i = 0; i < SWAPFILE_CLUSTER; i++) { + for (i = 0; i < size; i++) { val = map[i]; VM_BUG_ON(!(val & SWAP_HAS_CACHE)); if (val == SWAP_HAS_CACHE) free_entries++; } - if (free_entries == SWAPFILE_CLUSTER) { + if (free_entries == size) { unlock_cluster_or_swap_info(si, ci); spin_lock(&si->lock); - mem_cgroup_uncharge_swap(entry, SWAPFILE_CLUSTER); - swap_free_cluster(si, idx); + swap_entry_range_free(si, entry, size); spin_unlock(&si->lock); return; } @@ -1413,7 +1406,7 @@ void swapcache_free_entries(swp_entry_t *entries, int n) for (i = 0; i < n; ++i) { p = swap_info_get_cont(entries[i], prev); if (p) - swap_entry_free(p, entries[i]); + swap_entry_range_free(p, entries[i], 1); prev = p; } if (p) From patchwork Wed Jul 31 06:49:18 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chris Li X-Patchwork-Id: 13748191 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 71096C3DA64 for ; Wed, 31 Jul 2024 06:49:40 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id BE2DF6B0092; Wed, 31 Jul 2024 02:49:27 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 881AD6B008C; Wed, 31 Jul 2024 02:49:27 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 609706B0089; Wed, 31 Jul 2024 02:49:27 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id BA0296B0092 for ; Wed, 31 Jul 2024 02:49:26 -0400 (EDT) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 5C8F7C0154 for ; Wed, 31 Jul 2024 06:49:26 +0000 (UTC) X-FDA: 82399121532.30.3A91D24 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf12.hostedemail.com (Postfix) with ESMTP id A1A7C40018 for ; Wed, 31 Jul 2024 06:49:24 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b="gK8KjJ/l"; spf=pass (imf12.hostedemail.com: domain of chrisl@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=chrisl@kernel.org; dmarc=pass (policy=none) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1722408560; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=hIKa+KUWmxnDu4eCbL13MQiZrNi13NWmJgisa7SqNE0=; b=SeL0tW+vdYQfUAkNTmeSNZf9r9YlwstqnpykHIzG5NQEobEEi6bh2/SZotsBA4zgdFex68 Ez0KcIadJE/kjNjsG0kWfkMWCQ0ypzIEcIRC/K00IQciQOcKrG8ZrDRhKwWxRsMXsW98NH pNElRZF3lSMPZSpp4WR4LTB+g8CghOI= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b="gK8KjJ/l"; spf=pass (imf12.hostedemail.com: domain of chrisl@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=chrisl@kernel.org; dmarc=pass (policy=none) header.from=kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1722408560; a=rsa-sha256; cv=none; b=Sj7x+5VTe2vlslrdLYOURdS50lKLappnj6iiizzrGMMS9bQUl8Vm1BxvLtx7L6XlJB8yAh QRngyDHhNOCr/201d9d/jq760hwPryP6otFZ7625Qm63ejijCJVsVENvRFFzjY8LsHvCs0 E2kzUwShMb2rWyslieGzQ0Hxr+RkRdQ= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by dfw.source.kernel.org (Postfix) with ESMTP id CCEE7621F9; Wed, 31 Jul 2024 06:49:23 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 4A6C8C116B1; Wed, 31 Jul 2024 06:49:23 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1722408563; bh=dTXhdRo4kj8AjPmtS92IPfvC0Tt9HRSF2uv8IzXug84=; h=From:Date:Subject:References:In-Reply-To:To:Cc:From; b=gK8KjJ/ls7+wOOlC/d0FIoIXSj0R3NDeM9+iXvu+adQMveBC7Fw4bO4QtnnIORD7t HUaqkSKzJ3O4YQGooMEdusu0u8qfpoUpzlmdWhcCXnhafPdp39veiPnVP4pFVj6eaC Hypx12Tzh9rSc1aX45e+Kl/Ng6w495hCq3L9rZIDOoQso4g78OyJQbcYWeXWQ5GHH9 6VgkoFE2SXNqcZZiJV3hEhnax2ILh+N417/Re33nNrOxCA3Twr3F8+lVaqlhwkikqP bn6Giz7nv1xSti5xL2dvq+0TrKdFxCVvQi6LVviMLQVcNQkX1FXlpdPVc3yZ7ZCwc/ S9uV63wbVZ7iQ== From: chrisl@kernel.org Date: Tue, 30 Jul 2024 23:49:18 -0700 Subject: [PATCH v5 6/9] mm: swap: allow cache reclaim to skip slot cache MIME-Version: 1.0 Message-Id: <20240730-swap-allocator-v5-6-cb9c148b9297@kernel.org> References: <20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org> In-Reply-To: <20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org> To: Andrew Morton Cc: Kairui Song , Hugh Dickins , Ryan Roberts , "Huang, Ying" , Kalesh Singh , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Chris Li , Barry Song X-Mailer: b4 0.13.0 X-Rspam-User: X-Stat-Signature: bpsmm8byspbd41reytxjh1s4ekjadpwp X-Rspamd-Queue-Id: A1A7C40018 X-Rspamd-Server: rspam11 X-HE-Tag: 1722408564-818503 X-HE-Meta: U2FsdGVkX1/yaXitGCc2LmN5laadcNHwfhBCFX419o2Lckq+7+ZgTh2dVYf7SRGd0uC19ixCxGctIN29pv0wCe5gqan8xXscjcGIIhX+lvatg65y1t/2AadT5LKFXSCxz5e8aloc8uTzC33FMdhFZ9lZkA5u0YTCH/X5ESXOhTrDEabmYj1y53vQdGNQKNNxKL3EfoqEMgmVOMHA7CFJP+BDTcwi2AUKTZH6iwg7aCHLNQpKyadKlOqjB5Pqrsso79lOv+fwHthJ5aNdY+guegosMvd9YAgzTr54Ky59cn85k5O4lBpv7wW0CGVm387jARU4a6ORy7oGjqdbF/fMRM7TeKU4uJwbfM2qUA2d8b5XxEZWFccIdrJLDzzI9xqjis4L39IRWFrsTOfwhw8zWyey0rbgh9yRjTbScTOHJl+e+QaJ5Mu3VR2ElxMobClkzO/8eVQLrdM4gcGJJ9yLIL/TCzbtWTCWFjfqKVsYmAP2R2dNUjIVTWOeriZXYTZWfabTYcJ6vjNc7txaP4LYQ12ZdSNmcPjYAToCQrzSLP2k1Fbzzt8DPyQfZYbdXQPGSdatoE2xn4PgmI8jseHHMMHu6a4YmZlnkp4XiVd/XWxoMEFZxv6vsmWTOBw6xT4hiatbRwzJ26GDFYeYcnj8w1Jgllc33roIMg8K0R10FBbQlPqFzsiNtdGB9/9Zp8rXznVCa3121nLa5mfbSWrK/zuvbJD4/q3sJZzbaAljMfv9kC9K4/H4ggSD7lIs1NYQTv6jlbxxTMnagp6deQoAV8lF77q+/PH8DGhgYybBXftNuyv9TzeYMmNDDN6OokwA3CAfw+B9zBsCLpldP9Gh0DpHWWPGm9OmonqqTdEsUDJpbkYpiE5so8K3uURk+h88x3i6ysMKPELNVyCZ0KzczySWZJfVc0dnH2VLG9/Ue/bb5mJuMxyfmg4hUd8QFeQsFTD71Zmo/OomY4d3C5c h6Se2FHR buEna0L+P4xBHP3eL7iEuUPWkNzLjdK/XS77LlwmNp1DBIQtc5BLK90+lrPQeZ7d5q7wJs7mKkLV4d7ysBil7GjFL55ZQMk9tlfi7aovlS9kVCIY9SLqD8EJptBIOsTO9Gg0is8lTEsEHrgwkGYzRgwCZ38iDCWSP4neoSGPazl0puJrcu+ZeZFWdNZ+Bd7dmp6o/KrGl4nQECM1i61bSONveAe32oRzNNuZD/GrrUgipdIFYE7/vl9yanA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: From: Kairui Song Currently we free the reclaimed slots through slot cache even if the slot is required to be empty immediately. As a result the reclaim caller will see the slot still occupied even after a successful reclaim, and need to keep reclaiming until slot cache get flushed. This caused ineffective or over reclaim when SWAP is under stress. So introduce a new flag allowing the slot to be emptied bypassing the slot cache. Signed-off-by: Kairui Song --- mm/swapfile.c | 152 +++++++++++++++++++++++++++++++++++++++++----------------- 1 file changed, 109 insertions(+), 43 deletions(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index 9b63b2262cc2..4c0fc0409d3c 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -53,8 +53,15 @@ static bool swap_count_continued(struct swap_info_struct *, pgoff_t, unsigned char); static void free_swap_count_continuations(struct swap_info_struct *); +static void swap_entry_range_free(struct swap_info_struct *si, swp_entry_t entry, + unsigned int nr_pages); static void swap_range_alloc(struct swap_info_struct *si, unsigned long offset, unsigned int nr_entries); +static bool folio_swapcache_freeable(struct folio *folio); +static struct swap_cluster_info *lock_cluster_or_swap_info( + struct swap_info_struct *si, unsigned long offset); +static void unlock_cluster_or_swap_info(struct swap_info_struct *si, + struct swap_cluster_info *ci); static DEFINE_SPINLOCK(swap_lock); static unsigned int nr_swapfiles; @@ -129,8 +136,25 @@ static inline unsigned char swap_count(unsigned char ent) * corresponding page */ #define TTRS_UNMAPPED 0x2 -/* Reclaim the swap entry if swap is getting full*/ +/* Reclaim the swap entry if swap is getting full */ #define TTRS_FULL 0x4 +/* Reclaim directly, bypass the slot cache and don't touch device lock */ +#define TTRS_DIRECT 0x8 + +static bool swap_is_has_cache(struct swap_info_struct *si, + unsigned long offset, int nr_pages) +{ + unsigned char *map = si->swap_map + offset; + unsigned char *map_end = map + nr_pages; + + do { + VM_BUG_ON(!(*map & SWAP_HAS_CACHE)); + if (*map != SWAP_HAS_CACHE) + return false; + } while (++map < map_end); + + return true; +} /* * returns number of pages in the folio that backs the swap entry. If positive, @@ -141,12 +165,22 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si, unsigned long offset, unsigned long flags) { swp_entry_t entry = swp_entry(si->type, offset); + struct address_space *address_space = swap_address_space(entry); + struct swap_cluster_info *ci; struct folio *folio; - int ret = 0; + int ret, nr_pages; + bool need_reclaim; - folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry)); + folio = filemap_get_folio(address_space, swap_cache_index(entry)); if (IS_ERR(folio)) return 0; + + /* offset could point to the middle of a large folio */ + entry = folio->swap; + offset = swp_offset(entry); + nr_pages = folio_nr_pages(folio); + ret = -nr_pages; + /* * When this function is called from scan_swap_map_slots() and it's * called by vmscan.c at reclaiming folios. So we hold a folio lock @@ -154,14 +188,50 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si, * case and you should use folio_free_swap() with explicit folio_lock() * in usual operations. */ - if (folio_trylock(folio)) { - if ((flags & TTRS_ANYWAY) || - ((flags & TTRS_UNMAPPED) && !folio_mapped(folio)) || - ((flags & TTRS_FULL) && mem_cgroup_swap_full(folio))) - ret = folio_free_swap(folio); - folio_unlock(folio); + if (!folio_trylock(folio)) + goto out; + + need_reclaim = ((flags & TTRS_ANYWAY) || + ((flags & TTRS_UNMAPPED) && !folio_mapped(folio)) || + ((flags & TTRS_FULL) && mem_cgroup_swap_full(folio))); + if (!need_reclaim || !folio_swapcache_freeable(folio)) + goto out_unlock; + + /* + * It's safe to delete the folio from swap cache only if the folio's + * swap_map is HAS_CACHE only, which means the slots have no page table + * reference or pending writeback, and can't be allocated to others. + */ + ci = lock_cluster_or_swap_info(si, offset); + need_reclaim = swap_is_has_cache(si, offset, nr_pages); + unlock_cluster_or_swap_info(si, ci); + if (!need_reclaim) + goto out_unlock; + + if (!(flags & TTRS_DIRECT)) { + /* Free through slot cache */ + delete_from_swap_cache(folio); + folio_set_dirty(folio); + ret = nr_pages; + goto out_unlock; } - ret = ret ? folio_nr_pages(folio) : -folio_nr_pages(folio); + + xa_lock_irq(&address_space->i_pages); + __delete_from_swap_cache(folio, entry, NULL); + xa_unlock_irq(&address_space->i_pages); + folio_ref_sub(folio, nr_pages); + folio_set_dirty(folio); + + spin_lock(&si->lock); + /* Only sinple page folio can be backed by zswap */ + if (!nr_pages) + zswap_invalidate(entry); + swap_entry_range_free(si, entry, nr_pages); + spin_unlock(&si->lock); + ret = nr_pages; +out_unlock: + folio_unlock(folio); +out: folio_put(folio); return ret; } @@ -903,7 +973,7 @@ static int scan_swap_map_slots(struct swap_info_struct *si, if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) { int swap_was_freed; spin_unlock(&si->lock); - swap_was_freed = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY); + swap_was_freed = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY | TTRS_DIRECT); spin_lock(&si->lock); /* entry was freed successfully, try to use this again */ if (swap_was_freed > 0) @@ -1340,9 +1410,6 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry) unsigned long offset = swp_offset(entry); struct swap_cluster_info *ci; struct swap_info_struct *si; - unsigned char *map; - unsigned int i, free_entries = 0; - unsigned char val; int size = 1 << swap_entry_order(folio_order(folio)); si = _swap_info_get(entry); @@ -1350,23 +1417,14 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry) return; ci = lock_cluster_or_swap_info(si, offset); - if (size > 1) { - map = si->swap_map + offset; - for (i = 0; i < size; i++) { - val = map[i]; - VM_BUG_ON(!(val & SWAP_HAS_CACHE)); - if (val == SWAP_HAS_CACHE) - free_entries++; - } - if (free_entries == size) { - unlock_cluster_or_swap_info(si, ci); - spin_lock(&si->lock); - swap_entry_range_free(si, entry, size); - spin_unlock(&si->lock); - return; - } + if (size > 1 && swap_is_has_cache(si, offset, size)) { + unlock_cluster_or_swap_info(si, ci); + spin_lock(&si->lock); + swap_entry_range_free(si, entry, size); + spin_unlock(&si->lock); + return; } - for (i = 0; i < size; i++, entry.val++) { + for (int i = 0; i < size; i++, entry.val++) { if (!__swap_entry_free_locked(si, offset + i, SWAP_HAS_CACHE)) { unlock_cluster_or_swap_info(si, ci); free_swap_slot(entry); @@ -1526,16 +1584,7 @@ static bool folio_swapped(struct folio *folio) return swap_page_trans_huge_swapped(si, entry, folio_order(folio)); } -/** - * folio_free_swap() - Free the swap space used for this folio. - * @folio: The folio to remove. - * - * If swap is getting full, or if there are no more mappings of this folio, - * then call folio_free_swap to free its swap space. - * - * Return: true if we were able to release the swap space. - */ -bool folio_free_swap(struct folio *folio) +static bool folio_swapcache_freeable(struct folio *folio) { VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); @@ -1543,8 +1592,6 @@ bool folio_free_swap(struct folio *folio) return false; if (folio_test_writeback(folio)) return false; - if (folio_swapped(folio)) - return false; /* * Once hibernation has begun to create its image of memory, @@ -1564,6 +1611,25 @@ bool folio_free_swap(struct folio *folio) if (pm_suspended_storage()) return false; + return true; +} + +/** + * folio_free_swap() - Free the swap space used for this folio. + * @folio: The folio to remove. + * + * If swap is getting full, or if there are no more mappings of this folio, + * then call folio_free_swap to free its swap space. + * + * Return: true if we were able to release the swap space. + */ +bool folio_free_swap(struct folio *folio) +{ + if (!folio_swapcache_freeable(folio)) + return false; + if (folio_swapped(folio)) + return false; + delete_from_swap_cache(folio); folio_set_dirty(folio); return true; @@ -1640,7 +1706,7 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr) * to the next boundary. */ nr = __try_to_reclaim_swap(si, offset, - TTRS_UNMAPPED | TTRS_FULL); + TTRS_UNMAPPED | TTRS_FULL); if (nr == 0) nr = 1; else if (nr < 0) From patchwork Wed Jul 31 06:49:19 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chris Li X-Patchwork-Id: 13748192 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 13770C52D1D for ; Wed, 31 Jul 2024 06:49:43 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 17D636B0089; Wed, 31 Jul 2024 02:49:28 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 0DFA56B0093; Wed, 31 Jul 2024 02:49:28 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D86F06B0098; Wed, 31 Jul 2024 02:49:27 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 96C316B0089 for ; Wed, 31 Jul 2024 02:49:27 -0400 (EDT) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 1CA1C8015F for ; Wed, 31 Jul 2024 06:49:27 +0000 (UTC) X-FDA: 82399121574.27.022A852 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf26.hostedemail.com (Postfix) with ESMTP id 5806E14000D for ; Wed, 31 Jul 2024 06:49:25 +0000 (UTC) Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=itCPSWVG; spf=pass (imf26.hostedemail.com: domain of chrisl@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=chrisl@kernel.org; dmarc=pass (policy=none) header.from=kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1722408537; a=rsa-sha256; cv=none; b=8n46e2IPpjjyhQPhdrUp1UI/b9YxH5xWaJJsa3f+DDvdIZN9MiR9H2cKUu88e6MUHAflbG rmA5Y5oqUacVFtQdZ0ngobQ+XUXxBBZd+uDiEFE0R5EIedbt/Hn6GyI+Ygp0LxWYw1b0Cy nPbVaLxjK+MGwIJf1wqKZqbiGkNcw/E= ARC-Authentication-Results: i=1; imf26.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=itCPSWVG; spf=pass (imf26.hostedemail.com: domain of chrisl@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=chrisl@kernel.org; dmarc=pass (policy=none) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1722408537; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=/UxfJ0LXr9hn9n0h8ZVipttrVk9Nbw8TQC+goHuxA3g=; b=gPl+jv9OUH2CafHjiNiVDeZCZwCidqzyIfLULUcKSuOEdzooTPVdflYqAY16HdmPYzVaje oolpSMMtxieHNPH/XGOjTHwruxDfmybuMNkpRemZ/jh0F8toe8ek3Hd6JZ167zSAHZtItO 0L4lEvgtiknaY/PFB1WUaBnR1ercI78= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by dfw.source.kernel.org (Postfix) with ESMTP id 56A51621FB; Wed, 31 Jul 2024 06:49:24 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id C4391C4AF14; Wed, 31 Jul 2024 06:49:23 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1722408564; bh=LMjsqNTIwXS2shaii6dMa6mkdJXeL7A42/t9cu9stO0=; h=From:Date:Subject:References:In-Reply-To:To:Cc:From; b=itCPSWVG1mWJFeRuT4Y6xvyqL/gB+zuPWdzR33gpL1Asheo80VUsMCQHBVfoZArOk A61aAkf+gAfsUAp6AjgSvGn71FnPDV8s4sgoFmId49e2kUoTzC3ziLTPzqiUhM4Sn9 i0vulb+sGgvlwyMr9an1j5zyq4a5NN5E37YRkg1FJTzPgTo6FZNIFwMUgdAP1KcRMy G+kMbs3mIKIMdqYCzCbyamHCcNTZNoDebnkDkKEnhPjPVFWespNM55cOxnPgHrG09Y 4ZRt5i4lV1uxB+J/7BDfyDBItiKzUJEv6qUOdjqdxgIy1zIJjYIG/LhUmvdNj0eVgN dHPh1Bf9yuX1g== From: chrisl@kernel.org Date: Tue, 30 Jul 2024 23:49:19 -0700 Subject: [PATCH v5 7/9] mm: swap: add a fragment cluster list MIME-Version: 1.0 Message-Id: <20240730-swap-allocator-v5-7-cb9c148b9297@kernel.org> References: <20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org> In-Reply-To: <20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org> To: Andrew Morton Cc: Kairui Song , Hugh Dickins , Ryan Roberts , "Huang, Ying" , Kalesh Singh , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Chris Li , Barry Song X-Mailer: b4 0.13.0 X-Stat-Signature: ujjkppg5cqty9dtdsdh3qsdurw7zmpry X-Rspamd-Queue-Id: 5806E14000D X-Rspam-User: X-Rspamd-Server: rspam10 X-HE-Tag: 1722408565-902456 X-HE-Meta: U2FsdGVkX19qEC48QNt02bYUGuJzez5d/2iGotYlyuYTKYj2D79a1EOkKDwzicY30hl6zceVwg2azZdUW8zZOiHTSjKbbCbVFk3VqRk2UIfLQmTQIxW5m6z8egnHiPsfsrgMTpcPjEpHx0jMqE8IwOO+LAmtefa8Uqas9tfpZcYBNLRxeuNSbf8AwPbyc3qWC/AjtlWIK9cpmKIk9iD0Z5mDtwaA3eR9Lp/hn0V4WJZagxEQeQiLzkoeRnluZ5S0V3YWFI9hD/x9Nt/2G7gqB7eL8kXEbC29cP1vYEYvc0qgAHlr0spWWCBy/FPaOkMbovbmQ6U3F7cisSJ3VtuBlTNgmq7H7jmJ/aYkXBtLt9Bk1Wbdn+RcDnY+IAgw699sKj5rhczYMP0f4K/Z3GHzSpQvlW13nvvgt18Rii1eAtuVbic/ual5smIZ/mkPAMzYiBRdPWvjyPjyJnS3ZxNYZk5JvbNLiN6jYxmykMAzA+Q+tgsngGDT8qsgNSfTE9NoIqF24nAg8aWPydhguVL06vir9lhErCTmn3rSonGPV63doM3L7C10VrBKKtrKQIS7tDeDchnB6mgRhwcFZ8yk+zXM1l0rbzSCTxBgzek0/B5s6Y8y2+WpzOQswuTdhSnUL2Omi7ZBaDqk8Skv/E38ScBEANkzosvl4AQfUCmqPeYut+2islGmonepx3GIJ5ZFyObdYpUtLjx5adGzIRfKORcBFVDLGR1aZ7OBFzxqlQ2SPpDR6b5rfsqUO8xpYFIieTsOMCxI51nlYVfMc+z3t8mfay3QB4dgcJcCUWCzf1fKoblEx2tkkVwbk6DgAfh2Uw8Kk20VcMrzrzulskpx5zpAbkOzrkPnoSsJiPD4g2QF15JSWDnVg7OmjpJnalSOl9N9Dx2f1G2WKAFocSmmzajExw30o2kVnpWIskrm1cnicY0OKKtX3WWFS96HtEzT41/hv3bDJUgHC+f7ewd q3JJJCX5 k4d29VoWqU981xIEVsF8oFt+Wx9l08O5ZJ5Ust/eXIRcSu1gr4TVYTxCm0oRrf8jwZgOQxryCjCFHlDe8vrOv0fOiBuUL0iFF6re1QLXOa/3yckzZShT5Kgm0pCglm3FrybNPl2TCre1F48+NJSSHwVOXU48kTfnpYBGms/7+X0BH1aUCH3Q1ejIgOKaDTEUirPAzL9OJmzYwFjREu1m9IUaBmN7aeEYDa0scEvlNC175D16UFtF/kDtOdQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: From: Kairui Song Now swap cluster allocator arranges the clusters in LRU style, so the "cold" cluster stay at the head of nonfull lists are the ones that were used for allocation long time ago and still partially occupied. So if allocator can't find enough contiguous slots to satisfy an high order allocation, it's unlikely there will be slot being free on them to satisfy the allocation, at least in a short period. As a result, nonfull cluster scanning will waste time repeatly scanning the unusable head of the list. Also, multiple CPUs could content on the same head cluster of nonfull list. Unlike free clusters which are removed from the list when any CPU starts using it, nonfull cluster stays on the head. So introduce a new list frag list, all scanned nonfull clusters will be moved to this list. Both for avoiding repeated scanning and contention. Frag list is still used as fallback for allocations, so if one CPU failed to allocate one order of slots, it can still steal other CPU's clusters. And order 0 will favor the fragmented clusters to better protect nonfull clusters If any slots on a fragment list are being freed, move the fragment list back to nonfull list indicating it worth another scan on the cluster. Compared to scan upon freeing a slot, this keep the scanning lazy and save some CPU if there are still other clusters to use. It may seems unneccessay to keep the fragmented cluster on list at all if they can't be used for specific order allocation. But this will start to make sense once reclaim dring scanning is ready. Signed-off-by: Kairui Song --- include/linux/swap.h | 3 +++ mm/swapfile.c | 41 +++++++++++++++++++++++++++++++++++++---- 2 files changed, 40 insertions(+), 4 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 6716ef236766..5a14b6c65949 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -259,6 +259,7 @@ struct swap_cluster_info { }; #define CLUSTER_FLAG_FREE 1 /* This cluster is free */ #define CLUSTER_FLAG_NONFULL 2 /* This cluster is on nonfull list */ +#define CLUSTER_FLAG_FRAG 4 /* This cluster is on nonfull list */ /* * The first page in the swap file is the swap header, which is always marked @@ -299,6 +300,8 @@ struct swap_info_struct { struct list_head free_clusters; /* free clusters list */ struct list_head nonfull_clusters[SWAP_NR_ORDERS]; /* list of cluster that contains at least one free slot */ + struct list_head frag_clusters[SWAP_NR_ORDERS]; + /* list of cluster that are fragmented or contented */ unsigned int lowest_bit; /* index of first free in swap_map */ unsigned int highest_bit; /* index of last free in swap_map */ unsigned int pages; /* total of usable pages of swap */ diff --git a/mm/swapfile.c b/mm/swapfile.c index 4c0fc0409d3c..eb3e387e86b2 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -572,7 +572,10 @@ static void dec_cluster_info_page(struct swap_info_struct *p, if (!(ci->flags & CLUSTER_FLAG_NONFULL)) { VM_BUG_ON(ci->flags & CLUSTER_FLAG_FREE); - list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]); + if (ci->flags & CLUSTER_FLAG_FRAG) + list_move_tail(&ci->list, &p->nonfull_clusters[ci->order]); + else + list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]); ci->flags = CLUSTER_FLAG_NONFULL; } } @@ -610,7 +613,8 @@ static inline void cluster_alloc_range(struct swap_info_struct *si, struct swap_ ci->count += nr_pages; if (ci->count == SWAPFILE_CLUSTER) { - VM_BUG_ON(!(ci->flags & (CLUSTER_FLAG_FREE | CLUSTER_FLAG_NONFULL))); + VM_BUG_ON(!(ci->flags & + (CLUSTER_FLAG_FREE | CLUSTER_FLAG_NONFULL | CLUSTER_FLAG_FRAG))); list_del(&ci->list); ci->flags = 0; } @@ -666,6 +670,7 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o struct percpu_cluster *cluster; struct swap_cluster_info *ci, *n; unsigned int offset, found = 0; + LIST_HEAD(fraged); new_cluster: lockdep_assert_held(&si->lock); @@ -686,13 +691,29 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o if (order < PMD_ORDER) { list_for_each_entry_safe(ci, n, &si->nonfull_clusters[order], list) { + list_move_tail(&ci->list, &fraged); + ci->flags = CLUSTER_FLAG_FRAG; offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci), &found, order, usage); if (found) - goto done; + break; } + + if (!found) { + list_for_each_entry_safe(ci, n, &si->frag_clusters[order], list) { + offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci), + &found, order, usage); + if (found) + break; + } + } + + list_splice_tail(&fraged, &si->frag_clusters[order]); } + if (found) + goto done; + if (!list_empty(&si->discard_clusters)) { /* * we don't have free cluster but have some clusters in @@ -706,7 +727,17 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o if (order) goto done; + /* Order 0 stealing from higher order */ for (int o = 1; o < PMD_ORDER; o++) { + if (!list_empty(&si->frag_clusters[o])) { + ci = list_first_entry(&si->frag_clusters[o], + struct swap_cluster_info, list); + offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci), &found, + 0, usage); + VM_BUG_ON(!found); + goto done; + } + if (!list_empty(&si->nonfull_clusters[o])) { ci = list_first_entry(&si->nonfull_clusters[o], struct swap_cluster_info, list); @@ -3019,8 +3050,10 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p, INIT_LIST_HEAD(&p->free_clusters); INIT_LIST_HEAD(&p->discard_clusters); - for (i = 0; i < SWAP_NR_ORDERS; i++) + for (i = 0; i < SWAP_NR_ORDERS; i++) { INIT_LIST_HEAD(&p->nonfull_clusters[i]); + INIT_LIST_HEAD(&p->frag_clusters[i]); + } for (i = 0; i < swap_header->info.nr_badpages; i++) { unsigned int page_nr = swap_header->info.badpages[i]; From patchwork Wed Jul 31 06:49:20 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chris Li X-Patchwork-Id: 13748193 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id BA331C3DA7F for ; Wed, 31 Jul 2024 06:49:45 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4D9F36B008C; Wed, 31 Jul 2024 02:49:28 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3C6C16B0099; Wed, 31 Jul 2024 02:49:28 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EEA566B008C; Wed, 31 Jul 2024 02:49:27 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 995956B0093 for ; Wed, 31 Jul 2024 02:49:27 -0400 (EDT) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 62D7AA0140 for ; Wed, 31 Jul 2024 06:49:27 +0000 (UTC) X-FDA: 82399121574.17.1DF5EDC Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf17.hostedemail.com (Postfix) with ESMTP id A32A540010 for ; Wed, 31 Jul 2024 06:49:25 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=qf8E8DJt; spf=pass (imf17.hostedemail.com: domain of chrisl@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=chrisl@kernel.org; dmarc=pass (policy=none) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1722408492; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ngsb7fBYnvgzqEr/8vE4gN8jmVNnFCMfQ3J5p4HCvS0=; b=8bJtbPpp9JNowdTNQJNRasoejg1csIpXPLsGnL+KyxK/9Jgzg4IVt4SXyoISy0sLnVBDf5 zz6/96+DHvVftLch1GZ2bVXyZIBc+bX8cpdNjtgYJaIPP890Ifba9ODZv/x/k0+qjz481C KJvwB96aNE/rYPtAq/rAjm3i0pV/c68= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=qf8E8DJt; spf=pass (imf17.hostedemail.com: domain of chrisl@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=chrisl@kernel.org; dmarc=pass (policy=none) header.from=kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1722408492; a=rsa-sha256; cv=none; b=8o0a/8sXDQNt8TuzlCGvBQG0tVdBtbIz79fyblC7kh7IL3dFfeKNTYYc71LJJGkdJeSR// uhtNZMz3jA2hgHfDmFrUTOlDnOvbOBuwFnFQle9EtQyDOyMOm3Mw442aVU3SrVr7jc7bST SHhlMhhgM08Lnzr8iTqMt9LFo2tRxb0= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by dfw.source.kernel.org (Postfix) with ESMTP id D1D9E6220D; Wed, 31 Jul 2024 06:49:24 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 4E058C4AF16; Wed, 31 Jul 2024 06:49:24 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1722408564; bh=bGhHEHa6OStA/L6zX9n/5nX1MnDqB9/9wXUAWu7kmvs=; h=From:Date:Subject:References:In-Reply-To:To:Cc:From; b=qf8E8DJt+TyJxgGeW+MoN9f10/VCDxTPe/27P7EIijJOawJ5WT4ZoQXslSCKgdZAX r+sq7N/EUp+65m4HpvtbITV1ND1wvD0bEuMyqpU80Tt9aQwQYCOFTy2PGAjkbwUCE0 EsUgx5Ew92LWr6GRpWG1yzi029EzdChEOYnb7nM/AFTuBdWcj3CE2s0QCCSNzTtMSG 2EoQpvjSf7fGZdI0OTDXzIyWfq4f8qEVZi0tLy2ny/XWtm/9OsxIwe+sR8BtECOuKx aFizmjxAKCnvF0b2gmG50glRnB/Z77N9wV1vIaHsJK9H4l7ZuoVsKuZzlE5lO1AvrJ uKhKaj2Dz7Plw== From: chrisl@kernel.org Date: Tue, 30 Jul 2024 23:49:20 -0700 Subject: [PATCH v5 8/9] mm: swap: relaim the cached parts that got scanned MIME-Version: 1.0 Message-Id: <20240730-swap-allocator-v5-8-cb9c148b9297@kernel.org> References: <20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org> In-Reply-To: <20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org> To: Andrew Morton Cc: Kairui Song , Hugh Dickins , Ryan Roberts , "Huang, Ying" , Kalesh Singh , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Chris Li , Barry Song X-Mailer: b4 0.13.0 X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: A32A540010 X-Stat-Signature: nuzhqq9iruo4qurotaso7bgfe34o6xi5 X-Rspam-User: X-HE-Tag: 1722408565-272824 X-HE-Meta: U2FsdGVkX1+Gj6eZ1023zF1t4v/nigwInI3P/LcevlUE+5722Kx4byoQLKYSYRMTp6wt1U7CdobZ5hEVIAjbTgkdRRQxCWHfwahP3y9kujaUmGr3Iu/0SjC0sR5P9/Lk2c48mnfG6WgsZpV91xHyTEwqsU50FGaqYYB5VzsVaSFojbPj2kuvntV5QHlthOC0kBfe1cwNBAPAstFmqeJp64iLbo6umfNj3vcGMCs5abS3fxLDQZtkgNeYgo76DewITV1O6tOs2cJ8jJ8C0YW3Am5X3LqizH9H9LkiCDGvn3nHF7JHiiOwPU1s2Y8scYDKFtykyMwcf39RvaM3I2kBkKxu69jKOfQVVC+enDL1DklCtQ1ZH2V1ft4Pu4/BUyGWxEGgWSxoB4ybJvR7oofLiQkJ/CKjQgbhmLAOM2zlGznArTd3vlFSSOEITq5oqP5eUsEMyhGfenytTLWNNArx3zmcN9Nt/3K8HpJyqEsiDgjd7hINEphtAIHK27BOrLC70RTIlk0g1Y/3DeYlc3KBJY7Jjf+oxKvVN0o3saFjRkFseS/+bW9Wqp6HFYebt5W4is7Rq1c/KAU6fzFtwjNXdbyUI86fisnqrTvrocp3NmT19cx5QdCPiVzyNAmgknYNGxxz0UgVAaQBPyQzlH9lN5wEVxJhJG4KU9xjb9KF+yJyjgOteRlQV5fbzgyxohsUiYbiqY6K52wv/CysdaYHfIykwocXhtGAfFDarii/MlgshqP78HJKJoyQz3aNBWfoB4R7ItjYUkzI3R8xE8tAltbYQfEkzmqViF0xZL48EpsL+c5fATj6DPKOTJzC82RClcMr8Us512CSAKx7XYyrYhU1dqEvJ99KPr+7Z/XKG0aWo3YcwbWGVZ5hs7wlsoIO+TkQ62P9uk/AKbfaimBH1+GvZViFNtyclpwjDJ+dG6s3YDdTpVQxzO2mb39uC8E2QsOK0kFOThEqvWCf0I+ ZsVUH3hu k5nrlirTvOGj/UOVPH1B3pnHF0sFW+tDkJELrxsHwp44Y7+X37PwZV8h/t6kpxT9yB/mP5QiRKwE6xZQlEej93f5DVycPs3o0xht7plztG5mR7dtxLtaWiR28pkMrPl6D8IX+1jBcTxXOuQT0miyR0FN3MSE78cGqxrkuhSioV5dL/Uaw36mnVboTJybKCwESOed/SApZAYQV+Lmbp6sz3bQ76SCm7uKPDP6hVqyqR7JK3elNoy3Aa7cKlA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: From: Kairui Song This commit implements reclaim during scan for cluster allocator. Cluster scanning were unable to reuse SWAP_HAS_CACHE slots, which could result in low allocation success rate or early OOM. So to ensure maximum allocation success rate, integrate reclaiming with scanning. If found a range of suitable swap slots but fragmented due to HAS_CACHE, just try to reclaim the slots. Signed-off-by: Kairui Song --- include/linux/swap.h | 1 + mm/swapfile.c | 140 +++++++++++++++++++++++++++++++++++++++------------ 2 files changed, 110 insertions(+), 31 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 5a14b6c65949..9eb740563d63 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -302,6 +302,7 @@ struct swap_info_struct { /* list of cluster that contains at least one free slot */ struct list_head frag_clusters[SWAP_NR_ORDERS]; /* list of cluster that are fragmented or contented */ + unsigned int frag_cluster_nr[SWAP_NR_ORDERS]; unsigned int lowest_bit; /* index of first free in swap_map */ unsigned int highest_bit; /* index of last free in swap_map */ unsigned int pages; /* total of usable pages of swap */ diff --git a/mm/swapfile.c b/mm/swapfile.c index eb3e387e86b2..50e7f600a9a1 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -513,6 +513,10 @@ static void free_cluster(struct swap_info_struct *si, struct swap_cluster_info * VM_BUG_ON(ci->count != 0); lockdep_assert_held(&si->lock); lockdep_assert_held(&ci->lock); + + if (ci->flags & CLUSTER_FLAG_FRAG) + si->frag_cluster_nr[ci->order]--; + /* * If the swap is discardable, prepare discard the cluster * instead of free it immediately. The cluster will be freed @@ -572,31 +576,84 @@ static void dec_cluster_info_page(struct swap_info_struct *p, if (!(ci->flags & CLUSTER_FLAG_NONFULL)) { VM_BUG_ON(ci->flags & CLUSTER_FLAG_FREE); - if (ci->flags & CLUSTER_FLAG_FRAG) + if (ci->flags & CLUSTER_FLAG_FRAG) { + p->frag_cluster_nr[ci->order]--; list_move_tail(&ci->list, &p->nonfull_clusters[ci->order]); - else + } else { list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]); + } ci->flags = CLUSTER_FLAG_NONFULL; } } -static inline bool cluster_scan_range(struct swap_info_struct *si, unsigned int start, - unsigned int nr_pages) +static bool cluster_reclaim_range(struct swap_info_struct *si, + struct swap_cluster_info *ci, + unsigned long start, unsigned long end) { - unsigned char *p = si->swap_map + start; - unsigned char *end = p + nr_pages; + unsigned char *map = si->swap_map; + unsigned long offset; + + spin_unlock(&ci->lock); + spin_unlock(&si->lock); + + for (offset = start; offset < end; offset++) { + switch (READ_ONCE(map[offset])) { + case 0: + continue; + case SWAP_HAS_CACHE: + if (__try_to_reclaim_swap(si, offset, TTRS_ANYWAY | TTRS_DIRECT) > 0) + continue; + goto out; + default: + goto out; + } + } +out: + spin_lock(&si->lock); + spin_lock(&ci->lock); - while (p < end) - if (*p++) + /* + * Recheck the range no matter reclaim succeeded or not, the slot + * could have been be freed while we are not holding the lock. + */ + for (offset = start; offset < end; offset++) + if (READ_ONCE(map[offset])) return false; return true; } +static bool cluster_scan_range(struct swap_info_struct *si, + struct swap_cluster_info *ci, + unsigned long start, unsigned int nr_pages) +{ + unsigned long offset, end = start + nr_pages; + unsigned char *map = si->swap_map; + bool need_reclaim = false; -static inline void cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster_info *ci, - unsigned int start, unsigned char usage, - unsigned int order) + for (offset = start; offset < end; offset++) { + switch (READ_ONCE(map[offset])) { + case 0: + continue; + case SWAP_HAS_CACHE: + if (!vm_swap_full()) + return false; + need_reclaim = true; + continue; + default: + return false; + } + } + + if (need_reclaim) + return cluster_reclaim_range(si, ci, start, end); + + return true; +} + +static void cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster_info *ci, + unsigned int start, unsigned char usage, + unsigned int order) { unsigned int nr_pages = 1 << order; @@ -615,6 +672,8 @@ static inline void cluster_alloc_range(struct swap_info_struct *si, struct swap_ if (ci->count == SWAPFILE_CLUSTER) { VM_BUG_ON(!(ci->flags & (CLUSTER_FLAG_FREE | CLUSTER_FLAG_NONFULL | CLUSTER_FLAG_FRAG))); + if (ci->flags & CLUSTER_FLAG_FRAG) + si->frag_cluster_nr[ci->order]--; list_del(&ci->list); ci->flags = 0; } @@ -640,7 +699,7 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si, unsigne } while (offset <= end) { - if (cluster_scan_range(si, offset, nr_pages)) { + if (cluster_scan_range(si, ci, offset, nr_pages)) { cluster_alloc_range(si, ci, offset, usage, order); *foundp = offset; if (ci->count == SWAPFILE_CLUSTER) { @@ -668,9 +727,8 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o unsigned char usage) { struct percpu_cluster *cluster; - struct swap_cluster_info *ci, *n; + struct swap_cluster_info *ci; unsigned int offset, found = 0; - LIST_HEAD(fraged); new_cluster: lockdep_assert_held(&si->lock); @@ -690,25 +748,42 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o } if (order < PMD_ORDER) { - list_for_each_entry_safe(ci, n, &si->nonfull_clusters[order], list) { - list_move_tail(&ci->list, &fraged); + unsigned int frags = 0; + + while (!list_empty(&si->nonfull_clusters[order])) { + ci = list_first_entry(&si->nonfull_clusters[order], + struct swap_cluster_info, list); + list_move_tail(&ci->list, &si->frag_clusters[order]); ci->flags = CLUSTER_FLAG_FRAG; + si->frag_cluster_nr[order]++; offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci), &found, order, usage); + frags++; if (found) break; } if (!found) { - list_for_each_entry_safe(ci, n, &si->frag_clusters[order], list) { + /* + * Nonfull clusters are moved to frag tail if we reached + * here, count them too, don't over scan the frag list. + */ + while (frags < si->frag_cluster_nr[order]) { + ci = list_first_entry(&si->frag_clusters[order], + struct swap_cluster_info, list); + /* + * Rotate the frag list to iterate, they were all failing + * high order allocation or moved here due to per-CPU usage, + * this help keeping usable cluster ahead. + */ + list_move_tail(&ci->list, &si->frag_clusters[order]); offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci), &found, order, usage); + frags++; if (found) break; } } - - list_splice_tail(&fraged, &si->frag_clusters[order]); } if (found) @@ -729,25 +804,28 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o /* Order 0 stealing from higher order */ for (int o = 1; o < PMD_ORDER; o++) { - if (!list_empty(&si->frag_clusters[o])) { + /* + * Clusters here have at least one usable slots and can't fail order 0 + * allocation, but reclaim may drop si->lock and race with another user. + */ + while (!list_empty(&si->frag_clusters[o])) { ci = list_first_entry(&si->frag_clusters[o], struct swap_cluster_info, list); - offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci), &found, - 0, usage); - VM_BUG_ON(!found); - goto done; + offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci), + &found, 0, usage); + if (found) + goto done; } - if (!list_empty(&si->nonfull_clusters[o])) { - ci = list_first_entry(&si->nonfull_clusters[o], struct swap_cluster_info, - list); + while (!list_empty(&si->nonfull_clusters[o])) { + ci = list_first_entry(&si->nonfull_clusters[o], + struct swap_cluster_info, list); offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci), &found, 0, usage); - VM_BUG_ON(!found); - goto done; + if (found) + goto done; } } - done: cluster->next[order] = offset; return found; @@ -3053,6 +3131,7 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p, for (i = 0; i < SWAP_NR_ORDERS; i++) { INIT_LIST_HEAD(&p->nonfull_clusters[i]); INIT_LIST_HEAD(&p->frag_clusters[i]); + p->frag_cluster_nr[i] = 0; } for (i = 0; i < swap_header->info.nr_badpages; i++) { @@ -3096,7 +3175,6 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p, if (!cluster_info) return nr_extents; - /* * Reduce false cache line sharing between cluster_info and * sharing same address space. From patchwork Wed Jul 31 06:49:21 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chris Li X-Patchwork-Id: 13748194 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0F8BAC3DA64 for ; Wed, 31 Jul 2024 06:49:49 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D640B6B0093; Wed, 31 Jul 2024 02:49:28 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id CF0AD6B0098; Wed, 31 Jul 2024 02:49:28 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B663D6B0099; Wed, 31 Jul 2024 02:49:28 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 8C8E06B0093 for ; Wed, 31 Jul 2024 02:49:28 -0400 (EDT) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 543571A0146 for ; Wed, 31 Jul 2024 06:49:28 +0000 (UTC) X-FDA: 82399121616.24.491F023 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf11.hostedemail.com (Postfix) with ESMTP id 7FBCA40011 for ; Wed, 31 Jul 2024 06:49:26 +0000 (UTC) Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=AMOUJDcg; dmarc=pass (policy=none) header.from=kernel.org; spf=pass (imf11.hostedemail.com: domain of chrisl@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=chrisl@kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1722408562; a=rsa-sha256; cv=none; b=ePaWZXrVOGvZpUNBNwWZdQ3+JJdxPTRV1tFzaTjPtCBh65Y/uTWyguYN6KGSefyOamwuBt 5sEfPhI/UOaqAt697fL6o+irPAci4EGceVFUJw0TX6iCwM3XC/q9mnDpmr4A5jg0OO8QF+ xVXBmokbTpuBkHSnpDvxuxgh/FFCFcc= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=AMOUJDcg; dmarc=pass (policy=none) header.from=kernel.org; spf=pass (imf11.hostedemail.com: domain of chrisl@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=chrisl@kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1722408562; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=GcQ2a9crYvlnMmd6IAefiaIG/sdiK0KUvnvic8cMqcg=; b=i51Fgt0s5ShHJfRBXtz7dgm3EnwhYy9GL05BETsrO4ATvBISl79+DA+K2IkmkcKAmRkFWH JqfYdpNuNRiRlqC7z6cVqvq1j6dTRk0JlPbAUVdewj/8gN8J4iDSblykouSNAHyjya1JnO RSFZurgu9f0D7/wE5iRRD21O6NrNYU0= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by dfw.source.kernel.org (Postfix) with ESMTP id 5EEA86221A; Wed, 31 Jul 2024 06:49:25 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id CB2BAC4AF10; Wed, 31 Jul 2024 06:49:24 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1722408565; bh=85G3706WeHs7PoCDO0Xz686T8+2PYBWAt+EG2DvBdnI=; h=From:Date:Subject:References:In-Reply-To:To:Cc:From; b=AMOUJDcgdP60QBUC88rmHTe5Zh5zwohd44U8WH5Mb1wQQYcEx7vIwtAClpIFy2Vep b5TCQohg1ZdP/NBc9R12JO5c+3POLfosUwJxSVVgL3YJPH+mJZcaDkxN2XRSdmQnlJ +M0v5rpBl/KCbWZeh+i9qT0Mmrq/9SBgo40ysEYmdscvJ8m7fqdAEOUUqhlFANjjY8 WXyVIFS6v+o5UptA+VlTlgf+jJj/8W8zZ8qG533WKAIP2A3hqoO7cFz6wr7jFbGY4i VgAaYleNOW8uvKIOePo1Jt+9uBrxyYtcVxMkdDpToS6WQBLjgXSwc2zlHLJqV+ppZq jbZi35hcEDjNw== From: chrisl@kernel.org Date: Tue, 30 Jul 2024 23:49:21 -0700 Subject: [PATCH v5 9/9] mm: swap: add a adaptive full cluster cache reclaim MIME-Version: 1.0 Message-Id: <20240730-swap-allocator-v5-9-cb9c148b9297@kernel.org> References: <20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org> In-Reply-To: <20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org> To: Andrew Morton Cc: Kairui Song , Hugh Dickins , Ryan Roberts , "Huang, Ying" , Kalesh Singh , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Chris Li , Barry Song X-Mailer: b4 0.13.0 X-Rspam-User: X-Rspamd-Queue-Id: 7FBCA40011 X-Rspamd-Server: rspam01 X-Stat-Signature: r63xxc9hnukiq5t7iryt6mk3r1xigxn7 X-HE-Tag: 1722408566-188277 X-HE-Meta: U2FsdGVkX18MWUWseEQhKS6Y0G/plk/kKkPXCtmhnVACB3YfpcM6sn8rUDB72ImIpIwFl+FFPN+YhEg44NFsfgl4L3aEBHJZO5y58r6Q3Xiu9Rugh61hZI22AQsVjmwwydYyeCohpJSKYp4i6hz1HFbZTPAhS75GJT/peRaTbRH6GRGHtroV5bu977e3IaVwpUD32xmsi3GkA1ygAKBHLlmpkdQmzJ6hB5Wy2+XSMXQXnH0Ug4bNqTu7UbFprKcyzbYt6XVBEcjg6TwZm4a/IgrEpVCIXpbEvfJC5rujCdCa6l6K737pE+4QSi5pm9ZF/J5XdyRwDogcJvlgcs9vvEe9+MJYTNC4RvUE3qluREfyFVHExGlUGhQsA+7Ha8UjHcHoUtNbF6KUByCSBvq4Cur23rOJFJFywD4nUk3s99uX60hYTnoKBAoUs5fBB/VtzP8Qwq0DKB67wxAVrg0uFo/1PsqGsqjAWjs2ZmNXXlQIAofmsUqI5+Vhh3xqssX3qEbZgThxohxGGBDqnU6Xm8vynKulUmCd36RL0EmDMsSnsH1AYEWzjJXHXjfMnKuk46U0i9JsJ2cl9Ob8+GL9vPFtgnz0l/zvqdTkr+McNkEST43PWpyY47GYUGbWttP6AjmOSW8Bqo7PL9Ean8KFVxZwjehnx6IgsLahsg3KpfoYWeh0za4Dg5ah8d+2cK8Pmek6CWClhBZGpLD4gYJry0jBFI4k6B9MOzouxQwa3g8aXM4pP6u0SBgNziZgZXxxWP05aygW7EtmK34Y4y3/fMP9mWLXkrqviuzDaS4tu/XQLsZh1mWC3I0v81aj6elKXtgc5DThl9xFIQ4MXaMcy6mcXML5i94PqTY/TjS4oeXEaJMAVl9FNeP7TrdF7GJyl3m8BD98ABhQlXa5vD2B1fitWBrS2aXAP1HM1If6v7o6ooquwq8nBiYp7JLPErAq6IyqpzqntRM2L+kGR2B 2olXc+GD z2lJzy6rrJlMqp18YKwPy302PRcOre58tdRNMqxnmfwq3Rz4oP9D6z/Ylzbx/Hxy+SSsB0VkuVuTNG6r9EaTmAp9dplDGJQjNsq4TQ3MBd5CvOVouGQvl5mYPXV2VZRpR7rEQUMNktP0tKMs2MRhxkxvcoCUA5/Ep58g66F7A0FVOpw0mIz0wzczKlRUUKuj7VSrBkTYy3DEGsSiSEPrlvV7jrgq2KYCtibK8ttbj5MJ7W7aOSIB3Faawbg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: From: Kairui Song Link all full cluster with one full list, and reclaim from it when the allocation have ran out of all usable clusters. There are many reason a folio can end up being in the swap cache while having no swap count reference. So the best way to search for such slots is still by iterating the swap clusters. With the list as an LRU, iterating from the oldest cluster and keep them rotating is a very doable and clean way to free up potentially not inuse clusters. When any allocation failure, try reclaim and rotate only one cluster. This is adaptive for high order allocations they can tolerate fallback. So this avoids latency, and give the full cluster list an fair chance to get reclaimed. It release the usage stress for the fallback order 0 allocation or following up high order allocation. If the swap device is getting very full, reclaim more aggresively to ensure no OOM will happen. This ensures order 0 heavy workload won't go OOM as order 0 won't fail if any cluster still have any space. Signed-off-by: Kairui Song --- include/linux/swap.h | 1 + mm/swapfile.c | 68 +++++++++++++++++++++++++++++++++++++++++----------- 2 files changed, 55 insertions(+), 14 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 9eb740563d63..145e796dab84 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -298,6 +298,7 @@ struct swap_info_struct { unsigned long *zeromap; /* vmalloc'ed bitmap to track zero pages */ struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */ struct list_head free_clusters; /* free clusters list */ + struct list_head full_clusters; /* full clusters list */ struct list_head nonfull_clusters[SWAP_NR_ORDERS]; /* list of cluster that contains at least one free slot */ struct list_head frag_clusters[SWAP_NR_ORDERS]; diff --git a/mm/swapfile.c b/mm/swapfile.c index 50e7f600a9a1..9872e0dbfc72 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -440,10 +440,7 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si, SWAP_MAP_BAD, SWAPFILE_CLUSTER); VM_BUG_ON(ci->flags & CLUSTER_FLAG_FREE); - if (ci->flags & CLUSTER_FLAG_NONFULL) - list_move_tail(&ci->list, &si->discard_clusters); - else - list_add_tail(&ci->list, &si->discard_clusters); + list_move_tail(&ci->list, &si->discard_clusters); ci->flags = 0; schedule_work(&si->discard_work); } @@ -453,10 +450,7 @@ static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info lockdep_assert_held(&si->lock); lockdep_assert_held(&ci->lock); - if (ci->flags & CLUSTER_FLAG_NONFULL) - list_move_tail(&ci->list, &si->free_clusters); - else - list_add_tail(&ci->list, &si->free_clusters); + list_move_tail(&ci->list, &si->free_clusters); ci->flags = CLUSTER_FLAG_FREE; ci->order = 0; } @@ -576,12 +570,9 @@ static void dec_cluster_info_page(struct swap_info_struct *p, if (!(ci->flags & CLUSTER_FLAG_NONFULL)) { VM_BUG_ON(ci->flags & CLUSTER_FLAG_FREE); - if (ci->flags & CLUSTER_FLAG_FRAG) { + if (ci->flags & CLUSTER_FLAG_FRAG) p->frag_cluster_nr[ci->order]--; - list_move_tail(&ci->list, &p->nonfull_clusters[ci->order]); - } else { - list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]); - } + list_move_tail(&ci->list, &p->nonfull_clusters[ci->order]); ci->flags = CLUSTER_FLAG_NONFULL; } } @@ -674,7 +665,7 @@ static void cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster (CLUSTER_FLAG_FREE | CLUSTER_FLAG_NONFULL | CLUSTER_FLAG_FRAG))); if (ci->flags & CLUSTER_FLAG_FRAG) si->frag_cluster_nr[ci->order]--; - list_del(&ci->list); + list_move_tail(&ci->list, &si->full_clusters); ci->flags = 0; } } @@ -718,6 +709,46 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si, unsigne return offset; } +static void swap_reclaim_full_clusters(struct swap_info_struct *si) +{ + long to_scan = 1; + unsigned long offset, end; + struct swap_cluster_info *ci; + unsigned char *map = si->swap_map; + int nr_reclaim, total_reclaimed = 0; + + if (atomic_long_read(&nr_swap_pages) <= SWAPFILE_CLUSTER) + to_scan = si->inuse_pages / SWAPFILE_CLUSTER; + + while (!list_empty(&si->full_clusters)) { + ci = list_first_entry(&si->full_clusters, struct swap_cluster_info, list); + list_move_tail(&ci->list, &si->full_clusters); + offset = cluster_offset(si, ci); + end = min(si->max, offset + SWAPFILE_CLUSTER); + to_scan--; + + while (offset < end) { + if (READ_ONCE(map[offset]) == SWAP_HAS_CACHE) { + spin_unlock(&si->lock); + nr_reclaim = __try_to_reclaim_swap(si, offset, + TTRS_ANYWAY | TTRS_DIRECT); + spin_lock(&si->lock); + if (nr_reclaim > 0) { + offset += nr_reclaim; + total_reclaimed += nr_reclaim; + continue; + } else if (nr_reclaim < 0) { + offset += -nr_reclaim; + continue; + } + } + offset++; + } + if (to_scan <= 0 || total_reclaimed) + break; + } +} + /* * Try to get swap entries with specified order from current cpu's swap entry * pool (a cluster). This might involve allocating a new cluster for current CPU @@ -826,7 +857,15 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o goto done; } } + done: + /* Try reclaim from full clusters if device is nearfull */ + if (vm_swap_full() && (!found || (si->pages - si->inuse_pages) < SWAPFILE_CLUSTER)) { + swap_reclaim_full_clusters(si); + if (!found && !order && si->pages != si->inuse_pages) + goto new_cluster; + } + cluster->next[order] = offset; return found; } @@ -3126,6 +3165,7 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p, nr_good_pages = maxpages - 1; /* omit header page */ INIT_LIST_HEAD(&p->free_clusters); + INIT_LIST_HEAD(&p->full_clusters); INIT_LIST_HEAD(&p->discard_clusters); for (i = 0; i < SWAP_NR_ORDERS; i++) {