From patchwork Wed Jun 19 09:20:29 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chris Li X-Patchwork-Id: 13703550 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A3F89C2BA1A for ; Wed, 19 Jun 2024 09:20:40 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 348D46B02BC; Wed, 19 Jun 2024 05:20:40 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 302C96B02BD; Wed, 19 Jun 2024 05:20:40 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 172A66B02BE; Wed, 19 Jun 2024 05:20:40 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id E3DCE6B02BC for ; Wed, 19 Jun 2024 05:20:39 -0400 (EDT) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 93C78405BB for ; Wed, 19 Jun 2024 09:20:39 +0000 (UTC) X-FDA: 82247092998.09.C472962 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf13.hostedemail.com (Postfix) with ESMTP id BDE3720020 for ; Wed, 19 Jun 2024 09:20:37 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=XfVQPYJI; spf=pass (imf13.hostedemail.com: domain of chrisl@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=chrisl@kernel.org; dmarc=pass (policy=none) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1718788834; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=p+HjzAMOucJJNsQnjehG0+xOV/7bAWOzcxkCU87JvN0=; b=Rks2eXYiN8q0kRw72Q7Ap3XDOT+wSx8WVHYYDRODnOF+OBm239aFAAqFlRS7SjX8pJEiPt cBGAKhkWpIgDQM/mCKcfNqS0+sg2yUQQtuVv3x7VHYjQARoNYbPdlrS2HD7vEplUFoF4lZ fGhMI+PWh6pDD/t3AbXWh4WhApoizdo= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=XfVQPYJI; spf=pass (imf13.hostedemail.com: domain of chrisl@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=chrisl@kernel.org; dmarc=pass (policy=none) header.from=kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1718788834; a=rsa-sha256; cv=none; b=qJyi1NBUERXdXiLm7tfqR1qKOafUU1sULmOyQaWEnblCqBtHL4erG4UQCvRDKkKkaVqXwD cEZVBOEpVGV/O49lzziYl7aDbmncLjq4rLwQrU0H9ThqIYioABOQ0RRRLgCOLD+RhZenpg Ec/GY/dt68/HfQtefXzjnSEXbbWCxnk= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by dfw.source.kernel.org (Postfix) with ESMTP id 90C67617E6; Wed, 19 Jun 2024 09:20:36 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id E9A13C32786; Wed, 19 Jun 2024 09:20:35 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1718788836; bh=UDR/8LbOWFoaxydA7Kb582OEuium9XIN8bDTJQgL94k=; h=From:Date:Subject:References:In-Reply-To:To:Cc:From; b=XfVQPYJIhyqN/0qfbZzb9nwnqZZftyNk6s1K8NkhgCIML4/JieFyBFyETgAIgqR/C cxfEqUsj6cMtPPNrSzgQD1rs+jUzhOqCQ5KU8AQ9z9Tnr3D4SPxJ9hnrA/hqeTVBLy 5ngcU7/cW54NDj+psfmi0QVpFRuOJL/CKgCAUctcdWYKZVHFFpUDOewdPKg6T9dgFP 47sUoJkjkyZO48ODgSWFNP3qx/GNPwRbw39jr2X5K9JN2RShQxnmUhR2LDEYG6sABk hIw9ku+zDFl/JDGX1LZQAe75q9ecL2Nflp5BR0dUEMTKp00FQKykVyta6gueUgYIEr CGKMWT1w2DXbg== From: Chris Li Date: Wed, 19 Jun 2024 02:20:29 -0700 Subject: [PATCH v3 1/2] mm: swap: swap cluster switch to double link list MIME-Version: 1.0 Message-Id: <20240619-swap-allocator-v3-1-e973a3102444@kernel.org> References: <20240619-swap-allocator-v3-0-e973a3102444@kernel.org> In-Reply-To: <20240619-swap-allocator-v3-0-e973a3102444@kernel.org> To: Andrew Morton Cc: Kairui Song , Ryan Roberts , "Huang, Ying" , Kairui Song , Kalesh Singh , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Chris Li , Barry Song X-Mailer: b4 0.13.0 X-Rspamd-Server: rspam03 X-Rspam-User: X-Rspamd-Queue-Id: BDE3720020 X-Stat-Signature: jq5zdmoxkke1dogcmo3o8tiz5ipk96im X-HE-Tag: 1718788837-901399 X-HE-Meta: U2FsdGVkX19pE5Nr1LWrZ45Cxy4r+ylA0HHnfucv2KzfbFlJAvv1diYHREau6aqfFClVJ2g5XDZeC201ODzSw431/ZAR60gsxqNMs4erw1z5Gx0M8tnVYtxOiaM4RGpJ83MkHZ7bJDhy0W13gzevmkHOYblV8KbobZuYPMLt29oQO3NoXbs9tVpx42qP78teOmcJFczm7JY5kOee2Z3jDRdGrxyVpG4oud/jflPTR+2vTdX+oqROKVnwVcQ/R/TxI7GjZJUMYZjD3Cx4Yq2hlEEBuPC1cfBlErlJAIPDBu+OLFXV6GhsncghEz3PIc2OIE3zYf7kibx2Yi62dWN/ypQMntylc2oNCG/Pk7wXj4cb6qc2uWvo//0XXOqGC7CdbVLTD9F96HCSdsGIpSUbm6p8EEcSF3VwGdc7zlegiTwjjuOx1zdDmAWKPyb+ITgWkeTVEfA4XKsVOPghHsf8nbKqvMzfBZC57NtXoQKsZr1S8JKDavV49pmbDZyhZfarKyrGH7BnXek/JiZwpLzfSw6HpJSpcpSYoJJxUEhxP29QnEWLSYUu0MREvqo1I7/SPH5M6g/9RXzEXaFSv4pbB1ICGC6lRUa/w7fk0uDeWXwpW6Q4qTr46FybSSfuCi0JN4uU/4zt15L94tvgHxikBtuxKMBbu4DeLUSsVGkr8kQdkJhJCI5RHxANzZ7i2S6Mlngrs7+7mwDDRtFhEhQLzPsZ5CdxsIXKrwunaJuEql/bZtnf9wfZIG6ZtLkybyn8gZZuV84uw+T98f/AvzTUC2A/jWf2GhIn/S2bwug//9L+ZTXDAEpHrSB9pUYVExPta4zJsR1Sqx46IDA1BBWsuj0E6MqJ86fOBErnQt98FO8GewfboCQNr0xoS+oqf25A0VGQKDD27nwPCIwVho2dBidzsfCjqGReffx8E7h6kVMhEwDMo+z8xu09se4gZgdbKP7wKfFa/gWxnjDQMyH uZw9dEpa S4l9oNYeFzI1zxUOKAPUkQrHTVx2ECcZK7ASUH6m6xm8FfUzJEFQ7loPvzc+zp65oMfk/H5jYrfum28MtAtRK31+bXGeKCcYSPlZdTsC2Bw5MANZ364MAH1aaxOiWaKNpVX2AdJAMtMXT/5pb0PBwYTl7nSyanCH5nRyyd3y2J2Ufp4LlrCI5RFvsNZxTQ8Z3xaD0CxtamOSSCVEejTfGEvEWWclbjqliDSoTgSI7TgPp8uEJF3BXyCkHCQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Previously, the swap cluster used a cluster index as a pointer to construct a custom single link list type "swap_cluster_list". The next cluster pointer is shared with the cluster->count. It prevents puting the non free cluster into a list. Change the cluster to use the standard double link list instead. This allows tracing the nonfull cluster in the follow up patch. That way, it is faster to get to the nonfull cluster of that order. Remove the cluster getter/setter for accessing the cluster struct member. The list operation is protected by the swap_info_struct->lock. Change cluster code to use "struct swap_cluster_info *" to reference the cluster rather than by using index. That is more consistent with the list manipulation. It avoids the repeat adding index to the cluser_info. The code is easier to understand. Remove the cluster next pointer is NULL flag, the double link list can handle the empty list pretty well. The "swap_cluster_info" struct is two pointer bigger, because 512 swap entries share one swap struct, it has very little impact on the average memory usage per swap entry. For 1TB swapfile, the swap cluster data structure increases from 8MB to 24MB. Other than the list conversion, there is no real function change in this patch. Signed-off-by: Chris Li --- include/linux/swap.h | 26 +++--- mm/swapfile.c | 227 ++++++++++++++------------------------------------- 2 files changed, 70 insertions(+), 183 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 3df75d62a835..690a04f06674 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -243,22 +243,21 @@ enum { * free clusters are organized into a list. We fetch an entry from the list to * get a free cluster. * - * The data field stores next cluster if the cluster is free or cluster usage - * counter otherwise. The flags field determines if a cluster is free. This is - * protected by swap_info_struct.lock. + * The flags field determines if a cluster is free. This is + * protected by cluster lock. */ struct swap_cluster_info { spinlock_t lock; /* * Protect swap_cluster_info fields - * and swap_info_struct->swap_map - * elements correspond to the swap - * cluster + * other than list, and swap_info_struct->swap_map + * elements correspond to the swap cluster. */ - unsigned int data:24; - unsigned int flags:8; + u16 count; + u8 flags; + struct list_head list; }; #define CLUSTER_FLAG_FREE 1 /* This cluster is free */ -#define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */ + /* * The first page in the swap file is the swap header, which is always marked @@ -283,11 +282,6 @@ struct percpu_cluster { unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */ }; -struct swap_cluster_list { - struct swap_cluster_info head; - struct swap_cluster_info tail; -}; - /* * The in-memory structure used to track swap areas. */ @@ -300,7 +294,7 @@ struct swap_info_struct { unsigned int max; /* extent of the swap_map */ unsigned char *swap_map; /* vmalloc'ed array of usage counts */ struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */ - struct swap_cluster_list free_clusters; /* free clusters list */ + struct list_head free_clusters; /* free clusters list */ unsigned int lowest_bit; /* index of first free in swap_map */ unsigned int highest_bit; /* index of last free in swap_map */ unsigned int pages; /* total of usable pages of swap */ @@ -331,7 +325,7 @@ struct swap_info_struct { * list. */ struct work_struct discard_work; /* discard worker */ - struct swap_cluster_list discard_clusters; /* discard clusters list */ + struct list_head discard_clusters; /* discard clusters list */ struct plist_node avail_lists[]; /* * entries in swap_avail_heads, one * entry per node. diff --git a/mm/swapfile.c b/mm/swapfile.c index 9c6d8e557c0f..0b11c437f9cc 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -290,64 +290,11 @@ static void discard_swap_cluster(struct swap_info_struct *si, #endif #define LATENCY_LIMIT 256 -static inline void cluster_set_flag(struct swap_cluster_info *info, - unsigned int flag) -{ - info->flags = flag; -} - -static inline unsigned int cluster_count(struct swap_cluster_info *info) -{ - return info->data; -} - -static inline void cluster_set_count(struct swap_cluster_info *info, - unsigned int c) -{ - info->data = c; -} - -static inline void cluster_set_count_flag(struct swap_cluster_info *info, - unsigned int c, unsigned int f) -{ - info->flags = f; - info->data = c; -} - -static inline unsigned int cluster_next(struct swap_cluster_info *info) -{ - return info->data; -} - -static inline void cluster_set_next(struct swap_cluster_info *info, - unsigned int n) -{ - info->data = n; -} - -static inline void cluster_set_next_flag(struct swap_cluster_info *info, - unsigned int n, unsigned int f) -{ - info->flags = f; - info->data = n; -} - static inline bool cluster_is_free(struct swap_cluster_info *info) { return info->flags & CLUSTER_FLAG_FREE; } -static inline bool cluster_is_null(struct swap_cluster_info *info) -{ - return info->flags & CLUSTER_FLAG_NEXT_NULL; -} - -static inline void cluster_set_null(struct swap_cluster_info *info) -{ - info->flags = CLUSTER_FLAG_NEXT_NULL; - info->data = 0; -} - static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si, unsigned long offset) { @@ -394,65 +341,11 @@ static inline void unlock_cluster_or_swap_info(struct swap_info_struct *si, spin_unlock(&si->lock); } -static inline bool cluster_list_empty(struct swap_cluster_list *list) -{ - return cluster_is_null(&list->head); -} - -static inline unsigned int cluster_list_first(struct swap_cluster_list *list) -{ - return cluster_next(&list->head); -} - -static void cluster_list_init(struct swap_cluster_list *list) -{ - cluster_set_null(&list->head); - cluster_set_null(&list->tail); -} - -static void cluster_list_add_tail(struct swap_cluster_list *list, - struct swap_cluster_info *ci, - unsigned int idx) -{ - if (cluster_list_empty(list)) { - cluster_set_next_flag(&list->head, idx, 0); - cluster_set_next_flag(&list->tail, idx, 0); - } else { - struct swap_cluster_info *ci_tail; - unsigned int tail = cluster_next(&list->tail); - - /* - * Nested cluster lock, but both cluster locks are - * only acquired when we held swap_info_struct->lock - */ - ci_tail = ci + tail; - spin_lock_nested(&ci_tail->lock, SINGLE_DEPTH_NESTING); - cluster_set_next(ci_tail, idx); - spin_unlock(&ci_tail->lock); - cluster_set_next_flag(&list->tail, idx, 0); - } -} - -static unsigned int cluster_list_del_first(struct swap_cluster_list *list, - struct swap_cluster_info *ci) -{ - unsigned int idx; - - idx = cluster_next(&list->head); - if (cluster_next(&list->tail) == idx) { - cluster_set_null(&list->head); - cluster_set_null(&list->tail); - } else - cluster_set_next_flag(&list->head, - cluster_next(&ci[idx]), 0); - - return idx; -} - /* Add a cluster to discard list and schedule it to do discard */ static void swap_cluster_schedule_discard(struct swap_info_struct *si, - unsigned int idx) + struct swap_cluster_info *ci) { + unsigned int idx = ci - si->cluster_info; /* * If scan_swap_map_slots() can't find a free cluster, it will check * si->swap_map directly. To make sure the discarding cluster isn't @@ -462,17 +355,14 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si, memset(si->swap_map + idx * SWAPFILE_CLUSTER, SWAP_MAP_BAD, SWAPFILE_CLUSTER); - cluster_list_add_tail(&si->discard_clusters, si->cluster_info, idx); - + list_add_tail(&ci->list, &si->discard_clusters); schedule_work(&si->discard_work); } -static void __free_cluster(struct swap_info_struct *si, unsigned long idx) +static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci) { - struct swap_cluster_info *ci = si->cluster_info; - - cluster_set_flag(ci + idx, CLUSTER_FLAG_FREE); - cluster_list_add_tail(&si->free_clusters, ci, idx); + ci->flags = CLUSTER_FLAG_FREE; + list_add_tail(&ci->list, &si->free_clusters); } /* @@ -481,21 +371,22 @@ static void __free_cluster(struct swap_info_struct *si, unsigned long idx) */ static void swap_do_scheduled_discard(struct swap_info_struct *si) { - struct swap_cluster_info *info, *ci; + struct swap_cluster_info *ci; unsigned int idx; - info = si->cluster_info; - - while (!cluster_list_empty(&si->discard_clusters)) { - idx = cluster_list_del_first(&si->discard_clusters, info); + while (!list_empty(&si->discard_clusters)) { + ci = list_first_entry(&si->discard_clusters, struct swap_cluster_info, list); + list_del(&ci->list); + idx = ci - si->cluster_info; spin_unlock(&si->lock); discard_swap_cluster(si, idx * SWAPFILE_CLUSTER, SWAPFILE_CLUSTER); spin_lock(&si->lock); - ci = lock_cluster(si, idx * SWAPFILE_CLUSTER); - __free_cluster(si, idx); + + spin_lock(&ci->lock); + __free_cluster(si, ci); memset(si->swap_map + idx * SWAPFILE_CLUSTER, 0, SWAPFILE_CLUSTER); unlock_cluster(ci); @@ -521,20 +412,20 @@ static void swap_users_ref_free(struct percpu_ref *ref) complete(&si->comp); } -static void alloc_cluster(struct swap_info_struct *si, unsigned long idx) +static struct swap_cluster_info *alloc_cluster(struct swap_info_struct *si, unsigned long idx) { - struct swap_cluster_info *ci = si->cluster_info; + struct swap_cluster_info *ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list); - VM_BUG_ON(cluster_list_first(&si->free_clusters) != idx); - cluster_list_del_first(&si->free_clusters, ci); - cluster_set_count_flag(ci + idx, 0, 0); + VM_BUG_ON(ci - si->cluster_info != idx); + list_del(&ci->list); + ci->count = 0; + ci->flags = 0; + return ci; } -static void free_cluster(struct swap_info_struct *si, unsigned long idx) +static void free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci) { - struct swap_cluster_info *ci = si->cluster_info + idx; - - VM_BUG_ON(cluster_count(ci) != 0); + VM_BUG_ON(ci->count != 0); /* * If the swap is discardable, prepare discard the cluster * instead of free it immediately. The cluster will be freed @@ -542,11 +433,11 @@ static void free_cluster(struct swap_info_struct *si, unsigned long idx) */ if ((si->flags & (SWP_WRITEOK | SWP_PAGE_DISCARD)) == (SWP_WRITEOK | SWP_PAGE_DISCARD)) { - swap_cluster_schedule_discard(si, idx); + swap_cluster_schedule_discard(si, ci); return; } - __free_cluster(si, idx); + __free_cluster(si, ci); } /* @@ -559,15 +450,15 @@ static void add_cluster_info_page(struct swap_info_struct *p, unsigned long count) { unsigned long idx = page_nr / SWAPFILE_CLUSTER; + struct swap_cluster_info *ci = cluster_info + idx; if (!cluster_info) return; - if (cluster_is_free(&cluster_info[idx])) + if (cluster_is_free(ci)) alloc_cluster(p, idx); - VM_BUG_ON(cluster_count(&cluster_info[idx]) + count > SWAPFILE_CLUSTER); - cluster_set_count(&cluster_info[idx], - cluster_count(&cluster_info[idx]) + count); + VM_BUG_ON(ci->count + count > SWAPFILE_CLUSTER); + ci->count += count; } /* @@ -581,24 +472,20 @@ static void inc_cluster_info_page(struct swap_info_struct *p, } /* - * The cluster corresponding to page_nr decreases one usage. If the usage - * counter becomes 0, which means no page in the cluster is in using, we can - * optionally discard the cluster and add it to free cluster list. + * The cluster ci decreases one usage. If the usage counter becomes 0, + * which means no page in the cluster is in using, we can optionally discard + * the cluster and add it to free cluster list. */ -static void dec_cluster_info_page(struct swap_info_struct *p, - struct swap_cluster_info *cluster_info, unsigned long page_nr) +static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluster_info *ci) { - unsigned long idx = page_nr / SWAPFILE_CLUSTER; - - if (!cluster_info) + if (!p->cluster_info) return; - VM_BUG_ON(cluster_count(&cluster_info[idx]) == 0); - cluster_set_count(&cluster_info[idx], - cluster_count(&cluster_info[idx]) - 1); + VM_BUG_ON(ci->count == 0); + ci->count--; - if (cluster_count(&cluster_info[idx]) == 0) - free_cluster(p, idx); + if (!ci->count) + free_cluster(p, ci); } /* @@ -611,10 +498,10 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si, { struct percpu_cluster *percpu_cluster; bool conflict; - + struct swap_cluster_info *first = list_first_entry(&si->free_clusters, struct swap_cluster_info, list); offset /= SWAPFILE_CLUSTER; - conflict = !cluster_list_empty(&si->free_clusters) && - offset != cluster_list_first(&si->free_clusters) && + conflict = !list_empty(&si->free_clusters) && + offset != first - si->cluster_info && cluster_is_free(&si->cluster_info[offset]); if (!conflict) @@ -655,10 +542,14 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, cluster = this_cpu_ptr(si->percpu_cluster); tmp = cluster->next[order]; if (tmp == SWAP_NEXT_INVALID) { - if (!cluster_list_empty(&si->free_clusters)) { - tmp = cluster_next(&si->free_clusters.head) * - SWAPFILE_CLUSTER; - } else if (!cluster_list_empty(&si->discard_clusters)) { + if (!list_empty(&si->free_clusters)) { + ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list); + list_del(&ci->list); + spin_lock(&ci->lock); + ci->flags = 0; + spin_unlock(&ci->lock); + tmp = (ci - si->cluster_info) * SWAPFILE_CLUSTER; + } else if (!list_empty(&si->discard_clusters)) { /* * we don't have free cluster but have some clusters in * discarding, do discard now and reclaim them, then @@ -1062,8 +953,9 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx) ci = lock_cluster(si, offset); memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER); - cluster_set_count_flag(ci, 0, 0); - free_cluster(si, idx); + ci->count = 0; + ci->flags = 0; + free_cluster(si, ci); unlock_cluster(ci); swap_range_free(si, offset, SWAPFILE_CLUSTER); } @@ -1336,7 +1228,7 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry) count = p->swap_map[offset]; VM_BUG_ON(count != SWAP_HAS_CACHE); p->swap_map[offset] = 0; - dec_cluster_info_page(p, p->cluster_info, offset); + dec_cluster_info_page(p, ci); unlock_cluster(ci); mem_cgroup_uncharge_swap(entry, 1); @@ -3003,8 +2895,8 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p, nr_good_pages = maxpages - 1; /* omit header page */ - cluster_list_init(&p->free_clusters); - cluster_list_init(&p->discard_clusters); + INIT_LIST_HEAD(&p->free_clusters); + INIT_LIST_HEAD(&p->discard_clusters); for (i = 0; i < swap_header->info.nr_badpages; i++) { unsigned int page_nr = swap_header->info.badpages[i]; @@ -3055,14 +2947,15 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p, for (k = 0; k < SWAP_CLUSTER_COLS; k++) { j = (k + col) % SWAP_CLUSTER_COLS; for (i = 0; i < DIV_ROUND_UP(nr_clusters, SWAP_CLUSTER_COLS); i++) { + struct swap_cluster_info *ci; idx = i * SWAP_CLUSTER_COLS + j; + ci = cluster_info + idx; if (idx >= nr_clusters) continue; - if (cluster_count(&cluster_info[idx])) + if (ci->count) continue; - cluster_set_flag(&cluster_info[idx], CLUSTER_FLAG_FREE); - cluster_list_add_tail(&p->free_clusters, cluster_info, - idx); + ci->flags = CLUSTER_FLAG_FREE; + list_add_tail(&ci->list, &p->free_clusters); } } return nr_extents; From patchwork Wed Jun 19 09:20:30 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Chris Li X-Patchwork-Id: 13703552 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7BAE4C2BA1A for ; Wed, 19 Jun 2024 09:20:45 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 918BE6B02DD; Wed, 19 Jun 2024 05:20:43 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8A06A6B02E0; Wed, 19 Jun 2024 05:20:43 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6F1106B02DF; Wed, 19 Jun 2024 05:20:43 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 2994C8D004F for ; Wed, 19 Jun 2024 05:20:43 -0400 (EDT) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 9EE1F40FA7 for ; Wed, 19 Jun 2024 09:20:42 +0000 (UTC) X-FDA: 82247093124.06.DA4C826 Received: from sin.source.kernel.org (sin.source.kernel.org [145.40.73.55]) by imf09.hostedemail.com (Postfix) with ESMTP id 4EF4A14000D for ; Wed, 19 Jun 2024 09:20:39 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b="aqt/s8BR"; dmarc=pass (policy=none) header.from=kernel.org; spf=pass (imf09.hostedemail.com: domain of chrisl@kernel.org designates 145.40.73.55 as permitted sender) smtp.mailfrom=chrisl@kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1718788834; a=rsa-sha256; cv=none; b=6yIzdYEBK12yZQj63sepSj6suFkLEoPfvSXvl+8WeC8ksy/DptYhn21mtNQclOkmQrVN6u QkaVCJ6J3RQ1+QfPfpWEd3WX3JVohP9/pT49ZVCByFNWTefebY7ufntEdR8y6/gIkQMxNB +HH3MkQ7mzbz7HIdlXDRIa3md90IPew= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b="aqt/s8BR"; dmarc=pass (policy=none) header.from=kernel.org; spf=pass (imf09.hostedemail.com: domain of chrisl@kernel.org designates 145.40.73.55 as permitted sender) smtp.mailfrom=chrisl@kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1718788834; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Xq2n0tw+klVSEtQIdrBTohuWHoCN5V/F69rNHfL+Me4=; b=CRHmaXD4NnV5jJE17q8hpRaR7v8+qjRwpJF7Fsy4pVMS+76qR0vX6L02oESEts2V5mYcxz KdXaTggqQQh07j5Pp1tL1PxVCUW5DzLv/zwKpsIWkF+KcsIJvA0e8nustE82Mgo097a05T pvLgYk7whHUXI1qZyXg+JZ6HtyziGRo= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by sin.source.kernel.org (Postfix) with ESMTP id 6BD20CE1E37; Wed, 19 Jun 2024 09:20:37 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 6AFE4C4AF49; Wed, 19 Jun 2024 09:20:36 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1718788836; bh=FKmZEOOj+8xe3zO8a1V3U71PZl7D+ukIzZ+enpM2r0A=; h=From:Date:Subject:References:In-Reply-To:To:Cc:From; b=aqt/s8BRPJSS1M01V9JHeMII64IVR1p02doAtxRFDlR/L+DPjSim/Pqt13AvzEHD9 dh3CcggMBedVDbdEC0Agp3EnRQ0dlwlCZQXKrGJul8lPzNpfEml1ggoyc2REh08bDh fpeupgIz/feW1X98iKCWGIteeJVzmdCNfFePwBtSX16RUkBgq/tCo7be9T5hdJJiDx KI1cQi2WgByDXQMUDYVnwpw/5EzJOnvEjngE+fSozFZc+RAiIdBjNCWchUHYDCUW5C CNH7JgKsN3F3cCd3XTw3C9nm0YRJyLOpXKMN+tlrET2ToGJx3a0Ryk10wXY8qAKEfq waMXB9N+j//OQ== From: Chris Li Date: Wed, 19 Jun 2024 02:20:30 -0700 Subject: [PATCH v3 2/2] mm: swap: mTHP allocate swap entries from nonfull list MIME-Version: 1.0 Message-Id: <20240619-swap-allocator-v3-2-e973a3102444@kernel.org> References: <20240619-swap-allocator-v3-0-e973a3102444@kernel.org> In-Reply-To: <20240619-swap-allocator-v3-0-e973a3102444@kernel.org> To: Andrew Morton Cc: Kairui Song , Ryan Roberts , "Huang, Ying" , Kairui Song , Kalesh Singh , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Chris Li , Barry Song X-Mailer: b4 0.13.0 X-Rspamd-Queue-Id: 4EF4A14000D X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: 3ynmypxdm4fknhs67ga8hfyjgnokzxgi X-HE-Tag: 1718788839-371016 X-HE-Meta: U2FsdGVkX199UcgsAao6LrQZaZ+ich5uYUuRfNNLjp3aaEfmPPyOeEkg4SFI6LrK7vqKhE7Vy1HKVh3yRqqaXtVLfBFES4eWs0r1BUakmik8LsuC1UtsBPCxV9R9mzROp5iSaYI1xVq1w8xA7PgP1RbjAZu9qKWUo1ruClIPRmyjDyCDwcWghdzKl3RtSoKcQzCRP8RRMwtKM5zd5XwQ02G9rPpeMIJOSnmOe5yS3bwe9HzHqEXVb94z8DvsN7B8vtOdqh7qw7tcf+A1PrXNDQfh7FlbmrDSFyouFq5XDL8vBc9XrnD2ltWDDXiU9Mruqls/fO4IEDSbGwoXerLHIcPLjWGFrYVu8cpbUGCylFs8IZgKkP6RuEGvhM/2DGZTUoZrYBXrhyy2IuNzy0uFdkmS3p+KaQj/9tcgwRmcl/9F7VAqGOMnAtawka022r9486uV4YEMc9cX6YMLimDnV21pbKLCDvw155/i1j7YWKyHcT5ZWQSr8RQDv4gUaOGS1iP1Q9lD6AWjt6o4MK2oTAPNgAcu3dL1nvvf9dypLlSoFLM1zKdH9HlZw4Xo2KbLYnNCDxKh6gk71fj7I7CW5A6n8QSICl3Nxu3/pyPXdCsNJc4W+5UiHiIKAYtl5eD9uPp/Iv8FRnL1AvC5M78Nh6bCl32AYVCMI3aTCTyLLGjcg75jIB12Cob27nuXKfRr3WjpBNyDRc/Am9DVrQja0qDr6LAUmWelAGGMUMzen6WGeqhBONsZ5xNGMCVPX6P7uKd0JYC18PrIy5tyarIfF2UuE9DKUWujxCobIIzh6EORl+8fY3IknuH0P5asZYJD5GV9WcXgxXk10xW+wxkVDvZzUUbuF0dFZh+E9nvIRMOKhreJ0al1wN5SD7CMlB4+xPmvsowvyItu0MXhlI4pZibz7QoeEWMu7MzSSu1zX8SnPlgiQKDc3FhvIWxWNvepe9j1OScHZh1VhWKgU3V E2OXV+wN S6tWbukAq8XOMSRS5rwdXtH1OsrZBH0dmvqs33k0KMGEqzIPPA3cEZMHoPAz8fu5QTUvrafrCQRdxNbZUxxc/sRYChqrhhDKyAM+DBhuAdZQvwl4/FJlQ8UVz4zuRrqvFozLvLsdUb7rtJg2BoWEbfT095QC4fiwyhCDjC4PJrBoQYGEyL6ATztvjZHCqCXnyHGDNYL98Zj59EbSPPAApQPwSRP90hNKMVsUIkQxXJLRBIS/kjXFR5D/WKZdcNsSiASU1BnZ8JgJ6dze216q7A7vRmw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Track the nonfull cluster as well as the empty cluster on lists. Each order has one nonfull cluster list. The cluster will remember which order it was used during new cluster allocation. When the cluster has free entry, add to the nonfull[order] list.  When the free cluster list is empty, also allocate from the nonempty list of that order. This improves the mTHP swap allocation success rate. There are limitations if the distribution of numbers of different orders of mTHP changes a lot. e.g. there are a lot of nonfull cluster assign to order A while later time there are a lot of order B allocation while very little allocation in order A. Currently the cluster used by order A will not reused by order B unless the cluster is 100% empty. Signed-off-by: Chris Li --- include/linux/swap.h | 4 ++++ mm/swapfile.c | 27 ++++++++++++++++++++++++--- 2 files changed, 28 insertions(+), 3 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 690a04f06674..92613bb4a87b 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -254,9 +254,11 @@ struct swap_cluster_info { */ u16 count; u8 flags; + u8 order; struct list_head list; }; #define CLUSTER_FLAG_FREE 1 /* This cluster is free */ +#define CLUSTER_FLAG_NONFULL 2 /* This cluster is on nonfull list */ /* @@ -295,6 +297,8 @@ struct swap_info_struct { unsigned char *swap_map; /* vmalloc'ed array of usage counts */ struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */ struct list_head free_clusters; /* free clusters list */ + struct list_head nonfull_clusters[SWAP_NR_ORDERS]; + /* list of cluster that contains at least one free slot */ unsigned int lowest_bit; /* index of first free in swap_map */ unsigned int highest_bit; /* index of last free in swap_map */ unsigned int pages; /* total of usable pages of swap */ diff --git a/mm/swapfile.c b/mm/swapfile.c index 0b11c437f9cc..ba6676a4a8ef 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -361,8 +361,11 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si, static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci) { + if (ci->flags & CLUSTER_FLAG_NONFULL) + list_move_tail(&ci->list, &si->free_clusters); + else + list_add_tail(&ci->list, &si->free_clusters); ci->flags = CLUSTER_FLAG_FREE; - list_add_tail(&ci->list, &si->free_clusters); } /* @@ -485,7 +488,12 @@ static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluste ci->count--; if (!ci->count) - free_cluster(p, ci); + return free_cluster(p, ci); + + if (!(ci->flags & CLUSTER_FLAG_NONFULL)) { + list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]); + ci->flags |= CLUSTER_FLAG_NONFULL; + } } /* @@ -542,10 +550,18 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, cluster = this_cpu_ptr(si->percpu_cluster); tmp = cluster->next[order]; if (tmp == SWAP_NEXT_INVALID) { - if (!list_empty(&si->free_clusters)) { + if (!list_empty(&si->nonfull_clusters[order])) { + ci = list_first_entry(&si->nonfull_clusters[order], struct swap_cluster_info, list); + list_del(&ci->list); + spin_lock(&ci->lock); + ci->flags = 0; + spin_unlock(&ci->lock); + tmp = (ci - si->cluster_info) * SWAPFILE_CLUSTER; + } else if (!list_empty(&si->free_clusters)) { ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list); list_del(&ci->list); spin_lock(&ci->lock); + ci->order = order; ci->flags = 0; spin_unlock(&ci->lock); tmp = (ci - si->cluster_info) * SWAPFILE_CLUSTER; @@ -576,6 +592,7 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, break; tmp += nr_pages; } + WARN_ONCE(ci->order != order, "expecting order %d got %d", order, ci->order); unlock_cluster(ci); } if (tmp >= max) { @@ -954,6 +971,7 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx) ci = lock_cluster(si, offset); memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER); ci->count = 0; + ci->order = 0; ci->flags = 0; free_cluster(si, ci); unlock_cluster(ci); @@ -2898,6 +2916,9 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p, INIT_LIST_HEAD(&p->free_clusters); INIT_LIST_HEAD(&p->discard_clusters); + for (i = 0; i < SWAP_NR_ORDERS; i++) + INIT_LIST_HEAD(&p->nonfull_clusters[i]); + for (i = 0; i < swap_header->info.nr_badpages; i++) { unsigned int page_nr = swap_header->info.badpages[i]; if (page_nr == 0 || page_nr > swap_header->info.last_page)