From patchwork Thu Aug 22 11:24:58 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Johannes Weiner X-Patchwork-Id: 13773234 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9FC32C3DA4A for ; Thu, 22 Aug 2024 11:27:19 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0134E80027; Thu, 22 Aug 2024 07:27:19 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id EDDBB8001E; Thu, 22 Aug 2024 07:27:18 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D55F280027; Thu, 22 Aug 2024 07:27:18 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id B399A8001E for ; Thu, 22 Aug 2024 07:27:18 -0400 (EDT) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 2DB961205B2 for ; Thu, 22 Aug 2024 11:27:18 +0000 (UTC) X-FDA: 82479655356.26.5F86630 Received: from mail-lj1-f180.google.com (mail-lj1-f180.google.com [209.85.208.180]) by imf25.hostedemail.com (Postfix) with ESMTP id 1203AA0014 for ; Thu, 22 Aug 2024 11:27:15 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=rP7XGq3R; dmarc=pass (policy=none) header.from=cmpxchg.org; spf=pass (imf25.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.208.180 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1724325976; a=rsa-sha256; cv=none; b=2Aa+X8FQ44AgBNOZz6Eze691FCBNl+BDjtp8kl6gNUWMLPAk2TzTdTckB51txhYCqowRun MNm87h9qhgd+JfgyeqR0XH5jh9+66LGDbC9h9/H21HY/5xNlHBSbhYBqOkfQsttVILx+ff jrfaadY5YJJSOQUpeZCvK+PUCn+/Jys= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=rP7XGq3R; dmarc=pass (policy=none) header.from=cmpxchg.org; spf=pass (imf25.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.208.180 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1724325976; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=OdcrbgUwYZCbvTxgX6hl7dn8mDljxmv7MhBnsCIJfOc=; b=xaPhOC5IIekF32Lym9mQQV0dk2oYTPnR7iXIo6frXRqiFudWxP8KTsvGvfLqs75wU2r805 mOYirD81oX6PvIoz89HLL9x/77yEoNyEglYRdfh2uAicGNVUdwhDgqSuzyuOCgxZGsfKPR DPEGOjnmRxwGVpqxUqMjMXBHPlyk3O0= Received: by mail-lj1-f180.google.com with SMTP id 38308e7fff4ca-2f3e2f5163dso7191221fa.2 for ; Thu, 22 Aug 2024 04:27:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20230601.gappssmtp.com; s=20230601; t=1724326034; x=1724930834; darn=kvack.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=OdcrbgUwYZCbvTxgX6hl7dn8mDljxmv7MhBnsCIJfOc=; b=rP7XGq3R/xDH7QpfPvhD5EniHFKCMoSfjx7U8n/wgoPJi2jUJD/dzrE63EbZJDcyMP KyRo3LwQP5BaMVxCHjY/S2AlW7cXuCDmxP1Z0wsjFplAop/O/+3SHgLUA5uqhCW4PocS 8sev+aZ64Kz7WtaGuQ/QlbepBTMaItGs2voJdpUHxAk8ZnW9hisDijwMD5e/AZJn3iwd 86Sn9RcHvOH1KccJq871nj9kmJtzhqqhPlgzKwRHDFM6vtOP0Gr1o3lIqgvQkb5IcWEq dsVJEkmOz191yi7tMllQflR2JpPbB860o8h6p1O7No1Zzbg/ZxVHm/pi0nzFuDZ65kQa vskA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1724326034; x=1724930834; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=OdcrbgUwYZCbvTxgX6hl7dn8mDljxmv7MhBnsCIJfOc=; b=mILCo7DW8ES1VPLjB+zQvsZhTswQ+OXais51ESuUN1cGRbY9XrreT+KaZLio/H6Ha7 efcHwaCuCGmzEH+dhgnd+QaB9uhGlIrhi53J347BqnvB1yNSAVnLMcTFCBhudfxy39cG m6H0WsPSQnKignq/peV5gb+8GOMVajNNO/AHcn6hFzcWztGkCmIfUg6QqU8qaxSahXVA +Z5rtdgSi4rBvF0eokhj6R9iFoGk5j582B0YUSBADSkOwT329oXkxj+rURDHrwacwZ03 cF3K5GQZHu33pjlD54t+73enBQ7b4VGRBxrGjYTDX5G3UY3JewyDRhED+FdeBDZqzaM6 X+UQ== X-Forwarded-Encrypted: i=1; AJvYcCUg3WNtl+3osrXUCTLU2WYITYQOqVLhPNV564R6TLy6mkIFdXDvEASr//rfFWIrn2AS/ywA0qDypw==@kvack.org X-Gm-Message-State: AOJu0YxSkHqvSxUtkCOTDh6/2zzUlX33bhHeINTQA16VZ2UN9oh4NpWw z3sahwt4O+5NhhSjgz5F5Iu9T/CMbEPJkCTtFDApUA9rrg20YDNLCFnwXaf8Q7k= X-Google-Smtp-Source: AGHT+IHAU8FDGqmIJEUfnfRLVt+aTgE3s5SHVaHgmnI/b3XT2WX5U2lY1l6UDcY90CmnGCCqaxafaQ== X-Received: by 2002:a2e:f19:0:b0:2ef:2d3a:e70a with SMTP id 38308e7fff4ca-2f3f883066amr30830081fa.18.1724326033653; Thu, 22 Aug 2024 04:27:13 -0700 (PDT) Received: from localhost ([2a02:8071:8280:d6e0:e324:b080:c95e:f348]) by smtp.gmail.com with ESMTPSA id 4fb4d7f45d1cf-5c044ddbccbsm803653a12.11.2024.08.22.04.27.12 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 22 Aug 2024 04:27:13 -0700 (PDT) From: Johannes Weiner To: Andrew Morton Cc: "Huang, Ying" , Hugh Dickins , linux-btrfs@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [PATCH] mm: swapfile: fix SSD detection with swapfile on btrfs Date: Thu, 22 Aug 2024 13:24:58 +0200 Message-ID: <20240822112707.351844-1-hannes@cmpxchg.org> X-Mailer: git-send-email 2.46.0 MIME-Version: 1.0 X-Rspamd-Queue-Id: 1203AA0014 X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: swr7f1tmy8e3hzktn1rnxncj9ud1t53a X-HE-Tag: 1724326035-489433 X-HE-Meta: U2FsdGVkX1+oFfrePyp5xjwuszq1tvorSpE3zhup9pud7qBH6CGEISfrGVrCI2yPnhjXvPYnIAkt6ejLkEBKcO5z33JoK4mquFeeUucblZUdFusjk+2c/Sih6PNdZPoqXkqS+zybhbxvWy1JnwCu/r25eU0BOY/A8HEMsuunCRB7AKoCmfVDsEvSSSWDNuHWjG8GXaUs47kCHnqbkempZab1CFgvGpBnDsHeQ3Sl8ZX/gmrZOWrJzjVpsu7ABo9OOCSvd4eJBsOWnfs1XwrvmUyRwP0CE6CsVf7iJcWgakp+QvUc6avcy1MRs/+d817yU1b/gOYYQvEtjH+zDnYTblRFB9jOgvQvHYK8PdUrIpr+duGjtjALJakSIptEFSuOPisHxI4kMwzvgZ6AjtySX5dnLtTU1vh/Tog17jqGMq6ibRz8lEMmb4inQrPv3zVH7CHy48wVSxemoy/PAuNPUJUOtUStx9TePaO8yu891DOykuZ/AQ4yfERPQ6vNN9UpXRqZic8k/DKi+CDQ7lUbjiI3kti0Rq+LPLTpfAuSwM6C8IKl9Qn3Tt1R0sLHDb0PP+yOi8o/Ub5bPStfqXc7WMsR8KadeTGXuliu2vJUoz1MWJ1JoqVxHRcSzvaX437k4BWWcKjyZIFLTvgDU8fp6KpeSfzrtuNatnuZiTE2QllAa0DBpXNbKJrkJ7inNK3Chv/mHrRDPup6/cwlOpaNJfRLPaAzOlXkkdywZdSEkRaeis2U2GYuNYJBwNDjkb8WZjtuPPZFWYYSO4awkb/exAFLAPQBPC+u3X/6JsiJiHaLGPzouZellGSHU5GzQFekIeo7Lu6fi5OaoCBKO1OulKm4i3WR0c+WDgTOU8Ve0JVfjQqfRyKJPRIQtrazMHOBF8Rik4U65eeXmrQTpw5e/XRzLb1n/qiiqcDtXamdO8Xt/31HR2pX3icrfSlVFvJkdJYIDTNLvZUXFfNkcUs XDy1cRfP 5jpdNKOb0RKzgm0vYFY/3nMLYqwqTqFjMV4JNE7W4ad78u2duExy/wuktqWxTRKFkR25y2q6B/8KXk6dFOCjR4CH2bOHGEfjgaIFhDEW1mc8puL8SOMA3PMlDKhu5khEhsUTzj8v/5F+eYMbuEAORF8XZLDqyvXiSJXniSBPtE2qzT6+zj94HYnnFTrn8CCxgN2slSpkPDD7uyt3gCI5FmZoPdIWk2A3jVP11FVL2a3S+Mi/I9cpcBSmdLL5/WTHxR6IwFtEImsfvDKpiM8biaTPLAASQlDW4p/tCo6cr6O5WuIcvckYhZ7j4WTU43t6RreYftLBMThF6psZNNBOJzEmx2LPkBECagyPhH5kmCgOYRq4sa5kSygMwYIVgPBo+C81GtpUo3nrjkOS0/Xu4PoagV0ZhFegX37N5 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: We've been noticing a trend of significant lock contention in the swap subsystem as core counts have been increasing in our fleet. It turns out that our swapfiles on btrfs on flash were in fact using the old swap code for rotational storage. This turns out to be a detection issue in the swapon sequence: btrfs sets si->bdev during swap activation, which currently happens *after* swapon's SSD detection and cluster setup. Thus, none of the SSD optimizations and cluster lock splitting are enabled for btrfs swap. Rearrange the swapon sequence so that filesystem activation happens *before* determining swap behavior based on the backing device. Afterwards, the nonrotational drive is detected correctly: - Adding 2097148k swap on /mnt/swapfile. Priority:-3 extents:1 across:2097148k + Adding 2097148k swap on /mnt/swapfile. Priority:-3 extents:1 across:2097148k SS Signed-off-by: Johannes Weiner --- mm/swapfile.c | 165 ++++++++++++++++++++++++++------------------------ 1 file changed, 86 insertions(+), 79 deletions(-) Changes since RFC: o walk badpages[] instead of [0, maxpages] for faster swapon (thanks Ying!) diff --git a/mm/swapfile.c b/mm/swapfile.c index c1638a009113..aff73a3d0ead 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -3196,29 +3196,15 @@ static unsigned long read_swap_header(struct swap_info_struct *si, static int setup_swap_map_and_extents(struct swap_info_struct *si, union swap_header *swap_header, unsigned char *swap_map, - struct swap_cluster_info *cluster_info, unsigned long maxpages, sector_t *span) { - unsigned int j, k; unsigned int nr_good_pages; + unsigned long i; int nr_extents; - unsigned long nr_clusters = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER); - unsigned long col = si->cluster_next / SWAPFILE_CLUSTER % SWAP_CLUSTER_COLS; - unsigned long i, idx; nr_good_pages = maxpages - 1; /* omit header page */ - INIT_LIST_HEAD(&si->free_clusters); - INIT_LIST_HEAD(&si->full_clusters); - INIT_LIST_HEAD(&si->discard_clusters); - - for (i = 0; i < SWAP_NR_ORDERS; i++) { - INIT_LIST_HEAD(&si->nonfull_clusters[i]); - INIT_LIST_HEAD(&si->frag_clusters[i]); - si->frag_cluster_nr[i] = 0; - } - for (i = 0; i < swap_header->info.nr_badpages; i++) { unsigned int page_nr = swap_header->info.badpages[i]; if (page_nr == 0 || page_nr > swap_header->info.last_page) @@ -3226,25 +3212,11 @@ static int setup_swap_map_and_extents(struct swap_info_struct *si, if (page_nr < maxpages) { swap_map[page_nr] = SWAP_MAP_BAD; nr_good_pages--; - /* - * Haven't marked the cluster free yet, no list - * operation involved - */ - inc_cluster_info_page(si, cluster_info, page_nr); } } - /* Haven't marked the cluster free yet, no list operation involved */ - for (i = maxpages; i < round_up(maxpages, SWAPFILE_CLUSTER); i++) - inc_cluster_info_page(si, cluster_info, i); - if (nr_good_pages) { swap_map[0] = SWAP_MAP_BAD; - /* - * Not mark the cluster free yet, no list - * operation involved - */ - inc_cluster_info_page(si, cluster_info, 0); si->max = maxpages; si->pages = nr_good_pages; nr_extents = setup_swap_extents(si, span); @@ -3257,8 +3229,70 @@ static int setup_swap_map_and_extents(struct swap_info_struct *si, return -EINVAL; } + return nr_extents; +} + +static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si, + union swap_header *swap_header, + unsigned long maxpages) +{ + unsigned long nr_clusters = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER); + unsigned long col = si->cluster_next / SWAPFILE_CLUSTER % SWAP_CLUSTER_COLS; + struct swap_cluster_info *cluster_info; + unsigned long i, j, k, idx; + int cpu, err = -ENOMEM; + + cluster_info = kvcalloc(nr_clusters, sizeof(*cluster_info), GFP_KERNEL); if (!cluster_info) - return nr_extents; + goto err; + + for (i = 0; i < nr_clusters; i++) + spin_lock_init(&cluster_info[i].lock); + + si->cluster_next_cpu = alloc_percpu(unsigned int); + if (!si->cluster_next_cpu) + goto err_free; + + /* Random start position to help with wear leveling */ + for_each_possible_cpu(cpu) + per_cpu(*si->cluster_next_cpu, cpu) = + get_random_u32_inclusive(1, si->highest_bit); + + si->percpu_cluster = alloc_percpu(struct percpu_cluster); + if (!si->percpu_cluster) + goto err_free; + + for_each_possible_cpu(cpu) { + struct percpu_cluster *cluster; + + cluster = per_cpu_ptr(si->percpu_cluster, cpu); + for (i = 0; i < SWAP_NR_ORDERS; i++) + cluster->next[i] = SWAP_NEXT_INVALID; + } + + /* + * Mark unusable pages as unavailable. The clusters aren't + * marked free yet, so no list operations are involved yet. + * + * See setup_swap_map_and_extents(): header page, bad pages, + * and the EOF part of the last cluster. + */ + inc_cluster_info_page(si, cluster_info, 0); + for (i = 0; i < swap_header->info.nr_badpages; i++) + inc_cluster_info_page(si, cluster_info, + swap_header->info.badpages[i]); + for (i = maxpages; i < round_up(maxpages, SWAPFILE_CLUSTER); i++) + inc_cluster_info_page(si, cluster_info, i); + + INIT_LIST_HEAD(&si->free_clusters); + INIT_LIST_HEAD(&si->full_clusters); + INIT_LIST_HEAD(&si->discard_clusters); + + for (i = 0; i < SWAP_NR_ORDERS; i++) { + INIT_LIST_HEAD(&si->nonfull_clusters[i]); + INIT_LIST_HEAD(&si->frag_clusters[i]); + si->frag_cluster_nr[i] = 0; + } /* * Reduce false cache line sharing between cluster_info and @@ -3281,7 +3315,13 @@ static int setup_swap_map_and_extents(struct swap_info_struct *si, list_add_tail(&ci->list, &si->free_clusters); } } - return nr_extents; + + return cluster_info; + +err_free: + kvfree(cluster_info); +err: + return ERR_PTR(err); } SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) @@ -3377,6 +3417,17 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) goto bad_swap_unlock_inode; } + error = swap_cgroup_swapon(si->type, maxpages); + if (error) + goto bad_swap_unlock_inode; + + nr_extents = setup_swap_map_and_extents(si, swap_header, swap_map, + maxpages, &span); + if (unlikely(nr_extents < 0)) { + error = nr_extents; + goto bad_swap_unlock_inode; + } + if (si->bdev && bdev_stable_writes(si->bdev)) si->flags |= SWP_STABLE_WRITES; @@ -3384,63 +3435,19 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) si->flags |= SWP_SYNCHRONOUS_IO; if (si->bdev && bdev_nonrot(si->bdev)) { - int cpu, i; - unsigned long ci, nr_cluster; - si->flags |= SWP_SOLIDSTATE; - si->cluster_next_cpu = alloc_percpu(unsigned int); - if (!si->cluster_next_cpu) { - error = -ENOMEM; - goto bad_swap_unlock_inode; - } - /* - * select a random position to start with to help wear leveling - * SSD - */ - for_each_possible_cpu(cpu) { - per_cpu(*si->cluster_next_cpu, cpu) = - get_random_u32_inclusive(1, si->highest_bit); - } - nr_cluster = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER); - cluster_info = kvcalloc(nr_cluster, sizeof(*cluster_info), - GFP_KERNEL); - if (!cluster_info) { - error = -ENOMEM; + cluster_info = setup_clusters(si, swap_header, maxpages); + if (IS_ERR(cluster_info)) { + error = PTR_ERR(cluster_info); + cluster_info = NULL; goto bad_swap_unlock_inode; } - - for (ci = 0; ci < nr_cluster; ci++) - spin_lock_init(&((cluster_info + ci)->lock)); - - si->percpu_cluster = alloc_percpu(struct percpu_cluster); - if (!si->percpu_cluster) { - error = -ENOMEM; - goto bad_swap_unlock_inode; - } - for_each_possible_cpu(cpu) { - struct percpu_cluster *cluster; - - cluster = per_cpu_ptr(si->percpu_cluster, cpu); - for (i = 0; i < SWAP_NR_ORDERS; i++) - cluster->next[i] = SWAP_NEXT_INVALID; - } } else { atomic_inc(&nr_rotate_swap); inced_nr_rotate_swap = true; } - error = swap_cgroup_swapon(si->type, maxpages); - if (error) - goto bad_swap_unlock_inode; - - nr_extents = setup_swap_map_and_extents(si, swap_header, swap_map, - cluster_info, maxpages, &span); - if (unlikely(nr_extents < 0)) { - error = nr_extents; - goto bad_swap_unlock_inode; - } - if ((swap_flags & SWAP_FLAG_DISCARD) && si->bdev && bdev_max_discard_sectors(si->bdev)) { /*