From patchwork Mon Mar 11 15:00:53 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ryan Roberts X-Patchwork-Id: 13588927 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C531EC5475B for ; Mon, 11 Mar 2024 15:01:17 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 451236B00A3; Mon, 11 Mar 2024 11:01:17 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 4019D6B00A7; Mon, 11 Mar 2024 11:01:17 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2C9B56B00A8; Mon, 11 Mar 2024 11:01:17 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 1BEE86B00A3 for ; Mon, 11 Mar 2024 11:01:17 -0400 (EDT) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 4721C140D45 for ; Mon, 11 Mar 2024 15:01:15 +0000 (UTC) X-FDA: 81885071310.09.B79849D Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf03.hostedemail.com (Postfix) with ESMTP id 8D13C2002A for ; Mon, 11 Mar 2024 15:01:13 +0000 (UTC) Authentication-Results: imf03.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf03.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1710169273; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=t1TqpOLZKzMpn7UyKz7XDNHcnxZrY8XEpOUdp0/giuY=; b=6qUIg9SFAwneasIqsv8GqeOviHt41G0J0raHLbOJJBL4Kb/Klfz8pTXxuGBJNTRJP3vVx0 jcO9EX2YT7Hd9x+80emm+cPNUMGHARt1tAiyGvs8bp0/bW0L7+SIwQL314hvWcFkRB7OWi ZXJ1Y6zE5pYjbrYRQjjb9PeeOcayZHc= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf03.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1710169273; a=rsa-sha256; cv=none; b=v5MlfSChu8zKBguUcLvNfOLZYnB/jZyHzcH35zApb7X6LKCGSTec9AJ3a63zxAuOEu/8LG Ee0g276jL+dxIkMCgrECCvwdgzGCOxJ+E1/j3RtfC6lTiiszYHqy/6SAu9b/OeuNeNIaK6 2VbZ7IDiYtXwLhKDNqvwrmhYoRuFrRQ= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 528DC1007; Mon, 11 Mar 2024 08:01:49 -0700 (PDT) Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com [10.1.196.27]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 6A7423F64C; Mon, 11 Mar 2024 08:01:10 -0700 (PDT) From: Ryan Roberts To: Andrew Morton , David Hildenbrand , Matthew Wilcox , Huang Ying , Gao Xiang , Yu Zhao , Yang Shi , Michal Hocko , Kefeng Wang , Barry Song <21cnbao@gmail.com>, Chris Li Cc: Ryan Roberts , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [PATCH v4 1/6] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags Date: Mon, 11 Mar 2024 15:00:53 +0000 Message-Id: <20240311150058.1122862-2-ryan.roberts@arm.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20240311150058.1122862-1-ryan.roberts@arm.com> References: <20240311150058.1122862-1-ryan.roberts@arm.com> MIME-Version: 1.0 X-Rspam-User: X-Stat-Signature: w946ppqd4dwc6xnowswz8fbzy3eirm3h X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 8D13C2002A X-HE-Tag: 1710169273-952165 X-HE-Meta: U2FsdGVkX19HLTxYRNkT1W3K5qJK+SueUFfeS6l4tdvlNlYVSYk1aTCntSEO2l/pcpiUsjv17IiogWPUJKp6eTl0iIu2hI0RfqbgKKBDHTwr4PSO541JfcMfEnhTecwY19oBJBzTgW7uDWAHiaZuQLilNZaJm+r8KhDRZmczs5qRvWPiNCwMMfSPlEC9ZH+Q19G4w3Z7X9c+OiAu/rwgj3enzxCfw3zWn2UdHPSGtbPsyDbEEE8iFNRs4DucxqQ4LhJKOxHycKhrX2zmXIUPCyguCQYyErvd1mrg3j82+UqN/xIr1y68gjH/ldmu5XnzzIt8GpCWf5y4RpeFy8JbY7LddmsFD1trEWL2vDeSwlDcM3SEmsnjIO2OVf9Fz5UkqtIwUhNDLB7ZpJyQ7fO+4MY2rUlI7O7b8qs0ymGMTPkLKmVG8f4UQRgefaed0+wzbduJQd0vRoOfKVbmIVwugVvPI2mrFdl+ZCWVw3pBcIIkRiagsyV1/EcD+3LciNZb6WzCNDGQj2PDYTGf7zOVwYKSYDUmDNWmNDLSGb2e3S9js8bUYAbaa0f8z9O7NBH51gRz1GuqvjYqcH2qzzqSbv9jsanwn1YmFZhNuQ2+0hD318LrxXkwuFJtB+fnASxozi3cqS2nmr+nAZznpyZnabzSNZIEUXGwse7u8gwZTv0WQf2MgNyy0EtqFa8b2GsgZXHWmjzc4IwsFt07TeQHEyq+cy0MwcwHR3E+48++qvyPaH+L1cTBOrk9k7kRJk7Gzr/edReYaaDPF8hkuLcIIjZ74rM261MUZgaUxzwxywqErQCMTpG6YCRhIQ8gLmYXxV0MSvzPyWKkjljkXOh5mp6/JqbjDkOgRCztyRz8Zcj1FwttmXbAIK5slIs/5z2BfdwJygEDyPGidht6tcisy04FuVEWncfrfIRsL4p9N86BJT/Adi794Wp5X+jrwYYA00M4ZMz8pmSZifXmBJR a3xiWso7 YukIL3L/o1jt5irxSSffAQj7mYzsu1hizs/7XtUuLpi5cEdepObf1y3Ht5kfKANDDadML4A6XmKK8BXji1ohOYwimN67lBFEYc+UjLXao5Y64U1KWaz01Q4QX8j6eaz6oyaxihZ/6cCUGeaZbCwynDsvLZC28TWlvSyFgaBoSDuqQ+c4LSFM7l0fdRPMCyrljRNVR5KfXL4jYqSE= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: As preparation for supporting small-sized THP in the swap-out path, without first needing to split to order-0, Remove the CLUSTER_FLAG_HUGE, which, when present, always implies PMD-sized THP, which is the same as the cluster size. The only use of the flag was to determine whether a swap entry refers to a single page or a PMD-sized THP in swap_page_trans_huge_swapped(). Instead of relying on the flag, we now pass in nr_pages, which originates from the folio's number of pages. This allows the logic to work for folios of any order. The one snag is that one of the swap_page_trans_huge_swapped() call sites does not have the folio. But it was only being called there to shortcut a call __try_to_reclaim_swap() in some cases. __try_to_reclaim_swap() gets the folio and (via some other functions) calls swap_page_trans_huge_swapped(). So I've removed the problematic call site and believe the new logic should be functionally equivalent. That said, removing the fast path means that we will take a reference and trylock a large folio much more often, which we would like to avoid. The next patch will solve this. Removing CLUSTER_FLAG_HUGE also means we can remove split_swap_cluster() which used to be called during folio splitting, since split_swap_cluster()'s only job was to remove the flag. Signed-off-by: Ryan Roberts --- include/linux/swap.h | 10 ---------- mm/huge_memory.c | 3 --- mm/swapfile.c | 47 ++++++++------------------------------------ 3 files changed, 8 insertions(+), 52 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 2955f7a78d8d..4a8b6c60793a 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -259,7 +259,6 @@ struct swap_cluster_info { }; #define CLUSTER_FLAG_FREE 1 /* This cluster is free */ #define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */ -#define CLUSTER_FLAG_HUGE 4 /* This cluster is backing a transparent huge page */ /* * We assign a cluster to each CPU, so each CPU can allocate swap entry from @@ -600,15 +599,6 @@ static inline int add_swap_extent(struct swap_info_struct *sis, } #endif /* CONFIG_SWAP */ -#ifdef CONFIG_THP_SWAP -extern int split_swap_cluster(swp_entry_t entry); -#else -static inline int split_swap_cluster(swp_entry_t entry) -{ - return 0; -} -#endif - #ifdef CONFIG_MEMCG static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg) { diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 04fb994a7b0b..5298ba882d49 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2965,9 +2965,6 @@ static void __split_huge_page(struct page *page, struct list_head *list, shmem_uncharge(folio->mapping->host, nr_dropped); remap_page(folio, nr); - if (folio_test_swapcache(folio)) - split_swap_cluster(folio->swap); - /* * set page to its compound_head when split to non order-0 pages, so * we can skip unlocking it below, since PG_locked is transferred to diff --git a/mm/swapfile.c b/mm/swapfile.c index 1155a6304119..df1de034f6d8 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -343,18 +343,6 @@ static inline void cluster_set_null(struct swap_cluster_info *info) info->data = 0; } -static inline bool cluster_is_huge(struct swap_cluster_info *info) -{ - if (IS_ENABLED(CONFIG_THP_SWAP)) - return info->flags & CLUSTER_FLAG_HUGE; - return false; -} - -static inline void cluster_clear_huge(struct swap_cluster_info *info) -{ - info->flags &= ~CLUSTER_FLAG_HUGE; -} - static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si, unsigned long offset) { @@ -1027,7 +1015,7 @@ static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot) offset = idx * SWAPFILE_CLUSTER; ci = lock_cluster(si, offset); alloc_cluster(si, idx); - cluster_set_count_flag(ci, SWAPFILE_CLUSTER, CLUSTER_FLAG_HUGE); + cluster_set_count(ci, SWAPFILE_CLUSTER); memset(si->swap_map + offset, SWAP_HAS_CACHE, SWAPFILE_CLUSTER); unlock_cluster(ci); @@ -1365,7 +1353,6 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry) ci = lock_cluster_or_swap_info(si, offset); if (size == SWAPFILE_CLUSTER) { - VM_BUG_ON(!cluster_is_huge(ci)); map = si->swap_map + offset; for (i = 0; i < SWAPFILE_CLUSTER; i++) { val = map[i]; @@ -1373,7 +1360,6 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry) if (val == SWAP_HAS_CACHE) free_entries++; } - cluster_clear_huge(ci); if (free_entries == SWAPFILE_CLUSTER) { unlock_cluster_or_swap_info(si, ci); spin_lock(&si->lock); @@ -1395,23 +1381,6 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry) unlock_cluster_or_swap_info(si, ci); } -#ifdef CONFIG_THP_SWAP -int split_swap_cluster(swp_entry_t entry) -{ - struct swap_info_struct *si; - struct swap_cluster_info *ci; - unsigned long offset = swp_offset(entry); - - si = _swap_info_get(entry); - if (!si) - return -EBUSY; - ci = lock_cluster(si, offset); - cluster_clear_huge(ci); - unlock_cluster(ci); - return 0; -} -#endif - static int swp_entry_cmp(const void *ent1, const void *ent2) { const swp_entry_t *e1 = ent1, *e2 = ent2; @@ -1519,22 +1488,23 @@ int swp_swapcount(swp_entry_t entry) } static bool swap_page_trans_huge_swapped(struct swap_info_struct *si, - swp_entry_t entry) + swp_entry_t entry, + unsigned int nr_pages) { struct swap_cluster_info *ci; unsigned char *map = si->swap_map; unsigned long roffset = swp_offset(entry); - unsigned long offset = round_down(roffset, SWAPFILE_CLUSTER); + unsigned long offset = round_down(roffset, nr_pages); int i; bool ret = false; ci = lock_cluster_or_swap_info(si, offset); - if (!ci || !cluster_is_huge(ci)) { + if (!ci || nr_pages == 1) { if (swap_count(map[roffset])) ret = true; goto unlock_out; } - for (i = 0; i < SWAPFILE_CLUSTER; i++) { + for (i = 0; i < nr_pages; i++) { if (swap_count(map[offset + i])) { ret = true; break; @@ -1556,7 +1526,7 @@ static bool folio_swapped(struct folio *folio) if (!IS_ENABLED(CONFIG_THP_SWAP) || likely(!folio_test_large(folio))) return swap_swapcount(si, entry) != 0; - return swap_page_trans_huge_swapped(si, entry); + return swap_page_trans_huge_swapped(si, entry, folio_nr_pages(folio)); } /** @@ -1622,8 +1592,7 @@ int free_swap_and_cache(swp_entry_t entry) } count = __swap_entry_free(p, entry); - if (count == SWAP_HAS_CACHE && - !swap_page_trans_huge_swapped(p, entry)) + if (count == SWAP_HAS_CACHE) __try_to_reclaim_swap(p, swp_offset(entry), TTRS_UNMAPPED | TTRS_FULL); put_swap_device(p); From patchwork Mon Mar 11 15:00:54 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ryan Roberts X-Patchwork-Id: 13588928 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B31E7C54E58 for ; Mon, 11 Mar 2024 15:01:20 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4C81D6B00A7; Mon, 11 Mar 2024 11:01:20 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 49DA76B00A8; Mon, 11 Mar 2024 11:01:20 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 367AA6B00A9; Mon, 11 Mar 2024 11:01:20 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 2875A6B00A7 for ; Mon, 11 Mar 2024 11:01:20 -0400 (EDT) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 80A1880CF9 for ; Mon, 11 Mar 2024 15:01:19 +0000 (UTC) X-FDA: 81885071478.15.0A0B87A Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf15.hostedemail.com (Postfix) with ESMTP id 1C677A0028 for ; Mon, 11 Mar 2024 15:01:15 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=none; spf=pass (imf15.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com; dmarc=pass (policy=none) header.from=arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1710169276; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=pInh5iQyqVJpSWaqxP/OF0912aBlqyamn4HmXWmnODg=; b=5MgxV1olYlcBRtJXbX+PODOZCBinO+nzKa2Y5GpTzAGz9XgWltz5x0xyWv0oeT9QpcYOKK NOeRwnHugQfgw//sQ6O9juqIjtCD4Giz1F+EDzS9EsH4gY+kK8NXHosc4ok6ajh+TxV6bG peQJ7m/IrGdKgmAzt5d1kGHDQqjgL9M= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1710169276; a=rsa-sha256; cv=none; b=lILTvTD9XwWl8OBDV7MegbijTTqJ4CEcnbm9dCr0ls1g6ipAvNimemxbBSs8mSfxIWTmMD M5FjkIK39Z/iME0PDnroHRRlGLsr24dxQh5IaN5aPbhUc/5OuJHTQTGTfzKyR1U2LX6g55 j3QKVSwDGwrcPqvhUBmGqwkxHwtUQZY= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=none; spf=pass (imf15.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com; dmarc=pass (policy=none) header.from=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id BF8AF153B; Mon, 11 Mar 2024 08:01:51 -0700 (PDT) Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com [10.1.196.27]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id B87953F64C; Mon, 11 Mar 2024 08:01:12 -0700 (PDT) From: Ryan Roberts To: Andrew Morton , David Hildenbrand , Matthew Wilcox , Huang Ying , Gao Xiang , Yu Zhao , Yang Shi , Michal Hocko , Kefeng Wang , Barry Song <21cnbao@gmail.com>, Chris Li Cc: Ryan Roberts , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [PATCH v4 2/6] mm: swap: free_swap_and_cache_nr() as batched free_swap_and_cache() Date: Mon, 11 Mar 2024 15:00:54 +0000 Message-Id: <20240311150058.1122862-3-ryan.roberts@arm.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20240311150058.1122862-1-ryan.roberts@arm.com> References: <20240311150058.1122862-1-ryan.roberts@arm.com> MIME-Version: 1.0 X-Rspamd-Queue-Id: 1C677A0028 X-Rspam-User: X-Stat-Signature: hir3ahuzpsfaogcwsya41smgmhjo4wdf X-Rspamd-Server: rspam03 X-HE-Tag: 1710169275-389886 X-HE-Meta: U2FsdGVkX19yH/1SeTOKhNJpXuXlYyPyspUrtmiKzPkznhtcRblzCkxlzufL0EnAS5tKCE8x91C6knvJmfjCgcwKbCO6I7kBRnsEFGqgJBBxmifRMCcBDZuCNiFpoMuDSKu5dpkyXQlmR1KmPm1jpJ2tfzmYkSUkOGxC1fnnWw9noOlk1sGSMaOsgL0KuIgkDAXeZm7lYeH05AZAUQ4wbrNSXivmjFWOmPnJaaoIouhXDawZSWYWVK1SE3/T9I6beJ5PzvtWW8J8lIIUOI7X2CwJ8jrwmF1ChJNOxpvPBlBC5k37N8dWTxIcNuwXXL5NXhZWeAWpWzs3+x7e2pg5jiiPL/eyd5kMyCwyCFE4Ah+Cp3p/aU548i5auFctfT/X+z86DxPaMpNnkKCYsfKLm1tseUFZ1qet4qgKLkRxuxdMF7eBZxj6IKhDZqcJ/9nMI3nKusVEoElCMGuwY3MZw/BkZ+STG3XkxfFJe9YRJtWq52JJDsXj9tvEmbyjwH4XZ9MjYUdkvvz69zr0zT4106lO0WMgxoh1fdsA7H3ksZHm/SdR9edgTJqKS7EQ8anQDv8ufqU/13CWGo5v+l3X0B2td49h/r93DPRGXHUkIVArsa6XiB15A97SBhiMHIUvFSnmHJ2ER3xAHovLNptB76B128pnX/ZhdapmFfpg1fvRZJvCQ26STboKoxO7xIvnOXgfgFFItS837Qr9qxzbzmjyBvIov63gZPHLw84xE4FaDTzPqySh5tBJIiPz9NLfU2QHS83RaZhPGh5JJqFA9vo//ylMuGAXxlXZLTCCG3H44ExtKE/3L6icNaVE+k+67woiHHrgdVNiS0rWrQGEkAyVaq0o/lsKMLP9oJdk1LHxAw6zJs7wTuIINSqtg6QTBitZ2lkSawbo0kGcwEuZZd/AFhyhJ6JYxapQrVXxYZ0jQutONwcsN3Lnl26bvUH2RbBsrq55XbwJL1fP/2B X34idIoN pfszhmNN/CcJHjjG1QVtuFvFZaUIFOrdw6ARzyPTummf0ih2yqw4ygszqjUKb3xld8xVcq3zUbNchH2QTst4BLArXWzPwqAdyc6qmnF4h0inici6E3JPDYxm9xWyHNoi95b9y3eZFewj5zGRTnr/9sEcDS+iVvCkfIY5khO9eMd7x1VPgJe/4dbGLqqsyna0J4if86pgkyrW/W1M= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Now that we no longer have a convenient flag in the cluster to determine if a folio is large, free_swap_and_cache() will take a reference and lock a large folio much more often, which could lead to contention and (e.g.) failure to split large folios, etc. Let's solve that problem by batch freeing swap and cache with a new function, free_swap_and_cache_nr(), to free a contiguous range of swap entries together. This allows us to first drop a reference to each swap slot before we try to release the cache folio. This means we only try to release the folio once, only taking the reference and lock once - much better than the previous 512 times for the 2M THP case. Contiguous swap entries are gathered in zap_pte_range() and madvise_free_pte_range() in a similar way to how present ptes are already gathered in zap_pte_range(). While we are at it, let's simplify by converting the return type of both functions to void. The return value was used only by zap_pte_range() to print a bad pte, and was ignored by everyone else, so the extra reporting wasn't exactly guaranteed. We will still get the warning with most of the information from get_swap_device(). With the batch version, we wouldn't know which pte was bad anyway so could print the wrong one. Signed-off-by: Ryan Roberts --- include/linux/pgtable.h | 28 +++++++++++++++ include/linux/swap.h | 12 +++++-- mm/internal.h | 48 +++++++++++++++++++++++++ mm/madvise.c | 12 ++++--- mm/memory.c | 13 +++---- mm/swapfile.c | 78 ++++++++++++++++++++++++++++++----------- 6 files changed, 157 insertions(+), 34 deletions(-) diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 85fc7554cd52..8cf1f2fe2c25 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -708,6 +708,34 @@ static inline void pte_clear_not_present_full(struct mm_struct *mm, } #endif +#ifndef clear_not_present_full_ptes +/** + * clear_not_present_full_ptes - Clear consecutive not present PTEs. + * @mm: Address space the ptes represent. + * @addr: Address of the first pte. + * @ptep: Page table pointer for the first entry. + * @nr: Number of entries to clear. + * @full: Whether we are clearing a full mm. + * + * May be overridden by the architecture; otherwise, implemented as a simple + * loop over pte_clear_not_present_full(). + * + * Context: The caller holds the page table lock. The PTEs are all not present. + * The PTEs are all in the same PMD. + */ +static inline void clear_not_present_full_ptes(struct mm_struct *mm, + unsigned long addr, pte_t *ptep, unsigned int nr, int full) +{ + for (;;) { + pte_clear_not_present_full(mm, addr, ptep, full); + if (--nr == 0) + break; + ptep++; + addr += PAGE_SIZE; + } +} +#endif + #ifndef __HAVE_ARCH_PTEP_CLEAR_FLUSH extern pte_t ptep_clear_flush(struct vm_area_struct *vma, unsigned long address, diff --git a/include/linux/swap.h b/include/linux/swap.h index 4a8b6c60793a..f2b7f204b968 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -481,7 +481,7 @@ extern int swap_duplicate(swp_entry_t); extern int swapcache_prepare(swp_entry_t); extern void swap_free(swp_entry_t); extern void swapcache_free_entries(swp_entry_t *entries, int n); -extern int free_swap_and_cache(swp_entry_t); +extern void free_swap_and_cache_nr(swp_entry_t entry, int nr); int swap_type_of(dev_t device, sector_t offset); int find_first_swap(dev_t *device); extern unsigned int count_swap_pages(int, int); @@ -530,8 +530,9 @@ static inline void put_swap_device(struct swap_info_struct *si) #define free_pages_and_swap_cache(pages, nr) \ release_pages((pages), (nr)); -/* used to sanity check ptes in zap_pte_range when CONFIG_SWAP=0 */ -#define free_swap_and_cache(e) is_pfn_swap_entry(e) +static inline void free_swap_and_cache_nr(swp_entry_t entry, int nr) +{ +} static inline void free_swap_cache(struct folio *folio) { @@ -599,6 +600,11 @@ static inline int add_swap_extent(struct swap_info_struct *sis, } #endif /* CONFIG_SWAP */ +static inline void free_swap_and_cache(swp_entry_t entry) +{ + free_swap_and_cache_nr(entry, 1); +} + #ifdef CONFIG_MEMCG static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg) { diff --git a/mm/internal.h b/mm/internal.h index a3e19194079f..8dbb1335df88 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -11,6 +11,8 @@ #include #include #include +#include +#include #include struct folio_batch; @@ -174,6 +176,52 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr, return min(ptep - start_ptep, max_nr); } + +/** + * swap_pte_batch - detect a PTE batch for a set of contiguous swap entries + * @start_ptep: Page table pointer for the first entry. + * @max_nr: The maximum number of table entries to consider. + * @entry: Swap entry recovered from the first table entry. + * + * Detect a batch of contiguous swap entries: consecutive (non-present) PTEs + * containing swap entries all with consecutive offsets and targeting the same + * swap type. + * + * max_nr must be at least one and must be limited by the caller so scanning + * cannot exceed a single page table. + * + * Return: the number of table entries in the batch. + */ +static inline int swap_pte_batch(pte_t *start_ptep, int max_nr, + swp_entry_t entry) +{ + const pte_t *end_ptep = start_ptep + max_nr; + unsigned long expected_offset = swp_offset(entry) + 1; + unsigned int expected_type = swp_type(entry); + pte_t *ptep = start_ptep + 1; + + VM_WARN_ON(max_nr < 1); + VM_WARN_ON(non_swap_entry(entry)); + + while (ptep < end_ptep) { + pte_t pte = ptep_get(ptep); + + if (pte_none(pte) || pte_present(pte)) + break; + + entry = pte_to_swp_entry(pte); + + if (non_swap_entry(entry) || + swp_type(entry) != expected_type || + swp_offset(entry) != expected_offset) + break; + + expected_offset++; + ptep++; + } + + return ptep - start_ptep; +} #endif /* CONFIG_MMU */ void __acct_reclaim_writeback(pg_data_t *pgdat, struct folio *folio, diff --git a/mm/madvise.c b/mm/madvise.c index 44a498c94158..547dcd1f7a39 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -628,6 +628,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, struct folio *folio; int nr_swap = 0; unsigned long next; + int nr, max_nr; next = pmd_addr_end(addr, end); if (pmd_trans_huge(*pmd)) @@ -640,7 +641,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, return 0; flush_tlb_batched_pending(mm); arch_enter_lazy_mmu_mode(); - for (; addr != end; pte++, addr += PAGE_SIZE) { + for (; addr != end; pte += nr, addr += PAGE_SIZE * nr) { + nr = 1; ptent = ptep_get(pte); if (pte_none(ptent)) @@ -655,9 +657,11 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, entry = pte_to_swp_entry(ptent); if (!non_swap_entry(entry)) { - nr_swap--; - free_swap_and_cache(entry); - pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); + max_nr = (end - addr) / PAGE_SIZE; + nr = swap_pte_batch(pte, max_nr, entry); + nr_swap -= nr; + free_swap_and_cache_nr(entry, nr); + clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm); } else if (is_hwpoison_entry(entry) || is_poisoned_swp_entry(entry)) { pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); diff --git a/mm/memory.c b/mm/memory.c index f2bc6dd15eb8..25c0ef1c7ff3 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1629,12 +1629,13 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, folio_remove_rmap_pte(folio, page, vma); folio_put(folio); } else if (!non_swap_entry(entry)) { - /* Genuine swap entry, hence a private anon page */ + max_nr = (end - addr) / PAGE_SIZE; + nr = swap_pte_batch(pte, max_nr, entry); + /* Genuine swap entries, hence a private anon pages */ if (!should_zap_cows(details)) continue; - rss[MM_SWAPENTS]--; - if (unlikely(!free_swap_and_cache(entry))) - print_bad_pte(vma, addr, ptent, NULL); + rss[MM_SWAPENTS] -= nr; + free_swap_and_cache_nr(entry, nr); } else if (is_migration_entry(entry)) { folio = pfn_swap_entry_folio(entry); if (!should_zap_folio(details, folio)) @@ -1657,8 +1658,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, pr_alert("unrecognized swap entry 0x%lx\n", entry.val); WARN_ON_ONCE(1); } - pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); - zap_install_uffd_wp_if_needed(vma, addr, pte, 1, details, ptent); + clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm); + zap_install_uffd_wp_if_needed(vma, addr, pte, nr, details, ptent); } while (pte += nr, addr += PAGE_SIZE * nr, addr != end); add_mm_rss_vec(mm, rss); diff --git a/mm/swapfile.c b/mm/swapfile.c index df1de034f6d8..ee7e44cb40c5 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -130,7 +130,11 @@ static inline unsigned char swap_count(unsigned char ent) /* Reclaim the swap entry if swap is getting full*/ #define TTRS_FULL 0x4 -/* returns 1 if swap entry is freed */ +/* + * returns number of pages in the folio that backs the swap entry. If positive, + * the folio was reclaimed. If negative, the folio was not reclaimed. If 0, no + * folio was associated with the swap entry. + */ static int __try_to_reclaim_swap(struct swap_info_struct *si, unsigned long offset, unsigned long flags) { @@ -155,6 +159,7 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si, ret = folio_free_swap(folio); folio_unlock(folio); } + ret = ret ? folio_nr_pages(folio) : -folio_nr_pages(folio); folio_put(folio); return ret; } @@ -895,7 +900,7 @@ static int scan_swap_map_slots(struct swap_info_struct *si, swap_was_freed = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY); spin_lock(&si->lock); /* entry was freed successfully, try to use this again */ - if (swap_was_freed) + if (swap_was_freed > 0) goto checks; goto scan; /* check next one */ } @@ -1572,32 +1577,63 @@ bool folio_free_swap(struct folio *folio) return true; } -/* - * Free the swap entry like above, but also try to - * free the page cache entry if it is the last user. - */ -int free_swap_and_cache(swp_entry_t entry) +void free_swap_and_cache_nr(swp_entry_t entry, int nr) { - struct swap_info_struct *p; - unsigned char count; + unsigned long end = swp_offset(entry) + nr; + unsigned int type = swp_type(entry); + struct swap_info_struct *si; + unsigned long offset; if (non_swap_entry(entry)) - return 1; + return; - p = get_swap_device(entry); - if (p) { - if (WARN_ON(data_race(!p->swap_map[swp_offset(entry)]))) { - put_swap_device(p); - return 0; - } + si = get_swap_device(entry); + if (!si) + return; - count = __swap_entry_free(p, entry); - if (count == SWAP_HAS_CACHE) - __try_to_reclaim_swap(p, swp_offset(entry), + if (WARN_ON(end > si->max)) + goto out; + + /* + * First free all entries in the range. + */ + for (offset = swp_offset(entry); offset < end; offset++) { + if (!WARN_ON(data_race(!si->swap_map[offset]))) + __swap_entry_free(si, swp_entry(type, offset)); + } + + /* + * Now go back over the range trying to reclaim the swap cache. This is + * more efficient for large folios because we will only try to reclaim + * the swap once per folio in the common case. If we do + * __swap_entry_free() and __try_to_reclaim_swap() in the same loop, the + * latter will get a reference and lock the folio for every individual + * page but will only succeed once the swap slot for every subpage is + * zero. + */ + for (offset = swp_offset(entry); offset < end; offset += nr) { + nr = 1; + if (READ_ONCE(si->swap_map[offset]) == SWAP_HAS_CACHE) { + /* + * Folios are always naturally aligned in swap so + * advance forward to the next boundary. Zero means no + * folio was found for the swap entry, so advance by 1 + * in this case. Negative value means folio was found + * but could not be reclaimed. Here we can still advance + * to the next boundary. + */ + nr = __try_to_reclaim_swap(si, offset, TTRS_UNMAPPED | TTRS_FULL); - put_swap_device(p); + if (nr == 0) + nr = 1; + else if (nr < 0) + nr = -nr; + nr = ALIGN(offset + 1, nr) - offset; + } } - return p != NULL; + +out: + put_swap_device(si); } #ifdef CONFIG_HIBERNATION From patchwork Mon Mar 11 15:00:55 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ryan Roberts X-Patchwork-Id: 13588930 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B256FC54E60 for ; Mon, 11 Mar 2024 15:01:26 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7FBA96B00AB; Mon, 11 Mar 2024 11:01:24 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7AB296B00AC; Mon, 11 Mar 2024 11:01:24 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 673766B00AD; Mon, 11 Mar 2024 11:01:24 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 5485C6B00AB for ; Mon, 11 Mar 2024 11:01:24 -0400 (EDT) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 06B62A0A3F for ; Mon, 11 Mar 2024 15:01:21 +0000 (UTC) X-FDA: 81885071562.25.CE9FE1E Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf24.hostedemail.com (Postfix) with ESMTP id 4AB86180044 for ; Mon, 11 Mar 2024 15:01:18 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf24.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1710169278; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=JhGMTpj5yC0+MzmgutXJj1JLPcuamOP+qjERV2HcV3w=; b=ny1IgeWCcz0ZfzI/U8KjwNE6KW26DJX3zfi8VjBvzROgJS+iarQvTv4GuXGV3WLPUJIprl lWf1JS81K+Ij+oMgAz0PMH3UYAW5BmKkXVKRZGfgWfqB8eZbrHpHhbbj59vIbgIPxWEVia jnvg0Wb7NBzV77zUK2el/oYt0HA6eEc= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf24.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1710169278; a=rsa-sha256; cv=none; b=tw41rUPAZ8Ta669WqAS/S4GTfjG2r9SujrtAsaUT1Q9mZv7wssX5O3gMrYjL+tUm0LJnJ8 nzmIm+ZJMihsuTxSS7MaEo0GtmH3gbweg95G/JXQ3qhv/lLHtnY9/lK9SuQ7qnTv0vv00G SfP4K1ibP04mYi8SYAbIfSrbP4Y7u78= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 1953C1570; Mon, 11 Mar 2024 08:01:54 -0700 (PDT) Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com [10.1.196.27]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 3126C3F64C; Mon, 11 Mar 2024 08:01:15 -0700 (PDT) From: Ryan Roberts To: Andrew Morton , David Hildenbrand , Matthew Wilcox , Huang Ying , Gao Xiang , Yu Zhao , Yang Shi , Michal Hocko , Kefeng Wang , Barry Song <21cnbao@gmail.com>, Chris Li Cc: Ryan Roberts , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [PATCH v4 3/6] mm: swap: Simplify struct percpu_cluster Date: Mon, 11 Mar 2024 15:00:55 +0000 Message-Id: <20240311150058.1122862-4-ryan.roberts@arm.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20240311150058.1122862-1-ryan.roberts@arm.com> References: <20240311150058.1122862-1-ryan.roberts@arm.com> MIME-Version: 1.0 X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 4AB86180044 X-Stat-Signature: r645mgbry4k35dn8cn3ai13ckher4whd X-HE-Tag: 1710169278-740556 X-HE-Meta: U2FsdGVkX19YXllZ9gLbLESeZWGCje6a0SUtJV3JUz2Ep1Y79oM7o3IeYENc7pvsOeu3nI+bCWcRYtFZ58/XPb9k9w6WBSJ+lhLeugLx3b53viiPfsTwu4q22RydH+9I5Ml4C0fe2OLp0NJzxuvn42SBeJaRMvlByPu+k/39HWVSY+bV4EINLK/JkKFjI6PzchtPvTlZ22+dGNWkwhdRGCtkMnuDAHZXp2goCT4D6b001oP0xMgxyMxoVqZQ8JY0pIdMB5xv7xrOdwz8U0c7p7S0VEQt8DC2YEd1a488ROVnduepNKfXRgO/x5nE/MClafg7aAhn/2gmcVS5cAdcVaBnKyS14/KLQl8EKQbhlsmawQjoDoSPtamDJTzXOL0rttMLB0sjRRTbPCXrKpvY4/zPD65ezs5vt4BJ5iQ+zPmi6Uy/G/aojRX0LGAABtMZyY383Zmx5yqq5r5wOAS4aTZVK/Md4/1dKwVhZsFy3fEGlSTZZ8TzpHWMxTmmf/iqmARf+t6caV0tTVyccfu+v1Ls9Q9dObP2O1HaWOYATtaVVz1oRMQQAAFza+BUYhc2LePmV3WSbK0jprXoN4msgUhHtdQHexQP4DhHx88B0uiKZxpVPaRIIJ3wm4R7f3c3Ik1BmnuldX2JH1azaYrcSK0Fx/4BXo/ZvkMrQYyr2h+gjCWbr0GOZbjTJi3jg+ui1dRz6YkIWqgA8SnjnbPnA6WndUow33LsXpspmGilLtf35W/XQ5fr7efpFILuEp+O1ejYcLtjKWfh9rLcDV3BCloTA3Q3pK0tcTJvSnKd4r0lWZa95Yq/ilHENV6TKYTPvPdjkJ1aFqz2G6UUWUqu6/gK7C0PA5tpi9QuB5Fzo8IKUf3SYP1jgwDEbC+ULiDtWyGfKlN2i09h1XBcDNuw49XduzoYSAjHQhDRm01deY4QO7BzKbjxb9B6xCBzyXC5cGjgXoDYDxkMY+i8Rmd iGmjcnCY 3JlDZHaDJPEFx5pyJGURLbFaYFVmJmFShxEFp2vJXy5itwuBIXeeTnrZtj7TbyNiOMaHflWzuNf/JsPc3s7/pnMWvtMBH8AsWPlk+I8mCZ+OCfHyw2PsirxDYDjRcAgcsH9ivm4UeYcTZ95N+BDMSqqIFmPQKOxIgaSexy6eWPyPBTU9KLvKlPmtbVEhjOS6SksR66Oq2JSY/lrY= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: struct percpu_cluster stores the index of cpu's current cluster and the offset of the next entry that will be allocated for the cpu. These two pieces of information are redundant because the cluster index is just (offset / SWAPFILE_CLUSTER). The only reason for explicitly keeping the cluster index is because the structure used for it also has a flag to indicate "no cluster". However this data structure also contains a spin lock, which is never used in this context, as a side effect the code copies the spinlock_t structure, which is questionable coding practice in my view. So let's clean this up and store only the next offset, and use a sentinal value (SWAP_NEXT_INVALID) to indicate "no cluster". SWAP_NEXT_INVALID is chosen to be 0, because 0 will never be seen legitimately; The first page in the swap file is the swap header, which is always marked bad to prevent it from being allocated as an entry. This also prevents the cluster to which it belongs being marked free, so it will never appear on the free list. This change saves 16 bytes per cpu. And given we are shortly going to extend this mechanism to be per-cpu-AND-per-order, we will end up saving 16 * 9 = 144 bytes per cpu, which adds up if you have 256 cpus in the system. Signed-off-by: Ryan Roberts Reviewed-by: "Huang, Ying" --- include/linux/swap.h | 9 ++++++++- mm/swapfile.c | 22 +++++++++++----------- 2 files changed, 19 insertions(+), 12 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index f2b7f204b968..0cb082bee717 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -260,13 +260,20 @@ struct swap_cluster_info { #define CLUSTER_FLAG_FREE 1 /* This cluster is free */ #define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */ +/* + * The first page in the swap file is the swap header, which is always marked + * bad to prevent it from being allocated as an entry. This also prevents the + * cluster to which it belongs being marked free. Therefore 0 is safe to use as + * a sentinel to indicate next is not valid in percpu_cluster. + */ +#define SWAP_NEXT_INVALID 0 + /* * We assign a cluster to each CPU, so each CPU can allocate swap entry from * its own cluster and swapout sequentially. The purpose is to optimize swapout * throughput. */ struct percpu_cluster { - struct swap_cluster_info index; /* Current cluster index */ unsigned int next; /* Likely next allocation offset */ }; diff --git a/mm/swapfile.c b/mm/swapfile.c index ee7e44cb40c5..3828d81aa6b8 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -609,7 +609,7 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si, return false; percpu_cluster = this_cpu_ptr(si->percpu_cluster); - cluster_set_null(&percpu_cluster->index); + percpu_cluster->next = SWAP_NEXT_INVALID; return true; } @@ -622,14 +622,14 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, { struct percpu_cluster *cluster; struct swap_cluster_info *ci; - unsigned long tmp, max; + unsigned int tmp, max; new_cluster: cluster = this_cpu_ptr(si->percpu_cluster); - if (cluster_is_null(&cluster->index)) { + tmp = cluster->next; + if (tmp == SWAP_NEXT_INVALID) { if (!cluster_list_empty(&si->free_clusters)) { - cluster->index = si->free_clusters.head; - cluster->next = cluster_next(&cluster->index) * + tmp = cluster_next(&si->free_clusters.head) * SWAPFILE_CLUSTER; } else if (!cluster_list_empty(&si->discard_clusters)) { /* @@ -649,9 +649,7 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, * Other CPUs can use our cluster if they can't find a free cluster, * check if there is still free entry in the cluster */ - tmp = cluster->next; - max = min_t(unsigned long, si->max, - (cluster_next(&cluster->index) + 1) * SWAPFILE_CLUSTER); + max = min_t(unsigned long, si->max, ALIGN(tmp + 1, SWAPFILE_CLUSTER)); if (tmp < max) { ci = lock_cluster(si, tmp); while (tmp < max) { @@ -662,12 +660,13 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, unlock_cluster(ci); } if (tmp >= max) { - cluster_set_null(&cluster->index); + cluster->next = SWAP_NEXT_INVALID; goto new_cluster; } - cluster->next = tmp + 1; *offset = tmp; *scan_base = tmp; + tmp += 1; + cluster->next = tmp < max ? tmp : SWAP_NEXT_INVALID; return true; } @@ -3138,8 +3137,9 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) } for_each_possible_cpu(cpu) { struct percpu_cluster *cluster; + cluster = per_cpu_ptr(p->percpu_cluster, cpu); - cluster_set_null(&cluster->index); + cluster->next = SWAP_NEXT_INVALID; } } else { atomic_inc(&nr_rotate_swap); From patchwork Mon Mar 11 15:00:56 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ryan Roberts X-Patchwork-Id: 13588929 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 52CADC5475B for ; Mon, 11 Mar 2024 15:01:24 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DF1036B00A9; Mon, 11 Mar 2024 11:01:23 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D79596B00AA; Mon, 11 Mar 2024 11:01:23 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C1A086B00AB; Mon, 11 Mar 2024 11:01:23 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id AC74A6B00A9 for ; Mon, 11 Mar 2024 11:01:23 -0400 (EDT) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 2D2F9C0D0D for ; Mon, 11 Mar 2024 15:01:23 +0000 (UTC) X-FDA: 81885071646.17.70F6306 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf24.hostedemail.com (Postfix) with ESMTP id 3756018002F for ; Mon, 11 Mar 2024 15:01:20 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf24.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1710169280; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=2TA1Rp/WoJWY8IdnWIGKZEC36Vhayzb0x5TPiIjElgU=; b=M+NSzmYe1dOLhWd3AfB+M/Fbsivxc1Sj9cGhVuC8srwiLPb5cyQmqQ24m9PjQxbn8/haBm K5OUma09R2V/IosoKxWQWQow02qXRAAEpGNu227MGQfAvTBwDhseBgm4rNduz3nI5OR3A3 YAHa5PjDdetkqNo/dVT2+VylnvIEL/E= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf24.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1710169280; a=rsa-sha256; cv=none; b=WggMhrP+tTzWF67/R/1UWDzGhitbXQUkBympYGjPeezAiIIBooIbqLAoDVYMN2xJ1Mhkyb owSmq1d1nHr1ODXQhahyfiBts5gvxjmFUYl/NQR3M55oP0ZMzxYwSkzQPBzWgi5TT0C2E3 ImjLs6VW8++StIwgZZFP/3m9GbLoP1U= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 6752D1576; Mon, 11 Mar 2024 08:01:56 -0700 (PDT) Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com [10.1.196.27]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 7F8F83F64C; Mon, 11 Mar 2024 08:01:17 -0700 (PDT) From: Ryan Roberts To: Andrew Morton , David Hildenbrand , Matthew Wilcox , Huang Ying , Gao Xiang , Yu Zhao , Yang Shi , Michal Hocko , Kefeng Wang , Barry Song <21cnbao@gmail.com>, Chris Li Cc: Ryan Roberts , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [PATCH v4 4/6] mm: swap: Allow storage of all mTHP orders Date: Mon, 11 Mar 2024 15:00:56 +0000 Message-Id: <20240311150058.1122862-5-ryan.roberts@arm.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20240311150058.1122862-1-ryan.roberts@arm.com> References: <20240311150058.1122862-1-ryan.roberts@arm.com> MIME-Version: 1.0 X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 3756018002F X-Stat-Signature: kasc7n3quono41pijzggo48z9ba8aqt1 X-HE-Tag: 1710169280-541144 X-HE-Meta: U2FsdGVkX1+J54MNCFTUVDwmU81b/xZWsm+rGdknncIqJTMr6EWvBm2kV/4FpwhIGNIuDwPaKtQiQjMA8KczUfi4zKo4hnPtiaMnQ1C8pwx3HQztJ/TpLgHxSmQh5K04baV0M3hL9Zsdn7ecGWn4RVN21Azbb1De7E2Xbxu09MyRW417ykeCFctB9k0xNB6AqcrRQZLGqIuKpEkwPTNOSSA1/0Lv8dqq8n78xQ0LpqOSu9wkLPWWEHyY/aYM2nWczpUGTVm5m0tOxe44goSDYn/Cp1nfc/mgGKKq34nSuzxRTrMcWusdTvUmt+y1upMtJXiukoyxt1CqkniuPAeDpEZrXjwGHyuWNZ0GeMvdnh0g2Oni5Z3lL8qUuRsahcuAKojiwACNyaQZECr6ouIUIHchAAEeKagSEIEXs1SqUdZSgU/5V7NSG1edVRcT+2HfJ/7khAHpzFOZaXdcXx2ZMb6KVXJuiGnMmecWgHma3kRBYRMN7C9V/NEEt7MC8GwWiAdfJSh6xOkgc/XxL3NchyQLteP23lI9fO3JaPBgQMYIL6mXaPAfE/Gj94JCV2Twfj5vxpxGNOkU2qd7vymBZH9wGT2YyI21+KWb5XNT4zg3q7kop3k4UIdBWwYqgsOrgQmIckfsbL7WW8WV7lA6jrTOIc+niLGiYwHxrWK4Jk2coi+0pmozKV+Ty2Dk9DY+l0IOOlYNRkQsTYbuX4tpqS9dY8oRuBr4lkbYKIgHuJ2VkYgDPi1gnGgscstkopw4dSJUvOCFK92HpbBuZzd+gUam++AjC18gLoVU7esAfdACBn+CpY6VlRtdkFBQTg4jSBU6meXTJZFcuTcH6zdTMinOA7/HR9ar0uzbTkUnnhWLbOypSU7Mx7qaJ5dZkbX/iYoc+vzFgBPAjQ5XN1mIRut0MiKYuKsINbtOcC4IDVbdZ06tO8lzzFOe0G4ygHeQWHcPHt+6yLd9NlqeZqH +n/6bkjp 5KyCO4SzYP4kXCT/1qiiZ10HJOL+RX5dB98PUsnjtaBBYp9KUki/ZdEIcNmwt2Vab1yuJt4lZDtGtq8IREYhk3Wi9GeE6UmFg4xcQWU2VuB2O+lz7pVgFTDEW4PZiUjg6GX5YZ9Vfq635zh2iCoU20RnMXhdjK3Ot0ous X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Multi-size THP enables performance improvements by allocating large, pte-mapped folios for anonymous memory. However I've observed that on an arm64 system running a parallel workload (e.g. kernel compilation) across many cores, under high memory pressure, the speed regresses. This is due to bottlenecking on the increased number of TLBIs added due to all the extra folio splitting when the large folios are swapped out. Therefore, solve this regression by adding support for swapping out mTHP without needing to split the folio, just like is already done for PMD-sized THP. This change only applies when CONFIG_THP_SWAP is enabled, and when the swap backing store is a non-rotating block device. These are the same constraints as for the existing PMD-sized THP swap-out support. Note that no attempt is made to swap-in (m)THP here - this is still done page-by-page, like for PMD-sized THP. But swapping-out mTHP is a prerequisite for swapping-in mTHP. The main change here is to improve the swap entry allocator so that it can allocate any power-of-2 number of contiguous entries between [1, (1 << PMD_ORDER)]. This is done by allocating a cluster for each distinct order and allocating sequentially from it until the cluster is full. This ensures that we don't need to search the map and we get no fragmentation due to alignment padding for different orders in the cluster. If there is no current cluster for a given order, we attempt to allocate a free cluster from the list. If there are no free clusters, we fail the allocation and the caller can fall back to splitting the folio and allocates individual entries (as per existing PMD-sized THP fallback). The per-order current clusters are maintained per-cpu using the existing infrastructure. This is done to avoid interleving pages from different tasks, which would prevent IO being batched. This is already done for the order-0 allocations so we follow the same pattern. As is done for order-0 per-cpu clusters, the scanner now can steal order-0 entries from any per-cpu-per-order reserved cluster. This ensures that when the swap file is getting full, space doesn't get tied up in the per-cpu reserves. This change only modifies swap to be able to accept any order mTHP. It doesn't change the callers to elide doing the actual split. That will be done in separate changes. Signed-off-by: Ryan Roberts --- include/linux/swap.h | 8 ++- mm/swapfile.c | 167 +++++++++++++++++++++++++------------------ 2 files changed, 103 insertions(+), 72 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 0cb082bee717..39b5c18ccc6a 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -268,13 +268,19 @@ struct swap_cluster_info { */ #define SWAP_NEXT_INVALID 0 +#ifdef CONFIG_THP_SWAP +#define SWAP_NR_ORDERS (PMD_ORDER + 1) +#else +#define SWAP_NR_ORDERS 1 +#endif + /* * We assign a cluster to each CPU, so each CPU can allocate swap entry from * its own cluster and swapout sequentially. The purpose is to optimize swapout * throughput. */ struct percpu_cluster { - unsigned int next; /* Likely next allocation offset */ + unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */ }; struct swap_cluster_list { diff --git a/mm/swapfile.c b/mm/swapfile.c index 3828d81aa6b8..61118a090796 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -551,10 +551,12 @@ static void free_cluster(struct swap_info_struct *si, unsigned long idx) /* * The cluster corresponding to page_nr will be used. The cluster will be - * removed from free cluster list and its usage counter will be increased. + * removed from free cluster list and its usage counter will be increased by + * count. */ -static void inc_cluster_info_page(struct swap_info_struct *p, - struct swap_cluster_info *cluster_info, unsigned long page_nr) +static void add_cluster_info_page(struct swap_info_struct *p, + struct swap_cluster_info *cluster_info, unsigned long page_nr, + unsigned long count) { unsigned long idx = page_nr / SWAPFILE_CLUSTER; @@ -563,9 +565,19 @@ static void inc_cluster_info_page(struct swap_info_struct *p, if (cluster_is_free(&cluster_info[idx])) alloc_cluster(p, idx); - VM_BUG_ON(cluster_count(&cluster_info[idx]) >= SWAPFILE_CLUSTER); + VM_BUG_ON(cluster_count(&cluster_info[idx]) + count > SWAPFILE_CLUSTER); cluster_set_count(&cluster_info[idx], - cluster_count(&cluster_info[idx]) + 1); + cluster_count(&cluster_info[idx]) + count); +} + +/* + * The cluster corresponding to page_nr will be used. The cluster will be + * removed from free cluster list and its usage counter will be increased by 1. + */ +static void inc_cluster_info_page(struct swap_info_struct *p, + struct swap_cluster_info *cluster_info, unsigned long page_nr) +{ + add_cluster_info_page(p, cluster_info, page_nr, 1); } /* @@ -595,7 +607,7 @@ static void dec_cluster_info_page(struct swap_info_struct *p, */ static bool scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si, - unsigned long offset) + unsigned long offset, int order) { struct percpu_cluster *percpu_cluster; bool conflict; @@ -609,24 +621,39 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si, return false; percpu_cluster = this_cpu_ptr(si->percpu_cluster); - percpu_cluster->next = SWAP_NEXT_INVALID; + percpu_cluster->next[order] = SWAP_NEXT_INVALID; + return true; +} + +static inline bool swap_range_empty(char *swap_map, unsigned int start, + unsigned int nr_pages) +{ + unsigned int i; + + for (i = 0; i < nr_pages; i++) { + if (swap_map[start + i]) + return false; + } + return true; } /* - * Try to get a swap entry from current cpu's swap entry pool (a cluster). This - * might involve allocating a new cluster for current CPU too. + * Try to get a swap entry (or size indicated by order) from current cpu's swap + * entry pool (a cluster). This might involve allocating a new cluster for + * current CPU too. */ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, - unsigned long *offset, unsigned long *scan_base) + unsigned long *offset, unsigned long *scan_base, int order) { + unsigned int nr_pages = 1 << order; struct percpu_cluster *cluster; struct swap_cluster_info *ci; unsigned int tmp, max; new_cluster: cluster = this_cpu_ptr(si->percpu_cluster); - tmp = cluster->next; + tmp = cluster->next[order]; if (tmp == SWAP_NEXT_INVALID) { if (!cluster_list_empty(&si->free_clusters)) { tmp = cluster_next(&si->free_clusters.head) * @@ -647,26 +674,27 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, /* * Other CPUs can use our cluster if they can't find a free cluster, - * check if there is still free entry in the cluster + * check if there is still free entry in the cluster, maintaining + * natural alignment. */ max = min_t(unsigned long, si->max, ALIGN(tmp + 1, SWAPFILE_CLUSTER)); if (tmp < max) { ci = lock_cluster(si, tmp); while (tmp < max) { - if (!si->swap_map[tmp]) + if (swap_range_empty(si->swap_map, tmp, nr_pages)) break; - tmp++; + tmp += nr_pages; } unlock_cluster(ci); } if (tmp >= max) { - cluster->next = SWAP_NEXT_INVALID; + cluster->next[order] = SWAP_NEXT_INVALID; goto new_cluster; } *offset = tmp; *scan_base = tmp; - tmp += 1; - cluster->next = tmp < max ? tmp : SWAP_NEXT_INVALID; + tmp += nr_pages; + cluster->next[order] = tmp < max ? tmp : SWAP_NEXT_INVALID; return true; } @@ -796,13 +824,14 @@ static bool swap_offset_available_and_locked(struct swap_info_struct *si, static int scan_swap_map_slots(struct swap_info_struct *si, unsigned char usage, int nr, - swp_entry_t slots[]) + swp_entry_t slots[], unsigned int nr_pages) { struct swap_cluster_info *ci; unsigned long offset; unsigned long scan_base; unsigned long last_in_cluster = 0; int latency_ration = LATENCY_LIMIT; + int order = ilog2(nr_pages); int n_ret = 0; bool scanned_many = false; @@ -817,6 +846,26 @@ static int scan_swap_map_slots(struct swap_info_struct *si, * And we let swap pages go all over an SSD partition. Hugh */ + if (nr_pages > 1) { + /* + * Should not even be attempting large allocations when huge + * page swap is disabled. Warn and fail the allocation. + */ + if (!IS_ENABLED(CONFIG_THP_SWAP) || + nr_pages > SWAPFILE_CLUSTER || + !is_power_of_2(nr_pages)) { + VM_WARN_ON_ONCE(1); + return 0; + } + + /* + * Swapfile is not block device or not using clusters so unable + * to allocate large entries. + */ + if (!(si->flags & SWP_BLKDEV) || !si->cluster_info) + return 0; + } + si->flags += SWP_SCANNING; /* * Use percpu scan base for SSD to reduce lock contention on @@ -831,8 +880,11 @@ static int scan_swap_map_slots(struct swap_info_struct *si, /* SSD algorithm */ if (si->cluster_info) { - if (!scan_swap_map_try_ssd_cluster(si, &offset, &scan_base)) + if (!scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order)) { + if (order > 0) + goto no_page; goto scan; + } } else if (unlikely(!si->cluster_nr--)) { if (si->pages - si->inuse_pages < SWAPFILE_CLUSTER) { si->cluster_nr = SWAPFILE_CLUSTER - 1; @@ -874,26 +926,30 @@ static int scan_swap_map_slots(struct swap_info_struct *si, checks: if (si->cluster_info) { - while (scan_swap_map_ssd_cluster_conflict(si, offset)) { + while (scan_swap_map_ssd_cluster_conflict(si, offset, order)) { /* take a break if we already got some slots */ if (n_ret) goto done; if (!scan_swap_map_try_ssd_cluster(si, &offset, - &scan_base)) + &scan_base, order)) { + if (order > 0) + goto no_page; goto scan; + } } } if (!(si->flags & SWP_WRITEOK)) goto no_page; if (!si->highest_bit) goto no_page; - if (offset > si->highest_bit) + if (order == 0 && offset > si->highest_bit) scan_base = offset = si->lowest_bit; ci = lock_cluster(si, offset); /* reuse swap entry of cache-only swap if not busy. */ if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) { int swap_was_freed; + VM_WARN_ON(order > 0); unlock_cluster(ci); spin_unlock(&si->lock); swap_was_freed = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY); @@ -905,17 +961,18 @@ static int scan_swap_map_slots(struct swap_info_struct *si, } if (si->swap_map[offset]) { + VM_WARN_ON(order > 0); unlock_cluster(ci); if (!n_ret) goto scan; else goto done; } - WRITE_ONCE(si->swap_map[offset], usage); - inc_cluster_info_page(si, si->cluster_info, offset); + memset(si->swap_map + offset, usage, nr_pages); + add_cluster_info_page(si, si->cluster_info, offset, nr_pages); unlock_cluster(ci); - swap_range_alloc(si, offset, 1); + swap_range_alloc(si, offset, nr_pages); slots[n_ret++] = swp_entry(si->type, offset); /* got enough slots or reach max slots? */ @@ -936,8 +993,10 @@ static int scan_swap_map_slots(struct swap_info_struct *si, /* try to get more slots in cluster */ if (si->cluster_info) { - if (scan_swap_map_try_ssd_cluster(si, &offset, &scan_base)) + if (scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order)) goto checks; + if (order > 0) + goto done; } else if (si->cluster_nr && !si->swap_map[++offset]) { /* non-ssd case, still more slots in cluster? */ --si->cluster_nr; @@ -964,7 +1023,8 @@ static int scan_swap_map_slots(struct swap_info_struct *si, } done: - set_cluster_next(si, offset + 1); + if (order == 0) + set_cluster_next(si, offset + 1); si->flags -= SWP_SCANNING; return n_ret; @@ -997,38 +1057,6 @@ static int scan_swap_map_slots(struct swap_info_struct *si, return n_ret; } -static int swap_alloc_cluster(struct swap_info_struct *si, swp_entry_t *slot) -{ - unsigned long idx; - struct swap_cluster_info *ci; - unsigned long offset; - - /* - * Should not even be attempting cluster allocations when huge - * page swap is disabled. Warn and fail the allocation. - */ - if (!IS_ENABLED(CONFIG_THP_SWAP)) { - VM_WARN_ON_ONCE(1); - return 0; - } - - if (cluster_list_empty(&si->free_clusters)) - return 0; - - idx = cluster_list_first(&si->free_clusters); - offset = idx * SWAPFILE_CLUSTER; - ci = lock_cluster(si, offset); - alloc_cluster(si, idx); - cluster_set_count(ci, SWAPFILE_CLUSTER); - - memset(si->swap_map + offset, SWAP_HAS_CACHE, SWAPFILE_CLUSTER); - unlock_cluster(ci); - swap_range_alloc(si, offset, SWAPFILE_CLUSTER); - *slot = swp_entry(si->type, offset); - - return 1; -} - static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx) { unsigned long offset = idx * SWAPFILE_CLUSTER; @@ -1050,8 +1078,8 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size) int n_ret = 0; int node; - /* Only single cluster request supported */ - WARN_ON_ONCE(n_goal > 1 && size == SWAPFILE_CLUSTER); + /* Only single THP request supported */ + WARN_ON_ONCE(n_goal > 1 && size > 1); spin_lock(&swap_avail_lock); @@ -1088,14 +1116,10 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size) spin_unlock(&si->lock); goto nextsi; } - if (size == SWAPFILE_CLUSTER) { - if (si->flags & SWP_BLKDEV) - n_ret = swap_alloc_cluster(si, swp_entries); - } else - n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE, - n_goal, swp_entries); + n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE, + n_goal, swp_entries, size); spin_unlock(&si->lock); - if (n_ret || size == SWAPFILE_CLUSTER) + if (n_ret || size > 1) goto check_out; cond_resched(); @@ -1647,7 +1671,7 @@ swp_entry_t get_swap_page_of_type(int type) /* This is called for allocating swap entry, not cache */ spin_lock(&si->lock); - if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry)) + if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry, 1)) atomic_long_dec(&nr_swap_pages); spin_unlock(&si->lock); fail: @@ -3101,7 +3125,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) p->flags |= SWP_SYNCHRONOUS_IO; if (p->bdev && bdev_nonrot(p->bdev)) { - int cpu; + int cpu, i; unsigned long ci, nr_cluster; p->flags |= SWP_SOLIDSTATE; @@ -3139,7 +3163,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) struct percpu_cluster *cluster; cluster = per_cpu_ptr(p->percpu_cluster, cpu); - cluster->next = SWAP_NEXT_INVALID; + for (i = 0; i < SWAP_NR_ORDERS; i++) + cluster->next[i] = SWAP_NEXT_INVALID; } } else { atomic_inc(&nr_rotate_swap); From patchwork Mon Mar 11 15:00:57 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ryan Roberts X-Patchwork-Id: 13588931 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 891C8C54E58 for ; Mon, 11 Mar 2024 15:01:29 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D7F566B00AC; Mon, 11 Mar 2024 11:01:26 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D2ADD6B00AD; Mon, 11 Mar 2024 11:01:26 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BCBB06B00AE; Mon, 11 Mar 2024 11:01:26 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id AB0466B00AC for ; Mon, 11 Mar 2024 11:01:26 -0400 (EDT) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 457C2120E4D for ; Mon, 11 Mar 2024 15:01:26 +0000 (UTC) X-FDA: 81885071772.23.D21BCAC Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf12.hostedemail.com (Postfix) with ESMTP id B2A4A40036 for ; Mon, 11 Mar 2024 15:01:22 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=none; spf=pass (imf12.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com; dmarc=pass (policy=none) header.from=arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1710169282; a=rsa-sha256; cv=none; b=iH4mzT/gDpFcka8SQbf9J04s2qI+8BQZa0l42/hnk7pMoguEs4b3CA1EZVO61wEPiztmlV 52DwqZFrXndcu9eW45rk9TSnearkqumcwwhYGepUdO6zIVIBiAgDs+ntd0hOwNWxPO2J3L p1lGkjkPw6HiD3Lm225FaKZ7kcgmz+c= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=none; spf=pass (imf12.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com; dmarc=pass (policy=none) header.from=arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1710169282; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=hTYYDb/sq70D+WjhRYsih11v6WpSGH7mt+chjXPp+AI=; b=mf0If9ZNWikfyQ9M8hRy6cTC+JxMVo/udE9dRC8htWTFf7Pxkc0iM5cxYk/mxLPlLKOoiG rAbzPQgOM42jd0asB8HJVpSoTWDcbwAPfmyea5H4FFgGx0Jy2YJPdcASK+vlhwVttKRN5b 7PkYSfU0yo8B+GgqCfJS3Gymj42LXzI= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id B4CFEFEC; Mon, 11 Mar 2024 08:01:58 -0700 (PDT) Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com [10.1.196.27]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id CCDCB3F64C; Mon, 11 Mar 2024 08:01:19 -0700 (PDT) From: Ryan Roberts To: Andrew Morton , David Hildenbrand , Matthew Wilcox , Huang Ying , Gao Xiang , Yu Zhao , Yang Shi , Michal Hocko , Kefeng Wang , Barry Song <21cnbao@gmail.com>, Chris Li Cc: Ryan Roberts , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [PATCH v4 5/6] mm: vmscan: Avoid split during shrink_folio_list() Date: Mon, 11 Mar 2024 15:00:57 +0000 Message-Id: <20240311150058.1122862-6-ryan.roberts@arm.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20240311150058.1122862-1-ryan.roberts@arm.com> References: <20240311150058.1122862-1-ryan.roberts@arm.com> MIME-Version: 1.0 X-Rspam-User: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: B2A4A40036 X-Stat-Signature: xy88s9tz6q8tn91kk7454gei8tb1jwax X-HE-Tag: 1710169282-477617 X-HE-Meta: U2FsdGVkX1/In8/HkygQWpJYO46cjmLEOrqO8OJG7GsEAtja4vf81pPHG88rX+Qkwtvo1Khc9fsNVZ362/k+LyfdSKtrdgM36ti5Rytt4HKZnGWNXFsuvBl31bB7A+4ZKIrzpwhk9XTSO/slhy/ewslUuXpmji/zu3M/0//dSUhrsRGkjJQ0TsMO+hCAy9ITCeDy23425OdIFRdWsWV8IcdBqJj/A7IGYX4vDJhawADI1CIh9qyMDn2KL6qSMET3CHyBzp51aw0q1accQ+5+1UPMxD5z3iGaxt85W8iTXpRx+YlRuuyMAzBxbGY/l00jJJmv1y21oxR2hAMZC3suXd8mzjUvMd5iaPTG0EOejEpeXeBJC8wPavMwfcMAGfrd8qqfhNAh5zI1IyL7Pmz6sPnrJqSla/t99aM0dbAs/Q2nBjiBYz6ZVr/eJQqWOFSMs974Gy2uEi2bWc85KzwMTBFaJz/rbe9dNmN9bWcP6CxD303rU+gPZA9rHl7C28u0OfK+EDcX/1y9wN9EEGEz3t/RCwIe3iyGOBDPMX+lUA/yKm196+/U4GVzIYPvHy7OuL/QTkmy5H1jYaZCsqG6LtWdekMDEgplSDgEEEBSoyQdXek1ErrKPuN0JEwf39p1TqLX2/XseuQWUWQtfpUBxsy4akfoJ/zZNHOWYF/eTVd7x2JY2HF4IcbdJSghV8ufQHIOIQ7XSqGjmkMUYnfUEpDA+f1VtDB6zzOToct0DJJk7Mov30HpQutyWJ2mtVZaAGHRFX39almjmg7MVH4SursDW2sQ7V+mkxdXS17lGYKu84HCpWRHcFOVIKjNEze3tP0j3qiBQJRiYFOmW2GcxzqH33x3mZ9qheSUDRRDlgutET5huV/WdGfzoNir8naOYprHUy42OdYEFrm4eB330weJkimJZOMnYALanKNtmGLzav4jk8syKOJprZcFZM5hgfK6pwWKSUryO50/FYN ECq6VcvK 7ISUQU0Dhgon/I9FyGO+zUyRKSQjzflHIEuWCamL9gN5OlLqYZtzbgoMtEIxUfa6kp876/HDYpZklJ4e+2kSNRq6nLUZBNCQWvpQZkGPlz8I/HLhgncMuQEyeczDaYA+axKRlSbwZkpW7pGOO/HE1BVd8p6pVPLayf/BMQpdplkcVAvxChwCxIB6NcL6yt5eblFbJRvW/w0na3FI= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Now that swap supports storing all mTHP sizes, avoid splitting large folios before swap-out. This benefits performance of the swap-out path by eliding split_folio_to_list(), which is expensive, and also sets us up for swapping in large folios in a future series. If the folio is partially mapped, we continue to split it since we want to avoid the extra IO overhead and storage of writing out pages uneccessarily. Signed-off-by: Ryan Roberts Reviewed-by: Barry Song Reviewed-by: David Hildenbrand --- mm/vmscan.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index cf7d4cf47f1a..0ebec99e04c6 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1222,11 +1222,12 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, if (!can_split_folio(folio, NULL)) goto activate_locked; /* - * Split folios without a PMD map right - * away. Chances are some or all of the - * tail pages can be freed without IO. + * Split partially mapped folios map + * right away. Chances are some or all + * of the tail pages can be freed + * without IO. */ - if (!folio_entire_mapcount(folio) && + if (!list_empty(&folio->_deferred_list) && split_folio_to_list(folio, folio_list)) goto activate_locked; From patchwork Mon Mar 11 15:00:58 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ryan Roberts X-Patchwork-Id: 13588938 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E7CFFC5475B for ; Mon, 11 Mar 2024 15:01:35 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id CE17E6B00AF; Mon, 11 Mar 2024 11:01:29 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C8F546B00B0; Mon, 11 Mar 2024 11:01:29 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B2E8D6B00B1; Mon, 11 Mar 2024 11:01:29 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id A09E36B00AF for ; Mon, 11 Mar 2024 11:01:29 -0400 (EDT) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 6BBA280C64 for ; Mon, 11 Mar 2024 15:01:29 +0000 (UTC) X-FDA: 81885071898.26.5A052B4 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf21.hostedemail.com (Postfix) with ESMTP id 00A211C0022 for ; Mon, 11 Mar 2024 15:01:24 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf21.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1710169285; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=UI5PRAj7U+kLDNSybnSkxpbypRwhNSJXjiMivm+ZKt8=; b=IC9gTmPTh2MMYim35t3xajhpqA2Oa4/d9DsZsFBWCuRCp4sNO6eTXYFMumSIAHoxoOzmG5 sChe2MeEkRnIDc9w8Qd7GXIDm8nGSSAm4dEet+wGOeks4EvW3krfuxbCE+f3YOHiQPJvBW ABRWYLgT0YhBQ8dI5rV5rb+f7akoHJc= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf21.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1710169285; a=rsa-sha256; cv=none; b=E6OieJKgMEVQPSyWxOFfq/Zb2gEb84LBtO+hZ92p6ojpdtLTi360v6n63pyBXPcj5x4IQt bmySwvuDC8BcHZS0lEta4HYjUbcN3A2vhWsc0sPyPMqSD8SreE6ktLh6+7X8jpsSAXhGyP e7iZ08YnA4r/QEYgaB/dQVEEoiSDn/Q= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 0FEE3153B; Mon, 11 Mar 2024 08:02:01 -0700 (PDT) Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com [10.1.196.27]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 267053F64C; Mon, 11 Mar 2024 08:01:22 -0700 (PDT) From: Ryan Roberts To: Andrew Morton , David Hildenbrand , Matthew Wilcox , Huang Ying , Gao Xiang , Yu Zhao , Yang Shi , Michal Hocko , Kefeng Wang , Barry Song <21cnbao@gmail.com>, Chris Li Cc: Ryan Roberts , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [PATCH v4 6/6] mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD Date: Mon, 11 Mar 2024 15:00:58 +0000 Message-Id: <20240311150058.1122862-7-ryan.roberts@arm.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20240311150058.1122862-1-ryan.roberts@arm.com> References: <20240311150058.1122862-1-ryan.roberts@arm.com> MIME-Version: 1.0 X-Rspamd-Queue-Id: 00A211C0022 X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: 899nx7o7ydgdo9kyk57aoy1ait1ffujt X-HE-Tag: 1710169284-119422 X-HE-Meta: U2FsdGVkX1/UkSvhLrgJvg9BxLjOyd1NP68LcIqLl3qW9thHKkQENeEgHbB9BhfTg8Zdrcc3ooZS8/jlKn96kzMZ0gXcA3tojxNxl2FxIwq7kzKtvUnpKn9g1hBeRyLVRNNfUwjBO+VrufgL+FUsgxfWBNW8RCtCB7LNUH+GWmMTDdopsXlITebfrrjp0L9HZekoC+c2yq57fOLagyE3apnzRikAtbOhAX/t7FuobX2ayuR8gsV7NeXMUCDAu0AXvieQx0UJ6h694McUQDQKpAPKihVv3CwDJnWpicbIoJOIWz3ZUyeRdOc+7y0RF/iMSGCVS07UNgGgF/TXuq5iuBpw5lmQ2fV+zRV5znGcvvpTRq5mU11To8gL6pLXHU1brHJjmzcehGMt6q4QT+gSE2N1ULqthLjy8rvni5+Vv8z/sO3IRlysSRuCncT/Dg0L8MwGgfwdKNyoWhMX++hK2yhx1piGFpWkY3BhHpnaBeSmDbKsIvsZ56eUmuyW+qXesq5+f65qgdAO5P7EuBjx7SmPx8tV2wfCrejSJVmjv+s5Y6lyjTXx4SrEMeWPGv9rOgQlHMLf5DuY+W5hB+O4xBiqnvaA26kgC2yL1rhdfp0pTK+0/EKB8Ng+4F56SRdam2eP0j7ASg32un2bF6mECMHk1eLhn0pVwQene0FNVKZXRtlkc/tTWxWD896MfVxN9TH5qKwOQWwIrdosnlyUEFhumJ5enedOo8/+XkLCoKZpNZcMIflrO+WFK5qDZQW+i8El32UZg6AFJy0fFgelVtDyp/0PvFRKYHfMN8lz1OiiQwRN0O5XKUl3JhTVeOvqbHmjkjnFp35UUiBLZ0k3iINTu+Qrztq59T1FxUYDtuctWmV7jgEDLilBlADQqnQ9zyz68ZWuBFJaeFMVNl4uCQe5Cw92fosLQ5mmIRD7HPYjloFB5DnvuiTDV/bxS+++3uRGOlcr0LIjvS4H3Jz PuLL+419 U8bm6RTo0D5GU3EwdtVFqaZra42QNZNU0lsxSlilM9oTJ2GLcykWdNVzmlOE1TrKwhx2n+UFEvpKBDtgwr8OnJwiPtYMdFFz9Q55vogwx5SdFyx7SODtVEHz4LcPmHJgh149gi8wHzSeAjKF5qgzs/8fD0w6tmrLNMX3ibZwaGxfUwY4Y+ZlXi62nU8ojHcrFpqWXIC+g7GYJEic= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Rework madvise_cold_or_pageout_pte_range() to avoid splitting any large folio that is fully and contiguously mapped in the pageout/cold vm range. This change means that large folios will be maintained all the way to swap storage. This both improves performance during swap-out, by eliding the cost of splitting the folio, and sets us up nicely for maintaining the large folio when it is swapped back in (to be covered in a separate series). Folios that are not fully mapped in the target range are still split, but note that behavior is changed so that if the split fails for any reason (folio locked, shared, etc) we now leave it as is and move to the next pte in the range and continue work on the proceeding folios. Previously any failure of this sort would cause the entire operation to give up and no folios mapped at higher addresses were paged out or made cold. Given large folios are becoming more common, this old behavior would have likely lead to wasted opportunities. While we are at it, change the code that clears young from the ptes to use ptep_test_and_clear_young(), which is more efficent than get_and_clear/modify/set, especially for contpte mappings on arm64, where the old approach would require unfolding/refolding and the new approach can be done in place. Signed-off-by: Ryan Roberts Reviewed-by: Barry Song --- mm/madvise.c | 89 ++++++++++++++++++++++++++++++---------------------- 1 file changed, 51 insertions(+), 38 deletions(-) diff --git a/mm/madvise.c b/mm/madvise.c index 547dcd1f7a39..56c7ba7bd558 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -336,6 +336,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, LIST_HEAD(folio_list); bool pageout_anon_only_filter; unsigned int batch_count = 0; + int nr; if (fatal_signal_pending(current)) return -EINTR; @@ -423,7 +424,8 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, return 0; flush_tlb_batched_pending(mm); arch_enter_lazy_mmu_mode(); - for (; addr < end; pte++, addr += PAGE_SIZE) { + for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) { + nr = 1; ptent = ptep_get(pte); if (++batch_count == SWAP_CLUSTER_MAX) { @@ -447,55 +449,66 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, continue; /* - * Creating a THP page is expensive so split it only if we - * are sure it's worth. Split it if we are only owner. + * If we encounter a large folio, only split it if it is not + * fully mapped within the range we are operating on. Otherwise + * leave it as is so that it can be swapped out whole. If we + * fail to split a folio, leave it in place and advance to the + * next pte in the range. */ if (folio_test_large(folio)) { - int err; - - if (folio_estimated_sharers(folio) > 1) - break; - if (pageout_anon_only_filter && !folio_test_anon(folio)) - break; - if (!folio_trylock(folio)) - break; - folio_get(folio); - arch_leave_lazy_mmu_mode(); - pte_unmap_unlock(start_pte, ptl); - start_pte = NULL; - err = split_folio(folio); - folio_unlock(folio); - folio_put(folio); - if (err) - break; - start_pte = pte = - pte_offset_map_lock(mm, pmd, addr, &ptl); - if (!start_pte) - break; - arch_enter_lazy_mmu_mode(); - pte--; - addr -= PAGE_SIZE; - continue; + const fpb_t fpb_flags = FPB_IGNORE_DIRTY | + FPB_IGNORE_SOFT_DIRTY; + int max_nr = (end - addr) / PAGE_SIZE; + + nr = folio_pte_batch(folio, addr, pte, ptent, max_nr, + fpb_flags, NULL); + + if (nr < folio_nr_pages(folio)) { + int err; + + if (folio_estimated_sharers(folio) > 1) + continue; + if (pageout_anon_only_filter && !folio_test_anon(folio)) + continue; + if (!folio_trylock(folio)) + continue; + folio_get(folio); + arch_leave_lazy_mmu_mode(); + pte_unmap_unlock(start_pte, ptl); + start_pte = NULL; + err = split_folio(folio); + folio_unlock(folio); + folio_put(folio); + if (err) + continue; + start_pte = pte = + pte_offset_map_lock(mm, pmd, addr, &ptl); + if (!start_pte) + break; + arch_enter_lazy_mmu_mode(); + nr = 0; + continue; + } } /* * Do not interfere with other mappings of this folio and - * non-LRU folio. + * non-LRU folio. If we have a large folio at this point, we + * know it is fully mapped so if its mapcount is the same as its + * number of pages, it must be exclusive. */ - if (!folio_test_lru(folio) || folio_mapcount(folio) != 1) + if (!folio_test_lru(folio) || + folio_mapcount(folio) != folio_nr_pages(folio)) continue; if (pageout_anon_only_filter && !folio_test_anon(folio)) continue; - VM_BUG_ON_FOLIO(folio_test_large(folio), folio); - - if (!pageout && pte_young(ptent)) { - ptent = ptep_get_and_clear_full(mm, addr, pte, - tlb->fullmm); - ptent = pte_mkold(ptent); - set_pte_at(mm, addr, pte, ptent); - tlb_remove_tlb_entry(tlb, pte, addr); + if (!pageout) { + for (; nr != 0; nr--, pte++, addr += PAGE_SIZE) { + if (ptep_test_and_clear_young(vma, addr, pte)) + tlb_remove_tlb_entry(tlb, pte, addr); + } } /*