From patchwork Sat Jun 29 11:10:09 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Barry Song <21cnbao@gmail.com> X-Patchwork-Id: 13716890 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1BD07C27C4F for ; Sat, 29 Jun 2024 11:10:45 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 99C0B6B0088; Sat, 29 Jun 2024 07:10:44 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 94A866B0089; Sat, 29 Jun 2024 07:10:44 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7EC436B008A; Sat, 29 Jun 2024 07:10:44 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 6095A6B0088 for ; Sat, 29 Jun 2024 07:10:44 -0400 (EDT) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id BFD09120E0A for ; Sat, 29 Jun 2024 11:10:43 +0000 (UTC) X-FDA: 82283658366.03.9C7FDE8 Received: from mail-pl1-f172.google.com (mail-pl1-f172.google.com [209.85.214.172]) by imf27.hostedemail.com (Postfix) with ESMTP id DE7B54000E for ; Sat, 29 Jun 2024 11:10:41 +0000 (UTC) Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="WxNKZ9/8"; spf=pass (imf27.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.214.172 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1719659416; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=VIIRuBxsg8MoTkqvJ2vYF4yyI+YXFJLITJS/F7oC53Q=; b=gqE6gd+LnoLnY/o1oArd8OXD4mrwgTlNF0znKLKoAW9wJJteWZizxFbKDn7lDuAukAuMjo 2bVl33rSHsnHNpZEtdsNfVuXZhqGsG2KRvnwIs4EsPsGWGq/LWapktkPb6seuEw/3u4CYx EMDr9BOXPA2uCCi+q+6fE8KFjl3j0U4= ARC-Authentication-Results: i=1; imf27.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="WxNKZ9/8"; spf=pass (imf27.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.214.172 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1719659416; a=rsa-sha256; cv=none; b=wNNs5w4pbbhaOOjrD7OyJQOzOqpw4xjulRwop8yJdtsH8P9WoSGKbKYdZvfkFAXI87HtVM w1bd0jLIyh0s0Ulu469aLPWsoH97NgeHI820uQQMEKyF+YcSygSZ+cKZId1vQGCu4BszY/ 1YgZV+h9H7hj49TFbUxn7T2ilfV4y+g= Received: by mail-pl1-f172.google.com with SMTP id d9443c01a7336-1f9c6e59d34so10931445ad.2 for ; Sat, 29 Jun 2024 04:10:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1719659441; x=1720264241; darn=kvack.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=VIIRuBxsg8MoTkqvJ2vYF4yyI+YXFJLITJS/F7oC53Q=; b=WxNKZ9/8JXfNB+6xisfMhUJ3npdcCs5zEkowIxl2IB1AfJOiCsP5mcG2E2cPZzGyey HZnoRREH32/T1pbmtMsc7wmjWMRZ07vYlqfH4JrARNrDVezCyTcgf3nzZHWHmx0DHTUJ IRBb3auDARhFyBPMLescR/eSTLL0NcqYbeH8TGIfoZYZvVwgJ12Vssa7rwzF575iIAsH 91Zn6u1Tt7ivBhT8T7W6yeo/uXmhj10IVOi2xBztGzilNBeahs9P1hYQlGlR2yyIavgi dhtLuyWYmB0QJw3HDCqGKu5/T7RAfdcQ8NmCV60i8D7MNRMJgjBO1bYauhKTz+n+cex0 O9uw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1719659441; x=1720264241; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=VIIRuBxsg8MoTkqvJ2vYF4yyI+YXFJLITJS/F7oC53Q=; b=cO7BgB4weUk6sN3l8AoBNm+Rm7d+BqFXSQUvB6t2PPRhjPxL5faQ6yI3mVZXYuapmw EEONaRcLR75tqgNfjZGbFGYiAf4waOS4pDj5fT7XsqKhWOJAh+XF9E9PYOi+0aJP8o2/ NRIojs6x/clx7IiwmwRkkuWLABDXGa9+kdS3ge4b4a7WhJ9N1Rt4R/7uOCB5uXzpB92R SwDl2WPQA2Lq7eXvZKqqSltI32tgsuBrJe+RZ3j1h56DPblMxSNZz4kp4URqXy2Y9VWT SEYvAUDk+Q4y95Oh8Kq7UM/dfbE3LmCK15jIahTV9s2pX2zMf2ZTVvG1bI14wcJOwjJV Wyuw== X-Forwarded-Encrypted: i=1; AJvYcCVWnUr8n94SRNpm8/BNTxPbnitJnN3lDXpjuk9VcokB7sai32t+Auhz5zTFxnWVbMdXCY+TJPjVHZggPdVTx7fz7Kk= X-Gm-Message-State: AOJu0YyZLRLYunkt8tY1AJcpxv+ajyo4TAHI11jUwClNPDfAee8YOTqO zD8sZBwfNOwYgAEyXLWKbZ19TreBaFoiOoKBIPdS4ejPW7PZwuXT X-Google-Smtp-Source: AGHT+IG6qDn8EOGdMtx+wBwudZlM8Ulw9rvzL2rVxiK7pncEZX4lJDneiLE4Y/g+u+PcPoP0vLfGKQ== X-Received: by 2002:a17:902:d506:b0:1f7:1706:25ba with SMTP id d9443c01a7336-1fadbc84427mr5112845ad.15.1719659440685; Sat, 29 Jun 2024 04:10:40 -0700 (PDT) Received: from localhost.localdomain ([2407:7000:8942:5500:aaa1:59ff:fe57:eb97]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-1fac1596920sm30068975ad.268.2024.06.29.04.10.33 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 29 Jun 2024 04:10:40 -0700 (PDT) From: Barry Song <21cnbao@gmail.com> To: akpm@linux-foundation.org, linux-mm@kvack.org Cc: chrisl@kernel.org, david@redhat.com, hannes@cmpxchg.org, kasong@tencent.com, linux-kernel@vger.kernel.org, mhocko@suse.com, nphamcs@gmail.com, ryan.roberts@arm.com, shy828301@gmail.com, surenb@google.com, kaleshsingh@google.com, hughd@google.com, v-songbaohua@oppo.com, willy@infradead.org, xiang@kernel.org, ying.huang@intel.com, yosryahmed@google.com, baolin.wang@linux.alibaba.com, shakeel.butt@linux.dev, senozhatsky@chromium.org, minchan@kernel.org Subject: [PATCH RFC v4 1/2] mm: swap: introduce swapcache_prepare_nr and swapcache_clear_nr for large folios swap-in Date: Sat, 29 Jun 2024 23:10:09 +1200 Message-Id: <20240629111010.230484-2-21cnbao@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20240629111010.230484-1-21cnbao@gmail.com> References: <20240629111010.230484-1-21cnbao@gmail.com> MIME-Version: 1.0 X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: DE7B54000E X-Stat-Signature: nay4n44dirxes939187qs5hnegbt7rfs X-Rspam-User: X-HE-Tag: 1719659441-752226 X-HE-Meta: U2FsdGVkX186NvRPVSdUyuHjYJ2fHHGHoAeHVzU59myGD3FCQuQ5UCxHW8gcZDAammDlwy24r2tPySlI5URWqOBpybvkCS++k8F+hfXEoElHBomF1hwsXr36HLDkhwGEwMAFVYQSUYBXPPeDLSEhfY77J9G6cOwSh5j8KGhCKLfI6Gt00leMVWOpcWRnMBFRPHfupwbhxO2RQ6Z/JSHUZXM4s9d3cNKNwpackIcPuvZ5ohF66EPymzUwVr/rONqs/ppZlUpbvLwDGHm3qLT7q536ViCi0YfrpCGLBwKVAbUIMflDvTLXZmYLn2SxFQKt+wRYgTCve7QPZJlMRLKDc3CkLoiVlxNP0QAeOhPJZszgYoL25gjccnO3yj/0qNpQgX3DQPjn23H8FPXhtSPYYfqFj1CB4hcNnJs9LRxYmpcKUZMN/0X9MPK9aB8/TqydJRM18vRBxXOpIghvfLz2TA00JVwcGq6WyHq4SUFiHtJZiRUtPczJpU8gBPNuRxtoKBr1AHCRINR/Hc1CKzx11QwOi+107zOSLBo1Zb1E8WZS1z7bQFYbm2+UsITCSzHyEQ+ibNQDdAbgALUQLUP1EA0cKoi5VUGTq85CiHDg/kMrbFOhma9cSPi4vXk0KPgQ/SBVqpC8KssmO2BB5BhG8M1LamM77cE98CjeOPqUJex53DIdKczbyAhQRLuvUSYQ87NY9SklYB3yQdaB5UqJdREt6caxl/NzbHqqfR3ofk+A4xHZz2ZhsSI7i5fQ1NTM4T/OPVpfICrGLf5ZL6ILuJ8M9wjijHgRqwctaMHnJpqUMCw0zrQgdLg8EWPYdS+taduKGFGVWqHRN0V1B4a/DFWRPdCQT5nsDPmj8+YTFjoFvDfRfgj0eoaTRXsncsZjROZ1vX1oWr3yFiqvChN1anFVeQzSNaeDC3PWoW3nXuH6gGoZxNdqXi6FZqFcruzYIf/QNcnlTdIEUc0VB7u +3uvdTzf OTASfc2uunmeHzPENAsp5NG9Cx7fuNk7izTp5r/uZK3msQmA2IoaOKZ32DpbFjNlAcPw+enJ8/15nkzA5bbk0/CBO/rFEPlgxedfV866L2MJrQoU2u5JhBJb92E3cARpOGEmwPCSWhfLCBxUhUVoX1kF+FDGLoJJI3j1RyMtGDxGfP57GglURM82Vdv+ku5Mu9nnPUXI8/K6gjtPSekYp0Bi6UckWeda0COi0C5/gylSUzc4XAaqeFQHuDds2CaBc7+V2H1gz8P+Sy+5wL1Y0ZyzWxEUOeSjX5xED05gm1TqFeg7kB/kwjavGpTd8zSK6sX8R/sOdAsCdHKNHyZif5Gf6EmG7CwohxAWq9qltRQUER1Gar1Vy87vlolVlhmJ/+GLteLCJnZ5sn132GMeh06a9g9PRVUglDRG2LdUzqnTYML2OKZj+7j27xN6xRSCK4W2MU3YMEgrc/YlYbywA2HwdmWe5xJSjVEna X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: From: Barry Song Commit 13ddaf26be32 ("mm/swap: fix race when skipping swapcache") supports one entry only, to support large folio swap-in, we need to handle multiple swap entries. Signed-off-by: Barry Song --- include/linux/swap.h | 4 +- mm/swap.h | 4 +- mm/swapfile.c | 114 +++++++++++++++++++++++++------------------ 3 files changed, 70 insertions(+), 52 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index e473fe6cfb7a..c0f4f2073ca6 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -481,7 +481,7 @@ extern int get_swap_pages(int n, swp_entry_t swp_entries[], int order); extern int add_swap_count_continuation(swp_entry_t, gfp_t); extern void swap_shmem_alloc(swp_entry_t); extern int swap_duplicate(swp_entry_t); -extern int swapcache_prepare(swp_entry_t); +extern int swapcache_prepare_nr(swp_entry_t entry, int nr); extern void swap_free_nr(swp_entry_t entry, int nr_pages); extern void swapcache_free_entries(swp_entry_t *entries, int n); extern void free_swap_and_cache_nr(swp_entry_t entry, int nr); @@ -555,7 +555,7 @@ static inline int swap_duplicate(swp_entry_t swp) return 0; } -static inline int swapcache_prepare(swp_entry_t swp) +static inline int swapcache_prepare_nr(swp_entry_t swp, int nr) { return 0; } diff --git a/mm/swap.h b/mm/swap.h index baa1fa946b34..b96b1157441f 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -59,7 +59,7 @@ void __delete_from_swap_cache(struct folio *folio, void delete_from_swap_cache(struct folio *folio); void clear_shadow_from_swap_cache(int type, unsigned long begin, unsigned long end); -void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry); +void swapcache_clear_nr(struct swap_info_struct *si, swp_entry_t entry, int nr); struct folio *swap_cache_get_folio(swp_entry_t entry, struct vm_area_struct *vma, unsigned long addr); struct folio *filemap_get_incore_folio(struct address_space *mapping, @@ -120,7 +120,7 @@ static inline int swap_writepage(struct page *p, struct writeback_control *wbc) return 0; } -static inline void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry) +static inline void swapcache_clear_nr(struct swap_info_struct *si, swp_entry_t entry, int nr) { } diff --git a/mm/swapfile.c b/mm/swapfile.c index f7224bc1320c..8f60dd10fdef 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1352,7 +1352,8 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry) } static void cluster_swap_free_nr(struct swap_info_struct *sis, - unsigned long offset, int nr_pages) + unsigned long offset, int nr_pages, + unsigned char usage) { struct swap_cluster_info *ci; DECLARE_BITMAP(to_free, BITS_PER_LONG) = { 0 }; @@ -1362,7 +1363,7 @@ static void cluster_swap_free_nr(struct swap_info_struct *sis, while (nr_pages) { nr = min(BITS_PER_LONG, nr_pages); for (i = 0; i < nr; i++) { - if (!__swap_entry_free_locked(sis, offset + i, 1)) + if (!__swap_entry_free_locked(sis, offset + i, usage)) bitmap_set(to_free, i, 1); } if (!bitmap_empty(to_free, BITS_PER_LONG)) { @@ -1396,7 +1397,7 @@ void swap_free_nr(swp_entry_t entry, int nr_pages) while (nr_pages) { nr = min_t(int, nr_pages, SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER); - cluster_swap_free_nr(sis, offset, nr); + cluster_swap_free_nr(sis, offset, nr, 1); offset += nr; nr_pages -= nr; } @@ -3382,7 +3383,7 @@ void si_swapinfo(struct sysinfo *val) } /* - * Verify that a swap entry is valid and increment its swap map count. + * Verify that nr swap entries are valid and increment their swap map counts. * * Returns error code in following case. * - success -> 0 @@ -3392,66 +3393,88 @@ void si_swapinfo(struct sysinfo *val) * - swap-cache reference is requested but the entry is not used. -> ENOENT * - swap-mapped reference requested but needs continued swap count. -> ENOMEM */ -static int __swap_duplicate(swp_entry_t entry, unsigned char usage) +static int __swap_duplicate_nr(swp_entry_t entry, unsigned char usage, int nr) { struct swap_info_struct *p; struct swap_cluster_info *ci; unsigned long offset; unsigned char count; unsigned char has_cache; - int err; + int err, i; p = swp_swap_info(entry); offset = swp_offset(entry); + VM_WARN_ON(nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER); ci = lock_cluster_or_swap_info(p, offset); - count = p->swap_map[offset]; + err = 0; + for (i = 0; i < nr; i++) { + count = p->swap_map[offset + i]; - /* - * swapin_readahead() doesn't check if a swap entry is valid, so the - * swap entry could be SWAP_MAP_BAD. Check here with lock held. - */ - if (unlikely(swap_count(count) == SWAP_MAP_BAD)) { - err = -ENOENT; - goto unlock_out; - } + /* + * swapin_readahead() doesn't check if a swap entry is valid, so the + * swap entry could be SWAP_MAP_BAD. Check here with lock held. + */ + if (unlikely(swap_count(count) == SWAP_MAP_BAD)) { + err = -ENOENT; + goto unlock_out; + } - has_cache = count & SWAP_HAS_CACHE; - count &= ~SWAP_HAS_CACHE; - err = 0; + has_cache = count & SWAP_HAS_CACHE; + count &= ~SWAP_HAS_CACHE; - if (usage == SWAP_HAS_CACHE) { + if (usage == SWAP_HAS_CACHE) { + /* set SWAP_HAS_CACHE if there is no cache and entry is used */ + if (!has_cache && count) + continue; + else if (has_cache) /* someone else added cache */ + err = -EEXIST; + else /* no users remaining */ + err = -ENOENT; - /* set SWAP_HAS_CACHE if there is no cache and entry is used */ - if (!has_cache && count) - has_cache = SWAP_HAS_CACHE; - else if (has_cache) /* someone else added cache */ - err = -EEXIST; - else /* no users remaining */ - err = -ENOENT; + } else if (count || has_cache) { - } else if (count || has_cache) { + if ((count & ~COUNT_CONTINUED) < SWAP_MAP_MAX) + continue; + else if ((count & ~COUNT_CONTINUED) > SWAP_MAP_MAX) + err = -EINVAL; + else if (swap_count_continued(p, offset + i, count)) + continue; + else + err = -ENOMEM; + } else + err = -ENOENT; /* unused swap entry */ - if ((count & ~COUNT_CONTINUED) < SWAP_MAP_MAX) + if (err) + goto unlock_out; + } + + for (i = 0; i < nr; i++) { + count = p->swap_map[offset + i]; + has_cache = count & SWAP_HAS_CACHE; + count &= ~SWAP_HAS_CACHE; + + if (usage == SWAP_HAS_CACHE) + has_cache = SWAP_HAS_CACHE; + else if ((count & ~COUNT_CONTINUED) < SWAP_MAP_MAX) count += usage; - else if ((count & ~COUNT_CONTINUED) > SWAP_MAP_MAX) - err = -EINVAL; - else if (swap_count_continued(p, offset, count)) - count = COUNT_CONTINUED; else - err = -ENOMEM; - } else - err = -ENOENT; /* unused swap entry */ + count = COUNT_CONTINUED; - if (!err) - WRITE_ONCE(p->swap_map[offset], count | has_cache); + WRITE_ONCE(p->swap_map[offset + i], count | has_cache); + } unlock_out: unlock_cluster_or_swap_info(p, ci); return err; } +static int __swap_duplicate(swp_entry_t entry, unsigned char usage) +{ + return __swap_duplicate_nr(entry, usage, 1); +} + /* * Help swapoff by noting that swap entry belongs to shmem/tmpfs * (in which case its reference count is never incremented). @@ -3485,22 +3508,17 @@ int swap_duplicate(swp_entry_t entry) * -EEXIST means there is a swap cache. * Note: return code is different from swap_duplicate(). */ -int swapcache_prepare(swp_entry_t entry) +int swapcache_prepare_nr(swp_entry_t entry, int nr) { - return __swap_duplicate(entry, SWAP_HAS_CACHE); + return __swap_duplicate_nr(entry, SWAP_HAS_CACHE, nr); } -void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry) +void swapcache_clear_nr(struct swap_info_struct *si, swp_entry_t entry, int nr) { - struct swap_cluster_info *ci; - unsigned long offset = swp_offset(entry); - unsigned char usage; + pgoff_t offset = swp_offset(entry); - ci = lock_cluster_or_swap_info(si, offset); - usage = __swap_entry_free_locked(si, offset, SWAP_HAS_CACHE); - unlock_cluster_or_swap_info(si, ci); - if (!usage) - free_swap_slot(entry); + VM_WARN_ON(nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER); + cluster_swap_free_nr(si, offset, nr, SWAP_HAS_CACHE); } struct swap_info_struct *swp_swap_info(swp_entry_t entry) From patchwork Sat Jun 29 11:10:10 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Barry Song <21cnbao@gmail.com> X-Patchwork-Id: 13716891 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8B04BC30658 for ; Sat, 29 Jun 2024 11:10:53 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1AC236B008A; Sat, 29 Jun 2024 07:10:53 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 15B566B008C; Sat, 29 Jun 2024 07:10:53 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id F3E3F6B0092; Sat, 29 Jun 2024 07:10:52 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id D75EC6B008A for ; Sat, 29 Jun 2024 07:10:52 -0400 (EDT) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 487F3A300C for ; Sat, 29 Jun 2024 11:10:52 +0000 (UTC) X-FDA: 82283658744.23.7B52A04 Received: from mail-pl1-f171.google.com (mail-pl1-f171.google.com [209.85.214.171]) by imf11.hostedemail.com (Postfix) with ESMTP id 6B6884001C for ; Sat, 29 Jun 2024 11:10:50 +0000 (UTC) Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=QCvFejjp; spf=pass (imf11.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.214.171 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1719659440; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Ew+Hn1X3ubfi24qMFkHWDEj0VD3/JfMS/ycW8uy8Ae0=; b=WPuGQr5dfpQ+5LNbwkbHHRIhQ6yyz5MsmvgbiWcOd08xG0euIyR4F96AMFPdwp+kAzWNp8 gCQVEctVHpvxD4kDxnNIDwGE+JAhqX8zYKunXWkNn1ww/HbdkM+tyoL36FMNqE43rujDNF iH1vMoWSh0/Xk6+vfTY/f4OuO4oIoNw= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=QCvFejjp; spf=pass (imf11.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.214.171 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1719659440; a=rsa-sha256; cv=none; b=Jazppxh987C00wpnm6e0U8hgbNuqvqAvVH/wti8ODLHtgMXNVQHcxmmSBaf1s/VrUzsylF L+CNd5mg2JYlZAqS1TBA8Gl14tsiEHt0ynwLQARaVbbwbMsd+T15zHeX4WMLXWR9f8jYf1 GPqehYz4BOFiEUtg5AdFVj47s5xfgZc= Received: by mail-pl1-f171.google.com with SMTP id d9443c01a7336-1f9b523a15cso9669265ad.0 for ; Sat, 29 Jun 2024 04:10:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1719659449; x=1720264249; darn=kvack.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=Ew+Hn1X3ubfi24qMFkHWDEj0VD3/JfMS/ycW8uy8Ae0=; b=QCvFejjp82R0ZhQrOsi2yCwSQAyM0/cWALLuiTo4MMbTOuHSulqixzat4dU1Nj7TS4 rpv+G1xC9dKP9JIwr38zIuugFV9uD+0GnWv2SlwGfBtn6D3Y+nTYGAwyM7raO8FmlENe TGJY93wDHurwja0ekXi4zueduPuQDQd1tcUQmpwcneUSPA8s8v67lJ7wTbmay9qEMmUV 6C822mWmMVWH1fKgDp6r+bRGzAMct5i2zsCj6fNaI6KEZT8/Mci+4KIZAjRf0poncN0U vISdcKd7uM5Qh40PQ2Hcz1/JiWihLZ2RU79c6Shf+6XdRM5uLukURIw3sLbRS4NltM7X 6FuA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1719659449; x=1720264249; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Ew+Hn1X3ubfi24qMFkHWDEj0VD3/JfMS/ycW8uy8Ae0=; b=r8wJXKtnrLQiESkCiCFvTNyiES28CihpKA/Yvl8Tu2pFIK/PMkUzyBcp+01KhUh70A yKNuPu018JotlPZeaCfv2tNKBGq87mgiOtvT/BfprJr0z/F0p2Az9TPOaC2ORoww83EY M5ouS+Ks1SG6nHbk7knCie9EFa+ANuS12JUtK9M4Jp50Lqjo0VDLjBil2jmmhKR4Yeqj OBTpBULQN8y+Li0agNJB45LPcRlxh24nWAfw8LWtfOlhA8Wq33IfMdhtP/fwV24st3Ar DUckei7VP14aLW/Nq92LvQBh43JskvI0lBvhVEoyr/sJoSL0RoeIcWA0Kl+4T1Fjl/Uf 9k6g== X-Forwarded-Encrypted: i=1; AJvYcCXZkQKOCaEEe1Jj//5hi8upGYPUgUJcNNDF4a8qMMK3tQ71tJ0Mqoyhrq3kBrt+YitXxhWGcEDL5vtPAm5ITxSQNiI= X-Gm-Message-State: AOJu0YwkmeN9IO861pm9Y1R9x1sr9SsRwctVKzm8zhCgDxkGeYtEjhSF uK2lvV/+caIGO14fSEDWQGlHDy7aWpwd/Hsy5HvFbqNuOveFqJjl X-Google-Smtp-Source: AGHT+IF2LvXe/M/aBKc7IHeoM2uFyQ99DWns8BLeSfOB7NogMxfp7QfZubs9znaxk8dANr939azsTA== X-Received: by 2002:a17:902:f68c:b0:1f9:e3fa:d932 with SMTP id d9443c01a7336-1fadb433a92mr15486365ad.9.1719659449115; Sat, 29 Jun 2024 04:10:49 -0700 (PDT) Received: from localhost.localdomain ([2407:7000:8942:5500:aaa1:59ff:fe57:eb97]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-1fac1596920sm30068975ad.268.2024.06.29.04.10.41 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 29 Jun 2024 04:10:48 -0700 (PDT) From: Barry Song <21cnbao@gmail.com> To: akpm@linux-foundation.org, linux-mm@kvack.org Cc: chrisl@kernel.org, david@redhat.com, hannes@cmpxchg.org, kasong@tencent.com, linux-kernel@vger.kernel.org, mhocko@suse.com, nphamcs@gmail.com, ryan.roberts@arm.com, shy828301@gmail.com, surenb@google.com, kaleshsingh@google.com, hughd@google.com, v-songbaohua@oppo.com, willy@infradead.org, xiang@kernel.org, ying.huang@intel.com, yosryahmed@google.com, baolin.wang@linux.alibaba.com, shakeel.butt@linux.dev, senozhatsky@chromium.org, minchan@kernel.org, Chuanhua Han Subject: [PATCH RFC v4 2/2] mm: support large folios swapin as a whole for zRAM-like swapfile Date: Sat, 29 Jun 2024 23:10:10 +1200 Message-Id: <20240629111010.230484-3-21cnbao@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20240629111010.230484-1-21cnbao@gmail.com> References: <20240629111010.230484-1-21cnbao@gmail.com> MIME-Version: 1.0 X-Rspamd-Server: rspam11 X-Rspamd-Queue-Id: 6B6884001C X-Stat-Signature: 3gs8j3dhg4zoxpzj76b4xgejpwdnadap X-Rspam-User: X-HE-Tag: 1719659450-10072 X-HE-Meta: U2FsdGVkX18HHANtaL/2NiZnjNRaWCTYsMHwHuPq6MYVjhDcge6NJI+xo46mdiu6GJ6sVmeCs2vWrQZiyGur1IPeAoiJumPqRwCrs8WvSTknP+uDJC6LZhZLVYYV0FickuduEjwUHUoZ5xni4HLtfjhavPB265pNAyCxwgRkf8jKznN3Ul3JhYwqfh2ncUixPk3yysKuOr/U0moWnwfgO/0ibpSGBeIISLpYfKFENsRy894H4Q6ZaoQaBA3CY8X99m9UeqTpYyTRRT4DjQZtqyVLXt4iQwKfCfyPRpn53ZXQ2/lmgmN6+jcxsSWYYcsp175PsITNUHBeyd9jbnh33wZt+CfjPGT+WC6XcMX3piIr6o2Eh5eUG9KBrXRGqC1mfSAf1Jb1w5kQlJTnBaDpfwrtqxBesnnQzg7nX8C2aVxpCotJMBCFmIf5MzoTTqFhn+z4L+KZHEu29R+nve7x0yywHkklHmuB/n/+1dHkXfioz7HEhWnyPA9TT9Wt+cvH6xj48+R7/gzwoMKZB/L44/5FJB1g2zt+nzcGuikBEPyuGx8w+m6oVQbaOD3fNGTC0yf1mc/bvQAaSYvX83hpZM76ogNgCdXjeKMLoqFAt4mCkDK4gTd4ypsY5VH6talpt32BsnGaTDBBegbzfvN+kESMK0tqxJqdohsS0q6LeZdMPdzYxTW9LKnqKyb6m0uI/YqJBWNSUhCeTzMx/iTkb9TgdGmjek8QjdfRgtGAAWETNdk9BUoN1GaRfD4DfQ7fWBftPhfNWOZC9ERUUyKUlRBNRYeKMTH5BTRjh+GLC1Rc4SfGOEuyARXr3uDn57qEpmC6wu8Q2qWk8zIH3SxSSehjQ+tZVJiaIMe6lcpmgFsTcciaFCyMebPkmV0+GMVYj3w2CfS48ufY3PufTPvu3Y+35pLKWzzPk6T81EVJTaA1nXq/EKdawq8h5GYwVbWtXCQ0yr8fYVq3/wNMT32 8f2+KlOm O2qei6YX5hsiwMK8L4P03Xtu9R9XL2P7xawoaU9Q+fLp9hR+ECCZ9Uppxak4o5TFkIBxbKttzxrsCgvD3dYWo1hx57xrgoBMxuyYH9VvalpopVJArk6sx7KX8Bt2kEGAVwZlfZpar/bCejB7virYM1GJImWgiwFx9LzjAUDS+gSLmtPsnP2HtfUplWjEtB7Ec02256qlPfIqQQUDiBbL7Z/QfGpd+AYPeaiR0RiVwBepUjwjHdKacXiiQnvAwaIvEG7+8nXSBoJEa2eoMdphSDn0UNk55RuPJNojdEAJiYLevMrpYPnVXHReV/foL9hRKEK1Ys+Pv+x/f5ZhmoqFgbk6yW/G1bsnqlPoiMsnkparcG58eXp0Vl1B5H8UYv7sKlAPzxEh0hKFO+BTkqwyPbmEIZqUfThqoQwmnoBW8b+O3kzhhLGRPfkUymA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: From: Chuanhua Han In an embedded system like Android, more than half of anonymous memory is actually stored in swap devices such as zRAM. For instance, when an app is switched to the background, most of its memory might be swapped out. Currently, we have mTHP features, but unfortunately, without support for large folio swap-ins, once those large folios are swapped out, we lose them immediately because mTHP is a one-way ticket. This patch introduces mTHP swap-in support. For now, we limit mTHP swap-ins to contiguous swaps that were likely swapped out from mTHP as a whole. Additionally, the current implementation only covers the SWAP_SYNCHRONOUS case. This is the simplest and most common use case, benefiting millions of Android phones and similar devices with minimal implementation cost. In this straightforward scenario, large folios are always exclusive, eliminating the need to handle complex rmap and swapcache issues. It offers several benefits: 1. Enables bidirectional mTHP swapping, allowing retrieval of mTHP after swap-out and swap-in. 2. Eliminates fragmentation in swap slots and supports successful THP_SWPOUT without fragmentation. 3. Enables zRAM/zsmalloc to compress and decompress mTHP, reducing CPU usage and enhancing compression ratios significantly. Deploying this on millions of actual products, we haven't observed any noticeable increase in memory footprint for 64KiB mTHP based on CONT-PTE on ARM64. Signed-off-by: Chuanhua Han Co-developed-by: Barry Song Signed-off-by: Barry Song --- include/linux/zswap.h | 2 +- mm/memory.c | 210 +++++++++++++++++++++++++++++++++++------- mm/swap_state.c | 2 +- 3 files changed, 181 insertions(+), 33 deletions(-) diff --git a/include/linux/zswap.h b/include/linux/zswap.h index bf83ae5e285d..6cecb4a4f68b 100644 --- a/include/linux/zswap.h +++ b/include/linux/zswap.h @@ -68,7 +68,7 @@ static inline bool zswap_is_enabled(void) static inline bool zswap_never_enabled(void) { - return false; + return true; } #endif diff --git a/mm/memory.c b/mm/memory.c index 0a769f34bbb2..41ec7b919c2e 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3987,6 +3987,141 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf) return VM_FAULT_SIGBUS; } +/* + * check a range of PTEs are completely swap entries with + * contiguous swap offsets and the same SWAP_HAS_CACHE. + * ptep must be first one in the range + */ +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages) +{ + struct swap_info_struct *si; + unsigned long addr; + swp_entry_t entry; + pgoff_t offset; + char has_cache; + int idx, i; + pte_t pte; + + addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE); + idx = (vmf->address - addr) / PAGE_SIZE; + pte = ptep_get(ptep); + + if (!pte_same(pte, pte_move_swp_offset(vmf->orig_pte, -idx))) + return false; + entry = pte_to_swp_entry(pte); + offset = swp_offset(entry); + if (!IS_ALIGNED(offset, nr_pages)) + return false; + if (swap_pte_batch(ptep, nr_pages, pte) != nr_pages) + return false; + + si = swp_swap_info(entry); + has_cache = si->swap_map[offset] & SWAP_HAS_CACHE; + for (i = 1; i < nr_pages; i++) { + /* + * while allocating a large folio and doing swap_read_folio for the + * SWP_SYNCHRONOUS_IO path, which is the case the being faulted pte + * doesn't have swapcache. We need to ensure all PTEs have no cache + * as well, otherwise, we might go to swap devices while the content + * is in swapcache + */ + if ((si->swap_map[offset + i] & SWAP_HAS_CACHE) != has_cache) + return false; + } + + return true; +} + +/* + * Get a list of all the (large) orders below PMD_ORDER that are enabled + * for this vma. Then filter out the orders that can't be allocated over + * the faulting address and still be fully contained in the vma. + */ +static inline unsigned long get_alloc_folio_orders(struct vm_fault *vmf) +{ + struct vm_area_struct *vma = vmf->vma; + unsigned long orders; + + orders = thp_vma_allowable_orders(vma, vma->vm_flags, + TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1); + orders = thp_vma_suitable_orders(vma, vmf->address, orders); + return orders; +} +#else +static inline bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages) +{ + return false; +} +#endif + +static struct folio *alloc_swap_folio(struct vm_fault *vmf) +{ + struct vm_area_struct *vma = vmf->vma; +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + unsigned long orders; + struct folio *folio; + unsigned long addr; + spinlock_t *ptl; + pte_t *pte; + gfp_t gfp; + int order; + + /* + * If uffd is active for the vma we need per-page fault fidelity to + * maintain the uffd semantics. + */ + if (unlikely(userfaultfd_armed(vma))) + goto fallback; + + /* + * a large folio being swapped-in could be partially in + * zswap and partially in swap devices, zswap doesn't + * support large folios yet, we might get corrupted + * zero-filled data by reading all subpages from swap + * devices while some of them are actually in zswap + */ + if (!zswap_never_enabled()) + goto fallback; + + orders = get_alloc_folio_orders(vmf); + if (!orders) + goto fallback; + + pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address & PMD_MASK, &ptl); + if (unlikely(!pte)) + goto fallback; + + /* + * For do_swap_page, find the highest order where the aligned range is + * completely swap entries with contiguous swap offsets. + */ + order = highest_order(orders); + while (orders) { + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); + if (can_swapin_thp(vmf, pte + pte_index(addr), 1 << order)) + break; + order = next_order(&orders, order); + } + + pte_unmap_unlock(pte, ptl); + + /* Try allocating the highest of the remaining orders. */ + gfp = vma_thp_gfp_mask(vma); + while (orders) { + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); + folio = vma_alloc_folio(gfp, order, vma, addr, true); + if (folio) + return folio; + order = next_order(&orders, order); + } + +fallback: +#endif + return vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vmf->address, false); +} + + /* * We enter with non-exclusive mmap_lock (to exclude vma changes, * but allow concurrent faults), and pte mapped but not yet locked. @@ -4075,35 +4210,38 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) if (!folio) { if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && __swap_count(entry) == 1) { - /* - * Prevent parallel swapin from proceeding with - * the cache flag. Otherwise, another thread may - * finish swapin first, free the entry, and swapout - * reusing the same entry. It's undetectable as - * pte_same() returns true due to entry reuse. - */ - if (swapcache_prepare(entry)) { - /* Relax a bit to prevent rapid repeated page faults */ - schedule_timeout_uninterruptible(1); - goto out; - } - need_clear_cache = true; - /* skip swapcache */ - folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, - vma, vmf->address, false); + folio = alloc_swap_folio(vmf); page = &folio->page; if (folio) { __folio_set_locked(folio); __folio_set_swapbacked(folio); + nr_pages = folio_nr_pages(folio); + if (folio_test_large(folio)) + entry.val = ALIGN_DOWN(entry.val, nr_pages); + /* + * Prevent parallel swapin from proceeding with + * the cache flag. Otherwise, another thread may + * finish swapin first, free the entry, and swapout + * reusing the same entry. It's undetectable as + * pte_same() returns true due to entry reuse. + */ + if (swapcache_prepare_nr(entry, nr_pages)) { + /* Relax a bit to prevent rapid repeated page faults */ + schedule_timeout_uninterruptible(1); + goto out_page; + } + need_clear_cache = true; + if (mem_cgroup_swapin_charge_folio(folio, vma->vm_mm, GFP_KERNEL, entry)) { ret = VM_FAULT_OOM; goto out_page; } - mem_cgroup_swapin_uncharge_swap(entry); + for (swp_entry_t e = entry; e.val < entry.val + nr_pages; e.val++) + mem_cgroup_swapin_uncharge_swap(e); shadow = get_shadow_from_swap_cache(entry); if (shadow) @@ -4210,6 +4348,22 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) goto out_nomap; } + /* allocated large folios for SWP_SYNCHRONOUS_IO */ + if (folio_test_large(folio) && !folio_test_swapcache(folio)) { + unsigned long nr = folio_nr_pages(folio); + unsigned long folio_start = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE); + unsigned long idx = (vmf->address - folio_start) / PAGE_SIZE; + pte_t *folio_ptep = vmf->pte - idx; + + if (!can_swapin_thp(vmf, folio_ptep, nr)) + goto out_nomap; + + page_idx = idx; + address = folio_start; + ptep = folio_ptep; + goto check_folio; + } + nr_pages = 1; page_idx = 0; address = vmf->address; @@ -4341,11 +4495,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) folio_add_lru_vma(folio, vma); } else if (!folio_test_anon(folio)) { /* - * We currently only expect small !anon folios, which are either - * fully exclusive or fully shared. If we ever get large folios - * here, we have to be careful. + * We currently only expect small !anon folios which are either + * fully exclusive or fully shared, or new allocated large folios + * which are fully exclusive. If we ever get large folios within + * swapcache here, we have to be careful. */ - VM_WARN_ON_ONCE(folio_test_large(folio)); + VM_WARN_ON_ONCE(folio_test_large(folio) && folio_test_swapcache(folio)); VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio); folio_add_new_anon_rmap(folio, vma, address, rmap_flags); } else { @@ -4388,7 +4543,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) out: /* Clear the swap cache pin for direct swapin after PTL unlock */ if (need_clear_cache) - swapcache_clear(si, entry); + swapcache_clear_nr(si, entry, nr_pages); if (si) put_swap_device(si); return ret; @@ -4404,7 +4559,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) folio_put(swapcache); } if (need_clear_cache) - swapcache_clear(si, entry); + swapcache_clear_nr(si, entry, nr_pages); if (si) put_swap_device(si); return ret; @@ -4440,14 +4595,7 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf) if (unlikely(userfaultfd_armed(vma))) goto fallback; - /* - * Get a list of all the (large) orders below PMD_ORDER that are enabled - * for this vma. Then filter out the orders that can't be allocated over - * the faulting address and still be fully contained in the vma. - */ - orders = thp_vma_allowable_orders(vma, vma->vm_flags, - TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1); - orders = thp_vma_suitable_orders(vma, vmf->address, orders); + orders = get_alloc_folio_orders(vmf); if (!orders) goto fallback; diff --git a/mm/swap_state.c b/mm/swap_state.c index 994723cef821..7e20de975350 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -478,7 +478,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, /* * Swap entry may have been freed since our caller observed it. */ - err = swapcache_prepare(entry); + err = swapcache_prepare_nr(entry, 1); if (!err) break;