From patchwork Fri Jul 26 09:46:17 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Barry Song <21cnbao@gmail.com> X-Patchwork-Id: 13742539 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B75C9C3DA7F for ; Fri, 26 Jul 2024 09:47:30 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 334536B00A1; Fri, 26 Jul 2024 05:47:30 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 2BC446B00A2; Fri, 26 Jul 2024 05:47:30 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1119F6B00A3; Fri, 26 Jul 2024 05:47:30 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id E43566B00A1 for ; Fri, 26 Jul 2024 05:47:29 -0400 (EDT) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 87F4B417B4 for ; Fri, 26 Jul 2024 09:47:29 +0000 (UTC) X-FDA: 82381426218.11.B1DBE1E Received: from mail-pl1-f172.google.com (mail-pl1-f172.google.com [209.85.214.172]) by imf27.hostedemail.com (Postfix) with ESMTP id A813840003 for ; Fri, 26 Jul 2024 09:47:27 +0000 (UTC) Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=NzzQ8fvD; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf27.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.214.172 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1721987207; a=rsa-sha256; cv=none; b=hcJqfPuPad6rgHR9XtR6hxDCJyo0i9hUevy+n9lAOIXl/g5TkaUJjPGV7UT+nt8H2YeYWL CqCxVPtxkVv8s2EhrdEdQp2kcpwNxPxrRxBNG3aeoN10/poG/GplE4wKDSyQohQAXpEczm LQClNpp6SYv9JStYTJ7TKuEOVaJy7JE= ARC-Authentication-Results: i=1; imf27.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=NzzQ8fvD; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf27.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.214.172 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1721987207; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=gT1UL0dJwOaNKYiFjA7EC0yRYS8xQDfFpNmhEqLpJwk=; b=bR3gFUQG70fwCuVYuc55l5GeWjN4apMiPWQaNkSQl5nGqyHQ2cjyUHObReGZO8T+EAx2WJ UtjO+F9YozfEZJIktHHR4rm+mKqMkGh8bhH9UXPVaCQ4iXUW93OZZEjy0y7yYiFW6676Wm 2VGbSdycKmDX2GzixvZMRfvq2d1erF8= Received: by mail-pl1-f172.google.com with SMTP id d9443c01a7336-1fd90c2fc68so4150215ad.1 for ; Fri, 26 Jul 2024 02:47:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1721987246; x=1722592046; darn=kvack.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=gT1UL0dJwOaNKYiFjA7EC0yRYS8xQDfFpNmhEqLpJwk=; b=NzzQ8fvDF/GjmOMQCLk0w0BoCQBwT4HG7FBihhnYNfOX85yPW8YR/ysbcrhB92tBpo IKjMaLeFyEtTW3UO2uXhELJUH2ef12ZkeXcn2X1sawkRdYmVSeTPPO4zuRwgns/Mi+dr tuGJkMQ8L3PKwgGI3csSyGe8IIx0Dd3I2qPsD42IyeOfH9mFZAU46PMyFsyCMkEO9S/f HHbDbUJaXr200oBGMGNivApioWEqLcfuq/XtQU9lHSkAJtw6CXylt5HBLcxmlZ9vAmdU QhvNDMpigGQsOvG7aCrqA4EGjjUVbKcujIAVfKKSyuq/hnJ0qZr3NW0kQ/2HBCdYeq3J p+yQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1721987246; x=1722592046; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=gT1UL0dJwOaNKYiFjA7EC0yRYS8xQDfFpNmhEqLpJwk=; b=hs0v4Rtb7orWbmJVpFAS/nsW/J4gubYB5NLKNk3hM5+jNGydJh2o9z3FZtv8paxBvx pDw8CcMR2prU+0JUqV9DN/BauAA34weOHhSTkE4Mp4CMNocOlTUEdhZi3f3upbQ2jix8 +p14bofXzGQsZ7urkIbQIQhw0iKbjYlOKlhnsaTr08O1IhEicfMxnIt+biE6StijTrCR pLvsgVwsj+c2S1OnQvI6/vMRtORx0whVOCiDju/CvdAirMC0AHbHN5CMlzkCvQTWMtlM 0IARccHeeTW5JrFsjoeNVYnQ7HyjQ50z1zPuDCw4/NyM5oblP4Yhx6ztY9DUcvguuj3Q DdUA== X-Forwarded-Encrypted: i=1; AJvYcCU4IVXckuC4gYwnzxmk3magENGOjDq7Ds/epTd1UuGg6SKsO/riePZa+lZYZz5doeNzjm5MehGEzJsu7vKumPa75y4= X-Gm-Message-State: AOJu0YxucfI60PYVCPu4gv8qcABNC+QLqtmi2oeInpe2X61CxuZ7TgfJ EI7DhjtsaVAaE2PEQveZFy0XxUCu6izfIvlAVcSwS0ZCycLNdFfI X-Google-Smtp-Source: AGHT+IFk/zzhxB/vWOVfPadjTGBbJvMnp6xOZwuxd7f6freLAGGRqyTe1ovwigii5sLP/MuWd1JkEQ== X-Received: by 2002:a17:902:ecd1:b0:1fd:7293:3d70 with SMTP id d9443c01a7336-1fed35301e8mr63985595ad.8.1721987246237; Fri, 26 Jul 2024 02:47:26 -0700 (PDT) Received: from localhost.localdomain ([2407:7000:8942:5500:aaa1:59ff:fe57:eb97]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-1fed7d15e98sm28127455ad.99.2024.07.26.02.47.18 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 26 Jul 2024 02:47:25 -0700 (PDT) From: Barry Song <21cnbao@gmail.com> To: akpm@linux-foundation.org, linux-mm@kvack.org Cc: ying.huang@intel.com, baolin.wang@linux.alibaba.com, chrisl@kernel.org, david@redhat.com, hannes@cmpxchg.org, hughd@google.com, kaleshsingh@google.com, kasong@tencent.com, linux-kernel@vger.kernel.org, mhocko@suse.com, minchan@kernel.org, nphamcs@gmail.com, ryan.roberts@arm.com, senozhatsky@chromium.org, shakeel.butt@linux.dev, shy828301@gmail.com, surenb@google.com, v-songbaohua@oppo.com, willy@infradead.org, xiang@kernel.org, yosryahmed@google.com, Chuanhua Han Subject: [PATCH v5 3/4] mm: support large folios swapin as a whole for zRAM-like swapfile Date: Fri, 26 Jul 2024 21:46:17 +1200 Message-Id: <20240726094618.401593-4-21cnbao@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20240726094618.401593-1-21cnbao@gmail.com> References: <20240726094618.401593-1-21cnbao@gmail.com> MIME-Version: 1.0 X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: A813840003 X-Stat-Signature: tkxjozhmo5sp73a5yx4td74qunjkfecp X-Rspam-User: X-HE-Tag: 1721987247-714472 X-HE-Meta: U2FsdGVkX19LZQB3dkZCuJ+5f5Em29Yb23g/zbY+mqnyPow/wB5yTbgplPgqK7cCgoauIWarX4sy4afyDguejts3AgF6sWvWUQe0NjiAVf0If6trZvHGhhapAxy6jLcMrGy+TqN25/lgf2+Wd6TjHcCTxDc5I9OfcJ2uhzO3nPkhf4FNOMj2eP2V82gv4mIsgVpLgOKfEF9tT6Hc0Yx764TNNq1KzxwtOZPbekL90zVwYTKZQh4y7HTRJ9a0mK/Cu3aw0rOd0uUyShdMBB7hLJQuzOkV0h8jiTIkxugxSDEOslVqQUNvpudZ8HCxQRPfvwbWpB5KdznyHoRIpFHiRGeuf6rzig+ConbttjIFtEFaajN6ZR/Ryihluzq3uvmCWQvxGeTsWRX7ZP1b+TgvDrK553O+m/phwPn27YOfGCqPquL4qAZRv8+vipvK8bZ5iUw7yAVSHZ4b6vIRskG18iWyLnbsTEzbGq/UjRG/ZApw6UTLjiQjRqvnyPdfG+gJGl38bxqeKBt8ogGbpLL4t7WlsWI7FmY8jdz6XUwLgRobdY+mG8RR2pvwwd40i1VwewDuxPno2/yr8G9zqde+nwykQ7v9uanT/83/705fGECvygudMZB3ypsZ2IqcWupppyGPka0WLbe/LbLzFoQTHBfsAJVdKlXyNuOoymRX8nefhwKri5O4I9hs34VfZaiGSKbmA83yeHaX1ETxhZ57zDKD3s6d2bj/135G0AC4qd/gq1jnGyN0taeXlwMI53tvsP9wTU0RRj2+cEeLm61ILxCxSn46bTF3QnzY9KlBijIQ2c6q9YbJ74Y2HlORZTwfhCCSPgz49/JoRcoUSBW++ZkfuUVKSB1sE4YL3k/WxRB6nnE07eot+kBWcduDWvCONg6fZPtS0D9KYWhHWmLvBTHopfavPHhNZZvUrLyUcdpyCUNuZ3H6X6aGi3TVINXQc2EBvWihBjgy2gG/0Bn PAFxl7tM gpRj62vZaeP/q3lrXowuJ5UuDaPkIN5R/nrO0Te/ESUsSw4+ELRkktcNmRiXZr1ypYrihRuvOruQEegfSHNvCY5QKaGfiUV1WFbci1PvmSb6CpV1NClgEreatyOLsg+8rsUHeFz16IO7406Zpqt2SY0MlB1pxHI9y9JcAUaHhdiBUDkqeIjetdzS8BJnIW6udIjA/MxmwJcCZM4rFxz54zHoy0xDj3ycsMQ64uMcbDtxML76dNfqe8r7znChUMQoAGv00BamgQ+j7TX7BAdYFWJcT+j8mxIXtjE+U2O/lpIwZZTc33H4Fcpr+D7xs5AlZozpMGJ59EUp8u9U4AkfEEaOvlZZg6DKI7sCPYtBTy57uCqa5Y/Ak8WmofzwIW7WVJ5GCjZmgUrFZJ0GdyrCbjbGUCAGHAu7ZH0AJ21CquA9aw/9nUabZ5JEHvg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: From: Chuanhua Han In an embedded system like Android, more than half of anonymous memory is actually stored in swap devices such as zRAM. For instance, when an app is switched to the background, most of its memory might be swapped out. Currently, we have mTHP features, but unfortunately, without support for large folio swap-ins, once those large folios are swapped out, we lose them immediately because mTHP is a one-way ticket. This patch introduces mTHP swap-in support. For now, we limit mTHP swap-ins to contiguous swaps that were likely swapped out from mTHP as a whole. Additionally, the current implementation only covers the SWAP_SYNCHRONOUS case. This is the simplest and most common use case, benefiting millions of Android phones and similar devices with minimal implementation cost. In this straightforward scenario, large folios are always exclusive, eliminating the need to handle complex rmap and swapcache issues. It offers several benefits: 1. Enables bidirectional mTHP swapping, allowing retrieval of mTHP after swap-out and swap-in. 2. Eliminates fragmentation in swap slots and supports successful THP_SWPOUT without fragmentation. 3. Enables zRAM/zsmalloc to compress and decompress mTHP, reducing CPU usage and enhancing compression ratios significantly. Deploying this on millions of actual products, we haven't observed any noticeable increase in memory footprint for 64KiB mTHP based on CONT-PTE on ARM64. Signed-off-by: Chuanhua Han Co-developed-by: Barry Song Signed-off-by: Barry Song --- mm/memory.c | 211 ++++++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 188 insertions(+), 23 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 833d2cad6eb2..14048e9285d4 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3986,6 +3986,152 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf) return VM_FAULT_SIGBUS; } +/* + * check a range of PTEs are completely swap entries with + * contiguous swap offsets and the same SWAP_HAS_CACHE. + * ptep must be first one in the range + */ +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages) +{ + struct swap_info_struct *si; + unsigned long addr; + swp_entry_t entry; + pgoff_t offset; + char has_cache; + int idx, i; + pte_t pte; + + addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE); + idx = (vmf->address - addr) / PAGE_SIZE; + pte = ptep_get(ptep); + + if (!pte_same(pte, pte_move_swp_offset(vmf->orig_pte, -idx))) + return false; + entry = pte_to_swp_entry(pte); + offset = swp_offset(entry); + if (swap_pte_batch(ptep, nr_pages, pte) != nr_pages) + return false; + + si = swp_swap_info(entry); + has_cache = si->swap_map[offset] & SWAP_HAS_CACHE; + for (i = 1; i < nr_pages; i++) { + /* + * while allocating a large folio and doing swap_read_folio for the + * SWP_SYNCHRONOUS_IO path, which is the case the being faulted pte + * doesn't have swapcache. We need to ensure all PTEs have no cache + * as well, otherwise, we might go to swap devices while the content + * is in swapcache + */ + if ((si->swap_map[offset + i] & SWAP_HAS_CACHE) != has_cache) + return false; + } + + return true; +} + +static inline unsigned long thp_swap_suitable_orders(pgoff_t swp_offset, + unsigned long addr, unsigned long orders) +{ + int order, nr; + + order = highest_order(orders); + + /* + * To swap-in a THP with nr pages, we require its first swap_offset + * is aligned with nr. This can filter out most invalid entries. + */ + while (orders) { + nr = 1 << order; + if ((addr >> PAGE_SHIFT) % nr == swp_offset % nr) + break; + order = next_order(&orders, order); + } + + return orders; +} +#else +static inline bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages) +{ + return false; +} +#endif + +static struct folio *alloc_swap_folio(struct vm_fault *vmf) +{ + struct vm_area_struct *vma = vmf->vma; +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + unsigned long orders; + struct folio *folio; + unsigned long addr; + swp_entry_t entry; + spinlock_t *ptl; + pte_t *pte; + gfp_t gfp; + int order; + + /* + * If uffd is active for the vma we need per-page fault fidelity to + * maintain the uffd semantics. + */ + if (unlikely(userfaultfd_armed(vma))) + goto fallback; + + /* + * A large swapped out folio could be partially or fully in zswap. We + * lack handling for such cases, so fallback to swapping in order-0 + * folio. + */ + if (!zswap_never_enabled()) + goto fallback; + + entry = pte_to_swp_entry(vmf->orig_pte); + /* + * Get a list of all the (large) orders below PMD_ORDER that are enabled + * and suitable for swapping THP. + */ + orders = thp_vma_allowable_orders(vma, vma->vm_flags, + TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1); + orders = thp_vma_suitable_orders(vma, vmf->address, orders); + orders = thp_swap_suitable_orders(swp_offset(entry), vmf->address, orders); + + if (!orders) + goto fallback; + + pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address & PMD_MASK, &ptl); + if (unlikely(!pte)) + goto fallback; + + /* + * For do_swap_page, find the highest order where the aligned range is + * completely swap entries with contiguous swap offsets. + */ + order = highest_order(orders); + while (orders) { + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); + if (can_swapin_thp(vmf, pte + pte_index(addr), 1 << order)) + break; + order = next_order(&orders, order); + } + + pte_unmap_unlock(pte, ptl); + + /* Try allocating the highest of the remaining orders. */ + gfp = vma_thp_gfp_mask(vma); + while (orders) { + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); + folio = vma_alloc_folio(gfp, order, vma, addr, true); + if (folio) + return folio; + order = next_order(&orders, order); + } + +fallback: +#endif + return vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, vmf->address, false); +} + + /* * We enter with non-exclusive mmap_lock (to exclude vma changes, * but allow concurrent faults), and pte mapped but not yet locked. @@ -4074,35 +4220,37 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) if (!folio) { if (data_race(si->flags & SWP_SYNCHRONOUS_IO) && __swap_count(entry) == 1) { - /* - * Prevent parallel swapin from proceeding with - * the cache flag. Otherwise, another thread may - * finish swapin first, free the entry, and swapout - * reusing the same entry. It's undetectable as - * pte_same() returns true due to entry reuse. - */ - if (swapcache_prepare(entry)) { - /* Relax a bit to prevent rapid repeated page faults */ - schedule_timeout_uninterruptible(1); - goto out; - } - need_clear_cache = true; - /* skip swapcache */ - folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, - vma, vmf->address, false); + folio = alloc_swap_folio(vmf); page = &folio->page; if (folio) { __folio_set_locked(folio); __folio_set_swapbacked(folio); + nr_pages = folio_nr_pages(folio); + if (folio_test_large(folio)) + entry.val = ALIGN_DOWN(entry.val, nr_pages); + /* + * Prevent parallel swapin from proceeding with + * the cache flag. Otherwise, another thread may + * finish swapin first, free the entry, and swapout + * reusing the same entry. It's undetectable as + * pte_same() returns true due to entry reuse. + */ + if (swapcache_prepare_nr(entry, nr_pages)) { + /* Relax a bit to prevent rapid repeated page faults */ + schedule_timeout_uninterruptible(1); + goto out_page; + } + need_clear_cache = true; + if (mem_cgroup_swapin_charge_folio(folio, vma->vm_mm, GFP_KERNEL, entry)) { ret = VM_FAULT_OOM; goto out_page; } - mem_cgroup_swapin_uncharge_swap(entry); + mem_cgroup_swapin_uncharge_swap_nr(entry, nr_pages); shadow = get_shadow_from_swap_cache(entry); if (shadow) @@ -4209,6 +4357,22 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) goto out_nomap; } + /* allocated large folios for SWP_SYNCHRONOUS_IO */ + if (folio_test_large(folio) && !folio_test_swapcache(folio)) { + unsigned long nr = folio_nr_pages(folio); + unsigned long folio_start = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE); + unsigned long idx = (vmf->address - folio_start) / PAGE_SIZE; + pte_t *folio_ptep = vmf->pte - idx; + + if (!can_swapin_thp(vmf, folio_ptep, nr)) + goto out_nomap; + + page_idx = idx; + address = folio_start; + ptep = folio_ptep; + goto check_folio; + } + nr_pages = 1; page_idx = 0; address = vmf->address; @@ -4340,11 +4504,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) folio_add_lru_vma(folio, vma); } else if (!folio_test_anon(folio)) { /* - * We currently only expect small !anon folios, which are either - * fully exclusive or fully shared. If we ever get large folios - * here, we have to be careful. + * We currently only expect small !anon folios which are either + * fully exclusive or fully shared, or new allocated large folios + * which are fully exclusive. If we ever get large folios within + * swapcache here, we have to be careful. */ - VM_WARN_ON_ONCE(folio_test_large(folio)); + VM_WARN_ON_ONCE(folio_test_large(folio) && folio_test_swapcache(folio)); VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio); folio_add_new_anon_rmap(folio, vma, address, rmap_flags); } else { @@ -4387,7 +4552,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) out: /* Clear the swap cache pin for direct swapin after PTL unlock */ if (need_clear_cache) - swapcache_clear(si, entry); + swapcache_clear_nr(si, entry, nr_pages); if (si) put_swap_device(si); return ret; @@ -4403,7 +4568,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) folio_put(swapcache); } if (need_clear_cache) - swapcache_clear(si, entry); + swapcache_clear_nr(si, entry, nr_pages); if (si) put_swap_device(si); return ret;