From patchwork Sat Aug 31 09:23:39 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Barry Song <21cnbao@gmail.com> X-Patchwork-Id: 13785993 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5549CCA101D for ; Sat, 31 Aug 2024 09:23:56 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9BFCA8D001B; Sat, 31 Aug 2024 05:23:55 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 96EF38D0008; Sat, 31 Aug 2024 05:23:55 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 835F48D001B; Sat, 31 Aug 2024 05:23:55 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 6660D8D0008 for ; Sat, 31 Aug 2024 05:23:55 -0400 (EDT) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id E328340B2F for ; Sat, 31 Aug 2024 09:23:54 +0000 (UTC) X-FDA: 82512003588.29.AE657DB Received: from mail-pf1-f180.google.com (mail-pf1-f180.google.com [209.85.210.180]) by imf16.hostedemail.com (Postfix) with ESMTP id 1115818001B for ; Sat, 31 Aug 2024 09:23:52 +0000 (UTC) Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="IoHV1j/M"; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf16.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.210.180 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1725096131; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=fv9dkRqIourVqSwVBm4LPpK+gxgzxpTEuDUkvn84/4c=; b=rfwo7xxorsvP81yTtXWijRCb8bABBj8LFxPQpnun2FK1+VrtDQs+aq9IDoh2WH/g2afNbO 8ki6g8TJsYFiswziI/M+Oa7iyldDUIt2RJWQcDhcZZA4NGC5z3tSwUdeXWbsbbq6pBGCYC DXi9GsDMnnGJLe5TQDTfCKm6tY4CnCQ= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1725096131; a=rsa-sha256; cv=none; b=q9q7WJD7CGwzg2/F0Be7Y0Lc+BlRfcfaoAiZXIsTfGhPjjAu84tyvCBuCrLyj05LkMzngV ofeGVukZaBW8PaVpRIRI0RW4MaWNjOGFu+Q1LOCnaAoVKUInSCKMOThVzjn6noSgnZPSYL R++W6yP1sx48bU0Yh7XIe21pmxesAys= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="IoHV1j/M"; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf16.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.210.180 as permitted sender) smtp.mailfrom=21cnbao@gmail.com Received: by mail-pf1-f180.google.com with SMTP id d2e1a72fcca58-71744362c22so122591b3a.2 for ; Sat, 31 Aug 2024 02:23:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1725096232; x=1725701032; darn=kvack.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=fv9dkRqIourVqSwVBm4LPpK+gxgzxpTEuDUkvn84/4c=; b=IoHV1j/MPr5UCUOQv/0j/XaFgZ6xA2AAlmAkYAuydANUePVtaBlafb5RME5JI9DKhV TcDfH9isln3EVv+H60jyKF4wDTccqhVTGAfZkwTC10mWQqnqzhQwvUPydZj6tHFR4lCD qi+HeVnQzLo8p29fBubjNWqHD60RJlB1hQ5A4b/Z7mpYxpqljUQdtrbCppd+g7tDGLkz aZ9BIjtAqlYo+ig7RVtnpyhxJYHRo0eyoeSyUybKuw5Q5MQ5FMIT2GytrSz+3YARqJ9c 4MMAf29/j8oLe7Bx2kb77dCe+hy064KPIXuLh3l45QmCSVhUCDV7pVPFKokGWR/MKUaN iyhg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1725096232; x=1725701032; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=fv9dkRqIourVqSwVBm4LPpK+gxgzxpTEuDUkvn84/4c=; b=ZLcqG4xeFeqVxNV9VfZZlIQgmK3QYM92ON3fl6ELscbDpFph9bzcRngVdb3cXh+XJ+ XXqqKKG6ESkfH5nvAeqYkSkXrIoHE0/SvHnXmzvvaK9p6jh8rbPXLC8PRHnR5d+vRD3A QBIWk0x16DIRILtOi0uGY6UhWWrpblGivyGnWeWpv8hxm3ykoQxlwYlzrFgSx+bqZ0AS mtxVJQUDvOPCrN4ThyjTASyupMVjnNRFsGXFnE4GAHgdmS/AC6MBKBPQGfebDfMqDnau M0f6ywP/A59xQ7I9WFXWYfc9MlxYiE/EdwU1BLopC8YT30sOS6hxkABfLi5F0xx4/kJk egkQ== X-Forwarded-Encrypted: i=1; AJvYcCUG5M9jMMxBMR6GgeXjTdSk+ox7QDzFKVJSt0pI3s1cbjehnIsNx9jfwk20aBTWvesjIwnARMtLJA==@kvack.org X-Gm-Message-State: AOJu0Yx/R/MHsZ+Cwsd6Jl8ihuFwBNWtFu040F+PnKGOMbJ+PQz/a/nG vC/P2l3jE1JTD844MTor8TUb/vCbJRPSff2CyS3YDalKuMhVmnVM X-Google-Smtp-Source: AGHT+IEtNxeamvVQYRZG0BNxj4HrNxEbLU91xbnPcafwDm7ML9sK5hegUilwUujyxzU4VwlCjTjlHg== X-Received: by 2002:a05:6a00:2d8f:b0:70d:2cf6:5e6 with SMTP id d2e1a72fcca58-717457cef02mr338224b3a.15.1725096231596; Sat, 31 Aug 2024 02:23:51 -0700 (PDT) Received: from Barrys-MBP.hub ([2407:7000:8942:5500:198b:37cd:b0d:f4fa]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-7d22e77f32fsm4262995a12.44.2024.08.31.02.23.46 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Sat, 31 Aug 2024 02:23:51 -0700 (PDT) From: Barry Song <21cnbao@gmail.com> To: akpm@linux-foundation.org, linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, Barry Song , Chuanhua Han , Baolin Wang , Ryan Roberts , Zi Yan , David Hildenbrand , Chris Li , Kairui Song , Kalesh Singh , Suren Baghdasaryan Subject: [PATCH RFC] mm: entirely reuse the whole anon mTHP in do_wp_page Date: Sat, 31 Aug 2024 21:23:39 +1200 Message-Id: <20240831092339.66085-1-21cnbao@gmail.com> X-Mailer: git-send-email 2.39.3 (Apple Git-146) MIME-Version: 1.0 X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 1115818001B X-Stat-Signature: a87er6qnkqgmsob5uijdf6sogyccmnxn X-Rspam-User: X-HE-Tag: 1725096232-912126 X-HE-Meta: U2FsdGVkX19rkQ15MhMGmQ+/4ldCISUViHlCFJ3WV3hVu5vtpryP+MnXNRjCPD292ZbJRuKaLfmyXVxsQNbbZgKC3qdj9s9vOgNx/aWwbNwpsgxNNqkWwdaGWjeXS2IG7Icj/6B9XBNkWMZJMsA347f+yyUoe+3NlA6diY7TZQpnBHMOwaDGmXs8e+HXbJw4Fz8OSB3ZpBtg+Y5XaR+QwNjYiRT5LB2++D7JkA834HJftvPN+uB/+aPn3UcId2igLaLoCIhdEzbZpUiMnGSWmv1ZBoMI2bmSTB09Dw0CFLGjgViEUobHrTjICulvSOh180nwccJv1IuoGV9k5SF4PHJKY8lf7XbHlVf2l3j2nfLOUJ7mqn5kMoIQlu8u1H7HZoK87n6aBMigqFgUlG1t2tr1UYUHw3w5C+0SVgUWI9VMSYa3gOjPp7ttRiiy//h1KGQu/rt4gNpr4KA161j2f7sp0ysn2i6h7pK6SH/AA4P/ADtOzJdiTbiPEy6G+KLjuYOwu2ZbsiaWeGlJXW7sWD2CObEAI4+s0UpAZUtaZvbeZd8Vpf9Ir/S4e5ojgGVHc5e2TkU4y1g+WjDAYOfl9OIg780khoqHwdB6KcZsOMJJFNn4Du27EWNewQBWtGuCYF6UqBbDEHJ/DLIigoEbna7sZ2+3VYLDQj6Zp78jdG0OFtx12/4WYK/wKquurFTQ6+hzog+33zx0t97nyriMG3xULYB7QW2UPEAfpF/30q94GD8QcabVzghNAFXNLOpXSm+UDcwSF42MSb2InsoRex1zmI0y0DdrM4ohu95w02nQP7jFqyJSLUhTMoBbji8Cq9/fJki+W5e6orj6C5AkHQ6OLr/aixDXa6WEv5cpk3ZVs3ryMqGPw98jL9pjAyDrQigJHm24Uo5R4eBFx5oXUg46xo5VgjFpmvjpQgEg8shtGMwQSbXHUhwOg3KrVzuQWv7WvN9Vz05VV428Ct0 BwmLnOcN yVm46mZ2OH3KLJV7wb+QW7aDn0gfjF+4zNPgX6AsvHe3+6cCIFlUvmDGEuq6vF60oWyah8IcmZ0U3ouvuxkZcEcjxjFRV+L3vGVdDjyteGOnmyvSdFGgoVhCClaZX+V2WH0GdQtIk+9+yvQ+1khGWWenVrpJBr8ctKzHnwqdH3XAOH1hini62qAwagyK19ig8kFi8+JALmnRDgmYUCdg1/VOaygpAqoLdgF58X9ADoqOV4+G0PXj5/26IQnSr/lCnb3MIs6WS09APcTwOUXD4LV0Fq9nV9sg06R5msHFVyIzizPk+yRfXZ1hGezFndUbiqmYuPK6R8XLigsvtjtKGIV3dc/Xx/wEfrV4hn70F3D9TEpfJpIu6nS84ZfMAzdxE7W6fO+RdndFDfGgvM5FtfnxtcN7GvKt4OR/tIuZ5GiaS1VrKBCZyW5hCOZ/JcGN3GSClTKtb3U+tE5ENnp8IGkB5NzwRw+xGE0ZfZkq7sAkjOAdJG6qXWyoxuddNyLYCutK2AdvgdS2jLZ0soA2sEAmREFv8r9mhqFNtWDnHf6CGjg5vcvKOsEarQZ/UJdFuGzqgP+L6a2U7cgMcWUXf933ff/noUhPMzYMqaogqd5jPW51mPv5a4IlbOvQbEfZaAo9OgNprG7KNB0yGy9hyKmBSKeps1p/OjaEY+ZG5lA7XjV7iIVLP7B4xEj8oj2aLXVbr/g1RZOQBx28= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: From: Barry Song On a physical phone, it's sometimes observed that deferred_split mTHPs account for over 15% of the total mTHPs. Profiling by Chuanhua indicates that the majority of these originate from the typical fork scenario. When the child process either execs or exits, the parent process should ideally be able to reuse the entire mTHP. However, the current kernel lacks this capability and instead places the mTHP into split_deferred, performing a CoW (Copy-on-Write) on just a single subpage of the mTHP. main() { #define SIZE 1024 * 1024UL void *p = malloc(SIZE); memset(p, 0x11, SIZE); if (fork() == 0) exec(....); /* * this will trigger cow one subpage from * mTHP and put mTHP into split_deferred * list */ *(int *)(p + 10) = 10; printf("done\n"); while(1); } This leads to two significant issues: * Memory Waste: Before the mTHP is fully split by the shrinker, it wastes memory. In extreme cases, such as with a 64KB mTHP, the memory usage could be 64KB + 60KB until the last subpage is written, at which point the mTHP is freed. * Fragmentation and Performance Loss: It destroys large folios (negating the performance benefits of CONT-PTE) and fragments memory. To address this, we should aim to reuse the entire mTHP in such cases. Hi David, I’ve renamed wp_page_reuse() to wp_folio_reuse() and added an entirely_reuse argument because I’m not sure if there are still cases where we reuse a subpage within an mTHP. For now, I’m setting entirely_reuse to true only for the newly supported case, while all other cases still get false. Please let me know if this is incorrect—if we don’t reuse subpages at all, we could remove the argument. Hi Ryan, Ideally, I’d like to see ptep_set_access_flags_nr() support setting write-permission for the entire mTHP. Since we don’t currently have this capability, I’m doing it in a rather inefficient way—setting permissions one by one, which involves redundant unfolding and folding of CONTPTE. I wonder if we could collaborate on providing a batched ptep_set_access_flags_nr(). Cc: Chuanhua Han Cc: Baolin Wang Cc: Ryan Roberts Cc: Zi Yan Cc: David Hildenbrand Cc: Chris Li Cc: Kairui Song Cc: Kalesh Singh Cc: Suren Baghdasaryan Signed-off-by: Barry Song --- mm/memory.c | 91 ++++++++++++++++++++++++++++++++++++++--------------- 1 file changed, 66 insertions(+), 25 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index b95fce7d190f..c51980d14e41 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3205,18 +3205,26 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf) return 0; } -/* + /* * Handle write page faults for pages that can be reused in the current vma * * This can happen either due to the mapping being with the VM_SHARED flag, * or due to us being the last reference standing to the page. In either * case, all we need to do here is to mark the page as writable and update * any related book-keeping. + * If entirely_reuse is true, we are reusing the whole large folio; otherwise, + * we are reusing a subpage even though folio might be large one. */ -static inline void wp_page_reuse(struct vm_fault *vmf, struct folio *folio) +static inline void wp_folio_reuse(struct vm_fault *vmf, struct folio *folio, + bool entirely_reuse) __releases(vmf->ptl) { + unsigned long idx = entirely_reuse ? folio_page_idx(folio, vmf->page) : 0; + int nr = entirely_reuse ? folio_nr_pages(folio) : 1; + unsigned long start = vmf->address - idx * PAGE_SIZE; + unsigned long end = start + nr * PAGE_SIZE; struct vm_area_struct *vma = vmf->vma; + pte_t *ptep = vmf->pte - idx; pte_t entry; VM_BUG_ON(!(vmf->flags & FAULT_FLAG_WRITE)); @@ -3233,11 +3241,15 @@ static inline void wp_page_reuse(struct vm_fault *vmf, struct folio *folio) folio_xchg_last_cpupid(folio, (1 << LAST_CPUPID_SHIFT) - 1); } - flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte)); - entry = pte_mkyoung(vmf->orig_pte); - entry = maybe_mkwrite(pte_mkdirty(entry), vma); - if (ptep_set_access_flags(vma, vmf->address, vmf->pte, entry, 1)) - update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1); + flush_cache_range(vma, start, end); + for (int i = 0; i < nr; i++) { + entry = ptep_get(ptep + i); + entry = pte_mkyoung(entry); + entry = maybe_mkwrite(pte_mkdirty(entry), vma); + if (ptep_set_access_flags(vma, start + i * PAGE_SIZE, + ptep + i, entry, 1)) + update_mmu_cache_range(vmf, vma, start, ptep + i, 1); + } pte_unmap_unlock(vmf->pte, vmf->ptl); count_vm_event(PGREUSE); } @@ -3493,7 +3505,7 @@ static vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf, struct folio *folio pte_unmap_unlock(vmf->pte, vmf->ptl); return VM_FAULT_NOPAGE; } - wp_page_reuse(vmf, folio); + wp_folio_reuse(vmf, folio, false); return 0; } @@ -3519,7 +3531,7 @@ static vm_fault_t wp_pfn_shared(struct vm_fault *vmf) return ret; return finish_mkwrite_fault(vmf, NULL); } - wp_page_reuse(vmf, NULL); + wp_folio_reuse(vmf, NULL, false); return 0; } @@ -3554,7 +3566,7 @@ static vm_fault_t wp_page_shared(struct vm_fault *vmf, struct folio *folio) return tmp; } } else { - wp_page_reuse(vmf, folio); + wp_folio_reuse(vmf, folio, false); folio_lock(folio); } ret |= fault_dirty_shared_page(vmf); @@ -3564,17 +3576,41 @@ static vm_fault_t wp_page_shared(struct vm_fault *vmf, struct folio *folio) } static bool wp_can_reuse_anon_folio(struct folio *folio, - struct vm_area_struct *vma) + struct vm_fault *vmf) { + struct vm_area_struct *vma = vmf->vma; + int nr = folio_nr_pages(folio); + /* - * We could currently only reuse a subpage of a large folio if no - * other subpages of the large folios are still mapped. However, - * let's just consistently not reuse subpages even if we could - * reuse in that scenario, and give back a large folio a bit - * sooner. + * reuse a large folio while it is entirely mapped and + * exclusive (mapcount == folio_nr_pages) */ - if (folio_test_large(folio)) - return false; + if (folio_test_large(folio)) { + unsigned long folio_start, folio_end, idx; + unsigned long address = vmf->address; + pte_t *folio_ptep; + pte_t folio_pte; + if (folio_likely_mapped_shared(folio)) + return false; + + idx = folio_page_idx(folio, vmf->page); + folio_start = address - idx * PAGE_SIZE; + folio_end = folio_start + nr * PAGE_SIZE; + + if (unlikely(folio_start < max(address & PMD_MASK, vma->vm_start))) + return false; + if (unlikely(folio_end > pmd_addr_end(address, vma->vm_end))) + return false; + folio_ptep = vmf->pte - idx; + folio_pte = ptep_get(folio_ptep); + if (!pte_present(folio_pte) || pte_pfn(folio_pte) != folio_pfn(folio)) + return false; + if (folio_pte_batch(folio, folio_start, folio_ptep, folio_pte, nr, 0, + NULL, NULL, NULL) != nr) + return false; + if (folio_mapcount(folio) != nr) + return false; + } /* * We have to verify under folio lock: these early checks are @@ -3583,7 +3619,7 @@ static bool wp_can_reuse_anon_folio(struct folio *folio, * * KSM doesn't necessarily raise the folio refcount. */ - if (folio_test_ksm(folio) || folio_ref_count(folio) > 3) + if (folio_test_ksm(folio) || folio_ref_count(folio) > 2 + nr) return false; if (!folio_test_lru(folio)) /* @@ -3591,13 +3627,13 @@ static bool wp_can_reuse_anon_folio(struct folio *folio, * remote LRU caches or references to LRU folios. */ lru_add_drain(); - if (folio_ref_count(folio) > 1 + folio_test_swapcache(folio)) + if (folio_ref_count(folio) > nr + folio_test_swapcache(folio)) return false; if (!folio_trylock(folio)) return false; if (folio_test_swapcache(folio)) folio_free_swap(folio); - if (folio_test_ksm(folio) || folio_ref_count(folio) != 1) { + if (folio_test_ksm(folio) || folio_ref_count(folio) != nr) { folio_unlock(folio); return false; } @@ -3639,6 +3675,7 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf) const bool unshare = vmf->flags & FAULT_FLAG_UNSHARE; struct vm_area_struct *vma = vmf->vma; struct folio *folio = NULL; + int nr = 1; pte_t pte; if (likely(!unshare)) { @@ -3702,14 +3739,18 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf) * the page without further checks. */ if (folio && folio_test_anon(folio) && - (PageAnonExclusive(vmf->page) || wp_can_reuse_anon_folio(folio, vma))) { - if (!PageAnonExclusive(vmf->page)) - SetPageAnonExclusive(vmf->page); + (PageAnonExclusive(vmf->page) || wp_can_reuse_anon_folio(folio, vmf))) { + /* this is the case we are going to reuse the entire folio */ + if (!PageAnonExclusive(vmf->page)) { + nr = folio_nr_pages(folio); + for (int i = 0; i < nr; i++) + SetPageAnonExclusive(folio_page(folio, i)); + } if (unlikely(unshare)) { pte_unmap_unlock(vmf->pte, vmf->ptl); return 0; } - wp_page_reuse(vmf, folio); + wp_folio_reuse(vmf, folio, nr > 1); return 0; } /*