From patchwork Thu Nov 14 06:59:58 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qi Zheng X-Patchwork-Id: 13874616 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A83A8D65C52 for ; Thu, 14 Nov 2024 07:01:26 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 43B9A6B009F; Thu, 14 Nov 2024 02:01:26 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 3C4B36B00A0; Thu, 14 Nov 2024 02:01:26 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 217DE6B00A1; Thu, 14 Nov 2024 02:01:26 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id F32426B009F for ; Thu, 14 Nov 2024 02:01:25 -0500 (EST) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id AC86980FD8 for ; Thu, 14 Nov 2024 07:01:25 +0000 (UTC) X-FDA: 82783804110.14.89BC2B5 Received: from mail-pl1-f179.google.com (mail-pl1-f179.google.com [209.85.214.179]) by imf04.hostedemail.com (Postfix) with ESMTP id 9E7F340426 for ; Thu, 14 Nov 2024 07:00:27 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b="EX/WhNrO"; dmarc=pass (policy=quarantine) header.from=bytedance.com; spf=pass (imf04.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.214.179 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1731567619; a=rsa-sha256; cv=none; b=C7qTuiT0ACPFA/n+z2a4P6nQrI19TdrQv3BjGCqMp2ORSvgLZkSvICVQtaD/0slBN70Om3 /ldWylJ+0DfNyAShNr1DvK1hO+oDyCqPYR5+Zf/tIv6N+zrJs9Nrmv6HFQQfhToiq2siNG lU/+zXL0QjaKXJkB13Gcz0viEf1HE3g= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b="EX/WhNrO"; dmarc=pass (policy=quarantine) header.from=bytedance.com; spf=pass (imf04.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.214.179 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1731567619; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Ao15XNwWqfhJRpavPj5UJluLHg99Tu4ljFLM7T2kOsA=; b=gNbKoOpaaEbAN1Xi0d4ym/+pWpN7QDRyXHafzKMe2lCKhvQ5MivFWXWzAZMVvMeXInVMJy PjJVCEQh9oIt+XhmV62yQhjfRSWU7MBu5E1NEsgvKopVaUsqadRLONVHolxzA9tAoCp6qV 1m5K0QFt1lT61SjmkejgozG9MNwUhFA= Received: by mail-pl1-f179.google.com with SMTP id d9443c01a7336-210e5369b7dso2348775ad.3 for ; Wed, 13 Nov 2024 23:01:22 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1731567682; x=1732172482; darn=kvack.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=Ao15XNwWqfhJRpavPj5UJluLHg99Tu4ljFLM7T2kOsA=; b=EX/WhNrOIYLs4tR3Ius0LKNFCMEoj3XWTygUGd+EQoYR3p4g/lyaPXEOzyqOQkBWbw FU30Cmeu/WnSkTEeiITX/XQn4EyexfOBP/Z2gS0ZGnY5OytU1WUfsBkG0DLA33r9dWoX CTXF9zQ5tby9QAzjt7ngHBDsWvzSD7ZVn7sO7t7UQBPUYWkrLhEldEal6jpfa6+ozAKF 910G4ENLC+isJt1FuZM1oIoBTnWOoyD/CAvNBGTgNNdxAwKEPqqEpsXllwhUf/roCVjY 4mSnjc9qF9ZvCi92Ji5CDYD04/sDhSsWNXDh9Tg1np1iFfRddg8R7uf9PlB6B37hHcPm 5y6A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1731567682; x=1732172482; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Ao15XNwWqfhJRpavPj5UJluLHg99Tu4ljFLM7T2kOsA=; b=G3ZKDPvdKpzF1wyNLNvHW1DON2JH39KeJBv9MhtDw95RIXsSdYYqGYmkJSe7FoTcwD LvjD9kgIbRHV7bl8trJsdbNFoQZR52nBfxmG88P+w7FWS25U/l0OvEILP3i9zGq7R0o/ 4kn9RTCZtiQMh4lpB7p/t53Fb68rLvBKiGCn77wdeV+KZP8iFmFcJJt7NK8YvICvXLe4 1WP9kg9ZcjKAvhl8MKl+LoSrolPixrsbo6xACUW3FW+Cex34Ai8dThRsAA2PHDSZZdiE oYB0QP4jEhaB4NSE5Ss84ddCxpF3ungZ2qcy1iEXGR8v0IlbuzPBOmjdBG3FotZbNcE9 fuRw== X-Forwarded-Encrypted: i=1; AJvYcCV6A6RnN4sLb1RbDM3MbSUuzqHI1VM3UxOPJFUrQeHd4MTeLoagcCzSphUDZAgdBcf84wVNXhbsrw==@kvack.org X-Gm-Message-State: AOJu0YxSIdGw13ZhJmSQQcQn+ERNbx9c3k1fgvxlzVKDKqJpLEXMdOJa SJayX01x0RZTV9qnx5XWIklz1EWsjAlqiUZpqHd13q+bHIcMHNUddURG0ZkP3PA= X-Google-Smtp-Source: AGHT+IHw+F+4TmogLhrbU8N9wTkF2jmE4YHNJryvizmOe98BZRuryF1Qlfv8KxZRs6BXORJTlrQqRA== X-Received: by 2002:a17:902:e892:b0:20b:7210:5859 with SMTP id d9443c01a7336-211c50b0c51mr15181305ad.38.1731567681854; Wed, 13 Nov 2024 23:01:21 -0800 (PST) Received: from C02DW0BEMD6R.bytedance.net ([63.216.146.178]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-211c7d389c2sm4119065ad.268.2024.11.13.23.01.14 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 13 Nov 2024 23:01:21 -0800 (PST) From: Qi Zheng To: david@redhat.com, jannh@google.com, hughd@google.com, willy@infradead.org, muchun.song@linux.dev, vbabka@kernel.org, akpm@linux-foundation.org, peterx@redhat.com Cc: mgorman@suse.de, catalin.marinas@arm.com, will@kernel.org, dave.hansen@linux.intel.com, luto@kernel.org, peterz@infradead.org, x86@kernel.org, lorenzo.stoakes@oracle.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, zokeefe@google.com, rientjes@google.com, Qi Zheng Subject: [PATCH v3 7/9] mm: pgtable: try to reclaim empty PTE page in madvise(MADV_DONTNEED) Date: Thu, 14 Nov 2024 14:59:58 +0800 Message-Id: <9e6f0cff7ae29cd8bd1812d3a0e3513de3f42f42.1731566457.git.zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: References: MIME-Version: 1.0 X-Rspam-User: X-Rspamd-Queue-Id: 9E7F340426 X-Rspamd-Server: rspam11 X-Stat-Signature: gocfhak1bfg6sw4htfn5ae4k1fz74huf X-HE-Tag: 1731567627-517654 X-HE-Meta: U2FsdGVkX1/3Ha+kdnJ2NEQtKcxDb0pRZW1BzWKEorBTaerst3HIJNhL5k/NykHlVUYd8Ts2ivV1Aee1ju5E+BWqvG9M9b47xsbOcdLidg2cs9Hw4Syu30aFV/dVRFQwk6HYQ+hZhq5VoQXEFN4Lm18m04DN1Nl5ajvGtqt0u9Y/08whkDqOPN6RZP09kM8PEIU+4mp2258FvFonJvA9/vkA/0g2f3DWJARKKxMfcbH8Pkw+ALcRo3YP6QvTkOPjSqnysUO4/50W94u5WVfiVdAnUJ4xaGyjDApRNOfVfY9xbr5d/OXxrN41OIitxQnq1EbIOZN5BhVQW1vSfJhFL9xJuXXbl5nbJkUuabOH0qm/ARs4iZF5NCWYZlXIWGUN0eCyUyb+8//JFtqmo4DL25TC1hat3eCaHhaxc265YwFvAnQNiRF8dAJCyzzD7A6C6+8nKZT11PpH/hlqbLDhgPeWuR/5iWDw/XYWBfutUx6uwgWQK2ZDm0frX+nPkxoszJFHF3AaPBhIMpDt8p3h50dSFDU/SrYOrCxXZnFqxO/ePnmnX1K9EOHbq0NentgL0o8Qnd4Ihp/mMDKhaZS4qAdDT1G8WvnRd0RV/J36c4B6X9dPrvIs55TF5k72RiTwVm6fRhlfGIivLfLex/uNhSoXFxBRq6TMuFCENpzXnjvBlUitcjj6CUr+mwLLk+Mq/kCHtbBuv34FzLrBuicLWXpd5e/tCNDRQRqya9IdnnB4i8dSoNqEQmczusgTFBGlIiLSYYdFsdnGHk6UEgTxYTW1J7mgBUDmhgcXW5l10dxWlwLFzphxFujwUaAJ6FI1i13Twd6rSpkCJDGbssNuHZo+GQBjbw0Ponm31l4X3g3v0uiqdfTUyw8JMedBjX6ELJrJLSH2zUVJIfGJgYDqo2dwFr4/kj7rmAKX/oYYD3jG4QFbwBcHXeyo07+TaN8AUyiU+yieG66PcztX5gG ekkOGIy8 NT8MKC+8/hWALuJRjm2+B/+SPnvPsdUpQ8EIy/259cCq9UjkrwZ+qF5y/fYHtQv7cpMcmHI28zMRlxDDo58sLnxqtbLZYPGbmQDMaWVoJpsAxJZGVjG/VpRD2OHO+80D55Xt0yb4HKKujP8RraI6MVeY+DyVXa3E1picS/Rs+DzCA6zYNk+9d8hYsyJZR0xzQCs29mZgPBczt/O84s9R/OIULIwHc7mgPIZaVe3j4/IvcgidUflt+hBPqsLdgpPnANftgbxewqdvR9CVbHX9q4sodFYxWGBgF0PdPJTGf5j/H/h7zhznw5MYfLwRPZ4ioPMGSsc7a+TU6uoYRhgyXoAgg+SCnxFAO6UZ6wPL9atrjKcA= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Now in order to pursue high performance, applications mostly use some high-performance user-mode memory allocators, such as jemalloc or tcmalloc. These memory allocators use madvise(MADV_DONTNEED or MADV_FREE) to release physical memory, but neither MADV_DONTNEED nor MADV_FREE will release page table memory, which may cause huge page table memory usage. The following are a memory usage snapshot of one process which actually happened on our server: VIRT: 55t RES: 590g VmPTE: 110g In this case, most of the page table entries are empty. For such a PTE page where all entries are empty, we can actually free it back to the system for others to use. As a first step, this commit aims to synchronously free the empty PTE pages in madvise(MADV_DONTNEED) case. We will detect and free empty PTE pages in zap_pte_range(), and will add zap_details.reclaim_pt to exclude cases other than madvise(MADV_DONTNEED). Once an empty PTE is detected, we first try to hold the pmd lock within the pte lock. If successful, we clear the pmd entry directly (fast path). Otherwise, we wait until the pte lock is released, then re-hold the pmd and pte locks and loop PTRS_PER_PTE times to check pte_none() to re-detect whether the PTE page is empty and free it (slow path). For other cases such as madvise(MADV_FREE), consider scanning and freeing empty PTE pages asynchronously in the future. The following code snippet can show the effect of optimization: mmap 50G while (1) { for (; i < 1024 * 25; i++) { touch 2M memory madvise MADV_DONTNEED 2M } } As we can see, the memory usage of VmPTE is reduced: before after VIRT 50.0 GB 50.0 GB RES 3.1 MB 3.1 MB VmPTE 102640 KB 240 KB Signed-off-by: Qi Zheng --- include/linux/mm.h | 1 + mm/Kconfig | 15 ++++++++++ mm/Makefile | 1 + mm/internal.h | 19 +++++++++++++ mm/madvise.c | 7 ++++- mm/memory.c | 45 ++++++++++++++++++++++++++++- mm/pt_reclaim.c | 71 ++++++++++++++++++++++++++++++++++++++++++++++ 7 files changed, 157 insertions(+), 2 deletions(-) create mode 100644 mm/pt_reclaim.c diff --git a/include/linux/mm.h b/include/linux/mm.h index ca59d165f1f2e..1fcd4172d2c03 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2319,6 +2319,7 @@ extern void pagefault_out_of_memory(void); struct zap_details { struct folio *single_folio; /* Locked folio to be unmapped */ bool even_cows; /* Zap COWed private pages too? */ + bool reclaim_pt; /* Need reclaim page tables? */ zap_flags_t zap_flags; /* Extra flags for zapping */ }; diff --git a/mm/Kconfig b/mm/Kconfig index 84000b0168086..7949ab121070f 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1301,6 +1301,21 @@ config ARCH_HAS_USER_SHADOW_STACK The architecture has hardware support for userspace shadow call stacks (eg, x86 CET, arm64 GCS or RISC-V Zicfiss). +config ARCH_SUPPORTS_PT_RECLAIM + def_bool n + +config PT_RECLAIM + bool "reclaim empty user page table pages" + default y + depends on ARCH_SUPPORTS_PT_RECLAIM && MMU && SMP + select MMU_GATHER_RCU_TABLE_FREE + help + Try to reclaim empty user page table pages in paths other than munmap + and exit_mmap path. + + Note: now only empty user PTE page table pages will be reclaimed. + + source "mm/damon/Kconfig" endmenu diff --git a/mm/Makefile b/mm/Makefile index dba52bb0da8ab..850386a67b3e0 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -146,3 +146,4 @@ obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o obj-$(CONFIG_EXECMEM) += execmem.o obj-$(CONFIG_TMPFS_QUOTA) += shmem_quota.o +obj-$(CONFIG_PT_RECLAIM) += pt_reclaim.o diff --git a/mm/internal.h b/mm/internal.h index 5a7302baeed7c..5b2aef61073f1 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -1530,4 +1530,23 @@ int walk_page_range_mm(struct mm_struct *mm, unsigned long start, unsigned long end, const struct mm_walk_ops *ops, void *private); +/* pt_reclaim.c */ +bool try_get_and_clear_pmd(struct mm_struct *mm, pmd_t *pmd, pmd_t *pmdval); +void free_pte(struct mm_struct *mm, unsigned long addr, struct mmu_gather *tlb, + pmd_t pmdval); +void try_to_free_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr, + struct mmu_gather *tlb); + +#ifdef CONFIG_PT_RECLAIM +bool reclaim_pt_is_enabled(unsigned long start, unsigned long end, + struct zap_details *details); +#else +static inline bool reclaim_pt_is_enabled(unsigned long start, unsigned long end, + struct zap_details *details) +{ + return false; +} +#endif /* CONFIG_PT_RECLAIM */ + + #endif /* __MM_INTERNAL_H */ diff --git a/mm/madvise.c b/mm/madvise.c index 0ceae57da7dad..49f3a75046f63 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -851,7 +851,12 @@ static int madvise_free_single_vma(struct vm_area_struct *vma, static long madvise_dontneed_single_vma(struct vm_area_struct *vma, unsigned long start, unsigned long end) { - zap_page_range_single(vma, start, end - start, NULL); + struct zap_details details = { + .reclaim_pt = true, + .even_cows = true, + }; + + zap_page_range_single(vma, start, end - start, &details); return 0; } diff --git a/mm/memory.c b/mm/memory.c index 8b3348ff374ff..fe93b0648c430 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1436,7 +1436,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) static inline bool should_zap_cows(struct zap_details *details) { /* By default, zap all pages */ - if (!details) + if (!details || details->reclaim_pt) return true; /* Or, we zap COWed pages only if the caller wants to */ @@ -1698,6 +1698,30 @@ static inline int do_zap_pte_range(struct mmu_gather *tlb, details, rss); } +static inline int count_pte_none(pte_t *pte, int nr) +{ + int none_nr = 0; + + /* + * If PTE_MARKER_UFFD_WP is enabled, the uffd-wp PTEs may be + * re-installed, so we need to check pte_none() one by one. + * Otherwise, checking a single PTE in a batch is sufficient. + */ +#ifdef CONFIG_PTE_MARKER_UFFD_WP + for (;;) { + if (pte_none(ptep_get(pte))) + none_nr++; + if (--nr == 0) + break; + pte++; + } +#else + if (pte_none(ptep_get(pte))) + none_nr = nr; +#endif + return none_nr; +} + static unsigned long zap_pte_range(struct mmu_gather *tlb, struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, unsigned long end, @@ -1709,6 +1733,11 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, spinlock_t *ptl; pte_t *start_pte; pte_t *pte; + pmd_t pmdval; + unsigned long start = addr; + bool can_reclaim_pt = reclaim_pt_is_enabled(start, end, details); + bool direct_reclaim = false; + int none_nr = 0; int nr; retry: @@ -1726,6 +1755,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, nr = skip_none_ptes(pte, addr, end); if (nr) { + if (can_reclaim_pt) + none_nr += nr; addr += PAGE_SIZE * nr; if (addr == end) break; @@ -1734,12 +1765,17 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, nr = do_zap_pte_range(tlb, vma, pte, addr, end, details, rss, &force_flush, &force_break); + if (can_reclaim_pt) + none_nr += count_pte_none(pte, nr); if (unlikely(force_break)) { addr += nr * PAGE_SIZE; break; } } while (pte += nr, addr += PAGE_SIZE * nr, addr != end); + if (can_reclaim_pt && addr == end && (none_nr == PTRS_PER_PTE)) + direct_reclaim = try_get_and_clear_pmd(mm, pmd, &pmdval); + add_mm_rss_vec(mm, rss); arch_leave_lazy_mmu_mode(); @@ -1766,6 +1802,13 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, goto retry; } + if (can_reclaim_pt) { + if (direct_reclaim) + free_pte(mm, start, tlb, pmdval); + else + try_to_free_pte(mm, pmd, start, tlb); + } + return addr; } diff --git a/mm/pt_reclaim.c b/mm/pt_reclaim.c new file mode 100644 index 0000000000000..6540a3115dde8 --- /dev/null +++ b/mm/pt_reclaim.c @@ -0,0 +1,71 @@ +// SPDX-License-Identifier: GPL-2.0 +#include +#include +#include + +#include "internal.h" + +bool reclaim_pt_is_enabled(unsigned long start, unsigned long end, + struct zap_details *details) +{ + return details && details->reclaim_pt && (end - start >= PMD_SIZE); +} + +bool try_get_and_clear_pmd(struct mm_struct *mm, pmd_t *pmd, pmd_t *pmdval) +{ + spinlock_t *pml = pmd_lockptr(mm, pmd); + + if (!spin_trylock(pml)) + return false; + + *pmdval = pmdp_get_lockless(pmd); + pmd_clear(pmd); + spin_unlock(pml); + + return true; +} + +void free_pte(struct mm_struct *mm, unsigned long addr, struct mmu_gather *tlb, + pmd_t pmdval) +{ + pte_free_tlb(tlb, pmd_pgtable(pmdval), addr); + mm_dec_nr_ptes(mm); +} + +void try_to_free_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr, + struct mmu_gather *tlb) +{ + pmd_t pmdval; + spinlock_t *pml, *ptl; + pte_t *start_pte, *pte; + int i; + + pml = pmd_lock(mm, pmd); + start_pte = pte_offset_map_rw_nolock(mm, pmd, addr, &pmdval, &ptl); + if (!start_pte) + goto out_ptl; + if (ptl != pml) + spin_lock_nested(ptl, SINGLE_DEPTH_NESTING); + + /* Check if it is empty PTE page */ + for (i = 0, pte = start_pte; i < PTRS_PER_PTE; i++, pte++) { + if (!pte_none(ptep_get(pte))) + goto out_ptl; + } + pte_unmap(start_pte); + + pmd_clear(pmd); + + if (ptl != pml) + spin_unlock(ptl); + spin_unlock(pml); + + free_pte(mm, addr, tlb, pmdval); + + return; +out_ptl: + if (start_pte) + pte_unmap_unlock(start_pte, ptl); + if (ptl != pml) + spin_unlock(pml); +}