From patchwork Tue Feb 7 03:51:26 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chih-En Lin X-Patchwork-Id: 13130954 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 445EEC636D3 for ; Tue, 7 Feb 2023 03:52:59 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230019AbjBGDw5 (ORCPT ); Mon, 6 Feb 2023 22:52:57 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50402 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229976AbjBGDww (ORCPT ); Mon, 6 Feb 2023 22:52:52 -0500 Received: from mail-pj1-x102f.google.com (mail-pj1-x102f.google.com [IPv6:2607:f8b0:4864:20::102f]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8B22E33443; Mon, 6 Feb 2023 19:52:45 -0800 (PST) Received: by mail-pj1-x102f.google.com with SMTP id ge21-20020a17090b0e1500b002308aac5b5eso7491266pjb.4; Mon, 06 Feb 2023 19:52:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=THkbeIqrskaD2PQNhZ3CklGbHlPRjGInAR3sWDIA39M=; b=EY+PKakRdaKE8DbW8E7qw6byiX3BYP4skCjmQhfzN0h8oR9JAHzemzDky43uEb59dh gGvEuSdb6CDSmxf3bE8x90QtOtHc+/A9dyJZas7vltrEelfSf9UqYYKuTIoSF2fFV7xS v8wZ6FvO+8b9g+srnNdSAzjhha7lziW7v/2keYQ50SsNWYFLwol9wyBZ7tJHaO6Cs9g0 Gsb4p/q6gVlCZjRX262CLrPLNMYi6vztpssdmhh41phtqu3bvcPt+6rGlSf+5bPVXSOQ DLBTyZCEnfOuLKc5wre6Pf4M7uM71w46yMuXL0rgeO3wX8DGYS1/cB6DLuzNf08GI45W qilw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=THkbeIqrskaD2PQNhZ3CklGbHlPRjGInAR3sWDIA39M=; b=ysmVa5+JT0om6qHlCrKRyQXZXYx4M3oD/xc+yKgnla+1Y73omLabKinP+rx1sxWAuI CGXbdZfuI4BHETcI2JOnAty3UBOzupyjb7QLqkiGyqQeMbR3p16BbAkZe+5NNvfsOB83 Ap8CYeSWGZ66Y+EIF4m2gLTPw2muQwvsGCPYY9yVIlgRNEu5lJ9rrwS2o44ajKEyrtnJ hyFfd6MPLQ1Nysznbl55+PXy914RMmuMToZkVTTnsDz3QxkhuxvmFDuf2WrPdL0+IV1n GIp7qDbkFf1RUXw8eJJkse77kvK7IPgfFPlz3gIy6icY7PxAHaMA9Nin0F8jAcGbMR/a brfA== X-Gm-Message-State: AO0yUKX241fbIzPCnZCRBCsYLLi63M8pz95djWoqoY1lKdvUZbyMsZrK kOgWezeoZnsYq3aA64gV1RQ= X-Google-Smtp-Source: AK7set+nGsqo/sl/wqd55K2dU5QYefXctRyCSyI26tZ7c51ELti8mM2EnNAUFgqv60jBNNLuth9o3w== X-Received: by 2002:a17:902:c950:b0:196:58ac:6593 with SMTP id i16-20020a170902c95000b0019658ac6593mr1729111pla.61.1675741965047; Mon, 06 Feb 2023 19:52:45 -0800 (PST) Received: from strix-laptop.hitronhub.home ([123.110.9.95]) by smtp.googlemail.com with ESMTPSA id q4-20020a170902b10400b0019682e27995sm7647655plr.223.2023.02.06.19.52.33 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 06 Feb 2023 19:52:44 -0800 (PST) From: Chih-En Lin To: Andrew Morton , Qi Zheng , David Hildenbrand , "Matthew Wilcox (Oracle)" , Christophe Leroy , John Hubbard , Nadav Amit , Barry Song Cc: Steven Rostedt , Masami Hiramatsu , Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Yang Shi , Peter Xu , Vlastimil Babka , "Zach O'Keefe" , Yun Zhou , Hugh Dickins , Suren Baghdasaryan , Pasha Tatashin , Yu Zhao , Juergen Gross , Tong Tiangen , Liu Shixin , Anshuman Khandual , Li kunyu , Minchan Kim , Miaohe Lin , Gautam Menghani , Catalin Marinas , Mark Brown , Will Deacon , Vincenzo Frascino , Thomas Gleixner , "Eric W. Biederman" , Andy Lutomirski , Sebastian Andrzej Siewior , "Liam R. Howlett" , Fenghua Yu , Andrei Vagin , Barret Rhoden , Michal Hocko , "Jason A. Donenfeld" , Alexey Gladkov , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Dinglan Peng , Pedro Fonseca , Jim Huang , Huichun Feng , Chih-En Lin Subject: [PATCH v4 01/14] mm: Allow user to control COW PTE via prctl Date: Tue, 7 Feb 2023 11:51:26 +0800 Message-Id: <20230207035139.272707-2-shiyn.lin@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230207035139.272707-1-shiyn.lin@gmail.com> References: <20230207035139.272707-1-shiyn.lin@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Add a new prctl, PR_SET_COW_PTE, to allow the user to enable COW PTE. Since it has a time gap between using the prctl to enable the COW PTE and doing the fork, we use two states (MMF_COW_PTE_READY and MMF_COW_PTE) to determine the task that wants to do COW PTE or already doing it. The MMF_COW_PTE_READY flag marks the task to do COW PTE in the next time of fork(). During fork(), if MMF_COW_PTE_READY set, fork() will unset the flag and set the MMF_COW_PTE flag. After that, fork() might shares PTEs instead of duplicates it. Signed-off-by: Chih-En Lin --- include/linux/sched/coredump.h | 12 +++++++++++- include/uapi/linux/prctl.h | 6 ++++++ kernel/sys.c | 11 +++++++++++ 3 files changed, 28 insertions(+), 1 deletion(-) diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h index 8270ad7ae14c..570d599ebc85 100644 --- a/include/linux/sched/coredump.h +++ b/include/linux/sched/coredump.h @@ -83,7 +83,17 @@ static inline int get_dumpable(struct mm_struct *mm) #define MMF_HAS_PINNED 27 /* FOLL_PIN has run, never cleared */ #define MMF_DISABLE_THP_MASK (1 << MMF_DISABLE_THP) +/* + * MMF_COW_PTE_READY: Marking the task to do COW PTE in the next time of + * fork(). During fork(), if MMF_COW_PTE_READY set, fork() will unset the + * flag and set the MMF_COW_PTE flag. After that, fork() might shares PTEs + * rather than duplicates it. + */ +#define MMF_COW_PTE_READY 29 /* Share PTE tables in next time of fork() */ +#define MMF_COW_PTE 30 /* PTE tables are shared between processes */ +#define MMF_COW_PTE_MASK (1 << MMF_COW_PTE) + #define MMF_INIT_MASK (MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK |\ - MMF_DISABLE_THP_MASK) + MMF_DISABLE_THP_MASK | MMF_COW_PTE_MASK) #endif /* _LINUX_SCHED_COREDUMP_H */ diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index a5e06dcbba13..664a3c023019 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -284,4 +284,10 @@ struct prctl_mm_map { #define PR_SET_VMA 0x53564d41 # define PR_SET_VMA_ANON_NAME 0 +/* + * Set the prepare flag, MMF_COW_PTE_READY, to do the share (copy-on-write) + * page table in the next time of fork. + */ +#define PR_SET_COW_PTE 65 + #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/sys.c b/kernel/sys.c index 88b31f096fb2..eeab3093026f 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2350,6 +2350,14 @@ static int prctl_set_vma(unsigned long opt, unsigned long start, } #endif /* CONFIG_ANON_VMA_NAME */ +static int prctl_set_cow_pte(struct mm_struct *mm) +{ + if (test_bit(MMF_COW_PTE, &mm->flags)) + return -EINVAL; + set_bit(MMF_COW_PTE_READY, &mm->flags); + return 0; +} + SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, unsigned long, arg4, unsigned long, arg5) { @@ -2628,6 +2636,9 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, case PR_SET_VMA: error = prctl_set_vma(arg2, arg3, arg4, arg5); break; + case PR_SET_COW_PTE: + error = prctl_set_cow_pte(me->mm); + break; default: error = -EINVAL; break; From patchwork Tue Feb 7 03:51:27 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chih-En Lin X-Patchwork-Id: 13130955 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 02814C636CD for ; Tue, 7 Feb 2023 03:53:21 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229983AbjBGDxU (ORCPT ); Mon, 6 Feb 2023 22:53:20 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50840 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229724AbjBGDxT (ORCPT ); Mon, 6 Feb 2023 22:53:19 -0500 Received: from mail-pl1-x636.google.com (mail-pl1-x636.google.com [IPv6:2607:f8b0:4864:20::636]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 746652FCC1; Mon, 6 Feb 2023 19:52:58 -0800 (PST) Received: by mail-pl1-x636.google.com with SMTP id b5so14380134plz.5; Mon, 06 Feb 2023 19:52:58 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=bo1tneLWwD2RhiZr7s0fNkFVvi/jYX7dsdS6SUFT8xM=; b=JGWqJchBnYVuac0oRm4Gs0CVChNaex/V6id79yjo6Pc49o4y2daSY6qA1Q6KXNfcHU dkBweq6GC/8dy9lcJjk16orPIoGqCkGn6VTu+AjsqPGTdNDmm4HWf8/We9Rhi4NLBVYg HaBMWnPebrPu8F8WkJotpDYqZSXLP19PQyuFmeRX9Wg6gUpQ5r8oKftolWojfhdEuJr6 Z4OLASTNlYZIm+xFtUPJhTFo8Ci2y0rmW2pvV7JcvG8VuJRaSEpqBeCqGQxUMuMBjxoV vckW3Gc7QBdhjq4QEcLkQqWgDrUU3Xq4LfhEHtdATQ1Vocx8BgoJKHxsoUtkYZvTVTZ1 22CA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=bo1tneLWwD2RhiZr7s0fNkFVvi/jYX7dsdS6SUFT8xM=; b=AShGels9TXCLkUNt+fU3JFx5fM3GlC5/HrlTK6sMh7LpITAvAx8Yl+ODc2udsKACEl bkGHZd6IODY+A9wrj/cOWerGOt7il37lIpQ3/x0g1wbrEFqzjZg+TAC4bVGuXZjonFv6 om41FiwRPrNx3HkSCwgT6gkkY2H0MkAvTGozs/LP/lYBThwjrmr5AyBi4VJlWtd8L/+1 PCNKUCkKe493Nn+N1U+aKEk9hkfF1yCFRU5h57+i6lRzxc0W88IVWT/kn1K22F+NIghO GkWzUKI+gw7tqtnfih4rHsN/LzORMgz9+lT2fEPP1kxQrr7kjd8SqANOu2HAZl2qwtH/ Z1uw== X-Gm-Message-State: AO0yUKWc8bIWMr4k1eqFW9nR9wch3H7AoKBHI5A/sOdgH5+XMODbl1tL 2fmhE9F6EXig2JLfCwBEDRg= X-Google-Smtp-Source: AK7set/O7fcqmc6lLHaEVj1xEAfz8V3jwreohn2OTW5/SjFcXEbRoD9lnlHTuhQMmfHw7SNcGwH2HA== X-Received: by 2002:a17:902:e5c7:b0:196:8b36:d135 with SMTP id u7-20020a170902e5c700b001968b36d135mr1442730plf.62.1675741977709; Mon, 06 Feb 2023 19:52:57 -0800 (PST) Received: from strix-laptop.hitronhub.home ([123.110.9.95]) by smtp.googlemail.com with ESMTPSA id q4-20020a170902b10400b0019682e27995sm7647655plr.223.2023.02.06.19.52.45 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 06 Feb 2023 19:52:56 -0800 (PST) From: Chih-En Lin To: Andrew Morton , Qi Zheng , David Hildenbrand , "Matthew Wilcox (Oracle)" , Christophe Leroy , John Hubbard , Nadav Amit , Barry Song Cc: Steven Rostedt , Masami Hiramatsu , Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Yang Shi , Peter Xu , Vlastimil Babka , "Zach O'Keefe" , Yun Zhou , Hugh Dickins , Suren Baghdasaryan , Pasha Tatashin , Yu Zhao , Juergen Gross , Tong Tiangen , Liu Shixin , Anshuman Khandual , Li kunyu , Minchan Kim , Miaohe Lin , Gautam Menghani , Catalin Marinas , Mark Brown , Will Deacon , Vincenzo Frascino , Thomas Gleixner , "Eric W. Biederman" , Andy Lutomirski , Sebastian Andrzej Siewior , "Liam R. Howlett" , Fenghua Yu , Andrei Vagin , Barret Rhoden , Michal Hocko , "Jason A. Donenfeld" , Alexey Gladkov , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Dinglan Peng , Pedro Fonseca , Jim Huang , Huichun Feng , Chih-En Lin Subject: [PATCH v4 02/14] mm: Add Copy-On-Write PTE to fork() Date: Tue, 7 Feb 2023 11:51:27 +0800 Message-Id: <20230207035139.272707-3-shiyn.lin@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230207035139.272707-1-shiyn.lin@gmail.com> References: <20230207035139.272707-1-shiyn.lin@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Add copy_cow_pte_range() and recover_pte_range() for copy-on-write (COW) PTE in fork system call. During COW PTE fork, when processing the shared PTE, we traverse all the entries to determine current mapped page is available to share between processes. If PTE can be shared, account those mapped pages and then share the PTE. However, once we find out the mapped page is unavailable, e.g., pinned page, we have to copy it via copy_present_page(), which means that we will fall back to default path, page table copying (copy_pte_range()). And, since we may have already processed some COW-ed PTE entries, before starting the default path, we have to recover those entries. All the COW PTE behaviors are protected by the pte lock. The logic of how we handle nonpresent/present pte entries and error in copy_cow_pte_range() is same as copy_pte_range(). But to keep the codes clean (e.g., avoiding condition lock), we introduce new functions instead of modifying copy_pte_range(). To track the lifetime of COW-ed PTE, introduce the refcount of PTE. We reuse the _refcount in struct page for the page table to maintain the number of process references to COW-ed PTE table. Doing the fork with COW PTE will increase the refcount. And, when someone writes to the COW-ed PTE, it will cause the write fault to break COW PTE. If the refcount of COW-ed PTE is one, the process that triggers the fault will reuse the COW-ed PTE. Otherwise, the process will decrease the refcount and duplicate it. Since we share the PTE between the parent and child, the state of the parent's pte entries is different between COW PTE and the normal fork. COW PTE handles all the pte entries on the child side which means it will clear the dirty and access bit of the parent's pte entry. And, since some of the architectures, e.g., s390 and powerpc32, don't support the PMD entry and PTE table operations, add a new Kconfig, COW_PTE. COW_PTE config depends on the (HAVE_ARCH_TRANSPARENT_HUGEPAGE && !PREEMPT_RT) condition, it is same as the TRANSPARENT_HUGEPAGE config since most of the operations in COW PTE are depend on it. Signed-off-by: Chih-En Lin --- include/linux/mm.h | 20 +++ mm/Kconfig | 9 ++ mm/memory.c | 303 +++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 332 insertions(+) diff --git a/include/linux/mm.h b/include/linux/mm.h index 8f857163ac89..22e1e5804e96 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2482,6 +2482,23 @@ static inline bool ptlock_init(struct page *page) { return true; } static inline void ptlock_free(struct page *page) {} #endif /* USE_SPLIT_PTE_PTLOCKS */ +#ifdef CONFIG_COW_PTE +static inline int pmd_get_pte(pmd_t *pmd) +{ + return page_ref_inc_return(pmd_page(*pmd)); +} + +static inline bool pmd_put_pte(pmd_t *pmd) +{ + return page_ref_add_unless(pmd_page(*pmd), -1, 1); +} + +static inline int cow_pte_count(pmd_t *pmd) +{ + return page_count(pmd_page(*pmd)); +} +#endif + static inline void pgtable_init(void) { ptlock_cache_init(); @@ -2494,6 +2511,9 @@ static inline bool pgtable_pte_page_ctor(struct page *page) return false; __SetPageTable(page); inc_lruvec_page_state(page, NR_PAGETABLE); +#ifdef CONFIG_COW_PTE + set_page_count(page, 1); +#endif return true; } diff --git a/mm/Kconfig b/mm/Kconfig index ff7b209dec05..7dcceeb4196b 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -822,6 +822,15 @@ config READ_ONLY_THP_FOR_FS endif # TRANSPARENT_HUGEPAGE +menuconfig COW_PTE + bool "Copy-on-write PTE table" + depends on HAVE_ARCH_TRANSPARENT_HUGEPAGE && !PREEMPT_RT + help + Extend the copy-on-write (COW) mechanism to the PTE table + (the bottom level of the page-table hierarchy). To enable this + feature, a process must set prctl(PR_SET_COW_PTE) before the + fork system call. + # # UP and nommu archs use km based percpu allocator # diff --git a/mm/memory.c b/mm/memory.c index 3e836fecd035..7d2a1d24db56 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -739,11 +739,17 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, unsigned long addr, int *rss) { + /* With COW PTE, dst_vma is src_vma. */ unsigned long vm_flags = dst_vma->vm_flags; pte_t pte = *src_pte; struct page *page; swp_entry_t entry = pte_to_swp_entry(pte); + /* + * If it's COW PTE, parent shares PTE with child. Which means the + * following modifications of child will also affect parent. + */ + if (likely(!non_swap_entry(entry))) { if (swap_duplicate(entry) < 0) return -EIO; @@ -886,6 +892,8 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma /* * Copy one pte. Returns 0 if succeeded, or -EAGAIN if one preallocated page * is required to copy this pte. + * However, if prealloc is NULL, it is COW PTE. We should return and fall back + * to copy the PTE table. */ static inline int copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, @@ -909,6 +917,14 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, if (unlikely(page_try_dup_anon_rmap(page, false, src_vma))) { /* Page maybe pinned, we have to copy. */ put_page(page); + /* + * If prealloc is NULL, we are processing share page + * table (COW PTE, in copy_cow_pte_range()). We cannot + * call copy_present_page() right now, instead, we + * should fall back to copy_pte_range(). + */ + if (!prealloc) + return -EAGAIN; return copy_present_page(dst_vma, src_vma, dst_pte, src_pte, addr, rss, prealloc, page); } @@ -929,6 +945,11 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, } VM_BUG_ON(page && PageAnon(page) && PageAnonExclusive(page)); + /* + * If it's COW PTE, parent shares PTE with child. + * Which means the following will also affect parent. + */ + /* * If it's a shared mapping, mark it clean in * the child @@ -937,6 +958,7 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, pte = pte_mkclean(pte); pte = pte_mkold(pte); + /* For COW PTE, dst_vma is still src_vma. */ if (!userfaultfd_wp(dst_vma)) pte = pte_clear_uffd_wp(pte); @@ -963,6 +985,8 @@ page_copy_prealloc(struct mm_struct *src_mm, struct vm_area_struct *vma, return new_page; } + +/* copy_pte_range() will immediately allocate new page table. */ static int copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr, @@ -1087,6 +1111,227 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, return ret; } +#ifdef CONFIG_COW_PTE +/* + * copy_cow_pte_range() will try to share the page table with child. + * The logic of non-present, present and error handling is same as + * copy_pte_range() but dst_vma and dst_pte are src_vma and src_pte. + * + * We cannot preserve soft-dirty information, because PTE will share + * between multiple processes. + */ +static int +copy_cow_pte_range(struct vm_area_struct *dst_vma, + struct vm_area_struct *src_vma, + pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr, + unsigned long end, unsigned long *recover_end) +{ + struct mm_struct *dst_mm = dst_vma->vm_mm; + struct mm_struct *src_mm = src_vma->vm_mm; + struct vma_iterator vmi; + struct vm_area_struct *curr = src_vma; + pte_t *src_pte, *orig_src_pte; + spinlock_t *src_ptl; + int ret = 0; + int rss[NR_MM_COUNTERS]; + swp_entry_t entry = (swp_entry_t){0}; + unsigned long vm_end, orig_addr = addr; + pgtable_t pte_table = pmd_pgtable(*src_pmd); + + end = (addr + PMD_SIZE) & PMD_MASK; + addr = addr & PMD_MASK; + + /* + * Increase the refcount to prevent the parent's PTE + * dropped/reused. Only increace the refcount at first + * time attached. + */ + src_ptl = pte_lockptr(src_mm, src_pmd); + spin_lock(src_ptl); + pmd_get_pte(src_pmd); + pmd_install(dst_mm, dst_pmd, &pte_table); + spin_unlock(src_ptl); + + /* + * We should handle all of the entries in this PTE at this traversal, + * since we cannot promise that the next vma will not do the lazy fork. + * The lazy fork will skip the copying, which may cause the incomplete + * state of COW-ed PTE. + */ + vma_iter_init(&vmi, src_mm, addr); + for_each_vma_range(vmi, curr, end) { + vm_end = min(end, curr->vm_end); + addr = max(addr, curr->vm_start); + + /* We don't share the PTE with VM_DONTCOPY. */ + if (curr->vm_flags & VM_DONTCOPY) { + *recover_end = addr; + return -EAGAIN; + } +again: + init_rss_vec(rss); + src_pte = pte_offset_map(src_pmd, addr); + src_ptl = pte_lockptr(src_mm, src_pmd); + orig_src_pte = src_pte; + spin_lock(src_ptl); + arch_enter_lazy_mmu_mode(); + + do { + if (pte_none(*src_pte)) + continue; + if (unlikely(!pte_present(*src_pte))) { + /* + * Although, parent's PTE is COW-ed, we should + * still need to handle all the swap stuffs. + */ + ret = copy_nonpresent_pte(dst_mm, src_mm, + src_pte, src_pte, + curr, curr, + addr, rss); + if (ret == -EIO) { + entry = pte_to_swp_entry(*src_pte); + break; + } else if (ret == -EBUSY) { + break; + } else if (!ret) + continue; + /* + * Device exclusive entry restored, continue by + * copying the now present pte. + */ + WARN_ON_ONCE(ret != -ENOENT); + } + /* + * copy_present_pte() will determine the mapped page + * should be COW mapping or not. + */ + ret = copy_present_pte(curr, curr, src_pte, src_pte, + addr, rss, NULL); + /* + * If we need a pre-allocated page for this pte, + * drop the lock, recover all the entries, fall + * back to copy_pte_range(), and try again. + */ + if (unlikely(ret == -EAGAIN)) + break; + } while (src_pte++, addr += PAGE_SIZE, addr != vm_end); + + arch_leave_lazy_mmu_mode(); + add_mm_rss_vec(dst_mm, rss); + spin_unlock(src_ptl); + pte_unmap(orig_src_pte); + cond_resched(); + + if (ret == -EIO) { + VM_WARN_ON_ONCE(!entry.val); + if (add_swap_count_continuation(entry, GFP_KERNEL) < 0) { + ret = -ENOMEM; + goto out; + } + entry.val = 0; + } else if (ret == -EBUSY) { + goto out; + } else if (ret == -EAGAIN) { + /* + * We've to allocate the page immediately but first we + * should recover the processed entries and fall back + * to copy_pte_range(). + */ + *recover_end = addr; + return -EAGAIN; + } else if (ret) { + VM_WARN_ON_ONCE(1); + } + + /* We've captured and resolved the error. Reset, try again. */ + ret = 0; + if (addr != vm_end) + goto again; + } + +out: + /* + * All the pte entries are available to COW mapping. + * Now, we can share with child (COW PTE). + */ + pmdp_set_wrprotect(src_mm, orig_addr, src_pmd); + set_pmd_at(dst_mm, orig_addr, dst_pmd, pmd_wrprotect(*src_pmd)); + + return ret; +} + +/* When recovering the pte entries, we should hold the locks entirely. */ +static int +recover_pte_range(struct vm_area_struct *dst_vma, + struct vm_area_struct *src_vma, + pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long end) +{ + struct mm_struct *dst_mm = dst_vma->vm_mm; + struct mm_struct *src_mm = src_vma->vm_mm; + struct vma_iterator vmi; + struct vm_area_struct *curr = src_vma; + pte_t *orig_src_pte, *orig_dst_pte; + pte_t *src_pte, *dst_pte; + spinlock_t *src_ptl, *dst_ptl; + unsigned long vm_end, addr = end & PMD_MASK; + int ret = 0; + + /* Before we allocate the new PTE, clear the entry. */ + mm_dec_nr_ptes(dst_mm); + pmd_clear(dst_pmd); + if (pte_alloc(dst_mm, dst_pmd)) + return -ENOMEM; + + /* + * Traverse all the vmas that cover this PTE table until + * the end of recover address (unshareable page). + */ + vma_iter_init(&vmi, src_mm, addr); + for_each_vma_range(vmi, curr, end) { + vm_end = min(end, curr->vm_end); + addr = max(addr, curr->vm_start); + + orig_dst_pte = dst_pte = pte_offset_map(dst_pmd, addr); + dst_ptl = pte_lockptr(dst_mm, dst_pmd); + spin_lock(dst_ptl); + + orig_src_pte = src_pte = pte_offset_map(src_pmd, addr); + src_ptl = pte_lockptr(src_mm, src_pmd); + spin_lock(src_ptl); + arch_enter_lazy_mmu_mode(); + + do { + if (pte_none(*src_pte)) + continue; + /* + * COW mapping stuffs (e.g., PageAnonExclusive) + * should already handled by copy_cow_pte_range(). + * We can simply set the entry to the child. + */ + set_pte_at(dst_mm, addr, dst_pte, *src_pte); + } while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end); + + arch_leave_lazy_mmu_mode(); + spin_unlock(src_ptl); + pte_unmap(orig_src_pte); + + spin_unlock(dst_ptl); + pte_unmap(orig_dst_pte); + } + /* + * After recovering the entries, release the holding from child. + * Parent may still share with others, so don't make it writeable. + */ + spin_lock(src_ptl); + pmd_put_pte(src_pmd); + spin_unlock(src_ptl); + + cond_resched(); + + return ret; +} +#endif /* CONFIG_COW_PTE */ + static inline int copy_pmd_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, pud_t *dst_pud, pud_t *src_pud, unsigned long addr, @@ -1115,6 +1360,64 @@ copy_pmd_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, continue; /* fall through */ } + +#ifdef CONFIG_COW_PTE + /* + * If MMF_COW_PTE set, copy_pte_range() will try to share + * the PTE page table first. In other words, it attempts to + * do COW on PTE (and mapped pages). However, if there has + * any unshareable page (e.g., pinned page, device private + * page), it will fall back to the default path, which will + * copy the page table immediately. + * In such a case, it stores the address of first unshareable + * page to recover_end then goes back to the beginning of PTE + * and recovers the COW-ed PTE entries until it meets the same + * unshareable page again. During the recovering, because of + * COW-ed PTE entries are logical same as COW mapping, so it + * only needs to allocate the new PTE and sets COW-ed PTE + * entries to new PTE (which will be same as COW mapping). + */ + if (test_bit(MMF_COW_PTE, &src_mm->flags)) { + unsigned long recover_end = 0; + int ret; + + /* + * Setting wrprotect with normal PTE to pmd entry + * will trigger pmd_bad(). Skip bad checking here. + */ + if (pmd_none(*src_pmd)) + continue; + /* Skip if the PTE already did COW PTE this time. */ + if (!pmd_none(*dst_pmd) && !pmd_write(*dst_pmd)) + continue; + + ret = copy_cow_pte_range(dst_vma, src_vma, + dst_pmd, src_pmd, + addr, next, &recover_end); + if (!ret) { + /* COW PTE succeeded. */ + continue; + } else if (ret == -EAGAIN) { + /* fall back to normal copy method. */ + if (recover_pte_range(dst_vma, src_vma, + dst_pmd, src_pmd, + recover_end)) + return -ENOMEM; + /* + * Since we processed all the entries of PTE + * table, recover_end may not in the src_vma. + * If we already handled the src_vma, skip it. + */ + if (!range_in_vma(src_vma, recover_end, + recover_end + PAGE_SIZE)) + continue; + else + addr = recover_end; + /* fall through */ + } else if (ret) + return -ENOMEM; + } +#endif /* CONFIG_COW_PTE */ if (pmd_none_or_clear_bad(src_pmd)) continue; if (copy_pte_range(dst_vma, src_vma, dst_pmd, src_pmd, From patchwork Tue Feb 7 03:51:28 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chih-En Lin X-Patchwork-Id: 13130956 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3EEE6C636D3 for ; Tue, 7 Feb 2023 03:53:40 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230024AbjBGDxi (ORCPT ); Mon, 6 Feb 2023 22:53:38 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51140 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229834AbjBGDxe (ORCPT ); Mon, 6 Feb 2023 22:53:34 -0500 Received: from mail-pj1-x102f.google.com (mail-pj1-x102f.google.com [IPv6:2607:f8b0:4864:20::102f]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8380534C18; Mon, 6 Feb 2023 19:53:11 -0800 (PST) Received: by mail-pj1-x102f.google.com with SMTP id o16-20020a17090ad25000b00230759a8c06so10558922pjw.2; Mon, 06 Feb 2023 19:53:11 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=bZij7LuvWqAUsc/tDyGKqVPa1LlLe014JzLJaRBVfac=; b=lwCPmEll3sOd5wCDntj64zUMlsIheRAjUYsSQaWXGrPEqqe5HN4xQyzTELcYR8EuUo UUzJtsuAS7oEwIk7LobPrcjseI3GOXqxJ6Ql9s2vibYXBJLETeFhrYgdz5FBuT1maHjz smYCxO1fRZJ0sVU7uY23VbWtRRcdj5GLh90hmDin4Kz6Y35MbZgiujZ2Tb+g9rx72+Ep q5ooXDOByXEEJWEOHbraaoSVRCP+W026b9+2nnOCT6+M2ac7ycU2fAMC6B355Rx1mDSE 16dKTc35HJXmCGZUMjhvzoiE/9Crg/rFt+u4yvL0Detc0cbvOFP9nMLl4OsC3eZSnrMj Ivnw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=bZij7LuvWqAUsc/tDyGKqVPa1LlLe014JzLJaRBVfac=; b=jjRNIGVgKCh5Hnw35I7Whmlnt+pzHJBsh4446f1MJWDBTjst7KDTJ9DY4R9BOIx7bj Rg3ip5M8gcu1iRG/YV+LZyDOOQr61JE6ufmvi8xhFA7TC2GBQigqqdZ+aSgIf7oTumy7 rPbAFcDOdYoHYo9ZgAfjx7rvkdnBgvag8xVoObI6HXgDc0DmUXxSAhtPagdulIyOEtbi AN2IdTN19TrZjmW6+PrM9VAnlrU832rZymhNNn6UG9fsif9yrmVm12oxMM9XcuvxjL8d hiFm8U373TqS/0qYjMko2Wf/KLIEm8jgrrcCMch7StVrCpg5SUoQJZpIlDWIDu7SF6BE NF1A== X-Gm-Message-State: AO0yUKUTIh6bxYwrDwtuYds5V3ckx3KQZwb72YIYaowCFDYMz//2077q h46aurr6066iaVxANyNKBcw= X-Google-Smtp-Source: AK7set8e20ND70PIGz9eIdz6VNDFMIz/ERt5LiGEHQbdbQ4KvYIrKJ5UCXWnKI0Q4kYNn9Zk/Apy9A== X-Received: by 2002:a05:6a20:7f95:b0:c1:2037:554f with SMTP id d21-20020a056a207f9500b000c12037554fmr2308038pzj.62.1675741990219; Mon, 06 Feb 2023 19:53:10 -0800 (PST) Received: from strix-laptop.hitronhub.home ([123.110.9.95]) by smtp.googlemail.com with ESMTPSA id q4-20020a170902b10400b0019682e27995sm7647655plr.223.2023.02.06.19.52.58 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 06 Feb 2023 19:53:09 -0800 (PST) From: Chih-En Lin To: Andrew Morton , Qi Zheng , David Hildenbrand , "Matthew Wilcox (Oracle)" , Christophe Leroy , John Hubbard , Nadav Amit , Barry Song Cc: Steven Rostedt , Masami Hiramatsu , Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Yang Shi , Peter Xu , Vlastimil Babka , "Zach O'Keefe" , Yun Zhou , Hugh Dickins , Suren Baghdasaryan , Pasha Tatashin , Yu Zhao , Juergen Gross , Tong Tiangen , Liu Shixin , Anshuman Khandual , Li kunyu , Minchan Kim , Miaohe Lin , Gautam Menghani , Catalin Marinas , Mark Brown , Will Deacon , Vincenzo Frascino , Thomas Gleixner , "Eric W. Biederman" , Andy Lutomirski , Sebastian Andrzej Siewior , "Liam R. Howlett" , Fenghua Yu , Andrei Vagin , Barret Rhoden , Michal Hocko , "Jason A. Donenfeld" , Alexey Gladkov , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Dinglan Peng , Pedro Fonseca , Jim Huang , Huichun Feng , Chih-En Lin Subject: [PATCH v4 03/14] mm: Add break COW PTE fault and helper functions Date: Tue, 7 Feb 2023 11:51:28 +0800 Message-Id: <20230207035139.272707-4-shiyn.lin@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230207035139.272707-1-shiyn.lin@gmail.com> References: <20230207035139.272707-1-shiyn.lin@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Add the function, handle_cow_pte_fault(), to break (unshare) COW-ed PTE with the page fault that will modify the PTE table or the mapped page resided in COW-ed PTE (i.e., write, unshared, file read fault). When breaking COW PTE, it first checks COW-ed PTE's refcount to try to reuse it. If COW-ed PTE cannot be reused, allocates new PTE and duplicates all pte entries in COW-ed PTE. Moreover, Flush TLB when we change the write protection of PTE. In addition, provide the helper functions, break_cow_pte{,_range}(), to let the other features (remap, THP, migration, swapfile, etc) to use. Signed-off-by: Chih-En Lin --- include/linux/mm.h | 17 ++ include/linux/pgtable.h | 6 + mm/memory.c | 339 +++++++++++++++++++++++++++++++++++++++- mm/mmap.c | 4 + mm/mremap.c | 2 + mm/swapfile.c | 2 + 6 files changed, 363 insertions(+), 7 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 22e1e5804e96..369355e13936 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2020,6 +2020,23 @@ void pagecache_isize_extended(struct inode *inode, loff_t from, loff_t to); void truncate_pagecache_range(struct inode *inode, loff_t offset, loff_t end); int generic_error_remove_page(struct address_space *mapping, struct page *page); +#ifdef CONFIG_COW_PTE +int break_cow_pte(struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr); +int break_cow_pte_range(struct vm_area_struct *vma, unsigned long start, + unsigned long end); +#else +static inline int break_cow_pte(struct vm_area_struct *vma, + pmd_t *pmd, unsigned long addr) +{ + return 0; +} +static inline int break_cow_pte_range(struct vm_area_struct *vma, + unsigned long start, unsigned long end) +{ + return 0; +} +#endif + #ifdef CONFIG_MMU extern vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address, unsigned int flags, diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 1159b25b0542..72ff2a1cee5e 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -1406,6 +1406,12 @@ static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd) if (pmd_none(pmdval) || pmd_trans_huge(pmdval) || (IS_ENABLED(CONFIG_ARCH_ENABLE_THP_MIGRATION) && !pmd_present(pmdval))) return 1; + /* + * COW-ed PTE has write protection which can trigger pmd_bad(). + * To avoid this, return here if entry is write protection. + */ + if (!pmd_write(pmdval)) + return 0; if (unlikely(pmd_bad(pmdval))) { pmd_clear_bad(pmd); return 1; diff --git a/mm/memory.c b/mm/memory.c index 7d2a1d24db56..465742c6efa2 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -192,6 +192,36 @@ static inline void free_pmd_range(struct mmu_gather *tlb, pud_t *pud, pmd = pmd_offset(pud, addr); do { next = pmd_addr_end(addr, end); +#ifdef CONFIG_COW_PTE + /* + * For COW-ed PTE, the pte entries still mapping to pages. + * However, we should did de-accounting to all of it. So, + * even if the refcount is not the same as zapping, we + * could still fall back to normal PTE and handle it + * without traversing entries to do the de-accounting. + */ + if (test_bit(MMF_COW_PTE, &tlb->mm->flags)) { + if (!pmd_none(*pmd) && !pmd_write(*pmd)) { + spinlock_t *ptl = pte_lockptr(tlb->mm, pmd); + + spin_lock(ptl); + if (!pmd_put_pte(pmd)) { + pmd_t new = pmd_mkwrite(*pmd); + + set_pmd_at(tlb->mm, addr, pmd, new); + spin_unlock(ptl); + free_pte_range(tlb, pmd, addr); + continue; + } + spin_unlock(ptl); + + pmd_clear(pmd); + mm_dec_nr_ptes(tlb->mm); + tlb_flush_pmd_range(tlb, addr, PAGE_SIZE); + } else + VM_WARN_ON(cow_pte_count(pmd) != 1); + } +#endif if (pmd_none_or_clear_bad(pmd)) continue; free_pte_range(tlb, pmd, addr); @@ -1654,6 +1684,29 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, pte_t *start_pte; pte_t *pte; swp_entry_t entry; + bool pte_is_shared = false; + +#ifdef CONFIG_COW_PTE + if (test_bit(MMF_COW_PTE, &mm->flags) && !pmd_write(*pmd)) { + if (!range_in_vma(vma, addr & PMD_MASK, + (addr + PMD_SIZE) & PMD_MASK)) { + /* + * We cannot promise this COW-ed PTE will also be zap + * with the rest of VMAs. So, break COW PTE here. + */ + break_cow_pte(vma, pmd, addr); + } else { + start_pte = pte_offset_map_lock(mm, pmd, addr, &ptl); + if (cow_pte_count(pmd) == 1) { + /* Reuse COW-ed PTE */ + pmd_t new = pmd_mkwrite(*pmd); + set_pmd_at(tlb->mm, addr, pmd, new); + } else + pte_is_shared = true; + pte_unmap_unlock(start_pte, ptl); + } + } +#endif tlb_change_page_size(tlb, PAGE_SIZE); again: @@ -1678,11 +1731,15 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, page = vm_normal_page(vma, addr, ptent); if (unlikely(!should_zap_page(details, page))) continue; - ptent = ptep_get_and_clear_full(mm, addr, pte, - tlb->fullmm); + if (pte_is_shared) + ptent = *pte; + else + ptent = ptep_get_and_clear_full(mm, addr, pte, + tlb->fullmm); tlb_remove_tlb_entry(tlb, pte, addr); - zap_install_uffd_wp_if_needed(vma, addr, pte, details, - ptent); + if (!pte_is_shared) + zap_install_uffd_wp_if_needed(vma, addr, pte, + details, ptent); if (unlikely(!page)) continue; @@ -1754,8 +1811,12 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, /* We should have covered all the swap entry types */ WARN_ON_ONCE(1); } - pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); - zap_install_uffd_wp_if_needed(vma, addr, pte, details, ptent); + + if (!pte_is_shared) { + pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); + zap_install_uffd_wp_if_needed(vma, addr, pte, + details, ptent); + } } while (pte++, addr += PAGE_SIZE, addr != end); add_mm_rss_vec(mm, rss); @@ -2143,6 +2204,8 @@ static int insert_page(struct vm_area_struct *vma, unsigned long addr, if (retval) goto out; retval = -ENOMEM; + if (break_cow_pte(vma, NULL, addr)) + goto out; pte = get_locked_pte(vma->vm_mm, addr, &ptl); if (!pte) goto out; @@ -2402,6 +2465,9 @@ static vm_fault_t insert_pfn(struct vm_area_struct *vma, unsigned long addr, pte_t *pte, entry; spinlock_t *ptl; + if (break_cow_pte(vma, NULL, addr)) + return VM_FAULT_OOM; + pte = get_locked_pte(mm, addr, &ptl); if (!pte) return VM_FAULT_OOM; @@ -2779,6 +2845,10 @@ int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr, BUG_ON(addr >= end); pfn -= addr >> PAGE_SHIFT; pgd = pgd_offset(mm, addr); + + if (break_cow_pte_range(vma, addr, end)) + return -ENOMEM; + flush_cache_range(vma, addr, end); do { next = pgd_addr_end(addr, end); @@ -5159,6 +5229,233 @@ static vm_fault_t wp_huge_pud(struct vm_fault *vmf, pud_t orig_pud) return VM_FAULT_FALLBACK; } +#ifdef CONFIG_COW_PTE +/* Break (unshare) COW PTE */ +static vm_fault_t handle_cow_pte_fault(struct vm_fault *vmf) +{ + struct vm_area_struct *vma = vmf->vma; + struct mm_struct *mm = vma->vm_mm; + pmd_t *pmd = vmf->pmd; + unsigned long start, end, addr = vmf->address; + struct mmu_notifier_range range; + pmd_t cowed_entry; + pte_t *orig_dst_pte, *orig_src_pte; + pte_t *dst_pte, *src_pte; + spinlock_t *dst_ptl, *src_ptl; + int ret = 0; + + /* + * Do nothing with the fault that doesn't have PTE yet + * (from lazy fork). + */ + if (pmd_none(*pmd) || pmd_write(*pmd)) + return 0; + /* COW PTE doesn't handle huge page. */ + if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) + return 0; + + mmap_assert_write_locked(mm); + + start = addr & PMD_MASK; + end = (addr + PMD_SIZE) & PMD_MASK; + addr = start; + + mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_PAGE, + 0, vma, mm, start, end); + /* + * Because of the address range is PTE not only for the faulted + * vma, it might have some unmatch situations since mmu notifier + * will only reigster the faulted vma. + * Do we really need to care about this kind of unmatch? + */ + mmu_notifier_invalidate_range_start(&range); + raw_write_seqcount_begin(&mm->write_protect_seq); + + /* + * Fast path, check if we are the only one faulted task + * references to this COW-ed PTE, reuse it. + */ + src_pte = pte_offset_map_lock(mm, pmd, addr, &src_ptl); + if (cow_pte_count(pmd) == 1) { + pmd_t new = pmd_mkwrite(*pmd); + set_pmd_at(mm, addr, pmd, new); + pte_unmap_unlock(src_pte, src_ptl); + goto flush_tlb; + } + /* We don't hold the lock when allocating the new PTE. */ + pte_unmap_unlock(src_pte, src_ptl); + + /* + * Slow path. Since we already did the accounting and still + * sharing the mapped pages, we can just clone PTE. + */ + + cowed_entry = READ_ONCE(*pmd); + /* Decrease the pgtable_bytes of COW-ed PTE. */ + mm_dec_nr_ptes(mm); + pmd_clear(pmd); + orig_dst_pte = dst_pte = pte_alloc_map_lock(mm, pmd, addr, &dst_ptl); + if (unlikely(!dst_pte)) { + /* If allocation failed, restore COW-ed PTE. */ + set_pmd_at(mm, addr, pmd, cowed_entry); + ret = -ENOMEM; + goto out; + } + + /* + * We should hold the lock of COW-ed PTE until all the operations + * have been done, including duplicating, and decrease refcount. + */ + src_pte = pte_offset_map_lock(mm, &cowed_entry, addr, &src_ptl); + orig_src_pte = src_pte; + arch_enter_lazy_mmu_mode(); + + /* + * All the mapped pages in COW-ed PTE are COW mapping. We can + * set the entries and leave other stuff to handle_pte_fault(). + */ + do { + if (pte_none(*src_pte)) + continue; + set_pte_at(mm, addr, dst_pte, *src_pte); + } while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end); + + arch_leave_lazy_mmu_mode(); + pte_unmap_unlock(orig_dst_pte, dst_ptl); + + /* Decrease the refcount of COW-ed PTE. */ + if (!pmd_put_pte(&cowed_entry)) { + /* + * COW-ed (old) PTE's refcount is 1. Now we have two PTEs + * with the same content. Free the new one and reuse the + * old one. + */ + pgtable_t token = pmd_pgtable(*pmd); + /* Reuse COW-ed PTE. */ + pmd_t new = pmd_mkwrite(cowed_entry); + + /* Clear all the entries of new PTE. */ + addr = start; + dst_pte = pte_offset_map_lock(mm, pmd, addr, &dst_ptl); + orig_dst_pte = dst_pte; + do { + if (pte_none(*dst_pte)) + continue; + if (pte_present(*dst_pte)) + page_table_check_pte_clear(mm, addr, *dst_pte); + pte_clear(mm, addr, dst_pte); + } while (dst_pte++, addr += PAGE_SIZE, addr != end); + pte_unmap_unlock(orig_dst_pte, dst_ptl); + /* Now, we can safely free new PTE. */ + pmd_clear(pmd); + pte_free(mm, token); + /* Reuse COW-ed PTE */ + set_pmd_at(mm, start, pmd, new); + } + + pte_unmap_unlock(orig_src_pte, src_ptl); + +flush_tlb: + /* + * If we change the protection, flush TLB. + * flush_tlb_range() will only use vma to get mm, we don't need + * to consider the unmatch address range with vma problem here. + * + * Should we flush TLB when holding the pte lock? + */ + flush_tlb_range(vma, start, end); +out: + raw_write_seqcount_end(&mm->write_protect_seq); + mmu_notifier_invalidate_range_end(&range); + + return ret; +} + +static inline int __break_cow_pte(struct vm_area_struct *vma, pmd_t *pmd, + unsigned long addr) +{ + struct vm_fault vmf = { + .vma = vma, + .address = addr & PAGE_MASK, + .pmd = pmd, + }; + + return handle_cow_pte_fault(&vmf); +} + +/** + * break_cow_pte - duplicate/reuse shared, wprotected (COW-ed) PTE + * @vma: target vma want to break COW + * @pmd: pmd index that maps to the shared PTE + * @addr: the address trigger break COW PTE + * + * Return: zero on success, < 0 otherwise. + * + * The address needs to be in the range of shared and write portected + * PTE that the pmd index mapped. If pmd is NULL, it will get the pmd + * from vma. Duplicate COW-ed PTE when some still mapping to it. + * Otherwise, reuse COW-ed PTE. + */ +int break_cow_pte(struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr) +{ + struct mm_struct *mm; + pgd_t *pgd; + p4d_t *p4d; + pud_t *pud; + + if (!vma) + return -EINVAL; + mm = vma->vm_mm; + + if (!test_bit(MMF_COW_PTE, &mm->flags)) + return 0; + + if (!pmd) { + pgd = pgd_offset(mm, addr); + if (pgd_none_or_clear_bad(pgd)) + return 0; + p4d = p4d_offset(pgd, addr); + if (p4d_none_or_clear_bad(p4d)) + return 0; + pud = pud_offset(p4d, addr); + if (pud_none_or_clear_bad(pud)) + return 0; + pmd = pmd_offset(pud, addr); + } + + /* We will check the type of pmd entry later. */ + + return __break_cow_pte(vma, pmd, addr); +} + +/** + * break_cow_pte_range - duplicate/reuse COW-ed PTE in a given range + * @vma: target vma want to break COW + * @start: the address of start breaking + * @end: the address of end breaking + * + * Return: zero on success, the number of failed otherwise. + */ +int break_cow_pte_range(struct vm_area_struct *vma, unsigned long start, + unsigned long end) +{ + unsigned long addr, next; + int nr_failed = 0; + + if (!range_in_vma(vma, start, end)) + return -EINVAL; + + addr = start; + do { + next = pmd_addr_end(addr, end); + if (break_cow_pte(vma, NULL, addr)) + nr_failed++; + } while (addr = next, addr != end); + + return nr_failed; +} +#endif /* CONFIG_COW_PTE */ + /* * These routines also need to handle stuff like marking pages dirty * and/or accessed for architectures that don't do it in hardware (most @@ -5234,8 +5531,13 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) return do_fault(vmf); } - if (!pte_present(vmf->orig_pte)) + if (!pte_present(vmf->orig_pte)) { +#ifdef CONFIG_COW_PTE + if (test_bit(MMF_COW_PTE, &vmf->vma->vm_mm->flags)) + handle_cow_pte_fault(vmf); +#endif return do_swap_page(vmf); + } if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma)) return do_numa_page(vmf); @@ -5371,8 +5673,31 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma, return 0; } } +#ifdef CONFIG_COW_PTE + /* + * Duplicate COW-ed PTE when page fault will change the + * mapped pages (write or unshared fault) or COW-ed PTE + * (file mapped read fault, see do_read_fault()). + */ + if ((flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE) || + vma->vm_ops) && test_bit(MMF_COW_PTE, &mm->flags)) { + ret = handle_cow_pte_fault(&vmf); + if (unlikely(ret == -ENOMEM)) + return VM_FAULT_OOM; + } +#endif } +#ifdef CONFIG_COW_PTE + /* + * It's definitely will break the kernel when refcount of PTE + * is higher than 1 and it is writeable in PMD entry. But we + * want to see more information so just warning here. + */ + if (likely(!pmd_none(*vmf.pmd))) + VM_WARN_ON(cow_pte_count(vmf.pmd) > 1 && pmd_write(*vmf.pmd)); +#endif + return handle_pte_fault(&vmf); } diff --git a/mm/mmap.c b/mm/mmap.c index 425a9349e610..ca16d7abcdb6 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2208,6 +2208,10 @@ int __split_vma(struct mm_struct *mm, struct vm_area_struct *vma, return err; } + err = break_cow_pte(vma, NULL, addr); + if (err) + return err; + new = vm_area_dup(vma); if (!new) return -ENOMEM; diff --git a/mm/mremap.c b/mm/mremap.c index 930f65c315c0..3fbc45e381cc 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -534,6 +534,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma, old_pmd = get_old_pmd(vma->vm_mm, old_addr); if (!old_pmd) continue; + /* TLB flush twice time here? */ + break_cow_pte(vma, old_pmd, old_addr); new_pmd = alloc_new_pmd(vma->vm_mm, vma, new_addr); if (!new_pmd) break; diff --git a/mm/swapfile.c b/mm/swapfile.c index 4fa440e87cd6..92e39a722100 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1911,6 +1911,8 @@ static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud, next = pmd_addr_end(addr, end); if (pmd_none_or_trans_huge_or_clear_bad(pmd)) continue; + if (break_cow_pte(vma, pmd, addr)) + return -ENOMEM; ret = unuse_pte_range(vma, pmd, addr, next, type); if (ret) return ret; From patchwork Tue Feb 7 03:51:29 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chih-En Lin X-Patchwork-Id: 13130977 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 38EC8C636D3 for ; Tue, 7 Feb 2023 03:54:04 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230160AbjBGDyC (ORCPT ); Mon, 6 Feb 2023 22:54:02 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51418 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230075AbjBGDx6 (ORCPT ); Mon, 6 Feb 2023 22:53:58 -0500 Received: from mail-pl1-x636.google.com (mail-pl1-x636.google.com [IPv6:2607:f8b0:4864:20::636]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BB8693644B; Mon, 6 Feb 2023 19:53:23 -0800 (PST) Received: by mail-pl1-x636.google.com with SMTP id b5so14380838plz.5; Mon, 06 Feb 2023 19:53:23 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=ESippIZ/CnSB38WrMj5JjilizAGyOvY6jTA6jEVsXQY=; b=hCRrp5kpGX6QjdARktYunOshhGk1+EWKFT5Inj47ph8UgEaPUwrDd08achgmwp3+yh X2XxKEgIJtkgLu7diHgDiBGzH0J0BsocTB0tHay6IQqubsICj4ahady75+p1pVmQ8hmv REsKqot4Cf2GZHbfsUo2MbxsDaWeFF3d2o5dwAPUToyCuvp5zwyr5SD9uH5rEsgqfVHY ZTNUoPVRCVRlIsFb+wrwo4fjETdVaEMb6/wHALYAOJ8lWPfF/ePBMb3BHCpJbWQiS8tA FypkGbVdOQkNs6tW5XpQtNablH6CqIhBzSTyPoFExmbx9SOovtew9FUYuUyPZkNYuYub Ph9w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=ESippIZ/CnSB38WrMj5JjilizAGyOvY6jTA6jEVsXQY=; b=rrBfPTJ2aPvYsMeIf+qeDEUE9J3aN0zlEVYSE3xOIEf+GsGEs7wNZBR8cJU2FP5dd7 B29f2l4LpsVeM05Y2G9kH8KT9h72JfAuPWpT4fx2NzfoOOpm+H3cHfG47waX8PokPyM8 A4db7fW70PluzG56JirRH1Q0f+ZGifBvjK5C16Cv7A75JSFJsVR+4AKe50PzQaXZOGZa mtTilZTuTjwwp3fonKUm7CFpoWvKNH2igMgkM92f/8BP4ogOdDlq08p2rgRqCGtgU/oR 4Bz2NgFamO3mJLgqrzZ0Zw5Crz0cY8FiBPpT/Ur4aGg23qp8SveJ7QcYIVFmZadCBC5Y flXQ== X-Gm-Message-State: AO0yUKVDhKGKvLKBuOVDTW/lkctjVoggLHaEmTuWk/b2lB5dcFToI+vY iUWZUy+TfP0YPuJiMIZoHYM= X-Google-Smtp-Source: AK7set+i4QEI6ncNS3F1kMiQrABLK/ayhTEQ52mLQfi5RcF2b9D1IkmgPWJcep3JctJB4hPN+Y3yPQ== X-Received: by 2002:a17:902:f24c:b0:198:e584:5823 with SMTP id j12-20020a170902f24c00b00198e5845823mr1054495plc.34.1675742003150; Mon, 06 Feb 2023 19:53:23 -0800 (PST) Received: from strix-laptop.hitronhub.home ([123.110.9.95]) by smtp.googlemail.com with ESMTPSA id q4-20020a170902b10400b0019682e27995sm7647655plr.223.2023.02.06.19.53.10 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 06 Feb 2023 19:53:22 -0800 (PST) From: Chih-En Lin To: Andrew Morton , Qi Zheng , David Hildenbrand , "Matthew Wilcox (Oracle)" , Christophe Leroy , John Hubbard , Nadav Amit , Barry Song Cc: Steven Rostedt , Masami Hiramatsu , Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Yang Shi , Peter Xu , Vlastimil Babka , "Zach O'Keefe" , Yun Zhou , Hugh Dickins , Suren Baghdasaryan , Pasha Tatashin , Yu Zhao , Juergen Gross , Tong Tiangen , Liu Shixin , Anshuman Khandual , Li kunyu , Minchan Kim , Miaohe Lin , Gautam Menghani , Catalin Marinas , Mark Brown , Will Deacon , Vincenzo Frascino , Thomas Gleixner , "Eric W. Biederman" , Andy Lutomirski , Sebastian Andrzej Siewior , "Liam R. Howlett" , Fenghua Yu , Andrei Vagin , Barret Rhoden , Michal Hocko , "Jason A. Donenfeld" , Alexey Gladkov , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Dinglan Peng , Pedro Fonseca , Jim Huang , Huichun Feng , Chih-En Lin Subject: [PATCH v4 04/14] mm/rmap: Break COW PTE in rmap walking Date: Tue, 7 Feb 2023 11:51:29 +0800 Message-Id: <20230207035139.272707-5-shiyn.lin@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230207035139.272707-1-shiyn.lin@gmail.com> References: <20230207035139.272707-1-shiyn.lin@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Some of the features (unmap, migrate, device exclusive, mkclean, etc) might modify the pte entry via rmap. Add a new page vma mapped walk flag, PVMW_BREAK_COW_PTE, to indicate the rmap walking to break COW PTE. Signed-off-by: Chih-En Lin --- include/linux/rmap.h | 2 ++ mm/migrate.c | 3 ++- mm/page_vma_mapped.c | 4 ++++ mm/rmap.c | 9 +++++---- mm/vmscan.c | 3 ++- 5 files changed, 15 insertions(+), 6 deletions(-) diff --git a/include/linux/rmap.h b/include/linux/rmap.h index bd3504d11b15..d0f07e551973 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -368,6 +368,8 @@ int make_device_exclusive_range(struct mm_struct *mm, unsigned long start, #define PVMW_SYNC (1 << 0) /* Look for migration entries rather than present PTEs */ #define PVMW_MIGRATION (1 << 1) +/* Break COW-ed PTE during walking */ +#define PVMW_BREAK_COW_PTE (1 << 2) struct page_vma_mapped_walk { unsigned long pfn; diff --git a/mm/migrate.c b/mm/migrate.c index a4d3fc65085f..04376ce05aa8 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -183,7 +183,8 @@ void putback_movable_pages(struct list_head *l) static bool remove_migration_pte(struct folio *folio, struct vm_area_struct *vma, unsigned long addr, void *old) { - DEFINE_FOLIO_VMA_WALK(pvmw, old, vma, addr, PVMW_SYNC | PVMW_MIGRATION); + DEFINE_FOLIO_VMA_WALK(pvmw, old, vma, addr, + PVMW_SYNC | PVMW_MIGRATION | PVMW_BREAK_COW_PTE); while (page_vma_mapped_walk(&pvmw)) { rmap_t rmap_flags = RMAP_NONE; diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c index 93e13fc17d3c..7b35e85b9964 100644 --- a/mm/page_vma_mapped.c +++ b/mm/page_vma_mapped.c @@ -251,6 +251,10 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) step_forward(pvmw, PMD_SIZE); continue; } + if (pvmw->flags & PVMW_BREAK_COW_PTE) { + if (break_cow_pte(vma, pvmw->pmd, pvmw->address)) + return not_found(pvmw); + } if (!map_pte(pvmw)) goto next_pte; this_pte: diff --git a/mm/rmap.c b/mm/rmap.c index b616870a09be..bce97496b1f6 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1012,7 +1012,8 @@ static int page_vma_mkclean_one(struct page_vma_mapped_walk *pvmw) static bool page_mkclean_one(struct folio *folio, struct vm_area_struct *vma, unsigned long address, void *arg) { - DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, PVMW_SYNC); + DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, + PVMW_SYNC | PVMW_BREAK_COW_PTE); int *cleaned = arg; *cleaned += page_vma_mkclean_one(&pvmw); @@ -1463,7 +1464,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, unsigned long address, void *arg) { struct mm_struct *mm = vma->vm_mm; - DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0); + DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, PVMW_BREAK_COW_PTE); pte_t pteval; struct page *subpage; bool anon_exclusive, ret = true; @@ -1834,7 +1835,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, unsigned long address, void *arg) { struct mm_struct *mm = vma->vm_mm; - DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0); + DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, PVMW_BREAK_COW_PTE); pte_t pteval; struct page *subpage; bool anon_exclusive, ret = true; @@ -2187,7 +2188,7 @@ static bool page_make_device_exclusive_one(struct folio *folio, struct vm_area_struct *vma, unsigned long address, void *priv) { struct mm_struct *mm = vma->vm_mm; - DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0); + DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, PVMW_BREAK_COW_PTE); struct make_exclusive_args *args = priv; pte_t pteval; struct page *subpage; diff --git a/mm/vmscan.c b/mm/vmscan.c index bf3eedf0209c..15eda32146fd 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1882,7 +1882,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, /* * The folio is mapped into the page tables of one or more - * processes. Try to unmap it here. + * processes. Try to unmap it here. Also, since it will write + * to the page tables, break COW PTE if they are. */ if (folio_mapped(folio)) { enum ttu_flags flags = TTU_BATCH_FLUSH; From patchwork Tue Feb 7 03:51:30 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chih-En Lin X-Patchwork-Id: 13130978 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8CA14C636D3 for ; Tue, 7 Feb 2023 03:54:15 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230177AbjBGDyO (ORCPT ); Mon, 6 Feb 2023 22:54:14 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51754 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230145AbjBGDyB (ORCPT ); Mon, 6 Feb 2023 22:54:01 -0500 Received: from mail-pj1-x102e.google.com (mail-pj1-x102e.google.com [IPv6:2607:f8b0:4864:20::102e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BB2CA30EB3; Mon, 6 Feb 2023 19:53:36 -0800 (PST) Received: by mail-pj1-x102e.google.com with SMTP id j1so7833186pjd.0; Mon, 06 Feb 2023 19:53:36 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=vnJphn8h/teaY51bYztcFGncAa/2yVuimAdfonZwh2I=; b=PLGORcIX7Z6FO2uh+mQhLmU6Q65FKoCoeEVpBOahKass5lqk/ZUcoT3F2MndhZwPvD 0mcdoaI0U+baKWJ4JeOTlEZMvmaGel6EF3KYGvL0ma0HMwuPbMS/eNIBMp+3cjBI0CAV NPA6xJUzYQHhqdAsh/yDBz3rnThh7PF6IO6hLIZ3nUcpqLHo1AWOMnsnXd1o1hcMqQcd pgEDrwE4PnU5B1H8IA44sCgFJTPz/tBTbDkXemjFx+Vaa57KvKgs/iHYnRn357d7G2JT LQ3Urkg90KMHhySnNKDmJ+igImMdxMsMTgz5oKSQDPZl1mNYYAqTz+y9YQ4gDmZ0jN2o Be1g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=vnJphn8h/teaY51bYztcFGncAa/2yVuimAdfonZwh2I=; b=FalSsuCnTycobAEuLQXIeAaIpyjPsUROHFSGltL1f+KDvpXfbdY6Qh3MOK8nOa5zzn vIFnArPigDiOJgbkkD3v5l12WcbItkzcgu4r07l3Tw9Mu8BWbZ5cvfZbsJWupT62J4v4 KpcYHBvT1sMafujng2lg3nzu/FUkOEQh2r7oAW0Ae2ikBrVdLvAEHlVkUcFtZR0/6c7G jU0pb8wL5EWrhtrlc9HGpeug3QwOMc6S5FCDCiQonC8/DolUc9aXTJPdT432Zzhx8W8B NjxujUt0tf5uzQewA9u0Ew4q7km869lAZchJ2h5s99trPAPvyVj8qi5oul/sse0ZJzSj 97Ug== X-Gm-Message-State: AO0yUKXWL9/t4ZubtmsIDziaCTEPeB+rNyT6+Cjim4b6c6bna45snoGC hhd7PT1N3wih6PasZvZd3Bw= X-Google-Smtp-Source: AK7set+8EGcTNLLEym88LMDaQznwmnIZe5fv2xZTmJeysKbk4P1tuC9B1rImnpJu2fsuK/TfhwXLRA== X-Received: by 2002:a17:902:c40f:b0:196:e8e:cd28 with SMTP id k15-20020a170902c40f00b001960e8ecd28mr1564019plk.15.1675742016151; Mon, 06 Feb 2023 19:53:36 -0800 (PST) Received: from strix-laptop.hitronhub.home ([123.110.9.95]) by smtp.googlemail.com with ESMTPSA id q4-20020a170902b10400b0019682e27995sm7647655plr.223.2023.02.06.19.53.23 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 06 Feb 2023 19:53:35 -0800 (PST) From: Chih-En Lin To: Andrew Morton , Qi Zheng , David Hildenbrand , "Matthew Wilcox (Oracle)" , Christophe Leroy , John Hubbard , Nadav Amit , Barry Song Cc: Steven Rostedt , Masami Hiramatsu , Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Yang Shi , Peter Xu , Vlastimil Babka , "Zach O'Keefe" , Yun Zhou , Hugh Dickins , Suren Baghdasaryan , Pasha Tatashin , Yu Zhao , Juergen Gross , Tong Tiangen , Liu Shixin , Anshuman Khandual , Li kunyu , Minchan Kim , Miaohe Lin , Gautam Menghani , Catalin Marinas , Mark Brown , Will Deacon , Vincenzo Frascino , Thomas Gleixner , "Eric W. Biederman" , Andy Lutomirski , Sebastian Andrzej Siewior , "Liam R. Howlett" , Fenghua Yu , Andrei Vagin , Barret Rhoden , Michal Hocko , "Jason A. Donenfeld" , Alexey Gladkov , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Dinglan Peng , Pedro Fonseca , Jim Huang , Huichun Feng , Chih-En Lin Subject: [PATCH v4 05/14] mm/khugepaged: Break COW PTE before scanning pte Date: Tue, 7 Feb 2023 11:51:30 +0800 Message-Id: <20230207035139.272707-6-shiyn.lin@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230207035139.272707-1-shiyn.lin@gmail.com> References: <20230207035139.272707-1-shiyn.lin@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org We should not allow THP to collapse COW-ed PTE. So, break COW PTE before collapse_pte_mapped_thp() collapse to THP. Also, break COW PTE before khugepaged_scan_pmd() scan PTE. Signed-off-by: Chih-En Lin --- include/trace/events/huge_memory.h | 1 + mm/khugepaged.c | 35 +++++++++++++++++++++++++++++- 2 files changed, 35 insertions(+), 1 deletion(-) diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h index 3e6fb05852f9..5f2c39f61521 100644 --- a/include/trace/events/huge_memory.h +++ b/include/trace/events/huge_memory.h @@ -13,6 +13,7 @@ EM( SCAN_PMD_NULL, "pmd_null") \ EM( SCAN_PMD_NONE, "pmd_none") \ EM( SCAN_PMD_MAPPED, "page_pmd_mapped") \ + EM( SCAN_COW_PTE, "cowed_pte") \ EM( SCAN_EXCEED_NONE_PTE, "exceed_none_pte") \ EM( SCAN_EXCEED_SWAP_PTE, "exceed_swap_pte") \ EM( SCAN_EXCEED_SHARED_PTE, "exceed_shared_pte") \ diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 90acfea40c13..1cddc20318d5 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -31,6 +31,7 @@ enum scan_result { SCAN_PMD_NULL, SCAN_PMD_NONE, SCAN_PMD_MAPPED, + SCAN_COW_PTE, SCAN_EXCEED_NONE_PTE, SCAN_EXCEED_SWAP_PTE, SCAN_EXCEED_SHARED_PTE, @@ -875,7 +876,7 @@ static int find_pmd_or_thp_or_none(struct mm_struct *mm, return SCAN_PMD_MAPPED; if (pmd_devmap(pmde)) return SCAN_PMD_NULL; - if (pmd_bad(pmde)) + if (pmd_write(pmde) && pmd_bad(pmde)) return SCAN_PMD_NULL; return SCAN_SUCCEED; } @@ -926,6 +927,8 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm, pte_unmap(vmf.pte); continue; } + if (break_cow_pte(vma, pmd, address)) + return SCAN_COW_PTE; ret = do_swap_page(&vmf); /* @@ -1038,6 +1041,9 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, if (result != SCAN_SUCCEED) goto out_up_write; + /* We should already handled COW-ed PTE. */ + VM_WARN_ON(test_bit(MMF_COW_PTE, &mm->flags) && !pmd_write(*pmd)); + anon_vma_lock_write(vma->anon_vma); mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, NULL, mm, @@ -1148,6 +1154,13 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm, memset(cc->node_load, 0, sizeof(cc->node_load)); nodes_clear(cc->alloc_nmask); + + /* Break COW PTE before we collapse the pages. */ + if (break_cow_pte(vma, pmd, address)) { + result = SCAN_COW_PTE; + goto out; + } + pte = pte_offset_map_lock(mm, pmd, address, &ptl); for (_address = address, _pte = pte; _pte < pte + HPAGE_PMD_NR; _pte++, _address += PAGE_SIZE) { @@ -1206,6 +1219,10 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm, goto out_unmap; } + /* + * If we only trigger the break COW PTE, the page usually + * still in COW mapping, which it still be shared. + */ if (page_mapcount(page) > 1) { ++shared; if (cc->is_khugepaged && @@ -1501,6 +1518,11 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, goto drop_hpage; } + /* We shouldn't let COW-ed PTE collapse. */ + if (break_cow_pte(vma, pmd, haddr)) + goto drop_hpage; + VM_WARN_ON(test_bit(MMF_COW_PTE, &mm->flags) && !pmd_write(*pmd)); + /* * We need to lock the mapping so that from here on, only GUP-fast and * hardware page walks can access the parts of the page tables that @@ -1706,6 +1728,11 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff, result = SCAN_PTE_UFFD_WP; goto unlock_next; } + if (test_bit(MMF_COW_PTE, &mm->flags) && + !pmd_write(*pmd)) { + result = SCAN_COW_PTE; + goto unlock_next; + } collapse_and_free_pmd(mm, vma, addr, pmd); if (!cc->is_khugepaged && is_target) result = set_huge_pmd(vma, addr, pmd, hpage); @@ -2143,6 +2170,11 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr, swap = 0; memset(cc->node_load, 0, sizeof(cc->node_load)); nodes_clear(cc->alloc_nmask); + if (break_cow_pte(find_vma(mm, addr), NULL, addr)) { + result = SCAN_COW_PTE; + goto out; + } + rcu_read_lock(); xas_for_each(&xas, page, start + HPAGE_PMD_NR - 1) { if (xas_retry(&xas, page)) @@ -2213,6 +2245,7 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr, } rcu_read_unlock(); +out: if (result == SCAN_SUCCEED) { if (cc->is_khugepaged && present < HPAGE_PMD_NR - khugepaged_max_ptes_none) { From patchwork Tue Feb 7 03:51:31 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chih-En Lin X-Patchwork-Id: 13130979 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 54DDDC64EC4 for ; Tue, 7 Feb 2023 03:54:40 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230070AbjBGDyj (ORCPT ); Mon, 6 Feb 2023 22:54:39 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51762 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230180AbjBGDyP (ORCPT ); Mon, 6 Feb 2023 22:54:15 -0500 Received: from mail-pl1-x62c.google.com (mail-pl1-x62c.google.com [IPv6:2607:f8b0:4864:20::62c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id ACD6C36444; Mon, 6 Feb 2023 19:53:56 -0800 (PST) Received: by mail-pl1-x62c.google.com with SMTP id e19so6451550plc.9; Mon, 06 Feb 2023 19:53:56 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=P2hiqzbHJQraqi9uwlHzBIrzOyDxCP/J3GJiV5ChdNY=; b=bmOC7XUPYoB8TNQF7f0wYym0XRU34TbLiGrQ4u3lU2iawbK3M9Yxa2YG5f2unStsBz Gl7fnqxofqZfItKUqy/uxWj4gKzfqEHOiZpN3jsLN+pXOzyJRlfM6RG1F5aqeUQD56rm PfpHbRAv5wRzLb7GbK7Dtvb20vGmVB8yYws60BRLU39ZhuDrjQjtjXkww0bx9kbB+KiN QLE80k9KYlUBvl3ALbdcdlXaSO/lIoktoU+FuFkbrp/+L0s0YiAKRSdpxRqHHhIKiXS/ w+LEABWKYLafPT81g2ZVH6IIUA2YjdTap9V5QJ406//JaGKHHilBqrDZ0tSmNSWU2W/P LSbg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=P2hiqzbHJQraqi9uwlHzBIrzOyDxCP/J3GJiV5ChdNY=; b=ARhKoWGhnpr1SSdqxLPdfwH+I1mOPopduHKRO2VCESw6Oet394RFkfK+rHIBnqLq54 Jt7iyTVXPvCC6draj1yTHj6U4whcVJd/S79Y1Yd/5AE8D5DLWFD2nnRlPTbfKaHM3wy5 XbmlTIpX2fJIhOXWOlxSOryDtq2lKS+4xOssYCaMzJ1TZyYLYcJGpcjk4PO/GyEhfSnI WaJnA9bnICDflvwDFJGcOFCWnIqhm2xcCBkJsyR6sE5wVvz0ZFuSYNl5tnYQa6rh7+4H hrM48heXh971b++iOG6VDDtJYSy064gJcFOdgNhs8R7CHWFzoR+U/RAhQK6WDjhrXqX7 0OKw== X-Gm-Message-State: AO0yUKUFXjuSlg0VJ5ihobnf2cTZ4hXThkZYxqrZ8T11LVRiNAw+tE+z lvBjx1uqnVOJLM6acrL0g3Q= X-Google-Smtp-Source: AK7set9XER72kqQ2uHZQ9Dv9OhzTxcotdKJFhFXMKaajUQO56abILAuY/kvBJuFI6YmYXpde35qWlw== X-Received: by 2002:a17:902:f691:b0:199:2f45:14fd with SMTP id l17-20020a170902f69100b001992f4514fdmr1033089plg.31.1675742029047; Mon, 06 Feb 2023 19:53:49 -0800 (PST) Received: from strix-laptop.hitronhub.home ([123.110.9.95]) by smtp.googlemail.com with ESMTPSA id q4-20020a170902b10400b0019682e27995sm7647655plr.223.2023.02.06.19.53.36 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 06 Feb 2023 19:53:48 -0800 (PST) From: Chih-En Lin To: Andrew Morton , Qi Zheng , David Hildenbrand , "Matthew Wilcox (Oracle)" , Christophe Leroy , John Hubbard , Nadav Amit , Barry Song Cc: Steven Rostedt , Masami Hiramatsu , Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Yang Shi , Peter Xu , Vlastimil Babka , "Zach O'Keefe" , Yun Zhou , Hugh Dickins , Suren Baghdasaryan , Pasha Tatashin , Yu Zhao , Juergen Gross , Tong Tiangen , Liu Shixin , Anshuman Khandual , Li kunyu , Minchan Kim , Miaohe Lin , Gautam Menghani , Catalin Marinas , Mark Brown , Will Deacon , Vincenzo Frascino , Thomas Gleixner , "Eric W. Biederman" , Andy Lutomirski , Sebastian Andrzej Siewior , "Liam R. Howlett" , Fenghua Yu , Andrei Vagin , Barret Rhoden , Michal Hocko , "Jason A. Donenfeld" , Alexey Gladkov , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Dinglan Peng , Pedro Fonseca , Jim Huang , Huichun Feng , Chih-En Lin Subject: [PATCH v4 06/14] mm/ksm: Break COW PTE before modify shared PTE Date: Tue, 7 Feb 2023 11:51:31 +0800 Message-Id: <20230207035139.272707-7-shiyn.lin@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230207035139.272707-1-shiyn.lin@gmail.com> References: <20230207035139.272707-1-shiyn.lin@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Break COW PTE before merge the page that reside in COW-ed PTE. Signed-off-by: Chih-En Lin --- mm/ksm.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/mm/ksm.c b/mm/ksm.c index dd02780c387f..ce3887d3b04c 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -1045,7 +1045,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page, pte_t *orig_pte) { struct mm_struct *mm = vma->vm_mm; - DEFINE_PAGE_VMA_WALK(pvmw, page, vma, 0, 0); + DEFINE_PAGE_VMA_WALK(pvmw, page, vma, 0, PVMW_BREAK_COW_PTE); int swapped; int err = -EFAULT; struct mmu_notifier_range range; @@ -1163,6 +1163,8 @@ static int replace_page(struct vm_area_struct *vma, struct page *page, barrier(); if (!pmd_present(pmde) || pmd_trans_huge(pmde)) goto out; + if (break_cow_pte(vma, pmd, addr)) + goto out; mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, mm, addr, addr + PAGE_SIZE); From patchwork Tue Feb 7 03:51:32 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chih-En Lin X-Patchwork-Id: 13130980 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 907D1C636CD for ; Tue, 7 Feb 2023 03:54:54 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230195AbjBGDyv (ORCPT ); Mon, 6 Feb 2023 22:54:51 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51922 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229500AbjBGDy1 (ORCPT ); Mon, 6 Feb 2023 22:54:27 -0500 Received: from mail-pj1-x102c.google.com (mail-pj1-x102c.google.com [IPv6:2607:f8b0:4864:20::102c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 73D13301BF; Mon, 6 Feb 2023 19:54:01 -0800 (PST) Received: by mail-pj1-x102c.google.com with SMTP id o13so13713744pjg.2; Mon, 06 Feb 2023 19:54:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=radrNJcEj52EO1Vyn1//3vwMeqorrUn+XF43zRmwQLM=; b=e9tBbGrVc6KrCXLirUIwtUWWZ5pScVYwhGOAX85/DTr5qa96neTYfNiaW0Zb/8meJQ 1pqo78jPkCl06fIq603NmiH73ykIdE4tNysE7jWe+cqbGQoi9HRy5GXSizPwBDnptLs/ VdPLJ2wnPfV6r6WEDIrm8HmD229QpmDjX+wZ0dIu0qcmHVGfNGsYodK0wfi2wWI5G+SG Fm3K4UDmTwirbtiNNPUbhT/BLmCWI8dVf3lP7LNwLlC9pYszNYTHzELutRWlrZWhnAtV 9JbufIvqryccaNGycUo23FTX14QEl0JPhIjUdPVkkLIlKOzmjCljWGg/3U79rB9jCxhI mobw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=radrNJcEj52EO1Vyn1//3vwMeqorrUn+XF43zRmwQLM=; b=nv4RjaRjTqNLPqPlPn1MNKb8fQTApY5WWwMpQ+qp6+5jPjZz4e4hacRyhKge4aJAbT 5kKdk7ifiGwZu8D5dl7ouCIkXx7iPiiPKSXYslaj4B0HZbPtnJXd5srYfKz9IkyCponI cZYqWI/MI9ETYzeLHVTxjGihBzlrxyBG4WO67Yqg94n7fCVZ2NL3H/cVGj56j/P3k2IJ vtcjx4wTG47juGg51WaDPWwkNod3OmwlgfSscdKG5L2Jk1T/NJEM+h9dCkdxP0T+bQa6 n9mWq51eKsOEU7Pf8xXP3pUL4d5eeWHWhyR+g/DrmvwmPz9prOMC7dF9j55VHK4QU1bK KBRA== X-Gm-Message-State: AO0yUKWoaVhb9TcNzDN1rsekTmwOOGQiDcmcZjYFiU5FQ7XZonfkwwkW 5V2WPeYqbK0zO4fQsl1+K34= X-Google-Smtp-Source: AK7set/ZfBJEIGUrpJL8s7VUE/4LvMWeWo6c88SMYAKDr4tnwSgq7/tvhA0ZTlIbSkN1+Fz2FrhpsQ== X-Received: by 2002:a17:902:e2c2:b0:199:1219:1568 with SMTP id l2-20020a170902e2c200b0019912191568mr1103226plc.4.1675742040504; Mon, 06 Feb 2023 19:54:00 -0800 (PST) Received: from strix-laptop.hitronhub.home ([123.110.9.95]) by smtp.googlemail.com with ESMTPSA id q4-20020a170902b10400b0019682e27995sm7647655plr.223.2023.02.06.19.53.49 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 06 Feb 2023 19:53:59 -0800 (PST) From: Chih-En Lin To: Andrew Morton , Qi Zheng , David Hildenbrand , "Matthew Wilcox (Oracle)" , Christophe Leroy , John Hubbard , Nadav Amit , Barry Song Cc: Steven Rostedt , Masami Hiramatsu , Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Yang Shi , Peter Xu , Vlastimil Babka , "Zach O'Keefe" , Yun Zhou , Hugh Dickins , Suren Baghdasaryan , Pasha Tatashin , Yu Zhao , Juergen Gross , Tong Tiangen , Liu Shixin , Anshuman Khandual , Li kunyu , Minchan Kim , Miaohe Lin , Gautam Menghani , Catalin Marinas , Mark Brown , Will Deacon , Vincenzo Frascino , Thomas Gleixner , "Eric W. Biederman" , Andy Lutomirski , Sebastian Andrzej Siewior , "Liam R. Howlett" , Fenghua Yu , Andrei Vagin , Barret Rhoden , Michal Hocko , "Jason A. Donenfeld" , Alexey Gladkov , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Dinglan Peng , Pedro Fonseca , Jim Huang , Huichun Feng , Chih-En Lin Subject: [PATCH v4 07/14] mm/madvise: Handle COW-ed PTE with madvise() Date: Tue, 7 Feb 2023 11:51:32 +0800 Message-Id: <20230207035139.272707-8-shiyn.lin@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230207035139.272707-1-shiyn.lin@gmail.com> References: <20230207035139.272707-1-shiyn.lin@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Break COW PTE if madvise() modify the pte entry of COW-ed PTE. Following are the list of flags which need to break COW PTE. However, like MADV_HUGEPAGE and MADV_MERGEABLE, we should handle it respectively. - MADV_DONTNEED: It calls to zap_page_range() which already be handled. - MADV_FREE: It uses walk_page_range() with madvise_free_pte_range() to free the page by itself, so add break_cow_pte(). - MADV_REMOVE: Same as MADV_FREE, it remove the page by itself, so add break_cow_pte_range(). - MADV_COLD: Similar to MAD_FREE, break COW PTE before pageout. - MADV_POPULATE: Let GUP deal with it. Signed-off-by: Chih-En Lin --- mm/madvise.c | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/mm/madvise.c b/mm/madvise.c index b6ea204d4e23..8b815942f286 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -428,6 +428,9 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, if (pmd_trans_unstable(pmd)) return 0; #endif + if (break_cow_pte(vma, pmd, addr)) + return 0; + tlb_change_page_size(tlb, PAGE_SIZE); orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); flush_tlb_batched_pending(mm); @@ -629,6 +632,10 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, if (pmd_trans_unstable(pmd)) return 0; + /* We should only allocate PTE. */ + if (break_cow_pte(vma, pmd, addr)) + goto next; + tlb_change_page_size(tlb, PAGE_SIZE); orig_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl); flush_tlb_batched_pending(mm); @@ -989,6 +996,12 @@ static long madvise_remove(struct vm_area_struct *vma, if ((vma->vm_flags & (VM_SHARED|VM_WRITE)) != (VM_SHARED|VM_WRITE)) return -EACCES; + error = break_cow_pte_range(vma, start, end); + if (error < 0) + return error; + else if (error > 0) + return -ENOMEM; + offset = (loff_t)(start - vma->vm_start) + ((loff_t)vma->vm_pgoff << PAGE_SHIFT); From patchwork Tue Feb 7 03:51:33 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chih-En Lin X-Patchwork-Id: 13130981 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6ACBDC636CD for ; Tue, 7 Feb 2023 03:55:04 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230268AbjBGDzD (ORCPT ); Mon, 6 Feb 2023 22:55:03 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51708 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230163AbjBGDyn (ORCPT ); Mon, 6 Feb 2023 22:54:43 -0500 Received: from mail-pj1-x102b.google.com (mail-pj1-x102b.google.com [IPv6:2607:f8b0:4864:20::102b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1946E30EB2; Mon, 6 Feb 2023 19:54:12 -0800 (PST) Received: by mail-pj1-x102b.google.com with SMTP id mi9so13686179pjb.4; Mon, 06 Feb 2023 19:54:12 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=9hQ1wmYZNKbh6ubCaoCPj3vqYdMi+xiWWNvAtSowi2o=; b=K3GuS1yr0W6ph4124TteyvhSGbqph5dx0QX3w7vlsuPsCGZd4P2k77PsEPK5l23VHX cnagCyQEot5lD/H6hr+2gBVCvcYfQ0ICQ9bE7IBhyljYwAPPH9yPWzykEWWXSQi/tKOn nCkcQkwOnXlnCWQEGRnFKRjqvmk71mRAtOOapfPlKfPfn1q6YlVNo03Qnb+xloHM6ksC EfuX664vNfN+VFLbGjTsKWXaPEAjC3aBdSuZ9yKkw6OByEoHab18DU31F0Fl+qY0BhYo yL7TRMdYVWaab9Zrd0OXD5QlFZ01/W9XPxuILrCxR2+aBD0NCdQo/ULQ1o4Lc1HGZnnr nrnA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=9hQ1wmYZNKbh6ubCaoCPj3vqYdMi+xiWWNvAtSowi2o=; b=PqIDTHWl+zoS0tPtxRZCz52CE6XqG+RQrRvH9b/lOnh6dnTANZHI354Pt7cuqCtRbY Cq1WPhXKtYqRoUcN5eN1xDk8TaI10TemTNNrFTIfkmIlDIUbg0NuPnC2of3Hqa8CqgUh dVLiGufkml7YVJSoLesFMDUafbzoreGNlAzRP8i0HfXPcycfppZbUfrKNWaR2fUDX4fq bJAqCb82gYopdYdvv8qrljgQtEuSYlypmsrQtaX7d68E1V/CjJLXgLgYEeq2FmN9vwhM H3XUdWffVQOBmHxFlQh8esqPyRAkWrYiIEsNkUpUNCSKVe0/XkCfWzOZimP1SdUOSp6S H34g== X-Gm-Message-State: AO0yUKXsMq5M76/gvrTNsxDz9Y3mUjztzcB4ml2/9kF5Mm+PKMi68Xob lJl6HX1yxvkkTw5cZ4DetnI= X-Google-Smtp-Source: AK7set8BweMlzypQiVTtVVURVr8ilTYE/9HKHHuKVpdwfa5VakEqT0QFlksWW1Ae0vdg/3Bbk6hC7g== X-Received: by 2002:a17:902:f0ca:b0:199:9fa:7553 with SMTP id v10-20020a170902f0ca00b0019909fa7553mr1395163pla.17.1675742051331; Mon, 06 Feb 2023 19:54:11 -0800 (PST) Received: from strix-laptop.hitronhub.home ([123.110.9.95]) by smtp.googlemail.com with ESMTPSA id q4-20020a170902b10400b0019682e27995sm7647655plr.223.2023.02.06.19.54.00 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 06 Feb 2023 19:54:10 -0800 (PST) From: Chih-En Lin To: Andrew Morton , Qi Zheng , David Hildenbrand , "Matthew Wilcox (Oracle)" , Christophe Leroy , John Hubbard , Nadav Amit , Barry Song Cc: Steven Rostedt , Masami Hiramatsu , Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Yang Shi , Peter Xu , Vlastimil Babka , "Zach O'Keefe" , Yun Zhou , Hugh Dickins , Suren Baghdasaryan , Pasha Tatashin , Yu Zhao , Juergen Gross , Tong Tiangen , Liu Shixin , Anshuman Khandual , Li kunyu , Minchan Kim , Miaohe Lin , Gautam Menghani , Catalin Marinas , Mark Brown , Will Deacon , Vincenzo Frascino , Thomas Gleixner , "Eric W. Biederman" , Andy Lutomirski , Sebastian Andrzej Siewior , "Liam R. Howlett" , Fenghua Yu , Andrei Vagin , Barret Rhoden , Michal Hocko , "Jason A. Donenfeld" , Alexey Gladkov , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Dinglan Peng , Pedro Fonseca , Jim Huang , Huichun Feng , Chih-En Lin Subject: [PATCH v4 08/14] mm/gup: Trigger break COW PTE before calling follow_pfn_pte() Date: Tue, 7 Feb 2023 11:51:33 +0800 Message-Id: <20230207035139.272707-9-shiyn.lin@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230207035139.272707-1-shiyn.lin@gmail.com> References: <20230207035139.272707-1-shiyn.lin@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org In most of cases, GUP will not modify the page table, excluding follow_pfn_pte(). To deal with COW PTE, Trigger the break COW PTE fault before calling follow_pfn_pte(). Signed-off-by: Chih-En Lin --- mm/gup.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/mm/gup.c b/mm/gup.c index f45a3a5be53a..e702c0800105 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -545,7 +545,8 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) == (FOLL_PIN | FOLL_GET))) return ERR_PTR(-EINVAL); - if (unlikely(pmd_bad(*pmd))) + /* COW-ed PTE has write protection which can trigger pmd_bad(). */ + if (unlikely(pmd_write(*pmd) && pmd_bad(*pmd))) return no_page_table(vma, flags); ptep = pte_offset_map_lock(mm, pmd, address, &ptl); @@ -588,6 +589,11 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, if (is_zero_pfn(pte_pfn(pte))) { page = pte_page(pte); } else { + if (test_bit(MMF_COW_PTE, &mm->flags) && + !pmd_write(*pmd)) { + page = ERR_PTR(-EMLINK); + goto out; + } ret = follow_pfn_pte(vma, address, ptep, flags); page = ERR_PTR(ret); goto out; From patchwork Tue Feb 7 03:51:34 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chih-En Lin X-Patchwork-Id: 13130982 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4DA68C636CC for ; Tue, 7 Feb 2023 03:55:23 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229976AbjBGDzV (ORCPT ); Mon, 6 Feb 2023 22:55:21 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52010 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229630AbjBGDyz (ORCPT ); Mon, 6 Feb 2023 22:54:55 -0500 Received: from mail-pj1-x102f.google.com (mail-pj1-x102f.google.com [IPv6:2607:f8b0:4864:20::102f]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 145DD36FCD; Mon, 6 Feb 2023 19:54:26 -0800 (PST) Received: by mail-pj1-x102f.google.com with SMTP id c10-20020a17090a1d0a00b0022e63a94799so17291825pjd.2; Mon, 06 Feb 2023 19:54:26 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=7Fn2XJ6BeCuZXUbdo2K6qkX2BSn2D/fR6IYZR7N3Gdw=; b=CCUTutBoo2pbeZ3p1iGL8KlrwP8i3sjy+zx/AKmz1K9JrJ6oIfbvMlkFe++UpZ/o9c jLR0f59TMqFDvtBg/ZmOnybfErE0kys1xmXLilQ4FNTfi+LYKDLdIAWZW2V+uXG84A8G rJ0AdKYEfh/6NMITd7LBVlt8V4YMumtaI/ciIbWz7MN+mzdB5Mx+rEKHHtpThg8oRG3L At7ypimq/NDP8NtbuqqygEIWJ/zadD+1PVG6vwc4ahN51pv4KYtCTrdPMnGhQa4H4iEe V0DTtr2SOV4h13OOuEpVbRXS1EibeJGHdSfw/VsQXeEnMfZsXT51uoS1YherAt6a013L XX+A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=7Fn2XJ6BeCuZXUbdo2K6qkX2BSn2D/fR6IYZR7N3Gdw=; b=VTuC/meU9lRqBuk5m5EwezpGhGMssruvVgQbPLskhLyZt8JexiItjl22JmfmYgzl9Y Pmd2FHVqNJFeYUaWUvpPTPmPpXfR29kUnL9x6NNkIlH0zHV/HF1zOgaD6ey+brMd31hT P+A8zcagzSKDXh48oPDkLEpKj8v3cysGxLjvJWSiGSPF+twQPqDOfsIV3s2FwpopnaV0 kJxPZL8g5cDTQhP1LRuwcgdyxMm1SeBBOzJ+sxLQdJnY3ysHGqLgKIbvXw1SOESm6WT1 8hxtSswDZSWdSYKstABMwbehFRC3VSfoZe4v2Xkk5SLot945QT5lQP56GID1wd1gkkKA wIqA== X-Gm-Message-State: AO0yUKWLv9g+CDV2G7s21rTfbj7JumHKVxDNew+NCmhUq8/bVE8ZVczc M6Nx3F86Aol7OXlwXzuEUBk= X-Google-Smtp-Source: AK7set8H/+HhL0xGcMIRPsUXfCYGK7THtXNOLIHlQ4vaHRULRWfe88k0qBDEJHUrIvleBegMqDXp6Q== X-Received: by 2002:a17:902:c404:b0:198:c27f:8954 with SMTP id k4-20020a170902c40400b00198c27f8954mr1407914plk.54.1675742062426; Mon, 06 Feb 2023 19:54:22 -0800 (PST) Received: from strix-laptop.hitronhub.home ([123.110.9.95]) by smtp.googlemail.com with ESMTPSA id q4-20020a170902b10400b0019682e27995sm7647655plr.223.2023.02.06.19.54.11 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 06 Feb 2023 19:54:21 -0800 (PST) From: Chih-En Lin To: Andrew Morton , Qi Zheng , David Hildenbrand , "Matthew Wilcox (Oracle)" , Christophe Leroy , John Hubbard , Nadav Amit , Barry Song Cc: Steven Rostedt , Masami Hiramatsu , Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Yang Shi , Peter Xu , Vlastimil Babka , "Zach O'Keefe" , Yun Zhou , Hugh Dickins , Suren Baghdasaryan , Pasha Tatashin , Yu Zhao , Juergen Gross , Tong Tiangen , Liu Shixin , Anshuman Khandual , Li kunyu , Minchan Kim , Miaohe Lin , Gautam Menghani , Catalin Marinas , Mark Brown , Will Deacon , Vincenzo Frascino , Thomas Gleixner , "Eric W. Biederman" , Andy Lutomirski , Sebastian Andrzej Siewior , "Liam R. Howlett" , Fenghua Yu , Andrei Vagin , Barret Rhoden , Michal Hocko , "Jason A. Donenfeld" , Alexey Gladkov , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Dinglan Peng , Pedro Fonseca , Jim Huang , Huichun Feng , Chih-En Lin Subject: [PATCH v4 09/14] mm/mprotect: Break COW PTE before changing protection Date: Tue, 7 Feb 2023 11:51:34 +0800 Message-Id: <20230207035139.272707-10-shiyn.lin@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230207035139.272707-1-shiyn.lin@gmail.com> References: <20230207035139.272707-1-shiyn.lin@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org If the PTE table is COW-ed, break it before changing the protection. Signed-off-by: Chih-En Lin --- mm/mprotect.c | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/mm/mprotect.c b/mm/mprotect.c index 61cf60015a8b..8b18cd0e5c5e 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -103,6 +103,9 @@ static unsigned long change_pte_range(struct mmu_gather *tlb, if (pmd_trans_unstable(pmd)) return 0; + if (break_cow_pte(vma, pmd, addr)) + return 0; + /* * The pmd points to a regular pte so the pmd can't change * from under us even if the mmap_lock is only hold for @@ -314,6 +317,12 @@ static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd) return 1; if (pmd_trans_huge(pmdval)) return 0; + /* + * If the entry point to COW-ed PTE, it's write protection bit + * will cause pmd_bad(). + */ + if (!pmd_write(pmdval)) + return 0; if (unlikely(pmd_bad(pmdval))) { pmd_clear_bad(pmd); return 1; From patchwork Tue Feb 7 03:51:35 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chih-En Lin X-Patchwork-Id: 13130983 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1DAEFC64EC5 for ; Tue, 7 Feb 2023 03:55:26 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230264AbjBGDzZ (ORCPT ); Mon, 6 Feb 2023 22:55:25 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53008 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230204AbjBGDy7 (ORCPT ); Mon, 6 Feb 2023 22:54:59 -0500 Received: from mail-pj1-x102d.google.com (mail-pj1-x102d.google.com [IPv6:2607:f8b0:4864:20::102d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2372C367DD; Mon, 6 Feb 2023 19:54:36 -0800 (PST) Received: by mail-pj1-x102d.google.com with SMTP id v18-20020a17090ae99200b00230f079dcd9so155840pjy.1; Mon, 06 Feb 2023 19:54:36 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=ry1CaivQA+WwrthTA1eQZVLr8NatvP0/+r4itTa9xSI=; b=d8wSKS8rryvPGCmpnm+JTRMWPuer7cSkE915+Di2vStTbeU6J9CDNia1+lKJ4TYA3F FsiWitNkJfkhpGOa3/TWuDlGJ7hoCRvKFsaLeSEyLWy+VFwkWBMOPjP4mVO3kj/AOfTO dfLEUy5AOnIeoq/t3PzsGy1O4+SOKsRbP2Qc+h7hnzcDq5zC5CAfuABjOF37wJlEDJcM ya4VnrJUlMRgeqVncACG8zioe9b0CpHOIXXUE+bnxasciMIEVb1ZOGebhvWfFVJbDsB2 23LNzhzp8J/s/ZXPiEMPXWEev+Jh0Aw4avTPufwGWvkFmZUGBezycgI1BA0/H1w5bUhL KZaA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=ry1CaivQA+WwrthTA1eQZVLr8NatvP0/+r4itTa9xSI=; b=xGB3rHNlRYepGokvUEsS4b/eGRzqwb0H467SsPab/BVcT/bNSXDxrXULExDxhlUjEE I5MbIMyZtstFDcgMN28qMaIlKjj4ek59PvEKsxaQ2Zjstnlf3JIDCXXZ39hwnxWYcIrd Wz9jCzERs+jhJ6pSWl2wpGcOL/cLpgDt0NiBc7QvfTC3THIcUh12t394/aBuaAF4BwAl TDmOg6De3p1uO/SBK3SPUxAVp9Miqxtz7iUZUj/m6Dh3Z/Tk5h1UREkkIkXk511MXZdL jnUqEThSI+209M8Pl/kIoI0gSAazwPRnRARpiBVe/je2pzYgd9KcsxN5LofVczA2qDCX fDTQ== X-Gm-Message-State: AO0yUKUkyZAUOFyV2BHZ9GMKjB0skEIC6Gm+FNIhPiJ4juOS9SrsqTZU FvTyCSAcwUJ/EXZsVc+AXLk= X-Google-Smtp-Source: AK7set/QqXM+8K/GIbrrktYGZ3tjkjRzSAAFgJgsQE9zms3ILrIsqCklz+JpkGNk5mJ2O3IEsKj7gw== X-Received: by 2002:a17:902:e943:b0:198:adc4:229f with SMTP id b3-20020a170902e94300b00198adc4229fmr12287865pll.26.1675742075566; Mon, 06 Feb 2023 19:54:35 -0800 (PST) Received: from strix-laptop.hitronhub.home ([123.110.9.95]) by smtp.googlemail.com with ESMTPSA id q4-20020a170902b10400b0019682e27995sm7647655plr.223.2023.02.06.19.54.22 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 06 Feb 2023 19:54:34 -0800 (PST) From: Chih-En Lin To: Andrew Morton , Qi Zheng , David Hildenbrand , "Matthew Wilcox (Oracle)" , Christophe Leroy , John Hubbard , Nadav Amit , Barry Song Cc: Steven Rostedt , Masami Hiramatsu , Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Yang Shi , Peter Xu , Vlastimil Babka , "Zach O'Keefe" , Yun Zhou , Hugh Dickins , Suren Baghdasaryan , Pasha Tatashin , Yu Zhao , Juergen Gross , Tong Tiangen , Liu Shixin , Anshuman Khandual , Li kunyu , Minchan Kim , Miaohe Lin , Gautam Menghani , Catalin Marinas , Mark Brown , Will Deacon , Vincenzo Frascino , Thomas Gleixner , "Eric W. Biederman" , Andy Lutomirski , Sebastian Andrzej Siewior , "Liam R. Howlett" , Fenghua Yu , Andrei Vagin , Barret Rhoden , Michal Hocko , "Jason A. Donenfeld" , Alexey Gladkov , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Dinglan Peng , Pedro Fonseca , Jim Huang , Huichun Feng , Chih-En Lin Subject: [PATCH v4 10/14] mm/userfaultfd: Support COW PTE Date: Tue, 7 Feb 2023 11:51:35 +0800 Message-Id: <20230207035139.272707-11-shiyn.lin@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230207035139.272707-1-shiyn.lin@gmail.com> References: <20230207035139.272707-1-shiyn.lin@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org If uffd fills the zeropage or installs to COW-ed PTE, break it first. Signed-off-by: Chih-En Lin --- mm/userfaultfd.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 0499907b6f1a..3f66aa3eb54f 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -70,6 +70,9 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd, struct inode *inode; pgoff_t offset, max_off; + if (break_cow_pte(dst_vma, dst_pmd, dst_addr)) + return -ENOMEM; + _dst_pte = mk_pte(page, dst_vma->vm_page_prot); _dst_pte = pte_mkdirty(_dst_pte); if (page_in_cache && !vm_shared) @@ -229,6 +232,9 @@ static int mfill_zeropage_pte(struct mm_struct *dst_mm, pgoff_t offset, max_off; struct inode *inode; + if (break_cow_pte(dst_vma, dst_pmd, dst_addr)) + return -ENOMEM; + _dst_pte = pte_mkspecial(pfn_pte(my_zero_pfn(dst_addr), dst_vma->vm_page_prot)); dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl); From patchwork Tue Feb 7 03:51:36 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chih-En Lin X-Patchwork-Id: 13130984 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id A3D66C636D3 for ; Tue, 7 Feb 2023 03:55:48 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230241AbjBGDzr (ORCPT ); Mon, 6 Feb 2023 22:55:47 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52992 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230092AbjBGDzX (ORCPT ); Mon, 6 Feb 2023 22:55:23 -0500 Received: from mail-pl1-x634.google.com (mail-pl1-x634.google.com [IPv6:2607:f8b0:4864:20::634]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4BDD632E47; Mon, 6 Feb 2023 19:54:54 -0800 (PST) Received: by mail-pl1-x634.google.com with SMTP id z1so14356465plg.6; Mon, 06 Feb 2023 19:54:54 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=B0O7Up3wzI+qFoE2erzvpvdWol4kWXujLVpSiPg8jt0=; b=ImenKRKpVIE1w2d+yNyhMy/Bmoy2fYt1Y3bh919Fjcuemc/XCNtsmCp6Ll9Fjga3qj MBQ891tsWYncJYGiA9yY+OJ7jFg0MgK00ncTozVcs2/dyXQHdd7seoN6utATFCOAJK1T IiYZU7TCWwqybJQlhE9DooQMCwYh6apQ+x6vbO4EgOACpRO57yoAgcWf6z2VtdBw9St2 QebOO3GI3l0z/X8pJUTf0yvmUEKZJvQhTHv4gfIxbNLOjkIfYX5JJFiGN88ORHgE2etN 18rbyA/7Udt4wGqu16BfkrtXkHAV2APrs2IzJYpcNK9y7FxS/WAwzyJ16edRaBK0feb3 gOPg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=B0O7Up3wzI+qFoE2erzvpvdWol4kWXujLVpSiPg8jt0=; b=4XNJQV1dJh/+Q1GXablXm/+DYbRlJWTmj/iHXkiPHlVZVHYibWXy57SARRTRxWikY3 giWw/Xfm24UO9yx2Yjxt6CSVLdT0gNMHaghx6+5NmJMiawqtXgtN9ame8wcR+7bI85i9 8VuS8UthOgFEtyFLWJHlLUxsoaZ3+T6zah6TXo7/KJeqciqEkrRwefcMzsEo22A9YNE+ Bb5G1YDSduIczI7dvi0A5jX6B9ACQvcNRFyzXhXmoFTOT47gKMeZo9zHZRlNhKif4J+r qHYRhAEaVoBGdgCyicj8vb3aPi138pgkCfpkeqoFK41pqapU6AqGDFtNEiy9wHottVSz oCQg== X-Gm-Message-State: AO0yUKW5DPvwUuHPiQI4vzl5EoZgx7swHDF0WGPpN8V5aKtWJbXzQLCl ZL5ELiQoWsO9fJW0VigaAoA= X-Google-Smtp-Source: AK7set+R6CsRkpduFz4IpgNhrZUmOyLBgEIAcun1eMlrmSOR5DRWSRb+PFgTaXgU2+DZBiANGWjHgw== X-Received: by 2002:a17:902:f693:b0:199:1996:71ed with SMTP id l19-20020a170902f69300b00199199671edmr1724364plg.24.1675742088707; Mon, 06 Feb 2023 19:54:48 -0800 (PST) Received: from strix-laptop.hitronhub.home ([123.110.9.95]) by smtp.googlemail.com with ESMTPSA id q4-20020a170902b10400b0019682e27995sm7647655plr.223.2023.02.06.19.54.35 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 06 Feb 2023 19:54:47 -0800 (PST) From: Chih-En Lin To: Andrew Morton , Qi Zheng , David Hildenbrand , "Matthew Wilcox (Oracle)" , Christophe Leroy , John Hubbard , Nadav Amit , Barry Song Cc: Steven Rostedt , Masami Hiramatsu , Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Yang Shi , Peter Xu , Vlastimil Babka , "Zach O'Keefe" , Yun Zhou , Hugh Dickins , Suren Baghdasaryan , Pasha Tatashin , Yu Zhao , Juergen Gross , Tong Tiangen , Liu Shixin , Anshuman Khandual , Li kunyu , Minchan Kim , Miaohe Lin , Gautam Menghani , Catalin Marinas , Mark Brown , Will Deacon , Vincenzo Frascino , Thomas Gleixner , "Eric W. Biederman" , Andy Lutomirski , Sebastian Andrzej Siewior , "Liam R. Howlett" , Fenghua Yu , Andrei Vagin , Barret Rhoden , Michal Hocko , "Jason A. Donenfeld" , Alexey Gladkov , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Dinglan Peng , Pedro Fonseca , Jim Huang , Huichun Feng , Chih-En Lin Subject: [PATCH v4 11/14] mm/migrate_device: Support COW PTE Date: Tue, 7 Feb 2023 11:51:36 +0800 Message-Id: <20230207035139.272707-12-shiyn.lin@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230207035139.272707-1-shiyn.lin@gmail.com> References: <20230207035139.272707-1-shiyn.lin@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Break COW PTE before collecting the pages in COW-ed PTE. Signed-off-by: Chih-En Lin --- mm/migrate_device.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/mm/migrate_device.c b/mm/migrate_device.c index 721b2365dbca..2930e591e8fc 100644 --- a/mm/migrate_device.c +++ b/mm/migrate_device.c @@ -106,6 +106,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, } } + if (break_cow_pte_range(vma, start, end)) + return migrate_vma_collect_skip(start, end, walk); if (unlikely(pmd_bad(*pmdp))) return migrate_vma_collect_skip(start, end, walk); From patchwork Tue Feb 7 03:51:37 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chih-En Lin X-Patchwork-Id: 13130985 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id C783FC636D6 for ; Tue, 7 Feb 2023 03:56:01 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230319AbjBGD4A (ORCPT ); Mon, 6 Feb 2023 22:56:00 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53164 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230187AbjBGDzi (ORCPT ); Mon, 6 Feb 2023 22:55:38 -0500 Received: from mail-pl1-x62f.google.com (mail-pl1-x62f.google.com [IPv6:2607:f8b0:4864:20::62f]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6DD3D36085; Mon, 6 Feb 2023 19:55:04 -0800 (PST) Received: by mail-pl1-x62f.google.com with SMTP id r8so14392125pls.2; Mon, 06 Feb 2023 19:55:04 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=QGNZN7JF47hId+X1vWc4gO+CAnb7F699DIDoXCQvWIQ=; b=He5wcuqr4hD23UoQdl/Nr/rdm7mAo2tCpalq7z2P/0kqi8YkB1XHHhit04yx01Heqx uj+2dlN/JAI96jychr3QIhsyd9kSFSO7NeXfoWKaLaTDEgkQ88WbJnumitBCs0kuUsEX VrVi4V1juGiLFUM1OR/YViXfiIufa++K8TUR4d+HHSll48ldd6R3+eU/J/c3KBlfKqbR etU6Gr78SWNrgiDFXUUDKrN2i+79d2SWmJuUa2fp5mud5XBatO7GoFbHb6WaTtclLROM Oxka7aHoBT761ENxnS0JiopOA4gqvMxvp5M3XfWbp4xZYDG+T6TpuvZb56g9puFlhd4a PfqQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=QGNZN7JF47hId+X1vWc4gO+CAnb7F699DIDoXCQvWIQ=; b=CUxyRhkUMSeGeSKqVbA983wXk13b/ugOSWG375PLZYKZXCNl1a5gD/+DD1gg8fYnYI /+OAyySVu/bzzmPamULgV+Utnya+Si5Mh4A0cmKxUF28oNodXchUdAVgUJhRFD2jjdqd ISZkxTHlHG1V//bbHZEYpIBdoUPSdC5g8UIpGAoAUuw9gmMcAiN3ca6WIhadHfMRljyC X8QASJVdvvGhVR9VfOfyVm/W/rs/E0stqgG+35iaK4krCpUL+8VfwHWxct9IHjZMnD01 X2bRWuQXaFjPc6Gqj9LBVJi6GP6aDq14h9f7ulMLA4zhRxXzLQetvA3kcK0EtVLs209L lO8A== X-Gm-Message-State: AO0yUKXkRNZ9y5GA36GCNEnRd+dPdaHTQKx6Oew1ReNYjfRquiKZREeJ BzrnqShKDCcNari1Jgqcznc= X-Google-Smtp-Source: AK7set9PzL+wwdl/6GfLYROSoXvgwz12xEdsCVdNMrSCXNtZY2YzeGDIDot3x0eVBNWHM/hCwnCTvA== X-Received: by 2002:a17:902:c40f:b0:196:e8e:cd28 with SMTP id k15-20020a170902c40f00b001960e8ecd28mr1567847plk.15.1675742100941; Mon, 06 Feb 2023 19:55:00 -0800 (PST) Received: from strix-laptop.hitronhub.home ([123.110.9.95]) by smtp.googlemail.com with ESMTPSA id q4-20020a170902b10400b0019682e27995sm7647655plr.223.2023.02.06.19.54.49 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 06 Feb 2023 19:55:00 -0800 (PST) From: Chih-En Lin To: Andrew Morton , Qi Zheng , David Hildenbrand , "Matthew Wilcox (Oracle)" , Christophe Leroy , John Hubbard , Nadav Amit , Barry Song Cc: Steven Rostedt , Masami Hiramatsu , Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Yang Shi , Peter Xu , Vlastimil Babka , "Zach O'Keefe" , Yun Zhou , Hugh Dickins , Suren Baghdasaryan , Pasha Tatashin , Yu Zhao , Juergen Gross , Tong Tiangen , Liu Shixin , Anshuman Khandual , Li kunyu , Minchan Kim , Miaohe Lin , Gautam Menghani , Catalin Marinas , Mark Brown , Will Deacon , Vincenzo Frascino , Thomas Gleixner , "Eric W. Biederman" , Andy Lutomirski , Sebastian Andrzej Siewior , "Liam R. Howlett" , Fenghua Yu , Andrei Vagin , Barret Rhoden , Michal Hocko , "Jason A. Donenfeld" , Alexey Gladkov , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Dinglan Peng , Pedro Fonseca , Jim Huang , Huichun Feng , Chih-En Lin Subject: [PATCH v4 12/14] fs/proc: Support COW PTE with clear_refs_write Date: Tue, 7 Feb 2023 11:51:37 +0800 Message-Id: <20230207035139.272707-13-shiyn.lin@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230207035139.272707-1-shiyn.lin@gmail.com> References: <20230207035139.272707-1-shiyn.lin@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Before clearing the entry in COW-ed PTE, break COW PTE first. Signed-off-by: Chih-En Lin --- fs/proc/task_mmu.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index af1c49ae11b1..94958422aede 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1196,6 +1196,11 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr, if (pmd_trans_unstable(pmd)) return 0; + /* Only break COW when we modify the soft-dirty bit. */ + if (cp->type == CLEAR_REFS_SOFT_DIRTY && + break_cow_pte(vma, pmd, addr)) + return 0; + pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); for (; addr != end; pte++, addr += PAGE_SIZE) { ptent = *pte; From patchwork Tue Feb 7 03:51:38 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chih-En Lin X-Patchwork-Id: 13130986 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7E2ABC636CC for ; Tue, 7 Feb 2023 03:56:14 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230298AbjBGD4N (ORCPT ); Mon, 6 Feb 2023 22:56:13 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53478 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230309AbjBGDzs (ORCPT ); Mon, 6 Feb 2023 22:55:48 -0500 Received: from mail-pj1-x102b.google.com (mail-pj1-x102b.google.com [IPv6:2607:f8b0:4864:20::102b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0C4103400C; Mon, 6 Feb 2023 19:55:19 -0800 (PST) Received: by mail-pj1-x102b.google.com with SMTP id d6-20020a17090ae28600b00230aa72904fso5367971pjz.5; Mon, 06 Feb 2023 19:55:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=G5f9S6gTvBUR31ijZ1v0NXA88nU2gIARJhf8qkOYPpE=; b=V+9ehS3dUBnDSP754pQJilYnGgspDpamE6axHxsiTDW2FqvQddaZhtzoXahpcQNnWp slKt5OTaFYOnKxD33z7ajrAhO1f5bVaGT6RR4HraIXtm6VzkCVcF66mE5C+d2tMbbaFD +cjxflw1/Gp+nt43vddDQxJgK3gDExiSvTg3IjBGeNMZxBToZEebfadplSSmscanOrCX /FVxoHCFHfslQW1BnfnGGO9BAkv0ye+18mfO6F3hwnyawTP5HH9TkMKxOmk1OXHPn3gD mOqbNky7E02Ec3UOsJc8CO1ricLUh6qtCIFx31yJFispY8zifcyFXGLJSYwBocGdS5P2 vjog== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=G5f9S6gTvBUR31ijZ1v0NXA88nU2gIARJhf8qkOYPpE=; b=uY1nFzDCT0mv6GiWxdclJC97kN2mg34z1MrXzH4I9dThRjOiV3hlOCv9/QMMPqesc/ iTOk/F+FTDdkmd+wWgH19rnx3Ouq37c3PXAnErhBk72nzdVW1CIWTiYQKEX4kK+65SCl 7xtjl3pHnD6WjLOOoemaZ8s4RTZxsizJTMIHeZU8ZcHfexjKJOn6z516o//z0s+IJ9ue jyrOs67yUcbvCIUprfE6ClYBsfhiU6bDgTmhtUX82UVmBaNJ4STjv8KAe3Xxj1S470CT 5UA9optixKF/+7IdYjlFrUCjI8Fvhjt7F4/xYASwlA7xs0Ps/AZAi0IgiYAyDUvs7ciD daww== X-Gm-Message-State: AO0yUKU2lgeh/WtU8NQnKK24B/uXPiDsG8xF5OZPFkKwFxhC1hTbVrrN p7UgkWf73rE0TUgZwbv83eA= X-Google-Smtp-Source: AK7set9eFrjrbQaBrX68nI/egeQC5azfC6+Rad48+JSCqX5K0OjrQy1a9sYzZUEJN+SmZpiFte9d5g== X-Received: by 2002:a17:903:1250:b0:196:f82:14b7 with SMTP id u16-20020a170903125000b001960f8214b7mr1531757plh.37.1675742113998; Mon, 06 Feb 2023 19:55:13 -0800 (PST) Received: from strix-laptop.hitronhub.home ([123.110.9.95]) by smtp.googlemail.com with ESMTPSA id q4-20020a170902b10400b0019682e27995sm7647655plr.223.2023.02.06.19.55.01 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 06 Feb 2023 19:55:13 -0800 (PST) From: Chih-En Lin To: Andrew Morton , Qi Zheng , David Hildenbrand , "Matthew Wilcox (Oracle)" , Christophe Leroy , John Hubbard , Nadav Amit , Barry Song Cc: Steven Rostedt , Masami Hiramatsu , Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Yang Shi , Peter Xu , Vlastimil Babka , "Zach O'Keefe" , Yun Zhou , Hugh Dickins , Suren Baghdasaryan , Pasha Tatashin , Yu Zhao , Juergen Gross , Tong Tiangen , Liu Shixin , Anshuman Khandual , Li kunyu , Minchan Kim , Miaohe Lin , Gautam Menghani , Catalin Marinas , Mark Brown , Will Deacon , Vincenzo Frascino , Thomas Gleixner , "Eric W. Biederman" , Andy Lutomirski , Sebastian Andrzej Siewior , "Liam R. Howlett" , Fenghua Yu , Andrei Vagin , Barret Rhoden , Michal Hocko , "Jason A. Donenfeld" , Alexey Gladkov , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Dinglan Peng , Pedro Fonseca , Jim Huang , Huichun Feng , Chih-En Lin Subject: [PATCH v4 13/14] events/uprobes: Break COW PTE before replacing page Date: Tue, 7 Feb 2023 11:51:38 +0800 Message-Id: <20230207035139.272707-14-shiyn.lin@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230207035139.272707-1-shiyn.lin@gmail.com> References: <20230207035139.272707-1-shiyn.lin@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Break COW PTE if we want to replace the page which resides in COW-ed PTE. Signed-off-by: Chih-En Lin --- kernel/events/uprobes.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index d9e357b7e17c..2956a53da01a 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -157,7 +157,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr, struct folio *old_folio = page_folio(old_page); struct folio *new_folio; struct mm_struct *mm = vma->vm_mm; - DEFINE_FOLIO_VMA_WALK(pvmw, old_folio, vma, addr, 0); + DEFINE_FOLIO_VMA_WALK(pvmw, old_folio, vma, addr, PVMW_BREAK_COW_PTE); int err; struct mmu_notifier_range range; From patchwork Tue Feb 7 03:51:39 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chih-En Lin X-Patchwork-Id: 13130987 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6CA6FC636CD for ; Tue, 7 Feb 2023 03:56:24 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230306AbjBGD4X (ORCPT ); Mon, 6 Feb 2023 22:56:23 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53156 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230305AbjBGDz7 (ORCPT ); Mon, 6 Feb 2023 22:55:59 -0500 Received: from mail-pj1-x102b.google.com (mail-pj1-x102b.google.com [IPv6:2607:f8b0:4864:20::102b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id ACEFC37541; Mon, 6 Feb 2023 19:55:28 -0800 (PST) Received: by mail-pj1-x102b.google.com with SMTP id rm7-20020a17090b3ec700b0022c05558d22so13497293pjb.5; Mon, 06 Feb 2023 19:55:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=mqI9CX9OCLRhinZW0rV2yxCcmMTFndJVCdpXa5ZUnco=; b=o8ui0BwXjQb6Xz0+sV/XmAtBjIzgRNhLYPdfI9IHudkrwl5gTXpIpMOvolOaL9acRP XCdxnPWLWgJw0ORyp798XI/vdNQuP73eC52jwImCAT0YkmN/TUT0tUbkJMmm/mxAgpS0 XwxXjspo5I4w92K8TbwcdAFn+fv7U2jGlSP77xhKOI9/9crgO40/0FZOrKoybVWuIL8t fgIES6vwRi5iZ+RJVoHMkoDuT77EbnvzOn3Wgu+tVadR852ojkBEAyXwqTWHEvQ+/F9D oEoa54gTUp2x7VmLp0YL5AnlSdnytBXcKlTuEBzoyN6tEHrZwBETmGO0jD6Kw2tysYBO mLpA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=mqI9CX9OCLRhinZW0rV2yxCcmMTFndJVCdpXa5ZUnco=; b=FqKLPSr6YSBVPN7QFrgMu3ZfT2gOq9KxJ8qV2CvYR24T+OWQx8g+BSzt7L2JfIa1rX d+8reyEYiJ0P8zkhu+5n4+PxzCGB+Jb1aszv0NQKS/acON/ToWtk2FLFUcByXuIG9XOA jNq+DHHhz8A4UYOhJ1qgCzerIfqBrry5Q+yq0fA3Ybkd4DYAP4UV8rf42ozaK07Lyf8G kuqsMERA0oXmVohheDuu5qnq49ocLdd0UphQRxD9Whl1gRZ8DxbiUAYHi5IVJXYA+9Vp OAbCs+Iy+p7zRvqAb1zsDvHUrzaOzOxPE3G1gUxk84XPsZZtHQOJ9sthzBqPRSkOR7RZ h51w== X-Gm-Message-State: AO0yUKUbIAeiYS24QtrKikq7vQU6wk8V/lcNJRtnx0VA7BZqaGF1bfxA CPZ/0w3XsakVYh3N0R1Fv/A= X-Google-Smtp-Source: AK7set8xEIMQnqgw2pXG5HFkn5MnGyHjDBNyhDZPYQWQM7saiWZRxmBOABnr1379qxiruU1f4ByBsQ== X-Received: by 2002:a17:903:32c9:b0:196:1425:740c with SMTP id i9-20020a17090332c900b001961425740cmr1762111plr.62.1675742128059; Mon, 06 Feb 2023 19:55:28 -0800 (PST) Received: from strix-laptop.hitronhub.home ([123.110.9.95]) by smtp.googlemail.com with ESMTPSA id q4-20020a170902b10400b0019682e27995sm7647655plr.223.2023.02.06.19.55.14 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 06 Feb 2023 19:55:27 -0800 (PST) From: Chih-En Lin To: Andrew Morton , Qi Zheng , David Hildenbrand , "Matthew Wilcox (Oracle)" , Christophe Leroy , John Hubbard , Nadav Amit , Barry Song Cc: Steven Rostedt , Masami Hiramatsu , Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Yang Shi , Peter Xu , Vlastimil Babka , "Zach O'Keefe" , Yun Zhou , Hugh Dickins , Suren Baghdasaryan , Pasha Tatashin , Yu Zhao , Juergen Gross , Tong Tiangen , Liu Shixin , Anshuman Khandual , Li kunyu , Minchan Kim , Miaohe Lin , Gautam Menghani , Catalin Marinas , Mark Brown , Will Deacon , Vincenzo Frascino , Thomas Gleixner , "Eric W. Biederman" , Andy Lutomirski , Sebastian Andrzej Siewior , "Liam R. Howlett" , Fenghua Yu , Andrei Vagin , Barret Rhoden , Michal Hocko , "Jason A. Donenfeld" , Alexey Gladkov , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-trace-kernel@vger.kernel.org, linux-perf-users@vger.kernel.org, Dinglan Peng , Pedro Fonseca , Jim Huang , Huichun Feng , Chih-En Lin Subject: [PATCH v4 14/14] mm: fork: Enable COW PTE to fork system call Date: Tue, 7 Feb 2023 11:51:39 +0800 Message-Id: <20230207035139.272707-15-shiyn.lin@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230207035139.272707-1-shiyn.lin@gmail.com> References: <20230207035139.272707-1-shiyn.lin@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org This patch enables the Copy-On-Write (COW) mechanism to the PTE table in fork system call. To let the process do COW PTE fork, use prctl(PR_SET_COW_PTE), it will set the MMF_COW_PTE_READY flag to the process for enabling COW PTE during the next time of fork. It uses the MMF_COW_PTE flag to distinguish the normal page table and the COW one. Moreover, it is difficult to distinguish whether all the page tables is out of COW state. So the MMF_COW_PTE flag won't be disabled after setup. Signed-off-by: Chih-En Lin --- kernel/fork.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/kernel/fork.c b/kernel/fork.c index 9f7fe3541897..94c35c8b31b1 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2678,6 +2678,13 @@ pid_t kernel_clone(struct kernel_clone_args *args) trace = 0; } +#ifdef CONFIG_COW_PTE + if (current->mm && test_bit(MMF_COW_PTE_READY, ¤t->mm->flags)) { + clear_bit(MMF_COW_PTE_READY, ¤t->mm->flags); + set_bit(MMF_COW_PTE, ¤t->mm->flags); + } +#endif + p = copy_process(NULL, trace, NUMA_NO_NODE, args); add_latent_entropy();