From patchwork Thu Aug 13 22:09:37 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Peter Collingbourne X-Patchwork-Id: 11713059 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 134DE722 for ; Thu, 13 Aug 2020 22:10:03 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id B9B5720829 for ; Thu, 13 Aug 2020 22:10:02 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="Ua7Y0vVr" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org B9B5720829 Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id C0C2C8D0008; Thu, 13 Aug 2020 18:10:01 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id BBDC88D0003; Thu, 13 Aug 2020 18:10:01 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A83B28D0008; Thu, 13 Aug 2020 18:10:01 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0196.hostedemail.com [216.40.44.196]) by kanga.kvack.org (Postfix) with ESMTP id 89F558D0003 for ; Thu, 13 Aug 2020 18:10:01 -0400 (EDT) Received: from smtpin05.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 4C89582499A8 for ; Thu, 13 Aug 2020 22:10:01 +0000 (UTC) X-FDA: 77146939002.05.offer16_381741e26ff7 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin05.hostedemail.com (Postfix) with ESMTP id 197D018028982 for ; Thu, 13 Aug 2020 22:10:01 +0000 (UTC) X-Spam-Summary: 1,0,0,42cc3cc838aeb057,d41d8cd98f00b204,3n7o1xwmkciiviimuumrk.iusrot03-ssq1giq.uxm@flex--pcc.bounces.google.com,,RULES_HIT:41:152:327:355:379:541:800:960:966:967:973:982:988:989:1260:1277:1313:1314:1345:1437:1516:1518:1593:1594:1605:1730:1747:1777:1792:1801:2194:2196:2198:2199:2200:2201:2393:2525:2553:2559:2566:2570:2682:2685:2693:2703:2731:2740:2859:2894:2895:2912:2933:2937:2939:2942:2945:2947:2951:2954:3000:3022:3152:3865:3866:3867:3868:3870:3871:3872:3873:3874:3934:3936:3938:3941:3944:3947:3950:3953:3956:3959:4250:4321:4385:4605:5007:6117:6119:6261:7875:7903:7974:8660:9025:9969:10004:11658:12048:13141:13146:13148:13161:13229:13230,0,RBL:209.85.160.202:@flex--pcc.bounces.google.com:.lbl8.mailshell.net-66.100.201.100 62.18.0.100;04yrhw8c83758g3q7414d8sujhdtwocrx6sifnuop5tgapps89g6m4ogjtuhzb4.gsq5rk5169u8a931g5kb6ko3uuxo1n7qu4hf6f71985cuei3dquw93yfjxz9s4r.h-lbl8.mailshell.net-223.238.255.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SP X-HE-Tag: offer16_381741e26ff7 X-Filterd-Recvd-Size: 25521 Received: from mail-qt1-f202.google.com (mail-qt1-f202.google.com [209.85.160.202]) by imf44.hostedemail.com (Postfix) with ESMTP for ; Thu, 13 Aug 2020 22:10:00 +0000 (UTC) Received: by mail-qt1-f202.google.com with SMTP id w30so5888803qte.14 for ; Thu, 13 Aug 2020 15:10:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:message-id:mime-version:subject:from:to:cc; bh=wD/YxwTbot/6yRgk9tfZ2U27PrHSvTcQVd9LBpmEDcs=; b=Ua7Y0vVr4Bee1SNv+Ke08u+WqkJDnZlMXVv5Q9dckekexFZk0ZXpGftFFCKJUv4INH wo3jNYTE6UP1YDzZ0YtSWa58C3QxRNt5KeXActHD8btitwRN7eWTHvTPn68mA3NqF6/E 2FFEeabAXRtfgzCpvNUjv3hgBQsMhuukZOiXWReUaTc7L2baFLoHvVtyh+uJNxEAjNqg ThDSoHsU63TGynBKk2ski+NDxYnuAflvYlWZO/Rw4RXsW6Bg1wYsJOQuTWbXXWI90JMp p5v3/zaS8fS1j+pmCb1QK+V6qUlYe8q7ZFCQhQuC8cGgr/q1PMe4t5vox+I+AvM8iH/H MjUg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:message-id:mime-version:subject:from:to:cc; bh=wD/YxwTbot/6yRgk9tfZ2U27PrHSvTcQVd9LBpmEDcs=; b=JLcZtu9V5m8rIx7Yl26YtTnxQJKPLFz3hePXhfit9YQm5V52WTGd1BflapW7fNgqi2 5XnC4XNjbcYvo5ePPyI1AQ0ajQa1B1/hj8FvC4kmeKYyUFsCi7fjf1d1lT7KFxBwu7Ge 80VC3qAS+p8WeQeftmmdehJANrYfBwojVmxNLbs5jrd9ZKmgOStBH/7vejjEulW0pFP4 kuHps43soJQCMyjCD2zBRDMqTNOSyqqchoPu5KYSY8BurQUuwAEXkfM0/GxOTpyRIlWO 4wh8FR4pOCFqUQyxpnWs82pH/K59rmosn2IWBtK/1QwogxBH7Kl0SP9BZG/IlpjYI9BE Yh+A== X-Gm-Message-State: AOAM530ENlLkngdqVVYBn2GkZ5LP0E5c7+PJBdilKq0lAC3p5Qb2UVPG BX2fw8ER7nXWdTIuh7g1Z8OLhuc= X-Google-Smtp-Source: ABdhPJz11J6arHKfm5H7t2PPaLa3VzmRgOtp2zerfKwJM6RFLrcI9xPORCLqE5M89xq7kdx3DEgNr8s= X-Received: by 2002:a0c:fa84:: with SMTP id o4mr6881885qvn.163.1597356599569; Thu, 13 Aug 2020 15:09:59 -0700 (PDT) Date: Thu, 13 Aug 2020 15:09:37 -0700 Message-Id: <20200813220937.40973-1-pcc@google.com> Mime-Version: 1.0 X-Mailer: git-send-email 2.28.0.220.ged08abb693-goog Subject: [PATCH v2] mm: introduce reference pages From: Peter Collingbourne To: John Hubbard , Matthew Wilcox , "Kirill A . Shutemov" , Andrew Morton , Catalin Marinas , Evgenii Stepanov Cc: Peter Collingbourne , Linux ARM , linux-mm@kvack.org X-Rspamd-Queue-Id: 197D018028982 X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam03 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Introduce a new syscall, refpage_create, which returns a file descriptor which may be mapped using mmap. Such a mapping is similar to an anonymous mapping, but instead of clean pages being backed by the zero page, they are instead backed by a so-called reference page, whose contents are specified using an argument to refpage_create. Loads from the mapping will load directly from the reference page, and initial stores to the mapping will copy-on-write from the reference page. Reference pages are useful in circumstances where anonymous mappings combined with manual stores to memory would impose undesirable costs, either in terms of performance or RSS. Use cases are focused on heap allocators and include: - Pattern initialization for the heap. This is where malloc(3) gives you memory whose contents are filled with a non-zero pattern byte, in order to help detect and mitigate bugs involving use of uninitialized memory. Typically this is implemented by having the allocator memset the allocation with the pattern byte before returning it to the user, but for large allocations this can result in a significant increase in RSS, especially for allocations that are used sparsely. Even for dense allocations there is a needless impact to startup performance when it may be better to amortize it throughout the program. By creating allocations using a reference page filled with the pattern byte, we can avoid these costs. - Pre-tagged heap memory. Memory tagging [1] is an upcoming ARMv8.5 feature which allows for memory to be tagged in order to detect certain kinds of memory errors with low overhead. In order to set up an allocation to allow memory errors to be detected, the entire allocation needs to have the same tag. The issue here is similar to pattern initialization in the sense that large tagged allocations will be expensive if the tagging is done up front. The idea is that the allocator would create reference pages with each of the possible memory tags, and use those reference pages for the large allocations. In order to measure the performance and RSS impact of reference pages, a version of this patch backported to kernel version 4.14 was tested on a Pixel 4 together with a modified [2] version of the Scudo allocator that uses reference pages to implement pattern initialization. A PDFium test program was used to collect the measurements like so: $ wget https://static.docs.arm.com/ddi0487/fb/DDI0487F_b_armv8_arm.pdf $ /system/bin/time -v ./pdfium_test --pages=1-100 DDI0487F_b_armv8_arm.pdf and the median of 100 runs measurement was taken with three variants of the allocator: - "anon" is the baseline (no pattern init) - "memset" is with pattern init of allocator pages implemented by initializing anonymous pages with memset - "refpage" is with pattern init of allocator pages implemented by creating reference pages All three variants are measured using the patch that I linked. "anon" is without the patch, "refpage" is with the patch and "memset" is with a previous version of the patch [3] with "#if 0" in place of "#if 1" in linux.cpp. The measurements are as follows: Real time (s) Max RSS (KiB) anon 2.237081 107088 memset 2.252241 112180 refpage 2.243786 107128 We can see that RSS for refpage is almost the same as anon, and real time overhead is 44% that of memset. As an alternative to introducing this syscall, I considered using userfaultfd to implement reference pages. However, after having taken a detailed look at the interface, it does not seem suitable to be used in the context of a general purpose allocator. For example, UFFD_FEATURE_FORK support would be required in order to correctly support fork(2) in a process that uses the allocator (although POSIX does not guarantee support for allocating after fork, many allocators including Scudo support it, and nothing stops the forked process from page faulting pre-existing allocations after forking anyway), but UFFD_FEATURE_FORK has been restricted to root by commit 3c1c24d91ffd ("userfaultfd: require CAP_SYS_PTRACE for UFFD_FEATURE_EVENT_FORK"), making it unsuitable for use in an allocator. Furthermore, even if the interface issues are resolved, I suspect (but have not measured) that the cost of the multiple context switches between kernel and userspace would be too high to be used in an allocator anyway. [1] https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/enhancing-memory-safety [2] https://github.com/pcc/llvm-project/commit/4871b739f86a631537d1725847a27ac148a392a0 [3] https://github.com/pcc/llvm-project/commit/a05f88aaebc7daf262d6885444d9845052026f4b Signed-off-by: Peter Collingbourne Reported-by: kernel test robot Reported-by: kernel test robot Reported-by: kernel test robot --- v2: - Switch to an approach of adding a new syscall instead of modifying mmap(2) - Move ownership of the reference page to the struct file to avoid refcount overflows arch/alpha/kernel/syscalls/syscall.tbl | 1 + arch/arm/tools/syscall.tbl | 1 + arch/arm64/include/asm/unistd.h | 2 +- arch/arm64/include/asm/unistd32.h | 2 + arch/ia64/kernel/syscalls/syscall.tbl | 1 + arch/m68k/kernel/syscalls/syscall.tbl | 1 + arch/microblaze/kernel/syscalls/syscall.tbl | 1 + arch/mips/kernel/syscalls/syscall_n32.tbl | 1 + arch/mips/kernel/syscalls/syscall_n64.tbl | 1 + arch/mips/kernel/syscalls/syscall_o32.tbl | 1 + arch/parisc/kernel/syscalls/syscall.tbl | 1 + arch/powerpc/kernel/syscalls/syscall.tbl | 1 + arch/s390/kernel/syscalls/syscall.tbl | 1 + arch/sh/kernel/syscalls/syscall.tbl | 1 + arch/sparc/kernel/syscalls/syscall.tbl | 1 + arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + arch/xtensa/kernel/syscalls/syscall.tbl | 1 + include/linux/huge_mm.h | 7 +++ include/linux/pgtable.h | 10 ++++ include/linux/syscalls.h | 3 ++ include/uapi/asm-generic/unistd.h | 4 +- kernel/sys_ni.c | 1 + mm/Makefile | 4 +- mm/gup.c | 2 +- mm/memory.c | 32 ++++++++---- mm/migrate.c | 4 +- mm/refpage.c | 56 +++++++++++++++++++++ 28 files changed, 127 insertions(+), 16 deletions(-) create mode 100644 mm/refpage.c diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl index a28fb211881d..efbdbceba085 100644 --- a/arch/alpha/kernel/syscalls/syscall.tbl +++ b/arch/alpha/kernel/syscalls/syscall.tbl @@ -479,3 +479,4 @@ 547 common openat2 sys_openat2 548 common pidfd_getfd sys_pidfd_getfd 549 common faccessat2 sys_faccessat2 +550 common refpage_create sys_refpage_create diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl index 7e8ee4adf269..68f0a0822ed6 100644 --- a/arch/arm/tools/syscall.tbl +++ b/arch/arm/tools/syscall.tbl @@ -453,3 +453,4 @@ 437 common openat2 sys_openat2 438 common pidfd_getfd sys_pidfd_getfd 439 common faccessat2 sys_faccessat2 +440 common refpage_create sys_refpage_create diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h index 3b859596840d..b3b2019f8d16 100644 --- a/arch/arm64/include/asm/unistd.h +++ b/arch/arm64/include/asm/unistd.h @@ -38,7 +38,7 @@ #define __ARM_NR_compat_set_tls (__ARM_NR_COMPAT_BASE + 5) #define __ARM_NR_COMPAT_END (__ARM_NR_COMPAT_BASE + 0x800) -#define __NR_compat_syscalls 440 +#define __NR_compat_syscalls 441 #endif #define __ARCH_WANT_SYS_CLONE diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h index 17e81bd9a2d3..18ff5382341c 100644 --- a/arch/arm64/include/asm/unistd32.h +++ b/arch/arm64/include/asm/unistd32.h @@ -887,6 +887,8 @@ __SYSCALL(__NR_openat2, sys_openat2) __SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd) #define __NR_faccessat2 439 __SYSCALL(__NR_faccessat2, sys_faccessat2) +#define __NR_refpage_create 440 +__SYSCALL(__NR_refpage_create, sys_refpage_create) /* * Please add new compat syscalls above this comment and update diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl index ced9c83e47c9..dd58ddc63d92 100644 --- a/arch/ia64/kernel/syscalls/syscall.tbl +++ b/arch/ia64/kernel/syscalls/syscall.tbl @@ -360,3 +360,4 @@ 437 common openat2 sys_openat2 438 common pidfd_getfd sys_pidfd_getfd 439 common faccessat2 sys_faccessat2 +440 common refpage_create sys_refpage_create diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl index 1a4822de7292..fe9c2ffcbf63 100644 --- a/arch/m68k/kernel/syscalls/syscall.tbl +++ b/arch/m68k/kernel/syscalls/syscall.tbl @@ -439,3 +439,4 @@ 437 common openat2 sys_openat2 438 common pidfd_getfd sys_pidfd_getfd 439 common faccessat2 sys_faccessat2 +440 common refpage_create sys_refpage_create diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl index a3f4be8e7238..d8ef9318ac7f 100644 --- a/arch/microblaze/kernel/syscalls/syscall.tbl +++ b/arch/microblaze/kernel/syscalls/syscall.tbl @@ -445,3 +445,4 @@ 437 common openat2 sys_openat2 438 common pidfd_getfd sys_pidfd_getfd 439 common faccessat2 sys_faccessat2 +440 common refpage_create sys_refpage_create diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl index 6b4ee92e3aed..8970f55475c4 100644 --- a/arch/mips/kernel/syscalls/syscall_n32.tbl +++ b/arch/mips/kernel/syscalls/syscall_n32.tbl @@ -378,3 +378,4 @@ 437 n32 openat2 sys_openat2 438 n32 pidfd_getfd sys_pidfd_getfd 439 n32 faccessat2 sys_faccessat2 +440 n32 refpage_create sys_refpage_create diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl index 391acbf425a0..894645fc00a2 100644 --- a/arch/mips/kernel/syscalls/syscall_n64.tbl +++ b/arch/mips/kernel/syscalls/syscall_n64.tbl @@ -354,3 +354,4 @@ 437 n64 openat2 sys_openat2 438 n64 pidfd_getfd sys_pidfd_getfd 439 n64 faccessat2 sys_faccessat2 +440 n64 refpage_create sys_refpage_create diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl index 5727c5187508..43957e224dbf 100644 --- a/arch/mips/kernel/syscalls/syscall_o32.tbl +++ b/arch/mips/kernel/syscalls/syscall_o32.tbl @@ -427,3 +427,4 @@ 437 o32 openat2 sys_openat2 438 o32 pidfd_getfd sys_pidfd_getfd 439 o32 faccessat2 sys_faccessat2 +440 o32 refpage_create sys_refpage_create diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl index 292baabefade..d6d8d7c5e60a 100644 --- a/arch/parisc/kernel/syscalls/syscall.tbl +++ b/arch/parisc/kernel/syscalls/syscall.tbl @@ -437,3 +437,4 @@ 437 common openat2 sys_openat2 438 common pidfd_getfd sys_pidfd_getfd 439 common faccessat2 sys_faccessat2 +440 common refpage_create sys_refpage_create diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl index be9f74546068..a73e79116f43 100644 --- a/arch/powerpc/kernel/syscalls/syscall.tbl +++ b/arch/powerpc/kernel/syscalls/syscall.tbl @@ -529,3 +529,4 @@ 437 common openat2 sys_openat2 438 common pidfd_getfd sys_pidfd_getfd 439 common faccessat2 sys_faccessat2 +440 common refpage_create sys_refpage_create diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl index f1fda4375526..956253c47c07 100644 --- a/arch/s390/kernel/syscalls/syscall.tbl +++ b/arch/s390/kernel/syscalls/syscall.tbl @@ -442,3 +442,4 @@ 437 common openat2 sys_openat2 sys_openat2 438 common pidfd_getfd sys_pidfd_getfd sys_pidfd_getfd 439 common faccessat2 sys_faccessat2 sys_faccessat2 +440 common faccessat2 sys_refpage_create sys_refpage_create diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl index 96848db9659e..5e3d7f569603 100644 --- a/arch/sh/kernel/syscalls/syscall.tbl +++ b/arch/sh/kernel/syscalls/syscall.tbl @@ -442,3 +442,4 @@ 437 common openat2 sys_openat2 438 common pidfd_getfd sys_pidfd_getfd 439 common faccessat2 sys_faccessat2 +440 common refpage_create sys_refpage_create diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl index 46024e80ee86..8b21deb46ef5 100644 --- a/arch/sparc/kernel/syscalls/syscall.tbl +++ b/arch/sparc/kernel/syscalls/syscall.tbl @@ -485,3 +485,4 @@ 437 common openat2 sys_openat2 438 common pidfd_getfd sys_pidfd_getfd 439 common faccessat2 sys_faccessat2 +440 common refpage_create sys_refpage_create diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index e31a75262c9c..c614da77e1a0 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -444,3 +444,4 @@ 437 i386 openat2 sys_openat2 438 i386 pidfd_getfd sys_pidfd_getfd 439 i386 faccessat2 sys_faccessat2 +440 i386 refpage_create sys_refpage_create diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index 9d82078c949a..7f7ab6bab41e 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -361,6 +361,7 @@ 437 common openat2 sys_openat2 438 common pidfd_getfd sys_pidfd_getfd 439 common faccessat2 sys_faccessat2 +440 common refpage_create sys_refpage_create # # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl index d216ccba42f7..a086512e8f06 100644 --- a/arch/xtensa/kernel/syscalls/syscall.tbl +++ b/arch/xtensa/kernel/syscalls/syscall.tbl @@ -410,3 +410,4 @@ 437 common openat2 sys_openat2 438 common pidfd_getfd sys_pidfd_getfd 439 common faccessat2 sys_faccessat2 +439 common refpage_create sys_refpage_create diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 467302056e17..a1dc07ff914a 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -175,6 +175,13 @@ static inline bool transhuge_vma_suitable(struct vm_area_struct *vma, if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end) return false; + + /* + * Transparent hugepages not currently supported for anonymous VMAs with + * reference pages + */ + if (unlikely(vma->vm_private_data)) + return false; return true; } diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index a124c21e3204..1059dc75b1e3 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -1054,6 +1054,16 @@ static inline unsigned long my_zero_pfn(unsigned long addr) } #endif +static inline int is_zero_or_refpage_pfn(struct vm_area_struct *vma, + unsigned long pfn) +{ + if (is_zero_pfn(pfn)) + return true; + if (unlikely(!vma->vm_ops && vma->vm_private_data)) + return pfn == page_to_pfn((struct page *)vma->vm_private_data); + return false; +} + #ifdef CONFIG_MMU #ifndef CONFIG_TRANSPARENT_HUGEPAGE diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index dc2b827c81e5..7ee15611729e 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -831,6 +831,9 @@ asmlinkage long sys_mremap(unsigned long addr, unsigned long old_len, unsigned long new_len, unsigned long flags, unsigned long new_addr); +/* mm/refpage.c */ +asmlinkage long sys_refpage_create(const void __user *content, unsigned long flags); + /* security/keys/keyctl.c */ asmlinkage long sys_add_key(const char __user *_type, const char __user *_description, diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h index 995b36c2ea7d..26d99bd30e1e 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -859,9 +859,11 @@ __SYSCALL(__NR_openat2, sys_openat2) __SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd) #define __NR_faccessat2 439 __SYSCALL(__NR_faccessat2, sys_faccessat2) +#define __NR_refpage_create 440 +__SYSCALL(__NR_refpage_create, sys_refpage_create) #undef __NR_syscalls -#define __NR_syscalls 440 +#define __NR_syscalls 441 /* * 32 bit systems traditionally used different diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index 3b69a560a7ac..01af430d31da 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -291,6 +291,7 @@ COND_SYSCALL(migrate_pages); COND_SYSCALL_COMPAT(migrate_pages); COND_SYSCALL(move_pages); COND_SYSCALL_COMPAT(move_pages); +COND_SYSCALL(refpage_create); COND_SYSCALL(perf_event_open); COND_SYSCALL(accept4); diff --git a/mm/Makefile b/mm/Makefile index d5649f1c12c0..b2cc6f66d4e7 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -35,10 +35,10 @@ CFLAGS_init-mm.o += $(call cc-disable-warning, override-init) CFLAGS_init-mm.o += $(call cc-disable-warning, initializer-overrides) mmu-y := nommu.o -mmu-$(CONFIG_MMU) := highmem.o memory.o mincore.o \ +mmu-$(CONFIG_MMU) := highmem.o ioremap.o memory.o mincore.o \ mlock.o mmap.o mmu_gather.o mprotect.o mremap.o \ msync.o page_vma_mapped.o pagewalk.o \ - pgtable-generic.o rmap.o vmalloc.o ioremap.o + pgtable-generic.o refpage.o rmap.o vmalloc.o ifdef CONFIG_CROSS_MEMORY_ATTACH diff --git a/mm/gup.c b/mm/gup.c index 39e58df6925d..5b4c3e3c86b9 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -463,7 +463,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, goto out; } - if (is_zero_pfn(pte_pfn(pte))) { + if (is_zero_or_refpage_pfn(vma, pte_pfn(pte))) { page = pte_page(pte); } else { ret = follow_pfn_pte(vma, address, ptep, flags); diff --git a/mm/memory.c b/mm/memory.c index 228efaca75d3..3289fceae9ca 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -602,7 +602,7 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr, return vma->vm_ops->find_special_page(vma, addr); if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP)) return NULL; - if (is_zero_pfn(pfn)) + if (is_zero_or_refpage_pfn(vma, pfn)) return NULL; if (pte_devmap(pte)) return NULL; @@ -628,7 +628,7 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr, } } - if (is_zero_pfn(pfn)) + if (is_zero_or_refpage_pfn(vma, pfn)) return NULL; check_pfn: @@ -1880,7 +1880,7 @@ static bool vm_mixed_ok(struct vm_area_struct *vma, pfn_t pfn) return true; if (pfn_t_special(pfn)) return true; - if (is_zero_pfn(pfn_t_to_pfn(pfn))) + if (is_zero_or_refpage_pfn(vma, pfn_t_to_pfn(pfn))) return true; return false; } @@ -3322,6 +3322,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) static vm_fault_t do_anonymous_page(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; + struct page *refpage = vma->vm_private_data; struct page *page; vm_fault_t ret = 0; pte_t entry; @@ -3347,11 +3348,16 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) if (unlikely(pmd_trans_unstable(vmf->pmd))) return 0; - /* Use the zero-page for reads */ + /* Use the zero-page, or reference page if set, for reads */ if (!(vmf->flags & FAULT_FLAG_WRITE) && !mm_forbids_zeropage(vma->vm_mm)) { - entry = pte_mkspecial(pfn_pte(my_zero_pfn(vmf->address), - vma->vm_page_prot)); + unsigned long pfn; + + if (unlikely(refpage)) + pfn = page_to_pfn(refpage); + else + pfn = my_zero_pfn(vmf->address); + entry = pte_mkspecial(pfn_pte(pfn, vma->vm_page_prot)); vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl); if (!pte_none(*vmf->pte)) { @@ -3372,9 +3378,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) /* Allocate our own private page. */ if (unlikely(anon_vma_prepare(vma))) goto oom; - page = alloc_zeroed_user_highpage_movable(vma, vmf->address); - if (!page) - goto oom; + + if (unlikely(refpage)) { + page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vmf->address); + if (!page) + goto oom; + copy_user_highpage(page, refpage, vmf->address, vma); + } else { + page = alloc_zeroed_user_highpage_movable(vma, vmf->address); + if (!page) + goto oom; + } if (mem_cgroup_charge(page, vma->vm_mm, GFP_KERNEL)) goto oom_free_page; diff --git a/mm/migrate.c b/mm/migrate.c index 5053439be6ab..6e9246d09e95 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2841,8 +2841,8 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate, pmd_t *pmdp; pte_t *ptep; - /* Only allow populating anonymous memory */ - if (!vma_is_anonymous(vma)) + /* Only allow populating anonymous memory without a reference page */ + if (!vma_is_anonymous(vma) || vma->private_data) goto abort; pgdp = pgd_offset(mm, addr); diff --git a/mm/refpage.c b/mm/refpage.c new file mode 100644 index 000000000000..c5fc66a38a51 --- /dev/null +++ b/mm/refpage.c @@ -0,0 +1,56 @@ +// SPDX-License-Identifier: GPL-2.0-only + +#include +#include +#include +#include +#include + +static int refpage_mmap(struct file *file, struct vm_area_struct *vma) +{ + vma_set_anonymous(vma); + vma->vm_private_data = vma->vm_file->private_data; + return 0; +} + +static int refpage_release(struct inode *inode, struct file *file) +{ + put_page(file->private_data); + return 0; +} + +static const struct file_operations refpage_file_operations = { + .mmap = refpage_mmap, + .release = refpage_release, +}; + +SYSCALL_DEFINE2(refpage_create, const void *__user, content, unsigned long, + flags) +{ + unsigned long content_addr = (unsigned long)content; + struct page *userpage, *refpage; + int fd; + + if (flags != 0) + return -EINVAL; + + refpage = alloc_page(GFP_KERNEL); + if (!refpage) + return -ENOMEM; + + if ((content_addr & (PAGE_SIZE - 1)) != 0 || + get_user_pages(content_addr, 1, 0, &userpage, 0) != 1) { + put_page(refpage); + return -EFAULT; + } + + copy_highpage(refpage, userpage); + put_page(userpage); + + fd = anon_inode_getfd("[refpage]", &refpage_file_operations, refpage, + O_RDONLY | O_CLOEXEC); + if (fd < 0) + put_page(refpage); + + return fd; +}