[128/192] mm: improve mprotect(R|W) efficiency on pages referenced once

From: Peter Collingbourne <pcc@google.com>

From: Peter Collingbourne <pcc@google.com>
Subject: mm: improve mprotect(R|W) efficiency on pages referenced once

In the Scudo memory allocator [1] we would like to be able to detect
use-after-free vulnerabilities involving large allocations by issuing
mprotect(PROT_NONE) on the memory region used for the allocation when it
is deallocated.  Later on, after the memory region has been "quarantined"
for a sufficient period of time we would like to be able to use it for
another allocation by issuing mprotect(PROT_READ|PROT_WRITE).

Before this patch, after removing the write protection, any writes to the
memory region would result in page faults and entering the copy-on-write
code path, even in the usual case where the pages are only referenced by a
single PTE, harming performance unnecessarily.  Make it so that any pages
in anonymous mappings that are only referenced by a single PTE are
immediately made writable during the mprotect so that we can avoid the
page faults.

This program shows the critical syscall sequence that we intend to use in
the allocator:

  #include <string.h>
  #include <sys/mman.h>

  enum { kSize = 131072 };

  int main(int argc, char **argv) {
    char *addr = (char *)mmap(0, kSize, PROT_READ | PROT_WRITE,
                              MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
    for (int i = 0; i != 100000; ++i) {
      memset(addr, i, kSize);
      mprotect((void *)addr, kSize, PROT_NONE);
      mprotect((void *)addr, kSize, PROT_READ | PROT_WRITE);
    }
  }

The effect of this patch on the above program was measured on a
DragonBoard 845c by taking the median real time execution time of 10 runs.

Before: 2.94s
After:  0.66s

The effect was also measured using one of the microbenchmarks that we
normally use to benchmark the allocator [2], after modifying it to make
the appropriate mprotect calls [3].  With an allocation size of 131072
bytes to trigger the allocator's "large allocation" code path the
per-iteration time was measured as follows:

Before: 27450ns
After:   6010ns

This patch means that we do more work during the mprotect call itself in
exchange for less work when the pages are accessed.  In the worst case,
the pages are not accessed at all.  The effect of this patch in such cases
was measured using the following program:

  #include <string.h>
  #include <sys/mman.h>

  enum { kSize = 131072 };

  int main(int argc, char **argv) {
    char *addr = (char *)mmap(0, kSize, PROT_READ | PROT_WRITE,
                              MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
    memset(addr, 1, kSize);
    for (int i = 0; i != 100000; ++i) {
  #ifdef PAGE_FAULT
      memset(addr + (i * 4096) % kSize, i, 4096);
  #endif
      mprotect((void *)addr, kSize, PROT_NONE);
      mprotect((void *)addr, kSize, PROT_READ | PROT_WRITE);
    }
  }

With PAGE_FAULT undefined (0 pages touched after removing write
protection) the median real time execution time of 100 runs was measured
as follows:

Before: 0.330260s
After:  0.338836s

With PAGE_FAULT defined (1 page touched) the measurements were
as follows:

Before: 0.438048s
After:  0.355661s

So it seems that even with a single page fault the new approach is faster.

I saw similar results if I adjusted the programs to use a larger mapping
size.  With kSize = 1048576 I get these numbers with PAGE_FAULT undefined:

Before: 1.428988s
After:  1.512016s

i.e. around 5.5%.

And these with PAGE_FAULT defined:

Before: 1.518559s
After:  1.524417s

i.e. about the same.

What I think we may conclude from these results is that for smaller
mappings the advantage of the previous approach, although measurable, is
wiped out by a single page fault.  I think we may expect that there should
be at least one access resulting in a page fault (under the previous
approach) after making the pages writable, since the program presumably
made the pages writable for a reason.

For larger mappings we may guesstimate that the new approach wins if the
density of future page faults is > 0.4%.  But for the mappings that are
large enough for density to matter (not just the absolute number of page
faults) it doesn't seem like the increase in mprotect latency would be
very large relative to the total mprotect execution time.

[pcc@google.com: add comments, prohibit optimization for NUMA pages]
  Link: https://lkml.kernel.org/r/20210601185926.2623183-1-pcc@google.com
Link: https://lkml.kernel.org/r/20210527190453.1259020-1-pcc@google.com
Link: https://linux-review.googlesource.com/id/I98d75ef90e20330c578871c87494d64b1df3f1b8
Link: [1] https://source.android.com/devices/tech/debug/scudo
Link: [2] https://cs.android.com/android/platform/superproject/+/master:bionic/benchmarks/stdlib_benchmark.cpp;l=53;drc=e8693e78711e8f45ccd2b610e4dbe0b94d551cc9
Link: [3] https://github.com/pcc/llvm-project/commit/scudo-mprotect-secondary2
Signed-off-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Cc: Kostya Kortchinsky <kostyak@google.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/mprotect.c |   52 ++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 46 insertions(+), 6 deletions(-)

Message ID	20210629023959.4ZAFiI8oZ%akpm@linux-foundation.org (mailing list archive)
State	New
Headers	show Return-Path: <SRS0=4KF6=LX=kvack.org=owner-linux-mm@kernel.org> X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-20.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER, INCLUDES_PATCH,MAILING_LIST_MULTI,MENTIONS_GIT_HOSTING,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2D30EC11F65 for <linux-mm@archiver.kernel.org>; Tue, 29 Jun 2021 02:40:02 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id D6F5F61D0C for <linux-mm@archiver.kernel.org>; Tue, 29 Jun 2021 02:40:01 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org D6F5F61D0C Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=linux-foundation.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 3B2808D00F5; Mon, 28 Jun 2021 22:40:01 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 362738D00F0; Mon, 28 Jun 2021 22:40:01 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2035D8D00F5; Mon, 28 Jun 2021 22:40:01 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0230.hostedemail.com [216.40.44.230]) by kanga.kvack.org (Postfix) with ESMTP id F21CC8D00F0 for <linux-mm@kvack.org>; Mon, 28 Jun 2021 22:40:00 -0400 (EDT) Received: from smtpin05.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id EF898181AEF1D for <linux-mm@kvack.org>; Tue, 29 Jun 2021 02:40:00 +0000 (UTC) X-FDA: 78305206560.05.A359EEF Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by imf28.hostedemail.com (Postfix) with ESMTP id 992DF2000BEF for <linux-mm@kvack.org>; Tue, 29 Jun 2021 02:40:00 +0000 (UTC) Received: by mail.kernel.org (Postfix) with ESMTPSA id BA3E561D0B; Tue, 29 Jun 2021 02:39:59 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linux-foundation.org; s=korg; t=1624934400; bh=JcK5J0ohugo1hHKbFqfwL/WIOI+F2ABAr36f1h6aw8U=; h=Date:From:To:Subject:In-Reply-To:From; b=RkmtFlfRIktyi0TjQd/Ftz4+4RZQWHIELlROefuBPQp8yWnMtF5zl0cM9+jAvI6Um ifyc3Kqh9FDdGdDukAlU0y3MpB456zljhz2SBGM9/PAuhrb54fwzZD0MLSun8bMp51 ceC24upNIvMjQk3erZGciCjLr4gdm7me5g+2L3hw= Date: Mon, 28 Jun 2021 19:39:59 -0700 From: Andrew Morton <akpm@linux-foundation.org> To: aarcange@redhat.com, akpm@linux-foundation.org, eugenis@google.com, kostyak@google.com, linux-mm@kvack.org, mm-commits@vger.kernel.org, pcc@google.com, peterx@redhat.com, torvalds@linux-foundation.org Subject: [patch 128/192] mm: improve mprotect(R\|W) efficiency on pages referenced once Message-ID: <20210629023959.4ZAFiI8oZ%akpm@linux-foundation.org> In-Reply-To: <20210628193256.008961950a714730751c1423@linux-foundation.org> User-Agent: s-nail v14.8.16 Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=linux-foundation.org header.s=korg header.b=RkmtFlfR; dmarc=none; spf=pass (imf28.hostedemail.com: domain of akpm@linux-foundation.org designates 198.145.29.99 as permitted sender) smtp.mailfrom=akpm@linux-foundation.org X-Stat-Signature: 4p1qr593jejam7czngfpgig4zrckwcg7 X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 992DF2000BEF X-HE-Tag: 1624934400-926165 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: <linux-mm.kvack.org>
Series	[001/192] mm/gup: fix try_grab_compound_head() race with split_huge_page() \| expand [001/192] mm/gup: fix try_grab_compound_head() race with split_huge_page() [002/192] mm/page_alloc: fix memory map initialization for descending nodes [003/192] mm/page_alloc: correct return value of populated elements if bulk array is populated [004/192] kthread: switch to new kerneldoc syntax for named variable macro argument [005/192] kthread_worker: fix return value when kthread_mod_delayed_work() races with kthread_cance… [006/192] ia64: headers: drop duplicated words [007/192] ia64: mca_drv: fix incorrect array size calculation [008/192] streamline_config.pl: make spacing consistent [009/192] streamline_config.pl: add softtabstop=4 for vim users [010/192] scripts/spelling.txt: add more spellings to spelling.txt [011/192] ntfs: fix validity check for file name attribute [012/192] squashfs: add option to panic on errors [013/192] ocfs2: remove unnecessary INIT_LIST_HEAD() [014/192] ocfs2: fix snprintf() checking [015/192] ocfs2: remove redundant assignment to pointer queue [016/192] ocfs2: remove repeated uptodate check for buffer [017/192] ocfs2: replace simple_strtoull() with kstrtoull() [018/192] ocfs2: remove redundant initialization of variable ret [019/192] kernel: watchdog: modify the explanation related to watchdog thread [020/192] doc: watchdog: modify the explanation related to watchdog thread [021/192] doc: watchdog: modify the doc related to "watchdog/%u" [022/192] slab: use __func__ to trace function name [023/192] kunit: make test->lock irq safe [024/192] mm/slub, kunit: add a KUnit test for SLUB debugging functionality [025/192] slub: remove resiliency_test() function [026/192] mm, slub: change run-time assertion in kmalloc_index() to compile-time [027/192] slub: restore slub_debug=- behavior [028/192] slub: actually use 'message' in restore_bytes() [029/192] slub: indicate slab_fix() uses printf formats [030/192] slub: force on no_hash_pointers when slub_debug is enabled [031/192] mm: slub: move sysfs slab alloc/free interfaces to debugfs [032/192] mm/slub: add taint after the errors are printed [033/192] mm/kmemleak: fix possible wrong memory scanning period [034/192] dax: fix ENOMEM handling in grab_mapping_entry() [035/192] tools/vm/page_owner_sort.c: check malloc() return [036/192] mm/debug_vm_pgtable: ensure THP availability via has_transparent_hugepage() [037/192] mm: mmap_lock: use local locks instead of disabling preemption [038/192] mm/page_reporting: fix code style in __page_reporting_request() [039/192] mm/page_reporting: export reporting order as module parameter [040/192] mm/page_reporting: allow driver to specify reporting order [041/192] virtio_balloon: specify page reporting order if needed [042/192] mm: page-writeback: kill get_writeback_state() comments [043/192] mm/page-writeback: Fix performance when BDI's share of ratio is 0. [044/192] mm/page-writeback: update the comment of Dirty position control [045/192] mm/page-writeback: use __this_cpu_inc() in account_page_dirtied() [046/192] writeback, cgroup: do not switch inodes with I_WILL_FREE flag [047/192] writeback, cgroup: add smp_mb() to cgroup_writeback_umount() [048/192] writeback, cgroup: increment isw_nr_in_flight before grabbing an inode [049/192] writeback, cgroup: switch to rcu_work API in inode_switch_wbs() [050/192] writeback, cgroup: keep list of inodes attached to bdi_writeback [051/192] writeback, cgroup: split out the functional part of inode_switch_wbs_work_fn() [052/192] writeback, cgroup: support switching multiple inodes at once [053/192] writeback, cgroup: release dying cgwbs by switching attached inodes [054/192] fs: unexport __set_page_dirty [055/192] fs: move ramfs_aops to libfs [056/192] mm: require ->set_page_dirty to be explicitly wired up [057/192] mm/writeback: move __set_page_dirty() to core mm [058/192] mm/writeback: use __set_page_dirty in __set_page_dirty_nobuffers [059/192] iomap: use __set_page_dirty_nobuffers [060/192] fs: remove anon_set_page_dirty() [061/192] fs: remove noop_set_page_dirty() [062/192] mm: move page dirtying prototypes from mm.h [063/192] mm/gup_benchmark: support threading [064/192] mm: gup: allow FOLL_PIN to scale in SMP [065/192] mm: gup: pack has_pinned in MMF_HAS_PINNED [066/192] mm: pagewalk: fix walk for hugepage tables [067/192] mm/swapfile: use percpu_ref to serialize against concurrent swapoff [068/192] swap: fix do_swap_page() race with swapoff [069/192] mm/swap: remove confusing checking for non_swap_entry() in swap_ra_info() [070/192] mm/shmem: fix shmem_swapin() race with swapoff [071/192] mm/swapfile: move get_swap_page_of_type() under CONFIG_HIBERNATION [072/192] mm/swap: remove unused local variable nr_shadows [073/192] mm/swap_slots.c: delete meaningless forward declarations [074/192] mm, swap: remove unnecessary smp_rmb() in swap_type_to_swap_info() [075/192] mm: free idle swap cache page after COW [076/192] swap: check mapping_empty() for swap cache before being freed [077/192] mm/memcg: move mod_objcg_state() to memcontrol.c [078/192] mm/memcg: cache vmstat data in percpu memcg_stock_pcp [079/192] mm/memcg: improve refill_obj_stock() performance [080/192] mm/memcg: optimize user context object stock access [081/192] mm: memcg/slab: properly set up gfp flags for objcg pointer array [082/192] mm: memcg/slab: create a new set of kmalloc-cg-<n> caches [083/192] mm: memcg/slab: disable cache merging for KMALLOC_NORMAL caches [084/192] mm: memcontrol: fix root_mem_cgroup charging [085/192] mm: memcontrol: fix page charging in page replacement [086/192] mm: memcontrol: bail out early when !mm in get_mem_cgroup_from_mm [087/192] mm: memcontrol: remove the pgdata parameter of mem_cgroup_page_lruvec [088/192] mm: memcontrol: simplify lruvec_holds_page_lru_lock [089/192] mm: memcontrol: rename lruvec_holds_page_lru_lock to page_matches_lruvec [090/192] mm: memcontrol: simplify the logic of objcg pinning memcg [091/192] mm: memcontrol: move obj_cgroup_uncharge_pages() out of css_set_lock [092/192] mm: vmscan: remove noinline_for_stack [093/192] memcontrol: use flexible-array member [094/192] loop: use worker per cgroup instead of kworker [095/192] mm: charge active memcg when no mm is set [096/192] loop: charge i/o to mem and blk cg [097/192] mm: memcontrol: remove trailing semicolon in macros [098/192] perf: MAP_EXECUTABLE does not indicate VM_MAYEXEC [099/192] binfmt: remove in-tree usage of MAP_EXECUTABLE [100/192] mm: ignore MAP_EXECUTABLE in ksys_mmap_pgoff() [101/192] mm/mmap.c: logic of find_vma_intersection repeated in __do_munmap [102/192] mm/mmap: introduce unlock_range() for code cleanup [103/192] mm/mmap: use find_vma_intersection() in do_mmap() for overlap [104/192] mm/memory.c: fix comment of finish_mkwrite_fault() [105/192] mm: add vma_lookup(), update find_vma_intersection() comments [106/192] drm/i915/selftests: use vma_lookup() in __igt_mmap() [107/192] arch/arc/kernel/troubleshoot: use vma_lookup() instead of find_vma() [108/192] arch/arm64/kvm: use vma_lookup() instead of find_vma_intersection() [109/192] arch/powerpc/kvm/book3s_hv_uvmem: use vma_lookup() instead of find_vma_intersection() [110/192] arch/powerpc/kvm/book3s: use vma_lookup() in kvmppc_hv_setup_htab_rma() [111/192] arch/mips/kernel/traps: use vma_lookup() instead of find_vma() [112/192] arch/m68k/kernel/sys_m68k: use vma_lookup() in sys_cacheflush() [113/192] x86/sgx: use vma_lookup() in sgx_encl_find() [114/192] virt/kvm: use vma_lookup() instead of find_vma_intersection() [115/192] vfio: use vma_lookup() instead of find_vma_intersection() [116/192] net/ipv5/tcp: use vma_lookup() in tcp_zerocopy_receive() [117/192] drm/amdgpu: use vma_lookup() in amdgpu_ttm_tt_get_user_pages() [118/192] media: videobuf2: use vma_lookup() in get_vaddr_frames() [119/192] misc/sgi-gru/grufault: use vma_lookup() in gru_find_vma() [120/192] kernel/events/uprobes: use vma_lookup() in find_active_uprobe() [121/192] lib/test_hmm: use vma_lookup() in dmirror_migrate() [122/192] mm/ksm: use vma_lookup() in find_mergeable_vma() [123/192] mm/migrate: use vma_lookup() in do_pages_stat_array() [124/192] mm/mremap: use vma_lookup() in vma_to_resize() [125/192] mm/memory.c: use vma_lookup() in __access_remote_vm() [126/192] mm/mempolicy: use vma_lookup() in __access_remote_vm() [127/192] mm: update legacy flush_tlb_* to use vma [128/192] mm: improve mprotect(R\|W) efficiency on pages referenced once [129/192] h8300: remove unused variable [130/192] mm/dmapool: use DEVICE_ATTR_RO macro [131/192] mm, tracing: unify PFN format strings [132/192] mm/page_alloc: add an alloc_pages_bulk_array_node() helper [133/192] mm/vmalloc: switch to bulk allocator in __vmalloc_area_node() [134/192] mm/vmalloc: print a warning message first on failure [135/192] mm/vmalloc: remove quoted strings split across lines [136/192] mm/vmalloc: fallback to a single page allocator [137/192] mm: vmalloc: add cond_resched() in __vunmap() [138/192] printk: introduce dump_stack_lvl() [139/192] kasan: use dump_stack_lvl(KERN_ERR) to print stacks [140/192] kasan: test: improve failure message in KUNIT_EXPECT_KASAN_FAIL() [141/192] kasan: allow an architecture to disable inline instrumentation [142/192] kasan: allow architectures to provide an outline readiness check [143/192] mm: define default MAX_PTRS_PER_* in include/pgtable.h [144/192] kasan: use MAX_PTRS_PER_* for early shadow tables [145/192] kasan: rename CONFIG_KASAN_SW_TAGS_IDENTIFY to CONFIG_KASAN_TAGS_IDENTIFY [146/192] kasan: integrate the common part of two KASAN tag-based modes [147/192] kasan: add memory corruption identification support for hardware tag-based mode [148/192] mm: report which part of mem is being freed on initmem case [149/192] mm/mmzone.h: simplify is_highmem_idx() [150/192] mm: make __dump_page static [151/192] mm/page_alloc: bail out on fatal signal during reclaim/compaction retry attempt [152/192] mm/debug: factor PagePoisoned out of __dump_page [153/192] mm/page_owner: constify dump_page_owner [154/192] mm: make compound_head const-preserving [155/192] mm: constify get_pfnblock_flags_mask and get_pfnblock_migratetype [156/192] mm: constify page_count and page_ref_count [157/192] mm: optimise nth_page for contiguous memmap [158/192] mm/page_alloc: switch to pr_debug [159/192] kbuild: skip per-CPU BTF generation for pahole v1.18-v1.21 [160/192] mm/page_alloc: split per cpu page lists and zone stats [161/192] mm/page_alloc: convert per-cpu list protection to local_lock [162/192] mm/vmstat: convert NUMA statistics to basic NUMA counters [163/192] mm/vmstat: inline NUMA event counter updates [164/192] mm/page_alloc: batch the accounting updates in the bulk allocator [165/192] mm/page_alloc: reduce duration that IRQs are disabled for VM counters [166/192] mm/page_alloc: explicitly acquire the zone lock in __free_pages_ok [167/192] mm/page_alloc: avoid conflating IRQs disabled with zone->lock [168/192] mm/page_alloc: update PGFREE outside the zone lock in __free_pages_ok [169/192] mm: page_alloc: dump migrate-failed pages only at -EBUSY [170/192] mm/page_alloc: delete vm.percpu_pagelist_fraction [171/192] mm/page_alloc: disassociate the pcp->high from pcp->batch [172/192] mm/page_alloc: adjust pcp->high after CPU hotplug events [173/192] mm/page_alloc: scale the number of pages that are batch freed [174/192] mm/page_alloc: limit the number of pages on PCP lists when reclaim is active [175/192] mm/page_alloc: introduce vm.percpu_pagelist_high_fraction [176/192] mm: drop SECTION_SHIFT in code comments [177/192] mm/page_alloc: improve memmap_pages dbg msg [178/192] mm/page_alloc: fix counting of managed_pages [179/192] mm/page_alloc: move free_the_page [180/192] alpha: remove DISCONTIGMEM and NUMA [181/192] arc: update comment about HIGHMEM implementation [182/192] arc: remove support for DISCONTIGMEM [183/192] m68k: remove support for DISCONTIGMEM [184/192] mm: remove CONFIG_DISCONTIGMEM [185/192] arch, mm: remove stale mentions of DISCONIGMEM [186/192] docs: remove description of DISCONTIGMEM [187/192] mm: replace CONFIG_NEED_MULTIPLE_NODES with CONFIG_NUMA [188/192] mm: replace CONFIG_FLAT_NODE_MEM_MAP with CONFIG_FLATMEM [189/192] mm/page_alloc: allow high-order pages to be stored on the per-cpu lists [190/192] mm/page_alloc: split pcp->high across all online CPUs for cpuless nodes [191/192] mm,hwpoison: send SIGBUS with error virutal address [192/192] mm,hwpoison: make get_hwpoison_page() call get_any_page()

[128/192] mm: improve mprotect(R|W) efficiency on pages referenced once

Commit Message

Comments

Patch