[102/114] mm: streamline COW logic in do_swap_page()

Message ID	20220325011341.16096C340EE@smtp.kernel.org (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> Date: Thu, 24 Mar 2022 18:13:40 -0700 To: zhangliang5@huawei.com,willy@infradead.org,vbabka@suse.cz,shy828301@gmail.com,shakeelb@google.com,rppt@linux.ibm.com,roman.gushchin@linux.dev,rientjes@google.com,riel@surriel.com,peterx@redhat.com,oleg@redhat.com,nadav.amit@gmail.com,mike.kravetz@oracle.com,mhocko@kernel.org,kirill.shutemov@linux.intel.com,jhubbard@nvidia.com,jgg@nvidia.com,jannh@google.com,jack@suse.cz,hughd@google.com,hch@lst.de,ddutile@redhat.com,aarcange@redhat.com,david@redhat.com,akpm@linux-foundation.org,patches@lists.linux.dev,linux-mm@kvack.org,mm-commits@vger.kernel.org,torvalds@linux-foundation.org,akpm@linux-foundation.org From: Andrew Morton <akpm@linux-foundation.org> In-Reply-To: <20220324180758.96b1ac7e17675d6bc474485e@linux-foundation.org> Subject: [patch 102/114] mm: streamline COW logic in do_swap_page() Message-Id: <20220325011341.16096C340EE@smtp.kernel.org> Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[001/114] tools/vm/page_owner_sort.c: sort by stacktrace before culling \| expand [001/114] tools/vm/page_owner_sort.c: sort by stacktrace before culling [002/114] tools/vm/page_owner_sort.c: support sorting by stack trace [003/114] tools/vm/page_owner_sort.c: add switch between culling by stacktrace and txt [004/114] tools/vm/page_owner_sort.c: support sorting pid and time [005/114] tools/vm/page_owner_sort.c: two trivial fixes [006/114] tools/vm/page_owner_sort.c: delete invalid duplicate code [007/114] Documentation/vm/page_owner.rst: update the documentation [008/114] Documentation/vm/page_owner.rst: fix unexpected indentation warns [009/114] lib/vsprintf: avoid redundant work with 0 size [010/114] mm/page_owner: use scnprintf() to avoid excessive buffer overrun check [011/114] mm/page_owner: print memcg information [012/114] mm/page_owner: record task command name [013/114] mm/page_owner.c: record tgid [014/114] tools/vm/page_owner_sort.c: fix the instructions for use [015/114] tools/vm/page_owner_sort.c: fix comments [016/114] tools/vm/page_owner_sort.c: add a security check [017/114] tools/vm/page_owner_sort.c: support sorting by tgid and update documentation [018/114] tools/vm/page_owner_sort: fix three trivival places [019/114] tools/vm/page_owner_sort: support for sorting by task command name [020/114] tools/vm/page_owner_sort.c: support for selecting by PID, TGID or task command name [021/114] tools/vm/page_owner_sort.c: support for user-defined culling rules [022/114] mm: unexport page_init_poison [023/114] selftest/vm: add util.h and and move helper functions there [024/114] selftest/vm: add helpers to detect PAGE_SIZE and PAGE_SHIFT [025/114] mm: delete __ClearPageWaiters() [026/114] mm: filemap_unaccount_folio() large skip mapcount fixup [027/114] mm/thp: fix NR_FILE_MAPPED accounting in page_*_file_rmap() [028/114] mm/migration: add trace events for THP migrations [029/114] mm/migration: add trace events for base page and HugeTLB migrations [030/114] kasan, page_alloc: deduplicate should_skip_kasan_poison [031/114] kasan, page_alloc: move tag_clear_highpage out of kernel_init_free_pages [032/114] kasan, page_alloc: merge kasan_free_pages into free_pages_prepare [033/114] kasan, page_alloc: simplify kasan_poison_pages call site [034/114] kasan, page_alloc: init memory of skipped pages on free [035/114] kasan: drop skip_kasan_poison variable in free_pages_prepare [036/114] mm: clarify __GFP_ZEROTAGS comment [037/114] kasan: only apply __GFP_ZEROTAGS when memory is zeroed [038/114] kasan, page_alloc: refactor init checks in post_alloc_hook [039/114] kasan, page_alloc: merge kasan_alloc_pages into post_alloc_hook [040/114] kasan, page_alloc: combine tag_clear_highpage calls in post_alloc_hook [041/114] kasan, page_alloc: move SetPageSkipKASanPoison in post_alloc_hook [042/114] kasan, page_alloc: move kernel_init_free_pages in post_alloc_hook [043/114] kasan, page_alloc: rework kasan_unpoison_pages call site [044/114] kasan: clean up metadata byte definitions [045/114] kasan: define KASAN_VMALLOC_INVALID for SW_TAGS [046/114] kasan, x86, arm64, s390: rename functions for modules shadow [047/114] kasan, vmalloc: drop outdated VM_KASAN comment [048/114] kasan: reorder vmalloc hooks [049/114] kasan: add wrappers for vmalloc hooks [050/114] kasan, vmalloc: reset tags in vmalloc functions [051/114] kasan, fork: reset pointer tags of vmapped stacks [052/114] kasan, arm64: reset pointer tags of vmapped stacks [053/114] kasan, vmalloc: add vmalloc tagging for SW_TAGS [054/114] kasan, vmalloc, arm64: mark vmalloc mappings as pgprot_tagged [055/114] kasan, vmalloc: unpoison VM_ALLOC pages after mapping [056/114] kasan, mm: only define ___GFP_SKIP_KASAN_POISON with HW_TAGS [057/114] kasan, page_alloc: allow skipping unpoisoning for HW_TAGS [058/114] kasan, page_alloc: allow skipping memory init for HW_TAGS [059/114] kasan, vmalloc: add vmalloc tagging for HW_TAGS [060/114] kasan, vmalloc: only tag normal vmalloc allocations [061/114] kasan, arm64: don't tag executable vmalloc allocations [062/114] kasan: mark kasan_arg_stacktrace as __initdata [063/114] kasan: clean up feature flags for HW_TAGS mode [064/114] kasan: add kasan.vmalloc command line flag [065/114] kasan: allow enabling KASAN_VMALLOC and SW/HW_TAGS [066/114] arm64: select KASAN_VMALLOC for SW/HW_TAGS modes [067/114] kasan: documentation updates [068/114] kasan: improve vmalloc tests [069/114] kasan: test: support async (again) and asymm modes for HW_TAGS [070/114] mm/kasan: remove unnecessary CONFIG_KASAN option [071/114] kasan: update function name in comments [072/114] kasan: print virtual mapping info in reports [073/114] kasan: drop addr check from describe_object_addr [074/114] kasan: more line breaks in reports [075/114] kasan: rearrange stack frame info in reports [076/114] kasan: improve stack frame info in reports [077/114] kasan: print basic stack frame info for SW_TAGS [078/114] kasan: simplify async check in end_report() [079/114] kasan: simplify kasan_update_kunit_status() and call sites [080/114] kasan: check CONFIG_KASAN_KUNIT_TEST instead of CONFIG_KUNIT [081/114] kasan: move update_kunit_status to start_report [082/114] kasan: move disable_trace_on_warning to start_report [083/114] kasan: split out print_report from __kasan_report [084/114] kasan: simplify kasan_find_first_bad_addr call sites [085/114] kasan: restructure kasan_report [086/114] kasan: merge __kasan_report into kasan_report [087/114] kasan: call print_report from kasan_report_invalid_free [088/114] kasan: move and simplify kasan_report_async [089/114] kasan: rename kasan_access_info to kasan_report_info [090/114] kasan: add comment about UACCESS regions to kasan_report [091/114] kasan: respect KASAN_BIT_REPORTED in all reporting routines [092/114] kasan: reorder reporting functions [093/114] kasan: move and hide kasan_save_enable/restore_multi_shot [094/114] kasan: disable LOCKDEP when printing reports [095/114] mm: enable MADV_DONTNEED for hugetlb mappings [096/114] selftests/vm: add hugetlb madvise MADV_DONTNEED MADV_REMOVE test [097/114] userfaultfd/selftests: enable hugetlb remap and remove event testing [098/114] mm/huge_memory: make is_transparent_hugepage() static [099/114] mm: optimize do_wp_page() for exclusive pages in the swapcache [100/114] mm: optimize do_wp_page() for fresh pages in local LRU pagevecs [101/114] mm: slightly clarify KSM logic in do_swap_page() [102/114] mm: streamline COW logic in do_swap_page() [103/114] mm/huge_memory: streamline COW logic in do_huge_pmd_wp_page() [104/114] mm/khugepaged: remove reuse_swap_page() usage [105/114] mm/swapfile: remove stale reuse_swap_page() [106/114] mm/huge_memory: remove stale page_trans_huge_mapcount() [107/114] mm/huge_memory: remove stale locking logic from __split_huge_pmd() [108/114] mm: warn on deleting redirtied only if accounted [109/114] mm: unmap_mapping_range_tree() with i_mmap_rwsem shared [111/114] mm: fix race between MADV_FREE reclaim and blkdev direct IO read [112/114] mm: madvise: MADV_DONTNEED_LOCKED [113/114] selftests: vm: remove dependecy from internal kernel macros [114/114] selftests: kselftest framework: provide "finished" helper

Message ID

20220325011341.16096C340EE@smtp.kernel.org (mailing list archive)

State

New

Headers

Date: Thu, 24 Mar 2022 18:13:40 -0700
To: 
 zhangliang5@huawei.com,willy@infradead.org,vbabka@suse.cz,shy828301@gmail.com,shakeelb@google.com,rppt@linux.ibm.com,roman.gushchin@linux.dev,rientjes@google.com,riel@surriel.com,peterx@redhat.com,oleg@redhat.com,nadav.amit@gmail.com,mike.kravetz@oracle.com,mhocko@kernel.org,kirill.shutemov@linux.intel.com,jhubbard@nvidia.com,jgg@nvidia.com,jannh@google.com,jack@suse.cz,hughd@google.com,hch@lst.de,ddutile@redhat.com,aarcange@redhat.com,david@redhat.com,akpm@linux-foundation.org,patches@lists.linux.dev,linux-mm@kvack.org,mm-commits@vger.kernel.org,torvalds@linux-foundation.org,akpm@linux-foundation.org
From: Andrew Morton <akpm@linux-foundation.org>
In-Reply-To: <20220324180758.96b1ac7e17675d6bc474485e@linux-foundation.org>
Subject: [patch 102/114] mm: streamline COW logic in do_swap_page()
Message-Id: <20220325011341.16096C340EE@smtp.kernel.org>
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

[001/114] tools/vm/page_owner_sort.c: sort by stacktrace before culling | expand

Commit Message

Andrew Morton March 25, 2022, 1:13 a.m. UTC

From: David Hildenbrand <david@redhat.com>
Subject: mm: streamline COW logic in do_swap_page()

Currently we have a different COW logic when:
* triggering a read-fault to swapin first and then trigger a write-fault
  -> do_swap_page() + do_wp_page()
* triggering a write-fault to swapin
  -> do_swap_page() + do_wp_page() only if we fail reuse in do_swap_page()

The COW logic in do_swap_page() is different than our reuse logic in
do_wp_page().  The COW logic in do_wp_page() -- page_count() == 1 -- makes
currently sure that we certainly don't have a remaining reference, e.g.,
via GUP, on the target page we want to reuse: if there is any unexpected
reference, we have to copy to avoid information leaks.

As do_swap_page() behaves differently, in environments with swap enabled
we can currently have an unintended information leak from the parent to
the child, similar as known from CVE-2020-29374:

	1. Parent writes to anonymous page
	-> Page is mapped writable and modified
	2. Page is swapped out
	-> Page is unmapped and replaced by swap entry
	3. fork()
	-> Swap entries are copied to child
	4. Child pins page R/O
	-> Page is mapped R/O into child
	5. Child unmaps page
	-> Child still holds GUP reference
	6. Parent writes to page
	-> Page is reused in do_swap_page()
	-> Child can observe changes

Exchanging 2. and 3. should have the same effect.

Let's apply the same COW logic as in do_wp_page(), conditionally trying to
remove the page from the swapcache after freeing the swap entry, however,
before actually mapping our page.  We can change the order now that we use
try_to_free_swap(), which doesn't care about the mapcount, instead of
reuse_swap_page().

To handle references from the LRU pagevecs, conditionally drain the local
LRU pagevecs when required, however, don't consider the page_count() when
deciding whether to drain to keep it simple for now.

Link: https://lkml.kernel.org/r/20220131162940.210846-5-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Don Dutile <ddutile@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Liang Zhang <zhangliang5@huawei.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memory.c |   55 +++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 43 insertions(+), 12 deletions(-)

--- a/mm/memory.c~mm-streamline-cow-logic-in-do_swap_page
+++ a/mm/memory.c
@@ -3489,6 +3489,25 @@  static vm_fault_t remove_device_exclusiv
 	return 0;
 }
 
+static inline bool should_try_to_free_swap(struct page *page,
+					   struct vm_area_struct *vma,
+					   unsigned int fault_flags)
+{
+	if (!PageSwapCache(page))
+		return false;
+	if (mem_cgroup_swap_full(page) || (vma->vm_flags & VM_LOCKED) ||
+	    PageMlocked(page))
+		return true;
+	/*
+	 * If we want to map a page that's in the swapcache writable, we
+	 * have to detect via the refcount if we're really the exclusive
+	 * user. Try freeing the swapcache to get rid of the swapcache
+	 * reference only in case it's likely that we'll be the exlusive user.
+	 */
+	return (fault_flags & FAULT_FLAG_WRITE) && !PageKsm(page) &&
+		page_count(page) == 2;
+}
+
 /*
  * We enter with non-exclusive mmap_lock (to exclude vma changes,
  * but allow concurrent faults), and pte mapped but not yet locked.
@@ -3630,6 +3649,16 @@  vm_fault_t do_swap_page(struct vm_fault
 			page = swapcache;
 			goto out_page;
 		}
+
+		/*
+		 * If we want to map a page that's in the swapcache writable, we
+		 * have to detect via the refcount if we're really the exclusive
+		 * owner. Try removing the extra reference from the local LRU
+		 * pagevecs if required.
+		 */
+		if ((vmf->flags & FAULT_FLAG_WRITE) && page == swapcache &&
+		    !PageKsm(page) && !PageLRU(page))
+			lru_add_drain();
 	}
 
 	cgroup_throttle_swaprate(page, GFP_KERNEL);
@@ -3648,19 +3677,25 @@  vm_fault_t do_swap_page(struct vm_fault
 	}
 
 	/*
-	 * The page isn't present yet, go ahead with the fault.
-	 *
-	 * Be careful about the sequence of operations here.
-	 * To get its accounting right, reuse_swap_page() must be called
-	 * while the page is counted on swap but not yet in mapcount i.e.
-	 * before page_add_anon_rmap() and swap_free(); try_to_free_swap()
-	 * must be called after the swap_free(), or it will never succeed.
+	 * Remove the swap entry and conditionally try to free up the swapcache.
+	 * We're already holding a reference on the page but haven't mapped it
+	 * yet.
 	 */
+	swap_free(entry);
+	if (should_try_to_free_swap(page, vma, vmf->flags))
+		try_to_free_swap(page);
 
 	inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
 	dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS);
 	pte = mk_pte(page, vma->vm_page_prot);
-	if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
+
+	/*
+	 * Same logic as in do_wp_page(); however, optimize for fresh pages
+	 * that are certainly not shared because we just allocated them without
+	 * exposing them to the swapcache.
+	 */
+	if ((vmf->flags & FAULT_FLAG_WRITE) && !PageKsm(page) &&
+	    (page != swapcache || page_count(page) == 1)) {
 		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
 		vmf->flags &= ~FAULT_FLAG_WRITE;
 		ret |= VM_FAULT_WRITE;
@@ -3686,10 +3721,6 @@  vm_fault_t do_swap_page(struct vm_fault
 	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
 	arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
 
-	swap_free(entry);
-	if (mem_cgroup_swap_full(page) ||
-	    (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
-		try_to_free_swap(page);
 	unlock_page(page);
 	if (page != swapcache && swapcache) {
 		/*

[102/114] mm: streamline COW logic in do_swap_page()

Commit Message

Patch