[v3,4/9] mm: streamline COW logic in do_swap_page()

Message ID	20220131162940.210846-5-david@redhat.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: David Hildenbrand <david@redhat.com> To: linux-kernel@vger.kernel.org Cc: Andrew Morton <akpm@linux-foundation.org>, Hugh Dickins <hughd@google.com>, Linus Torvalds <torvalds@linux-foundation.org>, David Rientjes <rientjes@google.com>, Shakeel Butt <shakeelb@google.com>, John Hubbard <jhubbard@nvidia.com>, Jason Gunthorpe <jgg@nvidia.com>, Mike Kravetz <mike.kravetz@oracle.com>, Mike Rapoport <rppt@linux.ibm.com>, Yang Shi <shy828301@gmail.com>, "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>, Matthew Wilcox <willy@infradead.org>, Vlastimil Babka <vbabka@suse.cz>, Jann Horn <jannh@google.com>, Michal Hocko <mhocko@kernel.org>, Nadav Amit <namit@vmware.com>, Rik van Riel <riel@surriel.com>, Roman Gushchin <guro@fb.com>, Andrea Arcangeli <aarcange@redhat.com>, Peter Xu <peterx@redhat.com>, Donald Dutile <ddutile@redhat.com>, Christoph Hellwig <hch@lst.de>, Oleg Nesterov <oleg@redhat.com>, Jan Kara <jack@suse.cz>, Liang Zhang <zhangliang5@huawei.com>, linux-mm@kvack.org, David Hildenbrand <david@redhat.com> Subject: [PATCH v3 4/9] mm: streamline COW logic in do_swap_page() Date: Mon, 31 Jan 2022 17:29:34 +0100 Message-Id: <20220131162940.210846-5-david@redhat.com> In-Reply-To: <20220131162940.210846-1-david@redhat.com> References: <20220131162940.210846-1-david@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm: COW fixes part 1: fix the COW security issue for THP and swap \| expand [v3,0/9] mm: COW fixes part 1: fix the COW security issue for THP and swap [v3,1/9] mm: optimize do_wp_page() for exclusive pages in the swapcache [v3,2/9] mm: optimize do_wp_page() for fresh pages in local LRU pagevecs [v3,3/9] mm: slightly clarify KSM logic in do_swap_page() [v3,4/9] mm: streamline COW logic in do_swap_page() [v3,5/9] mm/huge_memory: streamline COW logic in do_huge_pmd_wp_page() [v3,6/9] mm/khugepaged: remove reuse_swap_page() usage [v3,7/9] mm/swapfile: remove stale reuse_swap_page() [v3,8/9] mm/huge_memory: remove stale page_trans_huge_mapcount() [v3,9/9] mm/huge_memory: remove stale locking logic from __split_huge_pmd()

Message ID

20220131162940.210846-5-david@redhat.com (mailing list archive)

State

New

Headers

From: David Hildenbrand <david@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Hugh Dickins <hughd@google.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	David Rientjes <rientjes@google.com>,
	Shakeel Butt <shakeelb@google.com>,
	John Hubbard <jhubbard@nvidia.com>,
	Jason Gunthorpe <jgg@nvidia.com>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Mike Rapoport <rppt@linux.ibm.com>,
	Yang Shi <shy828301@gmail.com>,
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
	Matthew Wilcox <willy@infradead.org>,
	Vlastimil Babka <vbabka@suse.cz>,
	Jann Horn <jannh@google.com>,
	Michal Hocko <mhocko@kernel.org>,
	Nadav Amit <namit@vmware.com>,
	Rik van Riel <riel@surriel.com>,
	Roman Gushchin <guro@fb.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Peter Xu <peterx@redhat.com>,
	Donald Dutile <ddutile@redhat.com>,
	Christoph Hellwig <hch@lst.de>,
	Oleg Nesterov <oleg@redhat.com>,
	Jan Kara <jack@suse.cz>,
	Liang Zhang <zhangliang5@huawei.com>,
	linux-mm@kvack.org,
	David Hildenbrand <david@redhat.com>
Subject: [PATCH v3 4/9] mm: streamline COW logic in do_swap_page()
Date: Mon, 31 Jan 2022 17:29:34 +0100
Message-Id: <20220131162940.210846-5-david@redhat.com>
In-Reply-To: <20220131162940.210846-1-david@redhat.com>
References: <20220131162940.210846-1-david@redhat.com>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

mm: COW fixes part 1: fix the COW security issue for THP and swap | expand

Commit Message

David Hildenbrand Jan. 31, 2022, 4:29 p.m. UTC

Currently we have a different COW logic when:
* triggering a read-fault to swapin first and then trigger a write-fault
  -> do_swap_page() + do_wp_page()
* triggering a write-fault to swapin
  -> do_swap_page() + do_wp_page() only if we fail reuse in do_swap_page()

The COW logic in do_swap_page() is different than our reuse logic in
do_wp_page(). The COW logic in do_wp_page() -- page_count() == 1 --  makes
currently sure that we certainly don't have a remaining reference, e.g.,
via GUP, on the target page we want to reuse: if there is any unexpected
reference, we have to copy to avoid information leaks.

As do_swap_page() behaves differently, in environments with swap enabled we
can currently have an unintended information leak from the parent to the
child, similar as known from CVE-2020-29374:

	1. Parent writes to anonymous page
	-> Page is mapped writable and modified
	2. Page is swapped out
	-> Page is unmapped and replaced by swap entry
	3. fork()
	-> Swap entries are copied to child
	4. Child pins page R/O
	-> Page is mapped R/O into child
	5. Child unmaps page
	-> Child still holds GUP reference
	6. Parent writes to page
	-> Page is reused in do_swap_page()
	-> Child can observe changes

Exchanging 2. and 3. should have the same effect.

Let's apply the same COW logic as in do_wp_page(), conditionally trying to
remove the page from the swapcache after freeing the swap entry, however,
before actually mapping our page. We can change the order now that
we use try_to_free_swap(), which doesn't care about the mapcount,
instead of reuse_swap_page().

To handle references from the LRU pagevecs, conditionally drain the local
LRU pagevecs when required, however, don't consider the page_count() when
deciding whether to drain to keep it simple for now.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memory.c | 55 +++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 43 insertions(+), 12 deletions(-)

Comments

Vlastimil Babka March 10, 2022, 9:41 a.m. UTC | #1

On 1/31/22 17:29, David Hildenbrand wrote:
> Currently we have a different COW logic when:
> * triggering a read-fault to swapin first and then trigger a write-fault
>   -> do_swap_page() + do_wp_page()
> * triggering a write-fault to swapin
>   -> do_swap_page() + do_wp_page() only if we fail reuse in do_swap_page()
> 
> The COW logic in do_swap_page() is different than our reuse logic in
> do_wp_page(). The COW logic in do_wp_page() -- page_count() == 1 --  makes
> currently sure that we certainly don't have a remaining reference, e.g.,
> via GUP, on the target page we want to reuse: if there is any unexpected
> reference, we have to copy to avoid information leaks.
> 
> As do_swap_page() behaves differently, in environments with swap enabled we
> can currently have an unintended information leak from the parent to the
> child, similar as known from CVE-2020-29374:
> 
> 	1. Parent writes to anonymous page
> 	-> Page is mapped writable and modified
> 	2. Page is swapped out
> 	-> Page is unmapped and replaced by swap entry
> 	3. fork()
> 	-> Swap entries are copied to child
> 	4. Child pins page R/O
> 	-> Page is mapped R/O into child
> 	5. Child unmaps page
> 	-> Child still holds GUP reference
> 	6. Parent writes to page
> 	-> Page is reused in do_swap_page()
> 	-> Child can observe changes
> 
> Exchanging 2. and 3. should have the same effect.
> 
> Let's apply the same COW logic as in do_wp_page(), conditionally trying to
> remove the page from the swapcache after freeing the swap entry, however,
> before actually mapping our page. We can change the order now that
> we use try_to_free_swap(), which doesn't care about the mapcount,
> instead of reuse_swap_page().
> 
> To handle references from the LRU pagevecs, conditionally drain the local
> LRU pagevecs when required, however, don't consider the page_count() when
> deciding whether to drain to keep it simple for now.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

diff --git a/mm/memory.c b/mm/memory.c
index 3c91294cca98..c6177d897964 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3497,6 +3497,25 @@  static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
 	return 0;
 }
 
+static inline bool should_try_to_free_swap(struct page *page,
+					   struct vm_area_struct *vma,
+					   unsigned int fault_flags)
+{
+	if (!PageSwapCache(page))
+		return false;
+	if (mem_cgroup_swap_full(page) || (vma->vm_flags & VM_LOCKED) ||
+	    PageMlocked(page))
+		return true;
+	/*
+	 * If we want to map a page that's in the swapcache writable, we
+	 * have to detect via the refcount if we're really the exclusive
+	 * user. Try freeing the swapcache to get rid of the swapcache
+	 * reference only in case it's likely that we'll be the exlusive user.
+	 */
+	return (fault_flags & FAULT_FLAG_WRITE) && !PageKsm(page) &&
+		page_count(page) == 2;
+}
+
 /*
  * We enter with non-exclusive mmap_lock (to exclude vma changes,
  * but allow concurrent faults), and pte mapped but not yet locked.
@@ -3638,6 +3657,16 @@  vm_fault_t do_swap_page(struct vm_fault *vmf)
 			page = swapcache;
 			goto out_page;
 		}
+
+		/*
+		 * If we want to map a page that's in the swapcache writable, we
+		 * have to detect via the refcount if we're really the exclusive
+		 * owner. Try removing the extra reference from the local LRU
+		 * pagevecs if required.
+		 */
+		if ((vmf->flags & FAULT_FLAG_WRITE) && page == swapcache &&
+		    !PageKsm(page) && !PageLRU(page))
+			lru_add_drain();
 	}
 
 	cgroup_throttle_swaprate(page, GFP_KERNEL);
@@ -3656,19 +3685,25 @@  vm_fault_t do_swap_page(struct vm_fault *vmf)
 	}
 
 	/*
-	 * The page isn't present yet, go ahead with the fault.
-	 *
-	 * Be careful about the sequence of operations here.
-	 * To get its accounting right, reuse_swap_page() must be called
-	 * while the page is counted on swap but not yet in mapcount i.e.
-	 * before page_add_anon_rmap() and swap_free(); try_to_free_swap()
-	 * must be called after the swap_free(), or it will never succeed.
+	 * Remove the swap entry and conditionally try to free up the swapcache.
+	 * We're already holding a reference on the page but haven't mapped it
+	 * yet.
 	 */
+	swap_free(entry);
+	if (should_try_to_free_swap(page, vma, vmf->flags))
+		try_to_free_swap(page);
 
 	inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
 	dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS);
 	pte = mk_pte(page, vma->vm_page_prot);
-	if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
+
+	/*
+	 * Same logic as in do_wp_page(); however, optimize for fresh pages
+	 * that are certainly not shared because we just allocated them without
+	 * exposing them to the swapcache.
+	 */
+	if ((vmf->flags & FAULT_FLAG_WRITE) && !PageKsm(page) &&
+	    (page != swapcache || page_count(page) == 1)) {
 		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
 		vmf->flags &= ~FAULT_FLAG_WRITE;
 		ret |= VM_FAULT_WRITE;
@@ -3694,10 +3729,6 @@  vm_fault_t do_swap_page(struct vm_fault *vmf)
 	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
 	arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
 
-	swap_free(entry);
-	if (mem_cgroup_swap_full(page) ||
-	    (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
-		try_to_free_swap(page);
 	unlock_page(page);
 	if (page != swapcache && swapcache) {
 		/*

[v3,4/9] mm: streamline COW logic in do_swap_page()

Commit Message

Comments

Patch