From patchwork Tue Aug 11 18:39:50 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Peter Xu <peterx@redhat.com>
X-Patchwork-Id: 11709503
Return-Path: <SRS0=xtX6=BV=kvack.org=owner-linux-mm@kernel.org>
Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org
 [172.30.200.123])
	by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id DC5DA13B1
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Tue, 11 Aug 2020 18:40:01 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 8DB4220774
	for <patchwork-linux-mm@patchwork.kernel.org>;
 Tue, 11 Aug 2020 18:40:01 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=fail reason="signature verification failed" (1024-bit key)
 header.d=redhat.com header.i=@redhat.com header.b="Uy/tA/QI"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 8DB4220774
Authentication-Results: mail.kernel.org;
 dmarc=fail (p=none dis=none) header.from=redhat.com
Authentication-Results: mail.kernel.org;
 spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 8F3CF6B0002; Tue, 11 Aug 2020 14:40:00 -0400 (EDT)
Delivered-To: linux-mm-outgoing@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 8A39A6B0005; Tue, 11 Aug 2020 14:40:00 -0400 (EDT)
X-Original-To: int-list-linux-mm@kvack.org
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 7911D8D0001; Tue, 11 Aug 2020 14:40:00 -0400 (EDT)
X-Original-To: linux-mm@kvack.org
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0122.hostedemail.com
 [216.40.44.122])
	by kanga.kvack.org (Postfix) with ESMTP id 6348C6B0002
	for <linux-mm@kvack.org>; Tue, 11 Aug 2020 14:40:00 -0400 (EDT)
Received: from smtpin13.hostedemail.com (10.5.19.251.rfc1918.com
 [10.5.19.251])
	by forelay03.hostedemail.com (Postfix) with ESMTP id E4EBF8248047
	for <linux-mm@kvack.org>; Tue, 11 Aug 2020 18:39:59 +0000 (UTC)
X-FDA: 77139152118.13.beam90_0906ae226fe5
Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251])
	by smtpin13.hostedemail.com (Postfix) with ESMTP id B215118140B60
	for <linux-mm@kvack.org>; Tue, 11 Aug 2020 18:39:59 +0000 (UTC)
X-Spam-Summary: 
 1,0,0,,d41d8cd98f00b204,peterx@redhat.com,,RULES_HIT:30003:30054:30070:30079:30090,0,RBL:205.139.110.120:@redhat.com:.lbl8.mailshell.net-62.18.0.100
 66.10.201.10;04yg53sug1xuf4narmfbznmzh4u6eyctbxy6wfb4etc6g51gnx9zjkc3r8tcaxm.py7dpeyhgz3dk9tbbfsued9ajxi7akhaz3uusjfcrbt17g7nutwzu8kd6p5kdkp.y-lbl8.mailshell.net-223.238.255.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not
 bulk,SPF:ft,MSBL:0,DNSBL:neutral,Custom_rules:0:0:0,LFtime:31,LUA_SUMMARY:none
X-HE-Tag: beam90_0906ae226fe5
X-Filterd-Recvd-Size: 14486
Received: from us-smtp-1.mimecast.com (us-smtp-delivery-1.mimecast.com
 [205.139.110.120])
	by imf48.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Tue, 11 Aug 2020 18:39:58 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1597171198;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding;
	bh=ehldQ7u2RFp9mhKtKspgyBES/eNp3K3u2HvNYnpxyMk=;
	b=Uy/tA/QIDk96bE6rMcKr+n1xe1A5Y2vJwIO3LYJKTFxTfbvBbe/+npQo9Wh1b2hfE8O3bL
	s/RkM1U92fjN+wOgtfUaLvEKqWwvCUfjSy11vZGDLx1XYJhiz45/Dj0BrNyj8mcs6HspyC
	r+Zxj9iInfd0GOlgtsvQJ/RdQMz8g6U=
Received: from mail-il1-f197.google.com (mail-il1-f197.google.com
 [209.85.166.197]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-124-7xZMDAIxMBGW2BbcjBxOOw-1; Tue, 11 Aug 2020 14:39:55 -0400
X-MC-Unique: 7xZMDAIxMBGW2BbcjBxOOw-1
Received: by mail-il1-f197.google.com with SMTP id c84so11183068ila.18
        for <linux-mm@kvack.org>; Tue, 11 Aug 2020 11:39:55 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version
         :content-transfer-encoding;
        bh=ehldQ7u2RFp9mhKtKspgyBES/eNp3K3u2HvNYnpxyMk=;
        b=UZ6vaV8GR+mgSKGcUs0PWMN84I6ObBqA5JxA6XWjh9+pX4uW/jbpfCuMMezVAYJ+Rh
         fi14qNyRnYb6kHu7avSgWFwhaSAl5aXAmAua2Mf1EakEi0srQGSMlrfkXwuf59nreqC4
         VZHB+7cg1KzVtfxYjUqt3mcYM+oNQd76FVQMHuyKgtXcYBDx9QbM+LAIEXtdeJNJP1+c
         Lqe1KBUZqVWduvEp3y3Ey8SPZGJNPUWxbuOk1K0vA+M05Tdq1botU25yND6sqAOZm6Sv
         LUmZDjq7X/pzlXuGQ/f0ymWkbaMeMl/MHCMZYxOx3gUxsUrmVudQBsNxRIqCSpThCPfz
         TTIg==
X-Gm-Message-State: AOAM533bq4xVEN5/tEVK2NXptVZNdBSmgH0E814DZ28ipEm5YvCOCzOM
	h3q9Q1QrdL0NewUTTvHzPt46ALogkCWm70RgfY9LBQMLM/I1dQmMEDnuoeqUkDrQrFlwtjLamum
	Q/HHAOjoQyC8=
X-Received: by 2002:a92:1f4f:: with SMTP id
 i76mr24616592ile.226.1597171194630;
        Tue, 11 Aug 2020 11:39:54 -0700 (PDT)
X-Google-Smtp-Source: 
 ABdhPJybSqRuf4y/mJQfGPNNf9rnhVxAmfLnpKyaH8o923KU+hMchJrC8pd6QdUfXzedlL1GyN1QPg==
X-Received: by 2002:a92:1f4f:: with SMTP id
 i76mr24616564ile.226.1597171194326;
        Tue, 11 Aug 2020 11:39:54 -0700 (PDT)
Received: from localhost.localdomain
 (bras-vprn-toroon474qw-lp130-11-70-53-122-15.dsl.bell.ca. [70.53.122.15])
        by smtp.gmail.com with ESMTPSA id
 v17sm13864621ilj.33.2020.08.11.11.39.51
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Tue, 11 Aug 2020 11:39:52 -0700 (PDT)
From: Peter Xu <peterx@redhat.com>
To: linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>,
	peterx@redhat.com,
	Marty Mcfadden <mcfadden8@llnl.gov>,
	"Maya B . Gokhale" <gokhale2@llnl.gov>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Jann Horn <jannh@google.com>,
	Christoph Hellwig <hch@lst.de>,
	Oleg Nesterov <oleg@redhat.com>,
	Kirill Shutemov <kirill@shutemov.name>,
	Jan Kara <jack@suse.cz>
Subject: [PATCH v3] mm/gup: Allow real explicit breaking of COW
Date: Tue, 11 Aug 2020 14:39:50 -0400
Message-Id: <20200811183950.10603-1-peterx@redhat.com>
X-Mailer: git-send-email 2.26.2
MIME-Version: 1.0
Authentication-Results: relay.mimecast.com;
	auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=peterx@redhat.com
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
X-Rspamd-Queue-Id: B215118140B60
X-Spamd-Result: default: False [0.00 / 100.00]
X-Rspamd-Server: rspam05
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Starting from commit 17839856fd58 ("gup: document and work around "COW can
break either way" issue", 2020-06-02), explicit copy-on-write behavior is
enforced for private gup pages even if it's a read-only.  It is achieved by
always passing FOLL_WRITE to emulate a write.

That should fix the COW issue that we were facing, however above commit could
also break userfaultfd-wp and applications like umapsort [1,2].

One general routine of umap-like program is: userspace library will manage page
allocations, and it will evict the least recently used pages from memory to
external storages (e.g., file systems).  Below are the general steps to evict
an in-memory page in the uffd service thread when the page pool is full:

  (1) UFFDIO_WRITEPROTECT with mode=WP on some to-be-evicted page P, so that
      further writes to page P will block (keep page P clean)
  (2) Copy page P to external storage (e.g. file system)
  (3) MADV_DONTNEED to evict page P

Here step (1) makes sure that the page to dump will always be up-to-date, so
that the page snapshot in the file system is consistent with the one that was
in the memory.  However with commit 17839856fd58, step (2) can potentially hang
itself because e.g. if we use write() to a file system fd to dump the page
data, that will be a translated read gup request in the file system driver to
read the page content, then the read gup will be translated to a write gup due
to the new enforced COW behavior.  This write gup will further trigger
handle_userfault() and hang the uffd service thread itself.

I think the problem will go away too if we replace the write() to the file
system into a memory write to a mmaped region in the userspace library, because
normal page faults will not enforce COW, only gup is affected.  However we
cannot forbid users to use write() or any form of kernel level read gup.

One solution is actually already mentioned in commit 17839856fd58, which is to
provide an explicit BREAK_COW scemantics for enforced COW.  Then we can still
use FAULT_FLAG_WRITE to identify whether this is a "real write request" or an
"enfornced COW (read) request".

With the enforced COW, we also need to inherit UFFD_WP bit during COW because
now COW can happen with UFFD_WP enabled (previously, it cannot).

Since at it, rename the variable in __handle_mm_fault() from "dirty" to "cow"
to better suite its functionality.

[1] https://github.com/LLNL/umap-apps/blob/develop/src/umapsort/umapsort.cpp
[2] https://github.com/LLNL/umap

CC: Marty Mcfadden <mcfadden8@llnl.gov>
CC: Maya B. Gokhale <gokhale2@llnl.gov>
CC: Andrea Arcangeli <aarcange@redhat.com>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Jann Horn <jannh@google.com>
CC: Christoph Hellwig <hch@lst.de>
CC: Oleg Nesterov <oleg@redhat.com>
CC: Kirill Shutemov <kirill@shutemov.name>
CC: Jan Kara <jack@suse.cz>
Fixes: 17839856fd588f4ab6b789f482ed3ffd7c403e1f
Signed-off-by: Peter Xu <peterx@redhat.com>
---
v3:
- inherit UFFD_WP bit for COW too
- take care of huge page cases
- more comments
v2:
- apply FAULT_FLAG_BREAK_COW correctly when FOLL_BREAK_COW [Christoph]
- removed comments above do_wp_page which seems redundant
---
 include/linux/mm.h |  3 +++
 mm/gup.c           |  6 ++++--
 mm/huge_memory.c   | 12 +++++++++++-
 mm/memory.c        | 39 +++++++++++++++++++++++++++++++--------
 4 files changed, 49 insertions(+), 11 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f6a82f9bccd7..a1f5c92b44cb 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -409,6 +409,7 @@ extern pgprot_t protection_map[16];
  * @FAULT_FLAG_REMOTE: The fault is not for current task/mm.
  * @FAULT_FLAG_INSTRUCTION: The fault was during an instruction fetch.
  * @FAULT_FLAG_INTERRUPTIBLE: The fault can be interrupted by non-fatal signals.
+ * @FAULT_FLAG_BREAK_COW: Do COW explicitly for the fault (even for read).
  *
  * About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify
  * whether we would allow page faults to retry by specifying these two
@@ -439,6 +440,7 @@ extern pgprot_t protection_map[16];
 #define FAULT_FLAG_REMOTE			0x80
 #define FAULT_FLAG_INSTRUCTION  		0x100
 #define FAULT_FLAG_INTERRUPTIBLE		0x200
+#define FAULT_FLAG_BREAK_COW			0x400
 
 /*
  * The default fault flags that should be used by most of the
@@ -2756,6 +2758,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 #define FOLL_SPLIT_PMD	0x20000	/* split huge pmd before returning */
 #define FOLL_PIN	0x40000	/* pages must be released via unpin_user_page */
 #define FOLL_FAST_ONLY	0x80000	/* gup_fast: prevent fall-back to slow gup */
+#define FOLL_BREAK_COW  0x100000 /* request for explicit COW (even for read) */
 
 /*
  * FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each
diff --git a/mm/gup.c b/mm/gup.c
index d8a33dd1430d..c33e84ab9c36 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -870,6 +870,8 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
 		return -ENOENT;
 	if (*flags & FOLL_WRITE)
 		fault_flags |= FAULT_FLAG_WRITE;
+	if (*flags & FOLL_BREAK_COW)
+		fault_flags |= FAULT_FLAG_BREAK_COW;
 	if (*flags & FOLL_REMOTE)
 		fault_flags |= FAULT_FLAG_REMOTE;
 	if (locked)
@@ -1076,7 +1078,7 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 			}
 			if (is_vm_hugetlb_page(vma)) {
 				if (should_force_cow_break(vma, foll_flags))
-					foll_flags |= FOLL_WRITE;
+					foll_flags |= FOLL_BREAK_COW;
 				i = follow_hugetlb_page(mm, vma, pages, vmas,
 						&start, &nr_pages, i,
 						foll_flags, locked);
@@ -1095,7 +1097,7 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		}
 
 		if (should_force_cow_break(vma, foll_flags))
-			foll_flags |= FOLL_WRITE;
+			foll_flags |= FOLL_BREAK_COW;
 
 retry:
 		/*
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 206f52b36ffb..c88f773d03af 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1296,7 +1296,17 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
 	if (reuse_swap_page(page, NULL)) {
 		pmd_t entry;
 		entry = pmd_mkyoung(orig_pmd);
-		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+		entry = pmd_mkdirty(entry);
+		if (pmd_uffd_wp(orig_pmd))
+			/*
+			 * This can happen when an uffd-wp protected page is
+			 * copied due to enfornced COW.  When it happens, we
+			 * need to keep the uffd-wp bit even after COW, and
+			 * make sure write bit is kept cleared.
+			 */
+			entry = pmd_mkuffd_wp(pmd_wrprotect(entry));
+		else
+			entry = maybe_pmd_mkwrite(entry, vma);
 		if (pmdp_set_access_flags(vma, haddr, vmf->pmd, entry, 1))
 			update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
 		unlock_page(page);
diff --git a/mm/memory.c b/mm/memory.c
index c39a13b09602..b27b555a9df8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2706,7 +2706,17 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 		flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
 		entry = mk_pte(new_page, vma->vm_page_prot);
 		entry = pte_sw_mkyoung(entry);
-		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+		entry = pte_mkdirty(entry);
+		if (pte_uffd_wp(vmf->orig_pte))
+			/*
+			 * This can happen when an uffd-wp protected page is
+			 * copied due to enfornced COW.  When it happens, we
+			 * need to keep the uffd-wp bit even after COW, and
+			 * make sure write bit is kept cleared.
+			 */
+			entry = pte_mkuffd_wp(pte_wrprotect(entry));
+		else
+			entry = maybe_mkwrite(entry, vma);
 		/*
 		 * Clear the pte entry and flush it first, before updating the
 		 * pte with the new entry. This will avoid a race condition
@@ -2900,7 +2910,13 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
 
-	if (userfaultfd_pte_wp(vma, *vmf->pte)) {
+	/*
+	 * Userfaultfd-wp only cares about real writes.  E.g., enforced COW for
+	 * read does not count.  When that happens, we will do the COW with the
+	 * UFFD_WP bit inherited from the original PTE/PMD.
+	 */
+	if ((vmf->flags & FAULT_FLAG_WRITE) &&
+	    userfaultfd_pte_wp(vma, *vmf->pte)) {
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
 		return handle_userfault(vmf, VM_UFFD_WP);
 	}
@@ -3290,7 +3306,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		put_page(swapcache);
 	}
 
-	if (vmf->flags & FAULT_FLAG_WRITE) {
+	if (vmf->flags & (FAULT_FLAG_WRITE | FAULT_FLAG_BREAK_COW)) {
 		ret |= do_wp_page(vmf);
 		if (ret & VM_FAULT_ERROR)
 			ret &= VM_FAULT_ERROR;
@@ -4117,7 +4133,14 @@ static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf)
 static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf, pmd_t orig_pmd)
 {
 	if (vma_is_anonymous(vmf->vma)) {
-		if (userfaultfd_huge_pmd_wp(vmf->vma, orig_pmd))
+		/*
+		 * Userfaultfd-wp only cares about real writes.  E.g., enforced
+		 * COW for read does not count.  When that happens, we will do
+		 * the COW with the UFFD_WP bit inherited from the original
+		 * PTE/PMD.
+		 */
+		if ((vmf->flags & FAULT_FLAG_WRITE) &&
+		    userfaultfd_huge_pmd_wp(vmf->vma, orig_pmd))
 			return handle_userfault(vmf, VM_UFFD_WP);
 		return do_huge_pmd_wp_page(vmf, orig_pmd);
 	}
@@ -4241,7 +4264,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 		update_mmu_tlb(vmf->vma, vmf->address, vmf->pte);
 		goto unlock;
 	}
-	if (vmf->flags & FAULT_FLAG_WRITE) {
+	if (vmf->flags & (FAULT_FLAG_WRITE | FAULT_FLAG_BREAK_COW)) {
 		if (!pte_write(entry))
 			return do_wp_page(vmf);
 		entry = pte_mkdirty(entry);
@@ -4281,7 +4304,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 		.pgoff = linear_page_index(vma, address),
 		.gfp_mask = __get_fault_gfp_mask(vma),
 	};
-	unsigned int dirty = flags & FAULT_FLAG_WRITE;
+	bool cow = flags & (FAULT_FLAG_WRITE | FAULT_FLAG_BREAK_COW);
 	struct mm_struct *mm = vma->vm_mm;
 	pgd_t *pgd;
 	p4d_t *p4d;
@@ -4308,7 +4331,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 
 			/* NUMA case for anonymous PUDs would go here */
 
-			if (dirty && !pud_write(orig_pud)) {
+			if (cow && !pud_write(orig_pud)) {
 				ret = wp_huge_pud(&vmf, orig_pud);
 				if (!(ret & VM_FAULT_FALLBACK))
 					return ret;
@@ -4346,7 +4369,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 			if (pmd_protnone(orig_pmd) && vma_is_accessible(vma))
 				return do_huge_pmd_numa_page(&vmf, orig_pmd);
 
-			if (dirty && !pmd_write(orig_pmd)) {
+			if (cow && !pmd_write(orig_pmd)) {
 				ret = wp_huge_pmd(&vmf, orig_pmd);
 				if (!(ret & VM_FAULT_FALLBACK))
 					return ret;