From patchwork Tue Oct 12 01:52:44 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Suren Baghdasaryan <surenb@google.com>
X-Patchwork-Id: 12551277
Return-Path: <SRS0=+gpB=PA=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 6199EC433EF
	for <linux-mm@archiver.kernel.org>; Tue, 12 Oct 2021 01:52:53 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id D72B760E8B
	for <linux-mm@archiver.kernel.org>; Tue, 12 Oct 2021 01:52:52 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org D72B760E8B
Authentication-Results: mail.kernel.org;
 dmarc=fail (p=reject dis=none) header.from=google.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org
Received: by kanga.kvack.org (Postfix)
	id 789F680007; Mon, 11 Oct 2021 21:52:52 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 711096B0071; Mon, 11 Oct 2021 21:52:52 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 5B26980007; Mon, 11 Oct 2021 21:52:52 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0065.hostedemail.com
 [216.40.44.65])
	by kanga.kvack.org (Postfix) with ESMTP id 4AC986B006C
	for <linux-mm@kvack.org>; Mon, 11 Oct 2021 21:52:52 -0400 (EDT)
Received: from smtpin08.hostedemail.com (10.5.19.251.rfc1918.com
 [10.5.19.251])
	by forelay02.hostedemail.com (Postfix) with ESMTP id 114CA2BFAF
	for <linux-mm@kvack.org>; Tue, 12 Oct 2021 01:52:52 +0000 (UTC)
X-FDA: 78686111784.08.6EDEB4C
Received: from mail-yb1-f201.google.com (mail-yb1-f201.google.com
 [209.85.219.201])
	by imf02.hostedemail.com (Postfix) with ESMTP id ACBF670021D5
	for <linux-mm@kvack.org>; Tue, 12 Oct 2021 01:52:51 +0000 (UTC)
Received: by mail-yb1-f201.google.com with SMTP id
 j193-20020a2523ca000000b005b789d71d9aso25367462ybj.21
        for <linux-mm@kvack.org>; Mon, 11 Oct 2021 18:52:51 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=date:message-id:mime-version:subject:from:to:cc;
        bh=7RUSnk/hJiiQ5rf2bhoZtU/BOChr6CmVWTljFBSgn8o=;
        b=PmyypnvTXhEr1cVGPmkRt/J1+okx3AelmToWilrYE8y6IQrNtMpwzk0/bv8LYdfNHT
         Hy2RxKN+MAAum6p+D1sG9Kd30owoxh1hNvMw2O+McR+j+drg60lWKySbI3mbApOoms30
         9ELba9jT0E1cJjWxaW/S+J0BP/nQ39AVRM53bXUXVF0ahNyPMymaE1z2KsirgtFo1esO
         xytEY22zi/i1BwE1t22LLX8+FsV96xu0Ay9hSuXuiKxOqehrxsrFce2O4zMCu99qJqiS
         1VyDNKitdZ79L4LJC2mgBbNKo8qCJbqIyibY4cjYGJgp3knTYgoPWzvOon+KeZqggKSn
         +rUw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:date:message-id:mime-version:subject:from:to:cc;
        bh=7RUSnk/hJiiQ5rf2bhoZtU/BOChr6CmVWTljFBSgn8o=;
        b=4rHgfNIMeJn2s6h+NFPEQoBPTL+PUFVpZLn719QqNoCXJ9HT7RUTbG7OXBEo4vbL87
         oZdpo/Q/0lvEBl/gMcOVLe9eQEeSC5pMbkQbfke4wE5PGwNGyHH/ncFK4Qfh0xPsOJpc
         IUDMaaQ49HmraGvKKtP4hCVfHlvDfG4frrAJHQZip7B8i6LKJ9Wm964rvW7bFXndXruy
         gXDF/6HWi0fAKvaI74CdblY60dpcGIVLry4GP9DW7XCpmTsyVJMOR5CZcGvJvX8aNUNa
         Bofb5E/uw98BLwYOHA+K1eldboEY9bHk3H6+tR4ZTiKSenvKT7b0Gx0APtdENqLTjjlp
         B+Mw==
X-Gm-Message-State: AOAM532RwwfMHHRN7b3EhpBvJt627yVsAJYTr2oh6sQEXXoOst0xUjqL
	vOUuUi/ulrvOAAmn59UkgwPsNb2ynBU=
X-Google-Smtp-Source: 
 ABdhPJzqwfWQi0UmMSYcnYwr+lI98nlHmNKcjU5gG7t8EmKrABsSADbfl/0EBD5rHkmogz1isnIMLVKrwBE=
X-Received: from surenb-desktop.mtv.corp.google.com
 ([2620:15c:211:200:8f50:78d2:a82f:433])
 (user=surenb job=sendgmr) by 2002:a25:afcd:: with SMTP id
 d13mr25459798ybj.504.1634003570905;
 Mon, 11 Oct 2021 18:52:50 -0700 (PDT)
Date: Mon, 11 Oct 2021 18:52:44 -0700
Message-Id: <20211012015244.693594-1-surenb@google.com>
Mime-Version: 1.0
X-Mailer: git-send-email 2.33.0.882.g93a45727a2-goog
Subject: [PATCH 1/1] gup: document and work around "COW can break either way"
 issue
From: Suren Baghdasaryan <surenb@google.com>
To: stable@vger.kernel.org
Cc: gregkh@linuxfoundation.org, jannh@google.com,
	torvalds@linux-foundation.org, vbabka@suse.cz, peterx@redhat.com,
	aarcange@redhat.com, david@redhat.com, jgg@ziepe.ca, ktkhai@virtuozzo.com,
	shli@fb.com, namit@vmware.com, hch@lst.de, oleg@redhat.com,
	kirill@shutemov.name, jack@suse.cz, willy@infradead.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, kernel-team@android.com, surenb@google.com
Authentication-Results: imf02.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=PmyypnvT;
	spf=pass (imf02.hostedemail.com: domain of
 3cupkYQYKCOsfheRaOTbbTYR.PbZYVahk-ZZXiNPX.beT@flex--surenb.bounces.google.com
 designates 209.85.219.201 as permitted sender)
 smtp.mailfrom=3cupkYQYKCOsfheRaOTbbTYR.PbZYVahk-ZZXiNPX.beT@flex--surenb.bounces.google.com;
	dmarc=pass (policy=reject) header.from=google.com
X-Rspamd-Server: rspam04
X-Rspamd-Queue-Id: ACBF670021D5
X-Stat-Signature: 7sb3edubaqwha8desyps7ueumgb89s68
X-HE-Tag: 1634003571-267915
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

From: Linus Torvalds <torvalds@linux-foundation.org>

From: Linus Torvalds <torvalds@linux-foundation.org>

commit 17839856fd588f4ab6b789f482ed3ffd7c403e1f upstream.

Doing a "get_user_pages()" on a copy-on-write page for reading can be
ambiguous: the page can be COW'ed at any time afterwards, and the
direction of a COW event isn't defined.

Yes, whoever writes to it will generally do the COW, but if the thread
that did the get_user_pages() unmapped the page before the write (and
that could happen due to memory pressure in addition to any outright
action), the writer could also just take over the old page instead.

End result: the get_user_pages() call might result in a page pointer
that is no longer associated with the original VM, and is associated
with - and controlled by - another VM having taken it over instead.

So when doing a get_user_pages() on a COW mapping, the only really safe
thing to do would be to break the COW when getting the page, even when
only getting it for reading.

At the same time, some users simply don't even care.

For example, the perf code wants to look up the page not because it
cares about the page, but because the code simply wants to look up the
physical address of the access for informational purposes, and doesn't
really care about races when a page might be unmapped and remapped
elsewhere.

This adds logic to force a COW event by setting FOLL_WRITE on any
copy-on-write mapping when FOLL_GET (or FOLL_PIN) is used to get a page
pointer as a result.

The current semantics end up being:

 - __get_user_pages_fast(): no change. If you don't ask for a write,
   you won't break COW. You'd better know what you're doing.

 - get_user_pages_fast(): the fast-case "look it up in the page tables
   without anything getting mmap_sem" now refuses to follow a read-only
   page, since it might need COW breaking.  Which happens in the slow
   path - the fast path doesn't know if the memory might be COW or not.

 - get_user_pages() (including the slow-path fallback for gup_fast()):
   for a COW mapping, turn on FOLL_WRITE for FOLL_GET/FOLL_PIN, with
   very similar semantics to FOLL_FORCE.

If it turns out that we want finer granularity (ie "only break COW when
it might actually matter" - things like the zero page are special and
don't need to be broken) we might need to push these semantics deeper
into the lookup fault path.  So if people care enough, it's possible
that we might end up adding a new internal FOLL_BREAK_COW flag to go
with the internal FOLL_COW flag we already have for tracking "I had a
COW".

Alternatively, if it turns out that different callers might want to
explicitly control the forced COW break behavior, we might even want to
make such a flag visible to the users of get_user_pages() instead of
using the above default semantics.

But for now, this is mostly commentary on the issue (this commit message
being a lot bigger than the patch, and that patch in turn is almost all
comments), with that minimal "enable COW breaking early" logic using the
existing FOLL_WRITE behavior.

[ It might be worth noting that we've always had this ambiguity, and it
  could arguably be seen as a user-space issue.

  You only get private COW mappings that could break either way in
  situations where user space is doing cooperative things (ie fork()
  before an execve() etc), but it _is_ surprising and very subtle, and
  fork() is supposed to give you independent address spaces.

  So let's treat this as a kernel issue and make the semantics of
  get_user_pages() easier to understand. Note that obviously a true
  shared mapping will still get a page that can change under us, so this
  does _not_ mean that get_user_pages() somehow returns any "stable"
  page ]

[surenb: backport notes
        Since gup_pgd_range does not exist, made appropriate changes on
        the the gup_huge_pgd, gup_huge_pd and gup_pud_range calls instead.
	Replaced (gup_flags | FOLL_WRITE) with write=1 in gup_huge_pgd,
        gup_huge_pd and gup_pud_range.
	Removed FOLL_PIN usage in should_force_cow_break since it's missing in
	the earlier kernels.]

Reported-by: Jann Horn <jannh@google.com>
Tested-by: Christoph Hellwig <hch@lst.de>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Kirill Shutemov <kirill@shutemov.name>
Acked-by: Jan Kara <jack@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
[surenb: backport to 4.4 kernel]
Cc: stable@vger.kernel.org # 4.4.x
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
---
 mm/gup.c         | 48 ++++++++++++++++++++++++++++++++++++++++--------
 mm/huge_memory.c |  7 +++----
 2 files changed, 43 insertions(+), 12 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 4c5857889e9d..c80cdc408228 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -59,13 +59,22 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
 }
 
 /*
- * FOLL_FORCE can write to even unwritable pte's, but only
- * after we've gone through a COW cycle and they are dirty.
+ * FOLL_FORCE or a forced COW break can write even to unwritable pte's,
+ * but only after we've gone through a COW cycle and they are dirty.
  */
 static inline bool can_follow_write_pte(pte_t pte, unsigned int flags)
 {
-	return pte_write(pte) ||
-		((flags & FOLL_FORCE) && (flags & FOLL_COW) && pte_dirty(pte));
+	return pte_write(pte) || ((flags & FOLL_COW) && pte_dirty(pte));
+}
+
+/*
+ * A (separate) COW fault might break the page the other way and
+ * get_user_pages() would return the page from what is now the wrong
+ * VM. So we need to force a COW break at GUP time even for reads.
+ */
+static inline bool should_force_cow_break(struct vm_area_struct *vma, unsigned int flags)
+{
+	return is_cow_mapping(vma->vm_flags) && (flags & FOLL_GET);
 }
 
 static struct page *follow_page_pte(struct vm_area_struct *vma,
@@ -509,12 +518,18 @@ long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 			if (!vma || check_vma_flags(vma, gup_flags))
 				return i ? : -EFAULT;
 			if (is_vm_hugetlb_page(vma)) {
+				if (should_force_cow_break(vma, foll_flags))
+					foll_flags |= FOLL_WRITE;
 				i = follow_hugetlb_page(mm, vma, pages, vmas,
 						&start, &nr_pages, i,
-						gup_flags);
+						foll_flags);
 				continue;
 			}
 		}
+
+		if (should_force_cow_break(vma, foll_flags))
+			foll_flags |= FOLL_WRITE;
+
 retry:
 		/*
 		 * If we have a pending SIGKILL, don't keep faulting pages and
@@ -1346,6 +1361,10 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end,
 /*
  * Like get_user_pages_fast() except it's IRQ-safe in that it won't fall back to
  * the regular GUP. It will only return non-negative values.
+ *
+ * Careful, careful! COW breaking can go either way, so a non-write
+ * access can get ambiguous page results. If you call this function without
+ * 'write' set, you'd better be sure that you're ok with that ambiguity.
  */
 int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
 			  struct page **pages)
@@ -1375,6 +1394,12 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
 	 *
 	 * We do not adopt an rcu_read_lock(.) here as we also want to
 	 * block IPIs that come from THPs splitting.
+	 *
+	 * NOTE! We allow read-only gup_fast() here, but you'd better be
+	 * careful about possible COW pages. You'll get _a_ COW page, but
+	 * not necessarily the one you intended to get depending on what
+	 * COW event happens after this. COW may break the page copy in a
+	 * random direction.
 	 */
 
 	local_irq_save(flags);
@@ -1385,15 +1410,22 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
 		next = pgd_addr_end(addr, end);
 		if (pgd_none(pgd))
 			break;
+		/*
+		 * The FAST_GUP case requires FOLL_WRITE even for pure reads,
+		 * because get_user_pages() may need to cause an early COW in
+		 * order to avoid confusing the normal COW routines. So only
+		 * targets that are already writable are safe to do by just
+		 * looking at the page tables.
+		 */
 		if (unlikely(pgd_huge(pgd))) {
-			if (!gup_huge_pgd(pgd, pgdp, addr, next, write,
+			if (!gup_huge_pgd(pgd, pgdp, addr, next, 1,
 					  pages, &nr))
 				break;
 		} else if (unlikely(is_hugepd(__hugepd(pgd_val(pgd))))) {
 			if (!gup_huge_pd(__hugepd(pgd_val(pgd)), addr,
-					 PGDIR_SHIFT, next, write, pages, &nr))
+					 PGDIR_SHIFT, next, 1, pages, &nr))
 				break;
-		} else if (!gup_pud_range(pgd, addr, next, write, pages, &nr))
+		} else if (!gup_pud_range(pgd, addr, next, 1, pages, &nr))
 			break;
 	} while (pgdp++, addr = next, addr != end);
 	local_irq_restore(flags);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 6404e4fcb4ed..fae45c56e2ee 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1268,13 +1268,12 @@ out_unlock:
 }
 
 /*
- * FOLL_FORCE can write to even unwritable pmd's, but only
- * after we've gone through a COW cycle and they are dirty.
+ * FOLL_FORCE or a forced COW break can write even to unwritable pmd's,
+ * but only after we've gone through a COW cycle and they are dirty.
  */
 static inline bool can_follow_write_pmd(pmd_t pmd, unsigned int flags)
 {
-	return pmd_write(pmd) ||
-	       ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pmd_dirty(pmd));
+	return pmd_write(pmd) || ((flags & FOLL_COW) && pmd_dirty(pmd));
 }
 
 struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,