From patchwork Fri Dec 17 11:30:39 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Hildenbrand <david@redhat.com>
X-Patchwork-Id: 12684333
Return-Path: <linux-kselftest-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 5C792C433FE
	for <linux-kselftest@archiver.kernel.org>;
 Fri, 17 Dec 2021 11:32:57 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S235415AbhLQLc4 (ORCPT
        <rfc822;linux-kselftest@archiver.kernel.org>);
        Fri, 17 Dec 2021 06:32:56 -0500
Received: from us-smtp-delivery-124.mimecast.com ([170.10.133.124]:43591 "EHLO
        us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S232164AbhLQLc4 (ORCPT
        <rfc822;linux-kselftest@vger.kernel.org>);
        Fri, 17 Dec 2021 06:32:56 -0500
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1639740775;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=IS+IADqtvjtFBPaIHjVHiY8W/A2A4twki8gtFJ6uWRE=;
        b=JBkJkSgAX+DODh0Uedw0PUlDo1VNrZBS7t5nhsbgp99lLWvO1TWDmcihugeHPDXIS32J82
        ClBGeDaInKKiv652xv00g0Z8cHLb4RJO7pIQzosMU42NR4L25OgpcrFT7BkzdIGLc7aDUK
        +79ovj6y9hFlyQxff7fPgkQZV+03B18=
Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com
 [209.132.183.4]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 us-mta-624-ZvV_edXtN52r3-4io6P9Bg-1; Fri, 17 Dec 2021 06:32:52 -0500
X-MC-Unique: ZvV_edXtN52r3-4io6P9Bg-1
Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.phx2.redhat.com
 [10.5.11.13])
        (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
        (No client certificate requested)
        by mimecast-mx01.redhat.com (Postfix) with ESMTPS id A280C185302A;
        Fri, 17 Dec 2021 11:32:48 +0000 (UTC)
Received: from t480s.redhat.com (unknown [10.39.193.204])
        by smtp.corp.redhat.com (Postfix) with ESMTP id A4A518CB3E;
        Fri, 17 Dec 2021 11:32:02 +0000 (UTC)
From: David Hildenbrand <david@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>,
        Hugh Dickins <hughd@google.com>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        David Rientjes <rientjes@google.com>,
        Shakeel Butt <shakeelb@google.com>,
        John Hubbard <jhubbard@nvidia.com>,
        Jason Gunthorpe <jgg@nvidia.com>,
        Mike Kravetz <mike.kravetz@oracle.com>,
        Mike Rapoport <rppt@linux.ibm.com>,
        Yang Shi <shy828301@gmail.com>,
        "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
        Matthew Wilcox <willy@infradead.org>,
        Vlastimil Babka <vbabka@suse.cz>, Jann Horn <jannh@google.com>,
        Michal Hocko <mhocko@kernel.org>,
        Nadav Amit <namit@vmware.com>, Rik van Riel <riel@surriel.com>,
        Roman Gushchin <guro@fb.com>,
        Andrea Arcangeli <aarcange@redhat.com>,
        Peter Xu <peterx@redhat.com>,
        Donald Dutile <ddutile@redhat.com>,
        Christoph Hellwig <hch@lst.de>,
        Oleg Nesterov <oleg@redhat.com>, Jan Kara <jack@suse.cz>,
        linux-mm@kvack.org, linux-kselftest@vger.kernel.org,
        linux-doc@vger.kernel.org, David Hildenbrand <david@redhat.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Ingo Molnar <mingo@redhat.com>, Will Deacon <will@kernel.org>,
        Waiman Long <longman@redhat.com>,
        Boqun Feng <boqun.feng@gmail.com>,
        Jonathan Corbet <corbet@lwn.net>
Subject: [PATCH v1 01/11] seqlock: provide lockdep-free raw_seqcount_t variant
Date: Fri, 17 Dec 2021 12:30:39 +0100
Message-Id: <20211217113049.23850-2-david@redhat.com>
In-Reply-To: <20211217113049.23850-1-david@redhat.com>
References: <20211217113049.23850-1-david@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 2.79 on 10.5.11.13
Precedence: bulk
List-ID: <linux-kselftest.vger.kernel.org>
X-Mailing-List: linux-kselftest@vger.kernel.org

Sometimes it is required to have a seqcount implementation that uses
a structure with a fixed and minimal size -- just a bare unsigned int --
independent of the kernel configuration. This is especially valuable, when
the raw_ variants of the seqlock function will be used and the additional
lockdep part of the seqcount_t structure remains essentially unused.

Let's provide a lockdep-free raw_seqcount_t variant that can be used via
the raw functions to have a basic seqlock.

The target use case is embedding a raw_seqcount_t in the "struct page",
where we really want a minimal size and cannot tolerate a sudden grow of
the seqcount_t structure resulting in a significant "struct page"
increase or even a layout change.

Provide raw_read_seqcount_retry(), to make it easy to match to
raw_read_seqcount_begin() in the code.

Let's add a short documentation as well.

Note: There might be other possible users for raw_seqcount_t where the
      lockdep part might be completely unused and just wastes memory --
      essentially any users that only use the raw_ function variants.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Will Deacon <will@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Acked-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 Documentation/locking/seqlock.rst |  50 +++++++++++
 include/linux/seqlock.h           | 145 +++++++++++++++++++++++-------
 2 files changed, 162 insertions(+), 33 deletions(-)

diff --git a/Documentation/locking/seqlock.rst b/Documentation/locking/seqlock.rst
index 64405e5da63e..6f66ae29cc07 100644
--- a/Documentation/locking/seqlock.rst
+++ b/Documentation/locking/seqlock.rst
@@ -87,6 +87,56 @@ Read path::
 	} while (read_seqcount_retry(&foo_seqcount, seq));
 
 
+Raw sequence counters (``raw_seqcount_t``)
+==========================================
+
+This is the raw counting mechanism, which does not protect against multiple
+writers and does not perform any lockdep tracking.  Write side critical sections
+must thus be serialized by an external lock.
+
+It is primary useful when a fixed, minimal sequence counter size is
+required and the lockdep overhead cannot be tolerated or is unused.
+Prefer using a :ref:`seqcount_t`, a :ref:`seqlock_t` or a
+:ref:`seqcount_locktype_t` if possible.
+
+The raw sequence counter is very similar to the :ref:`seqcount_t`, however,
+it can only be used with functions that don't perform any implicit lockdep
+tracking: primarily the *raw* function variants.
+
+Initialization::
+
+	/* dynamic */
+	raw_seqcount_t foo_seqcount;
+	raw_seqcount_init(&foo_seqcount);
+
+	/* static */
+	static raw_seqcount_t foo_seqcount = RAW_SEQCNT_ZERO(foo_seqcount);
+
+	/* C99 struct init */
+	struct {
+		.seq   = RAW_SEQCNT_ZERO(foo.seq),
+	} foo;
+
+Write path::
+
+	/* Serialized context with disabled preemption */
+
+	raw_write_seqcount_begin(&foo_seqcount);
+
+	/* ... [[write-side critical section]] ... */
+
+	raw_write_seqcount_end(&foo_seqcount);
+
+Read path::
+
+	do {
+		seq = raw_read_seqcount_begin(&foo_seqcount);
+
+		/* ... [[read-side critical section]] ... */
+
+	} while (raw_read_seqcount_retry(&foo_seqcount, seq));
+
+
 .. _seqcount_locktype_t:
 
 Sequence counters with associated locks (``seqcount_LOCKNAME_t``)
diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h
index 37ded6b8fee6..c61fba1f9893 100644
--- a/include/linux/seqlock.h
+++ b/include/linux/seqlock.h
@@ -60,15 +60,27 @@
  * serialization and non-preemptibility requirements, use a sequential
  * lock (seqlock_t) instead.
  *
+ * If it's undesired to have lockdep, especially when a fixed, minimal,
+ * structure size is required, use raw_seqcount_t along with the raw
+ * function variants.
+ *
  * See Documentation/locking/seqlock.rst
  */
+
+typedef unsigned int raw_seqcount_t;
+
 typedef struct seqcount {
-	unsigned sequence;
+	raw_seqcount_t sequence;
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 	struct lockdep_map dep_map;
 #endif
 } seqcount_t;
 
+static inline void __raw_seqcount_init(raw_seqcount_t *s)
+{
+	*s = 0;
+}
+
 static inline void __seqcount_init(seqcount_t *s, const char *name,
 					  struct lock_class_key *key)
 {
@@ -76,9 +88,15 @@ static inline void __seqcount_init(seqcount_t *s, const char *name,
 	 * Make sure we are not reinitializing a held lock:
 	 */
 	lockdep_init_map(&s->dep_map, name, key, 0);
-	s->sequence = 0;
+	__raw_seqcount_init(&s->sequence);
 }
 
+/**
+ * raw_seqcount_init() - runtime initializer for raw_seqcount_t
+ * @s: Pointer to the raw_seqcount_t instance
+ */
+# define raw_seqcount_init(s) __raw_seqcount_init(s)
+
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 
 # define SEQCOUNT_DEP_MAP_INIT(lockname)				\
@@ -111,11 +129,16 @@ static inline void seqcount_lockdep_reader_access(const seqcount_t *s)
 # define seqcount_lockdep_reader_access(x)
 #endif
 
+/**
+ * RAW_SEQCNT_ZERO() - static initializer for raw_seqcount_t
+ */
+#define RAW_SEQCNT_ZERO() 0
+
 /**
  * SEQCNT_ZERO() - static initializer for seqcount_t
  * @name: Name of the seqcount_t instance
  */
-#define SEQCNT_ZERO(name) { .sequence = 0, SEQCOUNT_DEP_MAP_INIT(name) }
+#define SEQCNT_ZERO(name) { .sequence = RAW_SEQCNT_ZERO(), SEQCOUNT_DEP_MAP_INIT(name) }
 
 /*
  * Sequence counters with associated locks (seqcount_LOCKNAME_t)
@@ -203,6 +226,12 @@ typedef struct seqcount_##lockname {					\
 	__SEQ_LOCK(locktype	*lock);					\
 } seqcount_##lockname##_t;						\
 									\
+static __always_inline raw_seqcount_t *					\
+__seqprop_##lockname##_raw_ptr(seqcount_##lockname##_t *s)		\
+{									\
+	return &s->seqcount.sequence;					\
+}									\
+									\
 static __always_inline seqcount_t *					\
 __seqprop_##lockname##_ptr(seqcount_##lockname##_t *s)			\
 {									\
@@ -247,10 +276,45 @@ __seqprop_##lockname##_assert(const seqcount_##lockname##_t *s)		\
 	__SEQ_LOCK(lockdep_assert_held(lockmember));			\
 }
 
+/*
+ * __raw_seqprop() for raw_seqcount_t
+ */
+
+static inline raw_seqcount_t *__raw_seqprop_raw_ptr(raw_seqcount_t *s)
+{
+	return s;
+}
+
+static inline seqcount_t *__raw_seqprop_ptr(raw_seqcount_t *s)
+{
+	BUILD_BUG();
+	return NULL;
+}
+
+static inline unsigned int __raw_seqprop_sequence(const raw_seqcount_t *s)
+{
+	return READ_ONCE(*s);
+}
+
+static inline bool __raw_seqprop_preemptible(const raw_seqcount_t *s)
+{
+	return false;
+}
+
+static inline void __raw_seqprop_assert(const raw_seqcount_t *s)
+{
+	lockdep_assert_preemption_disabled();
+}
+
 /*
  * __seqprop() for seqcount_t
  */
 
+static inline raw_seqcount_t *__seqprop_raw_ptr(seqcount_t *s)
+{
+	return &s->sequence;
+}
+
 static inline seqcount_t *__seqprop_ptr(seqcount_t *s)
 {
 	return s;
@@ -300,6 +364,7 @@ SEQCOUNT_LOCKNAME(ww_mutex,     struct ww_mutex, true,     &s->lock->base, ww_mu
 	seqcount_##lockname##_t: __seqprop_##lockname##_##prop((void *)(s))
 
 #define __seqprop(s, prop) _Generic(*(s),				\
+	raw_seqcount_t:		__raw_seqprop_##prop((void *)(s)),	\
 	seqcount_t:		__seqprop_##prop((void *)(s)),		\
 	__seqprop_case((s),	raw_spinlock,	prop),			\
 	__seqprop_case((s),	spinlock,	prop),			\
@@ -307,6 +372,7 @@ SEQCOUNT_LOCKNAME(ww_mutex,     struct ww_mutex, true,     &s->lock->base, ww_mu
 	__seqprop_case((s),	mutex,		prop),			\
 	__seqprop_case((s),	ww_mutex,	prop))
 
+#define seqprop_raw_ptr(s)		__seqprop(s, raw_ptr)
 #define seqprop_ptr(s)			__seqprop(s, ptr)
 #define seqprop_sequence(s)		__seqprop(s, sequence)
 #define seqprop_preemptible(s)		__seqprop(s, preemptible)
@@ -314,7 +380,8 @@ SEQCOUNT_LOCKNAME(ww_mutex,     struct ww_mutex, true,     &s->lock->base, ww_mu
 
 /**
  * __read_seqcount_begin() - begin a seqcount_t read section w/o barrier
- * @s: Pointer to seqcount_t or any of the seqcount_LOCKNAME_t variants
+ * @s: Pointer to seqcount_t, raw_seqcount_t or any of the seqcount_LOCKNAME_t
+ *     variants
  *
  * __read_seqcount_begin is like read_seqcount_begin, but has no smp_rmb()
  * barrier. Callers should ensure that smp_rmb() or equivalent ordering is
@@ -339,7 +406,8 @@ SEQCOUNT_LOCKNAME(ww_mutex,     struct ww_mutex, true,     &s->lock->base, ww_mu
 
 /**
  * raw_read_seqcount_begin() - begin a seqcount_t read section w/o lockdep
- * @s: Pointer to seqcount_t or any of the seqcount_LOCKNAME_t variants
+ * @s: Pointer to seqcount_t, raw_seqcount_t or any of the
+ *     seqcount_LOCKNAME_t variants
  *
  * Return: count to be passed to read_seqcount_retry()
  */
@@ -365,7 +433,8 @@ SEQCOUNT_LOCKNAME(ww_mutex,     struct ww_mutex, true,     &s->lock->base, ww_mu
 
 /**
  * raw_read_seqcount() - read the raw seqcount_t counter value
- * @s: Pointer to seqcount_t or any of the seqcount_LOCKNAME_t variants
+ * @s: Pointer to seqcount_t, raw_seqcount_t or any of the seqcount_LOCKNAME_t
+ *     variants
  *
  * raw_read_seqcount opens a read critical section of the given
  * seqcount_t, without any lockdep checking, and without checking or
@@ -386,7 +455,8 @@ SEQCOUNT_LOCKNAME(ww_mutex,     struct ww_mutex, true,     &s->lock->base, ww_mu
 /**
  * raw_seqcount_begin() - begin a seqcount_t read critical section w/o
  *                        lockdep and w/o counter stabilization
- * @s: Pointer to seqcount_t or any of the seqcount_LOCKNAME_t variants
+ * @s: Pointer to seqcount_t, raw_seqcount_t, or any of the seqcount_LOCKNAME_t
+ *     variants
  *
  * raw_seqcount_begin opens a read critical section of the given
  * seqcount_t. Unlike read_seqcount_begin(), this function will not wait
@@ -411,7 +481,8 @@ SEQCOUNT_LOCKNAME(ww_mutex,     struct ww_mutex, true,     &s->lock->base, ww_mu
 
 /**
  * __read_seqcount_retry() - end a seqcount_t read section w/o barrier
- * @s: Pointer to seqcount_t or any of the seqcount_LOCKNAME_t variants
+ * @s: Pointer to seqcount_t, raw_seqcount_t or any of the seqcount_LOCKNAME_t
+ *     variants
  * @start: count, from read_seqcount_begin()
  *
  * __read_seqcount_retry is like read_seqcount_retry, but has no smp_rmb()
@@ -425,17 +496,19 @@ SEQCOUNT_LOCKNAME(ww_mutex,     struct ww_mutex, true,     &s->lock->base, ww_mu
  * Return: true if a read section retry is required, else false
  */
 #define __read_seqcount_retry(s, start)					\
-	do___read_seqcount_retry(seqprop_ptr(s), start)
+	do___read_seqcount_retry(seqprop_raw_ptr(s), start)
 
-static inline int do___read_seqcount_retry(const seqcount_t *s, unsigned start)
+static inline int do___read_seqcount_retry(const raw_seqcount_t *s,
+					   unsigned int start)
 {
 	kcsan_atomic_next(0);
-	return unlikely(READ_ONCE(s->sequence) != start);
+	return unlikely(READ_ONCE(*s) != start);
 }
 
 /**
  * read_seqcount_retry() - end a seqcount_t read critical section
- * @s: Pointer to seqcount_t or any of the seqcount_LOCKNAME_t variants
+ * @s: Pointer to seqcount_t, raw_seqcount_t or any of the seqcount_LOCKNAME_t
+ *     variants
  * @start: count, from read_seqcount_begin()
  *
  * read_seqcount_retry closes the read critical section of given
@@ -445,9 +518,11 @@ static inline int do___read_seqcount_retry(const seqcount_t *s, unsigned start)
  * Return: true if a read section retry is required, else false
  */
 #define read_seqcount_retry(s, start)					\
-	do_read_seqcount_retry(seqprop_ptr(s), start)
+	do_read_seqcount_retry(seqprop_raw_ptr(s), start)
+#define raw_read_seqcount_retry(s, start) read_seqcount_retry(s, start)
 
-static inline int do_read_seqcount_retry(const seqcount_t *s, unsigned start)
+static inline int do_read_seqcount_retry(const raw_seqcount_t *s,
+					 unsigned int start)
 {
 	smp_rmb();
 	return do___read_seqcount_retry(s, start);
@@ -455,7 +530,8 @@ static inline int do_read_seqcount_retry(const seqcount_t *s, unsigned start)
 
 /**
  * raw_write_seqcount_begin() - start a seqcount_t write section w/o lockdep
- * @s: Pointer to seqcount_t or any of the seqcount_LOCKNAME_t variants
+ * @s: Pointer to seqcount_t, raw_seqcount_t or any of the seqcount_LOCKNAME_t
+ *     variants
  *
  * Context: check write_seqcount_begin()
  */
@@ -464,34 +540,35 @@ do {									\
 	if (seqprop_preemptible(s))					\
 		preempt_disable();					\
 									\
-	do_raw_write_seqcount_begin(seqprop_ptr(s));			\
+	do_raw_write_seqcount_begin(seqprop_raw_ptr(s));		\
 } while (0)
 
-static inline void do_raw_write_seqcount_begin(seqcount_t *s)
+static inline void do_raw_write_seqcount_begin(raw_seqcount_t *s)
 {
 	kcsan_nestable_atomic_begin();
-	s->sequence++;
+	(*s)++;
 	smp_wmb();
 }
 
 /**
  * raw_write_seqcount_end() - end a seqcount_t write section w/o lockdep
- * @s: Pointer to seqcount_t or any of the seqcount_LOCKNAME_t variants
+ * @s: Pointer to seqcount_t, raw_seqcount_t or any of the seqcount_LOCKNAME_t
+ *     variants
  *
  * Context: check write_seqcount_end()
  */
 #define raw_write_seqcount_end(s)					\
 do {									\
-	do_raw_write_seqcount_end(seqprop_ptr(s));			\
+	do_raw_write_seqcount_end(seqprop_raw_ptr(s));			\
 									\
 	if (seqprop_preemptible(s))					\
 		preempt_enable();					\
 } while (0)
 
-static inline void do_raw_write_seqcount_end(seqcount_t *s)
+static inline void do_raw_write_seqcount_end(raw_seqcount_t *s)
 {
 	smp_wmb();
-	s->sequence++;
+	(*s)++;
 	kcsan_nestable_atomic_end();
 }
 
@@ -516,7 +593,7 @@ do {									\
 
 static inline void do_write_seqcount_begin_nested(seqcount_t *s, int subclass)
 {
-	do_raw_write_seqcount_begin(s);
+	do_raw_write_seqcount_begin(&s->sequence);
 	seqcount_acquire(&s->dep_map, subclass, 0, _RET_IP_);
 }
 
@@ -563,12 +640,13 @@ do {									\
 static inline void do_write_seqcount_end(seqcount_t *s)
 {
 	seqcount_release(&s->dep_map, _RET_IP_);
-	do_raw_write_seqcount_end(s);
+	do_raw_write_seqcount_end(&s->sequence);
 }
 
 /**
  * raw_write_seqcount_barrier() - do a seqcount_t write barrier
- * @s: Pointer to seqcount_t or any of the seqcount_LOCKNAME_t variants
+ * @s: Pointer to seqcount_t, raw_seqcount_t or any of the seqcount_LOCKNAME_t
+ *     variants
  *
  * This can be used to provide an ordering guarantee instead of the usual
  * consistency guarantee. It is one wmb cheaper, because it can collapse
@@ -608,33 +686,34 @@ static inline void do_write_seqcount_end(seqcount_t *s)
  *      }
  */
 #define raw_write_seqcount_barrier(s)					\
-	do_raw_write_seqcount_barrier(seqprop_ptr(s))
+	do_raw_write_seqcount_barrier(seqprop_raw_ptr(s))
 
-static inline void do_raw_write_seqcount_barrier(seqcount_t *s)
+static inline void do_raw_write_seqcount_barrier(raw_seqcount_t *s)
 {
 	kcsan_nestable_atomic_begin();
-	s->sequence++;
+	(*s)++;
 	smp_wmb();
-	s->sequence++;
+	(*s)++;
 	kcsan_nestable_atomic_end();
 }
 
 /**
  * write_seqcount_invalidate() - invalidate in-progress seqcount_t read
  *                               side operations
- * @s: Pointer to seqcount_t or any of the seqcount_LOCKNAME_t variants
+ * @s: Pointer to seqcount_t, raw_seqcount_t or any of the seqcount_LOCKNAME_t
+ *     variants
  *
  * After write_seqcount_invalidate, no seqcount_t read side operations
  * will complete successfully and see data older than this.
  */
 #define write_seqcount_invalidate(s)					\
-	do_write_seqcount_invalidate(seqprop_ptr(s))
+	do_write_seqcount_invalidate(seqprop_raw_ptr(s))
 
-static inline void do_write_seqcount_invalidate(seqcount_t *s)
+static inline void do_write_seqcount_invalidate(raw_seqcount_t *s)
 {
 	smp_wmb();
 	kcsan_nestable_atomic_begin();
-	s->sequence+=2;
+	(*s) += 2;
 	kcsan_nestable_atomic_end();
 }
 

From patchwork Fri Dec 17 11:30:40 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Hildenbrand <david@redhat.com>
X-Patchwork-Id: 12684335
Return-Path: <linux-kselftest-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id A8610C433F5
	for <linux-kselftest@archiver.kernel.org>;
 Fri, 17 Dec 2021 11:33:37 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S235448AbhLQLdh (ORCPT
        <rfc822;linux-kselftest@archiver.kernel.org>);
        Fri, 17 Dec 2021 06:33:37 -0500
Received: from us-smtp-delivery-124.mimecast.com ([170.10.129.124]:46316 "EHLO
        us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S235447AbhLQLdh (ORCPT
        <rfc822;linux-kselftest@vger.kernel.org>);
        Fri, 17 Dec 2021 06:33:37 -0500
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1639740816;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=UTfvPc+jpJw3w8ga8TZ2rx5ThiRLullRxNgQNYf86dk=;
        b=VJYyfW2leLo9iETHR2xo07vl/elK85RGaGrQWYA9gqhq2YmfZ0eIz3ihdWb0zSRB7uTL3p
        SmBeARnEAv4o9PGkY4mBEqfZnOwflrEVXSGYamBzuSwTLWsd6NtDGU5GqYB2Fs0I3O9gdg
        hcZUCw3qQbeeL0pec2tb7HY8Jk+SUL4=
Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com
 [209.132.183.4]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 us-mta-307-IqsbLDQXOCm4tMXYTw-Yaw-1; Fri, 17 Dec 2021 06:33:33 -0500
X-MC-Unique: IqsbLDQXOCm4tMXYTw-Yaw-1
Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.phx2.redhat.com
 [10.5.11.13])
        (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
        (No client certificate requested)
        by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 439AB81CCC4;
        Fri, 17 Dec 2021 11:33:30 +0000 (UTC)
Received: from t480s.redhat.com (unknown [10.39.193.204])
        by smtp.corp.redhat.com (Postfix) with ESMTP id 0D1B88CB3E;
        Fri, 17 Dec 2021 11:32:48 +0000 (UTC)
From: David Hildenbrand <david@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>,
        Hugh Dickins <hughd@google.com>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        David Rientjes <rientjes@google.com>,
        Shakeel Butt <shakeelb@google.com>,
        John Hubbard <jhubbard@nvidia.com>,
        Jason Gunthorpe <jgg@nvidia.com>,
        Mike Kravetz <mike.kravetz@oracle.com>,
        Mike Rapoport <rppt@linux.ibm.com>,
        Yang Shi <shy828301@gmail.com>,
        "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
        Matthew Wilcox <willy@infradead.org>,
        Vlastimil Babka <vbabka@suse.cz>, Jann Horn <jannh@google.com>,
        Michal Hocko <mhocko@kernel.org>,
        Nadav Amit <namit@vmware.com>, Rik van Riel <riel@surriel.com>,
        Roman Gushchin <guro@fb.com>,
        Andrea Arcangeli <aarcange@redhat.com>,
        Peter Xu <peterx@redhat.com>,
        Donald Dutile <ddutile@redhat.com>,
        Christoph Hellwig <hch@lst.de>,
        Oleg Nesterov <oleg@redhat.com>, Jan Kara <jack@suse.cz>,
        linux-mm@kvack.org, linux-kselftest@vger.kernel.org,
        linux-doc@vger.kernel.org, David Hildenbrand <david@redhat.com>
Subject: [PATCH v1 02/11] mm: thp: consolidate mapcount logic on THP split
Date: Fri, 17 Dec 2021 12:30:40 +0100
Message-Id: <20211217113049.23850-3-david@redhat.com>
In-Reply-To: <20211217113049.23850-1-david@redhat.com>
References: <20211217113049.23850-1-david@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 2.79 on 10.5.11.13
Precedence: bulk
List-ID: <linux-kselftest.vger.kernel.org>
X-Mailing-List: linux-kselftest@vger.kernel.org

Let's consolidate the mapcount logic to make it easier to understand and
to prepare for further changes.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c | 18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e5483347291c..4751d03947da 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2101,21 +2101,25 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		pte = pte_offset_map(&_pmd, addr);
 		BUG_ON(!pte_none(*pte));
 		set_pte_at(mm, addr, pte, entry);
-		if (!pmd_migration)
-			atomic_inc(&page[i]._mapcount);
 		pte_unmap(pte);
 	}
 
 	if (!pmd_migration) {
+		/* Sub-page mapcount accounting for above small mappings. */
+		int val = 1;
+
 		/*
 		 * Set PG_double_map before dropping compound_mapcount to avoid
 		 * false-negative page_mapped().
+		 *
+		 * The first to set PageDoubleMap() has to increment all
+		 * sub-page mapcounts by one.
 		 */
-		if (compound_mapcount(page) > 1 &&
-		    !TestSetPageDoubleMap(page)) {
-			for (i = 0; i < HPAGE_PMD_NR; i++)
-				atomic_inc(&page[i]._mapcount);
-		}
+		if (compound_mapcount(page) > 1 && !TestSetPageDoubleMap(page))
+			val++;
+
+		for (i = 0; i < HPAGE_PMD_NR; i++)
+			atomic_add(val, &page[i]._mapcount);
 
 		lock_page_memcg(page);
 		if (atomic_add_negative(-1, compound_mapcount_ptr(page))) {

From patchwork Fri Dec 17 11:30:41 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Hildenbrand <david@redhat.com>
X-Patchwork-Id: 12684337
Return-Path: <linux-kselftest-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 11E00C433F5
	for <linux-kselftest@archiver.kernel.org>;
 Fri, 17 Dec 2021 11:33:48 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S235470AbhLQLdr (ORCPT
        <rfc822;linux-kselftest@archiver.kernel.org>);
        Fri, 17 Dec 2021 06:33:47 -0500
Received: from us-smtp-delivery-124.mimecast.com ([170.10.129.124]:58401 "EHLO
        us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S235464AbhLQLdq (ORCPT
        <rfc822;linux-kselftest@vger.kernel.org>);
        Fri, 17 Dec 2021 06:33:46 -0500
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1639740825;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=5QROTR0DNwNeGtld/DWtRhC9lgRjF4ZhB5ZKs8M3xpY=;
        b=TzAmR81d4AuIcXfpQXzQVWMSvlizOjH/tRMCkXYB2jRPKX987QlQvEbgEgXKAVeicMq1fe
        EXh8qNGuc5Qrw9vJnjOIgXJNDDIP6T55muMakr7TK9J92vZ0PoUSMdAQEbomTjuplT6qBZ
        CTY1hK8NokqVrcXmtjYDtXCF/Qx6nRk=
Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com
 [209.132.183.4]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 us-mta-523-AfuQXayyNmuryw_8DqcZug-1; Fri, 17 Dec 2021 06:33:40 -0500
X-MC-Unique: AfuQXayyNmuryw_8DqcZug-1
Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.phx2.redhat.com
 [10.5.11.13])
        (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
        (No client certificate requested)
        by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 65304801AAB;
        Fri, 17 Dec 2021 11:33:37 +0000 (UTC)
Received: from t480s.redhat.com (unknown [10.39.193.204])
        by smtp.corp.redhat.com (Postfix) with ESMTP id A036E8D5AC;
        Fri, 17 Dec 2021 11:33:30 +0000 (UTC)
From: David Hildenbrand <david@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>,
        Hugh Dickins <hughd@google.com>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        David Rientjes <rientjes@google.com>,
        Shakeel Butt <shakeelb@google.com>,
        John Hubbard <jhubbard@nvidia.com>,
        Jason Gunthorpe <jgg@nvidia.com>,
        Mike Kravetz <mike.kravetz@oracle.com>,
        Mike Rapoport <rppt@linux.ibm.com>,
        Yang Shi <shy828301@gmail.com>,
        "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
        Matthew Wilcox <willy@infradead.org>,
        Vlastimil Babka <vbabka@suse.cz>, Jann Horn <jannh@google.com>,
        Michal Hocko <mhocko@kernel.org>,
        Nadav Amit <namit@vmware.com>, Rik van Riel <riel@surriel.com>,
        Roman Gushchin <guro@fb.com>,
        Andrea Arcangeli <aarcange@redhat.com>,
        Peter Xu <peterx@redhat.com>,
        Donald Dutile <ddutile@redhat.com>,
        Christoph Hellwig <hch@lst.de>,
        Oleg Nesterov <oleg@redhat.com>, Jan Kara <jack@suse.cz>,
        linux-mm@kvack.org, linux-kselftest@vger.kernel.org,
        linux-doc@vger.kernel.org, David Hildenbrand <david@redhat.com>
Subject: [PATCH v1 03/11] mm: simplify hugetlb and file-THP handling in
 __page_mapcount()
Date: Fri, 17 Dec 2021 12:30:41 +0100
Message-Id: <20211217113049.23850-4-david@redhat.com>
In-Reply-To: <20211217113049.23850-1-david@redhat.com>
References: <20211217113049.23850-1-david@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 2.79 on 10.5.11.13
Precedence: bulk
List-ID: <linux-kselftest.vger.kernel.org>
X-Mailing-List: linux-kselftest@vger.kernel.org

Let's return early for hugetlb, which really only relies on the compound
mapcount so far and does not support PageDoubleMap() yet. Use the chance
to cleanup the file-THP case to make it easier to grasp. While at it, use
head_compound_mapcount().

This is a preparation for further changes.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/util.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/mm/util.c b/mm/util.c
index 741ba32a43ac..3239e75c148d 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -732,15 +732,18 @@ int __page_mapcount(struct page *page)
 {
 	int ret;
 
-	ret = atomic_read(&page->_mapcount) + 1;
+	if (PageHuge(page))
+		return compound_mapcount(page);
 	/*
 	 * For file THP page->_mapcount contains total number of mapping
 	 * of the page: no need to look into compound_mapcount.
 	 */
-	if (!PageAnon(page) && !PageHuge(page))
-		return ret;
+	if (!PageAnon(page))
+		return atomic_read(&page->_mapcount) + 1;
+
+	ret = atomic_read(&page->_mapcount) + 1;
 	page = compound_head(page);
-	ret += atomic_read(compound_mapcount_ptr(page)) + 1;
+	ret += head_compound_mapcount(page);
 	if (PageDoubleMap(page))
 		ret--;
 	return ret;

From patchwork Fri Dec 17 11:30:42 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Hildenbrand <david@redhat.com>
X-Patchwork-Id: 12684339
Return-Path: <linux-kselftest-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id E1E93C433F5
	for <linux-kselftest@archiver.kernel.org>;
 Fri, 17 Dec 2021 11:33:54 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S235466AbhLQLdy (ORCPT
        <rfc822;linux-kselftest@archiver.kernel.org>);
        Fri, 17 Dec 2021 06:33:54 -0500
Received: from us-smtp-delivery-124.mimecast.com ([170.10.133.124]:27311 "EHLO
        us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S235464AbhLQLdx (ORCPT
        <rfc822;linux-kselftest@vger.kernel.org>);
        Fri, 17 Dec 2021 06:33:53 -0500
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1639740832;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=UOrVV2NFx26u84MVGLWI5jLwDTFzF180mJVk+c6k5Rs=;
        b=b3h1DYf8FJmQyhXxN8DyCyrjxglmBWl0trfcWY8kagaD+rH/a2jk8v4dX0DfTc+sBR/EYd
        27u3Yw9w6p23nR9+BCGHxWtw7n6a1ROvHItItVtnccGrmbJlNgDNSmkbk94rJHHtf1UhYM
        grqLjvyFPmXyPgf+dn5mW7LlaYniknU=
Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com
 [209.132.183.4]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 us-mta-272-rZlAI3i2NSemwQWONFDUoA-1; Fri, 17 Dec 2021 06:33:47 -0500
X-MC-Unique: rZlAI3i2NSemwQWONFDUoA-1
Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.phx2.redhat.com
 [10.5.11.13])
        (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
        (No client certificate requested)
        by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 1E0C61018723;
        Fri, 17 Dec 2021 11:33:45 +0000 (UTC)
Received: from t480s.redhat.com (unknown [10.39.193.204])
        by smtp.corp.redhat.com (Postfix) with ESMTP id C61F08ACF7;
        Fri, 17 Dec 2021 11:33:37 +0000 (UTC)
From: David Hildenbrand <david@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>,
        Hugh Dickins <hughd@google.com>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        David Rientjes <rientjes@google.com>,
        Shakeel Butt <shakeelb@google.com>,
        John Hubbard <jhubbard@nvidia.com>,
        Jason Gunthorpe <jgg@nvidia.com>,
        Mike Kravetz <mike.kravetz@oracle.com>,
        Mike Rapoport <rppt@linux.ibm.com>,
        Yang Shi <shy828301@gmail.com>,
        "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
        Matthew Wilcox <willy@infradead.org>,
        Vlastimil Babka <vbabka@suse.cz>, Jann Horn <jannh@google.com>,
        Michal Hocko <mhocko@kernel.org>,
        Nadav Amit <namit@vmware.com>, Rik van Riel <riel@surriel.com>,
        Roman Gushchin <guro@fb.com>,
        Andrea Arcangeli <aarcange@redhat.com>,
        Peter Xu <peterx@redhat.com>,
        Donald Dutile <ddutile@redhat.com>,
        Christoph Hellwig <hch@lst.de>,
        Oleg Nesterov <oleg@redhat.com>, Jan Kara <jack@suse.cz>,
        linux-mm@kvack.org, linux-kselftest@vger.kernel.org,
        linux-doc@vger.kernel.org, David Hildenbrand <david@redhat.com>
Subject: [PATCH v1 04/11] mm: thp: simlify total_mapcount()
Date: Fri, 17 Dec 2021 12:30:42 +0100
Message-Id: <20211217113049.23850-5-david@redhat.com>
In-Reply-To: <20211217113049.23850-1-david@redhat.com>
References: <20211217113049.23850-1-david@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 2.79 on 10.5.11.13
Precedence: bulk
List-ID: <linux-kselftest.vger.kernel.org>
X-Mailing-List: linux-kselftest@vger.kernel.org

Let's simplify a bit, returning for PageHuge() early and using
head_compound_page() as we are only getting called for HEAD pages.

Note the VM_BUG_ON_PAGE(PageTail(page), page) check at the beginning of
total_mapcount().

This is a preparation for further changes.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/huge_memory.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 4751d03947da..826cabcad11a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2506,12 +2506,11 @@ int total_mapcount(struct page *page)
 
 	if (likely(!PageCompound(page)))
 		return atomic_read(&page->_mapcount) + 1;
+	if (PageHuge(page))
+		return head_compound_mapcount(page);
 
-	compound = compound_mapcount(page);
 	nr = compound_nr(page);
-	if (PageHuge(page))
-		return compound;
-	ret = compound;
+	ret = compound = head_compound_mapcount(page);
 	for (i = 0; i < nr; i++)
 		ret += atomic_read(&page[i]._mapcount) + 1;
 	/* File pages has compound_mapcount included in _mapcount */

From patchwork Fri Dec 17 11:30:43 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Hildenbrand <david@redhat.com>
X-Patchwork-Id: 12684341
Return-Path: <linux-kselftest-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id DD3BFC433EF
	for <linux-kselftest@archiver.kernel.org>;
 Fri, 17 Dec 2021 11:33:59 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S235478AbhLQLd7 (ORCPT
        <rfc822;linux-kselftest@archiver.kernel.org>);
        Fri, 17 Dec 2021 06:33:59 -0500
Received: from us-smtp-delivery-124.mimecast.com ([170.10.129.124]:30841 "EHLO
        us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S235464AbhLQLd7 (ORCPT
        <rfc822;linux-kselftest@vger.kernel.org>);
        Fri, 17 Dec 2021 06:33:59 -0500
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1639740838;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=H/7mUTo+iv3R8B8oSdwI4pdW6+nsiLEWXHYLzjN+rhg=;
        b=Vg8NjjABQtzo0/RSFWut12mEYfhQxExh0EhV1xhrqZjD5Rd/dv+C7gAT7zkgVRhMLuFOcY
        rU7ncKGG/EzlTHbPZHfnJj+NF9CJ1DHfhpoYOCxvZ2oNQOhcVIQUDCBCep9Fa1O5ir/dR3
        6RjR3hWHvNAdicAh8TWp5INzbN3PJ1Y=
Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com
 [209.132.183.4]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 us-mta-139-2ztkeZRXPsGpaJ5xK1c3YQ-1; Fri, 17 Dec 2021 06:33:55 -0500
X-MC-Unique: 2ztkeZRXPsGpaJ5xK1c3YQ-1
Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.phx2.redhat.com
 [10.5.11.13])
        (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
        (No client certificate requested)
        by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 1E2F91853026;
        Fri, 17 Dec 2021 11:33:52 +0000 (UTC)
Received: from t480s.redhat.com (unknown [10.39.193.204])
        by smtp.corp.redhat.com (Postfix) with ESMTP id 81FA58ACF7;
        Fri, 17 Dec 2021 11:33:45 +0000 (UTC)
From: David Hildenbrand <david@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>,
        Hugh Dickins <hughd@google.com>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        David Rientjes <rientjes@google.com>,
        Shakeel Butt <shakeelb@google.com>,
        John Hubbard <jhubbard@nvidia.com>,
        Jason Gunthorpe <jgg@nvidia.com>,
        Mike Kravetz <mike.kravetz@oracle.com>,
        Mike Rapoport <rppt@linux.ibm.com>,
        Yang Shi <shy828301@gmail.com>,
        "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
        Matthew Wilcox <willy@infradead.org>,
        Vlastimil Babka <vbabka@suse.cz>, Jann Horn <jannh@google.com>,
        Michal Hocko <mhocko@kernel.org>,
        Nadav Amit <namit@vmware.com>, Rik van Riel <riel@surriel.com>,
        Roman Gushchin <guro@fb.com>,
        Andrea Arcangeli <aarcange@redhat.com>,
        Peter Xu <peterx@redhat.com>,
        Donald Dutile <ddutile@redhat.com>,
        Christoph Hellwig <hch@lst.de>,
        Oleg Nesterov <oleg@redhat.com>, Jan Kara <jack@suse.cz>,
        linux-mm@kvack.org, linux-kselftest@vger.kernel.org,
        linux-doc@vger.kernel.org, David Hildenbrand <david@redhat.com>,
        Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Subject: [PATCH v1 05/11] mm: thp: allow for reading the THP mapcount
 atomically via a raw_seqlock_t
Date: Fri, 17 Dec 2021 12:30:43 +0100
Message-Id: <20211217113049.23850-6-david@redhat.com>
In-Reply-To: <20211217113049.23850-1-david@redhat.com>
References: <20211217113049.23850-1-david@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 2.79 on 10.5.11.13
Precedence: bulk
List-ID: <linux-kselftest.vger.kernel.org>
X-Mailing-List: linux-kselftest@vger.kernel.org

Currently, we are not able to read the mapcount of a THP atomically
without expensive locking, for example, if the THP is getting split
concurrently.

Also, we don't want mapcount readers to observe jitter on concurrent GUP
and unmapping like:
	2 -> 1 -> 2 -> 1
Instead, we want to avoid such jitter and want the mapcount of a THP to
move into one direction only instead.

The main challenge to avoid such jitter is PageDoubleMap. If the
compound_mapcount and the tail mapcounts move in the same direction,
there is no problem. However when the compound_mapcount is decreased and
reaches zero, the reader will see initially a decrease in the THP mapcount
that will then be followed by the PageDoubleMap being cleared and the
mapcount getting increased again. The act of clearing PageDoubleMap will
lead readers to overestimate the mapcount until all tail mapcounts (that
the PageDoubleMap flag kept artificially elevated) are finally released.

Introduce a raw_seqlock_t in the THP subpage at index 1 to allow reading
the THP mapcount atomically without grabbing the page lock, avoiding racing
with THP splitting or PageDoubleMap processing. For now, we only require
the seqlock for anonymous THP.

We use a PG_lock-based spinlock to synchronize the writer side. Note
that the PG_lock is located on the THP subpage at index 1, which is
unused so far.

To make especially page_mapcount() safe to be called from IRQ context, as
required by GUP via get_user_pages_fast_only() in the context of
GUP-triggered unsharing of shared anonymous pages soon, make sure the
reader side cannot deadlock if the writer side would be interrupted:
disable local interrupts on the writer side. Note that they are already
disabled during lock_page_memcg() in some configurations.

Fortunately, we do have as of now (mm/Kconfig)
	config TRANSPARENT_HUGEPAGE
		bool "Transparent Hugepage Support"
		depends on HAVE_ARCH_TRANSPARENT_HUGEPAGE && !PREEMPT_RT
so the disabling of interrupts in our case in particular has no effect
on PREEMPT_RT, which is good.

We don't need this type of locking on the THP freeing path: Once the
compound_mapcount of an anonymous THP drops to 0, it won't suddenly
increase again, so PageDoubleMap cannot be cleared concurrently and
consequently the seqlock only needs to be taken if the PageDoubleMap
flag is found set.

Note: In the future, we could avoid disabling local interrupts on the
      writer side by providing alternative functions that can be called
      from IRQ context without deadlocking: These functions must not spin
      but instead have to signal that locking failed. OR maybe we'll find
      a way to just simplify that whole mapcount handling logic for
      anonymous THP, but for now none has been identified.
      Let's keep it simple for now.

This commit is based on prototype patches by Andrea.

Reported-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Reported-by: Hugh Dickins <hughd@google.com>
Fixes: c444eb564fb1 ("mm: thp: make the THP mapcount atomic against __split_huge_pmd_locked()")
Co-developed-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/huge_mm.h  | 65 ++++++++++++++++++++++++++++++++++++++++
 include/linux/mm_types.h |  9 ++++++
 mm/huge_memory.c         | 56 +++++++++++++++++++++++-----------
 mm/rmap.c                | 40 +++++++++++++++----------
 mm/swapfile.c            | 35 +++++++++++++---------
 mm/util.c                | 17 +++++++----
 6 files changed, 170 insertions(+), 52 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index f280f33ff223..44e02d47c65a 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -318,6 +318,49 @@ static inline struct list_head *page_deferred_list(struct page *page)
 	return &page[2].deferred_list;
 }
 
+static inline void thp_mapcount_seqcount_init(struct page *page)
+{
+	raw_seqcount_init(&page[1].mapcount_seqcount);
+}
+
+static inline unsigned int thp_mapcount_read_begin(struct page *page)
+{
+	VM_BUG_ON_PAGE(PageTail(page), page);
+	return raw_read_seqcount_begin(&page[1].mapcount_seqcount);
+}
+
+static inline bool thp_mapcount_read_retry(struct page *page,
+					   unsigned int seqcount)
+{
+	VM_BUG_ON_PAGE(PageTail(page), page);
+	if (!raw_read_seqcount_retry(&page[1].mapcount_seqcount, seqcount))
+		return false;
+	cpu_relax();
+	return true;
+}
+
+static inline void thp_mapcount_lock(struct page *page,
+				     unsigned long *irq_flags)
+{
+	VM_BUG_ON_PAGE(PageTail(page), page);
+	/*
+	 * Prevent deadlocks in thp_mapcount_read_begin() if it is called in IRQ
+	 * context.
+	 */
+	local_irq_save(*irq_flags);
+	bit_spin_lock(PG_locked, &page[1].flags);
+	raw_write_seqcount_begin(&page[1].mapcount_seqcount);
+}
+
+static inline void thp_mapcount_unlock(struct page *page,
+				       unsigned long irq_flags)
+{
+	VM_BUG_ON_PAGE(PageTail(page), page);
+	raw_write_seqcount_end(&page[1].mapcount_seqcount);
+	bit_spin_unlock(PG_locked, &page[1].flags);
+	local_irq_restore(irq_flags);
+}
+
 #else /* CONFIG_TRANSPARENT_HUGEPAGE */
 #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
 #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
@@ -467,6 +510,28 @@ static inline bool thp_migration_supported(void)
 {
 	return false;
 }
+
+static inline unsigned int thp_mapcount_read_begin(struct page *page)
+{
+	return 0;
+}
+
+static inline bool thp_mapcount_read_retry(struct page *page,
+					   unsigned int seqcount)
+{
+	return false;
+}
+
+static inline void thp_mapcount_lock(struct page *page,
+				     unsigned long *irq_flags)
+{
+}
+
+static inline void thp_mapcount_unlock(struct page *page,
+				       unsigned long irq_flags)
+{
+}
+
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 /**
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index c3a6e6209600..a85a2a75d4ff 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -151,6 +151,15 @@ struct page {
 			unsigned char compound_order;
 			atomic_t compound_mapcount;
 			unsigned int compound_nr; /* 1 << compound_order */
+			/*
+			 * THP only: allow for atomic reading of the mapcount,
+			 * for example when we might be racing with a concurrent
+			 * THP split. Initialized for all THP but locking is
+			 * so far only required for anon THP where such races
+			 * apply. Write access is serialized via the
+			 * PG_locked-based spinlock in the first tail page.
+			 */
+			raw_seqcount_t mapcount_seqcount;
 		};
 		struct {	/* Second tail page of compound page */
 			unsigned long _compound_pad_1;	/* compound_head */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 826cabcad11a..1685821525e8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -527,6 +527,7 @@ void prep_transhuge_page(struct page *page)
 
 	INIT_LIST_HEAD(page_deferred_list(page));
 	set_compound_page_dtor(page, TRANSHUGE_PAGE_DTOR);
+	thp_mapcount_seqcount_init(page);
 }
 
 bool is_transparent_hugepage(struct page *page)
@@ -1959,11 +1960,11 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long haddr, bool freeze)
 {
 	struct mm_struct *mm = vma->vm_mm;
+	unsigned long addr, irq_flags;
 	struct page *page;
 	pgtable_t pgtable;
 	pmd_t old_pmd, _pmd;
 	bool young, write, soft_dirty, pmd_migration = false, uffd_wp = false;
-	unsigned long addr;
 	int i;
 
 	VM_BUG_ON(haddr & ~HPAGE_PMD_MASK);
@@ -2108,6 +2109,13 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		/* Sub-page mapcount accounting for above small mappings. */
 		int val = 1;
 
+		/*
+		 * lock_page_memcg() is taken before thp_mapcount_lock() in
+		 * page_remove_anon_compound_rmap(), respect the same locking
+		 * order.
+		 */
+		lock_page_memcg(page);
+		thp_mapcount_lock(page, &irq_flags);
 		/*
 		 * Set PG_double_map before dropping compound_mapcount to avoid
 		 * false-negative page_mapped().
@@ -2121,7 +2129,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		for (i = 0; i < HPAGE_PMD_NR; i++)
 			atomic_add(val, &page[i]._mapcount);
 
-		lock_page_memcg(page);
 		if (atomic_add_negative(-1, compound_mapcount_ptr(page))) {
 			/* Last compound_mapcount is gone. */
 			__mod_lruvec_page_state(page, NR_ANON_THPS,
@@ -2132,6 +2139,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 					atomic_dec(&page[i]._mapcount);
 			}
 		}
+		thp_mapcount_unlock(page, irq_flags);
 		unlock_page_memcg(page);
 	}
 
@@ -2501,6 +2509,8 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 int total_mapcount(struct page *page)
 {
 	int i, compound, nr, ret;
+	unsigned int seqcount;
+	bool double_map;
 
 	VM_BUG_ON_PAGE(PageTail(page), page);
 
@@ -2510,13 +2520,19 @@ int total_mapcount(struct page *page)
 		return head_compound_mapcount(page);
 
 	nr = compound_nr(page);
-	ret = compound = head_compound_mapcount(page);
-	for (i = 0; i < nr; i++)
-		ret += atomic_read(&page[i]._mapcount) + 1;
+
+	do {
+		seqcount = thp_mapcount_read_begin(page);
+		ret = compound = head_compound_mapcount(page);
+		for (i = 0; i < nr; i++)
+			ret += atomic_read(&page[i]._mapcount) + 1;
+		double_map = PageDoubleMap(page);
+	} while (thp_mapcount_read_retry(page, seqcount));
+
 	/* File pages has compound_mapcount included in _mapcount */
 	if (!PageAnon(page))
 		return ret - compound * nr;
-	if (PageDoubleMap(page))
+	if (double_map)
 		ret -= nr;
 	return ret;
 }
@@ -2548,6 +2564,7 @@ int total_mapcount(struct page *page)
 int page_trans_huge_mapcount(struct page *page, int *total_mapcount)
 {
 	int i, ret, _total_mapcount, mapcount;
+	unsigned int seqcount;
 
 	/* hugetlbfs shouldn't call it */
 	VM_BUG_ON_PAGE(PageHuge(page), page);
@@ -2561,17 +2578,22 @@ int page_trans_huge_mapcount(struct page *page, int *total_mapcount)
 
 	page = compound_head(page);
 
-	_total_mapcount = ret = 0;
-	for (i = 0; i < thp_nr_pages(page); i++) {
-		mapcount = atomic_read(&page[i]._mapcount) + 1;
-		ret = max(ret, mapcount);
-		_total_mapcount += mapcount;
-	}
-	if (PageDoubleMap(page)) {
-		ret -= 1;
-		_total_mapcount -= thp_nr_pages(page);
-	}
-	mapcount = compound_mapcount(page);
+	do {
+		_total_mapcount = ret = 0;
+
+		seqcount = thp_mapcount_read_begin(page);
+		for (i = 0; i < thp_nr_pages(page); i++) {
+			mapcount = atomic_read(&page[i]._mapcount) + 1;
+			ret = max(ret, mapcount);
+			_total_mapcount += mapcount;
+		}
+		if (PageDoubleMap(page)) {
+			ret -= 1;
+			_total_mapcount -= thp_nr_pages(page);
+		}
+		mapcount = compound_mapcount(page);
+	} while (thp_mapcount_read_retry(page, seqcount));
+
 	ret += mapcount;
 	_total_mapcount += mapcount;
 	if (total_mapcount)
diff --git a/mm/rmap.c b/mm/rmap.c
index 163ac4e6bcee..0218052586e7 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1294,6 +1294,7 @@ static void page_remove_file_rmap(struct page *page, bool compound)
 
 static void page_remove_anon_compound_rmap(struct page *page)
 {
+	unsigned long irq_flags;
 	int i, nr;
 
 	if (!atomic_add_negative(-1, compound_mapcount_ptr(page)))
@@ -1308,23 +1309,30 @@ static void page_remove_anon_compound_rmap(struct page *page)
 
 	__mod_lruvec_page_state(page, NR_ANON_THPS, -thp_nr_pages(page));
 
-	if (TestClearPageDoubleMap(page)) {
-		/*
-		 * Subpages can be mapped with PTEs too. Check how many of
-		 * them are still mapped.
-		 */
-		for (i = 0, nr = 0; i < thp_nr_pages(page); i++) {
-			if (atomic_add_negative(-1, &page[i]._mapcount))
-				nr++;
-		}
+	if (PageDoubleMap(page)) {
+		thp_mapcount_lock(page, &irq_flags);
+		if (TestClearPageDoubleMap(page)) {
+			/*
+			 * Subpages can be mapped with PTEs too. Check how many
+			 * of them are still mapped.
+			 */
+			for (i = 0, nr = 0; i < thp_nr_pages(page); i++) {
+				if (atomic_add_negative(-1, &page[i]._mapcount))
+					nr++;
+			}
+			thp_mapcount_unlock(page, irq_flags);
 
-		/*
-		 * Queue the page for deferred split if at least one small
-		 * page of the compound page is unmapped, but at least one
-		 * small page is still mapped.
-		 */
-		if (nr && nr < thp_nr_pages(page))
-			deferred_split_huge_page(page);
+			/*
+			 * Queue the page for deferred split if at least one
+			 * small page of the compound page is unmapped, but at
+			 * least one small page is still mapped.
+			 */
+			if (nr && nr < thp_nr_pages(page))
+				deferred_split_huge_page(page);
+		} else {
+			thp_mapcount_unlock(page, irq_flags);
+			nr = thp_nr_pages(page);
+		}
 	} else {
 		nr = thp_nr_pages(page);
 	}
diff --git a/mm/swapfile.c b/mm/swapfile.c
index e59e08ef46e1..82aeb927a7ba 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1610,6 +1610,7 @@ static int page_trans_huge_map_swapcount(struct page *page, int *total_mapcount,
 	struct swap_cluster_info *ci = NULL;
 	unsigned char *map = NULL;
 	int mapcount, swapcount = 0;
+	unsigned int seqcount;
 
 	/* hugetlbfs shouldn't call it */
 	VM_BUG_ON_PAGE(PageHuge(page), page);
@@ -1625,7 +1626,6 @@ static int page_trans_huge_map_swapcount(struct page *page, int *total_mapcount,
 
 	page = compound_head(page);
 
-	_total_mapcount = _total_swapcount = map_swapcount = 0;
 	if (PageSwapCache(page)) {
 		swp_entry_t entry;
 
@@ -1638,21 +1638,28 @@ static int page_trans_huge_map_swapcount(struct page *page, int *total_mapcount,
 	}
 	if (map)
 		ci = lock_cluster(si, offset);
-	for (i = 0; i < HPAGE_PMD_NR; i++) {
-		mapcount = atomic_read(&page[i]._mapcount) + 1;
-		_total_mapcount += mapcount;
-		if (map) {
-			swapcount = swap_count(map[offset + i]);
-			_total_swapcount += swapcount;
+
+	do {
+		_total_mapcount = _total_swapcount = map_swapcount = 0;
+
+		seqcount = thp_mapcount_read_begin(page);
+		for (i = 0; i < HPAGE_PMD_NR; i++) {
+			mapcount = atomic_read(&page[i]._mapcount) + 1;
+			_total_mapcount += mapcount;
+			if (map) {
+				swapcount = swap_count(map[offset + i]);
+				_total_swapcount += swapcount;
+			}
+			map_swapcount = max(map_swapcount, mapcount + swapcount);
 		}
-		map_swapcount = max(map_swapcount, mapcount + swapcount);
-	}
+		if (PageDoubleMap(page)) {
+			map_swapcount -= 1;
+			_total_mapcount -= HPAGE_PMD_NR;
+		}
+		mapcount = compound_mapcount(page);
+	} while (thp_mapcount_read_retry(page, seqcount));
+
 	unlock_cluster(ci);
-	if (PageDoubleMap(page)) {
-		map_swapcount -= 1;
-		_total_mapcount -= HPAGE_PMD_NR;
-	}
-	mapcount = compound_mapcount(page);
 	map_swapcount += mapcount;
 	_total_mapcount += mapcount;
 	if (total_mapcount)
diff --git a/mm/util.c b/mm/util.c
index 3239e75c148d..f4b81c794da1 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -730,6 +730,8 @@ EXPORT_SYMBOL(folio_mapping);
 /* Slow path of page_mapcount() for compound pages */
 int __page_mapcount(struct page *page)
 {
+	struct page *head_page;
+	unsigned int seqcount;
 	int ret;
 
 	if (PageHuge(page))
@@ -741,11 +743,16 @@ int __page_mapcount(struct page *page)
 	if (!PageAnon(page))
 		return atomic_read(&page->_mapcount) + 1;
 
-	ret = atomic_read(&page->_mapcount) + 1;
-	page = compound_head(page);
-	ret += head_compound_mapcount(page);
-	if (PageDoubleMap(page))
-		ret--;
+	/* The mapcount_seqlock is so far only required for anonymous THP. */
+	head_page = compound_head(page);
+	do {
+		seqcount = thp_mapcount_read_begin(head_page);
+		ret = atomic_read(&page->_mapcount) + 1;
+		ret += head_compound_mapcount(head_page);
+		if (PageDoubleMap(head_page))
+			ret--;
+	} while (thp_mapcount_read_retry(head_page, seqcount));
+
 	return ret;
 }
 EXPORT_SYMBOL_GPL(__page_mapcount);

From patchwork Fri Dec 17 11:30:44 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Hildenbrand <david@redhat.com>
X-Patchwork-Id: 12684343
Return-Path: <linux-kselftest-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 3F01BC433FE
	for <linux-kselftest@archiver.kernel.org>;
 Fri, 17 Dec 2021 11:34:07 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S235507AbhLQLeG (ORCPT
        <rfc822;linux-kselftest@archiver.kernel.org>);
        Fri, 17 Dec 2021 06:34:06 -0500
Received: from us-smtp-delivery-124.mimecast.com ([170.10.129.124]:24482 "EHLO
        us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S235495AbhLQLeF (ORCPT
        <rfc822;linux-kselftest@vger.kernel.org>);
        Fri, 17 Dec 2021 06:34:05 -0500
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1639740845;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=MstCAevuD1qvpTctpkyVpFQ9Km+HVMM2bhgtFFKdRTg=;
        b=KluuusnFWw+Z/Xq5jPOnT7RaRVL9rhjD1ckWWkKjoxEyqatG9W3nwg3cVeAFYgifaShvLk
        w0Ge+Nne2LrgoZPuoFnJjJoZXoEKwxH/GcIsK+J1Gf0aE8D7OA3Zuhn29b0mXHuvetmTji
        SX6FaqAYJSYqDbIGz7dG+XQv7BeH24A=
Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com
 [209.132.183.4]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 us-mta-631-marQ5KkgObO5qIOHzCa4PA-1; Fri, 17 Dec 2021 06:34:01 -0500
X-MC-Unique: marQ5KkgObO5qIOHzCa4PA-1
Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.phx2.redhat.com
 [10.5.11.13])
        (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
        (No client certificate requested)
        by mimecast-mx01.redhat.com (Postfix) with ESMTPS id C784C101AFAB;
        Fri, 17 Dec 2021 11:33:58 +0000 (UTC)
Received: from t480s.redhat.com (unknown [10.39.193.204])
        by smtp.corp.redhat.com (Postfix) with ESMTP id 735F48D5B6;
        Fri, 17 Dec 2021 11:33:52 +0000 (UTC)
From: David Hildenbrand <david@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>,
        Hugh Dickins <hughd@google.com>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        David Rientjes <rientjes@google.com>,
        Shakeel Butt <shakeelb@google.com>,
        John Hubbard <jhubbard@nvidia.com>,
        Jason Gunthorpe <jgg@nvidia.com>,
        Mike Kravetz <mike.kravetz@oracle.com>,
        Mike Rapoport <rppt@linux.ibm.com>,
        Yang Shi <shy828301@gmail.com>,
        "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
        Matthew Wilcox <willy@infradead.org>,
        Vlastimil Babka <vbabka@suse.cz>, Jann Horn <jannh@google.com>,
        Michal Hocko <mhocko@kernel.org>,
        Nadav Amit <namit@vmware.com>, Rik van Riel <riel@surriel.com>,
        Roman Gushchin <guro@fb.com>,
        Andrea Arcangeli <aarcange@redhat.com>,
        Peter Xu <peterx@redhat.com>,
        Donald Dutile <ddutile@redhat.com>,
        Christoph Hellwig <hch@lst.de>,
        Oleg Nesterov <oleg@redhat.com>, Jan Kara <jack@suse.cz>,
        linux-mm@kvack.org, linux-kselftest@vger.kernel.org,
        linux-doc@vger.kernel.org, David Hildenbrand <david@redhat.com>
Subject: [PATCH v1 06/11] mm: support GUP-triggered unsharing via
 FAULT_FLAG_UNSHARE (!hugetlb)
Date: Fri, 17 Dec 2021 12:30:44 +0100
Message-Id: <20211217113049.23850-7-david@redhat.com>
In-Reply-To: <20211217113049.23850-1-david@redhat.com>
References: <20211217113049.23850-1-david@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 2.79 on 10.5.11.13
Precedence: bulk
List-ID: <linux-kselftest.vger.kernel.org>
X-Mailing-List: linux-kselftest@vger.kernel.org

FAULT_FLAG_UNSHARE is a new type of page fault applicable to COW-able
anonymous memory (including hugetlb but excluding KSM) and its purpose is
to allow for unsharing of shared anonymous pages on selected GUP *read*
access, in comparison to the traditional COW on *write* access.

In contrast to a COW, GUP-triggered unsharing will still maintain the write
protection. It will be triggered by GUP to properly prevent a child process
from finding ways via GUP to observe memory modifications of anonymous
memory of the parent process after fork().

Rename the relevant functions to make it clear whether we're dealing
with unsharing, cow, or both.

The hugetlb part will be added separately.

This commit is based on prototype patches by Andrea.

Co-developed-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/mm.h |   4 ++
 mm/memory.c        | 136 ++++++++++++++++++++++++++++++++++++++-------
 2 files changed, 119 insertions(+), 21 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a7e4a9e7d807..37d1fb2f865e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -436,6 +436,9 @@ extern pgprot_t protection_map[16];
  * @FAULT_FLAG_REMOTE: The fault is not for current task/mm.
  * @FAULT_FLAG_INSTRUCTION: The fault was during an instruction fetch.
  * @FAULT_FLAG_INTERRUPTIBLE: The fault can be interrupted by non-fatal signals.
+ * @FAULT_FLAG_UNSHARE: The fault is an unsharing request to unshare a
+ *                      shared anonymous page (-> mapped R/O). Does not apply
+ *                      to KSM.
  *
  * About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify
  * whether we would allow page faults to retry by specifying these two
@@ -467,6 +470,7 @@ enum fault_flag {
 	FAULT_FLAG_REMOTE =		1 << 7,
 	FAULT_FLAG_INSTRUCTION =	1 << 8,
 	FAULT_FLAG_INTERRUPTIBLE =	1 << 9,
+	FAULT_FLAG_UNSHARE =		1 << 10,
 };
 
 /*
diff --git a/mm/memory.c b/mm/memory.c
index 8f1de811a1dc..7253a2ad4320 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2707,8 +2707,9 @@ EXPORT_SYMBOL_GPL(apply_to_existing_page_range);
  * read non-atomically.  Before making any commitment, on those architectures
  * or configurations (e.g. i386 with PAE) which might give a mix of unmatched
  * parts, do_swap_page must check under lock before unmapping the pte and
- * proceeding (but do_wp_page is only called after already making such a check;
- * and do_anonymous_page can safely check later on).
+ * proceeding (but do_wp_page_cow/do_wp_page_unshare is only called after
+ * already making such a check; and do_anonymous_page can safely check later
+ * on).
  */
 static inline int pte_unmap_same(struct vm_fault *vmf)
 {
@@ -2726,8 +2727,8 @@ static inline int pte_unmap_same(struct vm_fault *vmf)
 	return same;
 }
 
-static inline bool cow_user_page(struct page *dst, struct page *src,
-				 struct vm_fault *vmf)
+static inline bool __wp_page_copy_user(struct page *dst, struct page *src,
+				       struct vm_fault *vmf)
 {
 	bool ret;
 	void *kaddr;
@@ -2952,7 +2953,8 @@ static inline void wp_page_reuse(struct vm_fault *vmf)
 }
 
 /*
- * Handle the case of a page which we actually need to copy to a new page.
+ * Handle the case of a page which we actually need to copy to a new page,
+ * either due to COW or unsharing.
  *
  * Called with mmap_lock locked and the old page referenced, but
  * without the ptl held.
@@ -2967,7 +2969,7 @@ static inline void wp_page_reuse(struct vm_fault *vmf)
  *   held to the old page, as well as updating the rmap.
  * - In any case, unlock the PTL and drop the reference we took to the old page.
  */
-static vm_fault_t wp_page_copy(struct vm_fault *vmf)
+static vm_fault_t wp_page_copy(struct vm_fault *vmf, bool unshare)
 {
 	struct vm_area_struct *vma = vmf->vma;
 	struct mm_struct *mm = vma->vm_mm;
@@ -2991,7 +2993,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 		if (!new_page)
 			goto oom;
 
-		if (!cow_user_page(new_page, old_page, vmf)) {
+		if (!__wp_page_copy_user(new_page, old_page, vmf)) {
 			/*
 			 * COW failed, if the fault was solved by other,
 			 * it's fine. If not, userspace would re-fault on
@@ -3033,7 +3035,14 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 		flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
 		entry = mk_pte(new_page, vma->vm_page_prot);
 		entry = pte_sw_mkyoung(entry);
-		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+		if (unlikely(unshare)) {
+			if (pte_soft_dirty(vmf->orig_pte))
+				entry = pte_mksoft_dirty(entry);
+			if (pte_uffd_wp(vmf->orig_pte))
+				entry = pte_mkuffd_wp(entry);
+		} else {
+			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+		}
 
 		/*
 		 * Clear the pte entry and flush it first, before updating the
@@ -3050,6 +3059,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 		 * mmu page tables (such as kvm shadow page tables), we want the
 		 * new page to be mapped directly into the secondary page table.
 		 */
+		BUG_ON(unshare && pte_write(entry));
 		set_pte_at_notify(mm, vmf->address, vmf->pte, entry);
 		update_mmu_cache(vma, vmf->address, vmf->pte);
 		if (old_page) {
@@ -3109,6 +3119,8 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 			free_swap_cache(old_page);
 		put_page(old_page);
 	}
+	if (unlikely(unshare))
+		return 0;
 	return page_copied ? VM_FAULT_WRITE : 0;
 oom_free_new:
 	put_page(new_page);
@@ -3118,6 +3130,70 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 	return VM_FAULT_OOM;
 }
 
+static __always_inline vm_fault_t wp_page_cow(struct vm_fault *vmf)
+{
+	return wp_page_copy(vmf, false);
+}
+
+static __always_inline vm_fault_t wp_page_unshare(struct vm_fault *vmf)
+{
+	return wp_page_copy(vmf, true);
+}
+
+/*
+ * This routine handles present pages, when GUP tries to take a read-only
+ * pin on a shared anonymous page. It's similar to do_wp_page_cow(), except that
+ * it keeps the pages mapped read-only and doesn't apply to KSM pages.
+ *
+ * If a parent process forks a child process, we share anonymous pages between
+ * both processes with COW semantics. Both processes will map these now shared
+ * anonymous pages read-only, and any write access triggers unsharing via COW.
+ *
+ * If the child takes a read-only pin on such a page (i.e., FOLL_WRITE is not
+ * set) and then unmaps the target page, we have:
+ *
+ * * page has mapcount == 1 and refcount > 1
+ * * page is mapped read-only into the parent
+ * * page is pinned by the child and can still be read
+ *
+ * For now, we rely on refcount > 1 to perform the COW and trigger unsharing.
+ * However, that leads to other hard-to fix issues.
+ *
+ * GUP-triggered unsharing provides a parallel approach to trigger unsharing
+ * early, still allowing for relying on mapcount > 1 in COW code instead of on
+ * imprecise refcount > 1. Note that when we don't actually take a reference
+ * on the target page but instead use memory notifiers to synchronize to changes
+ * in the process page tables, unsharing is not required.
+ *
+ * Note that in the above scenario, it's impossible to distinguish during the
+ * write fault between:
+ *
+ * a) The parent process performed the pin and the child no longer has access
+ *    to the page.
+ *
+ * b) The child process performed the pin and the child still has access to the
+ *    page.
+ *
+ * In case of a), if we're dealing with a long-term read-only pin, the COW
+ * in the parent will result the pinned page differing from the page actually
+ * mapped into the process page tables in the parent: loss of synchronicity.
+ * Therefore, we really want to perform the copy when the read-only pin happens.
+ */
+static vm_fault_t do_wp_page_unshare(struct vm_fault *vmf)
+	__releases(vmf->ptl)
+{
+	vmf->page = vm_normal_page(vmf->vma, vmf->address, vmf->orig_pte);
+	if (vmf->page && PageAnon(vmf->page) && !PageKsm(vmf->page) &&
+	    page_mapcount(vmf->page) > 1) {
+		get_page(vmf->page);
+		pte_unmap_unlock(vmf->pte, vmf->ptl);
+		return wp_page_unshare(vmf);
+	}
+	vmf->page = NULL;
+	pte_unmap_unlock(vmf->pte, vmf->ptl);
+	return 0;
+}
+
 /**
  * finish_mkwrite_fault - finish page fault for a shared mapping, making PTE
  *			  writeable once the page is prepared
@@ -3226,7 +3302,7 @@ static vm_fault_t wp_page_shared(struct vm_fault *vmf)
  * but allow concurrent faults), with pte both mapped and locked.
  * We return with mmap_lock still held, but pte unmapped and unlocked.
  */
-static vm_fault_t do_wp_page(struct vm_fault *vmf)
+static vm_fault_t do_wp_page_cow(struct vm_fault *vmf)
 	__releases(vmf->ptl)
 {
 	struct vm_area_struct *vma = vmf->vma;
@@ -3258,7 +3334,7 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
 			return wp_pfn_shared(vmf);
 
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
-		return wp_page_copy(vmf);
+		return wp_page_cow(vmf);
 	}
 
 	/*
@@ -3296,7 +3372,7 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
 	get_page(vmf->page);
 
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
-	return wp_page_copy(vmf);
+	return wp_page_cow(vmf);
 }
 
 static void unmap_mapping_range_vma(struct vm_area_struct *vma,
@@ -3670,7 +3746,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	}
 
 	if (vmf->flags & FAULT_FLAG_WRITE) {
-		ret |= do_wp_page(vmf);
+		ret |= do_wp_page_cow(vmf);
 		if (ret & VM_FAULT_ERROR)
 			ret &= VM_FAULT_ERROR;
 		goto out;
@@ -4428,6 +4504,16 @@ static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf)
 /* `inline' is required to avoid gcc 4.1.2 build error */
 static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf)
 {
+	if (vmf->flags & FAULT_FLAG_UNSHARE) {
+		/*
+		 * We'll simply split the THP and handle unsharing on the
+		 * PTE level. Unsharing only applies to anon THPs and we
+		 * shouldn't ever find them inside shared mappings.
+		 */
+		if (WARN_ON_ONCE(vmf->vma->vm_flags & VM_SHARED))
+			return 0;
+		goto split_fallback;
+	}
 	if (vma_is_anonymous(vmf->vma)) {
 		if (userfaultfd_huge_pmd_wp(vmf->vma, vmf->orig_pmd))
 			return handle_userfault(vmf, VM_UFFD_WP);
@@ -4440,7 +4526,8 @@ static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf)
 			return ret;
 	}
 
-	/* COW or write-notify handled on pte level: split pmd. */
+split_fallback:
+	/* COW, unsharing or write-notify handled on pte level: split pmd. */
 	__split_huge_pmd(vmf->vma, vmf->pmd, vmf->address, false, NULL);
 
 	return VM_FAULT_FALLBACK;
@@ -4551,8 +4638,11 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 			return do_fault(vmf);
 	}
 
-	if (!pte_present(vmf->orig_pte))
-		return do_swap_page(vmf);
+	if (!pte_present(vmf->orig_pte)) {
+		if (likely(!(vmf->flags & FAULT_FLAG_UNSHARE)))
+			return do_swap_page(vmf);
+		return 0;
+	}
 
 	if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
 		return do_numa_page(vmf);
@@ -4564,9 +4654,13 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 		update_mmu_tlb(vmf->vma, vmf->address, vmf->pte);
 		goto unlock;
 	}
-	if (vmf->flags & FAULT_FLAG_WRITE) {
-		if (!pte_write(entry))
-			return do_wp_page(vmf);
+	if (vmf->flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) {
+		if (!pte_write(entry)) {
+			if (vmf->flags & FAULT_FLAG_WRITE)
+				return do_wp_page_cow(vmf);
+			else
+				return do_wp_page_unshare(vmf);
+		}
 		entry = pte_mkdirty(entry);
 	}
 	entry = pte_mkyoung(entry);
@@ -4607,7 +4701,6 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 		.pgoff = linear_page_index(vma, address),
 		.gfp_mask = __get_fault_gfp_mask(vma),
 	};
-	unsigned int dirty = flags & FAULT_FLAG_WRITE;
 	struct mm_struct *mm = vma->vm_mm;
 	pgd_t *pgd;
 	p4d_t *p4d;
@@ -4634,7 +4727,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 
 			/* NUMA case for anonymous PUDs would go here */
 
-			if (dirty && !pud_write(orig_pud)) {
+			if ((flags & FAULT_FLAG_WRITE) && !pud_write(orig_pud)) {
 				ret = wp_huge_pud(&vmf, orig_pud);
 				if (!(ret & VM_FAULT_FALLBACK))
 					return ret;
@@ -4672,7 +4765,8 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 			if (pmd_protnone(vmf.orig_pmd) && vma_is_accessible(vma))
 				return do_huge_pmd_numa_page(&vmf);
 
-			if (dirty && !pmd_write(vmf.orig_pmd)) {
+			if ((flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) &&
+			    !pmd_write(vmf.orig_pmd)) {
 				ret = wp_huge_pmd(&vmf);
 				if (!(ret & VM_FAULT_FALLBACK))
 					return ret;

From patchwork Fri Dec 17 11:30:45 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Hildenbrand <david@redhat.com>
X-Patchwork-Id: 12684345
Return-Path: <linux-kselftest-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id D7DF3C433EF
	for <linux-kselftest@archiver.kernel.org>;
 Fri, 17 Dec 2021 11:34:18 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S235532AbhLQLeS (ORCPT
        <rfc822;linux-kselftest@archiver.kernel.org>);
        Fri, 17 Dec 2021 06:34:18 -0500
Received: from us-smtp-delivery-124.mimecast.com ([170.10.133.124]:50542 "EHLO
        us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S235517AbhLQLeL (ORCPT
        <rfc822;linux-kselftest@vger.kernel.org>);
        Fri, 17 Dec 2021 06:34:11 -0500
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1639740851;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=GcIVNk9FqOqzjRi5JqPcDJ4wYU12HMN6l6PoOYX4XeU=;
        b=SloD0UewO5ZzG3HL4trjWPnYfYz09MR+lAclOXDuhQmeQ9JJVt2eaZjxyP3uUjFojzaeug
        sm1V9XK6JLq+6Vz25daVoFkVZppARDympxcZSoy5is09VQxrv01yZzPeQeyjy4VPLJXx+K
        5cLWf5lYJsaaF7knmu4N5vLgS9jWDow=
Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com
 [209.132.183.4]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 us-mta-540-oSDSQLLFP4mmgy0batSDew-1; Fri, 17 Dec 2021 06:34:07 -0500
X-MC-Unique: oSDSQLLFP4mmgy0batSDew-1
Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.phx2.redhat.com
 [10.5.11.13])
        (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
        (No client certificate requested)
        by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 4FC52101AFAC;
        Fri, 17 Dec 2021 11:34:05 +0000 (UTC)
Received: from t480s.redhat.com (unknown [10.39.193.204])
        by smtp.corp.redhat.com (Postfix) with ESMTP id 307788ACF7;
        Fri, 17 Dec 2021 11:33:59 +0000 (UTC)
From: David Hildenbrand <david@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>,
        Hugh Dickins <hughd@google.com>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        David Rientjes <rientjes@google.com>,
        Shakeel Butt <shakeelb@google.com>,
        John Hubbard <jhubbard@nvidia.com>,
        Jason Gunthorpe <jgg@nvidia.com>,
        Mike Kravetz <mike.kravetz@oracle.com>,
        Mike Rapoport <rppt@linux.ibm.com>,
        Yang Shi <shy828301@gmail.com>,
        "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
        Matthew Wilcox <willy@infradead.org>,
        Vlastimil Babka <vbabka@suse.cz>, Jann Horn <jannh@google.com>,
        Michal Hocko <mhocko@kernel.org>,
        Nadav Amit <namit@vmware.com>, Rik van Riel <riel@surriel.com>,
        Roman Gushchin <guro@fb.com>,
        Andrea Arcangeli <aarcange@redhat.com>,
        Peter Xu <peterx@redhat.com>,
        Donald Dutile <ddutile@redhat.com>,
        Christoph Hellwig <hch@lst.de>,
        Oleg Nesterov <oleg@redhat.com>, Jan Kara <jack@suse.cz>,
        linux-mm@kvack.org, linux-kselftest@vger.kernel.org,
        linux-doc@vger.kernel.org, David Hildenbrand <david@redhat.com>
Subject: [PATCH v1 07/11] mm: gup: trigger unsharing via FAULT_FLAG_UNSHARE
 when required (!hugetlb)
Date: Fri, 17 Dec 2021 12:30:45 +0100
Message-Id: <20211217113049.23850-8-david@redhat.com>
In-Reply-To: <20211217113049.23850-1-david@redhat.com>
References: <20211217113049.23850-1-david@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 2.79 on 10.5.11.13
Precedence: bulk
List-ID: <linux-kselftest.vger.kernel.org>
X-Mailing-List: linux-kselftest@vger.kernel.org

It is currently possible for a child process to observe modifications of
anonymous pages by the parent process after fork() in some cases, which is
not only a userspace visible violation of the POSIX semantics of
MAP_PRIVATE, but more importantly a real security issue.

This issue, including other related COW issues, has been summarized in [1]:
"
  1. Observing Memory Modifications of Private Pages From A Child Process

  Long story short: process-private memory might not be as private as you
  think once you fork(): successive modifications of private memory
  regions in the parent process can still be observed by the child
  process, for example, by smart use of vmsplice()+munmap().

  The core problem is that pinning pages readable in a child process, such
  as done via the vmsplice system call, can result in a child process
  observing memory modifications done in the parent process the child is
  not supposed to observe. [1] contains an excellent summary and [2]
  contains further details. This issue was assigned CVE-2020-29374 [9].

  For this to trigger, it's required to use a fork() without subsequent
  exec(), for example, as used under Android zygote. Without further
  details about an application that forks less-privileged child processes,
  one cannot really say what's actually affected and what's not -- see the
  details section the end of this mail for a short sshd/openssh analysis.

  While commit 17839856fd58 ("gup: document and work around "COW can break
  either way" issue") fixed this issue and resulted in other problems
  (e.g., ptrace on pmem), commit 09854ba94c6a ("mm: do_wp_page()
  simplification") re-introduced part of the problem unfortunately.

  The original reproducer can be modified quite easily to use THP [3] and
  make the issue appear again on upstream kernels. I modified it to use
  hugetlb [4] and it triggers as well. The problem is certainly less
  severe with hugetlb than with THP; it merely highlights that we still
  have plenty of open holes we should be closing/fixing.

  Regarding vmsplice(), the only known workaround is to disallow the
  vmsplice() system call ... or disable THP and hugetlb. But who knows
  what else is affected (RDMA? O_DIRECT?) to achieve the same goal -- in
  the end, it's a more generic issue.
"

This security issue / MAP_PRIVATE POSIX violation was first reported by
Jann Horn on 27 May 2020 and it currently affects anonymous THP and
hugetlb.

Ordinary anonymous pages are currently not affected, because the COW logic
was changed in commit 09854ba94c6a ("mm: do_wp_page() simplification")
for them to COW on "page_count() != 1" instead of "mapcount > 1", which
unfortunately results in other COW issues, some of them documented in [1]
as well.

To fix this COW issue once and for all, introduce GUP-triggered unsharing
that can be conditionally triggered via FAULT_FLAG_UNSHARE. In contrast to
traditional COW, unsharing will leave the copied page mapped
write-protected in the page table, not having the semantics of a write
fault.

Logically, unsharing is triggered "early", as soon as GUP performs the
action that could result in a COW getting missed later and the security
issue triggering: however, unsharing is not triggered as before via a
write fault with undesired side effects.

GUP triggers unsharing if all of the following conditions are met:
* The page is mapped R/O
* We have an anonymous page, excluding KSM
* We want to read (!FOLL_WRITE)
* Unsharing is not disabled (!FOLL_NOUNSHARE)
* We want to take a reference (FOLL_GET or FOLL_PIN)
* The page is a shared anonymous page: mapcount > 1

As this patch introduces the same unsharing logic also for ordinary
PTE-mapped anonymous pages, it also paves the way to fix the other known
COW related issues documented in [1] without reintroducing the security
issue or reintroducing other issues we observed in the past (e.g., broken
ptrace on pmem).

We better leave the follow_page() API alone: it's an internal API and
its users don't actually allow for user space to read page content and they
don't expect to get "NULL" for actually present pages -- because they
usually don't trigger faults. Introduce and use FOLL_NOUNSHARE for that
purpose. We could also think about using it for other corner cases, such
as get_dump_page().

Note: GUP users that use memory notifiers to synchronize with the MM
      don't have to bother about unsharing: they don't actually take
      a reference on the pages and are properly synchronized against MM
      changes to never result in consistency issues.

Add a TODO item that the mechanism should be extended to improve GUP
long-term as a whole, avoiding the requirement for FOLL_WRITE|FOLL_FORCE.

hugetlb case will be handled separately.

This commit is based on prototype patches by Andrea.

[1] https://lore.kernel.org/r/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@redhat.com

Co-developed-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/mm.h | 10 ++++++
 mm/gup.c           | 90 ++++++++++++++++++++++++++++++++++++++++++++--
 mm/huge_memory.c   |  7 ++++
 3 files changed, 104 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 37d1fb2f865e..ebcdaed60701 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2975,6 +2975,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 #define FOLL_SPLIT_PMD	0x20000	/* split huge pmd before returning */
 #define FOLL_PIN	0x40000	/* pages must be released via unpin_user_page */
 #define FOLL_FAST_ONLY	0x80000	/* gup_fast: prevent fall-back to slow gup */
+#define FOLL_NOUNSHARE	0x100000 /* don't trigger unsharing on shared anon pages */
 
 /*
  * FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each
@@ -3029,6 +3030,12 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
  * releasing pages: get_user_pages*() pages must be released via put_page(),
  * while pin_user_pages*() pages must be released via unpin_user_page().
  *
+ * FOLL_NOUNSHARE should be set when no unsharing should be triggered when
+ * eventually taking a read-only reference on a shared anonymous page, because
+ * we are sure that user space cannot use that reference for reading the page
+ * after eventually unmapping the page. FOLL_NOUNSHARE is implicitly set for the
+ * follow_page() API.
+ *
  * Please see Documentation/core-api/pin_user_pages.rst for more information.
  */
 
@@ -3043,6 +3050,9 @@ static inline int vm_fault_to_errno(vm_fault_t vm_fault, int foll_flags)
 	return 0;
 }
 
+extern bool gup_must_unshare(unsigned int flags, struct page *page,
+			     bool is_head);
+
 typedef int (*pte_fn_t)(pte_t *pte, unsigned long addr, void *data);
 extern int apply_to_page_range(struct mm_struct *mm, unsigned long address,
 			       unsigned long size, pte_fn_t fn, void *data);
diff --git a/mm/gup.c b/mm/gup.c
index 2c51e9748a6a..2a83388c3fb4 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -29,6 +29,53 @@ struct follow_page_context {
 	unsigned int page_mask;
 };
 
+/*
+ * Indicates for which pages that are write-protected in the page table,
+ * whether GUP has to trigger unsharing via FAULT_FLAG_UNSHARE such that the
+ * GUP pin will remain consistent with the pages mapped into the page tables
+ * of the MM.
+ *
+ * This handling is required to guarantee that a child process that triggered
+ * a read-only GUP before unmapping the page of interest cannot observe
+ * modifications of shared anonymous pages with COW semantics in the parent
+ * after fork().
+ *
+ * TODO: although the security issue described does no longer apply in any case,
+ * the full consistency between the pinned pages and the pages mapped into the
+ * page tables of the MM only apply to short-term pinnings only. For
+ * FOLL_LONGTERM, FOLL_WRITE|FOLL_FORCE is required for now, which can be
+ * inefficient and still result in some consistency issues. Extend this
+ * mechanism to also provide full synchronicity to FOLL_LONGTERM, avoiding
+ * FOLL_WRITE|FOLL_FORCE.
+ *
+ * This function is safe to be called in IRQ context.
+ */
+bool gup_must_unshare(unsigned int flags, struct page *page, bool is_head)
+{
+	/* We only care about read faults where unsharing is desired. */
+	if (flags & (FOLL_WRITE | FOLL_NOUNSHARE))
+		return false;
+	/*
+	 * We only care when the reference count of the page is to get
+	 * increased. In particular, GUP users that rely on memory notifiers
+	 * instead don't have to trigger unsharing.
+	 */
+	if (!(flags & (FOLL_GET|FOLL_PIN)))
+		return false;
+	if (!PageAnon(page))
+		return false;
+	if (PageKsm(page))
+		return false;
+	if (PageHuge(page))
+		/* TODO: handle hugetlb as well. */
+		return false;
+	if (is_head) {
+		VM_BUG_ON(!PageTransHuge(page));
+		return page_trans_huge_mapcount(page, NULL) > 1;
+	}
+	return page_mapcount(page) > 1;
+}
+
 static void hpage_pincount_add(struct page *page, int refs)
 {
 	VM_BUG_ON_PAGE(!hpage_pincount_available(page), page);
@@ -543,6 +590,14 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 		}
 	}
 
+	/*
+	 * If unsharing is required, keep retrying to unshare until the
+	 * page becomes exclusive.
+	 */
+	if (!pte_write(pte) && gup_must_unshare(flags, page, false)) {
+		page = ERR_PTR(-EMLINK);
+		goto out;
+	}
 	/* try_grab_page() does nothing unless FOLL_GET or FOLL_PIN is set. */
 	if (unlikely(!try_grab_page(page, flags))) {
 		page = ERR_PTR(-ENOMEM);
@@ -790,6 +845,11 @@ static struct page *follow_p4d_mask(struct vm_area_struct *vma,
  * When getting pages from ZONE_DEVICE memory, the @ctx->pgmap caches
  * the device's dev_pagemap metadata to avoid repeating expensive lookups.
  *
+ * When getting an anonymous page and the caller has to trigger unsharing
+ * of a shared anonymous page first, -EMLINK is returned. The caller should
+ * trigger a fault with FAULT_FLAG_UNSHARE set. With FOLL_NOUNSHARE set, will
+ * never require unsharing and consequently not return -EMLINK.
+ *
  * On output, the @ctx->page_mask is set according to the size of the page.
  *
  * Return: the mapped (struct page *), %NULL if no mapping exists, or
@@ -845,6 +905,12 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 	if (vma_is_secretmem(vma))
 		return NULL;
 
+	/*
+	 * Don't require unsharing in case we stumble over a read-only mapped,
+	 * shared anonymous page: this is an internal API only and callers don't
+	 * actually use it for exposing page content to user space.
+	 */
+	foll_flags |= FOLL_NOUNSHARE;
 	page = follow_page_mask(vma, address, foll_flags, &ctx);
 	if (ctx.pgmap)
 		put_dev_pagemap(ctx.pgmap);
@@ -910,7 +976,8 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
  * is, *@locked will be set to 0 and -EBUSY returned.
  */
 static int faultin_page(struct vm_area_struct *vma,
-		unsigned long address, unsigned int *flags, int *locked)
+		unsigned long address, unsigned int *flags, bool unshare,
+		int *locked)
 {
 	unsigned int fault_flags = 0;
 	vm_fault_t ret;
@@ -935,6 +1002,12 @@ static int faultin_page(struct vm_area_struct *vma,
 		 */
 		fault_flags |= FAULT_FLAG_TRIED;
 	}
+	if (unshare) {
+		VM_BUG_ON(unshare && *flags & FOLL_NOUNSHARE);
+		fault_flags |= FAULT_FLAG_UNSHARE;
+		/* FAULT_FLAG_WRITE and FAULT_FLAG_UNSHARE are incompatible */
+		VM_BUG_ON(fault_flags & FAULT_FLAG_WRITE);
+	}
 
 	ret = handle_mm_fault(vma, address, fault_flags, NULL);
 	if (ret & VM_FAULT_ERROR) {
@@ -1156,8 +1229,9 @@ static long __get_user_pages(struct mm_struct *mm,
 		cond_resched();
 
 		page = follow_page_mask(vma, start, foll_flags, &ctx);
-		if (!page) {
-			ret = faultin_page(vma, start, &foll_flags, locked);
+		if (!page || PTR_ERR(page) == -EMLINK) {
+			ret = faultin_page(vma, start, &foll_flags,
+					   PTR_ERR(page) == -EMLINK, locked);
 			switch (ret) {
 			case 0:
 				goto retry;
@@ -2311,6 +2385,11 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 			goto pte_unmap;
 		}
 
+		if (!pte_write(pte) && gup_must_unshare(flags, page, false)) {
+			put_compound_head(head, 1, flags);
+			goto pte_unmap;
+		}
+
 		VM_BUG_ON_PAGE(compound_head(page) != head, page);
 
 		/*
@@ -2554,6 +2633,11 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 		return 0;
 	}
 
+	if (!pmd_write(orig) && gup_must_unshare(flags, head, true)) {
+		put_compound_head(head, refs, flags);
+		return 0;
+	}
+
 	*nr += refs;
 	SetPageReferenced(head);
 	return 1;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1685821525e8..57842e8b13d4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1375,6 +1375,13 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 	page = pmd_page(*pmd);
 	VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
 
+	/*
+	 * If unsharing is required, keep retrying to unshare until the
+	 * page becomes exclusive.
+	 */
+	if (!pmd_write(*pmd) && gup_must_unshare(flags, page, true))
+		return ERR_PTR(-EMLINK);
+
 	if (!try_grab_page(page, flags))
 		return ERR_PTR(-ENOMEM);
 

From patchwork Fri Dec 17 11:30:46 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Hildenbrand <david@redhat.com>
X-Patchwork-Id: 12684347
Return-Path: <linux-kselftest-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id C4F93C43219
	for <linux-kselftest@archiver.kernel.org>;
 Fri, 17 Dec 2021 11:34:20 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S235504AbhLQLeU (ORCPT
        <rfc822;linux-kselftest@archiver.kernel.org>);
        Fri, 17 Dec 2021 06:34:20 -0500
Received: from us-smtp-delivery-124.mimecast.com ([170.10.133.124]:27689 "EHLO
        us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S235514AbhLQLeS (ORCPT
        <rfc822;linux-kselftest@vger.kernel.org>);
        Fri, 17 Dec 2021 06:34:18 -0500
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1639740857;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=SzUv6yJDwt+7H5HJzvGctMAG29y8XOFpvzAZtyudpbM=;
        b=HLbFVAnJaNtv3/8Vt2w6uUhQSesGyIpGpw9hh7FHwUudWFyQChlWxsESvpbu1/UbNj1vAs
        ww4kQN+649gs4gIzre+hLyKNjQofP03GqaewhFBWWURorQ3k2fvfHYDw6R58YHow3QSmHf
        0Odre/Dv4RuUmDA3m8LC1/iSTQFHFU8=
Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com
 [209.132.183.4]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 us-mta-596-9MIwY5ONN1m72xO_4dQ4yg-1; Fri, 17 Dec 2021 06:34:14 -0500
X-MC-Unique: 9MIwY5ONN1m72xO_4dQ4yg-1
Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.phx2.redhat.com
 [10.5.11.13])
        (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
        (No client certificate requested)
        by mimecast-mx01.redhat.com (Postfix) with ESMTPS id DEF7A1853020;
        Fri, 17 Dec 2021 11:34:11 +0000 (UTC)
Received: from t480s.redhat.com (unknown [10.39.193.204])
        by smtp.corp.redhat.com (Postfix) with ESMTP id A8DB68ACF7;
        Fri, 17 Dec 2021 11:34:05 +0000 (UTC)
From: David Hildenbrand <david@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>,
        Hugh Dickins <hughd@google.com>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        David Rientjes <rientjes@google.com>,
        Shakeel Butt <shakeelb@google.com>,
        John Hubbard <jhubbard@nvidia.com>,
        Jason Gunthorpe <jgg@nvidia.com>,
        Mike Kravetz <mike.kravetz@oracle.com>,
        Mike Rapoport <rppt@linux.ibm.com>,
        Yang Shi <shy828301@gmail.com>,
        "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
        Matthew Wilcox <willy@infradead.org>,
        Vlastimil Babka <vbabka@suse.cz>, Jann Horn <jannh@google.com>,
        Michal Hocko <mhocko@kernel.org>,
        Nadav Amit <namit@vmware.com>, Rik van Riel <riel@surriel.com>,
        Roman Gushchin <guro@fb.com>,
        Andrea Arcangeli <aarcange@redhat.com>,
        Peter Xu <peterx@redhat.com>,
        Donald Dutile <ddutile@redhat.com>,
        Christoph Hellwig <hch@lst.de>,
        Oleg Nesterov <oleg@redhat.com>, Jan Kara <jack@suse.cz>,
        linux-mm@kvack.org, linux-kselftest@vger.kernel.org,
        linux-doc@vger.kernel.org, David Hildenbrand <david@redhat.com>
Subject: [PATCH v1 08/11] mm: hugetlb: support GUP-triggered unsharing via
 FAULT_FLAG_UNSHARE
Date: Fri, 17 Dec 2021 12:30:46 +0100
Message-Id: <20211217113049.23850-9-david@redhat.com>
In-Reply-To: <20211217113049.23850-1-david@redhat.com>
References: <20211217113049.23850-1-david@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 2.79 on 10.5.11.13
Precedence: bulk
List-ID: <linux-kselftest.vger.kernel.org>
X-Mailing-List: linux-kselftest@vger.kernel.org

Let's support FAULT_FLAG_UNSHARE to implement GUP-triggered unsharing,
preparing for its use in the GUP paths when there is need to unshare a
shared anonymous hugetlb page.

We'll make use of it next by setting FAULT_FLAG_UNSHARE in case we
detect that unsharing is necessary.

This commit is based on a prototype patch by Andrea.

Co-developed-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/hugetlb.c | 86 ++++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 63 insertions(+), 23 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a1baa198519a..5f2863b046ef 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5130,14 +5130,15 @@ static void unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
 }
 
 /*
- * Hugetlb_cow() should be called with page lock of the original hugepage held.
+ * __wp_hugetlb() should be called with page lock of the original hugepage held.
  * Called with hugetlb_fault_mutex_table held and pte_page locked so we
  * cannot race with other handlers or page migration.
  * Keep the pte_same checks anyway to make transition from the mutex easier.
  */
-static vm_fault_t hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
-		       unsigned long address, pte_t *ptep,
-		       struct page *pagecache_page, spinlock_t *ptl)
+static __always_inline vm_fault_t
+__wp_hugetlb(struct mm_struct *mm, struct vm_area_struct *vma,
+	     unsigned long address, pte_t *ptep, struct page *pagecache_page,
+	     spinlock_t *ptl, bool unshare)
 {
 	pte_t pte;
 	struct hstate *h = hstate_vma(vma);
@@ -5151,11 +5152,21 @@ static vm_fault_t hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 	old_page = pte_page(pte);
 
 retry_avoidcopy:
-	/* If no-one else is actually using this page, avoid the copy
-	 * and just make the page writable */
-	if (page_mapcount(old_page) == 1 && PageAnon(old_page)) {
-		page_move_anon_rmap(old_page, vma);
-		set_huge_ptep_writable(vma, haddr, ptep);
+	if (!unshare) {
+		/*
+		 * If no-one else is actually using this page, avoid the copy
+		 * and just make the page writable.
+		 */
+		if (page_mapcount(old_page) == 1 && PageAnon(old_page)) {
+			page_move_anon_rmap(old_page, vma);
+			set_huge_ptep_writable(vma, haddr, ptep);
+			return 0;
+		}
+	} else if (!PageAnon(old_page) || page_mapcount(old_page) == 1) {
+		/*
+		 * GUP-triggered unsharing only applies to shared anonymous
+		 * pages. If that does no longer apply, there is nothing to do.
+		 */
 		return 0;
 	}
 
@@ -5256,11 +5267,11 @@ static vm_fault_t hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (likely(ptep && pte_same(huge_ptep_get(ptep), pte))) {
 		ClearHPageRestoreReserve(new_page);
 
-		/* Break COW */
+		/* Break COW or unshare */
 		huge_ptep_clear_flush(vma, haddr, ptep);
 		mmu_notifier_invalidate_range(mm, range.start, range.end);
 		set_huge_pte_at(mm, haddr, ptep,
-				make_huge_pte(vma, new_page, 1));
+				make_huge_pte(vma, new_page, !unshare));
 		page_remove_rmap(old_page, true);
 		hugepage_add_new_anon_rmap(new_page, vma, haddr);
 		SetHPageMigratable(new_page);
@@ -5270,7 +5281,10 @@ static vm_fault_t hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 	spin_unlock(ptl);
 	mmu_notifier_invalidate_range_end(&range);
 out_release_all:
-	/* No restore in case of successful pagetable update (Break COW) */
+	/*
+	 * No restore in case of successful pagetable update (Break COW or
+	 * unshare)
+	 */
 	if (new_page != old_page)
 		restore_reserve_on_error(h, vma, haddr, new_page);
 	put_page(new_page);
@@ -5281,6 +5295,23 @@ static vm_fault_t hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 	return ret;
 }
 
+static vm_fault_t
+wp_hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
+	       unsigned long address, pte_t *ptep, struct page *pagecache_page,
+	       spinlock_t *ptl)
+{
+	return __wp_hugetlb(mm, vma, address, ptep, pagecache_page, ptl,
+			    false);
+}
+
+static vm_fault_t
+wp_hugetlb_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
+		   unsigned long address, pte_t *ptep,
+		   struct page *pagecache_page, spinlock_t *ptl)
+{
+	return __wp_hugetlb(mm, vma, address, ptep, pagecache_page, ptl, true);
+}
+
 /* Return the pagecache page at a given address within a VMA */
 static struct page *hugetlbfs_pagecache_page(struct hstate *h,
 			struct vm_area_struct *vma, unsigned long address)
@@ -5393,7 +5424,8 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 	/*
 	 * Currently, we are forced to kill the process in the event the
 	 * original mapper has unmapped pages from the child due to a failed
-	 * COW. Warn that such a situation has occurred as it may not be obvious
+	 * COW/unsharing. Warn that such a situation has occurred as it may not
+	 * be obvious.
 	 */
 	if (is_vma_resv_set(vma, HPAGE_RESV_UNMAPPED)) {
 		pr_warn_ratelimited("PID %d killed due to inadequate hugepage pool\n",
@@ -5519,7 +5551,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 	hugetlb_count_add(pages_per_huge_page(h), mm);
 	if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
 		/* Optimization, do the COW without a second fault */
-		ret = hugetlb_cow(mm, vma, address, ptep, page, ptl);
+		ret = wp_hugetlb_cow(mm, vma, address, ptep, page, ptl);
 	}
 
 	spin_unlock(ptl);
@@ -5649,14 +5681,15 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto out_mutex;
 
 	/*
-	 * If we are going to COW the mapping later, we examine the pending
-	 * reservations for this page now. This will ensure that any
+	 * If we are going to COW/unshare the mapping later, we examine the
+	 * pending reservations for this page now. This will ensure that any
 	 * allocations necessary to record that reservation occur outside the
 	 * spinlock. For private mappings, we also lookup the pagecache
 	 * page now as it is used to determine if a reservation has been
 	 * consumed.
 	 */
-	if ((flags & FAULT_FLAG_WRITE) && !huge_pte_write(entry)) {
+	if ((flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) &&
+	    !huge_pte_write(entry)) {
 		if (vma_needs_reservation(h, vma, haddr) < 0) {
 			ret = VM_FAULT_OOM;
 			goto out_mutex;
@@ -5671,14 +5704,17 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	ptl = huge_pte_lock(h, mm, ptep);
 
-	/* Check for a racing update before calling hugetlb_cow */
+	/*
+	 * Check for a racing update before calling wp_hugetlb_cow /
+	 * wp_hugetlb_unshare
+	 */
 	if (unlikely(!pte_same(entry, huge_ptep_get(ptep))))
 		goto out_ptl;
 
 	/*
-	 * hugetlb_cow() requires page locks of pte_page(entry) and
-	 * pagecache_page, so here we need take the former one
-	 * when page != pagecache_page or !pagecache_page.
+	 * wp_hugetlb_cow()/wp_hugetlb_unshare() requires page locks of
+	 * pte_page(entry) and pagecache_page, so here we need take the former
+	 * one when page != pagecache_page or !pagecache_page.
 	 */
 	page = pte_page(entry);
 	if (page != pagecache_page)
@@ -5691,11 +5727,15 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	if (flags & FAULT_FLAG_WRITE) {
 		if (!huge_pte_write(entry)) {
-			ret = hugetlb_cow(mm, vma, address, ptep,
-					  pagecache_page, ptl);
+			ret = wp_hugetlb_cow(mm, vma, address, ptep,
+					     pagecache_page, ptl);
 			goto out_put_page;
 		}
 		entry = huge_pte_mkdirty(entry);
+	} else if (flags & FAULT_FLAG_UNSHARE && !huge_pte_write(entry)) {
+		ret = wp_hugetlb_unshare(mm, vma, address, ptep, pagecache_page,
+					 ptl);
+		goto out_put_page;
 	}
 	entry = pte_mkyoung(entry);
 	if (huge_ptep_set_access_flags(vma, haddr, ptep, entry,

From patchwork Fri Dec 17 11:30:47 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Hildenbrand <david@redhat.com>
X-Patchwork-Id: 12684349
Return-Path: <linux-kselftest-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 9C4CFC433EF
	for <linux-kselftest@archiver.kernel.org>;
 Fri, 17 Dec 2021 11:34:27 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S235535AbhLQLe1 (ORCPT
        <rfc822;linux-kselftest@archiver.kernel.org>);
        Fri, 17 Dec 2021 06:34:27 -0500
Received: from us-smtp-delivery-124.mimecast.com ([170.10.133.124]:27417 "EHLO
        us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S235524AbhLQLeY (ORCPT
        <rfc822;linux-kselftest@vger.kernel.org>);
        Fri, 17 Dec 2021 06:34:24 -0500
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1639740864;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=IwpaYH90QrdomBtkH4bbU2T/tiB0wo98uDxZc7uFD1s=;
        b=YvhhQZmbEWn1ltcPbgpeiXwsBSTZjGzKB5zDnwx1n5GxpZPQmChNZnolt4iJopps8Rq8tB
        2LxDbz96Be2JBTqZF5hTU9YJIE3tGdNBBYVfYylP5fdmDDAbvb2e1yrC9Vy5pYZGAg7A9W
        M/1xLDQs0nDhv3GYknwn4TQeL2N4hY0=
Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com
 [209.132.183.4]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 us-mta-646-1Bp9m0YsN3OXQAACCFsDzA-1; Fri, 17 Dec 2021 06:34:21 -0500
X-MC-Unique: 1Bp9m0YsN3OXQAACCFsDzA-1
Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.phx2.redhat.com
 [10.5.11.13])
        (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
        (No client certificate requested)
        by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 8C4C781CCB5;
        Fri, 17 Dec 2021 11:34:18 +0000 (UTC)
Received: from t480s.redhat.com (unknown [10.39.193.204])
        by smtp.corp.redhat.com (Postfix) with ESMTP id 4B0BB8ACF7;
        Fri, 17 Dec 2021 11:34:12 +0000 (UTC)
From: David Hildenbrand <david@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>,
        Hugh Dickins <hughd@google.com>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        David Rientjes <rientjes@google.com>,
        Shakeel Butt <shakeelb@google.com>,
        John Hubbard <jhubbard@nvidia.com>,
        Jason Gunthorpe <jgg@nvidia.com>,
        Mike Kravetz <mike.kravetz@oracle.com>,
        Mike Rapoport <rppt@linux.ibm.com>,
        Yang Shi <shy828301@gmail.com>,
        "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
        Matthew Wilcox <willy@infradead.org>,
        Vlastimil Babka <vbabka@suse.cz>, Jann Horn <jannh@google.com>,
        Michal Hocko <mhocko@kernel.org>,
        Nadav Amit <namit@vmware.com>, Rik van Riel <riel@surriel.com>,
        Roman Gushchin <guro@fb.com>,
        Andrea Arcangeli <aarcange@redhat.com>,
        Peter Xu <peterx@redhat.com>,
        Donald Dutile <ddutile@redhat.com>,
        Christoph Hellwig <hch@lst.de>,
        Oleg Nesterov <oleg@redhat.com>, Jan Kara <jack@suse.cz>,
        linux-mm@kvack.org, linux-kselftest@vger.kernel.org,
        linux-doc@vger.kernel.org, David Hildenbrand <david@redhat.com>
Subject: [PATCH v1 09/11] mm: gup: trigger unsharing via FAULT_FLAG_UNSHARE
 when required (hugetlb)
Date: Fri, 17 Dec 2021 12:30:47 +0100
Message-Id: <20211217113049.23850-10-david@redhat.com>
In-Reply-To: <20211217113049.23850-1-david@redhat.com>
References: <20211217113049.23850-1-david@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 2.79 on 10.5.11.13
Precedence: bulk
List-ID: <linux-kselftest.vger.kernel.org>
X-Mailing-List: linux-kselftest@vger.kernel.org

Similar to the !hugetlb variant, invoke unsharing for shared anonymous
pages when required during GUP by setting FOLL_FAULT_UNSHARE in hugetlb
code as well.

FAULT_FLAG_UNSHARE will trigger unsharing of shared anonymous pages during
GUP, resulting in a child process no longer being able to observe memory
modifications performed by the parent after fork() to anonymous shared
hugetlb pages.

This commit is based on prototype patches by Andrea.

Co-developed-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/gup.c     |  3 +--
 mm/hugetlb.c | 43 +++++++++++++++++++++++++++++++++++++++----
 2 files changed, 40 insertions(+), 6 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 2a83388c3fb4..35d1b28e3829 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -67,8 +67,7 @@ bool gup_must_unshare(unsigned int flags, struct page *page, bool is_head)
 	if (PageKsm(page))
 		return false;
 	if (PageHuge(page))
-		/* TODO: handle hugetlb as well. */
-		return false;
+		return __page_mapcount(page) > 1;
 	if (is_head) {
 		VM_BUG_ON(!PageTransHuge(page));
 		return page_trans_huge_mapcount(page, NULL) > 1;
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 5f2863b046ef..dc42018ee1a6 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5971,6 +5971,25 @@ static void record_subpages_vmas(struct page *page, struct vm_area_struct *vma,
 	}
 }
 
+static inline bool __follow_hugetlb_must_fault(unsigned int flags, pte_t *pte,
+					       bool *unshare)
+{
+	pte_t pteval = huge_ptep_get(pte);
+
+	*unshare = false;
+	if (is_swap_pte(pteval))
+		return true;
+	if (huge_pte_write(pteval))
+		return false;
+	if (flags & FOLL_WRITE)
+		return true;
+	if (gup_must_unshare(flags, pte_page(pteval), true)) {
+		*unshare = true;
+		return true;
+	}
+	return false;
+}
+
 long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			 struct page **pages, struct vm_area_struct **vmas,
 			 unsigned long *position, unsigned long *nr_pages,
@@ -5985,6 +6004,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	while (vaddr < vma->vm_end && remainder) {
 		pte_t *pte;
 		spinlock_t *ptl = NULL;
+		bool unshare;
 		int absent;
 		struct page *page;
 
@@ -6035,9 +6055,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		 * both cases, and because we can't follow correct pages
 		 * directly from any kind of swap entries.
 		 */
-		if (absent || is_swap_pte(huge_ptep_get(pte)) ||
-		    ((flags & FOLL_WRITE) &&
-		      !huge_pte_write(huge_ptep_get(pte)))) {
+		if (absent ||
+		    __follow_hugetlb_must_fault(flags, pte, &unshare)) {
 			vm_fault_t ret;
 			unsigned int fault_flags = 0;
 
@@ -6045,6 +6064,8 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 				spin_unlock(ptl);
 			if (flags & FOLL_WRITE)
 				fault_flags |= FAULT_FLAG_WRITE;
+			else if (unshare)
+				fault_flags |= FAULT_FLAG_UNSHARE;
 			if (locked)
 				fault_flags |= FAULT_FLAG_ALLOW_RETRY |
 					FAULT_FLAG_KILLABLE;
@@ -6734,7 +6755,21 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address,
 		goto out;
 	pte = huge_ptep_get((pte_t *)pmd);
 	if (pte_present(pte)) {
-		page = pmd_page(*pmd) + ((address & ~PMD_MASK) >> PAGE_SHIFT);
+		struct page *head_page = pmd_page(*pmd);
+
+		/*
+		 * follow_huge_pmd() is only called when coming via
+		 * follow_page(), where we set FOLL_NOUNSHARE. Ordinary GUP
+		 * goes via follow_hugetlb_page(), where we can properly unshare
+		 * if required.
+		 */
+		if (WARN_ON_ONCE(!huge_pte_write(pte) &&
+				 gup_must_unshare(flags, head_page, true))) {
+			page = NULL;
+			goto out;
+		}
+
+		page = head_page + ((address & ~PMD_MASK) >> PAGE_SHIFT);
 		/*
 		 * try_grab_page() should always succeed here, because: a) we
 		 * hold the pmd (ptl) lock, and b) we've just checked that the

From patchwork Fri Dec 17 11:30:48 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Hildenbrand <david@redhat.com>
X-Patchwork-Id: 12684351
Return-Path: <linux-kselftest-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 92A4DC433EF
	for <linux-kselftest@archiver.kernel.org>;
 Fri, 17 Dec 2021 11:34:33 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S235564AbhLQLed (ORCPT
        <rfc822;linux-kselftest@archiver.kernel.org>);
        Fri, 17 Dec 2021 06:34:33 -0500
Received: from us-smtp-delivery-124.mimecast.com ([170.10.129.124]:38795 "EHLO
        us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S235553AbhLQLec (ORCPT
        <rfc822;linux-kselftest@vger.kernel.org>);
        Fri, 17 Dec 2021 06:34:32 -0500
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1639740871;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=KXmb+mz7gGLUO/1AsI3IaHmA9N+YeaV6BOIe6tD64xE=;
        b=fFhl5/EmL92Prt3kgkmqf/eF5v/UGeYHG7lfk5yHxNLp+8dWYKjo42FOqmWoGAO4mj8s4u
        q6gzvyEsyP2J9La9042LFYyvt561rLr+84fDeVE8echfGEwEyFcE0QRz794m+Iw1sFs1qC
        +bgzkfxVtEBHOAz8lqd8WrLoM3xM3qg=
Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com
 [209.132.183.4]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 us-mta-499-lmA_dabzOyaf-oSTM4FGww-1; Fri, 17 Dec 2021 06:34:28 -0500
X-MC-Unique: lmA_dabzOyaf-oSTM4FGww-1
Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.phx2.redhat.com
 [10.5.11.13])
        (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
        (No client certificate requested)
        by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 1F8D436393;
        Fri, 17 Dec 2021 11:34:25 +0000 (UTC)
Received: from t480s.redhat.com (unknown [10.39.193.204])
        by smtp.corp.redhat.com (Postfix) with ESMTP id ED7FB8ACF7;
        Fri, 17 Dec 2021 11:34:18 +0000 (UTC)
From: David Hildenbrand <david@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>,
        Hugh Dickins <hughd@google.com>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        David Rientjes <rientjes@google.com>,
        Shakeel Butt <shakeelb@google.com>,
        John Hubbard <jhubbard@nvidia.com>,
        Jason Gunthorpe <jgg@nvidia.com>,
        Mike Kravetz <mike.kravetz@oracle.com>,
        Mike Rapoport <rppt@linux.ibm.com>,
        Yang Shi <shy828301@gmail.com>,
        "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
        Matthew Wilcox <willy@infradead.org>,
        Vlastimil Babka <vbabka@suse.cz>, Jann Horn <jannh@google.com>,
        Michal Hocko <mhocko@kernel.org>,
        Nadav Amit <namit@vmware.com>, Rik van Riel <riel@surriel.com>,
        Roman Gushchin <guro@fb.com>,
        Andrea Arcangeli <aarcange@redhat.com>,
        Peter Xu <peterx@redhat.com>,
        Donald Dutile <ddutile@redhat.com>,
        Christoph Hellwig <hch@lst.de>,
        Oleg Nesterov <oleg@redhat.com>, Jan Kara <jack@suse.cz>,
        linux-mm@kvack.org, linux-kselftest@vger.kernel.org,
        linux-doc@vger.kernel.org, David Hildenbrand <david@redhat.com>
Subject: [PATCH v1 10/11] mm: thp: introduce and use
 page_trans_huge_anon_shared()
Date: Fri, 17 Dec 2021 12:30:48 +0100
Message-Id: <20211217113049.23850-11-david@redhat.com>
In-Reply-To: <20211217113049.23850-1-david@redhat.com>
References: <20211217113049.23850-1-david@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 2.79 on 10.5.11.13
Precedence: bulk
List-ID: <linux-kselftest.vger.kernel.org>
X-Mailing-List: linux-kselftest@vger.kernel.org

Let's add an optimized way to check "page_trans_huge_mapcount() > 1"
that is allowed to break the loop early.

This commit is based on a prototype patch by Andrea.

Co-developed-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/huge_mm.h |  7 +++++++
 mm/gup.c                |  2 +-
 mm/huge_memory.c        | 34 ++++++++++++++++++++++++++++++++++
 3 files changed, 42 insertions(+), 1 deletion(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 44e02d47c65a..3a9d8cf64219 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -361,6 +361,8 @@ static inline void thp_mapcount_unlock(struct page *page,
 	local_irq_restore(irq_flags);
 }
 
+extern bool page_trans_huge_anon_shared(struct page *page);
+
 #else /* CONFIG_TRANSPARENT_HUGEPAGE */
 #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
 #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
@@ -532,6 +534,11 @@ static inline void thp_mapcount_unlock(struct page *page,
 {
 }
 
+static inline bool page_trans_huge_anon_shared(struct page *page)
+{
+	return false;
+}
+
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 /**
diff --git a/mm/gup.c b/mm/gup.c
index 35d1b28e3829..496575ff9ac8 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -70,7 +70,7 @@ bool gup_must_unshare(unsigned int flags, struct page *page, bool is_head)
 		return __page_mapcount(page) > 1;
 	if (is_head) {
 		VM_BUG_ON(!PageTransHuge(page));
-		return page_trans_huge_mapcount(page, NULL) > 1;
+		return page_trans_huge_anon_shared(page);
 	}
 	return page_mapcount(page) > 1;
 }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 57842e8b13d4..dced82274f1d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1281,6 +1281,40 @@ void huge_pmd_set_accessed(struct vm_fault *vmf)
 	spin_unlock(vmf->ptl);
 }
 
+
+static bool __page_trans_huge_anon_shared(struct page *page)
+{
+	int i, mapcount;
+
+	mapcount = head_compound_mapcount(page);
+	if (mapcount > 1)
+		return true;
+	if (PageDoubleMap(page))
+		mapcount -= 1;
+	for (i = 0; i < thp_nr_pages(page); i++) {
+		if (atomic_read(&page[i]._mapcount) + mapcount + 1 > 1)
+			return true;
+	}
+	return false;
+}
+
+/* A lightweight check corresponding to "page_trans_huge_mapcount() > 1". */
+bool page_trans_huge_anon_shared(struct page *page)
+{
+	unsigned int seqcount;
+	bool shared;
+
+	VM_BUG_ON_PAGE(PageHuge(page) || PageTail(page), page);
+	VM_BUG_ON_PAGE(!PageAnon(page) || !PageTransHuge(page), page);
+
+	do {
+		seqcount = thp_mapcount_read_begin(page);
+		shared = __page_trans_huge_anon_shared(page);
+	} while (thp_mapcount_read_retry(page, seqcount));
+
+	return shared;
+}
+
 vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;

From patchwork Fri Dec 17 11:30:49 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: David Hildenbrand <david@redhat.com>
X-Patchwork-Id: 12684353
Return-Path: <linux-kselftest-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id B288EC433FE
	for <linux-kselftest@archiver.kernel.org>;
 Fri, 17 Dec 2021 11:34:43 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S233366AbhLQLen (ORCPT
        <rfc822;linux-kselftest@archiver.kernel.org>);
        Fri, 17 Dec 2021 06:34:43 -0500
Received: from us-smtp-delivery-124.mimecast.com ([170.10.133.124]:34024 "EHLO
        us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK)
        by vger.kernel.org with ESMTP id S233740AbhLQLem (ORCPT
        <rfc822;linux-kselftest@vger.kernel.org>);
        Fri, 17 Dec 2021 06:34:42 -0500
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
        s=mimecast20190719; t=1639740882;
        h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
         to:to:cc:cc:mime-version:mime-version:
         content-transfer-encoding:content-transfer-encoding:
         in-reply-to:in-reply-to:references:references;
        bh=DoeMhWrfTnNa/IdgPp0GUn3BjAQ1iIVspD+MIaFOgu0=;
        b=a0X4zYiBeyFDFD2aKdckdVQ1bZnFKWaRCZSlAz3y8j+60EgfiQStO65Qj3pBks/3JZD6tR
        WrjHekrFBoMmBaH8ooNX0oBZ6+ZV2QUMxAov+LNS8D3RAZ9VKxQ2svXZrpDGYQ7PHjdE3z
        UMxqL/zSbP2vvsp5OCuNjJRyv9goSYY=
Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com
 [209.132.183.4]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 us-mta-524-Xm9_H4w6OzugWwN2HMQujw-1; Fri, 17 Dec 2021 06:34:39 -0500
X-MC-Unique: Xm9_H4w6OzugWwN2HMQujw-1
Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.phx2.redhat.com
 [10.5.11.13])
        (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
        (No client certificate requested)
        by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 4667036393;
        Fri, 17 Dec 2021 11:34:36 +0000 (UTC)
Received: from t480s.redhat.com (unknown [10.39.193.204])
        by smtp.corp.redhat.com (Postfix) with ESMTP id 813D88D5AC;
        Fri, 17 Dec 2021 11:34:25 +0000 (UTC)
From: David Hildenbrand <david@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>,
        Hugh Dickins <hughd@google.com>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        David Rientjes <rientjes@google.com>,
        Shakeel Butt <shakeelb@google.com>,
        John Hubbard <jhubbard@nvidia.com>,
        Jason Gunthorpe <jgg@nvidia.com>,
        Mike Kravetz <mike.kravetz@oracle.com>,
        Mike Rapoport <rppt@linux.ibm.com>,
        Yang Shi <shy828301@gmail.com>,
        "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
        Matthew Wilcox <willy@infradead.org>,
        Vlastimil Babka <vbabka@suse.cz>, Jann Horn <jannh@google.com>,
        Michal Hocko <mhocko@kernel.org>,
        Nadav Amit <namit@vmware.com>, Rik van Riel <riel@surriel.com>,
        Roman Gushchin <guro@fb.com>,
        Andrea Arcangeli <aarcange@redhat.com>,
        Peter Xu <peterx@redhat.com>,
        Donald Dutile <ddutile@redhat.com>,
        Christoph Hellwig <hch@lst.de>,
        Oleg Nesterov <oleg@redhat.com>, Jan Kara <jack@suse.cz>,
        linux-mm@kvack.org, linux-kselftest@vger.kernel.org,
        linux-doc@vger.kernel.org, David Hildenbrand <david@redhat.com>,
        Shuah Khan <shuah@kernel.org>
Subject: [PATCH v1 11/11] selftests/vm: add tests for the known COW security
 issues
Date: Fri, 17 Dec 2021 12:30:49 +0100
Message-Id: <20211217113049.23850-12-david@redhat.com>
In-Reply-To: <20211217113049.23850-1-david@redhat.com>
References: <20211217113049.23850-1-david@redhat.com>
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 2.79 on 10.5.11.13
Precedence: bulk
List-ID: <linux-kselftest.vger.kernel.org>
X-Mailing-List: linux-kselftest@vger.kernel.org

Let's make sure the security issue / MAP_PRIVATE violation of POSIX
semantics doesn't reappear again using variations of the original
vmsplice reproducer. Ideally, we'd also be test some more cases with
R/O long-term pinnings -- but the existing mechanisms like RDMA or VFIO
require rather complicated setups not suitable for simple selftests.

In the future we might be able to add some O_DIRECT test and maybe
extend the gup tests in the kernel accordingly.

Using barrier() is a little clunky, but "volatile" seems to be in
general frowned upon and makes checkpatch angry.

Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 tools/testing/selftests/vm/Makefile       |   1 +
 tools/testing/selftests/vm/gup_cow.c      | 312 ++++++++++++++++++++++
 tools/testing/selftests/vm/run_vmtests.sh |  16 ++
 3 files changed, 329 insertions(+)
 create mode 100644 tools/testing/selftests/vm/gup_cow.c

diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile
index 1607322a112c..dad6037d735f 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -27,6 +27,7 @@ CFLAGS = -Wall -I ../../../../usr/include $(EXTRA_CFLAGS)
 LDLIBS = -lrt -lpthread
 TEST_GEN_FILES = compaction_test
 TEST_GEN_FILES += gup_test
+TEST_GEN_FILES += gup_cow
 TEST_GEN_FILES += hmm-tests
 TEST_GEN_FILES += hugepage-mmap
 TEST_GEN_FILES += hugepage-mremap
diff --git a/tools/testing/selftests/vm/gup_cow.c b/tools/testing/selftests/vm/gup_cow.c
new file mode 100644
index 000000000000..9d44ed2ffdfc
--- /dev/null
+++ b/tools/testing/selftests/vm/gup_cow.c
@@ -0,0 +1,312 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * GUP (Get User Pages) interaction with COW (Copy On Write) tests.
+ *
+ * Copyright 2021, Red Hat, Inc.
+ *
+ * Author(s): David Hildenbrand <david@redhat.com>
+ */
+#define _GNU_SOURCE
+#include <stdlib.h>
+#include <string.h>
+#include <stdbool.h>
+#include <stdint.h>
+#include <unistd.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <sys/mman.h>
+#include <sys/wait.h>
+
+#include "../kselftest.h"
+
+#define barrier() asm volatile("" ::: "memory")
+
+static size_t pagesize;
+static size_t thpsize;
+static size_t hugetlbsize;
+
+struct shared_mem {
+	bool parent_ready;
+	bool child_ready;
+};
+struct shared_mem *shared;
+
+static size_t detect_thpsize(void)
+{
+	int fd = open("/sys/kernel/mm/transparent_hugepage/hpage_pmd_size",
+		      O_RDONLY);
+	size_t size = 0;
+	char buf[15];
+	int ret;
+
+	if (fd < 0)
+		return 0;
+
+	ret = pread(fd, buf, sizeof(buf), 0);
+	if (ret < 0 || ret == sizeof(buf))
+		goto out;
+	buf[ret] = 0;
+
+	size = strtoul(buf, NULL, 10);
+out:
+	close(fd);
+	if (size < pagesize)
+		size = 0;
+	return size;
+}
+
+static uint64_t pagemap_get_entry(int fd, void *addr)
+{
+	const unsigned long pfn = (unsigned long)addr / pagesize;
+	uint64_t entry;
+	int ret;
+
+	ret = pread(fd, &entry, sizeof(entry), pfn * sizeof(entry));
+	if (ret != sizeof(entry))
+		ksft_exit_fail_msg("reading pagemap failed\n");
+	return entry;
+}
+
+static bool page_is_populated(void *addr)
+{
+	int fd = open("/proc/self/pagemap", O_RDONLY);
+	uint64_t entry;
+	bool ret;
+
+	if (fd < 0)
+		ksft_exit_fail_msg("opening pagemap failed\n");
+
+	/* Present or swapped. */
+	entry = pagemap_get_entry(fd, addr);
+	ret = !!(entry & 0xc000000000000000ull);
+	close(fd);
+	return ret;
+}
+
+static int child_vmsplice_fn(unsigned char *mem, size_t size)
+{
+	struct iovec iov = {
+		.iov_base = mem,
+		.iov_len = size,
+	};
+	size_t cur, total, transferred;
+	char *old, *new;
+	int fds[2];
+
+	old = malloc(size);
+	new = malloc(size);
+
+	/* Backup the original content. */
+	memcpy(old, mem, size);
+
+	if (pipe(fds) < 0)
+		return -errno;
+
+	/* Trigger a read-only pin. */
+	transferred = vmsplice(fds[1], &iov, 1, 0);
+	if (transferred < 0)
+		return -errno;
+	if (transferred == 0)
+		return -EINVAL;
+
+	/* Unmap it from our page tables. */
+	if (munmap(mem, size) < 0)
+		return -errno;
+
+	/* Wait until the parent modified it. */
+	barrier();
+	shared->child_ready = true;
+	barrier();
+	while (!shared->parent_ready)
+		barrier();
+	barrier();
+
+	/* See if we still read the old values. */
+	total = 0;
+	while (total < transferred) {
+		cur = read(fds[0], new + total, transferred - total);
+		if (cur < 0)
+			return -errno;
+		total += cur;
+	}
+
+	return memcmp(old, new, transferred);
+}
+
+static void test_child_ro_gup(unsigned char *mem, size_t size)
+{
+	int ret;
+
+	/* Populate the page. */
+	memset(mem, 0, size);
+
+	shared->parent_ready = false;
+	shared->child_ready = false;
+	barrier();
+
+	ret = fork();
+	if (ret < 0) {
+		ksft_exit_fail_msg("fork failed\n");
+	} else if (!ret) {
+		ret = child_vmsplice_fn(mem, size);
+		exit(ret);
+	}
+
+	barrier();
+	while (!shared->child_ready)
+		barrier();
+	/* Modify the page. */
+	barrier();
+	memset(mem, 0xff, size);
+	barrier();
+	shared->parent_ready = true;
+
+	wait(&ret);
+	if (WIFEXITED(ret))
+		ret = WEXITSTATUS(ret);
+	else
+		ret = -EINVAL;
+
+	ksft_test_result(!ret, "child has correct MAP_PRIVATE semantics\n");
+}
+
+static void test_anon_ro_gup_child(void)
+{
+	unsigned char *mem;
+	int ret;
+
+	ksft_print_msg("[RUN] %s\n", __func__);
+
+	mem = mmap(NULL, pagesize, PROT_READ | PROT_WRITE,
+		   MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+	if (mem == MAP_FAILED) {
+		ksft_test_result_fail("mmap failed\n");
+		return;
+	}
+
+	ret = madvise(mem, pagesize, MADV_NOHUGEPAGE);
+	/* Ignore if not around on a kernel. */
+	if (ret && ret != -EINVAL) {
+		ksft_test_result_fail("madvise failed\n");
+		goto out;
+	}
+
+	test_child_ro_gup(mem, pagesize);
+out:
+	munmap(mem, pagesize);
+}
+
+static void test_anon_thp_ro_gup_child(void)
+{
+	unsigned char *mem, *mmap_mem;
+	size_t mmap_size;
+	int ret;
+
+	ksft_print_msg("[RUN] %s\n", __func__);
+
+	if (!thpsize)
+		ksft_test_result_skip("THP size not detected\n");
+
+	mmap_size = 2 * thpsize;
+	mmap_mem = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE,
+			MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+	if (mmap_mem == MAP_FAILED) {
+		ksft_test_result_fail("mmap failed\n");
+		return;
+	}
+
+	mem = (unsigned char *)(((uintptr_t)mmap_mem + thpsize) & ~(thpsize - 1));
+
+	ret = madvise(mem, thpsize, MADV_HUGEPAGE);
+	if (ret) {
+		ksft_test_result_fail("madvise(MADV_HUGEPAGE) failed\n");
+		goto out;
+	}
+
+	/*
+	 * Touch the first sub-page and test of we get another sub-page
+	 * populated.
+	 */
+	mem[0] = 0;
+	if (!page_is_populated(mem + pagesize)) {
+		ksft_test_result_skip("Did not get a THP populated\n");
+		goto out;
+	}
+
+	test_child_ro_gup(mem, thpsize);
+out:
+	munmap(mmap_mem, mmap_size);
+}
+
+static void test_anon_hugetlb_ro_gup_child(void)
+{
+	unsigned char *mem, *dummy;
+
+	ksft_print_msg("[RUN] %s\n", __func__);
+
+	if (!hugetlbsize)
+		ksft_test_result_skip("hugetlb size not detected\n");
+
+	ksft_print_msg("[INFO] Assuming hugetlb size of %zd bytes\n",
+			hugetlbsize);
+
+	mem = mmap(NULL, hugetlbsize, PROT_READ | PROT_WRITE,
+		   MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
+	if (mem == MAP_FAILED) {
+		ksft_test_result_skip("need more free huge pages\n");
+		return;
+	}
+
+	/*
+	 * We need a total of two hugetlb pages to handle COW/unsharing
+	 * properly.
+	 */
+	dummy = mmap(NULL, hugetlbsize, PROT_READ | PROT_WRITE,
+		     MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
+	if (dummy == MAP_FAILED) {
+		ksft_test_result_skip("need more free huge pages\n");
+		goto out;
+	}
+	munmap(dummy, hugetlbsize);
+
+	test_child_ro_gup(mem, hugetlbsize);
+out:
+	munmap(mem, hugetlbsize);
+}
+
+int main(int argc, char **argv)
+{
+	int err;
+
+	pagesize = getpagesize();
+	thpsize = detect_thpsize();
+	/* For simplicity, we'll rely on the thp size. */
+	hugetlbsize = thpsize;
+
+	ksft_print_header();
+	ksft_set_plan(3);
+
+	/* We need an easy way to talk to our child. */
+	shared = mmap(NULL, pagesize, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0);
+	if (shared == MAP_FAILED)
+		ksft_exit_fail_msg("mmap(MAP_SHARED)\n");
+
+	/*
+	 * Tests for the security issue reported by Jann Horn that originally
+	 * resulted in CVE-2020-29374. More generally, it's a violation of
+	 * POSIX MAP_PRIVATE semantics, because some other process can modify
+	 * pages that are supposed to be private to one process.
+	 *
+	 * So let's test that process-private pages stay private using the
+	 * known vmsplice reproducer.
+	 */
+	test_anon_ro_gup_child();
+	test_anon_thp_ro_gup_child();
+	test_anon_hugetlb_ro_gup_child();
+
+	err = ksft_get_fail_cnt();
+	if (err)
+		ksft_exit_fail_msg("%d out of %d tests failed\n",
+				   err, ksft_test_num());
+	return ksft_exit_pass();
+}
diff --git a/tools/testing/selftests/vm/run_vmtests.sh b/tools/testing/selftests/vm/run_vmtests.sh
index a24d30af3094..80e441e0ae45 100755
--- a/tools/testing/selftests/vm/run_vmtests.sh
+++ b/tools/testing/selftests/vm/run_vmtests.sh
@@ -168,6 +168,22 @@ else
 	echo "[PASS]"
 fi
 
+echo "--------------------------------------------------------"
+echo "running "GUP interaction with COW tests.
+echo "--------------------------------------------------------"
+./gup_cow
+ret_val=$?
+
+if [ $ret_val -eq 0 ]; then
+	echo "[PASS]"
+elif [ $ret_val -eq $ksft_skip ]; then
+	echo "[SKIP]"
+	exitcode=$ksft_skip
+else
+	echo "[FAIL]"
+	exitcode=1
+fi
+
 echo "-------------------"
 echo "running userfaultfd"
 echo "-------------------"