From patchwork Sat Jan 18 23:15:47 2025
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Jiaqi Yan <jiaqiyan@google.com>
X-Patchwork-Id: 13944257
Received: from mail-pj1-f73.google.com (mail-pj1-f73.google.com
 [209.85.216.73])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0FED41ACEC7
	for <linux-fsdevel@vger.kernel.org>; Sat, 18 Jan 2025 23:16:21 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.216.73
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1737242184; cv=none;
 b=gURcjfK2JGrEFsu8gj8ngp7MW6QZDShWAkRx8LMDznsTzpWbfSjyW05N7B7wa95unH55cex0ijPCvmn3bIwuR/4k5jNZBbIorc2f14n0hedzW5cWNcyH+qHNbNLdrheuzf9vynoy12MP1Hv6lxQYDIT3fVfUA5zSk4+mFFQNxT8=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1737242184; c=relaxed/simple;
	bh=HFWgIX5XahXV+2Tg4gkiXWkjIkrR+CzfrGOzy9aK/tQ=;
	h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From:
	 To:Cc:Content-Type;
 b=B0MrODmJS1bYnFjeqC6ceA/zzNddN51Lq+oaABJdYcauD1ymk/3ofJOtgwDBcVAiHy4gX8PT3qJ3Ycd2+SXtHqpScX1TjkF22PhNzP7PIYhTGq+9RBhVFe130j3JTo3ku+bcByMEKNcVt8AdB7cLpJH1XfG44ZCJPgq40p7IMK0=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=reject dis=none) header.from=google.com;
 spf=pass smtp.mailfrom=flex--jiaqiyan.bounces.google.com;
 dkim=pass (2048-bit key) header.d=google.com header.i=@google.com
 header.b=lFIZVB0f; arc=none smtp.client-ip=209.85.216.73
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=reject dis=none) header.from=google.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=flex--jiaqiyan.bounces.google.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=google.com header.i=@google.com
 header.b="lFIZVB0f"
Received: by mail-pj1-f73.google.com with SMTP id
 98e67ed59e1d1-2efa0eb9dacso6279812a91.1
        for <linux-fsdevel@vger.kernel.org>;
 Sat, 18 Jan 2025 15:16:21 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1737242181; x=1737846981;
 darn=vger.kernel.org;
        h=content-transfer-encoding:cc:to:from:subject:message-id:references
         :mime-version:in-reply-to:date:from:to:cc:subject:date:message-id
         :reply-to;
        bh=nm402TQHrDxI1EFlR658V6OFFFRpysLj2Dms4H1rJPY=;
        b=lFIZVB0fA6myY8TFV4ZY2wpdibLcG0nNIvaZvqNsGDwY/eHflGPZb2mSEzshSJNhZq
         G/iRoiLEoKTxDOisoN5UyqK6APuZFDK6DMbaM+YWjSMkZn0ngNtygAqwRCMs5KtaLagJ
         o9oIM/NGcvx1bkJ3+oolcJU4mExt87SNBFuU1khEmSE/LIi/jrBXJfn7XFmkvA6+6T46
         PuxAryOPdwOOrSNYiP8w+W3IsH7WLeWHhoHqWzAuDhtBZjQiqwGcOANsd1s73/Vi/blC
         J1QPQxWvnzzx6gUz1+mVEU/IWWrsU4/pf+ckljlHq8PJc/8J0Aqf3TB0IsGboSG8qAZq
         xghQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1737242181; x=1737846981;
        h=content-transfer-encoding:cc:to:from:subject:message-id:references
         :mime-version:in-reply-to:date:x-gm-message-state:from:to:cc:subject
         :date:message-id:reply-to;
        bh=nm402TQHrDxI1EFlR658V6OFFFRpysLj2Dms4H1rJPY=;
        b=i8AkEqGIQLFftL53zITtPWWUeZCHglecRu/iq4XeFjI1/5zhIwNZZ2ojWph/IdpBIb
         ILGhLonGpLrrRjMa+7jjGE85/hd3pHs3WWAH0yJgWOTEAt2DieIdLWPVyHh9RYp6Qc/W
         VUfeP/IoADvpWKJb3dAC6V2MogqSjrOk+OFw87ebqZmG9psEflfjFF+J/qyedigIQB0/
         nfxH/rRfalJvBixGFUl/tyLyadFOd43PlFWs6lnJfSnPxC719c4oOvgxVHyH/uf1vJU6
         0oURlq76a2cAAHjKVEXMzczlCqczfXOzs/RoRh/mf5V++In6ozmhzcnz0fjjloeoFRuF
         cFdg==
X-Forwarded-Encrypted: i=1;
 AJvYcCVTXIMs2iChMDey4l4ETCjuMj9HURpgqjHahkurkJ7XUazF3hGblbQ8wiJboYXDf35UncOLmIEQyaGLMDEL@vger.kernel.org
X-Gm-Message-State: AOJu0YyTP2nZc/34O534gQEF/MOXXRMg1fet4efqOOuQF+6HmZlnk+ml
	2dLbipj4yskBb2EneoLnV/jOmXOltV8VnYSJz8GsiWImezLRMWI2RZucPL4oYM0AjV5tu0SP2OI
	i2bmSsYSjtQ==
X-Google-Smtp-Source: 
 AGHT+IEpzJwcasigO7Z3LrZTdDEMveaU0rSOtpGiX7RzVESFnj1S0Tl/DAer+e37P1pFsn3VS6RTQfwq+n76Gg==
X-Received: from pjlk5.prod.google.com ([2002:a17:90a:7f05:b0:2ea:aa56:49c])
 (user=jiaqiyan job=prod-delivery.src-stubby-dispatcher) by
 2002:a17:90b:37ce:b0:2f2:ab09:c256 with SMTP id
 98e67ed59e1d1-2f782d6f0b9mr12295296a91.33.1737242181329;
 Sat, 18 Jan 2025 15:16:21 -0800 (PST)
Date: Sat, 18 Jan 2025 23:15:47 +0000
In-Reply-To: <20250118231549.1652825-1-jiaqiyan@google.com>
Precedence: bulk
X-Mailing-List: linux-fsdevel@vger.kernel.org
List-Id: <linux-fsdevel.vger.kernel.org>
List-Subscribe: <mailto:linux-fsdevel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-fsdevel+unsubscribe@vger.kernel.org>
Mime-Version: 1.0
References: <20250118231549.1652825-1-jiaqiyan@google.com>
X-Mailer: git-send-email 2.48.0.rc2.279.g1de40edade-goog
Message-ID: <20250118231549.1652825-2-jiaqiyan@google.com>
Subject: [RFC PATCH v1 1/3] mm: memfd/hugetlb: introduce userspace memory
 failure recovery policy
From: Jiaqi Yan <jiaqiyan@google.com>
To: nao.horiguchi@gmail.com, linmiaohe@huawei.com
Cc: tony.luck@intel.com, wangkefeng.wang@huawei.com, willy@infradead.org,
	jane.chu@oracle.com, akpm@linux-foundation.org, osalvador@suse.de,
	rientjes@google.com, duenwen@google.com, jthoughton@google.com,
	jgg@nvidia.com, ankita@nvidia.com, peterx@redhat.com,
	sidhartha.kumar@oracle.com, david@redhat.com, dave.hansen@linux.intel.com,
	muchun.song@linux.dev, linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, Jiaqi Yan <jiaqiyan@google.com>

Sometimes immediately hard offlining memory page having uncorrected
memory errors (UE) may not be the best option for capacity and/or
performance reasons. "Sometimes" even becomes "often times" in Cloud
scenarios. See cover letter for the descriptions to two scenarios.

Therefore keeping or discarding a large chunk of contiguous memory
mapped to userspace (particularly to serve guest memory) due to
UE (recoverable is implied) should be able to be controlled by
userspace process, e.g. VMM in Cloud environment.

Given the relevance of HugeTLB's non-ideal memory failure recovery
behavior, this commit uses HugeTLB as the "testbed" to demonstrate the
idea of memfd-based userspace memory failure policy.

MFD_MF_KEEP_UE_MAPPED is added to the possible values for flags in
memfd_create syscall. It is intended to be generic for any memfd,
not just HugeTLB, but the current implementation only covers HugeTLB.

When MFD_MF_KEEP_UE_MAPPED is set in flags, memory failure recovery
in the kernel doesn’t hard offline memory due to UE until the created
memfd is released or the affected memory region is truncated by
userspace. IOW, the HWPoison-ed memory remains accessible via
the returned memfd or the memory mapping created with that memfd.
However, the affected memory will be immediately protected and isolated
from future use by both kernel and userspace once the owning memfd is
gone or the memory is truncated. By default MFD_MF_KEEP_UE_MAPPED is
not set, and kernel hard offlines memory having UEs.

Tested with selftest in followup patch.

This commit should probably be split into smaller pieces, but for now
I will defer it until this RFC becomes PATCH.

Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
---
 fs/hugetlbfs/inode.c       |  16 +++++
 include/linux/hugetlb.h    |   7 +++
 include/linux/pagemap.h    |  43 ++++++++++++++
 include/uapi/linux/memfd.h |   1 +
 mm/filemap.c               |  78 ++++++++++++++++++++++++
 mm/hugetlb.c               |  20 ++++++-
 mm/memfd.c                 |  15 ++++-
 mm/memory-failure.c        | 119 +++++++++++++++++++++++++++++++++----
 8 files changed, 282 insertions(+), 17 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 0fc179a598300..3c7812898717b 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -576,6 +576,10 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
 	pgoff_t next, index;
 	int i, freed = 0;
 	bool truncate_op = (lend == LLONG_MAX);
+	LIST_HEAD(hwp_folios);
+
+	/* Needs to be done before removing folios from filemap. */
+	populate_memfd_hwp_folios(mapping, lstart >> PAGE_SHIFT, end, &hwp_folios);
 
 	folio_batch_init(&fbatch);
 	next = lstart >> PAGE_SHIFT;
@@ -605,6 +609,18 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
 		(void)hugetlb_unreserve_pages(inode,
 				lstart >> huge_page_shift(h),
 				LONG_MAX, freed);
+	/*
+	 * hugetlbfs_error_remove_folio keeps the HWPoison-ed pages in
+	 * page cache until mm wants to drop the folio at the end of the
+	 * of the filemap. At this point, if memory failure was delayed
+	 * by AS_MF_KEEP_UE_MAPPED in the past, we can now deal with it.
+	 *
+	 * TODO: in V2 we can probably get rid of populate_memfd_hwp_folios
+	 * and hwp_folios, by inserting filemap_offline_hwpoison_folio
+	 * into somewhere in folio_batch_release, or into per file system's
+	 * free_folio handler.
+	 */
+	offline_memfd_hwp_folios(mapping, &hwp_folios);
 }
 
 static void hugetlbfs_evict_inode(struct inode *inode)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index ec8c0ccc8f959..07d2a31146728 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -836,10 +836,17 @@ int dissolve_free_hugetlb_folios(unsigned long start_pfn,
 
 #ifdef CONFIG_MEMORY_FAILURE
 extern void folio_clear_hugetlb_hwpoison(struct folio *folio);
+extern bool hugetlb_should_keep_hwpoison_mapped(struct folio *folio,
+						struct address_space *mapping);
 #else
 static inline void folio_clear_hugetlb_hwpoison(struct folio *folio)
 {
 }
+static inline bool hugetlb_should_keep_hwpoison_mapped(struct folio *folio
+						       struct address_space *mapping)
+{
+	return false;
+}
 #endif
 
 #ifdef CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index fc2e1319c7bb5..fad7093d232a9 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -210,6 +210,12 @@ enum mapping_flags {
 	AS_STABLE_WRITES = 7,	/* must wait for writeback before modifying
 				   folio contents */
 	AS_INACCESSIBLE = 8,	/* Do not attempt direct R/W access to the mapping */
+	/*
+	 * Keeps folios belong to the mapping mapped even if uncorrectable memory
+	 * errors (UE) caused memory failure (MF) within the folio. Only at the end
+	 * of mapping will its HWPoison-ed folios be dealt with.
+	 */
+	AS_MF_KEEP_UE_MAPPED = 9,
 	/* Bits 16-25 are used for FOLIO_ORDER */
 	AS_FOLIO_ORDER_BITS = 5,
 	AS_FOLIO_ORDER_MIN = 16,
@@ -335,6 +341,16 @@ static inline bool mapping_inaccessible(struct address_space *mapping)
 	return test_bit(AS_INACCESSIBLE, &mapping->flags);
 }
 
+static inline bool mapping_mf_keep_ue_mapped(struct address_space *mapping)
+{
+	return test_bit(AS_MF_KEEP_UE_MAPPED, &mapping->flags);
+}
+
+static inline void mapping_set_mf_keep_ue_mapped(struct address_space *mapping)
+{
+	set_bit(AS_MF_KEEP_UE_MAPPED, &mapping->flags);
+}
+
 static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
 {
 	return mapping->gfp_mask;
@@ -1298,6 +1314,33 @@ void replace_page_cache_folio(struct folio *old, struct folio *new);
 void delete_from_page_cache_batch(struct address_space *mapping,
 				  struct folio_batch *fbatch);
 bool filemap_release_folio(struct folio *folio, gfp_t gfp);
+#ifdef CONFIG_MEMORY_FAILURE
+void populate_memfd_hwp_folios(struct address_space *mapping,
+			       pgoff_t lstart, pgoff_t lend,
+			       struct list_head *list);
+void offline_memfd_hwp_folios(struct address_space *mapping,
+			      struct list_head *list);
+/*
+ * Provided by memory failure to offline HWPoison-ed folio for various memory
+ * management systems (hugetlb, THP etc).
+ */
+void filemap_offline_hwpoison_folio(struct address_space *mapping,
+				    struct folio *folio);
+#else
+void populate_memfd_hwp_folios(struct address_space *mapping,
+			       loff_t lstart, loff_t lend,
+			       struct list_head *list)
+{
+}
+void offline_memfd_hwp_folios(struct address_space *mapping,
+			      struct list_head *list)
+{
+}
+void filemap_offline_hwpoison_folio(struct address_space *mapping,
+				    struct folio *folio)
+{
+}
+#endif
 loff_t mapping_seek_hole_data(struct address_space *, loff_t start, loff_t end,
 		int whence);
 
diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
index 273a4e15dfcff..eb7a4ffcae6b9 100644
--- a/include/uapi/linux/memfd.h
+++ b/include/uapi/linux/memfd.h
@@ -12,6 +12,7 @@
 #define MFD_NOEXEC_SEAL		0x0008U
 /* executable */
 #define MFD_EXEC		0x0010U
+#define MFD_MF_KEEP_UE_MAPPED	0x0020U
 
 /*
  * Huge page size encoding when MFD_HUGETLB is specified, and a huge page
diff --git a/mm/filemap.c b/mm/filemap.c
index b6494d2d3bc2a..5216889d12ecf 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -4427,3 +4427,81 @@ SYSCALL_DEFINE4(cachestat, unsigned int, fd,
 	return 0;
 }
 #endif /* CONFIG_CACHESTAT_SYSCALL */
+
+#ifdef CONFIG_MEMORY_FAILURE
+/**
+ * To remember the HWPoison-ed folios within a mapping before removing every
+ * folio, create an utility struct to link them a list.
+ */
+struct memfd_hwp_folio {
+	struct list_head node;
+	struct folio *folio;
+};
+/**
+ * populate_memfd_hwp_folios - populates HWPoison-ed folios.
+ * @mapping: The address_space of a memfd the kernel is trying to remove or truncate.
+ * @start: The starting page index.
+ * @end: The final page index (inclusive).
+ * @list: Where the HWPoison-ed folios will be stored into.
+ *
+ * There may be pending HWPoison-ed folios when a memfd is being removed or
+ * part of it is being truncated. Stores them into a linked list to offline
+ * after the file system removes them.
+ */
+void populate_memfd_hwp_folios(struct address_space *mapping,
+			       pgoff_t start, pgoff_t end,
+			       struct list_head *list)
+{
+	int i;
+	struct folio *folio;
+	struct memfd_hwp_folio *to_add;
+	struct folio_batch fbatch;
+	pgoff_t next = start;
+
+	if (!mapping_mf_keep_ue_mapped(mapping))
+		return;
+
+	folio_batch_init(&fbatch);
+	while (filemap_get_folios(mapping, &next, end - 1, &fbatch)) {
+		for (i = 0; i < folio_batch_count(&fbatch); ++i) {
+			folio = fbatch.folios[i];
+			if (!folio_test_hwpoison(folio))
+				continue;
+
+			to_add = kmalloc(sizeof(*to_add), GFP_KERNEL);
+			if (!to_add)
+				continue;
+
+			to_add->folio = folio;
+			list_add_tail(&to_add->node, list);
+		}
+		folio_batch_release(&fbatch);
+	}
+}
+EXPORT_SYMBOL_GPL(populate_memfd_hwp_folios);
+
+/**
+ * offline_memfd_hwp_folios - hard offline HWPoison-ed folios.
+ * @mapping: The address_space of a memfd the kernel is trying to remove or truncate.
+ * @list: Where the HWPoison-ed folios are stored. It will become empty when
+ *        offline_memfd_hwp_folios returns.
+ *
+ * After the file system removed all the folios belong to a memfd, the kernel
+ * now can hard offline all HWPoison-ed folios that are previously pending.
+ * Caller needs to exclusively own @list as no locking is provided here, and
+ * @list is entirely consumed here.
+ */
+void offline_memfd_hwp_folios(struct address_space *mapping,
+			      struct list_head *list)
+{
+	struct memfd_hwp_folio *curr, *temp;
+
+	list_for_each_entry_safe(curr, temp, list, node) {
+		filemap_offline_hwpoison_folio(mapping, curr->folio);
+		list_del(&curr->node);
+		kfree(curr);
+	}
+}
+EXPORT_SYMBOL_GPL(offline_memfd_hwp_folios);
+
+#endif /* CONFIG_MEMORY_FAILURE */
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 87761b042ed04..35e88d7fc2793 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6091,6 +6091,18 @@ static bool hugetlb_pte_stable(struct hstate *h, struct mm_struct *mm, unsigned
 	return same;
 }
 
+bool hugetlb_should_keep_hwpoison_mapped(struct folio *folio,
+					 struct address_space *mapping)
+{
+	if (WARN_ON_ONCE(!folio_test_hugetlb(folio)))
+		return false;
+
+	if (!mapping)
+		return false;
+
+	return mapping_mf_keep_ue_mapped(mapping);
+}
+
 static vm_fault_t hugetlb_no_page(struct address_space *mapping,
 			struct vm_fault *vmf)
 {
@@ -6214,9 +6226,11 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
 		 * So we need to block hugepage fault by PG_hwpoison bit check.
 		 */
 		if (unlikely(folio_test_hwpoison(folio))) {
-			ret = VM_FAULT_HWPOISON_LARGE |
-				VM_FAULT_SET_HINDEX(hstate_index(h));
-			goto backout_unlocked;
+			if (!mapping_mf_keep_ue_mapped(mapping)) {
+				ret = VM_FAULT_HWPOISON_LARGE |
+				      VM_FAULT_SET_HINDEX(hstate_index(h));
+				goto backout_unlocked;
+			}
 		}
 
 		/* Check for page in userfault range. */
diff --git a/mm/memfd.c b/mm/memfd.c
index 37f7be57c2f50..ddb9e988396c7 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -302,7 +302,8 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned int arg)
 #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
 #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
 
-#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | MFD_NOEXEC_SEAL | MFD_EXEC)
+#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | \
+		       MFD_NOEXEC_SEAL | MFD_EXEC | MFD_MF_KEEP_UE_MAPPED)
 
 static int check_sysctl_memfd_noexec(unsigned int *flags)
 {
@@ -376,6 +377,8 @@ static int sanitize_flags(unsigned int *flags_ptr)
 	if (!(flags & MFD_HUGETLB)) {
 		if (flags & ~(unsigned int)MFD_ALL_FLAGS)
 			return -EINVAL;
+		if (flags & MFD_MF_KEEP_UE_MAPPED)
+			return -EINVAL;
 	} else {
 		/* Allow huge page size encoding in flags. */
 		if (flags & ~(unsigned int)(MFD_ALL_FLAGS |
@@ -436,6 +439,16 @@ static struct file *alloc_file(const char *name, unsigned int flags)
 	file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
 	file->f_flags |= O_LARGEFILE;
 
+	/*
+	 * MFD_MF_KEEP_UE_MAPPED can only be specified in memfd_create; no API
+	 * to update it once memfd is created. MFD_MF_KEEP_UE_MAPPED is not
+	 * seal-able.
+	 *
+	 * TODO: MFD_MF_KEEP_UE_MAPPED is not supported by all file system yet.
+	 */
+	if (flags & (MFD_HUGETLB | MFD_MF_KEEP_UE_MAPPED))
+		mapping_set_mf_keep_ue_mapped(file->f_mapping);
+
 	if (flags & MFD_NOEXEC_SEAL) {
 		struct inode *inode = file_inode(file);
 
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index a7b8ccd29b6f5..f43607fb4310e 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -445,11 +445,13 @@ static unsigned long dev_pagemap_mapping_shift(struct vm_area_struct *vma,
  * Schedule a process for later kill.
  * Uses GFP_ATOMIC allocations to avoid potential recursions in the VM.
  */
-static void __add_to_kill(struct task_struct *tsk, const struct page *p,
+static void __add_to_kill(struct task_struct *tsk, struct page *p,
 			  struct vm_area_struct *vma, struct list_head *to_kill,
 			  unsigned long addr)
 {
 	struct to_kill *tk;
+	struct folio *folio;
+	struct address_space *mapping;
 
 	tk = kmalloc(sizeof(struct to_kill), GFP_ATOMIC);
 	if (!tk) {
@@ -460,8 +462,20 @@ static void __add_to_kill(struct task_struct *tsk, const struct page *p,
 	tk->addr = addr;
 	if (is_zone_device_page(p))
 		tk->size_shift = dev_pagemap_mapping_shift(vma, tk->addr);
-	else
-		tk->size_shift = folio_shift(page_folio(p));
+	else {
+		folio = page_folio(p);
+		mapping = folio_mapping(folio);
+		if (mapping && mapping_mf_keep_ue_mapped(mapping))
+			/*
+			 * Let userspace know the radius of the hardware poison
+			 * is the size of raw page, and as long as they aborts
+			 * the load to the scope, other pages inside the folio
+			 * are still safe to access.
+			 */
+			tk->size_shift = PAGE_SHIFT;
+		else
+			tk->size_shift = folio_shift(folio);
+	}
 
 	/*
 	 * Send SIGKILL if "tk->addr == -EFAULT". Also, as
@@ -486,7 +500,7 @@ static void __add_to_kill(struct task_struct *tsk, const struct page *p,
 	list_add_tail(&tk->nd, to_kill);
 }
 
-static void add_to_kill_anon_file(struct task_struct *tsk, const struct page *p,
+static void add_to_kill_anon_file(struct task_struct *tsk, struct page *p,
 		struct vm_area_struct *vma, struct list_head *to_kill,
 		unsigned long addr)
 {
@@ -607,7 +621,7 @@ struct task_struct *task_early_kill(struct task_struct *tsk, int force_early)
  * Collect processes when the error hit an anonymous page.
  */
 static void collect_procs_anon(const struct folio *folio,
-		const struct page *page, struct list_head *to_kill,
+		struct page *page, struct list_head *to_kill,
 		int force_early)
 {
 	struct task_struct *tsk;
@@ -645,7 +659,7 @@ static void collect_procs_anon(const struct folio *folio,
  * Collect processes when the error hit a file mapped page.
  */
 static void collect_procs_file(const struct folio *folio,
-		const struct page *page, struct list_head *to_kill,
+		struct page *page, struct list_head *to_kill,
 		int force_early)
 {
 	struct vm_area_struct *vma;
@@ -727,7 +741,7 @@ static void collect_procs_fsdax(const struct page *page,
 /*
  * Collect the processes who have the corrupted page mapped to kill.
  */
-static void collect_procs(const struct folio *folio, const struct page *page,
+static void collect_procs(const struct folio *folio, struct page *page,
 		struct list_head *tokill, int force_early)
 {
 	if (!folio->mapping)
@@ -1226,6 +1240,13 @@ static int me_huge_page(struct page_state *ps, struct page *p)
 		}
 	}
 
+	/*
+	 * MF still needs to holds a refcount for the deferred actions in
+	 * filemap_offline_hwpoison_folio.
+	 */
+	if (hugetlb_should_keep_hwpoison_mapped(folio, mapping))
+		return res;
+
 	if (has_extra_refcount(ps, p, extra_pins))
 		res = MF_FAILED;
 
@@ -1593,6 +1614,7 @@ static bool hwpoison_user_mappings(struct folio *folio, struct page *p,
 	struct address_space *mapping;
 	LIST_HEAD(tokill);
 	bool unmap_success;
+	bool keep_mapped;
 	int forcekill;
 	bool mlocked = folio_test_mlocked(folio);
 
@@ -1643,10 +1665,12 @@ static bool hwpoison_user_mappings(struct folio *folio, struct page *p,
 	 */
 	collect_procs(folio, p, &tokill, flags & MF_ACTION_REQUIRED);
 
-	unmap_poisoned_folio(folio, ttu);
+	keep_mapped = hugetlb_should_keep_hwpoison_mapped(folio, mapping);
+	if (!keep_mapped)
+		unmap_poisoned_folio(folio, ttu);
 
 	unmap_success = !folio_mapped(folio);
-	if (!unmap_success)
+	if (!unmap_success && !keep_mapped)
 		pr_err("%#lx: failed to unmap page (folio mapcount=%d)\n",
 		       pfn, folio_mapcount(folio));
 
@@ -1671,7 +1695,7 @@ static bool hwpoison_user_mappings(struct folio *folio, struct page *p,
 		    !unmap_success;
 	kill_procs(&tokill, forcekill, pfn, flags);
 
-	return unmap_success;
+	return unmap_success || keep_mapped;
 }
 
 static int identify_page_state(unsigned long pfn, struct page *p,
@@ -1911,6 +1935,9 @@ static unsigned long __folio_free_raw_hwp(struct folio *folio, bool move_flag)
 	unsigned long count = 0;
 
 	head = llist_del_all(raw_hwp_list_head(folio));
+	if (head == NULL)
+		return 0;
+
 	llist_for_each_entry_safe(p, next, head, node) {
 		if (move_flag)
 			SetPageHWPoison(p->page);
@@ -1927,7 +1954,8 @@ static int folio_set_hugetlb_hwpoison(struct folio *folio, struct page *page)
 	struct llist_head *head;
 	struct raw_hwp_page *raw_hwp;
 	struct raw_hwp_page *p;
-	int ret = folio_test_set_hwpoison(folio) ? -EHWPOISON : 0;
+	struct address_space *mapping = folio->mapping;
+	bool has_hwpoison = folio_test_set_hwpoison(folio);
 
 	/*
 	 * Once the hwpoison hugepage has lost reliable raw error info,
@@ -1946,8 +1974,15 @@ static int folio_set_hugetlb_hwpoison(struct folio *folio, struct page *page)
 	if (raw_hwp) {
 		raw_hwp->page = page;
 		llist_add(&raw_hwp->node, head);
+		if (hugetlb_should_keep_hwpoison_mapped(folio, mapping))
+			/*
+			 * A new raw HWPoison page. Don't return HWPOISON.
+			 * Error event will be counted in action_result().
+			 */
+			return 0;
+
 		/* the first error event will be counted in action_result(). */
-		if (ret)
+		if (has_hwpoison)
 			num_poisoned_pages_inc(page_to_pfn(page));
 	} else {
 		/*
@@ -1962,7 +1997,8 @@ static int folio_set_hugetlb_hwpoison(struct folio *folio, struct page *page)
 		 */
 		__folio_free_raw_hwp(folio, false);
 	}
-	return ret;
+
+	return has_hwpoison ? -EHWPOISON : 0;
 }
 
 static unsigned long folio_free_raw_hwp(struct folio *folio, bool move_flag)
@@ -2051,6 +2087,63 @@ int __get_huge_page_for_hwpoison(unsigned long pfn, int flags,
 	return ret;
 }
 
+static void filemap_offline_hwpoison_folio_hugetlb(struct folio *folio)
+{
+	int ret;
+	struct llist_node *head;
+	struct raw_hwp_page *curr, *next;
+	struct page *page;
+	unsigned long pfn;
+
+	head = llist_del_all(raw_hwp_list_head(folio));
+
+	/*
+	 * Release references hold by try_memory_failure_hugetlb, one per
+	 * HWPoison-ed page in raw hwp list. This folio's refcount expects to
+	 * drop to zero after the below for-each loop.
+	 */
+	llist_for_each_entry(curr, head, node)
+		folio_put(folio);
+
+	ret = dissolve_free_hugetlb_folio(folio);
+	if (ret) {
+		pr_err("failed to dissolve hugetlb folio: %d\n", ret);
+		llist_for_each_entry(curr, head, node) {
+			page = curr->page;
+			pfn = page_to_pfn(page);
+			/*
+			 * TODO: roll back the count incremented during online
+			 * handling, i.e. whatever me_huge_page returns.
+			 */
+			update_per_node_mf_stats(pfn, MF_FAILED);
+		}
+		return;
+	}
+
+	llist_for_each_entry_safe(curr, next, head, node) {
+		page = curr->page;
+		pfn = page_to_pfn(page);
+		drain_all_pages(page_zone(page));
+		if (PageBuddy(page) && !take_page_off_buddy(page))
+			pr_warn("%#lx: unable to take off buddy allocator\n", pfn);
+
+		SetPageHWPoison(page);
+		page_ref_inc(page);
+		kfree(curr);
+		pr_info("%#lx: pending hard offline completed\n", pfn);
+	}
+}
+
+void filemap_offline_hwpoison_folio(struct address_space *mapping,
+				    struct folio *folio)
+{
+	WARN_ON_ONCE(!mapping);
+
+	/* Pending MFR currently only exist for hugetlb. */
+	if (hugetlb_should_keep_hwpoison_mapped(folio, mapping))
+		filemap_offline_hwpoison_folio_hugetlb(folio);
+}
+
 /*
  * Taking refcount of hugetlb pages needs extra care about race conditions
  * with basic operations like hugepage allocation/free/demotion.

From patchwork Sat Jan 18 23:15:48 2025
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Jiaqi Yan <jiaqiyan@google.com>
X-Patchwork-Id: 13944258
Received: from mail-pj1-f74.google.com (mail-pj1-f74.google.com
 [209.85.216.74])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 56BB71DE8AA
	for <linux-fsdevel@vger.kernel.org>; Sat, 18 Jan 2025 23:16:24 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.216.74
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1737242186; cv=none;
 b=bbjUW59gO6g6skiEu7eogYbA2dgXsL300EzruhAToapQVUS9k/VVoKgzAV1025aR9yvnZHQbaG4JxuvZh8Vge3/bvcA6++SuzXOmY21+3N/El8O7HIIuDvCnw295csmWZ9ISZgJu+TMRWDwaODkW9sGDNtF0cMyMQbwPZ3uHFzE=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1737242186; c=relaxed/simple;
	bh=nf9gmZa4aqx37StJicQapDfEQByUQVTZgsHYJIMiH4g=;
	h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From:
	 To:Cc:Content-Type;
 b=SFTYvihxiCTV+EhhcOFwpyGpwkToNCOMdIk0UjiUpU14Iue8BWsSrmQVt/XhEKL8PFb49S8EQSCGJAcmsKQo833z/eI9VKnq1inx6e/UJ0q9SydAXDFFL8IVtEXMPdhVawL/qztIgtc7cJmJYQ8xlgXxAjFPmNWG/Hso9OFwRGc=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=reject dis=none) header.from=google.com;
 spf=pass smtp.mailfrom=flex--jiaqiyan.bounces.google.com;
 dkim=pass (2048-bit key) header.d=google.com header.i=@google.com
 header.b=PFRt/iFU; arc=none smtp.client-ip=209.85.216.74
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=reject dis=none) header.from=google.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=flex--jiaqiyan.bounces.google.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=google.com header.i=@google.com
 header.b="PFRt/iFU"
Received: by mail-pj1-f74.google.com with SMTP id
 98e67ed59e1d1-2ee6dccd3c9so5965617a91.3
        for <linux-fsdevel@vger.kernel.org>;
 Sat, 18 Jan 2025 15:16:24 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1737242184; x=1737846984;
 darn=vger.kernel.org;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:from:to:cc:subject:date:message-id:reply-to;
        bh=YN6TmY/NDY3/qLUZOajjmh5DrDHEICthyIauzWOR8ZQ=;
        b=PFRt/iFUSyTD3xTvKXubyegdDTIjgEExl+E+H07xPSBKlnX3c1lNcOkU5Qi33TxfsI
         0lA01uxk7PXGtCSExlzslCrK2hmpe43Ljzc5Hi/CrmSzD0JbeoirtlNd56s2kWApUsGv
         jJvl2uDZcubX+xkFd1C5k+odibp8HBqEAwXnpHnrpVbiAljcmLv0BTsG987VSz6euHV+
         qhMCbfpef99LeJs5Y9wktJREW2pVrzSDT6ugSXWQc+dOri4c7OcdCG4Be8a8w3jgMCnk
         s1nGG4A1iZJw3rRbUM4YQtEQheBFI767I8+8assgMNmDanfOIu7ga2PqrSsfSt5k/cqX
         2Now==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1737242184; x=1737846984;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=YN6TmY/NDY3/qLUZOajjmh5DrDHEICthyIauzWOR8ZQ=;
        b=CIixX8xJcYaWGPCyC2DyUP21Y7n24wktDaWeRCyImz+uo9gpy07wRpcvvO7aU5Hz02
         ASeD0QcC55uAkAcEO1+IDjFBBmJZxWg55hmkFUdZIVmABzFVgKtT5YGzc4jMTAHi7nLF
         JbcOBOun5v3NxmnzybAVawGSufCIMwG4cPIu1dyFQ4O70TMfqyYd7UmcMdNDjvV6OqBF
         a4GqjsZSATJFYnD20DIrSafNlILtQIkiRDQDOYaEbeIMJAEjJtxuo6hyTtCXR5NVmOcK
         RVRplY8y1sTW26Z3KpSv5+CSAIlULW8xaxnoFPs9yfJgLxPQgU8g51A4mTdIi9yNkeNH
         u+RQ==
X-Forwarded-Encrypted: i=1;
 AJvYcCVS+aHU3aUlfHSgfS7+c58AfPmSAsrKzo+otJHsu5CQouzLHz86/t3pmzU02ggYP1IO5UfuxCVo6w1ImHCm@vger.kernel.org
X-Gm-Message-State: AOJu0YyOKcaactdpnkljsVJXLfEPCxhN+as6zAmzrqrl05Xx2yPJM49+
	Xlqvrb08+44VWwg9o5D+0/TVCa9tLDfioFq3q1BX6O/k7hZj2NmDK3842i4ArDQYu6JaAEjbQkR
	It5KJ3+p9oQ==
X-Google-Smtp-Source: 
 AGHT+IFLLkqbQaii6vgpMggmn1nuWWFx29WluNp02IqS6MEf1N7eQhiWvEopoT1MpbZfEYAZ5yHYOwd2a2T7AQ==
X-Received: from pjzm19.prod.google.com ([2002:a17:90b:693:b0:2d8:8d32:2ea3])
 (user=jiaqiyan job=prod-delivery.src-stubby-dispatcher) by
 2002:a17:90b:540b:b0:2ee:ba0c:1726 with SMTP id
 98e67ed59e1d1-2f782d9a1c9mr9872038a91.34.1737242183861;
 Sat, 18 Jan 2025 15:16:23 -0800 (PST)
Date: Sat, 18 Jan 2025 23:15:48 +0000
In-Reply-To: <20250118231549.1652825-1-jiaqiyan@google.com>
Precedence: bulk
X-Mailing-List: linux-fsdevel@vger.kernel.org
List-Id: <linux-fsdevel.vger.kernel.org>
List-Subscribe: <mailto:linux-fsdevel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-fsdevel+unsubscribe@vger.kernel.org>
Mime-Version: 1.0
References: <20250118231549.1652825-1-jiaqiyan@google.com>
X-Mailer: git-send-email 2.48.0.rc2.279.g1de40edade-goog
Message-ID: <20250118231549.1652825-3-jiaqiyan@google.com>
Subject: [RFC PATCH v1 2/3] selftests/mm: test userspace MFR for HugeTLB 1G
 hugepage
From: Jiaqi Yan <jiaqiyan@google.com>
To: nao.horiguchi@gmail.com, linmiaohe@huawei.com
Cc: tony.luck@intel.com, wangkefeng.wang@huawei.com, willy@infradead.org,
	jane.chu@oracle.com, akpm@linux-foundation.org, osalvador@suse.de,
	rientjes@google.com, duenwen@google.com, jthoughton@google.com,
	jgg@nvidia.com, ankita@nvidia.com, peterx@redhat.com,
	sidhartha.kumar@oracle.com, david@redhat.com, dave.hansen@linux.intel.com,
	muchun.song@linux.dev, linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, Jiaqi Yan <jiaqiyan@google.com>

Tests the userspace memory failure recovery (MFR) policy for HugeTLB 1G
hugepage case:
1. Creates a memfd backed by 1G HugeTLB and had MFD_MF_KEEP_UE_MAPPED set.
2. Allocates and maps in a 1G hugepage to the process.
3. Creates sub-threads to MADV_HWPOISON inner addresses of the hugepage.
4. Checks if the process gets correct SIGBUS for each poisoned raw page.
5. Checks if all memory are still accessible and content valid.
6. Checks if the poisoned 1G hugepage is dealt with after memfd released.

Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
---
 tools/testing/selftests/mm/.gitignore    |   1 +
 tools/testing/selftests/mm/Makefile      |   1 +
 tools/testing/selftests/mm/hugetlb-mfr.c | 267 +++++++++++++++++++++++
 3 files changed, 269 insertions(+)
 create mode 100644 tools/testing/selftests/mm/hugetlb-mfr.c

diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftests/mm/.gitignore
index 121000c28c105..e65a1fa43f868 100644
--- a/tools/testing/selftests/mm/.gitignore
+++ b/tools/testing/selftests/mm/.gitignore
@@ -5,6 +5,7 @@ hugepage-mremap
 hugepage-shm
 hugepage-vmemmap
 hugetlb-madvise
+hugetlb-mfr
 hugetlb-read-hwpoison
 hugetlb-soft-offline
 khugepaged
diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
index 63ce39d024bb5..171a9e65ed357 100644
--- a/tools/testing/selftests/mm/Makefile
+++ b/tools/testing/selftests/mm/Makefile
@@ -62,6 +62,7 @@ TEST_GEN_FILES += hmm-tests
 TEST_GEN_FILES += hugetlb-madvise
 TEST_GEN_FILES += hugetlb-read-hwpoison
 TEST_GEN_FILES += hugetlb-soft-offline
+TEST_GEN_FILES += hugetlb-mfr
 TEST_GEN_FILES += hugepage-mmap
 TEST_GEN_FILES += hugepage-mremap
 TEST_GEN_FILES += hugepage-shm
diff --git a/tools/testing/selftests/mm/hugetlb-mfr.c b/tools/testing/selftests/mm/hugetlb-mfr.c
new file mode 100644
index 0000000000000..cb20b81984f5e
--- /dev/null
+++ b/tools/testing/selftests/mm/hugetlb-mfr.c
@@ -0,0 +1,267 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Tests the userspace memory failure recovery (MFR) policy for HugeTLB 1G
+ * hugepage case:
+ * 1. Creates a memfd backed by 1G HugeTLB and MFD_MF_KEEP_UE_MAPPED bit set.
+ * 2. Allocates and maps a 1G hugepage.
+ * 3. Creates sub-threads to MADV_HWPOISON inner addresses of the hugepage.
+ * 4. Checks if the sub-thread get correct SIGBUS for each poisoned raw page.
+ * 5. Checks if all memory are still accessible and content still valid.
+ * 6. Checks if the poisoned 1G hugepage is dealt with after memfd released.
+ */
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <pthread.h>
+#include <signal.h>
+#include <stdbool.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <string.h>
+#include <unistd.h>
+
+#include <linux/magic.h>
+#include <linux/memfd.h>
+#include <sys/mman.h>
+#include <sys/prctl.h>
+#include <sys/statfs.h>
+#include <sys/types.h>
+
+#include "../kselftest.h"
+#include "vm_util.h"
+
+#define EPREFIX			" !!! "
+#define BYTE_LENTH_IN_1G	0x40000000
+#define HUGETLB_FILL		0xab
+
+static void *sigbus_addr;
+static int sigbus_addr_lsb;
+static bool expecting_sigbus;
+static bool got_sigbus;
+static bool was_mceerr;
+
+static int create_hugetlbfs_file(struct statfs *file_stat)
+{
+	int fd;
+	int flags = MFD_HUGETLB | MFD_HUGE_1GB | MFD_MF_KEEP_UE_MAPPED;
+
+	fd = memfd_create("hugetlb_tmp", flags);
+	if (fd < 0)
+		ksft_exit_fail_perror("Failed to memfd_create");
+
+	memset(file_stat, 0, sizeof(*file_stat));
+	if (fstatfs(fd, file_stat)) {
+		close(fd);
+		ksft_exit_fail_perror("Failed to fstatfs");
+	}
+	if (file_stat->f_type != HUGETLBFS_MAGIC) {
+		close(fd);
+		ksft_exit_fail_msg("Not hugetlbfs file");
+	}
+
+	ksft_print_msg("Created hugetlb_tmp file\n");
+	ksft_print_msg("hugepagesize=%#lx\n", file_stat->f_bsize);
+	if (file_stat->f_bsize != BYTE_LENTH_IN_1G)
+		ksft_exit_fail_msg("Hugepage size is not 1G");
+
+	return fd;
+}
+
+/*
+ * SIGBUS handler for "do_hwpoison" thread that mapped and MADV_HWPOISON
+ */
+static void sigbus_handler(int signo, siginfo_t *info, void *context)
+{
+	if (!expecting_sigbus)
+		ksft_exit_fail_msg("unexpected sigbus with addr=%p",
+				   info->si_addr);
+
+	got_sigbus = true;
+	was_mceerr = (info->si_code == BUS_MCEERR_AO ||
+		      info->si_code == BUS_MCEERR_AR);
+	sigbus_addr = info->si_addr;
+	sigbus_addr_lsb = info->si_addr_lsb;
+}
+
+static void *do_hwpoison(void *hwpoison_addr)
+{
+	int hwpoison_size = getpagesize();
+
+	ksft_print_msg("MADV_HWPOISON hwpoison_addr=%p, len=%d\n",
+		       hwpoison_addr, hwpoison_size);
+	if (madvise(hwpoison_addr, hwpoison_size, MADV_HWPOISON) < 0)
+		ksft_exit_fail_perror("Failed to MADV_HWPOISON");
+
+	pthread_exit(NULL);
+}
+
+static void test_hwpoison_multiple_pages(unsigned char *start_addr)
+{
+	pthread_t pthread;
+	int ret;
+	unsigned char *hwpoison_addr;
+	unsigned long offsets[] = {0x200000, 0x400000, 0x800000};
+
+	for (size_t i = 0; i < ARRAY_SIZE(offsets); ++i) {
+		sigbus_addr = (void *)0xBADBADBAD;
+		sigbus_addr_lsb = 0;
+		was_mceerr = false;
+		got_sigbus = false;
+		expecting_sigbus = true;
+		hwpoison_addr = start_addr + offsets[i];
+
+		ret = pthread_create(&pthread, NULL, &do_hwpoison, hwpoison_addr);
+		if (ret)
+			ksft_exit_fail_perror("Failed to create hwpoison thread");
+
+		ksft_print_msg("Created thread to hwpoison and access hwpoison_addr=%p\n",
+			       hwpoison_addr);
+
+		pthread_join(pthread, NULL);
+
+		if (!got_sigbus)
+			ksft_test_result_fail("Didn't get a SIGBUS\n");
+		if (!was_mceerr)
+			ksft_test_result_fail("Didn't get a BUS_MCEERR_A(R|O)\n");
+		if (sigbus_addr != hwpoison_addr)
+			ksft_test_result_fail("Incorrect address: got=%p, expected=%p\n",
+					      sigbus_addr, hwpoison_addr);
+		if (sigbus_addr_lsb != pshift())
+			ksft_test_result_fail("Incorrect address LSB: got=%d, expected=%d\n",
+					      sigbus_addr_lsb, pshift());
+
+		ksft_print_msg("Received expected and correct SIGBUS\n");
+	}
+}
+
+static int read_nr_hugepages(unsigned long hugepage_size,
+			     unsigned long *nr_hugepages)
+{
+	char buffer[256] = {0};
+	char cmd[256] = {0};
+
+	sprintf(cmd, "cat /sys/kernel/mm/hugepages/hugepages-%ldkB/nr_hugepages",
+		hugepage_size);
+	FILE *cmdfile = popen(cmd, "r");
+
+	if (cmdfile == NULL) {
+		ksft_perror(EPREFIX "failed to popen nr_hugepages");
+		return -1;
+	}
+
+	if (!fgets(buffer, sizeof(buffer), cmdfile)) {
+		ksft_perror(EPREFIX "failed to read nr_hugepages");
+		pclose(cmdfile);
+		return -1;
+	}
+
+	*nr_hugepages = atoll(buffer);
+	pclose(cmdfile);
+	return 0;
+}
+
+/*
+ * Main thread that drives the test.
+ */
+static void test_main(int fd, size_t len)
+{
+	unsigned char *map, *iter;
+	struct sigaction new, old;
+	const unsigned long hugepagesize_kb = BYTE_LENTH_IN_1G / 1024;
+	unsigned long nr_hugepages_before = 0;
+	unsigned long nr_hugepages_after = 0;
+
+	if (read_nr_hugepages(hugepagesize_kb, &nr_hugepages_before) != 0) {
+		close(fd);
+		ksft_exit_fail_msg("Failed to read nr_hugepages\n");
+	}
+	ksft_print_msg("NR hugepages before MADV_HWPOISON is %ld\n", nr_hugepages_before);
+
+	if (ftruncate(fd, len) < 0)
+		ksft_exit_fail_perror("Failed to ftruncate");
+
+	ksft_print_msg("Allocated %#lx bytes to HugeTLB file\n", len);
+
+	map = mmap(NULL, len, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+	if (map == MAP_FAILED)
+		ksft_exit_fail_msg("Failed to mmap");
+
+	ksft_print_msg("Created HugeTLB mapping: %p\n", map);
+
+	memset(map, HUGETLB_FILL, len);
+	ksft_print_msg("Memset every byte to 0xab\n");
+
+	new.sa_sigaction = &sigbus_handler;
+	new.sa_flags = SA_SIGINFO;
+	if (sigaction(SIGBUS, &new, &old) < 0)
+		ksft_exit_fail_msg("Failed to setup SIGBUS handler");
+
+	ksft_print_msg("Setup SIGBUS handler successfully\n");
+
+	test_hwpoison_multiple_pages(map);
+
+	/*
+	 * Since MADV_HWPOISON doesn't corrupt the memory in hardware, and
+	 * MFD_MF_KEEP_UE_MAPPED keeps the hugepage mapped, every byte should
+	 * remain accessible and hold original data.
+	 */
+	expecting_sigbus = false;
+	for (iter = map; iter < map + len; ++iter) {
+		if (*iter != HUGETLB_FILL) {
+			ksft_print_msg("At addr=%p: got=%#x, expected=%#x\n",
+				       iter, *iter, HUGETLB_FILL);
+			ksft_test_result_fail("Memory content corrupted\n");
+			break;
+		}
+	}
+	ksft_print_msg("Memory content all valid\n");
+
+	if (read_nr_hugepages(hugepagesize_kb, &nr_hugepages_after) != 0) {
+		close(fd);
+		ksft_exit_fail_msg("Failed to read nr_hugepages\n");
+	}
+
+	/*
+	 * After MADV_HWPOISON, hugepage should still be in HugeTLB pool.
+	 */
+	ksft_print_msg("NR hugepages after MADV_HWPOISON is %ld\n", nr_hugepages_after);
+	if (nr_hugepages_before != nr_hugepages_after)
+		ksft_test_result_fail("NR hugepages reduced by %ld after MADV_HWPOISON\n",
+				      nr_hugepages_before - nr_hugepages_after);
+
+	/* End of the lifetime of the created HugeTLB memfd. */
+	if (ftruncate(fd, 0) < 0)
+		ksft_exit_fail_perror("Failed to ftruncate to 0");
+	munmap(map, len);
+	close(fd);
+
+	/*
+	 * After freed by userspace, MADV_HWPOISON-ed hugepage should be
+	 * dissolved into raw pages and removed from HugeTLB pool.
+	 */
+	if (read_nr_hugepages(hugepagesize_kb, &nr_hugepages_after) != 0) {
+		close(fd);
+		ksft_exit_fail_msg("Failed to read nr_hugepages\n");
+	}
+	ksft_print_msg("NR hugepages after closure is %ld\n", nr_hugepages_after);
+	if (nr_hugepages_before != nr_hugepages_after + 1)
+		ksft_test_result_fail("NR hugepages is not reduced after memfd closure\n");
+
+	ksft_test_result_pass("All done\n");
+}
+
+int main(int argc, char **argv)
+{
+	int fd;
+	struct statfs file_stat;
+	size_t len = BYTE_LENTH_IN_1G;
+
+	ksft_print_header();
+	ksft_set_plan(1);
+
+	fd = create_hugetlbfs_file(&file_stat);
+	test_main(fd, len);
+
+	ksft_finished();
+}

From patchwork Sat Jan 18 23:15:49 2025
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Jiaqi Yan <jiaqiyan@google.com>
X-Patchwork-Id: 13944259
Received: from mail-pl1-f201.google.com (mail-pl1-f201.google.com
 [209.85.214.201])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id E80851DED72
	for <linux-fsdevel@vger.kernel.org>; Sat, 18 Jan 2025 23:16:26 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.214.201
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1737242188; cv=none;
 b=Pxn6TW1ofPS2flMRleilKRRUoTbvAtGlfoTiHtwA9AjezEkiDka7OuBjlXHv5rlD3YK8yxcIQ7uuusCLqj5oJ64zCZn/jaFWhfUNnvhWpJV4LL4KnjGaFuT5mySH54NQ2LwyeHPp8Urq9ryfBRoiAh0XjPkWH1hb1VnEzYI4nok=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1737242188; c=relaxed/simple;
	bh=CXfN+SP4KawYzi4RkPaH6RZpzSX5VKuvvpTE7/NzZac=;
	h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From:
	 To:Cc:Content-Type;
 b=fQb8C1DlAbCEBovIyH6CAbfD7JvfbwAwTf1/0CM38WBqhxLdQ/DOULJD5OTIdVo15NDaqzrLZkbtIvqL3IXJ5R6nJ+vMrpqKmUF2QyrlBRk/5tKuw9oLCIGpnaAoZqpqDlpsPOvCKyhfMpa8S0Ga99A4cKKVbKI27w3rjILmFLs=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=reject dis=none) header.from=google.com;
 spf=pass smtp.mailfrom=flex--jiaqiyan.bounces.google.com;
 dkim=pass (2048-bit key) header.d=google.com header.i=@google.com
 header.b=r3kysIkb; arc=none smtp.client-ip=209.85.214.201
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=reject dis=none) header.from=google.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=flex--jiaqiyan.bounces.google.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=google.com header.i=@google.com
 header.b="r3kysIkb"
Received: by mail-pl1-f201.google.com with SMTP id
 d9443c01a7336-2166a1a5cc4so58585335ad.3
        for <linux-fsdevel@vger.kernel.org>;
 Sat, 18 Jan 2025 15:16:26 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1737242186; x=1737846986;
 darn=vger.kernel.org;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:from:to:cc:subject:date:message-id:reply-to;
        bh=I9XQRTrIMw2nHi4Rykakj51VRt6E+kzcD+EW0K/CYuY=;
        b=r3kysIkbCQkB3FOGlJ5LJ086F/gXP1sR/OpumHeYH0lYm1iXP5ZAXqfgCzU9u8DyIa
         r1kIrKMSUc+NialxDYJmLhBb96DsddzTkqNkT28yd1z8sNO3A0NskyUxm8fFOwG3PvWL
         bz1AYvLZOXgpGBtQvP5UN7Qi9uxGTdirgN3wA5RDOGWueaSUvTAzFbExkXDWl7+EnMAf
         uJwg3TChMv46x4iPt+up86EUsys5XAJEMkxwYs62yx2jOs78zbW9BN9f5Lx3sd/RTPNa
         f1zRA2PC74v7zPl7N2DbJxuuBFbXFnGKb5FwrKcUrcYHUfZHt4HcfRzeAYXuIar+V+RZ
         Aj+g==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1737242186; x=1737846986;
        h=cc:to:from:subject:message-id:references:mime-version:in-reply-to
         :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=I9XQRTrIMw2nHi4Rykakj51VRt6E+kzcD+EW0K/CYuY=;
        b=ojBj4REHPABGo23fsrnoz/UClYSDPV3Gk5PcLxI2zHG+/i0Tyw+QtsB+Fra1BC527z
         5JOgfG3Y4dDmc9z2TD1r3N8zoY7wfX8Rn4GAvyKnH4LLT7nPnZxq54OSu8J9ExOpbyWf
         PAxYXqMgfaZxttrYUBDEETLJnD3vmB1iq4PDsCg5UV5R98jpEX4/ZO2tYKHoO+340RkF
         jhJjyDl7mJnlEEaRi+8ipPVyRMNyZdlgU27zEKn2PyRYFu1f+AbMZiJm7liwXjd17TWd
         Xi3wICEWz43Iqd5VeYxy3suqEGeLipoTkzsewFENwbUTq8C+Mng5teAu9K6XrEr3Lcxu
         qK7A==
X-Forwarded-Encrypted: i=1;
 AJvYcCWkZ1PNL7pVQxHoryaqfQSXtn298nkwz17oRZYVvvU+QFjakS726Z5Yxm6zGiWPL84SgEu7qp5uvvVV6Ms1@vger.kernel.org
X-Gm-Message-State: AOJu0YyZQ03jM5erJoejKaaoR0V5lfq4zP6Z6mr3Tc8IEllil5TWtqJG
	YbR7GXudfajVCUPB1GbATSPN96lgZ5ipeM7uW2O9d/eSDZsot60ZMd3Obh5tGfgfMuHhNMTacN+
	SSt9e7OnTjQ==
X-Google-Smtp-Source: 
 AGHT+IHwlluQDU7UlSnmrszHFZEXxstYcEx2i2R/0XKRZHVd/xM9gn1WzTOyJfb2igjnZAJi3m+IL+n+5tTv9g==
X-Received: from plhj17.prod.google.com ([2002:a17:903:251:b0:216:2234:bf3e])
 (user=jiaqiyan job=prod-delivery.src-stubby-dispatcher) by
 2002:a17:902:f70a:b0:215:a7e4:8475 with SMTP id
 d9443c01a7336-21c35557a0bmr135957655ad.24.1737242186201;
 Sat, 18 Jan 2025 15:16:26 -0800 (PST)
Date: Sat, 18 Jan 2025 23:15:49 +0000
In-Reply-To: <20250118231549.1652825-1-jiaqiyan@google.com>
Precedence: bulk
X-Mailing-List: linux-fsdevel@vger.kernel.org
List-Id: <linux-fsdevel.vger.kernel.org>
List-Subscribe: <mailto:linux-fsdevel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-fsdevel+unsubscribe@vger.kernel.org>
Mime-Version: 1.0
References: <20250118231549.1652825-1-jiaqiyan@google.com>
X-Mailer: git-send-email 2.48.0.rc2.279.g1de40edade-goog
Message-ID: <20250118231549.1652825-4-jiaqiyan@google.com>
Subject: [RFC PATCH v1 3/3] Documentation: add userspace MF recovery policy
 via memfd
From: Jiaqi Yan <jiaqiyan@google.com>
To: nao.horiguchi@gmail.com, linmiaohe@huawei.com
Cc: tony.luck@intel.com, wangkefeng.wang@huawei.com, willy@infradead.org,
	jane.chu@oracle.com, akpm@linux-foundation.org, osalvador@suse.de,
	rientjes@google.com, duenwen@google.com, jthoughton@google.com,
	jgg@nvidia.com, ankita@nvidia.com, peterx@redhat.com,
	sidhartha.kumar@oracle.com, david@redhat.com, dave.hansen@linux.intel.com,
	muchun.song@linux.dev, linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	linux-fsdevel@vger.kernel.org, Jiaqi Yan <jiaqiyan@google.com>

Document its motivation and userspace API.

Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
---
 Documentation/userspace-api/index.rst         |  1 +
 .../userspace-api/mfd_mfr_policy.rst          | 55 +++++++++++++++++++
 2 files changed, 56 insertions(+)
 create mode 100644 Documentation/userspace-api/mfd_mfr_policy.rst

diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst
index 274cc7546efc2..0f9783b8807ea 100644
--- a/Documentation/userspace-api/index.rst
+++ b/Documentation/userspace-api/index.rst
@@ -63,6 +63,7 @@ Everything else
    vduse
    futex2
    perf_ring_buffer
+   mfd_mfr_policy
 
 .. only::  subproject and html
 
diff --git a/Documentation/userspace-api/mfd_mfr_policy.rst b/Documentation/userspace-api/mfd_mfr_policy.rst
new file mode 100644
index 0000000000000..d4557693c2c40
--- /dev/null
+++ b/Documentation/userspace-api/mfd_mfr_policy.rst
@@ -0,0 +1,55 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==================================================
+Userspace Memory Failure Recovery Policy via memfd
+==================================================
+
+:Author:
+    Jiaqi Yan <jiaqiyan@google.com>
+
+
+Motivation
+==========
+
+When a userspace process is able to recover from memory failures (MF)
+caused by uncorrected memory error (UE) in the DIMM, especially when it is
+able to avoid consuming known UEs, keeping the memory page mapped and
+accessible may be benifical to the owning process for a couple of reasons:
+- The memory pages affected by UE have a large smallest granularity, for
+  example 1G hugepage, but the actual corrupted amount of the page is only
+  several cachlines. Losing the entire hugepage of data is unacceptable to
+  the application.
+- In addition to keeping the data accessible, the application still wants
+  to access with as large page size for the fastest virtual-to-physical
+  translations.
+
+Memory failure recovery for 1G or larger HugeTLB is a good example. With
+memfd userspace process can control whether the kernel hard offlines its
+memory (huge)pages that backs the in-RAM file created by memfd.
+
+
+User API
+========
+
+``int memfd_create(const char *name, unsigned int flags)``
+
+``MFD_MF_KEEP_UE_MAPPED``
+	When ``MFD_MF_KEEP_UE_MAPPED`` bit is set in ``flags``, MF recovery
+	in the kernel does not hard offline memory due to UE until the
+	returned ``memfd`` is released. IOW, the HWPoison-ed memory emains
+	accessible via the returned ``memfd`` or the memory mapping created
+	with the returned ``memfd``. Note the affected memory will be
+	immediately protected and isolated from future use (by both kernel
+	and userspace) once the owning process is gone. By default
+	``MFD_MF_KEEP_UE_MAPPED`` is not set, and kernel hard offlines
+	memory having UEs.
+
+Notes about the behavior and limitations
+- Even if the page affected by UE is kept, a portion of the (huge)page is
+  already lost due to hardware corruption, and the size of the portion
+  is the smallest page size that kernel uses to manages memory on the
+  architecture, i.e. PAGESIZE. Accessing a virtual address within any of
+  these parts results in a SIGBUS; accessing virtual address outside these
+  parts are good until it is corrupted by new memory error.
+- ``MFD_MF_KEEP_UE_MAPPED`` currently only works for HugeTLB, so
+  ``MFD_HUGETLB`` must also be set when setting ``MFD_MF_KEEP_UE_MAPPED``.