From patchwork Sat Jan 18 23:15:47 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Jiaqi Yan X-Patchwork-Id: 13944261 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 751C7C02188 for ; Sat, 18 Jan 2025 23:16:25 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0B4C46B0083; Sat, 18 Jan 2025 18:16:25 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 03DC76B0085; Sat, 18 Jan 2025 18:16:24 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DABFC6B0088; Sat, 18 Jan 2025 18:16:24 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id B871A6B0083 for ; Sat, 18 Jan 2025 18:16:24 -0500 (EST) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 674331C874C for ; Sat, 18 Jan 2025 23:16:24 +0000 (UTC) X-FDA: 83022133488.28.E6AFE79 Received: from mail-pj1-f73.google.com (mail-pj1-f73.google.com [209.85.216.73]) by imf14.hostedemail.com (Postfix) with ESMTP id 8473A100005 for ; Sat, 18 Jan 2025 23:16:22 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=mB0W1z25; spf=pass (imf14.hostedemail.com: domain of 3RTaMZwgKCCYLKCSKaCPIQQING.EQONKPWZ-OOMXCEM.QTI@flex--jiaqiyan.bounces.google.com designates 209.85.216.73 as permitted sender) smtp.mailfrom=3RTaMZwgKCCYLKCSKaCPIQQING.EQONKPWZ-OOMXCEM.QTI@flex--jiaqiyan.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1737242182; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=nm402TQHrDxI1EFlR658V6OFFFRpysLj2Dms4H1rJPY=; b=5jQJqfD2yhZZXu2NWKjsXOPlaghqpNmIZhgpjzOBS+sHT/WbhenAl5oWpW34zwMvb67gGK j4tiuF39dYW3QhxIm1OZltFzsGyeiLko/5e4hFEDvzGHIRP2Vcv44yvA01tmTSHAjMPSVD bN7fVo1UkxQkW9OwSY9JAWBAtHJCTMo= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=mB0W1z25; spf=pass (imf14.hostedemail.com: domain of 3RTaMZwgKCCYLKCSKaCPIQQING.EQONKPWZ-OOMXCEM.QTI@flex--jiaqiyan.bounces.google.com designates 209.85.216.73 as permitted sender) smtp.mailfrom=3RTaMZwgKCCYLKCSKaCPIQQING.EQONKPWZ-OOMXCEM.QTI@flex--jiaqiyan.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1737242182; a=rsa-sha256; cv=none; b=JfM5KIntoQkQOKi4fRlR66L9lmYJcLrf1p6FfsExYvGYzd/yvF1vj/XvDGmW7i6W5+zw4g Fr2zmTt8tlqktP778Tjbe6wCQF4+b9Kb4/KNnP2dYkbejgGa4d4PFS+vfvqNJS369u8G5P pGVWwyLCcX4dKdtxjLIGzXLV3659ipo= Received: by mail-pj1-f73.google.com with SMTP id 98e67ed59e1d1-2efa0eb9dacso6279814a91.1 for ; Sat, 18 Jan 2025 15:16:22 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1737242181; x=1737846981; darn=kvack.org; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=nm402TQHrDxI1EFlR658V6OFFFRpysLj2Dms4H1rJPY=; b=mB0W1z25yKhmWFZ0w6w2i0HGslXZA7OOpAz3ZB7oU2lCbkd4DyiwbWTObMP/JsaaSp AmeBsnD/l/W38qXP3uCzXMbHtboVw9x0SJQ5Z2aqKoxV6NJf+obmP6CRbXLLbZG/5BcG fBPzvnizzn6jY2zVVYX7Cv7OnG+sLR09VIBEFuInWyFlz/zoFEl84u2b+g1ndykpTUCQ F9SvBEvO7ficYOVDTKYTXET2IB6tCUb+SSA8BU+BTYzsQy/nFdo1AtKoghzCjAuYDRrF FeJGJJ8jJYvixj4RrpGgErGE5z6makvzzG/xwa896UjWGEclOQXHHwICb8lJYM9FdMWf hVgw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1737242181; x=1737846981; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=nm402TQHrDxI1EFlR658V6OFFFRpysLj2Dms4H1rJPY=; b=Pzm1a0KLdEoP40BYYfJtIc20nTHv02nf4ODk03aySST4rMfb2+reaGrMMFVKlqJUV5 ndqAPppUkT2NA7N4CbQno2M6PtVgjpTRvScJHjDoQ7AGaOv2InnDttK3mgmBFZU93Eel uvAd7trNGGR3Jmj0TvwAmvEpynX1rLqxoqmlNSxMdhue6nSPLNYvxTT85WVPRewyZu7H 7HvRQCHTI41+HtNwxbdZdNSul5+PghDLZ8f8/AjbjU+Q2uMuy1S6oN9pfV2JMnd4eP2O c/voDz7EKMRb3QH99hb1dHiCEfg5c6HXW8ZqpncugDqGp5a4zvAPa+Bmit3vlE0yTkcI lQWA== X-Forwarded-Encrypted: i=1; AJvYcCW02W2jF+8qlF7jd12NlJ3dVJelR+jHexL4V8PcU6RY1WTxgkZOGM95Sq6I2IGxxv34wVxJW9RlYg==@kvack.org X-Gm-Message-State: AOJu0YzUepiOrmLfBVMEdvfYx6OdnjgXC08djioPVyFXAwqRhKGTjWzo CcI9v5+1q2h69LEMMcbBnABW2sZ/jOEB/Rfe4X4Svnkw5u1NycZid+8E7VZcBaeOFBPAhB5eMJu 0GQaau4VpbA== X-Google-Smtp-Source: AGHT+IEpzJwcasigO7Z3LrZTdDEMveaU0rSOtpGiX7RzVESFnj1S0Tl/DAer+e37P1pFsn3VS6RTQfwq+n76Gg== X-Received: from pjlk5.prod.google.com ([2002:a17:90a:7f05:b0:2ea:aa56:49c]) (user=jiaqiyan job=prod-delivery.src-stubby-dispatcher) by 2002:a17:90b:37ce:b0:2f2:ab09:c256 with SMTP id 98e67ed59e1d1-2f782d6f0b9mr12295296a91.33.1737242181329; Sat, 18 Jan 2025 15:16:21 -0800 (PST) Date: Sat, 18 Jan 2025 23:15:47 +0000 In-Reply-To: <20250118231549.1652825-1-jiaqiyan@google.com> Mime-Version: 1.0 References: <20250118231549.1652825-1-jiaqiyan@google.com> X-Mailer: git-send-email 2.48.0.rc2.279.g1de40edade-goog Message-ID: <20250118231549.1652825-2-jiaqiyan@google.com> Subject: [RFC PATCH v1 1/3] mm: memfd/hugetlb: introduce userspace memory failure recovery policy From: Jiaqi Yan To: nao.horiguchi@gmail.com, linmiaohe@huawei.com Cc: tony.luck@intel.com, wangkefeng.wang@huawei.com, willy@infradead.org, jane.chu@oracle.com, akpm@linux-foundation.org, osalvador@suse.de, rientjes@google.com, duenwen@google.com, jthoughton@google.com, jgg@nvidia.com, ankita@nvidia.com, peterx@redhat.com, sidhartha.kumar@oracle.com, david@redhat.com, dave.hansen@linux.intel.com, muchun.song@linux.dev, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, Jiaqi Yan X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 8473A100005 X-Stat-Signature: tpwf9jzb74qrwfmbg73w9kzme664y53j X-Rspam-User: X-HE-Tag: 1737242182-915263 X-HE-Meta: U2FsdGVkX1+n+LZQ0KZiOPKi6rvqvtR+yajFcSRFMlhnKHTei5MzBCWbM3lktEeYvK5gphlrJuJijXmjO3wCvW8tpxri/DyCWA0Bok1bvS4GU+kQRD4JEVk9virOVVeG1Te+NAOPahB2ijFYOzp8ecEyNhFSzmMoSjBFTDDVmF2DUaFSn/cf7VqhVMWDLraA9cMR+cbUNSzFX1Jy5bvhO/JBKgsAOAXY/wPm9Npo+U/Gr2ReFc5mmUgCtoaq8shG7AVMLlDB/p98eoXYTDSWv2z95K1r3sUxqS6LBLFDH3iQNwkyTHyoem4qGV1nVqsUISHVMh0WM/jtWr+69HVOKShp2DGnmOv00w53VeSwL/Z/4yTrRwTVgI7vSX7eTCrCY/cKj1izwKUvgIRfesOUmOhmRL3p+0/LcAWgdgv4EMc5rpEK5dv4T/WqwJhUBwsBGTRovg5FBnr4l8OMUvJ9ATRAAP+EDFKD1J4LuW1UyI6RcoinbJjUEAXZk9K7kLodZBsJJS3qaTlXeUS/nIsTU+fkvEY9fBBh7JmMqLkfAOR6F1c4SWbcL/fLFoI8GivglB53ptSIOAwMtpfSkcGZs9CbrYX08IaUSFznsbz7+n3JSPsSPq0PpLXyU6RHvL8Nc2zif1S1UcRQH1W0YCgyBHxlRerEqVHE13OxGgo393npvR/oov0dkNj6VQztk0nWhPoCvoc+FZxyXe74RU24UXfzADc9ZNbkBg3JhjGLoS81u2/OSZrQh3pqd2nvzi3ApkDvX2RtLsrabY0eDXqpdTP5oor/LoWzQKRpS9yRV0zkFV8isW7fuiF0nSeU1obQcxDCpZYmSfTzhRxPi/LxC08XsRAp2oTuu4DsbM66E+EH4U4rGT1v8qEyOFlV5Uz9VyJZJGk8q6T3Fb4mEBHfVvBvvMmTvWG1/+fx5715WYTVYY3oI2q2rs4oXM7BwV6WFbUQ8nkTUPDrYYYGthI biQm7VLt qtKj5fiO4pbmKHJsVaO/NeqaLdT3ATrX6WP/PL81GRZBsMpkdbNhwg4EwAlYnEAm/00E3vjHJmc/8FSGxTOJ0pUb+s6ajHvnZxZGaKYmS7alIssPifRelG3/Xdy2zuquwrxCpQ9u3dCYdd9yjoKksKcrTOR2RxLyHFjU/AdZUmd+ELiKj8f+06veT9RLbB96u7E0EgnZ1dnRthp27vXVVKYL8FLr/z29vwto/jZdiMcznZhKySzUSIJHs4ML/0zUZhiYvEMykDN+GpFMCJ66+rrtySL05DC2bNMDOtAn2QutWzANLnagBRbUPwGc1p588nZt/0d8LD7mG9Gctn+tugNg7SiK5Ngy3sMsDEqDUDa2YJn/YHxe3xF1SnrhnHXRC1ookzESL0efPeZsH3zVYMu+RPfZNQZSPDNwZ1dw4OpT9p8xxWC1WPOsM9zwZF66TtbhR3+iH/bXQv3DtFT3/WhYBUA+gIdm7BH+KgQ6S7qEFRieDyDvyligGCuWbsH2/DMVl8k06fdh/v4DbsblK5QTmIJ7niRAgOSSrttrEqMCQ9oYCjTrfft24zkG//VoC5gaJC3gUaqdZxAnGviLe3MBvGizdkWsrHETVYYfRYVnKZfP9sahsTSX/zCXy/QZkQ9aN X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Sometimes immediately hard offlining memory page having uncorrected memory errors (UE) may not be the best option for capacity and/or performance reasons. "Sometimes" even becomes "often times" in Cloud scenarios. See cover letter for the descriptions to two scenarios. Therefore keeping or discarding a large chunk of contiguous memory mapped to userspace (particularly to serve guest memory) due to UE (recoverable is implied) should be able to be controlled by userspace process, e.g. VMM in Cloud environment. Given the relevance of HugeTLB's non-ideal memory failure recovery behavior, this commit uses HugeTLB as the "testbed" to demonstrate the idea of memfd-based userspace memory failure policy. MFD_MF_KEEP_UE_MAPPED is added to the possible values for flags in memfd_create syscall. It is intended to be generic for any memfd, not just HugeTLB, but the current implementation only covers HugeTLB. When MFD_MF_KEEP_UE_MAPPED is set in flags, memory failure recovery in the kernel doesn’t hard offline memory due to UE until the created memfd is released or the affected memory region is truncated by userspace. IOW, the HWPoison-ed memory remains accessible via the returned memfd or the memory mapping created with that memfd. However, the affected memory will be immediately protected and isolated from future use by both kernel and userspace once the owning memfd is gone or the memory is truncated. By default MFD_MF_KEEP_UE_MAPPED is not set, and kernel hard offlines memory having UEs. Tested with selftest in followup patch. This commit should probably be split into smaller pieces, but for now I will defer it until this RFC becomes PATCH. Signed-off-by: Jiaqi Yan --- fs/hugetlbfs/inode.c | 16 +++++ include/linux/hugetlb.h | 7 +++ include/linux/pagemap.h | 43 ++++++++++++++ include/uapi/linux/memfd.h | 1 + mm/filemap.c | 78 ++++++++++++++++++++++++ mm/hugetlb.c | 20 ++++++- mm/memfd.c | 15 ++++- mm/memory-failure.c | 119 +++++++++++++++++++++++++++++++++---- 8 files changed, 282 insertions(+), 17 deletions(-) diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 0fc179a598300..3c7812898717b 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -576,6 +576,10 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart, pgoff_t next, index; int i, freed = 0; bool truncate_op = (lend == LLONG_MAX); + LIST_HEAD(hwp_folios); + + /* Needs to be done before removing folios from filemap. */ + populate_memfd_hwp_folios(mapping, lstart >> PAGE_SHIFT, end, &hwp_folios); folio_batch_init(&fbatch); next = lstart >> PAGE_SHIFT; @@ -605,6 +609,18 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart, (void)hugetlb_unreserve_pages(inode, lstart >> huge_page_shift(h), LONG_MAX, freed); + /* + * hugetlbfs_error_remove_folio keeps the HWPoison-ed pages in + * page cache until mm wants to drop the folio at the end of the + * of the filemap. At this point, if memory failure was delayed + * by AS_MF_KEEP_UE_MAPPED in the past, we can now deal with it. + * + * TODO: in V2 we can probably get rid of populate_memfd_hwp_folios + * and hwp_folios, by inserting filemap_offline_hwpoison_folio + * into somewhere in folio_batch_release, or into per file system's + * free_folio handler. + */ + offline_memfd_hwp_folios(mapping, &hwp_folios); } static void hugetlbfs_evict_inode(struct inode *inode) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index ec8c0ccc8f959..07d2a31146728 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -836,10 +836,17 @@ int dissolve_free_hugetlb_folios(unsigned long start_pfn, #ifdef CONFIG_MEMORY_FAILURE extern void folio_clear_hugetlb_hwpoison(struct folio *folio); +extern bool hugetlb_should_keep_hwpoison_mapped(struct folio *folio, + struct address_space *mapping); #else static inline void folio_clear_hugetlb_hwpoison(struct folio *folio) { } +static inline bool hugetlb_should_keep_hwpoison_mapped(struct folio *folio + struct address_space *mapping) +{ + return false; +} #endif #ifdef CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index fc2e1319c7bb5..fad7093d232a9 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -210,6 +210,12 @@ enum mapping_flags { AS_STABLE_WRITES = 7, /* must wait for writeback before modifying folio contents */ AS_INACCESSIBLE = 8, /* Do not attempt direct R/W access to the mapping */ + /* + * Keeps folios belong to the mapping mapped even if uncorrectable memory + * errors (UE) caused memory failure (MF) within the folio. Only at the end + * of mapping will its HWPoison-ed folios be dealt with. + */ + AS_MF_KEEP_UE_MAPPED = 9, /* Bits 16-25 are used for FOLIO_ORDER */ AS_FOLIO_ORDER_BITS = 5, AS_FOLIO_ORDER_MIN = 16, @@ -335,6 +341,16 @@ static inline bool mapping_inaccessible(struct address_space *mapping) return test_bit(AS_INACCESSIBLE, &mapping->flags); } +static inline bool mapping_mf_keep_ue_mapped(struct address_space *mapping) +{ + return test_bit(AS_MF_KEEP_UE_MAPPED, &mapping->flags); +} + +static inline void mapping_set_mf_keep_ue_mapped(struct address_space *mapping) +{ + set_bit(AS_MF_KEEP_UE_MAPPED, &mapping->flags); +} + static inline gfp_t mapping_gfp_mask(struct address_space * mapping) { return mapping->gfp_mask; @@ -1298,6 +1314,33 @@ void replace_page_cache_folio(struct folio *old, struct folio *new); void delete_from_page_cache_batch(struct address_space *mapping, struct folio_batch *fbatch); bool filemap_release_folio(struct folio *folio, gfp_t gfp); +#ifdef CONFIG_MEMORY_FAILURE +void populate_memfd_hwp_folios(struct address_space *mapping, + pgoff_t lstart, pgoff_t lend, + struct list_head *list); +void offline_memfd_hwp_folios(struct address_space *mapping, + struct list_head *list); +/* + * Provided by memory failure to offline HWPoison-ed folio for various memory + * management systems (hugetlb, THP etc). + */ +void filemap_offline_hwpoison_folio(struct address_space *mapping, + struct folio *folio); +#else +void populate_memfd_hwp_folios(struct address_space *mapping, + loff_t lstart, loff_t lend, + struct list_head *list) +{ +} +void offline_memfd_hwp_folios(struct address_space *mapping, + struct list_head *list) +{ +} +void filemap_offline_hwpoison_folio(struct address_space *mapping, + struct folio *folio) +{ +} +#endif loff_t mapping_seek_hole_data(struct address_space *, loff_t start, loff_t end, int whence); diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h index 273a4e15dfcff..eb7a4ffcae6b9 100644 --- a/include/uapi/linux/memfd.h +++ b/include/uapi/linux/memfd.h @@ -12,6 +12,7 @@ #define MFD_NOEXEC_SEAL 0x0008U /* executable */ #define MFD_EXEC 0x0010U +#define MFD_MF_KEEP_UE_MAPPED 0x0020U /* * Huge page size encoding when MFD_HUGETLB is specified, and a huge page diff --git a/mm/filemap.c b/mm/filemap.c index b6494d2d3bc2a..5216889d12ecf 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -4427,3 +4427,81 @@ SYSCALL_DEFINE4(cachestat, unsigned int, fd, return 0; } #endif /* CONFIG_CACHESTAT_SYSCALL */ + +#ifdef CONFIG_MEMORY_FAILURE +/** + * To remember the HWPoison-ed folios within a mapping before removing every + * folio, create an utility struct to link them a list. + */ +struct memfd_hwp_folio { + struct list_head node; + struct folio *folio; +}; +/** + * populate_memfd_hwp_folios - populates HWPoison-ed folios. + * @mapping: The address_space of a memfd the kernel is trying to remove or truncate. + * @start: The starting page index. + * @end: The final page index (inclusive). + * @list: Where the HWPoison-ed folios will be stored into. + * + * There may be pending HWPoison-ed folios when a memfd is being removed or + * part of it is being truncated. Stores them into a linked list to offline + * after the file system removes them. + */ +void populate_memfd_hwp_folios(struct address_space *mapping, + pgoff_t start, pgoff_t end, + struct list_head *list) +{ + int i; + struct folio *folio; + struct memfd_hwp_folio *to_add; + struct folio_batch fbatch; + pgoff_t next = start; + + if (!mapping_mf_keep_ue_mapped(mapping)) + return; + + folio_batch_init(&fbatch); + while (filemap_get_folios(mapping, &next, end - 1, &fbatch)) { + for (i = 0; i < folio_batch_count(&fbatch); ++i) { + folio = fbatch.folios[i]; + if (!folio_test_hwpoison(folio)) + continue; + + to_add = kmalloc(sizeof(*to_add), GFP_KERNEL); + if (!to_add) + continue; + + to_add->folio = folio; + list_add_tail(&to_add->node, list); + } + folio_batch_release(&fbatch); + } +} +EXPORT_SYMBOL_GPL(populate_memfd_hwp_folios); + +/** + * offline_memfd_hwp_folios - hard offline HWPoison-ed folios. + * @mapping: The address_space of a memfd the kernel is trying to remove or truncate. + * @list: Where the HWPoison-ed folios are stored. It will become empty when + * offline_memfd_hwp_folios returns. + * + * After the file system removed all the folios belong to a memfd, the kernel + * now can hard offline all HWPoison-ed folios that are previously pending. + * Caller needs to exclusively own @list as no locking is provided here, and + * @list is entirely consumed here. + */ +void offline_memfd_hwp_folios(struct address_space *mapping, + struct list_head *list) +{ + struct memfd_hwp_folio *curr, *temp; + + list_for_each_entry_safe(curr, temp, list, node) { + filemap_offline_hwpoison_folio(mapping, curr->folio); + list_del(&curr->node); + kfree(curr); + } +} +EXPORT_SYMBOL_GPL(offline_memfd_hwp_folios); + +#endif /* CONFIG_MEMORY_FAILURE */ diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 87761b042ed04..35e88d7fc2793 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -6091,6 +6091,18 @@ static bool hugetlb_pte_stable(struct hstate *h, struct mm_struct *mm, unsigned return same; } +bool hugetlb_should_keep_hwpoison_mapped(struct folio *folio, + struct address_space *mapping) +{ + if (WARN_ON_ONCE(!folio_test_hugetlb(folio))) + return false; + + if (!mapping) + return false; + + return mapping_mf_keep_ue_mapped(mapping); +} + static vm_fault_t hugetlb_no_page(struct address_space *mapping, struct vm_fault *vmf) { @@ -6214,9 +6226,11 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping, * So we need to block hugepage fault by PG_hwpoison bit check. */ if (unlikely(folio_test_hwpoison(folio))) { - ret = VM_FAULT_HWPOISON_LARGE | - VM_FAULT_SET_HINDEX(hstate_index(h)); - goto backout_unlocked; + if (!mapping_mf_keep_ue_mapped(mapping)) { + ret = VM_FAULT_HWPOISON_LARGE | + VM_FAULT_SET_HINDEX(hstate_index(h)); + goto backout_unlocked; + } } /* Check for page in userfault range. */ diff --git a/mm/memfd.c b/mm/memfd.c index 37f7be57c2f50..ddb9e988396c7 100644 --- a/mm/memfd.c +++ b/mm/memfd.c @@ -302,7 +302,8 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned int arg) #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1) #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN) -#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | MFD_NOEXEC_SEAL | MFD_EXEC) +#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | \ + MFD_NOEXEC_SEAL | MFD_EXEC | MFD_MF_KEEP_UE_MAPPED) static int check_sysctl_memfd_noexec(unsigned int *flags) { @@ -376,6 +377,8 @@ static int sanitize_flags(unsigned int *flags_ptr) if (!(flags & MFD_HUGETLB)) { if (flags & ~(unsigned int)MFD_ALL_FLAGS) return -EINVAL; + if (flags & MFD_MF_KEEP_UE_MAPPED) + return -EINVAL; } else { /* Allow huge page size encoding in flags. */ if (flags & ~(unsigned int)(MFD_ALL_FLAGS | @@ -436,6 +439,16 @@ static struct file *alloc_file(const char *name, unsigned int flags) file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE; file->f_flags |= O_LARGEFILE; + /* + * MFD_MF_KEEP_UE_MAPPED can only be specified in memfd_create; no API + * to update it once memfd is created. MFD_MF_KEEP_UE_MAPPED is not + * seal-able. + * + * TODO: MFD_MF_KEEP_UE_MAPPED is not supported by all file system yet. + */ + if (flags & (MFD_HUGETLB | MFD_MF_KEEP_UE_MAPPED)) + mapping_set_mf_keep_ue_mapped(file->f_mapping); + if (flags & MFD_NOEXEC_SEAL) { struct inode *inode = file_inode(file); diff --git a/mm/memory-failure.c b/mm/memory-failure.c index a7b8ccd29b6f5..f43607fb4310e 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -445,11 +445,13 @@ static unsigned long dev_pagemap_mapping_shift(struct vm_area_struct *vma, * Schedule a process for later kill. * Uses GFP_ATOMIC allocations to avoid potential recursions in the VM. */ -static void __add_to_kill(struct task_struct *tsk, const struct page *p, +static void __add_to_kill(struct task_struct *tsk, struct page *p, struct vm_area_struct *vma, struct list_head *to_kill, unsigned long addr) { struct to_kill *tk; + struct folio *folio; + struct address_space *mapping; tk = kmalloc(sizeof(struct to_kill), GFP_ATOMIC); if (!tk) { @@ -460,8 +462,20 @@ static void __add_to_kill(struct task_struct *tsk, const struct page *p, tk->addr = addr; if (is_zone_device_page(p)) tk->size_shift = dev_pagemap_mapping_shift(vma, tk->addr); - else - tk->size_shift = folio_shift(page_folio(p)); + else { + folio = page_folio(p); + mapping = folio_mapping(folio); + if (mapping && mapping_mf_keep_ue_mapped(mapping)) + /* + * Let userspace know the radius of the hardware poison + * is the size of raw page, and as long as they aborts + * the load to the scope, other pages inside the folio + * are still safe to access. + */ + tk->size_shift = PAGE_SHIFT; + else + tk->size_shift = folio_shift(folio); + } /* * Send SIGKILL if "tk->addr == -EFAULT". Also, as @@ -486,7 +500,7 @@ static void __add_to_kill(struct task_struct *tsk, const struct page *p, list_add_tail(&tk->nd, to_kill); } -static void add_to_kill_anon_file(struct task_struct *tsk, const struct page *p, +static void add_to_kill_anon_file(struct task_struct *tsk, struct page *p, struct vm_area_struct *vma, struct list_head *to_kill, unsigned long addr) { @@ -607,7 +621,7 @@ struct task_struct *task_early_kill(struct task_struct *tsk, int force_early) * Collect processes when the error hit an anonymous page. */ static void collect_procs_anon(const struct folio *folio, - const struct page *page, struct list_head *to_kill, + struct page *page, struct list_head *to_kill, int force_early) { struct task_struct *tsk; @@ -645,7 +659,7 @@ static void collect_procs_anon(const struct folio *folio, * Collect processes when the error hit a file mapped page. */ static void collect_procs_file(const struct folio *folio, - const struct page *page, struct list_head *to_kill, + struct page *page, struct list_head *to_kill, int force_early) { struct vm_area_struct *vma; @@ -727,7 +741,7 @@ static void collect_procs_fsdax(const struct page *page, /* * Collect the processes who have the corrupted page mapped to kill. */ -static void collect_procs(const struct folio *folio, const struct page *page, +static void collect_procs(const struct folio *folio, struct page *page, struct list_head *tokill, int force_early) { if (!folio->mapping) @@ -1226,6 +1240,13 @@ static int me_huge_page(struct page_state *ps, struct page *p) } } + /* + * MF still needs to holds a refcount for the deferred actions in + * filemap_offline_hwpoison_folio. + */ + if (hugetlb_should_keep_hwpoison_mapped(folio, mapping)) + return res; + if (has_extra_refcount(ps, p, extra_pins)) res = MF_FAILED; @@ -1593,6 +1614,7 @@ static bool hwpoison_user_mappings(struct folio *folio, struct page *p, struct address_space *mapping; LIST_HEAD(tokill); bool unmap_success; + bool keep_mapped; int forcekill; bool mlocked = folio_test_mlocked(folio); @@ -1643,10 +1665,12 @@ static bool hwpoison_user_mappings(struct folio *folio, struct page *p, */ collect_procs(folio, p, &tokill, flags & MF_ACTION_REQUIRED); - unmap_poisoned_folio(folio, ttu); + keep_mapped = hugetlb_should_keep_hwpoison_mapped(folio, mapping); + if (!keep_mapped) + unmap_poisoned_folio(folio, ttu); unmap_success = !folio_mapped(folio); - if (!unmap_success) + if (!unmap_success && !keep_mapped) pr_err("%#lx: failed to unmap page (folio mapcount=%d)\n", pfn, folio_mapcount(folio)); @@ -1671,7 +1695,7 @@ static bool hwpoison_user_mappings(struct folio *folio, struct page *p, !unmap_success; kill_procs(&tokill, forcekill, pfn, flags); - return unmap_success; + return unmap_success || keep_mapped; } static int identify_page_state(unsigned long pfn, struct page *p, @@ -1911,6 +1935,9 @@ static unsigned long __folio_free_raw_hwp(struct folio *folio, bool move_flag) unsigned long count = 0; head = llist_del_all(raw_hwp_list_head(folio)); + if (head == NULL) + return 0; + llist_for_each_entry_safe(p, next, head, node) { if (move_flag) SetPageHWPoison(p->page); @@ -1927,7 +1954,8 @@ static int folio_set_hugetlb_hwpoison(struct folio *folio, struct page *page) struct llist_head *head; struct raw_hwp_page *raw_hwp; struct raw_hwp_page *p; - int ret = folio_test_set_hwpoison(folio) ? -EHWPOISON : 0; + struct address_space *mapping = folio->mapping; + bool has_hwpoison = folio_test_set_hwpoison(folio); /* * Once the hwpoison hugepage has lost reliable raw error info, @@ -1946,8 +1974,15 @@ static int folio_set_hugetlb_hwpoison(struct folio *folio, struct page *page) if (raw_hwp) { raw_hwp->page = page; llist_add(&raw_hwp->node, head); + if (hugetlb_should_keep_hwpoison_mapped(folio, mapping)) + /* + * A new raw HWPoison page. Don't return HWPOISON. + * Error event will be counted in action_result(). + */ + return 0; + /* the first error event will be counted in action_result(). */ - if (ret) + if (has_hwpoison) num_poisoned_pages_inc(page_to_pfn(page)); } else { /* @@ -1962,7 +1997,8 @@ static int folio_set_hugetlb_hwpoison(struct folio *folio, struct page *page) */ __folio_free_raw_hwp(folio, false); } - return ret; + + return has_hwpoison ? -EHWPOISON : 0; } static unsigned long folio_free_raw_hwp(struct folio *folio, bool move_flag) @@ -2051,6 +2087,63 @@ int __get_huge_page_for_hwpoison(unsigned long pfn, int flags, return ret; } +static void filemap_offline_hwpoison_folio_hugetlb(struct folio *folio) +{ + int ret; + struct llist_node *head; + struct raw_hwp_page *curr, *next; + struct page *page; + unsigned long pfn; + + head = llist_del_all(raw_hwp_list_head(folio)); + + /* + * Release references hold by try_memory_failure_hugetlb, one per + * HWPoison-ed page in raw hwp list. This folio's refcount expects to + * drop to zero after the below for-each loop. + */ + llist_for_each_entry(curr, head, node) + folio_put(folio); + + ret = dissolve_free_hugetlb_folio(folio); + if (ret) { + pr_err("failed to dissolve hugetlb folio: %d\n", ret); + llist_for_each_entry(curr, head, node) { + page = curr->page; + pfn = page_to_pfn(page); + /* + * TODO: roll back the count incremented during online + * handling, i.e. whatever me_huge_page returns. + */ + update_per_node_mf_stats(pfn, MF_FAILED); + } + return; + } + + llist_for_each_entry_safe(curr, next, head, node) { + page = curr->page; + pfn = page_to_pfn(page); + drain_all_pages(page_zone(page)); + if (PageBuddy(page) && !take_page_off_buddy(page)) + pr_warn("%#lx: unable to take off buddy allocator\n", pfn); + + SetPageHWPoison(page); + page_ref_inc(page); + kfree(curr); + pr_info("%#lx: pending hard offline completed\n", pfn); + } +} + +void filemap_offline_hwpoison_folio(struct address_space *mapping, + struct folio *folio) +{ + WARN_ON_ONCE(!mapping); + + /* Pending MFR currently only exist for hugetlb. */ + if (hugetlb_should_keep_hwpoison_mapped(folio, mapping)) + filemap_offline_hwpoison_folio_hugetlb(folio); +} + /* * Taking refcount of hugetlb pages needs extra care about race conditions * with basic operations like hugepage allocation/free/demotion. From patchwork Sat Jan 18 23:15:48 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jiaqi Yan X-Patchwork-Id: 13944262 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 07C95C02187 for ; Sat, 18 Jan 2025 23:16:28 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8A5E96B0088; Sat, 18 Jan 2025 18:16:27 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 82F6D6B0089; Sat, 18 Jan 2025 18:16:27 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 682096B008A; Sat, 18 Jan 2025 18:16:27 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 4B0496B0088 for ; Sat, 18 Jan 2025 18:16:27 -0500 (EST) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id F004F1403E9 for ; Sat, 18 Jan 2025 23:16:26 +0000 (UTC) X-FDA: 83022133572.14.453BE19 Received: from mail-pj1-f74.google.com (mail-pj1-f74.google.com [209.85.216.74]) by imf20.hostedemail.com (Postfix) with ESMTP id 090B51C0009 for ; Sat, 18 Jan 2025 23:16:24 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=vq1fxSdR; spf=pass (imf20.hostedemail.com: domain of 3RzaMZwgKCCgNMEUMcERKSSKPI.GSQPMRYb-QQOZEGO.SVK@flex--jiaqiyan.bounces.google.com designates 209.85.216.74 as permitted sender) smtp.mailfrom=3RzaMZwgKCCgNMEUMcERKSSKPI.GSQPMRYb-QQOZEGO.SVK@flex--jiaqiyan.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1737242185; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=YN6TmY/NDY3/qLUZOajjmh5DrDHEICthyIauzWOR8ZQ=; b=6Ot0q4C2RgwNnptZPtf91NyrPYzx2B3GjXp5/yHfDQ9DJI/u364Dopfco1O6guOBQsZS38 BbQXul1SphFNK2brOTDiiNDXk/QegYTJm/QRJf51qtlPk/J8kRKLLLcT/FYWO7KR0eutU7 CcE7j+4w4OgFCPmLv+abFqnj14a0dNI= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1737242185; a=rsa-sha256; cv=none; b=OhDUnHdaO0X3ODw/LAv5ZUqU537gHAjDeuBAACJps2s5A7/2/tb4FP7y2QGnFe4xsK0NJq Z1BDfZqfV1PmRRc7WeSwXyIYt/3n5sBydojg8agOF1jLzZ9geV5oUkxpUe+abmBCN/1g8O ElLiggo+p+UL5MSLrRMhDg97aePbJLM= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=vq1fxSdR; spf=pass (imf20.hostedemail.com: domain of 3RzaMZwgKCCgNMEUMcERKSSKPI.GSQPMRYb-QQOZEGO.SVK@flex--jiaqiyan.bounces.google.com designates 209.85.216.74 as permitted sender) smtp.mailfrom=3RzaMZwgKCCgNMEUMcERKSSKPI.GSQPMRYb-QQOZEGO.SVK@flex--jiaqiyan.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-pj1-f74.google.com with SMTP id 98e67ed59e1d1-2ef775ec883so5980403a91.1 for ; Sat, 18 Jan 2025 15:16:24 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1737242184; x=1737846984; darn=kvack.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=YN6TmY/NDY3/qLUZOajjmh5DrDHEICthyIauzWOR8ZQ=; b=vq1fxSdRC8HKXgRM9nRNN+IlnUyRE3jQhfBriaEVjzV0bbTG/SgJxGWU/KRZpzsVUT I0c20DPbEifM7S4mwCVJiBqVQCrwrE6oCxECvFR8dVKuCGb3HjLRzI1s+i1EWbXyaxPJ 5bEglEjclIEBB/Yk2XcD3guQIM6sEh3ZWpjao+8cxiaV+uXfsfNFYFIzRAIMAizo2efa mwjihjspdLWGcXmaMb3WzszVm+EWxZv7iqiS8QW0A7vQ3J3QFLDAbYBMbMPs9DN++oYi XPwc3QByCQ5C3t4I9E0EgN/kDm4LDsJi2FCXlTmEJYnYEhwhMiE970rqdk+jg6thdWUx /FKQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1737242184; x=1737846984; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=YN6TmY/NDY3/qLUZOajjmh5DrDHEICthyIauzWOR8ZQ=; b=b5+GLSoNhdOq06wFfYGycnrlj8iXnV5/lUM6F8FZ8Azz/O/pqHw7QA5LvhOEWLlfo9 Cc8OEvoqFJML/1lWRoBP+Qnn6bTQ11sazRDje8uYRWgIWkz+E1BhvQncudf+pDPeLn8z NiA7KLGo5tsrkTTGGz5ISug3POPstGt/XTJWrQCAJ0mIz7esx+Epl5NSnt1xe2dQHTu/ OiHBEafFX1kpwb+4T9Y9S3r6LcF2BZ/OXFYAUO2MgyYw2I0bQS82pYPRbvaYkLWPAmqv WtPQRPNE+FQZs+YTs1Hh5N6EzZ/QacxQSsWcA9G5aLQHG29a9mK3umAk6Gz/VwfNo07V c3oQ== X-Forwarded-Encrypted: i=1; AJvYcCXRBQWO8ElizT8RwyRSy/QYk6FeR1PjI5da4aLmqRfdpvryHaRaK2tSGweL117EqTbXOMT3HbiXWw==@kvack.org X-Gm-Message-State: AOJu0YzVs/Nj1oIa2u0EBODYPWnl+CU2FgELAngjQEOlEhgFKDbmbAjX OW0uBl4iE2crjOdms3Y4S4OY8ws5skAMUfQmjdtxGPPH567apsgQUU+aaepdYp76NRDjrLI4Sh2 JVz0EoG+/pw== X-Google-Smtp-Source: AGHT+IFLLkqbQaii6vgpMggmn1nuWWFx29WluNp02IqS6MEf1N7eQhiWvEopoT1MpbZfEYAZ5yHYOwd2a2T7AQ== X-Received: from pjzm19.prod.google.com ([2002:a17:90b:693:b0:2d8:8d32:2ea3]) (user=jiaqiyan job=prod-delivery.src-stubby-dispatcher) by 2002:a17:90b:540b:b0:2ee:ba0c:1726 with SMTP id 98e67ed59e1d1-2f782d9a1c9mr9872038a91.34.1737242183861; Sat, 18 Jan 2025 15:16:23 -0800 (PST) Date: Sat, 18 Jan 2025 23:15:48 +0000 In-Reply-To: <20250118231549.1652825-1-jiaqiyan@google.com> Mime-Version: 1.0 References: <20250118231549.1652825-1-jiaqiyan@google.com> X-Mailer: git-send-email 2.48.0.rc2.279.g1de40edade-goog Message-ID: <20250118231549.1652825-3-jiaqiyan@google.com> Subject: [RFC PATCH v1 2/3] selftests/mm: test userspace MFR for HugeTLB 1G hugepage From: Jiaqi Yan To: nao.horiguchi@gmail.com, linmiaohe@huawei.com Cc: tony.luck@intel.com, wangkefeng.wang@huawei.com, willy@infradead.org, jane.chu@oracle.com, akpm@linux-foundation.org, osalvador@suse.de, rientjes@google.com, duenwen@google.com, jthoughton@google.com, jgg@nvidia.com, ankita@nvidia.com, peterx@redhat.com, sidhartha.kumar@oracle.com, david@redhat.com, dave.hansen@linux.intel.com, muchun.song@linux.dev, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, Jiaqi Yan X-Rspamd-Queue-Id: 090B51C0009 X-Stat-Signature: mc4ic6wqauazykwa4s1md6d6aydgd6a4 X-Rspam-User: X-Rspamd-Server: rspam12 X-HE-Tag: 1737242184-703007 X-HE-Meta: U2FsdGVkX1+DB6WDcTGipIUwgIBA/Ys57xVD6MlbuN8/akwrSlmOkZICaT55qHemCtbN6kxEfngkwmThSrHc1UM/TJ6l/SKk1NtYZRnLj55IxmpW5asEwbDDO6hML9JtCDnkA9fYyQaASYXCA01E0gcwRumZf+MZWaNLeVbP0P6P93wf4vMWmv3BBNH7tGpM4ki000cGbyqZ1h/wzFPSWo/m+hs6bfLl4njjGJhqLJgDZ4fL4ZFiiSr8nT1CqSkkjvLjRnTinPLjK+P4ygSYcO8NIc8xJ1GbxcK3+b1kBzkXK3l334XgPwd6AIn/2JMQ6BquP4YPYbnMxcVegDrFP4XaR/EjwaH7s+AAmM3AbH1LY0XlXsF43QytNbpLxyrWw9eAlEeBNYFf/Y9+pWr7oLShPPY5ifofJPRJTrFIMtzeoDnyPJi2qOCDo26ISv4alkKq5wcYXZ1q8hy6xt8eGnEwr2viHXWvgubCJGWlvcX1/XfWnQ6ANnk9Q0ZNT55DUucn12qTDxsYUUShuDjACjjVsV/P5AziSQ5kOrAtnokM5meCw/Bf1cJw8/0AZwhYl8m+9T9KC/gdrW3s4ogrRX1uYp9pl2cCpXsA825LT9xmMVuyp96OjiM3Oy6EYcVJy9nBB5ttzOyPwBsgv4QCN78z/wvNPqixqPW5VEuNCSzVNUoFHO6l2bTM6wAQdXWbKqShnINjW8AnKW/jJWCQ6UD+SPUNN3GBYfByEaOd2CTioKG0xK6RRCEVPjal+6xm6LVQ/xot4ym5Me3OK4s/048vEgOyxXa+GakiYvKU02v1gFEOqbWv8jeIddgcG3rK0nRLa2oynz+wCHTpj6P/mgcT2cs/TLFU/KpQJiK7vZxYgUtQh0i0RvhQFMItbAX9QjvGR+oILojeN2Y9rZ5+80BoVWEl5k+W5hwkzSNFGNNSJE7xv/2gsbYbcZ+t9Ye+/NDX7Xdd7USrScSjQ9G yyQEW7Lx QAivCvWUg0tKmx8MX5e2yZ4nkSCD/e7pNpktJsSzw6PJLrt2uigTUdFQSeYGVCMaDYBjqYRB9eAu8gN2QnqI8x1Wdmz0MmPqddrzj7ryslrI3gfB1BIrItExx8Jc65WpsgIR4dTZ+4lwFJShwuwxarUXkOKmRRZssc7GeldeYmQDPtCQS9x0uIaNkqNeMo8Lu+/yFQ/doKyuYLJq+A00ta0chM/NLFlImya4/WuAMWDGFjc50EIIGIJIwB39QQWtTVSAZUodwSWs+7qLpZwc8QLRkPV+CXLLH/zoD96dIb/fUDPA0K5wtvh7W5R/KOTeuNsO62PFQlv+SteQOO05I6djoMbfsZQaGysigXxA231C95zTiUcXcg2r8rWTksy5hcwvoaK9GMyCOubmwHhho/zNc7OmFEH7SF1Wekgx9nmQ8Wo3mR+zBGv/oT+e8wa3Vyc6PqK4r8+fC641nbT4aT89BJiyrA1G0XNcDWVSqubhuUghMNmcJNguV0aYElx9CntVMcNCaD4BAZnipZK44A9pq46336pdcQ9MrmGGUMntNpOLUpGBfL5Ux+N5olh0/TtK8lrA4ZOW/QKO1aEvm9HLEiKMzx9Sb1mZ7/v17xzETX0kCGVQyH8emJaxbrZoZEPH3 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Tests the userspace memory failure recovery (MFR) policy for HugeTLB 1G hugepage case: 1. Creates a memfd backed by 1G HugeTLB and had MFD_MF_KEEP_UE_MAPPED set. 2. Allocates and maps in a 1G hugepage to the process. 3. Creates sub-threads to MADV_HWPOISON inner addresses of the hugepage. 4. Checks if the process gets correct SIGBUS for each poisoned raw page. 5. Checks if all memory are still accessible and content valid. 6. Checks if the poisoned 1G hugepage is dealt with after memfd released. Signed-off-by: Jiaqi Yan --- tools/testing/selftests/mm/.gitignore | 1 + tools/testing/selftests/mm/Makefile | 1 + tools/testing/selftests/mm/hugetlb-mfr.c | 267 +++++++++++++++++++++++ 3 files changed, 269 insertions(+) create mode 100644 tools/testing/selftests/mm/hugetlb-mfr.c diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftests/mm/.gitignore index 121000c28c105..e65a1fa43f868 100644 --- a/tools/testing/selftests/mm/.gitignore +++ b/tools/testing/selftests/mm/.gitignore @@ -5,6 +5,7 @@ hugepage-mremap hugepage-shm hugepage-vmemmap hugetlb-madvise +hugetlb-mfr hugetlb-read-hwpoison hugetlb-soft-offline khugepaged diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile index 63ce39d024bb5..171a9e65ed357 100644 --- a/tools/testing/selftests/mm/Makefile +++ b/tools/testing/selftests/mm/Makefile @@ -62,6 +62,7 @@ TEST_GEN_FILES += hmm-tests TEST_GEN_FILES += hugetlb-madvise TEST_GEN_FILES += hugetlb-read-hwpoison TEST_GEN_FILES += hugetlb-soft-offline +TEST_GEN_FILES += hugetlb-mfr TEST_GEN_FILES += hugepage-mmap TEST_GEN_FILES += hugepage-mremap TEST_GEN_FILES += hugepage-shm diff --git a/tools/testing/selftests/mm/hugetlb-mfr.c b/tools/testing/selftests/mm/hugetlb-mfr.c new file mode 100644 index 0000000000000..cb20b81984f5e --- /dev/null +++ b/tools/testing/selftests/mm/hugetlb-mfr.c @@ -0,0 +1,267 @@ +// SPDX-License-Identifier: GPL-2.0 + +/* + * Tests the userspace memory failure recovery (MFR) policy for HugeTLB 1G + * hugepage case: + * 1. Creates a memfd backed by 1G HugeTLB and MFD_MF_KEEP_UE_MAPPED bit set. + * 2. Allocates and maps a 1G hugepage. + * 3. Creates sub-threads to MADV_HWPOISON inner addresses of the hugepage. + * 4. Checks if the sub-thread get correct SIGBUS for each poisoned raw page. + * 5. Checks if all memory are still accessible and content still valid. + * 6. Checks if the poisoned 1G hugepage is dealt with after memfd released. + */ + +#define _GNU_SOURCE +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include +#include +#include + +#include "../kselftest.h" +#include "vm_util.h" + +#define EPREFIX " !!! " +#define BYTE_LENTH_IN_1G 0x40000000 +#define HUGETLB_FILL 0xab + +static void *sigbus_addr; +static int sigbus_addr_lsb; +static bool expecting_sigbus; +static bool got_sigbus; +static bool was_mceerr; + +static int create_hugetlbfs_file(struct statfs *file_stat) +{ + int fd; + int flags = MFD_HUGETLB | MFD_HUGE_1GB | MFD_MF_KEEP_UE_MAPPED; + + fd = memfd_create("hugetlb_tmp", flags); + if (fd < 0) + ksft_exit_fail_perror("Failed to memfd_create"); + + memset(file_stat, 0, sizeof(*file_stat)); + if (fstatfs(fd, file_stat)) { + close(fd); + ksft_exit_fail_perror("Failed to fstatfs"); + } + if (file_stat->f_type != HUGETLBFS_MAGIC) { + close(fd); + ksft_exit_fail_msg("Not hugetlbfs file"); + } + + ksft_print_msg("Created hugetlb_tmp file\n"); + ksft_print_msg("hugepagesize=%#lx\n", file_stat->f_bsize); + if (file_stat->f_bsize != BYTE_LENTH_IN_1G) + ksft_exit_fail_msg("Hugepage size is not 1G"); + + return fd; +} + +/* + * SIGBUS handler for "do_hwpoison" thread that mapped and MADV_HWPOISON + */ +static void sigbus_handler(int signo, siginfo_t *info, void *context) +{ + if (!expecting_sigbus) + ksft_exit_fail_msg("unexpected sigbus with addr=%p", + info->si_addr); + + got_sigbus = true; + was_mceerr = (info->si_code == BUS_MCEERR_AO || + info->si_code == BUS_MCEERR_AR); + sigbus_addr = info->si_addr; + sigbus_addr_lsb = info->si_addr_lsb; +} + +static void *do_hwpoison(void *hwpoison_addr) +{ + int hwpoison_size = getpagesize(); + + ksft_print_msg("MADV_HWPOISON hwpoison_addr=%p, len=%d\n", + hwpoison_addr, hwpoison_size); + if (madvise(hwpoison_addr, hwpoison_size, MADV_HWPOISON) < 0) + ksft_exit_fail_perror("Failed to MADV_HWPOISON"); + + pthread_exit(NULL); +} + +static void test_hwpoison_multiple_pages(unsigned char *start_addr) +{ + pthread_t pthread; + int ret; + unsigned char *hwpoison_addr; + unsigned long offsets[] = {0x200000, 0x400000, 0x800000}; + + for (size_t i = 0; i < ARRAY_SIZE(offsets); ++i) { + sigbus_addr = (void *)0xBADBADBAD; + sigbus_addr_lsb = 0; + was_mceerr = false; + got_sigbus = false; + expecting_sigbus = true; + hwpoison_addr = start_addr + offsets[i]; + + ret = pthread_create(&pthread, NULL, &do_hwpoison, hwpoison_addr); + if (ret) + ksft_exit_fail_perror("Failed to create hwpoison thread"); + + ksft_print_msg("Created thread to hwpoison and access hwpoison_addr=%p\n", + hwpoison_addr); + + pthread_join(pthread, NULL); + + if (!got_sigbus) + ksft_test_result_fail("Didn't get a SIGBUS\n"); + if (!was_mceerr) + ksft_test_result_fail("Didn't get a BUS_MCEERR_A(R|O)\n"); + if (sigbus_addr != hwpoison_addr) + ksft_test_result_fail("Incorrect address: got=%p, expected=%p\n", + sigbus_addr, hwpoison_addr); + if (sigbus_addr_lsb != pshift()) + ksft_test_result_fail("Incorrect address LSB: got=%d, expected=%d\n", + sigbus_addr_lsb, pshift()); + + ksft_print_msg("Received expected and correct SIGBUS\n"); + } +} + +static int read_nr_hugepages(unsigned long hugepage_size, + unsigned long *nr_hugepages) +{ + char buffer[256] = {0}; + char cmd[256] = {0}; + + sprintf(cmd, "cat /sys/kernel/mm/hugepages/hugepages-%ldkB/nr_hugepages", + hugepage_size); + FILE *cmdfile = popen(cmd, "r"); + + if (cmdfile == NULL) { + ksft_perror(EPREFIX "failed to popen nr_hugepages"); + return -1; + } + + if (!fgets(buffer, sizeof(buffer), cmdfile)) { + ksft_perror(EPREFIX "failed to read nr_hugepages"); + pclose(cmdfile); + return -1; + } + + *nr_hugepages = atoll(buffer); + pclose(cmdfile); + return 0; +} + +/* + * Main thread that drives the test. + */ +static void test_main(int fd, size_t len) +{ + unsigned char *map, *iter; + struct sigaction new, old; + const unsigned long hugepagesize_kb = BYTE_LENTH_IN_1G / 1024; + unsigned long nr_hugepages_before = 0; + unsigned long nr_hugepages_after = 0; + + if (read_nr_hugepages(hugepagesize_kb, &nr_hugepages_before) != 0) { + close(fd); + ksft_exit_fail_msg("Failed to read nr_hugepages\n"); + } + ksft_print_msg("NR hugepages before MADV_HWPOISON is %ld\n", nr_hugepages_before); + + if (ftruncate(fd, len) < 0) + ksft_exit_fail_perror("Failed to ftruncate"); + + ksft_print_msg("Allocated %#lx bytes to HugeTLB file\n", len); + + map = mmap(NULL, len, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); + if (map == MAP_FAILED) + ksft_exit_fail_msg("Failed to mmap"); + + ksft_print_msg("Created HugeTLB mapping: %p\n", map); + + memset(map, HUGETLB_FILL, len); + ksft_print_msg("Memset every byte to 0xab\n"); + + new.sa_sigaction = &sigbus_handler; + new.sa_flags = SA_SIGINFO; + if (sigaction(SIGBUS, &new, &old) < 0) + ksft_exit_fail_msg("Failed to setup SIGBUS handler"); + + ksft_print_msg("Setup SIGBUS handler successfully\n"); + + test_hwpoison_multiple_pages(map); + + /* + * Since MADV_HWPOISON doesn't corrupt the memory in hardware, and + * MFD_MF_KEEP_UE_MAPPED keeps the hugepage mapped, every byte should + * remain accessible and hold original data. + */ + expecting_sigbus = false; + for (iter = map; iter < map + len; ++iter) { + if (*iter != HUGETLB_FILL) { + ksft_print_msg("At addr=%p: got=%#x, expected=%#x\n", + iter, *iter, HUGETLB_FILL); + ksft_test_result_fail("Memory content corrupted\n"); + break; + } + } + ksft_print_msg("Memory content all valid\n"); + + if (read_nr_hugepages(hugepagesize_kb, &nr_hugepages_after) != 0) { + close(fd); + ksft_exit_fail_msg("Failed to read nr_hugepages\n"); + } + + /* + * After MADV_HWPOISON, hugepage should still be in HugeTLB pool. + */ + ksft_print_msg("NR hugepages after MADV_HWPOISON is %ld\n", nr_hugepages_after); + if (nr_hugepages_before != nr_hugepages_after) + ksft_test_result_fail("NR hugepages reduced by %ld after MADV_HWPOISON\n", + nr_hugepages_before - nr_hugepages_after); + + /* End of the lifetime of the created HugeTLB memfd. */ + if (ftruncate(fd, 0) < 0) + ksft_exit_fail_perror("Failed to ftruncate to 0"); + munmap(map, len); + close(fd); + + /* + * After freed by userspace, MADV_HWPOISON-ed hugepage should be + * dissolved into raw pages and removed from HugeTLB pool. + */ + if (read_nr_hugepages(hugepagesize_kb, &nr_hugepages_after) != 0) { + close(fd); + ksft_exit_fail_msg("Failed to read nr_hugepages\n"); + } + ksft_print_msg("NR hugepages after closure is %ld\n", nr_hugepages_after); + if (nr_hugepages_before != nr_hugepages_after + 1) + ksft_test_result_fail("NR hugepages is not reduced after memfd closure\n"); + + ksft_test_result_pass("All done\n"); +} + +int main(int argc, char **argv) +{ + int fd; + struct statfs file_stat; + size_t len = BYTE_LENTH_IN_1G; + + ksft_print_header(); + ksft_set_plan(1); + + fd = create_hugetlbfs_file(&file_stat); + test_main(fd, len); + + ksft_finished(); +} From patchwork Sat Jan 18 23:15:49 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jiaqi Yan X-Patchwork-Id: 13944263 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3E49CC02185 for ; Sat, 18 Jan 2025 23:16:30 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id AEF176B008A; Sat, 18 Jan 2025 18:16:29 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id A9F2F6B008C; Sat, 18 Jan 2025 18:16:29 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9175B6B0092; Sat, 18 Jan 2025 18:16:29 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 73F7E6B008A for ; Sat, 18 Jan 2025 18:16:29 -0500 (EST) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 25B5C81A22 for ; Sat, 18 Jan 2025 23:16:29 +0000 (UTC) X-FDA: 83022133698.18.7031C88 Received: from mail-pl1-f201.google.com (mail-pl1-f201.google.com [209.85.214.201]) by imf17.hostedemail.com (Postfix) with ESMTP id 5556B40008 for ; Sat, 18 Jan 2025 23:16:27 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=Jhv2JZ0Q; spf=pass (imf17.hostedemail.com: domain of 3SjaMZwgKCCsQPHXPfHUNVVNSL.JVTSPUbe-TTRcHJR.VYN@flex--jiaqiyan.bounces.google.com designates 209.85.214.201 as permitted sender) smtp.mailfrom=3SjaMZwgKCCsQPHXPfHUNVVNSL.JVTSPUbe-TTRcHJR.VYN@flex--jiaqiyan.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1737242187; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=I9XQRTrIMw2nHi4Rykakj51VRt6E+kzcD+EW0K/CYuY=; b=xSxzruOs608ter3p0KG/Y4HosweCsEyi6e2+T/f4lhvS5RRdFPbHGdhNJlWebF1/HDFfjN vpAY1edUNy9zfd2xwoSXZB1kf+VaYbHF3di9mw3TIpzQJg/Xl3rPMoe4GhVmy3PDDk/vsG PLxUaS3SNcMVdCaSQhZc9tr6coxTjic= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1737242187; a=rsa-sha256; cv=none; b=ksQOXKpZWljoAx+haiqLKvxt9b9HWmQnfQ2DJTHgj5D/HLKIdA8xjFsXKMBqDOAx5H8ybj rkjZh2kaY8d5BRikmFNbxBg8P9KYg1NyUDhXja1WzC/SHuS0mJJDxcYw2NXt/eemPBCDAp 5AAF2rTZxIe1VU3htwIhEtF0rzdr/DI= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=Jhv2JZ0Q; spf=pass (imf17.hostedemail.com: domain of 3SjaMZwgKCCsQPHXPfHUNVVNSL.JVTSPUbe-TTRcHJR.VYN@flex--jiaqiyan.bounces.google.com designates 209.85.214.201 as permitted sender) smtp.mailfrom=3SjaMZwgKCCsQPHXPfHUNVVNSL.JVTSPUbe-TTRcHJR.VYN@flex--jiaqiyan.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-pl1-f201.google.com with SMTP id d9443c01a7336-2178115051dso60855185ad.1 for ; Sat, 18 Jan 2025 15:16:27 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1737242186; x=1737846986; darn=kvack.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=I9XQRTrIMw2nHi4Rykakj51VRt6E+kzcD+EW0K/CYuY=; b=Jhv2JZ0QhzRV1wOroA7PzMsPQCYlrvVWAWX35f8BEgBWy8jWNKg4f+WcM1+iDdg4Au l0+Vlu1JsehiPqn9beAJj6CsRmzXsNbLuP84mF05P/yw6Wex5aFcXG90h5jiHNyNZKpy GXy8DgQaty55CRiGTAHVFADaeWfHVd98O5qXwKraR6wHp3vdVyWUAGk5eaV6rLX92qmz 1b4J9ATyNIYNJV6nMkzL1Wz19LbGQN5FFHL4iUTFNhJpUrfndKY9igpLUOd7sDgU+J37 6FBpQXhh2G1yS+8sSByj5fNPTrjOU9VABdMBuITV7gR5mDLglbs3Q9TJWu+zcDWCgaYD r/Uw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1737242186; x=1737846986; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=I9XQRTrIMw2nHi4Rykakj51VRt6E+kzcD+EW0K/CYuY=; b=Zg58+pwHUWz1qwFEX3kRT0vN4ewaLzOqOBk62LsaJ7FqBIjQprQrE5h02bD0Arx02Z kqT4GaB1m2aSWN0NcFacItzNjKbZ0uDZWdr6qjXr+2tgIMaUg+tdGrFBe0n3kPqPuEvu l3N6oJ3w4iwAMlzROGSRJodFdmK0z1c1Vw6X2dIl5DXdyHFR7WS3atCvixWXEQOonl7G 1d7bzQKtZWKWmBym8LF9gdO2gEmkXLzT86isfj46JxHx2pfbkXjfmbNorkT/cuyF591R Xhpo0J/keLfAuZCiXuNobxPr0uBW+XBmKnSiyYZfg1r4iC34IP+YfsSpBZgYok8iWb7H rj8g== X-Forwarded-Encrypted: i=1; AJvYcCWCkYnS4OyOEdcwTM9yV5jq8eAr+QxlPoWmladIjYCUEWS0Jas//3MpjZ5poZpmKJ+ylirb3dv5vg==@kvack.org X-Gm-Message-State: AOJu0YxhX2Z78hYw/PCWD4ABkjZBR4fqYiA1ynp4pAo/zbVJ/DQ7y2J/ R4fv5ZhWGUhMKADziBNV+FR1W8AbMfpdfzVvvl+h28sEIw8sZ8OyHcnkw8owQ8sBUK2E8cxt6iN /978lDfo+zA== X-Google-Smtp-Source: AGHT+IHwlluQDU7UlSnmrszHFZEXxstYcEx2i2R/0XKRZHVd/xM9gn1WzTOyJfb2igjnZAJi3m+IL+n+5tTv9g== X-Received: from plhj17.prod.google.com ([2002:a17:903:251:b0:216:2234:bf3e]) (user=jiaqiyan job=prod-delivery.src-stubby-dispatcher) by 2002:a17:902:f70a:b0:215:a7e4:8475 with SMTP id d9443c01a7336-21c35557a0bmr135957655ad.24.1737242186201; Sat, 18 Jan 2025 15:16:26 -0800 (PST) Date: Sat, 18 Jan 2025 23:15:49 +0000 In-Reply-To: <20250118231549.1652825-1-jiaqiyan@google.com> Mime-Version: 1.0 References: <20250118231549.1652825-1-jiaqiyan@google.com> X-Mailer: git-send-email 2.48.0.rc2.279.g1de40edade-goog Message-ID: <20250118231549.1652825-4-jiaqiyan@google.com> Subject: [RFC PATCH v1 3/3] Documentation: add userspace MF recovery policy via memfd From: Jiaqi Yan To: nao.horiguchi@gmail.com, linmiaohe@huawei.com Cc: tony.luck@intel.com, wangkefeng.wang@huawei.com, willy@infradead.org, jane.chu@oracle.com, akpm@linux-foundation.org, osalvador@suse.de, rientjes@google.com, duenwen@google.com, jthoughton@google.com, jgg@nvidia.com, ankita@nvidia.com, peterx@redhat.com, sidhartha.kumar@oracle.com, david@redhat.com, dave.hansen@linux.intel.com, muchun.song@linux.dev, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, Jiaqi Yan X-Stat-Signature: q9oktmbge4yyc3acwbi8745p8zwhixh7 X-Rspam-User: X-Rspamd-Queue-Id: 5556B40008 X-Rspamd-Server: rspam03 X-HE-Tag: 1737242187-988021 X-HE-Meta: U2FsdGVkX1+vj56xp3NLmLvY/Hyqn5Q9Rsp5PovBIc2CoIoRkg8Fyz1xYqGAVCfWgdBC2gfpDGAapWeZvPAGcii9KJdY+OPdN4HO3lrAtLft6xELO74yq8q8yH9ZLALiD52PFG0oa5yaI4p8EKWN55laWzli2IkQqbWQjU4W9PvesguxrJOhDs95F7LH72TZ0MH9WOJ6/Iww28YkKHw6YuqGRdr0WCWFQy496hZSaobzybhCwW/kZx0I1VMFHY9a31pY9/mAHeDwerAjmeuXgbgwO0n6TJUMSDkvM2YHH/mpWhT/YvqgTylI85Ygcna8R1LUlFnfNrNJu5NYsPlW1oU3E9OlmalbneCYNQPMMZEaswUOWZCri6fv5FSS6AQVMXjRRS6s+lzO9MLAKpx1kL8Jexo6LuKlgBLlhRY2BnYprDEXLJHXE2G9VqPECuoDhBJ53mdR3N0/vUFy3lPeqBbD3wVxyrZl626mkfnh660nzV5r0R1vbEBm3bVTeQW3Zk+/zfvfhJka4dCHasbHd1AC42z+/0UbGNQdxkpM9/TNkNo3qR5Y5ZMwBcg8hAkFQIQsm9hFNteR9lmbLZv7MMTb9IfeTNDGWn/K9ns+XFaRyTI1aEH5LFMVXl/QbanVGvLtt/UhBH3135/XemJ8fC6e/7mY5tgDcAq1k4ceJMLUjZpi7wbu0XCvVYViJ5G4mULBNF6ehw320Ou9U/66NXAwA3V6/OneLA00k3P1MY7aNEe2/aGQ5ANjKmSWpvkwO7RPTvCS6PwtEcjNxd4JY678+S18up9KgQcEC8pjLnIPrbwqxuY935okCkK6kd80jLhjI9OeVTNi9aG2ildPaWoLhAtownJp8hrLgoDr1/m1ZDOq4n7RQ2LpiWAlhtie8gyUKZU1oTIKmSzXlukQi+1A6xDeFlf7K8dPcAYp4PpoLKKKQhY5h8RuV+853cpINIWuiefxETlSTaMeB4l tK2ccqEI 14U7Lw9CUxM/93wR+tQi0jTco12a+5Yvv8JCPol1uHTkXFBBECswBYrP/EZek3Uj2x+Wnf2LpSC4VtJaYySzMBa+A8lI+Dq4H/qfN4ZignaFN48gICDO4RAXbHy3hiGp7c6DOChHGUImRnrirQxVHwD5M6NpAJYGiO9ZR7FnwwSeatVQxM9WrK/NMxq1byx3AEjJOEXESniqbgtqfaR0FiKqIWf3GXlZarq259ZG1KPe1ghXlTlDgz7KYn+q2q/kYy2zinzCJAHQzS+f6xMk+IVJLqIpPYtVMBVTYrCfu4mhGtxMePfMXcuBnzHsuaa0MzdckoaWD5cPv72DJFxLvLcU4+odosTPICRDq5g64DzHIcymCWFAW9sB+9KNsJwnc0ORl901cVu1CQ2IrrCUNe1YMUszN2fTuap8QGBJ6RYc9/6vBpArQIESy5ScRBP5KasgmeOfJoGEnRuMDRG1F9YFmatpiNTiJzQURnmQyLsyjvA38Kwxy+xonQiK83hOYunrvdxUZZYAtWUXGW2B5fiin7Qvu5CpZA9/lrhQcAw9KNTPm4l34G0eB46E44AA//PDF0FNMphOO4sYIwwe9JKYXombHtYaE/czl5jo+HABrb+WN+aKc0IyPlx15OT4zXPbB X-Bogosity: Ham, tests=bogofilter, spamicity=0.023254, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Document its motivation and userspace API. Signed-off-by: Jiaqi Yan --- Documentation/userspace-api/index.rst | 1 + .../userspace-api/mfd_mfr_policy.rst | 55 +++++++++++++++++++ 2 files changed, 56 insertions(+) create mode 100644 Documentation/userspace-api/mfd_mfr_policy.rst diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst index 274cc7546efc2..0f9783b8807ea 100644 --- a/Documentation/userspace-api/index.rst +++ b/Documentation/userspace-api/index.rst @@ -63,6 +63,7 @@ Everything else vduse futex2 perf_ring_buffer + mfd_mfr_policy .. only:: subproject and html diff --git a/Documentation/userspace-api/mfd_mfr_policy.rst b/Documentation/userspace-api/mfd_mfr_policy.rst new file mode 100644 index 0000000000000..d4557693c2c40 --- /dev/null +++ b/Documentation/userspace-api/mfd_mfr_policy.rst @@ -0,0 +1,55 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================================================== +Userspace Memory Failure Recovery Policy via memfd +================================================== + +:Author: + Jiaqi Yan + + +Motivation +========== + +When a userspace process is able to recover from memory failures (MF) +caused by uncorrected memory error (UE) in the DIMM, especially when it is +able to avoid consuming known UEs, keeping the memory page mapped and +accessible may be benifical to the owning process for a couple of reasons: +- The memory pages affected by UE have a large smallest granularity, for + example 1G hugepage, but the actual corrupted amount of the page is only + several cachlines. Losing the entire hugepage of data is unacceptable to + the application. +- In addition to keeping the data accessible, the application still wants + to access with as large page size for the fastest virtual-to-physical + translations. + +Memory failure recovery for 1G or larger HugeTLB is a good example. With +memfd userspace process can control whether the kernel hard offlines its +memory (huge)pages that backs the in-RAM file created by memfd. + + +User API +======== + +``int memfd_create(const char *name, unsigned int flags)`` + +``MFD_MF_KEEP_UE_MAPPED`` + When ``MFD_MF_KEEP_UE_MAPPED`` bit is set in ``flags``, MF recovery + in the kernel does not hard offline memory due to UE until the + returned ``memfd`` is released. IOW, the HWPoison-ed memory emains + accessible via the returned ``memfd`` or the memory mapping created + with the returned ``memfd``. Note the affected memory will be + immediately protected and isolated from future use (by both kernel + and userspace) once the owning process is gone. By default + ``MFD_MF_KEEP_UE_MAPPED`` is not set, and kernel hard offlines + memory having UEs. + +Notes about the behavior and limitations +- Even if the page affected by UE is kept, a portion of the (huge)page is + already lost due to hardware corruption, and the size of the portion + is the smallest page size that kernel uses to manages memory on the + architecture, i.e. PAGESIZE. Accessing a virtual address within any of + these parts results in a SIGBUS; accessing virtual address outside these + parts are good until it is corrupted by new memory error. +- ``MFD_MF_KEEP_UE_MAPPED`` currently only works for HugeTLB, so + ``MFD_HUGETLB`` must also be set when setting ``MFD_MF_KEEP_UE_MAPPED``.