From patchwork Tue Sep 24 04:39:19 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jiaqi Yan X-Patchwork-Id: 13810097 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id ABB4BCF9C72 for ; Tue, 24 Sep 2024 04:39:35 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3C9CE6B0083; Tue, 24 Sep 2024 00:39:35 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 32DBD6B0085; Tue, 24 Sep 2024 00:39:35 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1A6536B0088; Tue, 24 Sep 2024 00:39:35 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id EDD176B0083 for ; Tue, 24 Sep 2024 00:39:34 -0400 (EDT) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id A0ECDC17A0 for ; Tue, 24 Sep 2024 04:39:34 +0000 (UTC) X-FDA: 82598378268.02.AEE894B Received: from mail-yb1-f202.google.com (mail-yb1-f202.google.com [209.85.219.202]) by imf04.hostedemail.com (Postfix) with ESMTP id D6C1D40009 for ; Tue, 24 Sep 2024 04:39:32 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=ak+Dze8T; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf04.hostedemail.com: domain of 3g0LyZggKCBAzyq6yEq3w44w1u.s421y3AD-220Bqs0.47w@flex--jiaqiyan.bounces.google.com designates 209.85.219.202 as permitted sender) smtp.mailfrom=3g0LyZggKCBAzyq6yEq3w44w1u.s421y3AD-220Bqs0.47w@flex--jiaqiyan.bounces.google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1727152737; a=rsa-sha256; cv=none; b=Q2Bv0BBQb1HMw/OfQvqkTdAm46494lxodblilorA3TQgsgqz6b+hcpEtOkgf/ObDIFOekq 1tHAVNtekuYQYFiR5wJ2BBpYDSeqRdVbH7O7JPWJUn91zmJob1IwwYDWWEVPZB4o3iHN+V P6gdFgaBQyAgns/h1RgHN5gniE7Q+Bg= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=ak+Dze8T; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf04.hostedemail.com: domain of 3g0LyZggKCBAzyq6yEq3w44w1u.s421y3AD-220Bqs0.47w@flex--jiaqiyan.bounces.google.com designates 209.85.219.202 as permitted sender) smtp.mailfrom=3g0LyZggKCBAzyq6yEq3w44w1u.s421y3AD-220Bqs0.47w@flex--jiaqiyan.bounces.google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1727152737; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=T9QYBokEPkcwmggku9rIrnocfmaIGptmI0axJesX/JY=; b=nSvbI+8ULmUprCJpBi6wbUxqUB1glCQ+SWQvFkU+CzXuDp1rodcwD+SJi++f9DifY7qeLF stU+JNdF1YEGZnD3iccvI/uVH+F6+zSeV7nzgo7KPVN4HaVjn5OvMTUW+SxW6uW7qJqA7s ndAXFoE36eeetNQNmY3WWQHlOshnrgI= Received: by mail-yb1-f202.google.com with SMTP id 3f1490d57ef6-e1aa529e30eso8203265276.1 for ; Mon, 23 Sep 2024 21:39:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1727152772; x=1727757572; darn=kvack.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=T9QYBokEPkcwmggku9rIrnocfmaIGptmI0axJesX/JY=; b=ak+Dze8TKO5I4GdO+FDy4hqjkttX65WHNLgsXtVangmIUIy5ZWK5FTfSXMRjL0tRax 73rGKN8/xi4nR8nBGrsd5r7J6QfBBa4jEhlXb5mMnXEiYp2f7Y6t9SA+dlnZqqnUgL+x 8ZQRbEM8VNUi4/mlSywmVdiN4m041ImiA7wOgEuD24slP5IzaVI1ma00o+CG5GqkUMmt 7KdcBJxa8Fg6oj3kLBR8OF1fcrgPYYvsyE9ggI3zNd6MbzykjRJbcRnZktlq1PYkiCqe bKzoD6+b33JwhA+Jyx9JDCMtcMU3240BUUP1ZOluBbANrQpNTaz1qAhiyW6WL2yM3HXf rT9w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1727152772; x=1727757572; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=T9QYBokEPkcwmggku9rIrnocfmaIGptmI0axJesX/JY=; b=HQlTNdNnrN1EFkKbkyt4Q2nIdYJgIpeEKCj8Ik1Wsv4dEwerkwPxmlAzKZCB/XOHdH Eup0JX4xjf2ttLr36e0riwos8CxyefQclmBHuoP+0JSR26idwfdlBsY+fTO3U/42jxHq PRW0Vrg5tJIAmn0atM3I8n0e6F4At4lRkylYb3axi2+2WNASN7ncUDiK2ahl94M09nWd dinnsamxyQojtfa9istu7b5uBbnFaiVJacoq8BzPKvtAgIZeY3+B4oXGVWmzRwno5Xd2 va7oDNU7S6a9SmhNFlMIeSf3R9wVWaIyPZcRN2MdVB6ghZDdEoWk/KKYsvRxg3yFcMBK F8Dg== X-Forwarded-Encrypted: i=1; AJvYcCXA9lYUZafY8qtFII6Ju3Ec6bfsw9tVk/8wYmkYEPDIE71VZj4pfR3ykOqLrN55B7p7Oa4BLGVXdA==@kvack.org X-Gm-Message-State: AOJu0YzO9vI5l5VI90skOZ2/Us1d/UitEvPIDarVwpwH7GpBujDyFUG4 2FHLMyzPlOI9D5+l3IL1n98rkTt0+shK3rD5fTKfQhNbzbp0d+dkg4X3DlpRPSW/XKI7dzP8W2s el4oPxWzHuA== X-Google-Smtp-Source: AGHT+IFqJLZv35vd4olBKuukrqKCNP+8cdSTLQTyDhm7fIZv528n0EJcJkxCvmr6Easi7Uj4esGRiguUUgy2Sw== X-Received: from yjq3.c.googlers.com ([fda3:e722:ac3:cc00:131:4782:ac13:917f]) (user=jiaqiyan job=sendgmr) by 2002:a25:8045:0:b0:e20:2b5b:c6e6 with SMTP id 3f1490d57ef6-e2250cc2238mr27796276.9.1727152771578; Mon, 23 Sep 2024 21:39:31 -0700 (PDT) Date: Tue, 24 Sep 2024 04:39:19 +0000 In-Reply-To: <20240924043924.3562257-1-jiaqiyan@google.com> Mime-Version: 1.0 References: <20240924043924.3562257-1-jiaqiyan@google.com> X-Mailer: git-send-email 2.46.0.792.g87dc391469-goog Message-ID: <20240924043924.3562257-2-jiaqiyan@google.com> Subject: [RFC PATCH v1 1/2] mm/memory-failure: introduce global MFR policy From: Jiaqi Yan To: nao.horiguchi@gmail.com, linmiaohe@huawei.com Cc: tony.luck@intel.com, wangkefeng.wang@huawei.com, jane.chu@oracle.com, akpm@linux-foundation.org, osalvador@suse.de, rientjes@google.com, duenwen@google.com, jthoughton@google.com, jgg@nvidia.com, ankita@nvidia.com, peterx@redhat.com, linux-mm@kvack.org, Jiaqi Yan X-Rspam-User: X-Rspamd-Queue-Id: D6C1D40009 X-Rspamd-Server: rspam01 X-Stat-Signature: wwm95he7y3zjkh1jgn7rwhy3mpdnixp3 X-HE-Tag: 1727152772-508400 X-HE-Meta: U2FsdGVkX1+3Mhrrp2wW90G6dms0/3oZeoS8Mnd7tGRBCY5wd4Anwo69EXtm+LeEr+09WGaYkv07EFtFns1leaMdeHLTQDGsaJk+Zy367477I2WzuRSeh1rDJGxtmAAW0nG2f4dTgMbgeScCAn/fPKSXm8nypzVfL6bHTuu/U1VeyOyKFuF31OwJxkvyM6Yc/YeINrovSqtZEKBHEkxsOUn8BL2kjff1EticXSmqfaX3TpHqgSuf8+Jec4uxNFaz9/fGRVtYBytPMgR9iR61yQr1llqpxYtdyKc6HijIGoxDc8jNRKzaZs8O6KsTxga3Bw6wFiIOOhpWg/r3IVsYEnjnEihzUvj7y853BjFtQ5ONEZk5Ktq0amN7zYGTk/VTiVpGxOK6Zyxc+YnTfrQEXhhHT9VLaqCgpIC847VTLqig3J+2KVetEbFgLAc/0AlIVj0LFHaZdRWwcDgus7/6oyih4S7Rx98+0TL8tXgZcsR1O4apbjC+Lg/ER4CoIGwM2dZ8HKQgbKANUyinPCRHotGnih4K4+J+BPLq+thHVqBGHjZz7UV4FC4OnA2SlggwwQyxxZzOzarqQMdkydNWRV1tUUrFNZBADNSm9+oeou0sGe5GF00f9zNKvRBu9O2B78vy0XcuMmAhfRWFgFq0wPhalBWq+TRJZ4RLv5ntRwVkE4msvrr43cGOCxcn6o1j5oZm1NSYjBeU+hcnAu5e1P7/dYFJbbCsX2jz4zhsv0OlacO9Tr1UzpAKFT5wwtZ3WDiQaHIJ9kqWgn++cRhb6S6i8ohLY21880yvNh54Gry3mNuNeQlTSHoW3OF2pDvi6DDhb5GE1nAib21H26b8d+ebhAwNkZZQgsc8jSFo/PjyPY7pHrgujR09gaIu/BJItMNzon1xy4ZBkU1ARNwHV3HRQW9aLgpoV0YtVSx6ks9WwPvWyQFfKBahqJ+pjK3TTzrcn515UZOgzbLlNxV wYacJ8Q/ vj39R9wmzF10X7x/t95yMcoQxX6eLkIle3GYwc8IofR0mAvuacm2M2Ws0SkcOkeT/m9V8Hw16qW/A6K+JFpE7R9zq7+v6KpT7jd3XtCI+tnThVzyv/oIPYU785LK6h7ouN0vPcN5Si1lF2df0xoBsuKUnvI1vdq6N0/L6TNCUsHnAKtLXB+StdmQw2l+dW6tkF+uCquPO4D2sEwKEL1VRYTInHWNUPqBvJ3z3Fx9J5WHku0DB3BwtoNYeDOkTDdGmFMRgWLBWVof71dAkOyPlCj++7/H4fG0sychk+fu1bNNm3It9hwhw9EWI9sjm4RBzG/Y0HxUEvGIRiQ1qBAwDA3kNyqc3isElKDUGQ7GMt3z0Vdk94CiwGzgDCt0Z9iEiPY2IlmVcNsqqVXN50qhu/n15n/BUiD410Y6tRXfcJkYUIG5SCIvf4tyen5LsfmqzfhfuOQfN4ugcK/ZSkkW/xt9KwfJ6GhPCKfcjMvFn/Y2zMJwLuUDgN2eqzdgQxqTzqu0TMZy6b48U0AOvsVHI8IEW7JCl7ABplX9shlokH2CzY6N9prlbu4iJFpyCNBVRHiTls5hC1hkNkBFGZq1+RYJBA+SMb7V2giGEamEH8FNowk6aL2nYoedioJyX3Yn2rVliqF1InA9O1sD8fwdrl2ChkFLr9NgN2u1AncIDbExDucI= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Give userspace the control to enable or disable HARD_OFFLINE error folio (either a raw page or a hugepage). By default, HARD_OFFLINE is enabled to be consistent with existing memory_failure behavior. Userspace should be able to control whether to keep or discard a large chunk of memory in the event of uncorrectable memory errors. There are two major use cases in cloud environments. The 1st case is 1G HugeTLB-backed database workload. Comparing to discarding the hugepage when only single PFN is impacted by uncorrectable memory error, if kernel simply leaves the 1G hugepage mapped, access to major of clean PFNs within the poisoned 1G region still works well for VM and workload. The 2nd case is MMIO device memory or EGM [1] mapped to userspace via huge VM_PFNMAP [2]. If kernel does not zap PUD or PMD, there is no need for the VFIO drivers that manages the memory to intercept page faults for clean PFNs and to reinstall PTEs. In addition, in both cases there is no EPT or stage-2 (S2) violation, so no performance cost for accessing clean guest pages already mapped in EPT or S2. See cover letter for more details on why userspace need such control, and implication when userspace chooses to disable HARD_OFFLINE. If this RFC receives general positive feedbacks, I will add selftest in v2. [1] https://developer.nvidia.com/blog/nvidia-grace-hopper-superchip-architecture-in-depth/#extended_gpu_memory [2] https://lore.kernel.org/linux-mm/20240828234958.GE3773488@nvidia.com/T/#m413a61acaf1fc60e65ee7968ab0ae3093f7b1ea3 Signed-off-by: Jiaqi Yan --- mm/memory-failure.c | 33 +++++++++++++++++++++++++++++++++ 1 file changed, 33 insertions(+) diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 7066fc84f351..a7b85b98d61e 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -70,6 +70,8 @@ static int sysctl_memory_failure_recovery __read_mostly = 1; static int sysctl_enable_soft_offline __read_mostly = 1; +static int sysctl_enable_hard_offline __read_mostly = 1; + atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0); static bool hw_memory_failure __read_mostly = false; @@ -151,6 +153,15 @@ static struct ctl_table memory_failure_table[] = { .proc_handler = proc_dointvec_minmax, .extra1 = SYSCTL_ZERO, .extra2 = SYSCTL_ONE, + }, + { + .procname = "enable_hard_offline", + .data = &sysctl_enable_hard_offline, + .maxlen = sizeof(sysctl_enable_hard_offline), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = SYSCTL_ZERO, + .extra2 = SYSCTL_ONE, } }; @@ -2223,6 +2234,14 @@ int memory_failure(unsigned long pfn, int flags) p = pfn_to_online_page(pfn); if (!p) { + /* + * For ZONE_DEVICE memory and memory on special architectures, + * assume they have opt out core kernel's MFR. Since these + * memory can still be mapped to userspace, let userspace + * know MFR doesn't apply. + */ + pr_info_once("%#lx: can't apply global MFR policy\n", pfn); + res = arch_memory_failure(pfn, flags); if (res == 0) goto unlock_mutex; @@ -2241,6 +2260,20 @@ int memory_failure(unsigned long pfn, int flags) goto unlock_mutex; } + /* + * On ARM64, if APEI failed to claims SEA, (e.g. GHES driver doesn't + * register to SEA notifications from firmware), memory_failure will + * never be synchrounous to the error consumption thread. Notifying + * it via SIGBUS synchrnously has to be done by either core kernel in + * do_mem_abort, or KVM in kvm_handle_guest_abort. + */ + if (!sysctl_enable_hard_offline) { + pr_info_once("%#lx: disabled by /proc/sys/vm/enable_hard_offline\n", pfn); + kill_procs_now(p, pfn, flags, page_folio(p)); + res = -EOPNOTSUPP; + goto unlock_mutex; + } + try_again: res = try_memory_failure_hugetlb(pfn, flags, &hugetlb); if (hugetlb) From patchwork Tue Sep 24 04:39:20 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Jiaqi Yan X-Patchwork-Id: 13810098 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id F06FFCF9C72 for ; Tue, 24 Sep 2024 04:39:38 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 810286B0085; Tue, 24 Sep 2024 00:39:38 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 731256B008C; Tue, 24 Sep 2024 00:39:38 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4DC086B008A; Tue, 24 Sep 2024 00:39:38 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 2CC266B0085 for ; Tue, 24 Sep 2024 00:39:38 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id A4B37121743 for ; Tue, 24 Sep 2024 04:39:37 +0000 (UTC) X-FDA: 82598378394.28.00934F2 Received: from mail-pl1-f202.google.com (mail-pl1-f202.google.com [209.85.214.202]) by imf13.hostedemail.com (Postfix) with ESMTP id C2F8E20006 for ; Tue, 24 Sep 2024 04:39:35 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=i2meAv3G; spf=pass (imf13.hostedemail.com: domain of 3hkLyZggKCBM21t91Ht6z77z4x.v75416DG-553Etv3.7Az@flex--jiaqiyan.bounces.google.com designates 209.85.214.202 as permitted sender) smtp.mailfrom=3hkLyZggKCBM21t91Ht6z77z4x.v75416DG-553Etv3.7Az@flex--jiaqiyan.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1727152715; a=rsa-sha256; cv=none; b=rExyX9aOFz1Mg9PK+fOxBq2sQ0OgNdIcBOCw5jOEoNITMulHj3BsN9oW9r6Pcm6NuxXdsW 5OJTuWtSPUoN2TXZsC1pq4nzeEFPJytm1y3ghK0C6oo4p90VmpTnx4TboMFBSq3h5rIbCz 5qdH6oa2+BqMLzdLGZ5+QBWgeUExscg= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=i2meAv3G; spf=pass (imf13.hostedemail.com: domain of 3hkLyZggKCBM21t91Ht6z77z4x.v75416DG-553Etv3.7Az@flex--jiaqiyan.bounces.google.com designates 209.85.214.202 as permitted sender) smtp.mailfrom=3hkLyZggKCBM21t91Ht6z77z4x.v75416DG-553Etv3.7Az@flex--jiaqiyan.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1727152715; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=VmDIWoOJPbvvXi3ByPe7aVoRPFI5iRYtqOEq5j9il9Y=; b=SV9s4tIDNKP5O/Ke3weJxwF7dBd8xsKLBZxQoJbqxWWw/1xl0L7TpkuNaSINtE5he3Vhgx kiVAFpXx9XruqYvnJwlS+VAPbhKj24rq0pS+VJehhbvBgX0UuW+RpyEZsxUJFr/x53yoee w8FIvuajF7HOq0lv2hBWB/NiJXaPwbM= Received: by mail-pl1-f202.google.com with SMTP id d9443c01a7336-207510f3242so63231895ad.0 for ; Mon, 23 Sep 2024 21:39:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1727152774; x=1727757574; darn=kvack.org; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=VmDIWoOJPbvvXi3ByPe7aVoRPFI5iRYtqOEq5j9il9Y=; b=i2meAv3GB9Dzw062GwwgM1aNbVmUatEj5Qj4xzxsiLCVSQhRdRJdsb7W/I1BjElyKm dT/NRji/9HaGhthn1tjNEXGKq3EPxBQpoukGjLrg+QFV8BRcc+FDrTbM/IuuR42SKbZL i3kQfj5uqUE6nAJm+wagKFXn07GJZVNDEhpKi4hjB/qk5v/Zq1ep++uJraR6IKm/APG7 YyZR1iEUDIjXqoaYJxUXEipyQhfYuYC17t0JVdzl/uFpdFIljRQKqPgwha+ZYY/ghLZV KDaiDpXe+FREWSL2f4eL3d/BRAKjfp6lPeeeL3stMNd+kE8Zngja3iTa/qnZKdUrSN5w iMBQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1727152774; x=1727757574; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=VmDIWoOJPbvvXi3ByPe7aVoRPFI5iRYtqOEq5j9il9Y=; b=JC8yCzArOLP+4pF4rhfTtGHUCno5RM3GNEB5eFssZkmEU+7cJCo7UTWmea2TyrR1bJ yXK70/1sSNhznQOf5KHGMpu4o3Kg1YWZjUTeuz64LmiExee/YMeJZ27sBIsH+8xgYXZy TCsMe0oZtXKluYp6rcd3a2jcXW9xQnemptnMPvo3JhroiuSJC80IgoLHX/OoGSq7BJ8e k693TMW9fvIdMHRMGotYPBFyP7TyFzouoVVG4UiSOz0yDercqNP/PH70ggp4B7RYOEm4 NyZo3PerV7y9k1Ts4nzSJkmL1PUlCxSGy8kEHZkOzPtm8cNCwCIjPQoYYjogK4WOwfjK gVtQ== X-Forwarded-Encrypted: i=1; AJvYcCV3XHpmlQuWB0A7EWORGwRD8oy/5TPuEi8eMsWpuzvXFlukzGog2yiXlrfnGTReDLrvIi0D66FKFg==@kvack.org X-Gm-Message-State: AOJu0YxCqrQnidATJze78O4skV9LTsw9Z/oidUiIhkpAORZXDQFh87JM 8rcYdyHOP4xFiVTMJ9FbtBKkOsVSJPerkNZS2MbNyCW5/O+YY/dyW3APYZlEKZHeYAkB5AjnO6U JhF152V7LLw== X-Google-Smtp-Source: AGHT+IGf/Ga9E9XwNc7nEQgIoCXmKM5P5HLo3dPYOWgNlpjE5wOP1LU0KDQiWt+rBgQFvPN5dnODm8wC2JyAeQ== X-Received: from yjq3.c.googlers.com ([fda3:e722:ac3:cc00:131:4782:ac13:917f]) (user=jiaqiyan job=sendgmr) by 2002:a17:903:2307:b0:206:a6c5:d2a5 with SMTP id d9443c01a7336-208d8591456mr2447515ad.11.1727152774173; Mon, 23 Sep 2024 21:39:34 -0700 (PDT) Date: Tue, 24 Sep 2024 04:39:20 +0000 In-Reply-To: <20240924043924.3562257-1-jiaqiyan@google.com> Mime-Version: 1.0 References: <20240924043924.3562257-1-jiaqiyan@google.com> X-Mailer: git-send-email 2.46.0.792.g87dc391469-goog Message-ID: <20240924043924.3562257-3-jiaqiyan@google.com> Subject: [RFC PATCH v1 2/2] docs: mm: add enable_hard_offline sysctl From: Jiaqi Yan To: nao.horiguchi@gmail.com, linmiaohe@huawei.com Cc: tony.luck@intel.com, wangkefeng.wang@huawei.com, jane.chu@oracle.com, akpm@linux-foundation.org, osalvador@suse.de, rientjes@google.com, duenwen@google.com, jthoughton@google.com, jgg@nvidia.com, ankita@nvidia.com, peterx@redhat.com, linux-mm@kvack.org, Jiaqi Yan X-Stat-Signature: hy1hw6oeyyiiunxrdba98tkrs9w1xnub X-Rspamd-Queue-Id: C2F8E20006 X-Rspam-User: X-Rspamd-Server: rspam10 X-HE-Tag: 1727152775-907263 X-HE-Meta: U2FsdGVkX18nwowjtZZdDAvccXIADzoVx+RVXekd2rHnVFGSN+SCtETMK5JQcIEqRdawzbgW+IzFDyShFhcM7/6hsrAy4WjFWTEifq60s9hG3SfyGoMZfovmlwpYorW7D5xlj/b/X31uzH+JqCQPbCb9BCBqPX7/M5MSsFcGzp4nypsr3d4Cg/ibZlQufaMb2RoF3r0v8uOlFvQbjEuvjVGN8booAlS4f0lOH7OCe4FP/da/xnbkGjr6PjNyHlWCGgFE1+lfS8YbBCP1P6r8uM5+K1CSEd7VFGs4USkwwWH1E0Tg6p+vC2HRCqnPYtWhokUe7y1/Kj+J4EzbqWGbaWElwX46ctB2W2q2UsL09SoT17/1R9+bCnrvyUPtGSxWKYckB5e+3eiKNX4OtxGeN9Bf82mybMEb2n2t0kp5O37HoNsJyQbGMGaeG3iazcQ4jamptBOOnMW3NqSricAtFdKaQtjwWB+ZVMZv2+bprfmT0zSGQwxiV3C0xPWG0igufeMvmAy1QyhiqRc7WEkHLcQC3/3lg3GoPNV1hmdfM+Dc6Y0hz39bIaI3B695JvRxNIouNQ+pSAm8BPSOItaqHHqVLg7bzU2b+XVVjkJi+wGQ/Phwk7EOjCeED+Fs1wJmE1pgJBc3me5kVnY5zvdbpIh76SK3D84RQrLqgaflrkPeyGq9Ui8LLXrteFuWt6q3mqdqz4CDEAW1xYXUM0q6cn0Pj/e4E3IAiuS38Z9iDZUTUfEFoDdSKiomN3zc5MzMbDuJP5cHOGJyn4Ol//uXIJzRc0ATCcla6z1XhGGAi4x+VqFZs81DbPwjDePoHEhxnyXZiYuFbApf7xRo2BMmQzCkmjwLwKXSeyekA4/Kuy8Gnp1vMJY8mUO86BFie7JmXjl21Yb80NC0B2yg4UZM+WJb4VV9L0Oa24KiOW/h11rzxBoawiyRUhrk3KEkDbXv38C1cQmIxZekCY3kWq1 6x2Bex65 z5oeAQPesvVJSZTa80ZytDAvQDzNXydxwbpuO7mb+a/n3kTPA8DpoRLrpI2DkI5n1E2KC+ImdGI5LtwVXYay4Jc+On7qaTD5vmwsqTmABpRqywk2cpZo9eIX32dsJQDJ37ypnH5IKk3GiSNFYEhx7ZMfItdCkR4+vA2fsiv/FeSVl9jjO3VwDbVHDG8g2n5Lf8iAkbTVk3uoDGY4LdtoPPCVQUlx4DB4IUqF+VlyMbeuV+9LZ1oj+Awpbo6tH5dv2aSr8OSeTBrmzVsF0VGn1wWIw6m2Xk1OIUamimwuKojsjJr6YmBWEk4Bp80QwcZmyGwgy9NXFfGWMlIVx8B/koQ17U3uTbmPk95/T0xx8wopGdlUANrGVAqO8iq7/1EafhCpzf18nanufhgrS2TqZU5fMvUoAit+m3bbNI2zrpvO8HIjWhlNFAr27FXWpwzCPTRUV2fcowqs2GdTEKFqrdCQDcnsDUB6ZwoKTaqa2hwLUeOCRwe5S+EhTdm5S82nq+GG3Xbq6MIwZOWriD//w/IkS44D+oOD0LG8GGtvTRz2i9wfKZY4fHRbR3V5JU+UA+cA4vPZV1YetN2rUAUt8oRVxRXRV4189dS1uPeAgnOiyvF7hmvXAlIo3Qxvzq2TS/z72US1GXUec9s4jDYk+tVtPIug8wbH/W0tFvNi9Lxk5qGM= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000002, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Add the documentation for the userspace control of hard offline memory having uncorrectable memory errors: where it will be useful and its global implications. Signed-off-by: Jiaqi Yan --- Documentation/admin-guide/sysctl/vm.rst | 92 +++++++++++++++++++++++++ 1 file changed, 92 insertions(+) diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst index f48eaa98d22d..a55a1d496b34 100644 --- a/Documentation/admin-guide/sysctl/vm.rst +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -37,6 +37,7 @@ Currently, these files are in /proc/sys/vm: - dirty_writeback_centisecs - drop_caches - enable_soft_offline +- enable_hard_offline - extfrag_threshold - highmem_is_dirtyable - hugetlb_shm_group @@ -306,6 +307,97 @@ following requests to soft offline pages will not be performed: - On PARISC, the request to soft offline pages from Page Deallocation Table. + +enable_hard_offline +=================== + +This parameter gives userspace the control on whether the kernel should hard +offline memory that has uncorrectable memory errors. When set to 1, kernel +attempts to hard offline the error folio whenever it thinks needed. When set +to 0, kernel returns EOPNOTSUPP to the request to hard offline the pages. +Its default value is 1. + +Where will `enable_hard_offline = 0`be useful? +---------------------------------------------- + +There are two major use cases from the cloud provider's perspective. + +The first use case is 1G HugeTLB, which provides critical optimization for +Virtual Machines (VM) where database-centric and data-intensive workloads have +requirements of both large memory size (hundreds of GB or several TB), +and high performance of address mapping. These VMs usually also require high +availability, so tolerating and recovering from inevitable uncorrectable +memory errors is usually provided by host RAS features for long VM uptime +(SLA is 99.95% Monthly Uptime). Due to the 1GB granularity, once a byte +of memory in a hugepage is hardware corrupted, the kernel discards the whole +1G hugepage, not only the corrupted bytes but also the healthy portion, from +HugeTLB system. In cloud environment this is a great loss of memory to VM, +putting VM in a dilemma: although the host is able to keep serving the VM, +the VM itself struggles to continue its data-intensive workload with the +unnecessary loss of ~1G data. On the other hand, simply terminating the VM +greatly reduces its uptime given the frequency of uncorrectable memory errors +occurrence. + +The second use case comes from the discussion of MFR for huge VM_PFNMAP [6], +which is to greatly improve TLB hits for PCI MMIO bars and kernel unmanaged +host primary memory. They are most relevant for VMs that run Machine Learning +(ML) workloads, which also requires reliable VM uptime. The MFR behavior for +huge VM_PFNMAP is: if the driver originally VM_PFNMAP-ed with PUD, it must +first zap the PUD, then intercept future page faults to either install PTE/PMD +for clean PFNs, or return VM_FAULT_HWPOISON for poisoned PFNs. Zapping PUD +means there will be a huge hole in the EPT or stage-2 (S2) page table, +causing a lot of EPT or S2 violations that need to be fixed up by the device +driver. There will be noticeable VM performance downgrades, not only during +refilling EPT or S2, but also after the hole is refilled, as EPT or S2 is +already fragmented. + +For both use cases, HARD_OFFLINE behavior in MFR arguably does more harm than +good to the VM. For the 1st case, if we simply leave the 1G HugeTLB hugepage +mapped, VM access to the clean PFNs within the poisoned 1G region still works +well; we just need to still send SIGBUS to userspace in case of re-access +poisoned PFNs to stop populating corrupted data. For the 2nd case, if zapping +PUD doesn't happen there is no need for the driver to intercept page faults to +clean memory on HBM or EGM. In addition, in both cases, there is no EPT or S2 +violation, so no performance cost for accessing clean guest pages already +mapped in EPT and S2. + +It is Global +------------ + +This applies to the system **globally** in the sense that +1. It is for entire *system-level memory managed by the kernel*, regardless + of the underlying memory type. +2. It applies to *all userspace threads*, regardless if the physical memory is + currently backing any VMA (free memory) or what VMAs it is backing. +3. It applies to *PCI(e) device memory* (e.g. HBM on GPU) as well, on the + condition that their device driver deliberately wants to follow the + kernel’s memory failure recovery, instead of being entirely taken care of + by device driver (e.g. drivers/nvdimm/pmem.c). + +Implications +------------ + +There is one important thing to point out in when `enable_hard_offline` = 0. +The kernel does NOT set HWPoison flag in the struct page or struct folio. +This behavior has implications now that no enforcement is done by kernel to +isolate poisoned page and prevent both userspace and kernel from consuming +memory error and causing hardware fault again (which used to be 'setting the +HWPoison flag'): + +- Userspace already has sufficient capability to prevent itself from + consuming memory error and causing hardware fault: with the poisoned + virtual address sent in SIGBUS, it can ask the kernel to remap the poisoned + page with data loss, or simply abort the memory load operation. That being + said, there is risk that a userspace thread can keep ignoring SIGBUS and + generates hardware faults repeatedly. + +- Kernel won't be able to forbid the reuse of the free error pages in future + memory allocations. If an error page is allocated to the kernel, when the + kernel consumes the allocated error page, a kernel panic is most likely to + happen. For userspace, it is now not guaranteed that newly allocated memory + is free of memory errors. + + extfrag_threshold =================