From patchwork Tue Sep 24 04:39:19 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jiaqi Yan X-Patchwork-Id: 13810097 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id ABB4BCF9C72 for ; Tue, 24 Sep 2024 04:39:35 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3C9CE6B0083; Tue, 24 Sep 2024 00:39:35 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 32DBD6B0085; Tue, 24 Sep 2024 00:39:35 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1A6536B0088; Tue, 24 Sep 2024 00:39:35 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id EDD176B0083 for ; Tue, 24 Sep 2024 00:39:34 -0400 (EDT) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id A0ECDC17A0 for ; Tue, 24 Sep 2024 04:39:34 +0000 (UTC) X-FDA: 82598378268.02.AEE894B Received: from mail-yb1-f202.google.com (mail-yb1-f202.google.com [209.85.219.202]) by imf04.hostedemail.com (Postfix) with ESMTP id D6C1D40009 for ; Tue, 24 Sep 2024 04:39:32 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=ak+Dze8T; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf04.hostedemail.com: domain of 3g0LyZggKCBAzyq6yEq3w44w1u.s421y3AD-220Bqs0.47w@flex--jiaqiyan.bounces.google.com designates 209.85.219.202 as permitted sender) smtp.mailfrom=3g0LyZggKCBAzyq6yEq3w44w1u.s421y3AD-220Bqs0.47w@flex--jiaqiyan.bounces.google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1727152737; a=rsa-sha256; cv=none; b=Q2Bv0BBQb1HMw/OfQvqkTdAm46494lxodblilorA3TQgsgqz6b+hcpEtOkgf/ObDIFOekq 1tHAVNtekuYQYFiR5wJ2BBpYDSeqRdVbH7O7JPWJUn91zmJob1IwwYDWWEVPZB4o3iHN+V P6gdFgaBQyAgns/h1RgHN5gniE7Q+Bg= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=ak+Dze8T; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf04.hostedemail.com: domain of 3g0LyZggKCBAzyq6yEq3w44w1u.s421y3AD-220Bqs0.47w@flex--jiaqiyan.bounces.google.com designates 209.85.219.202 as permitted sender) smtp.mailfrom=3g0LyZggKCBAzyq6yEq3w44w1u.s421y3AD-220Bqs0.47w@flex--jiaqiyan.bounces.google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1727152737; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=T9QYBokEPkcwmggku9rIrnocfmaIGptmI0axJesX/JY=; b=nSvbI+8ULmUprCJpBi6wbUxqUB1glCQ+SWQvFkU+CzXuDp1rodcwD+SJi++f9DifY7qeLF stU+JNdF1YEGZnD3iccvI/uVH+F6+zSeV7nzgo7KPVN4HaVjn5OvMTUW+SxW6uW7qJqA7s ndAXFoE36eeetNQNmY3WWQHlOshnrgI= Received: by mail-yb1-f202.google.com with SMTP id 3f1490d57ef6-e1aa529e30eso8203265276.1 for ; Mon, 23 Sep 2024 21:39:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1727152772; x=1727757572; darn=kvack.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=T9QYBokEPkcwmggku9rIrnocfmaIGptmI0axJesX/JY=; b=ak+Dze8TKO5I4GdO+FDy4hqjkttX65WHNLgsXtVangmIUIy5ZWK5FTfSXMRjL0tRax 73rGKN8/xi4nR8nBGrsd5r7J6QfBBa4jEhlXb5mMnXEiYp2f7Y6t9SA+dlnZqqnUgL+x 8ZQRbEM8VNUi4/mlSywmVdiN4m041ImiA7wOgEuD24slP5IzaVI1ma00o+CG5GqkUMmt 7KdcBJxa8Fg6oj3kLBR8OF1fcrgPYYvsyE9ggI3zNd6MbzykjRJbcRnZktlq1PYkiCqe bKzoD6+b33JwhA+Jyx9JDCMtcMU3240BUUP1ZOluBbANrQpNTaz1qAhiyW6WL2yM3HXf rT9w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1727152772; x=1727757572; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=T9QYBokEPkcwmggku9rIrnocfmaIGptmI0axJesX/JY=; b=HQlTNdNnrN1EFkKbkyt4Q2nIdYJgIpeEKCj8Ik1Wsv4dEwerkwPxmlAzKZCB/XOHdH Eup0JX4xjf2ttLr36e0riwos8CxyefQclmBHuoP+0JSR26idwfdlBsY+fTO3U/42jxHq PRW0Vrg5tJIAmn0atM3I8n0e6F4At4lRkylYb3axi2+2WNASN7ncUDiK2ahl94M09nWd dinnsamxyQojtfa9istu7b5uBbnFaiVJacoq8BzPKvtAgIZeY3+B4oXGVWmzRwno5Xd2 va7oDNU7S6a9SmhNFlMIeSf3R9wVWaIyPZcRN2MdVB6ghZDdEoWk/KKYsvRxg3yFcMBK F8Dg== X-Forwarded-Encrypted: i=1; AJvYcCXA9lYUZafY8qtFII6Ju3Ec6bfsw9tVk/8wYmkYEPDIE71VZj4pfR3ykOqLrN55B7p7Oa4BLGVXdA==@kvack.org X-Gm-Message-State: AOJu0YzO9vI5l5VI90skOZ2/Us1d/UitEvPIDarVwpwH7GpBujDyFUG4 2FHLMyzPlOI9D5+l3IL1n98rkTt0+shK3rD5fTKfQhNbzbp0d+dkg4X3DlpRPSW/XKI7dzP8W2s el4oPxWzHuA== X-Google-Smtp-Source: AGHT+IFqJLZv35vd4olBKuukrqKCNP+8cdSTLQTyDhm7fIZv528n0EJcJkxCvmr6Easi7Uj4esGRiguUUgy2Sw== X-Received: from yjq3.c.googlers.com ([fda3:e722:ac3:cc00:131:4782:ac13:917f]) (user=jiaqiyan job=sendgmr) by 2002:a25:8045:0:b0:e20:2b5b:c6e6 with SMTP id 3f1490d57ef6-e2250cc2238mr27796276.9.1727152771578; Mon, 23 Sep 2024 21:39:31 -0700 (PDT) Date: Tue, 24 Sep 2024 04:39:19 +0000 In-Reply-To: <20240924043924.3562257-1-jiaqiyan@google.com> Mime-Version: 1.0 References: <20240924043924.3562257-1-jiaqiyan@google.com> X-Mailer: git-send-email 2.46.0.792.g87dc391469-goog Message-ID: <20240924043924.3562257-2-jiaqiyan@google.com> Subject: [RFC PATCH v1 1/2] mm/memory-failure: introduce global MFR policy From: Jiaqi Yan To: nao.horiguchi@gmail.com, linmiaohe@huawei.com Cc: tony.luck@intel.com, wangkefeng.wang@huawei.com, jane.chu@oracle.com, akpm@linux-foundation.org, osalvador@suse.de, rientjes@google.com, duenwen@google.com, jthoughton@google.com, jgg@nvidia.com, ankita@nvidia.com, peterx@redhat.com, linux-mm@kvack.org, Jiaqi Yan X-Rspam-User: X-Rspamd-Queue-Id: D6C1D40009 X-Rspamd-Server: rspam01 X-Stat-Signature: wwm95he7y3zjkh1jgn7rwhy3mpdnixp3 X-HE-Tag: 1727152772-508400 X-HE-Meta: U2FsdGVkX1+3Mhrrp2wW90G6dms0/3oZeoS8Mnd7tGRBCY5wd4Anwo69EXtm+LeEr+09WGaYkv07EFtFns1leaMdeHLTQDGsaJk+Zy367477I2WzuRSeh1rDJGxtmAAW0nG2f4dTgMbgeScCAn/fPKSXm8nypzVfL6bHTuu/U1VeyOyKFuF31OwJxkvyM6Yc/YeINrovSqtZEKBHEkxsOUn8BL2kjff1EticXSmqfaX3TpHqgSuf8+Jec4uxNFaz9/fGRVtYBytPMgR9iR61yQr1llqpxYtdyKc6HijIGoxDc8jNRKzaZs8O6KsTxga3Bw6wFiIOOhpWg/r3IVsYEnjnEihzUvj7y853BjFtQ5ONEZk5Ktq0amN7zYGTk/VTiVpGxOK6Zyxc+YnTfrQEXhhHT9VLaqCgpIC847VTLqig3J+2KVetEbFgLAc/0AlIVj0LFHaZdRWwcDgus7/6oyih4S7Rx98+0TL8tXgZcsR1O4apbjC+Lg/ER4CoIGwM2dZ8HKQgbKANUyinPCRHotGnih4K4+J+BPLq+thHVqBGHjZz7UV4FC4OnA2SlggwwQyxxZzOzarqQMdkydNWRV1tUUrFNZBADNSm9+oeou0sGe5GF00f9zNKvRBu9O2B78vy0XcuMmAhfRWFgFq0wPhalBWq+TRJZ4RLv5ntRwVkE4msvrr43cGOCxcn6o1j5oZm1NSYjBeU+hcnAu5e1P7/dYFJbbCsX2jz4zhsv0OlacO9Tr1UzpAKFT5wwtZ3WDiQaHIJ9kqWgn++cRhb6S6i8ohLY21880yvNh54Gry3mNuNeQlTSHoW3OF2pDvi6DDhb5GE1nAib21H26b8d+ebhAwNkZZQgsc8jSFo/PjyPY7pHrgujR09gaIu/BJItMNzon1xy4ZBkU1ARNwHV3HRQW9aLgpoV0YtVSx6ks9WwPvWyQFfKBahqJ+pjK3TTzrcn515UZOgzbLlNxV wYacJ8Q/ vj39R9wmzF10X7x/t95yMcoQxX6eLkIle3GYwc8IofR0mAvuacm2M2Ws0SkcOkeT/m9V8Hw16qW/A6K+JFpE7R9zq7+v6KpT7jd3XtCI+tnThVzyv/oIPYU785LK6h7ouN0vPcN5Si1lF2df0xoBsuKUnvI1vdq6N0/L6TNCUsHnAKtLXB+StdmQw2l+dW6tkF+uCquPO4D2sEwKEL1VRYTInHWNUPqBvJ3z3Fx9J5WHku0DB3BwtoNYeDOkTDdGmFMRgWLBWVof71dAkOyPlCj++7/H4fG0sychk+fu1bNNm3It9hwhw9EWI9sjm4RBzG/Y0HxUEvGIRiQ1qBAwDA3kNyqc3isElKDUGQ7GMt3z0Vdk94CiwGzgDCt0Z9iEiPY2IlmVcNsqqVXN50qhu/n15n/BUiD410Y6tRXfcJkYUIG5SCIvf4tyen5LsfmqzfhfuOQfN4ugcK/ZSkkW/xt9KwfJ6GhPCKfcjMvFn/Y2zMJwLuUDgN2eqzdgQxqTzqu0TMZy6b48U0AOvsVHI8IEW7JCl7ABplX9shlokH2CzY6N9prlbu4iJFpyCNBVRHiTls5hC1hkNkBFGZq1+RYJBA+SMb7V2giGEamEH8FNowk6aL2nYoedioJyX3Yn2rVliqF1InA9O1sD8fwdrl2ChkFLr9NgN2u1AncIDbExDucI= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Give userspace the control to enable or disable HARD_OFFLINE error folio (either a raw page or a hugepage). By default, HARD_OFFLINE is enabled to be consistent with existing memory_failure behavior. Userspace should be able to control whether to keep or discard a large chunk of memory in the event of uncorrectable memory errors. There are two major use cases in cloud environments. The 1st case is 1G HugeTLB-backed database workload. Comparing to discarding the hugepage when only single PFN is impacted by uncorrectable memory error, if kernel simply leaves the 1G hugepage mapped, access to major of clean PFNs within the poisoned 1G region still works well for VM and workload. The 2nd case is MMIO device memory or EGM [1] mapped to userspace via huge VM_PFNMAP [2]. If kernel does not zap PUD or PMD, there is no need for the VFIO drivers that manages the memory to intercept page faults for clean PFNs and to reinstall PTEs. In addition, in both cases there is no EPT or stage-2 (S2) violation, so no performance cost for accessing clean guest pages already mapped in EPT or S2. See cover letter for more details on why userspace need such control, and implication when userspace chooses to disable HARD_OFFLINE. If this RFC receives general positive feedbacks, I will add selftest in v2. [1] https://developer.nvidia.com/blog/nvidia-grace-hopper-superchip-architecture-in-depth/#extended_gpu_memory [2] https://lore.kernel.org/linux-mm/20240828234958.GE3773488@nvidia.com/T/#m413a61acaf1fc60e65ee7968ab0ae3093f7b1ea3 Signed-off-by: Jiaqi Yan --- mm/memory-failure.c | 33 +++++++++++++++++++++++++++++++++ 1 file changed, 33 insertions(+) diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 7066fc84f351..a7b85b98d61e 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -70,6 +70,8 @@ static int sysctl_memory_failure_recovery __read_mostly = 1; static int sysctl_enable_soft_offline __read_mostly = 1; +static int sysctl_enable_hard_offline __read_mostly = 1; + atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0); static bool hw_memory_failure __read_mostly = false; @@ -151,6 +153,15 @@ static struct ctl_table memory_failure_table[] = { .proc_handler = proc_dointvec_minmax, .extra1 = SYSCTL_ZERO, .extra2 = SYSCTL_ONE, + }, + { + .procname = "enable_hard_offline", + .data = &sysctl_enable_hard_offline, + .maxlen = sizeof(sysctl_enable_hard_offline), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = SYSCTL_ZERO, + .extra2 = SYSCTL_ONE, } }; @@ -2223,6 +2234,14 @@ int memory_failure(unsigned long pfn, int flags) p = pfn_to_online_page(pfn); if (!p) { + /* + * For ZONE_DEVICE memory and memory on special architectures, + * assume they have opt out core kernel's MFR. Since these + * memory can still be mapped to userspace, let userspace + * know MFR doesn't apply. + */ + pr_info_once("%#lx: can't apply global MFR policy\n", pfn); + res = arch_memory_failure(pfn, flags); if (res == 0) goto unlock_mutex; @@ -2241,6 +2260,20 @@ int memory_failure(unsigned long pfn, int flags) goto unlock_mutex; } + /* + * On ARM64, if APEI failed to claims SEA, (e.g. GHES driver doesn't + * register to SEA notifications from firmware), memory_failure will + * never be synchrounous to the error consumption thread. Notifying + * it via SIGBUS synchrnously has to be done by either core kernel in + * do_mem_abort, or KVM in kvm_handle_guest_abort. + */ + if (!sysctl_enable_hard_offline) { + pr_info_once("%#lx: disabled by /proc/sys/vm/enable_hard_offline\n", pfn); + kill_procs_now(p, pfn, flags, page_folio(p)); + res = -EOPNOTSUPP; + goto unlock_mutex; + } + try_again: res = try_memory_failure_hugetlb(pfn, flags, &hugetlb); if (hugetlb)