From patchwork Fri Jun 28 20:59:56 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jiaqi Yan X-Patchwork-Id: 13716603 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E4495C2BD09 for ; Fri, 28 Jun 2024 21:00:09 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C383C6B0092; Fri, 28 Jun 2024 17:00:07 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id BC02C6B0095; Fri, 28 Jun 2024 17:00:07 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9ED2B6B0096; Fri, 28 Jun 2024 17:00:07 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 76BDC6B0092 for ; Fri, 28 Jun 2024 17:00:07 -0400 (EDT) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 311F7A05E9 for ; Fri, 28 Jun 2024 21:00:07 +0000 (UTC) X-FDA: 82281514854.29.BC5D483 Received: from mail-pl1-f202.google.com (mail-pl1-f202.google.com [209.85.214.202]) by imf08.hostedemail.com (Postfix) with ESMTP id 6246F160022 for ; Fri, 28 Jun 2024 21:00:05 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=DyMEXEBP; spf=pass (imf08.hostedemail.com: domain of 3VCR_ZggKCNM87zF7NzC5DD5A3.1DBA7CJM-BB9Kz19.DG5@flex--jiaqiyan.bounces.google.com designates 209.85.214.202 as permitted sender) smtp.mailfrom=3VCR_ZggKCNM87zF7NzC5DD5A3.1DBA7CJM-BB9Kz19.DG5@flex--jiaqiyan.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1719608395; a=rsa-sha256; cv=none; b=MFdJaZZfDCmVBSKfGIjAdItJBzGMXmAmzyCxfnxDKppQEQxOuJ2NTsaKuAJYTXXqa9h7F/ koxed9OUn5/rz6OiaTzqhzVcXsaETP0BcmvyblyFdUTUO7QLbhaMlhaNMJkv9JzGQ3eRIq Aw6QNs83s4f35s36/VMFhZqpzLRf6qA= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=DyMEXEBP; spf=pass (imf08.hostedemail.com: domain of 3VCR_ZggKCNM87zF7NzC5DD5A3.1DBA7CJM-BB9Kz19.DG5@flex--jiaqiyan.bounces.google.com designates 209.85.214.202 as permitted sender) smtp.mailfrom=3VCR_ZggKCNM87zF7NzC5DD5A3.1DBA7CJM-BB9Kz19.DG5@flex--jiaqiyan.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1719608395; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=wloQRnRJOFsUL3bZC7DzXWvVx2ebdRLN4PLJ4fxz/Zs=; b=2riFIZbgcoxNI52ZPoWS+tBpHe6l9fw+K+VhiZAs61y+7tIk73WNoYXFWeDEF4McX929VQ hUkZ7lGneQ+UPHYciEpIx4lT62mHntvq51eNvZrR6enmqXjotNBvNP86JxoKHMltJYfIin 8el04QI6xFAn9iHQfxDKHEXdXQ+4ipM= Received: by mail-pl1-f202.google.com with SMTP id d9443c01a7336-1f99666500aso8568355ad.0 for ; Fri, 28 Jun 2024 14:00:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1719608404; x=1720213204; darn=kvack.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=wloQRnRJOFsUL3bZC7DzXWvVx2ebdRLN4PLJ4fxz/Zs=; b=DyMEXEBPU6jT6PJiXD5EYmh701uTZOz9rXdnBCYTHozLYoowfau5h24jDGJnyDGuoc hOFlrkYZjzN9VJpJkFsGnNWzs+Ku2cZTtQDuEyXYeExfVH9/VKcL3tvqlwENjZGXnqmt tlvV12s4jxnDISjTpvsMDaTLgySG63xJ7cO/YvCawMFVnbe17yaL0amBIDdnaHZlaaEb KV5EBpNj7veKCOctOBRnrdRKPvT6Mz8YPr43EhIsl+U/A7v8xV0QeKvpQ9QbmZiuQ/fC x4ticP3yZ55HN8DQzhaJyAHgSKyfCu59w7doxLuNWepP9IkJ+YLcESHzvAoo/m1jGM+R ui1g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1719608404; x=1720213204; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=wloQRnRJOFsUL3bZC7DzXWvVx2ebdRLN4PLJ4fxz/Zs=; b=fj3MAMBldfwvANeXC/GRkDxWHfaB23icyOBwzTJ+x4Ix9XFvpObDDmzFs5ieE8ux+t vfcoRu5PogiDtt+J3ZlIEzS7mFRTEkRWRRxp+mbhrZTziG0PyF4B6Q0LuFiBsf+dOsBz nuKZyEk0gNqNsjUpCI1CtuWPGWG0hPW94YVr+HuarXyazij0TT4aq6yOJLIb2lx4wdiX JOPGalt4cnqkhFueyUpgczyCkezI74H98JOUiE6hi6pEOU8/vDTo3tjnfkpMufu2SISM Y6ib+4OmG2KCRLcNHas8Wl7Zv10UgMG/IkYhj/GLi1SHD57PvDEGBk1zmtlWm5dV6GKk 797w== X-Forwarded-Encrypted: i=1; AJvYcCWkJ57hXsaHnckPLCVyeSYu6V61108H7cOyxjywJi63Svvpuiyal/cZshwiZWy6vzd+44jGCCNVzR19lJOG1zaw6SY= X-Gm-Message-State: AOJu0Yyk6h902rraNMgERabC2K17OHIxEUQLJq9IVhKA1jvmP6Um+Lx6 V1g94fNwZnT479scZ0mizaKYxJgRMJ2T83bWn6ullmgp7zQd+XuDb3rPxIeD0NIzTGw7af/ff8u Gm2GfoM0hTA== X-Google-Smtp-Source: AGHT+IFngVlviz+bFsu7ULS0zV0gCgvaUcH7TCcU6sghaFwvanKOQPhlg4lFgTCSFwGFZC7AMJvWD60uCEYyCw== X-Received: from yjq3.c.googlers.com ([fda3:e722:ac3:cc00:24:72f4:c0a8:272f]) (user=jiaqiyan job=sendgmr) by 2002:a17:902:e5c8:b0:1fa:9149:4979 with SMTP id d9443c01a7336-1fa91494ca4mr7717605ad.2.1719608404213; Fri, 28 Jun 2024 14:00:04 -0700 (PDT) Date: Fri, 28 Jun 2024 20:59:56 +0000 In-Reply-To: <20240628205958.2845610-1-jiaqiyan@google.com> Mime-Version: 1.0 References: <20240628205958.2845610-1-jiaqiyan@google.com> X-Mailer: git-send-email 2.45.2.803.g4e1b14247a-goog Message-ID: <20240628205958.2845610-3-jiaqiyan@google.com> Subject: [PATCH v7 2/4] mm/memory-failure: userspace controls soft-offlining pages From: Jiaqi Yan To: nao.horiguchi@gmail.com, linmiaohe@huawei.com Cc: jane.chu@oracle.com, ioworker0@gmail.com, muchun.song@linux.dev, akpm@linux-foundation.org, shuah@kernel.org, rdunlap@infradead.org, corbet@lwn.net, osalvador@suse.de, rientjes@google.com, duenwen@google.com, fvdl@google.com, ak@linux.intel.com, linux-mm@kvack.org, linux-kselftest@vger.kernel.org, linux-doc@vger.kernel.org, Jiaqi Yan X-Stat-Signature: 8ghfqny964nuqf4xhhxn145bks48h47a X-Rspamd-Queue-Id: 6246F160022 X-Rspam-User: X-Rspamd-Server: rspam10 X-HE-Tag: 1719608405-465423 X-HE-Meta: U2FsdGVkX19cOEWvn6VoSYjywv374u0nvVRoJ1dtwOq35349hZwDT019I/YmC3O5uVNktU6BdBJuhzM5a1NFBUQSv0QZudMSnOFwGWiT2pz79R3u135UF4IKMTvKUH17Rw6LLT6d1jq7voiUpImEiTBWCoTwEcDUyKfn2IZAMk4JaaijglM33OVdExFZHOfZCWDLVC3vz9NpMaL0m/0yZuEJaWjXThxE0BljOaagQW8kcTuC0JKUfY5a8mdjCAfg/jyDmOIpdXTrpzWPPIN+fVwFBltYjUjutoJqSgcrm9GuDXaRcdOTWli9idDNrwrzPx8n6vBwXr0xBqlnqTSK8vOmE8grgvEj/LEtUTkYZnSnIqYjGHTc+EyOInneEZPd8Xzhvt5EbCzgA1hPV6KzkQz3xjknnyE893wCDTWTk+KvaA80HbMr4yuBASE+LfPqtTok9VnQHsK2Fvj7ocqPfGoKu30TIFsGlOiAmUCHrQTpPiyarbyL4CBtiN0x1t+kdH+SuEmsYCa62WwMB+iqnZDI+PONopFzwFo51/owyyCqklCcVHetX9+auQduPIvji0ayBkosORf79YU5iDdznfyCttyIPl+Ju0ZkbDbsqOCGqHli5mIRM6YWocmiX5sU3pe/7uCeCISHk3ftby45zzPn9W4aQwt1p9vBGJ8gHeDSp10D6UJb2R+nZzcx0PwbW7y+kqfrBWtnlZl6cTdOm8FLuM6wrpr9igmlUjFU4ZLYBhxoLkRTMgGIuqYWJv/pm6ak/YK4twMbZd2M/H/6AusYWugvJNiv6U9dhgEN/qDvX4PBQXcMo51lBxAt30DX59KGjrlzT3dWD8zgam7guT+JkPKV788aOR7qdYmt263UK0eGBuGUSTF32AfD7AeFO6fGi+1tMAKRmx8yTt5WCgyLjcyVluc6Htq6LyDiRwlWRIjl8iJmjvfi+OIxMu0FPVnonV9nC+KlMkP9ApB ObMQ9hv2 myhy3cSqHYV6wi086EqXH0l3My8S0+jDcZjuOslM/VEOXFnEA/hrdqMI4XeBl566Cwa8+LvSKzhcBvUyh4aXiBx+SHwg5OqBrMJTBhl+cnKsiOQDSJZBPMLPcNvKYLuhZTBLHeg+5CBEWIgqNnIxKEFgsOk3XF/9g8ag4APhV78bxPMJyviB+Eb9pqFBAQ73lB5EVJlPvDu25WyfxjJ7RTtS6CgsDIgd8WLtmMNFbPiDyMoG3TkBUVZAY3tWrpAQ3xxRcSoWy0jrySWELQ//0OwAIhaKVui5SBa5rKXmTRSDRqk5IfWXdiP8iStFZHlTbufT4+0ozvPRz2r4msbaf5IQWi5Qi3x0eiAA4sntmzYDDqnEPBk9Veld7B59KGXZrTcdtMRJdpA74Bul541PIDeutPWGTCOPMh1AB/Z83gLNBxuSdCHUu6WIvfUlhN9gPcU3XpEgYuvGYRl3+pknmd8UCWRYYUsKLZqZuNFygWGrsobE01O7NHAvJf1HcIp423W8ZjDD+SEFovOi00i12RztqDf7LK3Yk2dizCaS9b/E4Q+hFci/GwNHOzL/aRYZqboFgADklEJ4pB5z5LNUXyUWz7nmUyJodu2O/xBRDodMFtKX677y/7txRYQ5f1jHkmpEm X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Correctable memory errors are very common on servers with large amount of memory, and are corrected by ECC. Soft offline is kernel's additional recovery handling for memory pages having (excessive) corrected memory errors. Impacted page is migrated to a healthy page if inuse; the original page is discarded for any future use. The actual policy on whether (and when) to soft offline should be maintained by userspace, especially in case of an 1G HugeTLB page. Soft-offline dissolves the HugeTLB page, either in-use or free, into chunks of 4K pages, reducing HugeTLB pool capacity by 1 hugepage. If userspace has not acknowledged such behavior, it may be surprised when later failed to mmap hugepages due to lack of hugepages. In case of a transparent hugepage, it will be split into 4K pages as well; userspace will stop enjoying the transparent performance. In addition, discarding the entire 1G HugeTLB page only because of corrected memory errors sounds very costly and kernel better not doing under the hood. But today there are at least 2 such cases doing so: 1. when GHES driver sees both GHES_SEV_CORRECTED and CPER_SEC_ERROR_THRESHOLD_EXCEEDED after parsing CPER. 2. RAS Correctable Errors Collector counts correctable errors per PFN and when the counter for a PFN reaches threshold In both cases, userspace has no control of the soft offline performed by kernel's memory failure recovery. This commit gives userspace the control of softofflining any page: kernel only soft offlines raw page / transparent hugepage / HugeTLB hugepage if userspace has agreed to. The interface to userspace is a new sysctl at /proc/sys/vm/enable_soft_offline. By default its value is set to 1 to preserve existing behavior in kernel. When set to 0, soft-offline (e.g. MADV_SOFT_OFFLINE) will fail with EOPNOTSUPP. Acked-by: Miaohe Lin Acked-by: David Rientjes Signed-off-by: Jiaqi Yan --- mm/memory-failure.c | 22 ++++++++++++++++++++-- 1 file changed, 20 insertions(+), 2 deletions(-) diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 685ab9a77966..d55fdeed0cfc 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -68,6 +68,8 @@ static int sysctl_memory_failure_early_kill __read_mostly; static int sysctl_memory_failure_recovery __read_mostly = 1; +static int sysctl_enable_soft_offline __read_mostly = 1; + atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0); static bool hw_memory_failure __read_mostly = false; @@ -141,6 +143,15 @@ static struct ctl_table memory_failure_table[] = { .extra1 = SYSCTL_ZERO, .extra2 = SYSCTL_ONE, }, + { + .procname = "enable_soft_offline", + .data = &sysctl_enable_soft_offline, + .maxlen = sizeof(sysctl_enable_soft_offline), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = SYSCTL_ZERO, + .extra2 = SYSCTL_ONE, + } }; /* @@ -2758,8 +2769,9 @@ static int soft_offline_in_use_page(struct page *page) * @pfn: pfn to soft-offline * @flags: flags. Same as memory_failure(). * - * Returns 0 on success - * -EOPNOTSUPP for hwpoison_filter() filtered the error event + * Returns 0 on success, + * -EOPNOTSUPP for hwpoison_filter() filtered the error event, or + * disabled by /proc/sys/vm/enable_soft_offline, * < 0 otherwise negated errno. * * Soft offline a page, by migration or invalidation, @@ -2795,6 +2807,12 @@ int soft_offline_page(unsigned long pfn, int flags) return -EIO; } + if (!sysctl_enable_soft_offline) { + pr_info_once("disabled by /proc/sys/vm/enable_soft_offline\n"); + put_ref_page(pfn, flags); + return -EOPNOTSUPP; + } + mutex_lock(&mf_mutex); if (PageHWPoison(page)) {