From patchwork Tue Sep 24 04:39:20 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Jiaqi Yan X-Patchwork-Id: 13810098 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id F06FFCF9C72 for ; Tue, 24 Sep 2024 04:39:38 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 810286B0085; Tue, 24 Sep 2024 00:39:38 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 731256B008C; Tue, 24 Sep 2024 00:39:38 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4DC086B008A; Tue, 24 Sep 2024 00:39:38 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 2CC266B0085 for ; Tue, 24 Sep 2024 00:39:38 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id A4B37121743 for ; Tue, 24 Sep 2024 04:39:37 +0000 (UTC) X-FDA: 82598378394.28.00934F2 Received: from mail-pl1-f202.google.com (mail-pl1-f202.google.com [209.85.214.202]) by imf13.hostedemail.com (Postfix) with ESMTP id C2F8E20006 for ; Tue, 24 Sep 2024 04:39:35 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=i2meAv3G; spf=pass (imf13.hostedemail.com: domain of 3hkLyZggKCBM21t91Ht6z77z4x.v75416DG-553Etv3.7Az@flex--jiaqiyan.bounces.google.com designates 209.85.214.202 as permitted sender) smtp.mailfrom=3hkLyZggKCBM21t91Ht6z77z4x.v75416DG-553Etv3.7Az@flex--jiaqiyan.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1727152715; a=rsa-sha256; cv=none; b=rExyX9aOFz1Mg9PK+fOxBq2sQ0OgNdIcBOCw5jOEoNITMulHj3BsN9oW9r6Pcm6NuxXdsW 5OJTuWtSPUoN2TXZsC1pq4nzeEFPJytm1y3ghK0C6oo4p90VmpTnx4TboMFBSq3h5rIbCz 5qdH6oa2+BqMLzdLGZ5+QBWgeUExscg= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=i2meAv3G; spf=pass (imf13.hostedemail.com: domain of 3hkLyZggKCBM21t91Ht6z77z4x.v75416DG-553Etv3.7Az@flex--jiaqiyan.bounces.google.com designates 209.85.214.202 as permitted sender) smtp.mailfrom=3hkLyZggKCBM21t91Ht6z77z4x.v75416DG-553Etv3.7Az@flex--jiaqiyan.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1727152715; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=VmDIWoOJPbvvXi3ByPe7aVoRPFI5iRYtqOEq5j9il9Y=; b=SV9s4tIDNKP5O/Ke3weJxwF7dBd8xsKLBZxQoJbqxWWw/1xl0L7TpkuNaSINtE5he3Vhgx kiVAFpXx9XruqYvnJwlS+VAPbhKj24rq0pS+VJehhbvBgX0UuW+RpyEZsxUJFr/x53yoee w8FIvuajF7HOq0lv2hBWB/NiJXaPwbM= Received: by mail-pl1-f202.google.com with SMTP id d9443c01a7336-207510f3242so63231895ad.0 for ; Mon, 23 Sep 2024 21:39:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1727152774; x=1727757574; darn=kvack.org; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:from:to:cc:subject:date:message-id :reply-to; bh=VmDIWoOJPbvvXi3ByPe7aVoRPFI5iRYtqOEq5j9il9Y=; b=i2meAv3GB9Dzw062GwwgM1aNbVmUatEj5Qj4xzxsiLCVSQhRdRJdsb7W/I1BjElyKm dT/NRji/9HaGhthn1tjNEXGKq3EPxBQpoukGjLrg+QFV8BRcc+FDrTbM/IuuR42SKbZL i3kQfj5uqUE6nAJm+wagKFXn07GJZVNDEhpKi4hjB/qk5v/Zq1ep++uJraR6IKm/APG7 YyZR1iEUDIjXqoaYJxUXEipyQhfYuYC17t0JVdzl/uFpdFIljRQKqPgwha+ZYY/ghLZV KDaiDpXe+FREWSL2f4eL3d/BRAKjfp6lPeeeL3stMNd+kE8Zngja3iTa/qnZKdUrSN5w iMBQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1727152774; x=1727757574; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=VmDIWoOJPbvvXi3ByPe7aVoRPFI5iRYtqOEq5j9il9Y=; b=JC8yCzArOLP+4pF4rhfTtGHUCno5RM3GNEB5eFssZkmEU+7cJCo7UTWmea2TyrR1bJ yXK70/1sSNhznQOf5KHGMpu4o3Kg1YWZjUTeuz64LmiExee/YMeJZ27sBIsH+8xgYXZy TCsMe0oZtXKluYp6rcd3a2jcXW9xQnemptnMPvo3JhroiuSJC80IgoLHX/OoGSq7BJ8e k693TMW9fvIdMHRMGotYPBFyP7TyFzouoVVG4UiSOz0yDercqNP/PH70ggp4B7RYOEm4 NyZo3PerV7y9k1Ts4nzSJkmL1PUlCxSGy8kEHZkOzPtm8cNCwCIjPQoYYjogK4WOwfjK gVtQ== X-Forwarded-Encrypted: i=1; AJvYcCV3XHpmlQuWB0A7EWORGwRD8oy/5TPuEi8eMsWpuzvXFlukzGog2yiXlrfnGTReDLrvIi0D66FKFg==@kvack.org X-Gm-Message-State: AOJu0YxCqrQnidATJze78O4skV9LTsw9Z/oidUiIhkpAORZXDQFh87JM 8rcYdyHOP4xFiVTMJ9FbtBKkOsVSJPerkNZS2MbNyCW5/O+YY/dyW3APYZlEKZHeYAkB5AjnO6U JhF152V7LLw== X-Google-Smtp-Source: AGHT+IGf/Ga9E9XwNc7nEQgIoCXmKM5P5HLo3dPYOWgNlpjE5wOP1LU0KDQiWt+rBgQFvPN5dnODm8wC2JyAeQ== X-Received: from yjq3.c.googlers.com ([fda3:e722:ac3:cc00:131:4782:ac13:917f]) (user=jiaqiyan job=sendgmr) by 2002:a17:903:2307:b0:206:a6c5:d2a5 with SMTP id d9443c01a7336-208d8591456mr2447515ad.11.1727152774173; Mon, 23 Sep 2024 21:39:34 -0700 (PDT) Date: Tue, 24 Sep 2024 04:39:20 +0000 In-Reply-To: <20240924043924.3562257-1-jiaqiyan@google.com> Mime-Version: 1.0 References: <20240924043924.3562257-1-jiaqiyan@google.com> X-Mailer: git-send-email 2.46.0.792.g87dc391469-goog Message-ID: <20240924043924.3562257-3-jiaqiyan@google.com> Subject: [RFC PATCH v1 2/2] docs: mm: add enable_hard_offline sysctl From: Jiaqi Yan To: nao.horiguchi@gmail.com, linmiaohe@huawei.com Cc: tony.luck@intel.com, wangkefeng.wang@huawei.com, jane.chu@oracle.com, akpm@linux-foundation.org, osalvador@suse.de, rientjes@google.com, duenwen@google.com, jthoughton@google.com, jgg@nvidia.com, ankita@nvidia.com, peterx@redhat.com, linux-mm@kvack.org, Jiaqi Yan X-Stat-Signature: hy1hw6oeyyiiunxrdba98tkrs9w1xnub X-Rspamd-Queue-Id: C2F8E20006 X-Rspam-User: X-Rspamd-Server: rspam10 X-HE-Tag: 1727152775-907263 X-HE-Meta: U2FsdGVkX18nwowjtZZdDAvccXIADzoVx+RVXekd2rHnVFGSN+SCtETMK5JQcIEqRdawzbgW+IzFDyShFhcM7/6hsrAy4WjFWTEifq60s9hG3SfyGoMZfovmlwpYorW7D5xlj/b/X31uzH+JqCQPbCb9BCBqPX7/M5MSsFcGzp4nypsr3d4Cg/ibZlQufaMb2RoF3r0v8uOlFvQbjEuvjVGN8booAlS4f0lOH7OCe4FP/da/xnbkGjr6PjNyHlWCGgFE1+lfS8YbBCP1P6r8uM5+K1CSEd7VFGs4USkwwWH1E0Tg6p+vC2HRCqnPYtWhokUe7y1/Kj+J4EzbqWGbaWElwX46ctB2W2q2UsL09SoT17/1R9+bCnrvyUPtGSxWKYckB5e+3eiKNX4OtxGeN9Bf82mybMEb2n2t0kp5O37HoNsJyQbGMGaeG3iazcQ4jamptBOOnMW3NqSricAtFdKaQtjwWB+ZVMZv2+bprfmT0zSGQwxiV3C0xPWG0igufeMvmAy1QyhiqRc7WEkHLcQC3/3lg3GoPNV1hmdfM+Dc6Y0hz39bIaI3B695JvRxNIouNQ+pSAm8BPSOItaqHHqVLg7bzU2b+XVVjkJi+wGQ/Phwk7EOjCeED+Fs1wJmE1pgJBc3me5kVnY5zvdbpIh76SK3D84RQrLqgaflrkPeyGq9Ui8LLXrteFuWt6q3mqdqz4CDEAW1xYXUM0q6cn0Pj/e4E3IAiuS38Z9iDZUTUfEFoDdSKiomN3zc5MzMbDuJP5cHOGJyn4Ol//uXIJzRc0ATCcla6z1XhGGAi4x+VqFZs81DbPwjDePoHEhxnyXZiYuFbApf7xRo2BMmQzCkmjwLwKXSeyekA4/Kuy8Gnp1vMJY8mUO86BFie7JmXjl21Yb80NC0B2yg4UZM+WJb4VV9L0Oa24KiOW/h11rzxBoawiyRUhrk3KEkDbXv38C1cQmIxZekCY3kWq1 6x2Bex65 z5oeAQPesvVJSZTa80ZytDAvQDzNXydxwbpuO7mb+a/n3kTPA8DpoRLrpI2DkI5n1E2KC+ImdGI5LtwVXYay4Jc+On7qaTD5vmwsqTmABpRqywk2cpZo9eIX32dsJQDJ37ypnH5IKk3GiSNFYEhx7ZMfItdCkR4+vA2fsiv/FeSVl9jjO3VwDbVHDG8g2n5Lf8iAkbTVk3uoDGY4LdtoPPCVQUlx4DB4IUqF+VlyMbeuV+9LZ1oj+Awpbo6tH5dv2aSr8OSeTBrmzVsF0VGn1wWIw6m2Xk1OIUamimwuKojsjJr6YmBWEk4Bp80QwcZmyGwgy9NXFfGWMlIVx8B/koQ17U3uTbmPk95/T0xx8wopGdlUANrGVAqO8iq7/1EafhCpzf18nanufhgrS2TqZU5fMvUoAit+m3bbNI2zrpvO8HIjWhlNFAr27FXWpwzCPTRUV2fcowqs2GdTEKFqrdCQDcnsDUB6ZwoKTaqa2hwLUeOCRwe5S+EhTdm5S82nq+GG3Xbq6MIwZOWriD//w/IkS44D+oOD0LG8GGtvTRz2i9wfKZY4fHRbR3V5JU+UA+cA4vPZV1YetN2rUAUt8oRVxRXRV4189dS1uPeAgnOiyvF7hmvXAlIo3Qxvzq2TS/z72US1GXUec9s4jDYk+tVtPIug8wbH/W0tFvNi9Lxk5qGM= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000002, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Add the documentation for the userspace control of hard offline memory having uncorrectable memory errors: where it will be useful and its global implications. Signed-off-by: Jiaqi Yan --- Documentation/admin-guide/sysctl/vm.rst | 92 +++++++++++++++++++++++++ 1 file changed, 92 insertions(+) diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst index f48eaa98d22d..a55a1d496b34 100644 --- a/Documentation/admin-guide/sysctl/vm.rst +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -37,6 +37,7 @@ Currently, these files are in /proc/sys/vm: - dirty_writeback_centisecs - drop_caches - enable_soft_offline +- enable_hard_offline - extfrag_threshold - highmem_is_dirtyable - hugetlb_shm_group @@ -306,6 +307,97 @@ following requests to soft offline pages will not be performed: - On PARISC, the request to soft offline pages from Page Deallocation Table. + +enable_hard_offline +=================== + +This parameter gives userspace the control on whether the kernel should hard +offline memory that has uncorrectable memory errors. When set to 1, kernel +attempts to hard offline the error folio whenever it thinks needed. When set +to 0, kernel returns EOPNOTSUPP to the request to hard offline the pages. +Its default value is 1. + +Where will `enable_hard_offline = 0`be useful? +---------------------------------------------- + +There are two major use cases from the cloud provider's perspective. + +The first use case is 1G HugeTLB, which provides critical optimization for +Virtual Machines (VM) where database-centric and data-intensive workloads have +requirements of both large memory size (hundreds of GB or several TB), +and high performance of address mapping. These VMs usually also require high +availability, so tolerating and recovering from inevitable uncorrectable +memory errors is usually provided by host RAS features for long VM uptime +(SLA is 99.95% Monthly Uptime). Due to the 1GB granularity, once a byte +of memory in a hugepage is hardware corrupted, the kernel discards the whole +1G hugepage, not only the corrupted bytes but also the healthy portion, from +HugeTLB system. In cloud environment this is a great loss of memory to VM, +putting VM in a dilemma: although the host is able to keep serving the VM, +the VM itself struggles to continue its data-intensive workload with the +unnecessary loss of ~1G data. On the other hand, simply terminating the VM +greatly reduces its uptime given the frequency of uncorrectable memory errors +occurrence. + +The second use case comes from the discussion of MFR for huge VM_PFNMAP [6], +which is to greatly improve TLB hits for PCI MMIO bars and kernel unmanaged +host primary memory. They are most relevant for VMs that run Machine Learning +(ML) workloads, which also requires reliable VM uptime. The MFR behavior for +huge VM_PFNMAP is: if the driver originally VM_PFNMAP-ed with PUD, it must +first zap the PUD, then intercept future page faults to either install PTE/PMD +for clean PFNs, or return VM_FAULT_HWPOISON for poisoned PFNs. Zapping PUD +means there will be a huge hole in the EPT or stage-2 (S2) page table, +causing a lot of EPT or S2 violations that need to be fixed up by the device +driver. There will be noticeable VM performance downgrades, not only during +refilling EPT or S2, but also after the hole is refilled, as EPT or S2 is +already fragmented. + +For both use cases, HARD_OFFLINE behavior in MFR arguably does more harm than +good to the VM. For the 1st case, if we simply leave the 1G HugeTLB hugepage +mapped, VM access to the clean PFNs within the poisoned 1G region still works +well; we just need to still send SIGBUS to userspace in case of re-access +poisoned PFNs to stop populating corrupted data. For the 2nd case, if zapping +PUD doesn't happen there is no need for the driver to intercept page faults to +clean memory on HBM or EGM. In addition, in both cases, there is no EPT or S2 +violation, so no performance cost for accessing clean guest pages already +mapped in EPT and S2. + +It is Global +------------ + +This applies to the system **globally** in the sense that +1. It is for entire *system-level memory managed by the kernel*, regardless + of the underlying memory type. +2. It applies to *all userspace threads*, regardless if the physical memory is + currently backing any VMA (free memory) or what VMAs it is backing. +3. It applies to *PCI(e) device memory* (e.g. HBM on GPU) as well, on the + condition that their device driver deliberately wants to follow the + kernel’s memory failure recovery, instead of being entirely taken care of + by device driver (e.g. drivers/nvdimm/pmem.c). + +Implications +------------ + +There is one important thing to point out in when `enable_hard_offline` = 0. +The kernel does NOT set HWPoison flag in the struct page or struct folio. +This behavior has implications now that no enforcement is done by kernel to +isolate poisoned page and prevent both userspace and kernel from consuming +memory error and causing hardware fault again (which used to be 'setting the +HWPoison flag'): + +- Userspace already has sufficient capability to prevent itself from + consuming memory error and causing hardware fault: with the poisoned + virtual address sent in SIGBUS, it can ask the kernel to remap the poisoned + page with data loss, or simply abort the memory load operation. That being + said, there is risk that a userspace thread can keep ignoring SIGBUS and + generates hardware faults repeatedly. + +- Kernel won't be able to forbid the reuse of the free error pages in future + memory allocations. If an error page is allocated to the kernel, when the + kernel consumes the allocated error page, a kernel panic is most likely to + happen. For userspace, it is now not guaranteed that newly allocated memory + is free of memory errors. + + extfrag_threshold =================