From patchwork Fri May 31 21:34:37 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jiaqi Yan X-Patchwork-Id: 13682199 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 49343C27C50 for ; Fri, 31 May 2024 21:34:50 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A4FF46B00A3; Fri, 31 May 2024 17:34:49 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9FF8C6B00A5; Fri, 31 May 2024 17:34:49 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 87D436B00A8; Fri, 31 May 2024 17:34:49 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 615206B00A3 for ; Fri, 31 May 2024 17:34:49 -0400 (EDT) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id E69711C2480 for ; Fri, 31 May 2024 21:34:48 +0000 (UTC) X-FDA: 82179995856.13.BAC56E4 Received: from mail-yw1-f201.google.com (mail-yw1-f201.google.com [209.85.128.201]) by imf20.hostedemail.com (Postfix) with ESMTP id 301361C000B for ; Fri, 31 May 2024 21:34:46 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=i2v1uHc7; spf=pass (imf20.hostedemail.com: domain of 3dkJaZggKCJ0GF7NFV7KDLLDIB.9LJIFKRU-JJHS79H.LOD@flex--jiaqiyan.bounces.google.com designates 209.85.128.201 as permitted sender) smtp.mailfrom=3dkJaZggKCJ0GF7NFV7KDLLDIB.9LJIFKRU-JJHS79H.LOD@flex--jiaqiyan.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1717191287; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=22c0UwDgNXgPsVV+qHmiQw5fv4pPjf+Lc0qLhHTqi88=; b=zGYA98rING/or5XJisGsv8InM+9sFrJhxTVlWRAZV8lzr77YgZTDfMOgtG+b0xELe8P+7b 2hpPt2ltQnjPIGNnJJ39inwcJpIXHd1033e9Z+dlYEYrQQ0zx4gFSgfTFXELxGSCdEt+fk Pjw4+Rik5ktLkLwlJLEZRxgVFAAvb94= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1717191287; a=rsa-sha256; cv=none; b=y4BKkDoQaB7VADbmAaVIYvDQWZxUurf3uTHuk/acFARRTdWnh58e2+WnRu5enJAutvUU9m ISIuXPUFeLBv46EQoSygrbXd31LuFLYovEdOhUYbr04OlvKPUjeS6eXm0AbaZFMYWhP2/A pTAu087tPqoA6je6JB6zYfWbeAcS1nE= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=i2v1uHc7; spf=pass (imf20.hostedemail.com: domain of 3dkJaZggKCJ0GF7NFV7KDLLDIB.9LJIFKRU-JJHS79H.LOD@flex--jiaqiyan.bounces.google.com designates 209.85.128.201 as permitted sender) smtp.mailfrom=3dkJaZggKCJ0GF7NFV7KDLLDIB.9LJIFKRU-JJHS79H.LOD@flex--jiaqiyan.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-yw1-f201.google.com with SMTP id 00721157ae682-627e9a500faso39775367b3.1 for ; Fri, 31 May 2024 14:34:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1717191286; x=1717796086; darn=kvack.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=22c0UwDgNXgPsVV+qHmiQw5fv4pPjf+Lc0qLhHTqi88=; b=i2v1uHc7mUU18GOubd4WO8ivhQznsKHNd9Cb/qPwjuUNQnpwMO7/yT1wASl5cdyPnA Ldgqd4PsT2V8eUlBy6xYgVTFdV6xoyblg61VcdBAF5uv96h0Eco7B4li9f55tcdjsynR Y08G32xfJdB05eWYIGPd70Elx70qRzQvlWLBJQJgYD57/eEB+18EOhQhfbt3eCRAPYWy THDbLI9cU/t4pnw7rsW0+8ELzUzK4QLP4s8KMtxXFAyBIMMJnn2eKqW6WlJAQa8XHaZ0 hTocIcbvfC2SBzSzWrqE2s7O9BMnHNSRn1poYia2e8ZeTZYCXWJUbo5SVlC6O/FTxBSz SDGA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1717191286; x=1717796086; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=22c0UwDgNXgPsVV+qHmiQw5fv4pPjf+Lc0qLhHTqi88=; b=UQeAPjMz+I70mlC+DmrxYBizXz3gR1CaDNBh5vhpQGYUmu5+sCDHt/V9a3A6Q6ZhbI UEYtP3r2YYf2sJY6XjyNEhYmR3GZL6zbWaCWKv4R7raaC0IdDEX52xvsisgHNHO5RGBV IXAaotBa3D4BPfQOP77tyfdfuvPu1UeNGzDYGYg0u7W2+w0zbgxW6wx0+HYoZT8vkDsR iEtTJ/X8gfDoM4pnLGPkRF83Pq5QR6uixIb1L6bFb84kcOl9xCtIpEMGueBwbL9LBDp6 IFLT+lHD/Weq7OflMP3DUw3UznD3JAuT32BaRIHrupibjhuy0U4Mz03Uouqdeg99MtJ9 jk3Q== X-Forwarded-Encrypted: i=1; AJvYcCU7nGFFSspTeAFCMppuGtPtregEGy0twVkWYz4L+SyM/hlpGJUD3cxgL+JLH4D02WtKSmyEzYcmJFAz44SM4YY/lc8= X-Gm-Message-State: AOJu0YzuO0GVpKWDvwqaKB8RlSInIdCdMJI3ZRE2flSdOI7E7vAC53Hl ys6vbtn3Te7X8Iiz5pOHW7RqBmIo41z9adZDB5IdLgldti/geIW6s0TYFv0yY1sMgAV5sdOvpYz 0rAFY+mYwFQ== X-Google-Smtp-Source: AGHT+IEbPH0RxghVV8ANS15DCUTRp3o55R94/IFYmAtgmFEhINYqlUogTkNzjy7wCCu0jtBNvh0ajeeH7yA3Hw== X-Received: from yjq3.c.googlers.com ([fda3:e722:ac3:cc00:24:72f4:c0a8:272f]) (user=jiaqiyan job=sendgmr) by 2002:a05:6902:1003:b0:df7:62ec:c517 with SMTP id 3f1490d57ef6-dfa73dba365mr150739276.11.1717191286235; Fri, 31 May 2024 14:34:46 -0700 (PDT) Date: Fri, 31 May 2024 21:34:37 +0000 In-Reply-To: <20240531213439.2958891-1-jiaqiyan@google.com> Mime-Version: 1.0 References: <20240531213439.2958891-1-jiaqiyan@google.com> X-Mailer: git-send-email 2.45.1.288.g0e0cd299f1-goog Message-ID: <20240531213439.2958891-2-jiaqiyan@google.com> Subject: [PATCH v1 1/3] mm/memory-failure: userspace controls soft-offlining hugetlb pages From: Jiaqi Yan To: naoya.horiguchi@nec.com, muchun.song@linux.dev, linmiaohe@huawei.com Cc: akpm@linux-foundation.org, mike.kravetz@oracle.com, shuah@kernel.org, corbet@lwn.net, osalvador@suse.de, rientjes@google.com, duenwen@google.com, fvdl@google.com, linux-mm@kvack.org, linux-kselftest@vger.kernel.org, linux-doc@vger.kernel.org, Jiaqi Yan X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 301361C000B X-Rspam-User: X-Stat-Signature: 1r4qbpsfw7byrdyj6duzwz9qxicupccj X-HE-Tag: 1717191286-728983 X-HE-Meta: U2FsdGVkX1/370uNenXzHphE8a1UHZHdkxstCZiUPC4ztB6obl4TGAXoRsfLkzkoKDMLpeU9ToiIg+kk0PyDtcOMMLY34rcoSLFE4jeg5e8u7Zn3x5nyiGTDhj5hslfYz0luqkkVAP338oUyP+9Op7Vkj8bJZ1RdwcmUpXfcw/3nE44HBfMz1CWwKyoDHU3bZ7QyOuwuIS2CaWiBQWC+BeJvLveOWrdM3Q2EPs8LUIGhoksevDl0NvICLG0O1yboZVzAzp3OVXeb3WQXtfuZDQZH1l1IH7ynHFsetOkqv8Ll5y0p6eIRawGHyQIOV8wwXQXP7ffWg8pHbx/jNaAbXnzKj3JGQ/E+/LVfrkY2YButHHgAdtCpC1hA1vtizEe/zIhfgroygv9okKIxJ31KYwJ59MQhawCe2Pju/56WsScgJkHeOyA/tiFgUYngQQn388UtDNm3LRvHs1c7UCEzp73Bicq1WqQ6qloBPWcJIhBNKBtUYowioEp5NUc0jVv4zV3807hx/YDCo+WXKdgJaU+vliv6FJC4TboCXzcUPnkeo0/g4Z4FioCJ8hztRxUUit6JZIjernDNlkKhXNRmOmz/LvO0yGcKqFaHc5qV0eSGJEpGnZaOhaWraqTmewxmwJO9+cO2sagLzJpjvmvywzUGNRzDAWER426bzXSB4LNVz3SrcIBuHajQlVfg1uP26biWSzuDP/+iCoeI6vyfjcuB0sKCGnM3EJG8V0w3XzyjnmblgjGXmzIdPXJMbUscWYk0ypssVZnpbMk7yAuf1GRNWlGesm3cojr3016QcDMozz7N3Up9mwPBlN2zmvQTOXpEqI928yehlcJPgoRPtsqKqoDxe7JmrxtEeO7SHqKj4n+AGVhZLRpQVG0ddqDk3tKb5w0lkbp/+Vn/i5orTxxXPtvDpKX6LPfym/mN6sqxPEig3LeWSAjbl1j2Deh0p/+MfOqQ8jTLY8U/Xlt ZjLk2bcQ APY9A6Lo6v/GY9AKfa9MfcGN6ZW5moO57Zx+cIdhiOwlEjlFUD/oNh9lw0cuxXWXllwj011zMjP1qBz/HL930KhVMT/64BMC4S/opydFkwRa3kk/jaU4FbVqPq6RWIifwHCER4kwxVg8Udyrjmk5v7VIO9hKLA+rcdnaR8HkBDxy4Rnq9k65zaWNi6d83bkMX5cUyaySYM+AowGQuwf+54rbFtYouUKQNmwd+u3QP3U8/duWWXyJ6LQIUQ/saTzSfoDpfsmAJ7vh475pzC8FxQWAcTht8l2zoPAVz8luWpGWGYxJOqBQn7A3Q9gNSAe8p/vvKpq75yWISK94Sm0duXL0gUOtJl4wwO4xEdmZPUIfQlQn0VUWQhzOeRM+LjB0t1O1HPGNRfRnMTo6ZItvIG8bCrbf1DubfdzL6wAZQxjSKk5biEA1JYaa3yVeyKT9FyCgFv7XS4aA2XqQvBunHf+YfahMGQ1tx66ZB0Wf3lhNVwyC78lKJnYoUbX9//IvSfTWugChIPh1wHBlVVMREANYfkHbDbA2JmlN0cNFhJHGRa6OYvrZvfte1MQYZIpo91IC+qtEueVBbZOXvjoaLFr2G+w0IsioV3fAn7nC7gRR9LtZA39vMwCp6aw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Correctable memory errors are very common on servers with large amount of memory, and are corrected by ECC. Soft offline is kernel's additional recovery handling for memory pages having (excessive) corrected memory errors. Impacted page is migrated to a healthy page if mapped/inuse; the original page is discarded for any future use. The actual policy on whether (and when) to soft offline should be maintained by userspace, especially in case of HugeTLB hugepages. Soft-offline dissolves a hugepage, either in-use or free, into chunks of 4K pages, reducing HugeTLB pool capacity by 1 hugepage. If userspace has not acknowledged such behavior, it may be surprised when later mmap hugepages MAP_FAILED due to lack of hugepages. In addition, discarding the entire 1G memory page only because of corrected memory errors sounds very costly and kernel better not doing under the hood. But today there are at least 2 such cases: 1. GHES driver sees both GHES_SEV_CORRECTED and CPER_SEC_ERROR_THRESHOLD_EXCEEDED after parsing CPER. 2. RAS Correctable Errors Collector counts correctable errors per PFN and when the counter for a PFN reaches threshold In both cases, userspace has no control of the soft offline performed by kernel's memory failure recovery. This commit gives userspace the control of soft-offlining HugeTLB pages: kernel only soft offlines hugepage if userspace has opt-ed in in for that specific hugepage size. The interface to userspace is a new sysfs entry called softoffline_corrected_errors under the /sys/kernel/mm/hugepages/hugepages-${size}kB directory: * When softoffline_corrected_errors=0, skip soft offlining for all hugepages of size ${size}kB. * When softoffline_corrected_errors=1, soft offline as before this patch series. So the granularity of the control is per hugepage size, and is kept in corresponding hstate. By default softoffline_corrected_errors is 1 to preserve existing behavior in kernel. Signed-off-by: Jiaqi Yan --- include/linux/hugetlb.h | 17 +++++++++++++++++ mm/hugetlb.c | 34 ++++++++++++++++++++++++++++++++++ mm/memory-failure.c | 7 +++++++ 3 files changed, 58 insertions(+) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 2b3c3a404769..55f9e9593cce 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -685,6 +685,7 @@ struct hstate { int next_nid_to_free; unsigned int order; unsigned int demote_order; + unsigned int softoffline_corrected_errors; unsigned long mask; unsigned long max_huge_pages; unsigned long nr_huge_pages; @@ -1029,6 +1030,16 @@ void hugetlb_unregister_node(struct node *node); */ bool is_raw_hwpoison_page_in_hugepage(struct page *page); +/* + * For certain hugepage size, when a hugepage has corrected memory error(s): + * - Return 0 if userspace wants to disable soft offlining the hugepage. + * - Return > 0 if userspace allows soft offlining the hugepage. + */ +static inline int hugetlb_softoffline_corrected_errors(struct folio *folio) +{ + return folio_hstate(folio)->softoffline_corrected_errors; +} + #else /* CONFIG_HUGETLB_PAGE */ struct hstate {}; @@ -1226,6 +1237,12 @@ static inline bool hugetlbfs_pagecache_present( { return false; } + +static inline int hugetlb_softoffline_corrected_errors(struct folio *folio) +{ + return 1; +} + #endif /* CONFIG_HUGETLB_PAGE */ static inline spinlock_t *huge_pte_lock(struct hstate *h, diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 6be78e7d4f6e..a184e28ce592 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -4325,6 +4325,38 @@ static ssize_t demote_size_store(struct kobject *kobj, } HSTATE_ATTR(demote_size); +static ssize_t softoffline_corrected_errors_show(struct kobject *kobj, + struct kobj_attribute *attr, + char *buf) +{ + struct hstate *h = kobj_to_hstate(kobj, NULL); + + return sysfs_emit(buf, "%d\n", h->softoffline_corrected_errors); +} + +static ssize_t softoffline_corrected_errors_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, + size_t count) +{ + int err; + unsigned long input; + struct hstate *h = kobj_to_hstate(kobj, NULL); + + err = kstrtoul(buf, 10, &input); + if (err) + return err; + + /* softoffline_corrected_errors is either 0 or 1. */ + if (input > 1) + return -EINVAL; + + h->softoffline_corrected_errors = input; + + return count; +} +HSTATE_ATTR(softoffline_corrected_errors); + static struct attribute *hstate_attrs[] = { &nr_hugepages_attr.attr, &nr_overcommit_hugepages_attr.attr, @@ -4334,6 +4366,7 @@ static struct attribute *hstate_attrs[] = { #ifdef CONFIG_NUMA &nr_hugepages_mempolicy_attr.attr, #endif + &softoffline_corrected_errors_attr.attr, NULL, }; @@ -4655,6 +4688,7 @@ void __init hugetlb_add_hstate(unsigned int order) h = &hstates[hugetlb_max_hstate++]; mutex_init(&h->resize_lock); h->order = order; + h->softoffline_corrected_errors = 1; h->mask = ~(huge_page_size(h) - 1); for (i = 0; i < MAX_NUMNODES; ++i) INIT_LIST_HEAD(&h->hugepage_freelists[i]); diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 16ada4fb02b7..7094fc4c62e2 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -2776,6 +2776,13 @@ int soft_offline_page(unsigned long pfn, int flags) return -EIO; } + if (PageHuge(page) && + !hugetlb_softoffline_corrected_errors(page_folio(page))) { + pr_info("soft offline: %#lx: hugetlb page is ignored\n", pfn); + put_ref_page(pfn, flags); + return -EINVAL; + } + mutex_lock(&mf_mutex); if (PageHWPoison(page)) {