From patchwork Thu Jun 27 22:27:05 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yu Zhao X-Patchwork-Id: 13715090 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 07728C2BD09 for ; Thu, 27 Jun 2024 22:27:13 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7C7806B0085; Thu, 27 Jun 2024 18:27:13 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 777DE6B008C; Thu, 27 Jun 2024 18:27:13 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 617AD6B0092; Thu, 27 Jun 2024 18:27:13 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 441586B0085 for ; Thu, 27 Jun 2024 18:27:13 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id E01E0141120 for ; Thu, 27 Jun 2024 22:27:12 +0000 (UTC) X-FDA: 82278105504.28.360B599 Received: from mail-yb1-f202.google.com (mail-yb1-f202.google.com [209.85.219.202]) by imf23.hostedemail.com (Postfix) with ESMTP id 15EB2140014 for ; Thu, 27 Jun 2024 22:27:10 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=ajWGFocY; spf=pass (imf23.hostedemail.com: domain of 3Ped9ZgYKCD43z4mftlttlqj.htrqnsz2-rrp0fhp.twl@flex--yuzhao.bounces.google.com designates 209.85.219.202 as permitted sender) smtp.mailfrom=3Ped9ZgYKCD43z4mftlttlqj.htrqnsz2-rrp0fhp.twl@flex--yuzhao.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1719527214; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=Jy0eCPosTDsC4ow54f9quqwWAcJUb14iJ68K4X0ZvNg=; b=7xMswZsog4l54ASMZ+38FUpYmpb1aTYUpGO31kHhW8RtrhpVNYym3KyopINPdM6Mpseq6A 66P40M5DPuZ7Qz8Eu+bLpT1fe0Djn3+k0fa3OrOORrXkCDt3snV6tBB/ssY2HYqMVdm2C9 IV214PXEeNut3FQ9SwXDCrATv/GXu0s= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1719527214; a=rsa-sha256; cv=none; b=wql6HOwEndsDDWUopcDlVL0Jb44ta8HNFpFVxJKP/o8XzSEp8NqJ3xGMnWC2BB6M54jE9N O5rW8Cc1jF1fhZtdK6271GePJj0QbR/QA8r1DKVdZWF8qREeCMZjTYROwIqW7WCoIrroiv lME6pka0qiPXGUfRGY8riudPdcP0rPg= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=ajWGFocY; spf=pass (imf23.hostedemail.com: domain of 3Ped9ZgYKCD43z4mftlttlqj.htrqnsz2-rrp0fhp.twl@flex--yuzhao.bounces.google.com designates 209.85.219.202 as permitted sender) smtp.mailfrom=3Ped9ZgYKCD43z4mftlttlqj.htrqnsz2-rrp0fhp.twl@flex--yuzhao.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-yb1-f202.google.com with SMTP id 3f1490d57ef6-e0359e198ceso168866276.2 for ; Thu, 27 Jun 2024 15:27:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1719527230; x=1720132030; darn=kvack.org; h=cc:to:from:subject:message-id:mime-version:date:from:to:cc:subject :date:message-id:reply-to; bh=Jy0eCPosTDsC4ow54f9quqwWAcJUb14iJ68K4X0ZvNg=; b=ajWGFocYU1nONZUkwcZD2Pe6uZxOKrh41HFFS/8Ow9z+MRG544dE9p8UTLAtLTknHZ tHKnMJbVp4PvVuQxLiU479n6zju9FC20gufNMkqWhhQn2woYzbTOGvc5p/9hbxwd875H 3Ztf/hXhJyHFgz+a7CNuwcia4gk5t4CJRu1RfgWQm4jq2v7Rie+PFz7Mc9unCbpvDMzM o6s940j+0C5KdV7ZdsD+l7hog8EHYo4nUbTu+xoSxXrsKv4EcdTWyjl2tfGjuFvqQ2TU asaUUEIaJxZTLWkRlcltOUdz4vMF1vU0juyO0h7frHsu1QBFK22kKyClm1kCH3ruU1en 0PwQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1719527230; x=1720132030; h=cc:to:from:subject:message-id:mime-version:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=Jy0eCPosTDsC4ow54f9quqwWAcJUb14iJ68K4X0ZvNg=; b=oNwBcvOsHO4OYXyRknVE5eqn7XnHRaPGmw50aPc7j701lPBpMpwSSmIlCCe2tDic2T X6StESwzafAXhlbFVJZg6T9jY5GGbPkxXjjHBiDBdPytIqfNBsmJs7Hxp7YA4/AMM04+ bhDkTaTZ/Nxybia7grvib9CTw5lPSfZlrEEUjp8QciWDARmELTOeaL4eBzRE9jjZMxtO DED3nsNsQ+CXcNa18xo0kmeusRH17MnJxV57O7kiduEK03/caL95J1YxwKy0bFmNFLsU HPbpf/X8e2g04UcNsxOSqyiXV6eQeXf6ze9SJ484MaKCnWcFFFkVrjebdCE1iCSH8o2N nWgw== X-Forwarded-Encrypted: i=1; AJvYcCUI+WeHb1HrZ8xxO8mpKS08Wl2DxIjcjRjSpolvpacmVp2AMUnL+uM7U6rrEj4Sa2doaBaKPwEQ/3IX9oVwayFgYn8= X-Gm-Message-State: AOJu0YzipSfTJhdzeBr+oVwKFaX6E27c1RrEITp0I8BdcrPBpUSNnn6p XsGi9bisBk3Dd05bDaYx5bhbzL1VnpDYWTFdQoMqWiH2ix3xJbamQDMbDLZuWNTEAASeYIDb+Gw Rhw== X-Google-Smtp-Source: AGHT+IHKkfFE8FHutXWkOYnP7SfL+FYGxMkE+fzToL7noNtogZPbrka7P7HYXKY3gz94n2IYJBynrkt5Kv0= X-Received: from yuzhao2.bld.corp.google.com ([2a00:79e0:2e28:6:bbec:6bf1:2652:4697]) (user=yuzhao job=sendgmr) by 2002:a05:6902:114b:b0:dfd:9e19:7e0f with SMTP id 3f1490d57ef6-e0303fe39cbmr511346276.11.1719527229941; Thu, 27 Jun 2024 15:27:09 -0700 (PDT) Date: Thu, 27 Jun 2024 16:27:05 -0600 Mime-Version: 1.0 X-Mailer: git-send-email 2.45.2.803.g4e1b14247a-goog Message-ID: <20240627222705.2974207-1-yuzhao@google.com> Subject: [PATCH mm-unstable v2] mm/hugetlb_vmemmap: fix race with speculative PFN walkers From: Yu Zhao To: Andrew Morton , Muchun Song Cc: David Hildenbrand , Frank van der Linden , "Matthew Wilcox (Oracle)" , Peter Xu , Yang Shi , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Yu Zhao X-Rspamd-Queue-Id: 15EB2140014 X-Stat-Signature: 5akjuihh1yd68yh5ff6oxb5jjdbg8goc X-Rspamd-Server: rspam09 X-Rspam-User: X-HE-Tag: 1719527230-656237 X-HE-Meta: U2FsdGVkX19p2iYsFD4h9kQxMYlsIv4wskK8AxO4f85IEsWJ+eNWrrbmlsorlWSS1H4fMZzkSlDaH/8whNKAi4W1MFTye1xKdGhhVe3CW3ir/pn3Iowqyc4xaLp2viIYi8BPrUR8i8A+TaniCrmRUYojIK/by/X0583zQ0WCQD/F693fkbWkLehbngl+NpiUoMR5I0acA7QENhPakiPjMTzzeesmvddV1fGXEmF1okU4pkou+4te7yyrgDGP9y7OVdZobpEwqbHema8ajryv+Ndrzuin0PbdXDz6SeEXFJqfo+reeBHBznX3s+Jsa2CwbWCj8Dgiw5sPEhj2lXf3HoHR+rEAkn7ln9PMmvKwYJcVQbi6r0gRAC8MpXJN7emFCrIGLjXmLDTG7wUB9glFTNgwAas+AV2rTzUuHf0jerSwsS0UQR3l7//oqtwrzJKQCv1rvQ0h3q5VE8CBMz9pDSZKJVwSPxiI2UE1krUmXJdybl44LRzUgTclQx0K1MwQ5lo+8Ppay6uQescV5pcp41kylUCPBdkaUMd+QWVGyn9ivMAVVrkTIv5xgQ89BKiZXaibMafLTn4VJeMkLdbmNXkFC4lMR8wyW85U8ZD0jpn7U9qbu6ju8vpEU0vqgBSPn2lOeYucDkzl0PfjULZeJxKVEyK/RPFerL8+NkQe+/wBiULW4KiNWNw1g/Lf4HLQeAuTFdFP+o/UGi0vg0RuSeLHtD716g1sNyNDM7E0XTPTUQHPkuCDkDxZZwojrdXQhPVEc09IDmH25JOUd8y1ao9l11D0Mq0zNtzFgHYZKTyJsde9SAg7kHOIi1gHpnA2wnWYWoiTJalmpGCspoms1XFRu0QvTEVEnnirYr9yaQdYxI8C2hVMfysm8VJMJpcOsPwj7HURMK+z/JrbKyH/pmiNsFLNVhfhe1Jd+Cof9jsz909rI6HCtQjyMbb4jqCz0LhWLKK/OtvORjcbsLo opac2Nhg jlwofRYUKrUweaTjbOSf+tB3JaHIEUOfJ/7seN9V7fjrJ8y709LSNWzSvdQWOgEoEXOBzj/eD4Qg6fY392v7NxMIq2FWS/J19cs9JZjTcJARiqzk7cgCynZBP/NoBqS4Y8/k6jucYSiVwQgI9x7dKkc0T9N1yhcsETR1f4rAuhASi02rV8gXZSrNhKGBZAwlc0yLjLq3qjIXK+cTEPFISoTRZiUuMWX1gJ7sNoPxQL9NveUnBrUKWiwKUmhj8qWNqa9QKJ6XNUUCWM/A23pGjFVzJOTqTXPOTMXwhsOkbJrj9W46Eo9hxe0/hc4BXz9k71bE9kcInx0Stnsb8ZuV96lJd5ZcS/OHqG739tXzUmoRlHwXPu6q0d2vemfmn8MEdfPC7SGzKdOXVqoW+9ZkOYyicfvm/e512llmHP5W7XgAYjmiDHYOwHwhq63hXVLRIA6BKNwurFzgNhtiusmifjEdm2GOSRNAfO15Oc8IrpO58l4LkGKgyg8ODXwSSVPMrckFCsXwgadVmG8+iiv/uoRryAZNayInM3QSKjI/S9XnQtjjCKvv9tPXw2SlJoaJ/ozynqflKwMEsNN7yPGTh8I7aPVfaCKPjs9Uj0nvGbms4PKTk7a+u/ynJFcCNeuo5utSXU3cEe1byMt4= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000002, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: While investigating HVO for THPs [1], it turns out that speculative PFN walkers like compaction can race with vmemmap modifications, e.g., CPU 1 (vmemmap modifier) CPU 2 (speculative PFN walker) ------------------------------- ------------------------------ Allocates an LRU folio page1 Sees page1 Frees page1 Allocates a hugeTLB folio page2 (page1 being a tail of page2) Updates vmemmap mapping page1 get_page_unless_zero(page1) Even though page1->_refcount is zero after HVO, get_page_unless_zero() can still try to modify this read-only field, resulting in a crash. An independent report [2] confirmed this race. There are two discussed approaches to fix this race: 1. Make RO vmemmap RW so that get_page_unless_zero() can fail without triggering a PF. 2. Use RCU to make sure get_page_unless_zero() either sees zero page->_refcount through the old vmemmap or non-zero page->_refcount through the new one. The second approach is preferred here because: 1. It can prevent illegal modifications to struct page[] that has been HVO'ed; 2. It can be generalized, in a way similar to ZERO_PAGE(), to fix similar races in other places, e.g., arch_remove_memory() on x86 [3], which frees vmemmap mapping offlined struct page[]. While adding synchronize_rcu(), the goal is to be surgical, rather than optimized. Specifically, calls to synchronize_rcu() on the error handling paths can be coalesced, but it is not done for the sake of Simplicity: noticeably, this fix removes ~50% more lines than it adds. According to the hugetlb_optimize_vmemmap section in Documentation/admin-guide/sysctl/vm.rst, enabling HVO makes allocating or freeing hugeTLB pages "~2x slower than before". Having synchronize_rcu() on top makes those operations even worse, and this also affects the user interface /proc/sys/vm/nr_overcommit_hugepages. [1] https://lore.kernel.org/20240229183436.4110845-4-yuzhao@google.com/ [2] https://lore.kernel.org/917FFC7F-0615-44DD-90EE-9F85F8EA9974@linux.dev/ [3] https://lore.kernel.org/be130a96-a27e-4240-ad78-776802f57cad@redhat.com/ Signed-off-by: Yu Zhao Acked-by: Muchun Song --- include/linux/page_ref.h | 8 +++++- mm/hugetlb.c | 53 ++++++---------------------------------- mm/hugetlb_vmemmap.c | 16 ++++++++++++ 3 files changed, 30 insertions(+), 47 deletions(-) diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h index 490d0ad6e56d..8c236c651d1d 100644 --- a/include/linux/page_ref.h +++ b/include/linux/page_ref.h @@ -230,7 +230,13 @@ static inline int folio_ref_dec_return(struct folio *folio) static inline bool page_ref_add_unless(struct page *page, int nr, int u) { - bool ret = atomic_add_unless(&page->_refcount, nr, u); + bool ret = false; + + rcu_read_lock(); + /* avoid writing to the vmemmap area being remapped */ + if (!page_is_fake_head(page) && page_ref_count(page) != u) + ret = atomic_add_unless(&page->_refcount, nr, u); + rcu_read_unlock(); if (page_ref_tracepoint_active(page_ref_mod_unless)) __page_ref_mod_unless(page, nr, ret); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 9691624fcb79..0a69e194b517 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1629,13 +1629,10 @@ static inline void destroy_compound_gigantic_folio(struct folio *folio, * folio appears as just a compound page. Otherwise, wait until after * allocating vmemmap to clear the flag. * - * A reference is held on the folio, except in the case of demote. - * * Must be called with hugetlb lock held. */ -static void __remove_hugetlb_folio(struct hstate *h, struct folio *folio, - bool adjust_surplus, - bool demote) +static void remove_hugetlb_folio(struct hstate *h, struct folio *folio, + bool adjust_surplus) { int nid = folio_nid(folio); @@ -1649,6 +1646,7 @@ static void __remove_hugetlb_folio(struct hstate *h, struct folio *folio, list_del(&folio->lru); if (folio_test_hugetlb_freed(folio)) { + folio_clear_hugetlb_freed(folio); h->free_huge_pages--; h->free_huge_pages_node[nid]--; } @@ -1665,33 +1663,13 @@ static void __remove_hugetlb_folio(struct hstate *h, struct folio *folio, if (!folio_test_hugetlb_vmemmap_optimized(folio)) __folio_clear_hugetlb(folio); - /* - * In the case of demote we do not ref count the page as it will soon - * be turned into a page of smaller size. - */ - if (!demote) - folio_ref_unfreeze(folio, 1); - h->nr_huge_pages--; h->nr_huge_pages_node[nid]--; } -static void remove_hugetlb_folio(struct hstate *h, struct folio *folio, - bool adjust_surplus) -{ - __remove_hugetlb_folio(h, folio, adjust_surplus, false); -} - -static void remove_hugetlb_folio_for_demote(struct hstate *h, struct folio *folio, - bool adjust_surplus) -{ - __remove_hugetlb_folio(h, folio, adjust_surplus, true); -} - static void add_hugetlb_folio(struct hstate *h, struct folio *folio, bool adjust_surplus) { - int zeroed; int nid = folio_nid(folio); VM_BUG_ON_FOLIO(!folio_test_hugetlb_vmemmap_optimized(folio), folio); @@ -1715,21 +1693,6 @@ static void add_hugetlb_folio(struct hstate *h, struct folio *folio, */ folio_set_hugetlb_vmemmap_optimized(folio); - /* - * This folio is about to be managed by the hugetlb allocator and - * should have no users. Drop our reference, and check for others - * just in case. - */ - zeroed = folio_put_testzero(folio); - if (unlikely(!zeroed)) - /* - * It is VERY unlikely soneone else has taken a ref - * on the folio. In this case, we simply return as - * free_huge_folio() will be called when this other ref - * is dropped. - */ - return; - arch_clear_hugetlb_flags(folio); enqueue_hugetlb_folio(h, folio); } @@ -1783,6 +1746,8 @@ static void __update_and_free_hugetlb_folio(struct hstate *h, spin_unlock_irq(&hugetlb_lock); } + folio_ref_unfreeze(folio, 1); + /* * Non-gigantic pages demoted from CMA allocated gigantic pages * need to be given back to CMA in free_gigantic_folio. @@ -3106,11 +3071,8 @@ static int alloc_and_dissolve_hugetlb_folio(struct hstate *h, free_new: spin_unlock_irq(&hugetlb_lock); - if (new_folio) { - /* Folio has a zero ref count, but needs a ref to be freed */ - folio_ref_unfreeze(new_folio, 1); + if (new_folio) update_and_free_hugetlb_folio(h, new_folio, false); - } return ret; } @@ -3965,7 +3927,7 @@ static int demote_free_hugetlb_folio(struct hstate *h, struct folio *folio) target_hstate = size_to_hstate(PAGE_SIZE << h->demote_order); - remove_hugetlb_folio_for_demote(h, folio, false); + remove_hugetlb_folio(h, folio, false); spin_unlock_irq(&hugetlb_lock); /* @@ -3979,7 +3941,6 @@ static int demote_free_hugetlb_folio(struct hstate *h, struct folio *folio) if (rc) { /* Allocation of vmemmmap failed, we can not demote folio */ spin_lock_irq(&hugetlb_lock); - folio_ref_unfreeze(folio, 1); add_hugetlb_folio(h, folio, false); return rc; } diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c index fa00d61b6c5a..829112b0a914 100644 --- a/mm/hugetlb_vmemmap.c +++ b/mm/hugetlb_vmemmap.c @@ -455,6 +455,8 @@ static int __hugetlb_vmemmap_restore_folio(const struct hstate *h, unsigned long vmemmap_reuse; VM_WARN_ON_ONCE_FOLIO(!folio_test_hugetlb(folio), folio); + VM_WARN_ON_ONCE_FOLIO(folio_ref_count(folio), folio); + if (!folio_test_hugetlb_vmemmap_optimized(folio)) return 0; @@ -490,6 +492,9 @@ static int __hugetlb_vmemmap_restore_folio(const struct hstate *h, */ int hugetlb_vmemmap_restore_folio(const struct hstate *h, struct folio *folio) { + /* avoid writes from page_ref_add_unless() while unfolding vmemmap */ + synchronize_rcu(); + return __hugetlb_vmemmap_restore_folio(h, folio, 0); } @@ -514,6 +519,9 @@ long hugetlb_vmemmap_restore_folios(const struct hstate *h, long restored = 0; long ret = 0; + /* avoid writes from page_ref_add_unless() while unfolding vmemmap */ + synchronize_rcu(); + list_for_each_entry_safe(folio, t_folio, folio_list, lru) { if (folio_test_hugetlb_vmemmap_optimized(folio)) { ret = __hugetlb_vmemmap_restore_folio(h, folio, @@ -559,6 +567,8 @@ static int __hugetlb_vmemmap_optimize_folio(const struct hstate *h, unsigned long vmemmap_reuse; VM_WARN_ON_ONCE_FOLIO(!folio_test_hugetlb(folio), folio); + VM_WARN_ON_ONCE_FOLIO(folio_ref_count(folio), folio); + if (!vmemmap_should_optimize_folio(h, folio)) return ret; @@ -610,6 +620,9 @@ void hugetlb_vmemmap_optimize_folio(const struct hstate *h, struct folio *folio) { LIST_HEAD(vmemmap_pages); + /* avoid writes from page_ref_add_unless() while folding vmemmap */ + synchronize_rcu(); + __hugetlb_vmemmap_optimize_folio(h, folio, &vmemmap_pages, 0); free_vmemmap_page_list(&vmemmap_pages); } @@ -653,6 +666,9 @@ void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_l flush_tlb_all(); + /* avoid writes from page_ref_add_unless() while folding vmemmap */ + synchronize_rcu(); + list_for_each_entry(folio, folio_list, lru) { int ret;