From patchwork Thu Jun 27 04:43:34 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yu Zhao X-Patchwork-Id: 13713816 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3B05FC3064D for ; Thu, 27 Jun 2024 04:43:42 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 90AF36B008A; Thu, 27 Jun 2024 00:43:41 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8BA516B008C; Thu, 27 Jun 2024 00:43:41 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 782B76B0092; Thu, 27 Jun 2024 00:43:41 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 59BD16B008A for ; Thu, 27 Jun 2024 00:43:41 -0400 (EDT) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id DC680C1A07 for ; Thu, 27 Jun 2024 04:43:40 +0000 (UTC) X-FDA: 82275425400.11.76ECD2F Received: from mail-yb1-f201.google.com (mail-yb1-f201.google.com [209.85.219.201]) by imf25.hostedemail.com (Postfix) with ESMTP id 2E1B7A0012 for ; Thu, 27 Jun 2024 04:43:39 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=kQyhEK+8; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf25.hostedemail.com: domain of 3-u18ZgYKCAU3z4mftlttlqj.htrqnsz2-rrp0fhp.twl@flex--yuzhao.bounces.google.com designates 209.85.219.201 as permitted sender) smtp.mailfrom=3-u18ZgYKCAU3z4mftlttlqj.htrqnsz2-rrp0fhp.twl@flex--yuzhao.bounces.google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1719463400; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=hdRtKbQiOkGUeLR6KDBlvVV6ztPezKPNE7EFhbYdET8=; b=25tr5ZXgz8bG7aIZef/xHH03X1FuZPHvA8jBA4kk9pxxnrt1oMNGonzt5n40bYdE7J569K E+XubsRQrQLIJxb7ktSkJfqTIgZnymtsJUdjhc4gMBVH4Ixedbwk/HF/URu1BSMQmQFJAU Wnv94axnzkJq6hk48aPCzTKtjFVHP1s= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1719463400; a=rsa-sha256; cv=none; b=s0BeUpqnkxdQ9F2rwHJHb8GrPdqgptsfkq/Sw3BTG1toh1/2YVBBb8zFqGGrT6TXmZGvCY RkEgchqpEVaNRXJk19YCw3AEj/UxS1xkLC7iuBqgcyMLH6GHWa1LXQjUfcNwTTe3I99zO4 0kHoryjthtVIcOGKCLAXhZ+wlS+ar9c= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=kQyhEK+8; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf25.hostedemail.com: domain of 3-u18ZgYKCAU3z4mftlttlqj.htrqnsz2-rrp0fhp.twl@flex--yuzhao.bounces.google.com designates 209.85.219.201 as permitted sender) smtp.mailfrom=3-u18ZgYKCAU3z4mftlttlqj.htrqnsz2-rrp0fhp.twl@flex--yuzhao.bounces.google.com Received: by mail-yb1-f201.google.com with SMTP id 3f1490d57ef6-e034694ba78so202604276.1 for ; Wed, 26 Jun 2024 21:43:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1719463418; x=1720068218; darn=kvack.org; h=cc:to:from:subject:message-id:mime-version:date:from:to:cc:subject :date:message-id:reply-to; bh=hdRtKbQiOkGUeLR6KDBlvVV6ztPezKPNE7EFhbYdET8=; b=kQyhEK+8tmLVNctVyeyE/ZJxvz33cMakc3Sxh1WcDlVHTzWwLB3yh2JNAazi9Vsku6 vj8iYed6LDQ01VaU2FSHX7wmji+BOPzkBvJFNkiQ4TZoAXzZIyu4bGNucJoEutb+q/sm cSqlTIJwuBUz3FiBNqaeAFwaj4YwMinIHZ/jDwuSB7COPu3vSF5wTz+867o2IBWP4JXW cPOf3hFa8/O6cK9VsczqC3e6TrKBVmvkkfgqV44VtjzY6f2a27qM5H9A0DUVpOHxzhQh ZJAXh636CWJqroAuyO8NKF4Yh9NC6GTiPA1kmt/HCw9RLGHGIWchM/oHNkUqOm7KaEqT +NjA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1719463418; x=1720068218; h=cc:to:from:subject:message-id:mime-version:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=hdRtKbQiOkGUeLR6KDBlvVV6ztPezKPNE7EFhbYdET8=; b=iUxByRlrZKsyI8D6Fi3eXPGVQl9PM4xxg1fhOKeC7a16m+WSpwKuiCoF78IC/VbFMb rK01Ln9+uzf8gYHx+dP8P59fpv2ZXOvQy5u76axAerwFn84HjAUUPGRHWlhQ9lu6F/tj 9LbYAJOOS3KnsESQ4jP6Sq8EHZbB0ydmjfRwYkT0TzCDzLLukKAd9DaBsWYAInzKYdTe U0vQkMMiE/+x94zLBFsTYk1+X/yCrydfvvy5Ed8DAAtiWKxr0FcvV2nfMttfmfsxKoNt AoSVU2iHGEEUBE7ROXU2jhibcvwHXsyRlnjtKAF9NkEf62VyEgmI3DgFWXTnUVLbEdIF prCA== X-Forwarded-Encrypted: i=1; AJvYcCVGVEaeE38Q01LKc7J4pSCp0uFtfanL2q+Z/m1IhvHXxH4J0c83KUsZlrpHz1c8hw7ScgRmiI8Kl9reP2uY0GmLYqo= X-Gm-Message-State: AOJu0Yzb3/WGCQCw1TchvHxjW2dWRkVACvpEEphsjF2UKgveamCYN84A Q6Z067gA0H/AJ03SYMqLJIJEM7MlW2DKVfdO+UkhathF7AvtOEDnh11XbRZRs9o8hanW4LZRMVV 1qw== X-Google-Smtp-Source: AGHT+IELOZfdvbJRxtxyvrlCwB2OQtSBvFck+2H0B3d4kUKw92oWpBAzgjoIfxEO7GlE1eBl4BMLNb7GVJA= X-Received: from yuzhao2.bld.corp.google.com ([2a00:79e0:2e28:6:edf9:510d:6ad1:f544]) (user=yuzhao job=sendgmr) by 2002:a25:ae07:0:b0:e02:f3cf:9454 with SMTP id 3f1490d57ef6-e034508affamr5146276.2.1719463418116; Wed, 26 Jun 2024 21:43:38 -0700 (PDT) Date: Wed, 26 Jun 2024 22:43:34 -0600 Mime-Version: 1.0 X-Mailer: git-send-email 2.45.2.803.g4e1b14247a-goog Message-ID: <20240627044335.2698578-1-yuzhao@google.com> Subject: [PATCH mm-unstable v1] mm/hugetlb_vmemmap: fix race with speculative PFN walkers From: Yu Zhao To: Andrew Morton , Muchun Song Cc: David Hildenbrand , Frank van der Linden , "Matthew Wilcox (Oracle)" , Peter Xu , Yang Shi , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Yu Zhao X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 2E1B7A0012 X-Stat-Signature: 1fnsenqf3pbe1esdc9gqtwtwdnxujoxi X-Rspam-User: X-HE-Tag: 1719463418-557615 X-HE-Meta: U2FsdGVkX1/fF6QMy2H3X+/7+nVjkZ7/ix04zVyZVDq/yUME2NdL9lsnfGwcfpuNOv7VuD8vn+hNXL0BRNviFnjtivv0ZAj7XOkI35VvX/QOBSSULZQEmHzbEkEC97gxFjwYG2449YQwUSq9gDevqYnV3piDie6HmeR0ZByxxLlQPYtbRWB7WTZSffQrzkslf4LsCvdOmDxiOxgTQJlgW+cY5E8PBPK1J6KJvVch3NQbqcRhkLnFNnA95UgavQciS60UDazzTepxawpaU84oQ5rZ5VDFO9bKF4FetUq8na1ZAmrHm/CTTxFAeikNzHfth7sN6jmp0dUp4tAMaSRp/2xa5Y0G63X3K2AqT5HW9/+eHSsqhNjuYVaIhragMejtKMwPj9bZJjg5YNQ3wC8u2VXDW/arP/gAuBBW2y7KDEh4zejKq/pFZYWzfDhga/HVp9BxfJNO92vAK7ozZxGVvfbZv/OqzOQw+ybR+MFoe9cXPt489syYKLAKIu1gDWrM49QQc66pevoxRD3mc7t/HK7vqV+qd8rXElgfv4NGFTm+3HSif0Mh8tGDs4aaDQhXTkZRN0/yJ/QJLG+TDoRni6pQ0K7kI+qPQ3qoktkUGdekNRhv10Q/D0EfPp47H1nDFQuglm6+rtixc8KxYuprYYfjbOcaddV/ED8Btz2RE4kE+Z9Uxgosj6gDOWVYc6pAJBIrLABY235QUsCKp9seczX1O6qE4RnYpuYz5c+iixPVAPWN5Wt9S8nsgXBabjGI8jNByH9MPkgu1PqPECElQey/GHzrm8efXvNAWE+/H98Enl0u7h76wrR4NS5VyH005opGFO3ahnqqEfJYUqX1pXpmpKtWF+MFYR0gW6PPzXu/DFroIjZ66Snr+G+MW3cU7KlJr7nnz4yeMlvJ/aEzMkH/GNSOJep/eIDJLt2B81D1qpVkzXV2h86sgPxP1V2AdNNUcitac3+tGcDqtlp d0vm4BYw q0yaGqUCDTbSi85KOhjXaDCExVd0pAhJO5qgwZ/PDehabD+qpnWjgOQX1/EPjbLGiS8nJ7FjG3GCJyHQwjbFPUZZQzekBxdeYZ4RgH2uHy1yMSyCgCxDr/zoFk7PcNtjyQUSnWpeEzxdjUHATaVrbuwnkPkIptqlbrVCZ+ZlzFmSsCZI9jwgAkqRU9eLYuXfaX8yt8DKk2lWcprvE2U/xPY49CTcDk/N0US5ucDI9zj7sTPzShbrxDyt3pwi8Lfc7zQSPPuWwG8klHjKNDXE5lpHfY47kMCYysWnw7aw1iFErB7sVlMvvir1F1fgmZg6f5/o2tmU+WFFtQUwA7HnoyTsPPwqMLh8nw2I5+OOIO1v9WmLnkLy42zwi+RjlxVPuGI8iT1bHS7oZVGU3Jb3GRTFxR7e2VI9Um4GkTnLPAVPtkg1iWqIB294i9x364tdQiSSbgATCBa8WSAUJdtzMqkElesw49iLxVmo9KTyfi2aM3CFesxOaC+2NzYf+xhC+8RoI2bzrrMz46l6waZTbNjF5Ye6UhhRmDyMuA4Oh+y0hOQrVUNoaaBJKChPpTzYC4tmxVju9hpb2Weg+08BQRVZeIpqypqnNbiNSjIe7B+/ELZ/JCUHjzBwq7hPFbXRp2WzX X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: While investigating HVO for THPs [1], it turns out that speculative PFN walkers like compaction can race with vmemmap modifications, e.g., CPU 1 (vmemmap modifier) CPU 2 (speculative PFN walker) ------------------------------- ------------------------------ Allocates an LRU folio page1 Sees page1 Frees page1 Allocates a hugeTLB folio page2 (page1 being a tail of page2) Updates vmemmap mapping page1 get_page_unless_zero(page1) Even though page1->_refcount is zero after HVO, get_page_unless_zero() can still try to modify this read-only field, resulting in a crash. An independent report [2] confirmed this race. There are two discussed approaches to fix this race: 1. Make RO vmemmap RW so that get_page_unless_zero() can fail without triggering a PF. 2. Use RCU to make sure get_page_unless_zero() either sees zero page->_refcount through the old vmemmap or non-zero page->_refcount through the new one. The second approach is preferred here because: 1. It can prevent illegal modifications to struct page[] that has been HVO'ed; 2. It can be generalized, in a way similar to ZERO_PAGE(), to fix similar races in other places, e.g., arch_remove_memory() on x86 [3], which frees vmemmap mapping offlined struct page[]. While adding synchronize_rcu(), the goal is to be surgical, rather than optimized. Specifically, calls to synchronize_rcu() on the error handling paths can be coalesced, but it is not done for the sake of Simplicity: noticeably, this fix removes ~50% more lines than it adds. [1] https://lore.kernel.org/20240229183436.4110845-4-yuzhao@google.com/ [2] https://lore.kernel.org/917FFC7F-0615-44DD-90EE-9F85F8EA9974@linux.dev/ [3] https://lore.kernel.org/be130a96-a27e-4240-ad78-776802f57cad@redhat.com/ Signed-off-by: Yu Zhao Acked-by: Muchun Song --- include/linux/page_ref.h | 8 +++++- mm/hugetlb.c | 53 ++++++---------------------------------- mm/hugetlb_vmemmap.c | 16 ++++++++++++ 3 files changed, 30 insertions(+), 47 deletions(-) diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h index 490d0ad6e56d..8c236c651d1d 100644 --- a/include/linux/page_ref.h +++ b/include/linux/page_ref.h @@ -230,7 +230,13 @@ static inline int folio_ref_dec_return(struct folio *folio) static inline bool page_ref_add_unless(struct page *page, int nr, int u) { - bool ret = atomic_add_unless(&page->_refcount, nr, u); + bool ret = false; + + rcu_read_lock(); + /* avoid writing to the vmemmap area being remapped */ + if (!page_is_fake_head(page) && page_ref_count(page) != u) + ret = atomic_add_unless(&page->_refcount, nr, u); + rcu_read_unlock(); if (page_ref_tracepoint_active(page_ref_mod_unless)) __page_ref_mod_unless(page, nr, ret); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 9691624fcb79..1ddaf25737da 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1629,13 +1629,10 @@ static inline void destroy_compound_gigantic_folio(struct folio *folio, * folio appears as just a compound page. Otherwise, wait until after * allocating vmemmap to clear the flag. * - * A reference is held on the folio, except in the case of demote. - * * Must be called with hugetlb lock held. */ -static void __remove_hugetlb_folio(struct hstate *h, struct folio *folio, - bool adjust_surplus, - bool demote) +static void remove_hugetlb_folio(struct hstate *h, struct folio *folio, + bool adjust_surplus) { int nid = folio_nid(folio); @@ -1657,6 +1654,7 @@ static void __remove_hugetlb_folio(struct hstate *h, struct folio *folio, h->surplus_huge_pages_node[nid]--; } + folio_clear_hugetlb_freed(folio); /* * We can only clear the hugetlb flag after allocating vmemmap * pages. Otherwise, someone (memory error handling) may try to write @@ -1665,33 +1663,13 @@ static void __remove_hugetlb_folio(struct hstate *h, struct folio *folio, if (!folio_test_hugetlb_vmemmap_optimized(folio)) __folio_clear_hugetlb(folio); - /* - * In the case of demote we do not ref count the page as it will soon - * be turned into a page of smaller size. - */ - if (!demote) - folio_ref_unfreeze(folio, 1); - h->nr_huge_pages--; h->nr_huge_pages_node[nid]--; } -static void remove_hugetlb_folio(struct hstate *h, struct folio *folio, - bool adjust_surplus) -{ - __remove_hugetlb_folio(h, folio, adjust_surplus, false); -} - -static void remove_hugetlb_folio_for_demote(struct hstate *h, struct folio *folio, - bool adjust_surplus) -{ - __remove_hugetlb_folio(h, folio, adjust_surplus, true); -} - static void add_hugetlb_folio(struct hstate *h, struct folio *folio, bool adjust_surplus) { - int zeroed; int nid = folio_nid(folio); VM_BUG_ON_FOLIO(!folio_test_hugetlb_vmemmap_optimized(folio), folio); @@ -1715,21 +1693,6 @@ static void add_hugetlb_folio(struct hstate *h, struct folio *folio, */ folio_set_hugetlb_vmemmap_optimized(folio); - /* - * This folio is about to be managed by the hugetlb allocator and - * should have no users. Drop our reference, and check for others - * just in case. - */ - zeroed = folio_put_testzero(folio); - if (unlikely(!zeroed)) - /* - * It is VERY unlikely soneone else has taken a ref - * on the folio. In this case, we simply return as - * free_huge_folio() will be called when this other ref - * is dropped. - */ - return; - arch_clear_hugetlb_flags(folio); enqueue_hugetlb_folio(h, folio); } @@ -1783,6 +1746,8 @@ static void __update_and_free_hugetlb_folio(struct hstate *h, spin_unlock_irq(&hugetlb_lock); } + folio_ref_unfreeze(folio, 1); + /* * Non-gigantic pages demoted from CMA allocated gigantic pages * need to be given back to CMA in free_gigantic_folio. @@ -3106,11 +3071,8 @@ static int alloc_and_dissolve_hugetlb_folio(struct hstate *h, free_new: spin_unlock_irq(&hugetlb_lock); - if (new_folio) { - /* Folio has a zero ref count, but needs a ref to be freed */ - folio_ref_unfreeze(new_folio, 1); + if (new_folio) update_and_free_hugetlb_folio(h, new_folio, false); - } return ret; } @@ -3965,7 +3927,7 @@ static int demote_free_hugetlb_folio(struct hstate *h, struct folio *folio) target_hstate = size_to_hstate(PAGE_SIZE << h->demote_order); - remove_hugetlb_folio_for_demote(h, folio, false); + remove_hugetlb_folio(h, folio, false); spin_unlock_irq(&hugetlb_lock); /* @@ -3979,7 +3941,6 @@ static int demote_free_hugetlb_folio(struct hstate *h, struct folio *folio) if (rc) { /* Allocation of vmemmmap failed, we can not demote folio */ spin_lock_irq(&hugetlb_lock); - folio_ref_unfreeze(folio, 1); add_hugetlb_folio(h, folio, false); return rc; } diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c index fa00d61b6c5a..829112b0a914 100644 --- a/mm/hugetlb_vmemmap.c +++ b/mm/hugetlb_vmemmap.c @@ -455,6 +455,8 @@ static int __hugetlb_vmemmap_restore_folio(const struct hstate *h, unsigned long vmemmap_reuse; VM_WARN_ON_ONCE_FOLIO(!folio_test_hugetlb(folio), folio); + VM_WARN_ON_ONCE_FOLIO(folio_ref_count(folio), folio); + if (!folio_test_hugetlb_vmemmap_optimized(folio)) return 0; @@ -490,6 +492,9 @@ static int __hugetlb_vmemmap_restore_folio(const struct hstate *h, */ int hugetlb_vmemmap_restore_folio(const struct hstate *h, struct folio *folio) { + /* avoid writes from page_ref_add_unless() while unfolding vmemmap */ + synchronize_rcu(); + return __hugetlb_vmemmap_restore_folio(h, folio, 0); } @@ -514,6 +519,9 @@ long hugetlb_vmemmap_restore_folios(const struct hstate *h, long restored = 0; long ret = 0; + /* avoid writes from page_ref_add_unless() while unfolding vmemmap */ + synchronize_rcu(); + list_for_each_entry_safe(folio, t_folio, folio_list, lru) { if (folio_test_hugetlb_vmemmap_optimized(folio)) { ret = __hugetlb_vmemmap_restore_folio(h, folio, @@ -559,6 +567,8 @@ static int __hugetlb_vmemmap_optimize_folio(const struct hstate *h, unsigned long vmemmap_reuse; VM_WARN_ON_ONCE_FOLIO(!folio_test_hugetlb(folio), folio); + VM_WARN_ON_ONCE_FOLIO(folio_ref_count(folio), folio); + if (!vmemmap_should_optimize_folio(h, folio)) return ret; @@ -610,6 +620,9 @@ void hugetlb_vmemmap_optimize_folio(const struct hstate *h, struct folio *folio) { LIST_HEAD(vmemmap_pages); + /* avoid writes from page_ref_add_unless() while folding vmemmap */ + synchronize_rcu(); + __hugetlb_vmemmap_optimize_folio(h, folio, &vmemmap_pages, 0); free_vmemmap_page_list(&vmemmap_pages); } @@ -653,6 +666,9 @@ void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_l flush_tlb_all(); + /* avoid writes from page_ref_add_unless() while folding vmemmap */ + synchronize_rcu(); + list_for_each_entry(folio, folio_list, lru) { int ret;