From patchwork Sun Dec 1 21:22:34 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Peter Xu X-Patchwork-Id: 13889651 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E5EB7D4978D for ; Sun, 1 Dec 2024 21:22:53 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4A3B46B0083; Sun, 1 Dec 2024 16:22:53 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 404B46B0088; Sun, 1 Dec 2024 16:22:53 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2096D6B0089; Sun, 1 Dec 2024 16:22:53 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id F37296B0083 for ; Sun, 1 Dec 2024 16:22:52 -0500 (EST) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 96392121398 for ; Sun, 1 Dec 2024 21:22:52 +0000 (UTC) X-FDA: 82847664648.23.125D82B Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf12.hostedemail.com (Postfix) with ESMTP id 8F1B54000E for ; Sun, 1 Dec 2024 21:22:45 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=WmDj0fCX; spf=pass (imf12.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1733088163; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=VgKna+t5A1Bp9qSkDY4pJYluOesve9BjxVP72JgrxSQ=; b=cbuSq1dZPtpJ25hKavEexiLmA9WuVGlwH0Jc5zOij0LjS+KPn0L1G/sczV/EXqQztZzWXR XzvJNBqy1ANZMBhEN/GJoFpW+39p13F+mSGX96ESTdWy/ED7Sh74l/PGqm1OvWyWHrCv9A 4DGoM9uoikpmJK/KDpwWkOjYIhNzlNg= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1733088163; a=rsa-sha256; cv=none; b=eQw6BX7DtzimaZeWYz1uGtCRMWOjmx5vHqYEezh8wzde3hq5Rm7ziC7u3bXjM7H2y9G5TA qiRp/rdyIVjhuFNbNnoDO0pvFwR+goDOo8JZj/EwSyJMBRRPvgdtlZysyvmDB7eVClw8OD Zr2EYRFXw7aRdju1cftCigca4WkRnrI= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=WmDj0fCX; spf=pass (imf12.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1733088169; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=VgKna+t5A1Bp9qSkDY4pJYluOesve9BjxVP72JgrxSQ=; b=WmDj0fCXdSE0MvRO8HNOQkVr9m7dJSAd++UJ4CKRwa2QoBnUBzn0rBS7L+pd//awUpp82V koudJEWdqDQyKcRo8TQJXXy+zwNua4eTY6FbeTJj4V6N80gEuLqtgnQU5QrW8PHgy2WEHs lWRU7vIGRba3LczwEJHi5oEP0fXV/jw= Received: from mail-qt1-f199.google.com (mail-qt1-f199.google.com [209.85.160.199]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-288-0NDNW59qPv2fJHMz3bL47g-1; Sun, 01 Dec 2024 16:22:48 -0500 X-MC-Unique: 0NDNW59qPv2fJHMz3bL47g-1 X-Mimecast-MFC-AGG-ID: 0NDNW59qPv2fJHMz3bL47g Received: by mail-qt1-f199.google.com with SMTP id d75a77b69052e-4667e12c945so53229451cf.3 for ; Sun, 01 Dec 2024 13:22:48 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1733088168; x=1733692968; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=VgKna+t5A1Bp9qSkDY4pJYluOesve9BjxVP72JgrxSQ=; b=xNN75WQ6Shs25TUy9UWxIcb+PVKlP32alZcTwBJMdAgCzQUg5qjt0o0xYULxF/+T2P nLDAfxOZ4nnn5ch3TY+/DT/uGXCWnNdB9IGl7pO7mg9YK7w7R3AaOpcYVnWiAG/UNd6c 4n4Tuya887OquPxSqGSK3725UZIQToFE+JDYIfWOnrRjbXxdfTore5wCmlZrQI5a8MlU D4CyVViDyCeENLFamcJ6JcvDum07jyNdnftlnEiA00uorPdLTwkzwch+4yCRuPqxhJSF y28XTupZpE1haRx4+/C1R62vdz6GAP6kVfmfPNz7ApLhZmtcA0wAGqBUnmfDAq7gp7NQ i/DQ== X-Forwarded-Encrypted: i=1; AJvYcCX1hn6cy0xjQ4ejJjqtq1Jbn/J7XxzocsVJj9QcZvGpK1LbtpWU1cfSHD8O2H2iUYJr14MzfqGq5w==@kvack.org X-Gm-Message-State: AOJu0YxIiZJ3FUrK97KkIAu6QCriDKkFmY128G/8lH/O8BVxtfBZmg9Q hmo+D3INsmCvzmzT4B4CDJGvl/B6hmQyHPXNAhCkzRMIYiY4AzP1QwVn+kIeC/wAnKNgrGkt+ij vap7hCzZ+RosixOKntS6B8lml3+PUJ0BvNJCZCy6o9WZea0xY X-Gm-Gg: ASbGncsEOLqFBYLg/PTxMKya4CzOwuc6TilQ5spzAm/Hurs3Gp2DIduSYtLUgoZEbah TbD7bBw0/hBWc1sF1WXELJYQehXchZGgiV2Dhy8AxME/C2cG1XUM39uB2HHe+3UXwrIbaUrb6qk aOBQ/Zrs0rcmBCwQ89slWi8SpyXEG5L8IVxD0o7ZK7BC7ZBcV94LRb3Cjd00Gd7eGPAa/zzUtf0 bHZ2ZRrrgrKIyt/w+ap3jqrQ3IWLZzW97tLWTXwnEV55J/aYWDFWnIj/WX+1Zm/yT0vIKQD2he+ N9V5bkE28JFqdFMaGTyrqxnHFQ== X-Received: by 2002:a05:622a:1a20:b0:466:a587:8ce9 with SMTP id d75a77b69052e-466b34dc575mr318575421cf.6.1733088167850; Sun, 01 Dec 2024 13:22:47 -0800 (PST) X-Google-Smtp-Source: AGHT+IG7GykcZ01mbWQKH4SyRwwIoF+mZE5tT1yvMdlS3a2CiAzYqEu5ysg5BfeHrZjo8bJtgpTuBA== X-Received: by 2002:a05:622a:1a20:b0:466:a587:8ce9 with SMTP id d75a77b69052e-466b34dc575mr318575121cf.6.1733088167457; Sun, 01 Dec 2024 13:22:47 -0800 (PST) Received: from x1n.redhat.com (pool-99-254-114-190.cpe.net.cable.rogers.com. [99.254.114.190]) by smtp.gmail.com with ESMTPSA id d75a77b69052e-466c4249f0asm41278911cf.81.2024.12.01.13.22.44 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 01 Dec 2024 13:22:46 -0800 (PST) From: Peter Xu To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Rik van Riel , Breno Leitao , Andrew Morton , peterx@redhat.com, Muchun Song , Oscar Salvador , Roman Gushchin , Naoya Horiguchi , Ackerley Tng , linux-stable Subject: [PATCH 1/7] mm/hugetlb: Fix avoid_reserve to allow taking folio from subpool Date: Sun, 1 Dec 2024 16:22:34 -0500 Message-ID: <20241201212240.533824-2-peterx@redhat.com> X-Mailer: git-send-email 2.47.0 In-Reply-To: <20241201212240.533824-1-peterx@redhat.com> References: <20241201212240.533824-1-peterx@redhat.com> MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: iiKpv166HpJnE16ucMzDpZUjJv5VL6sB0FhHxaRydKE_1733088168 X-Mimecast-Originator: redhat.com content-type: text/plain; charset="US-ASCII"; x-default=true X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: 8F1B54000E X-Stat-Signature: z84mu8s9aw3u1wq6mucxgbu54mzsojid X-Rspam-User: X-HE-Tag: 1733088165-207640 X-HE-Meta: U2FsdGVkX1+avZBPw4XAwFS5JbYcp/EAHUScYQ6Jjg3Jpg7LbkQ9Qm2sc+JuRDISgo99Fm3PKhXUBeUgsqclTyRqsGVODZdZRFtCOMMLiympb4QPzIfI+/c1TEIxtLkb5sm3yWcPXHiLeEOduIcpkLtnlkBAqq1F2VX3sRoPVG1A++BZG9mOIfRBtWbOI5bihqTn/hofJ1tJh5XbALTquM3MJphlHfcK/0pFXd0PGvSCkiMRPpTPVmabXsu/UwpaksXbnP/3sg4OzPuV/6xGe8T1/pTY7Fvewz5vULb8aro6DbOUTVxmVwCZDXK4AM3Nry1fd3rO5rDamCyGnYp9TYEdXK5b4IQz08blvSAYTMc8hQsviIjdFRIoduBKGxEByGffUg/7u3Rz/AOrbt9p01RbVqro9snefGBUSzz24F0XGTi0jH248+SgiXYg6F3rk5rO/3pUafe20A82LeoSVKbzXaEWs9NKgesz6PzPmWHc9HPEBYUIAZXcMI9tRkyHryq8q51RT8CfIhlnMMnYdMqAGdZwNbrw1muLn+3ySLW9OInf98rbYzDUu4B8CfXpa00KNs2HJvmPijAuLeD+3EE4lG4W5KvKHb0je68AubeWXH1KUtk4vN2VUZhYH+T5/hF9tYB5J/2gSCnzgitJ60t0W58MK+jcP2OjO+JnynqP6QNwBppZVP8O9u9CFJoMPsBrD4ue+SVqVlLfb7I1/LYeEOZAfG6XdMSghfnR8WGRkHYcIghhMTt8pGww+cP+uxZIyxi0/0OykK/8cpoY0hIwyv9Mfvb0zVfVM091EEJm76qGTDA6fBCGOlFj8NUV8ybxX/1xSdn8XD/oUB5QZyiiCAQdskm9dLnshj+CMqd39VrBW8CO/jFfbH8XX90jdZHOnBzDGJLUZ8ex/XMLLtmuHN4y48wu0QjJjialhbDcv8ogTrJL4hK3L8rWM/hldLDt4zKtwMv7ONBtz3J DVz8y2ZJ +F+YBEezeelBdgYZ9aG9gKHQeFSgAj8X5oOAmO5ZbOk0gG+PLmyHM4jhgTSrUfX630BD5dit7K+vxYzZHnkxMfqm1Qln4Q42KKt2pWvlMI2Cy7oQ5eKp9GHbIlB3lCdQk6u8++zDNblY4QqnlB9pZOQBbtA0wl4kffPUwWxQdj5Md13nYQS/6Oidn4GyelPXZShX2n8t6Qh+00cPlvqRcqF7GbiJTd7YVWHichMQAGhkDWgHFn1MW1b5MVZcl1Jguffks3jo56SSgClDLxNegLuK9rjvSV3Spe5CDQUu2F+HhzuAtipoa2UkACAVJcD7bKCphrQgN3EzG0iZQnAXp61xss5d8ph2wHi/yvC9Fu17l6H1ziS8kYAMOSrl5Mu/ArNVEgVka1R/d3eAc3DnJzNrgzrfT69msWUkGYTc2hKaE0+X/PkUBl24nFx+33SfoswEMAy98PZnRcC546ThFXnFFeiL912XcPNbzyz3YiHv3AHJCtFUC8Y+Qm/2YhiH1gJbh X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Since commit 04f2cbe35699 ("hugetlb: guarantee that COW faults for a process that called mmap(MAP_PRIVATE) on hugetlbfs will succeed"), avoid_reserve was introduced for a special case of CoW on hugetlb private mappings, and only if the owner VMA is trying to allocate yet another hugetlb folio that is not reserved within the private vma reserved map. Later on, in commit d85f69b0b533 ("mm/hugetlb: alloc_huge_page handle areas hole punched by fallocate"), alloc_huge_page() enforced to not consume any global reservation as long as avoid_reserve=true. This operation doesn't look correct, because even if it will enforce the allocation to not use global reservation at all, it will still try to take one reservation from the spool (if the subpool existed). Then since the spool reserved pages take from global reservation, it'll also take one reservation globally. Logically it can cause global reservation to go wrong. I wrote a reproducer below, trigger this special path, and every run of such program will cause global reservation count to increment by one, until it hits the number of free pages: #define _GNU_SOURCE /* See feature_test_macros(7) */ #include #include #include #include #include #include #define MSIZE (2UL << 20) int main(int argc, char *argv[]) { const char *path; int *buf; int fd, ret; pid_t child; if (argc < 2) { printf("usage: %s \n", argv[0]); return -1; } path = argv[1]; fd = open(path, O_RDWR | O_CREAT, 0666); if (fd < 0) { perror("open failed"); return -1; } ret = fallocate(fd, 0, 0, MSIZE); if (ret != 0) { perror("fallocate"); return -1; } buf = mmap(NULL, MSIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE, fd, 0); if (buf == MAP_FAILED) { perror("mmap() failed"); return -1; } /* Allocate a page */ *buf = 1; child = fork(); if (child == 0) { /* child doesn't need to do anything */ exit(0); } /* Trigger CoW from owner */ *buf = 2; munmap(buf, MSIZE); close(fd); unlink(path); return 0; } It can only reproduce with a sub-mount when there're reserved pages on the spool, like: # sysctl vm.nr_hugepages=128 # mkdir ./hugetlb-pool # mount -t hugetlbfs -o min_size=8M,pagesize=2M none ./hugetlb-pool Then run the reproducer on the mountpoint: # ./reproducer ./hugetlb-pool/test Fix it by taking the reservation from spool if available. In general, avoid_reserve is IMHO more about "avoid vma resv map", not spool's. I copied stable, however I have no intention for backporting if it's not a clean cherry-pick, because private hugetlb mapping, and then fork() on top is too rare to hit. Cc: linux-stable Fixes: d85f69b0b533 ("mm/hugetlb: alloc_huge_page handle areas hole punched by fallocate") Signed-off-by: Peter Xu --- mm/hugetlb.c | 22 +++------------------- 1 file changed, 3 insertions(+), 19 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index cec4b121193f..9ce69fd22a01 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1394,8 +1394,7 @@ static unsigned long available_huge_pages(struct hstate *h) static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h, struct vm_area_struct *vma, - unsigned long address, int avoid_reserve, - long chg) + unsigned long address, long chg) { struct folio *folio = NULL; struct mempolicy *mpol; @@ -1411,10 +1410,6 @@ static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h, if (!vma_has_reserves(vma, chg) && !available_huge_pages(h)) goto err; - /* If reserves cannot be used, ensure enough pages are in the pool */ - if (avoid_reserve && !available_huge_pages(h)) - goto err; - gfp_mask = htlb_alloc_mask(h); nid = huge_node(vma, address, gfp_mask, &mpol, &nodemask); @@ -1430,7 +1425,7 @@ static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h, folio = dequeue_hugetlb_folio_nodemask(h, gfp_mask, nid, nodemask); - if (folio && !avoid_reserve && vma_has_reserves(vma, chg)) { + if (folio && vma_has_reserves(vma, chg)) { folio_set_hugetlb_restore_reserve(folio); h->resv_huge_pages--; } @@ -3007,17 +3002,6 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, gbl_chg = hugepage_subpool_get_pages(spool, 1); if (gbl_chg < 0) goto out_end_reservation; - - /* - * Even though there was no reservation in the region/reserve - * map, there could be reservations associated with the - * subpool that can be used. This would be indicated if the - * return value of hugepage_subpool_get_pages() is zero. - * However, if avoid_reserve is specified we still avoid even - * the subpool reservations. - */ - if (avoid_reserve) - gbl_chg = 1; } /* If this allocation is not consuming a reservation, charge it now. @@ -3040,7 +3024,7 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, * from the global free pool (global change). gbl_chg == 0 indicates * a reservation exists for the allocation. */ - folio = dequeue_hugetlb_folio_vma(h, vma, addr, avoid_reserve, gbl_chg); + folio = dequeue_hugetlb_folio_vma(h, vma, addr, gbl_chg); if (!folio) { spin_unlock_irq(&hugetlb_lock); folio = alloc_buddy_hugetlb_folio_with_mpol(h, vma, addr);