From patchwork Sun Dec 1 21:22:37 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Peter Xu X-Patchwork-Id: 13889654 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5364CD4978C for ; Sun, 1 Dec 2024 21:23:03 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 259616B0092; Sun, 1 Dec 2024 16:23:02 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 16C456B0093; Sun, 1 Dec 2024 16:23:02 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id F02996B0095; Sun, 1 Dec 2024 16:23:01 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id CC0BC6B0092 for ; Sun, 1 Dec 2024 16:23:01 -0500 (EST) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 52278121398 for ; Sun, 1 Dec 2024 21:23:01 +0000 (UTC) X-FDA: 82847665236.02.2621AF2 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf30.hostedemail.com (Postfix) with ESMTP id D7F218000E for ; Sun, 1 Dec 2024 21:22:37 +0000 (UTC) Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=dGW1X6rg; spf=pass (imf30.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1733088171; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=JsfRVwDPpEABH52Gas2qEz8JIuJrK9vYix32n1qAh2g=; b=llfzv1T62uX/ZNQGlIKO9Ugi4M6ZETt8wOTH5RqZKZW5gh47ggLP1LsCJk/SHPCx9UnANs 1Pd3N6fetlz0ynD21tkY/Nohs8gCxsgLzmAuEy5SkDBES1Lqsx1M0/GY1oHR6xfGi7jhbH wGMqe53Z0IYSoyFCbOxWNamx8Aejwj4= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1733088171; a=rsa-sha256; cv=none; b=Q3vQtdCuG+VEdF2cscU6QHStwPH/e7nDpuKt0stYx6S8Ez30QgFbdCWWI3HS8CcTC248YJ qhQgmERG5YDXq4qxBdKEaOqyClgQuwqafGXg0oqkUk6CPGFnui7Jbzi37JbJvvROXXhzXO VibD7cxrRPK3sOc7ol+9qezUm1V5dZE= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=dGW1X6rg; spf=pass (imf30.hostedemail.com: domain of peterx@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1733088178; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=JsfRVwDPpEABH52Gas2qEz8JIuJrK9vYix32n1qAh2g=; b=dGW1X6rgUwk2RufpSOVjoegJVvZYdY6Q2prjYR7H6K+Imn990jRq/DaztdWOJJ7eYuKrcE i18KzxF+tAb2KCARnv9yv4tYWxcUdkce7CnOYTGTHpPNG6c2d0lA5s6hVhKHPme4P+n32l 3O8SyblN9u36c2OJ7/77tSiBZoX1IoU= Received: from mail-qt1-f199.google.com (mail-qt1-f199.google.com [209.85.160.199]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-178-kjOe4ISmMs6U5kvzNXtjAw-1; Sun, 01 Dec 2024 16:22:57 -0500 X-MC-Unique: kjOe4ISmMs6U5kvzNXtjAw-1 X-Mimecast-MFC-AGG-ID: kjOe4ISmMs6U5kvzNXtjAw Received: by mail-qt1-f199.google.com with SMTP id d75a77b69052e-46697645ceeso63314331cf.0 for ; Sun, 01 Dec 2024 13:22:57 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1733088176; x=1733692976; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=JsfRVwDPpEABH52Gas2qEz8JIuJrK9vYix32n1qAh2g=; b=mZknMYaQwPcJylXVrjmIXBlRTCSQY9z5lkuJKvdDdOZZgfDTK8rvO0pIibk93E02Ap QzFpBkocMh0Kx7wcl4QIdKmAeLhm1H4RGNpu0eSDgAy/09IdT0I56wZPqZozRY389niq eUmpNON7nWFQ31HA5w86j0ie8c3Tzn4HXLv88kTwtYOer+1pL5FbJ9+3WtlWfEasN23N YLBfYgHQplMrYhRo0L1hEsQCcLJdQQfv7PJCfHD4THG78fUNMBjiuHNKwaCfOXZwQvTJ XwgiVYi/sbbYgBwaKZN/kKZHjM8N1sG39nUYVwyL4cFA5UszSjJPe3UZSvfudNaQLPKU G3OQ== X-Forwarded-Encrypted: i=1; AJvYcCWQHZi4A249jNPJ8DrT+ELD+m6sm0I5TjfyNq4X0qCHeGhmPiAIIWgPwTypbAwK6jOXTEBos9lelA==@kvack.org X-Gm-Message-State: AOJu0Yw+BpOcPony4mzrjmZTYFvfJ0b0ysdEZ/gFHlBjb/iDY2N18wIQ OlnyQWGD7AgBEtFlwBS/qOzlAHcWcQGaguK/1Ww21GLkEoypvXe1gKfa/UubTTsNGRWDgyp+iz7 qggvDWwPKzsvl1OyCCOhP3ZBL6ckBnAnXZ5wNm85Brbu17SUUdP45jhqX X-Gm-Gg: ASbGncty8wkMBoLAu3y1qV+Vt0S5714TGTDW6sFojvYqx7UQENnHB4bdGoH18GZxA/8 iuNBVLxBK68GoevLJeRHGgUfM83Kun8RcE46CnJnBwrwwFldJF+Ybwk+6aMjP+jAyRy05YeHB0o ngipwbKqmm5W6WPqxUoqMax8DpZg/bL1gUxDn4VSVoDmD9X5qEV0to4679gbqhZHkXc8LUJjc8R rm1l5ht5ALC2t2iprhpS0aMxjpBiksmJma3sgPdyX/e0u3iNRqpp/5PzYqr+XQolma+8TTz+JO9 e7md5KsrXqSp9jsLFzgDfNlyxA== X-Received: by 2002:a05:622a:1801:b0:462:a7d1:8e19 with SMTP id d75a77b69052e-466b359cc6dmr326628331cf.13.1733088176508; Sun, 01 Dec 2024 13:22:56 -0800 (PST) X-Google-Smtp-Source: AGHT+IEIHjJ7gsnlT3rnkpu2KbnEapiMVvKzn9R0CmFm4KgWLF8XZTcnqWRYbl4fLEluyyAPgKpicA== X-Received: by 2002:a05:622a:1801:b0:462:a7d1:8e19 with SMTP id d75a77b69052e-466b359cc6dmr326628031cf.13.1733088176142; Sun, 01 Dec 2024 13:22:56 -0800 (PST) Received: from x1n.redhat.com (pool-99-254-114-190.cpe.net.cable.rogers.com. [99.254.114.190]) by smtp.gmail.com with ESMTPSA id d75a77b69052e-466c4249f0asm41278911cf.81.2024.12.01.13.22.53 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 01 Dec 2024 13:22:55 -0800 (PST) From: Peter Xu To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Rik van Riel , Breno Leitao , Andrew Morton , peterx@redhat.com, Muchun Song , Oscar Salvador , Roman Gushchin , Naoya Horiguchi , Ackerley Tng Subject: [PATCH 4/7] mm/hugetlb: Clean up map/global resv accounting when allocate Date: Sun, 1 Dec 2024 16:22:37 -0500 Message-ID: <20241201212240.533824-5-peterx@redhat.com> X-Mailer: git-send-email 2.47.0 In-Reply-To: <20241201212240.533824-1-peterx@redhat.com> References: <20241201212240.533824-1-peterx@redhat.com> MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: UwPSu6P7VBkjWC4vr1bmlk3T3fxzexp2Px-HtJRDBRM_1733088177 X-Mimecast-Originator: redhat.com content-type: text/plain; charset="US-ASCII"; x-default=true X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: D7F218000E X-Stat-Signature: 71fmkhyjqbcncm1toig89x6yixxapije X-Rspam-User: X-HE-Tag: 1733088157-665065 X-HE-Meta: U2FsdGVkX18hNOqNv4Mxt7NhouSkTCTk/RDtftLayWG4PXwPLdZf2Z5Lon0ssz8iDbaBZSjH4YCOvGD9avwjeR3Rujt44qsyPy88UCuUNMTnM91FN+prUHjrE4KQKoZkuDm22WD6RWL2Cb4W44pO+H4kIAmK3je4RR3MOTMjxh2ttFQOvCe2WZIY/b69MHcyr91OjcRc+4N3QGedblvtLlPUc0SNIoR52mfux1ehLf9inZHPRBfjYUWLYzSEkve1mXYScpYLvsW7IZFbK4loC5WgcMy0lei9tHFk+nKys+TpEnOCCMzGohPUynOqYPtuzhf7ddVnoI81BUjcWU86j/UxKMn+MM1B3FsR+VRep3PPWmICiwLTuUQE9ve6I2+RwmRtsmcE1KJVBrY5Gz5JZ8mYGgqzcyFZ5BVTNVelJbxwHpuoex3YEWRtjIsyxfgKOcbJWMKSglq51YFIsKc4tY96gTfzrrX77NlCoWIlIJ7jw40um/LnhWRTyX1KW6QOKww4spAJV2gcIz+e3ErGXVzucQzf+GuO1BnOvbWgCpTNR1wxm6znEyu+75CFOZN5lynjbR8qqCKnxcb/DtVmzukaBj6r/WQ3wwLR1Jv8gN+eVRT2gT7OoleZ1S0z3IeHUxeyK2AYsqGvew1CN9K1a4LYHIsM1gForl2Z/4Pr16krThRPldMowcsxBhewogPdK8Rp2cqG7JT/QuvU10VV5XppbkE/SZKIjFHvHbsfGUv6wQkrUEvXStuG1kmGCzfNXha9D8aHv9ODqG5nE69z5sRGpk9KQVcVNmVZzcTpFNz0xv3GiD0BRIUaNyrkbMjoSvh0kgwP3CZJf+/r1o+BXJNSCynmefUVd7sRqEdxRsa2+InxiAsnII9DyjR1K2xBHg639WuFjjzP0mukcVms7NGg2ptlycigtZWOeFwlT1T2w/pfZKQgcpeokyA+Y+nIM0DoRCRMVYKgQ/IgauP Wx0zs/ea 7eKsh1KKko+Mo3NKEFg+Byhdxk8wA2y49yGBWUmWhkSobLFthU0WT0jHBPR6iiZ9w3H32eEKlAiTgObDB0sQAH5liN6161SlG7BvNiMas5gh/OvI5VuhXJg06F9oLsuE9VnFbwx9x7IHUPIhLqbmkOMuqli7ob7Z85Xrh6qWRwFDAMniprQtBK4tO9WfaBHrwQtx3jYjy1uO8AHUY1D0M/YHBKO3uvNl0pr+ejgEBjHKv7oRGeG3hF6oTbwDUZS1gE41QIOcMPwDRiV3pQwz9GxT0a3RoTLEpmsvOBkE4bjg3TbsqlJi7nL+Jzj/D3FUfXEByPsX/xQ6j57jbWLJynh1DuaemztwvbKXeCtfbKOAkzB3Ur5+SopuvuFE/AA19HBZpRo+P+ZqrzC00V/5iVsdLrqYjcsgY+u+Rx+lRhljlijmyj6lVwmEKTQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000001, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: alloc_hugetlb_folio() isn't a function easy to read, especially on reservation accountings for either VMA or globally (majorly, spool only). The 1st complexity lies in the special private CoW path, aka, cow_from_owner=true case. The 2nd complexity may be the confusing updates of gbl_chg after it's set once, which looks like they can change anytime on the fly. Logically, cow_from_user is only about vma reservation. We could already decouple the flag and consolidate it into map charge flag very early. Then we don't need to keep checking the CoW special flag every time. This patch does it by making map_chg a tri-state flag. Tri-state needed is unfortunate, and it's because currently vma_needs_reservation() has a side effect internally, that it must be followed by either a end() or commit(). We keep the same semantic as before on one thing: "if (map_chg)" means we need a separate per-vma resv count. It keeps most of the old code like before untouched with the new enum. After this patch, we take these steps to decide these variables, hopefully slightly easier to follow: - First, decide map_chg. This will take cow_from_owner into account, once and for all. It's about whether we could take a resv count from the vma, no matter it's shared, private, etc. - Then, decide gbl_chg. The only diff here is spool, comparing to map_chg. Now only update each flag once and for all, instead of keep any of them flipping which can be very hard to follow. With cow_from_owner merged into map_chg, we could remove quite a few such checks all over. Side benefit of such is that we can get rid of one more confusing flag, which is deferred_reserve. Cleanup the comments a bit too. E.g., MAP_NORESERVE may not need to check against spool limit, AFAIU, if it's on a shared mapping, and if the page cache folio has its inode's resv map available (in which case map_chg would have been set zero, hence the code should be correct, not the comment). There's one trivial detail that needs attention that this patch touched, which is this check right after vma_commit_reservation(): if (map_chg > map_commit) It changes to: if (unlikely(map_chg == MAP_CHG_NEEDED && retval == 0)) It should behave the same like before, because previously the only way to make "map_chg > map_commit" happen is map_chg=1 && map_commit=0. That's exactly the rewritten line. Meanwhile, either commit() or end() will need to be skipped if ENFORCE, to keep the old behavior. Even though it looks a lot changed, but no functional change expected. Signed-off-by: Peter Xu --- mm/hugetlb.c | 116 +++++++++++++++++++++++++++++++++++---------------- 1 file changed, 80 insertions(+), 36 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index dfd479a857b6..14cfe0bb01e4 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2956,6 +2956,25 @@ int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list) return ret; } +typedef enum { + /* + * For either 0/1: we checked the per-vma resv map, and one resv + * count either can be reused (0), or an extra needed (1). + */ + MAP_CHG_REUSE = 0, + MAP_CHG_NEEDED = 1, + /* + * Cannot use per-vma resv count can be used, hence a new resv + * count is enforced. + * + * NOTE: This is mostly identical to MAP_CHG_NEEDED, except + * that currently vma_needs_reservation() has an unwanted side + * effect to either use end() or commit() to complete the + * transaction. Hence it needs to differenciate from NEEDED. + */ + MAP_CHG_ENFORCED = 2, +} map_chg_state; + /* * NOTE! "cow_from_owner" represents a very hacky usage only used in CoW * faults of hugetlb private mappings on top of a non-page-cache folio (in @@ -2969,12 +2988,11 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, struct hugepage_subpool *spool = subpool_vma(vma); struct hstate *h = hstate_vma(vma); struct folio *folio; - long map_chg, map_commit, nr_pages = pages_per_huge_page(h); - long gbl_chg; + long retval, gbl_chg, nr_pages = pages_per_huge_page(h); + map_chg_state map_chg; int memcg_charge_ret, ret, idx; struct hugetlb_cgroup *h_cg = NULL; struct mem_cgroup *memcg; - bool deferred_reserve; gfp_t gfp = htlb_alloc_mask(h) | __GFP_RETRY_MAYFAIL; memcg = get_mem_cgroup_from_current(); @@ -2985,36 +3003,56 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, } idx = hstate_index(h); - /* - * Examine the region/reserve map to determine if the process - * has a reservation for the page to be allocated. A return - * code of zero indicates a reservation exists (no change). - */ - map_chg = gbl_chg = vma_needs_reservation(h, vma, addr); - if (map_chg < 0) { - if (!memcg_charge_ret) - mem_cgroup_cancel_charge(memcg, nr_pages); - mem_cgroup_put(memcg); - return ERR_PTR(-ENOMEM); + + /* Whether we need a separate per-vma reservation? */ + if (cow_from_owner) { + /* + * Special case! Since it's a CoW on top of a reserved + * page, the private resv map doesn't count. So it cannot + * consume the per-vma resv map even if it's reserved. + */ + map_chg = MAP_CHG_ENFORCED; + } else { + /* + * Examine the region/reserve map to determine if the process + * has a reservation for the page to be allocated. A return + * code of zero indicates a reservation exists (no change). + */ + retval = vma_needs_reservation(h, vma, addr); + if (retval < 0) { + if (!memcg_charge_ret) + mem_cgroup_cancel_charge(memcg, nr_pages); + mem_cgroup_put(memcg); + return ERR_PTR(-ENOMEM); + } + map_chg = retval ? MAP_CHG_NEEDED : MAP_CHG_REUSE; } /* + * Whether we need a separate global reservation? + * * Processes that did not create the mapping will have no * reserves as indicated by the region/reserve map. Check * that the allocation will not exceed the subpool limit. - * Allocations for MAP_NORESERVE mappings also need to be - * checked against any subpool limit. + * Or if it can get one from the pool reservation directly. */ - if (map_chg || cow_from_owner) { + if (map_chg) { gbl_chg = hugepage_subpool_get_pages(spool, 1); if (gbl_chg < 0) goto out_end_reservation; + } else { + /* + * If we have the vma reservation ready, no need for extra + * global reservation. + */ + gbl_chg = 0; } - /* If this allocation is not consuming a reservation, charge it now. + /* + * If this allocation is not consuming a per-vma reservation, + * charge the hugetlb cgroup now. */ - deferred_reserve = map_chg || cow_from_owner; - if (deferred_reserve) { + if (map_chg) { ret = hugetlb_cgroup_charge_cgroup_rsvd( idx, pages_per_huge_page(h), &h_cg); if (ret) @@ -3038,7 +3076,7 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, if (!folio) goto out_uncharge_cgroup; spin_lock_irq(&hugetlb_lock); - if (!cow_from_owner && vma_has_reserves(vma, gbl_chg)) { + if (vma_has_reserves(vma, gbl_chg)) { folio_set_hugetlb_restore_reserve(folio); h->resv_huge_pages--; } @@ -3051,7 +3089,7 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, /* If allocation is not consuming a reservation, also store the * hugetlb_cgroup pointer on the page. */ - if (deferred_reserve) { + if (map_chg) { hugetlb_cgroup_commit_charge_rsvd(idx, pages_per_huge_page(h), h_cg, folio); } @@ -3060,26 +3098,31 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, hugetlb_set_folio_subpool(folio, spool); - map_commit = vma_commit_reservation(h, vma, addr); - if (unlikely(map_chg > map_commit)) { + if (map_chg != MAP_CHG_ENFORCED) { + /* commit() is only needed if the map_chg is not enforced */ + retval = vma_commit_reservation(h, vma, addr); /* + * Check for possible race conditions. When it happens.. * The page was added to the reservation map between * vma_needs_reservation and vma_commit_reservation. * This indicates a race with hugetlb_reserve_pages. * Adjust for the subpool count incremented above AND - * in hugetlb_reserve_pages for the same page. Also, + * in hugetlb_reserve_pages for the same page. Also, * the reservation count added in hugetlb_reserve_pages * no longer applies. */ - long rsv_adjust; + if (unlikely(map_chg == MAP_CHG_NEEDED && retval == 0)) { + long rsv_adjust; - rsv_adjust = hugepage_subpool_put_pages(spool, 1); - hugetlb_acct_memory(h, -rsv_adjust); - if (deferred_reserve) { - spin_lock_irq(&hugetlb_lock); - hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h), - pages_per_huge_page(h), folio); - spin_unlock_irq(&hugetlb_lock); + rsv_adjust = hugepage_subpool_put_pages(spool, 1); + hugetlb_acct_memory(h, -rsv_adjust); + if (map_chg) { + spin_lock_irq(&hugetlb_lock); + hugetlb_cgroup_uncharge_folio_rsvd( + hstate_index(h), pages_per_huge_page(h), + folio); + spin_unlock_irq(&hugetlb_lock); + } } } @@ -3093,14 +3136,15 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, out_uncharge_cgroup: hugetlb_cgroup_uncharge_cgroup(idx, pages_per_huge_page(h), h_cg); out_uncharge_cgroup_reservation: - if (deferred_reserve) + if (map_chg) hugetlb_cgroup_uncharge_cgroup_rsvd(idx, pages_per_huge_page(h), h_cg); out_subpool_put: - if (map_chg || cow_from_owner) + if (map_chg) hugepage_subpool_put_pages(spool, 1); out_end_reservation: - vma_end_reservation(h, vma, addr); + if (map_chg != MAP_CHG_ENFORCED) + vma_end_reservation(h, vma, addr); if (!memcg_charge_ret) mem_cgroup_cancel_charge(memcg, nr_pages); mem_cgroup_put(memcg);