From patchwork Tue Jan 7 20:39:59 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Peter Xu X-Patchwork-Id: 13929594 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 31177E77199 for ; Tue, 7 Jan 2025 20:40:21 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C54046B00A8; Tue, 7 Jan 2025 15:40:18 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id C01476B00A9; Tue, 7 Jan 2025 15:40:18 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A2D766B00AA; Tue, 7 Jan 2025 15:40:18 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 828DD6B00A8 for ; Tue, 7 Jan 2025 15:40:18 -0500 (EST) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 495DCA0CF5 for ; Tue, 7 Jan 2025 20:40:18 +0000 (UTC) X-FDA: 82981823316.11.7A1545E Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf18.hostedemail.com (Postfix) with ESMTP id 0C9D11C000D for ; Tue, 7 Jan 2025 20:40:15 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=ibYPuqTo; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf18.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1736282416; a=rsa-sha256; cv=none; b=cCDtHUMpph8IYg4GXBkYFlbm3r3k6QNkKQrItLFfJSkADMg/izT2+caY0UKCmEg1933TUj 0vnOVY0BRkXeO1bb23ahZcdOn2sHDMTtlizAh0RqBcHUsv7pHo5CNOymf2bHLletZogKNI DyIg7oxA11mL6UV0lWcCfcyDZh2NMAc= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=ibYPuqTo; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf18.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1736282416; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=yuwIwTLmWdhGbI1nIJskaIORTo+ctaybQQ0IHgYhnBg=; b=fYal4eOkvigdz1bVA7pkzFcFKYAMAV7OdWKgPW/4zz7+3AegjdgPhH0HDuB1ola0AVBxVX XFOTqkzhyfNLyP6BxyLG4MwRepU+PyNVmogbVvdn8ggO8RIzohAqzZn5/uBvKUbG4H3k8W pW0b4ZvXjugNzMBaC9ZpO1hYYs2QVHs= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1736282415; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=yuwIwTLmWdhGbI1nIJskaIORTo+ctaybQQ0IHgYhnBg=; b=ibYPuqToPLVpCHxuqM68Ra3iRYgorCAIOfob619JxZlE8HOxUs/FVh9J2bnHbr6bQthGoN XXYtcSnQLwR/cGFBWzpH062/k0SxxZoDuZ8Z6bV7PlrEVCek9+c7KKfucf5cRg7KJwz+zw YdZI8Rwv0uVJ7zRBwNx0WAYAThvp3KE= Received: from mail-qv1-f69.google.com (mail-qv1-f69.google.com [209.85.219.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-353-qVfsUGJSPU-MjjS2F_ZcnQ-1; Tue, 07 Jan 2025 15:40:14 -0500 X-MC-Unique: qVfsUGJSPU-MjjS2F_ZcnQ-1 X-Mimecast-MFC-AGG-ID: qVfsUGJSPU-MjjS2F_ZcnQ Received: by mail-qv1-f69.google.com with SMTP id 6a1803df08f44-6d8f51b49e5so278011276d6.1 for ; Tue, 07 Jan 2025 12:40:14 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1736282413; x=1736887213; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=yuwIwTLmWdhGbI1nIJskaIORTo+ctaybQQ0IHgYhnBg=; b=TapP39RxVN9C41QQC/lHUXTcHKSI2LlarRVmgo8SBkH2eUU4JnGbdy4TZNhRJffquX PzfDkrBMuBzvd3tewWd3n5jngptRlXsSIkwQD7uwdWXpYUVj0jB59/CeRKGUrOqVhPhx zn1Bvbh73F6xG+9w+4ZlUp7A2Xb5jGzmsdbhCf90NPIn1XJMBo4X1Hlnjg5ZmWtwz6hA F5kFCoq/7nn4x43z0/Bw0Pb6sN8E7uzTTRKQ738nqyaBJjvGouZIewE+YZ5QeIzxz2m1 u2LHFiPNXy9L0YCvj3gkHx92tsOdxPOKIh7Rh3ofYeHBhSMdGbyYRxIxyeQF3iY1P4D2 7vhQ== X-Gm-Message-State: AOJu0YxHPt0WkkSSXw42Ykrg/Xv3HEAmnuzcCyzTpoCavUVgjzJxIG3Y UphPR4jXwJCEYDW2aCOrc3OLpErDpub6DwDrLpRCmdt2dgkz1DJArj+pSoO3Dfcm1f1D5lHQ0bj sNJjP9rZmv4DckdVwE4ZPGuidnDinuEVgzE6RSntnTwBWxJ3YyCpSSM1zVQSINCdEWG3QNG0zIH t6jcXjotaRAXdztnEmYFC/B62d7kYXqw== X-Gm-Gg: ASbGnctV5mo6OgHX9toHOtYii8Nx2Rh/fJKzjbsfvMSVIPr5xuXy/4rXMW1TqsDOf2y ZR6OnVDEQvn+Gv0v32amtgfhunlWJiBZr98bE37nsSLVmnclygzftxIJzzERB+eDvr6U6j6Q67A hh7I3VE2/YyskEsE2VML+K2dPKZ+XfPU5IkfEttCJAmdH264BAdOjUKaSJ0uStJLkZJ+PeTHDWn Y81OUEDqem3j/ZB3ijN145KGrRJvD5IAwBb662MiVZIJGpknl/UgToNInFm79R2yl44FfiDv7T1 nzNJgyOhomCgz1zezueNsUZSVxYSUf00 X-Received: by 2002:a05:6214:2586:b0:6d8:883b:142a with SMTP id 6a1803df08f44-6df9b1b4fdfmr9241226d6.2.1736282413272; Tue, 07 Jan 2025 12:40:13 -0800 (PST) X-Google-Smtp-Source: AGHT+IFsIwbZ8AHU5wiq8R2gE/O6m3JMATzJlnU9LbxD3qmGV4Q8ausp9Ssq2fF5qI6zbJ5x9oBK+A== X-Received: by 2002:a05:6214:2586:b0:6d8:883b:142a with SMTP id 6a1803df08f44-6df9b1b4fdfmr9240726d6.2.1736282412797; Tue, 07 Jan 2025 12:40:12 -0800 (PST) Received: from x1n.redhat.com (pool-99-254-114-190.cpe.net.cable.rogers.com. [99.254.114.190]) by smtp.gmail.com with ESMTPSA id 6a1803df08f44-6dd181373f6sm184478306d6.62.2025.01.07.12.40.11 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 07 Jan 2025 12:40:11 -0800 (PST) From: Peter Xu To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Breno Leitao , Rik van Riel , Muchun Song , Naoya Horiguchi , Roman Gushchin , Ackerley Tng , Andrew Morton , peterx@redhat.com, Oscar Salvador Subject: [PATCH v2 4/7] mm/hugetlb: Clean up map/global resv accounting when allocate Date: Tue, 7 Jan 2025 15:39:59 -0500 Message-ID: <20250107204002.2683356-5-peterx@redhat.com> X-Mailer: git-send-email 2.47.0 In-Reply-To: <20250107204002.2683356-1-peterx@redhat.com> References: <20250107204002.2683356-1-peterx@redhat.com> MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: Vbju-fQblqbhM4_Y41Uxai0RrRob5k6H5GxabLRhLPk_1736282414 X-Mimecast-Originator: redhat.com content-type: text/plain; charset="US-ASCII"; x-default=true X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 0C9D11C000D X-Stat-Signature: sfibmbwtq4c7gufdhu4iqete1d8p1rw8 X-Rspam-User: X-HE-Tag: 1736282415-412785 X-HE-Meta: U2FsdGVkX1/VuS7H7+QyqFfnud6ezCqrmrMVnJj4quoYtVdEBQHiTBEa7l3/uiVEUC/QM4K3s61WC6YZvK+w/r80qSZmqni9rGjcbsuDRzq35bG3bOP6lW1xpHlbKc0H1e1juPaarqzYIpDCtw0oM6O+1Ka6jhy3F+PIK3oCGRg1yQr3LCOBALAJzPJMulXLn3h3v/LROHlfCoAJ1zte0+wx1rownRbHefNGc+uGHBhWNWHmXGZpxPEBNdyPKU3sgZIR933H5g4R8UW/TYu84iGL+AL2wpu03EDQHi1mUXH6CJzHYSoSQrfl6Ua1Lhf0z74pE+xq6cJacw/GkEc1xfoX6fSyhons5z9nT2MGnTlyPsj/0Et7izi1YVGiUrQLS7tJ/bROqx9Xq7heoJClb3ArnUy21Th1D1nN02FXVXsiiHDIc993Ru7kYj9A4ZVqWikYDmyV+DgNYvAmgRdIwi0oLKK2A+eQ7tfM9IJTtnMYYtd0oVB5qQeW4GXgXc3dtkZmqwe6ctSu1MMcJcs/NTs3ECvPcckC3qqdaxU0dKBdX6aI9g45tuSctPUomSuOZ/rj9hw+95tKFdx4KqSsKJg5gRxcVvUv29T/lIoRqUTWhUpd2zuUmAdVs1jtmfaXx9pjQmKi3hPnky63dC8TBGKzxisludA//c477Fezc0xVMwp9QkWW3RcZQirYggPBhhkgORyfKJf8cskqx+aFP7GhdDYwb7YyO0utYRf3RlRwhKCUcrA0nvk/aotUDKRsZiXhoV2SFFgaeOCbnbHYdJsOvRqZskREpSSsBMbJGR57XARhTeODuzRF3EnmZDc9IrEZDgQl3dSmgRYioTCAgYPLZFWaA8Kky+I0q3U2qYoLmUKpns6ybKSemdQf8V5wfNxi+ZG7t7GqEjUhwSNEftv++GQ26jXzoZkiIP93hxK4mBVbK+J74wLgRfLK7URpiGpvhJakfs2m6d/JPLj VIYEhgRA 0LqueWg2mNqAIcoBWN+IDmmtXZMI7pDWq60KKT2Ru/8Wgt9ktjhwgRWTDtLvfvAXinfFj+UJpAxTqqkps46MBRzoPFoB1EXu/E93zMumA4yHa2wRHmTS138k7GcWTvpPT0h5dw7l4sCezeUx24lYTXR7hdPmMLcLxPZpNamEG0auIWUtddXVVoE2/ClQg411VrkRWCkb5iNLgtIueWCOOcwiJEAwCMtXi8rs/nN2VipdEpFFLVVJTfTgEmE3+eZLdB1kDSAVOKRV0g6xG+3wRb/w6KKqEJptro/Rs3WArlwQSt8bSbFHdazmVc0zHib1y4cnTeu7WNBbiriRhNgdBtteDE3tzAXgSdaYCZNX9cq8vEhjYbfqWKXVeOvgdSlQG1UhWLbQyTzUKnPk/d0yThUP6PHey2FYh3BBQDgSkh9go+a3x0eFJGZ7GEA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: alloc_hugetlb_folio() isn't a function easy to read, especially on reservation accountings for either VMA or globally (majorly, spool only). The 1st complexity lies in the special private CoW path, aka, cow_from_owner=true case. The 2nd complexity may be the confusing updates of gbl_chg after it's set once, which looks like they can change anytime on the fly. Logically, cow_from_user is only about vma reservation. We could already decouple the flag and consolidate it into map charge flag very early. Then we don't need to keep checking the CoW special flag every time. This patch does it by making map_chg a tri-state flag. Tri-state needed is unfortunate, and it's because currently vma_needs_reservation() has a side effect internally, that it must be followed by either a end() or commit(). We keep the same semantic as before on one thing: "if (map_chg)" means we need a separate per-vma resv count. It keeps most of the old code like before untouched with the new enum. After this patch, we take these steps to decide these variables, hopefully slightly easier to follow: - First, decide map_chg. This will take cow_from_owner into account, once and for all. It's about whether we could take a resv count from the vma, no matter it's shared, private, etc. - Then, decide gbl_chg. The only diff here is spool, comparing to map_chg. Now only update each flag once and for all, instead of keep any of them flipping which can be very hard to follow. With cow_from_owner merged into map_chg, we could remove quite a few such checks all over. Side benefit of such is that we can get rid of one more confusing flag, which is deferred_reserve. Cleanup the comments a bit too. E.g., MAP_NORESERVE may not need to check against spool limit, AFAIU, if it's on a shared mapping, and if the page cache folio has its inode's resv map available (in which case map_chg would have been set zero, hence the code should be correct, not the comment). There's one trivial detail that needs attention that this patch touched, which is this check right after vma_commit_reservation(): if (map_chg > map_commit) It changes to: if (unlikely(map_chg == MAP_CHG_NEEDED && retval == 0)) It should behave the same like before, because previously the only way to make "map_chg > map_commit" happen is map_chg=1 && map_commit=0. That's exactly the rewritten line. Meanwhile, either commit() or end() will need to be skipped if ENFORCE, to keep the old behavior. Even though it looks a lot changed, but no functional change expected. Signed-off-by: Peter Xu --- mm/hugetlb.c | 110 +++++++++++++++++++++++++++++++++++---------------- 1 file changed, 77 insertions(+), 33 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index cdbc8914a9f7..b8a849fe1531 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2970,6 +2970,25 @@ int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list) return ret; } +typedef enum { + /* + * For either 0/1: we checked the per-vma resv map, and one resv + * count either can be reused (0), or an extra needed (1). + */ + MAP_CHG_REUSE = 0, + MAP_CHG_NEEDED = 1, + /* + * Cannot use per-vma resv count can be used, hence a new resv + * count is enforced. + * + * NOTE: This is mostly identical to MAP_CHG_NEEDED, except + * that currently vma_needs_reservation() has an unwanted side + * effect to either use end() or commit() to complete the + * transaction. Hence it needs to differenciate from NEEDED. + */ + MAP_CHG_ENFORCED = 2, +} map_chg_state; + /* * replace_free_hugepage_folios - Replace free hugepage folios in a given pfn * range with new folios. @@ -3021,40 +3040,59 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, struct hugepage_subpool *spool = subpool_vma(vma); struct hstate *h = hstate_vma(vma); struct folio *folio; - long map_chg, map_commit; - long gbl_chg; + long retval, gbl_chg; + map_chg_state map_chg; int ret, idx; struct hugetlb_cgroup *h_cg = NULL; - bool deferred_reserve; gfp_t gfp = htlb_alloc_mask(h) | __GFP_RETRY_MAYFAIL; idx = hstate_index(h); - /* - * Examine the region/reserve map to determine if the process - * has a reservation for the page to be allocated. A return - * code of zero indicates a reservation exists (no change). - */ - map_chg = gbl_chg = vma_needs_reservation(h, vma, addr); - if (map_chg < 0) - return ERR_PTR(-ENOMEM); + + /* Whether we need a separate per-vma reservation? */ + if (cow_from_owner) { + /* + * Special case! Since it's a CoW on top of a reserved + * page, the private resv map doesn't count. So it cannot + * consume the per-vma resv map even if it's reserved. + */ + map_chg = MAP_CHG_ENFORCED; + } else { + /* + * Examine the region/reserve map to determine if the process + * has a reservation for the page to be allocated. A return + * code of zero indicates a reservation exists (no change). + */ + retval = vma_needs_reservation(h, vma, addr); + if (retval < 0) + return ERR_PTR(-ENOMEM); + map_chg = retval ? MAP_CHG_NEEDED : MAP_CHG_REUSE; + } /* + * Whether we need a separate global reservation? + * * Processes that did not create the mapping will have no * reserves as indicated by the region/reserve map. Check * that the allocation will not exceed the subpool limit. - * Allocations for MAP_NORESERVE mappings also need to be - * checked against any subpool limit. + * Or if it can get one from the pool reservation directly. */ - if (map_chg || cow_from_owner) { + if (map_chg) { gbl_chg = hugepage_subpool_get_pages(spool, 1); if (gbl_chg < 0) goto out_end_reservation; + } else { + /* + * If we have the vma reservation ready, no need for extra + * global reservation. + */ + gbl_chg = 0; } - /* If this allocation is not consuming a reservation, charge it now. + /* + * If this allocation is not consuming a per-vma reservation, + * charge the hugetlb cgroup now. */ - deferred_reserve = map_chg || cow_from_owner; - if (deferred_reserve) { + if (map_chg) { ret = hugetlb_cgroup_charge_cgroup_rsvd( idx, pages_per_huge_page(h), &h_cg); if (ret) @@ -3078,7 +3116,7 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, if (!folio) goto out_uncharge_cgroup; spin_lock_irq(&hugetlb_lock); - if (!cow_from_owner && vma_has_reserves(vma, gbl_chg)) { + if (vma_has_reserves(vma, gbl_chg)) { folio_set_hugetlb_restore_reserve(folio); h->resv_huge_pages--; } @@ -3091,7 +3129,7 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, /* If allocation is not consuming a reservation, also store the * hugetlb_cgroup pointer on the page. */ - if (deferred_reserve) { + if (map_chg) { hugetlb_cgroup_commit_charge_rsvd(idx, pages_per_huge_page(h), h_cg, folio); } @@ -3100,26 +3138,31 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, hugetlb_set_folio_subpool(folio, spool); - map_commit = vma_commit_reservation(h, vma, addr); - if (unlikely(map_chg > map_commit)) { + if (map_chg != MAP_CHG_ENFORCED) { + /* commit() is only needed if the map_chg is not enforced */ + retval = vma_commit_reservation(h, vma, addr); /* + * Check for possible race conditions. When it happens.. * The page was added to the reservation map between * vma_needs_reservation and vma_commit_reservation. * This indicates a race with hugetlb_reserve_pages. * Adjust for the subpool count incremented above AND - * in hugetlb_reserve_pages for the same page. Also, + * in hugetlb_reserve_pages for the same page. Also, * the reservation count added in hugetlb_reserve_pages * no longer applies. */ - long rsv_adjust; + if (unlikely(map_chg == MAP_CHG_NEEDED && retval == 0)) { + long rsv_adjust; - rsv_adjust = hugepage_subpool_put_pages(spool, 1); - hugetlb_acct_memory(h, -rsv_adjust); - if (deferred_reserve) { - spin_lock_irq(&hugetlb_lock); - hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h), - pages_per_huge_page(h), folio); - spin_unlock_irq(&hugetlb_lock); + rsv_adjust = hugepage_subpool_put_pages(spool, 1); + hugetlb_acct_memory(h, -rsv_adjust); + if (map_chg) { + spin_lock_irq(&hugetlb_lock); + hugetlb_cgroup_uncharge_folio_rsvd( + hstate_index(h), pages_per_huge_page(h), + folio); + spin_unlock_irq(&hugetlb_lock); + } } } @@ -3141,14 +3184,15 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, out_uncharge_cgroup: hugetlb_cgroup_uncharge_cgroup(idx, pages_per_huge_page(h), h_cg); out_uncharge_cgroup_reservation: - if (deferred_reserve) + if (map_chg) hugetlb_cgroup_uncharge_cgroup_rsvd(idx, pages_per_huge_page(h), h_cg); out_subpool_put: - if (map_chg || cow_from_owner) + if (map_chg) hugepage_subpool_put_pages(spool, 1); out_end_reservation: - vma_end_reservation(h, vma, addr); + if (map_chg != MAP_CHG_ENFORCED) + vma_end_reservation(h, vma, addr); return ERR_PTR(-ENOSPC); }