From patchwork Mon Apr 14 22:05:52 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Nico Pache X-Patchwork-Id: 14051084 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 875C3C369B5 for ; Mon, 14 Apr 2025 22:07:42 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 11E2728009E; Mon, 14 Apr 2025 18:07:41 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 0D09F28009C; Mon, 14 Apr 2025 18:07:41 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EB14828009E; Mon, 14 Apr 2025 18:07:40 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id CC85928009C for ; Mon, 14 Apr 2025 18:07:40 -0400 (EDT) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 62AD6AF723 for ; Mon, 14 Apr 2025 22:07:41 +0000 (UTC) X-FDA: 83334037122.25.29EC1F3 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf22.hostedemail.com (Postfix) with ESMTP id A1015C0013 for ; Mon, 14 Apr 2025 22:07:39 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=BJljogL4; dmarc=pass (policy=quarantine) header.from=redhat.com; spf=pass (imf22.hostedemail.com: domain of npache@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=npache@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1744668459; a=rsa-sha256; cv=none; b=b2cgsQI16SiBmQrTykTzzW9tWfI24EzRuvKCPYrkCBy6BbjKOcrf7aqL/cuFQylvjXKfhL CWTm6gHRb7eKJjQ+z/I8aMAV/wfN83Ea4UnZtETZOBBuFVxtTO/KU2cKNlCkxvP30HEDSp Ns0nqH81idUtuWhtHyB+RVxTar1Ibpc= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=BJljogL4; dmarc=pass (policy=quarantine) header.from=redhat.com; spf=pass (imf22.hostedemail.com: domain of npache@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=npache@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1744668459; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=qtzDELGqTVE9YoyCw6Exie8oZagkEva6Kmv70R0L40s=; b=mS/gJrtumq8WgO/+c0I4FM3Ln9sgQJ0j/FsIM2VGLTRzwffMUzy31ROEOOLcQJHNGFDlcV xOC1rvlIm1B9sp5Dg32CG381TwgH1WK95z/J3bpIdGJqXAcKIzWBOEw0agMFEUmGkchmNi +Q3iwPF02bnrj86iHI+xKy3pPSx3Irw= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1744668459; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=qtzDELGqTVE9YoyCw6Exie8oZagkEva6Kmv70R0L40s=; b=BJljogL49r46CHzpggcFahSsZNEigbv1BUVsEHFkRlxLEuA8MIWQrqTzScJDHcWNrr4nnE s58/SFah+Jku8YaGnxbbPwynAR6Uvj7TMSl8z6epiCMcRAW2/uCZ/wHpvmYQlrNCRTGo2f 297auwDbsfsD5m17Nu9OGLuqBxX6k78= Received: from mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-629-wMBGh-2mPMizjjMD_zyndw-1; Mon, 14 Apr 2025 18:07:35 -0400 X-MC-Unique: wMBGh-2mPMizjjMD_zyndw-1 X-Mimecast-MFC-AGG-ID: wMBGh-2mPMizjjMD_zyndw_1744668451 Received: from mx-prod-int-02.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-02.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.15]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 5AD4E18001E2; Mon, 14 Apr 2025 22:07:31 +0000 (UTC) Received: from h1.redhat.com (unknown [10.22.64.91]) by mx-prod-int-02.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id C60051956094; Mon, 14 Apr 2025 22:07:23 +0000 (UTC) From: Nico Pache To: linux-mm@kvack.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org Cc: akpm@linux-foundation.org, corbet@lwn.net, rostedt@goodmis.org, mhiramat@kernel.org, mathieu.desnoyers@efficios.com, david@redhat.com, baohua@kernel.org, baolin.wang@linux.alibaba.com, ryan.roberts@arm.com, willy@infradead.org, peterx@redhat.com, ziy@nvidia.com, wangkefeng.wang@huawei.com, usamaarif642@gmail.com, sunnanyong@huawei.com, vishal.moola@gmail.com, thomas.hellstrom@linux.intel.com, yang@os.amperecomputing.com, kirill.shutemov@linux.intel.com, aarcange@redhat.com, raquini@redhat.com, dev.jain@arm.com, anshuman.khandual@arm.com, catalin.marinas@arm.com, tiwai@suse.de, will@kernel.org, dave.hansen@linux.intel.com, jack@suse.cz, cl@gentwo.org, jglisse@google.com, surenb@google.com, zokeefe@google.com, hannes@cmpxchg.org, rientjes@google.com, mhocko@suse.com Subject: [PATCH v3 07/12] khugepaged: add mTHP support Date: Mon, 14 Apr 2025 16:05:52 -0600 Message-ID: <20250414220557.35388-8-npache@redhat.com> In-Reply-To: <20250414220557.35388-1-npache@redhat.com> References: <20250414220557.35388-1-npache@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.0 on 10.30.177.15 X-Rspamd-Queue-Id: A1015C0013 X-Stat-Signature: 5bq4rqgr47yra7ahwtubiec4mu11jinp X-Rspam-User: X-Rspamd-Server: rspam06 X-HE-Tag: 1744668459-886651 X-HE-Meta: U2FsdGVkX18PqZxz7CuTpxtYScG1q5y5ZcVRE6SiCc4wzm4rpUKZ5oogd29pf4SUAcQ5saQrvvB+v6NujQAURw1nE1vD+ikpr+eGSje1GHaRbP/PalnIPR6hludYZwCQIGQhKLL4Z3bDrBql2S+pYtWmieBUCBKqoEMu7KOB9aTmxAJY5J15TwCuqXJ1Xo/nYHFlDHVZxNVDNKRHwYBJwVoH9WacYhIJWzTuU6HXMyOgxHyvdecoXi9qc5HLsklB1CbJ5g6ZCuiYtLCEeF1UnwFYBh0V0CKWtDpRysdmeK1r+j/klmi5sRUTzvcy4tmH7KiFTOHAjmXk8umM/MkXOPjwksy4vG1MvHLVWt9Sztc1DT6vJPmLvp/C7AR2+iIRWL1ws5NV+J60JCK7WP3jsuoYKi804tXJ/gdPx70xF1imxvPGJnvKW6Vmfl6DI+XAA0/f/EQQlutFb2Pxa09NSPNyAaHbiAPkoYmdoFygRcP+B3kabW0Y/Eu5SdIgTOvNqyOczecCBZ6uyFvqUiISjxpknxE3wC7u6IksiFLp63OtyBR3TpIfL6LQYZf8ya8+egqpBvsVvHSnCPq35yxTnN+wZs+DWwC6Ceco1NQ+ZFD0ajUwA13362dnAuoWgedNwKkPZs15sd1TYMWzgOp7qJPZQFspDCYI5A0HwM1r4jk/OeknqpnV3TpSQQjUC/ys1TuZL9eTQkC3TgkHiP6FXNkxxP4hjJc1RsfoMiGRepMcnR786xf5NQ+JbriTjULfq5ovYDrpQjJNJHt21vaGqmBiosBkGVDb7a9nKParcXxC2IwOunIGUZTgSQ+hguOcmVBCTcOpoAuleJ2mO0IDBgQnr92yqXaKNIHB7d2WphrRUMf0tJsKybi7N38xuGdntn4zQdINZizw7AfmZtLhVGsO3QVPHHMWSqW5EKH2sezZAVaNm1Wpo3EVp5cyRrW5Gyjlas612m0DnSSB4l6 Rehio4we 6xOn5SbhZgVCwwSS7+u5OkqDF4Wq+YAWNIppZvdlJejIf/UYOMgAMWNWmpJHinW/wqyYx+ZGB52yylGFGg2xu12rVCTUk79h07TMdXP4lgkWmuWdME32YwuNCbhgFYoqS0UiTwX+Dw5z3gb7HXaCw+RJ8mnDIT5uSR/ZNaFRkBMM8UtDt4FBGublGBxsMuBVb0XbHN/PIYjZ/v3u6mevrSIvoWm46b81ZYJgYhyj7fyh8CSoMFOvgWrBj7kRPQpOM2TBWeZID1GxGS5cX4N8fOS91X8/iN9gjxp8re2iJjHdTxa1Rna4s7gi0zW95gJulTv7JR6fSljDFsRdnMhR3K2OHna/E8TvFbWJ++A/H9pBLiIKm6Pk9jVv4YcWaP+IMNJVE93YvyYlFSJKwgQTSdPesYA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Introduce the ability for khugepaged to collapse to different mTHP sizes. While scanning PMD ranges for potential collapse candidates, keep track of pages in MIN_MTHP_ORDER chunks via a bitmap. Each bit represents a utilized region of order MIN_MTHP_ORDER ptes. If mTHPs are enabled we remove the restriction of max_ptes_none during the scan phase so we dont bailout early and miss potential mTHP candidates. After the scan is complete we will perform binary recursion on the bitmap to determine which mTHP size would be most efficient to collapse to. max_ptes_none will be scaled by the attempted collapse order to determine how full a THP must be to be eligible. If a mTHP collapse is attempted, but contains swapped out, or shared pages, we dont perform the collapse. Signed-off-by: Nico Pache --- include/linux/khugepaged.h | 2 +- mm/khugepaged.c | 122 ++++++++++++++++++++++++++----------- 2 files changed, 89 insertions(+), 35 deletions(-) diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h index 60d41215bc1a..18fe6eb5051d 100644 --- a/include/linux/khugepaged.h +++ b/include/linux/khugepaged.h @@ -1,7 +1,7 @@ /* SPDX-License-Identifier: GPL-2.0 */ #ifndef _LINUX_KHUGEPAGED_H #define _LINUX_KHUGEPAGED_H -#define KHUGEPAGED_MIN_MTHP_ORDER 3 +#define KHUGEPAGED_MIN_MTHP_ORDER 2 #define KHUGEPAGED_MIN_MTHP_NR (1<anon_vma); - mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address, - address + HPAGE_PMD_SIZE); + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, _address, + _address + (PAGE_SIZE << order)); mmu_notifier_invalidate_range_start(&range); pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */ + /* * This removes any huge TLB entry from the CPU so we won't allow * huge and small TLB entries for the same virtual address to @@ -1226,10 +1230,10 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, mmu_notifier_invalidate_range_end(&range); tlb_remove_table_sync_one(); - pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl); + pte = pte_offset_map_lock(mm, &_pmd, _address, &pte_ptl); if (pte) { - result = __collapse_huge_page_isolate(vma, address, pte, cc, - &compound_pagelist, HPAGE_PMD_ORDER); + result = __collapse_huge_page_isolate(vma, _address, pte, cc, + &compound_pagelist, order); spin_unlock(pte_ptl); } else { result = SCAN_PMD_NULL; @@ -1258,8 +1262,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, anon_vma_unlock_write(vma->anon_vma); result = __collapse_huge_page_copy(pte, folio, pmd, _pmd, - vma, address, pte_ptl, - &compound_pagelist, HPAGE_PMD_ORDER); + vma, _address, pte_ptl, + &compound_pagelist, order); pte_unmap(pte); if (unlikely(result != SCAN_SUCCEED)) goto out_up_write; @@ -1270,20 +1274,35 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, * write. */ __folio_mark_uptodate(folio); - pgtable = pmd_pgtable(_pmd); - - _pmd = mk_huge_pmd(&folio->page, vma->vm_page_prot); - _pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma); - - spin_lock(pmd_ptl); - BUG_ON(!pmd_none(*pmd)); - folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE); - folio_add_lru_vma(folio, vma); - pgtable_trans_huge_deposit(mm, pmd, pgtable); - set_pmd_at(mm, address, pmd, _pmd); - update_mmu_cache_pmd(vma, address, pmd); - deferred_split_folio(folio, false); - spin_unlock(pmd_ptl); + if (order == HPAGE_PMD_ORDER) { + pgtable = pmd_pgtable(_pmd); + _pmd = mk_huge_pmd(&folio->page, vma->vm_page_prot); + _pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma); + + spin_lock(pmd_ptl); + BUG_ON(!pmd_none(*pmd)); + folio_add_new_anon_rmap(folio, vma, _address, RMAP_EXCLUSIVE); + folio_add_lru_vma(folio, vma); + pgtable_trans_huge_deposit(mm, pmd, pgtable); + set_pmd_at(mm, address, pmd, _pmd); + update_mmu_cache_pmd(vma, address, pmd); + deferred_split_folio(folio, false); + spin_unlock(pmd_ptl); + } else { //mTHP + mthp_pte = mk_pte(&folio->page, vma->vm_page_prot); + mthp_pte = maybe_mkwrite(pte_mkdirty(mthp_pte), vma); + + spin_lock(pmd_ptl); + folio_ref_add(folio, (1 << order) - 1); + folio_add_new_anon_rmap(folio, vma, _address, RMAP_EXCLUSIVE); + folio_add_lru_vma(folio, vma); + set_ptes(vma->vm_mm, _address, pte, mthp_pte, (1 << order)); + update_mmu_cache_range(NULL, vma, _address, pte, (1 << order)); + + smp_wmb(); /* make pte visible before pmd */ + pmd_populate(mm, pmd, pmd_pgtable(_pmd)); + spin_unlock(pmd_ptl); + } folio = NULL; @@ -1364,31 +1383,58 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, { pmd_t *pmd; pte_t *pte, *_pte; + int i; int result = SCAN_FAIL, referenced = 0; int none_or_zero = 0, shared = 0; struct page *page = NULL; struct folio *folio = NULL; unsigned long _address; + unsigned long enabled_orders; spinlock_t *ptl; int node = NUMA_NO_NODE, unmapped = 0; + bool is_pmd_only; bool writable = false; - + int chunk_none_count = 0; + int scaled_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER); + unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0; VM_BUG_ON(address & ~HPAGE_PMD_MASK); result = find_pmd_or_thp_or_none(mm, address, &pmd); if (result != SCAN_SUCCEED) goto out; + bitmap_zero(cc->mthp_bitmap, MAX_MTHP_BITMAP_SIZE); + bitmap_zero(cc->mthp_bitmap_temp, MAX_MTHP_BITMAP_SIZE); memset(cc->node_load, 0, sizeof(cc->node_load)); nodes_clear(cc->alloc_nmask); + + enabled_orders = thp_vma_allowable_orders(vma, vma->vm_flags, + tva_flags, THP_ORDERS_ALL_ANON); + + is_pmd_only = (enabled_orders == (1 << HPAGE_PMD_ORDER)); + pte = pte_offset_map_lock(mm, pmd, address, &ptl); if (!pte) { result = SCAN_PMD_NULL; goto out; } - for (_address = address, _pte = pte; _pte < pte + HPAGE_PMD_NR; - _pte++, _address += PAGE_SIZE) { + for (i = 0; i < HPAGE_PMD_NR; i++) { + /* + * we are reading in KHUGEPAGED_MIN_MTHP_NR page chunks. if + * there are pages in this chunk keep track of it in the bitmap + * for mTHP collapsing. + */ + if (i % KHUGEPAGED_MIN_MTHP_NR == 0) { + if (chunk_none_count <= scaled_none) + bitmap_set(cc->mthp_bitmap, + i / KHUGEPAGED_MIN_MTHP_NR, 1); + + chunk_none_count = 0; + } + + _pte = pte + i; + _address = address + i * PAGE_SIZE; pte_t pteval = ptep_get(_pte); if (is_swap_pte(pteval)) { ++unmapped; @@ -1411,10 +1457,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, } } if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) { + ++chunk_none_count; ++none_or_zero; if (!userfaultfd_armed(vma) && - (!cc->is_khugepaged || - none_or_zero <= khugepaged_max_ptes_none)) { + (!cc->is_khugepaged || !is_pmd_only || + none_or_zero <= khugepaged_max_ptes_none)) { continue; } else { result = SCAN_EXCEED_NONE_PTE; @@ -1510,6 +1557,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, address))) referenced++; } + if (!writable) { result = SCAN_PAGE_RO; } else if (cc->is_khugepaged && @@ -1522,8 +1570,12 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, out_unmap: pte_unmap_unlock(pte, ptl); if (result == SCAN_SUCCEED) { - result = collapse_huge_page(mm, address, referenced, - unmapped, cc, mmap_locked, HPAGE_PMD_ORDER, 0); + result = khugepaged_scan_bitmap(mm, address, referenced, unmapped, cc, + mmap_locked, enabled_orders); + if (result > 0) + result = SCAN_SUCCEED; + else + result = SCAN_FAIL; } out: trace_mm_khugepaged_scan_pmd(mm, &folio->page, writable, referenced, @@ -2479,11 +2531,13 @@ static int khugepaged_collapse_single_pmd(unsigned long addr, fput(file); if (result == SCAN_PTE_MAPPED_HUGEPAGE) { mmap_read_lock(mm); + *mmap_locked = true; if (khugepaged_test_exit_or_disable(mm)) goto end; result = collapse_pte_mapped_thp(mm, addr, !cc->is_khugepaged); mmap_read_unlock(mm); + *mmap_locked = false; } } else { result = khugepaged_scan_pmd(mm, vma, addr,