From patchwork Fri Aug 30 10:03:39 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Usama Arif X-Patchwork-Id: 13784871 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2AFC7CA0EFC for ; Fri, 30 Aug 2024 10:04:57 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id BE20F6B00FA; Fri, 30 Aug 2024 06:04:52 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B1DFE6B00FC; Fri, 30 Aug 2024 06:04:52 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8F78B6B00FB; Fri, 30 Aug 2024 06:04:52 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 72C846B009B for ; Fri, 30 Aug 2024 06:04:52 -0400 (EDT) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 3263CC03DF for ; Fri, 30 Aug 2024 10:04:52 +0000 (UTC) X-FDA: 82508478024.08.6804FCA Received: from mail-qk1-f175.google.com (mail-qk1-f175.google.com [209.85.222.175]) by imf21.hostedemail.com (Postfix) with ESMTP id 5B71F1C000C for ; Fri, 30 Aug 2024 10:04:50 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=bI2uJscy; spf=pass (imf21.hostedemail.com: domain of usamaarif642@gmail.com designates 209.85.222.175 as permitted sender) smtp.mailfrom=usamaarif642@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1725012199; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=TVfBrp+oNzgiShEzXHur6LLPz7aGRa6cdgG73sD1ODc=; b=c00PdFHqBdju+s5qZRV8HhQhnGasDghWgzxdczKKxgQPPlnN8pRLqLL+SBf7jsO1qjgXOR OOUpJKOUA8Y1Sa7sYdA4cdOoys4vn9037Z23wTB68wA7CkwXCjYbkBF7mxoySzaXFOsqVY n28BnZYdV3yGURQ+ow6WGFuEuW9VcmQ= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1725012200; a=rsa-sha256; cv=none; b=auqGQqVxpZm8wJAr/eAO1arvRVnkIKEJOnQEkHmP08tnI+xRwvGA9eu02iU2OzVx4ru5Mx mqpK1vtbq43Os70C15QXuetwXT5Dq44fHx4gLn4ReTDdv+tWxP3sLlRqIIvUafqsMJ64HZ ZLUmH44XW1pIMtqU8Aw7qE18haD77t4= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=bI2uJscy; spf=pass (imf21.hostedemail.com: domain of usamaarif642@gmail.com designates 209.85.222.175 as permitted sender) smtp.mailfrom=usamaarif642@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-qk1-f175.google.com with SMTP id af79cd13be357-7a8134aefe8so26184585a.2 for ; Fri, 30 Aug 2024 03:04:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1725012289; x=1725617089; darn=kvack.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=TVfBrp+oNzgiShEzXHur6LLPz7aGRa6cdgG73sD1ODc=; b=bI2uJscybctnrLXVLn5yst21Wf+HsroyaBkmkIGRJdOGUja9FuQnlLs7WlMfzo2CEb kj7gwLBEsS7pGUSTBJZ5ae76gWVZr7svuZ+vLbxjPFsTJ46upCR4qBdpPXKyuecrgzk/ BUkQx8cBRC44ub4N5JuQ3oPlGecmYBegAItMk17YGY35YoNDvMGsrwvSESD89VFYHRNO DnI0zWkptj/LpU4t9utW9rYWg+z3I3NfBpvORUGKe7DKXRbsf7SwNaIq0+Q4oZT7llvW X9QMGlVBT+TdiETN2FGy9d9eTxytpenw/Ou4bQdLkGPCmpH/qyvv11Ior1+LKm2RwwIt L3og== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1725012289; x=1725617089; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=TVfBrp+oNzgiShEzXHur6LLPz7aGRa6cdgG73sD1ODc=; b=atRaj9XZcCacOxjPl+v8TIPoCZytW9Sztmg5aJ/EdMgfLltT6KNoa6nYrKboeKK7QI e3oaF2fQ3I/EprW68gMhWYy4x1G0XJT71zPHfXpcMIU9ybu2Jha5/592CuKY2FFxcIxi zqQ18rpjOwMYCHV390Vxe3YLLkxDuXUgJ6mQSrBRgMLs4GrYrPuZ2VN1pIdnwW8XYbkE YyG4/PELVC09hWwxhlB04+Tt4PeEGoVhbOQuelKwwAhqcQspWUVOglMgupE9KrCe67Yx fjoY8GqlHu/G9Y/NSUpLHCS+hWqGIX/6xc4WOZIuSwT4e2IbyLGOliakHh+4/HEeFfYD ehWw== X-Forwarded-Encrypted: i=1; AJvYcCUjoE8bGwPWoHdescx1ZD9trlGNiTaHBUB7B7D9s5ZwVK6oZUYWDLuAQh/DD5b0GYZA6DMw65sLqQ==@kvack.org X-Gm-Message-State: AOJu0YxmUdxcaK1K0B8vEcc7UcAZEr3GERnYU2FXOKr2TNjA/6EwMc39 ReX0PGL+0Oj0bQoWVOzQKiccz64zCRxE1xsR2VDIarhy4xjXr/N5QQdb8IS5 X-Google-Smtp-Source: AGHT+IGdkW+88y1JPZ8HJseiLWBGoeU0Lq8Ru1U+26rkyCLiExAn2mKtW62AzHffSX6x/E3+lb5IPw== X-Received: by 2002:a05:620a:319d:b0:79f:249:9f9f with SMTP id af79cd13be357-7a80427d35amr433621985a.62.1725012289265; Fri, 30 Aug 2024 03:04:49 -0700 (PDT) Received: from localhost (fwdproxy-ash-113.fbsv.net. [2a03:2880:20ff:71::face:b00c]) by smtp.gmail.com with ESMTPSA id af79cd13be357-7a806c4cd7csm130315085a.69.2024.08.30.03.04.48 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 30 Aug 2024 03:04:48 -0700 (PDT) From: Usama Arif To: akpm@linux-foundation.org, linux-mm@kvack.org Cc: hannes@cmpxchg.org, riel@surriel.com, shakeel.butt@linux.dev, roman.gushchin@linux.dev, yuzhao@google.com, david@redhat.com, npache@redhat.com, baohua@kernel.org, ryan.roberts@arm.com, rppt@kernel.org, willy@infradead.org, cerasuolodomenico@gmail.com, ryncsn@gmail.com, corbet@lwn.net, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, kernel-team@meta.com, Usama Arif Subject: [PATCH v5 5/6] mm: split underused THPs Date: Fri, 30 Aug 2024 11:03:39 +0100 Message-ID: <20240830100438.3623486-6-usamaarif642@gmail.com> X-Mailer: git-send-email 2.43.5 In-Reply-To: <20240830100438.3623486-1-usamaarif642@gmail.com> References: <20240830100438.3623486-1-usamaarif642@gmail.com> MIME-Version: 1.0 X-Stat-Signature: wah7mniocn3oyg6j3syxnaki1zetpefj X-Rspamd-Queue-Id: 5B71F1C000C X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1725012290-597803 X-HE-Meta: U2FsdGVkX193ix63FdZaRKwbu46EXAF58UUkz/8FHFwevHuY171POkUiSyxPhGdP9pIzHZC7PQMfKH76aXAtSFw0acYW8OA/6iD/Su7+oOVVu10YCq545tCNxEMRRgRoOWgkCUgjHLrpVghF5ksNiI0IxBsrZXvyQ8uVK+5eNLrOQN8Nw56S0hlBmZzuyuZgkD19UsIyITdwnBzCH5YKvBCQbQZSjbEVnmzAw1E+PwIKnZYmuLxtp/kOhCiSm92HDru1HVFy+6S7avOrcrtl6Pl2fEavgL6ieCE8aB3NqGt3HobmwGkLmne5F48z0b8GdNYTeMBHo4Ig6P7MOUiiUkizH3+cRs1evdfAzutU13COs4y5P5W1WK+eirTPNR3nfRZ2gHHmiYumYtopOqm+tw2q9SEkYy0IY8l6bdsi4DAUnHu+UiFZpG2o6yzggQj5JKrfo96NF3asF94KmvDrOeQGu2OZG/lOdTNrqc0ZammoE9A/HUzz01HOEzODPCcY3TdAOtW55QWIHqzusOAYu6VRwWclxEHsLyqcV7jbYwJKEuU39jIWY8TRmex41ucy1fB/Vq+hKn/8hMYOOnyPOu1idJrcQiMOHNpj/nDUOooD4gVsLFeot5YkQ9O4ZBBZ81eu8GhWrgy2uRSspKtpfI9YO5gTKAw2Wjiwk4OdcuyZRRdPZyaclsO4pBg8bYCCkNitZBaK7RZxQlmqXpcAicxNqRxUJ4gM3+59rP9/QQUyRIlsI3tU0l1cSxtVCk8EIKlU+L/hk0FctREhM3do7/RoLZxNmfq9yLRvBlZAab6qi2NvIYvGkgxo3wc9KW7Ww/c641IJysHy1mCzBODfjjmFuAx273XbNgrGsglVQbeQuiagNdUQg1SBp//UL2jCO2MEvnTFt5lbWqEiekNi8q85EKrAHL4fYfwgCQPdxrWk5beOeS82iCZlZr9INiM/3+FGtI9AvNV3fEss7PA 4gBO+iLd cAAkNheElN2DXTc21Bj1P8kK3yMfi7i3CJWutLcnRuCD19oz3wo4jBaNiI9zRXmNZ1wa9E4YQiTnRUb/WC7I4phRgTTYReRLj3+gwOv2f+/XHsw5z9WVlIyySme3A0AqOU/OjHCnyNIRWGJmVNkpISBSDrPHZtl02RRO27HsMn9PMylM81hIcZDr5Gzh+hqXzSmTmhd4aAMIuHnfb07vgnzyC95GQj9XiBl1A2GxQNZ2kgdTLWL98hJX1GWBLrpW6J0I4hcLA9WcJA2cg554k82pLd+g6OI8ARfjrXdcOpVgJtq8ksGb/us7tMvH6Hat3IBh9HPVbd4Cmn0mtlc2CpVTaC7Lp7mTAV9yZmrxG8UHmpN8mEOLZpetw7U8tFSGu9tmt6ZV36vySTv0eEGK7IXCZP977xs6e+J+0n045VBuSBPHNO3bc4G1wCRcStgcttvM31d6AcLl30oo= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: This is an attempt to mitigate the issue of running out of memory when THP is always enabled. During runtime whenever a THP is being faulted in (__do_huge_pmd_anonymous_page) or collapsed by khugepaged (collapse_huge_page), the THP is added to _deferred_list. Whenever memory reclaim happens in linux, the kernel runs the deferred_split shrinker which goes through the _deferred_list. If the folio was partially mapped, the shrinker attempts to split it. If the folio is not partially mapped, the shrinker checks if the THP was underused, i.e. how many of the base 4K pages of the entire THP were zero-filled. If this number goes above a certain threshold (decided by /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none), the shrinker will attempt to split that THP. Then at remap time, the pages that were zero-filled are mapped to the shared zeropage, hence saving memory. Suggested-by: Rik van Riel Co-authored-by: Johannes Weiner Signed-off-by: Usama Arif --- Documentation/admin-guide/mm/transhuge.rst | 6 +++ include/linux/khugepaged.h | 1 + include/linux/vm_event_item.h | 1 + mm/huge_memory.c | 60 +++++++++++++++++++++- mm/khugepaged.c | 3 +- mm/vmstat.c | 1 + 6 files changed, 69 insertions(+), 3 deletions(-) diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index 56a086900651..aca0cff852b8 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -471,6 +471,12 @@ thp_deferred_split_page splitting it would free up some memory. Pages on split queue are going to be split under memory pressure. +thp_underused_split_page + is incremented when a huge page on the split queue was split + because it was underused. A THP is underused if the number of + zero pages in the THP is above a certain threshold + (/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none). + thp_split_pmd is incremented every time a PMD split into table of PTEs. This can happen, for instance, when application calls mprotect() or diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h index f68865e19b0b..30baae91b225 100644 --- a/include/linux/khugepaged.h +++ b/include/linux/khugepaged.h @@ -4,6 +4,7 @@ #include /* MMF_VM_HUGEPAGE */ +extern unsigned int khugepaged_max_ptes_none __read_mostly; #ifdef CONFIG_TRANSPARENT_HUGEPAGE extern struct attribute_group khugepaged_attr_group; diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index aae5c7c5cfb4..aed952d04132 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -105,6 +105,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, THP_SPLIT_PAGE, THP_SPLIT_PAGE_FAILED, THP_DEFERRED_SPLIT_PAGE, + THP_UNDERUSED_SPLIT_PAGE, THP_SPLIT_PMD, THP_SCAN_EXCEED_NONE_PTE, THP_SCAN_EXCEED_SWAP_PTE, diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 166f8810f3c6..a97aeffc55d6 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1187,6 +1187,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf, update_mmu_cache_pmd(vma, vmf->address, vmf->pmd); add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR); mm_inc_nr_ptes(vma->vm_mm); + deferred_split_folio(folio, false); spin_unlock(vmf->ptl); count_vm_event(THP_FAULT_ALLOC); count_mthp_stat(HPAGE_PMD_ORDER, MTHP_STAT_ANON_FAULT_ALLOC); @@ -3652,6 +3653,39 @@ static unsigned long deferred_split_count(struct shrinker *shrink, return READ_ONCE(ds_queue->split_queue_len); } +static bool thp_underused(struct folio *folio) +{ + int num_zero_pages = 0, num_filled_pages = 0; + void *kaddr; + int i; + + if (khugepaged_max_ptes_none == HPAGE_PMD_NR - 1) + return false; + + for (i = 0; i < folio_nr_pages(folio); i++) { + kaddr = kmap_local_folio(folio, i * PAGE_SIZE); + if (!memchr_inv(kaddr, 0, PAGE_SIZE)) { + num_zero_pages++; + if (num_zero_pages > khugepaged_max_ptes_none) { + kunmap_local(kaddr); + return true; + } + } else { + /* + * Another path for early exit once the number + * of non-zero filled pages exceeds threshold. + */ + num_filled_pages++; + if (num_filled_pages >= HPAGE_PMD_NR - khugepaged_max_ptes_none) { + kunmap_local(kaddr); + return false; + } + } + kunmap_local(kaddr); + } + return false; +} + static unsigned long deferred_split_scan(struct shrinker *shrink, struct shrink_control *sc) { @@ -3689,13 +3723,35 @@ static unsigned long deferred_split_scan(struct shrinker *shrink, spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); list_for_each_entry_safe(folio, next, &list, _deferred_list) { + bool did_split = false; + bool underused = false; + + if (!folio_test_partially_mapped(folio)) { + underused = thp_underused(folio); + if (!underused) + goto next; + } if (!folio_trylock(folio)) goto next; - /* split_huge_page() removes page from list on success */ - if (!split_folio(folio)) + if (!split_folio(folio)) { + did_split = true; + if (underused) + count_vm_event(THP_UNDERUSED_SPLIT_PAGE); split++; + } folio_unlock(folio); next: + /* + * split_folio() removes folio from list on success. + * Only add back to the queue if folio is partially mapped. + * If thp_underused returns false, or if split_folio fails + * in the case it was underused, then consider it used and + * don't add it back to split_queue. + */ + if (!did_split && !folio_test_partially_mapped(folio)) { + list_del_init(&folio->_deferred_list); + ds_queue->split_queue_len--; + } folio_put(folio); } diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 5bfb5594c604..bf1734e8e665 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -85,7 +85,7 @@ static DECLARE_WAIT_QUEUE_HEAD(khugepaged_wait); * * Note that these are only respected if collapse was initiated by khugepaged. */ -static unsigned int khugepaged_max_ptes_none __read_mostly; +unsigned int khugepaged_max_ptes_none __read_mostly; static unsigned int khugepaged_max_ptes_swap __read_mostly; static unsigned int khugepaged_max_ptes_shared __read_mostly; @@ -1237,6 +1237,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, pgtable_trans_huge_deposit(mm, pmd, pgtable); set_pmd_at(mm, address, pmd, _pmd); update_mmu_cache_pmd(vma, address, pmd); + deferred_split_folio(folio, false); spin_unlock(pmd_ptl); folio = NULL; diff --git a/mm/vmstat.c b/mm/vmstat.c index f41984dc856f..bb081ae4d0ae 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1385,6 +1385,7 @@ const char * const vmstat_text[] = { "thp_split_page", "thp_split_page_failed", "thp_deferred_split_page", + "thp_underused_split_page", "thp_split_pmd", "thp_scan_exceed_none_pte", "thp_scan_exceed_swap_pte",