From patchwork Tue Aug 13 12:02:48 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Usama Arif X-Patchwork-Id: 13761889 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A0993C52D7B for ; Tue, 13 Aug 2024 12:03:59 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E889B6B00A4; Tue, 13 Aug 2024 08:03:52 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E39B66B00A5; Tue, 13 Aug 2024 08:03:52 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CDB216B00A6; Tue, 13 Aug 2024 08:03:52 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id A93246B00A4 for ; Tue, 13 Aug 2024 08:03:52 -0400 (EDT) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 22D6B1607CC for ; Tue, 13 Aug 2024 12:03:52 +0000 (UTC) X-FDA: 82447088304.27.9DBFAAA Received: from mail-qk1-f173.google.com (mail-qk1-f173.google.com [209.85.222.173]) by imf05.hostedemail.com (Postfix) with ESMTP id 50F7010001D for ; Tue, 13 Aug 2024 12:03:50 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="S/VyARDi"; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf05.hostedemail.com: domain of usamaarif642@gmail.com designates 209.85.222.173 as permitted sender) smtp.mailfrom=usamaarif642@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1723550573; a=rsa-sha256; cv=none; b=nNHiuurPRVRXzbDD9dBHO0fWTifAF8CeeRxcRkOthw0dbJ0L746GTny5zASNLW8a7syxt3 CrFb/+KMO/iuX0rTb7HgnL0ReM/2C7AgzzMILOXCLAJtIV/mBFqDSK5UrIWoY3rXbZnNnX 6p2ppacuYoDkOUBVSVROcidc9TwgVrY= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="S/VyARDi"; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf05.hostedemail.com: domain of usamaarif642@gmail.com designates 209.85.222.173 as permitted sender) smtp.mailfrom=usamaarif642@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1723550573; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=/3e8jqaD9j4AlkmfSijGsYDb40oSO+aTFas99401M0M=; b=Pw1kRA1EgtbTKX+bXXJ0gI0qAWBkgzoQOm4gWa0Md6OyV64c/isPduCCAUwW4IkTnN3EMU MIBbfpUUfw5ECN8VrgPwgC6u+5tvGLVWvIFC34RWKbL+OsDqtupxoaXA3ZxMerFfIwWIiI pwL+mPQVp6oT9EOXl8HVNotNp0hxHX8= Received: by mail-qk1-f173.google.com with SMTP id af79cd13be357-7a1dd2004e1so337486685a.3 for ; Tue, 13 Aug 2024 05:03:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1723550629; x=1724155429; darn=kvack.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=/3e8jqaD9j4AlkmfSijGsYDb40oSO+aTFas99401M0M=; b=S/VyARDigTW+Z58CjvJS+yVU3aGpPypSFHXr05T8E8mWd9jd3vH6Av3LQ4bJT8bKnP I+6KDhfpaeyA9mlZGs6zP8YzcmOeOcNnAo0IKTMFrKKHX7wQelC+dU5zYeNRMAScHg/w UcT/dOvEef0QiCE7k/cnG0d1x2ii5uu5LdAX6EdLQUYP1nDGTPW9FANqz4BM6ASNU0Al ZgxiHDCbBT+tHdysFUpPJ6GsyIihNggyBX2kK3WjSPggZ0JqwibQldmCxG1VqBBKuv// 7F19fiNDQnSpFS54LQXQI0kwTtz9yRwjo6HGKGemnWkN/Oewp+gC5nvRgOXprlDQ1pZV PcFQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1723550629; x=1724155429; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=/3e8jqaD9j4AlkmfSijGsYDb40oSO+aTFas99401M0M=; b=KzoOPRwk0FBvxEUPz+NP4xVYupzNqjGkC5/3/wrTOgxDKGzWfd1HOm+4qyXhK9lD5f Ik+E8MGk0QZw5EgInwn3ecXg2M0QlIwg0DdxHbsNMkQQPDmrx68niaMBCN/+EbJQ56ND UPLeLYCtkcR482GaE2XNJt0kUJZpLUP+U5aymfzXsi7scxGVvcEuGcsK7ACF25jE4xxo PHbbYzHOo9beodFKwlI/UYSSoKaDKmvhaq3TeavOcvOcs++0+QCDAZmXn/K+aBuoGQUr +UnGrf9W8q58hurhTsGL4hKaS+SjZIe3KGhn0lZNbxkw/zNO7OGTcsi5EhF7zVjmLsTA upmw== X-Forwarded-Encrypted: i=1; AJvYcCXZAVqO0NzElrRvWPDkLcn7ChOYlm2mjxWfmgKa9ItRYS2dmTovAH2XLBE+Nza3z4T7/nbUcOMzq9EAHmUoSl6DXAU= X-Gm-Message-State: AOJu0YzgUyD04lL0b4wVzjDzd2Uo5ccG/XCoc82vxSJEYUeM6vrZc3/k gPbY8Z6dwMYv0t2bSZKBMle1qWO1sbvNqnCE5RD9CKk//BNLtXIN X-Google-Smtp-Source: AGHT+IEFfd2FYRK2Jx6dZup0/ifNGdrGzF8HSf/AxkPfSro8RvwQqrR7qIuXRhOvG6+PgXGqlDdltw== X-Received: by 2002:a05:620a:2441:b0:79f:b72:fb30 with SMTP id af79cd13be357-7a4e1667d8amr395537185a.59.1723550629311; Tue, 13 Aug 2024 05:03:49 -0700 (PDT) Received: from localhost (fwdproxy-ash-000.fbsv.net. [2a03:2880:20ff::face:b00c]) by smtp.gmail.com with ESMTPSA id af79cd13be357-7a4c7e040e0sm333193985a.101.2024.08.13.05.03.48 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 13 Aug 2024 05:03:48 -0700 (PDT) From: Usama Arif To: akpm@linux-foundation.org, linux-mm@kvack.org Cc: hannes@cmpxchg.org, riel@surriel.com, shakeel.butt@linux.dev, roman.gushchin@linux.dev, yuzhao@google.com, david@redhat.com, baohua@kernel.org, ryan.roberts@arm.com, rppt@kernel.org, willy@infradead.org, cerasuolodomenico@gmail.com, corbet@lwn.net, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, kernel-team@meta.com, Usama Arif Subject: [PATCH v3 5/6] mm: split underutilized THPs Date: Tue, 13 Aug 2024 13:02:48 +0100 Message-ID: <20240813120328.1275952-6-usamaarif642@gmail.com> X-Mailer: git-send-email 2.43.5 In-Reply-To: <20240813120328.1275952-1-usamaarif642@gmail.com> References: <20240813120328.1275952-1-usamaarif642@gmail.com> MIME-Version: 1.0 X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 50F7010001D X-Stat-Signature: igb1gkzja5e995zbraixfqp8cjjxudw6 X-Rspam-User: X-HE-Tag: 1723550630-474449 X-HE-Meta: U2FsdGVkX19OZ1q4GaFbVuNJjJdBQyZ9yyti/a9l5Wam4pEu3+3fBnClt1Yb+llbjDl2ykVJEMTO9P19VKmERDxW7XrY5Hn9bjPoDmt4el+xu7bXqwiVMQhfFCguVvci/tNStydj8aUFkHmMwzkAKZlP35Ar4BUEWUq2OD73Z70T8Rfk7nGnE0hdRmddUGRCcRpbLnKntWaf1Tkh9ph7vor+0zSj5CutpamwfP4nbw/vVO+ApJU3F6Fux4G/JpVh1xGmS1/KS0fdKufiaxR+qotde5Ay+kuP9o7kzBnjsAS1c0dJFJQhhhvTIDE19BcPIhLb38HY/yG33QbQFEbGV80f8NA6J5PWmQbbSUFbxhgznlULoZrAMt2A/9dlCcxa+83vIHYOcaQu+niONlLWsIFcbm9B2sdHstKCqCzEi8MAzF2VyDygUl1dAEfVFT0M8qC7X90EAiU06nFDsjqzwXY0CUF53bJQp85M+t/JhgKRTKmEyCZrT4K4OQWRDnEes2XQa8KdU5FyBcanUKNOPHnAwarbXVQu0SCjTQ8dgQ9Am7haf67ppEpmEq5KRjVJWi3caEQRjquzHzI+KJUmA6A1jzvusRMS1KRLepRXlkbVmZR0VHO+Nk5OTF410SII3c3XinxqNb8DnoVUxhj5JRvdmFXKwdsL7anCHL3GoXMHbrknwVSgIS91ZHKCT3RAzX9BeFLiM/nbwo2EAHDHUTkNaLvPoBDEAlLC7Zb3SXSqGljWRc3szg4/ePhkLGex+gs3Z8JqUqxysVoKYGrkIemL28xNnmx9SHAHwmnldDaC2SIVzX8Qtgz/x0JTsKCyi3gZD/Ln0TbZ6CpdfDQ/lWQhLXDxW1tDI0zvGoq1rm5q4BdgZHqRJvQfaE6d2cjExGEjLlduMsJ2bc0Ohd6HiC17dRHPKqAEksGwrsYrCXKZtmZr4YlBGRUbarTy13BSgEHe1cdmZ8g5hSGhjw2 Vu0byNYJ m6STw1hp9mw6ObX4ufJrcdy0VGhhkvSxkzZvsyFwSsd17/0iMupamx9aCSXrIbu4RtjPUbBS6qB/yhcaw2ZgL0Gqx7ggaRLqXAx0OPbVVorqpSzDP8RNrHY7MghMv9ujshELfgRuVRPDVqKaXLj3bWWbypUMatQuWWnphPm7GjtsWFxBuNgxsOmoi5j7JoRoZlhCzs453M2/NV5avU/chU1I597PqqIGxbcCTV1v3iRj1CvNmnQ7xQ2D+kxwBvX23brAEOfPG84yc7gftJRnv9X0boW1C4qSiEo6z51RRYSyyKuY2ujo0NCNOvztBjrWGfULQ8XUIDLiiz43aAJo3xa6KAruPqc+2Zpc5UQawWNTylOfzREhgViW0CaOOGewmRwxkF3F8yPsdkJhjnaVRW39zkN3TeGE9wAJPSVnIQZF8tLju9NptE4GnuqI+FZnBVMppDepcwLuKIug= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: This is an attempt to mitigate the issue of running out of memory when THP is always enabled. During runtime whenever a THP is being faulted in (__do_huge_pmd_anonymous_page) or collapsed by khugepaged (collapse_huge_page), the THP is added to _deferred_list. Whenever memory reclaim happens in linux, the kernel runs the deferred_split shrinker which goes through the _deferred_list. If the folio was partially mapped, the shrinker attempts to split it. If the folio is not partially mapped, the shrinker checks if the THP was underutilized, i.e. how many of the base 4K pages of the entire THP were zero-filled. If this number goes above a certain threshold (decided by /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none), the shrinker will attempt to split that THP. Then at remap time, the pages that were zero-filled are mapped to the shared zeropage, hence saving memory. Suggested-by: Rik van Riel Co-authored-by: Johannes Weiner Signed-off-by: Usama Arif --- Documentation/admin-guide/mm/transhuge.rst | 6 ++ include/linux/khugepaged.h | 1 + include/linux/vm_event_item.h | 1 + mm/huge_memory.c | 76 ++++++++++++++++++++-- mm/khugepaged.c | 3 +- mm/vmstat.c | 1 + 6 files changed, 80 insertions(+), 8 deletions(-) diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index 058485daf186..60522f49178b 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -447,6 +447,12 @@ thp_deferred_split_page splitting it would free up some memory. Pages on split queue are going to be split under memory pressure. +thp_underutilized_split_page + is incremented when a huge page on the split queue was split + because it was underutilized. A THP is underutilized if the + number of zero pages in the THP is above a certain threshold + (/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none). + thp_split_pmd is incremented every time a PMD split into table of PTEs. This can happen, for instance, when application calls mprotect() or diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h index f68865e19b0b..30baae91b225 100644 --- a/include/linux/khugepaged.h +++ b/include/linux/khugepaged.h @@ -4,6 +4,7 @@ #include /* MMF_VM_HUGEPAGE */ +extern unsigned int khugepaged_max_ptes_none __read_mostly; #ifdef CONFIG_TRANSPARENT_HUGEPAGE extern struct attribute_group khugepaged_attr_group; diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index aae5c7c5cfb4..bf1470a7a737 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -105,6 +105,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, THP_SPLIT_PAGE, THP_SPLIT_PAGE_FAILED, THP_DEFERRED_SPLIT_PAGE, + THP_UNDERUTILIZED_SPLIT_PAGE, THP_SPLIT_PMD, THP_SCAN_EXCEED_NONE_PTE, THP_SCAN_EXCEED_SWAP_PTE, diff --git a/mm/huge_memory.c b/mm/huge_memory.c index c024ab0f745c..6b32b2d4ab1e 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1087,6 +1087,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf, update_mmu_cache_pmd(vma, vmf->address, vmf->pmd); add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR); mm_inc_nr_ptes(vma->vm_mm); + deferred_split_folio(folio, false); spin_unlock(vmf->ptl); count_vm_event(THP_FAULT_ALLOC); count_mthp_stat(HPAGE_PMD_ORDER, MTHP_STAT_ANON_FAULT_ALLOC); @@ -3522,6 +3523,39 @@ static unsigned long deferred_split_count(struct shrinker *shrink, return READ_ONCE(ds_queue->split_queue_len); } +static bool thp_underutilized(struct folio *folio) +{ + int num_zero_pages = 0, num_filled_pages = 0; + void *kaddr; + int i; + + if (khugepaged_max_ptes_none == HPAGE_PMD_NR - 1) + return false; + + for (i = 0; i < folio_nr_pages(folio); i++) { + kaddr = kmap_local_folio(folio, i * PAGE_SIZE); + if (!memchr_inv(kaddr, 0, PAGE_SIZE)) { + num_zero_pages++; + if (num_zero_pages > khugepaged_max_ptes_none) { + kunmap_local(kaddr); + return true; + } + } else { + /* + * Another path for early exit once the number + * of non-zero filled pages exceeds threshold. + */ + num_filled_pages++; + if (num_filled_pages >= HPAGE_PMD_NR - khugepaged_max_ptes_none) { + kunmap_local(kaddr); + return false; + } + } + kunmap_local(kaddr); + } + return false; +} + static unsigned long deferred_split_scan(struct shrinker *shrink, struct shrink_control *sc) { @@ -3555,17 +3589,45 @@ static unsigned long deferred_split_scan(struct shrinker *shrink, spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); list_for_each_entry_safe(folio, next, &list, _deferred_list) { + bool did_split = false; + bool underutilized = false; + + if (folio_test_partially_mapped(folio)) + goto split; + underutilized = thp_underutilized(folio); + if (underutilized) + goto split; + continue; +split: if (!folio_trylock(folio)) - goto next; - /* split_huge_page() removes page from list on success */ - if (!split_folio(folio)) - split++; + continue; + did_split = !split_folio(folio); folio_unlock(folio); -next: - folio_put(folio); + if (did_split) { + /* Splitting removed folio from the list, drop reference here */ + folio_put(folio); + if (underutilized) + count_vm_event(THP_UNDERUTILIZED_SPLIT_PAGE); + split++; + } } + spin_lock_irqsave(&ds_queue->split_queue_lock, flags); - list_splice_tail(&list, &ds_queue->split_queue); + /* + * Only add back to the queue if folio is partially mapped. + * If thp_underutilized returns false, or if split_folio fails in + * the case it was underutilized, then consider it used and don't + * add it back to split_queue. + */ + list_for_each_entry_safe(folio, next, &list, _deferred_list) { + if (folio_test_partially_mapped(folio)) + list_move(&folio->_deferred_list, &ds_queue->split_queue); + else { + list_del_init(&folio->_deferred_list); + ds_queue->split_queue_len--; + } + folio_put(folio); + } spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); /* diff --git a/mm/khugepaged.c b/mm/khugepaged.c index cdd1d8655a76..02e1463e1a79 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -85,7 +85,7 @@ static DECLARE_WAIT_QUEUE_HEAD(khugepaged_wait); * * Note that these are only respected if collapse was initiated by khugepaged. */ -static unsigned int khugepaged_max_ptes_none __read_mostly; +unsigned int khugepaged_max_ptes_none __read_mostly; static unsigned int khugepaged_max_ptes_swap __read_mostly; static unsigned int khugepaged_max_ptes_shared __read_mostly; @@ -1235,6 +1235,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, pgtable_trans_huge_deposit(mm, pmd, pgtable); set_pmd_at(mm, address, pmd, _pmd); update_mmu_cache_pmd(vma, address, pmd); + deferred_split_folio(folio, false); spin_unlock(pmd_ptl); folio = NULL; diff --git a/mm/vmstat.c b/mm/vmstat.c index c3a402ea91f0..91cd7d4d482b 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1384,6 +1384,7 @@ const char * const vmstat_text[] = { "thp_split_page", "thp_split_page_failed", "thp_deferred_split_page", + "thp_underutilized_split_page", "thp_split_pmd", "thp_scan_exceed_none_pte", "thp_scan_exceed_swap_pte",