From patchwork Tue Jul 30 12:45:57 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Usama Arif X-Patchwork-Id: 13747378 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5F235C3DA7E for ; Tue, 30 Jul 2024 12:54:13 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 86AD36B0088; Tue, 30 Jul 2024 08:54:12 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 81B216B0089; Tue, 30 Jul 2024 08:54:12 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6E2AB6B008A; Tue, 30 Jul 2024 08:54:12 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 4CF6F6B0088 for ; Tue, 30 Jul 2024 08:54:12 -0400 (EDT) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id D5EA91A024E for ; Tue, 30 Jul 2024 12:54:11 +0000 (UTC) X-FDA: 82396411902.22.99CFECC Received: from mail-qk1-f174.google.com (mail-qk1-f174.google.com [209.85.222.174]) by imf21.hostedemail.com (Postfix) with ESMTP id 200C61C000D for ; Tue, 30 Jul 2024 12:54:09 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=i+3AbA3X; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf21.hostedemail.com: domain of usamaarif642@gmail.com designates 209.85.222.174 as permitted sender) smtp.mailfrom=usamaarif642@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1722343989; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=NJYKM99E2+3UyNnfdZyryLSeiuvfGcmER1HaW+0ju40=; b=lME0+ZTlGPgRbOY1nv+LLcZT30aG5iP4hoz90YhSaYsqQAlH2U3i6MQAOt7+OLqQYrYToy ToYKKihCKIlAHH62cWP+/9UkZ6mBfUV+/DqV457aoaPn6pnQnBU4snCdKQkg7CgAvukYeJ abtxdw22c0JBnSrehR+AiMAn0BVsOnc= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1722343989; a=rsa-sha256; cv=none; b=5kC4foNic9cjYphOftGckyPp+V7jyyW838UQ1SCZbb6vNGEbIqVrcMUokzWAPfM/4GS75T BInzIt7XbjULfYQSF/rI282NvBpBIUgkjzq9NPBko+uQkBwaA+bHOTjToZyt7AlbeXRZAv IA0aikvUf9A8S8fMOlHkkrqOrWsOQBg= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=i+3AbA3X; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf21.hostedemail.com: domain of usamaarif642@gmail.com designates 209.85.222.174 as permitted sender) smtp.mailfrom=usamaarif642@gmail.com Received: by mail-qk1-f174.google.com with SMTP id af79cd13be357-7a1d7a544d0so315835685a.1 for ; Tue, 30 Jul 2024 05:54:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1722344049; x=1722948849; darn=kvack.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=NJYKM99E2+3UyNnfdZyryLSeiuvfGcmER1HaW+0ju40=; b=i+3AbA3XF3VPZf6MYyIxCfBEwxVcOkxj9klOYVFVR/5O3iwIcmwxs0WO8Z2+oiewXm HQBrwFFOWKswQlJTWLA4HZ8YaYWOSeu6R22Qg4RnFbNcG2R1askdqaUU0b/QT3QtuTHx B/XhVD5ZUdMMcuaD4pT1YwsD50WvzMxEYY1zJx+fiHx8uFUjYPqaued2dIWjnJOrJiZ2 pCJtlDPi/BoOo7Ep8RcEOjS4YMyijuUQZIfwl9egRCfF6rYiaRkOqxEoWAKSjICb16pk in0PT5CeJ9BVuJyeMEE89zWTR8akM6/zMCkQxsWOPDE3iRlQtsVleIsxNbaZWMnlQjHZ QX8w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1722344049; x=1722948849; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=NJYKM99E2+3UyNnfdZyryLSeiuvfGcmER1HaW+0ju40=; b=ZiITlPmYXH4LlBLoHWvml7ojuIe4qm0UfPR8pKUI4yY07WO593/VmcQ9C0QzBoVYqc D0rFRJzS+cpLax6g59em1ITAji3wsAM8YRrdorNFneDaUJ5v1lemU8DGdFUFD3SisKry oi4u9Pw8A0eqzC3BLTYbDhFBWlIzD9RKKGcgxp4I8OpJlb/DiviXNlwrzwYjUR3YD0X0 //3M/iLNHeuUYy6L33RqSjBukqvWGgZKTnFI0im8VfYBE6QfVloi+ebMmPWe2y4rZ3dZ iNph3DQdrxYET/GA31mqreWWCScCNf54UuDzabsgUtSaBWSWoypMz2lKi7KfF65whmWf aupA== X-Forwarded-Encrypted: i=1; AJvYcCXNIMzB/3JLXK2Ywd0TUhun7k7XYis+RYgcUNWUYAcjEfpyLeFWJwvUzE3df8K7Ra7S675y7EdJ1rxDsG8Gyy8M8LQ= X-Gm-Message-State: AOJu0YzAzzQiFPS3xP9UrTXTnsUS+s/+CeLeHn6/f4biSQ4dfSP3xNRp dS+FB6JM5+v+1bvNT6hX2rXxG6ekqpYCfEKXlkJ15nqQYUwcTZ/G X-Google-Smtp-Source: AGHT+IHJH8x1W4h56Q6n4TFNZhS/JteevbctTh7VHs4uLUYEAjKKvOPMlUnAdQWViQNLvChld9W+Fg== X-Received: by 2002:a05:620a:4551:b0:79e:fef3:ba3 with SMTP id af79cd13be357-7a1e522a232mr1584496085a.3.1722344049079; Tue, 30 Jul 2024 05:54:09 -0700 (PDT) Received: from localhost (fwdproxy-ash-015.fbsv.net. [2a03:2880:20ff:f::face:b00c]) by smtp.gmail.com with ESMTPSA id d75a77b69052e-44fe8123fb1sm49762661cf.18.2024.07.30.05.54.08 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 30 Jul 2024 05:54:08 -0700 (PDT) From: Usama Arif To: akpm@linux-foundation.org, linux-mm@kvack.org Cc: hannes@cmpxchg.org, riel@surriel.com, shakeel.butt@linux.dev, roman.gushchin@linux.dev, yuzhao@google.com, david@redhat.com, baohua@kernel.org, ryan.roberts@arm.com, rppt@kernel.org, willy@infradead.org, cerasuolodomenico@gmail.com, corbet@lwn.net, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, kernel-team@meta.com, Usama Arif Subject: [PATCH 0/6] mm: split underutilized THPs Date: Tue, 30 Jul 2024 13:45:57 +0100 Message-ID: <20240730125346.1580150-1-usamaarif642@gmail.com> X-Mailer: git-send-email 2.43.0 MIME-Version: 1.0 X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 200C61C000D X-Stat-Signature: axakbk3q7jkr6ebmkqhjy3jbdrmdcs5z X-Rspam-User: X-HE-Tag: 1722344049-209656 X-HE-Meta: U2FsdGVkX19ZDrall+jjK7QYcZ6Jv7fn8R8Y2R7Tq8shYR9kn1ttJSYieWOeawDjDe4BSpOdnfOUKDxf/7Fn7hS5u9vccjtrroDJ096o8jgeA0VaOSSCGSBHnJllp+SaUh59WvVp/5Dcols+25gNxqPtP7w2MVDE5etNBUt5UuQjSgRbTru67SHrGFWmBPJf4YKnCggoatxx4q1iLrivoG9mIcU58pt6/DraK8vvZ9Xe6n3GBE/pFHdpXy6MW4QEjlY4omT2owsNoEdFBAO9n8yRpPtyftnv5I+Wh1g5wxtbarKfVDOeGdw6s5zR50gF2N69I3rKGmmQ2+f68Kphp0kyU5ZGTHNFrrOTw01PtRyYBF01z/v5NKmzyBuTUZogIYP3mm13vyEq7zaCfdQi1BBdls6enxXwEXkslTWliJvmxxlq7CYBWQgN3MV3uiA5EkQ/5HB/ksC9wRgeC4gNs2zJRnIuHVdRl9cQle12RgQsfMPhVcMsDKw1688/mE0hser/tS2jIc4TGHxa0cKptuwFa9UjQgy/dNeTzlUdyK/Jgw8KTySSGRh2N5OslX1DXmzgGZsbEzvzPDojRzeVXOrc5a1+QATeIu3sA6PGiiC0h6seBcZiX91EOG+niyYYKhCFC7L3xKmeASOpahjpC5+T2PSghkN4uLn3KMSkTIosqFoO6FNGEG57t4otLUKyqBdIGw79wQHKVN95dbH1ney7F6SJnQPwK1u82a8fgOGnQCubBXFQfGufenOz06GcsYAEtXj2a4O0X9ahp/fZea1WMF1o0bWxsuj9lD1sCZtvB/liZ7VVFsJONjrqQDXieSWuGXKeP7Gyh/fgzVz0tiDd2bTt+6OBg76k6TVJDjauFVihN0oOKBZ9oFX9YDCNrunP2wnURF1a29lNVgj6A1U65GQbUWX6vkrjCoF01XiCZay/kMpg6h4hW/0zFNaQS5LCNg0Fd+++uK2Qqa/ CD/9Qfmp FiPjfutW4V2wrqsfF95H3bA7pfZWNzOqB64tb8g16TN14o2YFgHrUrQzETgIxP/hQqEH+Q/8PB4HrT46EMl5ikhNzypG0Nkta0hlPmJ70LKajIcl2BmuywdJsCRoyRbOssvwOxSISFFPTkSBppQoItRt0bOq+IVAkUrO9O3CS6FZFp3r9jBKDZbKJHxhvQjo6Krwb6PMJbTyotk7qlkczcUrMSgvnOR7a1HNsDWTVj0nMPJ2BECBA6OjsjLT6wMXfZi7JszyCmdknzgm55QzX4dMd82yyP1gOXo2hrjlQitQJ2db6HZxGdfA4vOkAmgFtSUATMcMQbeqN+lGyYC49Q7EiGBEAY6PaJg9aJjc/J76eNiNsFQ8uzGsI49sKlIM94BUAaUF656yfoK/rlRZM3MYRTEyHGM5mjimv5PPDOjgfVG68BZpb8XGqjw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: The current upstream default policy for THP is always. However, Meta uses madvise in production as the current THP=always policy vastly overprovisions THPs in sparsely accessed memory areas, resulting in excessive memory pressure and premature OOM killing. Using madvise + relying on khugepaged has certain drawbacks over THP=always. Using madvise hints mean THPs aren't "transparent" and require userspace changes. Waiting for khugepaged to scan memory and collapse pages into THP can be slow and unpredictable in terms of performance (i.e. you dont know when the collapse will happen), while production environments require predictable performance. If there is enough memory available, its better for both performance and predictability to have a THP from fault time, i.e. THP=always rather than wait for khugepaged to collapse it, and deal with sparsely populated THPs when the system is running out of memory. This patch-series is an attempt to mitigate the issue of running out of memory when THP is always enabled. During runtime whenever a THP is being faulted in or collapsed by khugepaged, the THP is added to a list. Whenever memory reclaim happens, the kernel runs the deferred_split shrinker which goes through the list and checks if the THP was underutilized, i.e. how many of the base 4K pages of the entire THP were zero-filled. If this number goes above a certain threshold, the shrinker will attempt to split that THP. Then at remap time, the pages that were zero-filled are not remapped, hence saving memory. This method avoids the downside of wasting memory in areas where THP is sparsely filled when THP is always enabled, while still providing the upside THPs like reduced TLB misses without having to use madvise. Meta production workloads that were CPU bound (>99% CPU utilzation) were tested with THP shrinker. The results after 2 hours are as follows: | THP=madvise | THP=always | THP=always | | | + shrinker series | | | + max_ptes_none=409 ----------------------------------------------------------------------------- Performance improvement | - | +1.8% | +1.7% (over THP=madvise) | | | ----------------------------------------------------------------------------- Memory usage | 54.6G | 58.8G (+7.7%) | 55.9G (+2.4%) ----------------------------------------------------------------------------- max_ptes_none=409 means that any THP that has more than 409 out of 512 (80%) zero filled filled pages will be split. To test out the patches, the below commands without the shrinker will invoke OOM killer immediately and kill stress, but will not fail with the shrinker: echo 450 > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none mkdir /sys/fs/cgroup/test echo $$ > /sys/fs/cgroup/test/cgroup.procs echo 20M > /sys/fs/cgroup/test/memory.max echo 0 > /sys/fs/cgroup/test/memory.swap.max # allocate twice memory.max for each stress worker and touch 40/512 of # each THP, i.e. vm-stride 50K. # With the shrinker, max_ptes_none of 470 and below won't invoke OOM # killer. # Without the shrinker, OOM killer is invoked immediately irrespective # of max_ptes_none value and kill stress. stress --vm 1 --vm-bytes 40M --vm-stride 50K Patches 1-2 add back helper functions that were previously removed to operate on page lists (needed by patch 3). Patch 3 is an optimization to free zapped tail pages rather than waiting for page reclaim or migration. Patch 4 is a prerequisite for THP shrinker to not remap zero-filled subpages when splitting THP. Patches 6 adds support for THP shrinker. (This patch-series restarts the work on having a THP shrinker in kernel originally done in https://lore.kernel.org/all/cover.1667454613.git.alexlzhu@fb.com/. The THP shrinker in this series is significantly different than the original one, hence its labelled v1 (although the prerequisite to not remap clean subpages is the same).) Alexander Zhu (1): mm: add selftests to split_huge_page() to verify unmap/zap of zero pages Usama Arif (3): Revert "memcg: remove mem_cgroup_uncharge_list()" Revert "mm: remove free_unref_page_list()" mm: split underutilized THPs Yu Zhao (2): mm: free zapped tail pages when splitting isolated thp mm: don't remap unused subpages when splitting isolated thp Documentation/admin-guide/mm/transhuge.rst | 6 + include/linux/huge_mm.h | 4 +- include/linux/khugepaged.h | 1 + include/linux/memcontrol.h | 12 ++ include/linux/mm_types.h | 2 + include/linux/rmap.h | 2 +- include/linux/vm_event_item.h | 1 + mm/huge_memory.c | 152 +++++++++++++++--- mm/hugetlb.c | 1 + mm/internal.h | 5 +- mm/khugepaged.c | 3 +- mm/memcontrol.c | 22 ++- mm/migrate.c | 76 +++++++-- mm/migrate_device.c | 4 +- mm/page_alloc.c | 18 +++ mm/rmap.c | 2 +- mm/vmscan.c | 3 +- mm/vmstat.c | 1 + .../selftests/mm/split_huge_page_test.c | 113 +++++++++++++ tools/testing/selftests/mm/vm_util.c | 22 +++ tools/testing/selftests/mm/vm_util.h | 1 + 21 files changed, 414 insertions(+), 37 deletions(-)