From patchwork Tue Aug 13 12:02:43 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Usama Arif X-Patchwork-Id: 13761884 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 69700C52D7C for ; Tue, 13 Aug 2024 12:03:47 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C8B656B009A; Tue, 13 Aug 2024 08:03:46 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C3BCF6B009E; Tue, 13 Aug 2024 08:03:46 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B03016B009F; Tue, 13 Aug 2024 08:03:46 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 8F6AD6B009A for ; Tue, 13 Aug 2024 08:03:46 -0400 (EDT) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 30FBAA5E32 for ; Tue, 13 Aug 2024 12:03:46 +0000 (UTC) X-FDA: 82447088052.03.ACA4B4B Received: from mail-qk1-f172.google.com (mail-qk1-f172.google.com [209.85.222.172]) by imf29.hostedemail.com (Postfix) with ESMTP id 3A8F2120037 for ; Tue, 13 Aug 2024 12:03:44 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=PJWkZN2a; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf29.hostedemail.com: domain of usamaarif642@gmail.com designates 209.85.222.172 as permitted sender) smtp.mailfrom=usamaarif642@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1723550572; a=rsa-sha256; cv=none; b=l1+3zFvovLucvZIdWca1LbnnM+bN4MC7cgY7i97impMooARwKQAIoia17cyGEPYXaYHkvD O4WuTfNMhZzvi0dlydwth+4/KQqkqHH99X9ecOeYMYFasXz7l3tMUuBwWNew1HjdeL8Efl EhfSRiYy1aA0I93YviotCiD7k7zn5XY= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=PJWkZN2a; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf29.hostedemail.com: domain of usamaarif642@gmail.com designates 209.85.222.172 as permitted sender) smtp.mailfrom=usamaarif642@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1723550572; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=J4mgO91X0ECSZtREEM3BoWhZk1QC/8Jzg6tpXLFE4Nc=; b=VkRuJ3Ja1FO/6gaoEWeqdZM3prFnOheGh/1oFdZvw44+6rStNKarFlnk8JiB+XXONF91Hl vbq5CIVBOFBDWFa23jtvZ7EWlsNifJtALUItBIfd12UWAaN3FWuuvwU6NK96QLPhUs7ah4 7jW7hM077+xSjgkPHHmbaboVJ6u8v6Q= Received: by mail-qk1-f172.google.com with SMTP id af79cd13be357-7a1e31bc1efso328160685a.3 for ; Tue, 13 Aug 2024 05:03:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1723550623; x=1724155423; darn=kvack.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=J4mgO91X0ECSZtREEM3BoWhZk1QC/8Jzg6tpXLFE4Nc=; b=PJWkZN2adfz+mGpWZJL7vvsvsT64LRIvyQKmkcb9vfYjDo+waX+amZ/y0swKLoyA/t M9r7CBm9QerlpJXrAegQ5HP6mBOKsFaiu8MpArHu0NplvdmBD5L+R8+It/H6/PfP2of4 Ur/2Haey1AWQP9E/yTBj4S4O0MC6wtNPkf2hTHYLiRS4fHZPI9+u7X5/S1aQv/5Hg9Zm UKL864DzA288E46EdlyORTh2L4ViO43AkMrp8F1nYZa+FRGQ6GhyJGuSp2qw3uwePoS/ mcZ+bNcrD4of1+XywShvZfMLc+v9g6GHH34f7vEI1cOX5U6YYidGy9dmArs5ab+kHja7 cS7A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1723550623; x=1724155423; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=J4mgO91X0ECSZtREEM3BoWhZk1QC/8Jzg6tpXLFE4Nc=; b=PiM5W0HvT7bfUBlRtLvzkE8/oeIlwleGkkcauD1PiPaTaqxEUOGPWqxUJuvHpAl1Y6 GPdXKvdGgpZk+HwtvrF+0/+JAc1/hA6V0w+vX6s+AATaw43KXjh3KNy8YjVMzBnlBHCr dbvRd83PbTyiWaZHug28Pwy7OmP/oNiYWtAjP+iv6vWLTR3XBdoW/7iHsq9bk8QL3M6+ rA5fvcv/bcfi24esY9MbGiQrynEVpfBmB3xqStAs3JTQ3GDiRAk9heUpvwoSe0lIwonF 28ahG+uAHu2PqYFXqoM3/pHRK0oG2lHHOAWQ5dewrWwAK8LddgfPSl8MB7BfJzJZ3VI4 IU9g== X-Forwarded-Encrypted: i=1; AJvYcCUTPgjjLEDwv4oQt6aLF4rkqLF9g41dBK3px6+4ZrF0XRPse3XC/VN0qM4z3E3lb4L96s/5N/+d33CRHBNgUX4ePpg= X-Gm-Message-State: AOJu0YxsfWoUP/orB78rws4emaexUdYQtuLpu1YllxJnXVWgaAiQjowU OKDMOd3fej8I2ClyyZRk+76yayEiKkuG5GNKHduu+4WubwkWlJt+ X-Google-Smtp-Source: AGHT+IGT2fzTngjdee05ugmGl4tk7xJ0T9MPishojJ5VclajUPbjUng4/SWveUmLtJvIhfmk6VXO4A== X-Received: by 2002:a05:620a:28ca:b0:7a3:6dd9:efa6 with SMTP id af79cd13be357-7a4e1537391mr410740885a.33.1723550623047; Tue, 13 Aug 2024 05:03:43 -0700 (PDT) Received: from localhost (fwdproxy-ash-011.fbsv.net. [2a03:2880:20ff:b::face:b00c]) by smtp.gmail.com with ESMTPSA id af79cd13be357-7a4c7d64878sm335487285a.24.2024.08.13.05.03.42 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 13 Aug 2024 05:03:42 -0700 (PDT) From: Usama Arif To: akpm@linux-foundation.org, linux-mm@kvack.org Cc: hannes@cmpxchg.org, riel@surriel.com, shakeel.butt@linux.dev, roman.gushchin@linux.dev, yuzhao@google.com, david@redhat.com, baohua@kernel.org, ryan.roberts@arm.com, rppt@kernel.org, willy@infradead.org, cerasuolodomenico@gmail.com, corbet@lwn.net, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, kernel-team@meta.com, Usama Arif Subject: [PATCH v3 0/6] mm: split underutilized THPs Date: Tue, 13 Aug 2024 13:02:43 +0100 Message-ID: <20240813120328.1275952-1-usamaarif642@gmail.com> X-Mailer: git-send-email 2.43.5 MIME-Version: 1.0 X-Rspamd-Queue-Id: 3A8F2120037 X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: fsoaootpgobhcijrnb3wqfpt8y8s4koz X-HE-Tag: 1723550624-819843 X-HE-Meta: U2FsdGVkX1/MqeLb4/YMJAY8hQ+Vx75i/iORwtHyPbENyDllXUIhk12jn+hr+5nJBjk6hri4B5Eq9AQFKoaGITOc+6sf3Ywz8mD6ACvm6nb7S6WydBQStuXk48UhfTbVPVjJDKVahlydFmxbuoH6Y9EWkv2O+EQTpqD/r9yQpGEgLVnzN3X+Gnr22uogvi55vEKuSScIo/0oEmlR7XpeToMz5YbLR+ZF7Ap06ZpYGkSEZ1akwC9FqpaTRUc5adGhQCcbIB/D8PVAKkgTkBQo9/lTQmHmBZXZeiSXj+M+XXZg0SK4mIqTzKLr8srb+Xkraxkv358X3d4rTZBHNMUwqm+4JWaI29Unh/rvNqtEIq3a/Z9Ni6tgJFlUpMXynFWQ1vz+4EScYW/THT7YcoeSzl4cEGaHI+fKGwoEy2kLUzulR/QzZPs7U0/TCYl9HobYxIhneLASpSuHQUUe8aLn4TEQTHRoUNU4anafFA0N4ynb4qL4zqaiypWT+WTA0HWzFok4Np4mSyfqq+vJ2LhaZYboQCxQMiTrIWWIpBcwsIkfBhwL2rgtek9ryqj3xFLizp9Q0IkLmcj354fLQmBAEwQKaF1Sn5qtLEfwFdK6+Llz/75EtHM9OGA8c91bGXTBm3oXi2qFfMrVDRVAX992JFaJc+EF1Z6Hwb2wXiV0BdTYOoaICTTyyqlhErNZCp0ISZRxsbL37d3ELIozy/pDjwUa7jalC80qk5jkCZ/cedipbL6gEIcRoubWMOgMGuMooJNu9sQcXTj/4JtfRpkZOeWIFQQL/hmgsjlyIlD7or4Wx64Ok7rrK8oZbyuYcr9G4cHP4K+F62xjhPIBnX9PJ6Gs6Gf2I2MSGA1yTFnXXGRjDRLmdzbkmKALlGZy0+zNmEJq+F7JGfeo2KuyCQjROZAVx2c4OWMr5BaTCic4XTempUOsu8TU/bygjNFaFDGlmqPKtMnrLQ2j9L84f2f l8zHzylm d/3e5M9Tu9DY+HqVWj2IZa1w1ppkovmOyKzTpd0n29ntOYdyq64GSHFkmsIuuz1XfhNrrIUcfd+7+RljfUOOvwJ3lyowt0g5ugkTU4pXh6fIwrRmzD8XlAjScst7DNL+Vfy99lA4GXstkuueRhaYQ+1Qy5JYQqkGmT6FO58jw7FXTAd+4N67DS+40q5f8mKtxTt6RFMnG/7fn3J8d56UqiCi4StUkLtVhHZG5I5W4uKbr/UEjq2ZtSJdyIf/midv2ZCydcdNcz5QB30UdewJp8Mc8+7wdRMzKGzJqdOSotKrDpDGfp7Yj2hh+oX1Te/JMAaT0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: The current upstream default policy for THP is always. However, Meta uses madvise in production as the current THP=always policy vastly overprovisions THPs in sparsely accessed memory areas, resulting in excessive memory pressure and premature OOM killing. Using madvise + relying on khugepaged has certain drawbacks over THP=always. Using madvise hints mean THPs aren't "transparent" and require userspace changes. Waiting for khugepaged to scan memory and collapse pages into THP can be slow and unpredictable in terms of performance (i.e. you dont know when the collapse will happen), while production environments require predictable performance. If there is enough memory available, its better for both performance and predictability to have a THP from fault time, i.e. THP=always rather than wait for khugepaged to collapse it, and deal with sparsely populated THPs when the system is running out of memory. This patch-series is an attempt to mitigate the issue of running out of memory when THP is always enabled. During runtime whenever a THP is being faulted in or collapsed by khugepaged, the THP is added to a list. Whenever memory reclaim happens, the kernel runs the deferred_split shrinker which goes through the list and checks if the THP was underutilized, i.e. how many of the base 4K pages of the entire THP were zero-filled. If this number goes above a certain threshold, the shrinker will attempt to split that THP. Then at remap time, the pages that were zero-filled are mapped to the shared zeropage, hence saving memory. This method avoids the downside of wasting memory in areas where THP is sparsely filled when THP is always enabled, while still providing the upside THPs like reduced TLB misses without having to use madvise. Meta production workloads that were CPU bound (>99% CPU utilzation) were tested with THP shrinker. The results after 2 hours are as follows: | THP=madvise | THP=always | THP=always | | | + shrinker series | | | + max_ptes_none=409 ----------------------------------------------------------------------------- Performance improvement | - | +1.8% | +1.7% (over THP=madvise) | | | ----------------------------------------------------------------------------- Memory usage | 54.6G | 58.8G (+7.7%) | 55.9G (+2.4%) ----------------------------------------------------------------------------- max_ptes_none=409 means that any THP that has more than 409 out of 512 (80%) zero filled filled pages will be split. To test out the patches, the below commands without the shrinker will invoke OOM killer immediately and kill stress, but will not fail with the shrinker: echo 450 > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none mkdir /sys/fs/cgroup/test echo $$ > /sys/fs/cgroup/test/cgroup.procs echo 20M > /sys/fs/cgroup/test/memory.max echo 0 > /sys/fs/cgroup/test/memory.swap.max # allocate twice memory.max for each stress worker and touch 40/512 of # each THP, i.e. vm-stride 50K. # With the shrinker, max_ptes_none of 470 and below won't invoke OOM # killer. # Without the shrinker, OOM killer is invoked immediately irrespective # of max_ptes_none value and kills stress. stress --vm 1 --vm-bytes 40M --vm-stride 50K v2 -> v3: - Use my_zero_pfn instead of page_to_pfn(ZERO_PAGE(..)) (Johannes) - Use flags argument instead of bools in remove_migration_ptes (Johannes) - Use a new flag in folio->_flags_1 instead of folio->_partially_mapped (David Hildenbrand). - Split out the last patch of v2 into 3, one for introducing the flag, one for splitting underutilized THPs on _deferred_list and one for adding sysfs entry to disable splitting (David Hildenbrand). v1 -> v2: - Turn page checks and operations to folio versions in __split_huge_page. This means patches 1 and 2 from v1 are no longer needed. (David Hildenbrand) - Map to shared zeropage in all cases if the base page is zero-filled. The uffd selftest was removed. (David Hildenbrand). - rename 'dirty' to 'contains_data' in try_to_map_unused_to_zeropage (Rik van Riel). - Use unsigned long instead of uint64_t (kernel test robot). Alexander Zhu (1): mm: selftest to verify zero-filled pages are mapped to zeropage Usama Arif (3): mm: Introduce a pageflag for partially mapped folios mm: split underutilized THPs mm: add sysfs entry to disable splitting underutilized THPs Yu Zhao (2): mm: free zapped tail pages when splitting isolated thp mm: remap unused subpages to shared zeropage when splitting isolated thp Documentation/admin-guide/mm/transhuge.rst | 6 + include/linux/huge_mm.h | 4 +- include/linux/khugepaged.h | 1 + include/linux/page-flags.h | 3 + include/linux/rmap.h | 7 +- include/linux/vm_event_item.h | 1 + mm/huge_memory.c | 156 ++++++++++++++++-- mm/hugetlb.c | 1 + mm/internal.h | 4 +- mm/khugepaged.c | 3 +- mm/memcontrol.c | 3 +- mm/migrate.c | 74 +++++++-- mm/migrate_device.c | 4 +- mm/page_alloc.c | 5 +- mm/rmap.c | 3 +- mm/vmscan.c | 3 +- mm/vmstat.c | 1 + .../selftests/mm/split_huge_page_test.c | 71 ++++++++ tools/testing/selftests/mm/vm_util.c | 22 +++ tools/testing/selftests/mm/vm_util.h | 1 + 20 files changed, 333 insertions(+), 40 deletions(-)