From patchwork Mon Jul 29 02:35:32 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yafang Shao X-Patchwork-Id: 13744206 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 63087C3DA64 for ; Mon, 29 Jul 2024 02:36:27 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 06AA66B0099; Sun, 28 Jul 2024 22:36:27 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 01B736B009A; Sun, 28 Jul 2024 22:36:26 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DFD766B009B; Sun, 28 Jul 2024 22:36:26 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id BAC0C6B0099 for ; Sun, 28 Jul 2024 22:36:26 -0400 (EDT) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 7A984A02E1 for ; Mon, 29 Jul 2024 02:36:26 +0000 (UTC) X-FDA: 82391226372.06.6879CE4 Received: from mail-ot1-f41.google.com (mail-ot1-f41.google.com [209.85.210.41]) by imf15.hostedemail.com (Postfix) with ESMTP id 9E289A0011 for ; Mon, 29 Jul 2024 02:36:24 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=HcIxiaOF; spf=pass (imf15.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.210.41 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1722220581; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=EfZD+u6GjX65V1RDZ760xtyrXuwFF1GmK+m8w1rI0Cs=; b=n2Wl7LjGtp31xtTMgSEyQVhzj6aKZ6kiz1TkBP7eCK3jtiXlSX5OwLWn2+ypd+MjxHG/G5 uTbZBaz0EKKlLVUBD46Z7fS65EVIh1Br0XHBEPHvkH+3PqgHf00YIYL5jjgDDDaFaRTPj1 PSJhR2tBBszcOfaB6KUqVv4P1azFGyY= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=HcIxiaOF; spf=pass (imf15.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.210.41 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1722220581; a=rsa-sha256; cv=none; b=bJ4Briqa5TJkqsuuLbzbPOTsB2e+4kW6EV1cQ0YIPGITzSbP+0c+DrH7vkeTHdepl6seu0 2HL+qyNrufCNn6mUkFxEId77V4SDiNc0OThMaT20kmKOwTfQIH1jkAhrrt0kDnue3aVbDI Lzzjr6lLs9mVUdz3t5De2Ur9IDlzpSE= Received: by mail-ot1-f41.google.com with SMTP id 46e09a7af769-704466b19c4so1829745a34.0 for ; Sun, 28 Jul 2024 19:36:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1722220583; x=1722825383; darn=kvack.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=EfZD+u6GjX65V1RDZ760xtyrXuwFF1GmK+m8w1rI0Cs=; b=HcIxiaOFyKE4RsMI76csbpu3BTZinWh97l88bVZIn3uIVVxFTUUfZ8m4CzwnRTJjDz lpKlrLfWtZ3qHt+Fzg7L5bTEGiwe+u8r/fUrdDwJfOXIsIWMOWQTP6eIYaTU2bJzdP6Y oBiAB6pQ7p3zjl4+XztbOk2DB6WP16b2MHQV//MA9mj3sPhF4Nx1e9LZcFNw0dd06Bnb p1JpoXnv4S/Obii5KGd1Agz8DleJqJUW0josI/N8SXur0+o5fx65xqWeHfiCIKt0naYJ YPh16J9MjL6yFSvvO1RrSwv9J77G7rho1lAYyrVfpTff2ed+1IKirac/GDrK1+ej2Oe3 b6hA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1722220583; x=1722825383; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=EfZD+u6GjX65V1RDZ760xtyrXuwFF1GmK+m8w1rI0Cs=; b=ZHpMarLzyXCv1oR5IQBky50GwWFv0m/f5LhHzokyr38I5tegQOF31XRx6IXE68mm1z Iwb/250L2rSPgiFgHBbVId6VjZUVEDGLVnp5wDKF14GdyT1ca78Uc0X5ksKIht54h/5F p95Qc8EQ1b7ENL8K56Y25jA+xLGet2QP1qoEi/ueBWyc1EkaZpgJQ0+cp+IAbk/xKXU0 hwlpAdF5Aok1fzur8Liur117fCR3lY6p2UMMlhOmmiO7C/CE07YJjLNc8NBs4OgSFerB ViE0pXrDB0q9dbdKKnipO5XNGUPqlW91HJQPoi1Xf1bU5C+QOtCd3gVdgFc3beIogvCJ PwWA== X-Forwarded-Encrypted: i=1; AJvYcCVFZI1hXfkKAXrhVuGrKnEbuV9sNG7jPWCIuPxtpZodbcTxUL9wVseamEf7/xbt9HmLi7hZXxMnHAJSAI2CX/tBBxY= X-Gm-Message-State: AOJu0YxLpWXh9KkhVQTNwfQ9ELVicOlIwn1edLaYcbJyAyVPLY/XgZ0w ho+4xUJXPK/okyfW8ZrwNWbqahG0xIso6GhuVEKKBhRHyO1Idknb X-Google-Smtp-Source: AGHT+IHHFi2hwL7w3PyPWPnkxviHuthrRT/DoZBmBQ18L0VK6Di7XYVYQrL9zN6XrY1GsClFitsgnQ== X-Received: by 2002:a05:6830:4701:b0:703:6ca6:27 with SMTP id 46e09a7af769-70940c1c7c8mr10152203a34.16.1722220583290; Sun, 28 Jul 2024 19:36:23 -0700 (PDT) Received: from localhost.localdomain ([223.104.210.31]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-7a9f7c6ff4bsm5335673a12.4.2024.07.28.19.36.18 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Sun, 28 Jul 2024 19:36:22 -0700 (PDT) From: Yafang Shao To: akpm@linux-foundation.org Cc: ying.huang@intel.com, mgorman@techsingularity.net, linux-mm@kvack.org, Yafang Shao , Matthew Wilcox , David Rientjes Subject: [PATCH v2 3/3] mm/page_alloc: Introduce a new sysctl knob vm.pcp_batch_scale_max Date: Mon, 29 Jul 2024 10:35:32 +0800 Message-Id: <20240729023532.1555-4-laoar.shao@gmail.com> X-Mailer: git-send-email 2.30.1 (Apple Git-130) In-Reply-To: <20240729023532.1555-1-laoar.shao@gmail.com> References: <20240729023532.1555-1-laoar.shao@gmail.com> MIME-Version: 1.0 X-Rspam-User: X-Stat-Signature: d4mj1ihxc35uzrfr6nce3e4e334r8ogs X-Rspamd-Queue-Id: 9E289A0011 X-Rspamd-Server: rspam11 X-HE-Tag: 1722220584-675527 X-HE-Meta: U2FsdGVkX192/Jgjyyp1rP8r5VckAGEJmqnJDcjdb9zhA+HaKpKm13hylZn1Lu3fngBLMoWJNVE0pcG+wsrRZB0cgovEcTtgRQ9RCSDnlUkW7D41Yr17Tbqa/RVENaQLTa4cUKTpWflTuvUPvbkRvPZZGXtAJDSjYZJWjW4KtKa2wSxzxR7ibI5Fdlpo1OxQlJM6y05UePyeRq7fz+CNBLz/Le8xAoTpmOI3XNtl6EDs4XrIktALOzyZnbgv9nLGtDssQPWWWJjpsHzYV5yoLhseMZV5+AEk5DWjTFHI53igPz8zJ6DlRsjJRoS3Bmws9G0EmC+BYdtG1AFkzD+mT+nNxeyPFe1N/UNTW+6xpKq3V3YZ1v7vX8VLn9c65/yCHnbPjv599ScSr3CovMrVnNrD1/mocm7qjvJgJkpJAOyXiNJrCbKim1bTAoHE0lLP8SZ3ztJbfiNYBnXkx/kb9z8ugbYddsAzD8w1b74uaXWbRd9Dw631bgh3747843Ea3MoS7qOI4MWjyZtZmlRJpFAyekvICMTHPA9Vr5TMWGteIGWQv1VEd9tM9A2CE9qnSgWaF/lowwdhPCPNv16onPQUpDpM/PaLvEl3rIoy6mmPXkRIykMZbXboZtZkpIc2gmD1yTfR+QR6mVgArhBeZXcYYPf6DSjBKjOrnaHlKWA6+dUwMkhrYpN6GLngZFYjGCgyfmjwEuq1H6EnE/L75RRXu2KPygSs1Kqlsn29kRE5icWTyBOE++Hn+y35GYnaBlTYPwqoFUe6eNzVuEVjCFyw+29RqSg1AGbTLrZAiLH+4M/lDmrFzExOqSBspnRjXdvEhNF4mneKMYF3PxELsa7obFZVjzQEE47oRJn7hS+CaixFV+nef75jnXU8JCkZ3FVfezbAZFrI60JcBXAr1er/c2GJbK+zbBWezQrBYSrAdhFfChbgTEQeVkNxeWGLKXrKN1nDXZdiWmaJOEv m1FbTHnK HmgWMb8msMVHV+Ppy5QsKUnm2Od1nhjNb+OkTe7hamesJBHWRpUeZVNp0Uxlnyrr08aEPF1QITYga2bpUM8qekoAuytawsD5KhA5Vb9vGrFiSX9tiKIfYqo1yEs1YRx6Fom9KX/jJYRI7G/p4six0v21WxJNjkXdPom8PKZizN8/H+V5GXB8Uc7eCtAdj0Hn2S+4pYmaXG2rZCvtoBoPDD8br+LLvv3viGr+KZrzaXaD3bh09EMtwQQDQ08H8N8J9l+WlAW62GIq/GnZQUJslv45QK9GqjE6ai/nFl7vgouUFoTZ+AyGZIvWfj5D/4fNOOzrx9FP9wzYM3N6+CjjmyDmNiBoUK00uuH79oIGbqA8IAGSy+Ss3f9YaaRN+gsGANpLtg55GVVeiwGud3P+z1UYyziZjzsGAqetarftUTzWAd+tPJHgsmLVgCTaabtIBkQ5XUVIQtPRfpyrodjvqHKsLWuafYXYNnD3i1pZ+BtPVPXPuu8Wii2S1+VgRUkssSE5SQ56QXYIbmkoJmTQX7H8cWddlV4EZQUX16mUAKGSi/c7oSrUBr7uIuXXA7RLGpROiOPT08WACikwBcdiKHcPS5zVW2aGSxENbRh+lIgh6smgdxf6jTBFdrpvHSsFtH45D6g3c6ejwqE128g36ofCOS4ivke6yN9R6glUFPrYpq1gKbyXQEfS8L5wKYqbHm3zIUyeQaHfzZCgg7iRbUTH2cEeNwgMYbDj9jaBpKal6ow/WDzR+gPR3NgWOFKyP8PWwdyKskhKjnnA= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: During my recent work to resolve latency spikes caused by zone->lock contention[0], I found that CONFIG_PCP_BATCH_SCALE_MAX is difficult to use in practice. To demonstrate this, I wrote a Python script: import mmap size = 6 * 1024**3 while True: mm = mmap.mmap(-1, size) mm[:] = b'\xff' * size mm.close() Run this script 10 times in parallel and measure the allocation latency by measuring the duration of rmqueue_bulk() with the BCC tools funclatency[1]: funclatency -T -i 600 rmqueue_bulk Here are the results for both AMD and Intel CPUs. AMD EPYC 7W83 64-Core Processor, single NUMA node, KVM virtual server ===================================================================== - Default value of 5 nsecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 0 | | 512 -> 1023 : 12 | | 1024 -> 2047 : 9116 | | 2048 -> 4095 : 2004 | | 4096 -> 8191 : 2497 | | 8192 -> 16383 : 2127 | | 16384 -> 32767 : 2483 | | 32768 -> 65535 : 10102 | | 65536 -> 131071 : 212730 |******************* | 131072 -> 262143 : 314692 |***************************** | 262144 -> 524287 : 430058 |****************************************| 524288 -> 1048575 : 224032 |******************** | 1048576 -> 2097151 : 73567 |****** | 2097152 -> 4194303 : 17079 |* | 4194304 -> 8388607 : 3900 | | 8388608 -> 16777215 : 750 | | 16777216 -> 33554431 : 88 | | 33554432 -> 67108863 : 2 | | avg = 449775 nsecs, total: 587066511229 nsecs, count: 1305242 The avg alloc latency can be 449us, and the max latency can be higher than 30ms. - Value set to 0 nsecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 0 | | 512 -> 1023 : 92 | | 1024 -> 2047 : 8594 | | 2048 -> 4095 : 2042818 |****** | 4096 -> 8191 : 8737624 |************************** | 8192 -> 16383 : 13147872 |****************************************| 16384 -> 32767 : 8799951 |************************** | 32768 -> 65535 : 2879715 |******** | 65536 -> 131071 : 659600 |** | 131072 -> 262143 : 204004 | | 262144 -> 524287 : 78246 | | 524288 -> 1048575 : 30800 | | 1048576 -> 2097151 : 12251 | | 2097152 -> 4194303 : 2950 | | 4194304 -> 8388607 : 78 | | avg = 19359 nsecs, total: 708638369918 nsecs, count: 36604636 The avg was reduced significantly to 19us, and the max latency is reduced to less than 8ms. - Conclusion On this AMD CPU, reducing vm.pcp_batch_scale_max significantly helps reduce latency. Latency-sensitive applications will benefit from this tuning. However, I don't have access to other types of AMD CPUs, so I was unable to test it on different AMD models. Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, two NUMA nodes ============================================================ - Default value of 5 nsecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 0 | | 512 -> 1023 : 2419 | | 1024 -> 2047 : 34499 |* | 2048 -> 4095 : 4272 | | 4096 -> 8191 : 9035 | | 8192 -> 16383 : 4374 | | 16384 -> 32767 : 2963 | | 32768 -> 65535 : 6407 | | 65536 -> 131071 : 884806 |****************************************| 131072 -> 262143 : 145931 |****** | 262144 -> 524287 : 13406 | | 524288 -> 1048575 : 1874 | | 1048576 -> 2097151 : 249 | | 2097152 -> 4194303 : 28 | | avg = 96173 nsecs, total: 106778157925 nsecs, count: 1110263 - Conclusion This Intel CPU works fine with the default setting. Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, single NUMA node ============================================================== Using the cpuset cgroup, we can restrict the test script to run on NUMA node 0 only. - Default value of 5 nsecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 46 | | 512 -> 1023 : 695 | | 1024 -> 2047 : 19950 |* | 2048 -> 4095 : 1788 | | 4096 -> 8191 : 3392 | | 8192 -> 16383 : 2569 | | 16384 -> 32767 : 2619 | | 32768 -> 65535 : 3809 | | 65536 -> 131071 : 616182 |****************************************| 131072 -> 262143 : 295587 |******************* | 262144 -> 524287 : 75357 |**** | 524288 -> 1048575 : 15471 |* | 1048576 -> 2097151 : 2939 | | 2097152 -> 4194303 : 243 | | 4194304 -> 8388607 : 3 | | avg = 144410 nsecs, total: 150281196195 nsecs, count: 1040651 The zone->lock contention becomes severe when there is only a single NUMA node. The average latency is approximately 144us, with the maximum latency exceeding 4ms. - Value set to 0 nsecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 0 | | 256 -> 511 : 24 | | 512 -> 1023 : 2686 | | 1024 -> 2047 : 10246 | | 2048 -> 4095 : 4061529 |********* | 4096 -> 8191 : 16894971 |****************************************| 8192 -> 16383 : 6279310 |************** | 16384 -> 32767 : 1658240 |*** | 32768 -> 65535 : 445760 |* | 65536 -> 131071 : 110817 | | 131072 -> 262143 : 20279 | | 262144 -> 524287 : 4176 | | 524288 -> 1048575 : 436 | | 1048576 -> 2097151 : 8 | | 2097152 -> 4194303 : 2 | | avg = 8401 nsecs, total: 247739809022 nsecs, count: 29488508 After setting it to 0, the avg latency is reduced to around 8us, and the max latency is less than 4ms. - Conclusion On this Intel CPU, this tuning doesn't help much. Latency-sensitive applications work well with the default setting. It is worth noting that all the above data were tested using the upstream kernel. Why introduce a systl knob? =========================== From the above data, it's clear that different CPU types have varying allocation latencies concerning zone->lock contention. Typically, people don't release individual kernel packages for each type of x86_64 CPU. Furthermore, for latency-insensitive applications, we can keep the default setting for better throughput. In our production environment, we set this value to 0 for applications running on Kubernetes servers while keeping it at the default value of 5 for other applications like big data. It's not common to release individual kernel packages for each application. Future work =========== To ultimately mitigate the zone->lock contention issue, several suggestions have been proposed. One approach involves dividing large zones into multi smaller zones, as suggested by Matthew[2], while another entails splitting the zone->lock using a mechanism similar to memory arenas and shifting away from relying solely on zone_id to identify the range of free lists a particular page belongs to, as suggested by Mel[3]. However, implementing these solutions is likely to necessitate a more extended development effort. Link: https://lwn.net/Articles/981069/ [0] Link: https://github.com/iovisor/bcc/blob/master/tools/funclatency.py [1] Link: https://lore.kernel.org/linux-mm/ZnTrZ9mcAIRodnjx@casper.infradead.org/ [2] Link: https://lore.kernel.org/linux-mm/20240705130943.htsyhhhzbcptnkcu@techsingularity.net/ [3] Signed-off-by: Yafang Shao Cc: "Huang, Ying" Cc: Mel Gorman Cc: Matthew Wilcox Cc: David Rientjes --- Documentation/admin-guide/sysctl/vm.rst | 17 +++++++++++++++++ mm/Kconfig | 11 ----------- mm/page_alloc.c | 23 +++++++++++++++++------ 3 files changed, 34 insertions(+), 17 deletions(-) diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst index e86c968a7a0e..aa29f2fdad7c 100644 --- a/Documentation/admin-guide/sysctl/vm.rst +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -65,6 +65,7 @@ Currently, these files are in /proc/sys/vm: - page-cluster - page_lock_unfairness - panic_on_oom +- pcp_batch_scale_max - percpu_pagelist_high_fraction - stat_interval - stat_refresh @@ -845,6 +846,22 @@ panic_on_oom=2+kdump gives you very strong tool to investigate why oom happens. You can get snapshot. +pcp_batch_scale_max +=================== + +In page allocator, PCP (Per-CPU pageset) is refilled and drained in +batches. The batch number is scaled automatically to improve page +allocation/free throughput. But too large scale factor may hurt +latency. This option sets the upper limit of scale factor to limit +the maximum latency. + +The range for this parameter spans from 0 to 6, with a default value of 5. +The value assigned to 'N' signifies that during each refilling or draining +process, a maximum of (batch << N) pages will be involved, where "batch" +represents the default batch size automatically computed by the kernel for +each zone. + + percpu_pagelist_high_fraction ============================= diff --git a/mm/Kconfig b/mm/Kconfig index b4cb45255a54..41fe4c13b7ac 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -663,17 +663,6 @@ config HUGETLB_PAGE_SIZE_VARIABLE config CONTIG_ALLOC def_bool (MEMORY_ISOLATION && COMPACTION) || CMA -config PCP_BATCH_SCALE_MAX - int "Maximum scale factor of PCP (Per-CPU pageset) batch allocate/free" - default 5 - range 0 6 - help - In page allocator, PCP (Per-CPU pageset) is refilled and drained in - batches. The batch number is scaled automatically to improve page - allocation/free throughput. But too large scale factor may hurt - latency. This option sets the upper limit of scale factor to limit - the maximum latency. - config PHYS_ADDR_T_64BIT def_bool 64BIT diff --git a/mm/page_alloc.c b/mm/page_alloc.c index bfd44b65777c..8d6f9dc99387 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -273,6 +273,8 @@ int min_free_kbytes = 1024; int user_min_free_kbytes = -1; static int watermark_boost_factor __read_mostly = 15000; static int watermark_scale_factor = 10; +static int pcp_batch_scale_max = 5; +static int sysctl_6 = 6; /* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */ int movable_zone; @@ -2334,7 +2336,7 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone) int count = READ_ONCE(pcp->count); while (count) { - int to_drain = min(count, pcp->batch << CONFIG_PCP_BATCH_SCALE_MAX); + int to_drain = min(count, pcp->batch << pcp_batch_scale_max); count -= to_drain; spin_lock(&pcp->lock); @@ -2462,7 +2464,7 @@ static int nr_pcp_free(struct per_cpu_pages *pcp, int batch, int high, bool free /* Free as much as possible if batch freeing high-order pages. */ if (unlikely(free_high)) - return min(pcp->count, batch << CONFIG_PCP_BATCH_SCALE_MAX); + return min(pcp->count, batch << pcp_batch_scale_max); /* Check for PCP disabled or boot pageset */ if (unlikely(high < batch)) @@ -2494,7 +2496,7 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone, return 0; if (unlikely(free_high)) { - pcp->high = max(high - (batch << CONFIG_PCP_BATCH_SCALE_MAX), + pcp->high = max(high - (batch << pcp_batch_scale_max), high_min); return 0; } @@ -2564,9 +2566,9 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp, } else if (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) { pcp->flags &= ~PCPF_PREV_FREE_HIGH_ORDER; } - if (pcp->free_count < (batch << CONFIG_PCP_BATCH_SCALE_MAX)) + if (pcp->free_count < (batch << pcp_batch_scale_max)) pcp->free_count = min(pcp->free_count + (1 << order), - batch << CONFIG_PCP_BATCH_SCALE_MAX); + batch << pcp_batch_scale_max); high = nr_pcp_high(pcp, zone, batch, free_high); if (pcp->count >= high) { free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_high), @@ -2908,7 +2910,7 @@ static int nr_pcp_alloc(struct per_cpu_pages *pcp, struct zone *zone, int order) * subsequent allocation of order-0 pages without any freeing. */ if (batch <= max_nr_alloc && - pcp->alloc_factor < CONFIG_PCP_BATCH_SCALE_MAX) + pcp->alloc_factor < pcp_batch_scale_max) pcp->alloc_factor++; batch = min(batch, max_nr_alloc); } @@ -6275,6 +6277,15 @@ static struct ctl_table page_alloc_sysctl_table[] = { .proc_handler = percpu_pagelist_high_fraction_sysctl_handler, .extra1 = SYSCTL_ZERO, }, + { + .procname = "pcp_batch_scale_max", + .data = &pcp_batch_scale_max, + .maxlen = sizeof(pcp_batch_scale_max), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = SYSCTL_ZERO, + .extra2 = &sysctl_6, + }, { .procname = "lowmem_reserve_ratio", .data = &sysctl_lowmem_reserve_ratio,