From patchwork Tue Sep 5 14:13:46 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Feng Tang X-Patchwork-Id: 13374491 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1B5CDC83F2C for ; Tue, 5 Sep 2023 14:08:19 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 577DB6B0078; Tue, 5 Sep 2023 10:08:19 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 500716B007B; Tue, 5 Sep 2023 10:08:19 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 379DE8D0001; Tue, 5 Sep 2023 10:08:19 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 2240A6B0078 for ; Tue, 5 Sep 2023 10:08:19 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id B4F341A0860 for ; Tue, 5 Sep 2023 14:08:18 +0000 (UTC) X-FDA: 81202723476.21.18DB770 Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.65]) by imf14.hostedemail.com (Postfix) with ESMTP id 8D1AD10019B for ; Tue, 5 Sep 2023 14:07:37 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=jS6aTckY; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf14.hostedemail.com: domain of feng.tang@intel.com designates 134.134.136.65 as permitted sender) smtp.mailfrom=feng.tang@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1693922857; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=kFyLSjddLKlcAvWZhTTsKLsIjga2fip+4yGzBvKczGM=; b=xNLBxShBdB5H/qmFST7UlNHrPpJLtTCyI8tdi4GLOZBpk4ZQiFz/29iQkGUUpBgpteeM1h TyiqFesL2YbNeL3JXyMNXi0I/cfNXs8DELytVAh/1lAVH9Y5BoRX+YfPoqIKy87kafpCPC shfvCpYrq+QeKRSA/JysF/8pw+nLzzo= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=jS6aTckY; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf14.hostedemail.com: domain of feng.tang@intel.com designates 134.134.136.65 as permitted sender) smtp.mailfrom=feng.tang@intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1693922857; a=rsa-sha256; cv=none; b=KSETjIVN4N1dgr3wgr5SWxLs1zFTmWTOpGNOvl1zkCUu0wVzNUXyMD4i+vJELH7m+goTcq 1FFn4Ocl8GHiUwUm6LeWMCvJ9zROFszBM2ArfF+Q00cJhLwI5TBHkpAxaTag1qmsJ2rWcH 8M28LlZvLZNA4BR7lFFEIpH2bA3AxZo= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1693922857; x=1725458857; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=+Yeg+pnZH94x+zS9eiLxzs8MVPwnnUwq2oxNPIF7mVU=; b=jS6aTckY1G2svllWwEMWaFj943exA9I3eVxYbcHwWQaZisHAWeHHRPKA Pl0OXpq7JYluHCB1FdmdpxMHpls/FuxFpZxFAVHyEZiwdOiz4IDmBJsa1 SKMwho8uof8oWKweKj9b0jAxBcA1wXmmeESRI3g4Wc1Ru24KWJnuJ4MYM Uzb7b9quOB54Q/TipRBYgayRPfuvGn1mmgvPmowg5yEZC/kcFXLhWuYAB MQfbVxaEaRUQWyiur4BDY1PL3Kl1E1+9OvYX8mDK2BJuAKijPGywGexEQ T9vLKGqi9FlyGcyfn9ODwkjLZU02r3RSQ3/E1Is+5ky5LZmMvnSzls6uh Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10824"; a="380609589" X-IronPort-AV: E=Sophos;i="6.02,229,1688454000"; d="scan'208";a="380609589" Received: from fmsmga004.fm.intel.com ([10.253.24.48]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 05 Sep 2023 07:06:22 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10824"; a="811242127" X-IronPort-AV: E=Sophos;i="6.02,229,1688454000"; d="scan'208";a="811242127" Received: from shbuild999.sh.intel.com ([10.239.146.107]) by fmsmga004.fm.intel.com with ESMTP; 05 Sep 2023 07:06:20 -0700 From: Feng Tang To: Vlastimil Babka , Andrew Morton , Christoph Lameter , Pekka Enberg , David Rientjes , Joonsoo Kim , Roman Gushchin , Hyeonggon Yoo <42.hyeyoo@gmail.com>, linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Feng Tang Subject: [RFC Patch 1/3] mm/slub: increase the maximum slab order to 4 for big systems Date: Tue, 5 Sep 2023 22:13:46 +0800 Message-Id: <20230905141348.32946-2-feng.tang@intel.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20230905141348.32946-1-feng.tang@intel.com> References: <20230905141348.32946-1-feng.tang@intel.com> MIME-Version: 1.0 X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 8D1AD10019B X-Stat-Signature: eyuz1r5iw6afyrqmwjsa3e3b1ij5mgec X-HE-Tag: 1693922857-559034 X-HE-Meta: U2FsdGVkX18fPVQSsmUrrBaeipBAz+U8jPEELbAnrbvq/LB8CY0CaSriVInh+ql14CBkaB9/eveohitWnGINaLV9IzB3X2FbMqIClWsNUKaytMbhszjX4sPQhz/3RqeEgg3t0ypn0jAoqPbWSJbPeymgzkxjG7drmFs7yeQvWWtCEVl/htI88/k8DuIvJ3W0ZwlvIJbXs2KA1Z2Fd8pmbCVZ1iwuWNgjW6f/pSdEYX6SnFLvHfOEbvsB8a4txl4rb3/worMCWsnXd119etkTYiq/kTC5RqiyPMHONoq+J/PfVIhhR12PfAl90ixyFx6B7/LiEDbXLAf/Gxf+Qm5smKyOBEZfdpUjA/q1z1jHyFYomjWAvZW2/hj+Y0/HzGfwAMzzRo4kRoSdgQPgPQWqwwjhdyliGGkbIfcFubVLyD2T1LIQ8cxSplKPtnKbsOnhaBHE/g9qAQM+5IX6ZO+GB0iRNAFPAbsJw1aS9hgTYhl00j4YMVy/2eWLR8pjPfhzf2Cz86PGz3JzpRqTx4LOkK6Juf0aUg/n9+kfAQvoH+z+wWBCCwW16p9lFBmFK68vRuCfSzQixV/dje15xbELe3kRM5VL9Yk123N29GbYLX5NhLLTqL6ToAmlWq6fS0/McvnCYKQ7H74FxSI0IINX8+LZAWktxonunpg2LldmCYxG4D6s+B4vQKDXDtK/1QCT6afCYUZbY+I05PpuUQaS1TLzdZOpr38CPtYPZ/KXoqhgVfz5dggHQDSnAec9pPS7dI45BYK4W9e/BppWRKuMMzXQdjzVDI27i7J3sHRARgR1JF3UAMNZ+ykg7qMtrCXIAc1b02tm3tfJ0DzleFwSCNp4mzHOF+/JH5eguCU3gSaUJknvngIoLu4N29zAsP1g3Vm7yTpFjTiBhqEJCQeHScAoQevIg8pWWFWDWIExQSEhjmUyUDCR/j1yAWp2eL2SJzOkmOeTK/wDFAtC+aD +JpfuEq/ Umx8hd4iaFzxb2AWEhMLdtD2dNOuIiyW6iWZH7NF6QCCmP/VKBvWFWhumd8oo4dJzpc1vwSDXwFGKt77esJtU+m1a5oFKDrxN+Nx22hMr8zMOd6yPbUcAa9DOGW1gOIaR7uOvmCSCOKtCvxorFfxdLaK51ck7XwKbCyrlrIcHsEg5vGx4YfWnbo109Ne5bJYe2NQIHUnswHiBddM8XxMiY2+uovhFUH4WtPwetIwu29tcl2pi4RVRVv80PYDCiarr+Ifc45eeVtxJKXsyPJmhQD1OwzO7+MFs/yrZUiJUvwJM0+SeyJQJ01jWU/L+KKTZqcce93K8UbXJLwXhucxonk1/xJj5ciEl3NR7F9mGBD33BCU= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: There are reports about severe lock contention for slub's per-node 'list_lock' in 'hackbench' test, [1][2], on server systems. And similar contention is also seen when running 'mmap1' case of will-it-scale on big systems. As the trend is one processor (socket) will have more and more CPUs (100+, 200+), the contention could be much more severe and becomes a scalability issue. One way to help reducing the contention is to increase the maximum slab order from 3 to 4, for big systems. Unconditionally increasing the order could bring trouble to client devices with very limited size of memory, which may care more about memory footprint, also allocating order 4 page could be harder under memory pressure. So the increase will only be done for big systems like servers, which usually are equipped with plenty of memory and easier to hit lock contention issues. Following is some performance data: will-it-scale/mmap1 ------------------- Run will-it-scale benchmark's 'mmap1' test case on a 2 socket Sapphire Rapids server (112 cores / 224 threads) with 256 GB DRAM, run 3 configurations with parallel test threads of 25%, 50% and 100% of number of CPUs, and the data is (base is vanilla v6.5 kernel): base base+patch wis-mmap1-25% 223670 +33.3% 298205 per_process_ops wis-mmap1-50% 186020 +51.8% 282383 per_process_ops wis-mmap1-100% 89200 +65.0% 147139 per_process_ops Take the perf-profile comparasion of 50% test case, the lock contention is greatly reduced: 43.80 -30.8 13.04 pp.self.native_queued_spin_lock_slowpath 0.85 -0.2 0.65 pp.self.___slab_alloc 0.41 -0.1 0.27 pp.self.__unfreeze_partials 0.20 ± 2% -0.1 0.12 ± 4% pp.self.get_any_partial hackbench --------- Run same hackbench testcase mentioned in [1], use same HW/SW as will-it-scale: base base+patch hackbench 759951 +10.5% 839601 hackbench.throughput perf-profile diff: 22.20 ± 3% -15.2 7.05 pp.self.native_queued_spin_lock_slowpath 0.82 -0.2 0.59 pp.self.___slab_alloc 0.33 -0.2 0.13 pp.self.__unfreeze_partials [1]. https://lore.kernel.org/all/202307172140.3b34825a-oliver.sang@intel.com/ [2]. ttps://lore.kernel.org/lkml/ZORaUsd+So+tnyMV@chenyu5-mobl2/ Signed-off-by: Feng Tang --- mm/slub.c | 51 ++++++++++++++++++++++++++++++++++++++------------- 1 file changed, 38 insertions(+), 13 deletions(-) diff --git a/mm/slub.c b/mm/slub.c index f7940048138c..09ae1ed642b7 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -4081,7 +4081,7 @@ EXPORT_SYMBOL(kmem_cache_alloc_bulk); */ static unsigned int slub_min_order; static unsigned int slub_max_order = - IS_ENABLED(CONFIG_SLUB_TINY) ? 1 : PAGE_ALLOC_COSTLY_ORDER; + IS_ENABLED(CONFIG_SLUB_TINY) ? 1 : 4; static unsigned int slub_min_objects; /* @@ -4134,6 +4134,26 @@ static inline unsigned int calc_slab_order(unsigned int size, return order; } +static inline int num_cpus(void) +{ + int nr_cpus; + + /* + * Some architectures will only update present cpus when + * onlining them, so don't trust the number if it's just 1. But + * we also don't want to use nr_cpu_ids always, as on some other + * architectures, there can be many possible cpus, but never + * onlined. Here we compromise between trying to avoid too high + * order on systems that appear larger than they are, and too + * low order on systems that appear smaller than they are. + */ + nr_cpus = num_present_cpus(); + if (nr_cpus <= 1) + nr_cpus = nr_cpu_ids; + + return nr_cpus; +} + static inline int calculate_order(unsigned int size) { unsigned int order; @@ -4151,19 +4171,17 @@ static inline int calculate_order(unsigned int size) */ min_objects = slub_min_objects; if (!min_objects) { - /* - * Some architectures will only update present cpus when - * onlining them, so don't trust the number if it's just 1. But - * we also don't want to use nr_cpu_ids always, as on some other - * architectures, there can be many possible cpus, but never - * onlined. Here we compromise between trying to avoid too high - * order on systems that appear larger than they are, and too - * low order on systems that appear smaller than they are. - */ - nr_cpus = num_present_cpus(); - if (nr_cpus <= 1) - nr_cpus = nr_cpu_ids; + nr_cpus = num_cpus(); min_objects = 4 * (fls(nr_cpus) + 1); + + /* + * If nr_cpus >= 32, the platform is likely to be a server + * which usually has much more memory, and is easier to be + * hurt by scalability issue, so enlarge it to reduce the + * possible contention of the per-node 'list_lock'. + */ + if (nr_cpus >= 32) + min_objects *= 2; } max_objects = order_objects(slub_max_order, size); min_objects = min(min_objects, max_objects); @@ -4361,6 +4379,13 @@ static void set_cpu_partial(struct kmem_cache *s) else nr_objects = 120; + /* + * Give larger system more buffer to reduce scalability issue, like + * the handling in calculate_order(). + */ + if (num_cpus() >= 32) + nr_objects *= 2; + slub_set_cpu_partial(s, nr_objects); #endif }