From patchwork Tue Sep 5 14:13:45 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Feng Tang X-Patchwork-Id: 13374494 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2B659C83F2C for ; Tue, 5 Sep 2023 14:09:30 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B92908D0007; Tue, 5 Sep 2023 10:09:29 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B1D368D0001; Tue, 5 Sep 2023 10:09:29 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 995A28D0007; Tue, 5 Sep 2023 10:09:29 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 787D28D0001 for ; Tue, 5 Sep 2023 10:09:29 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id DB0991407F1 for ; Tue, 5 Sep 2023 14:09:28 +0000 (UTC) X-FDA: 81202726416.05.6EB6ED7 Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.65]) by imf06.hostedemail.com (Postfix) with ESMTP id AB2DD1801F3 for ; Tue, 5 Sep 2023 14:07:36 +0000 (UTC) Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=Q5W0VPcU; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf06.hostedemail.com: domain of feng.tang@intel.com designates 134.134.136.65 as permitted sender) smtp.mailfrom=feng.tang@intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1693922857; a=rsa-sha256; cv=none; b=SBTHOgC9HUuN4tqI3Enn4pBuO21E0vnXTWHuq35izFe0nsjiIbaQtWpqQRoUPp6JflM4+/ uHVFaiwZoKaHVWsaop5f2MioWpI7zQcEbaq+v0j/du5xRWcs7XsuS23syzyhUoWtYzWcsi kWKdDmRw9Sn7nI+IOZ8EKYAsoAnD2ak= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=Q5W0VPcU; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf06.hostedemail.com: domain of feng.tang@intel.com designates 134.134.136.65 as permitted sender) smtp.mailfrom=feng.tang@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1693922857; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=iiD+H1+40vd8/zwHK0DHFW634g4Zy/cltV0i7gre/4E=; b=vAv2NvcUL3rwRKJN22mttMAfZjqpoEcJWm/qBZO+r4iWRNjYBVVQXJvjAUXE6SqZFKgmEu HtV4HUCjPflOwQyW0FLZrl9x5uhIP1fBcotT8phBUVspX80mFft4a26p5vtZfUmmZroo+Q KtPYl1CshdGQhAtC4GZK4CaAimD7mPE= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1693922856; x=1725458856; h=from:to:cc:subject:date:message-id:mime-version: content-transfer-encoding; bh=eGjI4lWPLQi7BnstfSf8EXtxEieuRMGl6SSdeyTr5S8=; b=Q5W0VPcU50sx/7KI23G21P95VgCAPWo5FoYrXsJci80xil9tBYVkRh6n kENlGroIAkBVQusNnz6ZGpyc8vwUyNnmSU9X/mjn4Tt6HZlgt+2gLRDRo sbXcUr/5Gw3PX6gbKKaZgBdn4yXZhhMCWBiB53dWvHu8Bp/2iq9kd/j+0 pFIToWhdQRINv5t+9CpmiAQmap7PpglljB1Bdivwn2Q1zZcwmOOtDYljg bXNO8KcPc4pq1opqBWTGKdJhjWQPzLGD2pUphLrDM+lbNkkVKtQ/G+RPT WGs/g1TvN/PsjjH6BK/ZqCOtYqOXfG/IzUlZKz4rOY33S/TdlvMCHjdBp A==; X-IronPort-AV: E=McAfee;i="6600,9927,10824"; a="380609574" X-IronPort-AV: E=Sophos;i="6.02,229,1688454000"; d="scan'208";a="380609574" Received: from fmsmga004.fm.intel.com ([10.253.24.48]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 05 Sep 2023 07:06:20 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10824"; a="811242089" X-IronPort-AV: E=Sophos;i="6.02,229,1688454000"; d="scan'208";a="811242089" Received: from shbuild999.sh.intel.com ([10.239.146.107]) by fmsmga004.fm.intel.com with ESMTP; 05 Sep 2023 07:06:16 -0700 From: Feng Tang To: Vlastimil Babka , Andrew Morton , Christoph Lameter , Pekka Enberg , David Rientjes , Joonsoo Kim , Roman Gushchin , Hyeonggon Yoo <42.hyeyoo@gmail.com>, linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Feng Tang Subject: [RFC Patch 0/3] mm/slub: reduce contention for per-node list_lock for large systems Date: Tue, 5 Sep 2023 22:13:45 +0800 Message-Id: <20230905141348.32946-1-feng.tang@intel.com> X-Mailer: git-send-email 2.27.0 MIME-Version: 1.0 X-Rspam-User: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: AB2DD1801F3 X-Stat-Signature: fj6ybnubmy5s1iwqb33nmeh8cym1dwfu X-HE-Tag: 1693922856-763187 X-HE-Meta: U2FsdGVkX18fA4cGXh4k46kQzC2rGUtz2+LJOnLov2XzkWjhciAqYJg3G8bmTdu3Xmia0ujUEjCFDpFZYMXFRT3vfx6479scFYYXyN4qOPPwcy44MQecRGF4YPagdTSnlaD7VFo3tAoixdzpvIP/j9Lx//onRI178xQvOc6Xuu7TIVbTFC/+Egwjp2upkJ+9wjCXSIZCKdA6dXiHr3sCov+V88JymVY7/A7G0gxxRZWmuYBSdHnUIn4lCX8BONqr4bpPRNIpIj4HdwBfbgMcVEr/9olMftYMWUDxcXQEvA+wNVWc/+aRIX+Zp13O5XcRCXPi65RpY7ee3sOCVE41X1Qa0WjTJLkNBUB9gTmjRp0sdi5ER4vdBxzj6F8cIuee+EVa96NSgU9Mq3IxFruNTakrkc8FWkEYT4yEWzaAI6qB64bscg5n7oAPpY5iA39lbGUbbPNVkeC4pfMxy7Pz0Nfo+7WWDETj8xWMFJAfc5ZRldC3D9EE3uiiUL/o/DZ4nEnI0UBK+maAHTE4OoP3iBaqi6meQaOo0emyZEQuwBVh2SaL/eOOLnYsnihAYiL/2i+IzMBnOWg4mSmtHDgvMLg/BbmCBqgg/MaU7wuFB3JRXUHUThv6uvp4/paSPY4sAJ9taEp1q2cr/QgqGGD0L+t4bIhExPoAe89ouq4BkRLL1RTjfAof/wAkAnFUiZ9VmuuuoE3jU23rLHun0YPTAQSy3yA2wE7A4FyXQfHcZ67/WdZe7NmXkDwN3bv02u5wqnZXFcwoMj1mZu0gxJH4/2xlptz7nrnCGDFDSYqM+864LAUys6jcB+MwUc5E+y1c1Q5EVwfc/ZE3YpQLsYZReCDneKR05XIVxt1ntDwGF0yaVycbZXXWwqdmW/PCWl7MYtuteq6CF4UsILOnBYNFa/7tRMs5pbtse2pFJbinHttlgavndT1de5PK5AjReNe40Er71wKo1RtpVjobFhV cmwbZwc/ qKFPvfraWjpo4FULKE8nmvAqMNpkyBK6w+sXuPj/4D/yYUYmcJdq0qFClFEPot0gtlWD2FbwPfei20quHyLgyxcSb+PLWEqjPYdA8VHbOVZ28ubBXZTt2Rn2oUgCHEKUlcwp8WjhKcSbgOPq6QUNRvxcdeXSEhwLo2SdYQQ09l9Ile/ad5vp2ScbtLkP3r96cGwdqTxUFr4KJ0hN8zRFGldQ0G8B5Ya+MO3PMsKaBNMYK2UyW5YfvLZBvvVHq2vXBIBiyEATxD24du522xdk7lLHoPX2DPHQTE48uFzVaUPORWLI+TZOIXlTIlmogQ/MCGoPfLRJCX2yoTvav1JLw6Vi8+Wk5zLOjACMHmtHiP3Px1b6HPCLv+8hjd/9jX2uNxwtI X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hi All, Please help to review the ideas and patches, thanks! Problem ------- 0Day bot found performance regression of 'hackbench', related with slub's per-node 'list_lock' contention [1]. The same lock contention is also found when running will-it-scale/mmap1 benchmark on rather big system of 2 sockets with 224 CPUs, where the lock contention can take up to 76% of cpu cycles. As the trend is one processor (socket) will have more and more cpu cores, the contention can be more severe, and we need to tackle it sooner or later. Possible mitigations -------------------- There are 3 directions we can try, they don't have dependency over each other and can be taken separately or put together: 1) increase the order of each slab (including changing the max slub order from 3 to 4) 2) increase number of per-cpu partial slabs 3) increase the MIN_PARTIAL and MAX_PARTIAL to let each node have more (64) partial slabs in maxim Regarding reducing the lock contention and improving peformance, #1 is the most efficient way, #2 second it. Please be noted that the 3 patches are just for showing the idea separately to get review and comments first, and NOT targeting for merge. Patch 2 even can't apply upon patch 1. A similar regression related to 'list_lock' contention was found when testing 'hackbench' with new 'eevdf' scheduer patchset, and a rough combination of these patches cure the performance drop [2]. Performance data ---------------- We have showed some rough performance data in previous discussion: https://lore.kernel.org/all/ZO2smdi83wWwZBsm@feng-clx/ Following is performance data for using 'mmap1' case of 'will-it-scale' and 'hackbench' mentioned in [1]. For 'mmap1' case, we run 3 configurations with parallel test threads of 25%, 50% and 100% of number of CPUs The test HW is a 2 socket Sapphire Rapids server (112 cores / 224 threads) + 256 GB DRAM, the base kernel is vanilla kernel v6.5. 1) order increasing patch * will-it-scale/mmap1: base base+patch wis-mmap1-25% 223670 +33.3% 298205 per_process_ops wis-mmap1-50% 186020 +51.8% 282383 per_process_ops wis-mmap1-100% 89200 +65.0% 147139 per_process_ops Take the perf-profile comparasion of 50% test case, the lock contention is greatly reduced: 43.80 -30.8 13.04 pp.self.native_queued_spin_lock_slowpath 0.85 -0.2 0.65 pp.self.___slab_alloc 0.41 -0.1 0.27 pp.self.__unfreeze_partials 0.20 ± 2% -0.1 0.12 ± 4% pp.self.get_any_partial * hackbench: base base+patch hackbench 759951 +10.5% 839601 hackbench.throughput perf-profile diff: 22.20 ± 3% -15.2 7.05 pp.self.native_queued_spin_lock_slowpath 0.82 -0.2 0.59 pp.self.___slab_alloc 0.33 -0.2 0.13 pp.self.__unfreeze_partials 2) increasing per-cpu partial patch The patch itself only makes the per-cpu partial number 2X, and for better analysis, the 4X case is also profiled * will-it-scale/mmap1: base base + 2X patch base + 4X patch wis-mmap1-25 223670 +12.7% 251999 +34.9% 301749 per_process_ops wis-mmap1-50 186020 +28.0% 238067 +55.6% 289521 per_process_ops wis-mmap1-100 89200 +40.7% 125478 +62.4% 144858 per_process_ops Take the perf-profile comparasion of 50% test case, the lock contention is greatly reduced: 43.80 -11.5 32.27 -27.9 15.91 pp.self.native_queued_spin_lock_slowpath * hackbench (no obvious improvment) base base + 2X patch base + 4X patch hackbench 759951 +0.2% 761506 +0.5% 763972 hackbench.throughput 3) increasing per-node partial patch The patch effectively change the MIN_PARTIAL/MAX_PARTIAL to from 5/10 to 64/128. * will-it-scale/mmap1: base base+patch wis-mmap1-25% 223670 +0.2% 224035 per_process_ops wis-mmap1-50% 186020 +13.0% 210248 per_process_ops wis-mmap1-100% 89200 +11.3% 99308 per_process_ops 4) combination patches base base+patch-3 base+patch-3,1 base+patch-3,1,2 wis-mmap1-25% 223670 -0.0% 223641 +24.2% 277734 +37.7% 307991 per_process_ops wis-mmap1-50% 186172 +12.9% 210108 +42.4% 265028 +59.8% 297495 per_process_ops wis-mmap1-100% 89289 +11.3% 99363 +47.4% 131571 +78.1% 158991 per_process_ops Make the patch only affect large systems ---------------------------------------- In real world, there are different kinds of platforms which has different useage cases, large systems with huge numbers of CPUs usually comes with huge memory, and there are also small devices with limited memory, whch may care more about memory footprint. So the idea is to treat them separately, keep the current order/partial settings for system with small number of CPUs, and scale those settings according to CPU numbers. (there is similar handling in slub code already). Though aggressive idea is to bump them all together. [1]. https://lore.kernel.org/all/202307172140.3b34825a-oliver.sang@intel.com/ [2]. ttps://lore.kernel.org/lkml/ZORaUsd+So+tnyMV@chenyu5-mobl2/ Thanks, Feng Feng Tang (3): mm/slub: increase the maximum slab order to 4 for big systems mm/slub: setup maxim per-node partial according to cpu numbers mm/slub: double per-cpu partial number for large systems mm/slub.c | 7 +++++++ 1 file changed, 7 insertions(+)