From patchwork Tue Oct 17 15:44:34 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chengming Zhou X-Patchwork-Id: 13425535 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 88803CDB482 for ; Tue, 17 Oct 2023 15:45:20 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0D7D26B020D; Tue, 17 Oct 2023 11:45:20 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 0881F6B020E; Tue, 17 Oct 2023 11:45:20 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EBA0B6B020F; Tue, 17 Oct 2023 11:45:19 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id DE24E6B020D for ; Tue, 17 Oct 2023 11:45:19 -0400 (EDT) Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id B0DAE140C54 for ; Tue, 17 Oct 2023 15:45:19 +0000 (UTC) X-FDA: 81355377558.20.D849A05 Received: from out-195.mta1.migadu.com (out-195.mta1.migadu.com [95.215.58.195]) by imf05.hostedemail.com (Postfix) with ESMTP id CDEDF100013 for ; Tue, 17 Oct 2023 15:45:17 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=uN24GXJp; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf05.hostedemail.com: domain of chengming.zhou@linux.dev designates 95.215.58.195 as permitted sender) smtp.mailfrom=chengming.zhou@linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1697557518; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=lemsX7vhxu4js2CfwxN22PvTUw6JIiATv2g1JPaIjFU=; b=WwBxU8kStZQNkhb2K30PfrJe+Xnev1g3zRw4jd7mPrw8eyb+wLGMz6csVtp1kWpqnFcXuT SmRxn+8NFu2L4vzRTTDbeuFtflBu9EfV3UwQ6AByASPaLuiXU3+lO678bXHyhaprkCjpq9 nExx581ATIM0gxHR6kWKUhIqsKscgto= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=uN24GXJp; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf05.hostedemail.com: domain of chengming.zhou@linux.dev designates 95.215.58.195 as permitted sender) smtp.mailfrom=chengming.zhou@linux.dev ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1697557518; a=rsa-sha256; cv=none; b=DGe93sc9WmYSu/xPtfCR5+vPxjLnf1000qcus7bzEgs+ORoJH49Fvlmxydmg/WOaimVU/V EDOfv80IHByIfWN2ABKHOKroEchiSEkw1pb7bBo7iRFp84XTfHrqbJWPo5Rd9w5lFMU7nS /IJbyje+HeuV0etbDZA3Gb0oXENrLLw= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1697557515; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=lemsX7vhxu4js2CfwxN22PvTUw6JIiATv2g1JPaIjFU=; b=uN24GXJpRZk/OgGh+X3H2+ZPV7U57tOlzJa1pdg68vEeMyCb3k8Ltg6+sL101i6+UFMXqg aoSLXJCYUTZfTf2dxI3zFkONVwF8pLQnfJ5YHvafFO47JLqmfqQofBpF1M8VXOKggln2va gsZV2NI5jM9645zCQ1OD47R4nZ6pKOk= From: chengming.zhou@linux.dev To: cl@linux.com, penberg@kernel.org Cc: rientjes@google.com, iamjoonsoo.kim@lge.com, akpm@linux-foundation.org, vbabka@suse.cz, roman.gushchin@linux.dev, 42.hyeyoo@gmail.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, chengming.zhou@linux.dev, Chengming Zhou Subject: [RFC PATCH 0/5] slub: Delay freezing of CPU partial slabs Date: Tue, 17 Oct 2023 15:44:34 +0000 Message-Id: <20231017154439.3036608-1-chengming.zhou@linux.dev> MIME-Version: 1.0 X-Migadu-Flow: FLOW_OUT X-Rspamd-Queue-Id: CDEDF100013 X-Rspam-User: X-Rspamd-Server: rspam02 X-Stat-Signature: 6w63ugruatktu6g4poqua3djmtommxhj X-HE-Tag: 1697557517-778321 X-HE-Meta: U2FsdGVkX1+hQ27KohNtXW/3eR0PQAafJR84MxYlp7e3pZO4lVdu0+l6XJkSB1T45pGlGLseaBKgKagMMl48CNfxzL6EsUBuxopMQrf0jpXjiHuimvD4YJornkImqSJ+qnm82CRHg/G1PoVxgGFsIXYz0WTdnNo07WJQQO0vHAHPcuS6zbu0Q7hhWzmnFxt24M6vWSTkDW9moZFcAu5LKcj0R5j3D4QRP115Zqlt27k/noJsTCCsmjeboLWzUnEqynnT4ZAqvDeFLcKQDBXf0DyXshiohDdkvwSfVuB63nGWDiwl7nEpLuvbyfwsGHUvQPqEdXinxVX9foCXYmgmVjYw8muvSL1hgltVJRDjqA5U02WW+74NUd2Y9y6k4wdyGB1kPYDymeCzMB5F8ulPQrA3phd++efe5K52fJs5Dth14iiM4hHUttGN5znuL4op8rmEQ+Yj2zlFKqXmqOCzTwwElb+MLLSlCtYyMNOlOrh30Af9OY1aup2Tbssay6kS9U8fICWt1UHsphXF9OVnNiCvBUuRfGUiylKTTf56AuZKO5O92Q0kRu8y2/iccoWA5VeTtiswFlGzSkWf8okVGcsbCaqhv4pWg7SPIjcvu7TV51INdASHThPlpZA4VLe3QeViQhox6OdTG9qeNDEmeUQXmbCBnOKkI4X69jIZfp0Rki6LDbZoSUBLylFSmJpxJ1DJlAkcm3eHzi2ayR/h4gb1tAIH+QcQLsxpYCYnCcTR+fCHlC73fMaBKzI6oyW6/HDdwyNi8AwSRamrXNtShrNwtXtsQo4WIuQXtrXnRpOGqo7AN5ypTYvECnpIq4KoHN1Jxq2PFZqMQ+4dUgCKJU6QF48U5/JVC0zmBmo/W8pmFiStUoOmSlPLjIjgDoOrk+1NvgSHprJJTkSGSXae1vMKGAe6nUwcWiYBaII9AsuBCC2haXo7nT/CC7ASqV1c0thH+TIPZLNhvFcSaP0 Yu1TJJFA pJKpr5RWrGJ9RINcXYtucS4NykzLdKiXUsbNyMEDi5JV6SRyRvUEOXOLPdwwaRVrnaAJEwBLZF/Epsbd8jv1Fb124b3eiBYmFLpRClEssRaBHT0TwvQOOC7A58TqxsSE5r/+Xj7OC7/FXrmkM3Ar0kwP/5SFq1QT19yCA3l6+irAzyVR+BQYLqoUh0fn5JI4kthlb X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Chengming Zhou 1. Problem ========== Now we have to freeze the slab when get from the node partial list, and unfreeze the slab when put to the node partial list. Because we need to rely on the node list_lock to synchronize the "frozen" bit changes. This implementation has some drawbacks: - Alloc path: twice cmpxchg_double. It has to get some partial slabs from node when the allocator has used up the CPU partial slabs. So it freeze the slab (one cmpxchg_double) with node list_lock held, put those frozen slabs on its CPU partial list. Later ___slab_alloc() will cmpxchg_double try-loop again if that slab is picked to use. - Alloc path: amplified contention on node list_lock. Since we have to synchronize the "frozen" bit changes under the node list_lock, the contention of slab (struct page) can be transferred to the node list_lock. On machine with many CPUs in one node, the contention of list_lock will be amplified by all CPUs' alloc path. The current code has to workaround this problem by avoiding using cmpxchg_double try-loop, which will just break and return when contention of page encountered and the first cmpxchg_double failed. But this workaround has its own problem. - Free path: redundant unfreeze. __slab_free() will freeze and cache some slabs on its partial list, and flush them to the node partial list when exceed, which has to unfreeze those slabs again under the node list_lock. Actually we don't need to freeze slab on CPU partial list, in which case we can save the unfreeze cmpxchg_double operations in flush path. 2. Solution =========== We solve these problems by leaving slabs unfrozen when moving out of the node partial list and on CPU partial list, so "frozen" bit is 0. These partial slabs won't be manipulate concurrently by alloc path, the only racer is free path, which may manipulate its list when !inuse. So we need to introduce another synchronization way to avoid it, we use a bit in slab->flags to indicate whether the slab is on node partial list or not, only in that case we can manipulate the slab list. The slab will be delay frozen when it's picked to actively use by the CPU, it becomes full at the same time, in which case we still need to rely on "frozen" bit to avoid manipulating its list. So the slab will be frozen only when activate use and be unfrozen only when deactivate. 3. Patches ========== Patch-1 introduce the new slab->flags to indicate whether the slab is on node partial list, which is protected by node list_lock. Patch-2 change the free path to check if slab is on node partial list, only in which case we can manipulate its list. Then we can keep unfrozen partial slabs out of node partial list, since the free path won't concurrently manipulate with it. Patch-3 optimize the deactivate path, we can directly unfreeze the slab, (since node list_lock is not needed to synchronize "frozen" bit anymore) then grab node list_lock if it's needed to put on the node partial list. Patch-4 change to don't freeze slab when moving out from node partial list or put on the CPU partial list, and don't need to unfreeze these slabs when put back to node partial list from CPU partial list. Patch-5 change the alloc path to freeze the CPU partial slab when picked to use. 4. Testing ========== We just did some simple testing on a server with 128 CPUs (2 nodes) to compare performance for now. - perf bench sched messaging -g 5 -t -l 100000 baseline RFC 7.042s 6.966s 7.022s 7.045s 7.054s 6.985s - stress-ng --rawpkt 128 --rawpkt-ops 100000000 baseline RFC 2.42s 2.15s 2.45s 2.16s 2.44s 2.17s It shows above there is about 10% improvement on stress-ng rawpkt testcase, although no much improvement on perf sched bench testcase. Thanks for any comment and code review! Chengming Zhou (5): slub: Introduce on_partial() slub: Don't manipulate slab list when used by cpu slub: Optimize deactivate_slab() slub: Don't freeze slabs for cpu partial slub: Introduce get_cpu_partial() mm/slab.h | 2 +- mm/slub.c | 257 +++++++++++++++++++++++++++++++----------------------- 2 files changed, 150 insertions(+), 109 deletions(-)