From patchwork Mon Oct 16 05:29:54 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Ying" X-Patchwork-Id: 13422488 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 46DACCDB465 for ; Mon, 16 Oct 2023 05:30:28 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id CFE4B8D0030; Mon, 16 Oct 2023 01:30:27 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id CAE5D8D0001; Mon, 16 Oct 2023 01:30:27 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B27828D0030; Mon, 16 Oct 2023 01:30:27 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 9FE1F8D0001 for ; Mon, 16 Oct 2023 01:30:27 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 78EB6B5C2F for ; Mon, 16 Oct 2023 05:30:27 +0000 (UTC) X-FDA: 81350199294.28.4C2662E Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.65]) by imf03.hostedemail.com (Postfix) with ESMTP id 555BE20004 for ; Mon, 16 Oct 2023 05:30:24 +0000 (UTC) Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=i8llK6Hh; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf03.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.65 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1697434224; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=5/6gOK5aHsw1B56hGCkN1OotrD6M3Gmw5q2H3PeSKYc=; b=IkN9aaXHYhr9wQMIbCCtYQhEaE/fUagybu48ILIofvT7Q7ZeBSE717sPfJ03b3SgEOXqW3 Hevc+SSh0FF3LAnyY1Y9ADOrCCx9LaxNKjgGpXHEcp8WXMhva/m/6ldKqDXV2Uch9BbWiK 6mO72dOtJAtFawvccFzCUX3iBfZRm+w= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=i8llK6Hh; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf03.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.65 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1697434224; a=rsa-sha256; cv=none; b=UKnYGFhZbEJPa8c8G+lVaBGyy5BrLgOVKbk2R/stGndE2Lb5Bsgwl0tIQt57k0m7rbmnK7 yUtip1Q1JBZmkUJY0VZIvj2KH1wWE8gK++mkk4VCEqu/HV21azJOSQogjXrWgD51RLvN1M SiBS+qBiPCgFSNzrhiINXS+sG9WY/p4= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1697434224; x=1728970224; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=yIFg5arfuvqezYrEgzPwE443naXU4SI+Zr6n5bfusVQ=; b=i8llK6HhDeb79U4PjMBZQmGtqIOrpSOUhcyalKKIGZ7/c37XUns65LKy 6xChU80/rOyKnTvjEiv0Brr+5FTTboLiEgGOEJbNQ8fuHleePKcsx6h/2 Z6vESh9QqVnJ2/dP3HOvWWiakttDi4DuW+VtIjvfCG/KRbYxG7DGF/eNS TSOuHjayIelDgALprKq3NtxUHL5DNR4Di88KPxK9DWW1VM9Hl5OZHVjNA u3hkixXBfjkEIpb1e0m80zcaAshtiMZnRcnuf6COPOw5iSo8Xm1l2pioK eJ6FsrpjBj8qgoXZepa2B5KZDxfuK3eC8Mvrx8AAUyGEztO7VNdsCmZSg w==; X-IronPort-AV: E=McAfee;i="6600,9927,10863"; a="389307942" X-IronPort-AV: E=Sophos;i="6.03,228,1694761200"; d="scan'208";a="389307942" Received: from fmsmga001.fm.intel.com ([10.253.24.23]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Oct 2023 22:30:23 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10863"; a="899356632" X-IronPort-AV: E=Sophos;i="6.03,228,1694761200"; d="scan'208";a="899356632" Received: from yhuang6-mobl2.sh.intel.com ([10.238.6.133]) by fmsmga001-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Oct 2023 22:28:22 -0700 From: Huang Ying To: Andrew Morton Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Arjan Van De Ven , Huang Ying , Mel Gorman , Vlastimil Babka , David Hildenbrand , Johannes Weiner , Dave Hansen , Michal Hocko , Pavel Tatashin , Matthew Wilcox , Christoph Lameter Subject: [PATCH -V3 1/9] mm, pcp: avoid to drain PCP when process exit Date: Mon, 16 Oct 2023 13:29:54 +0800 Message-Id: <20231016053002.756205-2-ying.huang@intel.com> X-Mailer: git-send-email 2.39.2 In-Reply-To: <20231016053002.756205-1-ying.huang@intel.com> References: <20231016053002.756205-1-ying.huang@intel.com> MIME-Version: 1.0 X-Rspamd-Queue-Id: 555BE20004 X-Rspam-User: X-Rspamd-Server: rspam04 X-Stat-Signature: hms3oneq5aizujdyqbnfwi3p69yp3gnq X-HE-Tag: 1697434224-874532 X-HE-Meta: U2FsdGVkX18lQUNYEydqEy2HSiGYTsHoBSaKbhxykHtKdxoK/juX9HTvNflxF2bWwBYOf+vkZN6aFWvLjtFO+3kJLAhd05G3/RjIJ2x6+wUfYMr1V+gqT/MyUl5ULl7GVMXd0KZxrNiSIY+rt56oz4IQJlDMbB+dZ+1uH6uxbcxkgbMrwd4YgXh7O+JrRluDN7us0hEShpLjvEia+0UgMYOxjNDXMU4bAWPK+CEvw8GbhnAoKNHpnkPchtL15t8owhLJLbYi1e4fRcq5DAU3xoYkjAvFPZ9/bdWoDSPzF4y/mtQd6cmQ81PF03SEBhBsFGfpxAfEfdZfEmHaCifxbPKZcgEH7EIwOrlGIlzD79h3kVjNA/8ppCVC6QMxTb9nwbkN7h7CeII6SJKNp4veU5+vNGYAsTD5jNzm77Yzdx5vJziYbBXfwSCJiqUkg1x+Q8+wmS8xjCBMEdsFwMGaAyTSiyPfqBz4ojFgPOKvgbldqArm3aNUGDpwSxeJzKKxnBrWXXf2TcwbrYnRIXvaWBZH43lQ07WuZH7rExEXHNWj2gGIfW4sPrx/MFraxDIGeqnF0QOGLK73qnUC4mVcFngnP2Pcz7dVtuebUaas8aaKV725K7vnCdJUcEMff2oFm4dhYOqAnC8Ou6N88INoz2BW1qkpSh1nQEmdbDZNL2CcAaKImwJUhjnBfWXaIGhbu3Yq6c6qWARCBKD9Omc5CkcG4mcZ9pEsJDnT+z4YyXnRoUwzPnlfHUElOI/pEOt5JcxyvUPyg6DQp6JRBUVgzOLlZrL49fY1lASd43+EUZMRpcY5oouKNyPslRgAfz0utjmrT/oAjID3ix4tmAXE75XBihuLGPV7L5YEqDoaudtBaICBu/wohXnMhKe7jDMOoHDkWhfRP6nB/QZYNG1xsUnsp6J16zf1ye+C/TuX8HnvKQZA3WaBSKzeSEY4zgq4GZogpa8ShIMDWiQzjb2 MpN3AddI bmd1PeLJTunuNzocyMbWfIva7XRUDCvvziVgTjAI/9/P5yc+JMPC07NewJxySy07pKd+1Eodupkzo8qKzzqNG0ZiS2OGXsZvJraauLFBVq6UzLKBHZNJUM4xC9G2IUq0qsF2xGiAInJvs0ga7cGn5C5jGVK6gSg6a1xFA9kBBltRStXZPXZKG9tc7PD46oAQJ71IO4nSZZRr3DSsRD8Iwt0drpeubUvppXHZVF0MrGhk26s7dOBnsuyqO2CRInwLO6iC51f4lyaBEUM5BXlZ4Tzvtrw/K+9ImyVTtSCKrKmcDHLUvy/gHALbGjfzkDICo8FFEgKnLHrIUCoct1XQYZkVV9FDlhXs8r+teOGRBpwY7NKRWuKilUvTez3bC9tr8hE0TbvwKb1NTyuWyuYbcnzFSPWpmqfWogijZr6UIPNfcX24ZJbes6OIachP3LCIHbXz0xYNo/foveOHYumtG3Lv6HId81GyJ1VAdj549bUEbwxrgF4k+lYsLKqAdbsL/HsMijfiOVZ5LnboV+gauxWmniVjH/ML/K+Oo X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: In commit f26b3fa04611 ("mm/page_alloc: limit number of high-order pages on PCP during bulk free"), the PCP (Per-CPU Pageset) will be drained when PCP is mostly used for high-order pages freeing to improve the cache-hot pages reusing between page allocation and freeing CPUs. But, the PCP draining mechanism may be triggered unexpectedly when process exits. With some customized trace point, it was found that PCP draining (free_high == true) was triggered with the order-1 page freeing with the following call stack, => free_unref_page_commit => free_unref_page => __mmdrop => exit_mm => do_exit => do_group_exit => __x64_sys_exit_group => do_syscall_64 Checking the source code, this is the page table PGD freeing (mm_free_pgd()). It's a order-1 page freeing if CONFIG_PAGE_TABLE_ISOLATION=y. Which is a common configuration for security. Just before that, page freeing with the following call stack was found, => free_unref_page_commit => free_unref_page_list => release_pages => tlb_batch_pages_flush => tlb_finish_mmu => exit_mmap => __mmput => exit_mm => do_exit => do_group_exit => __x64_sys_exit_group => do_syscall_64 So, when a process exits, - a large number of user pages of the process will be freed without page allocation, it's highly possible that pcp->free_factor becomes > 0. In fact, this is expected behavior to improve process exit performance. - after freeing all user pages, the PGD will be freed, which is a order-1 page freeing, PCP will be drained. All in all, when a process exits, it's high possible that the PCP will be drained. This is an unexpected behavior. To avoid this, in the patch, the PCP draining will only be triggered for 2 consecutive high-order page freeing. On a 2-socket Intel server with 224 logical CPU, we run 8 kbuild instances in parallel (each with `make -j 28`) in 8 cgroup. This simulates the kbuild server that is used by 0-Day kbuild service. With the patch, the cycles% of the spinlock contention (mostly for zone lock) decreases from 14.0% to 12.8% (with PCP size == 367). The number of PCP draining for high order pages freeing (free_high) decreases 80.5%. This helps network workload too for reduced zone lock contention. On a 2-socket Intel server with 128 logical CPU, with the patch, the network bandwidth of the UNIX (AF_UNIX) test case of lmbench test suite with 16-pair processes increase 16.8%. The cycles% of the spinlock contention (mostly for zone lock) decreases from 51.4% to 46.1%. The number of PCP draining for high order pages freeing (free_high) decreases 30.5%. The cache miss rate keeps 0.2%. Signed-off-by: "Huang, Ying" Acked-by: Mel Gorman Cc: Andrew Morton Cc: Vlastimil Babka Cc: David Hildenbrand Cc: Johannes Weiner Cc: Dave Hansen Cc: Michal Hocko Cc: Pavel Tatashin Cc: Matthew Wilcox Cc: Christoph Lameter --- include/linux/mmzone.h | 12 +++++++++++- mm/page_alloc.c | 11 ++++++++--- 2 files changed, 19 insertions(+), 4 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 4106fbc5b4b3..19c40a6f7e45 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -676,12 +676,22 @@ enum zone_watermarks { #define high_wmark_pages(z) (z->_watermark[WMARK_HIGH] + z->watermark_boost) #define wmark_pages(z, i) (z->_watermark[i] + z->watermark_boost) +/* + * Flags used in pcp->flags field. + * + * PCPF_PREV_FREE_HIGH_ORDER: a high-order page is freed in the + * previous page freeing. To avoid to drain PCP for an accident + * high-order page freeing. + */ +#define PCPF_PREV_FREE_HIGH_ORDER BIT(0) + struct per_cpu_pages { spinlock_t lock; /* Protects lists field */ int count; /* number of pages in the list */ int high; /* high watermark, emptying needed */ int batch; /* chunk size for buddy add/remove */ - short free_factor; /* batch scaling factor during free */ + u8 flags; /* protected by pcp->lock */ + u8 free_factor; /* batch scaling factor during free */ #ifdef CONFIG_NUMA short expire; /* When 0, remote pagesets are drained */ #endif diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 95546f376302..295e61f0c49d 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2370,7 +2370,7 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp, { int high; int pindex; - bool free_high; + bool free_high = false; __count_vm_events(PGFREE, 1 << order); pindex = order_to_pindex(migratetype, order); @@ -2383,8 +2383,13 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp, * freeing without allocation. The remainder after bulk freeing * stops will be drained from vmstat refresh context. */ - free_high = (pcp->free_factor && order && order <= PAGE_ALLOC_COSTLY_ORDER); - + if (order && order <= PAGE_ALLOC_COSTLY_ORDER) { + free_high = (pcp->free_factor && + (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER)); + pcp->flags |= PCPF_PREV_FREE_HIGH_ORDER; + } else if (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) { + pcp->flags &= ~PCPF_PREV_FREE_HIGH_ORDER; + } high = nr_pcp_high(pcp, zone, free_high); if (pcp->count >= high) { free_pcppages_bulk(zone, nr_pcp_free(pcp, high, free_high), pcp, pindex); From patchwork Mon Oct 16 05:29:55 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Ying" X-Patchwork-Id: 13422489 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6B14DCDB482 for ; Mon, 16 Oct 2023 05:30:31 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id ED34C8D0031; Mon, 16 Oct 2023 01:30:30 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E82CB8D0001; Mon, 16 Oct 2023 01:30:30 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CFD018D0031; Mon, 16 Oct 2023 01:30:30 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id BF63F8D0001 for ; Mon, 16 Oct 2023 01:30:30 -0400 (EDT) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 88698B5C4C for ; Mon, 16 Oct 2023 05:30:30 +0000 (UTC) X-FDA: 81350199420.10.F42D52F Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.65]) by imf20.hostedemail.com (Postfix) with ESMTP id 84C0C1C0005 for ; Mon, 16 Oct 2023 05:30:28 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=iYoG2A0+; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf20.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.65 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1697434228; a=rsa-sha256; cv=none; b=CxoyLjpYIuCAUnjk5DNpj480OyIYuWC+Q426LbbiACdxcCxDHX+m5LbyZ3hN15Bbwo+mgW cfq/uCixAyDcPIquxj874ydXztJW1MZgPwe/1Rh7b4ggIAdKfrInQbZxWAeoUgvpySkkP3 cD047XRvJHt+OZ9Dvs/SEXan6ttbb/I= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=iYoG2A0+; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf20.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.65 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1697434228; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=vS+5z0LlkNcs1oXgpimJY+oGfBhKaMajeoAc4AHTtxs=; b=6TPP+vZea+xUwKvMMMhGDzhRYVbKfaH9/EbdU3+DkjdLV1zEjSkWvV90unGvkuNKdo68Ab 8V54ZDpxidHiDNKGuktNy5vOcjCsho9fBbKNz/EpFBniMtfA+f+s9d3HuInpKezG+/LE1+ RND4tArXToDbLLaUQkG4ypchHFeADJQ= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1697434228; x=1728970228; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=Td3CNuYVe/DvxEXYqqfRgCRBE8gOU4QwgmItjh0k3pc=; b=iYoG2A0+6p9dCy508NVMiEt7DUNdUlVtblyxKHYs+tR2+bQTr4W+rE10 vC3rpxBY8+Ny0HXscUcYwgo4mOdY4Dpn71tsLU4DkfLTP3stVAPVlVSkG MEaJ4fnXIaFNch6ACCk9N3NAkwkjDfjS2gvlrBeuBRvGLlu0oz8KKxSGh g6KamcepkvKhZ7iOSFwM1pAfC84v+XOdWuM5aWFSuXVo0MaGFAL8RjZ2H 1sdIcXOy6wAZQM07PBUm/7Ya0nzNraALAowNEyNw3hDJ+LfXdC2mKVoq9 1ETUPfB1pk0dRxowItQSU4h7DeYBJyc7IA8NacZswo6abbriNRHRXXzBm A==; X-IronPort-AV: E=McAfee;i="6600,9927,10863"; a="389307967" X-IronPort-AV: E=Sophos;i="6.03,228,1694761200"; d="scan'208";a="389307967" Received: from fmsmga001.fm.intel.com ([10.253.24.23]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Oct 2023 22:30:27 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10863"; a="899356650" X-IronPort-AV: E=Sophos;i="6.03,228,1694761200"; d="scan'208";a="899356650" Received: from yhuang6-mobl2.sh.intel.com ([10.238.6.133]) by fmsmga001-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Oct 2023 22:28:25 -0700 From: Huang Ying To: Andrew Morton Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Arjan Van De Ven , Huang Ying , Sudeep Holla , Mel Gorman , Vlastimil Babka , David Hildenbrand , Johannes Weiner , Dave Hansen , Michal Hocko , Pavel Tatashin , Matthew Wilcox , Christoph Lameter Subject: [PATCH -V3 2/9] cacheinfo: calculate size of per-CPU data cache slice Date: Mon, 16 Oct 2023 13:29:55 +0800 Message-Id: <20231016053002.756205-3-ying.huang@intel.com> X-Mailer: git-send-email 2.39.2 In-Reply-To: <20231016053002.756205-1-ying.huang@intel.com> References: <20231016053002.756205-1-ying.huang@intel.com> MIME-Version: 1.0 X-Rspam-User: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 84C0C1C0005 X-Stat-Signature: wk6p8cy6trm5mfihsyua8tgfb7h89csa X-HE-Tag: 1697434228-735281 X-HE-Meta: U2FsdGVkX1+1Jh9zfkP8VOiEeSj7rmmZZ8Ye9BQBCKKqpz1vcSL8akdzrujK1renY/RHMION3PRY3y2KHkko0y0NxqflNEJ+gf0TBPD+q183QlJscDZ2D6RA03kKumSKeF2OTZrmoqn83nBfqDOWXxa4sG8s39QQeTWRrjm2h4s4sFIT9l8OKIgEK9aW2M1DWuwTgihQM/qv8mTmElpREYT2hfsxcN8dV+gHe3m8y/7z2G6AlSkE/qgUvsI3zhz6OHO6oVfIlCvbv94Z3s2hfgaDTediDsFYHo9MyqHwIU3uFCscx6F4sTefQxJt2ZhgJM5qTmJpN2hp684jjrX2sdLHFCWIrx1telfj0X+WIWNGpHh4E2xhOrBplMjNQsA+tD8mPoGwo5DMxHsTUTTQIj4BcZXgl+8hk3apYk310fYv0Easo9pgPgNPcM3NUJIsTScVZ0HdXPF3symME4TG42z+mPDBVD69E38Ce+D1u6wLxf3wc6+JB2zm8WfBjVGkm9l7yYgDpawYDJnT75OAv+T+slQzmYUvKbz5QGMdpgbja31iBXv+B25IfaMw36OK0+qzvt3A6YYJj+V6FHBTMHx0CDwMlmdMHQy3WzsLcFXcfJW7/9NodPu9ZbsYZED4j6VV9QZ/xgQDRL3gYIiA0WjFfowDPKae1Jef6ojJvbcYAGcija2RyLv1/68mQ84fwbChTcwOfPKLBhe51kviS8CqKyVwSRXUUCkXVhSN5jTh3rdazWkd35tmdUwMW37GHXS590BylDj/0JYEqn/Wlft9U41rFY70rkrmB3fD7U+3rf1B2XboDEyCS8xGitnaBMxcDZ8F+OOpxWafBEjGsWek/2A9mUH8IckqLmRyZo1U/WOQS4as1ZP5UNaNOkm4FR/3y6Q5nCTcQim5+lckm7rgSVmgLvA6u9Lpo6VJILg9B+R+CcUs/t+N1/jiEQ/uS68+qoXFkkVFnn1zcM/ EMqZq01m fjfYV/v8cQMiHpvTssjp22EzpkaHNB32NqDXfvISn56JjqRcDJPMK4uyjOu5b34i2N+1+ylIXxZp61jLYr0R/EDG20vY0LaL3e4BmcIrQQ5co8hPLPrqQ2LRkhJmV9pK6Sy5Md4by3YCVdCZ2L/1Y/PZmgSrz7gqifnQUnBBgFgMhcVwWtVoGWKeMN+CiEQbR5lJWqpZ+GpvLmiv4O8yC3lxmzyiY+EIRA8qEfoDOS9stunU9IIgJ0Xjyu1nEWkcbIMDs/ODlH1tpYKxW1tzoCjyKHYI9qko8ygklVXBPBNGqntnAWwt0wVb1gXQA9jef0khdtOkZK2FGpg+wrdNjE6HCZVxkj5Bu2qJKFXLh/vDbFSr3soFH9IvyMnBSVORa/lZmoDt35unUQ3mM6nuFG6YBfxvuNoOHnXnwxdjz+Jg2S4d34nxm0xfhjYtC84fQOEjxhbta43n46z36kJz5+gby6T3q1fj40iW94hzyJOZThU710kijj6krrtCoW4DQkkZRpa9kZI4f+Rtv8K7dc24ODGSB4hTu+SO4pYhqpVk1r0XUNXwIjyNNdf7kSvrdww7MptkuONrjyiI+pp6aCjHUG9X6+MM7GRwA X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: This can be used to estimate the size of the data cache slice that can be used by one CPU under ideal circumstances. Both DATA caches and UNIFIED caches are used in calculation. So, the users need to consider the impact of the code cache usage. Because the cache inclusive/non-inclusive information isn't available now, we just use the size of the per-CPU slice of LLC to make the result more predictable across architectures. This may be improved when more cache information is available in the future. A brute-force algorithm to iterate all online CPUs is used to avoid to allocate an extra cpumask, especially in offline callback. Signed-off-by: "Huang, Ying" Cc: Sudeep Holla Cc: Andrew Morton Cc: Mel Gorman Cc: Vlastimil Babka Cc: David Hildenbrand Cc: Johannes Weiner Cc: Dave Hansen Cc: Michal Hocko Cc: Pavel Tatashin Cc: Matthew Wilcox Cc: Christoph Lameter Acked-by: Mel Gorman --- drivers/base/cacheinfo.c | 49 ++++++++++++++++++++++++++++++++++++++- include/linux/cacheinfo.h | 1 + 2 files changed, 49 insertions(+), 1 deletion(-) diff --git a/drivers/base/cacheinfo.c b/drivers/base/cacheinfo.c index cbae8be1fe52..585c66fce9d9 100644 --- a/drivers/base/cacheinfo.c +++ b/drivers/base/cacheinfo.c @@ -898,6 +898,48 @@ static int cache_add_dev(unsigned int cpu) return rc; } +/* + * Calculate the size of the per-CPU data cache slice. This can be + * used to estimate the size of the data cache slice that can be used + * by one CPU under ideal circumstances. UNIFIED caches are counted + * in addition to DATA caches. So, please consider code cache usage + * when use the result. + * + * Because the cache inclusive/non-inclusive information isn't + * available, we just use the size of the per-CPU slice of LLC to make + * the result more predictable across architectures. + */ +static void update_per_cpu_data_slice_size_cpu(unsigned int cpu) +{ + struct cpu_cacheinfo *ci; + struct cacheinfo *llc; + unsigned int nr_shared; + + if (!last_level_cache_is_valid(cpu)) + return; + + ci = ci_cacheinfo(cpu); + llc = per_cpu_cacheinfo_idx(cpu, cache_leaves(cpu) - 1); + + if (llc->type != CACHE_TYPE_DATA && llc->type != CACHE_TYPE_UNIFIED) + return; + + nr_shared = cpumask_weight(&llc->shared_cpu_map); + if (nr_shared) + ci->per_cpu_data_slice_size = llc->size / nr_shared; +} + +static void update_per_cpu_data_slice_size(bool cpu_online, unsigned int cpu) +{ + unsigned int icpu; + + for_each_online_cpu(icpu) { + if (!cpu_online && icpu == cpu) + continue; + update_per_cpu_data_slice_size_cpu(icpu); + } +} + static int cacheinfo_cpu_online(unsigned int cpu) { int rc = detect_cache_attributes(cpu); @@ -906,7 +948,11 @@ static int cacheinfo_cpu_online(unsigned int cpu) return rc; rc = cache_add_dev(cpu); if (rc) - free_cache_attributes(cpu); + goto err; + update_per_cpu_data_slice_size(true, cpu); + return 0; +err: + free_cache_attributes(cpu); return rc; } @@ -916,6 +962,7 @@ static int cacheinfo_cpu_pre_down(unsigned int cpu) cpu_cache_sysfs_exit(cpu); free_cache_attributes(cpu); + update_per_cpu_data_slice_size(false, cpu); return 0; } diff --git a/include/linux/cacheinfo.h b/include/linux/cacheinfo.h index a5cfd44fab45..d504eb4b49ab 100644 --- a/include/linux/cacheinfo.h +++ b/include/linux/cacheinfo.h @@ -73,6 +73,7 @@ struct cacheinfo { struct cpu_cacheinfo { struct cacheinfo *info_list; + unsigned int per_cpu_data_slice_size; unsigned int num_levels; unsigned int num_leaves; bool cpu_map_populated; From patchwork Mon Oct 16 05:29:56 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Ying" X-Patchwork-Id: 13422490 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id BD9FFCDB465 for ; Mon, 16 Oct 2023 05:30:35 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5851D8D0032; Mon, 16 Oct 2023 01:30:35 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 536A98D0001; Mon, 16 Oct 2023 01:30:35 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3FFA58D0032; Mon, 16 Oct 2023 01:30:35 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 2C0028D0001 for ; Mon, 16 Oct 2023 01:30:35 -0400 (EDT) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id F149EB5C49 for ; Mon, 16 Oct 2023 05:30:34 +0000 (UTC) X-FDA: 81350199588.08.8AD9BF2 Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.65]) by imf12.hostedemail.com (Postfix) with ESMTP id CADC240017 for ; Mon, 16 Oct 2023 05:30:32 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b="PMD/kIm+"; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf12.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.65 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1697434233; a=rsa-sha256; cv=none; b=pljMVF8vVmm8ybDVZhtQmr39iFaD9YG0j5bZhzXTCJH4ywzElMu62mmLU0gGmET1KkPUiq 8hz2Bhp6BsUY3ss69Cs5xit+jdYIYXuNliW6n+izBZe1N+Swi+Mru+RW87jO78pcoV5qsP qbifijU6J8jLGi91UDOr7SFRrFPUWYQ= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b="PMD/kIm+"; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf12.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.65 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1697434233; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=EM2p971k14pLhhfzm38QBffPfm3gPH5AMIDqRjUWOuU=; b=vFISVWiqU+vD4knUON892052OMXLQ3VFPMI40DJprLXfpjV6bPq+M4BPMbREkIUTFkloOc bn7tJ+rDbTTEQm5xBARTlj45eIGlKbUxf5vmgBDCZBwXthr1nXUKfBqJ0GPPXW7EXq/2Jc IuiAa9AeIyoOELYnJXJOQrWDK2Uma0g= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1697434232; x=1728970232; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=UjGG03uSU9scSCPZWKrCozXwLS8Efdo9vVeSGMTDY0g=; b=PMD/kIm+KFcWyDcpxdXO23TFBE8nKZVlpGEuo2fKBQB2RZs7OJL3tOzd Ko8IgoS7LFLK+8/er4pakGC7vl5nUg+q8hgrjCTHOnKSB6OyT++BBLXcU TH/QOssGE5YnqG9rjki0cL2XWfaeuteYJXOwdk0AKCmllLf5TJiVHha4z R27wRaHQLIMKkeU1A8gYZdmsOEiPTbcFuVhK/lhVmkN8XFV+OMM9WYrzI nYIDHOra0a1NHOzF9dpKS39SJdwHcrjSF2SlFvXBm1xBBx/05/XiV9TYb dYr5grgPxD5qaUKt2lkk9ULnspw4VTB5vfaJPmoTqUE807VXRCBK4nNzi A==; X-IronPort-AV: E=McAfee;i="6600,9927,10863"; a="389307995" X-IronPort-AV: E=Sophos;i="6.03,228,1694761200"; d="scan'208";a="389307995" Received: from fmsmga001.fm.intel.com ([10.253.24.23]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Oct 2023 22:30:30 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10863"; a="899356680" X-IronPort-AV: E=Sophos;i="6.03,228,1694761200"; d="scan'208";a="899356680" Received: from yhuang6-mobl2.sh.intel.com ([10.238.6.133]) by fmsmga001-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Oct 2023 22:28:29 -0700 From: Huang Ying To: Andrew Morton Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Arjan Van De Ven , Huang Ying , Mel Gorman , Sudeep Holla , Vlastimil Babka , David Hildenbrand , Johannes Weiner , Dave Hansen , Michal Hocko , Pavel Tatashin , Matthew Wilcox , Christoph Lameter Subject: [PATCH -V3 3/9] mm, pcp: reduce lock contention for draining high-order pages Date: Mon, 16 Oct 2023 13:29:56 +0800 Message-Id: <20231016053002.756205-4-ying.huang@intel.com> X-Mailer: git-send-email 2.39.2 In-Reply-To: <20231016053002.756205-1-ying.huang@intel.com> References: <20231016053002.756205-1-ying.huang@intel.com> MIME-Version: 1.0 X-Rspam-User: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: CADC240017 X-Stat-Signature: q4tw9c1i46bgcxgziugnwsbtcr5eis8k X-HE-Tag: 1697434232-762922 X-HE-Meta: U2FsdGVkX1+YcqAEDWrCSjqi6n2IJyNUNiu5567mBVyX9pWtWxiAftWOHUY13Mes6mC8XGUQoMsc6P69R4/wuMEMc+VxDBQHxP6yeOQ3O9NmTk8/GekOkGm3mOaeUUiX+mzdNijlGlX9MSt8Lq9sKz7h+38A/5mFRc5EcGdA5wpSxhqw9JTdkUJf+BW6pUwvPIMgvgQ0NQsZcDcb/sP9uLSpnfWr8SQbGT80BJTAAtzCFfbU6g8O53QV99cs2hzgZPAoGFkiLljglzkKw5f5axPs13RO1QGSQKQDsaavMjLkSWcwXQZwv2y6Zk/NrRLEhUExQHdUGDN+LAreP0NDB6DEtGafGN99C/Oqn3ooyiSoAltceBYrbeuf6e7gQKIBl4U92vGCGU/7BhBz1l1+gLh6jJejjud4gyvRsOAY9EBvMNqj5F8s5RU0wBUX2Yk1GbUwYP3O4JNC5mdmfh5IdAK0KEDrcaOmNpEOIX0ZFEHY0ZQwimLiWmsOrGOG1Esw7JlAcuB+YWifURPCaYkvyrer7dreCGyya1/GutiRzvbDvwVReFjxzxwfwaRFo7AHBfu0lTFxy8ThiWCUNZ3IMEdzyfW2Pz/NkNX30XeU3OWUYQoT4zVBaKx4+tZRt/LR8IwNor4KC8+a31EOHWXP6JGajULjcuqAX+PafFbWDdOgRGFCC6FNzkkolyRN+54Sj/Jj1E/4g4vNC0+uK2nR4qUX3WVj1FAaP8NOkygCr9e1bNexxorSGLWRB+XLgl5CN9nIzVDnUXhiV73nB1skUlcHwwiEvYbpgdSc8tGqTv1m4Xlm0z9UCflNJiUJ+d6bEI3PkxrONkrV1jw78OjmhecrB8eI6AN2gg4X2b775FJD7Eqe4JwNCUdENdiFe3+6oD83k4Xt0IsvU4g8iSQxPnxHrKEKn9QUt7ZiwcomgQCiUlwRN2c7LhyQhYcvUAFjnXhE5Gq4KuO568FSz5i L1LkHhH4 EYo2E4ztliXDjTn3Aqrf1Q4khrLyexmVJLzMc+2xBR/wbYcFw1VC3W7zIjcRRz+eIqA5hzrjGxa0vdS+n7zt7kluTY76Md9dG4JlW+7Gy2izBWoITvXOmJmgY5OMIMJDXIqfAhZuH2gb8H830aQ8nt/uiRU2P5jhcCB4q2KlNcHk97QxhuwdN/dM07oLZIXmJJ89Tv6oNbdgNBwfEgv8vL1PL1ce9hr8eazXSCoytXuOhlZbyG0e3yZ0rPOLgye8gUt2OVMV2ctNwNQ0pFAXQXdrvzEIKbzDaoc5D/Qs2U01DWxL2ondiTXfTLhHDEorZzc3Isfet3Jo6l/BJV6ftTjGFLH7jJp0N+YCu1GGymzpY7xVUNaAm8ytOhrekqjPvdrgtJSzN+4DYGwzGl8qgQ09l/a5WNLwIRV8nKQI7XWkN++z1Mu4BSaKk6sAogia6wpQS+DEBvz4eS61TOyrFt4gEM1N4HM96YDF0oK58yrQ68QZqkRbKjHh5i5+cCKgVLST3YuTEkpHYuiUy+jvXI7e1A5CGHRwnm+TDLga39X1q+s8HSezWpg7ORZSy7gReus6p7lvfnjuM0knSRlD4bqcEfDhO5VhEGkvQ X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: In commit f26b3fa04611 ("mm/page_alloc: limit number of high-order pages on PCP during bulk free"), the PCP (Per-CPU Pageset) will be drained when PCP is mostly used for high-order pages freeing to improve the cache-hot pages reusing between page allocating and freeing CPUs. On system with small per-CPU data cache slice, pages shouldn't be cached before draining to guarantee cache-hot. But on a system with large per-CPU data cache slice, some pages can be cached before draining to reduce zone lock contention. So, in this patch, instead of draining without any caching, "pcp->batch" pages will be cached in PCP before draining if the size of the per-CPU data cache slice is more than "3 * batch". In theory, if the size of per-CPU data cache slice is more than "2 * batch", we can reuse cache-hot pages between CPUs. But considering the other usage of cache (code, other data accessing, etc.), "3 * batch" is used. Note: "3 * batch" is chosen to make sure the optimization works on recent x86_64 server CPUs. If you want to increase it, please check whether it breaks the optimization. On a 2-socket Intel server with 128 logical CPU, with the patch, the network bandwidth of the UNIX (AF_UNIX) test case of lmbench test suite with 16-pair processes increase 70.5%. The cycles% of the spinlock contention (mostly for zone lock) decreases from 46.1% to 21.3%. The number of PCP draining for high order pages freeing (free_high) decreases 89.9%. The cache miss rate keeps 0.2%. Signed-off-by: "Huang, Ying" Acked-by: Mel Gorman Cc: Andrew Morton Cc: Sudeep Holla Cc: Vlastimil Babka Cc: David Hildenbrand Cc: Johannes Weiner Cc: Dave Hansen Cc: Michal Hocko Cc: Pavel Tatashin Cc: Matthew Wilcox Cc: Christoph Lameter --- drivers/base/cacheinfo.c | 2 ++ include/linux/gfp.h | 1 + include/linux/mmzone.h | 6 ++++++ mm/page_alloc.c | 38 +++++++++++++++++++++++++++++++++++++- 4 files changed, 46 insertions(+), 1 deletion(-) diff --git a/drivers/base/cacheinfo.c b/drivers/base/cacheinfo.c index 585c66fce9d9..f1e79263fe61 100644 --- a/drivers/base/cacheinfo.c +++ b/drivers/base/cacheinfo.c @@ -950,6 +950,7 @@ static int cacheinfo_cpu_online(unsigned int cpu) if (rc) goto err; update_per_cpu_data_slice_size(true, cpu); + setup_pcp_cacheinfo(); return 0; err: free_cache_attributes(cpu); @@ -963,6 +964,7 @@ static int cacheinfo_cpu_pre_down(unsigned int cpu) free_cache_attributes(cpu); update_per_cpu_data_slice_size(false, cpu); + setup_pcp_cacheinfo(); return 0; } diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 665f06675c83..665edc11fb9f 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -325,6 +325,7 @@ void drain_all_pages(struct zone *zone); void drain_local_pages(struct zone *zone); void page_alloc_init_late(void); +void setup_pcp_cacheinfo(void); /* * gfp_allowed_mask is set to GFP_BOOT_MASK during early boot to restrict what diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 19c40a6f7e45..cdff247e8c6f 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -682,8 +682,14 @@ enum zone_watermarks { * PCPF_PREV_FREE_HIGH_ORDER: a high-order page is freed in the * previous page freeing. To avoid to drain PCP for an accident * high-order page freeing. + * + * PCPF_FREE_HIGH_BATCH: preserve "pcp->batch" pages in PCP before + * draining PCP for consecutive high-order pages freeing without + * allocation if data cache slice of CPU is large enough. To reduce + * zone lock contention and keep cache-hot pages reusing. */ #define PCPF_PREV_FREE_HIGH_ORDER BIT(0) +#define PCPF_FREE_HIGH_BATCH BIT(1) struct per_cpu_pages { spinlock_t lock; /* Protects lists field */ diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 295e61f0c49d..ba2d8f06523e 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -52,6 +52,7 @@ #include #include #include +#include #include #include "internal.h" #include "shuffle.h" @@ -2385,7 +2386,9 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp, */ if (order && order <= PAGE_ALLOC_COSTLY_ORDER) { free_high = (pcp->free_factor && - (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER)); + (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) && + (!(pcp->flags & PCPF_FREE_HIGH_BATCH) || + pcp->count >= READ_ONCE(pcp->batch))); pcp->flags |= PCPF_PREV_FREE_HIGH_ORDER; } else if (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) { pcp->flags &= ~PCPF_PREV_FREE_HIGH_ORDER; @@ -5418,6 +5421,39 @@ static void zone_pcp_update(struct zone *zone, int cpu_online) mutex_unlock(&pcp_batch_high_lock); } +static void zone_pcp_update_cacheinfo(struct zone *zone) +{ + int cpu; + struct per_cpu_pages *pcp; + struct cpu_cacheinfo *cci; + + for_each_online_cpu(cpu) { + pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu); + cci = get_cpu_cacheinfo(cpu); + /* + * If data cache slice of CPU is large enough, "pcp->batch" + * pages can be preserved in PCP before draining PCP for + * consecutive high-order pages freeing without allocation. + * This can reduce zone lock contention without hurting + * cache-hot pages sharing. + */ + spin_lock(&pcp->lock); + if ((cci->per_cpu_data_slice_size >> PAGE_SHIFT) > 3 * pcp->batch) + pcp->flags |= PCPF_FREE_HIGH_BATCH; + else + pcp->flags &= ~PCPF_FREE_HIGH_BATCH; + spin_unlock(&pcp->lock); + } +} + +void setup_pcp_cacheinfo(void) +{ + struct zone *zone; + + for_each_populated_zone(zone) + zone_pcp_update_cacheinfo(zone); +} + /* * Allocate per cpu pagesets and initialize them. * Before this call only boot pagesets were available. From patchwork Mon Oct 16 05:29:57 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Ying" X-Patchwork-Id: 13422491 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D126ACDB474 for ; Mon, 16 Oct 2023 05:30:37 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 66EC38D0033; Mon, 16 Oct 2023 01:30:37 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 5F84A8D0001; Mon, 16 Oct 2023 01:30:37 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 424388D0033; Mon, 16 Oct 2023 01:30:37 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 2D9488D0001 for ; Mon, 16 Oct 2023 01:30:37 -0400 (EDT) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id F212A1CB519 for ; Mon, 16 Oct 2023 05:30:36 +0000 (UTC) X-FDA: 81350199672.25.23A2E7C Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.65]) by imf12.hostedemail.com (Postfix) with ESMTP id 0720240017 for ; Mon, 16 Oct 2023 05:30:34 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=SN67oHpK; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf12.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.65 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1697434235; a=rsa-sha256; cv=none; b=SQtJVVayXdg9vFmn15kWLQmFxe1ZQgLrta8qgeyrdE+FZ3sjs+HZj7OTQTRYzuzlpZ5cl7 PBjoD9zFa30h/KtW10+j4xHe606OSm6bZOppz6jIknyEuMJn1E16mEnmv+Lj0lhYsUxlgX vFNvXlwpOAG/7NI2nasD1XYg7JQXn40= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=SN67oHpK; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf12.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.65 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1697434235; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=eifZOAZubhiwagXLUheaZATFyR7PjaEIOdEoDnrrX8Q=; b=yRUKCETGvHmr5UBB5lWeh8ZUtImC5JxTHSC/I4voYfvN5BuvH3pHJ6/tSZfmSKFgzIkIxd lfA7+sfYdKStKHxcxNV1Nx2fiNTDb9dzLcwf31IryMtkI0F/dnTdNl7BarwVa0OYdFGE3W prcc9/HObS94XU72j2FnzGXQ+fW87YU= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1697434235; x=1728970235; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=ytDVYvVFNAmjdqNi25Zm1lNN8RkWeXmK0Rkw6kLWu9Y=; b=SN67oHpKnwc4LI0KTihDYKl8lfWW9ULPB9Rj+H3wZmT9DjJxODm+WciD TTyd89FTVt2ofqCkgYiyVhzHJRn48VsGAOcFLmonG6DYtODwGOe2XBa7c xnN1EjzGoMPGuO7Q53bLSnnesGwQl7VTng0xEsTxKKD1dtcvZvNoDJKWP w7zEsQ/ZIjQ6yDUL7/rf1BmAp2wDuM+QGN/6p+IvU+oot83GrKK0RjzXF P9q1Xx1akSyeuezoqYPkYihsOItkgvOHYDivInzi7V7sv98mxztxTMkkf OfIZlEoDN48w3JOXtB+P1UBxylmsPXIliXUH7LX5P6BN/D7aDEh6D4raq A==; X-IronPort-AV: E=McAfee;i="6600,9927,10863"; a="389308019" X-IronPort-AV: E=Sophos;i="6.03,228,1694761200"; d="scan'208";a="389308019" Received: from fmsmga001.fm.intel.com ([10.253.24.23]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Oct 2023 22:30:34 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10863"; a="899356691" X-IronPort-AV: E=Sophos;i="6.03,228,1694761200"; d="scan'208";a="899356691" Received: from yhuang6-mobl2.sh.intel.com ([10.238.6.133]) by fmsmga001-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Oct 2023 22:28:33 -0700 From: Huang Ying To: Andrew Morton Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Arjan Van De Ven , Huang Ying , Mel Gorman , Vlastimil Babka , David Hildenbrand , Johannes Weiner , Dave Hansen , Michal Hocko , Pavel Tatashin , Matthew Wilcox , Christoph Lameter Subject: [PATCH -V3 4/9] mm: restrict the pcp batch scale factor to avoid too long latency Date: Mon, 16 Oct 2023 13:29:57 +0800 Message-Id: <20231016053002.756205-5-ying.huang@intel.com> X-Mailer: git-send-email 2.39.2 In-Reply-To: <20231016053002.756205-1-ying.huang@intel.com> References: <20231016053002.756205-1-ying.huang@intel.com> MIME-Version: 1.0 X-Rspam-User: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 0720240017 X-Stat-Signature: sg8hgkdcrtx7i4ftwzjb6j61j16q13u1 X-HE-Tag: 1697434234-587059 X-HE-Meta: U2FsdGVkX1/S9kivoHf7ROo9RJxqxBz/rgxyjgl88hZs3YHSTJO0qx/qiYGQ6uWsIl5bUbkrTdEGXsXm7CyBVeeDDvgGBKgvvEBzJLtYYVgpn8gAqFeX3VjeokAvhDN9H1kncxGXsOhiDQCR/SrxADzlZJUdxxbDFN71pEvUKqdeWECg1TgMdkcvJwA9POnAaG56bum51hBX7TCeTOdRT5yP4Kzs5rw7ZFvWTqbgmVCjSjKCLeMJocKhIk70dS6mAeJ2xgtmWWug9XN9MRCzV4ScLRZq92IvEcCUgca1e6USvo4oXVTBJQtcNOOOw9oj5pLOBZYSU8+OTiRDbRxF3wYSx0eQAoycsXeMENfM+v3vFv8fSSZ7h4KhZbJ/io5KbG8h5cfW+p8kA4RHbTl9Reqrz5VjstIAQqJn3DTk+BXUh3FRKrkLd4IcZ4X8LsqmCeIj36UNIejCcBSAu1ha3p4iQSodhAklL6DBWZc/vLG8zSD9ON8jhq3PIedWnsNphbqxlAi2A/pF1QTwcTWgphhDmYYjRVZq0/nh/UCDQQhp7crKx286zLV6/TskETNOk4bUHc1X2I9WVdayf0fMHc0WHGYrhQY+C7qaanh5E97EjeEUVpzvbw8wfUGp5NhZtO6zDntmAiAiRvUOCXrEBc26sRftinja7O2PV4c5KF9xub72tX1OMxPLek1eh92rXQxmTtc8BEJfl3xsyUvcYXJskFRP4pqDok31kzG3qNlL3RMEVXreT+E+8HDgUbKsZmu+21lFjCB+5xSeQ8P/SCwFQTqyJqh7cQ3x9ZjDjMqM1UXgpVHAGT8ybo7e1bw1hjBbfGOi3GCJxiwgBd4hrG2WCVTJ91fvGKEjRRqeFbwULAGDZMROltqB+HdqQHnSBOdB/JqCgrMlLg9n9ukaKUOTFxJejB0ZCTzdfrHDwo0Mwpys71iS1yZ4zlaLASMTn8/CZifoZ1kCSVGzKzj WEGA2N5K +wwzueTEf6GsKvWrI1P1hvkp3r/NezJIW+kMeIB2aUj1T7QZk5VORJv40ln1PZeqGoAjcLKHRw5oclxTtZwcVb/AA5C+IZnutgWrogRWrLfqeAnYp3bPuY8RYtZsWDWqnM6IUVth6YpDcVgWdy02Fjrn118u1X8zxmbPwy34f18jWY3ksIe28U8qjiZTzcf0ebubkW64mFK2SBUnNXRf9B9cpadoVnZSGwaoulNHSlLOMFeDx0VKzG6/v49kEfZGcKN6WsRADa9RBvcm9CV8QY8oS8K0bMI5t259ZmCakpz4c9pvx5n/o0lBibgMtmBCarov2mN4pW51pVphGYywOmfAThwRZ9AQLt8u0hTGoB9FT6pgJsRsYtTDH/tT+16cZP1qEjkrFAFisjToqzbw8fwLEWH19CA9bvQO3fPo9gyiEcmRTAUUXIxcOvARiL3P0HAmvy1yTbhueNejNSh6wnZRwp62ybolLsNqbd0P3QNBsJQ0F3chED6qF+E6F552vUxTqLOoN4GZyIdKg3oGRcBEWBe+h9nhrJhdz X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: In page allocator, PCP (Per-CPU Pageset) is refilled and drained in batches to increase page allocation throughput, reduce page allocation/freeing latency per page, and reduce zone lock contention. But too large batch size will cause too long maximal allocation/freeing latency, which may punish arbitrary users. So the default batch size is chosen carefully (in zone_batchsize(), the value is 63 for zone > 1GB) to avoid that. In commit 3b12e7e97938 ("mm/page_alloc: scale the number of pages that are batch freed"), the batch size will be scaled for large number of page freeing to improve page freeing performance and reduce zone lock contention. Similar optimization can be used for large number of pages allocation too. To find out a suitable max batch scale factor (that is, max effective batch size), some tests and measurement on some machines were done as follows. A set of debug patches are implemented as follows, - Set PCP high to be 2 * batch to reduce the effect of PCP high - Disable free batch size scaling to get the raw performance. - The code with zone lock held is extracted from rmqueue_bulk() and free_pcppages_bulk() to 2 separate functions to make it easy to measure the function run time with ftrace function_graph tracer. - The batch size is hard coded to be 63 (default), 127, 255, 511, 1023, 2047, 4095. Then will-it-scale/page_fault1 is used to generate the page allocation/freeing workload. The page allocation/freeing throughput (page/s) is measured via will-it-scale. The page allocation/freeing average latency (alloc/free latency avg, in us) and allocation/freeing latency at 99 percentile (alloc/free latency 99%, in us) are measured with ftrace function_graph tracer. The test results are as follows, Sapphire Rapids Server ====================== Batch throughput free latency free latency alloc latency alloc latency page/s avg / us 99% / us avg / us 99% / us ----- ---------- ------------ ------------ ------------- ------------- 63 513633.4 2.33 3.57 2.67 6.83 127 517616.7 4.35 6.65 4.22 13.03 255 520822.8 8.29 13.32 7.52 25.24 511 524122.0 15.79 23.42 14.02 49.35 1023 525980.5 30.25 44.19 25.36 94.88 2047 526793.6 59.39 84.50 45.22 140.81 Ice Lake Server =============== Batch throughput free latency free latency alloc latency alloc latency page/s avg / us 99% / us avg / us 99% / us ----- ---------- ------------ ------------ ------------- ------------- 63 620210.3 2.21 3.68 2.02 4.35 127 627003.0 4.09 6.86 3.51 8.28 255 630777.5 7.70 13.50 6.17 15.97 511 633651.5 14.85 22.62 11.66 31.08 1023 637071.1 28.55 42.02 20.81 54.36 2047 638089.7 56.54 84.06 39.28 91.68 Cascade Lake Server =================== Batch throughput free latency free latency alloc latency alloc latency page/s avg / us 99% / us avg / us 99% / us ----- ---------- ------------ ------------ ------------- ------------- 63 404706.7 3.29 5.03 3.53 4.75 127 422475.2 6.12 9.09 6.36 8.76 255 411522.2 11.68 16.97 10.90 16.39 511 428124.1 22.54 31.28 19.86 32.25 1023 414718.4 43.39 62.52 40.00 66.33 2047 429848.7 86.64 120.34 71.14 106.08 Commet Lake Desktop =================== Batch throughput free latency free latency alloc latency alloc latency page/s avg / us 99% / us avg / us 99% / us ----- ---------- ------------ ------------ ------------- ------------- 63 795183.13 2.18 3.55 2.03 3.05 127 803067.85 3.91 6.56 3.85 5.52 255 812771.10 7.35 10.80 7.14 10.20 511 817723.48 14.17 27.54 13.43 30.31 1023 818870.19 27.72 40.10 27.89 46.28 Coffee Lake Desktop =================== Batch throughput free latency free latency alloc latency alloc latency page/s avg / us 99% / us avg / us 99% / us ----- ---------- ------------ ------------ ------------- ------------- 63 510542.8 3.13 4.40 2.48 3.43 127 514288.6 5.97 7.89 4.65 6.04 255 516889.7 11.86 15.58 8.96 12.55 511 519802.4 23.10 28.81 16.95 26.19 1023 520802.7 45.30 52.51 33.19 45.95 2047 519997.1 90.63 104.00 65.26 81.74 From the above data, to restrict the allocation/freeing latency to be less than 100 us in most times, the max batch scale factor needs to be less than or equal to 5. Although it is reasonable to use 5 as max batch scale factor for the systems tested, there are also slower systems. Where smaller value should be used to constrain the page allocation/freeing latency. So, in this patch, a new kconfig option (PCP_BATCH_SCALE_MAX) is added to set the max batch scale factor. Whose default value is 5, and users can reduce it when necessary. Signed-off-by: "Huang, Ying" Acked-by: Andrew Morton Cc: Mel Gorman Cc: Vlastimil Babka Cc: David Hildenbrand Cc: Johannes Weiner Cc: Dave Hansen Cc: Michal Hocko Cc: Pavel Tatashin Cc: Matthew Wilcox Cc: Christoph Lameter Acked-by: Mel Gorman --- mm/Kconfig | 11 +++++++++++ mm/page_alloc.c | 2 +- 2 files changed, 12 insertions(+), 1 deletion(-) diff --git a/mm/Kconfig b/mm/Kconfig index 264a2df5ecf5..ece4f2847e2b 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -704,6 +704,17 @@ config HUGETLB_PAGE_SIZE_VARIABLE config CONTIG_ALLOC def_bool (MEMORY_ISOLATION && COMPACTION) || CMA +config PCP_BATCH_SCALE_MAX + int "Maximum scale factor of PCP (Per-CPU pageset) batch allocate/free" + default 5 + range 0 6 + help + In page allocator, PCP (Per-CPU pageset) is refilled and drained in + batches. The batch number is scaled automatically to improve page + allocation/free throughput. But too large scale factor may hurt + latency. This option sets the upper limit of scale factor to limit + the maximum latency. + config PHYS_ADDR_T_64BIT def_bool 64BIT diff --git a/mm/page_alloc.c b/mm/page_alloc.c index ba2d8f06523e..a5a5a4c3cd2b 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2340,7 +2340,7 @@ static int nr_pcp_free(struct per_cpu_pages *pcp, int high, bool free_high) * freeing of pages without any allocation. */ batch <<= pcp->free_factor; - if (batch < max_nr_free) + if (batch < max_nr_free && pcp->free_factor < CONFIG_PCP_BATCH_SCALE_MAX) pcp->free_factor++; batch = clamp(batch, min_nr_free, max_nr_free); From patchwork Mon Oct 16 05:29:58 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Ying" X-Patchwork-Id: 13422492 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4FF68CDB465 for ; Mon, 16 Oct 2023 05:30:42 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id ADC328D0034; Mon, 16 Oct 2023 01:30:41 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A8C528D0001; Mon, 16 Oct 2023 01:30:41 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 92CB38D0034; Mon, 16 Oct 2023 01:30:41 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 80F8B8D0001 for ; Mon, 16 Oct 2023 01:30:41 -0400 (EDT) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 5D10D1CB519 for ; Mon, 16 Oct 2023 05:30:41 +0000 (UTC) X-FDA: 81350199882.04.4EAFC11 Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.65]) by imf28.hostedemail.com (Postfix) with ESMTP id 566F3C0008 for ; Mon, 16 Oct 2023 05:30:39 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=LTe2YsPJ; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf28.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.65 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1697434239; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=SQnZqRsApDYhuX2hY/nsUZ1a7v0CG+IRnI/AKxa1uwQ=; b=DJeCpkjq58kBpNl8JhZV3Uh6jdcVcqAx2cdxQIo9z9X4/Hl6zErdxFrX3fz0bNZuhjTGav /ChOnvdm+EBE8RhvId7b/QTZkk1i2VWH140kbA7uXqHL5TW0XsjXrS04DIpbHbGuqkEcd5 slJcJhM9tt1C44pJ/FFeEN/59CatNns= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=LTe2YsPJ; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf28.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.65 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1697434239; a=rsa-sha256; cv=none; b=0filRb1J0PNLo7BpBAGQBwZBUcybiSTQ9J7tThaUfIZB4SqvD6fqAvUagjWnh0Fe8czKaO oKK+eTWzEqqB6bvqyyieEFxh8Ra8iCfa+a1VQ5Y9f7xmlXLGhqCXqiL4PLN8xsoBCMC87C fPrvBeD1Uq/0qUfwLCbN3+lws9QERh8= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1697434239; x=1728970239; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=2DHjUu90tq7FGOt8u63umE+5KVYee9+PKxuiyDDeGjc=; b=LTe2YsPJIWcng/GwbQhgXxtHJu6CJE0iIjRMoiJsZLLtyb0G7uSvYS6c yED5M0vm/O6t1lIh//kDwzqEZH0jE3lohkhIX430/auLuFbUbM+xIqZEL XJc4hWYANkPgv3XZwc1LuGeDpDW+riyk4b17Wj9SE9SW9MXik6SMt841t 9K9NtcJcG7etODiFzdPc/SB8r5d830jKkIDzRpXt4beBFT4llrCZLh50i s0Ocfc9jdZlBmPDuJ3K12TSdpeFsAM6yxuNAgqExP+Xjaze7RXzuY9GYM Gfjy3XupZesIVxbkygazydc3jKOobjoFq/MRdoJuBsItxOSrlRU/I3INf g==; X-IronPort-AV: E=McAfee;i="6600,9927,10863"; a="389308041" X-IronPort-AV: E=Sophos;i="6.03,228,1694761200"; d="scan'208";a="389308041" Received: from fmsmga001.fm.intel.com ([10.253.24.23]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Oct 2023 22:30:37 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10863"; a="899356707" X-IronPort-AV: E=Sophos;i="6.03,228,1694761200"; d="scan'208";a="899356707" Received: from yhuang6-mobl2.sh.intel.com ([10.238.6.133]) by fmsmga001-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Oct 2023 22:28:36 -0700 From: Huang Ying To: Andrew Morton Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Arjan Van De Ven , Huang Ying , Mel Gorman , Vlastimil Babka , David Hildenbrand , Johannes Weiner , Dave Hansen , Michal Hocko , Pavel Tatashin , Matthew Wilcox , Christoph Lameter Subject: [PATCH -V3 5/9] mm, page_alloc: scale the number of pages that are batch allocated Date: Mon, 16 Oct 2023 13:29:58 +0800 Message-Id: <20231016053002.756205-6-ying.huang@intel.com> X-Mailer: git-send-email 2.39.2 In-Reply-To: <20231016053002.756205-1-ying.huang@intel.com> References: <20231016053002.756205-1-ying.huang@intel.com> MIME-Version: 1.0 X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 566F3C0008 X-Stat-Signature: crwnjn61d5ydbtqd7hhosaebdcb79jyr X-Rspam-User: X-HE-Tag: 1697434239-259324 X-HE-Meta: U2FsdGVkX1+qXccBF6YWJU1OozqY5q82G0hmBU0/gr2KYnqaKzm5IevR/8DMFRooIJboTvnmKILOnwe8gBu5LWjw6mg0snVUOzUMsYHSvj4sain8azwlI5TvaPIKYPJVl4Q+mjp/rCh6od61gqDOAGSmjbIYqJ6eBc7jmjc8AD44o7Z+oowrSULlJ6QPED5kYoESJLTZDir5z9hsUgJh0r4iDOnyzDsm7JhNQAQzZP5aw7YkhMBMRe/qAc1D1bJqc7DLJdyxxnaJuM3TZXTGrsSMd3SSb5kaJC97lsegKRNxq81ZeqLSPA0kl2EgRrvsUR7XGzv39+wPyPujTRSi2EtuPnMV+RgJ1rMLAq8nV6YO9tYq2PODPqCLPvegC3XWi/s3KtmHjReB5oURsspLauD2bQG09oWTFTq54ad/R4FESr6FTyLoVy2UY2lS0RadD0BJGBsk4//K2b+4RzCwQbZyYbw2WsMoGtPiEo9fGEk9ebZEh/wC44tsQZFxFXr17tgeqTGvThwSjdeaBbo+dLXVSAV5qihqkq4/w7ZTPwkbyjkrVJifJO9EUka3w5XTLYUHsNbNbW1s2yAIDI5M3GvAy9CTtxZ2SNQeYZV+I0w/9fRuCAydShkETvhfHfuugRq6vVklRg9BSRRBaRn9e6V1ouiG7pHesAyltP0doX5U8yXWb6yGewqzARsXB5+RqgWyBq+VG22iB47FyhYhkYqP9Jml2DKST3UtD+bwq/uELBqGJvOQcOlBZSqtxNaVvCsfdOMBfGXk9+2xWOzaq/6UEcrYPA2l04o1PhRh2XRugPIJ+tzlKOOpSof1SDhjsEBxpT7miLWX+8H2boECEczaT5fMxLyoR7E7kZ/pt+iUYr+qgkGUUdeB2cMWPW7fOhH4JivQIdPfWMLMqLEwkPQfcm33z4mecgmo7K05mbCBYgt4M6xH2QnWupcOpi46uy1AyZoTJWw3VdnU73T /QB8S+QZ qi8GrClBWyU4UJc8vXpVFFwMAnzjPKAN6HRMjPkL+b/reelflcp02EB3JutBsMpcqKVBUXEPa4KSXdNQP8Ry3Ud+xpbZGfnjlt30G5QA+rgqncq3nWLe+msH4/AhVKVxdOv2B8ChaXogRNSXUMx3DcrUnUHwsn2cXhnkYm3yRQXhG2oCC7LIokkplDT3MAqN8LNcM4lHLmgFBiG3oMKazDLKYGxtHAZ3uS5hUka5cvu9ZJhLyDt7SyJKzUrQDBYgOW9YX6Rnq6itkWumbJ4VchvD+eMgQ7LmCN8PHufxyq7HwbsRyspXRgT5Ngu/regzeIlud2H12PdNX70ogRvr3tfq4iYqE6o0Wzk1tcYFiyI8Psfe71W6lXa2n5rbQlRr8Ecc3NIVuQY7eGUQd9IaljCx+IZ8ayqnjFMYWmh7yZeJ85miBZ7wfMTym0mJ+WXa+oZQNXiS8owEKJ0a/tpa110G+gTXAguKZfVBRBEZcsBsDTPnryzo5pQU2MI5eOFHTLISJNruAlqiXyEUdnhtposwLuljMswpIJnZ833BIluoQ5BRlUTjMRCXDYqW1u+pZCM1ZihzBqB3v9h4= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: When a task is allocating a large number of order-0 pages, it may acquire the zone->lock multiple times allocating pages in batches. This may unnecessarily contend on the zone lock when allocating very large number of pages. This patch adapts the size of the batch based on the recent pattern to scale the batch size for subsequent allocations. On a 2-socket Intel server with 224 logical CPU, we run 8 kbuild instances in parallel (each with `make -j 28`) in 8 cgroup. This simulates the kbuild server that is used by 0-Day kbuild service. With the patch, the cycles% of the spinlock contention (mostly for zone lock) decreases from 12.6% to 11.0% (with PCP size == 367). Signed-off-by: "Huang, Ying" Suggested-by: Mel Gorman Acked-by: Mel Gorman Cc: Andrew Morton Cc: Vlastimil Babka Cc: David Hildenbrand Cc: Johannes Weiner Cc: Dave Hansen Cc: Michal Hocko Cc: Pavel Tatashin Cc: Matthew Wilcox Cc: Christoph Lameter --- include/linux/mmzone.h | 3 ++- mm/page_alloc.c | 53 ++++++++++++++++++++++++++++++++++-------- 2 files changed, 45 insertions(+), 11 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index cdff247e8c6f..ba548ae20686 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -697,9 +697,10 @@ struct per_cpu_pages { int high; /* high watermark, emptying needed */ int batch; /* chunk size for buddy add/remove */ u8 flags; /* protected by pcp->lock */ + u8 alloc_factor; /* batch scaling factor during allocate */ u8 free_factor; /* batch scaling factor during free */ #ifdef CONFIG_NUMA - short expire; /* When 0, remote pagesets are drained */ + u8 expire; /* When 0, remote pagesets are drained */ #endif /* Lists of pages, one per migrate type stored on the pcp-lists */ diff --git a/mm/page_alloc.c b/mm/page_alloc.c index a5a5a4c3cd2b..eeef0ead1c2a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2373,6 +2373,12 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp, int pindex; bool free_high = false; + /* + * On freeing, reduce the number of pages that are batch allocated. + * See nr_pcp_alloc() where alloc_factor is increased for subsequent + * allocations. + */ + pcp->alloc_factor >>= 1; __count_vm_events(PGFREE, 1 << order); pindex = order_to_pindex(migratetype, order); list_add(&page->pcp_list, &pcp->lists[pindex]); @@ -2679,6 +2685,42 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone, return page; } +static int nr_pcp_alloc(struct per_cpu_pages *pcp, int order) +{ + int high, batch, max_nr_alloc; + + high = READ_ONCE(pcp->high); + batch = READ_ONCE(pcp->batch); + + /* Check for PCP disabled or boot pageset */ + if (unlikely(high < batch)) + return 1; + + /* + * Double the number of pages allocated each time there is subsequent + * allocation of order-0 pages without any freeing. + */ + if (!order) { + max_nr_alloc = max(high - pcp->count - batch, batch); + batch <<= pcp->alloc_factor; + if (batch <= max_nr_alloc && + pcp->alloc_factor < CONFIG_PCP_BATCH_SCALE_MAX) + pcp->alloc_factor++; + batch = min(batch, max_nr_alloc); + } + + /* + * Scale batch relative to order if batch implies free pages + * can be stored on the PCP. Batch can be 1 for small zones or + * for boot pagesets which should never store free pages as + * the pages may belong to arbitrary zones. + */ + if (batch > 1) + batch = max(batch >> order, 2); + + return batch; +} + /* Remove page from the per-cpu list, caller must protect the list */ static inline struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order, @@ -2691,18 +2733,9 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order, do { if (list_empty(list)) { - int batch = READ_ONCE(pcp->batch); + int batch = nr_pcp_alloc(pcp, order); int alloced; - /* - * Scale batch relative to order if batch implies - * free pages can be stored on the PCP. Batch can - * be 1 for small zones or for boot pagesets which - * should never store free pages as the pages may - * belong to arbitrary zones. - */ - if (batch > 1) - batch = max(batch >> order, 2); alloced = rmqueue_bulk(zone, order, batch, list, migratetype, alloc_flags); From patchwork Mon Oct 16 05:29:59 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Ying" X-Patchwork-Id: 13422493 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1C35CCDB465 for ; Mon, 16 Oct 2023 05:30:45 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A68948D0035; Mon, 16 Oct 2023 01:30:44 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9CAD88D0001; Mon, 16 Oct 2023 01:30:44 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7F7108D0035; Mon, 16 Oct 2023 01:30:44 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 6CB788D0001 for ; Mon, 16 Oct 2023 01:30:44 -0400 (EDT) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 4F02940A78 for ; Mon, 16 Oct 2023 05:30:44 +0000 (UTC) X-FDA: 81350200008.12.917EBC5 Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.65]) by imf28.hostedemail.com (Postfix) with ESMTP id 289D9C0004 for ; Mon, 16 Oct 2023 05:30:41 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=lvl+Xjhy; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf28.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.65 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1697434242; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=WzyeZfJuZelMPtFpDBzIybWH2gSQkPfXUQ8KNvb+kwQ=; b=zRnm8fmwQtAb65xy6bYZ43QgL5Wr9Z2FRmn52wnL3ygKcscao15JiUACWujKGQBFr2WXYX x8mYsQhXqWzdNEUfvshoeP4DKxZ0fKrP44KBJXoxnUXDCrvCmcLqw1RQLoaSi7ZQHgAxLK ZofMlnaHXenW82ZQYuU04E7ZJknJXwU= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=lvl+Xjhy; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf28.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.65 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1697434242; a=rsa-sha256; cv=none; b=eatv+0+3YQAOpYVIcE9eLiTWyD9549ZaFJdgrRMv5kI6OI5DJw1XIQO3i8dxFc/7Y8V+1f xm2Bi21PS394KdxY04B9zAG7Va2N5j1mox/fb+p5H0QBRT5xgL2txxBc5iMn8iWf6iVKRa AhTmNGrSnOYxz+AhewKOI/9foR5fJ4M= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1697434242; x=1728970242; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=DB51T+n1GD7jCWJPVbTjdF4dWPaoZxyz7GoPTpVRQSU=; b=lvl+XjhyJ/oIOPe556a6OwexJ55TAWy1cxw+x3ATTqk1OJ3ASdDSFIqa Kv8PxJkAEPmUrlkbW1pqeTMcV8Diq+nlR44fBJQlmBJIVr64rKimxiHys ZUGVKwUPIug1TfoKf5rev5tKGhrno+VZXIU6F1pgVp611KXvYVDl7e2uC /6UWQIYKRlaUB7V1R4+WJOnaoiuO1eppG/bB5W/QuS38YeR0sciNbN1WT Ftq/BeH4qlpPcn1RzvDN9c+6wqUMKMcFjpiOhgG1rfALfkN3S3EFuuzY/ DinFpB8VU1b5ULW0dqluZ9d8zIY59SRZ1NIOvu3ooyKIFl0fU+uSFU92S g==; X-IronPort-AV: E=McAfee;i="6600,9927,10863"; a="389308060" X-IronPort-AV: E=Sophos;i="6.03,228,1694761200"; d="scan'208";a="389308060" Received: from fmsmga001.fm.intel.com ([10.253.24.23]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Oct 2023 22:30:41 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10863"; a="899356724" X-IronPort-AV: E=Sophos;i="6.03,228,1694761200"; d="scan'208";a="899356724" Received: from yhuang6-mobl2.sh.intel.com ([10.238.6.133]) by fmsmga001-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Oct 2023 22:28:40 -0700 From: Huang Ying To: Andrew Morton Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Arjan Van De Ven , Huang Ying , Mel Gorman , Vlastimil Babka , David Hildenbrand , Johannes Weiner , Dave Hansen , Michal Hocko , Pavel Tatashin , Matthew Wilcox , Christoph Lameter Subject: [PATCH -V3 6/9] mm: add framework for PCP high auto-tuning Date: Mon, 16 Oct 2023 13:29:59 +0800 Message-Id: <20231016053002.756205-7-ying.huang@intel.com> X-Mailer: git-send-email 2.39.2 In-Reply-To: <20231016053002.756205-1-ying.huang@intel.com> References: <20231016053002.756205-1-ying.huang@intel.com> MIME-Version: 1.0 X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 289D9C0004 X-Stat-Signature: ffqfj5nrkk4hwbyyutgbgywin755onz4 X-Rspam-User: X-HE-Tag: 1697434241-880005 X-HE-Meta: U2FsdGVkX18//k54iDUY1OEyYL0csvbJRd1KBVyr4ex9hKb1N4Y2X/rxDbZR6N04FyRbiKQgrdnh1sONtqzRir/JyU3uxlux2qCPgXJHIL+ZfrmpiR+dBaHTSSkjEt5w/juIQFtx8aMNuUpDyhBKdxE0gUasDXQfNJ/uLTxyuQTNbeGh915PqmoJkv/zgxRSdsJ6YrFRrRTX0GPmkZHvxJ8X74GQPFRt4TNvXMbHeeqiUEZtzgaHbneZD7f/RnZ8oDbMFVFLPzv3VrW3jVG3MvYkVGOLIfXjdr4/B+smxs62tqR4RAnSrWsySyf4CwF319nTR8e8J6677BJtAufVbYUub6NMwSLNCiJeZziL6KjVVUZ/X/7UdRkA/n/Hob2j7R8CZMdygBiBMimOu4fYU0p+4uobpjUVPiYWVPwaxLLmhfkP8kQ2ibqY1XnV+dZJrYQ2//3ALsBmeSP/wb4cFcOUQ/TMNP0YCInBX8r/hxPxux3A61L3BbYxS+LL4jYbooezvTwGqObHYy6AsrHismNHKZ+NHRTrzpgFXuz1Ua8PQMRU5h1Hk4L+cPBiSxM5XGGPmxp1hn0aMC0mIO6Npsg6zilMZJcDy5hkudTEOeCKnXDWFR8ZEDMhFjHtVRF2+t0+WuURBrdTUcTQ+t41pwfbJkwwkTEZIkSOOVQRNB0+/QwSJa4IDLxI6fp0jolE8HnEn0PomLTk/TSgEA1UsEkbSxM9ybyQcvzg9qf9Ibs/5XcAwnbnvZycrMi4ujWsVFZWCEHpy6lGzXwPbCJ5Wg5NwiImLBC1Zk5yb3Y42mq4dMN1KGTfdTflEmC6YKzVq/K4z80PwhHagoR7GB/+Hq/lHn+h0XhtB0jYr1qjAYJrbPLaifeCUBvRa9oKeEBW53al0646x51Yg6NG+ZOjGJS68OnhiO8eTGLIDDRzyiZdbJN756xT6zWF9dRS71fbKTduJ4XQaj8RFdUAHU5 1OtY8ueR JNAHBGaTs3VakMQeuFIqjl20lvUGQwbdGTfOZKUzC3CZKWa5/nfZ0+8RJLlmT/2b6f1gMYj+DjaeHzr286XWtZDNlnspZfLWH/FZ/qgBboHYCBdUqE2vQBq63Sh7/xudkIkCJmT78nvbOeoiMnbAHIuXacbJIj8Z2xTkkVfBj+aVK1Xzg15KEQzHrP+PLpRuDIKtWvrJHlTS9L+qkQc13j0md6cOGDvz4X3dncctoES2EFdDJ4MHbG1JjIWGru7lR7jD2ZnuoIv4jP+MNs+ExZ6Bp/VnsF84lXOrJeD5X1JD/Xr59IvIvWIM3J65M5JoXS65GizFvROHNXLKxFwdxKhjABf86fHxhZsI5ciwZANKm4t1rMElCeZAi69mBcHDBg+bnAFD45buQwa+RuxrdUvP6XOmvVLCXkLho63AAhSFgoxhjMsd+4UrEzTLRfltu2fps4YIZ1g7oVYU6V0agtFehJ6AbFEyJundFefR/iRwisLVBtX3eTcAt/aaj0k+ATT+LOw2O/prVBIZ4k4UYVzdjJE/sxXqtjMS4 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: The page allocation performance requirements of different workloads are usually different. So, we need to tune PCP (per-CPU pageset) high to optimize the workload page allocation performance. Now, we have a system wide sysctl knob (percpu_pagelist_high_fraction) to tune PCP high by hand. But, it's hard to find out the best value by hand. And one global configuration may not work best for the different workloads that run on the same system. One solution to these issues is to tune PCP high of each CPU automatically. This patch adds the framework for PCP high auto-tuning. With it, pcp->high of each CPU will be changed automatically by tuning algorithm at runtime. The minimal high (pcp->high_min) is the original PCP high value calculated based on the low watermark pages. While the maximal high (pcp->high_max) is the PCP high value when percpu_pagelist_high_fraction sysctl knob is set to MIN_PERCPU_PAGELIST_HIGH_FRACTION. That is, the maximal pcp->high that can be set via sysctl knob by hand. It's possible that PCP high auto-tuning doesn't work well for some workloads. So, when PCP high is tuned by hand via the sysctl knob, the auto-tuning will be disabled. The PCP high set by hand will be used instead. This patch only adds the framework, so pcp->high will be set to pcp->high_min (original default) always. We will add actual auto-tuning algorithm in the following patches in the series. Signed-off-by: "Huang, Ying" Cc: Andrew Morton Cc: Mel Gorman Cc: Vlastimil Babka Cc: David Hildenbrand Cc: Johannes Weiner Cc: Dave Hansen Cc: Michal Hocko Cc: Pavel Tatashin Cc: Matthew Wilcox Cc: Christoph Lameter Acked-by: Mel Gorman --- include/linux/mmzone.h | 5 ++- mm/page_alloc.c | 71 +++++++++++++++++++++++++++--------------- 2 files changed, 50 insertions(+), 26 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index ba548ae20686..ec3f7daedcc7 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -695,6 +695,8 @@ struct per_cpu_pages { spinlock_t lock; /* Protects lists field */ int count; /* number of pages in the list */ int high; /* high watermark, emptying needed */ + int high_min; /* min high watermark */ + int high_max; /* max high watermark */ int batch; /* chunk size for buddy add/remove */ u8 flags; /* protected by pcp->lock */ u8 alloc_factor; /* batch scaling factor during allocate */ @@ -854,7 +856,8 @@ struct zone { * the high and batch values are copied to individual pagesets for * faster access */ - int pageset_high; + int pageset_high_min; + int pageset_high_max; int pageset_batch; #ifndef CONFIG_SPARSEMEM diff --git a/mm/page_alloc.c b/mm/page_alloc.c index eeef0ead1c2a..1fb2c6ebde9c 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2350,7 +2350,7 @@ static int nr_pcp_free(struct per_cpu_pages *pcp, int high, bool free_high) static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone, bool free_high) { - int high = READ_ONCE(pcp->high); + int high = READ_ONCE(pcp->high_min); if (unlikely(!high || free_high)) return 0; @@ -2689,7 +2689,7 @@ static int nr_pcp_alloc(struct per_cpu_pages *pcp, int order) { int high, batch, max_nr_alloc; - high = READ_ONCE(pcp->high); + high = READ_ONCE(pcp->high_min); batch = READ_ONCE(pcp->batch); /* Check for PCP disabled or boot pageset */ @@ -5296,14 +5296,15 @@ static int zone_batchsize(struct zone *zone) } static int percpu_pagelist_high_fraction; -static int zone_highsize(struct zone *zone, int batch, int cpu_online) +static int zone_highsize(struct zone *zone, int batch, int cpu_online, + int high_fraction) { #ifdef CONFIG_MMU int high; int nr_split_cpus; unsigned long total_pages; - if (!percpu_pagelist_high_fraction) { + if (!high_fraction) { /* * By default, the high value of the pcp is based on the zone * low watermark so that if they are full then background @@ -5316,15 +5317,15 @@ static int zone_highsize(struct zone *zone, int batch, int cpu_online) * value is based on a fraction of the managed pages in the * zone. */ - total_pages = zone_managed_pages(zone) / percpu_pagelist_high_fraction; + total_pages = zone_managed_pages(zone) / high_fraction; } /* * Split the high value across all online CPUs local to the zone. Note * that early in boot that CPUs may not be online yet and that during * CPU hotplug that the cpumask is not yet updated when a CPU is being - * onlined. For memory nodes that have no CPUs, split pcp->high across - * all online CPUs to mitigate the risk that reclaim is triggered + * onlined. For memory nodes that have no CPUs, split the high value + * across all online CPUs to mitigate the risk that reclaim is triggered * prematurely due to pages stored on pcp lists. */ nr_split_cpus = cpumask_weight(cpumask_of_node(zone_to_nid(zone))) + cpu_online; @@ -5352,19 +5353,21 @@ static int zone_highsize(struct zone *zone, int batch, int cpu_online) * However, guaranteeing these relations at all times would require e.g. write * barriers here but also careful usage of read barriers at the read side, and * thus be prone to error and bad for performance. Thus the update only prevents - * store tearing. Any new users of pcp->batch and pcp->high should ensure they - * can cope with those fields changing asynchronously, and fully trust only the - * pcp->count field on the local CPU with interrupts disabled. + * store tearing. Any new users of pcp->batch, pcp->high_min and pcp->high_max + * should ensure they can cope with those fields changing asynchronously, and + * fully trust only the pcp->count field on the local CPU with interrupts + * disabled. * * mutex_is_locked(&pcp_batch_high_lock) required when calling this function * outside of boot time (or some other assurance that no concurrent updaters * exist). */ -static void pageset_update(struct per_cpu_pages *pcp, unsigned long high, - unsigned long batch) +static void pageset_update(struct per_cpu_pages *pcp, unsigned long high_min, + unsigned long high_max, unsigned long batch) { WRITE_ONCE(pcp->batch, batch); - WRITE_ONCE(pcp->high, high); + WRITE_ONCE(pcp->high_min, high_min); + WRITE_ONCE(pcp->high_max, high_max); } static void per_cpu_pages_init(struct per_cpu_pages *pcp, struct per_cpu_zonestat *pzstats) @@ -5384,20 +5387,21 @@ static void per_cpu_pages_init(struct per_cpu_pages *pcp, struct per_cpu_zonesta * need to be as careful as pageset_update() as nobody can access the * pageset yet. */ - pcp->high = BOOT_PAGESET_HIGH; + pcp->high_min = BOOT_PAGESET_HIGH; + pcp->high_max = BOOT_PAGESET_HIGH; pcp->batch = BOOT_PAGESET_BATCH; pcp->free_factor = 0; } -static void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long high, - unsigned long batch) +static void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long high_min, + unsigned long high_max, unsigned long batch) { struct per_cpu_pages *pcp; int cpu; for_each_possible_cpu(cpu) { pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu); - pageset_update(pcp, high, batch); + pageset_update(pcp, high_min, high_max, batch); } } @@ -5407,19 +5411,34 @@ static void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long h */ static void zone_set_pageset_high_and_batch(struct zone *zone, int cpu_online) { - int new_high, new_batch; + int new_high_min, new_high_max, new_batch; new_batch = max(1, zone_batchsize(zone)); - new_high = zone_highsize(zone, new_batch, cpu_online); + if (percpu_pagelist_high_fraction) { + new_high_min = zone_highsize(zone, new_batch, cpu_online, + percpu_pagelist_high_fraction); + /* + * PCP high is tuned manually, disable auto-tuning via + * setting high_min and high_max to the manual value. + */ + new_high_max = new_high_min; + } else { + new_high_min = zone_highsize(zone, new_batch, cpu_online, 0); + new_high_max = zone_highsize(zone, new_batch, cpu_online, + MIN_PERCPU_PAGELIST_HIGH_FRACTION); + } - if (zone->pageset_high == new_high && + if (zone->pageset_high_min == new_high_min && + zone->pageset_high_max == new_high_max && zone->pageset_batch == new_batch) return; - zone->pageset_high = new_high; + zone->pageset_high_min = new_high_min; + zone->pageset_high_max = new_high_max; zone->pageset_batch = new_batch; - __zone_set_pageset_high_and_batch(zone, new_high, new_batch); + __zone_set_pageset_high_and_batch(zone, new_high_min, new_high_max, + new_batch); } void __meminit setup_zone_pageset(struct zone *zone) @@ -5528,7 +5547,8 @@ __meminit void zone_pcp_init(struct zone *zone) */ zone->per_cpu_pageset = &boot_pageset; zone->per_cpu_zonestats = &boot_zonestats; - zone->pageset_high = BOOT_PAGESET_HIGH; + zone->pageset_high_min = BOOT_PAGESET_HIGH; + zone->pageset_high_max = BOOT_PAGESET_HIGH; zone->pageset_batch = BOOT_PAGESET_BATCH; if (populated_zone(zone)) @@ -6430,13 +6450,14 @@ EXPORT_SYMBOL(free_contig_range); void zone_pcp_disable(struct zone *zone) { mutex_lock(&pcp_batch_high_lock); - __zone_set_pageset_high_and_batch(zone, 0, 1); + __zone_set_pageset_high_and_batch(zone, 0, 0, 1); __drain_all_pages(zone, true); } void zone_pcp_enable(struct zone *zone) { - __zone_set_pageset_high_and_batch(zone, zone->pageset_high, zone->pageset_batch); + __zone_set_pageset_high_and_batch(zone, zone->pageset_high_min, + zone->pageset_high_max, zone->pageset_batch); mutex_unlock(&pcp_batch_high_lock); } From patchwork Mon Oct 16 05:30:00 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Ying" X-Patchwork-Id: 13422494 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A9467CDB474 for ; Mon, 16 Oct 2023 05:30:48 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3FFD18D0036; Mon, 16 Oct 2023 01:30:48 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3AEDD8D0001; Mon, 16 Oct 2023 01:30:48 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 227898D0036; Mon, 16 Oct 2023 01:30:48 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 0FF848D0001 for ; Mon, 16 Oct 2023 01:30:48 -0400 (EDT) Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id E84B7160A9B for ; Mon, 16 Oct 2023 05:30:47 +0000 (UTC) X-FDA: 81350200134.20.190685D Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.65]) by imf28.hostedemail.com (Postfix) with ESMTP id C9B6FC000F for ; Mon, 16 Oct 2023 05:30:45 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=Gvlpjvy7; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf28.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.65 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1697434246; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=7QKMAKhWdubZj9aFDqSJ9QquocxvOs/BYqkV1IqLIjo=; b=cPWoRsjZJrsIkKia7jb3TVZYmwHoESsrzFp0jbvuGL9JAf+V3022XFWc+y3WtI3aB3Jwi5 HGXcR53e4ZT9F6TgHZulerZ7gU4Fi+4iUp/KTJGQV8C3JQLRXTRYI8c9Sanr8la1hoAeb7 wiqMk7HPgtse5JhIjC4mDSa09V9/Y2k= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=Gvlpjvy7; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf28.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.65 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1697434246; a=rsa-sha256; cv=none; b=t262/K7fo9xJ8wCd0i7nnUcsJiN+PQoCizvg/xbKSptz7GKeRliH/d4yFkW8uHHSLp07Pd FECqcgpsboXJt5nmP6ZnCqI3lAX/E7AKF8EmNlYtWsfAoEXibrMB7PtKI6dHx748GlivhQ igO8QcUNqpVJSsdoybbKUJsLH7SL3pc= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1697434245; x=1728970245; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=kfRpoz7hyBU9JmRbw98oRPbnFFeRDJT3CelEzN0A4Iw=; b=Gvlpjvy7Hy3t9IKjLZZHI46yOQLFoJ5xu3aXR+Lrcb2MYzjY+rtVmxom noVeu3w7X1nvU1WXQfaBCFxcqXm7yNGaRaRe3S0ciDIXwdPc3bN7sp6/X ob22VQEwrB26FyN0ZVmsj5G9+/qY9sO4lhV+vHGOnUpaiU8L0kC2G61ey OO4+C6OIT64KffVN39Y223giPn8uHrwjp8RdaSvC/kAi9uk3VNMGJdz4n DzxNleRRIaiQJ2HMdJgJf/WgfqVp7vHoY5q9Why3RWWHlMeAB5lSa4ycy gClGeJ16CvfuXIpXNMwpvvzvE50QcvXkl3F2QLC5Kj8Eyu34JCJj7hExX Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10863"; a="389308095" X-IronPort-AV: E=Sophos;i="6.03,228,1694761200"; d="scan'208";a="389308095" Received: from fmsmga001.fm.intel.com ([10.253.24.23]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Oct 2023 22:30:45 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10863"; a="899356736" X-IronPort-AV: E=Sophos;i="6.03,228,1694761200"; d="scan'208";a="899356736" Received: from yhuang6-mobl2.sh.intel.com ([10.238.6.133]) by fmsmga001-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Oct 2023 22:28:43 -0700 From: Huang Ying To: Andrew Morton Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Arjan Van De Ven , Huang Ying , Mel Gorman , Michal Hocko , Vlastimil Babka , David Hildenbrand , Johannes Weiner , Dave Hansen , Pavel Tatashin , Matthew Wilcox , Christoph Lameter Subject: [PATCH -V3 7/9] mm: tune PCP high automatically Date: Mon, 16 Oct 2023 13:30:00 +0800 Message-Id: <20231016053002.756205-8-ying.huang@intel.com> X-Mailer: git-send-email 2.39.2 In-Reply-To: <20231016053002.756205-1-ying.huang@intel.com> References: <20231016053002.756205-1-ying.huang@intel.com> MIME-Version: 1.0 X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: C9B6FC000F X-Stat-Signature: cm7bik3bhxzikmdh7fbraupp43ze5uo5 X-Rspam-User: X-HE-Tag: 1697434245-729438 X-HE-Meta: U2FsdGVkX1/YODulJCnp590GOkJsWdei+27C1bZV88Vuv1XqAKVcN+n45Nn4XP36V47GbFAJNA3xGELxT0P8cIFr6gvJKLC6SayDzWKr5rcoqY6WtX/ON1yWgZdZbfVsH52x4q+1jAp8PfsKAA4M2+0wKSddTz1CU8FotGmXpiahz+A69J79afWyCQnapE/aa4GHpllA6KG0s7P0p8fmFAf3ty59tF54ZdBNhkRxC4+IjR+nTMnOnD9iYcpOSNME1LZuuUQHEchca3x4Igmh8epyk3qSueaihJnrbzbOXSU0XR4e+01W1VXrle8idnNB1R3Oc1MlwsU5CEub0cHLGsZpf5EcU2NbWuPT/dLkpwcXUvgd92DnlgK++EJrv4zXLI6bS6qEVNCcMNIwwOPd2ecgyYWGRbNJnkmFe0w8YLILSTOzP0BOhjHIzo6HHVZWcuOI0F3DK+CK1zR0EtI7Rn1XuBSXoBbi3DgeFxhA9bxzd+MurdFEOzkB5J3X57wHRwOlBU0Ow/p9npTx89WYQD9zAcxNoGhAGVpka+RXkhGiY8N05kkyLJbeaGd5NPelNzY8SMxwfeVkgtFJJYLAtLKIhck7kRBmgu4enzN/7AmtFCvyxi8RnoMJBQdodoLvo0f/OTD2Vj9dSaPKLHrPyIlYv7OT1XR1ybyIx7OCV4Gd1xmk/gSzocPfvcSkFpCfJa6VLosGSiBV0a8E2ywMV2T4pQjiXUNkpKbnoe0Yc1Y7LsmOzoaRdJ5cmOnZ0sI1loRTuAgcunc217Mij4vPIvEGS6JauDkhh7fLaXUP1tNR37gxpE9cl/FJbJvA9o2AmCWO9HIMvxTro1K6iO2ZYML9wX9bpTcp0bA73TJtYq9484dX0d30QKOQuwLrNop5CvY1INa4aeoI3iWihizY6JOZF/4fh2IoV6n2OPTmFZaEZX50JWBikpil4yHB9SyAlrMv/VmeBzOJpf/jYFo xqRFZZ2C l0NG3LeBqqd97bxSuEscfFpuFlhhmswto8mwgbJLOValpdVc+4DrDjm+dMzJlXvvsQKfIbfzYr5vHQDZ7xtX/O4FAl6ZhUU14qhC3n+BF9FoFC9GIlGnsjqS5Ggh4PQRqmkoi1vgq28QUu127Du2qIhs0WWM6RHIPoK9dtBfeaeOKBC72ZYwbKVw3sie0tNdHQPeBdTTcEyzjaYS+SC0KUTvGviXYih+mhloIsD49hy+Y7A9wS24k9WCiL9ADQta55yhJZoq1UUDJ3f9/1FA3VjrgY9fRjAxNtY0e/Ks4AtWckFTyb8eTwNV5ye7fEglvQFqWAPa3CwsDkleijP32Mow1ffXKuOivUdaSctfrIKsjqu2OEflNEO3mXvax5hOTpE9EF8Ggbixhp++1CCcGc556IwjGGf6QSJ9bH7z4RpgIe94EEA7poHFbq1aAfy/MMBdAXPZ9EJ/IDSXTryerSKc/86JdgQ9CFsJQEVi3UJ38wvQDTBVZ5C+tjGkOanvqkyXZ0otFpVH2GgkQHeAoQFFNyA5nigIFAaQj X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: The target to tune PCP high automatically is as follows, - Minimize allocation/freeing from/to shared zone - Minimize idle pages in PCP - Minimize pages in PCP if the system free pages is too few To reach these target, a tuning algorithm as follows is designed, - When we refill PCP via allocating from the zone, increase PCP high. Because if we had larger PCP, we could avoid to allocate from the zone. - In periodic vmstat updating kworker (via refresh_cpu_vm_stats()), decrease PCP high to try to free possible idle PCP pages. - When page reclaiming is active for the zone, stop increasing PCP high in allocating path, decrease PCP high and free some pages in freeing path. So, the PCP high can be tuned to the page allocating/freeing depth of workloads eventually. One issue of the algorithm is that if the number of pages allocated is much more than that of pages freed on a CPU, the PCP high may become the maximal value even if the allocating/freeing depth is small. But this isn't a severe issue, because there are no idle pages in this case. One alternative choice is to increase PCP high when we drain PCP via trying to free pages to the zone, but don't increase PCP high during PCP refilling. This can avoid the issue above. But if the number of pages allocated is much less than that of pages freed on a CPU, there will be many idle pages in PCP and it is hard to free these idle pages. 1/8 (>> 3) of PCP high will be decreased periodically. The value 1/8 is kind of arbitrary. Just to make sure that the idle PCP pages will be freed eventually. On a 2-socket Intel server with 224 logical CPU, we run 8 kbuild instances in parallel (each with `make -j 28`) in 8 cgroup. This simulates the kbuild server that is used by 0-Day kbuild service. With the patch, the build time decreases 3.5%. The cycles% of the spinlock contention (mostly for zone lock) decreases from 11.0% to 0.5%. The number of PCP draining for high order pages freeing (free_high) decreases 65.6%. The number of pages allocated from zone (instead of from PCP) decreases 83.9%. Signed-off-by: "Huang, Ying" Suggested-by: Mel Gorman Suggested-by: Michal Hocko Cc: Andrew Morton Cc: Vlastimil Babka Cc: David Hildenbrand Cc: Johannes Weiner Cc: Dave Hansen Cc: Pavel Tatashin Cc: Matthew Wilcox Cc: Christoph Lameter --- include/linux/gfp.h | 1 + mm/page_alloc.c | 119 ++++++++++++++++++++++++++++++++++---------- mm/vmstat.c | 8 +-- 3 files changed, 99 insertions(+), 29 deletions(-) diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 665edc11fb9f..5b917e5b9350 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -320,6 +320,7 @@ extern void page_frag_free(void *addr); #define free_page(addr) free_pages((addr), 0) void page_alloc_init_cpuhp(void); +int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp); void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp); void drain_all_pages(struct zone *zone); void drain_local_pages(struct zone *zone); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 1fb2c6ebde9c..8382ad2cdfd4 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2157,6 +2157,40 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order, return i; } +/* + * Called from the vmstat counter updater to decay the PCP high. + * Return whether there are addition works to do. + */ +int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp) +{ + int high_min, to_drain, batch; + int todo = 0; + + high_min = READ_ONCE(pcp->high_min); + batch = READ_ONCE(pcp->batch); + /* + * Decrease pcp->high periodically to try to free possible + * idle PCP pages. And, avoid to free too many pages to + * control latency. This caps pcp->high decrement too. + */ + if (pcp->high > high_min) { + pcp->high = max3(pcp->count - (batch << CONFIG_PCP_BATCH_SCALE_MAX), + pcp->high - (pcp->high >> 3), high_min); + if (pcp->high > high_min) + todo++; + } + + to_drain = pcp->count - pcp->high; + if (to_drain > 0) { + spin_lock(&pcp->lock); + free_pcppages_bulk(zone, to_drain, pcp, 0); + spin_unlock(&pcp->lock); + todo++; + } + + return todo; +} + #ifdef CONFIG_NUMA /* * Called from the vmstat counter updater to drain pagesets of this @@ -2318,14 +2352,13 @@ static bool free_unref_page_prepare(struct page *page, unsigned long pfn, return true; } -static int nr_pcp_free(struct per_cpu_pages *pcp, int high, bool free_high) +static int nr_pcp_free(struct per_cpu_pages *pcp, int batch, int high, bool free_high) { int min_nr_free, max_nr_free; - int batch = READ_ONCE(pcp->batch); - /* Free everything if batch freeing high-order pages. */ + /* Free as much as possible if batch freeing high-order pages. */ if (unlikely(free_high)) - return pcp->count; + return min(pcp->count, batch << CONFIG_PCP_BATCH_SCALE_MAX); /* Check for PCP disabled or boot pageset */ if (unlikely(high < batch)) @@ -2340,7 +2373,7 @@ static int nr_pcp_free(struct per_cpu_pages *pcp, int high, bool free_high) * freeing of pages without any allocation. */ batch <<= pcp->free_factor; - if (batch < max_nr_free && pcp->free_factor < CONFIG_PCP_BATCH_SCALE_MAX) + if (batch <= max_nr_free && pcp->free_factor < CONFIG_PCP_BATCH_SCALE_MAX) pcp->free_factor++; batch = clamp(batch, min_nr_free, max_nr_free); @@ -2348,28 +2381,48 @@ static int nr_pcp_free(struct per_cpu_pages *pcp, int high, bool free_high) } static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone, - bool free_high) + int batch, bool free_high) { - int high = READ_ONCE(pcp->high_min); + int high, high_min, high_max; - if (unlikely(!high || free_high)) + high_min = READ_ONCE(pcp->high_min); + high_max = READ_ONCE(pcp->high_max); + high = pcp->high = clamp(pcp->high, high_min, high_max); + + if (unlikely(!high)) return 0; - if (!test_bit(ZONE_RECLAIM_ACTIVE, &zone->flags)) - return high; + if (unlikely(free_high)) { + pcp->high = max(high - (batch << CONFIG_PCP_BATCH_SCALE_MAX), + high_min); + return 0; + } /* * If reclaim is active, limit the number of pages that can be * stored on pcp lists */ - return min(READ_ONCE(pcp->batch) << 2, high); + if (test_bit(ZONE_RECLAIM_ACTIVE, &zone->flags)) { + pcp->high = max(high - (batch << pcp->free_factor), high_min); + return min(batch << 2, pcp->high); + } + + if (pcp->count >= high && high_min != high_max) { + int need_high = (batch << pcp->free_factor) + batch; + + /* pcp->high should be large enough to hold batch freed pages */ + if (pcp->high < need_high) + pcp->high = clamp(need_high, high_min, high_max); + } + + return high; } static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp, struct page *page, int migratetype, unsigned int order) { - int high; + int high, batch; int pindex; bool free_high = false; @@ -2384,6 +2437,7 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp, list_add(&page->pcp_list, &pcp->lists[pindex]); pcp->count += 1 << order; + batch = READ_ONCE(pcp->batch); /* * As high-order pages other than THP's stored on PCP can contribute * to fragmentation, limit the number stored when PCP is heavily @@ -2394,14 +2448,15 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp, free_high = (pcp->free_factor && (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) && (!(pcp->flags & PCPF_FREE_HIGH_BATCH) || - pcp->count >= READ_ONCE(pcp->batch))); + pcp->count >= READ_ONCE(batch))); pcp->flags |= PCPF_PREV_FREE_HIGH_ORDER; } else if (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) { pcp->flags &= ~PCPF_PREV_FREE_HIGH_ORDER; } - high = nr_pcp_high(pcp, zone, free_high); + high = nr_pcp_high(pcp, zone, batch, free_high); if (pcp->count >= high) { - free_pcppages_bulk(zone, nr_pcp_free(pcp, high, free_high), pcp, pindex); + free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_high), + pcp, pindex); } } @@ -2685,24 +2740,38 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone, return page; } -static int nr_pcp_alloc(struct per_cpu_pages *pcp, int order) +static int nr_pcp_alloc(struct per_cpu_pages *pcp, struct zone *zone, int order) { - int high, batch, max_nr_alloc; + int high, base_batch, batch, max_nr_alloc; + int high_max, high_min; - high = READ_ONCE(pcp->high_min); - batch = READ_ONCE(pcp->batch); + base_batch = READ_ONCE(pcp->batch); + high_min = READ_ONCE(pcp->high_min); + high_max = READ_ONCE(pcp->high_max); + high = pcp->high = clamp(pcp->high, high_min, high_max); /* Check for PCP disabled or boot pageset */ - if (unlikely(high < batch)) + if (unlikely(high < base_batch)) return 1; + if (order) + batch = base_batch; + else + batch = (base_batch << pcp->alloc_factor); + /* - * Double the number of pages allocated each time there is subsequent - * allocation of order-0 pages without any freeing. + * If we had larger pcp->high, we could avoid to allocate from + * zone. */ + if (high_min != high_max && !test_bit(ZONE_RECLAIM_ACTIVE, &zone->flags)) + high = pcp->high = min(high + batch, high_max); + if (!order) { - max_nr_alloc = max(high - pcp->count - batch, batch); - batch <<= pcp->alloc_factor; + max_nr_alloc = max(high - pcp->count - base_batch, base_batch); + /* + * Double the number of pages allocated each time there is + * subsequent allocation of order-0 pages without any freeing. + */ if (batch <= max_nr_alloc && pcp->alloc_factor < CONFIG_PCP_BATCH_SCALE_MAX) pcp->alloc_factor++; @@ -2733,7 +2802,7 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order, do { if (list_empty(list)) { - int batch = nr_pcp_alloc(pcp, order); + int batch = nr_pcp_alloc(pcp, zone, order); int alloced; alloced = rmqueue_bulk(zone, order, diff --git a/mm/vmstat.c b/mm/vmstat.c index 00e81e99c6ee..2f716ad14168 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -814,9 +814,7 @@ static int refresh_cpu_vm_stats(bool do_pagesets) for_each_populated_zone(zone) { struct per_cpu_zonestat __percpu *pzstats = zone->per_cpu_zonestats; -#ifdef CONFIG_NUMA struct per_cpu_pages __percpu *pcp = zone->per_cpu_pageset; -#endif for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) { int v; @@ -832,10 +830,12 @@ static int refresh_cpu_vm_stats(bool do_pagesets) #endif } } -#ifdef CONFIG_NUMA if (do_pagesets) { cond_resched(); + + changes += decay_pcp_high(zone, this_cpu_ptr(pcp)); +#ifdef CONFIG_NUMA /* * Deal with draining the remote pageset of this * processor @@ -862,8 +862,8 @@ static int refresh_cpu_vm_stats(bool do_pagesets) drain_zone_pages(zone, this_cpu_ptr(pcp)); changes++; } - } #endif + } } for_each_online_pgdat(pgdat) { From patchwork Mon Oct 16 05:30:01 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Ying" X-Patchwork-Id: 13422495 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 23434CDB482 for ; Mon, 16 Oct 2023 05:30:52 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9CBDD8D0037; Mon, 16 Oct 2023 01:30:51 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 97C5F8D0001; Mon, 16 Oct 2023 01:30:51 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 81C748D0037; Mon, 16 Oct 2023 01:30:51 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 6CEFA8D0001 for ; Mon, 16 Oct 2023 01:30:51 -0400 (EDT) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 4E1391CB047 for ; Mon, 16 Oct 2023 05:30:51 +0000 (UTC) X-FDA: 81350200302.26.925A356 Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.65]) by imf28.hostedemail.com (Postfix) with ESMTP id 396FEC0013 for ; Mon, 16 Oct 2023 05:30:49 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=aNwA0sNO; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf28.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.65 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1697434249; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=EsUyHh8Yv1F4HMKZPnyBlZFhM2fGhvzjiN4YX7RaOzY=; b=qbTwztuo/9rBv8eG8I/lUnLTcT9I1fcy9Ww2iqzbrkxnT8eR+wCbsReiueblY7IHxe6XsH kqoqFLGS4GAqaTrv2M5ccD+11a5KVtIqfzIOQZ1BeZfWxniRRGSh1C412Khw6g1dS14uIq FMSizuSAbwgb5YGnPOlyLmqTrkq+wsI= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=aNwA0sNO; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf28.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.65 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1697434249; a=rsa-sha256; cv=none; b=LIsIjlNDk1fJnk7qsgdBEMWs+VQ1/ergCQknIyK8Nr/n/AboihGU8Ao1qK4tXMqMCI8T8T NfXTu8c2Plco/YJenuP38TQU1F6JJ2Y9zC20/yMHkIMK4jNr+iDymsOMSZDEx5qonzz+UD ygSx+sT7BgU5PqbMmk1ShK5m+SXZmd8= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1697434249; x=1728970249; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=oayBWHmLvZPHkilBpaQJzT0ghr/vVONoAKNZEgSFLCA=; b=aNwA0sNOFRBXky4E/8jKhyFIQyuK3bIzNwKHV6tvtr/BucBLexbY+Azk YeErJoSG87Ir9TqdE1CF8nkhTBc16Ugn8MQhOXz61TmYm6gd49hQnun2C OygZO/ejcP6O2ZNH4fZ07eE1qre5fFMG2B8KteDoRBbz7FghPkWZt3sy2 yC4ZtEsYR2XHjQjSwuAIQXWQiS7vbS4CuJzoF1JwdvTX7z8Qlgsk+nmp9 X9MtGHSsIZk+SEOHWV2jDKh9jTQMl/vb5UPmmdjqaVtzcXfpI78MVIDct o32XXQCn1fVIrY5HdpTyblYwvDIxmQxI8peU0ikyFz6s2naiVL+CDpLhn g==; X-IronPort-AV: E=McAfee;i="6600,9927,10863"; a="389308122" X-IronPort-AV: E=Sophos;i="6.03,228,1694761200"; d="scan'208";a="389308122" Received: from fmsmga001.fm.intel.com ([10.253.24.23]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Oct 2023 22:30:48 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10863"; a="899356750" X-IronPort-AV: E=Sophos;i="6.03,228,1694761200"; d="scan'208";a="899356750" Received: from yhuang6-mobl2.sh.intel.com ([10.238.6.133]) by fmsmga001-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Oct 2023 22:28:47 -0700 From: Huang Ying To: Andrew Morton Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Arjan Van De Ven , Huang Ying , Mel Gorman , Vlastimil Babka , David Hildenbrand , Johannes Weiner , Dave Hansen , Michal Hocko , Pavel Tatashin , Matthew Wilcox , Christoph Lameter Subject: [PATCH -V3 8/9] mm, pcp: decrease PCP high if free pages < high watermark Date: Mon, 16 Oct 2023 13:30:01 +0800 Message-Id: <20231016053002.756205-9-ying.huang@intel.com> X-Mailer: git-send-email 2.39.2 In-Reply-To: <20231016053002.756205-1-ying.huang@intel.com> References: <20231016053002.756205-1-ying.huang@intel.com> MIME-Version: 1.0 X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 396FEC0013 X-Stat-Signature: o9wp9k74iji4xhxjdib6gzxzgeioeu9t X-Rspam-User: X-HE-Tag: 1697434249-392697 X-HE-Meta: U2FsdGVkX19afi0ZKJpnVhYVco+Rl0J21l+iqnG7iRNgl45ccMrJi56jujiHq9AFn3KZyfUkj6OFRaxvmiI+X7pndCDNU+WtSHJSpzrIfAcP8A81MrXj8S/psRfLBad0ViTue+xLtQpi5eVSvHeAGlshJVhl37cx3qKuGBaSf5/LHRFznvd6VO2fe9u3KuoYukxXECymztGJ2pdmSQeY9q0Y0a9qoAf5ppKE6j3H1qw4MJZs5YQNP7iy84nrXjo633oq5oKZiKIu8znhFaWT1i4yq68K0bTbOC9BdjgLme9QWQbPyYHPBMoZXoOw6FdjqMBrA10nKqgi0zHo1uT9LsWQQk+rhwHGELfUFoWcV0k+9MCre1frjNdc+JgojR6gY6D+enWJlNrgTvEPREMsUPDFB0EevbtPJbOidADX1SyktkgYZlmkGOpjj6YOCQo7R+pq0ONkNw1iPAlZGDV3GUOcnJHJSNIakxzPPc04YVTGX19BYQgQFBR9DK/zY+ZbNNByu6uHHjrfTu/8SoibVdzCPipG8k5ieiLvaw0Phg1sewnNhX9AAWzyxu9S7rJzWxrkdpTOAtqmxmb/hu8AF3rqDcZlKDaq24mrcDhAIDow1xOtW9yfyH3FMKDKhZ33yV9A++srTo1smHomLaOSfKNPyWQC1gBzgAsBTdKs8xHwK/EF1HE91tFWStQRbTChvIU7TMr7F2ZE8xb9po+NzWDxa5/KZL0pGeaNl4yCgym5o9gVJdL4DQWt3EHSZXFx9wlXDP0vtHEu857AZE2g9ZXQbAkclIbUrift64KSpUY4IWlDu/1JVlIVm1wCNkM1VDoXwoU7LOoLtVUDkcT8UBwRgwxw6Y1LESdTT+AGEY0zFaM8L3MJIYBEV+XHFHMHaRsAQUCjeFeseIuB0pamXn0i5au5I8VYX4PJ+6Pd97pAFD7U0QDfErxSJSvd9YBQ/IPJcEA+LjEGKZXkQ9u zpYB7ije 2Mzr5DhaAkhSV3PT9m7yEF/H/ORbhndijbNpotwzeNdRyQcct1F9ZwFerxdEEyomSsoYBZIm/xU/LeiHFcOEvxUaknX1fYD1+8KB1G6d9wFNVOhYZsJPjwUn4aVOxkBg2LGJ1ujWpJXxPUdqeG6EvNAjrI4XZyE/Y6whB4JaivB2ehPW0irWgWxhiXH6+MH0YdNc0/JcD2hPPHiEnQNvuGrLA28HwC8fCpyEwHJ2N003Q/7T5gMp/ffVPJEQU0hkfiiy65fR3I6HrkGZnuifkLov3s2V6dd4T+FdvvbivDZ0P/dWhbG6iL2hn1PNyty6l+0FjqI/8fTwu/R6H09N5aiSTBnFa4zhDNUqfadqWcqBhutOPpareY4e83DCk1+LLUAPX5Ntn4SOSvWD2LaqqpJT72o2y1GT5wVOTh0fOH48yL6WcI/0DMNnOc7bLRltNOk6/ALy8tMkP9mpS+rG4XSeSAXbx2N8FYOEsY5o9oekRqZ9XLUio/mKIama9nzTb6z5KDDesO/EnaI9qzWrX8LPHT9GJtbvhlxwI X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: One target of PCP is to minimize pages in PCP if the system free pages is too few. To reach that target, when page reclaiming is active for the zone (ZONE_RECLAIM_ACTIVE), we will stop increasing PCP high in allocating path, decrease PCP high and free some pages in freeing path. But this may be too late because the background page reclaiming may introduce latency for some workloads. So, in this patch, during page allocation we will detect whether the number of free pages of the zone is below high watermark. If so, we will stop increasing PCP high in allocating path, decrease PCP high and free some pages in freeing path. With this, we can reduce the possibility of the premature background page reclaiming caused by too large PCP. The high watermark checking is done in allocating path to reduce the overhead in hotter freeing path. Signed-off-by: "Huang, Ying" Cc: Andrew Morton Cc: Mel Gorman Cc: Vlastimil Babka Cc: David Hildenbrand Cc: Johannes Weiner Cc: Dave Hansen Cc: Michal Hocko Cc: Pavel Tatashin Cc: Matthew Wilcox Cc: Christoph Lameter --- include/linux/mmzone.h | 1 + mm/page_alloc.c | 33 +++++++++++++++++++++++++++++++-- 2 files changed, 32 insertions(+), 2 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index ec3f7daedcc7..c88770381aaf 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -1018,6 +1018,7 @@ enum zone_flags { * Cleared when kswapd is woken. */ ZONE_RECLAIM_ACTIVE, /* kswapd may be scanning the zone. */ + ZONE_BELOW_HIGH, /* zone is below high watermark. */ }; static inline unsigned long zone_managed_pages(struct zone *zone) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 8382ad2cdfd4..253fc7d0498e 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2407,7 +2407,13 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone, return min(batch << 2, pcp->high); } - if (pcp->count >= high && high_min != high_max) { + if (high_min == high_max) + return high; + + if (test_bit(ZONE_BELOW_HIGH, &zone->flags)) { + pcp->high = max(high - (batch << pcp->free_factor), high_min); + high = max(pcp->count, high_min); + } else if (pcp->count >= high) { int need_high = (batch << pcp->free_factor) + batch; /* pcp->high should be large enough to hold batch freed pages */ @@ -2457,6 +2463,10 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp, if (pcp->count >= high) { free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_high), pcp, pindex); + if (test_bit(ZONE_BELOW_HIGH, &zone->flags) && + zone_watermark_ok(zone, 0, high_wmark_pages(zone), + ZONE_MOVABLE, 0)) + clear_bit(ZONE_BELOW_HIGH, &zone->flags); } } @@ -2763,7 +2773,7 @@ static int nr_pcp_alloc(struct per_cpu_pages *pcp, struct zone *zone, int order) * If we had larger pcp->high, we could avoid to allocate from * zone. */ - if (high_min != high_max && !test_bit(ZONE_RECLAIM_ACTIVE, &zone->flags)) + if (high_min != high_max && !test_bit(ZONE_BELOW_HIGH, &zone->flags)) high = pcp->high = min(high + batch, high_max); if (!order) { @@ -3225,6 +3235,25 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags, } } + /* + * Detect whether the number of free pages is below high + * watermark. If so, we will decrease pcp->high and free + * PCP pages in free path to reduce the possibility of + * premature page reclaiming. Detection is done here to + * avoid to do that in hotter free path. + */ + if (test_bit(ZONE_BELOW_HIGH, &zone->flags)) + goto check_alloc_wmark; + + mark = high_wmark_pages(zone); + if (zone_watermark_fast(zone, order, mark, + ac->highest_zoneidx, alloc_flags, + gfp_mask)) + goto try_this_zone; + else + set_bit(ZONE_BELOW_HIGH, &zone->flags); + +check_alloc_wmark: mark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK); if (!zone_watermark_fast(zone, order, mark, ac->highest_zoneidx, alloc_flags, From patchwork Mon Oct 16 05:30:02 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Huang, Ying" X-Patchwork-Id: 13422496 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1B2BECDB482 for ; Mon, 16 Oct 2023 05:30:56 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id ADEFF8D0038; Mon, 16 Oct 2023 01:30:55 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A96D38D0001; Mon, 16 Oct 2023 01:30:55 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 92D0C8D0038; Mon, 16 Oct 2023 01:30:55 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 8096C8D0001 for ; Mon, 16 Oct 2023 01:30:55 -0400 (EDT) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 520A5140A72 for ; Mon, 16 Oct 2023 05:30:55 +0000 (UTC) X-FDA: 81350200470.02.F61CF4E Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.65]) by imf28.hostedemail.com (Postfix) with ESMTP id 33904C0010 for ; Mon, 16 Oct 2023 05:30:52 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=YTKcEB2h; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf28.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.65 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1697434253; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=jUzUCsLJ+6KcY9UNZ5/m46p1p7GpyZ/XvA5MmUi6M/w=; b=LeRXNKclDXb5MyT9CAkYdCt/moQ4PTqER/e139/ZPm/0zjh/HISqrEhq29vXkpWyCv+aIj LH/q6HjyArrCCKwPklCudVvSfbaviRe2J4hTxCO5gElq6geEqz9v0D74IBKcMX/mlY01zk iOnGm6kVwXmUAk+HugDc+Oy1dE55J9w= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=YTKcEB2h; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf28.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.65 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1697434253; a=rsa-sha256; cv=none; b=XNDtO12sZwUoWXDRLKc8TxkWzuUzJz9Zqa814BiOkNkgrJu2vE0FJ9tlf6Z771JAq/T6GC PfDn4fFSmB/IRDd+C+ZP/CuTeKTW6tU9MVKsGXkW3xsnQfenfOHgdTFVHwGjRzk85bnB6y Gv0d495Xku3Es6hOGH5xwgfNYzHxdGU= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1697434253; x=1728970253; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=TDEUGRS7WQQwagCwDvutKWW3kgub6dOrNZn90bOr990=; b=YTKcEB2hq98hRxCKjYqa65P1kcmJD3IPdU8by5wMflLP3TI8tQvXmi3v bg+DZeIoBQzEBoTTlzd+Kewv7wcNPGCem05163pRbE5WMNGCH3eiNSlfP 2TnJ3Tq3W7eY4YanTsHckPbc44f+7oHMRwz4eh00tbjov/UO205r0TLRt FpLhdbPlir3xLVJcxE7+9Jdc6XFqAnk00cc2L1TeSfom78yB5c8OpP71N Q0QAqLFd0YmHh0YrWsGHQrwpdTGX0BV2JvwTRTfIPg1IX+VcRM1ZtVJZ3 cl+tthg2REzMmYI+6aYi7OBBxCCs/mDVkTjbXtlI8dy5DA5YsdMRnoMW7 w==; X-IronPort-AV: E=McAfee;i="6600,9927,10863"; a="389308151" X-IronPort-AV: E=Sophos;i="6.03,228,1694761200"; d="scan'208";a="389308151" Received: from fmsmga001.fm.intel.com ([10.253.24.23]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Oct 2023 22:30:52 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10863"; a="899356777" X-IronPort-AV: E=Sophos;i="6.03,228,1694761200"; d="scan'208";a="899356777" Received: from yhuang6-mobl2.sh.intel.com ([10.238.6.133]) by fmsmga001-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Oct 2023 22:28:51 -0700 From: Huang Ying To: Andrew Morton Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Arjan Van De Ven , Huang Ying , Mel Gorman , Vlastimil Babka , David Hildenbrand , Johannes Weiner , Dave Hansen , Michal Hocko , Pavel Tatashin , Matthew Wilcox , Christoph Lameter Subject: [PATCH -V3 9/9] mm, pcp: reduce detecting time of consecutive high order page freeing Date: Mon, 16 Oct 2023 13:30:02 +0800 Message-Id: <20231016053002.756205-10-ying.huang@intel.com> X-Mailer: git-send-email 2.39.2 In-Reply-To: <20231016053002.756205-1-ying.huang@intel.com> References: <20231016053002.756205-1-ying.huang@intel.com> MIME-Version: 1.0 X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 33904C0010 X-Stat-Signature: oc5uyhhzrw5g6agq9r36s849op6j9ga8 X-Rspam-User: X-HE-Tag: 1697434252-391673 X-HE-Meta: U2FsdGVkX19VKus2CJWtG8yi8ZfMlI3ftGV2H2L7AeCHx7ptBmmN7QdQi1vXRWrp/h+tIpBgARHlTfBb74vAU59t6yVCkdq67k5x4k0hymzyg9Bdfk3jcY+6ZLo2T6c1wkr3NEoriqlIP5yxGLDBSCrjM9RFynjt0mJaKHg15VogYIB8ww7omg8oTyF1PUNpAfN8+WAcAIg5qteJ8JXItni2DDrntf6aOuRrWmJ8FUv6VfvXQa0dlnSOrhoOzD6/eYqQqLE72bA2oOCi77+rug8Oh3h9mwHRQKuV2Eoij5ZbQJdvpRBCYDQtaqgIJ7NGf1rejgBXxbUqL8GJXZj/odfMpEamipRRFBzVZXba0NWr/g6zXgFvaPOgr9v/EQQjFlFCs4iifDoKbcTnV1rb88Yj4gDXEYlJGtwW6PNRmYdxtBFLxGnVLGkNA6EyUPeX/39Ic7UJRs/RaNvMRs12/fwq4C3MdqusWGwF16DeCjGIjFQyU+Dt8T66cK+fbwSfMFrqdLSEHu0UdsSekHrdAyeg/6sHyVz64k9kH1+WVJO7gFqRnO0i2EjQLL/INOzHI8c3DQu6TIjaoJLaH9mCcKdrVQk2+5cMcvsPLf59vwdbVg6jbGq5fgPRcyw8Go85vom+lv9IeVsCOm9kV3HIEHYxWqgUWYw/hCH7kgqYIBn8zQLoyWsTXeXvypj6qTsEDdfpfnAZxFVSK63BPC9WbcZrkOHu8MaLYz2FkkkmYKOSfaxo4kT1ltsxT0iTf1HK9QNsWzT6tX9qzLeT7DvXYV67NEE1i6w4PEyPUSkoTh0QtPneMKH2wZ6v5+hhwXL9021pQ6vpN9OGg/hGayPvcPh7+qd+P2QCHOjNoAk91yk6GFYyLdHxpfzfkDYCezrEMqn4ugAEEVVfH+BoMSizLKTGZ9JoFKQMYWkfv2faEaJIzdntOCB/Z8ghnfs2ZKhe5L2eIjmeJAtem/ADRJV PztJf9kY Hpjxlda0NDb29DY9RGYahuWoEne6nUuYw5WikIMTXCylnikrBbFrVImolroJHIt2cHk1MQqnx1NAPrnVgLTDmCbTayHAE8C+uzLqDgfmlTel7FHDIrrSui9ZS9q2t77jrxohMPZtKe7X8a78g94ZHLbnwFNdyCm4zXQ3e94tpA1G2BJHQ50ROXDsnGOatK1ZJLBFkI8i0aSW3AMzAiThTHGMfF+IpJiuaQFyoykWI1IErWmCSqZz37qVhQiA/Bk1VWRw7a6VF4DW2qcdWF+Nk81+H2DBX0yGCGFOQzz9TXnjzJVJSgs/mOHuvtQuTRkIe4t/TT6BKQe4JvGR1BDKdm98YTFh1ATDz90ZunrlPtJPYlpP81Xr/TrqQsAYP1+DXOa8irtWTuMyCSxdlNFSGM9O5yW3gMiixAp6Ow3txfRJ5FUh0V3Ko2qQQJ+DNsjETaWBTRKXbAAqmvyK6rAQuX/Hio/ckf687bfjTzR+/OeUUFiPGNHBkijoLOqKEZtPP/LlKO2s5RNuWPU7XjFjTph1QblDjRsmJJlVg X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: In current PCP auto-tuning design, if the number of pages allocated is much more than that of pages freed on a CPU, the PCP high may become the maximal value even if the allocating/freeing depth is small, for example, in the sender of network workloads. If a CPU was used as sender originally, then it is used as receiver after context switching, we need to fill the whole PCP with maximal high before triggering PCP draining for consecutive high order freeing. This will hurt the performance of some network workloads. To solve the issue, in this patch, we will track the consecutive page freeing with a counter in stead of relying on PCP draining. So, we can detect consecutive page freeing much earlier. On a 2-socket Intel server with 128 logical CPU, we tested SCTP_STREAM_MANY test case of netperf test suite with 64-pair processes. With the patch, the network bandwidth improves 5.0%. This restores the performance drop caused by PCP auto-tuning. Signed-off-by: "Huang, Ying" Cc: Andrew Morton Cc: Mel Gorman Cc: Vlastimil Babka Cc: David Hildenbrand Cc: Johannes Weiner Cc: Dave Hansen Cc: Michal Hocko Cc: Pavel Tatashin Cc: Matthew Wilcox Cc: Christoph Lameter --- include/linux/mmzone.h | 2 +- mm/page_alloc.c | 27 +++++++++++++++------------ 2 files changed, 16 insertions(+), 13 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index c88770381aaf..57086c57b8e4 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -700,10 +700,10 @@ struct per_cpu_pages { int batch; /* chunk size for buddy add/remove */ u8 flags; /* protected by pcp->lock */ u8 alloc_factor; /* batch scaling factor during allocate */ - u8 free_factor; /* batch scaling factor during free */ #ifdef CONFIG_NUMA u8 expire; /* When 0, remote pagesets are drained */ #endif + short free_count; /* consecutive free count */ /* Lists of pages, one per migrate type stored on the pcp-lists */ struct list_head lists[NR_PCP_LISTS]; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 253fc7d0498e..28088dd7a968 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2369,13 +2369,10 @@ static int nr_pcp_free(struct per_cpu_pages *pcp, int batch, int high, bool free max_nr_free = high - batch; /* - * Double the number of pages freed each time there is subsequent - * freeing of pages without any allocation. + * Increase the batch number to the number of the consecutive + * freed pages to reduce zone lock contention. */ - batch <<= pcp->free_factor; - if (batch <= max_nr_free && pcp->free_factor < CONFIG_PCP_BATCH_SCALE_MAX) - pcp->free_factor++; - batch = clamp(batch, min_nr_free, max_nr_free); + batch = clamp_t(int, pcp->free_count, min_nr_free, max_nr_free); return batch; } @@ -2403,7 +2400,9 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone, * stored on pcp lists */ if (test_bit(ZONE_RECLAIM_ACTIVE, &zone->flags)) { - pcp->high = max(high - (batch << pcp->free_factor), high_min); + int free_count = max_t(int, pcp->free_count, batch); + + pcp->high = max(high - free_count, high_min); return min(batch << 2, pcp->high); } @@ -2411,10 +2410,12 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone, return high; if (test_bit(ZONE_BELOW_HIGH, &zone->flags)) { - pcp->high = max(high - (batch << pcp->free_factor), high_min); + int free_count = max_t(int, pcp->free_count, batch); + + pcp->high = max(high - free_count, high_min); high = max(pcp->count, high_min); } else if (pcp->count >= high) { - int need_high = (batch << pcp->free_factor) + batch; + int need_high = pcp->free_count + batch; /* pcp->high should be large enough to hold batch freed pages */ if (pcp->high < need_high) @@ -2451,7 +2452,7 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp, * stops will be drained from vmstat refresh context. */ if (order && order <= PAGE_ALLOC_COSTLY_ORDER) { - free_high = (pcp->free_factor && + free_high = (pcp->free_count >= batch && (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) && (!(pcp->flags & PCPF_FREE_HIGH_BATCH) || pcp->count >= READ_ONCE(batch))); @@ -2459,6 +2460,8 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp, } else if (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) { pcp->flags &= ~PCPF_PREV_FREE_HIGH_ORDER; } + if (pcp->free_count < (batch << CONFIG_PCP_BATCH_SCALE_MAX)) + pcp->free_count += (1 << order); high = nr_pcp_high(pcp, zone, batch, free_high); if (pcp->count >= high) { free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_high), @@ -2855,7 +2858,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone, * See nr_pcp_free() where free_factor is increased for subsequent * frees. */ - pcp->free_factor >>= 1; + pcp->free_count >>= 1; list = &pcp->lists[order_to_pindex(migratetype, order)]; page = __rmqueue_pcplist(zone, order, migratetype, alloc_flags, pcp, list); pcp_spin_unlock(pcp); @@ -5488,7 +5491,7 @@ static void per_cpu_pages_init(struct per_cpu_pages *pcp, struct per_cpu_zonesta pcp->high_min = BOOT_PAGESET_HIGH; pcp->high_max = BOOT_PAGESET_HIGH; pcp->batch = BOOT_PAGESET_BATCH; - pcp->free_factor = 0; + pcp->free_count = 0; } static void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long high_min,