From patchwork Thu Apr 18 14:20:08 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Peng Zhang X-Patchwork-Id: 13634847 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0DC2FC04FFF for ; Thu, 18 Apr 2024 14:20:27 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4A1C86B008C; Thu, 18 Apr 2024 10:20:26 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 403926B0089; Thu, 18 Apr 2024 10:20:26 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1E0DA6B0095; Thu, 18 Apr 2024 10:20:26 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id F178D6B0089 for ; Thu, 18 Apr 2024 10:20:25 -0400 (EDT) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 898001A1419 for ; Thu, 18 Apr 2024 14:20:25 +0000 (UTC) X-FDA: 82022862810.13.7C75A01 Received: from szxga08-in.huawei.com (szxga08-in.huawei.com [45.249.212.255]) by imf13.hostedemail.com (Postfix) with ESMTP id DDE5D2001D for ; Thu, 18 Apr 2024 14:20:22 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf13.hostedemail.com: domain of zhangpeng362@huawei.com designates 45.249.212.255 as permitted sender) smtp.mailfrom=zhangpeng362@huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1713450023; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=zrJ4PA/Yx4K5+woAA3LNVfinaVSnMwH2sJgzg6Jg750=; b=u2tpmcHjTNDopGK8a6P/4pFf4cZFk7NtOtssOh7xQzjNzLkjuDJdgp9mC+gOm3ahDXoVGi vykWK/fmgcvNCQTs6UwR5wyeFJgmG5GabsGYh6T/AYEMXgGL6ov8LIMVAYwCTiyxVJYVlZ xOs6X/Ig/fHx3x00yu8/WgmDIFyfkjg= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf13.hostedemail.com: domain of zhangpeng362@huawei.com designates 45.249.212.255 as permitted sender) smtp.mailfrom=zhangpeng362@huawei.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1713450023; a=rsa-sha256; cv=none; b=Ig0z7milJyGAl2X7fZywGAJBaVI4g2CyKsyBQwx5uZClTHmrWIq3tA1Acn1UoCmyyWz8QI 78NvCQeWJ2MpZ7MrM8wOeRRmDeDkfwpkoFJIaDJ4LTKtTS3vVFzZfF0EqJ1QGGKiScz7pV aNtrQrM4EPQgOxSwkP3xsawkznshRnI= Received: from mail.maildlp.com (unknown [172.19.88.105]) by szxga08-in.huawei.com (SkyGuard) with ESMTP id 4VL0GY5dl6z1R5WL; Thu, 18 Apr 2024 22:17:25 +0800 (CST) Received: from kwepemm600020.china.huawei.com (unknown [7.193.23.147]) by mail.maildlp.com (Postfix) with ESMTPS id A4877140382; Thu, 18 Apr 2024 22:20:19 +0800 (CST) Received: from localhost.localdomain (10.175.112.125) by kwepemm600020.china.huawei.com (7.193.23.147) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.1.2507.35; Thu, 18 Apr 2024 22:20:18 +0800 From: Peng Zhang To: , CC: , , , , , , , , , , , , Subject: [RFC PATCH v2 2/2] mm: convert mm's rss stats to use atomic mode Date: Thu, 18 Apr 2024 22:20:08 +0800 Message-ID: <20240418142008.2775308-3-zhangpeng362@huawei.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20240418142008.2775308-1-zhangpeng362@huawei.com> References: <20240418142008.2775308-1-zhangpeng362@huawei.com> MIME-Version: 1.0 X-Originating-IP: [10.175.112.125] X-ClientProxiedBy: dggems702-chm.china.huawei.com (10.3.19.179) To kwepemm600020.china.huawei.com (7.193.23.147) X-Stat-Signature: py1a7ezfc4eycnz5bnw8fot9s1p3oykf X-Rspamd-Queue-Id: DDE5D2001D X-Rspamd-Server: rspam02 X-Rspam-User: X-HE-Tag: 1713450022-812970 X-HE-Meta: U2FsdGVkX1+lXrEnkCFQbkXtNqZ1SP9ODfyGnUiG6hzu0J1Ve3Jae6/BAvbDg1mIgfx9koUYw6SHFIHti/qle5ssAaR5eMmd0iJk5IPXPBazxeg3nTcSRa7O9v4lOQ8kZfyX6nqHJNagaQEyd0l0qPxFxwEg2fjtT+vpMzt0u5y6164hJJpbt1VZ7x+3MAj1Mr9ghK5n7ibE6vY/XLsmsj/kVepVAeUPjPUqlQloBzhFw7kvF63PI+1G5WDl7A5GBes4X8EDJjmhhksrCxW+7nMMh76Zt5T0xNbGcmqaqSTDsyqc1JokoB8+69KXPOTe1tLO5FynK7UxHPd8Y3x1XjKd1FYtktZysDNrikODHi5+oY4/UjCsALcynMWmqh+EwWA2WB+bFhlratduyDOa0fbyM35LBjKKroB609Cn8cRYUBpTB2XGyBi8Er8T5Kd7kh/wgshwuYPEiHekuglHe+LTOd8q/tGAmcHeSmuUNp1CXhFt3gLnRdg3HJfCg9V7t+d5a3CU2/SX6q8eBsn7R6+gj3KFE7XRldXC0rVoTu4wuNWPU7PflfPNOuGBYIW/LkSyqZZDuICBNsbxiEEXHOgXnXgfUEdnAxRD0/PKG/x0RCWs4xy6hnT2t4UuiDPe4jY39CEAe+EG00TWNfWs9hkMVSqnK1h5ms2YrD9o/2hBo1Jj6tQuW5Ou6U4Tvw66etNnZmViULs67RCTqVM/g0v/wf0fcwq/6vLLzuZWGINFFDHvTlMWceDqqkXZy7VCYrpe2IXRmihmEnYiXuOOcd5fgakYq/YP2E6gci3XR7lWZmU65lt71TTNrrskIMa6QJ+N/wIIqeI3CWJnL0u/7si+lkH0VWQgMBy9RgZmvonm7WP78OQSYP0G8mQJ5OD2vbhFINKz7hpK3boh13vb96WfhlZXDr961YVr+kSohIwNnxvv1DbbjyV+6GUTVe5icY0uA/uASsIeVUWpE6N QU9lEVNG xNJd4MtREV9+wKQhYsuMtv8S33y4pUDAGotiraQM1ZsVSkoP7FcbF0u2jCod05srw7fOMGookF7noX2+UsgmqAblMFuDEx9FsMnqGiyc+xs0gTclGdgT/XgEroqyGKs1PpVPmyY+QXAo2ymoMqGQ1TycPl9X7+86wa6r0F3c6ZbUam5+A6a2THfWElQHuZYxsmPBgpWqWPOlZ9Qg= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: From: ZhangPeng Since commit f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter"), the rss_stats have converted into percpu_counter, which convert the error margin from (nr_threads * 64) to approximately (nr_cpus ^ 2). However, the new percpu allocation in mm_init() causes a performance regression on fork/exec/shell. Even after commit 14ef95be6f55 ("kernel/fork: group allocation/free of per-cpu counters for mm struct"), the performance of fork/exec/shell is still poor compared to previous kernel versions. To mitigate performance regression, we delay the allocation of percpu memory for rss_stats. Therefore, we convert mm's rss stats to use percpu_counter atomic mode. For single-thread processes, rss_stat is in atomic mode, which reduces the memory consumption and performance regression caused by using percpu. For multiple-thread processes, rss_stat is switched to the percpu mode to reduce the error margin. We convert rss_stats from atomic mode to percpu mode only when the second thread is created. After lmbench test, we can get 2% ~ 4% performance improvement for lmbench fork_proc/exec_proc/shell_proc and 6.7% performance improvement for lmbench page_fault (before batch mode[1]). The test results are as follows: base base+revert base+this patch fork_proc 416.3ms 400.0ms (3.9%) 398.6ms (4.2%) exec_proc 2095.9ms 2061.1ms (1.7%) 2047.7ms (2.3%) shell_proc 3028.2ms 2954.7ms (2.4%) 2961.2ms (2.2%) page_fault 0.3603ms 0.3358ms (6.8%) 0.3361ms (6.7%) [1] https://lore.kernel.org/all/20240412064751.119015-1-wangkefeng.wang@huawei.com/ Suggested-by: Jan Kara Signed-off-by: ZhangPeng Signed-off-by: Kefeng Wang --- include/linux/mm.h | 50 +++++++++++++++++++++++++++++++------ include/trace/events/kmem.h | 4 +-- kernel/fork.c | 18 +++++++------ 3 files changed, 56 insertions(+), 16 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index d261e45bb29b..8f1bfbd54697 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2631,30 +2631,66 @@ static inline bool get_user_page_fast_only(unsigned long addr, */ static inline unsigned long get_mm_counter(struct mm_struct *mm, int member) { - return percpu_counter_read_positive(&mm->rss_stat[member]); + struct percpu_counter *fbc = &mm->rss_stat[member]; + + if (percpu_counter_initialized(fbc)) + return percpu_counter_read_positive(fbc); + + return percpu_counter_atomic_read(fbc); } void mm_trace_rss_stat(struct mm_struct *mm, int member); static inline void add_mm_counter(struct mm_struct *mm, int member, long value) { - percpu_counter_add(&mm->rss_stat[member], value); + struct percpu_counter *fbc = &mm->rss_stat[member]; + + if (percpu_counter_initialized(fbc)) + percpu_counter_add(fbc, value); + else + percpu_counter_atomic_add(fbc, value); mm_trace_rss_stat(mm, member); } static inline void inc_mm_counter(struct mm_struct *mm, int member) { - percpu_counter_inc(&mm->rss_stat[member]); - - mm_trace_rss_stat(mm, member); + add_mm_counter(mm, member, 1); } static inline void dec_mm_counter(struct mm_struct *mm, int member) { - percpu_counter_dec(&mm->rss_stat[member]); + add_mm_counter(mm, member, -1); +} - mm_trace_rss_stat(mm, member); +static inline s64 mm_counter_sum(struct mm_struct *mm, int member) +{ + struct percpu_counter *fbc = &mm->rss_stat[member]; + + if (percpu_counter_initialized(fbc)) + return percpu_counter_sum(fbc); + + return percpu_counter_atomic_read(fbc); +} + +static inline s64 mm_counter_sum_positive(struct mm_struct *mm, int member) +{ + struct percpu_counter *fbc = &mm->rss_stat[member]; + + if (percpu_counter_initialized(fbc)) + return percpu_counter_sum_positive(fbc); + + return percpu_counter_atomic_read(fbc); +} + +static inline int mm_counter_switch_to_pcpu_many(struct mm_struct *mm) +{ + return percpu_counter_switch_to_pcpu_many(mm->rss_stat, NR_MM_COUNTERS); +} + +static inline void mm_counter_destroy_many(struct mm_struct *mm) +{ + percpu_counter_destroy_many(mm->rss_stat, NR_MM_COUNTERS); } /* Optimized variant when folio is already known not to be anon */ diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h index 6e62cc64cd92..a4e40ae6a8c8 100644 --- a/include/trace/events/kmem.h +++ b/include/trace/events/kmem.h @@ -399,8 +399,8 @@ TRACE_EVENT(rss_stat, __entry->mm_id = mm_ptr_to_hash(mm); __entry->curr = !!(current->mm == mm); __entry->member = member; - __entry->size = (percpu_counter_sum_positive(&mm->rss_stat[member]) - << PAGE_SHIFT); + __entry->size = (mm_counter_sum_positive(mm, member) + << PAGE_SHIFT); ), TP_printk("mm_id=%u curr=%d type=%s size=%ldB", diff --git a/kernel/fork.c b/kernel/fork.c index 99076dbe27d8..0214273798c5 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -823,7 +823,7 @@ static void check_mm(struct mm_struct *mm) "Please make sure 'struct resident_page_types[]' is updated as well"); for (i = 0; i < NR_MM_COUNTERS; i++) { - long x = percpu_counter_sum(&mm->rss_stat[i]); + long x = mm_counter_sum(mm, i); if (unlikely(x)) pr_alert("BUG: Bad rss-counter state mm:%p type:%s val:%ld\n", @@ -1301,16 +1301,10 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, if (mm_alloc_cid(mm)) goto fail_cid; - if (percpu_counter_init_many(mm->rss_stat, 0, GFP_KERNEL_ACCOUNT, - NR_MM_COUNTERS)) - goto fail_pcpu; - mm->user_ns = get_user_ns(user_ns); lru_gen_init_mm(mm); return mm; -fail_pcpu: - mm_destroy_cid(mm); fail_cid: destroy_context(mm); fail_nocontext: @@ -1730,6 +1724,16 @@ static int copy_mm(unsigned long clone_flags, struct task_struct *tsk) if (!oldmm) return 0; + /* + * For single-thread processes, rss_stat is in atomic mode, which + * reduces the memory consumption and performance regression caused by + * using percpu. For multiple-thread processes, rss_stat is switched to + * the percpu mode to reduce the error margin. + */ + if (clone_flags & CLONE_THREAD) + if (mm_counter_switch_to_pcpu_many(oldmm)) + return -ENOMEM; + if (clone_flags & CLONE_VM) { mmget(oldmm); mm = oldmm;