From patchwork Wed Jan 24 10:00:22 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yosry Ahmed X-Patchwork-Id: 13528875 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4BED0C46CD2 for ; Wed, 24 Jan 2024 10:00:31 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id AB0936B0071; Wed, 24 Jan 2024 05:00:30 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id A607F6B0075; Wed, 24 Jan 2024 05:00:30 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9293D6B007B; Wed, 24 Jan 2024 05:00:30 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 80CD86B0071 for ; Wed, 24 Jan 2024 05:00:30 -0500 (EST) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 84B2D1A0B80 for ; Wed, 24 Jan 2024 10:00:29 +0000 (UTC) X-FDA: 81713759778.27.8770AC1 Received: from mail-yb1-f202.google.com (mail-yb1-f202.google.com [209.85.219.202]) by imf29.hostedemail.com (Postfix) with ESMTP id 6248412002E for ; Wed, 24 Jan 2024 10:00:26 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=ofcmUanz; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf29.hostedemail.com: domain of 3ud-wZQoKCHAmcgfmOVaSRUccUZS.QcaZWbil-aaYjOQY.cfU@flex--yosryahmed.bounces.google.com designates 209.85.219.202 as permitted sender) smtp.mailfrom=3ud-wZQoKCHAmcgfmOVaSRUccUZS.QcaZWbil-aaYjOQY.cfU@flex--yosryahmed.bounces.google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1706090426; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=FKHi8mwFd4NN6LHc1BfTFS1XT0Ge9Y10ExrRil/WNW0=; b=KgTu3WZ3ns450da19SeTuHZlsaFlRg0xp0saobMrfqTff3UYAPZqhPwMWDi4h9p/icqJw1 9Ruv+jlVO/jUgApKPviAl+5IoTnRGDJyGfxv/mkiwNRErGzXhS6fRYcerfK/eFBr2ihlEU BQ1ZO252X7M3B9JXpqvZ0GGGdw8okRk= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=ofcmUanz; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf29.hostedemail.com: domain of 3ud-wZQoKCHAmcgfmOVaSRUccUZS.QcaZWbil-aaYjOQY.cfU@flex--yosryahmed.bounces.google.com designates 209.85.219.202 as permitted sender) smtp.mailfrom=3ud-wZQoKCHAmcgfmOVaSRUccUZS.QcaZWbil-aaYjOQY.cfU@flex--yosryahmed.bounces.google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1706090426; a=rsa-sha256; cv=none; b=WKUkwOmvbrqAmlJs9Xv1AZvRwiGFfLZoANjbHMTirgO4L+Dk7cmKmaBtwZBkXJKKYbALTj BiN1yy3yWOwPMHAv0NNBshzSTOc2IddyTpDGF/0phU9X6q9FYzztpLuFcgAz6xTyDgaxfc LvBzYK4I9TC5QBRHbHIFUs1MKO9rKz8= Received: by mail-yb1-f202.google.com with SMTP id 3f1490d57ef6-dc221fad8c7so7810080276.3 for ; Wed, 24 Jan 2024 02:00:26 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1706090425; x=1706695225; darn=kvack.org; h=cc:to:from:subject:message-id:mime-version:date:from:to:cc:subject :date:message-id:reply-to; bh=FKHi8mwFd4NN6LHc1BfTFS1XT0Ge9Y10ExrRil/WNW0=; b=ofcmUanzfSSSPFVllzzgXFjzBahowY/79h6PJOTqYoIQj2tg2np8XTIAZE19TBx1vj UYCUXH5JgR/0rlQ+d3Le4E84HrOvdrQQvbHk8LVWIJqkzMQsX5xwMcI7NVR4b2Y7s6KG myONNiJ1PscMDMSzM6dB5Ux+PL1u81QgkpLQAxNjHHgJyRLN39Sn8kgP8Tu76plUsruh 8ngB5bI62GweGx3dyl6hq9XlY/jSVZwldV8bfGBEXGNE9WeQmlOd6HK7O0tjIrAdqSA/ eCNf1PjGfe2dlg72O+yhIqzaRb5jPDIoa0qnG8piISLpQPRCV8MIT9+DDHFG/j3U6Nh8 g7Zw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1706090425; x=1706695225; h=cc:to:from:subject:message-id:mime-version:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=FKHi8mwFd4NN6LHc1BfTFS1XT0Ge9Y10ExrRil/WNW0=; b=aflhjh0XeLnAdLinsVr0mOf0Kq60rK2AcMQjK9vqEKb4Fl38cEVeMcnrSh6+23m8X3 GyU3IVV9ia2tF3so6ZzW/9tDt4Nh6halrMHUCM/VNNsoKQ6yY4kp9IVlfDQKMv1n8bM4 kB/S1U6bULiqCz4mhdh+s/jM1/AuAS/xu0TE6AFanAHuwpFFOPVOqS3VhZjXNH29kav1 DOWaH1IuL0I14MSd0aha4lZ4dstJs12GeqO9bwtqoEzW0RgARb2pJaC3iNXSaPb3Yp4Z BR+st3xbMc2gfTSypLTgh4t+t17wJhHEHo9QUOcX20LH4f16m9PQu9j/fYHsyVcnzhBV FSLQ== X-Gm-Message-State: AOJu0Yw8rzyc4jf99qM+yp2dpZ+6XwumYQypyiDIQB+81etvu3XFcfsO /v2+6Q9xN7hKSja/aKGysqJOBhM4d09zfg0ZRCd86dbSG/XWk2aP+UN6joouUC9d2zhfpZZrxY6 cBddyAPwEclONQYFhQg== X-Google-Smtp-Source: AGHT+IH+naeLZo53W3VMEsSmdD2T5eoHGPmFtrvozwAy527oihSlzU23jzmkJorcxjJPT4nBLW9hb0QygfTJ6Sl9 X-Received: from yosry.c.googlers.com ([fda3:e722:ac3:cc00:20:ed76:c0a8:29b4]) (user=yosryahmed job=sendgmr) by 2002:a05:6902:2412:b0:dc2:4cf5:9ed0 with SMTP id dr18-20020a056902241200b00dc24cf59ed0mr272945ybb.5.1706090425470; Wed, 24 Jan 2024 02:00:25 -0800 (PST) Date: Wed, 24 Jan 2024 10:00:22 +0000 Mime-Version: 1.0 X-Mailer: git-send-email 2.43.0.429.g432eaa2c6b-goog Message-ID: <20240124100023.660032-1-yosryahmed@google.com> Subject: [PATCH] mm: memcg: optimize parent iteration in memcg_rstat_updated() From: Yosry Ahmed To: Andrew Morton Cc: Johannes Weiner , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Yosry Ahmed , kernel test robot X-Rspam-User: X-Stat-Signature: afw9pghzgobgw8m6jiwz9yn1zegkfhjw X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 6248412002E X-HE-Tag: 1706090426-145108 X-HE-Meta: U2FsdGVkX18ZOFSsdvz+lByKT1pcVMCAhuIVNM/7VwA5w6vKeMPGUqgwiR/6UT3DfBzW1Jqa4FIbfVUABzH/GCTJ3vRKOf9fZcganrggbDNFBYVTbwjQr94/3negiammVOpTZpbSZniuS2XOV4/MGinbAwbyLjUWBRc3BglSxQtZEiH1arGkQ1tRGAhh2E65GyBPFStIYeQlwuJhw/koN1DPr5kZURLvLesc6K2WA41/rgOHWkB6RM8iRIJndBsbWtcK8gxo3srVjlbQGC6hp4IBAIr28Knm5WHhBLcPsj+JwvpWwXU0yIVSUJ1bCRVZbgVlW2Mzb+eIIkpTkJRRvp9Pj9mjgmO5EM+gVWruty7ICuQLN3fUkWTPu3u9WkE3jJ+grufcWjVSnobuDBTEIeAh5v5KXKRZ7jzwcpEmq5Bz6DAk7Pykr72DRDAOR5ZRsYAy7ZUEvzIJA7N0rmD+5fgngJgbNPjrfsYPxfQGIwkTlcVW+PnrVHPERh9senCyxdkb7itKckEesJlNyRlDoDGA6EmwwHKagw/ipG/sGWRd3su0jLgjLSWjtT0qSlfcFRsXI2TvzFWY3T4O+mQFXTdr3bYQm3XwhdQxj0bkSloixgaASuzrqVlohPJ30J8+WwTcM1r5rl+T/w9q4TSzvx419Ub5/yUUE8zhBBZA1On4QaoJLmyu3CwZQfePyvIJwnZYJIIj07FdJnWSBYflDCWCIYNoJFxGQa3lEzpu78S922zTTGCGk5OUIhtBE722Z2kuwbPjdml0X8+Cd1FyFfk32lFzzUdB1Uzigf8y7rb6B0XE8bRBPNVbrSezcX1wvFiozJI8AOVoVPvEktqo/35h1k62YEAOlYwaYrG1p6B6TNfWaWzogX9Ip981UjpTSzBw/AuZhY1cRyf0bmwtDPJI9T4BGnlevAX7nGsZiPoYjX0Jr44TBGJ1mMkzHeiCqWQZjF6IBU3wdXW00ME +dJmQwIV 5bjxcqmytBTrnlstf1c0IS8cpTnPQm8bVCtpoaWhWD1Yy2k5eEpFxEri0EE5kT1GTwEyF5nE9qxNrnQuJ8z/94LNB86tcW8TLlPFJQJ9jWkUXXk4LTdnmBTkwUhFmrp70NA8+x5Kxup3pnvBDw3Wf/wZWZThSnVzBH4v3N9G+cYMRJUUtC4mlL0oxHpyIw7FQ5ujXZWeixuVRoCnLhH7zdrgZfwg8ZFohHr5Osq6FPQsdUUjeI4hpyK/XnawKnWa58v+X+AS4CtCudbnMk7vPHZIn8pie2Wf9SEKdn0c2AImL00FA4BXhgumErwjwI+EaewoIvWZ/Mn2rTsB8EA3nzKbs93RbPmGMS3IHjerPZSjTjOOP5Tpb/B+1Cf0xlmFXFtjn7Vip63aaDWr+lojnvla8zCAgvCnz18mZYC0C0o5D62d6ft9o4qD7eswuu7iGiBnWDrc+kmPJiLr83DR9XtWCMuKKo69QaXksKhUPNFQL9iUNK8v9pOCc0vCV6vJiZXcyMnctK9AlzhkVRoHNK85LqQctSYEzSwlQcDy1Fn20dXiWK6wx0iD0QUsKImuTkHH/bhhOSwun/bAza9fBFNU2XLXxOuUV6TIKfWtXTCXu4Gvey52PlzUOOzUfJWXg9FowZdt8Q4rJ85G5Mjr+FiGMnMDVNNf08nla3IMBJ2r88ZfcaykzKfQqCN2EGEA6yWYWcO9agXP55kkayP/XJk2CBR8e7qpad7prQyQfG91nsOZ0AYtT4ZlKYUYh50zAIpAJXBHJSbw1qOo= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: In memcg_rstat_updated(), we iterate the memcg being updated and its parents to update memcg->vmstats_percpu->stats_updates in the fast path (i.e. no atomic updates). According to my math, this is 3 memory loads (and potentially 3 cache misses) per memcg: - Load the address of memcg->vmstats_percpu. - Load vmstats_percpu->stats_updates (based on some percpu calculation). - Load the address of the parent memcg. Avoid most of the cache misses by caching a pointer from each struct memcg_vmstats_percpu to its parent on the corresponding CPU. In this case, for the first memcg we have 2 memory loads (same as above): - Load the address of memcg->vmstats_percpu. - Load vmstats_percpu->stats_updates (based on some percpu calculation). Then for each additional memcg, we need a single load to get the parent's stats_updates directly. This reduces the number of loads from O(3N) to O(2+N) -- where N is the number of memcgs we need to iterate. Additionally, stash a pointer to memcg->vmstats in each struct memcg_vmstats_percpu such that we can access the atomic counter that all CPUs fold into, memcg->vmstats->stats_updates. memcg_should_flush_stats() is changed to memcg_vmstats_needs_flush() to accept a struct memcg_vmstats pointer accordingly. In struct memcg_vmstats_percpu, make sure both pointers together with stats_updates live on the same cacheline. Finally, update mem_cgroup_alloc() to take in a parent pointer and initialize the new cache pointers on each CPU. The percpu loop in mem_cgroup_alloc() may look concerning, but there are multiple similar loops in the cgroup creation path (e.g. cgroup_rstat_init()), most of which are hidden within alloc_percpu(). According to Oliver's testing [1], this fixes multiple 30-38% regressions in vm-scalability, will-it-scale-tlb_flush2, and will-it-scale-fallocate1. This comes at a cost of 2 more pointers per CPU (<2KB on a machine with 128 CPUs). [1] https://lore.kernel.org/lkml/ZbDJsfsZt2ITyo61@xsang-OptiPlex-9020/ Fixes: 8d59d2214c23 ("mm: memcg: make stats flushing threshold per-memcg") Tested-by: kernel test robot Reported-by: kernel test robot Closes: https://lore.kernel.org/oe-lkp/202401221624.cb53a8ca-oliver.sang@intel.com Signed-off-by: Yosry Ahmed Acked-by: Shakeel Butt Acked-by: Johannes Weiner --- The only noticeable change I made after Oliver's testing is adding the cacheline padding in struct memcg_vmstats_percpu. In my config, the added pointers happen to be on the same cacheline as stats_updates, and I assume with Oliver's testing given the results. However, this can change on different configs and as new stats are added, so I added the cacheline padding to make sure they are always on the same cachline. --- mm/memcontrol.c | 49 ++++++++++++++++++++++++++++++++----------------- 1 file changed, 32 insertions(+), 17 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e4c8735e7c85c..868da91cceb48 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -633,8 +633,15 @@ struct memcg_vmstats_percpu { unsigned long nr_page_events; unsigned long targets[MEM_CGROUP_NTARGETS]; + /* Fit members below in a single cacheline for memcg_rstat_updated() */ + CACHELINE_PADDING(_pad1_); + /* Stats updates since the last flush */ unsigned int stats_updates; + + /* Cached pointers for fast iteration in memcg_rstat_updated() */ + struct memcg_vmstats_percpu *parent; + struct memcg_vmstats *vmstats; }; struct memcg_vmstats { @@ -698,36 +705,35 @@ static void memcg_stats_unlock(void) } -static bool memcg_should_flush_stats(struct mem_cgroup *memcg) +static bool memcg_vmstats_needs_flush(struct memcg_vmstats *vmstats) { - return atomic64_read(&memcg->vmstats->stats_updates) > + return atomic64_read(&vmstats->stats_updates) > MEMCG_CHARGE_BATCH * num_online_cpus(); } static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val) { + struct memcg_vmstats_percpu *statc; int cpu = smp_processor_id(); - unsigned int x; if (!val) return; cgroup_rstat_updated(memcg->css.cgroup, cpu); - - for (; memcg; memcg = parent_mem_cgroup(memcg)) { - x = __this_cpu_add_return(memcg->vmstats_percpu->stats_updates, - abs(val)); - - if (x < MEMCG_CHARGE_BATCH) + statc = this_cpu_ptr(memcg->vmstats_percpu); + for (; statc; statc = statc->parent) { + statc->stats_updates += abs(val); + if (statc->stats_updates < MEMCG_CHARGE_BATCH) continue; /* * If @memcg is already flush-able, increasing stats_updates is * redundant. Avoid the overhead of the atomic update. */ - if (!memcg_should_flush_stats(memcg)) - atomic64_add(x, &memcg->vmstats->stats_updates); - __this_cpu_write(memcg->vmstats_percpu->stats_updates, 0); + if (!memcg_vmstats_needs_flush(statc->vmstats)) + atomic64_add(statc->stats_updates, + &statc->vmstats->stats_updates); + statc->stats_updates = 0; } } @@ -756,7 +762,7 @@ void mem_cgroup_flush_stats(struct mem_cgroup *memcg) if (!memcg) memcg = root_mem_cgroup; - if (memcg_should_flush_stats(memcg)) + if (memcg_vmstats_needs_flush(memcg->vmstats)) do_flush_stats(memcg); } @@ -770,7 +776,7 @@ void mem_cgroup_flush_stats_ratelimited(struct mem_cgroup *memcg) static void flush_memcg_stats_dwork(struct work_struct *w) { /* - * Deliberately ignore memcg_should_flush_stats() here so that flushing + * Deliberately ignore memcg_vmstats_needs_flush() here so that flushing * in latency-sensitive paths is as cheap as possible. */ do_flush_stats(root_mem_cgroup); @@ -5456,10 +5462,11 @@ static void mem_cgroup_free(struct mem_cgroup *memcg) __mem_cgroup_free(memcg); } -static struct mem_cgroup *mem_cgroup_alloc(void) +static struct mem_cgroup *mem_cgroup_alloc(struct mem_cgroup *parent) { + struct memcg_vmstats_percpu *statc, *pstatc; struct mem_cgroup *memcg; - int node; + int node, cpu; int __maybe_unused i; long error = -ENOMEM; @@ -5483,6 +5490,14 @@ static struct mem_cgroup *mem_cgroup_alloc(void) if (!memcg->vmstats_percpu) goto fail; + for_each_possible_cpu(cpu) { + if (parent) + pstatc = per_cpu_ptr(parent->vmstats_percpu, cpu); + statc = per_cpu_ptr(memcg->vmstats_percpu, cpu); + statc->parent = parent ? pstatc : NULL; + statc->vmstats = memcg->vmstats; + } + for_each_node(node) if (alloc_mem_cgroup_per_node_info(memcg, node)) goto fail; @@ -5528,7 +5543,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) struct mem_cgroup *memcg, *old_memcg; old_memcg = set_active_memcg(parent); - memcg = mem_cgroup_alloc(); + memcg = mem_cgroup_alloc(parent); set_active_memcg(old_memcg); if (IS_ERR(memcg)) return ERR_CAST(memcg);