From patchwork Fri Apr 30 05:56:11 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Andrew Morton X-Patchwork-Id: 12232399 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER, INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 82E31C43460 for ; Fri, 30 Apr 2021 05:56:14 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 3499261459 for ; Fri, 30 Apr 2021 05:56:14 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 3499261459 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=linux-foundation.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id C8CBC6B00BD; Fri, 30 Apr 2021 01:56:13 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C63676B00BE; Fri, 30 Apr 2021 01:56:13 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B04876B00BF; Fri, 30 Apr 2021 01:56:13 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0140.hostedemail.com [216.40.44.140]) by kanga.kvack.org (Postfix) with ESMTP id 90FE46B00BD for ; Fri, 30 Apr 2021 01:56:13 -0400 (EDT) Received: from smtpin20.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 54FDA2496 for ; Fri, 30 Apr 2021 05:56:13 +0000 (UTC) X-FDA: 78087973026.20.FE88F26 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by imf15.hostedemail.com (Postfix) with ESMTP id 5745BA0003AF for ; Fri, 30 Apr 2021 05:56:08 +0000 (UTC) Received: by mail.kernel.org (Postfix) with ESMTPSA id DBF5961463; Fri, 30 Apr 2021 05:56:11 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linux-foundation.org; s=korg; t=1619762172; bh=lmRFmURVQja24WFu8SJM8MS7nVJCeiAbbNI2Y4mPcxM=; h=Date:From:To:Subject:In-Reply-To:From; b=TXo+zVKpmWL1fInnd2GiWahL+bvdMyvcY2dmq8YoNBg50C3ySDDGNteok8BPfaxDg CjTWTcd3/lsuggjcOnEB3avGlN0cxmI2oUMo4RrRNDKBbfwnFFuE3AKaIjHt6/l1+m sCA0pqhgyA8P5JsGYC2iSLI1HeGe5b/sSTwHoTtQ= Date: Thu, 29 Apr 2021 22:56:11 -0700 From: Andrew Morton To: akpm@linux-foundation.org, guro@fb.com, hannes@cmpxchg.org, linux-mm@kvack.org, mhocko@suse.com, mkoutny@suse.com, mm-commits@vger.kernel.org, shakeelb@google.com, tj@kernel.org, torvalds@linux-foundation.org Subject: [patch 059/178] mm: memcontrol: fix cpuhotplug statistics flushing Message-ID: <20210430055611.0kjSHEmCh%akpm@linux-foundation.org> In-Reply-To: <20210429225251.02b6386d21b69255b4f6c163@linux-foundation.org> User-Agent: s-nail v14.8.16 MIME-Version: 1.0 Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=linux-foundation.org header.s=korg header.b=TXo+zVKp; dmarc=none; spf=pass (imf15.hostedemail.com: domain of akpm@linux-foundation.org designates 198.145.29.99 as permitted sender) smtp.mailfrom=akpm@linux-foundation.org X-Rspamd-Server: rspam03 X-Stat-Signature: 1itkxsqfygymo3ikyr38d84xszmofdeq X-Rspamd-Queue-Id: 5745BA0003AF Received-SPF: none (linux-foundation.org>: No applicable sender policy available) receiver=imf15; identity=mailfrom; envelope-from=""; helo=mail.kernel.org; client-ip=198.145.29.99 X-HE-DKIM-Result: pass/pass X-HE-Tag: 1619762168-773134 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Johannes Weiner Subject: mm: memcontrol: fix cpuhotplug statistics flushing Patch series "mm: memcontrol: switch to rstat", v3. This series converts memcg stats tracking to the streamlined rstat infrastructure provided by the cgroup core code. rstat is already used by the CPU controller and the IO controller. This change is motivated by recent accuracy problems in memcg's custom stats code, as well as the benefits of sharing common infra with other controllers. The current memcg implementation does batched tree aggregation on the write side: local stat changes are cached in per-cpu counters, which are then propagated upward in batches when a threshold (32 pages) is exceeded. This is cheap, but the error introduced by the lazy upward propagation adds up: 32 pages times CPUs times cgroups in the subtree. We've had complaints from service owners that the stats do not reliably track and react to allocation behavior as expected, sometimes swallowing the results of entire test applications. The original memcg stat implementation used to do tree aggregation exclusively on the read side: local stats would only ever be tracked in per-cpu counters, and a memory.stat read would iterate the entire subtree and sum those counters up. This didn't keep up with the times: - Cgroup trees are much bigger now. We switched to lazily-freed cgroups, where deleted groups would hang around until their remaining page cache has been reclaimed. This can result in large subtrees that are expensive to walk, while most of the groups are idle and their statistics don't change much anymore. - Automated monitoring increased. With the proliferation of userspace oom killing, proactive reclaim, and higher-resolution logging of workload trends in general, top-level stat files are polled at least once a second in many deployments. - The lifetime of cgroups got shorter. Where most cgroup setups in the past would have a few large policy-oriented cgroups for everything running on the system, newer cgroup deployments tend to create one group per application - which gets deleted again as the processes exit. An aggregation scheme that doesn't retain child data inside the parents loses event history of the subtree. Rstat addresses all three of those concerns through intelligent, persistent read-side aggregation. As statistics change at the local level, rstat tracks - on a per-cpu basis - only those parts of a subtree that have changes pending and require aggregation. The actual aggregation occurs on the colder read side - which can now skip over (potentially large) numbers of recently idle cgroups. === The test_kmem cgroup selftest is currently failing due to excessive cumulative vmstat drift from 100 subgroups: ok 1 test_kmem_basic memory.current = 8810496 slab + anon + file + kernel_stack = 17074568 slab = 6101384 anon = 946176 file = 0 kernel_stack = 10027008 not ok 2 test_kmem_memcg_deletion ok 3 test_kmem_proc_kpagecgroup ok 4 test_kmem_kernel_stacks ok 5 test_kmem_dead_cgroups ok 6 test_percpu_basic As you can see, memory.stat items far exceed memory.current. The kernel stack alone is bigger than all of charged memory. That's because the memory of the test has been uncharged from memory.current, but the negative vmstat deltas are still sitting in the percpu caches. The test at this time isn't even counting percpu, pagetables etc. yet, which would further contribute to the error. The last patch in the series updates the test to include them - as well as reduces the vmstat tolerances in general to only expect page_counter batching. With all patches applied, the (now more stringent) test succeeds: ok 1 test_kmem_basic ok 2 test_kmem_memcg_deletion ok 3 test_kmem_proc_kpagecgroup ok 4 test_kmem_kernel_stacks ok 5 test_kmem_dead_cgroups ok 6 test_percpu_basic === A kernel build test confirms that overhead is comparable. Two kernels are built simultaneously in a nested tree with several idle siblings: root - kernelbuild - one - two - three - four - build-a (defconfig, make -j16) `- build-b (defconfig, make -j16) `- idle-1 `- ... `- idle-9 During the builds, kernelbuild/memory.stat is read once a second. A perf diff shows that the changes in cycle distribution is minimal. Top 10 kernel symbols: 0.09% +0.08% [kernel.kallsyms] [k] __mod_memcg_lruvec_state 0.00% +0.06% [kernel.kallsyms] [k] cgroup_rstat_updated 0.08% -0.05% [kernel.kallsyms] [k] __mod_memcg_state.part.0 0.16% -0.04% [kernel.kallsyms] [k] release_pages 0.00% +0.03% [kernel.kallsyms] [k] __count_memcg_events 0.01% +0.03% [kernel.kallsyms] [k] mem_cgroup_charge_statistics.constprop.0 0.10% -0.02% [kernel.kallsyms] [k] get_mem_cgroup_from_mm 0.05% -0.02% [kernel.kallsyms] [k] mem_cgroup_update_lru_size 0.57% +0.01% [kernel.kallsyms] [k] asm_exc_page_fault === The on-demand aggregated stats are now fully accurate: $ grep -e nr_inactive_file /proc/vmstat | awk '{print($1,$2*4096)}'; \ grep -e inactive_file /sys/fs/cgroup/memory.stat vanilla: patched: nr_inactive_file 1574105088 nr_inactive_file 1027801088 inactive_file 1577410560 inactive_file 1027801088 === This patch (of 8): The memcg hotunplug callback erroneously flushes counts on the local CPU, not the counts of the CPU going away; those counts will be lost. Flush the CPU that is actually going away. Also simplify the code a bit by using mod_memcg_state() and count_memcg_events() instead of open-coding the upward flush - this is comparable to how vmstat.c handles hotunplug flushing. Link: https://lkml.kernel.org/r/20210209163304.77088-1-hannes@cmpxchg.org Link: https://lkml.kernel.org/r/20210209163304.77088-2-hannes@cmpxchg.org Fixes: a983b5ebee572 ("mm: memcontrol: fix excessive complexity in memory.stat reporting") Signed-off-by: Johannes Weiner Reviewed-by: Shakeel Butt Reviewed-by: Roman Gushchin Reviewed-by: Michal Koutný Acked-by: Michal Hocko Cc: Tejun Heo Cc: Roman Gushchin Signed-off-by: Andrew Morton --- mm/memcontrol.c | 35 +++++++++++++++++++++-------------- 1 file changed, 21 insertions(+), 14 deletions(-) --- a/mm/memcontrol.c~mm-memcontrol-fix-cpuhotplug-statistics-flushing +++ a/mm/memcontrol.c @@ -2370,45 +2370,52 @@ static void drain_all_stock(struct mem_c static int memcg_hotplug_cpu_dead(unsigned int cpu) { struct memcg_stock_pcp *stock; - struct mem_cgroup *memcg, *mi; + struct mem_cgroup *memcg; stock = &per_cpu(memcg_stock, cpu); drain_stock(stock); for_each_mem_cgroup(memcg) { + struct memcg_vmstats_percpu *statc; int i; + statc = per_cpu_ptr(memcg->vmstats_percpu, cpu); + for (i = 0; i < MEMCG_NR_STAT; i++) { int nid; - long x; - x = this_cpu_xchg(memcg->vmstats_percpu->stat[i], 0); - if (x) - for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) - atomic_long_add(x, &memcg->vmstats[i]); + if (statc->stat[i]) { + mod_memcg_state(memcg, i, statc->stat[i]); + statc->stat[i] = 0; + } if (i >= NR_VM_NODE_STAT_ITEMS) continue; for_each_node(nid) { + struct batched_lruvec_stat *lstatc; struct mem_cgroup_per_node *pn; + long x; pn = mem_cgroup_nodeinfo(memcg, nid); - x = this_cpu_xchg(pn->lruvec_stat_cpu->count[i], 0); - if (x) + lstatc = per_cpu_ptr(pn->lruvec_stat_cpu, cpu); + + x = lstatc->count[i]; + lstatc->count[i] = 0; + + if (x) { do { atomic_long_add(x, &pn->lruvec_stat[i]); } while ((pn = parent_nodeinfo(pn, nid))); + } } } for (i = 0; i < NR_VM_EVENT_ITEMS; i++) { - long x; - - x = this_cpu_xchg(memcg->vmstats_percpu->events[i], 0); - if (x) - for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) - atomic_long_add(x, &memcg->vmevents[i]); + if (statc->events[i]) { + count_memcg_events(memcg, i, statc->events[i]); + statc->events[i] = 0; + } } }