[064/178] mm: memcontrol: switch to rstat

Message ID	20210430055626.WZK5r2afb%akpm@linux-foundation.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=xn6i=J3=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org AD7F661477 Date: Thu, 29 Apr 2021 22:56:26 -0700 From: Andrew Morton <akpm@linux-foundation.org> To: akpm@linux-foundation.org, bsingharora@gmail.com, guro@fb.com, hannes@cmpxchg.org, linux-mm@kvack.org, mhocko@suse.com, mkoutny@suse.com, mm-commits@vger.kernel.org, shakeelb@google.com, tj@kernel.org, torvalds@linux-foundation.org Subject: [patch 064/178] mm: memcontrol: switch to rstat Message-ID: <20210430055626.WZK5r2afb%akpm@linux-foundation.org> In-Reply-To: <20210429225251.02b6386d21b69255b4f6c163@linux-foundation.org> User-Agent: s-nail v14.8.16 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Received-SPF: none (linux-foundation.org>: No applicable sender policy available) receiver=imf06; identity=mailfrom; envelope-from="<akpm@linux-foundation.org>"; helo=mail.kernel.org; client-ip=198.145.29.99 Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	[001/178] arch/ia64/kernel/head.S: remove duplicate include \| expand [001/178] arch/ia64/kernel/head.S: remove duplicate include [002/178] arch/ia64/kernel/fsys.S: fix typos [003/178] arch/ia64/include/asm/pgtable.h: minor typo fixes [004/178] ia64: ensure proper NUMA distance and possible map initialization [005/178] ia64: drop unused IA64_FW_EMU ifdef [006/178] ia64: simplify code flow around swiotlb init [007/178] ia64: trivial spelling fixes [008/178] ia64: fix EFI_DEBUG build [009/178] ia64: mca: always make IA64_MCA_DEBUG an expression [010/178] ia64: drop marked broken DISCONTIGMEM and VIRTUAL_MEM_MAP [011/178] ia64: module: fix symbolizer crash on fdescr [012/178] include/linux/compiler-gcc.h: sparse can do constant folding of __builtin_bswap() [013/178] scripts/spelling.txt: add entries for recent discoveries [014/178] scripts: a new script for checking duplicate struct declaration [015/178] arch/sh/include/asm/tlb.h: remove duplicate include [016/178] ocfs2: replace DEFINE_SIMPLE_ATTRIBUTE with DEFINE_DEBUGFS_ATTRIBUTE [017/178] ocfs2: map flags directly in flags_to_o2dlm() [018/178] ocfs2: fix a typo [019/178] ocfs2/dlm: remove unused function [020/178] kfifo: fix ternary sign extension bugs [021/178] vfs: fs_parser: clean up kernel-doc warnings [022/178] watchdog: rename __touch_watchdog() to a better descriptive name [023/178] watchdog: explicitly update timestamp when reporting softlockup [024/178] watchdog/softlockup: report the overall time of softlockups [025/178] watchdog/softlockup: remove logic that tried to prevent repeated reports [026/178] watchdog: fix barriers when printing backtraces from all CPUs [027/178] watchdog: cleanup handling of false positives [028/178] mm/slab_common: provide "slab_merge" option for !IS_ENABLED(CONFIG_SLAB_MERGE_DEFAULT) bu… [029/178] mm, slub: enable slub_debug static key when creating cache with explicit debug flags [030/178] kunit: add a KUnit test for SLUB debugging functionality [031/178] slub: remove resiliency_test() function [032/178] mm/slub.c: trivial typo fixes [033/178] mm/kmemleak.c: fix a typo [034/178] mm/page_owner: record the timestamp of all pages during free [035/178] mm, page_owner: remove unused parameter in __set_page_owner_handle [036/178] mm: page_owner: fetch backtrace only for tracked pages [037/178] mm: page_owner: use kstrtobool() to parse bool option [038/178] mm: page_owner: detect page_owner recursion via task_struct [039/178] mm: page_poison: print page info when corruption is caught [040/178] mm/memtest: add ARCH_USE_MEMTEST [041/178] mm: provide filemap_range_needs_writeback() helper [042/178] mm: use filemap_range_needs_writeback() for O_DIRECT reads [043/178] iomap: use filemap_range_needs_writeback() for O_DIRECT reads [044/178] mm/filemap: use filemap_read_page in filemap_fault [045/178] mm/filemap: drop check for truncated page after I/O [046/178] mm: page-writeback: simplify memcg handling in test_clear_page_writeback() [047/178] mm: move page_mapping_file to pagemap.h [048/178] mm/filemap: update stale comment [049/178] mm/msync: exit early when the flags is an MS_ASYNC and start < vm_start [050/178] mm/gup: add compound page list iterator [051/178] mm/gup: decrement head page once for group of subpages [052/178] mm/gup: add a range variant of unpin_user_pages_dirty_lock() [053/178] RDMA/umem: batch page unpin in __ib_umem_release() [054/178] mm: gup: remove FOLL_SPLIT [055/178] mm/memremap.c: fix improper SPDX comment style [056/178] mm: memcontrol: fix kernel stack account [057/178] memcg: cleanup root memcg checks [058/178] memcg: enable memcg oom-kill for __GFP_NOFAIL [059/178] mm: memcontrol: fix cpuhotplug statistics flushing [060/178] mm: memcontrol: kill mem_cgroup_nodeinfo() [061/178] mm: memcontrol: privatize memcg_page_state query functions [062/178] cgroup: rstat: support cgroup1 [063/178] cgroup: rstat: punt root-level optimization to individual controllers [064/178] mm: memcontrol: switch to rstat [065/178] mm: memcontrol: consolidate lruvec stat flushing [066/178] kselftests: cgroup: update kmem test for new vmstat implementation [067/178] memcg: charge before adding to swapcache on swapin [068/178] mm: memcontrol: slab: fix obtain a reference to a freeing memcg [069/178] mm: memcontrol: introduce obj_cgroup_{un}charge_pages [070/178] mm: memcontrol: directly access page->memcg_data in mm/page_alloc.c [071/178] mm: memcontrol: change ug->dummy_page only if memcg changed [072/178] mm: memcontrol: use obj_cgroup APIs to charge kmem pages [073/178] mm: memcontrol: inline __memcg_kmem_{un}charge() into obj_cgroup_{un}charge_pages() [074/178] mm: memcontrol: move PageMemcgKmem to the scope of CONFIG_MEMCG_KMEM [075/178] linux/memcontrol.h: remove duplicate struct declaration [076/178] mm: page_counter: mitigate consequences of a page_counter underflow [077/178] mm/memory.c: do_numa_page(): delete bool "migrated" [078/178] mm/interval_tree: add comments to improve code readability [079/178] x86/vmemmap: drop handling of 4K unaligned vmemmap range [080/178] x86/vmemmap: drop handling of 1GB vmemmap ranges [081/178] x86/vmemmap: handle unpopulated sub-pmd ranges [082/178] x86/vmemmap: optimize for consecutive sections in partial populated PMDs [083/178] mm, tracing: improve rss_stat tracepoint message [084/178] mm: add remap_pfn_range_notrack [085/178] mm: add a io_mapping_map_user helper [086/178] i915: use io_mapping_map_user [087/178] i915: fix remap_io_sg to verify the pgprot [088/178] NUMA balancing: reduce TLB flush via delaying mapping on hint page fault [089/178] mm: extend MREMAP_DONTUNMAP to non-anonymous mappings [090/178] Revert "mremap: don't allow MREMAP_DONTUNMAP on special_mappings and aio" [091/178] selftests: add a MREMAP_DONTUNMAP selftest for shmem [092/178] mm/dmapool: switch from strlcpy to strscpy [093/178] mm/sparse: add the missing sparse_buffer_fini() in error branch [094/178] samples/vfio-mdev/mdpy: use remap_vmalloc_range [095/178] mm: unexport remap_vmalloc_range_partial [096/178] mm/vmalloc: use rb_tree instead of list for vread() lookups [097/178] ARM: mm: add missing pud_page define to 2-level page tables [098/178] mm/vmalloc: fix HUGE_VMAP regression by enabling huge pages in vmalloc_to_page [099/178] mm: apply_to_pte_range warn and fail if a large pte is encountered [100/178] mm/vmalloc: rename vmap__range vmap_pages__range [101/178] mm/ioremap: rename ioremap__range to vmap_*_range [102/178] mm: HUGE_VMAP arch support cleanup [103/178] powerpc: inline huge vmap supported functions [104/178] arm64: inline huge vmap supported functions [105/178] x86: inline huge vmap supported functions [106/178] mm/vmalloc: provide fallback arch huge vmap support functions [107/178] mm: move vmap_range from mm/ioremap.c to mm/vmalloc.c [108/178] mm/vmalloc: add vmap_range_noflush variant [109/178] mm/vmalloc: hugepage vmalloc mappings [110/178] mm/vmalloc: remove map_kernel_range [111/178] kernel/dma: remove unnecessary unmap_kernel_range [112/178] powerpc/xive: remove unnecessary unmap_kernel_range [113/178] mm/vmalloc: remove unmap_kernel_range [114/178] mm/vmalloc: improve allocation failure error messages [115/178] mm: vmalloc: prevent use after free in _vm_unmap_aliases [116/178] lib/test_vmalloc.c: remove two kvfree_rcu() tests [117/178] lib/test_vmalloc.c: add a new 'nr_threads' parameter [118/178] vm/test_vmalloc.sh: adapt for updated driver interface [119/178] mm/vmalloc: refactor the preloading loagic [120/178] mm/vmalloc: remove an empty line [121/178] mm/doc: fix fault_flag_allow_retry_first kerneldoc [122/178] mm/doc: fix page_maybe_dma_pinned kerneldoc [123/178] mm/doc: turn fault flags into an enum [124/178] mm/doc: add mm.h and mm_types.h to the mm-api document [125/178] MAINTAINERS: assign pagewalk.h to MEMORY MANAGEMENT [126/178] pagewalk: prefix struct kernel-doc descriptions [127/178] mm/kasan: switch from strlcpy to strscpy [128/178] kasan: fix kasan_byte_accessible() to be consistent with actual checks [129/178] kasan: initialize shadow to TAG_INVALID for SW_TAGS [130/178] mm, kasan: don't poison boot memory with tag-based modes [131/178] arm64: kasan: allow to init memory when setting tags [132/178] kasan: init memory in kasan_(un)poison for HW_TAGS [133/178] kasan, mm: integrate page_alloc init with HW_TAGS [134/178] kasan, mm: integrate slab init_on_alloc with HW_TAGS [135/178] kasan, mm: integrate slab init_on_free with HW_TAGS [136/178] kasan: docs: clean up sections [137/178] kasan: docs: update overview section [138/178] kasan: docs: update usage section [139/178] kasan: docs: update error reports section [140/178] kasan: docs: update boot parameters section [141/178] kasan: docs: update GENERIC implementation details section [142/178] kasan: docs: update SW_TAGS implementation details section [143/178] kasan: docs: update HW_TAGS implementation details section [144/178] kasan: docs: update shadow memory section [145/178] kasan: docs: update ignoring accesses section [146/178] kasan: docs: update tests section [147/178] kasan: record task_work_add() call stack [148/178] kasan: detect false-positives in tests [149/178] irq_work: record irq_work_queue() call stack [150/178] mm: move mem_init_print_info() into mm_init() [151/178] mm/page_alloc: drop pr_info_ratelimited() in alloc_contig_range() [152/178] mm: remove lru_add_drain_all in alloc_contig_range [153/178] include/linux/page-flags-layout.h: correctly determine LAST_CPUPID_WIDTH [154/178] include/linux/page-flags-layout.h: cleanups [155/178] mm/page_alloc: rename alloc_mask to alloc_gfp [156/178] mm/page_alloc: rename gfp_mask to gfp [157/178] mm/page_alloc: combine __alloc_pages and __alloc_pages_nodemask [158/178] mm/mempolicy: rename alloc_pages_current to alloc_pages [159/178] mm/mempolicy: rewrite alloc_pages documentation [160/178] mm/mempolicy: rewrite alloc_pages_vma documentation [161/178] mm/mempolicy: fix mpol_misplaced kernel-doc [162/178] mm: page_alloc: dump migrate-failed pages [163/178] mm/Kconfig: remove default DISCONTIGMEM_MANUAL [164/178] mm, page_alloc: avoid page_to_pfn() in move_freepages() [165/178] mm/page_alloc: duplicate include linux/vmalloc.h [166/178] mm/page_alloc: rename alloced to allocated [167/178] mm/page_alloc: add a bulk page allocator [168/178] mm/page_alloc: add an array-based interface to the bulk page allocator [169/178] mm/page_alloc: optimize code layout for __alloc_pages_bulk [170/178] mm/page_alloc: inline __rmqueue_pcplist [171/178] SUNRPC: set rq_page_end differently [172/178] SUNRPC: refresh rq_pages using a bulk page allocator [173/178] net: page_pool: refactor dma_map into own function page_pool_dma_map [174/178] net: page_pool: use alloc_pages_bulk in refill code path [175/178] mm: page_alloc: ignore init_on_free=1 for debug_pagealloc=1 [176/178] mm/page_alloc: redundant definition variables of pfn in for loop [177/178] mm/mmzone.h: fix existing kernel-doc comments and link them to core-api [178/178] mm/memory-failure: unnecessary amount of unmapping

Message ID

20210430055626.WZK5r2afb%akpm@linux-foundation.org (mailing list archive)

State

New, archived

Headers

DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org AD7F661477
Date: Thu, 29 Apr 2021 22:56:26 -0700
From: Andrew Morton <akpm@linux-foundation.org>
To: akpm@linux-foundation.org, bsingharora@gmail.com, guro@fb.com,
 hannes@cmpxchg.org, linux-mm@kvack.org, mhocko@suse.com,
 mkoutny@suse.com, mm-commits@vger.kernel.org, shakeelb@google.com,
 tj@kernel.org, torvalds@linux-foundation.org
Subject: [patch 064/178] mm: memcontrol: switch to rstat
Message-ID: <20210430055626.WZK5r2afb%akpm@linux-foundation.org>
In-Reply-To: <20210429225251.02b6386d21b69255b4f6c163@linux-foundation.org>
User-Agent: s-nail v14.8.16
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
Received-SPF: none (linux-foundation.org>: No applicable sender policy
 available) receiver=imf06; identity=mailfrom;
 envelope-from="<akpm@linux-foundation.org>"; helo=mail.kernel.org;
 client-ip=198.145.29.99
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

[001/178] arch/ia64/kernel/head.S: remove duplicate include | expand

Commit Message

Andrew Morton April 30, 2021, 5:56 a.m. UTC

From: Johannes Weiner <hannes@cmpxchg.org>
Subject: mm: memcontrol: switch to rstat

Replace the memory controller's custom hierarchical stats code with the
generic rstat infrastructure provided by the cgroup core.

The current implementation does batched upward propagation from the write
side (i.e.  as stats change).  The per-cpu batches introduce an error,
which is multiplied by the number of subgroups in a tree.  In systems with
many CPUs and sizable cgroup trees, the error can be large enough to
confuse users (e.g.  32 batch pages * 32 CPUs * 32 subgroups results in an
error of up to 128M per stat item).  This can entirely swallow allocation
bursts inside a workload that the user is expecting to see reflected in
the statistics.

In the past, we've done read-side aggregation, where a memory.stat read
would have to walk the entire subtree and add up per-cpu counts.  This
became problematic with lazily-freed cgroups: we could have large subtrees
where most cgroups were entirely idle.  Hence the switch to change-driven
upward propagation.  Unfortunately, it needed to trade accuracy for speed
due to the write side being so hot.

Rstat combines the best of both worlds: from the write side, it cheaply
maintains a queue of cgroups that have pending changes, so that the read
side can do selective tree aggregation.  This way the reported stats will
always be precise and recent as can be, while the aggregation can skip
over potentially large numbers of idle cgroups.

The way rstat works is that it implements a tree for tracking cgroups with
pending local changes, as well as a flush function that walks the tree
upwards.  The controller then drives this by 1) telling rstat when a local
cgroup stat changes (e.g.  mod_memcg_state) and 2) when a flush is
required to get uptodate hierarchy stats for a given subtree (e.g.  when
memory.stat is read).  The controller also provides a flush callback that
is called during the rstat flush walk for each cgroup and aggregates its
local per-cpu counters and propagates them upwards.

This adds a second vmstats to struct mem_cgroup (MEMCG_NR_STAT +
NR_VM_EVENT_ITEMS) to track pending subtree deltas during upward
aggregation.  It removes 3 words from the per-cpu data.  It eliminates
memcg_exact_page_state(), since memcg_page_state() is now exact.

[akpm@linux-foundation.org: merge fix]
[hannes@cmpxchg.org: fix a sleep in atomic section problem]
  Link: https://lkml.kernel.org/r/20210315234100.64307-1-hannes@cmpxchg.org
Link: https://lkml.kernel.org/r/20210209163304.77088-7-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/memcontrol.h |   67 ++++++----
 mm/memcontrol.c            |  218 +++++++++++++----------------------
 2 files changed, 127 insertions(+), 158 deletions(-)

--- a/include/linux/memcontrol.h~mm-memcontrol-switch-to-rstat
+++ a/include/linux/memcontrol.h
@@ -76,10 +76,27 @@  enum mem_cgroup_events_target {
 };
 
 struct memcg_vmstats_percpu {
-	long stat[MEMCG_NR_STAT];
-	unsigned long events[NR_VM_EVENT_ITEMS];
-	unsigned long nr_page_events;
-	unsigned long targets[MEM_CGROUP_NTARGETS];
+	/* Local (CPU and cgroup) page state & events */
+	long			state[MEMCG_NR_STAT];
+	unsigned long		events[NR_VM_EVENT_ITEMS];
+
+	/* Delta calculation for lockless upward propagation */
+	long			state_prev[MEMCG_NR_STAT];
+	unsigned long		events_prev[NR_VM_EVENT_ITEMS];
+
+	/* Cgroup1: threshold notifications & softlimit tree updates */
+	unsigned long		nr_page_events;
+	unsigned long		targets[MEM_CGROUP_NTARGETS];
+};
+
+struct memcg_vmstats {
+	/* Aggregated (CPU and subtree) page state & events */
+	long			state[MEMCG_NR_STAT];
+	unsigned long		events[NR_VM_EVENT_ITEMS];
+
+	/* Pending child counts during tree propagation */
+	long			state_pending[MEMCG_NR_STAT];
+	unsigned long		events_pending[NR_VM_EVENT_ITEMS];
 };
 
 struct mem_cgroup_reclaim_iter {
@@ -287,8 +304,8 @@  struct mem_cgroup {
 
 	MEMCG_PADDING(_pad1_);
 
-	atomic_long_t		vmstats[MEMCG_NR_STAT];
-	atomic_long_t		vmevents[NR_VM_EVENT_ITEMS];
+	/* memory.stat */
+	struct memcg_vmstats	vmstats;
 
 	/* memory.events */
 	atomic_long_t		memory_events[MEMCG_NR_MEMORY_EVENTS];
@@ -315,10 +332,6 @@  struct mem_cgroup {
 	atomic_t		moving_account;
 	struct task_struct	*move_lock_task;
 
-	/* Legacy local VM stats and events */
-	struct memcg_vmstats_percpu __percpu *vmstats_local;
-
-	/* Subtree VM stats and events (batched updates) */
 	struct memcg_vmstats_percpu __percpu *vmstats_percpu;
 
 #ifdef CONFIG_CGROUP_WRITEBACK
@@ -939,10 +952,6 @@  static inline void mod_memcg_lruvec_stat
 	local_irq_restore(flags);
 }
 
-unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
-						gfp_t gfp_mask,
-						unsigned long *total_scanned);
-
 void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx,
 			  unsigned long count);
 
@@ -1023,6 +1032,10 @@  static inline void memcg_memory_event_mm
 
 void split_page_memcg(struct page *head, unsigned int nr);
 
+unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
+						gfp_t gfp_mask,
+						unsigned long *total_scanned);
+
 #else /* CONFIG_MEMCG */
 
 #define MEM_CGROUP_ID_SHIFT	0
@@ -1131,6 +1144,10 @@  static inline bool lruvec_holds_page_lru
 	return lruvec == &pgdat->__lruvec;
 }
 
+static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
+{
+}
+
 static inline struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg)
 {
 	return NULL;
@@ -1334,18 +1351,6 @@  static inline void mod_lruvec_kmem_state
 	mod_node_page_state(page_pgdat(page), idx, val);
 }
 
-static inline
-unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
-					    gfp_t gfp_mask,
-					    unsigned long *total_scanned)
-{
-	return 0;
-}
-
-static inline void split_page_memcg(struct page *head, unsigned int nr)
-{
-}
-
 static inline void count_memcg_events(struct mem_cgroup *memcg,
 				      enum vm_event_item idx,
 				      unsigned long count)
@@ -1368,9 +1373,17 @@  void count_memcg_event_mm(struct mm_stru
 {
 }
 
-static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
+static inline void split_page_memcg(struct page *head, unsigned int nr)
 {
 }
+
+static inline
+unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
+					    gfp_t gfp_mask,
+					    unsigned long *total_scanned)
+{
+	return 0;
+}
 #endif /* CONFIG_MEMCG */
 
 static inline void __inc_lruvec_kmem_state(void *p, enum node_stat_item idx)
--- a/mm/memcontrol.c~mm-memcontrol-switch-to-rstat
+++ a/mm/memcontrol.c
@@ -765,37 +765,17 @@  mem_cgroup_largest_soft_limit_node(struc
  */
 void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val)
 {
-	long x, threshold = MEMCG_CHARGE_BATCH;
-
 	if (mem_cgroup_disabled())
 		return;
 
-	if (memcg_stat_item_in_bytes(idx))
-		threshold <<= PAGE_SHIFT;
-
-	x = val + __this_cpu_read(memcg->vmstats_percpu->stat[idx]);
-	if (unlikely(abs(x) > threshold)) {
-		struct mem_cgroup *mi;
-
-		/*
-		 * Batch local counters to keep them in sync with
-		 * the hierarchical ones.
-		 */
-		__this_cpu_add(memcg->vmstats_local->stat[idx], x);
-		for (mi = memcg; mi; mi = parent_mem_cgroup(mi))
-			atomic_long_add(x, &mi->vmstats[idx]);
-		x = 0;
-	}
-	__this_cpu_write(memcg->vmstats_percpu->stat[idx], x);
+	__this_cpu_add(memcg->vmstats_percpu->state[idx], val);
+	cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id());
 }
 
-/*
- * idx can be of type enum memcg_stat_item or node_stat_item.
- * Keep in sync with memcg_exact_page_state().
- */
+/* idx can be of type enum memcg_stat_item or node_stat_item. */
 static unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx)
 {
-	long x = atomic_long_read(&memcg->vmstats[idx]);
+	long x = READ_ONCE(memcg->vmstats.state[idx]);
 #ifdef CONFIG_SMP
 	if (x < 0)
 		x = 0;
@@ -803,17 +783,14 @@  static unsigned long memcg_page_state(st
 	return x;
 }
 
-/*
- * idx can be of type enum memcg_stat_item or node_stat_item.
- * Keep in sync with memcg_exact_page_state().
- */
+/* idx can be of type enum memcg_stat_item or node_stat_item. */
 static unsigned long memcg_page_state_local(struct mem_cgroup *memcg, int idx)
 {
 	long x = 0;
 	int cpu;
 
 	for_each_possible_cpu(cpu)
-		x += per_cpu(memcg->vmstats_local->stat[idx], cpu);
+		x += per_cpu(memcg->vmstats_percpu->state[idx], cpu);
 #ifdef CONFIG_SMP
 	if (x < 0)
 		x = 0;
@@ -936,30 +913,16 @@  void __mod_lruvec_kmem_state(void *p, en
 void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx,
 			  unsigned long count)
 {
-	unsigned long x;
-
 	if (mem_cgroup_disabled())
 		return;
 
-	x = count + __this_cpu_read(memcg->vmstats_percpu->events[idx]);
-	if (unlikely(x > MEMCG_CHARGE_BATCH)) {
-		struct mem_cgroup *mi;
-
-		/*
-		 * Batch local counters to keep them in sync with
-		 * the hierarchical ones.
-		 */
-		__this_cpu_add(memcg->vmstats_local->events[idx], x);
-		for (mi = memcg; mi; mi = parent_mem_cgroup(mi))
-			atomic_long_add(x, &mi->vmevents[idx]);
-		x = 0;
-	}
-	__this_cpu_write(memcg->vmstats_percpu->events[idx], x);
+	__this_cpu_add(memcg->vmstats_percpu->events[idx], count);
+	cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id());
 }
 
 static unsigned long memcg_events(struct mem_cgroup *memcg, int event)
 {
-	return atomic_long_read(&memcg->vmevents[event]);
+	return READ_ONCE(memcg->vmstats.events[event]);
 }
 
 static unsigned long memcg_events_local(struct mem_cgroup *memcg, int event)
@@ -968,7 +931,7 @@  static unsigned long memcg_events_local(
 	int cpu;
 
 	for_each_possible_cpu(cpu)
-		x += per_cpu(memcg->vmstats_local->events[event], cpu);
+		x += per_cpu(memcg->vmstats_percpu->events[event], cpu);
 	return x;
 }
 
@@ -1604,6 +1567,7 @@  static char *memory_stat_format(struct m
 	 *
 	 * Current memory state:
 	 */
+	cgroup_rstat_flush(memcg->css.cgroup);
 
 	for (i = 0; i < ARRAY_SIZE(memory_stats); i++) {
 		u64 size;
@@ -2409,22 +2373,11 @@  static int memcg_hotplug_cpu_dead(unsign
 	drain_stock(stock);
 
 	for_each_mem_cgroup(memcg) {
-		struct memcg_vmstats_percpu *statc;
 		int i;
 
-		statc = per_cpu_ptr(memcg->vmstats_percpu, cpu);
-
-		for (i = 0; i < MEMCG_NR_STAT; i++) {
+		for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
 			int nid;
 
-			if (statc->stat[i]) {
-				mod_memcg_state(memcg, i, statc->stat[i]);
-				statc->stat[i] = 0;
-			}
-
-			if (i >= NR_VM_NODE_STAT_ITEMS)
-				continue;
-
 			for_each_node(nid) {
 				struct batched_lruvec_stat *lstatc;
 				struct mem_cgroup_per_node *pn;
@@ -2443,13 +2396,6 @@  static int memcg_hotplug_cpu_dead(unsign
 				}
 			}
 		}
-
-		for (i = 0; i < NR_VM_EVENT_ITEMS; i++) {
-			if (statc->events[i]) {
-				count_memcg_events(memcg, i, statc->events[i]);
-				statc->events[i] = 0;
-			}
-		}
 	}
 
 	return 0;
@@ -3572,6 +3518,7 @@  static unsigned long mem_cgroup_usage(st
 	unsigned long val;
 
 	if (mem_cgroup_is_root(memcg)) {
+		cgroup_rstat_flush(memcg->css.cgroup);
 		val = memcg_page_state(memcg, NR_FILE_PAGES) +
 			memcg_page_state(memcg, NR_ANON_MAPPED);
 		if (swap)
@@ -3636,26 +3583,15 @@  static u64 mem_cgroup_read_u64(struct cg
 	}
 }
 
-static void memcg_flush_percpu_vmstats(struct mem_cgroup *memcg)
+static void memcg_flush_lruvec_page_state(struct mem_cgroup *memcg)
 {
-	unsigned long stat[MEMCG_NR_STAT] = {0};
-	struct mem_cgroup *mi;
-	int node, cpu, i;
-
-	for_each_online_cpu(cpu)
-		for (i = 0; i < MEMCG_NR_STAT; i++)
-			stat[i] += per_cpu(memcg->vmstats_percpu->stat[i], cpu);
-
-	for (mi = memcg; mi; mi = parent_mem_cgroup(mi))
-		for (i = 0; i < MEMCG_NR_STAT; i++)
-			atomic_long_add(stat[i], &mi->vmstats[i]);
+	int node;
 
 	for_each_node(node) {
 		struct mem_cgroup_per_node *pn = memcg->nodeinfo[node];
+		unsigned long stat[NR_VM_NODE_STAT_ITEMS] = { 0 };
 		struct mem_cgroup_per_node *pi;
-
-		for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
-			stat[i] = 0;
+		int cpu, i;
 
 		for_each_online_cpu(cpu)
 			for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
@@ -3668,25 +3604,6 @@  static void memcg_flush_percpu_vmstats(s
 	}
 }
 
-static void memcg_flush_percpu_vmevents(struct mem_cgroup *memcg)
-{
-	unsigned long events[NR_VM_EVENT_ITEMS];
-	struct mem_cgroup *mi;
-	int cpu, i;
-
-	for (i = 0; i < NR_VM_EVENT_ITEMS; i++)
-		events[i] = 0;
-
-	for_each_online_cpu(cpu)
-		for (i = 0; i < NR_VM_EVENT_ITEMS; i++)
-			events[i] += per_cpu(memcg->vmstats_percpu->events[i],
-					     cpu);
-
-	for (mi = memcg; mi; mi = parent_mem_cgroup(mi))
-		for (i = 0; i < NR_VM_EVENT_ITEMS; i++)
-			atomic_long_add(events[i], &mi->vmevents[i]);
-}
-
 #ifdef CONFIG_MEMCG_KMEM
 static int memcg_online_kmem(struct mem_cgroup *memcg)
 {
@@ -4003,6 +3920,8 @@  static int memcg_numa_stat_show(struct s
 	int nid;
 	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
 
+	cgroup_rstat_flush(memcg->css.cgroup);
+
 	for (stat = stats; stat < stats + ARRAY_SIZE(stats); stat++) {
 		seq_printf(m, "%s=%lu", stat->name,
 			   mem_cgroup_nr_lru_pages(memcg, stat->lru_mask,
@@ -4073,6 +3992,8 @@  static int memcg_stat_show(struct seq_fi
 
 	BUILD_BUG_ON(ARRAY_SIZE(memcg1_stat_names) != ARRAY_SIZE(memcg1_stats));
 
+	cgroup_rstat_flush(memcg->css.cgroup);
+
 	for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) {
 		unsigned long nr;
 
@@ -4549,22 +4470,6 @@  struct wb_domain *mem_cgroup_wb_domain(s
 	return &memcg->cgwb_domain;
 }
 
-/*
- * idx can be of type enum memcg_stat_item or node_stat_item.
- * Keep in sync with memcg_exact_page().
- */
-static unsigned long memcg_exact_page_state(struct mem_cgroup *memcg, int idx)
-{
-	long x = atomic_long_read(&memcg->vmstats[idx]);
-	int cpu;
-
-	for_each_online_cpu(cpu)
-		x += per_cpu_ptr(memcg->vmstats_percpu, cpu)->stat[idx];
-	if (x < 0)
-		x = 0;
-	return x;
-}
-
 /**
  * mem_cgroup_wb_stats - retrieve writeback related stats from its memcg
  * @wb: bdi_writeback in question
@@ -4590,13 +4495,14 @@  void mem_cgroup_wb_stats(struct bdi_writ
 	struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css);
 	struct mem_cgroup *parent;
 
-	*pdirty = memcg_exact_page_state(memcg, NR_FILE_DIRTY);
+	cgroup_rstat_flush_irqsafe(memcg->css.cgroup);
 
-	*pwriteback = memcg_exact_page_state(memcg, NR_WRITEBACK);
-	*pfilepages = memcg_exact_page_state(memcg, NR_INACTIVE_FILE) +
-			memcg_exact_page_state(memcg, NR_ACTIVE_FILE);
-	*pheadroom = PAGE_COUNTER_MAX;
+	*pdirty = memcg_page_state(memcg, NR_FILE_DIRTY);
+	*pwriteback = memcg_page_state(memcg, NR_WRITEBACK);
+	*pfilepages = memcg_page_state(memcg, NR_INACTIVE_FILE) +
+			memcg_page_state(memcg, NR_ACTIVE_FILE);
 
+	*pheadroom = PAGE_COUNTER_MAX;
 	while ((parent = parent_mem_cgroup(memcg))) {
 		unsigned long ceiling = min(READ_ONCE(memcg->memory.max),
 					    READ_ONCE(memcg->memory.high));
@@ -5228,7 +5134,6 @@  static void __mem_cgroup_free(struct mem
 	for_each_node(node)
 		free_mem_cgroup_per_node_info(memcg, node);
 	free_percpu(memcg->vmstats_percpu);
-	free_percpu(memcg->vmstats_local);
 	kfree(memcg);
 }
 
@@ -5236,11 +5141,10 @@  static void mem_cgroup_free(struct mem_c
 {
 	memcg_wb_domain_exit(memcg);
 	/*
-	 * Flush percpu vmstats and vmevents to guarantee the value correctness
-	 * on parent's and all ancestor levels.
+	 * Flush percpu lruvec stats to guarantee the value
+	 * correctness on parent's and all ancestor levels.
 	 */
-	memcg_flush_percpu_vmstats(memcg);
-	memcg_flush_percpu_vmevents(memcg);
+	memcg_flush_lruvec_page_state(memcg);
 	__mem_cgroup_free(memcg);
 }
 
@@ -5267,11 +5171,6 @@  static struct mem_cgroup *mem_cgroup_all
 		goto fail;
 	}
 
-	memcg->vmstats_local = alloc_percpu_gfp(struct memcg_vmstats_percpu,
-						GFP_KERNEL_ACCOUNT);
-	if (!memcg->vmstats_local)
-		goto fail;
-
 	memcg->vmstats_percpu = alloc_percpu_gfp(struct memcg_vmstats_percpu,
 						 GFP_KERNEL_ACCOUNT);
 	if (!memcg->vmstats_percpu)
@@ -5471,6 +5370,62 @@  static void mem_cgroup_css_reset(struct
 	memcg_wb_domain_size_changed(memcg);
 }
 
+static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+	struct mem_cgroup *parent = parent_mem_cgroup(memcg);
+	struct memcg_vmstats_percpu *statc;
+	long delta, v;
+	int i;
+
+	statc = per_cpu_ptr(memcg->vmstats_percpu, cpu);
+
+	for (i = 0; i < MEMCG_NR_STAT; i++) {
+		/*
+		 * Collect the aggregated propagation counts of groups
+		 * below us. We're in a per-cpu loop here and this is
+		 * a global counter, so the first cycle will get them.
+		 */
+		delta = memcg->vmstats.state_pending[i];
+		if (delta)
+			memcg->vmstats.state_pending[i] = 0;
+
+		/* Add CPU changes on this level since the last flush */
+		v = READ_ONCE(statc->state[i]);
+		if (v != statc->state_prev[i]) {
+			delta += v - statc->state_prev[i];
+			statc->state_prev[i] = v;
+		}
+
+		if (!delta)
+			continue;
+
+		/* Aggregate counts on this level and propagate upwards */
+		memcg->vmstats.state[i] += delta;
+		if (parent)
+			parent->vmstats.state_pending[i] += delta;
+	}
+
+	for (i = 0; i < NR_VM_EVENT_ITEMS; i++) {
+		delta = memcg->vmstats.events_pending[i];
+		if (delta)
+			memcg->vmstats.events_pending[i] = 0;
+
+		v = READ_ONCE(statc->events[i]);
+		if (v != statc->events_prev[i]) {
+			delta += v - statc->events_prev[i];
+			statc->events_prev[i] = v;
+		}
+
+		if (!delta)
+			continue;
+
+		memcg->vmstats.events[i] += delta;
+		if (parent)
+			parent->vmstats.events_pending[i] += delta;
+	}
+}
+
 #ifdef CONFIG_MMU
 /* Handlers for move charge at task migration. */
 static int mem_cgroup_do_precharge(unsigned long count)
@@ -6524,6 +6479,7 @@  struct cgroup_subsys memory_cgrp_subsys
 	.css_released = mem_cgroup_css_released,
 	.css_free = mem_cgroup_css_free,
 	.css_reset = mem_cgroup_css_reset,
+	.css_rstat_flush = mem_cgroup_css_rstat_flush,
 	.can_attach = mem_cgroup_can_attach,
 	.cancel_attach = mem_cgroup_cancel_attach,
 	.post_attach = mem_cgroup_move_task,

[064/178] mm: memcontrol: switch to rstat

Commit Message

Patch