[0/4] mm: memcontrol: memory.stat cost & correctness

Message ID	20190412151507.2769-1-hannes@cmpxchg.org (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> Received-SPF: pass (google.com: domain of hannes@cmpxchg.org designates 209.85.220.65 as permitted sender) client-ip=209.85.220.65; From: Johannes Weiner <hannes@cmpxchg.org> To: Andrew Morton <akpm@linux-foundation.org> Cc: linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com Subject: [PATCH 0/4] mm: memcontrol: memory.stat cost & correctness Date: Fri, 12 Apr 2019 11:15:03 -0400 Message-Id: <20190412151507.2769-1-hannes@cmpxchg.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm: memcontrol: memory.stat cost & correctness \| expand [0/4] mm: memcontrol: memory.stat cost & correctness [1/4] mm: memcontrol: make cgroup stats and events query API explicitly local [2/4] mm: memcontrol: move stat/event counting functions out-of-line [3/4] mm: memcontrol: fix recursive statistics correctness & scalabilty [4/4] mm: memcontrol: fix NUMA round-robin reclaim at intermediate level

Message ID

20190412151507.2769-1-hannes@cmpxchg.org (mailing list archive)

Headers

Received-SPF: pass (google.com: domain of hannes@cmpxchg.org designates
 209.85.220.65 as permitted sender) client-ip=209.85.220.65;
From: Johannes Weiner <hannes@cmpxchg.org>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org,
	cgroups@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	kernel-team@fb.com
Subject: [PATCH 0/4] mm: memcontrol: memory.stat cost & correctness
Date: Fri, 12 Apr 2019 11:15:03 -0400
Message-Id: <20190412151507.2769-1-hannes@cmpxchg.org>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

mm: memcontrol: memory.stat cost & correctness | expand

Message

Johannes Weiner April 12, 2019, 3:15 p.m. UTC

The cgroup memory.stat file holds recursive statistics for the entire
subtree. The current implementation does this tree walk on-demand
whenever the file is read. This is giving us problems in production.

1. The cost of aggregating the statistics on-demand is high. A lot of
system service cgroups are mostly idle and their stats don't change
between reads, yet we always have to check them. There are also always
some lazily-dying cgroups sitting around that are pinned by a handful
of remaining page cache; the same applies to them.

In an application that periodically monitors memory.stat in our fleet,
we have seen the aggregation consume up to 5% CPU time.

2. When cgroups die and disappear from the cgroup tree, so do their
accumulated vm events. The result is that the event counters at
higher-level cgroups can go backwards and confuse some of our
automation, let alone people looking at the graphs over time.

To address both issues, this patch series changes the stat
implementation to spill counts upwards when the counters change.

The upward spilling is batched using the existing per-cpu cache. In a
sparse file stress test with 5 level cgroup nesting, the additional
cost of the flushing was negligible (a little under 1% of CPU at 100%
CPU utilization, compared to the 5% of reading memory.stat during
regular operation).

 include/linux/memcontrol.h |  96 +++++++-------
 mm/memcontrol.c            | 290 +++++++++++++++++++++++++++----------------
 mm/vmscan.c                |   4 +-
 mm/workingset.c            |   7 +-
 4 files changed, 234 insertions(+), 163 deletions(-)

Comments

Shakeel Butt April 12, 2019, 8:07 p.m. UTC | #1

On Fri, Apr 12, 2019 at 8:15 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> The cgroup memory.stat file holds recursive statistics for the entire
> subtree. The current implementation does this tree walk on-demand
> whenever the file is read. This is giving us problems in production.
>
> 1. The cost of aggregating the statistics on-demand is high. A lot of
> system service cgroups are mostly idle and their stats don't change
> between reads, yet we always have to check them. There are also always
> some lazily-dying cgroups sitting around that are pinned by a handful
> of remaining page cache; the same applies to them.
>
> In an application that periodically monitors memory.stat in our fleet,
> we have seen the aggregation consume up to 5% CPU time.
>
> 2. When cgroups die and disappear from the cgroup tree, so do their
> accumulated vm events. The result is that the event counters at
> higher-level cgroups can go backwards and confuse some of our
> automation, let alone people looking at the graphs over time.
>
> To address both issues, this patch series changes the stat
> implementation to spill counts upwards when the counters change.
>
> The upward spilling is batched using the existing per-cpu cache. In a
> sparse file stress test with 5 level cgroup nesting, the additional
> cost of the flushing was negligible (a little under 1% of CPU at 100%
> CPU utilization, compared to the 5% of reading memory.stat during
> regular operation).

For whole series:

Reviewed-by: Shakeel Butt <shakeelb@google.com>

>
>  include/linux/memcontrol.h |  96 +++++++-------
>  mm/memcontrol.c            | 290 +++++++++++++++++++++++++++----------------
>  mm/vmscan.c                |   4 +-
>  mm/workingset.c            |   7 +-
>  4 files changed, 234 insertions(+), 163 deletions(-)
>
>

Roman Gushchin April 12, 2019, 10:04 p.m. UTC | #2

On Fri, Apr 12, 2019 at 11:15:03AM -0400, Johannes Weiner wrote:
> The cgroup memory.stat file holds recursive statistics for the entire
> subtree. The current implementation does this tree walk on-demand
> whenever the file is read. This is giving us problems in production.
> 
> 1. The cost of aggregating the statistics on-demand is high. A lot of
> system service cgroups are mostly idle and their stats don't change
> between reads, yet we always have to check them. There are also always
> some lazily-dying cgroups sitting around that are pinned by a handful
> of remaining page cache; the same applies to them.
> 
> In an application that periodically monitors memory.stat in our fleet,
> we have seen the aggregation consume up to 5% CPU time.
> 
> 2. When cgroups die and disappear from the cgroup tree, so do their
> accumulated vm events. The result is that the event counters at
> higher-level cgroups can go backwards and confuse some of our
> automation, let alone people looking at the graphs over time.
> 
> To address both issues, this patch series changes the stat
> implementation to spill counts upwards when the counters change.
> 
> The upward spilling is batched using the existing per-cpu cache. In a
> sparse file stress test with 5 level cgroup nesting, the additional
> cost of the flushing was negligible (a little under 1% of CPU at 100%
> CPU utilization, compared to the 5% of reading memory.stat during
> regular operation).
> 
>  include/linux/memcontrol.h |  96 +++++++-------
>  mm/memcontrol.c            | 290 +++++++++++++++++++++++++++----------------
>  mm/vmscan.c                |   4 +-
>  mm/workingset.c            |   7 +-
>  4 files changed, 234 insertions(+), 163 deletions(-)
> 
> 

For the series:
Reviewed-by: Roman Gushchin <guro@fb.com>

Thanks!