From patchwork Tue Apr 20 19:29:03 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Waiman Long <longman@redhat.com>
X-Patchwork-Id: 12215023
Return-Path: <SRS0=ZI7U=JR=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-10.8 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH,
	DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,
	INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=ham
	autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id AEE48C43460
	for <linux-mm@archiver.kernel.org>; Tue, 20 Apr 2021 19:30:23 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 3D965613E3
	for <linux-mm@archiver.kernel.org>; Tue, 20 Apr 2021 19:30:23 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 3D965613E3
Authentication-Results: mail.kernel.org;
 dmarc=fail (p=none dis=none) header.from=redhat.com
Authentication-Results: mail.kernel.org;
 spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id A78DB6B0070; Tue, 20 Apr 2021 15:30:22 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 9DD5B6B0071; Tue, 20 Apr 2021 15:30:22 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 805C26B0072; Tue, 20 Apr 2021 15:30:22 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0169.hostedemail.com
 [216.40.44.169])
	by kanga.kvack.org (Postfix) with ESMTP id 599B56B0070
	for <linux-mm@kvack.org>; Tue, 20 Apr 2021 15:30:22 -0400 (EDT)
Received: from smtpin13.hostedemail.com (10.5.19.251.rfc1918.com
 [10.5.19.251])
	by forelay01.hostedemail.com (Postfix) with ESMTP id 05BC6180ACEEC
	for <linux-mm@kvack.org>; Tue, 20 Apr 2021 19:30:22 +0000 (UTC)
X-FDA: 78053736684.13.6D9A571
Received: from us-smtp-delivery-124.mimecast.com
 (us-smtp-delivery-124.mimecast.com [216.205.24.124])
	by imf27.hostedemail.com (Postfix) with ESMTP id 9F3C380192F6
	for <linux-mm@kvack.org>; Tue, 20 Apr 2021 19:30:05 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1618947021;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc; bh=MbIxrX0aCC5TwYOgCR6sx7fFecrjfg4kmqWU4k6ggU4=;
	b=LmUqrxTce8sL9P9+8oW3glVUZ40DH0xRkcWH5parxM6369QRGbwW4cjR1dWvGoRlcFxFsO
	Qtv/YjnLx/xWVKvqUBM1UED/WhM739mIYowKRJXLAkKQpSMI0QKb7YWGX7qGRGxbFmLIy8
	3Kq9OJ5xhSgfjGSYM8OxfYcvE5jS+Pk=
Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com
 [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-341-J3Ys88hZNAGdZqs6KmmaXQ-1; Tue, 20 Apr 2021 15:29:42 -0400
X-MC-Unique: J3Ys88hZNAGdZqs6KmmaXQ-1
Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com
 [10.5.11.12])
	(using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 785751006C94;
	Tue, 20 Apr 2021 19:29:38 +0000 (UTC)
Received: from llong.com (ovpn-114-123.rdu2.redhat.com [10.10.114.123])
	by smtp.corp.redhat.com (Postfix) with ESMTP id 97EFB610F3;
	Tue, 20 Apr 2021 19:29:30 +0000 (UTC)
From: Waiman Long <longman@redhat.com>
To: Johannes Weiner <hannes@cmpxchg.org>,
	Michal Hocko <mhocko@kernel.org>,
	Vladimir Davydov <vdavydov.dev@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Tejun Heo <tj@kernel.org>,
	Christoph Lameter <cl@linux.com>,
	Pekka Enberg <penberg@kernel.org>,
	David Rientjes <rientjes@google.com>,
	Joonsoo Kim <iamjoonsoo.kim@lge.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	Roman Gushchin <guro@fb.com>
Cc: linux-kernel@vger.kernel.org,
	cgroups@vger.kernel.org,
	linux-mm@kvack.org,
	Shakeel Butt <shakeelb@google.com>,
	Muchun Song <songmuchun@bytedance.com>,
	Alex Shi <alex.shi@linux.alibaba.com>,
	Chris Down <chris@chrisdown.name>,
	Yafang Shao <laoar.shao@gmail.com>,
	Wei Yang <richard.weiyang@gmail.com>,
	Masayoshi Mizuma <msys.mizuma@gmail.com>,
	Xing Zhengjun <zhengjun.xing@linux.intel.com>,
	Matthew Wilcox <willy@infradead.org>,
	Waiman Long <longman@redhat.com>
Subject: [PATCH-next v5 0/4] mm/memcg: Reduce kmemcache memory accounting
 overhead
Date: Tue, 20 Apr 2021 15:29:03 -0400
Message-Id: <20210420192907.30880-1-longman@redhat.com>
X-Scanned-By: MIMEDefang 2.79 on 10.5.11.12
X-Rspamd-Queue-Id: 9F3C380192F6
X-Stat-Signature: 7peu5czbb4m366ud7b4q4d7tbs75zw8z
X-Rspamd-Server: rspam02
Received-SPF: none (redhat.com>: No applicable sender policy available)
 receiver=imf27; identity=mailfrom; envelope-from="<longman@redhat.com>";
 helo=us-smtp-delivery-124.mimecast.com; client-ip=216.205.24.124
X-HE-DKIM-Result: pass/pass
X-HE-Tag: 1618947005-483128
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

v5:
  - Combine v4 patches 2 & 4 into a single patch with minor twisting.
  - Move the user context stock access optimization patch to the end
    of the series.
  - Incorporate some changes suggested by Johannes.

 v4:
  - Drop v3 patch 1 as well as modification to mm/percpu.c as the percpu
    vmstat update isn't frequent enough to worth caching it.
  - Add a new patch 1 to move Move mod_objcg_state() to memcontrol.c instead.
  - Combine v3 patches 4 & 5 into a single patch (patch 3).
  - Add a new patch 4 to cache both reclaimable & unreclaimable vmstat updates.
  - Add a new patch 5 to improve refill_obj_stock() performance.

With the recent introduction of the new slab memory controller, we
eliminate the need for having separate kmemcaches for each memory
cgroup and reduce overall kernel memory usage. However, we also add
additional memory accounting overhead to each call of kmem_cache_alloc()
and kmem_cache_free().

For workloads that require a lot of kmemcache allocations and
de-allocations, they may experience performance regression as illustrated
in [1] and [2].

A simple kernel module that performs repeated loop of 100,000,000
kmem_cache_alloc() and kmem_cache_free() of either a small 32-byte
object or a big 4k object at module init time with a batch size of 4
(4 kmalloc's followed by 4 kfree's) is used for benchmarking. The
benchmarking tool was run on a kernel based on linux-next-20210419.
The test was run on a CascadeLake server with turbo-boosting disable
to reduce run-to-run variation.

The small object test exercises mainly the object stock charging and
vmstat update code paths. The large object test also exercises the
refill_obj_stock() and  __memcg_kmem_charge()/__memcg_kmem_uncharge()
code paths.

With memory accounting disabled, the run time was 3.130s with both small
object big object tests.

With memory accounting enabled, both cgroup v1 and v2 showed similar
results in the small object test.  The performance results of the large
object test, however, differed between cgroup v1 and v2.

The execution times with the application of various patches in the
patchset were:

  Applied patches   Run time   Accounting overhead   %age 1   %age 2
  ---------------   --------   -------------------   ------   ------

  Small 32-byte object:
       None          11.634s         8.504s          100.0%   271.7%
        1-2           9.425s         6.295s           74.0%   201.1%
        1-3           9.708s         6.578s           77.4%   210.2%
        1-4           8.062s         4.932s           58.0%   157.6%

  Large 4k object (v2):
       None          22.107s        18.977s          100.0%   606.3%
        1-2          20.960s        17.830s           94.0%   569.6%
        1-3          14.238s        11.108s           58.5%   354.9%
        1-4          11.329s         8.199s           43.2%   261.9%

  Large 4k object (v1):
       None          36.807s        33.677s          100.0%  1075.9%
        1-2          36.648s        33.518s           99.5%  1070.9%
        1-3          22.345s        19.215s           57.1%   613.9%
        1-4          18.662s        15.532s           46.1%   496.2%

  N.B. %age 1 = overhead/unpatched overhead
       %age 2 = overhead/accounting disabled time

Patch 2 (vmstat data stock caching) helps in both the small object test
and the large v2 object test. It doesn't help much in v1 big object test.

Patch 3 (refill_obj_stock improvement) does help the small object test
but offer significant performance improvement for the large object test
(both v1 and v2).

Patch 4 (eliminating irq disable/enable) helps in all test cases.

To test for the extreme case, a multi-threaded kmalloc/kfree
microbenchmark was run on the 2-socket 48-core 96-thread system with
96 testing threads in the same memcg doing kmalloc+kfree of a 4k object
with accounting enabled for 10s. The total number of kmalloc+kfree done
in kilo operations per second (kops/s) were as follows:

  Applied patches   v1 kops/s   v1 change   v2 kops/s   v2 change
  ---------------   ---------   ---------   ---------   ---------
       None           3,520        1.00X      6,242        1.00X
        1-2           4,304        1.22X      8,478        1.36X
        1-3           4,731        1.34X    418,142       66.99X
        1-4           4,587        1.30X    438,838       70.30X

With memory accounting disabled, the kmalloc/kfree rate was 1,481,291
kop/s. This test shows how significant the memory accouting overhead
can be in some extreme situations.

For this multithreaded test, the improvement from patch 2 mainly
comes from the conditional atomic xchg of objcg->nr_charged_bytes in
mod_objcg_state(). By using an unconditional xchg, the operation rates
were similar to the unpatched kernel.

Patch 3 elminates the single highly contended cacheline of
objcg->nr_charged_bytes for cgroup v2 leading to a huge performance
improvement. Cgroup v1, however, still has another highly contended
cacheline in the shared page counter &memcg->kmem. So the improvement
is only modest.

Patch 4 helps in cgroup v2, but performs worse in cgroup v1 as
eliminating the irq_disable/irq_enable overhead seems to aggravate the
cacheline contention.

[1] https://lore.kernel.org/linux-mm/20210408193948.vfktg3azh2wrt56t@gabell/T/#u
[2] https://lore.kernel.org/lkml/20210114025151.GA22932@xsang-OptiPlex-9020/


Waiman Long (4):
  mm/memcg: Move mod_objcg_state() to memcontrol.c
  mm/memcg: Cache vmstat data in percpu memcg_stock_pcp
  mm/memcg: Improve refill_obj_stock() performance
  mm/memcg: Optimize user context object stock access

 mm/memcontrol.c | 195 +++++++++++++++++++++++++++++++++++++++++-------
 mm/slab.h       |  16 +---
 2 files changed, 171 insertions(+), 40 deletions(-)