From patchwork Wed Sep 9 21:57:52 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Shakeel Butt X-Patchwork-Id: 11766295 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id E3C1713B1 for ; Wed, 9 Sep 2020 21:58:09 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 82D2121D92 for ; Wed, 9 Sep 2020 21:58:09 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=google.com header.i=@google.com header.b="C5Fv53Hk" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 82D2121D92 Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id A6D4F6B005A; Wed, 9 Sep 2020 17:58:08 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id A1CBF6B005C; Wed, 9 Sep 2020 17:58:08 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9322F6B0062; Wed, 9 Sep 2020 17:58:08 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0198.hostedemail.com [216.40.44.198]) by kanga.kvack.org (Postfix) with ESMTP id 7C9526B005A for ; Wed, 9 Sep 2020 17:58:08 -0400 (EDT) Received: from smtpin30.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 42F5A3623 for ; Wed, 9 Sep 2020 21:58:08 +0000 (UTC) X-FDA: 77244886656.30.sleep52_131484f270e0 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin30.hostedemail.com (Postfix) with ESMTP id 15FF9180B3C85 for ; Wed, 9 Sep 2020 21:58:08 +0000 (UTC) X-Spam-Summary: 1,0,0,4cdefefd3729321c,d41d8cd98f00b204,37k9zxwgkcpmncvfzzgwbjjbgz.xjhgdips-hhfqvxf.jmb@flex--shakeelb.bounces.google.com,,RULES_HIT:2:41:69:152:355:379:541:800:960:966:973:988:989:1260:1277:1313:1314:1345:1437:1500:1516:1518:1535:1593:1594:1605:1730:1747:1777:1792:1801:2194:2196:2198:2199:2200:2201:2393:2559:2562:2693:2740:2897:2901:2914:3138:3139:3140:3141:3142:3152:3865:3866:3867:3868:3870:3871:3872:3874:4049:4120:4250:4321:4385:4470:4605:5007:6119:6261:6653:7903:9969:10004:11026:11232:11473:11658:11914:12043:12048:12291:12296:12297:12438:12485:12555:12683:12895:13149:13161:13229:13230:13869:14394:14659:21080:21324:21433:21444:21451:21627:21740:21795:21939:21990:30005:30029:30034:30051:30054:30062,0,RBL:209.85.160.201:@flex--shakeelb.bounces.google.com:.lbl8.mailshell.net-66.100.201.100 62.18.0.100;04y885c53uakpj6thiojjog1wm39sopsweg4yseyou41j66wnsymx3m6yyghtkf.6hsz435ke69er4eryyei8keok9jb4qg1t93ibxtzx15zhs6uak7epegr355iu9d.4-lbl8.mailshell.net-223.238.255. 100,Cach X-HE-Tag: sleep52_131484f270e0 X-Filterd-Recvd-Size: 9402 Received: from mail-qt1-f201.google.com (mail-qt1-f201.google.com [209.85.160.201]) by imf39.hostedemail.com (Postfix) with ESMTP for ; Wed, 9 Sep 2020 21:58:07 +0000 (UTC) Received: by mail-qt1-f201.google.com with SMTP id o14so2803299qtq.0 for ; Wed, 09 Sep 2020 14:58:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=sender:date:message-id:mime-version:subject:from:to:cc; bh=YzcEKzsIWMuMUeF2ey3rtqIWIvrrBYx2vMKWker9WzU=; b=C5Fv53HkIMHHoez5IYzzn36hW0UQqwW9KLZC4s9j5QoptDldkoHfHK/dGAox6TwObe binlTX7MkxrWT+dDuwi1/uPuq27BZAjL1xZXwx6z8eR6sakcGs866qkvuOHdXpZMHE3v MIZFYe87z96xRKMxatoLjAZFYuXpFWjOZgXYXRkxZpK1pAIrdGNs3glWWx2Krl+XVYoR YirWIwRKkYQMruUWz41tJVHXBYH6vZVckNxtYWqfWNgypkQ+aR33DELU/kyDB2nB5Quz MFdn1w4xVNM8sZUXG/wH/M2G13dOLlVbQ/zZEbq0CRO8B3ONI0pHD1nRDsFjRCx2OrJK AEJg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:date:message-id:mime-version:subject:from :to:cc; bh=YzcEKzsIWMuMUeF2ey3rtqIWIvrrBYx2vMKWker9WzU=; b=Ghg6j8uHGcmDhPZgbwTnA/dZCN8Z4MoYoTEAy325OphYjd+udU48y80/TD+9pn3Gyf m2NDFe/Y3GtermaMxcQNtN83FcUKf9oBYZCg7Dc6ykJwsv0PILjg/UZ+zpUbJXBI+wNz YgRU1z/J2HyYHQoHnQvcYtgJVDOLqoY6KJe+TcKlLy1fhQs6glmjxrJ+qvFEUOM2VMGg ywm+XFs0K7Y8Bq6hYSfXQHVLM0Jlbg452MNHd/ztBWN95UuGB9ptqKzYGtmDOzkSf1G1 rxLH0WC/15GvNiJHaYAqT0iUq3UpUVU8lM4EX5ekvUDU9WlFea9D6Zd6o+JvTPOBzYm5 G5xg== X-Gm-Message-State: AOAM531nUkteyeHkHtfS7aCOBC+5Ngm87wnODQpJz9NNPVZsx7OJCQkZ 7x+z8fNPZQa8YJJkUZ47/y746ILqNDbauA== X-Google-Smtp-Source: ABdhPJxvm19zrqdC866xY9yfmda+4ASjoqxlYsir81SgXnJbsx/iOLmlhqpoJSUJCSmVcVOGWC375sKSUB3pqQ== X-Received: from shakeelb.svl.corp.google.com ([2620:15c:2cd:202:a28c:fdff:fee8:36f0]) (user=shakeelb job=sendgmr) by 2002:ad4:42b3:: with SMTP id e19mr6459498qvr.6.1599688686888; Wed, 09 Sep 2020 14:58:06 -0700 (PDT) Date: Wed, 9 Sep 2020 14:57:52 -0700 Message-Id: <20200909215752.1725525-1-shakeelb@google.com> Mime-Version: 1.0 X-Mailer: git-send-email 2.28.0.526.ge36021eeef-goog Subject: [PATCH] memcg: introduce per-memcg reclaim interface From: Shakeel Butt To: Johannes Weiner , Roman Gushchin , Michal Hocko , Yang Shi , Greg Thelen , David Rientjes , " =?utf-8?q?Michal_Koutn=C3=BD?= " Cc: Andrew Morton , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Shakeel Butt X-Rspamd-Queue-Id: 15FF9180B3C85 X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam01 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Introduce an memcg interface to trigger memory reclaim on a memory cgroup. Use cases: ---------- 1) Per-memcg uswapd: Usually applications consists of combination of latency sensitive and latency tolerant tasks. For example, tasks serving user requests vs tasks doing data backup for a database application. At the moment the kernel does not differentiate between such tasks when the application hits the memcg limits. So, potentially a latency sensitive user facing task can get stuck in high reclaim and be throttled by the kernel. Similarly there are cases of single process applications having two set of thread pools where threads from one pool have high scheduling priority and low latency requirement. One concrete example from our production is the VMM which have high priority low latency thread pool for the VCPUs while separate thread pool for stats reporting, I/O emulation, health checks and other managerial operations. The kernel memory reclaim does not differentiate between VCPU thread or a non-latency sensitive thread and a VCPU thread can get stuck in high reclaim. One way to resolve this issue is to preemptively trigger the memory reclaim from a latency tolerant task (uswapd) when the application is near the limits. Finding 'near the limits' situation is an orthogonal problem. 2) Proactive reclaim: This is a similar to the previous use-case, the difference is instead of waiting for the application to be near its limit to trigger memory reclaim, continuously pressuring the memcg to reclaim a small amount of memory. This gives more accurate and uptodate workingset estimation as the LRUs are continuously sorted and can potentially provide more deterministic memory overcommit behavior. The memory overcommit controller can provide more proactive response to the changing behavior of the running applications instead of being reactive. Benefit of user space solution: ------------------------------- 1) More flexible on who should be charged for the cpu of the memory reclaim. For proactive reclaim, it makes more sense to centralized the overhead while for uswapd, it makes more sense for the application to pay for the cpu of the memory reclaim. 2) More flexible on dedicating the resources (like cpu). The memory overcommit controller can balance the cost between the cpu usage and the memory reclaimed. 3) Provides a way to the applications to keep their LRUs sorted, so, under memory pressure better reclaim candidates are selected. This also gives more accurate and uptodate notion of working set for an application. Questions: ---------- 1) Why memory.high is not enough? memory.high can be used to trigger reclaim in a memcg and can potentially be used for proactive reclaim as well as uswapd use cases. However there is a big negative in using memory.high. It can potentially introduce high reclaim stalls in the target application as the allocations from the processes or the threads of the application can hit the temporary memory.high limit. Another issue with memory.high is that it is not delegatable. To actually use this interface for uswapd, the application has to introduce another layer of cgroup on whose memory.high it has write access. 2) Why uswapd safe from self induced reclaim? This is very similar to the scenario of oomd under global memory pressure. We can use the similar mechanisms to protect uswapd from self induced reclaim i.e. memory.min and mlock. Interface options: ------------------ Introducing a very simple memcg interface 'echo 10M > memory.reclaim' to trigger reclaim in the target memory cgroup. In future we might want to reclaim specific type of memory from a memcg, so, this interface can be extended to allow that. e.g. $ echo 10M [all|anon|file|kmem] > memory.reclaim However that should be when we have concrete use-cases for such functionality. Keep things simple for now. Signed-off-by: Shakeel Butt Reviewed-by: SeongJae Park --- Documentation/admin-guide/cgroup-v2.rst | 9 ++++++ mm/memcontrol.c | 37 +++++++++++++++++++++++++ 2 files changed, 46 insertions(+) diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 6be43781ec7f..58d70b5989d7 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1181,6 +1181,15 @@ PAGE_SIZE multiple when read back. high limit is used and monitored properly, this limit's utility is limited to providing the final safety net. + memory.reclaim + A write-only file which exists on non-root cgroups. + + This is a simple interface to trigger memory reclaim in the + target cgroup. Write the number of bytes to reclaim to this + file and the kernel will try to reclaim that much memory. + Please note that the kernel can over or under reclaim from + the target cgroup. + memory.oom.group A read-write single value file which exists on non-root cgroups. The default value is "0". diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 75cd1a1e66c8..2d006c36d7f3 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6456,6 +6456,38 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of, return nbytes; } +static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, + size_t nbytes, loff_t off) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); + unsigned int nr_retries = MAX_RECLAIM_RETRIES; + unsigned long nr_to_reclaim, nr_reclaimed = 0; + int err; + + buf = strstrip(buf); + err = page_counter_memparse(buf, "", &nr_to_reclaim); + if (err) + return err; + + while (nr_reclaimed < nr_to_reclaim) { + unsigned long reclaimed; + + if (signal_pending(current)) + break; + + reclaimed = try_to_free_mem_cgroup_pages(memcg, + nr_to_reclaim - nr_reclaimed, + GFP_KERNEL, true); + + if (!reclaimed && !nr_retries--) + break; + + nr_reclaimed += reclaimed; + } + + return nbytes; +} + static struct cftype memory_files[] = { { .name = "current", @@ -6508,6 +6540,11 @@ static struct cftype memory_files[] = { .seq_show = memory_oom_group_show, .write = memory_oom_group_write, }, + { + .name = "reclaim", + .flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE, + .write = memory_reclaim, + }, { } /* terminate */ };