From patchwork Mon Aug 28 23:33:18 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yosry Ahmed X-Patchwork-Id: 13368411 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8F4C7C83F11 for ; Mon, 28 Aug 2023 23:33:34 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id CFFD828002D; Mon, 28 Aug 2023 19:33:32 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id BEA4C8E001E; Mon, 28 Aug 2023 19:33:32 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A198128002D; Mon, 28 Aug 2023 19:33:32 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 886328E001E for ; Mon, 28 Aug 2023 19:33:32 -0400 (EDT) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 5B2ABC0671 for ; Mon, 28 Aug 2023 23:33:32 +0000 (UTC) X-FDA: 81175117464.27.E9DDC45 Received: from mail-pl1-f201.google.com (mail-pl1-f201.google.com [209.85.214.201]) by imf23.hostedemail.com (Postfix) with ESMTP id 912D3140017 for ; Mon, 28 Aug 2023 23:33:30 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=eoDxnQJF; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf23.hostedemail.com: domain of 3yS7tZAoKCAo8y218krwonqyyqvo.mywvsx47-wwu5kmu.y1q@flex--yosryahmed.bounces.google.com designates 209.85.214.201 as permitted sender) smtp.mailfrom=3yS7tZAoKCAo8y218krwonqyyqvo.mywvsx47-wwu5kmu.y1q@flex--yosryahmed.bounces.google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1693265610; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=wRGpIX3lyKI055RI8cMjafWOHR4EuDVDQeE9aWNLVUM=; b=zajZ5rytp5+tSKXtfihv2eeWDHMc+1r4RT+6H7EbuJnLFNhF4PZ3MZfeeDv1vvIh16/oNc G3uudxa76GJ94jvhaxLVKx1tsj3GqOdVneod0P2+GjGU29Vk6Xk3DtKNx9kF/p0wKDHT0w a4GNCbvjP+Z8aekgXeyc0QudKhL8i4k= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=eoDxnQJF; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf23.hostedemail.com: domain of 3yS7tZAoKCAo8y218krwonqyyqvo.mywvsx47-wwu5kmu.y1q@flex--yosryahmed.bounces.google.com designates 209.85.214.201 as permitted sender) smtp.mailfrom=3yS7tZAoKCAo8y218krwonqyyqvo.mywvsx47-wwu5kmu.y1q@flex--yosryahmed.bounces.google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1693265610; a=rsa-sha256; cv=none; b=BBMQ7GvUAy+QxodE2kjl5qlDR70T7HZlTERUoGUfAx1MKtFhaZgFa+76Ac+KIqq0dXG9qy rvWx8GZ43wPhQokNdEs2BVaetOJzkwkWv8vFVcS0O0LZm/D/ENLAvbZhudaCnzLjEmgOiC O0c38k0GwFVzgpuB1/dq/vlvtkS+bwQ= Received: by mail-pl1-f201.google.com with SMTP id d9443c01a7336-1c0d58f127fso43974985ad.2 for ; Mon, 28 Aug 2023 16:33:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1693265609; x=1693870409; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=wRGpIX3lyKI055RI8cMjafWOHR4EuDVDQeE9aWNLVUM=; b=eoDxnQJF3X52LCWoBIxxsewZGAjILMgMJpEG4EXQ2YAh0jGR93SKaYdX81UglrX6fo 1wKEaixnEVtkrTCbB/CHL7SnwHFabZVbsYdY6keYxwCeXgS6yY/2Kto8tSYSMEJ/Y2YG pN89ky8l7MzmQ1FBTbeb5i+s04d3wUfDyK8F758aH26chFWvDGxLJn0ispz/NJwci9YX zZyG7Q8Xdlaanlqu4MxEXr+HClTY0BRCmUuRbsTIW6YTWap0g/zatE8xcF8lie3HxjPZ icLK7HqerVKwHQ7cBYh7IgKJ5U1e/mmoItt4cDNTSXRVs/dV4Xbtf/Jcf41tWtLG9YIA UabQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1693265609; x=1693870409; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=wRGpIX3lyKI055RI8cMjafWOHR4EuDVDQeE9aWNLVUM=; b=gZ1eOPDNZXEqUxfoon4heBsNeaebUMYjQ1lsT+ECcupxUXEAMmz3nkDlcb2D3vyjZy 8IEMNnaUu9l7ccKVBUy/qYmeHPy7wWrLlTc6pCOMZh4ioldMLtlhz3ueDjv9MClKkVcm 9pT0tKGwsdeZJM+smkw2bUSsiNcDeYfIu350ZlhFC0GNcTZO4mH9SnEnR2LJu9ijP2sS rshiuqwsXbceoTp6YRvSxqS0QQD167fl/PvU9Be7m9sDTmaGNexekqKYyW18hHYSh2SG QfFY+YOnT9BlCxERg38cgwmW/uFm9WJeSo/iDpDCwraJ9qbIR7fJZXRtyPqIPmhmlZnN cXlA== X-Gm-Message-State: AOJu0Yzg7StO10pic8hC2sSXRXOELEHYbZkkSwh6J7FaYwQdf0LbSvUq 1EZSMqKaHvNN43NtPbyo1bFkP4QALN7iNECO X-Google-Smtp-Source: AGHT+IHG/TRKUOO5sYh5EIId++RdUCNl9wMlTbyfze6/NjsSKmPfNT57qBN5bgwEcgW3z1bCs66pOUtMUV59CqzY X-Received: from yosry.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:2327]) (user=yosryahmed job=sendgmr) by 2002:a17:902:dacd:b0:1bc:1866:fd0f with SMTP id q13-20020a170902dacd00b001bc1866fd0fmr9420007plx.9.1693265609521; Mon, 28 Aug 2023 16:33:29 -0700 (PDT) Date: Mon, 28 Aug 2023 23:33:18 +0000 In-Reply-To: <20230828233319.340712-1-yosryahmed@google.com> Mime-Version: 1.0 References: <20230828233319.340712-1-yosryahmed@google.com> X-Mailer: git-send-email 2.42.0.rc2.253.gd59a3bf2b4-goog Message-ID: <20230828233319.340712-5-yosryahmed@google.com> Subject: [PATCH v2 4/4] mm: memcg: use non-unified stats flushing for userspace reads From: Yosry Ahmed To: Andrew Morton Cc: Johannes Weiner , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , Ivan Babrou , Tejun Heo , " =?utf-8?q?Michal_Koutn=C3=BD?= " , Waiman Long , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, Yosry Ahmed X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 912D3140017 X-Stat-Signature: z7kix8of91h9z8bm7meioniho6y4n7f6 X-HE-Tag: 1693265610-775833 X-HE-Meta: U2FsdGVkX1+Z9bHZ1c8YMsoQCIYC3Bl+eUOIV5lnx2Y5SqtOyD08oizQcpdvzDvaV5/M6Mdhv5oKiBESimJRta4ilCHVLRx6mVZRvO9rauHVdIisd8VpNQ3eBl7Y86c8vREsSot5BWOaurRNCrZrDffzJQVmUDXPqoQ8nVKFWj+BZjWzP9brFL57mZSAz64sVD9DftWvFMfJNr8nvpriT7bPizGWdSsu9TfnegwqsYWTQOUeKI0cTqeSY/MtEbfvImBqc1mp9JMxRiiGd1sVbwMYni+Qzk65pLCHKMZVXRWLYZoqWMG3yg3lI/lvGcNU+Rvfy5YTlpuVk9XJFLi9yd9lr7IbChD4Smn/6Npxz1126B3VC3WsEcCxTOqk2EVriy8UQE3CzDiEvNonCu4m/EoaklbDbYjI+geDLYl8Qrgp7hiFpsCcSI4VaxeenqjSkxg0wiaPR7OWTrjRWzYoyPHSssIBClzo3MzzNeVxlqYXliTvhz49ScTzE8tDEXlUVYs9dfpHIqp0NpVCOzxexVq6MADv6QmC/2w5/HjLITvfPA+jeqx4V7DaQdr7PvedxpOtecRmeWtADtcrJcT3vGRss28vI0Of0M5R1XWkFTbW1CRRSP9bqjO2IN1GEkmCllJiQu7vCFwAUsRWZNjkBbHJWKPi1Cfq47WtORIYWdIZ+am6woRNkAL/Mf8XmFA9jwsOKI97sOO4kFX21ja4rKjLc9yVzdi9RZTNR2h0FPzmpPE7Fr4/s2TSfxuarOwlXirWYtJyz0WQegGIVj1uyZnxn/Far1FihnxufzMiT72s82wlczRkTTu9fjJDcZXnK5pgGixpuLY4jkX4KdBYMh7LI8V999WdkKrKyI9Ft8vHc9MNrsgyfdG0bAyV2drHJVY+taHyzRUdYtb0HG8fDvbFbAL32JpSfxb+Wn51CN+CfJeOQLXRdnDV9iA8OzRjCfMU0sL8C6uNwUVIWQC xzoAQxam Joi20rG4JOdlFLQFhC802V4Q23zsdFqseEmU4ayIHl9OWxlVq5zdQTQS8mbyoeChEz73qMZwu6mSKJHFKtDNu4ekBNrjfxm6ms/H9cnP2CZ0MehIxVhDrs91uMTzhLJTKNb0PBkNHgCeAOIhp7IW9JuBpgwo+qZ/d2U5GVgsawJp1DQe0R25v9uces4DQPu0mtlAXkxqAfQFkzgjOcH8a269DKFClZsIfd3IMwuUHpRqx7LaQ1LeoLf/v/H0xMNdxDW3vYfSfFmkfx5qRCG2Oyqcx8iaCOHBseVaySx25ojF3KcWWbmQe4ElawGbT89Kml842NKcjeX+UPYY4QhouAiec4tUlYLwKfx51jWgL/gCg9+aYc3y3Jf2ovrEclnBv/w2/jUE9NAkiAz0+2T5O7LLNoHYFT62tM7Od7uizXfYsqX1bV+5DjYpq/SyTEOD4VUcVovv1R42D9dplMWVAaiFtEj3Pf8/SkdYyKy0PCYMxU28Naz74nMZtpWjyMLTroZkXEbAXyYkRCRK5NK9CpRRg+8MVPzrRSEvLianz/ObeqFBtj8/TthGEB1tCsuidsGISKmOlb2UrQf81DDzMxSyAs/mHrRXq1aGD1eXUPSt1idUn1pkcWwOP9YSqy2mLHOOfuqBk9ylyqeXdW5O+jVKJn4NmhM8tkz0QEbhNrsGRupSEC1F7JuumsIULxfSh+1ecRyjo861cz7VyUDlRohZhkTIzFlT4KOtK X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Unified flushing allows for great concurrency for paths that attempt to flush the stats, at the expense of potential staleness and a single flusher paying the extra cost of flushing the full tree. This tradeoff makes sense for in-kernel flushers that may observe high concurrency (e.g. reclaim, refault). For userspace readers, stale stats may be unexpected and problematic, especially when such stats are used for critical paths such as userspace OOM handling. Additionally, a userspace reader will occasionally pay the cost of flushing the entire hierarchy, which also causes problems in some cases [1]. Opt userspace reads out of unified flushing. This makes the cost of reading the stats more predictable (proportional to the size of the subtree), as well as the freshness of the stats. Since userspace readers are not expected to have similar concurrency to in-kernel flushers, serializing them among themselves and among in-kernel flushers should be okay. Note that this may make the worst case latency for reading stats worse. Flushers may give up cgroup_rstat_lock and sleep, causing the number of waiters for the spinlock theoritically unbounded. A reader may grab the lock and do some work, give it up, sleep, and then wait for a long time before acquiring it again and continuing. This is only possible if there is high concurrency among processes reading stats of different parts of the hierarchy, such that they are not helping each other out. Other stats interfaces such as cpu.stat have the same theoritical problem, so this is unlikely to be a problem in practice. If it is, we can introduce a mutex in the stats reading path to guard against concurrent readers competing for the lock. We have similar protection for unified flushing, except that concurrent flushers skip instead of waiting. An alternative is to remove flushing from the stats reading path completely, and rely on the periodic flusher. This should be accompanied by making the periodic flushing period tunable, and providing an interface for userspace to force a flush, following a similar model to /proc/vmstat. However, such a change will be hard to reverse if the implementation needs to be changed because: - The cost of reading stats will be very cheap and we won't be able to take that back easily. - There are user-visible interfaces involved. Hence, let's go with the change that's most reversible first. If problems arise, we can add a mutex in the stats reading path as described above, or follow the more user-visible approach. This was tested on a machine with 256 cpus by running a synthetic test The script that creates 50 top-level cgroups, each with 5 children (250 leaf cgroups). Each leaf cgroup has 10 processes running that allocate memory beyond the cgroup limit, invoking reclaim (which is an in-kernel unified flusher). Concurrently, one thread is spawned per-cgroup to read the stats every second (including root, top-level, and leaf cgroups -- so total 251 threads). No regressions were observed in the total running time; which means that non-unified userspace readers are not slowing down in-kernel unified flushers: Base (mm-unstable): real 0m18.228s user 0m9.463s sys 60m15.879s real 0m20.828s user 0m8.535s sys 70m12.364s real 0m19.789s user 0m9.177s sys 66m10.798s With this patch: real 0m19.632s user 0m8.608s sys 64m23.483s real 0m18.463s user 0m7.465s sys 60m34.089s real 0m20.309s user 0m7.754s sys 68m2.392s Additionally, the average latency for reading stats went down up to 8 times when reading stats of leaf cgroups in the script, as we only have to flush the cgroup(s) being read. [1]https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME4-dsgfoQ@mail.gmail.com/ Signed-off-by: Yosry Ahmed --- mm/memcontrol.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index f3716478bf4e..8bfb0e3395ce 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1607,7 +1607,7 @@ static void memcg_stat_format(struct mem_cgroup *memcg, struct seq_buf *s) * * Current memory state: */ - mem_cgroup_try_flush_stats(); + do_stats_flush(memcg); for (i = 0; i < ARRAY_SIZE(memory_stats); i++) { u64 size; @@ -4049,7 +4049,7 @@ static int memcg_numa_stat_show(struct seq_file *m, void *v) int nid; struct mem_cgroup *memcg = mem_cgroup_from_seq(m); - mem_cgroup_try_flush_stats(); + do_stats_flush(memcg); for (stat = stats; stat < stats + ARRAY_SIZE(stats); stat++) { seq_printf(m, "%s=%lu", stat->name, @@ -4124,7 +4124,7 @@ static void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s) BUILD_BUG_ON(ARRAY_SIZE(memcg1_stat_names) != ARRAY_SIZE(memcg1_stats)); - mem_cgroup_try_flush_stats(); + do_stats_flush(memcg); for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) { unsigned long nr; @@ -4626,7 +4626,7 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages, struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css); struct mem_cgroup *parent; - mem_cgroup_try_flush_stats(); + do_stats_flush(memcg); *pdirty = memcg_page_state(memcg, NR_FILE_DIRTY); *pwriteback = memcg_page_state(memcg, NR_WRITEBACK); @@ -6641,7 +6641,7 @@ static int memory_numa_stat_show(struct seq_file *m, void *v) int i; struct mem_cgroup *memcg = mem_cgroup_from_seq(m); - mem_cgroup_try_flush_stats(); + do_stats_flush(memcg); for (i = 0; i < ARRAY_SIZE(memory_stats); i++) { int nid;