From patchwork Wed Aug 9 04:58:10 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yosry Ahmed X-Patchwork-Id: 13347425 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6D832EB64DD for ; Wed, 9 Aug 2023 04:58:18 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D0A718D0001; Wed, 9 Aug 2023 00:58:17 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id CBA546B0074; Wed, 9 Aug 2023 00:58:17 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B82DB8D0001; Wed, 9 Aug 2023 00:58:17 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id A8F796B0071 for ; Wed, 9 Aug 2023 00:58:17 -0400 (EDT) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 6921EA043D for ; Wed, 9 Aug 2023 04:58:17 +0000 (UTC) X-FDA: 81103359834.24.4069E9B Received: from mail-pg1-f201.google.com (mail-pg1-f201.google.com [209.85.215.201]) by imf23.hostedemail.com (Postfix) with ESMTP id 9EA8314000B for ; Wed, 9 Aug 2023 04:58:14 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=quGljRI4; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf23.hostedemail.com: domain of 35RzTZAoKCJgQGKJQ29E658GG8D6.4GEDAFMP-EECN24C.GJ8@flex--yosryahmed.bounces.google.com designates 209.85.215.201 as permitted sender) smtp.mailfrom=35RzTZAoKCJgQGKJQ29E658GG8D6.4GEDAFMP-EECN24C.GJ8@flex--yosryahmed.bounces.google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1691557094; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=OYG5lfEUc9Di0uttH2eztXJ8fv3suEABmVSMtUffIT0=; b=lYnzDxwxMSh83y4jJKZccujQZu05wSUFdR8mVQ7Mk06j4clt+BLDI1uxE5IdnMUykUCXsg o8DnWv3RFzzGemWDbgGXMqkbPKFfSETsZJAz/TYVs4UQ3UpueYTtqmSr60E83H3lUPWaJ2 u5FvEFEYSElyco9MrJWbC8wPPB41HHg= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=quGljRI4; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf23.hostedemail.com: domain of 35RzTZAoKCJgQGKJQ29E658GG8D6.4GEDAFMP-EECN24C.GJ8@flex--yosryahmed.bounces.google.com designates 209.85.215.201 as permitted sender) smtp.mailfrom=35RzTZAoKCJgQGKJQ29E658GG8D6.4GEDAFMP-EECN24C.GJ8@flex--yosryahmed.bounces.google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1691557094; a=rsa-sha256; cv=none; b=eFfZpDpZv6Xd5PqsT3PtBVCaIn/c07hWEbZKP405lXplNiD/UAS4wM/vaQyZJtfH1fOANd PeHXbKuKg6mUaYrxc02gJF5wLITSXPujTwBkJynUCQvGHb3euwEPWbL79wZ8FAcEFkkIyK 87PkccbajZu8qOO/u0SD+LDFYhwOaHw= Received: by mail-pg1-f201.google.com with SMTP id 41be03b00d2f7-564e1843717so2346475a12.0 for ; Tue, 08 Aug 2023 21:58:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1691557093; x=1692161893; h=cc:to:from:subject:message-id:mime-version:date:from:to:cc:subject :date:message-id:reply-to; bh=OYG5lfEUc9Di0uttH2eztXJ8fv3suEABmVSMtUffIT0=; b=quGljRI4w1RP9FtvOGqFbpKn287sq8n8yZxrVoBE8wYYqcAnuV/TkVeOTn+wchwE3e SA1f2P2LImZukMa8s5iU0UG7r/m+VsFAhDaPuxOBo5H44wylZPOCSmJJdmS6ZEwAkh7/ jtHAbR0qSA8Wob2+ggf6h8hpDEhUh3u4eGFPmDb/IOVbX9xuUcq7EC6ZVjShiepu1hoW 07R9B8SdYm2fi/uW3sOuASuXA70UXL8GWT79a/55rkSb0DuEdo7MP7i9j37geWuXJNJV 1x3tT+mEB5XmgsUnZmx4RN2R1qZDAJ3tUB2y22gEJq6H3iE26Oi58NHGVuPHKganuZDL Zs5A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1691557093; x=1692161893; h=cc:to:from:subject:message-id:mime-version:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=OYG5lfEUc9Di0uttH2eztXJ8fv3suEABmVSMtUffIT0=; b=gYqhCI8EegQZe/i6phoRADqXd7fz68+nJF61Fg/f32nletgZ9k5YBuiK7reSB1Jnuj sTDnSVtzwm/JG7KnRYmov8P1fn28e0P/MoYSEx04lthrfJWaYq0UPCjp+2QbUll0+RTA XHYD/frZ4xkx67uyDkwksbP8ec8JA77nSwbtMPaL9V3XjHfPtG0AD5sk6mXN242brRc5 NekjwyxBN3F20fsp+HKYaehfbSbHjTRh1czQYZX+Xgl3/VAjLsbaESdBYunI3aJt4xCJ AtAlz9mMFkoDGYYS256FXre4zqvLJ7hWsjsE5BJ0oZ+Rns8yzQjDAzwUYBwiu5kxX91i WdPQ== X-Gm-Message-State: AOJu0YzSytI1NoXrRpZp8kQcfuMiEq8YBqN0LQpLN08X905/pAA4Rju1 dUhA1yXJg6iMKxU6QvQutHKxRjCm/YKLGWMk X-Google-Smtp-Source: AGHT+IHV+BEGRzksFj4rWRexYZy18kZxUIGbke/J1tUZG9BpzC9NszNaMqLp5jtlDBUuvrtC3RKXEW6jSuRbBnpK X-Received: from yosry.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:2327]) (user=yosryahmed job=sendgmr) by 2002:a63:3f0c:0:b0:563:8767:d83f with SMTP id m12-20020a633f0c000000b005638767d83fmr24521pga.7.1691557093131; Tue, 08 Aug 2023 21:58:13 -0700 (PDT) Date: Wed, 9 Aug 2023 04:58:10 +0000 Mime-Version: 1.0 X-Mailer: git-send-email 2.41.0.640.ga95def55d0-goog Message-ID: <20230809045810.1659356-1-yosryahmed@google.com> Subject: [PATCH] mm: memcg: provide accurate stats for userspace reads From: Yosry Ahmed To: Johannes Weiner , Michal Hocko , Roman Gushchin , Shakeel Butt , Andrew Morton Cc: Muchun Song , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Yosry Ahmed X-Rspamd-Queue-Id: 9EA8314000B X-Rspam-User: X-Rspamd-Server: rspam04 X-Stat-Signature: 6h8i9p4jhpdqu3yqmc9mn954ws8j5d8c X-HE-Tag: 1691557094-884045 X-HE-Meta: U2FsdGVkX1/rZR45o03W7S4K3XGYa7Oh8EWPxx0IKTi31QY86GXGFMg6cokAQS+JjIdUcUtYTrrJ6h52jeLzHUpaYPnQsvxGv6aF0rG1TSdluy3WrkYnbicUlrMDCmbzCJ9Qliy0vlo68NzlwYpJIoIy1nJ6FsbqDLtTMVcc1rbeYm+Sms9JSIHq4LlfkdrS5Dxm3uzXhPe1twHxruPAvCtsb3Aj7Q7Ufj9daUqccRK5Z+0abvg9aEcEEELvP4iJxzQc5JeyNaT9wgE39g5k7H4sVnm7wnapuaPVCnKZjfY8Yv4L0XY/3wvcrnvPn1f69KvBuU8Hr9nNWU2GCngzOWEilWswWNEaUNJPz1jEtHIFr/GeMAlTallh3vLWCQU6YSh282PZaQ9djez5uB6wy4VMDXOiYzTG5o+X7wjygwlsc1ql+CrajpnmEN7tS7Slp43UukMWT2QyINNXRZ8x16YvBktZ8MppTpvyXGS1H2adMEewWhQ54klzD2OEy/FJjYMWtgpmT7ia6ojZrUlU2e04T+OS16Lc7r8OqvO/U6mbRtTZFHRX8rEcNCcjVrih+eUM0taoBMHo7OGbMKRIw8rEFt14JPuesTCsvBZ/I64hLO3BcWNoradBpXWs+4U8P7x9yQ5+u4PpHLqBPOMVIDH0y5T3YJrOhZxgVVjiTJYaN9PEUuAo6Dv3GGvn/Qchn+c85XOHu0b9+ZB5mJns4fE4CSMjPFvn2RBqnCMx4zV8cyFzM/vPTj94L/8LyVxtI9vL1zqYrIQcxo4W60tBfglSYsCsbRuJNFZIw0URbTpKvFzXtxRFQbgFgka/atn5nSFq/aakN1RYA3dcVyCINl1/8CEIlbra+LQLE/n56COCvIpvuOTKMqusw1bX3xE8rtB3jYzEBQKGPF/Flhu0Te6xF/3Y7X+9qCX8OOv2vE4dke1GB6FRkfvWUKmmTkSUv3fhVF0369ACKCoFjAo ggIv4rsS Ge33inuFxmctDDX7BXseze7f/GDZV3EmkeX0lT8V/HbvbwL0Jn1mm423jaUCNZp7fTk+bj5Rw0GFqLrMNTGkojy5d3uj2htyMr3Rb5OgIKcOgAxTe+u1YceSNyD68EC17Emo0Oo/LchtfEyKYdxv9zVD68ii42nMCpR8yTHbhiMsrS2Q4/8RI6eHqdP3AoZGGW4IKZXnXBVp0Zom06eXjd5U8PZP5vb4laXFl7pHv4fos2hDQziJziFBK959Ki9LlsN2dw9fy1RHjVHtvraFyBbaRfoXErGxIHVVEZUEmbw690A+kzfXajAaj3txSDkQT624aicEfhCzQabhJw7utKZaA/f7nkAz+uQg3XWowjvp56RZAzi0iSrNZ1llq6gzvraix9qDNho9Rd6Z8YNTfu1nzp8z/5am3XHXXYGQ7GGYTktygYfV0PFNlI2hZr6iCWdii/tAlmEESJey+CraZyibibI9H5MWUlmAuK9fuq6+OJzLn8aM/3amieghvd1+Ntx0fQbiVZyqzjzlbUYdl+Ca0KouWi6+hsB7TI6kVu4hJDHFaZIC59jQYKTdgUMoq2cZ2br2tz2pxCSrdDera3vIov2KJngmXlhfK06kXOiPZBg/dgH7I21d/sSn2TPQcM+PhfCsbUPMViKzLjZUbrJewnKBDYqUx56fhBmD0X5XKKoOm1bF8an9rzSHY3Up9y36H X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Over time, the memcg code added multiple optimizations to the stats flushing path that introduce a tradeoff between accuracy and performance. In some contexts (e.g. dirty throttling, refaults, etc), a full rstat flush of the stats in the tree can be too expensive. Such optimizations include [1]: (a) Introducing a periodic background flusher to keep the size of the update tree from growing unbounded. (b) Allowing only one thread to flush at a time, and other concurrent flushers just skip the flush. This avoids a thundering herd problem when multiple reclaim/refault threads attempt to flush the stats at once. (c) Only executing a flush if the magnitude of the stats updates exceeds a certain threshold. These optimizations were necessary to make flushing feasible in performance-critical paths, and they come at the cost of some accuracy that we choose to live without. On the other hand, for flushes invoked when userspace is reading the stats, the tradeoff is less appealing This code path is not performance-critical, and the inaccuracies can affect userspace behavior. For example, skipping flushing when there is another ongoing flush is essentially a coin flip. We don't know if the ongoing flush is done with the subtree of interest or not. If userspace asks for stats, let's give it accurate stats. Without this patch, we see regressions in userspace workloads due to stats inaccuracy in some cases. Rework the do_flush_stats() helper to accept a "full" boolean argument. For a "full" flush, if there is an ongoing flush, do not skip. Instead wait for the flush to complete. Introduce a new mem_cgroup_flush_stats_full() interface that use this full flush, and also does not check if the magnitude of the updates exceeds the threshold. Use mem_cgroup_flush_stats_full() in code paths where stats are flushed due to a userspace read. This essentially undos optimzations (b) and (c) above for flushes triggered by userspace reads. [1] https://lore.kernel.org/lkml/20210716212137.1391164-2-shakeelb@google.com/ Signed-off-by: Yosry Ahmed --- I want to argue that this is what we should be doing for all flushing contexts, not just userspace reads (i.e all flushes should be "full"). Skipping if a flush is ongoing is too brittle. There is a significant chance that the stats of the cgroup we care about is not fully flushed. Waiting for an ongoing flush to finish ensures correctness while still avoiding the thundering herd problem on the rstat flush lock. Having said that, there is a higher chance of regression if we add the wait in more critical paths (e.g. reclaim, refaults), so I opt-ed to do this for userspace reads for now. We have complaints about inaccuracy in userspace reads, but no complaints about inaccuracy in other paths so far (although it would be really difficult to tie a reclaim/refault problem to a partial stats flush anyway). --- mm/memcontrol.c | 42 +++++++++++++++++++++++++++--------------- 1 file changed, 27 insertions(+), 15 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e041ba827e59..38e227f7127d 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -630,7 +630,7 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val) /* * If stats_flush_threshold exceeds the threshold * (>num_online_cpus()), cgroup stats update will be triggered - * in __mem_cgroup_flush_stats(). Increasing this var further + * in mem_cgroup_flush_stats(). Increasing this var further * is redundant and simply adds overhead in atomic update. */ if (atomic_read(&stats_flush_threshold) <= num_online_cpus()) @@ -639,17 +639,24 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val) } } -static void do_flush_stats(void) +static void do_flush_stats(bool full) { + if (!atomic_read(&stats_flush_ongoing) && + !atomic_xchg(&stats_flush_ongoing, 1)) + goto flush; + /* - * We always flush the entire tree, so concurrent flushers can just - * skip. This avoids a thundering herd problem on the rstat global lock - * from memcg flushers (e.g. reclaim, refault, etc). + * We always flush the entire tree, so concurrent flushers can choose to + * skip if accuracy is not critical. Otherwise, wait for the ongoing + * flush to complete. This avoids a thundering herd problem on the rstat + * global lock from memcg flushers (e.g. reclaim, refault, etc). */ - if (atomic_read(&stats_flush_ongoing) || - atomic_xchg(&stats_flush_ongoing, 1)) - return; - + while (full && atomic_read(&stats_flush_ongoing) == 1) { + if (!cond_resched()) + cpu_relax(); + } + return; +flush: WRITE_ONCE(flush_next_time, jiffies_64 + 2*FLUSH_TIME); cgroup_rstat_flush(root_mem_cgroup->css.cgroup); @@ -661,7 +668,12 @@ static void do_flush_stats(void) void mem_cgroup_flush_stats(void) { if (atomic_read(&stats_flush_threshold) > num_online_cpus()) - do_flush_stats(); + do_flush_stats(false); +} + +static void mem_cgroup_flush_stats_full(void) +{ + do_flush_stats(true); } void mem_cgroup_flush_stats_ratelimited(void) @@ -676,7 +688,7 @@ static void flush_memcg_stats_dwork(struct work_struct *w) * Always flush here so that flushing in latency-sensitive paths is * as cheap as possible. */ - do_flush_stats(); + do_flush_stats(false); queue_delayed_work(system_unbound_wq, &stats_flush_dwork, FLUSH_TIME); } @@ -1576,7 +1588,7 @@ static void memcg_stat_format(struct mem_cgroup *memcg, struct seq_buf *s) * * Current memory state: */ - mem_cgroup_flush_stats(); + mem_cgroup_flush_stats_full(); for (i = 0; i < ARRAY_SIZE(memory_stats); i++) { u64 size; @@ -4018,7 +4030,7 @@ static int memcg_numa_stat_show(struct seq_file *m, void *v) int nid; struct mem_cgroup *memcg = mem_cgroup_from_seq(m); - mem_cgroup_flush_stats(); + mem_cgroup_flush_stats_full(); for (stat = stats; stat < stats + ARRAY_SIZE(stats); stat++) { seq_printf(m, "%s=%lu", stat->name, @@ -4093,7 +4105,7 @@ static void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s) BUILD_BUG_ON(ARRAY_SIZE(memcg1_stat_names) != ARRAY_SIZE(memcg1_stats)); - mem_cgroup_flush_stats(); + mem_cgroup_flush_stats_full(); for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) { unsigned long nr; @@ -6610,7 +6622,7 @@ static int memory_numa_stat_show(struct seq_file *m, void *v) int i; struct mem_cgroup *memcg = mem_cgroup_from_seq(m); - mem_cgroup_flush_stats(); + mem_cgroup_flush_stats_full(); for (i = 0; i < ARRAY_SIZE(memory_stats); i++) { int nid;