From patchwork Tue Aug 13 21:53:58 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Shakeel Butt X-Patchwork-Id: 13762612 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id F2CA8C52D7B for ; Tue, 13 Aug 2024 21:54:26 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 59A146B0082; Tue, 13 Aug 2024 17:54:26 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 54AD16B0083; Tue, 13 Aug 2024 17:54:26 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 438E96B0085; Tue, 13 Aug 2024 17:54:26 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 23F416B0082 for ; Tue, 13 Aug 2024 17:54:26 -0400 (EDT) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 90A0840AFE for ; Tue, 13 Aug 2024 21:54:25 +0000 (UTC) X-FDA: 82448576490.08.30F7BE3 Received: from out-173.mta1.migadu.com (out-173.mta1.migadu.com [95.215.58.173]) by imf13.hostedemail.com (Postfix) with ESMTP id A70AB20015 for ; Tue, 13 Aug 2024 21:54:23 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=pbNzjfki; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf13.hostedemail.com: domain of shakeel.butt@linux.dev designates 95.215.58.173 as permitted sender) smtp.mailfrom=shakeel.butt@linux.dev ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1723586011; a=rsa-sha256; cv=none; b=gt/qFCtnXDXQ6WyIhhzy0/GzKgq3BprV/0PlRvWCWMj9ShUBDjaAudey3pYiRx1MIM9tV4 Y19/PdSLqwE44/gCQjwWE/wmInnYiXCkFTN8keYMsejdT4SD1sTKeysRKvnOIyC5CWdb47 KXX/KBpLkmJUqXn7ktBAriboJmM9uos= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=pbNzjfki; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf13.hostedemail.com: domain of shakeel.butt@linux.dev designates 95.215.58.173 as permitted sender) smtp.mailfrom=shakeel.butt@linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1723586011; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=I6Arg1vJbf0/Wx+BzG60P4s5RDSrDJtVAVWTc5jBHE0=; b=uthnOqi3a65iissGIDdhJs+ClToUe46/yeHoznRleG1oqYA0l1YmErP+9jR8YLd2+KqnaT OxIfvOKn8DhuhXq1/spiuARRpcYjb25pkdBIDNiGq8EePDP2GmwIRSAkK4xcetVwAYnzg6 CVONp9th+s73MVNI0jwliIFl0wTS4Tw= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1723586061; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=I6Arg1vJbf0/Wx+BzG60P4s5RDSrDJtVAVWTc5jBHE0=; b=pbNzjfkiX/19NxQ6KDx2SG+dW7TlKPYFR5xrphkUmQy2A+Z6oRnQOzZWJk7rk8CR9r2zvk hlVl5uyC6cig0TY2SKYpKxTzNKfMutajuI6eqn7cUD4yvrqVbLXjRtPqeZNlnr1K262ovx dHjmBQ+QeJG/PD3mHUZDLjcUfDbO6rY= From: Shakeel Butt To: Andrew Morton Cc: Johannes Weiner , Michal Hocko , Roman Gushchin , Muchun Song , Yosry Ahmed , Jesper Dangaard Brouer , Yu Zhao , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Meta kernel team , cgroups@vger.kernel.org Subject: [PATCH v2] memcg: use ratelimited stats flush in the reclaim Date: Tue, 13 Aug 2024 14:53:58 -0700 Message-ID: <20240813215358.2259750-1-shakeel.butt@linux.dev> MIME-Version: 1.0 X-Migadu-Flow: FLOW_OUT X-Rspamd-Queue-Id: A70AB20015 X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: xi7uafxsmmz8esg3nzz1af6x6sxu9mnp X-HE-Tag: 1723586063-788935 X-HE-Meta: U2FsdGVkX1/UB5HasvnrkMvlQvNnT1wRZD/Wcwm4PaJjgCa7c9hCH/hNpYzWiVj4aWmbrWOXXoUvs/H3XJr54HqG4t7blxhq2A/Uji3Y1eJtLPaI5qf+rWQ8bZ4LVQo/IJNmDMqbKy7Q69Vb9zPtmTrL9Z7GaTAYJwhV9pqox6OTX08BANtOSdK6vZt4i+6N8bcExq9tVRIip38qB/gxuGwfeHoY6qggid5C82YfJd6696ASb8UEE1T8u5Mp3zJnaiw9Kd3ekZ5WuHKe1PdhIkjpiGGKjNXl1xCeKAtFlz7l4gxCdsBS/Ty6vlMH0fLeUxUzng0a92GsWWZKt5ouHh6q766KPbJqtqO80WrLejMDAO3OaTiYnnJCP0xYN+MufxHkgVw+Hxy0pTyII0HodDmFSqQqijLi0MRDsDfrWNhV3NLqAMILFQap7FsRa0vq8GpAzR4GpJSZMpbDvw2oKc+kXRzGulKjsGinWI/BpNU58lTcX/mMS6Cw1XARyZeuu8r9QVPgJOMzHY5ONlxgSJbxk690E6zgUbDhesb9gEJz4pxlKejbdNB7yo0Sf8PP8BVm4YgXUWlDWnX8Z+sHRrLL/JzTrYV8DfBFbckoaJH74vLM9JowM0+sJehh5zxAh2ocoyysvfFglagWD810U4xxFS6ob1kRNjgm41r/cP0LrWT0WpEp/U8QuhAaQgCBf3v+xAGn/3IQ9lIvY7y/pVTdEZ/QdLuqkKR4XZDn91YaSJMnhL1Nl8XanlOys/UAHNlc8Gw6LzMV0MtHkzAmLL1zwkcHl4TfYZkbkZ2vbgAhEmyIKo5odST3r2fSbQjZhhnYfATitAfEJgrD74ep7kc++RLR9JnM1WAMDuEihGX82ASTO2ZrhNwXKTpr5kkqtEE7XP9wbUpxBrWDxbWojSZSMC51oZjexTxY3pN/aDvLDtMo2mN6AaVkILuKG8NEyF3ua/Vd3sKJFGI+EPz wmS9nn7s PF1EBz7LidrsOvyQdxPWMU7+Q5lPBVFFW0jttUWXYWE1Aps0yS7k7oyjy6HtmBQBmRkw2i1/Nn6b3IqVNjDFYKtHCRMOCXyH9ODDQPaJ+1STWBtjcElV9wx9Twyi6Hrq4nFbWttTkNvrfA5BJJp+c9EKk1/Sh2VP5Z+HL X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: The Meta prod is seeing large amount of stalls in memcg stats flush from the memcg reclaim code path. At the moment, this specific callsite is doing a synchronous memcg stats flush. The rstat flush is an expensive and time consuming operation, so concurrent relaimers will busywait on the lock potentially for a long time. Actually this issue is not unique to Meta and has been observed by Cloudflare [1] as well. For the Cloudflare case, the stalls were due to contention between kswapd threads running on their 8 numa node machines which does not make sense as rstat flush is global and flush from one kswapd thread should be sufficient for all. Simply replace the synchronous flush with the ratelimited one. One may raise a concern on potentially using 2 sec stale (at worst) stats for heuristics like desirable inactive:active ratio and preferring inactive file pages over anon pages but these specific heuristics do not require very precise stats and also are ignored under severe memory pressure. More specifically for this code path, the stats are needed for two specific heuristics: 1. Deactivate LRUs 2. Cache trim mode The deactivate LRUs heuristic is to maintain a desirable inactive:active ratio of the LRUs. The specific stats needed are WORKINGSET_ACTIVATE* and the hierarchical LRU size. The WORKINGSET_ACTIVATE* is needed to check if there is a refault since last snapshot and the LRU size are needed for the desirable ratio between inactive and active LRUs. See the table below on how the desirable ratio is calculated. /* total target max * memory ratio inactive * ------------------------------------- * 10MB 1 5MB * 100MB 1 50MB * 1GB 3 250MB * 10GB 10 0.9GB * 100GB 31 3GB * 1TB 101 10GB * 10TB 320 32GB */ The desirable ratio only changes at the boundary of 1 GiB, 10 GiB, 100 GiB, 1 TiB and 10 TiB. There is no need for the precise and accurate LRU size information to calculate this ratio. In addition, if deactivation is skipped for some LRU, the kernel will force deactive on the severe memory pressure situation. For the cache trim mode, inactive file LRU size is read and the kernel scales it down based on the reclaim iteration (file >> sc->priority) and only checks if it is zero or not. Again precise information is not needed. This patch has been running on Meta fleet for several months and we have not observed any issues. Please note that MGLRU is not impacted by this issue at all as it avoids rstat flushing completely. Link: https://lore.kernel.org/all/6ee2518b-81dd-4082-bdf5-322883895ffc@kernel.org [1] Signed-off-by: Shakeel Butt --- Changes since v1: - Updated the commit message. mm/vmscan.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 008b62abf104..82318464cd5e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2282,10 +2282,11 @@ static void prepare_scan_control(pg_data_t *pgdat, struct scan_control *sc) target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat); /* - * Flush the memory cgroup stats, so that we read accurate per-memcg - * lruvec stats for heuristics. + * Flush the memory cgroup stats in rate-limited way as we don't need + * most accurate stats here. We may switch to regular stats flushing + * in the future once it is cheap enough. */ - mem_cgroup_flush_stats(sc->target_mem_cgroup); + mem_cgroup_flush_stats_ratelimited(sc->target_mem_cgroup); /* * Determine the scan balance between anon and file LRUs.