From patchwork Sat Aug 7 08:28:35 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Miaohe Lin X-Patchwork-Id: 12424225 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 51F2AC432BE for ; Sat, 7 Aug 2021 08:28:45 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id C4FAF610A6 for ; Sat, 7 Aug 2021 08:28:44 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org C4FAF610A6 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=huawei.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 33B808D0002; Sat, 7 Aug 2021 04:28:42 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 2ECB18D0001; Sat, 7 Aug 2021 04:28:42 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2045A8D0002; Sat, 7 Aug 2021 04:28:42 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0125.hostedemail.com [216.40.44.125]) by kanga.kvack.org (Postfix) with ESMTP id E23138D0001 for ; Sat, 7 Aug 2021 04:28:41 -0400 (EDT) Received: from smtpin07.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 7D85118027A86 for ; Sat, 7 Aug 2021 08:28:41 +0000 (UTC) X-FDA: 78447608442.07.12846DD Received: from szxga02-in.huawei.com (szxga02-in.huawei.com [45.249.212.188]) by imf16.hostedemail.com (Postfix) with ESMTP id 68BBFF00376C for ; Sat, 7 Aug 2021 08:28:40 +0000 (UTC) Received: from dggeme703-chm.china.huawei.com (unknown [172.30.72.56]) by szxga02-in.huawei.com (SkyGuard) with ESMTP id 4Ghb4Y4s95zcfX3; Sat, 7 Aug 2021 16:25:01 +0800 (CST) Received: from huawei.com (10.175.124.27) by dggeme703-chm.china.huawei.com (10.1.199.99) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256_P256) id 15.1.2176.2; Sat, 7 Aug 2021 16:28:37 +0800 From: Miaohe Lin To: , , , , , , CC: , , , , , , Subject: [PATCH v2 3/3] mm, memcg: get rid of percpu_charge_mutex lock Date: Sat, 7 Aug 2021 16:28:35 +0800 Message-ID: <20210807082835.61281-4-linmiaohe@huawei.com> X-Mailer: git-send-email 2.23.0 In-Reply-To: <20210807082835.61281-1-linmiaohe@huawei.com> References: <20210807082835.61281-1-linmiaohe@huawei.com> MIME-Version: 1.0 X-Originating-IP: [10.175.124.27] X-ClientProxiedBy: dggems703-chm.china.huawei.com (10.3.19.180) To dggeme703-chm.china.huawei.com (10.1.199.99) X-CFilter-Loop: Reflected X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 68BBFF00376C Authentication-Results: imf16.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=huawei.com; spf=pass (imf16.hostedemail.com: domain of linmiaohe@huawei.com designates 45.249.212.188 as permitted sender) smtp.mailfrom=linmiaohe@huawei.com X-Stat-Signature: dou1sb58ij7qiu3ikayhqkqpprkzashz X-HE-Tag: 1628324920-320968 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: We should get rid of percpu_charge_mutex lock as Johannes Weiner said, "" It doesn't seem like we need the lock at all. The comment says it's so we don't spawn more workers when flushing is already underway. But a work cannot be queued more than once - if it were just about that, we'd needlessly duplicate the test_and_set_bit(WORK_STRUCT_PENDING_BIT) in queue_work_on(). git history shows we tried to remove it once: Commit 8521fc50d433 ("memcg: get rid of percpu_charge_mutex lock") tried to do it, but it turned out that the lock did in fact protect a data structure: the stock itself. Specifically stock->cached: Commit 9f50fad65b87 ("Revert "memcg: get rid of percpu_charge_mutex lock"") reverted above removal and explained: The patch incorrectly assumes that using atomic FLUSHING_CACHED_CHARGE bit operations is sufficient but that is not true. Johannes Weiner has reported a crash during parallel memory cgroup removal: BUG: unable to handle kernel NULL pointer dereference at 0000000000000018 IP: [] css_is_ancestor+0x20/0x70 Oops: 0000 [#1] PREEMPT SMP Pid: 19677, comm: rmdir Tainted: G W 3.0.0-mm1-00188-gf38d32b #35 ECS MCP61M-M3/MCP61M-M3 RIP: 0010:[] css_is_ancestor+0x20/0x70 RSP: 0018:ffff880077b09c88 EFLAGS: 00010202 Process rmdir (pid: 19677, threadinfo ffff880077b08000, task ffff8800781bb310) Call Trace: [] mem_cgroup_same_or_subtree+0x33/0x40 [] drain_all_stock+0x11f/0x170 [] mem_cgroup_force_empty+0x231/0x6d0 [] mem_cgroup_pre_destroy+0x14/0x20 [] cgroup_rmdir+0xb9/0x500 [] vfs_rmdir+0x86/0xe0 [] do_rmdir+0xfb/0x110 [] sys_rmdir+0x16/0x20 [] system_call_fastpath+0x16/0x1b We are crashing because we try to dereference cached memcg when we are checking whether we should wait for draining on the cache. The cache is already cleaned up, though. There is also a theoretical chance that the cached memcg gets freed between we test for the FLUSHING_CACHED_CHARGE and dereference it in mem_cgroup_same_or_subtree: CPU0 CPU1 CPU2 mem=stock->cached stock->cached=NULL clear_bit test_and_set_bit test_bit() ... mem_cgroup_destroy use after free The percpu_charge_mutex protected from this race because sync draining is exclusive. It is safer to revert now and come up with a more parallel implementation later. I didn't remember this one at all! However, when you look at the codebase from back then, there was no rcu-protection for memcg lifetime, and drain_stock() didn't double check stock->cached inside the work. Hence the crash during a race. The drain code is different now: drain_local_stock() disables IRQs which holds up rcu, and then calls drain_stock() and drain_obj_stock() which both check stock->cached one more time before the deref. With workqueue managing concurrency, and rcu ensuring memcg lifetime during the drain, this lock indeed seems unnecessary now. Unless I'm missing something, it should just be removed instead. "" The quote is slightly modified to pass the checkpatch. Please see https://lore.kernel.org/linux-mm/YQlPiLY0ieRb704V@cmpxchg.org/ for unmodified version. Suggested-by: Johannes Weiner Signed-off-by: Miaohe Lin --- mm/memcontrol.c | 5 ----- 1 file changed, 5 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 7fe242d92802..711f1f60faa2 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2050,7 +2050,6 @@ struct memcg_stock_pcp { #define FLUSHING_CACHED_CHARGE 0 }; static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock); -static DEFINE_MUTEX(percpu_charge_mutex); #ifdef CONFIG_MEMCG_KMEM static void drain_obj_stock(struct obj_stock *stock); @@ -2211,9 +2210,6 @@ static void drain_all_stock(struct mem_cgroup *root_memcg) { int cpu, curcpu; - /* If someone's already draining, avoid adding running more workers. */ - if (!mutex_trylock(&percpu_charge_mutex)) - return; /* * Notify other cpus that system-wide "drain" is running * We do not care about races with the cpu hotplug because cpu down @@ -2244,7 +2240,6 @@ static void drain_all_stock(struct mem_cgroup *root_memcg) } } put_cpu(); - mutex_unlock(&percpu_charge_mutex); } static int memcg_hotplug_cpu_dead(unsigned int cpu)