From patchwork Wed Dec 22 11:41:09 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Sebastian Andrzej Siewior X-Patchwork-Id: 12691417 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 50528C433FE for ; Wed, 22 Dec 2021 11:41:24 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B5D376B0072; Wed, 22 Dec 2021 06:41:22 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id AD08A6B007D; Wed, 22 Dec 2021 06:41:22 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7EB106B0073; Wed, 22 Dec 2021 06:41:22 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0067.hostedemail.com [216.40.44.67]) by kanga.kvack.org (Postfix) with ESMTP id 5D50B6B0074 for ; Wed, 22 Dec 2021 06:41:22 -0500 (EST) Received: from smtpin12.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 0F84D8249980 for ; Wed, 22 Dec 2021 11:41:22 +0000 (UTC) X-FDA: 78945239604.12.395A13A Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) by imf18.hostedemail.com (Postfix) with ESMTP id 42C731C003F for ; Wed, 22 Dec 2021 11:41:15 +0000 (UTC) From: Sebastian Andrzej Siewior DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1640173279; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=yft2/5r6EVO3JzmEvsALpkpwi74bPqzdoozMCIubZx8=; b=pOw5kNMyB6XOy/Unv2vaRj6VI4mttvzxO0ZoEyRZOtsTemnrZgXe9y7ZE3o4GNGm6M6CFT 4hJPSSc9fuLikRDCqaKhIDUyhZ525Ilz03qUWZ9cJbj6qcjhhvEXiUYmvfLEDMp+S/JNg+ fWxwUJ5qhQeRpFNJoI08/t5n5QiS07gQMhfIM6t5bDhSPu1LVWod50QYiDv5FSRyhrS6B8 jsrGKULY+w0LXb2FrDUIWGmJSjoT6Gt3vXQl5YM6LHhz09mEBy5QSwOa3ZhpV7jf/zhx2K DMXGwfqIS/XgffBiVSDMtJtcoCxsoXnr72IWW4jAyn1pgjoEZD5eDGygiOIHJw== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1640173279; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=yft2/5r6EVO3JzmEvsALpkpwi74bPqzdoozMCIubZx8=; b=c4hUYgLkfCPKlFS+LQNaZv/FhIkIEekHgNpFhxGBE1Km2sbcdHUb1LddZQgBqWswMVzMMa 8v/q3Ez2h2/szKCg== To: cgroups@vger.kernel.org, linux-mm@kvack.org Cc: Johannes Weiner , Michal Hocko , Vladimir Davydov , Andrew Morton , Thomas Gleixner , Waiman Long , Peter Zijlstra , Sebastian Andrzej Siewior Subject: [RFC PATCH 1/3] mm/memcg: Protect per-CPU counter by disabling preemption on PREEMPT_RT Date: Wed, 22 Dec 2021 12:41:09 +0100 Message-Id: <20211222114111.2206248-2-bigeasy@linutronix.de> In-Reply-To: <20211222114111.2206248-1-bigeasy@linutronix.de> References: <20211222114111.2206248-1-bigeasy@linutronix.de> MIME-Version: 1.0 X-Rspamd-Queue-Id: 42C731C003F X-Stat-Signature: hiqiokp8kp4g5o833uputtehk54twnpd Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=linutronix.de header.s=2020 header.b=pOw5kNMy; dkim=pass header.d=linutronix.de header.s=2020e header.b=c4hUYgLk; dmarc=pass (policy=none) header.from=linutronix.de; spf=pass (imf18.hostedemail.com: domain of bigeasy@linutronix.de designates 193.142.43.55 as permitted sender) smtp.mailfrom=bigeasy@linutronix.de X-Rspamd-Server: rspam11 X-HE-Tag: 1640173275-524368 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: The per-CPU counter are modified with the non-atomic modifier. The consistency is ensure by disabling interrupts for the update. This breaks on PREEMPT_RT because some sections additionally acquire a spinlock_t lock (which becomes sleeping and must not be acquired with disabled interrupts). Another problem is that mem_cgroup_swapout() expects to be invoked with disabled interrupts because the caller has to acquire a spinlock_t which is acquired with disabled interrupts. Since spinlock_t never disables interrupts on PREEMPT_RT the interrupts are never disabled at this point. The code is never called from in_irq() context on PREEMPT_RT therefore disabling preemption during the update is sufficient on PREEMPT_RT. The sections with disabled preemption must exclude memcg_check_events() so that spinlock_t locks can still be acquired (for instance in eventfd_signal()). Don't disable interrupts during updates of the per-CPU variables, instead use shorter sections which disable preemption. Signed-off-by: Sebastian Andrzej Siewior --- mm/memcontrol.c | 74 +++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 60 insertions(+), 14 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 2ed5f2a0879d3..d2687d5ed544b 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -671,8 +671,14 @@ void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val) if (mem_cgroup_disabled()) return; + if (IS_ENABLED(CONFIG_PREEMPT_RT)) + preempt_disable(); + __this_cpu_add(memcg->vmstats_percpu->state[idx], val); memcg_rstat_updated(memcg); + + if (IS_ENABLED(CONFIG_PREEMPT_RT)) + preempt_enable(); } /* idx can be of type enum memcg_stat_item or node_stat_item. */ @@ -699,6 +705,9 @@ void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec); memcg = pn->memcg; + if (IS_ENABLED(CONFIG_PREEMPT_RT)) + preempt_disable(); + /* Update memcg */ __this_cpu_add(memcg->vmstats_percpu->state[idx], val); @@ -706,6 +715,9 @@ void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, __this_cpu_add(pn->lruvec_stats_percpu->state[idx], val); memcg_rstat_updated(memcg); + + if (IS_ENABLED(CONFIG_PREEMPT_RT)) + preempt_enable(); } /** @@ -788,8 +800,13 @@ void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx, if (mem_cgroup_disabled()) return; + if (IS_ENABLED(PREEMPT_RT)) + preempt_disable(); + __this_cpu_add(memcg->vmstats_percpu->events[idx], count); memcg_rstat_updated(memcg); + if (IS_ENABLED(PREEMPT_RT)) + preempt_enable(); } static unsigned long memcg_events(struct mem_cgroup *memcg, int event) @@ -810,6 +827,9 @@ static unsigned long memcg_events_local(struct mem_cgroup *memcg, int event) static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg, int nr_pages) { + if (IS_ENABLED(CONFIG_PREEMPT_RT)) + preempt_disable(); + /* pagein of a big page is an event. So, ignore page size */ if (nr_pages > 0) __count_memcg_events(memcg, PGPGIN, 1); @@ -819,12 +839,19 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg, } __this_cpu_add(memcg->vmstats_percpu->nr_page_events, nr_pages); + + if (IS_ENABLED(CONFIG_PREEMPT_RT)) + preempt_enable(); } static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg, enum mem_cgroup_events_target target) { unsigned long val, next; + bool ret = false; + + if (IS_ENABLED(CONFIG_PREEMPT_RT)) + preempt_disable(); val = __this_cpu_read(memcg->vmstats_percpu->nr_page_events); next = __this_cpu_read(memcg->vmstats_percpu->targets[target]); @@ -841,9 +868,11 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg, break; } __this_cpu_write(memcg->vmstats_percpu->targets[target], next); - return true; + ret = true; } - return false; + if (IS_ENABLED(CONFIG_PREEMPT_RT)) + preempt_enable(); + return ret; } /* @@ -5645,12 +5674,14 @@ static int mem_cgroup_move_account(struct page *page, ret = 0; nid = folio_nid(folio); - local_irq_disable(); + if (!IS_ENABLED(CONFIG_PREEMPT_RT)) + local_irq_disable(); mem_cgroup_charge_statistics(to, nr_pages); memcg_check_events(to, nid); mem_cgroup_charge_statistics(from, -nr_pages); memcg_check_events(from, nid); - local_irq_enable(); + if (!IS_ENABLED(CONFIG_PREEMPT_RT)) + local_irq_enable(); out_unlock: folio_unlock(folio); out: @@ -6670,10 +6701,12 @@ static int charge_memcg(struct folio *folio, struct mem_cgroup *memcg, css_get(&memcg->css); commit_charge(folio, memcg); - local_irq_disable(); + if (!IS_ENABLED(CONFIG_PREEMPT_RT)) + local_irq_disable(); mem_cgroup_charge_statistics(memcg, nr_pages); memcg_check_events(memcg, folio_nid(folio)); - local_irq_enable(); + if (!IS_ENABLED(CONFIG_PREEMPT_RT)) + local_irq_enable(); out: return ret; } @@ -6785,11 +6818,20 @@ static void uncharge_batch(const struct uncharge_gather *ug) memcg_oom_recover(ug->memcg); } - local_irq_save(flags); - __count_memcg_events(ug->memcg, PGPGOUT, ug->pgpgout); - __this_cpu_add(ug->memcg->vmstats_percpu->nr_page_events, ug->nr_memory); - memcg_check_events(ug->memcg, ug->nid); - local_irq_restore(flags); + if (!IS_ENABLED(CONFIG_PREEMPT_RT)) { + local_irq_save(flags); + __count_memcg_events(ug->memcg, PGPGOUT, ug->pgpgout); + __this_cpu_add(ug->memcg->vmstats_percpu->nr_page_events, ug->nr_memory); + memcg_check_events(ug->memcg, ug->nid); + local_irq_restore(flags); + } else { + preempt_disable(); + __count_memcg_events(ug->memcg, PGPGOUT, ug->pgpgout); + __this_cpu_add(ug->memcg->vmstats_percpu->nr_page_events, ug->nr_memory); + preempt_enable(); + + memcg_check_events(ug->memcg, ug->nid); + } /* drop reference from uncharge_folio */ css_put(&ug->memcg->css); @@ -6930,10 +6972,12 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new) css_get(&memcg->css); commit_charge(new, memcg); - local_irq_save(flags); + if (!IS_ENABLED(CONFIG_PREEMPT_RT)) + local_irq_save(flags); mem_cgroup_charge_statistics(memcg, nr_pages); memcg_check_events(memcg, folio_nid(new)); - local_irq_restore(flags); + if (!IS_ENABLED(CONFIG_PREEMPT_RT)) + local_irq_restore(flags); } DEFINE_STATIC_KEY_FALSE(memcg_sockets_enabled_key); @@ -7157,8 +7201,10 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry) * i_pages lock which is taken with interrupts-off. It is * important here to have the interrupts disabled because it is the * only synchronisation we have for updating the per-CPU variables. + * On PREEMPT_RT interrupts are never disabled and the updates to per-CPU + * variables are synchronised by keeping preemption disabled. */ - VM_BUG_ON(!irqs_disabled()); + VM_BUG_ON(!IS_ENABLED(CONFIG_PREEMPT_RT) && !irqs_disabled()); mem_cgroup_charge_statistics(memcg, -nr_entries); memcg_check_events(memcg, page_to_nid(page)); From patchwork Wed Dec 22 11:41:10 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Sebastian Andrzej Siewior X-Patchwork-Id: 12691419 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 563DDC433EF for ; Wed, 22 Dec 2021 11:41:26 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 06D376B0075; Wed, 22 Dec 2021 06:41:23 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id C569A6B0073; Wed, 22 Dec 2021 06:41:22 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A5F476B007E; Wed, 22 Dec 2021 06:41:22 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0211.hostedemail.com [216.40.44.211]) by kanga.kvack.org (Postfix) with ESMTP id 8F0956B0072 for ; Wed, 22 Dec 2021 06:41:22 -0500 (EST) Received: from smtpin25.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 3C62D181AC9C6 for ; Wed, 22 Dec 2021 11:41:22 +0000 (UTC) X-FDA: 78945239604.25.F2DF087 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) by imf28.hostedemail.com (Postfix) with ESMTP id 8B29FC000C for ; Wed, 22 Dec 2021 11:41:21 +0000 (UTC) From: Sebastian Andrzej Siewior DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1640173280; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=LbjlozVMz2Wkr17DO5VtsJrNAByqWrtYZ+vXHGFGCQY=; b=wJDUuc3OLMQa6e0ckxz6JRN9jML8UX7O14RsNSDPOAAAeadYK6nnymXmxgpLrpQCnPq5vL tUnHXi1ADU2/9As7/bQcerpPrN6ayjuPjwLg6qgc2OZVHPA2AA+EkMTgm57Mi/0D4PMvuR 7HSdOmc0EhkOGFYueNx1yeW93mhnVpqZS/qgx2pU/vs94AXiie5/FwaiS9YPDhq0EtJQg2 SL2ivHjWbnqoUcypJ7xpqQAzHfK7EshEFkgp8wMry4NRRneetBbu9reef2xPTjxTfgPk8f dFyVc9LsOpDz0lMvXEjWIh3olbobmM/zvSPe1SoYH7DWWZeNOmuLwEsxtl714g== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1640173280; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=LbjlozVMz2Wkr17DO5VtsJrNAByqWrtYZ+vXHGFGCQY=; b=y/rG53sKbirgF0Cjjn3BPg0rqdfxlOiJyc5eDqXHlqQWU+HXEZrw9F7lYFT+43/EeKKf76 SuA2D/DnctTGhnBA== To: cgroups@vger.kernel.org, linux-mm@kvack.org Cc: Johannes Weiner , Michal Hocko , Vladimir Davydov , Andrew Morton , Thomas Gleixner , Waiman Long , Peter Zijlstra , Sebastian Andrzej Siewior Subject: [RFC PATCH 2/3] mm/memcg: Add a local_lock_t for IRQ and TASK object. Date: Wed, 22 Dec 2021 12:41:10 +0100 Message-Id: <20211222114111.2206248-3-bigeasy@linutronix.de> In-Reply-To: <20211222114111.2206248-1-bigeasy@linutronix.de> References: <20211222114111.2206248-1-bigeasy@linutronix.de> MIME-Version: 1.0 X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 8B29FC000C X-Stat-Signature: 3d5otued3yrjc5zpmyb1xudjt8nm6jzj Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=linutronix.de header.s=2020 header.b=wJDUuc3O; dkim=pass header.d=linutronix.de header.s=2020e header.b="y/rG53sK"; spf=pass (imf28.hostedemail.com: domain of bigeasy@linutronix.de designates 193.142.43.55 as permitted sender) smtp.mailfrom=bigeasy@linutronix.de; dmarc=pass (policy=none) header.from=linutronix.de X-HE-Tag: 1640173281-111980 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: The members of the per-CPU structure memcg_stock_pcp are protected either by disabling interrupts or by disabling preemption if the invocation occurred in process context. Disabling interrupts protects most of the structure excluding task_obj while disabling preemption protects only task_obj. This schema is incompatible with PREEMPT_RT because it creates atomic context in which actions are performed which require preemptible context. One example is obj_cgroup_release(). The IRQ-disable and preempt-disable sections can be replaced with local_lock_t which preserves the explicit disabling of interrupts while keeps the code preemptible on PREEMPT_RT. The task_obj has been added for performance reason on non-preemptible kernels where preempt_disable() is a NOP. On the PREEMPT_RT preemption model preempt_disable() is always implemented. Also there are no memory allocations in_irq() context and softirqs are processed in (preemptible) process context. Therefore it makes sense to avoid using task_obj. Don't use task_obj on PREEMPT_RT and replace manual disabling of interrupts with a local_lock_t. This change requires some factoring: - drain_obj_stock() drops a reference on obj_cgroup which leads to an invocation of obj_cgroup_release() if it is the last object. This in turn leads to recursive locking of the local_lock_t. To avoid this, obj_cgroup_release() is invoked outside of the locked section. - drain_obj_stock() gets a memcg_stock_pcp passed if the stock_lock has been acquired (instead of the task_obj_lock) to avoid recursive locking later in refill_stock(). - drain_all_stock() disables preemption via get_cpu() and then invokes drain_local_stock() if it is the local CPU to avoid scheduling a worker (which invokes the same function). Disabling preemption here is problematic due to the sleeping locks in drain_local_stock(). This can be avoided by always scheduling a worker, even for the local CPU. Using cpus_read_lock() stabilizes cpu_online_mask which ensures that no worker is scheduled for an offline CPU. Since there is no flush_work(), it is still possible that a worker is invoked on the wrong CPU but it is okay since it operates always on the local-CPU data. - drain_local_stock() is always invoked as a worker so it can be optimized by removing in_task() (it is always true) and avoiding the "irq_save" variant because interrupts are always enabled here. Operating on task_obj first allows to acquire the lock_lock_t without lockdep complains. Signed-off-by: Sebastian Andrzej Siewior --- mm/memcontrol.c | 171 +++++++++++++++++++++++++++++++----------------- 1 file changed, 112 insertions(+), 59 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index d2687d5ed544b..1e76f26be2c15 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -261,8 +261,10 @@ bool mem_cgroup_kmem_disabled(void) return cgroup_memory_nokmem; } +struct memcg_stock_pcp; static void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg, - unsigned int nr_pages); + unsigned int nr_pages, + struct memcg_stock_pcp *stock_pcp); static void obj_cgroup_release(struct percpu_ref *ref) { @@ -296,7 +298,7 @@ static void obj_cgroup_release(struct percpu_ref *ref) nr_pages = nr_bytes >> PAGE_SHIFT; if (nr_pages) - obj_cgroup_uncharge_pages(objcg, nr_pages); + obj_cgroup_uncharge_pages(objcg, nr_pages, NULL); spin_lock_irqsave(&css_set_lock, flags); list_del(&objcg->list); @@ -2120,26 +2122,40 @@ struct obj_stock { }; struct memcg_stock_pcp { + /* Protects memcg_stock_pcp */ + local_lock_t stock_lock; struct mem_cgroup *cached; /* this never be root cgroup */ unsigned int nr_pages; +#ifndef CONFIG_PREEMPT_RT + /* Protects only task_obj */ + local_lock_t task_obj_lock; struct obj_stock task_obj; +#endif struct obj_stock irq_obj; struct work_struct work; unsigned long flags; #define FLUSHING_CACHED_CHARGE 0 }; -static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock); +static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock) = { + .stock_lock = INIT_LOCAL_LOCK(stock_lock), +#ifndef CONFIG_PREEMPT_RT + .task_obj_lock = INIT_LOCAL_LOCK(task_obj_lock), +#endif +}; static DEFINE_MUTEX(percpu_charge_mutex); #ifdef CONFIG_MEMCG_KMEM -static void drain_obj_stock(struct obj_stock *stock); +static struct obj_cgroup *drain_obj_stock(struct obj_stock *stock, + struct memcg_stock_pcp *stock_pcp); static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, struct mem_cgroup *root_memcg); #else -static inline void drain_obj_stock(struct obj_stock *stock) +static inline struct obj_cgroup *drain_obj_stock(struct obj_stock *stock, + struct memcg_stock_pcp *stock_pcp) { + return NULL; } static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, struct mem_cgroup *root_memcg) @@ -2168,7 +2184,7 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages) if (nr_pages > MEMCG_CHARGE_BATCH) return ret; - local_irq_save(flags); + local_lock_irqsave(&memcg_stock.stock_lock, flags); stock = this_cpu_ptr(&memcg_stock); if (memcg == stock->cached && stock->nr_pages >= nr_pages) { @@ -2176,7 +2192,7 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages) ret = true; } - local_irq_restore(flags); + local_unlock_irqrestore(&memcg_stock.stock_lock, flags); return ret; } @@ -2204,38 +2220,43 @@ static void drain_stock(struct memcg_stock_pcp *stock) static void drain_local_stock(struct work_struct *dummy) { - struct memcg_stock_pcp *stock; - unsigned long flags; + struct memcg_stock_pcp *stock_pcp; + struct obj_cgroup *old; /* * The only protection from cpu hotplug (memcg_hotplug_cpu_dead) vs. * drain_stock races is that we always operate on local CPU stock * here with IRQ disabled */ - local_irq_save(flags); +#ifndef CONFIG_PREEMPT_RT + local_lock(&memcg_stock.task_obj_lock); + old = drain_obj_stock(&this_cpu_ptr(&memcg_stock)->task_obj, NULL); + local_unlock(&memcg_stock.task_obj_lock); + if (old) + obj_cgroup_put(old); +#endif - stock = this_cpu_ptr(&memcg_stock); - drain_obj_stock(&stock->irq_obj); - if (in_task()) - drain_obj_stock(&stock->task_obj); - drain_stock(stock); - clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags); + local_lock_irq(&memcg_stock.stock_lock); + stock_pcp = this_cpu_ptr(&memcg_stock); + old = drain_obj_stock(&stock_pcp->irq_obj, stock_pcp); - local_irq_restore(flags); + drain_stock(stock_pcp); + clear_bit(FLUSHING_CACHED_CHARGE, &stock_pcp->flags); + + local_unlock_irq(&memcg_stock.stock_lock); + if (old) + obj_cgroup_put(old); } /* * Cache charges(val) to local per_cpu area. * This will be consumed by consume_stock() function, later. */ -static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages) +static void __refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages, + struct memcg_stock_pcp *stock) { - struct memcg_stock_pcp *stock; - unsigned long flags; + lockdep_assert_held(&stock->stock_lock); - local_irq_save(flags); - - stock = this_cpu_ptr(&memcg_stock); if (stock->cached != memcg) { /* reset if necessary */ drain_stock(stock); css_get(&memcg->css); @@ -2245,8 +2266,20 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages) if (stock->nr_pages > MEMCG_CHARGE_BATCH) drain_stock(stock); +} - local_irq_restore(flags); +static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages, + struct memcg_stock_pcp *stock_pcp) +{ + unsigned long flags; + + if (stock_pcp) { + __refill_stock(memcg, nr_pages, stock_pcp); + return; + } + local_lock_irqsave(&memcg_stock.stock_lock, flags); + __refill_stock(memcg, nr_pages, this_cpu_ptr(&memcg_stock)); + local_unlock_irqrestore(&memcg_stock.stock_lock, flags); } /* @@ -2255,7 +2288,7 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages) */ static void drain_all_stock(struct mem_cgroup *root_memcg) { - int cpu, curcpu; + int cpu; /* If someone's already draining, avoid adding running more workers. */ if (!mutex_trylock(&percpu_charge_mutex)) @@ -2266,7 +2299,7 @@ static void drain_all_stock(struct mem_cgroup *root_memcg) * as well as workers from this path always operate on the local * per-cpu data. CPU up doesn't touch memcg_stock at all. */ - curcpu = get_cpu(); + cpus_read_lock(); for_each_online_cpu(cpu) { struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu); struct mem_cgroup *memcg; @@ -2282,14 +2315,10 @@ static void drain_all_stock(struct mem_cgroup *root_memcg) rcu_read_unlock(); if (flush && - !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) { - if (cpu == curcpu) - drain_local_stock(&stock->work); - else - schedule_work_on(cpu, &stock->work); - } + !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) + schedule_work_on(cpu, &stock->work); } - put_cpu(); + cpus_read_unlock(); mutex_unlock(&percpu_charge_mutex); } @@ -2690,7 +2719,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, done_restock: if (batch > nr_pages) - refill_stock(memcg, batch - nr_pages); + refill_stock(memcg, batch - nr_pages, NULL); /* * If the hierarchy is above the normal consumption range, schedule @@ -2803,28 +2832,35 @@ static struct mem_cgroup *get_mem_cgroup_from_objcg(struct obj_cgroup *objcg) * can only be accessed after disabling interrupt. User context code can * access interrupt object stock, but not vice versa. */ -static inline struct obj_stock *get_obj_stock(unsigned long *pflags) +static inline struct obj_stock *get_obj_stock(unsigned long *pflags, + struct memcg_stock_pcp **stock_pcp) { struct memcg_stock_pcp *stock; +#ifndef CONFIG_PREEMPT_RT if (likely(in_task())) { *pflags = 0UL; - preempt_disable(); + *stock_pcp = NULL; + local_lock(&memcg_stock.task_obj_lock); stock = this_cpu_ptr(&memcg_stock); return &stock->task_obj; } - - local_irq_save(*pflags); +#endif + local_lock_irqsave(&memcg_stock.stock_lock, *pflags); stock = this_cpu_ptr(&memcg_stock); + *stock_pcp = stock; return &stock->irq_obj; } -static inline void put_obj_stock(unsigned long flags) +static inline void put_obj_stock(unsigned long flags, + struct memcg_stock_pcp *stock_pcp) { - if (likely(in_task())) - preempt_enable(); +#ifndef CONFIG_PREEMPT_RT + if (likely(!stock_pcp)) + local_unlock(&memcg_stock.task_obj_lock); else - local_irq_restore(flags); +#endif + local_unlock_irqrestore(&memcg_stock.stock_lock, flags); } /* @@ -3002,7 +3038,8 @@ static void memcg_free_cache_id(int id) * @nr_pages: number of pages to uncharge */ static void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg, - unsigned int nr_pages) + unsigned int nr_pages, + struct memcg_stock_pcp *stock_pcp) { struct mem_cgroup *memcg; @@ -3010,7 +3047,7 @@ static void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg, if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) page_counter_uncharge(&memcg->kmem, nr_pages); - refill_stock(memcg, nr_pages); + refill_stock(memcg, nr_pages, stock_pcp); css_put(&memcg->css); } @@ -3084,7 +3121,7 @@ void __memcg_kmem_uncharge_page(struct page *page, int order) return; objcg = __folio_objcg(folio); - obj_cgroup_uncharge_pages(objcg, nr_pages); + obj_cgroup_uncharge_pages(objcg, nr_pages, NULL); folio->memcg_data = 0; obj_cgroup_put(objcg); } @@ -3092,17 +3129,21 @@ void __memcg_kmem_uncharge_page(struct page *page, int order) void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat, enum node_stat_item idx, int nr) { + struct memcg_stock_pcp *stock_pcp; unsigned long flags; - struct obj_stock *stock = get_obj_stock(&flags); + struct obj_cgroup *old = NULL; + struct obj_stock *stock; int *bytes; + stock = get_obj_stock(&flags, &stock_pcp); /* * Save vmstat data in stock and skip vmstat array update unless * accumulating over a page of vmstat data or when pgdat or idx * changes. */ if (stock->cached_objcg != objcg) { - drain_obj_stock(stock); + old = drain_obj_stock(stock, stock_pcp); + obj_cgroup_get(objcg); stock->nr_bytes = atomic_read(&objcg->nr_charged_bytes) ? atomic_xchg(&objcg->nr_charged_bytes, 0) : 0; @@ -3146,38 +3187,43 @@ void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat, if (nr) mod_objcg_mlstate(objcg, pgdat, idx, nr); - put_obj_stock(flags); + put_obj_stock(flags, stock_pcp); + if (old) + obj_cgroup_put(old); } static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes) { + struct memcg_stock_pcp *stock_pcp; unsigned long flags; - struct obj_stock *stock = get_obj_stock(&flags); + struct obj_stock *stock; bool ret = false; + stock = get_obj_stock(&flags, &stock_pcp); if (objcg == stock->cached_objcg && stock->nr_bytes >= nr_bytes) { stock->nr_bytes -= nr_bytes; ret = true; } - put_obj_stock(flags); + put_obj_stock(flags, stock_pcp); return ret; } -static void drain_obj_stock(struct obj_stock *stock) +static struct obj_cgroup *drain_obj_stock(struct obj_stock *stock, + struct memcg_stock_pcp *stock_pcp) { struct obj_cgroup *old = stock->cached_objcg; if (!old) - return; + return NULL; if (stock->nr_bytes) { unsigned int nr_pages = stock->nr_bytes >> PAGE_SHIFT; unsigned int nr_bytes = stock->nr_bytes & (PAGE_SIZE - 1); if (nr_pages) - obj_cgroup_uncharge_pages(old, nr_pages); + obj_cgroup_uncharge_pages(old, nr_pages, stock_pcp); /* * The leftover is flushed to the centralized per-memcg value. @@ -3212,8 +3258,8 @@ static void drain_obj_stock(struct obj_stock *stock) stock->cached_pgdat = NULL; } - obj_cgroup_put(old); stock->cached_objcg = NULL; + return old; } static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, @@ -3221,11 +3267,13 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, { struct mem_cgroup *memcg; +#ifndef CONFIG_PREEMPT_RT if (in_task() && stock->task_obj.cached_objcg) { memcg = obj_cgroup_memcg(stock->task_obj.cached_objcg); if (memcg && mem_cgroup_is_descendant(memcg, root_memcg)) return true; } +#endif if (stock->irq_obj.cached_objcg) { memcg = obj_cgroup_memcg(stock->irq_obj.cached_objcg); if (memcg && mem_cgroup_is_descendant(memcg, root_memcg)) @@ -3238,12 +3286,15 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, bool allow_uncharge) { + struct memcg_stock_pcp *stock_pcp; unsigned long flags; - struct obj_stock *stock = get_obj_stock(&flags); + struct obj_stock *stock; unsigned int nr_pages = 0; + struct obj_cgroup *old = NULL; + stock = get_obj_stock(&flags, &stock_pcp); if (stock->cached_objcg != objcg) { /* reset if necessary */ - drain_obj_stock(stock); + old = drain_obj_stock(stock, stock_pcp); obj_cgroup_get(objcg); stock->cached_objcg = objcg; stock->nr_bytes = atomic_read(&objcg->nr_charged_bytes) @@ -3257,10 +3308,12 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, stock->nr_bytes &= (PAGE_SIZE - 1); } - put_obj_stock(flags); + put_obj_stock(flags, stock_pcp); + if (old) + obj_cgroup_put(old); if (nr_pages) - obj_cgroup_uncharge_pages(objcg, nr_pages); + obj_cgroup_uncharge_pages(objcg, nr_pages, NULL); } int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size) @@ -7061,7 +7114,7 @@ void mem_cgroup_uncharge_skmem(struct mem_cgroup *memcg, unsigned int nr_pages) mod_memcg_state(memcg, MEMCG_SOCK, -nr_pages); - refill_stock(memcg, nr_pages); + refill_stock(memcg, nr_pages, NULL); } static int __init cgroup_memory(char *s) From patchwork Wed Dec 22 11:41:11 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Sebastian Andrzej Siewior X-Patchwork-Id: 12691415 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 23E14C433EF for ; Wed, 22 Dec 2021 11:41:23 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7F9846B0074; Wed, 22 Dec 2021 06:41:22 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 79B676B0078; Wed, 22 Dec 2021 06:41:22 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 620AA6B0075; Wed, 22 Dec 2021 06:41:22 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0153.hostedemail.com [216.40.44.153]) by kanga.kvack.org (Postfix) with ESMTP id 4B45C6B0072 for ; Wed, 22 Dec 2021 06:41:22 -0500 (EST) Received: from smtpin14.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 09D5A180DD57F for ; Wed, 22 Dec 2021 11:41:22 +0000 (UTC) X-FDA: 78945239604.14.BD8FB4F Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) by imf10.hostedemail.com (Postfix) with ESMTP id 9E42CC0022 for ; Wed, 22 Dec 2021 11:41:14 +0000 (UTC) From: Sebastian Andrzej Siewior DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1640173280; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=jUscNupxAInGNFwdgPxE24F7XsB5NZIGxnYUda6WoJA=; b=bcjoPvk55c8sAaoa4q4VH8gB+SPU56a6QMGfWd0AfqBzVu3AeAt8CCDMhe16bhDtBok1UD usKttDwFRFWs1/Ezqralb/mLw6YJ0YTfpIGTBz49xbWW9wf5FKPMM3yCKn249Ktxmvk1Sc EHv3w9lS0/Hu651QBvj5pAtNn8uZxp0PFHbdmpzDB4zARDrzdsb6r9qhN1v8cpWH0JY9T3 mDjcgxykAiwCcprtHLugrtzNLyh8kGRZ+2pgh+iSFLHbP9EyFYHZlk6B3QI6F8xuNlaXFc K3GwPKRKxAMG4tsqLw7CcAab5ZOJuL9VE7sMvnVlQ4t8tyXs3kVDqAU9Fls7rQ== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1640173280; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=jUscNupxAInGNFwdgPxE24F7XsB5NZIGxnYUda6WoJA=; b=RucXExCrOFfGkre/X3jLYZdkXqI6Sh25YhnYIuGWSVc208J6vybnoQ1tw2htJE3pia7ZJf mpOWiyHmHescuOCQ== To: cgroups@vger.kernel.org, linux-mm@kvack.org Cc: Johannes Weiner , Michal Hocko , Vladimir Davydov , Andrew Morton , Thomas Gleixner , Waiman Long , Peter Zijlstra , Sebastian Andrzej Siewior Subject: [RFC PATCH 3/3] mm/memcg: Allow the task_obj optimization only on non-PREEMPTIBLE kernels. Date: Wed, 22 Dec 2021 12:41:11 +0100 Message-Id: <20211222114111.2206248-4-bigeasy@linutronix.de> In-Reply-To: <20211222114111.2206248-1-bigeasy@linutronix.de> References: <20211222114111.2206248-1-bigeasy@linutronix.de> MIME-Version: 1.0 X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: 9E42CC0022 X-Stat-Signature: hut6ym5ogse3cfauhhhbdcr3wt96jfq5 Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=linutronix.de header.s=2020 header.b=bcjoPvk5; dkim=pass header.d=linutronix.de header.s=2020e header.b=RucXExCr; spf=pass (imf10.hostedemail.com: domain of bigeasy@linutronix.de designates 193.142.43.55 as permitted sender) smtp.mailfrom=bigeasy@linutronix.de; dmarc=pass (policy=none) header.from=linutronix.de X-HE-Tag: 1640173274-118459 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Based on my understanding the optimisation with task_obj for in_task() mask sense on non-PREEMPTIBLE kernels because preempt_disable()/enable() is optimized away. This could be then restricted to !CONFIG_PREEMPTION kernel instead to only PREEMPT_RT. With CONFIG_PREEMPT_DYNAMIC a non-PREEMPTIBLE kernel can also be configured but these kernels always have preempt_disable()/enable() present so it probably makes no sense here for the optimisation. Restrict the optimisation to !CONFIG_PREEMPTION kernels. Signed-off-by: Sebastian Andrzej Siewior --- mm/memcontrol.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 1e76f26be2c15..92180f1aa9edc 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2126,7 +2126,7 @@ struct memcg_stock_pcp { local_lock_t stock_lock; struct mem_cgroup *cached; /* this never be root cgroup */ unsigned int nr_pages; -#ifndef CONFIG_PREEMPT_RT +#ifndef CONFIG_PREEMPTION /* Protects only task_obj */ local_lock_t task_obj_lock; struct obj_stock task_obj; @@ -2139,7 +2139,7 @@ struct memcg_stock_pcp { }; static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock) = { .stock_lock = INIT_LOCAL_LOCK(stock_lock), -#ifndef CONFIG_PREEMPT_RT +#ifndef CONFIG_PREEMPTION .task_obj_lock = INIT_LOCAL_LOCK(task_obj_lock), #endif }; @@ -2228,7 +2228,7 @@ static void drain_local_stock(struct work_struct *dummy) * drain_stock races is that we always operate on local CPU stock * here with IRQ disabled */ -#ifndef CONFIG_PREEMPT_RT +#ifndef CONFIG_PREEMPTION local_lock(&memcg_stock.task_obj_lock); old = drain_obj_stock(&this_cpu_ptr(&memcg_stock)->task_obj, NULL); local_unlock(&memcg_stock.task_obj_lock); @@ -2837,7 +2837,7 @@ static inline struct obj_stock *get_obj_stock(unsigned long *pflags, { struct memcg_stock_pcp *stock; -#ifndef CONFIG_PREEMPT_RT +#ifndef CONFIG_PREEMPTION if (likely(in_task())) { *pflags = 0UL; *stock_pcp = NULL; @@ -2855,7 +2855,7 @@ static inline struct obj_stock *get_obj_stock(unsigned long *pflags, static inline void put_obj_stock(unsigned long flags, struct memcg_stock_pcp *stock_pcp) { -#ifndef CONFIG_PREEMPT_RT +#ifndef CONFIG_PREEMPTION if (likely(!stock_pcp)) local_unlock(&memcg_stock.task_obj_lock); else @@ -3267,7 +3267,7 @@ static bool obj_stock_flush_required(struct memcg_stock_pcp *stock, { struct mem_cgroup *memcg; -#ifndef CONFIG_PREEMPT_RT +#ifndef CONFIG_PREEMPTION if (in_task() && stock->task_obj.cached_objcg) { memcg = obj_cgroup_memcg(stock->task_obj.cached_objcg); if (memcg && mem_cgroup_is_descendant(memcg, root_memcg))