From patchwork Sat Jan 8 16:43:46 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andy Lutomirski X-Patchwork-Id: 12707532 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A6D7DC433F5 for ; Sat, 8 Jan 2022 16:44:25 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0A2416B0071; Sat, 8 Jan 2022 11:44:22 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id DC35F6B007B; Sat, 8 Jan 2022 11:44:21 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C649D6B0073; Sat, 8 Jan 2022 11:44:21 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0060.hostedemail.com [216.40.44.60]) by kanga.kvack.org (Postfix) with ESMTP id 876446B0073 for ; Sat, 8 Jan 2022 11:44:21 -0500 (EST) Received: from smtpin25.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 354F0181B048C for ; Sat, 8 Jan 2022 16:44:21 +0000 (UTC) X-FDA: 79007692722.25.1EDDE4E Received: from ams.source.kernel.org (ams.source.kernel.org [145.40.68.75]) by imf22.hostedemail.com (Postfix) with ESMTP id B8D5FC0016 for ; Sat, 8 Jan 2022 16:44:20 +0000 (UTC) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ams.source.kernel.org (Postfix) with ESMTPS id A0100B80B3E; Sat, 8 Jan 2022 16:44:19 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 3E7CDC36AF2; Sat, 8 Jan 2022 16:44:18 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1641660258; bh=UNKWT50c2f68zkUzaFiQUCdKH55gB0N7A0rbX5LXlDg=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=ZczwHgJCsbxYQ+miDqJBqZB+PJZL285KFnyINpOxf6YtIR3q7IX8VHVvHlx4SJmza C2Q5btFR7YvhIRGjiHskhEF48wywxtqHFz0bs4PHvhD8cq2ECMxLPnqL0ityfYR6DL R3LzaGkC36Ezj/uRUM8j4zKQEDxSZlQA6315KPBQ8ndjiBO/M/DSR5yQaAh5hEHOWS CgKrhoM/iC3dtTzRrsZp6iMc4CT+q4je+3HY4bZZKOE3JsWcz6ObJ/NwT789Fn5bO3 rKeDuJLuVQtkLsg50jMuXUqIEG0pWzTsl5i8ZTUDG3SwKJKP3OisChDiA2bDWcRhkn BukE1CH6jJjVg== From: Andy Lutomirski To: Andrew Morton , Linux-MM Cc: Nicholas Piggin , Anton Blanchard , Benjamin Herrenschmidt , Paul Mackerras , Randy Dunlap , linux-arch , x86@kernel.org, Rik van Riel , Dave Hansen , Peter Zijlstra , Nadav Amit , Mathieu Desnoyers , Andy Lutomirski Subject: [PATCH 01/23] membarrier: Document why membarrier() works Date: Sat, 8 Jan 2022 08:43:46 -0800 Message-Id: X-Mailer: git-send-email 2.33.1 In-Reply-To: References: MIME-Version: 1.0 X-Rspamd-Queue-Id: B8D5FC0016 X-Stat-Signature: 3a69naw9bonp99k38cbefxgf73jnncek Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=ZczwHgJC; spf=pass (imf22.hostedemail.com: domain of luto@kernel.org designates 145.40.68.75 as permitted sender) smtp.mailfrom=luto@kernel.org; dmarc=pass (policy=none) header.from=kernel.org X-Rspamd-Server: rspam10 X-HE-Tag: 1641660260-663569 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: We had a nice comment at the top of membarrier.c explaining why membarrier worked in a handful of scenarios, but that consisted more of a list of things not to forget than an actual description of the algorithm and why it should be expected to work. Add a comment explaining my understanding of the algorithm. This exposes a couple of implementation issues that I will hopefully fix up in subsequent patches. Cc: Mathieu Desnoyers Cc: Nicholas Piggin Cc: Peter Zijlstra Signed-off-by: Andy Lutomirski Reviewed-by: Mathieu Desnoyers --- kernel/sched/membarrier.c | 60 +++++++++++++++++++++++++++++++++++++-- 1 file changed, 58 insertions(+), 2 deletions(-) diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c index b5add64d9698..30e964b9689d 100644 --- a/kernel/sched/membarrier.c +++ b/kernel/sched/membarrier.c @@ -7,8 +7,64 @@ #include "sched.h" /* - * For documentation purposes, here are some membarrier ordering - * scenarios to keep in mind: + * The basic principle behind the regular memory barrier mode of + * membarrier() is as follows. membarrier() is called in one thread. Tt + * iterates over all CPUs, and, for each CPU, it either sends an IPI to + * that CPU or it does not. If it sends an IPI, then we have the + * following sequence of events: + * + * 1. membarrier() does smp_mb(). + * 2. membarrier() does a store (the IPI request payload) that is observed by + * the target CPU. + * 3. The target CPU does smp_mb(). + * 4. The target CPU does a store (the completion indication) that is observed + * by membarrier()'s wait-for-IPIs-to-finish request. + * 5. membarrier() does smp_mb(). + * + * So all pre-membarrier() local accesses are visible after the IPI on the + * target CPU and all pre-IPI remote accesses are visible after + * membarrier(). IOW membarrier() has synchronized both ways with the target + * CPU. + * + * (This has the caveat that membarrier() does not interrupt the CPU that it's + * running on at the time it sends the IPIs. However, if that is the CPU on + * which membarrier() starts and/or finishes, membarrier() does smp_mb() and, + * if not, then the scheduler's migration of membarrier() is a full barrier.) + * + * membarrier() skips sending an IPI only if membarrier() sees + * cpu_rq(cpu)->curr->mm != target mm. The sequence of events is: + * + * membarrier() | target CPU + * --------------------------------------------------------------------- + * | 1. smp_mb() + * | 2. set rq->curr->mm = other_mm + * | (by writing to ->curr or to ->mm) + * 3. smp_mb() | + * 4. read rq->curr->mm == other_mm | + * 5. smp_mb() | + * | 6. rq->curr->mm = target_mm + * | (by writing to ->curr or to ->mm) + * | 7. smp_mb() + * | + * + * All memory accesses on the target CPU prior to scheduling are visible + * to membarrier()'s caller after membarrier() returns due to steps 1, 2, 4 + * and 5. + * + * All memory accesses by membarrier()'s caller prior to membarrier() are + * visible to the target CPU after scheduling due to steps 3, 4, 6, and 7. + * + * Note that, tasks can change their ->mm, e.g. via kthread_use_mm(). So + * tasks that switch their ->mm must follow the same rules as the scheduler + * changing rq->curr, and the membarrier() code needs to do both dereferences + * carefully. + * + * GLOBAL_EXPEDITED support works the same way except that all references + * to rq->curr->mm are replaced with references to rq->membarrier_state. + * + * + * Specific examples of how this produces the documented properties of + * membarrier(): * * A) Userspace thread execution after IPI vs membarrier's memory * barrier before sending the IPI From patchwork Sat Jan 8 16:43:47 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andy Lutomirski X-Patchwork-Id: 12707530 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3B976C433FE for ; Sat, 8 Jan 2022 16:44:22 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 88DFC6B0074; Sat, 8 Jan 2022 11:44:21 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 86DB96B0075; Sat, 8 Jan 2022 11:44:21 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 70CEA6B0074; Sat, 8 Jan 2022 11:44:21 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0030.hostedemail.com [216.40.44.30]) by kanga.kvack.org (Postfix) with ESMTP id 5E2E66B0071 for ; Sat, 8 Jan 2022 11:44:21 -0500 (EST) Received: from smtpin29.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 1603280E726A for ; Sat, 8 Jan 2022 16:44:21 +0000 (UTC) X-FDA: 79007692722.29.A316F3B Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf02.hostedemail.com (Postfix) with ESMTP id 9CF0B80015 for ; Sat, 8 Jan 2022 16:44:20 +0000 (UTC) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 9BFE460C91; Sat, 8 Jan 2022 16:44:19 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 49C0DC36AE3; Sat, 8 Jan 2022 16:44:19 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1641660259; bh=luXaTkuUOzn73UJhbrp8yKhm9hf+Uj848KliPU1OO9w=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=EwQbQtbbDo6gJrDXBM/q8zNDQ2YuHk3PRm9ab7IdxAnUz/hEwb2Us3d2xm0a6F4Sz aioxqBbMIshXmvZgHIcQ5+kdcUDPMN33Pw82ZRtwLpa116FbE5JypHK8bB63PsGnPX YK9FiNuWAsjsjsYw0/F4GOVYRVq/0uti65vMEAtecOtqdHsDMAMPq+g8OGJUL1KBz0 azxwUcdGqkMeVBjinm9Boluur8OzqpzRqulo+7L762mn12i8B3YwRN7DPMEnPfpSBP P0FrOCqE8InKncA7ywZJo0mHp8IA/PtTNbcgRDUqfPqxwN/89Lpf2R52697O2MDXtk NGBqpnUybgVbA== From: Andy Lutomirski To: Andrew Morton , Linux-MM Cc: Nicholas Piggin , Anton Blanchard , Benjamin Herrenschmidt , Paul Mackerras , Randy Dunlap , linux-arch , x86@kernel.org, Rik van Riel , Dave Hansen , Peter Zijlstra , Nadav Amit , Mathieu Desnoyers , Andy Lutomirski Subject: [PATCH 02/23] x86/mm: Handle unlazying membarrier core sync in the arch code Date: Sat, 8 Jan 2022 08:43:47 -0800 Message-Id: X-Mailer: git-send-email 2.33.1 In-Reply-To: References: MIME-Version: 1.0 Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=EwQbQtbb; spf=pass (imf02.hostedemail.com: domain of luto@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=luto@kernel.org; dmarc=pass (policy=none) header.from=kernel.org X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 9CF0B80015 X-Stat-Signature: gho9kom6q9g6k7f6jtf9yqsjh8u7ssqw X-HE-Tag: 1641660260-245538 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: The core scheduler isn't a great place for membarrier_mm_sync_core_before_usermode() -- the core scheduler doesn't actually know whether we are lazy. With the old code, if a CPU is running a membarrier-registered task, goes idle, gets unlazied via a TLB shootdown IPI, and switches back to the membarrier-registered task, it will do an unnecessary core sync. Conveniently, x86 is the only architecture that does anything in this sync_core_before_usermode(), so membarrier_mm_sync_core_before_usermode() is a no-op on all other architectures and we can just move the code. (I am not claiming that the SYNC_CORE code was correct before or after this change on any non-x86 architecture. I merely claim that this change improves readability, is correct on x86, and makes no change on any other architecture.) Cc: Mathieu Desnoyers Cc: Nicholas Piggin Cc: Peter Zijlstra Signed-off-by: Andy Lutomirski Reviewed-by: Mathieu Desnoyers --- arch/x86/mm/tlb.c | 58 +++++++++++++++++++++++++++++++--------- include/linux/sched/mm.h | 13 --------- kernel/sched/core.c | 14 +++++----- 3 files changed, 53 insertions(+), 32 deletions(-) diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 59ba2968af1b..1ae15172885e 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -9,6 +9,7 @@ #include #include #include +#include #include #include @@ -485,6 +486,15 @@ void cr4_update_pce(void *ignored) static inline void cr4_update_pce_mm(struct mm_struct *mm) { } #endif +static void sync_core_if_membarrier_enabled(struct mm_struct *next) +{ +#ifdef CONFIG_MEMBARRIER + if (unlikely(atomic_read(&next->membarrier_state) & + MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE)) + sync_core_before_usermode(); +#endif +} + void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, struct task_struct *tsk) { @@ -539,16 +549,24 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, this_cpu_write(cpu_tlbstate_shared.is_lazy, false); /* - * The membarrier system call requires a full memory barrier and - * core serialization before returning to user-space, after - * storing to rq->curr, when changing mm. This is because - * membarrier() sends IPIs to all CPUs that are in the target mm - * to make them issue memory barriers. However, if another CPU - * switches to/from the target mm concurrently with - * membarrier(), it can cause that CPU not to receive an IPI - * when it really should issue a memory barrier. Writing to CR3 - * provides that full memory barrier and core serializing - * instruction. + * membarrier() support requires that, when we change rq->curr->mm: + * + * - If next->mm has membarrier registered, a full memory barrier + * after writing rq->curr (or rq->curr->mm if we switched the mm + * without switching tasks) and before returning to user mode. + * + * - If next->mm has SYNC_CORE registered, then we sync core before + * returning to user mode. + * + * In the case where prev->mm == next->mm, membarrier() uses an IPI + * instead, and no particular barriers are needed while context + * switching. + * + * x86 gets all of this as a side-effect of writing to CR3 except + * in the case where we unlazy without flushing. + * + * All other architectures are civilized and do all of this implicitly + * when transitioning from kernel to user mode. */ if (real_prev == next) { VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) != @@ -566,7 +584,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, /* * If the CPU is not in lazy TLB mode, we are just switching * from one thread in a process to another thread in the same - * process. No TLB flush required. + * process. No TLB flush or membarrier() synchronization + * is required. */ if (!was_lazy) return; @@ -576,16 +595,31 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, * If the TLB is up to date, just use it. * The barrier synchronizes with the tlb_gen increment in * the TLB shootdown code. + * + * As a future optimization opportunity, it's plausible + * that the x86 memory model is strong enough that this + * smp_mb() isn't needed. */ smp_mb(); next_tlb_gen = atomic64_read(&next->context.tlb_gen); if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) == - next_tlb_gen) + next_tlb_gen) { + /* + * We switched logical mm but we're not going to + * write to CR3. We already did smp_mb() above, + * but membarrier() might require a sync_core() + * as well. + */ + sync_core_if_membarrier_enabled(next); + return; + } /* * TLB contents went out of date while we were in lazy * mode. Fall through to the TLB switching code below. + * No need for an explicit membarrier invocation -- the CR3 + * write will serialize. */ new_asid = prev_asid; need_flush = true; diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h index 5561486fddef..c256a7fc0423 100644 --- a/include/linux/sched/mm.h +++ b/include/linux/sched/mm.h @@ -345,16 +345,6 @@ enum { #include #endif -static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm) -{ - if (current->mm != mm) - return; - if (likely(!(atomic_read(&mm->membarrier_state) & - MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE))) - return; - sync_core_before_usermode(); -} - extern void membarrier_exec_mmap(struct mm_struct *mm); extern void membarrier_update_current_mm(struct mm_struct *next_mm); @@ -370,9 +360,6 @@ static inline void membarrier_arch_switch_mm(struct mm_struct *prev, static inline void membarrier_exec_mmap(struct mm_struct *mm) { } -static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm) -{ -} static inline void membarrier_update_current_mm(struct mm_struct *next_mm) { } diff --git a/kernel/sched/core.c b/kernel/sched/core.c index f21714ea3db8..6a1db8264c7b 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4822,22 +4822,22 @@ static struct rq *finish_task_switch(struct task_struct *prev) kmap_local_sched_in(); fire_sched_in_preempt_notifiers(current); + /* * When switching through a kernel thread, the loop in * membarrier_{private,global}_expedited() may have observed that * kernel thread and not issued an IPI. It is therefore possible to * schedule between user->kernel->user threads without passing though * switch_mm(). Membarrier requires a barrier after storing to - * rq->curr, before returning to userspace, so provide them here: + * rq->curr, before returning to userspace, and mmdrop() provides + * this barrier. * - * - a full memory barrier for {PRIVATE,GLOBAL}_EXPEDITED, implicitly - * provided by mmdrop(), - * - a sync_core for SYNC_CORE. + * If an architecture needs to take a specific action for + * SYNC_CORE, it can do so in switch_mm_irqs_off(). */ - if (mm) { - membarrier_mm_sync_core_before_usermode(mm); + if (mm) mmdrop(mm); - } + if (unlikely(prev_state == TASK_DEAD)) { if (prev->sched_class->task_dead) prev->sched_class->task_dead(prev); From patchwork Sat Jan 8 16:43:48 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andy Lutomirski X-Patchwork-Id: 12707533 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 258DEC4321E for ; Sat, 8 Jan 2022 16:44:27 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4C5E06B0073; Sat, 8 Jan 2022 11:44:22 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 474606B0078; Sat, 8 Jan 2022 11:44:22 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 33C826B007B; Sat, 8 Jan 2022 11:44:22 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0219.hostedemail.com [216.40.44.219]) by kanga.kvack.org (Postfix) with ESMTP id 1786A6B0073 for ; Sat, 8 Jan 2022 11:44:22 -0500 (EST) Received: from smtpin17.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id C3AC598C2D for ; Sat, 8 Jan 2022 16:44:21 +0000 (UTC) X-FDA: 79007692722.17.A2F9310 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf19.hostedemail.com (Postfix) with ESMTP id 6304F1A0004 for ; Sat, 8 Jan 2022 16:44:21 +0000 (UTC) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id A222E60C33; Sat, 8 Jan 2022 16:44:20 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 4E76AC36AE3; Sat, 8 Jan 2022 16:44:20 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1641660260; bh=bUrRFZAr479LQP/6WPwNI+9QSQKTdrq6o7ocsecDxbU=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=HZ333cMtnVBA70AW2odanHwWzISHcbq2v8sHljXikH2mkdQZhlc9Ej6nG9o6YiAGU JAMax/oSICJXzkPkJkuLtc+lJpSFR/2IJ8d2CxOZ5yDhFB4iL16bCvdurC3VmLS8p0 XdRjpXEIJcrn7YP5BfOvRkEIK+ZwGDlLSRR8mPRAQUjlWfwMfBhz+/6dl7q6A2supO teM2enjF/rxtXkArCAH64SyMEzGfoqitT7V3KfrpNbbcjQz3B/hvJX+vBCmUW3zdvT uNVP5rvUGXLYhJaRgFM1pEHYiruMtHIx7LEotVnN3xVRE/aq/dSs1skvVkCzrJ3OwX tjaFoWDe+Q3Hg== From: Andy Lutomirski To: Andrew Morton , Linux-MM Cc: Nicholas Piggin , Anton Blanchard , Benjamin Herrenschmidt , Paul Mackerras , Randy Dunlap , linux-arch , x86@kernel.org, Rik van Riel , Dave Hansen , Peter Zijlstra , Nadav Amit , Mathieu Desnoyers , Andy Lutomirski Subject: [PATCH 03/23] membarrier: Remove membarrier_arch_switch_mm() prototype in core code Date: Sat, 8 Jan 2022 08:43:48 -0800 Message-Id: <651efcf54f1d16467b12077b5366dfce587191d3.1641659630.git.luto@kernel.org> X-Mailer: git-send-email 2.33.1 In-Reply-To: References: MIME-Version: 1.0 X-Rspamd-Queue-Id: 6304F1A0004 X-Stat-Signature: 4z4wj5htomz8cx9erd4jxscubf4om4g6 Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=HZ333cMt; spf=pass (imf19.hostedemail.com: domain of luto@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=luto@kernel.org; dmarc=pass (policy=none) header.from=kernel.org X-Rspamd-Server: rspam10 X-HE-Tag: 1641660261-623652 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: membarrier_arch_switch_mm()'s sole implementation and caller are in arch/powerpc. Having a fallback implementation in include/linux is confusing -- remove it. It's still mentioned in a comment, but a subsequent patch will remove it. Cc: Mathieu Desnoyers Cc: Nicholas Piggin Cc: Peter Zijlstra Acked-by: Nicholas Piggin Acked-by: Mathieu Desnoyers Signed-off-by: Andy Lutomirski --- include/linux/sched/mm.h | 7 ------- 1 file changed, 7 deletions(-) diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h index c256a7fc0423..0df706c099e5 100644 --- a/include/linux/sched/mm.h +++ b/include/linux/sched/mm.h @@ -350,13 +350,6 @@ extern void membarrier_exec_mmap(struct mm_struct *mm); extern void membarrier_update_current_mm(struct mm_struct *next_mm); #else -#ifdef CONFIG_ARCH_HAS_MEMBARRIER_CALLBACKS -static inline void membarrier_arch_switch_mm(struct mm_struct *prev, - struct mm_struct *next, - struct task_struct *tsk) -{ -} -#endif static inline void membarrier_exec_mmap(struct mm_struct *mm) { } From patchwork Sat Jan 8 16:43:49 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andy Lutomirski X-Patchwork-Id: 12707534 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B619AC433F5 for ; Sat, 8 Jan 2022 16:44:28 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8E3856B0078; Sat, 8 Jan 2022 11:44:23 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 8947B6B007B; Sat, 8 Jan 2022 11:44:23 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6E8346B007D; Sat, 8 Jan 2022 11:44:23 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0067.hostedemail.com [216.40.44.67]) by kanga.kvack.org (Postfix) with ESMTP id 560A76B0078 for ; Sat, 8 Jan 2022 11:44:23 -0500 (EST) Received: from smtpin11.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 12087181C9908 for ; Sat, 8 Jan 2022 16:44:23 +0000 (UTC) X-FDA: 79007692806.11.C5EED2E Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf20.hostedemail.com (Postfix) with ESMTP id 90D3B1C0003 for ; Sat, 8 Jan 2022 16:44:22 +0000 (UTC) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id C747260DDD; Sat, 8 Jan 2022 16:44:21 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 709F9C36AF3; Sat, 8 Jan 2022 16:44:21 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1641660261; bh=LjdL825yge/umM2UgJRwPTnpW901vptpiAzBo0Abgdo=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=AlCM/878C1rYxAAe9UufUR7gcwXASA7e1xrfZVepu0whzF7ICFNMg6oO7ECN2Ok7t c+3+cWaPs8pEKoRin9U7PRYpzUzk+Ssq9kd5K1U7Vr4tyNrh4CPbWNiIhNlmLDebGz fKhJn5wiQIwLcFTRaVW3pvTEGODx2s2uMyLLtQy9RgyZfrM28FjRSGf+PbILr0+G+I onXa/0idW/YXqnEAJuZ4CIvCGvkvmHcA6YtSEEbeDoEsdQzYuO256oa4QmvY7M5cvq 1/ZCZUBnGMFpwA9MNXtGN+S8jXWz3in7j0PTX1/YE7RUrwsR2GNvvmqGZsN7qKHa6j 5RiYeD2VlBOOA== From: Andy Lutomirski To: Andrew Morton , Linux-MM Cc: Nicholas Piggin , Anton Blanchard , Benjamin Herrenschmidt , Paul Mackerras , Randy Dunlap , linux-arch , x86@kernel.org, Rik van Riel , Dave Hansen , Peter Zijlstra , Nadav Amit , Mathieu Desnoyers , Andy Lutomirski Subject: [PATCH 04/23] membarrier: Make the post-switch-mm barrier explicit Date: Sat, 8 Jan 2022 08:43:49 -0800 Message-Id: X-Mailer: git-send-email 2.33.1 In-Reply-To: References: MIME-Version: 1.0 X-Rspamd-Queue-Id: 90D3B1C0003 X-Stat-Signature: k8cfe4f7i5tqfxmyw9hgo7yffbwswdnf Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b="AlCM/878"; dmarc=pass (policy=none) header.from=kernel.org; spf=pass (imf20.hostedemail.com: domain of luto@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=luto@kernel.org X-Rspamd-Server: rspam02 X-HE-Tag: 1641660262-470138 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: membarrier() needs a barrier after any CPU changes mm. There is currently a comment explaining why this barrier probably exists in all cases. The logic is based on ensuring that the barrier exists on every control flow path through the scheduler. It also relies on mmgrab() and mmdrop() being full barriers. mmgrab() and mmdrop() would be better if they were not full barriers. As a trivial optimization, mmgrab() could use a relaxed atomic and mmdrop() could use a release on architectures that have these operations. Larger optimizations are also in the works. Doing any of these optimizations while preserving an unnecessary barrier will complicate the code and penalize non-membarrier-using tasks. Simplify the logic by adding an explicit barrier, and allow architectures to override it as an optimization if they want to. One of the deleted comments in this patch said "It is therefore possible to schedule between user->kernel->user threads without passing through switch_mm()". It is possible to do this without, say, writing to CR3 on x86, but the core scheduler indeed calls switch_mm_irqs_off() to tell the arch code to go back from lazy mode to no-lazy mode. The membarrier_finish_switch_mm() call in exec_mmap() is a no-op so long as there is no way for a newly execed program to register for membarrier prior to running user code. Subsequent patches will merge the exec_mmap() code with the kthread_use_mm() code, though, and keeping the paths consistent will make the result more comprehensible. Cc: Mathieu Desnoyers Cc: Nicholas Piggin Cc: Peter Zijlstra Signed-off-by: Andy Lutomirski --- fs/exec.c | 1 + include/linux/sched/mm.h | 18 ++++++++++++++++++ kernel/kthread.c | 12 +----------- kernel/sched/core.c | 34 +++++++++------------------------- 4 files changed, 29 insertions(+), 36 deletions(-) diff --git a/fs/exec.c b/fs/exec.c index a098c133d8d7..3abbd0294e73 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1019,6 +1019,7 @@ static int exec_mmap(struct mm_struct *mm) activate_mm(active_mm, mm); if (IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM)) local_irq_enable(); + membarrier_finish_switch_mm(mm); tsk->mm->vmacache_seqnum = 0; vmacache_flush(tsk); task_unlock(tsk); diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h index 0df706c099e5..e8919995d8dd 100644 --- a/include/linux/sched/mm.h +++ b/include/linux/sched/mm.h @@ -349,6 +349,20 @@ extern void membarrier_exec_mmap(struct mm_struct *mm); extern void membarrier_update_current_mm(struct mm_struct *next_mm); +/* + * Called by the core scheduler after calling switch_mm_irqs_off(). + * Architectures that have implicit barriers when switching mms can + * override this as an optimization. + */ +#ifndef membarrier_finish_switch_mm +static inline void membarrier_finish_switch_mm(struct mm_struct *mm) +{ + if (atomic_read(&mm->membarrier_state) & + (MEMBARRIER_STATE_GLOBAL_EXPEDITED | MEMBARRIER_STATE_PRIVATE_EXPEDITED)) + smp_mb(); +} +#endif + #else static inline void membarrier_exec_mmap(struct mm_struct *mm) { @@ -356,6 +370,10 @@ static inline void membarrier_exec_mmap(struct mm_struct *mm) static inline void membarrier_update_current_mm(struct mm_struct *next_mm) { } +static inline void membarrier_finish_switch_mm(struct mm_struct *mm) +{ +} + #endif #endif /* _LINUX_SCHED_MM_H */ diff --git a/kernel/kthread.c b/kernel/kthread.c index 5b37a8567168..396ae78a1a34 100644 --- a/kernel/kthread.c +++ b/kernel/kthread.c @@ -1361,25 +1361,15 @@ void kthread_use_mm(struct mm_struct *mm) tsk->mm = mm; membarrier_update_current_mm(mm); switch_mm_irqs_off(active_mm, mm, tsk); + membarrier_finish_switch_mm(mm); local_irq_enable(); task_unlock(tsk); #ifdef finish_arch_post_lock_switch finish_arch_post_lock_switch(); #endif - /* - * When a kthread starts operating on an address space, the loop - * in membarrier_{private,global}_expedited() may not observe - * that tsk->mm, and not issue an IPI. Membarrier requires a - * memory barrier after storing to tsk->mm, before accessing - * user-space memory. A full memory barrier for membarrier - * {PRIVATE,GLOBAL}_EXPEDITED is implicitly provided by - * mmdrop(), or explicitly with smp_mb(). - */ if (active_mm != mm) mmdrop(active_mm); - else - smp_mb(); to_kthread(tsk)->oldfs = force_uaccess_begin(); } diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 6a1db8264c7b..917068b0a145 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4824,14 +4824,6 @@ static struct rq *finish_task_switch(struct task_struct *prev) fire_sched_in_preempt_notifiers(current); /* - * When switching through a kernel thread, the loop in - * membarrier_{private,global}_expedited() may have observed that - * kernel thread and not issued an IPI. It is therefore possible to - * schedule between user->kernel->user threads without passing though - * switch_mm(). Membarrier requires a barrier after storing to - * rq->curr, before returning to userspace, and mmdrop() provides - * this barrier. - * * If an architecture needs to take a specific action for * SYNC_CORE, it can do so in switch_mm_irqs_off(). */ @@ -4915,15 +4907,14 @@ context_switch(struct rq *rq, struct task_struct *prev, prev->active_mm = NULL; } else { // to user membarrier_switch_mm(rq, prev->active_mm, next->mm); + switch_mm_irqs_off(prev->active_mm, next->mm, next); + /* * sys_membarrier() requires an smp_mb() between setting - * rq->curr / membarrier_switch_mm() and returning to userspace. - * - * The below provides this either through switch_mm(), or in - * case 'prev->active_mm == next->mm' through - * finish_task_switch()'s mmdrop(). + * rq->curr->mm to a membarrier-enabled mm and returning + * to userspace. */ - switch_mm_irqs_off(prev->active_mm, next->mm, next); + membarrier_finish_switch_mm(next->mm); if (!prev->mm) { // from kernel /* will mmdrop() in finish_task_switch(). */ @@ -6264,17 +6255,10 @@ static void __sched notrace __schedule(unsigned int sched_mode) RCU_INIT_POINTER(rq->curr, next); /* * The membarrier system call requires each architecture - * to have a full memory barrier after updating - * rq->curr, before returning to user-space. - * - * Here are the schemes providing that barrier on the - * various architectures: - * - mm ? switch_mm() : mmdrop() for x86, s390, sparc, PowerPC. - * switch_mm() rely on membarrier_arch_switch_mm() on PowerPC. - * - finish_lock_switch() for weakly-ordered - * architectures where spin_unlock is a full barrier, - * - switch_to() for arm64 (weakly-ordered, spin_unlock - * is a RELEASE barrier), + * to have a full memory barrier before and after updating + * rq->curr->mm, before returning to userspace. This + * is provided by membarrier_finish_switch_mm(). Architectures + * that want to optimize this can override that function. */ ++*switch_count; From patchwork Sat Jan 8 16:43:51 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andy Lutomirski X-Patchwork-Id: 12707535 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id AC1C0C433EF for ; Sat, 8 Jan 2022 16:44:30 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 20A6F6B007B; Sat, 8 Jan 2022 11:44:27 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 167EA6B007D; Sat, 8 Jan 2022 11:44:27 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id ED6786B007E; Sat, 8 Jan 2022 11:44:26 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0165.hostedemail.com [216.40.44.165]) by kanga.kvack.org (Postfix) with ESMTP id D597A6B007B for ; Sat, 8 Jan 2022 11:44:26 -0500 (EST) Received: from smtpin28.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 8D6FE180286D2 for ; Sat, 8 Jan 2022 16:44:26 +0000 (UTC) X-FDA: 79007692932.28.05134E7 Received: from ams.source.kernel.org (ams.source.kernel.org [145.40.68.75]) by imf08.hostedemail.com (Postfix) with ESMTP id 15B86160010 for ; Sat, 8 Jan 2022 16:44:25 +0000 (UTC) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ams.source.kernel.org (Postfix) with ESMTPS id 094A4B80B3C; Sat, 8 Jan 2022 16:44:25 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 9D4C8C36AED; Sat, 8 Jan 2022 16:44:23 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1641660263; bh=Ix0zIvWH9riMicQK8q6DeOk5TUbR/qLOVVFcIcVQ7C8=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=auhasTA/47YoAeVi8R3e+QTXB01K8YsMLaA59SJ4dOZ+OaBIuznl4w5vUDhWLgVgd nsbcRscEmwFJ79KqOn6yfwoCuOHEZlzFhXChKiZeBbe8MmiTX9B1oE3pm9v1YxtY3+ 9z+fr8Q/bL6dMsZuyYWCd4b4Ra+QLFuWY60jotzHt7SgsS2Rfogbp7ysJM1fZ+zgDs QYO8jxwQtH0zVozrfDwwMie+4OEFtg02UUwn8wCAyY9dyICu63kPwCBKqbQHkj3Q9K DS4jfc3F0BqNZecvV3KoEVKhGNwNyFtsjMr7v27SdHqXoW5tXauL6Dy/JQdJt+axRX EW0Pjg36QKZAQ== From: Andy Lutomirski To: Andrew Morton , Linux-MM Cc: Nicholas Piggin , Anton Blanchard , Benjamin Herrenschmidt , Paul Mackerras , Randy Dunlap , linux-arch , x86@kernel.org, Rik van Riel , Dave Hansen , Peter Zijlstra , Nadav Amit , Mathieu Desnoyers , Andy Lutomirski , Michael Ellerman , Paul Mackerras , linuxppc-dev@lists.ozlabs.org Subject: [PATCH 06/23] powerpc/membarrier: Remove special barrier on mm switch Date: Sat, 8 Jan 2022 08:43:51 -0800 Message-Id: X-Mailer: git-send-email 2.33.1 In-Reply-To: References: MIME-Version: 1.0 X-Stat-Signature: fdtb5e39c4ma1rb3ixbkcg5ifqw49s8y X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: 15B86160010 Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b="auhasTA/"; spf=pass (imf08.hostedemail.com: domain of luto@kernel.org designates 145.40.68.75 as permitted sender) smtp.mailfrom=luto@kernel.org; dmarc=pass (policy=none) header.from=kernel.org X-HE-Tag: 1641660265-499857 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: powerpc did the following on some, but not all, paths through switch_mm_irqs_off(): /* * Only need the full barrier when switching between processes. * Barrier when switching from kernel to userspace is not * required here, given that it is implied by mmdrop(). Barrier * when switching from userspace to kernel is not needed after * store to rq->curr. */ if (likely(!(atomic_read(&next->membarrier_state) & (MEMBARRIER_STATE_PRIVATE_EXPEDITED | MEMBARRIER_STATE_GLOBAL_EXPEDITED)) || !prev)) return; This is puzzling: if !prev, then one might expect that we are switching from kernel to user, not user to kernel, which is inconsistent with the comment. But this is all nonsense, because the one and only caller would never have prev == NULL and would, in fact, OOPS if prev == NULL. In any event, this code is unnecessary, since the new generic membarrier_finish_switch_mm() provides the same barrier without arch help. arch/powerpc/include/asm/membarrier.h remains as an empty header, because a later patch in this series will add code to it. Cc: Michael Ellerman Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: linuxppc-dev@lists.ozlabs.org Cc: Nicholas Piggin Cc: Mathieu Desnoyers Cc: Peter Zijlstra Signed-off-by: Andy Lutomirski --- arch/powerpc/include/asm/membarrier.h | 24 ------------------------ arch/powerpc/mm/mmu_context.c | 1 - 2 files changed, 25 deletions(-) diff --git a/arch/powerpc/include/asm/membarrier.h b/arch/powerpc/include/asm/membarrier.h index de7f79157918..b90766e95bd1 100644 --- a/arch/powerpc/include/asm/membarrier.h +++ b/arch/powerpc/include/asm/membarrier.h @@ -1,28 +1,4 @@ #ifndef _ASM_POWERPC_MEMBARRIER_H #define _ASM_POWERPC_MEMBARRIER_H -static inline void membarrier_arch_switch_mm(struct mm_struct *prev, - struct mm_struct *next, - struct task_struct *tsk) -{ - /* - * Only need the full barrier when switching between processes. - * Barrier when switching from kernel to userspace is not - * required here, given that it is implied by mmdrop(). Barrier - * when switching from userspace to kernel is not needed after - * store to rq->curr. - */ - if (IS_ENABLED(CONFIG_SMP) && - likely(!(atomic_read(&next->membarrier_state) & - (MEMBARRIER_STATE_PRIVATE_EXPEDITED | - MEMBARRIER_STATE_GLOBAL_EXPEDITED)) || !prev)) - return; - - /* - * The membarrier system call requires a full memory barrier - * after storing to rq->curr, before going back to user-space. - */ - smp_mb(); -} - #endif /* _ASM_POWERPC_MEMBARRIER_H */ diff --git a/arch/powerpc/mm/mmu_context.c b/arch/powerpc/mm/mmu_context.c index 74246536b832..5f2daa6b0497 100644 --- a/arch/powerpc/mm/mmu_context.c +++ b/arch/powerpc/mm/mmu_context.c @@ -84,7 +84,6 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, asm volatile ("dssall"); if (!new_on_cpu) - membarrier_arch_switch_mm(prev, next, tsk); /* * The actual HW switching method differs between the various From patchwork Sat Jan 8 16:43:52 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andy Lutomirski X-Patchwork-Id: 12707537 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C54FBC433EF for ; Sat, 8 Jan 2022 16:44:33 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D79046B0081; Sat, 8 Jan 2022 11:44:28 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id D01446B0080; Sat, 8 Jan 2022 11:44:28 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B06406B0081; Sat, 8 Jan 2022 11:44:28 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0077.hostedemail.com [216.40.44.77]) by kanga.kvack.org (Postfix) with ESMTP id 8CA4C6B007E for ; Sat, 8 Jan 2022 11:44:28 -0500 (EST) Received: from smtpin16.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 4B06C98C2D for ; Sat, 8 Jan 2022 16:44:28 +0000 (UTC) X-FDA: 79007693016.16.AB0B815 Received: from ams.source.kernel.org (ams.source.kernel.org [145.40.68.75]) by imf14.hostedemail.com (Postfix) with ESMTP id D047A10000B for ; Sat, 8 Jan 2022 16:44:27 +0000 (UTC) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ams.source.kernel.org (Postfix) with ESMTPS id 4631AB80B47; Sat, 8 Jan 2022 16:44:26 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id D8EB3C36AEF; Sat, 8 Jan 2022 16:44:24 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1641660265; bh=nCJ6JL8IVos0CgiyyAGe15m3ZdsuNKsEv9GJVjhwIOI=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=WtHuJMx7z0eKUfRVr+x8ybkHoQH5R7jphHS4lUHOVZi4MG9PdHrBL9P0lyvCaSaIH 1EkYq/MwLrm81oeFXOYQblb7oQVYh1/gidhXADq0+Wyq1LReeiV2H7p5p7yI+7c3o1 0Vr1Lq+hGO6rgLqtctcq8b/Fk1I/9k/ybOzc2qkMjkVsNu5T9gzTBfGhkpIhD8fVPW Y4PB3RuJZwBSuUxn1RzK/aXpei0NaV9vsRwFI+hnxmcCYTrh7c7pAFUa8EvpmB49sd dVNQNxJv1gZjDjYnff2yjZK2OTJwPVe9QtRy6WJUYBkSjz8CjlxtdivNGFs6Vsgpqi RIi+11VU5yWcA== From: Andy Lutomirski To: Andrew Morton , Linux-MM Cc: Nicholas Piggin , Anton Blanchard , Benjamin Herrenschmidt , Paul Mackerras , Randy Dunlap , linux-arch , x86@kernel.org, Rik van Riel , Dave Hansen , Peter Zijlstra , Nadav Amit , Mathieu Desnoyers , Andy Lutomirski , Michael Ellerman , Paul Mackerras , linuxppc-dev@lists.ozlabs.org, Catalin Marinas , Will Deacon , linux-arm-kernel@lists.infradead.org, stable@vger.kernel.org Subject: [PATCH 07/23] membarrier: Rewrite sync_core_before_usermode() and improve documentation Date: Sat, 8 Jan 2022 08:43:52 -0800 Message-Id: X-Mailer: git-send-email 2.33.1 In-Reply-To: References: MIME-Version: 1.0 X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: D047A10000B X-Stat-Signature: 4k7atjax3yhm1tddyt7k8kijhgxs8peu Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=WtHuJMx7; spf=pass (imf14.hostedemail.com: domain of luto@kernel.org designates 145.40.68.75 as permitted sender) smtp.mailfrom=luto@kernel.org; dmarc=pass (policy=none) header.from=kernel.org X-HE-Tag: 1641660267-658261 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: The old sync_core_before_usermode() comments suggested that a non-icache-syncing return-to-usermode instruction is x86-specific and that all other architectures automatically notice cross-modified code on return to userspace. This is misleading. The incantation needed to modify code from one CPU and execute it on another CPU is highly architecture dependent. On x86, according to the SDM, one must modify the code, issue SFENCE if the modification was WC or nontemporal, and then issue a "serializing instruction" on the CPU that will execute the code. membarrier() can do the latter. On arm, arm64 and powerpc, one must flush the icache and then flush the pipeline on the target CPU, although the CPU manuals don't necessarily use this language. So let's drop any pretense that we can have a generic way to define or implement membarrier's SYNC_CORE operation and instead require all architectures to define the helper and supply their own documentation as to how to use it. This means x86, arm64, and powerpc for now. Let's also rename the function from sync_core_before_usermode() to membarrier_sync_core_before_usermode() because the precise flushing details may very well be specific to membarrier, and even the concept of "sync_core" in the kernel is mostly an x86-ism. (It may well be the case that, on real x86 processors, synchronizing the icache (which requires no action at all) and "flushing the pipeline" is sufficient, but trying to use this language would be confusing at best. LFENCE does something awfully like "flushing the pipeline", but the SDM does not permit LFENCE as an alternative to a "serializing instruction" for this purpose.) Cc: Michael Ellerman Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: linuxppc-dev@lists.ozlabs.org Cc: Nicholas Piggin Cc: Catalin Marinas Cc: Will Deacon Cc: linux-arm-kernel@lists.infradead.org Cc: Mathieu Desnoyers Cc: Nicholas Piggin Cc: Peter Zijlstra Cc: x86@kernel.org Cc: stable@vger.kernel.org Acked-by: Will Deacon # for arm64 Fixes: 70216e18e519 ("membarrier: Provide core serializing command, *_SYNC_CORE") Signed-off-by: Andy Lutomirski --- .../membarrier-sync-core/arch-support.txt | 69 ++++++------------- arch/arm/include/asm/membarrier.h | 21 ++++++ arch/arm64/include/asm/membarrier.h | 19 +++++ arch/powerpc/include/asm/membarrier.h | 10 +++ arch/x86/Kconfig | 1 - arch/x86/include/asm/membarrier.h | 25 +++++++ arch/x86/include/asm/sync_core.h | 20 ------ arch/x86/kernel/alternative.c | 2 +- arch/x86/kernel/cpu/mce/core.c | 2 +- arch/x86/mm/tlb.c | 3 +- drivers/misc/sgi-gru/grufault.c | 2 +- drivers/misc/sgi-gru/gruhandles.c | 2 +- drivers/misc/sgi-gru/grukservices.c | 2 +- include/linux/sched/mm.h | 1 - include/linux/sync_core.h | 21 ------ init/Kconfig | 3 - kernel/sched/membarrier.c | 14 +++- 17 files changed, 115 insertions(+), 102 deletions(-) create mode 100644 arch/arm/include/asm/membarrier.h create mode 100644 arch/arm64/include/asm/membarrier.h create mode 100644 arch/x86/include/asm/membarrier.h delete mode 100644 include/linux/sync_core.h diff --git a/Documentation/features/sched/membarrier-sync-core/arch-support.txt b/Documentation/features/sched/membarrier-sync-core/arch-support.txt index 883d33b265d6..4009b26bf5c3 100644 --- a/Documentation/features/sched/membarrier-sync-core/arch-support.txt +++ b/Documentation/features/sched/membarrier-sync-core/arch-support.txt @@ -5,51 +5,26 @@ # # Architecture requirements # -# * arm/arm64/powerpc # -# Rely on implicit context synchronization as a result of exception return -# when returning from IPI handler, and when returning to user-space. -# -# * x86 -# -# x86-32 uses IRET as return from interrupt, which takes care of the IPI. -# However, it uses both IRET and SYSEXIT to go back to user-space. The IRET -# instruction is core serializing, but not SYSEXIT. -# -# x86-64 uses IRET as return from interrupt, which takes care of the IPI. -# However, it can return to user-space through either SYSRETL (compat code), -# SYSRETQ, or IRET. -# -# Given that neither SYSRET{L,Q}, nor SYSEXIT, are core serializing, we rely -# instead on write_cr3() performed by switch_mm() to provide core serialization -# after changing the current mm, and deal with the special case of kthread -> -# uthread (temporarily keeping current mm into active_mm) by issuing a -# sync_core_before_usermode() in that specific case. -# - ----------------------- - | arch |status| - ----------------------- - | alpha: | TODO | - | arc: | TODO | - | arm: | ok | - | arm64: | ok | - | csky: | TODO | - | h8300: | TODO | - | hexagon: | TODO | - | ia64: | TODO | - | m68k: | TODO | - | microblaze: | TODO | - | mips: | TODO | - | nds32: | TODO | - | nios2: | TODO | - | openrisc: | TODO | - | parisc: | TODO | - | powerpc: | ok | - | riscv: | TODO | - | s390: | TODO | - | sh: | TODO | - | sparc: | TODO | - | um: | TODO | - | x86: | ok | - | xtensa: | TODO | - ----------------------- +# An architecture that wants to support +# MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE needs to define precisely what it +# is supposed to do and implement membarrier_sync_core_before_usermode() to +# make it do that. Then it can select ARCH_HAS_MEMBARRIER_SYNC_CORE via +# Kconfig and document what SYNC_CORE does on that architecture in this +# list. +# +# On x86, a program can safely modify code, issue +# MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE, and then execute that code, via +# the modified address or an alias, from any thread in the calling process. +# +# On arm and arm64, a program can modify code, flush the icache as needed, +# and issue MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE to force a "context +# synchronizing event", aka pipeline flush on all CPUs that might run the +# calling process. Then the program can execute the modified code as long +# as it is executed from an address consistent with the icache flush and +# the CPU's cache type. On arm, cacheflush(2) can be used for the icache +# flushing operation. +# +# On powerpc, a program can use MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE +# similarly to arm64. It would be nice if the powerpc maintainers could +# add a more clear explanantion. diff --git a/arch/arm/include/asm/membarrier.h b/arch/arm/include/asm/membarrier.h new file mode 100644 index 000000000000..c162a0758657 --- /dev/null +++ b/arch/arm/include/asm/membarrier.h @@ -0,0 +1,21 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _ASM_ARM_MEMBARRIER_H +#define _ASM_ARM_MEMBARRIER_H + +#include + +/* + * On arm, anyone trying to use membarrier() to handle JIT code is required + * to first flush the icache (most likely by using cacheflush(2) and then + * do SYNC_CORE. All that's needed after the icache flush is to execute a + * "context synchronization event". + * + * Returning to user mode is a context synchronization event, so no + * specific action by the kernel is needed other than ensuring that the + * kernel is entered. + */ +static inline void membarrier_sync_core_before_usermode(void) +{ +} + +#endif /* _ASM_ARM_MEMBARRIER_H */ diff --git a/arch/arm64/include/asm/membarrier.h b/arch/arm64/include/asm/membarrier.h new file mode 100644 index 000000000000..db8e0ea57253 --- /dev/null +++ b/arch/arm64/include/asm/membarrier.h @@ -0,0 +1,19 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _ASM_ARM64_MEMBARRIER_H +#define _ASM_ARM64_MEMBARRIER_H + +#include + +/* + * On arm64, anyone trying to use membarrier() to handle JIT code is + * required to first flush the icache and then do SYNC_CORE. All that's + * needed after the icache flush is to execute a "context synchronization + * event". Right now, ERET does this, and we are guaranteed to ERET before + * any user code runs. If Linux ever programs the CPU to make ERET stop + * being a context synchronizing event, then this will need to be adjusted. + */ +static inline void membarrier_sync_core_before_usermode(void) +{ +} + +#endif /* _ASM_ARM64_MEMBARRIER_H */ diff --git a/arch/powerpc/include/asm/membarrier.h b/arch/powerpc/include/asm/membarrier.h index b90766e95bd1..466abe6fdcea 100644 --- a/arch/powerpc/include/asm/membarrier.h +++ b/arch/powerpc/include/asm/membarrier.h @@ -1,4 +1,14 @@ #ifndef _ASM_POWERPC_MEMBARRIER_H #define _ASM_POWERPC_MEMBARRIER_H +#include + +/* + * The RFI family of instructions are context synchronising, and + * that is how we return to userspace, so nothing is required here. + */ +static inline void membarrier_sync_core_before_usermode(void) +{ +} + #endif /* _ASM_POWERPC_MEMBARRIER_H */ diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index d9830e7e1060..5060c38bf560 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -90,7 +90,6 @@ config X86 select ARCH_HAS_SET_DIRECT_MAP select ARCH_HAS_STRICT_KERNEL_RWX select ARCH_HAS_STRICT_MODULE_RWX - select ARCH_HAS_SYNC_CORE_BEFORE_USERMODE select ARCH_HAS_SYSCALL_WRAPPER select ARCH_HAS_UBSAN_SANITIZE_ALL select ARCH_HAS_DEBUG_WX diff --git a/arch/x86/include/asm/membarrier.h b/arch/x86/include/asm/membarrier.h new file mode 100644 index 000000000000..9b72a1b49359 --- /dev/null +++ b/arch/x86/include/asm/membarrier.h @@ -0,0 +1,25 @@ +#ifndef _ASM_X86_MEMBARRIER_H +#define _ASM_X86_MEMBARRIER_H + +#include + +/* + * Ensure that the CPU notices any instruction changes before the next time + * it returns to usermode. + */ +static inline void membarrier_sync_core_before_usermode(void) +{ + /* With PTI, we unconditionally serialize before running user code. */ + if (static_cpu_has(X86_FEATURE_PTI)) + return; + + /* + * Even if we're in an interrupt, we might reschedule before returning, + * in which case we could switch to a different thread in the same mm + * and return using SYSRET or SYSEXIT. Instead of trying to keep + * track of our need to sync the core, just sync right away. + */ + sync_core(); +} + +#endif /* _ASM_X86_MEMBARRIER_H */ diff --git a/arch/x86/include/asm/sync_core.h b/arch/x86/include/asm/sync_core.h index ab7382f92aff..bfe4ac4e6be2 100644 --- a/arch/x86/include/asm/sync_core.h +++ b/arch/x86/include/asm/sync_core.h @@ -88,24 +88,4 @@ static inline void sync_core(void) iret_to_self(); } -/* - * Ensure that a core serializing instruction is issued before returning - * to user-mode. x86 implements return to user-space through sysexit, - * sysrel, and sysretq, which are not core serializing. - */ -static inline void sync_core_before_usermode(void) -{ - /* With PTI, we unconditionally serialize before running user code. */ - if (static_cpu_has(X86_FEATURE_PTI)) - return; - - /* - * Even if we're in an interrupt, we might reschedule before returning, - * in which case we could switch to a different thread in the same mm - * and return using SYSRET or SYSEXIT. Instead of trying to keep - * track of our need to sync the core, just sync right away. - */ - sync_core(); -} - #endif /* _ASM_X86_SYNC_CORE_H */ diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c index e9da3dc71254..b47cd22b2eb1 100644 --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -17,7 +17,7 @@ #include #include #include -#include +#include #include #include #include diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c index 193204aee880..a2529e09f620 100644 --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -41,12 +41,12 @@ #include #include #include -#include #include #include #include #include +#include #include #include #include diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 1ae15172885e..74b7a615bc15 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -12,6 +12,7 @@ #include #include +#include #include #include #include @@ -491,7 +492,7 @@ static void sync_core_if_membarrier_enabled(struct mm_struct *next) #ifdef CONFIG_MEMBARRIER if (unlikely(atomic_read(&next->membarrier_state) & MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE)) - sync_core_before_usermode(); + membarrier_sync_core_before_usermode(); #endif } diff --git a/drivers/misc/sgi-gru/grufault.c b/drivers/misc/sgi-gru/grufault.c index d7ef61e602ed..462c667bd6c4 100644 --- a/drivers/misc/sgi-gru/grufault.c +++ b/drivers/misc/sgi-gru/grufault.c @@ -20,8 +20,8 @@ #include #include #include -#include #include +#include #include "gru.h" #include "grutables.h" #include "grulib.h" diff --git a/drivers/misc/sgi-gru/gruhandles.c b/drivers/misc/sgi-gru/gruhandles.c index 1d75d5e540bc..c8cba1c1b00f 100644 --- a/drivers/misc/sgi-gru/gruhandles.c +++ b/drivers/misc/sgi-gru/gruhandles.c @@ -16,7 +16,7 @@ #define GRU_OPERATION_TIMEOUT (((cycles_t) local_cpu_data->itc_freq)*10) #define CLKS2NSEC(c) ((c) *1000000000 / local_cpu_data->itc_freq) #else -#include +#include #include #define GRU_OPERATION_TIMEOUT ((cycles_t) tsc_khz*10*1000) #define CLKS2NSEC(c) ((c) * 1000000 / tsc_khz) diff --git a/drivers/misc/sgi-gru/grukservices.c b/drivers/misc/sgi-gru/grukservices.c index 0ea923fe6371..ce03ff3f7c3a 100644 --- a/drivers/misc/sgi-gru/grukservices.c +++ b/drivers/misc/sgi-gru/grukservices.c @@ -16,10 +16,10 @@ #include #include #include -#include #include #include #include +#include #include #include "gru.h" #include "grulib.h" diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h index e8919995d8dd..e107f292fc42 100644 --- a/include/linux/sched/mm.h +++ b/include/linux/sched/mm.h @@ -7,7 +7,6 @@ #include #include #include -#include /* * Routines for handling mm_structs diff --git a/include/linux/sync_core.h b/include/linux/sync_core.h deleted file mode 100644 index 013da4b8b327..000000000000 --- a/include/linux/sync_core.h +++ /dev/null @@ -1,21 +0,0 @@ -/* SPDX-License-Identifier: GPL-2.0 */ -#ifndef _LINUX_SYNC_CORE_H -#define _LINUX_SYNC_CORE_H - -#ifdef CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE -#include -#else -/* - * This is a dummy sync_core_before_usermode() implementation that can be used - * on all architectures which return to user-space through core serializing - * instructions. - * If your architecture returns to user-space through non-core-serializing - * instructions, you need to write your own functions. - */ -static inline void sync_core_before_usermode(void) -{ -} -#endif - -#endif /* _LINUX_SYNC_CORE_H */ - diff --git a/init/Kconfig b/init/Kconfig index 11f8a845f259..bbaf93f9438b 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -2364,9 +2364,6 @@ source "kernel/Kconfig.locks" config ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE bool -config ARCH_HAS_SYNC_CORE_BEFORE_USERMODE - bool - # It may be useful for an architecture to override the definitions of the # SYSCALL_DEFINE() and __SYSCALL_DEFINEx() macros in # and the COMPAT_ variants in , in particular to use a diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c index 327830f89c37..eb73eeaedc7d 100644 --- a/kernel/sched/membarrier.c +++ b/kernel/sched/membarrier.c @@ -5,6 +5,14 @@ * membarrier system call */ #include "sched.h" +#ifdef CONFIG_ARCH_HAS_MEMBARRIER_SYNC_CORE +#include +#else +static inline void membarrier_sync_core_before_usermode(void) +{ + compiletime_assert(0, "architecture does not implement membarrier_sync_core_before_usermode"); +} +#endif /* * The basic principle behind the regular memory barrier mode of @@ -231,12 +239,12 @@ static void ipi_sync_core(void *info) * the big comment at the top of this file. * * A sync_core() would provide this guarantee, but - * sync_core_before_usermode() might end up being deferred until - * after membarrier()'s smp_mb(). + * membarrier_sync_core_before_usermode() might end up being deferred + * until after membarrier()'s smp_mb(). */ smp_mb(); /* IPIs should be serializing but paranoid. */ - sync_core_before_usermode(); + membarrier_sync_core_before_usermode(); } static void ipi_rseq(void *info) From patchwork Sat Jan 8 16:43:53 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andy Lutomirski X-Patchwork-Id: 12707536 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4953DC433F5 for ; Sat, 8 Jan 2022 16:44:32 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1E7CA6B007D; Sat, 8 Jan 2022 11:44:28 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 057896B007E; Sat, 8 Jan 2022 11:44:27 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E13E96B0080; Sat, 8 Jan 2022 11:44:27 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0125.hostedemail.com [216.40.44.125]) by kanga.kvack.org (Postfix) with ESMTP id CB7CC6B007D for ; Sat, 8 Jan 2022 11:44:27 -0500 (EST) Received: from smtpin29.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 96ECD180CB07B for ; Sat, 8 Jan 2022 16:44:27 +0000 (UTC) X-FDA: 79007692974.29.E6A0BED Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf09.hostedemail.com (Postfix) with ESMTP id 39810140008 for ; Sat, 8 Jan 2022 16:44:27 +0000 (UTC) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 712FA60DE6; Sat, 8 Jan 2022 16:44:26 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 19D23C36AE3; Sat, 8 Jan 2022 16:44:26 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1641660266; bh=kZ6ewOsIspZ5UP0WmGdsulngoycAuaq8Cf8dYc4KorQ=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=Nf4/sPAbWhWP+hg5cbRtORiMTUNtDoPJgpo+XlbzmdxaiI4aDAh+gRDd3LBs3jSdB E7MdZorBg0nGhL7W8XdLIzZAy2wFfunrxVnHkSRweDSFloVsl1ScCKPwqR5DcJQKD+ gg5zQ2t7i+CemQ7FNrsNzXCXMct/9u1Ds6Jg8las/6ldMd4yzif/Mtx+bzQ1uKQXT2 cCaHUMqAAt+gmHsMOE5N2MTkHaJiWbqJC9Ck/E86LuMNDiCawfA6TK5+s1mp3qycKr wW/yniY7NUlNbd0jnTbgpeenJ5MSCF5ZONgKQxXWuGGX+HBjPiU6Nj3+t5XVDdQeif nnJfyKWsI4uGA== From: Andy Lutomirski To: Andrew Morton , Linux-MM Cc: Nicholas Piggin , Anton Blanchard , Benjamin Herrenschmidt , Paul Mackerras , Randy Dunlap , linux-arch , x86@kernel.org, Rik van Riel , Dave Hansen , Peter Zijlstra , Nadav Amit , Mathieu Desnoyers , Andy Lutomirski Subject: [PATCH 08/23] membarrier: Remove redundant clear of mm->membarrier_state in exec_mmap() Date: Sat, 8 Jan 2022 08:43:53 -0800 Message-Id: X-Mailer: git-send-email 2.33.1 In-Reply-To: References: MIME-Version: 1.0 X-Rspamd-Queue-Id: 39810140008 X-Stat-Signature: g47fw66dcmj4ht7p3gr1hd59dxa1ika3 Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b="Nf4/sPAb"; spf=pass (imf09.hostedemail.com: domain of luto@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=luto@kernel.org; dmarc=pass (policy=none) header.from=kernel.org X-Rspamd-Server: rspam10 X-HE-Tag: 1641660267-374301 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: exec_mmap() supplies a brand-new mm from mm_alloc(), and membarrier_state is already 0. There's no need to clear it again. Signed-off-by: Andy Lutomirski --- kernel/sched/membarrier.c | 1 - 1 file changed, 1 deletion(-) diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c index eb73eeaedc7d..c38014c2ed66 100644 --- a/kernel/sched/membarrier.c +++ b/kernel/sched/membarrier.c @@ -285,7 +285,6 @@ void membarrier_exec_mmap(struct mm_struct *mm) * clearing this state. */ smp_mb(); - atomic_set(&mm->membarrier_state, 0); /* * Keep the runqueue membarrier_state in sync with this mm * membarrier_state. From patchwork Sat Jan 8 16:43:54 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andy Lutomirski X-Patchwork-Id: 12707538 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A8FB1C43219 for ; Sat, 8 Jan 2022 16:44:35 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3DDC26B007E; Sat, 8 Jan 2022 11:44:29 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 33E966B0080; Sat, 8 Jan 2022 11:44:29 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 191D56B0082; Sat, 8 Jan 2022 11:44:29 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0193.hostedemail.com [216.40.44.193]) by kanga.kvack.org (Postfix) with ESMTP id D74E96B007E for ; Sat, 8 Jan 2022 11:44:28 -0500 (EST) Received: from smtpin30.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 8FD6080EBFBE for ; Sat, 8 Jan 2022 16:44:28 +0000 (UTC) X-FDA: 79007693016.30.6C562C6 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf27.hostedemail.com (Postfix) with ESMTP id 21D2B40003 for ; Sat, 8 Jan 2022 16:44:27 +0000 (UTC) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 7843A60DE4; Sat, 8 Jan 2022 16:44:27 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 2841BC36AF8; Sat, 8 Jan 2022 16:44:27 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1641660267; bh=EnLYGiRErjhDokS7ynPZj/PCAxWfuRJzmK2u9DRIXD0=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=oD+cLyvFR3z8/hdUI0dQNvSw6VCygHK0d7zmsAL5mRiRPEs1GneFq+K8RC051+fvE wnOYH8GQ4M/EU8yekVbdZRo9BoZvgtAQM+Mo8CwqLL75oBw+kFhCNjQ4Kw95R9/L+Q 2cMxMrBqDkkvQZ6p44oEJ6af2VfwomQlglAyCUgUXhn+ApfijDyEI0gdR8IO8rxc6s iN8tBLBhZmiTVW2+0LeID3inM9BWIx885T3ZTIGERPbA1tqeixjxhGT0Dnfv7RKPOQ fF2aZW0kl/addSxnjEw/3QUARWFQgGaimwmT7puia2X79uQP74nTTefxCs96QkJ8GJ /7URAbBq0i1BA== From: Andy Lutomirski To: Andrew Morton , Linux-MM Cc: Nicholas Piggin , Anton Blanchard , Benjamin Herrenschmidt , Paul Mackerras , Randy Dunlap , linux-arch , x86@kernel.org, Rik van Riel , Dave Hansen , Peter Zijlstra , Nadav Amit , Mathieu Desnoyers , Andy Lutomirski Subject: [PATCH 09/23] membarrier: Fix incorrect barrier positions during exec and kthread_use_mm() Date: Sat, 8 Jan 2022 08:43:54 -0800 Message-Id: <21273aa5349827de22507ef445fbde1a12ac2f8f.1641659630.git.luto@kernel.org> X-Mailer: git-send-email 2.33.1 In-Reply-To: References: MIME-Version: 1.0 X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 21D2B40003 X-Stat-Signature: 7ibrmhp716rokqpujpns9pxrzsgffoui Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=oD+cLyvF; spf=pass (imf27.hostedemail.com: domain of luto@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=luto@kernel.org; dmarc=pass (policy=none) header.from=kernel.org X-HE-Tag: 1641660267-651037 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: membarrier() requires a barrier before changes to rq->curr->mm, not just before writes to rq->membarrier_state. Move the barrier in exec_mmap() to the right place. Add the barrier in kthread_use_mm() -- it was entirely missing before. This patch makes exec_mmap() and kthread_use_mm() use the same membarrier hooks, which results in some code deletion. As an added bonus, this will eliminate a redundant barrier in execve() on arches for which spinlock acquisition is a barrier. Signed-off-by: Andy Lutomirski --- fs/exec.c | 6 +++++- include/linux/sched/mm.h | 2 -- kernel/kthread.c | 5 +++++ kernel/sched/membarrier.c | 15 --------------- 4 files changed, 10 insertions(+), 18 deletions(-) diff --git a/fs/exec.c b/fs/exec.c index 38b05e01c5bd..325dab98bc51 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1001,12 +1001,16 @@ static int exec_mmap(struct mm_struct *mm) } task_lock(tsk); - membarrier_exec_mmap(mm); + /* + * membarrier() requires a full barrier before switching mm. + */ + smp_mb__after_spinlock(); local_irq_disable(); active_mm = tsk->active_mm; tsk->active_mm = mm; WRITE_ONCE(tsk->mm, mm); /* membarrier reads this without locks */ + membarrier_update_current_mm(mm); /* * This prevents preemption while active_mm is being loaded and * it and mm are being updated, which could cause problems for diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h index e107f292fc42..f1d2beac464c 100644 --- a/include/linux/sched/mm.h +++ b/include/linux/sched/mm.h @@ -344,8 +344,6 @@ enum { #include #endif -extern void membarrier_exec_mmap(struct mm_struct *mm); - extern void membarrier_update_current_mm(struct mm_struct *next_mm); /* diff --git a/kernel/kthread.c b/kernel/kthread.c index 3b18329f885c..18b0a2e0e3b2 100644 --- a/kernel/kthread.c +++ b/kernel/kthread.c @@ -1351,6 +1351,11 @@ void kthread_use_mm(struct mm_struct *mm) WARN_ON_ONCE(tsk->mm); task_lock(tsk); + /* + * membarrier() requires a full barrier before switching mm. + */ + smp_mb__after_spinlock(); + /* Hold off tlb flush IPIs while switching mm's */ local_irq_disable(); active_mm = tsk->active_mm; diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c index c38014c2ed66..44fafa6e1efd 100644 --- a/kernel/sched/membarrier.c +++ b/kernel/sched/membarrier.c @@ -277,21 +277,6 @@ static void ipi_sync_rq_state(void *info) smp_mb(); } -void membarrier_exec_mmap(struct mm_struct *mm) -{ - /* - * Issue a memory barrier before clearing membarrier_state to - * guarantee that no memory access prior to exec is reordered after - * clearing this state. - */ - smp_mb(); - /* - * Keep the runqueue membarrier_state in sync with this mm - * membarrier_state. - */ - this_cpu_write(runqueues.membarrier_state, 0); -} - void membarrier_update_current_mm(struct mm_struct *next_mm) { struct rq *rq = this_rq(); From patchwork Sat Jan 8 16:43:55 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andy Lutomirski X-Patchwork-Id: 12707539 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id CC8E6C433F5 for ; Sat, 8 Jan 2022 16:44:37 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 90BA86B0080; Sat, 8 Jan 2022 11:44:30 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 7F8D86B0082; Sat, 8 Jan 2022 11:44:30 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 698756B0083; Sat, 8 Jan 2022 11:44:30 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0220.hostedemail.com [216.40.44.220]) by kanga.kvack.org (Postfix) with ESMTP id 4F3156B0080 for ; Sat, 8 Jan 2022 11:44:30 -0500 (EST) Received: from smtpin13.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 032F098C3A for ; Sat, 8 Jan 2022 16:44:30 +0000 (UTC) X-FDA: 79007693100.13.A00FDBA Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf05.hostedemail.com (Postfix) with ESMTP id 7D87510000D for ; Sat, 8 Jan 2022 16:44:29 +0000 (UTC) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id D819160DEB; Sat, 8 Jan 2022 16:44:28 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 2958AC36AED; Sat, 8 Jan 2022 16:44:28 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1641660268; bh=VqXWDLIfda9ix4PWEeLYbN3dci4eLEKG08Nxx9ltxEQ=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=V06V1fGQjBCtMex/IKqEz4AHKdynLFyK3aQUjFL0aKgqnf1YsnJ1zg2qxfMeCD8ug 0JPMCzf64J/EySSFm6y+aFNu7gDoWMJchLOwmCYqRoNo3aHS0KvE4lp3+kCLKvScl7 9bH6pg+3W/7DM458zXR4DeDoYsnMrmy5BqAxl5Js80ikuPr52CTsxL9522fwgpupws 0f2kEsr8yvkLXVFi0xkE2KF+X7ASPKnrk04DnPaZTObMEI8pzPa3b87vvn93oosNSI /evFyjTSD3kaPNgHHtReWbZze8AejSQJ0m7GmeOnmIT/LwfvwYQ0DeGbS6Iwbt34A4 dhWglnickt2dw== From: Andy Lutomirski To: Andrew Morton , Linux-MM Cc: Nicholas Piggin , Anton Blanchard , Benjamin Herrenschmidt , Paul Mackerras , Randy Dunlap , linux-arch , x86@kernel.org, Rik van Riel , Dave Hansen , Peter Zijlstra , Nadav Amit , Mathieu Desnoyers , Andy Lutomirski , Joerg Roedel , Masami Hiramatsu Subject: [PATCH 10/23] x86/events, x86/insn-eval: Remove incorrect active_mm references Date: Sat, 8 Jan 2022 08:43:55 -0800 Message-Id: X-Mailer: git-send-email 2.33.1 In-Reply-To: References: MIME-Version: 1.0 X-Rspamd-Queue-Id: 7D87510000D X-Stat-Signature: j8gjj46pfuiacqrwbgkfyr3eida9z9dg Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=V06V1fGQ; dmarc=pass (policy=none) header.from=kernel.org; spf=pass (imf05.hostedemail.com: domain of luto@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=luto@kernel.org X-Rspamd-Server: rspam11 X-HE-Tag: 1641660269-108754 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: When decoding an instruction or handling a perf event that references an LDT segment, if we don't have a valid user context, trying to access the LDT by any means other than SLDT is racy. Certainly, using current->active_mm is wrong, as active_mm can point to a real user mm when CR3 and LDTR no longer reference that mm. Clean up the code. If nmi_uaccess_okay() says we don't have a valid context, just fail. Otherwise use current->mm. Cc: Joerg Roedel Cc: Masami Hiramatsu Signed-off-by: Andy Lutomirski --- arch/x86/events/core.c | 9 ++++++++- arch/x86/lib/insn-eval.c | 13 ++++++++++--- 2 files changed, 18 insertions(+), 4 deletions(-) diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c index 6dfa8ddaa60f..930082f0eba5 100644 --- a/arch/x86/events/core.c +++ b/arch/x86/events/core.c @@ -2800,8 +2800,15 @@ static unsigned long get_segment_base(unsigned int segment) #ifdef CONFIG_MODIFY_LDT_SYSCALL struct ldt_struct *ldt; + /* + * If we're not in a valid context with a real (not just lazy) + * user mm, then don't even try. + */ + if (!nmi_uaccess_okay()) + return 0; + /* IRQs are off, so this synchronizes with smp_store_release */ - ldt = READ_ONCE(current->active_mm->context.ldt); + ldt = smp_load_acquire(¤t->mm->context.ldt); if (!ldt || idx >= ldt->nr_entries) return 0; diff --git a/arch/x86/lib/insn-eval.c b/arch/x86/lib/insn-eval.c index a1d24fdc07cf..87a85a9dcdc4 100644 --- a/arch/x86/lib/insn-eval.c +++ b/arch/x86/lib/insn-eval.c @@ -609,14 +609,21 @@ static bool get_desc(struct desc_struct *out, unsigned short sel) /* Bits [15:3] contain the index of the desired entry. */ sel >>= 3; - mutex_lock(¤t->active_mm->context.lock); - ldt = current->active_mm->context.ldt; + /* + * If we're not in a valid context with a real (not just lazy) + * user mm, then don't even try. + */ + if (!nmi_uaccess_okay()) + return false; + + mutex_lock(¤t->mm->context.lock); + ldt = current->mm->context.ldt; if (ldt && sel < ldt->nr_entries) { *out = ldt->entries[sel]; success = true; } - mutex_unlock(¤t->active_mm->context.lock); + mutex_unlock(¤t->mm->context.lock); return success; } From patchwork Sat Jan 8 16:43:56 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andy Lutomirski X-Patchwork-Id: 12707540 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 73300C433EF for ; Sat, 8 Jan 2022 16:44:39 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 478FE6B0082; Sat, 8 Jan 2022 11:44:31 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 427CD6B0083; Sat, 8 Jan 2022 11:44:31 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 119B76B0085; Sat, 8 Jan 2022 11:44:31 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0211.hostedemail.com [216.40.44.211]) by kanga.kvack.org (Postfix) with ESMTP id E4DA06B0082 for ; Sat, 8 Jan 2022 11:44:30 -0500 (EST) Received: from smtpin30.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id AD34180FD981 for ; Sat, 8 Jan 2022 16:44:30 +0000 (UTC) X-FDA: 79007693100.30.364CF2A Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf18.hostedemail.com (Postfix) with ESMTP id 3DE771C000B for ; Sat, 8 Jan 2022 16:44:30 +0000 (UTC) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 8836360DD0; Sat, 8 Jan 2022 16:44:29 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 3914AC36AED; Sat, 8 Jan 2022 16:44:29 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1641660269; bh=6qfndroNSvS0PIjgZS5xp09UQvoV+Yf5NN9ZzmIhrIQ=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=LMeO/k5pF7lBmYkpUKME1r7Y5oNUYFq/iwsij2I8eVxJi2tdPS2zI9IGMIU0GE0oU Kp7yLce8Y8unoowsw8vYLycwt5QPkQbUsU/pJC2zVt7G0e7/0ej+MjGXXyAA+As8T2 /eQIgeEcTkClLIINypGIvv3g8UyKQOGSbIVtosrPGbq66FUsx1pfb6vmQD03IdCc+H e8eGvUz+8Eah4ApSmNIO4TYL3M08jd8NBw0r7xP3/knCaFOnTVadFP93khGD+GhXaD 8qDMF4ZFAz2/FK7pl7iB1h6sJ9g8HqrJztV0RgSKf/WcwAhbRfzDcKyCslUESX9xvl wDEj07Lyuq/OQ== From: Andy Lutomirski To: Andrew Morton , Linux-MM Cc: Nicholas Piggin , Anton Blanchard , Benjamin Herrenschmidt , Paul Mackerras , Randy Dunlap , linux-arch , x86@kernel.org, Rik van Riel , Dave Hansen , Peter Zijlstra , Nadav Amit , Mathieu Desnoyers , Andy Lutomirski , Woody Lin , Valentin Schneider , Sami Tolvanen Subject: [PATCH 11/23] sched/scs: Initialize shadow stack on idle thread bringup, not shutdown Date: Sat, 8 Jan 2022 08:43:56 -0800 Message-Id: <233d81a0a1e7b8eca1907998152ee848159b8774.1641659630.git.luto@kernel.org> X-Mailer: git-send-email 2.33.1 In-Reply-To: References: MIME-Version: 1.0 X-Rspamd-Queue-Id: 3DE771C000B X-Stat-Signature: emxo5614g8nagxwci7iqdd3i79maupc7 Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b="LMeO/k5p"; spf=pass (imf18.hostedemail.com: domain of luto@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=luto@kernel.org; dmarc=pass (policy=none) header.from=kernel.org X-Rspamd-Server: rspam10 X-HE-Tag: 1641660270-850782 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Starting with commit 63acd42c0d49 ("sched/scs: Reset the shadow stack when idle_task_exit"), the idle thread's shadow stack was reset from the idle task's context during CPU hot-unplug. This was fragile: between resetting the shadow stack and actually stopping the idle task, the shadow stack did not match the actual call stack. Clean this up by resetting the idle task's SCS in bringup_cpu(). init_idle() still does scs_task_reset() -- see the comments there. I leave this to an SCS maintainer to untangle further. Cc: Woody Lin Cc: Valentin Schneider Cc: Sami Tolvanen Signed-off-by: Andy Lutomirski --- kernel/cpu.c | 3 +++ kernel/sched/core.c | 9 ++++++++- 2 files changed, 11 insertions(+), 1 deletion(-) diff --git a/kernel/cpu.c b/kernel/cpu.c index 192e43a87407..be16816bb87c 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -33,6 +33,7 @@ #include #include #include +#include #include #define CREATE_TRACE_POINTS @@ -587,6 +588,8 @@ static int bringup_cpu(unsigned int cpu) struct task_struct *idle = idle_thread_get(cpu); int ret; + scs_task_reset(idle); + /* * Some architectures have to walk the irq descriptors to * setup the vector space for the cpu which comes online. diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 917068b0a145..acd52a7d1349 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -8621,7 +8621,15 @@ void __init init_idle(struct task_struct *idle, int cpu) idle->flags |= PF_IDLE | PF_KTHREAD | PF_NO_SETAFFINITY; kthread_set_per_cpu(idle, cpu); + /* + * NB: This is called from sched_init() on the *current* idle thread. + * This seems fragile if not actively incorrect. + * + * Initializing SCS for about-to-be-brought-up CPU idle threads + * is in bringup_cpu(), but that does not cover the boot CPU. + */ scs_task_reset(idle); + kasan_unpoison_task_stack(idle); #ifdef CONFIG_SMP @@ -8779,7 +8787,6 @@ void idle_task_exit(void) finish_arch_post_lock_switch(); } - scs_task_reset(current); /* finish_cpu(), as ran on the BP, will clean up the active_mm state */ } From patchwork Sat Jan 8 16:43:57 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andy Lutomirski X-Patchwork-Id: 12707541 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id DECE9C433FE for ; Sat, 8 Jan 2022 16:44:40 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4A03E6B0083; Sat, 8 Jan 2022 11:44:32 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 401786B0085; Sat, 8 Jan 2022 11:44:32 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2795E6B0087; Sat, 8 Jan 2022 11:44:32 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0161.hostedemail.com [216.40.44.161]) by kanga.kvack.org (Postfix) with ESMTP id 129B56B0083 for ; Sat, 8 Jan 2022 11:44:32 -0500 (EST) Received: from smtpin12.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id C8DB6181C98F6 for ; Sat, 8 Jan 2022 16:44:31 +0000 (UTC) X-FDA: 79007693142.12.110031C Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf15.hostedemail.com (Postfix) with ESMTP id 53A29A0014 for ; Sat, 8 Jan 2022 16:44:31 +0000 (UTC) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 9DB1560DDB; Sat, 8 Jan 2022 16:44:30 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 4DB21C36AE0; Sat, 8 Jan 2022 16:44:30 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1641660270; bh=pgtinwacUtnBeJrky/D8oYhqQ9FWRPsgYNyqIlGtAfk=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=c6CfzbD6d8c3+LBQbBZ9SOvugdwriXJbrBmrj+CT88nV8D2tiCbMznJTXpviUKJ3o eKvuobEEx8o/V4L+TppMNaNt1pQW6X+LgeZZIsvgUaKb7xVQdOK5/niGFP4AXBtCga cfc6TkgRb3efnluBFPjcEXZzjSvLEJKBCTAFYwNRpmkf3VR2+ew9sdZhaqszswH1ZC wQB3zJbsLNS/NCVmnllHpBLH4xpKNsVB2P16kvJMzXDUUeJUt68ZgpGN+nZ1Cbhyrj Yl8w1ol3618Gwx+MTJVTZxodFBavlpjI7uNZpw61vp1LzzFv/oa6ZWHq4HqzOQ3rCk 8OMKx8WOaVHaQ== From: Andy Lutomirski To: Andrew Morton , Linux-MM Cc: Nicholas Piggin , Anton Blanchard , Benjamin Herrenschmidt , Paul Mackerras , Randy Dunlap , linux-arch , x86@kernel.org, Rik van Riel , Dave Hansen , Peter Zijlstra , Nadav Amit , Mathieu Desnoyers , Andy Lutomirski Subject: [PATCH 12/23] Rework "sched/core: Fix illegal RCU from offline CPUs" Date: Sat, 8 Jan 2022 08:43:57 -0800 Message-Id: X-Mailer: git-send-email 2.33.1 In-Reply-To: References: MIME-Version: 1.0 Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=c6CfzbD6; dmarc=pass (policy=none) header.from=kernel.org; spf=pass (imf15.hostedemail.com: domain of luto@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=luto@kernel.org X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 53A29A0014 X-Stat-Signature: 88ge89mq6jaegs6rienoroahagwfyjjo X-HE-Tag: 1641660271-459922 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: This reworks commit bf2c59fce4074e55d622089b34be3a6bc95484fb. The problem solved by that commit was that mmdrop() after cpuhp_report_idle_dead() is an illegal use of RCU, so, with that commit applied, mmdrop() of the last lazy mm on an offlined CPU was done by the BSP. With the upcoming reworking of lazy mm references, retaining that design would involve the cpu hotplug code poking into internal scheduler details. Rework the fix. Add a helper unlazy_mm_irqs_off() to fully switch a CPU to init_mm, releasing any previous lazy active_mm, and do this before cpuhp_report_idle_dead(). Note that the actual refcounting of init_mm is inconsistent both before and after this patch. Most (all?) arches mmgrab(&init_mm) when booting an AP and set current->active_mm = &init_mm on that AP. This is consistent with the current ->active_mm refcounting rules, but archtectures don't do a corresponding mmdrop() when a CPU goes offine. The result is that each offline/online cycle leaks one init_mm reference. This seems fairly harmless. Signed-off-by: Andy Lutomirski --- arch/arm/kernel/smp.c | 2 - arch/arm64/kernel/smp.c | 2 - arch/csky/kernel/smp.c | 2 - arch/ia64/kernel/process.c | 1 - arch/mips/cavium-octeon/smp.c | 1 - arch/mips/kernel/smp-bmips.c | 2 - arch/mips/kernel/smp-cps.c | 1 - arch/mips/loongson64/smp.c | 2 - arch/powerpc/platforms/85xx/smp.c | 2 - arch/powerpc/platforms/powermac/smp.c | 2 - arch/powerpc/platforms/powernv/smp.c | 1 - arch/powerpc/platforms/pseries/hotplug-cpu.c | 2 - arch/powerpc/platforms/pseries/pmem.c | 1 - arch/riscv/kernel/cpu-hotplug.c | 2 - arch/s390/kernel/smp.c | 1 - arch/sh/kernel/smp.c | 1 - arch/sparc/kernel/smp_64.c | 2 - arch/x86/kernel/smpboot.c | 2 - arch/xtensa/kernel/smp.c | 1 - include/linux/sched/hotplug.h | 6 --- kernel/cpu.c | 18 +------- kernel/sched/core.c | 43 +++++++++++--------- kernel/sched/idle.c | 1 + kernel/sched/sched.h | 1 + 24 files changed, 27 insertions(+), 72 deletions(-) diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c index 842427ff2b3c..19863ad2f852 100644 --- a/arch/arm/kernel/smp.c +++ b/arch/arm/kernel/smp.c @@ -323,8 +323,6 @@ void arch_cpu_idle_dead(void) { unsigned int cpu = smp_processor_id(); - idle_task_exit(); - local_irq_disable(); /* diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c index 6f6ff072acbd..4b38fb42543f 100644 --- a/arch/arm64/kernel/smp.c +++ b/arch/arm64/kernel/smp.c @@ -366,8 +366,6 @@ void cpu_die(void) unsigned int cpu = smp_processor_id(); const struct cpu_operations *ops = get_cpu_ops(cpu); - idle_task_exit(); - local_daif_mask(); /* Tell __cpu_die() that this CPU is now safe to dispose of */ diff --git a/arch/csky/kernel/smp.c b/arch/csky/kernel/smp.c index e2993539af8e..4b17c3b8fcba 100644 --- a/arch/csky/kernel/smp.c +++ b/arch/csky/kernel/smp.c @@ -309,8 +309,6 @@ void __cpu_die(unsigned int cpu) void arch_cpu_idle_dead(void) { - idle_task_exit(); - cpu_report_death(); while (!secondary_stack) diff --git a/arch/ia64/kernel/process.c b/arch/ia64/kernel/process.c index e56d63f4abf9..ddb13db7ff7e 100644 --- a/arch/ia64/kernel/process.c +++ b/arch/ia64/kernel/process.c @@ -209,7 +209,6 @@ static inline void play_dead(void) max_xtp(); local_irq_disable(); - idle_task_exit(); ia64_jump_to_sal(&sal_boot_rendez_state[this_cpu]); /* * The above is a point of no-return, the processor is diff --git a/arch/mips/cavium-octeon/smp.c b/arch/mips/cavium-octeon/smp.c index 89954f5f87fb..7130ec7e9b61 100644 --- a/arch/mips/cavium-octeon/smp.c +++ b/arch/mips/cavium-octeon/smp.c @@ -343,7 +343,6 @@ void play_dead(void) { int cpu = cpu_number_map(cvmx_get_core_num()); - idle_task_exit(); octeon_processor_boot = 0xff; per_cpu(cpu_state, cpu) = CPU_DEAD; diff --git a/arch/mips/kernel/smp-bmips.c b/arch/mips/kernel/smp-bmips.c index b6ef5f7312cf..bd1e650dd176 100644 --- a/arch/mips/kernel/smp-bmips.c +++ b/arch/mips/kernel/smp-bmips.c @@ -388,8 +388,6 @@ static void bmips_cpu_die(unsigned int cpu) void __ref play_dead(void) { - idle_task_exit(); - /* flush data cache */ _dma_cache_wback_inv(0, ~0); diff --git a/arch/mips/kernel/smp-cps.c b/arch/mips/kernel/smp-cps.c index bcd6a944b839..23221fcee423 100644 --- a/arch/mips/kernel/smp-cps.c +++ b/arch/mips/kernel/smp-cps.c @@ -472,7 +472,6 @@ void play_dead(void) unsigned int cpu; local_irq_disable(); - idle_task_exit(); cpu = smp_processor_id(); cpu_death = CPU_DEATH_POWER; diff --git a/arch/mips/loongson64/smp.c b/arch/mips/loongson64/smp.c index 09ebe84a17fe..a1fe59f354d1 100644 --- a/arch/mips/loongson64/smp.c +++ b/arch/mips/loongson64/smp.c @@ -788,8 +788,6 @@ void play_dead(void) unsigned int cpu = smp_processor_id(); void (*play_dead_at_ckseg1)(int *); - idle_task_exit(); - prid_imp = read_c0_prid() & PRID_IMP_MASK; prid_rev = read_c0_prid() & PRID_REV_MASK; diff --git a/arch/powerpc/platforms/85xx/smp.c b/arch/powerpc/platforms/85xx/smp.c index c6df294054fe..9de9e1fcc87a 100644 --- a/arch/powerpc/platforms/85xx/smp.c +++ b/arch/powerpc/platforms/85xx/smp.c @@ -121,8 +121,6 @@ static void smp_85xx_cpu_offline_self(void) /* mask all irqs to prevent cpu wakeup */ qoriq_pm_ops->irq_mask(cpu); - idle_task_exit(); - mtspr(SPRN_TCR, 0); mtspr(SPRN_TSR, mfspr(SPRN_TSR)); diff --git a/arch/powerpc/platforms/powermac/smp.c b/arch/powerpc/platforms/powermac/smp.c index 3256a316e884..69d2bdd8246d 100644 --- a/arch/powerpc/platforms/powermac/smp.c +++ b/arch/powerpc/platforms/powermac/smp.c @@ -924,7 +924,6 @@ static void pmac_cpu_offline_self(void) int cpu = smp_processor_id(); local_irq_disable(); - idle_task_exit(); pr_debug("CPU%d offline\n", cpu); generic_set_cpu_dead(cpu); smp_wmb(); @@ -939,7 +938,6 @@ static void pmac_cpu_offline_self(void) int cpu = smp_processor_id(); local_irq_disable(); - idle_task_exit(); /* * turn off as much as possible, we'll be diff --git a/arch/powerpc/platforms/powernv/smp.c b/arch/powerpc/platforms/powernv/smp.c index cbb67813cd5d..cba21d053dae 100644 --- a/arch/powerpc/platforms/powernv/smp.c +++ b/arch/powerpc/platforms/powernv/smp.c @@ -169,7 +169,6 @@ static void pnv_cpu_offline_self(void) /* Standard hot unplug procedure */ - idle_task_exit(); cpu = smp_processor_id(); DBG("CPU%d offline\n", cpu); generic_set_cpu_dead(cpu); diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c index d646c22e94ab..c11ccd038866 100644 --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c @@ -19,7 +19,6 @@ #include #include #include -#include /* for idle_task_exit */ #include #include #include @@ -63,7 +62,6 @@ static void pseries_cpu_offline_self(void) unsigned int hwcpu = hard_smp_processor_id(); local_irq_disable(); - idle_task_exit(); if (xive_enabled()) xive_teardown_cpu(); else diff --git a/arch/powerpc/platforms/pseries/pmem.c b/arch/powerpc/platforms/pseries/pmem.c index 439ac72c2470..5280fcd5b37d 100644 --- a/arch/powerpc/platforms/pseries/pmem.c +++ b/arch/powerpc/platforms/pseries/pmem.c @@ -9,7 +9,6 @@ #include #include #include -#include /* for idle_task_exit */ #include #include #include diff --git a/arch/riscv/kernel/cpu-hotplug.c b/arch/riscv/kernel/cpu-hotplug.c index df84e0c13db1..6cced2d79f07 100644 --- a/arch/riscv/kernel/cpu-hotplug.c +++ b/arch/riscv/kernel/cpu-hotplug.c @@ -77,8 +77,6 @@ void __cpu_die(unsigned int cpu) */ void cpu_stop(void) { - idle_task_exit(); - (void)cpu_report_death(); cpu_ops[smp_processor_id()]->cpu_stop(); diff --git a/arch/s390/kernel/smp.c b/arch/s390/kernel/smp.c index 1a04e5bdf655..328930549803 100644 --- a/arch/s390/kernel/smp.c +++ b/arch/s390/kernel/smp.c @@ -987,7 +987,6 @@ void __cpu_die(unsigned int cpu) void __noreturn cpu_die(void) { - idle_task_exit(); __bpon(); pcpu_sigp_retry(pcpu_devices + smp_processor_id(), SIGP_STOP, 0); for (;;) ; diff --git a/arch/sh/kernel/smp.c b/arch/sh/kernel/smp.c index 65924d9ec245..cbd14604a736 100644 --- a/arch/sh/kernel/smp.c +++ b/arch/sh/kernel/smp.c @@ -106,7 +106,6 @@ int native_cpu_disable(unsigned int cpu) void play_dead_common(void) { - idle_task_exit(); irq_ctx_exit(raw_smp_processor_id()); mb(); diff --git a/arch/sparc/kernel/smp_64.c b/arch/sparc/kernel/smp_64.c index 0224d8f19ed6..450dc9513ff0 100644 --- a/arch/sparc/kernel/smp_64.c +++ b/arch/sparc/kernel/smp_64.c @@ -1301,8 +1301,6 @@ void cpu_play_dead(void) int cpu = smp_processor_id(); unsigned long pstate; - idle_task_exit(); - if (tlb_type == hypervisor) { struct trap_per_cpu *tb = &trap_block[cpu]; diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index 85f6e242b6b4..a57a709f2c35 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -1656,8 +1656,6 @@ void native_cpu_die(unsigned int cpu) void play_dead_common(void) { - idle_task_exit(); - /* Ack it */ (void)cpu_report_death(); diff --git a/arch/xtensa/kernel/smp.c b/arch/xtensa/kernel/smp.c index 1254da07ead1..fb011807d041 100644 --- a/arch/xtensa/kernel/smp.c +++ b/arch/xtensa/kernel/smp.c @@ -329,7 +329,6 @@ void arch_cpu_idle_dead(void) */ void __ref cpu_die(void) { - idle_task_exit(); local_irq_disable(); __asm__ __volatile__( " movi a2, cpu_restart\n" diff --git a/include/linux/sched/hotplug.h b/include/linux/sched/hotplug.h index 412cdaba33eb..18fa3e63123e 100644 --- a/include/linux/sched/hotplug.h +++ b/include/linux/sched/hotplug.h @@ -18,10 +18,4 @@ extern int sched_cpu_dying(unsigned int cpu); # define sched_cpu_dying NULL #endif -#ifdef CONFIG_HOTPLUG_CPU -extern void idle_task_exit(void); -#else -static inline void idle_task_exit(void) {} -#endif - #endif /* _LINUX_SCHED_HOTPLUG_H */ diff --git a/kernel/cpu.c b/kernel/cpu.c index be16816bb87c..709e2a7583ad 100644 --- a/kernel/cpu.c +++ b/kernel/cpu.c @@ -3,7 +3,6 @@ * * This code is licenced under the GPL. */ -#include #include #include #include @@ -605,21 +604,6 @@ static int bringup_cpu(unsigned int cpu) return bringup_wait_for_ap(cpu); } -static int finish_cpu(unsigned int cpu) -{ - struct task_struct *idle = idle_thread_get(cpu); - struct mm_struct *mm = idle->active_mm; - - /* - * idle_task_exit() will have switched to &init_mm, now - * clean up any remaining active_mm state. - */ - if (mm != &init_mm) - idle->active_mm = &init_mm; - mmdrop(mm); - return 0; -} - /* * Hotplug state machine related functions */ @@ -1699,7 +1683,7 @@ static struct cpuhp_step cpuhp_hp_states[] = { [CPUHP_BRINGUP_CPU] = { .name = "cpu:bringup", .startup.single = bringup_cpu, - .teardown.single = finish_cpu, + .teardown.single = NULL, .cant_stop = true, }, /* Final state before CPU kills itself */ diff --git a/kernel/sched/core.c b/kernel/sched/core.c index acd52a7d1349..32275b4ff141 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -8678,6 +8678,30 @@ void __init init_idle(struct task_struct *idle, int cpu) #endif } +/* + * Drops current->active_mm and switches current->active_mm to &init_mm. + * Caller must have IRQs off and must have current->mm == NULL (i.e. must + * be in a kernel thread). + */ +void unlazy_mm_irqs_off(void) +{ + struct mm_struct *mm = current->active_mm; + + lockdep_assert_irqs_disabled(); + + if (WARN_ON_ONCE(current->mm)) + return; + + if (mm == &init_mm) + return; + + switch_mm_irqs_off(mm, &init_mm, current); + mmgrab(&init_mm); + current->active_mm = &init_mm; + finish_arch_post_lock_switch(); + mmdrop(mm); +} + #ifdef CONFIG_SMP int cpuset_cpumask_can_shrink(const struct cpumask *cur, @@ -8771,25 +8795,6 @@ void sched_setnuma(struct task_struct *p, int nid) #endif /* CONFIG_NUMA_BALANCING */ #ifdef CONFIG_HOTPLUG_CPU -/* - * Ensure that the idle task is using init_mm right before its CPU goes - * offline. - */ -void idle_task_exit(void) -{ - struct mm_struct *mm = current->active_mm; - - BUG_ON(cpu_online(smp_processor_id())); - BUG_ON(current != this_rq()->idle); - - if (mm != &init_mm) { - switch_mm(mm, &init_mm, current); - finish_arch_post_lock_switch(); - } - - /* finish_cpu(), as ran on the BP, will clean up the active_mm state */ -} - static int __balance_push_cpu_stop(void *arg) { struct task_struct *p = arg; diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index d17b0a5ce6ac..af6a98e7a8d1 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -285,6 +285,7 @@ static void do_idle(void) local_irq_disable(); if (cpu_is_offline(cpu)) { + unlazy_mm_irqs_off(); tick_nohz_idle_stop_tick(); cpuhp_report_idle_dead(); arch_cpu_idle_dead(); diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 3d3e5793e117..b496a9ee9aec 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -3064,3 +3064,4 @@ extern int sched_dynamic_mode(const char *str); extern void sched_dynamic_update(int mode); #endif +extern void unlazy_mm_irqs_off(void); From patchwork Sat Jan 8 16:43:58 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andy Lutomirski X-Patchwork-Id: 12707543 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C7F4FC433F5 for ; Sat, 8 Jan 2022 16:44:44 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 143AD6B0087; Sat, 8 Jan 2022 11:44:35 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 07DBE6B0088; Sat, 8 Jan 2022 11:44:35 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E3C736B0089; Sat, 8 Jan 2022 11:44:34 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0188.hostedemail.com [216.40.44.188]) by kanga.kvack.org (Postfix) with ESMTP id BFC6B6B0088 for ; Sat, 8 Jan 2022 11:44:34 -0500 (EST) Received: from smtpin07.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 6872297906 for ; Sat, 8 Jan 2022 16:44:34 +0000 (UTC) X-FDA: 79007693268.07.5E8B834 Received: from ams.source.kernel.org (ams.source.kernel.org [145.40.68.75]) by imf15.hostedemail.com (Postfix) with ESMTP id 09481A0018 for ; Sat, 8 Jan 2022 16:44:33 +0000 (UTC) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ams.source.kernel.org (Postfix) with ESMTPS id EE672B80B44; Sat, 8 Jan 2022 16:44:32 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 85F0DC36AED; Sat, 8 Jan 2022 16:44:31 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1641660271; bh=rqABZliQOnPw5+mRwJyiDUUR6g4Y/aDsdVsBAqWYhN8=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=fNMg1GzMBirsWhPfcJt1VS301owi+7JZLVLHidbemWtEDnzBUeY3M57pY4xDlXxvJ jAZyEHz6Nfysn9TMOIt9eFhF5NissZryhOyD2fYqHo5bJPiFOXFH0+yBIWFKVD7+qU sIM2nEPPYJgDNxeIYejHjq4i+TacBOY563IE6FYmIGA7kYDAwcmtIrYNCiIUS+IyEh 3oU7c1dDNyrd1gHCnyl6y9k5ASLzmcLy1HQIDCW4NVDlQGfcq+pdDyMkU3UY/haVLp AN77MSHH/JA4rLDvCP3U51/Vecf7iWcX32o+xIfYXJ1gMyw13FlZFBTQdRaxVV7NSn CYdwqtRUYX/eQ== From: Andy Lutomirski To: Andrew Morton , Linux-MM Cc: Nicholas Piggin , Anton Blanchard , Benjamin Herrenschmidt , Paul Mackerras , Randy Dunlap , linux-arch , x86@kernel.org, Rik van Riel , Dave Hansen , Peter Zijlstra , Nadav Amit , Mathieu Desnoyers , Andy Lutomirski Subject: [PATCH 13/23] exec: Remove unnecessary vmacache_seqnum clear in exec_mmap() Date: Sat, 8 Jan 2022 08:43:58 -0800 Message-Id: <28ff7966573c7830f6bc296c1a1fc9a4c7072dc0.1641659630.git.luto@kernel.org> X-Mailer: git-send-email 2.33.1 In-Reply-To: References: MIME-Version: 1.0 X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 09481A0018 X-Stat-Signature: pykgauji9iygje6dkjkjxnnw3y7eqkwk Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=fNMg1GzM; spf=pass (imf15.hostedemail.com: domain of luto@kernel.org designates 145.40.68.75 as permitted sender) smtp.mailfrom=luto@kernel.org; dmarc=pass (policy=none) header.from=kernel.org X-HE-Tag: 1641660273-676469 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: exec_mmap() activates a brand new mm, so vmacache_seqnum is already 0. Stop zeroing it. Signed-off-by: Andy Lutomirski --- fs/exec.c | 1 - 1 file changed, 1 deletion(-) diff --git a/fs/exec.c b/fs/exec.c index 325dab98bc51..2afa7b0c75f2 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1024,7 +1024,6 @@ static int exec_mmap(struct mm_struct *mm) if (IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM)) local_irq_enable(); membarrier_finish_switch_mm(mm); - tsk->mm->vmacache_seqnum = 0; vmacache_flush(tsk); task_unlock(tsk); if (old_mm) { From patchwork Sat Jan 8 16:43:59 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andy Lutomirski X-Patchwork-Id: 12707542 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B41BAC433F5 for ; Sat, 8 Jan 2022 16:44:42 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B07F46B0085; Sat, 8 Jan 2022 11:44:34 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id A6ABC6B0087; Sat, 8 Jan 2022 11:44:34 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8BB9C6B0088; Sat, 8 Jan 2022 11:44:34 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0173.hostedemail.com [216.40.44.173]) by kanga.kvack.org (Postfix) with ESMTP id 74BDC6B0085 for ; Sat, 8 Jan 2022 11:44:34 -0500 (EST) Received: from smtpin22.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 364E7181953EE for ; Sat, 8 Jan 2022 16:44:34 +0000 (UTC) X-FDA: 79007693268.22.3C44A2E Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf29.hostedemail.com (Postfix) with ESMTP id A8F4112000B for ; Sat, 8 Jan 2022 16:44:33 +0000 (UTC) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 0376A60DDF; Sat, 8 Jan 2022 16:44:33 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id A74EBC36AE0; Sat, 8 Jan 2022 16:44:32 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1641660272; bh=8B16LCC4JGxAUtyfSb6J025YSsjZ/ID9qTkdK302ZaE=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=q2ZWfShUvoLzpVsRhPVgmFaKeZMadx6Ptwbsmg5lcw5VSiQkSWPasXkyTsMqD/XfF 49hkcmon7N6XWdaYppI3NPLf4mFrCNY8ND+aea/AyOjj1fLUIlnyKFAzu80HZjYHEb Sv8FBu159LhVRoRRFS2ESiNTjpaoinj12P0sGeGVU08jllcUyMUhFHD5Umsy8ddwGg wbzXjftFLjNH5Ubpv9vB1GoP3zhgyd86usbSN9o//WonrCNiMbLkwbkd2IMwNzzvAW LzCjIdawbMRc8i0dJ+sHJw1Yr8K3eZzyOvYWstCT++xy4a6bE5s5mRfNEldhAOBKYP g5pEdTcr8nhBQ== From: Andy Lutomirski To: Andrew Morton , Linux-MM Cc: Nicholas Piggin , Anton Blanchard , Benjamin Herrenschmidt , Paul Mackerras , Randy Dunlap , linux-arch , x86@kernel.org, Rik van Riel , Dave Hansen , Peter Zijlstra , Nadav Amit , Mathieu Desnoyers , Andy Lutomirski Subject: [PATCH 14/23] sched, exec: Factor current mm changes out from exec Date: Sat, 8 Jan 2022 08:43:59 -0800 Message-Id: <60eb8a98061100f95e53e7868841fbb9a68237c8.1641659630.git.luto@kernel.org> X-Mailer: git-send-email 2.33.1 In-Reply-To: References: MIME-Version: 1.0 Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=q2ZWfShU; spf=pass (imf29.hostedemail.com: domain of luto@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=luto@kernel.org; dmarc=pass (policy=none) header.from=kernel.org X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: A8F4112000B X-Stat-Signature: bdcuhzqyow8ib13ka7yq3zzksqqunrk4 X-HE-Tag: 1641660273-193072 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Currently, exec_mmap() open-codes an mm change. Create new core __change_current_mm() and __change_current_mm_to_kernel() helpers and use the former from exec_mmap(). This moves the nasty scheduler details out of exec.c and prepares for reusing this code elsewhere. Signed-off-by: Andy Lutomirski --- fs/exec.c | 32 +---------- include/linux/sched/mm.h | 20 +++++++ kernel/sched/core.c | 119 +++++++++++++++++++++++++++++++++++++++ 3 files changed, 141 insertions(+), 30 deletions(-) diff --git a/fs/exec.c b/fs/exec.c index 2afa7b0c75f2..9e1c2ee7c986 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -971,15 +971,13 @@ EXPORT_SYMBOL(read_code); static int exec_mmap(struct mm_struct *mm) { struct task_struct *tsk; - struct mm_struct *old_mm, *active_mm; + struct mm_struct *old_mm; int ret; /* Notify parent that we're no longer interested in the old VM */ tsk = current; old_mm = current->mm; exec_mm_release(tsk, old_mm); - if (old_mm) - sync_mm_rss(old_mm); ret = down_write_killable(&tsk->signal->exec_update_lock); if (ret) @@ -1000,41 +998,15 @@ static int exec_mmap(struct mm_struct *mm) } } - task_lock(tsk); - /* - * membarrier() requires a full barrier before switching mm. - */ - smp_mb__after_spinlock(); + __change_current_mm(mm, true); - local_irq_disable(); - active_mm = tsk->active_mm; - tsk->active_mm = mm; - WRITE_ONCE(tsk->mm, mm); /* membarrier reads this without locks */ - membarrier_update_current_mm(mm); - /* - * This prevents preemption while active_mm is being loaded and - * it and mm are being updated, which could cause problems for - * lazy tlb mm refcounting when these are updated by context - * switches. Not all architectures can handle irqs off over - * activate_mm yet. - */ - if (!IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM)) - local_irq_enable(); - activate_mm(active_mm, mm); - if (IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM)) - local_irq_enable(); - membarrier_finish_switch_mm(mm); - vmacache_flush(tsk); - task_unlock(tsk); if (old_mm) { mmap_read_unlock(old_mm); - BUG_ON(active_mm != old_mm); setmax_mm_hiwater_rss(&tsk->signal->maxrss, old_mm); mm_update_next_owner(old_mm); mmput(old_mm); return 0; } - mmdrop(active_mm); return 0; } diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h index f1d2beac464c..7509b2b2e99d 100644 --- a/include/linux/sched/mm.h +++ b/include/linux/sched/mm.h @@ -83,6 +83,26 @@ extern void mmput(struct mm_struct *); void mmput_async(struct mm_struct *); #endif +/* + * Switch the mm for current. This does not mmget() mm, nor does it mmput() + * the previous mm, if any. The caller is responsible for reference counting, + * although __change_current_mm() handles all details related to lazy mm + * refcounting. + * + * If the caller is a user task, the caller must call mm_update_next_owner(). + */ +void __change_current_mm(struct mm_struct *mm, bool mm_is_brand_new); + +/* + * Switch the mm for current to the kernel mm. This does not mmdrop() + * -- the caller is responsible for reference counting, although + * __change_current_mm_to_kernel() handles all details related to lazy + * mm refcounting. + * + * If the caller is a user task, the caller must call mm_update_next_owner(). + */ +void __change_current_mm_to_kernel(void); + /* Grab a reference to a task's mm, if it is not already going away */ extern struct mm_struct *get_task_mm(struct task_struct *task); /* diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 32275b4ff141..95eb0e78f74c 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -14,6 +14,7 @@ #include +#include #include #include @@ -4934,6 +4935,124 @@ context_switch(struct rq *rq, struct task_struct *prev, return finish_task_switch(prev); } +void __change_current_mm(struct mm_struct *mm, bool mm_is_brand_new) +{ + struct task_struct *tsk = current; + struct mm_struct *old_active_mm, *mm_to_drop = NULL; + + BUG_ON(!mm); /* likely to cause corruption if we continue */ + + /* + * We do not want to schedule, nor should procfs peek at current->mm + * while we're modifying it. task_lock() disables preemption and + * locks against procfs. + */ + task_lock(tsk); + /* + * membarrier() requires a full barrier before switching mm. + */ + smp_mb__after_spinlock(); + + local_irq_disable(); + + if (tsk->mm) { + /* We're detaching from an old mm. Sync stats. */ + sync_mm_rss(tsk->mm); + } else { + /* + * Switching from kernel mm to user. Drop the old lazy + * mm reference. + */ + mm_to_drop = tsk->active_mm; + } + + old_active_mm = tsk->active_mm; + tsk->active_mm = mm; + WRITE_ONCE(tsk->mm, mm); /* membarrier reads this without locks */ + membarrier_update_current_mm(mm); + + if (mm_is_brand_new) { + /* + * For historical reasons, some architectures want IRQs on + * when activate_mm() is called. If we're going to call + * activate_mm(), turn on IRQs but leave preemption + * disabled. + */ + if (!IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM)) + local_irq_enable(); + activate_mm(old_active_mm, mm); + if (IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM)) + local_irq_enable(); + } else { + switch_mm_irqs_off(old_active_mm, mm, tsk); + local_irq_enable(); + } + + /* IRQs are on now. Preemption is still disabled by task_lock(). */ + + membarrier_finish_switch_mm(mm); + vmacache_flush(tsk); + task_unlock(tsk); + +#ifdef finish_arch_post_lock_switch + if (!mm_is_brand_new) { + /* + * Some architectures want a callback after + * switch_mm_irqs_off() once locks are dropped. Callers of + * activate_mm() historically did not do this, so skip it if + * we did activate_mm(). On arm, this is because + * activate_mm() switches mm with IRQs on, which uses a + * different code path. + * + * Yes, this is extremely fragile and be cleaned up. + */ + finish_arch_post_lock_switch(); + } +#endif + + if (mm_to_drop) + mmdrop(mm_to_drop); +} + +void __change_current_mm_to_kernel(void) +{ + struct task_struct *tsk = current; + struct mm_struct *old_mm = tsk->mm; + + if (!old_mm) + return; /* nothing to do */ + + /* + * We do not want to schedule, nor should procfs peek at current->mm + * while we're modifying it. task_lock() disables preemption and + * locks against procfs. + */ + task_lock(tsk); + /* + * membarrier() requires a full barrier before switching mm. + */ + smp_mb__after_spinlock(); + + /* current has a real mm, so it must be active */ + WARN_ON_ONCE(tsk->active_mm != tsk->mm); + + local_irq_disable(); + + sync_mm_rss(old_mm); + + WRITE_ONCE(tsk->mm, NULL); /* membarrier reads this without locks */ + membarrier_update_current_mm(NULL); + vmacache_flush(tsk); + + /* active_mm is still 'old_mm' */ + mmgrab(old_mm); + enter_lazy_tlb(old_mm, tsk); + + local_irq_enable(); + + task_unlock(tsk); +} + /* * nr_running and nr_context_switches: * From patchwork Sat Jan 8 16:44:00 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andy Lutomirski X-Patchwork-Id: 12707544 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 33D74C4167D for ; Sat, 8 Jan 2022 16:44:46 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1DEF96B0088; Sat, 8 Jan 2022 11:44:36 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 1422D6B0089; Sat, 8 Jan 2022 11:44:36 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id F24C36B008A; Sat, 8 Jan 2022 11:44:35 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0166.hostedemail.com [216.40.44.166]) by kanga.kvack.org (Postfix) with ESMTP id D3F7B6B0088 for ; Sat, 8 Jan 2022 11:44:35 -0500 (EST) Received: from smtpin31.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 9454C181953EE for ; Sat, 8 Jan 2022 16:44:35 +0000 (UTC) X-FDA: 79007693310.31.F7570C6 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf29.hostedemail.com (Postfix) with ESMTP id CB10612000B for ; Sat, 8 Jan 2022 16:44:34 +0000 (UTC) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 2D33560DE1; Sat, 8 Jan 2022 16:44:34 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id CB396C36AE0; Sat, 8 Jan 2022 16:44:33 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1641660274; bh=+HQX3q3DuPuBg03W9xoO54/1KFTovvzwFnkvqR9M6l0=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=mzYTT9KWMZuhFsogwISh7v+tUggMo6MfWKPm8Rwm7S0a7cbwPGE92vse/UIuzStJ2 MVALxrxuLKwu0wJdECMuOxgPw9eKUTIyaCQwr6qIFoq1xv1eCbqmGvs9hAxB6QF/aA bRytq3QvCl3/EhICa71pyihO7MzbsdO6/GXB7N+7J6P9yC0sli2ujMXtIgj5djFUNU AaxAC7/TzuFw6dqDvMgjaYhzsJSfCzedSXr6PeAIlH+qHD2xHylLkqUt7VpI8Jyr6x jCnSuWKaPqu7zaho/khcIwn27CSSjVJbeoAzD4bj6azmEnqgHxo58KoNv76DJl6uVQ JnUuCUBTga9Ew== From: Andy Lutomirski To: Andrew Morton , Linux-MM Cc: Nicholas Piggin , Anton Blanchard , Benjamin Herrenschmidt , Paul Mackerras , Randy Dunlap , linux-arch , x86@kernel.org, Rik van Riel , Dave Hansen , Peter Zijlstra , Nadav Amit , Mathieu Desnoyers , Andy Lutomirski Subject: [PATCH 15/23] kthread: Switch to __change_current_mm() Date: Sat, 8 Jan 2022 08:44:00 -0800 Message-Id: <521a37c3b488e902b7f2f79f10ba2e2d729ff553.1641659630.git.luto@kernel.org> X-Mailer: git-send-email 2.33.1 In-Reply-To: References: MIME-Version: 1.0 Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=mzYTT9KW; spf=pass (imf29.hostedemail.com: domain of luto@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=luto@kernel.org; dmarc=pass (policy=none) header.from=kernel.org X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: CB10612000B X-Stat-Signature: iy53si1cgg7t88uq6k471q4gswpinga1 X-HE-Tag: 1641660274-510788 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Remove the open-coded mm switching in kthread_use_mm() and kthread_unuse_mm(). This has one internally-visible effect: the old code active_mm refcounting was inconsistent with everything else and mmgrabbed the mm in kthread_use_mm(). The new code refcounts following the same rules as normal user threads, so kthreads that are currently using a user mm will not hold an mm_count reference. Signed-off-by: Andy Lutomirski --- kernel/kthread.c | 45 ++------------------------------------------- 1 file changed, 2 insertions(+), 43 deletions(-) diff --git a/kernel/kthread.c b/kernel/kthread.c index 18b0a2e0e3b2..77586f5b14e5 100644 --- a/kernel/kthread.c +++ b/kernel/kthread.c @@ -1344,37 +1344,12 @@ EXPORT_SYMBOL(kthread_destroy_worker); */ void kthread_use_mm(struct mm_struct *mm) { - struct mm_struct *active_mm; struct task_struct *tsk = current; WARN_ON_ONCE(!(tsk->flags & PF_KTHREAD)); WARN_ON_ONCE(tsk->mm); - task_lock(tsk); - /* - * membarrier() requires a full barrier before switching mm. - */ - smp_mb__after_spinlock(); - - /* Hold off tlb flush IPIs while switching mm's */ - local_irq_disable(); - active_mm = tsk->active_mm; - if (active_mm != mm) { - mmgrab(mm); - tsk->active_mm = mm; - } - WRITE_ONCE(tsk->mm, mm); /* membarrier reads this without locks */ - membarrier_update_current_mm(mm); - switch_mm_irqs_off(active_mm, mm, tsk); - membarrier_finish_switch_mm(mm); - local_irq_enable(); - task_unlock(tsk); -#ifdef finish_arch_post_lock_switch - finish_arch_post_lock_switch(); -#endif - - if (active_mm != mm) - mmdrop(active_mm); + __change_current_mm(mm, false); to_kthread(tsk)->oldfs = force_uaccess_begin(); } @@ -1393,23 +1368,7 @@ void kthread_unuse_mm(struct mm_struct *mm) force_uaccess_end(to_kthread(tsk)->oldfs); - task_lock(tsk); - /* - * When a kthread stops operating on an address space, the loop - * in membarrier_{private,global}_expedited() may not observe - * that tsk->mm, and not issue an IPI. Membarrier requires a - * memory barrier after accessing user-space memory, before - * clearing tsk->mm. - */ - smp_mb__after_spinlock(); - sync_mm_rss(mm); - local_irq_disable(); - WRITE_ONCE(tsk->mm, NULL); /* membarrier reads this without locks */ - membarrier_update_current_mm(NULL); - /* active_mm is still 'mm' */ - enter_lazy_tlb(mm, tsk); - local_irq_enable(); - task_unlock(tsk); + __change_current_mm_to_kernel(); } EXPORT_SYMBOL_GPL(kthread_unuse_mm); From patchwork Sat Jan 8 16:44:01 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andy Lutomirski X-Patchwork-Id: 12707545 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B2671C4332F for ; Sat, 8 Jan 2022 16:44:47 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 539C56B0089; Sat, 8 Jan 2022 11:44:37 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 4C57E6B008A; Sat, 8 Jan 2022 11:44:37 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2EF376B008C; Sat, 8 Jan 2022 11:44:37 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0248.hostedemail.com [216.40.44.248]) by kanga.kvack.org (Postfix) with ESMTP id 17DAC6B0089 for ; Sat, 8 Jan 2022 11:44:37 -0500 (EST) Received: from smtpin14.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id C97E397906 for ; Sat, 8 Jan 2022 16:44:36 +0000 (UTC) X-FDA: 79007693352.14.CCE9C3B Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf31.hostedemail.com (Postfix) with ESMTP id 66B262000B for ; Sat, 8 Jan 2022 16:44:36 +0000 (UTC) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id AD36360DF2; Sat, 8 Jan 2022 16:44:35 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id E69B3C36AEF; Sat, 8 Jan 2022 16:44:34 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1641660275; bh=yDoOe4UwcHpvI4lvTKMEryFWVpWlenvaZDNNNclnJoE=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=r3IkvhwWWHhRcORWc95cYxl3JBHz9QHqFg3GicxqeO4scRWEn5+GAU7+QXvzaQiJu /t6TFUZP4ot00mZAAozBixOfCTjbJ0M2wWN0Nck3kravEBBMEZjMhI17BeOQ4QtAgB TAAeAIT0X8HOYseASDVhjZz4+JPBgThT+RsGEkN7JweaLGMQQEmsFfRpGdUT7GrFiI R1fkR/UrIud6v2FczSgTzelxa5hlD7NLt6WOQpC4QK1c3hFKfxW6qtfFDgj/lgIdMU YcvRH6xgb1NkCB3TVSaJYSF95qiUH/AI+c/xHGdI20dqWnZEXzHJ4LkXLoZOpx6Uv0 nKfAKa94/vq/w== From: Andy Lutomirski To: Andrew Morton , Linux-MM Cc: Nicholas Piggin , Anton Blanchard , Benjamin Herrenschmidt , Paul Mackerras , Randy Dunlap , linux-arch , x86@kernel.org, Rik van Riel , Dave Hansen , Peter Zijlstra , Nadav Amit , Mathieu Desnoyers , Andy Lutomirski , Linus Torvalds Subject: [PATCH 16/23] sched: Use lightweight hazard pointers to grab lazy mms Date: Sat, 8 Jan 2022 08:44:01 -0800 Message-Id: <7c9c388c388df8e88bb5d14828053ac0cb11cf69.1641659630.git.luto@kernel.org> X-Mailer: git-send-email 2.33.1 In-Reply-To: References: MIME-Version: 1.0 Authentication-Results: imf31.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=r3IkvhwW; spf=pass (imf31.hostedemail.com: domain of luto@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=luto@kernel.org; dmarc=pass (policy=none) header.from=kernel.org X-Rspamd-Queue-Id: 66B262000B X-Stat-Signature: owb6pfpzeunaux6ouykx846ubfthnbqf X-Rspamd-Server: rspam04 X-HE-Tag: 1641660276-566379 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Currently, switching between a real user mm and a kernel context (including idle) performs an atomic operation on a per-mm counter via mmgrab() and mmdrop(). For a single-threaded program, this isn't a big problem: a pair of atomic operations when entering and returning from idle isn't free, but it's not very expensive in the grand scheme of things. For a heavily multithreaded program on a large system, however, the overhead can be very large -- all CPUs can end up hammering the same cacheline with atomic operations, and scalability suffers. The purpose of mmgrab() and mmdrop() is to make "lazy tlb" mode safe. When Linux switches from user to kernel mm context, instead of immediately reprogramming the MMU to use init_mm, the kernel continues to use the most recent set of user page tables. This is safe as long as those page tables aren't freed. RCU can't be used to keep the pagetables alive, since RCU read locks can't be held when idle. To improve scalability, this patch adds a percpu hazard pointer scheme to keep lazily-used mms alive. Each CPU has a single pointer to an mm that must not be freed, and __mmput() checks the pointers belonging to all CPUs that might be lazily using the mm in question. By default, this means walking all online CPUs, but arch code can override the set of CPUs to check if they can do something more clever. For architectures that have accurate mm_cpumask(), mm_cpumask() is a reasonable choice. For architectures that can guarantee that *no* remote CPUs are lazily using an mm after the user portion of the pagetables are torn down (any architcture that uses IPI shootdowns in exit_mmap() and unlazies the MMU in the IPI handler, e.g. x86 on bare metal), the set of CPUs to check could be empty. XXX: I *think* this is correct when hot-unplugging a CPU, but this needs double-checking and maybe even a WARN to make sure the ordering is correct. Cc: Andrew Morton Cc: Linus Torvalds Cc: Nicholas Piggin Cc: Peter Zijlstra Cc: Rik van Riel Cc: Anton Blanchard Cc: Benjamin Herrenschmidt Cc: Linux-MM Cc: Paul Mackerras Cc: Randy Dunlap Signed-off-by: Andy Lutomirski --- include/linux/sched/mm.h | 3 + kernel/fork.c | 11 ++ kernel/sched/core.c | 230 +++++++++++++++++++++++++++++++++------ kernel/sched/sched.h | 10 +- 4 files changed, 221 insertions(+), 33 deletions(-) diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h index 7509b2b2e99d..3ceba11c049c 100644 --- a/include/linux/sched/mm.h +++ b/include/linux/sched/mm.h @@ -76,6 +76,9 @@ static inline bool mmget_not_zero(struct mm_struct *mm) /* mmput gets rid of the mappings and all user-space */ extern void mmput(struct mm_struct *); + +extern void mm_unlazy_mm_count(struct mm_struct *mm); + #ifdef CONFIG_MMU /* same as above but performs the slow path from the async context. Can * be called from the atomic context as well diff --git a/kernel/fork.c b/kernel/fork.c index 38681ad44c76..2df72cf3c0d2 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1122,6 +1122,17 @@ static inline void __mmput(struct mm_struct *mm) } if (mm->binfmt) module_put(mm->binfmt->module); + + /* + * We hold one mm_count reference. Convert all remaining lazy_mm + * references to mm_count references so that the mm will be genuinely + * unused when mm_count goes to zero. Do this after exit_mmap() so + * that, if the architecture shoots down remote TLB entries via IPI in + * exit_mmap() and calls unlazy_mm_irqs_off() when doing so, most or + * all lazy_mm references can be removed without + * mm_unlazy_mm_count()'s help. + */ + mm_unlazy_mm_count(mm); mmdrop(mm); } diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 95eb0e78f74c..64e4058b3c61 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -20,6 +20,7 @@ #include #include +#include #include "../workqueue_internal.h" #include "../../fs/io-wq.h" @@ -4750,6 +4751,144 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev, prepare_arch_switch(next); } +/* + * Called after each context switch. + * + * Strictly speaking, no action at all is required here. This rq + * can hold an extra reference to at most one mm, so the memory + * wasted by deferring the mmdrop() forever is bounded. That being + * said, it's straightforward to safely drop spare references + * in the common case. + */ +static void mmdrop_lazy(struct rq *rq) +{ + struct mm_struct *old_mm; + + old_mm = READ_ONCE(rq->drop_mm); + + do { + /* + * If there is nothing to drop or if we are still using old_mm, + * then don't call mmdrop(). + */ + if (likely(!old_mm || old_mm == rq->lazy_mm)) + return; + } while (!try_cmpxchg_relaxed(&rq->drop_mm, &old_mm, NULL)); + + mmdrop(old_mm); +} + +#ifndef for_each_possible_lazymm_cpu +#define for_each_possible_lazymm_cpu(cpu, mm) for_each_online_cpu((cpu)) +#endif + +static bool __try_mm_drop_rq_ref(struct rq *rq, struct mm_struct *mm) +{ + struct mm_struct *old_drop_mm = smp_load_acquire(&rq->drop_mm); + + /* + * We know that old_mm != mm: this is the only function that + * might set drop_mm to mm, and we haven't set it yet. + */ + WARN_ON_ONCE(old_drop_mm == mm); + + if (!old_drop_mm) { + /* + * Just set rq->drop_mm to mm and our reference will + * get dropped eventually after rq is done with it. + */ + return try_cmpxchg(&rq->drop_mm, &old_drop_mm, mm); + } + + /* + * The target cpu could still be using old_drop_mm. We know that, if + * old_drop_mm still exists, then old_drop_mm->mm_users == 0. Can we + * drop it? + * + * NB: it is critical that we load rq->lazy_mm again after loading + * drop_mm. If we looked at a prior value of lazy_mm (which we + * already know to be mm), then we would be subject to a race: + * + * Us: + * Load rq->lazy_mm. + * Remote CPU: + * Switch to old_drop_mm (with mm_users > 0) + * Become lazy and set rq->lazy_mm = old_drop_mm + * Third CPU: + * Set old_drop_mm->mm_users to 0. + * Set rq->drop_mm = old_drop_mm + * Us: + * Incorrectly believe that old_drop_mm is unused + * because rq->lazy_mm != old_drop_mm + * + * In other words, to verify that rq->lazy_mm is not keeping a given + * mm alive, we must load rq->lazy_mm _after_ we know that mm_users == + * 0 and therefore that rq will not switch to that mm. + */ + if (smp_load_acquire(&rq->lazy_mm) != mm) { + /* + * We got lucky! rq _was_ using mm, but it stopped. + * Just drop our reference. + */ + mmdrop(mm); + return true; + } + + /* + * If we got here, rq->lazy_mm != old_drop_mm, and we ruled + * out the race described above. rq is done with old_drop_mm, + * so we can steal the reference held by rq and replace it with + * our reference to mm. + */ + if (cmpxchg(&rq->drop_mm, old_drop_mm, mm) != old_drop_mm) + return false; + + mmdrop(old_drop_mm); + return true; +} + +/* + * This converts all lazy_mm references to mm to mm_count refcounts. Our + * caller holds an mm_count reference, so we don't need to worry about mm + * being freed out from under us. + */ +void mm_unlazy_mm_count(struct mm_struct *mm) +{ + unsigned int drop_count = 0; + int cpu; + + /* + * mm_users is zero, so no cpu will set its rq->lazy_mm to mm. + */ + WARN_ON_ONCE(atomic_read(&mm->mm_users) != 0); + + for_each_possible_lazymm_cpu(cpu, mm) { + struct rq *rq = cpu_rq(cpu); + + if (smp_load_acquire(&rq->lazy_mm) != mm) + continue; + + /* + * Grab one reference. Do it as a batch so we do a maximum + * of two atomic operations instead of one per lazy reference. + */ + if (!drop_count) { + /* + * Collect lots of references. We'll drop the ones we + * don't use. + */ + drop_count = num_possible_cpus(); + atomic_add(drop_count, &mm->mm_count); + } + drop_count--; + + while (!__try_mm_drop_rq_ref(rq, mm)) + ; + } + + atomic_sub(drop_count, &mm->mm_count); +} + /** * finish_task_switch - clean up after a task-switch * @prev: the thread we just switched away from. @@ -4773,7 +4912,6 @@ static struct rq *finish_task_switch(struct task_struct *prev) __releases(rq->lock) { struct rq *rq = this_rq(); - struct mm_struct *mm = rq->prev_mm; long prev_state; /* @@ -4792,8 +4930,6 @@ static struct rq *finish_task_switch(struct task_struct *prev) current->comm, current->pid, preempt_count())) preempt_count_set(FORK_PREEMPT_COUNT); - rq->prev_mm = NULL; - /* * A task struct has one reference for the use as "current". * If a task dies, then it sets TASK_DEAD in tsk->state and calls @@ -4824,12 +4960,7 @@ static struct rq *finish_task_switch(struct task_struct *prev) fire_sched_in_preempt_notifiers(current); - /* - * If an architecture needs to take a specific action for - * SYNC_CORE, it can do so in switch_mm_irqs_off(). - */ - if (mm) - mmdrop(mm); + mmdrop_lazy(rq); if (unlikely(prev_state == TASK_DEAD)) { if (prev->sched_class->task_dead) @@ -4891,36 +5022,55 @@ context_switch(struct rq *rq, struct task_struct *prev, */ arch_start_context_switch(prev); + /* + * Sanity check: if something went wrong and the previous mm was + * freed while we were still using it, KASAN might not notice + * without help. + */ + kasan_check_byte(prev->active_mm); + /* * kernel -> kernel lazy + transfer active - * user -> kernel lazy + mmgrab() active + * user -> kernel lazy + lazy_mm grab active * - * kernel -> user switch + mmdrop() active + * kernel -> user switch + lazy_mm release active * user -> user switch */ if (!next->mm) { // to kernel enter_lazy_tlb(prev->active_mm, next); next->active_mm = prev->active_mm; - if (prev->mm) // from user - mmgrab(prev->active_mm); - else + if (prev->mm) { // from user + SCHED_WARN_ON(rq->lazy_mm); + + /* + * Acqure a lazy_mm reference to the active + * (lazy) mm. No explicit barrier needed: we still + * hold an explicit (mm_users) reference. __mmput() + * can't be called until we call mmput() to drop + * our reference, and __mmput() is a release barrier. + */ + WRITE_ONCE(rq->lazy_mm, next->active_mm); + } else { prev->active_mm = NULL; + } } else { // to user membarrier_switch_mm(rq, prev->active_mm, next->mm); switch_mm_irqs_off(prev->active_mm, next->mm, next); /* - * sys_membarrier() requires an smp_mb() between setting - * rq->curr->mm to a membarrier-enabled mm and returning - * to userspace. + * An arch implementation of for_each_possible_lazymm_cpu() + * may skip this CPU now that we have switched away from + * prev->active_mm, so we must not reference it again. */ + membarrier_finish_switch_mm(next->mm); if (!prev->mm) { // from kernel - /* will mmdrop() in finish_task_switch(). */ - rq->prev_mm = prev->active_mm; prev->active_mm = NULL; + + /* Drop our lazy_mm reference to the old lazy mm. */ + smp_store_release(&rq->lazy_mm, NULL); } } @@ -4938,7 +5088,8 @@ context_switch(struct rq *rq, struct task_struct *prev, void __change_current_mm(struct mm_struct *mm, bool mm_is_brand_new) { struct task_struct *tsk = current; - struct mm_struct *old_active_mm, *mm_to_drop = NULL; + struct mm_struct *old_active_mm; + bool was_kernel; BUG_ON(!mm); /* likely to cause corruption if we continue */ @@ -4958,12 +5109,9 @@ void __change_current_mm(struct mm_struct *mm, bool mm_is_brand_new) if (tsk->mm) { /* We're detaching from an old mm. Sync stats. */ sync_mm_rss(tsk->mm); + was_kernel = false; } else { - /* - * Switching from kernel mm to user. Drop the old lazy - * mm reference. - */ - mm_to_drop = tsk->active_mm; + was_kernel = true; } old_active_mm = tsk->active_mm; @@ -4992,6 +5140,10 @@ void __change_current_mm(struct mm_struct *mm, bool mm_is_brand_new) membarrier_finish_switch_mm(mm); vmacache_flush(tsk); + + if (was_kernel) + smp_store_release(&this_rq()->lazy_mm, NULL); + task_unlock(tsk); #ifdef finish_arch_post_lock_switch @@ -5009,9 +5161,6 @@ void __change_current_mm(struct mm_struct *mm, bool mm_is_brand_new) finish_arch_post_lock_switch(); } #endif - - if (mm_to_drop) - mmdrop(mm_to_drop); } void __change_current_mm_to_kernel(void) @@ -5044,8 +5193,17 @@ void __change_current_mm_to_kernel(void) membarrier_update_current_mm(NULL); vmacache_flush(tsk); - /* active_mm is still 'old_mm' */ - mmgrab(old_mm); + /* + * active_mm is still 'old_mm' + * + * Acqure a lazy_mm reference to the active (lazy) mm. As in + * context_switch(), no explicit barrier needed: we still hold an + * explicit (mm_users) reference. __mmput() can't be called until we + * call mmput() to drop our reference, and __mmput() is a release + * barrier. + */ + WRITE_ONCE(this_rq()->lazy_mm, old_mm); + enter_lazy_tlb(old_mm, tsk); local_irq_enable(); @@ -8805,6 +8963,7 @@ void __init init_idle(struct task_struct *idle, int cpu) void unlazy_mm_irqs_off(void) { struct mm_struct *mm = current->active_mm; + struct rq *rq = this_rq(); lockdep_assert_irqs_disabled(); @@ -8815,10 +8974,17 @@ void unlazy_mm_irqs_off(void) return; switch_mm_irqs_off(mm, &init_mm, current); - mmgrab(&init_mm); current->active_mm = &init_mm; + + /* + * We don't need a lazy reference to init_mm -- it's not about + * to go away. + */ + smp_store_release(&rq->lazy_mm, NULL); + finish_arch_post_lock_switch(); - mmdrop(mm); + + mmdrop_lazy(rq); } #ifdef CONFIG_SMP diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index b496a9ee9aec..1010e63962d9 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -977,7 +977,15 @@ struct rq { struct task_struct *idle; struct task_struct *stop; unsigned long next_balance; - struct mm_struct *prev_mm; + + /* + * Fast refcounting scheme for lazy mm. lazy_mm is a hazard pointer: + * setting it to point to a lazily used mm keeps that mm from being + * freed. drop_mm points to am mm that needs an mmdrop() call + * after the CPU owning the rq is done with it. + */ + struct mm_struct *lazy_mm; + struct mm_struct *drop_mm; unsigned int clock_update_flags; u64 clock; From patchwork Sat Jan 8 16:44:02 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andy Lutomirski X-Patchwork-Id: 12707546 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 87EBFC433EF for ; Sat, 8 Jan 2022 16:44:49 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8E9366B008C; Sat, 8 Jan 2022 11:44:38 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 846646B0092; Sat, 8 Jan 2022 11:44:38 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6725B6B0093; Sat, 8 Jan 2022 11:44:38 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0021.hostedemail.com [216.40.44.21]) by kanga.kvack.org (Postfix) with ESMTP id 4B4566B008C for ; Sat, 8 Jan 2022 11:44:38 -0500 (EST) Received: from smtpin06.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 00587181C990F for ; Sat, 8 Jan 2022 16:44:37 +0000 (UTC) X-FDA: 79007693394.06.2557C43 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf11.hostedemail.com (Postfix) with ESMTP id 8446640007 for ; Sat, 8 Jan 2022 16:44:37 +0000 (UTC) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id B799260DFB; Sat, 8 Jan 2022 16:44:36 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 057E2C36AF5; Sat, 8 Jan 2022 16:44:35 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1641660276; bh=4O5nnTjv81SB19Td4mAJDL0i8hlOmRdkCRWMKeHBKAk=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=gnQbS/8LDy7d4rBBPC1BeLgidoaa4RtK6uxKnx4yRs+qg7nT673dGWG923bTUnbgu CKLbe0NsOO4c7xL1BQs2xd7jQmQaBsQVaH1Rn0gkGZZksdInj7FKmM2GnZA5IkKASn KZWlFjYI+xzq9mzvwbAvpRy4OBAv+7xXSB988sz6TRqW25dcR0O5L6wCw15mQM8EBn KRPduVbOwWJ9KilaMRZ3jeMkvsdFquIkcOvIupgtz2QzMYFleWwpQ0ImjkV6IIrJB2 44XNQ8quVWQdliW7Y2Ld9o9sD2bALVt1bDziiFN7x8gsf9r0KdWrTqesM/M67AnVOm rIomDSz8xJKZg== From: Andy Lutomirski To: Andrew Morton , Linux-MM Cc: Nicholas Piggin , Anton Blanchard , Benjamin Herrenschmidt , Paul Mackerras , Randy Dunlap , linux-arch , x86@kernel.org, Rik van Riel , Dave Hansen , Peter Zijlstra , Nadav Amit , Mathieu Desnoyers , Andy Lutomirski Subject: [PATCH 17/23] x86/mm: Make use/unuse_temporary_mm() non-static Date: Sat, 8 Jan 2022 08:44:02 -0800 Message-Id: X-Mailer: git-send-email 2.33.1 In-Reply-To: References: MIME-Version: 1.0 X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: 8446640007 X-Stat-Signature: c7y9fissxerd9ybqukzjdszz88ttmytu Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b="gnQbS/8L"; spf=pass (imf11.hostedemail.com: domain of luto@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=luto@kernel.org; dmarc=pass (policy=none) header.from=kernel.org X-HE-Tag: 1641660277-323148 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: This prepares them for use outside of the alternative machinery. The code is unchanged. Signed-off-by: Andy Lutomirski --- arch/x86/include/asm/mmu_context.h | 7 ++++ arch/x86/kernel/alternative.c | 65 +----------------------------- arch/x86/mm/tlb.c | 60 +++++++++++++++++++++++++++ 3 files changed, 68 insertions(+), 64 deletions(-) diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h index 27516046117a..2ca4fc4a8a0a 100644 --- a/arch/x86/include/asm/mmu_context.h +++ b/arch/x86/include/asm/mmu_context.h @@ -220,4 +220,11 @@ unsigned long __get_current_cr3_fast(void); #include +typedef struct { + struct mm_struct *mm; +} temp_mm_state_t; + +extern temp_mm_state_t use_temporary_mm(struct mm_struct *mm); +extern void unuse_temporary_mm(temp_mm_state_t prev_state); + #endif /* _ASM_X86_MMU_CONTEXT_H */ diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c index b47cd22b2eb1..af4c37e177ff 100644 --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -29,6 +29,7 @@ #include #include #include +#include int __read_mostly alternatives_patched; @@ -706,70 +707,6 @@ void __init_or_module text_poke_early(void *addr, const void *opcode, } } -typedef struct { - struct mm_struct *mm; -} temp_mm_state_t; - -/* - * Using a temporary mm allows to set temporary mappings that are not accessible - * by other CPUs. Such mappings are needed to perform sensitive memory writes - * that override the kernel memory protections (e.g., W^X), without exposing the - * temporary page-table mappings that are required for these write operations to - * other CPUs. Using a temporary mm also allows to avoid TLB shootdowns when the - * mapping is torn down. - * - * Context: The temporary mm needs to be used exclusively by a single core. To - * harden security IRQs must be disabled while the temporary mm is - * loaded, thereby preventing interrupt handler bugs from overriding - * the kernel memory protection. - */ -static inline temp_mm_state_t use_temporary_mm(struct mm_struct *mm) -{ - temp_mm_state_t temp_state; - - lockdep_assert_irqs_disabled(); - - /* - * Make sure not to be in TLB lazy mode, as otherwise we'll end up - * with a stale address space WITHOUT being in lazy mode after - * restoring the previous mm. - */ - if (this_cpu_read(cpu_tlbstate_shared.is_lazy)) - leave_mm(smp_processor_id()); - - temp_state.mm = this_cpu_read(cpu_tlbstate.loaded_mm); - switch_mm_irqs_off(NULL, mm, current); - - /* - * If breakpoints are enabled, disable them while the temporary mm is - * used. Userspace might set up watchpoints on addresses that are used - * in the temporary mm, which would lead to wrong signals being sent or - * crashes. - * - * Note that breakpoints are not disabled selectively, which also causes - * kernel breakpoints (e.g., perf's) to be disabled. This might be - * undesirable, but still seems reasonable as the code that runs in the - * temporary mm should be short. - */ - if (hw_breakpoint_active()) - hw_breakpoint_disable(); - - return temp_state; -} - -static inline void unuse_temporary_mm(temp_mm_state_t prev_state) -{ - lockdep_assert_irqs_disabled(); - switch_mm_irqs_off(NULL, prev_state.mm, current); - - /* - * Restore the breakpoints if they were disabled before the temporary mm - * was loaded. - */ - if (hw_breakpoint_active()) - hw_breakpoint_restore(); -} - __ro_after_init struct mm_struct *poking_mm; __ro_after_init unsigned long poking_addr; diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 74b7a615bc15..4e371f30e2ab 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -702,6 +702,66 @@ void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk) this_cpu_write(cpu_tlbstate_shared.is_lazy, true); } +/* + * Using a temporary mm allows to set temporary mappings that are not accessible + * by other CPUs. Such mappings are needed to perform sensitive memory writes + * that override the kernel memory protections (e.g., W^X), without exposing the + * temporary page-table mappings that are required for these write operations to + * other CPUs. Using a temporary mm also allows to avoid TLB shootdowns when the + * mapping is torn down. + * + * Context: The temporary mm needs to be used exclusively by a single core. To + * harden security IRQs must be disabled while the temporary mm is + * loaded, thereby preventing interrupt handler bugs from overriding + * the kernel memory protection. + */ +temp_mm_state_t use_temporary_mm(struct mm_struct *mm) +{ + temp_mm_state_t temp_state; + + lockdep_assert_irqs_disabled(); + + /* + * Make sure not to be in TLB lazy mode, as otherwise we'll end up + * with a stale address space WITHOUT being in lazy mode after + * restoring the previous mm. + */ + if (this_cpu_read(cpu_tlbstate_shared.is_lazy)) + leave_mm(smp_processor_id()); + + temp_state.mm = this_cpu_read(cpu_tlbstate.loaded_mm); + switch_mm_irqs_off(NULL, mm, current); + + /* + * If breakpoints are enabled, disable them while the temporary mm is + * used. Userspace might set up watchpoints on addresses that are used + * in the temporary mm, which would lead to wrong signals being sent or + * crashes. + * + * Note that breakpoints are not disabled selectively, which also causes + * kernel breakpoints (e.g., perf's) to be disabled. This might be + * undesirable, but still seems reasonable as the code that runs in the + * temporary mm should be short. + */ + if (hw_breakpoint_active()) + hw_breakpoint_disable(); + + return temp_state; +} + +void unuse_temporary_mm(temp_mm_state_t prev_state) +{ + lockdep_assert_irqs_disabled(); + switch_mm_irqs_off(NULL, prev_state.mm, current); + + /* + * Restore the breakpoints if they were disabled before the temporary mm + * was loaded. + */ + if (hw_breakpoint_active()) + hw_breakpoint_restore(); +} + /* * Call this when reinitializing a CPU. It fixes the following potential * problems: From patchwork Sat Jan 8 16:44:03 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andy Lutomirski X-Patchwork-Id: 12707547 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 67BCEC433FE for ; Sat, 8 Jan 2022 16:44:51 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 16C846B0092; Sat, 8 Jan 2022 11:44:39 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 0A5A16B0093; Sat, 8 Jan 2022 11:44:38 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DEA3B6B0095; Sat, 8 Jan 2022 11:44:38 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0101.hostedemail.com [216.40.44.101]) by kanga.kvack.org (Postfix) with ESMTP id CF9436B0092 for ; Sat, 8 Jan 2022 11:44:38 -0500 (EST) Received: from smtpin08.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 9A08A81050AA for ; Sat, 8 Jan 2022 16:44:38 +0000 (UTC) X-FDA: 79007693436.08.9B3FBCE Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf07.hostedemail.com (Postfix) with ESMTP id 335004000D for ; Sat, 8 Jan 2022 16:44:37 +0000 (UTC) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 6071260DE1; Sat, 8 Jan 2022 16:44:37 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 0E5CDC36AE0; Sat, 8 Jan 2022 16:44:37 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1641660277; bh=4/phD45ccBwKpoEpuFHVuaC4/6+F9ASvK+hhO/SU8Fo=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=fXxmmXTBkFoFc6QR/NgokKjHl/kD2AA7EmiuQeyFtV6de+D5/klMZnZR18SFCwNat nlYbn+f/Tj+XDTl9Kc/6FfiMbDc0dBF52mcyPFq0Y+WWUDSHsabaUQAlKtbjXoc7yK xsNL10IiLydVRUmGKmf7kx6c7EOS7Sgrenl5SStfYcVnvAxxBUcI91U0xxjRgFJitT DsBmXeKzpJVHu/fSNV+EV4FM2LYmnjvFRJDhm+ly7uXb5284aUJdaXoalZOMVpE8ER jSwt8bkjFo1QGts8nHMiMNpb3hb4t3eOkaCRIAFVqVEColWw53avbh4E7vzularl3k hGh9mVafpDgIA== From: Andy Lutomirski To: Andrew Morton , Linux-MM Cc: Nicholas Piggin , Anton Blanchard , Benjamin Herrenschmidt , Paul Mackerras , Randy Dunlap , linux-arch , x86@kernel.org, Rik van Riel , Dave Hansen , Peter Zijlstra , Nadav Amit , Mathieu Desnoyers , Andy Lutomirski Subject: [PATCH 18/23] x86/mm: Allow temporary mms when IRQs are on Date: Sat, 8 Jan 2022 08:44:03 -0800 Message-Id: X-Mailer: git-send-email 2.33.1 In-Reply-To: References: MIME-Version: 1.0 X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 335004000D X-Stat-Signature: fm3kxymzs4oh1tafdn4wajx1me6uao78 Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=fXxmmXTB; dmarc=pass (policy=none) header.from=kernel.org; spf=pass (imf07.hostedemail.com: domain of luto@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=luto@kernel.org X-HE-Tag: 1641660277-159901 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: EFI runtime services should use temporary mms, but EFI runtime services want IRQs on. Preemption must still be disabled in a temporary mm context. At some point, the entirely temporary mm mechanism should be moved out of arch code. Signed-off-by: Andy Lutomirski --- arch/x86/mm/tlb.c | 19 ++++++++++++------- 1 file changed, 12 insertions(+), 7 deletions(-) diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 4e371f30e2ab..36ce9dffb963 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -708,18 +708,23 @@ void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk) * that override the kernel memory protections (e.g., W^X), without exposing the * temporary page-table mappings that are required for these write operations to * other CPUs. Using a temporary mm also allows to avoid TLB shootdowns when the - * mapping is torn down. + * mapping is torn down. Temporary mms can also be used for EFI runtime service + * calls or similar functionality. * - * Context: The temporary mm needs to be used exclusively by a single core. To - * harden security IRQs must be disabled while the temporary mm is - * loaded, thereby preventing interrupt handler bugs from overriding - * the kernel memory protection. + * It is illegal to schedule while using a temporary mm -- the context switch + * code is unaware of the temporary mm and does not know how to context switch. + * Use a real (non-temporary) mm in a kernel thread if you need to sleep. + * + * Note: For sensitive memory writes, the temporary mm needs to be used + * exclusively by a single core, and IRQs should be disabled while the + * temporary mm is loaded, thereby preventing interrupt handler bugs from + * overriding the kernel memory protection. */ temp_mm_state_t use_temporary_mm(struct mm_struct *mm) { temp_mm_state_t temp_state; - lockdep_assert_irqs_disabled(); + lockdep_assert_preemption_disabled(); /* * Make sure not to be in TLB lazy mode, as otherwise we'll end up @@ -751,7 +756,7 @@ temp_mm_state_t use_temporary_mm(struct mm_struct *mm) void unuse_temporary_mm(temp_mm_state_t prev_state) { - lockdep_assert_irqs_disabled(); + lockdep_assert_preemption_disabled(); switch_mm_irqs_off(NULL, prev_state.mm, current); /* From patchwork Sat Jan 8 16:44:04 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andy Lutomirski X-Patchwork-Id: 12707548 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E116FC4332F for ; Sat, 8 Jan 2022 16:44:52 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id BCC636B0096; Sat, 8 Jan 2022 11:44:41 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id B53AC6B0099; Sat, 8 Jan 2022 11:44:41 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9A7276B0098; Sat, 8 Jan 2022 11:44:41 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0129.hostedemail.com [216.40.44.129]) by kanga.kvack.org (Postfix) with ESMTP id 854FB6B0095 for ; Sat, 8 Jan 2022 11:44:41 -0500 (EST) Received: from smtpin04.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 496BB972E4 for ; Sat, 8 Jan 2022 16:44:41 +0000 (UTC) X-FDA: 79007693562.04.45166D0 Received: from ams.source.kernel.org (ams.source.kernel.org [145.40.68.75]) by imf05.hostedemail.com (Postfix) with ESMTP id 861A2100003 for ; Sat, 8 Jan 2022 16:44:40 +0000 (UTC) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ams.source.kernel.org (Postfix) with ESMTPS id 8F85DB80B48; Sat, 8 Jan 2022 16:44:39 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 2C9E7C36AE3; Sat, 8 Jan 2022 16:44:38 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1641660278; bh=9J/1QKkNyuJOe05xU7vGzx3mhn6KDpei2w55Hnz/oT0=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=UuXs+Y6u3LWdS8m9HaS3ErgC6rn96YsFIWADrWq0raX8asAen+3bktbjqpqKONCiR B7W82gDbT0uZ090lAVwVhIVT9464iiazwOXTfAvs0tiIvL4QYr5G7FgwEQwCO03kDa cmHRsIlwrqIH5GaNrp6J3sX6gjKiU3mvA21NKNPW6KELcsdtJb/C+1Zer4ZDFtuB33 H5gx1/gRGV+FQdnaVoWdjbMEjcD0/MN4sdcznzXTAVwK11055fWS0b2Yv3YUPbS2lO 5+u0OWrOhBnfT1axklI1TvLdlxrjXkZ2Mus4954aj0Pcm/yG7bm8bnSyHdhk8I9ovJ CuFDP2/cXQ7jg== From: Andy Lutomirski To: Andrew Morton , Linux-MM Cc: Nicholas Piggin , Anton Blanchard , Benjamin Herrenschmidt , Paul Mackerras , Randy Dunlap , linux-arch , x86@kernel.org, Rik van Riel , Dave Hansen , Peter Zijlstra , Nadav Amit , Mathieu Desnoyers , Andy Lutomirski , Ard Biesheuvel Subject: [PATCH 19/23] x86/efi: Make efi_enter/leave_mm use the temporary_mm machinery Date: Sat, 8 Jan 2022 08:44:04 -0800 Message-Id: <3efc4cfd1d7c45a32752ced389d6666be15cde56.1641659630.git.luto@kernel.org> X-Mailer: git-send-email 2.33.1 In-Reply-To: References: MIME-Version: 1.0 X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 861A2100003 X-Stat-Signature: ryx6jcauycz7ipzr13w3jws1sxb966mh Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=UuXs+Y6u; dmarc=pass (policy=none) header.from=kernel.org; spf=pass (imf05.hostedemail.com: domain of luto@kernel.org designates 145.40.68.75 as permitted sender) smtp.mailfrom=luto@kernel.org X-HE-Tag: 1641660280-743891 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: This should be considerably more robust. It's also necessary for optimized for_each_possible_lazymm_cpu() on x86 -- without this patch, EFI calls in lazy context would remove the lazy mm from mm_cpumask(). Cc: Ard Biesheuvel Signed-off-by: Andy Lutomirski Acked-by: Ard Biesheuvel --- arch/x86/platform/efi/efi_64.c | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c index 7515e78ef898..b9a571904363 100644 --- a/arch/x86/platform/efi/efi_64.c +++ b/arch/x86/platform/efi/efi_64.c @@ -54,7 +54,7 @@ * 0xffff_ffff_0000_0000 and limit EFI VA mapping space to 64G. */ static u64 efi_va = EFI_VA_START; -static struct mm_struct *efi_prev_mm; +static temp_mm_state_t efi_temp_mm_state; /* * We need our own copy of the higher levels of the page tables @@ -461,15 +461,12 @@ void __init efi_dump_pagetable(void) */ void efi_enter_mm(void) { - efi_prev_mm = current->active_mm; - current->active_mm = &efi_mm; - switch_mm(efi_prev_mm, &efi_mm, NULL); + efi_temp_mm_state = use_temporary_mm(&efi_mm); } void efi_leave_mm(void) { - current->active_mm = efi_prev_mm; - switch_mm(&efi_mm, efi_prev_mm, NULL); + unuse_temporary_mm(efi_temp_mm_state); } static DEFINE_SPINLOCK(efi_runtime_lock); From patchwork Sat Jan 8 16:44:05 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andy Lutomirski X-Patchwork-Id: 12707549 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 31E03C433FE for ; Sat, 8 Jan 2022 16:44:54 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1E86B6B009B; Sat, 8 Jan 2022 11:44:42 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 120906B009A; Sat, 8 Jan 2022 11:44:41 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DEDE66B0099; Sat, 8 Jan 2022 11:44:41 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0230.hostedemail.com [216.40.44.230]) by kanga.kvack.org (Postfix) with ESMTP id A6EDE6B0095 for ; Sat, 8 Jan 2022 11:44:41 -0500 (EST) Received: from smtpin24.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 72E9A987B6 for ; Sat, 8 Jan 2022 16:44:41 +0000 (UTC) X-FDA: 79007693562.24.4841DAE Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf15.hostedemail.com (Postfix) with ESMTP id 0BC18A0014 for ; Sat, 8 Jan 2022 16:44:40 +0000 (UTC) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 411FC60DF7; Sat, 8 Jan 2022 16:44:40 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 698EFC36AEF; Sat, 8 Jan 2022 16:44:39 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1641660279; bh=8QnnEAQzxd3ws7BHDkrTxXyGch3lQVCI4n1OHmf5M5U=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=rF8QFsbIMcGBr1U5y8/pXL+EfJa32pzTSdYzFgYmToqPlFUXatWy1M9uUcpbDrtrp rVQLt/jBPIa6MBvvg6PqoYgoAQyx0Rf+oZj46edL8T8Pr3Ym4Whr3qbwUqe4TJfhqj AVwUVtWniDWr3OBWg0m79jM/ltpdvvkquHwe3kfGCrZH3DTd9CMqRSSpD5xcjWvZ5X SA333cy9bQigcr1rl2yd9Awz22kKRUE8zAH2oCCEo2KfrvUHKP/xlisMXdAiG7aNPx /KFPTgWlrDy5wmerDdjlXpTHXbguCa+hqj6xGFTiwVoxZy/BS3SAfbW3oVQqFmDvmc FmIeHtX3TurPw== From: Andy Lutomirski To: Andrew Morton , Linux-MM Cc: Nicholas Piggin , Anton Blanchard , Benjamin Herrenschmidt , Paul Mackerras , Randy Dunlap , linux-arch , x86@kernel.org, Rik van Riel , Dave Hansen , Peter Zijlstra , Nadav Amit , Mathieu Desnoyers , Andy Lutomirski Subject: [PATCH 20/23] x86/mm: Remove leave_mm() in favor of unlazy_mm_irqs_off() Date: Sat, 8 Jan 2022 08:44:05 -0800 Message-Id: <5e80aa6deb3f0a7bdcba5e9f20c48df50b752fd3.1641659630.git.luto@kernel.org> X-Mailer: git-send-email 2.33.1 In-Reply-To: References: MIME-Version: 1.0 X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 0BC18A0014 X-Stat-Signature: qsatq4ajuhg4icwc3ey6m9ay1ac5ou91 Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=rF8QFsbI; spf=pass (imf15.hostedemail.com: domain of luto@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=luto@kernel.org; dmarc=pass (policy=none) header.from=kernel.org X-HE-Tag: 1641660280-574555 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: x86's mm_cpumask() precisely tracks every CPU using an mm, with one major caveat: x86 internally switches back to init_mm more aggressively than the core code. This means that it's possible for x86 to point CR3 to init_mm and drop current->active_mm from mm_cpumask(). The core scheduler doesn't know when this happens, which is currently fine. But if we want to use mm_cpumask() to optimize for_each_possible_lazymm_cpu(), we need to keep mm_cpumask() in sync with the core scheduler. This patch removes x86's bespoke leave_mm() and uses the core scheduler's unlazy_mm_irqs_off() so that a lazy mm can be dropped and ->active_mm cleaned up together. This allows for_each_possible_lazymm_cpu() to be wired up on x86. As a side effect, non-x86 architectures that use ACPI C3 will now leave lazy mm mode before entering C3. This can only possibly affect ia64, because only x86 and ia64 enable CONFIG_ACPI_PROCESSOR_CSTATE. Signed-off-by: Andy Lutomirski --- arch/x86/include/asm/mmu.h | 2 -- arch/x86/mm/tlb.c | 29 +++-------------------------- arch/x86/xen/mmu_pv.c | 2 +- drivers/cpuidle/cpuidle.c | 2 +- drivers/idle/intel_idle.c | 4 ++-- include/linux/mmu_context.h | 4 +--- kernel/sched/sched.h | 2 -- 7 files changed, 8 insertions(+), 37 deletions(-) diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h index 5d7494631ea9..03ba71420ff3 100644 --- a/arch/x86/include/asm/mmu.h +++ b/arch/x86/include/asm/mmu.h @@ -63,7 +63,5 @@ typedef struct { .lock = __MUTEX_INITIALIZER(mm.context.lock), \ } -void leave_mm(int cpu); -#define leave_mm leave_mm #endif /* _ASM_X86_MMU_H */ diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 36ce9dffb963..e502565176b9 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -8,6 +8,7 @@ #include #include #include +#include #include #include @@ -294,28 +295,6 @@ static void load_new_mm_cr3(pgd_t *pgdir, u16 new_asid, bool need_flush) write_cr3(new_mm_cr3); } -void leave_mm(int cpu) -{ - struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm); - - /* - * It's plausible that we're in lazy TLB mode while our mm is init_mm. - * If so, our callers still expect us to flush the TLB, but there - * aren't any user TLB entries in init_mm to worry about. - * - * This needs to happen before any other sanity checks due to - * intel_idle's shenanigans. - */ - if (loaded_mm == &init_mm) - return; - - /* Warn if we're not lazy. */ - WARN_ON(!this_cpu_read(cpu_tlbstate_shared.is_lazy)); - - switch_mm(NULL, &init_mm, NULL); -} -EXPORT_SYMBOL_GPL(leave_mm); - void switch_mm(struct mm_struct *prev, struct mm_struct *next, struct task_struct *tsk) { @@ -512,8 +491,6 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, * from lazy TLB mode to normal mode if active_mm isn't changing. * When this happens, we don't assume that CR3 (and hence * cpu_tlbstate.loaded_mm) matches next. - * - * NB: leave_mm() calls us with prev == NULL and tsk == NULL. */ /* We don't want flush_tlb_func() to run concurrently with us. */ @@ -523,7 +500,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, /* * Verify that CR3 is what we think it is. This will catch * hypothetical buggy code that directly switches to swapper_pg_dir - * without going through leave_mm() / switch_mm_irqs_off() or that + * without going through switch_mm_irqs_off() or that * does something like write_cr3(read_cr3_pa()). * * Only do this check if CONFIG_DEBUG_VM=y because __read_cr3() @@ -732,7 +709,7 @@ temp_mm_state_t use_temporary_mm(struct mm_struct *mm) * restoring the previous mm. */ if (this_cpu_read(cpu_tlbstate_shared.is_lazy)) - leave_mm(smp_processor_id()); + unlazy_mm_irqs_off(); temp_state.mm = this_cpu_read(cpu_tlbstate.loaded_mm); switch_mm_irqs_off(NULL, mm, current); diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c index 3359c23573c5..ba849185810a 100644 --- a/arch/x86/xen/mmu_pv.c +++ b/arch/x86/xen/mmu_pv.c @@ -898,7 +898,7 @@ static void drop_mm_ref_this_cpu(void *info) struct mm_struct *mm = info; if (this_cpu_read(cpu_tlbstate.loaded_mm) == mm) - leave_mm(smp_processor_id()); + unlazy_mm_irqs_off(); /* * If this cpu still has a stale cr3 reference, then make sure diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c index ef2ea1b12cd8..b865822a6278 100644 --- a/drivers/cpuidle/cpuidle.c +++ b/drivers/cpuidle/cpuidle.c @@ -223,7 +223,7 @@ int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv, } if (target_state->flags & CPUIDLE_FLAG_TLB_FLUSHED) - leave_mm(dev->cpu); + unlazy_mm_irqs_off(); /* Take note of the planned idle state. */ sched_idle_set_state(target_state); diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c index e6c543b5ee1d..bb5d3b3e28df 100644 --- a/drivers/idle/intel_idle.c +++ b/drivers/idle/intel_idle.c @@ -115,8 +115,8 @@ static unsigned int mwait_substates __initdata; * If the local APIC timer is not known to be reliable in the target idle state, * enable one-shot tick broadcasting for the target CPU before executing MWAIT. * - * Optionally call leave_mm() for the target CPU upfront to avoid wakeups due to - * flushing user TLBs. + * Optionally call unlazy_mm_irqs_off() for the target CPU upfront to avoid + * wakeups due to flushing user TLBs. * * Must be called under local_irq_disable(). */ diff --git a/include/linux/mmu_context.h b/include/linux/mmu_context.h index b9b970f7ab45..035e8e42eb78 100644 --- a/include/linux/mmu_context.h +++ b/include/linux/mmu_context.h @@ -10,9 +10,7 @@ # define switch_mm_irqs_off switch_mm #endif -#ifndef leave_mm -static inline void leave_mm(int cpu) { } -#endif +extern void unlazy_mm_irqs_off(void); /* * CPUs that are capable of running user task @p. Must contain at least one diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 1010e63962d9..e57121bc84d5 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -3071,5 +3071,3 @@ extern int preempt_dynamic_mode; extern int sched_dynamic_mode(const char *str); extern void sched_dynamic_update(int mode); #endif - -extern void unlazy_mm_irqs_off(void); From patchwork Sat Jan 8 16:44:06 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andy Lutomirski X-Patchwork-Id: 12707550 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A9DAAC433F5 for ; Sat, 8 Jan 2022 16:44:55 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5FD6B6B0095; Sat, 8 Jan 2022 11:44:42 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 586FA6B0099; Sat, 8 Jan 2022 11:44:42 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1DF966B0095; Sat, 8 Jan 2022 11:44:42 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0093.hostedemail.com [216.40.44.93]) by kanga.kvack.org (Postfix) with ESMTP id 06F696B0098 for ; Sat, 8 Jan 2022 11:44:42 -0500 (EST) Received: from smtpin10.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id C3B0C81050AA for ; Sat, 8 Jan 2022 16:44:41 +0000 (UTC) X-FDA: 79007693562.10.1F0CC24 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf04.hostedemail.com (Postfix) with ESMTP id 7F5AA40011 for ; Sat, 8 Jan 2022 16:44:41 +0000 (UTC) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id D9E2660DE1; Sat, 8 Jan 2022 16:44:40 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 8ADC3C36AE3; Sat, 8 Jan 2022 16:44:40 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1641660280; bh=x2bPhjVOBxVWcMrIjPF084KnyexnW3t58B9hfB9b+7k=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=If/YBksUFp4siLcDN32aGTZYhYp8jKQmFbx/sACXi1qXOPrHk1oe2qmd5xmmJpaXc 9iYEIUy4PgG5ABTeNST4TSNCFLtHSJHo1hTtD5dK75jnum1hKbCu2SZo4HbR1XT5AP qeV7HTWOS42HaFmKroGlhmqqLd3HSM8bSNynBCfSnE9yXm4LA6BsH7STIOehwMJSQw AfmRI6Iq40t0b7fHdld8UPyC7akRCFoBzSV11sSKYd5uTQaXoC8Juz9paQ3R8m1L3K YBFNZ/pl2PPzCgdazBePXcOqonqdPRkkZzCJHoGQq+mkzYpQ40aHoi38cDX90Vh8+9 mq+2O2hCuwyyA== From: Andy Lutomirski To: Andrew Morton , Linux-MM Cc: Nicholas Piggin , Anton Blanchard , Benjamin Herrenschmidt , Paul Mackerras , Randy Dunlap , linux-arch , x86@kernel.org, Rik van Riel , Dave Hansen , Peter Zijlstra , Nadav Amit , Mathieu Desnoyers , Andy Lutomirski Subject: [PATCH 21/23] x86/mm: Use unlazy_mm_irqs_off() in TLB flush IPIs Date: Sat, 8 Jan 2022 08:44:06 -0800 Message-Id: <3f3daaf3df26e963a38f4d7d05069e866cb6e3e7.1641659630.git.luto@kernel.org> X-Mailer: git-send-email 2.33.1 In-Reply-To: References: MIME-Version: 1.0 X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 7F5AA40011 X-Stat-Signature: tkcpgf1x7qeusysi5biro6rk7ebwt63b Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b="If/YBksU"; spf=pass (imf04.hostedemail.com: domain of luto@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=luto@kernel.org; dmarc=pass (policy=none) header.from=kernel.org X-HE-Tag: 1641660281-279134 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: When IPI-flushing a lazy mm, we switch away from the lazy mm. Use unlazy_mm_irqs_off() so the scheduler knows we did this. Signed-off-by: Andy Lutomirski --- arch/x86/mm/tlb.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index e502565176b9..225b407812c7 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -843,7 +843,7 @@ static void flush_tlb_func(void *info) * This should be rare, with native_flush_tlb_multi() skipping * IPIs to lazy TLB mode CPUs. */ - switch_mm_irqs_off(NULL, &init_mm, NULL); + unlazy_mm_irqs_off(); return; } From patchwork Sat Jan 8 16:44:07 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andy Lutomirski X-Patchwork-Id: 12707551 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0AC5DC433FE for ; Sat, 8 Jan 2022 16:44:57 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0D0D36B0099; Sat, 8 Jan 2022 11:44:45 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id EAF8C6B009A; Sat, 8 Jan 2022 11:44:44 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D4EF36B009C; Sat, 8 Jan 2022 11:44:44 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0165.hostedemail.com [216.40.44.165]) by kanga.kvack.org (Postfix) with ESMTP id BA3BA6B0099 for ; Sat, 8 Jan 2022 11:44:44 -0500 (EST) Received: from smtpin13.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 879798CE33 for ; Sat, 8 Jan 2022 16:44:44 +0000 (UTC) X-FDA: 79007693688.13.EA33C15 Received: from ams.source.kernel.org (ams.source.kernel.org [145.40.68.75]) by imf01.hostedemail.com (Postfix) with ESMTP id 1F24B40007 for ; Sat, 8 Jan 2022 16:44:43 +0000 (UTC) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ams.source.kernel.org (Postfix) with ESMTPS id 25380B80B44; Sat, 8 Jan 2022 16:44:43 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id C3E3FC36AED; Sat, 8 Jan 2022 16:44:41 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1641660282; bh=vHQAtfJ1+OgjeriZePikcaC9D/iAK/oYRR/naawsr2g=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=aACd1MXlpkQJQQcnnNNlV12g06l4RWnDcLP8nKrsMAbq33gFkGpGx/qLyCU2Y+301 4mLpwCwOq8IV9o3P+SqaG8R306edy1pe9BXs8dlAvDoLvXh81DM9sIyHyFp6OZMLVO 3M2F8QA3Ez+S15ZocfYALEDWw8vlKNsQesnBHqGK+6yfSuhT/CWtgWWMMF6mZECaaU TrzQBxV1v0XCTMhj5bAZhXSohc12CCoo7ntkh0gYVqZTBFtMqEdyo600w4JT1sP1el aje6L7XWI/jygOh1JR92JrtYEqdH1V2FxPXaLMj5jM0kPBn3c87iH6UY4R33oe/a9H CFkWflxotEV6g== From: Andy Lutomirski To: Andrew Morton , Linux-MM Cc: Nicholas Piggin , Anton Blanchard , Benjamin Herrenschmidt , Paul Mackerras , Randy Dunlap , linux-arch , x86@kernel.org, Rik van Riel , Dave Hansen , Peter Zijlstra , Nadav Amit , Mathieu Desnoyers , Andy Lutomirski Subject: [PATCH 22/23] x86/mm: Optimize for_each_possible_lazymm_cpu() Date: Sat, 8 Jan 2022 08:44:07 -0800 Message-Id: <13849aa0218e0f32ac16b82950c682395a8fb5c7.1641659630.git.luto@kernel.org> X-Mailer: git-send-email 2.33.1 In-Reply-To: References: MIME-Version: 1.0 X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 1F24B40007 X-Stat-Signature: cco44js8yt63yaib74pfawjr1pmtb3bu Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=aACd1MXl; spf=pass (imf01.hostedemail.com: domain of luto@kernel.org designates 145.40.68.75 as permitted sender) smtp.mailfrom=luto@kernel.org; dmarc=pass (policy=none) header.from=kernel.org X-HE-Tag: 1641660283-138436 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Now that x86 does not switch away from a lazy mm behind the scheduler's back and thus clear a CPU from mm_cpumask() that the scheduler thinks is lazy, x86 can use mm_cpumask() to optimize for_each_possible_lazymm_cpu(). Signed-off-by: Andy Lutomirski --- arch/x86/include/asm/mmu.h | 4 ++++ arch/x86/mm/tlb.c | 4 +++- 2 files changed, 7 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h index 03ba71420ff3..da55f768e68c 100644 --- a/arch/x86/include/asm/mmu.h +++ b/arch/x86/include/asm/mmu.h @@ -63,5 +63,9 @@ typedef struct { .lock = __MUTEX_INITIALIZER(mm.context.lock), \ } +/* On x86, mm_cpumask(mm) contains all CPUs that might be lazily using mm */ +#define for_each_possible_lazymm_cpu(cpu, mm) \ + for_each_cpu((cpu), mm_cpumask((mm))) + #endif /* _ASM_X86_MMU_H */ diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 225b407812c7..04eb43e96e23 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -706,7 +706,9 @@ temp_mm_state_t use_temporary_mm(struct mm_struct *mm) /* * Make sure not to be in TLB lazy mode, as otherwise we'll end up * with a stale address space WITHOUT being in lazy mode after - * restoring the previous mm. + * restoring the previous mm. Additionally, once we switch mms, + * for_each_possible_lazymm_cpu() will no longer report this CPU, + * so a lazymm pin wouldn't work. */ if (this_cpu_read(cpu_tlbstate_shared.is_lazy)) unlazy_mm_irqs_off(); From patchwork Sat Jan 8 16:44:08 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andy Lutomirski X-Patchwork-Id: 12707552 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 70FCCC433F5 for ; Sat, 8 Jan 2022 16:44:58 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 191706B009A; Sat, 8 Jan 2022 11:44:46 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 11A436B009D; Sat, 8 Jan 2022 11:44:46 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id ED4D66B009E; Sat, 8 Jan 2022 11:44:45 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0065.hostedemail.com [216.40.44.65]) by kanga.kvack.org (Postfix) with ESMTP id DD8CE6B009A for ; Sat, 8 Jan 2022 11:44:45 -0500 (EST) Received: from smtpin29.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id A6E6881050AA for ; Sat, 8 Jan 2022 16:44:45 +0000 (UTC) X-FDA: 79007693730.29.535C6DD Received: from ams.source.kernel.org (ams.source.kernel.org [145.40.68.75]) by imf25.hostedemail.com (Postfix) with ESMTP id 4E374A0009 for ; Sat, 8 Jan 2022 16:44:45 +0000 (UTC) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ams.source.kernel.org (Postfix) with ESMTPS id 4E71FB80B3D; Sat, 8 Jan 2022 16:44:44 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id E1D57C36AEF; Sat, 8 Jan 2022 16:44:42 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1641660283; bh=rnfvKmSwxDEuaHgQLviIQQxu942uDTnNhvFW9UzBS8M=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=PGG3iDU+MfqwGuvm1AXjkHOjyllQSLOpyeMjNdbpiDjnW2l8GwkOIGmkf6T29512x XAw0F9yMYlhgkh+Xp0WhfbduiIxBlZoScpEG8jLUthHHszmoB/BRMVQOKTfu/XAlJ/ jgI0Rxu3sv+cXUUHg3jPUd3OFykmRsXTXuWMCbQI7h3DiqN892JePsXipLMjMLRPyC RKFjA6Qj5XGfOlbro7ioOg+TOsHoVsXM4YNWWGHPtlA6RhSybgbMowqXkI1RAGx5k/ QJk/OW3LSpV+6FsXsqvf4KcwRuaxy6rA7DW4pDIn60kg2YEOj+I9VEQ+bEOF+5ooOT u7+Rd30vzj9rQ== From: Andy Lutomirski To: Andrew Morton , Linux-MM Cc: Nicholas Piggin , Anton Blanchard , Benjamin Herrenschmidt , Paul Mackerras , Randy Dunlap , linux-arch , x86@kernel.org, Rik van Riel , Dave Hansen , Peter Zijlstra , Nadav Amit , Mathieu Desnoyers , Andy Lutomirski Subject: [PATCH 23/23] x86/mm: Opt in to IRQs-off activate_mm() Date: Sat, 8 Jan 2022 08:44:08 -0800 Message-Id: <69c7d711f240cfec23e6024e940d31af2990db36.1641659630.git.luto@kernel.org> X-Mailer: git-send-email 2.33.1 In-Reply-To: References: MIME-Version: 1.0 X-Stat-Signature: yp4i6ufqebqoyuhokmwut7dqgjkfqrid X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: 4E374A0009 Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=PGG3iDU+; spf=pass (imf25.hostedemail.com: domain of luto@kernel.org designates 145.40.68.75 as permitted sender) smtp.mailfrom=luto@kernel.org; dmarc=pass (policy=none) header.from=kernel.org X-HE-Tag: 1641660285-291652 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: We gain nothing by having the core code enable IRQs right before calling activate_mm() only for us to turn them right back off again in switch_mm(). This will save a few cycles, so execve() should be blazingly fast with this patch applied! Signed-off-by: Andy Lutomirski --- arch/x86/Kconfig | 1 + arch/x86/include/asm/mmu_context.h | 8 ++++---- 2 files changed, 5 insertions(+), 4 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 5060c38bf560..908a596619f2 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -119,6 +119,7 @@ config X86 select ARCH_WANT_LD_ORPHAN_WARN select ARCH_WANTS_THP_SWAP if X86_64 select ARCH_HAS_PARANOID_L1D_FLUSH + select ARCH_WANT_IRQS_OFF_ACTIVATE_MM select BUILDTIME_TABLE_SORT select CLKEVT_I8253 select CLOCKSOURCE_VALIDATE_LAST_CYCLE diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h index 2ca4fc4a8a0a..f028f1b68bc0 100644 --- a/arch/x86/include/asm/mmu_context.h +++ b/arch/x86/include/asm/mmu_context.h @@ -132,10 +132,10 @@ extern void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, struct task_struct *tsk); #define switch_mm_irqs_off switch_mm_irqs_off -#define activate_mm(prev, next) \ -do { \ - paravirt_activate_mm((prev), (next)); \ - switch_mm((prev), (next), NULL); \ +#define activate_mm(prev, next) \ +do { \ + paravirt_activate_mm((prev), (next)); \ + switch_mm_irqs_off((prev), (next), NULL); \ } while (0); #ifdef CONFIG_X86_32