From patchwork Thu Jul 20 16:30:54 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Valentin Schneider X-Patchwork-Id: 13320802 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D66E3EB64DA for ; Thu, 20 Jul 2023 16:34:35 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 77FBB280140; Thu, 20 Jul 2023 12:34:35 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 6E14428004C; Thu, 20 Jul 2023 12:34:35 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 55BB1280140; Thu, 20 Jul 2023 12:34:35 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 4648F28004C for ; Thu, 20 Jul 2023 12:34:35 -0400 (EDT) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 0EEAB1C83FE for ; Thu, 20 Jul 2023 16:34:35 +0000 (UTC) X-FDA: 81032538510.12.8A3C076 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf04.hostedemail.com (Postfix) with ESMTP id 2894B40029 for ; Thu, 20 Jul 2023 16:34:32 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=iYAu5M6z; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf04.hostedemail.com: domain of vschneid@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=vschneid@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1689870873; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=QESi3DN0fC8QnZsNV/N4VXfEg//bCfMhfN1lMbJjDRM=; b=cPNn0LYggzTCkxYRrImN591VHrIT7KrCOV67fsBjfMpUnxhaJ9aEijc59Qw7ybhybmbU82 5RCFAAnVdT+YT87VAoOb3w7Lo9RHvF8c5FD2V7v6UrBPkiJ+XywqLIVH02t9qTQn6ELPjX DGUFs3pxmBwcLRPwIfr3g44hU7g9FHo= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=iYAu5M6z; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf04.hostedemail.com: domain of vschneid@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=vschneid@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1689870873; a=rsa-sha256; cv=none; b=m/4K/Nq3acHKd+ZbWHokykSnCUdJ5UlBUScHG9dF26TVk9Dn4yweturq3goot5eMjuuyEx mVmXgZALTk0VJUa1GpJZvTosLB4hN73/k5T2Ij28AFhWsowUrqYLV785UG111CMLE+or1b pUYFfJCLtKBYQhym/1TMb6SrtPDlvr0= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1689870872; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=QESi3DN0fC8QnZsNV/N4VXfEg//bCfMhfN1lMbJjDRM=; b=iYAu5M6zCjLKYgJd6MiOOy3pI5pJVoRh09d3p7UQMYpLtVWvh8Bvx5FAbYsN9eYWps6V5B +SADsAHX0QlRfTJolto6JT/WTzG/qLL0D4RYvB3fxcBSzdatiiiMEC8tu7qXYvZ3aK/QQv euU7onrEbbSchANCayhSlaSnBpiYlqo= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-183-lKQNgwDjNBmDn1pKucKVhA-1; Thu, 20 Jul 2023 12:34:27 -0400 X-MC-Unique: lKQNgwDjNBmDn1pKucKVhA-1 Received: from smtp.corp.redhat.com (int-mx01.intmail.prod.int.rdu2.redhat.com [10.11.54.1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id BF85A936D34; Thu, 20 Jul 2023 16:34:24 +0000 (UTC) Received: from vschneid.remote.csb (unknown [10.42.28.48]) by smtp.corp.redhat.com (Postfix) with ESMTPS id BFF9140C2070; Thu, 20 Jul 2023 16:34:16 +0000 (UTC) From: Valentin Schneider To: linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org, linux-doc@vger.kernel.org, kvm@vger.kernel.org, linux-mm@kvack.org, bpf@vger.kernel.org, x86@kernel.org, rcu@vger.kernel.org, linux-kselftest@vger.kernel.org Cc: Peter Zijlstra , Nicolas Saenz Julienne , Steven Rostedt , Masami Hiramatsu , Jonathan Corbet , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , "H. Peter Anvin" , Paolo Bonzini , Wanpeng Li , Vitaly Kuznetsov , Andy Lutomirski , Frederic Weisbecker , "Paul E. McKenney" , Neeraj Upadhyay , Joel Fernandes , Josh Triplett , Boqun Feng , Mathieu Desnoyers , Lai Jiangshan , Zqiang , Andrew Morton , Uladzislau Rezki , Christoph Hellwig , Lorenzo Stoakes , Josh Poimboeuf , Jason Baron , Kees Cook , Sami Tolvanen , Ard Biesheuvel , Nicholas Piggin , Juerg Haefliger , Nicolas Saenz Julienne , "Kirill A. Shutemov" , Nadav Amit , Dan Carpenter , Chuang Wang , Yang Jihong , Petr Mladek , "Jason A. Donenfeld" , Song Liu , Julian Pidancet , Tom Lendacky , Dionna Glaze , =?utf-8?q?Thomas_Wei=C3=9Fschuh?= , Juri Lelli , Daniel Bristot de Oliveira , Marcelo Tosatti , Yair Podemsky Subject: [RFC PATCH v2 18/20] context_tracking,x86: Defer kernel text patching IPIs Date: Thu, 20 Jul 2023 17:30:54 +0100 Message-Id: <20230720163056.2564824-19-vschneid@redhat.com> In-Reply-To: <20230720163056.2564824-1-vschneid@redhat.com> References: <20230720163056.2564824-1-vschneid@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.1 on 10.11.54.1 X-Rspamd-Queue-Id: 2894B40029 X-Rspam-User: X-Rspamd-Server: rspam04 X-Stat-Signature: 1bwkua64tr9fzz11h614j691z8j378no X-HE-Tag: 1689870872-684950 X-HE-Meta: U2FsdGVkX1/ZO9tVe+CO8YlLCzcfrAFK48+tpjFIDim5WjisCxEDcfu/BPG6zCRiBQCpsCuNU8rgye5e7JJneFuWyl9OP1g8vyykHxElTrDc48mxZWDf/UCGBdDvSUccAVqwYP/M50MvCcNQ+nb6qfOybhNJcqkOTQw6nHwRJX0BVaSeC/+/fZdZ6XH9BlgW9E3PZVsX/SGDjAtAggw7FH2D3tkOYGU7Cid1DBMo9fXDGa0PCYBJN5lcfLmCMvYFnumBE3+lUdgobnoqDRPeJ0meGh1QhxCLvbZfGEDWR4G8SPEmfXKugA8sTsLBZpaMNVgoOew/Ee0bT1+epGESYWF12TakVosaoqWlL6GWm7rN2HSzSKku+A8zhoWlmvXa+cjy/0DjVRE2FCOrgBZ6e8wyUzfbQRIX24DBLnJY/PGD7UeaXXoE2H12D16sB2NlOVWVUfTmbaqXjpuOwUowwtl7v35dwiLTnlnr8uCwQiBWfQ1UtN65um5FU7g+XstfVcEmhkslu0Ye4FnuxvmO14BngXOlmrVThDs1hSXbZYp/UHPjReaVv/htVWgWoQ17j1KHe3aT1g5EfcmC9JDXXpT4pZbrfCduoDgzJO6tmGmOtKjlfAS5GiwD8nOmeicDnDHkL35AZ9MvYhcdGPBS+neLv5rxZuSlWMbkATqDRNxCtoZkiy+bRqY4r/NQzgo8j3J+ab7hqp6LveAxbbun9kSUz9XCFAY5q2+J9dG3B9qAVA213zsexralFILquzcTxmYRXQJO8X2ihQ47EFN3CmK+Z/IsTSLjFCyiJqwmGkAAZKI5YLSqTb8FKMR25L+eogN+5KM4bv13YTsKeOUs+99HMii4zzmowLA/hjSETK4foaq6WY21/T9u2QyUrRizm43EP6e1Lih1zcAWd7CwGLKXq8vlZD/N/My+tfJ0YOdSaU/cOnCUULnk5qCIYAzY+fHKvfnAkwYYc617GqZ j1EwWqHJ 67mQ0BOBsvtx02SiBx4Xip2xxuspgyvOyvWFCsgxieNIhC5FitGnYMP85cACfTHc0lbRRg5/GLvJVZY5yCFg3DyWvpo1UqkYJwpsZsAJVScowSQHII1qH10g39oCFWelF88dTwtgG9xgUz6bD0kcLiPVNRx+hCRPfOq5K4M9TxrJ4jhVIW3xz4pQzIK+AR4ODKTmsDuAdUfqpFsP6UWJypHQYF+xLStE0veGWWoCOUocLC9fvAth4yp0c2PFBujFRyFYME0zohL3M7fS0kiuiwags5VxEF8GtC3ogrIDCoDTM4owBgfZE5IjjlDcXCIehfwm03z9tsdabswh/dNL4q9obK8D3xLUKBREm X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: text_poke_bp_batch() sends IPIs to all online CPUs to synchronize them vs the newly patched instruction. CPUs that are executing in userspace do not need this synchronization to happen immediately, and this is actually harmful interference for NOHZ_FULL CPUs. As the synchronization IPIs are sent using a blocking call, returning from text_poke_bp_batch() implies all CPUs will observe the patched instruction(s), and this should be preserved even if the IPI is deferred. In other words, to safely defer this synchronization, any kernel instruction leading to the execution of the deferred instruction sync (ct_work_flush()) must *not* be mutable (patchable) at runtime. This means we must pay attention to mutable instructions in the early entry code: - alternatives - static keys - all sorts of probes (kprobes/ftrace/bpf/???) The early entry code leading to ct_work_flush() is noinstr, which gets rid of the probes. Alternatives are safe, because it's boot-time patching (before SMP is even brought up) which is before any IPI deferral can happen. This leaves us with static keys. Any static key used in early entry code should be only forever-enabled at boot time, IOW __ro_after_init (pretty much like alternatives). Objtool is now able to point at static keys that don't respect this, and all static keys used in early entry code have now been verified as behaving like so. Leverage the new context_tracking infrastructure to defer sync_core() IPIs to a target CPU's next kernel entry. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Nicolas Saenz Julienne Signed-off-by: Valentin Schneider --- arch/x86/include/asm/context_tracking_work.h | 6 +++-- arch/x86/include/asm/text-patching.h | 1 + arch/x86/kernel/alternative.c | 24 ++++++++++++++++---- arch/x86/kernel/kprobes/core.c | 4 ++-- arch/x86/kernel/kprobes/opt.c | 4 ++-- arch/x86/kernel/module.c | 2 +- include/linux/context_tracking_work.h | 4 ++-- 7 files changed, 32 insertions(+), 13 deletions(-) diff --git a/arch/x86/include/asm/context_tracking_work.h b/arch/x86/include/asm/context_tracking_work.h index 5bc29e6b2ed38..2c66687ce00e2 100644 --- a/arch/x86/include/asm/context_tracking_work.h +++ b/arch/x86/include/asm/context_tracking_work.h @@ -2,11 +2,13 @@ #ifndef _ASM_X86_CONTEXT_TRACKING_WORK_H #define _ASM_X86_CONTEXT_TRACKING_WORK_H +#include + static __always_inline void arch_context_tracking_work(int work) { switch (work) { - case CONTEXT_WORK_n: - // Do work... + case CONTEXT_WORK_SYNC: + sync_core(); break; } } diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h index 29832c338cdc5..b6939e965e69d 100644 --- a/arch/x86/include/asm/text-patching.h +++ b/arch/x86/include/asm/text-patching.h @@ -43,6 +43,7 @@ extern void text_poke_early(void *addr, const void *opcode, size_t len); */ extern void *text_poke(void *addr, const void *opcode, size_t len); extern void text_poke_sync(void); +extern void text_poke_sync_deferrable(void); extern void *text_poke_kgdb(void *addr, const void *opcode, size_t len); extern void *text_poke_copy(void *addr, const void *opcode, size_t len); extern void *text_poke_copy_locked(void *addr, const void *opcode, size_t len, bool core_ok); diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c index 72646d75b6ffe..fcce480e1919e 100644 --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -18,6 +18,7 @@ #include #include #include +#include #include #include #include @@ -1933,9 +1934,24 @@ static void do_sync_core(void *info) sync_core(); } +static bool do_sync_core_defer_cond(int cpu, void *info) +{ + return !ct_set_cpu_work(cpu, CONTEXT_WORK_SYNC); +} + +static void __text_poke_sync(smp_cond_func_t cond_func) +{ + on_each_cpu_cond(cond_func, do_sync_core, NULL, 1); +} + void text_poke_sync(void) { - on_each_cpu(do_sync_core, NULL, 1); + __text_poke_sync(NULL); +} + +void text_poke_sync_deferrable(void) +{ + __text_poke_sync(do_sync_core_defer_cond); } /* @@ -2145,7 +2161,7 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries text_poke(text_poke_addr(&tp[i]), &int3, INT3_INSN_SIZE); } - text_poke_sync(); + text_poke_sync_deferrable(); /* * Second step: update all but the first byte of the patched range. @@ -2207,7 +2223,7 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries * not necessary and we'd be safe even without it. But * better safe than sorry (plus there's not only Intel). */ - text_poke_sync(); + text_poke_sync_deferrable(); } /* @@ -2228,7 +2244,7 @@ static void text_poke_bp_batch(struct text_poke_loc *tp, unsigned int nr_entries } if (do_sync) - text_poke_sync(); + text_poke_sync_deferrable(); /* * Remove and wait for refs to be zero. diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c index f7f6042eb7e6c..a38c914753397 100644 --- a/arch/x86/kernel/kprobes/core.c +++ b/arch/x86/kernel/kprobes/core.c @@ -735,7 +735,7 @@ void arch_arm_kprobe(struct kprobe *p) u8 int3 = INT3_INSN_OPCODE; text_poke(p->addr, &int3, 1); - text_poke_sync(); + text_poke_sync_deferrable(); perf_event_text_poke(p->addr, &p->opcode, 1, &int3, 1); } @@ -745,7 +745,7 @@ void arch_disarm_kprobe(struct kprobe *p) perf_event_text_poke(p->addr, &int3, 1, &p->opcode, 1); text_poke(p->addr, &p->opcode, 1); - text_poke_sync(); + text_poke_sync_deferrable(); } void arch_remove_kprobe(struct kprobe *p) diff --git a/arch/x86/kernel/kprobes/opt.c b/arch/x86/kernel/kprobes/opt.c index 57b0037d0a996..88451a744ceda 100644 --- a/arch/x86/kernel/kprobes/opt.c +++ b/arch/x86/kernel/kprobes/opt.c @@ -521,11 +521,11 @@ void arch_unoptimize_kprobe(struct optimized_kprobe *op) JMP32_INSN_SIZE - INT3_INSN_SIZE); text_poke(addr, new, INT3_INSN_SIZE); - text_poke_sync(); + text_poke_sync_deferrable(); text_poke(addr + INT3_INSN_SIZE, new + INT3_INSN_SIZE, JMP32_INSN_SIZE - INT3_INSN_SIZE); - text_poke_sync(); + text_poke_sync_deferrable(); perf_event_text_poke(op->kp.addr, old, JMP32_INSN_SIZE, new, JMP32_INSN_SIZE); } diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c index b05f62ee2344b..8b4542dc51b6d 100644 --- a/arch/x86/kernel/module.c +++ b/arch/x86/kernel/module.c @@ -242,7 +242,7 @@ static int write_relocate_add(Elf64_Shdr *sechdrs, write, apply); if (!early) { - text_poke_sync(); + text_poke_sync_deferrable(); mutex_unlock(&text_mutex); } diff --git a/include/linux/context_tracking_work.h b/include/linux/context_tracking_work.h index fb74db8876dd2..13fc97b395030 100644 --- a/include/linux/context_tracking_work.h +++ b/include/linux/context_tracking_work.h @@ -5,12 +5,12 @@ #include enum { - CONTEXT_WORK_n_OFFSET, + CONTEXT_WORK_SYNC_OFFSET, CONTEXT_WORK_MAX_OFFSET }; enum ct_work { - CONTEXT_WORK_n = BIT(CONTEXT_WORK_n_OFFSET), + CONTEXT_WORK_SYNC = BIT(CONTEXT_WORK_SYNC_OFFSET), CONTEXT_WORK_MAX = BIT(CONTEXT_WORK_MAX_OFFSET) };