From patchwork Tue Nov 19 15:35:00 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Valentin Schneider X-Patchwork-Id: 13880189 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 92C4BD44167 for ; Tue, 19 Nov 2024 15:39:33 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 220566B00C7; Tue, 19 Nov 2024 10:39:33 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 1CF5C6B00C8; Tue, 19 Nov 2024 10:39:33 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id F3E246B00C9; Tue, 19 Nov 2024 10:39:32 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id D2BA16B00C7 for ; Tue, 19 Nov 2024 10:39:32 -0500 (EST) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 96FBD140595 for ; Tue, 19 Nov 2024 15:39:32 +0000 (UTC) X-FDA: 82803253218.04.8269AD9 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf05.hostedemail.com (Postfix) with ESMTP id 8180010000A for ; Tue, 19 Nov 2024 15:37:55 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=L7Unmtra; spf=pass (imf05.hostedemail.com: domain of vschneid@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=vschneid@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1732030679; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=XZhR8LJzRJ/NA3355q22R2um1DqV6KheeHOnZlXN8BQ=; b=xmLvkqkhhUQO04EHs9ZQZRmXShYfedsSdW4+VU5hXE9mY+Fx5fMtBpTUnEUtSOEMEM7Z3X NCLUMk91z4MeZi6YN+YNi2gybWNcrP3tY065iWXMIL+kLvhEZxbjurk2gi9hvYjsQkUp9F tWaiqrA3FkDw32u36wbwvd6IQkb6Co8= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1732030679; a=rsa-sha256; cv=none; b=MnwLwAxeOrQuVeouYqKGlSmt6vM7LwXaUuSVWyLoTmlWznrBKwuf+8DxTCD6T1OHCuC+l4 QP9JOlDja6Rj6gpH6HRdJKheHdtTGh06SrTr6AiStFMGp6DYsTcz6m1UjU7aUW0eRd3F5E L28D2wEotlt7OIdO4E0eI0QJgR3EAFY= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=L7Unmtra; spf=pass (imf05.hostedemail.com: domain of vschneid@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=vschneid@redhat.com; dmarc=pass (policy=none) header.from=redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1732030770; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=XZhR8LJzRJ/NA3355q22R2um1DqV6KheeHOnZlXN8BQ=; b=L7UnmtraAWzVIFxFLxJ3eS78EWFNfGgYnfQ3SsLeTALiEGwfPlNmXxATt5hboWRMqgY0uZ lvbXDO9pSuGo2li82Y8Ao4vyzqfDc3q4G9F7Nx50fRPtLsqnKqPkT8xhg7ehMJBT8Ueb3z 4Jaq9T3rqaNFHstrfX3DDDv+pZSpec8= Received: from mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-620-EyVM_Q77Oy2qLzT--XqbYw-1; Tue, 19 Nov 2024 10:39:27 -0500 X-MC-Unique: EyVM_Q77Oy2qLzT--XqbYw-1 X-Mimecast-MFC-AGG-ID: EyVM_Q77Oy2qLzT--XqbYw Received: from mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.4]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 209A21955F2C; Tue, 19 Nov 2024 15:39:22 +0000 (UTC) Received: from vschneid-thinkpadt14sgen2i.remote.csb (unknown [10.39.194.94]) by mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 7E38F30001A0; Tue, 19 Nov 2024 15:39:06 +0000 (UTC) From: Valentin Schneider To: linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, kvm@vger.kernel.org, linux-mm@kvack.org, bpf@vger.kernel.org, x86@kernel.org, rcu@vger.kernel.org, linux-kselftest@vger.kernel.org Cc: Steven Rostedt , Masami Hiramatsu , Jonathan Corbet , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , "H. Peter Anvin" , Paolo Bonzini , Wanpeng Li , Vitaly Kuznetsov , Andy Lutomirski , Peter Zijlstra , Frederic Weisbecker , "Paul E. McKenney" , Neeraj Upadhyay , Joel Fernandes , Josh Triplett , Boqun Feng , Mathieu Desnoyers , Lai Jiangshan , Zqiang , Andrew Morton , Uladzislau Rezki , Christoph Hellwig , Lorenzo Stoakes , Josh Poimboeuf , Jason Baron , Kees Cook , Sami Tolvanen , Ard Biesheuvel , Nicholas Piggin , Juerg Haefliger , Nicolas Saenz Julienne , "Kirill A. Shutemov" , Nadav Amit , Dan Carpenter , Chuang Wang , Yang Jihong , Petr Mladek , "Jason A. Donenfeld" , Song Liu , Julian Pidancet , Tom Lendacky , Dionna Glaze , =?utf-8?q?Thomas_Wei=C3=9Fschuh?= , Juri Lelli , Marcelo Tosatti , Yair Podemsky , Daniel Wagner , Petr Tesarik Subject: [RFC PATCH v3 13/15] context_tracking,x86: Add infrastructure to defer kernel TLBI Date: Tue, 19 Nov 2024 16:35:00 +0100 Message-ID: <20241119153502.41361-14-vschneid@redhat.com> In-Reply-To: <20241119153502.41361-1-vschneid@redhat.com> References: <20241119153502.41361-1-vschneid@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.4 X-Rspamd-Server: rspam10 X-Stat-Signature: 191d7p8idsyrzuisrsouyu6e5yxa7sys X-Rspamd-Queue-Id: 8180010000A X-Rspam-User: X-HE-Tag: 1732030675-531152 X-HE-Meta: U2FsdGVkX19uobxVlpnl9PE4B+9gBwwXknNd7fn7tv3ATZWFhFt3ryO1V4pb+qZAAyaP6r0iyqs62MZ2AFplh5nqtj06amHOYTX9tgwTFM+nXtSbT0+pkRLSnSNtM/lAsd+rOOhLsZD9zVcn9q1vO2+3ZsAkxmGVC1rTEy+PdHtx9E3jKKpAx+st1O4jtHhln7Q4QNs7C1sShlsrloYewqQEdi4Vd2TwX2hKSAXG2JxZvqyG9lfr0zOwFDuAKH1yaP8hp26mbAjO/Ew688s+8SN3Ei39qksdUENdNsIdVWjPmEG9WolHSJXT3sv7gXdLcLK+m+I2mIK4Sa6Yod+7xl2obEErcLeRfMCiO9hGyjD3BgsFVjq7yTcKG/tk7HKuztnCtb7JSlw1mslkI7yd7FKUBe3KN+ayXLQCIddm24pQ2S7yLFjq96JFFR1h9xAlCGYeyyxYszsC7uNIoIZH2Z8kb5somtV1mvduU8AdSX2JGlsktMTlhPLC8P1cd2kIHw3VDTrgv793+k4NrSzMoVYvf20m1hntLYMm3pL6apNAucBjY42ssl3ZtaNKtFtCmqedIWagTfLHPtXT1XYFQCRo/+STsxLTz48tTRmaKKiZbrzrQ3e0Si7Shq7mdrYKLmxngsxcc6kWuGvqNwxGmpndlmT2U7DmQKEYbw/MADBchSL2wLsZ4sQ/5brGKHtxvLZ+mdokmkGCNrZmbtRBIVNsmp+rrLA47yG+e65zyR76WAXjFd21Ix3jzp+gEmRhOlvhxRGZWWpCw1GYyvG+XlN/GiS40TS2RNbC5xKCzBjK1uuJ0qULeUTtmDlKpL4+0mpcBsKiyIwlbTJ8ul4jw3o9D+3pMjIrigMPazCLbu8Vawwo9LKZ7JrOiOmqMXjiVFUHn3RzGH57aLDRcGDn+pusKSHYAd+zWj/mr78hhrBAV+3InQeiBXlXfBEia+pluTJeFY2QPAWEvcJtCS+ +zIxgSCP F7sIIdGcHqeSZl0kIYmpOgnUMnx+u/w1MVXGONpvaOLs3wcrU86nte6s+YxMV34kTVTUFhB7WySad5qxrOPBSkHkRCYyfHCxMdwtY5ikEiMA6lmRv+wnuuqwJyL52V2/VjOeFOhDf7X8ORrk/vYBoA4ryD9WWZT2HiREoyRYvFUy4REGcSWqS31aCGTQ/YVzpcqQ5s+Jk79Ng5ABm7dunu5ntBzoTRtsYgKrYKFKBsS4xQr8jL1oHuNa5CJqkehrPnaN8/JIYxsyrgUSb34/b+n2yTXJbkuOo8Slk3jScBRDnbfKQPmTW2g+eiB02vQ1EB9N33A5p9eU7FKagblmIihDoa4LGv49bCoEd05ctfq+PlTKsWqdp0zgutW1vvBQQaQNNBfP4siw06hiv5UbfinCbrYkZ/Bhc+eJzVF6E2K/64wUBZmgS5PFvtGwVzY7zr5+hv6uCMAIANele+W0hLsmc3w== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Kernel TLB invalidation IPIs are a common source of interference on NOHZ_FULL CPUs. Given NOHZ_FULL CPUs executing in userspace are not accessing any kernel addresses, these invalidations do not need to happen immediately, and can be deferred until the next user->kernel transition. Add a minimal, noinstr-compliant variant of __flush_tlb_all() that doesn't try to leverage INVPCID. To achieve this: o Add a noinstr variant of native_write_cr4(), keeping native_write_cr4() as "only" __no_profile. o Make invalidate_user_asid() __always_inline XXX: what about paravirt? Should these be instead new operations made available through pv_ops.mmu.*? x86_64_start_kernel() uses __native_tlb_flush_global() regardless of paravirt, so I'm thinking the paravirt variants are more optimizations than hard requirements? Signed-off-by: Valentin Schneider --- arch/x86/include/asm/context_tracking_work.h | 4 +++ arch/x86/include/asm/special_insns.h | 1 + arch/x86/include/asm/tlbflush.h | 16 ++++++++++-- arch/x86/kernel/cpu/common.c | 6 ++++- arch/x86/mm/tlb.c | 26 ++++++++++++++++++-- include/linux/context_tracking_work.h | 2 ++ 6 files changed, 50 insertions(+), 5 deletions(-) diff --git a/arch/x86/include/asm/context_tracking_work.h b/arch/x86/include/asm/context_tracking_work.h index 2c66687ce00e2..9d4f021b5a45b 100644 --- a/arch/x86/include/asm/context_tracking_work.h +++ b/arch/x86/include/asm/context_tracking_work.h @@ -3,6 +3,7 @@ #define _ASM_X86_CONTEXT_TRACKING_WORK_H #include +#include static __always_inline void arch_context_tracking_work(int work) { @@ -10,6 +11,9 @@ static __always_inline void arch_context_tracking_work(int work) case CONTEXT_WORK_SYNC: sync_core(); break; + case CONTEXT_WORK_TLBI: + __flush_tlb_all_noinstr(); + break; } } diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h index aec6e2d3aa1d5..b97157a69d48e 100644 --- a/arch/x86/include/asm/special_insns.h +++ b/arch/x86/include/asm/special_insns.h @@ -74,6 +74,7 @@ static inline unsigned long native_read_cr4(void) return val; } +void native_write_cr4_noinstr(unsigned long val); void native_write_cr4(unsigned long val); #ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 69e79fff41b80..a653b5f47f0e6 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -18,6 +18,7 @@ DECLARE_PER_CPU(u64, tlbstate_untag_mask); void __flush_tlb_all(void); +void noinstr __flush_tlb_all_noinstr(void); #define TLB_FLUSH_ALL -1UL #define TLB_GENERATION_INVALID 0 @@ -418,9 +419,20 @@ static inline void cpu_tlbstate_update_lam(unsigned long lam, u64 untag_mask) #endif #endif /* !MODULE */ +#define __NATIVE_TLB_FLUSH_GLOBAL(suffix, cr4) \ + native_write_cr4##suffix(cr4 ^ X86_CR4_PGE); \ + native_write_cr4##suffix(cr4) +#define NATIVE_TLB_FLUSH_GLOBAL(cr4) __NATIVE_TLB_FLUSH_GLOBAL(, cr4) +#define NATIVE_TLB_FLUSH_GLOBAL_NOINSTR(cr4) __NATIVE_TLB_FLUSH_GLOBAL(_noinstr, cr4) + static inline void __native_tlb_flush_global(unsigned long cr4) { - native_write_cr4(cr4 ^ X86_CR4_PGE); - native_write_cr4(cr4); + NATIVE_TLB_FLUSH_GLOBAL(cr4); } + +static inline void __native_tlb_flush_global_noinstr(unsigned long cr4) +{ + NATIVE_TLB_FLUSH_GLOBAL_NOINSTR(cr4); +} + #endif /* _ASM_X86_TLBFLUSH_H */ diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index a5f221ea56888..a84bb8511650b 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -424,7 +424,7 @@ void native_write_cr0(unsigned long val) } EXPORT_SYMBOL(native_write_cr0); -void __no_profile native_write_cr4(unsigned long val) +noinstr void native_write_cr4_noinstr(unsigned long val) { unsigned long bits_changed = 0; @@ -442,6 +442,10 @@ void __no_profile native_write_cr4(unsigned long val) bits_changed); } } +void native_write_cr4(unsigned long val) +{ + native_write_cr4_noinstr(val); +} #if IS_MODULE(CONFIG_LKDTM) EXPORT_SYMBOL_GPL(native_write_cr4); #endif diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 86593d1b787d8..973a4ab3f53b3 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -256,7 +256,7 @@ static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen, * * See SWITCH_TO_USER_CR3. */ -static inline void invalidate_user_asid(u16 asid) +static __always_inline void invalidate_user_asid(u16 asid) { /* There is no user ASID if address space separation is off */ if (!IS_ENABLED(CONFIG_MITIGATION_PAGE_TABLE_ISOLATION)) @@ -1198,7 +1198,7 @@ STATIC_NOPV void native_flush_tlb_global(void) /* * Flush the entire current user mapping */ -STATIC_NOPV void native_flush_tlb_local(void) +static noinstr void native_flush_tlb_local_noinstr(void) { /* * Preemption or interrupts must be disabled to protect the access @@ -1213,6 +1213,11 @@ STATIC_NOPV void native_flush_tlb_local(void) native_write_cr3(__native_read_cr3()); } +STATIC_NOPV void native_flush_tlb_local(void) +{ + native_flush_tlb_local_noinstr(); +} + void flush_tlb_local(void) { __flush_tlb_local(); @@ -1240,6 +1245,23 @@ void __flush_tlb_all(void) } EXPORT_SYMBOL_GPL(__flush_tlb_all); +void noinstr __flush_tlb_all_noinstr(void) +{ + /* + * This is for invocation in early entry code that cannot be + * instrumented. A RMW to CR4 works for most cases, but relies on + * being able to flip either of the PGE or PCIDE bits. Flipping CR4.PCID + * would require also resetting CR3.PCID, so just try with CR4.PGE, else + * do the CR3 write. + * + * XXX: this gives paravirt the finger. + */ + if (cpu_feature_enabled(X86_FEATURE_PGE)) + __native_tlb_flush_global_noinstr(this_cpu_read(cpu_tlbstate.cr4)); + else + native_flush_tlb_local_noinstr(); +} + void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch) { struct flush_tlb_info *info; diff --git a/include/linux/context_tracking_work.h b/include/linux/context_tracking_work.h index 13fc97b395030..47d5ced39a43a 100644 --- a/include/linux/context_tracking_work.h +++ b/include/linux/context_tracking_work.h @@ -6,11 +6,13 @@ enum { CONTEXT_WORK_SYNC_OFFSET, + CONTEXT_WORK_TLBI_OFFSET, CONTEXT_WORK_MAX_OFFSET }; enum ct_work { CONTEXT_WORK_SYNC = BIT(CONTEXT_WORK_SYNC_OFFSET), + CONTEXT_WORK_TLBI = BIT(CONTEXT_WORK_TLBI_OFFSET), CONTEXT_WORK_MAX = BIT(CONTEXT_WORK_MAX_OFFSET) };