Message ID | 20241119153502.41361-14-vschneid@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | context_tracking,x86: Defer some IPIs until a user->kernel transition | expand |
On Tue, Nov 19, 2024 at 04:35:00PM +0100, Valentin Schneider wrote: > +void noinstr __flush_tlb_all_noinstr(void) > +{ > + /* > + * This is for invocation in early entry code that cannot be > + * instrumented. A RMW to CR4 works for most cases, but relies on > + * being able to flip either of the PGE or PCIDE bits. Flipping CR4.PCID > + * would require also resetting CR3.PCID, so just try with CR4.PGE, else > + * do the CR3 write. > + * > + * XXX: this gives paravirt the finger. > + */ > + if (cpu_feature_enabled(X86_FEATURE_PGE)) > + __native_tlb_flush_global_noinstr(this_cpu_read(cpu_tlbstate.cr4)); > + else > + native_flush_tlb_local_noinstr(); > +} Urgh, so that's a lot of ugleh, and cr4 has that pinning stuff and gah. Why not always just do the CR3 write and call it a day? That should also work for paravirt, no? Just make the whole write_cr3 thing noinstr and voila.
On Wed, Nov 20, 2024 at 04:22:16PM +0100, Peter Zijlstra wrote: > On Tue, Nov 19, 2024 at 04:35:00PM +0100, Valentin Schneider wrote: > > > +void noinstr __flush_tlb_all_noinstr(void) > > +{ > > + /* > > + * This is for invocation in early entry code that cannot be > > + * instrumented. A RMW to CR4 works for most cases, but relies on > > + * being able to flip either of the PGE or PCIDE bits. Flipping CR4.PCID > > + * would require also resetting CR3.PCID, so just try with CR4.PGE, else > > + * do the CR3 write. > > + * > > + * XXX: this gives paravirt the finger. > > + */ > > + if (cpu_feature_enabled(X86_FEATURE_PGE)) > > + __native_tlb_flush_global_noinstr(this_cpu_read(cpu_tlbstate.cr4)); > > + else > > + native_flush_tlb_local_noinstr(); > > +} > > Urgh, so that's a lot of ugleh, and cr4 has that pinning stuff and gah. > > Why not always just do the CR3 write and call it a day? That should also > work for paravirt, no? Just make the whole write_cr3 thing noinstr and > voila. Oh gawd, just having looked at xen_write_cr3() this might not be entirely trivial to mark noinstr :/
On 20/11/24 16:32, Peter Zijlstra wrote: > On Wed, Nov 20, 2024 at 04:22:16PM +0100, Peter Zijlstra wrote: >> On Tue, Nov 19, 2024 at 04:35:00PM +0100, Valentin Schneider wrote: >> >> > +void noinstr __flush_tlb_all_noinstr(void) >> > +{ >> > + /* >> > + * This is for invocation in early entry code that cannot be >> > + * instrumented. A RMW to CR4 works for most cases, but relies on >> > + * being able to flip either of the PGE or PCIDE bits. Flipping CR4.PCID >> > + * would require also resetting CR3.PCID, so just try with CR4.PGE, else >> > + * do the CR3 write. >> > + * >> > + * XXX: this gives paravirt the finger. >> > + */ >> > + if (cpu_feature_enabled(X86_FEATURE_PGE)) >> > + __native_tlb_flush_global_noinstr(this_cpu_read(cpu_tlbstate.cr4)); >> > + else >> > + native_flush_tlb_local_noinstr(); >> > +} >> >> Urgh, so that's a lot of ugleh, and cr4 has that pinning stuff and gah. >> >> Why not always just do the CR3 write and call it a day? That should also >> work for paravirt, no? Just make the whole write_cr3 thing noinstr and >> voila. > > Oh gawd, just having looked at xen_write_cr3() this might not be > entirely trivial to mark noinstr :/ ... I hadn't even seen that. AIUI the CR3 RMW is not "enough" if we have PGE enabled, because then global pages aren't flushed. The question becomes: what is held in global pages and do we care about that when it comes to vmalloc()? I'm starting to think no, but this is x86, I don't know what surprises are waiting for me. I see e.g. ds_clear_cea() clears PTEs that can have the _PAGE_GLOBAL flag, and it correctly uses the non-deferrable flush_tlb_kernel_range().
On Wed, Nov 20, 2024 at 06:24:56PM +0100, Valentin Schneider wrote: > > Oh gawd, just having looked at xen_write_cr3() this might not be > > entirely trivial to mark noinstr :/ > > ... I hadn't even seen that. > > AIUI the CR3 RMW is not "enough" if we have PGE enabled, because then > global pages aren't flushed. > > The question becomes: what is held in global pages and do we care about > that when it comes to vmalloc()? I'm starting to think no, but this is x86, > I don't know what surprises are waiting for me. > > I see e.g. ds_clear_cea() clears PTEs that can have the _PAGE_GLOBAL flag, > and it correctly uses the non-deferrable flush_tlb_kernel_range(). I always forget what we use global pages for, dhansen might know, but let me try and have a look. I *think* we only have GLOBAL on kernel text, and that only sometimes.
On 11/21/24 03:12, Peter Zijlstra wrote: >> I see e.g. ds_clear_cea() clears PTEs that can have the _PAGE_GLOBAL flag, >> and it correctly uses the non-deferrable flush_tlb_kernel_range(). > > I always forget what we use global pages for, dhansen might know, but > let me try and have a look. > > I *think* we only have GLOBAL on kernel text, and that only sometimes. I think you're remembering how _PAGE_GLOBAL gets used when KPTI is in play. Ignoring KPTI for a sec... We use _PAGE_GLOBAL for all kernel mappings. Before PCIDs, global mappings let the kernel TLB entries live across CR3 writes. When PCIDs are in play, global mappings let two different ASIDs share TLB entries. When KPTI is around, the kernel writes CR3 at user/kernel switches to make sure secrets are unmapped and can't be leaked by Meltdown. But unmapping those secrets doesn't do squat if they were mapped globally since they'll still be in the TLB and quite usable. There, we're more judicious and only mark performance-sensitive things that are not secret to be global, like kernel text.
On Thu, Nov 21, 2024 at 07:07:44AM -0800, Dave Hansen wrote: > On 11/21/24 03:12, Peter Zijlstra wrote: > >> I see e.g. ds_clear_cea() clears PTEs that can have the _PAGE_GLOBAL flag, > >> and it correctly uses the non-deferrable flush_tlb_kernel_range(). > > > > I always forget what we use global pages for, dhansen might know, but > > let me try and have a look. > > > > I *think* we only have GLOBAL on kernel text, and that only sometimes. > > I think you're remembering how _PAGE_GLOBAL gets used when KPTI is in play. Yah, I suppose I am. That was the last time I had a good look at this stuff :-) > Ignoring KPTI for a sec... We use _PAGE_GLOBAL for all kernel mappings. > Before PCIDs, global mappings let the kernel TLB entries live across CR3 > writes. When PCIDs are in play, global mappings let two different ASIDs > share TLB entries. Hurmph.. bah. That means we do need that horrible CR4 dance :/
On Tue, Nov 19, 2024 at 04:35:00PM +0100, Valentin Schneider wrote: > @@ -418,9 +419,20 @@ static inline void cpu_tlbstate_update_lam(unsigned long lam, u64 untag_mask) > #endif > #endif /* !MODULE */ > > +#define __NATIVE_TLB_FLUSH_GLOBAL(suffix, cr4) \ > + native_write_cr4##suffix(cr4 ^ X86_CR4_PGE); \ > + native_write_cr4##suffix(cr4) > +#define NATIVE_TLB_FLUSH_GLOBAL(cr4) __NATIVE_TLB_FLUSH_GLOBAL(, cr4) > +#define NATIVE_TLB_FLUSH_GLOBAL_NOINSTR(cr4) __NATIVE_TLB_FLUSH_GLOBAL(_noinstr, cr4) > + > static inline void __native_tlb_flush_global(unsigned long cr4) > { > - native_write_cr4(cr4 ^ X86_CR4_PGE); > - native_write_cr4(cr4); > + NATIVE_TLB_FLUSH_GLOBAL(cr4); > } > + > +static inline void __native_tlb_flush_global_noinstr(unsigned long cr4) > +{ > + NATIVE_TLB_FLUSH_GLOBAL_NOINSTR(cr4); > +} How about something like this instead? I've only compile tested the tlb.c bit, but it should get __flush_tlb_global() to be noinstr I think, including the Xen bit (unless I missed something but then objtool should complain). --- diff --git a/arch/x86/include/asm/invpcid.h b/arch/x86/include/asm/invpcid.h index 734482afbf81..ff26136fcd9c 100644 --- a/arch/x86/include/asm/invpcid.h +++ b/arch/x86/include/asm/invpcid.h @@ -2,7 +2,7 @@ #ifndef _ASM_X86_INVPCID #define _ASM_X86_INVPCID -static inline void __invpcid(unsigned long pcid, unsigned long addr, +static __always_inline void __invpcid(unsigned long pcid, unsigned long addr, unsigned long type) { struct { u64 d[2]; } desc = { { pcid, addr } }; @@ -13,7 +13,7 @@ static inline void __invpcid(unsigned long pcid, unsigned long addr, * mappings, we don't want the compiler to reorder any subsequent * memory accesses before the TLB flush. */ - asm volatile("invpcid %[desc], %[type]" + asm_inline volatile("invpcid %[desc], %[type]" :: [desc] "m" (desc), [type] "r" (type) : "memory"); } @@ -23,26 +23,25 @@ static inline void __invpcid(unsigned long pcid, unsigned long addr, #define INVPCID_TYPE_ALL_NON_GLOBAL 3 /* Flush all mappings for a given pcid and addr, not including globals. */ -static inline void invpcid_flush_one(unsigned long pcid, - unsigned long addr) +static __always_inline void invpcid_flush_one(unsigned long pcid, unsigned long addr) { __invpcid(pcid, addr, INVPCID_TYPE_INDIV_ADDR); } /* Flush all mappings for a given PCID, not including globals. */ -static inline void invpcid_flush_single_context(unsigned long pcid) +static __always_inline void invpcid_flush_single_context(unsigned long pcid) { __invpcid(pcid, 0, INVPCID_TYPE_SINGLE_CTXT); } /* Flush all mappings, including globals, for all PCIDs. */ -static inline void invpcid_flush_all(void) +static __always_inline void invpcid_flush_all(void) { __invpcid(0, 0, INVPCID_TYPE_ALL_INCL_GLOBAL); } /* Flush all mappings for all PCIDs except globals. */ -static inline void invpcid_flush_all_nonglobals(void) +static __always_inline void invpcid_flush_all_nonglobals(void) { __invpcid(0, 0, INVPCID_TYPE_ALL_NON_GLOBAL); } diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index d4eb9e1d61b8..b3daee3d4667 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -75,7 +75,7 @@ static inline void __flush_tlb_local(void) PVOP_VCALL0(mmu.flush_tlb_user); } -static inline void __flush_tlb_global(void) +static __always_inline void __flush_tlb_global(void) { PVOP_VCALL0(mmu.flush_tlb_kernel); } diff --git a/arch/x86/include/asm/xen/hypercall.h b/arch/x86/include/asm/xen/hypercall.h index a2dd24947eb8..b4c635b20538 100644 --- a/arch/x86/include/asm/xen/hypercall.h +++ b/arch/x86/include/asm/xen/hypercall.h @@ -357,8 +357,8 @@ MULTI_mmu_update(struct multicall_entry *mcl, struct mmu_update *req, trace_xen_mc_entry(mcl, 4); } -static inline void -MULTI_mmuext_op(struct multicall_entry *mcl, struct mmuext_op *op, int count, +static __always_inline void +__MULTI_mmuext_op(struct multicall_entry *mcl, struct mmuext_op *op, int count, int *success_count, domid_t domid) { mcl->op = __HYPERVISOR_mmuext_op; @@ -366,6 +366,13 @@ MULTI_mmuext_op(struct multicall_entry *mcl, struct mmuext_op *op, int count, mcl->args[1] = count; mcl->args[2] = (unsigned long)success_count; mcl->args[3] = domid; +} + +static inline void +MULTI_mmuext_op(struct multicall_entry *mcl, struct mmuext_op *op, int count, + int *success_count, domid_t domid) +{ + __MULTI_mmuext_op(mcl, op, count, success_count, domid); trace_xen_mc_entry(mcl, 4); } diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index b0d5a644fc84..0cfc00a34b7e 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -1168,9 +1168,10 @@ void flush_tlb_one_user(unsigned long addr) /* * Flush everything */ -STATIC_NOPV void native_flush_tlb_global(void) +STATIC_NOPV noinstr void native_flush_tlb_global(void) { unsigned long flags; + unsigned long cr4; if (static_cpu_has(X86_FEATURE_INVPCID)) { /* @@ -1189,9 +1190,15 @@ STATIC_NOPV void native_flush_tlb_global(void) * be called from deep inside debugging code.) */ raw_local_irq_save(flags); - - __native_tlb_flush_global(this_cpu_read(cpu_tlbstate.cr4)); - + cr4 = this_cpu_read(cpu_tlbstate.cr4); + asm volatile("mov %0,%%cr4": : "r" (cr4 ^ X86_CR4_PGE) : "memory"); + asm volatile("mov %0,%%cr4": : "r" (cr4) : "memory"); + /* + * In lieu of not having the pinning crap, hard fail if CR4 doesn't + * match the expected value. This ensures that anybody doing dodgy gets + * the fallthrough check. + */ + BUG_ON(cr4 != this_cpu_read(cpu_tlbstate.cr4)); raw_local_irq_restore(flags); } diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c index 55a4996d0c04..4eb265eb867a 100644 --- a/arch/x86/xen/mmu_pv.c +++ b/arch/x86/xen/mmu_pv.c @@ -1231,22 +1231,22 @@ static noinstr void xen_write_cr2(unsigned long cr2) this_cpu_read(xen_vcpu)->arch.cr2 = cr2; } -static noinline void xen_flush_tlb(void) +static noinline noinstr void xen_flush_tlb(void) { struct mmuext_op *op; struct multicall_space mcs; - preempt_disable(); + preempt_disable_notrace(); mcs = xen_mc_entry(sizeof(*op)); op = mcs.args; op->cmd = MMUEXT_TLB_FLUSH_LOCAL; - MULTI_mmuext_op(mcs.mc, op, 1, NULL, DOMID_SELF); + __MULTI_mmuext_op(mcs.mc, op, 1, NULL, DOMID_SELF); - xen_mc_issue(XEN_LAZY_MMU); + __xen_mc_issue(XEN_LAZY_MMU); - preempt_enable(); + preempt_enable_notrace(); } static void xen_flush_tlb_one_user(unsigned long addr) diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h index e1b782e823e6..31eddca45c27 100644 --- a/arch/x86/xen/xen-ops.h +++ b/arch/x86/xen/xen-ops.h @@ -235,15 +235,19 @@ static inline struct multicall_space xen_mc_entry(size_t args) void xen_mc_flush(void); /* Issue a multicall if we're not in a lazy mode */ -static inline void xen_mc_issue(unsigned mode) +static __always_inline void __xen_mc_issue(unsigned mode) { - trace_xen_mc_issue(mode); - if ((xen_get_lazy_mode() & mode) == 0) xen_mc_flush(); /* restore flags saved in xen_mc_batch */ - local_irq_restore(this_cpu_read(xen_mc_irq_flags)); + raw_local_irq_restore(this_cpu_read(xen_mc_irq_flags)); +} + +static inline void xen_mc_issue(unsigned mode) +{ + trace_xen_mc_issue(mode); + __xen_mc_issue(mode); } /* Set up a callback to be called when the current batch is flushed */
On Thu, 21 Nov 2024 16:30:16 +0100 Peter Zijlstra <peterz@infradead.org> wrote: > On Thu, Nov 21, 2024 at 07:07:44AM -0800, Dave Hansen wrote: > > On 11/21/24 03:12, Peter Zijlstra wrote: > > >> I see e.g. ds_clear_cea() clears PTEs that can have the _PAGE_GLOBAL flag, > > >> and it correctly uses the non-deferrable flush_tlb_kernel_range(). > > > > > > I always forget what we use global pages for, dhansen might know, but > > > let me try and have a look. > > > > > > I *think* we only have GLOBAL on kernel text, and that only sometimes. > > > > I think you're remembering how _PAGE_GLOBAL gets used when KPTI is in play. > > Yah, I suppose I am. That was the last time I had a good look at this > stuff :-) > > > Ignoring KPTI for a sec... We use _PAGE_GLOBAL for all kernel mappings. > > Before PCIDs, global mappings let the kernel TLB entries live across CR3 > > writes. When PCIDs are in play, global mappings let two different ASIDs > > share TLB entries. > > Hurmph.. bah. That means we do need that horrible CR4 dance :/ In general, yes. But I wonder what exactly was the original scenario encountered by Valentin. I mean, if TLB entry invalidations were necessary to sync changes to kernel text after flipping a static branch, then it might be less overhead to make a list of affected pages and call INVLPG on them. AFAIK there is currently no such IPI function for doing that, but if we could add one. If the list of invalidated global pages is reasonably short, of course. Valentin, do you happen to know? Petr T
On 05/12/24 18:31, Petr Tesarik wrote: > On Thu, 21 Nov 2024 16:30:16 +0100 > Peter Zijlstra <peterz@infradead.org> wrote: > >> On Thu, Nov 21, 2024 at 07:07:44AM -0800, Dave Hansen wrote: >> > On 11/21/24 03:12, Peter Zijlstra wrote: >> > >> I see e.g. ds_clear_cea() clears PTEs that can have the _PAGE_GLOBAL flag, >> > >> and it correctly uses the non-deferrable flush_tlb_kernel_range(). >> > > >> > > I always forget what we use global pages for, dhansen might know, but >> > > let me try and have a look. >> > > >> > > I *think* we only have GLOBAL on kernel text, and that only sometimes. >> > >> > I think you're remembering how _PAGE_GLOBAL gets used when KPTI is in play. >> >> Yah, I suppose I am. That was the last time I had a good look at this >> stuff :-) >> >> > Ignoring KPTI for a sec... We use _PAGE_GLOBAL for all kernel mappings. >> > Before PCIDs, global mappings let the kernel TLB entries live across CR3 >> > writes. When PCIDs are in play, global mappings let two different ASIDs >> > share TLB entries. >> >> Hurmph.. bah. That means we do need that horrible CR4 dance :/ > > In general, yes. > > But I wonder what exactly was the original scenario encountered by > Valentin. I mean, if TLB entry invalidations were necessary to sync > changes to kernel text after flipping a static branch, then it might be > less overhead to make a list of affected pages and call INVLPG on them. > > AFAIK there is currently no such IPI function for doing that, but if we > could add one. If the list of invalidated global pages is reasonably > short, of course. > > Valentin, do you happen to know? > So from my experimentation (hackbench + kernel compilation on housekeeping CPUs, dummy while(1) userspace loop on isolated CPUs), the TLB flushes only occurred from vunmap() - mainly from all the hackbench threads coming and going. Static branch updates only seem to trigger the sync_core() IPI, at least on x86. > Petr T
On Mon, Dec 09, 2024 at 01:04:43PM +0100, Valentin Schneider wrote: > > But I wonder what exactly was the original scenario encountered by > > Valentin. I mean, if TLB entry invalidations were necessary to sync > > changes to kernel text after flipping a static branch, then it might be > > less overhead to make a list of affected pages and call INVLPG on them. No; TLB is not involved with text patching (on x86). > > Valentin, do you happen to know? > > So from my experimentation (hackbench + kernel compilation on housekeeping > CPUs, dummy while(1) userspace loop on isolated CPUs), the TLB flushes only > occurred from vunmap() - mainly from all the hackbench threads coming and > going. Right, we have virtually mapped stacks.
On Mon, 09 Dec 2024 13:04:43 +0100 Valentin Schneider <vschneid@redhat.com> wrote: > On 05/12/24 18:31, Petr Tesarik wrote: > > On Thu, 21 Nov 2024 16:30:16 +0100 > > Peter Zijlstra <peterz@infradead.org> wrote: > > > >> On Thu, Nov 21, 2024 at 07:07:44AM -0800, Dave Hansen wrote: > >> > On 11/21/24 03:12, Peter Zijlstra wrote: > >> > >> I see e.g. ds_clear_cea() clears PTEs that can have the _PAGE_GLOBAL flag, > >> > >> and it correctly uses the non-deferrable flush_tlb_kernel_range(). > >> > > > >> > > I always forget what we use global pages for, dhansen might know, but > >> > > let me try and have a look. > >> > > > >> > > I *think* we only have GLOBAL on kernel text, and that only sometimes. > >> > > >> > I think you're remembering how _PAGE_GLOBAL gets used when KPTI is in play. > >> > >> Yah, I suppose I am. That was the last time I had a good look at this > >> stuff :-) > >> > >> > Ignoring KPTI for a sec... We use _PAGE_GLOBAL for all kernel mappings. > >> > Before PCIDs, global mappings let the kernel TLB entries live across CR3 > >> > writes. When PCIDs are in play, global mappings let two different ASIDs > >> > share TLB entries. > >> > >> Hurmph.. bah. That means we do need that horrible CR4 dance :/ > > > > In general, yes. > > > > But I wonder what exactly was the original scenario encountered by > > Valentin. I mean, if TLB entry invalidations were necessary to sync > > changes to kernel text after flipping a static branch, then it might be > > less overhead to make a list of affected pages and call INVLPG on them. > > > > AFAIK there is currently no such IPI function for doing that, but if we > > could add one. If the list of invalidated global pages is reasonably > > short, of course. > > > > Valentin, do you happen to know? > > > > So from my experimentation (hackbench + kernel compilation on housekeeping > CPUs, dummy while(1) userspace loop on isolated CPUs), the TLB flushes only > occurred from vunmap() - mainly from all the hackbench threads coming and > going. > > Static branch updates only seem to trigger the sync_core() IPI, at least on > x86. Thank you, this is helpful. So, these allocations span more than tlb_single_page_flush_ceiling pages (default 33). Is THP enabled? If yes, we could possibly get below that threshold by improving flushing of huge pages (cf. footnote [1] in Documentation/arch/x86/tlb.rst). OTOH even though a series of INVLPG may reduce subsequent TLB misses, it will not exactly improve latency, so it would go against the main goal of this whole patch series. Hmmm... I see, the CR4 dance is the best solution after all. :-| Petr T
On Mon, 9 Dec 2024 13:12:49 +0100 Peter Zijlstra <peterz@infradead.org> wrote: > On Mon, Dec 09, 2024 at 01:04:43PM +0100, Valentin Schneider wrote: > > > > But I wonder what exactly was the original scenario encountered by > > > Valentin. I mean, if TLB entry invalidations were necessary to sync > > > changes to kernel text after flipping a static branch, then it might be > > > less overhead to make a list of affected pages and call INVLPG on them. > > No; TLB is not involved with text patching (on x86). > > > > Valentin, do you happen to know? > > > > So from my experimentation (hackbench + kernel compilation on housekeeping > > CPUs, dummy while(1) userspace loop on isolated CPUs), the TLB flushes only > > occurred from vunmap() - mainly from all the hackbench threads coming and > > going. > > Right, we have virtually mapped stacks. Wait... Are you talking about the kernel stac? But that's only 4 pages (or 8 pages with KASAN), so that should be easily handled with INVLPG. No CR4 dances are needed for that. What am I missing? Petr T
On 09/12/24 15:42, Petr Tesarik wrote: > On Mon, 9 Dec 2024 13:12:49 +0100 > Peter Zijlstra <peterz@infradead.org> wrote: > >> On Mon, Dec 09, 2024 at 01:04:43PM +0100, Valentin Schneider wrote: >> >> > > But I wonder what exactly was the original scenario encountered by >> > > Valentin. I mean, if TLB entry invalidations were necessary to sync >> > > changes to kernel text after flipping a static branch, then it might be >> > > less overhead to make a list of affected pages and call INVLPG on them. >> >> No; TLB is not involved with text patching (on x86). >> >> > > Valentin, do you happen to know? >> > >> > So from my experimentation (hackbench + kernel compilation on housekeeping >> > CPUs, dummy while(1) userspace loop on isolated CPUs), the TLB flushes only >> > occurred from vunmap() - mainly from all the hackbench threads coming and >> > going. >> >> Right, we have virtually mapped stacks. > > Wait... Are you talking about the kernel stac? But that's only 4 pages > (or 8 pages with KASAN), so that should be easily handled with INVLPG. > No CR4 dances are needed for that. > > What am I missing? > So the gist of the IPI deferral thing is to coalesce IPI callbacks into a single flag value that is read & acted on upon kernel entry. Freeing a task's kernel stack is not the only thing that can issue a vunmap(), so instead of tracking all the pages affected by the unmap (which is potentially an ever-growing memory leak as long as no kernel entry happens on the isolated CPUs), we just flush everything. Quick tracing with my dummy benchmark mostly shows vfree_atomic() -> drain_vmap_work() but pretty much any vfree() / kvfree_rcu() from the housekeeping CPUs can get us that IPI.
On Tue, 10 Dec 2024 14:53:36 +0100 Valentin Schneider <vschneid@redhat.com> wrote: > On 09/12/24 15:42, Petr Tesarik wrote: > > On Mon, 9 Dec 2024 13:12:49 +0100 > > Peter Zijlstra <peterz@infradead.org> wrote: > > > >> On Mon, Dec 09, 2024 at 01:04:43PM +0100, Valentin Schneider wrote: > >> > >> > > But I wonder what exactly was the original scenario encountered by > >> > > Valentin. I mean, if TLB entry invalidations were necessary to sync > >> > > changes to kernel text after flipping a static branch, then it might be > >> > > less overhead to make a list of affected pages and call INVLPG on them. > >> > >> No; TLB is not involved with text patching (on x86). > >> > >> > > Valentin, do you happen to know? > >> > > >> > So from my experimentation (hackbench + kernel compilation on housekeeping > >> > CPUs, dummy while(1) userspace loop on isolated CPUs), the TLB flushes only > >> > occurred from vunmap() - mainly from all the hackbench threads coming and > >> > going. > >> > >> Right, we have virtually mapped stacks. > > > > Wait... Are you talking about the kernel stac? But that's only 4 pages > > (or 8 pages with KASAN), so that should be easily handled with INVLPG. > > No CR4 dances are needed for that. > > > > What am I missing? > > > > So the gist of the IPI deferral thing is to coalesce IPI callbacks into a > single flag value that is read & acted on upon kernel entry. Freeing a > task's kernel stack is not the only thing that can issue a vunmap(), so Thank you for confirming it's not the kernel stack. Peter's remark left me a little confused. > instead of tracking all the pages affected by the unmap (which is > potentially an ever-growing memory leak as long as no kernel entry happens > on the isolated CPUs), we just flush everything. Yes, this makes some sense. Of course, there is no way to avoid the cost; we can only defer it to a "more suitable" point in time, and current low-latency requirements make kernel entry better than IPI. It is at least more predictable (as long as device interrupts are routed to other CPUs). I have looked into ways to reduce the number of page faults _after_ flushing the TLB. FWIW if we decide to track to-be-flushed pages, we only need an array of tlb_single_page_flush_ceiling pages. If there are more, flushing the entire TLB is believed to be cheaper. That is, I merely suggest to use the same logic which is already implemented by flush_tlb_kernel_range(). Anyway, since there is no easy trick, let's leave the discussion for a later optimization. I definitely do not want to block progress on this patch series. Thanks for all your input! Petr T
diff --git a/arch/x86/include/asm/context_tracking_work.h b/arch/x86/include/asm/context_tracking_work.h index 2c66687ce00e2..9d4f021b5a45b 100644 --- a/arch/x86/include/asm/context_tracking_work.h +++ b/arch/x86/include/asm/context_tracking_work.h @@ -3,6 +3,7 @@ #define _ASM_X86_CONTEXT_TRACKING_WORK_H #include <asm/sync_core.h> +#include <asm/tlbflush.h> static __always_inline void arch_context_tracking_work(int work) { @@ -10,6 +11,9 @@ static __always_inline void arch_context_tracking_work(int work) case CONTEXT_WORK_SYNC: sync_core(); break; + case CONTEXT_WORK_TLBI: + __flush_tlb_all_noinstr(); + break; } } diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h index aec6e2d3aa1d5..b97157a69d48e 100644 --- a/arch/x86/include/asm/special_insns.h +++ b/arch/x86/include/asm/special_insns.h @@ -74,6 +74,7 @@ static inline unsigned long native_read_cr4(void) return val; } +void native_write_cr4_noinstr(unsigned long val); void native_write_cr4(unsigned long val); #ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 69e79fff41b80..a653b5f47f0e6 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -18,6 +18,7 @@ DECLARE_PER_CPU(u64, tlbstate_untag_mask); void __flush_tlb_all(void); +void noinstr __flush_tlb_all_noinstr(void); #define TLB_FLUSH_ALL -1UL #define TLB_GENERATION_INVALID 0 @@ -418,9 +419,20 @@ static inline void cpu_tlbstate_update_lam(unsigned long lam, u64 untag_mask) #endif #endif /* !MODULE */ +#define __NATIVE_TLB_FLUSH_GLOBAL(suffix, cr4) \ + native_write_cr4##suffix(cr4 ^ X86_CR4_PGE); \ + native_write_cr4##suffix(cr4) +#define NATIVE_TLB_FLUSH_GLOBAL(cr4) __NATIVE_TLB_FLUSH_GLOBAL(, cr4) +#define NATIVE_TLB_FLUSH_GLOBAL_NOINSTR(cr4) __NATIVE_TLB_FLUSH_GLOBAL(_noinstr, cr4) + static inline void __native_tlb_flush_global(unsigned long cr4) { - native_write_cr4(cr4 ^ X86_CR4_PGE); - native_write_cr4(cr4); + NATIVE_TLB_FLUSH_GLOBAL(cr4); } + +static inline void __native_tlb_flush_global_noinstr(unsigned long cr4) +{ + NATIVE_TLB_FLUSH_GLOBAL_NOINSTR(cr4); +} + #endif /* _ASM_X86_TLBFLUSH_H */ diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index a5f221ea56888..a84bb8511650b 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -424,7 +424,7 @@ void native_write_cr0(unsigned long val) } EXPORT_SYMBOL(native_write_cr0); -void __no_profile native_write_cr4(unsigned long val) +noinstr void native_write_cr4_noinstr(unsigned long val) { unsigned long bits_changed = 0; @@ -442,6 +442,10 @@ void __no_profile native_write_cr4(unsigned long val) bits_changed); } } +void native_write_cr4(unsigned long val) +{ + native_write_cr4_noinstr(val); +} #if IS_MODULE(CONFIG_LKDTM) EXPORT_SYMBOL_GPL(native_write_cr4); #endif diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 86593d1b787d8..973a4ab3f53b3 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -256,7 +256,7 @@ static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen, * * See SWITCH_TO_USER_CR3. */ -static inline void invalidate_user_asid(u16 asid) +static __always_inline void invalidate_user_asid(u16 asid) { /* There is no user ASID if address space separation is off */ if (!IS_ENABLED(CONFIG_MITIGATION_PAGE_TABLE_ISOLATION)) @@ -1198,7 +1198,7 @@ STATIC_NOPV void native_flush_tlb_global(void) /* * Flush the entire current user mapping */ -STATIC_NOPV void native_flush_tlb_local(void) +static noinstr void native_flush_tlb_local_noinstr(void) { /* * Preemption or interrupts must be disabled to protect the access @@ -1213,6 +1213,11 @@ STATIC_NOPV void native_flush_tlb_local(void) native_write_cr3(__native_read_cr3()); } +STATIC_NOPV void native_flush_tlb_local(void) +{ + native_flush_tlb_local_noinstr(); +} + void flush_tlb_local(void) { __flush_tlb_local(); @@ -1240,6 +1245,23 @@ void __flush_tlb_all(void) } EXPORT_SYMBOL_GPL(__flush_tlb_all); +void noinstr __flush_tlb_all_noinstr(void) +{ + /* + * This is for invocation in early entry code that cannot be + * instrumented. A RMW to CR4 works for most cases, but relies on + * being able to flip either of the PGE or PCIDE bits. Flipping CR4.PCID + * would require also resetting CR3.PCID, so just try with CR4.PGE, else + * do the CR3 write. + * + * XXX: this gives paravirt the finger. + */ + if (cpu_feature_enabled(X86_FEATURE_PGE)) + __native_tlb_flush_global_noinstr(this_cpu_read(cpu_tlbstate.cr4)); + else + native_flush_tlb_local_noinstr(); +} + void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch) { struct flush_tlb_info *info; diff --git a/include/linux/context_tracking_work.h b/include/linux/context_tracking_work.h index 13fc97b395030..47d5ced39a43a 100644 --- a/include/linux/context_tracking_work.h +++ b/include/linux/context_tracking_work.h @@ -6,11 +6,13 @@ enum { CONTEXT_WORK_SYNC_OFFSET, + CONTEXT_WORK_TLBI_OFFSET, CONTEXT_WORK_MAX_OFFSET }; enum ct_work { CONTEXT_WORK_SYNC = BIT(CONTEXT_WORK_SYNC_OFFSET), + CONTEXT_WORK_TLBI = BIT(CONTEXT_WORK_TLBI_OFFSET), CONTEXT_WORK_MAX = BIT(CONTEXT_WORK_MAX_OFFSET) };
Kernel TLB invalidation IPIs are a common source of interference on NOHZ_FULL CPUs. Given NOHZ_FULL CPUs executing in userspace are not accessing any kernel addresses, these invalidations do not need to happen immediately, and can be deferred until the next user->kernel transition. Add a minimal, noinstr-compliant variant of __flush_tlb_all() that doesn't try to leverage INVPCID. To achieve this: o Add a noinstr variant of native_write_cr4(), keeping native_write_cr4() as "only" __no_profile. o Make invalidate_user_asid() __always_inline XXX: what about paravirt? Should these be instead new operations made available through pv_ops.mmu.*? x86_64_start_kernel() uses __native_tlb_flush_global() regardless of paravirt, so I'm thinking the paravirt variants are more optimizations than hard requirements? Signed-off-by: Valentin Schneider <vschneid@redhat.com> --- arch/x86/include/asm/context_tracking_work.h | 4 +++ arch/x86/include/asm/special_insns.h | 1 + arch/x86/include/asm/tlbflush.h | 16 ++++++++++-- arch/x86/kernel/cpu/common.c | 6 ++++- arch/x86/mm/tlb.c | 26 ++++++++++++++++++-- include/linux/context_tracking_work.h | 2 ++ 6 files changed, 50 insertions(+), 5 deletions(-)