diff mbox series

[RFC,v2,20/20] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs

Message ID 20230720163056.2564824-21-vschneid@redhat.com (mailing list archive)
State New
Headers show
Series context_tracking,x86: Defer some IPIs until a user->kernel transition | expand

Commit Message

Valentin Schneider July 20, 2023, 4:30 p.m. UTC
vunmap()'s issued from housekeeping CPUs are a relatively common source of
interference for isolated NOHZ_FULL CPUs, as they are hit by the
flush_tlb_kernel_range() IPIs.

Given that CPUs executing in userspace do not access data in the vmalloc
range, these IPIs could be deferred until their next kernel entry.

This does require a guarantee that nothing in the vmalloc range can be
accessed in early entry code. vmalloc'd kernel stacks (VMAP_STACK) are
AFAICT a safe exception, as a task running in userspace needs to enter
kernelspace to execute do_exit() before its stack can be vfree'd.

XXX: Validation that nothing in the vmalloc range is accessed in .noinstr or
  somesuch?

Blindly deferring any and all flush of the kernel mappings is a risky move,
so introduce a variant of flush_tlb_kernel_range() that explicitly allows
deferral. Use it for vunmap flushes.

Note that while flush_tlb_kernel_range() may end up issuing a full
flush (including user mappings), this only happens when reaching a
invalidation range threshold where it is cheaper to do a full flush than to
individually invalidate each page in the range via INVLPG. IOW, it doesn't
*require* invalidating user mappings, and thus remains safe to defer until
a later kernel entry.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 arch/x86/include/asm/tlbflush.h |  1 +
 arch/x86/mm/tlb.c               | 23 ++++++++++++++++++++---
 mm/vmalloc.c                    | 19 ++++++++++++++-----
 3 files changed, 35 insertions(+), 8 deletions(-)

Comments

Nadav Amit July 21, 2023, 6:15 p.m. UTC | #1
> On Jul 20, 2023, at 9:30 AM, Valentin Schneider <vschneid@redhat.com> wrote:
> 
> vunmap()'s issued from housekeeping CPUs are a relatively common source of
> interference for isolated NOHZ_FULL CPUs, as they are hit by the
> flush_tlb_kernel_range() IPIs.
> 
> Given that CPUs executing in userspace do not access data in the vmalloc
> range, these IPIs could be deferred until their next kernel entry.

So I think there are a few assumptions here that it seems suitable to confirm
and acknowledge the major one in the commit log (assuming they hold).

There is an assumption that VMAP page-tables are not freed. I actually
never paid attention to that, but skimming the code it does seem so. To
clarify the issue: if page-tables were freed and their pages were reused,
there would be a problem that page-walk caches for instance would be used
and “junk” entries from the reused pages would be used. See [1].

I would also assume the memory-hot-unplug of some sorts is not an issue,
(i.e., you cannot have a stale TLB entry pointing to memory that was
unplugged).  

I also think that there might be speculative code execution using stale
TLB entries that would point to memory that has been reused and perhaps
controllable by the user. If somehow the CPU/OS is tricked to use the
stale executable TLB entries early enough on kernel entry that might be
an issue. I guess it is probably theoretical issue, but it would be helpful
to confirm.

In general, deferring TLB flushes can be done safely. This patch, I think,
takes it one step forward and allows the reuse of the memory before the TLB
flush is actually done. This is more dangerous.

[1] https://lore.kernel.org/lkml/tip-b956575bed91ecfb136a8300742ecbbf451471ab@git.kernel.org/
Valentin Schneider July 24, 2023, 11:32 a.m. UTC | #2
On 21/07/23 18:15, Nadav Amit wrote:
>> On Jul 20, 2023, at 9:30 AM, Valentin Schneider <vschneid@redhat.com> wrote:
>>
>> vunmap()'s issued from housekeeping CPUs are a relatively common source of
>> interference for isolated NOHZ_FULL CPUs, as they are hit by the
>> flush_tlb_kernel_range() IPIs.
>>
>> Given that CPUs executing in userspace do not access data in the vmalloc
>> range, these IPIs could be deferred until their next kernel entry.
>
> So I think there are a few assumptions here that it seems suitable to confirm
> and acknowledge the major one in the commit log (assuming they hold).
>
> There is an assumption that VMAP page-tables are not freed. I actually
> never paid attention to that, but skimming the code it does seem so. To
> clarify the issue: if page-tables were freed and their pages were reused,
> there would be a problem that page-walk caches for instance would be used
> and “junk” entries from the reused pages would be used. See [1].
>

Thanks for looking into this and sharing context. This is an area I don't
have much experience with, so help is much appreciated!

Indeed, accessing addresses that should be impacted by a TLB flush *before*
executing the deferred flush is an issue. Deferring sync_core() for
instruction patching is a similar problem - it's all in the shape of
"access @addr impacted by @operation during kernel entry, before actually
executing @operation".

AFAICT the only reasonable way to go about the deferral is to prove that no
such access happens before the deferred @operation is done. We got to prove
that for sync_core() deferral, cf. PATCH 18.

I'd like to reason about it for deferring vunmap TLB flushes:

What addresses in VMAP range, other than the stack, can early entry code
access? Yes, the ranges can be checked at runtime, but is there any chance
of figuring this out e.g. at build-time?

> I would also assume the memory-hot-unplug of some sorts is not an issue,
> (i.e., you cannot have a stale TLB entry pointing to memory that was
> unplugged).
>
> I also think that there might be speculative code execution using stale
> TLB entries that would point to memory that has been reused and perhaps
> controllable by the user. If somehow the CPU/OS is tricked to use the
> stale executable TLB entries early enough on kernel entry that might be
> an issue. I guess it is probably theoretical issue, but it would be helpful
> to confirm.
>
> In general, deferring TLB flushes can be done safely. This patch, I think,
> takes it one step forward and allows the reuse of the memory before the TLB
> flush is actually done. This is more dangerous.
>
> [1] https://lore.kernel.org/lkml/tip-b956575bed91ecfb136a8300742ecbbf451471ab@git.kernel.org/
Dave Hansen July 24, 2023, 5:40 p.m. UTC | #3
On 7/24/23 04:32, Valentin Schneider wrote:
> AFAICT the only reasonable way to go about the deferral is to prove that no
> such access happens before the deferred @operation is done. We got to prove
> that for sync_core() deferral, cf. PATCH 18.
> 
> I'd like to reason about it for deferring vunmap TLB flushes:
> 
> What addresses in VMAP range, other than the stack, can early entry code
> access? Yes, the ranges can be checked at runtime, but is there any chance
> of figuring this out e.g. at build-time?

Nadav was touching on a very important point: TLB flushes for addresses
are relatively easy to defer.  You just need to ensure that the CPU
deferring the flush does an actual flush before it might architecturally
consume the contents of the flushed entry.

TLB flushes for freed page tables are another game entirely.  The CPU is
free to cache any part of the paging hierarchy it wants at any time.
It's also free to set accessed and dirty bits at any time, even for
instructions that may never execute architecturally.

That basically means that if you have *ANY* freed page table page
*ANYWHERE* in the page table hierarchy of any CPU at any time ... you're
screwed.

There's no reasoning about accesses or ordering.  As soon as the CPU
does *anything*, it's out to get you.

You're going to need to do something a lot more radical to deal with
free page table pages.
Peter Zijlstra July 25, 2023, 1:21 p.m. UTC | #4
On Mon, Jul 24, 2023 at 10:40:04AM -0700, Dave Hansen wrote:

> TLB flushes for freed page tables are another game entirely.  The CPU is
> free to cache any part of the paging hierarchy it wants at any time.
> It's also free to set accessed and dirty bits at any time, even for
> instructions that may never execute architecturally.
> 
> That basically means that if you have *ANY* freed page table page
> *ANYWHERE* in the page table hierarchy of any CPU at any time ... you're
> screwed.
> 
> There's no reasoning about accesses or ordering.  As soon as the CPU
> does *anything*, it's out to get you.
> 
> You're going to need to do something a lot more radical to deal with
> free page table pages.

Ha! IIRC the only thing we can reasonably do there is to have strict
per-cpu page-tables such that NOHZ_FULL CPUs can be isolated. That is,
as long we the per-cpu tables do not contain -- and have never contained
-- a particular table page, we can avoid flushing it. Because if it
never was there, it also couldn't have speculatively loaded it.

Now, x86 doesn't really do per-cpu page tables easily (otherwise we'd
have done them ages ago) and doing them is going to be *major* surgery
and pain.

Other than that, we must take the TLBI-IPI when freeing
page-table-pages.


But yeah, I think Nadav is right, vmalloc.c never frees page-tables (or
at least, I couldn't find it in a hurry either), but if we're going to
be doing this, then that file must include a very prominent comment
explaining it must never actually do so either.

Not being able to free page-tables might be a 'problem' if we're going
to be doing more of HUGE_VMALLOC, because that means it becomes rather
hard to swizzle from small to large pages.
Valentin Schneider July 25, 2023, 2:03 p.m. UTC | #5
Sorry, I missed out Dave's email, so now I'm taking my time to page (hah!)
all of this.

On 25/07/23 15:21, Peter Zijlstra wrote:
> On Mon, Jul 24, 2023 at 10:40:04AM -0700, Dave Hansen wrote:
>
>> TLB flushes for freed page tables are another game entirely.  The CPU is
>> free to cache any part of the paging hierarchy it wants at any time.
>> It's also free to set accessed and dirty bits at any time, even for
>> instructions that may never execute architecturally.
>>
>> That basically means that if you have *ANY* freed page table page
>> *ANYWHERE* in the page table hierarchy of any CPU at any time ... you're
>> screwed.
>>
>> There's no reasoning about accesses or ordering.  As soon as the CPU
>> does *anything*, it's out to get you.
>>

OK, I feel like I need to go back do some more reading now, but I think I
get the difference. Thanks for spelling it out.

>> You're going to need to do something a lot more radical to deal with
>> free page table pages.
>
> Ha! IIRC the only thing we can reasonably do there is to have strict
> per-cpu page-tables such that NOHZ_FULL CPUs can be isolated. That is,
> as long we the per-cpu tables do not contain -- and have never contained
> -- a particular table page, we can avoid flushing it. Because if it
> never was there, it also couldn't have speculatively loaded it.
>
> Now, x86 doesn't really do per-cpu page tables easily (otherwise we'd
> have done them ages ago) and doing them is going to be *major* surgery
> and pain.
>
> Other than that, we must take the TLBI-IPI when freeing
> page-table-pages.
>
>
> But yeah, I think Nadav is right, vmalloc.c never frees page-tables (or
> at least, I couldn't find it in a hurry either), but if we're going to
> be doing this, then that file must include a very prominent comment
> explaining it must never actually do so either.
>

I also couldn't find any freeing of the page-table-pages, I'll do another
pass and sharpen my quill for a big fat comment.

> Not being able to free page-tables might be a 'problem' if we're going
> to be doing more of HUGE_VMALLOC, because that means it becomes rather
> hard to swizzle from small to large pages.
Marcelo Tosatti July 25, 2023, 4:37 p.m. UTC | #6
On Mon, Jul 24, 2023 at 10:40:04AM -0700, Dave Hansen wrote:
> On 7/24/23 04:32, Valentin Schneider wrote:
> > AFAICT the only reasonable way to go about the deferral is to prove that no
> > such access happens before the deferred @operation is done. We got to prove
> > that for sync_core() deferral, cf. PATCH 18.
> > 
> > I'd like to reason about it for deferring vunmap TLB flushes:
> > 
> > What addresses in VMAP range, other than the stack, can early entry code
> > access? Yes, the ranges can be checked at runtime, but is there any chance
> > of figuring this out e.g. at build-time?
> 
> Nadav was touching on a very important point: TLB flushes for addresses
> are relatively easy to defer.  You just need to ensure that the CPU
> deferring the flush does an actual flush before it might architecturally
> consume the contents of the flushed entry.
> 
> TLB flushes for freed page tables are another game entirely.  The CPU is
> free to cache any part of the paging hierarchy it wants at any time.

Depend on CONFIG_PAGE_TABLE_ISOLATION=y, which flushes TLB (and page
table caches) on user->kernel and kernel->user context switches ?

So freeing a kernel pagetable page does not require interrupting a CPU 
which is in userspace (therefore does not have visibility into kernel
pagetables).

> It's also free to set accessed and dirty bits at any time, even for
> instructions that may never execute architecturally.
> 
> That basically means that if you have *ANY* freed page table page
> *ANYWHERE* in the page table hierarchy of any CPU at any time ... you're
> screwed.
> 
> There's no reasoning about accesses or ordering.  As soon as the CPU
> does *anything*, it's out to get you.
> 
> You're going to need to do something a lot more radical to deal with
> free page table pages.
Dave Hansen July 25, 2023, 5:12 p.m. UTC | #7
On 7/25/23 09:37, Marcelo Tosatti wrote:
>> TLB flushes for freed page tables are another game entirely.  The CPU is
>> free to cache any part of the paging hierarchy it wants at any time.
> Depend on CONFIG_PAGE_TABLE_ISOLATION=y, which flushes TLB (and page
> table caches) on user->kernel and kernel->user context switches ?

Well, first of all, CONFIG_PAGE_TABLE_ISOLATION doesn't flush the TLB at
all on user<->kernel switches when PCIDs are enabled.

Second, even if it did, the CPU is still free to cache any portion of
the paging hierarchy at any time.  Without LASS[1], userspace can even
_compel_ walks of the kernel portion of the address space, and we don't
have any infrastructure to tell if a freed kernel page is exposed in the
user copy of the page tables with PTI.

Third, (also ignoring PCIDs) there are plenty of instructions between
kernel entry and the MOV-to-CR3 that can flush the TLB.  All those
instructions architecturally permitted to speculatively set Accessed or
Dirty bits in any part of the address space.  If they run into a free
page table page, things get ugly.

These accesses are not _likely_.  There probably isn't a predictor out
there that's going to see a:

	movq    %rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2)

and go off trying to dirty memory in the vmalloc() area.  But we'd need
some backward *and* forward-looking guarantees from our intrepid CPU
designers to promise that this kind of thing is safe yesterday, today
and tomorrow.  I suspect such a guarantee is going to be hard to obtain.

1. https://lkml.kernel.org/r/20230110055204.3227669-1-yian.chen@intel.com
diff mbox series

Patch

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 323b971987af7..0b9b1f040c476 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -248,6 +248,7 @@  extern void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 				unsigned long end, unsigned int stride_shift,
 				bool freed_tables);
 extern void flush_tlb_kernel_range(unsigned long start, unsigned long end);
+extern void flush_tlb_kernel_range_deferrable(unsigned long start, unsigned long end);
 
 static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
 {
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 631df9189ded4..bb18b35e61b4a 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -10,6 +10,7 @@ 
 #include <linux/debugfs.h>
 #include <linux/sched/smt.h>
 #include <linux/task_work.h>
+#include <linux/context_tracking.h>
 
 #include <asm/tlbflush.h>
 #include <asm/mmu_context.h>
@@ -1045,6 +1046,11 @@  static void do_flush_tlb_all(void *info)
 	__flush_tlb_all();
 }
 
+static bool do_kernel_flush_defer_cond(int cpu, void *info)
+{
+	return !ct_set_cpu_work(cpu, CONTEXT_WORK_TLBI);
+}
+
 void flush_tlb_all(void)
 {
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
@@ -1061,12 +1067,13 @@  static void do_kernel_range_flush(void *info)
 		flush_tlb_one_kernel(addr);
 }
 
-void flush_tlb_kernel_range(unsigned long start, unsigned long end)
+static inline void
+__flush_tlb_kernel_range(smp_cond_func_t cond_func, unsigned long start, unsigned long end)
 {
 	/* Balance as user space task's flush, a bit conservative */
 	if (end == TLB_FLUSH_ALL ||
 	    (end - start) > tlb_single_page_flush_ceiling << PAGE_SHIFT) {
-		on_each_cpu(do_flush_tlb_all, NULL, 1);
+		on_each_cpu_cond(cond_func, do_flush_tlb_all, NULL, 1);
 	} else {
 		struct flush_tlb_info *info;
 
@@ -1074,13 +1081,23 @@  void flush_tlb_kernel_range(unsigned long start, unsigned long end)
 		info = get_flush_tlb_info(NULL, start, end, 0, false,
 					  TLB_GENERATION_INVALID);
 
-		on_each_cpu(do_kernel_range_flush, info, 1);
+		on_each_cpu_cond(cond_func, do_kernel_range_flush, info, 1);
 
 		put_flush_tlb_info();
 		preempt_enable();
 	}
 }
 
+void flush_tlb_kernel_range(unsigned long start, unsigned long end)
+{
+	__flush_tlb_kernel_range(NULL, start, end);
+}
+
+void flush_tlb_kernel_range_deferrable(unsigned long start, unsigned long end)
+{
+	__flush_tlb_kernel_range(do_kernel_flush_defer_cond, start, end);
+}
+
 /*
  * This can be used from process context to figure out what the value of
  * CR3 is without needing to do a (slow) __read_cr3().
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 93cf99aba335b..e08b6c7d22fb6 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -439,6 +439,15 @@  void vunmap_range_noflush(unsigned long start, unsigned long end)
 	__vunmap_range_noflush(start, end);
 }
 
+#ifdef CONFIG_CONTEXT_TRACKING_WORK
+void __weak flush_tlb_kernel_range_deferrable(unsigned long start, unsigned long end)
+{
+	flush_tlb_kernel_range(start, end);
+}
+#else
+#define flush_tlb_kernel_range_deferrable(start, end) flush_tlb_kernel_range(start, end)
+#endif
+
 /**
  * vunmap_range - unmap kernel virtual addresses
  * @addr: start of the VM area to unmap
@@ -452,7 +461,7 @@  void vunmap_range(unsigned long addr, unsigned long end)
 {
 	flush_cache_vunmap(addr, end);
 	vunmap_range_noflush(addr, end);
-	flush_tlb_kernel_range(addr, end);
+	flush_tlb_kernel_range_deferrable(addr, end);
 }
 
 static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
@@ -1746,7 +1755,7 @@  static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
 		list_last_entry(&local_purge_list,
 			struct vmap_area, list)->va_end);
 
-	flush_tlb_kernel_range(start, end);
+	flush_tlb_kernel_range_deferrable(start, end);
 	resched_threshold = lazy_max_pages() << 1;
 
 	spin_lock(&free_vmap_area_lock);
@@ -1849,7 +1858,7 @@  static void free_unmap_vmap_area(struct vmap_area *va)
 	flush_cache_vunmap(va->va_start, va->va_end);
 	vunmap_range_noflush(va->va_start, va->va_end);
 	if (debug_pagealloc_enabled_static())
-		flush_tlb_kernel_range(va->va_start, va->va_end);
+		flush_tlb_kernel_range_deferrable(va->va_start, va->va_end);
 
 	free_vmap_area_noflush(va);
 }
@@ -2239,7 +2248,7 @@  static void vb_free(unsigned long addr, unsigned long size)
 	vunmap_range_noflush(addr, addr + size);
 
 	if (debug_pagealloc_enabled_static())
-		flush_tlb_kernel_range(addr, addr + size);
+		flush_tlb_kernel_range_deferrable(addr, addr + size);
 
 	spin_lock(&vb->lock);
 
@@ -2304,7 +2313,7 @@  static void _vm_unmap_aliases(unsigned long start, unsigned long end, int flush)
 	free_purged_blocks(&purge_list);
 
 	if (!__purge_vmap_area_lazy(start, end) && flush)
-		flush_tlb_kernel_range(start, end);
+		flush_tlb_kernel_range_deferrable(start, end);
 	mutex_unlock(&vmap_purge_lock);
 }