mbox series

[v4,00/30] context_tracking,x86: Defer some IPIs until a user->kernel transition

Message ID 20250114175143.81438-1-vschneid@redhat.com (mailing list archive)
Headers show
Series context_tracking,x86: Defer some IPIs until a user->kernel transition | expand

Message

Valentin Schneider Jan. 14, 2025, 5:51 p.m. UTC
Context
=======

We've observed within Red Hat that isolated, NOHZ_FULL CPUs running a
pure-userspace application get regularly interrupted by IPIs sent from
housekeeping CPUs. Those IPIs are caused by activity on the housekeeping CPUs
leading to various on_each_cpu() calls, e.g.:

  64359.052209596    NetworkManager       0    1405     smp_call_function_many_cond (cpu=0, func=do_kernel_range_flush)
    smp_call_function_many_cond+0x1
    smp_call_function+0x39
    on_each_cpu+0x2a
    flush_tlb_kernel_range+0x7b
    __purge_vmap_area_lazy+0x70
    _vm_unmap_aliases.part.42+0xdf
    change_page_attr_set_clr+0x16a
    set_memory_ro+0x26
    bpf_int_jit_compile+0x2f9
    bpf_prog_select_runtime+0xc6
    bpf_prepare_filter+0x523
    sk_attach_filter+0x13
    sock_setsockopt+0x92c
    __sys_setsockopt+0x16a
    __x64_sys_setsockopt+0x20
    do_syscall_64+0x87
    entry_SYSCALL_64_after_hwframe+0x65

The heart of this series is the thought that while we cannot remove NOHZ_FULL
CPUs from the list of CPUs targeted by these IPIs, they may not have to execute
the callbacks immediately. Anything that only affects kernelspace can wait
until the next user->kernel transition, providing it can be executed "early
enough" in the entry code.

The original implementation is from Peter [1]. Nicolas then added kernel TLB
invalidation deferral to that [2], and I picked it up from there.

Deferral approach
=================

Storing each and every callback, like a secondary call_single_queue turned out
to be a no-go: the whole point of deferral is to keep NOHZ_FULL CPUs in
userspace for as long as possible - no signal of any form would be sent when
deferring an IPI. This means that any form of queuing for deferred callbacks
would end up as a convoluted memory leak.

Deferred IPIs must thus be coalesced, which this series achieves by assigning
IPIs a "type" and having a mapping of IPI type to callback, leveraged upon
kernel entry.

What about IPIs whose callback take a parameter, you may ask?

Peter suggested during OSPM23 [3] that since on_each_cpu() targets
housekeeping CPUs *and* isolated CPUs, isolated CPUs can access either global or
housekeeping-CPU-local state to "reconstruct" the data that would have been sent
via the IPI.

This series does not affect any IPI callback that requires an argument, but the
approach would remain the same (one coalescable callback executed on kernel
entry).

Kernel entry vs execution of the deferred operation
===================================================

This is what I've referred to as the "Danger Zone" during my LPC24 talk [4].

There is a non-zero length of code that is executed upon kernel entry before the
deferred operation can be itself executed (i.e. before we start getting into
context_tracking.c proper), i.e.:

  idtentry_func_foo()                <--- we're in the kernel
    irqentry_enter()
      enter_from_user_mode()
	__ct_user_exit()
	    ct_kernel_enter_state()
	      ct_work_flush()        <--- deferred operation is executed here

This means one must take extra care to what can happen in the early entry code,
and that <bad things> cannot happen. For instance, we really don't want to hit
instructions that have been modified by a remote text_poke() while we're on our
way to execute a deferred sync_core(). Patches doing the actual deferral have
more detail on this.

Patches
=======

o Patches 1-2 are standalone objtool cleanups.

o Patches 3-4 add an RCU testing feature.

o Patches 5-6 add infrastructure for annotating static keys and static calls
  that may be used in noinstr code (courtesy of Josh).
o Patches 7-19 use said annotations on relevant keys / calls.
o Patch 20 enforces proper usage of said annotations (courtesy of Josh).

o Patches 21-23 fiddle with CT_STATE* within context tracking

o Patches 24-29 add the actual IPI deferral faff

o Patch 30 adds a freebie: deferring IPIs for NOHZ_IDLE. Not tested that much!
  if you care about battery-powered devices and energy consumption, go give it
  a try!

Patches are also available at:
https://gitlab.com/vschneid/linux.git -b redhat/isolirq/defer/v4

Stuff I'd like eyes and neurons on
==================================

Context-tracking vs idle. Patch 22 "improves" the situation by adding an
IDLE->KERNEL transition when getting an IRQ while idle, but it leaves the
following window:

  ~> IRQ
  ct_nmi_enter()
    state = state + CT_STATE_KERNEL - CT_STATE_IDLE

  [...]

  ct_nmi_exit()
    state = state - CT_STATE_KERNEL + CT_STATE_IDLE

  [...] /!\ CT_STATE_IDLE here while we're really in kernelspace! /!\

  ct_cpuidle_exit()
    state = state + CT_STATE_KERNEL - CT_STATE_IDLE

Said window is contained within cpu_idle_poll() and the cpuidle call within
cpuidle_enter_state(), both being noinstr (the former is __cpuidle which is
noinstr itself). Thus objtool will consider it as early entry and will warn
accordingly of any static key / call misuse, so the damage is somewhat
contained, but it's not ideal.

I tried fiddling with this but idle polling likes being annoying, as it is
shaped like so:

	ct_cpuidle_enter();

	raw_local_irq_enable();
	while (!tif_need_resched() &&
	       (cpu_idle_force_poll || tick_check_broadcast_expired()))
		cpu_relax();
	raw_local_irq_disable();

	ct_cpuidle_exit();

IOW, getting an IRQ that doesn't end up setting NEED_RESCHED while idle-polling
doesn't come near ct_cpuidle_exit(), which prevents me from having the outermost
ct_nmi_exit() leave the state as CT_STATE_KERNEL (rather than CT_STATE_IDLE).

Testing
=======

Xeon E5-2699 system with SMToff, NOHZ_FULL, isolated CPUs.
RHEL9 userspace.

Workload is using rteval (kernel compilation + hackbench) on housekeeping CPUs
and a dummy stay-in-userspace loop on the isolated CPUs. The main invocation is:

$ trace-cmd record -e "csd_queue_cpu" -f "cpu & CPUS{$ISOL_CPUS}" \
 	           -e "ipi_send_cpumask" -f "cpumask & CPUS{$ISOL_CPUS}" \
	           -e "ipi_send_cpu"     -f "cpu & CPUS{$ISOL_CPUS}" \
		   rteval --onlyload --loads-cpulist=$HK_CPUS \
		   --hackbench-runlowmem=True --duration=$DURATION

This only records IPIs sent to isolated CPUs, so any event there is interference
(with a bit of fuzz at the start/end of the workload when spawning the
processes). All tests were done with a duration of 6 hours.

v6.13-rc6
# This is the actual IPI count
$ trace-cmd report | grep callback | awk '{ print $(NF) }' | sort | uniq -c | sort -nr
    531 callback=generic_smp_call_function_single_interrupt+0x0

# These are the different CSD's that caused IPIs    
$ trace-cmd report | grep csd_queue | awk '{ print $(NF-1) }' | sort | uniq -c | sort -nr
  12818 func=do_flush_tlb_all
    910 func=do_kernel_range_flush
     78 func=do_sync_core

v6.13-rc6 + patches:
# This is the actual IPI count
$ trace-cmd report | grep callback | awk '{ print $(NF) }' | sort | uniq -c | sort -nr
# Zilch!

# These are the different CSD's that caused IPIs          
$ trace-cmd report | grep csd_queue | awk '{ print $(NF-1) }' | sort | uniq -c | sort -nr
# Nada!

Note that tlb_remove_table_smp_sync() showed up during testing of v3, and has
gone as mysteriously as it showed up. Yair had a series adressing this [5] which
would be worth revisiting.

Acknowledgements
================

Special thanks to:
o Clark Williams for listening to my ramblings about this and throwing ideas my way
o Josh Poimboeuf for all his help with everything objtool-related
o All of the folks who attended various (too many?) talks about this and
  provided precious feedback.  

Links
=====

[1]: https://lore.kernel.org/all/20210929151723.162004989@infradead.org/
[2]: https://github.com/vianpl/linux.git -b ct-work-defer-wip
[3]: https://youtu.be/0vjE6fjoVVE
[4]: https://lpc.events/event/18/contributions/1889/
[5]: https://lore.kernel.org/lkml/20230620144618.125703-1-ypodemsk@redhat.com/

Revisions
=========

RFCv3 -> v4
++++++++++++++

o Rebased onto v6.13-rc6

o New objtool patches from Josh
o More .noinstr static key/call patches
o Static calls now handled as well (again thanks to Josh)

o Fixed clearing the work bits on kernel exit
o Messed with IRQ hitting an idle CPU vs context tracking
o Various comment and naming cleanups

o Made RCU_DYNTICKS_TORTURE depend on !COMPILE_TEST (PeterZ)
o Fixed the CT_STATE_KERNEL check when setting a deferred work (Frederic)
o Cleaned up the __flush_tlb_all() mess thanks to PeterZ

RFCv2 -> RFCv3
++++++++++++++

o Rebased onto v6.12-rc6

o Added objtool documentation for the new warning (Josh)
o Added low-size RCU watching counter to TREE04 torture scenario (Paul)
o Added FORCEFUL jump label and static key types
o Added noinstr-compliant helpers for tlb flush deferral


RFCv1 -> RFCv2
++++++++++++++

o Rebased onto v6.5-rc1

o Updated the trace filter patches (Steven)

o Fixed __ro_after_init keys used in modules (Peter)
o Dropped the extra context_tracking atomic, squashed the new bits in the
  existing .state field (Peter, Frederic)
  
o Added an RCU_EXPERT config for the RCU dynticks counter size, and added an
  rcutorture case for a low-size counter (Paul) 

o Fixed flush_tlb_kernel_range_deferrable() definition

Josh Poimboeuf (3):
  jump_label: Add annotations for validating noinstr usage
  static_call: Add read-only-after-init static calls
  objtool: Add noinstr validation for static branches/calls

Peter Zijlstra (1):
  x86,tlb: Make __flush_tlb_global() noinstr-compliant

Valentin Schneider (26):
  objtool: Make validate_call() recognize indirect calls to pv_ops[]
  objtool: Flesh out warning related to pv_ops[] calls
  rcu: Add a small-width RCU watching counter debug option
  rcutorture: Make TREE04 use CONFIG_RCU_DYNTICKS_TORTURE
  x86/paravirt: Mark pv_sched_clock static call as __ro_after_init
  x86/idle: Mark x86_idle static call as __ro_after_init
  x86/paravirt: Mark pv_steal_clock static call as __ro_after_init
  riscv/paravirt: Mark pv_steal_clock static call as __ro_after_init
  loongarch/paravirt: Mark pv_steal_clock static call as __ro_after_init
  arm64/paravirt: Mark pv_steal_clock static call as __ro_after_init
  arm/paravirt: Mark pv_steal_clock static call as __ro_after_init
  perf/x86/amd: Mark perf_lopwr_cb static call as __ro_after_init
  sched/clock: Mark sched_clock_running key as __ro_after_init
  x86/speculation/mds: Mark mds_idle_clear key as allowed in .noinstr
  sched/clock, x86: Mark __sched_clock_stable key as allowed in .noinstr
  x86/kvm/vmx: Mark vmx_l1d_should flush and vmx_l1d_flush_cond keys as
    allowed in .noinstr
  stackleack: Mark stack_erasing_bypass key as allowed in .noinstr
  context_tracking: Explicitely use CT_STATE_KERNEL where it is missing
  context_tracking: Exit CT_STATE_IDLE upon irq/nmi entry
  context_tracking: Turn CT_STATE_* into bits
  context-tracking: Introduce work deferral infrastructure
  context_tracking,x86: Defer kernel text patching IPIs
  x86/tlb: Make __flush_tlb_local() noinstr-compliant
  x86/tlb: Make __flush_tlb_all() noinstr
  x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL
    CPUs
  context-tracking: Add a Kconfig to enable IPI deferral for NO_HZ_IDLE

 arch/Kconfig                                  |   9 ++
 arch/arm/kernel/paravirt.c                    |   2 +-
 arch/arm64/kernel/paravirt.c                  |   2 +-
 arch/loongarch/kernel/paravirt.c              |   2 +-
 arch/riscv/kernel/paravirt.c                  |   2 +-
 arch/x86/Kconfig                              |   1 +
 arch/x86/events/amd/brs.c                     |   2 +-
 arch/x86/include/asm/context_tracking_work.h  |  22 ++++
 arch/x86/include/asm/invpcid.h                |  13 +--
 arch/x86/include/asm/paravirt.h               |   4 +-
 arch/x86/include/asm/text-patching.h          |   1 +
 arch/x86/include/asm/tlbflush.h               |   3 +-
 arch/x86/include/asm/xen/hypercall.h          |  11 +-
 arch/x86/kernel/alternative.c                 |  38 ++++++-
 arch/x86/kernel/cpu/bugs.c                    |   9 +-
 arch/x86/kernel/kprobes/core.c                |   4 +-
 arch/x86/kernel/kprobes/opt.c                 |   4 +-
 arch/x86/kernel/module.c                      |   2 +-
 arch/x86/kernel/paravirt.c                    |   4 +-
 arch/x86/kernel/process.c                     |   2 +-
 arch/x86/kvm/vmx/vmx.c                        |  11 +-
 arch/x86/mm/tlb.c                             |  46 ++++++--
 arch/x86/xen/mmu_pv.c                         |  10 +-
 arch/x86/xen/xen-ops.h                        |  12 +-
 include/asm-generic/sections.h                |  15 +++
 include/linux/context_tracking.h              |  21 ++++
 include/linux/context_tracking_state.h        |  64 +++++++++--
 include/linux/context_tracking_work.h         |  28 +++++
 include/linux/jump_label.h                    |  30 ++++-
 include/linux/objtool.h                       |   7 ++
 include/linux/static_call.h                   |  19 ++++
 kernel/context_tracking.c                     |  98 ++++++++++++++--
 kernel/rcu/Kconfig.debug                      |  15 +++
 kernel/sched/clock.c                          |   7 +-
 kernel/stackleak.c                            |   6 +-
 kernel/time/Kconfig                           |  19 ++++
 mm/vmalloc.c                                  |  35 +++++-
 tools/objtool/Documentation/objtool.txt       |  34 ++++++
 tools/objtool/check.c                         | 106 +++++++++++++++---
 tools/objtool/include/objtool/check.h         |   1 +
 tools/objtool/include/objtool/elf.h           |   1 +
 tools/objtool/include/objtool/special.h       |   1 +
 tools/objtool/special.c                       |  18 ++-
 .../selftests/rcutorture/configs/rcu/TREE04   |   1 +
 44 files changed, 635 insertions(+), 107 deletions(-)
 create mode 100644 arch/x86/include/asm/context_tracking_work.h
 create mode 100644 include/linux/context_tracking_work.h

--
2.43.0

Comments

Jann Horn Jan. 14, 2025, 6:16 p.m. UTC | #1
On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider <vschneid@redhat.com> wrote:
> vunmap()'s issued from housekeeping CPUs are a relatively common source of
> interference for isolated NOHZ_FULL CPUs, as they are hit by the
> flush_tlb_kernel_range() IPIs.
>
> Given that CPUs executing in userspace do not access data in the vmalloc
> range, these IPIs could be deferred until their next kernel entry.
>
> Deferral vs early entry danger zone
> ===================================
>
> This requires a guarantee that nothing in the vmalloc range can be vunmap'd
> and then accessed in early entry code.

In other words, it needs a guarantee that no vmalloc allocations that
have been created in the vmalloc region while the CPU was idle can
then be accessed during early entry, right?
Valentin Schneider Jan. 17, 2025, 3:25 p.m. UTC | #2
On 14/01/25 19:16, Jann Horn wrote:
> On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider <vschneid@redhat.com> wrote:
>> vunmap()'s issued from housekeeping CPUs are a relatively common source of
>> interference for isolated NOHZ_FULL CPUs, as they are hit by the
>> flush_tlb_kernel_range() IPIs.
>>
>> Given that CPUs executing in userspace do not access data in the vmalloc
>> range, these IPIs could be deferred until their next kernel entry.
>>
>> Deferral vs early entry danger zone
>> ===================================
>>
>> This requires a guarantee that nothing in the vmalloc range can be vunmap'd
>> and then accessed in early entry code.
>
> In other words, it needs a guarantee that no vmalloc allocations that
> have been created in the vmalloc region while the CPU was idle can
> then be accessed during early entry, right?

I'm not sure if that would be a problem (not an mm expert, please do
correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't
deferred anyway.

So after vmapping something, I wouldn't expect isolated CPUs to have
invalid TLB entries for the newly vmapped page.

However, upon vunmap'ing something, the TLB flush is deferred, and thus
stale TLB entries can and will remain on isolated CPUs, up until they
execute the deferred flush themselves (IOW for the entire duration of the
"danger zone").

Does that make sense?
Jann Horn Jan. 17, 2025, 3:52 p.m. UTC | #3
On Fri, Jan 17, 2025 at 4:25 PM Valentin Schneider <vschneid@redhat.com> wrote:
> On 14/01/25 19:16, Jann Horn wrote:
> > On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider <vschneid@redhat.com> wrote:
> >> vunmap()'s issued from housekeeping CPUs are a relatively common source of
> >> interference for isolated NOHZ_FULL CPUs, as they are hit by the
> >> flush_tlb_kernel_range() IPIs.
> >>
> >> Given that CPUs executing in userspace do not access data in the vmalloc
> >> range, these IPIs could be deferred until their next kernel entry.
> >>
> >> Deferral vs early entry danger zone
> >> ===================================
> >>
> >> This requires a guarantee that nothing in the vmalloc range can be vunmap'd
> >> and then accessed in early entry code.
> >
> > In other words, it needs a guarantee that no vmalloc allocations that
> > have been created in the vmalloc region while the CPU was idle can
> > then be accessed during early entry, right?
>
> I'm not sure if that would be a problem (not an mm expert, please do
> correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't
> deferred anyway.

flush_cache_vmap() is about stuff like flushing data caches on
architectures with virtually indexed caches; that doesn't do TLB
maintenance. When you look for its definition on x86 or arm64, you'll
see that they use the generic implementation which is simply an empty
inline function.

> So after vmapping something, I wouldn't expect isolated CPUs to have
> invalid TLB entries for the newly vmapped page.
>
> However, upon vunmap'ing something, the TLB flush is deferred, and thus
> stale TLB entries can and will remain on isolated CPUs, up until they
> execute the deferred flush themselves (IOW for the entire duration of the
> "danger zone").
>
> Does that make sense?

The design idea wrt TLB flushes in the vmap code is that you don't do
TLB flushes when you unmap stuff or when you map stuff, because doing
TLB flushes across the entire system on every vmap/vunmap would be a
bit costly; instead you just do batched TLB flushes in between, in
__purge_vmap_area_lazy().

In other words, the basic idea is that you can keep calling vmap() and
vunmap() a bunch of times without ever doing TLB flushes until you run
out of virtual memory in the vmap region; then you do one big TLB
flush, and afterwards you can reuse the free virtual address space for
new allocations again.

So if you "defer" that batched TLB flush for CPUs that are not
currently running in the kernel, I think the consequence is that those
CPUs may end up with incoherent TLB state after a reallocation of the
virtual address space.

Actually, I think this would mean that your optimization is disallowed
at least on arm64 - I'm not sure about the exact wording, but arm64
has a "break before make" rule that forbids conflicting writable
address translations or something like that.

(I said "until you run out of virtual memory in the vmap region", but
that's not actually true - see the comment above lazy_max_pages() for
an explanation of the actual heuristic. You might be able to tune that
a bit if you'd be significantly happier with less frequent
interruptions, or something along those lines.)
Uladzislau Rezki Jan. 17, 2025, 4:11 p.m. UTC | #4
On Fri, Jan 17, 2025 at 04:25:45PM +0100, Valentin Schneider wrote:
> On 14/01/25 19:16, Jann Horn wrote:
> > On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider <vschneid@redhat.com> wrote:
> >> vunmap()'s issued from housekeeping CPUs are a relatively common source of
> >> interference for isolated NOHZ_FULL CPUs, as they are hit by the
> >> flush_tlb_kernel_range() IPIs.
> >>
> >> Given that CPUs executing in userspace do not access data in the vmalloc
> >> range, these IPIs could be deferred until their next kernel entry.
> >>
> >> Deferral vs early entry danger zone
> >> ===================================
> >>
> >> This requires a guarantee that nothing in the vmalloc range can be vunmap'd
> >> and then accessed in early entry code.
> >
> > In other words, it needs a guarantee that no vmalloc allocations that
> > have been created in the vmalloc region while the CPU was idle can
> > then be accessed during early entry, right?
> 
> I'm not sure if that would be a problem (not an mm expert, please do
> correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't
> deferred anyway.
> 
> So after vmapping something, I wouldn't expect isolated CPUs to have
> invalid TLB entries for the newly vmapped page.
> 
> However, upon vunmap'ing something, the TLB flush is deferred, and thus
> stale TLB entries can and will remain on isolated CPUs, up until they
> execute the deferred flush themselves (IOW for the entire duration of the
> "danger zone").
> 
> Does that make sense?
> 
Probably i am missing something and need to have a look at your patches,
but how do you guarantee that no-one map same are that you defer for TLB
flushing?

As noted by Jann, we already defer a TLB flushing by backing freed areas
until certain threshold and just after we cross it we do a flush.

--
Uladzislau Rezki
Valentin Schneider Jan. 17, 2025, 4:53 p.m. UTC | #5
On 17/01/25 16:52, Jann Horn wrote:
> On Fri, Jan 17, 2025 at 4:25 PM Valentin Schneider <vschneid@redhat.com> wrote:
>> On 14/01/25 19:16, Jann Horn wrote:
>> > On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider <vschneid@redhat.com> wrote:
>> >> vunmap()'s issued from housekeeping CPUs are a relatively common source of
>> >> interference for isolated NOHZ_FULL CPUs, as they are hit by the
>> >> flush_tlb_kernel_range() IPIs.
>> >>
>> >> Given that CPUs executing in userspace do not access data in the vmalloc
>> >> range, these IPIs could be deferred until their next kernel entry.
>> >>
>> >> Deferral vs early entry danger zone
>> >> ===================================
>> >>
>> >> This requires a guarantee that nothing in the vmalloc range can be vunmap'd
>> >> and then accessed in early entry code.
>> >
>> > In other words, it needs a guarantee that no vmalloc allocations that
>> > have been created in the vmalloc region while the CPU was idle can
>> > then be accessed during early entry, right?
>>
>> I'm not sure if that would be a problem (not an mm expert, please do
>> correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't
>> deferred anyway.
>
> flush_cache_vmap() is about stuff like flushing data caches on
> architectures with virtually indexed caches; that doesn't do TLB
> maintenance. When you look for its definition on x86 or arm64, you'll
> see that they use the generic implementation which is simply an empty
> inline function.
>
>> So after vmapping something, I wouldn't expect isolated CPUs to have
>> invalid TLB entries for the newly vmapped page.
>>
>> However, upon vunmap'ing something, the TLB flush is deferred, and thus
>> stale TLB entries can and will remain on isolated CPUs, up until they
>> execute the deferred flush themselves (IOW for the entire duration of the
>> "danger zone").
>>
>> Does that make sense?
>
> The design idea wrt TLB flushes in the vmap code is that you don't do
> TLB flushes when you unmap stuff or when you map stuff, because doing
> TLB flushes across the entire system on every vmap/vunmap would be a
> bit costly; instead you just do batched TLB flushes in between, in
> __purge_vmap_area_lazy().
>
> In other words, the basic idea is that you can keep calling vmap() and
> vunmap() a bunch of times without ever doing TLB flushes until you run
> out of virtual memory in the vmap region; then you do one big TLB
> flush, and afterwards you can reuse the free virtual address space for
> new allocations again.
>
> So if you "defer" that batched TLB flush for CPUs that are not
> currently running in the kernel, I think the consequence is that those
> CPUs may end up with incoherent TLB state after a reallocation of the
> virtual address space.
>

Ah, gotcha, thank you for laying this out! In which case yes, any vmalloc
that occurred while an isolated CPU was NOHZ-FULL can be an issue if said
CPU accesses it during early entry;

> Actually, I think this would mean that your optimization is disallowed
> at least on arm64 - I'm not sure about the exact wording, but arm64
> has a "break before make" rule that forbids conflicting writable
> address translations or something like that.
>

On the bright side of things, arm64 is not as bad as x86 when it comes to
IPI'ing isolated CPUs :-) I'll add that to my notes, thanks!

> (I said "until you run out of virtual memory in the vmap region", but
> that's not actually true - see the comment above lazy_max_pages() for
> an explanation of the actual heuristic. You might be able to tune that
> a bit if you'd be significantly happier with less frequent
> interruptions, or something along those lines.)
Valentin Schneider Jan. 17, 2025, 5 p.m. UTC | #6
On 17/01/25 17:11, Uladzislau Rezki wrote:
> On Fri, Jan 17, 2025 at 04:25:45PM +0100, Valentin Schneider wrote:
>> On 14/01/25 19:16, Jann Horn wrote:
>> > On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider <vschneid@redhat.com> wrote:
>> >> vunmap()'s issued from housekeeping CPUs are a relatively common source of
>> >> interference for isolated NOHZ_FULL CPUs, as they are hit by the
>> >> flush_tlb_kernel_range() IPIs.
>> >>
>> >> Given that CPUs executing in userspace do not access data in the vmalloc
>> >> range, these IPIs could be deferred until their next kernel entry.
>> >>
>> >> Deferral vs early entry danger zone
>> >> ===================================
>> >>
>> >> This requires a guarantee that nothing in the vmalloc range can be vunmap'd
>> >> and then accessed in early entry code.
>> >
>> > In other words, it needs a guarantee that no vmalloc allocations that
>> > have been created in the vmalloc region while the CPU was idle can
>> > then be accessed during early entry, right?
>>
>> I'm not sure if that would be a problem (not an mm expert, please do
>> correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't
>> deferred anyway.
>>
>> So after vmapping something, I wouldn't expect isolated CPUs to have
>> invalid TLB entries for the newly vmapped page.
>>
>> However, upon vunmap'ing something, the TLB flush is deferred, and thus
>> stale TLB entries can and will remain on isolated CPUs, up until they
>> execute the deferred flush themselves (IOW for the entire duration of the
>> "danger zone").
>>
>> Does that make sense?
>>
> Probably i am missing something and need to have a look at your patches,
> but how do you guarantee that no-one map same are that you defer for TLB
> flushing?
>

That's the cool part: I don't :')

For deferring instruction patching IPIs, I (well Josh really) managed to
get instrumentation to back me up and catch any problematic area.

I looked into getting something similar for vmalloc region access in
.noinstr code, but I didn't get anywhere. I even tried using emulated
watchpoints on QEMU to watch the whole vmalloc range, but that went about
as well as you could expect.

That left me with staring at code. AFAICT the only vmap'd thing that is
accessed during early entry is the task stack (CONFIG_VMAP_STACK), which
itself cannot be freed until the task exits - thus can't be subject to
invalidation when a task is entering kernelspace.

If you have any tracing/instrumentation suggestions, I'm all ears (eyes?).

> As noted by Jann, we already defer a TLB flushing by backing freed areas
> until certain threshold and just after we cross it we do a flush.
>
> --
> Uladzislau Rezki
Uladzislau Rezki Jan. 20, 2025, 11:15 a.m. UTC | #7
On Fri, Jan 17, 2025 at 06:00:30PM +0100, Valentin Schneider wrote:
> On 17/01/25 17:11, Uladzislau Rezki wrote:
> > On Fri, Jan 17, 2025 at 04:25:45PM +0100, Valentin Schneider wrote:
> >> On 14/01/25 19:16, Jann Horn wrote:
> >> > On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider <vschneid@redhat.com> wrote:
> >> >> vunmap()'s issued from housekeeping CPUs are a relatively common source of
> >> >> interference for isolated NOHZ_FULL CPUs, as they are hit by the
> >> >> flush_tlb_kernel_range() IPIs.
> >> >>
> >> >> Given that CPUs executing in userspace do not access data in the vmalloc
> >> >> range, these IPIs could be deferred until their next kernel entry.
> >> >>
> >> >> Deferral vs early entry danger zone
> >> >> ===================================
> >> >>
> >> >> This requires a guarantee that nothing in the vmalloc range can be vunmap'd
> >> >> and then accessed in early entry code.
> >> >
> >> > In other words, it needs a guarantee that no vmalloc allocations that
> >> > have been created in the vmalloc region while the CPU was idle can
> >> > then be accessed during early entry, right?
> >>
> >> I'm not sure if that would be a problem (not an mm expert, please do
> >> correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't
> >> deferred anyway.
> >>
> >> So after vmapping something, I wouldn't expect isolated CPUs to have
> >> invalid TLB entries for the newly vmapped page.
> >>
> >> However, upon vunmap'ing something, the TLB flush is deferred, and thus
> >> stale TLB entries can and will remain on isolated CPUs, up until they
> >> execute the deferred flush themselves (IOW for the entire duration of the
> >> "danger zone").
> >>
> >> Does that make sense?
> >>
> > Probably i am missing something and need to have a look at your patches,
> > but how do you guarantee that no-one map same are that you defer for TLB
> > flushing?
> >
> 
> That's the cool part: I don't :')
> 
Indeed, sounds unsafe :) Then we just do not need to free areas.

> For deferring instruction patching IPIs, I (well Josh really) managed to
> get instrumentation to back me up and catch any problematic area.
> 
> I looked into getting something similar for vmalloc region access in
> .noinstr code, but I didn't get anywhere. I even tried using emulated
> watchpoints on QEMU to watch the whole vmalloc range, but that went about
> as well as you could expect.
> 
> That left me with staring at code. AFAICT the only vmap'd thing that is
> accessed during early entry is the task stack (CONFIG_VMAP_STACK), which
> itself cannot be freed until the task exits - thus can't be subject to
> invalidation when a task is entering kernelspace.
> 
> If you have any tracing/instrumentation suggestions, I'm all ears (eyes?).
> 
As noted before, we defer flushing for vmalloc. We have a lazy-threshold
which can be exposed(if you need it) over sysfs for tuning. So, we can add it.

--
Uladzislau Rezki
Valentin Schneider Jan. 20, 2025, 4:09 p.m. UTC | #8
On 20/01/25 12:15, Uladzislau Rezki wrote:
> On Fri, Jan 17, 2025 at 06:00:30PM +0100, Valentin Schneider wrote:
>> On 17/01/25 17:11, Uladzislau Rezki wrote:
>> > On Fri, Jan 17, 2025 at 04:25:45PM +0100, Valentin Schneider wrote:
>> >> On 14/01/25 19:16, Jann Horn wrote:
>> >> > On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider <vschneid@redhat.com> wrote:
>> >> >> vunmap()'s issued from housekeeping CPUs are a relatively common source of
>> >> >> interference for isolated NOHZ_FULL CPUs, as they are hit by the
>> >> >> flush_tlb_kernel_range() IPIs.
>> >> >>
>> >> >> Given that CPUs executing in userspace do not access data in the vmalloc
>> >> >> range, these IPIs could be deferred until their next kernel entry.
>> >> >>
>> >> >> Deferral vs early entry danger zone
>> >> >> ===================================
>> >> >>
>> >> >> This requires a guarantee that nothing in the vmalloc range can be vunmap'd
>> >> >> and then accessed in early entry code.
>> >> >
>> >> > In other words, it needs a guarantee that no vmalloc allocations that
>> >> > have been created in the vmalloc region while the CPU was idle can
>> >> > then be accessed during early entry, right?
>> >>
>> >> I'm not sure if that would be a problem (not an mm expert, please do
>> >> correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't
>> >> deferred anyway.
>> >>
>> >> So after vmapping something, I wouldn't expect isolated CPUs to have
>> >> invalid TLB entries for the newly vmapped page.
>> >>
>> >> However, upon vunmap'ing something, the TLB flush is deferred, and thus
>> >> stale TLB entries can and will remain on isolated CPUs, up until they
>> >> execute the deferred flush themselves (IOW for the entire duration of the
>> >> "danger zone").
>> >>
>> >> Does that make sense?
>> >>
>> > Probably i am missing something and need to have a look at your patches,
>> > but how do you guarantee that no-one map same are that you defer for TLB
>> > flushing?
>> >
>>
>> That's the cool part: I don't :')
>>
> Indeed, sounds unsafe :) Then we just do not need to free areas.
>
>> For deferring instruction patching IPIs, I (well Josh really) managed to
>> get instrumentation to back me up and catch any problematic area.
>>
>> I looked into getting something similar for vmalloc region access in
>> .noinstr code, but I didn't get anywhere. I even tried using emulated
>> watchpoints on QEMU to watch the whole vmalloc range, but that went about
>> as well as you could expect.
>>
>> That left me with staring at code. AFAICT the only vmap'd thing that is
>> accessed during early entry is the task stack (CONFIG_VMAP_STACK), which
>> itself cannot be freed until the task exits - thus can't be subject to
>> invalidation when a task is entering kernelspace.
>>
>> If you have any tracing/instrumentation suggestions, I'm all ears (eyes?).
>>
> As noted before, we defer flushing for vmalloc. We have a lazy-threshold
> which can be exposed(if you need it) over sysfs for tuning. So, we can add it.
>

In a CPU isolation / NOHZ_FULL context, isolated CPUs will be running a
single userspace application that will never enter the kernel, unless
forced to by some interference (e.g. IPI sent from a housekeeping CPU).

Increasing the lazy threshold would unfortunately only delay the
interference - housekeeping CPUs are free to run whatever, and so they will
eventually cause the lazy threshold to be hit and IPI all the CPUs,
including the isolated/NOHZ_FULL ones.

I was thinking maybe we could subdivide the vmap space into two regions
with their own thresholds, but a task may allocate/vmap stuff while on a HK
CPU and be moved to an isolated CPU afterwards, and also I still don't have
any strong guarantee about what accesses an isolated CPU can do in its
early entry code :(

> --
> Uladzislau Rezki
Uladzislau Rezki Jan. 21, 2025, 5 p.m. UTC | #9
> >
> > As noted before, we defer flushing for vmalloc. We have a lazy-threshold
> > which can be exposed(if you need it) over sysfs for tuning. So, we can add it.
> >
> 
> In a CPU isolation / NOHZ_FULL context, isolated CPUs will be running a
> single userspace application that will never enter the kernel, unless
> forced to by some interference (e.g. IPI sent from a housekeeping CPU).
> 
> Increasing the lazy threshold would unfortunately only delay the
> interference - housekeeping CPUs are free to run whatever, and so they will
> eventually cause the lazy threshold to be hit and IPI all the CPUs,
> including the isolated/NOHZ_FULL ones.
> 
Do you have any testing results for your workload? I mean how much
potentially we can allocate. Again, maybe it is just enough to back
and once per-hour offload it.

Apart of that how critical IPIing CPUs affect your workloads?

> I was thinking maybe we could subdivide the vmap space into two regions
> with their own thresholds, but a task may allocate/vmap stuff while on a HK
> CPU and be moved to an isolated CPU afterwards, and also I still don't have
> any strong guarantee about what accesses an isolated CPU can do in its
> early entry code :(
> 
I agree this is not worth to play with a vmap space in terms of splitting it.

Sorry for later answer and thank you!

--
Uladzislau Rezki
Valentin Schneider Jan. 24, 2025, 3:22 p.m. UTC | #10
On 21/01/25 18:00, Uladzislau Rezki wrote:
>> >
>> > As noted before, we defer flushing for vmalloc. We have a lazy-threshold
>> > which can be exposed(if you need it) over sysfs for tuning. So, we can add it.
>> >
>>
>> In a CPU isolation / NOHZ_FULL context, isolated CPUs will be running a
>> single userspace application that will never enter the kernel, unless
>> forced to by some interference (e.g. IPI sent from a housekeeping CPU).
>>
>> Increasing the lazy threshold would unfortunately only delay the
>> interference - housekeeping CPUs are free to run whatever, and so they will
>> eventually cause the lazy threshold to be hit and IPI all the CPUs,
>> including the isolated/NOHZ_FULL ones.
>>
> Do you have any testing results for your workload? I mean how much
> potentially we can allocate. Again, maybe it is just enough to back
> and once per-hour offload it.
>

Potentially as much as you want... In our Openshift environments, you can
get any sort of container executing on the housekeeping CPUs and they're
free to do pretty much whatever they want. Per CPU isolation they're not
allowed/meant to disturb isolated CPUs, however.

> Apart of that how critical IPIing CPUs affect your workloads?
>

If I'm being pedantic, a single IPI to an isolated CPU breaks the
isolation. If we can't quiesce IPIs to isolated CPUs, then we can't
guarantee that whatever is running on the isolated CPUs is actually
isolated / shielded from third party interference.
Uladzislau Rezki Jan. 27, 2025, 10:36 a.m. UTC | #11
On Fri, Jan 24, 2025 at 04:22:19PM +0100, Valentin Schneider wrote:
> On 21/01/25 18:00, Uladzislau Rezki wrote:
> >> >
> >> > As noted before, we defer flushing for vmalloc. We have a lazy-threshold
> >> > which can be exposed(if you need it) over sysfs for tuning. So, we can add it.
> >> >
> >>
> >> In a CPU isolation / NOHZ_FULL context, isolated CPUs will be running a
> >> single userspace application that will never enter the kernel, unless
> >> forced to by some interference (e.g. IPI sent from a housekeeping CPU).
> >>
> >> Increasing the lazy threshold would unfortunately only delay the
> >> interference - housekeeping CPUs are free to run whatever, and so they will
> >> eventually cause the lazy threshold to be hit and IPI all the CPUs,
> >> including the isolated/NOHZ_FULL ones.
> >>
> > Do you have any testing results for your workload? I mean how much
> > potentially we can allocate. Again, maybe it is just enough to back
> > and once per-hour offload it.
> >
> 
> Potentially as much as you want... In our Openshift environments, you can
> get any sort of container executing on the housekeeping CPUs and they're
> free to do pretty much whatever they want. Per CPU isolation they're not
> allowed/meant to disturb isolated CPUs, however.
> 
> > Apart of that how critical IPIing CPUs affect your workloads?
> >
> 
> If I'm being pedantic, a single IPI to an isolated CPU breaks the
> isolation. If we can't quiesce IPIs to isolated CPUs, then we can't
> guarantee that whatever is running on the isolated CPUs is actually
> isolated / shielded from third party interference.
> 
I see. I thought you are fixing some issue. I do not see a straight
forward way how to remove such "distortion". Probably we can block the
range which we defer for flushing. But it also can be problematic
because of other constraints.

Thanks!

--
Uladzislau Rezki
Will Deacon Jan. 27, 2025, 3:51 p.m. UTC | #12
On Fri, Jan 17, 2025 at 04:52:19PM +0100, Jann Horn wrote:
> On Fri, Jan 17, 2025 at 4:25 PM Valentin Schneider <vschneid@redhat.com> wrote:
> > On 14/01/25 19:16, Jann Horn wrote:
> > > On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider <vschneid@redhat.com> wrote:
> > >> vunmap()'s issued from housekeeping CPUs are a relatively common source of
> > >> interference for isolated NOHZ_FULL CPUs, as they are hit by the
> > >> flush_tlb_kernel_range() IPIs.
> > >>
> > >> Given that CPUs executing in userspace do not access data in the vmalloc
> > >> range, these IPIs could be deferred until their next kernel entry.
> > >>
> > >> Deferral vs early entry danger zone
> > >> ===================================
> > >>
> > >> This requires a guarantee that nothing in the vmalloc range can be vunmap'd
> > >> and then accessed in early entry code.
> > >
> > > In other words, it needs a guarantee that no vmalloc allocations that
> > > have been created in the vmalloc region while the CPU was idle can
> > > then be accessed during early entry, right?
> >
> > I'm not sure if that would be a problem (not an mm expert, please do
> > correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't
> > deferred anyway.
> 
> flush_cache_vmap() is about stuff like flushing data caches on
> architectures with virtually indexed caches; that doesn't do TLB
> maintenance. When you look for its definition on x86 or arm64, you'll
> see that they use the generic implementation which is simply an empty
> inline function.
> 
> > So after vmapping something, I wouldn't expect isolated CPUs to have
> > invalid TLB entries for the newly vmapped page.
> >
> > However, upon vunmap'ing something, the TLB flush is deferred, and thus
> > stale TLB entries can and will remain on isolated CPUs, up until they
> > execute the deferred flush themselves (IOW for the entire duration of the
> > "danger zone").
> >
> > Does that make sense?
> 
> The design idea wrt TLB flushes in the vmap code is that you don't do
> TLB flushes when you unmap stuff or when you map stuff, because doing
> TLB flushes across the entire system on every vmap/vunmap would be a
> bit costly; instead you just do batched TLB flushes in between, in
> __purge_vmap_area_lazy().
> 
> In other words, the basic idea is that you can keep calling vmap() and
> vunmap() a bunch of times without ever doing TLB flushes until you run
> out of virtual memory in the vmap region; then you do one big TLB
> flush, and afterwards you can reuse the free virtual address space for
> new allocations again.
> 
> So if you "defer" that batched TLB flush for CPUs that are not
> currently running in the kernel, I think the consequence is that those
> CPUs may end up with incoherent TLB state after a reallocation of the
> virtual address space.
> 
> Actually, I think this would mean that your optimization is disallowed
> at least on arm64 - I'm not sure about the exact wording, but arm64
> has a "break before make" rule that forbids conflicting writable
> address translations or something like that.

Yes, that would definitely be a problem. There's also the more obvious
issue that the CnP ("Common not Private") feature of some Arm CPUs means
that TLB entries can be shared between cores, so the whole idea of using
a CPU's exception level to predicate invalidation is flawed on such a
system.

Will
Valentin Schneider Feb. 10, 2025, 6:36 p.m. UTC | #13
On 17/01/25 16:52, Jann Horn wrote:
> On Fri, Jan 17, 2025 at 4:25 PM Valentin Schneider <vschneid@redhat.com> wrote:
>> On 14/01/25 19:16, Jann Horn wrote:
>> > On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider <vschneid@redhat.com> wrote:
>> >> vunmap()'s issued from housekeeping CPUs are a relatively common source of
>> >> interference for isolated NOHZ_FULL CPUs, as they are hit by the
>> >> flush_tlb_kernel_range() IPIs.
>> >>
>> >> Given that CPUs executing in userspace do not access data in the vmalloc
>> >> range, these IPIs could be deferred until their next kernel entry.
>> >>
>> >> Deferral vs early entry danger zone
>> >> ===================================
>> >>
>> >> This requires a guarantee that nothing in the vmalloc range can be vunmap'd
>> >> and then accessed in early entry code.
>> >
>> > In other words, it needs a guarantee that no vmalloc allocations that
>> > have been created in the vmalloc region while the CPU was idle can
>> > then be accessed during early entry, right?
>>
>> I'm not sure if that would be a problem (not an mm expert, please do
>> correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't
>> deferred anyway.
>
> flush_cache_vmap() is about stuff like flushing data caches on
> architectures with virtually indexed caches; that doesn't do TLB
> maintenance. When you look for its definition on x86 or arm64, you'll
> see that they use the generic implementation which is simply an empty
> inline function.
>
>> So after vmapping something, I wouldn't expect isolated CPUs to have
>> invalid TLB entries for the newly vmapped page.
>>
>> However, upon vunmap'ing something, the TLB flush is deferred, and thus
>> stale TLB entries can and will remain on isolated CPUs, up until they
>> execute the deferred flush themselves (IOW for the entire duration of the
>> "danger zone").
>>
>> Does that make sense?
>
> The design idea wrt TLB flushes in the vmap code is that you don't do
> TLB flushes when you unmap stuff or when you map stuff, because doing
> TLB flushes across the entire system on every vmap/vunmap would be a
> bit costly; instead you just do batched TLB flushes in between, in
> __purge_vmap_area_lazy().
>
> In other words, the basic idea is that you can keep calling vmap() and
> vunmap() a bunch of times without ever doing TLB flushes until you run
> out of virtual memory in the vmap region; then you do one big TLB
> flush, and afterwards you can reuse the free virtual address space for
> new allocations again.
>
> So if you "defer" that batched TLB flush for CPUs that are not
> currently running in the kernel, I think the consequence is that those
> CPUs may end up with incoherent TLB state after a reallocation of the
> virtual address space.
>
> Actually, I think this would mean that your optimization is disallowed
> at least on arm64 - I'm not sure about the exact wording, but arm64
> has a "break before make" rule that forbids conflicting writable
> address translations or something like that.
>
> (I said "until you run out of virtual memory in the vmap region", but
> that's not actually true - see the comment above lazy_max_pages() for
> an explanation of the actual heuristic. You might be able to tune that
> a bit if you'd be significantly happier with less frequent
> interruptions, or something along those lines.)

I've been thinking some more (this is your cue to grab a brown paper bag)...

Experimentation (unmap the whole VMALLOC range upon return to userspace,
see what explodes upon entry into the kernel) suggests that the early entry
"danger zone" should only access the vmaped stack, which itself isn't an
issue.

That is obviously just a test on one system configuration, and the problem
I'm facing is trying put in place /some/ form of instrumentation that would
at the very least cause a warning for any future patch that would introduce
a vmap'd access in early entry code. That, or a complete mitigation that
prevents those accesses altogether.

What if isolated CPUs unconditionally did a TLBi as late as possible in
the stack right before returning to userspace? This would mean that upon
re-entering the kernel, an isolated CPU's TLB wouldn't contain any kernel
range translation - with the exception of whatever lies between the
last-minute flush and the actual userspace entry, which should be feasible
to vet? Then AFAICT there wouldn't be any work/flush to defer, the IPI
could be entirely silenced if it targets an isolated CPU.
Jann Horn Feb. 10, 2025, 10:08 p.m. UTC | #14
On Mon, Feb 10, 2025 at 7:36 PM Valentin Schneider <vschneid@redhat.com> wrote:
> What if isolated CPUs unconditionally did a TLBi as late as possible in
> the stack right before returning to userspace? This would mean that upon
> re-entering the kernel, an isolated CPU's TLB wouldn't contain any kernel
> range translation - with the exception of whatever lies between the
> last-minute flush and the actual userspace entry, which should be feasible
> to vet? Then AFAICT there wouldn't be any work/flush to defer, the IPI
> could be entirely silenced if it targets an isolated CPU.

Two issues with that:

1. I think the "Common not Private" feature Will Deacon referred to is
incompatible with this idea:
<https://developer.arm.com/documentation/101811/0104/Address-spaces/Common-not-Private>
says "When the CnP bit is set, the software promises to use the ASIDs
and VMIDs in the same way on all processors, which allows the TLB
entries that are created by one processor to be used by another"

2. It's wrong to assume that TLB entries are only populated for
addresses you access - thanks to speculative execution, you have to
assume that the CPU might be populating random TLB entries all over
the place.
Valentin Schneider Feb. 11, 2025, 1:33 p.m. UTC | #15
On 10/02/25 23:08, Jann Horn wrote:
> On Mon, Feb 10, 2025 at 7:36 PM Valentin Schneider <vschneid@redhat.com> wrote:
>> What if isolated CPUs unconditionally did a TLBi as late as possible in
>> the stack right before returning to userspace? This would mean that upon
>> re-entering the kernel, an isolated CPU's TLB wouldn't contain any kernel
>> range translation - with the exception of whatever lies between the
>> last-minute flush and the actual userspace entry, which should be feasible
>> to vet? Then AFAICT there wouldn't be any work/flush to defer, the IPI
>> could be entirely silenced if it targets an isolated CPU.
>
> Two issues with that:
>

Firstly, thank you for entertaining the idea :-)

> 1. I think the "Common not Private" feature Will Deacon referred to is
> incompatible with this idea:
> <https://developer.arm.com/documentation/101811/0104/Address-spaces/Common-not-Private>
> says "When the CnP bit is set, the software promises to use the ASIDs
> and VMIDs in the same way on all processors, which allows the TLB
> entries that are created by one processor to be used by another"
>


Sorry for being obtuse - I can understand inconsistent TLB states (old vs
new translations being present in separate TLBs) due to not sending the
flush IPI causing an issue with that, but not "flushing early". Even if TLB
entries can be shared/accessed between CPUs, a CPU should be allowed not to
have a shared entry in its TLB - what am I missing?

> 2. It's wrong to assume that TLB entries are only populated for
> addresses you access - thanks to speculative execution, you have to
> assume that the CPU might be populating random TLB entries all over
> the place.

Gotta love speculation. Now it is supposed to be limited to genuinely
accessible data & code, right? Say theoretically we have a full TLBi as
literally the last thing before doing the return-to-userspace, speculation
should be limited to executing maybe bits of the return-from-userspace
code?

Furthermore, I would hope that once a CPU is executing in userspace, it's
not going to populate the TLB with kernel address translations - AIUI the
whole vulnerability mitigation debacle was about preventing this sort of
thing.
Mark Rutland Feb. 11, 2025, 2:03 p.m. UTC | #16
On Tue, Feb 11, 2025 at 02:33:51PM +0100, Valentin Schneider wrote:
> On 10/02/25 23:08, Jann Horn wrote:
> > On Mon, Feb 10, 2025 at 7:36 PM Valentin Schneider <vschneid@redhat.com> wrote:
> >> What if isolated CPUs unconditionally did a TLBi as late as possible in
> >> the stack right before returning to userspace? This would mean that upon
> >> re-entering the kernel, an isolated CPU's TLB wouldn't contain any kernel
> >> range translation - with the exception of whatever lies between the
> >> last-minute flush and the actual userspace entry, which should be feasible
> >> to vet? Then AFAICT there wouldn't be any work/flush to defer, the IPI
> >> could be entirely silenced if it targets an isolated CPU.
> >
> > Two issues with that:
> 
> Firstly, thank you for entertaining the idea :-)
> 
> > 1. I think the "Common not Private" feature Will Deacon referred to is
> > incompatible with this idea:
> > <https://developer.arm.com/documentation/101811/0104/Address-spaces/Common-not-Private>
> > says "When the CnP bit is set, the software promises to use the ASIDs
> > and VMIDs in the same way on all processors, which allows the TLB
> > entries that are created by one processor to be used by another"
> 
> Sorry for being obtuse - I can understand inconsistent TLB states (old vs
> new translations being present in separate TLBs) due to not sending the
> flush IPI causing an issue with that, but not "flushing early". Even if TLB
> entries can be shared/accessed between CPUs, a CPU should be allowed not to
> have a shared entry in its TLB - what am I missing?
> 
> > 2. It's wrong to assume that TLB entries are only populated for
> > addresses you access - thanks to speculative execution, you have to
> > assume that the CPU might be populating random TLB entries all over
> > the place.
> 
> Gotta love speculation. Now it is supposed to be limited to genuinely
> accessible data & code, right? Say theoretically we have a full TLBi as
> literally the last thing before doing the return-to-userspace, speculation
> should be limited to executing maybe bits of the return-from-userspace
> code?

I think it's easier to ignore speculation entirely, and just assume that
the MMU can arbitrarily fill TLB entries from any page table entries
which are valid/accessible in the active page tables. Hardware
prefetchers can do that regardless of the specific path of speculative
execution.

Thus TLB fills are not limited to VAs which would be used on that
return-to-userspace path.

> Furthermore, I would hope that once a CPU is executing in userspace, it's
> not going to populate the TLB with kernel address translations - AIUI the
> whole vulnerability mitigation debacle was about preventing this sort of
> thing.

The CPU can definitely do that; the vulnerability mitigations are all
about what userspace can observe rather than what the CPU can do in the
background. Additionally, there are features like SPE and TRBE that use
kernel addresses while the CPU is executing userspace instructions.

The latest ARM Architecture Reference Manual (ARM DDI 0487 L.a) is fairly clear
about that in section D8.16 "Translation Lookaside Buff", where it says
(among other things):

  When address translation is enabled, if a translation table entry
  meets all of the following requirements, then that translation table
  entry is permitted to be cached in a TLB or intermediate TLB caching
  structure at any time:
  • The translation table entry itself does not generate a Translation
    fault, an Address size fault, or an Access flag fault.
  • The translation table entry is not from a translation regime
    configured by an Exception level that is lower than the current
    Exception level.

Here "permitted to be cached in a TLB" also implies that the HW is
allowed to fetch the translation tabl entry (which is what ARM call page
table entries).

The PDF can be found at:

  https://developer.arm.com/documentation/ddi0487/la/?lang=en

Mark.
Dave Hansen Feb. 11, 2025, 2:22 p.m. UTC | #17
On 2/11/25 05:33, Valentin Schneider wrote:
>> 2. It's wrong to assume that TLB entries are only populated for
>> addresses you access - thanks to speculative execution, you have to
>> assume that the CPU might be populating random TLB entries all over
>> the place.
> Gotta love speculation. Now it is supposed to be limited to genuinely
> accessible data & code, right? Say theoretically we have a full TLBi as
> literally the last thing before doing the return-to-userspace, speculation
> should be limited to executing maybe bits of the return-from-userspace
> code?

In practice, it's mostly limited like that.

Architecturally, there are no promises from the CPU. It is within its
rights to cache anything from the page tables at any time. If it's in
the CR3 tree, it's fair game.

> Furthermore, I would hope that once a CPU is executing in userspace, it's
> not going to populate the TLB with kernel address translations - AIUI the
> whole vulnerability mitigation debacle was about preventing this sort of
> thing.

Nope, unfortunately. There's two big exception to this. First, "implicit
supervisor-mode accesses". There are structures for which the CPU gets a
virtual address and accesses it even while userspace is running. The LDT
and GDT are the most obvious examples, but there are some less
ubiquitous ones like the buffers for PEBS events.

Second, remember that user versus supervisor is determined *BY* the page
tables. Before Linear Address Space Separation (LASS), all virtual
memory accesses walk the page tables, even userspace accesses to kernel
addresses.  The User/Supervisor bit is *in* the page tables, of course.

A userspace access to a kernel address results in a page walk and the
CPU is completely free to cache all or part of that page walk. A
Meltdown-style _speculative_ userspace access to kernel memory won't
generate a fault either. It won't leak data like it used to, of course,
but it can still walk the page tables. That's one reason LASS is needed.
Valentin Schneider Feb. 11, 2025, 4:09 p.m. UTC | #18
On 11/02/25 14:03, Mark Rutland wrote:
> On Tue, Feb 11, 2025 at 02:33:51PM +0100, Valentin Schneider wrote:
>> On 10/02/25 23:08, Jann Horn wrote:
>> > 2. It's wrong to assume that TLB entries are only populated for
>> > addresses you access - thanks to speculative execution, you have to
>> > assume that the CPU might be populating random TLB entries all over
>> > the place.
>>
>> Gotta love speculation. Now it is supposed to be limited to genuinely
>> accessible data & code, right? Say theoretically we have a full TLBi as
>> literally the last thing before doing the return-to-userspace, speculation
>> should be limited to executing maybe bits of the return-from-userspace
>> code?
>
> I think it's easier to ignore speculation entirely, and just assume that
> the MMU can arbitrarily fill TLB entries from any page table entries
> which are valid/accessible in the active page tables. Hardware
> prefetchers can do that regardless of the specific path of speculative
> execution.
>
> Thus TLB fills are not limited to VAs which would be used on that
> return-to-userspace path.
>
>> Furthermore, I would hope that once a CPU is executing in userspace, it's
>> not going to populate the TLB with kernel address translations - AIUI the
>> whole vulnerability mitigation debacle was about preventing this sort of
>> thing.
>
> The CPU can definitely do that; the vulnerability mitigations are all
> about what userspace can observe rather than what the CPU can do in the
> background. Additionally, there are features like SPE and TRBE that use
> kernel addresses while the CPU is executing userspace instructions.
>
> The latest ARM Architecture Reference Manual (ARM DDI 0487 L.a) is fairly clear
> about that in section D8.16 "Translation Lookaside Buff", where it says
> (among other things):
>
>   When address translation is enabled, if a translation table entry
>   meets all of the following requirements, then that translation table
>   entry is permitted to be cached in a TLB or intermediate TLB caching
>   structure at any time:
>   • The translation table entry itself does not generate a Translation
>     fault, an Address size fault, or an Access flag fault.
>   • The translation table entry is not from a translation regime
>     configured by an Exception level that is lower than the current
>     Exception level.
>
> Here "permitted to be cached in a TLB" also implies that the HW is
> allowed to fetch the translation tabl entry (which is what ARM call page
> table entries).
>

That's actually fairly clear all things considered, thanks for the
education and for fishing out the relevant DDI section!

> The PDF can be found at:
>
>   https://developer.arm.com/documentation/ddi0487/la/?lang=en
>
> Mark.
Valentin Schneider Feb. 11, 2025, 4:10 p.m. UTC | #19
On 11/02/25 06:22, Dave Hansen wrote:
> On 2/11/25 05:33, Valentin Schneider wrote:
>>> 2. It's wrong to assume that TLB entries are only populated for
>>> addresses you access - thanks to speculative execution, you have to
>>> assume that the CPU might be populating random TLB entries all over
>>> the place.
>> Gotta love speculation. Now it is supposed to be limited to genuinely
>> accessible data & code, right? Say theoretically we have a full TLBi as
>> literally the last thing before doing the return-to-userspace, speculation
>> should be limited to executing maybe bits of the return-from-userspace
>> code?
>
> In practice, it's mostly limited like that.
>
> Architecturally, there are no promises from the CPU. It is within its
> rights to cache anything from the page tables at any time. If it's in
> the CR3 tree, it's fair game.
>
>> Furthermore, I would hope that once a CPU is executing in userspace, it's
>> not going to populate the TLB with kernel address translations - AIUI the
>> whole vulnerability mitigation debacle was about preventing this sort of
>> thing.
>
> Nope, unfortunately. There's two big exception to this. First, "implicit
> supervisor-mode accesses". There are structures for which the CPU gets a
> virtual address and accesses it even while userspace is running. The LDT
> and GDT are the most obvious examples, but there are some less
> ubiquitous ones like the buffers for PEBS events.
>
> Second, remember that user versus supervisor is determined *BY* the page
> tables. Before Linear Address Space Separation (LASS), all virtual
> memory accesses walk the page tables, even userspace accesses to kernel
> addresses.  The User/Supervisor bit is *in* the page tables, of course.
>
> A userspace access to a kernel address results in a page walk and the
> CPU is completely free to cache all or part of that page walk. A
> Meltdown-style _speculative_ userspace access to kernel memory won't
> generate a fault either. It won't leak data like it used to, of course,
> but it can still walk the page tables. That's one reason LASS is needed.

Bummer, now I have at least two architectures proving me wrong :-) Thank
you as well for the education, I really appreciate it.
Valentin Schneider Feb. 18, 2025, 10:40 p.m. UTC | #20
On 11/02/25 06:22, Dave Hansen wrote:
> On 2/11/25 05:33, Valentin Schneider wrote:
>>> 2. It's wrong to assume that TLB entries are only populated for
>>> addresses you access - thanks to speculative execution, you have to
>>> assume that the CPU might be populating random TLB entries all over
>>> the place.
>> Gotta love speculation. Now it is supposed to be limited to genuinely
>> accessible data & code, right? Say theoretically we have a full TLBi as
>> literally the last thing before doing the return-to-userspace, speculation
>> should be limited to executing maybe bits of the return-from-userspace
>> code?
>
> In practice, it's mostly limited like that.
>
> Architecturally, there are no promises from the CPU. It is within its
> rights to cache anything from the page tables at any time. If it's in
> the CR3 tree, it's fair game.
>

So what if the VMEMMAP range *isn't* in the CR3 tree when a CPU is
executing in userspace?

AIUI that's the case with kPTI - the remaining kernel pages should mostly
be .entry.text and cpu_entry_area, at least for x86.

It sounds like it wouldn't do much for arm64 though, if with CnP a CPU executing in
userspace and with the user/trampoline page table installed can still use
TLB entries of another CPU executing in kernelspace with the kernel page
table installed.
Dave Hansen Feb. 19, 2025, 12:39 a.m. UTC | #21
On 2/18/25 14:40, Valentin Schneider wrote:
>> In practice, it's mostly limited like that.
>>
>> Architecturally, there are no promises from the CPU. It is within its
>> rights to cache anything from the page tables at any time. If it's in
>> the CR3 tree, it's fair game.
>>
> So what if the VMEMMAP range *isn't* in the CR3 tree when a CPU is
> executing in userspace?
> 
> AIUI that's the case with kPTI - the remaining kernel pages should mostly
> be .entry.text and cpu_entry_area, at least for x86.

Having part of VMEMMAP not in the CR3 tree should be harmless while
running userspace. VMEMMAP is a purely software structure; the hardware
doesn't do implicit supervisor accesses to it. It's also not needed in
super early entry.

Maybe I missed part of the discussion though. Is VMEMMAP your only
concern? I would have guessed that the more generic vmalloc()
functionality would be harder to pin down.
Joel Fernandes Feb. 19, 2025, 3:05 p.m. UTC | #22
On Fri, Jan 17, 2025 at 05:53:33PM +0100, Valentin Schneider wrote:
> On 17/01/25 16:52, Jann Horn wrote:
> > On Fri, Jan 17, 2025 at 4:25 PM Valentin Schneider <vschneid@redhat.com> wrote:
> >> On 14/01/25 19:16, Jann Horn wrote:
> >> > On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider <vschneid@redhat.com> wrote:
> >> >> vunmap()'s issued from housekeeping CPUs are a relatively common source of
> >> >> interference for isolated NOHZ_FULL CPUs, as they are hit by the
> >> >> flush_tlb_kernel_range() IPIs.
> >> >>
> >> >> Given that CPUs executing in userspace do not access data in the vmalloc
> >> >> range, these IPIs could be deferred until their next kernel entry.
> >> >>
> >> >> Deferral vs early entry danger zone
> >> >> ===================================
> >> >>
> >> >> This requires a guarantee that nothing in the vmalloc range can be vunmap'd
> >> >> and then accessed in early entry code.
> >> >
> >> > In other words, it needs a guarantee that no vmalloc allocations that
> >> > have been created in the vmalloc region while the CPU was idle can
> >> > then be accessed during early entry, right?
> >>
> >> I'm not sure if that would be a problem (not an mm expert, please do
> >> correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't
> >> deferred anyway.
> >
> > flush_cache_vmap() is about stuff like flushing data caches on
> > architectures with virtually indexed caches; that doesn't do TLB
> > maintenance. When you look for its definition on x86 or arm64, you'll
> > see that they use the generic implementation which is simply an empty
> > inline function.
> >
> >> So after vmapping something, I wouldn't expect isolated CPUs to have
> >> invalid TLB entries for the newly vmapped page.
> >>
> >> However, upon vunmap'ing something, the TLB flush is deferred, and thus
> >> stale TLB entries can and will remain on isolated CPUs, up until they
> >> execute the deferred flush themselves (IOW for the entire duration of the
> >> "danger zone").
> >>
> >> Does that make sense?
> >
> > The design idea wrt TLB flushes in the vmap code is that you don't do
> > TLB flushes when you unmap stuff or when you map stuff, because doing
> > TLB flushes across the entire system on every vmap/vunmap would be a
> > bit costly; instead you just do batched TLB flushes in between, in
> > __purge_vmap_area_lazy().
> >
> > In other words, the basic idea is that you can keep calling vmap() and
> > vunmap() a bunch of times without ever doing TLB flushes until you run
> > out of virtual memory in the vmap region; then you do one big TLB
> > flush, and afterwards you can reuse the free virtual address space for
> > new allocations again.
> >
> > So if you "defer" that batched TLB flush for CPUs that are not
> > currently running in the kernel, I think the consequence is that those
> > CPUs may end up with incoherent TLB state after a reallocation of the
> > virtual address space.
> >
> 
> Ah, gotcha, thank you for laying this out! In which case yes, any vmalloc
> that occurred while an isolated CPU was NOHZ-FULL can be an issue if said
> CPU accesses it during early entry;

So the issue is:

CPU1: unmappes vmalloc page X which was previously mapped to physical page
P1.

CPU2: does a whole bunch of vmalloc and vfree eventually crossing some lazy
threshold and sending out IPIs. It then goes ahead and does an allocation
that maps the same virtual page X to physical page P2.

CPU3 is isolated and executes some early entry code before receving said IPIs
which are supposedly deferred by Valentin's patches.

It does not receive the IPI becuase it is deferred, thus access by early
entry code to page X on this CPU results in a UAF access to P1.

Is that the issue?

thanks,

 - Joel
Valentin Schneider Feb. 19, 2025, 3:13 p.m. UTC | #23
On 18/02/25 16:39, Dave Hansen wrote:
> On 2/18/25 14:40, Valentin Schneider wrote:
>>> In practice, it's mostly limited like that.
>>>
>>> Architecturally, there are no promises from the CPU. It is within its
>>> rights to cache anything from the page tables at any time. If it's in
>>> the CR3 tree, it's fair game.
>>>
>> So what if the VMEMMAP range *isn't* in the CR3 tree when a CPU is
>> executing in userspace?
>>
>> AIUI that's the case with kPTI - the remaining kernel pages should mostly
>> be .entry.text and cpu_entry_area, at least for x86.
>
> Having part of VMEMMAP not in the CR3 tree should be harmless while
> running userspace. VMEMMAP is a purely software structure; the hardware
> doesn't do implicit supervisor accesses to it. It's also not needed in
> super early entry.
>
> Maybe I missed part of the discussion though. Is VMEMMAP your only
> concern? I would have guessed that the more generic vmalloc()
> functionality would be harder to pin down.

Urgh, that'll teach me to send emails that late - I did indeed mean the
vmalloc() range, not at all VMEMMAP. IIUC *neither* are present in the user
kPTI page table and AFAICT the page table swap is done before the actual vmap'd
stack (CONFIG_VMAP_STACK=y) gets used.
Valentin Schneider Feb. 19, 2025, 4:18 p.m. UTC | #24
On 19/02/25 10:05, Joel Fernandes wrote:
> On Fri, Jan 17, 2025 at 05:53:33PM +0100, Valentin Schneider wrote:
>> On 17/01/25 16:52, Jann Horn wrote:
>> > On Fri, Jan 17, 2025 at 4:25 PM Valentin Schneider <vschneid@redhat.com> wrote:
>> >> On 14/01/25 19:16, Jann Horn wrote:
>> >> > On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider <vschneid@redhat.com> wrote:
>> >> >> vunmap()'s issued from housekeeping CPUs are a relatively common source of
>> >> >> interference for isolated NOHZ_FULL CPUs, as they are hit by the
>> >> >> flush_tlb_kernel_range() IPIs.
>> >> >>
>> >> >> Given that CPUs executing in userspace do not access data in the vmalloc
>> >> >> range, these IPIs could be deferred until their next kernel entry.
>> >> >>
>> >> >> Deferral vs early entry danger zone
>> >> >> ===================================
>> >> >>
>> >> >> This requires a guarantee that nothing in the vmalloc range can be vunmap'd
>> >> >> and then accessed in early entry code.
>> >> >
>> >> > In other words, it needs a guarantee that no vmalloc allocations that
>> >> > have been created in the vmalloc region while the CPU was idle can
>> >> > then be accessed during early entry, right?
>> >>
>> >> I'm not sure if that would be a problem (not an mm expert, please do
>> >> correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't
>> >> deferred anyway.
>> >
>> > flush_cache_vmap() is about stuff like flushing data caches on
>> > architectures with virtually indexed caches; that doesn't do TLB
>> > maintenance. When you look for its definition on x86 or arm64, you'll
>> > see that they use the generic implementation which is simply an empty
>> > inline function.
>> >
>> >> So after vmapping something, I wouldn't expect isolated CPUs to have
>> >> invalid TLB entries for the newly vmapped page.
>> >>
>> >> However, upon vunmap'ing something, the TLB flush is deferred, and thus
>> >> stale TLB entries can and will remain on isolated CPUs, up until they
>> >> execute the deferred flush themselves (IOW for the entire duration of the
>> >> "danger zone").
>> >>
>> >> Does that make sense?
>> >
>> > The design idea wrt TLB flushes in the vmap code is that you don't do
>> > TLB flushes when you unmap stuff or when you map stuff, because doing
>> > TLB flushes across the entire system on every vmap/vunmap would be a
>> > bit costly; instead you just do batched TLB flushes in between, in
>> > __purge_vmap_area_lazy().
>> >
>> > In other words, the basic idea is that you can keep calling vmap() and
>> > vunmap() a bunch of times without ever doing TLB flushes until you run
>> > out of virtual memory in the vmap region; then you do one big TLB
>> > flush, and afterwards you can reuse the free virtual address space for
>> > new allocations again.
>> >
>> > So if you "defer" that batched TLB flush for CPUs that are not
>> > currently running in the kernel, I think the consequence is that those
>> > CPUs may end up with incoherent TLB state after a reallocation of the
>> > virtual address space.
>> >
>>
>> Ah, gotcha, thank you for laying this out! In which case yes, any vmalloc
>> that occurred while an isolated CPU was NOHZ-FULL can be an issue if said
>> CPU accesses it during early entry;
>
> So the issue is:
>
> CPU1: unmappes vmalloc page X which was previously mapped to physical page
> P1.
>
> CPU2: does a whole bunch of vmalloc and vfree eventually crossing some lazy
> threshold and sending out IPIs. It then goes ahead and does an allocation
> that maps the same virtual page X to physical page P2.
>
> CPU3 is isolated and executes some early entry code before receving said IPIs
> which are supposedly deferred by Valentin's patches.
>
> It does not receive the IPI becuase it is deferred, thus access by early
> entry code to page X on this CPU results in a UAF access to P1.
>
> Is that the issue?
>

Pretty much so yeah. That is, *if* there such a vmalloc'd address access in
early entry code - testing says it's not the case, but I haven't found a
way to instrumentally verify this.
Joel Fernandes Feb. 19, 2025, 5:08 p.m. UTC | #25
On 2/19/2025 11:18 AM, Valentin Schneider wrote:
> On 19/02/25 10:05, Joel Fernandes wrote:
>> On Fri, Jan 17, 2025 at 05:53:33PM +0100, Valentin Schneider wrote:
>>> On 17/01/25 16:52, Jann Horn wrote:
>>>> On Fri, Jan 17, 2025 at 4:25 PM Valentin Schneider <vschneid@redhat.com> wrote:
>>>>> On 14/01/25 19:16, Jann Horn wrote:
>>>>>> On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider <vschneid@redhat.com> wrote:
>>>>>>> vunmap()'s issued from housekeeping CPUs are a relatively common source of
>>>>>>> interference for isolated NOHZ_FULL CPUs, as they are hit by the
>>>>>>> flush_tlb_kernel_range() IPIs.
>>>>>>>
>>>>>>> Given that CPUs executing in userspace do not access data in the vmalloc
>>>>>>> range, these IPIs could be deferred until their next kernel entry.
>>>>>>>
>>>>>>> Deferral vs early entry danger zone
>>>>>>> ===================================
>>>>>>>
>>>>>>> This requires a guarantee that nothing in the vmalloc range can be vunmap'd
>>>>>>> and then accessed in early entry code.
>>>>>>
>>>>>> In other words, it needs a guarantee that no vmalloc allocations that
>>>>>> have been created in the vmalloc region while the CPU was idle can
>>>>>> then be accessed during early entry, right?
>>>>>
>>>>> I'm not sure if that would be a problem (not an mm expert, please do
>>>>> correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't
>>>>> deferred anyway.
>>>>
>>>> flush_cache_vmap() is about stuff like flushing data caches on
>>>> architectures with virtually indexed caches; that doesn't do TLB
>>>> maintenance. When you look for its definition on x86 or arm64, you'll
>>>> see that they use the generic implementation which is simply an empty
>>>> inline function.
>>>>
>>>>> So after vmapping something, I wouldn't expect isolated CPUs to have
>>>>> invalid TLB entries for the newly vmapped page.
>>>>>
>>>>> However, upon vunmap'ing something, the TLB flush is deferred, and thus
>>>>> stale TLB entries can and will remain on isolated CPUs, up until they
>>>>> execute the deferred flush themselves (IOW for the entire duration of the
>>>>> "danger zone").
>>>>>
>>>>> Does that make sense?
>>>>
>>>> The design idea wrt TLB flushes in the vmap code is that you don't do
>>>> TLB flushes when you unmap stuff or when you map stuff, because doing
>>>> TLB flushes across the entire system on every vmap/vunmap would be a
>>>> bit costly; instead you just do batched TLB flushes in between, in
>>>> __purge_vmap_area_lazy().
>>>>
>>>> In other words, the basic idea is that you can keep calling vmap() and
>>>> vunmap() a bunch of times without ever doing TLB flushes until you run
>>>> out of virtual memory in the vmap region; then you do one big TLB
>>>> flush, and afterwards you can reuse the free virtual address space for
>>>> new allocations again.
>>>>
>>>> So if you "defer" that batched TLB flush for CPUs that are not
>>>> currently running in the kernel, I think the consequence is that those
>>>> CPUs may end up with incoherent TLB state after a reallocation of the
>>>> virtual address space.
>>>>
>>>
>>> Ah, gotcha, thank you for laying this out! In which case yes, any vmalloc
>>> that occurred while an isolated CPU was NOHZ-FULL can be an issue if said
>>> CPU accesses it during early entry;
>>
>> So the issue is:
>>
>> CPU1: unmappes vmalloc page X which was previously mapped to physical page
>> P1.
>>
>> CPU2: does a whole bunch of vmalloc and vfree eventually crossing some lazy
>> threshold and sending out IPIs. It then goes ahead and does an allocation
>> that maps the same virtual page X to physical page P2.
>>
>> CPU3 is isolated and executes some early entry code before receving said IPIs
>> which are supposedly deferred by Valentin's patches.
>>
>> It does not receive the IPI becuase it is deferred, thus access by early
>> entry code to page X on this CPU results in a UAF access to P1.
>>
>> Is that the issue?
>>
> 
> Pretty much so yeah. That is, *if* there such a vmalloc'd address access in
> early entry code - testing says it's not the case, but I haven't found a
> way to instrumentally verify this.
Ok, thanks for confirming. Maybe there is an address sanitizer way of verifying,
but yeah it is subtle and there could be more than one way of solving it. Too
much 'fun' ;)

 - Joel
Dave Hansen Feb. 19, 2025, 8:25 p.m. UTC | #26
On 2/19/25 07:13, Valentin Schneider wrote:
>> Maybe I missed part of the discussion though. Is VMEMMAP your only
>> concern? I would have guessed that the more generic vmalloc()
>> functionality would be harder to pin down.
> Urgh, that'll teach me to send emails that late - I did indeed mean the
> vmalloc() range, not at all VMEMMAP. IIUC *neither* are present in the user
> kPTI page table and AFAICT the page table swap is done before the actual vmap'd
> stack (CONFIG_VMAP_STACK=y) gets used.

OK, so rewriting your question... ;)

> So what if the vmalloc() range *isn't* in the CR3 tree when a CPU is
> executing in userspace?

The LDT and maybe the PEBS buffers are the only implicit supervisor
accesses to vmalloc()'d memory that I can think of. But those are both
handled specially and shouldn't ever get zapped while in use. The LDT
replacement has its own IPIs separate from TLB flushing.

But I'm actually not all that worried about accesses while actually
running userspace. It's that "danger zone" in the kernel between entry
and when the TLB might have dangerous garbage in it.

BTW, I hope this whole thing is turned off on 32-bit. There, we can
actually take and handle faults on the vmalloc() area. If you get one of
those faults in your "danger zone", it'll start running page fault code
which will branch out to god-knows-where and certainly isn't noinstr.
Dave Hansen Feb. 19, 2025, 8:32 p.m. UTC | #27
On 2/19/25 09:08, Joel Fernandes wrote:
>> Pretty much so yeah. That is, *if* there such a vmalloc'd address access in
>> early entry code - testing says it's not the case, but I haven't found a
>> way to instrumentally verify this.
> Ok, thanks for confirming. Maybe there is an address sanitizer way of verifying,
> but yeah it is subtle and there could be more than one way of solving it. Too
> much 'fun' 
Valentin Schneider Feb. 20, 2025, 5:10 p.m. UTC | #28
On 19/02/25 12:25, Dave Hansen wrote:
> On 2/19/25 07:13, Valentin Schneider wrote:
>>> Maybe I missed part of the discussion though. Is VMEMMAP your only
>>> concern? I would have guessed that the more generic vmalloc()
>>> functionality would be harder to pin down.
>> Urgh, that'll teach me to send emails that late - I did indeed mean the
>> vmalloc() range, not at all VMEMMAP. IIUC *neither* are present in the user
>> kPTI page table and AFAICT the page table swap is done before the actual vmap'd
>> stack (CONFIG_VMAP_STACK=y) gets used.
>
> OK, so rewriting your question... ;)
>
>> So what if the vmalloc() range *isn't* in the CR3 tree when a CPU is
>> executing in userspace?
>
> The LDT and maybe the PEBS buffers are the only implicit supervisor
> accesses to vmalloc()'d memory that I can think of. But those are both
> handled specially and shouldn't ever get zapped while in use. The LDT
> replacement has its own IPIs separate from TLB flushing.
>
> But I'm actually not all that worried about accesses while actually
> running userspace. It's that "danger zone" in the kernel between entry
> and when the TLB might have dangerous garbage in it.
>

So say we have kPTI, thus no vmalloc() mapped in CR3 when running
userspace, and do a full TLB flush right before switching to userspace -
could the TLB still end up with vmalloc()-range-related entries when we're
back in the kernel and going through the danger zone?

> BTW, I hope this whole thing is turned off on 32-bit. There, we can
> actually take and handle faults on the vmalloc() area. If you get one of
> those faults in your "danger zone", it'll start running page fault code
> which will branch out to god-knows-where and certainly isn't noinstr.

Sounds... Fun. Thanks for pointing out the landmines.
Dave Hansen Feb. 20, 2025, 5:38 p.m. UTC | #29
On 2/20/25 09:10, Valentin Schneider wrote:
>> The LDT and maybe the PEBS buffers are the only implicit supervisor
>> accesses to vmalloc()'d memory that I can think of. But those are both
>> handled specially and shouldn't ever get zapped while in use. The LDT
>> replacement has its own IPIs separate from TLB flushing.
>>
>> But I'm actually not all that worried about accesses while actually
>> running userspace. It's that "danger zone" in the kernel between entry
>> and when the TLB might have dangerous garbage in it.
>>
> So say we have kPTI, thus no vmalloc() mapped in CR3 when running
> userspace, and do a full TLB flush right before switching to userspace -
> could the TLB still end up with vmalloc()-range-related entries when we're
> back in the kernel and going through the danger zone?

Yes, because the danger zone includes the switch back to the kernel CR3
with vmalloc() fully mapped. All bets are off about what's in the TLB
the moment that CR3 write occurs.

Actually, you could probably use that.

If a mapping is in the PTI user page table, you can't defer the flushes
for it. Basically the same rule for text poking in the danger zone.

If there's a deferred flush pending, make sure that all of the
SWITCH_TO_KERNEL_CR3's fully flush the TLB. You'd need something similar
to user_pcid_flush_mask.

But, honestly, I'm still not sure this is worth all the trouble. If
folks want to avoid IPIs for TLB flushes, there are hardware features
that *DO* that. Just get new hardware instead of adding this complicated
pile of software that we have to maintain forever. In 10 years, we'll
still have this software *and* 95% of our hardware has the hardware
feature too.
Valentin Schneider Feb. 26, 2025, 4:52 p.m. UTC | #30
On 20/02/25 09:38, Dave Hansen wrote:
> On 2/20/25 09:10, Valentin Schneider wrote:
>>> The LDT and maybe the PEBS buffers are the only implicit supervisor
>>> accesses to vmalloc()'d memory that I can think of. But those are both
>>> handled specially and shouldn't ever get zapped while in use. The LDT
>>> replacement has its own IPIs separate from TLB flushing.
>>>
>>> But I'm actually not all that worried about accesses while actually
>>> running userspace. It's that "danger zone" in the kernel between entry
>>> and when the TLB might have dangerous garbage in it.
>>>
>> So say we have kPTI, thus no vmalloc() mapped in CR3 when running
>> userspace, and do a full TLB flush right before switching to userspace -
>> could the TLB still end up with vmalloc()-range-related entries when we're
>> back in the kernel and going through the danger zone?
>
> Yes, because the danger zone includes the switch back to the kernel CR3
> with vmalloc() fully mapped. All bets are off about what's in the TLB
> the moment that CR3 write occurs.
>
> Actually, you could probably use that.
>
> If a mapping is in the PTI user page table, you can't defer the flushes
> for it. Basically the same rule for text poking in the danger zone.
>
> If there's a deferred flush pending, make sure that all of the
> SWITCH_TO_KERNEL_CR3's fully flush the TLB. You'd need something similar
> to user_pcid_flush_mask.
>

Right, that's what I (roughly) had in mind...

> But, honestly, I'm still not sure this is worth all the trouble. If
> folks want to avoid IPIs for TLB flushes, there are hardware features
> that *DO* that. Just get new hardware instead of adding this complicated
> pile of software that we have to maintain forever. In 10 years, we'll
> still have this software *and* 95% of our hardware has the hardware
> feature too.

... But yeah, it pretty much circumvents arch_context_tracking_work, or at
the very least adds an early(er) flushing of the context tracking
work... Urgh.

Thank you for grounding my wild ideas into reality. I'll try to think some
more see if I see any other way out (other than "buy hardware that does
what you want and ditch the one that doesn't").
Valentin Schneider March 25, 2025, 5:52 p.m. UTC | #31
On 20/02/25 09:38, Dave Hansen wrote:
> But, honestly, I'm still not sure this is worth all the trouble. If
> folks want to avoid IPIs for TLB flushes, there are hardware features
> that *DO* that. Just get new hardware instead of adding this complicated
> pile of software that we have to maintain forever. In 10 years, we'll
> still have this software *and* 95% of our hardware has the hardware
> feature too.

Sorry, you're going to have to deal with my ignorance a little bit longer...

Were you thinking x86 hardware specifically, or something else?
AIUI things like arm64's TLBIVMALLE1IS can do what is required without any
IPI:

C5.5.78
"""
The invalidation applies to all PEs in the same Inner Shareable shareability domain as the PE that
executes this System instruction.
"""

But for (at least) these architectures:

  alpha
  x86
  loongarch
  mips
  (non-freescale 8xx) powerpc
  riscv
  xtensa

flush_tlb_kernel_range() has a path with a hardcoded use of on_each_cpu(),
so AFAICT for these the IPIs will be sent no matter the hardware.
Jann Horn March 25, 2025, 6:41 p.m. UTC | #32
On Tue, Mar 25, 2025 at 6:52 PM Valentin Schneider <vschneid@redhat.com> wrote:
> On 20/02/25 09:38, Dave Hansen wrote:
> > But, honestly, I'm still not sure this is worth all the trouble. If
> > folks want to avoid IPIs for TLB flushes, there are hardware features
> > that *DO* that. Just get new hardware instead of adding this complicated
> > pile of software that we have to maintain forever. In 10 years, we'll
> > still have this software *and* 95% of our hardware has the hardware
> > feature too.
>
> Sorry, you're going to have to deal with my ignorance a little bit longer...
>
> Were you thinking x86 hardware specifically, or something else?
> AIUI things like arm64's TLBIVMALLE1IS can do what is required without any
> IPI:
>
> C5.5.78
> """
> The invalidation applies to all PEs in the same Inner Shareable shareability domain as the PE that
> executes this System instruction.
> """
>
> But for (at least) these architectures:
>
>   alpha
>   x86
>   loongarch
>   mips
>   (non-freescale 8xx) powerpc
>   riscv
>   xtensa
>
> flush_tlb_kernel_range() has a path with a hardcoded use of on_each_cpu(),
> so AFAICT for these the IPIs will be sent no matter the hardware.

On X86, both AMD and Intel have some fairly recently introduced CPU
features that can shoot down TLBs remotely.

The patch series
<https://lore.kernel.org/all/20250226030129.530345-1-riel@surriel.com/>
adds support for the AMD flavor; that series landed in the current
merge window (it's present in the mainline git repository now and should
be part of 6.15). I think support for the Intel flavor has not yet
been implemented, but the linked patch series mentions a plan to look
at the Intel flavor next.