diff mbox series

[v4,29/30] x86/mm, mm/vmalloc: Defer flush_tlb_kernel_range() targeting NOHZ_FULL CPUs

Message ID 20250114175143.81438-30-vschneid@redhat.com (mailing list archive)
State New
Headers show
Series context_tracking,x86: Defer some IPIs until a user->kernel transition | expand

Commit Message

Valentin Schneider Jan. 14, 2025, 5:51 p.m. UTC
vunmap()'s issued from housekeeping CPUs are a relatively common source of
interference for isolated NOHZ_FULL CPUs, as they are hit by the
flush_tlb_kernel_range() IPIs.

Given that CPUs executing in userspace do not access data in the vmalloc
range, these IPIs could be deferred until their next kernel entry.

Deferral vs early entry danger zone
===================================

This requires a guarantee that nothing in the vmalloc range can be vunmap'd
and then accessed in early entry code.

Vmalloc uses are, as reported by vmallocinfo:

  $ cat /proc/vmallocinfo | awk '{ print $3 }' | sort | uniq
  __pci_enable_msix_range+0x32b/0x560
  acpi_os_map_iomem+0x22d/0x250
  bpf_prog_alloc_no_stats+0x34/0x140
  fork_idle+0x79/0x120
  gen_pool_add_owner+0x3e/0xb0          ?
  hpet_enable+0xbf/0x470
  irq_init_percpu_irqstack+0x129/0x160
  kernel_clone+0xab/0x3b0
  memremap+0x164/0x330
  n_tty_open+0x18/0xa0
  pcpu_create_chunk+0x4e/0x1b0
  pcpu_create_chunk+0x75/0x1b0
  pcpu_get_vm_areas+0x0/0x1100
  unpurged
  vp_modern_map_capability+0x140/0x270
  zisofs_init+0x16/0x30

I've categorized these as:

a) Device or percpu mappings

   For these to be unmapped, the device (or CPU) has to be removed and an
   eventual IRQ freed. Even if the IRQ isn't freed, context tracking entry
   happens before handling the IRQ itself, per irqentry_enter() usage.

   __pci_enable_msix_range()
   acpi_os_map_iomem()
   irq_init_percpu_irqstack() (not even unmapped when CPU is hot-unplugged!)
   memremap()
   n_tty_open()
   pcpu_create_chunk()
   pcpu_get_vm_areas()
   vp_modern_map_capability()

b) CONFIG_VMAP_STACK

  fork_idle() & kernel_clone()

  vmalloc'd kernel stacks are AFAICT a safe example, as a task running in
  userspace needs to enter kernelspace to execute do_exit() before its
  stack can be vfree'd.

c) Non-issues

  bpf_prog_alloc_no_stats() - early entry is noinstr, no BPF!
  hpet_enable() - hpet_clear_mapping() is only called if __init function
		  fails, no runtime worries
  zisofs_init () - used for zisofs block decompression, that's way past
		   context tracking entry

d) I'm not sure, have a look?

  gen_pool_add_owner() - AIUI this is mainly for PCI / DMA stuff, which
			 again I wouldn't expect to be accessed before
			 context tracking entry.

Changes
======

Blindly deferring any and all flush of the kernel mappings is a risky move,
so introduce a variant of flush_tlb_kernel_range() that explicitly allows
deferral. Use it for vunmap flushes.

Note that while flush_tlb_kernel_range() may end up issuing a full
flush (including user mappings), this only happens when reaching a
invalidation range threshold where it is cheaper to do a full flush than to
individually invalidate each page in the range via INVLPG. IOW, it doesn't
*require* invalidating user mappings, and thus remains safe to defer until
a later kernel entry.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
---
 arch/x86/include/asm/context_tracking_work.h |  4 +++
 arch/x86/include/asm/tlbflush.h              |  1 +
 arch/x86/mm/tlb.c                            | 23 +++++++++++--
 include/linux/context_tracking_work.h        |  4 ++-
 mm/vmalloc.c                                 | 35 +++++++++++++++++---
 5 files changed, 58 insertions(+), 9 deletions(-)

Comments

Jann Horn Jan. 14, 2025, 6:16 p.m. UTC | #1
On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider <vschneid@redhat.com> wrote:
> vunmap()'s issued from housekeeping CPUs are a relatively common source of
> interference for isolated NOHZ_FULL CPUs, as they are hit by the
> flush_tlb_kernel_range() IPIs.
>
> Given that CPUs executing in userspace do not access data in the vmalloc
> range, these IPIs could be deferred until their next kernel entry.
>
> Deferral vs early entry danger zone
> ===================================
>
> This requires a guarantee that nothing in the vmalloc range can be vunmap'd
> and then accessed in early entry code.

In other words, it needs a guarantee that no vmalloc allocations that
have been created in the vmalloc region while the CPU was idle can
then be accessed during early entry, right?
Valentin Schneider Jan. 17, 2025, 3:25 p.m. UTC | #2
On 14/01/25 19:16, Jann Horn wrote:
> On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider <vschneid@redhat.com> wrote:
>> vunmap()'s issued from housekeeping CPUs are a relatively common source of
>> interference for isolated NOHZ_FULL CPUs, as they are hit by the
>> flush_tlb_kernel_range() IPIs.
>>
>> Given that CPUs executing in userspace do not access data in the vmalloc
>> range, these IPIs could be deferred until their next kernel entry.
>>
>> Deferral vs early entry danger zone
>> ===================================
>>
>> This requires a guarantee that nothing in the vmalloc range can be vunmap'd
>> and then accessed in early entry code.
>
> In other words, it needs a guarantee that no vmalloc allocations that
> have been created in the vmalloc region while the CPU was idle can
> then be accessed during early entry, right?

I'm not sure if that would be a problem (not an mm expert, please do
correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't
deferred anyway.

So after vmapping something, I wouldn't expect isolated CPUs to have
invalid TLB entries for the newly vmapped page.

However, upon vunmap'ing something, the TLB flush is deferred, and thus
stale TLB entries can and will remain on isolated CPUs, up until they
execute the deferred flush themselves (IOW for the entire duration of the
"danger zone").

Does that make sense?
Jann Horn Jan. 17, 2025, 3:52 p.m. UTC | #3
On Fri, Jan 17, 2025 at 4:25 PM Valentin Schneider <vschneid@redhat.com> wrote:
> On 14/01/25 19:16, Jann Horn wrote:
> > On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider <vschneid@redhat.com> wrote:
> >> vunmap()'s issued from housekeeping CPUs are a relatively common source of
> >> interference for isolated NOHZ_FULL CPUs, as they are hit by the
> >> flush_tlb_kernel_range() IPIs.
> >>
> >> Given that CPUs executing in userspace do not access data in the vmalloc
> >> range, these IPIs could be deferred until their next kernel entry.
> >>
> >> Deferral vs early entry danger zone
> >> ===================================
> >>
> >> This requires a guarantee that nothing in the vmalloc range can be vunmap'd
> >> and then accessed in early entry code.
> >
> > In other words, it needs a guarantee that no vmalloc allocations that
> > have been created in the vmalloc region while the CPU was idle can
> > then be accessed during early entry, right?
>
> I'm not sure if that would be a problem (not an mm expert, please do
> correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't
> deferred anyway.

flush_cache_vmap() is about stuff like flushing data caches on
architectures with virtually indexed caches; that doesn't do TLB
maintenance. When you look for its definition on x86 or arm64, you'll
see that they use the generic implementation which is simply an empty
inline function.

> So after vmapping something, I wouldn't expect isolated CPUs to have
> invalid TLB entries for the newly vmapped page.
>
> However, upon vunmap'ing something, the TLB flush is deferred, and thus
> stale TLB entries can and will remain on isolated CPUs, up until they
> execute the deferred flush themselves (IOW for the entire duration of the
> "danger zone").
>
> Does that make sense?

The design idea wrt TLB flushes in the vmap code is that you don't do
TLB flushes when you unmap stuff or when you map stuff, because doing
TLB flushes across the entire system on every vmap/vunmap would be a
bit costly; instead you just do batched TLB flushes in between, in
__purge_vmap_area_lazy().

In other words, the basic idea is that you can keep calling vmap() and
vunmap() a bunch of times without ever doing TLB flushes until you run
out of virtual memory in the vmap region; then you do one big TLB
flush, and afterwards you can reuse the free virtual address space for
new allocations again.

So if you "defer" that batched TLB flush for CPUs that are not
currently running in the kernel, I think the consequence is that those
CPUs may end up with incoherent TLB state after a reallocation of the
virtual address space.

Actually, I think this would mean that your optimization is disallowed
at least on arm64 - I'm not sure about the exact wording, but arm64
has a "break before make" rule that forbids conflicting writable
address translations or something like that.

(I said "until you run out of virtual memory in the vmap region", but
that's not actually true - see the comment above lazy_max_pages() for
an explanation of the actual heuristic. You might be able to tune that
a bit if you'd be significantly happier with less frequent
interruptions, or something along those lines.)
Uladzislau Rezki Jan. 17, 2025, 4:11 p.m. UTC | #4
On Fri, Jan 17, 2025 at 04:25:45PM +0100, Valentin Schneider wrote:
> On 14/01/25 19:16, Jann Horn wrote:
> > On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider <vschneid@redhat.com> wrote:
> >> vunmap()'s issued from housekeeping CPUs are a relatively common source of
> >> interference for isolated NOHZ_FULL CPUs, as they are hit by the
> >> flush_tlb_kernel_range() IPIs.
> >>
> >> Given that CPUs executing in userspace do not access data in the vmalloc
> >> range, these IPIs could be deferred until their next kernel entry.
> >>
> >> Deferral vs early entry danger zone
> >> ===================================
> >>
> >> This requires a guarantee that nothing in the vmalloc range can be vunmap'd
> >> and then accessed in early entry code.
> >
> > In other words, it needs a guarantee that no vmalloc allocations that
> > have been created in the vmalloc region while the CPU was idle can
> > then be accessed during early entry, right?
> 
> I'm not sure if that would be a problem (not an mm expert, please do
> correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't
> deferred anyway.
> 
> So after vmapping something, I wouldn't expect isolated CPUs to have
> invalid TLB entries for the newly vmapped page.
> 
> However, upon vunmap'ing something, the TLB flush is deferred, and thus
> stale TLB entries can and will remain on isolated CPUs, up until they
> execute the deferred flush themselves (IOW for the entire duration of the
> "danger zone").
> 
> Does that make sense?
> 
Probably i am missing something and need to have a look at your patches,
but how do you guarantee that no-one map same are that you defer for TLB
flushing?

As noted by Jann, we already defer a TLB flushing by backing freed areas
until certain threshold and just after we cross it we do a flush.

--
Uladzislau Rezki
Valentin Schneider Jan. 17, 2025, 4:53 p.m. UTC | #5
On 17/01/25 16:52, Jann Horn wrote:
> On Fri, Jan 17, 2025 at 4:25 PM Valentin Schneider <vschneid@redhat.com> wrote:
>> On 14/01/25 19:16, Jann Horn wrote:
>> > On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider <vschneid@redhat.com> wrote:
>> >> vunmap()'s issued from housekeeping CPUs are a relatively common source of
>> >> interference for isolated NOHZ_FULL CPUs, as they are hit by the
>> >> flush_tlb_kernel_range() IPIs.
>> >>
>> >> Given that CPUs executing in userspace do not access data in the vmalloc
>> >> range, these IPIs could be deferred until their next kernel entry.
>> >>
>> >> Deferral vs early entry danger zone
>> >> ===================================
>> >>
>> >> This requires a guarantee that nothing in the vmalloc range can be vunmap'd
>> >> and then accessed in early entry code.
>> >
>> > In other words, it needs a guarantee that no vmalloc allocations that
>> > have been created in the vmalloc region while the CPU was idle can
>> > then be accessed during early entry, right?
>>
>> I'm not sure if that would be a problem (not an mm expert, please do
>> correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't
>> deferred anyway.
>
> flush_cache_vmap() is about stuff like flushing data caches on
> architectures with virtually indexed caches; that doesn't do TLB
> maintenance. When you look for its definition on x86 or arm64, you'll
> see that they use the generic implementation which is simply an empty
> inline function.
>
>> So after vmapping something, I wouldn't expect isolated CPUs to have
>> invalid TLB entries for the newly vmapped page.
>>
>> However, upon vunmap'ing something, the TLB flush is deferred, and thus
>> stale TLB entries can and will remain on isolated CPUs, up until they
>> execute the deferred flush themselves (IOW for the entire duration of the
>> "danger zone").
>>
>> Does that make sense?
>
> The design idea wrt TLB flushes in the vmap code is that you don't do
> TLB flushes when you unmap stuff or when you map stuff, because doing
> TLB flushes across the entire system on every vmap/vunmap would be a
> bit costly; instead you just do batched TLB flushes in between, in
> __purge_vmap_area_lazy().
>
> In other words, the basic idea is that you can keep calling vmap() and
> vunmap() a bunch of times without ever doing TLB flushes until you run
> out of virtual memory in the vmap region; then you do one big TLB
> flush, and afterwards you can reuse the free virtual address space for
> new allocations again.
>
> So if you "defer" that batched TLB flush for CPUs that are not
> currently running in the kernel, I think the consequence is that those
> CPUs may end up with incoherent TLB state after a reallocation of the
> virtual address space.
>

Ah, gotcha, thank you for laying this out! In which case yes, any vmalloc
that occurred while an isolated CPU was NOHZ-FULL can be an issue if said
CPU accesses it during early entry;

> Actually, I think this would mean that your optimization is disallowed
> at least on arm64 - I'm not sure about the exact wording, but arm64
> has a "break before make" rule that forbids conflicting writable
> address translations or something like that.
>

On the bright side of things, arm64 is not as bad as x86 when it comes to
IPI'ing isolated CPUs :-) I'll add that to my notes, thanks!

> (I said "until you run out of virtual memory in the vmap region", but
> that's not actually true - see the comment above lazy_max_pages() for
> an explanation of the actual heuristic. You might be able to tune that
> a bit if you'd be significantly happier with less frequent
> interruptions, or something along those lines.)
Valentin Schneider Jan. 17, 2025, 5 p.m. UTC | #6
On 17/01/25 17:11, Uladzislau Rezki wrote:
> On Fri, Jan 17, 2025 at 04:25:45PM +0100, Valentin Schneider wrote:
>> On 14/01/25 19:16, Jann Horn wrote:
>> > On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider <vschneid@redhat.com> wrote:
>> >> vunmap()'s issued from housekeeping CPUs are a relatively common source of
>> >> interference for isolated NOHZ_FULL CPUs, as they are hit by the
>> >> flush_tlb_kernel_range() IPIs.
>> >>
>> >> Given that CPUs executing in userspace do not access data in the vmalloc
>> >> range, these IPIs could be deferred until their next kernel entry.
>> >>
>> >> Deferral vs early entry danger zone
>> >> ===================================
>> >>
>> >> This requires a guarantee that nothing in the vmalloc range can be vunmap'd
>> >> and then accessed in early entry code.
>> >
>> > In other words, it needs a guarantee that no vmalloc allocations that
>> > have been created in the vmalloc region while the CPU was idle can
>> > then be accessed during early entry, right?
>>
>> I'm not sure if that would be a problem (not an mm expert, please do
>> correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't
>> deferred anyway.
>>
>> So after vmapping something, I wouldn't expect isolated CPUs to have
>> invalid TLB entries for the newly vmapped page.
>>
>> However, upon vunmap'ing something, the TLB flush is deferred, and thus
>> stale TLB entries can and will remain on isolated CPUs, up until they
>> execute the deferred flush themselves (IOW for the entire duration of the
>> "danger zone").
>>
>> Does that make sense?
>>
> Probably i am missing something and need to have a look at your patches,
> but how do you guarantee that no-one map same are that you defer for TLB
> flushing?
>

That's the cool part: I don't :')

For deferring instruction patching IPIs, I (well Josh really) managed to
get instrumentation to back me up and catch any problematic area.

I looked into getting something similar for vmalloc region access in
.noinstr code, but I didn't get anywhere. I even tried using emulated
watchpoints on QEMU to watch the whole vmalloc range, but that went about
as well as you could expect.

That left me with staring at code. AFAICT the only vmap'd thing that is
accessed during early entry is the task stack (CONFIG_VMAP_STACK), which
itself cannot be freed until the task exits - thus can't be subject to
invalidation when a task is entering kernelspace.

If you have any tracing/instrumentation suggestions, I'm all ears (eyes?).

> As noted by Jann, we already defer a TLB flushing by backing freed areas
> until certain threshold and just after we cross it we do a flush.
>
> --
> Uladzislau Rezki
Uladzislau Rezki Jan. 20, 2025, 11:15 a.m. UTC | #7
On Fri, Jan 17, 2025 at 06:00:30PM +0100, Valentin Schneider wrote:
> On 17/01/25 17:11, Uladzislau Rezki wrote:
> > On Fri, Jan 17, 2025 at 04:25:45PM +0100, Valentin Schneider wrote:
> >> On 14/01/25 19:16, Jann Horn wrote:
> >> > On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider <vschneid@redhat.com> wrote:
> >> >> vunmap()'s issued from housekeeping CPUs are a relatively common source of
> >> >> interference for isolated NOHZ_FULL CPUs, as they are hit by the
> >> >> flush_tlb_kernel_range() IPIs.
> >> >>
> >> >> Given that CPUs executing in userspace do not access data in the vmalloc
> >> >> range, these IPIs could be deferred until their next kernel entry.
> >> >>
> >> >> Deferral vs early entry danger zone
> >> >> ===================================
> >> >>
> >> >> This requires a guarantee that nothing in the vmalloc range can be vunmap'd
> >> >> and then accessed in early entry code.
> >> >
> >> > In other words, it needs a guarantee that no vmalloc allocations that
> >> > have been created in the vmalloc region while the CPU was idle can
> >> > then be accessed during early entry, right?
> >>
> >> I'm not sure if that would be a problem (not an mm expert, please do
> >> correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't
> >> deferred anyway.
> >>
> >> So after vmapping something, I wouldn't expect isolated CPUs to have
> >> invalid TLB entries for the newly vmapped page.
> >>
> >> However, upon vunmap'ing something, the TLB flush is deferred, and thus
> >> stale TLB entries can and will remain on isolated CPUs, up until they
> >> execute the deferred flush themselves (IOW for the entire duration of the
> >> "danger zone").
> >>
> >> Does that make sense?
> >>
> > Probably i am missing something and need to have a look at your patches,
> > but how do you guarantee that no-one map same are that you defer for TLB
> > flushing?
> >
> 
> That's the cool part: I don't :')
> 
Indeed, sounds unsafe :) Then we just do not need to free areas.

> For deferring instruction patching IPIs, I (well Josh really) managed to
> get instrumentation to back me up and catch any problematic area.
> 
> I looked into getting something similar for vmalloc region access in
> .noinstr code, but I didn't get anywhere. I even tried using emulated
> watchpoints on QEMU to watch the whole vmalloc range, but that went about
> as well as you could expect.
> 
> That left me with staring at code. AFAICT the only vmap'd thing that is
> accessed during early entry is the task stack (CONFIG_VMAP_STACK), which
> itself cannot be freed until the task exits - thus can't be subject to
> invalidation when a task is entering kernelspace.
> 
> If you have any tracing/instrumentation suggestions, I'm all ears (eyes?).
> 
As noted before, we defer flushing for vmalloc. We have a lazy-threshold
which can be exposed(if you need it) over sysfs for tuning. So, we can add it.

--
Uladzislau Rezki
Valentin Schneider Jan. 20, 2025, 4:09 p.m. UTC | #8
On 20/01/25 12:15, Uladzislau Rezki wrote:
> On Fri, Jan 17, 2025 at 06:00:30PM +0100, Valentin Schneider wrote:
>> On 17/01/25 17:11, Uladzislau Rezki wrote:
>> > On Fri, Jan 17, 2025 at 04:25:45PM +0100, Valentin Schneider wrote:
>> >> On 14/01/25 19:16, Jann Horn wrote:
>> >> > On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider <vschneid@redhat.com> wrote:
>> >> >> vunmap()'s issued from housekeeping CPUs are a relatively common source of
>> >> >> interference for isolated NOHZ_FULL CPUs, as they are hit by the
>> >> >> flush_tlb_kernel_range() IPIs.
>> >> >>
>> >> >> Given that CPUs executing in userspace do not access data in the vmalloc
>> >> >> range, these IPIs could be deferred until their next kernel entry.
>> >> >>
>> >> >> Deferral vs early entry danger zone
>> >> >> ===================================
>> >> >>
>> >> >> This requires a guarantee that nothing in the vmalloc range can be vunmap'd
>> >> >> and then accessed in early entry code.
>> >> >
>> >> > In other words, it needs a guarantee that no vmalloc allocations that
>> >> > have been created in the vmalloc region while the CPU was idle can
>> >> > then be accessed during early entry, right?
>> >>
>> >> I'm not sure if that would be a problem (not an mm expert, please do
>> >> correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't
>> >> deferred anyway.
>> >>
>> >> So after vmapping something, I wouldn't expect isolated CPUs to have
>> >> invalid TLB entries for the newly vmapped page.
>> >>
>> >> However, upon vunmap'ing something, the TLB flush is deferred, and thus
>> >> stale TLB entries can and will remain on isolated CPUs, up until they
>> >> execute the deferred flush themselves (IOW for the entire duration of the
>> >> "danger zone").
>> >>
>> >> Does that make sense?
>> >>
>> > Probably i am missing something and need to have a look at your patches,
>> > but how do you guarantee that no-one map same are that you defer for TLB
>> > flushing?
>> >
>>
>> That's the cool part: I don't :')
>>
> Indeed, sounds unsafe :) Then we just do not need to free areas.
>
>> For deferring instruction patching IPIs, I (well Josh really) managed to
>> get instrumentation to back me up and catch any problematic area.
>>
>> I looked into getting something similar for vmalloc region access in
>> .noinstr code, but I didn't get anywhere. I even tried using emulated
>> watchpoints on QEMU to watch the whole vmalloc range, but that went about
>> as well as you could expect.
>>
>> That left me with staring at code. AFAICT the only vmap'd thing that is
>> accessed during early entry is the task stack (CONFIG_VMAP_STACK), which
>> itself cannot be freed until the task exits - thus can't be subject to
>> invalidation when a task is entering kernelspace.
>>
>> If you have any tracing/instrumentation suggestions, I'm all ears (eyes?).
>>
> As noted before, we defer flushing for vmalloc. We have a lazy-threshold
> which can be exposed(if you need it) over sysfs for tuning. So, we can add it.
>

In a CPU isolation / NOHZ_FULL context, isolated CPUs will be running a
single userspace application that will never enter the kernel, unless
forced to by some interference (e.g. IPI sent from a housekeeping CPU).

Increasing the lazy threshold would unfortunately only delay the
interference - housekeeping CPUs are free to run whatever, and so they will
eventually cause the lazy threshold to be hit and IPI all the CPUs,
including the isolated/NOHZ_FULL ones.

I was thinking maybe we could subdivide the vmap space into two regions
with their own thresholds, but a task may allocate/vmap stuff while on a HK
CPU and be moved to an isolated CPU afterwards, and also I still don't have
any strong guarantee about what accesses an isolated CPU can do in its
early entry code :(

> --
> Uladzislau Rezki
Uladzislau Rezki Jan. 21, 2025, 5 p.m. UTC | #9
> >
> > As noted before, we defer flushing for vmalloc. We have a lazy-threshold
> > which can be exposed(if you need it) over sysfs for tuning. So, we can add it.
> >
> 
> In a CPU isolation / NOHZ_FULL context, isolated CPUs will be running a
> single userspace application that will never enter the kernel, unless
> forced to by some interference (e.g. IPI sent from a housekeeping CPU).
> 
> Increasing the lazy threshold would unfortunately only delay the
> interference - housekeeping CPUs are free to run whatever, and so they will
> eventually cause the lazy threshold to be hit and IPI all the CPUs,
> including the isolated/NOHZ_FULL ones.
> 
Do you have any testing results for your workload? I mean how much
potentially we can allocate. Again, maybe it is just enough to back
and once per-hour offload it.

Apart of that how critical IPIing CPUs affect your workloads?

> I was thinking maybe we could subdivide the vmap space into two regions
> with their own thresholds, but a task may allocate/vmap stuff while on a HK
> CPU and be moved to an isolated CPU afterwards, and also I still don't have
> any strong guarantee about what accesses an isolated CPU can do in its
> early entry code :(
> 
I agree this is not worth to play with a vmap space in terms of splitting it.

Sorry for later answer and thank you!

--
Uladzislau Rezki
Valentin Schneider Jan. 24, 2025, 3:22 p.m. UTC | #10
On 21/01/25 18:00, Uladzislau Rezki wrote:
>> >
>> > As noted before, we defer flushing for vmalloc. We have a lazy-threshold
>> > which can be exposed(if you need it) over sysfs for tuning. So, we can add it.
>> >
>>
>> In a CPU isolation / NOHZ_FULL context, isolated CPUs will be running a
>> single userspace application that will never enter the kernel, unless
>> forced to by some interference (e.g. IPI sent from a housekeeping CPU).
>>
>> Increasing the lazy threshold would unfortunately only delay the
>> interference - housekeeping CPUs are free to run whatever, and so they will
>> eventually cause the lazy threshold to be hit and IPI all the CPUs,
>> including the isolated/NOHZ_FULL ones.
>>
> Do you have any testing results for your workload? I mean how much
> potentially we can allocate. Again, maybe it is just enough to back
> and once per-hour offload it.
>

Potentially as much as you want... In our Openshift environments, you can
get any sort of container executing on the housekeeping CPUs and they're
free to do pretty much whatever they want. Per CPU isolation they're not
allowed/meant to disturb isolated CPUs, however.

> Apart of that how critical IPIing CPUs affect your workloads?
>

If I'm being pedantic, a single IPI to an isolated CPU breaks the
isolation. If we can't quiesce IPIs to isolated CPUs, then we can't
guarantee that whatever is running on the isolated CPUs is actually
isolated / shielded from third party interference.
diff mbox series

Patch

diff --git a/arch/x86/include/asm/context_tracking_work.h b/arch/x86/include/asm/context_tracking_work.h
index 485b32881fde5..da3d270289836 100644
--- a/arch/x86/include/asm/context_tracking_work.h
+++ b/arch/x86/include/asm/context_tracking_work.h
@@ -3,6 +3,7 @@ 
 #define _ASM_X86_CONTEXT_TRACKING_WORK_H
 
 #include <asm/sync_core.h>
+#include <asm/tlbflush.h>
 
 static __always_inline void arch_context_tracking_work(enum ct_work work)
 {
@@ -10,6 +11,9 @@  static __always_inline void arch_context_tracking_work(enum ct_work work)
 	case CT_WORK_SYNC:
 		sync_core();
 		break;
+	case CT_WORK_TLBI:
+		__flush_tlb_all();
+		break;
 	case CT_WORK_MAX:
 		WARN_ON_ONCE(true);
 	}
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 4d11396250999..6e690acc35e53 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -248,6 +248,7 @@  extern void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 				unsigned long end, unsigned int stride_shift,
 				bool freed_tables);
 extern void flush_tlb_kernel_range(unsigned long start, unsigned long end);
+extern void flush_tlb_kernel_range_deferrable(unsigned long start, unsigned long end);
 
 static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
 {
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 119765772ab11..47fb437acf52a 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -12,6 +12,7 @@ 
 #include <linux/task_work.h>
 #include <linux/mmu_notifier.h>
 #include <linux/mmu_context.h>
+#include <linux/context_tracking.h>
 
 #include <asm/tlbflush.h>
 #include <asm/mmu_context.h>
@@ -1042,6 +1043,11 @@  static void do_flush_tlb_all(void *info)
 	__flush_tlb_all();
 }
 
+static bool do_kernel_flush_defer_cond(int cpu, void *info)
+{
+	return !ct_set_cpu_work(cpu, CT_WORK_TLBI);
+}
+
 void flush_tlb_all(void)
 {
 	count_vm_tlb_event(NR_TLB_REMOTE_FLUSH);
@@ -1058,12 +1064,13 @@  static void do_kernel_range_flush(void *info)
 		flush_tlb_one_kernel(addr);
 }
 
-void flush_tlb_kernel_range(unsigned long start, unsigned long end)
+static inline void
+__flush_tlb_kernel_range(smp_cond_func_t cond_func, unsigned long start, unsigned long end)
 {
 	/* Balance as user space task's flush, a bit conservative */
 	if (end == TLB_FLUSH_ALL ||
 	    (end - start) > tlb_single_page_flush_ceiling << PAGE_SHIFT) {
-		on_each_cpu(do_flush_tlb_all, NULL, 1);
+		on_each_cpu_cond(cond_func, do_flush_tlb_all, NULL, 1);
 	} else {
 		struct flush_tlb_info *info;
 
@@ -1071,13 +1078,23 @@  void flush_tlb_kernel_range(unsigned long start, unsigned long end)
 		info = get_flush_tlb_info(NULL, start, end, 0, false,
 					  TLB_GENERATION_INVALID);
 
-		on_each_cpu(do_kernel_range_flush, info, 1);
+		on_each_cpu_cond(cond_func, do_kernel_range_flush, info, 1);
 
 		put_flush_tlb_info();
 		preempt_enable();
 	}
 }
 
+void flush_tlb_kernel_range(unsigned long start, unsigned long end)
+{
+	__flush_tlb_kernel_range(NULL, start, end);
+}
+
+void flush_tlb_kernel_range_deferrable(unsigned long start, unsigned long end)
+{
+	__flush_tlb_kernel_range(do_kernel_flush_defer_cond, start, end);
+}
+
 /*
  * This can be used from process context to figure out what the value of
  * CR3 is without needing to do a (slow) __read_cr3().
diff --git a/include/linux/context_tracking_work.h b/include/linux/context_tracking_work.h
index 931ade1dcbcc2..1be60c64cdeca 100644
--- a/include/linux/context_tracking_work.h
+++ b/include/linux/context_tracking_work.h
@@ -5,12 +5,14 @@ 
 #include <linux/bitops.h>
 
 enum {
-	CT_WORK_SYNC,
+	CT_WORK_SYNC_OFFSET,
+	CT_WORK_TLBI_OFFSET,
 	CT_WORK_MAX_OFFSET
 };
 
 enum ct_work {
 	CT_WORK_SYNC     = BIT(CT_WORK_SYNC_OFFSET),
+	CT_WORK_TLBI     = BIT(CT_WORK_TLBI_OFFSET),
 	CT_WORK_MAX      = BIT(CT_WORK_MAX_OFFSET)
 };
 
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 5c88d0e90c209..e8aad4d55e955 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -467,6 +467,31 @@  void vunmap_range_noflush(unsigned long start, unsigned long end)
 	__vunmap_range_noflush(start, end);
 }
 
+#ifdef CONFIG_CONTEXT_TRACKING_WORK
+/*
+ * !!! BIG FAT WARNING !!!
+ *
+ * The CPU is free to cache any part of the paging hierarchy it wants at any
+ * time. It's also free to set accessed and dirty bits at any time, even for
+ * instructions that may never execute architecturally.
+ *
+ * This means that deferring a TLB flush affecting freed page-table-pages (IOW,
+ * keeping them in a CPU's paging hierarchy cache) is akin to dancing in a
+ * minefield.
+ *
+ * This isn't a problem for deferral of TLB flushes in vmalloc, because
+ * page-table-pages used for vmap() mappings are never freed - see how
+ * __vunmap_range_noflush() walks the whole mapping but only clears the leaf PTEs.
+ * If this ever changes, TLB flush deferral will cause misery.
+ */
+void __weak flush_tlb_kernel_range_deferrable(unsigned long start, unsigned long end)
+{
+	flush_tlb_kernel_range(start, end);
+}
+#else
+#define flush_tlb_kernel_range_deferrable(start, end) flush_tlb_kernel_range(start, end)
+#endif
+
 /**
  * vunmap_range - unmap kernel virtual addresses
  * @addr: start of the VM area to unmap
@@ -480,7 +505,7 @@  void vunmap_range(unsigned long addr, unsigned long end)
 {
 	flush_cache_vunmap(addr, end);
 	vunmap_range_noflush(addr, end);
-	flush_tlb_kernel_range(addr, end);
+	flush_tlb_kernel_range_deferrable(addr, end);
 }
 
 static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr,
@@ -2281,7 +2306,7 @@  static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end,
 
 	nr_purge_nodes = cpumask_weight(&purge_nodes);
 	if (nr_purge_nodes > 0) {
-		flush_tlb_kernel_range(start, end);
+		flush_tlb_kernel_range_deferrable(start, end);
 
 		/* One extra worker is per a lazy_max_pages() full set minus one. */
 		nr_purge_helpers = atomic_long_read(&vmap_lazy_nr) / lazy_max_pages();
@@ -2384,7 +2409,7 @@  static void free_unmap_vmap_area(struct vmap_area *va)
 	flush_cache_vunmap(va->va_start, va->va_end);
 	vunmap_range_noflush(va->va_start, va->va_end);
 	if (debug_pagealloc_enabled_static())
-		flush_tlb_kernel_range(va->va_start, va->va_end);
+		flush_tlb_kernel_range_deferrable(va->va_start, va->va_end);
 
 	free_vmap_area_noflush(va);
 }
@@ -2832,7 +2857,7 @@  static void vb_free(unsigned long addr, unsigned long size)
 	vunmap_range_noflush(addr, addr + size);
 
 	if (debug_pagealloc_enabled_static())
-		flush_tlb_kernel_range(addr, addr + size);
+		flush_tlb_kernel_range_deferrable(addr, addr + size);
 
 	spin_lock(&vb->lock);
 
@@ -2897,7 +2922,7 @@  static void _vm_unmap_aliases(unsigned long start, unsigned long end, int flush)
 	free_purged_blocks(&purge_list);
 
 	if (!__purge_vmap_area_lazy(start, end, false) && flush)
-		flush_tlb_kernel_range(start, end);
+		flush_tlb_kernel_range_deferrable(start, end);
 	mutex_unlock(&vmap_purge_lock);
 }