Message ID | 20250114175143.81438-30-vschneid@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | context_tracking,x86: Defer some IPIs until a user->kernel transition | expand |
On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider <vschneid@redhat.com> wrote: > vunmap()'s issued from housekeeping CPUs are a relatively common source of > interference for isolated NOHZ_FULL CPUs, as they are hit by the > flush_tlb_kernel_range() IPIs. > > Given that CPUs executing in userspace do not access data in the vmalloc > range, these IPIs could be deferred until their next kernel entry. > > Deferral vs early entry danger zone > =================================== > > This requires a guarantee that nothing in the vmalloc range can be vunmap'd > and then accessed in early entry code. In other words, it needs a guarantee that no vmalloc allocations that have been created in the vmalloc region while the CPU was idle can then be accessed during early entry, right?
On 14/01/25 19:16, Jann Horn wrote: > On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider <vschneid@redhat.com> wrote: >> vunmap()'s issued from housekeeping CPUs are a relatively common source of >> interference for isolated NOHZ_FULL CPUs, as they are hit by the >> flush_tlb_kernel_range() IPIs. >> >> Given that CPUs executing in userspace do not access data in the vmalloc >> range, these IPIs could be deferred until their next kernel entry. >> >> Deferral vs early entry danger zone >> =================================== >> >> This requires a guarantee that nothing in the vmalloc range can be vunmap'd >> and then accessed in early entry code. > > In other words, it needs a guarantee that no vmalloc allocations that > have been created in the vmalloc region while the CPU was idle can > then be accessed during early entry, right? I'm not sure if that would be a problem (not an mm expert, please do correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't deferred anyway. So after vmapping something, I wouldn't expect isolated CPUs to have invalid TLB entries for the newly vmapped page. However, upon vunmap'ing something, the TLB flush is deferred, and thus stale TLB entries can and will remain on isolated CPUs, up until they execute the deferred flush themselves (IOW for the entire duration of the "danger zone"). Does that make sense?
On Fri, Jan 17, 2025 at 4:25 PM Valentin Schneider <vschneid@redhat.com> wrote: > On 14/01/25 19:16, Jann Horn wrote: > > On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider <vschneid@redhat.com> wrote: > >> vunmap()'s issued from housekeeping CPUs are a relatively common source of > >> interference for isolated NOHZ_FULL CPUs, as they are hit by the > >> flush_tlb_kernel_range() IPIs. > >> > >> Given that CPUs executing in userspace do not access data in the vmalloc > >> range, these IPIs could be deferred until their next kernel entry. > >> > >> Deferral vs early entry danger zone > >> =================================== > >> > >> This requires a guarantee that nothing in the vmalloc range can be vunmap'd > >> and then accessed in early entry code. > > > > In other words, it needs a guarantee that no vmalloc allocations that > > have been created in the vmalloc region while the CPU was idle can > > then be accessed during early entry, right? > > I'm not sure if that would be a problem (not an mm expert, please do > correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't > deferred anyway. flush_cache_vmap() is about stuff like flushing data caches on architectures with virtually indexed caches; that doesn't do TLB maintenance. When you look for its definition on x86 or arm64, you'll see that they use the generic implementation which is simply an empty inline function. > So after vmapping something, I wouldn't expect isolated CPUs to have > invalid TLB entries for the newly vmapped page. > > However, upon vunmap'ing something, the TLB flush is deferred, and thus > stale TLB entries can and will remain on isolated CPUs, up until they > execute the deferred flush themselves (IOW for the entire duration of the > "danger zone"). > > Does that make sense? The design idea wrt TLB flushes in the vmap code is that you don't do TLB flushes when you unmap stuff or when you map stuff, because doing TLB flushes across the entire system on every vmap/vunmap would be a bit costly; instead you just do batched TLB flushes in between, in __purge_vmap_area_lazy(). In other words, the basic idea is that you can keep calling vmap() and vunmap() a bunch of times without ever doing TLB flushes until you run out of virtual memory in the vmap region; then you do one big TLB flush, and afterwards you can reuse the free virtual address space for new allocations again. So if you "defer" that batched TLB flush for CPUs that are not currently running in the kernel, I think the consequence is that those CPUs may end up with incoherent TLB state after a reallocation of the virtual address space. Actually, I think this would mean that your optimization is disallowed at least on arm64 - I'm not sure about the exact wording, but arm64 has a "break before make" rule that forbids conflicting writable address translations or something like that. (I said "until you run out of virtual memory in the vmap region", but that's not actually true - see the comment above lazy_max_pages() for an explanation of the actual heuristic. You might be able to tune that a bit if you'd be significantly happier with less frequent interruptions, or something along those lines.)
On Fri, Jan 17, 2025 at 04:25:45PM +0100, Valentin Schneider wrote: > On 14/01/25 19:16, Jann Horn wrote: > > On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider <vschneid@redhat.com> wrote: > >> vunmap()'s issued from housekeeping CPUs are a relatively common source of > >> interference for isolated NOHZ_FULL CPUs, as they are hit by the > >> flush_tlb_kernel_range() IPIs. > >> > >> Given that CPUs executing in userspace do not access data in the vmalloc > >> range, these IPIs could be deferred until their next kernel entry. > >> > >> Deferral vs early entry danger zone > >> =================================== > >> > >> This requires a guarantee that nothing in the vmalloc range can be vunmap'd > >> and then accessed in early entry code. > > > > In other words, it needs a guarantee that no vmalloc allocations that > > have been created in the vmalloc region while the CPU was idle can > > then be accessed during early entry, right? > > I'm not sure if that would be a problem (not an mm expert, please do > correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't > deferred anyway. > > So after vmapping something, I wouldn't expect isolated CPUs to have > invalid TLB entries for the newly vmapped page. > > However, upon vunmap'ing something, the TLB flush is deferred, and thus > stale TLB entries can and will remain on isolated CPUs, up until they > execute the deferred flush themselves (IOW for the entire duration of the > "danger zone"). > > Does that make sense? > Probably i am missing something and need to have a look at your patches, but how do you guarantee that no-one map same are that you defer for TLB flushing? As noted by Jann, we already defer a TLB flushing by backing freed areas until certain threshold and just after we cross it we do a flush. -- Uladzislau Rezki
On 17/01/25 16:52, Jann Horn wrote: > On Fri, Jan 17, 2025 at 4:25 PM Valentin Schneider <vschneid@redhat.com> wrote: >> On 14/01/25 19:16, Jann Horn wrote: >> > On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider <vschneid@redhat.com> wrote: >> >> vunmap()'s issued from housekeeping CPUs are a relatively common source of >> >> interference for isolated NOHZ_FULL CPUs, as they are hit by the >> >> flush_tlb_kernel_range() IPIs. >> >> >> >> Given that CPUs executing in userspace do not access data in the vmalloc >> >> range, these IPIs could be deferred until their next kernel entry. >> >> >> >> Deferral vs early entry danger zone >> >> =================================== >> >> >> >> This requires a guarantee that nothing in the vmalloc range can be vunmap'd >> >> and then accessed in early entry code. >> > >> > In other words, it needs a guarantee that no vmalloc allocations that >> > have been created in the vmalloc region while the CPU was idle can >> > then be accessed during early entry, right? >> >> I'm not sure if that would be a problem (not an mm expert, please do >> correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't >> deferred anyway. > > flush_cache_vmap() is about stuff like flushing data caches on > architectures with virtually indexed caches; that doesn't do TLB > maintenance. When you look for its definition on x86 or arm64, you'll > see that they use the generic implementation which is simply an empty > inline function. > >> So after vmapping something, I wouldn't expect isolated CPUs to have >> invalid TLB entries for the newly vmapped page. >> >> However, upon vunmap'ing something, the TLB flush is deferred, and thus >> stale TLB entries can and will remain on isolated CPUs, up until they >> execute the deferred flush themselves (IOW for the entire duration of the >> "danger zone"). >> >> Does that make sense? > > The design idea wrt TLB flushes in the vmap code is that you don't do > TLB flushes when you unmap stuff or when you map stuff, because doing > TLB flushes across the entire system on every vmap/vunmap would be a > bit costly; instead you just do batched TLB flushes in between, in > __purge_vmap_area_lazy(). > > In other words, the basic idea is that you can keep calling vmap() and > vunmap() a bunch of times without ever doing TLB flushes until you run > out of virtual memory in the vmap region; then you do one big TLB > flush, and afterwards you can reuse the free virtual address space for > new allocations again. > > So if you "defer" that batched TLB flush for CPUs that are not > currently running in the kernel, I think the consequence is that those > CPUs may end up with incoherent TLB state after a reallocation of the > virtual address space. > Ah, gotcha, thank you for laying this out! In which case yes, any vmalloc that occurred while an isolated CPU was NOHZ-FULL can be an issue if said CPU accesses it during early entry; > Actually, I think this would mean that your optimization is disallowed > at least on arm64 - I'm not sure about the exact wording, but arm64 > has a "break before make" rule that forbids conflicting writable > address translations or something like that. > On the bright side of things, arm64 is not as bad as x86 when it comes to IPI'ing isolated CPUs :-) I'll add that to my notes, thanks! > (I said "until you run out of virtual memory in the vmap region", but > that's not actually true - see the comment above lazy_max_pages() for > an explanation of the actual heuristic. You might be able to tune that > a bit if you'd be significantly happier with less frequent > interruptions, or something along those lines.)
On 17/01/25 17:11, Uladzislau Rezki wrote: > On Fri, Jan 17, 2025 at 04:25:45PM +0100, Valentin Schneider wrote: >> On 14/01/25 19:16, Jann Horn wrote: >> > On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider <vschneid@redhat.com> wrote: >> >> vunmap()'s issued from housekeeping CPUs are a relatively common source of >> >> interference for isolated NOHZ_FULL CPUs, as they are hit by the >> >> flush_tlb_kernel_range() IPIs. >> >> >> >> Given that CPUs executing in userspace do not access data in the vmalloc >> >> range, these IPIs could be deferred until their next kernel entry. >> >> >> >> Deferral vs early entry danger zone >> >> =================================== >> >> >> >> This requires a guarantee that nothing in the vmalloc range can be vunmap'd >> >> and then accessed in early entry code. >> > >> > In other words, it needs a guarantee that no vmalloc allocations that >> > have been created in the vmalloc region while the CPU was idle can >> > then be accessed during early entry, right? >> >> I'm not sure if that would be a problem (not an mm expert, please do >> correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't >> deferred anyway. >> >> So after vmapping something, I wouldn't expect isolated CPUs to have >> invalid TLB entries for the newly vmapped page. >> >> However, upon vunmap'ing something, the TLB flush is deferred, and thus >> stale TLB entries can and will remain on isolated CPUs, up until they >> execute the deferred flush themselves (IOW for the entire duration of the >> "danger zone"). >> >> Does that make sense? >> > Probably i am missing something and need to have a look at your patches, > but how do you guarantee that no-one map same are that you defer for TLB > flushing? > That's the cool part: I don't :') For deferring instruction patching IPIs, I (well Josh really) managed to get instrumentation to back me up and catch any problematic area. I looked into getting something similar for vmalloc region access in .noinstr code, but I didn't get anywhere. I even tried using emulated watchpoints on QEMU to watch the whole vmalloc range, but that went about as well as you could expect. That left me with staring at code. AFAICT the only vmap'd thing that is accessed during early entry is the task stack (CONFIG_VMAP_STACK), which itself cannot be freed until the task exits - thus can't be subject to invalidation when a task is entering kernelspace. If you have any tracing/instrumentation suggestions, I'm all ears (eyes?). > As noted by Jann, we already defer a TLB flushing by backing freed areas > until certain threshold and just after we cross it we do a flush. > > -- > Uladzislau Rezki
On Fri, Jan 17, 2025 at 06:00:30PM +0100, Valentin Schneider wrote: > On 17/01/25 17:11, Uladzislau Rezki wrote: > > On Fri, Jan 17, 2025 at 04:25:45PM +0100, Valentin Schneider wrote: > >> On 14/01/25 19:16, Jann Horn wrote: > >> > On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider <vschneid@redhat.com> wrote: > >> >> vunmap()'s issued from housekeeping CPUs are a relatively common source of > >> >> interference for isolated NOHZ_FULL CPUs, as they are hit by the > >> >> flush_tlb_kernel_range() IPIs. > >> >> > >> >> Given that CPUs executing in userspace do not access data in the vmalloc > >> >> range, these IPIs could be deferred until their next kernel entry. > >> >> > >> >> Deferral vs early entry danger zone > >> >> =================================== > >> >> > >> >> This requires a guarantee that nothing in the vmalloc range can be vunmap'd > >> >> and then accessed in early entry code. > >> > > >> > In other words, it needs a guarantee that no vmalloc allocations that > >> > have been created in the vmalloc region while the CPU was idle can > >> > then be accessed during early entry, right? > >> > >> I'm not sure if that would be a problem (not an mm expert, please do > >> correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't > >> deferred anyway. > >> > >> So after vmapping something, I wouldn't expect isolated CPUs to have > >> invalid TLB entries for the newly vmapped page. > >> > >> However, upon vunmap'ing something, the TLB flush is deferred, and thus > >> stale TLB entries can and will remain on isolated CPUs, up until they > >> execute the deferred flush themselves (IOW for the entire duration of the > >> "danger zone"). > >> > >> Does that make sense? > >> > > Probably i am missing something and need to have a look at your patches, > > but how do you guarantee that no-one map same are that you defer for TLB > > flushing? > > > > That's the cool part: I don't :') > Indeed, sounds unsafe :) Then we just do not need to free areas. > For deferring instruction patching IPIs, I (well Josh really) managed to > get instrumentation to back me up and catch any problematic area. > > I looked into getting something similar for vmalloc region access in > .noinstr code, but I didn't get anywhere. I even tried using emulated > watchpoints on QEMU to watch the whole vmalloc range, but that went about > as well as you could expect. > > That left me with staring at code. AFAICT the only vmap'd thing that is > accessed during early entry is the task stack (CONFIG_VMAP_STACK), which > itself cannot be freed until the task exits - thus can't be subject to > invalidation when a task is entering kernelspace. > > If you have any tracing/instrumentation suggestions, I'm all ears (eyes?). > As noted before, we defer flushing for vmalloc. We have a lazy-threshold which can be exposed(if you need it) over sysfs for tuning. So, we can add it. -- Uladzislau Rezki
On 20/01/25 12:15, Uladzislau Rezki wrote: > On Fri, Jan 17, 2025 at 06:00:30PM +0100, Valentin Schneider wrote: >> On 17/01/25 17:11, Uladzislau Rezki wrote: >> > On Fri, Jan 17, 2025 at 04:25:45PM +0100, Valentin Schneider wrote: >> >> On 14/01/25 19:16, Jann Horn wrote: >> >> > On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider <vschneid@redhat.com> wrote: >> >> >> vunmap()'s issued from housekeeping CPUs are a relatively common source of >> >> >> interference for isolated NOHZ_FULL CPUs, as they are hit by the >> >> >> flush_tlb_kernel_range() IPIs. >> >> >> >> >> >> Given that CPUs executing in userspace do not access data in the vmalloc >> >> >> range, these IPIs could be deferred until their next kernel entry. >> >> >> >> >> >> Deferral vs early entry danger zone >> >> >> =================================== >> >> >> >> >> >> This requires a guarantee that nothing in the vmalloc range can be vunmap'd >> >> >> and then accessed in early entry code. >> >> > >> >> > In other words, it needs a guarantee that no vmalloc allocations that >> >> > have been created in the vmalloc region while the CPU was idle can >> >> > then be accessed during early entry, right? >> >> >> >> I'm not sure if that would be a problem (not an mm expert, please do >> >> correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't >> >> deferred anyway. >> >> >> >> So after vmapping something, I wouldn't expect isolated CPUs to have >> >> invalid TLB entries for the newly vmapped page. >> >> >> >> However, upon vunmap'ing something, the TLB flush is deferred, and thus >> >> stale TLB entries can and will remain on isolated CPUs, up until they >> >> execute the deferred flush themselves (IOW for the entire duration of the >> >> "danger zone"). >> >> >> >> Does that make sense? >> >> >> > Probably i am missing something and need to have a look at your patches, >> > but how do you guarantee that no-one map same are that you defer for TLB >> > flushing? >> > >> >> That's the cool part: I don't :') >> > Indeed, sounds unsafe :) Then we just do not need to free areas. > >> For deferring instruction patching IPIs, I (well Josh really) managed to >> get instrumentation to back me up and catch any problematic area. >> >> I looked into getting something similar for vmalloc region access in >> .noinstr code, but I didn't get anywhere. I even tried using emulated >> watchpoints on QEMU to watch the whole vmalloc range, but that went about >> as well as you could expect. >> >> That left me with staring at code. AFAICT the only vmap'd thing that is >> accessed during early entry is the task stack (CONFIG_VMAP_STACK), which >> itself cannot be freed until the task exits - thus can't be subject to >> invalidation when a task is entering kernelspace. >> >> If you have any tracing/instrumentation suggestions, I'm all ears (eyes?). >> > As noted before, we defer flushing for vmalloc. We have a lazy-threshold > which can be exposed(if you need it) over sysfs for tuning. So, we can add it. > In a CPU isolation / NOHZ_FULL context, isolated CPUs will be running a single userspace application that will never enter the kernel, unless forced to by some interference (e.g. IPI sent from a housekeeping CPU). Increasing the lazy threshold would unfortunately only delay the interference - housekeeping CPUs are free to run whatever, and so they will eventually cause the lazy threshold to be hit and IPI all the CPUs, including the isolated/NOHZ_FULL ones. I was thinking maybe we could subdivide the vmap space into two regions with their own thresholds, but a task may allocate/vmap stuff while on a HK CPU and be moved to an isolated CPU afterwards, and also I still don't have any strong guarantee about what accesses an isolated CPU can do in its early entry code :( > -- > Uladzislau Rezki
> > > > As noted before, we defer flushing for vmalloc. We have a lazy-threshold > > which can be exposed(if you need it) over sysfs for tuning. So, we can add it. > > > > In a CPU isolation / NOHZ_FULL context, isolated CPUs will be running a > single userspace application that will never enter the kernel, unless > forced to by some interference (e.g. IPI sent from a housekeeping CPU). > > Increasing the lazy threshold would unfortunately only delay the > interference - housekeeping CPUs are free to run whatever, and so they will > eventually cause the lazy threshold to be hit and IPI all the CPUs, > including the isolated/NOHZ_FULL ones. > Do you have any testing results for your workload? I mean how much potentially we can allocate. Again, maybe it is just enough to back and once per-hour offload it. Apart of that how critical IPIing CPUs affect your workloads? > I was thinking maybe we could subdivide the vmap space into two regions > with their own thresholds, but a task may allocate/vmap stuff while on a HK > CPU and be moved to an isolated CPU afterwards, and also I still don't have > any strong guarantee about what accesses an isolated CPU can do in its > early entry code :( > I agree this is not worth to play with a vmap space in terms of splitting it. Sorry for later answer and thank you! -- Uladzislau Rezki
On 21/01/25 18:00, Uladzislau Rezki wrote: >> > >> > As noted before, we defer flushing for vmalloc. We have a lazy-threshold >> > which can be exposed(if you need it) over sysfs for tuning. So, we can add it. >> > >> >> In a CPU isolation / NOHZ_FULL context, isolated CPUs will be running a >> single userspace application that will never enter the kernel, unless >> forced to by some interference (e.g. IPI sent from a housekeeping CPU). >> >> Increasing the lazy threshold would unfortunately only delay the >> interference - housekeeping CPUs are free to run whatever, and so they will >> eventually cause the lazy threshold to be hit and IPI all the CPUs, >> including the isolated/NOHZ_FULL ones. >> > Do you have any testing results for your workload? I mean how much > potentially we can allocate. Again, maybe it is just enough to back > and once per-hour offload it. > Potentially as much as you want... In our Openshift environments, you can get any sort of container executing on the housekeeping CPUs and they're free to do pretty much whatever they want. Per CPU isolation they're not allowed/meant to disturb isolated CPUs, however. > Apart of that how critical IPIing CPUs affect your workloads? > If I'm being pedantic, a single IPI to an isolated CPU breaks the isolation. If we can't quiesce IPIs to isolated CPUs, then we can't guarantee that whatever is running on the isolated CPUs is actually isolated / shielded from third party interference.
On Fri, Jan 24, 2025 at 04:22:19PM +0100, Valentin Schneider wrote: > On 21/01/25 18:00, Uladzislau Rezki wrote: > >> > > >> > As noted before, we defer flushing for vmalloc. We have a lazy-threshold > >> > which can be exposed(if you need it) over sysfs for tuning. So, we can add it. > >> > > >> > >> In a CPU isolation / NOHZ_FULL context, isolated CPUs will be running a > >> single userspace application that will never enter the kernel, unless > >> forced to by some interference (e.g. IPI sent from a housekeeping CPU). > >> > >> Increasing the lazy threshold would unfortunately only delay the > >> interference - housekeeping CPUs are free to run whatever, and so they will > >> eventually cause the lazy threshold to be hit and IPI all the CPUs, > >> including the isolated/NOHZ_FULL ones. > >> > > Do you have any testing results for your workload? I mean how much > > potentially we can allocate. Again, maybe it is just enough to back > > and once per-hour offload it. > > > > Potentially as much as you want... In our Openshift environments, you can > get any sort of container executing on the housekeeping CPUs and they're > free to do pretty much whatever they want. Per CPU isolation they're not > allowed/meant to disturb isolated CPUs, however. > > > Apart of that how critical IPIing CPUs affect your workloads? > > > > If I'm being pedantic, a single IPI to an isolated CPU breaks the > isolation. If we can't quiesce IPIs to isolated CPUs, then we can't > guarantee that whatever is running on the isolated CPUs is actually > isolated / shielded from third party interference. > I see. I thought you are fixing some issue. I do not see a straight forward way how to remove such "distortion". Probably we can block the range which we defer for flushing. But it also can be problematic because of other constraints. Thanks! -- Uladzislau Rezki
On Fri, Jan 17, 2025 at 04:52:19PM +0100, Jann Horn wrote: > On Fri, Jan 17, 2025 at 4:25 PM Valentin Schneider <vschneid@redhat.com> wrote: > > On 14/01/25 19:16, Jann Horn wrote: > > > On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider <vschneid@redhat.com> wrote: > > >> vunmap()'s issued from housekeeping CPUs are a relatively common source of > > >> interference for isolated NOHZ_FULL CPUs, as they are hit by the > > >> flush_tlb_kernel_range() IPIs. > > >> > > >> Given that CPUs executing in userspace do not access data in the vmalloc > > >> range, these IPIs could be deferred until their next kernel entry. > > >> > > >> Deferral vs early entry danger zone > > >> =================================== > > >> > > >> This requires a guarantee that nothing in the vmalloc range can be vunmap'd > > >> and then accessed in early entry code. > > > > > > In other words, it needs a guarantee that no vmalloc allocations that > > > have been created in the vmalloc region while the CPU was idle can > > > then be accessed during early entry, right? > > > > I'm not sure if that would be a problem (not an mm expert, please do > > correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't > > deferred anyway. > > flush_cache_vmap() is about stuff like flushing data caches on > architectures with virtually indexed caches; that doesn't do TLB > maintenance. When you look for its definition on x86 or arm64, you'll > see that they use the generic implementation which is simply an empty > inline function. > > > So after vmapping something, I wouldn't expect isolated CPUs to have > > invalid TLB entries for the newly vmapped page. > > > > However, upon vunmap'ing something, the TLB flush is deferred, and thus > > stale TLB entries can and will remain on isolated CPUs, up until they > > execute the deferred flush themselves (IOW for the entire duration of the > > "danger zone"). > > > > Does that make sense? > > The design idea wrt TLB flushes in the vmap code is that you don't do > TLB flushes when you unmap stuff or when you map stuff, because doing > TLB flushes across the entire system on every vmap/vunmap would be a > bit costly; instead you just do batched TLB flushes in between, in > __purge_vmap_area_lazy(). > > In other words, the basic idea is that you can keep calling vmap() and > vunmap() a bunch of times without ever doing TLB flushes until you run > out of virtual memory in the vmap region; then you do one big TLB > flush, and afterwards you can reuse the free virtual address space for > new allocations again. > > So if you "defer" that batched TLB flush for CPUs that are not > currently running in the kernel, I think the consequence is that those > CPUs may end up with incoherent TLB state after a reallocation of the > virtual address space. > > Actually, I think this would mean that your optimization is disallowed > at least on arm64 - I'm not sure about the exact wording, but arm64 > has a "break before make" rule that forbids conflicting writable > address translations or something like that. Yes, that would definitely be a problem. There's also the more obvious issue that the CnP ("Common not Private") feature of some Arm CPUs means that TLB entries can be shared between cores, so the whole idea of using a CPU's exception level to predicate invalidation is flawed on such a system. Will
On 17/01/25 16:52, Jann Horn wrote: > On Fri, Jan 17, 2025 at 4:25 PM Valentin Schneider <vschneid@redhat.com> wrote: >> On 14/01/25 19:16, Jann Horn wrote: >> > On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider <vschneid@redhat.com> wrote: >> >> vunmap()'s issued from housekeeping CPUs are a relatively common source of >> >> interference for isolated NOHZ_FULL CPUs, as they are hit by the >> >> flush_tlb_kernel_range() IPIs. >> >> >> >> Given that CPUs executing in userspace do not access data in the vmalloc >> >> range, these IPIs could be deferred until their next kernel entry. >> >> >> >> Deferral vs early entry danger zone >> >> =================================== >> >> >> >> This requires a guarantee that nothing in the vmalloc range can be vunmap'd >> >> and then accessed in early entry code. >> > >> > In other words, it needs a guarantee that no vmalloc allocations that >> > have been created in the vmalloc region while the CPU was idle can >> > then be accessed during early entry, right? >> >> I'm not sure if that would be a problem (not an mm expert, please do >> correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't >> deferred anyway. > > flush_cache_vmap() is about stuff like flushing data caches on > architectures with virtually indexed caches; that doesn't do TLB > maintenance. When you look for its definition on x86 or arm64, you'll > see that they use the generic implementation which is simply an empty > inline function. > >> So after vmapping something, I wouldn't expect isolated CPUs to have >> invalid TLB entries for the newly vmapped page. >> >> However, upon vunmap'ing something, the TLB flush is deferred, and thus >> stale TLB entries can and will remain on isolated CPUs, up until they >> execute the deferred flush themselves (IOW for the entire duration of the >> "danger zone"). >> >> Does that make sense? > > The design idea wrt TLB flushes in the vmap code is that you don't do > TLB flushes when you unmap stuff or when you map stuff, because doing > TLB flushes across the entire system on every vmap/vunmap would be a > bit costly; instead you just do batched TLB flushes in between, in > __purge_vmap_area_lazy(). > > In other words, the basic idea is that you can keep calling vmap() and > vunmap() a bunch of times without ever doing TLB flushes until you run > out of virtual memory in the vmap region; then you do one big TLB > flush, and afterwards you can reuse the free virtual address space for > new allocations again. > > So if you "defer" that batched TLB flush for CPUs that are not > currently running in the kernel, I think the consequence is that those > CPUs may end up with incoherent TLB state after a reallocation of the > virtual address space. > > Actually, I think this would mean that your optimization is disallowed > at least on arm64 - I'm not sure about the exact wording, but arm64 > has a "break before make" rule that forbids conflicting writable > address translations or something like that. > > (I said "until you run out of virtual memory in the vmap region", but > that's not actually true - see the comment above lazy_max_pages() for > an explanation of the actual heuristic. You might be able to tune that > a bit if you'd be significantly happier with less frequent > interruptions, or something along those lines.) I've been thinking some more (this is your cue to grab a brown paper bag)... Experimentation (unmap the whole VMALLOC range upon return to userspace, see what explodes upon entry into the kernel) suggests that the early entry "danger zone" should only access the vmaped stack, which itself isn't an issue. That is obviously just a test on one system configuration, and the problem I'm facing is trying put in place /some/ form of instrumentation that would at the very least cause a warning for any future patch that would introduce a vmap'd access in early entry code. That, or a complete mitigation that prevents those accesses altogether. What if isolated CPUs unconditionally did a TLBi as late as possible in the stack right before returning to userspace? This would mean that upon re-entering the kernel, an isolated CPU's TLB wouldn't contain any kernel range translation - with the exception of whatever lies between the last-minute flush and the actual userspace entry, which should be feasible to vet? Then AFAICT there wouldn't be any work/flush to defer, the IPI could be entirely silenced if it targets an isolated CPU.
On Mon, Feb 10, 2025 at 7:36 PM Valentin Schneider <vschneid@redhat.com> wrote: > What if isolated CPUs unconditionally did a TLBi as late as possible in > the stack right before returning to userspace? This would mean that upon > re-entering the kernel, an isolated CPU's TLB wouldn't contain any kernel > range translation - with the exception of whatever lies between the > last-minute flush and the actual userspace entry, which should be feasible > to vet? Then AFAICT there wouldn't be any work/flush to defer, the IPI > could be entirely silenced if it targets an isolated CPU. Two issues with that: 1. I think the "Common not Private" feature Will Deacon referred to is incompatible with this idea: <https://developer.arm.com/documentation/101811/0104/Address-spaces/Common-not-Private> says "When the CnP bit is set, the software promises to use the ASIDs and VMIDs in the same way on all processors, which allows the TLB entries that are created by one processor to be used by another" 2. It's wrong to assume that TLB entries are only populated for addresses you access - thanks to speculative execution, you have to assume that the CPU might be populating random TLB entries all over the place.
On 10/02/25 23:08, Jann Horn wrote: > On Mon, Feb 10, 2025 at 7:36 PM Valentin Schneider <vschneid@redhat.com> wrote: >> What if isolated CPUs unconditionally did a TLBi as late as possible in >> the stack right before returning to userspace? This would mean that upon >> re-entering the kernel, an isolated CPU's TLB wouldn't contain any kernel >> range translation - with the exception of whatever lies between the >> last-minute flush and the actual userspace entry, which should be feasible >> to vet? Then AFAICT there wouldn't be any work/flush to defer, the IPI >> could be entirely silenced if it targets an isolated CPU. > > Two issues with that: > Firstly, thank you for entertaining the idea :-) > 1. I think the "Common not Private" feature Will Deacon referred to is > incompatible with this idea: > <https://developer.arm.com/documentation/101811/0104/Address-spaces/Common-not-Private> > says "When the CnP bit is set, the software promises to use the ASIDs > and VMIDs in the same way on all processors, which allows the TLB > entries that are created by one processor to be used by another" > Sorry for being obtuse - I can understand inconsistent TLB states (old vs new translations being present in separate TLBs) due to not sending the flush IPI causing an issue with that, but not "flushing early". Even if TLB entries can be shared/accessed between CPUs, a CPU should be allowed not to have a shared entry in its TLB - what am I missing? > 2. It's wrong to assume that TLB entries are only populated for > addresses you access - thanks to speculative execution, you have to > assume that the CPU might be populating random TLB entries all over > the place. Gotta love speculation. Now it is supposed to be limited to genuinely accessible data & code, right? Say theoretically we have a full TLBi as literally the last thing before doing the return-to-userspace, speculation should be limited to executing maybe bits of the return-from-userspace code? Furthermore, I would hope that once a CPU is executing in userspace, it's not going to populate the TLB with kernel address translations - AIUI the whole vulnerability mitigation debacle was about preventing this sort of thing.
On Tue, Feb 11, 2025 at 02:33:51PM +0100, Valentin Schneider wrote: > On 10/02/25 23:08, Jann Horn wrote: > > On Mon, Feb 10, 2025 at 7:36 PM Valentin Schneider <vschneid@redhat.com> wrote: > >> What if isolated CPUs unconditionally did a TLBi as late as possible in > >> the stack right before returning to userspace? This would mean that upon > >> re-entering the kernel, an isolated CPU's TLB wouldn't contain any kernel > >> range translation - with the exception of whatever lies between the > >> last-minute flush and the actual userspace entry, which should be feasible > >> to vet? Then AFAICT there wouldn't be any work/flush to defer, the IPI > >> could be entirely silenced if it targets an isolated CPU. > > > > Two issues with that: > > Firstly, thank you for entertaining the idea :-) > > > 1. I think the "Common not Private" feature Will Deacon referred to is > > incompatible with this idea: > > <https://developer.arm.com/documentation/101811/0104/Address-spaces/Common-not-Private> > > says "When the CnP bit is set, the software promises to use the ASIDs > > and VMIDs in the same way on all processors, which allows the TLB > > entries that are created by one processor to be used by another" > > Sorry for being obtuse - I can understand inconsistent TLB states (old vs > new translations being present in separate TLBs) due to not sending the > flush IPI causing an issue with that, but not "flushing early". Even if TLB > entries can be shared/accessed between CPUs, a CPU should be allowed not to > have a shared entry in its TLB - what am I missing? > > > 2. It's wrong to assume that TLB entries are only populated for > > addresses you access - thanks to speculative execution, you have to > > assume that the CPU might be populating random TLB entries all over > > the place. > > Gotta love speculation. Now it is supposed to be limited to genuinely > accessible data & code, right? Say theoretically we have a full TLBi as > literally the last thing before doing the return-to-userspace, speculation > should be limited to executing maybe bits of the return-from-userspace > code? I think it's easier to ignore speculation entirely, and just assume that the MMU can arbitrarily fill TLB entries from any page table entries which are valid/accessible in the active page tables. Hardware prefetchers can do that regardless of the specific path of speculative execution. Thus TLB fills are not limited to VAs which would be used on that return-to-userspace path. > Furthermore, I would hope that once a CPU is executing in userspace, it's > not going to populate the TLB with kernel address translations - AIUI the > whole vulnerability mitigation debacle was about preventing this sort of > thing. The CPU can definitely do that; the vulnerability mitigations are all about what userspace can observe rather than what the CPU can do in the background. Additionally, there are features like SPE and TRBE that use kernel addresses while the CPU is executing userspace instructions. The latest ARM Architecture Reference Manual (ARM DDI 0487 L.a) is fairly clear about that in section D8.16 "Translation Lookaside Buff", where it says (among other things): When address translation is enabled, if a translation table entry meets all of the following requirements, then that translation table entry is permitted to be cached in a TLB or intermediate TLB caching structure at any time: • The translation table entry itself does not generate a Translation fault, an Address size fault, or an Access flag fault. • The translation table entry is not from a translation regime configured by an Exception level that is lower than the current Exception level. Here "permitted to be cached in a TLB" also implies that the HW is allowed to fetch the translation tabl entry (which is what ARM call page table entries). The PDF can be found at: https://developer.arm.com/documentation/ddi0487/la/?lang=en Mark.
On 2/11/25 05:33, Valentin Schneider wrote: >> 2. It's wrong to assume that TLB entries are only populated for >> addresses you access - thanks to speculative execution, you have to >> assume that the CPU might be populating random TLB entries all over >> the place. > Gotta love speculation. Now it is supposed to be limited to genuinely > accessible data & code, right? Say theoretically we have a full TLBi as > literally the last thing before doing the return-to-userspace, speculation > should be limited to executing maybe bits of the return-from-userspace > code? In practice, it's mostly limited like that. Architecturally, there are no promises from the CPU. It is within its rights to cache anything from the page tables at any time. If it's in the CR3 tree, it's fair game. > Furthermore, I would hope that once a CPU is executing in userspace, it's > not going to populate the TLB with kernel address translations - AIUI the > whole vulnerability mitigation debacle was about preventing this sort of > thing. Nope, unfortunately. There's two big exception to this. First, "implicit supervisor-mode accesses". There are structures for which the CPU gets a virtual address and accesses it even while userspace is running. The LDT and GDT are the most obvious examples, but there are some less ubiquitous ones like the buffers for PEBS events. Second, remember that user versus supervisor is determined *BY* the page tables. Before Linear Address Space Separation (LASS), all virtual memory accesses walk the page tables, even userspace accesses to kernel addresses. The User/Supervisor bit is *in* the page tables, of course. A userspace access to a kernel address results in a page walk and the CPU is completely free to cache all or part of that page walk. A Meltdown-style _speculative_ userspace access to kernel memory won't generate a fault either. It won't leak data like it used to, of course, but it can still walk the page tables. That's one reason LASS is needed.
On 11/02/25 14:03, Mark Rutland wrote: > On Tue, Feb 11, 2025 at 02:33:51PM +0100, Valentin Schneider wrote: >> On 10/02/25 23:08, Jann Horn wrote: >> > 2. It's wrong to assume that TLB entries are only populated for >> > addresses you access - thanks to speculative execution, you have to >> > assume that the CPU might be populating random TLB entries all over >> > the place. >> >> Gotta love speculation. Now it is supposed to be limited to genuinely >> accessible data & code, right? Say theoretically we have a full TLBi as >> literally the last thing before doing the return-to-userspace, speculation >> should be limited to executing maybe bits of the return-from-userspace >> code? > > I think it's easier to ignore speculation entirely, and just assume that > the MMU can arbitrarily fill TLB entries from any page table entries > which are valid/accessible in the active page tables. Hardware > prefetchers can do that regardless of the specific path of speculative > execution. > > Thus TLB fills are not limited to VAs which would be used on that > return-to-userspace path. > >> Furthermore, I would hope that once a CPU is executing in userspace, it's >> not going to populate the TLB with kernel address translations - AIUI the >> whole vulnerability mitigation debacle was about preventing this sort of >> thing. > > The CPU can definitely do that; the vulnerability mitigations are all > about what userspace can observe rather than what the CPU can do in the > background. Additionally, there are features like SPE and TRBE that use > kernel addresses while the CPU is executing userspace instructions. > > The latest ARM Architecture Reference Manual (ARM DDI 0487 L.a) is fairly clear > about that in section D8.16 "Translation Lookaside Buff", where it says > (among other things): > > When address translation is enabled, if a translation table entry > meets all of the following requirements, then that translation table > entry is permitted to be cached in a TLB or intermediate TLB caching > structure at any time: > • The translation table entry itself does not generate a Translation > fault, an Address size fault, or an Access flag fault. > • The translation table entry is not from a translation regime > configured by an Exception level that is lower than the current > Exception level. > > Here "permitted to be cached in a TLB" also implies that the HW is > allowed to fetch the translation tabl entry (which is what ARM call page > table entries). > That's actually fairly clear all things considered, thanks for the education and for fishing out the relevant DDI section! > The PDF can be found at: > > https://developer.arm.com/documentation/ddi0487/la/?lang=en > > Mark.
On 11/02/25 06:22, Dave Hansen wrote: > On 2/11/25 05:33, Valentin Schneider wrote: >>> 2. It's wrong to assume that TLB entries are only populated for >>> addresses you access - thanks to speculative execution, you have to >>> assume that the CPU might be populating random TLB entries all over >>> the place. >> Gotta love speculation. Now it is supposed to be limited to genuinely >> accessible data & code, right? Say theoretically we have a full TLBi as >> literally the last thing before doing the return-to-userspace, speculation >> should be limited to executing maybe bits of the return-from-userspace >> code? > > In practice, it's mostly limited like that. > > Architecturally, there are no promises from the CPU. It is within its > rights to cache anything from the page tables at any time. If it's in > the CR3 tree, it's fair game. > >> Furthermore, I would hope that once a CPU is executing in userspace, it's >> not going to populate the TLB with kernel address translations - AIUI the >> whole vulnerability mitigation debacle was about preventing this sort of >> thing. > > Nope, unfortunately. There's two big exception to this. First, "implicit > supervisor-mode accesses". There are structures for which the CPU gets a > virtual address and accesses it even while userspace is running. The LDT > and GDT are the most obvious examples, but there are some less > ubiquitous ones like the buffers for PEBS events. > > Second, remember that user versus supervisor is determined *BY* the page > tables. Before Linear Address Space Separation (LASS), all virtual > memory accesses walk the page tables, even userspace accesses to kernel > addresses. The User/Supervisor bit is *in* the page tables, of course. > > A userspace access to a kernel address results in a page walk and the > CPU is completely free to cache all or part of that page walk. A > Meltdown-style _speculative_ userspace access to kernel memory won't > generate a fault either. It won't leak data like it used to, of course, > but it can still walk the page tables. That's one reason LASS is needed. Bummer, now I have at least two architectures proving me wrong :-) Thank you as well for the education, I really appreciate it.
On 11/02/25 06:22, Dave Hansen wrote: > On 2/11/25 05:33, Valentin Schneider wrote: >>> 2. It's wrong to assume that TLB entries are only populated for >>> addresses you access - thanks to speculative execution, you have to >>> assume that the CPU might be populating random TLB entries all over >>> the place. >> Gotta love speculation. Now it is supposed to be limited to genuinely >> accessible data & code, right? Say theoretically we have a full TLBi as >> literally the last thing before doing the return-to-userspace, speculation >> should be limited to executing maybe bits of the return-from-userspace >> code? > > In practice, it's mostly limited like that. > > Architecturally, there are no promises from the CPU. It is within its > rights to cache anything from the page tables at any time. If it's in > the CR3 tree, it's fair game. > So what if the VMEMMAP range *isn't* in the CR3 tree when a CPU is executing in userspace? AIUI that's the case with kPTI - the remaining kernel pages should mostly be .entry.text and cpu_entry_area, at least for x86. It sounds like it wouldn't do much for arm64 though, if with CnP a CPU executing in userspace and with the user/trampoline page table installed can still use TLB entries of another CPU executing in kernelspace with the kernel page table installed.
On 2/18/25 14:40, Valentin Schneider wrote: >> In practice, it's mostly limited like that. >> >> Architecturally, there are no promises from the CPU. It is within its >> rights to cache anything from the page tables at any time. If it's in >> the CR3 tree, it's fair game. >> > So what if the VMEMMAP range *isn't* in the CR3 tree when a CPU is > executing in userspace? > > AIUI that's the case with kPTI - the remaining kernel pages should mostly > be .entry.text and cpu_entry_area, at least for x86. Having part of VMEMMAP not in the CR3 tree should be harmless while running userspace. VMEMMAP is a purely software structure; the hardware doesn't do implicit supervisor accesses to it. It's also not needed in super early entry. Maybe I missed part of the discussion though. Is VMEMMAP your only concern? I would have guessed that the more generic vmalloc() functionality would be harder to pin down.
On Fri, Jan 17, 2025 at 05:53:33PM +0100, Valentin Schneider wrote: > On 17/01/25 16:52, Jann Horn wrote: > > On Fri, Jan 17, 2025 at 4:25 PM Valentin Schneider <vschneid@redhat.com> wrote: > >> On 14/01/25 19:16, Jann Horn wrote: > >> > On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider <vschneid@redhat.com> wrote: > >> >> vunmap()'s issued from housekeeping CPUs are a relatively common source of > >> >> interference for isolated NOHZ_FULL CPUs, as they are hit by the > >> >> flush_tlb_kernel_range() IPIs. > >> >> > >> >> Given that CPUs executing in userspace do not access data in the vmalloc > >> >> range, these IPIs could be deferred until their next kernel entry. > >> >> > >> >> Deferral vs early entry danger zone > >> >> =================================== > >> >> > >> >> This requires a guarantee that nothing in the vmalloc range can be vunmap'd > >> >> and then accessed in early entry code. > >> > > >> > In other words, it needs a guarantee that no vmalloc allocations that > >> > have been created in the vmalloc region while the CPU was idle can > >> > then be accessed during early entry, right? > >> > >> I'm not sure if that would be a problem (not an mm expert, please do > >> correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't > >> deferred anyway. > > > > flush_cache_vmap() is about stuff like flushing data caches on > > architectures with virtually indexed caches; that doesn't do TLB > > maintenance. When you look for its definition on x86 or arm64, you'll > > see that they use the generic implementation which is simply an empty > > inline function. > > > >> So after vmapping something, I wouldn't expect isolated CPUs to have > >> invalid TLB entries for the newly vmapped page. > >> > >> However, upon vunmap'ing something, the TLB flush is deferred, and thus > >> stale TLB entries can and will remain on isolated CPUs, up until they > >> execute the deferred flush themselves (IOW for the entire duration of the > >> "danger zone"). > >> > >> Does that make sense? > > > > The design idea wrt TLB flushes in the vmap code is that you don't do > > TLB flushes when you unmap stuff or when you map stuff, because doing > > TLB flushes across the entire system on every vmap/vunmap would be a > > bit costly; instead you just do batched TLB flushes in between, in > > __purge_vmap_area_lazy(). > > > > In other words, the basic idea is that you can keep calling vmap() and > > vunmap() a bunch of times without ever doing TLB flushes until you run > > out of virtual memory in the vmap region; then you do one big TLB > > flush, and afterwards you can reuse the free virtual address space for > > new allocations again. > > > > So if you "defer" that batched TLB flush for CPUs that are not > > currently running in the kernel, I think the consequence is that those > > CPUs may end up with incoherent TLB state after a reallocation of the > > virtual address space. > > > > Ah, gotcha, thank you for laying this out! In which case yes, any vmalloc > that occurred while an isolated CPU was NOHZ-FULL can be an issue if said > CPU accesses it during early entry; So the issue is: CPU1: unmappes vmalloc page X which was previously mapped to physical page P1. CPU2: does a whole bunch of vmalloc and vfree eventually crossing some lazy threshold and sending out IPIs. It then goes ahead and does an allocation that maps the same virtual page X to physical page P2. CPU3 is isolated and executes some early entry code before receving said IPIs which are supposedly deferred by Valentin's patches. It does not receive the IPI becuase it is deferred, thus access by early entry code to page X on this CPU results in a UAF access to P1. Is that the issue? thanks, - Joel
On 18/02/25 16:39, Dave Hansen wrote: > On 2/18/25 14:40, Valentin Schneider wrote: >>> In practice, it's mostly limited like that. >>> >>> Architecturally, there are no promises from the CPU. It is within its >>> rights to cache anything from the page tables at any time. If it's in >>> the CR3 tree, it's fair game. >>> >> So what if the VMEMMAP range *isn't* in the CR3 tree when a CPU is >> executing in userspace? >> >> AIUI that's the case with kPTI - the remaining kernel pages should mostly >> be .entry.text and cpu_entry_area, at least for x86. > > Having part of VMEMMAP not in the CR3 tree should be harmless while > running userspace. VMEMMAP is a purely software structure; the hardware > doesn't do implicit supervisor accesses to it. It's also not needed in > super early entry. > > Maybe I missed part of the discussion though. Is VMEMMAP your only > concern? I would have guessed that the more generic vmalloc() > functionality would be harder to pin down. Urgh, that'll teach me to send emails that late - I did indeed mean the vmalloc() range, not at all VMEMMAP. IIUC *neither* are present in the user kPTI page table and AFAICT the page table swap is done before the actual vmap'd stack (CONFIG_VMAP_STACK=y) gets used.
On 19/02/25 10:05, Joel Fernandes wrote: > On Fri, Jan 17, 2025 at 05:53:33PM +0100, Valentin Schneider wrote: >> On 17/01/25 16:52, Jann Horn wrote: >> > On Fri, Jan 17, 2025 at 4:25 PM Valentin Schneider <vschneid@redhat.com> wrote: >> >> On 14/01/25 19:16, Jann Horn wrote: >> >> > On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider <vschneid@redhat.com> wrote: >> >> >> vunmap()'s issued from housekeeping CPUs are a relatively common source of >> >> >> interference for isolated NOHZ_FULL CPUs, as they are hit by the >> >> >> flush_tlb_kernel_range() IPIs. >> >> >> >> >> >> Given that CPUs executing in userspace do not access data in the vmalloc >> >> >> range, these IPIs could be deferred until their next kernel entry. >> >> >> >> >> >> Deferral vs early entry danger zone >> >> >> =================================== >> >> >> >> >> >> This requires a guarantee that nothing in the vmalloc range can be vunmap'd >> >> >> and then accessed in early entry code. >> >> > >> >> > In other words, it needs a guarantee that no vmalloc allocations that >> >> > have been created in the vmalloc region while the CPU was idle can >> >> > then be accessed during early entry, right? >> >> >> >> I'm not sure if that would be a problem (not an mm expert, please do >> >> correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't >> >> deferred anyway. >> > >> > flush_cache_vmap() is about stuff like flushing data caches on >> > architectures with virtually indexed caches; that doesn't do TLB >> > maintenance. When you look for its definition on x86 or arm64, you'll >> > see that they use the generic implementation which is simply an empty >> > inline function. >> > >> >> So after vmapping something, I wouldn't expect isolated CPUs to have >> >> invalid TLB entries for the newly vmapped page. >> >> >> >> However, upon vunmap'ing something, the TLB flush is deferred, and thus >> >> stale TLB entries can and will remain on isolated CPUs, up until they >> >> execute the deferred flush themselves (IOW for the entire duration of the >> >> "danger zone"). >> >> >> >> Does that make sense? >> > >> > The design idea wrt TLB flushes in the vmap code is that you don't do >> > TLB flushes when you unmap stuff or when you map stuff, because doing >> > TLB flushes across the entire system on every vmap/vunmap would be a >> > bit costly; instead you just do batched TLB flushes in between, in >> > __purge_vmap_area_lazy(). >> > >> > In other words, the basic idea is that you can keep calling vmap() and >> > vunmap() a bunch of times without ever doing TLB flushes until you run >> > out of virtual memory in the vmap region; then you do one big TLB >> > flush, and afterwards you can reuse the free virtual address space for >> > new allocations again. >> > >> > So if you "defer" that batched TLB flush for CPUs that are not >> > currently running in the kernel, I think the consequence is that those >> > CPUs may end up with incoherent TLB state after a reallocation of the >> > virtual address space. >> > >> >> Ah, gotcha, thank you for laying this out! In which case yes, any vmalloc >> that occurred while an isolated CPU was NOHZ-FULL can be an issue if said >> CPU accesses it during early entry; > > So the issue is: > > CPU1: unmappes vmalloc page X which was previously mapped to physical page > P1. > > CPU2: does a whole bunch of vmalloc and vfree eventually crossing some lazy > threshold and sending out IPIs. It then goes ahead and does an allocation > that maps the same virtual page X to physical page P2. > > CPU3 is isolated and executes some early entry code before receving said IPIs > which are supposedly deferred by Valentin's patches. > > It does not receive the IPI becuase it is deferred, thus access by early > entry code to page X on this CPU results in a UAF access to P1. > > Is that the issue? > Pretty much so yeah. That is, *if* there such a vmalloc'd address access in early entry code - testing says it's not the case, but I haven't found a way to instrumentally verify this.
On 2/19/2025 11:18 AM, Valentin Schneider wrote: > On 19/02/25 10:05, Joel Fernandes wrote: >> On Fri, Jan 17, 2025 at 05:53:33PM +0100, Valentin Schneider wrote: >>> On 17/01/25 16:52, Jann Horn wrote: >>>> On Fri, Jan 17, 2025 at 4:25 PM Valentin Schneider <vschneid@redhat.com> wrote: >>>>> On 14/01/25 19:16, Jann Horn wrote: >>>>>> On Tue, Jan 14, 2025 at 6:51 PM Valentin Schneider <vschneid@redhat.com> wrote: >>>>>>> vunmap()'s issued from housekeeping CPUs are a relatively common source of >>>>>>> interference for isolated NOHZ_FULL CPUs, as they are hit by the >>>>>>> flush_tlb_kernel_range() IPIs. >>>>>>> >>>>>>> Given that CPUs executing in userspace do not access data in the vmalloc >>>>>>> range, these IPIs could be deferred until their next kernel entry. >>>>>>> >>>>>>> Deferral vs early entry danger zone >>>>>>> =================================== >>>>>>> >>>>>>> This requires a guarantee that nothing in the vmalloc range can be vunmap'd >>>>>>> and then accessed in early entry code. >>>>>> >>>>>> In other words, it needs a guarantee that no vmalloc allocations that >>>>>> have been created in the vmalloc region while the CPU was idle can >>>>>> then be accessed during early entry, right? >>>>> >>>>> I'm not sure if that would be a problem (not an mm expert, please do >>>>> correct me) - looking at vmap_pages_range(), flush_cache_vmap() isn't >>>>> deferred anyway. >>>> >>>> flush_cache_vmap() is about stuff like flushing data caches on >>>> architectures with virtually indexed caches; that doesn't do TLB >>>> maintenance. When you look for its definition on x86 or arm64, you'll >>>> see that they use the generic implementation which is simply an empty >>>> inline function. >>>> >>>>> So after vmapping something, I wouldn't expect isolated CPUs to have >>>>> invalid TLB entries for the newly vmapped page. >>>>> >>>>> However, upon vunmap'ing something, the TLB flush is deferred, and thus >>>>> stale TLB entries can and will remain on isolated CPUs, up until they >>>>> execute the deferred flush themselves (IOW for the entire duration of the >>>>> "danger zone"). >>>>> >>>>> Does that make sense? >>>> >>>> The design idea wrt TLB flushes in the vmap code is that you don't do >>>> TLB flushes when you unmap stuff or when you map stuff, because doing >>>> TLB flushes across the entire system on every vmap/vunmap would be a >>>> bit costly; instead you just do batched TLB flushes in between, in >>>> __purge_vmap_area_lazy(). >>>> >>>> In other words, the basic idea is that you can keep calling vmap() and >>>> vunmap() a bunch of times without ever doing TLB flushes until you run >>>> out of virtual memory in the vmap region; then you do one big TLB >>>> flush, and afterwards you can reuse the free virtual address space for >>>> new allocations again. >>>> >>>> So if you "defer" that batched TLB flush for CPUs that are not >>>> currently running in the kernel, I think the consequence is that those >>>> CPUs may end up with incoherent TLB state after a reallocation of the >>>> virtual address space. >>>> >>> >>> Ah, gotcha, thank you for laying this out! In which case yes, any vmalloc >>> that occurred while an isolated CPU was NOHZ-FULL can be an issue if said >>> CPU accesses it during early entry; >> >> So the issue is: >> >> CPU1: unmappes vmalloc page X which was previously mapped to physical page >> P1. >> >> CPU2: does a whole bunch of vmalloc and vfree eventually crossing some lazy >> threshold and sending out IPIs. It then goes ahead and does an allocation >> that maps the same virtual page X to physical page P2. >> >> CPU3 is isolated and executes some early entry code before receving said IPIs >> which are supposedly deferred by Valentin's patches. >> >> It does not receive the IPI becuase it is deferred, thus access by early >> entry code to page X on this CPU results in a UAF access to P1. >> >> Is that the issue? >> > > Pretty much so yeah. That is, *if* there such a vmalloc'd address access in > early entry code - testing says it's not the case, but I haven't found a > way to instrumentally verify this. Ok, thanks for confirming. Maybe there is an address sanitizer way of verifying, but yeah it is subtle and there could be more than one way of solving it. Too much 'fun' ;) - Joel
On 2/19/25 07:13, Valentin Schneider wrote: >> Maybe I missed part of the discussion though. Is VMEMMAP your only >> concern? I would have guessed that the more generic vmalloc() >> functionality would be harder to pin down. > Urgh, that'll teach me to send emails that late - I did indeed mean the > vmalloc() range, not at all VMEMMAP. IIUC *neither* are present in the user > kPTI page table and AFAICT the page table swap is done before the actual vmap'd > stack (CONFIG_VMAP_STACK=y) gets used. OK, so rewriting your question... ;) > So what if the vmalloc() range *isn't* in the CR3 tree when a CPU is > executing in userspace? The LDT and maybe the PEBS buffers are the only implicit supervisor accesses to vmalloc()'d memory that I can think of. But those are both handled specially and shouldn't ever get zapped while in use. The LDT replacement has its own IPIs separate from TLB flushing. But I'm actually not all that worried about accesses while actually running userspace. It's that "danger zone" in the kernel between entry and when the TLB might have dangerous garbage in it. BTW, I hope this whole thing is turned off on 32-bit. There, we can actually take and handle faults on the vmalloc() area. If you get one of those faults in your "danger zone", it'll start running page fault code which will branch out to god-knows-where and certainly isn't noinstr.
On 2/19/25 09:08, Joel Fernandes wrote: >> Pretty much so yeah. That is, *if* there such a vmalloc'd address access in >> early entry code - testing says it's not the case, but I haven't found a >> way to instrumentally verify this. > Ok, thanks for confirming. Maybe there is an address sanitizer way of verifying, > but yeah it is subtle and there could be more than one way of solving it. Too > much 'fun'
On 19/02/25 12:25, Dave Hansen wrote: > On 2/19/25 07:13, Valentin Schneider wrote: >>> Maybe I missed part of the discussion though. Is VMEMMAP your only >>> concern? I would have guessed that the more generic vmalloc() >>> functionality would be harder to pin down. >> Urgh, that'll teach me to send emails that late - I did indeed mean the >> vmalloc() range, not at all VMEMMAP. IIUC *neither* are present in the user >> kPTI page table and AFAICT the page table swap is done before the actual vmap'd >> stack (CONFIG_VMAP_STACK=y) gets used. > > OK, so rewriting your question... ;) > >> So what if the vmalloc() range *isn't* in the CR3 tree when a CPU is >> executing in userspace? > > The LDT and maybe the PEBS buffers are the only implicit supervisor > accesses to vmalloc()'d memory that I can think of. But those are both > handled specially and shouldn't ever get zapped while in use. The LDT > replacement has its own IPIs separate from TLB flushing. > > But I'm actually not all that worried about accesses while actually > running userspace. It's that "danger zone" in the kernel between entry > and when the TLB might have dangerous garbage in it. > So say we have kPTI, thus no vmalloc() mapped in CR3 when running userspace, and do a full TLB flush right before switching to userspace - could the TLB still end up with vmalloc()-range-related entries when we're back in the kernel and going through the danger zone? > BTW, I hope this whole thing is turned off on 32-bit. There, we can > actually take and handle faults on the vmalloc() area. If you get one of > those faults in your "danger zone", it'll start running page fault code > which will branch out to god-knows-where and certainly isn't noinstr. Sounds... Fun. Thanks for pointing out the landmines.
On 2/20/25 09:10, Valentin Schneider wrote: >> The LDT and maybe the PEBS buffers are the only implicit supervisor >> accesses to vmalloc()'d memory that I can think of. But those are both >> handled specially and shouldn't ever get zapped while in use. The LDT >> replacement has its own IPIs separate from TLB flushing. >> >> But I'm actually not all that worried about accesses while actually >> running userspace. It's that "danger zone" in the kernel between entry >> and when the TLB might have dangerous garbage in it. >> > So say we have kPTI, thus no vmalloc() mapped in CR3 when running > userspace, and do a full TLB flush right before switching to userspace - > could the TLB still end up with vmalloc()-range-related entries when we're > back in the kernel and going through the danger zone? Yes, because the danger zone includes the switch back to the kernel CR3 with vmalloc() fully mapped. All bets are off about what's in the TLB the moment that CR3 write occurs. Actually, you could probably use that. If a mapping is in the PTI user page table, you can't defer the flushes for it. Basically the same rule for text poking in the danger zone. If there's a deferred flush pending, make sure that all of the SWITCH_TO_KERNEL_CR3's fully flush the TLB. You'd need something similar to user_pcid_flush_mask. But, honestly, I'm still not sure this is worth all the trouble. If folks want to avoid IPIs for TLB flushes, there are hardware features that *DO* that. Just get new hardware instead of adding this complicated pile of software that we have to maintain forever. In 10 years, we'll still have this software *and* 95% of our hardware has the hardware feature too.
diff --git a/arch/x86/include/asm/context_tracking_work.h b/arch/x86/include/asm/context_tracking_work.h index 485b32881fde5..da3d270289836 100644 --- a/arch/x86/include/asm/context_tracking_work.h +++ b/arch/x86/include/asm/context_tracking_work.h @@ -3,6 +3,7 @@ #define _ASM_X86_CONTEXT_TRACKING_WORK_H #include <asm/sync_core.h> +#include <asm/tlbflush.h> static __always_inline void arch_context_tracking_work(enum ct_work work) { @@ -10,6 +11,9 @@ static __always_inline void arch_context_tracking_work(enum ct_work work) case CT_WORK_SYNC: sync_core(); break; + case CT_WORK_TLBI: + __flush_tlb_all(); + break; case CT_WORK_MAX: WARN_ON_ONCE(true); } diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 4d11396250999..6e690acc35e53 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -248,6 +248,7 @@ extern void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start, unsigned long end, unsigned int stride_shift, bool freed_tables); extern void flush_tlb_kernel_range(unsigned long start, unsigned long end); +extern void flush_tlb_kernel_range_deferrable(unsigned long start, unsigned long end); static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a) { diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 119765772ab11..47fb437acf52a 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -12,6 +12,7 @@ #include <linux/task_work.h> #include <linux/mmu_notifier.h> #include <linux/mmu_context.h> +#include <linux/context_tracking.h> #include <asm/tlbflush.h> #include <asm/mmu_context.h> @@ -1042,6 +1043,11 @@ static void do_flush_tlb_all(void *info) __flush_tlb_all(); } +static bool do_kernel_flush_defer_cond(int cpu, void *info) +{ + return !ct_set_cpu_work(cpu, CT_WORK_TLBI); +} + void flush_tlb_all(void) { count_vm_tlb_event(NR_TLB_REMOTE_FLUSH); @@ -1058,12 +1064,13 @@ static void do_kernel_range_flush(void *info) flush_tlb_one_kernel(addr); } -void flush_tlb_kernel_range(unsigned long start, unsigned long end) +static inline void +__flush_tlb_kernel_range(smp_cond_func_t cond_func, unsigned long start, unsigned long end) { /* Balance as user space task's flush, a bit conservative */ if (end == TLB_FLUSH_ALL || (end - start) > tlb_single_page_flush_ceiling << PAGE_SHIFT) { - on_each_cpu(do_flush_tlb_all, NULL, 1); + on_each_cpu_cond(cond_func, do_flush_tlb_all, NULL, 1); } else { struct flush_tlb_info *info; @@ -1071,13 +1078,23 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end) info = get_flush_tlb_info(NULL, start, end, 0, false, TLB_GENERATION_INVALID); - on_each_cpu(do_kernel_range_flush, info, 1); + on_each_cpu_cond(cond_func, do_kernel_range_flush, info, 1); put_flush_tlb_info(); preempt_enable(); } } +void flush_tlb_kernel_range(unsigned long start, unsigned long end) +{ + __flush_tlb_kernel_range(NULL, start, end); +} + +void flush_tlb_kernel_range_deferrable(unsigned long start, unsigned long end) +{ + __flush_tlb_kernel_range(do_kernel_flush_defer_cond, start, end); +} + /* * This can be used from process context to figure out what the value of * CR3 is without needing to do a (slow) __read_cr3(). diff --git a/include/linux/context_tracking_work.h b/include/linux/context_tracking_work.h index 931ade1dcbcc2..1be60c64cdeca 100644 --- a/include/linux/context_tracking_work.h +++ b/include/linux/context_tracking_work.h @@ -5,12 +5,14 @@ #include <linux/bitops.h> enum { - CT_WORK_SYNC, + CT_WORK_SYNC_OFFSET, + CT_WORK_TLBI_OFFSET, CT_WORK_MAX_OFFSET }; enum ct_work { CT_WORK_SYNC = BIT(CT_WORK_SYNC_OFFSET), + CT_WORK_TLBI = BIT(CT_WORK_TLBI_OFFSET), CT_WORK_MAX = BIT(CT_WORK_MAX_OFFSET) }; diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 5c88d0e90c209..e8aad4d55e955 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -467,6 +467,31 @@ void vunmap_range_noflush(unsigned long start, unsigned long end) __vunmap_range_noflush(start, end); } +#ifdef CONFIG_CONTEXT_TRACKING_WORK +/* + * !!! BIG FAT WARNING !!! + * + * The CPU is free to cache any part of the paging hierarchy it wants at any + * time. It's also free to set accessed and dirty bits at any time, even for + * instructions that may never execute architecturally. + * + * This means that deferring a TLB flush affecting freed page-table-pages (IOW, + * keeping them in a CPU's paging hierarchy cache) is akin to dancing in a + * minefield. + * + * This isn't a problem for deferral of TLB flushes in vmalloc, because + * page-table-pages used for vmap() mappings are never freed - see how + * __vunmap_range_noflush() walks the whole mapping but only clears the leaf PTEs. + * If this ever changes, TLB flush deferral will cause misery. + */ +void __weak flush_tlb_kernel_range_deferrable(unsigned long start, unsigned long end) +{ + flush_tlb_kernel_range(start, end); +} +#else +#define flush_tlb_kernel_range_deferrable(start, end) flush_tlb_kernel_range(start, end) +#endif + /** * vunmap_range - unmap kernel virtual addresses * @addr: start of the VM area to unmap @@ -480,7 +505,7 @@ void vunmap_range(unsigned long addr, unsigned long end) { flush_cache_vunmap(addr, end); vunmap_range_noflush(addr, end); - flush_tlb_kernel_range(addr, end); + flush_tlb_kernel_range_deferrable(addr, end); } static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr, @@ -2281,7 +2306,7 @@ static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end, nr_purge_nodes = cpumask_weight(&purge_nodes); if (nr_purge_nodes > 0) { - flush_tlb_kernel_range(start, end); + flush_tlb_kernel_range_deferrable(start, end); /* One extra worker is per a lazy_max_pages() full set minus one. */ nr_purge_helpers = atomic_long_read(&vmap_lazy_nr) / lazy_max_pages(); @@ -2384,7 +2409,7 @@ static void free_unmap_vmap_area(struct vmap_area *va) flush_cache_vunmap(va->va_start, va->va_end); vunmap_range_noflush(va->va_start, va->va_end); if (debug_pagealloc_enabled_static()) - flush_tlb_kernel_range(va->va_start, va->va_end); + flush_tlb_kernel_range_deferrable(va->va_start, va->va_end); free_vmap_area_noflush(va); } @@ -2832,7 +2857,7 @@ static void vb_free(unsigned long addr, unsigned long size) vunmap_range_noflush(addr, addr + size); if (debug_pagealloc_enabled_static()) - flush_tlb_kernel_range(addr, addr + size); + flush_tlb_kernel_range_deferrable(addr, addr + size); spin_lock(&vb->lock); @@ -2897,7 +2922,7 @@ static void _vm_unmap_aliases(unsigned long start, unsigned long end, int flush) free_purged_blocks(&purge_list); if (!__purge_vmap_area_lazy(start, end, false) && flush) - flush_tlb_kernel_range(start, end); + flush_tlb_kernel_range_deferrable(start, end); mutex_unlock(&vmap_purge_lock); }
vunmap()'s issued from housekeeping CPUs are a relatively common source of interference for isolated NOHZ_FULL CPUs, as they are hit by the flush_tlb_kernel_range() IPIs. Given that CPUs executing in userspace do not access data in the vmalloc range, these IPIs could be deferred until their next kernel entry. Deferral vs early entry danger zone =================================== This requires a guarantee that nothing in the vmalloc range can be vunmap'd and then accessed in early entry code. Vmalloc uses are, as reported by vmallocinfo: $ cat /proc/vmallocinfo | awk '{ print $3 }' | sort | uniq __pci_enable_msix_range+0x32b/0x560 acpi_os_map_iomem+0x22d/0x250 bpf_prog_alloc_no_stats+0x34/0x140 fork_idle+0x79/0x120 gen_pool_add_owner+0x3e/0xb0 ? hpet_enable+0xbf/0x470 irq_init_percpu_irqstack+0x129/0x160 kernel_clone+0xab/0x3b0 memremap+0x164/0x330 n_tty_open+0x18/0xa0 pcpu_create_chunk+0x4e/0x1b0 pcpu_create_chunk+0x75/0x1b0 pcpu_get_vm_areas+0x0/0x1100 unpurged vp_modern_map_capability+0x140/0x270 zisofs_init+0x16/0x30 I've categorized these as: a) Device or percpu mappings For these to be unmapped, the device (or CPU) has to be removed and an eventual IRQ freed. Even if the IRQ isn't freed, context tracking entry happens before handling the IRQ itself, per irqentry_enter() usage. __pci_enable_msix_range() acpi_os_map_iomem() irq_init_percpu_irqstack() (not even unmapped when CPU is hot-unplugged!) memremap() n_tty_open() pcpu_create_chunk() pcpu_get_vm_areas() vp_modern_map_capability() b) CONFIG_VMAP_STACK fork_idle() & kernel_clone() vmalloc'd kernel stacks are AFAICT a safe example, as a task running in userspace needs to enter kernelspace to execute do_exit() before its stack can be vfree'd. c) Non-issues bpf_prog_alloc_no_stats() - early entry is noinstr, no BPF! hpet_enable() - hpet_clear_mapping() is only called if __init function fails, no runtime worries zisofs_init () - used for zisofs block decompression, that's way past context tracking entry d) I'm not sure, have a look? gen_pool_add_owner() - AIUI this is mainly for PCI / DMA stuff, which again I wouldn't expect to be accessed before context tracking entry. Changes ====== Blindly deferring any and all flush of the kernel mappings is a risky move, so introduce a variant of flush_tlb_kernel_range() that explicitly allows deferral. Use it for vunmap flushes. Note that while flush_tlb_kernel_range() may end up issuing a full flush (including user mappings), this only happens when reaching a invalidation range threshold where it is cheaper to do a full flush than to individually invalidate each page in the range via INVLPG. IOW, it doesn't *require* invalidating user mappings, and thus remains safe to defer until a later kernel entry. Signed-off-by: Valentin Schneider <vschneid@redhat.com> --- arch/x86/include/asm/context_tracking_work.h | 4 +++ arch/x86/include/asm/tlbflush.h | 1 + arch/x86/mm/tlb.c | 23 +++++++++++-- include/linux/context_tracking_work.h | 4 ++- mm/vmalloc.c | 35 +++++++++++++++++--- 5 files changed, 58 insertions(+), 9 deletions(-)