Message ID | 20230915105933.495735-1-matteorizzo@google.com (mailing list archive) |
---|---|
Headers | show |
Series | Prevent cross-cache attacks in the SLUB allocator | expand |
On 9/15/23 03:59, Matteo Rizzo wrote: > The goal of this patch series is to deterministically prevent cross-cache > attacks in the SLUB allocator. What's the cost?
On Fri, 15 Sep 2023, Dave Hansen wrote: > On 9/15/23 03:59, Matteo Rizzo wrote: >> The goal of this patch series is to deterministically prevent cross-cache >> attacks in the SLUB allocator. > > What's the cost? The only thing that I see is 1-2% on kernel compilations (and "more on machines with lots of cores")? Having a virtualized slab subsystem could enable other things: - The page order calculation could be simplified since vmalloc can stitch arbitrary base pages together to form larger contiguous virtual segments. So just use f.e. order 5 or so for all slabs to reduce contention? - Maybe we could make slab pages movable (if we can ensure that slab objects are not touched somehow. At least stop_machine run could be used to move batches of slab memory) - Maybe we can avoid allocating page structs somehow for slab memory? Looks like this is taking a step into that direction. The metadata storage of the slab allocator could be reworked and optimized better. Problems: - Overhead due to more TLB lookups - Larger amounts of TLBs are used for the OS. Currently we are trying to use the maximum mappable TLBs to reduce their numbers. This presumably means using 4K TLBs for all slab access. - Memory may not be physically contiguous which may be required by some drivers doing DMA.
On Fri, 15 Sept 2023 at 18:30, Lameter, Christopher <cl@os.amperecomputing.com> wrote: > > On Fri, 15 Sep 2023, Dave Hansen wrote: > > > What's the cost? > > The only thing that I see is 1-2% on kernel compilations (and "more on > machines with lots of cores")? I used kernel compilation time (wall clock time) as a benchmark while preparing the series. Lower is better. Intel Skylake, 112 cores: LABEL | COUNT | MIN | MAX | MEAN | MEDIAN | STDDEV ---------------+-------+---------+---------+---------+---------+-------- SLAB_VIRTUAL=n | 150 | 49.700s | 51.320s | 50.449s | 50.430s | 0.29959 SLAB_VIRTUAL=y | 150 | 50.020s | 51.660s | 50.880s | 50.880s | 0.30495 | | +0.64% | +0.66% | +0.85% | +0.89% | +1.79% AMD Milan, 256 cores: LABEL | COUNT | MIN | MAX | MEAN | MEDIAN | STDDEV ---------------+-------+---------+---------+---------+---------+-------- SLAB_VIRTUAL=n | 150 | 25.480s | 26.550s | 26.065s | 26.055s | 0.23495 SLAB_VIRTUAL=y | 150 | 25.820s | 27.080s | 26.531s | 26.540s | 0.25974 | | +1.33% | +2.00% | +1.79% | +1.86% | +10.55% Are there any specific benchmarks that you would be interested in seeing or that are usually used for SLUB? > Problems: > > - Overhead due to more TLB lookups > > - Larger amounts of TLBs are used for the OS. Currently we are trying to > use the maximum mappable TLBs to reduce their numbers. This presumably > means using 4K TLBs for all slab access. Yes, we are using 4K pages for the slab mappings which is going to increase TLB pressure. I also tried writing a version of the patch that uses 2M pages which had slightly better performance, but that had its own problems. For example most slabs are much smaller than 2M, so we would need to create and map multiple slabs at once and we wouldn't be able to release the physical memory until all slabs in the 2M page are unused which increases fragmentation. > - Memory may not be physically contiguous which may be required by some > drivers doing DMA. In the current implementation each slab is backed by physically contiguous memory, but different slabs that are adjacent in virtual memory might not be physically contiguous. Treating objects allocated from two different slabs as one contiguous chunk of memory is probably wrong anyway, right? -- Matteo
* Matteo Rizzo <matteorizzo@google.com> wrote: > On Fri, 15 Sept 2023 at 18:30, Lameter, Christopher > <cl@os.amperecomputing.com> wrote: > > > > On Fri, 15 Sep 2023, Dave Hansen wrote: > > > > > What's the cost? > > > > The only thing that I see is 1-2% on kernel compilations (and "more on > > machines with lots of cores")? > > I used kernel compilation time (wall clock time) as a benchmark while > preparing the series. Lower is better. > > Intel Skylake, 112 cores: > > LABEL | COUNT | MIN | MAX | MEAN | MEDIAN | STDDEV > ---------------+-------+---------+---------+---------+---------+-------- > SLAB_VIRTUAL=n | 150 | 49.700s | 51.320s | 50.449s | 50.430s | 0.29959 > SLAB_VIRTUAL=y | 150 | 50.020s | 51.660s | 50.880s | 50.880s | 0.30495 > | | +0.64% | +0.66% | +0.85% | +0.89% | +1.79% > > AMD Milan, 256 cores: > > LABEL | COUNT | MIN | MAX | MEAN | MEDIAN | STDDEV > ---------------+-------+---------+---------+---------+---------+-------- > SLAB_VIRTUAL=n | 150 | 25.480s | 26.550s | 26.065s | 26.055s | 0.23495 > SLAB_VIRTUAL=y | 150 | 25.820s | 27.080s | 26.531s | 26.540s | 0.25974 > | | +1.33% | +2.00% | +1.79% | +1.86% | +10.55% That's sadly a rather substantial overhead for a compiler/linker workload that is dominantly user-space: a kernel build is about 90% user-time and 10% system-time: $ perf stat --null make -j64 vmlinux ... Performance counter stats for 'make -j64 vmlinux': 59.840704481 seconds time elapsed 2000.774537000 seconds user 219.138280000 seconds sys What's the split of the increase in overhead due to SLAB_VIRTUAL=y, between user-space execution and kernel-space execution? Thanks, Ingo
On Mon, 18 Sept 2023 at 10:39, Ingo Molnar <mingo@kernel.org> wrote: > > What's the split of the increase in overhead due to SLAB_VIRTUAL=y, between > user-space execution and kernel-space execution? ... and equally importantly, what about DMA? Or what about the fixed-size slabs (aka kmalloc?) What's the point of "never re-use the same address for a different slab", when the *same* slab will contain different kinds of allocations anyway? I think the whole "make it one single compile-time option" model is completely and fundamentally broken. Linus
On Mon, 18 Sept 2023 at 19:39, Ingo Molnar <mingo@kernel.org> wrote: > > What's the split of the increase in overhead due to SLAB_VIRTUAL=y, between > user-space execution and kernel-space execution? > Same benchmark as before (compiling a kernel on a system running the patched kernel): Intel Skylake: LABEL | COUNT | MIN | MAX | MEAN | MEDIAN | STDDEV ---------------+-------+----------+----------+----------+----------+-------- wall clock | | | | | | SLAB_VIRTUAL=n | 150 | 49.700 | 51.320 | 50.449 | 50.430 | 0.29959 SLAB_VIRTUAL=y | 150 | 50.020 | 51.660 | 50.880 | 50.880 | 0.30495 | | +0.64% | +0.66% | +0.85% | +0.89% | +1.79% system time | | | | | | SLAB_VIRTUAL=n | 150 | 358.560 | 362.900 | 360.922 | 360.985 | 0.91761 SLAB_VIRTUAL=y | 150 | 362.970 | 367.970 | 366.062 | 366.115 | 1.015 | | +1.23% | +1.40% | +1.42% | +1.42% | +10.60% user time | | | | | | SLAB_VIRTUAL=n | 150 | 3110.000 | 3124.520 | 3118.143 | 3118.120 | 2.466 SLAB_VIRTUAL=y | 150 | 3115.070 | 3127.070 | 3120.762 | 3120.925 | 2.654 | | +0.16% | +0.08% | +0.08% | +0.09% | +7.63% AMD Milan: LABEL | COUNT | MIN | MAX | MEAN | MEDIAN | STDDEV ---------------+-------+----------+----------+----------+----------+-------- wall clock | | | | | | SLAB_VIRTUAL=n | 150 | 25.480 | 26.550 | 26.065 | 26.055 | 0.23495 SLAB_VIRTUAL=y | 150 | 25.820 | 27.080 | 26.531 | 26.540 | 0.25974 | | +1.33% | +2.00% | +1.79% | +1.86% | +10.55% system time | | | | | | SLAB_VIRTUAL=n | 150 | 478.530 | 540.420 | 520.803 | 521.485 | 9.166 SLAB_VIRTUAL=y | 150 | 530.520 | 572.460 | 552.825 | 552.985 | 7.161 | | +10.86% | +5.93% | +6.15% | +6.04% | -21.88% user time | | | | | | SLAB_VIRTUAL=n | 150 | 2373.540 | 2403.800 | 2386.343 | 2385.840 | 5.325 SLAB_VIRTUAL=y | 150 | 2388.690 | 2426.290 | 2408.325 | 2408.895 | 6.667 | | +0.64% | +0.94% | +0.92% | +0.97% | +25.20% I'm not exactly sure why user time increases by almost 1% on Milan, it could be TLB contention. -- Matteo
On Mon, 18 Sept 2023 at 20:05, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > ... and equally importantly, what about DMA? I'm not exactly sure what you mean by this, I don't think this should affect the performance of DMA. > Or what about the fixed-size slabs (aka kmalloc?) What's the point of > "never re-use the same address for a different slab", when the *same* > slab will contain different kinds of allocations anyway? There are a number of patches out there (for example the random_kmalloc series which recently got merged into v6.6) which attempt to segregate kmalloc'd objects into different caches to make exploitation harder. Another thing that we would like to have in the future is to segregate objects by type (like XNU's kalloc_type https://security.apple.com/blog/towards-the-next-generation-of-xnu-memory-safety/) which makes exploiting use-after-free by type confusion much harder or impossible. All of these mitigations can be bypassed very easily if the attacker can mount a cross-cache attack, which is what this series attempts to prevent. This is not only theoretical, we've seen attackers use this all the time in kCTF/kernelCTF submissions (for example https://ruia-ruia.github.io/2022/08/05/CVE-2022-29582-io-uring/). > I think the whole "make it one single compile-time option" model is > completely and fundamentally broken. Wouldn't making this toggleable at boot time or runtime make performance even worse? -- Matteo
On 9/19/23 06:42, Matteo Rizzo wrote: > On Mon, 18 Sept 2023 at 19:39, Ingo Molnar <mingo@kernel.org> wrote: >> What's the split of the increase in overhead due to SLAB_VIRTUAL=y, between >> user-space execution and kernel-space execution? >> > Same benchmark as before (compiling a kernel on a system running the patched > kernel): Thanks for running those. One more situation that comes to mind is how this will act under memory pressure. Will some memory pressure make contention on 'slub_kworker_lock' visible or make the global TLB flushes less bearable? In any case, none of this looks _catastrophic_. It's surely a cost that some folks will pay. But I really do think it needs to be more dynamic. There are a _couple_ of reasons for this. If it's only a compile-time option, it's never going to get turned on except for maybe ChromeOS and the datacenter folks that are paranoid. I suspect the distros will never turn it on. A lot of questions get easier if you can disable/enable it at runtime. For instance, what do you do if the virtual area fills up? You _could_ just go back to handing out direct map addresses. Less secure? Yep. But better than crashing (for some folks). It also opens up the door to do this per-slab. That alone would be a handy debugging option.
On 9/19/23 08:48, Matteo Rizzo wrote: >> I think the whole "make it one single compile-time option" model is >> completely and fundamentally broken. > Wouldn't making this toggleable at boot time or runtime make performance > even worse? Maybe. But you can tolerate even more of a performance impact from a feature if the people that don't care can actually disable it. There are also plenty of ways to minimize the overhead of switching it on and off at runtime. Static branches are your best friend here.
On September 19, 2023 9:02:07 AM PDT, Dave Hansen <dave.hansen@intel.com> wrote: >On 9/19/23 08:48, Matteo Rizzo wrote: >>> I think the whole "make it one single compile-time option" model is >>> completely and fundamentally broken. >> Wouldn't making this toggleable at boot time or runtime make performance >> even worse? > >Maybe. > >But you can tolerate even more of a performance impact from a feature if >the people that don't care can actually disable it. > >There are also plenty of ways to minimize the overhead of switching it >on and off at runtime. Static branches are your best friend here. Let's start with a boot time on/off toggle (no per-slab, no switch on out-of-space, etc). That should be sufficient for initial ease of use for testing, etc. But yes, using static_branch will nicely DTRT here.
On Tue, 19 Sept 2023 at 08:48, Matteo Rizzo <matteorizzo@google.com> wrote: > > On Mon, 18 Sept 2023 at 20:05, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > ... and equally importantly, what about DMA? > > I'm not exactly sure what you mean by this, I don't think this should > affect the performance of DMA. I was more worried about just basic correctness. We've traditionally had a lot of issues with using virtual addresses for dma, simply because we've got random drivers, and I'm not entirely convinced that your "virt_to_phys()" update will catch it all. IOW, even on x86-64 - which is hopefully better than most architectures because it already has that double mapping issue - we have things like unsigned long paddr = (unsigned long)vaddr - __PAGE_OFFSET; in other places than just the __phys_addr() code. The one place I grepped for looks to be just boot-time AMD memory encryption, so wouldn't be any slab allocation, but ... Linus
* Matteo Rizzo <matteorizzo@google.com> wrote: > On Mon, 18 Sept 2023 at 19:39, Ingo Molnar <mingo@kernel.org> wrote: > > > > What's the split of the increase in overhead due to SLAB_VIRTUAL=y, between > > user-space execution and kernel-space execution? > > > > Same benchmark as before (compiling a kernel on a system running the patched > kernel): > > Intel Skylake: > > LABEL | COUNT | MIN | MAX | MEAN | MEDIAN | STDDEV > ---------------+-------+----------+----------+----------+----------+-------- > wall clock | | | | | | > SLAB_VIRTUAL=n | 150 | 49.700 | 51.320 | 50.449 | 50.430 | 0.29959 > SLAB_VIRTUAL=y | 150 | 50.020 | 51.660 | 50.880 | 50.880 | 0.30495 > | | +0.64% | +0.66% | +0.85% | +0.89% | +1.79% > system time | | | | | | > SLAB_VIRTUAL=n | 150 | 358.560 | 362.900 | 360.922 | 360.985 | 0.91761 > SLAB_VIRTUAL=y | 150 | 362.970 | 367.970 | 366.062 | 366.115 | 1.015 > | | +1.23% | +1.40% | +1.42% | +1.42% | +10.60% > user time | | | | | | > SLAB_VIRTUAL=n | 150 | 3110.000 | 3124.520 | 3118.143 | 3118.120 | 2.466 > SLAB_VIRTUAL=y | 150 | 3115.070 | 3127.070 | 3120.762 | 3120.925 | 2.654 > | | +0.16% | +0.08% | +0.08% | +0.09% | +7.63% These Skylake figures are a bit counter-intuitive: how does an increase of only +0.08% user-time - which dominates 89.5% of execution, combined with a +1.42% increase in system time that consumes only 10.5% of CPU capacity, result in a +0.85% increase in wall-clock time? There might be hidden factors at work in the DMA space, as Linus suggested? Or perhaps wall-clock time is dominated by the single-threaded final link time of the kernel, which phase might be disproportionately hurt by these changes? (Stddev seems low enough for this not to be a measurement artifact.) The AMD Milan figures are more intuitive: > AMD Milan: > > LABEL | COUNT | MIN | MAX | MEAN | MEDIAN | STDDEV > ---------------+-------+----------+----------+----------+----------+-------- > wall clock | | | | | | > SLAB_VIRTUAL=n | 150 | 25.480 | 26.550 | 26.065 | 26.055 | 0.23495 > SLAB_VIRTUAL=y | 150 | 25.820 | 27.080 | 26.531 | 26.540 | 0.25974 > | | +1.33% | +2.00% | +1.79% | +1.86% | +10.55% > system time | | | | | | > SLAB_VIRTUAL=n | 150 | 478.530 | 540.420 | 520.803 | 521.485 | 9.166 > SLAB_VIRTUAL=y | 150 | 530.520 | 572.460 | 552.825 | 552.985 | 7.161 > | | +10.86% | +5.93% | +6.15% | +6.04% | -21.88% > user time | | | | | | > SLAB_VIRTUAL=n | 150 | 2373.540 | 2403.800 | 2386.343 | 2385.840 | 5.325 > SLAB_VIRTUAL=y | 150 | 2388.690 | 2426.290 | 2408.325 | 2408.895 | 6.667 > | | +0.64% | +0.94% | +0.92% | +0.97% | +25.20% > > > I'm not exactly sure why user time increases by almost 1% on Milan, it > could be TLB contention. The other worrying aspect is the increase of +6.15% of system time ... which is roughly in line with what we'd expect from a +1.79% increase in wall-clock time. Thanks, Ingo
On 9/18/23 14:08, Matteo Rizzo wrote: > On Fri, 15 Sept 2023 at 18:30, Lameter, Christopher >> Problems: >> >> - Overhead due to more TLB lookups >> >> - Larger amounts of TLBs are used for the OS. Currently we are trying to >> use the maximum mappable TLBs to reduce their numbers. This presumably >> means using 4K TLBs for all slab access. > > Yes, we are using 4K pages for the slab mappings which is going to increase > TLB pressure. I also tried writing a version of the patch that uses 2M > pages which had slightly better performance, but that had its own problems. > For example most slabs are much smaller than 2M, so we would need to create > and map multiple slabs at once and we wouldn't be able to release the > physical memory until all slabs in the 2M page are unused which increases > fragmentation. At last LSF/MM [1] we basically discarded direct map fragmentation avoidance as solving something that turns out to be insignificant, with the exception of kernel code. As kernel code is unlikely to be allocated from kmem caches due to W^X, we can hopefully assume it's also insignificant for the virtual slab area. [1] https://lwn.net/Articles/931406/