Message ID | 20230629221910.359711-1-julian.pidancet@oracle.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v2] mm/slub: disable slab merging in the default configuration | expand |
On Fri, 30 Jun 2023, Julian Pidancet wrote: > Make CONFIG_SLAB_MERGE_DEFAULT default to n unless CONFIG_SLUB_TINY is > enabled. Benefits of slab merging is limited on systems that are not > memory constrained: the memory overhead is low and evidence of its > effect on cache hotness is hard to come by. > > On the other hand, distinguishing allocations into different slabs will > make attacks that rely on "heap spraying" more difficult to carry out > with success. > > Take sides with security in the default kernel configuration over > questionnable performance benefits/memory efficiency. > > A timed kernel compilation test, on x86 with 4K pages, conducted 10 > times with slab_merge, and the same test then conducted with > slab_nomerge on the same hardware in a similar state do not show any > sign of performance hit one way or another: > > | slab_merge | slab_nomerge | > ------+------------------+------------------| > Time | 588.080 ± 0.799 | 587.308 ± 1.411 | > Min | 586.267 | 584.640 | > Max | 589.248 | 590.091 | > > Peaks in slab usage during the test workload reveal a memory overhead > of 2.2 MiB when using slab_nomerge. Slab usage overhead after a fresh boot > amounts to 2.3 MiB: > > Slab Usage | slab_merge | slab_nomerge | > -------------------+------------+--------------| > After fresh boot | 79908 kB | 82284 kB | > During test (peak) | 127940 kB | 130204 kB | > > Signed-off-by: Julian Pidancet <julian.pidancet@oracle.com> > Reviewed-by: Kees Cook <keescook@chromium.org> Thanks for continuing to work on this. I think we need more data beyond just kernbench. Christoph's point about different page sizes is interesting. In the above results, I don't know the page orders for the various slab caches that this workload will stress. I think the memory overhead data may be different depending on how slab_max_order is being used, if at all. We should be able to run this through a variety of different benchmarks and measure peak slab usage at the same time for due diligence. I support the change in the default, I would just prefer to know what the implications of it is. Is it possible to collect data for other microbenchmarks and real-world workloads? And perhaps also with different page sizes where this will impact memory overhead more? I can help running more workloads once we have the next set of data. > --- > > v2: > - Re-run benchmark to minimize variance in results due to CPU > frequency scaling. > - Record slab usage after boot and peaks during tests workload. > - Include benchmark results in commit message. > - Fix typo: s/MEGE/MERGE/. > - Specify that "overhead" refers to memory overhead in SLUB doc. > > v1: > - Link: https://lore.kernel.org/linux-mm/20230627132131.214475-1-julian.pidancet@oracle.com/ > > .../admin-guide/kernel-parameters.txt | 29 ++++++++++--------- > Documentation/mm/slub.rst | 7 +++-- > mm/Kconfig | 6 ++-- > 3 files changed, 22 insertions(+), 20 deletions(-) > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt > index c5e7bb4babf0..7e78471a96b7 100644 > --- a/Documentation/admin-guide/kernel-parameters.txt > +++ b/Documentation/admin-guide/kernel-parameters.txt > @@ -5652,21 +5652,22 @@ > > slram= [HW,MTD] > > - slab_merge [MM] > - Enable merging of slabs with similar size when the > - kernel is built without CONFIG_SLAB_MERGE_DEFAULT. > - > slab_nomerge [MM] > - Disable merging of slabs with similar size. May be > - necessary if there is some reason to distinguish > - allocs to different slabs, especially in hardened > - environments where the risk of heap overflows and > - layout control by attackers can usually be > - frustrated by disabling merging. This will reduce > - most of the exposure of a heap attack to a single > - cache (risks via metadata attacks are mostly > - unchanged). Debug options disable merging on their > - own. > + Disable merging of slabs with similar size when > + the kernel is built with CONFIG_SLAB_MERGE_DEFAULT. > + Allocations of the same size made in distinct > + caches will be placed in separate slabs. In > + hardened environment, the risk of heap overflows > + and layout control by attackers can usually be > + frustrated by disabling merging. > + > + slab_merge [MM] > + Enable merging of slabs with similar size. May be > + necessary to reduce overhead or increase cache > + hotness of objects, at the cost of increased > + exposure in case of a heap attack to a single > + cache. (risks via metadata attacks are mostly > + unchanged). > For more information see Documentation/mm/slub.rst. > > slab_max_order= [MM, SLAB] > diff --git a/Documentation/mm/slub.rst b/Documentation/mm/slub.rst > index be75971532f5..0e2ce82177c0 100644 > --- a/Documentation/mm/slub.rst > +++ b/Documentation/mm/slub.rst > @@ -122,9 +122,10 @@ used on the wrong slab. > Slab merging > ============ > > -If no debug options are specified then SLUB may merge similar slabs together > -in order to reduce overhead and increase cache hotness of objects. > -``slabinfo -a`` displays which slabs were merged together. > +If the kernel is built with ``CONFIG_SLAB_MERGE_DEFAULT`` or if ``slab_merge`` > +is specified on the kernel command line, then SLUB may merge similar slabs > +together in order to reduce memory overhead and increase cache hotness of > +objects. ``slabinfo -a`` displays which slabs were merged together. > Suggest mentioning that one of the primary goals of slab cache merging is to reduce cache footprint. > Slab validation > ===============
On Mon Jul 3, 2023 at 02:09, David Rientjes wrote: > I think we need more data beyond just kernbench. Christoph's point about > different page sizes is interesting. In the above results, I don't know > the page orders for the various slab caches that this workload will > stress. I think the memory overhead data may be different depending on > how slab_max_order is being used, if at all. > > We should be able to run this through a variety of different benchmarks > and measure peak slab usage at the same time for due diligence. I support > the change in the default, I would just prefer to know what the > implications of it is. > > Is it possible to collect data for other microbenchmarks and real-world > workloads? And perhaps also with different page sizes where this will > impact memory overhead more? I can help running more workloads once we > have the next set of data. > David, I agree about the need to perform those tests on hardware using larger pages. I will collect data if I have the chance to get my hands on one of these systems. Do you have specific tests or workload in mind ? Compiling the kernel with files sitting on an XFS partition is not exhaustive but it is the only test I could think of that is both easy to set up and can be reproduced while keeping external interferences as little as possible.
On Mon, Jul 03, 2023 at 12:33:25PM +0200, Julian Pidancet wrote: > On Mon Jul 3, 2023 at 02:09, David Rientjes wrote: > > I think we need more data beyond just kernbench. Christoph's point about > > different page sizes is interesting. In the above results, I don't know > > the page orders for the various slab caches that this workload will > > stress. I think the memory overhead data may be different depending on > > how slab_max_order is being used, if at all. > > > > We should be able to run this through a variety of different benchmarks > > and measure peak slab usage at the same time for due diligence. I support > > the change in the default, I would just prefer to know what the > > implications of it is. > > > > Is it possible to collect data for other microbenchmarks and real-world > > workloads? And perhaps also with different page sizes where this will > > impact memory overhead more? I can help running more workloads once we > > have the next set of data. > > > > David, > > I agree about the need to perform those tests on hardware using larger > pages. I will collect data if I have the chance to get my hands on one > of these systems. > > Do you have specific tests or workload in mind ? Compiling the kernel > with files sitting on an XFS partition is not exhaustive but it is the > only test I could think of that is both easy to set up and can be > reproduced while keeping external interferences as little as possible. I think it is a sufficiently complicated heap allocation workload (and real-world). I'd prefer we get this change landed in -next after -rc1 so we can see if there are any regressions reported by the 0day and other CI performance tests. -Kees
On Mon, 3 Jul 2023, Julian Pidancet wrote: > On Mon Jul 3, 2023 at 02:09, David Rientjes wrote: > > I think we need more data beyond just kernbench. Christoph's point about > > different page sizes is interesting. In the above results, I don't know > > the page orders for the various slab caches that this workload will > > stress. I think the memory overhead data may be different depending on > > how slab_max_order is being used, if at all. > > > > We should be able to run this through a variety of different benchmarks > > and measure peak slab usage at the same time for due diligence. I support > > the change in the default, I would just prefer to know what the > > implications of it is. > > > > Is it possible to collect data for other microbenchmarks and real-world > > workloads? And perhaps also with different page sizes where this will > > impact memory overhead more? I can help running more workloads once we > > have the next set of data. > > > > David, > > I agree about the need to perform those tests on hardware using larger > pages. I will collect data if I have the chance to get my hands on one > of these systems. > Thanks. I think arm64 should suffice for things like 64KB pages that Christoph was referring to. We also may want to play around with slub_min_order on the kernel command line since that will inflate the size of slab pages and we may see some different results because of the increased page size. > Do you have specific tests or workload in mind ? Compiling the kernel > with files sitting on an XFS partition is not exhaustive but it is the > only test I could think of that is both easy to set up and can be > reproduced while keeping external interferences as little as possible. > The ones that Binder, cc'd, used to evaluate SLAB vs SLUB memory overhead: hackbench netperf redis specjbb2015 unixbench will-it-scale And Vlastimil had also suggested a few XFS specific benchmarks. I can try to help run benchmarks that you're not able to run or if you can't get your hands on an arm64 system. Additionally, I wouldn't consider this to be super urgent: slab cache merging has been this way for several years, we have some time to do an assessment of the implications of changing an important aspect of kernel memory allocation that will affect everybody. I agree with the patch if we can make it work, I'd just like to study the effect of it more fully beyond some kernbench runs.
On Mon, 3 Jul 2023, David Rientjes wrote:
> hackbench
Running hackbench on Skylake with v6.1.30 (A) and v6.1.30 + your patch
(B), for example:
LABEL | COUNT | MIN | MAX | MEAN | MEDIAN | STDDEV | DIRECTION
--------------------------------+-------+------------+------------+------------+------------+-----------+----------------
SReclaimable | | | | | | |
(A) v6.1.30 | 11 | 129480.000 | 233208.000 | 189936.364 | 204316.000 | 31465.625 |
(B) <same sha> | 11 | 139084.000 | 236772.000 | 198931.273 | 213672.000 | 30013.204 |
| | +7.42% | +1.53% | +4.74% | +4.58% | -4.62% | <not defined>
SUnreclaim | | | | | | |
(A) v6.1.30 | 11 | 305400.000 | 538744.000 | 422148.000 | 449344.000 | 65005.045 |
(B) <same sha> | 11 | 305780.000 | 518300.000 | 422219.636 | 450252.000 | 61245.137 |
| | +0.12% | -3.79% | +0.02% | +0.20% | -5.78% | <not defined>
Amount of reclaimable slab significantly increases which is likely not a
problem because, well, it's reclaimable. But I suspect we'll find other
interesting data points with the other suggested benchmarks.
And benchmark results:
LABEL | COUNT | MIN | MAX | MEAN | MEDIAN | STDDEV | DIRECTION
--------------------------------+-------+------------+------------+------------+------------+-----------+----------------
hackbench_process_pipes_234 | | | | | | |
(A) v6.1.30 | 7 | 1.735 | 1.979 | 1.831 | 1.835 | 0.086291 |
(B) <same sha> | 7 | 1.687 | 2.023 | 1.886 | 1.911 | 0.10276 |
| | -2.77% | +2.22% | +3.00% | +4.14% | +19.09% | <not defined>
hackbench_process_pipes_max | | | | | | |
(A) v6.1.30 | 7 | 1.735 | 1.979 | 1.831 | 1.835 | 0.086291 |
(B) <same sha> | 7 | 1.687 | 2.023 | 1.886 | 1.911 | 0.10276 |
| | -2.77% | +2.22% | +3.00% | +4.14% | +19.09% | - is good
hackbench_process_sockets_234 | | | | | | |
(A) v6.1.30 | 7 | 7.883 | 7.909 | 7.899 | 7.899 | 0.0087808 |
(B) <same sha> | 7 | 7.872 | 7.961 | 7.907 | 7.904 | 0.028019 |
| | -0.14% | +0.66% | +0.10% | +0.06% | +219.09% | <not defined>
hackbench_process_sockets_max | | | | | | |
(A) v6.1.30 | 7 | 7.883 | 7.909 | 7.899 | 7.899 | 0.0087808 |
(B) <same sha> | 7 | 7.872 | 7.961 | 7.907 | 7.904 | 0.028019 |
| | -0.14% | +0.66% | +0.10% | +0.06% | +219.09% | - is good
hackbench_thread_pipes_234 | | | | | | |
(A) v6.1.30 | 7 | 2.146 | 2.677 | 2.410 | 2.418 | 0.18143 |
(B) <same sha> | 7 | 2.016 | 2.514 | 2.268 | 2.241 | 0.17474 |
| | -6.06% | -6.09% | -5.88% | -7.32% | -3.69% | <not defined>
hackbench_thread_pipes_max | | | | | | |
(A) v6.1.30 | 7 | 2.146 | 2.677 | 2.410 | 2.418 | 0.18143 |
(B) <same sha> | 7 | 2.016 | 2.514 | 2.268 | 2.241 | 0.17474 |
| | -6.06% | -6.09% | -5.88% | -7.32% | -3.69% | - is good
hackbench_thread_sockets_234 | | | | | | |
(A) v6.1.30 | 7 | 8.025 | 8.127 | 8.084 | 8.085 | 0.029755 |
(B) <same sha> | 7 | 7.990 | 8.093 | 8.042 | 8.035 | 0.035152 |
| | -0.44% | -0.42% | -0.53% | -0.62% | +18.14% | <not defined>
hackbench_thread_sockets_max | | | | | | |
(A) v6.1.30 | 7 | 8.025 | 8.127 | 8.084 | 8.085 | 0.029755 |
(B) <same sha> | 7 | 7.990 | 8.093 | 8.042 | 8.035 | 0.035152 |
| | -0.44% | -0.42% | -0.53% | -0.62% | +18.14% | - is good
On Thu, 6 Jul 2023, David Rientjes wrote: > On Mon, 3 Jul 2023, David Rientjes wrote: > > > hackbench > > Running hackbench on Skylake with v6.1.30 (A) and v6.1.30 + your patch > (B), for example: > > LABEL | COUNT | MIN | MAX | MEAN | MEDIAN | STDDEV | DIRECTION > --------------------------------+-------+------------+------------+------------+------------+-----------+---------------- > SReclaimable | | | | | | | > (A) v6.1.30 | 11 | 129480.000 | 233208.000 | 189936.364 | 204316.000 | 31465.625 | > (B) <same sha> | 11 | 139084.000 | 236772.000 | 198931.273 | 213672.000 | 30013.204 | > | | +7.42% | +1.53% | +4.74% | +4.58% | -4.62% | <not defined> > SUnreclaim | | | | | | | > (A) v6.1.30 | 11 | 305400.000 | 538744.000 | 422148.000 | 449344.000 | 65005.045 | > (B) <same sha> | 11 | 305780.000 | 518300.000 | 422219.636 | 450252.000 | 61245.137 | > | | +0.12% | -3.79% | +0.02% | +0.20% | -5.78% | <not defined> > > Amount of reclaimable slab significantly increases which is likely not a > problem because, well, it's reclaimable. But I suspect we'll find other > interesting data points with the other suggested benchmarks. > > And benchmark results: > > LABEL | COUNT | MIN | MAX | MEAN | MEDIAN | STDDEV | DIRECTION > --------------------------------+-------+------------+------------+------------+------------+-----------+---------------- > hackbench_process_pipes_234 | | | | | | | > (A) v6.1.30 | 7 | 1.735 | 1.979 | 1.831 | 1.835 | 0.086291 | > (B) <same sha> | 7 | 1.687 | 2.023 | 1.886 | 1.911 | 0.10276 | > | | -2.77% | +2.22% | +3.00% | +4.14% | +19.09% | <not defined> > hackbench_process_pipes_max | | | | | | | > (A) v6.1.30 | 7 | 1.735 | 1.979 | 1.831 | 1.835 | 0.086291 | > (B) <same sha> | 7 | 1.687 | 2.023 | 1.886 | 1.911 | 0.10276 | > | | -2.77% | +2.22% | +3.00% | +4.14% | +19.09% | - is good > hackbench_process_sockets_234 | | | | | | | > (A) v6.1.30 | 7 | 7.883 | 7.909 | 7.899 | 7.899 | 0.0087808 | > (B) <same sha> | 7 | 7.872 | 7.961 | 7.907 | 7.904 | 0.028019 | > | | -0.14% | +0.66% | +0.10% | +0.06% | +219.09% | <not defined> > hackbench_process_sockets_max | | | | | | | > (A) v6.1.30 | 7 | 7.883 | 7.909 | 7.899 | 7.899 | 0.0087808 | > (B) <same sha> | 7 | 7.872 | 7.961 | 7.907 | 7.904 | 0.028019 | > | | -0.14% | +0.66% | +0.10% | +0.06% | +219.09% | - is good > hackbench_thread_pipes_234 | | | | | | | > (A) v6.1.30 | 7 | 2.146 | 2.677 | 2.410 | 2.418 | 0.18143 | > (B) <same sha> | 7 | 2.016 | 2.514 | 2.268 | 2.241 | 0.17474 | > | | -6.06% | -6.09% | -5.88% | -7.32% | -3.69% | <not defined> > hackbench_thread_pipes_max | | | | | | | > (A) v6.1.30 | 7 | 2.146 | 2.677 | 2.410 | 2.418 | 0.18143 | > (B) <same sha> | 7 | 2.016 | 2.514 | 2.268 | 2.241 | 0.17474 | > | | -6.06% | -6.09% | -5.88% | -7.32% | -3.69% | - is good > hackbench_thread_sockets_234 | | | | | | | > (A) v6.1.30 | 7 | 8.025 | 8.127 | 8.084 | 8.085 | 0.029755 | > (B) <same sha> | 7 | 7.990 | 8.093 | 8.042 | 8.035 | 0.035152 | > | | -0.44% | -0.42% | -0.53% | -0.62% | +18.14% | <not defined> > hackbench_thread_sockets_max | | | | | | | > (A) v6.1.30 | 7 | 8.025 | 8.127 | 8.084 | 8.085 | 0.029755 | > (B) <same sha> | 7 | 7.990 | 8.093 | 8.042 | 8.035 | 0.035152 | > | | -0.44% | -0.42% | -0.53% | -0.62% | +18.14% | - is good My takeaway from running half a dozen benchmarks on Intel is that performance is more impacted than slab memory usage. There are slight regressions in memory usage, but only measurable for SReclaimable which would be the better form (as opposed to SUnreclaimable). There are some substantial performance degradations, most notably context_switch1_per_thread_ops which regressed ~21%. I'll need to repeat that test to confirm it and can also try on cascadelake if it reproduces. There are some more negligible redis, specjbb, and will-it-scale regressions which don't look terribly concerning. I'll try running performance tests on AMD Zen3 and also ARM with PAGE_SIZE == 4KB and 64KB. Unixbench memory usage and performance is within +/- 1% for every metric, so it's not presented here. Full results for Skylake, removing results where mean is +/- 1% of baseline: ============================== MEMORY USAGE ============================== hackbench LABEL | COUNT | MIN | MAX | MEAN | MEDIAN | STDDEV | DIRECTION --------------------------------+-------+------------+------------+------------+------------+-----------+---------------- SReclaimable | | | | | | | (A) v6.1.30 | 11 | 129480.000 | 233208.000 | 189936.364 | 204316.000 | 31465.625 | (B) v6.1.30 slab_nomerge | 11 | 139084.000 | 236772.000 | 198931.273 | 213672.000 | 30013.204 | | | +7.42% | +1.53% | +4.74% | +4.58% | -4.62% | - is good redis LABEL | COUNT | MIN | MAX | MEAN | MEDIAN | STDDEV | DIRECTION -------------------------------+-------+------------+------------+------------+------------+-----------+---------------- SReclaimable | | | | | | | (A) v6.1.30 | 298 | 137056.000 | 238664.000 | 226005.477 | 226940.000 | 8109.328 | (B) v6.1.30 slab_nomerge | 302 | 139664.000 | 242664.000 | 229096.689 | 230098.000 | 8215.134 | | | +1.90% | +1.68% | +1.37% | +1.39% | +1.30% | - is good specjbb2015 LABEL | COUNT | MIN | MAX | MEAN | MEDIAN | STDDEV | DIRECTION -----------------------------------+-------+------------+------------+------------+------------+----------+---------------- SReclaimable | | | | | | | (A) v6.1.30 | 1602 | 118344.000 | 217932.000 | 203559.618 | 205372.000 | 5314.410 | (B) v6.1.30 slab_nomerge | 1655 | 128000.000 | 222536.000 | 208099.973 | 209396.000 | 4608.582 | | | +8.16% | +2.11% | +2.23% | +1.96% | -13.28% | - is good ============================== PERFORMANCE ============================== hackbench LABEL | COUNT | MIN | MAX | MEAN | MEDIAN | STDDEV | DIRECTION --------------------------------+-------+------------+------------+------------+------------+-----------+---------------- hackbench_process_pipes_234 | | | | | | | (A) v6.1.30 | 7 | 1.735 | 1.979 | 1.831 | 1.835 | 0.086291 | (B) v6.1.30 slab_nomerge | 7 | 1.687 | 2.023 | 1.886 | 1.911 | 0.10276 | | | -2.77% | +2.22% | +3.00% | +4.14% | +19.09% | - is good hackbench_thread_pipes_234 | | | | | | | (A) v6.1.30 | 7 | 2.146 | 2.677 | 2.410 | 2.418 | 0.18143 | (B) v6.1.30 slab_nomerge | 7 | 2.016 | 2.514 | 2.268 | 2.241 | 0.17474 | | | -6.06% | -6.09% | -5.88% | -7.32% | -3.69% | - is good redis LABEL | COUNT | MIN | MAX | MEAN | MEDIAN | STDDEV | DIRECTION -------------------------------+-------+------------+------------+------------+------------+-----------+---------------- redis_medium_max_INCR | | | | | | | (A) v6.1.30 | 5 | 108695.660 | 112637.980 | 110639.626 | 109757.440 | 1668.190 | (B) v6.1.30 slab_nomerge | 5 | 101853.740 | 106564.370 | 104166.478 | 104942.800 | 1833.377 | | | -6.29% | -5.39% | -5.85% | -4.39% | +9.90% | + is good redis_medium_max_LPOP | | | | | | | (A) v6.1.30 | 5 | 102944.200 | 108471.630 | 105572.750 | 106303.820 | 2016.986 | (B) v6.1.30 slab_nomerge | 5 | 101471.340 | 104231.810 | 103361.688 | 104090.770 | 1064.277 | | | -1.43% | -3.91% | -2.09% | -2.08% | -47.23% | + is good redis_medium_max_LPUSH | | | | | | | (A) v6.1.30 | 10 | 99255.590 | 108295.430 | 105960.440 | 106338.120 | 2553.802 | (B) v6.1.30 slab_nomerge | 10 | 100130.160 | 107032.000 | 104335.070 | 105091.705 | 2169.708 | | | +0.88% | -1.17% | -1.53% | -1.17% | -15.04% | + is good redis_medium_max_LRANGE_100 | | | | | | | (A) v6.1.30 | 5 | 72427.030 | 73046.020 | 72671.814 | 72626.910 | 202.812 | (B) v6.1.30 slab_nomerge | 5 | 70811.500 | 72030.540 | 71519.286 | 71761.750 | 450.918 | | | -2.23% | -1.39% | -1.59% | -1.19% | +122.33% | + is good redis_medium_max_MSET_10 | | | | | | | (A) v6.1.30 | 5 | 87642.420 | 89798.850 | 89044.390 | 89102.740 | 769.933 | (B) v6.1.30 slab_nomerge | 5 | 85287.840 | 89758.550 | 87876.598 | 88386.070 | 1641.608 | | | -2.69% | -0.04% | -1.31% | -0.80% | +113.21% | + is good redis_medium_max_PING_BULK | | | | | | | (A) v6.1.30 | 5 | 101729.400 | 108189.980 | 105003.228 | 105307.490 | 2171.756 | (B) v6.1.30 slab_nomerge | 5 | 100553.050 | 105340.770 | 102561.464 | 101947.190 | 1789.953 | | | -1.16% | -2.63% | -2.33% | -3.19% | -17.58% | + is good redis_medium_max_PING_INLINE | | | | | | | (A) v6.1.30 | 5 | 102522.050 | 107503.770 | 105209.902 | 106033.300 | 1981.499 | (B) v6.1.30 slab_nomerge | 5 | 97541.950 | 107319.170 | 103729.414 | 104854.780 | 3304.256 | | | -4.86% | -0.17% | -1.41% | -1.11% | +66.76% | + is good redis_medium_max_SET | | | | | | | (A) v6.1.30 | 5 | 105663.570 | 112283.850 | 108917.118 | 109469.070 | 2663.234 | (B) v6.1.30 slab_nomerge | 5 | 103071.540 | 106723.590 | 105128.226 | 106179.660 | 1666.892 | | | -2.45% | -4.95% | -3.48% | -3.00% | -37.41% | + is good redis_medium_max_SPOP | | | | | | | (A) v6.1.30 | 5 | 104079.940 | 107238.610 | 105140.616 | 104964.840 | 1150.370 | (B) v6.1.30 slab_nomerge | 5 | 102637.790 | 103885.300 | 103343.934 | 103412.620 | 437.159 | | | -1.39% | -3.13% | -1.71% | -1.48% | -62.00% | + is good redis_small_max_INCR | | | | | | | (A) v6.1.30 | 5 | 98814.230 | 114942.530 | 107744.856 | 108813.920 | 6150.540 | (B) v6.1.30 slab_nomerge | 5 | 99800.400 | 109529.020 | 104451.708 | 104058.270 | 3732.461 | | | +1.00% | -4.71% | -3.06% | -4.37% | -39.31% | + is good redis_small_max_LPOP | | | | | | | (A) v6.1.30 | 5 | 104275.290 | 118764.840 | 108648.192 | 106951.880 | 5208.918 | (B) v6.1.30 slab_nomerge | 5 | 97560.980 | 115074.800 | 103120.496 | 99800.400 | 6353.203 | | | -6.44% | -3.11% | -5.09% | -6.69% | +21.97% | + is good redis_small_max_LRANGE_100 | | | | | | | (A) v6.1.30 | 5 | 67980.970 | 72992.700 | 71589.644 | 72150.070 | 1832.810 | (B) v6.1.30 slab_nomerge | 5 | 64977.260 | 72046.110 | 70273.716 | 71684.590 | 2680.854 | | | -4.42% | -1.30% | -1.84% | -0.65% | +46.27% | + is good redis_small_max_MSET_10 | | | | | | | (A) v6.1.30 | 5 | 90497.730 | 106044.540 | 100756.422 | 102880.660 | 5455.768 | (B) v6.1.30 slab_nomerge | 5 | 97276.270 | 106951.880 | 102818.856 | 102880.660 | 3293.135 | | | +7.49% | +0.86% | +2.05% | +0.00% | -39.64% | + is good redis_small_max_PING_INLINE | | | | | | | (A) v6.1.30 | 5 | 96153.850 | 108459.870 | 102493.414 | 102459.020 | 4995.757 | (B) v6.1.30 slab_nomerge | 5 | 84317.030 | 116144.020 | 99995.920 | 98039.220 | 11045.861 | | | -12.31% | +7.08% | -2.44% | -4.31% | +121.10% | + is good redis_small_max_SADD | | | | | | | (A) v6.1.30 | 5 | 106044.540 | 115606.940 | 109804.052 | 110375.270 | 3451.251 | (B) v6.1.30 slab_nomerge | 5 | 95693.780 | 109769.480 | 102329.518 | 102249.490 | 4602.161 | | | -9.76% | -5.05% | -6.81% | -7.36% | +33.35% | + is good redis_small_max_SET | | | | | | | (A) v6.1.30 | 5 | 91911.760 | 116686.120 | 104509.200 | 102354.150 | 8993.532 | (B) v6.1.30 slab_nomerge | 5 | 100502.520 | 113636.370 | 108815.700 | 109649.120 | 4750.002 | | | +9.35% | -2.61% | +4.12% | +7.13% | -47.18% | + is good redis_small_max_SPOP | | | | | | | (A) v6.1.30 | 5 | 96899.230 | 108695.650 | 103648.652 | 104931.800 | 3901.567 | (B) v6.1.30 slab_nomerge | 5 | 93457.940 | 108108.110 | 101680.560 | 101626.020 | 5096.944 | | | -3.55% | -0.54% | -1.90% | -3.15% | +30.64% | + is good specjbb2015 LABEL | COUNT | MIN | MAX | MEAN | MEDIAN | STDDEV | DIRECTION -----------------------------------+-------+------------+------------+------------+------------+----------+---------------- specjbb2015_single_Critical_JOPS | | | | | | | (A) v6.1.30 | 1 | 46294.000 | 46294.000 | 46294.000 | 46294.000 | 0 | (B) v6.1.30 slab_nomerge | 1 | 46167.000 | 46167.000 | 46167.000 | 46167.000 | 0 | | | -0.27% | -0.27% | -0.27% | -0.27% | --- | + is good specjbb2015_single_Max_JOPS | | | | | | | (A) v6.1.30 | 1 | 68842.000 | 68842.000 | 68842.000 | 68842.000 | 0 | (B) v6.1.30 slab_nomerge | 1 | 67801.000 | 67801.000 | 67801.000 | 67801.000 | 0 | | | -1.51% | -1.51% | -1.51% | -1.51% | --- | + is good vm-scalability LABEL | COUNT | MIN | MAX | MEAN | MEDIAN | STDDEV | DIRECTION ---------------------------------------+-------+-----------------+-----------------+-----------------+-----------------+---------------+------------ 300s_128G_truncate_throughput | | | | | | | (A) v6.1.30 | 15 | 16398714804.000 | 17010339870.000 | 16772025703.867 | 16834675132.000 | 232697088.501 | (B) v6.1.30 slab_nomerge | 15 | 16704416343.000 | 17271437122.000 | 16948419991.200 | 16821799877.000 | 233146680.475 | | | +1.86% | +1.53% | +1.05% | -0.08% | +0.19% | + is good 300s_512G_anon_wx_rand_mt_throughput | | | | | | | (A) v6.1.30 | 15 | 7198561.000 | 7359712.000 | 7263944.200 | 7259418.000 | 50394.115 | (B) v6.1.30 slab_nomerge | 15 | 7191842.000 | 7628158.000 | 7390629.000 | 7407204.000 | 171602.612 | | | -0.09% | +3.65% | +1.74% | +2.04% | +240.52% | + is good will-it-scale LABEL | COUNT | MIN | MAX | MEAN | MEDIAN | STDDEV | DIRECTION -----------------------------------+-------+--------------+--------------+--------------+--------------+-----------+---------------- context_switch1_per_thread_ops | | | | | | | (A) v6.1.30 | 1 | 324721.000 | 324721.000 | 324721.000 | 324721.000 | 0 | (B) v6.1.30 slab_nomerge | 1 | 255999.000 | 255999.000 | 255999.000 | 255999.000 | 0 | !! REGRESSED !! | | -21.16% | -21.16% | -21.16% | -21.16% | --- | + is good getppid1_scalability | | | | | | | (A) v6.1.30 | 1 | 0.71943 | 0.71943 | 0.71943 | 0.71943 | 0 | (B) v6.1.30 slab_nomerge | 1 | 0.70923 | 0.70923 | 0.70923 | 0.70923 | 0 | | | -1.42% | -1.42% | -1.42% | -1.42% | --- | + is good mmap1_scalability | | | | | | | (A) v6.1.30 | 1 | 0.18831 | 0.18831 | 0.18831 | 0.18831 | 0 | (B) v6.1.30 slab_nomerge | 1 | 0.18413 | 0.18413 | 0.18413 | 0.18413 | 0 | | | -2.22% | -2.22% | -2.22% | -2.22% | --- | + is good poll2_scalability | | | | | | | (A) v6.1.30 | 1 | 0.45608 | 0.45608 | 0.45608 | 0.45608 | 0 | (B) v6.1.30 slab_nomerge | 1 | 0.44207 | 0.44207 | 0.44207 | 0.44207 | 0 | | | -3.07% | -3.07% | -3.07% | -3.07% | --- | + is good pthread_mutex1_scalability | | | | | | | (A) v6.1.30 | 1 | 0.45207 | 0.45207 | 0.45207 | 0.45207 | 0 | (B) v6.1.30 slab_nomerge | 1 | 0.44194 | 0.44194 | 0.44194 | 0.44194 | 0 | | | -2.24% | -2.24% | -2.24% | -2.24% | --- | + is good pthread_mutex2_per_process_ops | | | | | | | (A) v6.1.30 | 1 | 36292960.000 | 36292960.000 | 36292960.000 | 36292960.000 | 0 | (B) <v6.1.30 slab_nomerge | 1 | 35446930.000 | 35446930.000 | 35446930.000 | 35446930.000 | 0 | | | -2.33% | -2.33% | -2.33% | -2.33% | --- | + is good signal1_scalability | | | | | | | (A) v6.1.30 | 1 | 0.55541 | 0.55541 | 0.55541 | 0.55541 | 0 | (B) v6.1.30 slab_nomerge | 1 | 0.54773 | 0.54773 | 0.54773 | 0.54773 | 0 | | | -1.38% | -1.38% | -1.38% | -1.38% | --- | + is good unix1_scalability | | | | | | | (A) v6.1.30 | 1 | 0.55085 | 0.55085 | 0.55085 | 0.55085 | 0 | (B) v6.1.30 slab_nomerge | 1 | 0.53957 | 0.53957 | 0.53957 | 0.53957 | 0 | | | -2.05% | -2.05% | -2.05% | -2.05% | --- | + is good
On Sun, 9 Jul 2023, David Rientjes wrote: > There are some substantial performance degradations, most notably > context_switch1_per_thread_ops which regressed ~21%. I'll need to repeat > that test to confirm it and can also try on cascadelake if it reproduces. > So the regression on skylake for will-it-scale appears to be real: LABEL | COUNT | MIN | MAX | MEAN | MEDIAN | STDDEV | DIRECTION ----------------------------------+-------+------------+------------+------------+------------+--------+------------ context_switch1_per_thread_ops | | | | | | | (A) v6.1.30 | 1 | 314507.000 | 314507.000 | 314507.000 | 314507.000 | 0 | (B) v6.1.30 slab_nomerge | 1 | 257403.000 | 257403.000 | 257403.000 | 257403.000 | 0 | !! REGRESSED !! | | -18.16% | -18.16% | -18.16% | -18.16% | --- | + is good but I can't reproduce this on cascadelake: LABEL | COUNT | MIN | MAX | MEAN | MEDIAN | STDDEV | DIRECTION ----------------------------------+-------+------------+------------+------------+------------+--------+------------ context_switch1_per_thread_ops | | | | | | | (A) v6.1.30 | 1 | 301128.000 | 301128.000 | 301128.000 | 301128.000 | 0 | (B) v6.1.30 slab_nomerge | 1 | 301282.000 | 301282.000 | 301282.000 | 301282.000 | 0 | | | +0.05% | +0.05% | +0.05% | +0.05% | --- | + is good So I'm a bit baffled at the moment. I'll try to dig deeper and see what slab caches this benchmark exercises that apparently no other benchmarks do. (I'm really hoping that the only way to recover this performance is by something like kmem_cache_create(SLAB_MERGE).)
On 7/3/23 22:17, David Rientjes wrote: > Additionally, I wouldn't consider this to be super urgent: slab cache > merging has been this way for several years, we have some time to do an > assessment of the implications of changing an important aspect of kernel > memory allocation that will affect everybody. Agreed, although I wouldn't say "affect everybody" because the changed upstream default may not automatically translate to what distros will use, and I'd expect most people rely on distro kernels. > I agree with the patch if > we can make it work, I'd just like to study the effect of it more fully > beyond some kernbench runs.
On Mon Jul 10, 2023 at 04:40, David Rientjes wrote: > On Sun, 9 Jul 2023, David Rientjes wrote: > > > There are some substantial performance degradations, most notably > > context_switch1_per_thread_ops which regressed ~21%. I'll need to repeat > > that test to confirm it and can also try on cascadelake if it reproduces. > > > > So the regression on skylake for will-it-scale appears to be real: > > LABEL | COUNT | MIN | MAX | MEAN | MEDIAN | STDDEV | DIRECTION > ----------------------------------+-------+------------+------------+------------+------------+--------+------------ > context_switch1_per_thread_ops | | | | | | | > (A) v6.1.30 | 1 | 314507.000 | 314507.000 | 314507.000 | 314507.000 | 0 | > (B) v6.1.30 slab_nomerge | 1 | 257403.000 | 257403.000 | 257403.000 | 257403.000 | 0 | > !! REGRESSED !! | | -18.16% | -18.16% | -18.16% | -18.16% | --- | + is good > > but I can't reproduce this on cascadelake: > > LABEL | COUNT | MIN | MAX | MEAN | MEDIAN | STDDEV | DIRECTION > ----------------------------------+-------+------------+------------+------------+------------+--------+------------ > context_switch1_per_thread_ops | | | | | | | > (A) v6.1.30 | 1 | 301128.000 | 301128.000 | 301128.000 | 301128.000 | 0 | > (B) v6.1.30 slab_nomerge | 1 | 301282.000 | 301282.000 | 301282.000 | 301282.000 | 0 | > | | +0.05% | +0.05% | +0.05% | +0.05% | --- | + is good > > So I'm a bit baffled at the moment. > > I'll try to dig deeper and see what slab caches this benchmark exercises > that apparently no other benchmarks do. (I'm really hoping that the only > way to recover this performance is by something like > kmem_cache_create(SLAB_MERGE).) Hi David, Many thanks for running all these tests. The amount of attention you've given this change is simply amazing. I wish I could have been able to assist you by doing more tests, but I've been lacking the necessary resources to do so. I'm as surprised as you are regarding the skylake regression. 20% is quite a large number, but perhaps it's less worrying than it looks given that benchmarks are usually very different from real-world workloads? As Kees Cook was suggesting in his own reply, have you given a thought about including this change in -next and see if there are regressions showing up in CI performance tests results? Regards,
On Tue, 18 Jul 2023, Julian Pidancet wrote: > Hi David, > > Many thanks for running all these tests. The amount of attention you've > given this change is simply amazing. I wish I could have been able to > assist you by doing more tests, but I've been lacking the necessary > resources to do so. > > I'm as surprised as you are regarding the skylake regression. 20% is > quite a large number, but perhaps it's less worrying than it looks given > that benchmarks are usually very different from real-world workloads? > I'm not an expert on context_switch1_per_thread_ops so I can't infere which workloads would be most affected by such a regression other than to point out that -18% is quite substantial. I'm still hoping to run some benchmarks with 64KB page sizes as Christoph suggested, I should be able to do this with arm64. It's ceratinly good news that the overall memory footprint doesn't change much with this change. > As Kees Cook was suggesting in his own reply, have you given a thought > about including this change in -next and see if there are regressions > showing up in CI performance tests results? > I assume that anything we can run with CI performance tests can also be run without merging into -next? The performance degradation is substantial for a microbenchmark, I'd like to complete the picture on other benchmarks and do a complete analysis with 64KB page sizes since I think the concern Christoph mentions could be quite real. We just don't have the data yet to make an informed assessment of it. Certainly would welcome any help that others would like to provide for running benchmarks with this change as well :P Once we have a complete picture, we might also want to discuss what we are hoping to achieve with such a change. I was very supportive of it prior to the -18% benchmark result. But if most users are simply using whatever their distro defaults to and other users may already be opting into this either by the kernel command line or .config, it's hard to determine exactly the set of users that would be affected by this change. Suddenly causing a -18% regression overnight for this would be surprising for them.
On 7/26/23 01:25, David Rientjes wrote: > On Tue, 18 Jul 2023, Julian Pidancet wrote: > >> Hi David, >> >> Many thanks for running all these tests. The amount of attention you've >> given this change is simply amazing. I wish I could have been able to >> assist you by doing more tests, but I've been lacking the necessary >> resources to do so. >> >> I'm as surprised as you are regarding the skylake regression. 20% is >> quite a large number, but perhaps it's less worrying than it looks given >> that benchmarks are usually very different from real-world workloads? >> > > I'm not an expert on context_switch1_per_thread_ops so I can't infere > which workloads would be most affected by such a regression other than to > point out that -18% is quite substantial. It might turn out that this regression is accidental in that merging happens to result in a better caching that benefits the particular skylake cache hierarchy (but not others), because the workload happens to use two different classes of objects that are compatible for merging, and uses them with identical lifetime. But that would be arguably still a corner case and not something to result in a hard go/no-go for the change, as similar corner cases would likely exist that would benefit from not merging. But it's possible the reason for the regression is something less expectable than the above hypotehsis, so indeed we should investigate first. > I'm still hoping to run some benchmarks with 64KB page sizes as Christoph > suggested, I should be able to do this with arm64. > > It's ceratinly good news that the overall memory footprint doesn't change > much with this change. > >> As Kees Cook was suggesting in his own reply, have you given a thought >> about including this change in -next and see if there are regressions >> showing up in CI performance tests results? >> > > I assume that anything we can run with CI performance tests can also be > run without merging into -next? > > The performance degradation is substantial for a microbenchmark, I'd like > to complete the picture on other benchmarks and do a complete analysis > with 64KB page sizes since I think the concern Christoph mentions could be > quite real. We just don't have the data yet to make an informed > assessment of it. Certainly would welcome any help that others would like > to provide for running benchmarks with this change as well :P > > Once we have a complete picture, we might also want to discuss what we are > hoping to achieve with such a change. I was very supportive of it prior > to the -18% benchmark result. But if most users are simply using whatever > their distro defaults to and other users may already be opting into this > either by the kernel command line or .config, it's hard to determine > exactly the set of users that would be affected by this change. Suddenly > causing a -18% regression overnight for this would be surprising for them. What I'd hope to achieve is that if we find out that the differences of merging/not-merging are negligible (modulo corner cases) for both performance and memory, we'd not only change the default, but even make merging more exceptional. It should still be done under SLUB_TINY, and maybe we can keep the slab_merge boot option, but that's it? Because in case they are comparable, not merging has indeed benefits - /proc/slabinfo accounting is not misleading, so in case a bug is reported, it's not neccessary to reboot with nomerge to get the real picture, then there are the security benefits mentioned etc.
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index c5e7bb4babf0..7e78471a96b7 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -5652,21 +5652,22 @@ slram= [HW,MTD] - slab_merge [MM] - Enable merging of slabs with similar size when the - kernel is built without CONFIG_SLAB_MERGE_DEFAULT. - slab_nomerge [MM] - Disable merging of slabs with similar size. May be - necessary if there is some reason to distinguish - allocs to different slabs, especially in hardened - environments where the risk of heap overflows and - layout control by attackers can usually be - frustrated by disabling merging. This will reduce - most of the exposure of a heap attack to a single - cache (risks via metadata attacks are mostly - unchanged). Debug options disable merging on their - own. + Disable merging of slabs with similar size when + the kernel is built with CONFIG_SLAB_MERGE_DEFAULT. + Allocations of the same size made in distinct + caches will be placed in separate slabs. In + hardened environment, the risk of heap overflows + and layout control by attackers can usually be + frustrated by disabling merging. + + slab_merge [MM] + Enable merging of slabs with similar size. May be + necessary to reduce overhead or increase cache + hotness of objects, at the cost of increased + exposure in case of a heap attack to a single + cache. (risks via metadata attacks are mostly + unchanged). For more information see Documentation/mm/slub.rst. slab_max_order= [MM, SLAB] diff --git a/Documentation/mm/slub.rst b/Documentation/mm/slub.rst index be75971532f5..0e2ce82177c0 100644 --- a/Documentation/mm/slub.rst +++ b/Documentation/mm/slub.rst @@ -122,9 +122,10 @@ used on the wrong slab. Slab merging ============ -If no debug options are specified then SLUB may merge similar slabs together -in order to reduce overhead and increase cache hotness of objects. -``slabinfo -a`` displays which slabs were merged together. +If the kernel is built with ``CONFIG_SLAB_MERGE_DEFAULT`` or if ``slab_merge`` +is specified on the kernel command line, then SLUB may merge similar slabs +together in order to reduce memory overhead and increase cache hotness of +objects. ``slabinfo -a`` displays which slabs were merged together. Slab validation =============== diff --git a/mm/Kconfig b/mm/Kconfig index 7672a22647b4..05b0304302d4 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -255,7 +255,7 @@ config SLUB_TINY config SLAB_MERGE_DEFAULT bool "Allow slab caches to be merged" - default y + default n depends on SLAB || SLUB help For reduced kernel memory fragmentation, slab caches can be @@ -264,8 +264,8 @@ config SLAB_MERGE_DEFAULT overwrite objects from merged caches (and more easily control cache layout), which makes such heap attacks easier to exploit by attackers. By keeping caches unmerged, these kinds of exploits - can usually only damage objects in the same cache. To disable - merging at runtime, "slab_nomerge" can be passed on the kernel + can usually only damage objects in the same cache. To enable + merging at runtime, "slab_merge" can be passed on the kernel command line. config SLAB_FREELIST_RANDOM