Message ID | 20210805152000.12817-1-vbabka@suse.cz (mailing list archive) |
---|---|
Headers | show |
Series | SLUB: reduce irq disabled scope and make it RT compatible | expand |
On 2021-08-05 17:19:25 [+0200], Vlastimil Babka wrote: > Hi Andrew, Hi, > I believe the series is ready for mmotm. No known bugs, Mel found no !RT perf > regressions in v3 [9], Mike also (details below). RT guys validated it on RT > config and already incorporated the series in the RT tree. Correct, incl. the percpu-partial list fix. … > RT configs showed some throughput regressions, but that's expected tradeoff for > the preemption improvements through the RT mutex. It didn't prevent the v2 to > be incorporated to the 5.13 RT tree [7], leading to testing exposure and > bugfixes. There was throughput regression in RT compared to previous releases (without this series). The regression was (based on my testing) only visible in hackbench and was addressed by adding adaptiv spinning to RT-mutex. With that we almost back to what we had before :) … > The remaining patches to upstream from the RT tree are small ones related to > KConfig. The patch that restricts PREEMPT_RT to SLUB (not SLAB or SLOB) makes > sense. The patch that disables CONFIG_SLUB_CPU_PARTIAL with PREEMPT_RT could > perhaps be re-evaluated as the series addresses some latency issues with it. With your rework CONFIG_SLUB_CPU_PARTIAL can be enabled in RT since v5.14-rc3-rt1. So it has been re-evaluated :) Regarding SLAB/SLOB: SLOB has a few design parts which are incompatible with RT (if my memory suits me so it was never attempted to get it working). SLAB was used before SLUB and required a lot of love. SLUB performed better compared to SLAB (in both throughput and latency) and after a while the SLAB patches were dropped. Sebastian
On Thu, 2021-08-05 at 18:42 +0200, Sebastian Andrzej Siewior wrote: > > There was throughput regression in RT compared to previous releases > (without this series). The regression was (based on my testing) only > visible in hackbench and was addressed by adding adaptiv spinning to > RT-mutex. With that we almost back to what we had before :) Numbers on my box say a throughput regression remains (silly fork bomb scenario.. yawn), which can be recouped by either turning on all SL[AU]B features or converting the list_lock to a raw lock. They also seem to be saying that if you turned on PREEMPT_RT because you care about RT performance first and foremost (gee), you'll do neither of those, because either will eliminate an RT performance progression. -Mike numbers... box is old i4790 desktop perf stat -r10 hackbench -s4096 -l500 full warmup, record, repeat twice for elapsed SLUB+SLUB_DEBUG only begin previously reported numbers 5.14.0.g79e92006-tip-rt (5.12-rt based as before, 5.13-rt didn't yet exist) 7,984.52 msec task-clock # 7.565 CPUs utilized ( +- 0.66% ) 353,566 context-switches # 44.281 K/sec ( +- 2.77% ) 37,685 cpu-migrations # 4.720 K/sec ( +- 6.37% ) 12,939 page-faults # 1.620 K/sec ( +- 0.67% ) 29,901,079,227 cycles # 3.745 GHz ( +- 0.71% ) 14,550,797,818 instructions # 0.49 insn per cycle ( +- 0.47% ) 3,056,685,643 branches # 382.826 M/sec ( +- 0.51% ) 9,598,083 branch-misses # 0.31% of all branches ( +- 2.11% ) 1.05542 +- 0.00409 seconds time elapsed ( +- 0.39% ) 1.05990 +- 0.00244 seconds time elapsed ( +- 0.23% ) (repeat) 1.05367 +- 0.00303 seconds time elapsed ( +- 0.29% ) (repeat) 5.14.0.g79e92006-tip-rt +slub-local-lock-v2r3 -0034-mm-slub-convert-kmem_cpu_slab-protection-to-local_lock.patch 6,899.35 msec task-clock # 5.637 CPUs utilized ( +- 0.53% ) 420,304 context-switches # 60.919 K/sec ( +- 2.83% ) 187,130 cpu-migrations # 27.123 K/sec ( +- 1.81% ) 13,206 page-faults # 1.914 K/sec ( +- 0.96% ) 25,110,362,933 cycles # 3.640 GHz ( +- 0.49% ) 15,853,643,635 instructions # 0.63 insn per cycle ( +- 0.64% ) 3,366,261,524 branches # 487.910 M/sec ( +- 0.70% ) 14,839,618 branch-misses # 0.44% of all branches ( +- 2.01% ) 1.22390 +- 0.00744 seconds time elapsed ( +- 0.61% ) 1.21813 +- 0.00907 seconds time elapsed ( +- 0.74% ) (repeat) 1.22097 +- 0.00952 seconds time elapsed ( +- 0.78% ) (repeat) repeat of above with raw list_lock 8,072.62 msec task-clock # 7.605 CPUs utilized ( +- 0.49% ) 359,514 context-switches # 44.535 K/sec ( +- 4.95% ) 35,285 cpu-migrations # 4.371 K/sec ( +- 5.82% ) 13,503 page-faults # 1.673 K/sec ( +- 0.96% ) 30,247,989,681 cycles # 3.747 GHz ( +- 0.52% ) 14,580,011,391 instructions # 0.48 insn per cycle ( +- 0.81% ) 3,063,743,405 branches # 379.523 M/sec ( +- 0.85% ) 8,907,160 branch-misses # 0.29% of all branches ( +- 3.99% ) 1.06150 +- 0.00427 seconds time elapsed ( +- 0.40% ) 1.05041 +- 0.00176 seconds time elapsed ( +- 0.17% ) (repeat) 1.06086 +- 0.00237 seconds time elapsed ( +- 0.22% ) (repeat) 5.14.0.g79e92006-rt3-tip-rt +slub-local-lock-v2r3 full set 7,598.44 msec task-clock # 5.813 CPUs utilized ( +- 0.85% ) 488,161 context-switches # 64.245 K/sec ( +- 4.29% ) 196,866 cpu-migrations # 25.909 K/sec ( +- 1.49% ) 13,042 page-faults # 1.716 K/sec ( +- 0.73% ) 27,695,116,746 cycles # 3.645 GHz ( +- 0.79% ) 18,423,934,168 instructions # 0.67 insn per cycle ( +- 0.88% ) 3,969,540,695 branches # 522.415 M/sec ( +- 0.92% ) 15,493,482 branch-misses # 0.39% of all branches ( +- 2.15% ) 1.30709 +- 0.00890 seconds time elapsed ( +- 0.68% ) 1.3205 +- 0.0134 seconds time elapsed ( +- 1.02% ) (repeat) 1.3083 +- 0.0132 seconds time elapsed ( +- 1.01% ) (repeat) end previously reported numbers 5.14.0.gf6a71a5-rt6-tip-rt (same config, full slub set.. obviously) 7,707.63 msec task-clock # 5.880 CPUs utilized ( +- 1.46% ) 562,533 context-switches # 72.984 K/sec ( +- 7.46% ) 208,475 cpu-migrations # 27.048 K/sec ( +- 2.26% ) 13,022 page-faults # 1.689 K/sec ( +- 0.80% ) 28,025,004,779 cycles # 3.636 GHz ( +- 1.34% ) 18,487,135,489 instructions # 0.66 insn per cycle ( +- 1.58% ) 3,997,110,493 branches # 518.591 M/sec ( +- 1.65% ) 16,078,322 branch-misses # 0.40% of all branches ( +- 4.23% ) 1.3108 +- 0.0135 seconds time elapsed ( +- 1.03% ) 1.2997 +- 0.0138 seconds time elapsed ( +- 1.06% ) (repeat) 1.3009 +- 0.0166 seconds time elapsed ( +- 1.28% ) (repeat) 5.14.0.gf6a71a5-rt6-tip-rt +list_lock=raw_spinlock_t 8,252.59 msec task-clock # 7.584 CPUs utilized ( +- 0.27% ) 400,991 context-switches # 48.590 K/sec ( +- 6.15% ) 35,979 cpu-migrations # 4.360 K/sec ( +- 5.63% ) 13,261 page-faults # 1.607 K/sec ( +- 0.73% ) 30,910,310,737 cycles # 3.746 GHz ( +- 0.31% ) 16,522,383,240 instructions # 0.53 insn per cycle ( +- 0.92% ) 3,535,219,839 branches # 428.377 M/sec ( +- 0.96% ) 10,115,967 branch-misses # 0.29% of all branches ( +- 4.32% ) 1.08817 +- 0.00238 seconds time elapsed ( +- 0.22% ) 1.08583 +- 0.00243 seconds time elapsed ( +- 0.22% ) (repeat) 1.09003 +- 0.00164 seconds time elapsed ( +- 0.15% ) (repeat) 5.14.0.g251a152-rt6-master-rt (+SLAB_MERGE_DEFAULT,SLUB_CPU_PARTIAL,SLAB_FREELIST_RANDOM/HARDENED) 8,170.48 msec task-clock # 7.390 CPUs utilized ( +- 0.43% ) 449,994 context-switches # 55.076 K/sec ( +- 4.20% ) 55,912 cpu-migrations # 6.843 K/sec ( +- 4.28% ) 13,144 page-faults # 1.609 K/sec ( +- 0.53% ) 30,484,114,812 cycles # 3.731 GHz ( +- 0.44% ) 17,554,521,787 instructions # 0.58 insn per cycle ( +- 0.76% ) 3,751,725,852 branches # 459.181 M/sec ( +- 0.81% ) 13,421,985 branch-misses # 0.36% of all branches ( +- 2.40% ) 1.10563 +- 0.00382 seconds time elapsed ( +- 0.35% ) 1.1098 +- 0.0147 seconds time elapsed ( +- 1.32% ) (repeat) 1.11308 +- 0.00883 seconds time elapsed ( +- 0.79% ) (repeat) 5.14.0.gf6a71a5-rt6-tip-rt +SLAB_MERGE_DEFAULT,SLUB_CPU_PARTIAL,SLAB_FREELIST_RANDOM/HARDENED 8,026.39 msec task-clock # 7.320 CPUs utilized ( +- 0.70% ) 496,579 context-switches # 61.868 K/sec ( +- 6.78% ) 65,022 cpu-migrations # 8.101 K/sec ( +- 8.29% ) 13,161 page-faults # 1.640 K/sec ( +- 0.51% ) 29,870,954,733 cycles # 3.722 GHz ( +- 0.67% ) 17,617,522,235 instructions # 0.59 insn per cycle ( +- 1.36% ) 3,760,346,459 branches # 468.498 M/sec ( +- 1.45% ) 12,863,520 branch-misses # 0.34% of all branches ( +- 4.45% ) 1.0965 +- 0.0103 seconds time elapsed ( +- 0.94% ) 1.08149 +- 0.00362 seconds time elapsed ( +- 0.33% ) (repeat) 1.10027 +- 0.00916 seconds time elapsed ( +- 0.83% ) yup, perf delta == config delta, lets have a peek at jitter cyclictest -Smqp99& perf stat -r100 hackbench -s4096 -l500 && killall cyclictest 5.14.0.gf6a71a5-rt6-tip-rt SLUB+SLUB_DEBUG T: 1 ( 5903) P:99 I:1500 C: 92330 Min: 1 Act: 2 Avg: 6 Max: 19 T: 2 ( 5904) P:99 I:2000 C: 69247 Min: 1 Act: 2 Avg: 6 Max: 21 T: 3 ( 5905) P:99 I:2500 C: 55395 Min: 1 Act: 3 Avg: 6 Max: 22 T: 4 ( 5906) P:99 I:3000 C: 46163 Min: 1 Act: 4 Avg: 7 Max: 22 T: 5 ( 5907) P:99 I:3500 C: 39568 Min: 1 Act: 3 Avg: 6 Max: 23 T: 6 ( 5909) P:99 I:4000 C: 34621 Min: 1 Act: 2 Avg: 7 Max: 22 T: 7 ( 5910) P:99 I:4500 C: 30774 Min: 1 Act: 3 Avg: 7 Max: 18 SLUB+SLUB_DEBUG+list_lock=raw_spinlock_t T: 1 ( 4044) P:99 I:1500 C: 73340 Min: 1 Act: 3 Avg: 10 Max: 28 T: 2 ( 4045) P:99 I:2000 C: 55004 Min: 1 Act: 4 Avg: 10 Max: 33 T: 3 ( 4046) P:99 I:2500 C: 44002 Min: 1 Act: 2 Avg: 10 Max: 26 T: 4 ( 4047) P:99 I:3000 C: 36668 Min: 1 Act: 3 Avg: 10 Max: 24 T: 5 ( 4048) P:99 I:3500 C: 31429 Min: 1 Act: 3 Avg: 10 Max: 27 T: 6 ( 4049) P:99 I:4000 C: 27500 Min: 1 Act: 3 Avg: 11 Max: 30 T: 7 ( 4050) P:99 I:4500 C: 24444 Min: 1 Act: 4 Avg: 11 Max: 25 SLUB+SLUB_DEBUG+SLAB_MERGE_DEFAULT,SLUB_CPU_PARTIAL,SLAB_FREELIST_RANDOM/HARDENED T: 1 ( 4036) P:99 I:1500 C: 74039 Min: 1 Act: 3 Avg: 9 Max: 31 T: 2 ( 4037) P:99 I:2000 C: 55528 Min: 1 Act: 3 Avg: 10 Max: 29 T: 3 ( 4038) P:99 I:2500 C: 44422 Min: 1 Act: 2 Avg: 10 Max: 31 T: 4 ( 4039) P:99 I:3000 C: 37017 Min: 1 Act: 2 Avg: 9 Max: 23 T: 5 ( 4040) P:99 I:3500 C: 31729 Min: 1 Act: 3 Avg: 10 Max: 29 T: 6 ( 4041) P:99 I:4000 C: 27762 Min: 1 Act: 2 Avg: 8 Max: 26 T: 7 ( 4042) P:99 I:4500 C: 24677 Min: 1 Act: 3 Avg: 9 Max: 27 conclusion: gee, pi both works and ain't free - ditto add more stuff=cycles :)
On 8/6/21 7:14 AM, Mike Galbraith wrote: > On Thu, 2021-08-05 at 18:42 +0200, Sebastian Andrzej Siewior wrote: >> >> There was throughput regression in RT compared to previous releases >> (without this series). The regression was (based on my testing) only >> visible in hackbench and was addressed by adding adaptiv spinning to >> RT-mutex. With that we almost back to what we had before :) > > Numbers on my box say a throughput regression remains (silly fork bomb > scenario.. yawn), which can be recouped by either turning on all > SL[AU]B features or converting the list_lock to a raw lock. I'm surprised you can still do that raw lock in v3/v4 because there's now a path where get_partial_node() takes the list_lock and can call put_cpu_partial() which takes the local_lock. But seems your results below indicate that this was without CONFIG_SLUB_CPU_PARTIAL so that would still work. > They also > seem to be saying that if you turned on PREEMPT_RT because you care > about RT performance first and foremost (gee), you'll do neither of > those, because either will eliminate an RT performance progression. That was my assumption, that there would be some tradeoff and RT is willing to sacrifice some throughput here... which should be only visible if your benchmark is close to slab microbenchmark, as hackbench is. Thanks again!
On 8/5/21 5:19 PM, Vlastimil Babka wrote: > Series is based on 5.14-rc4 and also available as a git branch: > https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-local-lock-v4r0 New branch with fixed up locking orders in patch 29/35: https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-local-lock-v4r1
On 8/10/21 4:36 PM, Vlastimil Babka wrote: > On 8/5/21 5:19 PM, Vlastimil Babka wrote: >> Series is based on 5.14-rc4 and also available as a git branch: >> https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-local-lock-v4r0 > > New branch with fixed up locking orders in patch 29/35: > > https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-local-lock-v4r1 New branch with fixed up VM_BUG_ON in patch 13/35: https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-local-lock-v4r2
On 8/15/21 12:18 PM, Vlastimil Babka wrote: > On 8/10/21 4:36 PM, Vlastimil Babka wrote: >> On 8/5/21 5:19 PM, Vlastimil Babka wrote: >>> Series is based on 5.14-rc4 and also available as a git branch: >>> https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-local-lock-v4r0 >> >> New branch with fixed up locking orders in patch 29/35: >> >> https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-local-lock-v4r1 > > New branch with fixed up VM_BUG_ON in patch 13/35: > > https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-local-lock-v4r2 New branch with fixed struct kmem_cache_cpu layout in patch 35/35 (and a rebase to 5.14-rc6) https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-local-lock-v4r3
On 8/17/21 12:23 PM, Vlastimil Babka wrote: > On 8/15/21 12:18 PM, Vlastimil Babka wrote: >> On 8/10/21 4:36 PM, Vlastimil Babka wrote: >>> On 8/5/21 5:19 PM, Vlastimil Babka wrote: >>>> Series is based on 5.14-rc4 and also available as a git branch: >>>> https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-local-lock-v4r0 >>> >>> New branch with fixed up locking orders in patch 29/35: >>> >>> https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-local-lock-v4r1 >> >> New branch with fixed up VM_BUG_ON in patch 13/35: >> >> https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-local-lock-v4r2 > > New branch with fixed struct kmem_cache_cpu layout in patch 35/35 > (and a rebase to 5.14-rc6) > > https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-local-lock-v4r3 Another update to patch 35/35, simplifying lockdep_assert_held() as requested by RT: https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-local-lock-v4r4