[v4,00/35] SLUB: reduce irq disabled scope and make it RT compatible

Message ID	20210805152000.12817-1-vbabka@suse.cz (mailing list archive)
Headers	show Return-Path: <SRS0=eQu7=M4=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org E924A61167 From: Vlastimil Babka <vbabka@suse.cz> To: Andrew Morton <akpm@linux-foundation.org>, Christoph Lameter <cl@linux.com>, David Rientjes <rientjes@google.com>, Pekka Enberg <penberg@kernel.org>, Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Mike Galbraith <efault@gmx.de>, Sebastian Andrzej Siewior <bigeasy@linutronix.de>, Thomas Gleixner <tglx@linutronix.de>, Mel Gorman <mgorman@techsingularity.net>, Jesper Dangaard Brouer <brouer@redhat.com>, Jann Horn <jannh@google.com>, Vlastimil Babka <vbabka@suse.cz> Subject: [PATCH v4 00/35] SLUB: reduce irq disabled scope and make it RT compatible Date: Thu, 5 Aug 2021 17:19:25 +0200 Message-Id: <20210805152000.12817-1-vbabka@suse.cz> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	SLUB: reduce irq disabled scope and make it RT compatible \| expand [v4,00/35] SLUB: reduce irq disabled scope and make it RT compatible [v4,01/35] mm, slub: don't call flush_all() from slab_debug_trace_open() [v4,02/35] mm, slub: allocate private object map for debugfs listings [v4,03/35] mm, slub: allocate private object map for validate_slab_cache() [v4,04/35] mm, slub: don't disable irq for debug_check_no_locks_freed() [v4,05/35] mm, slub: remove redundant unfreeze_partials() from put_cpu_partial() [v4,06/35] mm, slub: unify cmpxchg_double_slab() and __cmpxchg_double_slab() [v4,07/35] mm, slub: extract get_partial() from new_slab_objects() [v4,08/35] mm, slub: dissolve new_slab_objects() into ___slab_alloc() [v4,09/35] mm, slub: return slab page from get_partial() and set c->page afterwards [v4,10/35] mm, slub: restructure new page checks in ___slab_alloc() [v4,11/35] mm, slub: simplify kmem_cache_cpu and tid setup [v4,12/35] mm, slub: move disabling/enabling irqs to ___slab_alloc() [v4,13/35] mm, slub: do initial checks in ___slab_alloc() with irqs enabled [v4,14/35] mm, slub: move disabling irqs closer to get_partial() in ___slab_alloc() [v4,15/35] mm, slub: restore irqs around calling new_slab() [v4,16/35] mm, slub: validate slab from partial list or page allocator before making it cpu slab [v4,17/35] mm, slub: check new pages with restored irqs [v4,18/35] mm, slub: stop disabling irqs around get_partial() [v4,19/35] mm, slub: move reset of c->page and freelist out of deactivate_slab() [v4,20/35] mm, slub: make locking in deactivate_slab() irq-safe [v4,21/35] mm, slub: call deactivate_slab() without disabling irqs [v4,22/35] mm, slub: move irq control into unfreeze_partials() [v4,23/35] mm, slub: discard slabs in unfreeze_partials() without irqs disabled [v4,24/35] mm, slub: detach whole partial list at once in unfreeze_partials() [v4,25/35] mm, slub: separate detaching of partial list in unfreeze_partials() from unfreezing [v4,26/35] mm, slub: only disable irq with spin_lock in __unfreeze_partials() [v4,27/35] mm, slub: don't disable irqs in slub_cpu_dead() [v4,28/35] mm, slab: make flush_slab() possible to call with irqs enabled [v4,29/35] mm: slub: Move flush_cpu_slab() invocations __free_slab() invocations out of IRQ context [v4,30/35] mm: slub: Make object_map_lock a raw_spinlock_t [v4,31/35] mm, slub: optionally save/restore irqs in slab_[un]lock()/ [v4,32/35] mm, slub: make slab_lock() disable irqs with PREEMPT_RT [v4,33/35] mm, slub: protect put_cpu_partial() with disabled irqs instead of cmpxchg [v4,34/35] mm, slub: use migrate_disable() on PREEMPT_RT [v4,35/35] mm, slub: convert kmem_cpu_slab protection to local_lock

Vlastimil Babka Aug. 5, 2021, 3:19 p.m. UTC

Hi Andrew,

I believe the series is ready for mmotm. No known bugs, Mel found no !RT perf
regressions in v3 [9], Mike also (details below). RT guys validated it on RT
config and already incorporated the series in the RT tree.

Thanks, Vlastimil.

Changes since v3 [8]:
* Rebase to 5.14-rc4
* Fix unbounded percpu partial list growth reported by Sebastian Andrzej Siewior
* Prevent spurious uninitialized local variable warning reported by Mel Gorman

Changes since v2 [5]:
* Rebase to 5.14-rc3
* A number of fixes to the RT parts, big thanks to Mike Galbraith for testing
  and debugging!
  * The largest fix is to protect kmem_cache_cpu->partial by local_lock instead
    of cmpxchg tricks, which are insufficient on RT. To avoid divergence
    between RT and !RT, just do it everywhere. Affected mainly patch 25 and a
    new patch 33. This also addresses a theoretical race raised earlier by Jann
    Horn.
* Smaller fixes reported by Sebastian Andrzej Siewior and Cyrill Gorcunov

Changes since RFC v1 [1]:
* Addressed feedback from Christoph and Mel, added their acks.
* Finished RT conversion, adopting 2 patches from the RT tree.
* The local_lock conversion has to sacrifice lockless fathpaths on PREEMPT_RT
* Added some more cleanup patches to the front.

This series was initially inspired by Mel's pcplist local_lock rewrite, and
also interest to better understand SLUB's locking and the new primitives and RT
variants and implications. It should make SLUB more preemption-friendly,
especially for RT, hopefully without noticeable regressions, as the fast paths
are not affected.

Series is based on 5.14-rc4 and also available as a git branch:
https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-local-lock-v4r0

The series should now be sufficiently tested in both RT and !RT configs, mainly
thanks to Mike. The RFC/v1 version also got basic performance screening by
Mel that didn't show major regressions. Mike's testing with hackbench of v2 on
!RT reported negligible differences [6]:

virgin(ish) tip
5.13.0.g60ab3ed-tip
          7,320.67 msec task-clock                #    7.792 CPUs utilized            ( +-  0.31% )
           221,215      context-switches          #    0.030 M/sec                    ( +-  3.97% )
            16,234      cpu-migrations            #    0.002 M/sec                    ( +-  4.07% )
            13,233      page-faults               #    0.002 M/sec                    ( +-  0.91% )
    27,592,205,252      cycles                    #    3.769 GHz                      ( +-  0.32% )
     8,309,495,040      instructions              #    0.30  insn per cycle           ( +-  0.37% )
     1,555,210,607      branches                  #  212.441 M/sec                    ( +-  0.42% )
         5,484,209      branch-misses             #    0.35% of all branches          ( +-  2.13% )

           0.93949 +- 0.00423 seconds time elapsed  ( +-  0.45% )
           0.94608 +- 0.00384 seconds time elapsed  ( +-  0.41% ) (repeat)
           0.94422 +- 0.00410 seconds time elapsed  ( +-  0.43% )

5.13.0.g60ab3ed-tip +slub-local-lock-v2r3
          7,343.57 msec task-clock                #    7.776 CPUs utilized            ( +-  0.44% )
           223,044      context-switches          #    0.030 M/sec                    ( +-  3.02% )
            16,057      cpu-migrations            #    0.002 M/sec                    ( +-  4.03% )
            13,164      page-faults               #    0.002 M/sec                    ( +-  0.97% )
    27,684,906,017      cycles                    #    3.770 GHz                      ( +-  0.45% )
     8,323,273,871      instructions              #    0.30  insn per cycle           ( +-  0.28% )
     1,556,106,680      branches                  #  211.901 M/sec                    ( +-  0.31% )
         5,463,468      branch-misses             #    0.35% of all branches          ( +-  1.33% )

           0.94440 +- 0.00352 seconds time elapsed  ( +-  0.37% )
           0.94830 +- 0.00228 seconds time elapsed  ( +-  0.24% ) (repeat)
           0.93813 +- 0.00440 seconds time elapsed  ( +-  0.47% ) (repeat)

RT configs showed some throughput regressions, but that's expected tradeoff for
the preemption improvements through the RT mutex. It didn't prevent the v2 to
be incorporated to the 5.13 RT tree [7], leading to testing exposure and
bugfixes.

Before the series, SLUB is lockless in both allocation and free fast paths, but
elsewhere, it's disabling irqs for considerable periods of time - especially in
allocation slowpath and the bulk allocation, where IRQs are re-enabled only
when a new page from the page allocator is needed, and the context allows
blocking. The irq disabled sections can then include deactivate_slab() which
walks a full freelist and frees the slab back to page allocator or
unfreeze_partials() going through a list of percpu partial slabs. The RT tree
currently has some patches mitigating these, but we can do much better in
mainline too.

Patches 1-6 are straightforward improvements or cleanups that could exist
outside of this series too, but are prerequsities.

Patches 7-10 are also preparatory code changes without functional changes, but
not so useful without the rest of the series.

Patch 11 simplifies the fast paths on systems with preemption, based on
(hopefully correct) observation that the current loops to verify tid are
unnecessary.

Patches 12-21 focus on reducing irq disabled scope in the allocation slowpath.

Patch 12 moves disabling of irqs into ___slab_alloc() from its callers, which
are the allocation slowpath, and bulk allocation. Instead these callers only
disable preemption to stabilize the cpu. The following patches then gradually
reduce the scope of disabled irqs in ___slab_alloc() and the functions called
from there. As of patch 15, the re-enabling of irqs based on gfp flags before
calling the page allocator is removed from allocate_slab(). As of patch 18,
it's possible to reach the page allocator (in case of existing slabs depleted)
without disabling and re-enabling irqs a single time.

Pathces 22-27 reduce the scope of disabled irqs in functions related to
unfreezing percpu partial slab.

Patch 28 is preparatory. Patch 29 is adopted from the RT tree and converts the
flushing of percpu slabs on all cpus from using IPI to workqueue, so that the
processing isn't happening with irqs disabled in the IPI handler. The flushing
is not performance critical so it should be acceptable.

Patch 30 also comes from RT tree and makes object_map_lock RT compatible.

Patches 31-32 make slab_lock irq-safe on RT where we cannot rely on having
irq disabled from the list_lock spin lock usage.

Patch 33 changes kmem_cache_cpu->partial handling in put_cpu_partial() from
cmpxchg loop to a short irq disabled section, which is used by all other code
modifying the field. This addresses a theoretical race scenario pointed out by
Jann, and makes the critical section safe wrt with RT local_lock semantics
after the conversion in patch 35.

Patch 34 changes preempt disable to migrate disable, so that the nested
list_lock spinlock is safe to take on RT. Because migrate_disable() is a
function call even on !RT, a small set of private wrappers is introduced
to keep using the cheaper preempt_disable() on !PREEMPT_RT configurations.

As of this patch, SLUB should be compatible with RT's lock semantics, to the
best of my knowledge.

Finally, patch 35 changes irq disabled sections that protect kmem_cache_cpu
fields in the slow paths, with a local lock. However on PREEMPT_RT it means the
lockless fast paths can now preempt slow paths which don't expect that, so the
local lock has to be taken also in the fast paths and they are no longer
lockless. It's up to RT folks to decide if this is a good tradeoff.
The patch also updates the locking documentation in the file's comment.

The main results of this series:

* irq disabling is only done for minimum amount of time needed to protect the
  kmem_cache_cpu data and as part of spin lock, local lock and bit lock
  operations to make them irq-safe

* SLUB should be fully PREEMPT_RT compatible

This should have obvious implications for better preemptibility, especially on RT.

Some details are different than how the previous SLUB RT tree patches were
implemented:

  mm: sl[au]b: Change list_lock to raw_spinlock_t [2] - the SLAB part can be
  dropped as a different patch restricts RT to SLUB anyway. And after this series
  the list_lock in SLUB is never used with irqs disabled before taking the lock
  so it doesn't have to be converted to raw_spinlock_t.

  mm: slub: Move discard_slab() invocations out of IRQ-off sections [3] should be
  unnecessary as this series does move these invocations outside irq disabled
  sections in a different way.

The remaining patches to upstream from the RT tree are small ones related to
KConfig. The patch that restricts PREEMPT_RT to SLUB (not SLAB or SLOB) makes
sense. The patch that disables CONFIG_SLUB_CPU_PARTIAL with PREEMPT_RT could
perhaps be re-evaluated as the series addresses some latency issues with it.

[1] https://lore.kernel.org/lkml/20210524233946.20352-1-vbabka@suse.cz/
[2] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/0001-mm-sl-au-b-Change-list_lock-to-raw_spinlock_t.patch?h=linux-5.12.y-rt-patches
[3] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/0004-mm-slub-Move-discard_slab-invocations-out-of-IRQ-off.patch?h=linux-5.12.y-rt-patches
[4] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/0005-mm-slub-Move-flush_cpu_slab-invocations-__free_slab-.patch?h=linux-5.12.y-rt-patches
[5] https://lore.kernel.org/lkml/20210609113903.1421-1-vbabka@suse.cz/
[6] https://lore.kernel.org/lkml/891dc24e38106f8542f4c72831d52dc1a1863ae8.camel@gmx.de
[7] https://lore.kernel.org/linux-rt-users/87tul5p2fa.ffs@nanos.tec.linutronix.de/
[8] https://lore.kernel.org/lkml/20210729132132.19691-1-vbabka@suse.cz/
[9] https://lore.kernel.org/lkml/20210804120522.GD6464@techsingularity.net/

Sebastian Andrzej Siewior (2):
  mm: slub: Move flush_cpu_slab() invocations __free_slab() invocations
    out of IRQ context
  mm: slub: Make object_map_lock a raw_spinlock_t

Vlastimil Babka (33):
  mm, slub: don't call flush_all() from slab_debug_trace_open()
  mm, slub: allocate private object map for debugfs listings
  mm, slub: allocate private object map for validate_slab_cache()
  mm, slub: don't disable irq for debug_check_no_locks_freed()
  mm, slub: remove redundant unfreeze_partials() from put_cpu_partial()
  mm, slub: unify cmpxchg_double_slab() and __cmpxchg_double_slab()
  mm, slub: extract get_partial() from new_slab_objects()
  mm, slub: dissolve new_slab_objects() into ___slab_alloc()
  mm, slub: return slab page from get_partial() and set c->page
    afterwards
  mm, slub: restructure new page checks in ___slab_alloc()
  mm, slub: simplify kmem_cache_cpu and tid setup
  mm, slub: move disabling/enabling irqs to ___slab_alloc()
  mm, slub: do initial checks in ___slab_alloc() with irqs enabled
  mm, slub: move disabling irqs closer to get_partial() in
    ___slab_alloc()
  mm, slub: restore irqs around calling new_slab()
  mm, slub: validate slab from partial list or page allocator before
    making it cpu slab
  mm, slub: check new pages with restored irqs
  mm, slub: stop disabling irqs around get_partial()
  mm, slub: move reset of c->page and freelist out of deactivate_slab()
  mm, slub: make locking in deactivate_slab() irq-safe
  mm, slub: call deactivate_slab() without disabling irqs
  mm, slub: move irq control into unfreeze_partials()
  mm, slub: discard slabs in unfreeze_partials() without irqs disabled
  mm, slub: detach whole partial list at once in unfreeze_partials()
  mm, slub: separate detaching of partial list in unfreeze_partials()
    from unfreezing
  mm, slub: only disable irq with spin_lock in __unfreeze_partials()
  mm, slub: don't disable irqs in slub_cpu_dead()
  mm, slab: make flush_slab() possible to call with irqs enabled
  mm, slub: optionally save/restore irqs in slab_[un]lock()/
  mm, slub: make slab_lock() disable irqs with PREEMPT_RT
  mm, slub: protect put_cpu_partial() with disabled irqs instead of
    cmpxchg
  mm, slub: use migrate_disable() on PREEMPT_RT
  mm, slub: convert kmem_cpu_slab protection to local_lock

 include/linux/slub_def.h |   2 +
 mm/slub.c                | 809 +++++++++++++++++++++++++--------------
 2 files changed, 531 insertions(+), 280 deletions(-)

Sebastian Andrzej Siewior Aug. 5, 2021, 4:42 p.m. UTC | #1

On 2021-08-05 17:19:25 [+0200], Vlastimil Babka wrote:
> Hi Andrew,
Hi,

> I believe the series is ready for mmotm. No known bugs, Mel found no !RT perf
> regressions in v3 [9], Mike also (details below). RT guys validated it on RT
> config and already incorporated the series in the RT tree.

Correct, incl. the percpu-partial list fix.

…
> RT configs showed some throughput regressions, but that's expected tradeoff for
> the preemption improvements through the RT mutex. It didn't prevent the v2 to
> be incorporated to the 5.13 RT tree [7], leading to testing exposure and
> bugfixes.

There was throughput regression in RT compared to previous releases
(without this series). The regression was (based on my testing) only
visible in hackbench and was addressed by adding adaptiv spinning to
RT-mutex. With that we almost back to what we had before :)

…
> The remaining patches to upstream from the RT tree are small ones related to
> KConfig. The patch that restricts PREEMPT_RT to SLUB (not SLAB or SLOB) makes
> sense. The patch that disables CONFIG_SLUB_CPU_PARTIAL with PREEMPT_RT could
> perhaps be re-evaluated as the series addresses some latency issues with it.

With your rework CONFIG_SLUB_CPU_PARTIAL can be enabled in RT since
v5.14-rc3-rt1. So it has been re-evaluated :)

Regarding SLAB/SLOB: SLOB has a few design parts which are incompatible
with RT (if my memory suits me so it was never attempted to get it
working). SLAB was used before SLUB and required a lot of love. SLUB
performed better compared to SLAB (in both throughput and latency) and
after a while the SLAB patches were dropped.

Sebastian

Mike Galbraith Aug. 6, 2021, 5:14 a.m. UTC | #2

On Thu, 2021-08-05 at 18:42 +0200, Sebastian Andrzej Siewior wrote:
>
> There was throughput regression in RT compared to previous releases
> (without this series). The regression was (based on my testing) only
> visible in hackbench and was addressed by adding adaptiv spinning to
> RT-mutex. With that we almost back to what we had before :)

Numbers on my box say a throughput regression remains (silly fork bomb
scenario.. yawn), which can be recouped by either turning on all
SL[AU]B features or converting the list_lock to a raw lock.  They also
seem to be saying that if you turned on PREEMPT_RT because you care
about RT performance first and foremost (gee), you'll do neither of
those, because either will eliminate an RT performance progression.

	-Mike

numbers...

box is old i4790 desktop
perf stat -r10 hackbench -s4096 -l500
full warmup, record, repeat twice for elapsed

SLUB+SLUB_DEBUG only

begin previously reported numbers
5.14.0.g79e92006-tip-rt (5.12-rt based as before, 5.13-rt didn't yet exist)
          7,984.52 msec task-clock                #    7.565 CPUs utilized            ( +-  0.66% )
           353,566      context-switches          #   44.281 K/sec                    ( +-  2.77% )
            37,685      cpu-migrations            #    4.720 K/sec                    ( +-  6.37% )
            12,939      page-faults               #    1.620 K/sec                    ( +-  0.67% )
    29,901,079,227      cycles                    #    3.745 GHz                      ( +-  0.71% )
    14,550,797,818      instructions              #    0.49  insn per cycle           ( +-  0.47% )
     3,056,685,643      branches                  #  382.826 M/sec                    ( +-  0.51% )
         9,598,083      branch-misses             #    0.31% of all branches          ( +-  2.11% )

           1.05542 +- 0.00409 seconds time elapsed  ( +-  0.39% )
           1.05990 +- 0.00244 seconds time elapsed  ( +-  0.23% ) (repeat)
           1.05367 +- 0.00303 seconds time elapsed  ( +-  0.29% ) (repeat)

5.14.0.g79e92006-tip-rt +slub-local-lock-v2r3 -0034-mm-slub-convert-kmem_cpu_slab-protection-to-local_lock.patch
          6,899.35 msec task-clock                #    5.637 CPUs utilized            ( +-  0.53% )
           420,304      context-switches          #   60.919 K/sec                    ( +-  2.83% )
           187,130      cpu-migrations            #   27.123 K/sec                    ( +-  1.81% )
            13,206      page-faults               #    1.914 K/sec                    ( +-  0.96% )
    25,110,362,933      cycles                    #    3.640 GHz                      ( +-  0.49% )
    15,853,643,635      instructions              #    0.63  insn per cycle           ( +-  0.64% )
     3,366,261,524      branches                  #  487.910 M/sec                    ( +-  0.70% )
        14,839,618      branch-misses             #    0.44% of all branches          ( +-  2.01% )

           1.22390 +- 0.00744 seconds time elapsed  ( +-  0.61% )
           1.21813 +- 0.00907 seconds time elapsed  ( +-  0.74% ) (repeat)
           1.22097 +- 0.00952 seconds time elapsed  ( +-  0.78% ) (repeat)

repeat of above with raw list_lock
          8,072.62 msec task-clock                #    7.605 CPUs utilized            ( +-  0.49% )
           359,514      context-switches          #   44.535 K/sec                    ( +-  4.95% )
            35,285      cpu-migrations            #    4.371 K/sec                    ( +-  5.82% )
            13,503      page-faults               #    1.673 K/sec                    ( +-  0.96% )
    30,247,989,681      cycles                    #    3.747 GHz                      ( +-  0.52% )
    14,580,011,391      instructions              #    0.48  insn per cycle           ( +-  0.81% )
     3,063,743,405      branches                  #  379.523 M/sec                    ( +-  0.85% )
         8,907,160      branch-misses             #    0.29% of all branches          ( +-  3.99% )

           1.06150 +- 0.00427 seconds time elapsed  ( +-  0.40% )
           1.05041 +- 0.00176 seconds time elapsed  ( +-  0.17% ) (repeat)
           1.06086 +- 0.00237 seconds time elapsed  ( +-  0.22% ) (repeat)

5.14.0.g79e92006-rt3-tip-rt +slub-local-lock-v2r3 full set
          7,598.44 msec task-clock                #    5.813 CPUs utilized            ( +-  0.85% )
           488,161      context-switches          #   64.245 K/sec                    ( +-  4.29% )
           196,866      cpu-migrations            #   25.909 K/sec                    ( +-  1.49% )
            13,042      page-faults               #    1.716 K/sec                    ( +-  0.73% )
    27,695,116,746      cycles                    #    3.645 GHz                      ( +-  0.79% )
    18,423,934,168      instructions              #    0.67  insn per cycle           ( +-  0.88% )
     3,969,540,695      branches                  #  522.415 M/sec                    ( +-  0.92% )
        15,493,482      branch-misses             #    0.39% of all branches          ( +-  2.15% )

           1.30709 +- 0.00890 seconds time elapsed  ( +-  0.68% )
           1.3205 +- 0.0134 seconds time elapsed  ( +-  1.02% ) (repeat)
           1.3083 +- 0.0132 seconds time elapsed  ( +-  1.01% ) (repeat)
end previously reported numbers

5.14.0.gf6a71a5-rt6-tip-rt (same config, full slub set.. obviously)
          7,707.63 msec task-clock                #    5.880 CPUs utilized            ( +-  1.46% )
           562,533      context-switches          #   72.984 K/sec                    ( +-  7.46% )
           208,475      cpu-migrations            #   27.048 K/sec                    ( +-  2.26% )
            13,022      page-faults               #    1.689 K/sec                    ( +-  0.80% )
    28,025,004,779      cycles                    #    3.636 GHz                      ( +-  1.34% )
    18,487,135,489      instructions              #    0.66  insn per cycle           ( +-  1.58% )
     3,997,110,493      branches                  #  518.591 M/sec                    ( +-  1.65% )
        16,078,322      branch-misses             #    0.40% of all branches          ( +-  4.23% )

            1.3108 +- 0.0135 seconds time elapsed  ( +-  1.03% )
            1.2997 +- 0.0138 seconds time elapsed  ( +-  1.06% ) (repeat)
            1.3009 +- 0.0166 seconds time elapsed  ( +-  1.28% ) (repeat)

5.14.0.gf6a71a5-rt6-tip-rt +list_lock=raw_spinlock_t
          8,252.59 msec task-clock                #    7.584 CPUs utilized            ( +-  0.27% )
           400,991      context-switches          #   48.590 K/sec                    ( +-  6.15% )
            35,979      cpu-migrations            #    4.360 K/sec                    ( +-  5.63% )
            13,261      page-faults               #    1.607 K/sec                    ( +-  0.73% )
    30,910,310,737      cycles                    #    3.746 GHz                      ( +-  0.31% )
    16,522,383,240      instructions              #    0.53  insn per cycle           ( +-  0.92% )
     3,535,219,839      branches                  #  428.377 M/sec                    ( +-  0.96% )
        10,115,967      branch-misses             #    0.29% of all branches          ( +-  4.32% )

           1.08817 +- 0.00238 seconds time elapsed  ( +-  0.22% )
           1.08583 +- 0.00243 seconds time elapsed  ( +-  0.22% ) (repeat)
           1.09003 +- 0.00164 seconds time elapsed  ( +-  0.15% ) (repeat)

5.14.0.g251a152-rt6-master-rt (+SLAB_MERGE_DEFAULT,SLUB_CPU_PARTIAL,SLAB_FREELIST_RANDOM/HARDENED)
          8,170.48 msec task-clock                #    7.390 CPUs utilized            ( +-  0.43% )
           449,994      context-switches          #   55.076 K/sec                    ( +-  4.20% )
            55,912      cpu-migrations            #    6.843 K/sec                    ( +-  4.28% )
            13,144      page-faults               #    1.609 K/sec                    ( +-  0.53% )
    30,484,114,812      cycles                    #    3.731 GHz                      ( +-  0.44% )
    17,554,521,787      instructions              #    0.58  insn per cycle           ( +-  0.76% )
     3,751,725,852      branches                  #  459.181 M/sec                    ( +-  0.81% )
        13,421,985      branch-misses             #    0.36% of all branches          ( +-  2.40% )

           1.10563 +- 0.00382 seconds time elapsed  ( +-  0.35% )
           1.1098 +- 0.0147 seconds time elapsed  ( +-  1.32% ) (repeat)
           1.11308 +- 0.00883 seconds time elapsed  ( +-  0.79% ) (repeat)

5.14.0.gf6a71a5-rt6-tip-rt +SLAB_MERGE_DEFAULT,SLUB_CPU_PARTIAL,SLAB_FREELIST_RANDOM/HARDENED
          8,026.39 msec task-clock                #    7.320 CPUs utilized            ( +-  0.70% )
           496,579      context-switches          #   61.868 K/sec                    ( +-  6.78% )
            65,022      cpu-migrations            #    8.101 K/sec                    ( +-  8.29% )
            13,161      page-faults               #    1.640 K/sec                    ( +-  0.51% )
    29,870,954,733      cycles                    #    3.722 GHz                      ( +-  0.67% )
    17,617,522,235      instructions              #    0.59  insn per cycle           ( +-  1.36% )
     3,760,346,459      branches                  #  468.498 M/sec                    ( +-  1.45% )
        12,863,520      branch-misses             #    0.34% of all branches          ( +-  4.45% )

            1.0965 +- 0.0103 seconds time elapsed  ( +-  0.94% )
            1.08149 +- 0.00362 seconds time elapsed  ( +-  0.33% ) (repeat)
            1.10027 +- 0.00916 seconds time elapsed  ( +-  0.83% )

yup, perf delta == config delta, lets have a peek at jitter

cyclictest -Smqp99& perf stat -r100 hackbench -s4096 -l500 && killall cyclictest

5.14.0.gf6a71a5-rt6-tip-rt
SLUB+SLUB_DEBUG
T: 1 ( 5903) P:99 I:1500 C:  92330 Min:      1 Act:    2 Avg:    6 Max:      19
T: 2 ( 5904) P:99 I:2000 C:  69247 Min:      1 Act:    2 Avg:    6 Max:      21
T: 3 ( 5905) P:99 I:2500 C:  55395 Min:      1 Act:    3 Avg:    6 Max:      22
T: 4 ( 5906) P:99 I:3000 C:  46163 Min:      1 Act:    4 Avg:    7 Max:      22
T: 5 ( 5907) P:99 I:3500 C:  39568 Min:      1 Act:    3 Avg:    6 Max:      23
T: 6 ( 5909) P:99 I:4000 C:  34621 Min:      1 Act:    2 Avg:    7 Max:      22
T: 7 ( 5910) P:99 I:4500 C:  30774 Min:      1 Act:    3 Avg:    7 Max:      18

SLUB+SLUB_DEBUG+list_lock=raw_spinlock_t
T: 1 ( 4044) P:99 I:1500 C:  73340 Min:      1 Act:    3 Avg:   10 Max:      28
T: 2 ( 4045) P:99 I:2000 C:  55004 Min:      1 Act:    4 Avg:   10 Max:      33
T: 3 ( 4046) P:99 I:2500 C:  44002 Min:      1 Act:    2 Avg:   10 Max:      26
T: 4 ( 4047) P:99 I:3000 C:  36668 Min:      1 Act:    3 Avg:   10 Max:      24
T: 5 ( 4048) P:99 I:3500 C:  31429 Min:      1 Act:    3 Avg:   10 Max:      27
T: 6 ( 4049) P:99 I:4000 C:  27500 Min:      1 Act:    3 Avg:   11 Max:      30
T: 7 ( 4050) P:99 I:4500 C:  24444 Min:      1 Act:    4 Avg:   11 Max:      25

SLUB+SLUB_DEBUG+SLAB_MERGE_DEFAULT,SLUB_CPU_PARTIAL,SLAB_FREELIST_RANDOM/HARDENED
T: 1 ( 4036) P:99 I:1500 C:  74039 Min:      1 Act:    3 Avg:    9 Max:      31
T: 2 ( 4037) P:99 I:2000 C:  55528 Min:      1 Act:    3 Avg:   10 Max:      29
T: 3 ( 4038) P:99 I:2500 C:  44422 Min:      1 Act:    2 Avg:   10 Max:      31
T: 4 ( 4039) P:99 I:3000 C:  37017 Min:      1 Act:    2 Avg:    9 Max:      23
T: 5 ( 4040) P:99 I:3500 C:  31729 Min:      1 Act:    3 Avg:   10 Max:      29
T: 6 ( 4041) P:99 I:4000 C:  27762 Min:      1 Act:    2 Avg:    8 Max:      26
T: 7 ( 4042) P:99 I:4500 C:  24677 Min:      1 Act:    3 Avg:    9 Max:      27

conclusion: gee, pi both works and ain't free - ditto add more stuff=cycles :)

Vlastimil Babka Aug. 6, 2021, 7:45 a.m. UTC | #3

On 8/6/21 7:14 AM, Mike Galbraith wrote:
> On Thu, 2021-08-05 at 18:42 +0200, Sebastian Andrzej Siewior wrote:
>>
>> There was throughput regression in RT compared to previous releases
>> (without this series). The regression was (based on my testing) only
>> visible in hackbench and was addressed by adding adaptiv spinning to
>> RT-mutex. With that we almost back to what we had before :)
> 
> Numbers on my box say a throughput regression remains (silly fork bomb
> scenario.. yawn), which can be recouped by either turning on all
> SL[AU]B features or converting the list_lock to a raw lock.

I'm surprised you can still do that raw lock in v3/v4 because there's now a path
where get_partial_node() takes the list_lock and can call put_cpu_partial()
which takes the local_lock. But seems your results below indicate that this was
without CONFIG_SLUB_CPU_PARTIAL so that would still work.

> They also
> seem to be saying that if you turned on PREEMPT_RT because you care
> about RT performance first and foremost (gee), you'll do neither of
> those, because either will eliminate an RT performance progression.

That was my assumption, that there would be some tradeoff and RT is willing to
sacrifice some throughput here... which should be only visible if your benchmark
is close to slab microbenchmark, as hackbench is.

Thanks again!

Vlastimil Babka Aug. 10, 2021, 2:36 p.m. UTC | #4

On 8/5/21 5:19 PM, Vlastimil Babka wrote:
> Series is based on 5.14-rc4 and also available as a git branch:
> https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-local-lock-v4r0

New branch with fixed up locking orders in patch 29/35:

https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-local-lock-v4r1

Vlastimil Babka Aug. 15, 2021, 10:18 a.m. UTC | #5

On 8/10/21 4:36 PM, Vlastimil Babka wrote:
> On 8/5/21 5:19 PM, Vlastimil Babka wrote:
>> Series is based on 5.14-rc4 and also available as a git branch:
>> https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-local-lock-v4r0
> 
> New branch with fixed up locking orders in patch 29/35:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-local-lock-v4r1

New branch with fixed up VM_BUG_ON in patch 13/35:

https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-local-lock-v4r2

Vlastimil Babka Aug. 17, 2021, 10:23 a.m. UTC | #6

On 8/15/21 12:18 PM, Vlastimil Babka wrote:
> On 8/10/21 4:36 PM, Vlastimil Babka wrote:
>> On 8/5/21 5:19 PM, Vlastimil Babka wrote:
>>> Series is based on 5.14-rc4 and also available as a git branch:
>>> https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-local-lock-v4r0
>>
>> New branch with fixed up locking orders in patch 29/35:
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-local-lock-v4r1
> 
> New branch with fixed up VM_BUG_ON in patch 13/35:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-local-lock-v4r2
 
New branch with fixed struct kmem_cache_cpu layout in patch 35/35
(and a rebase to 5.14-rc6)

https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-local-lock-v4r3

Vlastimil Babka Aug. 17, 2021, 3:59 p.m. UTC | #7

On 8/17/21 12:23 PM, Vlastimil Babka wrote:
> On 8/15/21 12:18 PM, Vlastimil Babka wrote:
>> On 8/10/21 4:36 PM, Vlastimil Babka wrote:
>>> On 8/5/21 5:19 PM, Vlastimil Babka wrote:
>>>> Series is based on 5.14-rc4 and also available as a git branch:
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-local-lock-v4r0
>>>
>>> New branch with fixed up locking orders in patch 29/35:
>>>
>>> https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-local-lock-v4r1
>>
>> New branch with fixed up VM_BUG_ON in patch 13/35:
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-local-lock-v4r2
>  
> New branch with fixed struct kmem_cache_cpu layout in patch 35/35
> (and a rebase to 5.14-rc6)
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-local-lock-v4r3

Another update to patch 35/35, simplifying lockdep_assert_held() as requested by RT:

https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-local-lock-v4r4

[v4,00/35] SLUB: reduce irq disabled scope and make it RT compatible

Message

Comments