mbox series

[RFC,v2,00/34] SLUB: reduce irq disabled scope and make it RT compatible

Message ID 20210609113903.1421-1-vbabka@suse.cz (mailing list archive)
Headers show
Series SLUB: reduce irq disabled scope and make it RT compatible | expand

Message

Vlastimil Babka June 9, 2021, 11:38 a.m. UTC
Changes since RFC v1 [1]:
* Addressed feedback from Christoph and Mel, added their acks.
* Finished RT conversion, including adopting 2 patches from the RT tree.
* The optional local_lock conversion has to sacrifice lockless fathpaths on RT
* Added some more cleanup patches to the front.

This series was initially inspired by Mel's pcplist local_lock rewrite, and
also by interest to better understand SLUB's locking and the new locking
primitives and their RT variants and implications. It should make SLUB more
preemption-friendly and fully RT compatible, hopefully without noticeable
regressions on !RT kernels, as the fast paths are not affected there.

Series is based on 5.13-rc5 and also available as a git branch:
https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-local-lock-v2r1

It received some light stability testing on !RT and no testing within RT
kernel. The previous version also got basic performance screening (thanks Mel)
that didn't show major regressions. This version shouldn't be introducing
further regressions. But I'm still interested in e.g. Jesper's tests whether
the bulk allocator or high speed networking in general didn't regress.

Before the series, SLUB is lockless in both allocation and free fast paths, but
elsewhere, it's disabling irqs for considerable periods of time - especially in
allocation slowpath and the bulk allocation, where IRQs are re-enabled only
when a new page from the page allocator is needed, and the context allows
blocking. The irq disabled sections can then include deactivate_slab() which
walks a full freelist and frees the slab back to page allocator or
unfreeze_partials() going through a list of percpu partial slabs. The RT tree
currently has some patches mitigating these, but we can do much better in
mainline too.

Patches 1-6 are straightforward improvements or cleanups that could exist
outside of this series too, but are prerequsities.

Patches 7-10 are also preparatory code changes without functional changes, but
not so useful without the rest of the series.

Patch 11 simplifies the fast paths on systems with preemption, based on
(hopefully correct) observation that the current loops to verify tid are
unnecessary.

Patches 12-21 focus on reducing irq disabled scope in the allocation slowpath:

Patch 12 moves disabling of irqs into ___slab_alloc() from its callers, which
are the allocation slowpath, and bulk allocation. Instead these callers only
disable preemption to stabilize the cpu. The following patches then gradually
reduce the scope of disabled irqs in ___slab_alloc() and the functions called
from there. As of patch 15, the re-enabling of irqs based on gfp flags before
calling the page allocator is removed from allocate_slab(). As of patch 18,
it's possible to reach the page allocator (when existing slabs are depleted of
free objects) without disabling and re-enabling irqs a single time on the way.

Pathces 22-27 similarly reduce the scope of disabled irqs in functions related
to unfreezing percpu partial slab.

Patch 28 is preparatory. Patch 29 is adopted from the RT tree and converts the
flushing of percpu slabs on all cpus from using IPIs to workqueue, so that the
processing isn't happening with irqs disabled in the IPI handler. The flushing
is not called from performance critical contexts, so it should be acceptable.

Patch 30 also comes from RT tree and makes object_map_lock RT compatible.

Patches 31-32 make slab_lock irq-safe on RT where we cannot rely on having
irq already disabled in the context of spin_lock_irqsave().

Patch 33 changes preempt disable to migrate disable, so that the nested
list_lock spinlock is safe to take on RT. Because migrate_disable() is a
function call even on !RT, a small set of private wrappers is introduced
to keep using the cheaper preempt_disable() on !PREEMPT_RT configurations.

As of this patch, SLUB should be compatible with RT's lock semantics, to the
best of my knowledge.

Finally, patch 34 changes irq disabled sections that protect kmem_cache_cpu
fields in the slow paths, with a local lock. However on PREEMPT_RT it means the
lockless fast paths can now preempt slow paths which don't expect that, so the
local lock has to be taken also in the fast paths and they are no longer
lockless. It's up to RT folks to decide if this is a good tradeoff.
The patch also updates the locking documentation in the file's comment.

The main results of this series:

* irq disabling is only done for minimum amount of time needed to protect the
  kmem_cache_cpu data or as part of spin lock, local lock and bit spinlock
  operations to make them irq-safe

* SLUB should be fully PREEMPT_RT compatible

This should have obvious implications for better preemptibility, especially on RT.

Some details are different than how the current SLUB RT tree patches are
implemented:

  mm: sl[au]b: Change list_lock to raw_spinlock_t [2] - the SLAB part can be
  dropped as a different patch restricts RT to SLUB anyway. And after this series
  the list_lock in SLUB is never taken with irqs or preemption already disabled
  so it doesn't have to be converted to raw_spinlock_t.

  mm: slub: Move discard_slab() invocations out of IRQ-off sections [3] should be
  unnecessary as this series does move these invocations outside irq disabled
  sections in a different way.

The remaining patches to upstream from the RT tree are small ones related to
KConfig. The patch that restricts PREEMPT_RT to SLUB (not SLAB or SLOB) makes
sense. The patch that disables CONFIG_SLUB_CPU_PARTIAL with PREEMPT_RT could
perhaps be re-evaluated as the series also addresses some latency issues with
percpu partial slabs.

[1] [RFC 00/26] SLUB: use local_lock for kmem_cache_cpu protection and reduce disabling irqs
    https://lore.kernel.org/lkml/20210524233946.20352-1-vbabka@suse.cz/
[2] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/0001-mm-sl-au-b-Change-list_lock-to-raw_spinlock_t.patch?h=linux-5.12.y-rt-patches
[3] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/0004-mm-slub-Move-discard_slab-invocations-out-of-IRQ-off.patch?h=linux-5.12.y-rt-patches
[4] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/0005-mm-slub-Move-flush_cpu_slab-invocations-__free_slab-.patch?h=linux-5.12.y-rt-patches

Sebastian Andrzej Siewior (2):
  mm: slub: Move flush_cpu_slab() invocations __free_slab() invocations
    out of IRQ context
  mm: slub: Make object_map_lock a raw_spinlock_t

Vlastimil Babka (32):
  mm, slub: don't call flush_all() from list_locations()
  mm, slub: allocate private object map for sysfs listings
  mm, slub: allocate private object map for validate_slab_cache()
  mm, slub: don't disable irq for debug_check_no_locks_freed()
  mm, slub: remove redundant unfreeze_partials() from put_cpu_partial()
  mm, slub: unify cmpxchg_double_slab() and __cmpxchg_double_slab()
  mm, slub: extract get_partial() from new_slab_objects()
  mm, slub: dissolve new_slab_objects() into ___slab_alloc()
  mm, slub: return slab page from get_partial() and set c->page
    afterwards
  mm, slub: restructure new page checks in ___slab_alloc()
  mm, slub: simplify kmem_cache_cpu and tid setup
  mm, slub: move disabling/enabling irqs to ___slab_alloc()
  mm, slub: do initial checks in  ___slab_alloc() with irqs enabled
  mm, slub: move disabling irqs closer to get_partial() in
    ___slab_alloc()
  mm, slub: restore irqs around calling new_slab()
  mm, slub: validate slab from partial list or page allocator before
    making it cpu slab
  mm, slub: check new pages with restored irqs
  mm, slub: stop disabling irqs around get_partial()
  mm, slub: move reset of c->page and freelist out of deactivate_slab()
  mm, slub: make locking in deactivate_slab() irq-safe
  mm, slub: call deactivate_slab() without disabling irqs
  mm, slub: move irq control into unfreeze_partials()
  mm, slub: discard slabs in unfreeze_partials() without irqs disabled
  mm, slub: detach whole partial list at once in unfreeze_partials()
  mm, slub: detach percpu partial list in unfreeze_partials() using
    this_cpu_cmpxchg()
  mm, slub: only disable irq with spin_lock in __unfreeze_partials()
  mm, slub: don't disable irqs in slub_cpu_dead()
  mm, slab: make flush_slab() possible to call with irqs enabled
  mm, slub: optionally save/restore irqs in slab_[un]lock()/
  mm, slub: make slab_lock() disable irqs with PREEMPT_RT
  mm, slub: use migrate_disable() on PREEMPT_RT
  mm, slub: convert kmem_cpu_slab protection to local_lock

 include/linux/slub_def.h |   2 +
 mm/slub.c                | 750 ++++++++++++++++++++++++++-------------
 2 files changed, 499 insertions(+), 253 deletions(-)

Comments

Mel Gorman June 14, 2021, 9:49 a.m. UTC | #1
On Wed, Jun 09, 2021 at 01:38:29PM +0200, Vlastimil Babka wrote:
> Changes since RFC v1 [1]:
> * Addressed feedback from Christoph and Mel, added their acks.
> * Finished RT conversion, including adopting 2 patches from the RT tree.
> * The optional local_lock conversion has to sacrifice lockless fathpaths on RT
> * Added some more cleanup patches to the front.
> 
> This series was initially inspired by Mel's pcplist local_lock rewrite, and
> also by interest to better understand SLUB's locking and the new locking
> primitives and their RT variants and implications. It should make SLUB more
> preemption-friendly and fully RT compatible, hopefully without noticeable
> regressions on !RT kernels, as the fast paths are not affected there.
> 
> Series is based on 5.13-rc5 and also available as a git branch:
> https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-local-lock-v2r1
> 

This failed to boot for me inside KVM.

[    0.273576] SRBDS: Unknown: Dependent on hypervisor status
[    0.273576] MDS: Mitigation: Clear CPU buffers
[    0.273576] Freeing SMP alternatives memory: 36K
[    0.273576] bad: scheduling from the idle thread!
[    0.273576] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.13.0-rc5-unspecified+ #1
[    0.273576] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[    0.273576] Call Trace:
[    0.273576]  dump_stack+0x76/0x94
[    0.273576]  dequeue_task_idle+0x28/0x40
[    0.273576]  __do_set_cpus_allowed+0xe0/0x1a0
[    0.273576]  __schedule+0x7f7/0x8f0
[    0.273576]  __cond_resched+0x22/0x40
[    0.273576]  alloc_vmap_area+0x72/0x8b0
[    0.273576]  ? kmem_cache_alloc_node_trace+0x189/0x300
[    0.273576]  ? __get_vm_area_node+0x76/0x150
[    0.273576]  __get_vm_area_node+0xab/0x150
[    0.273576]  __vmalloc_node_range+0x6d/0x2c0
[    0.273576]  ? kernel_clone+0x9b/0x3e0
[    0.273576]  ? kmem_cache_alloc_node+0x18d/0x2f0
[    0.273576]  ? copy_process+0x218/0x1c40
[    0.273576]  copy_process+0x2c1/0x1c40
[    0.273576]  ? kernel_clone+0x9b/0x3e0
[    0.273576]  ? enqueue_task_fair+0xa1/0x590
[    0.273576]  kernel_clone+0x9b/0x3e0
[    0.273576]  ? acpi_hw_register_read+0x146/0x166
[    0.273576]  kernel_thread+0x55/0x70
[    0.273576]  ? kthread_is_per_cpu+0x30/0x30
[    0.273576]  rest_init+0x75/0xc0
[    0.273576]  start_kernel+0x7fb/0x822
[    0.273576]  secondary_startup_64_no_verify+0xc2/0xcb
[    0.273576] BUG: kernel NULL pointer dereference, address: 0000000000000000
[    0.273576] #PF: supervisor instruction fetch in kernel mode
[    0.273576] #PF: error_code(0x0010) - not-present page
[    0.273576] PGD 0 P4D 0
[    0.273576] Oops: 0010 [#1] SMP PTI
[    0.273576] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.13.0-rc5-unspecified+ #1
[    0.273576] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[    0.273576] RIP: 0010:0x0
[    0.273576] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
[    0.273576] RSP: 0000:ffffffffbb803bb0 EFLAGS: 00010002
[    0.273576] RAX: 0000000000000000 RBX: ffff9fa64ea2ecc0 RCX: ffffffffb9e00101
[    0.273576] RDX: 000000000000000a RSI: ffffffffbb81a940 RDI: ffff9fa64ea2ecc0
[    0.273576] RBP: ffffffffbb803c08 R08: 0000000000000000 R09: ffffffffbbaa699c
[    0.273576] R10: 0000000000000000 R11: ffffffffbb803888 R12: ffffffffbb81a940
[    0.273576] R13: ffff9f9a80245f40 R14: ffffffffbb1f0060 R15: 0000000000003fff
[    0.273576] FS:  0000000000000000(0000) GS:ffff9fa64ea00000(0000) knlGS:0000000000000000
[    0.273576] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.273576] CR2: ffffffffffffffd6 CR3: 0000000a60e10001 CR4: 0000000000170ef0
[    0.273576] Call Trace:
[    0.273576]  __schedule+0x7f7/0x8f0
[    0.273576]  __cond_resched+0x22/0x40
[    0.273576]  alloc_vmap_area+0x72/0x8b0
[    0.273576]  ? kmem_cache_alloc_node_trace+0x189/0x300
[    0.273576]  ? __get_vm_area_node+0x76/0x150
[    0.273576]  __get_vm_area_node+0xab/0x150
[    0.273576]  __vmalloc_node_range+0x6d/0x2c0
[    0.273576]  ? kernel_clone+0x9b/0x3e0
[    0.273576]  ? kmem_cache_alloc_node+0x18d/0x2f0
[    0.273576]  ? copy_process+0x218/0x1c40
[    0.273576]  copy_process+0x2c1/0x1c40
[    0.273576]  ? kernel_clone+0x9b/0x3e0
[    0.273576]  ? enqueue_task_fair+0xa1/0x590
[    0.273576]  kernel_clone+0x9b/0x3e0
[    0.273576]  ? acpi_hw_register_read+0x146/0x166
[    0.273576]  kernel_thread+0x55/0x70
[    0.273576]  ? kthread_is_per_cpu+0x30/0x30
[    0.273576]  rest_init+0x75/0xc0
[    0.273576]  start_kernel+0x7fb/0x822
[    0.273576]  secondary_startup_64_no_verify+0xc2/0xcb
[    0.273576] Modules linked in:
[    0.273576] CR2: 0000000000000000
[    0.273576] ---[ end trace 7199d6fbb50b4cf7 ]---
[    0.273576] RIP: 0010:0x0
[    0.273576] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
[    0.273576] RSP: 0000:ffffffffbb803bb0 EFLAGS: 00010002
[    0.273576] RAX: 0000000000000000 RBX: ffff9fa64ea2ecc0 RCX: ffffffffb9e00101
[    0.273576] RDX: 000000000000000a RSI: ffffffffbb81a940 RDI: ffff9fa64ea2ecc0
[    0.273576] RBP: ffffffffbb803c08 R08: 0000000000000000 R09: ffffffffbbaa699c
[    0.273576] R10: 0000000000000000 R11: ffffffffbb803888 R12: ffffffffbb81a940
[    0.273576] R13: ffff9f9a80245f40 R14: ffffffffbb1f0060 R15: 0000000000003fff
[    0.273576] FS:  0000000000000000(0000) GS:ffff9fa64ea00000(0000) knlGS:0000000000000000
[    0.273576] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.273576] CR2: ffffffffffffffd6 CR3: 0000000a60e10001 CR4: 0000000000170ef0
[    0.273576] Kernel panic - not syncing: Attempted to kill the idle task!
[    0.273576] Rebooting in 90 seconds..
Vlastimil Babka June 14, 2021, 11:10 a.m. UTC | #2
On 6/9/21 1:38 PM, Vlastimil Babka wrote:
> Changes since RFC v1 [1]:
> * Addressed feedback from Christoph and Mel, added their acks.
> * Finished RT conversion, including adopting 2 patches from the RT tree.
> * The optional local_lock conversion has to sacrifice lockless fathpaths on RT
> * Added some more cleanup patches to the front.
> 
> This series was initially inspired by Mel's pcplist local_lock rewrite, and
> also by interest to better understand SLUB's locking and the new locking
> primitives and their RT variants and implications. It should make SLUB more
> preemption-friendly and fully RT compatible, hopefully without noticeable
> regressions on !RT kernels, as the fast paths are not affected there.
> 
> Series is based on 5.13-rc5 and also available as a git branch:
> https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-local-lock-v2r1

Pushed a new revision branch:
https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-local-lock-v2r2

which fixes a bug in
[RFC v2 33/34] mm, slub: use migrate_disable() on PREEMPT_RT
that was reported by Mel
Mel Gorman June 14, 2021, 11:31 a.m. UTC | #3
On Mon, Jun 14, 2021 at 10:49:02AM +0100, Mel Gorman wrote:
> On Wed, Jun 09, 2021 at 01:38:29PM +0200, Vlastimil Babka wrote:
> > Changes since RFC v1 [1]:
> > * Addressed feedback from Christoph and Mel, added their acks.
> > * Finished RT conversion, including adopting 2 patches from the RT tree.
> > * The optional local_lock conversion has to sacrifice lockless fathpaths on RT
> > * Added some more cleanup patches to the front.
> > 
> > This series was initially inspired by Mel's pcplist local_lock rewrite, and
> > also by interest to better understand SLUB's locking and the new locking
> > primitives and their RT variants and implications. It should make SLUB more
> > preemption-friendly and fully RT compatible, hopefully without noticeable
> > regressions on !RT kernels, as the fast paths are not affected there.
> > 
> > Series is based on 5.13-rc5 and also available as a git branch:
> > https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-local-lock-v2r1
> > 
> 
> This failed to boot for me inside KVM.
> 

Bisection pointed to "mm, slub: use migrate_disable() on PREEMPT_RT".
Off-list, Vlastimil noted that the #ifdef was wrong and the patch below
booted. Wider battery of tests is queued but will take a few days to
complete.

---8<---
mm, slub: use migrate_disable() on PREEMPT_RT -fix

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/slub.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/slub.c b/mm/slub.c
index caa206213e72..f0359b0c8154 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -119,7 +119,7 @@
  * We could simply use migrate_disable()/enable() but as long as it's a
  * function call even on !PREEMPT_RT, use inline preempt_disable() there.
  */
-#ifdef CONFIG_PREEMPT_RT
+#ifndef CONFIG_PREEMPT_RT
 #define slub_get_cpu_ptr(var)	get_cpu_ptr(var)
 #define slub_put_cpu_ptr(var)	put_cpu_ptr(var)
 #else
Sebastian Andrzej Siewior July 2, 2021, 6:29 p.m. UTC | #4
I replaced my slub changes with slub-local-lock-v2r3.
I haven't seen any complains from lockdep or so which is good. Then I
did this with RT enabled (and no debug):

- A "time make -j32" run of allmodconfig on /dev/shm.
  Old:
| real    20m6,217s
| user    568m22,553s
| sys     48m33,126s

  New:
| real    20m9,049s
| user    569m32,096s
| sys     48m47,670s

  These 3 seconds here are probably in the noise range.

- perf_5.10 stat -r 10 hackbench -g200 -s 4096 -l500
Old:
|         464.967,20 msec task-clock                #   27,220 CPUs utilized            ( +-  0,16% )
|          7.683.944      context-switches          #    0,017 M/sec                    ( +-  0,86% )
|            931.380      cpu-migrations            #    0,002 M/sec                    ( +-  4,94% )
|            219.569      page-faults               #    0,472 K/sec                    ( +-  0,39% )
|  1.104.727.599.918      cycles                    #    2,376 GHz                      ( +-  0,18% )
|    941.428.898.087      stalled-cycles-frontend   #   85,22% frontend cycles idle     ( +-  0,24% )
|    729.016.546.572      stalled-cycles-backend    #   65,99% backend cycles idle      ( +-  0,32% )
|    340.133.571.519      instructions              #    0,31  insn per cycle
|                                                   #    2,77  stalled cycles per insn  ( +-  0,12% )
|     73.746.821.314      branches                  #  158,607 M/sec                    ( +-  0,13% )
|        377.838.006      branch-misses             #    0,51% of all branches          ( +-  1,01% )
| 
|            17,0820 +- 0,0202 seconds time elapsed  ( +-  0,12% )

New:
|         422.865,71 msec task-clock                #    4,782 CPUs utilized            ( +-  0,34% )
|         14.594.238      context-switches          #    0,035 M/sec                    ( +-  0,43% )
|          3.737.926      cpu-migrations            #    0,009 M/sec                    ( +-  0,46% )
|            218.474      page-faults               #    0,517 K/sec                    ( +-  0,74% )
|    940.715.812.020      cycles                    #    2,225 GHz                      ( +-  0,34% )
|    716.593.827.820      stalled-cycles-frontend   #   76,18% frontend cycles idle     ( +-  0,39% )
|    550.730.862.839      stalled-cycles-backend    #   58,54% backend cycles idle      ( +-  0,43% )
|    417.274.588.907      instructions              #    0,44  insn per cycle
|                                                   #    1,72  stalled cycles per insn  ( +-  0,17% )
|     92.814.150.290      branches                  #  219,488 M/sec                    ( +-  0,17% )
|        822.102.170      branch-misses             #    0,89% of all branches          ( +-  0,41% )
| 
|             88,427 +- 0,618 seconds time elapsed  ( +-  0,70% )

So this is outside of the noise range.
I'm not sure where this is coming from. My guess would be higher lock
contention within the memory allocator.
  
> The remaining patches to upstream from the RT tree are small ones related to
> KConfig. The patch that restricts PREEMPT_RT to SLUB (not SLAB or SLOB) makes
> sense. The patch that disables CONFIG_SLUB_CPU_PARTIAL with PREEMPT_RT could
> perhaps be re-evaluated as the series also addresses some latency issues with
> percpu partial slabs.

With that series the PARTIAL slab can be indeed enabled. I have (had) a
half done series where I had PARTIAL enabled and noticed a slight
increase in latency so made it "default y on !RT". It wasn't dramatic
but appeared to be outside of noise.

Sebastian
Vlastimil Babka July 2, 2021, 8:25 p.m. UTC | #5
On 7/2/21 8:29 PM, Sebastian Andrzej Siewior wrote:
> I replaced my slub changes with slub-local-lock-v2r3.
> I haven't seen any complains from lockdep or so which is good. Then I
> did this with RT enabled (and no debug):

Thanks for testing!

> - A "time make -j32" run of allmodconfig on /dev/shm.
>   Old:
> | real    20m6,217s
> | user    568m22,553s
> | sys     48m33,126s
> 
>   New:
> | real    20m9,049s
> | user    569m32,096s
> | sys     48m47,670s
> 
>   These 3 seconds here are probably in the noise range.
> 
> - perf_5.10 stat -r 10 hackbench -g200 -s 4096 -l500
> Old:
> |         464.967,20 msec task-clock                #   27,220 CPUs utilized            ( +-  0,16% )
> |          7.683.944      context-switches          #    0,017 M/sec                    ( +-  0,86% )
> |            931.380      cpu-migrations            #    0,002 M/sec                    ( +-  4,94% )
> |            219.569      page-faults               #    0,472 K/sec                    ( +-  0,39% )
> |  1.104.727.599.918      cycles                    #    2,376 GHz                      ( +-  0,18% )
> |    941.428.898.087      stalled-cycles-frontend   #   85,22% frontend cycles idle     ( +-  0,24% )
> |    729.016.546.572      stalled-cycles-backend    #   65,99% backend cycles idle      ( +-  0,32% )
> |    340.133.571.519      instructions              #    0,31  insn per cycle
> |                                                   #    2,77  stalled cycles per insn  ( +-  0,12% )
> |     73.746.821.314      branches                  #  158,607 M/sec                    ( +-  0,13% )
> |        377.838.006      branch-misses             #    0,51% of all branches          ( +-  1,01% )
> | 
> |            17,0820 +- 0,0202 seconds time elapsed  ( +-  0,12% )
> 
> New:
> |         422.865,71 msec task-clock                #    4,782 CPUs utilized            ( +-  0,34% )
> |         14.594.238      context-switches          #    0,035 M/sec                    ( +-  0,43% )
> |          3.737.926      cpu-migrations            #    0,009 M/sec                    ( +-  0,46% )
> |            218.474      page-faults               #    0,517 K/sec                    ( +-  0,74% )
> |    940.715.812.020      cycles                    #    2,225 GHz                      ( +-  0,34% )
> |    716.593.827.820      stalled-cycles-frontend   #   76,18% frontend cycles idle     ( +-  0,39% )
> |    550.730.862.839      stalled-cycles-backend    #   58,54% backend cycles idle      ( +-  0,43% )
> |    417.274.588.907      instructions              #    0,44  insn per cycle
> |                                                   #    1,72  stalled cycles per insn  ( +-  0,17% )
> |     92.814.150.290      branches                  #  219,488 M/sec                    ( +-  0,17% )
> |        822.102.170      branch-misses             #    0,89% of all branches          ( +-  0,41% )
> | 
> |             88,427 +- 0,618 seconds time elapsed  ( +-  0,70% )
> 
> So this is outside of the noise range.
> I'm not sure where this is coming from. My guess would be higher lock
> contention within the memory allocator.

The series shouldn't significantly change the memory allocator
interaction, though.
Seems there's less cycles, but more time elapsed, thus more sleeping -
is it locks becoming mutexes on RT?

My first guess - the last, local_lock patch. What would happen if you
take that one out? Should be still RT-compatible. If it improves a lot,
maybe that conversion to local_lock is not worth it then.

My second guess - list_lock remains spinlock with my series, thus RT
mutex, but the current RT tree converts it to raw_spinlock. I'd hope
leaving that one as non-raw spinlock would still be much better for RT
goals, even if hackbench (which is AFAIK very slab intensive) throughput
regresses - hopefully not that much.

>> The remaining patches to upstream from the RT tree are small ones related to
>> KConfig. The patch that restricts PREEMPT_RT to SLUB (not SLAB or SLOB) makes
>> sense. The patch that disables CONFIG_SLUB_CPU_PARTIAL with PREEMPT_RT could
>> perhaps be re-evaluated as the series also addresses some latency issues with
>> percpu partial slabs.
> 
> With that series the PARTIAL slab can be indeed enabled. I have (had) a
> half done series where I had PARTIAL enabled and noticed a slight
> increase in latency so made it "default y on !RT". It wasn't dramatic
> but appeared to be outside of noise.
> 
> Sebastian
>
Mike Galbraith July 3, 2021, 7:24 a.m. UTC | #6
On Fri, 2021-07-02 at 20:29 +0200, Sebastian Andrzej Siewior wrote:
> I replaced my slub changes with slub-local-lock-v2r3.
> I haven't seen any complains from lockdep or so which is good. Then I
> did this with RT enabled (and no debug):

Below is some raw hackbench data from my little i4790 desktop box.  It
says we'll definitely still want list_lock to be raw.

It also appears to be saying that there's something RT specific to
stare at in addition to the list_lock business, but add a pinch of salt
to that due to the config of the virgin(ish) tip tree being much
lighter than the enterprise(ish) config of the tip-rt tree.

perf stat -r10 hackbench -s4096 -l500
full warmup, record, repeat twice for elapsed

5.13.0.g60ab3ed-tip-rt
          8,898.51 msec task-clock                #    7.525 CPUs utilized            ( +-  0.33% )
           368,922      context-switches          #    0.041 M/sec                    ( +-  5.20% )
            42,281      cpu-migrations            #    0.005 M/sec                    ( +-  5.28% )
            13,180      page-faults               #    0.001 M/sec                    ( +-  0.70% )
    33,343,378,867      cycles                    #    3.747 GHz                      ( +-  0.30% )
    21,656,783,887      instructions              #    0.65  insn per cycle           ( +-  0.67% )
     4,408,569,663      branches                  #  495.428 M/sec                    ( +-  0.73% )
        12,040,125      branch-misses             #    0.27% of all branches          ( +-  2.93% )

           1.18260 +- 0.00473 seconds time elapsed  ( +-  0.40% )
           1.19018 +- 0.00441 seconds time elapsed  ( +-  0.37% ) (repeat)
           1.18260 +- 0.00473 seconds time elapsed  ( +-  0.40% ) (repeat)

5.13.0.g60ab3ed-tip-rt +slub-local-lock-v2r3 list_lock=raw_spinlock_t
          9,642.00 msec task-clock                #    7.521 CPUs utilized            ( +-  0.46% )
           462,091      context-switches          #    0.048 M/sec                    ( +-  4.79% )
            44,411      cpu-migrations            #    0.005 M/sec                    ( +-  4.34% )
            12,980      page-faults               #    0.001 M/sec                    ( +-  0.43% )
    36,098,859,429      cycles                    #    3.744 GHz                      ( +-  0.44% )
    25,462,853,462      instructions              #    0.71  insn per cycle           ( +-  0.50% )
     5,260,898,360      branches                  #  545.623 M/sec                    ( +-  0.52% )
        16,088,686      branch-misses             #    0.31% of all branches          ( +-  2.02% )

           1.28207 +- 0.00568 seconds time elapsed  ( +-  0.44% )
           1.28744 +- 0.00713 seconds time elapsed  ( +-  0.55% ) (repeat)
           1.28085 +- 0.00850 seconds time elapsed  ( +-  0.66% ) (repeat)

5.13.0.g60ab3ed-tip-rt +slub-local-lock-v2r3 list_lock=spinlock_t
         10,004.89 msec task-clock                #    6.029 CPUs utilized            ( +-  1.37% )
           654,311      context-switches          #    0.065 M/sec                    ( +-  5.16% )
           211,070      cpu-migrations            #    0.021 M/sec                    ( +-  1.38% )
            13,262      page-faults               #    0.001 M/sec                    ( +-  0.79% )
    36,585,914,931      cycles                    #    3.657 GHz                      ( +-  1.35% )
    27,682,240,511      instructions              #    0.76  insn per cycle           ( +-  1.06% )
     5,766,064,432      branches                  #  576.325 M/sec                    ( +-  1.11% )
        24,269,069      branch-misses             #    0.42% of all branches          ( +-  2.03% )

            1.6595 +- 0.0116 seconds time elapsed  ( +-  0.70% )
            1.6270 +- 0.0180 seconds time elapsed  ( +-  1.11% ) (repeat)
            1.6213 +- 0.0150 seconds time elapsed  ( +-  0.93% ) (repeat)

virgin(ish) tip
5.13.0.g60ab3ed-tip
          7,320.67 msec task-clock                #    7.792 CPUs utilized            ( +-  0.31% )
           221,215      context-switches          #    0.030 M/sec                    ( +-  3.97% )
            16,234      cpu-migrations            #    0.002 M/sec                    ( +-  4.07% )
            13,233      page-faults               #    0.002 M/sec                    ( +-  0.91% )
    27,592,205,252      cycles                    #    3.769 GHz                      ( +-  0.32% )
     8,309,495,040      instructions              #    0.30  insn per cycle           ( +-  0.37% )
     1,555,210,607      branches                  #  212.441 M/sec                    ( +-  0.42% )
         5,484,209      branch-misses             #    0.35% of all branches          ( +-  2.13% )

           0.93949 +- 0.00423 seconds time elapsed  ( +-  0.45% )
           0.94608 +- 0.00384 seconds time elapsed  ( +-  0.41% ) (repeat)
           0.94422 +- 0.00410 seconds time elapsed  ( +-  0.43% )

5.13.0.g60ab3ed-tip +slub-local-lock-v2r3
          7,343.57 msec task-clock                #    7.776 CPUs utilized            ( +-  0.44% )
           223,044      context-switches          #    0.030 M/sec                    ( +-  3.02% )
            16,057      cpu-migrations            #    0.002 M/sec                    ( +-  4.03% )
            13,164      page-faults               #    0.002 M/sec                    ( +-  0.97% )
    27,684,906,017      cycles                    #    3.770 GHz                      ( +-  0.45% )
     8,323,273,871      instructions              #    0.30  insn per cycle           ( +-  0.28% )
     1,556,106,680      branches                  #  211.901 M/sec                    ( +-  0.31% )
         5,463,468      branch-misses             #    0.35% of all branches          ( +-  1.33% )

           0.94440 +- 0.00352 seconds time elapsed  ( +-  0.37% )
           0.94830 +- 0.00228 seconds time elapsed  ( +-  0.24% ) (repeat)
           0.93813 +- 0.00440 seconds time elapsed  ( +-  0.47% ) (repeat)
Mike Galbraith July 3, 2021, 3:47 p.m. UTC | #7
On Sat, 2021-07-03 at 09:24 +0200, Mike Galbraith wrote:
>
> It also appears to be saying that there's something RT specific to
> stare at in addition to the list_lock business.

The what is ___slab_alloc() consuming 3.9% CPU in tip-rt-slub whereas
it consumes < 1% in both tip-rt (sans slub patches) and tip-slub.

The why remains to ponder.

            5.13.0.g60ab3ed-tip-rt           5.13.0.g60ab3ed-tip-rt-slub      5.13.0.g60ab3ed-tip-slub
    25.18%  copy_user_enhanced_fast_string   copy_user_enhanced_fast_string   copy_user_enhanced_fast_string
     5.08%  unix_stream_read_generic         unix_stream_read_generic         unix_stream_read_generic
     3.39%  rt_spin_lock                 *** ___slab_alloc ***                __skb_datagram_iter
     2.80%  __skb_datagram_iter              rt_spin_lock                     _raw_spin_lock
     2.11%  get_page_from_freelist           __skb_datagram_iter              __alloc_skb
     2.01%  skb_release_data                 rt_spin_unlock                   skb_release_data
     1.94%  rt_spin_unlock                   get_page_from_freelist           __alloc_pages
     1.85%  __alloc_skb                      migrate_enable                   unix_stream_sendmsg
     1.68%  __schedule                       skb_release_data                 _raw_spin_lock_irqsave
     1.67%  unix_stream_sendmsg              __schedule                       free_pcppages_bulk
     1.50%  free_pcppages_bulk               unix_stream_sendmsg              __slab_free
     1.38%  migrate_enable                   free_pcppages_bulk               __fget_light
     1.24%  __fget_light                     __alloc_pages                    vfs_write
     1.16%  __slab_free                      migrate_disable                  __schedule
     1.14%  __alloc_pages                    __fget_light                     get_page_from_freelist
     1.10%  fsnotify                         __slab_free                      new_sync_write
     1.07%  kfree                            fsnotify                         fsnotify

5.13.0.g60ab3ed-tip-rt-slub ___slab_alloc() consumes 3.90%
  0.40 │       mov    0x28(%r13),%edx
  0.42 │       add    %r15,%rdx
       │     __swab():
       │     #endif
       │
       │     static __always_inline unsigned long __swab(const unsigned long y)
       │     {
       │     #if __BITS_PER_LONG == 64
       │     return __swab64(y);
  0.05 │       mov    %rdx,%rax
  1.14 │       bswap  %rax
       │     freelist_ptr():
       │     return (void *)((unsigned long)ptr ^ s->random ^  <== CONFIG_SLAB_FREELIST_HARDENED
  0.72 │       xor    0xb0(%r13),%rax
 65.41 │       xor    (%rdx),%rax                              <== huh? miss = 65% of that 3.9% kernel util?
       │     next_tid():
       │     return tid + TID_STEP;
  0.09 │       addq   $0x200,0x48(%r12)
       │     ___slab_alloc():
       │     * freelist is pointing to the list of objects to be used.
       │     * page is pointing to the page from which the objects are obtained.
       │     * That page must be frozen for per cpu allocations to work.
       │     */
       │     VM_BUG_ON(!c->page->frozen);
       │     c->freelist = get_freepointer(s, freelist);
  0.05 │       mov    %rax,0x40(%r12)
       │     c->tid = next_tid(c->tid);
       │     local_unlock_irqrestore(&s->cpu_slab->lock, flags);

5.13.0.g60ab3ed-tip-rt ___slab_alloc() consumes < 1%
Percent│     }
       │
       │     /* must check again c->freelist in case of cpu migration or IRQ */
       │     freelist = c->freelist;
  0.02 │ a1:   mov    (%r14),%r13
       │     if (freelist)
       │       test   %r13,%r13
  0.02 │     ↓ je     460
       │     get_freepointer():
       │     return freelist_dereference(s, object + s->offset);
  0.23 │ ad:   mov    0x28(%r12),%edx
  0.18 │       add    %r13,%rdx
       │     __swab():
       │     #endif
       │
       │     static __always_inline unsigned long __swab(const unsigned long y)
       │     {
       │     #if __BITS_PER_LONG == 64
       │     return __swab64(y);
  0.06 │       mov    %rdx,%rax
  1.16 │       bswap  %rax
       │     freelist_ptr():
       │     return (void *)((unsigned long)ptr ^ s->random ^
  0.23 │       xor    0xb0(%r12),%rax
 35.25 │       xor    (%rdx),%rax              <== 35% of < 1% kernel util
       │     next_tid():
       │     return tid + TID_STEP;
  0.28 │       addq   $0x200,0x8(%r14)
       │     ___slab_alloc():
       │     * freelist is pointing to the list of objects to be used.
       │     * page is pointing to the page from which the objects are obtained.
       │     * That page must be frozen for per cpu allocations to work.
       │     */
       │     VM_BUG_ON(!c->page->frozen);
       │     c->freelist = get_freepointer(s, freelist);

5.13.0.g60ab3ed-tip-slub ___slab_alloc() also consumes < 1%
Percent│     load_freelist:
       │
       │     lockdep_assert_held(this_cpu_ptr(&s->cpu_slab->lock));
  0.28 │ 84:   add    this_cpu_off,%rax
       │     get_freepointer():
       │     return freelist_dereference(s, object + s->offset);
  0.14 │       mov    0x28(%r14),%eax
       │     ___slab_alloc():
       │     * freelist is pointing to the list of objects to be used.
       │     * page is pointing to the page from which the objects are obtained.
       │     * That page must be frozen for per cpu allocations to work.
       │     */
       │     VM_BUG_ON(!c->page->frozen);
       │     c->freelist = get_freepointer(s, freelist);
 34.36 │       mov    0x0(%r13,%rax,1),%rax
       │     next_tid():
       │     return tid + TID_STEP;
  0.10 │       addq   $0x1,0x8(%r12)
       │     ___slab_alloc():
       │     c->freelist = get_freepointer(s, freelist);
  0.04 │       mov    %rax,(%r12)
       │     c->tid = next_tid(c->tid);
       │     local_unlock_irqrestore(&s->cpu_slab->lock, flags);
  0.12 │       mov    (%r14),%rax
  0.03 │       add    this_cpu_off,%rax
       │     arch_local_irq_restore():
       │     return arch_irqs_disabled_flags(flags);
       │     }
Mike Galbraith July 4, 2021, 5:37 a.m. UTC | #8
On Sat, 2021-07-03 at 17:47 +0200, Mike Galbraith wrote:
> On Sat, 2021-07-03 at 09:24 +0200, Mike Galbraith wrote:
> > It also appears to be saying that there's something RT specific to
> > stare at in addition to the list_lock business.
>
> The what is ___slab_alloc() consuming 3.9% CPU in tip-rt-slub whereas
> it consumes < 1% in both tip-rt (sans slub patches) and tip-slub.
>
> The why remains to ponder.

Ignoring odd distribution in the profile (red herring whackable via
prefetch games), and removing perf/debug options from the picture, the
remaining RT specific cost is just that of forced slow path.  For
hackbench in my little box, it adds up to right at 3%.

	-Mike
Mike Galbraith July 5, 2021, 4 p.m. UTC | #9
On Fri, 2021-07-02 at 20:29 +0200, Sebastian Andrzej Siewior wrote:
>  
> > The remaining patches to upstream from the RT tree are small ones
> > related to KConfig. The patch that restricts PREEMPT_RT to SLUB
> > (not SLAB or SLOB) makes sense. The patch that disables
> > CONFIG_SLUB_CPU_PARTIAL with PREEMPT_RT could perhaps be re-
> > evaluated as the series also addresses some latency issues with
> > percpu partial slabs.
> 
> With that series the PARTIAL slab can be indeed enabled. I have (had)
> a half done series where I had PARTIAL enabled and noticed a slight
> increase in latency so made it "default y on !RT". It wasn't dramatic
> but appeared to be outside of noise.

I'm seeing warnings/explosions while exercising box IFF PARTIAL slab
thingy is enabled.  I aborted -PARTIAL after a little over 4 hours,
whereas the longest survival of 4 +PARTIAL runs was 50 minutes, so I'm
fairly confident that PARTIAL really really is the trigger.

My reproducer is a keep box busy loop of parallel kbuild along with
alternating ltp::controllers (sans swap-from-hell bits), ltp::zram and
hackbench.

The warning may or may not trigger before box explodes.
   
[  224.232200] ------------[ cut here ]------------
[  224.232202] WARNING: CPU: 3 PID: 9112 at mm/slub.c:2018 ___slab_alloc.constprop.100+0xb09/0x13a0
[  224.232210] Modules linked in: sr_mod(E) cdrom(E) zram(E) btrfs(E) blake2b_generic(E) xor(E) raid6_pq(E) xfs(E) libcrc32c(E) af_packet(E) ip6table_mangle(E) ip6table_raw(E) iptable_raw(E) bridge(E) stp(E) llc(E) iscsi_ibft(E) iscsi_boot_sysfs(E) nfnetlink(E) ebtable_filter(E) rfkill(E) ebtables(E) ip6table_filter(E) ip6_tables(E) iptable_filter(E) ip_tables(E) x_tables(E) bpfilter(E) joydev(E) usblp(E) intel_rapl_msr(E) intel_rapl_common(E) nls_iso8859_1(E) nls_cp437(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) snd_hda_codec_realtek(E) snd_hda_codec_generic(E) ledtrig_audio(E) snd_hda_codec_hdmi(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) snd_hda_intel(E) aesni_intel(E) snd_intel_dspcfg(E) crypto_simd(E) iTCO_wdt(E) at24(E) intel_pmc_bxt(E) regmap_i2c(E) mei_hdcp(E) iTCO_vendor_support(E) cryptd(E) snd_hda_codec(E) snd_hwdep(E) pcspkr(E) i2c_i801(E) snd_hda_core(E) i2c_smbus(E) r8169(E) snd_pcm(E) realtek(E) snd_timer(E)
[  224.232253]  mei_me(E) mdio_devres(E) lpc_ich(E) snd(E) libphy(E) mei(E) soundcore(E) mfd_core(E) fan(E) thermal(E) nfsd(E) intel_smartconnect(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) sch_fq_codel(E) sunrpc(E) fuse(E) configfs(E) hid_logitech_hidpp(E) hid_logitech_dj(E) uas(E) usb_storage(E) hid_generic(E) usbhid(E) nouveau(E) wmi(E) drm_ttm_helper(E) ttm(E) i2c_algo_bit(E) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) fb_sys_fops(E) cec(E) ahci(E) rc_core(E) xhci_pci(E) ehci_pci(E) libahci(E) xhci_hcd(E) ehci_hcd(E) drm(E) libata(E) usbcore(E) video(E) button(E) sd_mod(E) t10_pi(E) vfat(E) fat(E) virtio_blk(E) virtio_mmio(E) virtio_ring(E) virtio(E) ext4(E) crc32c_intel(E) crc16(E) mbcache(E) jbd2(E) sg(E) dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) scsi_mod(E) msr(E) autofs4(E)
[  224.232295] CPU: 3 PID: 9112 Comm: dd Kdump: loaded Tainted: G            E     5.13.0.g60ab3ed-tip-rt-slub #25
[  224.232297] Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 09/23/2013
[  224.232299] RIP: 0010:___slab_alloc.constprop.100+0xb09/0x13a0
[  224.232302] Code: 48 8b 7d 88 4c 89 ee 8b 47 28 4c 01 f8 48 89 c2 48 0f ca 48 33 97 b8 00 00 00 48 33 10 e8 cf e5 ff ff e9 b5 00 00 00 4d 89 ce <0f> 0b e9 2d fa ff ff 8b 53 38 8b 05 8f 47 4a 01 48 8b 75 80 83 c2
[  224.232303] RSP: 0018:ffff8882047e3ba0 EFLAGS: 00010046
[  224.232305] RAX: ffff888100040f80 RBX: 0000000000000000 RCX: 0000000080190019
[  224.232307] RDX: ffffea0005d59a08 RSI: 0000000000000000 RDI: ffffffff8177ab54
[  224.232308] RBP: ffff8882047e3c70 R08: ffffffffffffffff R09: 0000000000000000
[  224.232309] R10: ffff8882047e3c90 R11: 0000000000000000 R12: ffffea0004d60780
[  224.232310] R13: 0000000000000019 R14: 0000000000000000 R15: 0000000080190019
[  224.232311] FS:  00007f88ffa09580(0000) GS:ffff88840ecc0000(0000) knlGS:0000000000000000
[  224.232312] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  224.232313] CR2: 00007f88ffa4a000 CR3: 000000015606a005 CR4: 00000000001706e0
[  224.232315] Call Trace:
[  224.232318]  ? __alloc_file+0x26/0xf0
[  224.232322]  ? prep_new_page+0xab/0x100
[  224.232326]  ? __alloc_file+0x26/0xf0
[  224.232329]  ? __slab_alloc.isra.87.constprop.99+0x2e/0x40
[  224.232331]  __slab_alloc.isra.87.constprop.99+0x2e/0x40
[  224.232333]  ? __alloc_file+0x26/0xf0
[  224.232335]  kmem_cache_alloc+0x4d/0x160
[  224.232338]  __alloc_file+0x26/0xf0
[  224.232340]  alloc_empty_file+0x43/0xe0
[  224.232343]  path_openat+0x35/0xe30
[  224.232345]  ? getname_flags+0x32/0x170
[  224.232347]  ? ___slab_alloc.constprop.100+0xbbb/0x13a0
[  224.232349]  ? filemap_map_pages+0xf0/0x3e0
[  224.232352]  do_filp_open+0xa2/0x100
[  224.232355]  ? __slab_alloc.isra.87.constprop.99+0x36/0x40
[  224.232357]  ? __slab_alloc.isra.87.constprop.99+0x36/0x40
[  224.232359]  ? getname_flags+0x32/0x170
[  224.232361]  ? alloc_fd+0xe2/0x1b0
[  224.232363]  ? do_sys_openat2+0x248/0x310
[  224.232365]  do_sys_openat2+0x248/0x310
[  224.232368]  do_sys_open+0x47/0x60
[  224.232370]  do_syscall_64+0x37/0x80
[  224.232373]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  224.232376] RIP: 0033:0x7f88ff54ab41
[  224.232378] Code: 41 83 e2 40 48 89 54 24 30 75 3e 89 f0 25 00 00 41 00 3d 00 00 41 00 74 30 89 f2 b8 01 01 00 00 48 89 fe bf 9c ff ff ff 0f 05 <48> 3d 00 f0 ff ff 77 3f 48 8b 4c 24 18 64 48 33 0c 25 28 00 00 00
[  224.232379] RSP: 002b:00007ffc0916a720 EFLAGS: 00000287 ORIG_RAX: 0000000000000101
[  224.232381] RAX: ffffffffffffffda RBX: 0000558d793efcf0 RCX: 00007f88ff54ab41
[  224.232382] RDX: 0000000000080000 RSI: 0000558d793efcb0 RDI: 00000000ffffff9c
[  224.232383] RBP: 00007ffc0916a860 R08: 0000000000000000 R09: 00007ffc0916a873
[  224.232384] R10: 0000000000000000 R11: 0000000000000287 R12: 0000558d793efa88
[  224.232385] R13: 000000000000000b R14: 0000558d793efa60 R15: 000000000000000b
[  224.232387] ---[ end trace 0000000000000002 ]---
...
[ 2010.420941] general protection fault, maybe for address 0xffffea00054ca398: 0000 [#1] PREEMPT_RT SMP NOPTI
[ 2010.420946] CPU: 4 PID: 17541 Comm: dd Kdump: loaded Tainted: G        W   E     5.13.0.g60ab3ed-tip-rt-slub #25
[ 2010.420948] Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 09/23/2013
[ 2010.420949] RIP: 0010:___slab_alloc.constprop.100+0x52b/0x13a0
[ 2010.420955] Code: b7 44 24 2a 31 db 66 25 ff 7f 66 89 45 b8 48 8b 45 88 80 4d bb 80 48 8b 4d b8 f6 40 0b 40 0f 84 5b 02 00 00 48 89 f0 4c 89 fa <f0> 49 0f c7 4c 24 20 0f 84 8f 01 00 00 4d 89 ce f3 90 48 8b b5 30
[ 2010.420957] RSP: 0018:ffff888102cf7ba0 EFLAGS: 00010002
[ 2010.420959] RAX: ffff888100047e00 RBX: 0000000000000000 RCX: 0000000080000000
[ 2010.420961] RDX: 0000000000000000 RSI: ffff888100047e00 RDI: ffff888100040f80
[ 2010.420962] RBP: ffff888102cf7c70 R08: ffffffffffffffff R09: 0000000000000000
[ 2010.420963] R10: ffff888102cf7c90 R11: 0000000000000000 R12: ffffea00054ca378
[ 2010.420964] R13: 0000000000000000 R14: 80000000000101f8 R15: 0000000000000000
[ 2010.420966] FS:  00007f9e8edbb580(0000) GS:ffff88840ed00000(0000) knlGS:0000000000000000
[ 2010.420967] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2010.420968] CR2: 00007f9e8edfa000 CR3: 0000000102ede004 CR4: 00000000001706e0
[ 2010.420970] Call Trace:
[ 2010.420973]  ? __alloc_file+0x26/0xf0
[ 2010.420977]  ? prep_new_page+0xab/0x100
[ 2010.420982]  ? vm_area_alloc+0x1a/0x60
[ 2010.420984]  ? __alloc_file+0x26/0xf0
[ 2010.420987]  ? __slab_alloc.isra.87.constprop.99+0x2e/0x40
[ 2010.420989]  __slab_alloc.isra.87.constprop.99+0x2e/0x40
[ 2010.420992]  ? __alloc_file+0x26/0xf0
[ 2010.420994]  kmem_cache_alloc+0x4d/0x160
[ 2010.420996]  __alloc_file+0x26/0xf0
[ 2010.420999]  alloc_empty_file+0x43/0xe0
[ 2010.421002]  path_openat+0x35/0xe30
[ 2010.421004]  ? getname_flags+0x32/0x170
[ 2010.421006]  ? ___slab_alloc.constprop.100+0xbbb/0x13a0
[ 2010.421008]  ? filemap_map_pages+0xf0/0x3e0
[ 2010.421012]  do_filp_open+0xa2/0x100
[ 2010.421015]  ? __handle_mm_fault+0x8b8/0xa40
[ 2010.421017]  ? __slab_alloc.isra.87.constprop.99+0x36/0x40
[ 2010.421019]  ? __slab_alloc.isra.87.constprop.99+0x36/0x40
[ 2010.421021]  ? getname_flags+0x32/0x170
[ 2010.421023]  ? alloc_fd+0xe2/0x1b0
[ 2010.421026]  ? do_sys_openat2+0x248/0x310
[ 2010.421028]  do_sys_openat2+0x248/0x310
[ 2010.421030]  do_sys_open+0x47/0x60
[ 2010.421033]  do_syscall_64+0x37/0x80
[ 2010.421035]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 2010.421039] RIP: 0033:0x7f9e8e8fcb41
[ 2010.421041] Code: 41 83 e2 40 48 89 54 24 30 75 3e 89 f0 25 00 00 41 00 3d 00 00 41 00 74 30 89 f2 b8 01 01 00 00 48 89 fe bf 9c ff ff ff 0f 05 <48> 3d 00 f0 ff ff 77 3f 48 8b 4c 24 18 64 48 33 0c 25 28 00 00 00
[ 2010.421043] RSP: 002b:00007ffe8964b7b0 EFLAGS: 00000287 ORIG_RAX: 0000000000000101
[ 2010.421044] RAX: ffffffffffffffda RBX: 0000562e1e31b450 RCX: 00007f9e8e8fcb41
[ 2010.421045] RDX: 0000000000080000 RSI: 0000562e1e31b230 RDI: 00000000ffffff9c
[ 2010.421047] RBP: 00007ffe8964b8f0 R08: 0000000000000000 R09: 00007ffe8964b903
[ 2010.421048] R10: 0000000000000000 R11: 0000000000000287 R12: 0000562e1e31b400
[ 2010.421049] R13: 0000000000000009 R14: 0000562e1e31b3e0 R15: 0000000000000009
[ 2010.421051] Modules linked in: zram(E) sr_mod(E) cdrom(E) btrfs(E) blake2b_generic(E) xor(E) raid6_pq(E) xfs(E) libcrc32c(E) af_packet(E) ip6table_mangle(E) ip6table_raw(E) iptable_raw(E) bridge(E) stp(E) llc(E) iscsi_ibft(E) iscsi_boot_sysfs(E) nfnetlink(E) ebtable_filter(E) rfkill(E) ebtables(E) ip6table_filter(E) ip6_tables(E) iptable_filter(E) ip_tables(E) x_tables(E) bpfilter(E) joydev(E) usblp(E) intel_rapl_msr(E) intel_rapl_common(E) nls_iso8859_1(E) nls_cp437(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) kvm(E) snd_hda_codec_realtek(E) snd_hda_codec_generic(E) ledtrig_audio(E) snd_hda_codec_hdmi(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) snd_hda_intel(E) aesni_intel(E) snd_intel_dspcfg(E) crypto_simd(E) iTCO_wdt(E) at24(E) intel_pmc_bxt(E) regmap_i2c(E) mei_hdcp(E) iTCO_vendor_support(E) cryptd(E) snd_hda_codec(E) snd_hwdep(E) pcspkr(E) i2c_i801(E) snd_hda_core(E) i2c_smbus(E) r8169(E) snd_pcm(E) realtek(E) snd_timer(E)
[ 2010.421093]  mei_me(E) mdio_devres(E) lpc_ich(E) snd(E) libphy(E) mei(E) soundcore(E) mfd_core(E) fan(E) thermal(E) nfsd(E) intel_smartconnect(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) sch_fq_codel(E) sunrpc(E) fuse(E) configfs(E) hid_logitech_hidpp(E) hid_logitech_dj(E) uas(E) usb_storage(E) hid_generic(E) usbhid(E) nouveau(E) wmi(E) drm_ttm_helper(E) ttm(E) i2c_algo_bit(E) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) fb_sys_fops(E) cec(E) ahci(E) rc_core(E) xhci_pci(E) ehci_pci(E) libahci(E) xhci_hcd(E) ehci_hcd(E) drm(E) libata(E) usbcore(E) video(E) button(E) sd_mod(E) t10_pi(E) vfat(E) fat(E) virtio_blk(E) virtio_mmio(E) virtio_ring(E) virtio(E) ext4(E) crc32c_intel(E) crc16(E) mbcache(E) jbd2(E) sg(E) dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) scsi_mod(E) msr(E) autofs4(E) [last unloaded: zram]
[ 2010.421138] Dumping ftrace buffer:
[ 2010.421141]    (ftrace buffer empty)
Mike Galbraith July 6, 2021, 5:56 p.m. UTC | #10
On Mon, 2021-07-05 at 18:00 +0200, Mike Galbraith wrote:
> On Fri, 2021-07-02 at 20:29 +0200, Sebastian Andrzej Siewior wrote:
> >
> > > The remaining patches to upstream from the RT tree are small ones
> > > related to KConfig. The patch that restricts PREEMPT_RT to SLUB
> > > (not SLAB or SLOB) makes sense. The patch that disables
> > > CONFIG_SLUB_CPU_PARTIAL with PREEMPT_RT could perhaps be re-
> > > evaluated as the series also addresses some latency issues with
> > > percpu partial slabs.
> >
> > With that series the PARTIAL slab can be indeed enabled. I have
> > (had)
> > a half done series where I had PARTIAL enabled and noticed a slight
> > increase in latency so made it "default y on !RT". It wasn't
> > dramatic
> > but appeared to be outside of noise.
>
> I'm seeing warnings/explosions while exercising box IFF PARTIAL slab
> thingy is enabled.  I aborted -PARTIAL after a little over 4 hours,
> whereas the longest survival of 4 +PARTIAL runs was 50 minutes, so
> I'm fairly confident that PARTIAL really really is the trigger.

Resurrecting local exclusion around unfreeze_partials() seems to have
put an end to that.  Guess I can chop these trees down now.

---
 mm/slub.c |   13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2497,7 +2497,9 @@ static void put_cpu_partial(struct kmem_
 				 * partial array is full. Move the existing
 				 * set to the per node partial list.
 				 */
+				local_lock(&s->cpu_slab->lock);
 				unfreeze_partials(s);
+				local_unlock(&s->cpu_slab->lock);
 				oldpage = NULL;
 				pobjects = 0;
 				pages = 0;
@@ -2579,7 +2581,9 @@ static void flush_cpu_slab(struct work_s
 	if (c->page)
 		flush_slab(s, c, true);

+	local_lock(&s->cpu_slab->lock);
 	unfreeze_partials(s);
+	local_unlock(&s->cpu_slab->lock);
 }

 static bool has_cpu_slab(int cpu, struct kmem_cache *s)
@@ -3358,13 +3362,12 @@ static __always_inline void do_slab_free
 		 * we need to take the local_lock. We shouldn't simply defer to
 		 * __slab_free() as that wouldn't use the cpu freelist at all.
 		 */
-		unsigned long flags;
 		void **freelist;

-		local_lock_irqsave(&s->cpu_slab->lock, flags);
+		local_lock(&s->cpu_slab->lock);
 		c = this_cpu_ptr(s->cpu_slab);
 		if (unlikely(page != c->page)) {
-			local_unlock_irqrestore(&s->cpu_slab->lock, flags);
+			local_unlock(&s->cpu_slab->lock);
 			goto redo;
 		}
 		tid = c->tid;
@@ -3374,7 +3377,7 @@ static __always_inline void do_slab_free
 		c->freelist = head;
 		c->tid = next_tid(tid);

-		local_unlock_irqrestore(&s->cpu_slab->lock, flags);
+		local_unlock(&s->cpu_slab->lock);
 #endif
 		stat(s, FREE_FASTPATH);
 	} else
@@ -3601,7 +3604,7 @@ int kmem_cache_alloc_bulk(struct kmem_ca
 				slab_want_init_on_alloc(flags, s));
 	return i;
 error:
-	local_unlock_irq(&s->cpu_slab->lock);
+	slub_put_cpu_ptr(s->cpu_slab);
 	slab_post_alloc_hook(s, objcg, flags, i, p, false);
 	__kmem_cache_free_bulk(s, i, p);
 	return 0;
Vlastimil Babka July 18, 2021, 7:41 a.m. UTC | #11
On 7/3/21 9:24 AM, Mike Galbraith wrote:
> On Fri, 2021-07-02 at 20:29 +0200, Sebastian Andrzej Siewior wrote:
>> I replaced my slub changes with slub-local-lock-v2r3.
>> I haven't seen any complains from lockdep or so which is good. Then I
>> did this with RT enabled (and no debug):
> 
> Below is some raw hackbench data from my little i4790 desktop box.  It
> says we'll definitely still want list_lock to be raw.

Hi Mike, thanks a lot for the testing, sorry for late reply.

Did you try, instead of raw list_lock, not applying the last, local lock
patch, as I suggested in reply to bigeasy? I think the impact at
reducing the RT-specific overhead would be larger (than raw list_lock),
the result should still be RT compatible, and it would also deal with
the bugs you found there... (which I'll look into).

Thanks,
Vlastimil

> It also appears to be saying that there's something RT specific to
> stare at in addition to the list_lock business, but add a pinch of salt
> to that due to the config of the virgin(ish) tip tree being much
> lighter than the enterprise(ish) config of the tip-rt tree.
> 
> perf stat -r10 hackbench -s4096 -l500
> full warmup, record, repeat twice for elapsed
> 
> 5.13.0.g60ab3ed-tip-rt
>           8,898.51 msec task-clock                #    7.525 CPUs utilized            ( +-  0.33% )
>            368,922      context-switches          #    0.041 M/sec                    ( +-  5.20% )
>             42,281      cpu-migrations            #    0.005 M/sec                    ( +-  5.28% )
>             13,180      page-faults               #    0.001 M/sec                    ( +-  0.70% )
>     33,343,378,867      cycles                    #    3.747 GHz                      ( +-  0.30% )
>     21,656,783,887      instructions              #    0.65  insn per cycle           ( +-  0.67% )
>      4,408,569,663      branches                  #  495.428 M/sec                    ( +-  0.73% )
>         12,040,125      branch-misses             #    0.27% of all branches          ( +-  2.93% )
> 
>            1.18260 +- 0.00473 seconds time elapsed  ( +-  0.40% )
>            1.19018 +- 0.00441 seconds time elapsed  ( +-  0.37% ) (repeat)
>            1.18260 +- 0.00473 seconds time elapsed  ( +-  0.40% ) (repeat)
> 
> 5.13.0.g60ab3ed-tip-rt +slub-local-lock-v2r3 list_lock=raw_spinlock_t
>           9,642.00 msec task-clock                #    7.521 CPUs utilized            ( +-  0.46% )
>            462,091      context-switches          #    0.048 M/sec                    ( +-  4.79% )
>             44,411      cpu-migrations            #    0.005 M/sec                    ( +-  4.34% )
>             12,980      page-faults               #    0.001 M/sec                    ( +-  0.43% )
>     36,098,859,429      cycles                    #    3.744 GHz                      ( +-  0.44% )
>     25,462,853,462      instructions              #    0.71  insn per cycle           ( +-  0.50% )
>      5,260,898,360      branches                  #  545.623 M/sec                    ( +-  0.52% )
>         16,088,686      branch-misses             #    0.31% of all branches          ( +-  2.02% )
> 
>            1.28207 +- 0.00568 seconds time elapsed  ( +-  0.44% )
>            1.28744 +- 0.00713 seconds time elapsed  ( +-  0.55% ) (repeat)
>            1.28085 +- 0.00850 seconds time elapsed  ( +-  0.66% ) (repeat)
> 
> 5.13.0.g60ab3ed-tip-rt +slub-local-lock-v2r3 list_lock=spinlock_t
>          10,004.89 msec task-clock                #    6.029 CPUs utilized            ( +-  1.37% )
>            654,311      context-switches          #    0.065 M/sec                    ( +-  5.16% )
>            211,070      cpu-migrations            #    0.021 M/sec                    ( +-  1.38% )
>             13,262      page-faults               #    0.001 M/sec                    ( +-  0.79% )
>     36,585,914,931      cycles                    #    3.657 GHz                      ( +-  1.35% )
>     27,682,240,511      instructions              #    0.76  insn per cycle           ( +-  1.06% )
>      5,766,064,432      branches                  #  576.325 M/sec                    ( +-  1.11% )
>         24,269,069      branch-misses             #    0.42% of all branches          ( +-  2.03% )
> 
>             1.6595 +- 0.0116 seconds time elapsed  ( +-  0.70% )
>             1.6270 +- 0.0180 seconds time elapsed  ( +-  1.11% ) (repeat)
>             1.6213 +- 0.0150 seconds time elapsed  ( +-  0.93% ) (repeat)
> 
> virgin(ish) tip
> 5.13.0.g60ab3ed-tip
>           7,320.67 msec task-clock                #    7.792 CPUs utilized            ( +-  0.31% )
>            221,215      context-switches          #    0.030 M/sec                    ( +-  3.97% )
>             16,234      cpu-migrations            #    0.002 M/sec                    ( +-  4.07% )
>             13,233      page-faults               #    0.002 M/sec                    ( +-  0.91% )
>     27,592,205,252      cycles                    #    3.769 GHz                      ( +-  0.32% )
>      8,309,495,040      instructions              #    0.30  insn per cycle           ( +-  0.37% )
>      1,555,210,607      branches                  #  212.441 M/sec                    ( +-  0.42% )
>          5,484,209      branch-misses             #    0.35% of all branches          ( +-  2.13% )
> 
>            0.93949 +- 0.00423 seconds time elapsed  ( +-  0.45% )
>            0.94608 +- 0.00384 seconds time elapsed  ( +-  0.41% ) (repeat)
>            0.94422 +- 0.00410 seconds time elapsed  ( +-  0.43% )
> 
> 5.13.0.g60ab3ed-tip +slub-local-lock-v2r3
>           7,343.57 msec task-clock                #    7.776 CPUs utilized            ( +-  0.44% )
>            223,044      context-switches          #    0.030 M/sec                    ( +-  3.02% )
>             16,057      cpu-migrations            #    0.002 M/sec                    ( +-  4.03% )
>             13,164      page-faults               #    0.002 M/sec                    ( +-  0.97% )
>     27,684,906,017      cycles                    #    3.770 GHz                      ( +-  0.45% )
>      8,323,273,871      instructions              #    0.30  insn per cycle           ( +-  0.28% )
>      1,556,106,680      branches                  #  211.901 M/sec                    ( +-  0.31% )
>          5,463,468      branch-misses             #    0.35% of all branches          ( +-  1.33% )
> 
>            0.94440 +- 0.00352 seconds time elapsed  ( +-  0.37% )
>            0.94830 +- 0.00228 seconds time elapsed  ( +-  0.24% ) (repeat)
>            0.93813 +- 0.00440 seconds time elapsed  ( +-  0.47% ) (repeat)
>
Mike Galbraith July 18, 2021, 8:29 a.m. UTC | #12
On Sun, 2021-07-18 at 09:41 +0200, Vlastimil Babka wrote:
> On 7/3/21 9:24 AM, Mike Galbraith wrote:
> > On Fri, 2021-07-02 at 20:29 +0200, Sebastian Andrzej Siewior wrote:
> > > I replaced my slub changes with slub-local-lock-v2r3.
> > > I haven't seen any complains from lockdep or so which is good.
> > > Then I
> > > did this with RT enabled (and no debug):
> >
> > Below is some raw hackbench data from my little i4790 desktop
> > box.  It
> > says we'll definitely still want list_lock to be raw.
>
> Hi Mike, thanks a lot for the testing, sorry for late reply.
>
> Did you try, instead of raw list_lock, not applying the last, local
> lock patch, as I suggested in reply to bigeasy?

No, but I suppose I can give that a go.

	-Mike
Mike Galbraith July 18, 2021, 12:09 p.m. UTC | #13
On Sun, 2021-07-18 at 10:29 +0200, Mike Galbraith wrote:
> On Sun, 2021-07-18 at 09:41 +0200, Vlastimil Babka wrote:
> > On 7/3/21 9:24 AM, Mike Galbraith wrote:
> > > On Fri, 2021-07-02 at 20:29 +0200, Sebastian Andrzej Siewior
> > > wrote:
> > > > I replaced my slub changes with slub-local-lock-v2r3.
> > > > I haven't seen any complains from lockdep or so which is good.
> > > > Then I
> > > > did this with RT enabled (and no debug):
> > >
> > > Below is some raw hackbench data from my little i4790 desktop
> > > box.  It
> > > says we'll definitely still want list_lock to be raw.
> >
> > Hi Mike, thanks a lot for the testing, sorry for late reply.
> >
> > Did you try, instead of raw list_lock, not applying the last, local
> > lock patch, as I suggested in reply to bigeasy?
>
> No, but I suppose I can give that a go.

The kernel has of course moved forward, so measure 'em again, but
starting before replacing 5.12-rt patches with slub-local-lock-v2r3.

box is old i4790 desktop
perf stat -r10 hackbench -s4096 -l500
full warmup, record, repeat twice for elapsed

Config has only SLUB+SLUB_DEBUG, as originally measured.

5.14.0.g79e92006-tip-rt (5.12-rt, 5.13-rt didn't exist when first measured)
          7,984.52 msec task-clock                #    7.565 CPUs utilized            ( +-  0.66% )
           353,566      context-switches          #   44.281 K/sec                    ( +-  2.77% )
            37,685      cpu-migrations            #    4.720 K/sec                    ( +-  6.37% )
            12,939      page-faults               #    1.620 K/sec                    ( +-  0.67% )
    29,901,079,227      cycles                    #    3.745 GHz                      ( +-  0.71% )
    14,550,797,818      instructions              #    0.49  insn per cycle           ( +-  0.47% )
     3,056,685,643      branches                  #  382.826 M/sec                    ( +-  0.51% )
         9,598,083      branch-misses             #    0.31% of all branches          ( +-  2.11% )

           1.05542 +- 0.00409 seconds time elapsed  ( +-  0.39% )
           1.05990 +- 0.00244 seconds time elapsed  ( +-  0.23% ) (repeat)
           1.05367 +- 0.00303 seconds time elapsed  ( +-  0.29% ) (repeat)

5.14.0.g79e92006-tip-rt +slub-local-lock-v2r3 -0034-mm-slub-convert-kmem_cpu_slab-protection-to-local_lock.patch
          6,899.35 msec task-clock                #    5.637 CPUs utilized            ( +-  0.53% )
           420,304      context-switches          #   60.919 K/sec                    ( +-  2.83% )
           187,130      cpu-migrations            #   27.123 K/sec                    ( +-  1.81% )
            13,206      page-faults               #    1.914 K/sec                    ( +-  0.96% )
    25,110,362,933      cycles                    #    3.640 GHz                      ( +-  0.49% )
    15,853,643,635      instructions              #    0.63  insn per cycle           ( +-  0.64% )
     3,366,261,524      branches                  #  487.910 M/sec                    ( +-  0.70% )
        14,839,618      branch-misses             #    0.44% of all branches          ( +-  2.01% )

           1.22390 +- 0.00744 seconds time elapsed  ( +-  0.61% )
           1.21813 +- 0.00907 seconds time elapsed  ( +-  0.74% ) (repeat)
           1.22097 +- 0.00952 seconds time elapsed  ( +-  0.78% ) (repeat)

repeat of above with raw list_lock
          8,072.62 msec task-clock                #    7.605 CPUs utilized            ( +-  0.49% )
           359,514      context-switches          #   44.535 K/sec                    ( +-  4.95% )
            35,285      cpu-migrations            #    4.371 K/sec                    ( +-  5.82% )
            13,503      page-faults               #    1.673 K/sec                    ( +-  0.96% )
    30,247,989,681      cycles                    #    3.747 GHz                      ( +-  0.52% )
    14,580,011,391      instructions              #    0.48  insn per cycle           ( +-  0.81% )
     3,063,743,405      branches                  #  379.523 M/sec                    ( +-  0.85% )
         8,907,160      branch-misses             #    0.29% of all branches          ( +-  3.99% )

           1.06150 +- 0.00427 seconds time elapsed  ( +-  0.40% )
           1.05041 +- 0.00176 seconds time elapsed  ( +-  0.17% ) (repeat)
           1.06086 +- 0.00237 seconds time elapsed  ( +-  0.22% ) (repeat)

5.14.0.g79e92006-rt3-tip-rt +slub-local-lock-v2r3 full set
          7,598.44 msec task-clock                #    5.813 CPUs utilized            ( +-  0.85% )
           488,161      context-switches          #   64.245 K/sec                    ( +-  4.29% )
           196,866      cpu-migrations            #   25.909 K/sec                    ( +-  1.49% )
            13,042      page-faults               #    1.716 K/sec                    ( +-  0.73% )
    27,695,116,746      cycles                    #    3.645 GHz                      ( +-  0.79% )
    18,423,934,168      instructions              #    0.67  insn per cycle           ( +-  0.88% )
     3,969,540,695      branches                  #  522.415 M/sec                    ( +-  0.92% )
        15,493,482      branch-misses             #    0.39% of all branches          ( +-  2.15% )

           1.30709 +- 0.00890 seconds time elapsed  ( +-  0.68% )
           1.3205 +- 0.0134 seconds time elapsed  ( +-  1.02% ) (repeat)
           1.3083 +- 0.0132 seconds time elapsed  ( +-  1.01% ) (repeat)
Sebastian Andrzej Siewior July 29, 2021, 1:49 p.m. UTC | #14
now that I'm slowly catching up…

On 2021-07-02 22:25:05 [+0200], Vlastimil Babka wrote:
> > - perf_5.10 stat -r 10 hackbench -g200 -s 4096 -l500
> > Old:
> > |         464.967,20 msec task-clock                #   27,220 CPUs utilized            ( +-  0,16% )
> > New:
> > |         422.865,71 msec task-clock                #    4,782 CPUs utilized            ( +-  0,34% )
> 
> The series shouldn't significantly change the memory allocator
> interaction, though.
> Seems there's less cycles, but more time elapsed, thus more sleeping -
> is it locks becoming mutexes on RT?

yes, most likely since the !RT parts are mostly unchanged.

> My second guess - list_lock remains spinlock with my series, thus RT
> mutex, but the current RT tree converts it to raw_spinlock. I'd hope
> leaving that one as non-raw spinlock would still be much better for RT
> goals, even if hackbench (which is AFAIK very slab intensive) throughput
> regresses - hopefully not that much.

Yes, the list_lock seems to be the case. I picked your
slub-local-lock-v3r0 and changed the list_lock (+slab_lock()) to use
raw_spinlock_t and disable interrupts and CPUs utilisation went to
~23CPUs (plus a bunch of warnings which probably made it a little slower
again).
The difference between a sleeping lock (spinlock_t) and a mutex is
that we attempt not to preempt a task that acquired a spinlock_t even if
it is running for some time and the scheduler would preempt it (like it
would do if the task had a mutex acquired. These are the "lazy preempt"
bits in the RT patch).

By making the list_lock a raw_spinlock_t a lot of IRQ-flags dancing
needs to be done as the page-allocator must be entered with enabled
interrupts. And then there is the possibility that you may need to free
some memory even if you allocate memory which requires some extra steps
on RT due to the IRQ-off part. All this vanishes by keeping list_lock a
spinlock_t.
The kernel-build test on /dev/shm remained unchanged so that is good.
Unless there is a real-world use-case, that gets worse, I don't mind
keeping the spinlock_t here. I haven't seen tglx complaining so far.

Sebastian
Vlastimil Babka July 29, 2021, 2:17 p.m. UTC | #15
On 7/29/21 3:49 PM, Sebastian Andrzej Siewior wrote:

> now that I'm slowly catching up…

> 

> On 2021-07-02 22:25:05 [+0200], Vlastimil Babka wrote:

>> > - perf_5.10 stat -r 10 hackbench -g200 -s 4096 -l500

>> > Old:

>> > |         464.967,20 msec task-clock                #   27,220 CPUs utilized            ( +-  0,16% )

>> > New:

>> > |         422.865,71 msec task-clock                #    4,782 CPUs utilized            ( +-  0,34% )

>> 

>> The series shouldn't significantly change the memory allocator

>> interaction, though.

>> Seems there's less cycles, but more time elapsed, thus more sleeping -

>> is it locks becoming mutexes on RT?

> 

> yes, most likely since the !RT parts are mostly unchanged.

> 

>> My second guess - list_lock remains spinlock with my series, thus RT

>> mutex, but the current RT tree converts it to raw_spinlock. I'd hope

>> leaving that one as non-raw spinlock would still be much better for RT

>> goals, even if hackbench (which is AFAIK very slab intensive) throughput

>> regresses - hopefully not that much.

> 

> Yes, the list_lock seems to be the case. I picked your

> slub-local-lock-v3r0 and changed the list_lock (+slab_lock()) to use

> raw_spinlock_t and disable interrupts and CPUs utilisation went to

> ~23CPUs (plus a bunch of warnings which probably made it a little slower

> again).


I forgot to point that out in the cover letter, but with v3 this change to
raw_spinlock_t is AFAICS no longer possible (at least with
CONFIG_SLUB_CPU_PARTIAL) because in put_cpu_partial() we now take the local_lock
and it can be called from get_partial_node() which takes the list_lock.



> The difference between a sleeping lock (spinlock_t) and a mutex is

> that we attempt not to preempt a task that acquired a spinlock_t even if

> it is running for some time and the scheduler would preempt it (like it

> would do if the task had a mutex acquired. These are the "lazy preempt"

> bits in the RT patch).

> 

> By making the list_lock a raw_spinlock_t a lot of IRQ-flags dancing

> needs to be done as the page-allocator must be entered with enabled

> interrupts.


Hm but SLUB should never call the page allocator from under list_lock in my series?



> And then there is the possibility that you may need to free

> some memory even if you allocate memory which requires some extra steps

> on RT due to the IRQ-off part. All this vanishes by keeping list_lock a

> spinlock_t.

> The kernel-build test on /dev/shm remained unchanged so that is good.

> Unless there is a real-world use-case, that gets worse, I don't mind

> keeping the spinlock_t here. I haven't seen tglx complaining so far.



Good. IIRC hackbench is very close to being a slab microbenchmark, so
regressions there are expected, but should not translate to notable real world
regressions.



> Sebastian

>
Sebastian Andrzej Siewior July 29, 2021, 2:37 p.m. UTC | #16
On 2021-07-29 16:17:26 [+0200], Vlastimil Babka wrote:
> I forgot to point that out in the cover letter, but with v3 this change to
> raw_spinlock_t is AFAICS no longer possible (at least with
> CONFIG_SLUB_CPU_PARTIAL) because in put_cpu_partial() we now take the local_lock
> and it can be called from get_partial_node() which takes the list_lock.

I saw increased latency numbers with CONFIG_SLUB_CPU_PARTIAL before it
got disabled for other reasons so I'm not too sad if it remains
disabled.

> Hm but SLUB should never call the page allocator from under list_lock in my series?

oh yes. I run into CPU_PARTIAL instead. Sorry for the confusion. So
without PARTIAL it says "24,643 CPUs utilized" and no warnings :)

Sebastian