[RFC,v3,13/13] uprobes: add speculative lockless VMA to inode resolution

Now that files_cachep is SLAB_TYPESAFE_BY_RCU, we can safely access
vma->vm_file->f_inode lockless only under rcu_read_lock() protection,
attempting uprobe look up speculatively.

We rely on newly added mmap_lock_speculation_{start,end}() helpers to
validate that mm_struct stays intact for entire duration of this
speculation. If not, we fall back to mmap_lock-protected lookup.

This allows to avoid contention on mmap_lock in absolutely majority of
cases, nicely improving uprobe/uretprobe scalability.

BEFORE
======
uprobe-nop            ( 1 cpus):    3.417 ± 0.013M/s  (  3.417M/s/cpu)
uprobe-nop            ( 2 cpus):    5.724 ± 0.006M/s  (  2.862M/s/cpu)
uprobe-nop            ( 3 cpus):    8.543 ± 0.012M/s  (  2.848M/s/cpu)
uprobe-nop            ( 4 cpus):   11.094 ± 0.004M/s  (  2.774M/s/cpu)
uprobe-nop            ( 5 cpus):   13.703 ± 0.006M/s  (  2.741M/s/cpu)
uprobe-nop            ( 6 cpus):   16.350 ± 0.010M/s  (  2.725M/s/cpu)
uprobe-nop            ( 7 cpus):   19.100 ± 0.031M/s  (  2.729M/s/cpu)
uprobe-nop            ( 8 cpus):   20.138 ± 0.029M/s  (  2.517M/s/cpu)
uprobe-nop            (10 cpus):   20.161 ± 0.020M/s  (  2.016M/s/cpu)
uprobe-nop            (12 cpus):   15.129 ± 0.011M/s  (  1.261M/s/cpu)
uprobe-nop            (14 cpus):   15.013 ± 0.013M/s  (  1.072M/s/cpu)
uprobe-nop            (16 cpus):   13.352 ± 0.007M/s  (  0.834M/s/cpu)
uprobe-nop            (24 cpus):   12.470 ± 0.005M/s  (  0.520M/s/cpu)
uprobe-nop            (32 cpus):   11.252 ± 0.042M/s  (  0.352M/s/cpu)
uprobe-nop            (40 cpus):   10.308 ± 0.001M/s  (  0.258M/s/cpu)
uprobe-nop            (48 cpus):   11.037 ± 0.007M/s  (  0.230M/s/cpu)
uprobe-nop            (56 cpus):   12.055 ± 0.002M/s  (  0.215M/s/cpu)
uprobe-nop            (64 cpus):   12.895 ± 0.004M/s  (  0.201M/s/cpu)
uprobe-nop            (72 cpus):   13.995 ± 0.005M/s  (  0.194M/s/cpu)
uprobe-nop            (80 cpus):   15.224 ± 0.030M/s  (  0.190M/s/cpu)

AFTER
=====
uprobe-nop            ( 1 cpus):    3.562 ± 0.006M/s  (  3.562M/s/cpu)
uprobe-nop            ( 2 cpus):    6.751 ± 0.007M/s  (  3.376M/s/cpu)
uprobe-nop            ( 3 cpus):   10.121 ± 0.007M/s  (  3.374M/s/cpu)
uprobe-nop            ( 4 cpus):   13.100 ± 0.007M/s  (  3.275M/s/cpu)
uprobe-nop            ( 5 cpus):   16.321 ± 0.008M/s  (  3.264M/s/cpu)
uprobe-nop            ( 6 cpus):   19.612 ± 0.004M/s  (  3.269M/s/cpu)
uprobe-nop            ( 7 cpus):   22.910 ± 0.037M/s  (  3.273M/s/cpu)
uprobe-nop            ( 8 cpus):   24.705 ± 0.011M/s  (  3.088M/s/cpu)
uprobe-nop            (10 cpus):   30.772 ± 0.020M/s  (  3.077M/s/cpu)
uprobe-nop            (12 cpus):   33.614 ± 0.009M/s  (  2.801M/s/cpu)
uprobe-nop            (14 cpus):   39.166 ± 0.004M/s  (  2.798M/s/cpu)
uprobe-nop            (16 cpus):   41.692 ± 0.014M/s  (  2.606M/s/cpu)
uprobe-nop            (24 cpus):   64.802 ± 0.048M/s  (  2.700M/s/cpu)
uprobe-nop            (32 cpus):   84.226 ± 0.223M/s  (  2.632M/s/cpu)
uprobe-nop            (40 cpus):  102.071 ± 0.067M/s  (  2.552M/s/cpu)
uprobe-nop            (48 cpus):  106.603 ± 1.198M/s  (  2.221M/s/cpu)
uprobe-nop            (56 cpus):  117.695 ± 0.059M/s  (  2.102M/s/cpu)
uprobe-nop            (64 cpus):  124.291 ± 0.485M/s  (  1.942M/s/cpu)
uprobe-nop            (72 cpus):  135.527 ± 0.134M/s  (  1.882M/s/cpu)
uprobe-nop            (80 cpus):  146.195 ± 0.230M/s  (  1.827M/s/cpu)

Previously total throughput was maxing out at 20mln/s with 8-10 cores,
declining afterwards. With this change, it now keeps growing with each
added CPU, reaching 146mln/s at 80 CPUs (this was measured on a 80-core
Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz).

Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/events/uprobes.c | 51 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 51 insertions(+)

Message ID	20240813042917.506057-14-andrii@kernel.org (mailing list archive)
State	Handled Elsewhere
Headers	show Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 745052374E; Tue, 13 Aug 2024 04:30:23 +0000 (UTC) From: Andrii Nakryiko <andrii@kernel.org> To: linux-trace-kernel@vger.kernel.org, peterz@infradead.org, oleg@redhat.com Cc: rostedt@goodmis.org, mhiramat@kernel.org, bpf@vger.kernel.org, linux-kernel@vger.kernel.org, jolsa@kernel.org, paulmck@kernel.org, willy@infradead.org, surenb@google.com, akpm@linux-foundation.org, linux-mm@kvack.org, Andrii Nakryiko <andrii@kernel.org> Subject: [PATCH RFC v3 13/13] uprobes: add speculative lockless VMA to inode resolution Date: Mon, 12 Aug 2024 21:29:17 -0700 Message-ID: <20240813042917.506057-14-andrii@kernel.org> In-Reply-To: <20240813042917.506057-1-andrii@kernel.org> References: <20240813042917.506057-1-andrii@kernel.org> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit
Series	uprobes: RCU-protected hot path optimizations \| expand [v3,00/13] uprobes: RCU-protected hot path optimizations [v3,01/13] uprobes: revamp uprobe refcounting and lifetime management [v3,02/13] uprobes: protected uprobe lifetime with SRCU [v3,03/13] uprobes: get rid of enum uprobe_filter_ctx in uprobe filter callbacks [v3,04/13] uprobes: travers uprobe's consumer list locklessly under SRCU protection [v3,05/13] perf/uprobe: split uprobe_unregister() [v3,06/13] rbtree: provide rb_find_rcu() / rb_find_add_rcu() [v3,07/13] uprobes: perform lockless SRCU-protected uprobes_tree lookup [v3,08/13] uprobes: switch to RCU Tasks Trace flavor for better performance [RFC,v3,09/13] uprobes: SRCU-protect uretprobe lifetime (with timeout) [RFC,v3,10/13] uprobes: implement SRCU-protected lifetime for single-stepped uprobe [RFC,v3,11/13] mm: introduce mmap_lock_speculation_{start\|end} [RFC,v3,12/13] mm: add SLAB_TYPESAFE_BY_RCU to files_cache [RFC,v3,13/13] uprobes: add speculative lockless VMA to inode resolution

Context	Check	Description
netdev/tree_selection	success	Not a local patch
bpf/vmtest-bpf-PR	fail	merge-conflict
bpf/vmtest-bpf-VM_Test-6	success	Logs for aarch64-gcc / test (test_maps, false, 360) / test_maps on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-7	success	Logs for aarch64-gcc / test (test_progs, false, 360) / test_progs on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-8	success	Logs for aarch64-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-1	success	Logs for ShellCheck
bpf/vmtest-bpf-VM_Test-0	success	Logs for Lint
bpf/vmtest-bpf-VM_Test-5	success	Logs for aarch64-gcc / build-release
bpf/vmtest-bpf-VM_Test-2	success	Logs for Unittests
bpf/vmtest-bpf-VM_Test-4	success	Logs for aarch64-gcc / build / build for aarch64 with gcc
bpf/vmtest-bpf-VM_Test-9	success	Logs for aarch64-gcc / test (test_verifier, false, 360) / test_verifier on aarch64 with gcc
bpf/vmtest-bpf-VM_Test-3	success	Logs for Validate matrix.py
bpf/vmtest-bpf-VM_Test-10	success	Logs for aarch64-gcc / veristat
bpf/vmtest-bpf-VM_Test-11	success	Logs for s390x-gcc / build / build for s390x with gcc
bpf/vmtest-bpf-VM_Test-12	success	Logs for s390x-gcc / build-release
bpf/vmtest-bpf-VM_Test-13	success	Logs for s390x-gcc / test (test_progs, false, 360) / test_progs on s390x with gcc
bpf/vmtest-bpf-VM_Test-14	pending	Logs for s390x-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on s390x with gcc
bpf/vmtest-bpf-VM_Test-15	success	Logs for s390x-gcc / test (test_verifier, false, 360) / test_verifier on s390x with gcc
bpf/vmtest-bpf-VM_Test-16	success	Logs for s390x-gcc / veristat
bpf/vmtest-bpf-VM_Test-17	success	Logs for set-matrix
bpf/vmtest-bpf-VM_Test-18	success	Logs for x86_64-gcc / build / build for x86_64 with gcc
bpf/vmtest-bpf-VM_Test-19	success	Logs for x86_64-gcc / build-release
bpf/vmtest-bpf-VM_Test-20	success	Logs for x86_64-gcc / test (test_maps, false, 360) / test_maps on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-21	success	Logs for x86_64-gcc / test (test_progs, false, 360) / test_progs on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-22	success	Logs for x86_64-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-23	success	Logs for x86_64-gcc / test (test_progs_no_alu32_parallel, true, 30) / test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-24	success	Logs for x86_64-gcc / test (test_progs_parallel, true, 30) / test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-25	success	Logs for x86_64-gcc / test (test_verifier, false, 360) / test_verifier on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-26	success	Logs for x86_64-gcc / veristat / veristat on x86_64 with gcc
bpf/vmtest-bpf-VM_Test-27	success	Logs for x86_64-llvm-17 / build / build for x86_64 with llvm-17
bpf/vmtest-bpf-VM_Test-28	success	Logs for x86_64-llvm-17 / build-release / build for x86_64 with llvm-17-O2
bpf/vmtest-bpf-VM_Test-29	success	Logs for x86_64-llvm-17 / test (test_maps, false, 360) / test_maps on x86_64 with llvm-17
bpf/vmtest-bpf-VM_Test-30	success	Logs for x86_64-llvm-17 / test (test_progs, false, 360) / test_progs on x86_64 with llvm-17
bpf/vmtest-bpf-VM_Test-31	success	Logs for x86_64-llvm-17 / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with llvm-17
bpf/vmtest-bpf-VM_Test-32	success	Logs for x86_64-llvm-17 / test (test_verifier, false, 360) / test_verifier on x86_64 with llvm-17
bpf/vmtest-bpf-VM_Test-33	success	Logs for x86_64-llvm-17 / veristat
bpf/vmtest-bpf-VM_Test-34	success	Logs for x86_64-llvm-18 / build / build for x86_64 with llvm-18
bpf/vmtest-bpf-VM_Test-35	success	Logs for x86_64-llvm-18 / build-release / build for x86_64 with llvm-18-O2
bpf/vmtest-bpf-VM_Test-36	success	Logs for x86_64-llvm-18 / test (test_maps, false, 360) / test_maps on x86_64 with llvm-18
bpf/vmtest-bpf-VM_Test-37	success	Logs for x86_64-llvm-18 / test (test_progs, false, 360) / test_progs on x86_64 with llvm-18
bpf/vmtest-bpf-VM_Test-38	success	Logs for x86_64-llvm-18 / test (test_progs_cpuv4, false, 360) / test_progs_cpuv4 on x86_64 with llvm-18
bpf/vmtest-bpf-VM_Test-39	success	Logs for x86_64-llvm-18 / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with llvm-18
bpf/vmtest-bpf-VM_Test-40	success	Logs for x86_64-llvm-18 / test (test_verifier, false, 360) / test_verifier on x86_64 with llvm-18
bpf/vmtest-bpf-VM_Test-41	success	Logs for x86_64-llvm-18 / veristat

[RFC,v3,13/13] uprobes: add speculative lockless VMA to inode resolution

Checks

Commit Message

Comments

Patch