From patchwork Thu Aug 29 18:37:33 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Andrii Nakryiko X-Patchwork-Id: 13783548 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B5F91C87FD2 for ; Thu, 29 Aug 2024 18:37:53 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 373446B0095; Thu, 29 Aug 2024 14:37:53 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 2FB4B6B00A4; Thu, 29 Aug 2024 14:37:53 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 19C436B009A; Thu, 29 Aug 2024 14:37:53 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id E9F1B6B00A4 for ; Thu, 29 Aug 2024 14:37:52 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 910A51C29D7 for ; Thu, 29 Aug 2024 18:37:52 +0000 (UTC) X-FDA: 82506141984.28.1EDF649 Received: from sin.source.kernel.org (sin.source.kernel.org [145.40.73.55]) by imf02.hostedemail.com (Postfix) with ESMTP id 48BFE80029 for ; Thu, 29 Aug 2024 18:37:49 +0000 (UTC) Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b="XyHk0/X4"; spf=pass (imf02.hostedemail.com: domain of andrii@kernel.org designates 145.40.73.55 as permitted sender) smtp.mailfrom=andrii@kernel.org; dmarc=pass (policy=quarantine) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1724956582; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=k7EgHOz75jUgTfx1ZqEl6zhWnuzkiUk3vfSyNQDyA/8=; b=LyReQTySHk7n/TRaU6C7NuTzR3KkqYMoVT/iCV9wXuwXDB7PpVwDtP+zJUzoEyvYXa1JVu hL/UB3Fl8dr8yRGvegDaDa6aXZzMNozFi2Q+YTiAGrLYxpQYqqhKFURtjlgQ1fnlkVWZHp JoYnlLgMa0DZFn+3qWHxEC39dDICOWM= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1724956582; a=rsa-sha256; cv=none; b=Tx/bbi2IciHg5eaLiAXPvzQhK+8FivBU2lCSn3OpeCvz0Wkh0aFf1ZWtnDbopkqQdl+/mr I3HFFeWZkZNjvXjET2r40JxZeNb81vXyFJf5D4PIJ0Q76Uq9ooftvJYzfSxX3/mzE8HheD 3sjTDb4X/AgFlUTApq5scgvGA8eCbnM= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b="XyHk0/X4"; spf=pass (imf02.hostedemail.com: domain of andrii@kernel.org designates 145.40.73.55 as permitted sender) smtp.mailfrom=andrii@kernel.org; dmarc=pass (policy=quarantine) header.from=kernel.org Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by sin.source.kernel.org (Postfix) with ESMTP id 347BFCE1C58; Thu, 29 Aug 2024 18:37:47 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 32115C4CEC1; Thu, 29 Aug 2024 18:37:46 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1724956666; bh=Ptv8OhE5/Lgppev6KTOjJasYNP2CXH6imyCq3lrAJcQ=; h=From:To:Cc:Subject:Date:From; b=XyHk0/X47yMTdNNOaqpO+rVmtaaHZiv6P7uTdB1DnK9xXkcgfwl3ZMUJcjuafCC+V hRJeOaR+Bu/Vb1pqo86R7CGk03TWtc4djU2OdvQO+YHLGzIjP3a5pGx7GPbhva8H/F LSSf2XUl3XZK8STCMnZywqWvldtbWQ8NJvtZpCfjjP1o6dbx+HzTX/DMHXKIJzlIic N8NCMi90UMSQJWaV1026XybPOU9h+ZZjf5dyuq83EWXA4HsDmgJLl8KQ4cBrU4wNS3 LBWmeOvZuCLPEa9JG9YlranFoUuxrOzxdhlHl9GhDspxIloJuOZCZsCLJVl/3geaq5 SBsz3R15nfN6Q== From: Andrii Nakryiko To: linux-trace-kernel@vger.kernel.org, peterz@infradead.org, oleg@redhat.com Cc: rostedt@goodmis.org, mhiramat@kernel.org, bpf@vger.kernel.org, linux-kernel@vger.kernel.org, jolsa@kernel.org, paulmck@kernel.org, willy@infradead.org, surenb@google.com, akpm@linux-foundation.org, linux-mm@kvack.org, Andrii Nakryiko Subject: [PATCH v4 0/8] uprobes: RCU-protected hot path optimizations Date: Thu, 29 Aug 2024 11:37:33 -0700 Message-ID: <20240829183741.3331213-1-andrii@kernel.org> X-Mailer: git-send-email 2.43.5 MIME-Version: 1.0 X-Rspamd-Queue-Id: 48BFE80029 X-Stat-Signature: dzsm4exemyzgsrt9rzntbbw7jwhi6p8o X-Rspamd-Server: rspam09 X-Rspam-User: X-HE-Tag: 1724956669-784328 X-HE-Meta: U2FsdGVkX19qyEI6vRMWpTcrGcRfQLRez33NN8CCXHSx6yNj5OJdf9DqkN4f29ah2X7Sb8YixFZENAJj6RUVzd91jiAcHOkuKheBBpaosXbnkoIyAgAdj2K9vp/erNiGURHWMBufUNuxZJsgfyMJ1cTRxRbj2SFd3eq+m+yKZmbX0bYLHKkCrT/BUNWFqNnn56QsKX/l6Gr7Ii9amr3fmgAm2C7JK8bjQYEVAYPBoLiMAEnw1digHpAvmI+uRZBQc2V50piW5V4BbsA5joRVd/0uhFH5tWpOmp48XyS0yk16TmjkGzu5ImILFEa5r3w4uJjY7vU2ugOnKFg+X223IDZm0gLM710NrUMivkBmF7aHndbSThCednK/7mXWfSkCvbH8zqfIgRdYyZDxRct88GLq/KVWWgbCpRdI/GyoAP5DlR7q8esuK6tVmzEpW/LRO3pVszvegr1pjr6mJdfVPNNcRSUBqmAKrtgpejoxa+AqdkHl5rPAvGgq7O6rlL4UOJJrKgc1qbV7/YIwyAtAE7hvL8J+tZOObfsVMvd0uuG2GX7PxMR9QP6lJSfhJp80dknzZIoJTNSr69wMaFL1U3aXuQeIaO+twrAAr5S+47YCxJdGoSf534sEyvDyxhU1WmLRjuFNYi022M1qLiKUNQeifiTAg7kxTAyrUQ1zpHLaIEgmiZme3syCE8vrazv6mTmqeqc9iTkhlV5pjL3AAi7bwj5jjYWI+qtdN2zEbMaVE0liv1GxCsaZTMj/CMBee0BZjVc8Iul27WW7RreJ/Y4wB8G27RQGAqChwXdz7AgOVHHLYD3640dYQjNYuDhmzttFwZmJ32yJJamr/dP7upaem3Gng2An+7QN2/FAEPBgxuDmJ94Y1Pi56eFftM0kMeie9g4zBOfnCMC8Yp6pF2gwdwL7cd6Pd0azYAvK+NpRqKg5cLtJHgdNmR4UxbZDRBp5GZogaEAB2zs9sU5 wksuRzGy q1/TOxhzO9VGH92kqBVniEe3JuydL/eI7m+gVmLQHk8CjNFmc6seRGJSJ4oArvvQRz+ijCDpl58GukPXcjrsXs3wrzr8OyzM5V3aXie4sjmMPM9Rqp8D7utVHev8gp1fEKNk4mD/3kjojKOFEa0UkFasmMpFXGA+NcSxGi4Wi+abDWMRxX5t+z4C55+Ywiv6lprJwQ6W86+kAOq1T91NU7EQ8xrpF0on5OALHmNPE3Tz75VTxhQNNzAIZsRF02xQbHfnX5GFfAIfug8CMa5omuYa+lqA2vmfq4tlHZMnpqLWx3YfLkQ1IuGZtmZndutco+MS60jUO2Fe3p7d4XuFhqtlunTFE/MhFBeSriOOC3sg1PvW0cRe1fGq193eWtMRQ7QRD X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: This patch set is heavily inspired by Peter Zijlstra's uprobe optimization patches ([0]) and continues that work, albeit trying to keep complexity to the minimum, and attepting to reuse existing primitives as much as possible. The goal here is to optimize common uprobe triggering hot path, while keeping the rest of locking mostly intact. I've added uprobe_unregister_sync() into the error handling code path inside uprobe_unregister(). This is due to recent refactorings from Oleg Nesterov ([1]), which necessitates this addition. Except for refcounting change patch (which I stongly believe is a good improvement we should do and forget about quasi-refcounting schema of uprobe->consumers list), the rest of the changes are similar to Peter's initial changes in [0]. Main differences would be: - no special RCU protection for mmap and fork handling, we just stick to refcounts there, as those are infrequent and not a performance-sensitive code path, while being complex and thus benefiting from proper locking; - the above means we don't need to do any custom SRCU additions to handle forking code path; - I handled UPROBE_HANDLER_REMOVE problem in handler_chain() differently, again, leveraging existing locking scheam; - I kept refcount usage for uretprobe and single-stepping uprobes, I plan to address that in a separate follow up patches. The plan is to avoid task_work with a lockless atomic xchg()/cmpxchg()-based solution; - finally, I dutifully was using SRCU throughout all the changes, and only last patch switches SRCU to RCU Tasks Trace and demonstrates significant performance and scalability gains from this. The changes in this patch set were tested using BPF selftests and using uprobe-stress ([2]) tool. Now, for the benchmarking results. I've used the following script (which utilizes BPF selftests-based bench tool). The CPU used was 80-core Intel Xeon Gold 6138 CPU @ 2.00GHz running kernel with production-like config. I minimized background noise by stopping any service I could identify and stop, so results are pretty stable and variability is pretty small, overall. Benchmark script: #!/bin/bash set -eufo pipefail for i in uprobe-nop uretprobe-nop; do for p in 1 2 4 8 16 32 64; do summary=$(sudo ./bench -w3 -d5 -p$p -a trig-$i | tail -n1) total=$(echo "$summary" | cut -d'(' -f1 | cut -d' ' -f3-) percpu=$(echo "$summary" | cut -d'(' -f2 | cut -d')' -f1 | cut -d'/' -f1) printf "%-15s (%2d cpus): %s (%s/s/cpu)\n" $i $p "$total" "$percpu" done echo done With all the lock-avoiding changes done in this patch set, we get a pretty decent improvement in performance and scalability of uprobes with number of CPUs, even though we are still nowhere near linear scalability. This is due to the remaining mmap_lock, which is currently taken to resolve interrupt address to inode+offset and then uprobe instance. And, of course, uretprobes still need similar RCU to avoid refcount in the hot path, which will be addressed in the follow up patches. BASELINE (on top of Oleg's clean up patches) ============================================ uprobe-nop ( 1 cpus): 3.032 ± 0.023M/s ( 3.032M/s/cpu) uprobe-nop ( 2 cpus): 3.452 ± 0.005M/s ( 1.726M/s/cpu) uprobe-nop ( 4 cpus): 3.663 ± 0.005M/s ( 0.916M/s/cpu) uprobe-nop ( 8 cpus): 3.718 ± 0.038M/s ( 0.465M/s/cpu) uprobe-nop (16 cpus): 3.344 ± 0.008M/s ( 0.209M/s/cpu) uprobe-nop (32 cpus): 2.288 ± 0.021M/s ( 0.071M/s/cpu) uprobe-nop (64 cpus): 3.205 ± 0.004M/s ( 0.050M/s/cpu) uretprobe-nop ( 1 cpus): 1.979 ± 0.005M/s ( 1.979M/s/cpu) uretprobe-nop ( 2 cpus): 2.361 ± 0.005M/s ( 1.180M/s/cpu) uretprobe-nop ( 4 cpus): 2.309 ± 0.002M/s ( 0.577M/s/cpu) uretprobe-nop ( 8 cpus): 2.253 ± 0.001M/s ( 0.282M/s/cpu) uretprobe-nop (16 cpus): 2.007 ± 0.000M/s ( 0.125M/s/cpu) uretprobe-nop (32 cpus): 1.624 ± 0.003M/s ( 0.051M/s/cpu) uretprobe-nop (64 cpus): 2.149 ± 0.001M/s ( 0.034M/s/cpu) Up to second-to-last patch (i.e., SRCU-based optimizations) =========================================================== uprobe-nop ( 1 cpus): 3.276 ± 0.005M/s ( 3.276M/s/cpu) uprobe-nop ( 2 cpus): 4.125 ± 0.002M/s ( 2.063M/s/cpu) uprobe-nop ( 4 cpus): 7.713 ± 0.002M/s ( 1.928M/s/cpu) uprobe-nop ( 8 cpus): 8.097 ± 0.006M/s ( 1.012M/s/cpu) uprobe-nop (16 cpus): 6.501 ± 0.056M/s ( 0.406M/s/cpu) uprobe-nop (32 cpus): 4.398 ± 0.084M/s ( 0.137M/s/cpu) uprobe-nop (64 cpus): 6.452 ± 0.000M/s ( 0.101M/s/cpu) uretprobe-nop ( 1 cpus): 2.055 ± 0.001M/s ( 2.055M/s/cpu) uretprobe-nop ( 2 cpus): 2.677 ± 0.000M/s ( 1.339M/s/cpu) uretprobe-nop ( 4 cpus): 4.561 ± 0.003M/s ( 1.140M/s/cpu) uretprobe-nop ( 8 cpus): 5.291 ± 0.002M/s ( 0.661M/s/cpu) uretprobe-nop (16 cpus): 5.065 ± 0.019M/s ( 0.317M/s/cpu) uretprobe-nop (32 cpus): 3.622 ± 0.003M/s ( 0.113M/s/cpu) uretprobe-nop (64 cpus): 3.723 ± 0.002M/s ( 0.058M/s/cpu) RCU Tasks Trace =============== uprobe-nop ( 1 cpus): 3.396 ± 0.002M/s ( 3.396M/s/cpu) uprobe-nop ( 2 cpus): 4.271 ± 0.006M/s ( 2.135M/s/cpu) uprobe-nop ( 4 cpus): 8.499 ± 0.015M/s ( 2.125M/s/cpu) uprobe-nop ( 8 cpus): 10.355 ± 0.028M/s ( 1.294M/s/cpu) uprobe-nop (16 cpus): 7.615 ± 0.099M/s ( 0.476M/s/cpu) uprobe-nop (32 cpus): 4.430 ± 0.007M/s ( 0.138M/s/cpu) uprobe-nop (64 cpus): 6.887 ± 0.020M/s ( 0.108M/s/cpu) uretprobe-nop ( 1 cpus): 2.174 ± 0.001M/s ( 2.174M/s/cpu) uretprobe-nop ( 2 cpus): 2.853 ± 0.001M/s ( 1.426M/s/cpu) uretprobe-nop ( 4 cpus): 4.913 ± 0.002M/s ( 1.228M/s/cpu) uretprobe-nop ( 8 cpus): 5.883 ± 0.002M/s ( 0.735M/s/cpu) uretprobe-nop (16 cpus): 5.147 ± 0.001M/s ( 0.322M/s/cpu) uretprobe-nop (32 cpus): 3.738 ± 0.008M/s ( 0.117M/s/cpu) uretprobe-nop (64 cpus): 4.397 ± 0.002M/s ( 0.069M/s/cpu) For baseline vs SRCU, peak througput increased from 3.7 M/s (million uprobe triggerings per second) up to about 8 M/s. For uretprobes it's a bit more modest with bump from 2.4 M/s to 5 M/s. For SRCU vs RCU Tasks Trace, peak throughput for uprobes increases further from 8 M/s to 10.3 M/s (+28%!), and for uretprobes from 5.3 M/s to 5.8 M/s (+11%), as we have more work to do on uretprobes side. Even single-thread (no contention) performance is slightly better: 3.276 M/s to 3.396 M/s (+3.5%) for uprobes, and 2.055 M/s to 2.174 M/s (+5.8%) for uretprobes. [0] https://lore.kernel.org/linux-trace-kernel/20240711110235.098009979@infradead.org/ [1] https://lore.kernel.org/linux-trace-kernel/20240729134444.GA12293@redhat.com/ [2] https://github.com/libbpf/libbpf-bootstrap/tree/uprobe-stress v3->v4: - added back consumer_rwsem into consumer_del(), which was accidentally omitted earlier (Jiri); - left out RFC patches, we can continue discussion on v3 patch set, if necessary; v2->v3: - undid rcu and rb_node fields colocation which were causing crashes (Oleg); - ensure synchronize_srcu() on registration failure in patch #4 (Oleg); v1->v2: - added back missed kfree() in patch #1 (Oleg); - forgot the rest, but there were a few small things here and there. Andrii Nakryiko (6): uprobes: revamp uprobe refcounting and lifetime management uprobes: protected uprobe lifetime with SRCU uprobes: get rid of enum uprobe_filter_ctx in uprobe filter callbacks uprobes: travers uprobe's consumer list locklessly under SRCU protection uprobes: perform lockless SRCU-protected uprobes_tree lookup uprobes: switch to RCU Tasks Trace flavor for better performance Peter Zijlstra (2): perf/uprobe: split uprobe_unregister() rbtree: provide rb_find_rcu() / rb_find_add_rcu() include/linux/rbtree.h | 67 +++ include/linux/uprobes.h | 20 +- kernel/events/uprobes.c | 391 +++++++++++------- kernel/trace/bpf_trace.c | 8 +- kernel/trace/trace_uprobe.c | 15 +- .../selftests/bpf/bpf_testmod/bpf_testmod.c | 3 +- 6 files changed, 321 insertions(+), 183 deletions(-) Reviewed-by: Oleg Nesterov