[v2,04/12] uprobes: revamp uprobe refcounting and lifetime management

Revamp how struct uprobe is refcounted, and thus how its lifetime is
managed.

Right now, there are a few possible "owners" of uprobe refcount:
  - uprobes_tree RB tree assumes one refcount when uprobe is registered
    and added to the lookup tree;
  - while uprobe is triggered and kernel is handling it in the breakpoint
    handler code, temporary refcount bump is done to keep uprobe from
    being freed;
  - if we have uretprobe requested on a given struct uprobe instance, we
    take another refcount to keep uprobe alive until user space code
    returns from the function and triggers return handler.

The uprobe_tree's extra refcount of 1 is problematic and inconvenient.
Because of it, we have extra retry logic in uprobe_register(), and we
have an extra logic in __uprobe_unregister(), which checks that uprobe
has no more consumers, and if that's the case, it removes struct uprobe
from uprobes_tree (through delete_uprobe(), which takes writer lock on
uprobes_tree), decrementing refcount after that. The latter is the
source of unfortunate race with uprobe_register, necessitating retries.

All of the above is a complication that makes adding batched uprobe
registration/unregistration APIs hard, and generally makes following the
logic harder.

This patch changes refcounting scheme in such a way as to not have
uprobes_tree keeping extra refcount for struct uprobe. Instead,
uprobe_consumer is assuming this extra refcount, which will be dropped
when consumer is unregistered. Other than that, all the active users of
uprobe (entry and return uprobe handling code) keeps exactly the same
refcounting approach.

With the above setup, once uprobe's refcount drops to zero, we need to
make sure that uprobe's "destructor" removes uprobe from uprobes_tree,
of course. This, though, races with uprobe entry handling code in
handle_swbp(), which, though find_active_uprobe()->find_uprobe() lookup
can race with uprobe being destroyed after refcount drops to zero (e.g.,
due to uprobe_consumer unregistering). This is because
find_active_uprobe() bumps refcount without knowing for sure that
uprobe's refcount is already positive (and it has to be this way, there
is no way around that setup).

One, attempted initially, way to solve this is through using
atomic_inc_not_zero() approach, turning get_uprobe() into
try_get_uprobe(), which can fail to bump refcount if uprobe is already
destined to be destroyed. This, unfortunately, turns out to be a rather
expensive due to underlying cmpxchg() operation in
atomic_inc_not_zero() and scales rather poorly with increased amount of
parallel threads triggering uprobes.

So, we devise a refcounting scheme that doesn't require cmpxchg(),
instead relying only on atomic additions, which scale better and are
faster. While the solution has a bit of a trick to it, all the logic is
nicely compartmentalized in __get_uprobe() and put_uprobe() helpers and
doesn't leak outside of those low-level helpers.

We, effectively, structure uprobe's destruction (i.e., put_uprobe() logic)
in such a way that we support "resurrecting" uprobe by bumping its
refcount from zero back to one, and pretending like it never dropped to
zero in the first place. This is done in a race-free way under
exclusive writer uprobes_treelock. Crucially, we take lock only once
refcount drops to zero. If we had to take lock before decrementing
refcount, the approach would be prohibitively expensive.

Anyways, under exclusive writer lock, we double-check that refcount
didn't change and is still zero. If it is, we proceed with destruction,
because at that point we have a guarantee that find_active_uprobe()
can't successfully look up this uprobe instance, as it's going to be
removed in destructor under writer lock. If, on the other hand,
find_active_uprobe() managed to bump refcount from zero to one in
between put_uprobe()'s atomic_dec_and_test(&uprobe->ref) and
write_lock(&uprobes_treelock), we'll deterministically detect this with
extra atomic_read(&uprobe->ref) check, and if it doesn't hold, we
pretend like atomic_dec_and_test() never returned true. There is no
resource freeing or any other irreversible action taken up till this
point, so we just exit early.

One tricky part in the above is actually two CPUs racing and dropping
refcnt to zero, and then attempting to free resources. This can happen
as follows:
  - CPU #0 drops refcnt from 1 to 0, and proceeds to grab uprobes_treelock;
  - before CPU #0 grabs a lock, CPU #1 updates refcnt as 0 -> 1 -> 0, at
    which point it decides that it needs to free uprobe as well.

At this point both CPU #0 and CPU #1 will believe they need to destroy
uprobe, which is obviously wrong. To prevent this situations, we augment
refcount with epoch counter, which is always incremented by 1 on either
get or put operation. This allows those two CPUs above to disambiguate
who should actually free uprobe (it's the CPU #1, because it has
up-to-date epoch). See comments in the code and note the specific values
of UPROBE_REFCNT_GET and UPROBE_REFCNT_PUT constants. Keep in mind that
a single atomi64_t is actually a two sort-of-independent 32-bit counters
that are incremented/decremented with a single atomic_add_and_return()
operation. Note also a small and extremely rare (and thus having no
effect on performance) need to clear the highest bit every 2 billion
get/put operations to prevent high 32-bit counter from "bleeding over"
into lower 32-bit counter.

Another aspect with this race is the winning CPU might, at least
theoretically, be so quick that it will free uprobe memory before losing
CPU gets a chance to discover that it lost. To prevent this, we
protected and delay uprobe lifetime with RCU. We can't use
rcu_read_lock() + rcu_read_unlock(), because we need to take locks
inside the RCU critical section. Luckily, we have RCU Tasks Trace
flavor, which supports locking and sleeping. It is already used by BPF
subsystem for sleepable BPF programs (including sleepable BPF uprobe
programs), and is optimized for reader-dominated workflows. It fits
perfectly and doesn't seem to introduce any significant slowdowns in
uprobe hot path.

All the above contained trickery aside, we end up with a nice semantics
for get and put operations, where get always succeeds and put handles
all the races properly and transparently to the caller.

And just to justify this a bit unorthodox refcounting approach, under
uprobe triggering micro-benchmark (using BPF selftests' bench tool) with
8 triggering threads, atomic_inc_not_zero() approach was producing about
3.3 millions/sec total uprobe triggerings across all threads. While the
final atomic_add_and_return()-based approach managed to get 3.6 millions/sec
throughput under the same 8 competing threads.

Furthermore, CPU profiling showed the following overall CPU usage:
  - try_get_uprobe (19.3%) + put_uprobe (8.2%) = 27.5% CPU usage for
    atomic_inc_not_zero approach;
  - __get_uprobe (12.3%) + put_uprobe (9.9%) = 22.2% CPU usage for
    atomic_add_and_return approach implemented by this patch.

So, CPU is spending relatively more CPU time in get/put operations while
delivering less total throughput if using atomic_inc_not_zero(). And
this will be even more prominent once we optimize away uprobe->register_rwsem
in the subsequent patch sets. So while slightly less straightforward,
current approach seems to be clearly winning and justified.

We also rename get_uprobe() to __get_uprobe() to indicate it's
a delicate internal helper that is only safe to call under valid
circumstances:
  - while holding uprobes_treelock (to synchronize with exclusive write
    lock in put_uprobe(), as described above);
  - or if we have a guarantee that uprobe's refcount is already positive
    through caller holding at least one refcount (in this case there is
    no risk of refcount dropping to zero by any other CPU).

We also document why it's safe to do unconditional __get_uprobe() at all
call sites, to make it clear that we maintain the above invariants.

Note also, we now don't have a race between registration and
unregistration, so we remove the retry logic completely.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/events/uprobes.c | 260 ++++++++++++++++++++++++++++++----------
 1 file changed, 195 insertions(+), 65 deletions(-)

Message ID	20240701223935.3783951-5-andrii@kernel.org (mailing list archive)
State	Handled Elsewhere
Headers	show Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id AEEA984A2B; Mon, 1 Jul 2024 22:39:52 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1719873592; cv=none; b=Xk5z8WdSvvCMf/uOWl3Yv7NJ/JIS+pexEf/wR6LBqTfARJM4RsU1oEGTSspGFDF6uWdjQ98hozqDbj+uTmgfiL89558pCMuPYCTF53C1aqC/WvyWtohVt3kr53RtrOQBB0dF4DviP920P1zF5G0SQHDGOiLtsfVm/JV0E8ZykWU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1719873592; c=relaxed/simple; bh=kbK+9+9lON/TIRaPwJzsgxwJ5qnuD1YzIvpDdS99Gno=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=OdCHVXacwEi0t080PCPoHyTXsmrMt5luOSEsUjd9K9Wjfrx2tY3lTGEd2J8mCvennygjoRTYw0mEeHeW6mVwPQKg6LtWYaL/tq3xKbswKXgozqCri1aPhMmtJdKPiKTJ3llxQ8HKJ1BMzeMyfAHV3tGuG8N6fzMttI1ybGags4Q= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=PHME0c+0; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="PHME0c+0" Received: by smtp.kernel.org (Postfix) with ESMTPSA id ECC51C116B1; Mon, 1 Jul 2024 22:39:51 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1719873592; bh=kbK+9+9lON/TIRaPwJzsgxwJ5qnuD1YzIvpDdS99Gno=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=PHME0c+0zRC/wfQmCQTTfiAcSY9etOnYc4btuFTIK1XYz4ZmjOby+CUs27+GDW25T 6ZNhxwOGnBYtcG5P0n3qWuRjxXaPwZC6+Pcs6KI5AHYGyHV793xnaZxqJWREEg4NYC ItLz7usyVqZqKKW9vUUS9ilftXW5skZBn549Puig7d9auSeAtnKqz27INV8r9Ea2jP 40/orrSLMO4SD2pZcneqOTADRcvJasW8h2PUZtQCaUBYZkHAzW/RbSAXvJ7JbeMrGU BgGsYpUGN6c4nIkltdyOMuJ4NHLc9jKs0xhtaBzTYkaAE7jUtAIgfV6Nn6HvQvp+4l 3jAaXwjGNmvoA== From: Andrii Nakryiko <andrii@kernel.org> To: linux-trace-kernel@vger.kernel.org, rostedt@goodmis.org, mhiramat@kernel.org, oleg@redhat.com Cc: peterz@infradead.org, mingo@redhat.com, bpf@vger.kernel.org, jolsa@kernel.org, paulmck@kernel.org, clm@meta.com, Andrii Nakryiko <andrii@kernel.org> Subject: [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management Date: Mon, 1 Jul 2024 15:39:27 -0700 Message-ID: <20240701223935.3783951-5-andrii@kernel.org> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20240701223935.3783951-1-andrii@kernel.org> References: <20240701223935.3783951-1-andrii@kernel.org> Precedence: bulk X-Mailing-List: linux-trace-kernel@vger.kernel.org List-Id: <linux-trace-kernel.vger.kernel.org> List-Subscribe: <mailto:linux-trace-kernel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-trace-kernel+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	uprobes: add batched register/unregister APIs and per-CPU RW semaphore \| expand [v2,00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore [v2,01/12] uprobes: update outdated comment [v2,02/12] uprobes: correct mmap_sem locking assumptions in uprobe_write_opcode() [v2,03/12] uprobes: simplify error handling for alloc_uprobe() [v2,04/12] uprobes: revamp uprobe refcounting and lifetime management [v2,05/12] uprobes: move offset and ref_ctr_offset into uprobe_consumer [v2,06/12] uprobes: add batch uprobe register/unregister APIs [v2,07/12] uprobes: inline alloc_uprobe() logic into __uprobe_register() [v2,08/12] uprobes: split uprobe allocation and uprobes_tree insertion steps [v2,09/12] uprobes: batch uprobes_treelock during registration [v2,10/12] uprobes: improve lock batching for uprobe_unregister_batch [v2,11/12] uprobes,bpf: switch to batch uprobe APIs for BPF multi-uprobes [v2,12/12] uprobes: switch uprobes_treelock to per-CPU RW semaphore

[v2,04/12] uprobes: revamp uprobe refcounting and lifetime management

Commit Message

Comments

Patch