From patchwork Tue Sep 3 17:45:56 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrii Nakryiko X-Patchwork-Id: 13789153 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1B2FFCD37AB for ; Tue, 3 Sep 2024 17:46:15 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A09B08D01B8; Tue, 3 Sep 2024 13:46:14 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9BA138D018A; Tue, 3 Sep 2024 13:46:14 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8344D8D01B8; Tue, 3 Sep 2024 13:46:14 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 640748D018A for ; Tue, 3 Sep 2024 13:46:14 -0400 (EDT) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 08D3A1607D2 for ; Tue, 3 Sep 2024 17:46:14 +0000 (UTC) X-FDA: 82524155868.07.6ED3D3A Received: from nyc.source.kernel.org (nyc.source.kernel.org [147.75.193.91]) by imf22.hostedemail.com (Postfix) with ESMTP id 6714BC0012 for ; Tue, 3 Sep 2024 17:46:12 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=LDavbUqo; dmarc=pass (policy=quarantine) header.from=kernel.org; spf=pass (imf22.hostedemail.com: domain of andrii@kernel.org designates 147.75.193.91 as permitted sender) smtp.mailfrom=andrii@kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1725385466; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=m4BH+SwmaFOBSjPhmTCV44mznN5J0Dtj/Obb7FS9b9Q=; b=yLt3AQXRhefYLn26cRjNhECbCFmbz3ol2ami4ykG58Jyy61OMJl4ur2woYMZ32zpG3JziC 60K+MjfMQNxlPbfCWIgpUI1BnX1X1owT+vi8luYfn4dELNLcbEmoqt/njfBqNWZRYIhTuB X8NeHapdmGbbUJb0gyVJIqJk34ZOcgU= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1725385466; a=rsa-sha256; cv=none; b=qX91y3NHJI7jAfv4ZA9UDHSRlkkGYjuOxQFeRHar3UqPifRkm03VVv85AejUoyIE9rwJMY D6gQyZTZUhcnOJMqbdn/XHLCh2wMtSv9eXHhcoDMskmmx4QIfggYxI9WkVxOSNq4VP5FMQ o85/1Hb1IIR/ixlFzfTY4E9ts6C55UI= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=LDavbUqo; dmarc=pass (policy=quarantine) header.from=kernel.org; spf=pass (imf22.hostedemail.com: domain of andrii@kernel.org designates 147.75.193.91 as permitted sender) smtp.mailfrom=andrii@kernel.org Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by nyc.source.kernel.org (Postfix) with ESMTP id 56A3EA42704; Tue, 3 Sep 2024 17:46:04 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 1FDB3C4CEC6; Tue, 3 Sep 2024 17:46:11 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1725385571; bh=Rxg7Tu+G/2GmZjbMNIm7JPVsxbOaRPHJxJ4sKFPmwPg=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=LDavbUqo2ULvupumGo7jK7HVlePICRjWhr3thztapEL7jWeMQr8jOjE8MsQELcsrd TurIIRYALlNU5tjt+RyHklryWjSQqn4jom7vAB5atB8Sv8SSsjLPIEL7oLPx7xyjZp 78f5+AItuXdlbwD8dsPIMo+YbRj9yxR1F8BzLtJzzrNiUnJJ2WINRf1CRi95e9nSOw I49TxFFhXtcQxSfzXzNhpiPbQKirUDkxLX4a+4Oxr2I4ra1GQy1eXRYW5dhHLUTh8n +EN6LAQqu1QfVXcxGeF0ByoS4iQLBIJ/rJrbC20vqv9xUJpS8So+BCAZWmkWG7Aypa Hls70L6bBAuKg== From: Andrii Nakryiko To: linux-trace-kernel@vger.kernel.org, peterz@infradead.org, oleg@redhat.com Cc: rostedt@goodmis.org, mhiramat@kernel.org, bpf@vger.kernel.org, linux-kernel@vger.kernel.org, jolsa@kernel.org, paulmck@kernel.org, willy@infradead.org, surenb@google.com, akpm@linux-foundation.org, linux-mm@kvack.org, Andrii Nakryiko Subject: [PATCH v5 1/8] uprobes: revamp uprobe refcounting and lifetime management Date: Tue, 3 Sep 2024 10:45:56 -0700 Message-ID: <20240903174603.3554182-2-andrii@kernel.org> X-Mailer: git-send-email 2.43.5 In-Reply-To: <20240903174603.3554182-1-andrii@kernel.org> References: <20240903174603.3554182-1-andrii@kernel.org> MIME-Version: 1.0 X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 6714BC0012 X-Stat-Signature: 7endm7tf585hhqdf867rkmyc637tnadc X-Rspam-User: X-HE-Tag: 1725385572-84602 X-HE-Meta: U2FsdGVkX18y07BUIeDar//G2d/aVOjwFnS12ZdUx2ymncirmVh59U/I7mPbNlnEYNWpzYrH9fTL3Lxqji0An1nFls/F9XzHI69tlRyMC4Op2GOX+OULc0OPmB6XQbAQBTVU/zTw/TRvgkzb4USpsc/YQaN3xUAENpHJlQKbpvLkJfuchLeiqCIXD4Pxe3vf09YZu3axGLohZeJS+NC4EGQVVSc5MdUfPlF0pHVRgzwidsLFGCzHmjTlQu+G3GRdVW8ljD68HtPuxGGb7fkZXOuNRtbiHjPr98ZrZQTCCCbkbsGnj2tyLRM+6BgBhqv8oBRzoDT1Z6I86eP05eW9sWV/v9RPceTdfSl5Ukds+asO3iZdkyiJhi+MIU/W8X8Xv6ir7ub3wWvAHHJIZThf+ymjkHt26HudxxnKW7JDvIzDO44kaR2Or7KfNe9Zb1RgwIBujfXLD0TRoKcdk7G4b+a3eHsCqtYO6cJvtUiAxpW6NoAyhyDSI6XAbg0gg+FPsaIfFbwi7czr13GiFFkj22ffBkCuRmex7TCBRsd0j/ooS0TS7Lg1RFsxc2uOJBKRRlO1nOZFIOLh0J8+CobdkiEoZAMjPAg1ND54miF/oF3ECl/JsdwDO5U5QdkZ3Z7FK6W5bmGWtYY68IL3hI1OTUuYn8QmmtbFa+8J1w4EjEN1Mn3idurgoOET+JDwS0+4/uYEfBaMshouT3NNv1bipIDrNU0zSjpTqT6w39tDLF0ixTdASERBv2vOIUNFC+aYWJEiCMVZGfW6ikR3pVn2ghMZFA4JfRQzqZuZlxfjNrv8y9F38kym8ONqKoR2Ne8co3mOoTzO1Z9RuB1/COo65+v2RZMd9HQx/WN9wwM+ZsP7GbHxO2NiObsztn/gS+dCDa3OCuNIf081eFgzaJcHnqGTMc0rn4bo+5rh3zNiRATJ+LDTc1qzoTyjca2aS2vEbYVHWj30jfQkntVBosZ RR1ZRaH3 pxV8KhGbsxyNN9sLYxkyTgFXWQvAKVWK+IHR3ODm0Sg5jz/ISiUmQ0h8z8CIlGTBgeSSTqr+DfwoInf3s/GnNfg0CscZF1OV/5+GFgMpfqUlfB5aG1sVTw1BSeISMEPekbIJ1IIxwNMnRJxF5NGjR6C35Eo48Tu2E6qqqlwHUnibBA/gfD7EpM19jRWpkjsaGMAAs8NM0zYHU15KzMt1FgU5ylpwkDtqLbP34nnPKN5Ytf/psq9q6gg5xD45xy1DEjU3dp8z6YYFDWNU= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Revamp how struct uprobe is refcounted, and thus how its lifetime is managed. Right now, there are a few possible "owners" of uprobe refcount: - uprobes_tree RB tree assumes one refcount when uprobe is registered and added to the lookup tree; - while uprobe is triggered and kernel is handling it in the breakpoint handler code, temporary refcount bump is done to keep uprobe from being freed; - if we have uretprobe requested on a given struct uprobe instance, we take another refcount to keep uprobe alive until user space code returns from the function and triggers return handler. The uprobe_tree's extra refcount of 1 is confusing and problematic. No matter how many actual consumers are attached, they all share the same refcount, and we have an extra logic to drop the "last" (which might not really be last) refcount once uprobe's consumer list becomes empty. This is unconventional and has to be kept in mind as a special case all the time. Further, because of this design we have the situations where find_uprobe() will find uprobe, bump refcount, return it to the caller, but that uprobe will still need uprobe_is_active() check, after which the caller is required to drop refcount and try again. This is just too many details leaking to the higher level logic. This patch changes refcounting scheme in such a way as to not have uprobes_tree keeping extra refcount for struct uprobe. Instead, each uprobe_consumer is assuming its own refcount, which will be dropped when consumer is unregistered. Other than that, all the active users of uprobe (entry and return uprobe handling code) keeps exactly the same refcounting approach. With the above setup, once uprobe's refcount drops to zero, we need to make sure that uprobe's "destructor" removes uprobe from uprobes_tree, of course. This, though, races with uprobe entry handling code in handle_swbp(), which, through find_active_uprobe()->find_uprobe() lookup, can race with uprobe being destroyed after refcount drops to zero (e.g., due to uprobe_consumer unregistering). So we add try_get_uprobe(), which will attempt to bump refcount, unless it already is zero. Caller needs to guarantee that uprobe instance won't be freed in parallel, which is the case while we keep uprobes_treelock (for read or write, doesn't matter). Note also, we now don't leak the race between registration and unregistration, so we remove the retry logic completely. If find_uprobe() returns valid uprobe, it's guaranteed to remain in uprobes_tree with properly incremented refcount. The race is handled inside __insert_uprobe() and put_uprobe() working together: __insert_uprobe() will remove uprobe from RB-tree, if it can't bump refcount and will retry to insert the new uprobe instance. put_uprobe() won't attempt to remove uprobe from RB-tree, if it's already not there. All that is protected by uprobes_treelock, which keeps things simple. Reviewed-by: Oleg Nesterov Signed-off-by: Andrii Nakryiko --- kernel/events/uprobes.c | 179 +++++++++++++++++++++++----------------- 1 file changed, 101 insertions(+), 78 deletions(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 33349cc8de0c..147561c19d57 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -109,6 +109,11 @@ struct xol_area { unsigned long vaddr; /* Page(s) of instruction slots */ }; +static void uprobe_warn(struct task_struct *t, const char *msg) +{ + pr_warn("uprobe: %s:%d failed to %s\n", current->comm, current->pid, msg); +} + /* * valid_vma: Verify if the specified vma is an executable vma * Relax restrictions while unregistering: vm_flags might have @@ -587,25 +592,53 @@ set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long v *(uprobe_opcode_t *)&auprobe->insn); } +/* uprobe should have guaranteed positive refcount */ static struct uprobe *get_uprobe(struct uprobe *uprobe) { refcount_inc(&uprobe->ref); return uprobe; } +/* + * uprobe should have guaranteed lifetime, which can be either of: + * - caller already has refcount taken (and wants an extra one); + * - uprobe is RCU protected and won't be freed until after grace period; + * - we are holding uprobes_treelock (for read or write, doesn't matter). + */ +static struct uprobe *try_get_uprobe(struct uprobe *uprobe) +{ + if (refcount_inc_not_zero(&uprobe->ref)) + return uprobe; + return NULL; +} + +static inline bool uprobe_is_active(struct uprobe *uprobe) +{ + return !RB_EMPTY_NODE(&uprobe->rb_node); +} + static void put_uprobe(struct uprobe *uprobe) { - if (refcount_dec_and_test(&uprobe->ref)) { - /* - * If application munmap(exec_vma) before uprobe_unregister() - * gets called, we don't get a chance to remove uprobe from - * delayed_uprobe_list from remove_breakpoint(). Do it here. - */ - mutex_lock(&delayed_uprobe_lock); - delayed_uprobe_remove(uprobe, NULL); - mutex_unlock(&delayed_uprobe_lock); - kfree(uprobe); - } + if (!refcount_dec_and_test(&uprobe->ref)) + return; + + write_lock(&uprobes_treelock); + + if (uprobe_is_active(uprobe)) + rb_erase(&uprobe->rb_node, &uprobes_tree); + + write_unlock(&uprobes_treelock); + + /* + * If application munmap(exec_vma) before uprobe_unregister() + * gets called, we don't get a chance to remove uprobe from + * delayed_uprobe_list from remove_breakpoint(). Do it here. + */ + mutex_lock(&delayed_uprobe_lock); + delayed_uprobe_remove(uprobe, NULL); + mutex_unlock(&delayed_uprobe_lock); + + kfree(uprobe); } static __always_inline @@ -656,7 +689,7 @@ static struct uprobe *__find_uprobe(struct inode *inode, loff_t offset) struct rb_node *node = rb_find(&key, &uprobes_tree, __uprobe_cmp_key); if (node) - return get_uprobe(__node_2_uprobe(node)); + return try_get_uprobe(__node_2_uprobe(node)); return NULL; } @@ -676,26 +709,44 @@ static struct uprobe *find_uprobe(struct inode *inode, loff_t offset) return uprobe; } +/* + * Attempt to insert a new uprobe into uprobes_tree. + * + * If uprobe already exists (for given inode+offset), we just increment + * refcount of previously existing uprobe. + * + * If not, a provided new instance of uprobe is inserted into the tree (with + * assumed initial refcount == 1). + * + * In any case, we return a uprobe instance that ends up being in uprobes_tree. + * Caller has to clean up new uprobe instance, if it ended up not being + * inserted into the tree. + * + * We assume that uprobes_treelock is held for writing. + */ static struct uprobe *__insert_uprobe(struct uprobe *uprobe) { struct rb_node *node; - +again: node = rb_find_add(&uprobe->rb_node, &uprobes_tree, __uprobe_cmp); - if (node) - return get_uprobe(__node_2_uprobe(node)); + if (node) { + struct uprobe *u = __node_2_uprobe(node); - /* get access + creation ref */ - refcount_set(&uprobe->ref, 2); - return NULL; + if (!try_get_uprobe(u)) { + rb_erase(node, &uprobes_tree); + RB_CLEAR_NODE(&u->rb_node); + goto again; + } + + return u; + } + + return uprobe; } /* - * Acquire uprobes_treelock. - * Matching uprobe already exists in rbtree; - * increment (access refcount) and return the matching uprobe. - * - * No matching uprobe; insert the uprobe in rb_tree; - * get a double refcount (access + creation) and return NULL. + * Acquire uprobes_treelock and insert uprobe into uprobes_tree + * (or reuse existing one, see __insert_uprobe() comments above). */ static struct uprobe *insert_uprobe(struct uprobe *uprobe) { @@ -732,11 +783,13 @@ static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset, uprobe->ref_ctr_offset = ref_ctr_offset; init_rwsem(&uprobe->register_rwsem); init_rwsem(&uprobe->consumer_rwsem); + RB_CLEAR_NODE(&uprobe->rb_node); + refcount_set(&uprobe->ref, 1); /* add to uprobes_tree, sorted on inode:offset */ cur_uprobe = insert_uprobe(uprobe); /* a uprobe exists for this inode:offset combination */ - if (cur_uprobe) { + if (cur_uprobe != uprobe) { if (cur_uprobe->ref_ctr_offset != uprobe->ref_ctr_offset) { ref_ctr_mismatch_warn(cur_uprobe, uprobe); put_uprobe(cur_uprobe); @@ -921,26 +974,6 @@ remove_breakpoint(struct uprobe *uprobe, struct mm_struct *mm, unsigned long vad return set_orig_insn(&uprobe->arch, mm, vaddr); } -static inline bool uprobe_is_active(struct uprobe *uprobe) -{ - return !RB_EMPTY_NODE(&uprobe->rb_node); -} -/* - * There could be threads that have already hit the breakpoint. They - * will recheck the current insn and restart if find_uprobe() fails. - * See find_active_uprobe(). - */ -static void delete_uprobe(struct uprobe *uprobe) -{ - if (WARN_ON(!uprobe_is_active(uprobe))) - return; - - write_lock(&uprobes_treelock); - rb_erase(&uprobe->rb_node, &uprobes_tree); - write_unlock(&uprobes_treelock); - RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */ -} - struct map_info { struct map_info *next; struct mm_struct *mm; @@ -1094,17 +1127,13 @@ void uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc) int err; down_write(&uprobe->register_rwsem); - if (WARN_ON(!consumer_del(uprobe, uc))) + if (WARN_ON(!consumer_del(uprobe, uc))) { err = -ENOENT; - else + } else { err = register_for_each_vma(uprobe, NULL); - - /* TODO : cant unregister? schedule a worker thread */ - if (!err) { - if (!uprobe->consumers) - delete_uprobe(uprobe); - else - err = -EBUSY; + /* TODO : cant unregister? schedule a worker thread */ + if (unlikely(err)) + uprobe_warn(current, "unregister, leaking uprobe"); } up_write(&uprobe->register_rwsem); @@ -1159,27 +1188,16 @@ struct uprobe *uprobe_register(struct inode *inode, if (!IS_ALIGNED(ref_ctr_offset, sizeof(short))) return ERR_PTR(-EINVAL); - retry: uprobe = alloc_uprobe(inode, offset, ref_ctr_offset); if (IS_ERR(uprobe)) return uprobe; - /* - * We can race with uprobe_unregister()->delete_uprobe(). - * Check uprobe_is_active() and retry if it is false. - */ down_write(&uprobe->register_rwsem); - ret = -EAGAIN; - if (likely(uprobe_is_active(uprobe))) { - consumer_add(uprobe, uc); - ret = register_for_each_vma(uprobe, uc); - } + consumer_add(uprobe, uc); + ret = register_for_each_vma(uprobe, uc); up_write(&uprobe->register_rwsem); - put_uprobe(uprobe); if (ret) { - if (unlikely(ret == -EAGAIN)) - goto retry; uprobe_unregister(uprobe, uc); return ERR_PTR(ret); } @@ -1286,15 +1304,17 @@ static void build_probe_list(struct inode *inode, u = rb_entry(t, struct uprobe, rb_node); if (u->inode != inode || u->offset < min) break; - list_add(&u->pending_list, head); - get_uprobe(u); + /* if uprobe went away, it's safe to ignore it */ + if (try_get_uprobe(u)) + list_add(&u->pending_list, head); } for (t = n; (t = rb_next(t)); ) { u = rb_entry(t, struct uprobe, rb_node); if (u->inode != inode || u->offset > max) break; - list_add(&u->pending_list, head); - get_uprobe(u); + /* if uprobe went away, it's safe to ignore it */ + if (try_get_uprobe(u)) + list_add(&u->pending_list, head); } } read_unlock(&uprobes_treelock); @@ -1752,6 +1772,12 @@ static int dup_utask(struct task_struct *t, struct uprobe_task *o_utask) return -ENOMEM; *n = *o; + /* + * uprobe's refcnt has to be positive at this point, kept by + * utask->return_instances items; return_instances can't be + * removed right now, as task is blocked due to duping; so + * get_uprobe() is safe to use here. + */ get_uprobe(n->uprobe); n->next = NULL; @@ -1763,12 +1789,6 @@ static int dup_utask(struct task_struct *t, struct uprobe_task *o_utask) return 0; } -static void uprobe_warn(struct task_struct *t, const char *msg) -{ - pr_warn("uprobe: %s:%d failed to %s\n", - current->comm, current->pid, msg); -} - static void dup_xol_work(struct callback_head *work) { if (current->flags & PF_EXITING) @@ -1894,7 +1914,10 @@ static void prepare_uretprobe(struct uprobe *uprobe, struct pt_regs *regs) } orig_ret_vaddr = utask->return_instances->orig_ret_vaddr; } - + /* + * uprobe's refcnt is positive, held by caller, so it's safe to + * unconditionally bump it one more time here + */ ri->uprobe = get_uprobe(uprobe); ri->func = instruction_pointer(regs); ri->stack = user_stack_pointer(regs);