From patchwork Mon Jul 1 22:39:24 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrii Nakryiko X-Patchwork-Id: 13718699 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8667D84A2B; Mon, 1 Jul 2024 22:39:42 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1719873582; cv=none; b=sivWsXVAu80gNw7fxFtT1dx/xSqHSDCm6Tj+dmmzOJIITlkqekPteJMl4TzD3oxodXvC1NREdFFPA9JiVWYASKCz94z2bTXEfNHSNtJothNjeLhtMq0bh6NTCeuaXekHXJKVEUHw415DpYYClTJ7CfEakPs8wEMh2fS6qtxqJXo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1719873582; c=relaxed/simple; bh=z0ddvbF8FumBOjGC+wG+fQgjyM9XIIz6hh6DhuruuXM=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=bjNqlf/c3Z8f/sLL6TfLCkxUFEW8JUErXqL01aLIXNXOPC4UerlELZq4umpRXUlsnbdTFmDAZ7gYud1wH6iWze6N94xdeUpIqaYnLSy5YdRkzfkIb408Dd/UBh6+IyeUTeDqJ+qODGqZylZsmYzrlLDy5b0tAZT7qOv66b2QV+0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=buzDRbzE; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="buzDRbzE" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 34AAAC116B1; Mon, 1 Jul 2024 22:39:42 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1719873582; bh=z0ddvbF8FumBOjGC+wG+fQgjyM9XIIz6hh6DhuruuXM=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=buzDRbzELZOFGyr55dgbjmdweAF0jX2IQYK1aZvVPDeZmCYuV9Ha3K3mdYuRF8IR/ FEhw4+FyG8BCCCMKLAnQw9aCotaoXaGW2xpKeTAbvuKdwCNIon62yEzZ2QzqoyVGPF KWj//OK2Tngcsj5uQO6vuT++DfvYALMNg0ohwlZJAJuM8D8aAb9ISRIUEHy5QWfQyf sOByHcp5mcnJ4WxidS5KK80dKL2rNpe+40ZE7A8/eGIlOcmU9TM58wRvgxF8jN6Zpv ybArGgEu2+zBxSwxi+pgCvBajOMKb4dio6jaX+s8na0jjLe0Qm3E6Q3HuEK24q+tJt uHQtIRDC1D4TA== From: Andrii Nakryiko To: linux-trace-kernel@vger.kernel.org, rostedt@goodmis.org, mhiramat@kernel.org, oleg@redhat.com Cc: peterz@infradead.org, mingo@redhat.com, bpf@vger.kernel.org, jolsa@kernel.org, paulmck@kernel.org, clm@meta.com, Andrii Nakryiko Subject: [PATCH v2 01/12] uprobes: update outdated comment Date: Mon, 1 Jul 2024 15:39:24 -0700 Message-ID: <20240701223935.3783951-2-andrii@kernel.org> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20240701223935.3783951-1-andrii@kernel.org> References: <20240701223935.3783951-1-andrii@kernel.org> Precedence: bulk X-Mailing-List: linux-trace-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 There is no task_struct passed into get_user_pages_remote() anymore, drop the parts of comment mentioning NULL tsk, it's just confusing at this point. Signed-off-by: Andrii Nakryiko Acked-by: Oleg Nesterov --- kernel/events/uprobes.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 99be2adedbc0..081821fd529a 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -2030,10 +2030,8 @@ static int is_trap_at_addr(struct mm_struct *mm, unsigned long vaddr) goto out; /* - * The NULL 'tsk' here ensures that any faults that occur here - * will not be accounted to the task. 'mm' *is* current->mm, - * but we treat this as a 'remote' access since it is - * essentially a kernel access to the memory. + * 'mm' *is* current->mm, but we treat this as a 'remote' access since + * it is essentially a kernel access to the memory. */ result = get_user_pages_remote(mm, vaddr, 1, FOLL_FORCE, &page, NULL); if (result < 0) From patchwork Mon Jul 1 22:39:25 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrii Nakryiko X-Patchwork-Id: 13718700 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1F381171079; Mon, 1 Jul 2024 22:39:45 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1719873586; cv=none; b=dUvJqYf9YxQisEBSJTyYz/9tDEpVPuHE082SSzc1Po50FHj9+kYD1JubjgfSQzIJHLe0xi4MHDF63ps75+Xoi4R8NENE2OL4kTfw2K1y4J4LGM09vltXTg0Yyne0lBS6WUYGcwmRUW2QexflLwHJjKT7LPaJEadSPeQVVEflaoQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1719873586; c=relaxed/simple; bh=zuWlQDRIA9Y3TvaGDg+DFpB2kxzINwcItGB/mZydiw8=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=bT5Hk9Yi4HY/80XU+pfjnbISGdcaQ6ViS18K6c3M8BMc2R4zsV2ujWS72UzBspIL5dNNFqajW/Il21ZXtixgyH0p76KvRlLJeSY12a4XOLQeoO0UQ4z4QprB15zBLxIEwpljdNSDUngp7sJzYUN1wtPrCaZopaxqgmvjBa53k20= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=eLtXfHjz; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="eLtXfHjz" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 75EB9C116B1; Mon, 1 Jul 2024 22:39:45 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1719873585; bh=zuWlQDRIA9Y3TvaGDg+DFpB2kxzINwcItGB/mZydiw8=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=eLtXfHjz/odlDT45j//QKRBfSZNrLhArgaG3JBpHfwbkJp8atnSQqneeMmwtMxl2x PFT+6P4z9S0IOPFbCjK9wNp1tDdv9bZ1CQAdcj1qQF1AAsmIa8Db7RukqCqCMhQAIV yz85GmxmIU8vlZxL4nTcyDTMqqUg/ecOdoDnbVX/p1+gp6SGqRWRXIG/27mL+BCAbW 54nnkj8CC4zVMlJoeBtcY1hYZmzGwxFGTZ6LPj7V58shVr8LarZ7JrpbtsZmeykMAZ ZSiRsYVHI4hSyyqTxw2tRlp1sgkrByQ4g6JXexMDwPno3a/p90tspYrYQ6EwWWdYLF TTkpqALa0wEzw== From: Andrii Nakryiko To: linux-trace-kernel@vger.kernel.org, rostedt@goodmis.org, mhiramat@kernel.org, oleg@redhat.com Cc: peterz@infradead.org, mingo@redhat.com, bpf@vger.kernel.org, jolsa@kernel.org, paulmck@kernel.org, clm@meta.com, Andrii Nakryiko Subject: [PATCH v2 02/12] uprobes: correct mmap_sem locking assumptions in uprobe_write_opcode() Date: Mon, 1 Jul 2024 15:39:25 -0700 Message-ID: <20240701223935.3783951-3-andrii@kernel.org> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20240701223935.3783951-1-andrii@kernel.org> References: <20240701223935.3783951-1-andrii@kernel.org> Precedence: bulk X-Mailing-List: linux-trace-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 It seems like uprobe_write_opcode() doesn't require writer locked mmap_sem, any lock (reader or writer) should be sufficient. This was established in a discussion in [0] and looking through existing code seems to confirm that there is no need for write-locked mmap_sem. Fix the comment to state this clearly. [0] https://lore.kernel.org/linux-trace-kernel/20240625190748.GC14254@redhat.com/ Fixes: 29dedee0e693 ("uprobes: Add mem_cgroup_charge_anon() into uprobe_write_opcode()") Signed-off-by: Andrii Nakryiko Acked-by: Oleg Nesterov --- kernel/events/uprobes.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 081821fd529a..f87049c08ee9 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -453,7 +453,7 @@ static int update_ref_ctr(struct uprobe *uprobe, struct mm_struct *mm, * @vaddr: the virtual address to store the opcode. * @opcode: opcode to be written at @vaddr. * - * Called with mm->mmap_lock held for write. + * Called with mm->mmap_lock held for read or write. * Return 0 (success) or a negative errno. */ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, From patchwork Mon Jul 1 22:39:26 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrii Nakryiko X-Patchwork-Id: 13718701 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5BC31171651; Mon, 1 Jul 2024 22:39:49 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1719873589; cv=none; b=u0dwG+i00gYt0+JSoJxWxWE0/7REG+5G3VLG3xmCvNElj6nqktrRF8hZZxYK9muLXCAwJkfbLHIZTxOBJ9maiC8Ex9FSV7seuDeDEsoeTBukEhGgU8DPpQujC7sS8ly8KiMcI0FKaDaYHYUMgZC/lJwnRIdzAyyJ5C/MHIpyo2Q= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1719873589; c=relaxed/simple; bh=MFARpduduHb4nilHvQhZgOpCAhfnha4oEsghJo+BlOI=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=GBZcjAf3YaRdv72agUjUrvyQ55zVACoa44q1oFectXRJ2kA2lvuLs9CPfCwfpZulofX3006H14EAIlCLT0C5lvcqBycD9Rw4AKXM3S0bdgw3NaCtHAKP1cLY5ct6+C+NF+3a0S1JqyhpnwREg+Ekj8LcDX/dMw2agl6OhpcJL4o= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=tXMscKr9; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="tXMscKr9" Received: by smtp.kernel.org (Postfix) with ESMTPSA id ADE19C116B1; Mon, 1 Jul 2024 22:39:48 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1719873588; bh=MFARpduduHb4nilHvQhZgOpCAhfnha4oEsghJo+BlOI=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=tXMscKr9GnYZF6+hoLiWwXi1d1j63cEF7bJUMwtPkuFtrPc1u11GIkvjGhaWFFmDY BE7sv7Io0/yIGrFkT6iY6nUpHEgEtnoPpijCd5nNEQn9WWGBmE9ySVI5Kdyw2blxW+ ukhSVWxQRAXs+c66lqtRTz15x+KcpNiRCTOZafw8QXAGqep/wFC7YkUct0jdCTC2cv bS6Oe8bEV30wNLg4z7yCvkGVFAo4M3n/gsCqs9jakX43VPtdNdxYmhzMtFp9apiE8d jeacyzcdnMsRTtGUQn3/N6MKQiYBy2ypbt2bTNpelzag+CBrgrXPRKaRQt1qyq4wgH vsGoMfQzQ4ppg== From: Andrii Nakryiko To: linux-trace-kernel@vger.kernel.org, rostedt@goodmis.org, mhiramat@kernel.org, oleg@redhat.com Cc: peterz@infradead.org, mingo@redhat.com, bpf@vger.kernel.org, jolsa@kernel.org, paulmck@kernel.org, clm@meta.com, Andrii Nakryiko Subject: [PATCH v2 03/12] uprobes: simplify error handling for alloc_uprobe() Date: Mon, 1 Jul 2024 15:39:26 -0700 Message-ID: <20240701223935.3783951-4-andrii@kernel.org> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20240701223935.3783951-1-andrii@kernel.org> References: <20240701223935.3783951-1-andrii@kernel.org> Precedence: bulk X-Mailing-List: linux-trace-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Return -ENOMEM instead of NULL, which makes caller's error handling just a touch simpler. Signed-off-by: Andrii Nakryiko --- kernel/events/uprobes.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index f87049c08ee9..23449a8c5e7e 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -725,7 +725,7 @@ static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset, uprobe = kzalloc(sizeof(struct uprobe), GFP_KERNEL); if (!uprobe) - return NULL; + return ERR_PTR(-ENOMEM); uprobe->inode = inode; uprobe->offset = offset; @@ -1161,8 +1161,6 @@ static int __uprobe_register(struct inode *inode, loff_t offset, retry: uprobe = alloc_uprobe(inode, offset, ref_ctr_offset); - if (!uprobe) - return -ENOMEM; if (IS_ERR(uprobe)) return PTR_ERR(uprobe); From patchwork Mon Jul 1 22:39:27 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrii Nakryiko X-Patchwork-Id: 13718702 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id AEEA984A2B; Mon, 1 Jul 2024 22:39:52 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1719873592; cv=none; b=Xk5z8WdSvvCMf/uOWl3Yv7NJ/JIS+pexEf/wR6LBqTfARJM4RsU1oEGTSspGFDF6uWdjQ98hozqDbj+uTmgfiL89558pCMuPYCTF53C1aqC/WvyWtohVt3kr53RtrOQBB0dF4DviP920P1zF5G0SQHDGOiLtsfVm/JV0E8ZykWU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1719873592; c=relaxed/simple; bh=kbK+9+9lON/TIRaPwJzsgxwJ5qnuD1YzIvpDdS99Gno=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=OdCHVXacwEi0t080PCPoHyTXsmrMt5luOSEsUjd9K9Wjfrx2tY3lTGEd2J8mCvennygjoRTYw0mEeHeW6mVwPQKg6LtWYaL/tq3xKbswKXgozqCri1aPhMmtJdKPiKTJ3llxQ8HKJ1BMzeMyfAHV3tGuG8N6fzMttI1ybGags4Q= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=PHME0c+0; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="PHME0c+0" Received: by smtp.kernel.org (Postfix) with ESMTPSA id ECC51C116B1; Mon, 1 Jul 2024 22:39:51 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1719873592; bh=kbK+9+9lON/TIRaPwJzsgxwJ5qnuD1YzIvpDdS99Gno=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=PHME0c+0zRC/wfQmCQTTfiAcSY9etOnYc4btuFTIK1XYz4ZmjOby+CUs27+GDW25T 6ZNhxwOGnBYtcG5P0n3qWuRjxXaPwZC6+Pcs6KI5AHYGyHV793xnaZxqJWREEg4NYC ItLz7usyVqZqKKW9vUUS9ilftXW5skZBn549Puig7d9auSeAtnKqz27INV8r9Ea2jP 40/orrSLMO4SD2pZcneqOTADRcvJasW8h2PUZtQCaUBYZkHAzW/RbSAXvJ7JbeMrGU BgGsYpUGN6c4nIkltdyOMuJ4NHLc9jKs0xhtaBzTYkaAE7jUtAIgfV6Nn6HvQvp+4l 3jAaXwjGNmvoA== From: Andrii Nakryiko To: linux-trace-kernel@vger.kernel.org, rostedt@goodmis.org, mhiramat@kernel.org, oleg@redhat.com Cc: peterz@infradead.org, mingo@redhat.com, bpf@vger.kernel.org, jolsa@kernel.org, paulmck@kernel.org, clm@meta.com, Andrii Nakryiko Subject: [PATCH v2 04/12] uprobes: revamp uprobe refcounting and lifetime management Date: Mon, 1 Jul 2024 15:39:27 -0700 Message-ID: <20240701223935.3783951-5-andrii@kernel.org> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20240701223935.3783951-1-andrii@kernel.org> References: <20240701223935.3783951-1-andrii@kernel.org> Precedence: bulk X-Mailing-List: linux-trace-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Revamp how struct uprobe is refcounted, and thus how its lifetime is managed. Right now, there are a few possible "owners" of uprobe refcount: - uprobes_tree RB tree assumes one refcount when uprobe is registered and added to the lookup tree; - while uprobe is triggered and kernel is handling it in the breakpoint handler code, temporary refcount bump is done to keep uprobe from being freed; - if we have uretprobe requested on a given struct uprobe instance, we take another refcount to keep uprobe alive until user space code returns from the function and triggers return handler. The uprobe_tree's extra refcount of 1 is problematic and inconvenient. Because of it, we have extra retry logic in uprobe_register(), and we have an extra logic in __uprobe_unregister(), which checks that uprobe has no more consumers, and if that's the case, it removes struct uprobe from uprobes_tree (through delete_uprobe(), which takes writer lock on uprobes_tree), decrementing refcount after that. The latter is the source of unfortunate race with uprobe_register, necessitating retries. All of the above is a complication that makes adding batched uprobe registration/unregistration APIs hard, and generally makes following the logic harder. This patch changes refcounting scheme in such a way as to not have uprobes_tree keeping extra refcount for struct uprobe. Instead, uprobe_consumer is assuming this extra refcount, which will be dropped when consumer is unregistered. Other than that, all the active users of uprobe (entry and return uprobe handling code) keeps exactly the same refcounting approach. With the above setup, once uprobe's refcount drops to zero, we need to make sure that uprobe's "destructor" removes uprobe from uprobes_tree, of course. This, though, races with uprobe entry handling code in handle_swbp(), which, though find_active_uprobe()->find_uprobe() lookup can race with uprobe being destroyed after refcount drops to zero (e.g., due to uprobe_consumer unregistering). This is because find_active_uprobe() bumps refcount without knowing for sure that uprobe's refcount is already positive (and it has to be this way, there is no way around that setup). One, attempted initially, way to solve this is through using atomic_inc_not_zero() approach, turning get_uprobe() into try_get_uprobe(), which can fail to bump refcount if uprobe is already destined to be destroyed. This, unfortunately, turns out to be a rather expensive due to underlying cmpxchg() operation in atomic_inc_not_zero() and scales rather poorly with increased amount of parallel threads triggering uprobes. So, we devise a refcounting scheme that doesn't require cmpxchg(), instead relying only on atomic additions, which scale better and are faster. While the solution has a bit of a trick to it, all the logic is nicely compartmentalized in __get_uprobe() and put_uprobe() helpers and doesn't leak outside of those low-level helpers. We, effectively, structure uprobe's destruction (i.e., put_uprobe() logic) in such a way that we support "resurrecting" uprobe by bumping its refcount from zero back to one, and pretending like it never dropped to zero in the first place. This is done in a race-free way under exclusive writer uprobes_treelock. Crucially, we take lock only once refcount drops to zero. If we had to take lock before decrementing refcount, the approach would be prohibitively expensive. Anyways, under exclusive writer lock, we double-check that refcount didn't change and is still zero. If it is, we proceed with destruction, because at that point we have a guarantee that find_active_uprobe() can't successfully look up this uprobe instance, as it's going to be removed in destructor under writer lock. If, on the other hand, find_active_uprobe() managed to bump refcount from zero to one in between put_uprobe()'s atomic_dec_and_test(&uprobe->ref) and write_lock(&uprobes_treelock), we'll deterministically detect this with extra atomic_read(&uprobe->ref) check, and if it doesn't hold, we pretend like atomic_dec_and_test() never returned true. There is no resource freeing or any other irreversible action taken up till this point, so we just exit early. One tricky part in the above is actually two CPUs racing and dropping refcnt to zero, and then attempting to free resources. This can happen as follows: - CPU #0 drops refcnt from 1 to 0, and proceeds to grab uprobes_treelock; - before CPU #0 grabs a lock, CPU #1 updates refcnt as 0 -> 1 -> 0, at which point it decides that it needs to free uprobe as well. At this point both CPU #0 and CPU #1 will believe they need to destroy uprobe, which is obviously wrong. To prevent this situations, we augment refcount with epoch counter, which is always incremented by 1 on either get or put operation. This allows those two CPUs above to disambiguate who should actually free uprobe (it's the CPU #1, because it has up-to-date epoch). See comments in the code and note the specific values of UPROBE_REFCNT_GET and UPROBE_REFCNT_PUT constants. Keep in mind that a single atomi64_t is actually a two sort-of-independent 32-bit counters that are incremented/decremented with a single atomic_add_and_return() operation. Note also a small and extremely rare (and thus having no effect on performance) need to clear the highest bit every 2 billion get/put operations to prevent high 32-bit counter from "bleeding over" into lower 32-bit counter. Another aspect with this race is the winning CPU might, at least theoretically, be so quick that it will free uprobe memory before losing CPU gets a chance to discover that it lost. To prevent this, we protected and delay uprobe lifetime with RCU. We can't use rcu_read_lock() + rcu_read_unlock(), because we need to take locks inside the RCU critical section. Luckily, we have RCU Tasks Trace flavor, which supports locking and sleeping. It is already used by BPF subsystem for sleepable BPF programs (including sleepable BPF uprobe programs), and is optimized for reader-dominated workflows. It fits perfectly and doesn't seem to introduce any significant slowdowns in uprobe hot path. All the above contained trickery aside, we end up with a nice semantics for get and put operations, where get always succeeds and put handles all the races properly and transparently to the caller. And just to justify this a bit unorthodox refcounting approach, under uprobe triggering micro-benchmark (using BPF selftests' bench tool) with 8 triggering threads, atomic_inc_not_zero() approach was producing about 3.3 millions/sec total uprobe triggerings across all threads. While the final atomic_add_and_return()-based approach managed to get 3.6 millions/sec throughput under the same 8 competing threads. Furthermore, CPU profiling showed the following overall CPU usage: - try_get_uprobe (19.3%) + put_uprobe (8.2%) = 27.5% CPU usage for atomic_inc_not_zero approach; - __get_uprobe (12.3%) + put_uprobe (9.9%) = 22.2% CPU usage for atomic_add_and_return approach implemented by this patch. So, CPU is spending relatively more CPU time in get/put operations while delivering less total throughput if using atomic_inc_not_zero(). And this will be even more prominent once we optimize away uprobe->register_rwsem in the subsequent patch sets. So while slightly less straightforward, current approach seems to be clearly winning and justified. We also rename get_uprobe() to __get_uprobe() to indicate it's a delicate internal helper that is only safe to call under valid circumstances: - while holding uprobes_treelock (to synchronize with exclusive write lock in put_uprobe(), as described above); - or if we have a guarantee that uprobe's refcount is already positive through caller holding at least one refcount (in this case there is no risk of refcount dropping to zero by any other CPU). We also document why it's safe to do unconditional __get_uprobe() at all call sites, to make it clear that we maintain the above invariants. Note also, we now don't have a race between registration and unregistration, so we remove the retry logic completely. Signed-off-by: Andrii Nakryiko --- kernel/events/uprobes.c | 260 ++++++++++++++++++++++++++++++---------- 1 file changed, 195 insertions(+), 65 deletions(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 23449a8c5e7e..560cf1ca512a 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -53,9 +53,10 @@ DEFINE_STATIC_PERCPU_RWSEM(dup_mmap_sem); struct uprobe { struct rb_node rb_node; /* node in the rb tree */ - refcount_t ref; + atomic64_t ref; /* see UPROBE_REFCNT_GET below */ struct rw_semaphore register_rwsem; struct rw_semaphore consumer_rwsem; + struct rcu_head rcu; struct list_head pending_list; struct uprobe_consumer *consumers; struct inode *inode; /* Also hold a ref to inode */ @@ -587,15 +588,138 @@ set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long v *(uprobe_opcode_t *)&auprobe->insn); } -static struct uprobe *get_uprobe(struct uprobe *uprobe) +/* + * Uprobe's 64-bit refcount is actually two independent counters co-located in + * a single u64 value: + * - lower 32 bits are just a normal refcount with is increment and + * decremented on get and put, respectively, just like normal refcount + * would; + * - upper 32 bits are a tag (or epoch, if you will), which is always + * incremented by one, no matter whether get or put operation is done. + * + * This upper counter is meant to distinguish between: + * - one CPU dropping refcnt from 1 -> 0 and proceeding with "destruction", + * - while another CPU continuing further meanwhile with 0 -> 1 -> 0 refcnt + * sequence, also proceeding to "destruction". + * + * In both cases refcount drops to zero, but in one case it will have epoch N, + * while the second drop to zero will have a different epoch N + 2, allowing + * first destructor to bail out because epoch changed between refcount going + * to zero and put_uprobe() taking uprobes_treelock (under which overall + * 64-bit refcount is double-checked, see put_uprobe() for details). + * + * Lower 32-bit counter is not meant to over overflow, while it's expected + * that upper 32-bit counter will overflow occasionally. Note, though, that we + * can't allow upper 32-bit counter to "bleed over" into lower 32-bit counter, + * so whenever epoch counter gets highest bit set to 1, __get_uprobe() and + * put_uprobe() will attempt to clear upper bit with cmpxchg(). This makes + * epoch effectively a 31-bit counter with highest bit used as a flag to + * perform a fix-up. This ensures epoch and refcnt parts do not "interfere". + * + * UPROBE_REFCNT_GET constant is chosen such that it will *increment both* + * epoch and refcnt parts atomically with one atomic_add(). + * UPROBE_REFCNT_PUT is chosen such that it will *decrement* refcnt part and + * *increment* epoch part. + */ +#define UPROBE_REFCNT_GET ((1LL << 32) + 1LL) /* 0x0000000100000001LL */ +#define UPROBE_REFCNT_PUT ((1LL << 32) - 1LL) /* 0x00000000ffffffffLL */ + +/* + * Caller has to make sure that: + * a) either uprobe's refcnt is positive before this call; + * b) or uprobes_treelock is held (doesn't matter if for read or write), + * preventing uprobe's destructor from removing it from uprobes_tree. + * + * In the latter case, uprobe's destructor will "resurrect" uprobe instance if + * it detects that its refcount went back to being positive again inbetween it + * dropping to zero at some point and (potentially delayed) destructor + * callback actually running. + */ +static struct uprobe *__get_uprobe(struct uprobe *uprobe) { - refcount_inc(&uprobe->ref); + s64 v; + + v = atomic64_add_return(UPROBE_REFCNT_GET, &uprobe->ref); + + /* + * If the highest bit is set, we need to clear it. If cmpxchg() fails, + * we don't retry because there is another CPU that just managed to + * update refcnt and will attempt the same "fix up". Eventually one of + * them will succeed to clear highset bit. + */ + if (unlikely(v < 0)) + (void)atomic64_cmpxchg(&uprobe->ref, v, v & ~(1ULL << 63)); + return uprobe; } +static inline bool uprobe_is_active(struct uprobe *uprobe) +{ + return !RB_EMPTY_NODE(&uprobe->rb_node); +} + +static void uprobe_free_rcu(struct rcu_head *rcu) +{ + struct uprobe *uprobe = container_of(rcu, struct uprobe, rcu); + + kfree(uprobe); +} + static void put_uprobe(struct uprobe *uprobe) { - if (refcount_dec_and_test(&uprobe->ref)) { + s64 v; + + /* + * here uprobe instance is guaranteed to be alive, so we use Tasks + * Trace RCU to guarantee that uprobe won't be freed from under us, if + * we end up being a losing "destructor" inside uprobe_treelock'ed + * section double-checking uprobe->ref value below. + * Note call_rcu_tasks_trace() + uprobe_free_rcu below. + */ + rcu_read_lock_trace(); + + v = atomic64_add_return(UPROBE_REFCNT_PUT, &uprobe->ref); + + if (unlikely((u32)v == 0)) { + bool destroy; + + write_lock(&uprobes_treelock); + /* + * We might race with find_uprobe()->__get_uprobe() executed + * from inside read-locked uprobes_treelock, which can bump + * refcount from zero back to one, after we got here. Even + * worse, it's possible for another CPU to do 0 -> 1 -> 0 + * transition between this CPU doing atomic_add() and taking + * uprobes_treelock. In either case this CPU should bail out + * and not proceed with destruction. + * + * So now that we have exclusive write lock, we double check + * the total 64-bit refcount value, which includes the epoch. + * If nothing changed (i.e., epoch is the same and refcnt is + * still zero), we are good and we proceed with the clean up. + * + * But if it managed to be updated back at least once, we just + * pretend it never went to zero. If lower 32-bit refcnt part + * drops to zero again, another CPU will proceed with + * destruction, due to more up to date epoch. + */ + destroy = atomic64_read(&uprobe->ref) == v; + if (destroy && uprobe_is_active(uprobe)) + rb_erase(&uprobe->rb_node, &uprobes_tree); + write_unlock(&uprobes_treelock); + + /* + * Beyond here we don't need RCU protection, we are either the + * winning destructor and we control the rest of uprobe's + * lifetime; or we lost and we are bailing without accessing + * uprobe fields anymore. + */ + rcu_read_unlock_trace(); + + /* uprobe got resurrected, pretend we never tried to free it */ + if (!destroy) + return; + /* * If application munmap(exec_vma) before uprobe_unregister() * gets called, we don't get a chance to remove uprobe from @@ -604,8 +728,21 @@ static void put_uprobe(struct uprobe *uprobe) mutex_lock(&delayed_uprobe_lock); delayed_uprobe_remove(uprobe, NULL); mutex_unlock(&delayed_uprobe_lock); - kfree(uprobe); + + call_rcu_tasks_trace(&uprobe->rcu, uprobe_free_rcu); + return; } + + /* + * If the highest bit is set, we need to clear it. If cmpxchg() fails, + * we don't retry because there is another CPU that just managed to + * update refcnt and will attempt the same "fix up". Eventually one of + * them will succeed to clear highset bit. + */ + if (unlikely(v < 0)) + (void)atomic64_cmpxchg(&uprobe->ref, v, v & ~(1ULL << 63)); + + rcu_read_unlock_trace(); } static __always_inline @@ -653,12 +790,15 @@ static struct uprobe *__find_uprobe(struct inode *inode, loff_t offset) .inode = inode, .offset = offset, }; - struct rb_node *node = rb_find(&key, &uprobes_tree, __uprobe_cmp_key); + struct rb_node *node; + struct uprobe *u = NULL; + node = rb_find(&key, &uprobes_tree, __uprobe_cmp_key); if (node) - return get_uprobe(__node_2_uprobe(node)); + /* we hold uprobes_treelock, so it's safe to __get_uprobe() */ + u = __get_uprobe(__node_2_uprobe(node)); - return NULL; + return u; } /* @@ -676,26 +816,37 @@ static struct uprobe *find_uprobe(struct inode *inode, loff_t offset) return uprobe; } +/* + * Attempt to insert a new uprobe into uprobes_tree. + * + * If uprobe already exists (for given inode+offset), we just increment + * refcount of previously existing uprobe. + * + * If not, a provided new instance of uprobe is inserted into the tree (with + * assumed initial refcount == 1). + * + * In any case, we return a uprobe instance that ends up being in uprobes_tree. + * Caller has to clean up new uprobe instance, if it ended up not being + * inserted into the tree. + * + * We assume that uprobes_treelock is held for writing. + */ static struct uprobe *__insert_uprobe(struct uprobe *uprobe) { struct rb_node *node; + struct uprobe *u = uprobe; node = rb_find_add(&uprobe->rb_node, &uprobes_tree, __uprobe_cmp); if (node) - return get_uprobe(__node_2_uprobe(node)); + /* we hold uprobes_treelock, so it's safe to __get_uprobe() */ + u = __get_uprobe(__node_2_uprobe(node)); - /* get access + creation ref */ - refcount_set(&uprobe->ref, 2); - return NULL; + return u; } /* - * Acquire uprobes_treelock. - * Matching uprobe already exists in rbtree; - * increment (access refcount) and return the matching uprobe. - * - * No matching uprobe; insert the uprobe in rb_tree; - * get a double refcount (access + creation) and return NULL. + * Acquire uprobes_treelock and insert uprobe into uprobes_tree + * (or reuse existing one, see __insert_uprobe() comments above). */ static struct uprobe *insert_uprobe(struct uprobe *uprobe) { @@ -732,11 +883,13 @@ static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset, uprobe->ref_ctr_offset = ref_ctr_offset; init_rwsem(&uprobe->register_rwsem); init_rwsem(&uprobe->consumer_rwsem); + RB_CLEAR_NODE(&uprobe->rb_node); + atomic64_set(&uprobe->ref, 1); /* add to uprobes_tree, sorted on inode:offset */ cur_uprobe = insert_uprobe(uprobe); /* a uprobe exists for this inode:offset combination */ - if (cur_uprobe) { + if (cur_uprobe != uprobe) { if (cur_uprobe->ref_ctr_offset != uprobe->ref_ctr_offset) { ref_ctr_mismatch_warn(cur_uprobe, uprobe); put_uprobe(cur_uprobe); @@ -921,27 +1074,6 @@ remove_breakpoint(struct uprobe *uprobe, struct mm_struct *mm, unsigned long vad return set_orig_insn(&uprobe->arch, mm, vaddr); } -static inline bool uprobe_is_active(struct uprobe *uprobe) -{ - return !RB_EMPTY_NODE(&uprobe->rb_node); -} -/* - * There could be threads that have already hit the breakpoint. They - * will recheck the current insn and restart if find_uprobe() fails. - * See find_active_uprobe(). - */ -static void delete_uprobe(struct uprobe *uprobe) -{ - if (WARN_ON(!uprobe_is_active(uprobe))) - return; - - write_lock(&uprobes_treelock); - rb_erase(&uprobe->rb_node, &uprobes_tree); - write_unlock(&uprobes_treelock); - RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */ - put_uprobe(uprobe); -} - struct map_info { struct map_info *next; struct mm_struct *mm; @@ -1082,15 +1214,11 @@ register_for_each_vma(struct uprobe *uprobe, struct uprobe_consumer *new) static void __uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc) { - int err; - if (WARN_ON(!consumer_del(uprobe, uc))) return; - err = register_for_each_vma(uprobe, NULL); /* TODO : cant unregister? schedule a worker thread */ - if (!uprobe->consumers && !err) - delete_uprobe(uprobe); + (void)register_for_each_vma(uprobe, NULL); } /* @@ -1159,28 +1287,20 @@ static int __uprobe_register(struct inode *inode, loff_t offset, if (!IS_ALIGNED(ref_ctr_offset, sizeof(short))) return -EINVAL; - retry: uprobe = alloc_uprobe(inode, offset, ref_ctr_offset); if (IS_ERR(uprobe)) return PTR_ERR(uprobe); - /* - * We can race with uprobe_unregister()->delete_uprobe(). - * Check uprobe_is_active() and retry if it is false. - */ down_write(&uprobe->register_rwsem); - ret = -EAGAIN; - if (likely(uprobe_is_active(uprobe))) { - consumer_add(uprobe, uc); - ret = register_for_each_vma(uprobe, uc); - if (ret) - __uprobe_unregister(uprobe, uc); - } + consumer_add(uprobe, uc); + ret = register_for_each_vma(uprobe, uc); + if (ret) + __uprobe_unregister(uprobe, uc); up_write(&uprobe->register_rwsem); - put_uprobe(uprobe); - if (unlikely(ret == -EAGAIN)) - goto retry; + if (ret) + put_uprobe(uprobe); + return ret; } @@ -1303,15 +1423,15 @@ static void build_probe_list(struct inode *inode, u = rb_entry(t, struct uprobe, rb_node); if (u->inode != inode || u->offset < min) break; + __get_uprobe(u); /* uprobes_treelock is held */ list_add(&u->pending_list, head); - get_uprobe(u); } for (t = n; (t = rb_next(t)); ) { u = rb_entry(t, struct uprobe, rb_node); if (u->inode != inode || u->offset > max) break; + __get_uprobe(u); /* uprobes_treelock is held */ list_add(&u->pending_list, head); - get_uprobe(u); } } read_unlock(&uprobes_treelock); @@ -1769,7 +1889,14 @@ static int dup_utask(struct task_struct *t, struct uprobe_task *o_utask) return -ENOMEM; *n = *o; - get_uprobe(n->uprobe); + /* + * uprobe's refcnt has to be positive at this point, kept by + * utask->return_instances items; return_instances can't be + * removed right now, as task is blocked due to duping; so + * __get_uprobe() is safe to use here without holding + * uprobes_treelock. + */ + __get_uprobe(n->uprobe); n->next = NULL; *p = n; @@ -1911,8 +2038,11 @@ static void prepare_uretprobe(struct uprobe *uprobe, struct pt_regs *regs) } orig_ret_vaddr = utask->return_instances->orig_ret_vaddr; } - - ri->uprobe = get_uprobe(uprobe); + /* + * uprobe's refcnt is positive, held by caller, so it's safe to + * unconditionally bump it one more time here + */ + ri->uprobe = __get_uprobe(uprobe); ri->func = instruction_pointer(regs); ri->stack = user_stack_pointer(regs); ri->orig_ret_vaddr = orig_ret_vaddr; From patchwork Mon Jul 1 22:39:28 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrii Nakryiko X-Patchwork-Id: 13718703 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E722F171E40; Mon, 1 Jul 2024 22:39:55 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1719873596; cv=none; b=AWoTVgL3Z2MECleXJB7BICkuCKMCoRpunqQuiMvMuAgvB8dsVcFnZVlQSJ+v4jXBxG/xD9KqoGrproBVxKflYzHaaxlscuUJVKvoz58cbMjIosFaAruOKKKB2gVSCyLj+SzRGoqpaNoyAaw9V8TqdbkmGLKul53kned0c0k3WOk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1719873596; c=relaxed/simple; bh=FMsQqQKFZx0PKbU9w0HJC5v8Z5W2tQ/OuPUwhgvZPS0=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=EAPS7N68kF7KNMsReSbeKmLCl6so2ukqkwL9bmFMljQNzPO9hg6d6K4PGOu/HPAFu2anc/b+Q+oH+nVKKd4tTcHO8YI2UTT53ufSAgydOTLl0OZEiYirQZSIhszbJFil0SupXPTFHnIFgIVWZ8sUaB/kcWMiC8ECOGqKNssrHS4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=WZHFvlbd; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="WZHFvlbd" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 4D58DC116B1; Mon, 1 Jul 2024 22:39:55 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1719873595; bh=FMsQqQKFZx0PKbU9w0HJC5v8Z5W2tQ/OuPUwhgvZPS0=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=WZHFvlbdq9Mr5SpPHc96b5Ek0xgLhilnNiMC4+Q2GkFGrJOpKuswQVbx71z85fCEZ eCpuOr0HRqn0YoDoP2Dzxu1FNMQvgj2gMWMtOkHGAEkxhLr3fquguVAuzzP/hFDVye xO9tQniX6i5wtNRjhrXNNgKpmM6HqkQ3fJ++LX5s4NXUIm9cUOIqam+hotWBNzPxGP L7QKm+hKQTxTuimcdG5ewF3EBzAyhmIvTlPUvEiBlchkb4plNXc5RsZeuD5ENCEvgI eSaaXRrYjAHEGZNe9ecuPbOLDX7mgD2Juwn/v3CU1iwPYqek7CsO3VgXfyU8n+uocb t99IhM4YyBLYA== From: Andrii Nakryiko To: linux-trace-kernel@vger.kernel.org, rostedt@goodmis.org, mhiramat@kernel.org, oleg@redhat.com Cc: peterz@infradead.org, mingo@redhat.com, bpf@vger.kernel.org, jolsa@kernel.org, paulmck@kernel.org, clm@meta.com, Andrii Nakryiko Subject: [PATCH v2 05/12] uprobes: move offset and ref_ctr_offset into uprobe_consumer Date: Mon, 1 Jul 2024 15:39:28 -0700 Message-ID: <20240701223935.3783951-6-andrii@kernel.org> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20240701223935.3783951-1-andrii@kernel.org> References: <20240701223935.3783951-1-andrii@kernel.org> Precedence: bulk X-Mailing-List: linux-trace-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Simplify uprobe registration/unregistration interfaces by making offset and ref_ctr_offset part of uprobe_consumer "interface". In practice, all existing users already store these fields somewhere in uprobe_consumer's containing structure, so this doesn't pose any problem. We just move some fields around. On the other hand, this simplifies uprobe_register() and uprobe_unregister() API by having only struct uprobe_consumer as one thing representing attachment/detachment entity. This makes batched versions of uprobe_register() and uprobe_unregister() simpler. This also makes uprobe_register_refctr() unnecessary, so remove it and simplify consumers. No functional changes intended. Acked-by: Masami Hiramatsu (Google) Signed-off-by: Andrii Nakryiko --- include/linux/uprobes.h | 18 +++---- kernel/events/uprobes.c | 19 ++----- kernel/trace/bpf_trace.c | 21 +++----- kernel/trace/trace_uprobe.c | 53 ++++++++----------- .../selftests/bpf/bpf_testmod/bpf_testmod.c | 22 ++++---- 5 files changed, 55 insertions(+), 78 deletions(-) diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h index b503fafb7fb3..a75ba37ce3c8 100644 --- a/include/linux/uprobes.h +++ b/include/linux/uprobes.h @@ -42,6 +42,11 @@ struct uprobe_consumer { enum uprobe_filter_ctx ctx, struct mm_struct *mm); + /* associated file offset of this probe */ + loff_t offset; + /* associated refctr file offset of this probe, or zero */ + loff_t ref_ctr_offset; + /* for internal uprobe infra use, consumers shouldn't touch fields below */ struct uprobe_consumer *next; }; @@ -110,10 +115,9 @@ extern bool is_trap_insn(uprobe_opcode_t *insn); extern unsigned long uprobe_get_swbp_addr(struct pt_regs *regs); extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs); extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr, uprobe_opcode_t); -extern int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc); -extern int uprobe_register_refctr(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc); +extern int uprobe_register(struct inode *inode, struct uprobe_consumer *uc); extern int uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, bool); -extern void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc); +extern void uprobe_unregister(struct inode *inode, struct uprobe_consumer *uc); extern int uprobe_mmap(struct vm_area_struct *vma); extern void uprobe_munmap(struct vm_area_struct *vma, unsigned long start, unsigned long end); extern void uprobe_start_dup_mmap(void); @@ -152,11 +156,7 @@ static inline void uprobes_init(void) #define uprobe_get_trap_addr(regs) instruction_pointer(regs) static inline int -uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc) -{ - return -ENOSYS; -} -static inline int uprobe_register_refctr(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc) +uprobe_register(struct inode *inode, struct uprobe_consumer *uc) { return -ENOSYS; } @@ -166,7 +166,7 @@ uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, boo return -ENOSYS; } static inline void -uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc) +uprobe_unregister(struct inode *inode, struct uprobe_consumer *uc) { } static inline int uprobe_mmap(struct vm_area_struct *vma) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 560cf1ca512a..8759c6d0683e 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -1224,14 +1224,13 @@ __uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc) /* * uprobe_unregister - unregister an already registered probe. * @inode: the file in which the probe has to be removed. - * @offset: offset from the start of the file. - * @uc: identify which probe if multiple probes are colocated. + * @uc: identify which probe consumer to unregister. */ -void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc) +void uprobe_unregister(struct inode *inode, struct uprobe_consumer *uc) { struct uprobe *uprobe; - uprobe = find_uprobe(inode, offset); + uprobe = find_uprobe(inode, uc->offset); if (WARN_ON(!uprobe)) return; @@ -1304,20 +1303,12 @@ static int __uprobe_register(struct inode *inode, loff_t offset, return ret; } -int uprobe_register(struct inode *inode, loff_t offset, - struct uprobe_consumer *uc) +int uprobe_register(struct inode *inode, struct uprobe_consumer *uc) { - return __uprobe_register(inode, offset, 0, uc); + return __uprobe_register(inode, uc->offset, uc->ref_ctr_offset, uc); } EXPORT_SYMBOL_GPL(uprobe_register); -int uprobe_register_refctr(struct inode *inode, loff_t offset, - loff_t ref_ctr_offset, struct uprobe_consumer *uc) -{ - return __uprobe_register(inode, offset, ref_ctr_offset, uc); -} -EXPORT_SYMBOL_GPL(uprobe_register_refctr); - /* * uprobe_apply - unregister an already registered probe. * @inode: the file in which the probe has to be removed. diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index d1daeab1bbc1..ba62baec3152 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -3154,8 +3154,6 @@ struct bpf_uprobe_multi_link; struct bpf_uprobe { struct bpf_uprobe_multi_link *link; - loff_t offset; - unsigned long ref_ctr_offset; u64 cookie; struct uprobe_consumer consumer; }; @@ -3181,8 +3179,7 @@ static void bpf_uprobe_unregister(struct path *path, struct bpf_uprobe *uprobes, u32 i; for (i = 0; i < cnt; i++) { - uprobe_unregister(d_real_inode(path->dentry), uprobes[i].offset, - &uprobes[i].consumer); + uprobe_unregister(d_real_inode(path->dentry), &uprobes[i].consumer); } } @@ -3262,10 +3259,10 @@ static int bpf_uprobe_multi_link_fill_link_info(const struct bpf_link *link, for (i = 0; i < ucount; i++) { if (uoffsets && - put_user(umulti_link->uprobes[i].offset, uoffsets + i)) + put_user(umulti_link->uprobes[i].consumer.offset, uoffsets + i)) return -EFAULT; if (uref_ctr_offsets && - put_user(umulti_link->uprobes[i].ref_ctr_offset, uref_ctr_offsets + i)) + put_user(umulti_link->uprobes[i].consumer.ref_ctr_offset, uref_ctr_offsets + i)) return -EFAULT; if (ucookies && put_user(umulti_link->uprobes[i].cookie, ucookies + i)) @@ -3439,15 +3436,16 @@ int bpf_uprobe_multi_link_attach(const union bpf_attr *attr, struct bpf_prog *pr goto error_free; for (i = 0; i < cnt; i++) { - if (__get_user(uprobes[i].offset, uoffsets + i)) { + if (__get_user(uprobes[i].consumer.offset, uoffsets + i)) { err = -EFAULT; goto error_free; } - if (uprobes[i].offset < 0) { + if (uprobes[i].consumer.offset < 0) { err = -EINVAL; goto error_free; } - if (uref_ctr_offsets && __get_user(uprobes[i].ref_ctr_offset, uref_ctr_offsets + i)) { + if (uref_ctr_offsets && + __get_user(uprobes[i].consumer.ref_ctr_offset, uref_ctr_offsets + i)) { err = -EFAULT; goto error_free; } @@ -3477,10 +3475,7 @@ int bpf_uprobe_multi_link_attach(const union bpf_attr *attr, struct bpf_prog *pr &bpf_uprobe_multi_link_lops, prog); for (i = 0; i < cnt; i++) { - err = uprobe_register_refctr(d_real_inode(link->path.dentry), - uprobes[i].offset, - uprobes[i].ref_ctr_offset, - &uprobes[i].consumer); + err = uprobe_register(d_real_inode(link->path.dentry), &uprobes[i].consumer); if (err) { bpf_uprobe_unregister(&path, uprobes, i); goto error_free; diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c index c98e3b3386ba..d786f99114be 100644 --- a/kernel/trace/trace_uprobe.c +++ b/kernel/trace/trace_uprobe.c @@ -60,8 +60,6 @@ struct trace_uprobe { struct path path; struct inode *inode; char *filename; - unsigned long offset; - unsigned long ref_ctr_offset; unsigned long nhit; struct trace_probe tp; }; @@ -205,7 +203,7 @@ static unsigned long translate_user_vaddr(unsigned long file_offset) udd = (void *) current->utask->vaddr; - base_addr = udd->bp_addr - udd->tu->offset; + base_addr = udd->bp_addr - udd->tu->consumer.offset; return base_addr + file_offset; } @@ -286,13 +284,13 @@ static bool trace_uprobe_match_command_head(struct trace_uprobe *tu, if (strncmp(tu->filename, argv[0], len) || argv[0][len] != ':') return false; - if (tu->ref_ctr_offset == 0) - snprintf(buf, sizeof(buf), "0x%0*lx", - (int)(sizeof(void *) * 2), tu->offset); + if (tu->consumer.ref_ctr_offset == 0) + snprintf(buf, sizeof(buf), "0x%0*llx", + (int)(sizeof(void *) * 2), tu->consumer.offset); else - snprintf(buf, sizeof(buf), "0x%0*lx(0x%lx)", - (int)(sizeof(void *) * 2), tu->offset, - tu->ref_ctr_offset); + snprintf(buf, sizeof(buf), "0x%0*llx(0x%llx)", + (int)(sizeof(void *) * 2), tu->consumer.offset, + tu->consumer.ref_ctr_offset); if (strcmp(buf, &argv[0][len + 1])) return false; @@ -410,7 +408,7 @@ static bool trace_uprobe_has_same_uprobe(struct trace_uprobe *orig, list_for_each_entry(orig, &tpe->probes, tp.list) { if (comp_inode != d_real_inode(orig->path.dentry) || - comp->offset != orig->offset) + comp->consumer.offset != orig->consumer.offset) continue; /* @@ -472,8 +470,8 @@ static int validate_ref_ctr_offset(struct trace_uprobe *new) for_each_trace_uprobe(tmp, pos) { if (new_inode == d_real_inode(tmp->path.dentry) && - new->offset == tmp->offset && - new->ref_ctr_offset != tmp->ref_ctr_offset) { + new->consumer.offset == tmp->consumer.offset && + new->consumer.ref_ctr_offset != tmp->consumer.ref_ctr_offset) { pr_warn("Reference counter offset mismatch."); return -EINVAL; } @@ -675,8 +673,8 @@ static int __trace_uprobe_create(int argc, const char **argv) WARN_ON_ONCE(ret != -ENOMEM); goto fail_address_parse; } - tu->offset = offset; - tu->ref_ctr_offset = ref_ctr_offset; + tu->consumer.offset = offset; + tu->consumer.ref_ctr_offset = ref_ctr_offset; tu->path = path; tu->filename = filename; @@ -746,12 +744,12 @@ static int trace_uprobe_show(struct seq_file *m, struct dyn_event *ev) char c = is_ret_probe(tu) ? 'r' : 'p'; int i; - seq_printf(m, "%c:%s/%s %s:0x%0*lx", c, trace_probe_group_name(&tu->tp), + seq_printf(m, "%c:%s/%s %s:0x%0*llx", c, trace_probe_group_name(&tu->tp), trace_probe_name(&tu->tp), tu->filename, - (int)(sizeof(void *) * 2), tu->offset); + (int)(sizeof(void *) * 2), tu->consumer.offset); - if (tu->ref_ctr_offset) - seq_printf(m, "(0x%lx)", tu->ref_ctr_offset); + if (tu->consumer.ref_ctr_offset) + seq_printf(m, "(0x%llx)", tu->consumer.ref_ctr_offset); for (i = 0; i < tu->tp.nr_args; i++) seq_printf(m, " %s=%s", tu->tp.args[i].name, tu->tp.args[i].comm); @@ -1089,12 +1087,7 @@ static int trace_uprobe_enable(struct trace_uprobe *tu, filter_func_t filter) tu->consumer.filter = filter; tu->inode = d_real_inode(tu->path.dentry); - if (tu->ref_ctr_offset) - ret = uprobe_register_refctr(tu->inode, tu->offset, - tu->ref_ctr_offset, &tu->consumer); - else - ret = uprobe_register(tu->inode, tu->offset, &tu->consumer); - + ret = uprobe_register(tu->inode, &tu->consumer); if (ret) tu->inode = NULL; @@ -1112,7 +1105,7 @@ static void __probe_event_disable(struct trace_probe *tp) if (!tu->inode) continue; - uprobe_unregister(tu->inode, tu->offset, &tu->consumer); + uprobe_unregister(tu->inode, &tu->consumer); tu->inode = NULL; } } @@ -1310,7 +1303,7 @@ static int uprobe_perf_close(struct trace_event_call *call, return 0; list_for_each_entry(tu, trace_probe_probe_list(tp), tp.list) { - ret = uprobe_apply(tu->inode, tu->offset, &tu->consumer, false); + ret = uprobe_apply(tu->inode, tu->consumer.offset, &tu->consumer, false); if (ret) break; } @@ -1334,7 +1327,7 @@ static int uprobe_perf_open(struct trace_event_call *call, return 0; list_for_each_entry(tu, trace_probe_probe_list(tp), tp.list) { - err = uprobe_apply(tu->inode, tu->offset, &tu->consumer, true); + err = uprobe_apply(tu->inode, tu->consumer.offset, &tu->consumer, true); if (err) { uprobe_perf_close(call, event); break; @@ -1464,7 +1457,7 @@ int bpf_get_uprobe_info(const struct perf_event *event, u32 *fd_type, *fd_type = is_ret_probe(tu) ? BPF_FD_TYPE_URETPROBE : BPF_FD_TYPE_UPROBE; *filename = tu->filename; - *probe_offset = tu->offset; + *probe_offset = tu->consumer.offset; *probe_addr = 0; return 0; } @@ -1627,9 +1620,9 @@ create_local_trace_uprobe(char *name, unsigned long offs, return ERR_CAST(tu); } - tu->offset = offs; + tu->consumer.offset = offs; tu->path = path; - tu->ref_ctr_offset = ref_ctr_offset; + tu->consumer.ref_ctr_offset = ref_ctr_offset; tu->filename = kstrdup(name, GFP_KERNEL); if (!tu->filename) { ret = -ENOMEM; diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c index b0132a342bb5..ca7122cdbcd3 100644 --- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c +++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c @@ -391,25 +391,24 @@ static int testmod_register_uprobe(loff_t offset) { int err = -EBUSY; - if (uprobe.offset) + if (uprobe.consumer.offset) return -EBUSY; mutex_lock(&testmod_uprobe_mutex); - if (uprobe.offset) + if (uprobe.consumer.offset) goto out; err = kern_path("/proc/self/exe", LOOKUP_FOLLOW, &uprobe.path); if (err) goto out; - err = uprobe_register_refctr(d_real_inode(uprobe.path.dentry), - offset, 0, &uprobe.consumer); - if (err) + uprobe.consumer.offset = offset; + err = uprobe_register(d_real_inode(uprobe.path.dentry), &uprobe.consumer); + if (err) { path_put(&uprobe.path); - else - uprobe.offset = offset; - + uprobe.consumer.offset = 0; + } out: mutex_unlock(&testmod_uprobe_mutex); return err; @@ -419,10 +418,9 @@ static void testmod_unregister_uprobe(void) { mutex_lock(&testmod_uprobe_mutex); - if (uprobe.offset) { - uprobe_unregister(d_real_inode(uprobe.path.dentry), - uprobe.offset, &uprobe.consumer); - uprobe.offset = 0; + if (uprobe.consumer.offset) { + uprobe_unregister(d_real_inode(uprobe.path.dentry), &uprobe.consumer); + uprobe.consumer.offset = 0; } mutex_unlock(&testmod_uprobe_mutex); From patchwork Mon Jul 1 22:39:29 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrii Nakryiko X-Patchwork-Id: 13718704 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E489E171079; Mon, 1 Jul 2024 22:39:58 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1719873599; cv=none; b=h7sFj7SQrejzKwHXw74qZghJ6vkh8uCcesCug8fc/I+qQXvMXNoR/hkaxsx2LrUBiEydy1uY3T//TxOJNN/vO4gmTGFjMiEK7wOqzp/F74hnsw8Tui3ucGAl9nix893/M2ekagLxO4uu3bP1cq+sv2WE4Cq/zRqLER1pqXkYrrg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1719873599; c=relaxed/simple; bh=+2+gITtXkEAXg7My0n7+SUTY0w3SFHja5M39f7EvZVU=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=Wek5m0OnY4xjLmbUDLvoVKxx2BrHuGtPevDywRieyy6PnuQeV29xz4HnJWO3nqG7Zsdl+5ZsNYAA1jRHXn9XM6DXHZapn358M9hcAlce7PTdU0TWjgsUXYXByY5ZO+flyMk5xl0sx7GMy2I5PQvAIWVbvzmGGZ4so8eBaeb9E0c= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=esWxtTJH; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="esWxtTJH" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 96673C116B1; Mon, 1 Jul 2024 22:39:58 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1719873598; bh=+2+gITtXkEAXg7My0n7+SUTY0w3SFHja5M39f7EvZVU=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=esWxtTJH1Roj4NkN41i+RxVdXE4gH72XXsYkTDw6BcZu5MIYS+JMooT40S6bi/yNZ dtgXotaB32wB6mESjLLNAoePzfwShGBUl7Du2NJdrnTDeV+Kep67zq6dJKZNYSEPnR Tvyp9ttCk5K2XVnQ8jhiixi27//r9z1Sp009O8pi7OXoVI9vD0etGAivFgLBCgued+ YET/4BILqJxeyX9W1etNK4CsYCeaVlGmmbMkSTR4HE+1jKukkODsPjMXOaZ7HMmS4J Q3aj+jjxtiZXR28EgN7jO+J+c1D0AtEGMPMCmEgLt2l0DB15Ema0S89LrmFX7SAM8M W3aIIWnlX+8AQ== From: Andrii Nakryiko To: linux-trace-kernel@vger.kernel.org, rostedt@goodmis.org, mhiramat@kernel.org, oleg@redhat.com Cc: peterz@infradead.org, mingo@redhat.com, bpf@vger.kernel.org, jolsa@kernel.org, paulmck@kernel.org, clm@meta.com, Andrii Nakryiko Subject: [PATCH v2 06/12] uprobes: add batch uprobe register/unregister APIs Date: Mon, 1 Jul 2024 15:39:29 -0700 Message-ID: <20240701223935.3783951-7-andrii@kernel.org> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20240701223935.3783951-1-andrii@kernel.org> References: <20240701223935.3783951-1-andrii@kernel.org> Precedence: bulk X-Mailing-List: linux-trace-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Introduce batch versions of uprobe registration (attachment) and unregistration (detachment) APIs. Unregistration is presumed to never fail, so that's easy. Batch registration can fail, and so the semantics of uprobe_register_batch() is such that either all uprobe_consumers are successfully attached or none of them remain attached after the return. There is no guarantee of atomicity of attachment, though, and so while batch attachment is proceeding, some uprobes might start firing before others are completely attached. Even if overall attachment eventually fails, some successfully attached uprobes might fire and callers have to be prepared to handle that. This is in no way a regression compared to current approach of attaching uprobes one-by-one, though. One crucial implementation detail is the addition of `struct uprobe *uprobe` field to `struct uprobe_consumer` which is meant for internal uprobe subsystem usage only. We use this field both as temporary storage (to avoid unnecessary allocations) and as a back link to associated uprobe to simplify and speed up uprobe unregistration, as we now can avoid yet another tree lookup when unregistering uprobe_consumer. The general direction with uprobe registration implementation is to do batch attachment in distinct steps, each step performing some set of checks or actions on all uprobe_consumers before proceeding to the next phase. This, after some more changes in next patches, allows to batch locking for each phase and in such a way amortize any long delays that might be added by writer locks (especially once we switch uprobes_treelock to per-CPU R/W semaphore later). Currently, uprobe_register_batch() performs all the sanity checks first. Then proceeds to allocate-and-insert (we'll split this up further later on) uprobe instances, as necessary. And then the last step is actual uprobe registration for all affected VMAs. We take care to undo all the actions in the event of an error at any point in this lengthy process, so end result is all-or-nothing, as described above. Signed-off-by: Andrii Nakryiko --- include/linux/uprobes.h | 17 ++++ kernel/events/uprobes.c | 180 ++++++++++++++++++++++++++++------------ 2 files changed, 146 insertions(+), 51 deletions(-) diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h index a75ba37ce3c8..a6e6eb70539d 100644 --- a/include/linux/uprobes.h +++ b/include/linux/uprobes.h @@ -33,6 +33,8 @@ enum uprobe_filter_ctx { UPROBE_FILTER_MMAP, }; +typedef struct uprobe_consumer *(*uprobe_consumer_fn)(size_t idx, void *ctx); + struct uprobe_consumer { int (*handler)(struct uprobe_consumer *self, struct pt_regs *regs); int (*ret_handler)(struct uprobe_consumer *self, @@ -48,6 +50,8 @@ struct uprobe_consumer { loff_t ref_ctr_offset; /* for internal uprobe infra use, consumers shouldn't touch fields below */ struct uprobe_consumer *next; + /* associated uprobe instance (or NULL if consumer isn't attached) */ + struct uprobe *uprobe; }; #ifdef CONFIG_UPROBES @@ -116,8 +120,12 @@ extern unsigned long uprobe_get_swbp_addr(struct pt_regs *regs); extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs); extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr, uprobe_opcode_t); extern int uprobe_register(struct inode *inode, struct uprobe_consumer *uc); +extern int uprobe_register_batch(struct inode *inode, int cnt, + uprobe_consumer_fn get_uprobe_consumer, void *ctx); extern int uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, bool); extern void uprobe_unregister(struct inode *inode, struct uprobe_consumer *uc); +extern void uprobe_unregister_batch(struct inode *inode, int cnt, + uprobe_consumer_fn get_uprobe_consumer, void *ctx); extern int uprobe_mmap(struct vm_area_struct *vma); extern void uprobe_munmap(struct vm_area_struct *vma, unsigned long start, unsigned long end); extern void uprobe_start_dup_mmap(void); @@ -160,6 +168,11 @@ uprobe_register(struct inode *inode, struct uprobe_consumer *uc) { return -ENOSYS; } +static inline int uprobe_register_batch(struct inode *inode, int cnt, + uprobe_consumer_fn get_uprobe_consumer, void *ctx) +{ + return -ENOSYS; +} static inline int uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, bool add) { @@ -169,6 +182,10 @@ static inline void uprobe_unregister(struct inode *inode, struct uprobe_consumer *uc) { } +static inline void uprobe_unregister_batch(struct inode *inode, int cnt, + uprobe_consumer_fn get_uprobe_consumer, void *ctx) +{ +} static inline int uprobe_mmap(struct vm_area_struct *vma) { return 0; diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 8759c6d0683e..68fdf1b8e4bf 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -1221,6 +1221,41 @@ __uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc) (void)register_for_each_vma(uprobe, NULL); } +/* + * uprobe_unregister_batch - unregister a batch of already registered uprobe + * consumers. + * @inode: the file in which the probes have to be removed. + * @cnt: number of consumers to unregister + * @get_uprobe_consumer: a callback that returns Nth uprobe_consumer to attach + * @ctx: an arbitrary context passed through into get_uprobe_consumer callback + */ +void uprobe_unregister_batch(struct inode *inode, int cnt, uprobe_consumer_fn get_uprobe_consumer, void *ctx) +{ + struct uprobe *uprobe; + struct uprobe_consumer *uc; + int i; + + for (i = 0; i < cnt; i++) { + uc = get_uprobe_consumer(i, ctx); + uprobe = uc->uprobe; + + if (WARN_ON(!uprobe)) + continue; + + down_write(&uprobe->register_rwsem); + __uprobe_unregister(uprobe, uc); + up_write(&uprobe->register_rwsem); + put_uprobe(uprobe); + + uc->uprobe = NULL; + } +} + +static struct uprobe_consumer *uprobe_consumer_identity(size_t idx, void *ctx) +{ + return (struct uprobe_consumer *)ctx; +} + /* * uprobe_unregister - unregister an already registered probe. * @inode: the file in which the probe has to be removed. @@ -1228,84 +1263,127 @@ __uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc) */ void uprobe_unregister(struct inode *inode, struct uprobe_consumer *uc) { - struct uprobe *uprobe; - - uprobe = find_uprobe(inode, uc->offset); - if (WARN_ON(!uprobe)) - return; - - down_write(&uprobe->register_rwsem); - __uprobe_unregister(uprobe, uc); - up_write(&uprobe->register_rwsem); - put_uprobe(uprobe); + uprobe_unregister_batch(inode, 1, uprobe_consumer_identity, uc); } EXPORT_SYMBOL_GPL(uprobe_unregister); /* - * __uprobe_register - register a probe - * @inode: the file in which the probe has to be placed. - * @offset: offset from the start of the file. - * @uc: information on howto handle the probe.. + * uprobe_register_batch - register a batch of probes for a given inode + * @inode: the file in which the probes have to be placed. + * @cnt: number of probes to register + * @get_uprobe_consumer: a callback that returns Nth uprobe_consumer + * @ctx: an arbitrary context passed through into get_uprobe_consumer callback + * + * uprobe_consumer instance itself contains offset and (optional) + * ref_ctr_offset within inode to attach to. + * + * On success, each attached uprobe_consumer assumes one refcount taken for + * respective uprobe instance (uniquely identified by inode+offset + * combination). Each uprobe_consumer is expected to eventually be detached + * through uprobe_unregister() or uprobe_unregister_batch() call, dropping + * their owning refcount. + * + * Caller of uprobe_register()/uprobe_register_batch() is required to keep + * @inode (and the containing mount) referenced. * - * Apart from the access refcount, __uprobe_register() takes a creation - * refcount (thro alloc_uprobe) if and only if this @uprobe is getting - * inserted into the rbtree (i.e first consumer for a @inode:@offset - * tuple). Creation refcount stops uprobe_unregister from freeing the - * @uprobe even before the register operation is complete. Creation - * refcount is released when the last @uc for the @uprobe - * unregisters. Caller of __uprobe_register() is required to keep @inode - * (and the containing mount) referenced. + * If not all probes are successfully installed, then all the successfully + * installed ones are rolled back. Note, there is no atomicity guarantees + * w.r.t. batch attachment. Some probes might start firing before batch + * attachment is completed. Even more so, some consumers might fire even if + * overall batch attachment ultimately fails. * * Return errno if it cannot successully install probes * else return 0 (success) */ -static int __uprobe_register(struct inode *inode, loff_t offset, - loff_t ref_ctr_offset, struct uprobe_consumer *uc) +int uprobe_register_batch(struct inode *inode, int cnt, + uprobe_consumer_fn get_uprobe_consumer, void *ctx) { struct uprobe *uprobe; - int ret; - - /* Uprobe must have at least one set consumer */ - if (!uc->handler && !uc->ret_handler) - return -EINVAL; + struct uprobe_consumer *uc; + int ret, i; /* copy_insn() uses read_mapping_page() or shmem_read_mapping_page() */ if (!inode->i_mapping->a_ops->read_folio && !shmem_mapping(inode->i_mapping)) return -EIO; - /* Racy, just to catch the obvious mistakes */ - if (offset > i_size_read(inode)) - return -EINVAL; - /* - * This ensures that copy_from_page(), copy_to_page() and - * __update_ref_ctr() can't cross page boundary. - */ - if (!IS_ALIGNED(offset, UPROBE_SWBP_INSN_SIZE)) - return -EINVAL; - if (!IS_ALIGNED(ref_ctr_offset, sizeof(short))) + if (cnt <= 0 || !get_uprobe_consumer) return -EINVAL; - uprobe = alloc_uprobe(inode, offset, ref_ctr_offset); - if (IS_ERR(uprobe)) - return PTR_ERR(uprobe); + for (i = 0; i < cnt; i++) { + uc = get_uprobe_consumer(i, ctx); + + /* Each consumer must have at least one set consumer */ + if (!uc || (!uc->handler && !uc->ret_handler)) + return -EINVAL; + /* Racy, just to catch the obvious mistakes */ + if (uc->offset > i_size_read(inode)) + return -EINVAL; + if (uc->uprobe) + return -EINVAL; + /* + * This ensures that copy_from_page(), copy_to_page() and + * __update_ref_ctr() can't cross page boundary. + */ + if (!IS_ALIGNED(uc->offset, UPROBE_SWBP_INSN_SIZE)) + return -EINVAL; + if (!IS_ALIGNED(uc->ref_ctr_offset, sizeof(short))) + return -EINVAL; + } - down_write(&uprobe->register_rwsem); - consumer_add(uprobe, uc); - ret = register_for_each_vma(uprobe, uc); - if (ret) - __uprobe_unregister(uprobe, uc); - up_write(&uprobe->register_rwsem); + for (i = 0; i < cnt; i++) { + uc = get_uprobe_consumer(i, ctx); - if (ret) - put_uprobe(uprobe); + uprobe = alloc_uprobe(inode, uc->offset, uc->ref_ctr_offset); + if (IS_ERR(uprobe)) { + ret = PTR_ERR(uprobe); + goto cleanup_uprobes; + } + + uc->uprobe = uprobe; + } + for (i = 0; i < cnt; i++) { + uc = get_uprobe_consumer(i, ctx); + uprobe = uc->uprobe; + + down_write(&uprobe->register_rwsem); + consumer_add(uprobe, uc); + ret = register_for_each_vma(uprobe, uc); + if (ret) + __uprobe_unregister(uprobe, uc); + up_write(&uprobe->register_rwsem); + + if (ret) + goto cleanup_unreg; + } + + return 0; + +cleanup_unreg: + /* unregister all uprobes we managed to register until failure */ + for (i--; i >= 0; i--) { + uc = get_uprobe_consumer(i, ctx); + + down_write(&uprobe->register_rwsem); + __uprobe_unregister(uc->uprobe, uc); + up_write(&uprobe->register_rwsem); + } +cleanup_uprobes: + /* put all the successfully allocated/reused uprobes */ + for (i = 0; i < cnt; i++) { + uc = get_uprobe_consumer(i, ctx); + + if (uc->uprobe) + put_uprobe(uc->uprobe); + uc->uprobe = NULL; + } return ret; } int uprobe_register(struct inode *inode, struct uprobe_consumer *uc) { - return __uprobe_register(inode, uc->offset, uc->ref_ctr_offset, uc); + return uprobe_register_batch(inode, 1, uprobe_consumer_identity, uc); } EXPORT_SYMBOL_GPL(uprobe_register); From patchwork Mon Jul 1 22:39:30 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrii Nakryiko X-Patchwork-Id: 13718705 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4E44117164D; Mon, 1 Jul 2024 22:40:02 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1719873602; cv=none; b=qjp+YDTAYblzgBY4oUxGK0KBkf09LzeOXeOgOO8sXnf0IDLc3MqqCAso76hwQOu+vRHVjscx3eefWPbzDcyxXIaUbdyL+A/gIifZ/gi/2tLD7tAjLIkYFrW7weIrEtTdyw3e1i0ehiL9nE41yKWRn7OpOoiAxVH7B4jTXl0xX/Y= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1719873602; c=relaxed/simple; bh=hsAPiWhq0aZiIPV7Z6KG9ozuzn7aBv3ssNJ6TuKJH1Y=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=DgPrCotRdp1rWRZEJE0VJ6YKe6bTNdJ5qYZ+5EqM+oZZbkvZ6zJmsTujOf2VkxVu1ln0nwv2yfwH/yKWHuyEKl0rv6Bsm68QtnGA6uKNSlMSRLxZ/bY5jypFFYL6anJ7kmMunWBm9nnA1+i8CnecaY4g57zLwVf/VRnkzzGC8n8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=A3SASqiX; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="A3SASqiX" Received: by smtp.kernel.org (Postfix) with ESMTPSA id ED801C116B1; Mon, 1 Jul 2024 22:40:01 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1719873602; bh=hsAPiWhq0aZiIPV7Z6KG9ozuzn7aBv3ssNJ6TuKJH1Y=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=A3SASqiXCuy/edJgsrqgcMsCiuP41G+0Vmb7I3ZGjGXsPa1Zzux6GFWz/AIWfOXtC wV0UKbZcGUqDFQ1AYiN6YdYCEfxExZySqXNqH+YiSR4XfGTbvWy3AumQ7hlbHCqUZk juHn7DFMVgd0Kq0WdvnsZxWMdQFMDuIBRwtCJqj8e6CROXpQce41JqbASprcj7ONPA 19fphjp2sQe/CC9Pge3ULorApum0tDB14Ij6cIoSXMgtW5e6gFn6Oc1GaHRrwAxxHY 8IoEZi8Ik5GGoE5zR0KlFpm+YRQ1okDJkIUZweTZtkybqT5Mb7u2duaZeN2gxEDJGs B0ciw+EKJKlDw== From: Andrii Nakryiko To: linux-trace-kernel@vger.kernel.org, rostedt@goodmis.org, mhiramat@kernel.org, oleg@redhat.com Cc: peterz@infradead.org, mingo@redhat.com, bpf@vger.kernel.org, jolsa@kernel.org, paulmck@kernel.org, clm@meta.com, Andrii Nakryiko Subject: [PATCH v2 07/12] uprobes: inline alloc_uprobe() logic into __uprobe_register() Date: Mon, 1 Jul 2024 15:39:30 -0700 Message-ID: <20240701223935.3783951-8-andrii@kernel.org> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20240701223935.3783951-1-andrii@kernel.org> References: <20240701223935.3783951-1-andrii@kernel.org> Precedence: bulk X-Mailing-List: linux-trace-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 To allow unbundling alloc-uprobe-and-insert step which is currently tightly coupled, inline alloc_uprobe() logic into uprobe_register_batch() loop. It's called from one place, so we don't really lose much in terms of maintainability. No functional changes. Signed-off-by: Andrii Nakryiko --- kernel/events/uprobes.c | 65 ++++++++++++++++++----------------------- 1 file changed, 28 insertions(+), 37 deletions(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 68fdf1b8e4bf..0f928a72a658 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -869,40 +869,6 @@ ref_ctr_mismatch_warn(struct uprobe *cur_uprobe, struct uprobe *uprobe) (unsigned long long) uprobe->ref_ctr_offset); } -static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset, - loff_t ref_ctr_offset) -{ - struct uprobe *uprobe, *cur_uprobe; - - uprobe = kzalloc(sizeof(struct uprobe), GFP_KERNEL); - if (!uprobe) - return ERR_PTR(-ENOMEM); - - uprobe->inode = inode; - uprobe->offset = offset; - uprobe->ref_ctr_offset = ref_ctr_offset; - init_rwsem(&uprobe->register_rwsem); - init_rwsem(&uprobe->consumer_rwsem); - RB_CLEAR_NODE(&uprobe->rb_node); - atomic64_set(&uprobe->ref, 1); - - /* add to uprobes_tree, sorted on inode:offset */ - cur_uprobe = insert_uprobe(uprobe); - /* a uprobe exists for this inode:offset combination */ - if (cur_uprobe != uprobe) { - if (cur_uprobe->ref_ctr_offset != uprobe->ref_ctr_offset) { - ref_ctr_mismatch_warn(cur_uprobe, uprobe); - put_uprobe(cur_uprobe); - kfree(uprobe); - return ERR_PTR(-EINVAL); - } - kfree(uprobe); - uprobe = cur_uprobe; - } - - return uprobe; -} - static void consumer_add(struct uprobe *uprobe, struct uprobe_consumer *uc) { down_write(&uprobe->consumer_rwsem); @@ -1332,14 +1298,39 @@ int uprobe_register_batch(struct inode *inode, int cnt, } for (i = 0; i < cnt; i++) { + struct uprobe *cur_uprobe; + uc = get_uprobe_consumer(i, ctx); - uprobe = alloc_uprobe(inode, uc->offset, uc->ref_ctr_offset); - if (IS_ERR(uprobe)) { - ret = PTR_ERR(uprobe); + uprobe = kzalloc(sizeof(struct uprobe), GFP_KERNEL); + if (!uprobe) { + ret = -ENOMEM; goto cleanup_uprobes; } + uprobe->inode = inode; + uprobe->offset = uc->offset; + uprobe->ref_ctr_offset = uc->ref_ctr_offset; + init_rwsem(&uprobe->register_rwsem); + init_rwsem(&uprobe->consumer_rwsem); + RB_CLEAR_NODE(&uprobe->rb_node); + atomic64_set(&uprobe->ref, 1); + + /* add to uprobes_tree, sorted on inode:offset */ + cur_uprobe = insert_uprobe(uprobe); + /* a uprobe exists for this inode:offset combination */ + if (cur_uprobe != uprobe) { + if (cur_uprobe->ref_ctr_offset != uprobe->ref_ctr_offset) { + ref_ctr_mismatch_warn(cur_uprobe, uprobe); + put_uprobe(cur_uprobe); + kfree(uprobe); + ret = -EINVAL; + goto cleanup_uprobes; + } + kfree(uprobe); + uprobe = cur_uprobe; + } + uc->uprobe = uprobe; } From patchwork Mon Jul 1 22:39:31 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrii Nakryiko X-Patchwork-Id: 13718706 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B275B17164D; Mon, 1 Jul 2024 22:40:05 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1719873605; cv=none; b=U/lzMF2e8Tx2DX2ob5eb7gLy7I+LcBqvJqvgnKTDo/CgpBd+eOZAEHsQS9g99h2s3ZoRPtnD8ZopA0eulegFiAg0SucEb2EbHpSKAGUzuhBmVOO5uUnxQqJO4dSF1CIeG/eGRm80wTvVO0mC3ML7X8BhCVBYz70azRtGz/oIGcY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1719873605; c=relaxed/simple; bh=2ujlAJwBRdM6eCA+Xbj6B9HXFPf+fftjjvITuvUEy1A=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=oiAaDgHidYl8Yl8bfwkuJNfKIf29HZnaXk7RXsmqr/CSuI63IaEOf6UE4dNIjcOKFWDOmh9+AwFF6l54Qiv0d5T2OKgTdIIUjfQaqsi8FV/qbr/J4e/wSPOkeV+lRjpeuNNpJokl9n+K5ynsREGqTvHPTBJSiVzhQYGYQZk2ncQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=PE6aRMyF; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="PE6aRMyF" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 27FD7C116B1; Mon, 1 Jul 2024 22:40:05 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1719873605; bh=2ujlAJwBRdM6eCA+Xbj6B9HXFPf+fftjjvITuvUEy1A=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=PE6aRMyF0AEJdkPGjx6o6KZGUCeYd9WFCRbmZK0YFUrI4GAYmE/bIE5ACdb6SZa0w fm5RPLzxST51cMpq86NKznXHvFb0toE92DVYANktLeaPzYHgOsM/16OhKs+jA+1WQg cmlh6RmFOCApRcLX6kdMDQDVSS8p5nUBebKFcvslPbBVBIV47HJKWyN8ibqB5YnjZV 9lh0LCV1ZlE2eRfJcseNlPOXgKtGBhI/QiuSConqLHKHNrlFvlC5afkzOXFdEAlKyk RNJs6ojQs1xtLJHSUqGSyT/mXhr5ntO3P8+Ggq5QIsKfaFx7q/ds+rRFsEP/2fIJYQ CwrAW1RdpYWPw== From: Andrii Nakryiko To: linux-trace-kernel@vger.kernel.org, rostedt@goodmis.org, mhiramat@kernel.org, oleg@redhat.com Cc: peterz@infradead.org, mingo@redhat.com, bpf@vger.kernel.org, jolsa@kernel.org, paulmck@kernel.org, clm@meta.com, Andrii Nakryiko Subject: [PATCH v2 08/12] uprobes: split uprobe allocation and uprobes_tree insertion steps Date: Mon, 1 Jul 2024 15:39:31 -0700 Message-ID: <20240701223935.3783951-9-andrii@kernel.org> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20240701223935.3783951-1-andrii@kernel.org> References: <20240701223935.3783951-1-andrii@kernel.org> Precedence: bulk X-Mailing-List: linux-trace-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Now we are ready to split alloc-and-insert coupled step into two separate phases. First, we allocate and prepare all potentially-to-be-inserted uprobe instances, assuming corresponding uprobes are not yet in uprobes_tree. This is needed so that we don't do memory allocations under uprobes_treelock (once we batch locking for each step). Second, we insert new uprobes or reuse already existing ones into uprobes_tree. Any uprobe that turned out to be not necessary is immediately freed, as there are no other references to it. This concludes preparations that make uprobes_register_batch() ready to batch and optimize locking per each phase. Signed-off-by: Andrii Nakryiko --- kernel/events/uprobes.c | 17 +++++++++++------ 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 0f928a72a658..128677ffe662 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -1297,9 +1297,8 @@ int uprobe_register_batch(struct inode *inode, int cnt, return -EINVAL; } + /* pre-allocate new uprobe instances */ for (i = 0; i < cnt; i++) { - struct uprobe *cur_uprobe; - uc = get_uprobe_consumer(i, ctx); uprobe = kzalloc(sizeof(struct uprobe), GFP_KERNEL); @@ -1316,6 +1315,15 @@ int uprobe_register_batch(struct inode *inode, int cnt, RB_CLEAR_NODE(&uprobe->rb_node); atomic64_set(&uprobe->ref, 1); + uc->uprobe = uprobe; + } + + for (i = 0; i < cnt; i++) { + struct uprobe *cur_uprobe; + + uc = get_uprobe_consumer(i, ctx); + uprobe = uc->uprobe; + /* add to uprobes_tree, sorted on inode:offset */ cur_uprobe = insert_uprobe(uprobe); /* a uprobe exists for this inode:offset combination */ @@ -1323,15 +1331,12 @@ int uprobe_register_batch(struct inode *inode, int cnt, if (cur_uprobe->ref_ctr_offset != uprobe->ref_ctr_offset) { ref_ctr_mismatch_warn(cur_uprobe, uprobe); put_uprobe(cur_uprobe); - kfree(uprobe); ret = -EINVAL; goto cleanup_uprobes; } kfree(uprobe); - uprobe = cur_uprobe; + uc->uprobe = cur_uprobe; } - - uc->uprobe = uprobe; } for (i = 0; i < cnt; i++) { From patchwork Mon Jul 1 22:39:32 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrii Nakryiko X-Patchwork-Id: 13718707 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B15C3171E4B; Mon, 1 Jul 2024 22:40:08 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1719873608; cv=none; b=g4bXbOtLbmYUvp8ox+5p2Ag4r9ZSRZMn9BOCAtyo31GriSDVL0oU6lGv2pUVGQBz6lEC25BnnU2L23QDjbQOBo5S68oPWq3pnPCAGXolkP37aur+y2ecicJ1l2zk6Ob6W5yBoGFbu6k0l9yuXkvPq0sJSiBs9CD8BbYEKHytM7w= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1719873608; c=relaxed/simple; bh=I4Eo38mmIesrQ77ZzSoxqm6W+ydpRlm53vqTuFQpouQ=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=e3wkez4hkdD2zdeXxs06EFkEdlh3FNxfu9c+ptLBULQaQW/hZNztW9pDtHaSqMI2bxV86f+jifSHxlAK7ZcWLuMJ2xYfLpR0dEiPmRoMsJbicCIpDYJrC5unOeFcE9NpFUSMrdMwpFCV95twRnB2MqBdCTfQfUgqmhYHfkHfpAI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=ilFO+4rL; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="ilFO+4rL" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 66BC4C116B1; Mon, 1 Jul 2024 22:40:08 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1719873608; bh=I4Eo38mmIesrQ77ZzSoxqm6W+ydpRlm53vqTuFQpouQ=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=ilFO+4rL4icjqX3iDeTY8fJcAvtkaeDOkCED6DAqKE3373rc/DwoIga39QBGcKV43 kxZmhV4SL7eqOitpwmegIIy2D9N2oEMfEBFYCyHdLk4TBgtU4dx2jXmzPttYaD6P3c W/sMaJfH0iJqObgr9DS1FPciUGAeBABYNRJgngsSkeyCOR8uX+46LDvY44clTm7hyy 8hgT3/4OXE2uPhJ0ebiDG6T1InmpGsGBWW8p+1O3287srHrQ+b6etUhjmeQFrhgLgR GhvDgZKNxI0t16InlzlR0uw9aont3bS1QauVB6s/pF6fPyRWMGzEbv9KFcQUdUXCpp MvpmsKH8g1Gwg== From: Andrii Nakryiko To: linux-trace-kernel@vger.kernel.org, rostedt@goodmis.org, mhiramat@kernel.org, oleg@redhat.com Cc: peterz@infradead.org, mingo@redhat.com, bpf@vger.kernel.org, jolsa@kernel.org, paulmck@kernel.org, clm@meta.com, Andrii Nakryiko Subject: [PATCH v2 09/12] uprobes: batch uprobes_treelock during registration Date: Mon, 1 Jul 2024 15:39:32 -0700 Message-ID: <20240701223935.3783951-10-andrii@kernel.org> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20240701223935.3783951-1-andrii@kernel.org> References: <20240701223935.3783951-1-andrii@kernel.org> Precedence: bulk X-Mailing-List: linux-trace-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Now that we have a good separate of each registration step, take uprobes_treelock just once for relevant registration step, and then process all relevant uprobes in one go. Even if writer lock introduces a relatively large delay (as might happen with per-CPU RW semaphore), this will keep overall batch attachment reasonably fast. We teach put_uprobe(), though __put_uprobe() helper, to optionally take or not uprobes_treelock, to accommodate this pattern. With these changes we don't need insert_uprobe() operation that unconditionally takes uprobes_treelock, so get rid of it, leaving only lower-level __insert_uprobe() helper. Signed-off-by: Andrii Nakryiko --- kernel/events/uprobes.c | 45 +++++++++++++++++++++-------------------- 1 file changed, 23 insertions(+), 22 deletions(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 128677ffe662..ced85284bbf4 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -665,7 +665,7 @@ static void uprobe_free_rcu(struct rcu_head *rcu) kfree(uprobe); } -static void put_uprobe(struct uprobe *uprobe) +static void __put_uprobe(struct uprobe *uprobe, bool tree_locked) { s64 v; @@ -683,7 +683,8 @@ static void put_uprobe(struct uprobe *uprobe) if (unlikely((u32)v == 0)) { bool destroy; - write_lock(&uprobes_treelock); + if (!tree_locked) + write_lock(&uprobes_treelock); /* * We might race with find_uprobe()->__get_uprobe() executed * from inside read-locked uprobes_treelock, which can bump @@ -706,7 +707,8 @@ static void put_uprobe(struct uprobe *uprobe) destroy = atomic64_read(&uprobe->ref) == v; if (destroy && uprobe_is_active(uprobe)) rb_erase(&uprobe->rb_node, &uprobes_tree); - write_unlock(&uprobes_treelock); + if (!tree_locked) + write_unlock(&uprobes_treelock); /* * Beyond here we don't need RCU protection, we are either the @@ -745,6 +747,11 @@ static void put_uprobe(struct uprobe *uprobe) rcu_read_unlock_trace(); } +static void put_uprobe(struct uprobe *uprobe) +{ + __put_uprobe(uprobe, false); +} + static __always_inline int uprobe_cmp(const struct inode *l_inode, const loff_t l_offset, const struct uprobe *r) @@ -844,21 +851,6 @@ static struct uprobe *__insert_uprobe(struct uprobe *uprobe) return u; } -/* - * Acquire uprobes_treelock and insert uprobe into uprobes_tree - * (or reuse existing one, see __insert_uprobe() comments above). - */ -static struct uprobe *insert_uprobe(struct uprobe *uprobe) -{ - struct uprobe *u; - - write_lock(&uprobes_treelock); - u = __insert_uprobe(uprobe); - write_unlock(&uprobes_treelock); - - return u; -} - static void ref_ctr_mismatch_warn(struct uprobe *cur_uprobe, struct uprobe *uprobe) { @@ -1318,6 +1310,8 @@ int uprobe_register_batch(struct inode *inode, int cnt, uc->uprobe = uprobe; } + ret = 0; + write_lock(&uprobes_treelock); for (i = 0; i < cnt; i++) { struct uprobe *cur_uprobe; @@ -1325,19 +1319,24 @@ int uprobe_register_batch(struct inode *inode, int cnt, uprobe = uc->uprobe; /* add to uprobes_tree, sorted on inode:offset */ - cur_uprobe = insert_uprobe(uprobe); + cur_uprobe = __insert_uprobe(uprobe); /* a uprobe exists for this inode:offset combination */ if (cur_uprobe != uprobe) { if (cur_uprobe->ref_ctr_offset != uprobe->ref_ctr_offset) { ref_ctr_mismatch_warn(cur_uprobe, uprobe); - put_uprobe(cur_uprobe); + + __put_uprobe(cur_uprobe, true); ret = -EINVAL; - goto cleanup_uprobes; + goto unlock_treelock; } kfree(uprobe); uc->uprobe = cur_uprobe; } } +unlock_treelock: + write_unlock(&uprobes_treelock); + if (ret) + goto cleanup_uprobes; for (i = 0; i < cnt; i++) { uc = get_uprobe_consumer(i, ctx); @@ -1367,13 +1366,15 @@ int uprobe_register_batch(struct inode *inode, int cnt, } cleanup_uprobes: /* put all the successfully allocated/reused uprobes */ + write_lock(&uprobes_treelock); for (i = 0; i < cnt; i++) { uc = get_uprobe_consumer(i, ctx); if (uc->uprobe) - put_uprobe(uc->uprobe); + __put_uprobe(uc->uprobe, true); uc->uprobe = NULL; } + write_unlock(&uprobes_treelock); return ret; } From patchwork Mon Jul 1 22:39:33 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrii Nakryiko X-Patchwork-Id: 13718708 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E353A171E52; Mon, 1 Jul 2024 22:40:11 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1719873612; cv=none; b=pxcdS+9QAPtGk3eYguT+ibOnTiiDkzCFMvdsFBe57R0hoAXw/oezxmUdWFhALC8bA3o0FCWHnHbwV2u4w2394pBb/pcmTm97dsZXC0SgeFmSjV/PjI2mB/+JGLIU/+lBG887qMVEPe57xUhK9dtU1pgG6ooxNcDznQJNyfZ5FlY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1719873612; c=relaxed/simple; bh=nWdyuSJceB+cMJxxTIMLYKHDpdzI6rYwtLA5cH3gtRc=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=f3iIxaqRYuD6GC7+h3rEltnBpKXlIvdwWbnaAX0m7BpHDmQNhYxyvbQDOs3istIUAahA8u8ZHx7DlbPRUTx/Ml5aYd5k3BjpvtMmhrLNv4unjQpf9Pez7LzpTv5yHJhc0HCRU58dIw/ci3PIsOxnioI7IjJNb4FmX5yHEzaLFfk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=j6jaOLsx; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="j6jaOLsx" Received: by smtp.kernel.org (Postfix) with ESMTPSA id A4852C116B1; Mon, 1 Jul 2024 22:40:11 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1719873611; bh=nWdyuSJceB+cMJxxTIMLYKHDpdzI6rYwtLA5cH3gtRc=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=j6jaOLsxN5U0Y+gj9bs8Ft+6FoW2nNpL2Oyfq/sb86jkBlZ+BPiJdBTUYYjNR9+5l nuUMxLi0224fJ+KiA9QFyqs7l4NmOiqykAnOiYipVYI30Jeqfu2RHcUyS1VUczv46R 3nYsA+PHSFSFfEYtyDdvYn1wzP0s3/1bDbkerctVPbwFI+y1aN9MTZ0Vjz7txaC6Vz BvHQdrJ9kwYvupfg5v1ISNzZySE1M+wY0w9NROtE0DQLCM1LQ53XoQ+Sgv4HQrUZI7 pNlayIBrXv96T0BCecBuEETlKETeCgm3wtTVUMODHynvMzzhn7AOtS6ST3KBGVE8Vw WA1X+qXx3Gv4w== From: Andrii Nakryiko To: linux-trace-kernel@vger.kernel.org, rostedt@goodmis.org, mhiramat@kernel.org, oleg@redhat.com Cc: peterz@infradead.org, mingo@redhat.com, bpf@vger.kernel.org, jolsa@kernel.org, paulmck@kernel.org, clm@meta.com, Andrii Nakryiko Subject: [PATCH v2 10/12] uprobes: improve lock batching for uprobe_unregister_batch Date: Mon, 1 Jul 2024 15:39:33 -0700 Message-ID: <20240701223935.3783951-11-andrii@kernel.org> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20240701223935.3783951-1-andrii@kernel.org> References: <20240701223935.3783951-1-andrii@kernel.org> Precedence: bulk X-Mailing-List: linux-trace-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Similarly to what we did for uprobes_register_batch(), split uprobe_unregister_batch() into two separate phases with different locking needs. First, all the VMA unregistration is performed while holding a per-uprobe register_rwsem. Then, we take a batched uprobes_treelock once to __put_uprobe() for all uprobe_consumers. That uprobe_consumer->uprobe field is really handy in helping with this. Signed-off-by: Andrii Nakryiko --- kernel/events/uprobes.c | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index ced85284bbf4..bb480a2400e1 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -1189,8 +1189,8 @@ __uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc) */ void uprobe_unregister_batch(struct inode *inode, int cnt, uprobe_consumer_fn get_uprobe_consumer, void *ctx) { - struct uprobe *uprobe; struct uprobe_consumer *uc; + struct uprobe *uprobe; int i; for (i = 0; i < cnt; i++) { @@ -1203,10 +1203,20 @@ void uprobe_unregister_batch(struct inode *inode, int cnt, uprobe_consumer_fn ge down_write(&uprobe->register_rwsem); __uprobe_unregister(uprobe, uc); up_write(&uprobe->register_rwsem); - put_uprobe(uprobe); + } + write_lock(&uprobes_treelock); + for (i = 0; i < cnt; i++) { + uc = get_uprobe_consumer(i, ctx); + uprobe = uc->uprobe; + + if (!uprobe) + continue; + + __put_uprobe(uprobe, true); uc->uprobe = NULL; } + write_unlock(&uprobes_treelock); } static struct uprobe_consumer *uprobe_consumer_identity(size_t idx, void *ctx) From patchwork Mon Jul 1 22:39:34 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrii Nakryiko X-Patchwork-Id: 13718709 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2413117164D; Mon, 1 Jul 2024 22:40:15 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1719873615; cv=none; b=XNM5a4yr7LVqWtfsxYKyz1RLb1NICDG0aDhxomAAGqZCc5AJ9Sajua/ifpLQp9HwffoszIuDIfCzUIGM1mdrBJ6WEgg/NRw9tWb+3uNEQKPYGathfj4USmIjnwLSQLynt+7/Q01qGm/5xWAMWHWWy9pw27RjEdysHBwQmkc3eNY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1719873615; c=relaxed/simple; bh=IK9qe93lUTxLehfjnp9sFnXM2BOwnNpa1pCnJf9613Y=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=hbhM+/R4Ee2Abl0y4LNuxONDfMPe9XdSJ5HH0wJJfxfOe/dCib6bQUYUwmXrNKJHIFQD8eodGnV2oxbRumt7DUdSNF49h7jmaZZK7rHwnhNpNDfvLz5Bcuty+1B9MmrWsSRTfXYf98ETmT6eYu6wXrId50uuf5BxbbDjQm7VQ88= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=AFfl0YwF; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="AFfl0YwF" Received: by smtp.kernel.org (Postfix) with ESMTPSA id D09CBC116B1; Mon, 1 Jul 2024 22:40:14 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1719873615; bh=IK9qe93lUTxLehfjnp9sFnXM2BOwnNpa1pCnJf9613Y=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=AFfl0YwFhJR32scFBXnOyUwRUiu6+dryL+yzYVBcK5tGFCJsbt2wPw1AmN8CuQPVn 7Co+hA7kkHx6geJ2YUm215SoS2xDWRkyTmq9/Lh80OGuE6II9+ysW05wgQPnNWtocS CW1dokQ5gfbGuxgcUJRzxcxf/lnt4Q2237lmAsSYdikXpT3BdlrFg3oXbcx9N93jRP zP03WdovzXAIbp/3DLnjgJux7L5g9mz4UhT5LWSZ2WxVDhmmALLXmdr0/y/jlM+O9b 167lgx3Hk0DXGlU9BafuL2ypHS/KELQ2pUS9SCmMUy8sJ2xMpIusQztJe8YBtops5s 55q2pWeuyiulg== From: Andrii Nakryiko To: linux-trace-kernel@vger.kernel.org, rostedt@goodmis.org, mhiramat@kernel.org, oleg@redhat.com Cc: peterz@infradead.org, mingo@redhat.com, bpf@vger.kernel.org, jolsa@kernel.org, paulmck@kernel.org, clm@meta.com, Andrii Nakryiko Subject: [PATCH v2 11/12] uprobes,bpf: switch to batch uprobe APIs for BPF multi-uprobes Date: Mon, 1 Jul 2024 15:39:34 -0700 Message-ID: <20240701223935.3783951-12-andrii@kernel.org> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20240701223935.3783951-1-andrii@kernel.org> References: <20240701223935.3783951-1-andrii@kernel.org> Precedence: bulk X-Mailing-List: linux-trace-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Switch internals of BPF multi-uprobes to batched version of uprobe registration and unregistration APIs. This also simplifies BPF clean up code a bit thanks to all-or-nothing guarantee of uprobes_register_batch(). Signed-off-by: Andrii Nakryiko --- kernel/trace/bpf_trace.c | 23 +++++++++-------------- 1 file changed, 9 insertions(+), 14 deletions(-) diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index ba62baec3152..41bf6736c542 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -3173,14 +3173,11 @@ struct bpf_uprobe_multi_run_ctx { struct bpf_uprobe *uprobe; }; -static void bpf_uprobe_unregister(struct path *path, struct bpf_uprobe *uprobes, - u32 cnt) +static struct uprobe_consumer *umulti_link_get_uprobe_consumer(size_t idx, void *ctx) { - u32 i; + struct bpf_uprobe_multi_link *link = ctx; - for (i = 0; i < cnt; i++) { - uprobe_unregister(d_real_inode(path->dentry), &uprobes[i].consumer); - } + return &link->uprobes[idx].consumer; } static void bpf_uprobe_multi_link_release(struct bpf_link *link) @@ -3188,7 +3185,8 @@ static void bpf_uprobe_multi_link_release(struct bpf_link *link) struct bpf_uprobe_multi_link *umulti_link; umulti_link = container_of(link, struct bpf_uprobe_multi_link, link); - bpf_uprobe_unregister(&umulti_link->path, umulti_link->uprobes, umulti_link->cnt); + uprobe_unregister_batch(d_real_inode(umulti_link->path.dentry), umulti_link->cnt, + umulti_link_get_uprobe_consumer, umulti_link); if (umulti_link->task) put_task_struct(umulti_link->task); path_put(&umulti_link->path); @@ -3474,13 +3472,10 @@ int bpf_uprobe_multi_link_attach(const union bpf_attr *attr, struct bpf_prog *pr bpf_link_init(&link->link, BPF_LINK_TYPE_UPROBE_MULTI, &bpf_uprobe_multi_link_lops, prog); - for (i = 0; i < cnt; i++) { - err = uprobe_register(d_real_inode(link->path.dentry), &uprobes[i].consumer); - if (err) { - bpf_uprobe_unregister(&path, uprobes, i); - goto error_free; - } - } + err = uprobe_register_batch(d_real_inode(link->path.dentry), cnt, + umulti_link_get_uprobe_consumer, link); + if (err) + goto error_free; err = bpf_link_prime(&link->link, &link_primer); if (err) From patchwork Mon Jul 1 22:39:35 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrii Nakryiko X-Patchwork-Id: 13718710 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C57C317164D; Mon, 1 Jul 2024 22:40:18 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1719873618; cv=none; b=RoH3uFOieh7iSIkJaZ60guLI2tuZEbFNTnGpzq0RsvqLG7A3VdbLOYuF7j+ZFtE7Mjs4wCAsX/cRwdNjp1R6wQeyDjF7CCSkaU885PRJTSzTw2Zf6VyZUBlYgmSdIAH/DZHWveYM1gdeEi3X7Z+FnrG4zG2Lu8axcX4F8UJTGjc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1719873618; c=relaxed/simple; bh=6fur/3CcOjDft27wS8sm7+8WB1QwLkIDD82RyRkga48=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=Om2WLuwhBaIuIk/46FxrUILhqh/4rLG3WAnYh0RYPXbdATVOiMNvdPBTMQZPElztKuhNFN+R1fQZQ8o9dgjWYM1Qorsu45228H+QLRi8TyK+RbUS9Th2ctMQn/1c5zXYv6LpDOO6E/zF9ulEFiDBIZIuykAW9W07ClGXtdw/mcw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=P9CTBmok; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="P9CTBmok" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 27D49C116B1; Mon, 1 Jul 2024 22:40:18 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1719873618; bh=6fur/3CcOjDft27wS8sm7+8WB1QwLkIDD82RyRkga48=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=P9CTBmokKxut3G3XjrIGET7XQ1m+PNztOlwOW1Ns9qs8jgVNr76UzPRnAWMd+Xuak zG3D4svs50XTx4TbtnUS8nkDRh8CKYn5xJ91JhmF6CwKP0CBhiYz+gztQPYixR7Sd1 xqRkYUp/AAh0bLgjydtvMPOdqzXVb1dzfVIL3185rGah6PIcuv3snnJokcKXQnlVe9 btUmKHs1YaKszj95KGUeVrGFmBN9vMn6js+aJ8G30UtKnzVJKQ+wchn5kjPyneeHzb 3ib1jsyin8ti3rPaPfv76z4zMsAMAGY7dhpACH0QVlhdM3CCAGsMWDW+hUguOhtk+V uPFlvGpRqAoOQ== From: Andrii Nakryiko To: linux-trace-kernel@vger.kernel.org, rostedt@goodmis.org, mhiramat@kernel.org, oleg@redhat.com Cc: peterz@infradead.org, mingo@redhat.com, bpf@vger.kernel.org, jolsa@kernel.org, paulmck@kernel.org, clm@meta.com, Andrii Nakryiko Subject: [PATCH v2 12/12] uprobes: switch uprobes_treelock to per-CPU RW semaphore Date: Mon, 1 Jul 2024 15:39:35 -0700 Message-ID: <20240701223935.3783951-13-andrii@kernel.org> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20240701223935.3783951-1-andrii@kernel.org> References: <20240701223935.3783951-1-andrii@kernel.org> Precedence: bulk X-Mailing-List: linux-trace-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 With all the batch uprobe APIs work we are now finally ready to reap the benefits. Switch uprobes_treelock from reader-writer spinlock to a much more efficient and scalable per-CPU RW semaphore. Benchmarks and numbers time. I've used BPF selftests' bench tool, trig-uprobe-nop benchmark specifically, to see how uprobe total throughput scales with number of competing threads (mapped to individual CPUs). Here are results: # threads BEFORE (mln/s) AFTER (mln/s) --------- -------------- ------------- 1 3.131 3.140 2 3.394 3.601 3 3.630 3.960 4 3.317 3.551 5 3.448 3.464 6 3.345 3.283 7 3.469 3.444 8 3.182 3.258 9 3.138 3.139 10 2.999 3.212 11 2.903 3.183 12 2.802 3.027 13 2.792 3.027 14 2.695 3.086 15 2.822 2.965 16 2.679 2.939 17 2.622 2.888 18 2.628 2.914 19 2.702 2.836 20 2.561 2.837 One can see that per-CPU RW semaphore-based implementation scales better with number of CPUs (especially that single CPU throughput is basically the same). Note, scalability is still limited by register_rwsem and this will hopefully be address in follow up patch set(s). Signed-off-by: Andrii Nakryiko --- kernel/events/uprobes.c | 30 +++++++++++++++--------------- 1 file changed, 15 insertions(+), 15 deletions(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index bb480a2400e1..1d76551e5e23 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT; */ #define no_uprobe_events() RB_EMPTY_ROOT(&uprobes_tree) -static DEFINE_RWLOCK(uprobes_treelock); /* serialize rbtree access */ +DEFINE_STATIC_PERCPU_RWSEM(uprobes_treelock); /* serialize rbtree access */ #define UPROBES_HASH_SZ 13 /* serialize uprobe->pending_list */ @@ -684,7 +684,7 @@ static void __put_uprobe(struct uprobe *uprobe, bool tree_locked) bool destroy; if (!tree_locked) - write_lock(&uprobes_treelock); + percpu_down_write(&uprobes_treelock); /* * We might race with find_uprobe()->__get_uprobe() executed * from inside read-locked uprobes_treelock, which can bump @@ -708,7 +708,7 @@ static void __put_uprobe(struct uprobe *uprobe, bool tree_locked) if (destroy && uprobe_is_active(uprobe)) rb_erase(&uprobe->rb_node, &uprobes_tree); if (!tree_locked) - write_unlock(&uprobes_treelock); + percpu_up_write(&uprobes_treelock); /* * Beyond here we don't need RCU protection, we are either the @@ -816,9 +816,9 @@ static struct uprobe *find_uprobe(struct inode *inode, loff_t offset) { struct uprobe *uprobe; - read_lock(&uprobes_treelock); + percpu_down_read(&uprobes_treelock); uprobe = __find_uprobe(inode, offset); - read_unlock(&uprobes_treelock); + percpu_up_read(&uprobes_treelock); return uprobe; } @@ -1205,7 +1205,7 @@ void uprobe_unregister_batch(struct inode *inode, int cnt, uprobe_consumer_fn ge up_write(&uprobe->register_rwsem); } - write_lock(&uprobes_treelock); + percpu_down_write(&uprobes_treelock); for (i = 0; i < cnt; i++) { uc = get_uprobe_consumer(i, ctx); uprobe = uc->uprobe; @@ -1216,7 +1216,7 @@ void uprobe_unregister_batch(struct inode *inode, int cnt, uprobe_consumer_fn ge __put_uprobe(uprobe, true); uc->uprobe = NULL; } - write_unlock(&uprobes_treelock); + percpu_up_write(&uprobes_treelock); } static struct uprobe_consumer *uprobe_consumer_identity(size_t idx, void *ctx) @@ -1321,7 +1321,7 @@ int uprobe_register_batch(struct inode *inode, int cnt, } ret = 0; - write_lock(&uprobes_treelock); + percpu_down_write(&uprobes_treelock); for (i = 0; i < cnt; i++) { struct uprobe *cur_uprobe; @@ -1344,7 +1344,7 @@ int uprobe_register_batch(struct inode *inode, int cnt, } } unlock_treelock: - write_unlock(&uprobes_treelock); + percpu_up_write(&uprobes_treelock); if (ret) goto cleanup_uprobes; @@ -1376,7 +1376,7 @@ int uprobe_register_batch(struct inode *inode, int cnt, } cleanup_uprobes: /* put all the successfully allocated/reused uprobes */ - write_lock(&uprobes_treelock); + percpu_down_write(&uprobes_treelock); for (i = 0; i < cnt; i++) { uc = get_uprobe_consumer(i, ctx); @@ -1384,7 +1384,7 @@ int uprobe_register_batch(struct inode *inode, int cnt, __put_uprobe(uc->uprobe, true); uc->uprobe = NULL; } - write_unlock(&uprobes_treelock); + percpu_up_write(&uprobes_treelock); return ret; } @@ -1492,7 +1492,7 @@ static void build_probe_list(struct inode *inode, min = vaddr_to_offset(vma, start); max = min + (end - start) - 1; - read_lock(&uprobes_treelock); + percpu_down_read(&uprobes_treelock); n = find_node_in_range(inode, min, max); if (n) { for (t = n; t; t = rb_prev(t)) { @@ -1510,7 +1510,7 @@ static void build_probe_list(struct inode *inode, list_add(&u->pending_list, head); } } - read_unlock(&uprobes_treelock); + percpu_up_read(&uprobes_treelock); } /* @vma contains reference counter, not the probed instruction. */ @@ -1601,9 +1601,9 @@ vma_has_uprobes(struct vm_area_struct *vma, unsigned long start, unsigned long e min = vaddr_to_offset(vma, start); max = min + (end - start) - 1; - read_lock(&uprobes_treelock); + percpu_down_read(&uprobes_treelock); n = find_node_in_range(inode, min, max); - read_unlock(&uprobes_treelock); + percpu_up_read(&uprobes_treelock); return !!n; }