From patchwork Wed Jun 5 00:24:46 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrii Nakryiko X-Patchwork-Id: 13686006 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 80A5C13A401; Wed, 5 Jun 2024 00:25:14 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1717547114; cv=none; b=hx+n+MqyzMau9NajK00LRqQxC9+5+AE/x2P5nd65pq1RSFeWfffBgnbMDzlsUtWM+gKATNVv70m5cvcTOv+LR3Igp1Ua+/9Kbt/l2Agdg1JBxOAvABU7oeARdZQus2toS7T9T1XHuuM9Bh6b3VG+IAX9dbl5uKuxaXbfNbVYuTk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1717547114; c=relaxed/simple; bh=MdL1CnIYtJxVuaoJQ4v36gHWpY8r7Nqra5mr+RAHejU=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=B/kYrKe8CHwhcTVFMzFZxwMJVJ1lcqC7DlUVOzjxBoynL+3vy3v3y4CAKoicTFZKHwRgFIlXmXbarP7R1fIvx4he9cBGkGQkO0eCCUqXJ1PON4dWtE6DJCFp168f2U1PewM+0K/j0slYCSqZfvsrahytlvd/MoTPOQ0M2kK1sqs= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=NOGRQgVN; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="NOGRQgVN" Received: by smtp.kernel.org (Postfix) with ESMTPSA id D5E94C2BBFC; Wed, 5 Jun 2024 00:25:13 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1717547114; bh=MdL1CnIYtJxVuaoJQ4v36gHWpY8r7Nqra5mr+RAHejU=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=NOGRQgVNavyC3fwxEaZyJrjxDcYUvSoz/8YM6QedZh1sBl8wowm7FizTiavEjNHn/ EvjD1syryMww2h/Cd4Bk1oUVRd0afZpPTXHLoi/FeUHaRnDFDDMMPRviTgxBJSCq12 tt9tGFK9cPMKulFga2lx7O7/gt+WZaNagmDdfWm6cGfLykpdKe8Wt4dLAvnv8fo30u qTNSE/DJb4zv2vaKfCZ/NqrLUbBBvVZqt3vYDDF4cw4Wz8o6nIehEZOjQ2NDnzHcqM gl4tEPY3uixZj4n7PhMAbTTCssXdChHgeYz/B+Quhh2vTHPQC0HknreP7LHgGlYcRy WSMJWgtfEVf9w== From: Andrii Nakryiko To: linux-fsdevel@vger.kernel.org, brauner@kernel.org, viro@zeniv.linux.org.uk, akpm@linux-foundation.org Cc: linux-kernel@vger.kernel.org, bpf@vger.kernel.org, gregkh@linuxfoundation.org, linux-mm@kvack.org, liam.howlett@oracle.com, surenb@google.com, rppt@kernel.org, Andrii Nakryiko Subject: [PATCH v3 1/9] mm: add find_vma()-like API but RCU protected and taking VMA lock Date: Tue, 4 Jun 2024 17:24:46 -0700 Message-ID: <20240605002459.4091285-2-andrii@kernel.org> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20240605002459.4091285-1-andrii@kernel.org> References: <20240605002459.4091285-1-andrii@kernel.org> Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Existing lock_vma_under_rcu() API assumes exact VMA match, so it's not a 100% equivalent of find_vma(). There are use cases that do want find_vma() semantics of finding an exact VMA or the next one. Also, it's important for such an API to let user distinguish between not being able to get per-VMA lock and not having any VMAs at or after provided address. As such, this patch adds a new find_vma()-like API, find_and_lock_vma_rcu(), which finds exact or next VMA, attempts to take per-VMA lock, and if that fails, returns ERR_PTR(-EBUSY). It still returns NULL if there is no VMA at or after address. In successfuly case it will return valid and non-isolated VMA with VMA lock taken. This API will be used in subsequent patch in this patch set to implement a new user-facing API for querying process VMAs. Cc: Mike Rapoport Cc: Suren Baghdasaryan Cc: Liam Howlett Signed-off-by: Andrii Nakryiko --- include/linux/mm.h | 8 ++++++ mm/memory.c | 62 ++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 70 insertions(+) diff --git a/include/linux/mm.h b/include/linux/mm.h index c41c82bcbec2..3ab52b7e124c 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -776,6 +776,8 @@ static inline void assert_fault_locked(struct vm_fault *vmf) mmap_assert_locked(vmf->vma->vm_mm); } +struct vm_area_struct *find_and_lock_vma_rcu(struct mm_struct *mm, + unsigned long address); struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm, unsigned long address); @@ -790,6 +792,12 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma) static inline void vma_mark_detached(struct vm_area_struct *vma, bool detached) {} +struct vm_area_struct *find_and_lock_vma_rcu(struct mm_struct *mm, + unsigned long address) +{ + return -EOPNOTSUPP; +} + static inline struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm, unsigned long address) { diff --git a/mm/memory.c b/mm/memory.c index eef4e482c0c2..c9517742bd6d 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -5913,6 +5913,68 @@ struct vm_area_struct *lock_mm_and_find_vma(struct mm_struct *mm, #endif #ifdef CONFIG_PER_VMA_LOCK +/* + * find_and_lock_vma_rcu() - Find and lock the VMA for a given address, or the + * next VMA. Search is done under RCU protection, without taking or assuming + * mmap_lock. Returned VMA is guaranteed to be stable and not isolated. + + * @mm: The mm_struct to check + * @addr: The address + * + * Returns: The VMA associated with addr, or the next VMA. + * May return %NULL in the case of no VMA at addr or above. + * If the VMA is being modified and can't be locked, -EBUSY is returned. + */ +struct vm_area_struct *find_and_lock_vma_rcu(struct mm_struct *mm, + unsigned long address) +{ + MA_STATE(mas, &mm->mm_mt, address, address); + struct vm_area_struct *vma; + int err; + + rcu_read_lock(); +retry: + vma = mas_find(&mas, ULONG_MAX); + if (!vma) { + err = 0; /* no VMA, return NULL */ + goto inval; + } + + if (!vma_start_read(vma)) { + err = -EBUSY; + goto inval; + } + + /* + * Check since vm_start/vm_end might change before we lock the VMA. + * Note, unlike lock_vma_under_rcu() we are searching for VMA covering + * address or the next one, so we only make sure VMA wasn't updated to + * end before the address. + */ + if (unlikely(vma->vm_end <= address)) { + err = -EBUSY; + goto inval_end_read; + } + + /* Check if the VMA got isolated after we found it */ + if (vma->detached) { + vma_end_read(vma); + count_vm_vma_lock_event(VMA_LOCK_MISS); + /* The area was replaced with another one */ + goto retry; + } + + rcu_read_unlock(); + return vma; + +inval_end_read: + vma_end_read(vma); +inval: + rcu_read_unlock(); + count_vm_vma_lock_event(VMA_LOCK_ABORT); + return ERR_PTR(err); +} + /* * Lookup and lock a VMA under RCU protection. Returned VMA is guaranteed to be * stable and not isolated. If the VMA is not found or is being modified the From patchwork Wed Jun 5 00:24:47 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrii Nakryiko X-Patchwork-Id: 13686007 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C624D13A885; Wed, 5 Jun 2024 00:25:17 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1717547117; cv=none; b=KQLazRBn3OB9ubvZIYqhR/jmrImYubo0Nh1VLbTERwMG3yh40xxXMUE3y/DqOZgQdSjwVgirc7ki1zj1g3e+bTqX4P1Shwa6VEJBwhacJbCB9eN3HFWfoRZVJ2ZISZ7ady7+dI9fWG71zJIWczkfv41bkZWRNeEnSpf1MQO2EWs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1717547117; c=relaxed/simple; bh=MRe3J2lR02W8fKMz0jOORh8WM6XjEyHtsdoj2x2PKJ8=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=nDFr/BhRwQG+FFSrZMndLhNrwOYz7QDDh0+ExxhEBa4ass2c6NlhfHT2wOCElqiGZJEp/n2hB0oabSyx4WF5W7mPEtqLRN1U1mVFwP1P6b8ybfJOiTtg2T+qvzdhjyEnBCeIS99sVvQMv9Qv9YnQc/1za0tYiHTp0IyOJldfD8Y= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=U2n4XMzf; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="U2n4XMzf" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 2E0F9C2BBFC; Wed, 5 Jun 2024 00:25:17 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1717547117; bh=MRe3J2lR02W8fKMz0jOORh8WM6XjEyHtsdoj2x2PKJ8=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=U2n4XMzfqQtlHII1pzt80oFo9YPq+fDQa4PXaugpXWcJF4d1I2pk1JT+RuboPqznW WSXGEMTL5blWc5VR6nUNwCIrk7CaRzbzBM5P7G0L0RlKgx3pyNTjuH5cQyw6d+ayJe rp4XqvEIjAxQeYjG5UZH4m3NAPDS1RKlY8tZ2wKqVfcobGVJY8VY4szQLlLQHGJ9ev WkYq9peZRmUxa8FaJZ+a+TZXGGAYdRwyqhrwA8FhDWd92cnQ6hS6NiS5HhispaoK3F OzDpPYsw4203pK+abejqee1NMfjjHnaKRwWsu+r4SXlfLTgXGivRptL5/oF+AOB5kI 9KO1xQQMAGJCg== From: Andrii Nakryiko To: linux-fsdevel@vger.kernel.org, brauner@kernel.org, viro@zeniv.linux.org.uk, akpm@linux-foundation.org Cc: linux-kernel@vger.kernel.org, bpf@vger.kernel.org, gregkh@linuxfoundation.org, linux-mm@kvack.org, liam.howlett@oracle.com, surenb@google.com, rppt@kernel.org, Andrii Nakryiko Subject: [PATCH v3 2/9] fs/procfs: extract logic for getting VMA name constituents Date: Tue, 4 Jun 2024 17:24:47 -0700 Message-ID: <20240605002459.4091285-3-andrii@kernel.org> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20240605002459.4091285-1-andrii@kernel.org> References: <20240605002459.4091285-1-andrii@kernel.org> Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Extract generic logic to fetch relevant pieces of data to describe VMA name. This could be just some string (either special constant or user-provided), or a string with some formatted wrapping text (e.g., "[anon_shmem:]"), or, commonly, file path. seq_file-based logic has different methods to handle all three cases, but they are currently mixed in with extracting underlying sources of data. This patch splits this into data fetching and data formatting, so that data fetching can be reused later on. There should be no functional changes. Signed-off-by: Andrii Nakryiko --- fs/proc/task_mmu.c | 125 +++++++++++++++++++++++++-------------------- 1 file changed, 71 insertions(+), 54 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index f8d35f993fe5..334ae210a95a 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -239,6 +239,67 @@ static int do_maps_open(struct inode *inode, struct file *file, sizeof(struct proc_maps_private)); } +static void get_vma_name(struct vm_area_struct *vma, + const struct path **path, + const char **name, + const char **name_fmt) +{ + struct anon_vma_name *anon_name = vma->vm_mm ? anon_vma_name(vma) : NULL; + + *name = NULL; + *path = NULL; + *name_fmt = NULL; + + /* + * Print the dentry name for named mappings, and a + * special [heap] marker for the heap: + */ + if (vma->vm_file) { + /* + * If user named this anon shared memory via + * prctl(PR_SET_VMA ..., use the provided name. + */ + if (anon_name) { + *name_fmt = "[anon_shmem:%s]"; + *name = anon_name->name; + } else { + *path = file_user_path(vma->vm_file); + } + return; + } + + if (vma->vm_ops && vma->vm_ops->name) { + *name = vma->vm_ops->name(vma); + if (*name) + return; + } + + *name = arch_vma_name(vma); + if (*name) + return; + + if (!vma->vm_mm) { + *name = "[vdso]"; + return; + } + + if (vma_is_initial_heap(vma)) { + *name = "[heap]"; + return; + } + + if (vma_is_initial_stack(vma)) { + *name = "[stack]"; + return; + } + + if (anon_name) { + *name_fmt = "[anon:%s]"; + *name = anon_name->name; + return; + } +} + static void show_vma_header_prefix(struct seq_file *m, unsigned long start, unsigned long end, vm_flags_t flags, unsigned long long pgoff, @@ -262,17 +323,15 @@ static void show_vma_header_prefix(struct seq_file *m, static void show_map_vma(struct seq_file *m, struct vm_area_struct *vma) { - struct anon_vma_name *anon_name = NULL; - struct mm_struct *mm = vma->vm_mm; - struct file *file = vma->vm_file; + const struct path *path; + const char *name_fmt, *name; vm_flags_t flags = vma->vm_flags; unsigned long ino = 0; unsigned long long pgoff = 0; unsigned long start, end; dev_t dev = 0; - const char *name = NULL; - if (file) { + if (vma->vm_file) { const struct inode *inode = file_user_inode(vma->vm_file); dev = inode->i_sb->s_dev; @@ -283,57 +342,15 @@ show_map_vma(struct seq_file *m, struct vm_area_struct *vma) start = vma->vm_start; end = vma->vm_end; show_vma_header_prefix(m, start, end, flags, pgoff, dev, ino); - if (mm) - anon_name = anon_vma_name(vma); - /* - * Print the dentry name for named mappings, and a - * special [heap] marker for the heap: - */ - if (file) { + get_vma_name(vma, &path, &name, &name_fmt); + if (path) { seq_pad(m, ' '); - /* - * If user named this anon shared memory via - * prctl(PR_SET_VMA ..., use the provided name. - */ - if (anon_name) - seq_printf(m, "[anon_shmem:%s]", anon_name->name); - else - seq_path(m, file_user_path(file), "\n"); - goto done; - } - - if (vma->vm_ops && vma->vm_ops->name) { - name = vma->vm_ops->name(vma); - if (name) - goto done; - } - - name = arch_vma_name(vma); - if (!name) { - if (!mm) { - name = "[vdso]"; - goto done; - } - - if (vma_is_initial_heap(vma)) { - name = "[heap]"; - goto done; - } - - if (vma_is_initial_stack(vma)) { - name = "[stack]"; - goto done; - } - - if (anon_name) { - seq_pad(m, ' '); - seq_printf(m, "[anon:%s]", anon_name->name); - } - } - -done: - if (name) { + seq_path(m, path, "\n"); + } else if (name_fmt) { + seq_pad(m, ' '); + seq_printf(m, name_fmt, name); + } else if (name) { seq_pad(m, ' '); seq_puts(m, name); } From patchwork Wed Jun 5 00:24:48 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrii Nakryiko X-Patchwork-Id: 13686008 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1DA7B13AD18; Wed, 5 Jun 2024 00:25:20 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1717547121; cv=none; b=nfbn2Pm4dU3QXXXoN7rL9gkO/rQYP7u8eQkvKQjYiiLXVhZBBiLUNw37vHzqC9TVGDWy+3ZUMYydH+VbfpJuDCMPWVf1tzaSwVaNx4RgmgngMrXOTkJ7FO5oCvH6WV3VQrhcmtCSOAH27sUIxB1ca+TR16+M92pyPRzYPorNc3Q= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1717547121; c=relaxed/simple; bh=aE4lwRQpXMQzqaA6H+9edhrn83KZ0AMQxprZsEKorRM=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=K3tQpkPvzmntF8QFT1VvPb245xPW4OzEEisv2h9qBVbhJ7n36fAXDYg3bMeDGVOS/wzNDPXdHE/fUt6pIMGQv2y547yCmSoALEVK11bdsLxXFJE/2yOkEWBiDlCkggZA0kgezDMm2jZpuqHKjD1xsk7e4zXrYhjFUydfEiV+xKM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=fQv+9SJo; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="fQv+9SJo" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 77831C4AF0A; Wed, 5 Jun 2024 00:25:20 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1717547120; bh=aE4lwRQpXMQzqaA6H+9edhrn83KZ0AMQxprZsEKorRM=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=fQv+9SJogYYssbBDwL8eChULYMMxRESk4n1UJ+b5dfECdLXG0yhx7fNqshC9undiJ n9Lsmu+fNlD9lqBLfBUsP+ZFW3qOT3DpFyRe3qsF+cbbXuXl9oMNXYMetjdgtPPLIm EJ6Pp3DctdTH8BPvRv1Kvz3phjaqP7zFJA0sGaUz6jtqp/VRn2tTJ/BqmmzPeV7y6r PXff7UC1bAXFMSgwxmkOwKh+1nMp/UuAKcwY6UPLGR5VhiNuFaY+0Ctx+xnJiGDtHM lcHukrPRQ8OyCCFxCiOIWuUAVN2afNX5TzylOf0WkCDuysy8IIe5h5kFYRrw8htR8z EmFT2IZUdrIDQ== From: Andrii Nakryiko To: linux-fsdevel@vger.kernel.org, brauner@kernel.org, viro@zeniv.linux.org.uk, akpm@linux-foundation.org Cc: linux-kernel@vger.kernel.org, bpf@vger.kernel.org, gregkh@linuxfoundation.org, linux-mm@kvack.org, liam.howlett@oracle.com, surenb@google.com, rppt@kernel.org, Andrii Nakryiko Subject: [PATCH v3 3/9] fs/procfs: implement efficient VMA querying API for /proc//maps Date: Tue, 4 Jun 2024 17:24:48 -0700 Message-ID: <20240605002459.4091285-4-andrii@kernel.org> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20240605002459.4091285-1-andrii@kernel.org> References: <20240605002459.4091285-1-andrii@kernel.org> Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 /proc//maps file is extremely useful in practice for various tasks involving figuring out process memory layout, what files are backing any given memory range, etc. One important class of applications that absolutely rely on this are profilers/stack symbolizers (perf tool being one of them). Patterns of use differ, but they generally would fall into two categories. In on-demand pattern, a profiler/symbolizer would normally capture stack trace containing absolute memory addresses of some functions, and would then use /proc//maps file to find corresponding backing ELF files (normally, only executable VMAs are of interest), file offsets within them, and then continue from there to get yet more information (ELF symbols, DWARF information) to get human-readable symbolic information. This pattern is used by Meta's fleet-wide profiler, as one example. In preprocessing pattern, application doesn't know the set of addresses of interest, so it has to fetch all relevant VMAs (again, probably only executable ones), store or cache them, then proceed with profiling and stack trace capture. Once done, it would do symbolization based on stored VMA information. This can happen at much later point in time. This patterns is used by perf tool, as an example. In either case, there are both performance and correctness requirement involved. This address to VMA information translation has to be done as efficiently as possible, but also not miss any VMA (especially in the case of loading/unloading shared libraries). In practice, correctness can't be guaranteed (due to process dying before VMA data can be captured, or shared library being unloaded, etc), but any effort to maximize the chance of finding the VMA is appreciated. Unfortunately, for all the /proc//maps file universality and usefulness, it doesn't fit the above use cases 100%. First, it's main purpose is to emit all VMAs sequentially, but in practice captured addresses would fall only into a smaller subset of all process' VMAs, mainly containing executable text. Yet, library would need to parse most or all of the contents to find needed VMAs, as there is no way to skip VMAs that are of no use. Efficient library can do the linear pass and it is still relatively efficient, but it's definitely an overhead that can be avoided, if there was a way to do more targeted querying of the relevant VMA information. Second, it's a text based interface, which makes its programmatic use from applications and libraries more cumbersome and inefficient due to the need to handle text parsing to get necessary pieces of information. The overhead is actually payed both by kernel, formatting originally binary VMA data into text, and then by user space application, parsing it back into binary data for further use. For the on-demand pattern of usage, described above, another problem when writing generic stack trace symbolization library is an unfortunate performance-vs-correctness tradeoff that needs to be made. Library has to make a decision to either cache parsed contents of /proc//maps (after initial processing) to service future requests (if application requests to symbolize another set of addresses (for the same process), captured at some later time, which is typical for periodic/continuous profiling cases) to avoid higher costs of re-parsing this file. Or it has to choose to cache the contents in memory to speed up future requests. In the former case, more memory is used for the cache and there is a risk of getting stale data if application loads or unloads shared libraries, or otherwise changed its set of VMAs somehow, e.g., through additional mmap() calls. In the latter case, it's the performance hit that comes from re-opening the file and re-parsing its contents all over again. This patch aims to solve this problem by providing a new API built on top of /proc//maps. It's meant to address both non-selectiveness and text nature of /proc//maps, by giving user more control of what sort of VMA(s) needs to be queried, and being binary-based interface eliminates the overhead of text formatting (on kernel side) and parsing (on user space side). It's also designed to be extensible and forward/backward compatible by including required struct size field, which user has to provide. We use established copy_struct_from_user() approach to handle extensibility. User has a choice to pick either getting VMA that covers provided address or -ENOENT if none is found (exact, least surprising, case). Or, with an extra query flag (PROCMAP_QUERY_COVERING_OR_NEXT_VMA), they can get either VMA that covers the address (if there is one), or the closest next VMA (i.e., VMA with the smallest vm_start > addr). The latter allows more efficient use, but, given it could be a surprising behavior, requires an explicit opt-in. There is another query flag that is useful for some use cases. PROCMAP_QUERY_FILE_BACKED_VMA instructs this API to only return file-backed VMAs. Combining this with PROCMAP_QUERY_COVERING_OR_NEXT_VMA makes it possible to efficiently iterate only file-backed VMAs of the process, which is what profilers/symbolizers are normally interested in. All the above querying flags can be combined with (also optional) set of desired VMA permissions flags. This allows to, for example, iterate only an executable subset of VMAs, which is what preprocessing pattern, used by perf tool, would benefit from, as the assumption is that captured stack traces would have addresses of executable code. This saves time by skipping non-executable VMAs altogether efficienty. All these querying flags (modifiers) are orthogonal and can be combined in a semantically meaningful and natural way. Basing this ioctl()-based API on top of /proc//maps's FD makes sense given it's querying the same set of VMA data. It's also benefitial because permission checks for /proc//maps is performed at open time once, and the actual data read of text contents of /proc//maps is done without further permission checks. We piggyback on this pattern with ioctl()-based API as well, as that's a desired property. Both for performance reasons, but also for security and flexibility reasons. Allowing application to open an FD for /proc/self/maps without any extra capabilities, and then passing it to some sort of profiling agent through Unix-domain socket, would allow such profiling agent to not require some of the capabilities that are otherwise expected when opening /proc//maps file for *another* process. This is a desirable property for some more restricted setups. This new ioctl-based implementation doesn't interfere with seq_file-based implementation of /proc//maps textual interface, and so could be used together or independently without paying any price for that. Note also, that fetching VMA name (e.g., backing file path, or special hard-coded or user-provided names) is optional just like build ID. If user sets vma_name_size to zero, kernel code won't attempt to retrieve it, saving resources. To simplify reviewing, per-VMA locking is not yet added in this patch, but the overall code structure is ready for it and will be adjusted in the next patch to take per-VMA locking into account. Signed-off-by: Andrii Nakryiko --- fs/proc/task_mmu.c | 218 ++++++++++++++++++++++++++++++++++++++++ include/uapi/linux/fs.h | 128 ++++++++++++++++++++++- 2 files changed, 345 insertions(+), 1 deletion(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 334ae210a95a..614fbe5d0667 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -375,11 +375,229 @@ static int pid_maps_open(struct inode *inode, struct file *file) return do_maps_open(inode, file, &proc_pid_maps_op); } +#define PROCMAP_QUERY_VMA_FLAGS ( \ + PROCMAP_QUERY_VMA_READABLE | \ + PROCMAP_QUERY_VMA_WRITABLE | \ + PROCMAP_QUERY_VMA_EXECUTABLE | \ + PROCMAP_QUERY_VMA_SHARED \ +) + +#define PROCMAP_QUERY_VALID_FLAGS_MASK ( \ + PROCMAP_QUERY_COVERING_OR_NEXT_VMA | \ + PROCMAP_QUERY_FILE_BACKED_VMA | \ + PROCMAP_QUERY_VMA_FLAGS \ +) + +static int query_vma_setup(struct mm_struct *mm) +{ + return mmap_read_lock_killable(mm); +} + +static void query_vma_teardown(struct mm_struct *mm, struct vm_area_struct *vma) +{ + mmap_read_unlock(mm); +} + +static struct vm_area_struct *query_vma_find_by_addr(struct mm_struct *mm, unsigned long addr) +{ + return find_vma(mm, addr); +} + +static struct vm_area_struct *query_matching_vma(struct mm_struct *mm, + unsigned long addr, u32 flags) +{ + struct vm_area_struct *vma; + +next_vma: + vma = query_vma_find_by_addr(mm, addr); + if (!vma) + goto no_vma; + + /* user requested only file-backed VMA, keep iterating */ + if ((flags & PROCMAP_QUERY_FILE_BACKED_VMA) && !vma->vm_file) + goto skip_vma; + + /* VMA permissions should satisfy query flags */ + if (flags & PROCMAP_QUERY_VMA_FLAGS) { + u32 perm = 0; + + if (flags & PROCMAP_QUERY_VMA_READABLE) + perm |= VM_READ; + if (flags & PROCMAP_QUERY_VMA_WRITABLE) + perm |= VM_WRITE; + if (flags & PROCMAP_QUERY_VMA_EXECUTABLE) + perm |= VM_EXEC; + if (flags & PROCMAP_QUERY_VMA_SHARED) + perm |= VM_MAYSHARE; + + if ((vma->vm_flags & perm) != perm) + goto skip_vma; + } + + /* found covering VMA or user is OK with the matching next VMA */ + if ((flags & PROCMAP_QUERY_COVERING_OR_NEXT_VMA) || vma->vm_start <= addr) + return vma; + +skip_vma: + /* + * If the user needs closest matching VMA, keep iterating. + */ + addr = vma->vm_end; + if (flags & PROCMAP_QUERY_COVERING_OR_NEXT_VMA) + goto next_vma; +no_vma: + return ERR_PTR(-ENOENT); +} + +static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg) +{ + struct procmap_query karg; + struct vm_area_struct *vma; + struct mm_struct *mm; + const char *name = NULL; + char *name_buf = NULL; + __u64 usize; + int err; + + if (copy_from_user(&usize, (void __user *)uarg, sizeof(usize))) + return -EFAULT; + /* argument struct can never be that large, reject abuse */ + if (usize > PAGE_SIZE) + return -E2BIG; + /* argument struct should have at least query_flags and query_addr fields */ + if (usize < offsetofend(struct procmap_query, query_addr)) + return -EINVAL; + err = copy_struct_from_user(&karg, sizeof(karg), uarg, usize); + if (err) + return err; + + /* reject unknown flags */ + if (karg.query_flags & ~PROCMAP_QUERY_VALID_FLAGS_MASK) + return -EINVAL; + /* either both buffer address and size are set, or both should be zero */ + if (!!karg.vma_name_size != !!karg.vma_name_addr) + return -EINVAL; + + mm = priv->mm; + if (!mm || !mmget_not_zero(mm)) + return -ESRCH; + + err = query_vma_setup(mm); + if (err) { + mmput(mm); + return err; + } + + vma = query_matching_vma(mm, karg.query_addr, karg.query_flags); + if (IS_ERR(vma)) { + err = PTR_ERR(vma); + vma = NULL; + goto out; + } + + karg.vma_start = vma->vm_start; + karg.vma_end = vma->vm_end; + + if (vma->vm_file) { + const struct inode *inode = file_user_inode(vma->vm_file); + + karg.vma_offset = ((__u64)vma->vm_pgoff) << PAGE_SHIFT; + karg.dev_major = MAJOR(inode->i_sb->s_dev); + karg.dev_minor = MINOR(inode->i_sb->s_dev); + karg.inode = inode->i_ino; + } else { + karg.vma_offset = 0; + karg.dev_major = 0; + karg.dev_minor = 0; + karg.inode = 0; + } + + karg.vma_flags = 0; + if (vma->vm_flags & VM_READ) + karg.vma_flags |= PROCMAP_QUERY_VMA_READABLE; + if (vma->vm_flags & VM_WRITE) + karg.vma_flags |= PROCMAP_QUERY_VMA_WRITABLE; + if (vma->vm_flags & VM_EXEC) + karg.vma_flags |= PROCMAP_QUERY_VMA_EXECUTABLE; + if (vma->vm_flags & VM_MAYSHARE) + karg.vma_flags |= PROCMAP_QUERY_VMA_SHARED; + + if (karg.vma_name_size) { + size_t name_buf_sz = min_t(size_t, PATH_MAX, karg.vma_name_size); + const struct path *path; + const char *name_fmt; + size_t name_sz = 0; + + get_vma_name(vma, &path, &name, &name_fmt); + + if (path || name_fmt || name) { + name_buf = kmalloc(name_buf_sz, GFP_KERNEL); + if (!name_buf) { + err = -ENOMEM; + goto out; + } + } + if (path) { + name = d_path(path, name_buf, name_buf_sz); + if (IS_ERR(name)) { + err = PTR_ERR(name); + goto out; + } + name_sz = name_buf + name_buf_sz - name; + } else if (name || name_fmt) { + name_sz = 1 + snprintf(name_buf, name_buf_sz, name_fmt ?: "%s", name); + name = name_buf; + } + if (name_sz > name_buf_sz) { + err = -ENAMETOOLONG; + goto out; + } + karg.vma_name_size = name_sz; + } + + /* unlock vma or mmap_lock, and put mm_struct before copying data to user */ + query_vma_teardown(mm, vma); + mmput(mm); + + if (karg.vma_name_size && copy_to_user((void __user *)karg.vma_name_addr, + name, karg.vma_name_size)) { + kfree(name_buf); + return -EFAULT; + } + kfree(name_buf); + + if (copy_to_user(uarg, &karg, min_t(size_t, sizeof(karg), usize))) + return -EFAULT; + + return 0; + +out: + query_vma_teardown(mm, vma); + mmput(mm); + kfree(name_buf); + return err; +} + +static long procfs_procmap_ioctl(struct file *file, unsigned int cmd, unsigned long arg) +{ + struct seq_file *seq = file->private_data; + struct proc_maps_private *priv = seq->private; + + switch (cmd) { + case PROCMAP_QUERY: + return do_procmap_query(priv, (void __user *)arg); + default: + return -ENOIOCTLCMD; + } +} + const struct file_operations proc_pid_maps_operations = { .open = pid_maps_open, .read = seq_read, .llseek = seq_lseek, .release = proc_map_release, + .unlocked_ioctl = procfs_procmap_ioctl, + .compat_ioctl = procfs_procmap_ioctl, }; /* diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h index 45e4e64fd664..f25e7004972d 100644 --- a/include/uapi/linux/fs.h +++ b/include/uapi/linux/fs.h @@ -333,8 +333,10 @@ typedef int __bitwise __kernel_rwf_t; #define RWF_SUPPORTED (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\ RWF_APPEND | RWF_NOAPPEND) +#define PROCFS_IOCTL_MAGIC 'f' + /* Pagemap ioctl */ -#define PAGEMAP_SCAN _IOWR('f', 16, struct pm_scan_arg) +#define PAGEMAP_SCAN _IOWR(PROCFS_IOCTL_MAGIC, 16, struct pm_scan_arg) /* Bitmasks provided in pm_scan_args masks and reported in page_region.categories. */ #define PAGE_IS_WPALLOWED (1 << 0) @@ -393,4 +395,128 @@ struct pm_scan_arg { __u64 return_mask; }; +/* /proc//maps ioctl */ +#define PROCMAP_QUERY _IOWR(PROCFS_IOCTL_MAGIC, 17, struct procmap_query) + +enum procmap_query_flags { + /* + * VMA permission flags. + * + * Can be used as part of procmap_query.query_flags field to look up + * only VMAs satisfying specified subset of permissions. E.g., specifying + * PROCMAP_QUERY_VMA_READABLE only will return both readable and read/write VMAs, + * while having PROCMAP_QUERY_VMA_READABLE | PROCMAP_QUERY_VMA_WRITABLE will only + * return read/write VMAs, though both executable/non-executable and + * private/shared will be ignored. + * + * PROCMAP_QUERY_VMA_* flags are also returned in procmap_query.vma_flags + * field to specify actual VMA permissions. + */ + PROCMAP_QUERY_VMA_READABLE = 0x01, + PROCMAP_QUERY_VMA_WRITABLE = 0x02, + PROCMAP_QUERY_VMA_EXECUTABLE = 0x04, + PROCMAP_QUERY_VMA_SHARED = 0x08, + /* + * Query modifier flags. + * + * By default VMA that covers provided address is returned, or -ENOENT + * is returned. With PROCMAP_QUERY_COVERING_OR_NEXT_VMA flag set, closest + * VMA with vma_start > addr will be returned if no covering VMA is + * found. + * + * PROCMAP_QUERY_FILE_BACKED_VMA instructs query to consider only VMAs that + * have file backing. Can be combined with PROCMAP_QUERY_COVERING_OR_NEXT_VMA + * to iterate all VMAs with file backing. + */ + PROCMAP_QUERY_COVERING_OR_NEXT_VMA = 0x10, + PROCMAP_QUERY_FILE_BACKED_VMA = 0x20, +}; + +/* + * Input/output argument structured passed into ioctl() call. It can be used + * to query a set of VMAs (Virtual Memory Areas) of a process. + * + * Each field can be one of three kinds, marked in a short comment to the + * right of the field: + * - "in", input argument, user has to provide this value, kernel doesn't modify it; + * - "out", output argument, kernel sets this field with VMA data; + * - "in/out", input and output argument; user provides initial value (used + * to specify maximum allowable buffer size), and kernel sets it to actual + * amount of data written (or zero, if there is no data). + * + * If matching VMA is found (according to criterias specified by + * query_addr/query_flags, all the out fields are filled out, and ioctl() + * returns 0. If there is no matching VMA, -ENOENT will be returned. + * In case of any other error, negative error code other than -ENOENT is + * returned. + * + * Most of the data is similar to the one returned as text in /proc//maps + * file, but procmap_query provides more querying flexibility. There are no + * consistency guarantees between subsequent ioctl() calls, but data returned + * for matched VMA is self-consistent. + */ +struct procmap_query { + /* Query struct size, for backwards/forward compatibility */ + __u64 size; + /* + * Query flags, a combination of enum procmap_query_flags values. + * Defines query filtering and behavior, see enum procmap_query_flags. + * + * Input argument, provided by user. Kernel doesn't modify it. + */ + __u64 query_flags; /* in */ + /* + * Query address. By default, VMA that covers this address will + * be looked up. PROCMAP_QUERY_* flags above modify this default + * behavior further. + * + * Input argument, provided by user. Kernel doesn't modify it. + */ + __u64 query_addr; /* in */ + /* VMA starting (inclusive) and ending (exclusive) address, if VMA is found. */ + __u64 vma_start; /* out */ + __u64 vma_end; /* out */ + /* VMA permissions flags. A combination of PROCMAP_QUERY_VMA_* flags. */ + __u64 vma_flags; /* out */ + /* + * VMA file offset. If VMA has file backing, this specifies offset + * within the file that VMA's start address corresponds to. + * Is set to zero if VMA has no backing file. + */ + __u64 vma_offset; /* out */ + /* Backing file's inode number, or zero, if VMA has no backing file. */ + __u64 inode; /* out */ + /* Backing file's device major/minor number, or zero, if VMA has no backing file. */ + __u32 dev_major; /* out */ + __u32 dev_minor; /* out */ + /* + * If set to non-zero value, signals the request to return VMA name + * (i.e., VMA's backing file's absolute path, with " (deleted)" suffix + * appended, if file was unlinked from FS) for matched VMA. VMA name + * can also be some special name (e.g., "[heap]", "[stack]") or could + * be even user-supplied with prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME). + * + * Kernel will set this field to zero, if VMA has no associated name. + * Otherwise kernel will return actual amount of bytes filled in + * user-supplied buffer (see vma_name_addr field below), including the + * terminating zero. + * + * If VMA name is longer that user-supplied maximum buffer size, + * -E2BIG error is returned. + * + * If this field is set to non-zero value, vma_name_addr should point + * to valid user space memory buffer of at least vma_name_size bytes. + * If set to zero, vma_name_addr should be set to zero as well + */ + __u32 vma_name_size; /* in/out */ + /* + * User-supplied address of a buffer of at least vma_name_size bytes + * for kernel to fill with matched VMA's name (see vma_name_size field + * description above for details). + * + * Should be set to zero if VMA name should not be returned. + */ + __u64 vma_name_addr; /* in */ +}; + #endif /* _UAPI_LINUX_FS_H */ From patchwork Wed Jun 5 00:24:49 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrii Nakryiko X-Patchwork-Id: 13686009 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 71B0913AD39; Wed, 5 Jun 2024 00:25:24 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1717547125; cv=none; b=XywbvD+Oe4IKHX4eKJDMvxa2FP//Ep+R3kr4gy6SFszGTaE/Tik3BIlDEWtZl6W1bq4GtT6z0uEMYWWdI396OXoAvHvjhPpJKje1A1CO1h+BYg2TfY7G551JTQ+j3wIQbnzV8IrygQuQYD8ruQaO5GahP2YGcVUn2ExxS7H6iac= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1717547125; c=relaxed/simple; bh=Pq+QGLVWTUWBT2KD39eaZ277HgWCnfkSzzJkgOa7/lc=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=cceUYLu4Xo+9tQS+JycjlqF/a/HxbrH9DU/tnxN8RztVU+bOoVEQcw2Nk+aEnwwUgrR64jNTfqXLzBrG5EDfuf8uJkT2mWNcrU8SzP51JKvPV7VVxG8KVvCxlHcZ0uW4otO1bhjgeHtDLso6MEhW3KHte2SjGjABufZhbSdIs5M= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=fpgXLkOC; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="fpgXLkOC" Received: by smtp.kernel.org (Postfix) with ESMTPSA id CF59CC2BBFC; Wed, 5 Jun 2024 00:25:23 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1717547124; bh=Pq+QGLVWTUWBT2KD39eaZ277HgWCnfkSzzJkgOa7/lc=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=fpgXLkOCnVfC96uuFhU6H4kf3FRlKyUiWC5AwEmD+MVxxbpcpWofrljXwKK3VDQ6C Atrrq7u8WckvkQnCVMLNC4Jn+q+pukJjrExfB0JctjvySmUVy4339wU7WjsbrsY9Yl wbf5qxJp5OJeLwAYiXGpbl3AFJA0Jn6SM/8TGzyBAo6mGrHPBLpLmBb4z2e+n2i/b/ 8RpYXfAwc1M7v6AuOdAjRr5hIU/Jr0u+gGyG9iC/WL508facstcWJVdyjFFwiaSmgk E/RRK/3MUxZGhXKbdvlWC9Rbf4UIpEd0cGJJA3Iw3qQtYXNO3Mo9Klt6Qqec2IjKUa 2jql6xImxK5VQ== From: Andrii Nakryiko To: linux-fsdevel@vger.kernel.org, brauner@kernel.org, viro@zeniv.linux.org.uk, akpm@linux-foundation.org Cc: linux-kernel@vger.kernel.org, bpf@vger.kernel.org, gregkh@linuxfoundation.org, linux-mm@kvack.org, liam.howlett@oracle.com, surenb@google.com, rppt@kernel.org, Andrii Nakryiko Subject: [PATCH v3 4/9] fs/procfs: use per-VMA RCU-protected locking in PROCMAP_QUERY API Date: Tue, 4 Jun 2024 17:24:49 -0700 Message-ID: <20240605002459.4091285-5-andrii@kernel.org> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20240605002459.4091285-1-andrii@kernel.org> References: <20240605002459.4091285-1-andrii@kernel.org> Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Attempt to use RCU-protected per-VMA lock when looking up requested VMA as much as possible, only falling back to mmap_lock if per-VMA lock failed. This is done so that querying of VMAs doesn't interfere with other critical tasks, like page fault handling. This has been suggested by mm folks, and we make use of a newly added internal API that works like find_vma(), but tries to use per-VMA lock. We have two sets of setup/query/teardown helper functions with different implementations depending on availability of per-VMA lock (conditioned on CONFIG_PER_VMA_LOCK) to abstract per-VMA lock subtleties. When per-VMA lock is available, lookup is done under RCU, attempting to take a per-VMA lock. If that fails, we fallback to mmap_lock, but then proceed to unconditionally grab per-VMA lock again, dropping mmap_lock immediately. In this configuration mmap_lock is never helf for long, minimizing disruptions while querying. When per-VMA lock is compiled out, we take mmap_lock once, query VMAs using find_vma() API, and then unlock mmap_lock at the very end once as well. In this setup we avoid locking/unlocking mmap_lock on every looked up VMA (depending on query parameters we might need to iterate a few of them). Signed-off-by: Andrii Nakryiko --- fs/proc/task_mmu.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 46 insertions(+) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 614fbe5d0667..140032ffc551 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -388,6 +388,49 @@ static int pid_maps_open(struct inode *inode, struct file *file) PROCMAP_QUERY_VMA_FLAGS \ ) +#ifdef CONFIG_PER_VMA_LOCK +static int query_vma_setup(struct mm_struct *mm) +{ + /* in the presence of per-VMA lock we don't need any setup/teardown */ + return 0; +} + +static void query_vma_teardown(struct mm_struct *mm, struct vm_area_struct *vma) +{ + /* in the presence of per-VMA lock we need to unlock vma, if present */ + if (vma) + vma_end_read(vma); +} + +static struct vm_area_struct *query_vma_find_by_addr(struct mm_struct *mm, unsigned long addr) +{ + struct vm_area_struct *vma; + + /* try to use less disruptive per-VMA lock */ + vma = find_and_lock_vma_rcu(mm, addr); + if (IS_ERR(vma)) { + /* failed to take per-VMA lock, fallback to mmap_lock */ + if (mmap_read_lock_killable(mm)) + return ERR_PTR(-EINTR); + + vma = find_vma(mm, addr); + if (vma) { + /* + * We cannot use vma_start_read() as it may fail due to + * false locked (see comment in vma_start_read()). We + * can avoid that by directly locking vm_lock under + * mmap_lock, which guarantees that nobody can lock the + * vma for write (vma_start_write()) under us. + */ + down_read(&vma->vm_lock->lock); + } + + mmap_read_unlock(mm); + } + + return vma; +} +#else static int query_vma_setup(struct mm_struct *mm) { return mmap_read_lock_killable(mm); @@ -402,6 +445,7 @@ static struct vm_area_struct *query_vma_find_by_addr(struct mm_struct *mm, unsig { return find_vma(mm, addr); } +#endif static struct vm_area_struct *query_matching_vma(struct mm_struct *mm, unsigned long addr, u32 flags) @@ -441,8 +485,10 @@ static struct vm_area_struct *query_matching_vma(struct mm_struct *mm, skip_vma: /* * If the user needs closest matching VMA, keep iterating. + * But before we proceed we might need to unlock current VMA. */ addr = vma->vm_end; + vma_end_read(vma); /* no-op under !CONFIG_PER_VMA_LOCK */ if (flags & PROCMAP_QUERY_COVERING_OR_NEXT_VMA) goto next_vma; no_vma: From patchwork Wed Jun 5 00:24:50 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrii Nakryiko X-Patchwork-Id: 13686010 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C831D13B794; Wed, 5 Jun 2024 00:25:27 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1717547127; cv=none; b=VsFaXiGJfeQlG2Tl/1P2sdVaVTfc7oXqja+McD5xdE7Hqzd5fvyDKMhOXNdZ8zo0M+rR+g3Ip77ExeTEfJPe8RTI4vecceDnbKAoNqDV8THi23QWxG23WP9c80WzQUN0QPJs+fhpAopNehmk6uc2bNwEE8m1qcKf6ihrmVQ5Phk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1717547127; c=relaxed/simple; bh=AklsDx8YWfb7MKXDA8kKw+F7thNTeWZboAKDAUz54mw=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=dr+GfUXo4zA0T4Xo4S2D5gmlGuhSu0c2QGSJjG9BeWpTO6DjFsLo1AwLwxR77Tq/AvgY8GP9AS3NRNMY8O+4zxP/mw8OazXY/GGQxtqfmbLa7ekp2kFOl2j+y49NI+EsJjhhyGyGJzcOUtCC2fGrjwuyb0EpCbUwMKwLMerge0k= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=aIIA33H5; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="aIIA33H5" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 18380C32786; Wed, 5 Jun 2024 00:25:27 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1717547127; bh=AklsDx8YWfb7MKXDA8kKw+F7thNTeWZboAKDAUz54mw=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=aIIA33H5QH+tD7y9NWtd3jWhzSau3y32gUEvdrPPCiaQj/oK8V7ws3/uhQqrGa8E3 G+rv4eCpgX8sYu4K6Ej1pkHk/DPd7rD1GiC33BMXQo/fSWIbpYqeBzSrc4dStTjJpq QZEZ6HGWIOeHp5x0zXBgVUoLLvXoD/CQO6gWw9fbki6qkV+qJoavysMPxs1BPfbM0W 3QsrVoUwFAS0QIMSppOa7H3qNufxa55g2CEMyy6DNN8ZCwB71Ns3L2G+WsmIx3UYX7 dTQwEieiqCwoAxfrATLcF5OS/CPBC6hr53IizeOVUsTFk/ufmFusqFS+Czfb/Dw4Bp 4Dh/vlrF/FLfw== From: Andrii Nakryiko To: linux-fsdevel@vger.kernel.org, brauner@kernel.org, viro@zeniv.linux.org.uk, akpm@linux-foundation.org Cc: linux-kernel@vger.kernel.org, bpf@vger.kernel.org, gregkh@linuxfoundation.org, linux-mm@kvack.org, liam.howlett@oracle.com, surenb@google.com, rppt@kernel.org, Andrii Nakryiko Subject: [PATCH v3 5/9] fs/procfs: add build ID fetching to PROCMAP_QUERY API Date: Tue, 4 Jun 2024 17:24:50 -0700 Message-ID: <20240605002459.4091285-6-andrii@kernel.org> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20240605002459.4091285-1-andrii@kernel.org> References: <20240605002459.4091285-1-andrii@kernel.org> Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 The need to get ELF build ID reliably is an important aspect when dealing with profiling and stack trace symbolization, and /proc//maps textual representation doesn't help with this. To get backing file's ELF build ID, application has to first resolve VMA, then use it's start/end address range to follow a special /proc//map_files/- symlink to open the ELF file (this is necessary because backing file might have been removed from the disk or was already replaced with another binary in the same file path. Such approach, beyond just adding complexity of having to do a bunch of extra work, has extra security implications. Because application opens underlying ELF file and needs read access to its entire contents (as far as kernel is concerned), kernel puts additional capable() checks on following /proc//map_files/- symlink. And that makes sense in general. But in the case of build ID, profiler/symbolizer doesn't need the contents of ELF file, per se. It's only build ID that is of interest, and ELF build ID itself doesn't provide any sensitive information. So this patch adds a way to request backing file's ELF build ID along the rest of VMA information in the same API. User has control over whether this piece of information is requested or not by either setting build_id_size field to zero or non-zero maximum buffer size they provided through build_id_addr field (which encodes user pointer as __u64 field). This is a completely optional piece of information, and so has no performance implications for user cases that don't care about build ID, while improving performance and simplifying the setup for those application that do need it. Kernel already implements build ID fetching, which is used from BPF subsystem. We are reusing this code here, but plan a follow up changes to make it work better under more relaxed assumption (compared to what existing code assumes) of being called from user process context, in which page faults are allowed. BPF-specific implementation currently bails out if necessary part of ELF file is not paged in, all due to extra BPF-specific restrictions (like the need to fetch build ID in restrictive contexts such as NMI handler). Signed-off-by: Andrii Nakryiko --- fs/proc/task_mmu.c | 25 ++++++++++++++++++++++++- include/uapi/linux/fs.h | 28 ++++++++++++++++++++++++++++ 2 files changed, 52 insertions(+), 1 deletion(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 140032ffc551..4b7251fb1a4b 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -22,6 +22,7 @@ #include #include #include +#include #include #include @@ -491,6 +492,7 @@ static struct vm_area_struct *query_matching_vma(struct mm_struct *mm, vma_end_read(vma); /* no-op under !CONFIG_PER_VMA_LOCK */ if (flags & PROCMAP_QUERY_COVERING_OR_NEXT_VMA) goto next_vma; + no_vma: return ERR_PTR(-ENOENT); } @@ -501,7 +503,7 @@ static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg) struct vm_area_struct *vma; struct mm_struct *mm; const char *name = NULL; - char *name_buf = NULL; + char build_id_buf[BUILD_ID_SIZE_MAX], *name_buf = NULL; __u64 usize; int err; @@ -523,6 +525,8 @@ static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg) /* either both buffer address and size are set, or both should be zero */ if (!!karg.vma_name_size != !!karg.vma_name_addr) return -EINVAL; + if (!!karg.build_id_size != !!karg.build_id_addr) + return -EINVAL; mm = priv->mm; if (!mm || !mmget_not_zero(mm)) @@ -568,6 +572,21 @@ static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg) if (vma->vm_flags & VM_MAYSHARE) karg.vma_flags |= PROCMAP_QUERY_VMA_SHARED; + if (karg.build_id_size) { + __u32 build_id_sz; + + err = build_id_parse(vma, build_id_buf, &build_id_sz); + if (err) { + karg.build_id_size = 0; + } else { + if (karg.build_id_size < build_id_sz) { + err = -ENAMETOOLONG; + goto out; + } + karg.build_id_size = build_id_sz; + } + } + if (karg.vma_name_size) { size_t name_buf_sz = min_t(size_t, PATH_MAX, karg.vma_name_size); const struct path *path; @@ -612,6 +631,10 @@ static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg) } kfree(name_buf); + if (karg.build_id_size && copy_to_user((void __user *)karg.build_id_addr, + build_id_buf, karg.build_id_size)) + return -EFAULT; + if (copy_to_user(uarg, &karg, min_t(size_t, sizeof(karg), usize))) return -EFAULT; diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h index f25e7004972d..7306022780d3 100644 --- a/include/uapi/linux/fs.h +++ b/include/uapi/linux/fs.h @@ -509,6 +509,26 @@ struct procmap_query { * If set to zero, vma_name_addr should be set to zero as well */ __u32 vma_name_size; /* in/out */ + /* + * If set to non-zero value, signals the request to extract and return + * VMA's backing file's build ID, if the backing file is an ELF file + * and it contains embedded build ID. + * + * Kernel will set this field to zero, if VMA has no backing file, + * backing file is not an ELF file, or ELF file has no build ID + * embedded. + * + * Build ID is a binary value (not a string). Kernel will set + * build_id_size field to exact number of bytes used for build ID. + * If build ID is requested and present, but needs more bytes than + * user-supplied maximum buffer size (see build_id_addr field below), + * -E2BIG error will be returned. + * + * If this field is set to non-zero value, build_id_addr should point + * to valid user space memory buffer of at least build_id_size bytes. + * If set to zero, build_id_addr should be set to zero as well + */ + __u32 build_id_size; /* in/out */ /* * User-supplied address of a buffer of at least vma_name_size bytes * for kernel to fill with matched VMA's name (see vma_name_size field @@ -517,6 +537,14 @@ struct procmap_query { * Should be set to zero if VMA name should not be returned. */ __u64 vma_name_addr; /* in */ + /* + * User-supplied address of a buffer of at least build_id_size bytes + * for kernel to fill with matched VMA's ELF build ID, if available + * (see build_id_size field description above for details). + * + * Should be set to zero if build ID should not be returned. + */ + __u64 build_id_addr; /* in */ }; #endif /* _UAPI_LINUX_FS_H */ From patchwork Wed Jun 5 00:24:51 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrii Nakryiko X-Patchwork-Id: 13686011 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8D77AEAF0; Wed, 5 Jun 2024 00:25:38 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1717547138; cv=none; b=AcS3bvxB81QBIJAwLHs5JypDHcaIsrkGNkDR9dl7xW86KOP94zQRczNUbc9Zi4Aez2B4SvsNrTieEaRy0PtxRl2iNiTjTaO5kM950hBSVGKUUiTpSFlUL1xb8MUXxf675JeQHaK+/+/85Xgj2wYR7IkjwZXyL64gzG7fJrXm2pE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1717547138; c=relaxed/simple; bh=ZB4t6JlnJmyNfXJGfSg5ESgmhwAqw9FNf1Hmazm6kO0=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=CSP27qhTxNc7bFFjgyGN6ELh8ikgaidTHD64j4EaxadDeLx0XnnRPZ+2SeQmiaQg7x2ROHgMbdXg2TJr+d6sS6F0ituOdx0jk7sJeHRjkJw3tt0IgIpnKIUeq6pK7ZWLsiKo/FxkYegWe5Qhpbmg40l4hmsHF7o4KqsQ7osDzz4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=OjIONyKY; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="OjIONyKY" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 4F9BAC2BBFC; Wed, 5 Jun 2024 00:25:38 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1717547138; bh=ZB4t6JlnJmyNfXJGfSg5ESgmhwAqw9FNf1Hmazm6kO0=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=OjIONyKYr0+xpTgXOjcVF6bOu02uR2+txTrRdm4Ixq+QkB+2eXEz0kuSKxOnhbNv+ n4Mw3XADEz/zMDgWgsnoCtNCX+wDYNP/uoyMk2FDPgaQ1IfIloSv/lZSECMlFyarar DpohbGRGcfJaJ4hdzHB3Tratzz77+u8gEQDjOEuiJw3aBaKwuO1bJZ8X0E/seeHgHX q2cx7mBYyK8gJ5dFXkXh8Y9Bxu99EhogGZY1NuRAP/ayKfGAuOcbZy1dmBkpKT9YW0 wz7tZsETraSeriNv1SZHfrH06MX7u9Gc8yZ7/mlxd/jNwiW4Z3R2XTqq4QIB08cT4I kLfEN+PJfbx9Q== From: Andrii Nakryiko To: linux-fsdevel@vger.kernel.org, brauner@kernel.org, viro@zeniv.linux.org.uk, akpm@linux-foundation.org Cc: linux-kernel@vger.kernel.org, bpf@vger.kernel.org, gregkh@linuxfoundation.org, linux-mm@kvack.org, liam.howlett@oracle.com, surenb@google.com, rppt@kernel.org, Andrii Nakryiko Subject: [PATCH v3 6/9] docs/procfs: call out ioctl()-based PROCMAP_QUERY command existence Date: Tue, 4 Jun 2024 17:24:51 -0700 Message-ID: <20240605002459.4091285-7-andrii@kernel.org> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20240605002459.4091285-1-andrii@kernel.org> References: <20240605002459.4091285-1-andrii@kernel.org> Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Call out PROCMAP_QUERY ioctl() existence in the section describing /proc/PID/maps file in documentation. We refer user to UAPI header for low-level details of this programmatic interface. Signed-off-by: Andrii Nakryiko --- Documentation/filesystems/proc.rst | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst index 7c3a565ffbef..f2bbd1e86204 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -443,6 +443,14 @@ is not associated with a file: or if empty, the mapping is anonymous. +Starting with 6.11 kernel, /proc/PID/maps provides an alternative +ioctl()-based API that gives ability to flexibly and efficiently query and +filter individual VMAs. This interface is binary and is meant for more +efficient programmatic use. `struct procmap_query`, defined in linux/fs.h UAPI +header, serves as an input/output argument to the `PROCMAP_QUERY` ioctl() +command. See comments in linus/fs.h UAPI header for details on query +semantics, supported flags, data returned, and general API usage information. + The /proc/PID/smaps is an extension based on maps, showing the memory consumption for each of the process's mappings. For each mapping (aka Virtual Memory Area, or VMA) there is a series of lines such as the following:: From patchwork Wed Jun 5 00:24:52 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrii Nakryiko X-Patchwork-Id: 13686012 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 318272F43; Wed, 5 Jun 2024 00:25:41 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1717547142; cv=none; b=Pfs3jQ3DO7l1XKQuDsU6lhb5EE7XW+esINk7BmKQrlg+rW0VgjTe/MbuCbDCEwa+Rqf0O1SbMGc5ynDVVEjg1adxgSOKxm2LgfR3akxmj6wN/Bx8NkOP6Lu5wcAhqSFKLXRPrNzQTHLrgiestaWK2fmcuQYgmiFpILcgDQGrg34= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1717547142; c=relaxed/simple; bh=XhT7MyIpfN/Vsi8sYrA9Vx1PIwjnTH2I6tc+o8hojXY=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=YHl8PKwWF9hXfDmgtw//yr4sdSWDOEXNr9IasLvCZUIPlbi2ZZXyAm5ty7xO7dX/XEbXONwYcPQLlMYtXDaLoUDtaiKJ6ctRQaVIyjX/hkk3VO6VdPWcNoEey3PDjfJrjOjbmrjy2MBtyaur93gaQSUduKlDRUbt0IqCpyJxjlY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=rKcQEWQM; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="rKcQEWQM" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 7CD21C2BBFC; Wed, 5 Jun 2024 00:25:41 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1717547141; bh=XhT7MyIpfN/Vsi8sYrA9Vx1PIwjnTH2I6tc+o8hojXY=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=rKcQEWQMeWdesGvQLL7bIVc6E0VZpu7OkRYVDovsDdv089zMFcwU6CHtlVDOsuJRJ QOVt4eRPH2RClctmS9GCSBzMj0m/I+KFY7ZvL9EQjuSAkYYKGkpeesjwZT/fFnuXAJ Yv+D8R3oWWznZDMQLYDFR6Ru5PBds+5+KEWaSJ0xkSpK5C77/19uuG5R7l5atcFFBC PLDrpQSdBIbk/ye0evRakElzfx5S0WsBJtmfytYB6uwRHLfY2IDWO7uAFRl5Je6W4I uZalXDJ8flxEouKOyouPgP3HmAosYh6UBCRs3XQrzvTN5WllAdQwyt3B1nUAQmjfRh rSg6xqlzOY6dQ== From: Andrii Nakryiko To: linux-fsdevel@vger.kernel.org, brauner@kernel.org, viro@zeniv.linux.org.uk, akpm@linux-foundation.org Cc: linux-kernel@vger.kernel.org, bpf@vger.kernel.org, gregkh@linuxfoundation.org, linux-mm@kvack.org, liam.howlett@oracle.com, surenb@google.com, rppt@kernel.org, Andrii Nakryiko Subject: [PATCH v3 7/9] tools: sync uapi/linux/fs.h header into tools subdir Date: Tue, 4 Jun 2024 17:24:52 -0700 Message-ID: <20240605002459.4091285-8-andrii@kernel.org> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20240605002459.4091285-1-andrii@kernel.org> References: <20240605002459.4091285-1-andrii@kernel.org> Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 We need this UAPI header in tools/include subdirectory for using it from BPF selftests. Signed-off-by: Andrii Nakryiko --- tools/include/uapi/linux/fs.h | 550 ++++++++++++++++++++++++++++++++++ 1 file changed, 550 insertions(+) create mode 100644 tools/include/uapi/linux/fs.h diff --git a/tools/include/uapi/linux/fs.h b/tools/include/uapi/linux/fs.h new file mode 100644 index 000000000000..7306022780d3 --- /dev/null +++ b/tools/include/uapi/linux/fs.h @@ -0,0 +1,550 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +#ifndef _UAPI_LINUX_FS_H +#define _UAPI_LINUX_FS_H + +/* + * This file has definitions for some important file table structures + * and constants and structures used by various generic file system + * ioctl's. Please do not make any changes in this file before + * sending patches for review to linux-fsdevel@vger.kernel.org and + * linux-api@vger.kernel.org. + */ + +#include +#include +#include +#ifndef __KERNEL__ +#include +#endif + +/* Use of MS_* flags within the kernel is restricted to core mount(2) code. */ +#if !defined(__KERNEL__) +#include +#endif + +/* + * It's silly to have NR_OPEN bigger than NR_FILE, but you can change + * the file limit at runtime and only root can increase the per-process + * nr_file rlimit, so it's safe to set up a ridiculously high absolute + * upper limit on files-per-process. + * + * Some programs (notably those using select()) may have to be + * recompiled to take full advantage of the new limits.. + */ + +/* Fixed constants first: */ +#undef NR_OPEN +#define INR_OPEN_CUR 1024 /* Initial setting for nfile rlimits */ +#define INR_OPEN_MAX 4096 /* Hard limit for nfile rlimits */ + +#define BLOCK_SIZE_BITS 10 +#define BLOCK_SIZE (1</maps ioctl */ +#define PROCMAP_QUERY _IOWR(PROCFS_IOCTL_MAGIC, 17, struct procmap_query) + +enum procmap_query_flags { + /* + * VMA permission flags. + * + * Can be used as part of procmap_query.query_flags field to look up + * only VMAs satisfying specified subset of permissions. E.g., specifying + * PROCMAP_QUERY_VMA_READABLE only will return both readable and read/write VMAs, + * while having PROCMAP_QUERY_VMA_READABLE | PROCMAP_QUERY_VMA_WRITABLE will only + * return read/write VMAs, though both executable/non-executable and + * private/shared will be ignored. + * + * PROCMAP_QUERY_VMA_* flags are also returned in procmap_query.vma_flags + * field to specify actual VMA permissions. + */ + PROCMAP_QUERY_VMA_READABLE = 0x01, + PROCMAP_QUERY_VMA_WRITABLE = 0x02, + PROCMAP_QUERY_VMA_EXECUTABLE = 0x04, + PROCMAP_QUERY_VMA_SHARED = 0x08, + /* + * Query modifier flags. + * + * By default VMA that covers provided address is returned, or -ENOENT + * is returned. With PROCMAP_QUERY_COVERING_OR_NEXT_VMA flag set, closest + * VMA with vma_start > addr will be returned if no covering VMA is + * found. + * + * PROCMAP_QUERY_FILE_BACKED_VMA instructs query to consider only VMAs that + * have file backing. Can be combined with PROCMAP_QUERY_COVERING_OR_NEXT_VMA + * to iterate all VMAs with file backing. + */ + PROCMAP_QUERY_COVERING_OR_NEXT_VMA = 0x10, + PROCMAP_QUERY_FILE_BACKED_VMA = 0x20, +}; + +/* + * Input/output argument structured passed into ioctl() call. It can be used + * to query a set of VMAs (Virtual Memory Areas) of a process. + * + * Each field can be one of three kinds, marked in a short comment to the + * right of the field: + * - "in", input argument, user has to provide this value, kernel doesn't modify it; + * - "out", output argument, kernel sets this field with VMA data; + * - "in/out", input and output argument; user provides initial value (used + * to specify maximum allowable buffer size), and kernel sets it to actual + * amount of data written (or zero, if there is no data). + * + * If matching VMA is found (according to criterias specified by + * query_addr/query_flags, all the out fields are filled out, and ioctl() + * returns 0. If there is no matching VMA, -ENOENT will be returned. + * In case of any other error, negative error code other than -ENOENT is + * returned. + * + * Most of the data is similar to the one returned as text in /proc//maps + * file, but procmap_query provides more querying flexibility. There are no + * consistency guarantees between subsequent ioctl() calls, but data returned + * for matched VMA is self-consistent. + */ +struct procmap_query { + /* Query struct size, for backwards/forward compatibility */ + __u64 size; + /* + * Query flags, a combination of enum procmap_query_flags values. + * Defines query filtering and behavior, see enum procmap_query_flags. + * + * Input argument, provided by user. Kernel doesn't modify it. + */ + __u64 query_flags; /* in */ + /* + * Query address. By default, VMA that covers this address will + * be looked up. PROCMAP_QUERY_* flags above modify this default + * behavior further. + * + * Input argument, provided by user. Kernel doesn't modify it. + */ + __u64 query_addr; /* in */ + /* VMA starting (inclusive) and ending (exclusive) address, if VMA is found. */ + __u64 vma_start; /* out */ + __u64 vma_end; /* out */ + /* VMA permissions flags. A combination of PROCMAP_QUERY_VMA_* flags. */ + __u64 vma_flags; /* out */ + /* + * VMA file offset. If VMA has file backing, this specifies offset + * within the file that VMA's start address corresponds to. + * Is set to zero if VMA has no backing file. + */ + __u64 vma_offset; /* out */ + /* Backing file's inode number, or zero, if VMA has no backing file. */ + __u64 inode; /* out */ + /* Backing file's device major/minor number, or zero, if VMA has no backing file. */ + __u32 dev_major; /* out */ + __u32 dev_minor; /* out */ + /* + * If set to non-zero value, signals the request to return VMA name + * (i.e., VMA's backing file's absolute path, with " (deleted)" suffix + * appended, if file was unlinked from FS) for matched VMA. VMA name + * can also be some special name (e.g., "[heap]", "[stack]") or could + * be even user-supplied with prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME). + * + * Kernel will set this field to zero, if VMA has no associated name. + * Otherwise kernel will return actual amount of bytes filled in + * user-supplied buffer (see vma_name_addr field below), including the + * terminating zero. + * + * If VMA name is longer that user-supplied maximum buffer size, + * -E2BIG error is returned. + * + * If this field is set to non-zero value, vma_name_addr should point + * to valid user space memory buffer of at least vma_name_size bytes. + * If set to zero, vma_name_addr should be set to zero as well + */ + __u32 vma_name_size; /* in/out */ + /* + * If set to non-zero value, signals the request to extract and return + * VMA's backing file's build ID, if the backing file is an ELF file + * and it contains embedded build ID. + * + * Kernel will set this field to zero, if VMA has no backing file, + * backing file is not an ELF file, or ELF file has no build ID + * embedded. + * + * Build ID is a binary value (not a string). Kernel will set + * build_id_size field to exact number of bytes used for build ID. + * If build ID is requested and present, but needs more bytes than + * user-supplied maximum buffer size (see build_id_addr field below), + * -E2BIG error will be returned. + * + * If this field is set to non-zero value, build_id_addr should point + * to valid user space memory buffer of at least build_id_size bytes. + * If set to zero, build_id_addr should be set to zero as well + */ + __u32 build_id_size; /* in/out */ + /* + * User-supplied address of a buffer of at least vma_name_size bytes + * for kernel to fill with matched VMA's name (see vma_name_size field + * description above for details). + * + * Should be set to zero if VMA name should not be returned. + */ + __u64 vma_name_addr; /* in */ + /* + * User-supplied address of a buffer of at least build_id_size bytes + * for kernel to fill with matched VMA's ELF build ID, if available + * (see build_id_size field description above for details). + * + * Should be set to zero if build ID should not be returned. + */ + __u64 build_id_addr; /* in */ +}; + +#endif /* _UAPI_LINUX_FS_H */ From patchwork Wed Jun 5 00:24:53 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrii Nakryiko X-Patchwork-Id: 13686013 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2CCD613D291; Wed, 5 Jun 2024 00:25:45 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1717547145; cv=none; b=W7O43i766cD7hoUecK2PYrbGFwvQa3W6HX++Dh/PGQhALErp4m2Ft5hHjg7pSs6t+btJUU6vCuAbe7mmNX7ojYnXXC3Hu1bwl5dNGUUwfc+jAQ7C3LN3Qlj6l4MIjC+hUkvSowZZEiEUA0xf66RKY6Buqj6BdKhNtQY/tQ8WXRY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1717547145; c=relaxed/simple; bh=6bfQltsebIDFmP7fT8/qzCkTKSF6gxCrj2vhm2ulkGA=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=S7PhYT2qPQvo3cC/E3itLqB7LwuvOfKTcDcQtchXo5yMcghqeQkXNIx86Us5bDm48HtneVd+tjDLxy8ePGNDz7pKE8xMOVNZIe3Grh4jPtYX6BtNIgWWDCsMhzGvUS7mCYnRueEVyuyEwNVOeAC7j322qp3e0e2O7ua1K3b8OHs= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=opmt81wZ; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="opmt81wZ" Received: by smtp.kernel.org (Postfix) with ESMTPSA id C6DDFC2BBFC; Wed, 5 Jun 2024 00:25:44 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1717547145; bh=6bfQltsebIDFmP7fT8/qzCkTKSF6gxCrj2vhm2ulkGA=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=opmt81wZngUknnQTTOcKDFZcCS76fVo2OK9CC+rIqkSO7slsl+ZrmNXcyw8lAJFfQ lltNT7n3/XTJsliYev/jBmNdHj3L+IkgDxtAWwarYwErSobyTQvv8nwWqq8uHmrfkk f3ZhmU657tfWxPI/MftfyHufeH6/TUFttPah7xKDgJSe2uC7kh9wLK3VkDGDqUDnCO GbJL+gBvZ/vCZ+ll4umgv+XdMnqqrNR15Fa2AvUO1vnvgpCsxepqvyB/de8UzDN0w9 XBpibd02VOFN/+spe2BmXEN6mjX+D6u48sKiTm2+Z0iLyN3esXsMgoViYV5mk2ijk5 RdEjiN6v+knVg== From: Andrii Nakryiko To: linux-fsdevel@vger.kernel.org, brauner@kernel.org, viro@zeniv.linux.org.uk, akpm@linux-foundation.org Cc: linux-kernel@vger.kernel.org, bpf@vger.kernel.org, gregkh@linuxfoundation.org, linux-mm@kvack.org, liam.howlett@oracle.com, surenb@google.com, rppt@kernel.org, Andrii Nakryiko Subject: [PATCH v3 8/9] selftests/bpf: make use of PROCMAP_QUERY ioctl if available Date: Tue, 4 Jun 2024 17:24:53 -0700 Message-ID: <20240605002459.4091285-9-andrii@kernel.org> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20240605002459.4091285-1-andrii@kernel.org> References: <20240605002459.4091285-1-andrii@kernel.org> Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Instead of parsing text-based /proc//maps file, try to use PROCMAP_QUERY ioctl() to simplify and speed up data fetching. This logic is used to do uprobe file offset calculation, so any bugs in this logic would manifest as failing uprobe BPF selftests. This also serves as a simple demonstration of one of the intended uses. Signed-off-by: Andrii Nakryiko --- tools/testing/selftests/bpf/test_progs.c | 3 + tools/testing/selftests/bpf/test_progs.h | 2 + tools/testing/selftests/bpf/trace_helpers.c | 104 +++++++++++++++++--- 3 files changed, 94 insertions(+), 15 deletions(-) diff --git a/tools/testing/selftests/bpf/test_progs.c b/tools/testing/selftests/bpf/test_progs.c index 89ff704e9dad..6a19970f2531 100644 --- a/tools/testing/selftests/bpf/test_progs.c +++ b/tools/testing/selftests/bpf/test_progs.c @@ -19,6 +19,8 @@ #include #include "json_writer.h" +int env_verbosity = 0; + static bool verbose(void) { return env.verbosity > VERBOSE_NONE; @@ -848,6 +850,7 @@ static error_t parse_arg(int key, char *arg, struct argp_state *state) return -EINVAL; } } + env_verbosity = env->verbosity; if (verbose()) { if (setenv("SELFTESTS_VERBOSE", "1", 1) == -1) { diff --git a/tools/testing/selftests/bpf/test_progs.h b/tools/testing/selftests/bpf/test_progs.h index 0ba5a20b19ba..6eae7fdab0d7 100644 --- a/tools/testing/selftests/bpf/test_progs.h +++ b/tools/testing/selftests/bpf/test_progs.h @@ -95,6 +95,8 @@ struct test_state { FILE *stdout; }; +extern int env_verbosity; + struct test_env { struct test_selector test_selector; struct test_selector subtest_selector; diff --git a/tools/testing/selftests/bpf/trace_helpers.c b/tools/testing/selftests/bpf/trace_helpers.c index 70e29f316fe7..6723186c46bb 100644 --- a/tools/testing/selftests/bpf/trace_helpers.c +++ b/tools/testing/selftests/bpf/trace_helpers.c @@ -10,6 +10,8 @@ #include #include #include +#include +#include #include #include "trace_helpers.h" #include @@ -233,29 +235,91 @@ int kallsyms_find(const char *sym, unsigned long long *addr) return err; } +#ifdef PROCMAP_QUERY +int env_verbosity __weak = 0; + +int procmap_query(int fd, const void *addr, __u32 query_flags, size_t *start, size_t *offset, int *flags) +{ + char path_buf[PATH_MAX], build_id_buf[20]; + struct procmap_query q; + int err; + + memset(&q, 0, sizeof(q)); + q.size = sizeof(q); + q.query_flags = query_flags; + q.query_addr = (__u64)addr; + q.vma_name_addr = (__u64)path_buf; + q.vma_name_size = sizeof(path_buf); + q.build_id_addr = (__u64)build_id_buf; + q.build_id_size = sizeof(build_id_buf); + + err = ioctl(fd, PROCMAP_QUERY, &q); + if (err < 0) { + err = -errno; + if (err == -ENOTTY) + return -EOPNOTSUPP; /* ioctl() not implemented yet */ + if (err == -ENOENT) + return -ESRCH; /* vma not found */ + return err; + } + + if (env_verbosity >= 1) { + printf("VMA FOUND (addr %08lx): %08lx-%08lx %c%c%c%c %08lx %02x:%02x %ld %s (build ID: %s, %d bytes)\n", + (long)addr, (long)q.vma_start, (long)q.vma_end, + (q.vma_flags & PROCMAP_QUERY_VMA_READABLE) ? 'r' : '-', + (q.vma_flags & PROCMAP_QUERY_VMA_WRITABLE) ? 'w' : '-', + (q.vma_flags & PROCMAP_QUERY_VMA_EXECUTABLE) ? 'x' : '-', + (q.vma_flags & PROCMAP_QUERY_VMA_SHARED) ? 's' : 'p', + (long)q.vma_offset, q.dev_major, q.dev_minor, (long)q.inode, + q.vma_name_size ? path_buf : "", + q.build_id_size ? "YES" : "NO", + q.build_id_size); + } + + *start = q.vma_start; + *offset = q.vma_offset; + *flags = q.vma_flags; + return 0; +} +#else +int procmap_query(int fd, const void *addr, size_t *start, size_t *offset, int *flags) +{ + return -EOPNOTSUPP; +} +#endif + ssize_t get_uprobe_offset(const void *addr) { - size_t start, end, base; - char buf[256]; - bool found = false; + size_t start, base, end; FILE *f; + char buf[256]; + int err, flags; f = fopen("/proc/self/maps", "r"); if (!f) return -errno; - while (fscanf(f, "%zx-%zx %s %zx %*[^\n]\n", &start, &end, buf, &base) == 4) { - if (buf[2] == 'x' && (uintptr_t)addr >= start && (uintptr_t)addr < end) { - found = true; - break; + /* requested executable VMA only */ + err = procmap_query(fileno(f), addr, PROCMAP_QUERY_VMA_EXECUTABLE, &start, &base, &flags); + if (err == -EOPNOTSUPP) { + bool found = false; + + while (fscanf(f, "%zx-%zx %s %zx %*[^\n]\n", &start, &end, buf, &base) == 4) { + if (buf[2] == 'x' && (uintptr_t)addr >= start && (uintptr_t)addr < end) { + found = true; + break; + } + } + if (!found) { + fclose(f); + return -ESRCH; } + } else if (err) { + fclose(f); + return err; } - fclose(f); - if (!found) - return -ESRCH; - #if defined(__powerpc64__) && defined(_CALL_ELF) && _CALL_ELF == 2 #define OP_RT_RA_MASK 0xffff0000UL @@ -296,15 +360,25 @@ ssize_t get_rel_offset(uintptr_t addr) size_t start, end, offset; char buf[256]; FILE *f; + int err, flags; f = fopen("/proc/self/maps", "r"); if (!f) return -errno; - while (fscanf(f, "%zx-%zx %s %zx %*[^\n]\n", &start, &end, buf, &offset) == 4) { - if (addr >= start && addr < end) { - fclose(f); - return (size_t)addr - start + offset; + err = procmap_query(fileno(f), (const void *)addr, 0, &start, &offset, &flags); + if (err == 0) { + fclose(f); + return (size_t)addr - start + offset; + } else if (err != -EOPNOTSUPP) { + fclose(f); + return err; + } else if (err) { + while (fscanf(f, "%zx-%zx %s %zx %*[^\n]\n", &start, &end, buf, &offset) == 4) { + if (addr >= start && addr < end) { + fclose(f); + return (size_t)addr - start + offset; + } } } From patchwork Wed Jun 5 00:24:54 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrii Nakryiko X-Patchwork-Id: 13686014 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 753C214B071; Wed, 5 Jun 2024 00:25:48 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1717547148; cv=none; b=TS4UyZ30/BTrWrt2k6hl46ypsNmBLsl+CqQ2tC/pY0JfMr+k8TWb2K5+h3AwXzZE8hP8gimmSDSeUCQHi0GEmqLtQqG8/GNG2A+VkOZ/8QVoezGlGtkXAMdSBztRpWEDScbXmW7ugDAjupeM7engOwoaIPCEosxXpj+Yv2cHpQA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1717547148; c=relaxed/simple; bh=NP8+W+bH6pHbXy3KBT/Ddc18s0vHb1N4SFb+qkc6u48=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=jmw65yV6bZGLnq+U+qzcKNQ33KtXsEbTv6oqyko6ElleaLHlgjqsgLrxBMacYiBX+duFtDqW1mXDzD1SkFu8ck7i9+qL8E7D5r6Ry6bvWXdrBtam5yKpCqtyIXHQq5/q6xGnU04EcBecYgy/OLE95d6whVJg+wPXmgghHzOWe5M= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=P8wNTrs9; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="P8wNTrs9" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 20434C2BBFC; Wed, 5 Jun 2024 00:25:48 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1717547148; bh=NP8+W+bH6pHbXy3KBT/Ddc18s0vHb1N4SFb+qkc6u48=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=P8wNTrs9jtprlX6mlecKnGpXHERAfE1l5gK04NjIESARVbZSqk+Q14GN4aFlu6qAs 3CcagYWcZ7CnQHItJhL2c2RBU2c5woxMvw1TSFaJAGabOCC6FPKFeQVOt7K8sXPNxJ XvPgb8DOZolYDBFoxvIxTPfScFUcz4s9ocqER0d8W6xPlzXbBdZ78xVn2Lx80f+KOf 4Vfh0AKDuO4mdu7wk0j3pKtkFuetw9DtTWhK/ALzRXGTJAH//A4BTjiVdJvJPzv6TB +UxijZ9f2+tuRWI0J+B8yXqWPe88AmjMqysO5KhYsJVMIUpQrM93rF+gvEgH2lewkW N/FNjY0HbSzMg== From: Andrii Nakryiko To: linux-fsdevel@vger.kernel.org, brauner@kernel.org, viro@zeniv.linux.org.uk, akpm@linux-foundation.org Cc: linux-kernel@vger.kernel.org, bpf@vger.kernel.org, gregkh@linuxfoundation.org, linux-mm@kvack.org, liam.howlett@oracle.com, surenb@google.com, rppt@kernel.org, Andrii Nakryiko Subject: [PATCH v3 9/9] selftests/bpf: add simple benchmark tool for /proc//maps APIs Date: Tue, 4 Jun 2024 17:24:54 -0700 Message-ID: <20240605002459.4091285-10-andrii@kernel.org> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20240605002459.4091285-1-andrii@kernel.org> References: <20240605002459.4091285-1-andrii@kernel.org> Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Implement a simple tool/benchmark for comparing address "resolution" logic based on textual /proc//maps interface and new binary ioctl-based PROCMAP_QUERY command. The tool expects a file with a list of hex addresses, relevant PID, and then provides control over whether textual or binary ioctl-based ways to process VMAs should be used. The overall logic implements as efficient way to do batched processing of a given set of (unsorted) addresses. We first sort them in increasing order (remembering their original position to restore original order, if necessary), and then process all VMAs from /proc//maps, matching addresses to VMAs and calculating file offsets, if matched. For ioctl-based approach the idea is similar, but is implemented even more efficiently, requesting only VMAs that cover all given addresses, skipping all the irrelevant VMAs altogether. To be able to compare efficiency of both APIs the tool has "benchark" mode. User provides a number of processing runs to run in a tight loop, only timing specifically /proc//maps parsing and processing parts of the logic. Address sorting and re-sorting is excluded. This gives a more direct way to compare ioctl- vs text-based APIs. We used a medium-sized production application to do representative benchmark. A bunch of stack traces were captured, resulting in 3244 user space addresses (464 unique ones, but we didn't deduplicate them). Application itself had 655 VMAs reported in /proc//maps. Averaging time taken to process all addresses 10000 times, showed that: - text-based approach took 333 microseconds *per one batch run*; - ioctl-based approach took 8 microseconds *per (identical) batch run*. This gives about ~40x speed up to do exactly the same amount of work (build IDs were not fetched for ioctl-based benchmark; fetching build IDs resulted in 2x slowdown compared to no-build-ID case). The ratio will vary depending on exact set of addresses and how many VMAs they are mapped to. So 40x isn't something to take for granted, but it does show possible improvements that are achievable. I also did an strace run of both cases. In text-based one the tool did 27 read() syscalls, fetching up to 4KB of data in one go (which is seq_file limitations, bumping the buffer size has no effect, as data is always capped at 4KB). In comparison, ioctl-based implementation had to do only 5 ioctl() calls to fetch all relevant VMAs. It is projected that savings from processing big production applications would only widen the gap in favor of binary-based querying ioctl API, as bigger applications will tend to have even more non-executable VMA mappings relative to executable ones. E.g., one of the larger production applications in the server fleet has upwards of 20000 VMAs, which would make benchmark even more unfair to processing /proc//maps file. This tool is implementing one of the patterns of usage, referred to as "on-demand profiling" use case in the main patch implementing ioctl() API. perf is an example of the pre-processing pattern in which all (or all executable) VMAs are loaded and stored for further querying. We implemented an experimental change to perf to benchmark text-based and ioctl-based APIs, and in perf benchmarks ioctl-based interface was no worse than optimized text-based parsing benchmark. Filtering to only executable VMAs further made ioctl-based benchmarks faster, as perf would be querying about 1/3 of all VMAs only, compared to the need to read and parse all of VMAs. E.g., running `perf bench internals synthesize --mt -M 8`, we are getting. TEXT-BASED ========== # ./perf-parse bench internals synthesize --mt -M 8 # Running 'internals/synthesize' benchmark: Computing performance of multi threaded perf event synthesis by synthesizing events on CPU 0: Number of synthesis threads: 1 Average synthesis took: 10238.600 usec (+- 309.656 usec) Average num. events: 3744.000 (+- 0.000) Average time per event 2.735 usec ... Number of synthesis threads: 8 Average synthesis took: 6814.600 usec (+- 149.418 usec) Average num. events: 3744.000 (+- 0.000) Average time per event 1.820 usec IOCTL-BASED, FETCHING ALL VMAS ============================== # ./perf-ioctl-all bench internals synthesize --mt -M 8 # Running 'internals/synthesize' benchmark: Computing performance of multi threaded perf event synthesis by synthesizing events on CPU 0: Number of synthesis threads: 1 Average synthesis took: 9944.800 usec (+- 381.794 usec) Average num. events: 3593.000 (+- 0.000) Average time per event 2.768 usec ... Number of synthesis threads: 8 Average synthesis took: 6598.600 usec (+- 137.503 usec) Average num. events: 3595.000 (+- 0.000) Average time per event 1.835 usec IOCTL-BASED, FETCHING EXECUTABLE VMAS ===================================== # ./perf-ioctl-exec bench internals synthesize --mt -M 8 # Running 'internals/synthesize' benchmark: Computing performance of multi threaded perf event synthesis by synthesizing events on CPU 0: Number of synthesis threads: 1 Average synthesis took: 8539.600 usec (+- 364.875 usec) Average num. events: 3569.000 (+- 0.000) Average time per event 2.393 usec ... Number of synthesis threads: 8 Average synthesis took: 5657.600 usec (+- 107.219 usec) Average num. events: 3571.000 (+- 0.000) Average time per event 1.584 usec Signed-off-by: Andrii Nakryiko --- tools/testing/selftests/bpf/.gitignore | 1 + tools/testing/selftests/bpf/Makefile | 2 +- tools/testing/selftests/bpf/procfs_query.c | 386 +++++++++++++++++++++ 3 files changed, 388 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/bpf/procfs_query.c diff --git a/tools/testing/selftests/bpf/.gitignore b/tools/testing/selftests/bpf/.gitignore index 5025401323af..903b14931bfe 100644 --- a/tools/testing/selftests/bpf/.gitignore +++ b/tools/testing/selftests/bpf/.gitignore @@ -44,6 +44,7 @@ test_cpp /veristat /sign-file /uprobe_multi +/procfs_query *.ko *.tmp xskxceiver diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile index e0b3887b3d2d..0afa667a54e5 100644 --- a/tools/testing/selftests/bpf/Makefile +++ b/tools/testing/selftests/bpf/Makefile @@ -144,7 +144,7 @@ TEST_GEN_PROGS_EXTENDED = test_skb_cgroup_id_user \ flow_dissector_load test_flow_dissector test_tcp_check_syncookie_user \ test_lirc_mode2_user xdping test_cpp runqslower bench bpf_testmod.ko \ xskxceiver xdp_redirect_multi xdp_synproxy veristat xdp_hw_metadata \ - xdp_features bpf_test_no_cfi.ko + xdp_features bpf_test_no_cfi.ko procfs_query TEST_GEN_FILES += liburandom_read.so urandom_read sign-file uprobe_multi diff --git a/tools/testing/selftests/bpf/procfs_query.c b/tools/testing/selftests/bpf/procfs_query.c new file mode 100644 index 000000000000..63e06568f1ff --- /dev/null +++ b/tools/testing/selftests/bpf/procfs_query.c @@ -0,0 +1,386 @@ +// SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +static bool verbose; +static bool quiet; +static bool use_ioctl; +static bool request_build_id; +static char *addrs_path; +static int pid; +static int bench_runs; + +const char *argp_program_version = "procfs_query 0.0"; +const char *argp_program_bug_address = ""; + +static inline uint64_t get_time_ns(void) +{ + struct timespec t; + + clock_gettime(CLOCK_MONOTONIC, &t); + + return (uint64_t)t.tv_sec * 1000000000 + t.tv_nsec; +} + +static const struct argp_option opts[] = { + { "verbose", 'v', NULL, 0, "Verbose mode" }, + { "quiet", 'q', NULL, 0, "Quiet mode (no output)" }, + { "pid", 'p', "PID", 0, "PID of the process" }, + { "addrs-path", 'f', "PATH", 0, "File with addresses to resolve" }, + { "benchmark", 'B', "RUNS", 0, "Benchmark mode" }, + { "query", 'Q', NULL, 0, "Use ioctl()-based point query API (by default text parsing is done)" }, + { "build-id", 'b', NULL, 0, "Fetch build ID, if available (only for ioctl mode)" }, + {}, +}; + +static error_t parse_arg(int key, char *arg, struct argp_state *state) +{ + switch (key) { + case 'v': + verbose = true; + break; + case 'q': + quiet = true; + break; + case 'Q': + use_ioctl = true; + break; + case 'b': + request_build_id = true; + break; + case 'p': + pid = strtol(arg, NULL, 10); + break; + case 'f': + addrs_path = strdup(arg); + break; + case 'B': + bench_runs = strtol(arg, NULL, 10); + if (bench_runs <= 0) { + fprintf(stderr, "Invalid benchmark run count: %s\n", arg); + return -EINVAL; + } + break; + case ARGP_KEY_ARG: + argp_usage(state); + break; + default: + return ARGP_ERR_UNKNOWN; + } + return 0; +} + +static const struct argp argp = { + .options = opts, + .parser = parse_arg, +}; + +struct addr { + unsigned long long addr; + int idx; +}; + +static struct addr *addrs; +static size_t addr_cnt, addr_cap; + +struct resolved_addr { + unsigned long long file_off; + const char *vma_name; + int build_id_sz; + char build_id[20]; +}; + +static struct resolved_addr *resolved; + +static int resolve_addrs_ioctl(void) +{ + char buf[32], build_id_buf[20], vma_name[PATH_MAX]; + struct procmap_query q; + int fd, err, i; + struct addr *a = &addrs[0]; + struct resolved_addr *r; + + snprintf(buf, sizeof(buf), "/proc/%d/maps", pid); + fd = open(buf, O_RDONLY); + if (fd < 0) { + err = -errno; + fprintf(stderr, "Failed to open process map file (%s): %d\n", buf, err); + return err; + } + + memset(&q, 0, sizeof(q)); + q.size = sizeof(q); + q.query_flags = PROCMAP_QUERY_COVERING_OR_NEXT_VMA; + q.vma_name_addr = (__u64)vma_name; + if (request_build_id) + q.build_id_addr = (__u64)build_id_buf; + + for (i = 0; i < addr_cnt; ) { + char *name = NULL; + + q.query_addr = (__u64)a->addr; + q.vma_name_size = sizeof(vma_name); + if (request_build_id) + q.build_id_size = sizeof(build_id_buf); + + err = ioctl(fd, PROCMAP_QUERY, &q); + if (err < 0 && errno == ENOTTY) { + close(fd); + fprintf(stderr, "PROCMAP_QUERY ioctl() command is not supported on this kernel!\n"); + return -EOPNOTSUPP; /* ioctl() not implemented yet */ + } + if (err < 0 && errno == ENOENT) { + fprintf(stderr, "ENOENT addr %lx\n", (long)q.query_addr); + i++; + a++; + continue; /* unresolved address */ + } + if (err < 0) { + err = -errno; + close(fd); + fprintf(stderr, "PROCMAP_QUERY ioctl() returned error: %d\n", err); + return err; + } + + if (verbose) { + printf("VMA FOUND (addr %08lx): %08lx-%08lx %c%c%c%c %08lx %02x:%02x %ld %s (build ID: %s, %d bytes)\n", + (long)q.query_addr, (long)q.vma_start, (long)q.vma_end, + (q.vma_flags & PROCMAP_QUERY_VMA_READABLE) ? 'r' : '-', + (q.vma_flags & PROCMAP_QUERY_VMA_WRITABLE) ? 'w' : '-', + (q.vma_flags & PROCMAP_QUERY_VMA_EXECUTABLE) ? 'x' : '-', + (q.vma_flags & PROCMAP_QUERY_VMA_SHARED) ? 's' : 'p', + (long)q.vma_offset, q.dev_major, q.dev_minor, (long)q.inode, + q.vma_name_size ? vma_name : "", + q.build_id_size ? "YES" : "NO", + q.build_id_size); + } + + /* skip addrs falling before current VMA */ + for (; i < addr_cnt && a->addr < q.vma_start; i++, a++) { + } + /* process addrs covered by current VMA */ + for (; i < addr_cnt && a->addr < q.vma_end; i++, a++) { + r = &resolved[a->idx]; + r->file_off = a->addr - q.vma_start + q.vma_offset; + + /* reuse name, if it was already strdup()'ed */ + if (q.vma_name_size) + name = name ?: strdup(vma_name); + r->vma_name = name; + + if (q.build_id_size) { + r->build_id_sz = q.build_id_size; + memcpy(r->build_id, build_id_buf, q.build_id_size); + } + } + } + + close(fd); + return 0; +} + +static int resolve_addrs_parse(void) +{ + size_t vma_start, vma_end, vma_offset, ino; + uint32_t dev_major, dev_minor; + char perms[4], buf[32], vma_name[PATH_MAX], fbuf[4096]; + FILE *f; + int err, idx = 0; + struct addr *a = &addrs[idx]; + struct resolved_addr *r; + + snprintf(buf, sizeof(buf), "/proc/%d/maps", pid); + f = fopen(buf, "r"); + if (!f) { + err = -errno; + fprintf(stderr, "Failed to open process map file (%s): %d\n", buf, err); + return err; + } + + err = setvbuf(f, fbuf, _IOFBF, sizeof(fbuf)); + if (err) { + err = -errno; + fprintf(stderr, "Failed to set custom file buffer size: %d\n", err); + return err; + } + + while ((err = fscanf(f, "%zx-%zx %c%c%c%c %zx %x:%x %zu %[^\n]\n", + &vma_start, &vma_end, + &perms[0], &perms[1], &perms[2], &perms[3], + &vma_offset, &dev_major, &dev_minor, &ino, vma_name)) >= 10) { + const char *name = NULL; + + /* skip addrs before current vma, they stay unresolved */ + for (; idx < addr_cnt && a->addr < vma_start; idx++, a++) { + } + + /* resolve all addrs within current vma now */ + for (; idx < addr_cnt && a->addr < vma_end; idx++, a++) { + r = &resolved[a->idx]; + r->file_off = a->addr - vma_start + vma_offset; + + /* reuse name, if it was already strdup()'ed */ + if (err > 10) + name = name ?: strdup(vma_name); + else + name = NULL; + r->vma_name = name; + } + + /* ran out of addrs to resolve, stop early */ + if (idx >= addr_cnt) + break; + } + + fclose(f); + return 0; +} + +static int cmp_by_addr(const void *a, const void *b) +{ + const struct addr *x = a, *y = b; + + if (x->addr != y->addr) + return x->addr < y->addr ? -1 : 1; + return x->idx < y->idx ? -1 : 1; +} + +static int cmp_by_idx(const void *a, const void *b) +{ + const struct addr *x = a, *y = b; + + return x->idx < y->idx ? -1 : 1; +} + +int main(int argc, char **argv) +{ + FILE* f; + int err, i; + unsigned long long addr; + uint64_t start_ns; + double total_ns; + + /* Parse command line arguments */ + err = argp_parse(&argp, argc, argv, 0, NULL, NULL); + if (err) + return err; + + if (pid <= 0 || !addrs_path) { + fprintf(stderr, "Please provide PID and file with addresses to process!\n"); + exit(1); + } + + if (verbose) { + fprintf(stderr, "PID: %d\n", pid); + fprintf(stderr, "PATH: %s\n", addrs_path); + } + + f = fopen(addrs_path, "r"); + if (!f) { + err = -errno; + fprintf(stderr, "Failed to open '%s': %d\n", addrs_path, err); + goto out; + } + + while ((err = fscanf(f, "%llx\n", &addr)) == 1) { + if (addr_cnt == addr_cap) { + addr_cap = addr_cap == 0 ? 16 : (addr_cap * 3 / 2); + addrs = realloc(addrs, sizeof(*addrs) * addr_cap); + memset(addrs + addr_cnt, 0, (addr_cap - addr_cnt) * sizeof(*addrs)); + } + + addrs[addr_cnt].addr = addr; + addrs[addr_cnt].idx = addr_cnt; + + addr_cnt++; + } + if (verbose) + fprintf(stderr, "READ %zu addrs!\n", addr_cnt); + if (!feof(f)) { + fprintf(stderr, "Failure parsing full list of addresses at '%s'!\n", addrs_path); + err = -EINVAL; + fclose(f); + goto out; + } + fclose(f); + if (addr_cnt == 0) { + fprintf(stderr, "No addresses provided, bailing out!\n"); + err = -ENOENT; + goto out; + } + + resolved = calloc(addr_cnt, sizeof(*resolved)); + + qsort(addrs, addr_cnt, sizeof(*addrs), cmp_by_addr); + if (verbose) { + fprintf(stderr, "SORTED ADDRS (%zu):\n", addr_cnt); + for (i = 0; i < addr_cnt; i++) { + fprintf(stderr, "ADDR #%d: %#llx\n", addrs[i].idx, addrs[i].addr); + } + } + + start_ns = get_time_ns(); + for (i = bench_runs ?: 1; i > 0; i--) { + if (use_ioctl) { + err = resolve_addrs_ioctl(); + } else { + err = resolve_addrs_parse(); + } + if (err) { + fprintf(stderr, "Failed to resolve addrs: %d!\n", err); + goto out; + } + } + total_ns = get_time_ns() - start_ns; + + if (bench_runs) { + fprintf(stderr, "BENCHMARK MODE. RUNS: %d TOTAL TIME (ms): %.3lf TIME/RUN (ms): %.3lf TIME/ADDR (us): %.3lf\n", + bench_runs, total_ns / 1000000.0, total_ns / bench_runs / 1000000.0, + total_ns / bench_runs / addr_cnt / 1000.0); + } + + /* sort them back into the original order */ + qsort(addrs, addr_cnt, sizeof(*addrs), cmp_by_idx); + + if (!quiet) { + printf("RESOLVED ADDRS (%zu):\n", addr_cnt); + for (i = 0; i < addr_cnt; i++) { + const struct addr *a = &addrs[i]; + const struct resolved_addr *r = &resolved[a->idx]; + + if (r->file_off) { + printf("RESOLVED #%d: %#llx -> OFF %#llx", + a->idx, a->addr, r->file_off); + if (r->vma_name) + printf(" NAME %s", r->vma_name); + if (r->build_id_sz) { + char build_id_str[41]; + int j; + + for (j = 0; j < r->build_id_sz; j++) + sprintf(&build_id_str[j * 2], "%02hhx", r->build_id[j]); + printf(" BUILDID %s", build_id_str); + } + printf("\n"); + } else { + printf("UNRESOLVED #%d: %#llx\n", a->idx, a->addr); + } + } + } +out: + free(addrs); + free(addrs_path); + free(resolved); + + return err < 0 ? -err : 0; +}