diff mbox series

[v5,3/6] fs/procfs: add build ID fetching to PROCMAP_QUERY API

Message ID 20240618224527.3685213-4-andrii@kernel.org (mailing list archive)
State Superseded
Headers show
Series ioctl()-based API to query VMAs from /proc/<pid>/maps | expand

Checks

Context Check Description
netdev/tree_selection success Not a local patch

Commit Message

Andrii Nakryiko June 18, 2024, 10:45 p.m. UTC
The need to get ELF build ID reliably is an important aspect when
dealing with profiling and stack trace symbolization, and
/proc/<pid>/maps textual representation doesn't help with this.

To get backing file's ELF build ID, application has to first resolve
VMA, then use it's start/end address range to follow a special
/proc/<pid>/map_files/<start>-<end> symlink to open the ELF file (this
is necessary because backing file might have been removed from the disk
or was already replaced with another binary in the same file path.

Such approach, beyond just adding complexity of having to do a bunch of
extra work, has extra security implications. Because application opens
underlying ELF file and needs read access to its entire contents (as far
as kernel is concerned), kernel puts additional capable() checks on
following /proc/<pid>/map_files/<start>-<end> symlink. And that makes
sense in general.

But in the case of build ID, profiler/symbolizer doesn't need the
contents of ELF file, per se. It's only build ID that is of interest,
and ELF build ID itself doesn't provide any sensitive information.

So this patch adds a way to request backing file's ELF build ID along
the rest of VMA information in the same API. User has control over
whether this piece of information is requested or not by either setting
build_id_size field to zero or non-zero maximum buffer size they
provided through build_id_addr field (which encodes user pointer as
__u64 field). This is a completely optional piece of information, and so
has no performance implications for user cases that don't care about
build ID, while improving performance and simplifying the setup for
those application that do need it.

Kernel already implements build ID fetching, which is used from BPF
subsystem. We are reusing this code here, but plan a follow up changes
to make it work better under more relaxed assumption (compared to what
existing code assumes) of being called from user process context, in
which page faults are allowed. BPF-specific implementation currently
bails out if necessary part of ELF file is not paged in, all due to
extra BPF-specific restrictions (like the need to fetch build ID in
restrictive contexts such as NMI handler).

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 fs/proc/task_mmu.c      | 25 ++++++++++++++++++++++++-
 include/uapi/linux/fs.h | 28 ++++++++++++++++++++++++++++
 2 files changed, 52 insertions(+), 1 deletion(-)

Comments

Alexey Dobriyan June 19, 2024, 10:14 a.m. UTC | #1
On Tue, Jun 18, 2024 at 03:45:22PM -0700, Andrii Nakryiko wrote:
> The need to get ELF build ID reliably is an important aspect when
> dealing with profiling and stack trace symbolization, and
> /proc/<pid>/maps textual representation doesn't help with this.

> @@ -539,6 +543,21 @@ static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg)
>  		}
>  	}
>  
> +	if (karg.build_id_size) {
> +		__u32 build_id_sz;
> +
> +		err = build_id_parse(vma, build_id_buf, &build_id_sz);

This is not your bug but build_id_parse() assumes program headers
immediately follow ELF header which is not guaranteed.

> +	 * If this field is set to non-zero value, build_id_addr should point
> +	 * to valid user space memory buffer of at least build_id_size bytes.
> +	 * If set to zero, build_id_addr should be set to zero as well
> +	 */
> +	__u32 build_id_size;		/* in/out */
>  	/*
>  	 * User-supplied address of a buffer of at least vma_name_size bytes
>  	 * for kernel to fill with matched VMA's name (see vma_name_size field
> @@ -519,6 +539,14 @@ struct procmap_query {
>  	 * Should be set to zero if VMA name should not be returned.
>  	 */
>  	__u64 vma_name_addr;		/* in */
> +	/*
> +	 * User-supplied address of a buffer of at least build_id_size bytes
> +	 * for kernel to fill with matched VMA's ELF build ID, if available
> +	 * (see build_id_size field description above for details).
> +	 *
> +	 * Should be set to zero if build ID should not be returned.
> +	 */
> +	__u64 build_id_addr;		/* in */

Can this be simplified to 512-bit buffer in ioctl structure?
BUILD_ID_SIZE_MAX is 20 which is sha1.
Andrii Nakryiko June 20, 2024, 6:50 p.m. UTC | #2
On Wed, Jun 19, 2024 at 3:14 AM Alexey Dobriyan <adobriyan@gmail.com> wrote:
>
> On Tue, Jun 18, 2024 at 03:45:22PM -0700, Andrii Nakryiko wrote:
> > The need to get ELF build ID reliably is an important aspect when
> > dealing with profiling and stack trace symbolization, and
> > /proc/<pid>/maps textual representation doesn't help with this.
>
> > @@ -539,6 +543,21 @@ static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg)
> >               }
> >       }
> >
> > +     if (karg.build_id_size) {
> > +             __u32 build_id_sz;
> > +
> > +             err = build_id_parse(vma, build_id_buf, &build_id_sz);
>
> This is not your bug but build_id_parse() assumes program headers
> immediately follow ELF header which is not guaranteed.

Yes, I'm aware, and I think I stated somewhere that I want to
fix/improve that. The thing is, current build_id_parse() was built for
BPF under NMI context assumption, which is why it can't page in memory
and so on (and this "build ID has to be in the first page" was a
surprise to me, but probably just a technical shortcut to make it a
bit easier to implement). Regardless, my plan, once this API is
merged, is to follow up with make build_id_parse() variant that would
work reliably under sleepable context assumptions. Hopefully that's ok
not to bundle all that with these patches?

>
> > +      * If this field is set to non-zero value, build_id_addr should point
> > +      * to valid user space memory buffer of at least build_id_size bytes.
> > +      * If set to zero, build_id_addr should be set to zero as well
> > +      */
> > +     __u32 build_id_size;            /* in/out */
> >       /*
> >        * User-supplied address of a buffer of at least vma_name_size bytes
> >        * for kernel to fill with matched VMA's name (see vma_name_size field
> > @@ -519,6 +539,14 @@ struct procmap_query {
> >        * Should be set to zero if VMA name should not be returned.
> >        */
> >       __u64 vma_name_addr;            /* in */
> > +     /*
> > +      * User-supplied address of a buffer of at least build_id_size bytes
> > +      * for kernel to fill with matched VMA's ELF build ID, if available
> > +      * (see build_id_size field description above for details).
> > +      *
> > +      * Should be set to zero if build ID should not be returned.
> > +      */
> > +     __u64 build_id_addr;            /* in */
>
> Can this be simplified to 512-bit buffer in ioctl structure?
> BUILD_ID_SIZE_MAX is 20 which is sha1.

I'd prefer not to because vma_name can't use the same trick, so we'll
have to have this size+buffer address approach anyway. And because of
that I'd like to have all these optional variable-length/string output
arguments handled in a uniform way. In practice, it's really simple to
use this from user-space, the only mildly annoying part is casting
pointer to __u64. But as I said, for vma_name users will do this
anyways, so not much benefit simplifying the build_id part only.
diff mbox series

Patch

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 674405b99d0d..32bef3eeab7f 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -22,6 +22,7 @@ 
 #include <linux/pkeys.h>
 #include <linux/minmax.h>
 #include <linux/overflow.h>
+#include <linux/buildid.h>
 
 #include <asm/elf.h>
 #include <asm/tlb.h>
@@ -445,6 +446,7 @@  static struct vm_area_struct *query_matching_vma(struct mm_struct *mm,
 	addr = vma->vm_end;
 	if (flags & PROCMAP_QUERY_COVERING_OR_NEXT_VMA)
 		goto next_vma;
+
 no_vma:
 	return ERR_PTR(-ENOENT);
 }
@@ -455,7 +457,7 @@  static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg)
 	struct vm_area_struct *vma;
 	struct mm_struct *mm;
 	const char *name = NULL;
-	char *name_buf = NULL;
+	char build_id_buf[BUILD_ID_SIZE_MAX], *name_buf = NULL;
 	__u64 usize;
 	int err;
 
@@ -477,6 +479,8 @@  static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg)
 	/* either both buffer address and size are set, or both should be zero */
 	if (!!karg.vma_name_size != !!karg.vma_name_addr)
 		return -EINVAL;
+	if (!!karg.build_id_size != !!karg.build_id_addr)
+		return -EINVAL;
 
 	mm = priv->mm;
 	if (!mm || !mmget_not_zero(mm))
@@ -539,6 +543,21 @@  static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg)
 		}
 	}
 
+	if (karg.build_id_size) {
+		__u32 build_id_sz;
+
+		err = build_id_parse(vma, build_id_buf, &build_id_sz);
+		if (err) {
+			karg.build_id_size = 0;
+		} else {
+			if (karg.build_id_size < build_id_sz) {
+				err = -ENAMETOOLONG;
+				goto out;
+			}
+			karg.build_id_size = build_id_sz;
+		}
+	}
+
 	if (karg.vma_name_size) {
 		size_t name_buf_sz = min_t(size_t, PATH_MAX, karg.vma_name_size);
 		const struct path *path;
@@ -583,6 +602,10 @@  static int do_procmap_query(struct proc_maps_private *priv, void __user *uarg)
 	}
 	kfree(name_buf);
 
+	if (karg.build_id_size && copy_to_user((void __user *)karg.build_id_addr,
+					       build_id_buf, karg.build_id_size))
+		return -EFAULT;
+
 	if (copy_to_user(uarg, &karg, min_t(size_t, sizeof(karg), usize)))
 		return -EFAULT;
 
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 74480ed4fa78..cad6375044bc 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -511,6 +511,26 @@  struct procmap_query {
 	 * If set to zero, vma_name_addr should be set to zero as well
 	 */
 	__u32 vma_name_size;		/* in/out */
+	/*
+	 * If set to non-zero value, signals the request to extract and return
+	 * VMA's backing file's build ID, if the backing file is an ELF file
+	 * and it contains embedded build ID.
+	 *
+	 * Kernel will set this field to zero, if VMA has no backing file,
+	 * backing file is not an ELF file, or ELF file has no build ID
+	 * embedded.
+	 *
+	 * Build ID is a binary value (not a string). Kernel will set
+	 * build_id_size field to exact number of bytes used for build ID.
+	 * If build ID is requested and present, but needs more bytes than
+	 * user-supplied maximum buffer size (see build_id_addr field below),
+	 * -E2BIG error will be returned.
+	 *
+	 * If this field is set to non-zero value, build_id_addr should point
+	 * to valid user space memory buffer of at least build_id_size bytes.
+	 * If set to zero, build_id_addr should be set to zero as well
+	 */
+	__u32 build_id_size;		/* in/out */
 	/*
 	 * User-supplied address of a buffer of at least vma_name_size bytes
 	 * for kernel to fill with matched VMA's name (see vma_name_size field
@@ -519,6 +539,14 @@  struct procmap_query {
 	 * Should be set to zero if VMA name should not be returned.
 	 */
 	__u64 vma_name_addr;		/* in */
+	/*
+	 * User-supplied address of a buffer of at least build_id_size bytes
+	 * for kernel to fill with matched VMA's ELF build ID, if available
+	 * (see build_id_size field description above for details).
+	 *
+	 * Should be set to zero if build ID should not be returned.
+	 */
+	__u64 build_id_addr;		/* in */
 };
 
 #endif /* _UAPI_LINUX_FS_H */