mbox series

[0/5] ioctl()-based API to query VMAs from /proc/<pid>/maps

Message ID 20240504003006.3303334-1-andrii@kernel.org (mailing list archive)
Headers show
Series ioctl()-based API to query VMAs from /proc/<pid>/maps | expand

Message

Andrii Nakryiko May 4, 2024, 12:30 a.m. UTC
Implement binary ioctl()-based interface to /proc/<pid>/maps file to allow
applications to query VMA information more efficiently than through textual
processing of /proc/<pid>/maps contents. See patch #2 for the context,
justification, and nuances of the API design.

Patch #1 is a refactoring to keep VMA name logic determination in one place.
Patch #2 is the meat of kernel-side API.
Patch #3 just syncs UAPI header (linux/fs.h) into tools/include.
Patch #4 adjusts BPF selftests logic that currently parses /proc/<pid>/maps to
optionally use this new ioctl()-based API, if supported.
Patch #5 implements a simple C tool to demonstrate intended efficient use (for
both textual and binary interfaces) and allows benchmarking them. Patch itself
also has performance numbers of a test based on one of the medium-sized
internal applications taken from production.

This patch set was based on top of next-20240503 tag in linux-next tree.
Not sure what should be the target tree for this, I'd appreciate any guidance,
thank you!

Andrii Nakryiko (5):
  fs/procfs: extract logic for getting VMA name constituents
  fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps
  tools: sync uapi/linux/fs.h header into tools subdir
  selftests/bpf: make use of PROCFS_PROCMAP_QUERY ioctl, if available
  selftests/bpf: a simple benchmark tool for /proc/<pid>/maps APIs

 fs/proc/task_mmu.c                            | 290 +++++++++++---
 include/uapi/linux/fs.h                       |  32 ++
 .../perf/trace/beauty/include/uapi/linux/fs.h |  32 ++
 tools/testing/selftests/bpf/.gitignore        |   1 +
 tools/testing/selftests/bpf/Makefile          |   2 +-
 tools/testing/selftests/bpf/procfs_query.c    | 366 ++++++++++++++++++
 tools/testing/selftests/bpf/test_progs.c      |   3 +
 tools/testing/selftests/bpf/test_progs.h      |   2 +
 tools/testing/selftests/bpf/trace_helpers.c   | 105 ++++-
 9 files changed, 763 insertions(+), 70 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/procfs_query.c

Comments

Christian Brauner May 4, 2024, 11:24 a.m. UTC | #1
On Fri, May 03, 2024 at 05:30:01PM -0700, Andrii Nakryiko wrote:
> Implement binary ioctl()-based interface to /proc/<pid>/maps file to allow
> applications to query VMA information more efficiently than through textual
> processing of /proc/<pid>/maps contents. See patch #2 for the context,
> justification, and nuances of the API design.
> 
> Patch #1 is a refactoring to keep VMA name logic determination in one place.
> Patch #2 is the meat of kernel-side API.
> Patch #3 just syncs UAPI header (linux/fs.h) into tools/include.
> Patch #4 adjusts BPF selftests logic that currently parses /proc/<pid>/maps to
> optionally use this new ioctl()-based API, if supported.
> Patch #5 implements a simple C tool to demonstrate intended efficient use (for
> both textual and binary interfaces) and allows benchmarking them. Patch itself
> also has performance numbers of a test based on one of the medium-sized
> internal applications taken from production.

I don't have anything against adding a binary interface for this. But
it's somewhat odd to do ioctls based on /proc files. I wonder if there
isn't a more suitable place for this. prctl()? New vmstat() system call
using a pidfd/pid as reference? ioctl() on fs/pidfs.c?
Greg KH May 4, 2024, 3:33 p.m. UTC | #2
On Sat, May 04, 2024 at 01:24:23PM +0200, Christian Brauner wrote:
> On Fri, May 03, 2024 at 05:30:01PM -0700, Andrii Nakryiko wrote:
> > Implement binary ioctl()-based interface to /proc/<pid>/maps file to allow
> > applications to query VMA information more efficiently than through textual
> > processing of /proc/<pid>/maps contents. See patch #2 for the context,
> > justification, and nuances of the API design.
> > 
> > Patch #1 is a refactoring to keep VMA name logic determination in one place.
> > Patch #2 is the meat of kernel-side API.
> > Patch #3 just syncs UAPI header (linux/fs.h) into tools/include.
> > Patch #4 adjusts BPF selftests logic that currently parses /proc/<pid>/maps to
> > optionally use this new ioctl()-based API, if supported.
> > Patch #5 implements a simple C tool to demonstrate intended efficient use (for
> > both textual and binary interfaces) and allows benchmarking them. Patch itself
> > also has performance numbers of a test based on one of the medium-sized
> > internal applications taken from production.
> 
> I don't have anything against adding a binary interface for this. But
> it's somewhat odd to do ioctls based on /proc files. I wonder if there
> isn't a more suitable place for this. prctl()? New vmstat() system call
> using a pidfd/pid as reference? ioctl() on fs/pidfs.c?

See my objection to the ioctl api in the patch review itself.

Also, as this is a new user/kernel api, it needs loads of documentation
(there was none), and probably also cc: linux-api, right?

thanks,

greg k-h
Andrii Nakryiko May 4, 2024, 9:50 p.m. UTC | #3
On Sat, May 4, 2024 at 8:34 AM Greg KH <gregkh@linuxfoundation.org> wrote:
>
> On Sat, May 04, 2024 at 01:24:23PM +0200, Christian Brauner wrote:
> > On Fri, May 03, 2024 at 05:30:01PM -0700, Andrii Nakryiko wrote:
> > > Implement binary ioctl()-based interface to /proc/<pid>/maps file to allow
> > > applications to query VMA information more efficiently than through textual
> > > processing of /proc/<pid>/maps contents. See patch #2 for the context,
> > > justification, and nuances of the API design.
> > >
> > > Patch #1 is a refactoring to keep VMA name logic determination in one place.
> > > Patch #2 is the meat of kernel-side API.
> > > Patch #3 just syncs UAPI header (linux/fs.h) into tools/include.
> > > Patch #4 adjusts BPF selftests logic that currently parses /proc/<pid>/maps to
> > > optionally use this new ioctl()-based API, if supported.
> > > Patch #5 implements a simple C tool to demonstrate intended efficient use (for
> > > both textual and binary interfaces) and allows benchmarking them. Patch itself
> > > also has performance numbers of a test based on one of the medium-sized
> > > internal applications taken from production.
> >
> > I don't have anything against adding a binary interface for this. But
> > it's somewhat odd to do ioctls based on /proc files. I wonder if there
> > isn't a more suitable place for this. prctl()? New vmstat() system call
> > using a pidfd/pid as reference? ioctl() on fs/pidfs.c?
>
> See my objection to the ioctl api in the patch review itself.

Will address them there.


>
> Also, as this is a new user/kernel api, it needs loads of documentation
> (there was none), and probably also cc: linux-api, right?

Will cc linux-api. And yes, I didn't want to invest too much time in
documentation upfront, as I knew that API itself will be tweaked and
tuned, moved to some other place (see Christian's pidfd suggestion).
But I'm happy to write it, I'd appreciate the pointers where exactly
this should live. Thanks!

>
> thanks,
>
> greg k-h
Andrii Nakryiko May 4, 2024, 9:50 p.m. UTC | #4
On Sat, May 4, 2024 at 4:24 AM Christian Brauner <brauner@kernel.org> wrote:
>
> On Fri, May 03, 2024 at 05:30:01PM -0700, Andrii Nakryiko wrote:
> > Implement binary ioctl()-based interface to /proc/<pid>/maps file to allow
> > applications to query VMA information more efficiently than through textual
> > processing of /proc/<pid>/maps contents. See patch #2 for the context,
> > justification, and nuances of the API design.
> >
> > Patch #1 is a refactoring to keep VMA name logic determination in one place.
> > Patch #2 is the meat of kernel-side API.
> > Patch #3 just syncs UAPI header (linux/fs.h) into tools/include.
> > Patch #4 adjusts BPF selftests logic that currently parses /proc/<pid>/maps to
> > optionally use this new ioctl()-based API, if supported.
> > Patch #5 implements a simple C tool to demonstrate intended efficient use (for
> > both textual and binary interfaces) and allows benchmarking them. Patch itself
> > also has performance numbers of a test based on one of the medium-sized
> > internal applications taken from production.
>
> I don't have anything against adding a binary interface for this. But
> it's somewhat odd to do ioctls based on /proc files. I wonder if there
> isn't a more suitable place for this. prctl()? New vmstat() system call
> using a pidfd/pid as reference? ioctl() on fs/pidfs.c?

I did ioctl() on /proc/<pid>/maps because that's the file that's used
for the same use cases and it can be opened from other processes for
any target PID. I'm open to any suggestions that make more sense, this
v1 is mostly to start the conversation.

prctl() probably doesn't make sense, as according to man page:

       prctl() manipulates various aspects of the behavior of the
       calling thread or process.

And this facility is most often used from another (profiler or
symbolizer) process.

New syscall feels like an overkill, but if that's the only way, so be it.

I do like the idea of ioctl() on top of pidfd (I assume that's what
you mean by "fs/pidfs.c", right)? This seems most promising. One
question/nuance. If I understand correctly, pidfd won't hold
task_struct (and its mm_struct) reference, right? So if the process
exits, even if I have pidfd, that task is gone and so we won't be able
to query it. Is that right?

If yes, then it's still workable in a lot of situations, but it would
be nice to have an ability to query VMAs (at least for binary's own
text segments) even if the process exits. This is the case for
short-lived processes that profilers capture some stack traces from,
but by the time these stack traces are processed they are gone.

This might be a stupid idea and question, but what if ioctl() on pidfd
itself would create another FD that would represent mm_struct of that
process, and then we have ioctl() on *that* soft-of-mm-struct-fd to
query VMA. Would that work at all? This approach would allow
long-running profiler application to open pidfd and this other "mm fd"
once, cache it, and then just query it. Meanwhile we can epoll() pidfd
itself to know when the process exits so that these mm_structs are not
referenced for longer than necessary.

Is this pushing too far or you think that would work and be acceptable?

But in any case, I think ioctl() on top of pidfd makes total sense for
this, thanks.
Ian Rogers May 5, 2024, 5:26 a.m. UTC | #5
On Fri, May 3, 2024 at 5:30 PM Andrii Nakryiko <andrii@kernel.org> wrote:
>
> Implement binary ioctl()-based interface to /proc/<pid>/maps file to allow
> applications to query VMA information more efficiently than through textual
> processing of /proc/<pid>/maps contents. See patch #2 for the context,
> justification, and nuances of the API design.
>
> Patch #1 is a refactoring to keep VMA name logic determination in one place.
> Patch #2 is the meat of kernel-side API.
> Patch #3 just syncs UAPI header (linux/fs.h) into tools/include.
> Patch #4 adjusts BPF selftests logic that currently parses /proc/<pid>/maps to
> optionally use this new ioctl()-based API, if supported.
> Patch #5 implements a simple C tool to demonstrate intended efficient use (for
> both textual and binary interfaces) and allows benchmarking them. Patch itself
> also has performance numbers of a test based on one of the medium-sized
> internal applications taken from production.
>
> This patch set was based on top of next-20240503 tag in linux-next tree.
> Not sure what should be the target tree for this, I'd appreciate any guidance,
> thank you!
>
> Andrii Nakryiko (5):
>   fs/procfs: extract logic for getting VMA name constituents
>   fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps
>   tools: sync uapi/linux/fs.h header into tools subdir
>   selftests/bpf: make use of PROCFS_PROCMAP_QUERY ioctl, if available
>   selftests/bpf: a simple benchmark tool for /proc/<pid>/maps APIs

I'd love to see improvements like this for the Linux perf command.
Some thoughts:

 - Could we do something scalability wise better than a file
descriptor per pid? If a profiler is running in a container the cost
of many file descriptors can be significant, and something that
increases as machines get larger. Could we have a /proc/maps for all
processes?

 - Something that is broken in perf currently is that we can race
between reading /proc and opening events on the pids it contains. For
example, perf top supports a uid option that first scans to find all
processes owned by a user then tries to open an event on each process.
This fails if the process terminates between the scan and the open
leading to a frequent:
```
$ sudo perf top -u `id -u`
The sys_perf_event_open() syscall returned with 3 (No such process)
for event (cycles:P).
```
It would be nice for the API to consider cgroups, uids and the like as
ways to get a subset of things to scan.

 - Some what related, the mmap perf events give data after the mmap
call has happened. As VMAs get merged this can lead to mmap perf
events looking like the memory overlaps (for jits using anonymous
memory) and we lack munmap/mremap events.

Jiri Olsa has looked at improvements in this area in the past.

Thanks,
Ian

>  fs/proc/task_mmu.c                            | 290 +++++++++++---
>  include/uapi/linux/fs.h                       |  32 ++
>  .../perf/trace/beauty/include/uapi/linux/fs.h |  32 ++
>  tools/testing/selftests/bpf/.gitignore        |   1 +
>  tools/testing/selftests/bpf/Makefile          |   2 +-
>  tools/testing/selftests/bpf/procfs_query.c    | 366 ++++++++++++++++++
>  tools/testing/selftests/bpf/test_progs.c      |   3 +
>  tools/testing/selftests/bpf/test_progs.h      |   2 +
>  tools/testing/selftests/bpf/trace_helpers.c   | 105 ++++-
>  9 files changed, 763 insertions(+), 70 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/procfs_query.c
>
> --
> 2.43.0
>
>
Andrii Nakryiko May 6, 2024, 6:58 p.m. UTC | #6
On Sat, May 4, 2024 at 10:26 PM Ian Rogers <irogers@google.com> wrote:
>
> On Fri, May 3, 2024 at 5:30 PM Andrii Nakryiko <andrii@kernel.org> wrote:
> >
> > Implement binary ioctl()-based interface to /proc/<pid>/maps file to allow
> > applications to query VMA information more efficiently than through textual
> > processing of /proc/<pid>/maps contents. See patch #2 for the context,
> > justification, and nuances of the API design.
> >
> > Patch #1 is a refactoring to keep VMA name logic determination in one place.
> > Patch #2 is the meat of kernel-side API.
> > Patch #3 just syncs UAPI header (linux/fs.h) into tools/include.
> > Patch #4 adjusts BPF selftests logic that currently parses /proc/<pid>/maps to
> > optionally use this new ioctl()-based API, if supported.
> > Patch #5 implements a simple C tool to demonstrate intended efficient use (for
> > both textual and binary interfaces) and allows benchmarking them. Patch itself
> > also has performance numbers of a test based on one of the medium-sized
> > internal applications taken from production.
> >
> > This patch set was based on top of next-20240503 tag in linux-next tree.
> > Not sure what should be the target tree for this, I'd appreciate any guidance,
> > thank you!
> >
> > Andrii Nakryiko (5):
> >   fs/procfs: extract logic for getting VMA name constituents
> >   fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps
> >   tools: sync uapi/linux/fs.h header into tools subdir
> >   selftests/bpf: make use of PROCFS_PROCMAP_QUERY ioctl, if available
> >   selftests/bpf: a simple benchmark tool for /proc/<pid>/maps APIs
>
> I'd love to see improvements like this for the Linux perf command.
> Some thoughts:
>
>  - Could we do something scalability wise better than a file
> descriptor per pid? If a profiler is running in a container the cost
> of many file descriptors can be significant, and something that
> increases as machines get larger. Could we have a /proc/maps for all
> processes?

It's probably not a question to me, as it seems like an entirely
different set of APIs. But it also seems a bit convoluted to mix
together information about many address spaces.

As for the cost of FDs, I haven't run into this limitation, and it
seems like the trend in Linux in general is towards "everything is a
file". Just look at pidfd, for example.

Also, having a fd that can be queries has an extra nice property. For
example, opening /proc/self/maps (i.e., process' own maps file)
doesn't require any extra permissions, and then it can be transferred
to another trusted process that would do address
resolution/symbolization. In practice right now it's unavoidable to
add extra caps/root permissions to the profiling process even if the
only thing that it needs is contents of /proc/<pid>/maps (and the use
case is as benign as symbol resolution). Not having an FD for this API
would make this use case unworkable.

>
>  - Something that is broken in perf currently is that we can race
> between reading /proc and opening events on the pids it contains. For
> example, perf top supports a uid option that first scans to find all
> processes owned by a user then tries to open an event on each process.
> This fails if the process terminates between the scan and the open
> leading to a frequent:
> ```
> $ sudo perf top -u `id -u`
> The sys_perf_event_open() syscall returned with 3 (No such process)
> for event (cycles:P).
> ```
> It would be nice for the API to consider cgroups, uids and the like as
> ways to get a subset of things to scan.

This seems like putting too much into an API, tbh. It feels like
mapping cgroupos/uids to their processes is its own way and if we
don't have efficient APIs to do this, we should add it. But conflating
it into "get VMAs from this process" seems wrong to me.

>
>  - Some what related, the mmap perf events give data after the mmap
> call has happened. As VMAs get merged this can lead to mmap perf
> events looking like the memory overlaps (for jits using anonymous
> memory) and we lack munmap/mremap events.

Is this related to "VMA generation" that Arnaldo mentioned? I'd
happily add it to the new API, as it's easily extensible, if the
kernel already maintains it. If not, then it should be a separate work
to discuss whether kernel *should* track this information.

>
> Jiri Olsa has looked at improvements in this area in the past.
>
> Thanks,
> Ian
>
> >  fs/proc/task_mmu.c                            | 290 +++++++++++---
> >  include/uapi/linux/fs.h                       |  32 ++
> >  .../perf/trace/beauty/include/uapi/linux/fs.h |  32 ++
> >  tools/testing/selftests/bpf/.gitignore        |   1 +
> >  tools/testing/selftests/bpf/Makefile          |   2 +-
> >  tools/testing/selftests/bpf/procfs_query.c    | 366 ++++++++++++++++++
> >  tools/testing/selftests/bpf/test_progs.c      |   3 +
> >  tools/testing/selftests/bpf/test_progs.h      |   2 +
> >  tools/testing/selftests/bpf/trace_helpers.c   | 105 ++++-
> >  9 files changed, 763 insertions(+), 70 deletions(-)
> >  create mode 100644 tools/testing/selftests/bpf/procfs_query.c
> >
> > --
> > 2.43.0
> >
> >