[PATCHv3,bpf-next,0/9] mm/bpf/perf: Store build id in file object

Message ID	20230316170149.4106586-1-jolsa@kernel.org (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Jiri Olsa <jolsa@kernel.org> To: Alexei Starovoitov <ast@kernel.org>, Andrii Nakryiko <andrii@kernel.org>, Hao Luo <haoluo@google.com>, Andrew Morton <akpm@linux-foundation.org>, Alexander Viro <viro@zeniv.linux.org.uk>, Peter Zijlstra <peterz@infradead.org>, Ingo Molnar <mingo@redhat.com>, Arnaldo Carvalho de Melo <acme@kernel.org>, Matthew Wilcox <willy@infradead.org> Cc: bpf@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-perf-users@vger.kernel.org, Martin KaFai Lau <kafai@fb.com>, Song Liu <songliubraving@fb.com>, Yonghong Song <yhs@fb.com>, John Fastabend <john.fastabend@gmail.com>, KP Singh <kpsingh@chromium.org>, Stanislav Fomichev <sdf@google.com>, Daniel Borkmann <daniel@iogearbox.net>, Namhyung Kim <namhyung@gmail.com>, Dave Chinner <david@fromorbit.com> Subject: [PATCHv3 bpf-next 0/9] mm/bpf/perf: Store build id in file object Date: Thu, 16 Mar 2023 18:01:40 +0100 Message-Id: <20230316170149.4106586-1-jolsa@kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm/bpf/perf: Store build id in file object \| expand [PATCHv3,bpf-next,0/9] mm/bpf/perf: Store build id in file object [PATCHv3,bpf-next,1/9] mm: Store build id in file object [PATCHv3,bpf-next,2/9] perf: Use file object build id in perf_event_mmap_event [PATCHv3,bpf-next,3/9] bpf: Use file object build id in stackmap [PATCHv3,bpf-next,4/9] bpf: Switch BUILD_ID_SIZE_MAX to enum [PATCHv3,bpf-next,5/9] selftests/bpf: Add read_buildid function [PATCHv3,bpf-next,6/9] selftests/bpf: Add err.h header [PATCHv3,bpf-next,7/9] selftests/bpf: Replace extract_build_id with read_build_id [PATCHv3,bpf-next,8/9] selftests/bpf: Add iter_task_vma_buildid test [PATCHv3,bpf-next,9/9] selftests/bpf: Add file_build_id test

Jiri Olsa March 16, 2023, 5:01 p.m. UTC

hi,
this patchset adds build id object pointer to struct file object.

We have several use cases for build id to be used in BPF programs
[2][3].

Having build id pointer stored directly in file object allows to get
build id reliably regardless of the execution context as described
in [3].

Previous iteration [1] stored build id pointer into inode, but it
became clear that struct file object is better place, because we read 
the build id when the file is mmap-ed and as Dave Chinner said [4]:

> Yes, the problem being that we can cache hundreds of millions of
> inodes in memory, and only a very small subset of them are going to
> have open files associated with them. And an even smaller subset are
> going to be mmapped.

thanks,
jirka


v3 changes:
  - moved build id back to file object (discussed in [4])
  - drop patch 4, it's not needed [Andrii]
  - added ack to patch 7 [Andrii]
  - replaced BUILD_ID_SIZE_MAX macro with enum [Andrii]
  - few re-formatting fixes [Andrii]

v2 changes:
  - store build id under inode [Matthew Wilcox]
  - use urandom_read and liburandom_read.so for test [Andrii]
  - add libelf-based helper to fetch build ID from elf [Andrii]
  - store build id or error we got when reading it [Andrii]
  - use full name i_build_id [Andrii]
  - several tests fixes [Andrii]


[1] https://lore.kernel.org/bpf/20230228093206.821563-1-jolsa@kernel.org/
[2] https://lore.kernel.org/bpf/CA+khW7juLEcrTOd7iKG3C_WY8L265XKNo0iLzV1fE=o-cyeHcQ@mail.gmail.com/
[3] https://lore.kernel.org/bpf/ZABf26mV0D0LS7r%2F@krava/
[4] https://lore.kernel.org/bpf/20230228220714.GJ2825702@dread.disaster.area/
---
Jiri Olsa (9):
      mm: Store build id in file object
      perf: Use file object build id in perf_event_mmap_event
      bpf: Use file object build id in stackmap
      bpf: Switch BUILD_ID_SIZE_MAX to enum
      selftests/bpf: Add read_buildid function
      selftests/bpf: Add err.h header
      selftests/bpf: Replace extract_build_id with read_build_id
      selftests/bpf: Add iter_task_vma_buildid test
      selftests/bpf: Add file_build_id test

 fs/file_table.c                                                  |  3 +++
 include/linux/buildid.h                                          | 21 ++++++++++++++++++-
 include/linux/fs.h                                               |  7 +++++++
 kernel/bpf/stackmap.c                                            | 24 +++++++++++++++++++++-
 kernel/events/core.c                                             | 43 ++++++++++++++++++++++++++++++++++----
 lib/buildid.c                                                    | 42 +++++++++++++++++++++++++++++++++++++
 mm/Kconfig                                                       |  9 ++++++++
 mm/mmap.c                                                        | 18 ++++++++++++++++
 tools/testing/selftests/bpf/Makefile                             |  7 ++++++-
 tools/testing/selftests/bpf/no_build_id.c                        |  6 ++++++
 tools/testing/selftests/bpf/prog_tests/bpf_iter.c                | 78 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 tools/testing/selftests/bpf/prog_tests/file_build_id.c           | 98 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 tools/testing/selftests/bpf/prog_tests/stacktrace_build_id.c     | 19 +++++++----------
 tools/testing/selftests/bpf/prog_tests/stacktrace_build_id_nmi.c | 17 ++++++---------
 tools/testing/selftests/bpf/progs/bpf_iter_task_vma_buildid.c    | 56 ++++++++++++++++++++++++++++++++++++++++++++++++++
 tools/testing/selftests/bpf/progs/err.h                          | 18 ++++++++++++++++
 tools/testing/selftests/bpf/progs/file_build_id.c                | 70 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 tools/testing/selftests/bpf/progs/profiler.inc.h                 |  3 +--
 tools/testing/selftests/bpf/test_progs.c                         | 25 ----------------------
 tools/testing/selftests/bpf/test_progs.h                         | 11 +++++++++-
 tools/testing/selftests/bpf/trace_helpers.c                      | 86 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 tools/testing/selftests/bpf/trace_helpers.h                      |  5 +++++
 22 files changed, 608 insertions(+), 58 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/no_build_id.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/file_build_id.c
 create mode 100644 tools/testing/selftests/bpf/progs/bpf_iter_task_vma_buildid.c
 create mode 100644 tools/testing/selftests/bpf/progs/err.h
 create mode 100644 tools/testing/selftests/bpf/progs/file_build_id.c

Matthew Wilcox March 16, 2023, 5:34 p.m. UTC | #1

On Thu, Mar 16, 2023 at 06:01:40PM +0100, Jiri Olsa wrote:
> hi,
> this patchset adds build id object pointer to struct file object.
> 
> We have several use cases for build id to be used in BPF programs
> [2][3].

Yes, you have use cases, but you never answered the question I asked:

Is this going to be enabled by every distro kernel, or is it for special
use-cases where only people doing a very specialised thing who are
willing to build their own kernels will use it?

Saying "hubble/tetragon" doesn't answer that question.  Maybe it does
to you, but I have no idea what that software is.

Put it another way: how does this make *MY* life better?  Literally me.
How will it affect my life?

Ian Rogers March 16, 2023, 5:50 p.m. UTC | #2

On Thu, Mar 16, 2023 at 10:35 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, Mar 16, 2023 at 06:01:40PM +0100, Jiri Olsa wrote:
> > hi,
> > this patchset adds build id object pointer to struct file object.
> >
> > We have several use cases for build id to be used in BPF programs
> > [2][3].
>
> Yes, you have use cases, but you never answered the question I asked:
>
> Is this going to be enabled by every distro kernel, or is it for special
> use-cases where only people doing a very specialised thing who are
> willing to build their own kernels will use it?
>
> Saying "hubble/tetragon" doesn't answer that question.  Maybe it does
> to you, but I have no idea what that software is.
>
> Put it another way: how does this make *MY* life better?  Literally me.
> How will it affect my life?

So at Google we use build IDs for all profiling, I believe Meta is the
same but obviously I can't speak for them. For BPF program stack
traces, using build ID + offset stack traces is preferable to perf's
whole system synthesis of mmap events based on data held in
/proc/pid/maps. Individual stack traces are larger, but you avoid the
ever growing problem of coming up with some initial virtual memory
state that will allow you to identify samples.

This doesn't answer the question about how this will help you, but I
expect over time you will see scalability issues and also want to use
tools assuming build IDs are present and cheap to access.

Thanks,
Ian

Andrii Nakryiko March 16, 2023, 9:51 p.m. UTC | #3

On Thu, Mar 16, 2023 at 10:50 AM Ian Rogers <irogers@google.com> wrote:
>
> On Thu, Mar 16, 2023 at 10:35 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Thu, Mar 16, 2023 at 06:01:40PM +0100, Jiri Olsa wrote:
> > > hi,
> > > this patchset adds build id object pointer to struct file object.
> > >
> > > We have several use cases for build id to be used in BPF programs
> > > [2][3].
> >
> > Yes, you have use cases, but you never answered the question I asked:
> >
> > Is this going to be enabled by every distro kernel, or is it for special
> > use-cases where only people doing a very specialised thing who are
> > willing to build their own kernels will use it?
> >
> > Saying "hubble/tetragon" doesn't answer that question.  Maybe it does
> > to you, but I have no idea what that software is.
> >
> > Put it another way: how does this make *MY* life better?  Literally me.
> > How will it affect my life?
>
> So at Google we use build IDs for all profiling, I believe Meta is the
> same but obviously I can't speak for them. For BPF program stack

Yep, Meta is also capturing stack traces with build ID as well, if
possible. Build IDs help with profiling short-lived processes which
exit before the profiling session is done and user-space tooling is
able to collect /proc/<pid>/maps contents (which is what Ian is
referring to here). But also build ID allows to offload more of the
expensive stack symbolization process (converting raw memory addresses
into human readable function+offset+file path+line numbers
information) to dedicated remote servers, by allowing to cache and
reuse preprocessed DWARF/ELF information based on build ID.

I believe perf tool is also using build ID, so any tool relying on
perf capturing full and complete profiling data for system-wide
performance analysis would benefit as well.

Generally speaking, there is a whole ecosystem built on top of
assumption that binaries have build ID and profiling tooling is able
to provide more value if those build IDs are more reliably collected.
Which ultimately benefits the entire open-source ecosystem by allowing
people to spot issues (not necessarily just performance, it could be
correctness issues as well) more reliably, fix them, and benefit every
user.

> traces, using build ID + offset stack traces is preferable to perf's
> whole system synthesis of mmap events based on data held in
> /proc/pid/maps. Individual stack traces are larger, but you avoid the
> ever growing problem of coming up with some initial virtual memory
> state that will allow you to identify samples.
>
> This doesn't answer the question about how this will help you, but I
> expect over time you will see scalability issues and also want to use
> tools assuming build IDs are present and cheap to access.
>
> Thanks,
> Ian

Matthew Wilcox March 17, 2023, 3:51 a.m. UTC | #4

On Thu, Mar 16, 2023 at 02:51:52PM -0700, Andrii Nakryiko wrote:
> Yep, Meta is also capturing stack traces with build ID as well, if
> possible. Build IDs help with profiling short-lived processes which
> exit before the profiling session is done and user-space tooling is
> able to collect /proc/<pid>/maps contents (which is what Ian is
> referring to here). But also build ID allows to offload more of the
> expensive stack symbolization process (converting raw memory addresses
> into human readable function+offset+file path+line numbers
> information) to dedicated remote servers, by allowing to cache and
> reuse preprocessed DWARF/ELF information based on build ID.
> 
> I believe perf tool is also using build ID, so any tool relying on
> perf capturing full and complete profiling data for system-wide
> performance analysis would benefit as well.
> 
> Generally speaking, there is a whole ecosystem built on top of
> assumption that binaries have build ID and profiling tooling is able
> to provide more value if those build IDs are more reliably collected.
> Which ultimately benefits the entire open-source ecosystem by allowing
> people to spot issues (not necessarily just performance, it could be
> correctness issues as well) more reliably, fix them, and benefit every
> user.

But build IDs are _generally_ available.  The only problem (AIUI)
is when you're trying to examine the contents of one container from
another container.  And to solve that problem, you're imposing a cost
on everybody else with (so far) pretty vague justifications.  I really
don't like to see you growing struct file for this (nor struct inode,
nor struct vm_area_struct).  It's all quite unsatisfactory and I don't
have a good suggestion.

Andrii Nakryiko March 17, 2023, 4:33 p.m. UTC | #5

On Thu, Mar 16, 2023 at 8:51 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, Mar 16, 2023 at 02:51:52PM -0700, Andrii Nakryiko wrote:
> > Yep, Meta is also capturing stack traces with build ID as well, if
> > possible. Build IDs help with profiling short-lived processes which
> > exit before the profiling session is done and user-space tooling is
> > able to collect /proc/<pid>/maps contents (which is what Ian is
> > referring to here). But also build ID allows to offload more of the
> > expensive stack symbolization process (converting raw memory addresses
> > into human readable function+offset+file path+line numbers
> > information) to dedicated remote servers, by allowing to cache and
> > reuse preprocessed DWARF/ELF information based on build ID.
> >
> > I believe perf tool is also using build ID, so any tool relying on
> > perf capturing full and complete profiling data for system-wide
> > performance analysis would benefit as well.
> >
> > Generally speaking, there is a whole ecosystem built on top of
> > assumption that binaries have build ID and profiling tooling is able
> > to provide more value if those build IDs are more reliably collected.
> > Which ultimately benefits the entire open-source ecosystem by allowing
> > people to spot issues (not necessarily just performance, it could be
> > correctness issues as well) more reliably, fix them, and benefit every
> > user.
>
> But build IDs are _generally_ available.  The only problem (AIUI)
> is when you're trying to examine the contents of one container from
> another container.  And to solve that problem, you're imposing a cost
> on everybody else with (so far) pretty vague justifications.  I really
> don't like to see you growing struct file for this (nor struct inode,
> nor struct vm_area_struct).  It's all quite unsatisfactory and I don't
> have a good suggestion.

There is a lot of profiling, observability and debugging tooling built
using BPF. And when capturing stack traces from BPF programs, if the
build ID note is not physically present in memory, fetching it from
the BPF program might fail in NMI (and other non-faultable contexts).
This patch set is about making sure we always can fetch build ID, even
from most restrictive environments. It's guarded by Kconfig to avoid
adding 8 bytes of overhead to struct file for environment where this
might be unacceptable, giving users and distros a choice.

Al Viro March 17, 2023, 9:14 p.m. UTC | #6

On Fri, Mar 17, 2023 at 09:33:17AM -0700, Andrii Nakryiko wrote:

> > But build IDs are _generally_ available.  The only problem (AIUI)
> > is when you're trying to examine the contents of one container from
> > another container.  And to solve that problem, you're imposing a cost
> > on everybody else with (so far) pretty vague justifications.  I really
> > don't like to see you growing struct file for this (nor struct inode,
> > nor struct vm_area_struct).  It's all quite unsatisfactory and I don't
> > have a good suggestion.
> 
> There is a lot of profiling, observability and debugging tooling built
> using BPF. And when capturing stack traces from BPF programs, if the
> build ID note is not physically present in memory, fetching it from
> the BPF program might fail in NMI (and other non-faultable contexts).
> This patch set is about making sure we always can fetch build ID, even
> from most restrictive environments. It's guarded by Kconfig to avoid
> adding 8 bytes of overhead to struct file for environment where this
> might be unacceptable, giving users and distros a choice.

Lovely.  As an exercise you might want to collect the stats on the
number of struct file instances on the system vs. the number of files
that happen to be ELF objects and are currently mmapped anywhere.
That does depend upon the load, obviously, but it's not hard to collect -
you already have more than enough hooks inserted in the relevant places.
That might give a better appreciation of the reactions...

Al Viro March 17, 2023, 9:21 p.m. UTC | #7

On Fri, Mar 17, 2023 at 09:14:03PM +0000, Al Viro wrote:
> On Fri, Mar 17, 2023 at 09:33:17AM -0700, Andrii Nakryiko wrote:
> 
> > > But build IDs are _generally_ available.  The only problem (AIUI)
> > > is when you're trying to examine the contents of one container from
> > > another container.  And to solve that problem, you're imposing a cost
> > > on everybody else with (so far) pretty vague justifications.  I really
> > > don't like to see you growing struct file for this (nor struct inode,
> > > nor struct vm_area_struct).  It's all quite unsatisfactory and I don't
> > > have a good suggestion.
> > 
> > There is a lot of profiling, observability and debugging tooling built
> > using BPF. And when capturing stack traces from BPF programs, if the
> > build ID note is not physically present in memory, fetching it from
> > the BPF program might fail in NMI (and other non-faultable contexts).
> > This patch set is about making sure we always can fetch build ID, even
> > from most restrictive environments. It's guarded by Kconfig to avoid
> > adding 8 bytes of overhead to struct file for environment where this
> > might be unacceptable, giving users and distros a choice.
> 
> Lovely.  As an exercise you might want to collect the stats on the
> number of struct file instances on the system vs. the number of files
> that happen to be ELF objects and are currently mmapped anywhere.
> That does depend upon the load, obviously, but it's not hard to collect -
> you already have more than enough hooks inserted in the relevant places.
> That might give a better appreciation of the reactions...

One possibility would be a bit stolen from inode flags + hash keyed by
struct inode address (middle bits make for a decent hash function);
inode eviction would check that bit and kick the corresponding thing
from hash if the bit is set.

Associating that thing with inode => hash lookup/insert + set the bit.

Andrii Nakryiko March 18, 2023, 6:08 a.m. UTC | #8

On Fri, Mar 17, 2023 at 2:21 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Fri, Mar 17, 2023 at 09:14:03PM +0000, Al Viro wrote:
> > On Fri, Mar 17, 2023 at 09:33:17AM -0700, Andrii Nakryiko wrote:
> >
> > > > But build IDs are _generally_ available.  The only problem (AIUI)
> > > > is when you're trying to examine the contents of one container from
> > > > another container.  And to solve that problem, you're imposing a cost
> > > > on everybody else with (so far) pretty vague justifications.  I really
> > > > don't like to see you growing struct file for this (nor struct inode,
> > > > nor struct vm_area_struct).  It's all quite unsatisfactory and I don't
> > > > have a good suggestion.
> > >
> > > There is a lot of profiling, observability and debugging tooling built
> > > using BPF. And when capturing stack traces from BPF programs, if the
> > > build ID note is not physically present in memory, fetching it from
> > > the BPF program might fail in NMI (and other non-faultable contexts).
> > > This patch set is about making sure we always can fetch build ID, even
> > > from most restrictive environments. It's guarded by Kconfig to avoid
> > > adding 8 bytes of overhead to struct file for environment where this
> > > might be unacceptable, giving users and distros a choice.
> >
> > Lovely.  As an exercise you might want to collect the stats on the
> > number of struct file instances on the system vs. the number of files
> > that happen to be ELF objects and are currently mmapped anywhere.

That's a good suggestion. I wrote a simple script that uses the drgn
tool ([0]), it enables nice introspection of the state of the kernel
memory for the running kernel. The script is at the bottom ([1]) for
anyone to sanity check. I didn't try to figure out which file is
mmaped as executable and which didn't, so let's do worst case and
assume that none of the file is executable, and thus that 8 byte
pointer is a waste for all of them.

On my devserver I got:

task_cnt=15984 uniq_file_cnt=56780

On randomly chosen production host I got:

task_cnt=6387 uniq_file_cnt=22514

So it seems like my devserver is "busier" than the production host. :)

Above numbers suggest that my devserver's kernel has about 57000
*unique* `struct file *` instances. That's 450KB of overhead. That's
not much by any modern standard.

But let's say I'm way off, and we have 1 million struct files. That's
8MB overhead. I'd argue that those 8MB is not a big deal even on a
normal laptop, even less so on production servers. Especially if you
have 1 million active struct file instances created in the system, as
way more will be used for application-specific needs.

> > That does depend upon the load, obviously, but it's not hard to collect -
> > you already have more than enough hooks inserted in the relevant places.
> > That might give a better appreciation of the reactions...
>
> One possibility would be a bit stolen from inode flags + hash keyed by
> struct inode address (middle bits make for a decent hash function);
> inode eviction would check that bit and kick the corresponding thing
> from hash if the bit is set.
>
> Associating that thing with inode => hash lookup/insert + set the bit.

This is an interesting idea, but now we are running into a few
unnecessary problems. We need to have a global dynamically sized hash
map in the system. If we fix the number of buckets, we risk either
wasting memory on an underutilized system (if we oversize), or
performance problems due to collisions (if we undersize) if we have a
busy system with lots of executables mapped in memory. If we don't
pre-size, then we are talking about reallocations, rehashing, and
doing that under global lock or something like that. Further, we'd
have to take locks on buckets, which causes further problems for
looking up build ID from this hashmap in NMI context for perf events
and BPF programs, as locks can't be safely taken under those
conditions, and thus fetching build ID would still be unreliable
(though less so than it is today, of course).

All of this is solvable to some degree (but not perfectly and not with
simple and elegant approaches), but seems like an unnecessarily
overcomplication compared to the amount of memory that we hope to
save. It still feels like a Kconfig-guarded 8 byte field per struct
file is a reasonable price for gaining reliable build ID information
for profiling/tracing tools.

  [0] https://drgn.readthedocs.io/en/latest/index.html

  [1] Script I used:

from drgn.helpers.linux.pid import for_each_task
from drgn.helpers.linux.fs import for_each_file

task_cnt = 0
file_set = set()

for task in for_each_task(prog):
    task_cnt += 1
    try:
        for (fd, file) in for_each_file(task):
            file_set.add(file.value_())
    except:
        pass

uniq_file_cnt = len(file_set)
print(f"task_cnt={task_cnt} uniq_file_cnt={uniq_file_cnt}")

Jiri Olsa March 18, 2023, 8:33 a.m. UTC | #9

On Thu, Mar 16, 2023 at 05:34:41PM +0000, Matthew Wilcox wrote:
> On Thu, Mar 16, 2023 at 06:01:40PM +0100, Jiri Olsa wrote:
> > hi,
> > this patchset adds build id object pointer to struct file object.
> > 
> > We have several use cases for build id to be used in BPF programs
> > [2][3].
> 
> Yes, you have use cases, but you never answered the question I asked:
> 
> Is this going to be enabled by every distro kernel, or is it for special
> use-cases where only people doing a very specialised thing who are
> willing to build their own kernels will use it?

I hope so, but I guess only time tell.. given the response by Ian and Andrii
there are 3 big users already

> 
> Saying "hubble/tetragon" doesn't answer that question.  Maybe it does
> to you, but I have no idea what that software is.
> 
> Put it another way: how does this make *MY* life better?  Literally me.
> How will it affect my life?

sorry, I'm  not sure how to answer this.. there're possible users of the
feature and the price seems to be reasonable to me, but of course I understand
file/inode objects are tricky to mess with

jirka

Jiri Olsa March 18, 2023, 8:34 a.m. UTC | #10

On Fri, Mar 17, 2023 at 11:08:44PM -0700, Andrii Nakryiko wrote:

SNIP

> > > That does depend upon the load, obviously, but it's not hard to collect -
> > > you already have more than enough hooks inserted in the relevant places.
> > > That might give a better appreciation of the reactions...
> >
> > One possibility would be a bit stolen from inode flags + hash keyed by
> > struct inode address (middle bits make for a decent hash function);
> > inode eviction would check that bit and kick the corresponding thing
> > from hash if the bit is set.
> >
> > Associating that thing with inode => hash lookup/insert + set the bit.
> 
> This is an interesting idea, but now we are running into a few
> unnecessary problems. We need to have a global dynamically sized hash
> map in the system. If we fix the number of buckets, we risk either
> wasting memory on an underutilized system (if we oversize), or
> performance problems due to collisions (if we undersize) if we have a
> busy system with lots of executables mapped in memory. If we don't
> pre-size, then we are talking about reallocations, rehashing, and
> doing that under global lock or something like that. Further, we'd
> have to take locks on buckets, which causes further problems for
> looking up build ID from this hashmap in NMI context for perf events
> and BPF programs, as locks can't be safely taken under those
> conditions, and thus fetching build ID would still be unreliable
> (though less so than it is today, of course).
> 
> All of this is solvable to some degree (but not perfectly and not with
> simple and elegant approaches), but seems like an unnecessarily
> overcomplication compared to the amount of memory that we hope to
> save. It still feels like a Kconfig-guarded 8 byte field per struct
> file is a reasonable price for gaining reliable build ID information
> for profiling/tracing tools.
> 
> 
>   [0] https://drgn.readthedocs.io/en/latest/index.html
> 
>   [1] Script I used:
> 
> from drgn.helpers.linux.pid import for_each_task
> from drgn.helpers.linux.fs import for_each_file
> 
> task_cnt = 0
> file_set = set()
> 
> for task in for_each_task(prog):
>     task_cnt += 1
>     try:
>         for (fd, file) in for_each_file(task):
>             file_set.add(file.value_())
>     except:
>         pass
> 
> uniq_file_cnt = len(file_set)
> print(f"task_cnt={task_cnt} uniq_file_cnt={uniq_file_cnt}")

great you beat me to this, I wouldn't have thought of using drgn for this ;-)
I'll see if I can install it to some of our test servers

thanks,
jirka

Matthew Wilcox March 18, 2023, 3:16 p.m. UTC | #11

On Sat, Mar 18, 2023 at 09:33:49AM +0100, Jiri Olsa wrote:
> On Thu, Mar 16, 2023 at 05:34:41PM +0000, Matthew Wilcox wrote:
> > On Thu, Mar 16, 2023 at 06:01:40PM +0100, Jiri Olsa wrote:
> > > hi,
> > > this patchset adds build id object pointer to struct file object.
> > > 
> > > We have several use cases for build id to be used in BPF programs
> > > [2][3].
> > 
> > Yes, you have use cases, but you never answered the question I asked:
> > 
> > Is this going to be enabled by every distro kernel, or is it for special
> > use-cases where only people doing a very specialised thing who are
> > willing to build their own kernels will use it?
> 
> I hope so, but I guess only time tell.. given the response by Ian and Andrii
> there are 3 big users already

So the whole "There's a config option to turn it off" shtick is just a
fig-leaf.  I won't ever see it turned off.  You're imposing the cost of
this on EVERYONE who runs a distro kernel.  And almost nobody will see
any benefits from it.  Thanks for admitting that.

Jiri Olsa March 18, 2023, 5:40 p.m. UTC | #12

On Sat, Mar 18, 2023 at 03:16:45PM +0000, Matthew Wilcox wrote:
> On Sat, Mar 18, 2023 at 09:33:49AM +0100, Jiri Olsa wrote:
> > On Thu, Mar 16, 2023 at 05:34:41PM +0000, Matthew Wilcox wrote:
> > > On Thu, Mar 16, 2023 at 06:01:40PM +0100, Jiri Olsa wrote:
> > > > hi,
> > > > this patchset adds build id object pointer to struct file object.
> > > > 
> > > > We have several use cases for build id to be used in BPF programs
> > > > [2][3].
> > > 
> > > Yes, you have use cases, but you never answered the question I asked:
> > > 
> > > Is this going to be enabled by every distro kernel, or is it for special
> > > use-cases where only people doing a very specialised thing who are
> > > willing to build their own kernels will use it?
> > 
> > I hope so, but I guess only time tell.. given the response by Ian and Andrii
> > there are 3 big users already
> 
> So the whole "There's a config option to turn it off" shtick is just a
> fig-leaf.  I won't ever see it turned off.  You're imposing the cost of
> this on EVERYONE who runs a distro kernel.  And almost nobody will see
> any benefits from it.  Thanks for admitting that.
> 

sure, I understand that's legit way of looking at this

I can imagine distros would have that enabled for debugging version of
the kernel (like in fedora), and if that proves to be useful the standard
kernel might take it, but yes, there's price (for discussion as pointed
by Andrii) and it'd be for the distro maintainers to decide

jirka

Arnaldo Carvalho de Melo March 22, 2023, 3:45 p.m. UTC | #13

Em Sat, Mar 18, 2023 at 03:16:45PM +0000, Matthew Wilcox escreveu:
> On Sat, Mar 18, 2023 at 09:33:49AM +0100, Jiri Olsa wrote:
> > On Thu, Mar 16, 2023 at 05:34:41PM +0000, Matthew Wilcox wrote:
> > > On Thu, Mar 16, 2023 at 06:01:40PM +0100, Jiri Olsa wrote:
> > > > hi,
> > > > this patchset adds build id object pointer to struct file object.
> > > > 
> > > > We have several use cases for build id to be used in BPF programs
> > > > [2][3].
> > > 
> > > Yes, you have use cases, but you never answered the question I asked:
> > > 
> > > Is this going to be enabled by every distro kernel, or is it for special
> > > use-cases where only people doing a very specialised thing who are
> > > willing to build their own kernels will use it?
> > 
> > I hope so, but I guess only time tell.. given the response by Ian and Andrii
> > there are 3 big users already
> 
> So the whole "There's a config option to turn it off" shtick is just a
> fig-leaf.  I won't ever see it turned off.  You're imposing the cost of
> this on EVERYONE who runs a distro kernel.  And almost nobody will see
> any benefits from it.  Thanks for admitting that.

I agree that build-ids are not useful for all 'struct file' uses, just
for executable files and for people wanting to have better observability
capabilities.

Having said that, it seems there will be no extra memory overhead at
least for a fedora:36 x86_64 kernel:

void __init files_init(void)
{
        filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0,
                        SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT, NULL);
        percpu_counter_init(&nr_files, 0, GFP_KERNEL);
}

[root@quaco ~]# pahole file | grep size: -A2
	/* size: 232, cachelines: 4, members: 20 */
	/* sum members: 228, holes: 1, sum holes: 4 */
	/* last cacheline: 40 bytes */
[acme@quaco perf-tools]$ uname -a
Linux quaco 6.1.11-100.fc36.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Feb  9 20:36:30 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
[root@quaco ~]# head -2 /proc/slabinfo 
slabinfo - version: 2.1
# name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
[root@quaco ~]# grep -w filp /proc/slabinfo 
filp               12452  13056    256   32    2 : tunables    0    0    0 : slabdata    408    408      0
[root@quaco ~]#

so there are 24 bytes on the 4th cacheline that are not being used,
right?

One other observation is that maybe we could do it as the 'struct sock'
hierachy in networking, where we would have a 'struct exec_file' that
would be:

	struct exec_file {
		struct file file;
		char build_id[20];
	}

say, and then when we create the 'struct file' in __alloc_file() we
could check some bit in 'flags' like Al Viro suggested and pick a
different slab than 'filp_cachep', that has that extra space for the
build_id (and whatever else exec related state we may end up wanting, if
ever).

No core fs will need to know about that except when we go free it, to
free from the right slab cache.

In current distro configs, no overhead would take place if I read that
SLAB_HWCACHE_ALIGN thing right, no?

- Arnaldo

Andrii Nakryiko March 31, 2023, 6:19 p.m. UTC | #14

On Wed, Mar 22, 2023 at 8:45 AM Arnaldo Carvalho de Melo
<acme@kernel.org> wrote:
>
> Em Sat, Mar 18, 2023 at 03:16:45PM +0000, Matthew Wilcox escreveu:
> > On Sat, Mar 18, 2023 at 09:33:49AM +0100, Jiri Olsa wrote:
> > > On Thu, Mar 16, 2023 at 05:34:41PM +0000, Matthew Wilcox wrote:
> > > > On Thu, Mar 16, 2023 at 06:01:40PM +0100, Jiri Olsa wrote:
> > > > > hi,
> > > > > this patchset adds build id object pointer to struct file object.
> > > > >
> > > > > We have several use cases for build id to be used in BPF programs
> > > > > [2][3].
> > > >
> > > > Yes, you have use cases, but you never answered the question I asked:
> > > >
> > > > Is this going to be enabled by every distro kernel, or is it for special
> > > > use-cases where only people doing a very specialised thing who are
> > > > willing to build their own kernels will use it?
> > >
> > > I hope so, but I guess only time tell.. given the response by Ian and Andrii
> > > there are 3 big users already
> >
> > So the whole "There's a config option to turn it off" shtick is just a
> > fig-leaf.  I won't ever see it turned off.  You're imposing the cost of
> > this on EVERYONE who runs a distro kernel.  And almost nobody will see
> > any benefits from it.  Thanks for admitting that.
>
> I agree that build-ids are not useful for all 'struct file' uses, just
> for executable files and for people wanting to have better observability
> capabilities.
>
> Having said that, it seems there will be no extra memory overhead at
> least for a fedora:36 x86_64 kernel:
>
> void __init files_init(void)
> {
>         filp_cachep = kmem_cache_create("filp", sizeof(struct file), 0,
>                         SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT, NULL);
>         percpu_counter_init(&nr_files, 0, GFP_KERNEL);
> }
>
> [root@quaco ~]# pahole file | grep size: -A2
>         /* size: 232, cachelines: 4, members: 20 */
>         /* sum members: 228, holes: 1, sum holes: 4 */
>         /* last cacheline: 40 bytes */
> [acme@quaco perf-tools]$ uname -a
> Linux quaco 6.1.11-100.fc36.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Feb  9 20:36:30 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
> [root@quaco ~]# head -2 /proc/slabinfo
> slabinfo - version: 2.1
> # name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
> [root@quaco ~]# grep -w filp /proc/slabinfo
> filp               12452  13056    256   32    2 : tunables    0    0    0 : slabdata    408    408      0
> [root@quaco ~]#
>
> so there are 24 bytes on the 4th cacheline that are not being used,
> right?

Well, even better then!

>
> One other observation is that maybe we could do it as the 'struct sock'
> hierachy in networking, where we would have a 'struct exec_file' that
> would be:
>
>         struct exec_file {
>                 struct file file;
>                 char build_id[20];
>         }
>
> say, and then when we create the 'struct file' in __alloc_file() we
> could check some bit in 'flags' like Al Viro suggested and pick a
> different slab than 'filp_cachep', that has that extra space for the
> build_id (and whatever else exec related state we may end up wanting, if
> ever).
>
> No core fs will need to know about that except when we go free it, to
> free from the right slab cache.
>
> In current distro configs, no overhead would take place if I read that
> SLAB_HWCACHE_ALIGN thing right, no?

Makes sense to me as well. Whatever the solution, as long as it's
usable from NMI contexts would be fine for the purposes of fetching
build ID. It would be good to hear from folks that are opposing adding
a pointer field to struct file whether they prefer this way instead?

>
> - Arnaldo

Matthew Wilcox March 31, 2023, 6:36 p.m. UTC | #15

On Fri, Mar 31, 2023 at 11:19:45AM -0700, Andrii Nakryiko wrote:
> On Wed, Mar 22, 2023 at 8:45 AM Arnaldo Carvalho de Melo
> <acme@kernel.org> wrote:
> > Having said that, it seems there will be no extra memory overhead at
> > least for a fedora:36 x86_64 kernel:
> 
> Makes sense to me as well. Whatever the solution, as long as it's
> usable from NMI contexts would be fine for the purposes of fetching
> build ID. It would be good to hear from folks that are opposing adding
> a pointer field to struct file whether they prefer this way instead?

Still no.  While it may not take up any room right now, this will
surely not be the last thing added to struct file.  When something
which is genuinely useful needs to be added, that person should
not have to sort out your mess first,

NAK now, NAK tomorrow, NAK forever.  Al told you how you could do it
without trampling on core data structures.

Andrii Nakryiko March 31, 2023, 8:27 p.m. UTC | #16

On Fri, Mar 31, 2023 at 11:36 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Mar 31, 2023 at 11:19:45AM -0700, Andrii Nakryiko wrote:
> > On Wed, Mar 22, 2023 at 8:45 AM Arnaldo Carvalho de Melo
> > <acme@kernel.org> wrote:
> > > Having said that, it seems there will be no extra memory overhead at
> > > least for a fedora:36 x86_64 kernel:
> >
> > Makes sense to me as well. Whatever the solution, as long as it's
> > usable from NMI contexts would be fine for the purposes of fetching
> > build ID. It would be good to hear from folks that are opposing adding
> > a pointer field to struct file whether they prefer this way instead?
>
> Still no.  While it may not take up any room right now, this will
> surely not be the last thing added to struct file.  When something
> which is genuinely useful needs to be added, that person should
> not have to sort out your mess first,

So I assume you are talking about adding a pointer field to the struct
file, right? What about the alternative proposed by Arnaldo to have a
struct exec_file that extends a struct file?

>
> NAK now, NAK tomorrow, NAK forever.  Al told you how you could do it
> without trampling on core data structures.

As I replied to Al, any solution that will have a lookup table on the
side isn't compatible with usage from NMI context due to locking. And
lots of tracing use cases are based on perf counters which are handled
in NMI context. And that's besides all the complexities with
right-sizing hash maps (if hashmaps are to be used).

[PATCHv3,bpf-next,0/9] mm/bpf/perf: Store build id in file object

Message

Comments