diff mbox series

Per-CPU variables in modules and pahole

Message ID CAEf4BzZWabv_hExaANQyQ71L2JHYqXaT4hFj52w-poWoVYWKqQ@mail.gmail.com (mailing list archive)
State RFC
Delegated to: BPF
Headers show
Series Per-CPU variables in modules and pahole | expand

Checks

Context Check Description
netdev/tree_selection success Not a local patch

Commit Message

Andrii Nakryiko Dec. 9, 2020, 8:53 p.m. UTC
Hi,

I'm working on supporting per-CPU symbols in BPF/libbpf, and the
prerequisite for that is BTF data for .data..percpu data section and
variables inside that.

Turns out, pahole doesn't currently emit any BTF information for such
variables in kernel modules. And the reason why is quite confusing and
I can't figure it out myself, so was hoping someone else might be able
to help.

To repro, you can take latest bpf-next tree and add this to
bpf_testmod/bpf_testmod.c inside selftests/bpf:

$ git diff bpf_testmod/bpf_testmod.c
      diff --git
a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
index 2df19d73ca49..b2086b798019 100644

1. So the very first issue (that I'm going to ignore for now) is that
if I just added bpf_testmod_ksym_percpu, it would get addr == 0 and
would be ignored by the current pahole logic. So we need to fix that
for modules. Adding dummy1 and dummy2 takes care of this for now,
bpf_testmod_ksym_percpu has offset 4.

2. Second issue is more interesting. Somehow, when pahole iterates
over DWARF variables, the address of bpf_testmod_ksym_percpu is
reported as 0x10e74, not 4. Which totally confuses pahole because
according to ELF symbols, bpf_testmod_ksym_percpu symbol has value 4.
I tracked this down to dwarf_getlocation() returning 10e74 as number
field in expr.

But this seems wrong, because when looking at DWARF:

$ readelf -wi bpf_testmod.ko | grep bpf_testmod_ksym_percpu -B1 -A6
 <1><fbc5>: Abbrev Number: 97 (DW_TAG_variable)
    <fbc6>   DW_AT_name        : (indirect string, offset: 0x4afb):
bpf_testmod_ksym_percpu
    <fbca>   DW_AT_decl_file   : 5
    <fbcb>   DW_AT_decl_line   : 15
    <fbcc>   DW_AT_decl_column : 1
    <fbcd>   DW_AT_type        : <0xce>
    <fbd1>   DW_AT_external    : 1
    <fbd1>   DW_AT_location    : 9 byte block: 3 4 0 0 0 0 0 0 0
 (DW_OP_addr: 4)

You can see that addr is actually 4.

And ELF symbols agree:

$ readelf -a bpf_testmod.ko | grep bpf_testmod_ksym_percpu
   102: 0000000000000004     4 OBJECT  GLOBAL DEFAULT   33
bpf_testmod_ksym_percpu


I also can't seem to match 0x10e74 to anything in bpf_testmod.ko, no
section or anything like that.

So, help! Is this just a libdw bug? If yes, why don't we see it
anywhere else? If not, what am I missing and how can we make pahole
emit BTF data for variables in modules?

Thanks!


-- Andrii

Comments

Jiri Olsa Dec. 10, 2020, 4:43 p.m. UTC | #1
On Wed, Dec 09, 2020 at 12:53:44PM -0800, Andrii Nakryiko wrote:
> Hi,
> 
> I'm working on supporting per-CPU symbols in BPF/libbpf, and the
> prerequisite for that is BTF data for .data..percpu data section and
> variables inside that.
> 
> Turns out, pahole doesn't currently emit any BTF information for such
> variables in kernel modules. And the reason why is quite confusing and
> I can't figure it out myself, so was hoping someone else might be able
> to help.
> 
> To repro, you can take latest bpf-next tree and add this to
> bpf_testmod/bpf_testmod.c inside selftests/bpf:
> 
> $ git diff bpf_testmod/bpf_testmod.c
>       diff --git
> a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
> b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
> index 2df19d73ca49..b2086b798019 100644
> --- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
> +++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
> @@ -3,6 +3,7 @@
>  #include <linux/error-injection.h>
>  #include <linux/init.h>
>  #include <linux/module.h>
> +#include <linux/percpu-defs.h>
>  #include <linux/sysfs.h>
>  #include <linux/tracepoint.h>
>  #include "bpf_testmod.h"
> @@ -10,6 +11,10 @@
>  #define CREATE_TRACE_POINTS
>  #include "bpf_testmod-events.h"
> 
> +DEFINE_PER_CPU(int, bpf_testmod_ksym_dummy1) = -1;
> +DEFINE_PER_CPU(int, bpf_testmod_ksym_percpu) = 123;
> +DEFINE_PER_CPU(int, bpf_testmod_ksym_dummy2) = -1;
> +
>  noinline ssize_t
>  bpf_testmod_test_read(struct file *file, struct kobject *kobj,
>                       struct bin_attribute *bin_attr,
> 
> 1. So the very first issue (that I'm going to ignore for now) is that
> if I just added bpf_testmod_ksym_percpu, it would get addr == 0 and
> would be ignored by the current pahole logic. So we need to fix that
> for modules. Adding dummy1 and dummy2 takes care of this for now,
> bpf_testmod_ksym_percpu has offset 4.

I removed that addr zero check in the modules changes but when
collecting functions, but it's still there in collect_percpu_var

> 
> 2. Second issue is more interesting. Somehow, when pahole iterates
> over DWARF variables, the address of bpf_testmod_ksym_percpu is
> reported as 0x10e74, not 4. Which totally confuses pahole because
> according to ELF symbols, bpf_testmod_ksym_percpu symbol has value 4.
> I tracked this down to dwarf_getlocation() returning 10e74 as number
> field in expr.

in which place do you see that address? when I put displayed
address from collect_percpu_var it shows 4

not sure this is related but looks like similar issue I had to
solve for modules functions, as described in the changelog:
(not merged yet)

    btf_encoder: Detect kernel module ftrace addresses

    ...
    There's one tricky point with kernel modules wrt Elf object,
    which we get from dwfl_module_getelf function. This function
    performs all possible relocations, including __mcount_loc
    section.

    So addrs array contains relocated values, which we need take
    into account when we compare them to functions values which
    are relative to their sections.
    ...

The 0x10e74 value could be relocated 4.. but it's me guessing,
because not sure where you see that address exactly

jirka
Andrii Nakryiko Dec. 10, 2020, 5:02 p.m. UTC | #2
On Thu, Dec 10, 2020 at 8:43 AM Jiri Olsa <jolsa@redhat.com> wrote:
>
> On Wed, Dec 09, 2020 at 12:53:44PM -0800, Andrii Nakryiko wrote:
> > Hi,
> >
> > I'm working on supporting per-CPU symbols in BPF/libbpf, and the
> > prerequisite for that is BTF data for .data..percpu data section and
> > variables inside that.
> >
> > Turns out, pahole doesn't currently emit any BTF information for such
> > variables in kernel modules. And the reason why is quite confusing and
> > I can't figure it out myself, so was hoping someone else might be able
> > to help.
> >
> > To repro, you can take latest bpf-next tree and add this to
> > bpf_testmod/bpf_testmod.c inside selftests/bpf:
> >
> > $ git diff bpf_testmod/bpf_testmod.c
> >       diff --git
> > a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
> > b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
> > index 2df19d73ca49..b2086b798019 100644
> > --- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
> > +++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
> > @@ -3,6 +3,7 @@
> >  #include <linux/error-injection.h>
> >  #include <linux/init.h>
> >  #include <linux/module.h>
> > +#include <linux/percpu-defs.h>
> >  #include <linux/sysfs.h>
> >  #include <linux/tracepoint.h>
> >  #include "bpf_testmod.h"
> > @@ -10,6 +11,10 @@
> >  #define CREATE_TRACE_POINTS
> >  #include "bpf_testmod-events.h"
> >
> > +DEFINE_PER_CPU(int, bpf_testmod_ksym_dummy1) = -1;
> > +DEFINE_PER_CPU(int, bpf_testmod_ksym_percpu) = 123;
> > +DEFINE_PER_CPU(int, bpf_testmod_ksym_dummy2) = -1;
> > +
> >  noinline ssize_t
> >  bpf_testmod_test_read(struct file *file, struct kobject *kobj,
> >                       struct bin_attribute *bin_attr,
> >
> > 1. So the very first issue (that I'm going to ignore for now) is that
> > if I just added bpf_testmod_ksym_percpu, it would get addr == 0 and
> > would be ignored by the current pahole logic. So we need to fix that
> > for modules. Adding dummy1 and dummy2 takes care of this for now,
> > bpf_testmod_ksym_percpu has offset 4.
>
> I removed that addr zero check in the modules changes but when
> collecting functions, but it's still there in collect_percpu_var

Hao had some reason to skip per-cpu variables with offset 0, maybe he
can comment on that before we change it.

>
> >
> > 2. Second issue is more interesting. Somehow, when pahole iterates
> > over DWARF variables, the address of bpf_testmod_ksym_percpu is
> > reported as 0x10e74, not 4. Which totally confuses pahole because
> > according to ELF symbols, bpf_testmod_ksym_percpu symbol has value 4.
> > I tracked this down to dwarf_getlocation() returning 10e74 as number
> > field in expr.
>
> in which place do you see that address? when I put displayed
> address from collect_percpu_var it shows 4

yes, ELF symbol's value is 4, but when iterating DWARF variables
(0x10e70 + 4) is returned. It does look like a special handling of
modules. I missed that libdw does some special things for specifically
modules. Further debugging yesterday showed that 0x10e70 roughly
corresponds to the offset of .data..per_cpu if you count all the
allocatable data sections that come before it. So I think you are
right. We should probably centralize the logic of kernel module
detection so that we can handle these module vs non-module differences
properly.

>
> not sure this is related but looks like similar issue I had to
> solve for modules functions, as described in the changelog:
> (not merged yet)
>
>     btf_encoder: Detect kernel module ftrace addresses
>
>     ...
>     There's one tricky point with kernel modules wrt Elf object,
>     which we get from dwfl_module_getelf function. This function
>     performs all possible relocations, including __mcount_loc
>     section.
>
>     So addrs array contains relocated values, which we need take
>     into account when we compare them to functions values which
>     are relative to their sections.
>     ...
>
> The 0x10e74 value could be relocated 4.. but it's me guessing,
> because not sure where you see that address exactly


It comes up in cu__encode_btf(), var->ip.addr is not 4, as we expect it to be.

>
> jirka
>
Hao Luo Dec. 10, 2020, 6:28 p.m. UTC | #3
On Thu, Dec 10, 2020 at 9:02 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Thu, Dec 10, 2020 at 8:43 AM Jiri Olsa <jolsa@redhat.com> wrote:
> >
> > On Wed, Dec 09, 2020 at 12:53:44PM -0800, Andrii Nakryiko wrote:
> > > Hi,
> > >
> > > I'm working on supporting per-CPU symbols in BPF/libbpf, and the
> > > prerequisite for that is BTF data for .data..percpu data section and
> > > variables inside that.
> > >
> > > Turns out, pahole doesn't currently emit any BTF information for such
> > > variables in kernel modules. And the reason why is quite confusing and
> > > I can't figure it out myself, so was hoping someone else might be able
> > > to help.
> > >
> > > To repro, you can take latest bpf-next tree and add this to
> > > bpf_testmod/bpf_testmod.c inside selftests/bpf:
> > >
> > > $ git diff bpf_testmod/bpf_testmod.c
> > >       diff --git
> > > a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
> > > b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
> > > index 2df19d73ca49..b2086b798019 100644
> > > --- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
> > > +++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
> > > @@ -3,6 +3,7 @@
> > >  #include <linux/error-injection.h>
> > >  #include <linux/init.h>
> > >  #include <linux/module.h>
> > > +#include <linux/percpu-defs.h>
> > >  #include <linux/sysfs.h>
> > >  #include <linux/tracepoint.h>
> > >  #include "bpf_testmod.h"
> > > @@ -10,6 +11,10 @@
> > >  #define CREATE_TRACE_POINTS
> > >  #include "bpf_testmod-events.h"
> > >
> > > +DEFINE_PER_CPU(int, bpf_testmod_ksym_dummy1) = -1;
> > > +DEFINE_PER_CPU(int, bpf_testmod_ksym_percpu) = 123;
> > > +DEFINE_PER_CPU(int, bpf_testmod_ksym_dummy2) = -1;
> > > +
> > >  noinline ssize_t
> > >  bpf_testmod_test_read(struct file *file, struct kobject *kobj,
> > >                       struct bin_attribute *bin_attr,
> > >
> > > 1. So the very first issue (that I'm going to ignore for now) is that
> > > if I just added bpf_testmod_ksym_percpu, it would get addr == 0 and
> > > would be ignored by the current pahole logic. So we need to fix that
> > > for modules. Adding dummy1 and dummy2 takes care of this for now,
> > > bpf_testmod_ksym_percpu has offset 4.
> >
> > I removed that addr zero check in the modules changes but when
> > collecting functions, but it's still there in collect_percpu_var
>
> Hao had some reason to skip per-cpu variables with offset 0, maybe he
> can comment on that before we change it.
>

When I initially write that check, I see there are multiple symbols of
the same name that associate with a single variable, but there is only
one that has a non-zero address. Besides, there are symbols that don't
associate to any variable and they have zero address. For example,
those defined as __ADDRESSABLE(sym) and __UNIQUE_ID(prefix). They are
quite a lot, I remember. So I filtered out the zero address for the
purpose of accelerating encoding. I noticed that on x86_64, the first
page of the percpu section is reserved, so I deem those symbols that
are of normal interest should have positive addresses.

>
> >
> > >
> > > 2. Second issue is more interesting. Somehow, when pahole iterates
> > > over DWARF variables, the address of bpf_testmod_ksym_percpu is
> > > reported as 0x10e74, not 4. Which totally confuses pahole because
> > > according to ELF symbols, bpf_testmod_ksym_percpu symbol has value 4.
> > > I tracked this down to dwarf_getlocation() returning 10e74 as number
> > > field in expr.
> >
> > in which place do you see that address? when I put displayed
> > address from collect_percpu_var it shows 4
>
> yes, ELF symbol's value is 4, but when iterating DWARF variables
> (0x10e70 + 4) is returned. It does look like a special handling of
> modules. I missed that libdw does some special things for specifically
> modules. Further debugging yesterday showed that 0x10e70 roughly
> corresponds to the offset of .data..per_cpu if you count all the
> allocatable data sections that come before it. So I think you are
> right. We should probably centralize the logic of kernel module
> detection so that we can handle these module vs non-module differences
> properly.
>
> >
> > not sure this is related but looks like similar issue I had to
> > solve for modules functions, as described in the changelog:
> > (not merged yet)
> >
> >     btf_encoder: Detect kernel module ftrace addresses
> >
> >     ...
> >     There's one tricky point with kernel modules wrt Elf object,
> >     which we get from dwfl_module_getelf function. This function
> >     performs all possible relocations, including __mcount_loc
> >     section.
> >
> >     So addrs array contains relocated values, which we need take
> >     into account when we compare them to functions values which
> >     are relative to their sections.
> >     ...
> >
> > The 0x10e74 value could be relocated 4.. but it's me guessing,
> > because not sure where you see that address exactly
>
>
> It comes up in cu__encode_btf(), var->ip.addr is not 4, as we expect it to be.
>
> >
> > jirka
> >
Jiri Olsa Dec. 10, 2020, 11:42 p.m. UTC | #4
On Thu, Dec 10, 2020 at 09:02:05AM -0800, Andrii Nakryiko wrote:

SNIP

> 
> yes, ELF symbol's value is 4, but when iterating DWARF variables
> (0x10e70 + 4) is returned. It does look like a special handling of
> modules. I missed that libdw does some special things for specifically
> modules. Further debugging yesterday showed that 0x10e70 roughly
> corresponds to the offset of .data..per_cpu if you count all the
> allocatable data sections that come before it. So I think you are
> right. We should probably centralize the logic of kernel module
> detection so that we can handle these module vs non-module differences
> properly.
> 
> >
> > not sure this is related but looks like similar issue I had to
> > solve for modules functions, as described in the changelog:
> > (not merged yet)
> >
> >     btf_encoder: Detect kernel module ftrace addresses
> >
> >     ...
> >     There's one tricky point with kernel modules wrt Elf object,
> >     which we get from dwfl_module_getelf function. This function
> >     performs all possible relocations, including __mcount_loc
> >     section.
> >
> >     So addrs array contains relocated values, which we need take
> >     into account when we compare them to functions values which
> >     are relative to their sections.
> >     ...
> >
> > The 0x10e74 value could be relocated 4.. but it's me guessing,
> > because not sure where you see that address exactly
> 
> 
> It comes up in cu__encode_btf(), var->ip.addr is not 4, as we expect it to be.

I'm taking section sh_addr for each function and relocate
the addr value for kernel modules, check setup_functions
function

I don't see this being somehow centralized, looks simple
enough to me for each case

jirka
Andrii Nakryiko Dec. 10, 2020, 11:49 p.m. UTC | #5
On Thu, Dec 10, 2020 at 3:42 PM Jiri Olsa <jolsa@redhat.com> wrote:
>
> On Thu, Dec 10, 2020 at 09:02:05AM -0800, Andrii Nakryiko wrote:
>
> SNIP
>
> >
> > yes, ELF symbol's value is 4, but when iterating DWARF variables
> > (0x10e70 + 4) is returned. It does look like a special handling of
> > modules. I missed that libdw does some special things for specifically
> > modules. Further debugging yesterday showed that 0x10e70 roughly
> > corresponds to the offset of .data..per_cpu if you count all the
> > allocatable data sections that come before it. So I think you are
> > right. We should probably centralize the logic of kernel module
> > detection so that we can handle these module vs non-module differences
> > properly.
> >
> > >
> > > not sure this is related but looks like similar issue I had to
> > > solve for modules functions, as described in the changelog:
> > > (not merged yet)
> > >
> > >     btf_encoder: Detect kernel module ftrace addresses
> > >
> > >     ...
> > >     There's one tricky point with kernel modules wrt Elf object,
> > >     which we get from dwfl_module_getelf function. This function
> > >     performs all possible relocations, including __mcount_loc
> > >     section.
> > >
> > >     So addrs array contains relocated values, which we need take
> > >     into account when we compare them to functions values which
> > >     are relative to their sections.
> > >     ...
> > >
> > > The 0x10e74 value could be relocated 4.. but it's me guessing,
> > > because not sure where you see that address exactly
> >
> >
> > It comes up in cu__encode_btf(), var->ip.addr is not 4, as we expect it to be.
>
> I'm taking section sh_addr for each function and relocate
> the addr value for kernel modules, check setup_functions
> function
>
> I don't see this being somehow centralized, looks simple
> enough to me for each case

I meant centralized detection of whether we are working with the
module or vmlinux or something else. setup_functions() currently has
very specific heuristic for that. So I'd like to extract that or come
up with some other way that won't be so function specific
(__start_mcount_loc symbol vs __mcount_loc section).

>
> jirka
>
Andrii Nakryiko Dec. 11, 2020, 2:56 a.m. UTC | #6
On Thu, Dec 10, 2020 at 10:29 AM Hao Luo <haoluo@google.com> wrote:
>
> On Thu, Dec 10, 2020 at 9:02 AM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> >
> > On Thu, Dec 10, 2020 at 8:43 AM Jiri Olsa <jolsa@redhat.com> wrote:
> > >
> > > On Wed, Dec 09, 2020 at 12:53:44PM -0800, Andrii Nakryiko wrote:
> > > > Hi,
> > > >
> > > > I'm working on supporting per-CPU symbols in BPF/libbpf, and the
> > > > prerequisite for that is BTF data for .data..percpu data section and
> > > > variables inside that.
> > > >
> > > > Turns out, pahole doesn't currently emit any BTF information for such
> > > > variables in kernel modules. And the reason why is quite confusing and
> > > > I can't figure it out myself, so was hoping someone else might be able
> > > > to help.
> > > >
> > > > To repro, you can take latest bpf-next tree and add this to
> > > > bpf_testmod/bpf_testmod.c inside selftests/bpf:
> > > >
> > > > $ git diff bpf_testmod/bpf_testmod.c
> > > >       diff --git
> > > > a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
> > > > b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
> > > > index 2df19d73ca49..b2086b798019 100644
> > > > --- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
> > > > +++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
> > > > @@ -3,6 +3,7 @@
> > > >  #include <linux/error-injection.h>
> > > >  #include <linux/init.h>
> > > >  #include <linux/module.h>
> > > > +#include <linux/percpu-defs.h>
> > > >  #include <linux/sysfs.h>
> > > >  #include <linux/tracepoint.h>
> > > >  #include "bpf_testmod.h"
> > > > @@ -10,6 +11,10 @@
> > > >  #define CREATE_TRACE_POINTS
> > > >  #include "bpf_testmod-events.h"
> > > >
> > > > +DEFINE_PER_CPU(int, bpf_testmod_ksym_dummy1) = -1;
> > > > +DEFINE_PER_CPU(int, bpf_testmod_ksym_percpu) = 123;
> > > > +DEFINE_PER_CPU(int, bpf_testmod_ksym_dummy2) = -1;
> > > > +
> > > >  noinline ssize_t
> > > >  bpf_testmod_test_read(struct file *file, struct kobject *kobj,
> > > >                       struct bin_attribute *bin_attr,
> > > >
> > > > 1. So the very first issue (that I'm going to ignore for now) is that
> > > > if I just added bpf_testmod_ksym_percpu, it would get addr == 0 and
> > > > would be ignored by the current pahole logic. So we need to fix that
> > > > for modules. Adding dummy1 and dummy2 takes care of this for now,
> > > > bpf_testmod_ksym_percpu has offset 4.
> > >
> > > I removed that addr zero check in the modules changes but when
> > > collecting functions, but it's still there in collect_percpu_var
> >
> > Hao had some reason to skip per-cpu variables with offset 0, maybe he
> > can comment on that before we change it.
> >
>
> When I initially write that check, I see there are multiple symbols of
> the same name that associate with a single variable, but there is only
> one that has a non-zero address. Besides, there are symbols that don't
> associate to any variable and they have zero address. For example,
> those defined as __ADDRESSABLE(sym) and __UNIQUE_ID(prefix). They are
> quite a lot, I remember. So I filtered out the zero address for the
> purpose of accelerating encoding. I noticed that on x86_64, the first
> page of the percpu section is reserved, so I deem those symbols that
> are of normal interest should have positive addresses.

So I just checked my local vmlinux image, and seems like the only one
with addr == 0 is fixed_percpu_data. Everything else that's detected
as belonging to .data..percpu section looks sane and has non-zero
offset.

So I think this might have been the case before we switched to using
ELF symbols and now it's not? I think I'll just drop this check, will
post the patch, and would really appreciate if you can test it in your
environment. Does that sound ok?

>
> >
> > >
> > > >
> > > > 2. Second issue is more interesting. Somehow, when pahole iterates
> > > > over DWARF variables, the address of bpf_testmod_ksym_percpu is
> > > > reported as 0x10e74, not 4. Which totally confuses pahole because
> > > > according to ELF symbols, bpf_testmod_ksym_percpu symbol has value 4.
> > > > I tracked this down to dwarf_getlocation() returning 10e74 as number
> > > > field in expr.
> > >
> > > in which place do you see that address? when I put displayed
> > > address from collect_percpu_var it shows 4
> >
> > yes, ELF symbol's value is 4, but when iterating DWARF variables
> > (0x10e70 + 4) is returned. It does look like a special handling of
> > modules. I missed that libdw does some special things for specifically
> > modules. Further debugging yesterday showed that 0x10e70 roughly
> > corresponds to the offset of .data..per_cpu if you count all the
> > allocatable data sections that come before it. So I think you are
> > right. We should probably centralize the logic of kernel module
> > detection so that we can handle these module vs non-module differences
> > properly.
> >
> > >
> > > not sure this is related but looks like similar issue I had to
> > > solve for modules functions, as described in the changelog:
> > > (not merged yet)
> > >
> > >     btf_encoder: Detect kernel module ftrace addresses
> > >
> > >     ...
> > >     There's one tricky point with kernel modules wrt Elf object,
> > >     which we get from dwfl_module_getelf function. This function
> > >     performs all possible relocations, including __mcount_loc
> > >     section.
> > >
> > >     So addrs array contains relocated values, which we need take
> > >     into account when we compare them to functions values which
> > >     are relative to their sections.
> > >     ...
> > >
> > > The 0x10e74 value could be relocated 4.. but it's me guessing,
> > > because not sure where you see that address exactly
> >
> >
> > It comes up in cu__encode_btf(), var->ip.addr is not 4, as we expect it to be.
> >
> > >
> > > jirka
> > >
Andrii Nakryiko Dec. 11, 2020, 2:57 a.m. UTC | #7
On Thu, Dec 10, 2020 at 3:49 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Thu, Dec 10, 2020 at 3:42 PM Jiri Olsa <jolsa@redhat.com> wrote:
> >
> > On Thu, Dec 10, 2020 at 09:02:05AM -0800, Andrii Nakryiko wrote:
> >
> > SNIP
> >
> > >
> > > yes, ELF symbol's value is 4, but when iterating DWARF variables
> > > (0x10e70 + 4) is returned. It does look like a special handling of
> > > modules. I missed that libdw does some special things for specifically
> > > modules. Further debugging yesterday showed that 0x10e70 roughly
> > > corresponds to the offset of .data..per_cpu if you count all the
> > > allocatable data sections that come before it. So I think you are
> > > right. We should probably centralize the logic of kernel module
> > > detection so that we can handle these module vs non-module differences
> > > properly.
> > >
> > > >
> > > > not sure this is related but looks like similar issue I had to
> > > > solve for modules functions, as described in the changelog:
> > > > (not merged yet)
> > > >
> > > >     btf_encoder: Detect kernel module ftrace addresses
> > > >
> > > >     ...
> > > >     There's one tricky point with kernel modules wrt Elf object,
> > > >     which we get from dwfl_module_getelf function. This function
> > > >     performs all possible relocations, including __mcount_loc
> > > >     section.
> > > >
> > > >     So addrs array contains relocated values, which we need take
> > > >     into account when we compare them to functions values which
> > > >     are relative to their sections.
> > > >     ...
> > > >
> > > > The 0x10e74 value could be relocated 4.. but it's me guessing,
> > > > because not sure where you see that address exactly
> > >
> > >
> > > It comes up in cu__encode_btf(), var->ip.addr is not 4, as we expect it to be.
> >
> > I'm taking section sh_addr for each function and relocate
> > the addr value for kernel modules, check setup_functions
> > function
> >
> > I don't see this being somehow centralized, looks simple
> > enough to me for each case
>
> I meant centralized detection of whether we are working with the
> module or vmlinux or something else. setup_functions() currently has
> very specific heuristic for that. So I'd like to extract that or come
> up with some other way that won't be so function specific
> (__start_mcount_loc symbol vs __mcount_loc section).
>

This seems to be unnecessary, actually. We already record
btfe->percpu_base_addr, which for vmlinux is always zero, while for
module non-zero. So just subtracting this base addr before looking up
ELF symbol solves the problem for me and still works for vmlinux. So
I'm going with that for now.

> >
> > jirka
> >
Andrii Nakryiko Dec. 11, 2020, 3:29 a.m. UTC | #8
On Thu, Dec 10, 2020 at 6:56 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Thu, Dec 10, 2020 at 10:29 AM Hao Luo <haoluo@google.com> wrote:
> >
> > On Thu, Dec 10, 2020 at 9:02 AM Andrii Nakryiko
> > <andrii.nakryiko@gmail.com> wrote:
> > >
> > > On Thu, Dec 10, 2020 at 8:43 AM Jiri Olsa <jolsa@redhat.com> wrote:
> > > >
> > > > On Wed, Dec 09, 2020 at 12:53:44PM -0800, Andrii Nakryiko wrote:
> > > > > Hi,
> > > > >
> > > > > I'm working on supporting per-CPU symbols in BPF/libbpf, and the
> > > > > prerequisite for that is BTF data for .data..percpu data section and
> > > > > variables inside that.
> > > > >
> > > > > Turns out, pahole doesn't currently emit any BTF information for such
> > > > > variables in kernel modules. And the reason why is quite confusing and
> > > > > I can't figure it out myself, so was hoping someone else might be able
> > > > > to help.
> > > > >
> > > > > To repro, you can take latest bpf-next tree and add this to
> > > > > bpf_testmod/bpf_testmod.c inside selftests/bpf:
> > > > >
> > > > > $ git diff bpf_testmod/bpf_testmod.c
> > > > >       diff --git
> > > > > a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
> > > > > b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
> > > > > index 2df19d73ca49..b2086b798019 100644
> > > > > --- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
> > > > > +++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
> > > > > @@ -3,6 +3,7 @@
> > > > >  #include <linux/error-injection.h>
> > > > >  #include <linux/init.h>
> > > > >  #include <linux/module.h>
> > > > > +#include <linux/percpu-defs.h>
> > > > >  #include <linux/sysfs.h>
> > > > >  #include <linux/tracepoint.h>
> > > > >  #include "bpf_testmod.h"
> > > > > @@ -10,6 +11,10 @@
> > > > >  #define CREATE_TRACE_POINTS
> > > > >  #include "bpf_testmod-events.h"
> > > > >
> > > > > +DEFINE_PER_CPU(int, bpf_testmod_ksym_dummy1) = -1;
> > > > > +DEFINE_PER_CPU(int, bpf_testmod_ksym_percpu) = 123;
> > > > > +DEFINE_PER_CPU(int, bpf_testmod_ksym_dummy2) = -1;
> > > > > +
> > > > >  noinline ssize_t
> > > > >  bpf_testmod_test_read(struct file *file, struct kobject *kobj,
> > > > >                       struct bin_attribute *bin_attr,
> > > > >
> > > > > 1. So the very first issue (that I'm going to ignore for now) is that
> > > > > if I just added bpf_testmod_ksym_percpu, it would get addr == 0 and
> > > > > would be ignored by the current pahole logic. So we need to fix that
> > > > > for modules. Adding dummy1 and dummy2 takes care of this for now,
> > > > > bpf_testmod_ksym_percpu has offset 4.
> > > >
> > > > I removed that addr zero check in the modules changes but when
> > > > collecting functions, but it's still there in collect_percpu_var
> > >
> > > Hao had some reason to skip per-cpu variables with offset 0, maybe he
> > > can comment on that before we change it.
> > >
> >
> > When I initially write that check, I see there are multiple symbols of
> > the same name that associate with a single variable, but there is only
> > one that has a non-zero address. Besides, there are symbols that don't
> > associate to any variable and they have zero address. For example,
> > those defined as __ADDRESSABLE(sym) and __UNIQUE_ID(prefix). They are
> > quite a lot, I remember. So I filtered out the zero address for the
> > purpose of accelerating encoding. I noticed that on x86_64, the first
> > page of the percpu section is reserved, so I deem those symbols that
> > are of normal interest should have positive addresses.
>
> So I just checked my local vmlinux image, and seems like the only one
> with addr == 0 is fixed_percpu_data. Everything else that's detected
> as belonging to .data..percpu section looks sane and has non-zero
> offset.
>
> So I think this might have been the case before we switched to using
> ELF symbols and now it's not? I think I'll just drop this check, will
> post the patch, and would really appreciate if you can test it in your
> environment. Does that sound ok?

Ah, never mind. While ELF symbols look good, it's the DWARF variables
side where the problem is. There are lots of DWARF variables that map
to addr 0 and which are impossible to distinguish from readl
fixed_percpu_data, because we can't even rely on getting DWARF
variable name.

I guess I'll leave it as is for now, but we should come up with some
solution, ideally.

>
> >
> > >
> > > >
> > > > >
> > > > > 2. Second issue is more interesting. Somehow, when pahole iterates
> > > > > over DWARF variables, the address of bpf_testmod_ksym_percpu is
> > > > > reported as 0x10e74, not 4. Which totally confuses pahole because
> > > > > according to ELF symbols, bpf_testmod_ksym_percpu symbol has value 4.
> > > > > I tracked this down to dwarf_getlocation() returning 10e74 as number
> > > > > field in expr.
> > > >
> > > > in which place do you see that address? when I put displayed
> > > > address from collect_percpu_var it shows 4
> > >
> > > yes, ELF symbol's value is 4, but when iterating DWARF variables
> > > (0x10e70 + 4) is returned. It does look like a special handling of
> > > modules. I missed that libdw does some special things for specifically
> > > modules. Further debugging yesterday showed that 0x10e70 roughly
> > > corresponds to the offset of .data..per_cpu if you count all the
> > > allocatable data sections that come before it. So I think you are
> > > right. We should probably centralize the logic of kernel module
> > > detection so that we can handle these module vs non-module differences
> > > properly.
> > >
> > > >
> > > > not sure this is related but looks like similar issue I had to
> > > > solve for modules functions, as described in the changelog:
> > > > (not merged yet)
> > > >
> > > >     btf_encoder: Detect kernel module ftrace addresses
> > > >
> > > >     ...
> > > >     There's one tricky point with kernel modules wrt Elf object,
> > > >     which we get from dwfl_module_getelf function. This function
> > > >     performs all possible relocations, including __mcount_loc
> > > >     section.
> > > >
> > > >     So addrs array contains relocated values, which we need take
> > > >     into account when we compare them to functions values which
> > > >     are relative to their sections.
> > > >     ...
> > > >
> > > > The 0x10e74 value could be relocated 4.. but it's me guessing,
> > > > because not sure where you see that address exactly
> > >
> > >
> > > It comes up in cu__encode_btf(), var->ip.addr is not 4, as we expect it to be.
> > >
> > > >
> > > > jirka
> > > >
diff mbox series

Patch

--- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
+++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
@@ -3,6 +3,7 @@ 
 #include <linux/error-injection.h>
 #include <linux/init.h>
 #include <linux/module.h>
+#include <linux/percpu-defs.h>
 #include <linux/sysfs.h>
 #include <linux/tracepoint.h>
 #include "bpf_testmod.h"
@@ -10,6 +11,10 @@ 
 #define CREATE_TRACE_POINTS
 #include "bpf_testmod-events.h"

+DEFINE_PER_CPU(int, bpf_testmod_ksym_dummy1) = -1;
+DEFINE_PER_CPU(int, bpf_testmod_ksym_percpu) = 123;
+DEFINE_PER_CPU(int, bpf_testmod_ksym_dummy2) = -1;
+
 noinline ssize_t
 bpf_testmod_test_read(struct file *file, struct kobject *kobj,
                      struct bin_attribute *bin_attr,