diff mbox series

[bpf-next,2/2] fork: Rename mm_init to task_mm_init

Message ID 20230424161104.3737-3-laoar.shao@gmail.com (mailing list archive)
State Changes Requested
Delegated to: BPF
Headers show
Series bpf: Fix issues caused by bpf trampoline | expand

Checks

Context Check Description
bpf/vmtest-bpf-next-PR success PR summary
bpf/vmtest-bpf-next-VM_Test-2 success Logs for build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-3 success Logs for build for aarch64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-4 success Logs for build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-31 success Logs for test_verifier on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-32 success Logs for test_verifier on aarch64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-34 success Logs for test_verifier on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-35 success Logs for test_verifier on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-8 success Logs for test_maps on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-9 success Logs for test_maps on aarch64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-11 success Logs for test_maps on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-12 success Logs for test_maps on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-13 success Logs for test_progs on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-14 success Logs for test_progs on aarch64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-16 success Logs for test_progs on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-17 success Logs for test_progs on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-18 fail Logs for test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-19 fail Logs for test_progs_no_alu32 on aarch64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-20 fail Logs for test_progs_no_alu32 on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-21 fail Logs for test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-22 fail Logs for test_progs_no_alu32 on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-23 success Logs for test_progs_no_alu32_parallel on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-24 success Logs for test_progs_no_alu32_parallel on aarch64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-25 success Logs for test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-26 success Logs for test_progs_no_alu32_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-27 success Logs for test_progs_parallel on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-28 success Logs for test_progs_parallel on aarch64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-29 success Logs for test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-30 success Logs for test_progs_parallel on x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-33 success Logs for test_verifier on s390x with gcc
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for bpf-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 46 this patch: 46
netdev/cc_maintainers fail 2 maintainers not CCed: Liam.Howlett@oracle.com akpm@linux-foundation.org
netdev/build_clang fail Errors and warnings before: 13 this patch: 13
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn fail Errors and warnings before: 14 this patch: 14
netdev/checkpatch warning CHECK: Alignment should match open parenthesis WARNING: line length of 82 exceeds 80 columns
netdev/kdoc success Errors and warnings before: 8 this patch: 8
netdev/source_inline success Was 0 now: 0
bpf/vmtest-bpf-next-VM_Test-15 success Logs for test_progs on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-10 success Logs for test_maps on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-1 success Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-5 success Logs for build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-6 success Logs for build for x86_64 with llvm-16
bpf/vmtest-bpf-next-VM_Test-7 success Logs for set-matrix

Commit Message

Yafang Shao April 24, 2023, 4:11 p.m. UTC
The kernel will panic as follows when attaching fexit to mm_init,

[   86.549700] ------------[ cut here ]------------
[   86.549712] BUG: kernel NULL pointer dereference, address: 0000000000000078
[   86.549713] #PF: supervisor read access in kernel mode
[   86.549715] #PF: error_code(0x0000) - not-present page
[   86.549716] PGD 10308f067 P4D 10308f067 PUD 11754e067 PMD 0
[   86.549719] Oops: 0000 [#1] PREEMPT SMP NOPTI
[   86.549722] CPU: 9 PID: 9829 Comm: main_amd64 Kdump: loaded Not tainted 6.3.0-rc6+ #12
[   86.549725] RIP: 0010:check_preempt_wakeup+0xd1/0x310
[   86.549754] Call Trace:
[   86.549755]  <TASK>
[   86.549757]  check_preempt_curr+0x5e/0x70
[   86.549761]  ttwu_do_activate+0xab/0x350
[   86.549763]  try_to_wake_up+0x314/0x680
[   86.549765]  wake_up_process+0x15/0x20
[   86.549767]  insert_work+0xb2/0xd0
[   86.549772]  __queue_work+0x20a/0x400
[   86.549774]  queue_work_on+0x7b/0x90
[   86.549778]  drm_fb_helper_sys_imageblit+0xd7/0xf0 [drm_kms_helper]
[   86.549801]  drm_fbdev_fb_imageblit+0x5b/0xb0 [drm_kms_helper]
[   86.549813]  soft_cursor+0x1cb/0x250
[   86.549816]  bit_cursor+0x3ce/0x630
[   86.549818]  fbcon_cursor+0x139/0x1c0
[   86.549821]  ? __pfx_bit_cursor+0x10/0x10
[   86.549822]  hide_cursor+0x31/0xd0
[   86.549825]  vt_console_print+0x477/0x4e0
[   86.549828]  console_flush_all+0x182/0x440
[   86.549832]  console_unlock+0x58/0xf0
[   86.549834]  vprintk_emit+0x1ae/0x200
[   86.549837]  vprintk_default+0x1d/0x30
[   86.549839]  vprintk+0x5c/0x90
[   86.549841]  _printk+0x58/0x80
[   86.549843]  __warn_printk+0x7e/0x1a0
[   86.549845]  ? trace_preempt_off+0x1b/0x70
[   86.549848]  ? trace_preempt_on+0x1b/0x70
[   86.549849]  ? __percpu_counter_init+0x8e/0xb0
[   86.549853]  refcount_warn_saturate+0x9f/0x150
[   86.549855]  mm_init+0x379/0x390
[   86.549859]  bpf_trampoline_6442453440_0+0x23/0x1000
[   86.549862]  mm_init+0x5/0x390
[   86.549865]  ? mm_alloc+0x4e/0x60
[   86.549866]  alloc_bprm+0x8a/0x2e0
[   86.549869]  do_execveat_common.isra.0+0x67/0x240
[   86.549872]  __x64_sys_execve+0x37/0x50
[   86.549874]  do_syscall_64+0x38/0x90
[   86.549877]  entry_SYSCALL_64_after_hwframe+0x72/0xdc

The reason is that when we attach the btf id of the function mm_init we
actually attach the mm_init defined in init/main.c rather than the
function defined in kernel/fork.c. That can be proved by parsing
/sys/kernel/btf/vmlinux:

[2493] FUNC 'initcall_blacklist' type_id=2477 linkage=static
[2494] FUNC_PROTO '(anon)' ret_type_id=21 vlen=1
        'buf' type_id=57
[2495] FUNC 'early_randomize_kstack_offset' type_id=2494 linkage=static
[2496] FUNC 'mm_init' type_id=118 linkage=static
[2497] FUNC 'trap_init' type_id=118 linkage=static
[2498] FUNC 'thread_stack_cache_init' type_id=118 linkage=static

From the above information we can find that the FUNCs above and below
mm_init are all defined in init/main.c. So there's no doubt that the
mm_init is also the function defined in init/main.c.

So when a task calls mm_init and thus the bpf trampoline is triggered it
will use the information of the mm_init defined in init/main.c. Then the
panic will occur.

It seems that there're issues in btf, for example it is unnecessary to
generate btf for the functions annonated with __init. We need to improve
btf. However we also need to change the function defined in
kernel/fork.c to task_mm_init to better distinguish them. After it is
renamed to task_mm_init, the /sys/kernel/btf/vmlinux will be:

[13970] FUNC 'mm_alloc' type_id=13969 linkage=static
[13971] FUNC_PROTO '(anon)' ret_type_id=204 vlen=3
        'mm' type_id=204
        'p' type_id=197
        'user_ns' type_id=452
[13972] FUNC 'task_mm_init' type_id=13971 linkage=static
[13973] FUNC 'coredump_filter_setup' type_id=3804 linkage=static
[13974] FUNC_PROTO '(anon)' ret_type_id=197 vlen=2
        'orig' type_id=197
        'node' type_id=21
[13975] FUNC 'dup_task_struct' type_id=13974 linkage=static

And then attaching task_mm_init won't panic. Improving the btf will be
handled later.

This issue can be reproduced with a simple bpf program as such:
  SEC("fexit/mm_init")
  int fentry_run()
  {
      return 0;
  }

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 kernel/fork.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Comments

Alexei Starovoitov April 24, 2023, 9:13 p.m. UTC | #1
On Mon, Apr 24, 2023 at 9:12 AM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> The kernel will panic as follows when attaching fexit to mm_init,
>
> [   86.549700] ------------[ cut here ]------------
> [   86.549712] BUG: kernel NULL pointer dereference, address: 0000000000000078
> [   86.549713] #PF: supervisor read access in kernel mode
> [   86.549715] #PF: error_code(0x0000) - not-present page
> [   86.549716] PGD 10308f067 P4D 10308f067 PUD 11754e067 PMD 0
> [   86.549719] Oops: 0000 [#1] PREEMPT SMP NOPTI
> [   86.549722] CPU: 9 PID: 9829 Comm: main_amd64 Kdump: loaded Not tainted 6.3.0-rc6+ #12
> [   86.549725] RIP: 0010:check_preempt_wakeup+0xd1/0x310
> [   86.549754] Call Trace:
> [   86.549755]  <TASK>
> [   86.549757]  check_preempt_curr+0x5e/0x70
> [   86.549761]  ttwu_do_activate+0xab/0x350
> [   86.549763]  try_to_wake_up+0x314/0x680
> [   86.549765]  wake_up_process+0x15/0x20
> [   86.549767]  insert_work+0xb2/0xd0
> [   86.549772]  __queue_work+0x20a/0x400
> [   86.549774]  queue_work_on+0x7b/0x90
> [   86.549778]  drm_fb_helper_sys_imageblit+0xd7/0xf0 [drm_kms_helper]
> [   86.549801]  drm_fbdev_fb_imageblit+0x5b/0xb0 [drm_kms_helper]
> [   86.549813]  soft_cursor+0x1cb/0x250
> [   86.549816]  bit_cursor+0x3ce/0x630
> [   86.549818]  fbcon_cursor+0x139/0x1c0
> [   86.549821]  ? __pfx_bit_cursor+0x10/0x10
> [   86.549822]  hide_cursor+0x31/0xd0
> [   86.549825]  vt_console_print+0x477/0x4e0
> [   86.549828]  console_flush_all+0x182/0x440
> [   86.549832]  console_unlock+0x58/0xf0
> [   86.549834]  vprintk_emit+0x1ae/0x200
> [   86.549837]  vprintk_default+0x1d/0x30
> [   86.549839]  vprintk+0x5c/0x90
> [   86.549841]  _printk+0x58/0x80
> [   86.549843]  __warn_printk+0x7e/0x1a0
> [   86.549845]  ? trace_preempt_off+0x1b/0x70
> [   86.549848]  ? trace_preempt_on+0x1b/0x70
> [   86.549849]  ? __percpu_counter_init+0x8e/0xb0
> [   86.549853]  refcount_warn_saturate+0x9f/0x150
> [   86.549855]  mm_init+0x379/0x390
> [   86.549859]  bpf_trampoline_6442453440_0+0x23/0x1000
> [   86.549862]  mm_init+0x5/0x390
> [   86.549865]  ? mm_alloc+0x4e/0x60
> [   86.549866]  alloc_bprm+0x8a/0x2e0
> [   86.549869]  do_execveat_common.isra.0+0x67/0x240
> [   86.549872]  __x64_sys_execve+0x37/0x50
> [   86.549874]  do_syscall_64+0x38/0x90
> [   86.549877]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
>
> The reason is that when we attach the btf id of the function mm_init we
> actually attach the mm_init defined in init/main.c rather than the
> function defined in kernel/fork.c. That can be proved by parsing
> /sys/kernel/btf/vmlinux:
>
> [2493] FUNC 'initcall_blacklist' type_id=2477 linkage=static
> [2494] FUNC_PROTO '(anon)' ret_type_id=21 vlen=1
>         'buf' type_id=57
> [2495] FUNC 'early_randomize_kstack_offset' type_id=2494 linkage=static
> [2496] FUNC 'mm_init' type_id=118 linkage=static
> [2497] FUNC 'trap_init' type_id=118 linkage=static
> [2498] FUNC 'thread_stack_cache_init' type_id=118 linkage=static
>
> From the above information we can find that the FUNCs above and below
> mm_init are all defined in init/main.c. So there's no doubt that the
> mm_init is also the function defined in init/main.c.
>
> So when a task calls mm_init and thus the bpf trampoline is triggered it
> will use the information of the mm_init defined in init/main.c. Then the
> panic will occur.
>
> It seems that there're issues in btf, for example it is unnecessary to
> generate btf for the functions annonated with __init. We need to improve
> btf. However we also need to change the function defined in
> kernel/fork.c to task_mm_init to better distinguish them. After it is
> renamed to task_mm_init, the /sys/kernel/btf/vmlinux will be:
>
> [13970] FUNC 'mm_alloc' type_id=13969 linkage=static
> [13971] FUNC_PROTO '(anon)' ret_type_id=204 vlen=3
>         'mm' type_id=204
>         'p' type_id=197
>         'user_ns' type_id=452
> [13972] FUNC 'task_mm_init' type_id=13971 linkage=static
> [13973] FUNC 'coredump_filter_setup' type_id=3804 linkage=static
> [13974] FUNC_PROTO '(anon)' ret_type_id=197 vlen=2
>         'orig' type_id=197
>         'node' type_id=21
> [13975] FUNC 'dup_task_struct' type_id=13974 linkage=static
>
> And then attaching task_mm_init won't panic. Improving the btf will be
> handled later.

We're not going to hack the kernel to workaround pahole issue.
Let's fix pahole instead.
cc-ing Alan for ideas.
Yafang Shao April 27, 2023, 11:34 a.m. UTC | #2
On Tue, Apr 25, 2023 at 5:13 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, Apr 24, 2023 at 9:12 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > The kernel will panic as follows when attaching fexit to mm_init,
> >
> > [   86.549700] ------------[ cut here ]------------
> > [   86.549712] BUG: kernel NULL pointer dereference, address: 0000000000000078
> > [   86.549713] #PF: supervisor read access in kernel mode
> > [   86.549715] #PF: error_code(0x0000) - not-present page
> > [   86.549716] PGD 10308f067 P4D 10308f067 PUD 11754e067 PMD 0
> > [   86.549719] Oops: 0000 [#1] PREEMPT SMP NOPTI
> > [   86.549722] CPU: 9 PID: 9829 Comm: main_amd64 Kdump: loaded Not tainted 6.3.0-rc6+ #12
> > [   86.549725] RIP: 0010:check_preempt_wakeup+0xd1/0x310
> > [   86.549754] Call Trace:
> > [   86.549755]  <TASK>
> > [   86.549757]  check_preempt_curr+0x5e/0x70
> > [   86.549761]  ttwu_do_activate+0xab/0x350
> > [   86.549763]  try_to_wake_up+0x314/0x680
> > [   86.549765]  wake_up_process+0x15/0x20
> > [   86.549767]  insert_work+0xb2/0xd0
> > [   86.549772]  __queue_work+0x20a/0x400
> > [   86.549774]  queue_work_on+0x7b/0x90
> > [   86.549778]  drm_fb_helper_sys_imageblit+0xd7/0xf0 [drm_kms_helper]
> > [   86.549801]  drm_fbdev_fb_imageblit+0x5b/0xb0 [drm_kms_helper]
> > [   86.549813]  soft_cursor+0x1cb/0x250
> > [   86.549816]  bit_cursor+0x3ce/0x630
> > [   86.549818]  fbcon_cursor+0x139/0x1c0
> > [   86.549821]  ? __pfx_bit_cursor+0x10/0x10
> > [   86.549822]  hide_cursor+0x31/0xd0
> > [   86.549825]  vt_console_print+0x477/0x4e0
> > [   86.549828]  console_flush_all+0x182/0x440
> > [   86.549832]  console_unlock+0x58/0xf0
> > [   86.549834]  vprintk_emit+0x1ae/0x200
> > [   86.549837]  vprintk_default+0x1d/0x30
> > [   86.549839]  vprintk+0x5c/0x90
> > [   86.549841]  _printk+0x58/0x80
> > [   86.549843]  __warn_printk+0x7e/0x1a0
> > [   86.549845]  ? trace_preempt_off+0x1b/0x70
> > [   86.549848]  ? trace_preempt_on+0x1b/0x70
> > [   86.549849]  ? __percpu_counter_init+0x8e/0xb0
> > [   86.549853]  refcount_warn_saturate+0x9f/0x150
> > [   86.549855]  mm_init+0x379/0x390
> > [   86.549859]  bpf_trampoline_6442453440_0+0x23/0x1000
> > [   86.549862]  mm_init+0x5/0x390
> > [   86.549865]  ? mm_alloc+0x4e/0x60
> > [   86.549866]  alloc_bprm+0x8a/0x2e0
> > [   86.549869]  do_execveat_common.isra.0+0x67/0x240
> > [   86.549872]  __x64_sys_execve+0x37/0x50
> > [   86.549874]  do_syscall_64+0x38/0x90
> > [   86.549877]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
> >
> > The reason is that when we attach the btf id of the function mm_init we
> > actually attach the mm_init defined in init/main.c rather than the
> > function defined in kernel/fork.c. That can be proved by parsing
> > /sys/kernel/btf/vmlinux:
> >
> > [2493] FUNC 'initcall_blacklist' type_id=2477 linkage=static
> > [2494] FUNC_PROTO '(anon)' ret_type_id=21 vlen=1
> >         'buf' type_id=57
> > [2495] FUNC 'early_randomize_kstack_offset' type_id=2494 linkage=static
> > [2496] FUNC 'mm_init' type_id=118 linkage=static
> > [2497] FUNC 'trap_init' type_id=118 linkage=static
> > [2498] FUNC 'thread_stack_cache_init' type_id=118 linkage=static
> >
> > From the above information we can find that the FUNCs above and below
> > mm_init are all defined in init/main.c. So there's no doubt that the
> > mm_init is also the function defined in init/main.c.
> >
> > So when a task calls mm_init and thus the bpf trampoline is triggered it
> > will use the information of the mm_init defined in init/main.c. Then the
> > panic will occur.
> >
> > It seems that there're issues in btf, for example it is unnecessary to
> > generate btf for the functions annonated with __init. We need to improve
> > btf. However we also need to change the function defined in
> > kernel/fork.c to task_mm_init to better distinguish them. After it is
> > renamed to task_mm_init, the /sys/kernel/btf/vmlinux will be:
> >
> > [13970] FUNC 'mm_alloc' type_id=13969 linkage=static
> > [13971] FUNC_PROTO '(anon)' ret_type_id=204 vlen=3
> >         'mm' type_id=204
> >         'p' type_id=197
> >         'user_ns' type_id=452
> > [13972] FUNC 'task_mm_init' type_id=13971 linkage=static
> > [13973] FUNC 'coredump_filter_setup' type_id=3804 linkage=static
> > [13974] FUNC_PROTO '(anon)' ret_type_id=197 vlen=2
> >         'orig' type_id=197
> >         'node' type_id=21
> > [13975] FUNC 'dup_task_struct' type_id=13974 linkage=static
> >
> > And then attaching task_mm_init won't panic. Improving the btf will be
> > handled later.
>
> We're not going to hack the kernel to workaround pahole issue.
> Let's fix pahole instead.
> cc-ing Alan for ideas.

Any comment on it, Alan ?
I think we can just skip generating BTF for the functions in
__section(".init.text"),  as these functions will be freed after
kernel init. There won't be use cases for them.
Alexei Starovoitov May 2, 2023, 3:40 a.m. UTC | #3
Alan,

wdyt on below?

On Thu, Apr 27, 2023 at 4:35 AM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Tue, Apr 25, 2023 at 5:13 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Mon, Apr 24, 2023 at 9:12 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > The kernel will panic as follows when attaching fexit to mm_init,
> > >
> > > [   86.549700] ------------[ cut here ]------------
> > > [   86.549712] BUG: kernel NULL pointer dereference, address: 0000000000000078
> > > [   86.549713] #PF: supervisor read access in kernel mode
> > > [   86.549715] #PF: error_code(0x0000) - not-present page
> > > [   86.549716] PGD 10308f067 P4D 10308f067 PUD 11754e067 PMD 0
> > > [   86.549719] Oops: 0000 [#1] PREEMPT SMP NOPTI
> > > [   86.549722] CPU: 9 PID: 9829 Comm: main_amd64 Kdump: loaded Not tainted 6.3.0-rc6+ #12
> > > [   86.549725] RIP: 0010:check_preempt_wakeup+0xd1/0x310
> > > [   86.549754] Call Trace:
> > > [   86.549755]  <TASK>
> > > [   86.549757]  check_preempt_curr+0x5e/0x70
> > > [   86.549761]  ttwu_do_activate+0xab/0x350
> > > [   86.549763]  try_to_wake_up+0x314/0x680
> > > [   86.549765]  wake_up_process+0x15/0x20
> > > [   86.549767]  insert_work+0xb2/0xd0
> > > [   86.549772]  __queue_work+0x20a/0x400
> > > [   86.549774]  queue_work_on+0x7b/0x90
> > > [   86.549778]  drm_fb_helper_sys_imageblit+0xd7/0xf0 [drm_kms_helper]
> > > [   86.549801]  drm_fbdev_fb_imageblit+0x5b/0xb0 [drm_kms_helper]
> > > [   86.549813]  soft_cursor+0x1cb/0x250
> > > [   86.549816]  bit_cursor+0x3ce/0x630
> > > [   86.549818]  fbcon_cursor+0x139/0x1c0
> > > [   86.549821]  ? __pfx_bit_cursor+0x10/0x10
> > > [   86.549822]  hide_cursor+0x31/0xd0
> > > [   86.549825]  vt_console_print+0x477/0x4e0
> > > [   86.549828]  console_flush_all+0x182/0x440
> > > [   86.549832]  console_unlock+0x58/0xf0
> > > [   86.549834]  vprintk_emit+0x1ae/0x200
> > > [   86.549837]  vprintk_default+0x1d/0x30
> > > [   86.549839]  vprintk+0x5c/0x90
> > > [   86.549841]  _printk+0x58/0x80
> > > [   86.549843]  __warn_printk+0x7e/0x1a0
> > > [   86.549845]  ? trace_preempt_off+0x1b/0x70
> > > [   86.549848]  ? trace_preempt_on+0x1b/0x70
> > > [   86.549849]  ? __percpu_counter_init+0x8e/0xb0
> > > [   86.549853]  refcount_warn_saturate+0x9f/0x150
> > > [   86.549855]  mm_init+0x379/0x390
> > > [   86.549859]  bpf_trampoline_6442453440_0+0x23/0x1000
> > > [   86.549862]  mm_init+0x5/0x390
> > > [   86.549865]  ? mm_alloc+0x4e/0x60
> > > [   86.549866]  alloc_bprm+0x8a/0x2e0
> > > [   86.549869]  do_execveat_common.isra.0+0x67/0x240
> > > [   86.549872]  __x64_sys_execve+0x37/0x50
> > > [   86.549874]  do_syscall_64+0x38/0x90
> > > [   86.549877]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
> > >
> > > The reason is that when we attach the btf id of the function mm_init we
> > > actually attach the mm_init defined in init/main.c rather than the
> > > function defined in kernel/fork.c. That can be proved by parsing
> > > /sys/kernel/btf/vmlinux:
> > >
> > > [2493] FUNC 'initcall_blacklist' type_id=2477 linkage=static
> > > [2494] FUNC_PROTO '(anon)' ret_type_id=21 vlen=1
> > >         'buf' type_id=57
> > > [2495] FUNC 'early_randomize_kstack_offset' type_id=2494 linkage=static
> > > [2496] FUNC 'mm_init' type_id=118 linkage=static
> > > [2497] FUNC 'trap_init' type_id=118 linkage=static
> > > [2498] FUNC 'thread_stack_cache_init' type_id=118 linkage=static
> > >
> > > From the above information we can find that the FUNCs above and below
> > > mm_init are all defined in init/main.c. So there's no doubt that the
> > > mm_init is also the function defined in init/main.c.
> > >
> > > So when a task calls mm_init and thus the bpf trampoline is triggered it
> > > will use the information of the mm_init defined in init/main.c. Then the
> > > panic will occur.
> > >
> > > It seems that there're issues in btf, for example it is unnecessary to
> > > generate btf for the functions annonated with __init. We need to improve
> > > btf. However we also need to change the function defined in
> > > kernel/fork.c to task_mm_init to better distinguish them. After it is
> > > renamed to task_mm_init, the /sys/kernel/btf/vmlinux will be:
> > >
> > > [13970] FUNC 'mm_alloc' type_id=13969 linkage=static
> > > [13971] FUNC_PROTO '(anon)' ret_type_id=204 vlen=3
> > >         'mm' type_id=204
> > >         'p' type_id=197
> > >         'user_ns' type_id=452
> > > [13972] FUNC 'task_mm_init' type_id=13971 linkage=static
> > > [13973] FUNC 'coredump_filter_setup' type_id=3804 linkage=static
> > > [13974] FUNC_PROTO '(anon)' ret_type_id=197 vlen=2
> > >         'orig' type_id=197
> > >         'node' type_id=21
> > > [13975] FUNC 'dup_task_struct' type_id=13974 linkage=static
> > >
> > > And then attaching task_mm_init won't panic. Improving the btf will be
> > > handled later.
> >
> > We're not going to hack the kernel to workaround pahole issue.
> > Let's fix pahole instead.
> > cc-ing Alan for ideas.
>
> Any comment on it, Alan ?
> I think we can just skip generating BTF for the functions in
> __section(".init.text"),  as these functions will be freed after
> kernel init. There won't be use cases for them.
>
> --
> Regards
> Yafang
Yafang Shao May 9, 2023, 3:35 p.m. UTC | #4
On Tue, May 2, 2023 at 11:40 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> Alan,
>
> wdyt on below?
>

Hi Alexei,

Per my understanding, not only does pahole have issues, but also there
are issues in the kernel.
This panic is caused by the inconsistency between BTF and kallsyms as such:
   bpf_check_attach_target
       tname = btf_name_by_offset(btf, t->name_off); // btf
       addr = kallsyms_lookup_name(tname); // kallsyms

So if the function displayed in /proc/sys/btf/vmlinux is not the same
with the function displayed in /proc/kallsyms, we will get a wrong
addr.  I think it is not proper to rely wholly on the userspace tools
to make them the same. The kernel should also imrpve the verifier to
make sure they are really the same function.  WDYT?

> On Thu, Apr 27, 2023 at 4:35 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Tue, Apr 25, 2023 at 5:13 AM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Mon, Apr 24, 2023 at 9:12 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > >
> > > > The kernel will panic as follows when attaching fexit to mm_init,
> > > >
> > > > [   86.549700] ------------[ cut here ]------------
> > > > [   86.549712] BUG: kernel NULL pointer dereference, address: 0000000000000078
> > > > [   86.549713] #PF: supervisor read access in kernel mode
> > > > [   86.549715] #PF: error_code(0x0000) - not-present page
> > > > [   86.549716] PGD 10308f067 P4D 10308f067 PUD 11754e067 PMD 0
> > > > [   86.549719] Oops: 0000 [#1] PREEMPT SMP NOPTI
> > > > [   86.549722] CPU: 9 PID: 9829 Comm: main_amd64 Kdump: loaded Not tainted 6.3.0-rc6+ #12
> > > > [   86.549725] RIP: 0010:check_preempt_wakeup+0xd1/0x310
> > > > [   86.549754] Call Trace:
> > > > [   86.549755]  <TASK>
> > > > [   86.549757]  check_preempt_curr+0x5e/0x70
> > > > [   86.549761]  ttwu_do_activate+0xab/0x350
> > > > [   86.549763]  try_to_wake_up+0x314/0x680
> > > > [   86.549765]  wake_up_process+0x15/0x20
> > > > [   86.549767]  insert_work+0xb2/0xd0
> > > > [   86.549772]  __queue_work+0x20a/0x400
> > > > [   86.549774]  queue_work_on+0x7b/0x90
> > > > [   86.549778]  drm_fb_helper_sys_imageblit+0xd7/0xf0 [drm_kms_helper]
> > > > [   86.549801]  drm_fbdev_fb_imageblit+0x5b/0xb0 [drm_kms_helper]
> > > > [   86.549813]  soft_cursor+0x1cb/0x250
> > > > [   86.549816]  bit_cursor+0x3ce/0x630
> > > > [   86.549818]  fbcon_cursor+0x139/0x1c0
> > > > [   86.549821]  ? __pfx_bit_cursor+0x10/0x10
> > > > [   86.549822]  hide_cursor+0x31/0xd0
> > > > [   86.549825]  vt_console_print+0x477/0x4e0
> > > > [   86.549828]  console_flush_all+0x182/0x440
> > > > [   86.549832]  console_unlock+0x58/0xf0
> > > > [   86.549834]  vprintk_emit+0x1ae/0x200
> > > > [   86.549837]  vprintk_default+0x1d/0x30
> > > > [   86.549839]  vprintk+0x5c/0x90
> > > > [   86.549841]  _printk+0x58/0x80
> > > > [   86.549843]  __warn_printk+0x7e/0x1a0
> > > > [   86.549845]  ? trace_preempt_off+0x1b/0x70
> > > > [   86.549848]  ? trace_preempt_on+0x1b/0x70
> > > > [   86.549849]  ? __percpu_counter_init+0x8e/0xb0
> > > > [   86.549853]  refcount_warn_saturate+0x9f/0x150
> > > > [   86.549855]  mm_init+0x379/0x390
> > > > [   86.549859]  bpf_trampoline_6442453440_0+0x23/0x1000
> > > > [   86.549862]  mm_init+0x5/0x390
> > > > [   86.549865]  ? mm_alloc+0x4e/0x60
> > > > [   86.549866]  alloc_bprm+0x8a/0x2e0
> > > > [   86.549869]  do_execveat_common.isra.0+0x67/0x240
> > > > [   86.549872]  __x64_sys_execve+0x37/0x50
> > > > [   86.549874]  do_syscall_64+0x38/0x90
> > > > [   86.549877]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
> > > >
> > > > The reason is that when we attach the btf id of the function mm_init we
> > > > actually attach the mm_init defined in init/main.c rather than the
> > > > function defined in kernel/fork.c. That can be proved by parsing
> > > > /sys/kernel/btf/vmlinux:
> > > >
> > > > [2493] FUNC 'initcall_blacklist' type_id=2477 linkage=static
> > > > [2494] FUNC_PROTO '(anon)' ret_type_id=21 vlen=1
> > > >         'buf' type_id=57
> > > > [2495] FUNC 'early_randomize_kstack_offset' type_id=2494 linkage=static
> > > > [2496] FUNC 'mm_init' type_id=118 linkage=static
> > > > [2497] FUNC 'trap_init' type_id=118 linkage=static
> > > > [2498] FUNC 'thread_stack_cache_init' type_id=118 linkage=static
> > > >
> > > > From the above information we can find that the FUNCs above and below
> > > > mm_init are all defined in init/main.c. So there's no doubt that the
> > > > mm_init is also the function defined in init/main.c.
> > > >
> > > > So when a task calls mm_init and thus the bpf trampoline is triggered it
> > > > will use the information of the mm_init defined in init/main.c. Then the
> > > > panic will occur.
> > > >
> > > > It seems that there're issues in btf, for example it is unnecessary to
> > > > generate btf for the functions annonated with __init. We need to improve
> > > > btf. However we also need to change the function defined in
> > > > kernel/fork.c to task_mm_init to better distinguish them. After it is
> > > > renamed to task_mm_init, the /sys/kernel/btf/vmlinux will be:
> > > >
> > > > [13970] FUNC 'mm_alloc' type_id=13969 linkage=static
> > > > [13971] FUNC_PROTO '(anon)' ret_type_id=204 vlen=3
> > > >         'mm' type_id=204
> > > >         'p' type_id=197
> > > >         'user_ns' type_id=452
> > > > [13972] FUNC 'task_mm_init' type_id=13971 linkage=static
> > > > [13973] FUNC 'coredump_filter_setup' type_id=3804 linkage=static
> > > > [13974] FUNC_PROTO '(anon)' ret_type_id=197 vlen=2
> > > >         'orig' type_id=197
> > > >         'node' type_id=21
> > > > [13975] FUNC 'dup_task_struct' type_id=13974 linkage=static
> > > >
> > > > And then attaching task_mm_init won't panic. Improving the btf will be
> > > > handled later.
> > >
> > > We're not going to hack the kernel to workaround pahole issue.
> > > Let's fix pahole instead.
> > > cc-ing Alan for ideas.
> >
> > Any comment on it, Alan ?
> > I think we can just skip generating BTF for the functions in
> > __section(".init.text"),  as these functions will be freed after
> > kernel init. There won't be use cases for them.
> >
> > --
> > Regards
> > Yafang
Alexei Starovoitov May 9, 2023, 5:18 p.m. UTC | #5
On Tue, May 9, 2023 at 8:36 AM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Tue, May 2, 2023 at 11:40 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > Alan,
> >
> > wdyt on below?
> >
>
> Hi Alexei,
>
> Per my understanding, not only does pahole have issues, but also there
> are issues in the kernel.
> This panic is caused by the inconsistency between BTF and kallsyms as such:
>    bpf_check_attach_target
>        tname = btf_name_by_offset(btf, t->name_off); // btf
>        addr = kallsyms_lookup_name(tname); // kallsyms
>
> So if the function displayed in /proc/sys/btf/vmlinux is not the same
> with the function displayed in /proc/kallsyms, we will get a wrong
> addr.  I think it is not proper to rely wholly on the userspace tools
> to make them the same. The kernel should also imrpve the verifier to
> make sure they are really the same function.  WDYT?

Are you saying it's not proper to rely on compilers
and linkers to build the kernel?
pahole, resolved_btfid, kallsym gen, objtool are part of the
compilation process.
The bugs in them are discovered from time to time and
have to be fixed. Just like compiler and linker bugs.
Alan Maguire May 9, 2023, 6:43 p.m. UTC | #6
On 02/05/2023 04:40, Alexei Starovoitov wrote:
> Alan,
> 
> wdyt on below?
>

apologies, missed this; see below..

> On Thu, Apr 27, 2023 at 4:35 AM Yafang Shao <laoar.shao@gmail.com> wrote:
>>
>> On Tue, Apr 25, 2023 at 5:13 AM Alexei Starovoitov
>> <alexei.starovoitov@gmail.com> wrote:
>>>
>>> On Mon, Apr 24, 2023 at 9:12 AM Yafang Shao <laoar.shao@gmail.com> wrote:
>>>>
>>>> The kernel will panic as follows when attaching fexit to mm_init,
>>>>
>>>> [   86.549700] ------------[ cut here ]------------
>>>> [   86.549712] BUG: kernel NULL pointer dereference, address: 0000000000000078
>>>> [   86.549713] #PF: supervisor read access in kernel mode
>>>> [   86.549715] #PF: error_code(0x0000) - not-present page
>>>> [   86.549716] PGD 10308f067 P4D 10308f067 PUD 11754e067 PMD 0
>>>> [   86.549719] Oops: 0000 [#1] PREEMPT SMP NOPTI
>>>> [   86.549722] CPU: 9 PID: 9829 Comm: main_amd64 Kdump: loaded Not tainted 6.3.0-rc6+ #12
>>>> [   86.549725] RIP: 0010:check_preempt_wakeup+0xd1/0x310
>>>> [   86.549754] Call Trace:
>>>> [   86.549755]  <TASK>
>>>> [   86.549757]  check_preempt_curr+0x5e/0x70
>>>> [   86.549761]  ttwu_do_activate+0xab/0x350
>>>> [   86.549763]  try_to_wake_up+0x314/0x680
>>>> [   86.549765]  wake_up_process+0x15/0x20
>>>> [   86.549767]  insert_work+0xb2/0xd0
>>>> [   86.549772]  __queue_work+0x20a/0x400
>>>> [   86.549774]  queue_work_on+0x7b/0x90
>>>> [   86.549778]  drm_fb_helper_sys_imageblit+0xd7/0xf0 [drm_kms_helper]
>>>> [   86.549801]  drm_fbdev_fb_imageblit+0x5b/0xb0 [drm_kms_helper]
>>>> [   86.549813]  soft_cursor+0x1cb/0x250
>>>> [   86.549816]  bit_cursor+0x3ce/0x630
>>>> [   86.549818]  fbcon_cursor+0x139/0x1c0
>>>> [   86.549821]  ? __pfx_bit_cursor+0x10/0x10
>>>> [   86.549822]  hide_cursor+0x31/0xd0
>>>> [   86.549825]  vt_console_print+0x477/0x4e0
>>>> [   86.549828]  console_flush_all+0x182/0x440
>>>> [   86.549832]  console_unlock+0x58/0xf0
>>>> [   86.549834]  vprintk_emit+0x1ae/0x200
>>>> [   86.549837]  vprintk_default+0x1d/0x30
>>>> [   86.549839]  vprintk+0x5c/0x90
>>>> [   86.549841]  _printk+0x58/0x80
>>>> [   86.549843]  __warn_printk+0x7e/0x1a0
>>>> [   86.549845]  ? trace_preempt_off+0x1b/0x70
>>>> [   86.549848]  ? trace_preempt_on+0x1b/0x70
>>>> [   86.549849]  ? __percpu_counter_init+0x8e/0xb0
>>>> [   86.549853]  refcount_warn_saturate+0x9f/0x150
>>>> [   86.549855]  mm_init+0x379/0x390
>>>> [   86.549859]  bpf_trampoline_6442453440_0+0x23/0x1000
>>>> [   86.549862]  mm_init+0x5/0x390
>>>> [   86.549865]  ? mm_alloc+0x4e/0x60
>>>> [   86.549866]  alloc_bprm+0x8a/0x2e0
>>>> [   86.549869]  do_execveat_common.isra.0+0x67/0x240
>>>> [   86.549872]  __x64_sys_execve+0x37/0x50
>>>> [   86.549874]  do_syscall_64+0x38/0x90
>>>> [   86.549877]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
>>>>
>>>> The reason is that when we attach the btf id of the function mm_init we
>>>> actually attach the mm_init defined in init/main.c rather than the
>>>> function defined in kernel/fork.c. That can be proved by parsing
>>>> /sys/kernel/btf/vmlinux:
>>>>
>>>> [2493] FUNC 'initcall_blacklist' type_id=2477 linkage=static
>>>> [2494] FUNC_PROTO '(anon)' ret_type_id=21 vlen=1
>>>>         'buf' type_id=57
>>>> [2495] FUNC 'early_randomize_kstack_offset' type_id=2494 linkage=static
>>>> [2496] FUNC 'mm_init' type_id=118 linkage=static
>>>> [2497] FUNC 'trap_init' type_id=118 linkage=static
>>>> [2498] FUNC 'thread_stack_cache_init' type_id=118 linkage=static
>>>>
>>>> From the above information we can find that the FUNCs above and below
>>>> mm_init are all defined in init/main.c. So there's no doubt that the
>>>> mm_init is also the function defined in init/main.c.
>>>>
>>>> So when a task calls mm_init and thus the bpf trampoline is triggered it
>>>> will use the information of the mm_init defined in init/main.c. Then the
>>>> panic will occur.
>>>>
>>>> It seems that there're issues in btf, for example it is unnecessary to
>>>> generate btf for the functions annonated with __init. We need to improve
>>>> btf. However we also need to change the function defined in
>>>> kernel/fork.c to task_mm_init to better distinguish them. After it is
>>>> renamed to task_mm_init, the /sys/kernel/btf/vmlinux will be:
>>>>
>>>> [13970] FUNC 'mm_alloc' type_id=13969 linkage=static
>>>> [13971] FUNC_PROTO '(anon)' ret_type_id=204 vlen=3
>>>>         'mm' type_id=204
>>>>         'p' type_id=197
>>>>         'user_ns' type_id=452
>>>> [13972] FUNC 'task_mm_init' type_id=13971 linkage=static
>>>> [13973] FUNC 'coredump_filter_setup' type_id=3804 linkage=static
>>>> [13974] FUNC_PROTO '(anon)' ret_type_id=197 vlen=2
>>>>         'orig' type_id=197
>>>>         'node' type_id=21
>>>> [13975] FUNC 'dup_task_struct' type_id=13974 linkage=static
>>>>
>>>> And then attaching task_mm_init won't panic. Improving the btf will be
>>>> handled later.
>>>
>>> We're not going to hack the kernel to workaround pahole issue.
>>> Let's fix pahole instead.
>>> cc-ing Alan for ideas.
>>
>> Any comment on it, Alan ?
>> I think we can just skip generating BTF for the functions in
>> __section(".init.text"),  as these functions will be freed after
>> kernel init. There won't be use cases for them.
>>

won't the pahole v1.25 changes help here; can you try applying

https://lore.kernel.org/bpf/1675949331-27935-1-git-send-email-alan.maguire@oracle.com/

...and build using pahole; this should eliminate any functions
with inconsistent prototypes via

      --skip_encoding_btf_inconsistent_proto
        Do not encode functions with multiple inconsistent prototypes or
        unexpected register use for their parameters, where  the  regis‐
        ters used do not match calling conventions.


I'll check this at my end too.

Alexei, if this works should we look at applying the above
again to bpf-next? If so I'll resend the patch.

Thanks!

Alan

>> --
>> Regards
>> Yafang
Alexei Starovoitov May 9, 2023, 9:21 p.m. UTC | #7
On Tue, May 9, 2023 at 11:44 AM Alan Maguire <alan.maguire@oracle.com> wrote:
>
> On 02/05/2023 04:40, Alexei Starovoitov wrote:
> > Alan,
> >
> > wdyt on below?
> >
>
> apologies, missed this; see below..
>
> > On Thu, Apr 27, 2023 at 4:35 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> >>
> >> On Tue, Apr 25, 2023 at 5:13 AM Alexei Starovoitov
> >> <alexei.starovoitov@gmail.com> wrote:
> >>>
> >>> On Mon, Apr 24, 2023 at 9:12 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> >>>>
> >>>> The kernel will panic as follows when attaching fexit to mm_init,
> >>>>
> >>>> [   86.549700] ------------[ cut here ]------------
> >>>> [   86.549712] BUG: kernel NULL pointer dereference, address: 0000000000000078
> >>>> [   86.549713] #PF: supervisor read access in kernel mode
> >>>> [   86.549715] #PF: error_code(0x0000) - not-present page
> >>>> [   86.549716] PGD 10308f067 P4D 10308f067 PUD 11754e067 PMD 0
> >>>> [   86.549719] Oops: 0000 [#1] PREEMPT SMP NOPTI
> >>>> [   86.549722] CPU: 9 PID: 9829 Comm: main_amd64 Kdump: loaded Not tainted 6.3.0-rc6+ #12
> >>>> [   86.549725] RIP: 0010:check_preempt_wakeup+0xd1/0x310
> >>>> [   86.549754] Call Trace:
> >>>> [   86.549755]  <TASK>
> >>>> [   86.549757]  check_preempt_curr+0x5e/0x70
> >>>> [   86.549761]  ttwu_do_activate+0xab/0x350
> >>>> [   86.549763]  try_to_wake_up+0x314/0x680
> >>>> [   86.549765]  wake_up_process+0x15/0x20
> >>>> [   86.549767]  insert_work+0xb2/0xd0
> >>>> [   86.549772]  __queue_work+0x20a/0x400
> >>>> [   86.549774]  queue_work_on+0x7b/0x90
> >>>> [   86.549778]  drm_fb_helper_sys_imageblit+0xd7/0xf0 [drm_kms_helper]
> >>>> [   86.549801]  drm_fbdev_fb_imageblit+0x5b/0xb0 [drm_kms_helper]
> >>>> [   86.549813]  soft_cursor+0x1cb/0x250
> >>>> [   86.549816]  bit_cursor+0x3ce/0x630
> >>>> [   86.549818]  fbcon_cursor+0x139/0x1c0
> >>>> [   86.549821]  ? __pfx_bit_cursor+0x10/0x10
> >>>> [   86.549822]  hide_cursor+0x31/0xd0
> >>>> [   86.549825]  vt_console_print+0x477/0x4e0
> >>>> [   86.549828]  console_flush_all+0x182/0x440
> >>>> [   86.549832]  console_unlock+0x58/0xf0
> >>>> [   86.549834]  vprintk_emit+0x1ae/0x200
> >>>> [   86.549837]  vprintk_default+0x1d/0x30
> >>>> [   86.549839]  vprintk+0x5c/0x90
> >>>> [   86.549841]  _printk+0x58/0x80
> >>>> [   86.549843]  __warn_printk+0x7e/0x1a0
> >>>> [   86.549845]  ? trace_preempt_off+0x1b/0x70
> >>>> [   86.549848]  ? trace_preempt_on+0x1b/0x70
> >>>> [   86.549849]  ? __percpu_counter_init+0x8e/0xb0
> >>>> [   86.549853]  refcount_warn_saturate+0x9f/0x150
> >>>> [   86.549855]  mm_init+0x379/0x390
> >>>> [   86.549859]  bpf_trampoline_6442453440_0+0x23/0x1000
> >>>> [   86.549862]  mm_init+0x5/0x390
> >>>> [   86.549865]  ? mm_alloc+0x4e/0x60
> >>>> [   86.549866]  alloc_bprm+0x8a/0x2e0
> >>>> [   86.549869]  do_execveat_common.isra.0+0x67/0x240
> >>>> [   86.549872]  __x64_sys_execve+0x37/0x50
> >>>> [   86.549874]  do_syscall_64+0x38/0x90
> >>>> [   86.549877]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
> >>>>
> >>>> The reason is that when we attach the btf id of the function mm_init we
> >>>> actually attach the mm_init defined in init/main.c rather than the
> >>>> function defined in kernel/fork.c. That can be proved by parsing
> >>>> /sys/kernel/btf/vmlinux:
> >>>>
> >>>> [2493] FUNC 'initcall_blacklist' type_id=2477 linkage=static
> >>>> [2494] FUNC_PROTO '(anon)' ret_type_id=21 vlen=1
> >>>>         'buf' type_id=57
> >>>> [2495] FUNC 'early_randomize_kstack_offset' type_id=2494 linkage=static
> >>>> [2496] FUNC 'mm_init' type_id=118 linkage=static
> >>>> [2497] FUNC 'trap_init' type_id=118 linkage=static
> >>>> [2498] FUNC 'thread_stack_cache_init' type_id=118 linkage=static
> >>>>
> >>>> From the above information we can find that the FUNCs above and below
> >>>> mm_init are all defined in init/main.c. So there's no doubt that the
> >>>> mm_init is also the function defined in init/main.c.
> >>>>
> >>>> So when a task calls mm_init and thus the bpf trampoline is triggered it
> >>>> will use the information of the mm_init defined in init/main.c. Then the
> >>>> panic will occur.
> >>>>
> >>>> It seems that there're issues in btf, for example it is unnecessary to
> >>>> generate btf for the functions annonated with __init. We need to improve
> >>>> btf. However we also need to change the function defined in
> >>>> kernel/fork.c to task_mm_init to better distinguish them. After it is
> >>>> renamed to task_mm_init, the /sys/kernel/btf/vmlinux will be:
> >>>>
> >>>> [13970] FUNC 'mm_alloc' type_id=13969 linkage=static
> >>>> [13971] FUNC_PROTO '(anon)' ret_type_id=204 vlen=3
> >>>>         'mm' type_id=204
> >>>>         'p' type_id=197
> >>>>         'user_ns' type_id=452
> >>>> [13972] FUNC 'task_mm_init' type_id=13971 linkage=static
> >>>> [13973] FUNC 'coredump_filter_setup' type_id=3804 linkage=static
> >>>> [13974] FUNC_PROTO '(anon)' ret_type_id=197 vlen=2
> >>>>         'orig' type_id=197
> >>>>         'node' type_id=21
> >>>> [13975] FUNC 'dup_task_struct' type_id=13974 linkage=static
> >>>>
> >>>> And then attaching task_mm_init won't panic. Improving the btf will be
> >>>> handled later.
> >>>
> >>> We're not going to hack the kernel to workaround pahole issue.
> >>> Let's fix pahole instead.
> >>> cc-ing Alan for ideas.
> >>
> >> Any comment on it, Alan ?
> >> I think we can just skip generating BTF for the functions in
> >> __section(".init.text"),  as these functions will be freed after
> >> kernel init. There won't be use cases for them.
> >>
>
> won't the pahole v1.25 changes help here; can you try applying
>
> https://lore.kernel.org/bpf/1675949331-27935-1-git-send-email-alan.maguire@oracle.com/
>
> ...and build using pahole; this should eliminate any functions
> with inconsistent prototypes via
>
>       --skip_encoding_btf_inconsistent_proto
>         Do not encode functions with multiple inconsistent prototypes or
>         unexpected register use for their parameters, where  the  regis‐
>         ters used do not match calling conventions.
>
>
> I'll check this at my end too.
>
> Alexei, if this works should we look at applying the above
> again to bpf-next? If so I'll resend the patch.

I've lost the track with pahole fixes.
Did Arnaldo re-tag pahole 1.25 or released 1.26 with the fixes?

Alan,
please submit a fresh patch for bpf-next to enable
--skip_encoding_btf_inconsistent_proto, so it can go through CI.
I cannot test all combinations manually.

Thanks!
Yafang Shao May 10, 2023, 5:38 a.m. UTC | #8
On Wed, May 10, 2023 at 1:18 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, May 9, 2023 at 8:36 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Tue, May 2, 2023 at 11:40 AM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > Alan,
> > >
> > > wdyt on below?
> > >
> >
> > Hi Alexei,
> >
> > Per my understanding, not only does pahole have issues, but also there
> > are issues in the kernel.
> > This panic is caused by the inconsistency between BTF and kallsyms as such:
> >    bpf_check_attach_target
> >        tname = btf_name_by_offset(btf, t->name_off); // btf
> >        addr = kallsyms_lookup_name(tname); // kallsyms
> >
> > So if the function displayed in /proc/sys/btf/vmlinux is not the same
> > with the function displayed in /proc/kallsyms, we will get a wrong
> > addr.  I think it is not proper to rely wholly on the userspace tools
> > to make them the same. The kernel should also imrpve the verifier to
> > make sure they are really the same function.  WDYT?
>
> Are you saying it's not proper to rely on compilers
> and linkers to build the kernel?
> pahole, resolved_btfid, kallsym gen, objtool are part of the
> compilation process.
> The bugs in them are discovered from time to time and
> have to be fixed. Just like compiler and linker bugs.

I was wondering if it is possible to add BTF_ID into kallsyms or to
add function address into BTF. Because the function name is not
unique, while the function ID is unique. So with the function ID we
can always get what we want.

For example,

$ cat /proc/kallsyms | awk '{if ($2=="t"||$2=="T") {print $3}}' |
sort|   uniq -c | sort -n -r | less
     56 __pfx_cleanup_module
     56 cleanup_module
     47 __pfx_cpumask_weight.constprop.0
     47 cpumask_weight.constprop.0
     21 __pfx_jhash
     21 __pfx_cpumask_weight
     21 jhash
     21 cpumask_weight
     17 type_show
     17 __pfx_type_show
     14 __rhashtable_insert_fast.constprop.0
     14 __pfx___rhashtable_insert_fast.constprop.0
     12 __rhashtable_remove_fast_one.cold
     12 __rhashtable_remove_fast_one
     12 __pfx___rhashtable_remove_fast_one
     11 __xfrm_policy_check2.constprop.0
     11 __pfx___xfrm_policy_check2.constprop.0
     11 __pfx_modalias_show
     11 modalias_show
     10 rht_key_get_hash.isra.0
     10 __pfx_rht_key_get_hash.isra.0
     10 __pfx_name_show
     10 __pfx_init_once
     10 name_show
     10 init_once
      9 __pfx_event_show
      9 event_show
      8 __pfx_dst_output
      8 dst_output
      7 state_show
      7 size_show
      7 __pfx_state_show
      7 __pfx_size_show

kallsyms_lookup_name() always returns the first function and ignores
the others, so it is impossible to trace the other functions with the
same name AFAIK.
Alan Maguire May 10, 2023, 1:08 p.m. UTC | #9
On 09/05/2023 22:21, Alexei Starovoitov wrote:
> On Tue, May 9, 2023 at 11:44 AM Alan Maguire <alan.maguire@oracle.com> wrote:
>>
>> On 02/05/2023 04:40, Alexei Starovoitov wrote:
>>> Alan,
>>>
>>> wdyt on below?
>>>
>>
>> apologies, missed this; see below..
>>
>>> On Thu, Apr 27, 2023 at 4:35 AM Yafang Shao <laoar.shao@gmail.com> wrote:
>>>>
>>>> On Tue, Apr 25, 2023 at 5:13 AM Alexei Starovoitov
>>>> <alexei.starovoitov@gmail.com> wrote:
>>>>>
>>>>> On Mon, Apr 24, 2023 at 9:12 AM Yafang Shao <laoar.shao@gmail.com> wrote:
>>>>>>
>>>>>> The kernel will panic as follows when attaching fexit to mm_init,
>>>>>>
>>>>>> [   86.549700] ------------[ cut here ]------------
>>>>>> [   86.549712] BUG: kernel NULL pointer dereference, address: 0000000000000078
>>>>>> [   86.549713] #PF: supervisor read access in kernel mode
>>>>>> [   86.549715] #PF: error_code(0x0000) - not-present page
>>>>>> [   86.549716] PGD 10308f067 P4D 10308f067 PUD 11754e067 PMD 0
>>>>>> [   86.549719] Oops: 0000 [#1] PREEMPT SMP NOPTI
>>>>>> [   86.549722] CPU: 9 PID: 9829 Comm: main_amd64 Kdump: loaded Not tainted 6.3.0-rc6+ #12
>>>>>> [   86.549725] RIP: 0010:check_preempt_wakeup+0xd1/0x310
>>>>>> [   86.549754] Call Trace:
>>>>>> [   86.549755]  <TASK>
>>>>>> [   86.549757]  check_preempt_curr+0x5e/0x70
>>>>>> [   86.549761]  ttwu_do_activate+0xab/0x350
>>>>>> [   86.549763]  try_to_wake_up+0x314/0x680
>>>>>> [   86.549765]  wake_up_process+0x15/0x20
>>>>>> [   86.549767]  insert_work+0xb2/0xd0
>>>>>> [   86.549772]  __queue_work+0x20a/0x400
>>>>>> [   86.549774]  queue_work_on+0x7b/0x90
>>>>>> [   86.549778]  drm_fb_helper_sys_imageblit+0xd7/0xf0 [drm_kms_helper]
>>>>>> [   86.549801]  drm_fbdev_fb_imageblit+0x5b/0xb0 [drm_kms_helper]
>>>>>> [   86.549813]  soft_cursor+0x1cb/0x250
>>>>>> [   86.549816]  bit_cursor+0x3ce/0x630
>>>>>> [   86.549818]  fbcon_cursor+0x139/0x1c0
>>>>>> [   86.549821]  ? __pfx_bit_cursor+0x10/0x10
>>>>>> [   86.549822]  hide_cursor+0x31/0xd0
>>>>>> [   86.549825]  vt_console_print+0x477/0x4e0
>>>>>> [   86.549828]  console_flush_all+0x182/0x440
>>>>>> [   86.549832]  console_unlock+0x58/0xf0
>>>>>> [   86.549834]  vprintk_emit+0x1ae/0x200
>>>>>> [   86.549837]  vprintk_default+0x1d/0x30
>>>>>> [   86.549839]  vprintk+0x5c/0x90
>>>>>> [   86.549841]  _printk+0x58/0x80
>>>>>> [   86.549843]  __warn_printk+0x7e/0x1a0
>>>>>> [   86.549845]  ? trace_preempt_off+0x1b/0x70
>>>>>> [   86.549848]  ? trace_preempt_on+0x1b/0x70
>>>>>> [   86.549849]  ? __percpu_counter_init+0x8e/0xb0
>>>>>> [   86.549853]  refcount_warn_saturate+0x9f/0x150
>>>>>> [   86.549855]  mm_init+0x379/0x390
>>>>>> [   86.549859]  bpf_trampoline_6442453440_0+0x23/0x1000
>>>>>> [   86.549862]  mm_init+0x5/0x390
>>>>>> [   86.549865]  ? mm_alloc+0x4e/0x60
>>>>>> [   86.549866]  alloc_bprm+0x8a/0x2e0
>>>>>> [   86.549869]  do_execveat_common.isra.0+0x67/0x240
>>>>>> [   86.549872]  __x64_sys_execve+0x37/0x50
>>>>>> [   86.549874]  do_syscall_64+0x38/0x90
>>>>>> [   86.549877]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
>>>>>>
>>>>>> The reason is that when we attach the btf id of the function mm_init we
>>>>>> actually attach the mm_init defined in init/main.c rather than the
>>>>>> function defined in kernel/fork.c. That can be proved by parsing
>>>>>> /sys/kernel/btf/vmlinux:
>>>>>>
>>>>>> [2493] FUNC 'initcall_blacklist' type_id=2477 linkage=static
>>>>>> [2494] FUNC_PROTO '(anon)' ret_type_id=21 vlen=1
>>>>>>         'buf' type_id=57
>>>>>> [2495] FUNC 'early_randomize_kstack_offset' type_id=2494 linkage=static
>>>>>> [2496] FUNC 'mm_init' type_id=118 linkage=static
>>>>>> [2497] FUNC 'trap_init' type_id=118 linkage=static
>>>>>> [2498] FUNC 'thread_stack_cache_init' type_id=118 linkage=static
>>>>>>
>>>>>> From the above information we can find that the FUNCs above and below
>>>>>> mm_init are all defined in init/main.c. So there's no doubt that the
>>>>>> mm_init is also the function defined in init/main.c.
>>>>>>
>>>>>> So when a task calls mm_init and thus the bpf trampoline is triggered it
>>>>>> will use the information of the mm_init defined in init/main.c. Then the
>>>>>> panic will occur.
>>>>>>
>>>>>> It seems that there're issues in btf, for example it is unnecessary to
>>>>>> generate btf for the functions annonated with __init. We need to improve
>>>>>> btf. However we also need to change the function defined in
>>>>>> kernel/fork.c to task_mm_init to better distinguish them. After it is
>>>>>> renamed to task_mm_init, the /sys/kernel/btf/vmlinux will be:
>>>>>>
>>>>>> [13970] FUNC 'mm_alloc' type_id=13969 linkage=static
>>>>>> [13971] FUNC_PROTO '(anon)' ret_type_id=204 vlen=3
>>>>>>         'mm' type_id=204
>>>>>>         'p' type_id=197
>>>>>>         'user_ns' type_id=452
>>>>>> [13972] FUNC 'task_mm_init' type_id=13971 linkage=static
>>>>>> [13973] FUNC 'coredump_filter_setup' type_id=3804 linkage=static
>>>>>> [13974] FUNC_PROTO '(anon)' ret_type_id=197 vlen=2
>>>>>>         'orig' type_id=197
>>>>>>         'node' type_id=21
>>>>>> [13975] FUNC 'dup_task_struct' type_id=13974 linkage=static
>>>>>>
>>>>>> And then attaching task_mm_init won't panic. Improving the btf will be
>>>>>> handled later.
>>>>>
>>>>> We're not going to hack the kernel to workaround pahole issue.
>>>>> Let's fix pahole instead.
>>>>> cc-ing Alan for ideas.
>>>>
>>>> Any comment on it, Alan ?
>>>> I think we can just skip generating BTF for the functions in
>>>> __section(".init.text"),  as these functions will be freed after
>>>> kernel init. There won't be use cases for them.
>>>>
>>
>> won't the pahole v1.25 changes help here; can you try applying
>>
>> https://lore.kernel.org/bpf/1675949331-27935-1-git-send-email-alan.maguire@oracle.com/
>>
>> ...and build using pahole; this should eliminate any functions
>> with inconsistent prototypes via
>>
>>       --skip_encoding_btf_inconsistent_proto
>>         Do not encode functions with multiple inconsistent prototypes or
>>         unexpected register use for their parameters, where  the  regis‐
>>         ters used do not match calling conventions.
>>
>>
>> I'll check this at my end too.
>>
>> Alexei, if this works should we look at applying the above
>> again to bpf-next? If so I'll resend the patch.
> 
> I've lost the track with pahole fixes.
> Did Arnaldo re-tag pahole 1.25 or released 1.26 with the fixes?
>

My understanding is the pahole repo is prepped for v1.25,
meaning that it will build with version 1.25 but it is
not officially released yet; Arnaldo will correct me if
I've got that wrong.

> Alan,
> please submit a fresh patch for bpf-next to enable
> --skip_encoding_btf_inconsistent_proto, so it can go through CI.
> I cannot test all combinations manually.
> 

Done; see [1]. If CI picks up the latest version from
the dwarves repo, it will see version 1.25.

I've tested the above specific case along with general testing
using latest pahole.

When running pahole with --verbose and
--skip_encoding_btf_inconsistent_proto

we see:

skipping addition of 'mm_init'(mm_init) due to multiple inconsistent
function prototypes


...and

 bpftool btf dump file vmlinux |grep mm_init

shows the function is not encoded in BTF due to these inconsistencies.

Thanks!

Alan

[1]
https://lore.kernel.org/bpf/20230510130241.1696561-1-alan.maguire@oracle.com/
Yonghong Song May 10, 2023, 1:54 p.m. UTC | #10
On 5/9/23 10:38 PM, Yafang Shao wrote:
> On Wed, May 10, 2023 at 1:18 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>>
>> On Tue, May 9, 2023 at 8:36 AM Yafang Shao <laoar.shao@gmail.com> wrote:
>>>
>>> On Tue, May 2, 2023 at 11:40 AM Alexei Starovoitov
>>> <alexei.starovoitov@gmail.com> wrote:
>>>>
>>>> Alan,
>>>>
>>>> wdyt on below?
>>>>
>>>
>>> Hi Alexei,
>>>
>>> Per my understanding, not only does pahole have issues, but also there
>>> are issues in the kernel.
>>> This panic is caused by the inconsistency between BTF and kallsyms as such:
>>>     bpf_check_attach_target
>>>         tname = btf_name_by_offset(btf, t->name_off); // btf
>>>         addr = kallsyms_lookup_name(tname); // kallsyms
>>>
>>> So if the function displayed in /proc/sys/btf/vmlinux is not the same
>>> with the function displayed in /proc/kallsyms, we will get a wrong
>>> addr.  I think it is not proper to rely wholly on the userspace tools
>>> to make them the same. The kernel should also imrpve the verifier to
>>> make sure they are really the same function.  WDYT?
>>
>> Are you saying it's not proper to rely on compilers
>> and linkers to build the kernel?
>> pahole, resolved_btfid, kallsym gen, objtool are part of the
>> compilation process.
>> The bugs in them are discovered from time to time and
>> have to be fixed. Just like compiler and linker bugs.
> 
> I was wondering if it is possible to add BTF_ID into kallsyms or to
> add function address into BTF. Because the function name is not
> unique, while the function ID is unique. So with the function ID we
> can always get what we want.

We just had some discussions during LSFMMBPF led by Jiri
about how to resolve such issues. Yes, adding 'addr' to BTF
is one solution. Also, if this patch set 
(https://lore.kernel.org/lkml/20221205163157.269335-1-nick.alcock@oracle.com/) 
can make it
into the kernel, adding 'file_path' to the BTF should also work.
So ya, two possible solutions here.

> 
> For example,
> 
> $ cat /proc/kallsyms | awk '{if ($2=="t"||$2=="T") {print $3}}' |
> sort|   uniq -c | sort -n -r | less
>       56 __pfx_cleanup_module
>       56 cleanup_module
>       47 __pfx_cpumask_weight.constprop.0
>       47 cpumask_weight.constprop.0
>       21 __pfx_jhash
>       21 __pfx_cpumask_weight
>       21 jhash
>       21 cpumask_weight
>       17 type_show
>       17 __pfx_type_show
>       14 __rhashtable_insert_fast.constprop.0
>       14 __pfx___rhashtable_insert_fast.constprop.0
>       12 __rhashtable_remove_fast_one.cold
>       12 __rhashtable_remove_fast_one
>       12 __pfx___rhashtable_remove_fast_one
>       11 __xfrm_policy_check2.constprop.0
>       11 __pfx___xfrm_policy_check2.constprop.0
>       11 __pfx_modalias_show
>       11 modalias_show
>       10 rht_key_get_hash.isra.0
>       10 __pfx_rht_key_get_hash.isra.0
>       10 __pfx_name_show
>       10 __pfx_init_once
>       10 name_show
>       10 init_once
>        9 __pfx_event_show
>        9 event_show
>        8 __pfx_dst_output
>        8 dst_output
>        7 state_show
>        7 size_show
>        7 __pfx_state_show
>        7 __pfx_size_show
> 
> kallsyms_lookup_name() always returns the first function and ignores
> the others, so it is impossible to trace the other functions with the
> same name AFAIK.
>
Arnaldo Carvalho de Melo May 10, 2023, 3:28 p.m. UTC | #11
Em Wed, May 10, 2023 at 02:08:49PM +0100, Alan Maguire escreveu:
> On 09/05/2023 22:21, Alexei Starovoitov wrote:
> > On Tue, May 9, 2023 at 11:44 AM Alan Maguire <alan.maguire@oracle.com> wrote:
> >> On 02/05/2023 04:40, Alexei Starovoitov wrote:
> >> won't the pahole v1.25 changes help here; can you try applying

> >> https://lore.kernel.org/bpf/1675949331-27935-1-git-send-email-alan.maguire@oracle.com/

> >> ...and build using pahole; this should eliminate any functions
> >> with inconsistent prototypes via

> >>       --skip_encoding_btf_inconsistent_proto
> >>         Do not encode functions with multiple inconsistent prototypes or
> >>         unexpected register use for their parameters, where  the  regis‐
> >>         ters used do not match calling conventions.

> >> I'll check this at my end too.

> >> Alexei, if this works should we look at applying the above
> >> again to bpf-next? If so I'll resend the patch.

> > I've lost the track with pahole fixes.
> > Did Arnaldo re-tag pahole 1.25 or released 1.26 with the fixes?
 
> My understanding is the pahole repo is prepped for v1.25,
> meaning that it will build with version 1.25 but it is
> not officially released yet; Arnaldo will correct me if
> I've got that wrong.

Right, pahole 1.25 was released with those options/fixes.

- Arnaldo
 
> > Alan,
> > please submit a fresh patch for bpf-next to enable
> > --skip_encoding_btf_inconsistent_proto, so it can go through CI.
> > I cannot test all combinations manually.
> > 
> 
> Done; see [1]. If CI picks up the latest version from
> the dwarves repo, it will see version 1.25.
> 
> I've tested the above specific case along with general testing
> using latest pahole.
> 
> When running pahole with --verbose and
> --skip_encoding_btf_inconsistent_proto
> 
> we see:
> 
> skipping addition of 'mm_init'(mm_init) due to multiple inconsistent
> function prototypes
> 
> 
> ...and
> 
>  bpftool btf dump file vmlinux |grep mm_init
> 
> shows the function is not encoded in BTF due to these inconsistencies.
> 
> Thanks!
> 
> Alan
> 
> [1]
> https://lore.kernel.org/bpf/20230510130241.1696561-1-alan.maguire@oracle.com/
diff mbox series

Patch

diff --git a/kernel/fork.c b/kernel/fork.c
index 0c92f22..af8110d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1116,7 +1116,7 @@  static void mm_init_uprobes_state(struct mm_struct *mm)
 #endif
 }
 
-static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
+static struct mm_struct *task_mm_init(struct mm_struct *mm, struct task_struct *p,
 	struct user_namespace *user_ns)
 {
 	int i;
@@ -1193,7 +1193,7 @@  struct mm_struct *mm_alloc(void)
 		return NULL;
 
 	memset(mm, 0, sizeof(*mm));
-	return mm_init(mm, current, current_user_ns());
+	return task_mm_init(mm, current, current_user_ns());
 }
 
 static inline void __mmput(struct mm_struct *mm)
@@ -1542,7 +1542,7 @@  static struct mm_struct *dup_mm(struct task_struct *tsk,
 
 	memcpy(mm, oldmm, sizeof(*mm));
 
-	if (!mm_init(mm, tsk, mm->user_ns))
+	if (!task_mm_init(mm, tsk, mm->user_ns))
 		goto fail_nomem;
 
 	err = dup_mmap(mm, oldmm);