Message ID | 20230424161104.3737-3-laoar.shao@gmail.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | BPF |
Headers | show |
Series | bpf: Fix issues caused by bpf trampoline | expand |
On Mon, Apr 24, 2023 at 9:12 AM Yafang Shao <laoar.shao@gmail.com> wrote: > > The kernel will panic as follows when attaching fexit to mm_init, > > [ 86.549700] ------------[ cut here ]------------ > [ 86.549712] BUG: kernel NULL pointer dereference, address: 0000000000000078 > [ 86.549713] #PF: supervisor read access in kernel mode > [ 86.549715] #PF: error_code(0x0000) - not-present page > [ 86.549716] PGD 10308f067 P4D 10308f067 PUD 11754e067 PMD 0 > [ 86.549719] Oops: 0000 [#1] PREEMPT SMP NOPTI > [ 86.549722] CPU: 9 PID: 9829 Comm: main_amd64 Kdump: loaded Not tainted 6.3.0-rc6+ #12 > [ 86.549725] RIP: 0010:check_preempt_wakeup+0xd1/0x310 > [ 86.549754] Call Trace: > [ 86.549755] <TASK> > [ 86.549757] check_preempt_curr+0x5e/0x70 > [ 86.549761] ttwu_do_activate+0xab/0x350 > [ 86.549763] try_to_wake_up+0x314/0x680 > [ 86.549765] wake_up_process+0x15/0x20 > [ 86.549767] insert_work+0xb2/0xd0 > [ 86.549772] __queue_work+0x20a/0x400 > [ 86.549774] queue_work_on+0x7b/0x90 > [ 86.549778] drm_fb_helper_sys_imageblit+0xd7/0xf0 [drm_kms_helper] > [ 86.549801] drm_fbdev_fb_imageblit+0x5b/0xb0 [drm_kms_helper] > [ 86.549813] soft_cursor+0x1cb/0x250 > [ 86.549816] bit_cursor+0x3ce/0x630 > [ 86.549818] fbcon_cursor+0x139/0x1c0 > [ 86.549821] ? __pfx_bit_cursor+0x10/0x10 > [ 86.549822] hide_cursor+0x31/0xd0 > [ 86.549825] vt_console_print+0x477/0x4e0 > [ 86.549828] console_flush_all+0x182/0x440 > [ 86.549832] console_unlock+0x58/0xf0 > [ 86.549834] vprintk_emit+0x1ae/0x200 > [ 86.549837] vprintk_default+0x1d/0x30 > [ 86.549839] vprintk+0x5c/0x90 > [ 86.549841] _printk+0x58/0x80 > [ 86.549843] __warn_printk+0x7e/0x1a0 > [ 86.549845] ? trace_preempt_off+0x1b/0x70 > [ 86.549848] ? trace_preempt_on+0x1b/0x70 > [ 86.549849] ? __percpu_counter_init+0x8e/0xb0 > [ 86.549853] refcount_warn_saturate+0x9f/0x150 > [ 86.549855] mm_init+0x379/0x390 > [ 86.549859] bpf_trampoline_6442453440_0+0x23/0x1000 > [ 86.549862] mm_init+0x5/0x390 > [ 86.549865] ? mm_alloc+0x4e/0x60 > [ 86.549866] alloc_bprm+0x8a/0x2e0 > [ 86.549869] do_execveat_common.isra.0+0x67/0x240 > [ 86.549872] __x64_sys_execve+0x37/0x50 > [ 86.549874] do_syscall_64+0x38/0x90 > [ 86.549877] entry_SYSCALL_64_after_hwframe+0x72/0xdc > > The reason is that when we attach the btf id of the function mm_init we > actually attach the mm_init defined in init/main.c rather than the > function defined in kernel/fork.c. That can be proved by parsing > /sys/kernel/btf/vmlinux: > > [2493] FUNC 'initcall_blacklist' type_id=2477 linkage=static > [2494] FUNC_PROTO '(anon)' ret_type_id=21 vlen=1 > 'buf' type_id=57 > [2495] FUNC 'early_randomize_kstack_offset' type_id=2494 linkage=static > [2496] FUNC 'mm_init' type_id=118 linkage=static > [2497] FUNC 'trap_init' type_id=118 linkage=static > [2498] FUNC 'thread_stack_cache_init' type_id=118 linkage=static > > From the above information we can find that the FUNCs above and below > mm_init are all defined in init/main.c. So there's no doubt that the > mm_init is also the function defined in init/main.c. > > So when a task calls mm_init and thus the bpf trampoline is triggered it > will use the information of the mm_init defined in init/main.c. Then the > panic will occur. > > It seems that there're issues in btf, for example it is unnecessary to > generate btf for the functions annonated with __init. We need to improve > btf. However we also need to change the function defined in > kernel/fork.c to task_mm_init to better distinguish them. After it is > renamed to task_mm_init, the /sys/kernel/btf/vmlinux will be: > > [13970] FUNC 'mm_alloc' type_id=13969 linkage=static > [13971] FUNC_PROTO '(anon)' ret_type_id=204 vlen=3 > 'mm' type_id=204 > 'p' type_id=197 > 'user_ns' type_id=452 > [13972] FUNC 'task_mm_init' type_id=13971 linkage=static > [13973] FUNC 'coredump_filter_setup' type_id=3804 linkage=static > [13974] FUNC_PROTO '(anon)' ret_type_id=197 vlen=2 > 'orig' type_id=197 > 'node' type_id=21 > [13975] FUNC 'dup_task_struct' type_id=13974 linkage=static > > And then attaching task_mm_init won't panic. Improving the btf will be > handled later. We're not going to hack the kernel to workaround pahole issue. Let's fix pahole instead. cc-ing Alan for ideas.
On Tue, Apr 25, 2023 at 5:13 AM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Mon, Apr 24, 2023 at 9:12 AM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > The kernel will panic as follows when attaching fexit to mm_init, > > > > [ 86.549700] ------------[ cut here ]------------ > > [ 86.549712] BUG: kernel NULL pointer dereference, address: 0000000000000078 > > [ 86.549713] #PF: supervisor read access in kernel mode > > [ 86.549715] #PF: error_code(0x0000) - not-present page > > [ 86.549716] PGD 10308f067 P4D 10308f067 PUD 11754e067 PMD 0 > > [ 86.549719] Oops: 0000 [#1] PREEMPT SMP NOPTI > > [ 86.549722] CPU: 9 PID: 9829 Comm: main_amd64 Kdump: loaded Not tainted 6.3.0-rc6+ #12 > > [ 86.549725] RIP: 0010:check_preempt_wakeup+0xd1/0x310 > > [ 86.549754] Call Trace: > > [ 86.549755] <TASK> > > [ 86.549757] check_preempt_curr+0x5e/0x70 > > [ 86.549761] ttwu_do_activate+0xab/0x350 > > [ 86.549763] try_to_wake_up+0x314/0x680 > > [ 86.549765] wake_up_process+0x15/0x20 > > [ 86.549767] insert_work+0xb2/0xd0 > > [ 86.549772] __queue_work+0x20a/0x400 > > [ 86.549774] queue_work_on+0x7b/0x90 > > [ 86.549778] drm_fb_helper_sys_imageblit+0xd7/0xf0 [drm_kms_helper] > > [ 86.549801] drm_fbdev_fb_imageblit+0x5b/0xb0 [drm_kms_helper] > > [ 86.549813] soft_cursor+0x1cb/0x250 > > [ 86.549816] bit_cursor+0x3ce/0x630 > > [ 86.549818] fbcon_cursor+0x139/0x1c0 > > [ 86.549821] ? __pfx_bit_cursor+0x10/0x10 > > [ 86.549822] hide_cursor+0x31/0xd0 > > [ 86.549825] vt_console_print+0x477/0x4e0 > > [ 86.549828] console_flush_all+0x182/0x440 > > [ 86.549832] console_unlock+0x58/0xf0 > > [ 86.549834] vprintk_emit+0x1ae/0x200 > > [ 86.549837] vprintk_default+0x1d/0x30 > > [ 86.549839] vprintk+0x5c/0x90 > > [ 86.549841] _printk+0x58/0x80 > > [ 86.549843] __warn_printk+0x7e/0x1a0 > > [ 86.549845] ? trace_preempt_off+0x1b/0x70 > > [ 86.549848] ? trace_preempt_on+0x1b/0x70 > > [ 86.549849] ? __percpu_counter_init+0x8e/0xb0 > > [ 86.549853] refcount_warn_saturate+0x9f/0x150 > > [ 86.549855] mm_init+0x379/0x390 > > [ 86.549859] bpf_trampoline_6442453440_0+0x23/0x1000 > > [ 86.549862] mm_init+0x5/0x390 > > [ 86.549865] ? mm_alloc+0x4e/0x60 > > [ 86.549866] alloc_bprm+0x8a/0x2e0 > > [ 86.549869] do_execveat_common.isra.0+0x67/0x240 > > [ 86.549872] __x64_sys_execve+0x37/0x50 > > [ 86.549874] do_syscall_64+0x38/0x90 > > [ 86.549877] entry_SYSCALL_64_after_hwframe+0x72/0xdc > > > > The reason is that when we attach the btf id of the function mm_init we > > actually attach the mm_init defined in init/main.c rather than the > > function defined in kernel/fork.c. That can be proved by parsing > > /sys/kernel/btf/vmlinux: > > > > [2493] FUNC 'initcall_blacklist' type_id=2477 linkage=static > > [2494] FUNC_PROTO '(anon)' ret_type_id=21 vlen=1 > > 'buf' type_id=57 > > [2495] FUNC 'early_randomize_kstack_offset' type_id=2494 linkage=static > > [2496] FUNC 'mm_init' type_id=118 linkage=static > > [2497] FUNC 'trap_init' type_id=118 linkage=static > > [2498] FUNC 'thread_stack_cache_init' type_id=118 linkage=static > > > > From the above information we can find that the FUNCs above and below > > mm_init are all defined in init/main.c. So there's no doubt that the > > mm_init is also the function defined in init/main.c. > > > > So when a task calls mm_init and thus the bpf trampoline is triggered it > > will use the information of the mm_init defined in init/main.c. Then the > > panic will occur. > > > > It seems that there're issues in btf, for example it is unnecessary to > > generate btf for the functions annonated with __init. We need to improve > > btf. However we also need to change the function defined in > > kernel/fork.c to task_mm_init to better distinguish them. After it is > > renamed to task_mm_init, the /sys/kernel/btf/vmlinux will be: > > > > [13970] FUNC 'mm_alloc' type_id=13969 linkage=static > > [13971] FUNC_PROTO '(anon)' ret_type_id=204 vlen=3 > > 'mm' type_id=204 > > 'p' type_id=197 > > 'user_ns' type_id=452 > > [13972] FUNC 'task_mm_init' type_id=13971 linkage=static > > [13973] FUNC 'coredump_filter_setup' type_id=3804 linkage=static > > [13974] FUNC_PROTO '(anon)' ret_type_id=197 vlen=2 > > 'orig' type_id=197 > > 'node' type_id=21 > > [13975] FUNC 'dup_task_struct' type_id=13974 linkage=static > > > > And then attaching task_mm_init won't panic. Improving the btf will be > > handled later. > > We're not going to hack the kernel to workaround pahole issue. > Let's fix pahole instead. > cc-ing Alan for ideas. Any comment on it, Alan ? I think we can just skip generating BTF for the functions in __section(".init.text"), as these functions will be freed after kernel init. There won't be use cases for them.
Alan, wdyt on below? On Thu, Apr 27, 2023 at 4:35 AM Yafang Shao <laoar.shao@gmail.com> wrote: > > On Tue, Apr 25, 2023 at 5:13 AM Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: > > > > On Mon, Apr 24, 2023 at 9:12 AM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > > > The kernel will panic as follows when attaching fexit to mm_init, > > > > > > [ 86.549700] ------------[ cut here ]------------ > > > [ 86.549712] BUG: kernel NULL pointer dereference, address: 0000000000000078 > > > [ 86.549713] #PF: supervisor read access in kernel mode > > > [ 86.549715] #PF: error_code(0x0000) - not-present page > > > [ 86.549716] PGD 10308f067 P4D 10308f067 PUD 11754e067 PMD 0 > > > [ 86.549719] Oops: 0000 [#1] PREEMPT SMP NOPTI > > > [ 86.549722] CPU: 9 PID: 9829 Comm: main_amd64 Kdump: loaded Not tainted 6.3.0-rc6+ #12 > > > [ 86.549725] RIP: 0010:check_preempt_wakeup+0xd1/0x310 > > > [ 86.549754] Call Trace: > > > [ 86.549755] <TASK> > > > [ 86.549757] check_preempt_curr+0x5e/0x70 > > > [ 86.549761] ttwu_do_activate+0xab/0x350 > > > [ 86.549763] try_to_wake_up+0x314/0x680 > > > [ 86.549765] wake_up_process+0x15/0x20 > > > [ 86.549767] insert_work+0xb2/0xd0 > > > [ 86.549772] __queue_work+0x20a/0x400 > > > [ 86.549774] queue_work_on+0x7b/0x90 > > > [ 86.549778] drm_fb_helper_sys_imageblit+0xd7/0xf0 [drm_kms_helper] > > > [ 86.549801] drm_fbdev_fb_imageblit+0x5b/0xb0 [drm_kms_helper] > > > [ 86.549813] soft_cursor+0x1cb/0x250 > > > [ 86.549816] bit_cursor+0x3ce/0x630 > > > [ 86.549818] fbcon_cursor+0x139/0x1c0 > > > [ 86.549821] ? __pfx_bit_cursor+0x10/0x10 > > > [ 86.549822] hide_cursor+0x31/0xd0 > > > [ 86.549825] vt_console_print+0x477/0x4e0 > > > [ 86.549828] console_flush_all+0x182/0x440 > > > [ 86.549832] console_unlock+0x58/0xf0 > > > [ 86.549834] vprintk_emit+0x1ae/0x200 > > > [ 86.549837] vprintk_default+0x1d/0x30 > > > [ 86.549839] vprintk+0x5c/0x90 > > > [ 86.549841] _printk+0x58/0x80 > > > [ 86.549843] __warn_printk+0x7e/0x1a0 > > > [ 86.549845] ? trace_preempt_off+0x1b/0x70 > > > [ 86.549848] ? trace_preempt_on+0x1b/0x70 > > > [ 86.549849] ? __percpu_counter_init+0x8e/0xb0 > > > [ 86.549853] refcount_warn_saturate+0x9f/0x150 > > > [ 86.549855] mm_init+0x379/0x390 > > > [ 86.549859] bpf_trampoline_6442453440_0+0x23/0x1000 > > > [ 86.549862] mm_init+0x5/0x390 > > > [ 86.549865] ? mm_alloc+0x4e/0x60 > > > [ 86.549866] alloc_bprm+0x8a/0x2e0 > > > [ 86.549869] do_execveat_common.isra.0+0x67/0x240 > > > [ 86.549872] __x64_sys_execve+0x37/0x50 > > > [ 86.549874] do_syscall_64+0x38/0x90 > > > [ 86.549877] entry_SYSCALL_64_after_hwframe+0x72/0xdc > > > > > > The reason is that when we attach the btf id of the function mm_init we > > > actually attach the mm_init defined in init/main.c rather than the > > > function defined in kernel/fork.c. That can be proved by parsing > > > /sys/kernel/btf/vmlinux: > > > > > > [2493] FUNC 'initcall_blacklist' type_id=2477 linkage=static > > > [2494] FUNC_PROTO '(anon)' ret_type_id=21 vlen=1 > > > 'buf' type_id=57 > > > [2495] FUNC 'early_randomize_kstack_offset' type_id=2494 linkage=static > > > [2496] FUNC 'mm_init' type_id=118 linkage=static > > > [2497] FUNC 'trap_init' type_id=118 linkage=static > > > [2498] FUNC 'thread_stack_cache_init' type_id=118 linkage=static > > > > > > From the above information we can find that the FUNCs above and below > > > mm_init are all defined in init/main.c. So there's no doubt that the > > > mm_init is also the function defined in init/main.c. > > > > > > So when a task calls mm_init and thus the bpf trampoline is triggered it > > > will use the information of the mm_init defined in init/main.c. Then the > > > panic will occur. > > > > > > It seems that there're issues in btf, for example it is unnecessary to > > > generate btf for the functions annonated with __init. We need to improve > > > btf. However we also need to change the function defined in > > > kernel/fork.c to task_mm_init to better distinguish them. After it is > > > renamed to task_mm_init, the /sys/kernel/btf/vmlinux will be: > > > > > > [13970] FUNC 'mm_alloc' type_id=13969 linkage=static > > > [13971] FUNC_PROTO '(anon)' ret_type_id=204 vlen=3 > > > 'mm' type_id=204 > > > 'p' type_id=197 > > > 'user_ns' type_id=452 > > > [13972] FUNC 'task_mm_init' type_id=13971 linkage=static > > > [13973] FUNC 'coredump_filter_setup' type_id=3804 linkage=static > > > [13974] FUNC_PROTO '(anon)' ret_type_id=197 vlen=2 > > > 'orig' type_id=197 > > > 'node' type_id=21 > > > [13975] FUNC 'dup_task_struct' type_id=13974 linkage=static > > > > > > And then attaching task_mm_init won't panic. Improving the btf will be > > > handled later. > > > > We're not going to hack the kernel to workaround pahole issue. > > Let's fix pahole instead. > > cc-ing Alan for ideas. > > Any comment on it, Alan ? > I think we can just skip generating BTF for the functions in > __section(".init.text"), as these functions will be freed after > kernel init. There won't be use cases for them. > > -- > Regards > Yafang
On Tue, May 2, 2023 at 11:40 AM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > Alan, > > wdyt on below? > Hi Alexei, Per my understanding, not only does pahole have issues, but also there are issues in the kernel. This panic is caused by the inconsistency between BTF and kallsyms as such: bpf_check_attach_target tname = btf_name_by_offset(btf, t->name_off); // btf addr = kallsyms_lookup_name(tname); // kallsyms So if the function displayed in /proc/sys/btf/vmlinux is not the same with the function displayed in /proc/kallsyms, we will get a wrong addr. I think it is not proper to rely wholly on the userspace tools to make them the same. The kernel should also imrpve the verifier to make sure they are really the same function. WDYT? > On Thu, Apr 27, 2023 at 4:35 AM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > On Tue, Apr 25, 2023 at 5:13 AM Alexei Starovoitov > > <alexei.starovoitov@gmail.com> wrote: > > > > > > On Mon, Apr 24, 2023 at 9:12 AM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > > > > > The kernel will panic as follows when attaching fexit to mm_init, > > > > > > > > [ 86.549700] ------------[ cut here ]------------ > > > > [ 86.549712] BUG: kernel NULL pointer dereference, address: 0000000000000078 > > > > [ 86.549713] #PF: supervisor read access in kernel mode > > > > [ 86.549715] #PF: error_code(0x0000) - not-present page > > > > [ 86.549716] PGD 10308f067 P4D 10308f067 PUD 11754e067 PMD 0 > > > > [ 86.549719] Oops: 0000 [#1] PREEMPT SMP NOPTI > > > > [ 86.549722] CPU: 9 PID: 9829 Comm: main_amd64 Kdump: loaded Not tainted 6.3.0-rc6+ #12 > > > > [ 86.549725] RIP: 0010:check_preempt_wakeup+0xd1/0x310 > > > > [ 86.549754] Call Trace: > > > > [ 86.549755] <TASK> > > > > [ 86.549757] check_preempt_curr+0x5e/0x70 > > > > [ 86.549761] ttwu_do_activate+0xab/0x350 > > > > [ 86.549763] try_to_wake_up+0x314/0x680 > > > > [ 86.549765] wake_up_process+0x15/0x20 > > > > [ 86.549767] insert_work+0xb2/0xd0 > > > > [ 86.549772] __queue_work+0x20a/0x400 > > > > [ 86.549774] queue_work_on+0x7b/0x90 > > > > [ 86.549778] drm_fb_helper_sys_imageblit+0xd7/0xf0 [drm_kms_helper] > > > > [ 86.549801] drm_fbdev_fb_imageblit+0x5b/0xb0 [drm_kms_helper] > > > > [ 86.549813] soft_cursor+0x1cb/0x250 > > > > [ 86.549816] bit_cursor+0x3ce/0x630 > > > > [ 86.549818] fbcon_cursor+0x139/0x1c0 > > > > [ 86.549821] ? __pfx_bit_cursor+0x10/0x10 > > > > [ 86.549822] hide_cursor+0x31/0xd0 > > > > [ 86.549825] vt_console_print+0x477/0x4e0 > > > > [ 86.549828] console_flush_all+0x182/0x440 > > > > [ 86.549832] console_unlock+0x58/0xf0 > > > > [ 86.549834] vprintk_emit+0x1ae/0x200 > > > > [ 86.549837] vprintk_default+0x1d/0x30 > > > > [ 86.549839] vprintk+0x5c/0x90 > > > > [ 86.549841] _printk+0x58/0x80 > > > > [ 86.549843] __warn_printk+0x7e/0x1a0 > > > > [ 86.549845] ? trace_preempt_off+0x1b/0x70 > > > > [ 86.549848] ? trace_preempt_on+0x1b/0x70 > > > > [ 86.549849] ? __percpu_counter_init+0x8e/0xb0 > > > > [ 86.549853] refcount_warn_saturate+0x9f/0x150 > > > > [ 86.549855] mm_init+0x379/0x390 > > > > [ 86.549859] bpf_trampoline_6442453440_0+0x23/0x1000 > > > > [ 86.549862] mm_init+0x5/0x390 > > > > [ 86.549865] ? mm_alloc+0x4e/0x60 > > > > [ 86.549866] alloc_bprm+0x8a/0x2e0 > > > > [ 86.549869] do_execveat_common.isra.0+0x67/0x240 > > > > [ 86.549872] __x64_sys_execve+0x37/0x50 > > > > [ 86.549874] do_syscall_64+0x38/0x90 > > > > [ 86.549877] entry_SYSCALL_64_after_hwframe+0x72/0xdc > > > > > > > > The reason is that when we attach the btf id of the function mm_init we > > > > actually attach the mm_init defined in init/main.c rather than the > > > > function defined in kernel/fork.c. That can be proved by parsing > > > > /sys/kernel/btf/vmlinux: > > > > > > > > [2493] FUNC 'initcall_blacklist' type_id=2477 linkage=static > > > > [2494] FUNC_PROTO '(anon)' ret_type_id=21 vlen=1 > > > > 'buf' type_id=57 > > > > [2495] FUNC 'early_randomize_kstack_offset' type_id=2494 linkage=static > > > > [2496] FUNC 'mm_init' type_id=118 linkage=static > > > > [2497] FUNC 'trap_init' type_id=118 linkage=static > > > > [2498] FUNC 'thread_stack_cache_init' type_id=118 linkage=static > > > > > > > > From the above information we can find that the FUNCs above and below > > > > mm_init are all defined in init/main.c. So there's no doubt that the > > > > mm_init is also the function defined in init/main.c. > > > > > > > > So when a task calls mm_init and thus the bpf trampoline is triggered it > > > > will use the information of the mm_init defined in init/main.c. Then the > > > > panic will occur. > > > > > > > > It seems that there're issues in btf, for example it is unnecessary to > > > > generate btf for the functions annonated with __init. We need to improve > > > > btf. However we also need to change the function defined in > > > > kernel/fork.c to task_mm_init to better distinguish them. After it is > > > > renamed to task_mm_init, the /sys/kernel/btf/vmlinux will be: > > > > > > > > [13970] FUNC 'mm_alloc' type_id=13969 linkage=static > > > > [13971] FUNC_PROTO '(anon)' ret_type_id=204 vlen=3 > > > > 'mm' type_id=204 > > > > 'p' type_id=197 > > > > 'user_ns' type_id=452 > > > > [13972] FUNC 'task_mm_init' type_id=13971 linkage=static > > > > [13973] FUNC 'coredump_filter_setup' type_id=3804 linkage=static > > > > [13974] FUNC_PROTO '(anon)' ret_type_id=197 vlen=2 > > > > 'orig' type_id=197 > > > > 'node' type_id=21 > > > > [13975] FUNC 'dup_task_struct' type_id=13974 linkage=static > > > > > > > > And then attaching task_mm_init won't panic. Improving the btf will be > > > > handled later. > > > > > > We're not going to hack the kernel to workaround pahole issue. > > > Let's fix pahole instead. > > > cc-ing Alan for ideas. > > > > Any comment on it, Alan ? > > I think we can just skip generating BTF for the functions in > > __section(".init.text"), as these functions will be freed after > > kernel init. There won't be use cases for them. > > > > -- > > Regards > > Yafang
On Tue, May 9, 2023 at 8:36 AM Yafang Shao <laoar.shao@gmail.com> wrote: > > On Tue, May 2, 2023 at 11:40 AM Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: > > > > Alan, > > > > wdyt on below? > > > > Hi Alexei, > > Per my understanding, not only does pahole have issues, but also there > are issues in the kernel. > This panic is caused by the inconsistency between BTF and kallsyms as such: > bpf_check_attach_target > tname = btf_name_by_offset(btf, t->name_off); // btf > addr = kallsyms_lookup_name(tname); // kallsyms > > So if the function displayed in /proc/sys/btf/vmlinux is not the same > with the function displayed in /proc/kallsyms, we will get a wrong > addr. I think it is not proper to rely wholly on the userspace tools > to make them the same. The kernel should also imrpve the verifier to > make sure they are really the same function. WDYT? Are you saying it's not proper to rely on compilers and linkers to build the kernel? pahole, resolved_btfid, kallsym gen, objtool are part of the compilation process. The bugs in them are discovered from time to time and have to be fixed. Just like compiler and linker bugs.
On 02/05/2023 04:40, Alexei Starovoitov wrote: > Alan, > > wdyt on below? > apologies, missed this; see below.. > On Thu, Apr 27, 2023 at 4:35 AM Yafang Shao <laoar.shao@gmail.com> wrote: >> >> On Tue, Apr 25, 2023 at 5:13 AM Alexei Starovoitov >> <alexei.starovoitov@gmail.com> wrote: >>> >>> On Mon, Apr 24, 2023 at 9:12 AM Yafang Shao <laoar.shao@gmail.com> wrote: >>>> >>>> The kernel will panic as follows when attaching fexit to mm_init, >>>> >>>> [ 86.549700] ------------[ cut here ]------------ >>>> [ 86.549712] BUG: kernel NULL pointer dereference, address: 0000000000000078 >>>> [ 86.549713] #PF: supervisor read access in kernel mode >>>> [ 86.549715] #PF: error_code(0x0000) - not-present page >>>> [ 86.549716] PGD 10308f067 P4D 10308f067 PUD 11754e067 PMD 0 >>>> [ 86.549719] Oops: 0000 [#1] PREEMPT SMP NOPTI >>>> [ 86.549722] CPU: 9 PID: 9829 Comm: main_amd64 Kdump: loaded Not tainted 6.3.0-rc6+ #12 >>>> [ 86.549725] RIP: 0010:check_preempt_wakeup+0xd1/0x310 >>>> [ 86.549754] Call Trace: >>>> [ 86.549755] <TASK> >>>> [ 86.549757] check_preempt_curr+0x5e/0x70 >>>> [ 86.549761] ttwu_do_activate+0xab/0x350 >>>> [ 86.549763] try_to_wake_up+0x314/0x680 >>>> [ 86.549765] wake_up_process+0x15/0x20 >>>> [ 86.549767] insert_work+0xb2/0xd0 >>>> [ 86.549772] __queue_work+0x20a/0x400 >>>> [ 86.549774] queue_work_on+0x7b/0x90 >>>> [ 86.549778] drm_fb_helper_sys_imageblit+0xd7/0xf0 [drm_kms_helper] >>>> [ 86.549801] drm_fbdev_fb_imageblit+0x5b/0xb0 [drm_kms_helper] >>>> [ 86.549813] soft_cursor+0x1cb/0x250 >>>> [ 86.549816] bit_cursor+0x3ce/0x630 >>>> [ 86.549818] fbcon_cursor+0x139/0x1c0 >>>> [ 86.549821] ? __pfx_bit_cursor+0x10/0x10 >>>> [ 86.549822] hide_cursor+0x31/0xd0 >>>> [ 86.549825] vt_console_print+0x477/0x4e0 >>>> [ 86.549828] console_flush_all+0x182/0x440 >>>> [ 86.549832] console_unlock+0x58/0xf0 >>>> [ 86.549834] vprintk_emit+0x1ae/0x200 >>>> [ 86.549837] vprintk_default+0x1d/0x30 >>>> [ 86.549839] vprintk+0x5c/0x90 >>>> [ 86.549841] _printk+0x58/0x80 >>>> [ 86.549843] __warn_printk+0x7e/0x1a0 >>>> [ 86.549845] ? trace_preempt_off+0x1b/0x70 >>>> [ 86.549848] ? trace_preempt_on+0x1b/0x70 >>>> [ 86.549849] ? __percpu_counter_init+0x8e/0xb0 >>>> [ 86.549853] refcount_warn_saturate+0x9f/0x150 >>>> [ 86.549855] mm_init+0x379/0x390 >>>> [ 86.549859] bpf_trampoline_6442453440_0+0x23/0x1000 >>>> [ 86.549862] mm_init+0x5/0x390 >>>> [ 86.549865] ? mm_alloc+0x4e/0x60 >>>> [ 86.549866] alloc_bprm+0x8a/0x2e0 >>>> [ 86.549869] do_execveat_common.isra.0+0x67/0x240 >>>> [ 86.549872] __x64_sys_execve+0x37/0x50 >>>> [ 86.549874] do_syscall_64+0x38/0x90 >>>> [ 86.549877] entry_SYSCALL_64_after_hwframe+0x72/0xdc >>>> >>>> The reason is that when we attach the btf id of the function mm_init we >>>> actually attach the mm_init defined in init/main.c rather than the >>>> function defined in kernel/fork.c. That can be proved by parsing >>>> /sys/kernel/btf/vmlinux: >>>> >>>> [2493] FUNC 'initcall_blacklist' type_id=2477 linkage=static >>>> [2494] FUNC_PROTO '(anon)' ret_type_id=21 vlen=1 >>>> 'buf' type_id=57 >>>> [2495] FUNC 'early_randomize_kstack_offset' type_id=2494 linkage=static >>>> [2496] FUNC 'mm_init' type_id=118 linkage=static >>>> [2497] FUNC 'trap_init' type_id=118 linkage=static >>>> [2498] FUNC 'thread_stack_cache_init' type_id=118 linkage=static >>>> >>>> From the above information we can find that the FUNCs above and below >>>> mm_init are all defined in init/main.c. So there's no doubt that the >>>> mm_init is also the function defined in init/main.c. >>>> >>>> So when a task calls mm_init and thus the bpf trampoline is triggered it >>>> will use the information of the mm_init defined in init/main.c. Then the >>>> panic will occur. >>>> >>>> It seems that there're issues in btf, for example it is unnecessary to >>>> generate btf for the functions annonated with __init. We need to improve >>>> btf. However we also need to change the function defined in >>>> kernel/fork.c to task_mm_init to better distinguish them. After it is >>>> renamed to task_mm_init, the /sys/kernel/btf/vmlinux will be: >>>> >>>> [13970] FUNC 'mm_alloc' type_id=13969 linkage=static >>>> [13971] FUNC_PROTO '(anon)' ret_type_id=204 vlen=3 >>>> 'mm' type_id=204 >>>> 'p' type_id=197 >>>> 'user_ns' type_id=452 >>>> [13972] FUNC 'task_mm_init' type_id=13971 linkage=static >>>> [13973] FUNC 'coredump_filter_setup' type_id=3804 linkage=static >>>> [13974] FUNC_PROTO '(anon)' ret_type_id=197 vlen=2 >>>> 'orig' type_id=197 >>>> 'node' type_id=21 >>>> [13975] FUNC 'dup_task_struct' type_id=13974 linkage=static >>>> >>>> And then attaching task_mm_init won't panic. Improving the btf will be >>>> handled later. >>> >>> We're not going to hack the kernel to workaround pahole issue. >>> Let's fix pahole instead. >>> cc-ing Alan for ideas. >> >> Any comment on it, Alan ? >> I think we can just skip generating BTF for the functions in >> __section(".init.text"), as these functions will be freed after >> kernel init. There won't be use cases for them. >> won't the pahole v1.25 changes help here; can you try applying https://lore.kernel.org/bpf/1675949331-27935-1-git-send-email-alan.maguire@oracle.com/ ...and build using pahole; this should eliminate any functions with inconsistent prototypes via --skip_encoding_btf_inconsistent_proto Do not encode functions with multiple inconsistent prototypes or unexpected register use for their parameters, where the regis‐ ters used do not match calling conventions. I'll check this at my end too. Alexei, if this works should we look at applying the above again to bpf-next? If so I'll resend the patch. Thanks! Alan >> -- >> Regards >> Yafang
On Tue, May 9, 2023 at 11:44 AM Alan Maguire <alan.maguire@oracle.com> wrote: > > On 02/05/2023 04:40, Alexei Starovoitov wrote: > > Alan, > > > > wdyt on below? > > > > apologies, missed this; see below.. > > > On Thu, Apr 27, 2023 at 4:35 AM Yafang Shao <laoar.shao@gmail.com> wrote: > >> > >> On Tue, Apr 25, 2023 at 5:13 AM Alexei Starovoitov > >> <alexei.starovoitov@gmail.com> wrote: > >>> > >>> On Mon, Apr 24, 2023 at 9:12 AM Yafang Shao <laoar.shao@gmail.com> wrote: > >>>> > >>>> The kernel will panic as follows when attaching fexit to mm_init, > >>>> > >>>> [ 86.549700] ------------[ cut here ]------------ > >>>> [ 86.549712] BUG: kernel NULL pointer dereference, address: 0000000000000078 > >>>> [ 86.549713] #PF: supervisor read access in kernel mode > >>>> [ 86.549715] #PF: error_code(0x0000) - not-present page > >>>> [ 86.549716] PGD 10308f067 P4D 10308f067 PUD 11754e067 PMD 0 > >>>> [ 86.549719] Oops: 0000 [#1] PREEMPT SMP NOPTI > >>>> [ 86.549722] CPU: 9 PID: 9829 Comm: main_amd64 Kdump: loaded Not tainted 6.3.0-rc6+ #12 > >>>> [ 86.549725] RIP: 0010:check_preempt_wakeup+0xd1/0x310 > >>>> [ 86.549754] Call Trace: > >>>> [ 86.549755] <TASK> > >>>> [ 86.549757] check_preempt_curr+0x5e/0x70 > >>>> [ 86.549761] ttwu_do_activate+0xab/0x350 > >>>> [ 86.549763] try_to_wake_up+0x314/0x680 > >>>> [ 86.549765] wake_up_process+0x15/0x20 > >>>> [ 86.549767] insert_work+0xb2/0xd0 > >>>> [ 86.549772] __queue_work+0x20a/0x400 > >>>> [ 86.549774] queue_work_on+0x7b/0x90 > >>>> [ 86.549778] drm_fb_helper_sys_imageblit+0xd7/0xf0 [drm_kms_helper] > >>>> [ 86.549801] drm_fbdev_fb_imageblit+0x5b/0xb0 [drm_kms_helper] > >>>> [ 86.549813] soft_cursor+0x1cb/0x250 > >>>> [ 86.549816] bit_cursor+0x3ce/0x630 > >>>> [ 86.549818] fbcon_cursor+0x139/0x1c0 > >>>> [ 86.549821] ? __pfx_bit_cursor+0x10/0x10 > >>>> [ 86.549822] hide_cursor+0x31/0xd0 > >>>> [ 86.549825] vt_console_print+0x477/0x4e0 > >>>> [ 86.549828] console_flush_all+0x182/0x440 > >>>> [ 86.549832] console_unlock+0x58/0xf0 > >>>> [ 86.549834] vprintk_emit+0x1ae/0x200 > >>>> [ 86.549837] vprintk_default+0x1d/0x30 > >>>> [ 86.549839] vprintk+0x5c/0x90 > >>>> [ 86.549841] _printk+0x58/0x80 > >>>> [ 86.549843] __warn_printk+0x7e/0x1a0 > >>>> [ 86.549845] ? trace_preempt_off+0x1b/0x70 > >>>> [ 86.549848] ? trace_preempt_on+0x1b/0x70 > >>>> [ 86.549849] ? __percpu_counter_init+0x8e/0xb0 > >>>> [ 86.549853] refcount_warn_saturate+0x9f/0x150 > >>>> [ 86.549855] mm_init+0x379/0x390 > >>>> [ 86.549859] bpf_trampoline_6442453440_0+0x23/0x1000 > >>>> [ 86.549862] mm_init+0x5/0x390 > >>>> [ 86.549865] ? mm_alloc+0x4e/0x60 > >>>> [ 86.549866] alloc_bprm+0x8a/0x2e0 > >>>> [ 86.549869] do_execveat_common.isra.0+0x67/0x240 > >>>> [ 86.549872] __x64_sys_execve+0x37/0x50 > >>>> [ 86.549874] do_syscall_64+0x38/0x90 > >>>> [ 86.549877] entry_SYSCALL_64_after_hwframe+0x72/0xdc > >>>> > >>>> The reason is that when we attach the btf id of the function mm_init we > >>>> actually attach the mm_init defined in init/main.c rather than the > >>>> function defined in kernel/fork.c. That can be proved by parsing > >>>> /sys/kernel/btf/vmlinux: > >>>> > >>>> [2493] FUNC 'initcall_blacklist' type_id=2477 linkage=static > >>>> [2494] FUNC_PROTO '(anon)' ret_type_id=21 vlen=1 > >>>> 'buf' type_id=57 > >>>> [2495] FUNC 'early_randomize_kstack_offset' type_id=2494 linkage=static > >>>> [2496] FUNC 'mm_init' type_id=118 linkage=static > >>>> [2497] FUNC 'trap_init' type_id=118 linkage=static > >>>> [2498] FUNC 'thread_stack_cache_init' type_id=118 linkage=static > >>>> > >>>> From the above information we can find that the FUNCs above and below > >>>> mm_init are all defined in init/main.c. So there's no doubt that the > >>>> mm_init is also the function defined in init/main.c. > >>>> > >>>> So when a task calls mm_init and thus the bpf trampoline is triggered it > >>>> will use the information of the mm_init defined in init/main.c. Then the > >>>> panic will occur. > >>>> > >>>> It seems that there're issues in btf, for example it is unnecessary to > >>>> generate btf for the functions annonated with __init. We need to improve > >>>> btf. However we also need to change the function defined in > >>>> kernel/fork.c to task_mm_init to better distinguish them. After it is > >>>> renamed to task_mm_init, the /sys/kernel/btf/vmlinux will be: > >>>> > >>>> [13970] FUNC 'mm_alloc' type_id=13969 linkage=static > >>>> [13971] FUNC_PROTO '(anon)' ret_type_id=204 vlen=3 > >>>> 'mm' type_id=204 > >>>> 'p' type_id=197 > >>>> 'user_ns' type_id=452 > >>>> [13972] FUNC 'task_mm_init' type_id=13971 linkage=static > >>>> [13973] FUNC 'coredump_filter_setup' type_id=3804 linkage=static > >>>> [13974] FUNC_PROTO '(anon)' ret_type_id=197 vlen=2 > >>>> 'orig' type_id=197 > >>>> 'node' type_id=21 > >>>> [13975] FUNC 'dup_task_struct' type_id=13974 linkage=static > >>>> > >>>> And then attaching task_mm_init won't panic. Improving the btf will be > >>>> handled later. > >>> > >>> We're not going to hack the kernel to workaround pahole issue. > >>> Let's fix pahole instead. > >>> cc-ing Alan for ideas. > >> > >> Any comment on it, Alan ? > >> I think we can just skip generating BTF for the functions in > >> __section(".init.text"), as these functions will be freed after > >> kernel init. There won't be use cases for them. > >> > > won't the pahole v1.25 changes help here; can you try applying > > https://lore.kernel.org/bpf/1675949331-27935-1-git-send-email-alan.maguire@oracle.com/ > > ...and build using pahole; this should eliminate any functions > with inconsistent prototypes via > > --skip_encoding_btf_inconsistent_proto > Do not encode functions with multiple inconsistent prototypes or > unexpected register use for their parameters, where the regis‐ > ters used do not match calling conventions. > > > I'll check this at my end too. > > Alexei, if this works should we look at applying the above > again to bpf-next? If so I'll resend the patch. I've lost the track with pahole fixes. Did Arnaldo re-tag pahole 1.25 or released 1.26 with the fixes? Alan, please submit a fresh patch for bpf-next to enable --skip_encoding_btf_inconsistent_proto, so it can go through CI. I cannot test all combinations manually. Thanks!
On Wed, May 10, 2023 at 1:18 AM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Tue, May 9, 2023 at 8:36 AM Yafang Shao <laoar.shao@gmail.com> wrote: > > > > On Tue, May 2, 2023 at 11:40 AM Alexei Starovoitov > > <alexei.starovoitov@gmail.com> wrote: > > > > > > Alan, > > > > > > wdyt on below? > > > > > > > Hi Alexei, > > > > Per my understanding, not only does pahole have issues, but also there > > are issues in the kernel. > > This panic is caused by the inconsistency between BTF and kallsyms as such: > > bpf_check_attach_target > > tname = btf_name_by_offset(btf, t->name_off); // btf > > addr = kallsyms_lookup_name(tname); // kallsyms > > > > So if the function displayed in /proc/sys/btf/vmlinux is not the same > > with the function displayed in /proc/kallsyms, we will get a wrong > > addr. I think it is not proper to rely wholly on the userspace tools > > to make them the same. The kernel should also imrpve the verifier to > > make sure they are really the same function. WDYT? > > Are you saying it's not proper to rely on compilers > and linkers to build the kernel? > pahole, resolved_btfid, kallsym gen, objtool are part of the > compilation process. > The bugs in them are discovered from time to time and > have to be fixed. Just like compiler and linker bugs. I was wondering if it is possible to add BTF_ID into kallsyms or to add function address into BTF. Because the function name is not unique, while the function ID is unique. So with the function ID we can always get what we want. For example, $ cat /proc/kallsyms | awk '{if ($2=="t"||$2=="T") {print $3}}' | sort| uniq -c | sort -n -r | less 56 __pfx_cleanup_module 56 cleanup_module 47 __pfx_cpumask_weight.constprop.0 47 cpumask_weight.constprop.0 21 __pfx_jhash 21 __pfx_cpumask_weight 21 jhash 21 cpumask_weight 17 type_show 17 __pfx_type_show 14 __rhashtable_insert_fast.constprop.0 14 __pfx___rhashtable_insert_fast.constprop.0 12 __rhashtable_remove_fast_one.cold 12 __rhashtable_remove_fast_one 12 __pfx___rhashtable_remove_fast_one 11 __xfrm_policy_check2.constprop.0 11 __pfx___xfrm_policy_check2.constprop.0 11 __pfx_modalias_show 11 modalias_show 10 rht_key_get_hash.isra.0 10 __pfx_rht_key_get_hash.isra.0 10 __pfx_name_show 10 __pfx_init_once 10 name_show 10 init_once 9 __pfx_event_show 9 event_show 8 __pfx_dst_output 8 dst_output 7 state_show 7 size_show 7 __pfx_state_show 7 __pfx_size_show kallsyms_lookup_name() always returns the first function and ignores the others, so it is impossible to trace the other functions with the same name AFAIK.
On 09/05/2023 22:21, Alexei Starovoitov wrote: > On Tue, May 9, 2023 at 11:44 AM Alan Maguire <alan.maguire@oracle.com> wrote: >> >> On 02/05/2023 04:40, Alexei Starovoitov wrote: >>> Alan, >>> >>> wdyt on below? >>> >> >> apologies, missed this; see below.. >> >>> On Thu, Apr 27, 2023 at 4:35 AM Yafang Shao <laoar.shao@gmail.com> wrote: >>>> >>>> On Tue, Apr 25, 2023 at 5:13 AM Alexei Starovoitov >>>> <alexei.starovoitov@gmail.com> wrote: >>>>> >>>>> On Mon, Apr 24, 2023 at 9:12 AM Yafang Shao <laoar.shao@gmail.com> wrote: >>>>>> >>>>>> The kernel will panic as follows when attaching fexit to mm_init, >>>>>> >>>>>> [ 86.549700] ------------[ cut here ]------------ >>>>>> [ 86.549712] BUG: kernel NULL pointer dereference, address: 0000000000000078 >>>>>> [ 86.549713] #PF: supervisor read access in kernel mode >>>>>> [ 86.549715] #PF: error_code(0x0000) - not-present page >>>>>> [ 86.549716] PGD 10308f067 P4D 10308f067 PUD 11754e067 PMD 0 >>>>>> [ 86.549719] Oops: 0000 [#1] PREEMPT SMP NOPTI >>>>>> [ 86.549722] CPU: 9 PID: 9829 Comm: main_amd64 Kdump: loaded Not tainted 6.3.0-rc6+ #12 >>>>>> [ 86.549725] RIP: 0010:check_preempt_wakeup+0xd1/0x310 >>>>>> [ 86.549754] Call Trace: >>>>>> [ 86.549755] <TASK> >>>>>> [ 86.549757] check_preempt_curr+0x5e/0x70 >>>>>> [ 86.549761] ttwu_do_activate+0xab/0x350 >>>>>> [ 86.549763] try_to_wake_up+0x314/0x680 >>>>>> [ 86.549765] wake_up_process+0x15/0x20 >>>>>> [ 86.549767] insert_work+0xb2/0xd0 >>>>>> [ 86.549772] __queue_work+0x20a/0x400 >>>>>> [ 86.549774] queue_work_on+0x7b/0x90 >>>>>> [ 86.549778] drm_fb_helper_sys_imageblit+0xd7/0xf0 [drm_kms_helper] >>>>>> [ 86.549801] drm_fbdev_fb_imageblit+0x5b/0xb0 [drm_kms_helper] >>>>>> [ 86.549813] soft_cursor+0x1cb/0x250 >>>>>> [ 86.549816] bit_cursor+0x3ce/0x630 >>>>>> [ 86.549818] fbcon_cursor+0x139/0x1c0 >>>>>> [ 86.549821] ? __pfx_bit_cursor+0x10/0x10 >>>>>> [ 86.549822] hide_cursor+0x31/0xd0 >>>>>> [ 86.549825] vt_console_print+0x477/0x4e0 >>>>>> [ 86.549828] console_flush_all+0x182/0x440 >>>>>> [ 86.549832] console_unlock+0x58/0xf0 >>>>>> [ 86.549834] vprintk_emit+0x1ae/0x200 >>>>>> [ 86.549837] vprintk_default+0x1d/0x30 >>>>>> [ 86.549839] vprintk+0x5c/0x90 >>>>>> [ 86.549841] _printk+0x58/0x80 >>>>>> [ 86.549843] __warn_printk+0x7e/0x1a0 >>>>>> [ 86.549845] ? trace_preempt_off+0x1b/0x70 >>>>>> [ 86.549848] ? trace_preempt_on+0x1b/0x70 >>>>>> [ 86.549849] ? __percpu_counter_init+0x8e/0xb0 >>>>>> [ 86.549853] refcount_warn_saturate+0x9f/0x150 >>>>>> [ 86.549855] mm_init+0x379/0x390 >>>>>> [ 86.549859] bpf_trampoline_6442453440_0+0x23/0x1000 >>>>>> [ 86.549862] mm_init+0x5/0x390 >>>>>> [ 86.549865] ? mm_alloc+0x4e/0x60 >>>>>> [ 86.549866] alloc_bprm+0x8a/0x2e0 >>>>>> [ 86.549869] do_execveat_common.isra.0+0x67/0x240 >>>>>> [ 86.549872] __x64_sys_execve+0x37/0x50 >>>>>> [ 86.549874] do_syscall_64+0x38/0x90 >>>>>> [ 86.549877] entry_SYSCALL_64_after_hwframe+0x72/0xdc >>>>>> >>>>>> The reason is that when we attach the btf id of the function mm_init we >>>>>> actually attach the mm_init defined in init/main.c rather than the >>>>>> function defined in kernel/fork.c. That can be proved by parsing >>>>>> /sys/kernel/btf/vmlinux: >>>>>> >>>>>> [2493] FUNC 'initcall_blacklist' type_id=2477 linkage=static >>>>>> [2494] FUNC_PROTO '(anon)' ret_type_id=21 vlen=1 >>>>>> 'buf' type_id=57 >>>>>> [2495] FUNC 'early_randomize_kstack_offset' type_id=2494 linkage=static >>>>>> [2496] FUNC 'mm_init' type_id=118 linkage=static >>>>>> [2497] FUNC 'trap_init' type_id=118 linkage=static >>>>>> [2498] FUNC 'thread_stack_cache_init' type_id=118 linkage=static >>>>>> >>>>>> From the above information we can find that the FUNCs above and below >>>>>> mm_init are all defined in init/main.c. So there's no doubt that the >>>>>> mm_init is also the function defined in init/main.c. >>>>>> >>>>>> So when a task calls mm_init and thus the bpf trampoline is triggered it >>>>>> will use the information of the mm_init defined in init/main.c. Then the >>>>>> panic will occur. >>>>>> >>>>>> It seems that there're issues in btf, for example it is unnecessary to >>>>>> generate btf for the functions annonated with __init. We need to improve >>>>>> btf. However we also need to change the function defined in >>>>>> kernel/fork.c to task_mm_init to better distinguish them. After it is >>>>>> renamed to task_mm_init, the /sys/kernel/btf/vmlinux will be: >>>>>> >>>>>> [13970] FUNC 'mm_alloc' type_id=13969 linkage=static >>>>>> [13971] FUNC_PROTO '(anon)' ret_type_id=204 vlen=3 >>>>>> 'mm' type_id=204 >>>>>> 'p' type_id=197 >>>>>> 'user_ns' type_id=452 >>>>>> [13972] FUNC 'task_mm_init' type_id=13971 linkage=static >>>>>> [13973] FUNC 'coredump_filter_setup' type_id=3804 linkage=static >>>>>> [13974] FUNC_PROTO '(anon)' ret_type_id=197 vlen=2 >>>>>> 'orig' type_id=197 >>>>>> 'node' type_id=21 >>>>>> [13975] FUNC 'dup_task_struct' type_id=13974 linkage=static >>>>>> >>>>>> And then attaching task_mm_init won't panic. Improving the btf will be >>>>>> handled later. >>>>> >>>>> We're not going to hack the kernel to workaround pahole issue. >>>>> Let's fix pahole instead. >>>>> cc-ing Alan for ideas. >>>> >>>> Any comment on it, Alan ? >>>> I think we can just skip generating BTF for the functions in >>>> __section(".init.text"), as these functions will be freed after >>>> kernel init. There won't be use cases for them. >>>> >> >> won't the pahole v1.25 changes help here; can you try applying >> >> https://lore.kernel.org/bpf/1675949331-27935-1-git-send-email-alan.maguire@oracle.com/ >> >> ...and build using pahole; this should eliminate any functions >> with inconsistent prototypes via >> >> --skip_encoding_btf_inconsistent_proto >> Do not encode functions with multiple inconsistent prototypes or >> unexpected register use for their parameters, where the regis‐ >> ters used do not match calling conventions. >> >> >> I'll check this at my end too. >> >> Alexei, if this works should we look at applying the above >> again to bpf-next? If so I'll resend the patch. > > I've lost the track with pahole fixes. > Did Arnaldo re-tag pahole 1.25 or released 1.26 with the fixes? > My understanding is the pahole repo is prepped for v1.25, meaning that it will build with version 1.25 but it is not officially released yet; Arnaldo will correct me if I've got that wrong. > Alan, > please submit a fresh patch for bpf-next to enable > --skip_encoding_btf_inconsistent_proto, so it can go through CI. > I cannot test all combinations manually. > Done; see [1]. If CI picks up the latest version from the dwarves repo, it will see version 1.25. I've tested the above specific case along with general testing using latest pahole. When running pahole with --verbose and --skip_encoding_btf_inconsistent_proto we see: skipping addition of 'mm_init'(mm_init) due to multiple inconsistent function prototypes ...and bpftool btf dump file vmlinux |grep mm_init shows the function is not encoded in BTF due to these inconsistencies. Thanks! Alan [1] https://lore.kernel.org/bpf/20230510130241.1696561-1-alan.maguire@oracle.com/
On 5/9/23 10:38 PM, Yafang Shao wrote: > On Wed, May 10, 2023 at 1:18 AM Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: >> >> On Tue, May 9, 2023 at 8:36 AM Yafang Shao <laoar.shao@gmail.com> wrote: >>> >>> On Tue, May 2, 2023 at 11:40 AM Alexei Starovoitov >>> <alexei.starovoitov@gmail.com> wrote: >>>> >>>> Alan, >>>> >>>> wdyt on below? >>>> >>> >>> Hi Alexei, >>> >>> Per my understanding, not only does pahole have issues, but also there >>> are issues in the kernel. >>> This panic is caused by the inconsistency between BTF and kallsyms as such: >>> bpf_check_attach_target >>> tname = btf_name_by_offset(btf, t->name_off); // btf >>> addr = kallsyms_lookup_name(tname); // kallsyms >>> >>> So if the function displayed in /proc/sys/btf/vmlinux is not the same >>> with the function displayed in /proc/kallsyms, we will get a wrong >>> addr. I think it is not proper to rely wholly on the userspace tools >>> to make them the same. The kernel should also imrpve the verifier to >>> make sure they are really the same function. WDYT? >> >> Are you saying it's not proper to rely on compilers >> and linkers to build the kernel? >> pahole, resolved_btfid, kallsym gen, objtool are part of the >> compilation process. >> The bugs in them are discovered from time to time and >> have to be fixed. Just like compiler and linker bugs. > > I was wondering if it is possible to add BTF_ID into kallsyms or to > add function address into BTF. Because the function name is not > unique, while the function ID is unique. So with the function ID we > can always get what we want. We just had some discussions during LSFMMBPF led by Jiri about how to resolve such issues. Yes, adding 'addr' to BTF is one solution. Also, if this patch set (https://lore.kernel.org/lkml/20221205163157.269335-1-nick.alcock@oracle.com/) can make it into the kernel, adding 'file_path' to the BTF should also work. So ya, two possible solutions here. > > For example, > > $ cat /proc/kallsyms | awk '{if ($2=="t"||$2=="T") {print $3}}' | > sort| uniq -c | sort -n -r | less > 56 __pfx_cleanup_module > 56 cleanup_module > 47 __pfx_cpumask_weight.constprop.0 > 47 cpumask_weight.constprop.0 > 21 __pfx_jhash > 21 __pfx_cpumask_weight > 21 jhash > 21 cpumask_weight > 17 type_show > 17 __pfx_type_show > 14 __rhashtable_insert_fast.constprop.0 > 14 __pfx___rhashtable_insert_fast.constprop.0 > 12 __rhashtable_remove_fast_one.cold > 12 __rhashtable_remove_fast_one > 12 __pfx___rhashtable_remove_fast_one > 11 __xfrm_policy_check2.constprop.0 > 11 __pfx___xfrm_policy_check2.constprop.0 > 11 __pfx_modalias_show > 11 modalias_show > 10 rht_key_get_hash.isra.0 > 10 __pfx_rht_key_get_hash.isra.0 > 10 __pfx_name_show > 10 __pfx_init_once > 10 name_show > 10 init_once > 9 __pfx_event_show > 9 event_show > 8 __pfx_dst_output > 8 dst_output > 7 state_show > 7 size_show > 7 __pfx_state_show > 7 __pfx_size_show > > kallsyms_lookup_name() always returns the first function and ignores > the others, so it is impossible to trace the other functions with the > same name AFAIK. >
Em Wed, May 10, 2023 at 02:08:49PM +0100, Alan Maguire escreveu: > On 09/05/2023 22:21, Alexei Starovoitov wrote: > > On Tue, May 9, 2023 at 11:44 AM Alan Maguire <alan.maguire@oracle.com> wrote: > >> On 02/05/2023 04:40, Alexei Starovoitov wrote: > >> won't the pahole v1.25 changes help here; can you try applying > >> https://lore.kernel.org/bpf/1675949331-27935-1-git-send-email-alan.maguire@oracle.com/ > >> ...and build using pahole; this should eliminate any functions > >> with inconsistent prototypes via > >> --skip_encoding_btf_inconsistent_proto > >> Do not encode functions with multiple inconsistent prototypes or > >> unexpected register use for their parameters, where the regis‐ > >> ters used do not match calling conventions. > >> I'll check this at my end too. > >> Alexei, if this works should we look at applying the above > >> again to bpf-next? If so I'll resend the patch. > > I've lost the track with pahole fixes. > > Did Arnaldo re-tag pahole 1.25 or released 1.26 with the fixes? > My understanding is the pahole repo is prepped for v1.25, > meaning that it will build with version 1.25 but it is > not officially released yet; Arnaldo will correct me if > I've got that wrong. Right, pahole 1.25 was released with those options/fixes. - Arnaldo > > Alan, > > please submit a fresh patch for bpf-next to enable > > --skip_encoding_btf_inconsistent_proto, so it can go through CI. > > I cannot test all combinations manually. > > > > Done; see [1]. If CI picks up the latest version from > the dwarves repo, it will see version 1.25. > > I've tested the above specific case along with general testing > using latest pahole. > > When running pahole with --verbose and > --skip_encoding_btf_inconsistent_proto > > we see: > > skipping addition of 'mm_init'(mm_init) due to multiple inconsistent > function prototypes > > > ...and > > bpftool btf dump file vmlinux |grep mm_init > > shows the function is not encoded in BTF due to these inconsistencies. > > Thanks! > > Alan > > [1] > https://lore.kernel.org/bpf/20230510130241.1696561-1-alan.maguire@oracle.com/
diff --git a/kernel/fork.c b/kernel/fork.c index 0c92f22..af8110d 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1116,7 +1116,7 @@ static void mm_init_uprobes_state(struct mm_struct *mm) #endif } -static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, +static struct mm_struct *task_mm_init(struct mm_struct *mm, struct task_struct *p, struct user_namespace *user_ns) { int i; @@ -1193,7 +1193,7 @@ struct mm_struct *mm_alloc(void) return NULL; memset(mm, 0, sizeof(*mm)); - return mm_init(mm, current, current_user_ns()); + return task_mm_init(mm, current, current_user_ns()); } static inline void __mmput(struct mm_struct *mm) @@ -1542,7 +1542,7 @@ static struct mm_struct *dup_mm(struct task_struct *tsk, memcpy(mm, oldmm, sizeof(*mm)); - if (!mm_init(mm, tsk, mm->user_ns)) + if (!task_mm_init(mm, tsk, mm->user_ns)) goto fail_nomem; err = dup_mmap(mm, oldmm);
The kernel will panic as follows when attaching fexit to mm_init, [ 86.549700] ------------[ cut here ]------------ [ 86.549712] BUG: kernel NULL pointer dereference, address: 0000000000000078 [ 86.549713] #PF: supervisor read access in kernel mode [ 86.549715] #PF: error_code(0x0000) - not-present page [ 86.549716] PGD 10308f067 P4D 10308f067 PUD 11754e067 PMD 0 [ 86.549719] Oops: 0000 [#1] PREEMPT SMP NOPTI [ 86.549722] CPU: 9 PID: 9829 Comm: main_amd64 Kdump: loaded Not tainted 6.3.0-rc6+ #12 [ 86.549725] RIP: 0010:check_preempt_wakeup+0xd1/0x310 [ 86.549754] Call Trace: [ 86.549755] <TASK> [ 86.549757] check_preempt_curr+0x5e/0x70 [ 86.549761] ttwu_do_activate+0xab/0x350 [ 86.549763] try_to_wake_up+0x314/0x680 [ 86.549765] wake_up_process+0x15/0x20 [ 86.549767] insert_work+0xb2/0xd0 [ 86.549772] __queue_work+0x20a/0x400 [ 86.549774] queue_work_on+0x7b/0x90 [ 86.549778] drm_fb_helper_sys_imageblit+0xd7/0xf0 [drm_kms_helper] [ 86.549801] drm_fbdev_fb_imageblit+0x5b/0xb0 [drm_kms_helper] [ 86.549813] soft_cursor+0x1cb/0x250 [ 86.549816] bit_cursor+0x3ce/0x630 [ 86.549818] fbcon_cursor+0x139/0x1c0 [ 86.549821] ? __pfx_bit_cursor+0x10/0x10 [ 86.549822] hide_cursor+0x31/0xd0 [ 86.549825] vt_console_print+0x477/0x4e0 [ 86.549828] console_flush_all+0x182/0x440 [ 86.549832] console_unlock+0x58/0xf0 [ 86.549834] vprintk_emit+0x1ae/0x200 [ 86.549837] vprintk_default+0x1d/0x30 [ 86.549839] vprintk+0x5c/0x90 [ 86.549841] _printk+0x58/0x80 [ 86.549843] __warn_printk+0x7e/0x1a0 [ 86.549845] ? trace_preempt_off+0x1b/0x70 [ 86.549848] ? trace_preempt_on+0x1b/0x70 [ 86.549849] ? __percpu_counter_init+0x8e/0xb0 [ 86.549853] refcount_warn_saturate+0x9f/0x150 [ 86.549855] mm_init+0x379/0x390 [ 86.549859] bpf_trampoline_6442453440_0+0x23/0x1000 [ 86.549862] mm_init+0x5/0x390 [ 86.549865] ? mm_alloc+0x4e/0x60 [ 86.549866] alloc_bprm+0x8a/0x2e0 [ 86.549869] do_execveat_common.isra.0+0x67/0x240 [ 86.549872] __x64_sys_execve+0x37/0x50 [ 86.549874] do_syscall_64+0x38/0x90 [ 86.549877] entry_SYSCALL_64_after_hwframe+0x72/0xdc The reason is that when we attach the btf id of the function mm_init we actually attach the mm_init defined in init/main.c rather than the function defined in kernel/fork.c. That can be proved by parsing /sys/kernel/btf/vmlinux: [2493] FUNC 'initcall_blacklist' type_id=2477 linkage=static [2494] FUNC_PROTO '(anon)' ret_type_id=21 vlen=1 'buf' type_id=57 [2495] FUNC 'early_randomize_kstack_offset' type_id=2494 linkage=static [2496] FUNC 'mm_init' type_id=118 linkage=static [2497] FUNC 'trap_init' type_id=118 linkage=static [2498] FUNC 'thread_stack_cache_init' type_id=118 linkage=static From the above information we can find that the FUNCs above and below mm_init are all defined in init/main.c. So there's no doubt that the mm_init is also the function defined in init/main.c. So when a task calls mm_init and thus the bpf trampoline is triggered it will use the information of the mm_init defined in init/main.c. Then the panic will occur. It seems that there're issues in btf, for example it is unnecessary to generate btf for the functions annonated with __init. We need to improve btf. However we also need to change the function defined in kernel/fork.c to task_mm_init to better distinguish them. After it is renamed to task_mm_init, the /sys/kernel/btf/vmlinux will be: [13970] FUNC 'mm_alloc' type_id=13969 linkage=static [13971] FUNC_PROTO '(anon)' ret_type_id=204 vlen=3 'mm' type_id=204 'p' type_id=197 'user_ns' type_id=452 [13972] FUNC 'task_mm_init' type_id=13971 linkage=static [13973] FUNC 'coredump_filter_setup' type_id=3804 linkage=static [13974] FUNC_PROTO '(anon)' ret_type_id=197 vlen=2 'orig' type_id=197 'node' type_id=21 [13975] FUNC 'dup_task_struct' type_id=13974 linkage=static And then attaching task_mm_init won't panic. Improving the btf will be handled later. This issue can be reproduced with a simple bpf program as such: SEC("fexit/mm_init") int fentry_run() { return 0; } Signed-off-by: Yafang Shao <laoar.shao@gmail.com> --- kernel/fork.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)