Message ID | 20231130134204.136058029@infradead.org (mailing list archive) |
---|---|
State | Handled Elsewhere |
Headers | show |
Series | x86/bpf: Fix FineIBT vs eBPF | expand |
On Thu, Nov 30, 2023 at 5:43 AM Peter Zijlstra <peterz@infradead.org> wrote: > > > void bpf_prog_kallsyms_del(struct bpf_prog *fp) > @@ -691,6 +708,9 @@ void bpf_prog_kallsyms_del(struct bpf_pr > return; > > bpf_ksym_del(&fp->aux->ksym); > +#ifdef CONFIG_FINEIBT > + bpf_ksym_del(&fp->aux->ksym_prefix); > +#endif > } Thank you for addressing all comments, but it panics during boot with: [ 3.109474] RIP: 0010:bpf_prog_kallsyms_del+0x10f/0x140 [ 3.109867] Code: 26 e0 00 ff 05 32 dd dd 01 48 8d bb 80 03 00 00 48 c7 c6 b8 b3 00 83 e8 ef 25 e0 00 48 8b 83 58 03 00 00 48 8b 8b 60 03 00 00 <48> 89 48 08 48 89 01 4c 89 b3 60 03 00 00 48 c7 c7 10 0b 7b 83 5b [ 3.111282] RSP: 0000:ffffc90000013e08 EFLAGS: 00010246 [ 3.116968] Call Trace: [ 3.117163] <TASK> [ 3.117328] ? __die_body+0x68/0xb0 [ 3.117599] ? page_fault_oops+0x317/0x390 [ 3.117909] ? debug_objects_fill_pool+0x19/0x440 [ 3.118283] ? debug_objects_fill_pool+0x19/0x440 [ 3.118715] ? do_user_addr_fault+0x4cd/0x560 [ 3.119045] ? exc_page_fault+0x62/0x1c0 [ 3.119350] ? asm_exc_page_fault+0x26/0x30 [ 3.119675] ? bpf_prog_kallsyms_del+0x10f/0x140 [ 3.120023] ? bpf_prog_kallsyms_del+0x101/0x140 [ 3.120381] __bpf_prog_put_noref+0x12/0xf0 [ 3.120704] bpf_prog_put_deferred+0xe9/0x110 [ 3.121035] bpf_prog_put+0xbb/0xd0 [ 3.121307] bpf_prog_release+0x15/0x20 Adding the following: diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index 5c84a935ba63..5013fd53adfd 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -709,6 +709,8 @@ void bpf_prog_kallsyms_del(struct bpf_prog *fp) bpf_ksym_del(&fp->aux->ksym); #ifdef CONFIG_FINEIBT + if (cfi_mode != CFI_FINEIBT) + return; bpf_ksym_del(&fp->aux->ksym_prefix); #endif } fixes the boot issue, but test_progs is not happy. Just running test_progs it splats right away: [ 74.047757] kmemleak: Found object by alias at 0xffffffffa0001d80 [ 74.048272] CPU: 14 PID: 104 Comm: kworker/14:0 Tainted: G W O 6.7.0-rc3-00702-g41c30fec304d-dirty #5241 [ 74.049118] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 [ 74.050042] Workqueue: events bpf_prog_free_deferred [ 74.050448] Call Trace: [ 74.050663] <TASK> [ 74.050841] dump_stack_lvl+0x55/0x80 [ 74.051141] __find_and_remove_object+0xdb/0x110 [ 74.051521] kmemleak_free+0x41/0x70 [ 74.051828] vfree+0x36/0x130 [ 74.052076] ? process_scheduled_works+0x1d7/0x520 [ 74.052463] bpf_prog_pack_free+0x42/0x1a0 [ 74.052803] ? process_scheduled_works+0x1d7/0x520 [ 74.053194] bpf_jit_binary_pack_free+0x17/0x30 [ 74.053565] bpf_jit_free+0x57/0x90 [ 74.053856] process_scheduled_works+0x250/0x520 [ 74.054234] worker_thread+0x26f/0x400 [ 74.054542] ? __cfi_worker_thread+0x10/0x10 [ 74.054909] kthread+0x113/0x130 [ 74.055178] ? __cfi_kthread+0x10/0x10 [ 74.055487] ret_from_fork+0x48/0x60 [ 74.055793] ? __cfi_kthread+0x10/0x10 [ 74.056102] ret_from_fork_asm+0x11/0x30 [ 74.056427] </TASK> [ 74.056616] kmemleak: Object 0xffffffffa0000000 (size 2097152): [ 74.057089] kmemleak: comm "swapper/0", pid 1, jiffies 4294667572 [ 74.057594] kmemleak: min_count = 2 [ 74.057892] kmemleak: count = 2 [ 74.058164] kmemleak: flags = 0x1 [ 74.058448] kmemleak: checksum = 0 [ 74.058746] kmemleak: backtrace: [ 74.059025] kmemleak_vmalloc+0x2d/0xd0 [ 74.059338] __vmalloc_node_range+0x7e0/0x810 [ 74.059726] module_alloc+0x5f/0x70 [ 74.060015] bpf_prog_pack_alloc+0x167/0x260 [ 74.060374] bpf_jit_binary_pack_alloc+0xca/0x1e0 [ 74.060760] bpf_int_jit_compile+0x3c5d/0x4140 [ 74.061120] bpf_prog_select_runtime+0x239/0x320 [ 74.061496] bpf_prepare_filter+0x49d/0x4c0 [ 74.061844] bpf_prog_create+0x80/0xc0 [ 74.062149] ptp_classifier_init+0x29/0x40 [ 74.062480] sock_init+0x9c/0xb0 [ 74.062753] do_one_initcall+0xdd/0x2f0 [ 74.063067] do_initcall_level+0x98/0x105 [ 74.063394] do_initcalls+0x43/0x80 [ 74.063687] kernel_init_freeable+0x15f/0x1d0 [ 74.064039] kernel_init+0x1a/0x1b0 [ 74.064993] Trying to vfree() bad address (000000001f212011) [ 74.065625] WARNING: CPU: 14 PID: 104 at mm/vmalloc.c:2692 remove_vm_area+0x141/0x150 [ 74.089515] Trying to vfree() nonexistent vm area (000000001f212011) [ 74.090234] WARNING: CPU: 14 PID: 104 at mm/vmalloc.c:2827 vfree+0xfe/0x130 [ 74.129930] Trying to vfree() bad address (000000009ed2080e) [ 74.130408] WARNING: CPU: 14 PID: 149 at mm/vmalloc.c:2692 remove_vm_area+0x141/0x150 and eventually panics with: [ 74.195676] BUG: unable to handle page fault for address: ffffffffa00020c0 [ 74.196541] #PF: supervisor read access in kernel mode [ 74.197548] #PF: error_code(0x0000) - not-present page [ 74.201441] PGD 3058067 P4D 3058067 PUD 3059063 PMD 101d69067 PTE 0 [ 74.202162] Oops: 0000 [#1] PREEMPT SMP PTI [ 74.202602] CPU: 14 PID: 2151 Comm: kworker/14:5 Tainted: G W O 6.7.0-rc3-00702-g41c30fec304d-dirty #5241 [ 74.203567] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 [ 74.204551] Workqueue: events bpf_prog_free_deferred [ 74.205039] RIP: 0010:bpf_prog_pack_free+0x20/0x1a0 [ 74.205469] Code: 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 55 41 57 41 56 53 49 89 fe 48 c7 c7 60 0b 7b 83 31 f6 e8 30 a7 e2 00 <41> 8b 0e 48 8b 3d 36 71 6d 02 f3 48 0f b8 c7 48 98 48 c1 e0 15 48 [ 74.207102] RSP: 0018:ffffc900006e7de0 EFLAGS: 00010282 [ 74.214890] Call Trace: [ 74.215108] <TASK> [ 74.215305] ? __die_body+0x68/0xb0 [ 74.215620] ? page_fault_oops+0x317/0x390 [ 74.215977] ? do_kern_addr_fault+0x8a/0xb0 [ 74.216351] ? exc_page_fault+0xa0/0x1c0 [ 74.216697] ? asm_exc_page_fault+0x26/0x30 [ 74.217055] ? process_scheduled_works+0x1d7/0x520 [ 74.217481] ? bpf_prog_pack_free+0x20/0x1a0 [ 74.217857] ? process_scheduled_works+0x1d7/0x520 [ 74.218279] bpf_jit_binary_pack_free+0x17/0x30 [ 74.218676] bpf_jit_free+0x57/0x90 [ 74.218983] process_scheduled_works+0x250/0x520 [ 74.219388] worker_thread+0x26f/0x400 The kernel was compiled with: CONFIG_CC_HAS_SLS=y CONFIG_CC_HAS_RETURN_THUNK=y CONFIG_CC_HAS_ENTRY_PADDING=y CONFIG_FUNCTION_PADDING_CFI=11 CONFIG_FUNCTION_PADDING_BYTES=11 CONFIG_CALL_PADDING=y CONFIG_FINEIBT=y CONFIG_HAVE_CALL_THUNKS=y CONFIG_SPECULATION_MITIGATIONS=y CONFIG_PAGE_TABLE_ISOLATION=y CONFIG_RETPOLINE=y CONFIG_RETHUNK=y CONFIG_CPU_UNRET_ENTRY=y ... CONFIG_DEBUG_KMEMLEAK=y ... CONFIG_CFI_CLANG=y and 'make LLVM=1', of course. I suspect the above vmalloc/prog_pack issue is somehow related to the patches, but I cannot prove, since without this CFI fixes it also panics: [ 29.079722] CFI failure at bpf_for_each_array_elem+0xa6/0x100 (target: bpf_prog_5a19eca5d8e54e9b_check_elem+0x0/0x42; expected type: 0xe37465df) [ 29.080884] invalid opcode: 0000 [#1] PREEMPT SMP PTI [ 29.081244] CPU: 8 PID: 2142 Comm: test_progs Tainted: G O 6.7.0-rc3-00699-g90679706d486 #5242 [ 29.082662] RIP: 0010:bpf_for_each_array_elem+0xa6/0x100 [ 29.083027] Code: af ef 4c 01 ed 44 89 7c 24 04 4c 89 e7 48 8d 74 24 04 48 89 ea 48 89 d9 45 31 c0 4d 89 f3 41 ba 21 9a 8b 1c 45 03 53 f1 74 02 <0f> 0b 2e e8 62 95 de 00 48 85 c0 75 0e 49 8d 47 01 41 8b 4c 24 24 [ 29.084282] RSP: 0018:ffffc9000269fea8 EFLAGS: 00010286 [ 29.089633] Call Trace: [ 29.089805] <TASK> [ 29.089953] ? __die_body+0x68/0xb0 [ 29.090192] ? die+0xa4/0xd0 [ 29.090391] ? do_trap+0xa5/0x180 [ 29.090619] ? bpf_for_each_array_elem+0xa6/0x100 [ 29.090941] ? do_error_trap+0xb6/0x100 [ 29.091200] ? bpf_for_each_array_elem+0xa6/0x100 [ 29.091516] ? bpf_for_each_array_elem+0xa6/0x100 [ 29.091848] ? handle_invalid_op+0x2c/0x40 [ 29.092123] ? bpf_for_each_array_elem+0xa6/0x100 [ 29.092439] ? exc_invalid_op+0x38/0x60 [ 29.092699] ? asm_exc_invalid_op+0x1a/0x20 [ 29.092985] ? 0xffffffffa0000b8c [ 29.093212] ? 0xffffffffa0000b8c [ 29.093439] ? bpf_for_each_array_elem+0xa6/0x100 [ 29.093759] ? preempt_count_add+0x5d/0xb0 [ 29.094034] bpf_prog_ca45ea7f9cb8ac1a_inner_map+0x94/0x98 [ 29.094415] bpf_trampoline_6442516600+0x47/0x1000 [ 29.094743] __x64_sys_getpgid+0x9/0x20 which is expected. In this case test_progs proceeds further before it CFI aborts. With CFI fixes vmalloc panics sooner. Song, you're an expert in prog_pack logic, please take a look as well. Peter, if you're struggling to setup bpf tests locally feel free to add an extra patch that adds CONFIG_FINEIBT=y and others to tools/testing/selftests/bpf/config.x86_64 and resend. BPF CI will apply that patch to kconfig while building the kernel and will run the tests accordingly. It will be ignored with gcc builds, but clang builds should pick it up. Or do this: https://docs.kernel.org/bpf/bpf_devel_QA.html#q-how-do-i-run-bpf-ci-on-my-changes-before-sending-them-out-for-review It might be easier to test this way. Same point about extra patch for tools/testing/selftests/bpf/config.x86_64.
On Sun, Dec 03, 2023 at 02:56:34PM -0800, Alexei Starovoitov wrote: > On Thu, Nov 30, 2023 at 5:43 AM Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > void bpf_prog_kallsyms_del(struct bpf_prog *fp) > > @@ -691,6 +708,9 @@ void bpf_prog_kallsyms_del(struct bpf_pr > > return; > > > > bpf_ksym_del(&fp->aux->ksym); > > +#ifdef CONFIG_FINEIBT > > + bpf_ksym_del(&fp->aux->ksym_prefix); > > +#endif > > } > > Thank you for addressing all comments, but it panics during boot with: > > [ 3.109474] RIP: 0010:bpf_prog_kallsyms_del+0x10f/0x140 > [ 3.109867] Code: 26 e0 00 ff 05 32 dd dd 01 48 8d bb 80 03 00 00 > 48 c7 c6 b8 b3 00 83 e8 ef 25 e0 00 48 8b 83 58 03 00 00 48 8b 8b 60 > 03 00 00 <48> 89 48 08 48 89 01 4c 89 b3 60 03 00 00 48 c7 c7 10 0b 7b > 83 5b > [ 3.111282] RSP: 0000:ffffc90000013e08 EFLAGS: 00010246 > [ 3.116968] Call Trace: > [ 3.117163] <TASK> > [ 3.117328] ? __die_body+0x68/0xb0 > [ 3.117599] ? page_fault_oops+0x317/0x390 > [ 3.117909] ? debug_objects_fill_pool+0x19/0x440 > [ 3.118283] ? debug_objects_fill_pool+0x19/0x440 > [ 3.118715] ? do_user_addr_fault+0x4cd/0x560 > [ 3.119045] ? exc_page_fault+0x62/0x1c0 > [ 3.119350] ? asm_exc_page_fault+0x26/0x30 > [ 3.119675] ? bpf_prog_kallsyms_del+0x10f/0x140 > [ 3.120023] ? bpf_prog_kallsyms_del+0x101/0x140 > [ 3.120381] __bpf_prog_put_noref+0x12/0xf0 > [ 3.120704] bpf_prog_put_deferred+0xe9/0x110 > [ 3.121035] bpf_prog_put+0xbb/0xd0 > [ 3.121307] bpf_prog_release+0x15/0x20 > > Adding the following: > > diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c > index 5c84a935ba63..5013fd53adfd 100644 > --- a/kernel/bpf/core.c > +++ b/kernel/bpf/core.c > @@ -709,6 +709,8 @@ void bpf_prog_kallsyms_del(struct bpf_prog *fp) > > bpf_ksym_del(&fp->aux->ksym); > #ifdef CONFIG_FINEIBT > + if (cfi_mode != CFI_FINEIBT) > + return; > bpf_ksym_del(&fp->aux->ksym_prefix); > #endif > } > > fixes the boot issue, but test_progs is not happy. Damn, I'm an idiot :-), I knew I should've boot tested all configurations again :/ > Just running test_progs it splats right away: > > [ 74.047757] kmemleak: Found object by alias at 0xffffffffa0001d80 > [ 74.048272] CPU: 14 PID: 104 Comm: kworker/14:0 Tainted: G W > O 6.7.0-rc3-00702-g41c30fec304d-dirty #5241 > [ 74.049118] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), > BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 > [ 74.050042] Workqueue: events bpf_prog_free_deferred > [ 74.050448] Call Trace: > [ 74.050663] <TASK> > [ 74.050841] dump_stack_lvl+0x55/0x80 > [ 74.051141] __find_and_remove_object+0xdb/0x110 > [ 74.051521] kmemleak_free+0x41/0x70 > [ 74.051828] vfree+0x36/0x130 Durr, I'll see if I can get that stuff running locally, and otherwise play with the robot as you suggested. Thanks!
On Mon, Dec 04, 2023 at 10:13:34AM +0100, Peter Zijlstra wrote: > > Just running test_progs it splats right away: > > > > [ 74.047757] kmemleak: Found object by alias at 0xffffffffa0001d80 > > [ 74.048272] CPU: 14 PID: 104 Comm: kworker/14:0 Tainted: G W > > O 6.7.0-rc3-00702-g41c30fec304d-dirty #5241 > > [ 74.049118] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), > > BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 > > [ 74.050042] Workqueue: events bpf_prog_free_deferred > > [ 74.050448] Call Trace: > > [ 74.050663] <TASK> > > [ 74.050841] dump_stack_lvl+0x55/0x80 > > [ 74.051141] __find_and_remove_object+0xdb/0x110 > > [ 74.051521] kmemleak_free+0x41/0x70 > > [ 74.051828] vfree+0x36/0x130 > > Durr, I'll see if I can get that stuff running locally, and otherwise > play with the robot as you suggested. Thanks! I think it is bpf_jit_binary_pack_hdr(), which is using prog->bpf_func as a start address for the image, instead of jit_data->image. This used to be true, but now it's offset. Let me see what to do about that...
On Mon, Dec 04, 2023 at 12:11:28PM +0100, Peter Zijlstra wrote: > On Mon, Dec 04, 2023 at 10:13:34AM +0100, Peter Zijlstra wrote: > > > > Just running test_progs it splats right away: > > > > > > [ 74.047757] kmemleak: Found object by alias at 0xffffffffa0001d80 > > > [ 74.048272] CPU: 14 PID: 104 Comm: kworker/14:0 Tainted: G W > > > O 6.7.0-rc3-00702-g41c30fec304d-dirty #5241 > > > [ 74.049118] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), > > > BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 > > > [ 74.050042] Workqueue: events bpf_prog_free_deferred > > > [ 74.050448] Call Trace: > > > [ 74.050663] <TASK> > > > [ 74.050841] dump_stack_lvl+0x55/0x80 > > > [ 74.051141] __find_and_remove_object+0xdb/0x110 > > > [ 74.051521] kmemleak_free+0x41/0x70 > > > [ 74.051828] vfree+0x36/0x130 > > > > Durr, I'll see if I can get that stuff running locally, and otherwise > > play with the robot as you suggested. Thanks! > > I think it is bpf_jit_binary_pack_hdr(), which is using prog->bpf_func > as a start address for the image, instead of jit_data->image. > > This used to be true, but now it's offset. > > Let me see what to do about that... Not the prettiest of things, but the below seems to make the thing happy... --- diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c index 196cc1481dec..f4357c3211bc 100644 --- a/arch/x86/net/bpf_jit_comp.c +++ b/arch/x86/net/bpf_jit_comp.c @@ -3024,6 +3024,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog) prog->bpf_func = (void *)image + ctx.prog_offset; prog->jited = 1; prog->jited_len = proglen - ctx.prog_offset; + prog->aux->cfi_offset = ctx.prog_offset; } else { prog = orig_prog; } @@ -3078,6 +3079,7 @@ void bpf_jit_free(struct bpf_prog *prog) kvfree(jit_data->addrs); kfree(jit_data); } + prog->bpf_func = (void *)prog->bpf_func - prog->aux->cfi_offset; hdr = bpf_jit_binary_pack_hdr(prog); bpf_jit_binary_pack_free(hdr, NULL); WARN_ON_ONCE(!bpf_prog_kallsyms_verify_off(prog)); diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 8b725776e70a..e5fa0852a20f 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -1483,6 +1483,7 @@ struct bpf_prog_aux { struct work_struct work; struct rcu_head rcu; }; + u32 cfi_offset; }; struct bpf_prog { diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index 5c84a935ba63..763742f4740f 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -121,6 +121,9 @@ struct bpf_prog *bpf_prog_alloc_no_stats(unsigned int size, gfp_t gfp_extra_flag #endif INIT_LIST_HEAD_RCU(&fp->aux->ksym.lnode); +#ifdef CONFIG_FINEIBT + INIT_LIST_HEAD_RCU(&fp->aux->ksym_prefix.lnode); +#endif mutex_init(&fp->aux->used_maps_mutex); mutex_init(&fp->aux->dst_mutex); @@ -709,6 +712,8 @@ void bpf_prog_kallsyms_del(struct bpf_prog *fp) bpf_ksym_del(&fp->aux->ksym); #ifdef CONFIG_FINEIBT + if (cfi_mode != CFI_FINEIBT) + return; bpf_ksym_del(&fp->aux->ksym_prefix); #endif }
On Mon, Dec 04, 2023 at 01:52:39PM +0100, Peter Zijlstra wrote: > On Mon, Dec 04, 2023 at 12:11:28PM +0100, Peter Zijlstra wrote: > > On Mon, Dec 04, 2023 at 10:13:34AM +0100, Peter Zijlstra wrote: > > > > > > Just running test_progs it splats right away: > > > > > > > > [ 74.047757] kmemleak: Found object by alias at 0xffffffffa0001d80 > > > > [ 74.048272] CPU: 14 PID: 104 Comm: kworker/14:0 Tainted: G W > > > > O 6.7.0-rc3-00702-g41c30fec304d-dirty #5241 > > > > [ 74.049118] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), > > > > BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 > > > > [ 74.050042] Workqueue: events bpf_prog_free_deferred > > > > [ 74.050448] Call Trace: > > > > [ 74.050663] <TASK> > > > > [ 74.050841] dump_stack_lvl+0x55/0x80 > > > > [ 74.051141] __find_and_remove_object+0xdb/0x110 > > > > [ 74.051521] kmemleak_free+0x41/0x70 > > > > [ 74.051828] vfree+0x36/0x130 > > > > > > Durr, I'll see if I can get that stuff running locally, and otherwise > > > play with the robot as you suggested. Thanks! > > > > I think it is bpf_jit_binary_pack_hdr(), which is using prog->bpf_func > > as a start address for the image, instead of jit_data->image. > > > > This used to be true, but now it's offset. > > > > Let me see what to do about that... > > Not the prettiest of things, but the below seems to make the thing > happy... > hyea, that boots properly for me but gives crash below when running bpf tests jirka --- [ 482.145182][ T699] RIP: 0010:bpf_for_each_array_elem+0xbb/0x120 [ 482.145672][ T699] Code: 4c 01 f5 89 5c 24 04 4c 89 e7 48 8d 74 24 04 48 89 ea 4c 89 fd 4c 89 f9 45 31 c0 4d 89 eb 41 ba ef 86 cd 67 45 03 53 f1 74 02 <0f> 0b 41 ff d3 0f 1f 00 48 85 c0 75 0e 48 8d 43 01 41 8b 4c 24 24 [ 482.147221][ T699] RSP: 0018:ffffc900017e3e88 EFLAGS: 00010217 [ 482.147702][ T699] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffc900017e3ed8 [ 482.152162][ T699] RDX: ffff888152eb0210 RSI: ffffc900017e3e8c RDI: ffff888152eb0000 [ 482.152770][ T699] RBP: ffffc900017e3ed8 R08: 0000000000000000 R09: 0000000000000000 [ 482.153350][ T699] R10: 000000004704ef28 R11: ffffffffa0012774 R12: ffff888152eb0000 [ 482.153951][ T699] R13: ffffffffa0012774 R14: ffff888152eb0210 R15: ffffc900017e3ed8 [ 482.154554][ T699] FS: 00007fa60d4fdd00(0000) GS:ffff88846d200000(0000) knlGS:0000000000000000 [ 482.155138][ T699] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 482.155564][ T699] CR2: 00007fa60d7d8000 CR3: 00000001502a2005 CR4: 0000000000770ef0 [ 482.156095][ T699] PKRU: 55555554 [ 482.156349][ T699] Call Trace: [ 482.156596][ T699] <TASK> [ 482.156816][ T699] ? __die_body+0x68/0xb0 [ 482.157138][ T699] ? die+0xba/0xe0 [ 482.157456][ T699] ? do_trap+0xa5/0x180 [ 482.157826][ T699] ? bpf_for_each_array_elem+0xbb/0x120 [ 482.158277][ T699] ? bpf_for_each_array_elem+0xbb/0x120 [ 482.158711][ T699] ? do_error_trap+0xc4/0x140 [ 482.159052][ T699] ? bpf_for_each_array_elem+0xbb/0x120 [ 482.159506][ T699] ? handle_invalid_op+0x2c/0x40 [ 482.159906][ T699] ? bpf_for_each_array_elem+0xbb/0x120 [ 482.160990][ T699] ? exc_invalid_op+0x38/0x60 [ 482.161375][ T699] ? asm_exc_invalid_op+0x1a/0x20 [ 482.161788][ T699] ? 0xffffffffa0012774 [ 482.162149][ T699] ? 0xffffffffa0012774 [ 482.162513][ T699] ? bpf_for_each_array_elem+0xbb/0x120 [ 482.162905][ T699] bpf_prog_ca45ea7f9cb8ac1a_inner_map+0x94/0x98 [ 482.163471][ T699] bpf_trampoline_6442549234+0x47/0x1000 [ 482.163981][ T699] __x64_sys_getpgid+0x9/0x20 [ 482.164770][ T699] do_syscall_64+0x53/0x110 [ 482.165184][ T699] entry_SYSCALL_64_after_hwframe+0x6e/0x76 [ 482.165646][ T699] RIP: 0033:0x7fa60d6c5b4d [ 482.166005][ T699] Code: c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 8b 92 0c 00 f7 d8 64 89 01 48 [ 482.169662][ T699] RSP: 002b:00007ffddfcf99e8 EFLAGS: 00000202 ORIG_RAX: 0000000000000079 [ 482.170582][ T699] RAX: ffffffffffffffda RBX: 00007ffddfcf9d98 RCX: 00007fa60d6c5b4d [ 482.171376][ T699] RDX: 0000000000000002 RSI: 0000000000000000 RDI: 0000557baa5ab850 [ 482.172010][ T699] RBP: 00007ffddfcf9b30 R08: 0000000000000000 R09: 0000000000000000 [ 482.172665][ T699] R10: 0000000000000064 R11: 0000000000000202 R12: 0000000000000001 [ 482.173359][ T699] R13: 0000000000000000 R14: 00007fa60d80d000 R15: 0000557ba6ab9790 [ 482.174014][ T699] </TASK> [ 482.174289][ T699] Modules linked in: bpf_testmod(OE) intel_rapl_msr intel_rapl_common crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel kvm_intel rapl iTCO_wdt iTCO_vendor_support i2c_i801 i2c_smbus lpc_ich drm loop drm_panel_orientation_quirks zram [ 482.176040][ T699] ---[ end trace 0000000000000000 ]--- [ 482.176534][ T699] RIP: 0010:bpf_for_each_array_elem+0xbb/0x120 [ 482.177215][ T699] Code: 4c 01 f5 89 5c 24 04 4c 89 e7 48 8d 74 24 04 48 89 ea 4c 89 fd 4c 89 f9 45 31 c0 4d 89 eb 41 ba ef 86 cd 67 45 03 53 f1 74 02 <0f> 0b 41 ff d3 0f 1f 00 48 85 c0 75 0e 48 8d 43 01 41 8b 4c 24 24 [ 482.179405][ T699] RSP: 0018:ffffc900017e3e88 EFLAGS: 00010217 [ 482.179971][ T699] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffc900017e3ed8 [ 482.180615][ T699] RDX: ffff888152eb0210 RSI: ffffc900017e3e8c RDI: ffff888152eb0000 [ 482.181195][ T699] RBP: ffffc900017e3ed8 R08: 0000000000000000 R09: 0000000000000000 [ 482.181805][ T699] R10: 000000004704ef28 R11: ffffffffa0012774 R12: ffff888152eb0000 [ 482.182411][ T699] R13: ffffffffa0012774 R14: ffff888152eb0210 R15: ffffc900017e3ed8 [ 482.183043][ T699] FS: 00007fa60d4fdd00(0000) GS:ffff88846d200000(0000) knlGS:0000000000000000 [ 482.183753][ T699] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 482.184361][ T699] CR2: 00007fa60d7d8000 CR3: 00000001502a2005 CR4: 0000000000770ef0 [ 482.185649][ T699] PKRU: 55555554 [ 482.185985][ T699] BUG: sleeping function called from invalid context at include/linux/percpu-rwsem.h:49 [ 482.186846][ T699] in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 699, name: test_progs [ 482.187590][ T699] preempt_count: 0, expected: 0 [ 482.188055][ T699] RCU nest depth: 1, expected: 0 [ 482.188460][ T699] INFO: lockdep is turned off. [ 482.188861][ T699] CPU: 1 PID: 699 Comm: test_progs Tainted: G D OE 6.7.0-rc3+ #118 bfe8e46ac948d811e03ae3149ad95ea179efe638 [ 482.189821][ T699] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc38 04/01/2014 [ 482.191140][ T699] Call Trace: [ 482.191469][ T699] <TASK> [ 482.191766][ T699] dump_stack_lvl+0xbd/0x120 [ 482.192222][ T699] __might_resched+0x24a/0x280 [ 482.192694][ T699] exit_signals+0x31/0x280 [ 482.193123][ T699] do_exit+0x1a6/0xbb0 [ 482.193511][ T699] ? bpf_trampoline_6442549234+0x47/0x1000 [ 482.194025][ T699] make_task_dead+0x94/0x180 [ 482.194469][ T699] rewind_stack_and_make_dead+0x17/0x20 [ 482.194968][ T699] RIP: 0033:0x7fa60d6c5b4d [ 482.195399][ T699] Code: c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 8b 92 0c 00 f7 d8 64 89 01 48 [ 482.197082][ T699] RSP: 002b:00007ffddfcf99e8 EFLAGS: 00000202 ORIG_RAX: 0000000000000079 [ 482.197887][ T699] RAX: ffffffffffffffda RBX: 00007ffddfcf9d98 RCX: 00007fa60d6c5b4d [ 482.198604][ T699] RDX: 0000000000000002 RSI: 0000000000000000 RDI: 0000557baa5ab850 [ 482.199327][ T699] RBP: 00007ffddfcf9b30 R08: 0000000000000000 R09: 0000000000000000 [ 482.200035][ T699] R10: 0000000000000064 R11: 0000000000000202 R12: 0000000000000001 [ 482.200765][ T699] R13: 0000000000000000 R14: 00007fa60d80d000 R15: 0000557ba6ab9790 [ 482.201499][ T699] </TASK>
On Mon, Dec 04, 2023 at 06:25:34PM +0100, Jiri Olsa wrote: > that boots properly for me but gives crash below when running bpf tests OK, more funnies.. > [ 482.145182][ T699] RIP: 0010:bpf_for_each_array_elem+0xbb/0x120 > [ 482.145672][ T699] Code: 4c 01 f5 89 5c 24 04 4c 89 e7 48 8d 74 24 04 48 89 ea 4c 89 fd 4c 89 f9 45 31 c0 4d 89 eb 41 ba ef 86 cd 67 45 03 53 f1 74 02 <0f> 0b 41 ff d3 0f 1f 00 48 85 c0 75 0e 48 8d 43 01 41 8b 4c 24 24 > [ 482.147221][ T699] RSP: 0018:ffffc900017e3e88 EFLAGS: 00010217 > [ 482.147702][ T699] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffc900017e3ed8 > [ 482.152162][ T699] RDX: ffff888152eb0210 RSI: ffffc900017e3e8c RDI: ffff888152eb0000 > [ 482.152770][ T699] RBP: ffffc900017e3ed8 R08: 0000000000000000 R09: 0000000000000000 > [ 482.153350][ T699] R10: 000000004704ef28 R11: ffffffffa0012774 R12: ffff888152eb0000 > [ 482.153951][ T699] R13: ffffffffa0012774 R14: ffff888152eb0210 R15: ffffc900017e3ed8 > [ 482.154554][ T699] FS: 00007fa60d4fdd00(0000) GS:ffff88846d200000(0000) knlGS:0000000000000000 > [ 482.155138][ T699] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 482.155564][ T699] CR2: 00007fa60d7d8000 CR3: 00000001502a2005 CR4: 0000000000770ef0 > [ 482.156095][ T699] PKRU: 55555554 > [ 482.156349][ T699] Call Trace: > [ 482.156596][ T699] <TASK> > [ 482.156816][ T699] ? __die_body+0x68/0xb0 > [ 482.157138][ T699] ? die+0xba/0xe0 > [ 482.157456][ T699] ? do_trap+0xa5/0x180 > [ 482.157826][ T699] ? bpf_for_each_array_elem+0xbb/0x120 > [ 482.158277][ T699] ? bpf_for_each_array_elem+0xbb/0x120 > [ 482.158711][ T699] ? do_error_trap+0xc4/0x140 > [ 482.159052][ T699] ? bpf_for_each_array_elem+0xbb/0x120 > [ 482.159506][ T699] ? handle_invalid_op+0x2c/0x40 > [ 482.159906][ T699] ? bpf_for_each_array_elem+0xbb/0x120 > [ 482.160990][ T699] ? exc_invalid_op+0x38/0x60 > [ 482.161375][ T699] ? asm_exc_invalid_op+0x1a/0x20 > [ 482.161788][ T699] ? 0xffffffffa0012774 > [ 482.162149][ T699] ? 0xffffffffa0012774 > [ 482.162513][ T699] ? bpf_for_each_array_elem+0xbb/0x120 > [ 482.162905][ T699] bpf_prog_ca45ea7f9cb8ac1a_inner_map+0x94/0x98 > [ 482.163471][ T699] bpf_trampoline_6442549234+0x47/0x1000 Looks like this trips an #UD, I'll go try and figure out what this bpf_for_each_array_elem() does to cause this. Looks like it has an indirect call, could be the callback_fn thing has a CFI mis-match.
On Mon, Dec 04, 2023 at 07:16:14PM +0100, Peter Zijlstra wrote: > On Mon, Dec 04, 2023 at 06:25:34PM +0100, Jiri Olsa wrote: > > > that boots properly for me but gives crash below when running bpf tests > > OK, more funnies.. > > > [ 482.145182][ T699] RIP: 0010:bpf_for_each_array_elem+0xbb/0x120 > > [ 482.145672][ T699] Code: 4c 01 f5 89 5c 24 04 4c 89 e7 48 8d 74 24 04 48 89 ea 4c 89 fd 4c 89 f9 45 31 c0 4d 89 eb 41 ba ef 86 cd 67 45 03 53 f1 74 02 <0f> 0b 41 ff d3 0f 1f 00 48 85 c0 75 0e 48 8d 43 01 41 8b 4c 24 24 > > [ 482.147221][ T699] RSP: 0018:ffffc900017e3e88 EFLAGS: 00010217 > > [ 482.147702][ T699] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffc900017e3ed8 > > [ 482.152162][ T699] RDX: ffff888152eb0210 RSI: ffffc900017e3e8c RDI: ffff888152eb0000 > > [ 482.152770][ T699] RBP: ffffc900017e3ed8 R08: 0000000000000000 R09: 0000000000000000 > > [ 482.153350][ T699] R10: 000000004704ef28 R11: ffffffffa0012774 R12: ffff888152eb0000 > > [ 482.153951][ T699] R13: ffffffffa0012774 R14: ffff888152eb0210 R15: ffffc900017e3ed8 > > [ 482.154554][ T699] FS: 00007fa60d4fdd00(0000) GS:ffff88846d200000(0000) knlGS:0000000000000000 > > [ 482.155138][ T699] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [ 482.155564][ T699] CR2: 00007fa60d7d8000 CR3: 00000001502a2005 CR4: 0000000000770ef0 > > [ 482.156095][ T699] PKRU: 55555554 > > [ 482.156349][ T699] Call Trace: > > [ 482.156596][ T699] <TASK> > > [ 482.156816][ T699] ? __die_body+0x68/0xb0 > > [ 482.157138][ T699] ? die+0xba/0xe0 > > [ 482.157456][ T699] ? do_trap+0xa5/0x180 > > [ 482.157826][ T699] ? bpf_for_each_array_elem+0xbb/0x120 > > [ 482.158277][ T699] ? bpf_for_each_array_elem+0xbb/0x120 > > [ 482.158711][ T699] ? do_error_trap+0xc4/0x140 > > [ 482.159052][ T699] ? bpf_for_each_array_elem+0xbb/0x120 > > [ 482.159506][ T699] ? handle_invalid_op+0x2c/0x40 > > [ 482.159906][ T699] ? bpf_for_each_array_elem+0xbb/0x120 > > [ 482.160990][ T699] ? exc_invalid_op+0x38/0x60 > > [ 482.161375][ T699] ? asm_exc_invalid_op+0x1a/0x20 > > [ 482.161788][ T699] ? 0xffffffffa0012774 > > [ 482.162149][ T699] ? 0xffffffffa0012774 > > [ 482.162513][ T699] ? bpf_for_each_array_elem+0xbb/0x120 > > [ 482.162905][ T699] bpf_prog_ca45ea7f9cb8ac1a_inner_map+0x94/0x98 > > [ 482.163471][ T699] bpf_trampoline_6442549234+0x47/0x1000 > > Looks like this trips an #UD, I'll go try and figure out what this > bpf_for_each_array_elem() does to cause this. Looks like it has an > indirect call, could be the callback_fn thing has a CFI mis-match. So afaict this is used through bpf_for_each_map_elem(), where the argument still is properly callback_fn. However, in the desriptor bpf_for_each_map_elem_proto the argument gets described as: ARG_PTR_TO_FUNC, which in turn has a comment like: ARG_PTR_TO_FUNC, /* pointer to a bpf program function */ Which to me sounds like there is definite type punning involved. The call in bpf_for_each_array_elem() is a regular C indirect call, which gets adorned with the kCFI magic. But I doubt the BPF function that gets used gets the correct matching bits on. TL;DR, I think this is a pre-existing problem with kCFI + eBPF and not caused by my patches. Could any of you bpf knowledgeable folks please explain me exactly what gets used as the function pointer in this case? -- I'm not sure I can follow along well enough to begin looking for a solution at this point :/
On Mon, Dec 4, 2023 at 10:34 AM Peter Zijlstra <peterz@infradead.org> wrote: > > So afaict this is used through bpf_for_each_map_elem(), where the > argument still is properly callback_fn. However, in the desriptor > bpf_for_each_map_elem_proto the argument gets described as: > ARG_PTR_TO_FUNC, which in turn has a comment like: > > ARG_PTR_TO_FUNC, /* pointer to a bpf program function */ > > Which to me sounds like there is definite type punning involved. The > call in bpf_for_each_array_elem() is a regular C indirect call, which > gets adorned with the kCFI magic. > > But I doubt the BPF function that gets used gets the correct matching > bits on. > > TL;DR, I think this is a pre-existing problem with kCFI + eBPF and not > caused by my patches. It is a pre-existing problem, I ran into the same failures when I looked into this briefly last year: https://github.com/ClangBuiltLinux/linux/issues/1727 In addition to bpf_for_each_array_elem, a few other callers also use the same function pointer type that doesn't match cfi_bpf_hash. Sami
On Mon, Dec 4, 2023 at 10:34 AM Peter Zijlstra <peterz@infradead.org> wrote: > > > TL;DR, I think this is a pre-existing problem with kCFI + eBPF and not > caused by my patches. It's an old issue indeed. To workaround I just did: +__nocfi static long bpf_for_each_array_elem(struct bpf_map *map, bpf_callback_t callback_fn, void *callback_ctx, u64 flags) to proceed further. test_progs passed few more tests, but then it hit: [ 13.965472] CFI failure at tcp_set_ca_state+0x51/0xd0 (target: 0xffffffffa02050d6; expected type: 0x3a47ac32) [ 13.966351] invalid opcode: 0000 [#1] PREEMPT SMP PTI [ 13.966752] CPU: 3 PID: 2142 Comm: test_progs Tainted: G O 6.7.0-rc3-00705-g421defd1bea0-dirty #5246 [ 13.967552] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 [ 13.968421] RIP: 0010:tcp_set_ca_state+0x51/0xd0 [ 13.968751] Code: 70 40 ff 84 c0 74 49 48 8b 83 60 07 00 00 4c 8b 58 10 4d 85 db 74 1b 40 0f b6 f5 48 89 df 41 ba ce 53 b8 c5 45 03 53 f1 74 02 <0f> 0b 2e e8 c7 ee 31 00 0f b6 83 90 07 00 00 40 80 e5 1f 24 e0 40 [ 13.975460] Call Trace: [ 13.975640] <IRQ> [ 13.975791] ? __die_body+0x68/0xb0 [ 13.976062] ? die+0xa4/0xd0 [ 13.976273] ? do_trap+0xa5/0x180 [ 13.976513] ? tcp_set_ca_state+0x51/0xd0 [ 13.976800] ? do_error_trap+0xb6/0x100 [ 13.977076] ? tcp_set_ca_state+0x51/0xd0 [ 13.977360] ? tcp_set_ca_state+0x51/0xd0 [ 13.977644] ? handle_invalid_op+0x2c/0x40 [ 13.977934] ? tcp_set_ca_state+0x51/0xd0 [ 13.978222] ? exc_invalid_op+0x38/0x60 [ 13.978497] ? asm_exc_invalid_op+0x1a/0x20 [ 13.978798] ? tcp_set_ca_state+0x51/0xd0 [ 13.979087] tcp_v6_syn_recv_sock+0x45c/0x6c0 [ 13.979401] tcp_check_req+0x497/0x590 [ 13.979671] tcp_v6_rcv+0x728/0xce0 [ 13.979923] ? raw6_local_deliver+0x63/0x350 [ 13.980257] ip6_protocol_deliver_rcu+0x2f6/0x560 [ 13.980596] ? ip6_input_finish+0x59/0x140 [ 13.980887] ? NF_HOOK+0x29/0x1d0 [ 13.981136] ip6_input_finish+0xcb/0x140 [ 13.981415] ? __cfi_ip6_input_finish+0x10/0x10 [ 13.981738] NF_HOOK+0x177/0x1d0 [ 13.981970] ? rcu_is_watching+0x10/0x40 [ 13.982279] ? lock_release+0x35/0x2e0 [ 13.982547] ? lock_release+0x35/0x2e0 [ 13.982822] ? NF_HOOK+0x29/0x1d0 [ 13.983064] ? __cfi_ip6_rcv_finish+0x10/0x10 [ 13.983409] NF_HOOK+0x177/0x1d0 [ 13.983664] ? ip6_rcv_core+0x50/0x6c0 [ 13.983956] ? process_backlog+0x132/0x290 [ 13.984264] ? process_backlog+0x132/0x290 [ 13.984557] __netif_receive_skb+0x5c/0x160 [ 13.984856] process_backlog+0x19e/0x290 [ 13.985140] __napi_poll+0x3f/0x1f0 [ 13.985402] net_rx_action+0x193/0x330 [ 13.985672] __do_softirq+0x14d/0x3ea [ 13.985963] ? do_softirq+0x7f/0xb0 [ 13.986243] ? __dev_queue_xmit+0x5b/0xd50 [ 13.986563] ? ip6_finish_output2+0x222/0x7a0 [ 13.986906] do_softirq+0x7f/0xb0 The stack trace doesn't have any bpf, but it's a bpf issue too. Here tcp_set_ca_state() calls icsk->icsk_ca_ops->set_state(sk, ca_state); which calls bpf prog via bpf trampoline. re: bpf_jit_binary_pack_hdr(). since cfi_mode is __ro_after_init we don't need to waste cfi_offset variable in prog->aux and in jit_context. How about +int get_cfi_offset(void) +{ + switch (cfi_mode) { + case CFI_FINEIBT: + return 16; + case CFI_KCFI: +#ifdef CONFIG_CALL_PADDING + return 16; +#else + return 5; +#endif + default: + return 0; + } +} + struct bpf_binary_header * bpf_jit_binary_pack_hdr(const struct bpf_prog *fp) { - unsigned long real_start = (unsigned long)fp->bpf_func; + unsigned long real_start = (unsigned long)fp->bpf_func - get_cfi_offset(); and have __weak version of get_cfi_offset() in bpf/core.c that returns 0 and non-weak in arch/x86 like above? Similarly remove prog_offset from jit_context and undo: ctx->prog_offset = emit_prologue(...) to keep it as 'static void emit_prologue' since cfi offset is fixed at early boot and the same for all progs. Separately we need to deal with bpf_for_each_array_elem() which doesn't look easy. And fix tcp_set_ca_state() as well (which is even harder). Just to see where places like these are I did: +__nocfi BPF_CALL_4(bpf_loop, u32, nr_loops, void *, callback_fn, void *, callback_ctx, +__nocfi static long bpf_for_each_hash_elem(struct bpf_map *map, bpf_callback_t callback_fn, +__nocfi static enum hrtimer_restart bpf_timer_cb(struct hrtimer *hrtimer) +__nocfi static int __bpf_rbtree_add(struct bpf_rb_root *root, +__nocfi BPF_CALL_4(bpf_user_ringbuf_drain, struct bpf_map *, map, +__nocfi void tcp_set_ca_state(struct sock *sk, const u8 ca_state) +__nocfi void tcp_init_congestion_control(struct sock *sk) +__nocfi void tcp_enter_loss(struct sock *sk) +__nocfi static void tcp_cong_avoid(struct sock *sk, u32 ack, u32 acked) +__nocfi static inline void tcp_in_ack_event(struct sock *sk, u32 flags) and more... Which is clearly not a direction to go. Instead of annotating callers is there a way to say that all bpf_callback_t calls are nocfi? I feel the patches scratched the iceberg.
On Mon, Dec 04, 2023 at 05:18:31PM -0800, Alexei Starovoitov wrote: > How about > +int get_cfi_offset(void) > +{ > + switch (cfi_mode) { > + case CFI_FINEIBT: > + return 16; > + case CFI_KCFI: > +#ifdef CONFIG_CALL_PADDING > + return 16; > +#else > + return 5; > +#endif > + default: > + return 0; > + } > +} Yeah, that works. I'll go make it happen. > Separately we need to deal with bpf_for_each_array_elem() > which doesn't look easy. > And fix tcp_set_ca_state() as well (which is even harder). > > Just to see where places like these are I did: > +__nocfi > BPF_CALL_4(bpf_loop, u32, nr_loops, void *, callback_fn, void *, callback_ctx, > +__nocfi > static long bpf_for_each_hash_elem(struct bpf_map *map, > bpf_callback_t callback_fn, > +__nocfi > static enum hrtimer_restart bpf_timer_cb(struct hrtimer *hrtimer) > +__nocfi > static int __bpf_rbtree_add(struct bpf_rb_root *root, > +__nocfi > BPF_CALL_4(bpf_user_ringbuf_drain, struct bpf_map *, map, > +__nocfi > void tcp_set_ca_state(struct sock *sk, const u8 ca_state) > +__nocfi > void tcp_init_congestion_control(struct sock *sk) > +__nocfi > void tcp_enter_loss(struct sock *sk) > +__nocfi > static void tcp_cong_avoid(struct sock *sk, u32 ack, u32 acked) > +__nocfi > static inline void tcp_in_ack_event(struct sock *sk, u32 flags) > > and more... Which is clearly not a direction to go. > > Instead of annotating callers is there a way to say that > all bpf_callback_t calls are nocfi? Well, ideally they would all actually use CFI, I'll go figure out how all this works and think about it. Thanks! > I feel the patches scratched the iceberg. Yeah, clearly :/ I'll go stare at it all.
On Mon, Dec 04, 2023 at 05:18:31PM -0800, Alexei Starovoitov wrote: > [ 13.978497] ? asm_exc_invalid_op+0x1a/0x20 > [ 13.978798] ? tcp_set_ca_state+0x51/0xd0 > [ 13.979087] tcp_v6_syn_recv_sock+0x45c/0x6c0 > [ 13.979401] tcp_check_req+0x497/0x590 > The stack trace doesn't have any bpf, but it's a bpf issue too. > Here tcp_set_ca_state() calls > icsk->icsk_ca_ops->set_state(sk, ca_state); > which calls bpf prog via bpf trampoline. Specifically, I think this is tools/testing/selftests/bpf/progs/bpf_cubic.c, which has: .set_state = (void *)bpf_cubic_state, which comes from: BPF_STRUCT_OPS(bpf_cubic_state, struct sock *sk, __u8 *new_state) which then wraps: BPF_PROG() which ends up generating: static __always_inline ___bpf_cubic_state(unsigned long long *ctx, struct sock *sk, __u8 *new_state) { ... } void bpf_cubic_state(unsigned long long *ctx) { return ____bpf_cubic_state(ctx, ctx[0], ctx[1]); } I think this then uses arch_prepare_bpf_trampoline(), but I'm entirely lost how this all comes together, because the way I understand it the whole bpf_trampoline is used to hook into an ftrace __fentry hook. And a __fentry hook is very much not a function pointer. Help!?!? The other case: For tools/testing/selftests/bpf/progs/bloom_filter_bench.c we have: bpf_for_each_map_elem(&array_map, bloom_callback, &data, 0); and here bloom callback appears like a normal function: static __u64 bloom_callback(struct bpf_map *map, __u32 *key, void *val, struct callback_ctx *data) But what do functions looks like in the JIT? What's the actual address that's then passed into the helper function. Given this seems to work without kCFI, it should at least have an ENDBR, but there's only 3 of those afaict: - emit_prologue() first insn - emit_prologue() tail-call site - arch_preprare_bpf_trampoline() If the function passed to the helper is from do_jit()/emit_prologue(), then how do I tell what 'function' is being JIT'ed ? If it is arch_prepare_bpf_trampoline(), then we're back at the previous question and I don't see how a __fentry site becomes a callable function pointer. Any clues would be much appreciated.
On Wed, Dec 06, 2023 at 05:38:14PM +0100, Peter Zijlstra wrote: > On Mon, Dec 04, 2023 at 05:18:31PM -0800, Alexei Starovoitov wrote: > > > [ 13.978497] ? asm_exc_invalid_op+0x1a/0x20 > > [ 13.978798] ? tcp_set_ca_state+0x51/0xd0 > > [ 13.979087] tcp_v6_syn_recv_sock+0x45c/0x6c0 > > [ 13.979401] tcp_check_req+0x497/0x590 > > > The stack trace doesn't have any bpf, but it's a bpf issue too. > > Here tcp_set_ca_state() calls > > icsk->icsk_ca_ops->set_state(sk, ca_state); > > which calls bpf prog via bpf trampoline. > > > > Specifically, I think this is > tools/testing/selftests/bpf/progs/bpf_cubic.c, which has: > > .set_state = (void *)bpf_cubic_state, > > which comes from: > > BPF_STRUCT_OPS(bpf_cubic_state, struct sock *sk, __u8 *new_state) > > which then wraps: > > BPF_PROG() > > which ends up generating: > > static __always_inline ___bpf_cubic_state(unsigned long long *ctx, struct sock *sk, __u8 *new_state) > { > ... > } > > void bpf_cubic_state(unsigned long long *ctx) > { > return ____bpf_cubic_state(ctx, ctx[0], ctx[1]); > } > > > I think this then uses arch_prepare_bpf_trampoline(), but I'm entirely > lost how this all comes together, because the way I understand it the > whole bpf_trampoline is used to hook into an ftrace __fentry hook. > > And a __fentry hook is very much not a function pointer. Help!?!? kernel/bpf/bpf_struct_ops.c:bpf_struct_ops_prepare_trampoline() And yeah, it seems to use the ftrace trampoline for indirect calls here, *sigh*. > The other case: > > For tools/testing/selftests/bpf/progs/bloom_filter_bench.c we have: > > bpf_for_each_map_elem(&array_map, bloom_callback, &data, 0); > > and here bloom callback appears like a normal function: > > static __u64 > bloom_callback(struct bpf_map *map, __u32 *key, void *val, > struct callback_ctx *data) > > > But what do functions looks like in the JIT? What's the actual address > that's then passed into the helper function. Given this seems to work > without kCFI, it should at least have an ENDBR, but there's only 3 of > those afaict: > > - emit_prologue() first insn > - emit_prologue() tail-call site > - arch_preprare_bpf_trampoline() > > If the function passed to the helper is from do_jit()/emit_prologue(), > then how do I tell what 'function' is being JIT'ed ? > > If it is arch_prepare_bpf_trampoline(), then we're back at the previous > question and I don't see how a __fentry site becomes a callable function > pointer. > > > Any clues would be much appreciated. Still not figured out how this one works...
On Wed, Dec 06, 2023 at 07:37:13PM +0100, Peter Zijlstra wrote: > On Wed, Dec 06, 2023 at 05:38:14PM +0100, Peter Zijlstra wrote: > > On Mon, Dec 04, 2023 at 05:18:31PM -0800, Alexei Starovoitov wrote: > > > > > [ 13.978497] ? asm_exc_invalid_op+0x1a/0x20 > > > [ 13.978798] ? tcp_set_ca_state+0x51/0xd0 > > > [ 13.979087] tcp_v6_syn_recv_sock+0x45c/0x6c0 > > > [ 13.979401] tcp_check_req+0x497/0x590 > > > > > The stack trace doesn't have any bpf, but it's a bpf issue too. > > > Here tcp_set_ca_state() calls > > > icsk->icsk_ca_ops->set_state(sk, ca_state); > > > which calls bpf prog via bpf trampoline. > > > > > > > > Specifically, I think this is > > tools/testing/selftests/bpf/progs/bpf_cubic.c, which has: > > > > .set_state = (void *)bpf_cubic_state, > > > > which comes from: > > > > BPF_STRUCT_OPS(bpf_cubic_state, struct sock *sk, __u8 *new_state) > > > > which then wraps: > > > > BPF_PROG() > > > > which ends up generating: > > > > static __always_inline ___bpf_cubic_state(unsigned long long *ctx, struct sock *sk, __u8 *new_state) > > { > > ... > > } > > > > void bpf_cubic_state(unsigned long long *ctx) > > { > > return ____bpf_cubic_state(ctx, ctx[0], ctx[1]); > > } Yep. That's correct. > > I think this then uses arch_prepare_bpf_trampoline(), but I'm entirely > > lost how this all comes together, because the way I understand it the > > whole bpf_trampoline is used to hook into an ftrace __fentry hook. > > > > And a __fentry hook is very much not a function pointer. Help!?!? > > kernel/bpf/bpf_struct_ops.c:bpf_struct_ops_prepare_trampoline() > > And yeah, it seems to use the ftrace trampoline for indirect calls here, > *sigh*. Not quite. bpf_struct_ops_prepare_trampoline() prepares a trampoline that does conversion from native calling convention to bpf calling convention. We could have optimized it for the case of x86-64 and num_args <= 5 and it would be a nop trampoline, but so far it's generic and works on x86-64, arm64, etc. There were patches posted to make it work on 32-bit archs too (not landed yet). All native args are stored one by one into u64 ctx[0], u64 ctx[1], on stack and then bpf prog is called with a single ctx pointer. For example for the case of struct tcp_congestion_ops { void (*set_state)(struct sock *sk, u8 new_state); } The translation of 'sk' into ctx[0] is based on 'struct btf_func_model' which is discovered from 'set_state' func prototype as stored in BTF. So the trampoline for set_state, copies %rdi into ctx[0] and %rsi into ctx[1] and _directly_ calls bpf_cubic_state(). Note arch_prepare_bpf_trampoline() emits ENDBR as the first insn, because the pointer to this trampoline is directly stored in 'struct tcp_congestion_ops'. Later from TCP stack point of view 'icsk_ca_ops' are exactly the same for built-in cong controls (CCs), kernel module's CCs and bpf-based CCs. All calls to struct tcp_congestion_ops callbacks are normal indirect calls. Different CCs have different struct tcp_congestion_ops with their own pointers to functions, of course. There is no ftrace here at all. No .text live patching either. All is ok until kCFI comes into picture. Here we probably need to teach arch_prepare_bpf_trampoline() to emit different __kcfi_typeid depending on kernel function proto, so that caller hash checking logic won't be tripped. I suspect that requires to reverse engineer an algorithm of computing kcfi from clang. other ideas? > > The other case: > > > > For tools/testing/selftests/bpf/progs/bloom_filter_bench.c we have: > > > > bpf_for_each_map_elem(&array_map, bloom_callback, &data, 0); > > > > and here bloom callback appears like a normal function: > > > > static __u64 > > bloom_callback(struct bpf_map *map, __u32 *key, void *val, > > struct callback_ctx *data) > > > > > > But what do functions looks like in the JIT? What's the actual address > > that's then passed into the helper function. Given this seems to work > > without kCFI, it should at least have an ENDBR, but there's only 3 of > > those afaict: Right. That is very different from struct_ops/trampoline. There is no trampoline here at all. A bpf prog is JITed as normal, but its prototype is technically bpf_callback_t. We do the same JITing for all progs. Both main entry prog and subprograms. They all are treated as accepting 5 u64 args and returning single u64. For the main prog the prototype: bpf_prog_runX(u64 *regs, const struct bpf_insn *insn) is partially true, since 2nd arg is never used and the 1st arg is 'ctx'. So from x86-64 JIT pov there is no difference whether it's single 8-byte arg or five 8-byte args. In the case of bpf_for_each_map_elem() the 'bloom_callback' is a subprog of bpf_callback_t type. So the kernel is doing: ret = callback_fn((u64)(long)map, (u64)(long)&key, (u64)(long)val, (u64)(long)callback_ctx, 0); and that works on all archs including 32-bit. The kernel is doing conversion from native calling convention to bpf calling convention and for lucky archs like x86-64 the conversion is a true nop. It's a plain indirect call to JITed bpf prog. Note there is no interpreter support here. This works on archs with JITs only. No ftrace and no trampoline. This case is easier to make work with kCFI. The JIT will use: cfi_bpf_hash: .long __kcfi_typeid___bpf_prog_runX like your patch already does. And will use extern u64 __bpf_callback_fn(u64, u64, u64, u64, u64); cfi_bpf_subprog_hash: .long __kcfi_typeid___bpf_callback_fn to JIT all subprogs. See bpf_is_subprog(). Which shouldn't be too difficult to add. We'd need to tweak the verifier to make sure bpf_for_each_map_elem and friends never callback the main subprog which is technically the case today. Just need to add more guardrails. I can work on this. btw there are two patchsets in progress that will touch core bits of JITs. This one: https://patchwork.kernel.org/project/netdevbpf/cover/20231201190654.1233153-1-song@kernel.org/ and this one: https://patchwork.kernel.org/project/netdevbpf/cover/20231011152725.95895-1-hffilwlqm@gmail.com/ so do you mind resending your current set with get_cfi_offset() change and I can land it into bpf-next, so we can fix one bug at a time, build on top, and avoid conflicts? The more we dig the more it looks like that the follow up you planned to do on top of this set isn't going to happen soon. So should be ok going through bpf-next and then you can follow up with x86 things after merge window?
On Wed, Dec 06, 2023 at 01:39:43PM -0800, Alexei Starovoitov wrote: > All is ok until kCFI comes into picture. > Here we probably need to teach arch_prepare_bpf_trampoline() to emit > different __kcfi_typeid depending on kernel function proto, > so that caller hash checking logic won't be tripped. > I suspect that requires to reverse engineer an algorithm of computing kcfi from clang. > other ideas? I was going to try and extend bpf_struct_ops with a pointer, this pointer will point to a struct of the right type with all ops filled out as stubs. Then I was going to have bpf_struct_ops_map_update_elem() pass a pointer to the stub op (using moff) into bpf_struct_ops_prepare_trampoline() and eventually arch_prepare_bpf_trampoline(). Additionally I was going to add BPF_TRAMP_F_INDIRECT. Then when F_INDIRECT is set, have it generate the CFI preamble based on the stub passed -- which will have the correct preamble for that method. At least, that's what I'm thinking now, I've yet to try and implement it. > > > The other case: > In the case of bpf_for_each_map_elem() the 'bloom_callback' is a subprog > of bpf_callback_t type. > So the kernel is doing: > ret = callback_fn((u64)(long)map, (u64)(long)&key, > (u64)(long)val, (u64)(long)callback_ctx, 0); > and that works on all archs including 32-bit. > The kernel is doing conversion from native calling convention to bpf calling convention > and for lucky archs like x86-64 the conversion is a true nop. > It's a plain indirect call to JITed bpf prog. > Note there is no interpreter support here. This works on archs with JITs only. > No ftrace and no trampoline. > > This case is easier to make work with kCFI. > The JIT will use: > cfi_bpf_hash: > .long __kcfi_typeid___bpf_prog_runX > like your patch already does. > And will use > extern u64 __bpf_callback_fn(u64, u64, u64, u64, u64); > cfi_bpf_subprog_hash: > .long __kcfi_typeid___bpf_callback_fn > to JIT all subprogs. See bpf_is_subprog(). Aaah!, yes it should be trivial to use another hash value when is_subprog in emit_prologue(). > btw there are two patchsets in progress that will touch core bits of JITs. > This one: > https://patchwork.kernel.org/project/netdevbpf/cover/20231201190654.1233153-1-song@kernel.org/ > and this one: > https://patchwork.kernel.org/project/netdevbpf/cover/20231011152725.95895-1-hffilwlqm@gmail.com/ > > so do you mind resending your current set with get_cfi_offset() change and > I can land it into bpf-next, so we can fix one bug at a time, > build on top, and avoid conflicts? I can do. > The more we dig the more it looks like that the follow up you planned to do > on top of this set isn't going to happen soon. > So should be ok going through bpf-next and then you can follow up with x86 things > after merge window? Yes, we can do that. Plans have changed on my side too -- I'm taking a 6 week break soon, so I'll do whatever I can before I'm out, and then continue from whatever state I find when I get back. Thanks for the details!
On Thu, Dec 07, 2023 at 10:31:05AM +0100, Peter Zijlstra wrote: > On Wed, Dec 06, 2023 at 01:39:43PM -0800, Alexei Starovoitov wrote: > > > > All is ok until kCFI comes into picture. > > Here we probably need to teach arch_prepare_bpf_trampoline() to emit > > different __kcfi_typeid depending on kernel function proto, > > so that caller hash checking logic won't be tripped. > > I suspect that requires to reverse engineer an algorithm of computing kcfi from clang. > > other ideas? > > I was going to try and extend bpf_struct_ops with a pointer, this > pointer will point to a struct of the right type with all ops filled out > as stubs. Right. Something like this, but it's more nuanced. The bpf_struct_ops concept is a generic mechanism to provide bpf-based callback to any set of kernel callbacks. bpf tcp CC plugs into: struct tcp_congestion_ops { /* do new cwnd calculation (required) */ void (*cong_avoid)(struct sock *sk, u32 ack, u32 acked); /* call before changing ca_state (optional) */ void (*set_state)(struct sock *sk, u8 new_state); /* call when cwnd event occurs (optional) */ void (*cwnd_event)(struct sock *sk, enum tcp_ca_event ev); ... }; and from bpf side we don't touch tcp kernel bits at all. tcp stack doesn't know whether it's calling bpf based CC or builtin CC or CC provided by kernel module. bpf struct_ops mechanim is a zero cost extension to potentially any kernel mechanism that is already setup with callbacks. tcp_congestion_ops is one of them. The allowlisting of tcp_congestion_ops for bpf use is done in net/ipv4/bpf_tcp_ca.c via: struct bpf_struct_ops bpf_tcp_congestion_ops = { ... .reg = bpf_tcp_ca_reg, .unreg = bpf_tcp_ca_unreg, ... .name = "tcp_congestion_ops", }; static int bpf_tcp_ca_reg(void *kdata) { return tcp_register_congestion_control(kdata); } and int tcp_register_congestion_control(struct tcp_congestion_ops *type); is a normal TCP CC registration routine that is used by all CCs. The bpf struct_ops infra prepares everything inside 'struct tcp_congestion_ops' that makes it indistinguishable from normal kernel CC, except kCFI part. sadly. The kCFI challenge is that clang may not generate any __cfi + __kcfi_typeid at all. Like if vmlinux doesn't have any TCP CCs built-in there will be no kCFI hashes in the kernel that represent a required hash to call cong_avoid, set_state, cwnd_event. At the callsite like in net/ipv4/tcp_input.c icsk->icsk_ca_ops->cong_avoid(sk, ack, acked); the clang will compute the kcfi hash, but there will be no __cfi symbol in vmlinux. If there was one we could teach the verifier to look for __kcfi...cong_avoid in kallsyms, then read cfi hash from there and populate it into generated asm in arch_prepare_bpf_trampoline. So I'm thinking to do this: diff --git a/net/ipv4/bpf_tcp_ca.c b/net/ipv4/bpf_tcp_ca.c index c7bbd8f3c708..afaadc2c0827 100644 --- a/net/ipv4/bpf_tcp_ca.c +++ b/net/ipv4/bpf_tcp_ca.c @@ -283,6 +283,12 @@ struct bpf_struct_ops bpf_tcp_congestion_ops = { .name = "tcp_congestion_ops", }; +/* never accessed, here for clang kCFI only */ +extern void __kcfi_cong_avoid(struct sock *sk, u32 ack, u32 acked); +__ADDRESSABLE(__kcfi_cong_avoid); +extern void __kcfi_set_state(struct sock *sk, u8 new_state); +__ADDRESSABLE(__kcfi_set_state); To force kcfi generation and then teach struct_ops infra to look for __kcfi_typeid___kcfi_##callbackname in kallsyms, read kcfi from there and populate into bpf trampoline. Since kcfi and bpf are not working well, I believe it's bpf folks job to fix it. Especially since it's starting to look bpf infra heavy. If you're interested I can, of course, point to relevant bits in kernel/bpf/bpf_struct_ops.c that would need to be extended to support such kcfi_typeid search, but I think it's my job to fix it. If I get stuck, I'll ask for help. I also researched a different approach. llvm does the following to compute the kcfi hash: llvm::ConstantInt *CodeGenModule::CreateKCFITypeId(QualType T) { if (auto *FnType = T->getAs<FunctionProtoType>()) T = getContext().getFunctionType( FnType->getReturnType(), FnType->getParamTypes(), FnType->getExtProtoInfo().withExceptionSpec(EST_None)); std::string OutName; llvm::raw_string_ostream Out(OutName); getCXXABI().getMangleContext().mangleCanonicalTypeName( T, Out, getCodeGenOpts().SanitizeCfiICallNormalizeIntegers); if (getCodeGenOpts().SanitizeCfiICallNormalizeIntegers) Out << ".normalized"; return llvm::ConstantInt::get(Int32Ty, static_cast<uint32_t>(llvm::xxHash64(OutName))); } xxhash is already available in the kernel. We can add type mangling logic and convert prototype of cong_avoid, set_state, etc (that are already available in vmlinux BTF) into a mangled string and apply xxhash on that string. This way we wouldn't need to add __kcfi stubs to net/ipv4/bpf_tcp_ca.c. kcfi will be computed on demand. > > > > This case is easier to make work with kCFI. > > The JIT will use: > > cfi_bpf_hash: > > .long __kcfi_typeid___bpf_prog_runX > > like your patch already does. > > And will use > > extern u64 __bpf_callback_fn(u64, u64, u64, u64, u64); > > cfi_bpf_subprog_hash: > > .long __kcfi_typeid___bpf_callback_fn > > to JIT all subprogs. See bpf_is_subprog(). > > Aaah!, yes it should be trivial to use another hash value when > is_subprog in emit_prologue(). Great. I'll wait for your respin and then will start building "kcfi for struct-ops" via one of the two approaches above. > Yes, we can do that. Plans have changed on my side too -- I'm taking a 6 > week break soon, so I'll do whatever I can before I'm out, and then > continue from whatever state I find when I get back. 6 weeks! Nice. Enjoy the long break. Last time I took 3 week PTO in a row I got bored after week 2 and went back to hacking :)
On Thu, Dec 07, 2023 at 02:32:12PM -0800, Alexei Starovoitov wrote: > On Thu, Dec 07, 2023 at 10:31:05AM +0100, Peter Zijlstra wrote: > > On Wed, Dec 06, 2023 at 01:39:43PM -0800, Alexei Starovoitov wrote: > > > > > > > All is ok until kCFI comes into picture. > > > Here we probably need to teach arch_prepare_bpf_trampoline() to emit > > > different __kcfi_typeid depending on kernel function proto, > > > so that caller hash checking logic won't be tripped. > > > I suspect that requires to reverse engineer an algorithm of computing kcfi from clang. > > > other ideas? > > > > I was going to try and extend bpf_struct_ops with a pointer, this > > pointer will point to a struct of the right type with all ops filled out > > as stubs. > > Right. Something like this, but it's more nuanced. > > The bpf_struct_ops concept is a generic mechanism to provide bpf-based callback > to any set of kernel callbacks. > > bpf tcp CC plugs into: > struct tcp_congestion_ops { > /* do new cwnd calculation (required) */ > void (*cong_avoid)(struct sock *sk, u32 ack, u32 acked); > > /* call before changing ca_state (optional) */ > void (*set_state)(struct sock *sk, u8 new_state); > > /* call when cwnd event occurs (optional) */ > void (*cwnd_event)(struct sock *sk, enum tcp_ca_event ev); > ... > }; > > and from bpf side we don't touch tcp kernel bits at all. > tcp stack doesn't know whether it's calling bpf based CC or builtin CC or CC provided by kernel module. > > bpf struct_ops mechanim is a zero cost extension to potentially any kernel mechanism > that is already setup with callbacks. tcp_congestion_ops is one of them. > > The allowlisting of tcp_congestion_ops for bpf use is done in net/ipv4/bpf_tcp_ca.c via: > > struct bpf_struct_ops bpf_tcp_congestion_ops = { > ... > .reg = bpf_tcp_ca_reg, > .unreg = bpf_tcp_ca_unreg, > ... > .name = "tcp_congestion_ops", > }; > static int bpf_tcp_ca_reg(void *kdata) > { > return tcp_register_congestion_control(kdata); > } > and > int tcp_register_congestion_control(struct tcp_congestion_ops *type); > is a normal TCP CC registration routine that is used by all CCs. > > The bpf struct_ops infra prepares everything inside > 'struct tcp_congestion_ops' that makes it indistinguishable from normal kernel CC, > except kCFI part. sadly. > > The kCFI challenge is that clang may not generate any __cfi + __kcfi_typeid at all. > Like if vmlinux doesn't have any TCP CCs built-in there will be no kCFI hashes > in the kernel that represent a required hash to call cong_avoid, set_state, cwnd_event. Right, I got that. So what I meant was something like the below (compiled only). By adding an actual ops struct, but filled with no-op stubs to bpf_struct_ops, we can crib the CFI hash from those functions. They'll never be called, but the compiler cannot tell and has to generate them. The only problem I now have is the one XXX, I'm not entirely sure what signature to use there. --- Index: linux-2.6/include/linux/bpf.h =================================================================== --- linux-2.6.orig/include/linux/bpf.h +++ linux-2.6/include/linux/bpf.h @@ -1060,6 +1060,17 @@ struct btf_func_model { */ #define BPF_TRAMP_F_TAIL_CALL_CTX BIT(7) +/* + * Indicate the trampoline should be suitable to receive indirect calls; + * without this indirectly calling the generated code can result in #UD/#CP, + * depending on the CFI options. + * + * Used by bpf_struct_ops. + * + * Incompatible with FENTRY usage, overloads @func_addr argument. + */ +#define BPF_TRAMP_F_INDIRECT BIT(8) + /* Each call __bpf_prog_enter + call bpf_func + call __bpf_prog_exit is ~50 * bytes on x86. */ @@ -1695,6 +1706,7 @@ struct bpf_struct_ops { struct btf_func_model func_models[BPF_STRUCT_OPS_MAX_NR_MEMBERS]; u32 type_id; u32 value_id; + void *bpf_ops_stubs; }; #if defined(CONFIG_BPF_JIT) && defined(CONFIG_BPF_SYSCALL) @@ -1708,6 +1720,7 @@ int bpf_struct_ops_map_sys_lookup_elem(s int bpf_struct_ops_prepare_trampoline(struct bpf_tramp_links *tlinks, struct bpf_tramp_link *link, const struct btf_func_model *model, + void *stub_func, void *image, void *image_end); static inline bool bpf_try_module_get(const void *data, struct module *owner) { Index: linux-2.6/kernel/bpf/bpf_struct_ops.c =================================================================== --- linux-2.6.orig/kernel/bpf/bpf_struct_ops.c +++ linux-2.6/kernel/bpf/bpf_struct_ops.c @@ -352,17 +352,16 @@ const struct bpf_link_ops bpf_struct_ops int bpf_struct_ops_prepare_trampoline(struct bpf_tramp_links *tlinks, struct bpf_tramp_link *link, const struct btf_func_model *model, - void *image, void *image_end) + void *stub_func, void *image, void *image_end) { - u32 flags; + u32 flags = BPF_TRAMP_F_INDIRECT; int size; tlinks[BPF_TRAMP_FENTRY].links[0] = link; tlinks[BPF_TRAMP_FENTRY].nr_links = 1; - /* BPF_TRAMP_F_RET_FENTRY_RET is only used by bpf_struct_ops, - * and it must be used alone. - */ - flags = model->ret_size > 0 ? BPF_TRAMP_F_RET_FENTRY_RET : 0; + + if (model->ret_size > 0) + flags |= BPF_TRAMP_F_RET_FENTRY_RET; size = arch_bpf_trampoline_size(model, flags, tlinks, NULL); if (size < 0) @@ -370,7 +369,7 @@ int bpf_struct_ops_prepare_trampoline(st if (size > (unsigned long)image_end - (unsigned long)image) return -E2BIG; return arch_prepare_bpf_trampoline(NULL, image, image_end, - model, flags, tlinks, NULL); + model, flags, tlinks, stub_func); } static long bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key, @@ -504,11 +503,12 @@ static long bpf_struct_ops_map_update_el err = bpf_struct_ops_prepare_trampoline(tlinks, link, &st_ops->func_models[i], + *(void **)(st_ops->bpf_ops_stubs + moff), image, image_end); if (err < 0) goto reset_unlock; - *(void **)(kdata + moff) = image; + *(void **)(kdata + moff) = image + cfi_get_offset(); image += err; /* put prog_id to udata */ Index: linux-2.6/net/ipv4/bpf_tcp_ca.c =================================================================== --- linux-2.6.orig/net/ipv4/bpf_tcp_ca.c +++ linux-2.6/net/ipv4/bpf_tcp_ca.c @@ -271,6 +271,74 @@ static int bpf_tcp_ca_validate(void *kda return tcp_validate_congestion_control(kdata); } +static u32 bpf_tcp_ca_ssthresh(struct sock *sk) +{ + return 0; +} + +static void bpf_tcp_ca_cong_avoid(struct sock *sk, u32 ack, u32 acked) +{ +} + +static void bpf_tcp_ca_set_state(struct sock *sk, u8 new_state) +{ +} + +static void bpf_tcp_ca_cwnd_event(struct sock *sk, enum tcp_ca_event ev) +{ +} + +static void bpf_tcp_ca_in_ack_event(struct sock *sk, u32 flags) +{ +} + +static void bpf_tcp_ca_pkts_acked(struct sock *sk, const struct ack_sample *sample) +{ +} + +static u32 bpf_tcp_ca_min_tso_segs(struct sock *sk) +{ + return 0; +} + +static void bpf_tcp_ca_cong_control(struct sock *sk, const struct rate_sample *rs) +{ +} + +static u32 bpf_tcp_ca_undo_cwnd(struct sock *sk) +{ + return 0; +} + +static u32 bpf_tcp_ca_sndbuf_expand(struct sock *sk) +{ + return 0; +} + +static void __bpf_tcp_ca_init(struct sock *sk) +{ +} + +static void __bpf_tcp_ca_release(struct sock *sk) +{ +} + +static struct tcp_congestion_ops __bpf_ops_tcp_congestion_ops = { + .ssthresh = bpf_tcp_ca_ssthresh, + .cong_avoid = bpf_tcp_ca_cong_avoid, + .set_state = bpf_tcp_ca_set_state, + .cwnd_event = bpf_tcp_ca_cwnd_event, + .in_ack_event = bpf_tcp_ca_in_ack_event, + .pkts_acked = bpf_tcp_ca_pkts_acked, + .min_tso_segs = bpf_tcp_ca_min_tso_segs, + .cong_control = bpf_tcp_ca_cong_control, + .undo_cwnd = bpf_tcp_ca_undo_cwnd, + .sndbuf_expand = bpf_tcp_ca_sndbuf_expand, + + .init = __bpf_tcp_ca_init, + .release = __bpf_tcp_ca_release, +}; + struct bpf_struct_ops bpf_tcp_congestion_ops = { .verifier_ops = &bpf_tcp_ca_verifier_ops, .reg = bpf_tcp_ca_reg, @@ -281,6 +349,7 @@ struct bpf_struct_ops bpf_tcp_congestion .init = bpf_tcp_ca_init, .validate = bpf_tcp_ca_validate, .name = "tcp_congestion_ops", + .bpf_ops_stubs = &__bpf_ops_tcp_congestion_ops, }; static int __init bpf_tcp_ca_kfunc_init(void) Index: linux-2.6/arch/x86/include/asm/cfi.h =================================================================== --- linux-2.6.orig/arch/x86/include/asm/cfi.h +++ linux-2.6/arch/x86/include/asm/cfi.h @@ -123,6 +123,8 @@ static inline cfi_get_offset(void) } #define cfi_get_offset cfi_get_offset +extern u32 cfi_get_func_hash(void *func); + #else static inline enum bug_trap_type handle_cfi_failure(struct pt_regs *regs) { @@ -130,6 +132,10 @@ static inline enum bug_trap_type handle_ } #define cfi_bpf_hash 0U #define cfi_bpf_subprog_hash 0U +static inline u32 cfi_get_func_hash(void *func) +{ + return 0; +} #endif /* CONFIG_CFI_CLANG */ #endif /* _ASM_X86_CFI_H */ Index: linux-2.6/arch/x86/kernel/alternative.c =================================================================== --- linux-2.6.orig/arch/x86/kernel/alternative.c +++ linux-2.6/arch/x86/kernel/alternative.c @@ -883,6 +883,28 @@ asm ( " .size cfi_bpf_subprog_hash, 4 \n" " .popsection \n" ); + +u32 cfi_get_func_hash(void *func) +{ + u32 hash; + + func -= cfi_get_offset(); + switch (cfi_mode) { + case CFI_FINEIBT: + func += 7; + break; + case CFI_KCFI: + func += 1; + break; + default: + return 0; + } + + if (get_kernel_nofault(hash, func)) + return 0; + + return hash; +} #endif #ifdef CONFIG_FINEIBT Index: linux-2.6/net/bpf/bpf_dummy_struct_ops.c =================================================================== --- linux-2.6.orig/net/bpf/bpf_dummy_struct_ops.c +++ linux-2.6/net/bpf/bpf_dummy_struct_ops.c @@ -62,7 +62,7 @@ static int dummy_ops_copy_args(struct bp static int dummy_ops_call_op(void *image, struct bpf_dummy_ops_test_args *args) { - dummy_ops_test_ret_fn test = (void *)image; + dummy_ops_test_ret_fn test = (void *)image + cfi_get_offset(); struct bpf_dummy_ops_state *state = NULL; /* state needs to be NULL if args[0] is 0 */ @@ -119,6 +119,7 @@ int bpf_struct_ops_test_run(struct bpf_p op_idx = prog->expected_attach_type; err = bpf_struct_ops_prepare_trampoline(tlinks, link, &st_ops->func_models[op_idx], + /* XXX */ NULL, image, image + PAGE_SIZE); if (err < 0) goto out; @@ -219,6 +220,28 @@ static void bpf_dummy_unreg(void *kdata) { } +static int bpf_dummy_test_1(struct bpf_dummy_ops_state *cb) +{ + return 0; +} + +static int bpf_dummy_test_2(struct bpf_dummy_ops_state *cb, int a1, unsigned short a2, + char a3, unsigned long a4) +{ + return 0; +} + +static int bpf_dummy_test_sleepable(struct bpf_dummy_ops_state *cb) +{ + return 0; +} + +static struct bpf_dummy_ops __bpf_bpf_dummy_ops = { + .test_1 = bpf_dummy_test_1, + .test_2 = bpf_dummy_test_2, + .test_sleepable = bpf_dummy_test_sleepable, +}; + struct bpf_struct_ops bpf_bpf_dummy_ops = { .verifier_ops = &bpf_dummy_verifier_ops, .init = bpf_dummy_init, @@ -227,4 +250,5 @@ struct bpf_struct_ops bpf_bpf_dummy_ops .reg = bpf_dummy_reg, .unreg = bpf_dummy_unreg, .name = "bpf_dummy_ops", + .bpf_ops_stubs = &__bpf_bpf_dummy_ops, }; Index: linux-2.6/arch/x86/net/bpf_jit_comp.c =================================================================== --- linux-2.6.orig/arch/x86/net/bpf_jit_comp.c +++ linux-2.6/arch/x86/net/bpf_jit_comp.c @@ -312,9 +312,8 @@ static void pop_callee_regs(u8 **pprog, * in arch/x86/kernel/alternative.c */ -static void emit_fineibt(u8 **pprog, bool is_subprog) +static void emit_fineibt(u8 **pprog, u32 hash) { - u32 hash = is_subprog ? cfi_bpf_subprog_hash : cfi_bpf_hash; u8 *prog = *pprog; EMIT_ENDBR(); @@ -327,9 +326,8 @@ static void emit_fineibt(u8 **pprog, boo *pprog = prog; } -static void emit_kcfi(u8 **pprog, bool is_subprog) +static void emit_kcfi(u8 **pprog, u32 hash) { - u32 hash = is_subprog ? cfi_bpf_subprog_hash : cfi_bpf_hash; u8 *prog = *pprog; EMIT1_off32(0xb8, hash); /* movl $hash, %eax */ @@ -351,17 +349,17 @@ static void emit_kcfi(u8 **pprog, bool i *pprog = prog; } -static void emit_cfi(u8 **pprog, bool is_subprog) +static void emit_cfi(u8 **pprog, u32 hash) { u8 *prog = *pprog; switch (cfi_mode) { case CFI_FINEIBT: - emit_fineibt(&prog, is_subprog); + emit_fineibt(&prog, hash); break; case CFI_KCFI: - emit_kcfi(&prog, is_subprog); + emit_kcfi(&prog, hash); break; default: @@ -383,7 +381,7 @@ static void emit_prologue(u8 **pprog, u3 { u8 *prog = *pprog; - emit_cfi(&prog, is_subprog); + emit_cfi(&prog, is_subprog ? cfi_bpf_subprog_hash : cfi_bpf_hash); /* BPF trampoline can be made to work without these nops, * but let's waste 5 bytes for now and optimize later */ @@ -2596,20 +2594,27 @@ static int __arch_prepare_bpf_trampoline prog = rw_image; - EMIT_ENDBR(); - /* - * This is the direct-call trampoline, as such it needs accounting - * for the __fentry__ call. - */ - x86_call_depth_emit_accounting(&prog, NULL); + if (flags & BPF_TRAMP_F_INDIRECT) { + /* + * Indirect call for bpf_struct_ops + */ + emit_cfi(&prog, cfi_get_func_hash(func_addr)); + } else { + /* + * Direct-call fentry stub, as such it needs accounting for the + * __fentry__ call. + */ + x86_call_depth_emit_accounting(&prog, NULL); + } EMIT1(0x55); /* push rbp */ EMIT3(0x48, 0x89, 0xE5); /* mov rbp, rsp */ - if (!is_imm8(stack_size)) + if (!is_imm8(stack_size)) { /* sub rsp, stack_size */ EMIT3_off32(0x48, 0x81, 0xEC, stack_size); - else + } else { /* sub rsp, stack_size */ EMIT4(0x48, 0x83, 0xEC, stack_size); + } if (flags & BPF_TRAMP_F_TAIL_CALL_CTX) EMIT1(0x50); /* push rax */ /* mov QWORD PTR [rbp - rbx_off], rbx */ @@ -2643,10 +2648,11 @@ static int __arch_prepare_bpf_trampoline } } - if (fentry->nr_links) + if (fentry->nr_links) { if (invoke_bpf(m, &prog, fentry, regs_off, run_ctx_off, flags & BPF_TRAMP_F_RET_FENTRY_RET, image, rw_image)) return -EINVAL; + } if (fmod_ret->nr_links) { branches = kcalloc(fmod_ret->nr_links, sizeof(u8 *), @@ -2665,11 +2671,12 @@ static int __arch_prepare_bpf_trampoline restore_regs(m, &prog, regs_off); save_args(m, &prog, arg_stack_off, true); - if (flags & BPF_TRAMP_F_TAIL_CALL_CTX) + if (flags & BPF_TRAMP_F_TAIL_CALL_CTX) { /* Before calling the original function, restore the * tail_call_cnt from stack to rax. */ RESTORE_TAIL_CALL_CNT(stack_size); + } if (flags & BPF_TRAMP_F_ORIG_STACK) { emit_ldx(&prog, BPF_DW, BPF_REG_6, BPF_REG_FP, 8); @@ -2698,17 +2705,19 @@ static int __arch_prepare_bpf_trampoline /* Update the branches saved in invoke_bpf_mod_ret with the * aligned address of do_fexit. */ - for (i = 0; i < fmod_ret->nr_links; i++) + for (i = 0; i < fmod_ret->nr_links; i++) { emit_cond_near_jump(&branches[i], image + (prog - (u8 *)rw_image), image + (branches[i] - (u8 *)rw_image), X86_JNE); + } } - if (fexit->nr_links) + if (fexit->nr_links) { if (invoke_bpf(m, &prog, fexit, regs_off, run_ctx_off, false, image, rw_image)) { ret = -EINVAL; goto cleanup; } + } if (flags & BPF_TRAMP_F_RESTORE_REGS) restore_regs(m, &prog, regs_off); @@ -2725,11 +2734,12 @@ static int __arch_prepare_bpf_trampoline ret = -EINVAL; goto cleanup; } - } else if (flags & BPF_TRAMP_F_TAIL_CALL_CTX) + } else if (flags & BPF_TRAMP_F_TAIL_CALL_CTX) { /* Before running the original function, restore the * tail_call_cnt from stack to rax. */ RESTORE_TAIL_CALL_CNT(stack_size); + } /* restore return value of orig_call or fentry prog back into RAX */ if (save_ret) @@ -2737,9 +2747,10 @@ static int __arch_prepare_bpf_trampoline emit_ldx(&prog, BPF_DW, BPF_REG_6, BPF_REG_FP, -rbx_off); EMIT1(0xC9); /* leave */ - if (flags & BPF_TRAMP_F_SKIP_FRAME) + if (flags & BPF_TRAMP_F_SKIP_FRAME) { /* skip our return address and return to parent */ EMIT4(0x48, 0x83, 0xC4, 8); /* add rsp, 8 */ + } emit_return(&prog, image + (prog - (u8 *)rw_image)); /* Make sure the trampoline generation logic doesn't overflow */ if (WARN_ON_ONCE(prog > (u8 *)rw_image_end - BPF_INSN_SAFETY)) {
On Fri, Dec 08, 2023 at 11:29:40AM +0100, Peter Zijlstra wrote: > The only problem I now have is the one XXX, I'm not entirely sure what > signature to use there. > @@ -119,6 +119,7 @@ int bpf_struct_ops_test_run(struct bpf_p > op_idx = prog->expected_attach_type; > err = bpf_struct_ops_prepare_trampoline(tlinks, link, > &st_ops->func_models[op_idx], > + /* XXX */ NULL, > image, image + PAGE_SIZE); > if (err < 0) > goto out; Duh, that should ofcourse be something of dummy_ops_test_ret_fn type. Let me go fix that.
On Fri, Dec 08, 2023 at 02:40:41PM +0100, Peter Zijlstra wrote: > On Fri, Dec 08, 2023 at 11:29:40AM +0100, Peter Zijlstra wrote: > > The only problem I now have is the one XXX, I'm not entirely sure what > > signature to use there. > > > @@ -119,6 +119,7 @@ int bpf_struct_ops_test_run(struct bpf_p > > op_idx = prog->expected_attach_type; > > err = bpf_struct_ops_prepare_trampoline(tlinks, link, > > &st_ops->func_models[op_idx], > > + /* XXX */ NULL, > > image, image + PAGE_SIZE); > > if (err < 0) > > goto out; > > Duh, that should ofcourse be something of dummy_ops_test_ret_fn type. > Let me go fix that. Next one.. bpf_obj_free_fields: field->kptr.dtor(xchg_field); The one that trips is bpf_cgroup_release(). objtool doesn't think the address of that function 'escapes' and 'helpfully' seals that function, and then BPF thinks it does escape and manages the above indirect call and *boom*. How can I tell which functions escape according to BPF such that I might teach objtool this?
On Fri, Dec 8, 2023 at 5:41 AM Peter Zijlstra <peterz@infradead.org> wrote: > > On Fri, Dec 08, 2023 at 11:29:40AM +0100, Peter Zijlstra wrote: > > The only problem I now have is the one XXX, I'm not entirely sure what > > signature to use there. > > > @@ -119,6 +119,7 @@ int bpf_struct_ops_test_run(struct bpf_p > > op_idx = prog->expected_attach_type; > > err = bpf_struct_ops_prepare_trampoline(tlinks, link, > > &st_ops->func_models[op_idx], > > + /* XXX */ NULL, > > image, image + PAGE_SIZE); > > if (err < 0) > > goto out; > > Duh, that should ofcourse be something of dummy_ops_test_ret_fn type. > Let me go fix that. Right. That should work. A bit wasteful to generate real code just to read hash from it via cfi_get_func_hash(), but it's a neat idea. I guess it's hard to get kcfi from __ADDRESSABLE in plain C and sprinkling asm("cfi_xxx: .long __kcfi_typeid..."); is worse? Even if it's a macro ? That macro would be used to define cfi_bpf_hash and all other stubs?
On Fri, Dec 8, 2023 at 9:22 AM Peter Zijlstra <peterz@infradead.org> wrote: > > On Fri, Dec 08, 2023 at 02:40:41PM +0100, Peter Zijlstra wrote: > > On Fri, Dec 08, 2023 at 11:29:40AM +0100, Peter Zijlstra wrote: > > > The only problem I now have is the one XXX, I'm not entirely sure what > > > signature to use there. > > > > > @@ -119,6 +119,7 @@ int bpf_struct_ops_test_run(struct bpf_p > > > op_idx = prog->expected_attach_type; > > > err = bpf_struct_ops_prepare_trampoline(tlinks, link, > > > &st_ops->func_models[op_idx], > > > + /* XXX */ NULL, > > > image, image + PAGE_SIZE); > > > if (err < 0) > > > goto out; > > > > Duh, that should ofcourse be something of dummy_ops_test_ret_fn type. > > Let me go fix that. > > Next one.. bpf_obj_free_fields: field->kptr.dtor(xchg_field); > > The one that trips is bpf_cgroup_release(). > > objtool doesn't think the address of that function 'escapes' and > 'helpfully' seals that function, and then BPF thinks it does escape and > manages the above indirect call and *boom*. > > How can I tell which functions escape according to BPF such that I might > teach objtool this? I'm not following. Are you asking to annotate __bpf_kfunc void bpf_cgroup_release(struct cgroup *cgrp) somehow so that objtool knows that it will be called indirectly? typedef void (*btf_dtor_kfunc_t)(void *); btf_dtor_kfunc_t dtor; but the bpf_cgroup_release takes 'struct cgroup*'. From kcfi pov void * == struct cgroup * ? Do we need to change it to 'void *cgrp' ? What is "sealing" by objtool?
On Fri, Dec 08, 2023 at 11:32:07AM -0800, Alexei Starovoitov wrote: > On Fri, Dec 8, 2023 at 5:41 AM Peter Zijlstra <peterz@infradead.org> wrote: > > > > On Fri, Dec 08, 2023 at 11:29:40AM +0100, Peter Zijlstra wrote: > > > The only problem I now have is the one XXX, I'm not entirely sure what > > > signature to use there. > > > > > @@ -119,6 +119,7 @@ int bpf_struct_ops_test_run(struct bpf_p > > > op_idx = prog->expected_attach_type; > > > err = bpf_struct_ops_prepare_trampoline(tlinks, link, > > > &st_ops->func_models[op_idx], > > > + /* XXX */ NULL, > > > image, image + PAGE_SIZE); > > > if (err < 0) > > > goto out; > > > > Duh, that should ofcourse be something of dummy_ops_test_ret_fn type. > > Let me go fix that. > > Right. That should work. > A bit wasteful to generate real code just to read hash from it > via cfi_get_func_hash(), but it's a neat idea. Right, bit wasteful. But the advantage is that I get a structure with pointers that exactly mirrors the structure we're writing. > I guess it's hard to get kcfi from __ADDRESSABLE in plain C > and sprinkling asm("cfi_xxx: .long __kcfi_typeid..."); is worse? > Even if it's a macro ? I can try this, but I'm not sure it'll be pretty. Even if I wrap it in a decent macro, I still get to define a ton of variables and then wrap the lot into a structure -- one that expects function pointers. I'll see how horrible it will become.
On Fri, Dec 08, 2023 at 11:40:27AM -0800, Alexei Starovoitov wrote:
> What is "sealing" by objtool?
Ah, LTO like pass that tries to determine if a function ever gets it's
address taken.
The basic problem is that the compiler (barring its own LTO pass) must
emit CFI for every non-local symbol in a translation unit. This means
that a ton of functions will have CFI on, even if they're never
indirectly called.
So objtool collects all functions that have CFI but do not get their
address taken, and sticks their address in a .discard section, then at
boot time we iterate this section and scribble the CFI state for all
these functions, making them invalid to be called indirectly.
For one this avoids malicious code from finding a function address in
the symbol table and indirectly calling it anyway as a means to
circumvent the EXPORT symbols.
So objtool does not think bpf_cgroup_release() gets its address taken,
specifically it does not find it's address in a section it knows about.
And hence it goes on the list and we scribble it and the indirect call
goes *boom*.
On Fri, Dec 08, 2023 at 11:40:27AM -0800, Alexei Starovoitov wrote: > typedef void (*btf_dtor_kfunc_t)(void *); > btf_dtor_kfunc_t dtor; > but the bpf_cgroup_release takes 'struct cgroup*'. > From kcfi pov void * == struct cgroup * ? > Do we need to change it to 'void *cgrp' ? Yes, doing that naively like the below, gets me lovely things like: validate_case:FAIL:expect_msg unexpected error: -22 VERIFIER LOG: ============= ============= EXPECTED MSG: 'Possibly NULL pointer passed to trusted arg0' #48/7 cgrp_kfunc/cgrp_kfunc_acquire_untrusted:FAIL run_subtest:PASS:obj_open_mem 0 nsec libbpf: extern (func ksym) 'bpf_cgroup_release': func_proto [148] incompatible with vmlinux [125610] libbpf: failed to load object 'cgrp_kfunc_failure' But let me try rebuilding everything.. --- diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index b3be5742d6f1..078b207af7f0 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -2145,10 +2145,11 @@ __bpf_kfunc struct task_struct *bpf_task_acquire(struct task_struct *p) * bpf_task_release - Release the reference acquired on a task. * @p: The task on which a reference is being released. */ -__bpf_kfunc void bpf_task_release(struct task_struct *p) +__bpf_kfunc void bpf_task_release(void *p) { put_task_struct_rcu_user(p); } +EXPORT_SYMBOL_GPL(bpf_task_release); #ifdef CONFIG_CGROUPS /** @@ -2169,10 +2170,11 @@ __bpf_kfunc struct cgroup *bpf_cgroup_acquire(struct cgroup *cgrp) * drops to 0. * @cgrp: The cgroup on which a reference is being released. */ -__bpf_kfunc void bpf_cgroup_release(struct cgroup *cgrp) +__bpf_kfunc void bpf_cgroup_release(void *cgrp) { cgroup_put(cgrp); } +EXPORT_SYMBOL_GPL(bpf_cgroup_release); /** * bpf_cgroup_ancestor - Perform a lookup on an entry in a cgroup's ancestor
On Fri, Dec 8, 2023 at 12:35 PM Peter Zijlstra <peterz@infradead.org> wrote: > > On Fri, Dec 08, 2023 at 11:40:27AM -0800, Alexei Starovoitov wrote: > > > typedef void (*btf_dtor_kfunc_t)(void *); > > btf_dtor_kfunc_t dtor; > > but the bpf_cgroup_release takes 'struct cgroup*'. > > From kcfi pov void * == struct cgroup * ? > > Do we need to change it to 'void *cgrp' ? > > Yes, doing that naively like the below, gets me lovely things like: > > validate_case:FAIL:expect_msg unexpected error: -22 > VERIFIER LOG: > ============= > ============= > EXPECTED MSG: 'Possibly NULL pointer passed to trusted arg0' > #48/7 cgrp_kfunc/cgrp_kfunc_acquire_untrusted:FAIL > run_subtest:PASS:obj_open_mem 0 nsec > libbpf: extern (func ksym) 'bpf_cgroup_release': func_proto [148] incompatible with vmlinux [125610] > libbpf: failed to load object 'cgrp_kfunc_failure' > > > But let me try rebuilding everything.. > > > --- > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c > index b3be5742d6f1..078b207af7f0 100644 > --- a/kernel/bpf/helpers.c > +++ b/kernel/bpf/helpers.c > @@ -2145,10 +2145,11 @@ __bpf_kfunc struct task_struct *bpf_task_acquire(struct task_struct *p) > * bpf_task_release - Release the reference acquired on a task. > * @p: The task on which a reference is being released. > */ > -__bpf_kfunc void bpf_task_release(struct task_struct *p) > +__bpf_kfunc void bpf_task_release(void *p) Yeah. That won't work. We need a wrapper. Since bpf prog is also calling it directly. In progs/task_kfunc_common.h void bpf_task_release(struct task_struct *p) __ksym; than later both libbpf and the verifier check that what bpf prog is calling actually matches the proto of what is in the kernel. Effectively we're doing strong prototype check at load time. btw instead of EXPORT_SYMBOL_GPL(bpf_task_release) can __ADDRESSABLE be used ? Since it's not an export symbol.
On Fri, Dec 8, 2023 at 12:18 PM Peter Zijlstra <peterz@infradead.org> wrote: > > On Fri, Dec 08, 2023 at 11:32:07AM -0800, Alexei Starovoitov wrote: > > On Fri, Dec 8, 2023 at 5:41 AM Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > On Fri, Dec 08, 2023 at 11:29:40AM +0100, Peter Zijlstra wrote: > > > > The only problem I now have is the one XXX, I'm not entirely sure what > > > > signature to use there. > > > > > > > @@ -119,6 +119,7 @@ int bpf_struct_ops_test_run(struct bpf_p > > > > op_idx = prog->expected_attach_type; > > > > err = bpf_struct_ops_prepare_trampoline(tlinks, link, > > > > &st_ops->func_models[op_idx], > > > > + /* XXX */ NULL, > > > > image, image + PAGE_SIZE); > > > > if (err < 0) > > > > goto out; > > > > > > Duh, that should ofcourse be something of dummy_ops_test_ret_fn type. > > > Let me go fix that. > > > > Right. That should work. > > A bit wasteful to generate real code just to read hash from it > > via cfi_get_func_hash(), but it's a neat idea. > > Right, bit wasteful. But the advantage is that I get a structure with > pointers that exactly mirrors the structure we're writing. > > > I guess it's hard to get kcfi from __ADDRESSABLE in plain C > > and sprinkling asm("cfi_xxx: .long __kcfi_typeid..."); is worse? > > Even if it's a macro ? > > I can try this, but I'm not sure it'll be pretty. Even if I wrap it in a > decent macro, I still get to define a ton of variables and then wrap the > lot into a structure -- one that expects function pointers. > > I'll see how horrible it will become. I mean we don't need to store a pointer to a func in stubs. Can it be, roughly: extern void bpf_tcp_ca_cong_avoid(struct sock *sk, u32 ack, u32 acked); KCFI_MACRO(hash_of_cong_avoid, bpf_tcp_ca_cong_avoid); u32 __array_of_kcfi_hash[] = {hash_of_cong_avoid, hash_of_set_state,...}; .bpf_ops_stubs = __array_of_kcfi_hash,
On Fri, Dec 08, 2023 at 12:41:03PM -0800, Alexei Starovoitov wrote: > On Fri, Dec 8, 2023 at 12:35 PM Peter Zijlstra <peterz@infradead.org> wrote: > > -__bpf_kfunc void bpf_task_release(struct task_struct *p) > > +__bpf_kfunc void bpf_task_release(void *p) > > Yeah. That won't work. We need a wrapper. > Since bpf prog is also calling it directly. > In progs/task_kfunc_common.h > void bpf_task_release(struct task_struct *p) __ksym; > > than later both libbpf and the verifier check that > what bpf prog is calling actually matches the proto > of what is in the kernel. > Effectively we're doing strong prototype check at load time. I'm still somewhat confused on how this works, where does BPF get the address of the function from? and what should I call the wrapper? > btw instead of EXPORT_SYMBOL_GPL(bpf_task_release) > can __ADDRESSABLE be used ? > Since it's not an export symbol. No __ADDRESSABLE() is expressly ignored, but we have IBT_NOSEAL() that should do it. I'll rename the thing and lift it out of x86 to avoid breaking all other arch builds.
On Fri, Dec 08, 2023 at 12:45:51PM -0800, Alexei Starovoitov wrote: > I mean we don't need to store a pointer to a func in stubs. > Can it be, roughly: > > extern void bpf_tcp_ca_cong_avoid(struct sock *sk, u32 ack, u32 acked); > KCFI_MACRO(hash_of_cong_avoid, bpf_tcp_ca_cong_avoid); > u32 __array_of_kcfi_hash[] = {hash_of_cong_avoid, hash_of_set_state,...}; > .bpf_ops_stubs = __array_of_kcfi_hash, But then how do I index this array? The bpf_ops_stubs thing having the same layout at the target struct made it easy and we could use 'moff' for both. That also remains working if someone adds a member to the struct or moves some members around.
On Fri, Dec 8, 2023 at 12:52 PM Peter Zijlstra <peterz@infradead.org> wrote: > > On Fri, Dec 08, 2023 at 12:41:03PM -0800, Alexei Starovoitov wrote: > > On Fri, Dec 8, 2023 at 12:35 PM Peter Zijlstra <peterz@infradead.org> wrote: > > > > -__bpf_kfunc void bpf_task_release(struct task_struct *p) > > > +__bpf_kfunc void bpf_task_release(void *p) > > > > Yeah. That won't work. We need a wrapper. > > Since bpf prog is also calling it directly. > > In progs/task_kfunc_common.h > > void bpf_task_release(struct task_struct *p) __ksym; > > > > than later both libbpf and the verifier check that > > what bpf prog is calling actually matches the proto > > of what is in the kernel. > > Effectively we're doing strong prototype check at load time. > > I'm still somewhat confused on how this works, where does BPF get the > address of the function from? and what should I call the wrapper? It starts with register_btf_id_dtor_kfuncs() that takes a set of btf_ids: {btf_id_of_type, btf_id_of_dtor_function}, ... Then based on btf_id_of_dtor_function we find its type proto, name, do checks, and eventually: addr = kallsyms_lookup_name(dtor_func_name); field->kptr.dtor = (void *)addr; bpf_task_release(struct task_struct *p) would need to stay as-is, but we can have a wrapper void bpf_task_release_dtor(void *p) { bpf_task_release(p); } And adjust the above lookup with extra "_dtor" suffix. > > btw instead of EXPORT_SYMBOL_GPL(bpf_task_release) > > can __ADDRESSABLE be used ? > > Since it's not an export symbol. > > No __ADDRESSABLE() is expressly ignored, but we have IBT_NOSEAL() that > should do it. I'll rename the thing and lift it out of x86 to avoid > breaking all other arch builds. Makes sense.
On Fri, Dec 8, 2023 at 12:56 PM Peter Zijlstra <peterz@infradead.org> wrote: > > On Fri, Dec 08, 2023 at 12:45:51PM -0800, Alexei Starovoitov wrote: > > > I mean we don't need to store a pointer to a func in stubs. > > Can it be, roughly: > > > > extern void bpf_tcp_ca_cong_avoid(struct sock *sk, u32 ack, u32 acked); > > KCFI_MACRO(hash_of_cong_avoid, bpf_tcp_ca_cong_avoid); > > u32 __array_of_kcfi_hash[] = {hash_of_cong_avoid, hash_of_set_state,...}; > > .bpf_ops_stubs = __array_of_kcfi_hash, > > But then how do I index this array? The bpf_ops_stubs thing having the > same layout at the target struct made it easy and we could use 'moff' > for both. > > That also remains working if someone adds a member to the struct or > moves some members around. I was thinking to just moff/2 assuming u32 array will have the same order, but I missed the fact that init() and release() in tcp_congestion_ops come after some variables. And you're right it's more fragile when things change in tcp_congestion_ops. Storing u32 hash as (void *) into function pointers is just ugly. Let's go with your initial approach.
On Fri, Dec 08, 2023 at 12:58:01PM -0800, Alexei Starovoitov wrote: > On Fri, Dec 8, 2023 at 12:52 PM Peter Zijlstra <peterz@infradead.org> wrote: > > > > On Fri, Dec 08, 2023 at 12:41:03PM -0800, Alexei Starovoitov wrote: > > > On Fri, Dec 8, 2023 at 12:35 PM Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > -__bpf_kfunc void bpf_task_release(struct task_struct *p) > > > > +__bpf_kfunc void bpf_task_release(void *p) > > > > > > Yeah. That won't work. We need a wrapper. > > > Since bpf prog is also calling it directly. > > > In progs/task_kfunc_common.h > > > void bpf_task_release(struct task_struct *p) __ksym; > > > > > > than later both libbpf and the verifier check that > > > what bpf prog is calling actually matches the proto > > > of what is in the kernel. > > > Effectively we're doing strong prototype check at load time. > > > > I'm still somewhat confused on how this works, where does BPF get the > > address of the function from? and what should I call the wrapper? > > It starts with > register_btf_id_dtor_kfuncs() that takes a set of btf_ids: > {btf_id_of_type, btf_id_of_dtor_function}, ... > > Then based on btf_id_of_dtor_function we find its type proto, name, do checks, > and eventually: > addr = kallsyms_lookup_name(dtor_func_name); > field->kptr.dtor = (void *)addr; > > bpf_task_release(struct task_struct *p) would need to stay as-is, > but we can have a wrapper > void bpf_task_release_dtor(void *p) > { > bpf_task_release(p); > } > > And adjust the above lookup with extra "_dtor" suffix. > > > > btw instead of EXPORT_SYMBOL_GPL(bpf_task_release) > > > can __ADDRESSABLE be used ? > > > Since it's not an export symbol. > > > > No __ADDRESSABLE() is expressly ignored, but we have IBT_NOSEAL() that > > should do it. I'll rename the thing and lift it out of x86 to avoid > > breaking all other arch builds. > > Makes sense. Ok, did that. Current patches (on top of bpf-next) are here: git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git x86/cfi (really should try and write better changelogs, but it's too late) The test_progs thing still doesn't run to completion, the next problem seems to be bpf_throw(): [ 247.720159] ? die+0xa4/0xd0 [ 247.720216] ? do_trap+0xa5/0x180 [ 247.720281] ? __cfi_bpf_prog_8ac473954ac6d431_F+0xd/0x10 [ 247.720368] ? __cfi_bpf_prog_8ac473954ac6d431_F+0xd/0x10 [ 247.720459] ? do_error_trap+0xba/0x120 [ 247.720525] ? __cfi_bpf_prog_8ac473954ac6d431_F+0xd/0x10 [ 247.720614] ? handle_invalid_op+0x2c/0x40 [ 247.720684] ? __cfi_bpf_prog_8ac473954ac6d431_F+0xd/0x10 [ 247.720775] ? exc_invalid_op+0x38/0x60 [ 247.720840] ? asm_exc_invalid_op+0x1a/0x20 [ 247.720909] ? 0xffffffffc001ba54 [ 247.720971] ? __cfi_bpf_prog_8ac473954ac6d431_F+0xd/0x10 [ 247.721063] ? bpf_throw+0x9b/0xf0 [ 247.721126] ? bpf_test_run+0x108/0x350 [ 247.721191] ? bpf_prog_5555714b685bf0cf_exception_throw_always_1+0x26/0x26 [ 247.721301] ? bpf_test_run+0x108/0x350 [ 247.721368] bpf_test_run+0x212/0x350 [ 247.721433] ? slab_build_skb+0x22/0x110 [ 247.721503] bpf_prog_test_run_skb+0x347/0x4a0 But I'm too tired to think staight. Is this a bpf_callback_t vs bpf_exception_cb difference? I'll prod more later. Zzzz..
On Fri, Dec 8, 2023 at 2:46 PM Peter Zijlstra <peterz@infradead.org> wrote: > > > Ok, did that. Current patches (on top of bpf-next) are here: > > git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git x86/cfi Looks really great. The last patch is cleaner than I expected. Good idea. > (really should try and write better changelogs, but it's too late) commit logs look fine except the "pilfer" word that I had to look up in the dictionary :) > [ 247.721063] ? bpf_throw+0x9b/0xf0 > [ 247.721126] ? bpf_test_run+0x108/0x350 > [ 247.721191] ? bpf_prog_5555714b685bf0cf_exception_throw_always_1+0x26/0x26 > [ 247.721301] ? bpf_test_run+0x108/0x350 > [ 247.721368] bpf_test_run+0x212/0x350 > [ 247.721433] ? slab_build_skb+0x22/0x110 > [ 247.721503] bpf_prog_test_run_skb+0x347/0x4a0 > > But I'm too tired to think staight. Is this a bpf_callback_t vs > bpf_exception_cb difference? Yep. It's easy to fix: diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 0e162eae8639..e36b3f41751e 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -1484,7 +1484,7 @@ struct bpf_prog_aux { int cgroup_atype; /* enum cgroup_bpf_attach_type */ struct bpf_map *cgroup_storage[MAX_BPF_CGROUP_STORAGE_TYPE]; char name[BPF_OBJ_NAME_LEN]; - unsigned int (*bpf_exception_cb)(u64 cookie, u64 sp, u64 bp); + u64 (*bpf_exception_cb)(u64 cookie, u64 sp, u64 bp, u64, u64); #ifdef CONFIG_SECURITY void *security; #endif diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index fe229b28e4a9..650ebe8ff183 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -2537,7 +2537,7 @@ __bpf_kfunc void bpf_throw(u64 cookie) * which skips compiler generated instrumentation to do the same. */ kasan_unpoison_task_stack_below((void *)(long)ctx.sp); - ctx.aux->bpf_exception_cb(cookie, ctx.sp, ctx.bp); + ctx.aux->bpf_exception_cb(cookie, ctx.sp, ctx.bp, 0, 0); WARN(1, "A call to BPF exception callback should never return\n"); } and with that all of test_progs runs successfully without CFI panics. *happy dance* Only test_progs -t btf/line_info fails suspiciously. There we check that line info embedded in the prog looks sane. New cfi preamble is probably tripping something. It could be a test issue. I'll investigate. It's not a blocker. Do you mind resending the whole set so that BPF CI can test it on different archs ?
--- a/arch/x86/include/asm/cfi.h +++ b/arch/x86/include/asm/cfi.h @@ -9,15 +9,109 @@ */ #include <linux/bug.h> +/* + * An overview of the various calling conventions... + * + * Traditional: + * + * foo: + * ... code here ... + * ret + * + * direct caller: + * call foo + * + * indirect caller: + * lea foo(%rip), %r11 + * ... + * call *%r11 + * + * + * IBT: + * + * foo: + * endbr64 + * ... code here ... + * ret + * + * direct caller: + * call foo / call foo+4 + * + * indirect caller: + * lea foo(%rip), %r11 + * ... + * call *%r11 + * + * + * kCFI: + * + * __cfi_foo: + * movl $0x12345678, %eax + * # 11 nops when CONFIG_CALL_PADDING + * foo: + * endbr64 # when IBT + * ... code here ... + * ret + * + * direct call: + * call foo # / call foo+4 when IBT + * + * indirect call: + * lea foo(%rip), %r11 + * ... + * movl $(-0x12345678), %r10d + * addl -4(%r11), %r10d # -15 when CONFIG_CALL_PADDING + * jz 1f + * ud2 + * 1:call *%r11 + * + * + * FineIBT (builds as kCFI + CALL_PADDING + IBT + RETPOLINE and runtime patches into): + * + * __cfi_foo: + * endbr64 + * subl 0x12345678, %r10d + * jz foo + * ud2 + * nop + * foo: + * osp nop3 # was endbr64 + * ... code here ... + * ret + * + * direct caller: + * call foo / call foo+4 + * + * indirect caller: + * lea foo(%rip), %r11 + * ... + * movl $0x12345678, %r10d + * subl $16, %r11 + * nop4 + * call *%r11 + * + */ +enum cfi_mode { + CFI_DEFAULT, /* FineIBT if hardware has IBT, otherwise kCFI */ + CFI_OFF, /* Taditional / IBT depending on .config */ + CFI_KCFI, /* Optionally CALL_PADDING, IBT, RETPOLINE */ + CFI_FINEIBT, /* see arch/x86/kernel/alternative.c */ +}; + +extern enum cfi_mode cfi_mode; + struct pt_regs; #ifdef CONFIG_CFI_CLANG enum bug_trap_type handle_cfi_failure(struct pt_regs *regs); +#define __bpfcall +extern u32 cfi_bpf_hash; #else static inline enum bug_trap_type handle_cfi_failure(struct pt_regs *regs) { return BUG_TRAP_TYPE_NONE; } +#define cfi_bpf_hash 0U #endif /* CONFIG_CFI_CLANG */ #endif /* _ASM_X86_CFI_H */ --- a/arch/x86/kernel/alternative.c +++ b/arch/x86/kernel/alternative.c @@ -30,6 +30,7 @@ #include <asm/fixmap.h> #include <asm/paravirt.h> #include <asm/asm-prototypes.h> +#include <asm/cfi.h> int __read_mostly alternatives_patched; @@ -832,15 +833,43 @@ void __init_or_module apply_seal_endbr(s #endif /* CONFIG_X86_KERNEL_IBT */ #ifdef CONFIG_FINEIBT +#define __CFI_DEFAULT CFI_DEFAULT +#elif defined(CONFIG_CFI_CLANG) +#define __CFI_DEFAULT CFI_KCFI +#else +#define __CFI_DEFAULT CFI_OFF +#endif -enum cfi_mode { - CFI_DEFAULT, - CFI_OFF, - CFI_KCFI, - CFI_FINEIBT, -}; +enum cfi_mode cfi_mode __ro_after_init = __CFI_DEFAULT; + +#ifdef CONFIG_CFI_CLANG +struct bpf_insn; + +/* Must match bpf_func_t / DEFINE_BPF_PROG_RUN() */ +extern unsigned int __bpf_prog_runX(const void *ctx, + const struct bpf_insn *insn); + +/* + * Force a reference to the external symbol so the compiler generates + * __kcfi_typid. + */ +__ADDRESSABLE(__bpf_prog_runX); + +/* u32 __ro_after_init cfi_bpf_hash = __kcfi_typeid___bpf_prog_runX; */ +asm ( +" .pushsection .data..ro_after_init,\"aw\",@progbits \n" +" .type cfi_bpf_hash,@object \n" +" .globl cfi_bpf_hash \n" +" .p2align 2, 0x0 \n" +"cfi_bpf_hash: \n" +" .long __kcfi_typeid___bpf_prog_runX \n" +" .size cfi_bpf_hash, 4 \n" +" .popsection \n" +); +#endif + +#ifdef CONFIG_FINEIBT -static enum cfi_mode cfi_mode __ro_after_init = CFI_DEFAULT; static bool cfi_rand __ro_after_init = true; static u32 cfi_seed __ro_after_init; @@ -1149,8 +1178,10 @@ static void __apply_fineibt(s32 *start_r goto err; if (cfi_rand) { - if (builtin) + if (builtin) { cfi_seed = get_random_u32(); + cfi_bpf_hash = cfi_rehash(cfi_bpf_hash); + } ret = cfi_rand_preamble(start_cfi, end_cfi); if (ret) --- a/arch/x86/net/bpf_jit_comp.c +++ b/arch/x86/net/bpf_jit_comp.c @@ -17,6 +17,7 @@ #include <asm/nospec-branch.h> #include <asm/text-patching.h> #include <asm/unwind.h> +#include <asm/cfi.h> static bool all_callee_regs_used[4] = {true, true, true, true}; @@ -51,9 +52,11 @@ static u8 *emit_code(u8 *ptr, u32 bytes, do { EMIT4(b1, b2, b3, b4); EMIT(off, 4); } while (0) #ifdef CONFIG_X86_KERNEL_IBT -#define EMIT_ENDBR() EMIT(gen_endbr(), 4) +#define EMIT_ENDBR() EMIT(gen_endbr(), 4) +#define EMIT_ENDBR_POISON() EMIT(gen_endbr_poison(), 4) #else #define EMIT_ENDBR() +#define EMIT_ENDBR_POISON() #endif static bool is_imm8(int value) @@ -247,6 +250,7 @@ struct jit_context { */ int tail_call_direct_label; int tail_call_indirect_label; + int prog_offset; }; /* Maximum number of bytes emitted while JITing one eBPF insn */ @@ -305,20 +309,90 @@ static void pop_callee_regs(u8 **pprog, } /* + * Emit the various CFI preambles, see asm/cfi.h and the comments about FineIBT + * in arch/x86/kernel/alternative.c + */ + +static int emit_fineibt(u8 **pprog) +{ + u8 *prog = *pprog; + + EMIT_ENDBR(); + EMIT3_off32(0x41, 0x81, 0xea, cfi_bpf_hash); /* subl $hash, %r10d */ + EMIT2(0x74, 0x07); /* jz.d8 +7 */ + EMIT2(0x0f, 0x0b); /* ud2 */ + EMIT1(0x90); /* nop */ + EMIT_ENDBR_POISON(); + + *pprog = prog; + return 16; +} + +static int emit_kcfi(u8 **pprog) +{ + u8 *prog = *pprog; + int offset = 5; + + EMIT1_off32(0xb8, cfi_bpf_hash); /* movl $hash, %eax */ +#ifdef CONFIG_CALL_PADDING + EMIT1(0x90); + EMIT1(0x90); + EMIT1(0x90); + EMIT1(0x90); + EMIT1(0x90); + EMIT1(0x90); + EMIT1(0x90); + EMIT1(0x90); + EMIT1(0x90); + EMIT1(0x90); + EMIT1(0x90); + offset += 11; +#endif + EMIT_ENDBR(); + + *pprog = prog; + return offset; +} + +static int emit_cfi(u8 **pprog) +{ + u8 *prog = *pprog; + int offset = 0; + + switch (cfi_mode) { + case CFI_FINEIBT: + offset = emit_fineibt(&prog); + break; + + case CFI_KCFI: + offset = emit_kcfi(&prog); + break; + + default: + EMIT_ENDBR(); + break; + } + + *pprog = prog; + return offset; +} + +/* * Emit x86-64 prologue code for BPF program. * bpf_tail_call helper will skip the first X86_TAIL_CALL_OFFSET bytes * while jumping to another program */ -static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf, - bool tail_call_reachable, bool is_subprog, - bool is_exception_cb) +static int emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf, + bool tail_call_reachable, bool is_subprog, + bool is_exception_cb) { u8 *prog = *pprog; + int offset; + offset = emit_cfi(&prog); /* BPF trampoline can be made to work without these nops, * but let's waste 5 bytes for now and optimize later */ - EMIT_ENDBR(); memcpy(prog, x86_nops[5], X86_PATCH_SIZE); prog += X86_PATCH_SIZE; if (!ebpf_from_cbpf) { @@ -357,6 +431,8 @@ static void emit_prologue(u8 **pprog, u3 if (tail_call_reachable) EMIT1(0x50); /* push rax */ *pprog = prog; + + return offset; } static int emit_patch(u8 **pprog, void *func, void *ip, u8 opcode) @@ -1083,8 +1159,8 @@ static int do_jit(struct bpf_prog *bpf_p bool tail_call_seen = false; bool seen_exit = false; u8 temp[BPF_MAX_INSN_SIZE + BPF_INSN_SAFETY]; - int i, excnt = 0; int ilen, proglen = 0; + int i, excnt = 0; u8 *prog = temp; int err; @@ -1094,9 +1170,12 @@ static int do_jit(struct bpf_prog *bpf_p /* tail call's presence in current prog implies it is reachable */ tail_call_reachable |= tail_call_seen; - emit_prologue(&prog, bpf_prog->aux->stack_depth, - bpf_prog_was_classic(bpf_prog), tail_call_reachable, - bpf_is_subprog(bpf_prog), bpf_prog->aux->exception_cb); + ctx->prog_offset = emit_prologue(&prog, bpf_prog->aux->stack_depth, + bpf_prog_was_classic(bpf_prog), + tail_call_reachable, + bpf_is_subprog(bpf_prog), + bpf_prog->aux->exception_cb); + /* Exception callback will clobber callee regs for its own use, and * restore the original callee regs from main prog's stack frame. */ @@ -2935,9 +3014,16 @@ struct bpf_prog *bpf_int_jit_compile(str jit_data->header = header; jit_data->rw_header = rw_header; } - prog->bpf_func = (void *)image; + /* + * ctx.prog_offset is used when CFI preambles put code *before* + * the function. See emit_cfi(). For FineIBT specifically this code + * can also be executed and bpf_prog_kallsyms_add() will + * generate an additional symbol to cover this, hence also + * decrement proglen. + */ + prog->bpf_func = (void *)image + ctx.prog_offset; prog->jited = 1; - prog->jited_len = proglen; + prog->jited_len = proglen - ctx.prog_offset; } else { prog = orig_prog; } --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -29,6 +29,7 @@ #include <linux/rcupdate_trace.h> #include <linux/static_call.h> #include <linux/memcontrol.h> +#include <linux/cfi.h> struct bpf_verifier_env; struct bpf_verifier_log; @@ -1188,7 +1189,11 @@ struct bpf_dispatcher { #endif }; -static __always_inline __nocfi unsigned int bpf_dispatcher_nop_func( +#ifndef __bpfcall +#define __bpfcall __nocfi +#endif + +static __always_inline __bpfcall unsigned int bpf_dispatcher_nop_func( const void *ctx, const struct bpf_insn *insnsi, bpf_func_t bpf_func) @@ -1278,7 +1283,7 @@ int arch_prepare_bpf_dispatcher(void *im #define DEFINE_BPF_DISPATCHER(name) \ __BPF_DISPATCHER_SC(name); \ - noinline __nocfi unsigned int bpf_dispatcher_##name##_func( \ + noinline __bpfcall unsigned int bpf_dispatcher_##name##_func( \ const void *ctx, \ const struct bpf_insn *insnsi, \ bpf_func_t bpf_func) \ @@ -1426,6 +1431,9 @@ struct bpf_prog_aux { struct bpf_kfunc_desc_tab *kfunc_tab; struct bpf_kfunc_btf_tab *kfunc_btf_tab; u32 size_poke_tab; +#ifdef CONFIG_FINEIBT + struct bpf_ksym ksym_prefix; +#endif struct bpf_ksym ksym; const struct bpf_prog_ops *ops; struct bpf_map **used_maps; --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -683,6 +683,23 @@ void bpf_prog_kallsyms_add(struct bpf_pr fp->aux->ksym.prog = true; bpf_ksym_add(&fp->aux->ksym); + +#ifdef CONFIG_FINEIBT + /* + * When FineIBT, code in the __cfi_foo() symbols can get executed + * and hence unwinder needs help. + */ + if (cfi_mode != CFI_FINEIBT) + return; + + snprintf(fp->aux->ksym_prefix.name, KSYM_NAME_LEN, + "__cfi_%s", fp->aux->ksym.name); + + fp->aux->ksym_prefix.start = (unsigned long) fp->bpf_func - 16; + fp->aux->ksym_prefix.end = (unsigned long) fp->bpf_func; + + bpf_ksym_add(&fp->aux->ksym_prefix); +#endif } void bpf_prog_kallsyms_del(struct bpf_prog *fp) @@ -691,6 +708,9 @@ void bpf_prog_kallsyms_del(struct bpf_pr return; bpf_ksym_del(&fp->aux->ksym); +#ifdef CONFIG_FINEIBT + bpf_ksym_del(&fp->aux->ksym_prefix); +#endif } static struct bpf_ksym *bpf_ksym_find(unsigned long addr)
The current BPF call convention is __nocfi, except when it calls !JIT things, then it calls regular C functions. It so happens that with FineIBT the __nocfi and C calling conventions are incompatible. Specifically __nocfi will call at func+0, while FineIBT will have endbr-poison there, which is not a valid indirect target. Causing #CP. Notably this only triggers on IBT enabled hardware, which is probably why this hasn't been reported (also, most people will have JIT on anyway). Implement proper CFI prologues for the BPF JIT codegen and drop __nocfi for x86. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> --- arch/x86/include/asm/cfi.h | 94 ++++++++++++++++++++++++++++++++++++ arch/x86/kernel/alternative.c | 47 +++++++++++++++--- arch/x86/net/bpf_jit_comp.c | 108 +++++++++++++++++++++++++++++++++++++----- include/linux/bpf.h | 12 +++- kernel/bpf/core.c | 20 +++++++ 5 files changed, 260 insertions(+), 21 deletions(-)