diff mbox series

[bpf] bpf: Fix fexit trampoline.

Message ID 20210316210007.38949-1-alexei.starovoitov@gmail.com (mailing list archive)
State Accepted
Commit e21aa341785c679dd409c8cb71f864c00fe6c463
Delegated to: BPF
Headers show
Series [bpf] bpf: Fix fexit trampoline. | expand

Checks

Context Check Description
netdev/cover_letter success Link
netdev/fixes_present success Link
netdev/patch_count success Link
netdev/tree_selection success Clearly marked for bpf
netdev/subject_prefix success Link
netdev/cc_maintainers fail 2 blamed authors not CCed: ast@kernel.org songliubraving@fb.com; 14 maintainers not CCed: netdev@vger.kernel.org dsahern@kernel.org bp@alien8.de yhs@fb.com kpsingh@kernel.org yoshfuji@linux-ipv6.org mingo@redhat.com tglx@linutronix.de kafai@fb.com hpa@zytor.com ast@kernel.org john.fastabend@gmail.com x86@kernel.org songliubraving@fb.com
netdev/source_inline success Was 0 now: 0
netdev/verify_signedoff success Link
netdev/module_param success Was 0 now: 0
netdev/build_32bit success Errors and warnings before: 12194 this patch: 12194
netdev/kdoc success Errors and warnings before: 1 this patch: 1
netdev/verify_fixes success Link
netdev/checkpatch warning CHECK: No space is necessary after a cast CHECK: multiple assignments should be avoided WARNING: line length of 82 exceeds 80 columns WARNING: line length of 83 exceeds 80 columns WARNING: line length of 84 exceeds 80 columns WARNING: line length of 85 exceeds 80 columns WARNING: line length of 89 exceeds 80 columns
netdev/build_allmodconfig_warn success Errors and warnings before: 12700 this patch: 12700
netdev/header_inline success Link

Commit Message

Alexei Starovoitov March 16, 2021, 9 p.m. UTC
From: Alexei Starovoitov <ast@kernel.org>

The fexit/fmod_ret programs can be attached to kernel functions that can sleep.
The synchronize_rcu_tasks() will not wait for such tasks to complete.
In such case the trampoline image will be freed and when the task
wakes up the return IP will point to freed memory causing the crash.
Solve this by adding percpu_ref_get/put for the duration of trampoline
and separate trampoline vs its image life times.
The "half page" optimization has to be removed, since
first_half->second_half->first_half transition cannot be guaranteed to
complete in deterministic time. Every trampoline update becomes a new image.
The image with fmod_ret or fexit progs will be freed via percpu_ref_kill and
call_rcu_tasks. Together they will wait for the original function and
trampoline asm to complete. The trampoline is patched from nop to jmp to skip
fexit progs. They are freed independently from the trampoline. The image with
fentry progs only will be freed via call_rcu_tasks_trace+call_rcu_tasks which
will wait for both sleepable and non-sleepable progs to complete.

Reported-by: Andrii Nakryiko <andrii@kernel.org>
Fixes: fec56f5890d9 ("bpf: Introduce BPF trampoline")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>  # for RCU
---
Without ftrace fix:
https://patchwork.kernel.org/project/netdevbpf/patch/20210316195815.34714-1-alexei.starovoitov@gmail.com/
this patch will trigger warn in ftrace.

 arch/x86/net/bpf_jit_comp.c |  26 ++++-
 include/linux/bpf.h         |  24 +++-
 kernel/bpf/bpf_struct_ops.c |   2 +-
 kernel/bpf/core.c           |   4 +-
 kernel/bpf/trampoline.c     | 218 +++++++++++++++++++++++++++---------
 5 files changed, 213 insertions(+), 61 deletions(-)

Comments

patchwork-bot+netdevbpf@kernel.org March 17, 2021, 11:30 p.m. UTC | #1
Hello:

This patch was applied to bpf/bpf.git (refs/heads/master):

On Tue, 16 Mar 2021 14:00:07 -0700 you wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> The fexit/fmod_ret programs can be attached to kernel functions that can sleep.
> The synchronize_rcu_tasks() will not wait for such tasks to complete.
> In such case the trampoline image will be freed and when the task
> wakes up the return IP will point to freed memory causing the crash.
> Solve this by adding percpu_ref_get/put for the duration of trampoline
> and separate trampoline vs its image life times.
> The "half page" optimization has to be removed, since
> first_half->second_half->first_half transition cannot be guaranteed to
> complete in deterministic time. Every trampoline update becomes a new image.
> The image with fmod_ret or fexit progs will be freed via percpu_ref_kill and
> call_rcu_tasks. Together they will wait for the original function and
> trampoline asm to complete. The trampoline is patched from nop to jmp to skip
> fexit progs. They are freed independently from the trampoline. The image with
> fentry progs only will be freed via call_rcu_tasks_trace+call_rcu_tasks which
> will wait for both sleepable and non-sleepable progs to complete.
> 
> [...]

Here is the summary with links:
  - [bpf] bpf: Fix fexit trampoline.
    https://git.kernel.org/bpf/bpf/c/e21aa341785c

You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
Jiri Olsa March 21, 2021, 11:32 p.m. UTC | #2
On Tue, Mar 16, 2021 at 02:00:07PM -0700, Alexei Starovoitov wrote:
> From: Alexei Starovoitov <ast@kernel.org>
> 
> The fexit/fmod_ret programs can be attached to kernel functions that can sleep.
> The synchronize_rcu_tasks() will not wait for such tasks to complete.
> In such case the trampoline image will be freed and when the task
> wakes up the return IP will point to freed memory causing the crash.
> Solve this by adding percpu_ref_get/put for the duration of trampoline
> and separate trampoline vs its image life times.
> The "half page" optimization has to be removed, since
> first_half->second_half->first_half transition cannot be guaranteed to
> complete in deterministic time. Every trampoline update becomes a new image.
> The image with fmod_ret or fexit progs will be freed via percpu_ref_kill and
> call_rcu_tasks. Together they will wait for the original function and
> trampoline asm to complete. The trampoline is patched from nop to jmp to skip
> fexit progs. They are freed independently from the trampoline. The image with
> fentry progs only will be freed via call_rcu_tasks_trace+call_rcu_tasks which
> will wait for both sleepable and non-sleepable progs to complete.
> 
> Reported-by: Andrii Nakryiko <andrii@kernel.org>
> Fixes: fec56f5890d9 ("bpf: Introduce BPF trampoline")
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> Acked-by: Paul E. McKenney <paulmck@kernel.org>  # for RCU
> ---
> Without ftrace fix:
> https://patchwork.kernel.org/project/netdevbpf/patch/20210316195815.34714-1-alexei.starovoitov@gmail.com/
> this patch will trigger warn in ftrace.
> 
>  arch/x86/net/bpf_jit_comp.c |  26 ++++-
>  include/linux/bpf.h         |  24 +++-
>  kernel/bpf/bpf_struct_ops.c |   2 +-
>  kernel/bpf/core.c           |   4 +-
>  kernel/bpf/trampoline.c     | 218 +++++++++++++++++++++++++++---------
>  5 files changed, 213 insertions(+), 61 deletions(-)
> 

hi,
I'm on bpf/master and I'm triggering warnings below when running together:

  # while :; do ./test_progs -t fentry_test ; done
  # while :; do ./test_progs -t module_attach ; done

when I revert this patch (plus b90829704780) it seems ok

jirka


---
[  548.594548] bpf_testmod: loading out-of-tree module taints kernel.
[  548.600787] bpf_testmod: module verification failed: signature and/or required key missing - tainting kernel
[  558.353423] ------------[ cut here ]------------
[  558.358064] WARNING: CPU: 35 PID: 1572 at kernel/bpf/syscall.c:2516 bpf_tracing_link_release+0x3b/0x40
[  558.367399] Modules linked in: intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm ipmi_ssif irqbypass rapl mei_me iTCO_wdt intel_cstate wmi_bmof iTCO_ve]
[  558.409989] CPU: 35 PID: 1572 Comm: test-66 Tainted: G          IOE     5.12.0-rc2+ #25
[  558.418005] Hardware name: Dell Inc. PowerEdge R440/08CYF7, BIOS 1.7.0 12/14/2018
[  558.425492] RIP: 0010:bpf_tracing_link_release+0x3b/0x40
[  558.430829] Code: 48 8b 7f 18 e8 26 5c 02 00 85 c0 75 1d 48 8b 7b 48 e8 29 53 02 00 48 8b 7b 50 48 85 ff 74 05 e8 bb f4 ff ff 48 8b 5d f8 c9 c3 <0f> 0b eb df 90 0f 1f 44 00 00 55 48 89 e5 41 54 4f
[  558.449588] RSP: 0018:ffffc90002107e40 EFLAGS: 00010286
[  558.454828] RAX: 00000000ffffffed RBX: ffff888105982300 RCX: 0000000000000000
[  558.461969] RDX: ffff8881132c2540 RSI: 4c376cb4fcbc233e RDI: ffff8881058595d0
[  558.469110] RBP: ffffc90002107e48 R08: 0000000000000000 R09: 00000000ffff850f
[  558.476250] R10: 000000000000000a R11: 0000000000000008 R12: ffff888105982300
[  558.483391] R13: ffff8881040743f8 R14: ffff8881039465a0 R15: ffff888141359440
[  558.490532] FS:  00007f4232341740(0000) GS:ffff8897e10c0000(0000) knlGS:0000000000000000
[  558.498625] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  558.504371] CR2: 0000000001ef164d CR3: 000000014b23a002 CR4: 00000000007706e0
[  558.511505] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  558.518645] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  558.525778] PKRU: 55555554
[  558.528492] Call Trace:
[  558.530956]  bpf_link_free+0x55/0x80
[  558.534539]  bpf_link_release+0x29/0x70
[  558.538389]  __fput+0x9f/0x250
[  558.541457]  ____fput+0xe/0x10
[  558.544525]  task_work_run+0x64/0xa0
[  558.548112]  exit_to_user_mode_prepare+0x11c/0x120
[  558.552914]  syscall_exit_to_user_mode+0x21/0x40
[  558.557543]  ? __x64_sys_close+0x12/0x40
[  558.561476]  do_syscall_64+0x45/0x50
[  558.565065]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  558.570125] RIP: 0033:0x7f4232524167
[  558.573713] Code: 00 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24 0f
[  558.592470] RSP: 002b:00007ffd76f88ec8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[  558.600047] RAX: 0000000000000000 RBX: 0000000000000005 RCX: 00007f4232524167
[  558.607187] RDX: 0000000005824950 RSI: 000000000582ec50 RDI: 0000000000000009
[  558.614326] RBP: 000000000582ec80 R08: 0000000000000000 R09: 00007ffd76f88df7
[  558.621466] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  558.628600] R13: 00000000000001c8 R14: 00000000fffffffb R15: 000000000000000b
[  558.635741] ---[ end trace 878f3b01fdcfe925 ]---
[  563.521703] ------------[ cut here ]------------
[  563.526335] WARNING: CPU: 37 PID: 1586 at kernel/trace/ftrace.c:6321 ftrace_module_enable+0x33d/0x370
[  563.535559] Modules linked in: bpf_testmod(OE+) intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm ipmi_ssif irqbypass rapl mei_me iTCO_wdt intel_cstate]
[  563.579581] CPU: 37 PID: 1586 Comm: test_progs Tainted: G        W IOE     5.12.0-rc2+ #25
[  563.587862] Hardware name: Dell Inc. PowerEdge R440/08CYF7, BIOS 1.7.0 12/14/2018
[  563.595348] RIP: 0010:ftrace_module_enable+0x33d/0x370
[  563.600495] Code: 74 99 48 81 ca 00 00 00 10 49 89 54 24 08 e9 dc fe ff ff 8b 8b 98 01 00 00 48 01 ca 48 39 d0 0f 83 2e fd ff ff e9 65 fd ff ff <0f> 0b e9 be fe ff ff 0f 0b e9 b7 fe ff ff 48 83 7d
[  563.619249] RSP: 0018:ffffc90002137d18 EFLAGS: 00010206
[  563.624483] RAX: 0000000000031045 RBX: ffffffffa06793c0 RCX: 000000000000003d
[  563.631617] RDX: ffff888108eb2080 RSI: ffffffffa06763c0 RDI: 0000000000000000
[  563.638757] RBP: ffffc90002137d40 R08: ffffffff82b32ea0 R09: 0000000000000000
[  563.645890] R10: 0000000000000001 R11: 0000000000000000 R12: ffff88810abd2030
[  563.653027] R13: 61c8864680b583eb R14: 0000000000000003 R15: ffff888115d7d080
[  563.660166] FS:  00007fcf8c7cb740(0000) GS:ffff8897e1140000(0000) knlGS:0000000000000000
[  563.668259] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  563.674005] CR2: 000000000159d3d8 CR3: 000000010aad6001 CR4: 00000000007706e0
[  563.681136] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  563.688270] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  563.695403] PKRU: 55555554
[  563.698116] Call Trace:
[  563.700573]  load_module+0x2142/0x2610
[  563.704333]  __do_sys_finit_module+0xc2/0x120
[  563.708700]  __x64_sys_finit_module+0x1a/0x20
[  563.713064]  do_syscall_64+0x38/0x50
[  563.716656]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  563.721715] RIP: 0033:0x7fcf8c8cc55d
[  563.725302] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d eb 78 0c 00 f8
[  563.744050] RSP: 002b:00007fffc6d14f38 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[  563.751625] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007fcf8c8cc55d
[  563.758764] RDX: 0000000000000000 RSI: 000000000159d3da RDI: 0000000000000004
[  563.765899] RBP: 0000000000000004 R08: 0000000000000000 R09: 00007fffc6d83000
[  563.773039] R10: 0000000000000003 R11: 0000000000000246 R12: 0000000000000000
[  563.780174] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[  563.787316] ---[ end trace 878f3b01fdcfe926 ]---
[  563.903431] ------------[ cut here ]------------
[  563.908057] WARNING: CPU: 32 PID: 1584 at kernel/trace/ftrace.c:1748 __ftrace_hash_rec_update.part.0+0x326/0x430
[  563.918243] Modules linked in: bpf_testmod(OE) intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm ipmi_ssif irqbypass rapl mei_me iTCO_wdt intel_cstate ]
[  563.962163] CPU: 32 PID: 1584 Comm: test-38 Tainted: G        W IOE     5.12.0-rc2+ #25
[  563.970172] Hardware name: Dell Inc. PowerEdge R440/08CYF7, BIOS 1.7.0 12/14/2018
[  563.977658] RIP: 0010:__ftrace_hash_rec_update.part.0+0x326/0x430
[  563.983757] Code: e7 c4 82 75 ca 49 8b 41 08 e9 30 ff ff ff 49 8b 41 08 4d 85 d2 0f 84 23 ff ff ff 48 0d 00 00 00 10 49 89 41 08 e9 63 fe ff ff <0f> 0b c7 05 2e fa aa 01 01 00 00 00 c7 05 34 fa a0
[  564.002503] RSP: 0018:ffffc9000212fc48 EFLAGS: 00010246
[  564.007731] RAX: 0000000000000000 RBX: ffff888115d7d080 RCX: 0000000000000001
[  564.014873] RDX: 0000000000000003 RSI: ffffffffa06763c0 RDI: 0000000000000000
[  564.022015] RBP: ffffc9000212fcc0 R08: 0000000000000001 R09: ffff88810abd2030
[  564.029154] R10: ffff888108eb2080 R11: 0000000000000000 R12: 0000000000000000
[  564.036288] R13: 0000000000000007 R14: 0000000000000000 R15: 0000000000000003
[  564.043418] FS:  00007ff79bdeb740(0000) GS:ffff8897e1000000(0000) knlGS:0000000000000000
[  564.051503] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  564.057250] CR2: 00007ff79c039000 CR3: 000000010baee002 CR4: 00000000007706e0
[  564.064383] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  564.071515] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  564.078649] PKRU: 55555554
[  564.081361] Call Trace:
[  564.083821]  ? udp_destruct_sock+0x140/0x140
[  564.088107]  ftrace_hash_rec_update_modify+0x1f/0x80
[  564.093081]  ftrace_hash_move_and_update_ops+0xcf/0x240
[  564.098314]  ? bpf_fentry_test1+0x10/0x10
[  564.102336]  ftrace_set_hash+0x121/0x1d0
[  564.106275]  ? 0xffffffffa052e000
[  564.109599]  ? bpf_fentry_test1+0x10/0x10
[  564.113620]  unregister_ftrace_direct+0x7a/0x200
[  564.118247]  ? bpf_fentry_test1+0x10/0x10
[  564.122271]  bpf_trampoline_update+0x31e/0x3f0
[  564.126726]  ? __radix_tree_delete+0x87/0xf0
[  564.131005]  bpf_trampoline_unlink_prog+0x9c/0x140
[  564.135808]  bpf_tracing_link_release+0x1a/0x40
[  564.140347]  bpf_link_free+0x55/0x80
[  564.143935]  bpf_link_release+0x29/0x70
[  564.147784]  __fput+0x9f/0x250
[  564.150851]  ____fput+0xe/0x10
[  564.153911]  task_work_run+0x64/0xa0
[  564.157498]  exit_to_user_mode_prepare+0x11c/0x120
[  564.162299]  syscall_exit_to_user_mode+0x21/0x40
[  564.166928]  ? __x64_sys_close+0x12/0x40
[  564.170863]  do_syscall_64+0x45/0x50
[  564.174451]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  564.179510] RIP: 0033:0x7ff79bfce167
[  564.183090] Code: 00 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24 0f
[  564.201836] RSP: 002b:00007ffd8a9d85a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[  564.209403] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 00007ff79bfce167
[  564.216536] RDX: 0000000007274970 RSI: 0000000007276370 RDI: 000000000000000f
[  564.223669] RBP: 00000000072763a0 R08: 0000000000000000 R09: 00007ffd8a9d84d7
[  564.230801] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  564.237936] R13: 00007ff79c039030 R14: 0000000000000002 R15: 0000000000000000
[  564.245077] ---[ end trace 878f3b01fdcfe927 ]---
[  564.467617] ------------[ cut here ]------------
[  564.472245] WARNING: CPU: 32 PID: 1584 at kernel/trace/ftrace.c:5228 unregister_ftrace_direct+0x1df/0x200
[  564.481814] Modules linked in: bpf_testmod(OE) intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm ipmi_ssif irqbypass rapl mei_me iTCO_wdt intel_cstate ]
[  564.525736] CPU: 32 PID: 1584 Comm: test-38 Tainted: G        W IOE     5.12.0-rc2+ #25
[  564.533745] Hardware name: Dell Inc. PowerEdge R440/08CYF7, BIOS 1.7.0 12/14/2018
[  564.541233] RIP: 0010:unregister_ftrace_direct+0x1df/0x200
[  564.546726] Code: 85 c0 75 0e 31 f6 48 c7 c7 a0 2e b3 82 e8 79 e4 ff ff 48 c7 c7 60 30 b3 82 e8 5d b1 94 00 e9 73 fe ff ff 0f 0b e9 1a ff ff ff <0f> 0b e9 a1 fe ff ff 0f 0b e9 0c ff ff ff 41 be e1
[  564.565480] RSP: 0018:ffffc9000212fd90 EFLAGS: 00010286
[  564.570716] RAX: 0000000000000001 RBX: ffffffffa0503000 RCX: ffffffff81954730
[  564.577860] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff82b32ea0
[  564.584996] RBP: ffffc9000212fdb8 R08: 0000000000000001 R09: 0000000000000000
[  564.592128] R10: 0000000000000002 R11: ffff8881067cc790 R12: ffffffff81954730
[  564.599261] R13: ffff888103ac18e0 R14: 00000000ffffffed R15: ffff888106251140
[  564.606396] FS:  00007ff79bdeb740(0000) GS:ffff8897e1000000(0000) knlGS:0000000000000000
[  564.614488] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  564.620235] CR2: 00007ff79c039000 CR3: 000000010baee002 CR4: 00000000007706e0
[  564.627368] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  564.634500] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  564.641633] PKRU: 55555554
[  564.644345] Call Trace:
[  564.646800]  ? bpf_fentry_test2+0x10/0x10
[  564.650821]  bpf_trampoline_update+0x31e/0x3f0
[  564.655275]  ? __radix_tree_delete+0x87/0xf0
[  564.659558]  bpf_trampoline_unlink_prog+0x9c/0x140
[  564.664359]  bpf_tracing_link_release+0x1a/0x40
[  564.668900]  bpf_link_free+0x55/0x80
[  564.672486]  bpf_link_release+0x29/0x70
[  564.676333]  __fput+0x9f/0x250
[  564.679395]  ____fput+0xe/0x10
[  564.682462]  task_work_run+0x64/0xa0
[  564.686051]  exit_to_user_mode_prepare+0x11c/0x120
[  564.690869]  syscall_exit_to_user_mode+0x21/0x40
[  564.695502]  ? __x64_sys_close+0x12/0x40
[  564.699442]  do_syscall_64+0x45/0x50
[  564.703029]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  564.708089] RIP: 0033:0x7ff79bfce167
[  564.711669] Code: 00 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24 0f
[  564.730422] RSP: 002b:00007ffd8a9d85a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[  564.737989] RAX: 0000000000000000 RBX: 0000000000000002 RCX: 00007ff79bfce167
[  564.745121] RDX: 0000000007274970 RSI: 00000000072763a0 RDI: 0000000000000010
[  564.752254] RBP: 00000000072763d0 R08: 0000000000000000 R09: 00007ffd8a9d84d7
[  564.759387] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  564.766520] R13: 00007ff79c039030 R14: 0000000000000002 R15: 0000000000000000
[  564.773655] ---[ end trace 878f3b01fdcfe928 ]---
[  565.117608] ------------[ cut here ]------------
Jiri Olsa March 22, 2021, 4:24 p.m. UTC | #3
On Mon, Mar 22, 2021 at 12:32:05AM +0100, Jiri Olsa wrote:
> On Tue, Mar 16, 2021 at 02:00:07PM -0700, Alexei Starovoitov wrote:
> > From: Alexei Starovoitov <ast@kernel.org>
> > 
> > The fexit/fmod_ret programs can be attached to kernel functions that can sleep.
> > The synchronize_rcu_tasks() will not wait for such tasks to complete.
> > In such case the trampoline image will be freed and when the task
> > wakes up the return IP will point to freed memory causing the crash.
> > Solve this by adding percpu_ref_get/put for the duration of trampoline
> > and separate trampoline vs its image life times.
> > The "half page" optimization has to be removed, since
> > first_half->second_half->first_half transition cannot be guaranteed to
> > complete in deterministic time. Every trampoline update becomes a new image.
> > The image with fmod_ret or fexit progs will be freed via percpu_ref_kill and
> > call_rcu_tasks. Together they will wait for the original function and
> > trampoline asm to complete. The trampoline is patched from nop to jmp to skip
> > fexit progs. They are freed independently from the trampoline. The image with
> > fentry progs only will be freed via call_rcu_tasks_trace+call_rcu_tasks which
> > will wait for both sleepable and non-sleepable progs to complete.
> > 
> > Reported-by: Andrii Nakryiko <andrii@kernel.org>
> > Fixes: fec56f5890d9 ("bpf: Introduce BPF trampoline")
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > Acked-by: Paul E. McKenney <paulmck@kernel.org>  # for RCU
> > ---
> > Without ftrace fix:
> > https://patchwork.kernel.org/project/netdevbpf/patch/20210316195815.34714-1-alexei.starovoitov@gmail.com/
> > this patch will trigger warn in ftrace.
> > 
> >  arch/x86/net/bpf_jit_comp.c |  26 ++++-
> >  include/linux/bpf.h         |  24 +++-
> >  kernel/bpf/bpf_struct_ops.c |   2 +-
> >  kernel/bpf/core.c           |   4 +-
> >  kernel/bpf/trampoline.c     | 218 +++++++++++++++++++++++++++---------
> >  5 files changed, 213 insertions(+), 61 deletions(-)
> > 
> 
> hi,
> I'm on bpf/master and I'm triggering warnings below when running together:
> 
>   # while :; do ./test_progs -t fentry_test ; done
>   # while :; do ./test_progs -t module_attach ; done

hum, is it possible that we don't take module ref and it can get
unloaded even if there's trampoline attach to it..? I can't see
that in the code.. ftrace_release_mod can't fail ;-)

jirka

> 
> when I revert this patch (plus b90829704780) it seems ok
> 
> jirka
> 
> 
> ---
> [  548.594548] bpf_testmod: loading out-of-tree module taints kernel.
> [  548.600787] bpf_testmod: module verification failed: signature and/or required key missing - tainting kernel
> [  558.353423] ------------[ cut here ]------------
> [  558.358064] WARNING: CPU: 35 PID: 1572 at kernel/bpf/syscall.c:2516 bpf_tracing_link_release+0x3b/0x40
> [  558.367399] Modules linked in: intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm ipmi_ssif irqbypass rapl mei_me iTCO_wdt intel_cstate wmi_bmof iTCO_ve]
> [  558.409989] CPU: 35 PID: 1572 Comm: test-66 Tainted: G          IOE     5.12.0-rc2+ #25
> [  558.418005] Hardware name: Dell Inc. PowerEdge R440/08CYF7, BIOS 1.7.0 12/14/2018
> [  558.425492] RIP: 0010:bpf_tracing_link_release+0x3b/0x40
> [  558.430829] Code: 48 8b 7f 18 e8 26 5c 02 00 85 c0 75 1d 48 8b 7b 48 e8 29 53 02 00 48 8b 7b 50 48 85 ff 74 05 e8 bb f4 ff ff 48 8b 5d f8 c9 c3 <0f> 0b eb df 90 0f 1f 44 00 00 55 48 89 e5 41 54 4f
> [  558.449588] RSP: 0018:ffffc90002107e40 EFLAGS: 00010286
> [  558.454828] RAX: 00000000ffffffed RBX: ffff888105982300 RCX: 0000000000000000
> [  558.461969] RDX: ffff8881132c2540 RSI: 4c376cb4fcbc233e RDI: ffff8881058595d0
> [  558.469110] RBP: ffffc90002107e48 R08: 0000000000000000 R09: 00000000ffff850f
> [  558.476250] R10: 000000000000000a R11: 0000000000000008 R12: ffff888105982300
> [  558.483391] R13: ffff8881040743f8 R14: ffff8881039465a0 R15: ffff888141359440
> [  558.490532] FS:  00007f4232341740(0000) GS:ffff8897e10c0000(0000) knlGS:0000000000000000
> [  558.498625] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  558.504371] CR2: 0000000001ef164d CR3: 000000014b23a002 CR4: 00000000007706e0
> [  558.511505] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  558.518645] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  558.525778] PKRU: 55555554
> [  558.528492] Call Trace:
> [  558.530956]  bpf_link_free+0x55/0x80
> [  558.534539]  bpf_link_release+0x29/0x70
> [  558.538389]  __fput+0x9f/0x250
> [  558.541457]  ____fput+0xe/0x10
> [  558.544525]  task_work_run+0x64/0xa0
> [  558.548112]  exit_to_user_mode_prepare+0x11c/0x120
> [  558.552914]  syscall_exit_to_user_mode+0x21/0x40
> [  558.557543]  ? __x64_sys_close+0x12/0x40
> [  558.561476]  do_syscall_64+0x45/0x50
> [  558.565065]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [  558.570125] RIP: 0033:0x7f4232524167
> [  558.573713] Code: 00 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24 0f
> [  558.592470] RSP: 002b:00007ffd76f88ec8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
> [  558.600047] RAX: 0000000000000000 RBX: 0000000000000005 RCX: 00007f4232524167
> [  558.607187] RDX: 0000000005824950 RSI: 000000000582ec50 RDI: 0000000000000009
> [  558.614326] RBP: 000000000582ec80 R08: 0000000000000000 R09: 00007ffd76f88df7
> [  558.621466] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
> [  558.628600] R13: 00000000000001c8 R14: 00000000fffffffb R15: 000000000000000b
> [  558.635741] ---[ end trace 878f3b01fdcfe925 ]---
> [  563.521703] ------------[ cut here ]------------
> [  563.526335] WARNING: CPU: 37 PID: 1586 at kernel/trace/ftrace.c:6321 ftrace_module_enable+0x33d/0x370
> [  563.535559] Modules linked in: bpf_testmod(OE+) intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm ipmi_ssif irqbypass rapl mei_me iTCO_wdt intel_cstate]
> [  563.579581] CPU: 37 PID: 1586 Comm: test_progs Tainted: G        W IOE     5.12.0-rc2+ #25
> [  563.587862] Hardware name: Dell Inc. PowerEdge R440/08CYF7, BIOS 1.7.0 12/14/2018
> [  563.595348] RIP: 0010:ftrace_module_enable+0x33d/0x370
> [  563.600495] Code: 74 99 48 81 ca 00 00 00 10 49 89 54 24 08 e9 dc fe ff ff 8b 8b 98 01 00 00 48 01 ca 48 39 d0 0f 83 2e fd ff ff e9 65 fd ff ff <0f> 0b e9 be fe ff ff 0f 0b e9 b7 fe ff ff 48 83 7d
> [  563.619249] RSP: 0018:ffffc90002137d18 EFLAGS: 00010206
> [  563.624483] RAX: 0000000000031045 RBX: ffffffffa06793c0 RCX: 000000000000003d
> [  563.631617] RDX: ffff888108eb2080 RSI: ffffffffa06763c0 RDI: 0000000000000000
> [  563.638757] RBP: ffffc90002137d40 R08: ffffffff82b32ea0 R09: 0000000000000000
> [  563.645890] R10: 0000000000000001 R11: 0000000000000000 R12: ffff88810abd2030
> [  563.653027] R13: 61c8864680b583eb R14: 0000000000000003 R15: ffff888115d7d080
> [  563.660166] FS:  00007fcf8c7cb740(0000) GS:ffff8897e1140000(0000) knlGS:0000000000000000
> [  563.668259] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  563.674005] CR2: 000000000159d3d8 CR3: 000000010aad6001 CR4: 00000000007706e0
> [  563.681136] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  563.688270] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  563.695403] PKRU: 55555554
> [  563.698116] Call Trace:
> [  563.700573]  load_module+0x2142/0x2610
> [  563.704333]  __do_sys_finit_module+0xc2/0x120
> [  563.708700]  __x64_sys_finit_module+0x1a/0x20
> [  563.713064]  do_syscall_64+0x38/0x50
> [  563.716656]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [  563.721715] RIP: 0033:0x7fcf8c8cc55d
> [  563.725302] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d eb 78 0c 00 f8
> [  563.744050] RSP: 002b:00007fffc6d14f38 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
> [  563.751625] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007fcf8c8cc55d
> [  563.758764] RDX: 0000000000000000 RSI: 000000000159d3da RDI: 0000000000000004
> [  563.765899] RBP: 0000000000000004 R08: 0000000000000000 R09: 00007fffc6d83000
> [  563.773039] R10: 0000000000000003 R11: 0000000000000246 R12: 0000000000000000
> [  563.780174] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> [  563.787316] ---[ end trace 878f3b01fdcfe926 ]---
> [  563.903431] ------------[ cut here ]------------
> [  563.908057] WARNING: CPU: 32 PID: 1584 at kernel/trace/ftrace.c:1748 __ftrace_hash_rec_update.part.0+0x326/0x430
> [  563.918243] Modules linked in: bpf_testmod(OE) intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm ipmi_ssif irqbypass rapl mei_me iTCO_wdt intel_cstate ]
> [  563.962163] CPU: 32 PID: 1584 Comm: test-38 Tainted: G        W IOE     5.12.0-rc2+ #25
> [  563.970172] Hardware name: Dell Inc. PowerEdge R440/08CYF7, BIOS 1.7.0 12/14/2018
> [  563.977658] RIP: 0010:__ftrace_hash_rec_update.part.0+0x326/0x430
> [  563.983757] Code: e7 c4 82 75 ca 49 8b 41 08 e9 30 ff ff ff 49 8b 41 08 4d 85 d2 0f 84 23 ff ff ff 48 0d 00 00 00 10 49 89 41 08 e9 63 fe ff ff <0f> 0b c7 05 2e fa aa 01 01 00 00 00 c7 05 34 fa a0
> [  564.002503] RSP: 0018:ffffc9000212fc48 EFLAGS: 00010246
> [  564.007731] RAX: 0000000000000000 RBX: ffff888115d7d080 RCX: 0000000000000001
> [  564.014873] RDX: 0000000000000003 RSI: ffffffffa06763c0 RDI: 0000000000000000
> [  564.022015] RBP: ffffc9000212fcc0 R08: 0000000000000001 R09: ffff88810abd2030
> [  564.029154] R10: ffff888108eb2080 R11: 0000000000000000 R12: 0000000000000000
> [  564.036288] R13: 0000000000000007 R14: 0000000000000000 R15: 0000000000000003
> [  564.043418] FS:  00007ff79bdeb740(0000) GS:ffff8897e1000000(0000) knlGS:0000000000000000
> [  564.051503] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  564.057250] CR2: 00007ff79c039000 CR3: 000000010baee002 CR4: 00000000007706e0
> [  564.064383] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  564.071515] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  564.078649] PKRU: 55555554
> [  564.081361] Call Trace:
> [  564.083821]  ? udp_destruct_sock+0x140/0x140
> [  564.088107]  ftrace_hash_rec_update_modify+0x1f/0x80
> [  564.093081]  ftrace_hash_move_and_update_ops+0xcf/0x240
> [  564.098314]  ? bpf_fentry_test1+0x10/0x10
> [  564.102336]  ftrace_set_hash+0x121/0x1d0
> [  564.106275]  ? 0xffffffffa052e000
> [  564.109599]  ? bpf_fentry_test1+0x10/0x10
> [  564.113620]  unregister_ftrace_direct+0x7a/0x200
> [  564.118247]  ? bpf_fentry_test1+0x10/0x10
> [  564.122271]  bpf_trampoline_update+0x31e/0x3f0
> [  564.126726]  ? __radix_tree_delete+0x87/0xf0
> [  564.131005]  bpf_trampoline_unlink_prog+0x9c/0x140
> [  564.135808]  bpf_tracing_link_release+0x1a/0x40
> [  564.140347]  bpf_link_free+0x55/0x80
> [  564.143935]  bpf_link_release+0x29/0x70
> [  564.147784]  __fput+0x9f/0x250
> [  564.150851]  ____fput+0xe/0x10
> [  564.153911]  task_work_run+0x64/0xa0
> [  564.157498]  exit_to_user_mode_prepare+0x11c/0x120
> [  564.162299]  syscall_exit_to_user_mode+0x21/0x40
> [  564.166928]  ? __x64_sys_close+0x12/0x40
> [  564.170863]  do_syscall_64+0x45/0x50
> [  564.174451]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [  564.179510] RIP: 0033:0x7ff79bfce167
> [  564.183090] Code: 00 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24 0f
> [  564.201836] RSP: 002b:00007ffd8a9d85a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
> [  564.209403] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 00007ff79bfce167
> [  564.216536] RDX: 0000000007274970 RSI: 0000000007276370 RDI: 000000000000000f
> [  564.223669] RBP: 00000000072763a0 R08: 0000000000000000 R09: 00007ffd8a9d84d7
> [  564.230801] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
> [  564.237936] R13: 00007ff79c039030 R14: 0000000000000002 R15: 0000000000000000
> [  564.245077] ---[ end trace 878f3b01fdcfe927 ]---
> [  564.467617] ------------[ cut here ]------------
> [  564.472245] WARNING: CPU: 32 PID: 1584 at kernel/trace/ftrace.c:5228 unregister_ftrace_direct+0x1df/0x200
> [  564.481814] Modules linked in: bpf_testmod(OE) intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm ipmi_ssif irqbypass rapl mei_me iTCO_wdt intel_cstate ]
> [  564.525736] CPU: 32 PID: 1584 Comm: test-38 Tainted: G        W IOE     5.12.0-rc2+ #25
> [  564.533745] Hardware name: Dell Inc. PowerEdge R440/08CYF7, BIOS 1.7.0 12/14/2018
> [  564.541233] RIP: 0010:unregister_ftrace_direct+0x1df/0x200
> [  564.546726] Code: 85 c0 75 0e 31 f6 48 c7 c7 a0 2e b3 82 e8 79 e4 ff ff 48 c7 c7 60 30 b3 82 e8 5d b1 94 00 e9 73 fe ff ff 0f 0b e9 1a ff ff ff <0f> 0b e9 a1 fe ff ff 0f 0b e9 0c ff ff ff 41 be e1
> [  564.565480] RSP: 0018:ffffc9000212fd90 EFLAGS: 00010286
> [  564.570716] RAX: 0000000000000001 RBX: ffffffffa0503000 RCX: ffffffff81954730
> [  564.577860] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff82b32ea0
> [  564.584996] RBP: ffffc9000212fdb8 R08: 0000000000000001 R09: 0000000000000000
> [  564.592128] R10: 0000000000000002 R11: ffff8881067cc790 R12: ffffffff81954730
> [  564.599261] R13: ffff888103ac18e0 R14: 00000000ffffffed R15: ffff888106251140
> [  564.606396] FS:  00007ff79bdeb740(0000) GS:ffff8897e1000000(0000) knlGS:0000000000000000
> [  564.614488] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  564.620235] CR2: 00007ff79c039000 CR3: 000000010baee002 CR4: 00000000007706e0
> [  564.627368] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  564.634500] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  564.641633] PKRU: 55555554
> [  564.644345] Call Trace:
> [  564.646800]  ? bpf_fentry_test2+0x10/0x10
> [  564.650821]  bpf_trampoline_update+0x31e/0x3f0
> [  564.655275]  ? __radix_tree_delete+0x87/0xf0
> [  564.659558]  bpf_trampoline_unlink_prog+0x9c/0x140
> [  564.664359]  bpf_tracing_link_release+0x1a/0x40
> [  564.668900]  bpf_link_free+0x55/0x80
> [  564.672486]  bpf_link_release+0x29/0x70
> [  564.676333]  __fput+0x9f/0x250
> [  564.679395]  ____fput+0xe/0x10
> [  564.682462]  task_work_run+0x64/0xa0
> [  564.686051]  exit_to_user_mode_prepare+0x11c/0x120
> [  564.690869]  syscall_exit_to_user_mode+0x21/0x40
> [  564.695502]  ? __x64_sys_close+0x12/0x40
> [  564.699442]  do_syscall_64+0x45/0x50
> [  564.703029]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [  564.708089] RIP: 0033:0x7ff79bfce167
> [  564.711669] Code: 00 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24 0f
> [  564.730422] RSP: 002b:00007ffd8a9d85a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
> [  564.737989] RAX: 0000000000000000 RBX: 0000000000000002 RCX: 00007ff79bfce167
> [  564.745121] RDX: 0000000007274970 RSI: 00000000072763a0 RDI: 0000000000000010
> [  564.752254] RBP: 00000000072763d0 R08: 0000000000000000 R09: 00007ffd8a9d84d7
> [  564.759387] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
> [  564.766520] R13: 00007ff79c039030 R14: 0000000000000002 R15: 0000000000000000
> [  564.773655] ---[ end trace 878f3b01fdcfe928 ]---
> [  565.117608] ------------[ cut here ]------------
Jiri Olsa March 22, 2021, 6:53 p.m. UTC | #4
On Mon, Mar 22, 2021 at 05:24:26PM +0100, Jiri Olsa wrote:
> On Mon, Mar 22, 2021 at 12:32:05AM +0100, Jiri Olsa wrote:
> > On Tue, Mar 16, 2021 at 02:00:07PM -0700, Alexei Starovoitov wrote:
> > > From: Alexei Starovoitov <ast@kernel.org>
> > > 
> > > The fexit/fmod_ret programs can be attached to kernel functions that can sleep.
> > > The synchronize_rcu_tasks() will not wait for such tasks to complete.
> > > In such case the trampoline image will be freed and when the task
> > > wakes up the return IP will point to freed memory causing the crash.
> > > Solve this by adding percpu_ref_get/put for the duration of trampoline
> > > and separate trampoline vs its image life times.
> > > The "half page" optimization has to be removed, since
> > > first_half->second_half->first_half transition cannot be guaranteed to
> > > complete in deterministic time. Every trampoline update becomes a new image.
> > > The image with fmod_ret or fexit progs will be freed via percpu_ref_kill and
> > > call_rcu_tasks. Together they will wait for the original function and
> > > trampoline asm to complete. The trampoline is patched from nop to jmp to skip
> > > fexit progs. They are freed independently from the trampoline. The image with
> > > fentry progs only will be freed via call_rcu_tasks_trace+call_rcu_tasks which
> > > will wait for both sleepable and non-sleepable progs to complete.
> > > 
> > > Reported-by: Andrii Nakryiko <andrii@kernel.org>
> > > Fixes: fec56f5890d9 ("bpf: Introduce BPF trampoline")
> > > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > > Acked-by: Paul E. McKenney <paulmck@kernel.org>  # for RCU
> > > ---
> > > Without ftrace fix:
> > > https://patchwork.kernel.org/project/netdevbpf/patch/20210316195815.34714-1-alexei.starovoitov@gmail.com/
> > > this patch will trigger warn in ftrace.
> > > 
> > >  arch/x86/net/bpf_jit_comp.c |  26 ++++-
> > >  include/linux/bpf.h         |  24 +++-
> > >  kernel/bpf/bpf_struct_ops.c |   2 +-
> > >  kernel/bpf/core.c           |   4 +-
> > >  kernel/bpf/trampoline.c     | 218 +++++++++++++++++++++++++++---------
> > >  5 files changed, 213 insertions(+), 61 deletions(-)
> > > 
> > 
> > hi,
> > I'm on bpf/master and I'm triggering warnings below when running together:
> > 
> >   # while :; do ./test_progs -t fentry_test ; done
> >   # while :; do ./test_progs -t module_attach ; done
> 
> hum, is it possible that we don't take module ref and it can get
> unloaded even if there's trampoline attach to it..? I can't see
> that in the code.. ftrace_release_mod can't fail ;-)

when I get the module for each module trampoline,
I can no longer see those warnings (link for Steven):
  https://lore.kernel.org/bpf/YFfXcqnksPsSe0Bv@krava/

Steven,
I might be missing something, but it looks like module
can be unloaded even if the trampoline (direct function)
is registered in it.. is that right?

thanks,
jirka


---
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index b7e29db127fa..ab0b2c8df283 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -5059,6 +5059,28 @@ static struct ftrace_direct_func *ftrace_alloc_direct_func(unsigned long addr)
 	return direct;
 }
 
+static struct module *ftrace_direct_module_get(unsigned long ip)
+{
+	struct module *mod;
+	int err = 0;
+
+	preempt_disable();
+	mod = __module_text_address(ip);
+	if (mod && !try_module_get(mod))
+		err = -ENOENT;
+	preempt_enable();
+	return err ? ERR_PTR(err) : mod;
+}
+
+static void ftrace_direct_module_put(unsigned long ip)
+{
+	struct module *mod;
+
+	mod = __module_text_address(ip);
+	if (mod)
+		module_put(mod);
+}
+
 /**
  * register_ftrace_direct - Call a custom trampoline directly
  * @ip: The address of the nop at the beginning of a function
@@ -5081,6 +5103,7 @@ int register_ftrace_direct(unsigned long ip, unsigned long addr)
 	struct ftrace_direct_func *direct;
 	struct ftrace_func_entry *entry;
 	struct ftrace_hash *free_hash = NULL;
+	struct module *mod = NULL;
 	struct dyn_ftrace *rec;
 	int ret = -EBUSY;
 
@@ -5095,6 +5118,13 @@ int register_ftrace_direct(unsigned long ip, unsigned long addr)
 	if (!rec)
 		goto out_unlock;
 
+	mod = ftrace_direct_module_get(ip);
+	if (IS_ERR(mod)) {
+		ret = -ENOENT;
+		mod = NULL;
+		goto out_unlock;
+	}
+
 	/*
 	 * Check if the rec says it has a direct call but we didn't
 	 * find one earlier?
@@ -5172,6 +5202,8 @@ int register_ftrace_direct(unsigned long ip, unsigned long addr)
  out_unlock:
 	mutex_unlock(&direct_mutex);
 
+	if (ret)
+		module_put(mod);
 	if (free_hash) {
 		synchronize_rcu_tasks();
 		free_ftrace_hash(free_hash);
@@ -5242,6 +5274,8 @@ int unregister_ftrace_direct(unsigned long ip, unsigned long addr)
 			ftrace_direct_func_count--;
 		}
 	}
+	ftrace_direct_module_put(ip);
+
  out_unlock:
 	mutex_unlock(&direct_mutex);
Steven Rostedt March 23, 2021, 12:59 p.m. UTC | #5
On Mon, 22 Mar 2021 19:53:10 +0100
Jiri Olsa <jolsa@redhat.com> wrote:

> On Mon, Mar 22, 2021 at 05:24:26PM +0100, Jiri Olsa wrote:
> > On Mon, Mar 22, 2021 at 12:32:05AM +0100, Jiri Olsa wrote:  
> > > On Tue, Mar 16, 2021 at 02:00:07PM -0700, Alexei Starovoitov wrote:  
> > > > From: Alexei Starovoitov <ast@kernel.org>
> > > > 
> > > > The fexit/fmod_ret programs can be attached to kernel functions that can sleep.
> > > > The synchronize_rcu_tasks() will not wait for such tasks to complete.
> > > > In such case the trampoline image will be freed and when the task
> > > > wakes up the return IP will point to freed memory causing the crash.
> > > > Solve this by adding percpu_ref_get/put for the duration of trampoline
> > > > and separate trampoline vs its image life times.
> > > > The "half page" optimization has to be removed, since
> > > > first_half->second_half->first_half transition cannot be guaranteed to
> > > > complete in deterministic time. Every trampoline update becomes a new image.
> > > > The image with fmod_ret or fexit progs will be freed via percpu_ref_kill and
> > > > call_rcu_tasks. Together they will wait for the original function and
> > > > trampoline asm to complete. The trampoline is patched from nop to jmp to skip
> > > > fexit progs. They are freed independently from the trampoline. The image with
> > > > fentry progs only will be freed via call_rcu_tasks_trace+call_rcu_tasks which
> > > > will wait for both sleepable and non-sleepable progs to complete.
> > > > 
> > > > Reported-by: Andrii Nakryiko <andrii@kernel.org>
> > > > Fixes: fec56f5890d9 ("bpf: Introduce BPF trampoline")
> > > > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > > > Acked-by: Paul E. McKenney <paulmck@kernel.org>  # for RCU
> > > > ---
> > > > Without ftrace fix:
> > > > https://patchwork.kernel.org/project/netdevbpf/patch/20210316195815.34714-1-alexei.starovoitov@gmail.com/
> > > > this patch will trigger warn in ftrace.
> > > > 
> > > >  arch/x86/net/bpf_jit_comp.c |  26 ++++-
> > > >  include/linux/bpf.h         |  24 +++-
> > > >  kernel/bpf/bpf_struct_ops.c |   2 +-
> > > >  kernel/bpf/core.c           |   4 +-
> > > >  kernel/bpf/trampoline.c     | 218 +++++++++++++++++++++++++++---------
> > > >  5 files changed, 213 insertions(+), 61 deletions(-)
> > > >   
> > > 
> > > hi,
> > > I'm on bpf/master and I'm triggering warnings below when running together:
> > > 
> > >   # while :; do ./test_progs -t fentry_test ; done
> > >   # while :; do ./test_progs -t module_attach ; done  
> > 
> > hum, is it possible that we don't take module ref and it can get
> > unloaded even if there's trampoline attach to it..? I can't see
> > that in the code.. ftrace_release_mod can't fail ;-)  
> 
> when I get the module for each module trampoline,
> I can no longer see those warnings (link for Steven):
>   https://lore.kernel.org/bpf/YFfXcqnksPsSe0Bv@krava/
> 
> Steven,
> I might be missing something, but it looks like module
> can be unloaded even if the trampoline (direct function)
> is registered in it.. is that right?
> 

Not with your patch below ;-)

But yes, ftrace does not currently manage module text for direct calls,
it's assumed that whoever attaches to the module text would do that. But
I'm not adverse to the patch below.

-- Steve


> thanks,
> jirka
> 
> 
> ---
> diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
> index b7e29db127fa..ab0b2c8df283 100644
> --- a/kernel/trace/ftrace.c
> +++ b/kernel/trace/ftrace.c
> @@ -5059,6 +5059,28 @@ static struct ftrace_direct_func *ftrace_alloc_direct_func(unsigned long addr)
>  	return direct;
>  }
>  
> +static struct module *ftrace_direct_module_get(unsigned long ip)
> +{
> +	struct module *mod;
> +	int err = 0;
> +
> +	preempt_disable();
> +	mod = __module_text_address(ip);
> +	if (mod && !try_module_get(mod))
> +		err = -ENOENT;
> +	preempt_enable();
> +	return err ? ERR_PTR(err) : mod;
> +}
> +
> +static void ftrace_direct_module_put(unsigned long ip)
> +{
> +	struct module *mod;
> +
> +	mod = __module_text_address(ip);
> +	if (mod)
> +		module_put(mod);
> +}
> +
>  /**
>   * register_ftrace_direct - Call a custom trampoline directly
>   * @ip: The address of the nop at the beginning of a function
> @@ -5081,6 +5103,7 @@ int register_ftrace_direct(unsigned long ip, unsigned long addr)
>  	struct ftrace_direct_func *direct;
>  	struct ftrace_func_entry *entry;
>  	struct ftrace_hash *free_hash = NULL;
> +	struct module *mod = NULL;
>  	struct dyn_ftrace *rec;
>  	int ret = -EBUSY;
>  
> @@ -5095,6 +5118,13 @@ int register_ftrace_direct(unsigned long ip, unsigned long addr)
>  	if (!rec)
>  		goto out_unlock;
>  
> +	mod = ftrace_direct_module_get(ip);
> +	if (IS_ERR(mod)) {
> +		ret = -ENOENT;
> +		mod = NULL;
> +		goto out_unlock;
> +	}
> +
>  	/*
>  	 * Check if the rec says it has a direct call but we didn't
>  	 * find one earlier?
> @@ -5172,6 +5202,8 @@ int register_ftrace_direct(unsigned long ip, unsigned long addr)
>   out_unlock:
>  	mutex_unlock(&direct_mutex);
>  
> +	if (ret)
> +		module_put(mod);
>  	if (free_hash) {
>  		synchronize_rcu_tasks();
>  		free_ftrace_hash(free_hash);
> @@ -5242,6 +5274,8 @@ int unregister_ftrace_direct(unsigned long ip, unsigned long addr)
>  			ftrace_direct_func_count--;
>  		}
>  	}
> +	ftrace_direct_module_put(ip);
> +
>   out_unlock:
>  	mutex_unlock(&direct_mutex);
>
Alexei Starovoitov March 23, 2021, 2:50 p.m. UTC | #6
On 3/23/21 5:59 AM, Steven Rostedt wrote:
> On Mon, 22 Mar 2021 19:53:10 +0100
> Jiri Olsa <jolsa@redhat.com> wrote:
> 
>> On Mon, Mar 22, 2021 at 05:24:26PM +0100, Jiri Olsa wrote:
>>> On Mon, Mar 22, 2021 at 12:32:05AM +0100, Jiri Olsa wrote:
>>>> On Tue, Mar 16, 2021 at 02:00:07PM -0700, Alexei Starovoitov wrote:
>>>>> From: Alexei Starovoitov <ast@kernel.org>
>>>>>
>>>>> The fexit/fmod_ret programs can be attached to kernel functions that can sleep.
>>>>> The synchronize_rcu_tasks() will not wait for such tasks to complete.
>>>>> In such case the trampoline image will be freed and when the task
>>>>> wakes up the return IP will point to freed memory causing the crash.
>>>>> Solve this by adding percpu_ref_get/put for the duration of trampoline
>>>>> and separate trampoline vs its image life times.
>>>>> The "half page" optimization has to be removed, since
>>>>> first_half->second_half->first_half transition cannot be guaranteed to
>>>>> complete in deterministic time. Every trampoline update becomes a new image.
>>>>> The image with fmod_ret or fexit progs will be freed via percpu_ref_kill and
>>>>> call_rcu_tasks. Together they will wait for the original function and
>>>>> trampoline asm to complete. The trampoline is patched from nop to jmp to skip
>>>>> fexit progs. They are freed independently from the trampoline. The image with
>>>>> fentry progs only will be freed via call_rcu_tasks_trace+call_rcu_tasks which
>>>>> will wait for both sleepable and non-sleepable progs to complete.
>>>>>
>>>>> Reported-by: Andrii Nakryiko <andrii@kernel.org>
>>>>> Fixes: fec56f5890d9 ("bpf: Introduce BPF trampoline")
>>>>> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
>>>>> Acked-by: Paul E. McKenney <paulmck@kernel.org>  # for RCU
>>>>> ---
>>>>> Without ftrace fix:
>>>>> https://patchwork.kernel.org/project/netdevbpf/patch/20210316195815.34714-1-alexei.starovoitov@gmail.com/
>>>>> this patch will trigger warn in ftrace.
>>>>>
>>>>>   arch/x86/net/bpf_jit_comp.c |  26 ++++-
>>>>>   include/linux/bpf.h         |  24 +++-
>>>>>   kernel/bpf/bpf_struct_ops.c |   2 +-
>>>>>   kernel/bpf/core.c           |   4 +-
>>>>>   kernel/bpf/trampoline.c     | 218 +++++++++++++++++++++++++++---------
>>>>>   5 files changed, 213 insertions(+), 61 deletions(-)
>>>>>    
>>>>
>>>> hi,
>>>> I'm on bpf/master and I'm triggering warnings below when running together:
>>>>
>>>>    # while :; do ./test_progs -t fentry_test ; done
>>>>    # while :; do ./test_progs -t module_attach ; done
>>>
>>> hum, is it possible that we don't take module ref and it can get
>>> unloaded even if there's trampoline attach to it..? I can't see
>>> that in the code.. ftrace_release_mod can't fail ;-)
>>
>> when I get the module for each module trampoline,
>> I can no longer see those warnings (link for Steven):
>>    https://lore.kernel.org/bpf/YFfXcqnksPsSe0Bv@krava/
>>
>> Steven,
>> I might be missing something, but it looks like module
>> can be unloaded even if the trampoline (direct function)
>> is registered in it.. is that right?
>>
> 
> Not with your patch below ;-)
> 
> But yes, ftrace does not currently manage module text for direct calls,
> it's assumed that whoever attaches to the module text would do that. But
> I'm not adverse to the patch below.

Jiri,

could you please refactor your patch to do the same in bpf trampoline?
The selftest/bpf would be great as well. It can come as a follow up.
Let's fix the issue for bpf tree first.
Jiri Olsa March 23, 2021, 8:59 p.m. UTC | #7
On Tue, Mar 23, 2021 at 07:50:55AM -0700, Alexei Starovoitov wrote:
> On 3/23/21 5:59 AM, Steven Rostedt wrote:
> > On Mon, 22 Mar 2021 19:53:10 +0100
> > Jiri Olsa <jolsa@redhat.com> wrote:
> > 
> > > On Mon, Mar 22, 2021 at 05:24:26PM +0100, Jiri Olsa wrote:
> > > > On Mon, Mar 22, 2021 at 12:32:05AM +0100, Jiri Olsa wrote:
> > > > > On Tue, Mar 16, 2021 at 02:00:07PM -0700, Alexei Starovoitov wrote:
> > > > > > From: Alexei Starovoitov <ast@kernel.org>
> > > > > > 
> > > > > > The fexit/fmod_ret programs can be attached to kernel functions that can sleep.
> > > > > > The synchronize_rcu_tasks() will not wait for such tasks to complete.
> > > > > > In such case the trampoline image will be freed and when the task
> > > > > > wakes up the return IP will point to freed memory causing the crash.
> > > > > > Solve this by adding percpu_ref_get/put for the duration of trampoline
> > > > > > and separate trampoline vs its image life times.
> > > > > > The "half page" optimization has to be removed, since
> > > > > > first_half->second_half->first_half transition cannot be guaranteed to
> > > > > > complete in deterministic time. Every trampoline update becomes a new image.
> > > > > > The image with fmod_ret or fexit progs will be freed via percpu_ref_kill and
> > > > > > call_rcu_tasks. Together they will wait for the original function and
> > > > > > trampoline asm to complete. The trampoline is patched from nop to jmp to skip
> > > > > > fexit progs. They are freed independently from the trampoline. The image with
> > > > > > fentry progs only will be freed via call_rcu_tasks_trace+call_rcu_tasks which
> > > > > > will wait for both sleepable and non-sleepable progs to complete.
> > > > > > 
> > > > > > Reported-by: Andrii Nakryiko <andrii@kernel.org>
> > > > > > Fixes: fec56f5890d9 ("bpf: Introduce BPF trampoline")
> > > > > > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > > > > > Acked-by: Paul E. McKenney <paulmck@kernel.org>  # for RCU
> > > > > > ---
> > > > > > Without ftrace fix:
> > > > > > https://patchwork.kernel.org/project/netdevbpf/patch/20210316195815.34714-1-alexei.starovoitov@gmail.com/
> > > > > > this patch will trigger warn in ftrace.
> > > > > > 
> > > > > >   arch/x86/net/bpf_jit_comp.c |  26 ++++-
> > > > > >   include/linux/bpf.h         |  24 +++-
> > > > > >   kernel/bpf/bpf_struct_ops.c |   2 +-
> > > > > >   kernel/bpf/core.c           |   4 +-
> > > > > >   kernel/bpf/trampoline.c     | 218 +++++++++++++++++++++++++++---------
> > > > > >   5 files changed, 213 insertions(+), 61 deletions(-)
> > > > > 
> > > > > hi,
> > > > > I'm on bpf/master and I'm triggering warnings below when running together:
> > > > > 
> > > > >    # while :; do ./test_progs -t fentry_test ; done
> > > > >    # while :; do ./test_progs -t module_attach ; done
> > > > 
> > > > hum, is it possible that we don't take module ref and it can get
> > > > unloaded even if there's trampoline attach to it..? I can't see
> > > > that in the code.. ftrace_release_mod can't fail ;-)
> > > 
> > > when I get the module for each module trampoline,
> > > I can no longer see those warnings (link for Steven):
> > >    https://lore.kernel.org/bpf/YFfXcqnksPsSe0Bv@krava/
> > > 
> > > Steven,
> > > I might be missing something, but it looks like module
> > > can be unloaded even if the trampoline (direct function)
> > > is registered in it.. is that right?
> > > 
> > 
> > Not with your patch below ;-)
> > 
> > But yes, ftrace does not currently manage module text for direct calls,
> > it's assumed that whoever attaches to the module text would do that. But
> > I'm not adverse to the patch below.
> 
> Jiri,
> 
> could you please refactor your patch to do the same in bpf trampoline?

right, we need to take care about bpf_arch_text_poke
interface as well.. I'll resend

> The selftest/bpf would be great as well. It can come as a follow up.
> Let's fix the issue for bpf tree first.

ok, will send test later

jirka
diff mbox series

Patch

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 747bba0a584a..72b5a57e9e31 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -1936,7 +1936,7 @@  static int invoke_bpf_mod_ret(const struct btf_func_model *m, u8 **pprog,
  * add rsp, 8                      // skip eth_type_trans's frame
  * ret                             // return to its caller
  */
-int arch_prepare_bpf_trampoline(void *image, void *image_end,
+int arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *image, void *image_end,
 				const struct btf_func_model *m, u32 flags,
 				struct bpf_tramp_progs *tprogs,
 				void *orig_call)
@@ -1975,6 +1975,15 @@  int arch_prepare_bpf_trampoline(void *image, void *image_end,
 
 	save_regs(m, &prog, nr_args, stack_size);
 
+	if (flags & BPF_TRAMP_F_CALL_ORIG) {
+		/* arg1: mov rdi, im */
+		emit_mov_imm64(&prog, BPF_REG_1, (long) im >> 32, (u32) (long) im);
+		if (emit_call(&prog, __bpf_tramp_enter, prog)) {
+			ret = -EINVAL;
+			goto cleanup;
+		}
+	}
+
 	if (fentry->nr_progs)
 		if (invoke_bpf(m, &prog, fentry, stack_size))
 			return -EINVAL;
@@ -1993,8 +2002,7 @@  int arch_prepare_bpf_trampoline(void *image, void *image_end,
 	}
 
 	if (flags & BPF_TRAMP_F_CALL_ORIG) {
-		if (fentry->nr_progs || fmod_ret->nr_progs)
-			restore_regs(m, &prog, nr_args, stack_size);
+		restore_regs(m, &prog, nr_args, stack_size);
 
 		/* call original function */
 		if (emit_call(&prog, orig_call, prog)) {
@@ -2003,6 +2011,8 @@  int arch_prepare_bpf_trampoline(void *image, void *image_end,
 		}
 		/* remember return value in a stack for bpf prog to access */
 		emit_stx(&prog, BPF_DW, BPF_REG_FP, BPF_REG_0, -8);
+		im->ip_after_call = prog;
+		emit_nops(&prog, 5);
 	}
 
 	if (fmod_ret->nr_progs) {
@@ -2033,9 +2043,17 @@  int arch_prepare_bpf_trampoline(void *image, void *image_end,
 	 * the return value is only updated on the stack and still needs to be
 	 * restored to R0.
 	 */
-	if (flags & BPF_TRAMP_F_CALL_ORIG)
+	if (flags & BPF_TRAMP_F_CALL_ORIG) {
+		im->ip_epilogue = prog;
+		/* arg1: mov rdi, im */
+		emit_mov_imm64(&prog, BPF_REG_1, (long) im >> 32, (u32) (long) im);
+		if (emit_call(&prog, __bpf_tramp_exit, prog)) {
+			ret = -EINVAL;
+			goto cleanup;
+		}
 		/* restore original return value back into RAX */
 		emit_ldx(&prog, BPF_DW, BPF_REG_0, BPF_REG_FP, -8);
+	}
 
 	EMIT1(0x5B); /* pop rbx */
 	EMIT1(0xC9); /* leave */
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index d7e0f479a5b0..3625f019767d 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -21,6 +21,7 @@ 
 #include <linux/capability.h>
 #include <linux/sched/mm.h>
 #include <linux/slab.h>
+#include <linux/percpu-refcount.h>
 
 struct bpf_verifier_env;
 struct bpf_verifier_log;
@@ -556,7 +557,8 @@  struct bpf_tramp_progs {
  *      fentry = a set of program to run before calling original function
  *      fexit = a set of program to run after original function
  */
-int arch_prepare_bpf_trampoline(void *image, void *image_end,
+struct bpf_tramp_image;
+int arch_prepare_bpf_trampoline(struct bpf_tramp_image *tr, void *image, void *image_end,
 				const struct btf_func_model *m, u32 flags,
 				struct bpf_tramp_progs *tprogs,
 				void *orig_call);
@@ -565,6 +567,8 @@  u64 notrace __bpf_prog_enter(struct bpf_prog *prog);
 void notrace __bpf_prog_exit(struct bpf_prog *prog, u64 start);
 u64 notrace __bpf_prog_enter_sleepable(struct bpf_prog *prog);
 void notrace __bpf_prog_exit_sleepable(struct bpf_prog *prog, u64 start);
+void notrace __bpf_tramp_enter(struct bpf_tramp_image *tr);
+void notrace __bpf_tramp_exit(struct bpf_tramp_image *tr);
 
 struct bpf_ksym {
 	unsigned long		 start;
@@ -583,6 +587,18 @@  enum bpf_tramp_prog_type {
 	BPF_TRAMP_REPLACE, /* more than MAX */
 };
 
+struct bpf_tramp_image {
+	void *image;
+	struct bpf_ksym ksym;
+	struct percpu_ref pcref;
+	void *ip_after_call;
+	void *ip_epilogue;
+	union {
+		struct rcu_head rcu;
+		struct work_struct work;
+	};
+};
+
 struct bpf_trampoline {
 	/* hlist for trampoline_table */
 	struct hlist_node hlist;
@@ -605,9 +621,8 @@  struct bpf_trampoline {
 	/* Number of attached programs. A counter per kind. */
 	int progs_cnt[BPF_TRAMP_MAX];
 	/* Executable image of trampoline */
-	void *image;
+	struct bpf_tramp_image *cur_image;
 	u64 selector;
-	struct bpf_ksym ksym;
 };
 
 struct bpf_attach_target_info {
@@ -691,6 +706,8 @@  void bpf_image_ksym_add(void *data, struct bpf_ksym *ksym);
 void bpf_image_ksym_del(struct bpf_ksym *ksym);
 void bpf_ksym_add(struct bpf_ksym *ksym);
 void bpf_ksym_del(struct bpf_ksym *ksym);
+int bpf_jit_charge_modmem(u32 pages);
+void bpf_jit_uncharge_modmem(u32 pages);
 #else
 static inline int bpf_trampoline_link_prog(struct bpf_prog *prog,
 					   struct bpf_trampoline *tr)
@@ -787,7 +804,6 @@  struct bpf_prog_aux {
 	bool func_proto_unreliable;
 	bool sleepable;
 	bool tail_call_reachable;
-	enum bpf_tramp_prog_type trampoline_prog_type;
 	struct hlist_node tramp_hlist;
 	/* BTF_KIND_FUNC_PROTO for valid attach_btf_id */
 	const struct btf_type *attach_func_proto;
diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
index 1a666a975416..70f6fd4fa305 100644
--- a/kernel/bpf/bpf_struct_ops.c
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -430,7 +430,7 @@  static int bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key,
 
 		tprogs[BPF_TRAMP_FENTRY].progs[0] = prog;
 		tprogs[BPF_TRAMP_FENTRY].nr_progs = 1;
-		err = arch_prepare_bpf_trampoline(image,
+		err = arch_prepare_bpf_trampoline(NULL, image,
 						  st_map->image + PAGE_SIZE,
 						  &st_ops->func_models[i], 0,
 						  tprogs, NULL);
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 3a283bf97f2f..75244ecb2389 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -827,7 +827,7 @@  static int __init bpf_jit_charge_init(void)
 }
 pure_initcall(bpf_jit_charge_init);
 
-static int bpf_jit_charge_modmem(u32 pages)
+int bpf_jit_charge_modmem(u32 pages)
 {
 	if (atomic_long_add_return(pages, &bpf_jit_current) >
 	    (bpf_jit_limit >> PAGE_SHIFT)) {
@@ -840,7 +840,7 @@  static int bpf_jit_charge_modmem(u32 pages)
 	return 0;
 }
 
-static void bpf_jit_uncharge_modmem(u32 pages)
+void bpf_jit_uncharge_modmem(u32 pages)
 {
 	atomic_long_sub(pages, &bpf_jit_current);
 }
diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c
index 7bc3b3209224..1f3a4be4b175 100644
--- a/kernel/bpf/trampoline.c
+++ b/kernel/bpf/trampoline.c
@@ -57,19 +57,10 @@  void bpf_image_ksym_del(struct bpf_ksym *ksym)
 			   PAGE_SIZE, true, ksym->name);
 }
 
-static void bpf_trampoline_ksym_add(struct bpf_trampoline *tr)
-{
-	struct bpf_ksym *ksym = &tr->ksym;
-
-	snprintf(ksym->name, KSYM_NAME_LEN, "bpf_trampoline_%llu", tr->key);
-	bpf_image_ksym_add(tr->image, ksym);
-}
-
 static struct bpf_trampoline *bpf_trampoline_lookup(u64 key)
 {
 	struct bpf_trampoline *tr;
 	struct hlist_head *head;
-	void *image;
 	int i;
 
 	mutex_lock(&trampoline_mutex);
@@ -84,14 +75,6 @@  static struct bpf_trampoline *bpf_trampoline_lookup(u64 key)
 	if (!tr)
 		goto out;
 
-	/* is_root was checked earlier. No need for bpf_jit_charge_modmem() */
-	image = bpf_jit_alloc_exec_page();
-	if (!image) {
-		kfree(tr);
-		tr = NULL;
-		goto out;
-	}
-
 	tr->key = key;
 	INIT_HLIST_NODE(&tr->hlist);
 	hlist_add_head(&tr->hlist, head);
@@ -99,9 +82,6 @@  static struct bpf_trampoline *bpf_trampoline_lookup(u64 key)
 	mutex_init(&tr->mutex);
 	for (i = 0; i < BPF_TRAMP_MAX; i++)
 		INIT_HLIST_HEAD(&tr->progs_hlist[i]);
-	tr->image = image;
-	INIT_LIST_HEAD_RCU(&tr->ksym.lnode);
-	bpf_trampoline_ksym_add(tr);
 out:
 	mutex_unlock(&trampoline_mutex);
 	return tr;
@@ -185,10 +165,142 @@  bpf_trampoline_get_progs(const struct bpf_trampoline *tr, int *total)
 	return tprogs;
 }
 
+static void __bpf_tramp_image_put_deferred(struct work_struct *work)
+{
+	struct bpf_tramp_image *im;
+
+	im = container_of(work, struct bpf_tramp_image, work);
+	bpf_image_ksym_del(&im->ksym);
+	bpf_jit_free_exec(im->image);
+	bpf_jit_uncharge_modmem(1);
+	percpu_ref_exit(&im->pcref);
+	kfree_rcu(im, rcu);
+}
+
+/* callback, fexit step 3 or fentry step 2 */
+static void __bpf_tramp_image_put_rcu(struct rcu_head *rcu)
+{
+	struct bpf_tramp_image *im;
+
+	im = container_of(rcu, struct bpf_tramp_image, rcu);
+	INIT_WORK(&im->work, __bpf_tramp_image_put_deferred);
+	schedule_work(&im->work);
+}
+
+/* callback, fexit step 2. Called after percpu_ref_kill confirms. */
+static void __bpf_tramp_image_release(struct percpu_ref *pcref)
+{
+	struct bpf_tramp_image *im;
+
+	im = container_of(pcref, struct bpf_tramp_image, pcref);
+	call_rcu_tasks(&im->rcu, __bpf_tramp_image_put_rcu);
+}
+
+/* callback, fexit or fentry step 1 */
+static void __bpf_tramp_image_put_rcu_tasks(struct rcu_head *rcu)
+{
+	struct bpf_tramp_image *im;
+
+	im = container_of(rcu, struct bpf_tramp_image, rcu);
+	if (im->ip_after_call)
+		/* the case of fmod_ret/fexit trampoline and CONFIG_PREEMPTION=y */
+		percpu_ref_kill(&im->pcref);
+	else
+		/* the case of fentry trampoline */
+		call_rcu_tasks(&im->rcu, __bpf_tramp_image_put_rcu);
+}
+
+static void bpf_tramp_image_put(struct bpf_tramp_image *im)
+{
+	/* The trampoline image that calls original function is using:
+	 * rcu_read_lock_trace to protect sleepable bpf progs
+	 * rcu_read_lock to protect normal bpf progs
+	 * percpu_ref to protect trampoline itself
+	 * rcu tasks to protect trampoline asm not covered by percpu_ref
+	 * (which are few asm insns before __bpf_tramp_enter and
+	 *  after __bpf_tramp_exit)
+	 *
+	 * The trampoline is unreachable before bpf_tramp_image_put().
+	 *
+	 * First, patch the trampoline to avoid calling into fexit progs.
+	 * The progs will be freed even if the original function is still
+	 * executing or sleeping.
+	 * In case of CONFIG_PREEMPT=y use call_rcu_tasks() to wait on
+	 * first few asm instructions to execute and call into
+	 * __bpf_tramp_enter->percpu_ref_get.
+	 * Then use percpu_ref_kill to wait for the trampoline and the original
+	 * function to finish.
+	 * Then use call_rcu_tasks() to make sure few asm insns in
+	 * the trampoline epilogue are done as well.
+	 *
+	 * In !PREEMPT case the task that got interrupted in the first asm
+	 * insns won't go through an RCU quiescent state which the
+	 * percpu_ref_kill will be waiting for. Hence the first
+	 * call_rcu_tasks() is not necessary.
+	 */
+	if (im->ip_after_call) {
+		int err = bpf_arch_text_poke(im->ip_after_call, BPF_MOD_JUMP,
+					     NULL, im->ip_epilogue);
+		WARN_ON(err);
+		if (IS_ENABLED(CONFIG_PREEMPTION))
+			call_rcu_tasks(&im->rcu, __bpf_tramp_image_put_rcu_tasks);
+		else
+			percpu_ref_kill(&im->pcref);
+		return;
+	}
+
+	/* The trampoline without fexit and fmod_ret progs doesn't call original
+	 * function and doesn't use percpu_ref.
+	 * Use call_rcu_tasks_trace() to wait for sleepable progs to finish.
+	 * Then use call_rcu_tasks() to wait for the rest of trampoline asm
+	 * and normal progs.
+	 */
+	call_rcu_tasks_trace(&im->rcu, __bpf_tramp_image_put_rcu_tasks);
+}
+
+static struct bpf_tramp_image *bpf_tramp_image_alloc(u64 key, u32 idx)
+{
+	struct bpf_tramp_image *im;
+	struct bpf_ksym *ksym;
+	void *image;
+	int err = -ENOMEM;
+
+	im = kzalloc(sizeof(*im), GFP_KERNEL);
+	if (!im)
+		goto out;
+
+	err = bpf_jit_charge_modmem(1);
+	if (err)
+		goto out_free_im;
+
+	err = -ENOMEM;
+	im->image = image = bpf_jit_alloc_exec_page();
+	if (!image)
+		goto out_uncharge;
+
+	err = percpu_ref_init(&im->pcref, __bpf_tramp_image_release, 0, GFP_KERNEL);
+	if (err)
+		goto out_free_image;
+
+	ksym = &im->ksym;
+	INIT_LIST_HEAD_RCU(&ksym->lnode);
+	snprintf(ksym->name, KSYM_NAME_LEN, "bpf_trampoline_%llu_%u", key, idx);
+	bpf_image_ksym_add(image, ksym);
+	return im;
+
+out_free_image:
+	bpf_jit_free_exec(im->image);
+out_uncharge:
+	bpf_jit_uncharge_modmem(1);
+out_free_im:
+	kfree(im);
+out:
+	return ERR_PTR(err);
+}
+
 static int bpf_trampoline_update(struct bpf_trampoline *tr)
 {
-	void *old_image = tr->image + ((tr->selector + 1) & 1) * PAGE_SIZE/2;
-	void *new_image = tr->image + (tr->selector & 1) * PAGE_SIZE/2;
+	struct bpf_tramp_image *im;
 	struct bpf_tramp_progs *tprogs;
 	u32 flags = BPF_TRAMP_F_RESTORE_REGS;
 	int err, total;
@@ -198,41 +310,42 @@  static int bpf_trampoline_update(struct bpf_trampoline *tr)
 		return PTR_ERR(tprogs);
 
 	if (total == 0) {
-		err = unregister_fentry(tr, old_image);
+		err = unregister_fentry(tr, tr->cur_image->image);
+		bpf_tramp_image_put(tr->cur_image);
+		tr->cur_image = NULL;
 		tr->selector = 0;
 		goto out;
 	}
 
+	im = bpf_tramp_image_alloc(tr->key, tr->selector);
+	if (IS_ERR(im)) {
+		err = PTR_ERR(im);
+		goto out;
+	}
+
 	if (tprogs[BPF_TRAMP_FEXIT].nr_progs ||
 	    tprogs[BPF_TRAMP_MODIFY_RETURN].nr_progs)
 		flags = BPF_TRAMP_F_CALL_ORIG | BPF_TRAMP_F_SKIP_FRAME;
 
-	/* Though the second half of trampoline page is unused a task could be
-	 * preempted in the middle of the first half of trampoline and two
-	 * updates to trampoline would change the code from underneath the
-	 * preempted task. Hence wait for tasks to voluntarily schedule or go
-	 * to userspace.
-	 * The same trampoline can hold both sleepable and non-sleepable progs.
-	 * synchronize_rcu_tasks_trace() is needed to make sure all sleepable
-	 * programs finish executing.
-	 * Wait for these two grace periods together.
-	 */
-	synchronize_rcu_mult(call_rcu_tasks, call_rcu_tasks_trace);
-
-	err = arch_prepare_bpf_trampoline(new_image, new_image + PAGE_SIZE / 2,
+	err = arch_prepare_bpf_trampoline(im, im->image, im->image + PAGE_SIZE,
 					  &tr->func.model, flags, tprogs,
 					  tr->func.addr);
 	if (err < 0)
 		goto out;
 
-	if (tr->selector)
+	WARN_ON(tr->cur_image && tr->selector == 0);
+	WARN_ON(!tr->cur_image && tr->selector);
+	if (tr->cur_image)
 		/* progs already running at this address */
-		err = modify_fentry(tr, old_image, new_image);
+		err = modify_fentry(tr, tr->cur_image->image, im->image);
 	else
 		/* first time registering */
-		err = register_fentry(tr, new_image);
+		err = register_fentry(tr, im->image);
 	if (err)
 		goto out;
+	if (tr->cur_image)
+		bpf_tramp_image_put(tr->cur_image);
+	tr->cur_image = im;
 	tr->selector++;
 out:
 	kfree(tprogs);
@@ -364,17 +477,12 @@  void bpf_trampoline_put(struct bpf_trampoline *tr)
 		goto out;
 	if (WARN_ON_ONCE(!hlist_empty(&tr->progs_hlist[BPF_TRAMP_FEXIT])))
 		goto out;
-	bpf_image_ksym_del(&tr->ksym);
-	/* This code will be executed when all bpf progs (both sleepable and
-	 * non-sleepable) went through
-	 * bpf_prog_put()->call_rcu[_tasks_trace]()->bpf_prog_free_deferred().
-	 * Hence no need for another synchronize_rcu_tasks_trace() here,
-	 * but synchronize_rcu_tasks() is still needed, since trampoline
-	 * may not have had any sleepable programs and we need to wait
-	 * for tasks to get out of trampoline code before freeing it.
+	/* This code will be executed even when the last bpf_tramp_image
+	 * is alive. All progs are detached from the trampoline and the
+	 * trampoline image is patched with jmp into epilogue to skip
+	 * fexit progs. The fentry-only trampoline will be freed via
+	 * multiple rcu callbacks.
 	 */
-	synchronize_rcu_tasks();
-	bpf_jit_free_exec(tr->image);
 	hlist_del(&tr->hlist);
 	kfree(tr);
 out:
@@ -478,8 +586,18 @@  void notrace __bpf_prog_exit_sleepable(struct bpf_prog *prog, u64 start)
 	rcu_read_unlock_trace();
 }
 
+void notrace __bpf_tramp_enter(struct bpf_tramp_image *tr)
+{
+	percpu_ref_get(&tr->pcref);
+}
+
+void notrace __bpf_tramp_exit(struct bpf_tramp_image *tr)
+{
+	percpu_ref_put(&tr->pcref);
+}
+
 int __weak
-arch_prepare_bpf_trampoline(void *image, void *image_end,
+arch_prepare_bpf_trampoline(struct bpf_tramp_image *tr, void *image, void *image_end,
 			    const struct btf_func_model *m, u32 flags,
 			    struct bpf_tramp_progs *tprogs,
 			    void *orig_call)