Message ID | 20230720154941.1504-1-puranjay12@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | bpf, riscv: use BPF prog pack allocator in BPF JIT | expand |
Puranjay Mohan <puranjay12@gmail.com> writes: > BPF programs currently consume a page each on RISCV. For systems with many BPF > programs, this adds significant pressure to instruction TLB. High iTLB pressure > usually causes slow down for the whole system. > > Song Liu introduced the BPF prog pack allocator[1] to mitigate the above issue. > It packs multiple BPF programs into a single huge page. It is currently only > enabled for the x86_64 BPF JIT. > > I enabled this allocator on the ARM64 BPF JIT[2]. It is being reviewed now. > > This patch series enables the BPF prog pack allocator for the RISCV BPF JIT. > This series needs a patch[3] from the ARM64 series to work. Just a heads-up; I'm on vacation for 2 more weeks, so expect a somewhat late review from my side! Thank you for working on the RISC-V BPF JIT! Björn
Puranjay Mohan <puranjay12@gmail.com> writes: > BPF programs currently consume a page each on RISCV. For systems with many BPF > programs, this adds significant pressure to instruction TLB. High iTLB pressure > usually causes slow down for the whole system. > > Song Liu introduced the BPF prog pack allocator[1] to mitigate the above issue. > It packs multiple BPF programs into a single huge page. It is currently only > enabled for the x86_64 BPF JIT. > > I enabled this allocator on the ARM64 BPF JIT[2]. It is being reviewed now. > > This patch series enables the BPF prog pack allocator for the RISCV BPF JIT. > This series needs a patch[3] from the ARM64 series to work. > > ====================================================== > Performance Analysis of prog pack allocator on RISCV64 > ====================================================== > > Test setup: > =========== > > Host machine: Debian GNU/Linux 11 (bullseye) > Qemu Version: QEMU emulator version 8.0.3 (Debian 1:8.0.3+dfsg-1) > u-boot-qemu Version: 2023.07+dfsg-1 > opensbi Version: 1.3-1 > > To test the performance of the BPF prog pack allocator on RV, a stresser > tool[4] linked below was built. This tool loads 8 BPF programs on the system and > triggers 5 of them in an infinite loop by doing system calls. > > The runner script starts 20 instances of the above which loads 8*20=160 BPF > programs on the system, 5*20=100 of which are being constantly triggered. > The script is passed a command which would be run in the above environment. > > The script was run with following perf command: > ./run.sh "perf stat -a \ > -e iTLB-load-misses \ > -e dTLB-load-misses \ > -e dTLB-store-misses \ > -e instructions \ > --timeout 60000" > > The output of the above command is discussed below before and after enabling the > BPF prog pack allocator. > > The tests were run on qemu-system-riscv64 with 8 cpus, 16G memory. The rootfs > was created using Bjorn's riscv-cross-builder[5] docker container linked below. Back in the saddle! Sorry for the horribly late reply... Did you run the test_progs kselftest test, and passed w/o regressions? I ran a test without/with your series (plus the patch from the arm64 series that you pointed out), and I'm getting regressions with this series: w/o Summary: 318/3114 PASSED, 27 SKIPPED, 60 FAILED w/ Summary: 299/3026 PASSED, 33 SKIPPED, 79 FAILED I'm did the test on commit 4c75bf7e4a0e ("Merge tag 'kbuild-fixes-v6.5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild"). I'm re-running, and investigating now. Björn
Björn Töpel <bjorn@kernel.org> writes: > Puranjay Mohan <puranjay12@gmail.com> writes: > >> BPF programs currently consume a page each on RISCV. For systems with many BPF >> programs, this adds significant pressure to instruction TLB. High iTLB pressure >> usually causes slow down for the whole system. >> >> Song Liu introduced the BPF prog pack allocator[1] to mitigate the above issue. >> It packs multiple BPF programs into a single huge page. It is currently only >> enabled for the x86_64 BPF JIT. >> >> I enabled this allocator on the ARM64 BPF JIT[2]. It is being reviewed now. >> >> This patch series enables the BPF prog pack allocator for the RISCV BPF JIT. >> This series needs a patch[3] from the ARM64 series to work. >> >> ====================================================== >> Performance Analysis of prog pack allocator on RISCV64 >> ====================================================== >> >> Test setup: >> =========== >> >> Host machine: Debian GNU/Linux 11 (bullseye) >> Qemu Version: QEMU emulator version 8.0.3 (Debian 1:8.0.3+dfsg-1) >> u-boot-qemu Version: 2023.07+dfsg-1 >> opensbi Version: 1.3-1 >> >> To test the performance of the BPF prog pack allocator on RV, a stresser >> tool[4] linked below was built. This tool loads 8 BPF programs on the system and >> triggers 5 of them in an infinite loop by doing system calls. >> >> The runner script starts 20 instances of the above which loads 8*20=160 BPF >> programs on the system, 5*20=100 of which are being constantly triggered. >> The script is passed a command which would be run in the above environment. >> >> The script was run with following perf command: >> ./run.sh "perf stat -a \ >> -e iTLB-load-misses \ >> -e dTLB-load-misses \ >> -e dTLB-store-misses \ >> -e instructions \ >> --timeout 60000" >> >> The output of the above command is discussed below before and after enabling the >> BPF prog pack allocator. >> >> The tests were run on qemu-system-riscv64 with 8 cpus, 16G memory. The rootfs >> was created using Bjorn's riscv-cross-builder[5] docker container linked below. > > Back in the saddle! Sorry for the horribly late reply... > > Did you run the test_progs kselftest test, and passed w/o regressions? I > ran a test without/with your series (plus the patch from the arm64 > series that you pointed out), and I'm getting regressions with this > series: > > w/o Summary: 318/3114 PASSED, 27 SKIPPED, 60 FAILED > w/ Summary: 299/3026 PASSED, 33 SKIPPED, 79 FAILED > > I'm did the test on commit 4c75bf7e4a0e ("Merge tag > 'kbuild-fixes-v6.5-2' of > git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild"). > > I'm re-running, and investigating now. I had a bad environment on for the rebuild; A proper rebuild worked. No regressions. Sorry for the noise!
Puranjay Mohan <puranjay12@gmail.com> writes: > BPF programs currently consume a page each on RISCV. For systems with many BPF > programs, this adds significant pressure to instruction TLB. High iTLB pressure > usually causes slow down for the whole system. > > Song Liu introduced the BPF prog pack allocator[1] to mitigate the above issue. > It packs multiple BPF programs into a single huge page. It is currently only > enabled for the x86_64 BPF JIT. > > I enabled this allocator on the ARM64 BPF JIT[2]. It is being reviewed now. > > This patch series enables the BPF prog pack allocator for the RISCV BPF JIT. > This series needs a patch[3] from the ARM64 series to work. > > ====================================================== > Performance Analysis of prog pack allocator on RISCV64 > ====================================================== > > Test setup: > =========== > > Host machine: Debian GNU/Linux 11 (bullseye) > Qemu Version: QEMU emulator version 8.0.3 (Debian 1:8.0.3+dfsg-1) > u-boot-qemu Version: 2023.07+dfsg-1 > opensbi Version: 1.3-1 > > To test the performance of the BPF prog pack allocator on RV, a stresser > tool[4] linked below was built. This tool loads 8 BPF programs on the system and > triggers 5 of them in an infinite loop by doing system calls. > > The runner script starts 20 instances of the above which loads 8*20=160 BPF > programs on the system, 5*20=100 of which are being constantly triggered. > The script is passed a command which would be run in the above environment. > > The script was run with following perf command: > ./run.sh "perf stat -a \ > -e iTLB-load-misses \ > -e dTLB-load-misses \ > -e dTLB-store-misses \ > -e instructions \ > --timeout 60000" > > The output of the above command is discussed below before and after enabling the > BPF prog pack allocator. > > The tests were run on qemu-system-riscv64 with 8 cpus, 16G memory. The rootfs > was created using Bjorn's riscv-cross-builder[5] docker container linked below. > > Results > ======= > > Before enabling prog pack allocator: > ------------------------------------ > > Performance counter stats for 'system wide': > > 4939048 iTLB-load-misses > 5468689 dTLB-load-misses > 465234 dTLB-store-misses > 1441082097998 instructions > > 60.045791200 seconds time elapsed > > After enabling prog pack allocator: > ----------------------------------- > > Performance counter stats for 'system wide': > > 3430035 iTLB-load-misses > 5008745 dTLB-load-misses > 409944 dTLB-store-misses > 1441535637988 instructions > > 60.046296600 seconds time elapsed > > Improvements in metrics > ======================= > > It was expected that the iTLB-load-misses would decrease as now a single huge > page is used to keep all the BPF programs compared to a single page for each > program earlier. > > -------------------------------------------- > The improvement in iTLB-load-misses: -30.5 % > -------------------------------------------- > > I repeated this expriment more than 100 times in different setups and the > improvement was always greater than 30%. > > This patch series is boot tested on the Starfive VisionFive 2 board[6]. > The performance analysis was not done on the board because it doesn't > expose iTLB-load-misses, etc. The stresser program was run on the board to test > the loading and unloading of BPF programs > > [1] https://lore.kernel.org/bpf/20220204185742.271030-1-song@kernel.org/ > [2] https://lore.kernel.org/all/20230626085811.3192402-1-puranjay12@gmail.com/ > [3] https://lore.kernel.org/all/20230626085811.3192402-2-puranjay12@gmail.com/ > [4] https://github.com/puranjaymohan/BPF-Allocator-Bench > [5] https://github.com/bjoto/riscv-cross-builder > [6] https://www.starfivetech.com/en/site/boards > > Puranjay Mohan (2): > riscv: Extend patch_text_nosync() for multiple pages > bpf, riscv: use prog pack allocator in the BPF JIT I get a hang for "test_tag", but it's not directly related to your series, but rather "remote fence.i". | rcu: INFO: rcu_sched detected stalls on CPUs/tasks: | rcu: 0-....: (1400 ticks this GP) idle=d5e4/1/0x4000000000000000 softirq=5542/5542 fqs=1862 | rcu: (detected by 1, t=5252 jiffies, g=10253, q=195 ncpus=4) | Task dump for CPU 0: | task:kworker/0:5 state:R running task stack:0 pid:319 ppid:2 flags:0x00000008 | Workqueue: events bpf_prog_free_deferred | Call Trace: | [<ffffffff80cbc444>] __schedule+0x2d0/0x940 | watchdog: BUG: soft lockup - CPU#0 stuck for 21s! [kworker/0:5:319] | Modules linked in: nls_iso8859_1 drm fuse i2c_core drm_panel_orientation_quirks backlight dm_mod configfs ip_tables x_tables | CPU: 0 PID: 319 Comm: kworker/0:5 Not tainted 6.5.0-rc5 #1 | Hardware name: riscv-virtio,qemu (DT) | Workqueue: events bpf_prog_free_deferred | epc : __sbi_rfence_v02_call.isra.0+0x74/0x11a | ra : __sbi_rfence_v02+0xda/0x1a4 | epc : ffffffff8000ab4c ra : ffffffff8000accc sp : ff20000001c9bbd0 | gp : ffffffff82078c48 tp : ff600000888e6a40 t0 : ff20000001c9bd44 | t1 : 0000000000000000 t2 : 0000000000000040 s0 : ff20000001c9bbf0 | s1 : 0000000000000010 a0 : 0000000000000000 a1 : 0000000000000000 | a2 : 0000000000000000 a3 : 0000000000000000 a4 : 0000000000000000 | a5 : 0000000000000000 a6 : 0000000000000000 a7 : 0000000052464e43 | s2 : 000000000000ffff s3 : 00000000ffffffff s4 : ffffffff81667528 | s5 : 0000000000000000 s6 : 0000000000000000 s7 : 0000000000000000 | s8 : 0000000000000001 s9 : 0000000000000003 s10: 0000000000000040 | s11: ffffffff8207d240 t3 : 000000000000000f t4 : 000000000000002a | t5 : ff600000872df140 t6 : ffffffff81e26828 | status: 0000000200000120 badaddr: 0000000000000000 cause: 8000000000000005 | [<ffffffff8000ab4c>] __sbi_rfence_v02_call.isra.0+0x74/0x11a | [<ffffffff8000accc>] __sbi_rfence_v02+0xda/0x1a4 | [<ffffffff8000a886>] sbi_remote_fence_i+0x1e/0x26 | [<ffffffff8000cee2>] flush_icache_all+0x1a/0x48 | [<ffffffff80007736>] patch_text_nosync+0x6c/0x8c | [<ffffffff8000f0f8>] bpf_arch_text_invalidate+0x62/0xac | [<ffffffff8016c538>] bpf_prog_pack_free+0x9c/0x1b2 | [<ffffffff8016c84a>] bpf_jit_binary_pack_free+0x20/0x4a | [<ffffffff8000f198>] bpf_jit_free+0x56/0x9e | [<ffffffff8016b43a>] bpf_prog_free_deferred+0x15a/0x182 | [<ffffffff800576c4>] process_one_work+0x1b6/0x3d6 | [<ffffffff80057d52>] worker_thread+0x84/0x378 | [<ffffffff8005fc2c>] kthread+0xe8/0x108 | [<ffffffff80003ffa>] ret_from_fork+0xe/0x20 I'm digging into that now, and I would appreciate if you could run the test_tag on VF2 or similar (I'm missing that HW). It seems like we're hitting a bug with this series, so let's try to figure out where the problems is, prior merging it. Björn
Hi Björn, On Mon, Aug 14, 2023 at 11:12 AM Björn Töpel <bjorn@kernel.org> wrote: > > Puranjay Mohan <puranjay12@gmail.com> writes: > > > BPF programs currently consume a page each on RISCV. For systems with many BPF > > programs, this adds significant pressure to instruction TLB. High iTLB pressure > > usually causes slow down for the whole system. > > > > Song Liu introduced the BPF prog pack allocator[1] to mitigate the above issue. > > It packs multiple BPF programs into a single huge page. It is currently only > > enabled for the x86_64 BPF JIT. > > > > I enabled this allocator on the ARM64 BPF JIT[2]. It is being reviewed now. > > > > This patch series enables the BPF prog pack allocator for the RISCV BPF JIT. > > This series needs a patch[3] from the ARM64 series to work. > > > > ====================================================== > > Performance Analysis of prog pack allocator on RISCV64 > > ====================================================== > > > > Test setup: > > =========== > > > > Host machine: Debian GNU/Linux 11 (bullseye) > > Qemu Version: QEMU emulator version 8.0.3 (Debian 1:8.0.3+dfsg-1) > > u-boot-qemu Version: 2023.07+dfsg-1 > > opensbi Version: 1.3-1 > > > > To test the performance of the BPF prog pack allocator on RV, a stresser > > tool[4] linked below was built. This tool loads 8 BPF programs on the system and > > triggers 5 of them in an infinite loop by doing system calls. > > > > The runner script starts 20 instances of the above which loads 8*20=160 BPF > > programs on the system, 5*20=100 of which are being constantly triggered. > > The script is passed a command which would be run in the above environment. > > > > The script was run with following perf command: > > ./run.sh "perf stat -a \ > > -e iTLB-load-misses \ > > -e dTLB-load-misses \ > > -e dTLB-store-misses \ > > -e instructions \ > > --timeout 60000" > > > > The output of the above command is discussed below before and after enabling the > > BPF prog pack allocator. > > > > The tests were run on qemu-system-riscv64 with 8 cpus, 16G memory. The rootfs > > was created using Bjorn's riscv-cross-builder[5] docker container linked below. > > > > Results > > ======= > > > > Before enabling prog pack allocator: > > ------------------------------------ > > > > Performance counter stats for 'system wide': > > > > 4939048 iTLB-load-misses > > 5468689 dTLB-load-misses > > 465234 dTLB-store-misses > > 1441082097998 instructions > > > > 60.045791200 seconds time elapsed > > > > After enabling prog pack allocator: > > ----------------------------------- > > > > Performance counter stats for 'system wide': > > > > 3430035 iTLB-load-misses > > 5008745 dTLB-load-misses > > 409944 dTLB-store-misses > > 1441535637988 instructions > > > > 60.046296600 seconds time elapsed > > > > Improvements in metrics > > ======================= > > > > It was expected that the iTLB-load-misses would decrease as now a single huge > > page is used to keep all the BPF programs compared to a single page for each > > program earlier. > > > > -------------------------------------------- > > The improvement in iTLB-load-misses: -30.5 % > > -------------------------------------------- > > > > I repeated this expriment more than 100 times in different setups and the > > improvement was always greater than 30%. > > > > This patch series is boot tested on the Starfive VisionFive 2 board[6]. > > The performance analysis was not done on the board because it doesn't > > expose iTLB-load-misses, etc. The stresser program was run on the board to test > > the loading and unloading of BPF programs > > > > [1] https://lore.kernel.org/bpf/20220204185742.271030-1-song@kernel.org/ > > [2] https://lore.kernel.org/all/20230626085811.3192402-1-puranjay12@gmail.com/ > > [3] https://lore.kernel.org/all/20230626085811.3192402-2-puranjay12@gmail.com/ > > [4] https://github.com/puranjaymohan/BPF-Allocator-Bench > > [5] https://github.com/bjoto/riscv-cross-builder > > [6] https://www.starfivetech.com/en/site/boards > > > > Puranjay Mohan (2): > > riscv: Extend patch_text_nosync() for multiple pages > > bpf, riscv: use prog pack allocator in the BPF JIT > > I get a hang for "test_tag", but it's not directly related to your > series, but rather "remote fence.i". I was seeing some stalls like this even without my series but couldn't debug them at that time. > > | rcu: INFO: rcu_sched detected stalls on CPUs/tasks: > | rcu: 0-....: (1400 ticks this GP) idle=d5e4/1/0x4000000000000000 softirq=5542/5542 fqs=1862 > | rcu: (detected by 1, t=5252 jiffies, g=10253, q=195 ncpus=4) > | Task dump for CPU 0: > | task:kworker/0:5 state:R running task stack:0 pid:319 ppid:2 flags:0x00000008 > | Workqueue: events bpf_prog_free_deferred > | Call Trace: > | [<ffffffff80cbc444>] __schedule+0x2d0/0x940 > | watchdog: BUG: soft lockup - CPU#0 stuck for 21s! [kworker/0:5:319] > | Modules linked in: nls_iso8859_1 drm fuse i2c_core drm_panel_orientation_quirks backlight dm_mod configfs ip_tables x_tables > | CPU: 0 PID: 319 Comm: kworker/0:5 Not tainted 6.5.0-rc5 #1 > | Hardware name: riscv-virtio,qemu (DT) > | Workqueue: events bpf_prog_free_deferred > | epc : __sbi_rfence_v02_call.isra.0+0x74/0x11a > | ra : __sbi_rfence_v02+0xda/0x1a4 > | epc : ffffffff8000ab4c ra : ffffffff8000accc sp : ff20000001c9bbd0 > | gp : ffffffff82078c48 tp : ff600000888e6a40 t0 : ff20000001c9bd44 > | t1 : 0000000000000000 t2 : 0000000000000040 s0 : ff20000001c9bbf0 > | s1 : 0000000000000010 a0 : 0000000000000000 a1 : 0000000000000000 > | a2 : 0000000000000000 a3 : 0000000000000000 a4 : 0000000000000000 > | a5 : 0000000000000000 a6 : 0000000000000000 a7 : 0000000052464e43 > | s2 : 000000000000ffff s3 : 00000000ffffffff s4 : ffffffff81667528 > | s5 : 0000000000000000 s6 : 0000000000000000 s7 : 0000000000000000 > | s8 : 0000000000000001 s9 : 0000000000000003 s10: 0000000000000040 > | s11: ffffffff8207d240 t3 : 000000000000000f t4 : 000000000000002a > | t5 : ff600000872df140 t6 : ffffffff81e26828 > | status: 0000000200000120 badaddr: 0000000000000000 cause: 8000000000000005 > | [<ffffffff8000ab4c>] __sbi_rfence_v02_call.isra.0+0x74/0x11a > | [<ffffffff8000accc>] __sbi_rfence_v02+0xda/0x1a4 > | [<ffffffff8000a886>] sbi_remote_fence_i+0x1e/0x26 > | [<ffffffff8000cee2>] flush_icache_all+0x1a/0x48 > | [<ffffffff80007736>] patch_text_nosync+0x6c/0x8c > | [<ffffffff8000f0f8>] bpf_arch_text_invalidate+0x62/0xac > | [<ffffffff8016c538>] bpf_prog_pack_free+0x9c/0x1b2 > | [<ffffffff8016c84a>] bpf_jit_binary_pack_free+0x20/0x4a > | [<ffffffff8000f198>] bpf_jit_free+0x56/0x9e > | [<ffffffff8016b43a>] bpf_prog_free_deferred+0x15a/0x182 > | [<ffffffff800576c4>] process_one_work+0x1b6/0x3d6 > | [<ffffffff80057d52>] worker_thread+0x84/0x378 > | [<ffffffff8005fc2c>] kthread+0xe8/0x108 > | [<ffffffff80003ffa>] ret_from_fork+0xe/0x20 > > I'm digging into that now, and I would appreciate if you could run the > test_tag on VF2 or similar (I'm missing that HW). Sure, I will try to run this on the board. I will rebase my series(+ the patch from arm64 series) on the latest bpf-next tree and try to run it. Let me know if I need to add: + select HAVE_EFFICIENT_UNALIGNED_ACCESS if MMU && 64BIT > > It seems like we're hitting a bug with this series, so let's try to > figure out where the problems is, prior merging it. > > > Björn Thanks, Puranjay
Puranjay Mohan <puranjay12@gmail.com> writes: >> I get a hang for "test_tag", but it's not directly related to your >> series, but rather "remote fence.i". > > I was seeing some stalls like this even without my series but couldn't > debug them at that time. Yeah, I think it's not related to your series -- it's just a good reproducer. ;-) >> >> | rcu: INFO: rcu_sched detected stalls on CPUs/tasks: >> | rcu: 0-....: (1400 ticks this GP) idle=d5e4/1/0x4000000000000000 softirq=5542/5542 fqs=1862 >> | rcu: (detected by 1, t=5252 jiffies, g=10253, q=195 ncpus=4) >> | Task dump for CPU 0: >> | task:kworker/0:5 state:R running task stack:0 pid:319 ppid:2 flags:0x00000008 >> | Workqueue: events bpf_prog_free_deferred >> | Call Trace: >> | [<ffffffff80cbc444>] __schedule+0x2d0/0x940 >> | watchdog: BUG: soft lockup - CPU#0 stuck for 21s! [kworker/0:5:319] >> | Modules linked in: nls_iso8859_1 drm fuse i2c_core drm_panel_orientation_quirks backlight dm_mod configfs ip_tables x_tables >> | CPU: 0 PID: 319 Comm: kworker/0:5 Not tainted 6.5.0-rc5 #1 >> | Hardware name: riscv-virtio,qemu (DT) >> | Workqueue: events bpf_prog_free_deferred >> | epc : __sbi_rfence_v02_call.isra.0+0x74/0x11a >> | ra : __sbi_rfence_v02+0xda/0x1a4 >> | epc : ffffffff8000ab4c ra : ffffffff8000accc sp : ff20000001c9bbd0 >> | gp : ffffffff82078c48 tp : ff600000888e6a40 t0 : ff20000001c9bd44 >> | t1 : 0000000000000000 t2 : 0000000000000040 s0 : ff20000001c9bbf0 >> | s1 : 0000000000000010 a0 : 0000000000000000 a1 : 0000000000000000 >> | a2 : 0000000000000000 a3 : 0000000000000000 a4 : 0000000000000000 >> | a5 : 0000000000000000 a6 : 0000000000000000 a7 : 0000000052464e43 >> | s2 : 000000000000ffff s3 : 00000000ffffffff s4 : ffffffff81667528 >> | s5 : 0000000000000000 s6 : 0000000000000000 s7 : 0000000000000000 >> | s8 : 0000000000000001 s9 : 0000000000000003 s10: 0000000000000040 >> | s11: ffffffff8207d240 t3 : 000000000000000f t4 : 000000000000002a >> | t5 : ff600000872df140 t6 : ffffffff81e26828 >> | status: 0000000200000120 badaddr: 0000000000000000 cause: 8000000000000005 >> | [<ffffffff8000ab4c>] __sbi_rfence_v02_call.isra.0+0x74/0x11a >> | [<ffffffff8000accc>] __sbi_rfence_v02+0xda/0x1a4 >> | [<ffffffff8000a886>] sbi_remote_fence_i+0x1e/0x26 >> | [<ffffffff8000cee2>] flush_icache_all+0x1a/0x48 >> | [<ffffffff80007736>] patch_text_nosync+0x6c/0x8c >> | [<ffffffff8000f0f8>] bpf_arch_text_invalidate+0x62/0xac >> | [<ffffffff8016c538>] bpf_prog_pack_free+0x9c/0x1b2 >> | [<ffffffff8016c84a>] bpf_jit_binary_pack_free+0x20/0x4a >> | [<ffffffff8000f198>] bpf_jit_free+0x56/0x9e >> | [<ffffffff8016b43a>] bpf_prog_free_deferred+0x15a/0x182 >> | [<ffffffff800576c4>] process_one_work+0x1b6/0x3d6 >> | [<ffffffff80057d52>] worker_thread+0x84/0x378 >> | [<ffffffff8005fc2c>] kthread+0xe8/0x108 >> | [<ffffffff80003ffa>] ret_from_fork+0xe/0x20 >> >> I'm digging into that now, and I would appreciate if you could run the >> test_tag on VF2 or similar (I'm missing that HW). > > Sure, I will try to run this on the board. > I will rebase my series(+ the patch from arm64 series) on the latest > bpf-next tree and try to run it. Thank you! > Let me know if I need to add: > + select HAVE_EFFICIENT_UNALIGNED_ACCESS if MMU && 64BIT I usually run with that *on*, for better coverage. Björn
Björn Töpel <bjorn@kernel.org> writes: > Puranjay Mohan <puranjay12@gmail.com> writes: > >> BPF programs currently consume a page each on RISCV. For systems with many BPF >> programs, this adds significant pressure to instruction TLB. High iTLB pressure >> usually causes slow down for the whole system. >> >> Song Liu introduced the BPF prog pack allocator[1] to mitigate the above issue. >> It packs multiple BPF programs into a single huge page. It is currently only >> enabled for the x86_64 BPF JIT. >> >> I enabled this allocator on the ARM64 BPF JIT[2]. It is being reviewed now. >> >> This patch series enables the BPF prog pack allocator for the RISCV BPF JIT. >> This series needs a patch[3] from the ARM64 series to work. >> >> ====================================================== >> Performance Analysis of prog pack allocator on RISCV64 >> ====================================================== >> >> Test setup: >> =========== >> >> Host machine: Debian GNU/Linux 11 (bullseye) >> Qemu Version: QEMU emulator version 8.0.3 (Debian 1:8.0.3+dfsg-1) >> u-boot-qemu Version: 2023.07+dfsg-1 >> opensbi Version: 1.3-1 >> >> To test the performance of the BPF prog pack allocator on RV, a stresser >> tool[4] linked below was built. This tool loads 8 BPF programs on the system and >> triggers 5 of them in an infinite loop by doing system calls. >> >> The runner script starts 20 instances of the above which loads 8*20=160 BPF >> programs on the system, 5*20=100 of which are being constantly triggered. >> The script is passed a command which would be run in the above environment. >> >> The script was run with following perf command: >> ./run.sh "perf stat -a \ >> -e iTLB-load-misses \ >> -e dTLB-load-misses \ >> -e dTLB-store-misses \ >> -e instructions \ >> --timeout 60000" >> >> The output of the above command is discussed below before and after enabling the >> BPF prog pack allocator. >> >> The tests were run on qemu-system-riscv64 with 8 cpus, 16G memory. The rootfs >> was created using Bjorn's riscv-cross-builder[5] docker container linked below. >> >> Results >> ======= >> >> Before enabling prog pack allocator: >> ------------------------------------ >> >> Performance counter stats for 'system wide': >> >> 4939048 iTLB-load-misses >> 5468689 dTLB-load-misses >> 465234 dTLB-store-misses >> 1441082097998 instructions >> >> 60.045791200 seconds time elapsed >> >> After enabling prog pack allocator: >> ----------------------------------- >> >> Performance counter stats for 'system wide': >> >> 3430035 iTLB-load-misses >> 5008745 dTLB-load-misses >> 409944 dTLB-store-misses >> 1441535637988 instructions >> >> 60.046296600 seconds time elapsed >> >> Improvements in metrics >> ======================= >> >> It was expected that the iTLB-load-misses would decrease as now a single huge >> page is used to keep all the BPF programs compared to a single page for each >> program earlier. >> >> -------------------------------------------- >> The improvement in iTLB-load-misses: -30.5 % >> -------------------------------------------- >> >> I repeated this expriment more than 100 times in different setups and the >> improvement was always greater than 30%. >> >> This patch series is boot tested on the Starfive VisionFive 2 board[6]. >> The performance analysis was not done on the board because it doesn't >> expose iTLB-load-misses, etc. The stresser program was run on the board to test >> the loading and unloading of BPF programs >> >> [1] https://lore.kernel.org/bpf/20220204185742.271030-1-song@kernel.org/ >> [2] https://lore.kernel.org/all/20230626085811.3192402-1-puranjay12@gmail.com/ >> [3] https://lore.kernel.org/all/20230626085811.3192402-2-puranjay12@gmail.com/ >> [4] https://github.com/puranjaymohan/BPF-Allocator-Bench >> [5] https://github.com/bjoto/riscv-cross-builder >> [6] https://www.starfivetech.com/en/site/boards >> >> Puranjay Mohan (2): >> riscv: Extend patch_text_nosync() for multiple pages >> bpf, riscv: use prog pack allocator in the BPF JIT > > I get a hang for "test_tag", but it's not directly related to your > series, but rather "remote fence.i". > > | rcu: INFO: rcu_sched detected stalls on CPUs/tasks: > | rcu: 0-....: (1400 ticks this GP) idle=d5e4/1/0x4000000000000000 softirq=5542/5542 fqs=1862 > | rcu: (detected by 1, t=5252 jiffies, g=10253, q=195 ncpus=4) > | Task dump for CPU 0: > | task:kworker/0:5 state:R running task stack:0 pid:319 ppid:2 flags:0x00000008 > | Workqueue: events bpf_prog_free_deferred > | Call Trace: > | [<ffffffff80cbc444>] __schedule+0x2d0/0x940 > | watchdog: BUG: soft lockup - CPU#0 stuck for 21s! [kworker/0:5:319] > | Modules linked in: nls_iso8859_1 drm fuse i2c_core drm_panel_orientation_quirks backlight dm_mod configfs ip_tables x_tables > | CPU: 0 PID: 319 Comm: kworker/0:5 Not tainted 6.5.0-rc5 #1 > | Hardware name: riscv-virtio,qemu (DT) > | Workqueue: events bpf_prog_free_deferred > | epc : __sbi_rfence_v02_call.isra.0+0x74/0x11a > | ra : __sbi_rfence_v02+0xda/0x1a4 > | epc : ffffffff8000ab4c ra : ffffffff8000accc sp : ff20000001c9bbd0 > | gp : ffffffff82078c48 tp : ff600000888e6a40 t0 : ff20000001c9bd44 > | t1 : 0000000000000000 t2 : 0000000000000040 s0 : ff20000001c9bbf0 > | s1 : 0000000000000010 a0 : 0000000000000000 a1 : 0000000000000000 > | a2 : 0000000000000000 a3 : 0000000000000000 a4 : 0000000000000000 > | a5 : 0000000000000000 a6 : 0000000000000000 a7 : 0000000052464e43 > | s2 : 000000000000ffff s3 : 00000000ffffffff s4 : ffffffff81667528 > | s5 : 0000000000000000 s6 : 0000000000000000 s7 : 0000000000000000 > | s8 : 0000000000000001 s9 : 0000000000000003 s10: 0000000000000040 > | s11: ffffffff8207d240 t3 : 000000000000000f t4 : 000000000000002a > | t5 : ff600000872df140 t6 : ffffffff81e26828 > | status: 0000000200000120 badaddr: 0000000000000000 cause: 8000000000000005 > | [<ffffffff8000ab4c>] __sbi_rfence_v02_call.isra.0+0x74/0x11a > | [<ffffffff8000accc>] __sbi_rfence_v02+0xda/0x1a4 > | [<ffffffff8000a886>] sbi_remote_fence_i+0x1e/0x26 > | [<ffffffff8000cee2>] flush_icache_all+0x1a/0x48 > | [<ffffffff80007736>] patch_text_nosync+0x6c/0x8c > | [<ffffffff8000f0f8>] bpf_arch_text_invalidate+0x62/0xac > | [<ffffffff8016c538>] bpf_prog_pack_free+0x9c/0x1b2 > | [<ffffffff8016c84a>] bpf_jit_binary_pack_free+0x20/0x4a > | [<ffffffff8000f198>] bpf_jit_free+0x56/0x9e > | [<ffffffff8016b43a>] bpf_prog_free_deferred+0x15a/0x182 > | [<ffffffff800576c4>] process_one_work+0x1b6/0x3d6 > | [<ffffffff80057d52>] worker_thread+0x84/0x378 > | [<ffffffff8005fc2c>] kthread+0xe8/0x108 > | [<ffffffff80003ffa>] ret_from_fork+0xe/0x20 > > I'm digging into that now, and I would appreciate if you could run the > test_tag on VF2 or similar (I'm missing that HW). > > It seems like we're hitting a bug with this series, so let's try to > figure out where the problems is, prior merging it. Hmm, it looks like the bpf_arch_text_invalidate() implementation is a bit problematic: +int bpf_arch_text_invalidate(void *dst, size_t len) +{ + __le32 *ptr; + int ret = 0; + u32 inval = 0; + + for (ptr = dst; ret == 0 && len >= sizeof(u32); len -= sizeof(u32)) { + mutex_lock(&text_mutex); + ret = patch_text_nosync(ptr++, &inval, sizeof(u32)); + mutex_unlock(&text_mutex); + } + + return ret; +} Each patch_text_nosync() is a remote fence.i, and for a big "len", we'll be flooded with remote fences. I think that's exactly what we hit with "test_tag". Björn
On Mon, Aug 14, 2023 at 12:40 PM Björn Töpel <bjorn@kernel.org> wrote: > > Björn Töpel <bjorn@kernel.org> writes: > > > Puranjay Mohan <puranjay12@gmail.com> writes: > > > >> BPF programs currently consume a page each on RISCV. For systems with many BPF > >> programs, this adds significant pressure to instruction TLB. High iTLB pressure > >> usually causes slow down for the whole system. > >> > >> Song Liu introduced the BPF prog pack allocator[1] to mitigate the above issue. > >> It packs multiple BPF programs into a single huge page. It is currently only > >> enabled for the x86_64 BPF JIT. > >> > >> I enabled this allocator on the ARM64 BPF JIT[2]. It is being reviewed now. > >> > >> This patch series enables the BPF prog pack allocator for the RISCV BPF JIT. > >> This series needs a patch[3] from the ARM64 series to work. > >> > >> ====================================================== > >> Performance Analysis of prog pack allocator on RISCV64 > >> ====================================================== > >> > >> Test setup: > >> =========== > >> > >> Host machine: Debian GNU/Linux 11 (bullseye) > >> Qemu Version: QEMU emulator version 8.0.3 (Debian 1:8.0.3+dfsg-1) > >> u-boot-qemu Version: 2023.07+dfsg-1 > >> opensbi Version: 1.3-1 > >> > >> To test the performance of the BPF prog pack allocator on RV, a stresser > >> tool[4] linked below was built. This tool loads 8 BPF programs on the system and > >> triggers 5 of them in an infinite loop by doing system calls. > >> > >> The runner script starts 20 instances of the above which loads 8*20=160 BPF > >> programs on the system, 5*20=100 of which are being constantly triggered. > >> The script is passed a command which would be run in the above environment. > >> > >> The script was run with following perf command: > >> ./run.sh "perf stat -a \ > >> -e iTLB-load-misses \ > >> -e dTLB-load-misses \ > >> -e dTLB-store-misses \ > >> -e instructions \ > >> --timeout 60000" > >> > >> The output of the above command is discussed below before and after enabling the > >> BPF prog pack allocator. > >> > >> The tests were run on qemu-system-riscv64 with 8 cpus, 16G memory. The rootfs > >> was created using Bjorn's riscv-cross-builder[5] docker container linked below. > >> > >> Results > >> ======= > >> > >> Before enabling prog pack allocator: > >> ------------------------------------ > >> > >> Performance counter stats for 'system wide': > >> > >> 4939048 iTLB-load-misses > >> 5468689 dTLB-load-misses > >> 465234 dTLB-store-misses > >> 1441082097998 instructions > >> > >> 60.045791200 seconds time elapsed > >> > >> After enabling prog pack allocator: > >> ----------------------------------- > >> > >> Performance counter stats for 'system wide': > >> > >> 3430035 iTLB-load-misses > >> 5008745 dTLB-load-misses > >> 409944 dTLB-store-misses > >> 1441535637988 instructions > >> > >> 60.046296600 seconds time elapsed > >> > >> Improvements in metrics > >> ======================= > >> > >> It was expected that the iTLB-load-misses would decrease as now a single huge > >> page is used to keep all the BPF programs compared to a single page for each > >> program earlier. > >> > >> -------------------------------------------- > >> The improvement in iTLB-load-misses: -30.5 % > >> -------------------------------------------- > >> > >> I repeated this expriment more than 100 times in different setups and the > >> improvement was always greater than 30%. > >> > >> This patch series is boot tested on the Starfive VisionFive 2 board[6]. > >> The performance analysis was not done on the board because it doesn't > >> expose iTLB-load-misses, etc. The stresser program was run on the board to test > >> the loading and unloading of BPF programs > >> > >> [1] https://lore.kernel.org/bpf/20220204185742.271030-1-song@kernel.org/ > >> [2] https://lore.kernel.org/all/20230626085811.3192402-1-puranjay12@gmail.com/ > >> [3] https://lore.kernel.org/all/20230626085811.3192402-2-puranjay12@gmail.com/ > >> [4] https://github.com/puranjaymohan/BPF-Allocator-Bench > >> [5] https://github.com/bjoto/riscv-cross-builder > >> [6] https://www.starfivetech.com/en/site/boards > >> > >> Puranjay Mohan (2): > >> riscv: Extend patch_text_nosync() for multiple pages > >> bpf, riscv: use prog pack allocator in the BPF JIT > > > > I get a hang for "test_tag", but it's not directly related to your > > series, but rather "remote fence.i". > > > > | rcu: INFO: rcu_sched detected stalls on CPUs/tasks: > > | rcu: 0-....: (1400 ticks this GP) idle=d5e4/1/0x4000000000000000 softirq=5542/5542 fqs=1862 > > | rcu: (detected by 1, t=5252 jiffies, g=10253, q=195 ncpus=4) > > | Task dump for CPU 0: > > | task:kworker/0:5 state:R running task stack:0 pid:319 ppid:2 flags:0x00000008 > > | Workqueue: events bpf_prog_free_deferred > > | Call Trace: > > | [<ffffffff80cbc444>] __schedule+0x2d0/0x940 > > | watchdog: BUG: soft lockup - CPU#0 stuck for 21s! [kworker/0:5:319] > > | Modules linked in: nls_iso8859_1 drm fuse i2c_core drm_panel_orientation_quirks backlight dm_mod configfs ip_tables x_tables > > | CPU: 0 PID: 319 Comm: kworker/0:5 Not tainted 6.5.0-rc5 #1 > > | Hardware name: riscv-virtio,qemu (DT) > > | Workqueue: events bpf_prog_free_deferred > > | epc : __sbi_rfence_v02_call.isra.0+0x74/0x11a > > | ra : __sbi_rfence_v02+0xda/0x1a4 > > | epc : ffffffff8000ab4c ra : ffffffff8000accc sp : ff20000001c9bbd0 > > | gp : ffffffff82078c48 tp : ff600000888e6a40 t0 : ff20000001c9bd44 > > | t1 : 0000000000000000 t2 : 0000000000000040 s0 : ff20000001c9bbf0 > > | s1 : 0000000000000010 a0 : 0000000000000000 a1 : 0000000000000000 > > | a2 : 0000000000000000 a3 : 0000000000000000 a4 : 0000000000000000 > > | a5 : 0000000000000000 a6 : 0000000000000000 a7 : 0000000052464e43 > > | s2 : 000000000000ffff s3 : 00000000ffffffff s4 : ffffffff81667528 > > | s5 : 0000000000000000 s6 : 0000000000000000 s7 : 0000000000000000 > > | s8 : 0000000000000001 s9 : 0000000000000003 s10: 0000000000000040 > > | s11: ffffffff8207d240 t3 : 000000000000000f t4 : 000000000000002a > > | t5 : ff600000872df140 t6 : ffffffff81e26828 > > | status: 0000000200000120 badaddr: 0000000000000000 cause: 8000000000000005 > > | [<ffffffff8000ab4c>] __sbi_rfence_v02_call.isra.0+0x74/0x11a > > | [<ffffffff8000accc>] __sbi_rfence_v02+0xda/0x1a4 > > | [<ffffffff8000a886>] sbi_remote_fence_i+0x1e/0x26 > > | [<ffffffff8000cee2>] flush_icache_all+0x1a/0x48 > > | [<ffffffff80007736>] patch_text_nosync+0x6c/0x8c > > | [<ffffffff8000f0f8>] bpf_arch_text_invalidate+0x62/0xac > > | [<ffffffff8016c538>] bpf_prog_pack_free+0x9c/0x1b2 > > | [<ffffffff8016c84a>] bpf_jit_binary_pack_free+0x20/0x4a > > | [<ffffffff8000f198>] bpf_jit_free+0x56/0x9e > > | [<ffffffff8016b43a>] bpf_prog_free_deferred+0x15a/0x182 > > | [<ffffffff800576c4>] process_one_work+0x1b6/0x3d6 > > | [<ffffffff80057d52>] worker_thread+0x84/0x378 > > | [<ffffffff8005fc2c>] kthread+0xe8/0x108 > > | [<ffffffff80003ffa>] ret_from_fork+0xe/0x20 > > > > I'm digging into that now, and I would appreciate if you could run the > > test_tag on VF2 or similar (I'm missing that HW). > > > > It seems like we're hitting a bug with this series, so let's try to > > figure out where the problems is, prior merging it. > > Hmm, it looks like the bpf_arch_text_invalidate() implementation is a > bit problematic: > > +int bpf_arch_text_invalidate(void *dst, size_t len) > +{ > + __le32 *ptr; > + int ret = 0; > + u32 inval = 0; > + > + for (ptr = dst; ret == 0 && len >= sizeof(u32); len -= sizeof(u32)) { > + mutex_lock(&text_mutex); > + ret = patch_text_nosync(ptr++, &inval, sizeof(u32)); > + mutex_unlock(&text_mutex); > + } > + > + return ret; > +} > > Each patch_text_nosync() is a remote fence.i, and for a big "len", we'll > be flooded with remote fences. I understand this now, thanks for debugging this. We are calling patch_text_nosync() for each word (u32) which calls flush_icache_range() and therefore "fence.i" is inserted after every word. I still don't fully understand how it causes this bug because I lack the prerequisite knowledge about test_tag and what the failing test is doing. But to solve this issue we would need a function like the x86 text_poke_set() that will only insert a single "fence.i" after setting the whole memory area. This can be done by implementing a wrapper around patch_insn_write() which would set the memory area and at the end call flush_icache_range(). Something like: void *text_set_nosync(void *dst, int c, size_t len) { __le32 *ptr; int ret = 0; for (ptr = dst; ret == 0 && len >= sizeof(u32); len -= sizeof(u32)) { ret = patch_insn_write(ptr++, &c, sizeof(u32)); } if(!ret) flush_icache_range((uintptr_t) dst, (uintptr_t) dst + len); return ret; } Let me know if this looks correct or we need more details here. I will then send v2 with this implemented as a separate patch. > > I think that's exactly what we hit with "test_tag". > > > Björn Thanks, Puranjay
Puranjay Mohan <puranjay12@gmail.com> writes: > On Mon, Aug 14, 2023 at 12:40 PM Björn Töpel <bjorn@kernel.org> wrote: >> >> Björn Töpel <bjorn@kernel.org> writes: >> >> > Puranjay Mohan <puranjay12@gmail.com> writes: >> > >> >> BPF programs currently consume a page each on RISCV. For systems with many BPF >> >> programs, this adds significant pressure to instruction TLB. High iTLB pressure >> >> usually causes slow down for the whole system. >> >> >> >> Song Liu introduced the BPF prog pack allocator[1] to mitigate the above issue. >> >> It packs multiple BPF programs into a single huge page. It is currently only >> >> enabled for the x86_64 BPF JIT. >> >> >> >> I enabled this allocator on the ARM64 BPF JIT[2]. It is being reviewed now. >> >> >> >> This patch series enables the BPF prog pack allocator for the RISCV BPF JIT. >> >> This series needs a patch[3] from the ARM64 series to work. >> >> >> >> ====================================================== >> >> Performance Analysis of prog pack allocator on RISCV64 >> >> ====================================================== >> >> >> >> Test setup: >> >> =========== >> >> >> >> Host machine: Debian GNU/Linux 11 (bullseye) >> >> Qemu Version: QEMU emulator version 8.0.3 (Debian 1:8.0.3+dfsg-1) >> >> u-boot-qemu Version: 2023.07+dfsg-1 >> >> opensbi Version: 1.3-1 >> >> >> >> To test the performance of the BPF prog pack allocator on RV, a stresser >> >> tool[4] linked below was built. This tool loads 8 BPF programs on the system and >> >> triggers 5 of them in an infinite loop by doing system calls. >> >> >> >> The runner script starts 20 instances of the above which loads 8*20=160 BPF >> >> programs on the system, 5*20=100 of which are being constantly triggered. >> >> The script is passed a command which would be run in the above environment. >> >> >> >> The script was run with following perf command: >> >> ./run.sh "perf stat -a \ >> >> -e iTLB-load-misses \ >> >> -e dTLB-load-misses \ >> >> -e dTLB-store-misses \ >> >> -e instructions \ >> >> --timeout 60000" >> >> >> >> The output of the above command is discussed below before and after enabling the >> >> BPF prog pack allocator. >> >> >> >> The tests were run on qemu-system-riscv64 with 8 cpus, 16G memory. The rootfs >> >> was created using Bjorn's riscv-cross-builder[5] docker container linked below. >> >> >> >> Results >> >> ======= >> >> >> >> Before enabling prog pack allocator: >> >> ------------------------------------ >> >> >> >> Performance counter stats for 'system wide': >> >> >> >> 4939048 iTLB-load-misses >> >> 5468689 dTLB-load-misses >> >> 465234 dTLB-store-misses >> >> 1441082097998 instructions >> >> >> >> 60.045791200 seconds time elapsed >> >> >> >> After enabling prog pack allocator: >> >> ----------------------------------- >> >> >> >> Performance counter stats for 'system wide': >> >> >> >> 3430035 iTLB-load-misses >> >> 5008745 dTLB-load-misses >> >> 409944 dTLB-store-misses >> >> 1441535637988 instructions >> >> >> >> 60.046296600 seconds time elapsed >> >> >> >> Improvements in metrics >> >> ======================= >> >> >> >> It was expected that the iTLB-load-misses would decrease as now a single huge >> >> page is used to keep all the BPF programs compared to a single page for each >> >> program earlier. >> >> >> >> -------------------------------------------- >> >> The improvement in iTLB-load-misses: -30.5 % >> >> -------------------------------------------- >> >> >> >> I repeated this expriment more than 100 times in different setups and the >> >> improvement was always greater than 30%. >> >> >> >> This patch series is boot tested on the Starfive VisionFive 2 board[6]. >> >> The performance analysis was not done on the board because it doesn't >> >> expose iTLB-load-misses, etc. The stresser program was run on the board to test >> >> the loading and unloading of BPF programs >> >> >> >> [1] https://lore.kernel.org/bpf/20220204185742.271030-1-song@kernel.org/ >> >> [2] https://lore.kernel.org/all/20230626085811.3192402-1-puranjay12@gmail.com/ >> >> [3] https://lore.kernel.org/all/20230626085811.3192402-2-puranjay12@gmail.com/ >> >> [4] https://github.com/puranjaymohan/BPF-Allocator-Bench >> >> [5] https://github.com/bjoto/riscv-cross-builder >> >> [6] https://www.starfivetech.com/en/site/boards >> >> >> >> Puranjay Mohan (2): >> >> riscv: Extend patch_text_nosync() for multiple pages >> >> bpf, riscv: use prog pack allocator in the BPF JIT >> > >> > I get a hang for "test_tag", but it's not directly related to your >> > series, but rather "remote fence.i". >> > >> > | rcu: INFO: rcu_sched detected stalls on CPUs/tasks: >> > | rcu: 0-....: (1400 ticks this GP) idle=d5e4/1/0x4000000000000000 softirq=5542/5542 fqs=1862 >> > | rcu: (detected by 1, t=5252 jiffies, g=10253, q=195 ncpus=4) >> > | Task dump for CPU 0: >> > | task:kworker/0:5 state:R running task stack:0 pid:319 ppid:2 flags:0x00000008 >> > | Workqueue: events bpf_prog_free_deferred >> > | Call Trace: >> > | [<ffffffff80cbc444>] __schedule+0x2d0/0x940 >> > | watchdog: BUG: soft lockup - CPU#0 stuck for 21s! [kworker/0:5:319] >> > | Modules linked in: nls_iso8859_1 drm fuse i2c_core drm_panel_orientation_quirks backlight dm_mod configfs ip_tables x_tables >> > | CPU: 0 PID: 319 Comm: kworker/0:5 Not tainted 6.5.0-rc5 #1 >> > | Hardware name: riscv-virtio,qemu (DT) >> > | Workqueue: events bpf_prog_free_deferred >> > | epc : __sbi_rfence_v02_call.isra.0+0x74/0x11a >> > | ra : __sbi_rfence_v02+0xda/0x1a4 >> > | epc : ffffffff8000ab4c ra : ffffffff8000accc sp : ff20000001c9bbd0 >> > | gp : ffffffff82078c48 tp : ff600000888e6a40 t0 : ff20000001c9bd44 >> > | t1 : 0000000000000000 t2 : 0000000000000040 s0 : ff20000001c9bbf0 >> > | s1 : 0000000000000010 a0 : 0000000000000000 a1 : 0000000000000000 >> > | a2 : 0000000000000000 a3 : 0000000000000000 a4 : 0000000000000000 >> > | a5 : 0000000000000000 a6 : 0000000000000000 a7 : 0000000052464e43 >> > | s2 : 000000000000ffff s3 : 00000000ffffffff s4 : ffffffff81667528 >> > | s5 : 0000000000000000 s6 : 0000000000000000 s7 : 0000000000000000 >> > | s8 : 0000000000000001 s9 : 0000000000000003 s10: 0000000000000040 >> > | s11: ffffffff8207d240 t3 : 000000000000000f t4 : 000000000000002a >> > | t5 : ff600000872df140 t6 : ffffffff81e26828 >> > | status: 0000000200000120 badaddr: 0000000000000000 cause: 8000000000000005 >> > | [<ffffffff8000ab4c>] __sbi_rfence_v02_call.isra.0+0x74/0x11a >> > | [<ffffffff8000accc>] __sbi_rfence_v02+0xda/0x1a4 >> > | [<ffffffff8000a886>] sbi_remote_fence_i+0x1e/0x26 >> > | [<ffffffff8000cee2>] flush_icache_all+0x1a/0x48 >> > | [<ffffffff80007736>] patch_text_nosync+0x6c/0x8c >> > | [<ffffffff8000f0f8>] bpf_arch_text_invalidate+0x62/0xac >> > | [<ffffffff8016c538>] bpf_prog_pack_free+0x9c/0x1b2 >> > | [<ffffffff8016c84a>] bpf_jit_binary_pack_free+0x20/0x4a >> > | [<ffffffff8000f198>] bpf_jit_free+0x56/0x9e >> > | [<ffffffff8016b43a>] bpf_prog_free_deferred+0x15a/0x182 >> > | [<ffffffff800576c4>] process_one_work+0x1b6/0x3d6 >> > | [<ffffffff80057d52>] worker_thread+0x84/0x378 >> > | [<ffffffff8005fc2c>] kthread+0xe8/0x108 >> > | [<ffffffff80003ffa>] ret_from_fork+0xe/0x20 >> > >> > I'm digging into that now, and I would appreciate if you could run the >> > test_tag on VF2 or similar (I'm missing that HW). >> > >> > It seems like we're hitting a bug with this series, so let's try to >> > figure out where the problems is, prior merging it. >> >> Hmm, it looks like the bpf_arch_text_invalidate() implementation is a >> bit problematic: >> >> +int bpf_arch_text_invalidate(void *dst, size_t len) >> +{ >> + __le32 *ptr; >> + int ret = 0; >> + u32 inval = 0; >> + >> + for (ptr = dst; ret == 0 && len >= sizeof(u32); len -= sizeof(u32)) { >> + mutex_lock(&text_mutex); >> + ret = patch_text_nosync(ptr++, &inval, sizeof(u32)); >> + mutex_unlock(&text_mutex); >> + } >> + >> + return ret; >> +} >> >> Each patch_text_nosync() is a remote fence.i, and for a big "len", we'll >> be flooded with remote fences. > > I understand this now, thanks for debugging this. > > We are calling patch_text_nosync() for each word (u32) which calls > flush_icache_range() and therefore "fence.i" is inserted after every > word. But more importantly, it does a remote fence.i (which is an IPI to all cores). > I still don't fully understand how it causes this bug because I lack > the prerequisite > knowledge about test_tag and what the failing test is doing. The test_tag is part of kselftest/bpf: tools/testing/selftests/bpf/test_tag.c TL;DR: it generates a bunch of programs, where some have a length of, e.g, 41024. bpf_arch_text_invalidate() does ~10k of remote fences in that case. > But to solve this issue we would need a function like the x86 > text_poke_set() that will only > insert a single "fence.i" after setting the whole memory area. This > can be done by > implementing a wrapper around patch_insn_write() which would set the memory area > and at the end call flush_icache_range(). > > Something like: > > void *text_set_nosync(void *dst, int c, size_t len) > { > __le32 *ptr; > int ret = 0; > > for (ptr = dst; ret == 0 && len >= sizeof(u32); len -= sizeof(u32)) { > ret = patch_insn_write(ptr++, &c, sizeof(u32)); > } > if(!ret) > flush_icache_range((uintptr_t) dst, (uintptr_t) dst + len); > > return ret; > } > > Let me know if this looks correct or we need more details here. > I will then send v2 with this implemented as a separate patch. Can't we do better here? Perhaps a similar pattern like the 2 page fill? Otherwise we'll have a bunch of fixmap updates as well. I'd keep the patch_ prefix in the name for consistency. Please measure the runtime of test_tag pre/after the change. I don't know if your arm64 work has similar problems? Björn
Hi Björn, On Mon, Aug 14, 2023 at 4:29 PM Björn Töpel <bjorn@kernel.org> wrote: > > Puranjay Mohan <puranjay12@gmail.com> writes: > > > On Mon, Aug 14, 2023 at 12:40 PM Björn Töpel <bjorn@kernel.org> wrote: > >> > >> Björn Töpel <bjorn@kernel.org> writes: > >> > >> > Puranjay Mohan <puranjay12@gmail.com> writes: > >> > > >> >> BPF programs currently consume a page each on RISCV. For systems with many BPF > >> >> programs, this adds significant pressure to instruction TLB. High iTLB pressure > >> >> usually causes slow down for the whole system. > >> >> > >> >> Song Liu introduced the BPF prog pack allocator[1] to mitigate the above issue. > >> >> It packs multiple BPF programs into a single huge page. It is currently only > >> >> enabled for the x86_64 BPF JIT. > >> >> > >> >> I enabled this allocator on the ARM64 BPF JIT[2]. It is being reviewed now. > >> >> > >> >> This patch series enables the BPF prog pack allocator for the RISCV BPF JIT. > >> >> This series needs a patch[3] from the ARM64 series to work. > >> >> > >> >> ====================================================== > >> >> Performance Analysis of prog pack allocator on RISCV64 > >> >> ====================================================== > >> >> > >> >> Test setup: > >> >> =========== > >> >> > >> >> Host machine: Debian GNU/Linux 11 (bullseye) > >> >> Qemu Version: QEMU emulator version 8.0.3 (Debian 1:8.0.3+dfsg-1) > >> >> u-boot-qemu Version: 2023.07+dfsg-1 > >> >> opensbi Version: 1.3-1 > >> >> > >> >> To test the performance of the BPF prog pack allocator on RV, a stresser > >> >> tool[4] linked below was built. This tool loads 8 BPF programs on the system and > >> >> triggers 5 of them in an infinite loop by doing system calls. > >> >> > >> >> The runner script starts 20 instances of the above which loads 8*20=160 BPF > >> >> programs on the system, 5*20=100 of which are being constantly triggered. > >> >> The script is passed a command which would be run in the above environment. > >> >> > >> >> The script was run with following perf command: > >> >> ./run.sh "perf stat -a \ > >> >> -e iTLB-load-misses \ > >> >> -e dTLB-load-misses \ > >> >> -e dTLB-store-misses \ > >> >> -e instructions \ > >> >> --timeout 60000" > >> >> > >> >> The output of the above command is discussed below before and after enabling the > >> >> BPF prog pack allocator. > >> >> > >> >> The tests were run on qemu-system-riscv64 with 8 cpus, 16G memory. The rootfs > >> >> was created using Bjorn's riscv-cross-builder[5] docker container linked below. > >> >> > >> >> Results > >> >> ======= > >> >> > >> >> Before enabling prog pack allocator: > >> >> ------------------------------------ > >> >> > >> >> Performance counter stats for 'system wide': > >> >> > >> >> 4939048 iTLB-load-misses > >> >> 5468689 dTLB-load-misses > >> >> 465234 dTLB-store-misses > >> >> 1441082097998 instructions > >> >> > >> >> 60.045791200 seconds time elapsed > >> >> > >> >> After enabling prog pack allocator: > >> >> ----------------------------------- > >> >> > >> >> Performance counter stats for 'system wide': > >> >> > >> >> 3430035 iTLB-load-misses > >> >> 5008745 dTLB-load-misses > >> >> 409944 dTLB-store-misses > >> >> 1441535637988 instructions > >> >> > >> >> 60.046296600 seconds time elapsed > >> >> > >> >> Improvements in metrics > >> >> ======================= > >> >> > >> >> It was expected that the iTLB-load-misses would decrease as now a single huge > >> >> page is used to keep all the BPF programs compared to a single page for each > >> >> program earlier. > >> >> > >> >> -------------------------------------------- > >> >> The improvement in iTLB-load-misses: -30.5 % > >> >> -------------------------------------------- > >> >> > >> >> I repeated this expriment more than 100 times in different setups and the > >> >> improvement was always greater than 30%. > >> >> > >> >> This patch series is boot tested on the Starfive VisionFive 2 board[6]. > >> >> The performance analysis was not done on the board because it doesn't > >> >> expose iTLB-load-misses, etc. The stresser program was run on the board to test > >> >> the loading and unloading of BPF programs > >> >> > >> >> [1] https://lore.kernel.org/bpf/20220204185742.271030-1-song@kernel.org/ > >> >> [2] https://lore.kernel.org/all/20230626085811.3192402-1-puranjay12@gmail.com/ > >> >> [3] https://lore.kernel.org/all/20230626085811.3192402-2-puranjay12@gmail.com/ > >> >> [4] https://github.com/puranjaymohan/BPF-Allocator-Bench > >> >> [5] https://github.com/bjoto/riscv-cross-builder > >> >> [6] https://www.starfivetech.com/en/site/boards > >> >> > >> >> Puranjay Mohan (2): > >> >> riscv: Extend patch_text_nosync() for multiple pages > >> >> bpf, riscv: use prog pack allocator in the BPF JIT > >> > > >> > I get a hang for "test_tag", but it's not directly related to your > >> > series, but rather "remote fence.i". > >> > > >> > | rcu: INFO: rcu_sched detected stalls on CPUs/tasks: > >> > | rcu: 0-....: (1400 ticks this GP) idle=d5e4/1/0x4000000000000000 softirq=5542/5542 fqs=1862 > >> > | rcu: (detected by 1, t=5252 jiffies, g=10253, q=195 ncpus=4) > >> > | Task dump for CPU 0: > >> > | task:kworker/0:5 state:R running task stack:0 pid:319 ppid:2 flags:0x00000008 > >> > | Workqueue: events bpf_prog_free_deferred > >> > | Call Trace: > >> > | [<ffffffff80cbc444>] __schedule+0x2d0/0x940 > >> > | watchdog: BUG: soft lockup - CPU#0 stuck for 21s! [kworker/0:5:319] > >> > | Modules linked in: nls_iso8859_1 drm fuse i2c_core drm_panel_orientation_quirks backlight dm_mod configfs ip_tables x_tables > >> > | CPU: 0 PID: 319 Comm: kworker/0:5 Not tainted 6.5.0-rc5 #1 > >> > | Hardware name: riscv-virtio,qemu (DT) > >> > | Workqueue: events bpf_prog_free_deferred > >> > | epc : __sbi_rfence_v02_call.isra.0+0x74/0x11a > >> > | ra : __sbi_rfence_v02+0xda/0x1a4 > >> > | epc : ffffffff8000ab4c ra : ffffffff8000accc sp : ff20000001c9bbd0 > >> > | gp : ffffffff82078c48 tp : ff600000888e6a40 t0 : ff20000001c9bd44 > >> > | t1 : 0000000000000000 t2 : 0000000000000040 s0 : ff20000001c9bbf0 > >> > | s1 : 0000000000000010 a0 : 0000000000000000 a1 : 0000000000000000 > >> > | a2 : 0000000000000000 a3 : 0000000000000000 a4 : 0000000000000000 > >> > | a5 : 0000000000000000 a6 : 0000000000000000 a7 : 0000000052464e43 > >> > | s2 : 000000000000ffff s3 : 00000000ffffffff s4 : ffffffff81667528 > >> > | s5 : 0000000000000000 s6 : 0000000000000000 s7 : 0000000000000000 > >> > | s8 : 0000000000000001 s9 : 0000000000000003 s10: 0000000000000040 > >> > | s11: ffffffff8207d240 t3 : 000000000000000f t4 : 000000000000002a > >> > | t5 : ff600000872df140 t6 : ffffffff81e26828 > >> > | status: 0000000200000120 badaddr: 0000000000000000 cause: 8000000000000005 > >> > | [<ffffffff8000ab4c>] __sbi_rfence_v02_call.isra.0+0x74/0x11a > >> > | [<ffffffff8000accc>] __sbi_rfence_v02+0xda/0x1a4 > >> > | [<ffffffff8000a886>] sbi_remote_fence_i+0x1e/0x26 > >> > | [<ffffffff8000cee2>] flush_icache_all+0x1a/0x48 > >> > | [<ffffffff80007736>] patch_text_nosync+0x6c/0x8c > >> > | [<ffffffff8000f0f8>] bpf_arch_text_invalidate+0x62/0xac > >> > | [<ffffffff8016c538>] bpf_prog_pack_free+0x9c/0x1b2 > >> > | [<ffffffff8016c84a>] bpf_jit_binary_pack_free+0x20/0x4a > >> > | [<ffffffff8000f198>] bpf_jit_free+0x56/0x9e > >> > | [<ffffffff8016b43a>] bpf_prog_free_deferred+0x15a/0x182 > >> > | [<ffffffff800576c4>] process_one_work+0x1b6/0x3d6 > >> > | [<ffffffff80057d52>] worker_thread+0x84/0x378 > >> > | [<ffffffff8005fc2c>] kthread+0xe8/0x108 > >> > | [<ffffffff80003ffa>] ret_from_fork+0xe/0x20 > >> > > >> > I'm digging into that now, and I would appreciate if you could run the > >> > test_tag on VF2 or similar (I'm missing that HW). > >> > > >> > It seems like we're hitting a bug with this series, so let's try to > >> > figure out where the problems is, prior merging it. > >> > >> Hmm, it looks like the bpf_arch_text_invalidate() implementation is a > >> bit problematic: > >> > >> +int bpf_arch_text_invalidate(void *dst, size_t len) > >> +{ > >> + __le32 *ptr; > >> + int ret = 0; > >> + u32 inval = 0; > >> + > >> + for (ptr = dst; ret == 0 && len >= sizeof(u32); len -= sizeof(u32)) { > >> + mutex_lock(&text_mutex); > >> + ret = patch_text_nosync(ptr++, &inval, sizeof(u32)); > >> + mutex_unlock(&text_mutex); > >> + } > >> + > >> + return ret; > >> +} > >> > >> Each patch_text_nosync() is a remote fence.i, and for a big "len", we'll > >> be flooded with remote fences. > > > > I understand this now, thanks for debugging this. > > > > We are calling patch_text_nosync() for each word (u32) which calls > > flush_icache_range() and therefore "fence.i" is inserted after every > > word. > > But more importantly, it does a remote fence.i (which is an IPI to all > cores). > > > I still don't fully understand how it causes this bug because I lack > > the prerequisite > > knowledge about test_tag and what the failing test is doing. > > The test_tag is part of kselftest/bpf: > tools/testing/selftests/bpf/test_tag.c > > TL;DR: it generates a bunch of programs, where some have a length of, > e.g, 41024. bpf_arch_text_invalidate() does ~10k of remote fences in > that case. > > > But to solve this issue we would need a function like the x86 > > text_poke_set() that will only > > insert a single "fence.i" after setting the whole memory area. This > > can be done by > > implementing a wrapper around patch_insn_write() which would set the memory area > > and at the end call flush_icache_range(). > > > > Something like: > > > > void *text_set_nosync(void *dst, int c, size_t len) > > { > > __le32 *ptr; > > int ret = 0; > > > > for (ptr = dst; ret == 0 && len >= sizeof(u32); len -= sizeof(u32)) { > > ret = patch_insn_write(ptr++, &c, sizeof(u32)); > > } > > if(!ret) > > flush_icache_range((uintptr_t) dst, (uintptr_t) dst + len); > > > > return ret; > > } > > > > Let me know if this looks correct or we need more details here. > > I will then send v2 with this implemented as a separate patch. > > Can't we do better here? Perhaps a similar pattern like the 2 page fill? > Otherwise we'll have a bunch of fixmap updates as well. I agree that we can make it more efficient by first copying the value to a RW buffer using normal memcpy() and then copying that area to the RO area using patch_insn_write(). Then it would solve both problems. Or we implement a new function like patch_insn_write() that does the 2 page map and set explicitly. Which approach would you prefer? 1) Wrapper around patch_insn_write() that first memsets a RW buffer and then copies the complete RW buffer to the RO area by calling patch_insn_write() with len. 2) A new function like patch_insn_write() that takes dst, src, len and maps the dst, 2 pages at a time and sets it to *src in a loop. > > I'd keep the patch_ prefix in the name for consistency. Please measure > the runtime of test_tag pre/after the change. test_tag currently wouldn't even complete right? with the current version of the patch? > > I don't know if your arm64 work has similar problems? Thanks for bringing this up. I will revisit that and verify if test_tag is working there. There also the bpf_arch_text_invalidate() is calling aarch64_insn_patch_text_nosync() in a loop that in turn calls caches_clean_inval_pou(). So I might see similar issues there. I think https://github.com/kernel-patches doesn't run test_tag hence I might have missed it. > > > Björn Thanks, Puranjay
Puranjay Mohan <puranjay12@gmail.com> writes: > Hi Björn, > > On Mon, Aug 14, 2023 at 4:29 PM Björn Töpel <bjorn@kernel.org> wrote: >> >> Puranjay Mohan <puranjay12@gmail.com> writes: >> >> > On Mon, Aug 14, 2023 at 12:40 PM Björn Töpel <bjorn@kernel.org> wrote: >> >> >> >> Björn Töpel <bjorn@kernel.org> writes: >> >> >> >> > Puranjay Mohan <puranjay12@gmail.com> writes: >> >> > >> >> >> BPF programs currently consume a page each on RISCV. For systems with many BPF >> >> >> programs, this adds significant pressure to instruction TLB. High iTLB pressure >> >> >> usually causes slow down for the whole system. >> >> >> >> >> >> Song Liu introduced the BPF prog pack allocator[1] to mitigate the above issue. >> >> >> It packs multiple BPF programs into a single huge page. It is currently only >> >> >> enabled for the x86_64 BPF JIT. >> >> >> >> >> >> I enabled this allocator on the ARM64 BPF JIT[2]. It is being reviewed now. >> >> >> >> >> >> This patch series enables the BPF prog pack allocator for the RISCV BPF JIT. >> >> >> This series needs a patch[3] from the ARM64 series to work. >> >> >> >> >> >> ====================================================== >> >> >> Performance Analysis of prog pack allocator on RISCV64 >> >> >> ====================================================== >> >> >> >> >> >> Test setup: >> >> >> =========== >> >> >> >> >> >> Host machine: Debian GNU/Linux 11 (bullseye) >> >> >> Qemu Version: QEMU emulator version 8.0.3 (Debian 1:8.0.3+dfsg-1) >> >> >> u-boot-qemu Version: 2023.07+dfsg-1 >> >> >> opensbi Version: 1.3-1 >> >> >> >> >> >> To test the performance of the BPF prog pack allocator on RV, a stresser >> >> >> tool[4] linked below was built. This tool loads 8 BPF programs on the system and >> >> >> triggers 5 of them in an infinite loop by doing system calls. >> >> >> >> >> >> The runner script starts 20 instances of the above which loads 8*20=160 BPF >> >> >> programs on the system, 5*20=100 of which are being constantly triggered. >> >> >> The script is passed a command which would be run in the above environment. >> >> >> >> >> >> The script was run with following perf command: >> >> >> ./run.sh "perf stat -a \ >> >> >> -e iTLB-load-misses \ >> >> >> -e dTLB-load-misses \ >> >> >> -e dTLB-store-misses \ >> >> >> -e instructions \ >> >> >> --timeout 60000" >> >> >> >> >> >> The output of the above command is discussed below before and after enabling the >> >> >> BPF prog pack allocator. >> >> >> >> >> >> The tests were run on qemu-system-riscv64 with 8 cpus, 16G memory. The rootfs >> >> >> was created using Bjorn's riscv-cross-builder[5] docker container linked below. >> >> >> >> >> >> Results >> >> >> ======= >> >> >> >> >> >> Before enabling prog pack allocator: >> >> >> ------------------------------------ >> >> >> >> >> >> Performance counter stats for 'system wide': >> >> >> >> >> >> 4939048 iTLB-load-misses >> >> >> 5468689 dTLB-load-misses >> >> >> 465234 dTLB-store-misses >> >> >> 1441082097998 instructions >> >> >> >> >> >> 60.045791200 seconds time elapsed >> >> >> >> >> >> After enabling prog pack allocator: >> >> >> ----------------------------------- >> >> >> >> >> >> Performance counter stats for 'system wide': >> >> >> >> >> >> 3430035 iTLB-load-misses >> >> >> 5008745 dTLB-load-misses >> >> >> 409944 dTLB-store-misses >> >> >> 1441535637988 instructions >> >> >> >> >> >> 60.046296600 seconds time elapsed >> >> >> >> >> >> Improvements in metrics >> >> >> ======================= >> >> >> >> >> >> It was expected that the iTLB-load-misses would decrease as now a single huge >> >> >> page is used to keep all the BPF programs compared to a single page for each >> >> >> program earlier. >> >> >> >> >> >> -------------------------------------------- >> >> >> The improvement in iTLB-load-misses: -30.5 % >> >> >> -------------------------------------------- >> >> >> >> >> >> I repeated this expriment more than 100 times in different setups and the >> >> >> improvement was always greater than 30%. >> >> >> >> >> >> This patch series is boot tested on the Starfive VisionFive 2 board[6]. >> >> >> The performance analysis was not done on the board because it doesn't >> >> >> expose iTLB-load-misses, etc. The stresser program was run on the board to test >> >> >> the loading and unloading of BPF programs >> >> >> >> >> >> [1] https://lore.kernel.org/bpf/20220204185742.271030-1-song@kernel.org/ >> >> >> [2] https://lore.kernel.org/all/20230626085811.3192402-1-puranjay12@gmail.com/ >> >> >> [3] https://lore.kernel.org/all/20230626085811.3192402-2-puranjay12@gmail.com/ >> >> >> [4] https://github.com/puranjaymohan/BPF-Allocator-Bench >> >> >> [5] https://github.com/bjoto/riscv-cross-builder >> >> >> [6] https://www.starfivetech.com/en/site/boards >> >> >> >> >> >> Puranjay Mohan (2): >> >> >> riscv: Extend patch_text_nosync() for multiple pages >> >> >> bpf, riscv: use prog pack allocator in the BPF JIT >> >> > >> >> > I get a hang for "test_tag", but it's not directly related to your >> >> > series, but rather "remote fence.i". >> >> > >> >> > | rcu: INFO: rcu_sched detected stalls on CPUs/tasks: >> >> > | rcu: 0-....: (1400 ticks this GP) idle=d5e4/1/0x4000000000000000 softirq=5542/5542 fqs=1862 >> >> > | rcu: (detected by 1, t=5252 jiffies, g=10253, q=195 ncpus=4) >> >> > | Task dump for CPU 0: >> >> > | task:kworker/0:5 state:R running task stack:0 pid:319 ppid:2 flags:0x00000008 >> >> > | Workqueue: events bpf_prog_free_deferred >> >> > | Call Trace: >> >> > | [<ffffffff80cbc444>] __schedule+0x2d0/0x940 >> >> > | watchdog: BUG: soft lockup - CPU#0 stuck for 21s! [kworker/0:5:319] >> >> > | Modules linked in: nls_iso8859_1 drm fuse i2c_core drm_panel_orientation_quirks backlight dm_mod configfs ip_tables x_tables >> >> > | CPU: 0 PID: 319 Comm: kworker/0:5 Not tainted 6.5.0-rc5 #1 >> >> > | Hardware name: riscv-virtio,qemu (DT) >> >> > | Workqueue: events bpf_prog_free_deferred >> >> > | epc : __sbi_rfence_v02_call.isra.0+0x74/0x11a >> >> > | ra : __sbi_rfence_v02+0xda/0x1a4 >> >> > | epc : ffffffff8000ab4c ra : ffffffff8000accc sp : ff20000001c9bbd0 >> >> > | gp : ffffffff82078c48 tp : ff600000888e6a40 t0 : ff20000001c9bd44 >> >> > | t1 : 0000000000000000 t2 : 0000000000000040 s0 : ff20000001c9bbf0 >> >> > | s1 : 0000000000000010 a0 : 0000000000000000 a1 : 0000000000000000 >> >> > | a2 : 0000000000000000 a3 : 0000000000000000 a4 : 0000000000000000 >> >> > | a5 : 0000000000000000 a6 : 0000000000000000 a7 : 0000000052464e43 >> >> > | s2 : 000000000000ffff s3 : 00000000ffffffff s4 : ffffffff81667528 >> >> > | s5 : 0000000000000000 s6 : 0000000000000000 s7 : 0000000000000000 >> >> > | s8 : 0000000000000001 s9 : 0000000000000003 s10: 0000000000000040 >> >> > | s11: ffffffff8207d240 t3 : 000000000000000f t4 : 000000000000002a >> >> > | t5 : ff600000872df140 t6 : ffffffff81e26828 >> >> > | status: 0000000200000120 badaddr: 0000000000000000 cause: 8000000000000005 >> >> > | [<ffffffff8000ab4c>] __sbi_rfence_v02_call.isra.0+0x74/0x11a >> >> > | [<ffffffff8000accc>] __sbi_rfence_v02+0xda/0x1a4 >> >> > | [<ffffffff8000a886>] sbi_remote_fence_i+0x1e/0x26 >> >> > | [<ffffffff8000cee2>] flush_icache_all+0x1a/0x48 >> >> > | [<ffffffff80007736>] patch_text_nosync+0x6c/0x8c >> >> > | [<ffffffff8000f0f8>] bpf_arch_text_invalidate+0x62/0xac >> >> > | [<ffffffff8016c538>] bpf_prog_pack_free+0x9c/0x1b2 >> >> > | [<ffffffff8016c84a>] bpf_jit_binary_pack_free+0x20/0x4a >> >> > | [<ffffffff8000f198>] bpf_jit_free+0x56/0x9e >> >> > | [<ffffffff8016b43a>] bpf_prog_free_deferred+0x15a/0x182 >> >> > | [<ffffffff800576c4>] process_one_work+0x1b6/0x3d6 >> >> > | [<ffffffff80057d52>] worker_thread+0x84/0x378 >> >> > | [<ffffffff8005fc2c>] kthread+0xe8/0x108 >> >> > | [<ffffffff80003ffa>] ret_from_fork+0xe/0x20 >> >> > >> >> > I'm digging into that now, and I would appreciate if you could run the >> >> > test_tag on VF2 or similar (I'm missing that HW). >> >> > >> >> > It seems like we're hitting a bug with this series, so let's try to >> >> > figure out where the problems is, prior merging it. >> >> >> >> Hmm, it looks like the bpf_arch_text_invalidate() implementation is a >> >> bit problematic: >> >> >> >> +int bpf_arch_text_invalidate(void *dst, size_t len) >> >> +{ >> >> + __le32 *ptr; >> >> + int ret = 0; >> >> + u32 inval = 0; >> >> + >> >> + for (ptr = dst; ret == 0 && len >= sizeof(u32); len -= sizeof(u32)) { >> >> + mutex_lock(&text_mutex); >> >> + ret = patch_text_nosync(ptr++, &inval, sizeof(u32)); >> >> + mutex_unlock(&text_mutex); >> >> + } >> >> + >> >> + return ret; >> >> +} >> >> >> >> Each patch_text_nosync() is a remote fence.i, and for a big "len", we'll >> >> be flooded with remote fences. >> > >> > I understand this now, thanks for debugging this. >> > >> > We are calling patch_text_nosync() for each word (u32) which calls >> > flush_icache_range() and therefore "fence.i" is inserted after every >> > word. >> >> But more importantly, it does a remote fence.i (which is an IPI to all >> cores). >> >> > I still don't fully understand how it causes this bug because I lack >> > the prerequisite >> > knowledge about test_tag and what the failing test is doing. >> >> The test_tag is part of kselftest/bpf: >> tools/testing/selftests/bpf/test_tag.c >> >> TL;DR: it generates a bunch of programs, where some have a length of, >> e.g, 41024. bpf_arch_text_invalidate() does ~10k of remote fences in >> that case. >> >> > But to solve this issue we would need a function like the x86 >> > text_poke_set() that will only >> > insert a single "fence.i" after setting the whole memory area. This >> > can be done by >> > implementing a wrapper around patch_insn_write() which would set the memory area >> > and at the end call flush_icache_range(). >> > >> > Something like: >> > >> > void *text_set_nosync(void *dst, int c, size_t len) >> > { >> > __le32 *ptr; >> > int ret = 0; >> > >> > for (ptr = dst; ret == 0 && len >= sizeof(u32); len -= sizeof(u32)) { >> > ret = patch_insn_write(ptr++, &c, sizeof(u32)); >> > } >> > if(!ret) >> > flush_icache_range((uintptr_t) dst, (uintptr_t) dst + len); >> > >> > return ret; >> > } >> > >> > Let me know if this looks correct or we need more details here. >> > I will then send v2 with this implemented as a separate patch. >> >> Can't we do better here? Perhaps a similar pattern like the 2 page fill? >> Otherwise we'll have a bunch of fixmap updates as well. > > I agree that we can make it more efficient by first copying the value to a > RW buffer using normal memcpy() and then copying that area to the RO area > using patch_insn_write(). Then it would solve both problems. Or we implement > a new function like patch_insn_write() that does the 2 page map and > set explicitly. > > Which approach would you prefer? > 1) Wrapper around patch_insn_write() that first memsets a RW buffer and then > copies the complete RW buffer to the RO area by calling > patch_insn_write() with len. > > 2) A new function like patch_insn_write() that takes dst, src, len and > maps the dst, 2 pages > at a time and sets it to *src in a loop. A think 2) is the way to go: A "patch_insn_set(void *addr, u16 c, size_t len)" or smth. >> I'd keep the patch_ prefix in the name for consistency. Please measure >> the runtime of test_tag pre/after the change. > > test_tag currently wouldn't even complete right? with the current > version of the patch? It will, but you'd need to crank up the watchdog timeout. :-) What I meant was test_tag w/ and w/o your pack allocator. >> I don't know if your arm64 work has similar problems? > > Thanks for bringing this up. I will revisit that and verify if > test_tag is working > there. There also the bpf_arch_text_invalidate() is calling > aarch64_insn_patch_text_nosync() > in a loop that in turn calls caches_clean_inval_pou(). So I might see > similar issues there. Ok! > I think https://github.com/kernel-patches doesn't run test_tag hence I > might have missed it. You're right, it doesn't. I usually run the full suite locally. Again, thanks a lot for spending time on this! It's nice work! Björn