Message ID | 20230112090603.1295340-1-guoren@kernel.org (mailing list archive) |
---|---|
Headers | show |
Series | riscv: Optimize function trace | expand |
Hi, On 12.01.2023 12:05, guoren@kernel.org wrote: > From: Guo Ren <guoren@linux.alibaba.com> > > The previous ftrace detour implementation fc76b8b8011 ("riscv: Using > PATCHABLE_FUNCTION_ENTRY instead of MCOUNT") contain three problems. > > - The most horrible bug is preemption panic which found by Andy [1]. > Let's disable preemption for ftrace first, and Andy could continue > the ftrace preemption work. It seems, the patches #2-#7 of this series do not require "riscv: ftrace: Fixup panic by disabling preemption" and can be used without it. How about moving that patch out of the series and processing it separately? As it was pointed out in the discussion of that patch, some other solution to non-atomic changes of the prologue might be needed anyway. > - The "-fpatchable-function-entry= CFLAG" wasted code size > !RISCV_ISA_C. > - The ftrace detour implementation wasted code size. > - When livepatching, the trampoline (ftrace_regs_caller) would not > return to <func_prolog+12> but would rather jump to the new function. > So, "REG_L ra, -SZREG(sp)" would not run and the original return > address would not be restored. The kernel is likely to hang or crash > as a result. (Found by Evgenii Shatokhin [4]) > > Patches 1,2,3 fixup above problems. > > Patches 4,5,6,7 are the features based on reduced detour code > patch, we include them in the series for test and maintenance. > > You can directly try it with: > https://github.com/guoren83/linux/tree/ftrace_fixup_v7 > > Make function graph use ftrace directly [2] (patch 4, 5) > ======================================================== > > In RISC-V architecture, when we enable the ftrace_graph tracer on some > functions, the function tracings on other functions will suffer extra > graph tracing work. In essence, graph_ops isn't limited by its func_hash > due to the global ftrace_graph_[regs]_call label. That should be > corrected. > > What inspires me is the commit 0c0593b45c9b ("x86/ftrace: Make function > graph use ftrace directly") that uses graph_ops::func function to > install return_hooker and makes the function called against its > func_hash. > > This series of patches makes function graph use ftrace directly for > riscv. > > If FTRACE_WITH_REGS isn't defined, ftrace_caller keeps ftrace_graph_call > so that it can be replaced with the calling of prepare_ftrace_return by > the enable/disable helper. > > As for defining FTRACE_WITH_REGS, ftrace_caller is adjusted to save the > necessary regs against the pt_regs layout, so it can reasonably call the > graph_ops::func function - ftrace_graph_func. And > ftrace_graph_[regs]_call > and its enable/disable helper aren't needed. > > Test log: > > The tests generated by CONFIG_FTRACE_STARTUP_TEST have passed in the > local > qemu-system-riscv64 virt machine. The following is the log during > startup. > > ``` > Nov 15 03:07:13 stage4 kernel: Testing tracer function: PASSED > Nov 15 03:07:13 stage4 kernel: Testing dynamic ftrace: PASSED > Nov 15 03:07:13 stage4 kernel: Testing dynamic ftrace ops #1: > Nov 15 03:07:13 stage4 kernel: (1 0 1 0 0) > Nov 15 03:07:13 stage4 kernel: (1 1 2 0 0) > Nov 15 03:07:13 stage4 kernel: (2 1 3 0 365) > Nov 15 03:07:13 stage4 kernel: (2 2 4 0 399) > Nov 15 03:07:13 stage4 kernel: (3 2 4 0 146071) > Nov 15 03:07:13 stage4 kernel: (3 3 5 0 146105) PASSED > Nov 15 03:07:13 stage4 kernel: Testing dynamic ftrace ops #2: > Nov 15 03:07:13 stage4 kernel: (1 0 1 589 0) > Nov 15 03:07:13 stage4 kernel: (1 1 2 635 0) > Nov 15 03:07:13 stage4 kernel: (2 1 3 1 2) > Nov 15 03:07:13 stage4 kernel: (2 2 4 125 126) > Nov 15 03:07:13 stage4 kernel: (3 2 4 146001 146078) > Nov 15 03:07:13 stage4 kernel: (3 3 5 146035 146112) PASSED > Nov 15 03:07:13 stage4 kernel: Testing ftrace recursion: PASSED > Nov 15 03:07:13 stage4 kernel: Testing ftrace recursion safe: PASSED > Nov 15 03:07:13 stage4 kernel: Testing ftrace regs: PASSED > Nov 15 03:07:13 stage4 kernel: Testing tracer nop: PASSED > Nov 15 03:07:13 stage4 kernel: Testing tracer irqsoff: PASSED > Nov 15 03:07:13 stage4 kernel: Testing tracer wakeup: > Nov 15 03:07:13 stage4 kernel: sched: DL replenish lagged too much > Nov 15 03:07:13 stage4 kernel: PASSED > Nov 15 03:07:13 stage4 kernel: Testing tracer wakeup_rt: PASSED > Nov 15 03:07:13 stage4 kernel: Testing tracer wakeup_dl: PASSED > Nov 15 03:07:13 stage4 kernel: Testing tracer function_graph: PASSED > ``` > > Add WITH_DIRECT_CALLS support [3] (patch 6, 7) > ============================================== > > This series adds DYNAMIC_FTRACE_WITH_DIRECT_CALLS support for RISC-V. > SAMPLE_FTRACE_DIRECT and SAMPLE_FTRACE_DIRECT_MULTI are also included > here as the samples for testing DIRECT_CALLS related interface. > > First, select the DYNAMIC_FTRACE_WITH_DIRECT_CALLS to provide > register_ftrace_direct[_multi] interfaces allowing user to register > the customed trampoline (direct_caller) as the mcount for one or > more target functions. And modify_ftrace_direct[_multi] are also > provided for modify direct_caller. > > At the same time, the samples in ./samples/ftrace/ can be built > as kerenl module for testing these interfaces with SAMPLE_FTRACE_DIRECT > and SAMPLE_FTRACE_DIRECT_MULTI selected. > > Second, to make the direct_caller and the other ftrace hooks > (eg. function/fgraph tracer, k[ret]probes) co-exist, a temporary > register > are nominated to store the address of direct_caller in > ftrace_regs_caller. > After the setting of the address direct_caller by direct_ops->func and > the RESTORE_REGS in ftrace_regs_caller, direct_caller will be jumped to > by the `jr` inst. > > The following tests have been passed in my local qemu-riscv64 virt > machine. > > 1. tests with CONFIG_FTRACE_STARTUP_TEST > 2. tests of samples/ftrace/ftrace*.ko > 3. manual tests with any combination of the following hooks > - function/function_graph tracer > - ftrace*.ko > - kprobe/kretprobe > > For your reference, here is the log when function tracer, kretprobe and > ftrace-direct-too.ko co-hooks the handle_mm_fault function. > > ``` > [root@stage4 tracing]# echo handle_mm_fault > set_ftrace_filter > [root@stage4 tracing]# echo 'r:myr handle_mm_fault' > kprobe_events > [root@stage4 tracing]# echo function > current_tracer > [root@stage4 tracing]# echo 1 > events/kprobes/myr/enable > [root@stage4 tracing]# insmod /root/ftrace-direct-too.ko > [root@stage4 tracing]# > [root@stage4 tracing]# cat trace | tail > cat-388 [000] ...1. 583.051438: myr: > (do_page_fault+0x16c/0x5f2 <- handle_mm_fault) > cat-388 [000] ...2. 583.057930: handle_mm_fault > <-do_page_fault > cat-388 [000] ..... 583.057990: my_direct_func: > handle mm fault vma=000000002d9fe19c address=ffffffae9b7000 flags=215 > cat-388 [000] ...1. 583.058284: myr: > (do_page_fault+0x16c/0x5f2 <- handle_mm_fault) > tail-389 [001] ...2. 583.059062: handle_mm_fault > <-do_page_fault > tail-389 [001] ..... 583.059104: my_direct_func: > handle mm fault vma=0000000017f3c48e address=aaaaaabebf3000 flags=215 > tail-389 [001] ...1. 583.059325: myr: > (do_page_fault+0x16c/0x5f2 <- handle_mm_fault) > tail-389 [001] ...2. 583.060371: handle_mm_fault > <-do_page_fault > tail-389 [001] ..... 583.060410: my_direct_func: > handle mm fault vma=0000000017f3c48e address=aaaaaabebf1000 flags=255 > tail-389 [001] ...1. 583.060996: myr: > (do_page_fault+0x16c/0x5f2 <- handle_mm_fault) > ``` > Note1: > > The checkpatch.pl will output some warnings on this series, like this > > ``` > WARNING: Prefer using '"%s...", __func__' to using 'my_direct_func2', > this function's name, in a string > 111: FILE: samples/ftrace/ftrace-direct-multi-modify.c:48: > +" call my_direct_func2\n" > ``` > > The reason is that checkpatch depends on patch context providing the > function name. In the above warning, my_direct_func2 has some codeline > distance with the changed trunk, so its declaration doesn't come into > the patch, and then the warning jumps out. > > You may notice the location of `my_ip` variable changes in the 2nd > patch. I did that for reducing the warnings to some extent. But killing > all the warnings will makes the patch less readable, so I stopped here. > > [1] https://lpc.events/event/16/contributions/1171/ > [2] https://lore.kernel.org/lkml/20221120084230.910152-1-suagrfillet@gmail.com/ > [3] https://lore.kernel.org/linux-riscv/20221123142025.1504030-1-suagrfillet@gmail.com/ > [4] https://lore.kernel.org/linux-riscv/d7d5730b-ebef-68e5-5046-e763e1ee6164@yadro.com/ > > Changes in v7: > - Fixup RESTORE_ABI_REGS by remove PT_T0(sp) overwrite. > - Add FTRACE_MCOUNT_USE_PATCHABLE_FUNCTION_ENTRY > - Fixup kconfig with HAVE_SAMPLE_FTRACE_DIRECT & > HAVE_SAMPLE_FTRACE_DIRECT_MULTI > > Changes in v6: > https://lore.kernel.org/linux-riscv/20230107133549.4192639-1-guoren@kernel.org/ > - Replace 8 with MCOUNT_INSN_SIZE > - Replace "REG_L a1, PT_RA(sp)" with "mv a1, ra" > - Add Evgenii Shatokhin comment > > Changes in v5: > https://lore.kernel.org/linux-riscv/20221208091244.203407-1-guoren@kernel.org/ > - Sort Kconfig entries in alphabetical order. > > Changes in v4: > https://lore.kernel.org/linux-riscv/20221129033230.255947-1-guoren@kernel.org/ > - Include [3] for maintenance. [Song Shuai] > > Changes in V3: > https://lore.kernel.org/linux-riscv/20221123153950.2911981-1-guoren@kernel.org/ > - Include [2] for maintenance. [Song Shuai] > > Changes in V2: > https://lore.kernel.org/linux-riscv/20220921034910.3142465-1-guoren@kernel.org/ > - Add Signed-off for preemption fixup. > > Changes in V1: > https://lore.kernel.org/linux-riscv/20220916103817.9490-1-guoren@kernel.org/ > > Andy Chiu (1): > riscv: ftrace: Fixup panic by disabling preemption > > Guo Ren (2): > riscv: ftrace: Remove wasted nops for !RISCV_ISA_C > riscv: ftrace: Reduce the detour code size to half > > Song Shuai (4): > riscv: ftrace: Add ftrace_graph_func > riscv: ftrace: Add DYNAMIC_FTRACE_WITH_DIRECT_CALLS support > samples: ftrace: Add riscv support for SAMPLE_FTRACE_DIRECT[_MULTI] > riscv : select FTRACE_MCOUNT_USE_PATCHABLE_FUNCTION_ENTRY > > arch/riscv/Kconfig | 6 +- > arch/riscv/Makefile | 6 +- > arch/riscv/include/asm/ftrace.h | 71 ++++++-- > arch/riscv/kernel/ftrace.c | 91 ++++------ > arch/riscv/kernel/mcount-dyn.S | 179 +++++++++++++------- > samples/ftrace/ftrace-direct-modify.c | 33 ++++ > samples/ftrace/ftrace-direct-multi-modify.c | 37 ++++ > samples/ftrace/ftrace-direct-multi.c | 22 +++ > samples/ftrace/ftrace-direct-too.c | 26 +++ > samples/ftrace/ftrace-direct.c | 22 +++ > 10 files changed, 356 insertions(+), 137 deletions(-) > > -- > 2.36.1 > > I tested this series a bit on top of 6.2-rc4, with https://github.com/sugarfillet/linux/commit/9539a80dc6e7d1137ec7a96ebef2ab912a694bd7.patch added. The kernel was built with CONFIG_DYNAMIC_FTRACE_WITH_REGS=y and CONFIG_RISCV_ISA_C=y. Here are some results: * The kernel was built without visible issues. * Most of the Ftrace startup tests ran and passed. Certain event-specific tests caused soft lockups in one of the test runs but they are likely unrelated to this patchset and could happen without it too. * "function" and "function_graph" tracers worked in my simple use cases. * "ipmodify" functionality worked. * kprobe sample modules worked OK, which is good, because they actually use Ftrace: they plant the probes at the beginning of the resp. functions. * ftrace-direct-multi and ftrace-direct-multi-modify sample modules reported possibly invalid data. More info - in my reply to "[PATCH -next V7 6/7] samples: ftrace: Add riscv support for SAMPLE_FTRACE_DIRECT[_MULTI]" Regards, Evgenii
On Mon, Jan 16, 2023 at 11:02 PM Evgenii Shatokhin <e.shatokhin@yadro.com> wrote: > > Hi, > > On 12.01.2023 12:05, guoren@kernel.org wrote: > > From: Guo Ren <guoren@linux.alibaba.com> > > > > The previous ftrace detour implementation fc76b8b8011 ("riscv: Using > > PATCHABLE_FUNCTION_ENTRY instead of MCOUNT") contain three problems. > > > > - The most horrible bug is preemption panic which found by Andy [1]. > > Let's disable preemption for ftrace first, and Andy could continue > > the ftrace preemption work. > > It seems, the patches #2-#7 of this series do not require "riscv: > ftrace: Fixup panic by disabling preemption" and can be used without it. > > How about moving that patch out of the series and processing it separately? Okay. > > As it was pointed out in the discussion of that patch, some other > solution to non-atomic changes of the prologue might be needed anyway. I think you mean Mark Rutland's DYNAMIC_FTRACE_WITH_CALL_OPS. But that still needs to be ready. Let's disable PREEMPT for ftrace first. > > > - The "-fpatchable-function-entry= CFLAG" wasted code size > > !RISCV_ISA_C. > > - The ftrace detour implementation wasted code size. > > - When livepatching, the trampoline (ftrace_regs_caller) would not > > return to <func_prolog+12> but would rather jump to the new function. > > So, "REG_L ra, -SZREG(sp)" would not run and the original return > > address would not be restored. The kernel is likely to hang or crash > > as a result. (Found by Evgenii Shatokhin [4]) > > > > Patches 1,2,3 fixup above problems. > > > > Patches 4,5,6,7 are the features based on reduced detour code > > patch, we include them in the series for test and maintenance. > > > > You can directly try it with: > > https://github.com/guoren83/linux/tree/ftrace_fixup_v7 > > > > Make function graph use ftrace directly [2] (patch 4, 5) > > ======================================================== > > > > In RISC-V architecture, when we enable the ftrace_graph tracer on some > > functions, the function tracings on other functions will suffer extra > > graph tracing work. In essence, graph_ops isn't limited by its func_hash > > due to the global ftrace_graph_[regs]_call label. That should be > > corrected. > > > > What inspires me is the commit 0c0593b45c9b ("x86/ftrace: Make function > > graph use ftrace directly") that uses graph_ops::func function to > > install return_hooker and makes the function called against its > > func_hash. > > > > This series of patches makes function graph use ftrace directly for > > riscv. > > > > If FTRACE_WITH_REGS isn't defined, ftrace_caller keeps ftrace_graph_call > > so that it can be replaced with the calling of prepare_ftrace_return by > > the enable/disable helper. > > > > As for defining FTRACE_WITH_REGS, ftrace_caller is adjusted to save the > > necessary regs against the pt_regs layout, so it can reasonably call the > > graph_ops::func function - ftrace_graph_func. And > > ftrace_graph_[regs]_call > > and its enable/disable helper aren't needed. > > > > Test log: > > > > The tests generated by CONFIG_FTRACE_STARTUP_TEST have passed in the > > local > > qemu-system-riscv64 virt machine. The following is the log during > > startup. > > > > ``` > > Nov 15 03:07:13 stage4 kernel: Testing tracer function: PASSED > > Nov 15 03:07:13 stage4 kernel: Testing dynamic ftrace: PASSED > > Nov 15 03:07:13 stage4 kernel: Testing dynamic ftrace ops #1: > > Nov 15 03:07:13 stage4 kernel: (1 0 1 0 0) > > Nov 15 03:07:13 stage4 kernel: (1 1 2 0 0) > > Nov 15 03:07:13 stage4 kernel: (2 1 3 0 365) > > Nov 15 03:07:13 stage4 kernel: (2 2 4 0 399) > > Nov 15 03:07:13 stage4 kernel: (3 2 4 0 146071) > > Nov 15 03:07:13 stage4 kernel: (3 3 5 0 146105) PASSED > > Nov 15 03:07:13 stage4 kernel: Testing dynamic ftrace ops #2: > > Nov 15 03:07:13 stage4 kernel: (1 0 1 589 0) > > Nov 15 03:07:13 stage4 kernel: (1 1 2 635 0) > > Nov 15 03:07:13 stage4 kernel: (2 1 3 1 2) > > Nov 15 03:07:13 stage4 kernel: (2 2 4 125 126) > > Nov 15 03:07:13 stage4 kernel: (3 2 4 146001 146078) > > Nov 15 03:07:13 stage4 kernel: (3 3 5 146035 146112) PASSED > > Nov 15 03:07:13 stage4 kernel: Testing ftrace recursion: PASSED > > Nov 15 03:07:13 stage4 kernel: Testing ftrace recursion safe: PASSED > > Nov 15 03:07:13 stage4 kernel: Testing ftrace regs: PASSED > > Nov 15 03:07:13 stage4 kernel: Testing tracer nop: PASSED > > Nov 15 03:07:13 stage4 kernel: Testing tracer irqsoff: PASSED > > Nov 15 03:07:13 stage4 kernel: Testing tracer wakeup: > > Nov 15 03:07:13 stage4 kernel: sched: DL replenish lagged too much > > Nov 15 03:07:13 stage4 kernel: PASSED > > Nov 15 03:07:13 stage4 kernel: Testing tracer wakeup_rt: PASSED > > Nov 15 03:07:13 stage4 kernel: Testing tracer wakeup_dl: PASSED > > Nov 15 03:07:13 stage4 kernel: Testing tracer function_graph: PASSED > > ``` > > > > Add WITH_DIRECT_CALLS support [3] (patch 6, 7) > > ============================================== > > > > This series adds DYNAMIC_FTRACE_WITH_DIRECT_CALLS support for RISC-V. > > SAMPLE_FTRACE_DIRECT and SAMPLE_FTRACE_DIRECT_MULTI are also included > > here as the samples for testing DIRECT_CALLS related interface. > > > > First, select the DYNAMIC_FTRACE_WITH_DIRECT_CALLS to provide > > register_ftrace_direct[_multi] interfaces allowing user to register > > the customed trampoline (direct_caller) as the mcount for one or > > more target functions. And modify_ftrace_direct[_multi] are also > > provided for modify direct_caller. > > > > At the same time, the samples in ./samples/ftrace/ can be built > > as kerenl module for testing these interfaces with SAMPLE_FTRACE_DIRECT > > and SAMPLE_FTRACE_DIRECT_MULTI selected. > > > > Second, to make the direct_caller and the other ftrace hooks > > (eg. function/fgraph tracer, k[ret]probes) co-exist, a temporary > > register > > are nominated to store the address of direct_caller in > > ftrace_regs_caller. > > After the setting of the address direct_caller by direct_ops->func and > > the RESTORE_REGS in ftrace_regs_caller, direct_caller will be jumped to > > by the `jr` inst. > > > > The following tests have been passed in my local qemu-riscv64 virt > > machine. > > > > 1. tests with CONFIG_FTRACE_STARTUP_TEST > > 2. tests of samples/ftrace/ftrace*.ko > > 3. manual tests with any combination of the following hooks > > - function/function_graph tracer > > - ftrace*.ko > > - kprobe/kretprobe > > > > For your reference, here is the log when function tracer, kretprobe and > > ftrace-direct-too.ko co-hooks the handle_mm_fault function. > > > > ``` > > [root@stage4 tracing]# echo handle_mm_fault > set_ftrace_filter > > [root@stage4 tracing]# echo 'r:myr handle_mm_fault' > kprobe_events > > [root@stage4 tracing]# echo function > current_tracer > > [root@stage4 tracing]# echo 1 > events/kprobes/myr/enable > > [root@stage4 tracing]# insmod /root/ftrace-direct-too.ko > > [root@stage4 tracing]# > > [root@stage4 tracing]# cat trace | tail > > cat-388 [000] ...1. 583.051438: myr: > > (do_page_fault+0x16c/0x5f2 <- handle_mm_fault) > > cat-388 [000] ...2. 583.057930: handle_mm_fault > > <-do_page_fault > > cat-388 [000] ..... 583.057990: my_direct_func: > > handle mm fault vma=000000002d9fe19c address=ffffffae9b7000 flags=215 > > cat-388 [000] ...1. 583.058284: myr: > > (do_page_fault+0x16c/0x5f2 <- handle_mm_fault) > > tail-389 [001] ...2. 583.059062: handle_mm_fault > > <-do_page_fault > > tail-389 [001] ..... 583.059104: my_direct_func: > > handle mm fault vma=0000000017f3c48e address=aaaaaabebf3000 flags=215 > > tail-389 [001] ...1. 583.059325: myr: > > (do_page_fault+0x16c/0x5f2 <- handle_mm_fault) > > tail-389 [001] ...2. 583.060371: handle_mm_fault > > <-do_page_fault > > tail-389 [001] ..... 583.060410: my_direct_func: > > handle mm fault vma=0000000017f3c48e address=aaaaaabebf1000 flags=255 > > tail-389 [001] ...1. 583.060996: myr: > > (do_page_fault+0x16c/0x5f2 <- handle_mm_fault) > > ``` > > Note1: > > > > The checkpatch.pl will output some warnings on this series, like this > > > > ``` > > WARNING: Prefer using '"%s...", __func__' to using 'my_direct_func2', > > this function's name, in a string > > 111: FILE: samples/ftrace/ftrace-direct-multi-modify.c:48: > > +" call my_direct_func2\n" > > ``` > > > > The reason is that checkpatch depends on patch context providing the > > function name. In the above warning, my_direct_func2 has some codeline > > distance with the changed trunk, so its declaration doesn't come into > > the patch, and then the warning jumps out. > > > > You may notice the location of `my_ip` variable changes in the 2nd > > patch. I did that for reducing the warnings to some extent. But killing > > all the warnings will makes the patch less readable, so I stopped here. > > > > [1] https://lpc.events/event/16/contributions/1171/ > > [2] https://lore.kernel.org/lkml/20221120084230.910152-1-suagrfillet@gmail.com/ > > [3] https://lore.kernel.org/linux-riscv/20221123142025.1504030-1-suagrfillet@gmail.com/ > > [4] https://lore.kernel.org/linux-riscv/d7d5730b-ebef-68e5-5046-e763e1ee6164@yadro.com/ > > > > Changes in v7: > > - Fixup RESTORE_ABI_REGS by remove PT_T0(sp) overwrite. > > - Add FTRACE_MCOUNT_USE_PATCHABLE_FUNCTION_ENTRY > > - Fixup kconfig with HAVE_SAMPLE_FTRACE_DIRECT & > > HAVE_SAMPLE_FTRACE_DIRECT_MULTI > > > > Changes in v6: > > https://lore.kernel.org/linux-riscv/20230107133549.4192639-1-guoren@kernel.org/ > > - Replace 8 with MCOUNT_INSN_SIZE > > - Replace "REG_L a1, PT_RA(sp)" with "mv a1, ra" > > - Add Evgenii Shatokhin comment > > > > Changes in v5: > > https://lore.kernel.org/linux-riscv/20221208091244.203407-1-guoren@kernel.org/ > > - Sort Kconfig entries in alphabetical order. > > > > Changes in v4: > > https://lore.kernel.org/linux-riscv/20221129033230.255947-1-guoren@kernel.org/ > > - Include [3] for maintenance. [Song Shuai] > > > > Changes in V3: > > https://lore.kernel.org/linux-riscv/20221123153950.2911981-1-guoren@kernel.org/ > > - Include [2] for maintenance. [Song Shuai] > > > > Changes in V2: > > https://lore.kernel.org/linux-riscv/20220921034910.3142465-1-guoren@kernel.org/ > > - Add Signed-off for preemption fixup. > > > > Changes in V1: > > https://lore.kernel.org/linux-riscv/20220916103817.9490-1-guoren@kernel.org/ > > > > Andy Chiu (1): > > riscv: ftrace: Fixup panic by disabling preemption > > > > Guo Ren (2): > > riscv: ftrace: Remove wasted nops for !RISCV_ISA_C > > riscv: ftrace: Reduce the detour code size to half > > > > Song Shuai (4): > > riscv: ftrace: Add ftrace_graph_func > > riscv: ftrace: Add DYNAMIC_FTRACE_WITH_DIRECT_CALLS support > > samples: ftrace: Add riscv support for SAMPLE_FTRACE_DIRECT[_MULTI] > > riscv : select FTRACE_MCOUNT_USE_PATCHABLE_FUNCTION_ENTRY > > > > arch/riscv/Kconfig | 6 +- > > arch/riscv/Makefile | 6 +- > > arch/riscv/include/asm/ftrace.h | 71 ++++++-- > > arch/riscv/kernel/ftrace.c | 91 ++++------ > > arch/riscv/kernel/mcount-dyn.S | 179 +++++++++++++------- > > samples/ftrace/ftrace-direct-modify.c | 33 ++++ > > samples/ftrace/ftrace-direct-multi-modify.c | 37 ++++ > > samples/ftrace/ftrace-direct-multi.c | 22 +++ > > samples/ftrace/ftrace-direct-too.c | 26 +++ > > samples/ftrace/ftrace-direct.c | 22 +++ > > 10 files changed, 356 insertions(+), 137 deletions(-) > > > > -- > > 2.36.1 > > > > > > I tested this series a bit on top of 6.2-rc4, with > https://github.com/sugarfillet/linux/commit/9539a80dc6e7d1137ec7a96ebef2ab912a694bd7.patch > added. > > The kernel was built with CONFIG_DYNAMIC_FTRACE_WITH_REGS=y and > CONFIG_RISCV_ISA_C=y. > > Here are some results: > > * The kernel was built without visible issues. > > * Most of the Ftrace startup tests ran and passed. Certain > event-specific tests caused soft lockups in one of the test runs but > they are likely unrelated to this patchset and could happen without it too. > > * "function" and "function_graph" tracers worked in my simple use cases. > > * "ipmodify" functionality worked. > > * kprobe sample modules worked OK, which is good, because they actually > use Ftrace: they plant the probes at the beginning of the resp. functions. Thx for test > > * ftrace-direct-multi and ftrace-direct-multi-modify sample modules > reported possibly invalid data. More info - in my reply to > "[PATCH -next V7 6/7] samples: ftrace: Add riscv support for > SAMPLE_FTRACE_DIRECT[_MULTI]" Good report! We will solve it in the next version. > > Regards, > Evgenii > -- Best Regards Guo Ren
On Sat, Feb 04, 2023 at 02:40:52PM +0800, Guo Ren wrote: > On Mon, Jan 16, 2023 at 11:02 PM Evgenii Shatokhin > <e.shatokhin@yadro.com> wrote: > > > > Hi, > > > > On 12.01.2023 12:05, guoren@kernel.org wrote: > > > From: Guo Ren <guoren@linux.alibaba.com> > > > > > > The previous ftrace detour implementation fc76b8b8011 ("riscv: Using > > > PATCHABLE_FUNCTION_ENTRY instead of MCOUNT") contain three problems. > > > > > > - The most horrible bug is preemption panic which found by Andy [1]. > > > Let's disable preemption for ftrace first, and Andy could continue > > > the ftrace preemption work. > > > > It seems, the patches #2-#7 of this series do not require "riscv: > > ftrace: Fixup panic by disabling preemption" and can be used without it. > > > > How about moving that patch out of the series and processing it separately? > Okay. > > > > > As it was pointed out in the discussion of that patch, some other > > solution to non-atomic changes of the prologue might be needed anyway. > I think you mean Mark Rutland's DYNAMIC_FTRACE_WITH_CALL_OPS. But that > still needs to be ready. Let's disable PREEMPT for ftrace first. FWIW, taking the patch to disable FTRACE with PREEMPT for now makes sense to me, too. The DYNAMIC_FTRACE_WITH_CALL_OPS patches should be in v6.3. They're currently queued in the arm64 tree in the for-next/ftrace branch: git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-next/ftrace https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/ ... and those *should* be in v6.3. Patches to imeplement DIRECT_CALLS atop that are in review at the moment: https://lore.kernel.org/linux-arm-kernel/20230201163420.1579014-1-revest@chromium.org/ ... and if riscv uses the CALL_OPS approach, I believe it can do much the same there. If riscv wants to do a single atomic patch to each patch-site (to avoid stop_machine()), then direct calls would always needs to bounce through the ftrace_caller trampoline (and acquire the direct call from the ftrace_ops), but that might not be as bad as it sounds -- from benchmarking on arm64, the bulk of the overhead seen with direct calls is when using the list_ops or having to do a hash lookup, and both of those are avoided with the CALL_OPS approach. Calling directly from the patch-site is a minor optimization relative to skipping that work. Thanks, Mark.
On Mon, Feb 6, 2023 at 5:56 PM Mark Rutland <mark.rutland@arm.com> wrote: > > On Sat, Feb 04, 2023 at 02:40:52PM +0800, Guo Ren wrote: > > On Mon, Jan 16, 2023 at 11:02 PM Evgenii Shatokhin > > <e.shatokhin@yadro.com> wrote: > > > > > > Hi, > > > > > > On 12.01.2023 12:05, guoren@kernel.org wrote: > > > > From: Guo Ren <guoren@linux.alibaba.com> > > > > > > > > The previous ftrace detour implementation fc76b8b8011 ("riscv: Using > > > > PATCHABLE_FUNCTION_ENTRY instead of MCOUNT") contain three problems. > > > > > > > > - The most horrible bug is preemption panic which found by Andy [1]. > > > > Let's disable preemption for ftrace first, and Andy could continue > > > > the ftrace preemption work. > > > > > > It seems, the patches #2-#7 of this series do not require "riscv: > > > ftrace: Fixup panic by disabling preemption" and can be used without it. > > > > > > How about moving that patch out of the series and processing it separately? > > Okay. > > > > > > > > As it was pointed out in the discussion of that patch, some other > > > solution to non-atomic changes of the prologue might be needed anyway. > > I think you mean Mark Rutland's DYNAMIC_FTRACE_WITH_CALL_OPS. But that > > still needs to be ready. Let's disable PREEMPT for ftrace first. > > FWIW, taking the patch to disable FTRACE with PREEMPT for now makes sense to > me, too. Thx, you agree with that. > > The DYNAMIC_FTRACE_WITH_CALL_OPS patches should be in v6.3. They're currently > queued in the arm64 tree in the for-next/ftrace branch: > > git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-next/ftrace > https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/ > > ... and those *should* be in v6.3. Glade to hear that. Great! > > Patches to imeplement DIRECT_CALLS atop that are in review at the moment: > > https://lore.kernel.org/linux-arm-kernel/20230201163420.1579014-1-revest@chromium.org/ Good reference. Thx for sharing. > > ... and if riscv uses the CALL_OPS approach, I believe it can do much the same > there. > > If riscv wants to do a single atomic patch to each patch-site (to avoid > stop_machine()), then direct calls would always needs to bounce through the > ftrace_caller trampoline (and acquire the direct call from the ftrace_ops), but > that might not be as bad as it sounds -- from benchmarking on arm64, the bulk > of the overhead seen with direct calls is when using the list_ops or having to > do a hash lookup, and both of those are avoided with the CALL_OPS approach. > Calling directly from the patch-site is a minor optimization relative to > skipping that work. Yes, CALL_OPS could solve the PREEMPTION & stop_machine problems. I would follow up. The difference from arm64 is that RISC-V is 16bit/32bit mixed instruction ISA, so we must keep ftrace_caller & ftrace_regs_caller in 2048 aligned. Then: FTRACE_UPDATE_MAKE_CALL: * addr+00: NOP // Literal (first 32 bits) * addr+04: NOP // Literal (last 32 bits) * addr+08: func: auipc t0, ? // All trampolines are in the 2048 aligned place, so this point won't be changed. * addr+12: jalr ?(t0) // For different trampolines: ftrace_regs_caller, ftrace_caller FTRACE_UPDATE_MAKE_NOP: * addr+00: NOP // Literal (first 32 bits) * addr+04: NOP // Literal (last 32 bits) * addr+08: func: c.j // jump to addr + 16 and skip broken insn & jalr * addr+10: xxx // last half & broken insn of auipc t0, ? * addr+12: jalr ?(t0) // To be patched to jalr ?<t0> () * addr+16: func body Right? (The call site would be increased from 64bit to 128bit ahead of func.) > > Thanks, > Mark. -- Best Regards Guo Ren
On Tue, Feb 07, 2023 at 11:57:06AM +0800, Guo Ren wrote: > On Mon, Feb 6, 2023 at 5:56 PM Mark Rutland <mark.rutland@arm.com> wrote: > > The DYNAMIC_FTRACE_WITH_CALL_OPS patches should be in v6.3. They're currently > > queued in the arm64 tree in the for-next/ftrace branch: > > > > git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-next/ftrace > > https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/ > > > > ... and those *should* be in v6.3. > Glade to hear that. Great! > > > > > Patches to imeplement DIRECT_CALLS atop that are in review at the moment: > > > > https://lore.kernel.org/linux-arm-kernel/20230201163420.1579014-1-revest@chromium.org/ > Good reference. Thx for sharing. > > > > > ... and if riscv uses the CALL_OPS approach, I believe it can do much the same > > there. > > > > If riscv wants to do a single atomic patch to each patch-site (to avoid > > stop_machine()), then direct calls would always needs to bounce through the > > ftrace_caller trampoline (and acquire the direct call from the ftrace_ops), but > > that might not be as bad as it sounds -- from benchmarking on arm64, the bulk > > of the overhead seen with direct calls is when using the list_ops or having to > > do a hash lookup, and both of those are avoided with the CALL_OPS approach. > > Calling directly from the patch-site is a minor optimization relative to > > skipping that work. > Yes, CALL_OPS could solve the PREEMPTION & stop_machine problems. I > would follow up. > > The difference from arm64 is that RISC-V is 16bit/32bit mixed > instruction ISA, so we must keep ftrace_caller & ftrace_regs_caller in > 2048 aligned. Then: Where does the 2048-bit alignment requirement come from? Note that I'm assuming you will *always* go through a common ftrace_caller trampoline (even for direct calls), with the trampoline responsible for recovering the direct trampoline (or ops->func) from the ops pointer. That would only require 64-bit alignment on 64-bit (or 32-bit alignment on 32-bit) to keep the literal naturally-aligned; the rest of the instructions wouldn't require additional alignment. For example, I would expect that (for 64-bit) you'd use: # place 2 NOPs *immediately before* the function, and 3 NOPs at the start -fpatchable-function-entry=5,2 # Align the function to 8-bytes -falign=functions=8 ... and your trampoline in each function could be initialized to: # Note: aligned to 8 bytes addr-08 // Literal (first 32-bits) // set to ftrace_nop_ops addr-04 // Literal (last 32-bits) // set to ftrace_nop_ops addr+00 func: mv t0, ra addr+04 auipc t1, ftrace_caller addr+08 nop ... and when enabled can be set to: # Note: aligned to 8 bytes addr-08 // Literal (first 32-bits) // patched to ops ptr addr-04 // Literal (last 32-bits) // patched to ops ptr addr+00 func: mv t0, ra addr+04 auipc t1, ftrace_caller addr+08 jalr ftrace_caller(t1) Note: this *only* requires patching the literal and NOP<->JALR; the MV and AUIPC aren't harmful and can always be there. This way, you won't need to use stop_machine(). With that, the ftrace_caller trampoline can recover the `ops` pointer at a negative offset from `ra`, and can recover the instrumented function's return address in `t0`. Using the `ops` pointer, it can figure out whether to branch to a direct trampoline or whether to save/restore the regs around invoking ops->func. For 32-bit it would be exactly the same, except you'd only need a single nop before the function, and the offset would be -0x10. That's what arm64 does; the only difference is that riscv would *always* need to go via the trampoline in order to make direct calls. Thanks, Mark.
Hi Mark, Thx for the thoughtful reply. On Tue, Feb 7, 2023 at 5:17 PM Mark Rutland <mark.rutland@arm.com> wrote: > > On Tue, Feb 07, 2023 at 11:57:06AM +0800, Guo Ren wrote: > > On Mon, Feb 6, 2023 at 5:56 PM Mark Rutland <mark.rutland@arm.com> wrote: > > > The DYNAMIC_FTRACE_WITH_CALL_OPS patches should be in v6.3. They're currently > > > queued in the arm64 tree in the for-next/ftrace branch: > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-next/ftrace > > > https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/ > > > > > > ... and those *should* be in v6.3. > > Glade to hear that. Great! > > > > > > > > Patches to imeplement DIRECT_CALLS atop that are in review at the moment: > > > > > > https://lore.kernel.org/linux-arm-kernel/20230201163420.1579014-1-revest@chromium.org/ > > Good reference. Thx for sharing. > > > > > > > > ... and if riscv uses the CALL_OPS approach, I believe it can do much the same > > > there. > > > > > > If riscv wants to do a single atomic patch to each patch-site (to avoid > > > stop_machine()), then direct calls would always needs to bounce through the > > > ftrace_caller trampoline (and acquire the direct call from the ftrace_ops), but > > > that might not be as bad as it sounds -- from benchmarking on arm64, the bulk > > > of the overhead seen with direct calls is when using the list_ops or having to > > > do a hash lookup, and both of those are avoided with the CALL_OPS approach. > > > Calling directly from the patch-site is a minor optimization relative to > > > skipping that work. > > Yes, CALL_OPS could solve the PREEMPTION & stop_machine problems. I > > would follow up. > > > > The difference from arm64 is that RISC-V is 16bit/32bit mixed > > instruction ISA, so we must keep ftrace_caller & ftrace_regs_caller in > > 2048 aligned. Then: > > Where does the 2048-bit alignment requirement come from? Sorry for the typo. It's one 2048 bytes for keeping two trampolines (ftrace_caller & ftrace_regs_caller) in one aligned part. Because the jalr has only +-2048 bytes offset range. Then the "auipc t1, ftrace(_regs)_caller" is fixed. > > Note that I'm assuming you will *always* go through a common ftrace_caller > trampoline (even for direct calls), with the trampoline responsible for > recovering the direct trampoline (or ops->func) from the ops pointer. > > That would only require 64-bit alignment on 64-bit (or 32-bit alignment on > 32-bit) to keep the literal naturally-aligned; the rest of the instructions > wouldn't require additional alignment. > > For example, I would expect that (for 64-bit) you'd use: > > # place 2 NOPs *immediately before* the function, and 3 NOPs at the start > -fpatchable-function-entry=5,2 > > # Align the function to 8-bytes > -falign=functions=8 > > ... and your trampoline in each function could be initialized to: > > # Note: aligned to 8 bytes > addr-08 // Literal (first 32-bits) // set to ftrace_nop_ops > addr-04 // Literal (last 32-bits) // set to ftrace_nop_ops > addr+00 func: mv t0, ra > addr+04 auipc t1, ftrace_caller > addr+08 nop > > ... and when enabled can be set to: > > # Note: aligned to 8 bytes > addr-08 // Literal (first 32-bits) // patched to ops ptr > addr-04 // Literal (last 32-bits) // patched to ops ptr > addr+00 func: mv t0, ra We needn't "mv t0, ra" here because our "jalr" could work with t0 and won't affect ra. Let's do it in the trampoline code, and then we can save another word here. > addr+04 auipc t1, ftrace_caller > addr+08 jalr ftrace_caller(t1) Here is the call-site: # Note: aligned to 8 bytes addr-08 // Literal (first 32-bits) // patched to ops ptr addr-04 // Literal (last 32-bits) // patched to ops ptr addr+00 auipc t0, ftrace_caller addr+04 jalr ftrace_caller(t0) > > Note: this *only* requires patching the literal and NOP<->JALR; the MV and > AUIPC aren't harmful and can always be there. This way, you won't need to use > stop_machine(). Yes, simplest nop is better than c.j. I confused. > > With that, the ftrace_caller trampoline can recover the `ops` pointer at a > negative offset from `ra`, and can recover the instrumented function's return > address in `t0`. Using the `ops` pointer, it can figure out whether to branch > to a direct trampoline or whether to save/restore the regs around invoking > ops->func. > > For 32-bit it would be exactly the same, except you'd only need a single nop > before the function, and the offset would be -0x10. Yes, we reduced another 4 bytes & a smaller alignment for better code size when 32-bit. # Note: aligned to 4 bytes addr-04 // Literal (last 32-bits) // patched to ops ptr addr+00 auipc t0, ftrace_caller addr+04 jalr ftrace_caller(t0) > > That's what arm64 does; the only difference is that riscv would *always* need > to go via the trampoline in order to make direct calls. We need one more trampoline here beside ftrace_caller & ftrace_regs_caller: It's "direct_caller". addr+04 nop -> direct_caller/ftrace_caller/ftrace_regs_caller > > Thanks, > Mark.
On Wed, Feb 08, 2023 at 10:30:56AM +0800, Guo Ren wrote: > Hi Mark, > > Thx for the thoughtful reply. > > On Tue, Feb 7, 2023 at 5:17 PM Mark Rutland <mark.rutland@arm.com> wrote: > > Note that I'm assuming you will *always* go through a common ftrace_caller > > trampoline (even for direct calls), with the trampoline responsible for > > recovering the direct trampoline (or ops->func) from the ops pointer. > > > > That would only require 64-bit alignment on 64-bit (or 32-bit alignment on > > 32-bit) to keep the literal naturally-aligned; the rest of the instructions > > wouldn't require additional alignment. > > > > For example, I would expect that (for 64-bit) you'd use: > > > > # place 2 NOPs *immediately before* the function, and 3 NOPs at the start > > -fpatchable-function-entry=5,2 > > > > # Align the function to 8-bytes > > -falign=functions=8 > > > > ... and your trampoline in each function could be initialized to: > > > > # Note: aligned to 8 bytes > > addr-08 // Literal (first 32-bits) // set to ftrace_nop_ops > > addr-04 // Literal (last 32-bits) // set to ftrace_nop_ops > > addr+00 func: mv t0, ra > > addr+04 auipc t1, ftrace_caller > > addr+08 nop > > > > ... and when enabled can be set to: > > > > # Note: aligned to 8 bytes > > addr-08 // Literal (first 32-bits) // patched to ops ptr > > addr-04 // Literal (last 32-bits) // patched to ops ptr > > addr+00 func: mv t0, ra > We needn't "mv t0, ra" here because our "jalr" could work with t0 and > won't affect ra. Let's do it in the trampoline code, and then we can > save another word here. Ah; I thought JALR always clobbered ra? Or can that specify the register to save the link address to? I'm not that familiar with riscv asm, so I've probably just got that wrong. > > addr+04 auipc t1, ftrace_caller > > addr+08 jalr ftrace_caller(t1) > > Here is the call-site: > # Note: aligned to 8 bytes > addr-08 // Literal (first 32-bits) // patched to ops ptr > addr-04 // Literal (last 32-bits) // patched to ops ptr > addr+00 auipc t0, ftrace_caller > addr+04 jalr ftrace_caller(t0) I'm a bit confused there; I thought that the `symbol(reg)` addressing mode was generating additional bits that the AUPIC didn't -- have I got that wrong? What specifies which register the JALR will write the link address to? > > Note: this *only* requires patching the literal and NOP<->JALR; the MV and > > AUIPC aren't harmful and can always be there. This way, you won't need to use > > stop_machine(). > Yes, simplest nop is better than c.j. I confused. > > > > > With that, the ftrace_caller trampoline can recover the `ops` pointer at a > > negative offset from `ra`, and can recover the instrumented function's return > > address in `t0`. Using the `ops` pointer, it can figure out whether to branch > > to a direct trampoline or whether to save/restore the regs around invoking > > ops->func. > > > > For 32-bit it would be exactly the same, except you'd only need a single nop > > before the function, and the offset would be -0x10. > Yes, we reduced another 4 bytes & a smaller alignment for better code > size when 32-bit. > # Note: aligned to 4 bytes > addr-04 // Literal (last 32-bits) // patched to ops ptr > addr+00 auipc t0, ftrace_caller > addr+04 jalr ftrace_caller(t0) > > > > That's what arm64 does; the only difference is that riscv would *always* need > > to go via the trampoline in order to make direct calls. > We need one more trampoline here beside ftrace_caller & > ftrace_regs_caller: It's "direct_caller". > > addr+04 nop -> direct_caller/ftrace_caller/ftrace_regs_caller I'd strongly recommend that you instead implement FTRACE_WITH_ARGS and deprecate FTRACE_WITH_REGS, like arm64 has done, then you only need a single ftrace_caller, as I mentioned above. That way there's no risk that you need to patch the AUIPC after initialization. The arm64 FTRACE_WITH_ARGS conversion is in mainline, and arm64's FTRACE_WITH_CALL_OPS is based upon that. Florent's DIRECT_CALLS patches add the direct call logic to the same ftrace_caller trampoline. Thanks, Mark.
> > # Note: aligned to 8 bytes > > addr-08 // Literal (first 32-bits) // patched to ops ptr > > addr-04 // Literal (last 32-bits) // patched to ops ptr > > addr+00 func: mv t0, ra > We needn't "mv t0, ra" here because our "jalr" could work with t0 and > won't affect ra. Let's do it in the trampoline code, and then we can > save another word here. > > addr+04 auipc t1, ftrace_caller > > addr+08 jalr ftrace_caller(t1) Is that some kind of 'load high' and 'add offset' pair? I guess 64bit kernels guarantee to put all module code within +-2G of the main kernel? > Here is the call-site: > # Note: aligned to 8 bytes > addr-08 // Literal (first 32-bits) // patched to ops ptr > addr-04 // Literal (last 32-bits) // patched to ops ptr > addr+00 auipc t0, ftrace_caller > addr+04 jalr ftrace_caller(t0) Could you even do something like: addr-n call ftrace-function addr-n+x literals addr+0 nop or jmp addr-n addr+4 function_code So that all the code executed when tracing is enabled is before the label and only one 'nop' is in the body. The called code can use the return address to find the literals and then modify it to return to addr+4. The code cost when trace is enabled is probably irrelevant here - dominated by what happens later. It probably isn't even worth aligning a 64bit constant. Doing two reads probably won't be noticable. What you do want to ensure is that the initial patch is overwriting nop - just in case the gap isn't there. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Wed, Feb 8, 2023 at 10:46 PM Mark Rutland <mark.rutland@arm.com> wrote: > > On Wed, Feb 08, 2023 at 10:30:56AM +0800, Guo Ren wrote: > > Hi Mark, > > > > Thx for the thoughtful reply. > > > > On Tue, Feb 7, 2023 at 5:17 PM Mark Rutland <mark.rutland@arm.com> wrote: > > > Note that I'm assuming you will *always* go through a common ftrace_caller > > > trampoline (even for direct calls), with the trampoline responsible for > > > recovering the direct trampoline (or ops->func) from the ops pointer. > > > > > > That would only require 64-bit alignment on 64-bit (or 32-bit alignment on > > > 32-bit) to keep the literal naturally-aligned; the rest of the instructions > > > wouldn't require additional alignment. > > > > > > For example, I would expect that (for 64-bit) you'd use: > > > > > > # place 2 NOPs *immediately before* the function, and 3 NOPs at the start > > > -fpatchable-function-entry=5,2 > > > > > > # Align the function to 8-bytes > > > -falign=functions=8 > > > > > > ... and your trampoline in each function could be initialized to: > > > > > > # Note: aligned to 8 bytes > > > addr-08 // Literal (first 32-bits) // set to ftrace_nop_ops > > > addr-04 // Literal (last 32-bits) // set to ftrace_nop_ops > > > addr+00 func: mv t0, ra > > > addr+04 auipc t1, ftrace_caller > > > addr+08 nop > > > > > > ... and when enabled can be set to: > > > > > > # Note: aligned to 8 bytes > > > addr-08 // Literal (first 32-bits) // patched to ops ptr > > > addr-04 // Literal (last 32-bits) // patched to ops ptr > > > addr+00 func: mv t0, ra > > We needn't "mv t0, ra" here because our "jalr" could work with t0 and > > won't affect ra. Let's do it in the trampoline code, and then we can > > save another word here. > > Ah; I thought JALR always clobbered ra? Or can that specify the register to > save the link address to? Yes, that's the feature of riscv :) We could use any register to save the link address. > > I'm not that familiar with riscv asm, so I've probably just got that wrong. > > > > addr+04 auipc t1, ftrace_caller > > > addr+08 jalr ftrace_caller(t1) > > > > Here is the call-site: > > # Note: aligned to 8 bytes > > addr-08 // Literal (first 32-bits) // patched to ops ptr > > addr-04 // Literal (last 32-bits) // patched to ops ptr > > addr+00 auipc t0, ftrace_caller > > addr+04 jalr ftrace_caller(t0) Sorry, it should be: addr+04 jalr t0, ftrace_caller(t0) > > I'm a bit confused there; I thought that the `symbol(reg)` addressing mode was > generating additional bits that the AUPIC didn't -- have I got that wrong? > > What specifies which register the JALR will write the link address to? According to the spec, auipc t1,0x0 should write PC + 0x0<<12 (which is equal to PC) to t1 and then jalr t0, (t0)0 jumps to the address stored in t0 + 0x0 and stores the return address to t0. That means auipc defines xxx << 12 bits, jalr defines lowest 12 bits. > > > > Note: this *only* requires patching the literal and NOP<->JALR; the MV and > > > AUIPC aren't harmful and can always be there. This way, you won't need to use > > > stop_machine(). > > Yes, simplest nop is better than c.j. I confused. > > > > > > > > With that, the ftrace_caller trampoline can recover the `ops` pointer at a > > > negative offset from `ra`, and can recover the instrumented function's return > > > address in `t0`. Using the `ops` pointer, it can figure out whether to branch > > > to a direct trampoline or whether to save/restore the regs around invoking > > > ops->func. > > > > > > For 32-bit it would be exactly the same, except you'd only need a single nop > > > before the function, and the offset would be -0x10. > > Yes, we reduced another 4 bytes & a smaller alignment for better code > > size when 32-bit. > > # Note: aligned to 4 bytes > > addr-04 // Literal (last 32-bits) // patched to ops ptr > > addr+00 auipc t0, ftrace_caller > > addr+04 jalr ftrace_caller(t0) addr+04 jalr t0, ftrace_caller(t0) > > > > > > > That's what arm64 does; the only difference is that riscv would *always* need > > > to go via the trampoline in order to make direct calls. > > We need one more trampoline here beside ftrace_caller & > > ftrace_regs_caller: It's "direct_caller". > > > > addr+04 nop -> direct_caller/ftrace_caller/ftrace_regs_caller > > I'd strongly recommend that you instead implement FTRACE_WITH_ARGS and > deprecate FTRACE_WITH_REGS, like arm64 has done, then you only need a single > ftrace_caller, as I mentioned above. That way there's no risk that you need to > patch the AUIPC after initialization. > > The arm64 FTRACE_WITH_ARGS conversion is in mainline, and arm64's > FTRACE_WITH_CALL_OPS is based upon that. Florent's DIRECT_CALLS patches add the > direct call logic to the same ftrace_caller trampoline. Thx for the suggestion of only keeping the ftrace_caller idea, but it's another topic. What I want to point out: If we keep "auipc (addr+00)" fixed, we could use the different trampolines at "jalr (addr+0x4)" (All of them must be in one 2k aligned area). > > Thanks, > Mark.
On Thu, Feb 9, 2023 at 6:29 AM David Laight <David.Laight@aculab.com> wrote: > > > > # Note: aligned to 8 bytes > > > addr-08 // Literal (first 32-bits) // patched to ops ptr > > > addr-04 // Literal (last 32-bits) // patched to ops ptr > > > addr+00 func: mv t0, ra > > We needn't "mv t0, ra" here because our "jalr" could work with t0 and > > won't affect ra. Let's do it in the trampoline code, and then we can > > save another word here. > > > addr+04 auipc t1, ftrace_caller > > > addr+08 jalr ftrace_caller(t1) > > Is that some kind of 'load high' and 'add offset' pair? Yes. > I guess 64bit kernels guarantee to put all module code > within +-2G of the main kernel? Yes, 32-bit is enough. So we only need one 32-bit literal size for the current rv64, just like CONFIG_32BIT. > > > Here is the call-site: > > # Note: aligned to 8 bytes > > addr-08 // Literal (first 32-bits) // patched to ops ptr > > addr-04 // Literal (last 32-bits) // patched to ops ptr > > addr+00 auipc t0, ftrace_caller > > addr+04 jalr ftrace_caller(t0) > > Could you even do something like: > addr-n call ftrace-function > addr-n+x literals > addr+0 nop or jmp addr-n > addr+4 function_code Yours cost one more instruction, right? addr-12 auipc addr-8 jalr addr-4 // Literal (32-bits) addr+0 nop or jmp addr-n // one more? addr+4 function_code > So that all the code executed when tracing is enabled > is before the label and only one 'nop' is in the body. > The called code can use the return address to find the > literals and then modify it to return to addr+4. > The code cost when trace is enabled is probably irrelevant > here - dominated by what happens later. > It probably isn't even worth aligning a 64bit constant. > Doing two reads probably won't be noticable. > > What you do want to ensure is that the initial patch is > overwriting nop - just in case the gap isn't there. > > David > > - > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK > Registration No: 1397386 (Wales)
On Thu, Feb 9, 2023 at 9:51 AM Guo Ren <guoren@kernel.org> wrote: > > On Thu, Feb 9, 2023 at 6:29 AM David Laight <David.Laight@aculab.com> wrote: > > > > > > # Note: aligned to 8 bytes > > > > addr-08 // Literal (first 32-bits) // patched to ops ptr > > > > addr-04 // Literal (last 32-bits) // patched to ops ptr > > > > addr+00 func: mv t0, ra > > > We needn't "mv t0, ra" here because our "jalr" could work with t0 and > > > won't affect ra. Let's do it in the trampoline code, and then we can > > > save another word here. > > > > addr+04 auipc t1, ftrace_caller > > > > addr+08 jalr ftrace_caller(t1) > > > > Is that some kind of 'load high' and 'add offset' pair? > Yes. > > > I guess 64bit kernels guarantee to put all module code > > within +-2G of the main kernel? > Yes, 32-bit is enough. So we only need one 32-bit literal size for the > current rv64, just like CONFIG_32BIT. We need kernel_addr_base + this 32-bit Literal. @Mark Rutland What do you think the idea about reducing one more 32-bit in call-site? (It also sould work for arm64.) > > > > > > Here is the call-site: > > > # Note: aligned to 8 bytes > > > addr-08 // Literal (first 32-bits) // patched to ops ptr > > > addr-04 // Literal (last 32-bits) // patched to ops ptr > > > addr+00 auipc t0, ftrace_caller > > > addr+04 jalr ftrace_caller(t0) > > > > Could you even do something like: > > addr-n call ftrace-function > > addr-n+x literals > > addr+0 nop or jmp addr-n > > addr+4 function_code > Yours cost one more instruction, right? > addr-12 auipc > addr-8 jalr > addr-4 // Literal (32-bits) > addr+0 nop or jmp addr-n // one more? > addr+4 function_code > > > So that all the code executed when tracing is enabled > > is before the label and only one 'nop' is in the body. > > The called code can use the return address to find the > > literals and then modify it to return to addr+4. > > The code cost when trace is enabled is probably irrelevant > > here - dominated by what happens later. > > It probably isn't even worth aligning a 64bit constant. > > Doing two reads probably won't be noticable. > > > > What you do want to ensure is that the initial patch is > > overwriting nop - just in case the gap isn't there. > > > > David > > > > - > > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK > > Registration No: 1397386 (Wales) > > > > -- > Best Regards > Guo Ren
From: Guo Ren > Sent: 09 February 2023 01:51 ... > Yours cost one more instruction, right? > addr-12 auipc > addr-8 jalr > addr-4 // Literal (32-bits) > addr+0 nop or jmp addr-n // one more? > addr+4 function_code Yes, it is 4 bytes larger but there is one less instruction executed (only one nop) when ftrace is disabled. That probably matters more than anything in the ftrace 'prologue' code. I also suspect that you can use a 32bit integer as a table index in 64bit mode to save a word there. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Thu, Feb 9, 2023 at 5:00 PM David Laight <David.Laight@aculab.com> wrote: > > From: Guo Ren > > Sent: 09 February 2023 01:51 > ... > > Yours cost one more instruction, right? > > addr-12 auipc > > addr-8 jalr > > addr-4 // Literal (32-bits) > > addr+0 nop or jmp addr-n // one more? > > addr+4 function_code > > Yes, it is 4 bytes larger but there is one less > instruction executed (only one nop) when ftrace is disabled. > That probably matters more than anything in the ftrace > 'prologue' code. I've got your point, thx. I would consider your advice and make the tradeoff. > > I also suspect that you can use a 32bit integer as > a table index in 64bit mode to save a word there. Yes, good idea. I've asked Mark Rutland. Let's wait for his feedback. It also helps arm64. > > David > > - > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK > Registration No: 1397386 (Wales)
On Thu, Feb 09, 2023 at 09:59:33AM +0800, Guo Ren wrote: > On Thu, Feb 9, 2023 at 9:51 AM Guo Ren <guoren@kernel.org> wrote: > > > > On Thu, Feb 9, 2023 at 6:29 AM David Laight <David.Laight@aculab.com> wrote: > > > > > > > > # Note: aligned to 8 bytes > > > > > addr-08 // Literal (first 32-bits) // patched to ops ptr > > > > > addr-04 // Literal (last 32-bits) // patched to ops ptr > > > > > addr+00 func: mv t0, ra > > > > We needn't "mv t0, ra" here because our "jalr" could work with t0 and > > > > won't affect ra. Let's do it in the trampoline code, and then we can > > > > save another word here. > > > > > addr+04 auipc t1, ftrace_caller > > > > > addr+08 jalr ftrace_caller(t1) > > > > > > Is that some kind of 'load high' and 'add offset' pair? > > Yes. > > > > > I guess 64bit kernels guarantee to put all module code > > > within +-2G of the main kernel? > > Yes, 32-bit is enough. So we only need one 32-bit literal size for the > > current rv64, just like CONFIG_32BIT. > We need kernel_addr_base + this 32-bit Literal. > > @Mark Rutland > What do you think the idea about reducing one more 32-bit in > call-site? (It also sould work for arm64.) The literal pointer is for a struct ftrace_ops, which is data, not code. An ftrace_ops can be allocated from anywhere (e.g. core kernel data, module data, linear map, vmalloc space), and so is not guaranteed to be within 2GiB of all code. The literal needs to be able to address the entire kernel addresss range, and since it can be modified concurrently (with PREEMPT and not using stop_machine()) it needs to be possible to read/write atomically. So practically speaking it needs to be the native pointer size (i.e. 64-bit on a 64-bit kernel). Other schemes for compressing that (e.g. using an integer into an array of pointers) is possible, but uses more memory and gets more complicated for concurrent manipulation, so I would strongly recommend keeping this simple and using a native pointer size here. > > > > Here is the call-site: > > > > # Note: aligned to 8 bytes > > > > addr-08 // Literal (first 32-bits) // patched to ops ptr > > > > addr-04 // Literal (last 32-bits) // patched to ops ptr > > > > addr+00 auipc t0, ftrace_caller > > > > addr+04 jalr ftrace_caller(t0) > > > > > > Could you even do something like: > > > addr-n call ftrace-function > > > addr-n+x literals > > > addr+0 nop or jmp addr-n > > > addr+4 function_code > > Yours cost one more instruction, right? > > addr-12 auipc > > addr-8 jalr > > addr-4 // Literal (32-bits) > > addr+0 nop or jmp addr-n // one more? > > addr+4 function_code Placing instructions before the entry point is going to confuse symbol resolution and unwind code, so I would not recommend that. It also means the trampoline will need to re-adjust the return address back into the function, but that is relatively simple. I also think that this is micro-optimizing. The AUPIC *should* be cheap, so executing that unconditionally should be fine. I think the form that Guo suggested with AUIPC + {JALR || NOP} in the function (and 64-bits reserved immediately bfore the function) is the way to go, so long as that does the right thing with ra. Thanks, Mark.
From: Guo Ren > Sent: 09 February 2023 01:31 ... > > I'm a bit confused there; I thought that the `symbol(reg)` addressing mode was > > generating additional bits that the AUPIC didn't -- have I got that wrong? > > > > What specifies which register the JALR will write the link address to? > > According to the spec, auipc t1,0x0 should write PC + 0x0<<12 (which > is equal to PC) to t1 and then jalr t0, (t0)0 jumps to the address > stored in t0 + 0x0 and stores the return address to t0. > > That means auipc defines xxx << 12 bits, jalr defines lowest 12 bits. ... > What I want to point out: > If we keep "auipc (addr+00)" fixed, we could use the different > trampolines at "jalr (addr+0x4)" (All of them must be in one 2k > aligned area). I looked up auipc: "AUIPC is used to build PC-relative addresses and uses the U-type format. AUIPC forms a 32-bit offset from the U-immediate, filling in the lowest 12 bits with zeros, adds this offset to the address of the AUIPC instruction, then places the result in rd." So it generates 'pc + (val << 12)'. And the jalr then adds in a 12bit offset. I think that means that if you have two trampolines you might need to change both instructions even if the two trampolines are actually adjacent instructions. It is the distance from the call site that mustn't cross a 2k boundary - not the absolute address of the trampoline itself. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Fri, Feb 10, 2023 at 6:47 AM David Laight <David.Laight@aculab.com> wrote: > > From: Guo Ren > > Sent: 09 February 2023 01:31 > ... > > > I'm a bit confused there; I thought that the `symbol(reg)` addressing mode was > > > generating additional bits that the AUPIC didn't -- have I got that wrong? > > > > > > What specifies which register the JALR will write the link address to? > > > > According to the spec, auipc t1,0x0 should write PC + 0x0<<12 (which > > is equal to PC) to t1 and then jalr t0, (t0)0 jumps to the address > > stored in t0 + 0x0 and stores the return address to t0. > > > > That means auipc defines xxx << 12 bits, jalr defines lowest 12 bits. > > ... > > What I want to point out: > > If we keep "auipc (addr+00)" fixed, we could use the different > > trampolines at "jalr (addr+0x4)" (All of them must be in one 2k > > aligned area). > > I looked up auipc: > "AUIPC is used to build PC-relative addresses and uses the U-type format. > AUIPC forms a 32-bit offset from the U-immediate, filling in the lowest > 12 bits with zeros, adds this offset to the address of the AUIPC instruction, > then places the result in rd." > > So it generates 'pc + (val << 12)'. > And the jalr then adds in a 12bit offset. Correct! > > I think that means that if you have two trampolines you might need > to change both instructions even if the two trampolines are actually > adjacent instructions. > It is the distance from the call site that mustn't cross a 2k > boundary - not the absolute address of the trampoline itself. We could have multiple adjacent trampolines, which must be kept in the same 2k boundary area. Then we needn't change auipc part when modifying. > > David > > - > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK > Registration No: 1397386 (Wales)
On Thu, Feb 9, 2023 at 5:54 PM Mark Rutland <mark.rutland@arm.com> wrote: > > On Thu, Feb 09, 2023 at 09:59:33AM +0800, Guo Ren wrote: > > On Thu, Feb 9, 2023 at 9:51 AM Guo Ren <guoren@kernel.org> wrote: > > > > > > On Thu, Feb 9, 2023 at 6:29 AM David Laight <David.Laight@aculab.com> wrote: > > > > > > > > > > # Note: aligned to 8 bytes > > > > > > addr-08 // Literal (first 32-bits) // patched to ops ptr > > > > > > addr-04 // Literal (last 32-bits) // patched to ops ptr > > > > > > addr+00 func: mv t0, ra > > > > > We needn't "mv t0, ra" here because our "jalr" could work with t0 and > > > > > won't affect ra. Let's do it in the trampoline code, and then we can > > > > > save another word here. > > > > > > addr+04 auipc t1, ftrace_caller > > > > > > addr+08 jalr ftrace_caller(t1) > > > > > > > > Is that some kind of 'load high' and 'add offset' pair? > > > Yes. > > > > > > > I guess 64bit kernels guarantee to put all module code > > > > within +-2G of the main kernel? > > > Yes, 32-bit is enough. So we only need one 32-bit literal size for the > > > current rv64, just like CONFIG_32BIT. > > We need kernel_addr_base + this 32-bit Literal. > > > > @Mark Rutland > > What do you think the idea about reducing one more 32-bit in > > call-site? (It also sould work for arm64.) > > The literal pointer is for a struct ftrace_ops, which is data, not code. > > An ftrace_ops can be allocated from anywhere (e.g. core kernel data, module > data, linear map, vmalloc space), and so is not guaranteed to be within 2GiB of > all code. The literal needs to be able to address the entire kernel addresss > range, and since it can be modified concurrently (with PREEMPT and not using > stop_machine()) it needs to be possible to read/write atomically. So > practically speaking it needs to be the native pointer size (i.e. 64-bit on a > 64-bit kernel). Got it, thx. Let's use an absolute pointer as the beginning. > > Other schemes for compressing that (e.g. using an integer into an array of > pointers) is possible, but uses more memory and gets more complicated for > concurrent manipulation, so I would strongly recommend keeping this simple and > using a native pointer size here. > > > > > > Here is the call-site: > > > > > # Note: aligned to 8 bytes > > > > > addr-08 // Literal (first 32-bits) // patched to ops ptr > > > > > addr-04 // Literal (last 32-bits) // patched to ops ptr > > > > > addr+00 auipc t0, ftrace_caller > > > > > addr+04 jalr ftrace_caller(t0) > > > > > > > > Could you even do something like: > > > > addr-n call ftrace-function > > > > addr-n+x literals > > > > addr+0 nop or jmp addr-n > > > > addr+4 function_code > > > Yours cost one more instruction, right? > > > addr-12 auipc > > > addr-8 jalr > > > addr-4 // Literal (32-bits) > > > addr+0 nop or jmp addr-n // one more? > > > addr+4 function_code > > Placing instructions before the entry point is going to confuse symbol > resolution and unwind code, so I would not recommend that. It also means the > trampoline will need to re-adjust the return address back into the function, > but that is relatively simple. > > I also think that this is micro-optimizing. The AUPIC *should* be cheap, so > executing that unconditionally should be fine. I think the form that Guo > suggested with AUIPC + {JALR || NOP} in the function (and 64-bits reserved > immediately bfore the function) is the way to go, so long as that does the > right thing with ra. > > Thanks, > Mark.
Hello: This series was applied to riscv/linux.git (for-next) by Palmer Dabbelt <palmer@rivosinc.com>: On Thu, 12 Jan 2023 04:05:56 -0500 you wrote: > From: Guo Ren <guoren@linux.alibaba.com> > > The previous ftrace detour implementation fc76b8b8011 ("riscv: Using > PATCHABLE_FUNCTION_ENTRY instead of MCOUNT") contain three problems. > > - The most horrible bug is preemption panic which found by Andy [1]. > Let's disable preemption for ftrace first, and Andy could continue > the ftrace preemption work. > - The "-fpatchable-function-entry= CFLAG" wasted code size > !RISCV_ISA_C. > - The ftrace detour implementation wasted code size. > - When livepatching, the trampoline (ftrace_regs_caller) would not > return to <func_prolog+12> but would rather jump to the new function. > So, "REG_L ra, -SZREG(sp)" would not run and the original return > address would not be restored. The kernel is likely to hang or crash > as a result. (Found by Evgenii Shatokhin [4]) > > [...] Here is the summary with links: - [-next,V7,1/7] riscv: ftrace: Fixup panic by disabling preemption https://git.kernel.org/riscv/c/8547649981e6 - [-next,V7,2/7] riscv: ftrace: Remove wasted nops for !RISCV_ISA_C https://git.kernel.org/riscv/c/409c8fb20c66 - [-next,V7,3/7] riscv: ftrace: Reduce the detour code size to half https://git.kernel.org/riscv/c/6724a76cff85 - [-next,V7,4/7] riscv: ftrace: Add ftrace_graph_func (no matching commit) - [-next,V7,5/7] riscv: ftrace: Add DYNAMIC_FTRACE_WITH_DIRECT_CALLS support (no matching commit) - [-next,V7,6/7] samples: ftrace: Add riscv support for SAMPLE_FTRACE_DIRECT[_MULTI] (no matching commit) - [-next,V7,7/7] riscv : select FTRACE_MCOUNT_USE_PATCHABLE_FUNCTION_ENTRY (no matching commit) You are awesome, thank you!
From: Guo Ren <guoren@linux.alibaba.com> The previous ftrace detour implementation fc76b8b8011 ("riscv: Using PATCHABLE_FUNCTION_ENTRY instead of MCOUNT") contain three problems. - The most horrible bug is preemption panic which found by Andy [1]. Let's disable preemption for ftrace first, and Andy could continue the ftrace preemption work. - The "-fpatchable-function-entry= CFLAG" wasted code size !RISCV_ISA_C. - The ftrace detour implementation wasted code size. - When livepatching, the trampoline (ftrace_regs_caller) would not return to <func_prolog+12> but would rather jump to the new function. So, "REG_L ra, -SZREG(sp)" would not run and the original return address would not be restored. The kernel is likely to hang or crash as a result. (Found by Evgenii Shatokhin [4]) Patches 1,2,3 fixup above problems. Patches 4,5,6,7 are the features based on reduced detour code patch, we include them in the series for test and maintenance. You can directly try it with: https://github.com/guoren83/linux/tree/ftrace_fixup_v7 Make function graph use ftrace directly [2] (patch 4, 5) ======================================================== In RISC-V architecture, when we enable the ftrace_graph tracer on some functions, the function tracings on other functions will suffer extra graph tracing work. In essence, graph_ops isn't limited by its func_hash due to the global ftrace_graph_[regs]_call label. That should be corrected. What inspires me is the commit 0c0593b45c9b ("x86/ftrace: Make function graph use ftrace directly") that uses graph_ops::func function to install return_hooker and makes the function called against its func_hash. This series of patches makes function graph use ftrace directly for riscv. If FTRACE_WITH_REGS isn't defined, ftrace_caller keeps ftrace_graph_call so that it can be replaced with the calling of prepare_ftrace_return by the enable/disable helper. As for defining FTRACE_WITH_REGS, ftrace_caller is adjusted to save the necessary regs against the pt_regs layout, so it can reasonably call the graph_ops::func function - ftrace_graph_func. And ftrace_graph_[regs]_call and its enable/disable helper aren't needed. Test log: The tests generated by CONFIG_FTRACE_STARTUP_TEST have passed in the local qemu-system-riscv64 virt machine. The following is the log during startup. ``` Nov 15 03:07:13 stage4 kernel: Testing tracer function: PASSED Nov 15 03:07:13 stage4 kernel: Testing dynamic ftrace: PASSED Nov 15 03:07:13 stage4 kernel: Testing dynamic ftrace ops #1: Nov 15 03:07:13 stage4 kernel: (1 0 1 0 0) Nov 15 03:07:13 stage4 kernel: (1 1 2 0 0) Nov 15 03:07:13 stage4 kernel: (2 1 3 0 365) Nov 15 03:07:13 stage4 kernel: (2 2 4 0 399) Nov 15 03:07:13 stage4 kernel: (3 2 4 0 146071) Nov 15 03:07:13 stage4 kernel: (3 3 5 0 146105) PASSED Nov 15 03:07:13 stage4 kernel: Testing dynamic ftrace ops #2: Nov 15 03:07:13 stage4 kernel: (1 0 1 589 0) Nov 15 03:07:13 stage4 kernel: (1 1 2 635 0) Nov 15 03:07:13 stage4 kernel: (2 1 3 1 2) Nov 15 03:07:13 stage4 kernel: (2 2 4 125 126) Nov 15 03:07:13 stage4 kernel: (3 2 4 146001 146078) Nov 15 03:07:13 stage4 kernel: (3 3 5 146035 146112) PASSED Nov 15 03:07:13 stage4 kernel: Testing ftrace recursion: PASSED Nov 15 03:07:13 stage4 kernel: Testing ftrace recursion safe: PASSED Nov 15 03:07:13 stage4 kernel: Testing ftrace regs: PASSED Nov 15 03:07:13 stage4 kernel: Testing tracer nop: PASSED Nov 15 03:07:13 stage4 kernel: Testing tracer irqsoff: PASSED Nov 15 03:07:13 stage4 kernel: Testing tracer wakeup: Nov 15 03:07:13 stage4 kernel: sched: DL replenish lagged too much Nov 15 03:07:13 stage4 kernel: PASSED Nov 15 03:07:13 stage4 kernel: Testing tracer wakeup_rt: PASSED Nov 15 03:07:13 stage4 kernel: Testing tracer wakeup_dl: PASSED Nov 15 03:07:13 stage4 kernel: Testing tracer function_graph: PASSED ``` Add WITH_DIRECT_CALLS support [3] (patch 6, 7) ============================================== This series adds DYNAMIC_FTRACE_WITH_DIRECT_CALLS support for RISC-V. SAMPLE_FTRACE_DIRECT and SAMPLE_FTRACE_DIRECT_MULTI are also included here as the samples for testing DIRECT_CALLS related interface. First, select the DYNAMIC_FTRACE_WITH_DIRECT_CALLS to provide register_ftrace_direct[_multi] interfaces allowing user to register the customed trampoline (direct_caller) as the mcount for one or more target functions. And modify_ftrace_direct[_multi] are also provided for modify direct_caller. At the same time, the samples in ./samples/ftrace/ can be built as kerenl module for testing these interfaces with SAMPLE_FTRACE_DIRECT and SAMPLE_FTRACE_DIRECT_MULTI selected. Second, to make the direct_caller and the other ftrace hooks (eg. function/fgraph tracer, k[ret]probes) co-exist, a temporary register are nominated to store the address of direct_caller in ftrace_regs_caller. After the setting of the address direct_caller by direct_ops->func and the RESTORE_REGS in ftrace_regs_caller, direct_caller will be jumped to by the `jr` inst. The following tests have been passed in my local qemu-riscv64 virt machine. 1. tests with CONFIG_FTRACE_STARTUP_TEST 2. tests of samples/ftrace/ftrace*.ko 3. manual tests with any combination of the following hooks - function/function_graph tracer - ftrace*.ko - kprobe/kretprobe For your reference, here is the log when function tracer, kretprobe and ftrace-direct-too.ko co-hooks the handle_mm_fault function. ``` [root@stage4 tracing]# echo handle_mm_fault > set_ftrace_filter [root@stage4 tracing]# echo 'r:myr handle_mm_fault' > kprobe_events [root@stage4 tracing]# echo function > current_tracer [root@stage4 tracing]# echo 1 > events/kprobes/myr/enable [root@stage4 tracing]# insmod /root/ftrace-direct-too.ko [root@stage4 tracing]# [root@stage4 tracing]# cat trace | tail cat-388 [000] ...1. 583.051438: myr: (do_page_fault+0x16c/0x5f2 <- handle_mm_fault) cat-388 [000] ...2. 583.057930: handle_mm_fault <-do_page_fault cat-388 [000] ..... 583.057990: my_direct_func: handle mm fault vma=000000002d9fe19c address=ffffffae9b7000 flags=215 cat-388 [000] ...1. 583.058284: myr: (do_page_fault+0x16c/0x5f2 <- handle_mm_fault) tail-389 [001] ...2. 583.059062: handle_mm_fault <-do_page_fault tail-389 [001] ..... 583.059104: my_direct_func: handle mm fault vma=0000000017f3c48e address=aaaaaabebf3000 flags=215 tail-389 [001] ...1. 583.059325: myr: (do_page_fault+0x16c/0x5f2 <- handle_mm_fault) tail-389 [001] ...2. 583.060371: handle_mm_fault <-do_page_fault tail-389 [001] ..... 583.060410: my_direct_func: handle mm fault vma=0000000017f3c48e address=aaaaaabebf1000 flags=255 tail-389 [001] ...1. 583.060996: myr: (do_page_fault+0x16c/0x5f2 <- handle_mm_fault) ``` Note1: The checkpatch.pl will output some warnings on this series, like this ``` WARNING: Prefer using '"%s...", __func__' to using 'my_direct_func2', this function's name, in a string 111: FILE: samples/ftrace/ftrace-direct-multi-modify.c:48: +" call my_direct_func2\n" ``` The reason is that checkpatch depends on patch context providing the function name. In the above warning, my_direct_func2 has some codeline distance with the changed trunk, so its declaration doesn't come into the patch, and then the warning jumps out. You may notice the location of `my_ip` variable changes in the 2nd patch. I did that for reducing the warnings to some extent. But killing all the warnings will makes the patch less readable, so I stopped here. [1] https://lpc.events/event/16/contributions/1171/ [2] https://lore.kernel.org/lkml/20221120084230.910152-1-suagrfillet@gmail.com/ [3] https://lore.kernel.org/linux-riscv/20221123142025.1504030-1-suagrfillet@gmail.com/ [4] https://lore.kernel.org/linux-riscv/d7d5730b-ebef-68e5-5046-e763e1ee6164@yadro.com/ Changes in v7: - Fixup RESTORE_ABI_REGS by remove PT_T0(sp) overwrite. - Add FTRACE_MCOUNT_USE_PATCHABLE_FUNCTION_ENTRY - Fixup kconfig with HAVE_SAMPLE_FTRACE_DIRECT & HAVE_SAMPLE_FTRACE_DIRECT_MULTI Changes in v6: https://lore.kernel.org/linux-riscv/20230107133549.4192639-1-guoren@kernel.org/ - Replace 8 with MCOUNT_INSN_SIZE - Replace "REG_L a1, PT_RA(sp)" with "mv a1, ra" - Add Evgenii Shatokhin comment Changes in v5: https://lore.kernel.org/linux-riscv/20221208091244.203407-1-guoren@kernel.org/ - Sort Kconfig entries in alphabetical order. Changes in v4: https://lore.kernel.org/linux-riscv/20221129033230.255947-1-guoren@kernel.org/ - Include [3] for maintenance. [Song Shuai] Changes in V3: https://lore.kernel.org/linux-riscv/20221123153950.2911981-1-guoren@kernel.org/ - Include [2] for maintenance. [Song Shuai] Changes in V2: https://lore.kernel.org/linux-riscv/20220921034910.3142465-1-guoren@kernel.org/ - Add Signed-off for preemption fixup. Changes in V1: https://lore.kernel.org/linux-riscv/20220916103817.9490-1-guoren@kernel.org/ Andy Chiu (1): riscv: ftrace: Fixup panic by disabling preemption Guo Ren (2): riscv: ftrace: Remove wasted nops for !RISCV_ISA_C riscv: ftrace: Reduce the detour code size to half Song Shuai (4): riscv: ftrace: Add ftrace_graph_func riscv: ftrace: Add DYNAMIC_FTRACE_WITH_DIRECT_CALLS support samples: ftrace: Add riscv support for SAMPLE_FTRACE_DIRECT[_MULTI] riscv : select FTRACE_MCOUNT_USE_PATCHABLE_FUNCTION_ENTRY arch/riscv/Kconfig | 6 +- arch/riscv/Makefile | 6 +- arch/riscv/include/asm/ftrace.h | 71 ++++++-- arch/riscv/kernel/ftrace.c | 91 ++++------ arch/riscv/kernel/mcount-dyn.S | 179 +++++++++++++------- samples/ftrace/ftrace-direct-modify.c | 33 ++++ samples/ftrace/ftrace-direct-multi-modify.c | 37 ++++ samples/ftrace/ftrace-direct-multi.c | 22 +++ samples/ftrace/ftrace-direct-too.c | 26 +++ samples/ftrace/ftrace-direct.c | 22 +++ 10 files changed, 356 insertions(+), 137 deletions(-)