mbox series

[RFC,v2,riscv/for-next,0/5] Enable ftrace with kernel preemption for RISC-V

Message ID 20220913094252.3555240-1-andy.chiu@sifive.com (mailing list archive)
Headers show
Series Enable ftrace with kernel preemption for RISC-V | expand

Message

Andy Chiu Sept. 13, 2022, 9:42 a.m. UTC
This patch removes dependency of dynamic ftrace from calling
stop_machine(), and makes it compatiable with kernel preemption.
Originally, we ran into stack corruptions, or execution of partially
updated instructions when starting or stopping ftrace on a fully
preemptible kernel configuration. The reason is that kernel periodically
calls rcu_momentary_dyntick_idle() on cores waiting for the code-patching
core running in ftrace. Though rcu_momentary_dyntick_idle() itself is
marked as notrace, it would call a bunch of tracable functions if we
configured the kernel as preemptible. For example, these are some functions
that happened to have a symbol and have not been marked as notrace on a
RISC-V preemptible kernel compiled with GCC-11:
 - __rcu_report_exp_rnp()
 - rcu_report_exp_cpu_mult()
 - rcu_preempt_deferred_qs()
 - rcu_preempt_need_deferred_qs()
 - rcu_preempt_deferred_qs_irqrestore()

Thus, this make it not ideal for us to rely on stop_machine() and
handly marked "notrace"s to perform runtime code patching. To remove
such dependency, we must make updates of code seemed atomic on running
cores. This might not be obvious for RISC-V since it usaually uses a pair
of AUIPC + JALR to perform a long jump, which cannot be modified and
executed concurrently if we consider preemptions. As such, this patch
proposed a way to make it possible. It embeds a 32-bit rel-address data
into instructions of each ftrace prologue and jumps indirectly. In this
way, we could store and load the address atomically so that the code
patching core could run simutaneously with the rest of running cores.

After applying the patchset, we compiled a preemptible kernel with all
tracers and ftrace-selftest enabled, and booted it on a 2-core QEMU virt
machine. The kernel could boot up successfully, passing all ftrace
testsuits. Besides, we ran a script that randomly pick a tracer on every
0~5 seconds. The kernel has sustained over 20K rounds of the test. In
contrast, a preemptible kernel without our patch would panic in few
rounds on the same machine.

Though we ran into errors when using hwlat or irqsoff tracers together
with cpu-online stressor from stress-ng on a preemptible kernel. We
believe the reason may be that  percpu workers of the tracers are being
queued into unbounded workqueue when cpu get offlined and patches will go
through tracing tree.

Additionally, we found patching of tracepoints unsafe since the
instructions being patched are not naturally aligned. This may result in
2 half-word stores, which breaks atomicity, during the code patching.

changes in patch v2:
 - Enforce alignments on all functions with a compiler workaround.
 - Support 64bit addressing for ftrace targets if xlen == 64
 - Initialize ftrace target addresses to avoid calling bad address in a
   hypothesized case.
 - Use LGPTR instead of SZPTR since .align is log-scaled for
   mcount-dyn.S
 - Require the nop instruction of all jump_labels aligns naturally on
   4B.

Andy Chiu (5):
  riscv: align ftrace to 4 Byte boundary and increase ftrace prologue
    size
  riscv: export patch_insn_write
  riscv: ftrace: use indirect jump to work with kernel preemption
  riscv: ftrace: do not use stop_machine to update code
  riscv: align arch_static_branch function

 arch/riscv/Makefile                 |   2 +-
 arch/riscv/include/asm/ftrace.h     |  24 ----
 arch/riscv/include/asm/jump_label.h |   2 +
 arch/riscv/include/asm/patch.h      |   1 +
 arch/riscv/kernel/ftrace.c          | 179 ++++++++++++++++++++--------
 arch/riscv/kernel/mcount-dyn.S      |  69 ++++++++---
 arch/riscv/kernel/patch.c           |   4 +-
 7 files changed, 188 insertions(+), 93 deletions(-)

Comments

Evgenii Shatokhin Feb. 13, 2024, 7:42 p.m. UTC | #1
Hi,

On 13.09.2022 12:42, Andy Chiu wrote:
> This patch removes dependency of dynamic ftrace from calling
> stop_machine(), and makes it compatiable with kernel preemption.
> Originally, we ran into stack corruptions, or execution of partially
> updated instructions when starting or stopping ftrace on a fully
> preemptible kernel configuration. The reason is that kernel periodically
> calls rcu_momentary_dyntick_idle() on cores waiting for the code-patching
> core running in ftrace. Though rcu_momentary_dyntick_idle() itself is
> marked as notrace, it would call a bunch of tracable functions if we
> configured the kernel as preemptible. For example, these are some functions
> that happened to have a symbol and have not been marked as notrace on a
> RISC-V preemptible kernel compiled with GCC-11:
>   - __rcu_report_exp_rnp()
>   - rcu_report_exp_cpu_mult()
>   - rcu_preempt_deferred_qs()
>   - rcu_preempt_need_deferred_qs()
>   - rcu_preempt_deferred_qs_irqrestore()
> 
> Thus, this make it not ideal for us to rely on stop_machine() and
> handly marked "notrace"s to perform runtime code patching. To remove
> such dependency, we must make updates of code seemed atomic on running
> cores. This might not be obvious for RISC-V since it usaually uses a pair
> of AUIPC + JALR to perform a long jump, which cannot be modified and
> executed concurrently if we consider preemptions. As such, this patch
> proposed a way to make it possible. It embeds a 32-bit rel-address data
> into instructions of each ftrace prologue and jumps indirectly. In this
> way, we could store and load the address atomically so that the code
> patching core could run simutaneously with the rest of running cores.
> 
> After applying the patchset, we compiled a preemptible kernel with all
> tracers and ftrace-selftest enabled, and booted it on a 2-core QEMU virt
> machine. The kernel could boot up successfully, passing all ftrace
> testsuits. Besides, we ran a script that randomly pick a tracer on every
> 0~5 seconds. The kernel has sustained over 20K rounds of the test. In
> contrast, a preemptible kernel without our patch would panic in few
> rounds on the same machine.
> 
> Though we ran into errors when using hwlat or irqsoff tracers together
> with cpu-online stressor from stress-ng on a preemptible kernel. We
> believe the reason may be that  percpu workers of the tracers are being
> queued into unbounded workqueue when cpu get offlined and patches will go
> through tracing tree.
> 
> Additionally, we found patching of tracepoints unsafe since the
> instructions being patched are not naturally aligned. This may result in
> 2 half-word stores, which breaks atomicity, during the code patching.
> 
> changes in patch v2:
>   - Enforce alignments on all functions with a compiler workaround.
>   - Support 64bit addressing for ftrace targets if xlen == 64
>   - Initialize ftrace target addresses to avoid calling bad address in a
>     hypothesized case.
>   - Use LGPTR instead of SZPTR since .align is log-scaled for
>     mcount-dyn.S
>   - Require the nop instruction of all jump_labels aligns naturally on
>     4B.
> 
> Andy Chiu (5):
>    riscv: align ftrace to 4 Byte boundary and increase ftrace prologue
>      size
>    riscv: export patch_insn_write
>    riscv: ftrace: use indirect jump to work with kernel preemption
>    riscv: ftrace: do not use stop_machine to update code
>    riscv: align arch_static_branch function
> 
>   arch/riscv/Makefile                 |   2 +-
>   arch/riscv/include/asm/ftrace.h     |  24 ----
>   arch/riscv/include/asm/jump_label.h |   2 +
>   arch/riscv/include/asm/patch.h      |   1 +
>   arch/riscv/kernel/ftrace.c          | 179 ++++++++++++++++++++--------
>   arch/riscv/kernel/mcount-dyn.S      |  69 ++++++++---
>   arch/riscv/kernel/patch.c           |   4 +-
>   7 files changed, 188 insertions(+), 93 deletions(-)
> 

First of all, thank you for working on making dynamic Ftrace robust in 
preemptible kernels on RISC-V.
It is an important use case but, for now, dynamic Ftrace and related 
tracers cannot be safely used with such kernels.

Are there any updates on this series?
It needs a rebase, of course, but it looks doable.

If I understand the discussion correctly, the only blocker was that 
using "-falign-functions" was not enough to properly align cold 
functions and "-fno-guess-branch-probability" would likely have a 
performance cost.

It seems, GCC developers have recently provided a workaround for that 
(https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=0f5a9a00e3ab1fe96142f304cfbcf3f63b15f326, 
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88345#c24).

"-fmin-function-alignment" should help but, I do not know, which GCC 
versions have got that patch already. In the meantime, one could 
probably check if "-fmin-function-alignment" is supported by the 
compiler and use it, if it is.

Thoughts?

Regards,
Evgenii
Andy Chiu Feb. 21, 2024, 5:27 a.m. UTC | #2
On Wed, Feb 14, 2024 at 3:42 AM Evgenii Shatokhin <e.shatokhin@yadro.com> wrote:
>
> Hi,
>
> On 13.09.2022 12:42, Andy Chiu wrote:
> > This patch removes dependency of dynamic ftrace from calling
> > stop_machine(), and makes it compatiable with kernel preemption.
> > Originally, we ran into stack corruptions, or execution of partially
> > updated instructions when starting or stopping ftrace on a fully
> > preemptible kernel configuration. The reason is that kernel periodically
> > calls rcu_momentary_dyntick_idle() on cores waiting for the code-patching
> > core running in ftrace. Though rcu_momentary_dyntick_idle() itself is
> > marked as notrace, it would call a bunch of tracable functions if we
> > configured the kernel as preemptible. For example, these are some functions
> > that happened to have a symbol and have not been marked as notrace on a
> > RISC-V preemptible kernel compiled with GCC-11:
> >   - __rcu_report_exp_rnp()
> >   - rcu_report_exp_cpu_mult()
> >   - rcu_preempt_deferred_qs()
> >   - rcu_preempt_need_deferred_qs()
> >   - rcu_preempt_deferred_qs_irqrestore()
> >
> > Thus, this make it not ideal for us to rely on stop_machine() and
> > handly marked "notrace"s to perform runtime code patching. To remove
> > such dependency, we must make updates of code seemed atomic on running
> > cores. This might not be obvious for RISC-V since it usaually uses a pair
> > of AUIPC + JALR to perform a long jump, which cannot be modified and
> > executed concurrently if we consider preemptions. As such, this patch
> > proposed a way to make it possible. It embeds a 32-bit rel-address data
> > into instructions of each ftrace prologue and jumps indirectly. In this
> > way, we could store and load the address atomically so that the code
> > patching core could run simutaneously with the rest of running cores.
> >
> > After applying the patchset, we compiled a preemptible kernel with all
> > tracers and ftrace-selftest enabled, and booted it on a 2-core QEMU virt
> > machine. The kernel could boot up successfully, passing all ftrace
> > testsuits. Besides, we ran a script that randomly pick a tracer on every
> > 0~5 seconds. The kernel has sustained over 20K rounds of the test. In
> > contrast, a preemptible kernel without our patch would panic in few
> > rounds on the same machine.
> >
> > Though we ran into errors when using hwlat or irqsoff tracers together
> > with cpu-online stressor from stress-ng on a preemptible kernel. We
> > believe the reason may be that  percpu workers of the tracers are being
> > queued into unbounded workqueue when cpu get offlined and patches will go
> > through tracing tree.
> >
> > Additionally, we found patching of tracepoints unsafe since the
> > instructions being patched are not naturally aligned. This may result in
> > 2 half-word stores, which breaks atomicity, during the code patching.
> >
> > changes in patch v2:
> >   - Enforce alignments on all functions with a compiler workaround.
> >   - Support 64bit addressing for ftrace targets if xlen == 64
> >   - Initialize ftrace target addresses to avoid calling bad address in a
> >     hypothesized case.
> >   - Use LGPTR instead of SZPTR since .align is log-scaled for
> >     mcount-dyn.S
> >   - Require the nop instruction of all jump_labels aligns naturally on
> >     4B.
> >
> > Andy Chiu (5):
> >    riscv: align ftrace to 4 Byte boundary and increase ftrace prologue
> >      size
> >    riscv: export patch_insn_write
> >    riscv: ftrace: use indirect jump to work with kernel preemption
> >    riscv: ftrace: do not use stop_machine to update code
> >    riscv: align arch_static_branch function
> >
> >   arch/riscv/Makefile                 |   2 +-
> >   arch/riscv/include/asm/ftrace.h     |  24 ----
> >   arch/riscv/include/asm/jump_label.h |   2 +
> >   arch/riscv/include/asm/patch.h      |   1 +
> >   arch/riscv/kernel/ftrace.c          | 179 ++++++++++++++++++++--------
> >   arch/riscv/kernel/mcount-dyn.S      |  69 ++++++++---
> >   arch/riscv/kernel/patch.c           |   4 +-
> >   7 files changed, 188 insertions(+), 93 deletions(-)
> >
>
> First of all, thank you for working on making dynamic Ftrace robust in
> preemptible kernels on RISC-V.
> It is an important use case but, for now, dynamic Ftrace and related
> tracers cannot be safely used with such kernels.
>
> Are there any updates on this series?
> It needs a rebase, of course, but it looks doable.
>
> If I understand the discussion correctly, the only blocker was that
> using "-falign-functions" was not enough to properly align cold
> functions and "-fno-guess-branch-probability" would likely have a
> performance cost.
>
> It seems, GCC developers have recently provided a workaround for that
> (https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=0f5a9a00e3ab1fe96142f304cfbcf3f63b15f326,
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88345#c24).
>
> "-fmin-function-alignment" should help but, I do not know, which GCC
> versions have got that patch already. In the meantime, one could
> probably check if "-fmin-function-alignment" is supported by the
> compiler and use it, if it is.
>
> Thoughts?

Hi Evgenii,

Thanks for the update. Indeed, it is essential to this patch for
toolchain to provide forced alignment. We can test this flag in the
Makefile to sort out if toolchain supports it or not. Meanwhile, I had
figured out a way for this to work on any 2-B align addresses but
hadn't implemented it out yet. Basically it would require more
patching space for us to do software alignment. I would opt for a
special toolchain flag if the toolchain just supports it.

Let me take some time to look and get back to you soon.

>
> Regards,
> Evgenii

Regards,
Andy
Evgenii Shatokhin Feb. 21, 2024, 4:55 p.m. UTC | #3
On 21.02.2024 08:27, Andy Chiu wrote:
> «Внимание! Данное письмо от внешнего адресата!»
> 
> On Wed, Feb 14, 2024 at 3:42 AM Evgenii Shatokhin <e.shatokhin@yadro.com> wrote:
>>
>> Hi,
>>
>> On 13.09.2022 12:42, Andy Chiu wrote:
>>> This patch removes dependency of dynamic ftrace from calling
>>> stop_machine(), and makes it compatiable with kernel preemption.
>>> Originally, we ran into stack corruptions, or execution of partially
>>> updated instructions when starting or stopping ftrace on a fully
>>> preemptible kernel configuration. The reason is that kernel periodically
>>> calls rcu_momentary_dyntick_idle() on cores waiting for the code-patching
>>> core running in ftrace. Though rcu_momentary_dyntick_idle() itself is
>>> marked as notrace, it would call a bunch of tracable functions if we
>>> configured the kernel as preemptible. For example, these are some functions
>>> that happened to have a symbol and have not been marked as notrace on a
>>> RISC-V preemptible kernel compiled with GCC-11:
>>>    - __rcu_report_exp_rnp()
>>>    - rcu_report_exp_cpu_mult()
>>>    - rcu_preempt_deferred_qs()
>>>    - rcu_preempt_need_deferred_qs()
>>>    - rcu_preempt_deferred_qs_irqrestore()
>>>
>>> Thus, this make it not ideal for us to rely on stop_machine() and
>>> handly marked "notrace"s to perform runtime code patching. To remove
>>> such dependency, we must make updates of code seemed atomic on running
>>> cores. This might not be obvious for RISC-V since it usaually uses a pair
>>> of AUIPC + JALR to perform a long jump, which cannot be modified and
>>> executed concurrently if we consider preemptions. As such, this patch
>>> proposed a way to make it possible. It embeds a 32-bit rel-address data
>>> into instructions of each ftrace prologue and jumps indirectly. In this
>>> way, we could store and load the address atomically so that the code
>>> patching core could run simutaneously with the rest of running cores.
>>>
>>> After applying the patchset, we compiled a preemptible kernel with all
>>> tracers and ftrace-selftest enabled, and booted it on a 2-core QEMU virt
>>> machine. The kernel could boot up successfully, passing all ftrace
>>> testsuits. Besides, we ran a script that randomly pick a tracer on every
>>> 0~5 seconds. The kernel has sustained over 20K rounds of the test. In
>>> contrast, a preemptible kernel without our patch would panic in few
>>> rounds on the same machine.
>>>
>>> Though we ran into errors when using hwlat or irqsoff tracers together
>>> with cpu-online stressor from stress-ng on a preemptible kernel. We
>>> believe the reason may be that  percpu workers of the tracers are being
>>> queued into unbounded workqueue when cpu get offlined and patches will go
>>> through tracing tree.
>>>
>>> Additionally, we found patching of tracepoints unsafe since the
>>> instructions being patched are not naturally aligned. This may result in
>>> 2 half-word stores, which breaks atomicity, during the code patching.
>>>
>>> changes in patch v2:
>>>    - Enforce alignments on all functions with a compiler workaround.
>>>    - Support 64bit addressing for ftrace targets if xlen == 64
>>>    - Initialize ftrace target addresses to avoid calling bad address in a
>>>      hypothesized case.
>>>    - Use LGPTR instead of SZPTR since .align is log-scaled for
>>>      mcount-dyn.S
>>>    - Require the nop instruction of all jump_labels aligns naturally on
>>>      4B.
>>>
>>> Andy Chiu (5):
>>>     riscv: align ftrace to 4 Byte boundary and increase ftrace prologue
>>>       size
>>>     riscv: export patch_insn_write
>>>     riscv: ftrace: use indirect jump to work with kernel preemption
>>>     riscv: ftrace: do not use stop_machine to update code
>>>     riscv: align arch_static_branch function
>>>
>>>    arch/riscv/Makefile                 |   2 +-
>>>    arch/riscv/include/asm/ftrace.h     |  24 ----
>>>    arch/riscv/include/asm/jump_label.h |   2 +
>>>    arch/riscv/include/asm/patch.h      |   1 +
>>>    arch/riscv/kernel/ftrace.c          | 179 ++++++++++++++++++++--------
>>>    arch/riscv/kernel/mcount-dyn.S      |  69 ++++++++---
>>>    arch/riscv/kernel/patch.c           |   4 +-
>>>    7 files changed, 188 insertions(+), 93 deletions(-)
>>>
>>
>> First of all, thank you for working on making dynamic Ftrace robust in
>> preemptible kernels on RISC-V.
>> It is an important use case but, for now, dynamic Ftrace and related
>> tracers cannot be safely used with such kernels.
>>
>> Are there any updates on this series?
>> It needs a rebase, of course, but it looks doable.
>>
>> If I understand the discussion correctly, the only blocker was that
>> using "-falign-functions" was not enough to properly align cold
>> functions and "-fno-guess-branch-probability" would likely have a
>> performance cost.
>>
>> It seems, GCC developers have recently provided a workaround for that
>> (https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=0f5a9a00e3ab1fe96142f304cfbcf3f63b15f326,
>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88345#c24).
>>
>> "-fmin-function-alignment" should help but, I do not know, which GCC
>> versions have got that patch already. In the meantime, one could
>> probably check if "-fmin-function-alignment" is supported by the
>> compiler and use it, if it is.
>>
>> Thoughts?
> 
> Hi Evgenii,
> 
> Thanks for the update. Indeed, it is essential to this patch for
> toolchain to provide forced alignment. We can test this flag in the
> Makefile to sort out if toolchain supports it or not. Meanwhile, I had
> figured out a way for this to work on any 2-B align addresses but
> hadn't implemented it out yet. Basically it would require more
> patching space for us to do software alignment. I would opt for a
> special toolchain flag if the toolchain just supports it.
> 
> Let me take some time to look and get back to you soon.

Thank you! Looking forward to it.

In case it helps, here is what I have checked so far.

1.
I added the patch 
https://gcc.gnu.org/git/?p=gcc.git;a=patch;h=0f5a9a00e3ab1fe96142f304cfbcf3f63b15f326 
to the current revision of GCC 13.2.0 from RISC-V toolchain.

Rebased your patchset on top of Linux 6.8-rc4 (mostly - context changes, 
SYM_FUNC_START/SYM_FUNC_END for asm symbols, etc.).

Reverted 8547649981e6 ("riscv: ftrace: Fixup panic by disabling 
preemption").

Switched from -falign-functions=4 to -fmin-function-alignment=4:
------------------
diff --git a/arch/riscv/Makefile b/arch/riscv/Makefile
index b33b787c8b07..dcd0adeebaae 100644
--- a/arch/riscv/Makefile
+++ b/arch/riscv/Makefile
@@ -15,9 +15,9 @@ ifeq ($(CONFIG_DYNAMIC_FTRACE),y)
  	LDFLAGS_vmlinux += --no-relax
  	KBUILD_CPPFLAGS += -DCC_USING_PATCHABLE_FUNCTION_ENTRY
  ifeq ($(CONFIG_RISCV_ISA_C),y)
-	CC_FLAGS_FTRACE := -fpatchable-function-entry=12 -falign-functions=4
+	CC_FLAGS_FTRACE := -fpatchable-function-entry=12 
-fmin-function-alignment=4
  else
-	CC_FLAGS_FTRACE := -fpatchable-function-entry=6 -falign-functions=4
+	CC_FLAGS_FTRACE := -fpatchable-function-entry=6 -fmin-function-alignment=4
  endif
  endif

------------------

As far as I can see from objdump, the functions that were not aligned at 
4-byte boundary with -falign-functions=4, are now aligned correctly with 
-fmin-function-alignment=4.

2.
I tried the kernel in a QEMU VM with 2 CPUs and "-machine virt".

The boottime tests for Ftrace had passed, except the tests for 
function_graph. I described the failure and the possible fix here:
https://lore.kernel.org/all/dcc5976d-635a-4710-92df-94a99653314e@yadro.com/

3.
There were also boottime warnings about "RCU not on for: 
arch_cpu_idle+0x0/0x2c". These are probably not related to your 
patchset, but rather to the fact that Ftrace is enabled in a preemptble 
kernel where RCU does different things.

As a workaround, I disabled tracing of arch_cpu_idle() for now:
------------------
diff --git a/arch/riscv/kernel/process.c b/arch/riscv/kernel/process.c
index 92922dbd5b5c..6abeecbfc51d 100644
--- a/arch/riscv/kernel/process.c
+++ b/arch/riscv/kernel/process.c
@@ -37,7 +37,7 @@ EXPORT_SYMBOL(__stack_chk_guard);

  extern asmlinkage void ret_from_fork(void);

-void arch_cpu_idle(void)
+void noinstr arch_cpu_idle(void)
  {
  	cpu_do_idle();
  }

------------------

4.
Stress-testing revealed an issue though, which I do not understand yet.

Probably similar to what you did earlier, I ran a script that switched 
the current tracer to "function", "function_graph", "nop", "blk" each 
1-5 seconds. In another shell, "stress-ng --hrtimers 1" was running.

The kernel usually crashed within a few minutes, in seemingly random 
locations, but often in one of two ways:

(a) Invalid instruction, because the address of ftrace_caller function 
was somehow written to the body of the traced function rather than just 
to the Ftrace prologue.

In the following example, the crash happened at 0xffffffff800d3398. "b0 
d7" is actually not part of the code here, but rather the lower bytes of 
0xffffffff8000d7b0, the address of ftrace_caller() in this kernel.

(gdb) disas /r 0xffffffff800d3382,+0x20
Dump of assembler code from 0xffffffff800d3382 to 0xffffffff800d33a2:
...
    0xffffffff800d3394 <clockevents_program_event+144>:  ba 87   mv 
a5,a4
    0xffffffff800d3396 <clockevents_program_event+146>:  c1 bf   j 
0xffffffff800d3366 <clockevents_program_event+98>
    0xffffffff800d3398 <clockevents_program_event+148>:  b0 d7   sw 
a2,104(a5) // 0xffffffff8000d7b0, the address of ftrace_caller().
    0xffffffff800d339a <clockevents_program_event+150>:  00 80   .2byte 
0x8000
    0xffffffff800d339c <clockevents_program_event+152>:  ff ff   .2byte 
0xffff
    0xffffffff800d339e <clockevents_program_event+154>:  ff ff   .2byte 
0xffff
    0xffffffff800d33a0 <clockevents_program_event+156>:  d5 bf   j 
0xffffffff800d3394 <clockevents_program_event+144

The backtrace usually contains one or more occurrences of 
return_to_handler() in this case.

[  260.520394] [<ffffffff800d3398>] clockevents_program_event+0xac/0x100
[  260.521195] [<ffffffff8000d2bc>] return_to_handler+0x0/0x26
[  260.521843] [<ffffffff800c50ba>] hrtimer_interrupt+0x122/0x20c
[  260.522492] [<ffffffff8000d2bc>] return_to_handler+0x0/0x26
[  260.523132] [<ffffffff8009785e>] handle_percpu_devid_irq+0x9e/0x1ec
[  260.523788] [<ffffffff8000d2bc>] return_to_handler+0x0/0x26
[  260.524437] [<ffffffff8000d2bc>] return_to_handler+0x0/0x26
[  260.525080] [<ffffffff80a8acfa>] handle_riscv_irq+0x4a/0x74
[  260.525726] [<ffffffff80a97b9a>] call_on_irq_stack+0x32/0x40
----------------------

(b) Jump to an invalid location, e.g. to the middle of a valid 4-byte 
instruction. %ra usually points right after the last instruction, "jalr 
   a2", in return_to_handler() in such cases, so the jump was likely 
made from there.

The problem is reproducible, although I have not found what causes it yet.

Any help is appreciated, of course.

> 
>>
>> Regards,
>> Evgenii
> 
> Regards,
> Andy
Alexandre Ghiti March 6, 2024, 8:57 p.m. UTC | #4
Hi Evgenii,

On 21/02/2024 17:55, Evgenii Shatokhin wrote:
> On 21.02.2024 08:27, Andy Chiu wrote:
>> «Внимание! Данное письмо от внешнего адресата!»
>>
>> On Wed, Feb 14, 2024 at 3:42 AM Evgenii Shatokhin 
>> <e.shatokhin@yadro.com> wrote:
>>>
>>> Hi,
>>>
>>> On 13.09.2022 12:42, Andy Chiu wrote:
>>>> This patch removes dependency of dynamic ftrace from calling
>>>> stop_machine(), and makes it compatiable with kernel preemption.
>>>> Originally, we ran into stack corruptions, or execution of partially
>>>> updated instructions when starting or stopping ftrace on a fully
>>>> preemptible kernel configuration. The reason is that kernel 
>>>> periodically
>>>> calls rcu_momentary_dyntick_idle() on cores waiting for the 
>>>> code-patching
>>>> core running in ftrace. Though rcu_momentary_dyntick_idle() itself is
>>>> marked as notrace, it would call a bunch of tracable functions if we
>>>> configured the kernel as preemptible. For example, these are some 
>>>> functions
>>>> that happened to have a symbol and have not been marked as notrace 
>>>> on a
>>>> RISC-V preemptible kernel compiled with GCC-11:
>>>>    - __rcu_report_exp_rnp()
>>>>    - rcu_report_exp_cpu_mult()
>>>>    - rcu_preempt_deferred_qs()
>>>>    - rcu_preempt_need_deferred_qs()
>>>>    - rcu_preempt_deferred_qs_irqrestore()
>>>>
>>>> Thus, this make it not ideal for us to rely on stop_machine() and
>>>> handly marked "notrace"s to perform runtime code patching. To remove
>>>> such dependency, we must make updates of code seemed atomic on running
>>>> cores. This might not be obvious for RISC-V since it usaually uses 
>>>> a pair
>>>> of AUIPC + JALR to perform a long jump, which cannot be modified and
>>>> executed concurrently if we consider preemptions. As such, this patch
>>>> proposed a way to make it possible. It embeds a 32-bit rel-address 
>>>> data
>>>> into instructions of each ftrace prologue and jumps indirectly. In 
>>>> this
>>>> way, we could store and load the address atomically so that the code
>>>> patching core could run simutaneously with the rest of running cores.
>>>>
>>>> After applying the patchset, we compiled a preemptible kernel with all
>>>> tracers and ftrace-selftest enabled, and booted it on a 2-core QEMU 
>>>> virt
>>>> machine. The kernel could boot up successfully, passing all ftrace
>>>> testsuits. Besides, we ran a script that randomly pick a tracer on 
>>>> every
>>>> 0~5 seconds. The kernel has sustained over 20K rounds of the test. In
>>>> contrast, a preemptible kernel without our patch would panic in few
>>>> rounds on the same machine.
>>>>
>>>> Though we ran into errors when using hwlat or irqsoff tracers together
>>>> with cpu-online stressor from stress-ng on a preemptible kernel. We
>>>> believe the reason may be that  percpu workers of the tracers are 
>>>> being
>>>> queued into unbounded workqueue when cpu get offlined and patches 
>>>> will go
>>>> through tracing tree.
>>>>
>>>> Additionally, we found patching of tracepoints unsafe since the
>>>> instructions being patched are not naturally aligned. This may 
>>>> result in
>>>> 2 half-word stores, which breaks atomicity, during the code patching.
>>>>
>>>> changes in patch v2:
>>>>    - Enforce alignments on all functions with a compiler workaround.
>>>>    - Support 64bit addressing for ftrace targets if xlen == 64
>>>>    - Initialize ftrace target addresses to avoid calling bad 
>>>> address in a
>>>>      hypothesized case.
>>>>    - Use LGPTR instead of SZPTR since .align is log-scaled for
>>>>      mcount-dyn.S
>>>>    - Require the nop instruction of all jump_labels aligns 
>>>> naturally on
>>>>      4B.
>>>>
>>>> Andy Chiu (5):
>>>>     riscv: align ftrace to 4 Byte boundary and increase ftrace 
>>>> prologue
>>>>       size
>>>>     riscv: export patch_insn_write
>>>>     riscv: ftrace: use indirect jump to work with kernel preemption
>>>>     riscv: ftrace: do not use stop_machine to update code
>>>>     riscv: align arch_static_branch function
>>>>
>>>>    arch/riscv/Makefile                 |   2 +-
>>>>    arch/riscv/include/asm/ftrace.h     |  24 ----
>>>>    arch/riscv/include/asm/jump_label.h |   2 +
>>>>    arch/riscv/include/asm/patch.h      |   1 +
>>>>    arch/riscv/kernel/ftrace.c          | 179 
>>>> ++++++++++++++++++++--------
>>>>    arch/riscv/kernel/mcount-dyn.S      |  69 ++++++++---
>>>>    arch/riscv/kernel/patch.c           |   4 +-
>>>>    7 files changed, 188 insertions(+), 93 deletions(-)
>>>>
>>>
>>> First of all, thank you for working on making dynamic Ftrace robust in
>>> preemptible kernels on RISC-V.
>>> It is an important use case but, for now, dynamic Ftrace and related
>>> tracers cannot be safely used with such kernels.
>>>
>>> Are there any updates on this series?
>>> It needs a rebase, of course, but it looks doable.
>>>
>>> If I understand the discussion correctly, the only blocker was that
>>> using "-falign-functions" was not enough to properly align cold
>>> functions and "-fno-guess-branch-probability" would likely have a
>>> performance cost.
>>>
>>> It seems, GCC developers have recently provided a workaround for that
>>> (https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=0f5a9a00e3ab1fe96142f304cfbcf3f63b15f326, 
>>>
>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88345#c24).
>>>
>>> "-fmin-function-alignment" should help but, I do not know, which GCC
>>> versions have got that patch already. In the meantime, one could
>>> probably check if "-fmin-function-alignment" is supported by the
>>> compiler and use it, if it is.
>>>
>>> Thoughts?
>>
>> Hi Evgenii,
>>
>> Thanks for the update. Indeed, it is essential to this patch for
>> toolchain to provide forced alignment. We can test this flag in the
>> Makefile to sort out if toolchain supports it or not. Meanwhile, I had
>> figured out a way for this to work on any 2-B align addresses but
>> hadn't implemented it out yet. Basically it would require more
>> patching space for us to do software alignment. I would opt for a
>> special toolchain flag if the toolchain just supports it.
>>
>> Let me take some time to look and get back to you soon.
>
> Thank you! Looking forward to it.
>
> In case it helps, here is what I have checked so far.
>
> 1.
> I added the patch 
> https://gcc.gnu.org/git/?p=gcc.git;a=patch;h=0f5a9a00e3ab1fe96142f304cfbcf3f63b15f326 
> to the current revision of GCC 13.2.0 from RISC-V toolchain.
>
> Rebased your patchset on top of Linux 6.8-rc4 (mostly - context 
> changes, SYM_FUNC_START/SYM_FUNC_END for asm symbols, etc.).
>
> Reverted 8547649981e6 ("riscv: ftrace: Fixup panic by disabling 
> preemption").
>
> Switched from -falign-functions=4 to -fmin-function-alignment=4:
> ------------------
> diff --git a/arch/riscv/Makefile b/arch/riscv/Makefile
> index b33b787c8b07..dcd0adeebaae 100644
> --- a/arch/riscv/Makefile
> +++ b/arch/riscv/Makefile
> @@ -15,9 +15,9 @@ ifeq ($(CONFIG_DYNAMIC_FTRACE),y)
>      LDFLAGS_vmlinux += --no-relax
>      KBUILD_CPPFLAGS += -DCC_USING_PATCHABLE_FUNCTION_ENTRY
>  ifeq ($(CONFIG_RISCV_ISA_C),y)
> -    CC_FLAGS_FTRACE := -fpatchable-function-entry=12 -falign-functions=4
> +    CC_FLAGS_FTRACE := -fpatchable-function-entry=12 
> -fmin-function-alignment=4
>  else
> -    CC_FLAGS_FTRACE := -fpatchable-function-entry=6 -falign-functions=4
> +    CC_FLAGS_FTRACE := -fpatchable-function-entry=6 
> -fmin-function-alignment=4
>  endif
>  endif
>
> ------------------
>
> As far as I can see from objdump, the functions that were not aligned 
> at 4-byte boundary with -falign-functions=4, are now aligned correctly 
> with -fmin-function-alignment=4.
>
> 2.
> I tried the kernel in a QEMU VM with 2 CPUs and "-machine virt".
>
> The boottime tests for Ftrace had passed, except the tests for 
> function_graph. I described the failure and the possible fix here:
> https://lore.kernel.org/all/dcc5976d-635a-4710-92df-94a99653314e@yadro.com/ 
>
>
> 3.
> There were also boottime warnings about "RCU not on for: 
> arch_cpu_idle+0x0/0x2c". These are probably not related to your 
> patchset, but rather to the fact that Ftrace is enabled in a 
> preemptble kernel where RCU does different things.
>
> As a workaround, I disabled tracing of arch_cpu_idle() for now:
> ------------------
> diff --git a/arch/riscv/kernel/process.c b/arch/riscv/kernel/process.c
> index 92922dbd5b5c..6abeecbfc51d 100644
> --- a/arch/riscv/kernel/process.c
> +++ b/arch/riscv/kernel/process.c
> @@ -37,7 +37,7 @@ EXPORT_SYMBOL(__stack_chk_guard);
>
>  extern asmlinkage void ret_from_fork(void);
>
> -void arch_cpu_idle(void)
> +void noinstr arch_cpu_idle(void)
>  {
>      cpu_do_idle();
>  }


I came up with the same fix for this, based on a similar fix for s390. I 
have a patch ready and will send it soon since to me, it is a fix, not a 
workaround.

Thanks,

Alex


>
> ------------------
>
> 4.
> Stress-testing revealed an issue though, which I do not understand yet.
>
> Probably similar to what you did earlier, I ran a script that switched 
> the current tracer to "function", "function_graph", "nop", "blk" each 
> 1-5 seconds. In another shell, "stress-ng --hrtimers 1" was running.
>
> The kernel usually crashed within a few minutes, in seemingly random 
> locations, but often in one of two ways:
>
> (a) Invalid instruction, because the address of ftrace_caller function 
> was somehow written to the body of the traced function rather than 
> just to the Ftrace prologue.
>
> In the following example, the crash happened at 0xffffffff800d3398. 
> "b0 d7" is actually not part of the code here, but rather the lower 
> bytes of 0xffffffff8000d7b0, the address of ftrace_caller() in this 
> kernel.
>
> (gdb) disas /r 0xffffffff800d3382,+0x20
> Dump of assembler code from 0xffffffff800d3382 to 0xffffffff800d33a2:
> ...
>    0xffffffff800d3394 <clockevents_program_event+144>:  ba 87   mv a5,a4
>    0xffffffff800d3396 <clockevents_program_event+146>:  c1 bf   j 
> 0xffffffff800d3366 <clockevents_program_event+98>
>    0xffffffff800d3398 <clockevents_program_event+148>:  b0 d7   sw 
> a2,104(a5) // 0xffffffff8000d7b0, the address of ftrace_caller().
>    0xffffffff800d339a <clockevents_program_event+150>:  00 80   .2byte 
> 0x8000
>    0xffffffff800d339c <clockevents_program_event+152>:  ff ff   .2byte 
> 0xffff
>    0xffffffff800d339e <clockevents_program_event+154>:  ff ff   .2byte 
> 0xffff
>    0xffffffff800d33a0 <clockevents_program_event+156>:  d5 bf   j 
> 0xffffffff800d3394 <clockevents_program_event+144
>
> The backtrace usually contains one or more occurrences of 
> return_to_handler() in this case.
>
> [  260.520394] [<ffffffff800d3398>] clockevents_program_event+0xac/0x100
> [  260.521195] [<ffffffff8000d2bc>] return_to_handler+0x0/0x26
> [  260.521843] [<ffffffff800c50ba>] hrtimer_interrupt+0x122/0x20c
> [  260.522492] [<ffffffff8000d2bc>] return_to_handler+0x0/0x26
> [  260.523132] [<ffffffff8009785e>] handle_percpu_devid_irq+0x9e/0x1ec
> [  260.523788] [<ffffffff8000d2bc>] return_to_handler+0x0/0x26
> [  260.524437] [<ffffffff8000d2bc>] return_to_handler+0x0/0x26
> [  260.525080] [<ffffffff80a8acfa>] handle_riscv_irq+0x4a/0x74
> [  260.525726] [<ffffffff80a97b9a>] call_on_irq_stack+0x32/0x40
> ----------------------
>
> (b) Jump to an invalid location, e.g. to the middle of a valid 4-byte 
> instruction. %ra usually points right after the last instruction, 
> "jalr   a2", in return_to_handler() in such cases, so the jump was 
> likely made from there.
>
> The problem is reproducible, although I have not found what causes it 
> yet.
>
> Any help is appreciated, of course.
>
>>
>>>
>>> Regards,
>>> Evgenii
>>
>> Regards,
>> Andy
>
>
> _______________________________________________
> linux-riscv mailing list
> linux-riscv@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-riscv
Evgenii Shatokhin March 7, 2024, 8:35 a.m. UTC | #5
Hi Alexandre,

On 06.03.2024 23:57, Alexandre Ghiti wrote:
> Hi Evgenii,
> 
> On 21/02/2024 17:55, Evgenii Shatokhin wrote:
>> On 21.02.2024 08:27, Andy Chiu wrote:
>>> «Внимание! Данное письмо от внешнего адресата!»
>>>
>>> On Wed, Feb 14, 2024 at 3:42 AM Evgenii Shatokhin
>>> <e.shatokhin@yadro.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> On 13.09.2022 12:42, Andy Chiu wrote:
>>>>> This patch removes dependency of dynamic ftrace from calling
>>>>> stop_machine(), and makes it compatiable with kernel preemption.
>>>>> Originally, we ran into stack corruptions, or execution of partially
>>>>> updated instructions when starting or stopping ftrace on a fully
>>>>> preemptible kernel configuration. The reason is that kernel
>>>>> periodically
>>>>> calls rcu_momentary_dyntick_idle() on cores waiting for the
>>>>> code-patching
>>>>> core running in ftrace. Though rcu_momentary_dyntick_idle() itself is
>>>>> marked as notrace, it would call a bunch of tracable functions if we
>>>>> configured the kernel as preemptible. For example, these are some
>>>>> functions
>>>>> that happened to have a symbol and have not been marked as notrace
>>>>> on a
>>>>> RISC-V preemptible kernel compiled with GCC-11:
>>>>>    - __rcu_report_exp_rnp()
>>>>>    - rcu_report_exp_cpu_mult()
>>>>>    - rcu_preempt_deferred_qs()
>>>>>    - rcu_preempt_need_deferred_qs()
>>>>>    - rcu_preempt_deferred_qs_irqrestore()
>>>>>
>>>>> Thus, this make it not ideal for us to rely on stop_machine() and
>>>>> handly marked "notrace"s to perform runtime code patching. To remove
>>>>> such dependency, we must make updates of code seemed atomic on running
>>>>> cores. This might not be obvious for RISC-V since it usaually uses
>>>>> a pair
>>>>> of AUIPC + JALR to perform a long jump, which cannot be modified and
>>>>> executed concurrently if we consider preemptions. As such, this patch
>>>>> proposed a way to make it possible. It embeds a 32-bit rel-address
>>>>> data
>>>>> into instructions of each ftrace prologue and jumps indirectly. In
>>>>> this
>>>>> way, we could store and load the address atomically so that the code
>>>>> patching core could run simutaneously with the rest of running cores.
>>>>>
>>>>> After applying the patchset, we compiled a preemptible kernel with all
>>>>> tracers and ftrace-selftest enabled, and booted it on a 2-core QEMU
>>>>> virt
>>>>> machine. The kernel could boot up successfully, passing all ftrace
>>>>> testsuits. Besides, we ran a script that randomly pick a tracer on
>>>>> every
>>>>> 0~5 seconds. The kernel has sustained over 20K rounds of the test. In
>>>>> contrast, a preemptible kernel without our patch would panic in few
>>>>> rounds on the same machine.
>>>>>
>>>>> Though we ran into errors when using hwlat or irqsoff tracers together
>>>>> with cpu-online stressor from stress-ng on a preemptible kernel. We
>>>>> believe the reason may be that  percpu workers of the tracers are
>>>>> being
>>>>> queued into unbounded workqueue when cpu get offlined and patches
>>>>> will go
>>>>> through tracing tree.
>>>>>
>>>>> Additionally, we found patching of tracepoints unsafe since the
>>>>> instructions being patched are not naturally aligned. This may
>>>>> result in
>>>>> 2 half-word stores, which breaks atomicity, during the code patching.
>>>>>
>>>>> changes in patch v2:
>>>>>    - Enforce alignments on all functions with a compiler workaround.
>>>>>    - Support 64bit addressing for ftrace targets if xlen == 64
>>>>>    - Initialize ftrace target addresses to avoid calling bad
>>>>> address in a
>>>>>      hypothesized case.
>>>>>    - Use LGPTR instead of SZPTR since .align is log-scaled for
>>>>>      mcount-dyn.S
>>>>>    - Require the nop instruction of all jump_labels aligns
>>>>> naturally on
>>>>>      4B.
>>>>>
>>>>> Andy Chiu (5):
>>>>>     riscv: align ftrace to 4 Byte boundary and increase ftrace
>>>>> prologue
>>>>>       size
>>>>>     riscv: export patch_insn_write
>>>>>     riscv: ftrace: use indirect jump to work with kernel preemption
>>>>>     riscv: ftrace: do not use stop_machine to update code
>>>>>     riscv: align arch_static_branch function
>>>>>
>>>>>    arch/riscv/Makefile                 |   2 +-
>>>>>    arch/riscv/include/asm/ftrace.h     |  24 ----
>>>>>    arch/riscv/include/asm/jump_label.h |   2 +
>>>>>    arch/riscv/include/asm/patch.h      |   1 +
>>>>>    arch/riscv/kernel/ftrace.c          | 179
>>>>> ++++++++++++++++++++--------
>>>>>    arch/riscv/kernel/mcount-dyn.S      |  69 ++++++++---
>>>>>    arch/riscv/kernel/patch.c           |   4 +-
>>>>>    7 files changed, 188 insertions(+), 93 deletions(-)
>>>>>
>>>>
>>>> First of all, thank you for working on making dynamic Ftrace robust in
>>>> preemptible kernels on RISC-V.
>>>> It is an important use case but, for now, dynamic Ftrace and related
>>>> tracers cannot be safely used with such kernels.
>>>>
>>>> Are there any updates on this series?
>>>> It needs a rebase, of course, but it looks doable.
>>>>
>>>> If I understand the discussion correctly, the only blocker was that
>>>> using "-falign-functions" was not enough to properly align cold
>>>> functions and "-fno-guess-branch-probability" would likely have a
>>>> performance cost.
>>>>
>>>> It seems, GCC developers have recently provided a workaround for that
>>>> (https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=0f5a9a00e3ab1fe96142f304cfbcf3f63b15f326,
>>>>
>>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88345#c24).
>>>>
>>>> "-fmin-function-alignment" should help but, I do not know, which GCC
>>>> versions have got that patch already. In the meantime, one could
>>>> probably check if "-fmin-function-alignment" is supported by the
>>>> compiler and use it, if it is.
>>>>
>>>> Thoughts?
>>>
>>> Hi Evgenii,
>>>
>>> Thanks for the update. Indeed, it is essential to this patch for
>>> toolchain to provide forced alignment. We can test this flag in the
>>> Makefile to sort out if toolchain supports it or not. Meanwhile, I had
>>> figured out a way for this to work on any 2-B align addresses but
>>> hadn't implemented it out yet. Basically it would require more
>>> patching space for us to do software alignment. I would opt for a
>>> special toolchain flag if the toolchain just supports it.
>>>
>>> Let me take some time to look and get back to you soon.
>>
>> Thank you! Looking forward to it.
>>
>> In case it helps, here is what I have checked so far.
>>
>> 1.
>> I added the patch
>> https://gcc.gnu.org/git/?p=gcc.git;a=patch;h=0f5a9a00e3ab1fe96142f304cfbcf3f63b15f326
>> to the current revision of GCC 13.2.0 from RISC-V toolchain.
>>
>> Rebased your patchset on top of Linux 6.8-rc4 (mostly - context
>> changes, SYM_FUNC_START/SYM_FUNC_END for asm symbols, etc.).
>>
>> Reverted 8547649981e6 ("riscv: ftrace: Fixup panic by disabling
>> preemption").
>>
>> Switched from -falign-functions=4 to -fmin-function-alignment=4:
>> ------------------
>> diff --git a/arch/riscv/Makefile b/arch/riscv/Makefile
>> index b33b787c8b07..dcd0adeebaae 100644
>> --- a/arch/riscv/Makefile
>> +++ b/arch/riscv/Makefile
>> @@ -15,9 +15,9 @@ ifeq ($(CONFIG_DYNAMIC_FTRACE),y)
>>      LDFLAGS_vmlinux += --no-relax
>>      KBUILD_CPPFLAGS += -DCC_USING_PATCHABLE_FUNCTION_ENTRY
>>  ifeq ($(CONFIG_RISCV_ISA_C),y)
>> -    CC_FLAGS_FTRACE := -fpatchable-function-entry=12 -falign-functions=4
>> +    CC_FLAGS_FTRACE := -fpatchable-function-entry=12
>> -fmin-function-alignment=4
>>  else
>> -    CC_FLAGS_FTRACE := -fpatchable-function-entry=6 -falign-functions=4
>> +    CC_FLAGS_FTRACE := -fpatchable-function-entry=6
>> -fmin-function-alignment=4
>>  endif
>>  endif
>>
>> ------------------
>>
>> As far as I can see from objdump, the functions that were not aligned
>> at 4-byte boundary with -falign-functions=4, are now aligned correctly
>> with -fmin-function-alignment=4.
>>
>> 2.
>> I tried the kernel in a QEMU VM with 2 CPUs and "-machine virt".
>>
>> The boottime tests for Ftrace had passed, except the tests for
>> function_graph. I described the failure and the possible fix here:
>> https://lore.kernel.org/all/dcc5976d-635a-4710-92df-94a99653314e@yadro.com/
>>
>>
>> 3.
>> There were also boottime warnings about "RCU not on for:
>> arch_cpu_idle+0x0/0x2c". These are probably not related to your
>> patchset, but rather to the fact that Ftrace is enabled in a
>> preemptble kernel where RCU does different things.
>>
>> As a workaround, I disabled tracing of arch_cpu_idle() for now:
>> ------------------
>> diff --git a/arch/riscv/kernel/process.c b/arch/riscv/kernel/process.c
>> index 92922dbd5b5c..6abeecbfc51d 100644
>> --- a/arch/riscv/kernel/process.c
>> +++ b/arch/riscv/kernel/process.c
>> @@ -37,7 +37,7 @@ EXPORT_SYMBOL(__stack_chk_guard);
>>
>>  extern asmlinkage void ret_from_fork(void);
>>
>> -void arch_cpu_idle(void)
>> +void noinstr arch_cpu_idle(void)
>>  {
>>      cpu_do_idle();
>>  }
> 
> 
> I came up with the same fix for this, based on a similar fix for s390. I
> have a patch ready and will send it soon since to me, it is a fix, not a
> workaround.
> 
> Thanks,
> 
> Alex

Great! Thank you. That is very good news.

By the way, have you tried switching dynamic tracers like "function", 
"function_graph", etc. while the system is under pressure, on a kernel 
with this patchset?

I am using 'stress-ng --hrtimers 1' and memory corruption still happens 
within a few minutes each time. I described the issue earlier.

It seems as if the address of ftrace_caller is sometimes written to a 
wrong location when enabling "function" or "function_graph" tracer. 
Perhaps, a barrier is missing somewhere, or something.

> 
> 
>>
>> ------------------
>>
>> 4.
>> Stress-testing revealed an issue though, which I do not understand yet.
>>
>> Probably similar to what you did earlier, I ran a script that switched
>> the current tracer to "function", "function_graph", "nop", "blk" each
>> 1-5 seconds. In another shell, "stress-ng --hrtimers 1" was running.
>>
>> The kernel usually crashed within a few minutes, in seemingly random
>> locations, but often in one of two ways:
>>
>> (a) Invalid instruction, because the address of ftrace_caller function
>> was somehow written to the body of the traced function rather than
>> just to the Ftrace prologue.
>>
>> In the following example, the crash happened at 0xffffffff800d3398.
>> "b0 d7" is actually not part of the code here, but rather the lower
>> bytes of 0xffffffff8000d7b0, the address of ftrace_caller() in this
>> kernel.
>>
>> (gdb) disas /r 0xffffffff800d3382,+0x20
>> Dump of assembler code from 0xffffffff800d3382 to 0xffffffff800d33a2:
>> ...
>>    0xffffffff800d3394 <clockevents_program_event+144>:  ba 87   mv a5,a4
>>    0xffffffff800d3396 <clockevents_program_event+146>:  c1 bf   j
>> 0xffffffff800d3366 <clockevents_program_event+98>
>>    0xffffffff800d3398 <clockevents_program_event+148>:  b0 d7   sw
>> a2,104(a5) // 0xffffffff8000d7b0, the address of ftrace_caller().
>>    0xffffffff800d339a <clockevents_program_event+150>:  00 80   .2byte
>> 0x8000
>>    0xffffffff800d339c <clockevents_program_event+152>:  ff ff   .2byte
>> 0xffff
>>    0xffffffff800d339e <clockevents_program_event+154>:  ff ff   .2byte
>> 0xffff
>>    0xffffffff800d33a0 <clockevents_program_event+156>:  d5 bf   j
>> 0xffffffff800d3394 <clockevents_program_event+144
>>
>> The backtrace usually contains one or more occurrences of
>> return_to_handler() in this case.
>>
>> [  260.520394] [<ffffffff800d3398>] clockevents_program_event+0xac/0x100
>> [  260.521195] [<ffffffff8000d2bc>] return_to_handler+0x0/0x26
>> [  260.521843] [<ffffffff800c50ba>] hrtimer_interrupt+0x122/0x20c
>> [  260.522492] [<ffffffff8000d2bc>] return_to_handler+0x0/0x26
>> [  260.523132] [<ffffffff8009785e>] handle_percpu_devid_irq+0x9e/0x1ec
>> [  260.523788] [<ffffffff8000d2bc>] return_to_handler+0x0/0x26
>> [  260.524437] [<ffffffff8000d2bc>] return_to_handler+0x0/0x26
>> [  260.525080] [<ffffffff80a8acfa>] handle_riscv_irq+0x4a/0x74
>> [  260.525726] [<ffffffff80a97b9a>] call_on_irq_stack+0x32/0x40
>> ----------------------
>>
>> (b) Jump to an invalid location, e.g. to the middle of a valid 4-byte
>> instruction. %ra usually points right after the last instruction,
>> "jalr   a2", in return_to_handler() in such cases, so the jump was
>> likely made from there.
>>
>> The problem is reproducible, although I have not found what causes it
>> yet.
>>
>> Any help is appreciated, of course.
>>
>>>
>>>>
>>>> Regards,
>>>> Evgenii
>>>
>>> Regards,
>>> Andy
>>

Regards,
Evgenii
Andy Chiu March 7, 2024, 12:27 p.m. UTC | #6
Hi Alex,

On Thu, Mar 7, 2024 at 4:57 AM Alexandre Ghiti <alex@ghiti.fr> wrote:
>
> Hi Evgenii,
>
> On 21/02/2024 17:55, Evgenii Shatokhin wrote:
> > On 21.02.2024 08:27, Andy Chiu wrote:
> >> «Внимание! Данное письмо от внешнего адресата!»
> >>
> >> On Wed, Feb 14, 2024 at 3:42 AM Evgenii Shatokhin
> >> <e.shatokhin@yadro.com> wrote:
> >>>
> >>> Hi,
> >>>
> >>> On 13.09.2022 12:42, Andy Chiu wrote:
> >>>> This patch removes dependency of dynamic ftrace from calling
> >>>> stop_machine(), and makes it compatiable with kernel preemption.
> >>>> Originally, we ran into stack corruptions, or execution of partially
> >>>> updated instructions when starting or stopping ftrace on a fully
> >>>> preemptible kernel configuration. The reason is that kernel
> >>>> periodically
> >>>> calls rcu_momentary_dyntick_idle() on cores waiting for the
> >>>> code-patching
> >>>> core running in ftrace. Though rcu_momentary_dyntick_idle() itself is
> >>>> marked as notrace, it would call a bunch of tracable functions if we
> >>>> configured the kernel as preemptible. For example, these are some
> >>>> functions
> >>>> that happened to have a symbol and have not been marked as notrace
> >>>> on a
> >>>> RISC-V preemptible kernel compiled with GCC-11:
> >>>>    - __rcu_report_exp_rnp()
> >>>>    - rcu_report_exp_cpu_mult()
> >>>>    - rcu_preempt_deferred_qs()
> >>>>    - rcu_preempt_need_deferred_qs()
> >>>>    - rcu_preempt_deferred_qs_irqrestore()
> >>>>
> >>>> Thus, this make it not ideal for us to rely on stop_machine() and
> >>>> handly marked "notrace"s to perform runtime code patching. To remove
> >>>> such dependency, we must make updates of code seemed atomic on running
> >>>> cores. This might not be obvious for RISC-V since it usaually uses
> >>>> a pair
> >>>> of AUIPC + JALR to perform a long jump, which cannot be modified and
> >>>> executed concurrently if we consider preemptions. As such, this patch
> >>>> proposed a way to make it possible. It embeds a 32-bit rel-address
> >>>> data
> >>>> into instructions of each ftrace prologue and jumps indirectly. In
> >>>> this
> >>>> way, we could store and load the address atomically so that the code
> >>>> patching core could run simutaneously with the rest of running cores.
> >>>>
> >>>> After applying the patchset, we compiled a preemptible kernel with all
> >>>> tracers and ftrace-selftest enabled, and booted it on a 2-core QEMU
> >>>> virt
> >>>> machine. The kernel could boot up successfully, passing all ftrace
> >>>> testsuits. Besides, we ran a script that randomly pick a tracer on
> >>>> every
> >>>> 0~5 seconds. The kernel has sustained over 20K rounds of the test. In
> >>>> contrast, a preemptible kernel without our patch would panic in few
> >>>> rounds on the same machine.
> >>>>
> >>>> Though we ran into errors when using hwlat or irqsoff tracers together
> >>>> with cpu-online stressor from stress-ng on a preemptible kernel. We
> >>>> believe the reason may be that  percpu workers of the tracers are
> >>>> being
> >>>> queued into unbounded workqueue when cpu get offlined and patches
> >>>> will go
> >>>> through tracing tree.
> >>>>
> >>>> Additionally, we found patching of tracepoints unsafe since the
> >>>> instructions being patched are not naturally aligned. This may
> >>>> result in
> >>>> 2 half-word stores, which breaks atomicity, during the code patching.
> >>>>
> >>>> changes in patch v2:
> >>>>    - Enforce alignments on all functions with a compiler workaround.
> >>>>    - Support 64bit addressing for ftrace targets if xlen == 64
> >>>>    - Initialize ftrace target addresses to avoid calling bad
> >>>> address in a
> >>>>      hypothesized case.
> >>>>    - Use LGPTR instead of SZPTR since .align is log-scaled for
> >>>>      mcount-dyn.S
> >>>>    - Require the nop instruction of all jump_labels aligns
> >>>> naturally on
> >>>>      4B.
> >>>>
> >>>> Andy Chiu (5):
> >>>>     riscv: align ftrace to 4 Byte boundary and increase ftrace
> >>>> prologue
> >>>>       size
> >>>>     riscv: export patch_insn_write
> >>>>     riscv: ftrace: use indirect jump to work with kernel preemption
> >>>>     riscv: ftrace: do not use stop_machine to update code
> >>>>     riscv: align arch_static_branch function
> >>>>
> >>>>    arch/riscv/Makefile                 |   2 +-
> >>>>    arch/riscv/include/asm/ftrace.h     |  24 ----
> >>>>    arch/riscv/include/asm/jump_label.h |   2 +
> >>>>    arch/riscv/include/asm/patch.h      |   1 +
> >>>>    arch/riscv/kernel/ftrace.c          | 179
> >>>> ++++++++++++++++++++--------
> >>>>    arch/riscv/kernel/mcount-dyn.S      |  69 ++++++++---
> >>>>    arch/riscv/kernel/patch.c           |   4 +-
> >>>>    7 files changed, 188 insertions(+), 93 deletions(-)
> >>>>
> >>>
> >>> First of all, thank you for working on making dynamic Ftrace robust in
> >>> preemptible kernels on RISC-V.
> >>> It is an important use case but, for now, dynamic Ftrace and related
> >>> tracers cannot be safely used with such kernels.
> >>>
> >>> Are there any updates on this series?
> >>> It needs a rebase, of course, but it looks doable.
> >>>
> >>> If I understand the discussion correctly, the only blocker was that
> >>> using "-falign-functions" was not enough to properly align cold
> >>> functions and "-fno-guess-branch-probability" would likely have a
> >>> performance cost.
> >>>
> >>> It seems, GCC developers have recently provided a workaround for that
> >>> (https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=0f5a9a00e3ab1fe96142f304cfbcf3f63b15f326,
> >>>
> >>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88345#c24).
> >>>
> >>> "-fmin-function-alignment" should help but, I do not know, which GCC
> >>> versions have got that patch already. In the meantime, one could
> >>> probably check if "-fmin-function-alignment" is supported by the
> >>> compiler and use it, if it is.
> >>>
> >>> Thoughts?
> >>
> >> Hi Evgenii,
> >>
> >> Thanks for the update. Indeed, it is essential to this patch for
> >> toolchain to provide forced alignment. We can test this flag in the
> >> Makefile to sort out if toolchain supports it or not. Meanwhile, I had
> >> figured out a way for this to work on any 2-B align addresses but
> >> hadn't implemented it out yet. Basically it would require more
> >> patching space for us to do software alignment. I would opt for a
> >> special toolchain flag if the toolchain just supports it.
> >>
> >> Let me take some time to look and get back to you soon.
> >
> > Thank you! Looking forward to it.
> >
> > In case it helps, here is what I have checked so far.
> >
> > 1.
> > I added the patch
> > https://gcc.gnu.org/git/?p=gcc.git;a=patch;h=0f5a9a00e3ab1fe96142f304cfbcf3f63b15f326
> > to the current revision of GCC 13.2.0 from RISC-V toolchain.
> >
> > Rebased your patchset on top of Linux 6.8-rc4 (mostly - context
> > changes, SYM_FUNC_START/SYM_FUNC_END for asm symbols, etc.).
> >
> > Reverted 8547649981e6 ("riscv: ftrace: Fixup panic by disabling
> > preemption").
> >
> > Switched from -falign-functions=4 to -fmin-function-alignment=4:
> > ------------------
> > diff --git a/arch/riscv/Makefile b/arch/riscv/Makefile
> > index b33b787c8b07..dcd0adeebaae 100644
> > --- a/arch/riscv/Makefile
> > +++ b/arch/riscv/Makefile
> > @@ -15,9 +15,9 @@ ifeq ($(CONFIG_DYNAMIC_FTRACE),y)
> >      LDFLAGS_vmlinux += --no-relax
> >      KBUILD_CPPFLAGS += -DCC_USING_PATCHABLE_FUNCTION_ENTRY
> >  ifeq ($(CONFIG_RISCV_ISA_C),y)
> > -    CC_FLAGS_FTRACE := -fpatchable-function-entry=12 -falign-functions=4
> > +    CC_FLAGS_FTRACE := -fpatchable-function-entry=12
> > -fmin-function-alignment=4
> >  else
> > -    CC_FLAGS_FTRACE := -fpatchable-function-entry=6 -falign-functions=4
> > +    CC_FLAGS_FTRACE := -fpatchable-function-entry=6
> > -fmin-function-alignment=4
> >  endif
> >  endif
> >
> > ------------------
> >
> > As far as I can see from objdump, the functions that were not aligned
> > at 4-byte boundary with -falign-functions=4, are now aligned correctly
> > with -fmin-function-alignment=4.
> >
> > 2.
> > I tried the kernel in a QEMU VM with 2 CPUs and "-machine virt".
> >
> > The boottime tests for Ftrace had passed, except the tests for
> > function_graph. I described the failure and the possible fix here:
> > https://lore.kernel.org/all/dcc5976d-635a-4710-92df-94a99653314e@yadro.com/
> >
> >
> > 3.
> > There were also boottime warnings about "RCU not on for:
> > arch_cpu_idle+0x0/0x2c". These are probably not related to your
> > patchset, but rather to the fact that Ftrace is enabled in a
> > preemptble kernel where RCU does different things.
> >
> > As a workaround, I disabled tracing of arch_cpu_idle() for now:
> > ------------------
> > diff --git a/arch/riscv/kernel/process.c b/arch/riscv/kernel/process.c
> > index 92922dbd5b5c..6abeecbfc51d 100644
> > --- a/arch/riscv/kernel/process.c
> > +++ b/arch/riscv/kernel/process.c
> > @@ -37,7 +37,7 @@ EXPORT_SYMBOL(__stack_chk_guard);
> >
> >  extern asmlinkage void ret_from_fork(void);
> >
> > -void arch_cpu_idle(void)
> > +void noinstr arch_cpu_idle(void)
> >  {
> >      cpu_do_idle();
> >  }
>
>
> I came up with the same fix for this, based on a similar fix for s390. I
> have a patch ready and will send it soon since to me, it is a fix, not a
> workaround.

Just making sure we aren't duplicating works. Are you also working on
getting rid of stop_machine() while patching ftrace entries? Or to
provide a patch to fix the issue in arch_cpu_idle()? I was just about
to restart my patchset for the first purpose. In case if I missed
anything, could you help pointing me to the patchset if it's already
on the ML?

>
> Thanks,
>
> Alex
>
>
> >
> > ------------------
> >
> > 4.
> > Stress-testing revealed an issue though, which I do not understand yet.
> >
> > Probably similar to what you did earlier, I ran a script that switched
> > the current tracer to "function", "function_graph", "nop", "blk" each
> > 1-5 seconds. In another shell, "stress-ng --hrtimers 1" was running.
> >
> > The kernel usually crashed within a few minutes, in seemingly random
> > locations, but often in one of two ways:
> >
> > (a) Invalid instruction, because the address of ftrace_caller function
> > was somehow written to the body of the traced function rather than
> > just to the Ftrace prologue.
> >
> > In the following example, the crash happened at 0xffffffff800d3398.
> > "b0 d7" is actually not part of the code here, but rather the lower
> > bytes of 0xffffffff8000d7b0, the address of ftrace_caller() in this
> > kernel.
> >
> > (gdb) disas /r 0xffffffff800d3382,+0x20
> > Dump of assembler code from 0xffffffff800d3382 to 0xffffffff800d33a2:
> > ...
> >    0xffffffff800d3394 <clockevents_program_event+144>:  ba 87   mv a5,a4
> >    0xffffffff800d3396 <clockevents_program_event+146>:  c1 bf   j
> > 0xffffffff800d3366 <clockevents_program_event+98>
> >    0xffffffff800d3398 <clockevents_program_event+148>:  b0 d7   sw
> > a2,104(a5) // 0xffffffff8000d7b0, the address of ftrace_caller().
> >    0xffffffff800d339a <clockevents_program_event+150>:  00 80   .2byte
> > 0x8000
> >    0xffffffff800d339c <clockevents_program_event+152>:  ff ff   .2byte
> > 0xffff
> >    0xffffffff800d339e <clockevents_program_event+154>:  ff ff   .2byte
> > 0xffff
> >    0xffffffff800d33a0 <clockevents_program_event+156>:  d5 bf   j
> > 0xffffffff800d3394 <clockevents_program_event+144
> >
> > The backtrace usually contains one or more occurrences of
> > return_to_handler() in this case.
> >
> > [  260.520394] [<ffffffff800d3398>] clockevents_program_event+0xac/0x100
> > [  260.521195] [<ffffffff8000d2bc>] return_to_handler+0x0/0x26
> > [  260.521843] [<ffffffff800c50ba>] hrtimer_interrupt+0x122/0x20c
> > [  260.522492] [<ffffffff8000d2bc>] return_to_handler+0x0/0x26
> > [  260.523132] [<ffffffff8009785e>] handle_percpu_devid_irq+0x9e/0x1ec
> > [  260.523788] [<ffffffff8000d2bc>] return_to_handler+0x0/0x26
> > [  260.524437] [<ffffffff8000d2bc>] return_to_handler+0x0/0x26
> > [  260.525080] [<ffffffff80a8acfa>] handle_riscv_irq+0x4a/0x74
> > [  260.525726] [<ffffffff80a97b9a>] call_on_irq_stack+0x32/0x40
> > ----------------------
> >
> > (b) Jump to an invalid location, e.g. to the middle of a valid 4-byte
> > instruction. %ra usually points right after the last instruction,
> > "jalr   a2", in return_to_handler() in such cases, so the jump was
> > likely made from there.
> >
> > The problem is reproducible, although I have not found what causes it
> > yet.
> >
> > Any help is appreciated, of course.
> >
> >>
> >>>
> >>> Regards,
> >>> Evgenii
> >>
> >> Regards,
> >> Andy
> >
> >
> > _______________________________________________
> > linux-riscv mailing list
> > linux-riscv@lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/linux-riscv

Thanks,
Andy
Alexandre Ghiti March 7, 2024, 1:21 p.m. UTC | #7
Hi Andy,

On 07/03/2024 13:27, Andy Chiu wrote:
> Hi Alex,
>
> On Thu, Mar 7, 2024 at 4:57 AM Alexandre Ghiti <alex@ghiti.fr> wrote:
>> Hi Evgenii,
>>
>> On 21/02/2024 17:55, Evgenii Shatokhin wrote:
>>> On 21.02.2024 08:27, Andy Chiu wrote:
>>>> «Внимание! Данное письмо от внешнего адресата!»
>>>>
>>>> On Wed, Feb 14, 2024 at 3:42 AM Evgenii Shatokhin
>>>> <e.shatokhin@yadro.com> wrote:
>>>>> Hi,
>>>>>
>>>>> On 13.09.2022 12:42, Andy Chiu wrote:
>>>>>> This patch removes dependency of dynamic ftrace from calling
>>>>>> stop_machine(), and makes it compatiable with kernel preemption.
>>>>>> Originally, we ran into stack corruptions, or execution of partially
>>>>>> updated instructions when starting or stopping ftrace on a fully
>>>>>> preemptible kernel configuration. The reason is that kernel
>>>>>> periodically
>>>>>> calls rcu_momentary_dyntick_idle() on cores waiting for the
>>>>>> code-patching
>>>>>> core running in ftrace. Though rcu_momentary_dyntick_idle() itself is
>>>>>> marked as notrace, it would call a bunch of tracable functions if we
>>>>>> configured the kernel as preemptible. For example, these are some
>>>>>> functions
>>>>>> that happened to have a symbol and have not been marked as notrace
>>>>>> on a
>>>>>> RISC-V preemptible kernel compiled with GCC-11:
>>>>>>     - __rcu_report_exp_rnp()
>>>>>>     - rcu_report_exp_cpu_mult()
>>>>>>     - rcu_preempt_deferred_qs()
>>>>>>     - rcu_preempt_need_deferred_qs()
>>>>>>     - rcu_preempt_deferred_qs_irqrestore()
>>>>>>
>>>>>> Thus, this make it not ideal for us to rely on stop_machine() and
>>>>>> handly marked "notrace"s to perform runtime code patching. To remove
>>>>>> such dependency, we must make updates of code seemed atomic on running
>>>>>> cores. This might not be obvious for RISC-V since it usaually uses
>>>>>> a pair
>>>>>> of AUIPC + JALR to perform a long jump, which cannot be modified and
>>>>>> executed concurrently if we consider preemptions. As such, this patch
>>>>>> proposed a way to make it possible. It embeds a 32-bit rel-address
>>>>>> data
>>>>>> into instructions of each ftrace prologue and jumps indirectly. In
>>>>>> this
>>>>>> way, we could store and load the address atomically so that the code
>>>>>> patching core could run simutaneously with the rest of running cores.
>>>>>>
>>>>>> After applying the patchset, we compiled a preemptible kernel with all
>>>>>> tracers and ftrace-selftest enabled, and booted it on a 2-core QEMU
>>>>>> virt
>>>>>> machine. The kernel could boot up successfully, passing all ftrace
>>>>>> testsuits. Besides, we ran a script that randomly pick a tracer on
>>>>>> every
>>>>>> 0~5 seconds. The kernel has sustained over 20K rounds of the test. In
>>>>>> contrast, a preemptible kernel without our patch would panic in few
>>>>>> rounds on the same machine.
>>>>>>
>>>>>> Though we ran into errors when using hwlat or irqsoff tracers together
>>>>>> with cpu-online stressor from stress-ng on a preemptible kernel. We
>>>>>> believe the reason may be that  percpu workers of the tracers are
>>>>>> being
>>>>>> queued into unbounded workqueue when cpu get offlined and patches
>>>>>> will go
>>>>>> through tracing tree.
>>>>>>
>>>>>> Additionally, we found patching of tracepoints unsafe since the
>>>>>> instructions being patched are not naturally aligned. This may
>>>>>> result in
>>>>>> 2 half-word stores, which breaks atomicity, during the code patching.
>>>>>>
>>>>>> changes in patch v2:
>>>>>>     - Enforce alignments on all functions with a compiler workaround.
>>>>>>     - Support 64bit addressing for ftrace targets if xlen == 64
>>>>>>     - Initialize ftrace target addresses to avoid calling bad
>>>>>> address in a
>>>>>>       hypothesized case.
>>>>>>     - Use LGPTR instead of SZPTR since .align is log-scaled for
>>>>>>       mcount-dyn.S
>>>>>>     - Require the nop instruction of all jump_labels aligns
>>>>>> naturally on
>>>>>>       4B.
>>>>>>
>>>>>> Andy Chiu (5):
>>>>>>      riscv: align ftrace to 4 Byte boundary and increase ftrace
>>>>>> prologue
>>>>>>        size
>>>>>>      riscv: export patch_insn_write
>>>>>>      riscv: ftrace: use indirect jump to work with kernel preemption
>>>>>>      riscv: ftrace: do not use stop_machine to update code
>>>>>>      riscv: align arch_static_branch function
>>>>>>
>>>>>>     arch/riscv/Makefile                 |   2 +-
>>>>>>     arch/riscv/include/asm/ftrace.h     |  24 ----
>>>>>>     arch/riscv/include/asm/jump_label.h |   2 +
>>>>>>     arch/riscv/include/asm/patch.h      |   1 +
>>>>>>     arch/riscv/kernel/ftrace.c          | 179
>>>>>> ++++++++++++++++++++--------
>>>>>>     arch/riscv/kernel/mcount-dyn.S      |  69 ++++++++---
>>>>>>     arch/riscv/kernel/patch.c           |   4 +-
>>>>>>     7 files changed, 188 insertions(+), 93 deletions(-)
>>>>>>
>>>>> First of all, thank you for working on making dynamic Ftrace robust in
>>>>> preemptible kernels on RISC-V.
>>>>> It is an important use case but, for now, dynamic Ftrace and related
>>>>> tracers cannot be safely used with such kernels.
>>>>>
>>>>> Are there any updates on this series?
>>>>> It needs a rebase, of course, but it looks doable.
>>>>>
>>>>> If I understand the discussion correctly, the only blocker was that
>>>>> using "-falign-functions" was not enough to properly align cold
>>>>> functions and "-fno-guess-branch-probability" would likely have a
>>>>> performance cost.
>>>>>
>>>>> It seems, GCC developers have recently provided a workaround for that
>>>>> (https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=0f5a9a00e3ab1fe96142f304cfbcf3f63b15f326,
>>>>>
>>>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88345#c24).
>>>>>
>>>>> "-fmin-function-alignment" should help but, I do not know, which GCC
>>>>> versions have got that patch already. In the meantime, one could
>>>>> probably check if "-fmin-function-alignment" is supported by the
>>>>> compiler and use it, if it is.
>>>>>
>>>>> Thoughts?
>>>> Hi Evgenii,
>>>>
>>>> Thanks for the update. Indeed, it is essential to this patch for
>>>> toolchain to provide forced alignment. We can test this flag in the
>>>> Makefile to sort out if toolchain supports it or not. Meanwhile, I had
>>>> figured out a way for this to work on any 2-B align addresses but
>>>> hadn't implemented it out yet. Basically it would require more
>>>> patching space for us to do software alignment. I would opt for a
>>>> special toolchain flag if the toolchain just supports it.
>>>>
>>>> Let me take some time to look and get back to you soon.
>>> Thank you! Looking forward to it.
>>>
>>> In case it helps, here is what I have checked so far.
>>>
>>> 1.
>>> I added the patch
>>> https://gcc.gnu.org/git/?p=gcc.git;a=patch;h=0f5a9a00e3ab1fe96142f304cfbcf3f63b15f326
>>> to the current revision of GCC 13.2.0 from RISC-V toolchain.
>>>
>>> Rebased your patchset on top of Linux 6.8-rc4 (mostly - context
>>> changes, SYM_FUNC_START/SYM_FUNC_END for asm symbols, etc.).
>>>
>>> Reverted 8547649981e6 ("riscv: ftrace: Fixup panic by disabling
>>> preemption").
>>>
>>> Switched from -falign-functions=4 to -fmin-function-alignment=4:
>>> ------------------
>>> diff --git a/arch/riscv/Makefile b/arch/riscv/Makefile
>>> index b33b787c8b07..dcd0adeebaae 100644
>>> --- a/arch/riscv/Makefile
>>> +++ b/arch/riscv/Makefile
>>> @@ -15,9 +15,9 @@ ifeq ($(CONFIG_DYNAMIC_FTRACE),y)
>>>       LDFLAGS_vmlinux += --no-relax
>>>       KBUILD_CPPFLAGS += -DCC_USING_PATCHABLE_FUNCTION_ENTRY
>>>   ifeq ($(CONFIG_RISCV_ISA_C),y)
>>> -    CC_FLAGS_FTRACE := -fpatchable-function-entry=12 -falign-functions=4
>>> +    CC_FLAGS_FTRACE := -fpatchable-function-entry=12
>>> -fmin-function-alignment=4
>>>   else
>>> -    CC_FLAGS_FTRACE := -fpatchable-function-entry=6 -falign-functions=4
>>> +    CC_FLAGS_FTRACE := -fpatchable-function-entry=6
>>> -fmin-function-alignment=4
>>>   endif
>>>   endif
>>>
>>> ------------------
>>>
>>> As far as I can see from objdump, the functions that were not aligned
>>> at 4-byte boundary with -falign-functions=4, are now aligned correctly
>>> with -fmin-function-alignment=4.
>>>
>>> 2.
>>> I tried the kernel in a QEMU VM with 2 CPUs and "-machine virt".
>>>
>>> The boottime tests for Ftrace had passed, except the tests for
>>> function_graph. I described the failure and the possible fix here:
>>> https://lore.kernel.org/all/dcc5976d-635a-4710-92df-94a99653314e@yadro.com/
>>>
>>>
>>> 3.
>>> There were also boottime warnings about "RCU not on for:
>>> arch_cpu_idle+0x0/0x2c". These are probably not related to your
>>> patchset, but rather to the fact that Ftrace is enabled in a
>>> preemptble kernel where RCU does different things.
>>>
>>> As a workaround, I disabled tracing of arch_cpu_idle() for now:
>>> ------------------
>>> diff --git a/arch/riscv/kernel/process.c b/arch/riscv/kernel/process.c
>>> index 92922dbd5b5c..6abeecbfc51d 100644
>>> --- a/arch/riscv/kernel/process.c
>>> +++ b/arch/riscv/kernel/process.c
>>> @@ -37,7 +37,7 @@ EXPORT_SYMBOL(__stack_chk_guard);
>>>
>>>   extern asmlinkage void ret_from_fork(void);
>>>
>>> -void arch_cpu_idle(void)
>>> +void noinstr arch_cpu_idle(void)
>>>   {
>>>       cpu_do_idle();
>>>   }
>>
>> I came up with the same fix for this, based on a similar fix for s390. I
>> have a patch ready and will send it soon since to me, it is a fix, not a
>> workaround.
> Just making sure we aren't duplicating works. Are you also working on
> getting rid of stop_machine() while patching ftrace entries? Or to
> provide a patch to fix the issue in arch_cpu_idle()? I was just about
> to restart my patchset for the first purpose. In case if I missed
> anything, could you help pointing me to the patchset if it's already
> on the ML?


I'm currently trying to fix ftrace because I noticed that the ftrace 
kselftests triggered a lot of panics and warning. For now I only fixed 
this one ^.

But TBH, I have started thinking about the issue your patch is trying to 
deal with. IIUC you're trying to avoid traps (or silent errors) that 
could happen because of concurrent accesses when patching is happening 
on a pair auipc/jarl.

I'm wondering if instead, we could not actually handle the potential 
traps: before storing the auipc + jalr pair, we could use a 
well-identified trapping instruction that could be recognized in the 
trap handler as a legitimate trap. For example:


auipc   -->  auipc  --> XXXX  -->  XXXX  -->  auipc
jalr              XXXX        XXXX        jalr             jalr


If a core traps on a XXXX instruction, we know this address is being 
patched, so we can return and probably the patching will be over. We 
could also identify half patched word instruction (I mean with only XX).

But please let me know if that's completely stupid and I did not 
understand the problem, since my patchset to support svvptc, I am 
wondering if it is not more performant to actually take very unlikely 
traps instead of trying to avoid them.

Thanks,

Alex


>> Thanks,
>>
>> Alex
>>
>>
>>> ------------------
>>>
>>> 4.
>>> Stress-testing revealed an issue though, which I do not understand yet.
>>>
>>> Probably similar to what you did earlier, I ran a script that switched
>>> the current tracer to "function", "function_graph", "nop", "blk" each
>>> 1-5 seconds. In another shell, "stress-ng --hrtimers 1" was running.
>>>
>>> The kernel usually crashed within a few minutes, in seemingly random
>>> locations, but often in one of two ways:
>>>
>>> (a) Invalid instruction, because the address of ftrace_caller function
>>> was somehow written to the body of the traced function rather than
>>> just to the Ftrace prologue.
>>>
>>> In the following example, the crash happened at 0xffffffff800d3398.
>>> "b0 d7" is actually not part of the code here, but rather the lower
>>> bytes of 0xffffffff8000d7b0, the address of ftrace_caller() in this
>>> kernel.
>>>
>>> (gdb) disas /r 0xffffffff800d3382,+0x20
>>> Dump of assembler code from 0xffffffff800d3382 to 0xffffffff800d33a2:
>>> ...
>>>     0xffffffff800d3394 <clockevents_program_event+144>:  ba 87   mv a5,a4
>>>     0xffffffff800d3396 <clockevents_program_event+146>:  c1 bf   j
>>> 0xffffffff800d3366 <clockevents_program_event+98>
>>>     0xffffffff800d3398 <clockevents_program_event+148>:  b0 d7   sw
>>> a2,104(a5) // 0xffffffff8000d7b0, the address of ftrace_caller().
>>>     0xffffffff800d339a <clockevents_program_event+150>:  00 80   .2byte
>>> 0x8000
>>>     0xffffffff800d339c <clockevents_program_event+152>:  ff ff   .2byte
>>> 0xffff
>>>     0xffffffff800d339e <clockevents_program_event+154>:  ff ff   .2byte
>>> 0xffff
>>>     0xffffffff800d33a0 <clockevents_program_event+156>:  d5 bf   j
>>> 0xffffffff800d3394 <clockevents_program_event+144
>>>
>>> The backtrace usually contains one or more occurrences of
>>> return_to_handler() in this case.
>>>
>>> [  260.520394] [<ffffffff800d3398>] clockevents_program_event+0xac/0x100
>>> [  260.521195] [<ffffffff8000d2bc>] return_to_handler+0x0/0x26
>>> [  260.521843] [<ffffffff800c50ba>] hrtimer_interrupt+0x122/0x20c
>>> [  260.522492] [<ffffffff8000d2bc>] return_to_handler+0x0/0x26
>>> [  260.523132] [<ffffffff8009785e>] handle_percpu_devid_irq+0x9e/0x1ec
>>> [  260.523788] [<ffffffff8000d2bc>] return_to_handler+0x0/0x26
>>> [  260.524437] [<ffffffff8000d2bc>] return_to_handler+0x0/0x26
>>> [  260.525080] [<ffffffff80a8acfa>] handle_riscv_irq+0x4a/0x74
>>> [  260.525726] [<ffffffff80a97b9a>] call_on_irq_stack+0x32/0x40
>>> ----------------------
>>>
>>> (b) Jump to an invalid location, e.g. to the middle of a valid 4-byte
>>> instruction. %ra usually points right after the last instruction,
>>> "jalr   a2", in return_to_handler() in such cases, so the jump was
>>> likely made from there.
>>>
>>> The problem is reproducible, although I have not found what causes it
>>> yet.
>>>
>>> Any help is appreciated, of course.
>>>
>>>>> Regards,
>>>>> Evgenii
>>>> Regards,
>>>> Andy
>>>
>>> _______________________________________________
>>> linux-riscv mailing list
>>> linux-riscv@lists.infradead.org
>>> http://lists.infradead.org/mailman/listinfo/linux-riscv
> Thanks,
> Andy
>
> _______________________________________________
> linux-riscv mailing list
> linux-riscv@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-riscv
Samuel Holland March 7, 2024, 3:57 p.m. UTC | #8
Hi Alex,

On 2024-03-07 7:21 AM, Alexandre Ghiti wrote:
> But TBH, I have started thinking about the issue your patch is trying to deal
> with. IIUC you're trying to avoid traps (or silent errors) that could happen
> because of concurrent accesses when patching is happening on a pair auipc/jarl.
> 
> I'm wondering if instead, we could not actually handle the potential traps:
> before storing the auipc + jalr pair, we could use a well-identified trapping
> instruction that could be recognized in the trap handler as a legitimate trap.
> For example:
> 
> 
> auipc  -->  auipc  -->  XXXX  -->  XXXX  -->  auipc
> jalr        XXXX        XXXX       jalr       jalr
> 
> 
> If a core traps on a XXXX instruction, we know this address is being patched, so
> we can return and probably the patching will be over. We could also identify
> half patched word instruction (I mean with only XX).

Unfortunately this does not work without some fence.i in the middle. The
processor is free to fetch any instruction that has been written to a location
since the last fence.i instruction. So it would be perfectly valid to fetch the
old aiupc and new jalr or vice versa and not trap. This would happen if, for
example, the two instructions were in different cache lines, and only one of the
cache lines got evicted and refilled.

But sending an IPI to run the fence.i probably negates the performance benefit.

Maybe there is some creative way to overcome this.

> But please let me know if that's completely stupid and I did not understand the
> problem, since my patchset to support svvptc, I am wondering if it is not more
> performant to actually take very unlikely traps instead of trying to avoid them.

I agree in general it is a good idea to optimize the hot path like this.

Regards,
Samuel
Andy Chiu March 11, 2024, 2:24 p.m. UTC | #9
On Thu, Mar 7, 2024 at 11:57 PM Samuel Holland
<samuel.holland@sifive.com> wrote:
>
> Hi Alex,
>
> On 2024-03-07 7:21 AM, Alexandre Ghiti wrote:
> > But TBH, I have started thinking about the issue your patch is trying to deal
> > with. IIUC you're trying to avoid traps (or silent errors) that could happen
> > because of concurrent accesses when patching is happening on a pair auipc/jarl.
> >
> > I'm wondering if instead, we could not actually handle the potential traps:
> > before storing the auipc + jalr pair, we could use a well-identified trapping
> > instruction that could be recognized in the trap handler as a legitimate trap.
> > For example:
> >
> >
> > auipc  -->  auipc  -->  XXXX  -->  XXXX  -->  auipc
> > jalr        XXXX        XXXX       jalr       jalr
> >
> >
> > If a core traps on a XXXX instruction, we know this address is being patched, so
> > we can return and probably the patching will be over. We could also identify
> > half patched word instruction (I mean with only XX).
>
> Unfortunately this does not work without some fence.i in the middle. The
> processor is free to fetch any instruction that has been written to a location
> since the last fence.i instruction. So it would be perfectly valid to fetch the
> old aiupc and new jalr or vice versa and not trap. This would happen if, for
> example, the two instructions were in different cache lines, and only one of the
> cache lines got evicted and refilled.
>
> But sending an IPI to run the fence.i probably negates the performance benefit.

Maybe something like x86, we can hook ftrace_replace_code() out and
batch send IPIs to prevent storms of remote fences. The solution Alex
proposed can save the code size for function entries. But we have to
send out remote fences at each "-->" transition, which is 4 sets of
remote IPIs. On the other hand, this series increases the per-function
patch size to 24 bytes. However, it decreases the number of remote
fences to 1 set.

The performance hit could be observable for the auipc + jalr case,
because all remote cores will be executing on XXXX instructions and
take a trap at each function entry during code patching.

Besides, this series would give us a chance not to send any remote
fences if we were to change only the destination of ftrace (e.g. to a
custom ftrace trampoline). As it would be a regular store for the
writer and regular load for readers, only fence w,w is needed.
However, I am not very certain on how often would be for this
particular use case. I'd need some time to investigate it.

>
> Maybe there is some creative way to overcome this.
>
> > But please let me know if that's completely stupid and I did not understand the
> > problem, since my patchset to support svvptc, I am wondering if it is not more
> > performant to actually take very unlikely traps instead of trying to avoid them.
>
> I agree in general it is a good idea to optimize the hot path like this.
>
> Regards,
> Samuel
>

Regards,
Andy
Andy Chiu March 18, 2024, 3:31 p.m. UTC | #10
Hi Evgenii,

Thanks for your help!

I just rebased upon 6.8-rc1 and passed the stress-ng + ftrace/nop
testing. I will add some random tracers to test and some optimization
before sending out again. Here are a few things needed:

On Thu, Feb 22, 2024 at 12:55 AM Evgenii Shatokhin
<e.shatokhin@yadro.com> wrote:
>
> On 21.02.2024 08:27, Andy Chiu wrote:
> > «Внимание! Данное письмо от внешнего адресата!»
> >
> > On Wed, Feb 14, 2024 at 3:42 AM Evgenii Shatokhin <e.shatokhin@yadro.com> wrote:
> >>
> >> Hi,
> >>
> >> On 13.09.2022 12:42, Andy Chiu wrote:
> >>> This patch removes dependency of dynamic ftrace from calling
> >>> stop_machine(), and makes it compatiable with kernel preemption.
> >>> Originally, we ran into stack corruptions, or execution of partially
> >>> updated instructions when starting or stopping ftrace on a fully
> >>> preemptible kernel configuration. The reason is that kernel periodically
> >>> calls rcu_momentary_dyntick_idle() on cores waiting for the code-patching
> >>> core running in ftrace. Though rcu_momentary_dyntick_idle() itself is
> >>> marked as notrace, it would call a bunch of tracable functions if we
> >>> configured the kernel as preemptible. For example, these are some functions
> >>> that happened to have a symbol and have not been marked as notrace on a
> >>> RISC-V preemptible kernel compiled with GCC-11:
> >>>    - __rcu_report_exp_rnp()
> >>>    - rcu_report_exp_cpu_mult()
> >>>    - rcu_preempt_deferred_qs()
> >>>    - rcu_preempt_need_deferred_qs()
> >>>    - rcu_preempt_deferred_qs_irqrestore()
> >>>
> >>> Thus, this make it not ideal for us to rely on stop_machine() and
> >>> handly marked "notrace"s to perform runtime code patching. To remove
> >>> such dependency, we must make updates of code seemed atomic on running
> >>> cores. This might not be obvious for RISC-V since it usaually uses a pair
> >>> of AUIPC + JALR to perform a long jump, which cannot be modified and
> >>> executed concurrently if we consider preemptions. As such, this patch
> >>> proposed a way to make it possible. It embeds a 32-bit rel-address data
> >>> into instructions of each ftrace prologue and jumps indirectly. In this
> >>> way, we could store and load the address atomically so that the code
> >>> patching core could run simutaneously with the rest of running cores.
> >>>
> >>> After applying the patchset, we compiled a preemptible kernel with all
> >>> tracers and ftrace-selftest enabled, and booted it on a 2-core QEMU virt
> >>> machine. The kernel could boot up successfully, passing all ftrace
> >>> testsuits. Besides, we ran a script that randomly pick a tracer on every
> >>> 0~5 seconds. The kernel has sustained over 20K rounds of the test. In
> >>> contrast, a preemptible kernel without our patch would panic in few
> >>> rounds on the same machine.
> >>>
> >>> Though we ran into errors when using hwlat or irqsoff tracers together
> >>> with cpu-online stressor from stress-ng on a preemptible kernel. We
> >>> believe the reason may be that  percpu workers of the tracers are being
> >>> queued into unbounded workqueue when cpu get offlined and patches will go
> >>> through tracing tree.
> >>>
> >>> Additionally, we found patching of tracepoints unsafe since the
> >>> instructions being patched are not naturally aligned. This may result in
> >>> 2 half-word stores, which breaks atomicity, during the code patching.
> >>>
> >>> changes in patch v2:
> >>>    - Enforce alignments on all functions with a compiler workaround.
> >>>    - Support 64bit addressing for ftrace targets if xlen == 64
> >>>    - Initialize ftrace target addresses to avoid calling bad address in a
> >>>      hypothesized case.
> >>>    - Use LGPTR instead of SZPTR since .align is log-scaled for
> >>>      mcount-dyn.S
> >>>    - Require the nop instruction of all jump_labels aligns naturally on
> >>>      4B.
> >>>
> >>> Andy Chiu (5):
> >>>     riscv: align ftrace to 4 Byte boundary and increase ftrace prologue
> >>>       size
> >>>     riscv: export patch_insn_write
> >>>     riscv: ftrace: use indirect jump to work with kernel preemption
> >>>     riscv: ftrace: do not use stop_machine to update code
> >>>     riscv: align arch_static_branch function
> >>>
> >>>    arch/riscv/Makefile                 |   2 +-
> >>>    arch/riscv/include/asm/ftrace.h     |  24 ----
> >>>    arch/riscv/include/asm/jump_label.h |   2 +
> >>>    arch/riscv/include/asm/patch.h      |   1 +
> >>>    arch/riscv/kernel/ftrace.c          | 179 ++++++++++++++++++++--------
> >>>    arch/riscv/kernel/mcount-dyn.S      |  69 ++++++++---
> >>>    arch/riscv/kernel/patch.c           |   4 +-
> >>>    7 files changed, 188 insertions(+), 93 deletions(-)
> >>>
> >>
> >> First of all, thank you for working on making dynamic Ftrace robust in
> >> preemptible kernels on RISC-V.
> >> It is an important use case but, for now, dynamic Ftrace and related
> >> tracers cannot be safely used with such kernels.
> >>
> >> Are there any updates on this series?
> >> It needs a rebase, of course, but it looks doable.
> >>
> >> If I understand the discussion correctly, the only blocker was that
> >> using "-falign-functions" was not enough to properly align cold
> >> functions and "-fno-guess-branch-probability" would likely have a
> >> performance cost.
> >>
> >> It seems, GCC developers have recently provided a workaround for that
> >> (https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=0f5a9a00e3ab1fe96142f304cfbcf3f63b15f326,
> >> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88345#c24).
> >>
> >> "-fmin-function-alignment" should help but, I do not know, which GCC
> >> versions have got that patch already. In the meantime, one could
> >> probably check if "-fmin-function-alignment" is supported by the
> >> compiler and use it, if it is.
> >>
> >> Thoughts?
> >
> > Hi Evgenii,
> >
> > Thanks for the update. Indeed, it is essential to this patch for
> > toolchain to provide forced alignment. We can test this flag in the
> > Makefile to sort out if toolchain supports it or not. Meanwhile, I had
> > figured out a way for this to work on any 2-B align addresses but
> > hadn't implemented it out yet. Basically it would require more
> > patching space for us to do software alignment. I would opt for a
> > special toolchain flag if the toolchain just supports it.
> >
> > Let me take some time to look and get back to you soon.
>
> Thank you! Looking forward to it.
>
> In case it helps, here is what I have checked so far.
>
> 1.
> I added the patch
> https://gcc.gnu.org/git/?p=gcc.git;a=patch;h=0f5a9a00e3ab1fe96142f304cfbcf3f63b15f326
> to the current revision of GCC 13.2.0 from RISC-V toolchain.
>
> Rebased your patchset on top of Linux 6.8-rc4 (mostly - context changes,
> SYM_FUNC_START/SYM_FUNC_END for asm symbols, etc.).
>
> Reverted 8547649981e6 ("riscv: ftrace: Fixup panic by disabling
> preemption").
>
> Switched from -falign-functions=4 to -fmin-function-alignment=4:
> ------------------
> diff --git a/arch/riscv/Makefile b/arch/riscv/Makefile
> index b33b787c8b07..dcd0adeebaae 100644
> --- a/arch/riscv/Makefile
> +++ b/arch/riscv/Makefile
> @@ -15,9 +15,9 @@ ifeq ($(CONFIG_DYNAMIC_FTRACE),y)
>         LDFLAGS_vmlinux += --no-relax
>         KBUILD_CPPFLAGS += -DCC_USING_PATCHABLE_FUNCTION_ENTRY
>   ifeq ($(CONFIG_RISCV_ISA_C),y)
> -       CC_FLAGS_FTRACE := -fpatchable-function-entry=12 -falign-functions=4
> +       CC_FLAGS_FTRACE := -fpatchable-function-entry=12
> -fmin-function-alignment=4
>   else
> -       CC_FLAGS_FTRACE := -fpatchable-function-entry=6 -falign-functions=4
> +       CC_FLAGS_FTRACE := -fpatchable-function-entry=6 -fmin-function-alignment=4
>   endif
>   endif
>
> ------------------
>
> As far as I can see from objdump, the functions that were not aligned at
> 4-byte boundary with -falign-functions=4, are now aligned correctly with
> -fmin-function-alignment=4.
>
> 2.
> I tried the kernel in a QEMU VM with 2 CPUs and "-machine virt".
>
> The boottime tests for Ftrace had passed, except the tests for
> function_graph. I described the failure and the possible fix here:
> https://lore.kernel.org/all/dcc5976d-635a-4710-92df-94a99653314e@yadro.com/

Indeed, this is needed. I am not sure why I got ftrace boot-time tests
passed back then. Thank you for solving it!

>
> 3.
> There were also boottime warnings about "RCU not on for:
> arch_cpu_idle+0x0/0x2c". These are probably not related to your
> patchset, but rather to the fact that Ftrace is enabled in a preemptble
> kernel where RCU does different things.
>
> As a workaround, I disabled tracing of arch_cpu_idle() for now:
> ------------------
> diff --git a/arch/riscv/kernel/process.c b/arch/riscv/kernel/process.c
> index 92922dbd5b5c..6abeecbfc51d 100644
> --- a/arch/riscv/kernel/process.c
> +++ b/arch/riscv/kernel/process.c
> @@ -37,7 +37,7 @@ EXPORT_SYMBOL(__stack_chk_guard);
>
>   extern asmlinkage void ret_from_fork(void);
>
> -void arch_cpu_idle(void)
> +void noinstr arch_cpu_idle(void)
>   {
>         cpu_do_idle();
>   }
>
> ------------------
>
> 4.
> Stress-testing revealed an issue though, which I do not understand yet.
>
> Probably similar to what you did earlier, I ran a script that switched
> the current tracer to "function", "function_graph", "nop", "blk" each
> 1-5 seconds. In another shell, "stress-ng --hrtimers 1" was running.
>
> The kernel usually crashed within a few minutes, in seemingly random
> locations, but often in one of two ways:
>
> (a) Invalid instruction, because the address of ftrace_caller function
> was somehow written to the body of the traced function rather than just
> to the Ftrace prologue.

The reason for this is probably that any one of your ftrace_*_call is
not 8-B aligned.

>
> In the following example, the crash happened at 0xffffffff800d3398. "b0
> d7" is actually not part of the code here, but rather the lower bytes of
> 0xffffffff8000d7b0, the address of ftrace_caller() in this kernel.

It seems like there is a bug in patch_insn_write(). I think we should
at least disable migration during patch_map() and patch_unmap(). I'd
need some time to dig into patch_map(). But since __set_fixmap() only
flush local tlb, I'd assume it is not safe to context switch out and
migrate while holding the fix-map mapping. Adding preempt_disable()
and preempt_enable() before calling __patch_insn_write() solves the
issue.

>
> (gdb) disas /r 0xffffffff800d3382,+0x20
> Dump of assembler code from 0xffffffff800d3382 to 0xffffffff800d33a2:
> ...
>     0xffffffff800d3394 <clockevents_program_event+144>:  ba 87   mv
> a5,a4
>     0xffffffff800d3396 <clockevents_program_event+146>:  c1 bf   j
> 0xffffffff800d3366 <clockevents_program_event+98>
>     0xffffffff800d3398 <clockevents_program_event+148>:  b0 d7   sw
> a2,104(a5) // 0xffffffff8000d7b0, the address of ftrace_caller().
>     0xffffffff800d339a <clockevents_program_event+150>:  00 80   .2byte
> 0x8000
>     0xffffffff800d339c <clockevents_program_event+152>:  ff ff   .2byte
> 0xffff
>     0xffffffff800d339e <clockevents_program_event+154>:  ff ff   .2byte
> 0xffff
>     0xffffffff800d33a0 <clockevents_program_event+156>:  d5 bf   j
> 0xffffffff800d3394 <clockevents_program_event+144
>
> The backtrace usually contains one or more occurrences of
> return_to_handler() in this case.
>
> [  260.520394] [<ffffffff800d3398>] clockevents_program_event+0xac/0x100
> [  260.521195] [<ffffffff8000d2bc>] return_to_handler+0x0/0x26
> [  260.521843] [<ffffffff800c50ba>] hrtimer_interrupt+0x122/0x20c
> [  260.522492] [<ffffffff8000d2bc>] return_to_handler+0x0/0x26
> [  260.523132] [<ffffffff8009785e>] handle_percpu_devid_irq+0x9e/0x1ec
> [  260.523788] [<ffffffff8000d2bc>] return_to_handler+0x0/0x26
> [  260.524437] [<ffffffff8000d2bc>] return_to_handler+0x0/0x26
> [  260.525080] [<ffffffff80a8acfa>] handle_riscv_irq+0x4a/0x74
> [  260.525726] [<ffffffff80a97b9a>] call_on_irq_stack+0x32/0x40
> ----------------------
>
> (b) Jump to an invalid location, e.g. to the middle of a valid 4-byte
> instruction. %ra usually points right after the last instruction, "jalr
>    a2", in return_to_handler() in such cases, so the jump was likely
> made from there.

I haven't done fgraph tests yet. I will try out and see.

>
> The problem is reproducible, although I have not found what causes it yet.
>
> Any help is appreciated, of course.
>
> >
> >>
> >> Regards,
> >> Evgenii
> >
> > Regards,
> > Andy
>

Also, here is another side note,

It seems like the ftrace save/restore routine should save more
registers as clang's fastcc may use t2 when the number of arguments
exceeds what ABI defines for passing arg through registers.

Cheers,
Andy
Alexandre Ghiti March 19, 2024, 2:50 p.m. UTC | #11
On 11/03/2024 15:24, Andy Chiu wrote:
> On Thu, Mar 7, 2024 at 11:57 PM Samuel Holland
> <samuel.holland@sifive.com> wrote:
>> Hi Alex,
>>
>> On 2024-03-07 7:21 AM, Alexandre Ghiti wrote:
>>> But TBH, I have started thinking about the issue your patch is trying to deal
>>> with. IIUC you're trying to avoid traps (or silent errors) that could happen
>>> because of concurrent accesses when patching is happening on a pair auipc/jarl.
>>>
>>> I'm wondering if instead, we could not actually handle the potential traps:
>>> before storing the auipc + jalr pair, we could use a well-identified trapping
>>> instruction that could be recognized in the trap handler as a legitimate trap.
>>> For example:
>>>
>>>
>>> auipc  -->  auipc  -->  XXXX  -->  XXXX  -->  auipc
>>> jalr        XXXX        XXXX       jalr       jalr
>>>
>>>
>>> If a core traps on a XXXX instruction, we know this address is being patched, so
>>> we can return and probably the patching will be over. We could also identify
>>> half patched word instruction (I mean with only XX).
>> Unfortunately this does not work without some fence.i in the middle. The
>> processor is free to fetch any instruction that has been written to a location
>> since the last fence.i instruction. So it would be perfectly valid to fetch the
>> old aiupc and new jalr or vice versa and not trap. This would happen if, for
>> example, the two instructions were in different cache lines, and only one of the
>> cache lines got evicted and refilled.
>>
>> But sending an IPI to run the fence.i probably negates the performance benefit.
> Maybe something like x86, we can hook ftrace_replace_code() out and
> batch send IPIs to prevent storms of remote fences. The solution Alex
> proposed can save the code size for function entries. But we have to
> send out remote fences at each "-->" transition, which is 4 sets of
> remote IPIs. On the other hand, this series increases the per-function
> patch size to 24 bytes. However, it decreases the number of remote
> fences to 1 set.
>
> The performance hit could be observable for the auipc + jalr case,
> because all remote cores will be executing on XXXX instructions and
> take a trap at each function entry during code patching.
>
> Besides, this series would give us a chance not to send any remote
> fences if we were to change only the destination of ftrace (e.g. to a
> custom ftrace trampoline). As it would be a regular store for the
> writer and regular load for readers, only fence w,w is needed.
> However, I am not very certain on how often would be for this
> particular use case. I'd need some time to investigate it.
>
>> Maybe there is some creative way to overcome this.
>>
>>> But please let me know if that's completely stupid and I did not understand the
>>> problem, since my patchset to support svvptc, I am wondering if it is not more
>>> performant to actually take very unlikely traps instead of trying to avoid them.
>> I agree in general it is a good idea to optimize the hot path like this.
>>
>> Regards,
>> Samuel
>>
> Regards,
> Andy
>
> _______________________________________________
> linux-riscv mailing list
> linux-riscv@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-riscv


So indeed my solution was way too naive and we've been discussing that 
with Björn lately. He worked a lot on that and came up with the solution 
he proposed here 
https://lore.kernel.org/linux-riscv/87zfv0onre.fsf@all.your.base.are.belong.to.us/

The thing is ftrace seems to be quite broken as the ftrace kselftests 
raise a lot of issues which I have started to debug but are not that 
easy, so we are wondering if *someone* should not work on Bjorn's 
solution (or another, open to discussions) for 6.10. @Andy WDYT? Do you 
have free cycles? Björn could work on that too (and I'll help if needed).

Let me know what you think!

Alex
Conor Dooley March 19, 2024, 2:58 p.m. UTC | #12
On Tue, Mar 19, 2024 at 03:50:01PM +0100, Alexandre Ghiti wrote:

> The thing is ftrace seems to be quite broken as the ftrace kselftests raise
> a lot of issues which I have started to debug but are not that easy, so we
> are wondering if *someone* should not work on Bjorn's solution (or another,
> open to discussions) for 6.10. @Andy WDYT? Do you have free cycles? Björn
> could work on that too (and I'll help if needed).

If patching is broken I wouldn't be too worried about targeting 6.10,
just do it right and get Palmer to take it on fixes when everyone is
happy with it.
Evgenii Shatokhin March 19, 2024, 3:32 p.m. UTC | #13
Hi,

On 18.03.2024 18:31, Andy Chiu wrote:
> Hi Evgenii,
> 
> Thanks for your help!

You are welcome!

> 
> I just rebased upon 6.8-rc1 and passed the stress-ng + ftrace/nop
> testing. I will add some random tracers to test and some optimization
> before sending out again. Here are a few things needed:
> 
> On Thu, Feb 22, 2024 at 12:55 AM Evgenii Shatokhin
> <e.shatokhin@yadro.com> wrote:
>>
>> On 21.02.2024 08:27, Andy Chiu wrote:
>>>
>>> On Wed, Feb 14, 2024 at 3:42 AM Evgenii Shatokhin <e.shatokhin@yadro.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> On 13.09.2022 12:42, Andy Chiu wrote:
>>>>> This patch removes dependency of dynamic ftrace from calling
>>>>> stop_machine(), and makes it compatiable with kernel preemption.
>>>>> Originally, we ran into stack corruptions, or execution of partially
>>>>> updated instructions when starting or stopping ftrace on a fully
>>>>> preemptible kernel configuration. The reason is that kernel periodically
>>>>> calls rcu_momentary_dyntick_idle() on cores waiting for the code-patching
>>>>> core running in ftrace. Though rcu_momentary_dyntick_idle() itself is
>>>>> marked as notrace, it would call a bunch of tracable functions if we
>>>>> configured the kernel as preemptible. For example, these are some functions
>>>>> that happened to have a symbol and have not been marked as notrace on a
>>>>> RISC-V preemptible kernel compiled with GCC-11:
>>>>>     - __rcu_report_exp_rnp()
>>>>>     - rcu_report_exp_cpu_mult()
>>>>>     - rcu_preempt_deferred_qs()
>>>>>     - rcu_preempt_need_deferred_qs()
>>>>>     - rcu_preempt_deferred_qs_irqrestore()
>>>>>
>>>>> Thus, this make it not ideal for us to rely on stop_machine() and
>>>>> handly marked "notrace"s to perform runtime code patching. To remove
>>>>> such dependency, we must make updates of code seemed atomic on running
>>>>> cores. This might not be obvious for RISC-V since it usaually uses a pair
>>>>> of AUIPC + JALR to perform a long jump, which cannot be modified and
>>>>> executed concurrently if we consider preemptions. As such, this patch
>>>>> proposed a way to make it possible. It embeds a 32-bit rel-address data
>>>>> into instructions of each ftrace prologue and jumps indirectly. In this
>>>>> way, we could store and load the address atomically so that the code
>>>>> patching core could run simutaneously with the rest of running cores.
>>>>>
>>>>> After applying the patchset, we compiled a preemptible kernel with all
>>>>> tracers and ftrace-selftest enabled, and booted it on a 2-core QEMU virt
>>>>> machine. The kernel could boot up successfully, passing all ftrace
>>>>> testsuits. Besides, we ran a script that randomly pick a tracer on every
>>>>> 0~5 seconds. The kernel has sustained over 20K rounds of the test. In
>>>>> contrast, a preemptible kernel without our patch would panic in few
>>>>> rounds on the same machine.
>>>>>
>>>>> Though we ran into errors when using hwlat or irqsoff tracers together
>>>>> with cpu-online stressor from stress-ng on a preemptible kernel. We
>>>>> believe the reason may be that  percpu workers of the tracers are being
>>>>> queued into unbounded workqueue when cpu get offlined and patches will go
>>>>> through tracing tree.
>>>>>
>>>>> Additionally, we found patching of tracepoints unsafe since the
>>>>> instructions being patched are not naturally aligned. This may result in
>>>>> 2 half-word stores, which breaks atomicity, during the code patching.
>>>>>
>>>>> changes in patch v2:
>>>>>     - Enforce alignments on all functions with a compiler workaround.
>>>>>     - Support 64bit addressing for ftrace targets if xlen == 64
>>>>>     - Initialize ftrace target addresses to avoid calling bad address in a
>>>>>       hypothesized case.
>>>>>     - Use LGPTR instead of SZPTR since .align is log-scaled for
>>>>>       mcount-dyn.S
>>>>>     - Require the nop instruction of all jump_labels aligns naturally on
>>>>>       4B.
>>>>>
>>>>> Andy Chiu (5):
>>>>>      riscv: align ftrace to 4 Byte boundary and increase ftrace prologue
>>>>>        size
>>>>>      riscv: export patch_insn_write
>>>>>      riscv: ftrace: use indirect jump to work with kernel preemption
>>>>>      riscv: ftrace: do not use stop_machine to update code
>>>>>      riscv: align arch_static_branch function
>>>>>
>>>>>     arch/riscv/Makefile                 |   2 +-
>>>>>     arch/riscv/include/asm/ftrace.h     |  24 ----
>>>>>     arch/riscv/include/asm/jump_label.h |   2 +
>>>>>     arch/riscv/include/asm/patch.h      |   1 +
>>>>>     arch/riscv/kernel/ftrace.c          | 179 ++++++++++++++++++++--------
>>>>>     arch/riscv/kernel/mcount-dyn.S      |  69 ++++++++---
>>>>>     arch/riscv/kernel/patch.c           |   4 +-
>>>>>     7 files changed, 188 insertions(+), 93 deletions(-)
>>>>>
>>>>
>>>> First of all, thank you for working on making dynamic Ftrace robust in
>>>> preemptible kernels on RISC-V.
>>>> It is an important use case but, for now, dynamic Ftrace and related
>>>> tracers cannot be safely used with such kernels.
>>>>
>>>> Are there any updates on this series?
>>>> It needs a rebase, of course, but it looks doable.
>>>>
>>>> If I understand the discussion correctly, the only blocker was that
>>>> using "-falign-functions" was not enough to properly align cold
>>>> functions and "-fno-guess-branch-probability" would likely have a
>>>> performance cost.
>>>>
>>>> It seems, GCC developers have recently provided a workaround for that
>>>> (https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=0f5a9a00e3ab1fe96142f304cfbcf3f63b15f326,
>>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88345#c24).
>>>>
>>>> "-fmin-function-alignment" should help but, I do not know, which GCC
>>>> versions have got that patch already. In the meantime, one could
>>>> probably check if "-fmin-function-alignment" is supported by the
>>>> compiler and use it, if it is.
>>>>
>>>> Thoughts?
>>>
>>> Hi Evgenii,
>>>
>>> Thanks for the update. Indeed, it is essential to this patch for
>>> toolchain to provide forced alignment. We can test this flag in the
>>> Makefile to sort out if toolchain supports it or not. Meanwhile, I had
>>> figured out a way for this to work on any 2-B align addresses but
>>> hadn't implemented it out yet. Basically it would require more
>>> patching space for us to do software alignment. I would opt for a
>>> special toolchain flag if the toolchain just supports it.
>>>
>>> Let me take some time to look and get back to you soon.
>>
>> Thank you! Looking forward to it.
>>
>> In case it helps, here is what I have checked so far.
>>
>> 1.
>> I added the patch
>> https://gcc.gnu.org/git/?p=gcc.git;a=patch;h=0f5a9a00e3ab1fe96142f304cfbcf3f63b15f326
>> to the current revision of GCC 13.2.0 from RISC-V toolchain.
>>
>> Rebased your patchset on top of Linux 6.8-rc4 (mostly - context changes,
>> SYM_FUNC_START/SYM_FUNC_END for asm symbols, etc.).
>>
>> Reverted 8547649981e6 ("riscv: ftrace: Fixup panic by disabling
>> preemption").
>>
>> Switched from -falign-functions=4 to -fmin-function-alignment=4:
>> ------------------
>> diff --git a/arch/riscv/Makefile b/arch/riscv/Makefile
>> index b33b787c8b07..dcd0adeebaae 100644
>> --- a/arch/riscv/Makefile
>> +++ b/arch/riscv/Makefile
>> @@ -15,9 +15,9 @@ ifeq ($(CONFIG_DYNAMIC_FTRACE),y)
>>          LDFLAGS_vmlinux += --no-relax
>>          KBUILD_CPPFLAGS += -DCC_USING_PATCHABLE_FUNCTION_ENTRY
>>    ifeq ($(CONFIG_RISCV_ISA_C),y)
>> -       CC_FLAGS_FTRACE := -fpatchable-function-entry=12 -falign-functions=4
>> +       CC_FLAGS_FTRACE := -fpatchable-function-entry=12
>> -fmin-function-alignment=4
>>    else
>> -       CC_FLAGS_FTRACE := -fpatchable-function-entry=6 -falign-functions=4
>> +       CC_FLAGS_FTRACE := -fpatchable-function-entry=6 -fmin-function-alignment=4
>>    endif
>>    endif
>>
>> ------------------
>>
>> As far as I can see from objdump, the functions that were not aligned at
>> 4-byte boundary with -falign-functions=4, are now aligned correctly with
>> -fmin-function-alignment=4.
>>
>> 2.
>> I tried the kernel in a QEMU VM with 2 CPUs and "-machine virt".
>>
>> The boottime tests for Ftrace had passed, except the tests for
>> function_graph. I described the failure and the possible fix here:
>> https://lore.kernel.org/all/dcc5976d-635a-4710-92df-94a99653314e@yadro.com/
> 
> Indeed, this is needed. I am not sure why I got ftrace boot-time tests
> passed back then. Thank you for solving it!
> 
>>
>> 3.
>> There were also boottime warnings about "RCU not on for:
>> arch_cpu_idle+0x0/0x2c". These are probably not related to your
>> patchset, but rather to the fact that Ftrace is enabled in a preemptble
>> kernel where RCU does different things.
>>
>> As a workaround, I disabled tracing of arch_cpu_idle() for now:
>> ------------------
>> diff --git a/arch/riscv/kernel/process.c b/arch/riscv/kernel/process.c
>> index 92922dbd5b5c..6abeecbfc51d 100644
>> --- a/arch/riscv/kernel/process.c
>> +++ b/arch/riscv/kernel/process.c
>> @@ -37,7 +37,7 @@ EXPORT_SYMBOL(__stack_chk_guard);
>>
>>    extern asmlinkage void ret_from_fork(void);
>>
>> -void arch_cpu_idle(void)
>> +void noinstr arch_cpu_idle(void)
>>    {
>>          cpu_do_idle();
>>    }
>>
>> ------------------
>>
>> 4.
>> Stress-testing revealed an issue though, which I do not understand yet.
>>
>> Probably similar to what you did earlier, I ran a script that switched
>> the current tracer to "function", "function_graph", "nop", "blk" each
>> 1-5 seconds. In another shell, "stress-ng --hrtimers 1" was running.
>>
>> The kernel usually crashed within a few minutes, in seemingly random
>> locations, but often in one of two ways:
>>
>> (a) Invalid instruction, because the address of ftrace_caller function
>> was somehow written to the body of the traced function rather than just
>> to the Ftrace prologue.
> 
> The reason for this is probably that any one of your ftrace_*_call is
> not 8-B aligned.

I thought, all locations where the address of a ftrace_caller function 
is written are 8-byte aligned, if the compiler guarantees that start 
addresses of all functions are 4-byte aligned. Your patchset provides 2 
kinds of function prologues exactly for that purpose. Am I missing 
something?

> 
>>
>> In the following example, the crash happened at 0xffffffff800d3398. "b0
>> d7" is actually not part of the code here, but rather the lower bytes of
>> 0xffffffff8000d7b0, the address of ftrace_caller() in this kernel.
> 
> It seems like there is a bug in patch_insn_write(). I think we should
> at least disable migration during patch_map() and patch_unmap(). I'd
> need some time to dig into patch_map(). But since __set_fixmap() only
> flush local tlb, I'd assume it is not safe to context switch out and
> migrate while holding the fix-map mapping. Adding preempt_disable()
> and preempt_enable() before calling __patch_insn_write() solves the
> issue.
> 

Interesting.
Thanks for pointing that out! I never though that the task could migrate 
to a different CPU while patch_insn_write() is running. If it could, 
that would cause such issues, sure. And probably - the issues with 
"function_graph" too, if some data were corrupted that way rather than code.

>>
>> (gdb) disas /r 0xffffffff800d3382,+0x20
>> Dump of assembler code from 0xffffffff800d3382 to 0xffffffff800d33a2:
>> ...
>>      0xffffffff800d3394 <clockevents_program_event+144>:  ba 87   mv
>> a5,a4
>>      0xffffffff800d3396 <clockevents_program_event+146>:  c1 bf   j
>> 0xffffffff800d3366 <clockevents_program_event+98>
>>      0xffffffff800d3398 <clockevents_program_event+148>:  b0 d7   sw
>> a2,104(a5) // 0xffffffff8000d7b0, the address of ftrace_caller().
>>      0xffffffff800d339a <clockevents_program_event+150>:  00 80   .2byte
>> 0x8000
>>      0xffffffff800d339c <clockevents_program_event+152>:  ff ff   .2byte
>> 0xffff
>>      0xffffffff800d339e <clockevents_program_event+154>:  ff ff   .2byte
>> 0xffff
>>      0xffffffff800d33a0 <clockevents_program_event+156>:  d5 bf   j
>> 0xffffffff800d3394 <clockevents_program_event+144
>>
>> The backtrace usually contains one or more occurrences of
>> return_to_handler() in this case.
>>
>> [  260.520394] [<ffffffff800d3398>] clockevents_program_event+0xac/0x100
>> [  260.521195] [<ffffffff8000d2bc>] return_to_handler+0x0/0x26
>> [  260.521843] [<ffffffff800c50ba>] hrtimer_interrupt+0x122/0x20c
>> [  260.522492] [<ffffffff8000d2bc>] return_to_handler+0x0/0x26
>> [  260.523132] [<ffffffff8009785e>] handle_percpu_devid_irq+0x9e/0x1ec
>> [  260.523788] [<ffffffff8000d2bc>] return_to_handler+0x0/0x26
>> [  260.524437] [<ffffffff8000d2bc>] return_to_handler+0x0/0x26
>> [  260.525080] [<ffffffff80a8acfa>] handle_riscv_irq+0x4a/0x74
>> [  260.525726] [<ffffffff80a97b9a>] call_on_irq_stack+0x32/0x40
>> ----------------------
>>
>> (b) Jump to an invalid location, e.g. to the middle of a valid 4-byte
>> instruction. %ra usually points right after the last instruction, "jalr
>>     a2", in return_to_handler() in such cases, so the jump was likely
>> made from there.
> 
> I haven't done fgraph tests yet. I will try out and see.
> 
>>
>> The problem is reproducible, although I have not found what causes it yet.
>>
>> Any help is appreciated, of course.
>>
>>>
>>>>
>>>> Regards,
>>>> Evgenii
>>>
>>> Regards,
>>> Andy
>>
> 
> Also, here is another side note,
> 
> It seems like the ftrace save/restore routine should save more
> registers as clang's fastcc may use t2 when the number of arguments
> exceeds what ABI defines for passing arg through registers.

Yes, I reported that issue to LLVM maintainers in 
https://github.com/llvm/llvm-project/issues/83111. It seems, static 
functions with 9+ arguments use t2 and t3, etc. for the 9th and 10th 
arguments when compiled with clang.

Clang seems to leave t0 and t1 alone but I do not know yet, if it is 
just a coincidence. Haven't found the exact rules for fastcc calling 
convention on RISC-V so far.

A compiler option to disable fastcc for the Linux kernel builds would be 
great. But, it seems, the discussion with LLVM maintainers will go 
nowhere without benchmarks to show whether that optimization has any 
significant effect. I plan to find and run proper benchmarks when I have 
time, but not just yet.

> 
> Cheers,
> Andy

Regards,
Evgenii
Alexandre Ghiti March 19, 2024, 5:37 p.m. UTC | #14
Hi Andy,

On 18/03/2024 16:31, Andy Chiu wrote:
> Hi Evgenii,
>
> Thanks for your help!
>
> I just rebased upon 6.8-rc1 and passed the stress-ng + ftrace/nop
> testing. I will add some random tracers to test and some optimization
> before sending out again. Here are a few things needed:
>
> On Thu, Feb 22, 2024 at 12:55 AM Evgenii Shatokhin
> <e.shatokhin@yadro.com> wrote:
>> On 21.02.2024 08:27, Andy Chiu wrote:
>>> «Внимание! Данное письмо от внешнего адресата!»
>>>
>>> On Wed, Feb 14, 2024 at 3:42 AM Evgenii Shatokhin <e.shatokhin@yadro.com> wrote:
>>>> Hi,
>>>>
>>>> On 13.09.2022 12:42, Andy Chiu wrote:
>>>>> This patch removes dependency of dynamic ftrace from calling
>>>>> stop_machine(), and makes it compatiable with kernel preemption.
>>>>> Originally, we ran into stack corruptions, or execution of partially
>>>>> updated instructions when starting or stopping ftrace on a fully
>>>>> preemptible kernel configuration. The reason is that kernel periodically
>>>>> calls rcu_momentary_dyntick_idle() on cores waiting for the code-patching
>>>>> core running in ftrace. Though rcu_momentary_dyntick_idle() itself is
>>>>> marked as notrace, it would call a bunch of tracable functions if we
>>>>> configured the kernel as preemptible. For example, these are some functions
>>>>> that happened to have a symbol and have not been marked as notrace on a
>>>>> RISC-V preemptible kernel compiled with GCC-11:
>>>>>     - __rcu_report_exp_rnp()
>>>>>     - rcu_report_exp_cpu_mult()
>>>>>     - rcu_preempt_deferred_qs()
>>>>>     - rcu_preempt_need_deferred_qs()
>>>>>     - rcu_preempt_deferred_qs_irqrestore()
>>>>>
>>>>> Thus, this make it not ideal for us to rely on stop_machine() and
>>>>> handly marked "notrace"s to perform runtime code patching. To remove
>>>>> such dependency, we must make updates of code seemed atomic on running
>>>>> cores. This might not be obvious for RISC-V since it usaually uses a pair
>>>>> of AUIPC + JALR to perform a long jump, which cannot be modified and
>>>>> executed concurrently if we consider preemptions. As such, this patch
>>>>> proposed a way to make it possible. It embeds a 32-bit rel-address data
>>>>> into instructions of each ftrace prologue and jumps indirectly. In this
>>>>> way, we could store and load the address atomically so that the code
>>>>> patching core could run simutaneously with the rest of running cores.
>>>>>
>>>>> After applying the patchset, we compiled a preemptible kernel with all
>>>>> tracers and ftrace-selftest enabled, and booted it on a 2-core QEMU virt
>>>>> machine. The kernel could boot up successfully, passing all ftrace
>>>>> testsuits. Besides, we ran a script that randomly pick a tracer on every
>>>>> 0~5 seconds. The kernel has sustained over 20K rounds of the test. In
>>>>> contrast, a preemptible kernel without our patch would panic in few
>>>>> rounds on the same machine.
>>>>>
>>>>> Though we ran into errors when using hwlat or irqsoff tracers together
>>>>> with cpu-online stressor from stress-ng on a preemptible kernel. We
>>>>> believe the reason may be that  percpu workers of the tracers are being
>>>>> queued into unbounded workqueue when cpu get offlined and patches will go
>>>>> through tracing tree.
>>>>>
>>>>> Additionally, we found patching of tracepoints unsafe since the
>>>>> instructions being patched are not naturally aligned. This may result in
>>>>> 2 half-word stores, which breaks atomicity, during the code patching.
>>>>>
>>>>> changes in patch v2:
>>>>>     - Enforce alignments on all functions with a compiler workaround.
>>>>>     - Support 64bit addressing for ftrace targets if xlen == 64
>>>>>     - Initialize ftrace target addresses to avoid calling bad address in a
>>>>>       hypothesized case.
>>>>>     - Use LGPTR instead of SZPTR since .align is log-scaled for
>>>>>       mcount-dyn.S
>>>>>     - Require the nop instruction of all jump_labels aligns naturally on
>>>>>       4B.
>>>>>
>>>>> Andy Chiu (5):
>>>>>      riscv: align ftrace to 4 Byte boundary and increase ftrace prologue
>>>>>        size
>>>>>      riscv: export patch_insn_write
>>>>>      riscv: ftrace: use indirect jump to work with kernel preemption
>>>>>      riscv: ftrace: do not use stop_machine to update code
>>>>>      riscv: align arch_static_branch function
>>>>>
>>>>>     arch/riscv/Makefile                 |   2 +-
>>>>>     arch/riscv/include/asm/ftrace.h     |  24 ----
>>>>>     arch/riscv/include/asm/jump_label.h |   2 +
>>>>>     arch/riscv/include/asm/patch.h      |   1 +
>>>>>     arch/riscv/kernel/ftrace.c          | 179 ++++++++++++++++++++--------
>>>>>     arch/riscv/kernel/mcount-dyn.S      |  69 ++++++++---
>>>>>     arch/riscv/kernel/patch.c           |   4 +-
>>>>>     7 files changed, 188 insertions(+), 93 deletions(-)
>>>>>
>>>> First of all, thank you for working on making dynamic Ftrace robust in
>>>> preemptible kernels on RISC-V.
>>>> It is an important use case but, for now, dynamic Ftrace and related
>>>> tracers cannot be safely used with such kernels.
>>>>
>>>> Are there any updates on this series?
>>>> It needs a rebase, of course, but it looks doable.
>>>>
>>>> If I understand the discussion correctly, the only blocker was that
>>>> using "-falign-functions" was not enough to properly align cold
>>>> functions and "-fno-guess-branch-probability" would likely have a
>>>> performance cost.
>>>>
>>>> It seems, GCC developers have recently provided a workaround for that
>>>> (https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=0f5a9a00e3ab1fe96142f304cfbcf3f63b15f326,
>>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88345#c24).
>>>>
>>>> "-fmin-function-alignment" should help but, I do not know, which GCC
>>>> versions have got that patch already. In the meantime, one could
>>>> probably check if "-fmin-function-alignment" is supported by the
>>>> compiler and use it, if it is.
>>>>
>>>> Thoughts?
>>> Hi Evgenii,
>>>
>>> Thanks for the update. Indeed, it is essential to this patch for
>>> toolchain to provide forced alignment. We can test this flag in the
>>> Makefile to sort out if toolchain supports it or not. Meanwhile, I had
>>> figured out a way for this to work on any 2-B align addresses but
>>> hadn't implemented it out yet. Basically it would require more
>>> patching space for us to do software alignment. I would opt for a
>>> special toolchain flag if the toolchain just supports it.
>>>
>>> Let me take some time to look and get back to you soon.
>> Thank you! Looking forward to it.
>>
>> In case it helps, here is what I have checked so far.
>>
>> 1.
>> I added the patch
>> https://gcc.gnu.org/git/?p=gcc.git;a=patch;h=0f5a9a00e3ab1fe96142f304cfbcf3f63b15f326
>> to the current revision of GCC 13.2.0 from RISC-V toolchain.
>>
>> Rebased your patchset on top of Linux 6.8-rc4 (mostly - context changes,
>> SYM_FUNC_START/SYM_FUNC_END for asm symbols, etc.).
>>
>> Reverted 8547649981e6 ("riscv: ftrace: Fixup panic by disabling
>> preemption").
>>
>> Switched from -falign-functions=4 to -fmin-function-alignment=4:
>> ------------------
>> diff --git a/arch/riscv/Makefile b/arch/riscv/Makefile
>> index b33b787c8b07..dcd0adeebaae 100644
>> --- a/arch/riscv/Makefile
>> +++ b/arch/riscv/Makefile
>> @@ -15,9 +15,9 @@ ifeq ($(CONFIG_DYNAMIC_FTRACE),y)
>>          LDFLAGS_vmlinux += --no-relax
>>          KBUILD_CPPFLAGS += -DCC_USING_PATCHABLE_FUNCTION_ENTRY
>>    ifeq ($(CONFIG_RISCV_ISA_C),y)
>> -       CC_FLAGS_FTRACE := -fpatchable-function-entry=12 -falign-functions=4
>> +       CC_FLAGS_FTRACE := -fpatchable-function-entry=12
>> -fmin-function-alignment=4
>>    else
>> -       CC_FLAGS_FTRACE := -fpatchable-function-entry=6 -falign-functions=4
>> +       CC_FLAGS_FTRACE := -fpatchable-function-entry=6 -fmin-function-alignment=4
>>    endif
>>    endif
>>
>> ------------------
>>
>> As far as I can see from objdump, the functions that were not aligned at
>> 4-byte boundary with -falign-functions=4, are now aligned correctly with
>> -fmin-function-alignment=4.
>>
>> 2.
>> I tried the kernel in a QEMU VM with 2 CPUs and "-machine virt".
>>
>> The boottime tests for Ftrace had passed, except the tests for
>> function_graph. I described the failure and the possible fix here:
>> https://lore.kernel.org/all/dcc5976d-635a-4710-92df-94a99653314e@yadro.com/
> Indeed, this is needed. I am not sure why I got ftrace boot-time tests
> passed back then. Thank you for solving it!
>
>> 3.
>> There were also boottime warnings about "RCU not on for:
>> arch_cpu_idle+0x0/0x2c". These are probably not related to your
>> patchset, but rather to the fact that Ftrace is enabled in a preemptble
>> kernel where RCU does different things.
>>
>> As a workaround, I disabled tracing of arch_cpu_idle() for now:
>> ------------------
>> diff --git a/arch/riscv/kernel/process.c b/arch/riscv/kernel/process.c
>> index 92922dbd5b5c..6abeecbfc51d 100644
>> --- a/arch/riscv/kernel/process.c
>> +++ b/arch/riscv/kernel/process.c
>> @@ -37,7 +37,7 @@ EXPORT_SYMBOL(__stack_chk_guard);
>>
>>    extern asmlinkage void ret_from_fork(void);
>>
>> -void arch_cpu_idle(void)
>> +void noinstr arch_cpu_idle(void)
>>    {
>>          cpu_do_idle();
>>    }
>>
>> ------------------
>>
>> 4.
>> Stress-testing revealed an issue though, which I do not understand yet.
>>
>> Probably similar to what you did earlier, I ran a script that switched
>> the current tracer to "function", "function_graph", "nop", "blk" each
>> 1-5 seconds. In another shell, "stress-ng --hrtimers 1" was running.
>>
>> The kernel usually crashed within a few minutes, in seemingly random
>> locations, but often in one of two ways:
>>
>> (a) Invalid instruction, because the address of ftrace_caller function
>> was somehow written to the body of the traced function rather than just
>> to the Ftrace prologue.
> The reason for this is probably that any one of your ftrace_*_call is
> not 8-B aligned.
>
>> In the following example, the crash happened at 0xffffffff800d3398. "b0
>> d7" is actually not part of the code here, but rather the lower bytes of
>> 0xffffffff8000d7b0, the address of ftrace_caller() in this kernel.
> It seems like there is a bug in patch_insn_write(). I think we should
> at least disable migration during patch_map() and patch_unmap(). I'd
> need some time to dig into patch_map(). But since __set_fixmap() only
> flush local tlb, I'd assume it is not safe to context switch out and
> migrate while holding the fix-map mapping. Adding preempt_disable()
> and preempt_enable() before calling __patch_insn_write() solves the
> issue.


Yes, Andrea already mentioned this, I came up with the same idea of 
preempt_disable() but then I noticed arm64 actually disables IRQ: any 
idea why? 
https://lore.kernel.org/linux-riscv/CAHVXubj7ChgpvN4F_QO0oASaT5WC2VS0Q-bEqhnmF8z8QV=yDQ@mail.gmail.com/


>> (gdb) disas /r 0xffffffff800d3382,+0x20
>> Dump of assembler code from 0xffffffff800d3382 to 0xffffffff800d33a2:
>> ...
>>      0xffffffff800d3394 <clockevents_program_event+144>:  ba 87   mv
>> a5,a4
>>      0xffffffff800d3396 <clockevents_program_event+146>:  c1 bf   j
>> 0xffffffff800d3366 <clockevents_program_event+98>
>>      0xffffffff800d3398 <clockevents_program_event+148>:  b0 d7   sw
>> a2,104(a5) // 0xffffffff8000d7b0, the address of ftrace_caller().
>>      0xffffffff800d339a <clockevents_program_event+150>:  00 80   .2byte
>> 0x8000
>>      0xffffffff800d339c <clockevents_program_event+152>:  ff ff   .2byte
>> 0xffff
>>      0xffffffff800d339e <clockevents_program_event+154>:  ff ff   .2byte
>> 0xffff
>>      0xffffffff800d33a0 <clockevents_program_event+156>:  d5 bf   j
>> 0xffffffff800d3394 <clockevents_program_event+144
>>
>> The backtrace usually contains one or more occurrences of
>> return_to_handler() in this case.
>>
>> [  260.520394] [<ffffffff800d3398>] clockevents_program_event+0xac/0x100
>> [  260.521195] [<ffffffff8000d2bc>] return_to_handler+0x0/0x26
>> [  260.521843] [<ffffffff800c50ba>] hrtimer_interrupt+0x122/0x20c
>> [  260.522492] [<ffffffff8000d2bc>] return_to_handler+0x0/0x26
>> [  260.523132] [<ffffffff8009785e>] handle_percpu_devid_irq+0x9e/0x1ec
>> [  260.523788] [<ffffffff8000d2bc>] return_to_handler+0x0/0x26
>> [  260.524437] [<ffffffff8000d2bc>] return_to_handler+0x0/0x26
>> [  260.525080] [<ffffffff80a8acfa>] handle_riscv_irq+0x4a/0x74
>> [  260.525726] [<ffffffff80a97b9a>] call_on_irq_stack+0x32/0x40
>> ----------------------
>>
>> (b) Jump to an invalid location, e.g. to the middle of a valid 4-byte
>> instruction. %ra usually points right after the last instruction, "jalr
>>     a2", in return_to_handler() in such cases, so the jump was likely
>> made from there.
> I haven't done fgraph tests yet. I will try out and see.
>
>> The problem is reproducible, although I have not found what causes it yet.
>>
>> Any help is appreciated, of course.
>>
>>>> Regards,
>>>> Evgenii
>>> Regards,
>>> Andy
> Also, here is another side note,
>
> It seems like the ftrace save/restore routine should save more
> registers as clang's fastcc may use t2 when the number of arguments
> exceeds what ABI defines for passing arg through registers.
>
> Cheers,
> Andy
>
> _______________________________________________
> linux-riscv mailing list
> linux-riscv@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-riscv
Andy Chiu March 20, 2024, 4:36 p.m. UTC | #15
On Wed, Mar 20, 2024 at 1:37 AM Alexandre Ghiti <alex@ghiti.fr> wrote:
>
> Hi Andy,
>
> On 18/03/2024 16:31, Andy Chiu wrote:
> > Hi Evgenii,
> >
> > Thanks for your help!
> >
> > I just rebased upon 6.8-rc1 and passed the stress-ng + ftrace/nop
> > testing. I will add some random tracers to test and some optimization
> > before sending out again. Here are a few things needed:
> >
> > On Thu, Feb 22, 2024 at 12:55 AM Evgenii Shatokhin
> > <e.shatokhin@yadro.com> wrote:
> >> On 21.02.2024 08:27, Andy Chiu wrote:
> >>> «Внимание! Данное письмо от внешнего адресата!»
> >>>
> >>> On Wed, Feb 14, 2024 at 3:42 AM Evgenii Shatokhin <e.shatokhin@yadro.com> wrote:
> >>>> Hi,
> >>>>
> >>>> On 13.09.2022 12:42, Andy Chiu wrote:
> >>>>> This patch removes dependency of dynamic ftrace from calling
> >>>>> stop_machine(), and makes it compatiable with kernel preemption.
> >>>>> Originally, we ran into stack corruptions, or execution of partially
> >>>>> updated instructions when starting or stopping ftrace on a fully
> >>>>> preemptible kernel configuration. The reason is that kernel periodically
> >>>>> calls rcu_momentary_dyntick_idle() on cores waiting for the code-patching
> >>>>> core running in ftrace. Though rcu_momentary_dyntick_idle() itself is
> >>>>> marked as notrace, it would call a bunch of tracable functions if we
> >>>>> configured the kernel as preemptible. For example, these are some functions
> >>>>> that happened to have a symbol and have not been marked as notrace on a
> >>>>> RISC-V preemptible kernel compiled with GCC-11:
> >>>>>     - __rcu_report_exp_rnp()
> >>>>>     - rcu_report_exp_cpu_mult()
> >>>>>     - rcu_preempt_deferred_qs()
> >>>>>     - rcu_preempt_need_deferred_qs()
> >>>>>     - rcu_preempt_deferred_qs_irqrestore()
> >>>>>
> >>>>> Thus, this make it not ideal for us to rely on stop_machine() and
> >>>>> handly marked "notrace"s to perform runtime code patching. To remove
> >>>>> such dependency, we must make updates of code seemed atomic on running
> >>>>> cores. This might not be obvious for RISC-V since it usaually uses a pair
> >>>>> of AUIPC + JALR to perform a long jump, which cannot be modified and
> >>>>> executed concurrently if we consider preemptions. As such, this patch
> >>>>> proposed a way to make it possible. It embeds a 32-bit rel-address data
> >>>>> into instructions of each ftrace prologue and jumps indirectly. In this
> >>>>> way, we could store and load the address atomically so that the code
> >>>>> patching core could run simutaneously with the rest of running cores.
> >>>>>
> >>>>> After applying the patchset, we compiled a preemptible kernel with all
> >>>>> tracers and ftrace-selftest enabled, and booted it on a 2-core QEMU virt
> >>>>> machine. The kernel could boot up successfully, passing all ftrace
> >>>>> testsuits. Besides, we ran a script that randomly pick a tracer on every
> >>>>> 0~5 seconds. The kernel has sustained over 20K rounds of the test. In
> >>>>> contrast, a preemptible kernel without our patch would panic in few
> >>>>> rounds on the same machine.
> >>>>>
> >>>>> Though we ran into errors when using hwlat or irqsoff tracers together
> >>>>> with cpu-online stressor from stress-ng on a preemptible kernel. We
> >>>>> believe the reason may be that  percpu workers of the tracers are being
> >>>>> queued into unbounded workqueue when cpu get offlined and patches will go
> >>>>> through tracing tree.
> >>>>>
> >>>>> Additionally, we found patching of tracepoints unsafe since the
> >>>>> instructions being patched are not naturally aligned. This may result in
> >>>>> 2 half-word stores, which breaks atomicity, during the code patching.
> >>>>>
> >>>>> changes in patch v2:
> >>>>>     - Enforce alignments on all functions with a compiler workaround.
> >>>>>     - Support 64bit addressing for ftrace targets if xlen == 64
> >>>>>     - Initialize ftrace target addresses to avoid calling bad address in a
> >>>>>       hypothesized case.
> >>>>>     - Use LGPTR instead of SZPTR since .align is log-scaled for
> >>>>>       mcount-dyn.S
> >>>>>     - Require the nop instruction of all jump_labels aligns naturally on
> >>>>>       4B.
> >>>>>
> >>>>> Andy Chiu (5):
> >>>>>      riscv: align ftrace to 4 Byte boundary and increase ftrace prologue
> >>>>>        size
> >>>>>      riscv: export patch_insn_write
> >>>>>      riscv: ftrace: use indirect jump to work with kernel preemption
> >>>>>      riscv: ftrace: do not use stop_machine to update code
> >>>>>      riscv: align arch_static_branch function
> >>>>>
> >>>>>     arch/riscv/Makefile                 |   2 +-
> >>>>>     arch/riscv/include/asm/ftrace.h     |  24 ----
> >>>>>     arch/riscv/include/asm/jump_label.h |   2 +
> >>>>>     arch/riscv/include/asm/patch.h      |   1 +
> >>>>>     arch/riscv/kernel/ftrace.c          | 179 ++++++++++++++++++++--------
> >>>>>     arch/riscv/kernel/mcount-dyn.S      |  69 ++++++++---
> >>>>>     arch/riscv/kernel/patch.c           |   4 +-
> >>>>>     7 files changed, 188 insertions(+), 93 deletions(-)
> >>>>>
> >>>> First of all, thank you for working on making dynamic Ftrace robust in
> >>>> preemptible kernels on RISC-V.
> >>>> It is an important use case but, for now, dynamic Ftrace and related
> >>>> tracers cannot be safely used with such kernels.
> >>>>
> >>>> Are there any updates on this series?
> >>>> It needs a rebase, of course, but it looks doable.
> >>>>
> >>>> If I understand the discussion correctly, the only blocker was that
> >>>> using "-falign-functions" was not enough to properly align cold
> >>>> functions and "-fno-guess-branch-probability" would likely have a
> >>>> performance cost.
> >>>>
> >>>> It seems, GCC developers have recently provided a workaround for that
> >>>> (https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=0f5a9a00e3ab1fe96142f304cfbcf3f63b15f326,
> >>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88345#c24).
> >>>>
> >>>> "-fmin-function-alignment" should help but, I do not know, which GCC
> >>>> versions have got that patch already. In the meantime, one could
> >>>> probably check if "-fmin-function-alignment" is supported by the
> >>>> compiler and use it, if it is.
> >>>>
> >>>> Thoughts?
> >>> Hi Evgenii,
> >>>
> >>> Thanks for the update. Indeed, it is essential to this patch for
> >>> toolchain to provide forced alignment. We can test this flag in the
> >>> Makefile to sort out if toolchain supports it or not. Meanwhile, I had
> >>> figured out a way for this to work on any 2-B align addresses but
> >>> hadn't implemented it out yet. Basically it would require more
> >>> patching space for us to do software alignment. I would opt for a
> >>> special toolchain flag if the toolchain just supports it.
> >>>
> >>> Let me take some time to look and get back to you soon.
> >> Thank you! Looking forward to it.
> >>
> >> In case it helps, here is what I have checked so far.
> >>
> >> 1.
> >> I added the patch
> >> https://gcc.gnu.org/git/?p=gcc.git;a=patch;h=0f5a9a00e3ab1fe96142f304cfbcf3f63b15f326
> >> to the current revision of GCC 13.2.0 from RISC-V toolchain.
> >>
> >> Rebased your patchset on top of Linux 6.8-rc4 (mostly - context changes,
> >> SYM_FUNC_START/SYM_FUNC_END for asm symbols, etc.).
> >>
> >> Reverted 8547649981e6 ("riscv: ftrace: Fixup panic by disabling
> >> preemption").
> >>
> >> Switched from -falign-functions=4 to -fmin-function-alignment=4:
> >> ------------------
> >> diff --git a/arch/riscv/Makefile b/arch/riscv/Makefile
> >> index b33b787c8b07..dcd0adeebaae 100644
> >> --- a/arch/riscv/Makefile
> >> +++ b/arch/riscv/Makefile
> >> @@ -15,9 +15,9 @@ ifeq ($(CONFIG_DYNAMIC_FTRACE),y)
> >>          LDFLAGS_vmlinux += --no-relax
> >>          KBUILD_CPPFLAGS += -DCC_USING_PATCHABLE_FUNCTION_ENTRY
> >>    ifeq ($(CONFIG_RISCV_ISA_C),y)
> >> -       CC_FLAGS_FTRACE := -fpatchable-function-entry=12 -falign-functions=4
> >> +       CC_FLAGS_FTRACE := -fpatchable-function-entry=12
> >> -fmin-function-alignment=4
> >>    else
> >> -       CC_FLAGS_FTRACE := -fpatchable-function-entry=6 -falign-functions=4
> >> +       CC_FLAGS_FTRACE := -fpatchable-function-entry=6 -fmin-function-alignment=4
> >>    endif
> >>    endif
> >>
> >> ------------------
> >>
> >> As far as I can see from objdump, the functions that were not aligned at
> >> 4-byte boundary with -falign-functions=4, are now aligned correctly with
> >> -fmin-function-alignment=4.
> >>
> >> 2.
> >> I tried the kernel in a QEMU VM with 2 CPUs and "-machine virt".
> >>
> >> The boottime tests for Ftrace had passed, except the tests for
> >> function_graph. I described the failure and the possible fix here:
> >> https://lore.kernel.org/all/dcc5976d-635a-4710-92df-94a99653314e@yadro.com/
> > Indeed, this is needed. I am not sure why I got ftrace boot-time tests
> > passed back then. Thank you for solving it!
> >
> >> 3.
> >> There were also boottime warnings about "RCU not on for:
> >> arch_cpu_idle+0x0/0x2c". These are probably not related to your
> >> patchset, but rather to the fact that Ftrace is enabled in a preemptble
> >> kernel where RCU does different things.
> >>
> >> As a workaround, I disabled tracing of arch_cpu_idle() for now:
> >> ------------------
> >> diff --git a/arch/riscv/kernel/process.c b/arch/riscv/kernel/process.c
> >> index 92922dbd5b5c..6abeecbfc51d 100644
> >> --- a/arch/riscv/kernel/process.c
> >> +++ b/arch/riscv/kernel/process.c
> >> @@ -37,7 +37,7 @@ EXPORT_SYMBOL(__stack_chk_guard);
> >>
> >>    extern asmlinkage void ret_from_fork(void);
> >>
> >> -void arch_cpu_idle(void)
> >> +void noinstr arch_cpu_idle(void)
> >>    {
> >>          cpu_do_idle();
> >>    }
> >>
> >> ------------------
> >>
> >> 4.
> >> Stress-testing revealed an issue though, which I do not understand yet.
> >>
> >> Probably similar to what you did earlier, I ran a script that switched
> >> the current tracer to "function", "function_graph", "nop", "blk" each
> >> 1-5 seconds. In another shell, "stress-ng --hrtimers 1" was running.
> >>
> >> The kernel usually crashed within a few minutes, in seemingly random
> >> locations, but often in one of two ways:
> >>
> >> (a) Invalid instruction, because the address of ftrace_caller function
> >> was somehow written to the body of the traced function rather than just
> >> to the Ftrace prologue.
> > The reason for this is probably that any one of your ftrace_*_call is
> > not 8-B aligned.
> >
> >> In the following example, the crash happened at 0xffffffff800d3398. "b0
> >> d7" is actually not part of the code here, but rather the lower bytes of
> >> 0xffffffff8000d7b0, the address of ftrace_caller() in this kernel.
> > It seems like there is a bug in patch_insn_write(). I think we should
> > at least disable migration during patch_map() and patch_unmap(). I'd
> > need some time to dig into patch_map(). But since __set_fixmap() only
> > flush local tlb, I'd assume it is not safe to context switch out and
> > migrate while holding the fix-map mapping. Adding preempt_disable()
> > and preempt_enable() before calling __patch_insn_write() solves the
> > issue.
>
>
> Yes, Andrea already mentioned this, I came up with the same idea of
> preempt_disable() but then I noticed arm64 actually disables IRQ: any
> idea why?
> https://lore.kernel.org/linux-riscv/CAHVXubj7ChgpvN4F_QO0oASaT5WC2VS0Q-bEqhnmF8z8QV=yDQ@mail.gmail.com/

Hi, I took a quick look and it seems that it is a design choice in
software to me. ARM uses a spinlock to protect text and we use a
mutex. If they have a requirement to do patching while irq is off
(maybe in an ipi handler), then the only viable option would be to use
raw_spin_lock_irqsave. I think preempt_disable should be enough for us
if we use text_mutex to protect patching. Or, am I missing something?




>
>
> >> (gdb) disas /r 0xffffffff800d3382,+0x20
> >> Dump of assembler code from 0xffffffff800d3382 to 0xffffffff800d33a2:
> >> ...
> >>      0xffffffff800d3394 <clockevents_program_event+144>:  ba 87   mv
> >> a5,a4
> >>      0xffffffff800d3396 <clockevents_program_event+146>:  c1 bf   j
> >> 0xffffffff800d3366 <clockevents_program_event+98>
> >>      0xffffffff800d3398 <clockevents_program_event+148>:  b0 d7   sw
> >> a2,104(a5) // 0xffffffff8000d7b0, the address of ftrace_caller().
> >>      0xffffffff800d339a <clockevents_program_event+150>:  00 80   .2byte
> >> 0x8000
> >>      0xffffffff800d339c <clockevents_program_event+152>:  ff ff   .2byte
> >> 0xffff
> >>      0xffffffff800d339e <clockevents_program_event+154>:  ff ff   .2byte
> >> 0xffff
> >>      0xffffffff800d33a0 <clockevents_program_event+156>:  d5 bf   j
> >> 0xffffffff800d3394 <clockevents_program_event+144
> >>
> >> The backtrace usually contains one or more occurrences of
> >> return_to_handler() in this case.
> >>
> >> [  260.520394] [<ffffffff800d3398>] clockevents_program_event+0xac/0x100
> >> [  260.521195] [<ffffffff8000d2bc>] return_to_handler+0x0/0x26
> >> [  260.521843] [<ffffffff800c50ba>] hrtimer_interrupt+0x122/0x20c
> >> [  260.522492] [<ffffffff8000d2bc>] return_to_handler+0x0/0x26
> >> [  260.523132] [<ffffffff8009785e>] handle_percpu_devid_irq+0x9e/0x1ec
> >> [  260.523788] [<ffffffff8000d2bc>] return_to_handler+0x0/0x26
> >> [  260.524437] [<ffffffff8000d2bc>] return_to_handler+0x0/0x26
> >> [  260.525080] [<ffffffff80a8acfa>] handle_riscv_irq+0x4a/0x74
> >> [  260.525726] [<ffffffff80a97b9a>] call_on_irq_stack+0x32/0x40
> >> ----------------------
> >>
> >> (b) Jump to an invalid location, e.g. to the middle of a valid 4-byte
> >> instruction. %ra usually points right after the last instruction, "jalr
> >>     a2", in return_to_handler() in such cases, so the jump was likely
> >> made from there.
> > I haven't done fgraph tests yet. I will try out and see.
> >
> >> The problem is reproducible, although I have not found what causes it yet.
> >>
> >> Any help is appreciated, of course.
> >>
> >>>> Regards,
> >>>> Evgenii
> >>> Regards,
> >>> Andy
> > Also, here is another side note,
> >
> > It seems like the ftrace save/restore routine should save more
> > registers as clang's fastcc may use t2 when the number of arguments
> > exceeds what ABI defines for passing arg through registers.
> >
> > Cheers,
> > Andy
> >
> > _______________________________________________
> > linux-riscv mailing list
> > linux-riscv@lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/linux-riscv
Andy Chiu March 20, 2024, 4:37 p.m. UTC | #16
On Tue, Mar 19, 2024 at 10:50 PM Alexandre Ghiti <alex@ghiti.fr> wrote:
>
> On 11/03/2024 15:24, Andy Chiu wrote:
> > On Thu, Mar 7, 2024 at 11:57 PM Samuel Holland
> > <samuel.holland@sifive.com> wrote:
> >> Hi Alex,
> >>
> >> On 2024-03-07 7:21 AM, Alexandre Ghiti wrote:
> >>> But TBH, I have started thinking about the issue your patch is trying to deal
> >>> with. IIUC you're trying to avoid traps (or silent errors) that could happen
> >>> because of concurrent accesses when patching is happening on a pair auipc/jarl.
> >>>
> >>> I'm wondering if instead, we could not actually handle the potential traps:
> >>> before storing the auipc + jalr pair, we could use a well-identified trapping
> >>> instruction that could be recognized in the trap handler as a legitimate trap.
> >>> For example:
> >>>
> >>>
> >>> auipc  -->  auipc  -->  XXXX  -->  XXXX  -->  auipc
> >>> jalr        XXXX        XXXX       jalr       jalr
> >>>
> >>>
> >>> If a core traps on a XXXX instruction, we know this address is being patched, so
> >>> we can return and probably the patching will be over. We could also identify
> >>> half patched word instruction (I mean with only XX).
> >> Unfortunately this does not work without some fence.i in the middle. The
> >> processor is free to fetch any instruction that has been written to a location
> >> since the last fence.i instruction. So it would be perfectly valid to fetch the
> >> old aiupc and new jalr or vice versa and not trap. This would happen if, for
> >> example, the two instructions were in different cache lines, and only one of the
> >> cache lines got evicted and refilled.
> >>
> >> But sending an IPI to run the fence.i probably negates the performance benefit.
> > Maybe something like x86, we can hook ftrace_replace_code() out and
> > batch send IPIs to prevent storms of remote fences. The solution Alex
> > proposed can save the code size for function entries. But we have to
> > send out remote fences at each "-->" transition, which is 4 sets of
> > remote IPIs. On the other hand, this series increases the per-function
> > patch size to 24 bytes. However, it decreases the number of remote
> > fences to 1 set.
> >
> > The performance hit could be observable for the auipc + jalr case,
> > because all remote cores will be executing on XXXX instructions and
> > take a trap at each function entry during code patching.
> >
> > Besides, this series would give us a chance not to send any remote
> > fences if we were to change only the destination of ftrace (e.g. to a
> > custom ftrace trampoline). As it would be a regular store for the
> > writer and regular load for readers, only fence w,w is needed.
> > However, I am not very certain on how often would be for this
> > particular use case. I'd need some time to investigate it.
> >
> >> Maybe there is some creative way to overcome this.
> >>
> >>> But please let me know if that's completely stupid and I did not understand the
> >>> problem, since my patchset to support svvptc, I am wondering if it is not more
> >>> performant to actually take very unlikely traps instead of trying to avoid them.
> >> I agree in general it is a good idea to optimize the hot path like this.
> >>
> >> Regards,
> >> Samuel
> >>
> > Regards,
> > Andy
> >
> > _______________________________________________
> > linux-riscv mailing list
> > linux-riscv@lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/linux-riscv
>
>
> So indeed my solution was way too naive and we've been discussing that
> with Björn lately. He worked a lot on that and came up with the solution
> he proposed here
> https://lore.kernel.org/linux-riscv/87zfv0onre.fsf@all.your.base.are.belong.to.us/
>
> The thing is ftrace seems to be quite broken as the ftrace kselftests
> raise a lot of issues which I have started to debug but are not that
> easy, so we are wondering if *someone* should not work on Bjorn's
> solution (or another, open to discussions) for 6.10. @Andy WDYT? Do you
> have free cycles? Björn could work on that too (and I'll help if needed).

Do you mean the FTRACE_STARTUP_TEST, or something else? I am also
happy to help on text patching issues. It would be great if we could
define the remaining works and share them. Currently I am focusing on
having dynamic ftrace with preemption and getting rid of
stop_machine() while patching code. I am going to spin a revision of
this patch series in a few days if possible. There are quite some
things needed to be discussed and I'd like to join any conversation!

>
> Let me know what you think!
>
> Alex
>
>

Cheers,
Andy
Andy Chiu March 20, 2024, 4:38 p.m. UTC | #17
On Tue, Mar 19, 2024 at 11:32 PM Evgenii Shatokhin
<e.shatokhin@yadro.com> wrote:
>
> Hi,
>
> On 18.03.2024 18:31, Andy Chiu wrote:
> > Hi Evgenii,
> >
> > Thanks for your help!
>
> You are welcome!
>
> >
> > I just rebased upon 6.8-rc1 and passed the stress-ng + ftrace/nop
> > testing. I will add some random tracers to test and some optimization
> > before sending out again. Here are a few things needed:
> >
> > On Thu, Feb 22, 2024 at 12:55 AM Evgenii Shatokhin
> > <e.shatokhin@yadro.com> wrote:
> >>
> >> On 21.02.2024 08:27, Andy Chiu wrote:
> >>>
> >>> On Wed, Feb 14, 2024 at 3:42 AM Evgenii Shatokhin <e.shatokhin@yadro.com> wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>> On 13.09.2022 12:42, Andy Chiu wrote:
> >>>>> This patch removes dependency of dynamic ftrace from calling
> >>>>> stop_machine(), and makes it compatiable with kernel preemption.
> >>>>> Originally, we ran into stack corruptions, or execution of partially
> >>>>> updated instructions when starting or stopping ftrace on a fully
> >>>>> preemptible kernel configuration. The reason is that kernel periodically
> >>>>> calls rcu_momentary_dyntick_idle() on cores waiting for the code-patching
> >>>>> core running in ftrace. Though rcu_momentary_dyntick_idle() itself is
> >>>>> marked as notrace, it would call a bunch of tracable functions if we
> >>>>> configured the kernel as preemptible. For example, these are some functions
> >>>>> that happened to have a symbol and have not been marked as notrace on a
> >>>>> RISC-V preemptible kernel compiled with GCC-11:
> >>>>>     - __rcu_report_exp_rnp()
> >>>>>     - rcu_report_exp_cpu_mult()
> >>>>>     - rcu_preempt_deferred_qs()
> >>>>>     - rcu_preempt_need_deferred_qs()
> >>>>>     - rcu_preempt_deferred_qs_irqrestore()
> >>>>>
> >>>>> Thus, this make it not ideal for us to rely on stop_machine() and
> >>>>> handly marked "notrace"s to perform runtime code patching. To remove
> >>>>> such dependency, we must make updates of code seemed atomic on running
> >>>>> cores. This might not be obvious for RISC-V since it usaually uses a pair
> >>>>> of AUIPC + JALR to perform a long jump, which cannot be modified and
> >>>>> executed concurrently if we consider preemptions. As such, this patch
> >>>>> proposed a way to make it possible. It embeds a 32-bit rel-address data
> >>>>> into instructions of each ftrace prologue and jumps indirectly. In this
> >>>>> way, we could store and load the address atomically so that the code
> >>>>> patching core could run simutaneously with the rest of running cores.
> >>>>>
> >>>>> After applying the patchset, we compiled a preemptible kernel with all
> >>>>> tracers and ftrace-selftest enabled, and booted it on a 2-core QEMU virt
> >>>>> machine. The kernel could boot up successfully, passing all ftrace
> >>>>> testsuits. Besides, we ran a script that randomly pick a tracer on every
> >>>>> 0~5 seconds. The kernel has sustained over 20K rounds of the test. In
> >>>>> contrast, a preemptible kernel without our patch would panic in few
> >>>>> rounds on the same machine.
> >>>>>
> >>>>> Though we ran into errors when using hwlat or irqsoff tracers together
> >>>>> with cpu-online stressor from stress-ng on a preemptible kernel. We
> >>>>> believe the reason may be that  percpu workers of the tracers are being
> >>>>> queued into unbounded workqueue when cpu get offlined and patches will go
> >>>>> through tracing tree.
> >>>>>
> >>>>> Additionally, we found patching of tracepoints unsafe since the
> >>>>> instructions being patched are not naturally aligned. This may result in
> >>>>> 2 half-word stores, which breaks atomicity, during the code patching.
> >>>>>
> >>>>> changes in patch v2:
> >>>>>     - Enforce alignments on all functions with a compiler workaround.
> >>>>>     - Support 64bit addressing for ftrace targets if xlen == 64
> >>>>>     - Initialize ftrace target addresses to avoid calling bad address in a
> >>>>>       hypothesized case.
> >>>>>     - Use LGPTR instead of SZPTR since .align is log-scaled for
> >>>>>       mcount-dyn.S
> >>>>>     - Require the nop instruction of all jump_labels aligns naturally on
> >>>>>       4B.
> >>>>>
> >>>>> Andy Chiu (5):
> >>>>>      riscv: align ftrace to 4 Byte boundary and increase ftrace prologue
> >>>>>        size
> >>>>>      riscv: export patch_insn_write
> >>>>>      riscv: ftrace: use indirect jump to work with kernel preemption
> >>>>>      riscv: ftrace: do not use stop_machine to update code
> >>>>>      riscv: align arch_static_branch function
> >>>>>
> >>>>>     arch/riscv/Makefile                 |   2 +-
> >>>>>     arch/riscv/include/asm/ftrace.h     |  24 ----
> >>>>>     arch/riscv/include/asm/jump_label.h |   2 +
> >>>>>     arch/riscv/include/asm/patch.h      |   1 +
> >>>>>     arch/riscv/kernel/ftrace.c          | 179 ++++++++++++++++++++--------
> >>>>>     arch/riscv/kernel/mcount-dyn.S      |  69 ++++++++---
> >>>>>     arch/riscv/kernel/patch.c           |   4 +-
> >>>>>     7 files changed, 188 insertions(+), 93 deletions(-)
> >>>>>
> >>>>
> >>>> First of all, thank you for working on making dynamic Ftrace robust in
> >>>> preemptible kernels on RISC-V.
> >>>> It is an important use case but, for now, dynamic Ftrace and related
> >>>> tracers cannot be safely used with such kernels.
> >>>>
> >>>> Are there any updates on this series?
> >>>> It needs a rebase, of course, but it looks doable.
> >>>>
> >>>> If I understand the discussion correctly, the only blocker was that
> >>>> using "-falign-functions" was not enough to properly align cold
> >>>> functions and "-fno-guess-branch-probability" would likely have a
> >>>> performance cost.
> >>>>
> >>>> It seems, GCC developers have recently provided a workaround for that
> >>>> (https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=0f5a9a00e3ab1fe96142f304cfbcf3f63b15f326,
> >>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88345#c24).
> >>>>
> >>>> "-fmin-function-alignment" should help but, I do not know, which GCC
> >>>> versions have got that patch already. In the meantime, one could
> >>>> probably check if "-fmin-function-alignment" is supported by the
> >>>> compiler and use it, if it is.
> >>>>
> >>>> Thoughts?
> >>>
> >>> Hi Evgenii,
> >>>
> >>> Thanks for the update. Indeed, it is essential to this patch for
> >>> toolchain to provide forced alignment. We can test this flag in the
> >>> Makefile to sort out if toolchain supports it or not. Meanwhile, I had
> >>> figured out a way for this to work on any 2-B align addresses but
> >>> hadn't implemented it out yet. Basically it would require more
> >>> patching space for us to do software alignment. I would opt for a
> >>> special toolchain flag if the toolchain just supports it.
> >>>
> >>> Let me take some time to look and get back to you soon.
> >>
> >> Thank you! Looking forward to it.
> >>
> >> In case it helps, here is what I have checked so far.
> >>
> >> 1.
> >> I added the patch
> >> https://gcc.gnu.org/git/?p=gcc.git;a=patch;h=0f5a9a00e3ab1fe96142f304cfbcf3f63b15f326
> >> to the current revision of GCC 13.2.0 from RISC-V toolchain.
> >>
> >> Rebased your patchset on top of Linux 6.8-rc4 (mostly - context changes,
> >> SYM_FUNC_START/SYM_FUNC_END for asm symbols, etc.).
> >>
> >> Reverted 8547649981e6 ("riscv: ftrace: Fixup panic by disabling
> >> preemption").
> >>
> >> Switched from -falign-functions=4 to -fmin-function-alignment=4:
> >> ------------------
> >> diff --git a/arch/riscv/Makefile b/arch/riscv/Makefile
> >> index b33b787c8b07..dcd0adeebaae 100644
> >> --- a/arch/riscv/Makefile
> >> +++ b/arch/riscv/Makefile
> >> @@ -15,9 +15,9 @@ ifeq ($(CONFIG_DYNAMIC_FTRACE),y)
> >>          LDFLAGS_vmlinux += --no-relax
> >>          KBUILD_CPPFLAGS += -DCC_USING_PATCHABLE_FUNCTION_ENTRY
> >>    ifeq ($(CONFIG_RISCV_ISA_C),y)
> >> -       CC_FLAGS_FTRACE := -fpatchable-function-entry=12 -falign-functions=4
> >> +       CC_FLAGS_FTRACE := -fpatchable-function-entry=12
> >> -fmin-function-alignment=4
> >>    else
> >> -       CC_FLAGS_FTRACE := -fpatchable-function-entry=6 -falign-functions=4
> >> +       CC_FLAGS_FTRACE := -fpatchable-function-entry=6 -fmin-function-alignment=4
> >>    endif
> >>    endif
> >>
> >> ------------------
> >>
> >> As far as I can see from objdump, the functions that were not aligned at
> >> 4-byte boundary with -falign-functions=4, are now aligned correctly with
> >> -fmin-function-alignment=4.
> >>
> >> 2.
> >> I tried the kernel in a QEMU VM with 2 CPUs and "-machine virt".
> >>
> >> The boottime tests for Ftrace had passed, except the tests for
> >> function_graph. I described the failure and the possible fix here:
> >> https://lore.kernel.org/all/dcc5976d-635a-4710-92df-94a99653314e@yadro.com/
> >
> > Indeed, this is needed. I am not sure why I got ftrace boot-time tests
> > passed back then. Thank you for solving it!
> >
> >>
> >> 3.
> >> There were also boottime warnings about "RCU not on for:
> >> arch_cpu_idle+0x0/0x2c". These are probably not related to your
> >> patchset, but rather to the fact that Ftrace is enabled in a preemptble
> >> kernel where RCU does different things.
> >>
> >> As a workaround, I disabled tracing of arch_cpu_idle() for now:
> >> ------------------
> >> diff --git a/arch/riscv/kernel/process.c b/arch/riscv/kernel/process.c
> >> index 92922dbd5b5c..6abeecbfc51d 100644
> >> --- a/arch/riscv/kernel/process.c
> >> +++ b/arch/riscv/kernel/process.c
> >> @@ -37,7 +37,7 @@ EXPORT_SYMBOL(__stack_chk_guard);
> >>
> >>    extern asmlinkage void ret_from_fork(void);
> >>
> >> -void arch_cpu_idle(void)
> >> +void noinstr arch_cpu_idle(void)
> >>    {
> >>          cpu_do_idle();
> >>    }
> >>
> >> ------------------
> >>
> >> 4.
> >> Stress-testing revealed an issue though, which I do not understand yet.
> >>
> >> Probably similar to what you did earlier, I ran a script that switched
> >> the current tracer to "function", "function_graph", "nop", "blk" each
> >> 1-5 seconds. In another shell, "stress-ng --hrtimers 1" was running.
> >>
> >> The kernel usually crashed within a few minutes, in seemingly random
> >> locations, but often in one of two ways:
> >>
> >> (a) Invalid instruction, because the address of ftrace_caller function
> >> was somehow written to the body of the traced function rather than just
> >> to the Ftrace prologue.
> >
> > The reason for this is probably that any one of your ftrace_*_call is
> > not 8-B aligned.
>
> I thought, all locations where the address of a ftrace_caller function
> is written are 8-byte aligned, if the compiler guarantees that start
> addresses of all functions are 4-byte aligned. Your patchset provides 2
> kinds of function prologues exactly for that purpose. Am I missing
> something?

Yes, it's true, and that is the first step of ftrace, e.g. to jump
into a ftrace trampoline. The second step for ftrace is to jump to the
actual ftrace handler function. We have to use a 8B-aligned .text
address to store the pointer to the handler. So it could be atomically
patched, or loaded, in dynamic ftrace.

>
> >
> >>
> >> In the following example, the crash happened at 0xffffffff800d3398. "b0
> >> d7" is actually not part of the code here, but rather the lower bytes of
> >> 0xffffffff8000d7b0, the address of ftrace_caller() in this kernel.
> >
> > It seems like there is a bug in patch_insn_write(). I think we should
> > at least disable migration during patch_map() and patch_unmap(). I'd
> > need some time to dig into patch_map(). But since __set_fixmap() only
> > flush local tlb, I'd assume it is not safe to context switch out and
> > migrate while holding the fix-map mapping. Adding preempt_disable()
> > and preempt_enable() before calling __patch_insn_write() solves the
> > issue.
> >
>
> Interesting.
> Thanks for pointing that out! I never though that the task could migrate
> to a different CPU while patch_insn_write() is running. If it could,
> that would cause such issues, sure. And probably - the issues with
> "function_graph" too, if some data were corrupted that way rather than code.

I found another issue with function_graph in preemptible Vector, not
directly related to function_graph though. Currently we don't support
calling schedule() within kernel_vector_{begin,end}. However, this
could be inevitable with ftrace + preemption. For example, preemptible
vectorized uaccess could call into return_to_handler, then call
schedule() when returned from kernel_vector_begin(). This can cause
the following Vector operation fail with illegal instruction because
VS was turned off during context switch.

        kernel_vector_begin();
        //=> return_to_handler
        //==> ... schedule()
        remain = __asm_vector_usercopy(dst, src, n);
        kernel_vector_end();

Here is what we can do if we'd support calling schedule() while in an
active preempt_v.

 static __always_inline void __vstate_csr_save(struct __riscv_v_ext_state *dest)
 {
        asm volatile (
@@ -243,6 +248,11 @@ static inline void __switch_to_vector(struct
task_struct *prev,
        struct pt_regs *regs;

        if (riscv_preempt_v_started(prev)) {
+               if (riscv_v_is_on()) {
+                       WARN_ON(prev->thread.riscv_v_flags &
RISCV_V_CTX_DEPTH_MASK);
+                       riscv_v_disable();
+                       prev->thread.riscv_v_flags |=
RISCV_PREEMPT_V_IN_SCHEDULE;
+               }
                if (riscv_preempt_v_dirty(prev)) {
                        __riscv_v_vstate_save(&prev->thread.kernel_vstate,
                                              prev->thread.kernel_vstate.datap);
@@ -253,10 +263,16 @@ static inline void __switch_to_vector(struct
task_struct *prev,
                riscv_v_vstate_save(&prev->thread.vstate, regs);
        }

-       if (riscv_preempt_v_started(next))
-               riscv_preempt_v_set_restore(next);
-       else
+       if (riscv_preempt_v_started(next)) {
+               if (next->thread.riscv_v_flags & RISCV_PREEMPT_V_IN_SCHEDULE) {
+                       next->thread.riscv_v_flags &=
~RISCV_PREEMPT_V_IN_SCHEDULE;
+                       riscv_v_enable();
+               } else {
+                       riscv_preempt_v_set_restore(next);
+               }
+       } else {
                riscv_v_vstate_set_restore(next, task_pt_regs(next));
+       }

 }

>
> >>
> >> (gdb) disas /r 0xffffffff800d3382,+0x20
> >> Dump of assembler code from 0xffffffff800d3382 to 0xffffffff800d33a2:
> >> ...
> >>      0xffffffff800d3394 <clockevents_program_event+144>:  ba 87   mv
> >> a5,a4
> >>      0xffffffff800d3396 <clockevents_program_event+146>:  c1 bf   j
> >> 0xffffffff800d3366 <clockevents_program_event+98>
> >>      0xffffffff800d3398 <clockevents_program_event+148>:  b0 d7   sw
> >> a2,104(a5) // 0xffffffff8000d7b0, the address of ftrace_caller().
> >>      0xffffffff800d339a <clockevents_program_event+150>:  00 80   .2byte
> >> 0x8000
> >>      0xffffffff800d339c <clockevents_program_event+152>:  ff ff   .2byte
> >> 0xffff
> >>      0xffffffff800d339e <clockevents_program_event+154>:  ff ff   .2byte
> >> 0xffff
> >>      0xffffffff800d33a0 <clockevents_program_event+156>:  d5 bf   j
> >> 0xffffffff800d3394 <clockevents_program_event+144
> >>
> >> The backtrace usually contains one or more occurrences of
> >> return_to_handler() in this case.
> >>
> >> [  260.520394] [<ffffffff800d3398>] clockevents_program_event+0xac/0x100
> >> [  260.521195] [<ffffffff8000d2bc>] return_to_handler+0x0/0x26
> >> [  260.521843] [<ffffffff800c50ba>] hrtimer_interrupt+0x122/0x20c
> >> [  260.522492] [<ffffffff8000d2bc>] return_to_handler+0x0/0x26
> >> [  260.523132] [<ffffffff8009785e>] handle_percpu_devid_irq+0x9e/0x1ec
> >> [  260.523788] [<ffffffff8000d2bc>] return_to_handler+0x0/0x26
> >> [  260.524437] [<ffffffff8000d2bc>] return_to_handler+0x0/0x26
> >> [  260.525080] [<ffffffff80a8acfa>] handle_riscv_irq+0x4a/0x74
> >> [  260.525726] [<ffffffff80a97b9a>] call_on_irq_stack+0x32/0x40
> >> ----------------------
> >>
> >> (b) Jump to an invalid location, e.g. to the middle of a valid 4-byte
> >> instruction. %ra usually points right after the last instruction, "jalr
> >>     a2", in return_to_handler() in such cases, so the jump was likely
> >> made from there.
> >
> > I haven't done fgraph tests yet. I will try out and see.

With the above being fixed, I can pass several hundred (and continue)
rounds of random tracer + stress-ng --hrtimers test.


> >
> >>
> >> The problem is reproducible, although I have not found what causes it yet.
> >>
> >> Any help is appreciated, of course.
> >>
> >>>
> >>>>
> >>>> Regards,
> >>>> Evgenii
> >>>
> >>> Regards,
> >>> Andy
> >>
> >
> > Also, here is another side note,
> >
> > It seems like the ftrace save/restore routine should save more
> > registers as clang's fastcc may use t2 when the number of arguments
> > exceeds what ABI defines for passing arg through registers.
>
> Yes, I reported that issue to LLVM maintainers in
> https://github.com/llvm/llvm-project/issues/83111. It seems, static
> functions with 9+ arguments use t2 and t3, etc. for the 9th and 10th
> arguments when compiled with clang.
>
> Clang seems to leave t0 and t1 alone but I do not know yet, if it is
> just a coincidence. Haven't found the exact rules for fastcc calling
> convention on RISC-V so far.
>
> A compiler option to disable fastcc for the Linux kernel builds would be
> great. But, it seems, the discussion with LLVM maintainers will go
> nowhere without benchmarks to show whether that optimization has any
> significant effect. I plan to find and run proper benchmarks when I have
> time, but not just yet.
>
> >
> > Cheers,
> > Andy
>
> Regards,
> Evgenii
>
>
Alexandre Ghiti March 21, 2024, 11:02 a.m. UTC | #18
On 20/03/2024 17:36, Andy Chiu wrote:
> On Wed, Mar 20, 2024 at 1:37 AM Alexandre Ghiti <alex@ghiti.fr> wrote:
>> Hi Andy,
>>
>> On 18/03/2024 16:31, Andy Chiu wrote:
>>> Hi Evgenii,
>>>
>>> Thanks for your help!
>>>
>>> I just rebased upon 6.8-rc1 and passed the stress-ng + ftrace/nop
>>> testing. I will add some random tracers to test and some optimization
>>> before sending out again. Here are a few things needed:
>>>
>>> On Thu, Feb 22, 2024 at 12:55 AM Evgenii Shatokhin
>>> <e.shatokhin@yadro.com> wrote:
>>>> On 21.02.2024 08:27, Andy Chiu wrote:
>>>>> «Внимание! Данное письмо от внешнего адресата!»
>>>>>
>>>>> On Wed, Feb 14, 2024 at 3:42 AM Evgenii Shatokhin <e.shatokhin@yadro.com> wrote:
>>>>>> Hi,
>>>>>>
>>>>>> On 13.09.2022 12:42, Andy Chiu wrote:
>>>>>>> This patch removes dependency of dynamic ftrace from calling
>>>>>>> stop_machine(), and makes it compatiable with kernel preemption.
>>>>>>> Originally, we ran into stack corruptions, or execution of partially
>>>>>>> updated instructions when starting or stopping ftrace on a fully
>>>>>>> preemptible kernel configuration. The reason is that kernel periodically
>>>>>>> calls rcu_momentary_dyntick_idle() on cores waiting for the code-patching
>>>>>>> core running in ftrace. Though rcu_momentary_dyntick_idle() itself is
>>>>>>> marked as notrace, it would call a bunch of tracable functions if we
>>>>>>> configured the kernel as preemptible. For example, these are some functions
>>>>>>> that happened to have a symbol and have not been marked as notrace on a
>>>>>>> RISC-V preemptible kernel compiled with GCC-11:
>>>>>>>      - __rcu_report_exp_rnp()
>>>>>>>      - rcu_report_exp_cpu_mult()
>>>>>>>      - rcu_preempt_deferred_qs()
>>>>>>>      - rcu_preempt_need_deferred_qs()
>>>>>>>      - rcu_preempt_deferred_qs_irqrestore()
>>>>>>>
>>>>>>> Thus, this make it not ideal for us to rely on stop_machine() and
>>>>>>> handly marked "notrace"s to perform runtime code patching. To remove
>>>>>>> such dependency, we must make updates of code seemed atomic on running
>>>>>>> cores. This might not be obvious for RISC-V since it usaually uses a pair
>>>>>>> of AUIPC + JALR to perform a long jump, which cannot be modified and
>>>>>>> executed concurrently if we consider preemptions. As such, this patch
>>>>>>> proposed a way to make it possible. It embeds a 32-bit rel-address data
>>>>>>> into instructions of each ftrace prologue and jumps indirectly. In this
>>>>>>> way, we could store and load the address atomically so that the code
>>>>>>> patching core could run simutaneously with the rest of running cores.
>>>>>>>
>>>>>>> After applying the patchset, we compiled a preemptible kernel with all
>>>>>>> tracers and ftrace-selftest enabled, and booted it on a 2-core QEMU virt
>>>>>>> machine. The kernel could boot up successfully, passing all ftrace
>>>>>>> testsuits. Besides, we ran a script that randomly pick a tracer on every
>>>>>>> 0~5 seconds. The kernel has sustained over 20K rounds of the test. In
>>>>>>> contrast, a preemptible kernel without our patch would panic in few
>>>>>>> rounds on the same machine.
>>>>>>>
>>>>>>> Though we ran into errors when using hwlat or irqsoff tracers together
>>>>>>> with cpu-online stressor from stress-ng on a preemptible kernel. We
>>>>>>> believe the reason may be that  percpu workers of the tracers are being
>>>>>>> queued into unbounded workqueue when cpu get offlined and patches will go
>>>>>>> through tracing tree.
>>>>>>>
>>>>>>> Additionally, we found patching of tracepoints unsafe since the
>>>>>>> instructions being patched are not naturally aligned. This may result in
>>>>>>> 2 half-word stores, which breaks atomicity, during the code patching.
>>>>>>>
>>>>>>> changes in patch v2:
>>>>>>>      - Enforce alignments on all functions with a compiler workaround.
>>>>>>>      - Support 64bit addressing for ftrace targets if xlen == 64
>>>>>>>      - Initialize ftrace target addresses to avoid calling bad address in a
>>>>>>>        hypothesized case.
>>>>>>>      - Use LGPTR instead of SZPTR since .align is log-scaled for
>>>>>>>        mcount-dyn.S
>>>>>>>      - Require the nop instruction of all jump_labels aligns naturally on
>>>>>>>        4B.
>>>>>>>
>>>>>>> Andy Chiu (5):
>>>>>>>       riscv: align ftrace to 4 Byte boundary and increase ftrace prologue
>>>>>>>         size
>>>>>>>       riscv: export patch_insn_write
>>>>>>>       riscv: ftrace: use indirect jump to work with kernel preemption
>>>>>>>       riscv: ftrace: do not use stop_machine to update code
>>>>>>>       riscv: align arch_static_branch function
>>>>>>>
>>>>>>>      arch/riscv/Makefile                 |   2 +-
>>>>>>>      arch/riscv/include/asm/ftrace.h     |  24 ----
>>>>>>>      arch/riscv/include/asm/jump_label.h |   2 +
>>>>>>>      arch/riscv/include/asm/patch.h      |   1 +
>>>>>>>      arch/riscv/kernel/ftrace.c          | 179 ++++++++++++++++++++--------
>>>>>>>      arch/riscv/kernel/mcount-dyn.S      |  69 ++++++++---
>>>>>>>      arch/riscv/kernel/patch.c           |   4 +-
>>>>>>>      7 files changed, 188 insertions(+), 93 deletions(-)
>>>>>>>
>>>>>> First of all, thank you for working on making dynamic Ftrace robust in
>>>>>> preemptible kernels on RISC-V.
>>>>>> It is an important use case but, for now, dynamic Ftrace and related
>>>>>> tracers cannot be safely used with such kernels.
>>>>>>
>>>>>> Are there any updates on this series?
>>>>>> It needs a rebase, of course, but it looks doable.
>>>>>>
>>>>>> If I understand the discussion correctly, the only blocker was that
>>>>>> using "-falign-functions" was not enough to properly align cold
>>>>>> functions and "-fno-guess-branch-probability" would likely have a
>>>>>> performance cost.
>>>>>>
>>>>>> It seems, GCC developers have recently provided a workaround for that
>>>>>> (https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=0f5a9a00e3ab1fe96142f304cfbcf3f63b15f326,
>>>>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88345#c24).
>>>>>>
>>>>>> "-fmin-function-alignment" should help but, I do not know, which GCC
>>>>>> versions have got that patch already. In the meantime, one could
>>>>>> probably check if "-fmin-function-alignment" is supported by the
>>>>>> compiler and use it, if it is.
>>>>>>
>>>>>> Thoughts?
>>>>> Hi Evgenii,
>>>>>
>>>>> Thanks for the update. Indeed, it is essential to this patch for
>>>>> toolchain to provide forced alignment. We can test this flag in the
>>>>> Makefile to sort out if toolchain supports it or not. Meanwhile, I had
>>>>> figured out a way for this to work on any 2-B align addresses but
>>>>> hadn't implemented it out yet. Basically it would require more
>>>>> patching space for us to do software alignment. I would opt for a
>>>>> special toolchain flag if the toolchain just supports it.
>>>>>
>>>>> Let me take some time to look and get back to you soon.
>>>> Thank you! Looking forward to it.
>>>>
>>>> In case it helps, here is what I have checked so far.
>>>>
>>>> 1.
>>>> I added the patch
>>>> https://gcc.gnu.org/git/?p=gcc.git;a=patch;h=0f5a9a00e3ab1fe96142f304cfbcf3f63b15f326
>>>> to the current revision of GCC 13.2.0 from RISC-V toolchain.
>>>>
>>>> Rebased your patchset on top of Linux 6.8-rc4 (mostly - context changes,
>>>> SYM_FUNC_START/SYM_FUNC_END for asm symbols, etc.).
>>>>
>>>> Reverted 8547649981e6 ("riscv: ftrace: Fixup panic by disabling
>>>> preemption").
>>>>
>>>> Switched from -falign-functions=4 to -fmin-function-alignment=4:
>>>> ------------------
>>>> diff --git a/arch/riscv/Makefile b/arch/riscv/Makefile
>>>> index b33b787c8b07..dcd0adeebaae 100644
>>>> --- a/arch/riscv/Makefile
>>>> +++ b/arch/riscv/Makefile
>>>> @@ -15,9 +15,9 @@ ifeq ($(CONFIG_DYNAMIC_FTRACE),y)
>>>>           LDFLAGS_vmlinux += --no-relax
>>>>           KBUILD_CPPFLAGS += -DCC_USING_PATCHABLE_FUNCTION_ENTRY
>>>>     ifeq ($(CONFIG_RISCV_ISA_C),y)
>>>> -       CC_FLAGS_FTRACE := -fpatchable-function-entry=12 -falign-functions=4
>>>> +       CC_FLAGS_FTRACE := -fpatchable-function-entry=12
>>>> -fmin-function-alignment=4
>>>>     else
>>>> -       CC_FLAGS_FTRACE := -fpatchable-function-entry=6 -falign-functions=4
>>>> +       CC_FLAGS_FTRACE := -fpatchable-function-entry=6 -fmin-function-alignment=4
>>>>     endif
>>>>     endif
>>>>
>>>> ------------------
>>>>
>>>> As far as I can see from objdump, the functions that were not aligned at
>>>> 4-byte boundary with -falign-functions=4, are now aligned correctly with
>>>> -fmin-function-alignment=4.
>>>>
>>>> 2.
>>>> I tried the kernel in a QEMU VM with 2 CPUs and "-machine virt".
>>>>
>>>> The boottime tests for Ftrace had passed, except the tests for
>>>> function_graph. I described the failure and the possible fix here:
>>>> https://lore.kernel.org/all/dcc5976d-635a-4710-92df-94a99653314e@yadro.com/
>>> Indeed, this is needed. I am not sure why I got ftrace boot-time tests
>>> passed back then. Thank you for solving it!
>>>
>>>> 3.
>>>> There were also boottime warnings about "RCU not on for:
>>>> arch_cpu_idle+0x0/0x2c". These are probably not related to your
>>>> patchset, but rather to the fact that Ftrace is enabled in a preemptble
>>>> kernel where RCU does different things.
>>>>
>>>> As a workaround, I disabled tracing of arch_cpu_idle() for now:
>>>> ------------------
>>>> diff --git a/arch/riscv/kernel/process.c b/arch/riscv/kernel/process.c
>>>> index 92922dbd5b5c..6abeecbfc51d 100644
>>>> --- a/arch/riscv/kernel/process.c
>>>> +++ b/arch/riscv/kernel/process.c
>>>> @@ -37,7 +37,7 @@ EXPORT_SYMBOL(__stack_chk_guard);
>>>>
>>>>     extern asmlinkage void ret_from_fork(void);
>>>>
>>>> -void arch_cpu_idle(void)
>>>> +void noinstr arch_cpu_idle(void)
>>>>     {
>>>>           cpu_do_idle();
>>>>     }
>>>>
>>>> ------------------
>>>>
>>>> 4.
>>>> Stress-testing revealed an issue though, which I do not understand yet.
>>>>
>>>> Probably similar to what you did earlier, I ran a script that switched
>>>> the current tracer to "function", "function_graph", "nop", "blk" each
>>>> 1-5 seconds. In another shell, "stress-ng --hrtimers 1" was running.
>>>>
>>>> The kernel usually crashed within a few minutes, in seemingly random
>>>> locations, but often in one of two ways:
>>>>
>>>> (a) Invalid instruction, because the address of ftrace_caller function
>>>> was somehow written to the body of the traced function rather than just
>>>> to the Ftrace prologue.
>>> The reason for this is probably that any one of your ftrace_*_call is
>>> not 8-B aligned.
>>>
>>>> In the following example, the crash happened at 0xffffffff800d3398. "b0
>>>> d7" is actually not part of the code here, but rather the lower bytes of
>>>> 0xffffffff8000d7b0, the address of ftrace_caller() in this kernel.
>>> It seems like there is a bug in patch_insn_write(). I think we should
>>> at least disable migration during patch_map() and patch_unmap(). I'd
>>> need some time to dig into patch_map(). But since __set_fixmap() only
>>> flush local tlb, I'd assume it is not safe to context switch out and
>>> migrate while holding the fix-map mapping. Adding preempt_disable()
>>> and preempt_enable() before calling __patch_insn_write() solves the
>>> issue.
>>
>> Yes, Andrea already mentioned this, I came up with the same idea of
>> preempt_disable() but then I noticed arm64 actually disables IRQ: any
>> idea why?
>> https://lore.kernel.org/linux-riscv/CAHVXubj7ChgpvN4F_QO0oASaT5WC2VS0Q-bEqhnmF8z8QV=yDQ@mail.gmail.com/
> Hi, I took a quick look and it seems that it is a design choice in
> software to me. ARM uses a spinlock to protect text and we use a
> mutex. If they have a requirement to do patching while irq is off
> (maybe in an ipi handler), then the only viable option would be to use
> raw_spin_lock_irqsave. I think preempt_disable should be enough for us
> if we use text_mutex to protect patching. Or, am I missing something?


I agree with you, I convinced myself that it should be enough :)

Do you intend to send this patch? Or should I? I have another small fix 
for ftrace, so I don't mind sending this one. Up to you, we just need to 
make sure it lands in 6.9 :)

Thanks


>
>
>
>
>>
>>>> (gdb) disas /r 0xffffffff800d3382,+0x20
>>>> Dump of assembler code from 0xffffffff800d3382 to 0xffffffff800d33a2:
>>>> ...
>>>>       0xffffffff800d3394 <clockevents_program_event+144>:  ba 87   mv
>>>> a5,a4
>>>>       0xffffffff800d3396 <clockevents_program_event+146>:  c1 bf   j
>>>> 0xffffffff800d3366 <clockevents_program_event+98>
>>>>       0xffffffff800d3398 <clockevents_program_event+148>:  b0 d7   sw
>>>> a2,104(a5) // 0xffffffff8000d7b0, the address of ftrace_caller().
>>>>       0xffffffff800d339a <clockevents_program_event+150>:  00 80   .2byte
>>>> 0x8000
>>>>       0xffffffff800d339c <clockevents_program_event+152>:  ff ff   .2byte
>>>> 0xffff
>>>>       0xffffffff800d339e <clockevents_program_event+154>:  ff ff   .2byte
>>>> 0xffff
>>>>       0xffffffff800d33a0 <clockevents_program_event+156>:  d5 bf   j
>>>> 0xffffffff800d3394 <clockevents_program_event+144
>>>>
>>>> The backtrace usually contains one or more occurrences of
>>>> return_to_handler() in this case.
>>>>
>>>> [  260.520394] [<ffffffff800d3398>] clockevents_program_event+0xac/0x100
>>>> [  260.521195] [<ffffffff8000d2bc>] return_to_handler+0x0/0x26
>>>> [  260.521843] [<ffffffff800c50ba>] hrtimer_interrupt+0x122/0x20c
>>>> [  260.522492] [<ffffffff8000d2bc>] return_to_handler+0x0/0x26
>>>> [  260.523132] [<ffffffff8009785e>] handle_percpu_devid_irq+0x9e/0x1ec
>>>> [  260.523788] [<ffffffff8000d2bc>] return_to_handler+0x0/0x26
>>>> [  260.524437] [<ffffffff8000d2bc>] return_to_handler+0x0/0x26
>>>> [  260.525080] [<ffffffff80a8acfa>] handle_riscv_irq+0x4a/0x74
>>>> [  260.525726] [<ffffffff80a97b9a>] call_on_irq_stack+0x32/0x40
>>>> ----------------------
>>>>
>>>> (b) Jump to an invalid location, e.g. to the middle of a valid 4-byte
>>>> instruction. %ra usually points right after the last instruction, "jalr
>>>>      a2", in return_to_handler() in such cases, so the jump was likely
>>>> made from there.
>>> I haven't done fgraph tests yet. I will try out and see.
>>>
>>>> The problem is reproducible, although I have not found what causes it yet.
>>>>
>>>> Any help is appreciated, of course.
>>>>
>>>>>> Regards,
>>>>>> Evgenii
>>>>> Regards,
>>>>> Andy
>>> Also, here is another side note,
>>>
>>> It seems like the ftrace save/restore routine should save more
>>> registers as clang's fastcc may use t2 when the number of arguments
>>> exceeds what ABI defines for passing arg through registers.
>>>
>>> Cheers,
>>> Andy
>>>
>>> _______________________________________________
>>> linux-riscv mailing list
>>> linux-riscv@lists.infradead.org
>>> http://lists.infradead.org/mailman/listinfo/linux-riscv