mbox series

[v4,0/8] Add OPTPROBES feature on RISCV

Message ID 20221106100316.2803176-1-chenguokai17@mails.ucas.ac.cn (mailing list archive)
Headers show
Series Add OPTPROBES feature on RISCV | expand

Message

Xim Nov. 6, 2022, 10:03 a.m. UTC
From: Liao Chang <liaoclark@163.com>

From: Liao Chang <liaochang1@huawei.com>

Add jump optimization support for RISC-V.

Replaces ebreak instructions used by normal kprobes with an
auipc+jalr instruction pair, at the aim of suppressing the probe-hit
overhead.

All known optprobe-capable RISC architectures have been using a single
jump or branch instructions while this patch chooses not. RISC-V has a
quite limited jump range (4KB or 2MB) for both its branch and jump
instructions, which prevent optimizations from supporting probes that
spread all over the kernel.

Auipc-jalr instruction pair is introduced with a much wider jump range
(4GB), where auipc loads the upper 12 bits to a free register and jalr
Deaconappends the lower 20 bits to form a 32 bit immediate. Note that
returns from probe handler requires another free register. As kprobes
can appear almost anywhere inside the kernel, the free register should
be found in a generic way, not depending on calling convention or any
other regulations.

The algorithm for finding the free register is inspired by the register
renaming in modern processors. From the perspective of register renaming,
a register could be represented as two different registers if two neighbour
instructions both write to it but no one ever reads. Extending this fact,
a register is considered to be free if there is no read before its next
write in the execution flow. We are free to change its value without
interfering normal execution.

Static analysis shows that 51% instructions of the kernel (default config)
is capable of being replaced i.e. one free register can be found at both
the start and end of replaced instruction pairs while the replaced
instructions can be directly executed.

Contribution:
Chen Guokai invents the algorithm of searching free register, evaluate
the ratio of optimizaion, the basic function support RVI kernel binary.
Liao Chang adds the support for hybrid RVI and RVC kernel binary, fix
some bugs with different kernel configure, refactor out entire feature
into some individual patches.

v4:
Correct the sequence of Signed-off-by and Co-developed-by.

v3:
1. Support of hybrid RVI and RVC kernel binary.
2. Refactor out entire feature into some individual patches.

v2:
1. Adjust comments
2. Remove improper copyright
3. Clean up format issues that is no common practice
4. Extract common definition of instruction decoder
5. Fix race issue in SMP platform.

v1:
Chen Guokai contribute the basic functionality code.

Liao Chang (8):
  riscv/kprobe: Prepare the skeleton to implement RISCV OPTPROBES
    feature
  riscv/kprobe: Allocate detour buffer from module area
  riscv/kprobe: Prepare the skeleton to prepare optimized kprobe
  riscv/kprobe: Add common RVI and RVC instruction decoder code
  riscv/kprobe: Search free register(s) to clobber for 'AUIPC/JALR'
  riscv/kprobe: Add code to check if kprobe can be optimized
  riscv/kprobe: Prepare detour buffer for optimized kprobe
  riscv/kprobe: Patch AUIPC/JALR pair to optimize kprobe

 arch/riscv/Kconfig                        |   1 +
 arch/riscv/include/asm/bug.h              |   5 +-
 arch/riscv/include/asm/kprobes.h          |  48 ++
 arch/riscv/include/asm/patch.h            |   1 +
 arch/riscv/kernel/patch.c                 |  22 +-
 arch/riscv/kernel/probes/Makefile         |   1 +
 arch/riscv/kernel/probes/decode-insn.h    | 145 ++++++
 arch/riscv/kernel/probes/kprobes.c        |  25 +
 arch/riscv/kernel/probes/opt.c            | 602 ++++++++++++++++++++++
 arch/riscv/kernel/probes/opt_trampoline.S | 137 +++++
 arch/riscv/kernel/probes/simulate-insn.h  |  41 ++
 11 files changed, 1023 insertions(+), 5 deletions(-)
 create mode 100644 arch/riscv/kernel/probes/opt.c
 create mode 100644 arch/riscv/kernel/probes/opt_trampoline.S

Comments

Björn Töpel Nov. 7, 2022, 4:54 p.m. UTC | #1
Chen Guokai <chenguokai17@mails.ucas.ac.cn> writes:

> From: Liao Chang <liaoclark@163.com>
>
> From: Liao Chang <liaochang1@huawei.com>
>
> Add jump optimization support for RISC-V.

Thanks for working on this! I have some comments on the series, but I'll
do that on a per-patch basis.

Have you run the series on real hardware, or just qemu?


Cheers,
Björn
Xim Nov. 8, 2022, 11:04 a.m. UTC | #2
Hi Björn,

Thanks for your great review! Some explanations below.

> 2022年11月8日 00:54,Björn Töpel <bjorn@kernel.org> 写道:
> 
> Have you run the series on real hardware, or just qemu?

Currently only qemu tests are made, I will try to test it on a FPGA real hardware soon.

> AFAIU, the algorithm only tracks registers that are *in use*. You are
> already scanning the whole function (next patch). What about the caller
> saved registers that are *not* used by the function in the probe range?
> Can those, potentially unused, regs be used?

Great missing part! I have made a static analyzation right upon receiving this mail.
The result shows that this newly purposed idea reaches about the same
success rate on my test set (rv64 defconf with RVI only) while when combined,
they can reach a higher success rate, 1/3 above their baseline. A patch that
includes this strategy will be sent soon.
> 
>> +static void arch_find_register(unsigned long start, unsigned long end,
> 
> Nit; When I see "arch_" I think it's functionality that can be
> overridden per-arch. This is not the case, but just a helper for RV.

It can be explained from two aspects. First, it can be extended to most RISC
archs, which can be extracted into the common flow of Kprobe. Second, it is indeed
a internal helper for now, so I will correct the name in the next version.

>> static void find_free_registers(struct kprobe *kp, struct optimized_kprobe *op,
>> -				int *rd1, int *rd2)
>> +				int *rd, int *ra)
> 
> Nit; Please get rid of this code churn, just name the parameters
> correctly on introduction in the previous patch.

Will be fixed.

>> +	*rd = ((kw | ow) == 1UL) ? 0 : __builtin_ctzl((kw | ow) & ~1UL);
>> +	*ra = (kw == 1UL) ? 0 : __builtin_ctzl(kw & ~1UL);
> 
> Hmm, __builtin_ctzl is undefined for 0, right? Can that be triggered
> here?

Will be fixed.

Regards,
Guokai Chen
Liao, Chang Nov. 8, 2022, 11:33 a.m. UTC | #3
在 2022/11/8 19:04, Xim 写道:
> Hi Björn,
> 
> Thanks for your great review! Some explanations below.
> 
>> 2022年11月8日 00:54,Björn Töpel <bjorn@kernel.org> 写道:
>>
>> Have you run the series on real hardware, or just qemu?
> 
> Currently only qemu tests are made, I will try to test it on a FPGA real hardware soon.
> 
>> AFAIU, the algorithm only tracks registers that are *in use*. You are
>> already scanning the whole function (next patch). What about the caller
>> saved registers that are *not* used by the function in the probe range?
>> Can those, potentially unused, regs be used?
> 
> Great missing part! I have made a static analyzation right upon receiving this mail.
> The result shows that this newly purposed idea reaches about the same
> success rate on my test set (rv64 defconf with RVI only) while when combined,
> they can reach a higher success rate, 1/3 above their baseline. A patch that
> includes this strategy will be sent soon.
>>
>>> +static void arch_find_register(unsigned long start, unsigned long end,
>>
>> Nit; When I see "arch_" I think it's functionality that can be
>> overridden per-arch. This is not the case, but just a helper for RV.
> 
> It can be explained from two aspects. First, it can be extended to most RISC
> archs, which can be extracted into the common flow of Kprobe. Second, it is indeed
> a internal helper for now, so I will correct the name in the next version.
> 
>>> static void find_free_registers(struct kprobe *kp, struct optimized_kprobe *op,
>>> -				int *rd1, int *rd2)
>>> +				int *rd, int *ra)
>>
>> Nit; Please get rid of this code churn, just name the parameters
>> correctly on introduction in the previous patch.
> 
> Will be fixed.
> 
>>> +	*rd = ((kw | ow) == 1UL) ? 0 : __builtin_ctzl((kw | ow) & ~1UL);
>>> +	*ra = (kw == 1UL) ? 0 : __builtin_ctzl(kw & ~1UL);
>>
>> Hmm, __builtin_ctzl is undefined for 0, right? Can that be triggered
>> here?

This corner case has been taken into account, look these condition parts,
if kw == 1UL this expression will return 0 directly, no chance to invoke __builtin_ctzl.

Thanks.

> 
> Will be fixed.
> 
> Regards,
> Guokai Chen
>
Björn Töpel Nov. 8, 2022, 1:12 p.m. UTC | #4
"liaochang (A)" <liaochang1@huawei.com> writes:

>>>> +	*rd = ((kw | ow) == 1UL) ? 0 : __builtin_ctzl((kw | ow) & ~1UL);
>>>> +	*ra = (kw == 1UL) ? 0 : __builtin_ctzl(kw & ~1UL);
>>>
>>> Hmm, __builtin_ctzl is undefined for 0, right? Can that be triggered
>>> here?
>
> This corner case has been taken into account, look these condition parts,
> if kw == 1UL this expression will return 0 directly, no chance to invoke __builtin_ctzl.

Indeed! Thanks for making that clear! Looking forward to the next
revision!


Björn