diff mbox series

arm64: insn: Simulate nop and push instruction for better uprobe performance

Message ID 20240814080356.2639544-1-liaochang1@huawei.com (mailing list archive)
State Superseded
Delegated to: Masami Hiramatsu
Headers show
Series arm64: insn: Simulate nop and push instruction for better uprobe performance | expand

Commit Message

Liao, Chang Aug. 14, 2024, 8:03 a.m. UTC
As Andrii pointed out, the uprobe/uretprobe selftest bench run into a
counterintuitive result that nop and push variants are much slower than
ret variant [0]. The root cause lies in the arch_probe_analyse_insn(),
which excludes 'nop' and 'stp' from the emulatable instructions list.
This force the kernel returns to userspace and execute them out-of-line,
then trapping back to kernel for running uprobe callback functions. This
leads to a significant performance overhead compared to 'ret' variant,
which is already emulated.

Typicall uprobe is installed on 'nop' for USDT and on function entry
which starts with the instrucion 'stp x29, x30, [sp, #imm]!' to push lr
and fp into stack regardless kernel or userspace binary. In order to
improve the performance of handling uprobe for common usecases. This
patch supports the emulation of Arm64 equvialents instructions of 'nop'
and 'push'. The benchmark results below indicates the performance gain
of emulation is obvious.

On Kunpeng916 (Hi1616), 4 NUMA nodes, 64 Arm64 cores@2.4GHz.

xol (1 cpus)
------------
uprobe-nop:  0.916 ± 0.001M/s (0.916M/prod)
uprobe-push: 0.908 ± 0.001M/s (0.908M/prod)
uprobe-ret:  1.855 ± 0.000M/s (1.855M/prod)
uretprobe-nop:  0.640 ± 0.000M/s (0.640M/prod)
uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod)
uretprobe-ret:  0.978 ± 0.003M/s (0.978M/prod)

emulation (1 cpus)
-------------------
uprobe-nop:  1.862 ± 0.002M/s  (1.862M/prod)
uprobe-push: 1.743 ± 0.006M/s  (1.743M/prod)
uprobe-ret:  1.840 ± 0.001M/s  (1.840M/prod)
uretprobe-nop:  0.964 ± 0.004M/s  (0.964M/prod)
uretprobe-push: 0.936 ± 0.004M/s  (0.936M/prod)
uretprobe-ret:  0.940 ± 0.001M/s  (0.940M/prod)

As shown above, the performance gap between 'nop/push' and 'ret'
variants has been significantly reduced. Due to the emulation of 'push'
instruction needs to access userspace memory, it spent more cycles than
the other.

[0] https://lore.kernel.org/all/CAEf4BzaO4eG6hr2hzXYpn+7Uer4chS0R99zLn02ezZ5YruVuQw@mail.gmail.com/

Signed-off-by: Liao Chang <liaochang1@huawei.com>
---
 arch/arm64/include/asm/insn.h            | 21 ++++++++++++++++++
 arch/arm64/kernel/probes/decode-insn.c   | 18 +++++++++++++--
 arch/arm64/kernel/probes/decode-insn.h   |  3 ++-
 arch/arm64/kernel/probes/simulate-insn.c | 28 ++++++++++++++++++++++++
 arch/arm64/kernel/probes/simulate-insn.h |  2 ++
 arch/arm64/kernel/probes/uprobes.c       |  2 +-
 6 files changed, 70 insertions(+), 4 deletions(-)

Comments

Mark Rutland Aug. 15, 2024, 9:58 a.m. UTC | #1
On Wed, Aug 14, 2024 at 08:03:56AM +0000, Liao Chang wrote:
> As Andrii pointed out, the uprobe/uretprobe selftest bench run into a
> counterintuitive result that nop and push variants are much slower than
> ret variant [0]. The root cause lies in the arch_probe_analyse_insn(),
> which excludes 'nop' and 'stp' from the emulatable instructions list.
> This force the kernel returns to userspace and execute them out-of-line,
> then trapping back to kernel for running uprobe callback functions. This
> leads to a significant performance overhead compared to 'ret' variant,
> which is already emulated.

I appreciate this might be surprising, but does it actually matter
outside of a microbenchmark?

> Typicall uprobe is installed on 'nop' for USDT and on function entry
> which starts with the instrucion 'stp x29, x30, [sp, #imm]!' to push lr
> and fp into stack regardless kernel or userspace binary. 

Function entry doesn't always start with a STP; these days it's often a
BTI or PACIASP, and for non-leaf functions (or with shrink-wrapping in
the compiler), it could be any arbitrary instruction. This might happen
to be the common case today, but there are certain;y codebases where it
is not.

STP (or any instruction that accesses memory) is fairly painful to
emulate because you need to ensure that the correct atomicity and
ordering properties are provided (e.g. an aligned STP should be
single-copy-atomic, but copy_to_user() doesn't guarantee that except by
chance), and that the correct VMSA behaviour is provided (e.g. when
interacting with MTE, POE, etc, while the uaccess primitives don't try
to be 100% equivalent to instructions in userspace).

For those reasons, in general I don't think we should be emulating any
instruction which accesses memory, and we should not try to emulate the
STP, but I think it's entirely reasonable to emulate NOP.

> In order to
> improve the performance of handling uprobe for common usecases. This
> patch supports the emulation of Arm64 equvialents instructions of 'nop'
> and 'push'. The benchmark results below indicates the performance gain
> of emulation is obvious.
> 
> On Kunpeng916 (Hi1616), 4 NUMA nodes, 64 Arm64 cores@2.4GHz.
> 
> xol (1 cpus)
> ------------
> uprobe-nop:  0.916 ± 0.001M/s (0.916M/prod)
> uprobe-push: 0.908 ± 0.001M/s (0.908M/prod)
> uprobe-ret:  1.855 ± 0.000M/s (1.855M/prod)
> uretprobe-nop:  0.640 ± 0.000M/s (0.640M/prod)
> uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod)
> uretprobe-ret:  0.978 ± 0.003M/s (0.978M/prod)
> 
> emulation (1 cpus)
> -------------------
> uprobe-nop:  1.862 ± 0.002M/s  (1.862M/prod)
> uprobe-push: 1.743 ± 0.006M/s  (1.743M/prod)
> uprobe-ret:  1.840 ± 0.001M/s  (1.840M/prod)
> uretprobe-nop:  0.964 ± 0.004M/s  (0.964M/prod)
> uretprobe-push: 0.936 ± 0.004M/s  (0.936M/prod)
> uretprobe-ret:  0.940 ± 0.001M/s  (0.940M/prod)
> 
> As shown above, the performance gap between 'nop/push' and 'ret'
> variants has been significantly reduced. Due to the emulation of 'push'
> instruction needs to access userspace memory, it spent more cycles than
> the other.
> 
> [0] https://lore.kernel.org/all/CAEf4BzaO4eG6hr2hzXYpn+7Uer4chS0R99zLn02ezZ5YruVuQw@mail.gmail.com/
> 
> Signed-off-by: Liao Chang <liaochang1@huawei.com>
> ---
>  arch/arm64/include/asm/insn.h            | 21 ++++++++++++++++++
>  arch/arm64/kernel/probes/decode-insn.c   | 18 +++++++++++++--
>  arch/arm64/kernel/probes/decode-insn.h   |  3 ++-
>  arch/arm64/kernel/probes/simulate-insn.c | 28 ++++++++++++++++++++++++
>  arch/arm64/kernel/probes/simulate-insn.h |  2 ++
>  arch/arm64/kernel/probes/uprobes.c       |  2 +-
>  6 files changed, 70 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
> index 8c0a36f72d6f..a246e6e550ba 100644
> --- a/arch/arm64/include/asm/insn.h
> +++ b/arch/arm64/include/asm/insn.h
> @@ -549,6 +549,27 @@ static __always_inline bool aarch64_insn_uses_literal(u32 insn)
>  	       aarch64_insn_is_prfm_lit(insn);
>  }
>  
> +static __always_inline bool aarch64_insn_is_nop(u32 insn)
> +{
> +	/* nop */
> +	return aarch64_insn_is_hint(insn) &&
> +	       ((insn & 0xFE0) == AARCH64_INSN_HINT_NOP);
> +}

This looks fine, but the comment can go.

> +static __always_inline bool aarch64_insn_is_stp_fp_lr_sp_64b(u32 insn)
> +{
> +	/*
> +	 * The 1st instruction on function entry often follows the
> +	 * patten 'stp x29, x30, [sp, #imm]!' that pushing fp and lr
> +	 * into stack.
> +	 */
> +	return aarch64_insn_is_stp_pre(insn) &&
> +	       (((insn >> 30) & 0x03) ==  2) && /* opc == 10 */
> +	       (((insn >>  5) & 0x1F) == 31) && /* Rn  is sp */
> +	       (((insn >> 10) & 0x1F) == 30) && /* Rt2 is x29 */
> +	       (((insn >>  0) & 0x1F) == 29);	/* Rt  is x30 */
> +}

We have accessors for these fields. Please use them.

Regardless, as above I do not think we should have a helper this
specific (with Rn, Rt, and Rt2 values hard-coded) inside <asm/insn.h>.

>  enum aarch64_insn_encoding_class aarch64_get_insn_class(u32 insn);
>  u64 aarch64_insn_decode_immediate(enum aarch64_insn_imm_type type, u32 insn);
>  u32 aarch64_insn_encode_immediate(enum aarch64_insn_imm_type type,
> diff --git a/arch/arm64/kernel/probes/decode-insn.c b/arch/arm64/kernel/probes/decode-insn.c
> index 968d5fffe233..df7ca16fc763 100644
> --- a/arch/arm64/kernel/probes/decode-insn.c
> +++ b/arch/arm64/kernel/probes/decode-insn.c
> @@ -73,8 +73,22 @@ static bool __kprobes aarch64_insn_is_steppable(u32 insn)
>   *   INSN_GOOD_NO_SLOT If instruction is supported but doesn't use its slot.
>   */
>  enum probe_insn __kprobes
> -arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *api)
> +arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *api,
> +		      bool kernel)
>  {
> +	/*
> +	 * While 'nop' and 'stp x29, x30, [sp, #imm]! instructions can
> +	 * execute in the out-of-line slot, simulating them in breakpoint
> +	 * handling offers better performance.
> +	 */
> +	if (aarch64_insn_is_nop(insn)) {
> +		api->handler = simulate_nop;
> +		return INSN_GOOD_NO_SLOT;
> +	} else if (!kernel && aarch64_insn_is_stp_fp_lr_sp_64b(insn)) {
> +		api->handler = simulate_stp_fp_lr_sp_64b;
> +		return INSN_GOOD_NO_SLOT;
> +	}

With the STP emulation gone, you won't need the kernel parameter here.

> +
>  	/*
>  	 * Instructions reading or modifying the PC won't work from the XOL
>  	 * slot.
> @@ -157,7 +171,7 @@ arm_kprobe_decode_insn(kprobe_opcode_t *addr, struct arch_specific_insn *asi)
>  		else
>  			scan_end = addr - MAX_ATOMIC_CONTEXT_SIZE;
>  	}
> -	decoded = arm_probe_decode_insn(insn, &asi->api);
> +	decoded = arm_probe_decode_insn(insn, &asi->api, true);
>  
>  	if (decoded != INSN_REJECTED && scan_end)
>  		if (is_probed_address_atomic(addr - 1, scan_end))
> diff --git a/arch/arm64/kernel/probes/decode-insn.h b/arch/arm64/kernel/probes/decode-insn.h
> index 8b758c5a2062..ec4607189933 100644
> --- a/arch/arm64/kernel/probes/decode-insn.h
> +++ b/arch/arm64/kernel/probes/decode-insn.h
> @@ -28,6 +28,7 @@ enum probe_insn __kprobes
>  arm_kprobe_decode_insn(kprobe_opcode_t *addr, struct arch_specific_insn *asi);
>  #endif
>  enum probe_insn __kprobes
> -arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *asi);
> +arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *asi,
> +		      bool kernel);
>  
>  #endif /* _ARM_KERNEL_KPROBES_ARM64_H */
> diff --git a/arch/arm64/kernel/probes/simulate-insn.c b/arch/arm64/kernel/probes/simulate-insn.c
> index 22d0b3252476..0b1623fa7003 100644
> --- a/arch/arm64/kernel/probes/simulate-insn.c
> +++ b/arch/arm64/kernel/probes/simulate-insn.c
> @@ -200,3 +200,31 @@ simulate_ldrsw_literal(u32 opcode, long addr, struct pt_regs *regs)
>  
>  	instruction_pointer_set(regs, instruction_pointer(regs) + 4);
>  }
> +
> +void __kprobes
> +simulate_nop(u32 opcode, long addr, struct pt_regs *regs)
> +{
> +	instruction_pointer_set(regs, instruction_pointer(regs) + 4);
> +}

Hmm, this forgets to update the single-step state machine and PSTATE.BT,
and that's an extant bug in arch_uprobe_post_xol(). This can be:

| void __kprobes
| simulate_nop(u32 opcode, long addr, struct pt_regs *regs)
| {
| 	arm64_skip_faulting_instruction(regs, AARCH64_INSN_SIZE);
| }

> +
> +void __kprobes
> +simulate_stp_fp_lr_sp_64b(u32 opcode, long addr, struct pt_regs *regs)
> +{
> +	long imm7;
> +	u64 buf[2];
> +	long new_sp;
> +
> +	imm7 = sign_extend64((opcode >> 15) & 0x7f, 6);
> +	new_sp = regs->sp + (imm7 << 3);

We have accessors for these fields, please use them.

> +
> +	buf[0] = regs->regs[29];
> +	buf[1] = regs->regs[30];
> +
> +	if (copy_to_user((void __user *)new_sp, buf, sizeof(buf))) {
> +		force_sig(SIGSEGV);
> +		return;
> +	}

As above, this won't interact with VMSA features (e.g. MTE, POE) in the
same way as an STP in userspace, and this will not have the same
atomicity properties as an STP.

> +
> +	regs->sp = new_sp;
> +	instruction_pointer_set(regs, instruction_pointer(regs) + 4);

Likewise, this sould need ot use arm64_skip_faulting_instruction(),
though as above I think we should drop STP emulation entirely.

Mark.
Andrii Nakryiko Aug. 15, 2024, 5:07 p.m. UTC | #2
On Thu, Aug 15, 2024 at 2:58 AM Mark Rutland <mark.rutland@arm.com> wrote:
>
> On Wed, Aug 14, 2024 at 08:03:56AM +0000, Liao Chang wrote:
> > As Andrii pointed out, the uprobe/uretprobe selftest bench run into a
> > counterintuitive result that nop and push variants are much slower than
> > ret variant [0]. The root cause lies in the arch_probe_analyse_insn(),
> > which excludes 'nop' and 'stp' from the emulatable instructions list.
> > This force the kernel returns to userspace and execute them out-of-line,
> > then trapping back to kernel for running uprobe callback functions. This
> > leads to a significant performance overhead compared to 'ret' variant,
> > which is already emulated.
>
> I appreciate this might be surprising, but does it actually matter
> outside of a microbenchmark?

I'll leave the ARM parts to Liao, but yes, it does a lot. Admittedly,
my main focus right now is x86-64, but ARM64 keeps growing in
importance.

But on x86-64 we specifically added emulation of push/pop operations
(a while ago) so we can mitigate performance degradation for a common
case of installing uprobes on (user space) function entry. That was a
significant speed up because we avoided doing one extra interrupt hop
between kernel and user space, which is a big chunk of uprobe
activation cost. And then in BPF cases, BPF uprobe program logic is
usually pretty lightweight, so the uprobe triggering overhead is still
very noticeable in practice.

So if there is anything that can be done to improve performance on
ARM64 for similar function entry situations, that would be greatly
appreciated by many bpftrace and BPF users at the very least.

>
> > Typicall uprobe is installed on 'nop' for USDT and on function entry
> > which starts with the instrucion 'stp x29, x30, [sp, #imm]!' to push lr
> > and fp into stack regardless kernel or userspace binary.
>
> Function entry doesn't always start with a STP; these days it's often a
> BTI or PACIASP, and for non-leaf functions (or with shrink-wrapping in
> the compiler), it could be any arbitrary instruction. This might happen
> to be the common case today, but there are certain;y codebases where it
> is not.
>

[...]
Liao, Chang Aug. 21, 2024, 7:55 a.m. UTC | #3
Hi, Mark

My bad for taking so long to rely, I generally agree with your suggestions to
STP emulation.

在 2024/8/15 17:58, Mark Rutland 写道:
> On Wed, Aug 14, 2024 at 08:03:56AM +0000, Liao Chang wrote:
>> As Andrii pointed out, the uprobe/uretprobe selftest bench run into a
>> counterintuitive result that nop and push variants are much slower than
>> ret variant [0]. The root cause lies in the arch_probe_analyse_insn(),
>> which excludes 'nop' and 'stp' from the emulatable instructions list.
>> This force the kernel returns to userspace and execute them out-of-line,
>> then trapping back to kernel for running uprobe callback functions. This
>> leads to a significant performance overhead compared to 'ret' variant,
>> which is already emulated.
> 
> I appreciate this might be surprising, but does it actually matter
> outside of a microbenchmark?

I just do a simple comparsion the performance impact of single-stepped and
emulated STP on my local machine. Three user cases were measured: Redis GET and
SET throughput (Request Per Second, RPS), and the time taken to execute a grep
command on the "arch_uprobe_copy_xol" string within the kernel source.

Redis GET (higher is better)
----------------------------
No uprobe: 49149.71 RPS
Single-stepped STP: 46750.82 RPS
Emulated STP: 48981.19 RPS

Redis SET (larger is better)
----------------------------
No uprobe: 49761.14 RPS
Single-stepped STP: 45255.01 RPS
Emulated stp: 48619.21 RPS

Grep (lower is better)
----------------------
No uprobe: 2.165s
Single-stepped STP: 15.314s
Emualted STP: 2.216s

The result reveals single-stepped STP instruction that used to push fp/lr into
stack significantly impacts the Redis and grep performance, leading to a notable
notable decrease RPS and increase time individually. So emulating STP on the
function entry might be a more viable option for uprobe.

> 
>> Typicall uprobe is installed on 'nop' for USDT and on function entry
>> which starts with the instrucion 'stp x29, x30, [sp, #imm]!' to push lr
>> and fp into stack regardless kernel or userspace binary. 
> 
> Function entry doesn't always start with a STP; these days it's often a
> BTI or PACIASP, and for non-leaf functions (or with shrink-wrapping in
> the compiler), it could be any arbitrary instruction. This might happen
> to be the common case today, but there are certain;y codebases where it
> is not.

Sure, if kernel, CPU and compiler support BTI and PAC, the entry instruction
is definitly not STP. But for CPU and kernel lack of these supports, STP as
the entry instruction is still the common case. And I profiled the entry
instruction for all leaf and non-leaf function, the ratio of STP is 64.5%
for redis, 76.1% for the BPF selftest bench. So I am thinking it is still
useful to emulate the STP on the function entry. Perhaps, for CPU and kernel
with BTI and PAC enabled, uprobe chooses the slower single-stepping to execute
STP for pushing stack.

> 
> STP (or any instruction that accesses memory) is fairly painful to
> emulate because you need to ensure that the correct atomicity and
> ordering properties are provided (e.g. an aligned STP should be
> single-copy-atomic, but copy_to_user() doesn't guarantee that except by
> chance), and that the correct VMSA behaviour is provided (e.g. when
> interacting with MTE, POE, etc, while the uaccess primitives don't try
> to be 100% equivalent to instructions in userspace).
Agreed, but I don't think it has to emulate strictly the single-copy-atomic
feature of STP that is used to push fp/lr into stack. In most cases, only one
CPU will push registers to the same position on stack. And I barely understand
why other CPUs would depends on the ordering of pushing data into stack. So it
means the atomicity and ordering is not so important for this scenario. Regarding
MTE and POE, a similar stragety to BTI and PAC can be applied: for CPUs and kernel
with MTE and POE enabled, uprobe chooses the slower single-stepping to execute
STP for pushing stack.

> 
> For those reasons, in general I don't think we should be emulating any
> instruction which accesses memory, and we should not try to emulate the
> STP, but I think it's entirely reasonable to emulate NOP.
> 
>> In order to
>> improve the performance of handling uprobe for common usecases. This
>> patch supports the emulation of Arm64 equvialents instructions of 'nop'
>> and 'push'. The benchmark results below indicates the performance gain
>> of emulation is obvious.
>>
>> On Kunpeng916 (Hi1616), 4 NUMA nodes, 64 Arm64 cores@2.4GHz.
>>
>> xol (1 cpus)
>> ------------
>> uprobe-nop:  0.916 ± 0.001M/s (0.916M/prod)
>> uprobe-push: 0.908 ± 0.001M/s (0.908M/prod)
>> uprobe-ret:  1.855 ± 0.000M/s (1.855M/prod)
>> uretprobe-nop:  0.640 ± 0.000M/s (0.640M/prod)
>> uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod)
>> uretprobe-ret:  0.978 ± 0.003M/s (0.978M/prod)
>>
>> emulation (1 cpus)
>> -------------------
>> uprobe-nop:  1.862 ± 0.002M/s  (1.862M/prod)
>> uprobe-push: 1.743 ± 0.006M/s  (1.743M/prod)
>> uprobe-ret:  1.840 ± 0.001M/s  (1.840M/prod)
>> uretprobe-nop:  0.964 ± 0.004M/s  (0.964M/prod)
>> uretprobe-push: 0.936 ± 0.004M/s  (0.936M/prod)
>> uretprobe-ret:  0.940 ± 0.001M/s  (0.940M/prod)
>>
>> As shown above, the performance gap between 'nop/push' and 'ret'
>> variants has been significantly reduced. Due to the emulation of 'push'
>> instruction needs to access userspace memory, it spent more cycles than
>> the other.
>>
>> [0] https://lore.kernel.org/all/CAEf4BzaO4eG6hr2hzXYpn+7Uer4chS0R99zLn02ezZ5YruVuQw@mail.gmail.com/
>>
>> Signed-off-by: Liao Chang <liaochang1@huawei.com>
>> ---
>>  arch/arm64/include/asm/insn.h            | 21 ++++++++++++++++++
>>  arch/arm64/kernel/probes/decode-insn.c   | 18 +++++++++++++--
>>  arch/arm64/kernel/probes/decode-insn.h   |  3 ++-
>>  arch/arm64/kernel/probes/simulate-insn.c | 28 ++++++++++++++++++++++++
>>  arch/arm64/kernel/probes/simulate-insn.h |  2 ++
>>  arch/arm64/kernel/probes/uprobes.c       |  2 +-
>>  6 files changed, 70 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
>> index 8c0a36f72d6f..a246e6e550ba 100644
>> --- a/arch/arm64/include/asm/insn.h
>> +++ b/arch/arm64/include/asm/insn.h
>> @@ -549,6 +549,27 @@ static __always_inline bool aarch64_insn_uses_literal(u32 insn)
>>  	       aarch64_insn_is_prfm_lit(insn);
>>  }
>>  
>> +static __always_inline bool aarch64_insn_is_nop(u32 insn)
>> +{
>> +	/* nop */
>> +	return aarch64_insn_is_hint(insn) &&
>> +	       ((insn & 0xFE0) == AARCH64_INSN_HINT_NOP);
>> +}
> 
> This looks fine, but the comment can go.

Removed.

> 
>> +static __always_inline bool aarch64_insn_is_stp_fp_lr_sp_64b(u32 insn)
>> +{
>> +	/*
>> +	 * The 1st instruction on function entry often follows the
>> +	 * patten 'stp x29, x30, [sp, #imm]!' that pushing fp and lr
>> +	 * into stack.
>> +	 */
>> +	return aarch64_insn_is_stp_pre(insn) &&
>> +	       (((insn >> 30) & 0x03) ==  2) && /* opc == 10 */
>> +	       (((insn >>  5) & 0x1F) == 31) && /* Rn  is sp */
>> +	       (((insn >> 10) & 0x1F) == 30) && /* Rt2 is x29 */
>> +	       (((insn >>  0) & 0x1F) == 29);	/* Rt  is x30 */
>> +}
> 
> We have accessors for these fields. Please use them.

Do you mean aarch64_insn_decode_register()?

> 
> Regardless, as above I do not think we should have a helper this
> specific (with Rn, Rt, and Rt2 values hard-coded) inside <asm/insn.h>.

If we left necessary of emulation of STP aside, where would the best file to
place these hard-coded decoder helper?

> 
>>  enum aarch64_insn_encoding_class aarch64_get_insn_class(u32 insn);
>>  u64 aarch64_insn_decode_immediate(enum aarch64_insn_imm_type type, u32 insn);
>>  u32 aarch64_insn_encode_immediate(enum aarch64_insn_imm_type type,
>> diff --git a/arch/arm64/kernel/probes/decode-insn.c b/arch/arm64/kernel/probes/decode-insn.c
>> index 968d5fffe233..df7ca16fc763 100644
>> --- a/arch/arm64/kernel/probes/decode-insn.c
>> +++ b/arch/arm64/kernel/probes/decode-insn.c
>> @@ -73,8 +73,22 @@ static bool __kprobes aarch64_insn_is_steppable(u32 insn)
>>   *   INSN_GOOD_NO_SLOT If instruction is supported but doesn't use its slot.
>>   */
>>  enum probe_insn __kprobes
>> -arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *api)
>> +arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *api,
>> +		      bool kernel)
>>  {
>> +	/*
>> +	 * While 'nop' and 'stp x29, x30, [sp, #imm]! instructions can
>> +	 * execute in the out-of-line slot, simulating them in breakpoint
>> +	 * handling offers better performance.
>> +	 */
>> +	if (aarch64_insn_is_nop(insn)) {
>> +		api->handler = simulate_nop;
>> +		return INSN_GOOD_NO_SLOT;
>> +	} else if (!kernel && aarch64_insn_is_stp_fp_lr_sp_64b(insn)) {
>> +		api->handler = simulate_stp_fp_lr_sp_64b;
>> +		return INSN_GOOD_NO_SLOT;
>> +	}
> 
> With the STP emulation gone, you won't need the kernel parameter here.>
>> +
>>  	/*
>>  	 * Instructions reading or modifying the PC won't work from the XOL
>>  	 * slot.
>> @@ -157,7 +171,7 @@ arm_kprobe_decode_insn(kprobe_opcode_t *addr, struct arch_specific_insn *asi)
>>  		else
>>  			scan_end = addr - MAX_ATOMIC_CONTEXT_SIZE;
>>  	}
>> -	decoded = arm_probe_decode_insn(insn, &asi->api);
>> +	decoded = arm_probe_decode_insn(insn, &asi->api, true);
>>  
>>  	if (decoded != INSN_REJECTED && scan_end)
>>  		if (is_probed_address_atomic(addr - 1, scan_end))
>> diff --git a/arch/arm64/kernel/probes/decode-insn.h b/arch/arm64/kernel/probes/decode-insn.h
>> index 8b758c5a2062..ec4607189933 100644
>> --- a/arch/arm64/kernel/probes/decode-insn.h
>> +++ b/arch/arm64/kernel/probes/decode-insn.h
>> @@ -28,6 +28,7 @@ enum probe_insn __kprobes
>>  arm_kprobe_decode_insn(kprobe_opcode_t *addr, struct arch_specific_insn *asi);
>>  #endif
>>  enum probe_insn __kprobes
>> -arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *asi);
>> +arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *asi,
>> +		      bool kernel);
>>  
>>  #endif /* _ARM_KERNEL_KPROBES_ARM64_H */
>> diff --git a/arch/arm64/kernel/probes/simulate-insn.c b/arch/arm64/kernel/probes/simulate-insn.c
>> index 22d0b3252476..0b1623fa7003 100644
>> --- a/arch/arm64/kernel/probes/simulate-insn.c
>> +++ b/arch/arm64/kernel/probes/simulate-insn.c
>> @@ -200,3 +200,31 @@ simulate_ldrsw_literal(u32 opcode, long addr, struct pt_regs *regs)
>>  
>>  	instruction_pointer_set(regs, instruction_pointer(regs) + 4);
>>  }
>> +
>> +void __kprobes
>> +simulate_nop(u32 opcode, long addr, struct pt_regs *regs)
>> +{
>> +	instruction_pointer_set(regs, instruction_pointer(regs) + 4);
>> +}
> 
> Hmm, this forgets to update the single-step state machine and PSTATE.BT,
> and that's an extant bug in arch_uprobe_post_xol(). This can be:

For emulated instruction, uprobe won't enable single-step mode of CPU,
please check the handle_swbp() in kernel/events/uprobes.c:

  if (arch_uprobe_skip_sstep(&uprobe->arch, regs))
          goto out;

  if (!pre_ssout(uprobe, regs, bp_vaddr))
          return;

For emualted instruction, It will skip entire single-stepping and associated
exception, which typically begins with pre_ssout() and ends with
arch_uprobe_post_xol(). Therefore, using instruction_pointer_set() to emulate
NOP is generally not a bad idea.

> 
> | void __kprobes
> | simulate_nop(u32 opcode, long addr, struct pt_regs *regs)
> | {
> | 	arm64_skip_faulting_instruction(regs, AARCH64_INSN_SIZE);
> | }
> 
>> +
>> +void __kprobes
>> +simulate_stp_fp_lr_sp_64b(u32 opcode, long addr, struct pt_regs *regs)
>> +{
>> +	long imm7;
>> +	u64 buf[2];
>> +	long new_sp;
>> +
>> +	imm7 = sign_extend64((opcode >> 15) & 0x7f, 6);
>> +	new_sp = regs->sp + (imm7 << 3);
> 
> We have accessors for these fields, please use them.

Do you mean aarch64_insn_decode_immediate()?

> 
>> +
>> +	buf[0] = regs->regs[29];
>> +	buf[1] = regs->regs[30];
>> +
>> +	if (copy_to_user((void __user *)new_sp, buf, sizeof(buf))) {
>> +		force_sig(SIGSEGV);
>> +		return;
>> +	}
> 
> As above, this won't interact with VMSA features (e.g. MTE, POE) in the
> same way as an STP in userspace, and this will not have the same
> atomicity properties as an STP>
>> +
>> +	regs->sp = new_sp;
>> +	instruction_pointer_set(regs, instruction_pointer(regs) + 4);
> 
> Likewise, this sould need ot use arm64_skip_faulting_instruction(),
> though as above I think we should drop STP emulation entirely.

I explain the reason why using instruction_pointer_set() under your comments
for simulate_nop().

Thanks.

> 
> Mark.
>
Liao, Chang Aug. 27, 2024, 11:33 a.m. UTC | #4
Hi, Mark

Would you like to discuss this patch further, or do you still believe emulating
STP to push FP/LR into the stack in kernel is not a good idea?

Thanks.


在 2024/8/21 15:55, Liao, Chang 写道:
> Hi, Mark
> 
> My bad for taking so long to rely, I generally agree with your suggestions to
> STP emulation.
> 
> 在 2024/8/15 17:58, Mark Rutland 写道:
>> On Wed, Aug 14, 2024 at 08:03:56AM +0000, Liao Chang wrote:
>>> As Andrii pointed out, the uprobe/uretprobe selftest bench run into a
>>> counterintuitive result that nop and push variants are much slower than
>>> ret variant [0]. The root cause lies in the arch_probe_analyse_insn(),
>>> which excludes 'nop' and 'stp' from the emulatable instructions list.
>>> This force the kernel returns to userspace and execute them out-of-line,
>>> then trapping back to kernel for running uprobe callback functions. This
>>> leads to a significant performance overhead compared to 'ret' variant,
>>> which is already emulated.
>>
>> I appreciate this might be surprising, but does it actually matter
>> outside of a microbenchmark?
> 
> I just do a simple comparsion the performance impact of single-stepped and
> emulated STP on my local machine. Three user cases were measured: Redis GET and
> SET throughput (Request Per Second, RPS), and the time taken to execute a grep
> command on the "arch_uprobe_copy_xol" string within the kernel source.
> 
> Redis GET (higher is better)
> ----------------------------
> No uprobe: 49149.71 RPS
> Single-stepped STP: 46750.82 RPS
> Emulated STP: 48981.19 RPS
> 
> Redis SET (larger is better)
> ----------------------------
> No uprobe: 49761.14 RPS
> Single-stepped STP: 45255.01 RPS
> Emulated stp: 48619.21 RPS
> 
> Grep (lower is better)
> ----------------------
> No uprobe: 2.165s
> Single-stepped STP: 15.314s
> Emualted STP: 2.216s
> 
> The result reveals single-stepped STP instruction that used to push fp/lr into
> stack significantly impacts the Redis and grep performance, leading to a notable
> notable decrease RPS and increase time individually. So emulating STP on the
> function entry might be a more viable option for uprobe.
> 
>>
>>> Typicall uprobe is installed on 'nop' for USDT and on function entry
>>> which starts with the instrucion 'stp x29, x30, [sp, #imm]!' to push lr
>>> and fp into stack regardless kernel or userspace binary. 
>>
>> Function entry doesn't always start with a STP; these days it's often a
>> BTI or PACIASP, and for non-leaf functions (or with shrink-wrapping in
>> the compiler), it could be any arbitrary instruction. This might happen
>> to be the common case today, but there are certain;y codebases where it
>> is not.
> 
> Sure, if kernel, CPU and compiler support BTI and PAC, the entry instruction
> is definitly not STP. But for CPU and kernel lack of these supports, STP as
> the entry instruction is still the common case. And I profiled the entry
> instruction for all leaf and non-leaf function, the ratio of STP is 64.5%
> for redis, 76.1% for the BPF selftest bench. So I am thinking it is still
> useful to emulate the STP on the function entry. Perhaps, for CPU and kernel
> with BTI and PAC enabled, uprobe chooses the slower single-stepping to execute
> STP for pushing stack.
> 
>>
>> STP (or any instruction that accesses memory) is fairly painful to
>> emulate because you need to ensure that the correct atomicity and
>> ordering properties are provided (e.g. an aligned STP should be
>> single-copy-atomic, but copy_to_user() doesn't guarantee that except by
>> chance), and that the correct VMSA behaviour is provided (e.g. when
>> interacting with MTE, POE, etc, while the uaccess primitives don't try
>> to be 100% equivalent to instructions in userspace).
> Agreed, but I don't think it has to emulate strictly the single-copy-atomic
> feature of STP that is used to push fp/lr into stack. In most cases, only one
> CPU will push registers to the same position on stack. And I barely understand
> why other CPUs would depends on the ordering of pushing data into stack. So it
> means the atomicity and ordering is not so important for this scenario. Regarding
> MTE and POE, a similar stragety to BTI and PAC can be applied: for CPUs and kernel
> with MTE and POE enabled, uprobe chooses the slower single-stepping to execute
> STP for pushing stack.
> 
>>
>> For those reasons, in general I don't think we should be emulating any
>> instruction which accesses memory, and we should not try to emulate the
>> STP, but I think it's entirely reasonable to emulate NOP.
>>
>>> In order to
>>> improve the performance of handling uprobe for common usecases. This
>>> patch supports the emulation of Arm64 equvialents instructions of 'nop'
>>> and 'push'. The benchmark results below indicates the performance gain
>>> of emulation is obvious.
>>>
>>> On Kunpeng916 (Hi1616), 4 NUMA nodes, 64 Arm64 cores@2.4GHz.
>>>
>>> xol (1 cpus)
>>> ------------
>>> uprobe-nop:  0.916 ± 0.001M/s (0.916M/prod)
>>> uprobe-push: 0.908 ± 0.001M/s (0.908M/prod)
>>> uprobe-ret:  1.855 ± 0.000M/s (1.855M/prod)
>>> uretprobe-nop:  0.640 ± 0.000M/s (0.640M/prod)
>>> uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod)
>>> uretprobe-ret:  0.978 ± 0.003M/s (0.978M/prod)
>>>
>>> emulation (1 cpus)
>>> -------------------
>>> uprobe-nop:  1.862 ± 0.002M/s  (1.862M/prod)
>>> uprobe-push: 1.743 ± 0.006M/s  (1.743M/prod)
>>> uprobe-ret:  1.840 ± 0.001M/s  (1.840M/prod)
>>> uretprobe-nop:  0.964 ± 0.004M/s  (0.964M/prod)
>>> uretprobe-push: 0.936 ± 0.004M/s  (0.936M/prod)
>>> uretprobe-ret:  0.940 ± 0.001M/s  (0.940M/prod)
>>>
>>> As shown above, the performance gap between 'nop/push' and 'ret'
>>> variants has been significantly reduced. Due to the emulation of 'push'
>>> instruction needs to access userspace memory, it spent more cycles than
>>> the other.
>>>
>>> [0] https://lore.kernel.org/all/CAEf4BzaO4eG6hr2hzXYpn+7Uer4chS0R99zLn02ezZ5YruVuQw@mail.gmail.com/
>>>
>>> Signed-off-by: Liao Chang <liaochang1@huawei.com>
>>> ---
>>>  arch/arm64/include/asm/insn.h            | 21 ++++++++++++++++++
>>>  arch/arm64/kernel/probes/decode-insn.c   | 18 +++++++++++++--
>>>  arch/arm64/kernel/probes/decode-insn.h   |  3 ++-
>>>  arch/arm64/kernel/probes/simulate-insn.c | 28 ++++++++++++++++++++++++
>>>  arch/arm64/kernel/probes/simulate-insn.h |  2 ++
>>>  arch/arm64/kernel/probes/uprobes.c       |  2 +-
>>>  6 files changed, 70 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
>>> index 8c0a36f72d6f..a246e6e550ba 100644
>>> --- a/arch/arm64/include/asm/insn.h
>>> +++ b/arch/arm64/include/asm/insn.h
>>> @@ -549,6 +549,27 @@ static __always_inline bool aarch64_insn_uses_literal(u32 insn)
>>>  	       aarch64_insn_is_prfm_lit(insn);
>>>  }
>>>  
>>> +static __always_inline bool aarch64_insn_is_nop(u32 insn)
>>> +{
>>> +	/* nop */
>>> +	return aarch64_insn_is_hint(insn) &&
>>> +	       ((insn & 0xFE0) == AARCH64_INSN_HINT_NOP);
>>> +}
>>
>> This looks fine, but the comment can go.
> 
> Removed.
> 
>>
>>> +static __always_inline bool aarch64_insn_is_stp_fp_lr_sp_64b(u32 insn)
>>> +{
>>> +	/*
>>> +	 * The 1st instruction on function entry often follows the
>>> +	 * patten 'stp x29, x30, [sp, #imm]!' that pushing fp and lr
>>> +	 * into stack.
>>> +	 */
>>> +	return aarch64_insn_is_stp_pre(insn) &&
>>> +	       (((insn >> 30) & 0x03) ==  2) && /* opc == 10 */
>>> +	       (((insn >>  5) & 0x1F) == 31) && /* Rn  is sp */
>>> +	       (((insn >> 10) & 0x1F) == 30) && /* Rt2 is x29 */
>>> +	       (((insn >>  0) & 0x1F) == 29);	/* Rt  is x30 */
>>> +}
>>
>> We have accessors for these fields. Please use them.
> 
> Do you mean aarch64_insn_decode_register()?
> 
>>
>> Regardless, as above I do not think we should have a helper this
>> specific (with Rn, Rt, and Rt2 values hard-coded) inside <asm/insn.h>.
> 
> If we left necessary of emulation of STP aside, where would the best file to
> place these hard-coded decoder helper?
> 
>>
>>>  enum aarch64_insn_encoding_class aarch64_get_insn_class(u32 insn);
>>>  u64 aarch64_insn_decode_immediate(enum aarch64_insn_imm_type type, u32 insn);
>>>  u32 aarch64_insn_encode_immediate(enum aarch64_insn_imm_type type,
>>> diff --git a/arch/arm64/kernel/probes/decode-insn.c b/arch/arm64/kernel/probes/decode-insn.c
>>> index 968d5fffe233..df7ca16fc763 100644
>>> --- a/arch/arm64/kernel/probes/decode-insn.c
>>> +++ b/arch/arm64/kernel/probes/decode-insn.c
>>> @@ -73,8 +73,22 @@ static bool __kprobes aarch64_insn_is_steppable(u32 insn)
>>>   *   INSN_GOOD_NO_SLOT If instruction is supported but doesn't use its slot.
>>>   */
>>>  enum probe_insn __kprobes
>>> -arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *api)
>>> +arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *api,
>>> +		      bool kernel)
>>>  {
>>> +	/*
>>> +	 * While 'nop' and 'stp x29, x30, [sp, #imm]! instructions can
>>> +	 * execute in the out-of-line slot, simulating them in breakpoint
>>> +	 * handling offers better performance.
>>> +	 */
>>> +	if (aarch64_insn_is_nop(insn)) {
>>> +		api->handler = simulate_nop;
>>> +		return INSN_GOOD_NO_SLOT;
>>> +	} else if (!kernel && aarch64_insn_is_stp_fp_lr_sp_64b(insn)) {
>>> +		api->handler = simulate_stp_fp_lr_sp_64b;
>>> +		return INSN_GOOD_NO_SLOT;
>>> +	}
>>
>> With the STP emulation gone, you won't need the kernel parameter here.>
>>> +
>>>  	/*
>>>  	 * Instructions reading or modifying the PC won't work from the XOL
>>>  	 * slot.
>>> @@ -157,7 +171,7 @@ arm_kprobe_decode_insn(kprobe_opcode_t *addr, struct arch_specific_insn *asi)
>>>  		else
>>>  			scan_end = addr - MAX_ATOMIC_CONTEXT_SIZE;
>>>  	}
>>> -	decoded = arm_probe_decode_insn(insn, &asi->api);
>>> +	decoded = arm_probe_decode_insn(insn, &asi->api, true);
>>>  
>>>  	if (decoded != INSN_REJECTED && scan_end)
>>>  		if (is_probed_address_atomic(addr - 1, scan_end))
>>> diff --git a/arch/arm64/kernel/probes/decode-insn.h b/arch/arm64/kernel/probes/decode-insn.h
>>> index 8b758c5a2062..ec4607189933 100644
>>> --- a/arch/arm64/kernel/probes/decode-insn.h
>>> +++ b/arch/arm64/kernel/probes/decode-insn.h
>>> @@ -28,6 +28,7 @@ enum probe_insn __kprobes
>>>  arm_kprobe_decode_insn(kprobe_opcode_t *addr, struct arch_specific_insn *asi);
>>>  #endif
>>>  enum probe_insn __kprobes
>>> -arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *asi);
>>> +arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *asi,
>>> +		      bool kernel);
>>>  
>>>  #endif /* _ARM_KERNEL_KPROBES_ARM64_H */
>>> diff --git a/arch/arm64/kernel/probes/simulate-insn.c b/arch/arm64/kernel/probes/simulate-insn.c
>>> index 22d0b3252476..0b1623fa7003 100644
>>> --- a/arch/arm64/kernel/probes/simulate-insn.c
>>> +++ b/arch/arm64/kernel/probes/simulate-insn.c
>>> @@ -200,3 +200,31 @@ simulate_ldrsw_literal(u32 opcode, long addr, struct pt_regs *regs)
>>>  
>>>  	instruction_pointer_set(regs, instruction_pointer(regs) + 4);
>>>  }
>>> +
>>> +void __kprobes
>>> +simulate_nop(u32 opcode, long addr, struct pt_regs *regs)
>>> +{
>>> +	instruction_pointer_set(regs, instruction_pointer(regs) + 4);
>>> +}
>>
>> Hmm, this forgets to update the single-step state machine and PSTATE.BT,
>> and that's an extant bug in arch_uprobe_post_xol(). This can be:
> 
> For emulated instruction, uprobe won't enable single-step mode of CPU,
> please check the handle_swbp() in kernel/events/uprobes.c:
> 
>   if (arch_uprobe_skip_sstep(&uprobe->arch, regs))
>           goto out;
> 
>   if (!pre_ssout(uprobe, regs, bp_vaddr))
>           return;
> 
> For emualted instruction, It will skip entire single-stepping and associated
> exception, which typically begins with pre_ssout() and ends with
> arch_uprobe_post_xol(). Therefore, using instruction_pointer_set() to emulate
> NOP is generally not a bad idea.
> 
>>
>> | void __kprobes
>> | simulate_nop(u32 opcode, long addr, struct pt_regs *regs)
>> | {
>> | 	arm64_skip_faulting_instruction(regs, AARCH64_INSN_SIZE);
>> | }
>>
>>> +
>>> +void __kprobes
>>> +simulate_stp_fp_lr_sp_64b(u32 opcode, long addr, struct pt_regs *regs)
>>> +{
>>> +	long imm7;
>>> +	u64 buf[2];
>>> +	long new_sp;
>>> +
>>> +	imm7 = sign_extend64((opcode >> 15) & 0x7f, 6);
>>> +	new_sp = regs->sp + (imm7 << 3);
>>
>> We have accessors for these fields, please use them.
> 
> Do you mean aarch64_insn_decode_immediate()?
> 
>>
>>> +
>>> +	buf[0] = regs->regs[29];
>>> +	buf[1] = regs->regs[30];
>>> +
>>> +	if (copy_to_user((void __user *)new_sp, buf, sizeof(buf))) {
>>> +		force_sig(SIGSEGV);
>>> +		return;
>>> +	}
>>
>> As above, this won't interact with VMSA features (e.g. MTE, POE) in the
>> same way as an STP in userspace, and this will not have the same
>> atomicity properties as an STP>
>>> +
>>> +	regs->sp = new_sp;
>>> +	instruction_pointer_set(regs, instruction_pointer(regs) + 4);
>>
>> Likewise, this sould need ot use arm64_skip_faulting_instruction(),
>> though as above I think we should drop STP emulation entirely.
> 
> I explain the reason why using instruction_pointer_set() under your comments
> for simulate_nop().
> 
> Thanks.
> 
>>
>> Mark.
>>
>
Andrii Nakryiko Aug. 29, 2024, 7:26 p.m. UTC | #5
On Tue, Aug 27, 2024 at 4:34 AM Liao, Chang <liaochang1@huawei.com> wrote:
>
> Hi, Mark
>
> Would you like to discuss this patch further, or do you still believe emulating
> STP to push FP/LR into the stack in kernel is not a good idea?
>

Please send an updated version of your patches taking into account
various smaller issues Mark pointed out. But please keep STP
emulation, I think it's very important, even if it's not covering all
possible uprobe tracing scenarios.

> Thanks.
>
>
> 在 2024/8/21 15:55, Liao, Chang 写道:
> > Hi, Mark
> >
> > My bad for taking so long to rely, I generally agree with your suggestions to
> > STP emulation.
> >
> > 在 2024/8/15 17:58, Mark Rutland 写道:
> >> On Wed, Aug 14, 2024 at 08:03:56AM +0000, Liao Chang wrote:
> >>> As Andrii pointed out, the uprobe/uretprobe selftest bench run into a
> >>> counterintuitive result that nop and push variants are much slower than
> >>> ret variant [0]. The root cause lies in the arch_probe_analyse_insn(),
> >>> which excludes 'nop' and 'stp' from the emulatable instructions list.
> >>> This force the kernel returns to userspace and execute them out-of-line,
> >>> then trapping back to kernel for running uprobe callback functions. This
> >>> leads to a significant performance overhead compared to 'ret' variant,
> >>> which is already emulated.
> >>
> >> I appreciate this might be surprising, but does it actually matter
> >> outside of a microbenchmark?
> >
> > I just do a simple comparsion the performance impact of single-stepped and
> > emulated STP on my local machine. Three user cases were measured: Redis GET and
> > SET throughput (Request Per Second, RPS), and the time taken to execute a grep
> > command on the "arch_uprobe_copy_xol" string within the kernel source.
> >
> > Redis GET (higher is better)
> > ----------------------------
> > No uprobe: 49149.71 RPS
> > Single-stepped STP: 46750.82 RPS
> > Emulated STP: 48981.19 RPS
> >
> > Redis SET (larger is better)
> > ----------------------------
> > No uprobe: 49761.14 RPS
> > Single-stepped STP: 45255.01 RPS
> > Emulated stp: 48619.21 RPS
> >
> > Grep (lower is better)
> > ----------------------
> > No uprobe: 2.165s
> > Single-stepped STP: 15.314s
> > Emualted STP: 2.216s
> >
> > The result reveals single-stepped STP instruction that used to push fp/lr into
> > stack significantly impacts the Redis and grep performance, leading to a notable
> > notable decrease RPS and increase time individually. So emulating STP on the
> > function entry might be a more viable option for uprobe.
> >
> >>
> >>> Typicall uprobe is installed on 'nop' for USDT and on function entry
> >>> which starts with the instrucion 'stp x29, x30, [sp, #imm]!' to push lr
> >>> and fp into stack regardless kernel or userspace binary.
> >>
> >> Function entry doesn't always start with a STP; these days it's often a
> >> BTI or PACIASP, and for non-leaf functions (or with shrink-wrapping in
> >> the compiler), it could be any arbitrary instruction. This might happen
> >> to be the common case today, but there are certain;y codebases where it
> >> is not.
> >
> > Sure, if kernel, CPU and compiler support BTI and PAC, the entry instruction
> > is definitly not STP. But for CPU and kernel lack of these supports, STP as
> > the entry instruction is still the common case. And I profiled the entry
> > instruction for all leaf and non-leaf function, the ratio of STP is 64.5%
> > for redis, 76.1% for the BPF selftest bench. So I am thinking it is still
> > useful to emulate the STP on the function entry. Perhaps, for CPU and kernel
> > with BTI and PAC enabled, uprobe chooses the slower single-stepping to execute
> > STP for pushing stack.
> >
> >>
> >> STP (or any instruction that accesses memory) is fairly painful to
> >> emulate because you need to ensure that the correct atomicity and
> >> ordering properties are provided (e.g. an aligned STP should be
> >> single-copy-atomic, but copy_to_user() doesn't guarantee that except by
> >> chance), and that the correct VMSA behaviour is provided (e.g. when
> >> interacting with MTE, POE, etc, while the uaccess primitives don't try
> >> to be 100% equivalent to instructions in userspace).
> > Agreed, but I don't think it has to emulate strictly the single-copy-atomic
> > feature of STP that is used to push fp/lr into stack. In most cases, only one
> > CPU will push registers to the same position on stack. And I barely understand
> > why other CPUs would depends on the ordering of pushing data into stack. So it
> > means the atomicity and ordering is not so important for this scenario. Regarding
> > MTE and POE, a similar stragety to BTI and PAC can be applied: for CPUs and kernel
> > with MTE and POE enabled, uprobe chooses the slower single-stepping to execute
> > STP for pushing stack.
> >
> >>
> >> For those reasons, in general I don't think we should be emulating any
> >> instruction which accesses memory, and we should not try to emulate the
> >> STP, but I think it's entirely reasonable to emulate NOP.
> >>
> >>> In order to
> >>> improve the performance of handling uprobe for common usecases. This
> >>> patch supports the emulation of Arm64 equvialents instructions of 'nop'
> >>> and 'push'. The benchmark results below indicates the performance gain
> >>> of emulation is obvious.
> >>>
> >>> On Kunpeng916 (Hi1616), 4 NUMA nodes, 64 Arm64 cores@2.4GHz.
> >>>
> >>> xol (1 cpus)
> >>> ------------
> >>> uprobe-nop:  0.916 ± 0.001M/s (0.916M/prod)
> >>> uprobe-push: 0.908 ± 0.001M/s (0.908M/prod)
> >>> uprobe-ret:  1.855 ± 0.000M/s (1.855M/prod)
> >>> uretprobe-nop:  0.640 ± 0.000M/s (0.640M/prod)
> >>> uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod)
> >>> uretprobe-ret:  0.978 ± 0.003M/s (0.978M/prod)
> >>>
> >>> emulation (1 cpus)
> >>> -------------------
> >>> uprobe-nop:  1.862 ± 0.002M/s  (1.862M/prod)
> >>> uprobe-push: 1.743 ± 0.006M/s  (1.743M/prod)
> >>> uprobe-ret:  1.840 ± 0.001M/s  (1.840M/prod)
> >>> uretprobe-nop:  0.964 ± 0.004M/s  (0.964M/prod)
> >>> uretprobe-push: 0.936 ± 0.004M/s  (0.936M/prod)
> >>> uretprobe-ret:  0.940 ± 0.001M/s  (0.940M/prod)
> >>>
> >>> As shown above, the performance gap between 'nop/push' and 'ret'
> >>> variants has been significantly reduced. Due to the emulation of 'push'
> >>> instruction needs to access userspace memory, it spent more cycles than
> >>> the other.
> >>>
> >>> [0] https://lore.kernel.org/all/CAEf4BzaO4eG6hr2hzXYpn+7Uer4chS0R99zLn02ezZ5YruVuQw@mail.gmail.com/
> >>>
> >>> Signed-off-by: Liao Chang <liaochang1@huawei.com>
> >>> ---
> >>>  arch/arm64/include/asm/insn.h            | 21 ++++++++++++++++++
> >>>  arch/arm64/kernel/probes/decode-insn.c   | 18 +++++++++++++--
> >>>  arch/arm64/kernel/probes/decode-insn.h   |  3 ++-
> >>>  arch/arm64/kernel/probes/simulate-insn.c | 28 ++++++++++++++++++++++++
> >>>  arch/arm64/kernel/probes/simulate-insn.h |  2 ++
> >>>  arch/arm64/kernel/probes/uprobes.c       |  2 +-
> >>>  6 files changed, 70 insertions(+), 4 deletions(-)
> >>>
> >>> diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
> >>> index 8c0a36f72d6f..a246e6e550ba 100644
> >>> --- a/arch/arm64/include/asm/insn.h
> >>> +++ b/arch/arm64/include/asm/insn.h
> >>> @@ -549,6 +549,27 @@ static __always_inline bool aarch64_insn_uses_literal(u32 insn)
> >>>            aarch64_insn_is_prfm_lit(insn);
> >>>  }
> >>>
> >>> +static __always_inline bool aarch64_insn_is_nop(u32 insn)
> >>> +{
> >>> +   /* nop */
> >>> +   return aarch64_insn_is_hint(insn) &&
> >>> +          ((insn & 0xFE0) == AARCH64_INSN_HINT_NOP);
> >>> +}
> >>
> >> This looks fine, but the comment can go.
> >
> > Removed.
> >
> >>
> >>> +static __always_inline bool aarch64_insn_is_stp_fp_lr_sp_64b(u32 insn)
> >>> +{
> >>> +   /*
> >>> +    * The 1st instruction on function entry often follows the
> >>> +    * patten 'stp x29, x30, [sp, #imm]!' that pushing fp and lr
> >>> +    * into stack.
> >>> +    */
> >>> +   return aarch64_insn_is_stp_pre(insn) &&
> >>> +          (((insn >> 30) & 0x03) ==  2) && /* opc == 10 */
> >>> +          (((insn >>  5) & 0x1F) == 31) && /* Rn  is sp */
> >>> +          (((insn >> 10) & 0x1F) == 30) && /* Rt2 is x29 */
> >>> +          (((insn >>  0) & 0x1F) == 29);   /* Rt  is x30 */
> >>> +}
> >>
> >> We have accessors for these fields. Please use them.
> >
> > Do you mean aarch64_insn_decode_register()?
> >
> >>
> >> Regardless, as above I do not think we should have a helper this
> >> specific (with Rn, Rt, and Rt2 values hard-coded) inside <asm/insn.h>.
> >
> > If we left necessary of emulation of STP aside, where would the best file to
> > place these hard-coded decoder helper?
> >
> >>
> >>>  enum aarch64_insn_encoding_class aarch64_get_insn_class(u32 insn);
> >>>  u64 aarch64_insn_decode_immediate(enum aarch64_insn_imm_type type, u32 insn);
> >>>  u32 aarch64_insn_encode_immediate(enum aarch64_insn_imm_type type,
> >>> diff --git a/arch/arm64/kernel/probes/decode-insn.c b/arch/arm64/kernel/probes/decode-insn.c
> >>> index 968d5fffe233..df7ca16fc763 100644
> >>> --- a/arch/arm64/kernel/probes/decode-insn.c
> >>> +++ b/arch/arm64/kernel/probes/decode-insn.c
> >>> @@ -73,8 +73,22 @@ static bool __kprobes aarch64_insn_is_steppable(u32 insn)
> >>>   *   INSN_GOOD_NO_SLOT If instruction is supported but doesn't use its slot.
> >>>   */
> >>>  enum probe_insn __kprobes
> >>> -arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *api)
> >>> +arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *api,
> >>> +                 bool kernel)
> >>>  {
> >>> +   /*
> >>> +    * While 'nop' and 'stp x29, x30, [sp, #imm]! instructions can
> >>> +    * execute in the out-of-line slot, simulating them in breakpoint
> >>> +    * handling offers better performance.
> >>> +    */
> >>> +   if (aarch64_insn_is_nop(insn)) {
> >>> +           api->handler = simulate_nop;
> >>> +           return INSN_GOOD_NO_SLOT;
> >>> +   } else if (!kernel && aarch64_insn_is_stp_fp_lr_sp_64b(insn)) {
> >>> +           api->handler = simulate_stp_fp_lr_sp_64b;
> >>> +           return INSN_GOOD_NO_SLOT;
> >>> +   }
> >>
> >> With the STP emulation gone, you won't need the kernel parameter here.>
> >>> +
> >>>     /*
> >>>      * Instructions reading or modifying the PC won't work from the XOL
> >>>      * slot.
> >>> @@ -157,7 +171,7 @@ arm_kprobe_decode_insn(kprobe_opcode_t *addr, struct arch_specific_insn *asi)
> >>>             else
> >>>                     scan_end = addr - MAX_ATOMIC_CONTEXT_SIZE;
> >>>     }
> >>> -   decoded = arm_probe_decode_insn(insn, &asi->api);
> >>> +   decoded = arm_probe_decode_insn(insn, &asi->api, true);
> >>>
> >>>     if (decoded != INSN_REJECTED && scan_end)
> >>>             if (is_probed_address_atomic(addr - 1, scan_end))
> >>> diff --git a/arch/arm64/kernel/probes/decode-insn.h b/arch/arm64/kernel/probes/decode-insn.h
> >>> index 8b758c5a2062..ec4607189933 100644
> >>> --- a/arch/arm64/kernel/probes/decode-insn.h
> >>> +++ b/arch/arm64/kernel/probes/decode-insn.h
> >>> @@ -28,6 +28,7 @@ enum probe_insn __kprobes
> >>>  arm_kprobe_decode_insn(kprobe_opcode_t *addr, struct arch_specific_insn *asi);
> >>>  #endif
> >>>  enum probe_insn __kprobes
> >>> -arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *asi);
> >>> +arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *asi,
> >>> +                 bool kernel);
> >>>
> >>>  #endif /* _ARM_KERNEL_KPROBES_ARM64_H */
> >>> diff --git a/arch/arm64/kernel/probes/simulate-insn.c b/arch/arm64/kernel/probes/simulate-insn.c
> >>> index 22d0b3252476..0b1623fa7003 100644
> >>> --- a/arch/arm64/kernel/probes/simulate-insn.c
> >>> +++ b/arch/arm64/kernel/probes/simulate-insn.c
> >>> @@ -200,3 +200,31 @@ simulate_ldrsw_literal(u32 opcode, long addr, struct pt_regs *regs)
> >>>
> >>>     instruction_pointer_set(regs, instruction_pointer(regs) + 4);
> >>>  }
> >>> +
> >>> +void __kprobes
> >>> +simulate_nop(u32 opcode, long addr, struct pt_regs *regs)
> >>> +{
> >>> +   instruction_pointer_set(regs, instruction_pointer(regs) + 4);
> >>> +}
> >>
> >> Hmm, this forgets to update the single-step state machine and PSTATE.BT,
> >> and that's an extant bug in arch_uprobe_post_xol(). This can be:
> >
> > For emulated instruction, uprobe won't enable single-step mode of CPU,
> > please check the handle_swbp() in kernel/events/uprobes.c:
> >
> >   if (arch_uprobe_skip_sstep(&uprobe->arch, regs))
> >           goto out;
> >
> >   if (!pre_ssout(uprobe, regs, bp_vaddr))
> >           return;
> >
> > For emualted instruction, It will skip entire single-stepping and associated
> > exception, which typically begins with pre_ssout() and ends with
> > arch_uprobe_post_xol(). Therefore, using instruction_pointer_set() to emulate
> > NOP is generally not a bad idea.
> >
> >>
> >> | void __kprobes
> >> | simulate_nop(u32 opcode, long addr, struct pt_regs *regs)
> >> | {
> >> |    arm64_skip_faulting_instruction(regs, AARCH64_INSN_SIZE);
> >> | }
> >>
> >>> +
> >>> +void __kprobes
> >>> +simulate_stp_fp_lr_sp_64b(u32 opcode, long addr, struct pt_regs *regs)
> >>> +{
> >>> +   long imm7;
> >>> +   u64 buf[2];
> >>> +   long new_sp;
> >>> +
> >>> +   imm7 = sign_extend64((opcode >> 15) & 0x7f, 6);
> >>> +   new_sp = regs->sp + (imm7 << 3);
> >>
> >> We have accessors for these fields, please use them.
> >
> > Do you mean aarch64_insn_decode_immediate()?
> >
> >>
> >>> +
> >>> +   buf[0] = regs->regs[29];
> >>> +   buf[1] = regs->regs[30];
> >>> +
> >>> +   if (copy_to_user((void __user *)new_sp, buf, sizeof(buf))) {
> >>> +           force_sig(SIGSEGV);
> >>> +           return;
> >>> +   }
> >>
> >> As above, this won't interact with VMSA features (e.g. MTE, POE) in the
> >> same way as an STP in userspace, and this will not have the same
> >> atomicity properties as an STP>
> >>> +
> >>> +   regs->sp = new_sp;
> >>> +   instruction_pointer_set(regs, instruction_pointer(regs) + 4);
> >>
> >> Likewise, this sould need ot use arm64_skip_faulting_instruction(),
> >> though as above I think we should drop STP emulation entirely.
> >
> > I explain the reason why using instruction_pointer_set() under your comments
> > for simulate_nop().
> >
> > Thanks.
> >
> >>
> >> Mark.
> >>
> >
>
> --
> BR
> Liao, Chang
>
Liao, Chang Aug. 30, 2024, 9:25 a.m. UTC | #6
在 2024/8/30 3:26, Andrii Nakryiko 写道:
> On Tue, Aug 27, 2024 at 4:34 AM Liao, Chang <liaochang1@huawei.com> wrote:
>>
>> Hi, Mark
>>
>> Would you like to discuss this patch further, or do you still believe emulating
>> STP to push FP/LR into the stack in kernel is not a good idea?
>>
> 
> Please send an updated version of your patches taking into account
> various smaller issues Mark pointed out. But please keep STP
> emulation, I think it's very important, even if it's not covering all
> possible uprobe tracing scenarios.

OK, I will send it out over the weekend, including an enhancement of STP
emuation that addresses the MTE and POE issues Mark mentioned. I hope this
will lead to more feedback from him.

> 
>> Thanks.
>>
>>
>> 在 2024/8/21 15:55, Liao, Chang 写道:
>>> Hi, Mark
>>>
>>> My bad for taking so long to rely, I generally agree with your suggestions to
>>> STP emulation.
>>>
>>> 在 2024/8/15 17:58, Mark Rutland 写道:
>>>> On Wed, Aug 14, 2024 at 08:03:56AM +0000, Liao Chang wrote:
>>>>> As Andrii pointed out, the uprobe/uretprobe selftest bench run into a
>>>>> counterintuitive result that nop and push variants are much slower than
>>>>> ret variant [0]. The root cause lies in the arch_probe_analyse_insn(),
>>>>> which excludes 'nop' and 'stp' from the emulatable instructions list.
>>>>> This force the kernel returns to userspace and execute them out-of-line,
>>>>> then trapping back to kernel for running uprobe callback functions. This
>>>>> leads to a significant performance overhead compared to 'ret' variant,
>>>>> which is already emulated.
>>>>
>>>> I appreciate this might be surprising, but does it actually matter
>>>> outside of a microbenchmark?
>>>
>>> I just do a simple comparsion the performance impact of single-stepped and
>>> emulated STP on my local machine. Three user cases were measured: Redis GET and
>>> SET throughput (Request Per Second, RPS), and the time taken to execute a grep
>>> command on the "arch_uprobe_copy_xol" string within the kernel source.
>>>
>>> Redis GET (higher is better)
>>> ----------------------------
>>> No uprobe: 49149.71 RPS
>>> Single-stepped STP: 46750.82 RPS
>>> Emulated STP: 48981.19 RPS
>>>
>>> Redis SET (larger is better)
>>> ----------------------------
>>> No uprobe: 49761.14 RPS
>>> Single-stepped STP: 45255.01 RPS
>>> Emulated stp: 48619.21 RPS
>>>
>>> Grep (lower is better)
>>> ----------------------
>>> No uprobe: 2.165s
>>> Single-stepped STP: 15.314s
>>> Emualted STP: 2.216s
>>>
>>> The result reveals single-stepped STP instruction that used to push fp/lr into
>>> stack significantly impacts the Redis and grep performance, leading to a notable
>>> notable decrease RPS and increase time individually. So emulating STP on the
>>> function entry might be a more viable option for uprobe.
>>>
>>>>
>>>>> Typicall uprobe is installed on 'nop' for USDT and on function entry
>>>>> which starts with the instrucion 'stp x29, x30, [sp, #imm]!' to push lr
>>>>> and fp into stack regardless kernel or userspace binary.
>>>>
>>>> Function entry doesn't always start with a STP; these days it's often a
>>>> BTI or PACIASP, and for non-leaf functions (or with shrink-wrapping in
>>>> the compiler), it could be any arbitrary instruction. This might happen
>>>> to be the common case today, but there are certain;y codebases where it
>>>> is not.
>>>
>>> Sure, if kernel, CPU and compiler support BTI and PAC, the entry instruction
>>> is definitly not STP. But for CPU and kernel lack of these supports, STP as
>>> the entry instruction is still the common case. And I profiled the entry
>>> instruction for all leaf and non-leaf function, the ratio of STP is 64.5%
>>> for redis, 76.1% for the BPF selftest bench. So I am thinking it is still
>>> useful to emulate the STP on the function entry. Perhaps, for CPU and kernel
>>> with BTI and PAC enabled, uprobe chooses the slower single-stepping to execute
>>> STP for pushing stack.
>>>
>>>>
>>>> STP (or any instruction that accesses memory) is fairly painful to
>>>> emulate because you need to ensure that the correct atomicity and
>>>> ordering properties are provided (e.g. an aligned STP should be
>>>> single-copy-atomic, but copy_to_user() doesn't guarantee that except by
>>>> chance), and that the correct VMSA behaviour is provided (e.g. when
>>>> interacting with MTE, POE, etc, while the uaccess primitives don't try
>>>> to be 100% equivalent to instructions in userspace).
>>> Agreed, but I don't think it has to emulate strictly the single-copy-atomic
>>> feature of STP that is used to push fp/lr into stack. In most cases, only one
>>> CPU will push registers to the same position on stack. And I barely understand
>>> why other CPUs would depends on the ordering of pushing data into stack. So it
>>> means the atomicity and ordering is not so important for this scenario. Regarding
>>> MTE and POE, a similar stragety to BTI and PAC can be applied: for CPUs and kernel
>>> with MTE and POE enabled, uprobe chooses the slower single-stepping to execute
>>> STP for pushing stack.
>>>
>>>>
>>>> For those reasons, in general I don't think we should be emulating any
>>>> instruction which accesses memory, and we should not try to emulate the
>>>> STP, but I think it's entirely reasonable to emulate NOP.
>>>>
>>>>> In order to
>>>>> improve the performance of handling uprobe for common usecases. This
>>>>> patch supports the emulation of Arm64 equvialents instructions of 'nop'
>>>>> and 'push'. The benchmark results below indicates the performance gain
>>>>> of emulation is obvious.
>>>>>
>>>>> On Kunpeng916 (Hi1616), 4 NUMA nodes, 64 Arm64 cores@2.4GHz.
>>>>>
>>>>> xol (1 cpus)
>>>>> ------------
>>>>> uprobe-nop:  0.916 ± 0.001M/s (0.916M/prod)
>>>>> uprobe-push: 0.908 ± 0.001M/s (0.908M/prod)
>>>>> uprobe-ret:  1.855 ± 0.000M/s (1.855M/prod)
>>>>> uretprobe-nop:  0.640 ± 0.000M/s (0.640M/prod)
>>>>> uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod)
>>>>> uretprobe-ret:  0.978 ± 0.003M/s (0.978M/prod)
>>>>>
>>>>> emulation (1 cpus)
>>>>> -------------------
>>>>> uprobe-nop:  1.862 ± 0.002M/s  (1.862M/prod)
>>>>> uprobe-push: 1.743 ± 0.006M/s  (1.743M/prod)
>>>>> uprobe-ret:  1.840 ± 0.001M/s  (1.840M/prod)
>>>>> uretprobe-nop:  0.964 ± 0.004M/s  (0.964M/prod)
>>>>> uretprobe-push: 0.936 ± 0.004M/s  (0.936M/prod)
>>>>> uretprobe-ret:  0.940 ± 0.001M/s  (0.940M/prod)
>>>>>
>>>>> As shown above, the performance gap between 'nop/push' and 'ret'
>>>>> variants has been significantly reduced. Due to the emulation of 'push'
>>>>> instruction needs to access userspace memory, it spent more cycles than
>>>>> the other.
>>>>>
>>>>> [0] https://lore.kernel.org/all/CAEf4BzaO4eG6hr2hzXYpn+7Uer4chS0R99zLn02ezZ5YruVuQw@mail.gmail.com/
>>>>>
>>>>> Signed-off-by: Liao Chang <liaochang1@huawei.com>
>>>>> ---
>>>>>  arch/arm64/include/asm/insn.h            | 21 ++++++++++++++++++
>>>>>  arch/arm64/kernel/probes/decode-insn.c   | 18 +++++++++++++--
>>>>>  arch/arm64/kernel/probes/decode-insn.h   |  3 ++-
>>>>>  arch/arm64/kernel/probes/simulate-insn.c | 28 ++++++++++++++++++++++++
>>>>>  arch/arm64/kernel/probes/simulate-insn.h |  2 ++
>>>>>  arch/arm64/kernel/probes/uprobes.c       |  2 +-
>>>>>  6 files changed, 70 insertions(+), 4 deletions(-)
>>>>>
>>>>> diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
>>>>> index 8c0a36f72d6f..a246e6e550ba 100644
>>>>> --- a/arch/arm64/include/asm/insn.h
>>>>> +++ b/arch/arm64/include/asm/insn.h
>>>>> @@ -549,6 +549,27 @@ static __always_inline bool aarch64_insn_uses_literal(u32 insn)
>>>>>            aarch64_insn_is_prfm_lit(insn);
>>>>>  }
>>>>>
>>>>> +static __always_inline bool aarch64_insn_is_nop(u32 insn)
>>>>> +{
>>>>> +   /* nop */
>>>>> +   return aarch64_insn_is_hint(insn) &&
>>>>> +          ((insn & 0xFE0) == AARCH64_INSN_HINT_NOP);
>>>>> +}
>>>>
>>>> This looks fine, but the comment can go.
>>>
>>> Removed.
>>>
>>>>
>>>>> +static __always_inline bool aarch64_insn_is_stp_fp_lr_sp_64b(u32 insn)
>>>>> +{
>>>>> +   /*
>>>>> +    * The 1st instruction on function entry often follows the
>>>>> +    * patten 'stp x29, x30, [sp, #imm]!' that pushing fp and lr
>>>>> +    * into stack.
>>>>> +    */
>>>>> +   return aarch64_insn_is_stp_pre(insn) &&
>>>>> +          (((insn >> 30) & 0x03) ==  2) && /* opc == 10 */
>>>>> +          (((insn >>  5) & 0x1F) == 31) && /* Rn  is sp */
>>>>> +          (((insn >> 10) & 0x1F) == 30) && /* Rt2 is x29 */
>>>>> +          (((insn >>  0) & 0x1F) == 29);   /* Rt  is x30 */
>>>>> +}
>>>>
>>>> We have accessors for these fields. Please use them.
>>>
>>> Do you mean aarch64_insn_decode_register()?
>>>
>>>>
>>>> Regardless, as above I do not think we should have a helper this
>>>> specific (with Rn, Rt, and Rt2 values hard-coded) inside <asm/insn.h>.
>>>
>>> If we left necessary of emulation of STP aside, where would the best file to
>>> place these hard-coded decoder helper?
>>>
>>>>
>>>>>  enum aarch64_insn_encoding_class aarch64_get_insn_class(u32 insn);
>>>>>  u64 aarch64_insn_decode_immediate(enum aarch64_insn_imm_type type, u32 insn);
>>>>>  u32 aarch64_insn_encode_immediate(enum aarch64_insn_imm_type type,
>>>>> diff --git a/arch/arm64/kernel/probes/decode-insn.c b/arch/arm64/kernel/probes/decode-insn.c
>>>>> index 968d5fffe233..df7ca16fc763 100644
>>>>> --- a/arch/arm64/kernel/probes/decode-insn.c
>>>>> +++ b/arch/arm64/kernel/probes/decode-insn.c
>>>>> @@ -73,8 +73,22 @@ static bool __kprobes aarch64_insn_is_steppable(u32 insn)
>>>>>   *   INSN_GOOD_NO_SLOT If instruction is supported but doesn't use its slot.
>>>>>   */
>>>>>  enum probe_insn __kprobes
>>>>> -arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *api)
>>>>> +arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *api,
>>>>> +                 bool kernel)
>>>>>  {
>>>>> +   /*
>>>>> +    * While 'nop' and 'stp x29, x30, [sp, #imm]! instructions can
>>>>> +    * execute in the out-of-line slot, simulating them in breakpoint
>>>>> +    * handling offers better performance.
>>>>> +    */
>>>>> +   if (aarch64_insn_is_nop(insn)) {
>>>>> +           api->handler = simulate_nop;
>>>>> +           return INSN_GOOD_NO_SLOT;
>>>>> +   } else if (!kernel && aarch64_insn_is_stp_fp_lr_sp_64b(insn)) {
>>>>> +           api->handler = simulate_stp_fp_lr_sp_64b;
>>>>> +           return INSN_GOOD_NO_SLOT;
>>>>> +   }
>>>>
>>>> With the STP emulation gone, you won't need the kernel parameter here.>
>>>>> +
>>>>>     /*
>>>>>      * Instructions reading or modifying the PC won't work from the XOL
>>>>>      * slot.
>>>>> @@ -157,7 +171,7 @@ arm_kprobe_decode_insn(kprobe_opcode_t *addr, struct arch_specific_insn *asi)
>>>>>             else
>>>>>                     scan_end = addr - MAX_ATOMIC_CONTEXT_SIZE;
>>>>>     }
>>>>> -   decoded = arm_probe_decode_insn(insn, &asi->api);
>>>>> +   decoded = arm_probe_decode_insn(insn, &asi->api, true);
>>>>>
>>>>>     if (decoded != INSN_REJECTED && scan_end)
>>>>>             if (is_probed_address_atomic(addr - 1, scan_end))
>>>>> diff --git a/arch/arm64/kernel/probes/decode-insn.h b/arch/arm64/kernel/probes/decode-insn.h
>>>>> index 8b758c5a2062..ec4607189933 100644
>>>>> --- a/arch/arm64/kernel/probes/decode-insn.h
>>>>> +++ b/arch/arm64/kernel/probes/decode-insn.h
>>>>> @@ -28,6 +28,7 @@ enum probe_insn __kprobes
>>>>>  arm_kprobe_decode_insn(kprobe_opcode_t *addr, struct arch_specific_insn *asi);
>>>>>  #endif
>>>>>  enum probe_insn __kprobes
>>>>> -arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *asi);
>>>>> +arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *asi,
>>>>> +                 bool kernel);
>>>>>
>>>>>  #endif /* _ARM_KERNEL_KPROBES_ARM64_H */
>>>>> diff --git a/arch/arm64/kernel/probes/simulate-insn.c b/arch/arm64/kernel/probes/simulate-insn.c
>>>>> index 22d0b3252476..0b1623fa7003 100644
>>>>> --- a/arch/arm64/kernel/probes/simulate-insn.c
>>>>> +++ b/arch/arm64/kernel/probes/simulate-insn.c
>>>>> @@ -200,3 +200,31 @@ simulate_ldrsw_literal(u32 opcode, long addr, struct pt_regs *regs)
>>>>>
>>>>>     instruction_pointer_set(regs, instruction_pointer(regs) + 4);
>>>>>  }
>>>>> +
>>>>> +void __kprobes
>>>>> +simulate_nop(u32 opcode, long addr, struct pt_regs *regs)
>>>>> +{
>>>>> +   instruction_pointer_set(regs, instruction_pointer(regs) + 4);
>>>>> +}
>>>>
>>>> Hmm, this forgets to update the single-step state machine and PSTATE.BT,
>>>> and that's an extant bug in arch_uprobe_post_xol(). This can be:
>>>
>>> For emulated instruction, uprobe won't enable single-step mode of CPU,
>>> please check the handle_swbp() in kernel/events/uprobes.c:
>>>
>>>   if (arch_uprobe_skip_sstep(&uprobe->arch, regs))
>>>           goto out;
>>>
>>>   if (!pre_ssout(uprobe, regs, bp_vaddr))
>>>           return;
>>>
>>> For emualted instruction, It will skip entire single-stepping and associated
>>> exception, which typically begins with pre_ssout() and ends with
>>> arch_uprobe_post_xol(). Therefore, using instruction_pointer_set() to emulate
>>> NOP is generally not a bad idea.
>>>
>>>>
>>>> | void __kprobes
>>>> | simulate_nop(u32 opcode, long addr, struct pt_regs *regs)
>>>> | {
>>>> |    arm64_skip_faulting_instruction(regs, AARCH64_INSN_SIZE);
>>>> | }
>>>>
>>>>> +
>>>>> +void __kprobes
>>>>> +simulate_stp_fp_lr_sp_64b(u32 opcode, long addr, struct pt_regs *regs)
>>>>> +{
>>>>> +   long imm7;
>>>>> +   u64 buf[2];
>>>>> +   long new_sp;
>>>>> +
>>>>> +   imm7 = sign_extend64((opcode >> 15) & 0x7f, 6);
>>>>> +   new_sp = regs->sp + (imm7 << 3);
>>>>
>>>> We have accessors for these fields, please use them.
>>>
>>> Do you mean aarch64_insn_decode_immediate()?
>>>
>>>>
>>>>> +
>>>>> +   buf[0] = regs->regs[29];
>>>>> +   buf[1] = regs->regs[30];
>>>>> +
>>>>> +   if (copy_to_user((void __user *)new_sp, buf, sizeof(buf))) {
>>>>> +           force_sig(SIGSEGV);
>>>>> +           return;
>>>>> +   }
>>>>
>>>> As above, this won't interact with VMSA features (e.g. MTE, POE) in the
>>>> same way as an STP in userspace, and this will not have the same
>>>> atomicity properties as an STP>
>>>>> +
>>>>> +   regs->sp = new_sp;
>>>>> +   instruction_pointer_set(regs, instruction_pointer(regs) + 4);
>>>>
>>>> Likewise, this sould need ot use arm64_skip_faulting_instruction(),
>>>> though as above I think we should drop STP emulation entirely.
>>>
>>> I explain the reason why using instruction_pointer_set() under your comments
>>> for simulate_nop().
>>>
>>> Thanks.
>>>
>>>>
>>>> Mark.
>>>>
>>>
>>
>> --
>> BR
>> Liao, Chang
>>
Andrii Nakryiko Sept. 5, 2024, 8:17 p.m. UTC | #7
On Fri, Aug 30, 2024 at 2:25 AM Liao, Chang <liaochang1@huawei.com> wrote:
>
>
>
> 在 2024/8/30 3:26, Andrii Nakryiko 写道:
> > On Tue, Aug 27, 2024 at 4:34 AM Liao, Chang <liaochang1@huawei.com> wrote:
> >>
> >> Hi, Mark
> >>
> >> Would you like to discuss this patch further, or do you still believe emulating
> >> STP to push FP/LR into the stack in kernel is not a good idea?
> >>
> >
> > Please send an updated version of your patches taking into account
> > various smaller issues Mark pointed out. But please keep STP
> > emulation, I think it's very important, even if it's not covering all
> > possible uprobe tracing scenarios.
>
> OK, I will send it out over the weekend, including an enhancement of STP
> emuation that addresses the MTE and POE issues Mark mentioned. I hope this
> will lead to more feedback from him.
>

Hey Liao,

Did you get a chance to prepare and send a new revision of this patch?

> >
> >> Thanks.
> >>
> >>
> >> 在 2024/8/21 15:55, Liao, Chang 写道:
> >>> Hi, Mark
> >>>
> >>> My bad for taking so long to rely, I generally agree with your suggestions to
> >>> STP emulation.
> >>>
> >>> 在 2024/8/15 17:58, Mark Rutland 写道:
> >>>> On Wed, Aug 14, 2024 at 08:03:56AM +0000, Liao Chang wrote:
> >>>>> As Andrii pointed out, the uprobe/uretprobe selftest bench run into a
> >>>>> counterintuitive result that nop and push variants are much slower than
> >>>>> ret variant [0]. The root cause lies in the arch_probe_analyse_insn(),
> >>>>> which excludes 'nop' and 'stp' from the emulatable instructions list.
> >>>>> This force the kernel returns to userspace and execute them out-of-line,
> >>>>> then trapping back to kernel for running uprobe callback functions. This
> >>>>> leads to a significant performance overhead compared to 'ret' variant,
> >>>>> which is already emulated.
> >>>>
> >>>> I appreciate this might be surprising, but does it actually matter
> >>>> outside of a microbenchmark?
> >>>
> >>> I just do a simple comparsion the performance impact of single-stepped and
> >>> emulated STP on my local machine. Three user cases were measured: Redis GET and
> >>> SET throughput (Request Per Second, RPS), and the time taken to execute a grep
> >>> command on the "arch_uprobe_copy_xol" string within the kernel source.
> >>>
> >>> Redis GET (higher is better)
> >>> ----------------------------
> >>> No uprobe: 49149.71 RPS
> >>> Single-stepped STP: 46750.82 RPS
> >>> Emulated STP: 48981.19 RPS
> >>>
> >>> Redis SET (larger is better)
> >>> ----------------------------
> >>> No uprobe: 49761.14 RPS
> >>> Single-stepped STP: 45255.01 RPS
> >>> Emulated stp: 48619.21 RPS
> >>>
> >>> Grep (lower is better)
> >>> ----------------------
> >>> No uprobe: 2.165s
> >>> Single-stepped STP: 15.314s
> >>> Emualted STP: 2.216s
> >>>
> >>> The result reveals single-stepped STP instruction that used to push fp/lr into
> >>> stack significantly impacts the Redis and grep performance, leading to a notable
> >>> notable decrease RPS and increase time individually. So emulating STP on the
> >>> function entry might be a more viable option for uprobe.
> >>>
> >>>>
> >>>>> Typicall uprobe is installed on 'nop' for USDT and on function entry
> >>>>> which starts with the instrucion 'stp x29, x30, [sp, #imm]!' to push lr
> >>>>> and fp into stack regardless kernel or userspace binary.
> >>>>
> >>>> Function entry doesn't always start with a STP; these days it's often a
> >>>> BTI or PACIASP, and for non-leaf functions (or with shrink-wrapping in
> >>>> the compiler), it could be any arbitrary instruction. This might happen
> >>>> to be the common case today, but there are certain;y codebases where it
> >>>> is not.
> >>>
> >>> Sure, if kernel, CPU and compiler support BTI and PAC, the entry instruction
> >>> is definitly not STP. But for CPU and kernel lack of these supports, STP as
> >>> the entry instruction is still the common case. And I profiled the entry
> >>> instruction for all leaf and non-leaf function, the ratio of STP is 64.5%
> >>> for redis, 76.1% for the BPF selftest bench. So I am thinking it is still
> >>> useful to emulate the STP on the function entry. Perhaps, for CPU and kernel
> >>> with BTI and PAC enabled, uprobe chooses the slower single-stepping to execute
> >>> STP for pushing stack.
> >>>
> >>>>
> >>>> STP (or any instruction that accesses memory) is fairly painful to
> >>>> emulate because you need to ensure that the correct atomicity and
> >>>> ordering properties are provided (e.g. an aligned STP should be
> >>>> single-copy-atomic, but copy_to_user() doesn't guarantee that except by
> >>>> chance), and that the correct VMSA behaviour is provided (e.g. when
> >>>> interacting with MTE, POE, etc, while the uaccess primitives don't try
> >>>> to be 100% equivalent to instructions in userspace).
> >>> Agreed, but I don't think it has to emulate strictly the single-copy-atomic
> >>> feature of STP that is used to push fp/lr into stack. In most cases, only one
> >>> CPU will push registers to the same position on stack. And I barely understand
> >>> why other CPUs would depends on the ordering of pushing data into stack. So it
> >>> means the atomicity and ordering is not so important for this scenario. Regarding
> >>> MTE and POE, a similar stragety to BTI and PAC can be applied: for CPUs and kernel
> >>> with MTE and POE enabled, uprobe chooses the slower single-stepping to execute
> >>> STP for pushing stack.
> >>>
> >>>>
> >>>> For those reasons, in general I don't think we should be emulating any
> >>>> instruction which accesses memory, and we should not try to emulate the
> >>>> STP, but I think it's entirely reasonable to emulate NOP.
> >>>>
> >>>>> In order to
> >>>>> improve the performance of handling uprobe for common usecases. This
> >>>>> patch supports the emulation of Arm64 equvialents instructions of 'nop'
> >>>>> and 'push'. The benchmark results below indicates the performance gain
> >>>>> of emulation is obvious.
> >>>>>
> >>>>> On Kunpeng916 (Hi1616), 4 NUMA nodes, 64 Arm64 cores@2.4GHz.
> >>>>>
> >>>>> xol (1 cpus)
> >>>>> ------------
> >>>>> uprobe-nop:  0.916 ± 0.001M/s (0.916M/prod)
> >>>>> uprobe-push: 0.908 ± 0.001M/s (0.908M/prod)
> >>>>> uprobe-ret:  1.855 ± 0.000M/s (1.855M/prod)
> >>>>> uretprobe-nop:  0.640 ± 0.000M/s (0.640M/prod)
> >>>>> uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod)
> >>>>> uretprobe-ret:  0.978 ± 0.003M/s (0.978M/prod)
> >>>>>
> >>>>> emulation (1 cpus)
> >>>>> -------------------
> >>>>> uprobe-nop:  1.862 ± 0.002M/s  (1.862M/prod)
> >>>>> uprobe-push: 1.743 ± 0.006M/s  (1.743M/prod)
> >>>>> uprobe-ret:  1.840 ± 0.001M/s  (1.840M/prod)
> >>>>> uretprobe-nop:  0.964 ± 0.004M/s  (0.964M/prod)
> >>>>> uretprobe-push: 0.936 ± 0.004M/s  (0.936M/prod)
> >>>>> uretprobe-ret:  0.940 ± 0.001M/s  (0.940M/prod)
> >>>>>
> >>>>> As shown above, the performance gap between 'nop/push' and 'ret'
> >>>>> variants has been significantly reduced. Due to the emulation of 'push'
> >>>>> instruction needs to access userspace memory, it spent more cycles than
> >>>>> the other.
> >>>>>
> >>>>> [0] https://lore.kernel.org/all/CAEf4BzaO4eG6hr2hzXYpn+7Uer4chS0R99zLn02ezZ5YruVuQw@mail.gmail.com/
> >>>>>
> >>>>> Signed-off-by: Liao Chang <liaochang1@huawei.com>
> >>>>> ---
> >>>>>  arch/arm64/include/asm/insn.h            | 21 ++++++++++++++++++
> >>>>>  arch/arm64/kernel/probes/decode-insn.c   | 18 +++++++++++++--
> >>>>>  arch/arm64/kernel/probes/decode-insn.h   |  3 ++-
> >>>>>  arch/arm64/kernel/probes/simulate-insn.c | 28 ++++++++++++++++++++++++
> >>>>>  arch/arm64/kernel/probes/simulate-insn.h |  2 ++
> >>>>>  arch/arm64/kernel/probes/uprobes.c       |  2 +-
> >>>>>  6 files changed, 70 insertions(+), 4 deletions(-)
> >>>>>
> >>>>> diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
> >>>>> index 8c0a36f72d6f..a246e6e550ba 100644
> >>>>> --- a/arch/arm64/include/asm/insn.h
> >>>>> +++ b/arch/arm64/include/asm/insn.h
> >>>>> @@ -549,6 +549,27 @@ static __always_inline bool aarch64_insn_uses_literal(u32 insn)
> >>>>>            aarch64_insn_is_prfm_lit(insn);
> >>>>>  }
> >>>>>
> >>>>> +static __always_inline bool aarch64_insn_is_nop(u32 insn)
> >>>>> +{
> >>>>> +   /* nop */
> >>>>> +   return aarch64_insn_is_hint(insn) &&
> >>>>> +          ((insn & 0xFE0) == AARCH64_INSN_HINT_NOP);
> >>>>> +}
> >>>>
> >>>> This looks fine, but the comment can go.
> >>>
> >>> Removed.
> >>>
> >>>>
> >>>>> +static __always_inline bool aarch64_insn_is_stp_fp_lr_sp_64b(u32 insn)
> >>>>> +{
> >>>>> +   /*
> >>>>> +    * The 1st instruction on function entry often follows the
> >>>>> +    * patten 'stp x29, x30, [sp, #imm]!' that pushing fp and lr
> >>>>> +    * into stack.
> >>>>> +    */
> >>>>> +   return aarch64_insn_is_stp_pre(insn) &&
> >>>>> +          (((insn >> 30) & 0x03) ==  2) && /* opc == 10 */
> >>>>> +          (((insn >>  5) & 0x1F) == 31) && /* Rn  is sp */
> >>>>> +          (((insn >> 10) & 0x1F) == 30) && /* Rt2 is x29 */
> >>>>> +          (((insn >>  0) & 0x1F) == 29);   /* Rt  is x30 */
> >>>>> +}
> >>>>
> >>>> We have accessors for these fields. Please use them.
> >>>
> >>> Do you mean aarch64_insn_decode_register()?
> >>>
> >>>>
> >>>> Regardless, as above I do not think we should have a helper this
> >>>> specific (with Rn, Rt, and Rt2 values hard-coded) inside <asm/insn.h>.
> >>>
> >>> If we left necessary of emulation of STP aside, where would the best file to
> >>> place these hard-coded decoder helper?
> >>>
> >>>>
> >>>>>  enum aarch64_insn_encoding_class aarch64_get_insn_class(u32 insn);
> >>>>>  u64 aarch64_insn_decode_immediate(enum aarch64_insn_imm_type type, u32 insn);
> >>>>>  u32 aarch64_insn_encode_immediate(enum aarch64_insn_imm_type type,
> >>>>> diff --git a/arch/arm64/kernel/probes/decode-insn.c b/arch/arm64/kernel/probes/decode-insn.c
> >>>>> index 968d5fffe233..df7ca16fc763 100644
> >>>>> --- a/arch/arm64/kernel/probes/decode-insn.c
> >>>>> +++ b/arch/arm64/kernel/probes/decode-insn.c
> >>>>> @@ -73,8 +73,22 @@ static bool __kprobes aarch64_insn_is_steppable(u32 insn)
> >>>>>   *   INSN_GOOD_NO_SLOT If instruction is supported but doesn't use its slot.
> >>>>>   */
> >>>>>  enum probe_insn __kprobes
> >>>>> -arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *api)
> >>>>> +arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *api,
> >>>>> +                 bool kernel)
> >>>>>  {
> >>>>> +   /*
> >>>>> +    * While 'nop' and 'stp x29, x30, [sp, #imm]! instructions can
> >>>>> +    * execute in the out-of-line slot, simulating them in breakpoint
> >>>>> +    * handling offers better performance.
> >>>>> +    */
> >>>>> +   if (aarch64_insn_is_nop(insn)) {
> >>>>> +           api->handler = simulate_nop;
> >>>>> +           return INSN_GOOD_NO_SLOT;
> >>>>> +   } else if (!kernel && aarch64_insn_is_stp_fp_lr_sp_64b(insn)) {
> >>>>> +           api->handler = simulate_stp_fp_lr_sp_64b;
> >>>>> +           return INSN_GOOD_NO_SLOT;
> >>>>> +   }
> >>>>
> >>>> With the STP emulation gone, you won't need the kernel parameter here.>
> >>>>> +
> >>>>>     /*
> >>>>>      * Instructions reading or modifying the PC won't work from the XOL
> >>>>>      * slot.
> >>>>> @@ -157,7 +171,7 @@ arm_kprobe_decode_insn(kprobe_opcode_t *addr, struct arch_specific_insn *asi)
> >>>>>             else
> >>>>>                     scan_end = addr - MAX_ATOMIC_CONTEXT_SIZE;
> >>>>>     }
> >>>>> -   decoded = arm_probe_decode_insn(insn, &asi->api);
> >>>>> +   decoded = arm_probe_decode_insn(insn, &asi->api, true);
> >>>>>
> >>>>>     if (decoded != INSN_REJECTED && scan_end)
> >>>>>             if (is_probed_address_atomic(addr - 1, scan_end))
> >>>>> diff --git a/arch/arm64/kernel/probes/decode-insn.h b/arch/arm64/kernel/probes/decode-insn.h
> >>>>> index 8b758c5a2062..ec4607189933 100644
> >>>>> --- a/arch/arm64/kernel/probes/decode-insn.h
> >>>>> +++ b/arch/arm64/kernel/probes/decode-insn.h
> >>>>> @@ -28,6 +28,7 @@ enum probe_insn __kprobes
> >>>>>  arm_kprobe_decode_insn(kprobe_opcode_t *addr, struct arch_specific_insn *asi);
> >>>>>  #endif
> >>>>>  enum probe_insn __kprobes
> >>>>> -arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *asi);
> >>>>> +arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *asi,
> >>>>> +                 bool kernel);
> >>>>>
> >>>>>  #endif /* _ARM_KERNEL_KPROBES_ARM64_H */
> >>>>> diff --git a/arch/arm64/kernel/probes/simulate-insn.c b/arch/arm64/kernel/probes/simulate-insn.c
> >>>>> index 22d0b3252476..0b1623fa7003 100644
> >>>>> --- a/arch/arm64/kernel/probes/simulate-insn.c
> >>>>> +++ b/arch/arm64/kernel/probes/simulate-insn.c
> >>>>> @@ -200,3 +200,31 @@ simulate_ldrsw_literal(u32 opcode, long addr, struct pt_regs *regs)
> >>>>>
> >>>>>     instruction_pointer_set(regs, instruction_pointer(regs) + 4);
> >>>>>  }
> >>>>> +
> >>>>> +void __kprobes
> >>>>> +simulate_nop(u32 opcode, long addr, struct pt_regs *regs)
> >>>>> +{
> >>>>> +   instruction_pointer_set(regs, instruction_pointer(regs) + 4);
> >>>>> +}
> >>>>
> >>>> Hmm, this forgets to update the single-step state machine and PSTATE.BT,
> >>>> and that's an extant bug in arch_uprobe_post_xol(). This can be:
> >>>
> >>> For emulated instruction, uprobe won't enable single-step mode of CPU,
> >>> please check the handle_swbp() in kernel/events/uprobes.c:
> >>>
> >>>   if (arch_uprobe_skip_sstep(&uprobe->arch, regs))
> >>>           goto out;
> >>>
> >>>   if (!pre_ssout(uprobe, regs, bp_vaddr))
> >>>           return;
> >>>
> >>> For emualted instruction, It will skip entire single-stepping and associated
> >>> exception, which typically begins with pre_ssout() and ends with
> >>> arch_uprobe_post_xol(). Therefore, using instruction_pointer_set() to emulate
> >>> NOP is generally not a bad idea.
> >>>
> >>>>
> >>>> | void __kprobes
> >>>> | simulate_nop(u32 opcode, long addr, struct pt_regs *regs)
> >>>> | {
> >>>> |    arm64_skip_faulting_instruction(regs, AARCH64_INSN_SIZE);
> >>>> | }
> >>>>
> >>>>> +
> >>>>> +void __kprobes
> >>>>> +simulate_stp_fp_lr_sp_64b(u32 opcode, long addr, struct pt_regs *regs)
> >>>>> +{
> >>>>> +   long imm7;
> >>>>> +   u64 buf[2];
> >>>>> +   long new_sp;
> >>>>> +
> >>>>> +   imm7 = sign_extend64((opcode >> 15) & 0x7f, 6);
> >>>>> +   new_sp = regs->sp + (imm7 << 3);
> >>>>
> >>>> We have accessors for these fields, please use them.
> >>>
> >>> Do you mean aarch64_insn_decode_immediate()?
> >>>
> >>>>
> >>>>> +
> >>>>> +   buf[0] = regs->regs[29];
> >>>>> +   buf[1] = regs->regs[30];
> >>>>> +
> >>>>> +   if (copy_to_user((void __user *)new_sp, buf, sizeof(buf))) {
> >>>>> +           force_sig(SIGSEGV);
> >>>>> +           return;
> >>>>> +   }
> >>>>
> >>>> As above, this won't interact with VMSA features (e.g. MTE, POE) in the
> >>>> same way as an STP in userspace, and this will not have the same
> >>>> atomicity properties as an STP>
> >>>>> +
> >>>>> +   regs->sp = new_sp;
> >>>>> +   instruction_pointer_set(regs, instruction_pointer(regs) + 4);
> >>>>
> >>>> Likewise, this sould need ot use arm64_skip_faulting_instruction(),
> >>>> though as above I think we should drop STP emulation entirely.
> >>>
> >>> I explain the reason why using instruction_pointer_set() under your comments
> >>> for simulate_nop().
> >>>
> >>> Thanks.
> >>>
> >>>>
> >>>> Mark.
> >>>>
> >>>
> >>
> >> --
> >> BR
> >> Liao, Chang
> >>
>
> --
> BR
> Liao, Chang
Mark Rutland Sept. 6, 2024, 9:39 a.m. UTC | #8
On Tue, Aug 27, 2024 at 07:33:55PM +0800, Liao, Chang wrote:
> Hi, Mark
> 
> Would you like to discuss this patch further, or do you still believe emulating
> STP to push FP/LR into the stack in kernel is not a good idea?

I'm happy with the NOP emulation in principle, so please send a new
version with *just* the NOP emulation, and I can review that.

Regarding STP emulation, I stand by my earlier comments, and in addition
to those comments, AFAICT it's currently unsafe to use any uaccess
routine in the uprobe BRK handler anyway, so that's moot. The uprobe BRK
handler runs with preemption disabled and IRQs (and all other maskable
exceptions) masked, and faults cannot be handled. IIUC
CONFIG_DEBUG_ATOMIC_SLEEP should scream about that.

Longer-term I'm happy to look into making it possible to do uaccesses in
the uprobe handlers, but that requires a much wider rework of the way we
handle BRK instructions, single-step, and hardware breakpoints and
watchpoints.

Mark.

> Thanks.
> 
> 
> 在 2024/8/21 15:55, Liao, Chang 写道:
> > Hi, Mark
> > 
> > My bad for taking so long to rely, I generally agree with your suggestions to
> > STP emulation.
> > 
> > 在 2024/8/15 17:58, Mark Rutland 写道:
> >> On Wed, Aug 14, 2024 at 08:03:56AM +0000, Liao Chang wrote:
> >>> As Andrii pointed out, the uprobe/uretprobe selftest bench run into a
> >>> counterintuitive result that nop and push variants are much slower than
> >>> ret variant [0]. The root cause lies in the arch_probe_analyse_insn(),
> >>> which excludes 'nop' and 'stp' from the emulatable instructions list.
> >>> This force the kernel returns to userspace and execute them out-of-line,
> >>> then trapping back to kernel for running uprobe callback functions. This
> >>> leads to a significant performance overhead compared to 'ret' variant,
> >>> which is already emulated.
> >>
> >> I appreciate this might be surprising, but does it actually matter
> >> outside of a microbenchmark?
> > 
> > I just do a simple comparsion the performance impact of single-stepped and
> > emulated STP on my local machine. Three user cases were measured: Redis GET and
> > SET throughput (Request Per Second, RPS), and the time taken to execute a grep
> > command on the "arch_uprobe_copy_xol" string within the kernel source.
> > 
> > Redis GET (higher is better)
> > ----------------------------
> > No uprobe: 49149.71 RPS
> > Single-stepped STP: 46750.82 RPS
> > Emulated STP: 48981.19 RPS
> > 
> > Redis SET (larger is better)
> > ----------------------------
> > No uprobe: 49761.14 RPS
> > Single-stepped STP: 45255.01 RPS
> > Emulated stp: 48619.21 RPS
> > 
> > Grep (lower is better)
> > ----------------------
> > No uprobe: 2.165s
> > Single-stepped STP: 15.314s
> > Emualted STP: 2.216s
> > 
> > The result reveals single-stepped STP instruction that used to push fp/lr into
> > stack significantly impacts the Redis and grep performance, leading to a notable
> > notable decrease RPS and increase time individually. So emulating STP on the
> > function entry might be a more viable option for uprobe.
> > 
> >>
> >>> Typicall uprobe is installed on 'nop' for USDT and on function entry
> >>> which starts with the instrucion 'stp x29, x30, [sp, #imm]!' to push lr
> >>> and fp into stack regardless kernel or userspace binary. 
> >>
> >> Function entry doesn't always start with a STP; these days it's often a
> >> BTI or PACIASP, and for non-leaf functions (or with shrink-wrapping in
> >> the compiler), it could be any arbitrary instruction. This might happen
> >> to be the common case today, but there are certain;y codebases where it
> >> is not.
> > 
> > Sure, if kernel, CPU and compiler support BTI and PAC, the entry instruction
> > is definitly not STP. But for CPU and kernel lack of these supports, STP as
> > the entry instruction is still the common case. And I profiled the entry
> > instruction for all leaf and non-leaf function, the ratio of STP is 64.5%
> > for redis, 76.1% for the BPF selftest bench. So I am thinking it is still
> > useful to emulate the STP on the function entry. Perhaps, for CPU and kernel
> > with BTI and PAC enabled, uprobe chooses the slower single-stepping to execute
> > STP for pushing stack.
> > 
> >>
> >> STP (or any instruction that accesses memory) is fairly painful to
> >> emulate because you need to ensure that the correct atomicity and
> >> ordering properties are provided (e.g. an aligned STP should be
> >> single-copy-atomic, but copy_to_user() doesn't guarantee that except by
> >> chance), and that the correct VMSA behaviour is provided (e.g. when
> >> interacting with MTE, POE, etc, while the uaccess primitives don't try
> >> to be 100% equivalent to instructions in userspace).
> > Agreed, but I don't think it has to emulate strictly the single-copy-atomic
> > feature of STP that is used to push fp/lr into stack. In most cases, only one
> > CPU will push registers to the same position on stack. And I barely understand
> > why other CPUs would depends on the ordering of pushing data into stack. So it
> > means the atomicity and ordering is not so important for this scenario. Regarding
> > MTE and POE, a similar stragety to BTI and PAC can be applied: for CPUs and kernel
> > with MTE and POE enabled, uprobe chooses the slower single-stepping to execute
> > STP for pushing stack.
> > 
> >>
> >> For those reasons, in general I don't think we should be emulating any
> >> instruction which accesses memory, and we should not try to emulate the
> >> STP, but I think it's entirely reasonable to emulate NOP.
> >>
> >>> In order to
> >>> improve the performance of handling uprobe for common usecases. This
> >>> patch supports the emulation of Arm64 equvialents instructions of 'nop'
> >>> and 'push'. The benchmark results below indicates the performance gain
> >>> of emulation is obvious.
> >>>
> >>> On Kunpeng916 (Hi1616), 4 NUMA nodes, 64 Arm64 cores@2.4GHz.
> >>>
> >>> xol (1 cpus)
> >>> ------------
> >>> uprobe-nop:  0.916 ± 0.001M/s (0.916M/prod)
> >>> uprobe-push: 0.908 ± 0.001M/s (0.908M/prod)
> >>> uprobe-ret:  1.855 ± 0.000M/s (1.855M/prod)
> >>> uretprobe-nop:  0.640 ± 0.000M/s (0.640M/prod)
> >>> uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod)
> >>> uretprobe-ret:  0.978 ± 0.003M/s (0.978M/prod)
> >>>
> >>> emulation (1 cpus)
> >>> -------------------
> >>> uprobe-nop:  1.862 ± 0.002M/s  (1.862M/prod)
> >>> uprobe-push: 1.743 ± 0.006M/s  (1.743M/prod)
> >>> uprobe-ret:  1.840 ± 0.001M/s  (1.840M/prod)
> >>> uretprobe-nop:  0.964 ± 0.004M/s  (0.964M/prod)
> >>> uretprobe-push: 0.936 ± 0.004M/s  (0.936M/prod)
> >>> uretprobe-ret:  0.940 ± 0.001M/s  (0.940M/prod)
> >>>
> >>> As shown above, the performance gap between 'nop/push' and 'ret'
> >>> variants has been significantly reduced. Due to the emulation of 'push'
> >>> instruction needs to access userspace memory, it spent more cycles than
> >>> the other.
> >>>
> >>> [0] https://lore.kernel.org/all/CAEf4BzaO4eG6hr2hzXYpn+7Uer4chS0R99zLn02ezZ5YruVuQw@mail.gmail.com/
> >>>
> >>> Signed-off-by: Liao Chang <liaochang1@huawei.com>
> >>> ---
> >>>  arch/arm64/include/asm/insn.h            | 21 ++++++++++++++++++
> >>>  arch/arm64/kernel/probes/decode-insn.c   | 18 +++++++++++++--
> >>>  arch/arm64/kernel/probes/decode-insn.h   |  3 ++-
> >>>  arch/arm64/kernel/probes/simulate-insn.c | 28 ++++++++++++++++++++++++
> >>>  arch/arm64/kernel/probes/simulate-insn.h |  2 ++
> >>>  arch/arm64/kernel/probes/uprobes.c       |  2 +-
> >>>  6 files changed, 70 insertions(+), 4 deletions(-)
> >>>
> >>> diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
> >>> index 8c0a36f72d6f..a246e6e550ba 100644
> >>> --- a/arch/arm64/include/asm/insn.h
> >>> +++ b/arch/arm64/include/asm/insn.h
> >>> @@ -549,6 +549,27 @@ static __always_inline bool aarch64_insn_uses_literal(u32 insn)
> >>>  	       aarch64_insn_is_prfm_lit(insn);
> >>>  }
> >>>  
> >>> +static __always_inline bool aarch64_insn_is_nop(u32 insn)
> >>> +{
> >>> +	/* nop */
> >>> +	return aarch64_insn_is_hint(insn) &&
> >>> +	       ((insn & 0xFE0) == AARCH64_INSN_HINT_NOP);
> >>> +}
> >>
> >> This looks fine, but the comment can go.
> > 
> > Removed.
> > 
> >>
> >>> +static __always_inline bool aarch64_insn_is_stp_fp_lr_sp_64b(u32 insn)
> >>> +{
> >>> +	/*
> >>> +	 * The 1st instruction on function entry often follows the
> >>> +	 * patten 'stp x29, x30, [sp, #imm]!' that pushing fp and lr
> >>> +	 * into stack.
> >>> +	 */
> >>> +	return aarch64_insn_is_stp_pre(insn) &&
> >>> +	       (((insn >> 30) & 0x03) ==  2) && /* opc == 10 */
> >>> +	       (((insn >>  5) & 0x1F) == 31) && /* Rn  is sp */
> >>> +	       (((insn >> 10) & 0x1F) == 30) && /* Rt2 is x29 */
> >>> +	       (((insn >>  0) & 0x1F) == 29);	/* Rt  is x30 */
> >>> +}
> >>
> >> We have accessors for these fields. Please use them.
> > 
> > Do you mean aarch64_insn_decode_register()?
> > 
> >>
> >> Regardless, as above I do not think we should have a helper this
> >> specific (with Rn, Rt, and Rt2 values hard-coded) inside <asm/insn.h>.
> > 
> > If we left necessary of emulation of STP aside, where would the best file to
> > place these hard-coded decoder helper?
> > 
> >>
> >>>  enum aarch64_insn_encoding_class aarch64_get_insn_class(u32 insn);
> >>>  u64 aarch64_insn_decode_immediate(enum aarch64_insn_imm_type type, u32 insn);
> >>>  u32 aarch64_insn_encode_immediate(enum aarch64_insn_imm_type type,
> >>> diff --git a/arch/arm64/kernel/probes/decode-insn.c b/arch/arm64/kernel/probes/decode-insn.c
> >>> index 968d5fffe233..df7ca16fc763 100644
> >>> --- a/arch/arm64/kernel/probes/decode-insn.c
> >>> +++ b/arch/arm64/kernel/probes/decode-insn.c
> >>> @@ -73,8 +73,22 @@ static bool __kprobes aarch64_insn_is_steppable(u32 insn)
> >>>   *   INSN_GOOD_NO_SLOT If instruction is supported but doesn't use its slot.
> >>>   */
> >>>  enum probe_insn __kprobes
> >>> -arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *api)
> >>> +arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *api,
> >>> +		      bool kernel)
> >>>  {
> >>> +	/*
> >>> +	 * While 'nop' and 'stp x29, x30, [sp, #imm]! instructions can
> >>> +	 * execute in the out-of-line slot, simulating them in breakpoint
> >>> +	 * handling offers better performance.
> >>> +	 */
> >>> +	if (aarch64_insn_is_nop(insn)) {
> >>> +		api->handler = simulate_nop;
> >>> +		return INSN_GOOD_NO_SLOT;
> >>> +	} else if (!kernel && aarch64_insn_is_stp_fp_lr_sp_64b(insn)) {
> >>> +		api->handler = simulate_stp_fp_lr_sp_64b;
> >>> +		return INSN_GOOD_NO_SLOT;
> >>> +	}
> >>
> >> With the STP emulation gone, you won't need the kernel parameter here.>
> >>> +
> >>>  	/*
> >>>  	 * Instructions reading or modifying the PC won't work from the XOL
> >>>  	 * slot.
> >>> @@ -157,7 +171,7 @@ arm_kprobe_decode_insn(kprobe_opcode_t *addr, struct arch_specific_insn *asi)
> >>>  		else
> >>>  			scan_end = addr - MAX_ATOMIC_CONTEXT_SIZE;
> >>>  	}
> >>> -	decoded = arm_probe_decode_insn(insn, &asi->api);
> >>> +	decoded = arm_probe_decode_insn(insn, &asi->api, true);
> >>>  
> >>>  	if (decoded != INSN_REJECTED && scan_end)
> >>>  		if (is_probed_address_atomic(addr - 1, scan_end))
> >>> diff --git a/arch/arm64/kernel/probes/decode-insn.h b/arch/arm64/kernel/probes/decode-insn.h
> >>> index 8b758c5a2062..ec4607189933 100644
> >>> --- a/arch/arm64/kernel/probes/decode-insn.h
> >>> +++ b/arch/arm64/kernel/probes/decode-insn.h
> >>> @@ -28,6 +28,7 @@ enum probe_insn __kprobes
> >>>  arm_kprobe_decode_insn(kprobe_opcode_t *addr, struct arch_specific_insn *asi);
> >>>  #endif
> >>>  enum probe_insn __kprobes
> >>> -arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *asi);
> >>> +arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *asi,
> >>> +		      bool kernel);
> >>>  
> >>>  #endif /* _ARM_KERNEL_KPROBES_ARM64_H */
> >>> diff --git a/arch/arm64/kernel/probes/simulate-insn.c b/arch/arm64/kernel/probes/simulate-insn.c
> >>> index 22d0b3252476..0b1623fa7003 100644
> >>> --- a/arch/arm64/kernel/probes/simulate-insn.c
> >>> +++ b/arch/arm64/kernel/probes/simulate-insn.c
> >>> @@ -200,3 +200,31 @@ simulate_ldrsw_literal(u32 opcode, long addr, struct pt_regs *regs)
> >>>  
> >>>  	instruction_pointer_set(regs, instruction_pointer(regs) + 4);
> >>>  }
> >>> +
> >>> +void __kprobes
> >>> +simulate_nop(u32 opcode, long addr, struct pt_regs *regs)
> >>> +{
> >>> +	instruction_pointer_set(regs, instruction_pointer(regs) + 4);
> >>> +}
> >>
> >> Hmm, this forgets to update the single-step state machine and PSTATE.BT,
> >> and that's an extant bug in arch_uprobe_post_xol(). This can be:
> > 
> > For emulated instruction, uprobe won't enable single-step mode of CPU,
> > please check the handle_swbp() in kernel/events/uprobes.c:
> > 
> >   if (arch_uprobe_skip_sstep(&uprobe->arch, regs))
> >           goto out;
> > 
> >   if (!pre_ssout(uprobe, regs, bp_vaddr))
> >           return;
> > 
> > For emualted instruction, It will skip entire single-stepping and associated
> > exception, which typically begins with pre_ssout() and ends with
> > arch_uprobe_post_xol(). Therefore, using instruction_pointer_set() to emulate
> > NOP is generally not a bad idea.
> > 
> >>
> >> | void __kprobes
> >> | simulate_nop(u32 opcode, long addr, struct pt_regs *regs)
> >> | {
> >> | 	arm64_skip_faulting_instruction(regs, AARCH64_INSN_SIZE);
> >> | }
> >>
> >>> +
> >>> +void __kprobes
> >>> +simulate_stp_fp_lr_sp_64b(u32 opcode, long addr, struct pt_regs *regs)
> >>> +{
> >>> +	long imm7;
> >>> +	u64 buf[2];
> >>> +	long new_sp;
> >>> +
> >>> +	imm7 = sign_extend64((opcode >> 15) & 0x7f, 6);
> >>> +	new_sp = regs->sp + (imm7 << 3);
> >>
> >> We have accessors for these fields, please use them.
> > 
> > Do you mean aarch64_insn_decode_immediate()?
> > 
> >>
> >>> +
> >>> +	buf[0] = regs->regs[29];
> >>> +	buf[1] = regs->regs[30];
> >>> +
> >>> +	if (copy_to_user((void __user *)new_sp, buf, sizeof(buf))) {
> >>> +		force_sig(SIGSEGV);
> >>> +		return;
> >>> +	}
> >>
> >> As above, this won't interact with VMSA features (e.g. MTE, POE) in the
> >> same way as an STP in userspace, and this will not have the same
> >> atomicity properties as an STP>
> >>> +
> >>> +	regs->sp = new_sp;
> >>> +	instruction_pointer_set(regs, instruction_pointer(regs) + 4);
> >>
> >> Likewise, this sould need ot use arm64_skip_faulting_instruction(),
> >> though as above I think we should drop STP emulation entirely.
> > 
> > I explain the reason why using instruction_pointer_set() under your comments
> > for simulate_nop().
> > 
> > Thanks.
> > 
> >>
> >> Mark.
> >>
> > 
> 
> -- 
> BR
> Liao, Chang
Andrii Nakryiko Sept. 6, 2024, 5:46 p.m. UTC | #9
On Fri, Sep 6, 2024 at 2:39 AM Mark Rutland <mark.rutland@arm.com> wrote:
>
> On Tue, Aug 27, 2024 at 07:33:55PM +0800, Liao, Chang wrote:
> > Hi, Mark
> >
> > Would you like to discuss this patch further, or do you still believe emulating
> > STP to push FP/LR into the stack in kernel is not a good idea?
>
> I'm happy with the NOP emulation in principle, so please send a new
> version with *just* the NOP emulation, and I can review that.

Let's definitely start with that, this is important for faster USDT tracing.

>
> Regarding STP emulation, I stand by my earlier comments, and in addition
> to those comments, AFAICT it's currently unsafe to use any uaccess
> routine in the uprobe BRK handler anyway, so that's moot. The uprobe BRK
> handler runs with preemption disabled and IRQs (and all other maskable
> exceptions) masked, and faults cannot be handled. IIUC
> CONFIG_DEBUG_ATOMIC_SLEEP should scream about that.

This part I don't really get, and this might be some very
ARM64-specific issue, so I'm sorry ahead of time.

But in general, at the lowest level uprobes work in two logical steps.
First, there is a breakpoint that user space hits, kernel gets
control, and if VMA which hit breakpoint might contain uprobe, kernel
sets TIF_UPROBE thread flag and exits. This is the only part that's in
hard IRQ context. See uprobe_notify_resume() and comments around it.

Then uprobe infrastructure gets called in user context on the way back
to user space. This is where we confirm that this is uprobe/uretprobe
hit, and, if supported, perform instruction emulation.

So I'm wondering if your above comment refers to instruction emulation
within the first part of uprobe handling? If yes, then, no, that's not
where emulation will happen.

>
> Longer-term I'm happy to look into making it possible to do uaccesses in
> the uprobe handlers, but that requires a much wider rework of the way we
> handle BRK instructions, single-step, and hardware breakpoints and
> watchpoints.

See above and please let me know whether we are talking about the same
issue. In general, we run BPF programs in the same context where this
instruction emulation will happen, and BPF uprobe programs can be even
sleepable, so not just non-faultable user memory accesses, but even
faultable user accesses are supported.

>
> Mark.
>
> > Thanks.
> >
> >
> > 在 2024/8/21 15:55, Liao, Chang 写道:
> > > Hi, Mark
> > >
> > > My bad for taking so long to rely, I generally agree with your suggestions to
> > > STP emulation.
> > >
> > > 在 2024/8/15 17:58, Mark Rutland 写道:
> > >> On Wed, Aug 14, 2024 at 08:03:56AM +0000, Liao Chang wrote:
> > >>> As Andrii pointed out, the uprobe/uretprobe selftest bench run into a
> > >>> counterintuitive result that nop and push variants are much slower than
> > >>> ret variant [0]. The root cause lies in the arch_probe_analyse_insn(),
> > >>> which excludes 'nop' and 'stp' from the emulatable instructions list.
> > >>> This force the kernel returns to userspace and execute them out-of-line,
> > >>> then trapping back to kernel for running uprobe callback functions. This
> > >>> leads to a significant performance overhead compared to 'ret' variant,
> > >>> which is already emulated.
> > >>
> > >> I appreciate this might be surprising, but does it actually matter
> > >> outside of a microbenchmark?
> > >
> > > I just do a simple comparsion the performance impact of single-stepped and
> > > emulated STP on my local machine. Three user cases were measured: Redis GET and
> > > SET throughput (Request Per Second, RPS), and the time taken to execute a grep
> > > command on the "arch_uprobe_copy_xol" string within the kernel source.
> > >
> > > Redis GET (higher is better)
> > > ----------------------------
> > > No uprobe: 49149.71 RPS
> > > Single-stepped STP: 46750.82 RPS
> > > Emulated STP: 48981.19 RPS
> > >
> > > Redis SET (larger is better)
> > > ----------------------------
> > > No uprobe: 49761.14 RPS
> > > Single-stepped STP: 45255.01 RPS
> > > Emulated stp: 48619.21 RPS
> > >
> > > Grep (lower is better)
> > > ----------------------
> > > No uprobe: 2.165s
> > > Single-stepped STP: 15.314s
> > > Emualted STP: 2.216s
> > >
> > > The result reveals single-stepped STP instruction that used to push fp/lr into
> > > stack significantly impacts the Redis and grep performance, leading to a notable
> > > notable decrease RPS and increase time individually. So emulating STP on the
> > > function entry might be a more viable option for uprobe.
> > >
> > >>
> > >>> Typicall uprobe is installed on 'nop' for USDT and on function entry
> > >>> which starts with the instrucion 'stp x29, x30, [sp, #imm]!' to push lr
> > >>> and fp into stack regardless kernel or userspace binary.
> > >>
> > >> Function entry doesn't always start with a STP; these days it's often a
> > >> BTI or PACIASP, and for non-leaf functions (or with shrink-wrapping in
> > >> the compiler), it could be any arbitrary instruction. This might happen
> > >> to be the common case today, but there are certain;y codebases where it
> > >> is not.
> > >
> > > Sure, if kernel, CPU and compiler support BTI and PAC, the entry instruction
> > > is definitly not STP. But for CPU and kernel lack of these supports, STP as
> > > the entry instruction is still the common case. And I profiled the entry
> > > instruction for all leaf and non-leaf function, the ratio of STP is 64.5%
> > > for redis, 76.1% for the BPF selftest bench. So I am thinking it is still
> > > useful to emulate the STP on the function entry. Perhaps, for CPU and kernel
> > > with BTI and PAC enabled, uprobe chooses the slower single-stepping to execute
> > > STP for pushing stack.
> > >
> > >>
> > >> STP (or any instruction that accesses memory) is fairly painful to
> > >> emulate because you need to ensure that the correct atomicity and
> > >> ordering properties are provided (e.g. an aligned STP should be
> > >> single-copy-atomic, but copy_to_user() doesn't guarantee that except by
> > >> chance), and that the correct VMSA behaviour is provided (e.g. when
> > >> interacting with MTE, POE, etc, while the uaccess primitives don't try
> > >> to be 100% equivalent to instructions in userspace).
> > > Agreed, but I don't think it has to emulate strictly the single-copy-atomic
> > > feature of STP that is used to push fp/lr into stack. In most cases, only one
> > > CPU will push registers to the same position on stack. And I barely understand
> > > why other CPUs would depends on the ordering of pushing data into stack. So it
> > > means the atomicity and ordering is not so important for this scenario. Regarding
> > > MTE and POE, a similar stragety to BTI and PAC can be applied: for CPUs and kernel
> > > with MTE and POE enabled, uprobe chooses the slower single-stepping to execute
> > > STP for pushing stack.
> > >
> > >>
> > >> For those reasons, in general I don't think we should be emulating any
> > >> instruction which accesses memory, and we should not try to emulate the
> > >> STP, but I think it's entirely reasonable to emulate NOP.
> > >>
> > >>> In order to
> > >>> improve the performance of handling uprobe for common usecases. This
> > >>> patch supports the emulation of Arm64 equvialents instructions of 'nop'
> > >>> and 'push'. The benchmark results below indicates the performance gain
> > >>> of emulation is obvious.
> > >>>
> > >>> On Kunpeng916 (Hi1616), 4 NUMA nodes, 64 Arm64 cores@2.4GHz.
> > >>>
> > >>> xol (1 cpus)
> > >>> ------------
> > >>> uprobe-nop:  0.916 ± 0.001M/s (0.916M/prod)
> > >>> uprobe-push: 0.908 ± 0.001M/s (0.908M/prod)
> > >>> uprobe-ret:  1.855 ± 0.000M/s (1.855M/prod)
> > >>> uretprobe-nop:  0.640 ± 0.000M/s (0.640M/prod)
> > >>> uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod)
> > >>> uretprobe-ret:  0.978 ± 0.003M/s (0.978M/prod)
> > >>>
> > >>> emulation (1 cpus)
> > >>> -------------------
> > >>> uprobe-nop:  1.862 ± 0.002M/s  (1.862M/prod)
> > >>> uprobe-push: 1.743 ± 0.006M/s  (1.743M/prod)
> > >>> uprobe-ret:  1.840 ± 0.001M/s  (1.840M/prod)
> > >>> uretprobe-nop:  0.964 ± 0.004M/s  (0.964M/prod)
> > >>> uretprobe-push: 0.936 ± 0.004M/s  (0.936M/prod)
> > >>> uretprobe-ret:  0.940 ± 0.001M/s  (0.940M/prod)
> > >>>
> > >>> As shown above, the performance gap between 'nop/push' and 'ret'
> > >>> variants has been significantly reduced. Due to the emulation of 'push'
> > >>> instruction needs to access userspace memory, it spent more cycles than
> > >>> the other.
> > >>>
> > >>> [0] https://lore.kernel.org/all/CAEf4BzaO4eG6hr2hzXYpn+7Uer4chS0R99zLn02ezZ5YruVuQw@mail.gmail.com/
> > >>>
> > >>> Signed-off-by: Liao Chang <liaochang1@huawei.com>
> > >>> ---
> > >>>  arch/arm64/include/asm/insn.h            | 21 ++++++++++++++++++
> > >>>  arch/arm64/kernel/probes/decode-insn.c   | 18 +++++++++++++--
> > >>>  arch/arm64/kernel/probes/decode-insn.h   |  3 ++-
> > >>>  arch/arm64/kernel/probes/simulate-insn.c | 28 ++++++++++++++++++++++++
> > >>>  arch/arm64/kernel/probes/simulate-insn.h |  2 ++
> > >>>  arch/arm64/kernel/probes/uprobes.c       |  2 +-
> > >>>  6 files changed, 70 insertions(+), 4 deletions(-)
> > >>>
> > >>> diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
> > >>> index 8c0a36f72d6f..a246e6e550ba 100644
> > >>> --- a/arch/arm64/include/asm/insn.h
> > >>> +++ b/arch/arm64/include/asm/insn.h
> > >>> @@ -549,6 +549,27 @@ static __always_inline bool aarch64_insn_uses_literal(u32 insn)
> > >>>          aarch64_insn_is_prfm_lit(insn);
> > >>>  }
> > >>>
> > >>> +static __always_inline bool aarch64_insn_is_nop(u32 insn)
> > >>> +{
> > >>> + /* nop */
> > >>> + return aarch64_insn_is_hint(insn) &&
> > >>> +        ((insn & 0xFE0) == AARCH64_INSN_HINT_NOP);
> > >>> +}
> > >>
> > >> This looks fine, but the comment can go.
> > >
> > > Removed.
> > >
> > >>
> > >>> +static __always_inline bool aarch64_insn_is_stp_fp_lr_sp_64b(u32 insn)
> > >>> +{
> > >>> + /*
> > >>> +  * The 1st instruction on function entry often follows the
> > >>> +  * patten 'stp x29, x30, [sp, #imm]!' that pushing fp and lr
> > >>> +  * into stack.
> > >>> +  */
> > >>> + return aarch64_insn_is_stp_pre(insn) &&
> > >>> +        (((insn >> 30) & 0x03) ==  2) && /* opc == 10 */
> > >>> +        (((insn >>  5) & 0x1F) == 31) && /* Rn  is sp */
> > >>> +        (((insn >> 10) & 0x1F) == 30) && /* Rt2 is x29 */
> > >>> +        (((insn >>  0) & 0x1F) == 29);   /* Rt  is x30 */
> > >>> +}
> > >>
> > >> We have accessors for these fields. Please use them.
> > >
> > > Do you mean aarch64_insn_decode_register()?
> > >
> > >>
> > >> Regardless, as above I do not think we should have a helper this
> > >> specific (with Rn, Rt, and Rt2 values hard-coded) inside <asm/insn.h>.
> > >
> > > If we left necessary of emulation of STP aside, where would the best file to
> > > place these hard-coded decoder helper?
> > >
> > >>
> > >>>  enum aarch64_insn_encoding_class aarch64_get_insn_class(u32 insn);
> > >>>  u64 aarch64_insn_decode_immediate(enum aarch64_insn_imm_type type, u32 insn);
> > >>>  u32 aarch64_insn_encode_immediate(enum aarch64_insn_imm_type type,
> > >>> diff --git a/arch/arm64/kernel/probes/decode-insn.c b/arch/arm64/kernel/probes/decode-insn.c
> > >>> index 968d5fffe233..df7ca16fc763 100644
> > >>> --- a/arch/arm64/kernel/probes/decode-insn.c
> > >>> +++ b/arch/arm64/kernel/probes/decode-insn.c
> > >>> @@ -73,8 +73,22 @@ static bool __kprobes aarch64_insn_is_steppable(u32 insn)
> > >>>   *   INSN_GOOD_NO_SLOT If instruction is supported but doesn't use its slot.
> > >>>   */
> > >>>  enum probe_insn __kprobes
> > >>> -arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *api)
> > >>> +arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *api,
> > >>> +               bool kernel)
> > >>>  {
> > >>> + /*
> > >>> +  * While 'nop' and 'stp x29, x30, [sp, #imm]! instructions can
> > >>> +  * execute in the out-of-line slot, simulating them in breakpoint
> > >>> +  * handling offers better performance.
> > >>> +  */
> > >>> + if (aarch64_insn_is_nop(insn)) {
> > >>> +         api->handler = simulate_nop;
> > >>> +         return INSN_GOOD_NO_SLOT;
> > >>> + } else if (!kernel && aarch64_insn_is_stp_fp_lr_sp_64b(insn)) {
> > >>> +         api->handler = simulate_stp_fp_lr_sp_64b;
> > >>> +         return INSN_GOOD_NO_SLOT;
> > >>> + }
> > >>
> > >> With the STP emulation gone, you won't need the kernel parameter here.>
> > >>> +
> > >>>   /*
> > >>>    * Instructions reading or modifying the PC won't work from the XOL
> > >>>    * slot.
> > >>> @@ -157,7 +171,7 @@ arm_kprobe_decode_insn(kprobe_opcode_t *addr, struct arch_specific_insn *asi)
> > >>>           else
> > >>>                   scan_end = addr - MAX_ATOMIC_CONTEXT_SIZE;
> > >>>   }
> > >>> - decoded = arm_probe_decode_insn(insn, &asi->api);
> > >>> + decoded = arm_probe_decode_insn(insn, &asi->api, true);
> > >>>
> > >>>   if (decoded != INSN_REJECTED && scan_end)
> > >>>           if (is_probed_address_atomic(addr - 1, scan_end))
> > >>> diff --git a/arch/arm64/kernel/probes/decode-insn.h b/arch/arm64/kernel/probes/decode-insn.h
> > >>> index 8b758c5a2062..ec4607189933 100644
> > >>> --- a/arch/arm64/kernel/probes/decode-insn.h
> > >>> +++ b/arch/arm64/kernel/probes/decode-insn.h
> > >>> @@ -28,6 +28,7 @@ enum probe_insn __kprobes
> > >>>  arm_kprobe_decode_insn(kprobe_opcode_t *addr, struct arch_specific_insn *asi);
> > >>>  #endif
> > >>>  enum probe_insn __kprobes
> > >>> -arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *asi);
> > >>> +arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *asi,
> > >>> +               bool kernel);
> > >>>
> > >>>  #endif /* _ARM_KERNEL_KPROBES_ARM64_H */
> > >>> diff --git a/arch/arm64/kernel/probes/simulate-insn.c b/arch/arm64/kernel/probes/simulate-insn.c
> > >>> index 22d0b3252476..0b1623fa7003 100644
> > >>> --- a/arch/arm64/kernel/probes/simulate-insn.c
> > >>> +++ b/arch/arm64/kernel/probes/simulate-insn.c
> > >>> @@ -200,3 +200,31 @@ simulate_ldrsw_literal(u32 opcode, long addr, struct pt_regs *regs)
> > >>>
> > >>>   instruction_pointer_set(regs, instruction_pointer(regs) + 4);
> > >>>  }
> > >>> +
> > >>> +void __kprobes
> > >>> +simulate_nop(u32 opcode, long addr, struct pt_regs *regs)
> > >>> +{
> > >>> + instruction_pointer_set(regs, instruction_pointer(regs) + 4);
> > >>> +}
> > >>
> > >> Hmm, this forgets to update the single-step state machine and PSTATE.BT,
> > >> and that's an extant bug in arch_uprobe_post_xol(). This can be:
> > >
> > > For emulated instruction, uprobe won't enable single-step mode of CPU,
> > > please check the handle_swbp() in kernel/events/uprobes.c:
> > >
> > >   if (arch_uprobe_skip_sstep(&uprobe->arch, regs))
> > >           goto out;
> > >
> > >   if (!pre_ssout(uprobe, regs, bp_vaddr))
> > >           return;
> > >
> > > For emualted instruction, It will skip entire single-stepping and associated
> > > exception, which typically begins with pre_ssout() and ends with
> > > arch_uprobe_post_xol(). Therefore, using instruction_pointer_set() to emulate
> > > NOP is generally not a bad idea.
> > >
> > >>
> > >> | void __kprobes
> > >> | simulate_nop(u32 opcode, long addr, struct pt_regs *regs)
> > >> | {
> > >> |  arm64_skip_faulting_instruction(regs, AARCH64_INSN_SIZE);
> > >> | }
> > >>
> > >>> +
> > >>> +void __kprobes
> > >>> +simulate_stp_fp_lr_sp_64b(u32 opcode, long addr, struct pt_regs *regs)
> > >>> +{
> > >>> + long imm7;
> > >>> + u64 buf[2];
> > >>> + long new_sp;
> > >>> +
> > >>> + imm7 = sign_extend64((opcode >> 15) & 0x7f, 6);
> > >>> + new_sp = regs->sp + (imm7 << 3);
> > >>
> > >> We have accessors for these fields, please use them.
> > >
> > > Do you mean aarch64_insn_decode_immediate()?
> > >
> > >>
> > >>> +
> > >>> + buf[0] = regs->regs[29];
> > >>> + buf[1] = regs->regs[30];
> > >>> +
> > >>> + if (copy_to_user((void __user *)new_sp, buf, sizeof(buf))) {
> > >>> +         force_sig(SIGSEGV);
> > >>> +         return;
> > >>> + }
> > >>
> > >> As above, this won't interact with VMSA features (e.g. MTE, POE) in the
> > >> same way as an STP in userspace, and this will not have the same
> > >> atomicity properties as an STP>
> > >>> +
> > >>> + regs->sp = new_sp;
> > >>> + instruction_pointer_set(regs, instruction_pointer(regs) + 4);
> > >>
> > >> Likewise, this sould need ot use arm64_skip_faulting_instruction(),
> > >> though as above I think we should drop STP emulation entirely.
> > >
> > > I explain the reason why using instruction_pointer_set() under your comments
> > > for simulate_nop().
> > >
> > > Thanks.
> > >
> > >>
> > >> Mark.
> > >>
> > >
> >
> > --
> > BR
> > Liao, Chang
Liao, Chang Sept. 9, 2024, 7:24 a.m. UTC | #10
在 2024/9/7 1:46, Andrii Nakryiko 写道:
> On Fri, Sep 6, 2024 at 2:39 AM Mark Rutland <mark.rutland@arm.com> wrote:
>>
>> On Tue, Aug 27, 2024 at 07:33:55PM +0800, Liao, Chang wrote:
>>> Hi, Mark
>>>
>>> Would you like to discuss this patch further, or do you still believe emulating
>>> STP to push FP/LR into the stack in kernel is not a good idea?
>>
>> I'm happy with the NOP emulation in principle, so please send a new
>> version with *just* the NOP emulation, and I can review that.
> 
> Let's definitely start with that, this is important for faster USDT tracing.
> 
>>
>> Regarding STP emulation, I stand by my earlier comments, and in addition
>> to those comments, AFAICT it's currently unsafe to use any uaccess
>> routine in the uprobe BRK handler anyway, so that's moot. The uprobe BRK
>> handler runs with preemption disabled and IRQs (and all other maskable
>> exceptions) masked, and faults cannot be handled. IIUC
>> CONFIG_DEBUG_ATOMIC_SLEEP should scream about that.
> 
> This part I don't really get, and this might be some very
> ARM64-specific issue, so I'm sorry ahead of time.
> 
> But in general, at the lowest level uprobes work in two logical steps.
> First, there is a breakpoint that user space hits, kernel gets
> control, and if VMA which hit breakpoint might contain uprobe, kernel
> sets TIF_UPROBE thread flag and exits. This is the only part that's in
> hard IRQ context. See uprobe_notify_resume() and comments around it.
> 
> Then uprobe infrastructure gets called in user context on the way back
> to user space. This is where we confirm that this is uprobe/uretprobe
> hit, and, if supported, perform instruction emulation.
> 
> So I'm wondering if your above comment refers to instruction emulation
> within the first part of uprobe handling? If yes, then, no, that's not
> where emulation will happen.

Andrii, instruction emulation occurs in the second step, while preemption
is enabled after the first step for uprobe on ARM64. However, simulation
indeed executes in the first step for *kprobe*. Since we are just focusing
on uprobe, so I also raise the same question why using uaccess routines
is unsafe in this context.

> 
>>
>> Longer-term I'm happy to look into making it possible to do uaccesses in
>> the uprobe handlers, but that requires a much wider rework of the way we
>> handle BRK instructions, single-step, and hardware breakpoints and
>> watchpoints.
> 
> See above and please let me know whether we are talking about the same
> issue. In general, we run BPF programs in the same context where this
> instruction emulation will happen, and BPF uprobe programs can be even
> sleepable, so not just non-faultable user memory accesses, but even
> faultable user accesses are supported.
> 
>>
>> Mark.
>>
>>> Thanks.
>>>
>>>
>>> 在 2024/8/21 15:55, Liao, Chang 写道:
>>>> Hi, Mark
>>>>
>>>> My bad for taking so long to rely, I generally agree with your suggestions to
>>>> STP emulation.
>>>>
>>>> 在 2024/8/15 17:58, Mark Rutland 写道:
>>>>> On Wed, Aug 14, 2024 at 08:03:56AM +0000, Liao Chang wrote:
>>>>>> As Andrii pointed out, the uprobe/uretprobe selftest bench run into a
>>>>>> counterintuitive result that nop and push variants are much slower than
>>>>>> ret variant [0]. The root cause lies in the arch_probe_analyse_insn(),
>>>>>> which excludes 'nop' and 'stp' from the emulatable instructions list.
>>>>>> This force the kernel returns to userspace and execute them out-of-line,
>>>>>> then trapping back to kernel for running uprobe callback functions. This
>>>>>> leads to a significant performance overhead compared to 'ret' variant,
>>>>>> which is already emulated.
>>>>>
>>>>> I appreciate this might be surprising, but does it actually matter
>>>>> outside of a microbenchmark?
>>>>
>>>> I just do a simple comparsion the performance impact of single-stepped and
>>>> emulated STP on my local machine. Three user cases were measured: Redis GET and
>>>> SET throughput (Request Per Second, RPS), and the time taken to execute a grep
>>>> command on the "arch_uprobe_copy_xol" string within the kernel source.
>>>>
>>>> Redis GET (higher is better)
>>>> ----------------------------
>>>> No uprobe: 49149.71 RPS
>>>> Single-stepped STP: 46750.82 RPS
>>>> Emulated STP: 48981.19 RPS
>>>>
>>>> Redis SET (larger is better)
>>>> ----------------------------
>>>> No uprobe: 49761.14 RPS
>>>> Single-stepped STP: 45255.01 RPS
>>>> Emulated stp: 48619.21 RPS
>>>>
>>>> Grep (lower is better)
>>>> ----------------------
>>>> No uprobe: 2.165s
>>>> Single-stepped STP: 15.314s
>>>> Emualted STP: 2.216s
>>>>
>>>> The result reveals single-stepped STP instruction that used to push fp/lr into
>>>> stack significantly impacts the Redis and grep performance, leading to a notable
>>>> notable decrease RPS and increase time individually. So emulating STP on the
>>>> function entry might be a more viable option for uprobe.
>>>>
>>>>>
>>>>>> Typicall uprobe is installed on 'nop' for USDT and on function entry
>>>>>> which starts with the instrucion 'stp x29, x30, [sp, #imm]!' to push lr
>>>>>> and fp into stack regardless kernel or userspace binary.
>>>>>
>>>>> Function entry doesn't always start with a STP; these days it's often a
>>>>> BTI or PACIASP, and for non-leaf functions (or with shrink-wrapping in
>>>>> the compiler), it could be any arbitrary instruction. This might happen
>>>>> to be the common case today, but there are certain;y codebases where it
>>>>> is not.
>>>>
>>>> Sure, if kernel, CPU and compiler support BTI and PAC, the entry instruction
>>>> is definitly not STP. But for CPU and kernel lack of these supports, STP as
>>>> the entry instruction is still the common case. And I profiled the entry
>>>> instruction for all leaf and non-leaf function, the ratio of STP is 64.5%
>>>> for redis, 76.1% for the BPF selftest bench. So I am thinking it is still
>>>> useful to emulate the STP on the function entry. Perhaps, for CPU and kernel
>>>> with BTI and PAC enabled, uprobe chooses the slower single-stepping to execute
>>>> STP for pushing stack.
>>>>
>>>>>
>>>>> STP (or any instruction that accesses memory) is fairly painful to
>>>>> emulate because you need to ensure that the correct atomicity and
>>>>> ordering properties are provided (e.g. an aligned STP should be
>>>>> single-copy-atomic, but copy_to_user() doesn't guarantee that except by
>>>>> chance), and that the correct VMSA behaviour is provided (e.g. when
>>>>> interacting with MTE, POE, etc, while the uaccess primitives don't try
>>>>> to be 100% equivalent to instructions in userspace).
>>>> Agreed, but I don't think it has to emulate strictly the single-copy-atomic
>>>> feature of STP that is used to push fp/lr into stack. In most cases, only one
>>>> CPU will push registers to the same position on stack. And I barely understand
>>>> why other CPUs would depends on the ordering of pushing data into stack. So it
>>>> means the atomicity and ordering is not so important for this scenario. Regarding
>>>> MTE and POE, a similar stragety to BTI and PAC can be applied: for CPUs and kernel
>>>> with MTE and POE enabled, uprobe chooses the slower single-stepping to execute
>>>> STP for pushing stack.
>>>>
>>>>>
>>>>> For those reasons, in general I don't think we should be emulating any
>>>>> instruction which accesses memory, and we should not try to emulate the
>>>>> STP, but I think it's entirely reasonable to emulate NOP.
>>>>>
>>>>>> In order to
>>>>>> improve the performance of handling uprobe for common usecases. This
>>>>>> patch supports the emulation of Arm64 equvialents instructions of 'nop'
>>>>>> and 'push'. The benchmark results below indicates the performance gain
>>>>>> of emulation is obvious.
>>>>>>
>>>>>> On Kunpeng916 (Hi1616), 4 NUMA nodes, 64 Arm64 cores@2.4GHz.
>>>>>>
>>>>>> xol (1 cpus)
>>>>>> ------------
>>>>>> uprobe-nop:  0.916 ± 0.001M/s (0.916M/prod)
>>>>>> uprobe-push: 0.908 ± 0.001M/s (0.908M/prod)
>>>>>> uprobe-ret:  1.855 ± 0.000M/s (1.855M/prod)
>>>>>> uretprobe-nop:  0.640 ± 0.000M/s (0.640M/prod)
>>>>>> uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod)
>>>>>> uretprobe-ret:  0.978 ± 0.003M/s (0.978M/prod)
>>>>>>
>>>>>> emulation (1 cpus)
>>>>>> -------------------
>>>>>> uprobe-nop:  1.862 ± 0.002M/s  (1.862M/prod)
>>>>>> uprobe-push: 1.743 ± 0.006M/s  (1.743M/prod)
>>>>>> uprobe-ret:  1.840 ± 0.001M/s  (1.840M/prod)
>>>>>> uretprobe-nop:  0.964 ± 0.004M/s  (0.964M/prod)
>>>>>> uretprobe-push: 0.936 ± 0.004M/s  (0.936M/prod)
>>>>>> uretprobe-ret:  0.940 ± 0.001M/s  (0.940M/prod)
>>>>>>
>>>>>> As shown above, the performance gap between 'nop/push' and 'ret'
>>>>>> variants has been significantly reduced. Due to the emulation of 'push'
>>>>>> instruction needs to access userspace memory, it spent more cycles than
>>>>>> the other.
>>>>>>
>>>>>> [0] https://lore.kernel.org/all/CAEf4BzaO4eG6hr2hzXYpn+7Uer4chS0R99zLn02ezZ5YruVuQw@mail.gmail.com/
>>>>>>
>>>>>> Signed-off-by: Liao Chang <liaochang1@huawei.com>
>>>>>> ---
>>>>>>  arch/arm64/include/asm/insn.h            | 21 ++++++++++++++++++
>>>>>>  arch/arm64/kernel/probes/decode-insn.c   | 18 +++++++++++++--
>>>>>>  arch/arm64/kernel/probes/decode-insn.h   |  3 ++-
>>>>>>  arch/arm64/kernel/probes/simulate-insn.c | 28 ++++++++++++++++++++++++
>>>>>>  arch/arm64/kernel/probes/simulate-insn.h |  2 ++
>>>>>>  arch/arm64/kernel/probes/uprobes.c       |  2 +-
>>>>>>  6 files changed, 70 insertions(+), 4 deletions(-)
>>>>>>
>>>>>> diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
>>>>>> index 8c0a36f72d6f..a246e6e550ba 100644
>>>>>> --- a/arch/arm64/include/asm/insn.h
>>>>>> +++ b/arch/arm64/include/asm/insn.h
>>>>>> @@ -549,6 +549,27 @@ static __always_inline bool aarch64_insn_uses_literal(u32 insn)
>>>>>>          aarch64_insn_is_prfm_lit(insn);
>>>>>>  }
>>>>>>
>>>>>> +static __always_inline bool aarch64_insn_is_nop(u32 insn)
>>>>>> +{
>>>>>> + /* nop */
>>>>>> + return aarch64_insn_is_hint(insn) &&
>>>>>> +        ((insn & 0xFE0) == AARCH64_INSN_HINT_NOP);
>>>>>> +}
>>>>>
>>>>> This looks fine, but the comment can go.
>>>>
>>>> Removed.
>>>>
>>>>>
>>>>>> +static __always_inline bool aarch64_insn_is_stp_fp_lr_sp_64b(u32 insn)
>>>>>> +{
>>>>>> + /*
>>>>>> +  * The 1st instruction on function entry often follows the
>>>>>> +  * patten 'stp x29, x30, [sp, #imm]!' that pushing fp and lr
>>>>>> +  * into stack.
>>>>>> +  */
>>>>>> + return aarch64_insn_is_stp_pre(insn) &&
>>>>>> +        (((insn >> 30) & 0x03) ==  2) && /* opc == 10 */
>>>>>> +        (((insn >>  5) & 0x1F) == 31) && /* Rn  is sp */
>>>>>> +        (((insn >> 10) & 0x1F) == 30) && /* Rt2 is x29 */
>>>>>> +        (((insn >>  0) & 0x1F) == 29);   /* Rt  is x30 */
>>>>>> +}
>>>>>
>>>>> We have accessors for these fields. Please use them.
>>>>
>>>> Do you mean aarch64_insn_decode_register()?
>>>>
>>>>>
>>>>> Regardless, as above I do not think we should have a helper this
>>>>> specific (with Rn, Rt, and Rt2 values hard-coded) inside <asm/insn.h>.
>>>>
>>>> If we left necessary of emulation of STP aside, where would the best file to
>>>> place these hard-coded decoder helper?
>>>>
>>>>>
>>>>>>  enum aarch64_insn_encoding_class aarch64_get_insn_class(u32 insn);
>>>>>>  u64 aarch64_insn_decode_immediate(enum aarch64_insn_imm_type type, u32 insn);
>>>>>>  u32 aarch64_insn_encode_immediate(enum aarch64_insn_imm_type type,
>>>>>> diff --git a/arch/arm64/kernel/probes/decode-insn.c b/arch/arm64/kernel/probes/decode-insn.c
>>>>>> index 968d5fffe233..df7ca16fc763 100644
>>>>>> --- a/arch/arm64/kernel/probes/decode-insn.c
>>>>>> +++ b/arch/arm64/kernel/probes/decode-insn.c
>>>>>> @@ -73,8 +73,22 @@ static bool __kprobes aarch64_insn_is_steppable(u32 insn)
>>>>>>   *   INSN_GOOD_NO_SLOT If instruction is supported but doesn't use its slot.
>>>>>>   */
>>>>>>  enum probe_insn __kprobes
>>>>>> -arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *api)
>>>>>> +arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *api,
>>>>>> +               bool kernel)
>>>>>>  {
>>>>>> + /*
>>>>>> +  * While 'nop' and 'stp x29, x30, [sp, #imm]! instructions can
>>>>>> +  * execute in the out-of-line slot, simulating them in breakpoint
>>>>>> +  * handling offers better performance.
>>>>>> +  */
>>>>>> + if (aarch64_insn_is_nop(insn)) {
>>>>>> +         api->handler = simulate_nop;
>>>>>> +         return INSN_GOOD_NO_SLOT;
>>>>>> + } else if (!kernel && aarch64_insn_is_stp_fp_lr_sp_64b(insn)) {
>>>>>> +         api->handler = simulate_stp_fp_lr_sp_64b;
>>>>>> +         return INSN_GOOD_NO_SLOT;
>>>>>> + }
>>>>>
>>>>> With the STP emulation gone, you won't need the kernel parameter here.>
>>>>>> +
>>>>>>   /*
>>>>>>    * Instructions reading or modifying the PC won't work from the XOL
>>>>>>    * slot.
>>>>>> @@ -157,7 +171,7 @@ arm_kprobe_decode_insn(kprobe_opcode_t *addr, struct arch_specific_insn *asi)
>>>>>>           else
>>>>>>                   scan_end = addr - MAX_ATOMIC_CONTEXT_SIZE;
>>>>>>   }
>>>>>> - decoded = arm_probe_decode_insn(insn, &asi->api);
>>>>>> + decoded = arm_probe_decode_insn(insn, &asi->api, true);
>>>>>>
>>>>>>   if (decoded != INSN_REJECTED && scan_end)
>>>>>>           if (is_probed_address_atomic(addr - 1, scan_end))
>>>>>> diff --git a/arch/arm64/kernel/probes/decode-insn.h b/arch/arm64/kernel/probes/decode-insn.h
>>>>>> index 8b758c5a2062..ec4607189933 100644
>>>>>> --- a/arch/arm64/kernel/probes/decode-insn.h
>>>>>> +++ b/arch/arm64/kernel/probes/decode-insn.h
>>>>>> @@ -28,6 +28,7 @@ enum probe_insn __kprobes
>>>>>>  arm_kprobe_decode_insn(kprobe_opcode_t *addr, struct arch_specific_insn *asi);
>>>>>>  #endif
>>>>>>  enum probe_insn __kprobes
>>>>>> -arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *asi);
>>>>>> +arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *asi,
>>>>>> +               bool kernel);
>>>>>>
>>>>>>  #endif /* _ARM_KERNEL_KPROBES_ARM64_H */
>>>>>> diff --git a/arch/arm64/kernel/probes/simulate-insn.c b/arch/arm64/kernel/probes/simulate-insn.c
>>>>>> index 22d0b3252476..0b1623fa7003 100644
>>>>>> --- a/arch/arm64/kernel/probes/simulate-insn.c
>>>>>> +++ b/arch/arm64/kernel/probes/simulate-insn.c
>>>>>> @@ -200,3 +200,31 @@ simulate_ldrsw_literal(u32 opcode, long addr, struct pt_regs *regs)
>>>>>>
>>>>>>   instruction_pointer_set(regs, instruction_pointer(regs) + 4);
>>>>>>  }
>>>>>> +
>>>>>> +void __kprobes
>>>>>> +simulate_nop(u32 opcode, long addr, struct pt_regs *regs)
>>>>>> +{
>>>>>> + instruction_pointer_set(regs, instruction_pointer(regs) + 4);
>>>>>> +}
>>>>>
>>>>> Hmm, this forgets to update the single-step state machine and PSTATE.BT,
>>>>> and that's an extant bug in arch_uprobe_post_xol(). This can be:
>>>>
>>>> For emulated instruction, uprobe won't enable single-step mode of CPU,
>>>> please check the handle_swbp() in kernel/events/uprobes.c:
>>>>
>>>>   if (arch_uprobe_skip_sstep(&uprobe->arch, regs))
>>>>           goto out;
>>>>
>>>>   if (!pre_ssout(uprobe, regs, bp_vaddr))
>>>>           return;
>>>>
>>>> For emualted instruction, It will skip entire single-stepping and associated
>>>> exception, which typically begins with pre_ssout() and ends with
>>>> arch_uprobe_post_xol(). Therefore, using instruction_pointer_set() to emulate
>>>> NOP is generally not a bad idea.
>>>>
>>>>>
>>>>> | void __kprobes
>>>>> | simulate_nop(u32 opcode, long addr, struct pt_regs *regs)
>>>>> | {
>>>>> |  arm64_skip_faulting_instruction(regs, AARCH64_INSN_SIZE);
>>>>> | }
>>>>>
>>>>>> +
>>>>>> +void __kprobes
>>>>>> +simulate_stp_fp_lr_sp_64b(u32 opcode, long addr, struct pt_regs *regs)
>>>>>> +{
>>>>>> + long imm7;
>>>>>> + u64 buf[2];
>>>>>> + long new_sp;
>>>>>> +
>>>>>> + imm7 = sign_extend64((opcode >> 15) & 0x7f, 6);
>>>>>> + new_sp = regs->sp + (imm7 << 3);
>>>>>
>>>>> We have accessors for these fields, please use them.
>>>>
>>>> Do you mean aarch64_insn_decode_immediate()?
>>>>
>>>>>
>>>>>> +
>>>>>> + buf[0] = regs->regs[29];
>>>>>> + buf[1] = regs->regs[30];
>>>>>> +
>>>>>> + if (copy_to_user((void __user *)new_sp, buf, sizeof(buf))) {
>>>>>> +         force_sig(SIGSEGV);
>>>>>> +         return;
>>>>>> + }
>>>>>
>>>>> As above, this won't interact with VMSA features (e.g. MTE, POE) in the
>>>>> same way as an STP in userspace, and this will not have the same
>>>>> atomicity properties as an STP>
>>>>>> +
>>>>>> + regs->sp = new_sp;
>>>>>> + instruction_pointer_set(regs, instruction_pointer(regs) + 4);
>>>>>
>>>>> Likewise, this sould need ot use arm64_skip_faulting_instruction(),
>>>>> though as above I think we should drop STP emulation entirely.
>>>>
>>>> I explain the reason why using instruction_pointer_set() under your comments
>>>> for simulate_nop().
>>>>
>>>> Thanks.
>>>>
>>>>>
>>>>> Mark.
>>>>>
>>>>
>>>
>>> --
>>> BR
>>> Liao, Chang
Liao, Chang Sept. 9, 2024, 7:40 a.m. UTC | #11
在 2024/9/6 4:17, Andrii Nakryiko 写道:
> On Fri, Aug 30, 2024 at 2:25 AM Liao, Chang <liaochang1@huawei.com> wrote:
>>
>>
>>
>> 在 2024/8/30 3:26, Andrii Nakryiko 写道:
>>> On Tue, Aug 27, 2024 at 4:34 AM Liao, Chang <liaochang1@huawei.com> wrote:
>>>>
>>>> Hi, Mark
>>>>
>>>> Would you like to discuss this patch further, or do you still believe emulating
>>>> STP to push FP/LR into the stack in kernel is not a good idea?
>>>>
>>>
>>> Please send an updated version of your patches taking into account
>>> various smaller issues Mark pointed out. But please keep STP
>>> emulation, I think it's very important, even if it's not covering all
>>> possible uprobe tracing scenarios.
>>
>> OK, I will send it out over the weekend, including an enhancement of STP
>> emuation that addresses the MTE and POE issues Mark mentioned. I hope this
>> will lead to more feedback from him.
>>
> 
> Hey Liao,
> 
> Did you get a chance to prepare and send a new revision of this patch?
Andrii,

Yes, the v2 is on the way, please check it from here.

  https://lore.kernel.org/all/20240909071114.1150053-1-liaochang1@huawei.com/

Due to the issues Mark pointed out, primarily the atomicity and ordering
of STP, and the VMASA features for uprobe-push context, I have remove the
STP simulation and all related bits from v1. But I've been working on a
more complex method to simulate STP that is different with the uaccess
routines used in v1. I'm currently debugging it to ensure everything
functions work correctly. Please allow me more hours to complete it before
sending a separate patch for reviewing, thanks.

BR,
Chang

> 
>>>
>>>> Thanks.
>>>>
>>>>
>>>> 在 2024/8/21 15:55, Liao, Chang 写道:
>>>>> Hi, Mark
>>>>>
>>>>> My bad for taking so long to rely, I generally agree with your suggestions to
>>>>> STP emulation.
>>>>>
>>>>> 在 2024/8/15 17:58, Mark Rutland 写道:
>>>>>> On Wed, Aug 14, 2024 at 08:03:56AM +0000, Liao Chang wrote:
>>>>>>> As Andrii pointed out, the uprobe/uretprobe selftest bench run into a
>>>>>>> counterintuitive result that nop and push variants are much slower than
>>>>>>> ret variant [0]. The root cause lies in the arch_probe_analyse_insn(),
>>>>>>> which excludes 'nop' and 'stp' from the emulatable instructions list.
>>>>>>> This force the kernel returns to userspace and execute them out-of-line,
>>>>>>> then trapping back to kernel for running uprobe callback functions. This
>>>>>>> leads to a significant performance overhead compared to 'ret' variant,
>>>>>>> which is already emulated.
>>>>>>
>>>>>> I appreciate this might be surprising, but does it actually matter
>>>>>> outside of a microbenchmark?
>>>>>
>>>>> I just do a simple comparsion the performance impact of single-stepped and
>>>>> emulated STP on my local machine. Three user cases were measured: Redis GET and
>>>>> SET throughput (Request Per Second, RPS), and the time taken to execute a grep
>>>>> command on the "arch_uprobe_copy_xol" string within the kernel source.
>>>>>
>>>>> Redis GET (higher is better)
>>>>> ----------------------------
>>>>> No uprobe: 49149.71 RPS
>>>>> Single-stepped STP: 46750.82 RPS
>>>>> Emulated STP: 48981.19 RPS
>>>>>
>>>>> Redis SET (larger is better)
>>>>> ----------------------------
>>>>> No uprobe: 49761.14 RPS
>>>>> Single-stepped STP: 45255.01 RPS
>>>>> Emulated stp: 48619.21 RPS
>>>>>
>>>>> Grep (lower is better)
>>>>> ----------------------
>>>>> No uprobe: 2.165s
>>>>> Single-stepped STP: 15.314s
>>>>> Emualted STP: 2.216s
>>>>>
>>>>> The result reveals single-stepped STP instruction that used to push fp/lr into
>>>>> stack significantly impacts the Redis and grep performance, leading to a notable
>>>>> notable decrease RPS and increase time individually. So emulating STP on the
>>>>> function entry might be a more viable option for uprobe.
>>>>>
>>>>>>
>>>>>>> Typicall uprobe is installed on 'nop' for USDT and on function entry
>>>>>>> which starts with the instrucion 'stp x29, x30, [sp, #imm]!' to push lr
>>>>>>> and fp into stack regardless kernel or userspace binary.
>>>>>>
>>>>>> Function entry doesn't always start with a STP; these days it's often a
>>>>>> BTI or PACIASP, and for non-leaf functions (or with shrink-wrapping in
>>>>>> the compiler), it could be any arbitrary instruction. This might happen
>>>>>> to be the common case today, but there are certain;y codebases where it
>>>>>> is not.
>>>>>
>>>>> Sure, if kernel, CPU and compiler support BTI and PAC, the entry instruction
>>>>> is definitly not STP. But for CPU and kernel lack of these supports, STP as
>>>>> the entry instruction is still the common case. And I profiled the entry
>>>>> instruction for all leaf and non-leaf function, the ratio of STP is 64.5%
>>>>> for redis, 76.1% for the BPF selftest bench. So I am thinking it is still
>>>>> useful to emulate the STP on the function entry. Perhaps, for CPU and kernel
>>>>> with BTI and PAC enabled, uprobe chooses the slower single-stepping to execute
>>>>> STP for pushing stack.
>>>>>
>>>>>>
>>>>>> STP (or any instruction that accesses memory) is fairly painful to
>>>>>> emulate because you need to ensure that the correct atomicity and
>>>>>> ordering properties are provided (e.g. an aligned STP should be
>>>>>> single-copy-atomic, but copy_to_user() doesn't guarantee that except by
>>>>>> chance), and that the correct VMSA behaviour is provided (e.g. when
>>>>>> interacting with MTE, POE, etc, while the uaccess primitives don't try
>>>>>> to be 100% equivalent to instructions in userspace).
>>>>> Agreed, but I don't think it has to emulate strictly the single-copy-atomic
>>>>> feature of STP that is used to push fp/lr into stack. In most cases, only one
>>>>> CPU will push registers to the same position on stack. And I barely understand
>>>>> why other CPUs would depends on the ordering of pushing data into stack. So it
>>>>> means the atomicity and ordering is not so important for this scenario. Regarding
>>>>> MTE and POE, a similar stragety to BTI and PAC can be applied: for CPUs and kernel
>>>>> with MTE and POE enabled, uprobe chooses the slower single-stepping to execute
>>>>> STP for pushing stack.
>>>>>
>>>>>>
>>>>>> For those reasons, in general I don't think we should be emulating any
>>>>>> instruction which accesses memory, and we should not try to emulate the
>>>>>> STP, but I think it's entirely reasonable to emulate NOP.
>>>>>>
>>>>>>> In order to
>>>>>>> improve the performance of handling uprobe for common usecases. This
>>>>>>> patch supports the emulation of Arm64 equvialents instructions of 'nop'
>>>>>>> and 'push'. The benchmark results below indicates the performance gain
>>>>>>> of emulation is obvious.
>>>>>>>
>>>>>>> On Kunpeng916 (Hi1616), 4 NUMA nodes, 64 Arm64 cores@2.4GHz.
>>>>>>>
>>>>>>> xol (1 cpus)
>>>>>>> ------------
>>>>>>> uprobe-nop:  0.916 ± 0.001M/s (0.916M/prod)
>>>>>>> uprobe-push: 0.908 ± 0.001M/s (0.908M/prod)
>>>>>>> uprobe-ret:  1.855 ± 0.000M/s (1.855M/prod)
>>>>>>> uretprobe-nop:  0.640 ± 0.000M/s (0.640M/prod)
>>>>>>> uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod)
>>>>>>> uretprobe-ret:  0.978 ± 0.003M/s (0.978M/prod)
>>>>>>>
>>>>>>> emulation (1 cpus)
>>>>>>> -------------------
>>>>>>> uprobe-nop:  1.862 ± 0.002M/s  (1.862M/prod)
>>>>>>> uprobe-push: 1.743 ± 0.006M/s  (1.743M/prod)
>>>>>>> uprobe-ret:  1.840 ± 0.001M/s  (1.840M/prod)
>>>>>>> uretprobe-nop:  0.964 ± 0.004M/s  (0.964M/prod)
>>>>>>> uretprobe-push: 0.936 ± 0.004M/s  (0.936M/prod)
>>>>>>> uretprobe-ret:  0.940 ± 0.001M/s  (0.940M/prod)
>>>>>>>
>>>>>>> As shown above, the performance gap between 'nop/push' and 'ret'
>>>>>>> variants has been significantly reduced. Due to the emulation of 'push'
>>>>>>> instruction needs to access userspace memory, it spent more cycles than
>>>>>>> the other.
>>>>>>>
>>>>>>> [0] https://lore.kernel.org/all/CAEf4BzaO4eG6hr2hzXYpn+7Uer4chS0R99zLn02ezZ5YruVuQw@mail.gmail.com/
>>>>>>>
>>>>>>> Signed-off-by: Liao Chang <liaochang1@huawei.com>
>>>>>>> ---
>>>>>>>  arch/arm64/include/asm/insn.h            | 21 ++++++++++++++++++
>>>>>>>  arch/arm64/kernel/probes/decode-insn.c   | 18 +++++++++++++--
>>>>>>>  arch/arm64/kernel/probes/decode-insn.h   |  3 ++-
>>>>>>>  arch/arm64/kernel/probes/simulate-insn.c | 28 ++++++++++++++++++++++++
>>>>>>>  arch/arm64/kernel/probes/simulate-insn.h |  2 ++
>>>>>>>  arch/arm64/kernel/probes/uprobes.c       |  2 +-
>>>>>>>  6 files changed, 70 insertions(+), 4 deletions(-)
>>>>>>>
>>>>>>> diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
>>>>>>> index 8c0a36f72d6f..a246e6e550ba 100644
>>>>>>> --- a/arch/arm64/include/asm/insn.h
>>>>>>> +++ b/arch/arm64/include/asm/insn.h
>>>>>>> @@ -549,6 +549,27 @@ static __always_inline bool aarch64_insn_uses_literal(u32 insn)
>>>>>>>            aarch64_insn_is_prfm_lit(insn);
>>>>>>>  }
>>>>>>>
>>>>>>> +static __always_inline bool aarch64_insn_is_nop(u32 insn)
>>>>>>> +{
>>>>>>> +   /* nop */
>>>>>>> +   return aarch64_insn_is_hint(insn) &&
>>>>>>> +          ((insn & 0xFE0) == AARCH64_INSN_HINT_NOP);
>>>>>>> +}
>>>>>>
>>>>>> This looks fine, but the comment can go.
>>>>>
>>>>> Removed.
>>>>>
>>>>>>
>>>>>>> +static __always_inline bool aarch64_insn_is_stp_fp_lr_sp_64b(u32 insn)
>>>>>>> +{
>>>>>>> +   /*
>>>>>>> +    * The 1st instruction on function entry often follows the
>>>>>>> +    * patten 'stp x29, x30, [sp, #imm]!' that pushing fp and lr
>>>>>>> +    * into stack.
>>>>>>> +    */
>>>>>>> +   return aarch64_insn_is_stp_pre(insn) &&
>>>>>>> +          (((insn >> 30) & 0x03) ==  2) && /* opc == 10 */
>>>>>>> +          (((insn >>  5) & 0x1F) == 31) && /* Rn  is sp */
>>>>>>> +          (((insn >> 10) & 0x1F) == 30) && /* Rt2 is x29 */
>>>>>>> +          (((insn >>  0) & 0x1F) == 29);   /* Rt  is x30 */
>>>>>>> +}
>>>>>>
>>>>>> We have accessors for these fields. Please use them.
>>>>>
>>>>> Do you mean aarch64_insn_decode_register()?
>>>>>
>>>>>>
>>>>>> Regardless, as above I do not think we should have a helper this
>>>>>> specific (with Rn, Rt, and Rt2 values hard-coded) inside <asm/insn.h>.
>>>>>
>>>>> If we left necessary of emulation of STP aside, where would the best file to
>>>>> place these hard-coded decoder helper?
>>>>>
>>>>>>
>>>>>>>  enum aarch64_insn_encoding_class aarch64_get_insn_class(u32 insn);
>>>>>>>  u64 aarch64_insn_decode_immediate(enum aarch64_insn_imm_type type, u32 insn);
>>>>>>>  u32 aarch64_insn_encode_immediate(enum aarch64_insn_imm_type type,
>>>>>>> diff --git a/arch/arm64/kernel/probes/decode-insn.c b/arch/arm64/kernel/probes/decode-insn.c
>>>>>>> index 968d5fffe233..df7ca16fc763 100644
>>>>>>> --- a/arch/arm64/kernel/probes/decode-insn.c
>>>>>>> +++ b/arch/arm64/kernel/probes/decode-insn.c
>>>>>>> @@ -73,8 +73,22 @@ static bool __kprobes aarch64_insn_is_steppable(u32 insn)
>>>>>>>   *   INSN_GOOD_NO_SLOT If instruction is supported but doesn't use its slot.
>>>>>>>   */
>>>>>>>  enum probe_insn __kprobes
>>>>>>> -arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *api)
>>>>>>> +arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *api,
>>>>>>> +                 bool kernel)
>>>>>>>  {
>>>>>>> +   /*
>>>>>>> +    * While 'nop' and 'stp x29, x30, [sp, #imm]! instructions can
>>>>>>> +    * execute in the out-of-line slot, simulating them in breakpoint
>>>>>>> +    * handling offers better performance.
>>>>>>> +    */
>>>>>>> +   if (aarch64_insn_is_nop(insn)) {
>>>>>>> +           api->handler = simulate_nop;
>>>>>>> +           return INSN_GOOD_NO_SLOT;
>>>>>>> +   } else if (!kernel && aarch64_insn_is_stp_fp_lr_sp_64b(insn)) {
>>>>>>> +           api->handler = simulate_stp_fp_lr_sp_64b;
>>>>>>> +           return INSN_GOOD_NO_SLOT;
>>>>>>> +   }
>>>>>>
>>>>>> With the STP emulation gone, you won't need the kernel parameter here.>
>>>>>>> +
>>>>>>>     /*
>>>>>>>      * Instructions reading or modifying the PC won't work from the XOL
>>>>>>>      * slot.
>>>>>>> @@ -157,7 +171,7 @@ arm_kprobe_decode_insn(kprobe_opcode_t *addr, struct arch_specific_insn *asi)
>>>>>>>             else
>>>>>>>                     scan_end = addr - MAX_ATOMIC_CONTEXT_SIZE;
>>>>>>>     }
>>>>>>> -   decoded = arm_probe_decode_insn(insn, &asi->api);
>>>>>>> +   decoded = arm_probe_decode_insn(insn, &asi->api, true);
>>>>>>>
>>>>>>>     if (decoded != INSN_REJECTED && scan_end)
>>>>>>>             if (is_probed_address_atomic(addr - 1, scan_end))
>>>>>>> diff --git a/arch/arm64/kernel/probes/decode-insn.h b/arch/arm64/kernel/probes/decode-insn.h
>>>>>>> index 8b758c5a2062..ec4607189933 100644
>>>>>>> --- a/arch/arm64/kernel/probes/decode-insn.h
>>>>>>> +++ b/arch/arm64/kernel/probes/decode-insn.h
>>>>>>> @@ -28,6 +28,7 @@ enum probe_insn __kprobes
>>>>>>>  arm_kprobe_decode_insn(kprobe_opcode_t *addr, struct arch_specific_insn *asi);
>>>>>>>  #endif
>>>>>>>  enum probe_insn __kprobes
>>>>>>> -arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *asi);
>>>>>>> +arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *asi,
>>>>>>> +                 bool kernel);
>>>>>>>
>>>>>>>  #endif /* _ARM_KERNEL_KPROBES_ARM64_H */
>>>>>>> diff --git a/arch/arm64/kernel/probes/simulate-insn.c b/arch/arm64/kernel/probes/simulate-insn.c
>>>>>>> index 22d0b3252476..0b1623fa7003 100644
>>>>>>> --- a/arch/arm64/kernel/probes/simulate-insn.c
>>>>>>> +++ b/arch/arm64/kernel/probes/simulate-insn.c
>>>>>>> @@ -200,3 +200,31 @@ simulate_ldrsw_literal(u32 opcode, long addr, struct pt_regs *regs)
>>>>>>>
>>>>>>>     instruction_pointer_set(regs, instruction_pointer(regs) + 4);
>>>>>>>  }
>>>>>>> +
>>>>>>> +void __kprobes
>>>>>>> +simulate_nop(u32 opcode, long addr, struct pt_regs *regs)
>>>>>>> +{
>>>>>>> +   instruction_pointer_set(regs, instruction_pointer(regs) + 4);
>>>>>>> +}
>>>>>>
>>>>>> Hmm, this forgets to update the single-step state machine and PSTATE.BT,
>>>>>> and that's an extant bug in arch_uprobe_post_xol(). This can be:
>>>>>
>>>>> For emulated instruction, uprobe won't enable single-step mode of CPU,
>>>>> please check the handle_swbp() in kernel/events/uprobes.c:
>>>>>
>>>>>   if (arch_uprobe_skip_sstep(&uprobe->arch, regs))
>>>>>           goto out;
>>>>>
>>>>>   if (!pre_ssout(uprobe, regs, bp_vaddr))
>>>>>           return;
>>>>>
>>>>> For emualted instruction, It will skip entire single-stepping and associated
>>>>> exception, which typically begins with pre_ssout() and ends with
>>>>> arch_uprobe_post_xol(). Therefore, using instruction_pointer_set() to emulate
>>>>> NOP is generally not a bad idea.
>>>>>
>>>>>>
>>>>>> | void __kprobes
>>>>>> | simulate_nop(u32 opcode, long addr, struct pt_regs *regs)
>>>>>> | {
>>>>>> |    arm64_skip_faulting_instruction(regs, AARCH64_INSN_SIZE);
>>>>>> | }
>>>>>>
>>>>>>> +
>>>>>>> +void __kprobes
>>>>>>> +simulate_stp_fp_lr_sp_64b(u32 opcode, long addr, struct pt_regs *regs)
>>>>>>> +{
>>>>>>> +   long imm7;
>>>>>>> +   u64 buf[2];
>>>>>>> +   long new_sp;
>>>>>>> +
>>>>>>> +   imm7 = sign_extend64((opcode >> 15) & 0x7f, 6);
>>>>>>> +   new_sp = regs->sp + (imm7 << 3);
>>>>>>
>>>>>> We have accessors for these fields, please use them.
>>>>>
>>>>> Do you mean aarch64_insn_decode_immediate()?
>>>>>
>>>>>>
>>>>>>> +
>>>>>>> +   buf[0] = regs->regs[29];
>>>>>>> +   buf[1] = regs->regs[30];
>>>>>>> +
>>>>>>> +   if (copy_to_user((void __user *)new_sp, buf, sizeof(buf))) {
>>>>>>> +           force_sig(SIGSEGV);
>>>>>>> +           return;
>>>>>>> +   }
>>>>>>
>>>>>> As above, this won't interact with VMSA features (e.g. MTE, POE) in the
>>>>>> same way as an STP in userspace, and this will not have the same
>>>>>> atomicity properties as an STP>
>>>>>>> +
>>>>>>> +   regs->sp = new_sp;
>>>>>>> +   instruction_pointer_set(regs, instruction_pointer(regs) + 4);
>>>>>>
>>>>>> Likewise, this sould need ot use arm64_skip_faulting_instruction(),
>>>>>> though as above I think we should drop STP emulation entirely.
>>>>>
>>>>> I explain the reason why using instruction_pointer_set() under your comments
>>>>> for simulate_nop().
>>>>>
>>>>> Thanks.
>>>>>
>>>>>>
>>>>>> Mark.
>>>>>>
>>>>>
>>>>
>>>> --
>>>> BR
>>>> Liao, Chang
>>>>
>>
>> --
>> BR
>> Liao, Chang
Liao, Chang Sept. 9, 2024, 7:42 a.m. UTC | #12
在 2024/9/6 17:39, Mark Rutland 写道:
> On Tue, Aug 27, 2024 at 07:33:55PM +0800, Liao, Chang wrote:
>> Hi, Mark
>>
>> Would you like to discuss this patch further, or do you still believe emulating
>> STP to push FP/LR into the stack in kernel is not a good idea?
> 
> I'm happy with the NOP emulation in principle, so please send a new
> version with *just* the NOP emulation, and I can review that.

Mark, please check v2 from link:

  https://lore.kernel.org/all/20240909071114.1150053-1-liaochang1@huawei.com/

Thanks.

> 
> Regarding STP emulation, I stand by my earlier comments, and in addition
> to those comments, AFAICT it's currently unsafe to use any uaccess
> routine in the uprobe BRK handler anyway, so that's moot. The uprobe BRK
> handler runs with preemption disabled and IRQs (and all other maskable
> exceptions) masked, and faults cannot be handled. IIUC
> CONFIG_DEBUG_ATOMIC_SLEEP should scream about that.
> 
> Longer-term I'm happy to look into making it possible to do uaccesses in
> the uprobe handlers, but that requires a much wider rework of the way we
> handle BRK instructions, single-step, and hardware breakpoints and
> watchpoints.
> 
> Mark.
> 
>> Thanks.
>>
>>
>> 在 2024/8/21 15:55, Liao, Chang 写道:
>>> Hi, Mark
>>>
>>> My bad for taking so long to rely, I generally agree with your suggestions to
>>> STP emulation.
>>>
>>> 在 2024/8/15 17:58, Mark Rutland 写道:
>>>> On Wed, Aug 14, 2024 at 08:03:56AM +0000, Liao Chang wrote:
>>>>> As Andrii pointed out, the uprobe/uretprobe selftest bench run into a
>>>>> counterintuitive result that nop and push variants are much slower than
>>>>> ret variant [0]. The root cause lies in the arch_probe_analyse_insn(),
>>>>> which excludes 'nop' and 'stp' from the emulatable instructions list.
>>>>> This force the kernel returns to userspace and execute them out-of-line,
>>>>> then trapping back to kernel for running uprobe callback functions. This
>>>>> leads to a significant performance overhead compared to 'ret' variant,
>>>>> which is already emulated.
>>>>
>>>> I appreciate this might be surprising, but does it actually matter
>>>> outside of a microbenchmark?
>>>
>>> I just do a simple comparsion the performance impact of single-stepped and
>>> emulated STP on my local machine. Three user cases were measured: Redis GET and
>>> SET throughput (Request Per Second, RPS), and the time taken to execute a grep
>>> command on the "arch_uprobe_copy_xol" string within the kernel source.
>>>
>>> Redis GET (higher is better)
>>> ----------------------------
>>> No uprobe: 49149.71 RPS
>>> Single-stepped STP: 46750.82 RPS
>>> Emulated STP: 48981.19 RPS
>>>
>>> Redis SET (larger is better)
>>> ----------------------------
>>> No uprobe: 49761.14 RPS
>>> Single-stepped STP: 45255.01 RPS
>>> Emulated stp: 48619.21 RPS
>>>
>>> Grep (lower is better)
>>> ----------------------
>>> No uprobe: 2.165s
>>> Single-stepped STP: 15.314s
>>> Emualted STP: 2.216s
>>>
>>> The result reveals single-stepped STP instruction that used to push fp/lr into
>>> stack significantly impacts the Redis and grep performance, leading to a notable
>>> notable decrease RPS and increase time individually. So emulating STP on the
>>> function entry might be a more viable option for uprobe.
>>>
>>>>
>>>>> Typicall uprobe is installed on 'nop' for USDT and on function entry
>>>>> which starts with the instrucion 'stp x29, x30, [sp, #imm]!' to push lr
>>>>> and fp into stack regardless kernel or userspace binary. 
>>>>
>>>> Function entry doesn't always start with a STP; these days it's often a
>>>> BTI or PACIASP, and for non-leaf functions (or with shrink-wrapping in
>>>> the compiler), it could be any arbitrary instruction. This might happen
>>>> to be the common case today, but there are certain;y codebases where it
>>>> is not.
>>>
>>> Sure, if kernel, CPU and compiler support BTI and PAC, the entry instruction
>>> is definitly not STP. But for CPU and kernel lack of these supports, STP as
>>> the entry instruction is still the common case. And I profiled the entry
>>> instruction for all leaf and non-leaf function, the ratio of STP is 64.5%
>>> for redis, 76.1% for the BPF selftest bench. So I am thinking it is still
>>> useful to emulate the STP on the function entry. Perhaps, for CPU and kernel
>>> with BTI and PAC enabled, uprobe chooses the slower single-stepping to execute
>>> STP for pushing stack.
>>>
>>>>
>>>> STP (or any instruction that accesses memory) is fairly painful to
>>>> emulate because you need to ensure that the correct atomicity and
>>>> ordering properties are provided (e.g. an aligned STP should be
>>>> single-copy-atomic, but copy_to_user() doesn't guarantee that except by
>>>> chance), and that the correct VMSA behaviour is provided (e.g. when
>>>> interacting with MTE, POE, etc, while the uaccess primitives don't try
>>>> to be 100% equivalent to instructions in userspace).
>>> Agreed, but I don't think it has to emulate strictly the single-copy-atomic
>>> feature of STP that is used to push fp/lr into stack. In most cases, only one
>>> CPU will push registers to the same position on stack. And I barely understand
>>> why other CPUs would depends on the ordering of pushing data into stack. So it
>>> means the atomicity and ordering is not so important for this scenario. Regarding
>>> MTE and POE, a similar stragety to BTI and PAC can be applied: for CPUs and kernel
>>> with MTE and POE enabled, uprobe chooses the slower single-stepping to execute
>>> STP for pushing stack.
>>>
>>>>
>>>> For those reasons, in general I don't think we should be emulating any
>>>> instruction which accesses memory, and we should not try to emulate the
>>>> STP, but I think it's entirely reasonable to emulate NOP.
>>>>
>>>>> In order to
>>>>> improve the performance of handling uprobe for common usecases. This
>>>>> patch supports the emulation of Arm64 equvialents instructions of 'nop'
>>>>> and 'push'. The benchmark results below indicates the performance gain
>>>>> of emulation is obvious.
>>>>>
>>>>> On Kunpeng916 (Hi1616), 4 NUMA nodes, 64 Arm64 cores@2.4GHz.
>>>>>
>>>>> xol (1 cpus)
>>>>> ------------
>>>>> uprobe-nop:  0.916 ± 0.001M/s (0.916M/prod)
>>>>> uprobe-push: 0.908 ± 0.001M/s (0.908M/prod)
>>>>> uprobe-ret:  1.855 ± 0.000M/s (1.855M/prod)
>>>>> uretprobe-nop:  0.640 ± 0.000M/s (0.640M/prod)
>>>>> uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod)
>>>>> uretprobe-ret:  0.978 ± 0.003M/s (0.978M/prod)
>>>>>
>>>>> emulation (1 cpus)
>>>>> -------------------
>>>>> uprobe-nop:  1.862 ± 0.002M/s  (1.862M/prod)
>>>>> uprobe-push: 1.743 ± 0.006M/s  (1.743M/prod)
>>>>> uprobe-ret:  1.840 ± 0.001M/s  (1.840M/prod)
>>>>> uretprobe-nop:  0.964 ± 0.004M/s  (0.964M/prod)
>>>>> uretprobe-push: 0.936 ± 0.004M/s  (0.936M/prod)
>>>>> uretprobe-ret:  0.940 ± 0.001M/s  (0.940M/prod)
>>>>>
>>>>> As shown above, the performance gap between 'nop/push' and 'ret'
>>>>> variants has been significantly reduced. Due to the emulation of 'push'
>>>>> instruction needs to access userspace memory, it spent more cycles than
>>>>> the other.
>>>>>
>>>>> [0] https://lore.kernel.org/all/CAEf4BzaO4eG6hr2hzXYpn+7Uer4chS0R99zLn02ezZ5YruVuQw@mail.gmail.com/
>>>>>
>>>>> Signed-off-by: Liao Chang <liaochang1@huawei.com>
>>>>> ---
>>>>>  arch/arm64/include/asm/insn.h            | 21 ++++++++++++++++++
>>>>>  arch/arm64/kernel/probes/decode-insn.c   | 18 +++++++++++++--
>>>>>  arch/arm64/kernel/probes/decode-insn.h   |  3 ++-
>>>>>  arch/arm64/kernel/probes/simulate-insn.c | 28 ++++++++++++++++++++++++
>>>>>  arch/arm64/kernel/probes/simulate-insn.h |  2 ++
>>>>>  arch/arm64/kernel/probes/uprobes.c       |  2 +-
>>>>>  6 files changed, 70 insertions(+), 4 deletions(-)
>>>>>
>>>>> diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
>>>>> index 8c0a36f72d6f..a246e6e550ba 100644
>>>>> --- a/arch/arm64/include/asm/insn.h
>>>>> +++ b/arch/arm64/include/asm/insn.h
>>>>> @@ -549,6 +549,27 @@ static __always_inline bool aarch64_insn_uses_literal(u32 insn)
>>>>>  	       aarch64_insn_is_prfm_lit(insn);
>>>>>  }
>>>>>  
>>>>> +static __always_inline bool aarch64_insn_is_nop(u32 insn)
>>>>> +{
>>>>> +	/* nop */
>>>>> +	return aarch64_insn_is_hint(insn) &&
>>>>> +	       ((insn & 0xFE0) == AARCH64_INSN_HINT_NOP);
>>>>> +}
>>>>
>>>> This looks fine, but the comment can go.
>>>
>>> Removed.
>>>
>>>>
>>>>> +static __always_inline bool aarch64_insn_is_stp_fp_lr_sp_64b(u32 insn)
>>>>> +{
>>>>> +	/*
>>>>> +	 * The 1st instruction on function entry often follows the
>>>>> +	 * patten 'stp x29, x30, [sp, #imm]!' that pushing fp and lr
>>>>> +	 * into stack.
>>>>> +	 */
>>>>> +	return aarch64_insn_is_stp_pre(insn) &&
>>>>> +	       (((insn >> 30) & 0x03) ==  2) && /* opc == 10 */
>>>>> +	       (((insn >>  5) & 0x1F) == 31) && /* Rn  is sp */
>>>>> +	       (((insn >> 10) & 0x1F) == 30) && /* Rt2 is x29 */
>>>>> +	       (((insn >>  0) & 0x1F) == 29);	/* Rt  is x30 */
>>>>> +}
>>>>
>>>> We have accessors for these fields. Please use them.
>>>
>>> Do you mean aarch64_insn_decode_register()?
>>>
>>>>
>>>> Regardless, as above I do not think we should have a helper this
>>>> specific (with Rn, Rt, and Rt2 values hard-coded) inside <asm/insn.h>.
>>>
>>> If we left necessary of emulation of STP aside, where would the best file to
>>> place these hard-coded decoder helper?
>>>
>>>>
>>>>>  enum aarch64_insn_encoding_class aarch64_get_insn_class(u32 insn);
>>>>>  u64 aarch64_insn_decode_immediate(enum aarch64_insn_imm_type type, u32 insn);
>>>>>  u32 aarch64_insn_encode_immediate(enum aarch64_insn_imm_type type,
>>>>> diff --git a/arch/arm64/kernel/probes/decode-insn.c b/arch/arm64/kernel/probes/decode-insn.c
>>>>> index 968d5fffe233..df7ca16fc763 100644
>>>>> --- a/arch/arm64/kernel/probes/decode-insn.c
>>>>> +++ b/arch/arm64/kernel/probes/decode-insn.c
>>>>> @@ -73,8 +73,22 @@ static bool __kprobes aarch64_insn_is_steppable(u32 insn)
>>>>>   *   INSN_GOOD_NO_SLOT If instruction is supported but doesn't use its slot.
>>>>>   */
>>>>>  enum probe_insn __kprobes
>>>>> -arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *api)
>>>>> +arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *api,
>>>>> +		      bool kernel)
>>>>>  {
>>>>> +	/*
>>>>> +	 * While 'nop' and 'stp x29, x30, [sp, #imm]! instructions can
>>>>> +	 * execute in the out-of-line slot, simulating them in breakpoint
>>>>> +	 * handling offers better performance.
>>>>> +	 */
>>>>> +	if (aarch64_insn_is_nop(insn)) {
>>>>> +		api->handler = simulate_nop;
>>>>> +		return INSN_GOOD_NO_SLOT;
>>>>> +	} else if (!kernel && aarch64_insn_is_stp_fp_lr_sp_64b(insn)) {
>>>>> +		api->handler = simulate_stp_fp_lr_sp_64b;
>>>>> +		return INSN_GOOD_NO_SLOT;
>>>>> +	}
>>>>
>>>> With the STP emulation gone, you won't need the kernel parameter here.>
>>>>> +
>>>>>  	/*
>>>>>  	 * Instructions reading or modifying the PC won't work from the XOL
>>>>>  	 * slot.
>>>>> @@ -157,7 +171,7 @@ arm_kprobe_decode_insn(kprobe_opcode_t *addr, struct arch_specific_insn *asi)
>>>>>  		else
>>>>>  			scan_end = addr - MAX_ATOMIC_CONTEXT_SIZE;
>>>>>  	}
>>>>> -	decoded = arm_probe_decode_insn(insn, &asi->api);
>>>>> +	decoded = arm_probe_decode_insn(insn, &asi->api, true);
>>>>>  
>>>>>  	if (decoded != INSN_REJECTED && scan_end)
>>>>>  		if (is_probed_address_atomic(addr - 1, scan_end))
>>>>> diff --git a/arch/arm64/kernel/probes/decode-insn.h b/arch/arm64/kernel/probes/decode-insn.h
>>>>> index 8b758c5a2062..ec4607189933 100644
>>>>> --- a/arch/arm64/kernel/probes/decode-insn.h
>>>>> +++ b/arch/arm64/kernel/probes/decode-insn.h
>>>>> @@ -28,6 +28,7 @@ enum probe_insn __kprobes
>>>>>  arm_kprobe_decode_insn(kprobe_opcode_t *addr, struct arch_specific_insn *asi);
>>>>>  #endif
>>>>>  enum probe_insn __kprobes
>>>>> -arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *asi);
>>>>> +arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *asi,
>>>>> +		      bool kernel);
>>>>>  
>>>>>  #endif /* _ARM_KERNEL_KPROBES_ARM64_H */
>>>>> diff --git a/arch/arm64/kernel/probes/simulate-insn.c b/arch/arm64/kernel/probes/simulate-insn.c
>>>>> index 22d0b3252476..0b1623fa7003 100644
>>>>> --- a/arch/arm64/kernel/probes/simulate-insn.c
>>>>> +++ b/arch/arm64/kernel/probes/simulate-insn.c
>>>>> @@ -200,3 +200,31 @@ simulate_ldrsw_literal(u32 opcode, long addr, struct pt_regs *regs)
>>>>>  
>>>>>  	instruction_pointer_set(regs, instruction_pointer(regs) + 4);
>>>>>  }
>>>>> +
>>>>> +void __kprobes
>>>>> +simulate_nop(u32 opcode, long addr, struct pt_regs *regs)
>>>>> +{
>>>>> +	instruction_pointer_set(regs, instruction_pointer(regs) + 4);
>>>>> +}
>>>>
>>>> Hmm, this forgets to update the single-step state machine and PSTATE.BT,
>>>> and that's an extant bug in arch_uprobe_post_xol(). This can be:
>>>
>>> For emulated instruction, uprobe won't enable single-step mode of CPU,
>>> please check the handle_swbp() in kernel/events/uprobes.c:
>>>
>>>   if (arch_uprobe_skip_sstep(&uprobe->arch, regs))
>>>           goto out;
>>>
>>>   if (!pre_ssout(uprobe, regs, bp_vaddr))
>>>           return;
>>>
>>> For emualted instruction, It will skip entire single-stepping and associated
>>> exception, which typically begins with pre_ssout() and ends with
>>> arch_uprobe_post_xol(). Therefore, using instruction_pointer_set() to emulate
>>> NOP is generally not a bad idea.
>>>
>>>>
>>>> | void __kprobes
>>>> | simulate_nop(u32 opcode, long addr, struct pt_regs *regs)
>>>> | {
>>>> | 	arm64_skip_faulting_instruction(regs, AARCH64_INSN_SIZE);
>>>> | }
>>>>
>>>>> +
>>>>> +void __kprobes
>>>>> +simulate_stp_fp_lr_sp_64b(u32 opcode, long addr, struct pt_regs *regs)
>>>>> +{
>>>>> +	long imm7;
>>>>> +	u64 buf[2];
>>>>> +	long new_sp;
>>>>> +
>>>>> +	imm7 = sign_extend64((opcode >> 15) & 0x7f, 6);
>>>>> +	new_sp = regs->sp + (imm7 << 3);
>>>>
>>>> We have accessors for these fields, please use them.
>>>
>>> Do you mean aarch64_insn_decode_immediate()?
>>>
>>>>
>>>>> +
>>>>> +	buf[0] = regs->regs[29];
>>>>> +	buf[1] = regs->regs[30];
>>>>> +
>>>>> +	if (copy_to_user((void __user *)new_sp, buf, sizeof(buf))) {
>>>>> +		force_sig(SIGSEGV);
>>>>> +		return;
>>>>> +	}
>>>>
>>>> As above, this won't interact with VMSA features (e.g. MTE, POE) in the
>>>> same way as an STP in userspace, and this will not have the same
>>>> atomicity properties as an STP>
>>>>> +
>>>>> +	regs->sp = new_sp;
>>>>> +	instruction_pointer_set(regs, instruction_pointer(regs) + 4);
>>>>
>>>> Likewise, this sould need ot use arm64_skip_faulting_instruction(),
>>>> though as above I think we should drop STP emulation entirely.
>>>
>>> I explain the reason why using instruction_pointer_set() under your comments
>>> for simulate_nop().
>>>
>>> Thanks.
>>>
>>>>
>>>> Mark.
>>>>
>>>
>>
>> -- 
>> BR
>> Liao, Chang
>
Mark Rutland Sept. 9, 2024, 10:18 a.m. UTC | #13
On Fri, Sep 06, 2024 at 10:46:00AM -0700, Andrii Nakryiko wrote:
> On Fri, Sep 6, 2024 at 2:39 AM Mark Rutland <mark.rutland@arm.com> wrote:
> >
> > On Tue, Aug 27, 2024 at 07:33:55PM +0800, Liao, Chang wrote:
> > > Hi, Mark
> > >
> > > Would you like to discuss this patch further, or do you still believe emulating
> > > STP to push FP/LR into the stack in kernel is not a good idea?
> >
> > I'm happy with the NOP emulation in principle, so please send a new
> > version with *just* the NOP emulation, and I can review that.
> 
> Let's definitely start with that, this is important for faster USDT tracing.
> 
> > Regarding STP emulation, I stand by my earlier comments, and in addition
> > to those comments, AFAICT it's currently unsafe to use any uaccess
> > routine in the uprobe BRK handler anyway, so that's moot. The uprobe BRK
> > handler runs with preemption disabled and IRQs (and all other maskable
> > exceptions) masked, and faults cannot be handled. IIUC
> > CONFIG_DEBUG_ATOMIC_SLEEP should scream about that.
> 
> This part I don't really get, and this might be some very
> ARM64-specific issue, so I'm sorry ahead of time.
> 
> But in general, at the lowest level uprobes work in two logical steps.
> First, there is a breakpoint that user space hits, kernel gets
> control, and if VMA which hit breakpoint might contain uprobe, kernel
> sets TIF_UPROBE thread flag and exits. This is the only part that's in
> hard IRQ context. See uprobe_notify_resume() and comments around it.
> 
> Then uprobe infrastructure gets called in user context on the way back
> to user space. This is where we confirm that this is uprobe/uretprobe
> hit, and, if supported, perform instruction emulation.
> 
> So I'm wondering if your above comment refers to instruction emulation
> within the first part of uprobe handling? If yes, then, no, that's not
> where emulation will happen.

You're right -- I had misunderstood that the emulation happened during
handling of the breakpoint, rather than on the return-to-userspace path.
Looking at the arm64 entry code, the way uprobe_notify_resume() is
plumbed in is safe as it happens after we've re-enabled preemption and
unmasked other exceptions.

Sorry about that.

For the moment I'd still prefer to get the NOP case out of the way
first, so I'll review the NOP-only patch shortly.

Mark.
Andrii Nakryiko Sept. 9, 2024, 4:57 p.m. UTC | #14
On Mon, Sep 9, 2024 at 3:18 AM Mark Rutland <mark.rutland@arm.com> wrote:
>
> On Fri, Sep 06, 2024 at 10:46:00AM -0700, Andrii Nakryiko wrote:
> > On Fri, Sep 6, 2024 at 2:39 AM Mark Rutland <mark.rutland@arm.com> wrote:
> > >
> > > On Tue, Aug 27, 2024 at 07:33:55PM +0800, Liao, Chang wrote:
> > > > Hi, Mark
> > > >
> > > > Would you like to discuss this patch further, or do you still believe emulating
> > > > STP to push FP/LR into the stack in kernel is not a good idea?
> > >
> > > I'm happy with the NOP emulation in principle, so please send a new
> > > version with *just* the NOP emulation, and I can review that.
> >
> > Let's definitely start with that, this is important for faster USDT tracing.
> >
> > > Regarding STP emulation, I stand by my earlier comments, and in addition
> > > to those comments, AFAICT it's currently unsafe to use any uaccess
> > > routine in the uprobe BRK handler anyway, so that's moot. The uprobe BRK
> > > handler runs with preemption disabled and IRQs (and all other maskable
> > > exceptions) masked, and faults cannot be handled. IIUC
> > > CONFIG_DEBUG_ATOMIC_SLEEP should scream about that.
> >
> > This part I don't really get, and this might be some very
> > ARM64-specific issue, so I'm sorry ahead of time.
> >
> > But in general, at the lowest level uprobes work in two logical steps.
> > First, there is a breakpoint that user space hits, kernel gets
> > control, and if VMA which hit breakpoint might contain uprobe, kernel
> > sets TIF_UPROBE thread flag and exits. This is the only part that's in
> > hard IRQ context. See uprobe_notify_resume() and comments around it.
> >
> > Then uprobe infrastructure gets called in user context on the way back
> > to user space. This is where we confirm that this is uprobe/uretprobe
> > hit, and, if supported, perform instruction emulation.
> >
> > So I'm wondering if your above comment refers to instruction emulation
> > within the first part of uprobe handling? If yes, then, no, that's not
> > where emulation will happen.
>
> You're right -- I had misunderstood that the emulation happened during
> handling of the breakpoint, rather than on the return-to-userspace path.
> Looking at the arm64 entry code, the way uprobe_notify_resume() is
> plumbed in is safe as it happens after we've re-enabled preemption and
> unmasked other exceptions.
>
> Sorry about that.
>
> For the moment I'd still prefer to get the NOP case out of the way
> first, so I'll review the NOP-only patch shortly.
>

Yep, one step at a time makes sense, thanks! Regardless, I'm glad we
clarified the confusion.

> Mark.
Liao, Chang Sept. 10, 2024, 6:36 a.m. UTC | #15
Hi, Mark

在 2024/9/6 17:39, Mark Rutland 写道:
> On Tue, Aug 27, 2024 at 07:33:55PM +0800, Liao, Chang wrote:
>> Hi, Mark
>>
>> Would you like to discuss this patch further, or do you still believe emulating
>> STP to push FP/LR into the stack in kernel is not a good idea?
> 
> I'm happy with the NOP emulation in principle, so please send a new
> version with *just* the NOP emulation, and I can review that.
> 
> Regarding STP emulation, I stand by my earlier comments, and in addition
> to those comments, AFAICT it's currently unsafe to use any uaccess
> routine in the uprobe BRK handler anyway, so that's moot. The uprobe BRK
> handler runs with preemption disabled and IRQs (and all other maskable
> exceptions) masked, and faults cannot be handled. IIUC
> CONFIG_DEBUG_ATOMIC_SLEEP should scream about that.
> 
> Longer-term I'm happy to look into making it possible to do uaccesses in
> the uprobe handlers, but that requires a much wider rework of the way we
> handle BRK instructions, single-step, and hardware breakpoints and
> watchpoints.

I've just sent a new patch for STP simulation:

  https://lore.kernel.org/all/20240910060407.1427716-1-liaochang1@huawei.com/

This patch try to maintain the correct properties of STP, even if it comes
at a obvious cost of efficiency. Simulating STP directly within the kernel
becomes painful due to POE and MTE interactions. Therefore, for now, the
patch focus on scenarios where the kernel and hardware don't support MTE.

> 
> Mark.
> 
>> Thanks.
>>
>>
>> 在 2024/8/21 15:55, Liao, Chang 写道:
>>> Hi, Mark
>>>
>>> My bad for taking so long to rely, I generally agree with your suggestions to
>>> STP emulation.
>>>
>>> 在 2024/8/15 17:58, Mark Rutland 写道:
>>>> On Wed, Aug 14, 2024 at 08:03:56AM +0000, Liao Chang wrote:
>>>>> As Andrii pointed out, the uprobe/uretprobe selftest bench run into a
>>>>> counterintuitive result that nop and push variants are much slower than
>>>>> ret variant [0]. The root cause lies in the arch_probe_analyse_insn(),
>>>>> which excludes 'nop' and 'stp' from the emulatable instructions list.
>>>>> This force the kernel returns to userspace and execute them out-of-line,
>>>>> then trapping back to kernel for running uprobe callback functions. This
>>>>> leads to a significant performance overhead compared to 'ret' variant,
>>>>> which is already emulated.
>>>>
>>>> I appreciate this might be surprising, but does it actually matter
>>>> outside of a microbenchmark?
>>>
>>> I just do a simple comparsion the performance impact of single-stepped and
>>> emulated STP on my local machine. Three user cases were measured: Redis GET and
>>> SET throughput (Request Per Second, RPS), and the time taken to execute a grep
>>> command on the "arch_uprobe_copy_xol" string within the kernel source.
>>>
>>> Redis GET (higher is better)
>>> ----------------------------
>>> No uprobe: 49149.71 RPS
>>> Single-stepped STP: 46750.82 RPS
>>> Emulated STP: 48981.19 RPS
>>>
>>> Redis SET (larger is better)
>>> ----------------------------
>>> No uprobe: 49761.14 RPS
>>> Single-stepped STP: 45255.01 RPS
>>> Emulated stp: 48619.21 RPS
>>>
>>> Grep (lower is better)
>>> ----------------------
>>> No uprobe: 2.165s
>>> Single-stepped STP: 15.314s
>>> Emualted STP: 2.216s
>>>
>>> The result reveals single-stepped STP instruction that used to push fp/lr into
>>> stack significantly impacts the Redis and grep performance, leading to a notable
>>> notable decrease RPS and increase time individually. So emulating STP on the
>>> function entry might be a more viable option for uprobe.
>>>
>>>>
>>>>> Typicall uprobe is installed on 'nop' for USDT and on function entry
>>>>> which starts with the instrucion 'stp x29, x30, [sp, #imm]!' to push lr
>>>>> and fp into stack regardless kernel or userspace binary. 
>>>>
>>>> Function entry doesn't always start with a STP; these days it's often a
>>>> BTI or PACIASP, and for non-leaf functions (or with shrink-wrapping in
>>>> the compiler), it could be any arbitrary instruction. This might happen
>>>> to be the common case today, but there are certain;y codebases where it
>>>> is not.
>>>
>>> Sure, if kernel, CPU and compiler support BTI and PAC, the entry instruction
>>> is definitly not STP. But for CPU and kernel lack of these supports, STP as
>>> the entry instruction is still the common case. And I profiled the entry
>>> instruction for all leaf and non-leaf function, the ratio of STP is 64.5%
>>> for redis, 76.1% for the BPF selftest bench. So I am thinking it is still
>>> useful to emulate the STP on the function entry. Perhaps, for CPU and kernel
>>> with BTI and PAC enabled, uprobe chooses the slower single-stepping to execute
>>> STP for pushing stack.
>>>
>>>>
>>>> STP (or any instruction that accesses memory) is fairly painful to
>>>> emulate because you need to ensure that the correct atomicity and
>>>> ordering properties are provided (e.g. an aligned STP should be
>>>> single-copy-atomic, but copy_to_user() doesn't guarantee that except by
>>>> chance), and that the correct VMSA behaviour is provided (e.g. when
>>>> interacting with MTE, POE, etc, while the uaccess primitives don't try
>>>> to be 100% equivalent to instructions in userspace).
>>> Agreed, but I don't think it has to emulate strictly the single-copy-atomic
>>> feature of STP that is used to push fp/lr into stack. In most cases, only one
>>> CPU will push registers to the same position on stack. And I barely understand
>>> why other CPUs would depends on the ordering of pushing data into stack. So it
>>> means the atomicity and ordering is not so important for this scenario. Regarding
>>> MTE and POE, a similar stragety to BTI and PAC can be applied: for CPUs and kernel
>>> with MTE and POE enabled, uprobe chooses the slower single-stepping to execute
>>> STP for pushing stack.
>>>
>>>>
>>>> For those reasons, in general I don't think we should be emulating any
>>>> instruction which accesses memory, and we should not try to emulate the
>>>> STP, but I think it's entirely reasonable to emulate NOP.
>>>>
>>>>> In order to
>>>>> improve the performance of handling uprobe for common usecases. This
>>>>> patch supports the emulation of Arm64 equvialents instructions of 'nop'
>>>>> and 'push'. The benchmark results below indicates the performance gain
>>>>> of emulation is obvious.
>>>>>
>>>>> On Kunpeng916 (Hi1616), 4 NUMA nodes, 64 Arm64 cores@2.4GHz.
>>>>>
>>>>> xol (1 cpus)
>>>>> ------------
>>>>> uprobe-nop:  0.916 ± 0.001M/s (0.916M/prod)
>>>>> uprobe-push: 0.908 ± 0.001M/s (0.908M/prod)
>>>>> uprobe-ret:  1.855 ± 0.000M/s (1.855M/prod)
>>>>> uretprobe-nop:  0.640 ± 0.000M/s (0.640M/prod)
>>>>> uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod)
>>>>> uretprobe-ret:  0.978 ± 0.003M/s (0.978M/prod)
>>>>>
>>>>> emulation (1 cpus)
>>>>> -------------------
>>>>> uprobe-nop:  1.862 ± 0.002M/s  (1.862M/prod)
>>>>> uprobe-push: 1.743 ± 0.006M/s  (1.743M/prod)
>>>>> uprobe-ret:  1.840 ± 0.001M/s  (1.840M/prod)
>>>>> uretprobe-nop:  0.964 ± 0.004M/s  (0.964M/prod)
>>>>> uretprobe-push: 0.936 ± 0.004M/s  (0.936M/prod)
>>>>> uretprobe-ret:  0.940 ± 0.001M/s  (0.940M/prod)
>>>>>
>>>>> As shown above, the performance gap between 'nop/push' and 'ret'
>>>>> variants has been significantly reduced. Due to the emulation of 'push'
>>>>> instruction needs to access userspace memory, it spent more cycles than
>>>>> the other.
>>>>>
>>>>> [0] https://lore.kernel.org/all/CAEf4BzaO4eG6hr2hzXYpn+7Uer4chS0R99zLn02ezZ5YruVuQw@mail.gmail.com/
>>>>>
>>>>> Signed-off-by: Liao Chang <liaochang1@huawei.com>
>>>>> ---
>>>>>  arch/arm64/include/asm/insn.h            | 21 ++++++++++++++++++
>>>>>  arch/arm64/kernel/probes/decode-insn.c   | 18 +++++++++++++--
>>>>>  arch/arm64/kernel/probes/decode-insn.h   |  3 ++-
>>>>>  arch/arm64/kernel/probes/simulate-insn.c | 28 ++++++++++++++++++++++++
>>>>>  arch/arm64/kernel/probes/simulate-insn.h |  2 ++
>>>>>  arch/arm64/kernel/probes/uprobes.c       |  2 +-
>>>>>  6 files changed, 70 insertions(+), 4 deletions(-)
>>>>>
>>>>> diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
>>>>> index 8c0a36f72d6f..a246e6e550ba 100644
>>>>> --- a/arch/arm64/include/asm/insn.h
>>>>> +++ b/arch/arm64/include/asm/insn.h
>>>>> @@ -549,6 +549,27 @@ static __always_inline bool aarch64_insn_uses_literal(u32 insn)
>>>>>  	       aarch64_insn_is_prfm_lit(insn);
>>>>>  }
>>>>>  
>>>>> +static __always_inline bool aarch64_insn_is_nop(u32 insn)
>>>>> +{
>>>>> +	/* nop */
>>>>> +	return aarch64_insn_is_hint(insn) &&
>>>>> +	       ((insn & 0xFE0) == AARCH64_INSN_HINT_NOP);
>>>>> +}
>>>>
>>>> This looks fine, but the comment can go.
>>>
>>> Removed.
>>>
>>>>
>>>>> +static __always_inline bool aarch64_insn_is_stp_fp_lr_sp_64b(u32 insn)
>>>>> +{
>>>>> +	/*
>>>>> +	 * The 1st instruction on function entry often follows the
>>>>> +	 * patten 'stp x29, x30, [sp, #imm]!' that pushing fp and lr
>>>>> +	 * into stack.
>>>>> +	 */
>>>>> +	return aarch64_insn_is_stp_pre(insn) &&
>>>>> +	       (((insn >> 30) & 0x03) ==  2) && /* opc == 10 */
>>>>> +	       (((insn >>  5) & 0x1F) == 31) && /* Rn  is sp */
>>>>> +	       (((insn >> 10) & 0x1F) == 30) && /* Rt2 is x29 */
>>>>> +	       (((insn >>  0) & 0x1F) == 29);	/* Rt  is x30 */
>>>>> +}
>>>>
>>>> We have accessors for these fields. Please use them.
>>>
>>> Do you mean aarch64_insn_decode_register()?
>>>
>>>>
>>>> Regardless, as above I do not think we should have a helper this
>>>> specific (with Rn, Rt, and Rt2 values hard-coded) inside <asm/insn.h>.
>>>
>>> If we left necessary of emulation of STP aside, where would the best file to
>>> place these hard-coded decoder helper?
>>>
>>>>
>>>>>  enum aarch64_insn_encoding_class aarch64_get_insn_class(u32 insn);
>>>>>  u64 aarch64_insn_decode_immediate(enum aarch64_insn_imm_type type, u32 insn);
>>>>>  u32 aarch64_insn_encode_immediate(enum aarch64_insn_imm_type type,
>>>>> diff --git a/arch/arm64/kernel/probes/decode-insn.c b/arch/arm64/kernel/probes/decode-insn.c
>>>>> index 968d5fffe233..df7ca16fc763 100644
>>>>> --- a/arch/arm64/kernel/probes/decode-insn.c
>>>>> +++ b/arch/arm64/kernel/probes/decode-insn.c
>>>>> @@ -73,8 +73,22 @@ static bool __kprobes aarch64_insn_is_steppable(u32 insn)
>>>>>   *   INSN_GOOD_NO_SLOT If instruction is supported but doesn't use its slot.
>>>>>   */
>>>>>  enum probe_insn __kprobes
>>>>> -arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *api)
>>>>> +arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *api,
>>>>> +		      bool kernel)
>>>>>  {
>>>>> +	/*
>>>>> +	 * While 'nop' and 'stp x29, x30, [sp, #imm]! instructions can
>>>>> +	 * execute in the out-of-line slot, simulating them in breakpoint
>>>>> +	 * handling offers better performance.
>>>>> +	 */
>>>>> +	if (aarch64_insn_is_nop(insn)) {
>>>>> +		api->handler = simulate_nop;
>>>>> +		return INSN_GOOD_NO_SLOT;
>>>>> +	} else if (!kernel && aarch64_insn_is_stp_fp_lr_sp_64b(insn)) {
>>>>> +		api->handler = simulate_stp_fp_lr_sp_64b;
>>>>> +		return INSN_GOOD_NO_SLOT;
>>>>> +	}
>>>>
>>>> With the STP emulation gone, you won't need the kernel parameter here.>
>>>>> +
>>>>>  	/*
>>>>>  	 * Instructions reading or modifying the PC won't work from the XOL
>>>>>  	 * slot.
>>>>> @@ -157,7 +171,7 @@ arm_kprobe_decode_insn(kprobe_opcode_t *addr, struct arch_specific_insn *asi)
>>>>>  		else
>>>>>  			scan_end = addr - MAX_ATOMIC_CONTEXT_SIZE;
>>>>>  	}
>>>>> -	decoded = arm_probe_decode_insn(insn, &asi->api);
>>>>> +	decoded = arm_probe_decode_insn(insn, &asi->api, true);
>>>>>  
>>>>>  	if (decoded != INSN_REJECTED && scan_end)
>>>>>  		if (is_probed_address_atomic(addr - 1, scan_end))
>>>>> diff --git a/arch/arm64/kernel/probes/decode-insn.h b/arch/arm64/kernel/probes/decode-insn.h
>>>>> index 8b758c5a2062..ec4607189933 100644
>>>>> --- a/arch/arm64/kernel/probes/decode-insn.h
>>>>> +++ b/arch/arm64/kernel/probes/decode-insn.h
>>>>> @@ -28,6 +28,7 @@ enum probe_insn __kprobes
>>>>>  arm_kprobe_decode_insn(kprobe_opcode_t *addr, struct arch_specific_insn *asi);
>>>>>  #endif
>>>>>  enum probe_insn __kprobes
>>>>> -arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *asi);
>>>>> +arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *asi,
>>>>> +		      bool kernel);
>>>>>  
>>>>>  #endif /* _ARM_KERNEL_KPROBES_ARM64_H */
>>>>> diff --git a/arch/arm64/kernel/probes/simulate-insn.c b/arch/arm64/kernel/probes/simulate-insn.c
>>>>> index 22d0b3252476..0b1623fa7003 100644
>>>>> --- a/arch/arm64/kernel/probes/simulate-insn.c
>>>>> +++ b/arch/arm64/kernel/probes/simulate-insn.c
>>>>> @@ -200,3 +200,31 @@ simulate_ldrsw_literal(u32 opcode, long addr, struct pt_regs *regs)
>>>>>  
>>>>>  	instruction_pointer_set(regs, instruction_pointer(regs) + 4);
>>>>>  }
>>>>> +
>>>>> +void __kprobes
>>>>> +simulate_nop(u32 opcode, long addr, struct pt_regs *regs)
>>>>> +{
>>>>> +	instruction_pointer_set(regs, instruction_pointer(regs) + 4);
>>>>> +}
>>>>
>>>> Hmm, this forgets to update the single-step state machine and PSTATE.BT,
>>>> and that's an extant bug in arch_uprobe_post_xol(). This can be:
>>>
>>> For emulated instruction, uprobe won't enable single-step mode of CPU,
>>> please check the handle_swbp() in kernel/events/uprobes.c:
>>>
>>>   if (arch_uprobe_skip_sstep(&uprobe->arch, regs))
>>>           goto out;
>>>
>>>   if (!pre_ssout(uprobe, regs, bp_vaddr))
>>>           return;
>>>
>>> For emualted instruction, It will skip entire single-stepping and associated
>>> exception, which typically begins with pre_ssout() and ends with
>>> arch_uprobe_post_xol(). Therefore, using instruction_pointer_set() to emulate
>>> NOP is generally not a bad idea.
>>>
>>>>
>>>> | void __kprobes
>>>> | simulate_nop(u32 opcode, long addr, struct pt_regs *regs)
>>>> | {
>>>> | 	arm64_skip_faulting_instruction(regs, AARCH64_INSN_SIZE);
>>>> | }
>>>>
>>>>> +
>>>>> +void __kprobes
>>>>> +simulate_stp_fp_lr_sp_64b(u32 opcode, long addr, struct pt_regs *regs)
>>>>> +{
>>>>> +	long imm7;
>>>>> +	u64 buf[2];
>>>>> +	long new_sp;
>>>>> +
>>>>> +	imm7 = sign_extend64((opcode >> 15) & 0x7f, 6);
>>>>> +	new_sp = regs->sp + (imm7 << 3);
>>>>
>>>> We have accessors for these fields, please use them.
>>>
>>> Do you mean aarch64_insn_decode_immediate()?
>>>
>>>>
>>>>> +
>>>>> +	buf[0] = regs->regs[29];
>>>>> +	buf[1] = regs->regs[30];
>>>>> +
>>>>> +	if (copy_to_user((void __user *)new_sp, buf, sizeof(buf))) {
>>>>> +		force_sig(SIGSEGV);
>>>>> +		return;
>>>>> +	}
>>>>
>>>> As above, this won't interact with VMSA features (e.g. MTE, POE) in the
>>>> same way as an STP in userspace, and this will not have the same
>>>> atomicity properties as an STP>
>>>>> +
>>>>> +	regs->sp = new_sp;
>>>>> +	instruction_pointer_set(regs, instruction_pointer(regs) + 4);
>>>>
>>>> Likewise, this sould need ot use arm64_skip_faulting_instruction(),
>>>> though as above I think we should drop STP emulation entirely.
>>>
>>> I explain the reason why using instruction_pointer_set() under your comments
>>> for simulate_nop().
>>>
>>> Thanks.
>>>
>>>>
>>>> Mark.
>>>>
>>>
>>
>> -- 
>> BR
>> Liao, Chang
>
diff mbox series

Patch

diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
index 8c0a36f72d6f..a246e6e550ba 100644
--- a/arch/arm64/include/asm/insn.h
+++ b/arch/arm64/include/asm/insn.h
@@ -549,6 +549,27 @@  static __always_inline bool aarch64_insn_uses_literal(u32 insn)
 	       aarch64_insn_is_prfm_lit(insn);
 }
 
+static __always_inline bool aarch64_insn_is_nop(u32 insn)
+{
+	/* nop */
+	return aarch64_insn_is_hint(insn) &&
+	       ((insn & 0xFE0) == AARCH64_INSN_HINT_NOP);
+}
+
+static __always_inline bool aarch64_insn_is_stp_fp_lr_sp_64b(u32 insn)
+{
+	/*
+	 * The 1st instruction on function entry often follows the
+	 * patten 'stp x29, x30, [sp, #imm]!' that pushing fp and lr
+	 * into stack.
+	 */
+	return aarch64_insn_is_stp_pre(insn) &&
+	       (((insn >> 30) & 0x03) ==  2) && /* opc == 10 */
+	       (((insn >>  5) & 0x1F) == 31) && /* Rn  is sp */
+	       (((insn >> 10) & 0x1F) == 30) && /* Rt2 is x29 */
+	       (((insn >>  0) & 0x1F) == 29);	/* Rt  is x30 */
+}
+
 enum aarch64_insn_encoding_class aarch64_get_insn_class(u32 insn);
 u64 aarch64_insn_decode_immediate(enum aarch64_insn_imm_type type, u32 insn);
 u32 aarch64_insn_encode_immediate(enum aarch64_insn_imm_type type,
diff --git a/arch/arm64/kernel/probes/decode-insn.c b/arch/arm64/kernel/probes/decode-insn.c
index 968d5fffe233..df7ca16fc763 100644
--- a/arch/arm64/kernel/probes/decode-insn.c
+++ b/arch/arm64/kernel/probes/decode-insn.c
@@ -73,8 +73,22 @@  static bool __kprobes aarch64_insn_is_steppable(u32 insn)
  *   INSN_GOOD_NO_SLOT If instruction is supported but doesn't use its slot.
  */
 enum probe_insn __kprobes
-arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *api)
+arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *api,
+		      bool kernel)
 {
+	/*
+	 * While 'nop' and 'stp x29, x30, [sp, #imm]! instructions can
+	 * execute in the out-of-line slot, simulating them in breakpoint
+	 * handling offers better performance.
+	 */
+	if (aarch64_insn_is_nop(insn)) {
+		api->handler = simulate_nop;
+		return INSN_GOOD_NO_SLOT;
+	} else if (!kernel && aarch64_insn_is_stp_fp_lr_sp_64b(insn)) {
+		api->handler = simulate_stp_fp_lr_sp_64b;
+		return INSN_GOOD_NO_SLOT;
+	}
+
 	/*
 	 * Instructions reading or modifying the PC won't work from the XOL
 	 * slot.
@@ -157,7 +171,7 @@  arm_kprobe_decode_insn(kprobe_opcode_t *addr, struct arch_specific_insn *asi)
 		else
 			scan_end = addr - MAX_ATOMIC_CONTEXT_SIZE;
 	}
-	decoded = arm_probe_decode_insn(insn, &asi->api);
+	decoded = arm_probe_decode_insn(insn, &asi->api, true);
 
 	if (decoded != INSN_REJECTED && scan_end)
 		if (is_probed_address_atomic(addr - 1, scan_end))
diff --git a/arch/arm64/kernel/probes/decode-insn.h b/arch/arm64/kernel/probes/decode-insn.h
index 8b758c5a2062..ec4607189933 100644
--- a/arch/arm64/kernel/probes/decode-insn.h
+++ b/arch/arm64/kernel/probes/decode-insn.h
@@ -28,6 +28,7 @@  enum probe_insn __kprobes
 arm_kprobe_decode_insn(kprobe_opcode_t *addr, struct arch_specific_insn *asi);
 #endif
 enum probe_insn __kprobes
-arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *asi);
+arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *asi,
+		      bool kernel);
 
 #endif /* _ARM_KERNEL_KPROBES_ARM64_H */
diff --git a/arch/arm64/kernel/probes/simulate-insn.c b/arch/arm64/kernel/probes/simulate-insn.c
index 22d0b3252476..0b1623fa7003 100644
--- a/arch/arm64/kernel/probes/simulate-insn.c
+++ b/arch/arm64/kernel/probes/simulate-insn.c
@@ -200,3 +200,31 @@  simulate_ldrsw_literal(u32 opcode, long addr, struct pt_regs *regs)
 
 	instruction_pointer_set(regs, instruction_pointer(regs) + 4);
 }
+
+void __kprobes
+simulate_nop(u32 opcode, long addr, struct pt_regs *regs)
+{
+	instruction_pointer_set(regs, instruction_pointer(regs) + 4);
+}
+
+void __kprobes
+simulate_stp_fp_lr_sp_64b(u32 opcode, long addr, struct pt_regs *regs)
+{
+	long imm7;
+	u64 buf[2];
+	long new_sp;
+
+	imm7 = sign_extend64((opcode >> 15) & 0x7f, 6);
+	new_sp = regs->sp + (imm7 << 3);
+
+	buf[0] = regs->regs[29];
+	buf[1] = regs->regs[30];
+
+	if (copy_to_user((void __user *)new_sp, buf, sizeof(buf))) {
+		force_sig(SIGSEGV);
+		return;
+	}
+
+	regs->sp = new_sp;
+	instruction_pointer_set(regs, instruction_pointer(regs) + 4);
+}
diff --git a/arch/arm64/kernel/probes/simulate-insn.h b/arch/arm64/kernel/probes/simulate-insn.h
index e065dc92218e..733a47ffa2e5 100644
--- a/arch/arm64/kernel/probes/simulate-insn.h
+++ b/arch/arm64/kernel/probes/simulate-insn.h
@@ -16,5 +16,7 @@  void simulate_cbz_cbnz(u32 opcode, long addr, struct pt_regs *regs);
 void simulate_tbz_tbnz(u32 opcode, long addr, struct pt_regs *regs);
 void simulate_ldr_literal(u32 opcode, long addr, struct pt_regs *regs);
 void simulate_ldrsw_literal(u32 opcode, long addr, struct pt_regs *regs);
+void simulate_nop(u32 opcode, long addr, struct pt_regs *regs);
+void simulate_stp_fp_lr_sp_64b(u32 opcode, long addr, struct pt_regs *regs);
 
 #endif /* _ARM_KERNEL_KPROBES_SIMULATE_INSN_H */
diff --git a/arch/arm64/kernel/probes/uprobes.c b/arch/arm64/kernel/probes/uprobes.c
index d49aef2657cd..ec5881db3b7a 100644
--- a/arch/arm64/kernel/probes/uprobes.c
+++ b/arch/arm64/kernel/probes/uprobes.c
@@ -44,7 +44,7 @@  int arch_uprobe_analyze_insn(struct arch_uprobe *auprobe, struct mm_struct *mm,
 
 	insn = *(probe_opcode_t *)(&auprobe->insn[0]);
 
-	switch (arm_probe_decode_insn(insn, &auprobe->api)) {
+	switch (arm_probe_decode_insn(insn, &auprobe->api, false)) {
 	case INSN_REJECTED:
 		return -EINVAL;