[v4,2/2] arm64: kprobe: Enable OPTPROBE for arm64

Message ID	20210818073336.59678-3-liuqi115@huawei.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=NLw9=NJ=lists.infradead.org=linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@kernel.org> DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org B0FFB60FBF From: Qi Liu <liuqi115@huawei.com> To: <catalin.marinas@arm.com>, <will@kernel.org>, <naveen.n.rao@linux.ibm.com>, <anil.s.keshavamurthy@intel.com>, <davem@davemloft.net>, <mhiramat@kernel.org>, <linux-arm-kernel@lists.infradead.org> CC: <song.bao.hua@hisilicon.com>, <prime.zeng@hisilicon.com>, <robin.murphy@arm.com>, <f.fangjian@huawei.com>, <liuqi115@huawei.com>, <linuxarm@huawei.com>, <linux-kernel@vger.kernel.org> Subject: [PATCH v4 2/2] arm64: kprobe: Enable OPTPROBE for arm64 Date: Wed, 18 Aug 2021 15:33:36 +0800 Message-ID: <20210818073336.59678-3-liuqi115@huawei.com> In-Reply-To: <20210818073336.59678-1-liuqi115@huawei.com> References: <20210818073336.59678-1-liuqi115@huawei.com> MIME-Version: 1.0 Precedence: list Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org
Series	arm64: Enable OPTPROBE for arm64 \| expand [v4,0/2] arm64: Enable OPTPROBE for arm64 [v4,1/2] Make save_all_base_regs and restore_all_base_regs as common macro [v4,2/2] arm64: kprobe: Enable OPTPROBE for arm64

liuqi (BA) Aug. 18, 2021, 7:33 a.m. UTC

This patch introduce optprobe for ARM64. In optprobe, probed
instruction is replaced by a branch instruction to detour
buffer. Detour buffer contains trampoline code and a call to
optimized_callback(). optimized_callback() calls opt_pre_handler()
to execute kprobe handler.

Performance of optprobe on Hip08 platform is test using kprobe
example module[1] to analyze the latency of a kernel function,
and here is the result:

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/samples/kprobes/kretprobe_example.c

kprobe before optimized:
[280709.846380] do_empty returned 0 and took 1530 ns to execute
[280709.852057] do_empty returned 0 and took 550 ns to execute
[280709.857631] do_empty returned 0 and took 440 ns to execute
[280709.863215] do_empty returned 0 and took 380 ns to execute
[280709.868787] do_empty returned 0 and took 360 ns to execute
[280709.874362] do_empty returned 0 and took 340 ns to execute
[280709.879936] do_empty returned 0 and took 320 ns to execute
[280709.885505] do_empty returned 0 and took 300 ns to execute
[280709.891075] do_empty returned 0 and took 280 ns to execute
[280709.896646] do_empty returned 0 and took 290 ns to execute
[280709.902220] do_empty returned 0 and took 290 ns to execute
[280709.907807] do_empty returned 0 and took 290 ns to execute

optprobe:
[ 2965.964572] do_empty returned 0 and took 90 ns to execute
[ 2965.969952] do_empty returned 0 and took 80 ns to execute
[ 2965.975332] do_empty returned 0 and took 70 ns to execute
[ 2965.980714] do_empty returned 0 and took 60 ns to execute
[ 2965.986128] do_empty returned 0 and took 80 ns to execute
[ 2965.991507] do_empty returned 0 and took 70 ns to execute
[ 2965.996884] do_empty returned 0 and took 70 ns to execute
[ 2966.002262] do_empty returned 0 and took 80 ns to execute
[ 2966.007642] do_empty returned 0 and took 70 ns to execute
[ 2966.013020] do_empty returned 0 and took 70 ns to execute
[ 2966.018400] do_empty returned 0 and took 70 ns to execute
[ 2966.023779] do_empty returned 0 and took 70 ns to execute
[ 2966.029158] do_empty returned 0 and took 70 ns to execute

Signed-off-by: Qi Liu <liuqi115@huawei.com>

Note:
To guarantee the offset between probe point and kprobe pre_handler
is smaller than 128MiB, users should set
CONFIG_RANDOMIZE_MODULE_REGION_FULL=N or set nokaslr in command line, or
optprobe will not work and fall back to normal kprobe.
---
 arch/arm64/Kconfig                            |   1 +
 arch/arm64/include/asm/kprobes.h              |  24 ++
 arch/arm64/kernel/probes/Makefile             |   2 +
 arch/arm64/kernel/probes/kprobes.c            |  19 +-
 arch/arm64/kernel/probes/opt_arm64.c          | 276 ++++++++++++++++++
 .../arm64/kernel/probes/optprobe_trampoline.S |  37 +++
 6 files changed, 356 insertions(+), 3 deletions(-)
 create mode 100644 arch/arm64/kernel/probes/opt_arm64.c
 create mode 100644 arch/arm64/kernel/probes/optprobe_trampoline.S

Masami Hiramatsu (Google) Aug. 18, 2021, 4:27 p.m. UTC | #1

On Wed, 18 Aug 2021 15:33:36 +0800
Qi Liu <liuqi115@huawei.com> wrote:

> This patch introduce optprobe for ARM64. In optprobe, probed
> instruction is replaced by a branch instruction to detour
> buffer. Detour buffer contains trampoline code and a call to
> optimized_callback(). optimized_callback() calls opt_pre_handler()
> to execute kprobe handler.
> 
> Performance of optprobe on Hip08 platform is test using kprobe
> example module[1] to analyze the latency of a kernel function,
> and here is the result:
> 
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/samples/kprobes/kretprobe_example.c
> 
> kprobe before optimized:
> [280709.846380] do_empty returned 0 and took 1530 ns to execute
> [280709.852057] do_empty returned 0 and took 550 ns to execute
> [280709.857631] do_empty returned 0 and took 440 ns to execute
> [280709.863215] do_empty returned 0 and took 380 ns to execute
> [280709.868787] do_empty returned 0 and took 360 ns to execute
> [280709.874362] do_empty returned 0 and took 340 ns to execute
> [280709.879936] do_empty returned 0 and took 320 ns to execute
> [280709.885505] do_empty returned 0 and took 300 ns to execute
> [280709.891075] do_empty returned 0 and took 280 ns to execute
> [280709.896646] do_empty returned 0 and took 290 ns to execute
> [280709.902220] do_empty returned 0 and took 290 ns to execute
> [280709.907807] do_empty returned 0 and took 290 ns to execute
> 
> optprobe:
> [ 2965.964572] do_empty returned 0 and took 90 ns to execute
> [ 2965.969952] do_empty returned 0 and took 80 ns to execute
> [ 2965.975332] do_empty returned 0 and took 70 ns to execute
> [ 2965.980714] do_empty returned 0 and took 60 ns to execute
> [ 2965.986128] do_empty returned 0 and took 80 ns to execute
> [ 2965.991507] do_empty returned 0 and took 70 ns to execute
> [ 2965.996884] do_empty returned 0 and took 70 ns to execute
> [ 2966.002262] do_empty returned 0 and took 80 ns to execute
> [ 2966.007642] do_empty returned 0 and took 70 ns to execute
> [ 2966.013020] do_empty returned 0 and took 70 ns to execute
> [ 2966.018400] do_empty returned 0 and took 70 ns to execute
> [ 2966.023779] do_empty returned 0 and took 70 ns to execute
> [ 2966.029158] do_empty returned 0 and took 70 ns to execute
> 
> Signed-off-by: Qi Liu <liuqi115@huawei.com>

Thanks for updating. This looks good to me. :D

Acked-by: Masami Hiramatsu <mhiramat@kernel.org>

Thank you!

> 
> Note:
> To guarantee the offset between probe point and kprobe pre_handler
> is smaller than 128MiB, users should set
> CONFIG_RANDOMIZE_MODULE_REGION_FULL=N or set nokaslr in command line, or
> optprobe will not work and fall back to normal kprobe.
> ---
>  arch/arm64/Kconfig                            |   1 +
>  arch/arm64/include/asm/kprobes.h              |  24 ++
>  arch/arm64/kernel/probes/Makefile             |   2 +
>  arch/arm64/kernel/probes/kprobes.c            |  19 +-
>  arch/arm64/kernel/probes/opt_arm64.c          | 276 ++++++++++++++++++
>  .../arm64/kernel/probes/optprobe_trampoline.S |  37 +++
>  6 files changed, 356 insertions(+), 3 deletions(-)
>  create mode 100644 arch/arm64/kernel/probes/opt_arm64.c
>  create mode 100644 arch/arm64/kernel/probes/optprobe_trampoline.S
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index b5b13a932561..b05d1d275d87 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -200,6 +200,7 @@ config ARM64
>  	select HAVE_STACKPROTECTOR
>  	select HAVE_SYSCALL_TRACEPOINTS
>  	select HAVE_KPROBES
> +	select HAVE_OPTPROBES
>  	select HAVE_KRETPROBES
>  	select HAVE_GENERIC_VDSO
>  	select IOMMU_DMA if IOMMU_SUPPORT
> diff --git a/arch/arm64/include/asm/kprobes.h b/arch/arm64/include/asm/kprobes.h
> index 5d38ff4a4806..6b2fdd2ad7d8 100644
> --- a/arch/arm64/include/asm/kprobes.h
> +++ b/arch/arm64/include/asm/kprobes.h
> @@ -39,6 +39,30 @@ void arch_remove_kprobe(struct kprobe *);
>  int kprobe_fault_handler(struct pt_regs *regs, unsigned int fsr);
>  int kprobe_exceptions_notify(struct notifier_block *self,
>  			     unsigned long val, void *data);
> +
> +#define RELATIVEJUMP_SIZE (4)
> +#define MAX_COPIED_INSN	DIV_ROUND_UP(RELATIVEJUMP_SIZE, sizeof(kprobe_opcode_t))
> +struct arch_optimized_insn {
> +	kprobe_opcode_t copied_insn[MAX_COPIED_INSN];
> +	/* detour code buffer */
> +	kprobe_opcode_t *insn;
> +};
> +
> +/* optinsn template addresses */
> +extern __visible kprobe_opcode_t optprobe_template_entry[];
> +extern __visible kprobe_opcode_t optprobe_template_val[];
> +extern __visible kprobe_opcode_t optprobe_template_call[];
> +extern __visible kprobe_opcode_t optprobe_template_end[];
> +extern __visible kprobe_opcode_t optprobe_template_restore_begin[];
> +extern __visible kprobe_opcode_t optprobe_template_restore_orig_insn[];
> +extern __visible kprobe_opcode_t optprobe_template_restore_end[];
> +extern __visible kprobe_opcode_t optprobe_template_max_length[];
> +
> +#define MAX_OPTIMIZED_LENGTH	4
> +#define MAX_OPTINSN_SIZE				\
> +	((unsigned long)optprobe_template_end -	\
> +	 (unsigned long)optprobe_template_entry)
> +
>  void kretprobe_trampoline(void);
>  void __kprobes *trampoline_probe_handler(struct pt_regs *regs);
>  
> diff --git a/arch/arm64/kernel/probes/Makefile b/arch/arm64/kernel/probes/Makefile
> index 8e4be92e25b1..07105fd3261d 100644
> --- a/arch/arm64/kernel/probes/Makefile
> +++ b/arch/arm64/kernel/probes/Makefile
> @@ -4,3 +4,5 @@ obj-$(CONFIG_KPROBES)		+= kprobes.o decode-insn.o	\
>  				   simulate-insn.o
>  obj-$(CONFIG_UPROBES)		+= uprobes.o decode-insn.o	\
>  				   simulate-insn.o
> +obj-$(CONFIG_OPTPROBES)		+= opt_arm64.o			\
> +				   optprobe_trampoline.o
> diff --git a/arch/arm64/kernel/probes/kprobes.c b/arch/arm64/kernel/probes/kprobes.c
> index 6dbcc89f6662..83755ad62abe 100644
> --- a/arch/arm64/kernel/probes/kprobes.c
> +++ b/arch/arm64/kernel/probes/kprobes.c
> @@ -11,6 +11,7 @@
>  #include <linux/kasan.h>
>  #include <linux/kernel.h>
>  #include <linux/kprobes.h>
> +#include <linux/moduleloader.h>
>  #include <linux/sched/debug.h>
>  #include <linux/set_memory.h>
>  #include <linux/slab.h>
> @@ -113,9 +114,21 @@ int __kprobes arch_prepare_kprobe(struct kprobe *p)
>  
>  void *alloc_insn_page(void)
>  {
> -	return __vmalloc_node_range(PAGE_SIZE, 1, VMALLOC_START, VMALLOC_END,
> -			GFP_KERNEL, PAGE_KERNEL_ROX, VM_FLUSH_RESET_PERMS,
> -			NUMA_NO_NODE, __builtin_return_address(0));
> +	void *page;
> +
> +	page = module_alloc(PAGE_SIZE);
> +	if (!page)
> +		return NULL;
> +
> +	set_vm_flush_reset_perms(page);
> +	/*
> +	 * First make the page read-only, and only then make it executable to
> +	 * prevent it from being W+X in between.
> +	 */
> +	set_memory_ro((unsigned long)page, 1);
> +	set_memory_x((unsigned long)page, 1);
> +
> +	return page;
>  }
>  
>  /* arm kprobe: install breakpoint in text */
> diff --git a/arch/arm64/kernel/probes/opt_arm64.c b/arch/arm64/kernel/probes/opt_arm64.c
> new file mode 100644
> index 000000000000..4de535bee534
> --- /dev/null
> +++ b/arch/arm64/kernel/probes/opt_arm64.c
> @@ -0,0 +1,276 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Code for Kernel probes Jump optimization.
> + *
> + * Copyright (C) 2021 Hisilicon Limited
> + */
> +
> +#include <linux/jump_label.h>
> +#include <linux/kprobes.h>
> +
> +#include <asm/cacheflush.h>
> +#include <asm/insn.h>
> +#include <asm/kprobes.h>
> +#include <asm/patching.h>
> +
> +#define TMPL_VAL_IDX \
> +	(optprobe_template_val - optprobe_template_entry)
> +#define TMPL_CALL_BACK \
> +	(optprobe_template_call - optprobe_template_entry)
> +#define TMPL_END_IDX \
> +	(optprobe_template_end - optprobe_template_entry)
> +#define TMPL_RESTORE_ORIGN_INSN \
> +	(optprobe_template_restore_orig_insn - optprobe_template_entry)
> +#define TMPL_RESTORE_END \
> +	(optprobe_template_restore_end - optprobe_template_entry)
> +#define TMPL_MAX_LENGTH \
> +	(optprobe_template_max_length - optprobe_template_entry)
> +#define OPTPROBE_BATCH_SIZE 64
> +
> +int arch_check_optimized_kprobe(struct optimized_kprobe *op)
> +{
> +	return 0;
> +}
> +
> +int arch_prepared_optinsn(struct arch_optimized_insn *optinsn)
> +{
> +	return optinsn->insn != NULL;
> +}
> +
> +int arch_within_optimized_kprobe(struct optimized_kprobe *op,
> +				unsigned long addr)
> +{
> +	return ((unsigned long)op->kp.addr <= addr &&
> +		(unsigned long)op->kp.addr + RELATIVEJUMP_SIZE > addr);
> +}
> +
> +static void
> +optimized_callback(struct optimized_kprobe *op, struct pt_regs *regs)
> +{
> +	/* This is possible if op is under delayed unoptimizing */
> +	if (kprobe_disabled(&op->kp))
> +		return;
> +
> +	preempt_disable();
> +
> +	if (kprobe_running()) {
> +		kprobes_inc_nmissed_count(&op->kp);
> +	} else {
> +		__this_cpu_write(current_kprobe, &op->kp);
> +		regs->pc = (unsigned long)op->kp.addr;
> +		get_kprobe_ctlblk()->kprobe_status = KPROBE_HIT_ACTIVE;
> +		opt_pre_handler(&op->kp, regs);
> +		__this_cpu_write(current_kprobe, NULL);
> +	}
> +
> +	preempt_enable_no_resched();
> +}
> +NOKPROBE_SYMBOL(optimized_callback)
> +
> +static bool is_offset_in_range(unsigned long start, unsigned long end)
> +{
> +	long offset = end - start;
> +
> +	/*
> +	 * Verify if the address gap is in 128MiB range, because this uses
> +	 * a relative jump.
> +	 *
> +	 * kprobe opt use a 'b' instruction to branch to optinsn.insn.
> +	 * According to ARM manual, branch instruction is:
> +	 *
> +	 *   31  30                  25              0
> +	 *  +----+---+---+---+---+---+---------------+
> +	 *  |cond| 0 | 0 | 1 | 0 | 1 |     imm26     |
> +	 *  +----+---+---+---+---+---+---------------+
> +	 *
> +	 * imm26 is a signed 26 bits integer. The real branch offset is computed
> +	 * by: imm64 = SignExtend(imm26:'00', 64);
> +	 *
> +	 * So the maximum forward branch should be:
> +	 *   (0x01ffffff << 2) = 0x07fffffc
> +	 * The maximum backward branch should be:
> +	 *   (0xfe000000 << 2) = 0xFFFFFFFFF8000000 = -0x08000000
> +	 *
> +	 * We can simply check (rel & 0xf8000003):
> +	 *  if rel is positive, (rel & 0xf8000003) should be 0
> +	 *  if rel is negitive, (rel & 0xf8000003) should be 0xf8000000
> +	 *  the last '3' is used for alignment checking.
> +	 */
> +	return (offset >= -0x8000000 && offset <= 0x7fffffc && !(offset & 0x3));
> +}
> +
> +int arch_prepare_optimized_kprobe(struct optimized_kprobe *op,
> +				  struct kprobe *orig)
> +{
> +	kprobe_opcode_t *code, *buf;
> +	void **addrs;
> +	u32 insn;
> +	int ret, i;
> +
> +	addrs = kcalloc(TMPL_MAX_LENGTH, sizeof(void *), GFP_KERNEL);
> +	if (!addrs)
> +		return -ENOMEM;
> +
> +	buf = kcalloc(TMPL_MAX_LENGTH, sizeof(kprobe_opcode_t), GFP_KERNEL);
> +	if (!buf) {
> +		kfree(addrs);
> +		return -ENOMEM;
> +	}
> +
> +	code = get_optinsn_slot();
> +	if (!code) {
> +		kfree(addrs);
> +		kfree(buf);
> +		return -ENOMEM;
> +	}
> +
> +	if (!is_offset_in_range((unsigned long)code,
> +				(unsigned long)orig->addr + 8)) {
> +		ret = -ERANGE;
> +		goto error;
> +	}
> +
> +	if (!is_offset_in_range((unsigned long)code + TMPL_CALL_BACK,
> +				(unsigned long)optimized_callback)) {
> +		ret = -ERANGE;
> +		goto error;
> +	}
> +
> +	if (!is_offset_in_range((unsigned long)&code[TMPL_RESTORE_END],
> +				(unsigned long)op->kp.addr + 4)) {
> +		ret = -ERANGE;
> +		goto error;
> +	}
> +
> +	memcpy(buf, optprobe_template_entry,
> +	       TMPL_END_IDX * sizeof(kprobe_opcode_t));
> +
> +	buf[TMPL_VAL_IDX] = FIELD_GET(GENMASK(31, 0), (unsigned long long)op);
> +	buf[TMPL_VAL_IDX + 1] =
> +		FIELD_GET(GENMASK(63, 32), (unsigned long long)op);
> +	buf[TMPL_RESTORE_ORIGN_INSN] = orig->opcode;
> +
> +	insn = aarch64_insn_gen_branch_imm(
> +		(unsigned long)(&code[TMPL_CALL_BACK]),
> +		(unsigned long)optimized_callback, AARCH64_INSN_BRANCH_LINK);
> +	buf[TMPL_CALL_BACK] = insn;
> +
> +	insn = aarch64_insn_gen_branch_imm(
> +		(unsigned long)(&code[TMPL_RESTORE_END]),
> +		(unsigned long)(op->kp.addr) + 4, AARCH64_INSN_BRANCH_NOLINK);
> +	buf[TMPL_RESTORE_END] = insn;
> +
> +	/* Setup template */
> +	for (i = 0; i < TMPL_MAX_LENGTH; i++)
> +		addrs[i] = code + i;
> +
> +	ret = aarch64_insn_patch_text(addrs, buf, TMPL_MAX_LENGTH);
> +	if (ret < 0)
> +		goto error;
> +
> +	flush_icache_range((unsigned long)code,
> +			   (unsigned long)(&code[TMPL_END_IDX]));
> +
> +	/* Set op->optinsn.insn means prepared. */
> +	op->optinsn.insn = code;
> +
> +out:
> +	kfree(addrs);
> +	kfree(buf);
> +	return ret;
> +
> +error:
> +	free_optinsn_slot(code, 0);
> +	goto out;
> +}
> +
> +void arch_optimize_kprobes(struct list_head *oplist)
> +{
> +	struct optimized_kprobe *op, *tmp;
> +	kprobe_opcode_t *insns;
> +	void **addrs;
> +	int i = 0;
> +
> +	addrs = kcalloc(OPTPROBE_BATCH_SIZE, sizeof(void *), GFP_KERNEL);
> +	if (!addrs)
> +		return;
> +
> +	insns = kcalloc(OPTPROBE_BATCH_SIZE, sizeof(kprobe_opcode_t), GFP_KERNEL);
> +	if (!insns) {
> +		kfree(addrs);
> +		return;
> +	}
> +
> +	list_for_each_entry_safe(op, tmp, oplist, list) {
> +		WARN_ON(kprobe_disabled(&op->kp));
> +
> +		/*
> +		 * Backup instructions which will be replaced
> +		 * by jump address
> +		 */
> +		memcpy(op->optinsn.copied_insn, op->kp.addr,
> +			RELATIVEJUMP_SIZE);
> +
> +		addrs[i] = (void *)op->kp.addr;
> +		insns[i] = aarch64_insn_gen_branch_imm((unsigned long)op->kp.addr,
> +				(unsigned long)op->optinsn.insn,
> +				AARCH64_INSN_BRANCH_NOLINK);
> +
> +		list_del_init(&op->list);
> +		if (++i == OPTPROBE_BATCH_SIZE)
> +			break;
> +	}
> +
> +	aarch64_insn_patch_text(addrs, insns, i);
> +	kfree(addrs);
> +	kfree(insns);
> +}
> +
> +void arch_unoptimize_kprobe(struct optimized_kprobe *op)
> +{
> +	arch_arm_kprobe(&op->kp);
> +}
> +
> +/*
> + * Recover original instructions and breakpoints from relative jumps.
> + * Caller must call with locking kprobe_mutex.
> + */
> +void arch_unoptimize_kprobes(struct list_head *oplist,
> +			    struct list_head *done_list)
> +{
> +	struct optimized_kprobe *op, *tmp;
> +	kprobe_opcode_t *insns;
> +	void **addrs;
> +	int i = 0;
> +
> +	addrs = kcalloc(OPTPROBE_BATCH_SIZE, sizeof(void *), GFP_KERNEL);
> +	if (!addrs)
> +		return;
> +
> +	insns = kcalloc(OPTPROBE_BATCH_SIZE, sizeof(kprobe_opcode_t), GFP_KERNEL);
> +	if (!insns) {
> +		kfree(addrs);
> +		return;
> +	}
> +
> +	list_for_each_entry_safe(op, tmp, oplist, list) {
> +		addrs[i] = (void *)op->kp.addr;
> +		insns[i] = BRK64_OPCODE_KPROBES;
> +		list_move(&op->list, done_list);
> +
> +		if (++i == OPTPROBE_BATCH_SIZE)
> +			break;
> +	}
> +
> +	aarch64_insn_patch_text(addrs, insns, i);
> +	kfree(addrs);
> +	kfree(insns);
> +}
> +
> +void arch_remove_optimized_kprobe(struct optimized_kprobe *op)
> +{
> +	if (op->optinsn.insn) {
> +		free_optinsn_slot(op->optinsn.insn, 1);
> +		op->optinsn.insn = NULL;
> +	}
> +}
> diff --git a/arch/arm64/kernel/probes/optprobe_trampoline.S b/arch/arm64/kernel/probes/optprobe_trampoline.S
> new file mode 100644
> index 000000000000..24d713d400cd
> --- /dev/null
> +++ b/arch/arm64/kernel/probes/optprobe_trampoline.S
> @@ -0,0 +1,37 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * trampoline entry and return code for optprobes.
> + */
> +
> +#include <linux/linkage.h>
> +#include <asm/asm-offsets.h>
> +#include <asm/assembler.h>
> +
> +	.global optprobe_template_entry
> +optprobe_template_entry:
> +	sub sp, sp, #PT_REGS_SIZE
> +	save_all_base_regs
> +	/* Get parameters to optimized_callback() */
> +	ldr	x0, 1f
> +	mov	x1, sp
> +	/* Branch to optimized_callback() */
> +	.global optprobe_template_call
> +optprobe_template_call:
> +	nop
> +	restore_all_base_regs
> +	ldr lr, [sp, #S_LR]
> +        add sp, sp, #PT_REGS_SIZE
> +	.global optprobe_template_restore_orig_insn
> +optprobe_template_restore_orig_insn:
> +	nop
> +	.global optprobe_template_restore_end
> +optprobe_template_restore_end:
> +	nop
> +	.global optprobe_template_end
> +optprobe_template_end:
> +	.global optprobe_template_val
> +optprobe_template_val:
> +	1:	.long 0
> +		.long 0
> +	.global optprobe_template_max_length
> +optprobe_template_max_length:
> -- 
> 2.17.1
>

Mark Rutland Aug. 24, 2021, 10:50 a.m. UTC | #2

Hi,

I have a bunch of comments below.

At a high-level, I'm not all that keen on adding yet another set of
trampolines, especially given we have constraints on how we can branch
to them which render this not that useful in common configurations (e.g.
where KASLR and module randomization is enabled).

So importantly, do we actually need this? I don't think the sampel is
that compelling since we can already use ftrace to measure function
latencies.

If we do need this, I think we need to do some more substantial rework
to address those branch range limitations. I know that we could permit
arbitrary branching if we expand the ftrace-with-regs callsites to ~6
instructions, but that interacts rather poorly with stacktracing and
will make the kernel a bit bigger.

On Wed, Aug 18, 2021 at 03:33:36PM +0800, Qi Liu wrote:
> This patch introduce optprobe for ARM64. In optprobe, probed
> instruction is replaced by a branch instruction to detour
> buffer. Detour buffer contains trampoline code and a call to
> optimized_callback(). optimized_callback() calls opt_pre_handler()
> to execute kprobe handler.
> 
> Performance of optprobe on Hip08 platform is test using kprobe
> example module[1] to analyze the latency of a kernel function,
> and here is the result:
> 
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/samples/kprobes/kretprobe_example.c
> 
> kprobe before optimized:
> [280709.846380] do_empty returned 0 and took 1530 ns to execute
> [280709.852057] do_empty returned 0 and took 550 ns to execute
> [280709.857631] do_empty returned 0 and took 440 ns to execute
> [280709.863215] do_empty returned 0 and took 380 ns to execute
> [280709.868787] do_empty returned 0 and took 360 ns to execute
> [280709.874362] do_empty returned 0 and took 340 ns to execute
> [280709.879936] do_empty returned 0 and took 320 ns to execute
> [280709.885505] do_empty returned 0 and took 300 ns to execute
> [280709.891075] do_empty returned 0 and took 280 ns to execute
> [280709.896646] do_empty returned 0 and took 290 ns to execute
> [280709.902220] do_empty returned 0 and took 290 ns to execute
> [280709.907807] do_empty returned 0 and took 290 ns to execute
> 
> optprobe:
> [ 2965.964572] do_empty returned 0 and took 90 ns to execute
> [ 2965.969952] do_empty returned 0 and took 80 ns to execute
> [ 2965.975332] do_empty returned 0 and took 70 ns to execute
> [ 2965.980714] do_empty returned 0 and took 60 ns to execute
> [ 2965.986128] do_empty returned 0 and took 80 ns to execute
> [ 2965.991507] do_empty returned 0 and took 70 ns to execute
> [ 2965.996884] do_empty returned 0 and took 70 ns to execute
> [ 2966.002262] do_empty returned 0 and took 80 ns to execute
> [ 2966.007642] do_empty returned 0 and took 70 ns to execute
> [ 2966.013020] do_empty returned 0 and took 70 ns to execute
> [ 2966.018400] do_empty returned 0 and took 70 ns to execute
> [ 2966.023779] do_empty returned 0 and took 70 ns to execute
> [ 2966.029158] do_empty returned 0 and took 70 ns to execute

Do we have any examples of where this latency matters in practice?

> 
> Signed-off-by: Qi Liu <liuqi115@huawei.com>
> 
> Note:
> To guarantee the offset between probe point and kprobe pre_handler
> is smaller than 128MiB, users should set
> CONFIG_RANDOMIZE_MODULE_REGION_FULL=N or set nokaslr in command line, or
> optprobe will not work and fall back to normal kprobe.

Hmm... I don't think that's something we want to recommend, and
certainly distros *should* use KASLR and
CONFIG_RANDOMIZE_MODULE_REGION_FULL.

What happens with defconfig? Do we always get the fallback behaviour?

> ---
>  arch/arm64/Kconfig                            |   1 +
>  arch/arm64/include/asm/kprobes.h              |  24 ++
>  arch/arm64/kernel/probes/Makefile             |   2 +
>  arch/arm64/kernel/probes/kprobes.c            |  19 +-
>  arch/arm64/kernel/probes/opt_arm64.c          | 276 ++++++++++++++++++
>  .../arm64/kernel/probes/optprobe_trampoline.S |  37 +++
>  6 files changed, 356 insertions(+), 3 deletions(-)
>  create mode 100644 arch/arm64/kernel/probes/opt_arm64.c
>  create mode 100644 arch/arm64/kernel/probes/optprobe_trampoline.S
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index b5b13a932561..b05d1d275d87 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -200,6 +200,7 @@ config ARM64
>  	select HAVE_STACKPROTECTOR
>  	select HAVE_SYSCALL_TRACEPOINTS
>  	select HAVE_KPROBES
> +	select HAVE_OPTPROBES
>  	select HAVE_KRETPROBES
>  	select HAVE_GENERIC_VDSO
>  	select IOMMU_DMA if IOMMU_SUPPORT
> diff --git a/arch/arm64/include/asm/kprobes.h b/arch/arm64/include/asm/kprobes.h
> index 5d38ff4a4806..6b2fdd2ad7d8 100644
> --- a/arch/arm64/include/asm/kprobes.h
> +++ b/arch/arm64/include/asm/kprobes.h
> @@ -39,6 +39,30 @@ void arch_remove_kprobe(struct kprobe *);
>  int kprobe_fault_handler(struct pt_regs *regs, unsigned int fsr);
>  int kprobe_exceptions_notify(struct notifier_block *self,
>  			     unsigned long val, void *data);
> +
> +#define RELATIVEJUMP_SIZE (4)
> +#define MAX_COPIED_INSN	DIV_ROUND_UP(RELATIVEJUMP_SIZE, sizeof(kprobe_opcode_t))

On arm64, instructions are always 4 bytes, so this can be:

| #define MAX_COPIED_INSN		AARCH64_INSN_SIZE

Note: AARCH64_INSN_SIZE == sizeof(kprobe_opcode_t), so either could be
used here.

> +struct arch_optimized_insn {
> +	kprobe_opcode_t copied_insn[MAX_COPIED_INSN];

This is always a single insn. For clarity, it would be nicer to call
this something like `orig_insn`.

> +	/* detour code buffer */
> +	kprobe_opcode_t *insn;

Could we call this `trampoline`?

> +};
> +
> +/* optinsn template addresses */
> +extern __visible kprobe_opcode_t optprobe_template_entry[];
> +extern __visible kprobe_opcode_t optprobe_template_val[];
> +extern __visible kprobe_opcode_t optprobe_template_call[];
> +extern __visible kprobe_opcode_t optprobe_template_end[];
> +extern __visible kprobe_opcode_t optprobe_template_restore_begin[];
> +extern __visible kprobe_opcode_t optprobe_template_restore_orig_insn[];
> +extern __visible kprobe_opcode_t optprobe_template_restore_end[];
> +extern __visible kprobe_opcode_t optprobe_template_max_length[];

Why is this called "max_length"? It's the end of the template, value
included. Other architectures (e.g. arm) include that between
optprobe_template_entry and optprobe_template_end, so I don't believe we
need to separate that out.

> +
> +#define MAX_OPTIMIZED_LENGTH	4

As above, please define this as either AARCH64_INSN_SIZE or
sizeof(kprobe_opcode_t).

> +#define MAX_OPTINSN_SIZE				\
> +	((unsigned long)optprobe_template_end -	\
> +	 (unsigned long)optprobe_template_entry)

As above, this seems to be the trampoline size minus the size of the
value, which doesn't match other architectures, and doesn't seem right.
The kprobe core code has:

| kprobe_optinsn_slots.insn_size = MAX_OPTINSN_SIZE;

... which IIUC is used to alocate the buffers, so surely we *must* take
the size of the value into account?

> +
>  void kretprobe_trampoline(void);
>  void __kprobes *trampoline_probe_handler(struct pt_regs *regs);
>  
> diff --git a/arch/arm64/kernel/probes/Makefile b/arch/arm64/kernel/probes/Makefile
> index 8e4be92e25b1..07105fd3261d 100644
> --- a/arch/arm64/kernel/probes/Makefile
> +++ b/arch/arm64/kernel/probes/Makefile
> @@ -4,3 +4,5 @@ obj-$(CONFIG_KPROBES)		+= kprobes.o decode-insn.o	\
>  				   simulate-insn.o
>  obj-$(CONFIG_UPROBES)		+= uprobes.o decode-insn.o	\
>  				   simulate-insn.o
> +obj-$(CONFIG_OPTPROBES)		+= opt_arm64.o			\
> +				   optprobe_trampoline.o
> diff --git a/arch/arm64/kernel/probes/kprobes.c b/arch/arm64/kernel/probes/kprobes.c
> index 6dbcc89f6662..83755ad62abe 100644
> --- a/arch/arm64/kernel/probes/kprobes.c
> +++ b/arch/arm64/kernel/probes/kprobes.c
> @@ -11,6 +11,7 @@
>  #include <linux/kasan.h>
>  #include <linux/kernel.h>
>  #include <linux/kprobes.h>
> +#include <linux/moduleloader.h>
>  #include <linux/sched/debug.h>
>  #include <linux/set_memory.h>
>  #include <linux/slab.h>
> @@ -113,9 +114,21 @@ int __kprobes arch_prepare_kprobe(struct kprobe *p)
>  
>  void *alloc_insn_page(void)
>  {
> -	return __vmalloc_node_range(PAGE_SIZE, 1, VMALLOC_START, VMALLOC_END,
> -			GFP_KERNEL, PAGE_KERNEL_ROX, VM_FLUSH_RESET_PERMS,
> -			NUMA_NO_NODE, __builtin_return_address(0));
> +	void *page;
> +
> +	page = module_alloc(PAGE_SIZE);
> +	if (!page)
> +		return NULL;
> +
> +	set_vm_flush_reset_perms(page);
> +	/*
> +	 * First make the page read-only, and only then make it executable to
> +	 * prevent it from being W+X in between.
> +	 */
> +	set_memory_ro((unsigned long)page, 1);
> +	set_memory_x((unsigned long)page, 1);
> +
> +	return page;
>  }

Today this is only used for xol pages, which don't need to live in the
module space. Why is this not a separate function?

>  /* arm kprobe: install breakpoint in text */
> diff --git a/arch/arm64/kernel/probes/opt_arm64.c b/arch/arm64/kernel/probes/opt_arm64.c
> new file mode 100644
> index 000000000000..4de535bee534
> --- /dev/null
> +++ b/arch/arm64/kernel/probes/opt_arm64.c
> @@ -0,0 +1,276 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Code for Kernel probes Jump optimization.
> + *
> + * Copyright (C) 2021 Hisilicon Limited
> + */
> +
> +#include <linux/jump_label.h>
> +#include <linux/kprobes.h>
> +
> +#include <asm/cacheflush.h>
> +#include <asm/insn.h>
> +#include <asm/kprobes.h>
> +#include <asm/patching.h>
> +
> +#define TMPL_VAL_IDX \
> +	(optprobe_template_val - optprobe_template_entry)
> +#define TMPL_CALL_BACK \
> +	(optprobe_template_call - optprobe_template_entry)
> +#define TMPL_END_IDX \
> +	(optprobe_template_end - optprobe_template_entry)
> +#define TMPL_RESTORE_ORIGN_INSN \
> +	(optprobe_template_restore_orig_insn - optprobe_template_entry)
> +#define TMPL_RESTORE_END \
> +	(optprobe_template_restore_end - optprobe_template_entry)
> +#define TMPL_MAX_LENGTH \
> +	(optprobe_template_max_length - optprobe_template_entry)
> +#define OPTPROBE_BATCH_SIZE 64

What is batch size, and why is it 64?

> +
> +int arch_check_optimized_kprobe(struct optimized_kprobe *op)
> +{
> +	return 0;
> +}
> +

The core code has the comment:

|	/* Check there is no other kprobes at the optimized instructions */
|	if (arch_check_optimized_kprobe(op) < 0) 
|		return;

... is that comment misleading, or do we need to do something here?

> +int arch_prepared_optinsn(struct arch_optimized_insn *optinsn)
> +{
> +	return optinsn->insn != NULL;
> +}
> +
> +int arch_within_optimized_kprobe(struct optimized_kprobe *op,
> +				unsigned long addr)
> +{
> +	return ((unsigned long)op->kp.addr <= addr &&
> +		(unsigned long)op->kp.addr + RELATIVEJUMP_SIZE > addr);
> +}

Since we have a single instruction, and all instructions are naturally aligned 4 bytes, this can be:

|	return (unsigned long)op->kp.addr == addr;

> +
> +static void
> +optimized_callback(struct optimized_kprobe *op, struct pt_regs *regs)
> +{
> +	/* This is possible if op is under delayed unoptimizing */

I see this is a copy-paste from x86 andor powerpc, but it's a bit hard
to understand, and I'm not sure what it is trying to say. What scenario
is this trying to handle?

I note that arch/arm doesn't seem to do this -- is that a bug?

> +	if (kprobe_disabled(&op->kp))
> +		return;
> +
> +	preempt_disable();
> +
> +	if (kprobe_running()) {
> +		kprobes_inc_nmissed_count(&op->kp);
> +	} else {
> +		__this_cpu_write(current_kprobe, &op->kp);
> +		regs->pc = (unsigned long)op->kp.addr;
> +		get_kprobe_ctlblk()->kprobe_status = KPROBE_HIT_ACTIVE;
> +		opt_pre_handler(&op->kp, regs);
> +		__this_cpu_write(current_kprobe, NULL);
> +	}


The body of this looks like it could be factored out in to the core
kprobes code if we had an arch hook to save any missed registers
(apparently just the pc on arm64).

> +	preempt_enable_no_resched();

Why the `_no_resched` variant? I see that x86 just does a plain
preempt_enable().

> +}
> +NOKPROBE_SYMBOL(optimized_callback)
> +
> +static bool is_offset_in_range(unsigned long start, unsigned long end)
> +{
> +	long offset = end - start;
> +
> +	/*
> +	 * Verify if the address gap is in 128MiB range, because this uses
> +	 * a relative jump.
> +	 *
> +	 * kprobe opt use a 'b' instruction to branch to optinsn.insn.
> +	 * According to ARM manual, branch instruction is:
> +	 *
> +	 *   31  30                  25              0
> +	 *  +----+---+---+---+---+---+---------------+
> +	 *  |cond| 0 | 0 | 1 | 0 | 1 |     imm26     |
> +	 *  +----+---+---+---+---+---+---------------+
> +	 *
> +	 * imm26 is a signed 26 bits integer. The real branch offset is computed
> +	 * by: imm64 = SignExtend(imm26:'00', 64);
> +	 *
> +	 * So the maximum forward branch should be:
> +	 *   (0x01ffffff << 2) = 0x07fffffc
> +	 * The maximum backward branch should be:
> +	 *   (0xfe000000 << 2) = 0xFFFFFFFFF8000000 = -0x08000000
> +	 *
> +	 * We can simply check (rel & 0xf8000003):
> +	 *  if rel is positive, (rel & 0xf8000003) should be 0
> +	 *  if rel is negitive, (rel & 0xf8000003) should be 0xf8000000
> +	 *  the last '3' is used for alignment checking.
> +	 */
> +	return (offset >= -0x8000000 && offset <= 0x7fffffc && !(offset & 0x3));
> +}

Please re-use existing code for this. In insn.c we already have
branch_imm_common() and aarch64_insn_gen_branch_imm(), which we should
be able to refactor for easier usage (and should move ftrace_make_nop()
over to that too).

We could *also* add a aarch64_insn_try_gen_branch_imm() helper that
returns an error code rather than printing an error, which would allow
us to consolidate the checks with the codegen, and would be more robust
if this changes in future.

> +
> +int arch_prepare_optimized_kprobe(struct optimized_kprobe *op,
> +				  struct kprobe *orig)
> +{
> +	kprobe_opcode_t *code, *buf;
> +	void **addrs;
> +	u32 insn;
> +	int ret, i;
> +
> +	addrs = kcalloc(TMPL_MAX_LENGTH, sizeof(void *), GFP_KERNEL);
> +	if (!addrs)
> +		return -ENOMEM;

As a general thing, when using allocators, please use sizeof() on the
pointer being allocated, e.g.

	addrs = kcalloc(TMPL_MAX_LENGTH, sizeof(*addrs), GFP_KERNEL);

... as that ensures the type is always correct, even if the type of the
pointer is changed later.

That said, I don't believe this needs to be allocated from the heap.
It's only used for the duration of this function, it's relatively small,
and we could add a wrapper in insn.c to patch a range of instructions,
which would avoid the need for the array entirely.

> +
> +	buf = kcalloc(TMPL_MAX_LENGTH, sizeof(kprobe_opcode_t), GFP_KERNEL);
> +	if (!buf) {
> +		kfree(addrs);
> +		return -ENOMEM;
> +	}
> +
> +	code = get_optinsn_slot();
> +	if (!code) {
> +		kfree(addrs);
> +		kfree(buf);
> +		return -ENOMEM;
> +	}

You can initialize `ret` to -ENOMEM, and goto `out` here.

> +
> +	if (!is_offset_in_range((unsigned long)code,
> +				(unsigned long)orig->addr + 8)) {
> +		ret = -ERANGE;
> +		goto error;
> +	}
> +
> +	if (!is_offset_in_range((unsigned long)code + TMPL_CALL_BACK,
> +				(unsigned long)optimized_callback)) {
> +		ret = -ERANGE;
> +		goto error;
> +	}
> +
> +	if (!is_offset_in_range((unsigned long)&code[TMPL_RESTORE_END],
> +				(unsigned long)op->kp.addr + 4)) {
> +		ret = -ERANGE;
> +		goto error;
> +	}
> +
> +	memcpy(buf, optprobe_template_entry,
> +	       TMPL_END_IDX * sizeof(kprobe_opcode_t));

Why do we need to copy this into a temporary buffer? Can't we work on it
directly in the destination?

> +
> +	buf[TMPL_VAL_IDX] = FIELD_GET(GENMASK(31, 0), (unsigned long long)op);
> +	buf[TMPL_VAL_IDX + 1] =
> +		FIELD_GET(GENMASK(63, 32), (unsigned long long)op);

This is a 64-bit pointer, and needs to be treated as a single unit. The
above is broken for big-endian kernels.

Rather than using instruction indices (and pretending this pointer is
two instructions), it would be nicer to use byte offsets consistently.

> +	buf[TMPL_RESTORE_ORIGN_INSN] = orig->opcode;
> +
> +	insn = aarch64_insn_gen_branch_imm(
> +		(unsigned long)(&code[TMPL_CALL_BACK]),
> +		(unsigned long)optimized_callback, AARCH64_INSN_BRANCH_LINK);
> +	buf[TMPL_CALL_BACK] = insn;
> +
> +	insn = aarch64_insn_gen_branch_imm(
> +		(unsigned long)(&code[TMPL_RESTORE_END]),
> +		(unsigned long)(op->kp.addr) + 4, AARCH64_INSN_BRANCH_NOLINK);
> +	buf[TMPL_RESTORE_END] = insn;
> +
> +	/* Setup template */
> +	for (i = 0; i < TMPL_MAX_LENGTH; i++)
> +		addrs[i] = code + i;
> +
> +	ret = aarch64_insn_patch_text(addrs, buf, TMPL_MAX_LENGTH);
> +	if (ret < 0)
> +		goto error;

This does a stop_machine(), which isn't strictly necessary here if we
write the buffer *then* swing the direct branch.

> +
> +	flush_icache_range((unsigned long)code,
> +			   (unsigned long)(&code[TMPL_END_IDX]));
> +
> +	/* Set op->optinsn.insn means prepared. */
> +	op->optinsn.insn = code;
> +
> +out:
> +	kfree(addrs);
> +	kfree(buf);
> +	return ret;
> +
> +error:
> +	free_optinsn_slot(code, 0);
> +	goto out;
> +}
> +
> +void arch_optimize_kprobes(struct list_head *oplist)
> +{
> +	struct optimized_kprobe *op, *tmp;
> +	kprobe_opcode_t *insns;
> +	void **addrs;
> +	int i = 0;
> +
> +	addrs = kcalloc(OPTPROBE_BATCH_SIZE, sizeof(void *), GFP_KERNEL);
> +	if (!addrs)
> +		return;
> +
> +	insns = kcalloc(OPTPROBE_BATCH_SIZE, sizeof(kprobe_opcode_t), GFP_KERNEL);
> +	if (!insns) {
> +		kfree(addrs);
> +		return;
> +	}

Why do these need to be dynamically allocated? They have small fixed
sizes.

> +
> +	list_for_each_entry_safe(op, tmp, oplist, list) {
> +		WARN_ON(kprobe_disabled(&op->kp));
> +
> +		/*
> +		 * Backup instructions which will be replaced
> +		 * by jump address
> +		 */
> +		memcpy(op->optinsn.copied_insn, op->kp.addr,
> +			RELATIVEJUMP_SIZE);

Hmmm... how does this end up getting restored? This hasn't been through
an le32_to_cpu(), so if we patch it back with the usual insn functions
we'll restore the wrong thing on a big-endian kernel.

> +		addrs[i] = (void *)op->kp.addr;
> +		insns[i] = aarch64_insn_gen_branch_imm((unsigned long)op->kp.addr,
> +				(unsigned long)op->optinsn.insn,
> +				AARCH64_INSN_BRANCH_NOLINK);
> +
> +		list_del_init(&op->list);
> +		if (++i == OPTPROBE_BATCH_SIZE)
> +			break;
> +	}

What happens if the list was bigger than OPTPROBE_BATCH_SIZE?

> +
> +	aarch64_insn_patch_text(addrs, insns, i);
> +	kfree(addrs);
> +	kfree(insns);
> +}
> +
> +void arch_unoptimize_kprobe(struct optimized_kprobe *op)
> +{
> +	arch_arm_kprobe(&op->kp);
> +}
> +
> +/*
> + * Recover original instructions and breakpoints from relative jumps.
> + * Caller must call with locking kprobe_mutex.
> + */
> +void arch_unoptimize_kprobes(struct list_head *oplist,
> +			    struct list_head *done_list)
> +{
> +	struct optimized_kprobe *op, *tmp;
> +	kprobe_opcode_t *insns;
> +	void **addrs;
> +	int i = 0;
> +
> +	addrs = kcalloc(OPTPROBE_BATCH_SIZE, sizeof(void *), GFP_KERNEL);
> +	if (!addrs)
> +		return;
> +
> +	insns = kcalloc(OPTPROBE_BATCH_SIZE, sizeof(kprobe_opcode_t), GFP_KERNEL);
> +	if (!insns) {
> +		kfree(addrs);
> +		return;
> +	}

Again, I do not think these need to be dynamically allocated.

> +
> +	list_for_each_entry_safe(op, tmp, oplist, list) {
> +		addrs[i] = (void *)op->kp.addr;
> +		insns[i] = BRK64_OPCODE_KPROBES;
> +		list_move(&op->list, done_list);
> +
> +		if (++i == OPTPROBE_BATCH_SIZE)
> +			break;

What happens if the list was larger rthan OPTPROBE_BATCH_SIZE ?

> +	}
> +
> +	aarch64_insn_patch_text(addrs, insns, i);
> +	kfree(addrs);
> +	kfree(insns);
> +}
> +
> +void arch_remove_optimized_kprobe(struct optimized_kprobe *op)
> +{
> +	if (op->optinsn.insn) {
> +		free_optinsn_slot(op->optinsn.insn, 1);
> +		op->optinsn.insn = NULL;
> +	}
> +}
> diff --git a/arch/arm64/kernel/probes/optprobe_trampoline.S b/arch/arm64/kernel/probes/optprobe_trampoline.S
> new file mode 100644
> index 000000000000..24d713d400cd
> --- /dev/null
> +++ b/arch/arm64/kernel/probes/optprobe_trampoline.S
> @@ -0,0 +1,37 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * trampoline entry and return code for optprobes.
> + */
> +
> +#include <linux/linkage.h>
> +#include <asm/asm-offsets.h>
> +#include <asm/assembler.h>
> +
> +	.global optprobe_template_entry
> +optprobe_template_entry:

Please use SYM_*(); see arch/arm64/kernel/entry-ftrace.S for examples of
how to use that for trampolines.

This should be:

SYM_CODE_START(optprobe_template)

... and note the matching end below.

> +	sub sp, sp, #PT_REGS_SIZE
> +	save_all_base_regs
> +	/* Get parameters to optimized_callback() */
> +	ldr	x0, 1f
> +	mov	x1, sp
> +	/* Branch to optimized_callback() */
> +	.global optprobe_template_call
> +optprobe_template_call:

SYM_INNER_LABEL(optprobe_template_call, SYM_L_GLOBAL)

...and likewise for all the other labels.

> +	nop
> +	restore_all_base_regs
> +	ldr lr, [sp, #S_LR]
> +        add sp, sp, #PT_REGS_SIZE
> +	.global optprobe_template_restore_orig_insn
> +optprobe_template_restore_orig_insn:
> +	nop
> +	.global optprobe_template_restore_end
> +optprobe_template_restore_end:
> +	nop
> +	.global optprobe_template_end
> +optprobe_template_end:
> +	.global optprobe_template_val
> +optprobe_template_val:
> +	1:	.long 0
> +		.long 0
> +	.global optprobe_template_max_length
> +optprobe_template_max_length:

SYM_INNER_LABEL(optprobe_template_end, SYM_L_GLOBAL)
SYM_CODE_END(optprobe_template)

Thanks,
Mark.

> -- 
> 2.17.1
>

Barry Song Aug. 24, 2021, 11:50 a.m. UTC | #3

On Tue, Aug 24, 2021 at 10:51 PM Mark Rutland <mark.rutland@arm.com> wrote:
>
> Hi,
>
> I have a bunch of comments below.
>
> At a high-level, I'm not all that keen on adding yet another set of
> trampolines, especially given we have constraints on how we can branch
> to them which render this not that useful in common configurations (e.g.
> where KASLR and module randomization is enabled).
>
> So importantly, do we actually need this? I don't think the sampel is
> that compelling since we can already use ftrace to measure function
> latencies.

Hopefully I can help answer some general questions. for code details,
I'll still rely on liuqi.

As far as I know, the functionality of ftrace is very limited. kprobe,
on the other hand, can do a lot of
things, for example, hooking some ebpf code.
Tons of userspace tools depend on kprobe, eg. ebpf/bcc which is widely
used nowadays:
https://github.com/iovisor/bcc

Function latency is really a very small part of what kprobe can do.
kprobe can do much much
more. For example, while doing disksnoop by kprobe, we are easily
collecting the request struct,
and get the layout of each IO request. without kprobe, we will
probably need to add tracepoints
here and there in kernel code to achieve the same goal.

I believe Masami and kprobe maintainers can answer this question better than me.

>
> If we do need this, I think we need to do some more substantial rework
> to address those branch range limitations. I know that we could permit
> arbitrary branching if we expand the ftrace-with-regs callsites to ~6
> instructions, but that interacts rather poorly with stacktracing and
> will make the kernel a bit bigger.

This patch is probably the first step towards a faster kprobe on ARM64.
I assume PLT or some similar tech will help break the limitation next
step.

>
> On Wed, Aug 18, 2021 at 03:33:36PM +0800, Qi Liu wrote:
> > This patch introduce optprobe for ARM64. In optprobe, probed
> > instruction is replaced by a branch instruction to detour
> > buffer. Detour buffer contains trampoline code and a call to
> > optimized_callback(). optimized_callback() calls opt_pre_handler()
> > to execute kprobe handler.
> >
> > Performance of optprobe on Hip08 platform is test using kprobe
> > example module[1] to analyze the latency of a kernel function,
> > and here is the result:
> >
> > [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/samples/kprobes/kretprobe_example.c
> >
> > kprobe before optimized:
> > [280709.846380] do_empty returned 0 and took 1530 ns to execute
> > [280709.852057] do_empty returned 0 and took 550 ns to execute
> > [280709.857631] do_empty returned 0 and took 440 ns to execute
> > [280709.863215] do_empty returned 0 and took 380 ns to execute
> > [280709.868787] do_empty returned 0 and took 360 ns to execute
> > [280709.874362] do_empty returned 0 and took 340 ns to execute
> > [280709.879936] do_empty returned 0 and took 320 ns to execute
> > [280709.885505] do_empty returned 0 and took 300 ns to execute
> > [280709.891075] do_empty returned 0 and took 280 ns to execute
> > [280709.896646] do_empty returned 0 and took 290 ns to execute
> > [280709.902220] do_empty returned 0 and took 290 ns to execute
> > [280709.907807] do_empty returned 0 and took 290 ns to execute
> >
> > optprobe:
> > [ 2965.964572] do_empty returned 0 and took 90 ns to execute
> > [ 2965.969952] do_empty returned 0 and took 80 ns to execute
> > [ 2965.975332] do_empty returned 0 and took 70 ns to execute
> > [ 2965.980714] do_empty returned 0 and took 60 ns to execute
> > [ 2965.986128] do_empty returned 0 and took 80 ns to execute
> > [ 2965.991507] do_empty returned 0 and took 70 ns to execute
> > [ 2965.996884] do_empty returned 0 and took 70 ns to execute
> > [ 2966.002262] do_empty returned 0 and took 80 ns to execute
> > [ 2966.007642] do_empty returned 0 and took 70 ns to execute
> > [ 2966.013020] do_empty returned 0 and took 70 ns to execute
> > [ 2966.018400] do_empty returned 0 and took 70 ns to execute
> > [ 2966.023779] do_empty returned 0 and took 70 ns to execute
> > [ 2966.029158] do_empty returned 0 and took 70 ns to execute
>
> Do we have any examples of where this latency matters in practice?

While doing performance analysis or debugging performance issues  in
lab, the big latency of the current kprobe
can significantly impact the real result.

And it is particularly important when we want to do some online
diagnostics where machines are running on the cloud.
we need to minimally affect the real system while watching the system.

For example, https://gitee.com/xiebaoyou/diagnosis_tools is an
open-source tool which is used by alibaba, it uses
kprobe a lot to detect various problems running on aloud.

>
> >
> > Signed-off-by: Qi Liu <liuqi115@huawei.com>
> >
> > Note:
> > To guarantee the offset between probe point and kprobe pre_handler
> > is smaller than 128MiB, users should set
> > CONFIG_RANDOMIZE_MODULE_REGION_FULL=N or set nokaslr in command line, or
> > optprobe will not work and fall back to normal kprobe.
>
> Hmm... I don't think that's something we want to recommend, and
> certainly distros *should* use KASLR and
> CONFIG_RANDOMIZE_MODULE_REGION_FULL.
>
> What happens with defconfig? Do we always get the fallback behaviour?

For this stage, with no_kaslr cmdline for defconfig, we are always
going to optprobe. otherwise, it
always goes to the fallback behaviour. So for this stage, it is
probably only useful in lab while people
are debugging problems. but this is still a widely useful scenario.
>

Thanks
Barry

Mark Rutland Aug. 24, 2021, 12:11 p.m. UTC | #4

On Tue, Aug 24, 2021 at 11:50:08PM +1200, Barry Song wrote:
> On Tue, Aug 24, 2021 at 10:51 PM Mark Rutland <mark.rutland@arm.com> wrote:
> >
> > Hi,
> >
> > I have a bunch of comments below.
> >
> > At a high-level, I'm not all that keen on adding yet another set of
> > trampolines, especially given we have constraints on how we can branch
> > to them which render this not that useful in common configurations (e.g.
> > where KASLR and module randomization is enabled).
> >
> > So importantly, do we actually need this? I don't think the sampel is
> > that compelling since we can already use ftrace to measure function
> > latencies.
> 
> Hopefully I can help answer some general questions. for code details,
> I'll still rely on liuqi.
> 
> As far as I know, the functionality of ftrace is very limited. kprobe,
> on the other hand, can do a lot of
> things, for example, hooking some ebpf code.
> Tons of userspace tools depend on kprobe, eg. ebpf/bcc which is widely
> used nowadays:
> https://github.com/iovisor/bcc
> 
> Function latency is really a very small part of what kprobe can do.

I appreciate that, but *functionally* we can already do the other bits
with the existing kprobes code, so the question for this series is where
the latency/overhead matters in practice.

> kprobe can do much much more. For example, while doing disksnoop by
> kprobe, we are easily collecting the request struct, and get the
> layout of each IO request. without kprobe, we will probably need to
> add tracepoints here and there in kernel code to achieve the same
> goal.
> 
> I believe Masami and kprobe maintainers can answer this question better than me.
> 
> > If we do need this, I think we need to do some more substantial rework
> > to address those branch range limitations. I know that we could permit
> > arbitrary branching if we expand the ftrace-with-regs callsites to ~6
> > instructions, but that interacts rather poorly with stacktracing and
> > will make the kernel a bit bigger.
> 
> This patch is probably the first step towards a faster kprobe on ARM64.
> I assume PLT or some similar tech will help break the limitation next
> step.

I have ideas about how we can make that work (basically an inline PLT
next to the function), but the implementation is *very* different, and
my thinking is that either:

* The latency/overhead is a real concern generally, and this needs to
  work in the majority of cases (e.g. where KASLR and module
  randomization are on). If so, I think we should implement the more
  invasive but complete solution in one go rather than implementing
  something we know we're going to have to throw away.

* The latency is not that much of a concern, or is only a concern ina
  few niche cases. If so, it's not clear to me whether this is
  worthwhile as-is.

> > On Wed, Aug 18, 2021 at 03:33:36PM +0800, Qi Liu wrote:
> > > This patch introduce optprobe for ARM64. In optprobe, probed
> > > instruction is replaced by a branch instruction to detour
> > > buffer. Detour buffer contains trampoline code and a call to
> > > optimized_callback(). optimized_callback() calls opt_pre_handler()
> > > to execute kprobe handler.
> > >
> > > Performance of optprobe on Hip08 platform is test using kprobe
> > > example module[1] to analyze the latency of a kernel function,
> > > and here is the result:
> > >
> > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/samples/kprobes/kretprobe_example.c
> > >
> > > kprobe before optimized:
> > > [280709.846380] do_empty returned 0 and took 1530 ns to execute
> > > [280709.852057] do_empty returned 0 and took 550 ns to execute
> > > [280709.857631] do_empty returned 0 and took 440 ns to execute
> > > [280709.863215] do_empty returned 0 and took 380 ns to execute
> > > [280709.868787] do_empty returned 0 and took 360 ns to execute
> > > [280709.874362] do_empty returned 0 and took 340 ns to execute
> > > [280709.879936] do_empty returned 0 and took 320 ns to execute
> > > [280709.885505] do_empty returned 0 and took 300 ns to execute
> > > [280709.891075] do_empty returned 0 and took 280 ns to execute
> > > [280709.896646] do_empty returned 0 and took 290 ns to execute
> > > [280709.902220] do_empty returned 0 and took 290 ns to execute
> > > [280709.907807] do_empty returned 0 and took 290 ns to execute
> > >
> > > optprobe:
> > > [ 2965.964572] do_empty returned 0 and took 90 ns to execute
> > > [ 2965.969952] do_empty returned 0 and took 80 ns to execute
> > > [ 2965.975332] do_empty returned 0 and took 70 ns to execute
> > > [ 2965.980714] do_empty returned 0 and took 60 ns to execute
> > > [ 2965.986128] do_empty returned 0 and took 80 ns to execute
> > > [ 2965.991507] do_empty returned 0 and took 70 ns to execute
> > > [ 2965.996884] do_empty returned 0 and took 70 ns to execute
> > > [ 2966.002262] do_empty returned 0 and took 80 ns to execute
> > > [ 2966.007642] do_empty returned 0 and took 70 ns to execute
> > > [ 2966.013020] do_empty returned 0 and took 70 ns to execute
> > > [ 2966.018400] do_empty returned 0 and took 70 ns to execute
> > > [ 2966.023779] do_empty returned 0 and took 70 ns to execute
> > > [ 2966.029158] do_empty returned 0 and took 70 ns to execute
> >
> > Do we have any examples of where this latency matters in practice?
> 
> While doing performance analysis or debugging performance issues  in
> lab, the big latency of the current kprobe
> can significantly impact the real result.
> 
> And it is particularly important when we want to do some online
> diagnostics where machines are running on the cloud.
> we need to minimally affect the real system while watching the system.
> 
> For example, https://gitee.com/xiebaoyou/diagnosis_tools is an
> open-source tool which is used by alibaba, it uses
> kprobe a lot to detect various problems running on aloud.

Ok. It would be good to mention something like this in the commit
message. If it's possible to get an estimate of the overhead reduction,
that would be great.

> > > Signed-off-by: Qi Liu <liuqi115@huawei.com>
> > >
> > > Note:
> > > To guarantee the offset between probe point and kprobe pre_handler
> > > is smaller than 128MiB, users should set
> > > CONFIG_RANDOMIZE_MODULE_REGION_FULL=N or set nokaslr in command line, or
> > > optprobe will not work and fall back to normal kprobe.
> >
> > Hmm... I don't think that's something we want to recommend, and
> > certainly distros *should* use KASLR and
> > CONFIG_RANDOMIZE_MODULE_REGION_FULL.
> >
> > What happens with defconfig? Do we always get the fallback behaviour?
> 
> For this stage, with no_kaslr cmdline for defconfig, we are always
> going to optprobe. otherwise, it always goes to the fallback
> behaviour. So for this stage, it is probably only useful in lab while
> people are debugging problems. but this is still a widely useful
> scenario.

If it's only going to be used for lab debugging, why is the existing
kprobes logic not sufficient?

Thanks,
Mark.

Barry Song Aug. 24, 2021, 12:42 p.m. UTC | #5

On Wed, Aug 25, 2021 at 12:11 AM Mark Rutland <mark.rutland@arm.com> wrote:
>
> On Tue, Aug 24, 2021 at 11:50:08PM +1200, Barry Song wrote:
> > On Tue, Aug 24, 2021 at 10:51 PM Mark Rutland <mark.rutland@arm.com> wrote:
> > >
> > > Hi,
> > >
> > > I have a bunch of comments below.
> > >
> > > At a high-level, I'm not all that keen on adding yet another set of
> > > trampolines, especially given we have constraints on how we can branch
> > > to them which render this not that useful in common configurations (e.g.
> > > where KASLR and module randomization is enabled).
> > >
> > > So importantly, do we actually need this? I don't think the sampel is
> > > that compelling since we can already use ftrace to measure function
> > > latencies.
> >
> > Hopefully I can help answer some general questions. for code details,
> > I'll still rely on liuqi.
> >
> > As far as I know, the functionality of ftrace is very limited. kprobe,
> > on the other hand, can do a lot of
> > things, for example, hooking some ebpf code.
> > Tons of userspace tools depend on kprobe, eg. ebpf/bcc which is widely
> > used nowadays:
> > https://github.com/iovisor/bcc
> >
> > Function latency is really a very small part of what kprobe can do.
>
> I appreciate that, but *functionally* we can already do the other bits
> with the existing kprobes code, so the question for this series is where
> the latency/overhead matters in practice.
>
> > kprobe can do much much more. For example, while doing disksnoop by
> > kprobe, we are easily collecting the request struct, and get the
> > layout of each IO request. without kprobe, we will probably need to
> > add tracepoints here and there in kernel code to achieve the same
> > goal.
> >
> > I believe Masami and kprobe maintainers can answer this question better than me.
> >
> > > If we do need this, I think we need to do some more substantial rework
> > > to address those branch range limitations. I know that we could permit
> > > arbitrary branching if we expand the ftrace-with-regs callsites to ~6
> > > instructions, but that interacts rather poorly with stacktracing and
> > > will make the kernel a bit bigger.
> >
> > This patch is probably the first step towards a faster kprobe on ARM64.
> > I assume PLT or some similar tech will help break the limitation next
> > step.
>
> I have ideas about how we can make that work (basically an inline PLT
> next to the function), but the implementation is *very* different, and
> my thinking is that either:
>
> * The latency/overhead is a real concern generally, and this needs to
>   work in the majority of cases (e.g. where KASLR and module
>   randomization are on). If so, I think we should implement the more
>   invasive but complete solution in one go rather than implementing
>   something we know we're going to have to throw away.

For an online scenario, I do think we should have a complete implementation
to make it more useful as we can't force users to use nokaslr. in this
thread V1:
https://lore.kernel.org/lkml/6a97dff6c33c4b84887223de2502bd3d@hisilicon.com/
I actually also posted some similar ideas with you regarding using PLT to break
the limitation.

>
> * The latency is not that much of a concern, or is only a concern ina
>   few niche cases. If so, it's not clear to me whether this is
>   worthwhile as-is.
>
> > > On Wed, Aug 18, 2021 at 03:33:36PM +0800, Qi Liu wrote:
> > > > This patch introduce optprobe for ARM64. In optprobe, probed
> > > > instruction is replaced by a branch instruction to detour
> > > > buffer. Detour buffer contains trampoline code and a call to
> > > > optimized_callback(). optimized_callback() calls opt_pre_handler()
> > > > to execute kprobe handler.
> > > >
> > > > Performance of optprobe on Hip08 platform is test using kprobe
> > > > example module[1] to analyze the latency of a kernel function,
> > > > and here is the result:
> > > >
> > > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/samples/kprobes/kretprobe_example.c
> > > >
> > > > kprobe before optimized:
> > > > [280709.846380] do_empty returned 0 and took 1530 ns to execute
> > > > [280709.852057] do_empty returned 0 and took 550 ns to execute
> > > > [280709.857631] do_empty returned 0 and took 440 ns to execute
> > > > [280709.863215] do_empty returned 0 and took 380 ns to execute
> > > > [280709.868787] do_empty returned 0 and took 360 ns to execute
> > > > [280709.874362] do_empty returned 0 and took 340 ns to execute
> > > > [280709.879936] do_empty returned 0 and took 320 ns to execute
> > > > [280709.885505] do_empty returned 0 and took 300 ns to execute
> > > > [280709.891075] do_empty returned 0 and took 280 ns to execute
> > > > [280709.896646] do_empty returned 0 and took 290 ns to execute
> > > > [280709.902220] do_empty returned 0 and took 290 ns to execute
> > > > [280709.907807] do_empty returned 0 and took 290 ns to execute
> > > >
> > > > optprobe:
> > > > [ 2965.964572] do_empty returned 0 and took 90 ns to execute
> > > > [ 2965.969952] do_empty returned 0 and took 80 ns to execute
> > > > [ 2965.975332] do_empty returned 0 and took 70 ns to execute
> > > > [ 2965.980714] do_empty returned 0 and took 60 ns to execute
> > > > [ 2965.986128] do_empty returned 0 and took 80 ns to execute
> > > > [ 2965.991507] do_empty returned 0 and took 70 ns to execute
> > > > [ 2965.996884] do_empty returned 0 and took 70 ns to execute
> > > > [ 2966.002262] do_empty returned 0 and took 80 ns to execute
> > > > [ 2966.007642] do_empty returned 0 and took 70 ns to execute
> > > > [ 2966.013020] do_empty returned 0 and took 70 ns to execute
> > > > [ 2966.018400] do_empty returned 0 and took 70 ns to execute
> > > > [ 2966.023779] do_empty returned 0 and took 70 ns to execute
> > > > [ 2966.029158] do_empty returned 0 and took 70 ns to execute
> > >
> > > Do we have any examples of where this latency matters in practice?
> >
> > While doing performance analysis or debugging performance issues  in
> > lab, the big latency of the current kprobe
> > can significantly impact the real result.
> >
> > And it is particularly important when we want to do some online
> > diagnostics where machines are running on the cloud.
> > we need to minimally affect the real system while watching the system.
> >
> > For example, https://gitee.com/xiebaoyou/diagnosis_tools is an
> > open-source tool which is used by alibaba, it uses
> > kprobe a lot to detect various problems running on aloud.
>
> Ok. It would be good to mention something like this in the commit
> message. If it's possible to get an estimate of the overhead reduction,
> that would be great.

I believe liuqi's commit log has estimated the overhead reduction.
The below is exactly the overhead of kprobe before and after the patch.
not only does it decrease the overhead,  it also makes the
standard deviation more even.

i believe liuqi's estimation doesn't open cpufreq, if cpufreq is open,
un-optimized kprobe
latency can be much much bigger and have much much bigger stdev.

kprobe before optimized:
[280709.846380] do_empty returned 0 and took 1530 ns to execute
[280709.852057] do_empty returned 0 and took 550 ns to execute
[280709.857631] do_empty returned 0 and took 440 ns to execute
[280709.863215] do_empty returned 0 and took 380 ns to execute
[280709.868787] do_empty returned 0 and took 360 ns to execute
[280709.874362] do_empty returned 0 and took 340 ns to execute
[280709.879936] do_empty returned 0 and took 320 ns to execute
[280709.885505] do_empty returned 0 and took 300 ns to execute
[280709.891075] do_empty returned 0 and took 280 ns to execute
[280709.896646] do_empty returned 0 and took 290 ns to execute
[280709.902220] do_empty returned 0 and took 290 ns to execute
[280709.907807] do_empty returned 0 and took 290 ns to execute

optprobe:
[ 2965.964572] do_empty returned 0 and took 90 ns to execute
[ 2965.969952] do_empty returned 0 and took 80 ns to execute
[ 2965.975332] do_empty returned 0 and took 70 ns to execute
[ 2965.980714] do_empty returned 0 and took 60 ns to execute
[ 2965.986128] do_empty returned 0 and took 80 ns to execute
[ 2965.991507] do_empty returned 0 and took 70 ns to execute
[ 2965.996884] do_empty returned 0 and took 70 ns to execute
[ 2966.002262] do_empty returned 0 and took 80 ns to execute
[ 2966.007642] do_empty returned 0 and took 70 ns to execute
[ 2966.013020] do_empty returned 0 and took 70 ns to execute
[ 2966.018400] do_empty returned 0 and took 70 ns to execute
[ 2966.023779] do_empty returned 0 and took 70 ns to execute
[ 2966.029158] do_empty returned 0 and took 70 ns to execute

>
> > > > Signed-off-by: Qi Liu <liuqi115@huawei.com>
> > > >
> > > > Note:
> > > > To guarantee the offset between probe point and kprobe pre_handler
> > > > is smaller than 128MiB, users should set
> > > > CONFIG_RANDOMIZE_MODULE_REGION_FULL=N or set nokaslr in command line, or
> > > > optprobe will not work and fall back to normal kprobe.
> > >
> > > Hmm... I don't think that's something we want to recommend, and
> > > certainly distros *should* use KASLR and
> > > CONFIG_RANDOMIZE_MODULE_REGION_FULL.
> > >
> > > What happens with defconfig? Do we always get the fallback behaviour?
> >
> > For this stage, with no_kaslr cmdline for defconfig, we are always
> > going to optprobe. otherwise, it always goes to the fallback
> > behaviour. So for this stage, it is probably only useful in lab while
> > people are debugging problems. but this is still a widely useful
> > scenario.
>
> If it's only going to be used for lab debugging, why is the existing
> kprobes logic not sufficient?

My understanding is the lab needs to emulate the real scenario while
it is not a real
online system. The huge latency of the existing un-optimized kprobe
make all the data
from ebpf/bcc tools seriously distorted.


>
> Thanks,
> Mark.

Thanks
barry

Masami Hiramatsu (Google) Aug. 25, 2021, 2:13 a.m. UTC | #6

On Tue, 24 Aug 2021 11:50:01 +0100
Mark Rutland <mark.rutland@arm.com> wrote:

> Hi,
> 
> I have a bunch of comments below.
> 
> At a high-level, I'm not all that keen on adding yet another set of
> trampolines, especially given we have constraints on how we can branch
> to them which render this not that useful in common configurations (e.g.
> where KASLR and module randomization is enabled).

Yes, that makes kprobe jump optimization hard to implement on
RISC architecture in general. (x86 has 32bit offset jump instruction)
To solve this issue, something like "intermedate jump area" is needed
for each module. (Or, overwriting multiple instructions)

> 
> So importantly, do we actually need this? I don't think the sampel is
> that compelling since we can already use ftrace to measure function
> latencies.

That depends on what you use it for, as you may know, kprobes allows
you to put the probes on function body (and inlined function),
on the other hand, ftrace can put only on the entry of the function.
I guess Qi may want to use it for improving performance of BPF.

(BTW, as far as I know, Jisheng Zhang once tried to implement
kprobe on ftrace, that may be more helpful in this example.
https://lore.kernel.org/linux-arm-kernel/20191225172625.69811b3e@xhacker.debian/T/#m23a7aa55d32d140ee6a92102534446cfd4a43007
I will pick them up again)

> 
> If we do need this, I think we need to do some more substantial rework
> to address those branch range limitations. I know that we could permit
> arbitrary branching if we expand the ftrace-with-regs callsites to ~6
> instructions, but that interacts rather poorly with stacktracing and
> will make the kernel a bit bigger.

Would you mean we reuse the ftrace-with-regs callsites for kprobes?

arm32 avoids this limitation partially with reserved text pages
for trampoline in the kernel. But I think that is also a partial
solution. It may not work with module randomization at least on
arm64.

On arm64, I think there are several way to solve it.

- Add optprobe trampoline buffer for each module.
  This is the simplest way to solve this issue, but requires some
  pages to be added to each module (and kernel).

- Add intermediate trampoline area for each module. (2-stage jump)
  This jumps into an intermediate trampoline entry, save a partial
  registers and jump the actual trampoline using that register.
  This can reduce the size of trampoline buffer for each module.

- Replace multiple instructions with the above intermediate jump
  code. (single jump, but replace multiple instructions)
  This requires to emulate multiple instructions and also the
  kprobe must decode the instructions in the target function to
  identify the replaced instructions are in one basic block. But
  no need to add intermediate trampoline area (page).



> 
> On Wed, Aug 18, 2021 at 03:33:36PM +0800, Qi Liu wrote:
> > This patch introduce optprobe for ARM64. In optprobe, probed
> > instruction is replaced by a branch instruction to detour
> > buffer. Detour buffer contains trampoline code and a call to
> > optimized_callback(). optimized_callback() calls opt_pre_handler()
> > to execute kprobe handler.
> > 
> > Performance of optprobe on Hip08 platform is test using kprobe
> > example module[1] to analyze the latency of a kernel function,
> > and here is the result:
> > 
> > [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/samples/kprobes/kretprobe_example.c
> > 
> > kprobe before optimized:
> > [280709.846380] do_empty returned 0 and took 1530 ns to execute
> > [280709.852057] do_empty returned 0 and took 550 ns to execute
> > [280709.857631] do_empty returned 0 and took 440 ns to execute
> > [280709.863215] do_empty returned 0 and took 380 ns to execute
> > [280709.868787] do_empty returned 0 and took 360 ns to execute
> > [280709.874362] do_empty returned 0 and took 340 ns to execute
> > [280709.879936] do_empty returned 0 and took 320 ns to execute
> > [280709.885505] do_empty returned 0 and took 300 ns to execute
> > [280709.891075] do_empty returned 0 and took 280 ns to execute
> > [280709.896646] do_empty returned 0 and took 290 ns to execute
> > [280709.902220] do_empty returned 0 and took 290 ns to execute
> > [280709.907807] do_empty returned 0 and took 290 ns to execute
> > 
> > optprobe:
> > [ 2965.964572] do_empty returned 0 and took 90 ns to execute
> > [ 2965.969952] do_empty returned 0 and took 80 ns to execute
> > [ 2965.975332] do_empty returned 0 and took 70 ns to execute
> > [ 2965.980714] do_empty returned 0 and took 60 ns to execute
> > [ 2965.986128] do_empty returned 0 and took 80 ns to execute
> > [ 2965.991507] do_empty returned 0 and took 70 ns to execute
> > [ 2965.996884] do_empty returned 0 and took 70 ns to execute
> > [ 2966.002262] do_empty returned 0 and took 80 ns to execute
> > [ 2966.007642] do_empty returned 0 and took 70 ns to execute
> > [ 2966.013020] do_empty returned 0 and took 70 ns to execute
> > [ 2966.018400] do_empty returned 0 and took 70 ns to execute
> > [ 2966.023779] do_empty returned 0 and took 70 ns to execute
> > [ 2966.029158] do_empty returned 0 and took 70 ns to execute
> 
> Do we have any examples of where this latency matters in practice?
> 
> > 
> > Signed-off-by: Qi Liu <liuqi115@huawei.com>
> > 
> > Note:
> > To guarantee the offset between probe point and kprobe pre_handler
> > is smaller than 128MiB, users should set
> > CONFIG_RANDOMIZE_MODULE_REGION_FULL=N or set nokaslr in command line, or
> > optprobe will not work and fall back to normal kprobe.
> 
> Hmm... I don't think that's something we want to recommend, and
> certainly distros *should* use KASLR and
> CONFIG_RANDOMIZE_MODULE_REGION_FULL.
> 
> What happens with defconfig? Do we always get the fallback behaviour?

Yes, in such case, it fails back to normal kprobe.
Anyway, optprobe is a background optimization. User can not specify
which kprobe is optimized. That is automatically done.

Thank you,

Barry Song Aug. 25, 2021, 3:12 a.m. UTC | #7

On Wed, Aug 25, 2021 at 2:15 PM Masami Hiramatsu <mhiramat@kernel.org> wrote:
>
> On Tue, 24 Aug 2021 11:50:01 +0100
> Mark Rutland <mark.rutland@arm.com> wrote:
>
> > Hi,
> >
> > I have a bunch of comments below.
> >
> > At a high-level, I'm not all that keen on adding yet another set of
> > trampolines, especially given we have constraints on how we can branch
> > to them which render this not that useful in common configurations (e.g.
> > where KASLR and module randomization is enabled).
>
> Yes, that makes kprobe jump optimization hard to implement on
> RISC architecture in general. (x86 has 32bit offset jump instruction)
> To solve this issue, something like "intermedate jump area" is needed
> for each module. (Or, overwriting multiple instructions)
>
> >
> > So importantly, do we actually need this? I don't think the sampel is
> > that compelling since we can already use ftrace to measure function
> > latencies.
>
> That depends on what you use it for, as you may know, kprobes allows
> you to put the probes on function body (and inlined function),
> on the other hand, ftrace can put only on the entry of the function.
> I guess Qi may want to use it for improving performance of BPF.
>
> (BTW, as far as I know, Jisheng Zhang once tried to implement
> kprobe on ftrace, that may be more helpful in this example.
> https://lore.kernel.org/linux-arm-kernel/20191225172625.69811b3e@xhacker.debian/T/#m23a7aa55d32d140ee6a92102534446cfd4a43007
> I will pick them up again)
>
> >
> > If we do need this, I think we need to do some more substantial rework
> > to address those branch range limitations. I know that we could permit
> > arbitrary branching if we expand the ftrace-with-regs callsites to ~6
> > instructions, but that interacts rather poorly with stacktracing and
> > will make the kernel a bit bigger.
>
> Would you mean we reuse the ftrace-with-regs callsites for kprobes?
>
> arm32 avoids this limitation partially with reserved text pages
> for trampoline in the kernel. But I think that is also a partial
> solution. It may not work with module randomization at least on
> arm64.
>
> On arm64, I think there are several way to solve it.
>
> - Add optprobe trampoline buffer for each module.
>   This is the simplest way to solve this issue, but requires some
>   pages to be added to each module (and kernel).
>
> - Add intermediate trampoline area for each module. (2-stage jump)
>   This jumps into an intermediate trampoline entry, save a partial
>   registers and jump the actual trampoline using that register.
>   This can reduce the size of trampoline buffer for each module.
>
> - Replace multiple instructions with the above intermediate jump
>   code. (single jump, but replace multiple instructions)
>   This requires to emulate multiple instructions and also the
>   kprobe must decode the instructions in the target function to
>   identify the replaced instructions are in one basic block. But
>   no need to add intermediate trampoline area (page).
>
>
>
> >
> > On Wed, Aug 18, 2021 at 03:33:36PM +0800, Qi Liu wrote:
> > > This patch introduce optprobe for ARM64. In optprobe, probed
> > > instruction is replaced by a branch instruction to detour
> > > buffer. Detour buffer contains trampoline code and a call to
> > > optimized_callback(). optimized_callback() calls opt_pre_handler()
> > > to execute kprobe handler.
> > >
> > > Performance of optprobe on Hip08 platform is test using kprobe
> > > example module[1] to analyze the latency of a kernel function,
> > > and here is the result:
> > >
> > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/samples/kprobes/kretprobe_example.c
> > >
> > > kprobe before optimized:
> > > [280709.846380] do_empty returned 0 and took 1530 ns to execute
> > > [280709.852057] do_empty returned 0 and took 550 ns to execute
> > > [280709.857631] do_empty returned 0 and took 440 ns to execute
> > > [280709.863215] do_empty returned 0 and took 380 ns to execute
> > > [280709.868787] do_empty returned 0 and took 360 ns to execute
> > > [280709.874362] do_empty returned 0 and took 340 ns to execute
> > > [280709.879936] do_empty returned 0 and took 320 ns to execute
> > > [280709.885505] do_empty returned 0 and took 300 ns to execute
> > > [280709.891075] do_empty returned 0 and took 280 ns to execute
> > > [280709.896646] do_empty returned 0 and took 290 ns to execute
> > > [280709.902220] do_empty returned 0 and took 290 ns to execute
> > > [280709.907807] do_empty returned 0 and took 290 ns to execute
> > >
> > > optprobe:
> > > [ 2965.964572] do_empty returned 0 and took 90 ns to execute
> > > [ 2965.969952] do_empty returned 0 and took 80 ns to execute
> > > [ 2965.975332] do_empty returned 0 and took 70 ns to execute
> > > [ 2965.980714] do_empty returned 0 and took 60 ns to execute
> > > [ 2965.986128] do_empty returned 0 and took 80 ns to execute
> > > [ 2965.991507] do_empty returned 0 and took 70 ns to execute
> > > [ 2965.996884] do_empty returned 0 and took 70 ns to execute
> > > [ 2966.002262] do_empty returned 0 and took 80 ns to execute
> > > [ 2966.007642] do_empty returned 0 and took 70 ns to execute
> > > [ 2966.013020] do_empty returned 0 and took 70 ns to execute
> > > [ 2966.018400] do_empty returned 0 and took 70 ns to execute
> > > [ 2966.023779] do_empty returned 0 and took 70 ns to execute
> > > [ 2966.029158] do_empty returned 0 and took 70 ns to execute
> >
> > Do we have any examples of where this latency matters in practice?
> >
> > >
> > > Signed-off-by: Qi Liu <liuqi115@huawei.com>
> > >
> > > Note:
> > > To guarantee the offset between probe point and kprobe pre_handler
> > > is smaller than 128MiB, users should set
> > > CONFIG_RANDOMIZE_MODULE_REGION_FULL=N or set nokaslr in command line, or
> > > optprobe will not work and fall back to normal kprobe.
> >
> > Hmm... I don't think that's something we want to recommend, and
> > certainly distros *should* use KASLR and
> > CONFIG_RANDOMIZE_MODULE_REGION_FULL.
> >
> > What happens with defconfig? Do we always get the fallback behaviour?
>
> Yes, in such case, it fails back to normal kprobe.

just one minor comment. as Qi pointed out before, bootargs nokaslr
will make kernel built by defconfig
use optprobe:

        nokaslr         [KNL]
                        When CONFIG_RANDOMIZE_BASE is set, this disables
                        kernel and module base offset ASLR (Address Space
                        Layout Randomization).

in lab, while security is not a concern as online, it would be a good option.

> Anyway, optprobe is a background optimization. User can not specify
> which kprobe is optimized. That is automatically done.
>
> Thank you,
>
>
> --
> Masami Hiramatsu <mhiramat@kernel.org>

Thanks
barry

liuqi (BA) Sept. 7, 2021, 3:14 a.m. UTC | #8

Hi Mark,
Thanks for reviewing this patch.

On 2021/8/24 18:50, Mark Rutland wrote:
> Hi,
> 
> I have a bunch of comments below.
> 
> At a high-level, I'm not all that keen on adding yet another set of
> trampolines, especially given we have constraints on how we can branch
> to them which render this not that useful in common configurations (e.g.
> where KASLR and module randomization is enabled).
> 
> So importantly, do we actually need this? I don't think the sampel is
> that compelling since we can already use ftrace to measure function
> latencies.
> 
> If we do need this, I think we need to do some more substantial rework
> to address those branch range limitations. I know that we could permit
> arbitrary branching if we expand the ftrace-with-regs callsites to ~6
> instructions, but that interacts rather poorly with stacktracing and
> will make the kernel a bit bigger.
>

As Barry mentioned, kprobe do have some use in commercial scenarios, 
like https://gitee.com/xiebaoyou/diagnosis_tools, an open-source tool 
which is used by alibaba. So reduce the latency of kprobe has some 
practical value.

Branch limitation sometimes will make optprobe fall back to normal 
kprobe, but perhaps we could use this as the first step, and try PLT or 
some other methods to avoid the limitation latter.

> On Wed, Aug 18, 2021 at 03:33:36PM +0800, Qi Liu wrote:
>> This patch introduce optprobe for ARM64. In optprobe, probed
>> instruction is replaced by a branch instruction to detour
>> buffer. Detour buffer contains trampoline code and a call to
>> optimized_callback(). optimized_callback() calls opt_pre_handler()
>> to execute kprobe handler.

[...]

>> diff --git a/arch/arm64/include/asm/kprobes.h b/arch/arm64/include/asm/kprobes.h
>> index 5d38ff4a4806..6b2fdd2ad7d8 100644
>> --- a/arch/arm64/include/asm/kprobes.h
>> +++ b/arch/arm64/include/asm/kprobes.h
>> @@ -39,6 +39,30 @@ void arch_remove_kprobe(struct kprobe *);
>>   int kprobe_fault_handler(struct pt_regs *regs, unsigned int fsr);
>>   int kprobe_exceptions_notify(struct notifier_block *self,
>>   			     unsigned long val, void *data);
>> +
>> +#define RELATIVEJUMP_SIZE (4)
>> +#define MAX_COPIED_INSN	DIV_ROUND_UP(RELATIVEJUMP_SIZE, sizeof(kprobe_opcode_t))
> 
> On arm64, instructions are always 4 bytes, so this can be:
> 
> | #define MAX_COPIED_INSN		AARCH64_INSN_SIZE
> 
> Note: AARCH64_INSN_SIZE == sizeof(kprobe_opcode_t), so either could be
> used here.
> 
got it, will change this.
>> +struct arch_optimized_insn {
>> +	kprobe_opcode_t copied_insn[MAX_COPIED_INSN];
> 
> This is always a single insn. For clarity, it would be nicer to call
> this something like `orig_insn`.
> 
>> +	/* detour code buffer */
>> +	kprobe_opcode_t *insn;
> 
> Could we call this `trampoline`?
> 
ok, will change this, thanks.
>> +};
>> +
>> +/* optinsn template addresses */
>> +extern __visible kprobe_opcode_t optprobe_template_entry[];
>> +extern __visible kprobe_opcode_t optprobe_template_val[];
>> +extern __visible kprobe_opcode_t optprobe_template_call[];
>> +extern __visible kprobe_opcode_t optprobe_template_end[];
>> +extern __visible kprobe_opcode_t optprobe_template_restore_begin[];
>> +extern __visible kprobe_opcode_t optprobe_template_restore_orig_insn[];
>> +extern __visible kprobe_opcode_t optprobe_template_restore_end[];
>> +extern __visible kprobe_opcode_t optprobe_template_max_length[];
> 
> Why is this called "max_length"? It's the end of the template, value
> included. Other architectures (e.g. arm) include that between
> optprobe_template_entry and optprobe_template_end, so I don't believe we
> need to separate that out.
> 
got it, will remove this next time, thanks.
>> +
>> +#define MAX_OPTIMIZED_LENGTH	4
> 
> As above, please define this as either AARCH64_INSN_SIZE or
> sizeof(kprobe_opcode_t).
> 
ok, will change this, thanks.
>> +#define MAX_OPTINSN_SIZE				\
>> +	((unsigned long)optprobe_template_end -	\
>> +	 (unsigned long)optprobe_template_entry)
> 
> As above, this seems to be the trampoline size minus the size of the
> value, which doesn't match other architectures, and doesn't seem right.
> The kprobe core code has:
> 
> | kprobe_optinsn_slots.insn_size = MAX_OPTINSN_SIZE;
> 
> ... which IIUC is used to alocate the buffers, so surely we *must* take
> the size of the value into account?
> >> +
>>   void kretprobe_trampoline(void);
>>   void __kprobes *trampoline_probe_handler(struct pt_regs *regs);
>>   
>> diff --git a/arch/arm64/kernel/probes/Makefile b/arch/arm64/kernel/probes/Makefile
>> index 8e4be92e25b1..07105fd3261d 100644
>> --- a/arch/arm64/kernel/probes/Makefile
>> +++ b/arch/arm64/kernel/probes/Makefile
>> @@ -4,3 +4,5 @@ obj-$(CONFIG_KPROBES)		+= kprobes.o decode-insn.o	\
>>   				   simulate-insn.o
>>   obj-$(CONFIG_UPROBES)		+= uprobes.o decode-insn.o	\
>>   				   simulate-insn.o
>> +obj-$(CONFIG_OPTPROBES)		+= opt_arm64.o			\
>> +				   optprobe_trampoline.o
>> diff --git a/arch/arm64/kernel/probes/kprobes.c b/arch/arm64/kernel/probes/kprobes.c
>> index 6dbcc89f6662..83755ad62abe 100644
>> --- a/arch/arm64/kernel/probes/kprobes.c
>> +++ b/arch/arm64/kernel/probes/kprobes.c
>> @@ -11,6 +11,7 @@
>>   #include <linux/kasan.h>
>>   #include <linux/kernel.h>
>>   #include <linux/kprobes.h>
>> +#include <linux/moduleloader.h>
>>   #include <linux/sched/debug.h>
>>   #include <linux/set_memory.h>
>>   #include <linux/slab.h>
>> @@ -113,9 +114,21 @@ int __kprobes arch_prepare_kprobe(struct kprobe *p)
>>   
>>   void *alloc_insn_page(void)
>>   {
>> -	return __vmalloc_node_range(PAGE_SIZE, 1, VMALLOC_START, VMALLOC_END,
>> -			GFP_KERNEL, PAGE_KERNEL_ROX, VM_FLUSH_RESET_PERMS,
>> -			NUMA_NO_NODE, __builtin_return_address(0));
>> +	void *page;
>> +
>> +	page = module_alloc(PAGE_SIZE);
>> +	if (!page)
>> +		return NULL;
>> +
>> +	set_vm_flush_reset_perms(page);
>> +	/*
>> +	 * First make the page read-only, and only then make it executable to
>> +	 * prevent it from being W+X in between.
>> +	 */
>> +	set_memory_ro((unsigned long)page, 1);
>> +	set_memory_x((unsigned long)page, 1);
>> +
>> +	return page;
>>   }
> 
> Today this is only used for xol pages, which don't need to live in the
> module space. Why is this not a separate function?
> 

I'm not clear about this, do you mean following set can be moved to a 
separate function? thanks.

set_vm_flush_reset_perms(page);
set_memory_ro((unsigned long)page, 1);
set_memory_x((unsigned long)page, 1);

>>   /* arm kprobe: install breakpoint in text */
>> diff --git a/arch/arm64/kernel/probes/opt_arm64.c b/arch/arm64/kernel/probes/opt_arm64.c

[...]
>> +int arch_check_optimized_kprobe(struct optimized_kprobe *op)
>> +{
>> +	return 0;
>> +}
>> +
> 
> The core code has the comment:
> 
> |	/* Check there is no other kprobes at the optimized instructions */
> |	if (arch_check_optimized_kprobe(op) < 0)
> |		return;
> 
> ... is that comment misleading, or do we need to do something here?
> 

In ARM64, kprobe opt always replace one instruction and it is impossible 
to encounter another kprobe in the address range. So always return 0.

>> +int arch_prepared_optinsn(struct arch_optimized_insn *optinsn)
>> +{
>> +	return optinsn->insn != NULL;
>> +}
>> +
>> +int arch_within_optimized_kprobe(struct optimized_kprobe *op,
>> +				unsigned long addr)
>> +{
>> +	return ((unsigned long)op->kp.addr <= addr &&
>> +		(unsigned long)op->kp.addr + RELATIVEJUMP_SIZE > addr);
>> +}
> 
> Since we have a single instruction, and all instructions are naturally aligned 4 bytes, this can be:
> 
> |	return (unsigned long)op->kp.addr == addr;
> 
got it , will change this next time, thanks.
>> +
>> +static void
>> +optimized_callback(struct optimized_kprobe *op, struct pt_regs *regs)
>> +{
>> +	/* This is possible if op is under delayed unoptimizing */
> 
> I see this is a copy-paste from x86 andor powerpc, but it's a bit hard
> to understand, and I'm not sure what it is trying to say. What scenario
> is this trying to handle?
> 
> I note that arch/arm doesn't seem to do this -- is that a bug?
> 

I think we should check the status of the op, if it is disabled, 
pre-handler should not be called.
And I guess this might be a bug of arm, but not sure...

>> +	if (kprobe_disabled(&op->kp))
>> +		return;
>> +
>> +	preempt_disable();
>> +
>> +	if (kprobe_running()) {
>> +		kprobes_inc_nmissed_count(&op->kp);
>> +	} else {
>> +		__this_cpu_write(current_kprobe, &op->kp);
>> +		regs->pc = (unsigned long)op->kp.addr;
>> +		get_kprobe_ctlblk()->kprobe_status = KPROBE_HIT_ACTIVE;
>> +		opt_pre_handler(&op->kp, regs);
>> +		__this_cpu_write(current_kprobe, NULL);
>> +	}
> 
>  > The body of this looks like it could be factored out in to the core
> kprobes code if we had an arch hook to save any missed registers
> (apparently just the pc on arm64).

got it, I'll add a opt_callback() function in core kprobes to factor the 
body of this optimized_callback function. and add a op->set_regs() hook 
for each arch.

> 
>> +	preempt_enable_no_resched();
> 
> Why the `_no_resched` variant? I see that x86 just does a plain
> preempt_enable().
> 
>> +}
>> +NOKPROBE_SYMBOL(optimized_callback)
>> +
>> +static bool is_offset_in_range(unsigned long start, unsigned long end)
>> +{
>> +	long offset = end - start;
>> +
>> +	/*
>> +	 * Verify if the address gap is in 128MiB range, because this uses
>> +	 * a relative jump.
>> +	 *
>> +	 * kprobe opt use a 'b' instruction to branch to optinsn.insn.
>> +	 * According to ARM manual, branch instruction is:
>> +	 *
>> +	 *   31  30                  25              0
>> +	 *  +----+---+---+---+---+---+---------------+
>> +	 *  |cond| 0 | 0 | 1 | 0 | 1 |     imm26     |
>> +	 *  +----+---+---+---+---+---+---------------+
>> +	 *
>> +	 * imm26 is a signed 26 bits integer. The real branch offset is computed
>> +	 * by: imm64 = SignExtend(imm26:'00', 64);
>> +	 *
>> +	 * So the maximum forward branch should be:
>> +	 *   (0x01ffffff << 2) = 0x07fffffc
>> +	 * The maximum backward branch should be:
>> +	 *   (0xfe000000 << 2) = 0xFFFFFFFFF8000000 = -0x08000000
>> +	 *
>> +	 * We can simply check (rel & 0xf8000003):
>> +	 *  if rel is positive, (rel & 0xf8000003) should be 0
>> +	 *  if rel is negitive, (rel & 0xf8000003) should be 0xf8000000
>> +	 *  the last '3' is used for alignment checking.
>> +	 */
>> +	return (offset >= -0x8000000 && offset <= 0x7fffffc && !(offset & 0x3));
>> +}
> 
> Please re-use existing code for this. In insn.c we already have
> branch_imm_common() and aarch64_insn_gen_branch_imm(), which we should
> be able to refactor for easier usage (and should move ftrace_make_nop()
> over to that too).
> 
> We could *also* add a aarch64_insn_try_gen_branch_imm() helper that
> returns an error code rather than printing an error, which would allow
> us to consolidate the checks with the codegen, and would be more robust
> if this changes in future.
> 

Got it, I'll remove this function, and add a 
aarch64_insn_try_gen_branch_imm() in insn.c, like this:

bool aarch64_insn_try_gen_branch_imm(unsigned long pc, unsigned long addr,
				     long range)
{
	long offset;

	offset = branch_imm_common(pc, addr, SZ_128M);
	if (offset >= SZ_128M)
		return -ERANGE;

	return 0;
}

>> +
>> +int arch_prepare_optimized_kprobe(struct optimized_kprobe *op,
>> +				  struct kprobe *orig)
>> +{
>> +	kprobe_opcode_t *code, *buf;
>> +	void **addrs;
>> +	u32 insn;
>> +	int ret, i;
>> +
>> +	addrs = kcalloc(TMPL_MAX_LENGTH, sizeof(void *), GFP_KERNEL);
>> +	if (!addrs)
>> +		return -ENOMEM;
> 
> As a general thing, when using allocators, please use sizeof() on the
> pointer being allocated, e.g.
> 
> 	addrs = kcalloc(TMPL_MAX_LENGTH, sizeof(*addrs), GFP_KERNEL);
> 
> ... as that ensures the type is always correct, even if the type of the
> pointer is changed later.
> 
> That said, I don't believe this needs to be allocated from the heap.
> It's only used for the duration of this function, it's relatively small,
> and we could add a wrapper in insn.c to patch a range of instructions,
> which would avoid the need for the array entirely.
> 

as your comments below, aarch64_insn_patch_text() with stop_machine is 
not nessesary in this function, so addrs[] could be dropped here.

>> +
>> +	buf = kcalloc(TMPL_MAX_LENGTH, sizeof(kprobe_opcode_t), GFP_KERNEL);
>> +	if (!buf) {
>> +		kfree(addrs);
>> +		return -ENOMEM;
>> +	}
>> +
>> +	code = get_optinsn_slot();
>> +	if (!code) {
>> +		kfree(addrs);
>> +		kfree(buf);
>> +		return -ENOMEM;
>> +	}
> 
> You can initialize `ret` to -ENOMEM, and goto `out` here.
> 
got it, thanks
>> +
>> +	if (!is_offset_in_range((unsigned long)code,
>> +				(unsigned long)orig->addr + 8)) {
>> +		ret = -ERANGE;
>> +		goto error;
>> +	}
>> +
>> +	if (!is_offset_in_range((unsigned long)code + TMPL_CALL_BACK,
>> +				(unsigned long)optimized_callback)) {
>> +		ret = -ERANGE;
>> +		goto error;
>> +	}
>> +
>> +	if (!is_offset_in_range((unsigned long)&code[TMPL_RESTORE_END],
>> +				(unsigned long)op->kp.addr + 4)) {
>> +		ret = -ERANGE;
>> +		goto error;
>> +	}
>> +
>> +	memcpy(buf, optprobe_template_entry,
>> +	       TMPL_END_IDX * sizeof(kprobe_opcode_t));
> 
> Why do we need to copy this into a temporary buffer? Can't we work on it
> directly in the destination?
> 

copy directly to destination will call aarch64_insn_patch_text() 
frequently, this will call too much stop_machine() and add delay(as 
Masami said).

>> +
>> +	buf[TMPL_VAL_IDX] = FIELD_GET(GENMASK(31, 0), (unsigned long long)op);
>> +	buf[TMPL_VAL_IDX + 1] =
>> +		FIELD_GET(GENMASK(63, 32), (unsigned long long)op);
> 
> This is a 64-bit pointer, and needs to be treated as a single unit. The
> above is broken for big-endian kernels.
> 
> Rather than using instruction indices (and pretending this pointer is
> two instructions), it would be nicer to use byte offsets consistently.
> 

got it, thanks, so I'll use memcpy here instead, thanks.
>> +	buf[TMPL_RESTORE_ORIGN_INSN] = orig->opcode;
>> +
>> +	insn = aarch64_insn_gen_branch_imm(
>> +		(unsigned long)(&code[TMPL_CALL_BACK]),
>> +		(unsigned long)optimized_callback, AARCH64_INSN_BRANCH_LINK);
>> +	buf[TMPL_CALL_BACK] = insn;
>> +
>> +	insn = aarch64_insn_gen_branch_imm(
>> +		(unsigned long)(&code[TMPL_RESTORE_END]),
>> +		(unsigned long)(op->kp.addr) + 4, AARCH64_INSN_BRANCH_NOLINK);
>> +	buf[TMPL_RESTORE_END] = insn;
>> +
>> +	/* Setup template */
>> +	for (i = 0; i < TMPL_MAX_LENGTH; i++)
>> +		addrs[i] = code + i;
>> +
>> +	ret = aarch64_insn_patch_text(addrs, buf, TMPL_MAX_LENGTH);
>> +	if (ret < 0)
>> +		goto error;
> 
> This does a stop_machine(), which isn't strictly necessary here if we
> write the buffer *then* swing the direct branch.
> 

got it, so I'll use aarch64_insn_patch_text_nosync() instead. thanks.
>> +
>> +	flush_icache_range((unsigned long)code,
>> +			   (unsigned long)(&code[TMPL_END_IDX]));
>> +
>> +	/* Set op->optinsn.insn means prepared. */
>> +	op->optinsn.insn = code;
>> +
>> +out:
>> +	kfree(addrs);
>> +	kfree(buf);
>> +	return ret;
>> +
>> +error:
>> +	free_optinsn_slot(code, 0);
>> +	goto out;
>> +}
>> +
>> +void arch_optimize_kprobes(struct list_head *oplist)
>> +{
>> +	struct optimized_kprobe *op, *tmp;
>> +	kprobe_opcode_t *insns;
>> +	void **addrs;
>> +	int i = 0;
>> +
>> +	addrs = kcalloc(OPTPROBE_BATCH_SIZE, sizeof(void *), GFP_KERNEL);
>> +	if (!addrs)
>> +		return;
>> +
>> +	insns = kcalloc(OPTPROBE_BATCH_SIZE, sizeof(kprobe_opcode_t), GFP_KERNEL);
>> +	if (!insns) {
>> +		kfree(addrs);
>> +		return;
>> +	}
> 
> Why do these need to be dynamically allocated? They have small fixed
> sizes.
> 
got it, will change this next time.
>> +
>> +	list_for_each_entry_safe(op, tmp, oplist, list) {
>> +		WARN_ON(kprobe_disabled(&op->kp));
>> +
>> +		/*
>> +		 * Backup instructions which will be replaced
>> +		 * by jump address
>> +		 */
>> +		memcpy(op->optinsn.copied_insn, op->kp.addr,
>> +			RELATIVEJUMP_SIZE);
> 
> Hmmm... how does this end up getting restored? This hasn't been through
> an le32_to_cpu(), so if we patch it back with the usual insn functions
> we'll restore the wrong thing on a big-endian kernel.

op->optinsn.copied_insn should be a "kprobe_opcode_t copied_insn[1];", 
so we copy a u32 insn to copied_insn[1], and don't need to worry about 
the endian problem here.

> 
>> +		addrs[i] = (void *)op->kp.addr;
>> +		insns[i] = aarch64_insn_gen_branch_imm((unsigned long)op->kp.addr,
>> +				(unsigned long)op->optinsn.insn,
>> +				AARCH64_INSN_BRANCH_NOLINK);
>> +
>> +		list_del_init(&op->list);
>> +		if (++i == OPTPROBE_BATCH_SIZE)
>> +			break;
>> +	}
> 
> What happens if the list was bigger than OPTPROBE_BATCH_SIZE?
> 

Will fall back to normal kprobe.

As Masami's comment in V3 patchset, we add a buffer to store all the 
optprobes' insntructions and call aarch64_insn_patch_text() only once.

https://lore.kernel.org/lkml/20210811162004.9b3349e6bda68e74e0a3a4ad@kernel.org/
>> +
>> +	aarch64_insn_patch_text(addrs, insns, i);
>> +	kfree(addrs);
>> +	kfree(insns);
>> +}
>> +
>> +void arch_unoptimize_kprobe(struct optimized_kprobe *op)
>> +{
>> +	arch_arm_kprobe(&op->kp);
>> +}
>> +
>> +/*
>> + * Recover original instructions and breakpoints from relative jumps.
>> + * Caller must call with locking kprobe_mutex.
>> + */
>> +void arch_unoptimize_kprobes(struct list_head *oplist,
>> +			    struct list_head *done_list)
>> +{
>> +	struct optimized_kprobe *op, *tmp;
>> +	kprobe_opcode_t *insns;
>> +	void **addrs;
>> +	int i = 0;
>> +
>> +	addrs = kcalloc(OPTPROBE_BATCH_SIZE, sizeof(void *), GFP_KERNEL);
>> +	if (!addrs)
>> +		return;
>> +
>> +	insns = kcalloc(OPTPROBE_BATCH_SIZE, sizeof(kprobe_opcode_t), GFP_KERNEL);
>> +	if (!insns) {
>> +		kfree(addrs);
>> +		return;
>> +	}
> 
> Again, I do not think these need to be dynamically allocated.
> 
got it.
>> +
>> +	list_for_each_entry_safe(op, tmp, oplist, list) {
>> +		addrs[i] = (void *)op->kp.addr;
>> +		insns[i] = BRK64_OPCODE_KPROBES;
>> +		list_move(&op->list, done_list);
>> +
>> +		if (++i == OPTPROBE_BATCH_SIZE)
>> +			break;
> 
> What happens if the list was larger rthan OPTPROBE_BATCH_SIZE ?
> 

Will fall back to normal kprobe.
>> +	}
>> +
>> +	aarch64_insn_patch_text(addrs, insns, i);
>> +	kfree(addrs);
>> +	kfree(insns);
>> +}
>> +
>> +void arch_remove_optimized_kprobe(struct optimized_kprobe *op)
>> +{
>> +	if (op->optinsn.insn) {
>> +		free_optinsn_slot(op->optinsn.insn, 1);
>> +		op->optinsn.insn = NULL;
>> +	}
>> +}
>> diff --git a/arch/arm64/kernel/probes/optprobe_trampoline.S b/arch/arm64/kernel/probes/optprobe_trampoline.S
>> new file mode 100644
>> index 000000000000..24d713d400cd
>> --- /dev/null
>> +++ b/arch/arm64/kernel/probes/optprobe_trampoline.S
>> @@ -0,0 +1,37 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +/*
>> + * trampoline entry and return code for optprobes.
>> + */
>> +
>> +#include <linux/linkage.h>
>> +#include <asm/asm-offsets.h>
>> +#include <asm/assembler.h>
>> +
>> +	.global optprobe_template_entry
>> +optprobe_template_entry:
> 
> Please use SYM_*(); see arch/arm64/kernel/entry-ftrace.S for examples of
> how to use that for trampolines.
> 
> This should be:
> 
> SYM_CODE_START(optprobe_template)
> 
> ... and note the matching end below.
> 
got it , thanks.
>> +	sub sp, sp, #PT_REGS_SIZE
>> +	save_all_base_regs
>> +	/* Get parameters to optimized_callback() */
>> +	ldr	x0, 1f
>> +	mov	x1, sp
>> +	/* Branch to optimized_callback() */
>> +	.global optprobe_template_call
>> +optprobe_template_call:
> 
> SYM_INNER_LABEL(optprobe_template_call, SYM_L_GLOBAL)
> 
> ...and likewise for all the other labels.
> 
ok, will change this next time.

Thanks for your review,
Qi

>> +	nop
>> +	restore_all_base_regs
>> +	ldr lr, [sp, #S_LR]
>> +        add sp, sp, #PT_REGS_SIZE
>> +	.global optprobe_template_restore_orig_insn
>> +optprobe_template_restore_orig_insn:
>> +	nop
>> +	.global optprobe_template_restore_end
>> +optprobe_template_restore_end:
>> +	nop
>> +	.global optprobe_template_end
>> +optprobe_template_end:
>> +	.global optprobe_template_val
>> +optprobe_template_val:
>> +	1:	.long 0
>> +		.long 0
>> +	.global optprobe_template_max_length
>> +optprobe_template_max_length:
> 
> SYM_INNER_LABEL(optprobe_template_end, SYM_L_GLOBAL)
> SYM_CODE_END(optprobe_template)
> 
> Thanks,
> Mark.
> 
>> -- 
>> 2.17.1
>>
> .
>

liuqi (BA) Nov. 26, 2021, 10:31 a.m. UTC | #9

On 2021/8/24 18:50, Mark Rutland wrote:
>> diff --git a/arch/arm64/kernel/probes/optprobe_trampoline.S b/arch/arm64/kernel/probes/optprobe_trampoline.S
>> new file mode 100644
>> index 000000000000..24d713d400cd
>> --- /dev/null
>> +++ b/arch/arm64/kernel/probes/optprobe_trampoline.S
>> @@ -0,0 +1,37 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +/*
>> + * trampoline entry and return code for optprobes.
>> + */
>> +
>> +#include <linux/linkage.h>
>> +#include <asm/asm-offsets.h>
>> +#include <asm/assembler.h>
>> +
>> +	.global optprobe_template_entry
>> +optprobe_template_entry:
> Please use SYM_*(); see arch/arm64/kernel/entry-ftrace.S for examples of
> how to use that for trampolines.
> 
> This should be:
> 
> SYM_CODE_START(optprobe_template)
> 
Hi all,

I meet a problem when I use SYM_CODE_START(optprobe_template) to replace 
optprobe_template_entry.

If SYM_CODE_START is used, all optprobe will share one trampoline space. 
Under this circumstances, if user register two optprobes, trampoline 
will be overwritten by the newer one, and this will cause kernel panic 
when the old optprobe is trigger.

Using optprobe_template_entry will not have this problem, as each 
optprobe has its own trampoline space (alloced in get_opinsn_slot()).

So how to reuse SYM_CODE_START  in this situation, does anyone has a 
good idea?

Thanks,
Qi
> ... and note the matching end below.
> 
>> +	sub sp, sp, #PT_REGS_SIZE
>> +	save_all_base_regs
>> +	/* Get parameters to optimized_callback() */
>> +	ldr	x0, 1f
>> +	mov	x1, sp
>> +	/* Branch to optimized_callback() */
>> +	.global optprobe_template_call
>> +optprobe_template_call:
> SYM_INNER_LABEL(optprobe_template_call, SYM_L_GLOBAL)
> 
> ...and likewise for all the other labels.
> 
>> +	nop
>> +	restore_all_base_regs
>> +	ldr lr, [sp, #S_LR]
>> +        add sp, sp, #PT_REGS_SIZE
>> +	.global optprobe_template_restore_orig_insn
>> +optprobe_template_restore_orig_insn:
>> +	nop
>> +	.global optprobe_template_restore_end
>> +optprobe_template_restore_end:
>> +	nop
>> +	.global optprobe_template_end
>> +optprobe_template_end:
>> +	.global optprobe_template_val
>> +optprobe_template_val:
>> +	1:	.long 0
>> +		.long 0
>> +	.global optprobe_template_max_length
>> +optprobe_template_max_length:
> SYM_INNER_LABEL(optprobe_template_end, SYM_L_GLOBAL)
> SYM_CODE_END(optprobe_template)
> 
> Thanks,
> Mark.
> 
>> --

Masami Hiramatsu (Google) Nov. 27, 2021, 12:23 p.m. UTC | #10

On Fri, 26 Nov 2021 18:31:06 +0800
"liuqi (BA)" <liuqi115@huawei.com> wrote:

> 
> 
> On 2021/8/24 18:50, Mark Rutland wrote:
> >> diff --git a/arch/arm64/kernel/probes/optprobe_trampoline.S b/arch/arm64/kernel/probes/optprobe_trampoline.S
> >> new file mode 100644
> >> index 000000000000..24d713d400cd
> >> --- /dev/null
> >> +++ b/arch/arm64/kernel/probes/optprobe_trampoline.S
> >> @@ -0,0 +1,37 @@
> >> +/* SPDX-License-Identifier: GPL-2.0 */
> >> +/*
> >> + * trampoline entry and return code for optprobes.
> >> + */
> >> +
> >> +#include <linux/linkage.h>
> >> +#include <asm/asm-offsets.h>
> >> +#include <asm/assembler.h>
> >> +
> >> +	.global optprobe_template_entry
> >> +optprobe_template_entry:
> > Please use SYM_*(); see arch/arm64/kernel/entry-ftrace.S for examples of
> > how to use that for trampolines.
> > 
> > This should be:
> > 
> > SYM_CODE_START(optprobe_template)
> > 
> Hi all,
> 
> I meet a problem when I use SYM_CODE_START(optprobe_template) to replace 
> optprobe_template_entry.
> 
> If SYM_CODE_START is used, all optprobe will share one trampoline space. 
> Under this circumstances, if user register two optprobes, trampoline 
> will be overwritten by the newer one, and this will cause kernel panic 
> when the old optprobe is trigger.

Hm, this is curious, because the template should be copied to the
trampoline buffer for each optprobe and be modified.

> 
> Using optprobe_template_entry will not have this problem, as each 
> optprobe has its own trampoline space (alloced in get_opinsn_slot()).

Yes, it is designed to do so.

Thank you,

> 
> So how to reuse SYM_CODE_START  in this situation, does anyone has a 
> good idea?
> 
> Thanks,
> Qi
> > ... and note the matching end below.
> > 
> >> +	sub sp, sp, #PT_REGS_SIZE
> >> +	save_all_base_regs
> >> +	/* Get parameters to optimized_callback() */
> >> +	ldr	x0, 1f
> >> +	mov	x1, sp
> >> +	/* Branch to optimized_callback() */
> >> +	.global optprobe_template_call
> >> +optprobe_template_call:
> > SYM_INNER_LABEL(optprobe_template_call, SYM_L_GLOBAL)
> > 
> > ...and likewise for all the other labels.
> > 
> >> +	nop
> >> +	restore_all_base_regs
> >> +	ldr lr, [sp, #S_LR]
> >> +        add sp, sp, #PT_REGS_SIZE
> >> +	.global optprobe_template_restore_orig_insn
> >> +optprobe_template_restore_orig_insn:
> >> +	nop
> >> +	.global optprobe_template_restore_end
> >> +optprobe_template_restore_end:
> >> +	nop
> >> +	.global optprobe_template_end
> >> +optprobe_template_end:
> >> +	.global optprobe_template_val
> >> +optprobe_template_val:
> >> +	1:	.long 0
> >> +		.long 0
> >> +	.global optprobe_template_max_length
> >> +optprobe_template_max_length:
> > SYM_INNER_LABEL(optprobe_template_end, SYM_L_GLOBAL)
> > SYM_CODE_END(optprobe_template)
> > 
> > Thanks,
> > Mark.
> > 
> >> --

liuqi (BA) Nov. 29, 2021, 1:40 a.m. UTC | #11

On 2021/11/27 20:23, Masami Hiramatsu wrote:
> On Fri, 26 Nov 2021 18:31:06 +0800
> "liuqi (BA)" <liuqi115@huawei.com> wrote:
> 
>>
>>
>> On 2021/8/24 18:50, Mark Rutland wrote:
>>>> diff --git a/arch/arm64/kernel/probes/optprobe_trampoline.S b/arch/arm64/kernel/probes/optprobe_trampoline.S
>>>> new file mode 100644
>>>> index 000000000000..24d713d400cd
>>>> --- /dev/null
>>>> +++ b/arch/arm64/kernel/probes/optprobe_trampoline.S
>>>> @@ -0,0 +1,37 @@
>>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>>> +/*
>>>> + * trampoline entry and return code for optprobes.
>>>> + */
>>>> +
>>>> +#include <linux/linkage.h>
>>>> +#include <asm/asm-offsets.h>
>>>> +#include <asm/assembler.h>
>>>> +
>>>> +	.global optprobe_template_entry
>>>> +optprobe_template_entry:
>>> Please use SYM_*(); see arch/arm64/kernel/entry-ftrace.S for examples of
>>> how to use that for trampolines.
>>>
>>> This should be:
>>>
>>> SYM_CODE_START(optprobe_template)
>>>
>> Hi all,
>>
>> I meet a problem when I use SYM_CODE_START(optprobe_template) to replace
>> optprobe_template_entry.
>>
>> If SYM_CODE_START is used, all optprobe will share one trampoline space.
>> Under this circumstances, if user register two optprobes, trampoline
>> will be overwritten by the newer one, and this will cause kernel panic
>> when the old optprobe is trigger.
> 
> Hm, this is curious, because the template should be copied to the
> trampoline buffer for each optprobe and be modified.
> 
>>
>> Using optprobe_template_entry will not have this problem, as each
>> optprobe has its own trampoline space (alloced in get_opinsn_slot()).
> 
> Yes, it is designed to do so.
> 
> Thank you,
> 

Hi Masami,

Thanks for your reply. But I also met a problem when using 
get_opinsn_slot() to alloc trampoline buffer.

As module_alloc(like x86) is used to alloc buffer, trampoline is in 
module space, so if origin insn is in kernel space, the range between 
origin insn and trampoline is out of 128M.

As module PLT cannot used here, I have no idea to achieve long jump in 
this situation. Do you have any good idea?

Thanks,
Qi

>>
>> So how to reuse SYM_CODE_START  in this situation, does anyone has a
>> good idea?
>>
>> Thanks,
>> Qi
>>> ... and note the matching end below.
>>>
>>>> +	sub sp, sp, #PT_REGS_SIZE
>>>> +	save_all_base_regs
>>>> +	/* Get parameters to optimized_callback() */
>>>> +	ldr	x0, 1f
>>>> +	mov	x1, sp
>>>> +	/* Branch to optimized_callback() */
>>>> +	.global optprobe_template_call
>>>> +optprobe_template_call:
>>> SYM_INNER_LABEL(optprobe_template_call, SYM_L_GLOBAL)
>>>
>>> ...and likewise for all the other labels.
>>>
>>>> +	nop
>>>> +	restore_all_base_regs
>>>> +	ldr lr, [sp, #S_LR]
>>>> +        add sp, sp, #PT_REGS_SIZE
>>>> +	.global optprobe_template_restore_orig_insn
>>>> +optprobe_template_restore_orig_insn:
>>>> +	nop
>>>> +	.global optprobe_template_restore_end
>>>> +optprobe_template_restore_end:
>>>> +	nop
>>>> +	.global optprobe_template_end
>>>> +optprobe_template_end:
>>>> +	.global optprobe_template_val
>>>> +optprobe_template_val:
>>>> +	1:	.long 0
>>>> +		.long 0
>>>> +	.global optprobe_template_max_length
>>>> +optprobe_template_max_length:
>>> SYM_INNER_LABEL(optprobe_template_end, SYM_L_GLOBAL)
>>> SYM_CODE_END(optprobe_template)
>>>
>>> Thanks,
>>> Mark.
>>>
>>>> -- 
> 
>

Masami Hiramatsu (Google) Nov. 29, 2021, 5 a.m. UTC | #12

On Mon, 29 Nov 2021 09:40:30 +0800
"liuqi (BA)" <liuqi115@huawei.com> wrote:

> 
> 
> On 2021/11/27 20:23, Masami Hiramatsu wrote:
> > On Fri, 26 Nov 2021 18:31:06 +0800
> > "liuqi (BA)" <liuqi115@huawei.com> wrote:
> > 
> >>
> >>
> >> On 2021/8/24 18:50, Mark Rutland wrote:
> >>>> diff --git a/arch/arm64/kernel/probes/optprobe_trampoline.S b/arch/arm64/kernel/probes/optprobe_trampoline.S
> >>>> new file mode 100644
> >>>> index 000000000000..24d713d400cd
> >>>> --- /dev/null
> >>>> +++ b/arch/arm64/kernel/probes/optprobe_trampoline.S
> >>>> @@ -0,0 +1,37 @@
> >>>> +/* SPDX-License-Identifier: GPL-2.0 */
> >>>> +/*
> >>>> + * trampoline entry and return code for optprobes.
> >>>> + */
> >>>> +
> >>>> +#include <linux/linkage.h>
> >>>> +#include <asm/asm-offsets.h>
> >>>> +#include <asm/assembler.h>
> >>>> +
> >>>> +	.global optprobe_template_entry
> >>>> +optprobe_template_entry:
> >>> Please use SYM_*(); see arch/arm64/kernel/entry-ftrace.S for examples of
> >>> how to use that for trampolines.
> >>>
> >>> This should be:
> >>>
> >>> SYM_CODE_START(optprobe_template)
> >>>
> >> Hi all,
> >>
> >> I meet a problem when I use SYM_CODE_START(optprobe_template) to replace
> >> optprobe_template_entry.
> >>
> >> If SYM_CODE_START is used, all optprobe will share one trampoline space.
> >> Under this circumstances, if user register two optprobes, trampoline
> >> will be overwritten by the newer one, and this will cause kernel panic
> >> when the old optprobe is trigger.
> > 
> > Hm, this is curious, because the template should be copied to the
> > trampoline buffer for each optprobe and be modified.
> > 
> >>
> >> Using optprobe_template_entry will not have this problem, as each
> >> optprobe has its own trampoline space (alloced in get_opinsn_slot()).
> > 
> > Yes, it is designed to do so.
> > 
> > Thank you,
> > 
> 
> Hi Masami,
> 
> Thanks for your reply. But I also met a problem when using 
> get_opinsn_slot() to alloc trampoline buffer.
> 
> As module_alloc(like x86) is used to alloc buffer, trampoline is in 
> module space, so if origin insn is in kernel space, the range between 
> origin insn and trampoline is out of 128M.
> 
> As module PLT cannot used here, I have no idea to achieve long jump in 
> this situation. Do you have any good idea?

One possible solution is to use pre-allocated trampoline space in
the text area, as same as ppc64 does.
(See arch/powerpc/kernel/optprobes_head.S, it embeds a space at "optinsn_slot")

Also, the trampoline can be minimized, since what we need is the
probed address (and the address of struct optprobe).
A single trampoline entry will do the following;

1. push lr and a victim register (here, x0)
2. load the address of optprobe to x0
3. call(br) common-optprobe asm code
4. pop lr and x0
5. jump back to (next to) the original place

Here the common-optprobe asm code does;

c1. push all registers on the stack (like save_all_base_regs) for making
  struct pt_regs.
c2. set the pt_regs address to x1.
c3. call optimized_callback()
c4. return

Since arm64 will emulate the probed instruction, we can do this.
(On the other hand, x86 needs to run the probed insn in trampoline
 code, it will do that between step 4 and 5)

The trampoline entry code is just 5 instructions (but may need an
immediate value (&optprobe) needs to be embedded).

Thank you,

> 
> Thanks,
> Qi
> 
> >>
> >> So how to reuse SYM_CODE_START  in this situation, does anyone has a
> >> good idea?
> >>
> >> Thanks,
> >> Qi
> >>> ... and note the matching end below.
> >>>
> >>>> +	sub sp, sp, #PT_REGS_SIZE
> >>>> +	save_all_base_regs
> >>>> +	/* Get parameters to optimized_callback() */
> >>>> +	ldr	x0, 1f
> >>>> +	mov	x1, sp
> >>>> +	/* Branch to optimized_callback() */
> >>>> +	.global optprobe_template_call
> >>>> +optprobe_template_call:
> >>> SYM_INNER_LABEL(optprobe_template_call, SYM_L_GLOBAL)
> >>>
> >>> ...and likewise for all the other labels.
> >>>
> >>>> +	nop
> >>>> +	restore_all_base_regs
> >>>> +	ldr lr, [sp, #S_LR]
> >>>> +        add sp, sp, #PT_REGS_SIZE
> >>>> +	.global optprobe_template_restore_orig_insn
> >>>> +optprobe_template_restore_orig_insn:
> >>>> +	nop
> >>>> +	.global optprobe_template_restore_end
> >>>> +optprobe_template_restore_end:
> >>>> +	nop
> >>>> +	.global optprobe_template_end
> >>>> +optprobe_template_end:
> >>>> +	.global optprobe_template_val
> >>>> +optprobe_template_val:
> >>>> +	1:	.long 0
> >>>> +		.long 0
> >>>> +	.global optprobe_template_max_length
> >>>> +optprobe_template_max_length:
> >>> SYM_INNER_LABEL(optprobe_template_end, SYM_L_GLOBAL)
> >>> SYM_CODE_END(optprobe_template)
> >>>
> >>> Thanks,
> >>> Mark.
> >>>
> >>>> -- 
> > 
> >

liuqi (BA) Nov. 29, 2021, 6:50 a.m. UTC | #13

On 2021/11/29 13:00, Masami Hiramatsu wrote:
> On Mon, 29 Nov 2021 09:40:30 +0800
> "liuqi (BA)" <liuqi115@huawei.com> wrote:
> 
>>
>>
>> On 2021/11/27 20:23, Masami Hiramatsu wrote:
>>> On Fri, 26 Nov 2021 18:31:06 +0800
>>> "liuqi (BA)" <liuqi115@huawei.com> wrote:
>>>
>>>>
>>>>
>>>> On 2021/8/24 18:50, Mark Rutland wrote:
>>>>>> diff --git a/arch/arm64/kernel/probes/optprobe_trampoline.S b/arch/arm64/kernel/probes/optprobe_trampoline.S
>>>>>> new file mode 100644
>>>>>> index 000000000000..24d713d400cd
>>>>>> --- /dev/null
>>>>>> +++ b/arch/arm64/kernel/probes/optprobe_trampoline.S
>>>>>> @@ -0,0 +1,37 @@
>>>>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>>>>> +/*
>>>>>> + * trampoline entry and return code for optprobes.
>>>>>> + */
>>>>>> +
>>>>>> +#include <linux/linkage.h>
>>>>>> +#include <asm/asm-offsets.h>
>>>>>> +#include <asm/assembler.h>
>>>>>> +
>>>>>> +	.global optprobe_template_entry
>>>>>> +optprobe_template_entry:
>>>>> Please use SYM_*(); see arch/arm64/kernel/entry-ftrace.S for examples of
>>>>> how to use that for trampolines.
>>>>>
>>>>> This should be:
>>>>>
>>>>> SYM_CODE_START(optprobe_template)
>>>>>
>>>> Hi all,
>>>>
>>>> I meet a problem when I use SYM_CODE_START(optprobe_template) to replace
>>>> optprobe_template_entry.
>>>>
>>>> If SYM_CODE_START is used, all optprobe will share one trampoline space.
>>>> Under this circumstances, if user register two optprobes, trampoline
>>>> will be overwritten by the newer one, and this will cause kernel panic
>>>> when the old optprobe is trigger.
>>>
>>> Hm, this is curious, because the template should be copied to the
>>> trampoline buffer for each optprobe and be modified.
>>>
>>>>
>>>> Using optprobe_template_entry will not have this problem, as each
>>>> optprobe has its own trampoline space (alloced in get_opinsn_slot()).
>>>
>>> Yes, it is designed to do so.
>>>
>>> Thank you,
>>>
>>
>> Hi Masami,
>>
>> Thanks for your reply. But I also met a problem when using
>> get_opinsn_slot() to alloc trampoline buffer.
>>
>> As module_alloc(like x86) is used to alloc buffer, trampoline is in
>> module space, so if origin insn is in kernel space, the range between
>> origin insn and trampoline is out of 128M.
>>
>> As module PLT cannot used here, I have no idea to achieve long jump in
>> this situation. Do you have any good idea?
> 
Hi Masami,

Thanks so much for your reply.

> One possible solution is to use pre-allocated trampoline space in
> the text area, as same as ppc64 does.
> (See arch/powerpc/kernel/optprobes_head.S, it embeds a space at "optinsn_slot")
> 

I find something interesting in arch/powerpc/kernel/optprobes.c, it use 
"optinsn_slot" as a public buffer, and use a static "insn_page_in_use" 
to make sure there is only one optprobe in kernel.

If we use this solution , users could only register one optprobe each 
time. This will also be a limitation for users, what's your opinion 
about this?


> Also, the trampoline can be minimized, since what we need is the
> probed address (and the address of struct optprobe).
> A single trampoline entry will do the following;
> 
> 1. push lr and a victim register (here, x0)
> 2. load the address of optprobe to x0
> 3. call(br) common-optprobe asm code
> 4. pop lr and x0
> 5. jump back to (next to) the original place
> 
> Here the common-optprobe asm code does;
> 
> c1. push all registers on the stack (like save_all_base_regs) for making
>    struct pt_regs.
> c2. set the pt_regs address to x1.
> c3. call optimized_callback()
> c4. return
> 
> Since arm64 will emulate the probed instruction, we can do this.
> (On the other hand, x86 needs to run the probed insn in trampoline
>   code, it will do that between step 4 and 5)
> 

I'll try to minimize the trampoline according to this,

Thanks,
Qi
> The trampoline entry code is just 5 instructions (but may need an
> immediate value (&optprobe) needs to be embedded).
> 
> Thank you,
> 
>>
>> Thanks,
>> Qi
>>
>>>>
>>>> So how to reuse SYM_CODE_START  in this situation, does anyone has a
>>>> good idea?
>>>>
>>>> Thanks,
>>>> Qi
>>>>> ... and note the matching end below.
>>>>>
>>>>>> +	sub sp, sp, #PT_REGS_SIZE
>>>>>> +	save_all_base_regs
>>>>>> +	/* Get parameters to optimized_callback() */
>>>>>> +	ldr	x0, 1f
>>>>>> +	mov	x1, sp
>>>>>> +	/* Branch to optimized_callback() */
>>>>>> +	.global optprobe_template_call
>>>>>> +optprobe_template_call:
>>>>> SYM_INNER_LABEL(optprobe_template_call, SYM_L_GLOBAL)
>>>>>
>>>>> ...and likewise for all the other labels.
>>>>>
>>>>>> +	nop
>>>>>> +	restore_all_base_regs
>>>>>> +	ldr lr, [sp, #S_LR]
>>>>>> +        add sp, sp, #PT_REGS_SIZE
>>>>>> +	.global optprobe_template_restore_orig_insn
>>>>>> +optprobe_template_restore_orig_insn:
>>>>>> +	nop
>>>>>> +	.global optprobe_template_restore_end
>>>>>> +optprobe_template_restore_end:
>>>>>> +	nop
>>>>>> +	.global optprobe_template_end
>>>>>> +optprobe_template_end:
>>>>>> +	.global optprobe_template_val
>>>>>> +optprobe_template_val:
>>>>>> +	1:	.long 0
>>>>>> +		.long 0
>>>>>> +	.global optprobe_template_max_length
>>>>>> +optprobe_template_max_length:
>>>>> SYM_INNER_LABEL(optprobe_template_end, SYM_L_GLOBAL)
>>>>> SYM_CODE_END(optprobe_template)
>>>>>
>>>>> Thanks,
>>>>> Mark.
>>>>>
>>>>>> -- 
>>>
>>>
> 
>

Masami Hiramatsu (Google) Nov. 29, 2021, 2:35 p.m. UTC | #14

Hi,

On Mon, 29 Nov 2021 14:50:22 +0800
"liuqi (BA)" <liuqi115@huawei.com> wrote:

> 
> 
> On 2021/11/29 13:00, Masami Hiramatsu wrote:
> > On Mon, 29 Nov 2021 09:40:30 +0800
> > "liuqi (BA)" <liuqi115@huawei.com> wrote:
> > 
> >>
> >>
> >> On 2021/11/27 20:23, Masami Hiramatsu wrote:
> >>> On Fri, 26 Nov 2021 18:31:06 +0800
> >>> "liuqi (BA)" <liuqi115@huawei.com> wrote:
> >>>
> >>>>
> >>>>
> >>>> On 2021/8/24 18:50, Mark Rutland wrote:
> >>>>>> diff --git a/arch/arm64/kernel/probes/optprobe_trampoline.S b/arch/arm64/kernel/probes/optprobe_trampoline.S
> >>>>>> new file mode 100644
> >>>>>> index 000000000000..24d713d400cd
> >>>>>> --- /dev/null
> >>>>>> +++ b/arch/arm64/kernel/probes/optprobe_trampoline.S
> >>>>>> @@ -0,0 +1,37 @@
> >>>>>> +/* SPDX-License-Identifier: GPL-2.0 */
> >>>>>> +/*
> >>>>>> + * trampoline entry and return code for optprobes.
> >>>>>> + */
> >>>>>> +
> >>>>>> +#include <linux/linkage.h>
> >>>>>> +#include <asm/asm-offsets.h>
> >>>>>> +#include <asm/assembler.h>
> >>>>>> +
> >>>>>> +	.global optprobe_template_entry
> >>>>>> +optprobe_template_entry:
> >>>>> Please use SYM_*(); see arch/arm64/kernel/entry-ftrace.S for examples of
> >>>>> how to use that for trampolines.
> >>>>>
> >>>>> This should be:
> >>>>>
> >>>>> SYM_CODE_START(optprobe_template)
> >>>>>
> >>>> Hi all,
> >>>>
> >>>> I meet a problem when I use SYM_CODE_START(optprobe_template) to replace
> >>>> optprobe_template_entry.
> >>>>
> >>>> If SYM_CODE_START is used, all optprobe will share one trampoline space.
> >>>> Under this circumstances, if user register two optprobes, trampoline
> >>>> will be overwritten by the newer one, and this will cause kernel panic
> >>>> when the old optprobe is trigger.
> >>>
> >>> Hm, this is curious, because the template should be copied to the
> >>> trampoline buffer for each optprobe and be modified.
> >>>
> >>>>
> >>>> Using optprobe_template_entry will not have this problem, as each
> >>>> optprobe has its own trampoline space (alloced in get_opinsn_slot()).
> >>>
> >>> Yes, it is designed to do so.
> >>>
> >>> Thank you,
> >>>
> >>
> >> Hi Masami,
> >>
> >> Thanks for your reply. But I also met a problem when using
> >> get_opinsn_slot() to alloc trampoline buffer.
> >>
> >> As module_alloc(like x86) is used to alloc buffer, trampoline is in
> >> module space, so if origin insn is in kernel space, the range between
> >> origin insn and trampoline is out of 128M.
> >>
> >> As module PLT cannot used here, I have no idea to achieve long jump in
> >> this situation. Do you have any good idea?
> > 
> Hi Masami,
> 
> Thanks so much for your reply.
> 
> > One possible solution is to use pre-allocated trampoline space in
> > the text area, as same as ppc64 does.
> > (See arch/powerpc/kernel/optprobes_head.S, it embeds a space at "optinsn_slot")
> > 
> 
> I find something interesting in arch/powerpc/kernel/optprobes.c, it use 
> "optinsn_slot" as a public buffer, and use a static "insn_page_in_use" 
> to make sure there is only one optprobe in kernel.
> 
> If we use this solution , users could only register one optprobe each 
> time. This will also be a limitation for users, what's your opinion 
> about this?

No, that is just a memory area for pooling trampoline buffer. So optprobe
can allocate the buffer from that area. Please see kernel/kprobes.c:344.
optprobe allocates "insn_slot" from kprobe_optinsn_slots, which uses
alloc_optinsn_page() to allocate the pool of slots.

Thank you,

> 
> 
> > Also, the trampoline can be minimized, since what we need is the
> > probed address (and the address of struct optprobe).
> > A single trampoline entry will do the following;
> > 
> > 1. push lr and a victim register (here, x0)
> > 2. load the address of optprobe to x0
> > 3. call(br) common-optprobe asm code
> > 4. pop lr and x0
> > 5. jump back to (next to) the original place
> > 
> > Here the common-optprobe asm code does;
> > 
> > c1. push all registers on the stack (like save_all_base_regs) for making
> >    struct pt_regs.
> > c2. set the pt_regs address to x1.
> > c3. call optimized_callback()
> > c4. return
> > 
> > Since arm64 will emulate the probed instruction, we can do this.
> > (On the other hand, x86 needs to run the probed insn in trampoline
> >   code, it will do that between step 4 and 5)
> > 
> 
> I'll try to minimize the trampoline according to this,
> 
> Thanks,
> Qi
> > The trampoline entry code is just 5 instructions (but may need an
> > immediate value (&optprobe) needs to be embedded).
> > 
> > Thank you,
> > 
> >>
> >> Thanks,
> >> Qi
> >>
> >>>>
> >>>> So how to reuse SYM_CODE_START  in this situation, does anyone has a
> >>>> good idea?
> >>>>
> >>>> Thanks,
> >>>> Qi
> >>>>> ... and note the matching end below.
> >>>>>
> >>>>>> +	sub sp, sp, #PT_REGS_SIZE
> >>>>>> +	save_all_base_regs
> >>>>>> +	/* Get parameters to optimized_callback() */
> >>>>>> +	ldr	x0, 1f
> >>>>>> +	mov	x1, sp
> >>>>>> +	/* Branch to optimized_callback() */
> >>>>>> +	.global optprobe_template_call
> >>>>>> +optprobe_template_call:
> >>>>> SYM_INNER_LABEL(optprobe_template_call, SYM_L_GLOBAL)
> >>>>>
> >>>>> ...and likewise for all the other labels.
> >>>>>
> >>>>>> +	nop
> >>>>>> +	restore_all_base_regs
> >>>>>> +	ldr lr, [sp, #S_LR]
> >>>>>> +        add sp, sp, #PT_REGS_SIZE
> >>>>>> +	.global optprobe_template_restore_orig_insn
> >>>>>> +optprobe_template_restore_orig_insn:
> >>>>>> +	nop
> >>>>>> +	.global optprobe_template_restore_end
> >>>>>> +optprobe_template_restore_end:
> >>>>>> +	nop
> >>>>>> +	.global optprobe_template_end
> >>>>>> +optprobe_template_end:
> >>>>>> +	.global optprobe_template_val
> >>>>>> +optprobe_template_val:
> >>>>>> +	1:	.long 0
> >>>>>> +		.long 0
> >>>>>> +	.global optprobe_template_max_length
> >>>>>> +optprobe_template_max_length:
> >>>>> SYM_INNER_LABEL(optprobe_template_end, SYM_L_GLOBAL)
> >>>>> SYM_CODE_END(optprobe_template)
> >>>>>
> >>>>> Thanks,
> >>>>> Mark.
> >>>>>
> >>>>>> -- 
> >>>
> >>>
> > 
> >

liuqi (BA) Nov. 30, 2021, 6:48 a.m. UTC | #15

On 2021/11/29 22:35, Masami Hiramatsu wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> I meet a problem when I use SYM_CODE_START(optprobe_template) to replace
>>>>>> optprobe_template_entry.
>>>>>>
>>>>>> If SYM_CODE_START is used, all optprobe will share one trampoline space.
>>>>>> Under this circumstances, if user register two optprobes, trampoline
>>>>>> will be overwritten by the newer one, and this will cause kernel panic
>>>>>> when the old optprobe is trigger.
>>>>> Hm, this is curious, because the template should be copied to the
>>>>> trampoline buffer for each optprobe and be modified.
>>>>>
>>>>>> Using optprobe_template_entry will not have this problem, as each
>>>>>> optprobe has its own trampoline space (alloced in get_opinsn_slot()).
>>>>> Yes, it is designed to do so.
>>>>>
>>>>> Thank you,
>>>>>
>>>> Hi Masami,
>>>>
>>>> Thanks for your reply. But I also met a problem when using
>>>> get_opinsn_slot() to alloc trampoline buffer.
>>>>
>>>> As module_alloc(like x86) is used to alloc buffer, trampoline is in
>>>> module space, so if origin insn is in kernel space, the range between
>>>> origin insn and trampoline is out of 128M.
>>>>
>>>> As module PLT cannot used here, I have no idea to achieve long jump in
>>>> this situation. Do you have any good idea?
>> Hi Masami,
>>
>> Thanks so much for your reply.
>>
>>> One possible solution is to use pre-allocated trampoline space in
>>> the text area, as same as ppc64 does.
>>> (See arch/powerpc/kernel/optprobes_head.S, it embeds a space at "optinsn_slot")
>>>
>> I find something interesting in arch/powerpc/kernel/optprobes.c, it use
>> "optinsn_slot" as a public buffer, and use a static "insn_page_in_use"
>> to make sure there is only one optprobe in kernel.
>>
>> If we use this solution , users could only register one optprobe each
>> time. This will also be a limitation for users, what's your opinion
>> about this?
> No, that is just a memory area for pooling trampoline buffer. So optprobe
> can allocate the buffer from that area. Please see kernel/kprobes.c:344.
> optprobe allocates "insn_slot" from kprobe_optinsn_slots, which uses
> alloc_optinsn_page() to allocate the pool of slots.
> 
> Thank you,
> 

Hi,

Thanks for your reply, I test alloc_optinsn_page() and it does work well 
to alloc the pool of slots.

But when I tried to use module PLT, something seems wrong here.
Arm64 Module PLT in mod->arch.ftrace_trampolines is set in 
module_finalize, after that, mod->arch.ftrace_trampolines seems to be a 
read-only memory. But in arch_optimize_kprobes() we need to modify the 
destination of PLT (as each optprobe has its own trampoline buffer), if 
so, we cannot get rid of the 128M branch limit :(

Thanks,
Qi

Masami Hiramatsu (Google) Dec. 1, 2021, 1:50 a.m. UTC | #16

On Tue, 30 Nov 2021 14:48:06 +0800
"liuqi (BA)" <liuqi115@huawei.com> wrote:

> 
> 
> On 2021/11/29 22:35, Masami Hiramatsu wrote:
> >>>>>> Hi all,
> >>>>>>
> >>>>>> I meet a problem when I use SYM_CODE_START(optprobe_template) to replace
> >>>>>> optprobe_template_entry.
> >>>>>>
> >>>>>> If SYM_CODE_START is used, all optprobe will share one trampoline space.
> >>>>>> Under this circumstances, if user register two optprobes, trampoline
> >>>>>> will be overwritten by the newer one, and this will cause kernel panic
> >>>>>> when the old optprobe is trigger.
> >>>>> Hm, this is curious, because the template should be copied to the
> >>>>> trampoline buffer for each optprobe and be modified.
> >>>>>
> >>>>>> Using optprobe_template_entry will not have this problem, as each
> >>>>>> optprobe has its own trampoline space (alloced in get_opinsn_slot()).
> >>>>> Yes, it is designed to do so.
> >>>>>
> >>>>> Thank you,
> >>>>>
> >>>> Hi Masami,
> >>>>
> >>>> Thanks for your reply. But I also met a problem when using
> >>>> get_opinsn_slot() to alloc trampoline buffer.
> >>>>
> >>>> As module_alloc(like x86) is used to alloc buffer, trampoline is in
> >>>> module space, so if origin insn is in kernel space, the range between
> >>>> origin insn and trampoline is out of 128M.
> >>>>
> >>>> As module PLT cannot used here, I have no idea to achieve long jump in
> >>>> this situation. Do you have any good idea?
> >> Hi Masami,
> >>
> >> Thanks so much for your reply.
> >>
> >>> One possible solution is to use pre-allocated trampoline space in
> >>> the text area, as same as ppc64 does.
> >>> (See arch/powerpc/kernel/optprobes_head.S, it embeds a space at "optinsn_slot")
> >>>
> >> I find something interesting in arch/powerpc/kernel/optprobes.c, it use
> >> "optinsn_slot" as a public buffer, and use a static "insn_page_in_use"
> >> to make sure there is only one optprobe in kernel.
> >>
> >> If we use this solution , users could only register one optprobe each
> >> time. This will also be a limitation for users, what's your opinion
> >> about this?
> > No, that is just a memory area for pooling trampoline buffer. So optprobe
> > can allocate the buffer from that area. Please see kernel/kprobes.c:344.
> > optprobe allocates "insn_slot" from kprobe_optinsn_slots, which uses
> > alloc_optinsn_page() to allocate the pool of slots.
> > 
> > Thank you,
> > 
> 
> Hi,
> 
> Thanks for your reply, I test alloc_optinsn_page() and it does work well 
> to alloc the pool of slots.
> 
> But when I tried to use module PLT, something seems wrong here.
> Arm64 Module PLT in mod->arch.ftrace_trampolines is set in 
> module_finalize, after that, mod->arch.ftrace_trampolines seems to be a 
> read-only memory. But in arch_optimize_kprobes() we need to modify the 
> destination of PLT (as each optprobe has its own trampoline buffer), if 
> so, we cannot get rid of the 128M branch limit :(

Hmm, OK, we need to introduce trampoline buffer allocation pool for modules
for such arch. But that should be another story. I think you should start
from the core-kernel. At this moment, if the probe address is in the module,
please return -ERANGE from arch_prepare_optimized_kprobe().
Module support must be done in the next step (series), since that will involve
the kprobes generic code change.

Thank you,

> 
> Thanks,
> Qi

liuqi (BA) Dec. 1, 2021, 2:55 a.m. UTC | #17

On 2021/12/1 9:50, Masami Hiramatsu wrote:
> On Tue, 30 Nov 2021 14:48:06 +0800
> "liuqi (BA)" <liuqi115@huawei.com> wrote:
> 
>>
>>
>> On 2021/11/29 22:35, Masami Hiramatsu wrote:
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> I meet a problem when I use SYM_CODE_START(optprobe_template) to replace
>>>>>>>> optprobe_template_entry.
>>>>>>>>
>>>>>>>> If SYM_CODE_START is used, all optprobe will share one trampoline space.
>>>>>>>> Under this circumstances, if user register two optprobes, trampoline
>>>>>>>> will be overwritten by the newer one, and this will cause kernel panic
>>>>>>>> when the old optprobe is trigger.
>>>>>>> Hm, this is curious, because the template should be copied to the
>>>>>>> trampoline buffer for each optprobe and be modified.
>>>>>>>
>>>>>>>> Using optprobe_template_entry will not have this problem, as each
>>>>>>>> optprobe has its own trampoline space (alloced in get_opinsn_slot()).
>>>>>>> Yes, it is designed to do so.
>>>>>>>
>>>>>>> Thank you,
>>>>>>>
>>>>>> Hi Masami,
>>>>>>
>>>>>> Thanks for your reply. But I also met a problem when using
>>>>>> get_opinsn_slot() to alloc trampoline buffer.
>>>>>>
>>>>>> As module_alloc(like x86) is used to alloc buffer, trampoline is in
>>>>>> module space, so if origin insn is in kernel space, the range between
>>>>>> origin insn and trampoline is out of 128M.
>>>>>>
>>>>>> As module PLT cannot used here, I have no idea to achieve long jump in
>>>>>> this situation. Do you have any good idea?
>>>> Hi Masami,
>>>>
>>>> Thanks so much for your reply.
>>>>
>>>>> One possible solution is to use pre-allocated trampoline space in
>>>>> the text area, as same as ppc64 does.
>>>>> (See arch/powerpc/kernel/optprobes_head.S, it embeds a space at "optinsn_slot")
>>>>>
>>>> I find something interesting in arch/powerpc/kernel/optprobes.c, it use
>>>> "optinsn_slot" as a public buffer, and use a static "insn_page_in_use"
>>>> to make sure there is only one optprobe in kernel.
>>>>
>>>> If we use this solution , users could only register one optprobe each
>>>> time. This will also be a limitation for users, what's your opinion
>>>> about this?
>>> No, that is just a memory area for pooling trampoline buffer. So optprobe
>>> can allocate the buffer from that area. Please see kernel/kprobes.c:344.
>>> optprobe allocates "insn_slot" from kprobe_optinsn_slots, which uses
>>> alloc_optinsn_page() to allocate the pool of slots.
>>>
>>> Thank you,
>>>
>>
>> Hi,
>>
>> Thanks for your reply, I test alloc_optinsn_page() and it does work well
>> to alloc the pool of slots.
>>
>> But when I tried to use module PLT, something seems wrong here.
>> Arm64 Module PLT in mod->arch.ftrace_trampolines is set in
>> module_finalize, after that, mod->arch.ftrace_trampolines seems to be a
>> read-only memory. But in arch_optimize_kprobes() we need to modify the
>> destination of PLT (as each optprobe has its own trampoline buffer), if
>> so, we cannot get rid of the 128M branch limit :(
> 
Hi Masami,

> Hmm, OK, we need to introduce trampoline buffer allocation pool for modules
> for such arch. 

Yes, but will this expand the size of Image?

But that should be another story. I think you should start
> from the core-kernel. At this moment, if the probe address is in the module,
> please return -ERANGE from arch_prepare_optimized_kprobe().
> Module support must be done in the next step (series), since that will involve
> the kprobes generic code change.
> 

got it, I'll send a new patchset which support core-kernel optprobe 
soon. thanks a lot for your reply.

Qi
> Thank you,
> 
>>
>> Thanks,
>> Qi
> 
>

[v4,2/2] arm64: kprobe: Enable OPTPROBE for arm64

Commit Message

Comments

Patch