Message ID | 20240910060407.1427716-1-liaochang1@huawei.com (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
Series | arm64: uprobes: Simulate STP for pushing fp/lr into user stack | expand |
Context | Check | Description |
---|---|---|
netdev/tree_selection | success | Not a local patch |
On Mon, Sep 9, 2024 at 11:14 PM Liao Chang <liaochang1@huawei.com> wrote: > > This patch is the second part of a series to improve the selftest bench > of uprobe/uretprobe [0]. The lack of simulating 'stp fp, lr, [sp, #imm]' > significantly impact uprobe/uretprobe performance at function entry in > most user cases. Profiling results below reveals the STP that executes > in the xol slot and trap back to kernel, reduce redis RPS and increase > the time of string grep obviously. > > On Kunpeng916 (Hi1616), 4 NUMA nodes, 64 Arm64 cores@2.4GHz. > > Redis GET (higher is better) > ---------------------------- > No uprobe: 49149.71 RPS > Single-stepped STP: 46750.82 RPS > Emulated STP: 48981.19 RPS > > Redis SET (larger is better) > ---------------------------- > No uprobe: 49761.14 RPS > Single-stepped STP: 45255.01 RPS > Emulated stp: 48619.21 RPS > > Grep (lower is better) > ---------------------- > No uprobe: 2.165s > Single-stepped STP: 15.314s > Emualted STP: 2.216s > > Additionally, a profiling of the entry instruction for all leaf and > non-leaf function, the ratio of 'stp fp, lr, [sp, #imm]' is larger than > 50%. So simulting the STP on the function entry is a more viable option > for uprobe. > > In the first version [1], it used a uaccess routine to simulate the STP > that push fp/lr into stack, which use double STTR instructions for > memory store. But as Mark pointed out, this approach can't simulate the > correct single-atomicity and ordering properties of STP, especiallly > when it interacts with MTE, POE, etc. So this patch uses a more complex Does all those effects matter if the thread is stopped after breakpoint? This is pushing to stack, right? Other threads are not supposed to access that memory anyways (not the well-defined ones, at least, I suppose). Do we really need all these complications for uprobes? We use a similar approach in x86-64, see emulate_push_stack() in arch/x86/kernel/uprobes.c and it works great in practice (and has been for years by now). Would be nice to keep things simple knowing that this is specifically for this rather well-defined and restricted uprobe/uretprobe use case. Sorry, I can't help reviewing this, but I have a hunch that we might be over-killing it with this approach, no? > and inefficient approach that acquires user stack pages, maps them to > kernel address space, and allows kernel to use STP directly push fp/lr > into the stack pages. > > xol-stp > ------- > uprobe-nop ( 1 cpus): 1.566 ± 0.006M/s ( 1.566M/s/cpu) > uprobe-push ( 1 cpus): 0.868 ± 0.001M/s ( 0.868M/s/cpu) > uprobe-ret ( 1 cpus): 1.629 ± 0.001M/s ( 1.629M/s/cpu) > uretprobe-nop ( 1 cpus): 0.871 ± 0.001M/s ( 0.871M/s/cpu) > uretprobe-push ( 1 cpus): 0.616 ± 0.001M/s ( 0.616M/s/cpu) > uretprobe-ret ( 1 cpus): 0.878 ± 0.002M/s ( 0.878M/s/cpu) > > simulated-stp > ------------- > uprobe-nop ( 1 cpus): 1.544 ± 0.001M/s ( 1.544M/s/cpu) > uprobe-push ( 1 cpus): 1.128 ± 0.002M/s ( 1.128M/s/cpu) > uprobe-ret ( 1 cpus): 1.550 ± 0.005M/s ( 1.550M/s/cpu) > uretprobe-nop ( 1 cpus): 0.872 ± 0.004M/s ( 0.872M/s/cpu) > uretprobe-push ( 1 cpus): 0.714 ± 0.001M/s ( 0.714M/s/cpu) > uretprobe-ret ( 1 cpus): 0.896 ± 0.001M/s ( 0.896M/s/cpu) > > The profiling results based on the upstream kernel with spinlock > optimization patches [2] reveals the simulation of STP increase the > uprobe-push throughput by 29.3% (from 0.868M/s/cpu to 1.1238M/s/cpu) and > uretprobe-push by 15.9% (from 0.616M/s/cpu to 0.714M/s/cpu). > > [0] https://lore.kernel.org/all/CAEf4BzaO4eG6hr2hzXYpn+7Uer4chS0R99zLn02ezZ5YruVuQw@mail.gmail.com/ > [1] https://lore.kernel.org/all/Zr3RN4zxF5XPgjEB@J2N7QTR9R3/ > [2] https://lore.kernel.org/all/20240815014629.2685155-1-liaochang1@huawei.com/ > > Signed-off-by: Liao Chang <liaochang1@huawei.com> > --- > arch/arm64/include/asm/insn.h | 1 + > arch/arm64/kernel/probes/decode-insn.c | 16 +++++ > arch/arm64/kernel/probes/decode-insn.h | 1 + > arch/arm64/kernel/probes/simulate-insn.c | 89 ++++++++++++++++++++++++ > arch/arm64/kernel/probes/simulate-insn.h | 1 + > arch/arm64/kernel/probes/uprobes.c | 21 ++++++ > arch/arm64/lib/insn.c | 5 ++ > 7 files changed, 134 insertions(+) > [...]
在 2024/9/11 4:54, Andrii Nakryiko 写道: > On Mon, Sep 9, 2024 at 11:14 PM Liao Chang <liaochang1@huawei.com> wrote: >> >> This patch is the second part of a series to improve the selftest bench >> of uprobe/uretprobe [0]. The lack of simulating 'stp fp, lr, [sp, #imm]' >> significantly impact uprobe/uretprobe performance at function entry in >> most user cases. Profiling results below reveals the STP that executes >> in the xol slot and trap back to kernel, reduce redis RPS and increase >> the time of string grep obviously. >> >> On Kunpeng916 (Hi1616), 4 NUMA nodes, 64 Arm64 cores@2.4GHz. >> >> Redis GET (higher is better) >> ---------------------------- >> No uprobe: 49149.71 RPS >> Single-stepped STP: 46750.82 RPS >> Emulated STP: 48981.19 RPS >> >> Redis SET (larger is better) >> ---------------------------- >> No uprobe: 49761.14 RPS >> Single-stepped STP: 45255.01 RPS >> Emulated stp: 48619.21 RPS >> >> Grep (lower is better) >> ---------------------- >> No uprobe: 2.165s >> Single-stepped STP: 15.314s >> Emualted STP: 2.216s >> >> Additionally, a profiling of the entry instruction for all leaf and >> non-leaf function, the ratio of 'stp fp, lr, [sp, #imm]' is larger than >> 50%. So simulting the STP on the function entry is a more viable option >> for uprobe. >> >> In the first version [1], it used a uaccess routine to simulate the STP >> that push fp/lr into stack, which use double STTR instructions for >> memory store. But as Mark pointed out, this approach can't simulate the >> correct single-atomicity and ordering properties of STP, especiallly >> when it interacts with MTE, POE, etc. So this patch uses a more complex > > Does all those effects matter if the thread is stopped after > breakpoint? This is pushing to stack, right? Other threads are not > supposed to access that memory anyways (not the well-defined ones, at > least, I suppose). Do we really need all these complications for I have raised the same question in my reply to Mark. Since the STP simulation focuses on the uprobe/uretprob at function entry, which push two registers onto *stack*. I believe it might not require strict alignment with the exact property of STP. However, as you know, Mark stand by his comments about STP simulation, which is why I send this patch out. Although the gain is not good as the uaccess version, it still offer some better result than the current XOL code. > uprobes? We use a similar approach in x86-64, see emulate_push_stack() > in arch/x86/kernel/uprobes.c and it works great in practice (and has Yes, I've noticed the X86 routine. Actually. The CPU-specific difference lies in Arm64 CPUs with PAN enabled. Due to security reasons, it doesn't support STP (storing pairs of registers to memory) when accessing userpsace address. This leads to kernel has to use STTR instructions (storing single register to unprivileged memory) twice, which can't meet the atomicity and ordering properties of original STP at userspace. In future, if Arm64 would add some instruction for storing pairs of registers to unprivileged memory, it ought to replace this inefficient approach. > been for years by now). Would be nice to keep things simple knowing > that this is specifically for this rather well-defined and restricted > uprobe/uretprobe use case. > > Sorry, I can't help reviewing this, but I have a hunch that we might > be over-killing it with this approach, no? This approach fails to obtain the max benefit from simuation indeed. > > >> and inefficient approach that acquires user stack pages, maps them to >> kernel address space, and allows kernel to use STP directly push fp/lr >> into the stack pages. >> >> xol-stp >> ------- >> uprobe-nop ( 1 cpus): 1.566 ± 0.006M/s ( 1.566M/s/cpu) >> uprobe-push ( 1 cpus): 0.868 ± 0.001M/s ( 0.868M/s/cpu) >> uprobe-ret ( 1 cpus): 1.629 ± 0.001M/s ( 1.629M/s/cpu) >> uretprobe-nop ( 1 cpus): 0.871 ± 0.001M/s ( 0.871M/s/cpu) >> uretprobe-push ( 1 cpus): 0.616 ± 0.001M/s ( 0.616M/s/cpu) >> uretprobe-ret ( 1 cpus): 0.878 ± 0.002M/s ( 0.878M/s/cpu) >> >> simulated-stp >> ------------- >> uprobe-nop ( 1 cpus): 1.544 ± 0.001M/s ( 1.544M/s/cpu) >> uprobe-push ( 1 cpus): 1.128 ± 0.002M/s ( 1.128M/s/cpu) >> uprobe-ret ( 1 cpus): 1.550 ± 0.005M/s ( 1.550M/s/cpu) >> uretprobe-nop ( 1 cpus): 0.872 ± 0.004M/s ( 0.872M/s/cpu) >> uretprobe-push ( 1 cpus): 0.714 ± 0.001M/s ( 0.714M/s/cpu) >> uretprobe-ret ( 1 cpus): 0.896 ± 0.001M/s ( 0.896M/s/cpu) >> >> The profiling results based on the upstream kernel with spinlock >> optimization patches [2] reveals the simulation of STP increase the >> uprobe-push throughput by 29.3% (from 0.868M/s/cpu to 1.1238M/s/cpu) and >> uretprobe-push by 15.9% (from 0.616M/s/cpu to 0.714M/s/cpu). >> >> [0] https://lore.kernel.org/all/CAEf4BzaO4eG6hr2hzXYpn+7Uer4chS0R99zLn02ezZ5YruVuQw@mail.gmail.com/ >> [1] https://lore.kernel.org/all/Zr3RN4zxF5XPgjEB@J2N7QTR9R3/ >> [2] https://lore.kernel.org/all/20240815014629.2685155-1-liaochang1@huawei.com/ >> >> Signed-off-by: Liao Chang <liaochang1@huawei.com> >> --- >> arch/arm64/include/asm/insn.h | 1 + >> arch/arm64/kernel/probes/decode-insn.c | 16 +++++ >> arch/arm64/kernel/probes/decode-insn.h | 1 + >> arch/arm64/kernel/probes/simulate-insn.c | 89 ++++++++++++++++++++++++ >> arch/arm64/kernel/probes/simulate-insn.h | 1 + >> arch/arm64/kernel/probes/uprobes.c | 21 ++++++ >> arch/arm64/lib/insn.c | 5 ++ >> 7 files changed, 134 insertions(+) >> > > [...] > >
On Tue, Sep 10, 2024 at 8:07 PM Liao, Chang <liaochang1@huawei.com> wrote: > > > > 在 2024/9/11 4:54, Andrii Nakryiko 写道: > > On Mon, Sep 9, 2024 at 11:14 PM Liao Chang <liaochang1@huawei.com> wrote: > >> > >> This patch is the second part of a series to improve the selftest bench > >> of uprobe/uretprobe [0]. The lack of simulating 'stp fp, lr, [sp, #imm]' > >> significantly impact uprobe/uretprobe performance at function entry in > >> most user cases. Profiling results below reveals the STP that executes > >> in the xol slot and trap back to kernel, reduce redis RPS and increase > >> the time of string grep obviously. > >> > >> On Kunpeng916 (Hi1616), 4 NUMA nodes, 64 Arm64 cores@2.4GHz. > >> > >> Redis GET (higher is better) > >> ---------------------------- > >> No uprobe: 49149.71 RPS > >> Single-stepped STP: 46750.82 RPS > >> Emulated STP: 48981.19 RPS > >> > >> Redis SET (larger is better) > >> ---------------------------- > >> No uprobe: 49761.14 RPS > >> Single-stepped STP: 45255.01 RPS > >> Emulated stp: 48619.21 RPS > >> > >> Grep (lower is better) > >> ---------------------- > >> No uprobe: 2.165s > >> Single-stepped STP: 15.314s > >> Emualted STP: 2.216s > >> > >> Additionally, a profiling of the entry instruction for all leaf and > >> non-leaf function, the ratio of 'stp fp, lr, [sp, #imm]' is larger than > >> 50%. So simulting the STP on the function entry is a more viable option > >> for uprobe. > >> > >> In the first version [1], it used a uaccess routine to simulate the STP > >> that push fp/lr into stack, which use double STTR instructions for > >> memory store. But as Mark pointed out, this approach can't simulate the > >> correct single-atomicity and ordering properties of STP, especiallly > >> when it interacts with MTE, POE, etc. So this patch uses a more complex > > > > Does all those effects matter if the thread is stopped after > > breakpoint? This is pushing to stack, right? Other threads are not > > supposed to access that memory anyways (not the well-defined ones, at > > least, I suppose). Do we really need all these complications for > > I have raised the same question in my reply to Mark. Since the STP > simulation focuses on the uprobe/uretprob at function entry, which > push two registers onto *stack*. I believe it might not require strict > alignment with the exact property of STP. However, as you know, Mark Agreed. > stand by his comments about STP simulation, which is why I send this > patch out. Although the gain is not good as the uaccess version, it > still offer some better result than the current XOL code. > > > uprobes? We use a similar approach in x86-64, see emulate_push_stack() > > in arch/x86/kernel/uprobes.c and it works great in practice (and has > > Yes, I've noticed the X86 routine. Actually. The CPU-specific difference > lies in Arm64 CPUs with PAN enabled. Due to security reasons, it doesn't > support STP (storing pairs of registers to memory) when accessing userpsace > address. This leads to kernel has to use STTR instructions (storing single > register to unprivileged memory) twice, which can't meet the atomicity > and ordering properties of original STP at userspace. In future, if Arm64 > would add some instruction for storing pairs of registers to unprivileged > memory, it ought to replace this inefficient approach. > > > been for years by now). Would be nice to keep things simple knowing > > that this is specifically for this rather well-defined and restricted > > uprobe/uretprobe use case. > > > > Sorry, I can't help reviewing this, but I have a hunch that we might > > be over-killing it with this approach, no? > > This approach fails to obtain the max benefit from simuation indeed. > Yes, the performance hit is very large for seemingly no good reason, which is why I'm asking. And all this performance concern is not just some pure microbenchmarking. We do have use cases with millions of uprobe calls per second. E.g., tracing every single Python function call, then rolling a dice (in BPF program), and sampling some portion of them (more heavy-weight logic). As such, it's critical to be able to trigger uprobe as fast as possible, then most of the time we do nothing. So any overheads like this one are very noticeable and limit possible applications. > > > > > >> and inefficient approach that acquires user stack pages, maps them to > >> kernel address space, and allows kernel to use STP directly push fp/lr > >> into the stack pages. > >> > >> xol-stp > >> ------- > >> uprobe-nop ( 1 cpus): 1.566 ± 0.006M/s ( 1.566M/s/cpu) > >> uprobe-push ( 1 cpus): 0.868 ± 0.001M/s ( 0.868M/s/cpu) > >> uprobe-ret ( 1 cpus): 1.629 ± 0.001M/s ( 1.629M/s/cpu) > >> uretprobe-nop ( 1 cpus): 0.871 ± 0.001M/s ( 0.871M/s/cpu) > >> uretprobe-push ( 1 cpus): 0.616 ± 0.001M/s ( 0.616M/s/cpu) > >> uretprobe-ret ( 1 cpus): 0.878 ± 0.002M/s ( 0.878M/s/cpu) > >> > >> simulated-stp > >> ------------- > >> uprobe-nop ( 1 cpus): 1.544 ± 0.001M/s ( 1.544M/s/cpu) > >> uprobe-push ( 1 cpus): 1.128 ± 0.002M/s ( 1.128M/s/cpu) > >> uprobe-ret ( 1 cpus): 1.550 ± 0.005M/s ( 1.550M/s/cpu) > >> uretprobe-nop ( 1 cpus): 0.872 ± 0.004M/s ( 0.872M/s/cpu) > >> uretprobe-push ( 1 cpus): 0.714 ± 0.001M/s ( 0.714M/s/cpu) > >> uretprobe-ret ( 1 cpus): 0.896 ± 0.001M/s ( 0.896M/s/cpu) > >> > >> The profiling results based on the upstream kernel with spinlock > >> optimization patches [2] reveals the simulation of STP increase the > >> uprobe-push throughput by 29.3% (from 0.868M/s/cpu to 1.1238M/s/cpu) and > >> uretprobe-push by 15.9% (from 0.616M/s/cpu to 0.714M/s/cpu). > >> > >> [0] https://lore.kernel.org/all/CAEf4BzaO4eG6hr2hzXYpn+7Uer4chS0R99zLn02ezZ5YruVuQw@mail.gmail.com/ > >> [1] https://lore.kernel.org/all/Zr3RN4zxF5XPgjEB@J2N7QTR9R3/ > >> [2] https://lore.kernel.org/all/20240815014629.2685155-1-liaochang1@huawei.com/ > >> > >> Signed-off-by: Liao Chang <liaochang1@huawei.com> > >> --- > >> arch/arm64/include/asm/insn.h | 1 + > >> arch/arm64/kernel/probes/decode-insn.c | 16 +++++ > >> arch/arm64/kernel/probes/decode-insn.h | 1 + > >> arch/arm64/kernel/probes/simulate-insn.c | 89 ++++++++++++++++++++++++ > >> arch/arm64/kernel/probes/simulate-insn.h | 1 + > >> arch/arm64/kernel/probes/uprobes.c | 21 ++++++ > >> arch/arm64/lib/insn.c | 5 ++ > >> 7 files changed, 134 insertions(+) > >> > > > > [...] > > > > > > -- > BR > Liao, Chang
On Tue, Sep 10, 2024 at 06:04:07AM +0000, Liao Chang wrote: > This patch is the second part of a series to improve the selftest bench > of uprobe/uretprobe [0]. The lack of simulating 'stp fp, lr, [sp, #imm]' > significantly impact uprobe/uretprobe performance at function entry in > most user cases. Profiling results below reveals the STP that executes > in the xol slot and trap back to kernel, reduce redis RPS and increase > the time of string grep obviously. > > On Kunpeng916 (Hi1616), 4 NUMA nodes, 64 Arm64 cores@2.4GHz. > > Redis GET (higher is better) > ---------------------------- > No uprobe: 49149.71 RPS > Single-stepped STP: 46750.82 RPS > Emulated STP: 48981.19 RPS > > Redis SET (larger is better) > ---------------------------- > No uprobe: 49761.14 RPS > Single-stepped STP: 45255.01 RPS > Emulated stp: 48619.21 RPS > > Grep (lower is better) > ---------------------- > No uprobe: 2.165s > Single-stepped STP: 15.314s > Emualted STP: 2.216s The results for grep are concerning. In theory, the overhead for stepping should be roughly double the overhead for emulating, assuming the exception-entry and exception-return are the dominant cost. The cost of stepping should be trivial. Those results show emulating adds 0.051s (for a ~2.4% overhead), while stepping adds 13.149s (for a ~607% overhead), meaning stepping is 250x more expensive. Was this tested bare-metal, or in a VM? AFAICT either: * Single-stepping is unexpectedly expensive. Historically we had performance issues with hypervisor trapping of debug features, and there are things we might be able to improve in the hypervisor and kernel, which would improve stepping *all* instructions. If stepping is the big problem, we could move uprobes over to a BRK rather than a single-step. That would require require updating and fixing the logic to decide which instructions are steppable, but that's necessary anyway given it has extant soundness issues. * XOL management is absurdly expensive. Does uprobes keep the XOL slot around (like krpobes does), or does it create the slot afresh for each trap? If that's trying to create a slot afresh for each trap, there are several opportunities for improvement, e.g. keep the slot around for as long as the uprobe exists, or pre-allocate shared slots for common instructions and use those. Mark. > > Additionally, a profiling of the entry instruction for all leaf and > non-leaf function, the ratio of 'stp fp, lr, [sp, #imm]' is larger than > 50%. So simulting the STP on the function entry is a more viable option > for uprobe. > > In the first version [1], it used a uaccess routine to simulate the STP > that push fp/lr into stack, which use double STTR instructions for > memory store. But as Mark pointed out, this approach can't simulate the > correct single-atomicity and ordering properties of STP, especiallly > when it interacts with MTE, POE, etc. So this patch uses a more complex > and inefficient approach that acquires user stack pages, maps them to > kernel address space, and allows kernel to use STP directly push fp/lr > into the stack pages. > > xol-stp > ------- > uprobe-nop ( 1 cpus): 1.566 ± 0.006M/s ( 1.566M/s/cpu) > uprobe-push ( 1 cpus): 0.868 ± 0.001M/s ( 0.868M/s/cpu) > uprobe-ret ( 1 cpus): 1.629 ± 0.001M/s ( 1.629M/s/cpu) > uretprobe-nop ( 1 cpus): 0.871 ± 0.001M/s ( 0.871M/s/cpu) > uretprobe-push ( 1 cpus): 0.616 ± 0.001M/s ( 0.616M/s/cpu) > uretprobe-ret ( 1 cpus): 0.878 ± 0.002M/s ( 0.878M/s/cpu) > > simulated-stp > ------------- > uprobe-nop ( 1 cpus): 1.544 ± 0.001M/s ( 1.544M/s/cpu) > uprobe-push ( 1 cpus): 1.128 ± 0.002M/s ( 1.128M/s/cpu) > uprobe-ret ( 1 cpus): 1.550 ± 0.005M/s ( 1.550M/s/cpu) > uretprobe-nop ( 1 cpus): 0.872 ± 0.004M/s ( 0.872M/s/cpu) > uretprobe-push ( 1 cpus): 0.714 ± 0.001M/s ( 0.714M/s/cpu) > uretprobe-ret ( 1 cpus): 0.896 ± 0.001M/s ( 0.896M/s/cpu) > > The profiling results based on the upstream kernel with spinlock > optimization patches [2] reveals the simulation of STP increase the > uprobe-push throughput by 29.3% (from 0.868M/s/cpu to 1.1238M/s/cpu) and > uretprobe-push by 15.9% (from 0.616M/s/cpu to 0.714M/s/cpu). > > [0] https://lore.kernel.org/all/CAEf4BzaO4eG6hr2hzXYpn+7Uer4chS0R99zLn02ezZ5YruVuQw@mail.gmail.com/ > [1] https://lore.kernel.org/all/Zr3RN4zxF5XPgjEB@J2N7QTR9R3/ > [2] https://lore.kernel.org/all/20240815014629.2685155-1-liaochang1@huawei.com/ > > Signed-off-by: Liao Chang <liaochang1@huawei.com> > --- > arch/arm64/include/asm/insn.h | 1 + > arch/arm64/kernel/probes/decode-insn.c | 16 +++++ > arch/arm64/kernel/probes/decode-insn.h | 1 + > arch/arm64/kernel/probes/simulate-insn.c | 89 ++++++++++++++++++++++++ > arch/arm64/kernel/probes/simulate-insn.h | 1 + > arch/arm64/kernel/probes/uprobes.c | 21 ++++++ > arch/arm64/lib/insn.c | 5 ++ > 7 files changed, 134 insertions(+) > > diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h > index dd530d5c3d67..74e25debfa75 100644 > --- a/arch/arm64/include/asm/insn.h > +++ b/arch/arm64/include/asm/insn.h > @@ -561,6 +561,7 @@ u32 aarch64_insn_encode_immediate(enum aarch64_insn_imm_type type, > u32 insn, u64 imm); > u32 aarch64_insn_decode_register(enum aarch64_insn_register_type type, > u32 insn); > +u32 aarch64_insn_decode_ldst_size(u32 insn); > u32 aarch64_insn_gen_branch_imm(unsigned long pc, unsigned long addr, > enum aarch64_insn_branch_type type); > u32 aarch64_insn_gen_comp_branch_imm(unsigned long pc, unsigned long addr, > diff --git a/arch/arm64/kernel/probes/decode-insn.c b/arch/arm64/kernel/probes/decode-insn.c > index be54539e309e..847a7a61ff6d 100644 > --- a/arch/arm64/kernel/probes/decode-insn.c > +++ b/arch/arm64/kernel/probes/decode-insn.c > @@ -67,6 +67,22 @@ static bool __kprobes aarch64_insn_is_steppable(u32 insn) > return true; > } > > +bool aarch64_insn_is_stp_fp_lr_sp_64b(probe_opcode_t insn) > +{ > + /* > + * The 1st instruction on function entry often follows the > + * patten 'stp x29, x30, [sp, #imm]!' that pushing fp and lr > + * into stack. > + */ > + u32 opc = aarch64_insn_decode_ldst_size(insn); > + u32 rt2 = aarch64_insn_decode_register(AARCH64_INSN_REGTYPE_RT2, insn); > + u32 rn = aarch64_insn_decode_register(AARCH64_INSN_REGTYPE_RN, insn); > + u32 rt = aarch64_insn_decode_register(AARCH64_INSN_REGTYPE_RT, insn); > + > + return aarch64_insn_is_stp_pre(insn) && > + (opc == 2) && (rt2 == 30) && (rn == 31) && (rt == 29); > +} > + > /* Return: > * INSN_REJECTED If instruction is one not allowed to kprobe, > * INSN_GOOD If instruction is supported and uses instruction slot, > diff --git a/arch/arm64/kernel/probes/decode-insn.h b/arch/arm64/kernel/probes/decode-insn.h > index 8b758c5a2062..033ccab73da6 100644 > --- a/arch/arm64/kernel/probes/decode-insn.h > +++ b/arch/arm64/kernel/probes/decode-insn.h > @@ -29,5 +29,6 @@ arm_kprobe_decode_insn(kprobe_opcode_t *addr, struct arch_specific_insn *asi); > #endif > enum probe_insn __kprobes > arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *asi); > +bool aarch64_insn_is_stp_fp_lr_sp_64b(probe_opcode_t insn); > > #endif /* _ARM_KERNEL_KPROBES_ARM64_H */ > diff --git a/arch/arm64/kernel/probes/simulate-insn.c b/arch/arm64/kernel/probes/simulate-insn.c > index 5e4f887a074c..3906851c07b2 100644 > --- a/arch/arm64/kernel/probes/simulate-insn.c > +++ b/arch/arm64/kernel/probes/simulate-insn.c > @@ -8,6 +8,9 @@ > #include <linux/bitops.h> > #include <linux/kernel.h> > #include <linux/kprobes.h> > +#include <linux/highmem.h> > +#include <linux/vmalloc.h> > +#include <linux/mm.h> > > #include <asm/ptrace.h> > #include <asm/traps.h> > @@ -211,3 +214,89 @@ simulate_nop(u32 opcode, long addr, struct pt_regs *regs) > */ > arm64_skip_faulting_instruction(regs, AARCH64_INSN_SIZE); > } > + > +static inline > +bool stack_align_check(unsigned long sp) > +{ > + return (IS_ALIGNED(sp, 16) || > + !(read_sysreg(sctlr_el1) & SCTLR_EL1_SA0_MASK)); > +} > + > +static inline > +void put_user_stack_pages(struct page **pages, int nr_pages) > +{ > + int i; > + > + for (i = 0; i < nr_pages; i++) > + put_page(pages[i]); > +} > + > +static inline > +int get_user_stack_pages(long start, long end, struct page **pages) > +{ > + int ret; > + int nr_pages = (end >> PAGE_SHIFT) - (start >> PAGE_SHIFT) + 1; > + > + ret = get_user_pages_fast(start, nr_pages, > + FOLL_WRITE | FOLL_FORCE, pages); > + if (unlikely(ret != nr_pages)) { > + if (ret > 0) > + put_user_stack_pages(pages, ret); > + return 0; > + } > + > + return nr_pages; > +} > + > +static inline > +void *map_user_stack_pages(struct page **pages, int nr_pages) > +{ > + if (likely(nr_pages == 1)) > + return kmap_local_page(pages[0]); > + else > + return vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL); > +} > + > +static inline > +void unmap_user_stack_pages(void *kaddr, int nr_pages) > +{ > + if (likely(nr_pages == 1)) > + kunmap_local(kaddr); > + else > + vunmap(kaddr); > +} > + > +void __kprobes > +simulate_stp_fp_lr_sp_64b(u32 opcode, long addr, struct pt_regs *regs) > +{ > + long imm7; > + long new_sp; > + int nr_pages; > + void *kaddr, *dst; > + struct page *pages[2] = { NULL }; > + > + imm7 = aarch64_insn_decode_immediate(AARCH64_INSN_IMM_7, opcode); > + new_sp = regs->sp + (sign_extend64(imm7, 6) << 3); > + if (!stack_align_check(new_sp)) { > + force_sig(SIGSEGV); > + goto done; > + } > + > + nr_pages = get_user_stack_pages(new_sp, regs->sp, pages); > + if (!nr_pages) { > + force_sig(SIGSEGV); > + goto done; > + } > + > + kaddr = map_user_stack_pages(pages, nr_pages); > + dst = kaddr + (new_sp & ~PAGE_MASK); > + asm volatile("stp %0, %1, [%2]" > + : : "r"(regs->regs[29]), "r"(regs->regs[30]), "r"(dst)); > + > + unmap_user_stack_pages(kaddr, nr_pages); > + put_user_stack_pages(pages, nr_pages); > + > +done: > + regs->sp = new_sp; > + arm64_skip_faulting_instruction(regs, AARCH64_INSN_SIZE); > +} > diff --git a/arch/arm64/kernel/probes/simulate-insn.h b/arch/arm64/kernel/probes/simulate-insn.h > index efb2803ec943..733a47ffa2e5 100644 > --- a/arch/arm64/kernel/probes/simulate-insn.h > +++ b/arch/arm64/kernel/probes/simulate-insn.h > @@ -17,5 +17,6 @@ void simulate_tbz_tbnz(u32 opcode, long addr, struct pt_regs *regs); > void simulate_ldr_literal(u32 opcode, long addr, struct pt_regs *regs); > void simulate_ldrsw_literal(u32 opcode, long addr, struct pt_regs *regs); > void simulate_nop(u32 opcode, long addr, struct pt_regs *regs); > +void simulate_stp_fp_lr_sp_64b(u32 opcode, long addr, struct pt_regs *regs); > > #endif /* _ARM_KERNEL_KPROBES_SIMULATE_INSN_H */ > diff --git a/arch/arm64/kernel/probes/uprobes.c b/arch/arm64/kernel/probes/uprobes.c > index d49aef2657cd..c70862314fde 100644 > --- a/arch/arm64/kernel/probes/uprobes.c > +++ b/arch/arm64/kernel/probes/uprobes.c > @@ -8,6 +8,7 @@ > #include <asm/cacheflush.h> > > #include "decode-insn.h" > +#include "simulate-insn.h" > > #define UPROBE_INV_FAULT_CODE UINT_MAX > > @@ -31,6 +32,21 @@ unsigned long uprobe_get_swbp_addr(struct pt_regs *regs) > return instruction_pointer(regs); > } > > +static enum probe_insn > +arm_uprobe_decode_special_insn(probe_opcode_t insn, struct arch_probe_insn *api) > +{ > + /* > + * When uprobe interact with VMSA features, such as MTE, POE, etc, it > + * give up the simulation of memory access related instructions. > + */ > + if (!system_supports_mte() && aarch64_insn_is_stp_fp_lr_sp_64b(insn)) { > + api->handler = simulate_stp_fp_lr_sp_64b; > + return INSN_GOOD_NO_SLOT; > + } > + > + return INSN_REJECTED; > +} > + > int arch_uprobe_analyze_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, > unsigned long addr) > { > @@ -44,6 +60,11 @@ int arch_uprobe_analyze_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, > > insn = *(probe_opcode_t *)(&auprobe->insn[0]); > > + if (arm_uprobe_decode_special_insn(insn, &auprobe->api)) { > + auprobe->simulate = true; > + return 0; > + } > + > switch (arm_probe_decode_insn(insn, &auprobe->api)) { > case INSN_REJECTED: > return -EINVAL; > diff --git a/arch/arm64/lib/insn.c b/arch/arm64/lib/insn.c > index b008a9b46a7f..0635219d2196 100644 > --- a/arch/arm64/lib/insn.c > +++ b/arch/arm64/lib/insn.c > @@ -238,6 +238,11 @@ static u32 aarch64_insn_encode_ldst_size(enum aarch64_insn_size_type type, > return insn; > } > > +u32 aarch64_insn_decode_ldst_size(u32 insn) > +{ > + return (insn & GENMASK(31, 30)) >> 30; > +} > + > static inline long label_imm_common(unsigned long pc, unsigned long addr, > long range) > { > -- > 2.34.1 >
On Thu, Oct 24, 2024 at 7:06 AM Mark Rutland <mark.rutland@arm.com> wrote: > > On Tue, Sep 10, 2024 at 06:04:07AM +0000, Liao Chang wrote: > > This patch is the second part of a series to improve the selftest bench > > of uprobe/uretprobe [0]. The lack of simulating 'stp fp, lr, [sp, #imm]' > > significantly impact uprobe/uretprobe performance at function entry in > > most user cases. Profiling results below reveals the STP that executes > > in the xol slot and trap back to kernel, reduce redis RPS and increase > > the time of string grep obviously. > > > > On Kunpeng916 (Hi1616), 4 NUMA nodes, 64 Arm64 cores@2.4GHz. > > > > Redis GET (higher is better) > > ---------------------------- > > No uprobe: 49149.71 RPS > > Single-stepped STP: 46750.82 RPS > > Emulated STP: 48981.19 RPS > > > > Redis SET (larger is better) > > ---------------------------- > > No uprobe: 49761.14 RPS > > Single-stepped STP: 45255.01 RPS > > Emulated stp: 48619.21 RPS > > > > Grep (lower is better) > > ---------------------- > > No uprobe: 2.165s > > Single-stepped STP: 15.314s > > Emualted STP: 2.216s > > The results for grep are concerning. > > In theory, the overhead for stepping should be roughly double the > overhead for emulating, assuming the exception-entry and > exception-return are the dominant cost. The cost of stepping should be > trivial. > > Those results show emulating adds 0.051s (for a ~2.4% overhead), while > stepping adds 13.149s (for a ~607% overhead), meaning stepping is 250x > more expensive. > > Was this tested bare-metal, or in a VM? Hey Mark, I hope Liao will have a chance to reply, I don't know the details of his benchmarking. But I can try to give you my numbers and maybe answer a few questions, hopefully that helps move the conversation forward. So, first of all, I did a quick benchmark on bare metal (without Liao's optimization, though), here are my results: uprobe-nop ( 1 cpus): 2.334 ± 0.011M/s ( 2.334M/s/cpu) uprobe-push ( 1 cpus): 2.321 ± 0.010M/s ( 2.321M/s/cpu) uprobe-ret ( 1 cpus): 4.144 ± 0.041M/s ( 4.144M/s/cpu) uretprobe-nop ( 1 cpus): 1.684 ± 0.004M/s ( 1.684M/s/cpu) uretprobe-push ( 1 cpus): 1.736 ± 0.003M/s ( 1.736M/s/cpu) uretprobe-ret ( 1 cpus): 2.502 ± 0.006M/s ( 2.502M/s/cpu) uretprobes are inherently slower, so I'll just compare uprobe, as the differences are very clear either way. -nop is literally nop (Liao solved that issue, I just don't have his patch applied on my test machine). -push has `stp x29, x30, [sp, #-0x10]!` instruction traced. -ret is literally just `ret` instruction. So you can see that -ret is almost twice as fast as the -push variant (it's a microbenchmark, yes, but still). > > AFAICT either: > > * Single-stepping is unexpectedly expensive. > > Historically we had performance issues with hypervisor trapping of > debug features, and there are things we might be able to improve in > the hypervisor and kernel, which would improve stepping *all* > instructions. > Single-stepping will always be more expensive, as it necessitates extra hop kernel->user space->kernel, so no matter the optimization for single-stepping, if we can avoid it, we should. It will be noticeable. > If stepping is the big problem, we could move uprobes over to a BRK > rather than a single-step. That would require require updating and > fixing the logic to decide which instructions are steppable, but > that's necessary anyway given it has extant soundness issues. I'm afraid I don't understand what BRK means and what are the consequences in terms of overheads. I'm not an ARM person either, so sorry if that's a stupid question. But either way, I can't address this. But see above, emulating an instruction feels like a much better approach, if possible. > > * XOL management is absurdly expensive. > > Does uprobes keep the XOL slot around (like krpobes does), or does it > create the slot afresh for each trap? XOL *page* is created once per process, lazily, and then we just juggle a bunch of fixed slots there for each instance of single-stepped uprobe. And yes, there are some bottlenecks in XOL management, though it's mostly due to lock contention (as it is implemented right now). Liao and Oleg have been improving XOL management, but still, avoiding XOL in the first place is the much preferred way. > > If that's trying to create a slot afresh for each trap, there are > several opportunities for improvement, e.g. keep the slot around for > as long as the uprobe exists, or pre-allocate shared slots for common > instructions and use those. As I mentioned, a XOL page is allocated and mapped once, but yes, it seems like we dynamically get a slot in it for each single-stepped execution (see xol_take_insn_slot() in kernel/events/uprobes.c). It's probably not a bad idea to just cache and hold a XOL slot for each specific uprobe, I don't see why we should limit ourselves to just one XOL page. We also don't need to pre-size each slot, we can probably allocate just the right amount of space for a given uprobe. All good ideas for sure, we should do them, IMO. But we'll still be paying an extra kernel->user->kernel switch, which almost certainly is slower than doing a simple stack push emulation just like we do in x86-64 case, no? BTW, I did a quick local profiling run. I don't think XOL management is the main source of overhead. I see 5% of CPU cycles spent in arch_uprobe_copy_ixol, but other than that XOL doesn't figure in stack traces. There are at least 22% CPU cycles spent in some local_daif_restore function, though, not sure what that is, but might be related to interrupt handling, right? The take away I'd like to communicate here is avoiding the single-stepping need is *the best way* to go, IMO. So if we can emulate those STP instructions for uprobe *cheaply*, that would be awesome. > > Mark. > > > > > Additionally, a profiling of the entry instruction for all leaf and > > non-leaf function, the ratio of 'stp fp, lr, [sp, #imm]' is larger than > > 50%. So simulting the STP on the function entry is a more viable option > > for uprobe. > > > > In the first version [1], it used a uaccess routine to simulate the STP > > that push fp/lr into stack, which use double STTR instructions for > > memory store. But as Mark pointed out, this approach can't simulate the > > correct single-atomicity and ordering properties of STP, especiallly > > when it interacts with MTE, POE, etc. So this patch uses a more complex > > and inefficient approach that acquires user stack pages, maps them to > > kernel address space, and allows kernel to use STP directly push fp/lr > > into the stack pages. > > > > xol-stp > > ------- > > uprobe-nop ( 1 cpus): 1.566 ± 0.006M/s ( 1.566M/s/cpu) > > uprobe-push ( 1 cpus): 0.868 ± 0.001M/s ( 0.868M/s/cpu) > > uprobe-ret ( 1 cpus): 1.629 ± 0.001M/s ( 1.629M/s/cpu) > > uretprobe-nop ( 1 cpus): 0.871 ± 0.001M/s ( 0.871M/s/cpu) > > uretprobe-push ( 1 cpus): 0.616 ± 0.001M/s ( 0.616M/s/cpu) > > uretprobe-ret ( 1 cpus): 0.878 ± 0.002M/s ( 0.878M/s/cpu) > > > > simulated-stp > > ------------- > > uprobe-nop ( 1 cpus): 1.544 ± 0.001M/s ( 1.544M/s/cpu) > > uprobe-push ( 1 cpus): 1.128 ± 0.002M/s ( 1.128M/s/cpu) > > uprobe-ret ( 1 cpus): 1.550 ± 0.005M/s ( 1.550M/s/cpu) > > uretprobe-nop ( 1 cpus): 0.872 ± 0.004M/s ( 0.872M/s/cpu) > > uretprobe-push ( 1 cpus): 0.714 ± 0.001M/s ( 0.714M/s/cpu) > > uretprobe-ret ( 1 cpus): 0.896 ± 0.001M/s ( 0.896M/s/cpu) > > > > The profiling results based on the upstream kernel with spinlock > > optimization patches [2] reveals the simulation of STP increase the > > uprobe-push throughput by 29.3% (from 0.868M/s/cpu to 1.1238M/s/cpu) and > > uretprobe-push by 15.9% (from 0.616M/s/cpu to 0.714M/s/cpu). > > > > [0] https://lore.kernel.org/all/CAEf4BzaO4eG6hr2hzXYpn+7Uer4chS0R99zLn02ezZ5YruVuQw@mail.gmail.com/ > > [1] https://lore.kernel.org/all/Zr3RN4zxF5XPgjEB@J2N7QTR9R3/ > > [2] https://lore.kernel.org/all/20240815014629.2685155-1-liaochang1@huawei.com/ > > > > Signed-off-by: Liao Chang <liaochang1@huawei.com> > > --- > > arch/arm64/include/asm/insn.h | 1 + > > arch/arm64/kernel/probes/decode-insn.c | 16 +++++ > > arch/arm64/kernel/probes/decode-insn.h | 1 + > > arch/arm64/kernel/probes/simulate-insn.c | 89 ++++++++++++++++++++++++ > > arch/arm64/kernel/probes/simulate-insn.h | 1 + > > arch/arm64/kernel/probes/uprobes.c | 21 ++++++ > > arch/arm64/lib/insn.c | 5 ++ > > 7 files changed, 134 insertions(+) > > [...]
On Fri, 25 Oct 2024 13:51:14 -0700 Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote: > All good ideas for sure, we should do them, IMO. But we'll still be > paying an extra kernel->user->kernel switch, which almost certainly is > slower than doing a simple stack push emulation just like we do in > x86-64 case, no? > > > BTW, I did a quick local profiling run. I don't think XOL management > is the main source of overhead. I see 5% of CPU cycles spent in > arch_uprobe_copy_ixol, but other than that XOL doesn't figure in stack > traces. There are at least 22% CPU cycles spent in some > local_daif_restore function, though, not sure what that is, but might > be related to interrupt handling, right? > > > The take away I'd like to communicate here is avoiding the > single-stepping need is *the best way* to go, IMO. So if we can > emulate those STP instructions for uprobe *cheaply*, that would be > awesome. +1. Unlike the kprobe, uprobe singlestep needs to go userspace (for sacurity), which can introduce much bigger overhead. If we can just emulate the instruction safely in the kernel, it should be done. Thank you,
在 2024/10/24 22:06, Mark Rutland 写道: > On Tue, Sep 10, 2024 at 06:04:07AM +0000, Liao Chang wrote: >> This patch is the second part of a series to improve the selftest bench >> of uprobe/uretprobe [0]. The lack of simulating 'stp fp, lr, [sp, #imm]' >> significantly impact uprobe/uretprobe performance at function entry in >> most user cases. Profiling results below reveals the STP that executes >> in the xol slot and trap back to kernel, reduce redis RPS and increase >> the time of string grep obviously. >> >> On Kunpeng916 (Hi1616), 4 NUMA nodes, 64 Arm64 cores@2.4GHz. >> >> Redis GET (higher is better) >> ---------------------------- >> No uprobe: 49149.71 RPS >> Single-stepped STP: 46750.82 RPS >> Emulated STP: 48981.19 RPS >> >> Redis SET (larger is better) >> ---------------------------- >> No uprobe: 49761.14 RPS >> Single-stepped STP: 45255.01 RPS >> Emulated stp: 48619.21 RPS >> >> Grep (lower is better) >> ---------------------- >> No uprobe: 2.165s >> Single-stepped STP: 15.314s >> Emualted STP: 2.216s > > The results for grep are concerning. > > In theory, the overhead for stepping should be roughly double the > overhead for emulating, assuming the exception-entry and > exception-return are the dominant cost. The cost of stepping should be > trivial. > > Those results show emulating adds 0.051s (for a ~2.4% overhead), while > stepping adds 13.149s (for a ~607% overhead), meaning stepping is 250x > more expensive. To provide more context, I set an uprobe at the grep_file() function from the grep source code [1], which is likely on very hot path. I suspect that the high frequency of uprobe traps contribute to the signficant overhead. I'm not entirely certain about this, and I'll need to investigate further to pinpoint the source of 13.149s. https://github.com/brgl/busybox/blob/master/findutils/grep.c#L302 > > Was this tested bare-metal, or in a VM? bare-metal. > > AFAICT either: > > * Single-stepping is unexpectedly expensive. > > Historically we had performance issues with hypervisor trapping of > debug features, and there are things we might be able to improve in > the hypervisor and kernel, which would improve stepping *all* > instructions. > > If stepping is the big problem, we could move uprobes over to a BRK > rather than a single-step. That would require require updating and > fixing the logic to decide which instructions are steppable, but > that's necessary anyway given it has extant soundness issues. That's a very helpful input. I'll rerun the profiling in a VM. > > * XOL management is absurdly expensive. > > Does uprobes keep the XOL slot around (like krpobes does), or does it > create the slot afresh for each trap? As I know, it doesn't create a new slot for each trap. Instead, it grabs an slot from a pre-allocated pool. I believe the cost of acquring a slot from this pool is relatively low compared to creating a new one. > > If that's trying to create a slot afresh for each trap, there are > several opportunities for improvement, e.g. keep the slot around for > as long as the uprobe exists, or pre-allocate shared slots for common > instructions and use those. I have sent a patch [2] that aimed to improve scalability of uprobe handling. While that patch focuses on reduing the cacheline bouncing caused by atomic operations when grabbing and releasing slot from a pool, I wouldn't say this has a impact on grep, as it typically use a single thread for pattern matching on my sytem. [2] https://lore.kernel.org/all/20240927094549.3382916-1-liaochang1@huawei.com/ > > Mark. > >> >> Additionally, a profiling of the entry instruction for all leaf and >> non-leaf function, the ratio of 'stp fp, lr, [sp, #imm]' is larger than >> 50%. So simulting the STP on the function entry is a more viable option >> for uprobe. >> >> In the first version [1], it used a uaccess routine to simulate the STP >> that push fp/lr into stack, which use double STTR instructions for >> memory store. But as Mark pointed out, this approach can't simulate the >> correct single-atomicity and ordering properties of STP, especiallly >> when it interacts with MTE, POE, etc. So this patch uses a more complex >> and inefficient approach that acquires user stack pages, maps them to >> kernel address space, and allows kernel to use STP directly push fp/lr >> into the stack pages. >> >> xol-stp >> ------- >> uprobe-nop ( 1 cpus): 1.566 ± 0.006M/s ( 1.566M/s/cpu) >> uprobe-push ( 1 cpus): 0.868 ± 0.001M/s ( 0.868M/s/cpu) >> uprobe-ret ( 1 cpus): 1.629 ± 0.001M/s ( 1.629M/s/cpu) >> uretprobe-nop ( 1 cpus): 0.871 ± 0.001M/s ( 0.871M/s/cpu) >> uretprobe-push ( 1 cpus): 0.616 ± 0.001M/s ( 0.616M/s/cpu) >> uretprobe-ret ( 1 cpus): 0.878 ± 0.002M/s ( 0.878M/s/cpu) >> >> simulated-stp >> ------------- >> uprobe-nop ( 1 cpus): 1.544 ± 0.001M/s ( 1.544M/s/cpu) >> uprobe-push ( 1 cpus): 1.128 ± 0.002M/s ( 1.128M/s/cpu) >> uprobe-ret ( 1 cpus): 1.550 ± 0.005M/s ( 1.550M/s/cpu) >> uretprobe-nop ( 1 cpus): 0.872 ± 0.004M/s ( 0.872M/s/cpu) >> uretprobe-push ( 1 cpus): 0.714 ± 0.001M/s ( 0.714M/s/cpu) >> uretprobe-ret ( 1 cpus): 0.896 ± 0.001M/s ( 0.896M/s/cpu) >> >> The profiling results based on the upstream kernel with spinlock >> optimization patches [2] reveals the simulation of STP increase the >> uprobe-push throughput by 29.3% (from 0.868M/s/cpu to 1.1238M/s/cpu) and >> uretprobe-push by 15.9% (from 0.616M/s/cpu to 0.714M/s/cpu). >> >> [0] https://lore.kernel.org/all/CAEf4BzaO4eG6hr2hzXYpn+7Uer4chS0R99zLn02ezZ5YruVuQw@mail.gmail.com/ >> [1] https://lore.kernel.org/all/Zr3RN4zxF5XPgjEB@J2N7QTR9R3/ >> [2] https://lore.kernel.org/all/20240815014629.2685155-1-liaochang1@huawei.com/ >> >> Signed-off-by: Liao Chang <liaochang1@huawei.com> >> --- >> arch/arm64/include/asm/insn.h | 1 + >> arch/arm64/kernel/probes/decode-insn.c | 16 +++++ >> arch/arm64/kernel/probes/decode-insn.h | 1 + >> arch/arm64/kernel/probes/simulate-insn.c | 89 ++++++++++++++++++++++++ >> arch/arm64/kernel/probes/simulate-insn.h | 1 + >> arch/arm64/kernel/probes/uprobes.c | 21 ++++++ >> arch/arm64/lib/insn.c | 5 ++ >> 7 files changed, 134 insertions(+) >> >> diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h >> index dd530d5c3d67..74e25debfa75 100644 >> --- a/arch/arm64/include/asm/insn.h >> +++ b/arch/arm64/include/asm/insn.h >> @@ -561,6 +561,7 @@ u32 aarch64_insn_encode_immediate(enum aarch64_insn_imm_type type, >> u32 insn, u64 imm); >> u32 aarch64_insn_decode_register(enum aarch64_insn_register_type type, >> u32 insn); >> +u32 aarch64_insn_decode_ldst_size(u32 insn); >> u32 aarch64_insn_gen_branch_imm(unsigned long pc, unsigned long addr, >> enum aarch64_insn_branch_type type); >> u32 aarch64_insn_gen_comp_branch_imm(unsigned long pc, unsigned long addr, >> diff --git a/arch/arm64/kernel/probes/decode-insn.c b/arch/arm64/kernel/probes/decode-insn.c >> index be54539e309e..847a7a61ff6d 100644 >> --- a/arch/arm64/kernel/probes/decode-insn.c >> +++ b/arch/arm64/kernel/probes/decode-insn.c >> @@ -67,6 +67,22 @@ static bool __kprobes aarch64_insn_is_steppable(u32 insn) >> return true; >> } >> >> +bool aarch64_insn_is_stp_fp_lr_sp_64b(probe_opcode_t insn) >> +{ >> + /* >> + * The 1st instruction on function entry often follows the >> + * patten 'stp x29, x30, [sp, #imm]!' that pushing fp and lr >> + * into stack. >> + */ >> + u32 opc = aarch64_insn_decode_ldst_size(insn); >> + u32 rt2 = aarch64_insn_decode_register(AARCH64_INSN_REGTYPE_RT2, insn); >> + u32 rn = aarch64_insn_decode_register(AARCH64_INSN_REGTYPE_RN, insn); >> + u32 rt = aarch64_insn_decode_register(AARCH64_INSN_REGTYPE_RT, insn); >> + >> + return aarch64_insn_is_stp_pre(insn) && >> + (opc == 2) && (rt2 == 30) && (rn == 31) && (rt == 29); >> +} >> + >> /* Return: >> * INSN_REJECTED If instruction is one not allowed to kprobe, >> * INSN_GOOD If instruction is supported and uses instruction slot, >> diff --git a/arch/arm64/kernel/probes/decode-insn.h b/arch/arm64/kernel/probes/decode-insn.h >> index 8b758c5a2062..033ccab73da6 100644 >> --- a/arch/arm64/kernel/probes/decode-insn.h >> +++ b/arch/arm64/kernel/probes/decode-insn.h >> @@ -29,5 +29,6 @@ arm_kprobe_decode_insn(kprobe_opcode_t *addr, struct arch_specific_insn *asi); >> #endif >> enum probe_insn __kprobes >> arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *asi); >> +bool aarch64_insn_is_stp_fp_lr_sp_64b(probe_opcode_t insn); >> >> #endif /* _ARM_KERNEL_KPROBES_ARM64_H */ >> diff --git a/arch/arm64/kernel/probes/simulate-insn.c b/arch/arm64/kernel/probes/simulate-insn.c >> index 5e4f887a074c..3906851c07b2 100644 >> --- a/arch/arm64/kernel/probes/simulate-insn.c >> +++ b/arch/arm64/kernel/probes/simulate-insn.c >> @@ -8,6 +8,9 @@ >> #include <linux/bitops.h> >> #include <linux/kernel.h> >> #include <linux/kprobes.h> >> +#include <linux/highmem.h> >> +#include <linux/vmalloc.h> >> +#include <linux/mm.h> >> >> #include <asm/ptrace.h> >> #include <asm/traps.h> >> @@ -211,3 +214,89 @@ simulate_nop(u32 opcode, long addr, struct pt_regs *regs) >> */ >> arm64_skip_faulting_instruction(regs, AARCH64_INSN_SIZE); >> } >> + >> +static inline >> +bool stack_align_check(unsigned long sp) >> +{ >> + return (IS_ALIGNED(sp, 16) || >> + !(read_sysreg(sctlr_el1) & SCTLR_EL1_SA0_MASK)); >> +} >> + >> +static inline >> +void put_user_stack_pages(struct page **pages, int nr_pages) >> +{ >> + int i; >> + >> + for (i = 0; i < nr_pages; i++) >> + put_page(pages[i]); >> +} >> + >> +static inline >> +int get_user_stack_pages(long start, long end, struct page **pages) >> +{ >> + int ret; >> + int nr_pages = (end >> PAGE_SHIFT) - (start >> PAGE_SHIFT) + 1; >> + >> + ret = get_user_pages_fast(start, nr_pages, >> + FOLL_WRITE | FOLL_FORCE, pages); >> + if (unlikely(ret != nr_pages)) { >> + if (ret > 0) >> + put_user_stack_pages(pages, ret); >> + return 0; >> + } >> + >> + return nr_pages; >> +} >> + >> +static inline >> +void *map_user_stack_pages(struct page **pages, int nr_pages) >> +{ >> + if (likely(nr_pages == 1)) >> + return kmap_local_page(pages[0]); >> + else >> + return vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL); >> +} >> + >> +static inline >> +void unmap_user_stack_pages(void *kaddr, int nr_pages) >> +{ >> + if (likely(nr_pages == 1)) >> + kunmap_local(kaddr); >> + else >> + vunmap(kaddr); >> +} >> + >> +void __kprobes >> +simulate_stp_fp_lr_sp_64b(u32 opcode, long addr, struct pt_regs *regs) >> +{ >> + long imm7; >> + long new_sp; >> + int nr_pages; >> + void *kaddr, *dst; >> + struct page *pages[2] = { NULL }; >> + >> + imm7 = aarch64_insn_decode_immediate(AARCH64_INSN_IMM_7, opcode); >> + new_sp = regs->sp + (sign_extend64(imm7, 6) << 3); >> + if (!stack_align_check(new_sp)) { >> + force_sig(SIGSEGV); >> + goto done; >> + } >> + >> + nr_pages = get_user_stack_pages(new_sp, regs->sp, pages); >> + if (!nr_pages) { >> + force_sig(SIGSEGV); >> + goto done; >> + } >> + >> + kaddr = map_user_stack_pages(pages, nr_pages); >> + dst = kaddr + (new_sp & ~PAGE_MASK); >> + asm volatile("stp %0, %1, [%2]" >> + : : "r"(regs->regs[29]), "r"(regs->regs[30]), "r"(dst)); >> + >> + unmap_user_stack_pages(kaddr, nr_pages); >> + put_user_stack_pages(pages, nr_pages); >> + >> +done: >> + regs->sp = new_sp; >> + arm64_skip_faulting_instruction(regs, AARCH64_INSN_SIZE); >> +} >> diff --git a/arch/arm64/kernel/probes/simulate-insn.h b/arch/arm64/kernel/probes/simulate-insn.h >> index efb2803ec943..733a47ffa2e5 100644 >> --- a/arch/arm64/kernel/probes/simulate-insn.h >> +++ b/arch/arm64/kernel/probes/simulate-insn.h >> @@ -17,5 +17,6 @@ void simulate_tbz_tbnz(u32 opcode, long addr, struct pt_regs *regs); >> void simulate_ldr_literal(u32 opcode, long addr, struct pt_regs *regs); >> void simulate_ldrsw_literal(u32 opcode, long addr, struct pt_regs *regs); >> void simulate_nop(u32 opcode, long addr, struct pt_regs *regs); >> +void simulate_stp_fp_lr_sp_64b(u32 opcode, long addr, struct pt_regs *regs); >> >> #endif /* _ARM_KERNEL_KPROBES_SIMULATE_INSN_H */ >> diff --git a/arch/arm64/kernel/probes/uprobes.c b/arch/arm64/kernel/probes/uprobes.c >> index d49aef2657cd..c70862314fde 100644 >> --- a/arch/arm64/kernel/probes/uprobes.c >> +++ b/arch/arm64/kernel/probes/uprobes.c >> @@ -8,6 +8,7 @@ >> #include <asm/cacheflush.h> >> >> #include "decode-insn.h" >> +#include "simulate-insn.h" >> >> #define UPROBE_INV_FAULT_CODE UINT_MAX >> >> @@ -31,6 +32,21 @@ unsigned long uprobe_get_swbp_addr(struct pt_regs *regs) >> return instruction_pointer(regs); >> } >> >> +static enum probe_insn >> +arm_uprobe_decode_special_insn(probe_opcode_t insn, struct arch_probe_insn *api) >> +{ >> + /* >> + * When uprobe interact with VMSA features, such as MTE, POE, etc, it >> + * give up the simulation of memory access related instructions. >> + */ >> + if (!system_supports_mte() && aarch64_insn_is_stp_fp_lr_sp_64b(insn)) { >> + api->handler = simulate_stp_fp_lr_sp_64b; >> + return INSN_GOOD_NO_SLOT; >> + } >> + >> + return INSN_REJECTED; >> +} >> + >> int arch_uprobe_analyze_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, >> unsigned long addr) >> { >> @@ -44,6 +60,11 @@ int arch_uprobe_analyze_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, >> >> insn = *(probe_opcode_t *)(&auprobe->insn[0]); >> >> + if (arm_uprobe_decode_special_insn(insn, &auprobe->api)) { >> + auprobe->simulate = true; >> + return 0; >> + } >> + >> switch (arm_probe_decode_insn(insn, &auprobe->api)) { >> case INSN_REJECTED: >> return -EINVAL; >> diff --git a/arch/arm64/lib/insn.c b/arch/arm64/lib/insn.c >> index b008a9b46a7f..0635219d2196 100644 >> --- a/arch/arm64/lib/insn.c >> +++ b/arch/arm64/lib/insn.c >> @@ -238,6 +238,11 @@ static u32 aarch64_insn_encode_ldst_size(enum aarch64_insn_size_type type, >> return insn; >> } >> >> +u32 aarch64_insn_decode_ldst_size(u32 insn) >> +{ >> + return (insn & GENMASK(31, 30)) >> 30; >> +} >> + >> static inline long label_imm_common(unsigned long pc, unsigned long addr, >> long range) >> { >> -- >> 2.34.1 >> >
Andrii and Mark. 在 2024/10/26 4:51, Andrii Nakryiko 写道: > On Thu, Oct 24, 2024 at 7:06 AM Mark Rutland <mark.rutland@arm.com> wrote: >> >> On Tue, Sep 10, 2024 at 06:04:07AM +0000, Liao Chang wrote: >>> This patch is the second part of a series to improve the selftest bench >>> of uprobe/uretprobe [0]. The lack of simulating 'stp fp, lr, [sp, #imm]' >>> significantly impact uprobe/uretprobe performance at function entry in >>> most user cases. Profiling results below reveals the STP that executes >>> in the xol slot and trap back to kernel, reduce redis RPS and increase >>> the time of string grep obviously. >>> >>> On Kunpeng916 (Hi1616), 4 NUMA nodes, 64 Arm64 cores@2.4GHz. >>> >>> Redis GET (higher is better) >>> ---------------------------- >>> No uprobe: 49149.71 RPS >>> Single-stepped STP: 46750.82 RPS >>> Emulated STP: 48981.19 RPS >>> >>> Redis SET (larger is better) >>> ---------------------------- >>> No uprobe: 49761.14 RPS >>> Single-stepped STP: 45255.01 RPS >>> Emulated stp: 48619.21 RPS >>> >>> Grep (lower is better) >>> ---------------------- >>> No uprobe: 2.165s >>> Single-stepped STP: 15.314s >>> Emualted STP: 2.216s >> >> The results for grep are concerning. >> >> In theory, the overhead for stepping should be roughly double the >> overhead for emulating, assuming the exception-entry and >> exception-return are the dominant cost. The cost of stepping should be >> trivial. >> >> Those results show emulating adds 0.051s (for a ~2.4% overhead), while >> stepping adds 13.149s (for a ~607% overhead), meaning stepping is 250x >> more expensive. >> >> Was this tested bare-metal, or in a VM? > > Hey Mark, I hope Liao will have a chance to reply, I don't know the > details of his benchmarking. But I can try to give you my numbers and > maybe answer a few questions, hopefully that helps move the > conversation forward. > > So, first of all, I did a quick benchmark on bare metal (without > Liao's optimization, though), here are my results: > > uprobe-nop ( 1 cpus): 2.334 ± 0.011M/s ( 2.334M/s/cpu) > uprobe-push ( 1 cpus): 2.321 ± 0.010M/s ( 2.321M/s/cpu) > uprobe-ret ( 1 cpus): 4.144 ± 0.041M/s ( 4.144M/s/cpu) > > uretprobe-nop ( 1 cpus): 1.684 ± 0.004M/s ( 1.684M/s/cpu) > uretprobe-push ( 1 cpus): 1.736 ± 0.003M/s ( 1.736M/s/cpu) > uretprobe-ret ( 1 cpus): 2.502 ± 0.006M/s ( 2.502M/s/cpu) > > uretprobes are inherently slower, so I'll just compare uprobe, as the > differences are very clear either way. > > -nop is literally nop (Liao solved that issue, I just don't have his > patch applied on my test machine). -push has `stp x29, x30, [sp, > #-0x10]!` instruction traced. -ret is literally just `ret` > instruction. > > So you can see that -ret is almost twice as fast as the -push variant > (it's a microbenchmark, yes, but still). > >> >> AFAICT either: >> >> * Single-stepping is unexpectedly expensive. >> >> Historically we had performance issues with hypervisor trapping of >> debug features, and there are things we might be able to improve in >> the hypervisor and kernel, which would improve stepping *all* >> instructions. >> > > Single-stepping will always be more expensive, as it necessitates > extra hop kernel->user space->kernel, so no matter the optimization > for single-stepping, if we can avoid it, we should. It will be > noticeable. > >> If stepping is the big problem, we could move uprobes over to a BRK >> rather than a single-step. That would require require updating and >> fixing the logic to decide which instructions are steppable, but >> that's necessary anyway given it has extant soundness issues. > > I'm afraid I don't understand what BRK means and what are the > consequences in terms of overheads. I'm not an ARM person either, so > sorry if that's a stupid question. But either way, I can't address > this. But see above, emulating an instruction feels like a much better > approach, if possible. As I understand, Mark's suggestion is to place a BRK instruction next to the instruction in the xol slot. Once the instruction in the xol slot executed, the BRK instruction would trigger a trap into kernel. This is a common technique used on platforms that don't support hardware single- step. However, since Arm64 does support hardware single-stepping, kernel enables it in pre_ssout(), allowing the CPU to automatically trap into kernel after instruction in xol slot executed. But even we move uprobes over to a BRK rather than a single-step. It can't reduce the overhead of user-> kernel->user context switch on the bare-metal. Maybe I am wrong, Mark, could you give more details about the BRK. > >> >> * XOL management is absurdly expensive. >> >> Does uprobes keep the XOL slot around (like krpobes does), or does it >> create the slot afresh for each trap? > > XOL *page* is created once per process, lazily, and then we just > juggle a bunch of fixed slots there for each instance of > single-stepped uprobe. And yes, there are some bottlenecks in XOL > management, though it's mostly due to lock contention (as it is > implemented right now). Liao and Oleg have been improving XOL > management, but still, avoiding XOL in the first place is the much > preferred way. > >> >> If that's trying to create a slot afresh for each trap, there are >> several opportunities for improvement, e.g. keep the slot around for >> as long as the uprobe exists, or pre-allocate shared slots for common >> instructions and use those. > > As I mentioned, a XOL page is allocated and mapped once, but yes, it > seems like we dynamically get a slot in it for each single-stepped > execution (see xol_take_insn_slot() in kernel/events/uprobes.c). It's > probably not a bad idea to just cache and hold a XOL slot for each > specific uprobe, I don't see why we should limit ourselves to just one > XOL page. We also don't need to pre-size each slot, we can probably > allocate just the right amount of space for a given uprobe. > > All good ideas for sure, we should do them, IMO. But we'll still be > paying an extra kernel->user->kernel switch, which almost certainly is > slower than doing a simple stack push emulation just like we do in > x86-64 case, no? > > > BTW, I did a quick local profiling run. I don't think XOL management > is the main source of overhead. I see 5% of CPU cycles spent in > arch_uprobe_copy_ixol, but other than that XOL doesn't figure in stack > traces. There are at least 22% CPU cycles spent in some > local_daif_restore function, though, not sure what that is, but might > be related to interrupt handling, right? The local_daif_restore() is part of the path for all user->kernel->user context switch, including interrupt handling, breakpoints, and single-stepping etc. I am surprised to see it consuming 22% of CPU cycles as well. I haven't been enable to reproduce this on my local machine. Andrii, could you use the patch below to see if it can reduce the 5% of CPU cycles spent in arch_uprobe_copy_ixol, I doubt that D/I cache synchronization is the cause of this part of overhead. https://lore.kernel.org/all/20240919121719.2148361-1-liaochang1@huawei.com/ > > > The take away I'd like to communicate here is avoiding the > single-stepping need is *the best way* to go, IMO. So if we can > emulate those STP instructions for uprobe *cheaply*, that would be > awesome. Given some significant uprobe optimizations from Oleg and Andrii merged, I am curious to see how these changes impact the profiling result on Arm64. So I re-ran the selftest bench on the latest kernel (based on tag next-20241104) and the kernel (based on tag next-20240909) that I used when I submitted this patch. The results re-ran are shown below. next-20240909(xol stp + xol nop) -------------------------------- uprobe-nop ( 1 cpus): 0.424 ± 0.000M/s ( 0.424M/s/cpu) uprobe-push ( 1 cpus): 0.415 ± 0.001M/s ( 0.415M/s/cpu) uprobe-ret ( 1 cpus): 2.101 ± 0.002M/s ( 2.101M/s/cpu) uretprobe-nop ( 1 cpus): 0.347 ± 0.000M/s ( 0.347M/s/cpu) uretprobe-push ( 1 cpus): 0.349 ± 0.000M/s ( 0.349M/s/cpu) uretprobe-ret ( 1 cpus): 1.051 ± 0.001M/s ( 1.051M/s/cpu) next-20240909(sim stp + sim nop) -------------------------------- uprobe-nop ( 1 cpus): 2.042 ± 0.002M/s ( 2.042M/s/cpu) uprobe-push ( 1 cpus): 1.363 ± 0.002M/s ( 1.363M/s/cpu) uprobe-ret ( 1 cpus): 2.052 ± 0.002M/s ( 2.052M/s/cpu) uretprobe-nop ( 1 cpus): 1.049 ± 0.001M/s ( 1.049M/s/cpu) uretprobe-push ( 1 cpus): 0.780 ± 0.000M/s ( 0.780M/s/cpu) uretprobe-ret ( 1 cpus): 1.065 ± 0.001M/s ( 1.065M/s/cpu) next-20241104 (xol stp + sim nop) --------------------------------- uprobe-nop ( 1 cpus): 2.044 ± 0.003M/s ( 2.044M/s/cpu) uprobe-push ( 1 cpus): 0.415 ± 0.001M/s ( 0.415M/s/cpu) uprobe-ret ( 1 cpus): 2.047 ± 0.001M/s ( 2.047M/s/cpu) uretprobe-nop ( 1 cpus): 0.832 ± 0.003M/s ( 0.832M/s/cpu) uretprobe-push ( 1 cpus): 0.328 ± 0.000M/s ( 0.328M/s/cpu) uretprobe-ret ( 1 cpus): 0.833 ± 0.003M/s ( 0.833M/s/cpu) next-20241104 (sim stp + sim nop) --------------------------------- uprobe-nop ( 1 cpus): 2.052 ± 0.002M/s ( 2.052M/s/cpu) uprobe-push ( 1 cpus): 1.411 ± 0.002M/s ( 1.411M/s/cpu) uprobe-ret ( 1 cpus): 2.052 ± 0.005M/s ( 2.052M/s/cpu) uretprobe-nop ( 1 cpus): 0.839 ± 0.005M/s ( 0.839M/s/cpu) uretprobe-push ( 1 cpus): 0.702 ± 0.002M/s ( 0.702M/s/cpu) uretprobe-ret ( 1 cpus): 0.837 ± 0.001M/s ( 0.837M/s/cpu) It seems that the STP simluation approach in this patch significantly improves uprobe-push throughtput by 240% (from 0.415Ms/ to 1.411M/s) and uretprobe-push by 114% (from 0.328M/s to 0.702M/s) on kernels bases on next-20240909 and next-20241104. While there is still room for improvement to reach the throughput of -nop and -ret, the gains are very substantail. But I'm a bit puzzled by the throughput of uprobe/uretprobe-push using single-stepping stp, which are far lower compared to the result when when I submitted patch(look closely to the uprobe-push and uretprobe-push results in commit log). I'm certain that the tests were run on the same bare-metal machine with background tasked minimized. I doubt some uncommitted uprobe optimization on my local repo twist the result of -push using single-step. In addition to the micro benchmark, I also re-ran Redis benchmark to compare the impact of single-stepping STP and simluated STP to the throughput of redis-server. I believe the impact of uprobe on the real application depends on the frequency of uprobe triggered and the application's hot paths. Therefore, I wouldn't say the simluated STP will benefit all real world applications. $ redis-benchmark -h [redis-server IP] -p 7778 -n 64000 -d 4 -c 128 -t SET $ redis-server --port 7778 --protected-mode no --save "" --appendonly no & && bpftrace -e 'uprobe:redis-server:readQueryFromClient{} uprobe:redis-server:processCommand{} uprobe:redis-server:aeApiPoll {}' next-20241104 ------------- RPS: 55602.1 next-20241104 + ss stp ---------------------- RPS: 47220.9 uprobe@@aeApiPoll: 554565 uprobe@processCommand: 1275160 uprobe@readQueryFromClient: 1277710 next-20241104 + sim stp ----------------------- RPS 54290.09 uprobe@aeApiPoll: 496007 uprobe@processCommand: 1275160 uprobe@readQueryFromClient: 1277710 Andrii expressed concern that the STP simulation in this patch is too expensive. If we believe the result I re-ran, perhaps it is not a bad way to simluate STP. Looking forward to your feedbacks, or someone could propose a cheaper way to simluate STP, I'm very happy to test it on my machine, thanks. [...] >>> >>> xol-stp >>> ------- >>> uprobe-nop ( 1 cpus): 1.566 ± 0.006M/s ( 1.566M/s/cpu) >>> uprobe-push ( 1 cpus): 0.868 ± 0.001M/s ( 0.868M/s/cpu) >>> uprobe-ret ( 1 cpus): 1.629 ± 0.001M/s ( 1.629M/s/cpu) >>> uretprobe-nop ( 1 cpus): 0.871 ± 0.001M/s ( 0.871M/s/cpu) >>> uretprobe-push ( 1 cpus): 0.616 ± 0.001M/s ( 0.616M/s/cpu) >>> uretprobe-ret ( 1 cpus): 0.878 ± 0.002M/s ( 0.878M/s/cpu) >>> >>> simulated-stp >>> ------------- >>> uprobe-nop ( 1 cpus): 1.544 ± 0.001M/s ( 1.544M/s/cpu) >>> uprobe-push ( 1 cpus): 1.128 ± 0.002M/s ( 1.128M/s/cpu) >>> uprobe-ret ( 1 cpus): 1.550 ± 0.005M/s ( 1.550M/s/cpu) >>> uretprobe-nop ( 1 cpus): 0.872 ± 0.004M/s ( 0.872M/s/cpu) >>> uretprobe-push ( 1 cpus): 0.714 ± 0.001M/s ( 0.714M/s/cpu) >>> uretprobe-ret ( 1 cpus): 0.896 ± 0.001M/s ( 0.896M/s/cpu) >>> >>> The profiling results based on the upstream kernel with spinlock >>> optimization patches [2] reveals the simulation of STP increase the >>> uprobe-push throughput by 29.3% (from 0.868M/s/cpu to 1.1238M/s/cpu) and >>> uretprobe-push by 15.9% (from 0.616M/s/cpu to 0.714M/s/cpu). >>> >>> [0] https://lore.kernel.org/all/CAEf4BzaO4eG6hr2hzXYpn+7Uer4chS0R99zLn02ezZ5YruVuQw@mail.gmail.com/ >>> [1] https://lore.kernel.org/all/Zr3RN4zxF5XPgjEB@J2N7QTR9R3/ >>> [2] https://lore.kernel.org/all/20240815014629.2685155-1-liaochang1@huawei.com/ >>> >>> Signed-off-by: Liao Chang <liaochang1@huawei.com> >>> --- >>> arch/arm64/include/asm/insn.h | 1 + >>> arch/arm64/kernel/probes/decode-insn.c | 16 +++++ >>> arch/arm64/kernel/probes/decode-insn.h | 1 + >>> arch/arm64/kernel/probes/simulate-insn.c | 89 ++++++++++++++++++++++++ >>> arch/arm64/kernel/probes/simulate-insn.h | 1 + >>> arch/arm64/kernel/probes/uprobes.c | 21 ++++++ >>> arch/arm64/lib/insn.c | 5 ++ >>> 7 files changed, 134 insertions(+) >>> > > [...]
On Tue, Nov 5, 2024 at 4:22 AM Liao, Chang <liaochang1@huawei.com> wrote: > > Andrii and Mark. > > 在 2024/10/26 4:51, Andrii Nakryiko 写道: > > On Thu, Oct 24, 2024 at 7:06 AM Mark Rutland <mark.rutland@arm.com> wrote: > >> > >> On Tue, Sep 10, 2024 at 06:04:07AM +0000, Liao Chang wrote: > >>> This patch is the second part of a series to improve the selftest bench > >>> of uprobe/uretprobe [0]. The lack of simulating 'stp fp, lr, [sp, #imm]' > >>> significantly impact uprobe/uretprobe performance at function entry in > >>> most user cases. Profiling results below reveals the STP that executes > >>> in the xol slot and trap back to kernel, reduce redis RPS and increase > >>> the time of string grep obviously. > >>> > >>> On Kunpeng916 (Hi1616), 4 NUMA nodes, 64 Arm64 cores@2.4GHz. > >>> > >>> Redis GET (higher is better) > >>> ---------------------------- > >>> No uprobe: 49149.71 RPS > >>> Single-stepped STP: 46750.82 RPS > >>> Emulated STP: 48981.19 RPS > >>> > >>> Redis SET (larger is better) > >>> ---------------------------- > >>> No uprobe: 49761.14 RPS > >>> Single-stepped STP: 45255.01 RPS > >>> Emulated stp: 48619.21 RPS > >>> > >>> Grep (lower is better) > >>> ---------------------- > >>> No uprobe: 2.165s > >>> Single-stepped STP: 15.314s > >>> Emualted STP: 2.216s > >> > >> The results for grep are concerning. > >> > >> In theory, the overhead for stepping should be roughly double the > >> overhead for emulating, assuming the exception-entry and > >> exception-return are the dominant cost. The cost of stepping should be > >> trivial. > >> > >> Those results show emulating adds 0.051s (for a ~2.4% overhead), while > >> stepping adds 13.149s (for a ~607% overhead), meaning stepping is 250x > >> more expensive. > >> > >> Was this tested bare-metal, or in a VM? > > > > Hey Mark, I hope Liao will have a chance to reply, I don't know the > > details of his benchmarking. But I can try to give you my numbers and > > maybe answer a few questions, hopefully that helps move the > > conversation forward. > > > > So, first of all, I did a quick benchmark on bare metal (without > > Liao's optimization, though), here are my results: > > > > uprobe-nop ( 1 cpus): 2.334 ± 0.011M/s ( 2.334M/s/cpu) > > uprobe-push ( 1 cpus): 2.321 ± 0.010M/s ( 2.321M/s/cpu) > > uprobe-ret ( 1 cpus): 4.144 ± 0.041M/s ( 4.144M/s/cpu) > > > > uretprobe-nop ( 1 cpus): 1.684 ± 0.004M/s ( 1.684M/s/cpu) > > uretprobe-push ( 1 cpus): 1.736 ± 0.003M/s ( 1.736M/s/cpu) > > uretprobe-ret ( 1 cpus): 2.502 ± 0.006M/s ( 2.502M/s/cpu) > > > > uretprobes are inherently slower, so I'll just compare uprobe, as the > > differences are very clear either way. > > > > -nop is literally nop (Liao solved that issue, I just don't have his > > patch applied on my test machine). -push has `stp x29, x30, [sp, > > #-0x10]!` instruction traced. -ret is literally just `ret` > > instruction. > > > > So you can see that -ret is almost twice as fast as the -push variant > > (it's a microbenchmark, yes, but still). > > > >> > >> AFAICT either: > >> > >> * Single-stepping is unexpectedly expensive. > >> > >> Historically we had performance issues with hypervisor trapping of > >> debug features, and there are things we might be able to improve in > >> the hypervisor and kernel, which would improve stepping *all* > >> instructions. > >> > > > > Single-stepping will always be more expensive, as it necessitates > > extra hop kernel->user space->kernel, so no matter the optimization > > for single-stepping, if we can avoid it, we should. It will be > > noticeable. > > > >> If stepping is the big problem, we could move uprobes over to a BRK > >> rather than a single-step. That would require require updating and > >> fixing the logic to decide which instructions are steppable, but > >> that's necessary anyway given it has extant soundness issues. > > > > I'm afraid I don't understand what BRK means and what are the > > consequences in terms of overheads. I'm not an ARM person either, so > > sorry if that's a stupid question. But either way, I can't address > > this. But see above, emulating an instruction feels like a much better > > approach, if possible. > > As I understand, Mark's suggestion is to place a BRK instruction next to > the instruction in the xol slot. Once the instruction in the xol slot > executed, the BRK instruction would trigger a trap into kernel. This is > a common technique used on platforms that don't support hardware single- > step. However, since Arm64 does support hardware single-stepping, kernel > enables it in pre_ssout(), allowing the CPU to automatically trap into kernel > after instruction in xol slot executed. But even we move uprobes over > to a BRK rather than a single-step. It can't reduce the overhead of user-> > kernel->user context switch on the bare-metal. Maybe I am wrong, Mark, > could you give more details about the BRK. > I see, thanks for elaborating. So the suggestion was to go from very expensive single-stepping mode to still expensive breakpoint-based kernel->user->kernel workflow. I think either way it's going to be much slower than avoiding kernel->user->kernel hop, so we should emulate STP instead, yep. > > > >> > >> * XOL management is absurdly expensive. > >> > >> Does uprobes keep the XOL slot around (like krpobes does), or does it > >> create the slot afresh for each trap? > > > > XOL *page* is created once per process, lazily, and then we just > > juggle a bunch of fixed slots there for each instance of > > single-stepped uprobe. And yes, there are some bottlenecks in XOL > > management, though it's mostly due to lock contention (as it is > > implemented right now). Liao and Oleg have been improving XOL > > management, but still, avoiding XOL in the first place is the much > > preferred way. > > > >> > >> If that's trying to create a slot afresh for each trap, there are > >> several opportunities for improvement, e.g. keep the slot around for > >> as long as the uprobe exists, or pre-allocate shared slots for common > >> instructions and use those. > > > > As I mentioned, a XOL page is allocated and mapped once, but yes, it > > seems like we dynamically get a slot in it for each single-stepped > > execution (see xol_take_insn_slot() in kernel/events/uprobes.c). It's > > probably not a bad idea to just cache and hold a XOL slot for each > > specific uprobe, I don't see why we should limit ourselves to just one > > XOL page. We also don't need to pre-size each slot, we can probably > > allocate just the right amount of space for a given uprobe. > > > > All good ideas for sure, we should do them, IMO. But we'll still be > > paying an extra kernel->user->kernel switch, which almost certainly is > > slower than doing a simple stack push emulation just like we do in > > x86-64 case, no? > > > > > > BTW, I did a quick local profiling run. I don't think XOL management > > is the main source of overhead. I see 5% of CPU cycles spent in > > arch_uprobe_copy_ixol, but other than that XOL doesn't figure in stack > > traces. There are at least 22% CPU cycles spent in some > > local_daif_restore function, though, not sure what that is, but might > > be related to interrupt handling, right? > > The local_daif_restore() is part of the path for all user->kernel->user > context switch, including interrupt handling, breakpoints, and single-stepping > etc. I am surprised to see it consuming 22% of CPU cycles as well. I haven't > been enable to reproduce this on my local machine. > > Andrii, could you use the patch below to see if it can reduce the 5% of > CPU cycles spent in arch_uprobe_copy_ixol, I doubt that D/I cache > synchronization is the cause of this part of overhead. > > https://lore.kernel.org/all/20240919121719.2148361-1-liaochang1@huawei.com/ tbh, I think pre-allocating and setting up fixed XOL slots instead of dynamically allocating them is the way to go. We can allocate as many special "[uprobes]" pages as necessary to accommodate all the single-stepped uprobes in XOL, remember their index, etc. I think that's much more performant (and simpler, IMO) approach overall. All this preparation, however expensive it might be, will be done once per each attached/detached uprobe/uretprobe, which is the place where we want to do expensive stuff. Not when uprobe/uretprobe is actually triggered. > > > > > > > The take away I'd like to communicate here is avoiding the > > single-stepping need is *the best way* to go, IMO. So if we can > > emulate those STP instructions for uprobe *cheaply*, that would be > > awesome. > > Given some significant uprobe optimizations from Oleg and Andrii > merged, I am curious to see how these changes impact the profiling > result on Arm64. So I re-ran the selftest bench on the latest kernel > (based on tag next-20241104) and the kernel (based on tag next-20240909) > that I used when I submitted this patch. The results re-ran are shown > below. > > next-20240909(xol stp + xol nop) > -------------------------------- > uprobe-nop ( 1 cpus): 0.424 ± 0.000M/s ( 0.424M/s/cpu) > uprobe-push ( 1 cpus): 0.415 ± 0.001M/s ( 0.415M/s/cpu) > uprobe-ret ( 1 cpus): 2.101 ± 0.002M/s ( 2.101M/s/cpu) > uretprobe-nop ( 1 cpus): 0.347 ± 0.000M/s ( 0.347M/s/cpu) > uretprobe-push ( 1 cpus): 0.349 ± 0.000M/s ( 0.349M/s/cpu) > uretprobe-ret ( 1 cpus): 1.051 ± 0.001M/s ( 1.051M/s/cpu) > > next-20240909(sim stp + sim nop) > -------------------------------- > uprobe-nop ( 1 cpus): 2.042 ± 0.002M/s ( 2.042M/s/cpu) > uprobe-push ( 1 cpus): 1.363 ± 0.002M/s ( 1.363M/s/cpu) > uprobe-ret ( 1 cpus): 2.052 ± 0.002M/s ( 2.052M/s/cpu) > uretprobe-nop ( 1 cpus): 1.049 ± 0.001M/s ( 1.049M/s/cpu) > uretprobe-push ( 1 cpus): 0.780 ± 0.000M/s ( 0.780M/s/cpu) > uretprobe-ret ( 1 cpus): 1.065 ± 0.001M/s ( 1.065M/s/cpu) > > next-20241104 (xol stp + sim nop) > --------------------------------- > uprobe-nop ( 1 cpus): 2.044 ± 0.003M/s ( 2.044M/s/cpu) > uprobe-push ( 1 cpus): 0.415 ± 0.001M/s ( 0.415M/s/cpu) > uprobe-ret ( 1 cpus): 2.047 ± 0.001M/s ( 2.047M/s/cpu) > uretprobe-nop ( 1 cpus): 0.832 ± 0.003M/s ( 0.832M/s/cpu) > uretprobe-push ( 1 cpus): 0.328 ± 0.000M/s ( 0.328M/s/cpu) > uretprobe-ret ( 1 cpus): 0.833 ± 0.003M/s ( 0.833M/s/cpu) > > next-20241104 (sim stp + sim nop) > --------------------------------- > uprobe-nop ( 1 cpus): 2.052 ± 0.002M/s ( 2.052M/s/cpu) > uprobe-push ( 1 cpus): 1.411 ± 0.002M/s ( 1.411M/s/cpu) > uprobe-ret ( 1 cpus): 2.052 ± 0.005M/s ( 2.052M/s/cpu) > uretprobe-nop ( 1 cpus): 0.839 ± 0.005M/s ( 0.839M/s/cpu) > uretprobe-push ( 1 cpus): 0.702 ± 0.002M/s ( 0.702M/s/cpu) > uretprobe-ret ( 1 cpus): 0.837 ± 0.001M/s ( 0.837M/s/cpu) > > It seems that the STP simluation approach in this patch significantly > improves uprobe-push throughtput by 240% (from 0.415Ms/ to 1.411M/s) > and uretprobe-push by 114% (from 0.328M/s to 0.702M/s) on kernels > bases on next-20240909 and next-20241104. While there is still room > for improvement to reach the throughput of -nop and -ret, the gains > are very substantail. > > But I'm a bit puzzled by the throughput of uprobe/uretprobe-push using > single-stepping stp, which are far lower compared to the result when > when I submitted patch(look closely to the uprobe-push and uretprobe-push > results in commit log). I'm certain that the tests were run on the > same bare-metal machine with background tasked minimized. I doubt some > uncommitted uprobe optimization on my local repo twist the result of > -push using single-step. You can always profiler and compare before/after, right? See where added costs are coming from? > > In addition to the micro benchmark, I also re-ran Redis benchmark to > compare the impact of single-stepping STP and simluated STP to the > throughput of redis-server. I believe the impact of uprobe on the real > application depends on the frequency of uprobe triggered and the application's > hot paths. Therefore, I wouldn't say the simluated STP will benefit all > real world applications. It will benefit *a lot* of real world applications, though, so I think it's very important to improve. > > $ redis-benchmark -h [redis-server IP] -p 7778 -n 64000 -d 4 -c 128 -t SET > $ redis-server --port 7778 --protected-mode no --save "" --appendonly no & && > bpftrace -e 'uprobe:redis-server:readQueryFromClient{} > uprobe:redis-server:processCommand{} > uprobe:redis-server:aeApiPoll {}' > > next-20241104 > ------------- > RPS: 55602.1 > > next-20241104 + ss stp > ---------------------- > RPS: 47220.9 > uprobe@@aeApiPoll: 554565 > uprobe@processCommand: 1275160 > uprobe@readQueryFromClient: 1277710 > > next-20241104 + sim stp > ----------------------- > RPS 54290.09 > uprobe@aeApiPoll: 496007 > uprobe@processCommand: 1275160 > uprobe@readQueryFromClient: 1277710 > > Andrii expressed concern that the STP simulation in this patch is too > expensive. If we believe the result I re-ran, perhaps it is not a > bad way to simluate STP. Looking forward to your feedbacks, or someone > could propose a cheaper way to simluate STP, I'm very happy to test it > on my machine, thanks. I'm no ARM64 expert, but seeing that we simulate stack pushes with just memory reads/write for x86-64, it feels like that should be satisfactory for ARM64. So I'd suggest you to go back to the initial implementation, clean it up, rebase, re-benchmark, and send a new revision. Let's continue the discussion there? > > [...] > > >>> > >>> xol-stp > >>> ------- > >>> uprobe-nop ( 1 cpus): 1.566 ± 0.006M/s ( 1.566M/s/cpu) > >>> uprobe-push ( 1 cpus): 0.868 ± 0.001M/s ( 0.868M/s/cpu) > >>> uprobe-ret ( 1 cpus): 1.629 ± 0.001M/s ( 1.629M/s/cpu) > >>> uretprobe-nop ( 1 cpus): 0.871 ± 0.001M/s ( 0.871M/s/cpu) > >>> uretprobe-push ( 1 cpus): 0.616 ± 0.001M/s ( 0.616M/s/cpu) > >>> uretprobe-ret ( 1 cpus): 0.878 ± 0.002M/s ( 0.878M/s/cpu) > >>> > >>> simulated-stp > >>> ------------- > >>> uprobe-nop ( 1 cpus): 1.544 ± 0.001M/s ( 1.544M/s/cpu) > >>> uprobe-push ( 1 cpus): 1.128 ± 0.002M/s ( 1.128M/s/cpu) > >>> uprobe-ret ( 1 cpus): 1.550 ± 0.005M/s ( 1.550M/s/cpu) > >>> uretprobe-nop ( 1 cpus): 0.872 ± 0.004M/s ( 0.872M/s/cpu) > >>> uretprobe-push ( 1 cpus): 0.714 ± 0.001M/s ( 0.714M/s/cpu) > >>> uretprobe-ret ( 1 cpus): 0.896 ± 0.001M/s ( 0.896M/s/cpu) > >>> > >>> The profiling results based on the upstream kernel with spinlock > >>> optimization patches [2] reveals the simulation of STP increase the > >>> uprobe-push throughput by 29.3% (from 0.868M/s/cpu to 1.1238M/s/cpu) and > >>> uretprobe-push by 15.9% (from 0.616M/s/cpu to 0.714M/s/cpu). > >>> > >>> [0] https://lore.kernel.org/all/CAEf4BzaO4eG6hr2hzXYpn+7Uer4chS0R99zLn02ezZ5YruVuQw@mail.gmail.com/ > >>> [1] https://lore.kernel.org/all/Zr3RN4zxF5XPgjEB@J2N7QTR9R3/ > >>> [2] https://lore.kernel.org/all/20240815014629.2685155-1-liaochang1@huawei.com/ > >>> > >>> Signed-off-by: Liao Chang <liaochang1@huawei.com> > >>> --- > >>> arch/arm64/include/asm/insn.h | 1 + > >>> arch/arm64/kernel/probes/decode-insn.c | 16 +++++ > >>> arch/arm64/kernel/probes/decode-insn.h | 1 + > >>> arch/arm64/kernel/probes/simulate-insn.c | 89 ++++++++++++++++++++++++ > >>> arch/arm64/kernel/probes/simulate-insn.h | 1 + > >>> arch/arm64/kernel/probes/uprobes.c | 21 ++++++ > >>> arch/arm64/lib/insn.c | 5 ++ > >>> 7 files changed, 134 insertions(+) > >>> > > > > [...] > > -- > BR > Liao, Chang >
在 2024/11/7 3:45, Andrii Nakryiko 写道: > On Tue, Nov 5, 2024 at 4:22 AM Liao, Chang <liaochang1@huawei.com> wrote: >> >> Andrii and Mark. >> >> 在 2024/10/26 4:51, Andrii Nakryiko 写道: >>> On Thu, Oct 24, 2024 at 7:06 AM Mark Rutland <mark.rutland@arm.com> wrote: >>>> >>>> On Tue, Sep 10, 2024 at 06:04:07AM +0000, Liao Chang wrote: >>>>> This patch is the second part of a series to improve the selftest bench >>>>> of uprobe/uretprobe [0]. The lack of simulating 'stp fp, lr, [sp, #imm]' >>>>> significantly impact uprobe/uretprobe performance at function entry in >>>>> most user cases. Profiling results below reveals the STP that executes >>>>> in the xol slot and trap back to kernel, reduce redis RPS and increase >>>>> the time of string grep obviously. >>>>> >>>>> On Kunpeng916 (Hi1616), 4 NUMA nodes, 64 Arm64 cores@2.4GHz. >>>>> >>>>> Redis GET (higher is better) >>>>> ---------------------------- >>>>> No uprobe: 49149.71 RPS >>>>> Single-stepped STP: 46750.82 RPS >>>>> Emulated STP: 48981.19 RPS >>>>> >>>>> Redis SET (larger is better) >>>>> ---------------------------- >>>>> No uprobe: 49761.14 RPS >>>>> Single-stepped STP: 45255.01 RPS >>>>> Emulated stp: 48619.21 RPS >>>>> >>>>> Grep (lower is better) >>>>> ---------------------- >>>>> No uprobe: 2.165s >>>>> Single-stepped STP: 15.314s >>>>> Emualted STP: 2.216s >>>> >>>> The results for grep are concerning. >>>> >>>> In theory, the overhead for stepping should be roughly double the >>>> overhead for emulating, assuming the exception-entry and >>>> exception-return are the dominant cost. The cost of stepping should be >>>> trivial. >>>> >>>> Those results show emulating adds 0.051s (for a ~2.4% overhead), while >>>> stepping adds 13.149s (for a ~607% overhead), meaning stepping is 250x >>>> more expensive. >>>> >>>> Was this tested bare-metal, or in a VM? >>> >>> Hey Mark, I hope Liao will have a chance to reply, I don't know the >>> details of his benchmarking. But I can try to give you my numbers and >>> maybe answer a few questions, hopefully that helps move the >>> conversation forward. >>> >>> So, first of all, I did a quick benchmark on bare metal (without >>> Liao's optimization, though), here are my results: >>> >>> uprobe-nop ( 1 cpus): 2.334 ± 0.011M/s ( 2.334M/s/cpu) >>> uprobe-push ( 1 cpus): 2.321 ± 0.010M/s ( 2.321M/s/cpu) >>> uprobe-ret ( 1 cpus): 4.144 ± 0.041M/s ( 4.144M/s/cpu) >>> >>> uretprobe-nop ( 1 cpus): 1.684 ± 0.004M/s ( 1.684M/s/cpu) >>> uretprobe-push ( 1 cpus): 1.736 ± 0.003M/s ( 1.736M/s/cpu) >>> uretprobe-ret ( 1 cpus): 2.502 ± 0.006M/s ( 2.502M/s/cpu) >>> >>> uretprobes are inherently slower, so I'll just compare uprobe, as the >>> differences are very clear either way. >>> >>> -nop is literally nop (Liao solved that issue, I just don't have his >>> patch applied on my test machine). -push has `stp x29, x30, [sp, >>> #-0x10]!` instruction traced. -ret is literally just `ret` >>> instruction. >>> >>> So you can see that -ret is almost twice as fast as the -push variant >>> (it's a microbenchmark, yes, but still). >>> >>>> >>>> AFAICT either: >>>> >>>> * Single-stepping is unexpectedly expensive. >>>> >>>> Historically we had performance issues with hypervisor trapping of >>>> debug features, and there are things we might be able to improve in >>>> the hypervisor and kernel, which would improve stepping *all* >>>> instructions. >>>> >>> >>> Single-stepping will always be more expensive, as it necessitates >>> extra hop kernel->user space->kernel, so no matter the optimization >>> for single-stepping, if we can avoid it, we should. It will be >>> noticeable. >>> >>>> If stepping is the big problem, we could move uprobes over to a BRK >>>> rather than a single-step. That would require require updating and >>>> fixing the logic to decide which instructions are steppable, but >>>> that's necessary anyway given it has extant soundness issues. >>> >>> I'm afraid I don't understand what BRK means and what are the >>> consequences in terms of overheads. I'm not an ARM person either, so >>> sorry if that's a stupid question. But either way, I can't address >>> this. But see above, emulating an instruction feels like a much better >>> approach, if possible. >> >> As I understand, Mark's suggestion is to place a BRK instruction next to >> the instruction in the xol slot. Once the instruction in the xol slot >> executed, the BRK instruction would trigger a trap into kernel. This is >> a common technique used on platforms that don't support hardware single- >> step. However, since Arm64 does support hardware single-stepping, kernel >> enables it in pre_ssout(), allowing the CPU to automatically trap into kernel >> after instruction in xol slot executed. But even we move uprobes over >> to a BRK rather than a single-step. It can't reduce the overhead of user-> >> kernel->user context switch on the bare-metal. Maybe I am wrong, Mark, >> could you give more details about the BRK. >> > > I see, thanks for elaborating. So the suggestion was to go from very > expensive single-stepping mode to still expensive breakpoint-based > kernel->user->kernel workflow. > > I think either way it's going to be much slower than avoiding > kernel->user->kernel hop, so we should emulate STP instead, yep. Exactly, in most cases, simluation is the better option. > >>> >>>> >>>> * XOL management is absurdly expensive. >>>> >>>> Does uprobes keep the XOL slot around (like krpobes does), or does it >>>> create the slot afresh for each trap? >>> >>> XOL *page* is created once per process, lazily, and then we just >>> juggle a bunch of fixed slots there for each instance of >>> single-stepped uprobe. And yes, there are some bottlenecks in XOL >>> management, though it's mostly due to lock contention (as it is >>> implemented right now). Liao and Oleg have been improving XOL >>> management, but still, avoiding XOL in the first place is the much >>> preferred way. >>> >>>> >>>> If that's trying to create a slot afresh for each trap, there are >>>> several opportunities for improvement, e.g. keep the slot around for >>>> as long as the uprobe exists, or pre-allocate shared slots for common >>>> instructions and use those. >>> >>> As I mentioned, a XOL page is allocated and mapped once, but yes, it >>> seems like we dynamically get a slot in it for each single-stepped >>> execution (see xol_take_insn_slot() in kernel/events/uprobes.c). It's >>> probably not a bad idea to just cache and hold a XOL slot for each >>> specific uprobe, I don't see why we should limit ourselves to just one >>> XOL page. We also don't need to pre-size each slot, we can probably >>> allocate just the right amount of space for a given uprobe. >>> >>> All good ideas for sure, we should do them, IMO. But we'll still be >>> paying an extra kernel->user->kernel switch, which almost certainly is >>> slower than doing a simple stack push emulation just like we do in >>> x86-64 case, no? >>> >>> >>> BTW, I did a quick local profiling run. I don't think XOL management >>> is the main source of overhead. I see 5% of CPU cycles spent in >>> arch_uprobe_copy_ixol, but other than that XOL doesn't figure in stack >>> traces. There are at least 22% CPU cycles spent in some >>> local_daif_restore function, though, not sure what that is, but might >>> be related to interrupt handling, right? >> >> The local_daif_restore() is part of the path for all user->kernel->user >> context switch, including interrupt handling, breakpoints, and single-stepping >> etc. I am surprised to see it consuming 22% of CPU cycles as well. I haven't >> been enable to reproduce this on my local machine. >> >> Andrii, could you use the patch below to see if it can reduce the 5% of >> CPU cycles spent in arch_uprobe_copy_ixol, I doubt that D/I cache >> synchronization is the cause of this part of overhead. >> >> https://lore.kernel.org/all/20240919121719.2148361-1-liaochang1@huawei.com/ > > tbh, I think pre-allocating and setting up fixed XOL slots instead of > dynamically allocating them is the way to go. We can allocate as many > special "[uprobes]" pages as necessary to accommodate all the > single-stepped uprobes in XOL, remember their index, etc. I think > that's much more performant (and simpler, IMO) approach overall. All > this preparation, however expensive it might be, will be done once per > each attached/detached uprobe/uretprobe, which is the place where we > want to do expensive stuff. Not when uprobe/uretprobe is actually > triggered. Generally agreed. But I have two concerns about pre-allocating of XOL slots: 1. If some uprobes/uretprobes are rarely or never triggered, pre-allocating slots for them seem wastful. However, it isn't a issue on machines with ample memory(e.g., hundreds of GB or a couple of TB). 2. Currently, threads that trigger the same uprobe/uretprobe are dynamically allocated different slots. Since we can't predict how many threads will trigger the same uprobe/uretprobe. it can't pre-allocate enough slots for per each uprobe/uretprobe. If you allow all threads to share the slot associated with each uprobe/uretprobe. this would result in XOL pages being shared across procecces, I think it would introduce significant changes to the page management of xol_area and fault handling. > >> >>> >>> >>> The take away I'd like to communicate here is avoiding the >>> single-stepping need is *the best way* to go, IMO. So if we can >>> emulate those STP instructions for uprobe *cheaply*, that would be >>> awesome. >> >> Given some significant uprobe optimizations from Oleg and Andrii >> merged, I am curious to see how these changes impact the profiling >> result on Arm64. So I re-ran the selftest bench on the latest kernel >> (based on tag next-20241104) and the kernel (based on tag next-20240909) >> that I used when I submitted this patch. The results re-ran are shown >> below. >> >> next-20240909(xol stp + xol nop) >> -------------------------------- >> uprobe-nop ( 1 cpus): 0.424 ± 0.000M/s ( 0.424M/s/cpu) >> uprobe-push ( 1 cpus): 0.415 ± 0.001M/s ( 0.415M/s/cpu) >> uprobe-ret ( 1 cpus): 2.101 ± 0.002M/s ( 2.101M/s/cpu) >> uretprobe-nop ( 1 cpus): 0.347 ± 0.000M/s ( 0.347M/s/cpu) >> uretprobe-push ( 1 cpus): 0.349 ± 0.000M/s ( 0.349M/s/cpu) >> uretprobe-ret ( 1 cpus): 1.051 ± 0.001M/s ( 1.051M/s/cpu) >> >> next-20240909(sim stp + sim nop) >> -------------------------------- >> uprobe-nop ( 1 cpus): 2.042 ± 0.002M/s ( 2.042M/s/cpu) >> uprobe-push ( 1 cpus): 1.363 ± 0.002M/s ( 1.363M/s/cpu) >> uprobe-ret ( 1 cpus): 2.052 ± 0.002M/s ( 2.052M/s/cpu) >> uretprobe-nop ( 1 cpus): 1.049 ± 0.001M/s ( 1.049M/s/cpu) >> uretprobe-push ( 1 cpus): 0.780 ± 0.000M/s ( 0.780M/s/cpu) >> uretprobe-ret ( 1 cpus): 1.065 ± 0.001M/s ( 1.065M/s/cpu) >> >> next-20241104 (xol stp + sim nop) >> --------------------------------- >> uprobe-nop ( 1 cpus): 2.044 ± 0.003M/s ( 2.044M/s/cpu) >> uprobe-push ( 1 cpus): 0.415 ± 0.001M/s ( 0.415M/s/cpu) >> uprobe-ret ( 1 cpus): 2.047 ± 0.001M/s ( 2.047M/s/cpu) >> uretprobe-nop ( 1 cpus): 0.832 ± 0.003M/s ( 0.832M/s/cpu) >> uretprobe-push ( 1 cpus): 0.328 ± 0.000M/s ( 0.328M/s/cpu) >> uretprobe-ret ( 1 cpus): 0.833 ± 0.003M/s ( 0.833M/s/cpu) >> >> next-20241104 (sim stp + sim nop) >> --------------------------------- >> uprobe-nop ( 1 cpus): 2.052 ± 0.002M/s ( 2.052M/s/cpu) >> uprobe-push ( 1 cpus): 1.411 ± 0.002M/s ( 1.411M/s/cpu) >> uprobe-ret ( 1 cpus): 2.052 ± 0.005M/s ( 2.052M/s/cpu) >> uretprobe-nop ( 1 cpus): 0.839 ± 0.005M/s ( 0.839M/s/cpu) >> uretprobe-push ( 1 cpus): 0.702 ± 0.002M/s ( 0.702M/s/cpu) >> uretprobe-ret ( 1 cpus): 0.837 ± 0.001M/s ( 0.837M/s/cpu) >> >> It seems that the STP simluation approach in this patch significantly >> improves uprobe-push throughtput by 240% (from 0.415Ms/ to 1.411M/s) >> and uretprobe-push by 114% (from 0.328M/s to 0.702M/s) on kernels >> bases on next-20240909 and next-20241104. While there is still room >> for improvement to reach the throughput of -nop and -ret, the gains >> are very substantail. >> >> But I'm a bit puzzled by the throughput of uprobe/uretprobe-push using >> single-stepping stp, which are far lower compared to the result when >> when I submitted patch(look closely to the uprobe-push and uretprobe-push >> results in commit log). I'm certain that the tests were run on the >> same bare-metal machine with background tasked minimized. I doubt some >> uncommitted uprobe optimization on my local repo twist the result of >> -push using single-step. > > You can always profiler and compare before/after, right? See where > added costs are coming from? Yes, I've been looking through the git reflog to restore the kernel tree to the state when I submitted this patch. Regarding the throughtput data of uprobe/uretprobe-push using single-stepping, I believe *the re-ran result provide a more accurate picture*. If we carefully compare the throughput of -nop and -push, we can see that they are very close (uprobe-nop is 0.424M/s/cpu and uprobe-push is 0.415M/s/cpu, uretprobe-nop is 0.347M/s/cpu and upretprobe-push is 0.349M/s/cpu). This is expected, as both use single-stepping to execute NOP and STP. There's no reason -push using single-step should outperform the one of -nop(uprobe-push is 0.868M/s/cpu a month ago). In summary, understanding these performance is crucial for selecting an approach that balances accuracy and efficiency. Although this patch offers lower gain compared to my initial implementation. But, considering the complexity of 'STP', the cost seems acceptable, right? > >> >> In addition to the micro benchmark, I also re-ran Redis benchmark to >> compare the impact of single-stepping STP and simluated STP to the >> throughput of redis-server. I believe the impact of uprobe on the real >> application depends on the frequency of uprobe triggered and the application's >> hot paths. Therefore, I wouldn't say the simluated STP will benefit all >> real world applications. > > It will benefit *a lot* of real world applications, though, so I think > it's very important to improve. Good to see that. > >> >> $ redis-benchmark -h [redis-server IP] -p 7778 -n 64000 -d 4 -c 128 -t SET >> $ redis-server --port 7778 --protected-mode no --save "" --appendonly no & && >> bpftrace -e 'uprobe:redis-server:readQueryFromClient{} >> uprobe:redis-server:processCommand{} >> uprobe:redis-server:aeApiPoll {}' >> >> next-20241104 >> ------------- >> RPS: 55602.1 >> >> next-20241104 + ss stp >> ---------------------- >> RPS: 47220.9 >> uprobe@@aeApiPoll: 554565 >> uprobe@processCommand: 1275160 >> uprobe@readQueryFromClient: 1277710 >>>> next-20241104 + sim stp >> ----------------------- >> RPS 54290.09 >> uprobe@aeApiPoll: 496007 >> uprobe@processCommand: 1275160 >> uprobe@readQueryFromClient: 1277710 >> >> Andrii expressed concern that the STP simulation in this patch is too >> expensive. If we believe the result I re-ran, perhaps it is not a >> bad way to simluate STP. Looking forward to your feedbacks, or someone >> could propose a cheaper way to simluate STP, I'm very happy to test it >> on my machine, thanks. > > I'm no ARM64 expert, but seeing that we simulate stack pushes with > just memory reads/write for x86-64, it feels like that should be > satisfactory for ARM64. So I'd suggest you to go back to the initial > implementation, clean it up, rebase, re-benchmark, and send a new > revision. Let's continue the discussion there? I've re-benchmarked the initial implementation on the latest kernel tree (tag next-20241104). I paste the results for single-stepping, the initial patch and this patch here for comparision. Please see the results below: next-20241104 + xol stp ----------------------- uprobe-push ( 1 cpus): 0.415 ± 0.001M/s ( 0.415M/s/cpu) uretprobe-push ( 1 cpus): 0.328 ± 0.000M/s ( 0.328M/s/cpu) next-20221104 + initial patch ----------------------------- uprobe-push ( 1 cpus): 1.798 ± 0.001M/s ( 1.798M/s/cpu) uretprobe-push ( 1 cpus): 0.806 ± 0.001M/s ( 0.806M/s/cpu) next-20241104 + this patch -------------------------- uprobe-push ( 1 cpus): 1.411 ± 0.002M/s ( 1.411M/s/cpu) uretprobe-push ( 1 cpus): 0.702 ± 0.002M/s ( 0.702M/s/cpu) As shown in the benchmark results, the initial implementation offers a limited performance advantage, especailly, it comes at the cost of accuracy. > >> >> [...] >> >>>>> >>>>> xol-stp >>>>> ------- >>>>> uprobe-nop ( 1 cpus): 1.566 ± 0.006M/s ( 1.566M/s/cpu) >>>>> uprobe-push ( 1 cpus): 0.868 ± 0.001M/s ( 0.868M/s/cpu) >>>>> uprobe-ret ( 1 cpus): 1.629 ± 0.001M/s ( 1.629M/s/cpu) >>>>> uretprobe-nop ( 1 cpus): 0.871 ± 0.001M/s ( 0.871M/s/cpu) >>>>> uretprobe-push ( 1 cpus): 0.616 ± 0.001M/s ( 0.616M/s/cpu) >>>>> uretprobe-ret ( 1 cpus): 0.878 ± 0.002M/s ( 0.878M/s/cpu) >>>>> >>>>> simulated-stp >>>>> ------------- >>>>> uprobe-nop ( 1 cpus): 1.544 ± 0.001M/s ( 1.544M/s/cpu) >>>>> uprobe-push ( 1 cpus): 1.128 ± 0.002M/s ( 1.128M/s/cpu) >>>>> uprobe-ret ( 1 cpus): 1.550 ± 0.005M/s ( 1.550M/s/cpu) >>>>> uretprobe-nop ( 1 cpus): 0.872 ± 0.004M/s ( 0.872M/s/cpu) >>>>> uretprobe-push ( 1 cpus): 0.714 ± 0.001M/s ( 0.714M/s/cpu) >>>>> uretprobe-ret ( 1 cpus): 0.896 ± 0.001M/s ( 0.896M/s/cpu) >>>>> >>>>> The profiling results based on the upstream kernel with spinlock >>>>> optimization patches [2] reveals the simulation of STP increase the >>>>> uprobe-push throughput by 29.3% (from 0.868M/s/cpu to 1.1238M/s/cpu) and >>>>> uretprobe-push by 15.9% (from 0.616M/s/cpu to 0.714M/s/cpu). >>>>> >>>>> [0] https://lore.kernel.org/all/CAEf4BzaO4eG6hr2hzXYpn+7Uer4chS0R99zLn02ezZ5YruVuQw@mail.gmail.com/ >>>>> [1] https://lore.kernel.org/all/Zr3RN4zxF5XPgjEB@J2N7QTR9R3/ >>>>> [2] https://lore.kernel.org/all/20240815014629.2685155-1-liaochang1@huawei.com/ >>>>> >>>>> Signed-off-by: Liao Chang <liaochang1@huawei.com> >>>>> --- >>>>> arch/arm64/include/asm/insn.h | 1 + >>>>> arch/arm64/kernel/probes/decode-insn.c | 16 +++++ >>>>> arch/arm64/kernel/probes/decode-insn.h | 1 + >>>>> arch/arm64/kernel/probes/simulate-insn.c | 89 ++++++++++++++++++++++++ >>>>> arch/arm64/kernel/probes/simulate-insn.h | 1 + >>>>> arch/arm64/kernel/probes/uprobes.c | 21 ++++++ >>>>> arch/arm64/lib/insn.c | 5 ++ >>>>> 7 files changed, 134 insertions(+) >>>>> >>> >>> [...] >> >> -- >> BR >> Liao, Chang >>
On Thu, Nov 7, 2024 at 7:13 PM Liao, Chang <liaochang1@huawei.com> wrote: > > > > 在 2024/11/7 3:45, Andrii Nakryiko 写道: > > On Tue, Nov 5, 2024 at 4:22 AM Liao, Chang <liaochang1@huawei.com> wrote: > >> > >> Andrii and Mark. > >> > >> 在 2024/10/26 4:51, Andrii Nakryiko 写道: > >>> On Thu, Oct 24, 2024 at 7:06 AM Mark Rutland <mark.rutland@arm.com> wrote: > >>>> > >>>> On Tue, Sep 10, 2024 at 06:04:07AM +0000, Liao Chang wrote: > >>>>> This patch is the second part of a series to improve the selftest bench > >>>>> of uprobe/uretprobe [0]. The lack of simulating 'stp fp, lr, [sp, #imm]' > >>>>> significantly impact uprobe/uretprobe performance at function entry in > >>>>> most user cases. Profiling results below reveals the STP that executes > >>>>> in the xol slot and trap back to kernel, reduce redis RPS and increase > >>>>> the time of string grep obviously. > >>>>> > >>>>> On Kunpeng916 (Hi1616), 4 NUMA nodes, 64 Arm64 cores@2.4GHz. > >>>>> > >>>>> Redis GET (higher is better) > >>>>> ---------------------------- > >>>>> No uprobe: 49149.71 RPS > >>>>> Single-stepped STP: 46750.82 RPS > >>>>> Emulated STP: 48981.19 RPS > >>>>> > >>>>> Redis SET (larger is better) > >>>>> ---------------------------- > >>>>> No uprobe: 49761.14 RPS > >>>>> Single-stepped STP: 45255.01 RPS > >>>>> Emulated stp: 48619.21 RPS > >>>>> > >>>>> Grep (lower is better) > >>>>> ---------------------- > >>>>> No uprobe: 2.165s > >>>>> Single-stepped STP: 15.314s > >>>>> Emualted STP: 2.216s > >>>> > >>>> The results for grep are concerning. > >>>> > >>>> In theory, the overhead for stepping should be roughly double the > >>>> overhead for emulating, assuming the exception-entry and > >>>> exception-return are the dominant cost. The cost of stepping should be > >>>> trivial. > >>>> > >>>> Those results show emulating adds 0.051s (for a ~2.4% overhead), while > >>>> stepping adds 13.149s (for a ~607% overhead), meaning stepping is 250x > >>>> more expensive. > >>>> > >>>> Was this tested bare-metal, or in a VM? > >>> > >>> Hey Mark, I hope Liao will have a chance to reply, I don't know the > >>> details of his benchmarking. But I can try to give you my numbers and > >>> maybe answer a few questions, hopefully that helps move the > >>> conversation forward. > >>> > >>> So, first of all, I did a quick benchmark on bare metal (without > >>> Liao's optimization, though), here are my results: > >>> > >>> uprobe-nop ( 1 cpus): 2.334 ± 0.011M/s ( 2.334M/s/cpu) > >>> uprobe-push ( 1 cpus): 2.321 ± 0.010M/s ( 2.321M/s/cpu) > >>> uprobe-ret ( 1 cpus): 4.144 ± 0.041M/s ( 4.144M/s/cpu) > >>> > >>> uretprobe-nop ( 1 cpus): 1.684 ± 0.004M/s ( 1.684M/s/cpu) > >>> uretprobe-push ( 1 cpus): 1.736 ± 0.003M/s ( 1.736M/s/cpu) > >>> uretprobe-ret ( 1 cpus): 2.502 ± 0.006M/s ( 2.502M/s/cpu) > >>> > >>> uretprobes are inherently slower, so I'll just compare uprobe, as the > >>> differences are very clear either way. > >>> > >>> -nop is literally nop (Liao solved that issue, I just don't have his > >>> patch applied on my test machine). -push has `stp x29, x30, [sp, > >>> #-0x10]!` instruction traced. -ret is literally just `ret` > >>> instruction. > >>> > >>> So you can see that -ret is almost twice as fast as the -push variant > >>> (it's a microbenchmark, yes, but still). > >>> > >>>> > >>>> AFAICT either: > >>>> > >>>> * Single-stepping is unexpectedly expensive. > >>>> > >>>> Historically we had performance issues with hypervisor trapping of > >>>> debug features, and there are things we might be able to improve in > >>>> the hypervisor and kernel, which would improve stepping *all* > >>>> instructions. > >>>> > >>> > >>> Single-stepping will always be more expensive, as it necessitates > >>> extra hop kernel->user space->kernel, so no matter the optimization > >>> for single-stepping, if we can avoid it, we should. It will be > >>> noticeable. > >>> > >>>> If stepping is the big problem, we could move uprobes over to a BRK > >>>> rather than a single-step. That would require require updating and > >>>> fixing the logic to decide which instructions are steppable, but > >>>> that's necessary anyway given it has extant soundness issues. > >>> > >>> I'm afraid I don't understand what BRK means and what are the > >>> consequences in terms of overheads. I'm not an ARM person either, so > >>> sorry if that's a stupid question. But either way, I can't address > >>> this. But see above, emulating an instruction feels like a much better > >>> approach, if possible. > >> > >> As I understand, Mark's suggestion is to place a BRK instruction next to > >> the instruction in the xol slot. Once the instruction in the xol slot > >> executed, the BRK instruction would trigger a trap into kernel. This is > >> a common technique used on platforms that don't support hardware single- > >> step. However, since Arm64 does support hardware single-stepping, kernel > >> enables it in pre_ssout(), allowing the CPU to automatically trap into kernel > >> after instruction in xol slot executed. But even we move uprobes over > >> to a BRK rather than a single-step. It can't reduce the overhead of user-> > >> kernel->user context switch on the bare-metal. Maybe I am wrong, Mark, > >> could you give more details about the BRK. > >> > > > > I see, thanks for elaborating. So the suggestion was to go from very > > expensive single-stepping mode to still expensive breakpoint-based > > kernel->user->kernel workflow. > > > > I think either way it's going to be much slower than avoiding > > kernel->user->kernel hop, so we should emulate STP instead, yep. > > Exactly, in most cases, simluation is the better option. > > > > >>> > >>>> > >>>> * XOL management is absurdly expensive. > >>>> > >>>> Does uprobes keep the XOL slot around (like krpobes does), or does it > >>>> create the slot afresh for each trap? > >>> > >>> XOL *page* is created once per process, lazily, and then we just > >>> juggle a bunch of fixed slots there for each instance of > >>> single-stepped uprobe. And yes, there are some bottlenecks in XOL > >>> management, though it's mostly due to lock contention (as it is > >>> implemented right now). Liao and Oleg have been improving XOL > >>> management, but still, avoiding XOL in the first place is the much > >>> preferred way. > >>> > >>>> > >>>> If that's trying to create a slot afresh for each trap, there are > >>>> several opportunities for improvement, e.g. keep the slot around for > >>>> as long as the uprobe exists, or pre-allocate shared slots for common > >>>> instructions and use those. > >>> > >>> As I mentioned, a XOL page is allocated and mapped once, but yes, it > >>> seems like we dynamically get a slot in it for each single-stepped > >>> execution (see xol_take_insn_slot() in kernel/events/uprobes.c). It's > >>> probably not a bad idea to just cache and hold a XOL slot for each > >>> specific uprobe, I don't see why we should limit ourselves to just one > >>> XOL page. We also don't need to pre-size each slot, we can probably > >>> allocate just the right amount of space for a given uprobe. > >>> > >>> All good ideas for sure, we should do them, IMO. But we'll still be > >>> paying an extra kernel->user->kernel switch, which almost certainly is > >>> slower than doing a simple stack push emulation just like we do in > >>> x86-64 case, no? > >>> > >>> > >>> BTW, I did a quick local profiling run. I don't think XOL management > >>> is the main source of overhead. I see 5% of CPU cycles spent in > >>> arch_uprobe_copy_ixol, but other than that XOL doesn't figure in stack > >>> traces. There are at least 22% CPU cycles spent in some > >>> local_daif_restore function, though, not sure what that is, but might > >>> be related to interrupt handling, right? > >> > >> The local_daif_restore() is part of the path for all user->kernel->user > >> context switch, including interrupt handling, breakpoints, and single-stepping > >> etc. I am surprised to see it consuming 22% of CPU cycles as well. I haven't > >> been enable to reproduce this on my local machine. > >> > >> Andrii, could you use the patch below to see if it can reduce the 5% of > >> CPU cycles spent in arch_uprobe_copy_ixol, I doubt that D/I cache > >> synchronization is the cause of this part of overhead. > >> > >> https://lore.kernel.org/all/20240919121719.2148361-1-liaochang1@huawei.com/ > > > > tbh, I think pre-allocating and setting up fixed XOL slots instead of > > dynamically allocating them is the way to go. We can allocate as many > > special "[uprobes]" pages as necessary to accommodate all the > > single-stepped uprobes in XOL, remember their index, etc. I think > > that's much more performant (and simpler, IMO) approach overall. All > > this preparation, however expensive it might be, will be done once per > > each attached/detached uprobe/uretprobe, which is the place where we > > want to do expensive stuff. Not when uprobe/uretprobe is actually > > triggered. > > Generally agreed. But I have two concerns about pre-allocating of XOL slots: > > 1. If some uprobes/uretprobes are rarely or never triggered, pre-allocating > slots for them seem wastful. However, it isn't a issue on machines with > ample memory(e.g., hundreds of GB or a couple of TB). I'm not too worried here because a) hopefully most of uprobes won't need XOL and b) if user attaches to lots of uprobes and needs lots of XOL slots, then machine is assumed to be powerful enough to handle that. If we run out of memory, then that machine shouldn't be uprobed that extensively. Generally, I think this is not a big concern. But keep in mind, if we go this way, I think we need to make XOL slots as small and exactly sized as necessary to fit just enough code to handle single-stepping for that particular uprobe. Currently we size them to 128 bytes to make it easy to dynamically allocate them in the hot path. With pre-allocating XOL we have different tradeoffs and I'd allocate them as tightly as possible, because attach/detach can be significantly slower than uprobe/uretprobe triggering. > > 2. Currently, threads that trigger the same uprobe/uretprobe are dynamically > allocated different slots. Since we can't predict how many threads will > trigger the same uprobe/uretprobe. it can't pre-allocate enough slots for > per each uprobe/uretprobe. If you allow all threads to share the slot > associated with each uprobe/uretprobe. this would result in XOL pages being > shared across procecces, I think it would introduce significant changes to > the page management of xol_area and fault handling. Yes, I was thinking that all threads will share the same XOL slot, why would you want that per-thread? And yes, it's a pretty significant change, but I think it's a good one. > > > > >> > >>> > >>> > >>> The take away I'd like to communicate here is avoiding the > >>> single-stepping need is *the best way* to go, IMO. So if we can > >>> emulate those STP instructions for uprobe *cheaply*, that would be > >>> awesome. > >> > >> Given some significant uprobe optimizations from Oleg and Andrii > >> merged, I am curious to see how these changes impact the profiling > >> result on Arm64. So I re-ran the selftest bench on the latest kernel > >> (based on tag next-20241104) and the kernel (based on tag next-20240909) > >> that I used when I submitted this patch. The results re-ran are shown > >> below. > >> > >> next-20240909(xol stp + xol nop) > >> -------------------------------- > >> uprobe-nop ( 1 cpus): 0.424 ± 0.000M/s ( 0.424M/s/cpu) > >> uprobe-push ( 1 cpus): 0.415 ± 0.001M/s ( 0.415M/s/cpu) > >> uprobe-ret ( 1 cpus): 2.101 ± 0.002M/s ( 2.101M/s/cpu) > >> uretprobe-nop ( 1 cpus): 0.347 ± 0.000M/s ( 0.347M/s/cpu) > >> uretprobe-push ( 1 cpus): 0.349 ± 0.000M/s ( 0.349M/s/cpu) > >> uretprobe-ret ( 1 cpus): 1.051 ± 0.001M/s ( 1.051M/s/cpu) > >> > >> next-20240909(sim stp + sim nop) > >> -------------------------------- > >> uprobe-nop ( 1 cpus): 2.042 ± 0.002M/s ( 2.042M/s/cpu) > >> uprobe-push ( 1 cpus): 1.363 ± 0.002M/s ( 1.363M/s/cpu) > >> uprobe-ret ( 1 cpus): 2.052 ± 0.002M/s ( 2.052M/s/cpu) > >> uretprobe-nop ( 1 cpus): 1.049 ± 0.001M/s ( 1.049M/s/cpu) > >> uretprobe-push ( 1 cpus): 0.780 ± 0.000M/s ( 0.780M/s/cpu) > >> uretprobe-ret ( 1 cpus): 1.065 ± 0.001M/s ( 1.065M/s/cpu) > >> > >> next-20241104 (xol stp + sim nop) > >> --------------------------------- > >> uprobe-nop ( 1 cpus): 2.044 ± 0.003M/s ( 2.044M/s/cpu) > >> uprobe-push ( 1 cpus): 0.415 ± 0.001M/s ( 0.415M/s/cpu) > >> uprobe-ret ( 1 cpus): 2.047 ± 0.001M/s ( 2.047M/s/cpu) > >> uretprobe-nop ( 1 cpus): 0.832 ± 0.003M/s ( 0.832M/s/cpu) > >> uretprobe-push ( 1 cpus): 0.328 ± 0.000M/s ( 0.328M/s/cpu) > >> uretprobe-ret ( 1 cpus): 0.833 ± 0.003M/s ( 0.833M/s/cpu) > >> > >> next-20241104 (sim stp + sim nop) > >> --------------------------------- > >> uprobe-nop ( 1 cpus): 2.052 ± 0.002M/s ( 2.052M/s/cpu) > >> uprobe-push ( 1 cpus): 1.411 ± 0.002M/s ( 1.411M/s/cpu) > >> uprobe-ret ( 1 cpus): 2.052 ± 0.005M/s ( 2.052M/s/cpu) > >> uretprobe-nop ( 1 cpus): 0.839 ± 0.005M/s ( 0.839M/s/cpu) > >> uretprobe-push ( 1 cpus): 0.702 ± 0.002M/s ( 0.702M/s/cpu) > >> uretprobe-ret ( 1 cpus): 0.837 ± 0.001M/s ( 0.837M/s/cpu) > >> > >> It seems that the STP simluation approach in this patch significantly > >> improves uprobe-push throughtput by 240% (from 0.415Ms/ to 1.411M/s) > >> and uretprobe-push by 114% (from 0.328M/s to 0.702M/s) on kernels > >> bases on next-20240909 and next-20241104. While there is still room > >> for improvement to reach the throughput of -nop and -ret, the gains > >> are very substantail. > >> > >> But I'm a bit puzzled by the throughput of uprobe/uretprobe-push using > >> single-stepping stp, which are far lower compared to the result when > >> when I submitted patch(look closely to the uprobe-push and uretprobe-push > >> results in commit log). I'm certain that the tests were run on the > >> same bare-metal machine with background tasked minimized. I doubt some > >> uncommitted uprobe optimization on my local repo twist the result of > >> -push using single-step. > > > > You can always profiler and compare before/after, right? See where > > added costs are coming from? > > Yes, I've been looking through the git reflog to restore the > kernel tree to the state when I submitted this patch. > > Regarding the throughtput data of uprobe/uretprobe-push using > single-stepping, I believe *the re-ran result provide a more > accurate picture*. If we carefully compare the throughput of > -nop and -push, we can see that they are very close (uprobe-nop > is 0.424M/s/cpu and uprobe-push is 0.415M/s/cpu, uretprobe-nop > is 0.347M/s/cpu and upretprobe-push is 0.349M/s/cpu). This is > expected, as both use single-stepping to execute NOP and STP. > There's no reason -push using single-step should outperform the > one of -nop(uprobe-push is 0.868M/s/cpu a month ago). > > In summary, understanding these performance is crucial for selecting > an approach that balances accuracy and efficiency. Although this > patch offers lower gain compared to my initial implementation. > But, considering the complexity of 'STP', the cost seems acceptable, > right? But why do the suboptimal thing if we can do better? > > > > >> > >> In addition to the micro benchmark, I also re-ran Redis benchmark to > >> compare the impact of single-stepping STP and simluated STP to the > >> throughput of redis-server. I believe the impact of uprobe on the real > >> application depends on the frequency of uprobe triggered and the application's > >> hot paths. Therefore, I wouldn't say the simluated STP will benefit all > >> real world applications. > > > > It will benefit *a lot* of real world applications, though, so I think > > it's very important to improve. > > Good to see that. > > > > >> > >> $ redis-benchmark -h [redis-server IP] -p 7778 -n 64000 -d 4 -c 128 -t SET > >> $ redis-server --port 7778 --protected-mode no --save "" --appendonly no & && > >> bpftrace -e 'uprobe:redis-server:readQueryFromClient{} > >> uprobe:redis-server:processCommand{} > >> uprobe:redis-server:aeApiPoll {}' > >> > >> next-20241104 > >> ------------- > >> RPS: 55602.1 > >> > >> next-20241104 + ss stp > >> ---------------------- > >> RPS: 47220.9 > >> uprobe@@aeApiPoll: 554565 > >> uprobe@processCommand: 1275160 > >> uprobe@readQueryFromClient: 1277710 > >>>> next-20241104 + sim stp > >> ----------------------- > >> RPS 54290.09 > >> uprobe@aeApiPoll: 496007 > >> uprobe@processCommand: 1275160 > >> uprobe@readQueryFromClient: 1277710 > >> > >> Andrii expressed concern that the STP simulation in this patch is too > >> expensive. If we believe the result I re-ran, perhaps it is not a > >> bad way to simluate STP. Looking forward to your feedbacks, or someone > >> could propose a cheaper way to simluate STP, I'm very happy to test it > >> on my machine, thanks. > > > > I'm no ARM64 expert, but seeing that we simulate stack pushes with > > just memory reads/write for x86-64, it feels like that should be > > satisfactory for ARM64. So I'd suggest you to go back to the initial > > implementation, clean it up, rebase, re-benchmark, and send a new > > revision. Let's continue the discussion there? > > I've re-benchmarked the initial implementation on the latest kernel > tree (tag next-20241104). I paste the results for single-stepping, > the initial patch and this patch here for comparision. Please see > the results below: > > next-20241104 + xol stp > ----------------------- > uprobe-push ( 1 cpus): 0.415 ± 0.001M/s ( 0.415M/s/cpu) > uretprobe-push ( 1 cpus): 0.328 ± 0.000M/s ( 0.328M/s/cpu) > > next-20221104 + initial patch > ----------------------------- > uprobe-push ( 1 cpus): 1.798 ± 0.001M/s ( 1.798M/s/cpu) > uretprobe-push ( 1 cpus): 0.806 ± 0.001M/s ( 0.806M/s/cpu) > > next-20241104 + this patch > -------------------------- > uprobe-push ( 1 cpus): 1.411 ± 0.002M/s ( 1.411M/s/cpu) > uretprobe-push ( 1 cpus): 0.702 ± 0.002M/s ( 0.702M/s/cpu) > > As shown in the benchmark results, the initial implementation offers > a limited performance advantage, especailly, it comes at the cost of > accuracy. > What limited accuracy? And since when almost 30% (1.4M/s -> 1.8M/s) is a "limited performance advantage"? > > > >> > >> [...] > >> > >>>>> > >>>>> xol-stp > >>>>> ------- > >>>>> uprobe-nop ( 1 cpus): 1.566 ± 0.006M/s ( 1.566M/s/cpu) > >>>>> uprobe-push ( 1 cpus): 0.868 ± 0.001M/s ( 0.868M/s/cpu) > >>>>> uprobe-ret ( 1 cpus): 1.629 ± 0.001M/s ( 1.629M/s/cpu) > >>>>> uretprobe-nop ( 1 cpus): 0.871 ± 0.001M/s ( 0.871M/s/cpu) > >>>>> uretprobe-push ( 1 cpus): 0.616 ± 0.001M/s ( 0.616M/s/cpu) > >>>>> uretprobe-ret ( 1 cpus): 0.878 ± 0.002M/s ( 0.878M/s/cpu) > >>>>> > >>>>> simulated-stp > >>>>> ------------- > >>>>> uprobe-nop ( 1 cpus): 1.544 ± 0.001M/s ( 1.544M/s/cpu) > >>>>> uprobe-push ( 1 cpus): 1.128 ± 0.002M/s ( 1.128M/s/cpu) > >>>>> uprobe-ret ( 1 cpus): 1.550 ± 0.005M/s ( 1.550M/s/cpu) > >>>>> uretprobe-nop ( 1 cpus): 0.872 ± 0.004M/s ( 0.872M/s/cpu) > >>>>> uretprobe-push ( 1 cpus): 0.714 ± 0.001M/s ( 0.714M/s/cpu) > >>>>> uretprobe-ret ( 1 cpus): 0.896 ± 0.001M/s ( 0.896M/s/cpu) > >>>>> > >>>>> The profiling results based on the upstream kernel with spinlock > >>>>> optimization patches [2] reveals the simulation of STP increase the > >>>>> uprobe-push throughput by 29.3% (from 0.868M/s/cpu to 1.1238M/s/cpu) and > >>>>> uretprobe-push by 15.9% (from 0.616M/s/cpu to 0.714M/s/cpu). > >>>>> > >>>>> [0] https://lore.kernel.org/all/CAEf4BzaO4eG6hr2hzXYpn+7Uer4chS0R99zLn02ezZ5YruVuQw@mail.gmail.com/ > >>>>> [1] https://lore.kernel.org/all/Zr3RN4zxF5XPgjEB@J2N7QTR9R3/ > >>>>> [2] https://lore.kernel.org/all/20240815014629.2685155-1-liaochang1@huawei.com/ > >>>>> > >>>>> Signed-off-by: Liao Chang <liaochang1@huawei.com> > >>>>> --- > >>>>> arch/arm64/include/asm/insn.h | 1 + > >>>>> arch/arm64/kernel/probes/decode-insn.c | 16 +++++ > >>>>> arch/arm64/kernel/probes/decode-insn.h | 1 + > >>>>> arch/arm64/kernel/probes/simulate-insn.c | 89 ++++++++++++++++++++++++ > >>>>> arch/arm64/kernel/probes/simulate-insn.h | 1 + > >>>>> arch/arm64/kernel/probes/uprobes.c | 21 ++++++ > >>>>> arch/arm64/lib/insn.c | 5 ++ > >>>>> 7 files changed, 134 insertions(+) > >>>>> > >>> > >>> [...] > >> > >> -- > >> BR > >> Liao, Chang > >> > > -- > BR > Liao, Chang >
diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h index dd530d5c3d67..74e25debfa75 100644 --- a/arch/arm64/include/asm/insn.h +++ b/arch/arm64/include/asm/insn.h @@ -561,6 +561,7 @@ u32 aarch64_insn_encode_immediate(enum aarch64_insn_imm_type type, u32 insn, u64 imm); u32 aarch64_insn_decode_register(enum aarch64_insn_register_type type, u32 insn); +u32 aarch64_insn_decode_ldst_size(u32 insn); u32 aarch64_insn_gen_branch_imm(unsigned long pc, unsigned long addr, enum aarch64_insn_branch_type type); u32 aarch64_insn_gen_comp_branch_imm(unsigned long pc, unsigned long addr, diff --git a/arch/arm64/kernel/probes/decode-insn.c b/arch/arm64/kernel/probes/decode-insn.c index be54539e309e..847a7a61ff6d 100644 --- a/arch/arm64/kernel/probes/decode-insn.c +++ b/arch/arm64/kernel/probes/decode-insn.c @@ -67,6 +67,22 @@ static bool __kprobes aarch64_insn_is_steppable(u32 insn) return true; } +bool aarch64_insn_is_stp_fp_lr_sp_64b(probe_opcode_t insn) +{ + /* + * The 1st instruction on function entry often follows the + * patten 'stp x29, x30, [sp, #imm]!' that pushing fp and lr + * into stack. + */ + u32 opc = aarch64_insn_decode_ldst_size(insn); + u32 rt2 = aarch64_insn_decode_register(AARCH64_INSN_REGTYPE_RT2, insn); + u32 rn = aarch64_insn_decode_register(AARCH64_INSN_REGTYPE_RN, insn); + u32 rt = aarch64_insn_decode_register(AARCH64_INSN_REGTYPE_RT, insn); + + return aarch64_insn_is_stp_pre(insn) && + (opc == 2) && (rt2 == 30) && (rn == 31) && (rt == 29); +} + /* Return: * INSN_REJECTED If instruction is one not allowed to kprobe, * INSN_GOOD If instruction is supported and uses instruction slot, diff --git a/arch/arm64/kernel/probes/decode-insn.h b/arch/arm64/kernel/probes/decode-insn.h index 8b758c5a2062..033ccab73da6 100644 --- a/arch/arm64/kernel/probes/decode-insn.h +++ b/arch/arm64/kernel/probes/decode-insn.h @@ -29,5 +29,6 @@ arm_kprobe_decode_insn(kprobe_opcode_t *addr, struct arch_specific_insn *asi); #endif enum probe_insn __kprobes arm_probe_decode_insn(probe_opcode_t insn, struct arch_probe_insn *asi); +bool aarch64_insn_is_stp_fp_lr_sp_64b(probe_opcode_t insn); #endif /* _ARM_KERNEL_KPROBES_ARM64_H */ diff --git a/arch/arm64/kernel/probes/simulate-insn.c b/arch/arm64/kernel/probes/simulate-insn.c index 5e4f887a074c..3906851c07b2 100644 --- a/arch/arm64/kernel/probes/simulate-insn.c +++ b/arch/arm64/kernel/probes/simulate-insn.c @@ -8,6 +8,9 @@ #include <linux/bitops.h> #include <linux/kernel.h> #include <linux/kprobes.h> +#include <linux/highmem.h> +#include <linux/vmalloc.h> +#include <linux/mm.h> #include <asm/ptrace.h> #include <asm/traps.h> @@ -211,3 +214,89 @@ simulate_nop(u32 opcode, long addr, struct pt_regs *regs) */ arm64_skip_faulting_instruction(regs, AARCH64_INSN_SIZE); } + +static inline +bool stack_align_check(unsigned long sp) +{ + return (IS_ALIGNED(sp, 16) || + !(read_sysreg(sctlr_el1) & SCTLR_EL1_SA0_MASK)); +} + +static inline +void put_user_stack_pages(struct page **pages, int nr_pages) +{ + int i; + + for (i = 0; i < nr_pages; i++) + put_page(pages[i]); +} + +static inline +int get_user_stack_pages(long start, long end, struct page **pages) +{ + int ret; + int nr_pages = (end >> PAGE_SHIFT) - (start >> PAGE_SHIFT) + 1; + + ret = get_user_pages_fast(start, nr_pages, + FOLL_WRITE | FOLL_FORCE, pages); + if (unlikely(ret != nr_pages)) { + if (ret > 0) + put_user_stack_pages(pages, ret); + return 0; + } + + return nr_pages; +} + +static inline +void *map_user_stack_pages(struct page **pages, int nr_pages) +{ + if (likely(nr_pages == 1)) + return kmap_local_page(pages[0]); + else + return vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL); +} + +static inline +void unmap_user_stack_pages(void *kaddr, int nr_pages) +{ + if (likely(nr_pages == 1)) + kunmap_local(kaddr); + else + vunmap(kaddr); +} + +void __kprobes +simulate_stp_fp_lr_sp_64b(u32 opcode, long addr, struct pt_regs *regs) +{ + long imm7; + long new_sp; + int nr_pages; + void *kaddr, *dst; + struct page *pages[2] = { NULL }; + + imm7 = aarch64_insn_decode_immediate(AARCH64_INSN_IMM_7, opcode); + new_sp = regs->sp + (sign_extend64(imm7, 6) << 3); + if (!stack_align_check(new_sp)) { + force_sig(SIGSEGV); + goto done; + } + + nr_pages = get_user_stack_pages(new_sp, regs->sp, pages); + if (!nr_pages) { + force_sig(SIGSEGV); + goto done; + } + + kaddr = map_user_stack_pages(pages, nr_pages); + dst = kaddr + (new_sp & ~PAGE_MASK); + asm volatile("stp %0, %1, [%2]" + : : "r"(regs->regs[29]), "r"(regs->regs[30]), "r"(dst)); + + unmap_user_stack_pages(kaddr, nr_pages); + put_user_stack_pages(pages, nr_pages); + +done: + regs->sp = new_sp; + arm64_skip_faulting_instruction(regs, AARCH64_INSN_SIZE); +} diff --git a/arch/arm64/kernel/probes/simulate-insn.h b/arch/arm64/kernel/probes/simulate-insn.h index efb2803ec943..733a47ffa2e5 100644 --- a/arch/arm64/kernel/probes/simulate-insn.h +++ b/arch/arm64/kernel/probes/simulate-insn.h @@ -17,5 +17,6 @@ void simulate_tbz_tbnz(u32 opcode, long addr, struct pt_regs *regs); void simulate_ldr_literal(u32 opcode, long addr, struct pt_regs *regs); void simulate_ldrsw_literal(u32 opcode, long addr, struct pt_regs *regs); void simulate_nop(u32 opcode, long addr, struct pt_regs *regs); +void simulate_stp_fp_lr_sp_64b(u32 opcode, long addr, struct pt_regs *regs); #endif /* _ARM_KERNEL_KPROBES_SIMULATE_INSN_H */ diff --git a/arch/arm64/kernel/probes/uprobes.c b/arch/arm64/kernel/probes/uprobes.c index d49aef2657cd..c70862314fde 100644 --- a/arch/arm64/kernel/probes/uprobes.c +++ b/arch/arm64/kernel/probes/uprobes.c @@ -8,6 +8,7 @@ #include <asm/cacheflush.h> #include "decode-insn.h" +#include "simulate-insn.h" #define UPROBE_INV_FAULT_CODE UINT_MAX @@ -31,6 +32,21 @@ unsigned long uprobe_get_swbp_addr(struct pt_regs *regs) return instruction_pointer(regs); } +static enum probe_insn +arm_uprobe_decode_special_insn(probe_opcode_t insn, struct arch_probe_insn *api) +{ + /* + * When uprobe interact with VMSA features, such as MTE, POE, etc, it + * give up the simulation of memory access related instructions. + */ + if (!system_supports_mte() && aarch64_insn_is_stp_fp_lr_sp_64b(insn)) { + api->handler = simulate_stp_fp_lr_sp_64b; + return INSN_GOOD_NO_SLOT; + } + + return INSN_REJECTED; +} + int arch_uprobe_analyze_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long addr) { @@ -44,6 +60,11 @@ int arch_uprobe_analyze_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, insn = *(probe_opcode_t *)(&auprobe->insn[0]); + if (arm_uprobe_decode_special_insn(insn, &auprobe->api)) { + auprobe->simulate = true; + return 0; + } + switch (arm_probe_decode_insn(insn, &auprobe->api)) { case INSN_REJECTED: return -EINVAL; diff --git a/arch/arm64/lib/insn.c b/arch/arm64/lib/insn.c index b008a9b46a7f..0635219d2196 100644 --- a/arch/arm64/lib/insn.c +++ b/arch/arm64/lib/insn.c @@ -238,6 +238,11 @@ static u32 aarch64_insn_encode_ldst_size(enum aarch64_insn_size_type type, return insn; } +u32 aarch64_insn_decode_ldst_size(u32 insn) +{ + return (insn & GENMASK(31, 30)) >> 30; +} + static inline long label_imm_common(unsigned long pc, unsigned long addr, long range) {
This patch is the second part of a series to improve the selftest bench of uprobe/uretprobe [0]. The lack of simulating 'stp fp, lr, [sp, #imm]' significantly impact uprobe/uretprobe performance at function entry in most user cases. Profiling results below reveals the STP that executes in the xol slot and trap back to kernel, reduce redis RPS and increase the time of string grep obviously. On Kunpeng916 (Hi1616), 4 NUMA nodes, 64 Arm64 cores@2.4GHz. Redis GET (higher is better) ---------------------------- No uprobe: 49149.71 RPS Single-stepped STP: 46750.82 RPS Emulated STP: 48981.19 RPS Redis SET (larger is better) ---------------------------- No uprobe: 49761.14 RPS Single-stepped STP: 45255.01 RPS Emulated stp: 48619.21 RPS Grep (lower is better) ---------------------- No uprobe: 2.165s Single-stepped STP: 15.314s Emualted STP: 2.216s Additionally, a profiling of the entry instruction for all leaf and non-leaf function, the ratio of 'stp fp, lr, [sp, #imm]' is larger than 50%. So simulting the STP on the function entry is a more viable option for uprobe. In the first version [1], it used a uaccess routine to simulate the STP that push fp/lr into stack, which use double STTR instructions for memory store. But as Mark pointed out, this approach can't simulate the correct single-atomicity and ordering properties of STP, especiallly when it interacts with MTE, POE, etc. So this patch uses a more complex and inefficient approach that acquires user stack pages, maps them to kernel address space, and allows kernel to use STP directly push fp/lr into the stack pages. xol-stp ------- uprobe-nop ( 1 cpus): 1.566 ± 0.006M/s ( 1.566M/s/cpu) uprobe-push ( 1 cpus): 0.868 ± 0.001M/s ( 0.868M/s/cpu) uprobe-ret ( 1 cpus): 1.629 ± 0.001M/s ( 1.629M/s/cpu) uretprobe-nop ( 1 cpus): 0.871 ± 0.001M/s ( 0.871M/s/cpu) uretprobe-push ( 1 cpus): 0.616 ± 0.001M/s ( 0.616M/s/cpu) uretprobe-ret ( 1 cpus): 0.878 ± 0.002M/s ( 0.878M/s/cpu) simulated-stp ------------- uprobe-nop ( 1 cpus): 1.544 ± 0.001M/s ( 1.544M/s/cpu) uprobe-push ( 1 cpus): 1.128 ± 0.002M/s ( 1.128M/s/cpu) uprobe-ret ( 1 cpus): 1.550 ± 0.005M/s ( 1.550M/s/cpu) uretprobe-nop ( 1 cpus): 0.872 ± 0.004M/s ( 0.872M/s/cpu) uretprobe-push ( 1 cpus): 0.714 ± 0.001M/s ( 0.714M/s/cpu) uretprobe-ret ( 1 cpus): 0.896 ± 0.001M/s ( 0.896M/s/cpu) The profiling results based on the upstream kernel with spinlock optimization patches [2] reveals the simulation of STP increase the uprobe-push throughput by 29.3% (from 0.868M/s/cpu to 1.1238M/s/cpu) and uretprobe-push by 15.9% (from 0.616M/s/cpu to 0.714M/s/cpu). [0] https://lore.kernel.org/all/CAEf4BzaO4eG6hr2hzXYpn+7Uer4chS0R99zLn02ezZ5YruVuQw@mail.gmail.com/ [1] https://lore.kernel.org/all/Zr3RN4zxF5XPgjEB@J2N7QTR9R3/ [2] https://lore.kernel.org/all/20240815014629.2685155-1-liaochang1@huawei.com/ Signed-off-by: Liao Chang <liaochang1@huawei.com> --- arch/arm64/include/asm/insn.h | 1 + arch/arm64/kernel/probes/decode-insn.c | 16 +++++ arch/arm64/kernel/probes/decode-insn.h | 1 + arch/arm64/kernel/probes/simulate-insn.c | 89 ++++++++++++++++++++++++ arch/arm64/kernel/probes/simulate-insn.h | 1 + arch/arm64/kernel/probes/uprobes.c | 21 ++++++ arch/arm64/lib/insn.c | 5 ++ 7 files changed, 134 insertions(+)