arm64: uprobes: Simulate STP for pushing fp/lr into user stack

This patch is the second part of a series to improve the selftest bench
of uprobe/uretprobe [0]. The lack of simulating 'stp fp, lr, [sp, #imm]'
significantly impact uprobe/uretprobe performance at function entry in
most user cases. Profiling results below reveals the STP that executes
in the xol slot and trap back to kernel, reduce redis RPS and increase
the time of string grep obviously.

On Kunpeng916 (Hi1616), 4 NUMA nodes, 64 Arm64 cores@2.4GHz.

Redis GET (higher is better)
----------------------------
No uprobe: 49149.71 RPS
Single-stepped STP: 46750.82 RPS
Emulated STP: 48981.19 RPS

Redis SET (larger is better)
----------------------------
No uprobe: 49761.14 RPS
Single-stepped STP: 45255.01 RPS
Emulated stp: 48619.21 RPS

Grep (lower is better)
----------------------
No uprobe: 2.165s
Single-stepped STP: 15.314s
Emualted STP: 2.216s

Additionally, a profiling of the entry instruction for all leaf and
non-leaf function, the ratio of 'stp fp, lr, [sp, #imm]' is larger than
50%. So simulting the STP on the function entry is a more viable option
for uprobe.

In the first version [1], it used a uaccess routine to simulate the STP
that push fp/lr into stack, which use double STTR instructions for
memory store. But as Mark pointed out, this approach can't simulate the
correct single-atomicity and ordering properties of STP, especiallly
when it interacts with MTE, POE, etc. So this patch uses a more complex
and inefficient approach that acquires user stack pages, maps them to
kernel address space, and allows kernel to use STP directly push fp/lr
into the stack pages.

xol-stp
-------
uprobe-nop      ( 1 cpus):    1.566 ± 0.006M/s  (  1.566M/s/cpu)
uprobe-push     ( 1 cpus):    0.868 ± 0.001M/s  (  0.868M/s/cpu)
uprobe-ret      ( 1 cpus):    1.629 ± 0.001M/s  (  1.629M/s/cpu)
uretprobe-nop   ( 1 cpus):    0.871 ± 0.001M/s  (  0.871M/s/cpu)
uretprobe-push  ( 1 cpus):    0.616 ± 0.001M/s  (  0.616M/s/cpu)
uretprobe-ret   ( 1 cpus):    0.878 ± 0.002M/s  (  0.878M/s/cpu)

simulated-stp
-------------
uprobe-nop      ( 1 cpus):    1.544 ± 0.001M/s  (  1.544M/s/cpu)
uprobe-push     ( 1 cpus):    1.128 ± 0.002M/s  (  1.128M/s/cpu)
uprobe-ret      ( 1 cpus):    1.550 ± 0.005M/s  (  1.550M/s/cpu)
uretprobe-nop   ( 1 cpus):    0.872 ± 0.004M/s  (  0.872M/s/cpu)
uretprobe-push  ( 1 cpus):    0.714 ± 0.001M/s  (  0.714M/s/cpu)
uretprobe-ret   ( 1 cpus):    0.896 ± 0.001M/s  (  0.896M/s/cpu)

The profiling results based on the upstream kernel with spinlock
optimization patches [2] reveals the simulation of STP increase the
uprobe-push throughput by 29.3% (from 0.868M/s/cpu to 1.1238M/s/cpu) and
uretprobe-push by 15.9% (from 0.616M/s/cpu to 0.714M/s/cpu).

[0] https://lore.kernel.org/all/CAEf4BzaO4eG6hr2hzXYpn+7Uer4chS0R99zLn02ezZ5YruVuQw@mail.gmail.com/
[1] https://lore.kernel.org/all/Zr3RN4zxF5XPgjEB@J2N7QTR9R3/
[2] https://lore.kernel.org/all/20240815014629.2685155-1-liaochang1@huawei.com/

Signed-off-by: Liao Chang <liaochang1@huawei.com>
---
 arch/arm64/include/asm/insn.h            |  1 +
 arch/arm64/kernel/probes/decode-insn.c   | 16 +++++
 arch/arm64/kernel/probes/decode-insn.h   |  1 +
 arch/arm64/kernel/probes/simulate-insn.c | 89 ++++++++++++++++++++++++
 arch/arm64/kernel/probes/simulate-insn.h |  1 +
 arch/arm64/kernel/probes/uprobes.c       | 21 ++++++
 arch/arm64/lib/insn.c                    |  5 ++
 7 files changed, 134 insertions(+)

Message ID	20240910060407.1427716-1-liaochang1@huawei.com (mailing list archive)
State	Not Applicable
Headers	show Received: from szxga08-in.huawei.com (szxga08-in.huawei.com [45.249.212.255]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 72728175D4F; Tue, 10 Sep 2024 06:14:07 +0000 (UTC) From: Liao Chang <liaochang1@huawei.com> To: <catalin.marinas@arm.com>, <will@kernel.org>, <mhiramat@kernel.org>, <oleg@redhat.com>, <peterz@infradead.org>, <ast@kernel.org>, <puranjay@kernel.org>, <liaochang1@huawei.com>, <andrii@kernel.org>, <andrii.nakryiko@gmail.com>, <mark.rutland@arm.com> CC: <linux-arm-kernel@lists.infradead.org>, <linux-kernel@vger.kernel.org>, <linux-trace-kernel@vger.kernel.org>, <bpf@vger.kernel.org> Subject: [PATCH] arm64: uprobes: Simulate STP for pushing fp/lr into user stack Date: Tue, 10 Sep 2024 06:04:07 +0000 Message-ID: <20240910060407.1427716-1-liaochang1@huawei.com> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8bit
Series	arm64: uprobes: Simulate STP for pushing fp/lr into user stack \| expand arm64: uprobes: Simulate STP for pushing fp/lr into user stack

arm64: uprobes: Simulate STP for pushing fp/lr into user stack

Checks

Commit Message

Comments

Patch