Message ID | 20241010-arm-generic-entry-v1-0-b94f451d087b@linaro.org (mailing list archive) |
---|---|
Headers | show |
Series | ARM: Switch to generic entry | expand |
On Thu, Oct 10, 2024 at 01:33:38PM +0200, Linus Walleij wrote: > This patch series converts a slew of ARM assembly into the > corresponding C code, step by step moving the codebase > closer to the expectations of the generic entry code, > and as a last step switches ARM over to the generic > entry code. I haven't looked at the series yet, but I guess we're throwing away all the effort I put in to make stuff like syscalls as fast as possible. So the question is... do we want performance, or do we want generic (and slower) code? It seems insane to me that we spend time micro-optimising things like memcpy, memset, divide routines, but then go and throw away performance that applications actually rely upon, such as syscall performance.
On Thu, Oct 10, 2024 at 1:55 PM Russell King (Oracle) <linux@armlinux.org.uk> wrote: > I haven't looked at the series yet, but I guess we're throwing away > all the effort I put in to make stuff like syscalls as fast as > possible. > > So the question is... do we want performance, or do we want generic > (and slower) code? Yes, the very same question that came to me as I was working on it, we need to reach some conclusion here. Al Viro also put some nice assembly optimizations in the syscall restart that just go out the window as well. Some of the C interpersing relates to the RCU context tracking that really likes to be called at every single IRQ, FIQ or SWI, and where ARM32 is one of the few last users of the user_exit_callable()/user_enter_callable() API which is obviously less intrusive as it only needs to get called at transitions to/from userspace, while these calls are marked with big block letters as obsolete in the context tracker. > It seems insane to me that we spend time micro-optimising things like > memcpy, memset, divide routines, but then go and throw away performance > that applications actually rely upon, such as syscall performance. Yes, this series is a real RFC in the true sense of the word. Yours, Linus Walleij
On Thu, Oct 10, 2024 at 02:11:09PM +0200, Linus Walleij wrote: > On Thu, Oct 10, 2024 at 1:55 PM Russell King (Oracle) > <linux@armlinux.org.uk> wrote: > > > I haven't looked at the series yet, but I guess we're throwing away > > all the effort I put in to make stuff like syscalls as fast as > > possible. > > > > So the question is... do we want performance, or do we want generic > > (and slower) code? > > Yes, the very same question that came to me as I was working on it, we > need to reach some conclusion here. Al Viro also put some nice > assembly optimizations in the syscall restart that just go out the > window as well. > > Some of the C interpersing relates to the RCU context tracking that > really likes to be called at every single IRQ, FIQ or SWI, and where ARM32 > is one of the few last users of the user_exit_callable()/user_enter_callable() > API which is obviously less intrusive as it only needs to get called > at transitions to/from userspace, while these calls are marked > with big block letters as obsolete in the context tracker. > > > It seems insane to me that we spend time micro-optimising things like > > memcpy, memset, divide routines, but then go and throw away performance > > that applications actually rely upon, such as syscall performance. > > Yes, this series is a real RFC in the true sense of the word. I think we need to quantify what the effect is on performance by making these changes, and I think we need to do more than just syscall entry/ exit performance, but the overall performance impact on userspace when the system is under a certain interrupt load. One of the things we have to remember is that applications like to endlessly get system time. Many of these other architectures that have been converted to this generic code support VDSO. However, 32-bit ARM generally does not have VDSO to avoid the syscall over head for e.g. gettimeofday(). So, we also need to time real workloads as well to properly understand what the effect of making these changes is.
This patch series converts a slew of ARM assembly into the corresponding C code, step by step moving the codebase closer to the expectations of the generic entry code, and as a last step switches ARM over to the generic entry code. This was inspired by Jinjie Ruans similar work for ARM64. The low-level assembly calls into arch/arm/kernel/syscall.c to invoke syscalls from userspace, and to the functions listed in arch/arm/kernel/entry.c for any other transitions to and from userspace. Looking at these functions and the call sites in the assembly on the final result should give a pretty good idea about how this works, and what the generic entry expects from an architecture. To test the code the following seccomp patch is needed on older ARM systems: https://lore.kernel.org/lkml/20241008-seccomp-compile-error-v1-1-f87de4007095@linaro.org There is a git branch you can pull in and test (v6.12-rc1 based): https://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-integrator.git/log/?h=b4/arm-generic-entry-v6.12-rc1 Upsides: - Same code paths as x86, S390, RISCV, Loongarch and probably soon ARM64 is used for the ARM systems. This includes some instrumentation stubs helping out with things we haven't even started to look at such as kmsan and live patching (!). - By introducing the new callbacks to C, we can move away from the deprecated (and I think partly unmaintained) context tracking mechanism for RCU (user_exit_callable(), user_enter_callable()) in favor of what everyone else is using, i.e. calling rcu_irq_enter_check_tick() on IRQ entry. - I think also lockdep is now behaving more according to expectations (the lockdep calls in ARM64 and generic entry seems different and more fine-granular from the ARM32 code) but I am no expert in lockdep so I cannot really tell if this is a real improvement. Downsides: - I had to remove the "fast syscall restart" from Al Viro. I don't know how much it will affect performance, but if this is something we must have, let's try to make the solution generic, i.e. add fast syscall restart in the generic entry code. - The "superfast return to userspace" using just very small assembly snippets to get back to userspace on e.g. IRQs if and only if no instrumentation was compiled in, is no longer possible, since we unconditionally call into code written in C. Testing: - Booted into Versatile Express QEMU (ARMv7), Ux500 full graphic UI (PostmarketOS Phosh, ARMv7 on hardware, Gemini ARMv4 on hardware. No special issues. - Tested some ptrace/strace obviously, such as issuing several instances of "ptrace find /" and let this scroll by in the terminal over some 10 minutes or so. - Turned on RCU torture tests and ran for a while. Seems stable and the test outputs look normal. - Ran stress-ng, which triggers the idle bug below that also appear during boot. - perf top doesn't give any output, I don't really know how to enble interesting stuff in the kernel to run this tool. Help needed. Potential bugs: - This comes up during boot and stress-ng runs: ------------[ cut here ]------------ WARNING: CPU: 0 PID: 0 at kernel/context_tracking.c:128 ct_kernel_exit+0xf8/0x100 CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.12.0-rc1+ #31 Hardware name: ARM-Versatile Express (...) It is emitted in kernel/context_tracking.c, ct_kernel_exit(): WARN_ON_ONCE(ct_nmi_nesting() != CT_NESTING_IRQ_NONIDLE); I don't know exactly what's going on here, but it happens right after CPU1 is brought online at boot, so there might be some unexpected nesting of IPI:s happening when CPU1 is brought up? Open questions: - Generic entry requires PTRACE_SYSEMU and PTRACE_SYSEMU_SINGLESTEP to be defined. I added them but don't even know what they do or if generic entry magically adds support for them (probably not) so I need help here. - I need Al Viro's input on how to deal with the "fast syscall restart" that I bluntly deleted, if we need to reincarnate it in the generic entry or what we shall do here. - I need to test with an OABI rootfs. - Performance impact. If this is major I think it's a no-go, we need to agree on metrics here however and I need suggestions on what to test with. Signed-off-by: Linus Walleij <linus.walleij@linaro.org> --- Linus Walleij (28): ARM: Prepare includes for generic entry ARM: ptrace: Split report_syscall() ARM: entry: Skip ret_slow_syscall label ARM: process: Rewrite ret_from_fork i C ARM: process: Remove local restart ARM: entry: Invoke syscalls using C ARM: entry: Rewrite two asm calls in C ARM: entry: Move trace entry to C function ARM: entry: save the syscall sp in thread_info ARM: entry: move all tracing invocation to C ARM: entry: Merge the common and trace entry code ARM: entry: Rename syscall invocation ARM: entry: Create user_mode_enter/exit ARM: entry: Drop trace argument from usr_entry macro ARM: entry: Separate call path for syscall SWI entry ARM: entry: Drop argument to asm_irqentry macros ARM: entry: Implement syscall_exit_to_user_mode() ARM: entry: Drop the superfast ret_fast_syscall ARM: entry: Remove fast and offset register restore ARM: entry: Untangle ret_fast_syscall/to_user ARM: entry: Do not double-call exit functions ARM: entry: Move work processing to C ARM: entry: Stop exiting syscalls like IRQs ARM: entry: Complete syscall and IRQ transition to C ARM: entry: Create irqentry calls from kernel mode ARM: entry: Move in-kernel hardirq tracing to C ARM: entry: Add FIQ/NMI C callbacks ARM: entry: Convert to generic entry arch/arm/Kconfig | 1 + arch/arm/include/asm/entry-common.h | 66 ++++++++++++ arch/arm/include/asm/entry.h | 17 +++ arch/arm/include/asm/ptrace.h | 8 +- arch/arm/include/asm/signal.h | 4 - arch/arm/include/asm/stacktrace.h | 2 +- arch/arm/include/asm/switch_to.h | 4 + arch/arm/include/asm/syscall.h | 7 ++ arch/arm/include/asm/thread_info.h | 18 +--- arch/arm/include/asm/traps.h | 2 +- arch/arm/include/uapi/asm/ptrace.h | 2 + arch/arm/kernel/Makefile | 5 +- arch/arm/kernel/asm-offsets.c | 1 + arch/arm/kernel/entry-armv.S | 39 +++---- arch/arm/kernel/entry-common.S | 202 ++++++++++++++---------------------- arch/arm/kernel/entry-header.S | 108 +++++-------------- arch/arm/kernel/entry.c | 59 +++++++++++ arch/arm/kernel/process.c | 22 +++- arch/arm/kernel/ptrace.c | 76 -------------- arch/arm/kernel/signal.c | 57 ++-------- arch/arm/kernel/syscall.c | 31 ++++++ arch/arm/kernel/traps.c | 2 +- 22 files changed, 349 insertions(+), 384 deletions(-) --- base-commit: e1dc5c87445c608a99e508fe4d3102e2b32858ef change-id: 20240903-arm-generic-entry-ada145378bbe Best regards,