From patchwork Wed Sep 28 11:58:51 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Robin Murphy X-Patchwork-Id: 12992178 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 33BAAC32771 for ; Wed, 28 Sep 2022 12:00:15 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender: Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:MIME-Version:References:In-Reply-To: Message-Id:Date:Subject:Cc:To:From:Reply-To:Content-ID:Content-Description: Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID: List-Owner; bh=b2yFfP/IPiqgilPXG9PiffF/nfH2noHMTlPlNbI6/tI=; b=3B/1s7tlxWLXl2 XVn8Rjqa8R2C+XgDoDSqZ+qAt2X4qbwVWJDHYaAF5wYCfEzyMGuPxrcdWJHfp4j5Q85KzNHfdTSTU Gh0C43QMt3ZLLQbUDdM/ys/8ukXLxHZbdbAkxw0Ad0MlgmJAXegyfL0ATRXU0PjbbiXvxamXolpT8 jDQQtbmyZEMmAnGfUG6W+tlGsrxk3ZtTtjzAHCdr+z2rM//W4HL9K9mqnsERiY0nXcIXsnM0ODXCI 0wM14D+tPazxblQVyVQYEfkBvIimVMM2pEuYvfY2ZntgVVmtCy2ye6TmR5tTv41m+weZrsZFaukz5 0GRAjQXjqb9ACWWKEh2A==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux)) id 1odVib-00G4xv-6V; Wed, 28 Sep 2022 11:59:13 +0000 Received: from foss.arm.com ([217.140.110.172]) by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux)) id 1odViQ-00G4vd-9c for linux-arm-kernel@lists.infradead.org; Wed, 28 Sep 2022 11:59:04 +0000 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id E4F6416A3; Wed, 28 Sep 2022 04:59:05 -0700 (PDT) Received: from e121345-lin.cambridge.arm.com (e121345-lin.cambridge.arm.com [10.1.196.40]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id 7E0D73F73D; Wed, 28 Sep 2022 04:58:58 -0700 (PDT) From: Robin Murphy To: will@kernel.org, catalin.marinas@arm.com Cc: linux-arm-kernel@lists.infradead.org, mark.rutland@arm.com, kristina.martsenko@arm.com Subject: [PATCH 1/3] arm64: Update copy_from_user() Date: Wed, 28 Sep 2022 12:58:51 +0100 Message-Id: X-Mailer: git-send-email 2.36.1.dirty In-Reply-To: References: MIME-Version: 1.0 X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20220928_045902_458386_656C5FE2 X-CRM114-Status: GOOD ( 15.94 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org Replace the old mangled-beyond-hope copy_from_user() routine with a shiny new one based on our newest memcpy() implementation. When plumbed into the standalone memcpy benchmark from the Arm Optimized Routines library, this routine outperforms the old one by up to ~1.7x for small copies, with comparable gains across a range of microarchitectures, levelling off once sizes get up to ~2KB and general load/store throughput starts to dominate. Some of this is paid for by pushing more complexity into the fixup handlers, but much of that could be recovered again with cleverer exception records that can give more information about the original access to the handler itself. For now though, the label sleds are at least entertaining. Signed-off-by: Robin Murphy --- arch/arm64/lib/copy_from_user.S | 274 ++++++++++++++++++++++++++------ 1 file changed, 223 insertions(+), 51 deletions(-) diff --git a/arch/arm64/lib/copy_from_user.S b/arch/arm64/lib/copy_from_user.S index 34e317907524..a4b9bd73a5a8 100644 --- a/arch/arm64/lib/copy_from_user.S +++ b/arch/arm64/lib/copy_from_user.S @@ -1,73 +1,245 @@ /* SPDX-License-Identifier: GPL-2.0-only */ /* - * Copyright (C) 2012 ARM Ltd. + * Copyright (c) 2012-2022, Arm Limited. */ #include #include #include -#include -/* - * Copy from user space to a kernel buffer (alignment handled by the hardware) +/* Assumptions: + * + * ARMv8-a, AArch64, unaligned accesses. * - * Parameters: - * x0 - to - * x1 - from - * x2 - n - * Returns: - * x0 - bytes not copied */ - .macro ldrb1 reg, ptr, val - user_ldst 9998f, ldtrb, \reg, \ptr, \val - .endm +#define L(label) .L ## label - .macro strb1 reg, ptr, val - strb \reg, [\ptr], \val - .endm +#define dstin x0 +#define src x1 +#define count x2 +#define dst x3 +#define srcend x4 +#define dstend x5 +#define A_l x6 +#define A_lw w6 +#define A_h x7 +#define B_l x8 +#define B_lw w8 +#define B_h x9 +#define C_l x10 +#define C_lw w10 +#define C_h x11 +#define D_l x12 +#define D_h x13 +#define E_l x14 +#define E_h x15 +#define F_l x16 +#define F_h x17 +#define tmp1 x14 - .macro ldrh1 reg, ptr, val - user_ldst 9997f, ldtrh, \reg, \ptr, \val - .endm +/* + * Derived from memcpy with various adjustments: + * + * - memmove parts are removed since user and kernel pointers won't overlap. + * - The main loop is scaled down to 48 bytes per iteration since the increase + * in load ops changes the balance; little cores barely notice the difference, + * so big cores can benefit from keeping the loop relatively short. + * - Similarly, preferring source rather than destination alignment works out + * better on average. + * - The 33-128 byte cases are reworked to better balance the stores with the + * doubled-up load ops, and keep a more consistent access pattern. + * - The 0-3 byte sequence is replaced with the one borrowed from clear_user, + * since LDTRB lacks a register-offset addressing mode. + */ - .macro strh1 reg, ptr, val - strh \reg, [\ptr], \val - .endm +#define U_pre(x...) USER(L(fixup_pre), x) +#define U_dst(x...) USER(L(fixup_dst), x) +#define U_S1(x...) USER(L(fixup_s1), x) +#define U_M16(x...) USER(L(fixup_m16), x) +#define U_M32(x...) USER(L(fixup_m32), x) +#define U_M64(x...) USER(L(fixup_m64), x) +#define U_L32(x...) USER(L(fixup_l32), x) +#define U_L48(x...) USER(L(fixup_l48), x) +#define U_L64(x...) USER(L(fixup_l64), x) - .macro ldr1 reg, ptr, val - user_ldst 9997f, ldtr, \reg, \ptr, \val - .endm - - .macro str1 reg, ptr, val - str \reg, [\ptr], \val - .endm - - .macro ldp1 reg1, reg2, ptr, val - user_ldp 9997f, \reg1, \reg2, \ptr, \val - .endm - - .macro stp1 reg1, reg2, ptr, val - stp \reg1, \reg2, [\ptr], \val - .endm - -end .req x5 -srcin .req x15 SYM_FUNC_START(__arch_copy_from_user) - add end, x0, x2 - mov srcin, x1 -#include "copy_template.S" - mov x0, #0 // Nothing to copy + add srcend, src, count + add dstend, dstin, count + cmp count, 128 + b.hi L(copy_long) + cmp count, 32 + b.hi L(copy32_128) + + /* Small copies: 0..32 bytes. */ + cmp count, 16 + b.lo L(copy16) +U_pre( ldtr A_l, [src]) +U_pre( ldtr A_h, [src, 8]) +U_pre( ldtr D_l, [srcend, -16]) +U_pre( ldtr D_h, [srcend, -8]) + stp A_l, A_h, [dstin] + stp D_l, D_h, [dstend, -16] + mov x0, #0 ret - // Exception fixups -9997: cmp dst, dstin - b.ne 9998f - // Before being absolutely sure we couldn't copy anything, try harder -USER(9998f, ldtrb tmp1w, [srcin]) - strb tmp1w, [dst], #1 -9998: sub x0, end, dst // bytes not copied + /* Copy 8-15 bytes. */ +L(copy16): + tbz count, 3, L(copy8) +U_pre( ldtr A_l, [src]) +U_pre( ldtr A_h, [srcend, -8]) + str A_l, [dstin] + str A_h, [dstend, -8] + mov x0, #0 ret + + .p2align 3 + /* Copy 4-7 bytes. */ +L(copy8): + tbz count, 2, L(copy4) +U_pre( ldtr A_lw, [src]) +U_pre( ldtr B_lw, [srcend, -4]) + str A_lw, [dstin] + str B_lw, [dstend, -4] + mov x0, #0 + ret + + /* Copy 0..3 bytes. */ +L(copy4): + tbz count, #1, L(copy1) +U_pre( ldtrh A_lw, [src]) + strh A_lw, [dstin] +L(copy1): + tbz count, #0, L(copy0) +U_S1( ldtrb A_lw, [srcend, -1]) + strb A_lw, [dstend, -1] +L(copy0): + mov x0, #0 + ret + + .p2align 4 + /* Medium copies: 33..128 bytes. */ +L(copy32_128): +U_pre( ldtr A_l, [src]) +U_pre( ldtr A_h, [src, 8]) +U_pre( ldtr B_l, [src, 16]) +U_pre( ldtr B_h, [src, 24]) + stp A_l, A_h, [dstin] + stp B_l, B_h, [dstin, 16] +U_M32( ldtr C_l, [srcend, -32]) +U_M32( ldtr C_h, [srcend, -24]) +U_M32( ldtr D_l, [srcend, -16]) +U_M32( ldtr D_h, [srcend, -8]) + cmp count, 64 + b.ls L(copy64) +U_M32( ldtr E_l, [src, 32]) +U_M32( ldtr E_h, [src, 40]) +U_M32( ldtr F_l, [src, 48]) +U_M32( ldtr F_h, [src, 56]) + stp E_l, E_h, [dstin, 32] + stp F_l, F_h, [dstin, 48] +U_M64( ldtr A_l, [srcend, -64]) +U_M64( ldtr A_h, [srcend, -56]) +U_M64( ldtr B_l, [srcend, -48]) +U_M64( ldtr B_h, [srcend, -40]) + stp A_l, A_h, [dstend, -64] + stp B_l, B_h, [dstend, -48] +L(copy64): + stp C_l, C_h, [dstend, -32] + stp D_l, D_h, [dstend, -16] + mov x0, #0 + ret + + .p2align 4 + /* Copy more than 128 bytes. */ +L(copy_long): + /* Copy 16 bytes and then align src to 16-byte alignment. */ + +U_pre( ldtr D_l, [src]) +U_pre( ldtr D_h, [src, 8]) + and tmp1, src, 15 + bic src, src, 15 + sub dst, dstin, tmp1 + add count, count, tmp1 /* Count is now 16 too large. */ +U_pre( ldtr A_l, [src, 16]) +U_pre( ldtr A_h, [src, 24]) + stp D_l, D_h, [dstin] +U_M16( ldtr B_l, [src, 32]) +U_M16( ldtr B_h, [src, 40]) +U_M16( ldtr C_l, [src, 48]) +U_M16( ldtr C_h, [src, 56]) + add src, src, #48 + subs count, count, 96 + 16 /* Test and readjust count. */ + b.ls L(copy48_from_end) + +L(loop48): + stp A_l, A_h, [dst, 16] +U_L32( ldtr A_l, [src, 16]) +U_L32( ldtr A_h, [src, 24]) + stp B_l, B_h, [dst, 32] +U_L48( ldtr B_l, [src, 32]) +U_L48( ldtr B_h, [src, 40]) + stp C_l, C_h, [dst, 48]! +U_dst( ldtr C_l, [src, 48]) +U_dst( ldtr C_h, [src, 56]) + add src, src, #48 + subs count, count, 48 + b.hi L(loop48) + + /* Write the last iteration and copy 48 bytes from the end. */ +L(copy48_from_end): + stp A_l, A_h, [dst, 16] +U_L32( ldtr A_l, [srcend, -48]) +U_L32( ldtr A_h, [srcend, -40]) + stp B_l, B_h, [dst, 32] +U_L48( ldtr B_l, [srcend, -32]) +U_L48( ldtr B_h, [srcend, -24]) + stp C_l, C_h, [dst, 48] +U_L64( ldtr C_l, [srcend, -16]) +U_L64( ldtr C_h, [srcend, -8]) + stp A_l, A_h, [dstend, -48] + stp B_l, B_h, [dstend, -32] + stp C_l, C_h, [dstend, -16] + mov x0, #0 + ret + + /* Fixups... */ + + /* + * Fault before anything has been written, but progress may have + * been possible; realign dst and retry a single byte to confirm. + */ +L(fixup_pre): + mov dst, dstin +U_dst( ldtrb A_lw, [src]) + strb A_lw, [dst], #1 +L(fixup_dst): + sub x0, dstend, dst + ret + + /* Small: Fault with 1 byte remaining, regardless of count */ +L(fixup_s1): + mov x0, #1 + ret + + /* Medium: Faults after n bytes beyond dstin have been written */ +L(fixup_m64): + add dstin, dstin, #32 +L(fixup_m32): + add dstin, dstin, #16 +L(fixup_m16): + add dst, dstin, #16 + b L(fixup_dst) + + /* Large: Faults after n bytes beyond dst have been written */ +L(fixup_l64): + add dst, dst, #16 +L(fixup_l48): + add dst, dst, #16 +L(fixup_l32): + add dst, dst, #32 + b L(fixup_dst) + SYM_FUNC_END(__arch_copy_from_user) EXPORT_SYMBOL(__arch_copy_from_user) From patchwork Wed Sep 28 11:58:52 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Robin Murphy X-Patchwork-Id: 12992179 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 50F64C04A95 for ; Wed, 28 Sep 2022 12:00:21 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender: Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:MIME-Version:References:In-Reply-To: Message-Id:Date:Subject:Cc:To:From:Reply-To:Content-ID:Content-Description: Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID: List-Owner; bh=b49YoBaBw6IuL3X7jnkctlBj3bd5cXx58sv5Tj8yAlA=; b=zhRTWwbrzCmAL4 xA4EibuIr8/n6hdTr+8oc2JC/xgSO44XbJ1N654ITIZtOLGzuNFA2pxAe+2hXyO3CFcCT7ReIVUcZ SxWO6XuiP8T6BZoxE8CN4bBAR/pvPSl4yVxHeNFzimAnpQqPQD2gUThR/pmNJQ69IujSbopczxYYt En/lDhOxTZSUqqTFgnvmmQ5C8A5LcYmZODKyKo6mtQtR3WOB2mtnIXvetnnerSrlAJ552bkCu0YLh srdVSqMPd1tQIQsYNYS7YAdflExWGsLnF+g2BTqBW5HtgJMEcGbXeu0w9mDc2VXb7e4pV/n0BDsIj WkFrxMOrZj3B6NVOTHfw==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux)) id 1odVij-00G4zC-MI; Wed, 28 Sep 2022 11:59:21 +0000 Received: from foss.arm.com ([217.140.110.172]) by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux)) id 1odViQ-00G4vc-9d for linux-arm-kernel@lists.infradead.org; Wed, 28 Sep 2022 11:59:04 +0000 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id E1B7820E3; Wed, 28 Sep 2022 04:59:06 -0700 (PDT) Received: from e121345-lin.cambridge.arm.com (e121345-lin.cambridge.arm.com [10.1.196.40]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id 7A3AA3F73D; Wed, 28 Sep 2022 04:58:59 -0700 (PDT) From: Robin Murphy To: will@kernel.org, catalin.marinas@arm.com Cc: linux-arm-kernel@lists.infradead.org, mark.rutland@arm.com, kristina.martsenko@arm.com Subject: [PATCH 2/3] arm64: Update copy_to_user() Date: Wed, 28 Sep 2022 12:58:52 +0100 Message-Id: <13333310fd93096c8b71a35ae66f634ca82a2dc6.1664363162.git.robin.murphy@arm.com> X-Mailer: git-send-email 2.36.1.dirty In-Reply-To: References: MIME-Version: 1.0 X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20220928_045902_464756_B0C1AE12 X-CRM114-Status: GOOD ( 16.51 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org As with its counterpart, replace copy_to_user() with a new and improved version similarly derived from memcpy(). Different tradeoffs from the base implementation are made relative to copy_from_user() to get the most consistent results across different microarchitectures, but the overall shape of the performance gain ends up about the same. The exception fixups are even more comical this time around, but that's down to now needing to reconstruct the faulting address, and cope with overlapping stores at various points. Again, improvements to the exception mechanism could significantly simplify things in future. Signed-off-by: Robin Murphy --- arch/arm64/lib/copy_to_user.S | 391 +++++++++++++++++++++++++++++----- 1 file changed, 341 insertions(+), 50 deletions(-) diff --git a/arch/arm64/lib/copy_to_user.S b/arch/arm64/lib/copy_to_user.S index 802231772608..b641f00f50d6 100644 --- a/arch/arm64/lib/copy_to_user.S +++ b/arch/arm64/lib/copy_to_user.S @@ -1,73 +1,364 @@ /* SPDX-License-Identifier: GPL-2.0-only */ /* - * Copyright (C) 2012 ARM Ltd. + * Copyright (c) 2012-2022, Arm Limited. */ #include #include #include -#include + +/* Assumptions: + * + * ARMv8-a, AArch64, unaligned accesses. + * + */ + +#define L(label) .L ## label + +#define dstin x0 +#define src x1 +#define count x2 +#define dst x3 +#define srcend x4 +#define dstend x5 +#define A_l x6 +#define A_lw w6 +#define A_h x7 +#define B_l x8 +#define B_lw w8 +#define B_h x9 +#define C_l x10 +#define C_lw w10 +#define C_h x11 +#define D_l x12 +#define D_h x13 +#define E_l x14 +#define E_h x15 +#define F_l x16 +#define F_h x17 +#define tmp1 x14 /* - * Copy to user space from a kernel buffer (alignment handled by the hardware) + * Derived from memcpy with various adjustments: * - * Parameters: - * x0 - to - * x1 - from - * x2 - n - * Returns: - * x0 - bytes not copied + * - memmove parts are removed since user and kernel pointers won't overlap. + * - Contrary to copy_from_user, although big cores still aren't particularly + * keen on the increased instruction count in the main loop, processing fewer + * than 64 bytes per iteration here hurts little cores more. + * - The medium-size cases are reworked to better balance the loads with the + * doubled-up store ops, avoid potential out-of-sequence faults, and preserve + * the input arguments for the sake of fault handling. + * - The 0-3 byte sequence is replaced with the one borrowed from clear_user, + * since STTRB lacks a register-offset addressing mode. */ - .macro ldrb1 reg, ptr, val - ldrb \reg, [\ptr], \val - .endm - .macro strb1 reg, ptr, val - user_ldst 9998f, sttrb, \reg, \ptr, \val - .endm +#define U_pre(x...) USER(L(fixup_pre), x) +#define U_dst(x...) USER(L(fixup_dst), x) - .macro ldrh1 reg, ptr, val - ldrh \reg, [\ptr], \val - .endm +#define U_S1(x...) USER(L(fixup_s1), x) +#define U_S4(x...) USER(L(fixup_s4), x) +#define U_S8(x...) USER(L(fixup_s8), x) +#define U_ST8(x...) USER(L(fixup_st8), x) +#define U_S16(x...) USER(L(fixup_s16), x) +#define U_M24(x...) USER(L(fixup_m24), x) +#define U_M32(x...) USER(L(fixup_m32), x) +#define U_M40(x...) USER(L(fixup_m40), x) +#define U_M48(x...) USER(L(fixup_m48), x) +#define U_M56(x...) USER(L(fixup_m56), x) +#define U_M64(x...) USER(L(fixup_m64), x) +#define U_MT8(x...) USER(L(fixup_mt8), x) +#define U_MT16(x...) USER(L(fixup_mt16), x) +#define U_MT24(x...) USER(L(fixup_mt24), x) +#define U_MT32(x...) USER(L(fixup_mt32), x) +#define U_MT40(x...) USER(L(fixup_mt40), x) +#define U_MT48(x...) USER(L(fixup_mt48), x) +#define U_MT56(x...) USER(L(fixup_mt56), x) - .macro strh1 reg, ptr, val - user_ldst 9997f, sttrh, \reg, \ptr, \val - .endm +#define U_L16(x...) USER(L(fixup_l16), x) +#define U_L24(x...) USER(L(fixup_l24), x) +#define U_L32(x...) USER(L(fixup_l32), x) +#define U_L40(x...) USER(L(fixup_l40), x) +#define U_L48(x...) USER(L(fixup_l48), x) +#define U_L56(x...) USER(L(fixup_l56), x) +#define U_L64(x...) USER(L(fixup_l64), x) +#define U_L72(x...) USER(L(fixup_l72), x) +#define U_LT8(x...) USER(L(fixup_lt8), x) +#define U_LT16(x...) USER(L(fixup_lt16), x) +#define U_LT24(x...) USER(L(fixup_lt24), x) +#define U_LT32(x...) USER(L(fixup_lt32), x) +#define U_LT40(x...) USER(L(fixup_lt40), x) +#define U_LT48(x...) USER(L(fixup_lt48), x) +#define U_LT56(x...) USER(L(fixup_lt56), x) +#define U_LT64(x...) USER(L(fixup_lt64), x) - .macro ldr1 reg, ptr, val - ldr \reg, [\ptr], \val - .endm - - .macro str1 reg, ptr, val - user_ldst 9997f, sttr, \reg, \ptr, \val - .endm - - .macro ldp1 reg1, reg2, ptr, val - ldp \reg1, \reg2, [\ptr], \val - .endm - - .macro stp1 reg1, reg2, ptr, val - user_stp 9997f, \reg1, \reg2, \ptr, \val - .endm - -end .req x5 -srcin .req x15 SYM_FUNC_START(__arch_copy_to_user) - add end, x0, x2 - mov srcin, x1 -#include "copy_template.S" + add srcend, src, count + add dstend, dstin, count + cmp count, 128 + b.hi L(copy_long) + cmp count, 32 + b.hi L(copy32_128) + + /* Small copies: 0..32 bytes. */ + cmp count, 16 + b.lo L(copy16) + ldp A_l, A_h, [src] + ldp D_l, D_h, [srcend, -16] +U_pre( sttr A_l, [dstin]) +U_S8( sttr A_h, [dstin, 8]) +U_S16( sttr D_l, [dstend, -16]) +U_ST8( sttr D_h, [dstend, -8]) mov x0, #0 ret - // Exception fixups -9997: cmp dst, dstin - b.ne 9998f - // Before being absolutely sure we couldn't copy anything, try harder - ldrb tmp1w, [srcin] -USER(9998f, sttrb tmp1w, [dst]) - add dst, dst, #1 -9998: sub x0, end, dst // bytes not copied + /* Copy 8-15 bytes. */ +L(copy16): + tbz count, 3, L(copy8) + ldr A_l, [src] + ldr A_h, [srcend, -8] +U_pre( sttr A_l, [dstin]) +U_S8( sttr A_h, [dstend, -8]) + mov x0, #0 ret + + .p2align 3 + /* Copy 4-7 bytes. */ +L(copy8): + tbz count, 2, L(copy4) + ldr A_lw, [src] + ldr B_lw, [srcend, -4] +U_pre( sttr A_lw, [dstin]) +U_S4( sttr B_lw, [dstend, -4]) + mov x0, #0 + ret + + /* Copy 0..3 bytes. */ +L(copy4): + tbz count, #1, L(copy1) + ldrh A_lw, [src] +U_pre( sttrh A_lw, [dstin]) +L(copy1): + tbz count, #0, L(copy0) + ldrb A_lw, [srcend, -1] +U_S1( sttrb A_lw, [dstend, -1]) +L(copy0): + mov x0, #0 + ret + + .p2align 4 + /* Medium copies: 33..128 bytes. */ +L(copy32_128): + ldp A_l, A_h, [src] + ldp B_l, B_h, [src, 16] +U_pre( sttr A_l, [dstin]) +U_S8( sttr A_h, [dstin, 8]) +U_S16( sttr B_l, [dstin, 16]) +U_M24( sttr B_h, [dstin, 24]) + ldp C_l, C_h, [srcend, -32] + ldp D_l, D_h, [srcend, -16] + cmp count, 64 + b.ls L(copy64) + ldp E_l, E_h, [src, 32] + ldp F_l, F_h, [src, 48] +U_M32( sttr E_l, [dstin, 32]) +U_M40( sttr E_h, [dstin, 40]) +U_M48( sttr F_l, [dstin, 48]) +U_M56( sttr F_h, [dstin, 56]) + ldp A_l, A_h, [srcend, -64] + ldp B_l, B_h, [srcend, -48] +U_M64( sttr A_l, [dstend, -64]) +U_MT56( sttr A_h, [dstend, -56]) +U_MT48( sttr B_l, [dstend, -48]) +U_MT40( sttr B_h, [dstend, -40]) +L(copy64): +U_MT32( sttr C_l, [dstend, -32]) +U_MT24( sttr C_h, [dstend, -24]) +U_MT16( sttr D_l, [dstend, -16]) +U_MT8( sttr D_h, [dstend, -8]) + mov x0, #0 + ret + + .p2align 4 + nop + nop + nop + /* Copy more than 128 bytes. */ +L(copy_long): + /* Copy 16 bytes and then align dst to 16-byte alignment. */ + + ldp D_l, D_h, [src] + and tmp1, dstin, 15 + bic dst, dstin, 15 + sub src, src, tmp1 + add count, count, tmp1 /* Count is now 16 too large. */ + ldp A_l, A_h, [src, 16] +U_pre( sttr D_l, [dstin]) +U_S8( sttr D_h, [dstin, 8]) + ldp B_l, B_h, [src, 32] + ldp C_l, C_h, [src, 48] + ldp D_l, D_h, [src, 64]! + subs count, count, 128 + 16 /* Test and readjust count. */ + b.ls L(copy64_from_end) + +L(loop64): +U_L16( sttr A_l, [dst, 16]) +U_L24( sttr A_h, [dst, 24]) + ldp A_l, A_h, [src, 16] +U_L32( sttr B_l, [dst, 32]) +U_L40( sttr B_h, [dst, 40]) + ldp B_l, B_h, [src, 32] +U_L48( sttr C_l, [dst, 48]) +U_L56( sttr C_h, [dst, 56]) + ldp C_l, C_h, [src, 48] +U_L64( sttr D_l, [dst, 64]) +U_L72( sttr D_h, [dst, 72]) + add dst, dst, #64 + ldp D_l, D_h, [src, 64]! + subs count, count, 64 + b.hi L(loop64) + + /* Write the last iteration and copy 64 bytes from the end. */ +L(copy64_from_end): + ldp E_l, E_h, [srcend, -64] +U_L16( sttr A_l, [dst, 16]) +U_L24( sttr A_h, [dst, 24]) + ldp A_l, A_h, [srcend, -48] +U_L32( sttr B_l, [dst, 32]) +U_L40( sttr B_h, [dst, 40]) + ldp B_l, B_h, [srcend, -32] +U_L48( sttr C_l, [dst, 48]) +U_L56( sttr C_h, [dst, 56]) + ldp C_l, C_h, [srcend, -16] +U_L64( sttr D_l, [dst, 64]) +U_L72( sttr D_h, [dst, 72]) +U_LT64( sttr E_l, [dstend, -64]) +U_LT56( sttr E_h, [dstend, -56]) +U_LT48( sttr A_l, [dstend, -48]) +U_LT40( sttr A_h, [dstend, -40]) +U_LT32( sttr B_l, [dstend, -32]) +U_LT24( sttr B_h, [dstend, -24]) +U_LT16( sttr C_l, [dstend, -16]) +U_LT8( sttr C_h, [dstend, -8]) + mov x0, #0 + ret + + /* Fixups... */ + + /* + * Fault on the first write, but progress may have been possible; + * realign dst and retry a single byte to confirm. + */ +L(fixup_pre): + mov dst, dstin +U_dst( ldtrb A_lw, [src]) + strb A_lw, [dst], #1 +L(fixup_dst): + sub x0, dstend, dst + ret + + /* Small: Fault with 1 byte remaining, regardless of count */ +L(fixup_s1): + mov x0, #1 + ret + + /* Small tail case: Fault 8 bytes before dstend, >=16 bytes written */ +L(fixup_st8): + sub dst, dstend, #8 + add dstin, dstin, #16 +L(fixup_tail): + cmp dst, dstin + csel dst, dst, dstin, hi + b L(fixup_dst) + + /* Small/medium: Faults n bytes past dtsin, that much written */ +L(fixup_m64): + add dstin, dstin, #8 +L(fixup_m56): + add dstin, dstin, #8 +L(fixup_m48): + add dstin, dstin, #8 +L(fixup_m40): + add dstin, dstin, #8 +L(fixup_m32): + add dstin, dstin, #8 +L(fixup_m24): + add dstin, dstin, #8 +L(fixup_s16): + add dstin, dstin, #8 +L(fixup_s8): + add dstin, dstin, #4 +L(fixup_s4): + add dst, dstin, #4 + b L(fixup_dst) + + /* + * Medium tail cases: Faults n bytes before dstend, 64 or 32 bytes + * past dstin written, depending on original count + */ +L(fixup_mt56): + sub count, count, #8 +L(fixup_mt48): + sub count, count, #8 +L(fixup_mt40): + sub count, count, #8 +L(fixup_mt32): + sub count, count, #8 +L(fixup_mt24): + sub count, count, #8 +L(fixup_mt16): + sub count, count, #8 +L(fixup_mt8): + add count, count, #8 + add dst, dstin, count + + sub tmp1, dstend, dstin + cmp tmp1, #64 + add tmp1, dstin, #64 + add dstin, dstin, #32 + csel dstin, dstin, tmp1, ls + b L(fixup_tail) + + /* Large: Faults n bytes past dst, at least 16 bytes past dstin written */ +L(fixup_l72): + add dst, dst, #8 +L(fixup_l64): + add dst, dst, #8 +L(fixup_l56): + add dst, dst, #8 +L(fixup_l48): + add dst, dst, #8 +L(fixup_l40): + add dst, dst, #8 +L(fixup_l32): + add dst, dst, #8 +L(fixup_l24): + add dst, dst, #8 +L(fixup_l16): + add dst, dst, #16 + add dstin, dstin, #16 + b L(fixup_tail) + + /* Large tail: Faults n bytes before dstend, 80 bytes past dst written */ +L(fixup_lt64): + sub count, count, #8 +L(fixup_lt56): + sub count, count, #8 +L(fixup_lt48): + sub count, count, #8 +L(fixup_lt40): + sub count, count, #8 +L(fixup_lt32): + sub count, count, #8 +L(fixup_lt24): + sub count, count, #8 +L(fixup_lt16): + sub count, count, #8 +L(fixup_lt8): + add count, count, #56 /* Count was also off by 64 */ + add dstin, dst, #80 + add dst, dst, count + b L(fixup_tail) + SYM_FUNC_END(__arch_copy_to_user) EXPORT_SYMBOL(__arch_copy_to_user) From patchwork Wed Sep 28 11:58:53 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Robin Murphy X-Patchwork-Id: 12992180 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id C63A7C32771 for ; Wed, 28 Sep 2022 12:00:29 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender: Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:MIME-Version:References:In-Reply-To: Message-Id:Date:Subject:Cc:To:From:Reply-To:Content-ID:Content-Description: Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID: List-Owner; bh=P9UgGpSBqpPMRwvTdCCVk5nqE0M5GGNAumKWQDDSpOk=; b=tVs94ZPXA4liC4 vmswiOzKy2pB/6EwkizqCFuZ+q0FI5GeSFSyQgN42y53d9aKXGMMWn2zSyv4zp5rQl6tJC88BwQgt hknljpe/CQLQdk7BDQoqmnPnWYKx7zyiwLFUgQuREHmvtezHOueSHFvPtJgqSCa7kZMvzcnXDxiov kzS/ZmfTR+SII5QQhDE06cmugtT2gXsx4yyKeDZa+f29g8vtmcBJJorWSIOGhbxoJd9gYPOAN/k/3 9E2IRYSdQk3ib1D0KPeT9WD+Kpwy4zxDF5Lk1madHKKsNaK4fNGVQyWdztqRiHR1UVJF7W7g35K7n SwSwYwAkZVA5hVi3zI/w==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux)) id 1odVit-00G522-Fc; Wed, 28 Sep 2022 11:59:31 +0000 Received: from foss.arm.com ([217.140.110.172]) by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux)) id 1odViQ-00G4vg-BR for linux-arm-kernel@lists.infradead.org; Wed, 28 Sep 2022 11:59:06 +0000 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id DDBC42247; Wed, 28 Sep 2022 04:59:07 -0700 (PDT) Received: from e121345-lin.cambridge.arm.com (e121345-lin.cambridge.arm.com [10.1.196.40]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id 76A953F73D; Wed, 28 Sep 2022 04:59:00 -0700 (PDT) From: Robin Murphy To: will@kernel.org, catalin.marinas@arm.com Cc: linux-arm-kernel@lists.infradead.org, mark.rutland@arm.com, kristina.martsenko@arm.com Subject: [PATCH 3/3] arm64: Garbage-collect usercopy leftovers Date: Wed, 28 Sep 2022 12:58:53 +0100 Message-Id: X-Mailer: git-send-email 2.36.1.dirty In-Reply-To: References: MIME-Version: 1.0 X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20220928_045902_510436_34CB6063 X-CRM114-Status: GOOD ( 18.55 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org With both usercopy routines replaced, remove the now-unused template and supporting macros. Signed-off-by: Robin Murphy --- arch/arm64/include/asm/asm-uaccess.h | 30 ----- arch/arm64/lib/copy_template.S | 181 --------------------------- 2 files changed, 211 deletions(-) delete mode 100644 arch/arm64/lib/copy_template.S diff --git a/arch/arm64/include/asm/asm-uaccess.h b/arch/arm64/include/asm/asm-uaccess.h index 75b211c98dea..c2dea9b6b9e9 100644 --- a/arch/arm64/include/asm/asm-uaccess.h +++ b/arch/arm64/include/asm/asm-uaccess.h @@ -62,34 +62,4 @@ alternative_else_nop_endif #define USER(l, x...) \ 9999: x; \ _asm_extable_uaccess 9999b, l - -/* - * Generate the assembly for LDTR/STTR with exception table entries. - * This is complicated as there is no post-increment or pair versions of the - * unprivileged instructions, and USER() only works for single instructions. - */ - .macro user_ldp l, reg1, reg2, addr, post_inc -8888: ldtr \reg1, [\addr]; -8889: ldtr \reg2, [\addr, #8]; - add \addr, \addr, \post_inc; - - _asm_extable_uaccess 8888b, \l; - _asm_extable_uaccess 8889b, \l; - .endm - - .macro user_stp l, reg1, reg2, addr, post_inc -8888: sttr \reg1, [\addr]; -8889: sttr \reg2, [\addr, #8]; - add \addr, \addr, \post_inc; - - _asm_extable_uaccess 8888b,\l; - _asm_extable_uaccess 8889b,\l; - .endm - - .macro user_ldst l, inst, reg, addr, post_inc -8888: \inst \reg, [\addr]; - add \addr, \addr, \post_inc; - - _asm_extable_uaccess 8888b, \l; - .endm #endif diff --git a/arch/arm64/lib/copy_template.S b/arch/arm64/lib/copy_template.S deleted file mode 100644 index 488df234c49a..000000000000 --- a/arch/arm64/lib/copy_template.S +++ /dev/null @@ -1,181 +0,0 @@ -/* SPDX-License-Identifier: GPL-2.0-only */ -/* - * Copyright (C) 2013 ARM Ltd. - * Copyright (C) 2013 Linaro. - * - * This code is based on glibc cortex strings work originally authored by Linaro - * be found @ - * - * http://bazaar.launchpad.net/~linaro-toolchain-dev/cortex-strings/trunk/ - * files/head:/src/aarch64/ - */ - - -/* - * Copy a buffer from src to dest (alignment handled by the hardware) - * - * Parameters: - * x0 - dest - * x1 - src - * x2 - n - * Returns: - * x0 - dest - */ -dstin .req x0 -src .req x1 -count .req x2 -tmp1 .req x3 -tmp1w .req w3 -tmp2 .req x4 -tmp2w .req w4 -dst .req x6 - -A_l .req x7 -A_h .req x8 -B_l .req x9 -B_h .req x10 -C_l .req x11 -C_h .req x12 -D_l .req x13 -D_h .req x14 - - mov dst, dstin - cmp count, #16 - /*When memory length is less than 16, the accessed are not aligned.*/ - b.lo .Ltiny15 - - neg tmp2, src - ands tmp2, tmp2, #15/* Bytes to reach alignment. */ - b.eq .LSrcAligned - sub count, count, tmp2 - /* - * Copy the leading memory data from src to dst in an increasing - * address order.By this way,the risk of overwriting the source - * memory data is eliminated when the distance between src and - * dst is less than 16. The memory accesses here are alignment. - */ - tbz tmp2, #0, 1f - ldrb1 tmp1w, src, #1 - strb1 tmp1w, dst, #1 -1: - tbz tmp2, #1, 2f - ldrh1 tmp1w, src, #2 - strh1 tmp1w, dst, #2 -2: - tbz tmp2, #2, 3f - ldr1 tmp1w, src, #4 - str1 tmp1w, dst, #4 -3: - tbz tmp2, #3, .LSrcAligned - ldr1 tmp1, src, #8 - str1 tmp1, dst, #8 - -.LSrcAligned: - cmp count, #64 - b.ge .Lcpy_over64 - /* - * Deal with small copies quickly by dropping straight into the - * exit block. - */ -.Ltail63: - /* - * Copy up to 48 bytes of data. At this point we only need the - * bottom 6 bits of count to be accurate. - */ - ands tmp1, count, #0x30 - b.eq .Ltiny15 - cmp tmp1w, #0x20 - b.eq 1f - b.lt 2f - ldp1 A_l, A_h, src, #16 - stp1 A_l, A_h, dst, #16 -1: - ldp1 A_l, A_h, src, #16 - stp1 A_l, A_h, dst, #16 -2: - ldp1 A_l, A_h, src, #16 - stp1 A_l, A_h, dst, #16 -.Ltiny15: - /* - * Prefer to break one ldp/stp into several load/store to access - * memory in an increasing address order,rather than to load/store 16 - * bytes from (src-16) to (dst-16) and to backward the src to aligned - * address,which way is used in original cortex memcpy. If keeping - * the original memcpy process here, memmove need to satisfy the - * precondition that src address is at least 16 bytes bigger than dst - * address,otherwise some source data will be overwritten when memove - * call memcpy directly. To make memmove simpler and decouple the - * memcpy's dependency on memmove, withdrew the original process. - */ - tbz count, #3, 1f - ldr1 tmp1, src, #8 - str1 tmp1, dst, #8 -1: - tbz count, #2, 2f - ldr1 tmp1w, src, #4 - str1 tmp1w, dst, #4 -2: - tbz count, #1, 3f - ldrh1 tmp1w, src, #2 - strh1 tmp1w, dst, #2 -3: - tbz count, #0, .Lexitfunc - ldrb1 tmp1w, src, #1 - strb1 tmp1w, dst, #1 - - b .Lexitfunc - -.Lcpy_over64: - subs count, count, #128 - b.ge .Lcpy_body_large - /* - * Less than 128 bytes to copy, so handle 64 here and then jump - * to the tail. - */ - ldp1 A_l, A_h, src, #16 - stp1 A_l, A_h, dst, #16 - ldp1 B_l, B_h, src, #16 - ldp1 C_l, C_h, src, #16 - stp1 B_l, B_h, dst, #16 - stp1 C_l, C_h, dst, #16 - ldp1 D_l, D_h, src, #16 - stp1 D_l, D_h, dst, #16 - - tst count, #0x3f - b.ne .Ltail63 - b .Lexitfunc - - /* - * Critical loop. Start at a new cache line boundary. Assuming - * 64 bytes per line this ensures the entire loop is in one line. - */ - .p2align L1_CACHE_SHIFT -.Lcpy_body_large: - /* pre-get 64 bytes data. */ - ldp1 A_l, A_h, src, #16 - ldp1 B_l, B_h, src, #16 - ldp1 C_l, C_h, src, #16 - ldp1 D_l, D_h, src, #16 -1: - /* - * interlace the load of next 64 bytes data block with store of the last - * loaded 64 bytes data. - */ - stp1 A_l, A_h, dst, #16 - ldp1 A_l, A_h, src, #16 - stp1 B_l, B_h, dst, #16 - ldp1 B_l, B_h, src, #16 - stp1 C_l, C_h, dst, #16 - ldp1 C_l, C_h, src, #16 - stp1 D_l, D_h, dst, #16 - ldp1 D_l, D_h, src, #16 - subs count, count, #64 - b.ge 1b - stp1 A_l, A_h, dst, #16 - stp1 B_l, B_h, dst, #16 - stp1 C_l, C_h, dst, #16 - stp1 D_l, D_h, dst, #16 - - tst count, #0x3f - b.ne .Ltail63 -.Lexitfunc: