From patchwork Tue Nov 5 16:09:01 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ard Biesheuvel X-Patchwork-Id: 13863214 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id E85BDD31765 for ; Tue, 5 Nov 2024 16:47:46 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender:List-Subscribe:List-Help :List-Post:List-Archive:List-Unsubscribe:List-Id:Content-Type:Cc:To:From: Subject:Message-ID:References:Mime-Version:In-Reply-To:Date:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner; bh=+PYWCpA35ebymISdfvrACDJD2IIeB/JqSCaz3DRtnXQ=; b=0q4nfrJ/m+HTB3RrR4DjkC6Y44 0Q9KgFXwuxHJqVx5zjF/8ed8o1QbT20wPcloCxYTLViDFm3grrPO5IUFkLPiKALV+rLbm5NROmwbY GMg6xFoLkAjg8KwDEsCkiBsp0brdnJZmaH8EW2CDGbH2fst8X7ZAOk4wVwCoVihQwMU//2ms5eoWe DI9e86afQA9oqjTXsmx5KnZu0ZToJm8kKgYsw/OFqNVpQ7p7/DVzEfb4D3LbAsIaMk+SO9nEYCw8z wV/DaM3mWlwMXoiXWR7GveYQxACFCd4cOTqSQFSaVeMz+S5Vn36hSCw6CxHsE1kyfi3kXwCLKKrOh GQcpO0Vg==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.98 #2 (Red Hat Linux)) id 1t8MiO-000000006Qf-1VAR; Tue, 05 Nov 2024 16:47:36 +0000 Received: from mail-yw1-x1149.google.com ([2607:f8b0:4864:20::1149]) by bombadil.infradead.org with esmtps (Exim 4.98 #2 (Red Hat Linux)) id 1t8MC3-000000000J6-3ieJ for linux-arm-kernel@lists.infradead.org; Tue, 05 Nov 2024 16:14:13 +0000 Received: by mail-yw1-x1149.google.com with SMTP id 00721157ae682-6e38fabff35so99513617b3.0 for ; Tue, 05 Nov 2024 08:14:09 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1730823249; x=1731428049; darn=lists.infradead.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=+PYWCpA35ebymISdfvrACDJD2IIeB/JqSCaz3DRtnXQ=; b=Ckcz134PgSBYAtlk4/stRmll++EjGf2c/BQztLahI1CT5x1tyEjqHzX+/fMIikyF21 HIUC3lGZM2e0/Q9UPjejJazaZ68Ln3KVL2/RgfNAWguXfZnDDZ9sPAZXsymQCGGLNB+1 S4fSoVwH1qS/xWzErXE5mxbHCJz+MZtPAvMzfLpiHR6PFvtRD6fdUf8heWAzNSfEk71C jUWn51T5m817YUKVQU596AhH8+d+qNRY9iVkRrwf4M04/7NpoPolzNnM0rxEEutyerzu EA33uNtLzpkHrO1ZYnlmEuqb9O//3Mq6tTt5BS8Vvi0IqhDwv89hNj2zi92L5N6U2ryl wlJA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1730823249; x=1731428049; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=+PYWCpA35ebymISdfvrACDJD2IIeB/JqSCaz3DRtnXQ=; b=GnAk3y9oZ7r+LnTJreqnGg7+s9+a+WkvnEaJPC+UcUF//pCi/1FptLnvhKt35cd+w3 +htsFHrsMQrUZ0c42SvlrmVRcKli6vHL887qnuji3RCXqfEvJxiG9FuoVN89SUVE3Zhe S9Bb97UdYiLCkj7WRhbQuJYQS0zDv0zIOqLIcOmzaANxE6zgyVm7GZolVVxjkJ+o1yJe wrDmEqnWW312EOQXsG9Td67ru9jWaHPPgRodanh4+rZV4Mj01DJfyuPdM5A/TXJYv3N4 7XV7wkw0SghLzf4oMDgyY5dY+AXgfqo0AJjaBGxtJKYIZVHh5VVMrHopeU/xYmTcCZTH BSQg== X-Gm-Message-State: AOJu0YzyaVtFvrGz0oyT3SCBTTzWN2nWaeWkiGWIOciCghHM4Ek8bGRK Pxos/zyALepI43NstPuCdgiu+f2dS0cDhxdbSEyTAyKeRibchC0yCYLiR6pwAlfhEhXrzQ== X-Google-Smtp-Source: AGHT+IGLaooYqrcyk3qUe0TX4eujCqm8wQrES8XgTDSUNFOoXaBLxJAERPmv+9udro0XiVT+bvrfoocD X-Received: from palermo.c.googlers.com ([fda3:e722:ac3:cc00:7b:198d:ac11:8138]) (user=ardb job=sendgmr) by 2002:a05:690c:308a:b0:6e9:f188:8638 with SMTP id 00721157ae682-6e9f1888858mr13489617b3.7.1730823249265; Tue, 05 Nov 2024 08:14:09 -0800 (PST) Date: Tue, 5 Nov 2024 17:09:01 +0100 In-Reply-To: <20241105160859.1459261-8-ardb+git@google.com> Mime-Version: 1.0 References: <20241105160859.1459261-8-ardb+git@google.com> X-Developer-Key: i=ardb@kernel.org; a=openpgp; fpr=F43D03328115A198C90016883D200E9CA6329909 X-Developer-Signature: v=1; a=openpgp-sha256; l=2089; i=ardb@kernel.org; h=from:subject; bh=vU4XBM4NBrhTx0MFUyTDlm+FeMA4aQ6+vOJf2f1XAjY=; b=owGbwMvMwCFmkMcZplerG8N4Wi2JIV3LWTZuc35KvKbt0ZzF2/csPfPldd0h8z4dyc13ZyvG3 1jsusmyo5SFQYyDQVZMkUVg9t93O09PlKp1niULM4eVCWQIAxenAExEw4Thf9m8JKvMxNdXr5// uvm3Wd2xdscbMaVFr8vfFv8xWF55rJzhr/Ds5SV1nD8c7zLaG/PcENjJu3I3003jytPOj2ufRfv 68wAA X-Mailer: git-send-email 2.47.0.199.ga7371fff76-goog Message-ID: <20241105160859.1459261-9-ardb+git@google.com> Subject: [PATCH v2 1/6] crypto: arm64/crct10dif - Remove obsolete chunking logic From: Ard Biesheuvel To: linux-crypto@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org, ebiggers@kernel.org, herbert@gondor.apana.org.au, keescook@chromium.org, Ard Biesheuvel , Eric Biggers X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20241105_081411_966664_8869A44D X-CRM114-Status: GOOD ( 13.57 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org From: Ard Biesheuvel This is a partial revert of commit fc754c024a343b, which moved the logic into C code which ensures that kernel mode NEON code does not hog the CPU for too long. This is no longer needed now that kernel mode NEON no longer disables preemption, so we can drop this. Reviewed-by: Eric Biggers Signed-off-by: Ard Biesheuvel --- arch/arm64/crypto/crct10dif-ce-glue.c | 30 ++++---------------- 1 file changed, 6 insertions(+), 24 deletions(-) diff --git a/arch/arm64/crypto/crct10dif-ce-glue.c b/arch/arm64/crypto/crct10dif-ce-glue.c index 606d25c559ed..7b05094a0480 100644 --- a/arch/arm64/crypto/crct10dif-ce-glue.c +++ b/arch/arm64/crypto/crct10dif-ce-glue.c @@ -37,18 +37,9 @@ static int crct10dif_update_pmull_p8(struct shash_desc *desc, const u8 *data, u16 *crc = shash_desc_ctx(desc); if (length >= CRC_T10DIF_PMULL_CHUNK_SIZE && crypto_simd_usable()) { - do { - unsigned int chunk = length; - - if (chunk > SZ_4K + CRC_T10DIF_PMULL_CHUNK_SIZE) - chunk = SZ_4K; - - kernel_neon_begin(); - *crc = crc_t10dif_pmull_p8(*crc, data, chunk); - kernel_neon_end(); - data += chunk; - length -= chunk; - } while (length); + kernel_neon_begin(); + *crc = crc_t10dif_pmull_p8(*crc, data, length); + kernel_neon_end(); } else { *crc = crc_t10dif_generic(*crc, data, length); } @@ -62,18 +53,9 @@ static int crct10dif_update_pmull_p64(struct shash_desc *desc, const u8 *data, u16 *crc = shash_desc_ctx(desc); if (length >= CRC_T10DIF_PMULL_CHUNK_SIZE && crypto_simd_usable()) { - do { - unsigned int chunk = length; - - if (chunk > SZ_4K + CRC_T10DIF_PMULL_CHUNK_SIZE) - chunk = SZ_4K; - - kernel_neon_begin(); - *crc = crc_t10dif_pmull_p64(*crc, data, chunk); - kernel_neon_end(); - data += chunk; - length -= chunk; - } while (length); + kernel_neon_begin(); + *crc = crc_t10dif_pmull_p64(*crc, data, length); + kernel_neon_end(); } else { *crc = crc_t10dif_generic(*crc, data, length); } From patchwork Tue Nov 5 16:09:02 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ard Biesheuvel X-Patchwork-Id: 13863223 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id AD25CD31767 for ; Tue, 5 Nov 2024 16:49:38 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender:List-Subscribe:List-Help :List-Post:List-Archive:List-Unsubscribe:List-Id:Content-Type:Cc:To:From: Subject:Message-ID:References:Mime-Version:In-Reply-To:Date:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner; bh=cUTZKhO+m9CbRYTtzqazvIhlpnuTgx8VjiuBDf6t1Bk=; b=rlTi91TBVlBEZ5c3YDwBBuMW7S J4fmRWMEBGFWVzt55mH8uc6wsK8FGbE3oA2qgp5uO91LE2mgp0SE1Bf+HoAoGn5hlb5eVJdT3pFZD WLO4NIObVo7GmR3VNQ0NcoTsW6j2IqTm4agyFHW2eU4VvmmBffKJK9NtAuRhiIu3L0qKDb0C98hYQ ewwYAJ2alsjQaDH0EIAK+yusQG9+pRneiTY5o4lC/ExxQWI9F6MPO59fQBwh8SA/7BJiljEcyRsmH h4n4s1FTCy++NHiXoU0TZb4ZfeXj+msfVc/hS7ujFc6qTrux8Fc7DeaCSeP+pG4cAlc1lmYzY4+LJ v/eL81eQ==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.98 #2 (Red Hat Linux)) id 1t8Mk9-000000006nS-0AQI; Tue, 05 Nov 2024 16:49:25 +0000 Received: from mail-wm1-x349.google.com ([2a00:1450:4864:20::349]) by bombadil.infradead.org with esmtps (Exim 4.98 #2 (Red Hat Linux)) id 1t8MC6-000000000K1-0hCO for linux-arm-kernel@lists.infradead.org; Tue, 05 Nov 2024 16:14:15 +0000 Received: by mail-wm1-x349.google.com with SMTP id 5b1f17b1804b1-43157e3521dso38094525e9.1 for ; Tue, 05 Nov 2024 08:14:12 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1730823251; x=1731428051; darn=lists.infradead.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=cUTZKhO+m9CbRYTtzqazvIhlpnuTgx8VjiuBDf6t1Bk=; b=tLlVH2ZOT4kB/GO/FqY97F7ve10igz8IVsxNFjGRcTmFm2MI/lhhehJgO56+KbCIRG H5K7d1Hf16Z46A/oe7F4oHuCmMkm3+gr7W1twveowTEKahz5ZFqcqsG0GFCn9F3W2NVS C9BR4Lj53JmrkL78KKi/EiD792EsL6jVR5ehAuCZP8RCWaR2tLARYDjJTkff2Gy2669U 3FQEW7UC1AIlaA+kE8fEutRdFJHLtW6W5jAYv5VDkB4G4VB1hecouDFWI8fRES+RVFJG 2zEipDE7xOHL+UHJDq7cHcigIX4nhGh1rvA7jJ6BhL66MwYI4ntWCt5Ywmvu3/g2GWry kBSg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1730823251; x=1731428051; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=cUTZKhO+m9CbRYTtzqazvIhlpnuTgx8VjiuBDf6t1Bk=; b=Gq/+Figvu0nvS/+nrIuZP5dr0UKWa0lHeu9OHQqpIcILElyI/nJgByb7ODR781dnx5 cENDG1yqK0N95dC0xRyee8C+kWq7fVq6xkMqYE3qpvxSTabAlXFykIREd3EIkD7/qN5v O4nGRB4KKmOSA2O51HJYRofdjxXA5z8+bPQPjb0yh9kT9lxmhwjc+Huses4XZick04zB MejwaniRvZN9l/by+SWIA0Sjro0iLULZhNisH/T+aq7rrNu7nF8rwNUeiMaRQbdNQ+Gr pF4FWVW5oGxCi7UV1MH/rvwIRgqWwI4uaUJnxLmRl4MCt/oyZBqjNdGrl22zGCQWfgY8 Nh3w== X-Gm-Message-State: AOJu0YztJFHKznGDVIBIyCWaO/K8TM1RM/mRR6Ob2O78Ow2eKArvtWbW aXNcCkY0vuW/ch90eEkbOCFs37U8fl6RCVYXw6QBf0K2Yx1AOMwF4AJzVx511IQkasFIrg== X-Google-Smtp-Source: AGHT+IE1Nlhc8GIHkn95vE7fV4U4p21WsOlBR9O4RgNPNsEk4URCQoymAAI4dVUFa6csu+wXO7g9Q+4A X-Received: from palermo.c.googlers.com ([fda3:e722:ac3:cc00:7b:198d:ac11:8138]) (user=ardb job=sendgmr) by 2002:a05:600c:2210:b0:431:47cf:4f59 with SMTP id 5b1f17b1804b1-432831cb821mr486115e9.0.1730823251479; Tue, 05 Nov 2024 08:14:11 -0800 (PST) Date: Tue, 5 Nov 2024 17:09:02 +0100 In-Reply-To: <20241105160859.1459261-8-ardb+git@google.com> Mime-Version: 1.0 References: <20241105160859.1459261-8-ardb+git@google.com> X-Developer-Key: i=ardb@kernel.org; a=openpgp; fpr=F43D03328115A198C90016883D200E9CA6329909 X-Developer-Signature: v=1; a=openpgp-sha256; l=9105; i=ardb@kernel.org; h=from:subject; bh=sF5Epk2ae4EBs5KvMvDtueP5LeUDeZZXIOLxDNupMtk=; b=owGbwMvMwCFmkMcZplerG8N4Wi2JIV3LWf6rZTzfNf2uGcFSITdPTj271yi9/aDmlRmxC0S+n NI+q76wo5SFQYyDQVZMkUVg9t93O09PlKp1niULM4eVCWQIAxenAExkQywjwyt/D/dvO14ZMiyS 6zu6+vAx31jBSE23ZqfUaQHnQuSWFTAy7Fxj82lfc+mxM1uX2ItPfzf1zJ549/P+3DkXg/RSjbp aOAE= X-Mailer: git-send-email 2.47.0.199.ga7371fff76-goog Message-ID: <20241105160859.1459261-10-ardb+git@google.com> Subject: [PATCH v2 2/6] crypto: arm64/crct10dif - Use faster 16x64 bit polynomial multiply From: Ard Biesheuvel To: linux-crypto@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org, ebiggers@kernel.org, herbert@gondor.apana.org.au, keescook@chromium.org, Ard Biesheuvel X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20241105_081414_270103_F982E1E1 X-CRM114-Status: GOOD ( 18.80 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org From: Ard Biesheuvel The CRC-T10DIF implementation for arm64 has a version that uses 8x8 polynomial multiplication, for cores that lack the crypto extensions, which cover the 64x64 polynomial multiplication instruction that the algorithm was built around. This fallback version rather naively adopted the 64x64 polynomial multiplication algorithm that I ported from ARM for the GHASH driver, which needs 8 PMULL8 instructions to implement one PMULL64. This is reasonable, given that each 8-bit vector element needs to be multiplied with each element in the other vector, producing 8 vectors with partial results that need to be combined to yield the correct result. However, most PMULL64 invocations in the CRC-T10DIF code involve multiplication by a pair of 16-bit folding coefficients, and so all the partial results from higher order bytes will be zero, and there is no need to calculate them to begin with. Then, the CRC-T10DIF algorithm always XORs the output values of the PMULL64 instructions being issued in pairs, and so there is no need to faithfully implement each individual PMULL64 instruction, as long as XORing the results pairwise produces the expected result. Implementing these improvements results in a speedup of 3.3x on low-end platforms such as Raspberry Pi 4 (Cortex-A72) Signed-off-by: Ard Biesheuvel --- arch/arm64/crypto/crct10dif-ce-core.S | 121 +++++++++++++++++--- 1 file changed, 104 insertions(+), 17 deletions(-) diff --git a/arch/arm64/crypto/crct10dif-ce-core.S b/arch/arm64/crypto/crct10dif-ce-core.S index 5604de61d06d..d2acaa2b5a01 100644 --- a/arch/arm64/crypto/crct10dif-ce-core.S +++ b/arch/arm64/crypto/crct10dif-ce-core.S @@ -1,8 +1,11 @@ // // Accelerated CRC-T10DIF using arm64 NEON and Crypto Extensions instructions // -// Copyright (C) 2016 Linaro Ltd -// Copyright (C) 2019 Google LLC +// Copyright (C) 2016 Linaro Ltd +// Copyright (C) 2019-2024 Google LLC +// +// Authors: Ard Biesheuvel +// Eric Biggers // // This program is free software; you can redistribute it and/or modify // it under the terms of the GNU General Public License version 2 as @@ -122,6 +125,13 @@ sli perm2.2d, perm1.2d, #56 sli perm3.2d, perm1.2d, #48 sli perm4.2d, perm1.2d, #40 + + // Compose { 0,0,0,0, 8,8,8,8, 1,1,1,1, 9,9,9,9 } + movi bd1.4h, #8, lsl #8 + orr bd1.2s, #1, lsl #16 + orr bd1.2s, #1, lsl #24 + zip1 bd1.16b, bd1.16b, bd1.16b + zip1 bd1.16b, bd1.16b, bd1.16b .endm .macro __pmull_pre_p8, bd @@ -196,6 +206,92 @@ SYM_FUNC_START_LOCAL(__pmull_p8_core) ret SYM_FUNC_END(__pmull_p8_core) + .macro pmull16x64_p64, a16, b64, c64 + pmull2 \c64\().1q, \a16\().2d, \b64\().2d + pmull \b64\().1q, \a16\().1d, \b64\().1d + .endm + + /* + * Pairwise long polynomial multiplication of two 16-bit values + * + * { w0, w1 }, { y0, y1 } + * + * by two 64-bit values + * + * { x0, x1, x2, x3, x4, x5, x6, x7 }, { z0, z1, z2, z3, z4, z5, z6, z7 } + * + * where each vector element is a byte, ordered from least to most + * significant. + * + * This can be implemented using 8x8 long polynomial multiplication, by + * reorganizing the input so that each pairwise 8x8 multiplication + * produces one of the terms from the decomposition below, and + * combining the results of each rank and shifting them into place. + * + * Rank + * 0 w0*x0 ^ | y0*z0 ^ + * 1 (w0*x1 ^ w1*x0) << 8 ^ | (y0*z1 ^ y1*z0) << 8 ^ + * 2 (w0*x2 ^ w1*x1) << 16 ^ | (y0*z2 ^ y1*z1) << 16 ^ + * 3 (w0*x3 ^ w1*x2) << 24 ^ | (y0*z3 ^ y1*z2) << 24 ^ + * 4 (w0*x4 ^ w1*x3) << 32 ^ | (y0*z4 ^ y1*z3) << 32 ^ + * 5 (w0*x5 ^ w1*x4) << 40 ^ | (y0*z5 ^ y1*z4) << 40 ^ + * 6 (w0*x6 ^ w1*x5) << 48 ^ | (y0*z6 ^ y1*z5) << 48 ^ + * 7 (w0*x7 ^ w1*x6) << 56 ^ | (y0*z7 ^ y1*z6) << 56 ^ + * 8 w1*x7 << 64 | y1*z7 << 64 + * + * The inputs can be reorganized into + * + * { w0, w0, w0, w0, y0, y0, y0, y0 }, { w1, w1, w1, w1, y1, y1, y1, y1 } + * { x0, x2, x4, x6, z0, z2, z4, z6 }, { x1, x3, x5, x7, z1, z3, z5, z7 } + * + * and after performing 8x8->16 bit long polynomial multiplication of + * each of the halves of the first vector with those of the second one, + * we obtain the following four vectors of 16-bit elements: + * + * a := { w0*x0, w0*x2, w0*x4, w0*x6 }, { y0*z0, y0*z2, y0*z4, y0*z6 } + * b := { w0*x1, w0*x3, w0*x5, w0*x7 }, { y0*z1, y0*z3, y0*z5, y0*z7 } + * c := { w1*x0, w1*x2, w1*x4, w1*x6 }, { y1*z0, y1*z2, y1*z4, y1*z6 } + * d := { w1*x1, w1*x3, w1*x5, w1*x7 }, { y1*z1, y1*z3, y1*z5, y1*z7 } + * + * Results b and c can be XORed together, as the vector elements have + * matching ranks. Then, the final XOR (*) can be pulled forward, and + * applied between the halves of each of the remaining three vectors, + * which are then shifted into place, and combined to produce two + * 80-bit results. + * + * (*) NOTE: the 16x64 bit polynomial multiply below is not equivalent + * to the 64x64 bit one above, but XOR'ing the outputs together will + * produce the expected result, and this is sufficient in the context of + * this algorithm. + */ + .macro pmull16x64_p8, a16, b64, c64 + ext t7.16b, \b64\().16b, \b64\().16b, #1 + tbl t5.16b, {\a16\().16b}, bd1.16b + uzp1 t7.16b, \b64\().16b, t7.16b + bl __pmull_p8_16x64 + ext \b64\().16b, t4.16b, t4.16b, #15 + eor \c64\().16b, t8.16b, t5.16b + .endm + +SYM_FUNC_START_LOCAL(__pmull_p8_16x64) + ext t6.16b, t5.16b, t5.16b, #8 + + pmull t3.8h, t7.8b, t5.8b + pmull t4.8h, t7.8b, t6.8b + pmull2 t5.8h, t7.16b, t5.16b + pmull2 t6.8h, t7.16b, t6.16b + + ext t8.16b, t3.16b, t3.16b, #8 + eor t4.16b, t4.16b, t6.16b + ext t7.16b, t5.16b, t5.16b, #8 + ext t6.16b, t4.16b, t4.16b, #8 + eor t8.8b, t8.8b, t3.8b + eor t5.8b, t5.8b, t7.8b + eor t4.8b, t4.8b, t6.8b + ext t5.16b, t5.16b, t5.16b, #14 + ret +SYM_FUNC_END(__pmull_p8_16x64) + .macro __pmull_p8, rq, ad, bd, i .ifnc \bd, fold_consts .err @@ -218,14 +314,12 @@ SYM_FUNC_END(__pmull_p8_core) .macro fold_32_bytes, p, reg1, reg2 ldp q11, q12, [buf], #0x20 - __pmull_\p v8, \reg1, fold_consts, 2 - __pmull_\p \reg1, \reg1, fold_consts + pmull16x64_\p fold_consts, \reg1, v8 CPU_LE( rev64 v11.16b, v11.16b ) CPU_LE( rev64 v12.16b, v12.16b ) - __pmull_\p v9, \reg2, fold_consts, 2 - __pmull_\p \reg2, \reg2, fold_consts + pmull16x64_\p fold_consts, \reg2, v9 CPU_LE( ext v11.16b, v11.16b, v11.16b, #8 ) CPU_LE( ext v12.16b, v12.16b, v12.16b, #8 ) @@ -238,11 +332,9 @@ CPU_LE( ext v12.16b, v12.16b, v12.16b, #8 ) // Fold src_reg into dst_reg, optionally loading the next fold constants .macro fold_16_bytes, p, src_reg, dst_reg, load_next_consts - __pmull_\p v8, \src_reg, fold_consts - __pmull_\p \src_reg, \src_reg, fold_consts, 2 + pmull16x64_\p fold_consts, \src_reg, v8 .ifnb \load_next_consts ld1 {fold_consts.2d}, [fold_consts_ptr], #16 - __pmull_pre_\p fold_consts .endif eor \dst_reg\().16b, \dst_reg\().16b, v8.16b eor \dst_reg\().16b, \dst_reg\().16b, \src_reg\().16b @@ -296,7 +388,6 @@ CPU_LE( ext v7.16b, v7.16b, v7.16b, #8 ) // Load the constants for folding across 128 bytes. ld1 {fold_consts.2d}, [fold_consts_ptr] - __pmull_pre_\p fold_consts // Subtract 128 for the 128 data bytes just consumed. Subtract another // 128 to simplify the termination condition of the following loop. @@ -318,7 +409,6 @@ CPU_LE( ext v7.16b, v7.16b, v7.16b, #8 ) // Fold across 64 bytes. add fold_consts_ptr, fold_consts_ptr, #16 ld1 {fold_consts.2d}, [fold_consts_ptr], #16 - __pmull_pre_\p fold_consts fold_16_bytes \p, v0, v4 fold_16_bytes \p, v1, v5 fold_16_bytes \p, v2, v6 @@ -339,8 +429,7 @@ CPU_LE( ext v7.16b, v7.16b, v7.16b, #8 ) // into them, storing the result back into v7. b.lt .Lfold_16_bytes_loop_done_\@ .Lfold_16_bytes_loop_\@: - __pmull_\p v8, v7, fold_consts - __pmull_\p v7, v7, fold_consts, 2 + pmull16x64_\p fold_consts, v7, v8 eor v7.16b, v7.16b, v8.16b ldr q0, [buf], #16 CPU_LE( rev64 v0.16b, v0.16b ) @@ -387,9 +476,8 @@ CPU_LE( ext v0.16b, v0.16b, v0.16b, #8 ) bsl v2.16b, v1.16b, v0.16b // Fold the first chunk into the second chunk, storing the result in v7. - __pmull_\p v0, v3, fold_consts - __pmull_\p v7, v3, fold_consts, 2 - eor v7.16b, v7.16b, v0.16b + pmull16x64_\p fold_consts, v3, v0 + eor v7.16b, v3.16b, v0.16b eor v7.16b, v7.16b, v2.16b .Lreduce_final_16_bytes_\@: @@ -450,7 +538,6 @@ CPU_LE( ext v7.16b, v7.16b, v7.16b, #8 ) // Load the fold-across-16-bytes constants. ld1 {fold_consts.2d}, [fold_consts_ptr], #16 - __pmull_pre_\p fold_consts cmp len, #16 b.eq .Lreduce_final_16_bytes_\@ // len == 16 From patchwork Tue Nov 5 16:09:03 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ard Biesheuvel X-Patchwork-Id: 13863224 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id F4204D31765 for ; Tue, 5 Nov 2024 16:51:20 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender:List-Subscribe:List-Help :List-Post:List-Archive:List-Unsubscribe:List-Id:Content-Type:Cc:To:From: Subject:Message-ID:References:Mime-Version:In-Reply-To:Date:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner; bh=fn14AhmQ6OQTZ360sXt8YVpA23nVVdhA37ni/ACtb1c=; b=tzJ8igchZBTAZ2DVNkx6cBe5ZV N5bUD1wgzl+sJV5Y3A3xzYkJ/HvxWkOAyG4KfmDfWQEtA0JRU+domLdenIOgH2STcDCGhciL8qpFf bsaaSu/UH10oUrAqG78A8vCAdrfpGY0/FjaVVjvmvTB4uo9BghUVKmBmcbJxL6KoO7mwt862tO/5j Py+J/KQpYK0KeROiMz7LsKcWvLoviP8+wwilAnAD8YucnxDDw0oyfHwEsyeF2djNAlIxX5iGPDc3z 2yW7eXlJVPerkUOusYOOJoYdBHMStXJk8P77nVZyAj3UbokOg5uERRdEKjLJthQ0Gw6ld3/yihaKq XmXRH9IA==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.98 #2 (Red Hat Linux)) id 1t8Mlq-000000007Ai-2YAK; Tue, 05 Nov 2024 16:51:10 +0000 Received: from mail-wm1-x34a.google.com ([2a00:1450:4864:20::34a]) by bombadil.infradead.org with esmtps (Exim 4.98 #2 (Red Hat Linux)) id 1t8MC9-000000000LC-2BJu for linux-arm-kernel@lists.infradead.org; Tue, 05 Nov 2024 16:14:19 +0000 Received: by mail-wm1-x34a.google.com with SMTP id 5b1f17b1804b1-43154a0886bso38151085e9.0 for ; Tue, 05 Nov 2024 08:14:15 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1730823254; x=1731428054; darn=lists.infradead.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=fn14AhmQ6OQTZ360sXt8YVpA23nVVdhA37ni/ACtb1c=; b=pg/mP0sYBUhOTnnz7j6vWkBeHRAST+be/fGjv1KZS8lheteNOoBWmbiO7OMzT+LEKS p+SROfeN55Vh91Sf4gDTcRlrQcwE5cQ8p7gihovlrF3cwR2fdMz01tPf+FsAJi21A3dk 2f+KZGMRUVE7ZI2oq6GJaZe2EXIfroShQL5GQMx+TlU/Ehu3h7YYEQvtIU547IQ+VDtc Vgd33h9XRjURzjUEZ5pFIuZaXCQW7/WczLZiPg6U4lS1UMJR3YlkXFy+S9KgQoXzXAsI Khs5CIEeaZfMIC0iQ11Q2iVCJcn9Ae7Z0lS/4omQooEB44GP4q6arlWGj/Y3mAE3PfUi qlZg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1730823254; x=1731428054; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=fn14AhmQ6OQTZ360sXt8YVpA23nVVdhA37ni/ACtb1c=; b=FWVFl979lQr9nXIpge/jfOddpYKruFRw4sbBJj+LcUk1N4ipWcip2umHYV8jFV8OeJ La9YbDvJ3pKvwO1dvLIWvtSj6a+onsKsFI4RCHgs0OlGq03jgdXq8gDKGyWhbp5n3p7I 8mfYFSk7cxk/eaafM2YPIGlrtL93YcI/h0gOYXNwzg5yQA8MBr7sRSu42JHNXieAOPOh HoxV28DoALc4mokXLWBrPMLPa7CxnYGPlkw6z3bxvdXRn6RahGkYhZHmew9EgYhFfZjG 4ITmQYcnRvOxiJUVkKJM3+fDMmJpiylS5raM8b4t45qfQBTDiG/VpZ5rpU29BBqGMelB EH9g== X-Gm-Message-State: AOJu0YyRs9RuyJFlMI009ulUwG1Q2BqcVC7+UsQIxKzmW1vAHFB1W2TR BYWxPLFgQKbZNYvY+QtAA8tYzLg6W99Avkdi2ym0/xJD0KT/SnpoRW3B7lQBQYLidZ8eUw== X-Google-Smtp-Source: AGHT+IEwdf0ONVUK6Zt2kw6p8pObMWZdsJrlr5Jei3x3dBdcZfmS1Juvb/jLGZBpeNe9JIrh5xFQoNLb X-Received: from palermo.c.googlers.com ([fda3:e722:ac3:cc00:7b:198d:ac11:8138]) (user=ardb job=sendgmr) by 2002:a5d:410a:0:b0:37d:4e56:9a42 with SMTP id ffacd0b85a97d-381c7a4ee79mr8427f8f.4.1730823254102; Tue, 05 Nov 2024 08:14:14 -0800 (PST) Date: Tue, 5 Nov 2024 17:09:03 +0100 In-Reply-To: <20241105160859.1459261-8-ardb+git@google.com> Mime-Version: 1.0 References: <20241105160859.1459261-8-ardb+git@google.com> X-Developer-Key: i=ardb@kernel.org; a=openpgp; fpr=F43D03328115A198C90016883D200E9CA6329909 X-Developer-Signature: v=1; a=openpgp-sha256; l=11659; i=ardb@kernel.org; h=from:subject; bh=5s52W2sO5L/bY0ndSW3KvGbMMMwPHHx3dy2G4v15k4A=; b=owGbwMvMwCFmkMcZplerG8N4Wi2JIV3LWeHf38Yk9imzlE/cvnU4alpjxScXke0SR27NuSg9w UL10VnljlIWBjEOBlkxRRaB2X/f7Tw9UarWeZYszBxWJpAhDFycAjCR/+oM/+Njc20kouKjPSe9 mnPy9hWFNucN3z0fZjBu1FHOyJ/dmc7wP6/ibeDuzkVrQ2c7/DsjfUHofUNx1vQHuYbTfnjYKoc LcwMA X-Mailer: git-send-email 2.47.0.199.ga7371fff76-goog Message-ID: <20241105160859.1459261-11-ardb+git@google.com> Subject: [PATCH v2 3/6] crypto: arm64/crct10dif - Remove remaining 64x64 PMULL fallback code From: Ard Biesheuvel To: linux-crypto@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org, ebiggers@kernel.org, herbert@gondor.apana.org.au, keescook@chromium.org, Ard Biesheuvel X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20241105_081417_610134_68A158A8 X-CRM114-Status: GOOD ( 16.78 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org From: Ard Biesheuvel The only remaining user of the fallback implementation of 64x64 polynomial multiplication using 8x8 PMULL instructions is the final reduction from a 16 byte vector to a 16-bit CRC. The fallback code is complicated and messy, and this reduction has little impact on the overall performance, so instead, let's calculate the final CRC by passing the 16 byte vector to the generic CRC-T10DIF implementation when running the fallback version. Signed-off-by: Ard Biesheuvel --- arch/arm64/crypto/crct10dif-ce-core.S | 244 +++++--------------- arch/arm64/crypto/crct10dif-ce-glue.c | 18 +- 2 files changed, 68 insertions(+), 194 deletions(-) diff --git a/arch/arm64/crypto/crct10dif-ce-core.S b/arch/arm64/crypto/crct10dif-ce-core.S index d2acaa2b5a01..87dd6d46224d 100644 --- a/arch/arm64/crypto/crct10dif-ce-core.S +++ b/arch/arm64/crypto/crct10dif-ce-core.S @@ -74,137 +74,18 @@ init_crc .req w0 buf .req x1 len .req x2 - fold_consts_ptr .req x3 + fold_consts_ptr .req x5 fold_consts .req v10 - ad .req v14 - - k00_16 .req v15 - k32_48 .req v16 - t3 .req v17 t4 .req v18 t5 .req v19 t6 .req v20 t7 .req v21 t8 .req v22 - t9 .req v23 - - perm1 .req v24 - perm2 .req v25 - perm3 .req v26 - perm4 .req v27 - - bd1 .req v28 - bd2 .req v29 - bd3 .req v30 - bd4 .req v31 - - .macro __pmull_init_p64 - .endm - .macro __pmull_pre_p64, bd - .endm - - .macro __pmull_init_p8 - // k00_16 := 0x0000000000000000_000000000000ffff - // k32_48 := 0x00000000ffffffff_0000ffffffffffff - movi k32_48.2d, #0xffffffff - mov k32_48.h[2], k32_48.h[0] - ushr k00_16.2d, k32_48.2d, #32 - - // prepare the permutation vectors - mov_q x5, 0x080f0e0d0c0b0a09 - movi perm4.8b, #8 - dup perm1.2d, x5 - eor perm1.16b, perm1.16b, perm4.16b - ushr perm2.2d, perm1.2d, #8 - ushr perm3.2d, perm1.2d, #16 - ushr perm4.2d, perm1.2d, #24 - sli perm2.2d, perm1.2d, #56 - sli perm3.2d, perm1.2d, #48 - sli perm4.2d, perm1.2d, #40 - - // Compose { 0,0,0,0, 8,8,8,8, 1,1,1,1, 9,9,9,9 } - movi bd1.4h, #8, lsl #8 - orr bd1.2s, #1, lsl #16 - orr bd1.2s, #1, lsl #24 - zip1 bd1.16b, bd1.16b, bd1.16b - zip1 bd1.16b, bd1.16b, bd1.16b - .endm - - .macro __pmull_pre_p8, bd - tbl bd1.16b, {\bd\().16b}, perm1.16b - tbl bd2.16b, {\bd\().16b}, perm2.16b - tbl bd3.16b, {\bd\().16b}, perm3.16b - tbl bd4.16b, {\bd\().16b}, perm4.16b - .endm - -SYM_FUNC_START_LOCAL(__pmull_p8_core) -.L__pmull_p8_core: - ext t4.8b, ad.8b, ad.8b, #1 // A1 - ext t5.8b, ad.8b, ad.8b, #2 // A2 - ext t6.8b, ad.8b, ad.8b, #3 // A3 - - pmull t4.8h, t4.8b, fold_consts.8b // F = A1*B - pmull t8.8h, ad.8b, bd1.8b // E = A*B1 - pmull t5.8h, t5.8b, fold_consts.8b // H = A2*B - pmull t7.8h, ad.8b, bd2.8b // G = A*B2 - pmull t6.8h, t6.8b, fold_consts.8b // J = A3*B - pmull t9.8h, ad.8b, bd3.8b // I = A*B3 - pmull t3.8h, ad.8b, bd4.8b // K = A*B4 - b 0f - -.L__pmull_p8_core2: - tbl t4.16b, {ad.16b}, perm1.16b // A1 - tbl t5.16b, {ad.16b}, perm2.16b // A2 - tbl t6.16b, {ad.16b}, perm3.16b // A3 - - pmull2 t4.8h, t4.16b, fold_consts.16b // F = A1*B - pmull2 t8.8h, ad.16b, bd1.16b // E = A*B1 - pmull2 t5.8h, t5.16b, fold_consts.16b // H = A2*B - pmull2 t7.8h, ad.16b, bd2.16b // G = A*B2 - pmull2 t6.8h, t6.16b, fold_consts.16b // J = A3*B - pmull2 t9.8h, ad.16b, bd3.16b // I = A*B3 - pmull2 t3.8h, ad.16b, bd4.16b // K = A*B4 - -0: eor t4.16b, t4.16b, t8.16b // L = E + F - eor t5.16b, t5.16b, t7.16b // M = G + H - eor t6.16b, t6.16b, t9.16b // N = I + J - - uzp1 t8.2d, t4.2d, t5.2d - uzp2 t4.2d, t4.2d, t5.2d - uzp1 t7.2d, t6.2d, t3.2d - uzp2 t6.2d, t6.2d, t3.2d - - // t4 = (L) (P0 + P1) << 8 - // t5 = (M) (P2 + P3) << 16 - eor t8.16b, t8.16b, t4.16b - and t4.16b, t4.16b, k32_48.16b - - // t6 = (N) (P4 + P5) << 24 - // t7 = (K) (P6 + P7) << 32 - eor t7.16b, t7.16b, t6.16b - and t6.16b, t6.16b, k00_16.16b - - eor t8.16b, t8.16b, t4.16b - eor t7.16b, t7.16b, t6.16b - - zip2 t5.2d, t8.2d, t4.2d - zip1 t4.2d, t8.2d, t4.2d - zip2 t3.2d, t7.2d, t6.2d - zip1 t6.2d, t7.2d, t6.2d - - ext t4.16b, t4.16b, t4.16b, #15 - ext t5.16b, t5.16b, t5.16b, #14 - ext t6.16b, t6.16b, t6.16b, #13 - ext t3.16b, t3.16b, t3.16b, #12 - - eor t4.16b, t4.16b, t5.16b - eor t6.16b, t6.16b, t3.16b - ret -SYM_FUNC_END(__pmull_p8_core) + perm .req v27 .macro pmull16x64_p64, a16, b64, c64 pmull2 \c64\().1q, \a16\().2d, \b64\().2d @@ -266,7 +147,7 @@ SYM_FUNC_END(__pmull_p8_core) */ .macro pmull16x64_p8, a16, b64, c64 ext t7.16b, \b64\().16b, \b64\().16b, #1 - tbl t5.16b, {\a16\().16b}, bd1.16b + tbl t5.16b, {\a16\().16b}, perm.16b uzp1 t7.16b, \b64\().16b, t7.16b bl __pmull_p8_16x64 ext \b64\().16b, t4.16b, t4.16b, #15 @@ -292,22 +173,6 @@ SYM_FUNC_START_LOCAL(__pmull_p8_16x64) ret SYM_FUNC_END(__pmull_p8_16x64) - .macro __pmull_p8, rq, ad, bd, i - .ifnc \bd, fold_consts - .err - .endif - mov ad.16b, \ad\().16b - .ifb \i - pmull \rq\().8h, \ad\().8b, \bd\().8b // D = A*B - .else - pmull2 \rq\().8h, \ad\().16b, \bd\().16b // D = A*B - .endif - - bl .L__pmull_p8_core\i - - eor \rq\().16b, \rq\().16b, t4.16b - eor \rq\().16b, \rq\().16b, t6.16b - .endm // Fold reg1, reg2 into the next 32 data bytes, storing the result back // into reg1, reg2. @@ -340,16 +205,7 @@ CPU_LE( ext v12.16b, v12.16b, v12.16b, #8 ) eor \dst_reg\().16b, \dst_reg\().16b, \src_reg\().16b .endm - .macro __pmull_p64, rd, rn, rm, n - .ifb \n - pmull \rd\().1q, \rn\().1d, \rm\().1d - .else - pmull2 \rd\().1q, \rn\().2d, \rm\().2d - .endif - .endm - .macro crc_t10dif_pmull, p - __pmull_init_\p // For sizes less than 256 bytes, we can't fold 128 bytes at a time. cmp len, #256 @@ -479,47 +335,7 @@ CPU_LE( ext v0.16b, v0.16b, v0.16b, #8 ) pmull16x64_\p fold_consts, v3, v0 eor v7.16b, v3.16b, v0.16b eor v7.16b, v7.16b, v2.16b - -.Lreduce_final_16_bytes_\@: - // Reduce the 128-bit value M(x), stored in v7, to the final 16-bit CRC. - - movi v2.16b, #0 // init zero register - - // Load 'x^48 * (x^48 mod G(x))' and 'x^48 * (x^80 mod G(x))'. - ld1 {fold_consts.2d}, [fold_consts_ptr], #16 - __pmull_pre_\p fold_consts - - // Fold the high 64 bits into the low 64 bits, while also multiplying by - // x^64. This produces a 128-bit value congruent to x^64 * M(x) and - // whose low 48 bits are 0. - ext v0.16b, v2.16b, v7.16b, #8 - __pmull_\p v7, v7, fold_consts, 2 // high bits * x^48 * (x^80 mod G(x)) - eor v0.16b, v0.16b, v7.16b // + low bits * x^64 - - // Fold the high 32 bits into the low 96 bits. This produces a 96-bit - // value congruent to x^64 * M(x) and whose low 48 bits are 0. - ext v1.16b, v0.16b, v2.16b, #12 // extract high 32 bits - mov v0.s[3], v2.s[0] // zero high 32 bits - __pmull_\p v1, v1, fold_consts // high 32 bits * x^48 * (x^48 mod G(x)) - eor v0.16b, v0.16b, v1.16b // + low bits - - // Load G(x) and floor(x^48 / G(x)). - ld1 {fold_consts.2d}, [fold_consts_ptr] - __pmull_pre_\p fold_consts - - // Use Barrett reduction to compute the final CRC value. - __pmull_\p v1, v0, fold_consts, 2 // high 32 bits * floor(x^48 / G(x)) - ushr v1.2d, v1.2d, #32 // /= x^32 - __pmull_\p v1, v1, fold_consts // *= G(x) - ushr v0.2d, v0.2d, #48 - eor v0.16b, v0.16b, v1.16b // + low 16 nonzero bits - // Final CRC value (x^16 * M(x)) mod G(x) is in low 16 bits of v0. - - umov w0, v0.h[0] - .ifc \p, p8 - frame_pop - .endif - ret + b .Lreduce_final_16_bytes_\@ .Lless_than_256_bytes_\@: // Checksumming a buffer of length 16...255 bytes @@ -545,6 +361,8 @@ CPU_LE( ext v7.16b, v7.16b, v7.16b, #8 ) b.ge .Lfold_16_bytes_loop_\@ // 32 <= len <= 255 add len, len, #16 b .Lhandle_partial_segment_\@ // 17 <= len <= 31 + +.Lreduce_final_16_bytes_\@: .endm // @@ -554,7 +372,22 @@ CPU_LE( ext v7.16b, v7.16b, v7.16b, #8 ) // SYM_FUNC_START(crc_t10dif_pmull_p8) frame_push 1 + + // Compose { 0,0,0,0, 8,8,8,8, 1,1,1,1, 9,9,9,9 } + movi perm.4h, #8, lsl #8 + orr perm.2s, #1, lsl #16 + orr perm.2s, #1, lsl #24 + zip1 perm.16b, perm.16b, perm.16b + zip1 perm.16b, perm.16b, perm.16b + crc_t10dif_pmull p8 + +CPU_LE( rev64 v7.16b, v7.16b ) +CPU_LE( ext v7.16b, v7.16b, v7.16b, #8 ) + str q7, [x3] + + frame_pop + ret SYM_FUNC_END(crc_t10dif_pmull_p8) .align 5 @@ -565,6 +398,41 @@ SYM_FUNC_END(crc_t10dif_pmull_p8) // SYM_FUNC_START(crc_t10dif_pmull_p64) crc_t10dif_pmull p64 + + // Reduce the 128-bit value M(x), stored in v7, to the final 16-bit CRC. + + movi v2.16b, #0 // init zero register + + // Load 'x^48 * (x^48 mod G(x))' and 'x^48 * (x^80 mod G(x))'. + ld1 {fold_consts.2d}, [fold_consts_ptr], #16 + + // Fold the high 64 bits into the low 64 bits, while also multiplying by + // x^64. This produces a 128-bit value congruent to x^64 * M(x) and + // whose low 48 bits are 0. + ext v0.16b, v2.16b, v7.16b, #8 + pmull2 v7.1q, v7.2d, fold_consts.2d // high bits * x^48 * (x^80 mod G(x)) + eor v0.16b, v0.16b, v7.16b // + low bits * x^64 + + // Fold the high 32 bits into the low 96 bits. This produces a 96-bit + // value congruent to x^64 * M(x) and whose low 48 bits are 0. + ext v1.16b, v0.16b, v2.16b, #12 // extract high 32 bits + mov v0.s[3], v2.s[0] // zero high 32 bits + pmull v1.1q, v1.1d, fold_consts.1d // high 32 bits * x^48 * (x^48 mod G(x)) + eor v0.16b, v0.16b, v1.16b // + low bits + + // Load G(x) and floor(x^48 / G(x)). + ld1 {fold_consts.2d}, [fold_consts_ptr] + + // Use Barrett reduction to compute the final CRC value. + pmull2 v1.1q, v0.2d, fold_consts.2d // high 32 bits * floor(x^48 / G(x)) + ushr v1.2d, v1.2d, #32 // /= x^32 + pmull v1.1q, v1.1d, fold_consts.1d // *= G(x) + ushr v0.2d, v0.2d, #48 + eor v0.16b, v0.16b, v1.16b // + low 16 nonzero bits + // Final CRC value (x^16 * M(x)) mod G(x) is in low 16 bits of v0. + + umov w0, v0.h[0] + ret SYM_FUNC_END(crc_t10dif_pmull_p64) .section ".rodata", "a" diff --git a/arch/arm64/crypto/crct10dif-ce-glue.c b/arch/arm64/crypto/crct10dif-ce-glue.c index 7b05094a0480..08bcbd884395 100644 --- a/arch/arm64/crypto/crct10dif-ce-glue.c +++ b/arch/arm64/crypto/crct10dif-ce-glue.c @@ -20,7 +20,8 @@ #define CRC_T10DIF_PMULL_CHUNK_SIZE 16U -asmlinkage u16 crc_t10dif_pmull_p8(u16 init_crc, const u8 *buf, size_t len); +asmlinkage void crc_t10dif_pmull_p8(u16 init_crc, const u8 *buf, size_t len, + u8 out[16]); asmlinkage u16 crc_t10dif_pmull_p64(u16 init_crc, const u8 *buf, size_t len); static int crct10dif_init(struct shash_desc *desc) @@ -34,16 +35,21 @@ static int crct10dif_init(struct shash_desc *desc) static int crct10dif_update_pmull_p8(struct shash_desc *desc, const u8 *data, unsigned int length) { - u16 *crc = shash_desc_ctx(desc); + u16 *crcp = shash_desc_ctx(desc); + u16 crc = *crcp; + u8 buf[16]; - if (length >= CRC_T10DIF_PMULL_CHUNK_SIZE && crypto_simd_usable()) { + if (length > CRC_T10DIF_PMULL_CHUNK_SIZE && crypto_simd_usable()) { kernel_neon_begin(); - *crc = crc_t10dif_pmull_p8(*crc, data, length); + crc_t10dif_pmull_p8(crc, data, length, buf); kernel_neon_end(); - } else { - *crc = crc_t10dif_generic(*crc, data, length); + + crc = 0; + data = buf; + length = sizeof(buf); } + *crcp = crc_t10dif_generic(crc, data, length); return 0; } From patchwork Tue Nov 5 16:09:04 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ard Biesheuvel X-Patchwork-Id: 13863225 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 2E463D31768 for ; Tue, 5 Nov 2024 16:53:06 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender:List-Subscribe:List-Help :List-Post:List-Archive:List-Unsubscribe:List-Id:Content-Type:Cc:To:From: Subject:Message-ID:References:Mime-Version:In-Reply-To:Date:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner; bh=6N2VBOA9rFVK0/GkkoRg7xkFe5ZWEmXf+9rCp3hK2ew=; b=FlVP/DLaSwX496vbIfwqZ3yu0C Mq5r1wS3yFfa2BpfEsSL8zoMIK9o64mSfLG54GYqhOslvE0+lqQQ1DpbkKBsXl3tv1X4aCTaG91j0 JqJyJdONROdhJuuoizB0bLglmvk+/G2XOMWh9VyPn/PnpB49X8EgYPJzVSTF/abCW1FBwlve7MaWn 3Af9u2QT/ciCreCwB6ReGgE8S6LzB4/AQ971cAXNJEllusenxFUmHqLwP86YkkDztaeRu19BPBnnp n4pb3FXOmVPo3Q4/jXcKwCLjmWK0jnYeOxFlzw7yaqRs4fyYlmjutCjPVQiKeri2Hd7tc2S7EQEyr tJILEwKQ==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.98 #2 (Red Hat Linux)) id 1t8MnW-000000007N5-2YUi; Tue, 05 Nov 2024 16:52:54 +0000 Received: from mail-wm1-x34a.google.com ([2a00:1450:4864:20::34a]) by bombadil.infradead.org with esmtps (Exim 4.98 #2 (Red Hat Linux)) id 1t8MCA-000000000MF-1UGW for linux-arm-kernel@lists.infradead.org; Tue, 05 Nov 2024 16:14:20 +0000 Received: by mail-wm1-x34a.google.com with SMTP id 5b1f17b1804b1-4316655b2f1so39952335e9.0 for ; Tue, 05 Nov 2024 08:14:17 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1730823257; x=1731428057; darn=lists.infradead.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=6N2VBOA9rFVK0/GkkoRg7xkFe5ZWEmXf+9rCp3hK2ew=; b=0Pz4ubJ8Kr3r+fnbSKhh1H/lACQ921OLmr3EIF4qa/eLcSxaWXAGY6Y8UnaBbNy0lF O+qsPihG4cD6WLmwwj7KsCi1eaGL8xpsO5/p7PrvFpt8x30+BdP+1vz7LtETTk4CyQeL 3yFa/cnzpx0r5Ha1FcwC84jGIXOZ4RiKPRQaFP13ddhVGG7KXywkc0Ex8AG+lloseELG rMjh1DBhO1ipsVgcwdcJSqaIxiJw7RYIUk9TxHKdSuRIUOFgv/VL2u2eOO0zv92QdoG0 Bowira47Z00Zqv3+LznSR0mc8K8RmHNC6yTm+RSHPCogve7SK/K3Gry/y7jwBEBTQIkr xQRg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1730823257; x=1731428057; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=6N2VBOA9rFVK0/GkkoRg7xkFe5ZWEmXf+9rCp3hK2ew=; b=MgWdlnshI0E1h+gV8kZD8W6+Yuugr3xpr5XzSt0C/sPOh6lq6irzpvOWrcOoKyp08v AhkXxgeXc7x+GcibVNOBi7VWpjnsmOdrwCkNRyZbK1i/V4qYb9BwcEve5DjjQYbBv3Fv e2ed3M/iYMTzIdgS/onCwLxuBuP6H3b/lL0cpp1irr8t8NFKjw+3bjIGtQfdwPwhmdE6 BPyFbQNs221Rt73qasj5heT6BKK00q/IrzwiXwguDYteviDGmmq62rk01z+y8mcFYIwW dELSNLGcRF8JxtTfRRJVwGM3JJ0SICmY1cSpFlCcHM6IbW/PBKCM7jS+aDYanSG8Cygj tG4g== X-Gm-Message-State: AOJu0Ywt8+yri52YsCb1TMdM6eLGR+Oifx/5GRgc7ImRa50+mGNenbGA cpPjFIzTyIrohm4klslDY07D3j2BbbR7Mshqx7a/POrgoqJ+60PPKRhNkBMva26ykHr3QQ== X-Google-Smtp-Source: AGHT+IE1eGkHlmoaTOuZHS5yxUb+zHYCL/xdGqZApRlfh1/UEprDAnMdWsqBst7cBf+cLypOHteKNsFR X-Received: from palermo.c.googlers.com ([fda3:e722:ac3:cc00:7b:198d:ac11:8138]) (user=ardb job=sendgmr) by 2002:a7b:ce8b:0:b0:42c:ae30:fc33 with SMTP id 5b1f17b1804b1-4327b7fcdc4mr378795e9.4.1730823256859; Tue, 05 Nov 2024 08:14:16 -0800 (PST) Date: Tue, 5 Nov 2024 17:09:04 +0100 In-Reply-To: <20241105160859.1459261-8-ardb+git@google.com> Mime-Version: 1.0 References: <20241105160859.1459261-8-ardb+git@google.com> X-Developer-Key: i=ardb@kernel.org; a=openpgp; fpr=F43D03328115A198C90016883D200E9CA6329909 X-Developer-Signature: v=1; a=openpgp-sha256; l=1687; i=ardb@kernel.org; h=from:subject; bh=AKN1MPIUQxwxJZBx0Ms0cbfUdj7H9BFpnNX0+RF49+Q=; b=owGbwMvMwCFmkMcZplerG8N4Wi2JIV3LWSl0q66lZ/662TXljbc3tB3cpZ1md1SgeJnF7gcV7 kpXJl3tKGVhEONgkBVTZBGY/ffdztMTpWqdZ8nCzGFlAhnCwMUpABPZlMTwP1qsR+bTg3cWmeZ3 100Pv3aS6c5lwZVFc+cydibuvKl57hXDP90Dv1ScRJ4vyVG8d6rs+/tLiuqt3RJqnQ3+J9elpz3 azAYA X-Mailer: git-send-email 2.47.0.199.ga7371fff76-goog Message-ID: <20241105160859.1459261-12-ardb+git@google.com> Subject: [PATCH v2 4/6] crypto: arm/crct10dif - Use existing mov_l macro instead of __adrl From: Ard Biesheuvel To: linux-crypto@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org, ebiggers@kernel.org, herbert@gondor.apana.org.au, keescook@chromium.org, Ard Biesheuvel , Eric Biggers X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20241105_081418_425442_D0A84B72 X-CRM114-Status: GOOD ( 10.19 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org From: Ard Biesheuvel Reviewed-by: Eric Biggers Signed-off-by: Ard Biesheuvel --- arch/arm/crypto/crct10dif-ce-core.S | 11 +++-------- 1 file changed, 3 insertions(+), 8 deletions(-) diff --git a/arch/arm/crypto/crct10dif-ce-core.S b/arch/arm/crypto/crct10dif-ce-core.S index 46c02c518a30..4dac32e020de 100644 --- a/arch/arm/crypto/crct10dif-ce-core.S +++ b/arch/arm/crypto/crct10dif-ce-core.S @@ -144,11 +144,6 @@ CPU_LE( vrev64.8 q12, q12 ) veor.8 \dst_reg, \dst_reg, \src_reg .endm - .macro __adrl, out, sym - movw \out, #:lower16:\sym - movt \out, #:upper16:\sym - .endm - // // u16 crc_t10dif_pmull(u16 init_crc, const u8 *buf, size_t len); // @@ -160,7 +155,7 @@ ENTRY(crc_t10dif_pmull) cmp len, #256 blt .Lless_than_256_bytes - __adrl fold_consts_ptr, .Lfold_across_128_bytes_consts + mov_l fold_consts_ptr, .Lfold_across_128_bytes_consts // Load the first 128 data bytes. Byte swapping is necessary to make // the bit order match the polynomial coefficient order. @@ -262,7 +257,7 @@ CPU_LE( vrev64.8 q0, q0 ) vswp q0l, q0h // q1 = high order part of second chunk: q7 left-shifted by 'len' bytes. - __adrl r3, .Lbyteshift_table + 16 + mov_l r3, .Lbyteshift_table + 16 sub r3, r3, len vld1.8 {q2}, [r3] vtbl.8 q1l, {q7l-q7h}, q2l @@ -324,7 +319,7 @@ CPU_LE( vrev64.8 q0, q0 ) .Lless_than_256_bytes: // Checksumming a buffer of length 16...255 bytes - __adrl fold_consts_ptr, .Lfold_across_16_bytes_consts + mov_l fold_consts_ptr, .Lfold_across_16_bytes_consts // Load the first 16 data bytes. vld1.64 {q7}, [buf]! From patchwork Tue Nov 5 16:09:05 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ard Biesheuvel X-Patchwork-Id: 13863235 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 27A2DD31768 for ; Tue, 5 Nov 2024 16:54:49 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender:List-Subscribe:List-Help :List-Post:List-Archive:List-Unsubscribe:List-Id:Content-Type:Cc:To:From: Subject:Message-ID:References:Mime-Version:In-Reply-To:Date:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner; bh=HhfO8ohnOtNJQi1hI9D3A4FDq/9CsS5R1WgkWnITbx8=; b=H+YD+UAonm7fcb2nAQOjSnRheV zm5wSP06RBJHvjvNcR3Ku/CZ9niqjvUy+F676uy+GnfvEQqmCnKoWLak1diqV2vtvi2UOcHGJDPTy kCVgwBiYcvOP73kcWHrfNfbtK/EeKb3IrfIddr2bLyzhaJDkMBfAUoy69SjHt1pMW5MqE/1RXgGR6 DjLJIjS7HhNLYNQsbuQy7hubfr09TBs6Xb6C+vUwFXEHl3OTQPL5+oXJCXuV0Cp2X9L913G5MyIk4 sld+pzJw9UbHwnmQPXARF+KMa9vw3H7BMvOcJWlGI15bxsTOWGelsdpSCM5iKUxbmqP6WtzyRR8Gi R9yI/7NA==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.98 #2 (Red Hat Linux)) id 1t8MpB-000000007Vk-0iZU; Tue, 05 Nov 2024 16:54:37 +0000 Received: from mail-yb1-xb4a.google.com ([2607:f8b0:4864:20::b4a]) by bombadil.infradead.org with esmtps (Exim 4.98 #2 (Red Hat Linux)) id 1t8MCD-000000000O1-1rUb for linux-arm-kernel@lists.infradead.org; Tue, 05 Nov 2024 16:14:23 +0000 Received: by mail-yb1-xb4a.google.com with SMTP id 3f1490d57ef6-e30ba241e7eso10318087276.2 for ; Tue, 05 Nov 2024 08:14:20 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1730823259; x=1731428059; darn=lists.infradead.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=HhfO8ohnOtNJQi1hI9D3A4FDq/9CsS5R1WgkWnITbx8=; b=A/OrPykn6lHqsOeEbELaYv+8eyHMg9Fjwo327J4Rc1YELOtipDTCVj4oy+O0UV1oQg 77UUyu+AWdTMyOIq7ZnS8yeL0y6OIpv7fsqnfSmETVilIftn5UWqWeMbiGMDkckNAadl XTqsHuTb9CvQA5H+kzL7DhwPKg4PiCTJAfrxxb1v5LhzxaQ0GFOvY3EmARbMvi0WbI0z 2sRxcAkXNWY+hiwQVSnk0XjeOTU3REr0fD1LdTF+/l+LL35kQcQ1ihqa7RB9qjC2B2YO qFgE8sAXAWWPs2AIb2gSzdpZfWCyCT0zsQKT0rc0thwMKFVqMZlF2zmPA2N6/WuXsDtc MF1A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1730823259; x=1731428059; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=HhfO8ohnOtNJQi1hI9D3A4FDq/9CsS5R1WgkWnITbx8=; b=Fn/FkMJKk99f/jJpw5IfkgGDSovt4XXuWeVoAqeazNH7UDZr2Aku5+VUfgkUTRkh2+ 6FAN823dP3g3cuTvDAozCYuHyrGYkCu2OabP24/ldM+gAYFGahJBfSURxhN3VWPaVLAT pRYrDXfo1QcJ+96nE+9VCiboj3X0aPlr7xIxA6kHMm3flHFDn1Zpf+tOSSoqGhu1KRy+ WXZPqFl48b7ep+BMwCJSmOSBH4zvbOxclbxHf7Zzj0JDjK5osFo5SFqxaSDgQCNkbtqS HxI+WBeKwcbeby8CCabz6ofSImaBewDGVT3vK5Vv+W3iTEHoWj/Kdf1mUvC/n0Pkwrkg 144A== X-Gm-Message-State: AOJu0YzBlCEa5IzW8VSlnPt8dZiJO2dh93rFFOo6Ufovku275wp4D/bT MhvLkDJ0+Iw16X0dLmc8Yq30UhE8VJKAitesohxCF6N3Bqv9PU+DTV9WpS/WnhAgllAAdA== X-Google-Smtp-Source: AGHT+IEunqYkD2invRN5aFCO1Fz2RABAs2iohlsBH2Q4XA5ZLGkwm043sZB8E/a/R53LQfgmz7W0gc9F X-Received: from palermo.c.googlers.com ([fda3:e722:ac3:cc00:7b:198d:ac11:8138]) (user=ardb job=sendgmr) by 2002:a25:ade8:0:b0:e11:5da7:337 with SMTP id 3f1490d57ef6-e3087a4b553mr58004276.3.1730823259211; Tue, 05 Nov 2024 08:14:19 -0800 (PST) Date: Tue, 5 Nov 2024 17:09:05 +0100 In-Reply-To: <20241105160859.1459261-8-ardb+git@google.com> Mime-Version: 1.0 References: <20241105160859.1459261-8-ardb+git@google.com> X-Developer-Key: i=ardb@kernel.org; a=openpgp; fpr=F43D03328115A198C90016883D200E9CA6329909 X-Developer-Signature: v=1; a=openpgp-sha256; l=8874; i=ardb@kernel.org; h=from:subject; bh=NY3GF5MSQ9ZMgCYjt+TDh25uIssEg93ZXjzQFK1S8a0=; b=owGbwMvMwCFmkMcZplerG8N4Wi2JIV3LWVk1fNL8a9calWaZXP1R9+nAsV6+azr3vvuWJZ2Y7 b64zSy4o5SFQYyDQVZMkUVg9t93O09PlKp1niULM4eVCWQIAxenAEwkrpPhD5dxobN6Sq7uhE7z O7HnLTZxxZ2dH6badXipwbben/l3exn+6e86F/D8dG7AtI8fWL8tkAs6uXT10f3n5tpnKwVeUvc UZAcA X-Mailer: git-send-email 2.47.0.199.ga7371fff76-goog Message-ID: <20241105160859.1459261-13-ardb+git@google.com> Subject: [PATCH v2 5/6] crypto: arm/crct10dif - Macroify PMULL asm code From: Ard Biesheuvel To: linux-crypto@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org, ebiggers@kernel.org, herbert@gondor.apana.org.au, keescook@chromium.org, Ard Biesheuvel , Eric Biggers X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20241105_081421_545816_75EFF0AA X-CRM114-Status: GOOD ( 17.09 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org From: Ard Biesheuvel To allow an alternative version to be created of the PMULL based CRC-T10DIF algorithm, turn the bulk of it into a macro, except for the final reduction, which will only be used by the existing version. Reviewed-by: Eric Biggers Signed-off-by: Ard Biesheuvel --- arch/arm/crypto/crct10dif-ce-core.S | 154 ++++++++++---------- arch/arm/crypto/crct10dif-ce-glue.c | 10 +- 2 files changed, 83 insertions(+), 81 deletions(-) diff --git a/arch/arm/crypto/crct10dif-ce-core.S b/arch/arm/crypto/crct10dif-ce-core.S index 4dac32e020de..6b72167574b2 100644 --- a/arch/arm/crypto/crct10dif-ce-core.S +++ b/arch/arm/crypto/crct10dif-ce-core.S @@ -112,48 +112,42 @@ FOLD_CONST_L .req q10l FOLD_CONST_H .req q10h + .macro pmull16x64_p64, v16, v64 + vmull.p64 q11, \v64\()l, \v16\()_L + vmull.p64 \v64, \v64\()h, \v16\()_H + veor \v64, \v64, q11 + .endm + // Fold reg1, reg2 into the next 32 data bytes, storing the result back // into reg1, reg2. - .macro fold_32_bytes, reg1, reg2 - vld1.64 {q11-q12}, [buf]! + .macro fold_32_bytes, reg1, reg2, p + vld1.64 {q8-q9}, [buf]! - vmull.p64 q8, \reg1\()h, FOLD_CONST_H - vmull.p64 \reg1, \reg1\()l, FOLD_CONST_L - vmull.p64 q9, \reg2\()h, FOLD_CONST_H - vmull.p64 \reg2, \reg2\()l, FOLD_CONST_L + pmull16x64_\p FOLD_CONST, \reg1 + pmull16x64_\p FOLD_CONST, \reg2 -CPU_LE( vrev64.8 q11, q11 ) -CPU_LE( vrev64.8 q12, q12 ) - vswp q11l, q11h - vswp q12l, q12h +CPU_LE( vrev64.8 q8, q8 ) +CPU_LE( vrev64.8 q9, q9 ) + vswp q8l, q8h + vswp q9l, q9h veor.8 \reg1, \reg1, q8 veor.8 \reg2, \reg2, q9 - veor.8 \reg1, \reg1, q11 - veor.8 \reg2, \reg2, q12 .endm // Fold src_reg into dst_reg, optionally loading the next fold constants - .macro fold_16_bytes, src_reg, dst_reg, load_next_consts - vmull.p64 q8, \src_reg\()l, FOLD_CONST_L - vmull.p64 \src_reg, \src_reg\()h, FOLD_CONST_H + .macro fold_16_bytes, src_reg, dst_reg, p, load_next_consts + pmull16x64_\p FOLD_CONST, \src_reg .ifnb \load_next_consts vld1.64 {FOLD_CONSTS}, [fold_consts_ptr, :128]! .endif - veor.8 \dst_reg, \dst_reg, q8 veor.8 \dst_reg, \dst_reg, \src_reg .endm -// -// u16 crc_t10dif_pmull(u16 init_crc, const u8 *buf, size_t len); -// -// Assumes len >= 16. -// -ENTRY(crc_t10dif_pmull) - + .macro crct10dif, p // For sizes less than 256 bytes, we can't fold 128 bytes at a time. cmp len, #256 - blt .Lless_than_256_bytes + blt .Lless_than_256_bytes\@ mov_l fold_consts_ptr, .Lfold_across_128_bytes_consts @@ -194,27 +188,27 @@ CPU_LE( vrev64.8 q7, q7 ) // While >= 128 data bytes remain (not counting q0-q7), fold the 128 // bytes q0-q7 into them, storing the result back into q0-q7. -.Lfold_128_bytes_loop: - fold_32_bytes q0, q1 - fold_32_bytes q2, q3 - fold_32_bytes q4, q5 - fold_32_bytes q6, q7 +.Lfold_128_bytes_loop\@: + fold_32_bytes q0, q1, \p + fold_32_bytes q2, q3, \p + fold_32_bytes q4, q5, \p + fold_32_bytes q6, q7, \p subs len, len, #128 - bge .Lfold_128_bytes_loop + bge .Lfold_128_bytes_loop\@ // Now fold the 112 bytes in q0-q6 into the 16 bytes in q7. // Fold across 64 bytes. vld1.64 {FOLD_CONSTS}, [fold_consts_ptr, :128]! - fold_16_bytes q0, q4 - fold_16_bytes q1, q5 - fold_16_bytes q2, q6 - fold_16_bytes q3, q7, 1 + fold_16_bytes q0, q4, \p + fold_16_bytes q1, q5, \p + fold_16_bytes q2, q6, \p + fold_16_bytes q3, q7, \p, 1 // Fold across 32 bytes. - fold_16_bytes q4, q6 - fold_16_bytes q5, q7, 1 + fold_16_bytes q4, q6, \p + fold_16_bytes q5, q7, \p, 1 // Fold across 16 bytes. - fold_16_bytes q6, q7 + fold_16_bytes q6, q7, \p // Add 128 to get the correct number of data bytes remaining in 0...127 // (not counting q7), following the previous extra subtraction by 128. @@ -224,25 +218,23 @@ CPU_LE( vrev64.8 q7, q7 ) // While >= 16 data bytes remain (not counting q7), fold the 16 bytes q7 // into them, storing the result back into q7. - blt .Lfold_16_bytes_loop_done -.Lfold_16_bytes_loop: - vmull.p64 q8, q7l, FOLD_CONST_L - vmull.p64 q7, q7h, FOLD_CONST_H - veor.8 q7, q7, q8 + blt .Lfold_16_bytes_loop_done\@ +.Lfold_16_bytes_loop\@: + pmull16x64_\p FOLD_CONST, q7 vld1.64 {q0}, [buf]! CPU_LE( vrev64.8 q0, q0 ) vswp q0l, q0h veor.8 q7, q7, q0 subs len, len, #16 - bge .Lfold_16_bytes_loop + bge .Lfold_16_bytes_loop\@ -.Lfold_16_bytes_loop_done: +.Lfold_16_bytes_loop_done\@: // Add 16 to get the correct number of data bytes remaining in 0...15 // (not counting q7), following the previous extra subtraction by 16. adds len, len, #16 - beq .Lreduce_final_16_bytes + beq .Lreduce_final_16_bytes\@ -.Lhandle_partial_segment: +.Lhandle_partial_segment\@: // Reduce the last '16 + len' bytes where 1 <= len <= 15 and the first // 16 bytes are in q7 and the rest are the remaining data in 'buf'. To // do this without needing a fold constant for each possible 'len', @@ -277,12 +269,46 @@ CPU_LE( vrev64.8 q0, q0 ) vbsl.8 q2, q1, q0 // Fold the first chunk into the second chunk, storing the result in q7. - vmull.p64 q0, q3l, FOLD_CONST_L - vmull.p64 q7, q3h, FOLD_CONST_H - veor.8 q7, q7, q0 - veor.8 q7, q7, q2 + pmull16x64_\p FOLD_CONST, q3 + veor.8 q7, q3, q2 + b .Lreduce_final_16_bytes\@ + +.Lless_than_256_bytes\@: + // Checksumming a buffer of length 16...255 bytes + + mov_l fold_consts_ptr, .Lfold_across_16_bytes_consts + + // Load the first 16 data bytes. + vld1.64 {q7}, [buf]! +CPU_LE( vrev64.8 q7, q7 ) + vswp q7l, q7h + + // XOR the first 16 data *bits* with the initial CRC value. + vmov.i8 q0h, #0 + vmov.u16 q0h[3], init_crc + veor.8 q7h, q7h, q0h + + // Load the fold-across-16-bytes constants. + vld1.64 {FOLD_CONSTS}, [fold_consts_ptr, :128]! + + cmp len, #16 + beq .Lreduce_final_16_bytes\@ // len == 16 + subs len, len, #32 + addlt len, len, #16 + blt .Lhandle_partial_segment\@ // 17 <= len <= 31 + b .Lfold_16_bytes_loop\@ // 32 <= len <= 255 + +.Lreduce_final_16_bytes\@: + .endm + +// +// u16 crc_t10dif_pmull(u16 init_crc, const u8 *buf, size_t len); +// +// Assumes len >= 16. +// +ENTRY(crc_t10dif_pmull64) + crct10dif p64 -.Lreduce_final_16_bytes: // Reduce the 128-bit value M(x), stored in q7, to the final 16-bit CRC. // Load 'x^48 * (x^48 mod G(x))' and 'x^48 * (x^80 mod G(x))'. @@ -316,31 +342,7 @@ CPU_LE( vrev64.8 q0, q0 ) vmov.u16 r0, q0l[0] bx lr -.Lless_than_256_bytes: - // Checksumming a buffer of length 16...255 bytes - - mov_l fold_consts_ptr, .Lfold_across_16_bytes_consts - - // Load the first 16 data bytes. - vld1.64 {q7}, [buf]! -CPU_LE( vrev64.8 q7, q7 ) - vswp q7l, q7h - - // XOR the first 16 data *bits* with the initial CRC value. - vmov.i8 q0h, #0 - vmov.u16 q0h[3], init_crc - veor.8 q7h, q7h, q0h - - // Load the fold-across-16-bytes constants. - vld1.64 {FOLD_CONSTS}, [fold_consts_ptr, :128]! - - cmp len, #16 - beq .Lreduce_final_16_bytes // len == 16 - subs len, len, #32 - addlt len, len, #16 - blt .Lhandle_partial_segment // 17 <= len <= 31 - b .Lfold_16_bytes_loop // 32 <= len <= 255 -ENDPROC(crc_t10dif_pmull) +ENDPROC(crc_t10dif_pmull64) .section ".rodata", "a" .align 4 diff --git a/arch/arm/crypto/crct10dif-ce-glue.c b/arch/arm/crypto/crct10dif-ce-glue.c index 79f3b204d8c0..60aa79c2fcdb 100644 --- a/arch/arm/crypto/crct10dif-ce-glue.c +++ b/arch/arm/crypto/crct10dif-ce-glue.c @@ -19,7 +19,7 @@ #define CRC_T10DIF_PMULL_CHUNK_SIZE 16U -asmlinkage u16 crc_t10dif_pmull(u16 init_crc, const u8 *buf, size_t len); +asmlinkage u16 crc_t10dif_pmull64(u16 init_crc, const u8 *buf, size_t len); static int crct10dif_init(struct shash_desc *desc) { @@ -29,14 +29,14 @@ static int crct10dif_init(struct shash_desc *desc) return 0; } -static int crct10dif_update(struct shash_desc *desc, const u8 *data, - unsigned int length) +static int crct10dif_update_ce(struct shash_desc *desc, const u8 *data, + unsigned int length) { u16 *crc = shash_desc_ctx(desc); if (length >= CRC_T10DIF_PMULL_CHUNK_SIZE && crypto_simd_usable()) { kernel_neon_begin(); - *crc = crc_t10dif_pmull(*crc, data, length); + *crc = crc_t10dif_pmull64(*crc, data, length); kernel_neon_end(); } else { *crc = crc_t10dif_generic(*crc, data, length); @@ -56,7 +56,7 @@ static int crct10dif_final(struct shash_desc *desc, u8 *out) static struct shash_alg crc_t10dif_alg = { .digestsize = CRC_T10DIF_DIGEST_SIZE, .init = crct10dif_init, - .update = crct10dif_update, + .update = crct10dif_update_ce, .final = crct10dif_final, .descsize = CRC_T10DIF_DIGEST_SIZE, From patchwork Tue Nov 5 16:09:06 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ard Biesheuvel X-Patchwork-Id: 13863236 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 6BA99D31768 for ; Tue, 5 Nov 2024 16:56:39 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender:List-Subscribe:List-Help :List-Post:List-Archive:List-Unsubscribe:List-Id:Content-Type:Cc:To:From: Subject:Message-ID:References:Mime-Version:In-Reply-To:Date:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner; bh=j5C9UQXxiURmILMPEwiZdl1ShbeIW9MqWROHwJW9ezc=; b=c25+B8oldPwyAxmq0HUS7DDJRF Y5z6X8B5TR4W+LFs4rqh+p6i0c2bSrXs34aTJZQXnQ0t/wVA+uYYa5/yA3fVLL2BlgxgCnfDI1Yqu o9zfr4tlCL1qjrsajQ/7KBLaszcv0cCMGyG0i8Uz/dhe1u9CBPIbOXf0ax5gNqtyEKXMRA+ad/6st 7U20IQXj4v4X+DhyI9C0J2WYUvl8U7I0QilNfBRhVZ2s7rJuC/EcJTMCGzKse0ayplFKn0zUJ4dG3 wJoTviD3DgV44iaGWsaZYf+RHqdxqrXW9GOtBTY/uoIQhiywL+vR2PI5Ycpf8Nhd1iFbBxoz8Heg8 bcHB9YQw==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.98 #2 (Red Hat Linux)) id 1t8Mqx-000000007qj-1KCO; Tue, 05 Nov 2024 16:56:27 +0000 Received: from mail-wm1-x349.google.com ([2a00:1450:4864:20::349]) by bombadil.infradead.org with esmtps (Exim 4.98 #2 (Red Hat Linux)) id 1t8MCF-000000000PO-2JAr for linux-arm-kernel@lists.infradead.org; Tue, 05 Nov 2024 16:14:24 +0000 Received: by mail-wm1-x349.google.com with SMTP id 5b1f17b1804b1-43151a9ea95so36618635e9.1 for ; Tue, 05 Nov 2024 08:14:23 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1730823262; x=1731428062; darn=lists.infradead.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=j5C9UQXxiURmILMPEwiZdl1ShbeIW9MqWROHwJW9ezc=; b=Vf5aWI8bf+u9RIDxlnMPLOywILSIgtGwFZYbOKaSkvPIL5LfK7U/AU9x6Jqdnb5pxj MqfBVHHnFTelxVtBX17mtFLdykpVVT2LK/QmU6C7tCSRNjJrJn+htUZiycL3sxSOnBxE 4g9IpmWxko0ZoNI/2df5yV14l7+wVlhzeAYgYCrr0wD4KyDbJMxVOYSGAJjBgu5HCLvb eee9/f68gJLs+CYv7zN4FH82ImG4W5zpF4Ln4TJBEZ8qW2KqPkTW4L2aK9Vd8/87KfKX TYG3oDYBX0oHFnkC/b6EaV9lIx+bH3urh/Xa/ZTAI6VMmZo0WFRwWeAdXPC9RA04odI+ Vs/A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1730823262; x=1731428062; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=j5C9UQXxiURmILMPEwiZdl1ShbeIW9MqWROHwJW9ezc=; b=Q6XTsT467przm4BXhVFAX+At5smheXhrFnk0Tk3idb4Q0nP7OzxsFjIUCULzCHp0HL 7kuEkLyteZYpu3ak763GSkIyX5AwOcUD6R2qHX7yR2yNxXhQnMr0+gJR6rVuUKYgc26+ +cde22Ql1/GJ/iV5y1ROJRvts0Q/lff0w2gAgIduc76LqUzHm9aRN0saOe91QOCaf4tR ameYh6+6VOalZyVj4Bw/eAcmcTQG3RJxb2fvjmo5Fp1Lq30kCik+cYPN0X4+1kdkYHen j8qhZrx9kiRoF2HkftnuupU2ucg9/kYWYzm+TcLstimVWP+LzG6SRAR1wTAcuLrr2/kC uS4g== X-Gm-Message-State: AOJu0YzhGQrvTMSf6ESJnLiNc+QWasCWoKzxOFJF+k69hIEOYVaFcL3M JCvErX2YU9NKGSYlEuewIb+M7hftWPRqwy/eJ1fRgSx9TUfnTHqe8hJL1O7ER7ErsrQ8uA== X-Google-Smtp-Source: AGHT+IGoUtJ5+KmRmnOKl2AZ6ZxwHw9MzdvYo9g4ve3paCn9ggnLv8R8NjYp1xRcfQPf/t8joGA1v+aG X-Received: from palermo.c.googlers.com ([fda3:e722:ac3:cc00:7b:198d:ac11:8138]) (user=ardb job=sendgmr) by 2002:a05:600c:6c56:b0:431:44c4:c932 with SMTP id 5b1f17b1804b1-432830b14dfmr1188005e9.4.1730823261778; Tue, 05 Nov 2024 08:14:21 -0800 (PST) Date: Tue, 5 Nov 2024 17:09:06 +0100 In-Reply-To: <20241105160859.1459261-8-ardb+git@google.com> Mime-Version: 1.0 References: <20241105160859.1459261-8-ardb+git@google.com> X-Developer-Key: i=ardb@kernel.org; a=openpgp; fpr=F43D03328115A198C90016883D200E9CA6329909 X-Developer-Signature: v=1; a=openpgp-sha256; l=8242; i=ardb@kernel.org; h=from:subject; bh=wTU9qg6b3+f755RWEvZYGTwnH7lJNldlFtFrxF4nGyQ=; b=owGbwMvMwCFmkMcZplerG8N4Wi2JIV3LWeXwTm/eq6/+PLzxffMV+f6dLy5VnE4v+j/H7dKjp iUrGMSndZSyMIhxMMiKKbIIzP77bufpiVK1zrNkYeawMoEMYeDiFICJdFcw/M+SVzEreeqg4rXr kLXYssdCTfEb1EvidTZcF5F2f5xwQJ3hf/yc9+ums846tMnkwgr9muMJDs29N/NuXpF/YV7dMVN VjQcA X-Mailer: git-send-email 2.47.0.199.ga7371fff76-goog Message-ID: <20241105160859.1459261-14-ardb+git@google.com> Subject: [PATCH v2 6/6] crypto: arm/crct10dif - Implement plain NEON variant From: Ard Biesheuvel To: linux-crypto@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org, ebiggers@kernel.org, herbert@gondor.apana.org.au, keescook@chromium.org, Ard Biesheuvel X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20241105_081423_705999_0827FF41 X-CRM114-Status: GOOD ( 20.76 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org From: Ard Biesheuvel The CRC-T10DIF algorithm produces a 16-bit CRC, and this is reflected in the folding coefficients, which are also only 16 bits wide. This means that the polynomial multiplications involving these coefficients can be performed using 8-bit long polynomial multiplication (8x8 -> 16) in only a few steps, and this is an instruction that is part of the base NEON ISA, which is all most real ARMv7 cores implement. (The 64-bit PMULL instruction is part of the crypto extensions, which are only implemented by 64-bit cores) The final reduction is a bit more involved, but we can delegate that to the generic CRC-T10DIF implementation after folding the entire input into a 16 byte vector. This results in a speedup of around 6.6x on Cortex-A72 running in 32-bit mode. On Cortex-A8 (BeagleBone White), the results are substantially better than that, but not sufficiently reproducible (with tcrypt) to quote a number here. Signed-off-by: Ard Biesheuvel --- arch/arm/crypto/crct10dif-ce-core.S | 98 +++++++++++++++++++- arch/arm/crypto/crct10dif-ce-glue.c | 45 ++++++++- 2 files changed, 134 insertions(+), 9 deletions(-) diff --git a/arch/arm/crypto/crct10dif-ce-core.S b/arch/arm/crypto/crct10dif-ce-core.S index 6b72167574b2..2bbf2df9c1e2 100644 --- a/arch/arm/crypto/crct10dif-ce-core.S +++ b/arch/arm/crypto/crct10dif-ce-core.S @@ -112,6 +112,82 @@ FOLD_CONST_L .req q10l FOLD_CONST_H .req q10h + /* + * Pairwise long polynomial multiplication of two 16-bit values + * + * { w0, w1 }, { y0, y1 } + * + * by two 64-bit values + * + * { x0, x1, x2, x3, x4, x5, x6, x7 }, { z0, z1, z2, z3, z4, z5, z6, z7 } + * + * where each vector element is a byte, ordered from least to most + * significant. The resulting 80-bit vectors are XOR'ed together. + * + * This can be implemented using 8x8 long polynomial multiplication, by + * reorganizing the input so that each pairwise 8x8 multiplication + * produces one of the terms from the decomposition below, and + * combining the results of each rank and shifting them into place. + * + * Rank + * 0 w0*x0 ^ | y0*z0 ^ + * 1 (w0*x1 ^ w1*x0) << 8 ^ | (y0*z1 ^ y1*z0) << 8 ^ + * 2 (w0*x2 ^ w1*x1) << 16 ^ | (y0*z2 ^ y1*z1) << 16 ^ + * 3 (w0*x3 ^ w1*x2) << 24 ^ | (y0*z3 ^ y1*z2) << 24 ^ + * 4 (w0*x4 ^ w1*x3) << 32 ^ | (y0*z4 ^ y1*z3) << 32 ^ + * 5 (w0*x5 ^ w1*x4) << 40 ^ | (y0*z5 ^ y1*z4) << 40 ^ + * 6 (w0*x6 ^ w1*x5) << 48 ^ | (y0*z6 ^ y1*z5) << 48 ^ + * 7 (w0*x7 ^ w1*x6) << 56 ^ | (y0*z7 ^ y1*z6) << 56 ^ + * 8 w1*x7 << 64 | y1*z7 << 64 + * + * The inputs can be reorganized into + * + * { w0, w0, w0, w0, y0, y0, y0, y0 }, { w1, w1, w1, w1, y1, y1, y1, y1 } + * { x0, x2, x4, x6, z0, z2, z4, z6 }, { x1, x3, x5, x7, z1, z3, z5, z7 } + * + * and after performing 8x8->16 bit long polynomial multiplication of + * each of the halves of the first vector with those of the second one, + * we obtain the following four vectors of 16-bit elements: + * + * a := { w0*x0, w0*x2, w0*x4, w0*x6 }, { y0*z0, y0*z2, y0*z4, y0*z6 } + * b := { w0*x1, w0*x3, w0*x5, w0*x7 }, { y0*z1, y0*z3, y0*z5, y0*z7 } + * c := { w1*x0, w1*x2, w1*x4, w1*x6 }, { y1*z0, y1*z2, y1*z4, y1*z6 } + * d := { w1*x1, w1*x3, w1*x5, w1*x7 }, { y1*z1, y1*z3, y1*z5, y1*z7 } + * + * Results b and c can be XORed together, as the vector elements have + * matching ranks. Then, the final XOR can be pulled forward, and + * applied between the halves of each of the remaining three vectors, + * which are then shifted into place, and XORed together to produce the + * final 80-bit result. + */ + .macro pmull16x64_p8, v16, v64 + vext.8 q11, \v64, \v64, #1 + vld1.64 {q12}, [r4, :128] + vuzp.8 q11, \v64 + vtbl.8 d24, {\v16\()_L-\v16\()_H}, d24 + vtbl.8 d25, {\v16\()_L-\v16\()_H}, d25 + bl __pmull16x64_p8 + veor \v64, q12, q14 + .endm + +__pmull16x64_p8: + vmull.p8 q13, d23, d24 + vmull.p8 q14, d23, d25 + vmull.p8 q15, d22, d24 + vmull.p8 q12, d22, d25 + + veor q14, q14, q15 + veor d24, d24, d25 + veor d26, d26, d27 + veor d28, d28, d29 + vmov.i32 d25, #0 + vmov.i32 d29, #0 + vext.8 q12, q12, q12, #14 + vext.8 q14, q14, q14, #15 + veor d24, d24, d26 + bx lr +ENDPROC(__pmull16x64_p8) + .macro pmull16x64_p64, v16, v64 vmull.p64 q11, \v64\()l, \v16\()_L vmull.p64 \v64, \v64\()h, \v16\()_H @@ -249,9 +325,9 @@ CPU_LE( vrev64.8 q0, q0 ) vswp q0l, q0h // q1 = high order part of second chunk: q7 left-shifted by 'len' bytes. - mov_l r3, .Lbyteshift_table + 16 - sub r3, r3, len - vld1.8 {q2}, [r3] + mov_l r1, .Lbyteshift_table + 16 + sub r1, r1, len + vld1.8 {q2}, [r1] vtbl.8 q1l, {q7l-q7h}, q2l vtbl.8 q1h, {q7l-q7h}, q2h @@ -341,9 +417,20 @@ ENTRY(crc_t10dif_pmull64) vmov.u16 r0, q0l[0] bx lr - ENDPROC(crc_t10dif_pmull64) +ENTRY(crc_t10dif_pmull8) + push {r4, lr} + mov_l r4, .L16x64perm + + crct10dif p8 + +CPU_LE( vrev64.8 q7, q7 ) + vswp q7l, q7h + vst1.64 {q7}, [r3, :128] + pop {r4, pc} +ENDPROC(crc_t10dif_pmull8) + .section ".rodata", "a" .align 4 @@ -376,3 +463,6 @@ ENDPROC(crc_t10dif_pmull64) .byte 0x88, 0x89, 0x8a, 0x8b, 0x8c, 0x8d, 0x8e, 0x8f .byte 0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7 .byte 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe , 0x0 + +.L16x64perm: + .quad 0x808080800000000, 0x909090901010101 diff --git a/arch/arm/crypto/crct10dif-ce-glue.c b/arch/arm/crypto/crct10dif-ce-glue.c index 60aa79c2fcdb..a8b74523729e 100644 --- a/arch/arm/crypto/crct10dif-ce-glue.c +++ b/arch/arm/crypto/crct10dif-ce-glue.c @@ -20,6 +20,8 @@ #define CRC_T10DIF_PMULL_CHUNK_SIZE 16U asmlinkage u16 crc_t10dif_pmull64(u16 init_crc, const u8 *buf, size_t len); +asmlinkage void crc_t10dif_pmull8(u16 init_crc, const u8 *buf, size_t len, + u8 out[16]); static int crct10dif_init(struct shash_desc *desc) { @@ -45,6 +47,27 @@ static int crct10dif_update_ce(struct shash_desc *desc, const u8 *data, return 0; } +static int crct10dif_update_neon(struct shash_desc *desc, const u8 *data, + unsigned int length) +{ + u16 *crcp = shash_desc_ctx(desc); + u8 buf[16] __aligned(16); + u16 crc = *crcp; + + if (length > CRC_T10DIF_PMULL_CHUNK_SIZE && crypto_simd_usable()) { + kernel_neon_begin(); + crc_t10dif_pmull8(crc, data, length, buf); + kernel_neon_end(); + + crc = 0; + data = buf; + length = sizeof(buf); + } + + *crcp = crc_t10dif_generic(crc, data, length); + return 0; +} + static int crct10dif_final(struct shash_desc *desc, u8 *out) { u16 *crc = shash_desc_ctx(desc); @@ -53,7 +76,19 @@ static int crct10dif_final(struct shash_desc *desc, u8 *out) return 0; } -static struct shash_alg crc_t10dif_alg = { +static struct shash_alg algs[] = {{ + .digestsize = CRC_T10DIF_DIGEST_SIZE, + .init = crct10dif_init, + .update = crct10dif_update_neon, + .final = crct10dif_final, + .descsize = CRC_T10DIF_DIGEST_SIZE, + + .base.cra_name = "crct10dif", + .base.cra_driver_name = "crct10dif-arm-neon", + .base.cra_priority = 150, + .base.cra_blocksize = CRC_T10DIF_BLOCK_SIZE, + .base.cra_module = THIS_MODULE, +}, { .digestsize = CRC_T10DIF_DIGEST_SIZE, .init = crct10dif_init, .update = crct10dif_update_ce, @@ -65,19 +100,19 @@ static struct shash_alg crc_t10dif_alg = { .base.cra_priority = 200, .base.cra_blocksize = CRC_T10DIF_BLOCK_SIZE, .base.cra_module = THIS_MODULE, -}; +}}; static int __init crc_t10dif_mod_init(void) { - if (!(elf_hwcap2 & HWCAP2_PMULL)) + if (!(elf_hwcap & HWCAP_NEON)) return -ENODEV; - return crypto_register_shash(&crc_t10dif_alg); + return crypto_register_shashes(algs, 1 + !!(elf_hwcap2 & HWCAP2_PMULL)); } static void __exit crc_t10dif_mod_exit(void) { - crypto_unregister_shash(&crc_t10dif_alg); + crypto_unregister_shashes(algs, 1 + !!(elf_hwcap2 & HWCAP2_PMULL)); } module_init(crc_t10dif_mod_init);