[v3,14/14] crypto: arm/blake2b - add NEON-accelerated BLAKE2b

From: Eric Biggers <ebiggers@google.com>

From: Eric Biggers <ebiggers@google.com>

Add a NEON-accelerated implementation of BLAKE2b.

On Cortex-A7 (which these days is the most common ARM processor that
doesn't have the ARMv8 Crypto Extensions), this is over twice as fast as
SHA-256, and slightly faster than SHA-1.  It is also almost three times
as fast as the generic implementation of BLAKE2b:

	Algorithm            Cycles per byte (on 4096-byte messages)
	===================  =======================================
	blake2b-256-neon     14.0
	sha1-neon            16.3
	blake2s-256-arm      18.8
	sha1-asm             20.8
	blake2s-256-generic  26.0
	sha256-neon	     28.9
	sha256-asm	     32.0
	blake2b-256-generic  38.9

This implementation isn't directly based on any other implementation,
but it borrows some ideas from previous NEON code I've written as well
as from chacha-neon-core.S.  At least on Cortex-A7, it is faster than
the other NEON implementations of BLAKE2b I'm aware of (the
implementation in the BLAKE2 official repository using intrinsics, and
Andrew Moon's implementation which can be found in SUPERCOP).  It does
only one block at a time, so it performs well on short messages too.

NEON-accelerated BLAKE2b is useful because there is interest in using
BLAKE2b-256 for dm-verity on low-end Android devices (specifically,
devices that lack the ARMv8 Crypto Extensions) to replace SHA-1.  On
these devices, the performance cost of upgrading to SHA-256 may be
unacceptable, whereas BLAKE2b-256 would actually improve performance.

Although BLAKE2b is intended for 64-bit platforms (unlike BLAKE2s which
is intended for 32-bit platforms), on 32-bit ARM processors with NEON,
BLAKE2b is actually faster than BLAKE2s.  This is because NEON supports
64-bit operations, and because BLAKE2s's block size is too small for
NEON to be helpful for it.  The best I've been able to do with BLAKE2s
on Cortex-A7 is 18.8 cpb with an optimized scalar implementation.

(I didn't try BLAKE2sp and BLAKE3, which in theory would be faster, but
they're more complex as they require running multiple hashes at once.
Note that BLAKE2b already uses all the NEON bandwidth on the Cortex-A7,
so I expect that any speedup from BLAKE2sp or BLAKE3 would come only
from the smaller number of rounds, not from the extra parallelism.)

For now this BLAKE2b implementation is only wired up to the shash API,
since there is no library API for BLAKE2b yet.  However, I've tried to
keep things consistent with BLAKE2s, e.g. by defining
blake2b_compress_arch() which is analogous to blake2s_compress_arch()
and could be exported for use by the library API later if needed.

Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 arch/arm/crypto/Kconfig             |  10 +
 arch/arm/crypto/Makefile            |   2 +
 arch/arm/crypto/blake2b-neon-core.S | 347 ++++++++++++++++++++++++++++
 arch/arm/crypto/blake2b-neon-glue.c | 105 +++++++++
 4 files changed, 464 insertions(+)
 create mode 100644 arch/arm/crypto/blake2b-neon-core.S
 create mode 100644 arch/arm/crypto/blake2b-neon-glue.c

Message ID	20201223081003.373663-15-ebiggers@kernel.org (mailing list archive)
State	Accepted
Delegated to:	Herbert Xu
Headers	show Return-Path: <linux-crypto-owner@kernel.org> X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-19.0 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8BCDCC43332 for <linux-crypto@archiver.kernel.org>; Wed, 23 Dec 2020 08:14:32 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 615A3207D2 for <linux-crypto@archiver.kernel.org>; Wed, 23 Dec 2020 08:14:32 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728056AbgLWIOS (ORCPT <rfc822;linux-crypto@archiver.kernel.org>); Wed, 23 Dec 2020 03:14:18 -0500 Received: from mail.kernel.org ([198.145.29.99]:46962 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728002AbgLWIOS (ORCPT <rfc822;linux-crypto@vger.kernel.org>); Wed, 23 Dec 2020 03:14:18 -0500 Received: by mail.kernel.org (Postfix) with ESMTPSA id 8C97E22517; Wed, 23 Dec 2020 08:12:59 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1608711179; bh=JRNq0Z9P755tRhBe2YcST7ZWWYVJW6j9hD075rTGk18=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=GkCyLxmXvOxslZlRVec9YhhU72svlWdxAhclPdQ83SHmzi4I7jgiczfXudnDtQc+P hsSLZCyJ4TC0F4FGkHcqaNs0hfipIvkd9e7fbbWnQOAtOwJG3W7zlPp590r9Mg1Bup 3lkeKjdReCtROfPtLw53r/sWyxxYj5kiTUEkTuQyRSsR2wo/ZaS4arX2omVEf74oI6 gNHNtcWVhaQFRY97DwUAczCwGP8lwrXJq0OCNb+cyq7ePQkaVCevCSoN9dohYNn+I2 3XYiqFlsb1S5qF0yzdG3tdgyZmivh5omHJfVozvLuWjnRe9pEYwlEd/Fh9lkXRxJ8Q vamtM61DFqypg== From: Eric Biggers <ebiggers@kernel.org> To: linux-crypto@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org, Ard Biesheuvel <ardb@kernel.org>, Herbert Xu <herbert@gondor.apana.org.au>, David Sterba <dsterba@suse.com>, "Jason A . Donenfeld" <Jason@zx2c4.com>, Paul Crowley <paulcrowley@google.com> Subject: [PATCH v3 14/14] crypto: arm/blake2b - add NEON-accelerated BLAKE2b Date: Wed, 23 Dec 2020 00:10:03 -0800 Message-Id: <20201223081003.373663-15-ebiggers@kernel.org> X-Mailer: git-send-email 2.29.2 In-Reply-To: <20201223081003.373663-1-ebiggers@kernel.org> References: <20201223081003.373663-1-ebiggers@kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: <linux-crypto.vger.kernel.org> X-Mailing-List: linux-crypto@vger.kernel.org
Series	crypto: arm32-optimized BLAKE2b and BLAKE2s \| expand [v3,00/14] crypto: arm32-optimized BLAKE2b and BLAKE2s [v3,01/14] crypto: blake2s - define shash_alg structs using macros [v3,02/14] crypto: x86/blake2s - define shash_alg structs using macros [v3,03/14] crypto: blake2s - remove unneeded includes [v3,04/14] crypto: blake2s - move update and final logic to internal/blake2s.h [v3,05/14] crypto: blake2s - share the "shash" API boilerplate code [v3,06/14] crypto: blake2s - optimize blake2s initialization [v3,07/14] crypto: blake2s - add comment for blake2s_state fields [v3,08/14] crypto: blake2s - adjust include guard naming [v3,09/14] crypto: blake2s - include <linux/bug.h> instead of <asm/bug.h> [v3,10/14] crypto: arm/blake2s - add ARM scalar optimized BLAKE2s [v3,11/14] wireguard: Kconfig: select CRYPTO_BLAKE2S_ARM [v3,12/14] crypto: blake2b - sync with blake2s implementation [v3,13/14] crypto: blake2b - update file comment [v3,14/14] crypto: arm/blake2b - add NEON-accelerated BLAKE2b

[v3,14/14] crypto: arm/blake2b - add NEON-accelerated BLAKE2b

Commit Message

Comments

Patch