Message ID | 20231025183644.8735-13-jerry.shih@sifive.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Herbert Xu |
Headers | show |
Series | RISC-V: provide some accelerated cryptography implementations using vector extensions | expand |
On Thu, Oct 26, 2023 at 02:36:44AM +0800, Jerry Shih wrote: > +static struct skcipher_alg riscv64_chacha_alg_zvkb[] = { { > + .base = { > + .cra_name = "chacha20", > + .cra_driver_name = "chacha20-riscv64-zvkb", > + .cra_priority = 300, > + .cra_blocksize = 1, > + .cra_ctxsize = sizeof(struct chacha_ctx), > + .cra_module = THIS_MODULE, > + }, > + .min_keysize = CHACHA_KEY_SIZE, > + .max_keysize = CHACHA_KEY_SIZE, > + .ivsize = CHACHA_IV_SIZE, > + .chunksize = CHACHA_BLOCK_SIZE, > + .walksize = CHACHA_BLOCK_SIZE * 4, > + .setkey = chacha20_setkey, > + .encrypt = chacha20_encrypt, > + .decrypt = chacha20_encrypt, > +} }; > + > +static inline bool check_chacha20_ext(void) > +{ > + return riscv_isa_extension_available(NULL, ZVKB) && > + riscv_vector_vlen() >= 128; > +} > + > +static int __init riscv64_chacha_mod_init(void) > +{ > + if (check_chacha20_ext()) > + return crypto_register_skciphers( > + riscv64_chacha_alg_zvkb, > + ARRAY_SIZE(riscv64_chacha_alg_zvkb)); > + > + return -ENODEV; > +} > + > +static void __exit riscv64_chacha_mod_fini(void) > +{ > + if (check_chacha20_ext()) > + crypto_unregister_skciphers( > + riscv64_chacha_alg_zvkb, > + ARRAY_SIZE(riscv64_chacha_alg_zvkb)); > +} When there's just one algorithm being registered/unregistered, crypto_register_skcipher() and crypto_unregister_skcipher() can be used. > +# - RV64I > +# - RISC-V Vector ('V') with VLEN >= 128 > +# - RISC-V Vector Cryptography Bit-manipulation extension ('Zvkb') > +# - RISC-V Zicclsm(Main memory supports misaligned loads/stores) How is the presence of the Zicclsm extension guaranteed? - Eric
On Nov 2, 2023, at 13:43, Eric Biggers <ebiggers@kernel.org> wrote: > On Thu, Oct 26, 2023 at 02:36:44AM +0800, Jerry Shih wrote: >> +static struct skcipher_alg riscv64_chacha_alg_zvkb[] = { { >> + .base = { >> + .cra_name = "chacha20", >> + .cra_driver_name = "chacha20-riscv64-zvkb", >> + .cra_priority = 300, >> + .cra_blocksize = 1, >> + .cra_ctxsize = sizeof(struct chacha_ctx), >> + .cra_module = THIS_MODULE, >> + }, >> + .min_keysize = CHACHA_KEY_SIZE, >> + .max_keysize = CHACHA_KEY_SIZE, >> + .ivsize = CHACHA_IV_SIZE, >> + .chunksize = CHACHA_BLOCK_SIZE, >> + .walksize = CHACHA_BLOCK_SIZE * 4, >> + .setkey = chacha20_setkey, >> + .encrypt = chacha20_encrypt, >> + .decrypt = chacha20_encrypt, >> +} }; >> + >> +static inline bool check_chacha20_ext(void) >> +{ >> + return riscv_isa_extension_available(NULL, ZVKB) && >> + riscv_vector_vlen() >= 128; >> +} >> + >> +static int __init riscv64_chacha_mod_init(void) >> +{ >> + if (check_chacha20_ext()) >> + return crypto_register_skciphers( >> + riscv64_chacha_alg_zvkb, >> + ARRAY_SIZE(riscv64_chacha_alg_zvkb)); >> + >> + return -ENODEV; >> +} >> + >> +static void __exit riscv64_chacha_mod_fini(void) >> +{ >> + if (check_chacha20_ext()) >> + crypto_unregister_skciphers( >> + riscv64_chacha_alg_zvkb, >> + ARRAY_SIZE(riscv64_chacha_alg_zvkb)); >> +} > > When there's just one algorithm being registered/unregistered, > crypto_register_skcipher() and crypto_unregister_skcipher() can be used. Fixed. >> +# - RV64I >> +# - RISC-V Vector ('V') with VLEN >= 128 >> +# - RISC-V Vector Cryptography Bit-manipulation extension ('Zvkb') >> +# - RISC-V Zicclsm(Main memory supports misaligned loads/stores) > > How is the presence of the Zicclsm extension guaranteed? > > - Eric I have the addition extension parser for `Zicclsm` in the v2 patch set. -Jerry
Hi Jerry! On Mon, Nov 20, 2023 at 10:55:15AM +0800, Jerry Shih wrote: > >> +# - RV64I > >> +# - RISC-V Vector ('V') with VLEN >= 128 > >> +# - RISC-V Vector Cryptography Bit-manipulation extension ('Zvkb') > >> +# - RISC-V Zicclsm(Main memory supports misaligned loads/stores) > > > > How is the presence of the Zicclsm extension guaranteed? > > > > - Eric > > I have the addition extension parser for `Zicclsm` in the v2 patch set. First, I can see your updated patchset at branch "dev/jerrys/vector-crypto-upstream-v2" of https://github.com/JerryShih/linux, but I haven't seen it on the mailing list yet. Are you planning to send it out? Second, with your updated patchset, I'm not seeing any of the RISC-V optimized algorithms be registered when I boot the kernel in QEMU. This is caused by the new check 'riscv_isa_extension_available(NULL, ZICCLSM)' not passing. Is checking for "Zicclsm" the correct way to determine whether unaligned memory accesses are supported? I'm using 'qemu-system-riscv64 -cpu max -machine virt', with the very latest QEMU commit (af9264da80073435), so it should have all the CPU features. - Eric
On Nov 21, 2023, at 03:18, Eric Biggers <ebiggers@kernel.org> wrote: > First, I can see your updated patchset at branch > "dev/jerrys/vector-crypto-upstream-v2" of https://github.com/JerryShih/linux, > but I haven't seen it on the mailing list yet. Are you planning to send it out? I will send it out soon. > Second, with your updated patchset, I'm not seeing any of the RISC-V optimized > algorithms be registered when I boot the kernel in QEMU. This is caused by the > new check 'riscv_isa_extension_available(NULL, ZICCLSM)' not passing. Is > checking for "Zicclsm" the correct way to determine whether unaligned memory > accesses are supported? > > I'm using 'qemu-system-riscv64 -cpu max -machine virt', with the very latest > QEMU commit (af9264da80073435), so it should have all the CPU features. > > - Eric Sorry, I just use my `internal` qemu with vector-crypto and rva22 patches. The public qemu haven't supported rva22 profiles. Here is the qemu patch[1] for that. But here is the discussion why the qemu doesn't export these `named extensions`(e.g. Zicclsm). I try to add Zicclsm in DT in the v2 patch set. Maybe we will have more discussion about the rva22 profiles in kernel DT. [1] LINK: https://lore.kernel.org/all/d1d6f2dc-55b2-4dce-a48a-4afbbf6df526@ventanamicro.com/#t I don't know whether it's a good practice to check unaligned access using `Zicclsm`. Here is another related cpu feature for unaligned access: RISCV_HWPROBE_MISALIGNED_* But it looks like it always be initialized with `RISCV_HWPROBE_MISALIGNED_SLOW`[2]. It implies that linux kernel always supports unaligned access. But we have the actual HW which doesn't support unaligned access for vector unit. [2] LINK: https://github.com/torvalds/linux/blob/98b1cc82c4affc16f5598d4fa14b1858671b2263/arch/riscv/kernel/cpufeature.c#L575 I will still use `Zicclsm` checking in this stage for reviewing. And I will create qemu branch with Zicclsm enabled feature for testing. -Jerry
On Tue, Nov 21, 2023 at 06:55:07PM +0800, Jerry Shih wrote: > On Nov 21, 2023, at 03:18, Eric Biggers <ebiggers@kernel.org> wrote: > > First, I can see your updated patchset at branch > > "dev/jerrys/vector-crypto-upstream-v2" of https://github.com/JerryShih/linux, > > but I haven't seen it on the mailing list yet. Are you planning to send it out? > > I will send it out soon. > > > Second, with your updated patchset, I'm not seeing any of the RISC-V optimized > > algorithms be registered when I boot the kernel in QEMU. This is caused by the > > new check 'riscv_isa_extension_available(NULL, ZICCLSM)' not passing. Is > > checking for "Zicclsm" the correct way to determine whether unaligned memory > > accesses are supported? > > > > I'm using 'qemu-system-riscv64 -cpu max -machine virt', with the very latest > > QEMU commit (af9264da80073435), so it should have all the CPU features. > > > > - Eric > > Sorry, I just use my `internal` qemu with vector-crypto and rva22 patches. > > The public qemu haven't supported rva22 profiles. Here is the qemu patch[1] for > that. But here is the discussion why the qemu doesn't export these > `named extensions`(e.g. Zicclsm). > I try to add Zicclsm in DT in the v2 patch set. Maybe we will have more discussion > about the rva22 profiles in kernel DT. Please do, that'll be fun! Please take some time to read what the profiles spec actually defines Zicclsm fore before you send those patches though. I think you might come to find you have misunderstood what it means - certainly I did the first time I saw it! > [1] > LINK: https://lore.kernel.org/all/d1d6f2dc-55b2-4dce-a48a-4afbbf6df526@ventanamicro.com/#t > > I don't know whether it's a good practice to check unaligned access using > `Zicclsm`. > > Here is another related cpu feature for unaligned access: > RISCV_HWPROBE_MISALIGNED_* > But it looks like it always be initialized with `RISCV_HWPROBE_MISALIGNED_SLOW`[2]. > It implies that linux kernel always supports unaligned access. But we have the > actual HW which doesn't support unaligned access for vector unit. https://docs.kernel.org/arch/riscv/uabi.html#misaligned-accesses Misaligned accesses are part of the user ABI & the hwprobe stuff for that allows userspace to figure out whether they're fast (likely implemented in hardware), slow (likely emulated in firmware) or emulated in the kernel. Cheers, Conor. > > [2] > LINK: https://github.com/torvalds/linux/blob/98b1cc82c4affc16f5598d4fa14b1858671b2263/arch/riscv/kernel/cpufeature.c#L575 > > I will still use `Zicclsm` checking in this stage for reviewing. And I will create qemu > branch with Zicclsm enabled feature for testing. > > -Jerry
On Tue, Nov 21, 2023 at 01:14:47PM +0000, Conor Dooley wrote: > On Tue, Nov 21, 2023 at 06:55:07PM +0800, Jerry Shih wrote: > > On Nov 21, 2023, at 03:18, Eric Biggers <ebiggers@kernel.org> wrote: > > > First, I can see your updated patchset at branch > > > "dev/jerrys/vector-crypto-upstream-v2" of https://github.com/JerryShih/linux, > > > but I haven't seen it on the mailing list yet. Are you planning to send it out? > > > > I will send it out soon. > > > > > Second, with your updated patchset, I'm not seeing any of the RISC-V optimized > > > algorithms be registered when I boot the kernel in QEMU. This is caused by the > > > new check 'riscv_isa_extension_available(NULL, ZICCLSM)' not passing. Is > > > checking for "Zicclsm" the correct way to determine whether unaligned memory > > > accesses are supported? > > > > > > I'm using 'qemu-system-riscv64 -cpu max -machine virt', with the very latest > > > QEMU commit (af9264da80073435), so it should have all the CPU features. > > > > > > - Eric > > > > Sorry, I just use my `internal` qemu with vector-crypto and rva22 patches. > > > > The public qemu haven't supported rva22 profiles. Here is the qemu patch[1] for > > that. But here is the discussion why the qemu doesn't export these > > `named extensions`(e.g. Zicclsm). > > I try to add Zicclsm in DT in the v2 patch set. Maybe we will have more discussion > > about the rva22 profiles in kernel DT. > > Please do, that'll be fun! Please take some time to read what the > profiles spec actually defines Zicclsm fore before you send those patches > though. I think you might come to find you have misunderstood what it > means - certainly I did the first time I saw it! > > > [1] > > LINK: https://lore.kernel.org/all/d1d6f2dc-55b2-4dce-a48a-4afbbf6df526@ventanamicro.com/#t > > > > I don't know whether it's a good practice to check unaligned access using > > `Zicclsm`. > > > > Here is another related cpu feature for unaligned access: > > RISCV_HWPROBE_MISALIGNED_* > > But it looks like it always be initialized with `RISCV_HWPROBE_MISALIGNED_SLOW`[2]. > > It implies that linux kernel always supports unaligned access. But we have the > > actual HW which doesn't support unaligned access for vector unit. > > https://docs.kernel.org/arch/riscv/uabi.html#misaligned-accesses > > Misaligned accesses are part of the user ABI & the hwprobe stuff for > that allows userspace to figure out whether they're fast (likely > implemented in hardware), slow (likely emulated in firmware) or emulated > in the kernel. > > Cheers, > Conor. > > > > > [2] > > LINK: https://github.com/torvalds/linux/blob/98b1cc82c4affc16f5598d4fa14b1858671b2263/arch/riscv/kernel/cpufeature.c#L575 > > > > I will still use `Zicclsm` checking in this stage for reviewing. And I will create qemu > > branch with Zicclsm enabled feature for testing. > > According to https://github.com/riscv/riscv-profiles/blob/main/profiles.adoc, Zicclsm means that "main memory supports misaligned loads/stores", but they "might execute extremely slowly." In general, the vector crypto routines that Jerry is adding assume that misaligned vector loads/stores are supported *and* are fast. I think the kernel mustn't register those algorithms if that isn't the case. Zicclsm sounds like the wrong thing to check. Maybe RISCV_HWPROBE_MISALIGNED_FAST is the right thing to check? BTW, something else I was wondering about is endianness. Most of the vector crypto routines also assume little endian byte order, but I don't see that being explicitly checked for anywhere. Should it be? - Eric
On Tue, Nov 21, 2023 at 03:37:43PM -0800, Eric Biggers wrote: > On Tue, Nov 21, 2023 at 01:14:47PM +0000, Conor Dooley wrote: > > On Tue, Nov 21, 2023 at 06:55:07PM +0800, Jerry Shih wrote: > > > On Nov 21, 2023, at 03:18, Eric Biggers <ebiggers@kernel.org> wrote: > > > > First, I can see your updated patchset at branch > > > > "dev/jerrys/vector-crypto-upstream-v2" of https://github.com/JerryShih/linux, > > > > but I haven't seen it on the mailing list yet. Are you planning to send it out? > > > > > > I will send it out soon. > > > > > > > Second, with your updated patchset, I'm not seeing any of the RISC-V optimized > > > > algorithms be registered when I boot the kernel in QEMU. This is caused by the > > > > new check 'riscv_isa_extension_available(NULL, ZICCLSM)' not passing. Is > > > > checking for "Zicclsm" the correct way to determine whether unaligned memory > > > > accesses are supported? > > > > > > > > I'm using 'qemu-system-riscv64 -cpu max -machine virt', with the very latest > > > > QEMU commit (af9264da80073435), so it should have all the CPU features. > > > > > > > > - Eric > > > > > > Sorry, I just use my `internal` qemu with vector-crypto and rva22 patches. > > > > > > The public qemu haven't supported rva22 profiles. Here is the qemu patch[1] for > > > that. But here is the discussion why the qemu doesn't export these > > > `named extensions`(e.g. Zicclsm). > > > I try to add Zicclsm in DT in the v2 patch set. Maybe we will have more discussion > > > about the rva22 profiles in kernel DT. > > > > Please do, that'll be fun! Please take some time to read what the > > profiles spec actually defines Zicclsm fore before you send those patches > > though. I think you might come to find you have misunderstood what it > > means - certainly I did the first time I saw it! > > > > > [1] > > > LINK: https://lore.kernel.org/all/d1d6f2dc-55b2-4dce-a48a-4afbbf6df526@ventanamicro.com/#t > > > > > > I don't know whether it's a good practice to check unaligned access using > > > `Zicclsm`. > > > > > > Here is another related cpu feature for unaligned access: > > > RISCV_HWPROBE_MISALIGNED_* > > > But it looks like it always be initialized with `RISCV_HWPROBE_MISALIGNED_SLOW`[2]. > > > It implies that linux kernel always supports unaligned access. But we have the > > > actual HW which doesn't support unaligned access for vector unit. > > > > https://docs.kernel.org/arch/riscv/uabi.html#misaligned-accesses > > > > Misaligned accesses are part of the user ABI & the hwprobe stuff for > > that allows userspace to figure out whether they're fast (likely > > implemented in hardware), slow (likely emulated in firmware) or emulated > > in the kernel. > > > > > [2] > > > LINK: https://github.com/torvalds/linux/blob/98b1cc82c4affc16f5598d4fa14b1858671b2263/arch/riscv/kernel/cpufeature.c#L575 > > > > > > I will still use `Zicclsm` checking in this stage for reviewing. And I will create qemu > > > branch with Zicclsm enabled feature for testing. > > > > > According to https://github.com/riscv/riscv-profiles/blob/main/profiles.adoc, > Zicclsm means that "main memory supports misaligned loads/stores", but they > "might execute extremely slowly." Check the section it is defined in - it is only defined for the RVA22U64 profile which describes "features available to user-mode execution environments". It otherwise has no meaning, so it is not suitable for detecting anything from within the kernel. For other operating systems it might actually mean something, but for Linux the uABI on RISC-V unconditionally provides what Zicclsm is intended to convey: https://www.kernel.org/doc/html/next/riscv/uabi.html#misaligned-accesses We could (_perhaps_) set it in /proc/cpuinfo in riscv,isa there - but a conversation would have to be had about what these non-extension "features" actually are & whether it makes sense to put them there. > In general, the vector crypto routines that Jerry is adding assume that > misaligned vector loads/stores are supported *and* are fast. I think the kernel > mustn't register those algorithms if that isn't the case. Zicclsm sounds like > the wrong thing to check. Maybe RISCV_HWPROBE_MISALIGNED_FAST is the right > thing to check? It actually means something, so it is certainly better ;) I think checking it makes sense as a good surrogate for actually knowing whether or not the hardware supports misaligned access. > BTW, something else I was wondering about is endianness. Most of the vector > crypto routines also assume little endian byte order, but I don't see that being > explicitly checked for anywhere. Should it be? The RISC-V kernel only supports LE at the moment. I hope that doesn't change tbh. Cheers, Conor.
On Thu, Oct 26, 2023 at 02:36:44AM +0800, Jerry Shih wrote: > diff --git a/arch/riscv/crypto/chacha-riscv64-glue.c b/arch/riscv/crypto/chacha-riscv64-glue.c > new file mode 100644 > index 000000000000..72011949f705 > --- /dev/null > +++ b/arch/riscv/crypto/chacha-riscv64-glue.c > @@ -0,0 +1,120 @@ > +// SPDX-License-Identifier: GPL-2.0-only > +/* > + * Port of the OpenSSL ChaCha20 implementation for RISC-V 64 > + * > + * Copyright (C) 2023 SiFive, Inc. > + * Author: Jerry Shih <jerry.shih@sifive.com> > + */ > + > +#include <asm/simd.h> > +#include <asm/vector.h> > +#include <crypto/internal/chacha.h> > +#include <crypto/internal/simd.h> > +#include <crypto/internal/skcipher.h> > +#include <linux/crypto.h> > +#include <linux/module.h> > +#include <linux/types.h> > + > +#define CHACHA_BLOCK_VALID_SIZE_MASK (~(CHACHA_BLOCK_SIZE - 1)) > +#define CHACHA_BLOCK_REMAINING_SIZE_MASK (CHACHA_BLOCK_SIZE - 1) > +#define CHACHA_KEY_OFFSET 4 > +#define CHACHA_IV_OFFSET 12 > + > +/* chacha20 using zvkb vector crypto extension */ > +void ChaCha20_ctr32_zvkb(u8 *out, const u8 *input, size_t len, const u32 *key, > + const u32 *counter); > + > +static int chacha20_encrypt(struct skcipher_request *req) > +{ > + u32 state[CHACHA_STATE_WORDS]; This function doesn't need to create the whole state matrix on the stack, since the underlying assembly function takes as input the key and counter, not the state matrix. I recommend something like the following: diff --git a/arch/riscv/crypto/chacha-riscv64-glue.c b/arch/riscv/crypto/chacha-riscv64-glue.c index df185d0663fcc..216b4cd9d1e01 100644 --- a/arch/riscv/crypto/chacha-riscv64-glue.c +++ b/arch/riscv/crypto/chacha-riscv64-glue.c @@ -16,45 +16,42 @@ #include <linux/module.h> #include <linux/types.h> -#define CHACHA_KEY_OFFSET 4 -#define CHACHA_IV_OFFSET 12 - /* chacha20 using zvkb vector crypto extension */ asmlinkage void ChaCha20_ctr32_zvkb(u8 *out, const u8 *input, size_t len, const u32 *key, const u32 *counter); static int chacha20_encrypt(struct skcipher_request *req) { - u32 state[CHACHA_STATE_WORDS]; u8 block_buffer[CHACHA_BLOCK_SIZE]; struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req); const struct chacha_ctx *ctx = crypto_skcipher_ctx(tfm); struct skcipher_walk walk; unsigned int nbytes; unsigned int tail_bytes; + u32 iv[4]; int err; - chacha_init_generic(state, ctx->key, req->iv); + iv[0] = get_unaligned_le32(req->iv); + iv[1] = get_unaligned_le32(req->iv + 4); + iv[2] = get_unaligned_le32(req->iv + 8); + iv[3] = get_unaligned_le32(req->iv + 12); err = skcipher_walk_virt(&walk, req, false); while (walk.nbytes) { - nbytes = walk.nbytes & (~(CHACHA_BLOCK_SIZE - 1)); + nbytes = walk.nbytes & ~(CHACHA_BLOCK_SIZE - 1); tail_bytes = walk.nbytes & (CHACHA_BLOCK_SIZE - 1); kernel_vector_begin(); if (nbytes) { ChaCha20_ctr32_zvkb(walk.dst.virt.addr, walk.src.virt.addr, nbytes, - state + CHACHA_KEY_OFFSET, - state + CHACHA_IV_OFFSET); - state[CHACHA_IV_OFFSET] += nbytes / CHACHA_BLOCK_SIZE; + ctx->key, iv); + iv[0] += nbytes / CHACHA_BLOCK_SIZE; } if (walk.nbytes == walk.total && tail_bytes > 0) { memcpy(block_buffer, walk.src.virt.addr + nbytes, tail_bytes); ChaCha20_ctr32_zvkb(block_buffer, block_buffer, - CHACHA_BLOCK_SIZE, - state + CHACHA_KEY_OFFSET, - state + CHACHA_IV_OFFSET); + CHACHA_BLOCK_SIZE, ctx->key, iv); memcpy(walk.dst.virt.addr + nbytes, block_buffer, tail_bytes); tail_bytes = 0;
On Nov 21, 2023, at 21:14, Conor Dooley <conor.dooley@microchip.com> wrote: > On Tue, Nov 21, 2023 at 06:55:07PM +0800, Jerry Shih wrote: >> Sorry, I just use my `internal` qemu with vector-crypto and rva22 patches. >> >> The public qemu haven't supported rva22 profiles. Here is the qemu patch[1] for >> that. But here is the discussion why the qemu doesn't export these >> `named extensions`(e.g. Zicclsm). >> I try to add Zicclsm in DT in the v2 patch set. Maybe we will have more discussion >> about the rva22 profiles in kernel DT. > > Please do, that'll be fun! Please take some time to read what the > profiles spec actually defines Zicclsm fore before you send those patches > though. I think you might come to find you have misunderstood what it > means - certainly I did the first time I saw it! From the rva22 profile: This requires misaligned support for all regular load and store instructions (including scalar and ``vector``) The spec includes the explicit `vector` keyword. So, I still think we could use Zicclsm checking for these vector-crypto implementations. My proposed patch is just a simple patch which only update the DT document and update the isa string parser for Zicclsm. If it's still not recommend to use Zicclsm checking, I will turn to use `RISCV_HWPROBE_MISALIGNED_*` instead. >> [1] >> LINK: https://lore.kernel.org/all/d1d6f2dc-55b2-4dce-a48a-4afbbf6df526@ventanamicro.com/#t >> >> I don't know whether it's a good practice to check unaligned access using >> `Zicclsm`. >> >> Here is another related cpu feature for unaligned access: >> RISCV_HWPROBE_MISALIGNED_* >> But it looks like it always be initialized with `RISCV_HWPROBE_MISALIGNED_SLOW`[2]. >> It implies that linux kernel always supports unaligned access. But we have the >> actual HW which doesn't support unaligned access for vector unit. > > https://docs.kernel.org/arch/riscv/uabi.html#misaligned-accesses > > Misaligned accesses are part of the user ABI & the hwprobe stuff for > that allows userspace to figure out whether they're fast (likely > implemented in hardware), slow (likely emulated in firmware) or emulated > in the kernel. The HWPROBE_MISALIGNED_* checking function is at: https://github.com/torvalds/linux/blob/c2d5304e6c648ebcf653bace7e51e0e6742e46c8/arch/riscv/kernel/cpufeature.c#L564-L647 The tests are all scalar. No `vector` test inside. So, I'm not sure the HWPROBE_MISALIGNED_* is related to vector unit or not. The goal is to check whether `vector` support unaligned access or not in this crypto patch. I haven't seen the emulated path for unaligned-vector-access in OpenSBI and kernel. Is the unaligned-vector-access included in user ABI? Thanks, Jerry
On Wed, 22 Nov 2023 09:37:33 PST (-0800), jerry.shih@sifive.com wrote: > On Nov 21, 2023, at 21:14, Conor Dooley <conor.dooley@microchip.com> wrote: >> On Tue, Nov 21, 2023 at 06:55:07PM +0800, Jerry Shih wrote: >>> Sorry, I just use my `internal` qemu with vector-crypto and rva22 patches. >>> >>> The public qemu haven't supported rva22 profiles. Here is the qemu patch[1] for >>> that. But here is the discussion why the qemu doesn't export these >>> `named extensions`(e.g. Zicclsm). >>> I try to add Zicclsm in DT in the v2 patch set. Maybe we will have more discussion >>> about the rva22 profiles in kernel DT. >> >> Please do, that'll be fun! Please take some time to read what the >> profiles spec actually defines Zicclsm fore before you send those patches >> though. I think you might come to find you have misunderstood what it >> means - certainly I did the first time I saw it! > > From the rva22 profile: > This requires misaligned support for all regular load and store instructions (including > scalar and ``vector``) > > The spec includes the explicit `vector` keyword. > So, I still think we could use Zicclsm checking for these vector-crypto implementations. > > My proposed patch is just a simple patch which only update the DT document and > update the isa string parser for Zicclsm. If it's still not recommend to use Zicclsm > checking, I will turn to use `RISCV_HWPROBE_MISALIGNED_*` instead. IMO that's the way to go: even if these are required to be supported by Zicclsm, we still need to deal with the performance implications. >>> [1] >>> LINK: https://lore.kernel.org/all/d1d6f2dc-55b2-4dce-a48a-4afbbf6df526@ventanamicro.com/#t >>> >>> I don't know whether it's a good practice to check unaligned access using >>> `Zicclsm`. >>> >>> Here is another related cpu feature for unaligned access: >>> RISCV_HWPROBE_MISALIGNED_* >>> But it looks like it always be initialized with `RISCV_HWPROBE_MISALIGNED_SLOW`[2]. >>> It implies that linux kernel always supports unaligned access. But we have the >>> actual HW which doesn't support unaligned access for vector unit. >> >> https://docs.kernel.org/arch/riscv/uabi.html#misaligned-accesses >> >> Misaligned accesses are part of the user ABI & the hwprobe stuff for >> that allows userspace to figure out whether they're fast (likely >> implemented in hardware), slow (likely emulated in firmware) or emulated >> in the kernel. > > The HWPROBE_MISALIGNED_* checking function is at: > https://github.com/torvalds/linux/blob/c2d5304e6c648ebcf653bace7e51e0e6742e46c8/arch/riscv/kernel/cpufeature.c#L564-L647 > The tests are all scalar. No `vector` test inside. So, I'm not sure the > HWPROBE_MISALIGNED_* is related to vector unit or not. > > The goal is to check whether `vector` support unaligned access or not > in this crypto patch. > > I haven't seen the emulated path for unaligned-vector-access in OpenSBI > and kernel. Is the unaligned-vector-access included in user ABI? I guess it's kind of a grey area, but I'd agrue that it is: we merged support for V when the only implementation (ie, QEMU) supported misaligned accesses, so we're stuck with that being the defacto behavior. As part of adding support for the K230 we'll need to then add the kernel-mode vector misaligned access handlers, but that doesn't seem so hard. So I'd say we should update the hwprobe docs to say that key only reflects scalar accesses (or maybe even just integer accesses? that's all we're testing for) -- essentially just make the documentation match the implementation, as that'll keep ABI compatibility. Then we can add a new key for vector misaligned access performance. > > Thanks, > Jerry
On Thu, Nov 23, 2023 at 01:37:33AM +0800, Jerry Shih wrote: > On Nov 21, 2023, at 21:14, Conor Dooley <conor.dooley@microchip.com> wrote: > > On Tue, Nov 21, 2023 at 06:55:07PM +0800, Jerry Shih wrote: > >> Sorry, I just use my `internal` qemu with vector-crypto and rva22 patches. > >> > >> The public qemu haven't supported rva22 profiles. Here is the qemu patch[1] for > >> that. But here is the discussion why the qemu doesn't export these > >> `named extensions`(e.g. Zicclsm). > >> I try to add Zicclsm in DT in the v2 patch set. Maybe we will have more discussion > >> about the rva22 profiles in kernel DT. > > > > Please do, that'll be fun! Please take some time to read what the > > profiles spec actually defines Zicclsm fore before you send those patches > > though. I think you might come to find you have misunderstood what it > > means - certainly I did the first time I saw it! > > From the rva22 profile: "rva22" is not a profile. As I pointed out to Eric, this is defined in the RVA22U64 profile (and the RVA20U64 one, but that is effectively a moot point). The profile descriptions for these only specify "the ISA features available to user-mode execution environments", so it is not suitable for use in any other context. > This requires misaligned support for all regular load and store instructions (including > scalar and ``vector``) > > The spec includes the explicit `vector` keyword. > So, I still think we could use Zicclsm checking for these vector-crypto implementations. In userspace, if Zicclsm was exported somewhere, that would be a valid argument. Even for userspace, the hwprobe flags probably provide more information though, since the firmware emulation is insanely slow. > My proposed patch is just a simple patch which only update the DT document and > update the isa string parser for Zicclsm. Zicclsm has no meaning outside of user mode, so it's not suitable for use in that context. Other "features" defined in the profiles spec might be suitable for inclusion, but it'll be a case-by-case basis. > If it's still not recommend to use Zicclsm > checking, I will turn to use `RISCV_HWPROBE_MISALIGNED_*` instead. Palmer has commented on the rest, so no need for me :)
On Nov 23, 2023, at 02:20, Conor Dooley <conor@kernel.org> wrote: > On Thu, Nov 23, 2023 at 01:37:33AM +0800, Jerry Shih wrote: >> On Nov 21, 2023, at 21:14, Conor Dooley <conor.dooley@microchip.com> wrote: >>> On Tue, Nov 21, 2023 at 06:55:07PM +0800, Jerry Shih wrote: >>>> Sorry, I just use my `internal` qemu with vector-crypto and rva22 patches. >>>> >>>> The public qemu haven't supported rva22 profiles. Here is the qemu patch[1] for >>>> that. But here is the discussion why the qemu doesn't export these >>>> `named extensions`(e.g. Zicclsm). >>>> I try to add Zicclsm in DT in the v2 patch set. Maybe we will have more discussion >>>> about the rva22 profiles in kernel DT. >>> >>> Please do, that'll be fun! Please take some time to read what the >>> profiles spec actually defines Zicclsm fore before you send those patches >>> though. I think you might come to find you have misunderstood what it >>> means - certainly I did the first time I saw it! >> >> From the rva22 profile: > > "rva22" is not a profile. As I pointed out to Eric, this is defined in > the RVA22U64 profile (and the RVA20U64 one, but that is effectively a > moot point). The profile descriptions for these only specify "the ISA > features available to user-mode execution environments", so it is not > suitable for use in any other context. I missed that important part: it's for user space. Thx. >> This requires misaligned support for all regular load and store instructions (including >> scalar and ``vector``) >> >> The spec includes the explicit `vector` keyword. >> So, I still think we could use Zicclsm checking for these vector-crypto implementations. > > In userspace, if Zicclsm was exported somewhere, that would be a valid > argument. Even for userspace, the hwprobe flags probably provide more > information though, since the firmware emulation is insanely slow. I agree. It will be more useful to have the flag like `VECTOR_MISALIGNED_FAST` instead. >> My proposed patch is just a simple patch which only update the DT document and >> update the isa string parser for Zicclsm. > > Zicclsm has no meaning outside of user mode, so it's not suitable for > use in that context. Other "features" defined in the profiles spec might > be suitable for inclusion, but it'll be a case-by-case basis. I will skip the Zicclsm part in my v2 patch. >> If it's still not recommend to use Zicclsm >> checking, I will turn to use `RISCV_HWPROBE_MISALIGNED_*` instead. > > Palmer has commented on the rest, so no need for me :) All crypto algorithms will assume that the vector supports misaligned access in next v2 patch. And the algorithms will also not check for `RISCV_HWPROBE_MISALIGNED_*` since it's related to scalar accesses. Once we have the vector performance related flag, we could go back here to use it. -Jerry
On Nov 22, 2023, at 09:29, Eric Biggers <ebiggers@kernel.org> wrote: > On Thu, Oct 26, 2023 at 02:36:44AM +0800, Jerry Shih wrote: >> diff --git a/arch/riscv/crypto/chacha-riscv64-glue.c b/arch/riscv/crypto/chacha-riscv64-glue.c >> new file mode 100644 >> index 000000000000..72011949f705 >> --- /dev/null >> +++ b/arch/riscv/crypto/chacha-riscv64-glue.c >> @@ -0,0 +1,120 @@ >> +// SPDX-License-Identifier: GPL-2.0-only >> +/* >> + * Port of the OpenSSL ChaCha20 implementation for RISC-V 64 >> + * >> + * Copyright (C) 2023 SiFive, Inc. >> + * Author: Jerry Shih <jerry.shih@sifive.com> >> + */ >> + >> +#include <asm/simd.h> >> +#include <asm/vector.h> >> +#include <crypto/internal/chacha.h> >> +#include <crypto/internal/simd.h> >> +#include <crypto/internal/skcipher.h> >> +#include <linux/crypto.h> >> +#include <linux/module.h> >> +#include <linux/types.h> >> + >> +#define CHACHA_BLOCK_VALID_SIZE_MASK (~(CHACHA_BLOCK_SIZE - 1)) >> +#define CHACHA_BLOCK_REMAINING_SIZE_MASK (CHACHA_BLOCK_SIZE - 1) >> +#define CHACHA_KEY_OFFSET 4 >> +#define CHACHA_IV_OFFSET 12 >> + >> +/* chacha20 using zvkb vector crypto extension */ >> +void ChaCha20_ctr32_zvkb(u8 *out, const u8 *input, size_t len, const u32 *key, >> + const u32 *counter); >> + >> +static int chacha20_encrypt(struct skcipher_request *req) >> +{ >> + u32 state[CHACHA_STATE_WORDS]; > > This function doesn't need to create the whole state matrix on the stack, since > the underlying assembly function takes as input the key and counter, not the > state matrix. I recommend something like the following: > > diff --git a/arch/riscv/crypto/chacha-riscv64-glue.c b/arch/riscv/crypto/chacha-riscv64-glue.c > index df185d0663fcc..216b4cd9d1e01 100644 > --- a/arch/riscv/crypto/chacha-riscv64-glue.c > +++ b/arch/riscv/crypto/chacha-riscv64-glue.c > @@ -16,45 +16,42 @@ > #include <linux/module.h> > #include <linux/types.h> > > -#define CHACHA_KEY_OFFSET 4 > -#define CHACHA_IV_OFFSET 12 > - > /* chacha20 using zvkb vector crypto extension */ > asmlinkage void ChaCha20_ctr32_zvkb(u8 *out, const u8 *input, size_t len, > const u32 *key, const u32 *counter); > > static int chacha20_encrypt(struct skcipher_request *req) > { > - u32 state[CHACHA_STATE_WORDS]; > u8 block_buffer[CHACHA_BLOCK_SIZE]; > struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req); > const struct chacha_ctx *ctx = crypto_skcipher_ctx(tfm); > struct skcipher_walk walk; > unsigned int nbytes; > unsigned int tail_bytes; > + u32 iv[4]; > int err; > > - chacha_init_generic(state, ctx->key, req->iv); > + iv[0] = get_unaligned_le32(req->iv); > + iv[1] = get_unaligned_le32(req->iv + 4); > + iv[2] = get_unaligned_le32(req->iv + 8); > + iv[3] = get_unaligned_le32(req->iv + 12); > > err = skcipher_walk_virt(&walk, req, false); > while (walk.nbytes) { > - nbytes = walk.nbytes & (~(CHACHA_BLOCK_SIZE - 1)); > + nbytes = walk.nbytes & ~(CHACHA_BLOCK_SIZE - 1); > tail_bytes = walk.nbytes & (CHACHA_BLOCK_SIZE - 1); > kernel_vector_begin(); > if (nbytes) { > ChaCha20_ctr32_zvkb(walk.dst.virt.addr, > walk.src.virt.addr, nbytes, > - state + CHACHA_KEY_OFFSET, > - state + CHACHA_IV_OFFSET); > - state[CHACHA_IV_OFFSET] += nbytes / CHACHA_BLOCK_SIZE; > + ctx->key, iv); > + iv[0] += nbytes / CHACHA_BLOCK_SIZE; > } > if (walk.nbytes == walk.total && tail_bytes > 0) { > memcpy(block_buffer, walk.src.virt.addr + nbytes, > tail_bytes); > ChaCha20_ctr32_zvkb(block_buffer, block_buffer, > - CHACHA_BLOCK_SIZE, > - state + CHACHA_KEY_OFFSET, > - state + CHACHA_IV_OFFSET); > + CHACHA_BLOCK_SIZE, ctx->key, iv); > memcpy(walk.dst.virt.addr + nbytes, block_buffer, > tail_bytes); > tail_bytes = 0; Fixed. We will only use the iv instead of the full chacha state matrix.
diff --git a/arch/riscv/crypto/Kconfig b/arch/riscv/crypto/Kconfig index 2797b37394bb..41ce453afafa 100644 --- a/arch/riscv/crypto/Kconfig +++ b/arch/riscv/crypto/Kconfig @@ -35,6 +35,18 @@ config CRYPTO_AES_BLOCK_RISCV64 - Zvkg vector crypto extension (XTS) - Zvkned vector crypto extension +config CRYPTO_CHACHA20_RISCV64 + default y if RISCV_ISA_V + tristate "Ciphers: ChaCha20" + depends on 64BIT && RISCV_ISA_V + select CRYPTO_SKCIPHER + select CRYPTO_LIB_CHACHA_GENERIC + help + Length-preserving ciphers: ChaCha20 stream cipher algorithm + + Architecture: riscv64 using: + - Zvkb vector crypto extension + config CRYPTO_GHASH_RISCV64 default y if RISCV_ISA_V tristate "Hash functions: GHASH" diff --git a/arch/riscv/crypto/Makefile b/arch/riscv/crypto/Makefile index b772417703fd..80b0ebc956a3 100644 --- a/arch/riscv/crypto/Makefile +++ b/arch/riscv/crypto/Makefile @@ -9,6 +9,9 @@ aes-riscv64-y := aes-riscv64-glue.o aes-riscv64-zvkned.o obj-$(CONFIG_CRYPTO_AES_BLOCK_RISCV64) += aes-block-riscv64.o aes-block-riscv64-y := aes-riscv64-block-mode-glue.o aes-riscv64-zvbb-zvkg-zvkned.o aes-riscv64-zvkb-zvkned.o +obj-$(CONFIG_CRYPTO_CHACHA20_RISCV64) += chacha-riscv64.o +chacha-riscv64-y := chacha-riscv64-glue.o chacha-riscv64-zvkb.o + obj-$(CONFIG_CRYPTO_GHASH_RISCV64) += ghash-riscv64.o ghash-riscv64-y := ghash-riscv64-glue.o ghash-riscv64-zvkg.o @@ -36,6 +39,9 @@ $(obj)/aes-riscv64-zvbb-zvkg-zvkned.S: $(src)/aes-riscv64-zvbb-zvkg-zvkned.pl $(obj)/aes-riscv64-zvkb-zvkned.S: $(src)/aes-riscv64-zvkb-zvkned.pl $(call cmd,perlasm) +$(obj)/chacha-riscv64-zvkb.S: $(src)/chacha-riscv64-zvkb.pl + $(call cmd,perlasm) + $(obj)/ghash-riscv64-zvkg.S: $(src)/ghash-riscv64-zvkg.pl $(call cmd,perlasm) @@ -54,6 +60,7 @@ $(obj)/sm4-riscv64-zvksed.S: $(src)/sm4-riscv64-zvksed.pl clean-files += aes-riscv64-zvkned.S clean-files += aes-riscv64-zvbb-zvkg-zvkned.S clean-files += aes-riscv64-zvkb-zvkned.S +clean-files += chacha-riscv64-zvkb.S clean-files += ghash-riscv64-zvkg.S clean-files += sha256-riscv64-zvkb-zvknha_or_zvknhb.S clean-files += sha512-riscv64-zvkb-zvknhb.S diff --git a/arch/riscv/crypto/chacha-riscv64-glue.c b/arch/riscv/crypto/chacha-riscv64-glue.c new file mode 100644 index 000000000000..72011949f705 --- /dev/null +++ b/arch/riscv/crypto/chacha-riscv64-glue.c @@ -0,0 +1,120 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Port of the OpenSSL ChaCha20 implementation for RISC-V 64 + * + * Copyright (C) 2023 SiFive, Inc. + * Author: Jerry Shih <jerry.shih@sifive.com> + */ + +#include <asm/simd.h> +#include <asm/vector.h> +#include <crypto/internal/chacha.h> +#include <crypto/internal/simd.h> +#include <crypto/internal/skcipher.h> +#include <linux/crypto.h> +#include <linux/module.h> +#include <linux/types.h> + +#define CHACHA_BLOCK_VALID_SIZE_MASK (~(CHACHA_BLOCK_SIZE - 1)) +#define CHACHA_BLOCK_REMAINING_SIZE_MASK (CHACHA_BLOCK_SIZE - 1) +#define CHACHA_KEY_OFFSET 4 +#define CHACHA_IV_OFFSET 12 + +/* chacha20 using zvkb vector crypto extension */ +void ChaCha20_ctr32_zvkb(u8 *out, const u8 *input, size_t len, const u32 *key, + const u32 *counter); + +static int chacha20_encrypt(struct skcipher_request *req) +{ + u32 state[CHACHA_STATE_WORDS]; + u8 block_buffer[CHACHA_BLOCK_SIZE]; + struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req); + const struct chacha_ctx *ctx = crypto_skcipher_ctx(tfm); + struct skcipher_walk walk; + unsigned int nbytes; + unsigned int tail_bytes; + int err; + + chacha_init_generic(state, ctx->key, req->iv); + + err = skcipher_walk_virt(&walk, req, false); + while (walk.nbytes) { + nbytes = walk.nbytes & CHACHA_BLOCK_VALID_SIZE_MASK; + tail_bytes = walk.nbytes & CHACHA_BLOCK_REMAINING_SIZE_MASK; + kernel_vector_begin(); + if (nbytes) { + ChaCha20_ctr32_zvkb(walk.dst.virt.addr, + walk.src.virt.addr, nbytes, + state + CHACHA_KEY_OFFSET, + state + CHACHA_IV_OFFSET); + state[CHACHA_IV_OFFSET] += nbytes / CHACHA_BLOCK_SIZE; + } + if (walk.nbytes == walk.total && tail_bytes > 0) { + memcpy(block_buffer, walk.src.virt.addr + nbytes, + tail_bytes); + ChaCha20_ctr32_zvkb(block_buffer, block_buffer, + CHACHA_BLOCK_SIZE, + state + CHACHA_KEY_OFFSET, + state + CHACHA_IV_OFFSET); + memcpy(walk.dst.virt.addr + nbytes, block_buffer, + tail_bytes); + tail_bytes = 0; + } + kernel_vector_end(); + + err = skcipher_walk_done(&walk, tail_bytes); + } + + return err; +} + +static struct skcipher_alg riscv64_chacha_alg_zvkb[] = { { + .base = { + .cra_name = "chacha20", + .cra_driver_name = "chacha20-riscv64-zvkb", + .cra_priority = 300, + .cra_blocksize = 1, + .cra_ctxsize = sizeof(struct chacha_ctx), + .cra_module = THIS_MODULE, + }, + .min_keysize = CHACHA_KEY_SIZE, + .max_keysize = CHACHA_KEY_SIZE, + .ivsize = CHACHA_IV_SIZE, + .chunksize = CHACHA_BLOCK_SIZE, + .walksize = CHACHA_BLOCK_SIZE * 4, + .setkey = chacha20_setkey, + .encrypt = chacha20_encrypt, + .decrypt = chacha20_encrypt, +} }; + +static inline bool check_chacha20_ext(void) +{ + return riscv_isa_extension_available(NULL, ZVKB) && + riscv_vector_vlen() >= 128; +} + +static int __init riscv64_chacha_mod_init(void) +{ + if (check_chacha20_ext()) + return crypto_register_skciphers( + riscv64_chacha_alg_zvkb, + ARRAY_SIZE(riscv64_chacha_alg_zvkb)); + + return -ENODEV; +} + +static void __exit riscv64_chacha_mod_fini(void) +{ + if (check_chacha20_ext()) + crypto_unregister_skciphers( + riscv64_chacha_alg_zvkb, + ARRAY_SIZE(riscv64_chacha_alg_zvkb)); +} + +module_init(riscv64_chacha_mod_init); +module_exit(riscv64_chacha_mod_fini); + +MODULE_DESCRIPTION("ChaCha20 (RISC-V accelerated)"); +MODULE_AUTHOR("Jerry Shih <jerry.shih@sifive.com>"); +MODULE_LICENSE("GPL"); +MODULE_ALIAS_CRYPTO("chacha20"); diff --git a/arch/riscv/crypto/chacha-riscv64-zvkb.pl b/arch/riscv/crypto/chacha-riscv64-zvkb.pl new file mode 100644 index 000000000000..9caf7b247804 --- /dev/null +++ b/arch/riscv/crypto/chacha-riscv64-zvkb.pl @@ -0,0 +1,322 @@ +#! /usr/bin/env perl +# SPDX-License-Identifier: Apache-2.0 OR BSD-2-Clause +# +# This file is dual-licensed, meaning that you can use it under your +# choice of either of the following two licenses: +# +# Copyright 2023-2023 The OpenSSL Project Authors. All Rights Reserved. +# +# Licensed under the Apache License 2.0 (the "License"). You may not use +# this file except in compliance with the License. You can obtain a copy +# in the file LICENSE in the source distribution or at +# https://www.openssl.org/source/license.html +# +# or +# +# Copyright (c) 2023, Jerry Shih <jerry.shih@sifive.com> +# All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# 1. Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# 2. Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS +# "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT +# LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR +# A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT +# OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, +# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT +# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, +# DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY +# THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +# - RV64I +# - RISC-V Vector ('V') with VLEN >= 128 +# - RISC-V Vector Cryptography Bit-manipulation extension ('Zvkb') +# - RISC-V Zicclsm(Main memory supports misaligned loads/stores) + +use strict; +use warnings; + +use FindBin qw($Bin); +use lib "$Bin"; +use lib "$Bin/../../perlasm"; +use riscv; + +# $output is the last argument if it looks like a file (it has an extension) +# $flavour is the first argument if it doesn't look like a file +my $output = $#ARGV >= 0 && $ARGV[$#ARGV] =~ m|\.\w+$| ? pop : undef; +my $flavour = $#ARGV >= 0 && $ARGV[0] !~ m|\.| ? shift : undef; + +$output and open STDOUT, ">$output"; + +my $code = <<___; +.text +___ + +# void ChaCha20_ctr32_zvkb(unsigned char *out, const unsigned char *inp, +# size_t len, const unsigned int key[8], +# const unsigned int counter[4]); +################################################################################ +my ( $OUTPUT, $INPUT, $LEN, $KEY, $COUNTER ) = ( "a0", "a1", "a2", "a3", "a4" ); +my ( $T0 ) = ( "t0" ); +my ( $CONST_DATA0, $CONST_DATA1, $CONST_DATA2, $CONST_DATA3 ) = + ( "a5", "a6", "a7", "t1" ); +my ( $KEY0, $KEY1, $KEY2,$KEY3, $KEY4, $KEY5, $KEY6, $KEY7, + $COUNTER0, $COUNTER1, $NONCE0, $NONCE1 +) = ( "s0", "s1", "s2", "s3", "s4", "s5", "s6", + "s7", "s8", "s9", "s10", "s11" ); +my ( $VL, $STRIDE, $CHACHA_LOOP_COUNT ) = ( "t2", "t3", "t4" ); +my ( + $V0, $V1, $V2, $V3, $V4, $V5, $V6, $V7, $V8, $V9, $V10, + $V11, $V12, $V13, $V14, $V15, $V16, $V17, $V18, $V19, $V20, $V21, + $V22, $V23, $V24, $V25, $V26, $V27, $V28, $V29, $V30, $V31, +) = map( "v$_", ( 0 .. 31 ) ); + +sub chacha_quad_round_group { + my ( + $A0, $B0, $C0, $D0, $A1, $B1, $C1, $D1, + $A2, $B2, $C2, $D2, $A3, $B3, $C3, $D3 + ) = @_; + + my $code = <<___; + # a += b; d ^= a; d <<<= 16; + @{[vadd_vv $A0, $A0, $B0]} + @{[vadd_vv $A1, $A1, $B1]} + @{[vadd_vv $A2, $A2, $B2]} + @{[vadd_vv $A3, $A3, $B3]} + @{[vxor_vv $D0, $D0, $A0]} + @{[vxor_vv $D1, $D1, $A1]} + @{[vxor_vv $D2, $D2, $A2]} + @{[vxor_vv $D3, $D3, $A3]} + @{[vror_vi $D0, $D0, 32 - 16]} + @{[vror_vi $D1, $D1, 32 - 16]} + @{[vror_vi $D2, $D2, 32 - 16]} + @{[vror_vi $D3, $D3, 32 - 16]} + # c += d; b ^= c; b <<<= 12; + @{[vadd_vv $C0, $C0, $D0]} + @{[vadd_vv $C1, $C1, $D1]} + @{[vadd_vv $C2, $C2, $D2]} + @{[vadd_vv $C3, $C3, $D3]} + @{[vxor_vv $B0, $B0, $C0]} + @{[vxor_vv $B1, $B1, $C1]} + @{[vxor_vv $B2, $B2, $C2]} + @{[vxor_vv $B3, $B3, $C3]} + @{[vror_vi $B0, $B0, 32 - 12]} + @{[vror_vi $B1, $B1, 32 - 12]} + @{[vror_vi $B2, $B2, 32 - 12]} + @{[vror_vi $B3, $B3, 32 - 12]} + # a += b; d ^= a; d <<<= 8; + @{[vadd_vv $A0, $A0, $B0]} + @{[vadd_vv $A1, $A1, $B1]} + @{[vadd_vv $A2, $A2, $B2]} + @{[vadd_vv $A3, $A3, $B3]} + @{[vxor_vv $D0, $D0, $A0]} + @{[vxor_vv $D1, $D1, $A1]} + @{[vxor_vv $D2, $D2, $A2]} + @{[vxor_vv $D3, $D3, $A3]} + @{[vror_vi $D0, $D0, 32 - 8]} + @{[vror_vi $D1, $D1, 32 - 8]} + @{[vror_vi $D2, $D2, 32 - 8]} + @{[vror_vi $D3, $D3, 32 - 8]} + # c += d; b ^= c; b <<<= 7; + @{[vadd_vv $C0, $C0, $D0]} + @{[vadd_vv $C1, $C1, $D1]} + @{[vadd_vv $C2, $C2, $D2]} + @{[vadd_vv $C3, $C3, $D3]} + @{[vxor_vv $B0, $B0, $C0]} + @{[vxor_vv $B1, $B1, $C1]} + @{[vxor_vv $B2, $B2, $C2]} + @{[vxor_vv $B3, $B3, $C3]} + @{[vror_vi $B0, $B0, 32 - 7]} + @{[vror_vi $B1, $B1, 32 - 7]} + @{[vror_vi $B2, $B2, 32 - 7]} + @{[vror_vi $B3, $B3, 32 - 7]} +___ + + return $code; +} + +$code .= <<___; +.p2align 3 +.globl ChaCha20_ctr32_zvkb +.type ChaCha20_ctr32_zvkb,\@function +ChaCha20_ctr32_zvkb: + srli $LEN, $LEN, 6 + beqz $LEN, .Lend + + addi sp, sp, -96 + sd s0, 0(sp) + sd s1, 8(sp) + sd s2, 16(sp) + sd s3, 24(sp) + sd s4, 32(sp) + sd s5, 40(sp) + sd s6, 48(sp) + sd s7, 56(sp) + sd s8, 64(sp) + sd s9, 72(sp) + sd s10, 80(sp) + sd s11, 88(sp) + + li $STRIDE, 64 + + #### chacha block data + # "expa" little endian + li $CONST_DATA0, 0x61707865 + # "nd 3" little endian + li $CONST_DATA1, 0x3320646e + # "2-by" little endian + li $CONST_DATA2, 0x79622d32 + # "te k" little endian + li $CONST_DATA3, 0x6b206574 + + lw $KEY0, 0($KEY) + lw $KEY1, 4($KEY) + lw $KEY2, 8($KEY) + lw $KEY3, 12($KEY) + lw $KEY4, 16($KEY) + lw $KEY5, 20($KEY) + lw $KEY6, 24($KEY) + lw $KEY7, 28($KEY) + + lw $COUNTER0, 0($COUNTER) + lw $COUNTER1, 4($COUNTER) + lw $NONCE0, 8($COUNTER) + lw $NONCE1, 12($COUNTER) + +.Lblock_loop: + @{[vsetvli $VL, $LEN, "e32", "m1", "ta", "ma"]} + + # init chacha const states + @{[vmv_v_x $V0, $CONST_DATA0]} + @{[vmv_v_x $V1, $CONST_DATA1]} + @{[vmv_v_x $V2, $CONST_DATA2]} + @{[vmv_v_x $V3, $CONST_DATA3]} + + # init chacha key states + @{[vmv_v_x $V4, $KEY0]} + @{[vmv_v_x $V5, $KEY1]} + @{[vmv_v_x $V6, $KEY2]} + @{[vmv_v_x $V7, $KEY3]} + @{[vmv_v_x $V8, $KEY4]} + @{[vmv_v_x $V9, $KEY5]} + @{[vmv_v_x $V10, $KEY6]} + @{[vmv_v_x $V11, $KEY7]} + + # init chacha key states + @{[vid_v $V12]} + @{[vadd_vx $V12, $V12, $COUNTER0]} + @{[vmv_v_x $V13, $COUNTER1]} + + # init chacha nonce states + @{[vmv_v_x $V14, $NONCE0]} + @{[vmv_v_x $V15, $NONCE1]} + + # load the top-half of input data + @{[vlsseg_nf_e32_v 8, $V16, $INPUT, $STRIDE]} + + li $CHACHA_LOOP_COUNT, 10 +.Lround_loop: + addi $CHACHA_LOOP_COUNT, $CHACHA_LOOP_COUNT, -1 + @{[chacha_quad_round_group + $V0, $V4, $V8, $V12, + $V1, $V5, $V9, $V13, + $V2, $V6, $V10, $V14, + $V3, $V7, $V11, $V15]} + @{[chacha_quad_round_group + $V0, $V5, $V10, $V15, + $V1, $V6, $V11, $V12, + $V2, $V7, $V8, $V13, + $V3, $V4, $V9, $V14]} + bnez $CHACHA_LOOP_COUNT, .Lround_loop + + # load the bottom-half of input data + addi $T0, $INPUT, 32 + @{[vlsseg_nf_e32_v 8, $V24, $T0, $STRIDE]} + + # add chacha top-half initial block states + @{[vadd_vx $V0, $V0, $CONST_DATA0]} + @{[vadd_vx $V1, $V1, $CONST_DATA1]} + @{[vadd_vx $V2, $V2, $CONST_DATA2]} + @{[vadd_vx $V3, $V3, $CONST_DATA3]} + @{[vadd_vx $V4, $V4, $KEY0]} + @{[vadd_vx $V5, $V5, $KEY1]} + @{[vadd_vx $V6, $V6, $KEY2]} + @{[vadd_vx $V7, $V7, $KEY3]} + # xor with the top-half input + @{[vxor_vv $V16, $V16, $V0]} + @{[vxor_vv $V17, $V17, $V1]} + @{[vxor_vv $V18, $V18, $V2]} + @{[vxor_vv $V19, $V19, $V3]} + @{[vxor_vv $V20, $V20, $V4]} + @{[vxor_vv $V21, $V21, $V5]} + @{[vxor_vv $V22, $V22, $V6]} + @{[vxor_vv $V23, $V23, $V7]} + + # save the top-half of output + @{[vssseg_nf_e32_v 8, $V16, $OUTPUT, $STRIDE]} + + # add chacha bottom-half initial block states + @{[vadd_vx $V8, $V8, $KEY4]} + @{[vadd_vx $V9, $V9, $KEY5]} + @{[vadd_vx $V10, $V10, $KEY6]} + @{[vadd_vx $V11, $V11, $KEY7]} + @{[vid_v $V0]} + @{[vadd_vx $V12, $V12, $COUNTER0]} + @{[vadd_vx $V13, $V13, $COUNTER1]} + @{[vadd_vx $V14, $V14, $NONCE0]} + @{[vadd_vx $V15, $V15, $NONCE1]} + @{[vadd_vv $V12, $V12, $V0]} + # xor with the bottom-half input + @{[vxor_vv $V24, $V24, $V8]} + @{[vxor_vv $V25, $V25, $V9]} + @{[vxor_vv $V26, $V26, $V10]} + @{[vxor_vv $V27, $V27, $V11]} + @{[vxor_vv $V29, $V29, $V13]} + @{[vxor_vv $V28, $V28, $V12]} + @{[vxor_vv $V30, $V30, $V14]} + @{[vxor_vv $V31, $V31, $V15]} + + # save the bottom-half of output + addi $T0, $OUTPUT, 32 + @{[vssseg_nf_e32_v 8, $V24, $T0, $STRIDE]} + + # update counter + add $COUNTER0, $COUNTER0, $VL + sub $LEN, $LEN, $VL + # increase offset for `4 * 16 * VL = 64 * VL` + slli $T0, $VL, 6 + add $INPUT, $INPUT, $T0 + add $OUTPUT, $OUTPUT, $T0 + bnez $LEN, .Lblock_loop + + ld s0, 0(sp) + ld s1, 8(sp) + ld s2, 16(sp) + ld s3, 24(sp) + ld s4, 32(sp) + ld s5, 40(sp) + ld s6, 48(sp) + ld s7, 56(sp) + ld s8, 64(sp) + ld s9, 72(sp) + ld s10, 80(sp) + ld s11, 88(sp) + addi sp, sp, 96 + +.Lend: + ret +.size ChaCha20_ctr32_zvkb,.-ChaCha20_ctr32_zvkb +___ + +print $code; + +close STDOUT or die "error closing STDOUT: $!";
Add a ChaCha20 vector implementation from OpenSSL(openssl/openssl#21923). Signed-off-by: Jerry Shih <jerry.shih@sifive.com> --- arch/riscv/crypto/Kconfig | 12 + arch/riscv/crypto/Makefile | 7 + arch/riscv/crypto/chacha-riscv64-glue.c | 120 +++++++++ arch/riscv/crypto/chacha-riscv64-zvkb.pl | 322 +++++++++++++++++++++++ 4 files changed, 461 insertions(+) create mode 100644 arch/riscv/crypto/chacha-riscv64-glue.c create mode 100644 arch/riscv/crypto/chacha-riscv64-zvkb.pl