mbox series

[RFC,00/13] RISC-V crypto with reworked asm files

Message ID 20240102064743.220490-1-ebiggers@kernel.org (mailing list archive)
Headers show
Series RISC-V crypto with reworked asm files | expand

Message

Eric Biggers Jan. 2, 2024, 6:47 a.m. UTC
As discussed previously, the proposed use of the so-called perlasm for
the RISC-V crypto assembly files makes them difficult to read, and these
files have some other issues such extensively duplicating source code
for the different AES key lengths and for the unrolled hash functions.
There is/was a desire to share code with OpenSSL, but many of the files
have already diverged significantly; also, for most of the algorithms
the source code can be quite short anyway, due to the native support for
them in the RISC-V vector crypto extensions combined with the way the
RISC-V vector extension naturally scales to arbitrary vector lengths.

Since we're still waiting for prerequisite patches to be merged anyway,
we have a bit more time to get this cleaned up properly.  So I've had a
go at cleaning up the patchset to use standard .S files, with the code
duplication fixed.  I also made some tweaks to make the different
algorithms consistent with each other and with what exists in the kernel
already for other architectures, and tried to improve comments.

The result is this series, which passes all tests and is about 2400
lines shorter than the latest version with the perlasm
(https://lore.kernel.org/linux-crypto/20231231152743.6304-1-jerry.shih@sifive.com/).
All the same functionality and general optimizations are still included,
except for some minor optimizations in XTS that I dropped since it's not
clear they are worth the complexity.  (Note that almost all users of XTS
in the kernel only use it with block-aligned messages, so it's not very
important to optimize the ciphertext stealing case.)

I'd appreciate people's thoughts on this series.  Jerry, I hope I'm not
stepping on your toes too much here, but I think there are some big
improvements here.

This series is based on riscv/for-next
(https://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux.git/log/?h=for-next)
commit f352a28cc2fb4ee8d08c6a6362c9a861fcc84236, and for convenience
I've included the prerequisite patches too.

Andy Chiu (1):
  riscv: vector: make Vector always available for softirq context

Eric Biggers (1):
  RISC-V: add TOOLCHAIN_HAS_VECTOR_CRYPTO

Greentime Hu (1):
  riscv: Add support for kernel mode vector

Heiko Stuebner (2):
  RISC-V: add helper function to read the vector VLEN
  RISC-V: hook new crypto subdir into build-system

Jerry Shih (8):
  crypto: riscv - add vector crypto accelerated AES
  crypto: riscv - add vector crypto accelerated AES-{ECB,CBC,CTR,XTS}
  crypto: riscv - add vector crypto accelerated ChaCha20
  crypto: riscv - add vector crypto accelerated GHASH
  crypto: riscv - add vector crypto accelerated SHA-{256,224}
  crypto: riscv - add vector crypto accelerated SHA-{512,384}
  crypto: riscv - add vector crypto accelerated SM3
  crypto: riscv - add vector crypto accelerated SM4

 arch/riscv/Kbuild                             |   1 +
 arch/riscv/Kconfig                            |   7 +
 arch/riscv/crypto/Kconfig                     | 108 +++++
 arch/riscv/crypto/Makefile                    |  28 ++
 arch/riscv/crypto/aes-macros.S                | 156 +++++++
 .../crypto/aes-riscv64-block-mode-glue.c      | 435 ++++++++++++++++++
 arch/riscv/crypto/aes-riscv64-glue.c          | 123 +++++
 arch/riscv/crypto/aes-riscv64-glue.h          |  15 +
 .../crypto/aes-riscv64-zvkned-zvbb-zvkg.S     | 300 ++++++++++++
 arch/riscv/crypto/aes-riscv64-zvkned-zvkb.S   | 146 ++++++
 arch/riscv/crypto/aes-riscv64-zvkned.S        | 180 ++++++++
 arch/riscv/crypto/chacha-riscv64-glue.c       | 101 ++++
 arch/riscv/crypto/chacha-riscv64-zvkb.S       | 294 ++++++++++++
 arch/riscv/crypto/ghash-riscv64-glue.c        | 170 +++++++
 arch/riscv/crypto/ghash-riscv64-zvkg.S        |  72 +++
 arch/riscv/crypto/sha256-riscv64-glue.c       | 137 ++++++
 .../sha256-riscv64-zvknha_or_zvknhb-zvkb.S    | 225 +++++++++
 arch/riscv/crypto/sha512-riscv64-glue.c       | 133 ++++++
 .../riscv/crypto/sha512-riscv64-zvknhb-zvkb.S | 203 ++++++++
 arch/riscv/crypto/sm3-riscv64-glue.c          | 112 +++++
 arch/riscv/crypto/sm3-riscv64-zvksh-zvkb.S    | 123 +++++
 arch/riscv/crypto/sm4-riscv64-glue.c          | 109 +++++
 arch/riscv/crypto/sm4-riscv64-zvksed-zvkb.S   | 117 +++++
 arch/riscv/include/asm/processor.h            |  14 +-
 arch/riscv/include/asm/simd.h                 |  48 ++
 arch/riscv/include/asm/vector.h               |  20 +
 arch/riscv/kernel/Makefile                    |   1 +
 arch/riscv/kernel/kernel_mode_vector.c        | 126 +++++
 arch/riscv/kernel/process.c                   |   1 +
 crypto/Kconfig                                |   3 +
 30 files changed, 3507 insertions(+), 1 deletion(-)
 create mode 100644 arch/riscv/crypto/Kconfig
 create mode 100644 arch/riscv/crypto/Makefile
 create mode 100644 arch/riscv/crypto/aes-macros.S
 create mode 100644 arch/riscv/crypto/aes-riscv64-block-mode-glue.c
 create mode 100644 arch/riscv/crypto/aes-riscv64-glue.c
 create mode 100644 arch/riscv/crypto/aes-riscv64-glue.h
 create mode 100644 arch/riscv/crypto/aes-riscv64-zvkned-zvbb-zvkg.S
 create mode 100644 arch/riscv/crypto/aes-riscv64-zvkned-zvkb.S
 create mode 100644 arch/riscv/crypto/aes-riscv64-zvkned.S
 create mode 100644 arch/riscv/crypto/chacha-riscv64-glue.c
 create mode 100644 arch/riscv/crypto/chacha-riscv64-zvkb.S
 create mode 100644 arch/riscv/crypto/ghash-riscv64-glue.c
 create mode 100644 arch/riscv/crypto/ghash-riscv64-zvkg.S
 create mode 100644 arch/riscv/crypto/sha256-riscv64-glue.c
 create mode 100644 arch/riscv/crypto/sha256-riscv64-zvknha_or_zvknhb-zvkb.S
 create mode 100644 arch/riscv/crypto/sha512-riscv64-glue.c
 create mode 100644 arch/riscv/crypto/sha512-riscv64-zvknhb-zvkb.S
 create mode 100644 arch/riscv/crypto/sm3-riscv64-glue.c
 create mode 100644 arch/riscv/crypto/sm3-riscv64-zvksh-zvkb.S
 create mode 100644 arch/riscv/crypto/sm4-riscv64-glue.c
 create mode 100644 arch/riscv/crypto/sm4-riscv64-zvksed-zvkb.S
 create mode 100644 arch/riscv/include/asm/simd.h
 create mode 100644 arch/riscv/kernel/kernel_mode_vector.c


base-commit: f352a28cc2fb4ee8d08c6a6362c9a861fcc84236

Comments

Ard Biesheuvel Jan. 3, 2024, 2 p.m. UTC | #1
On Tue, 2 Jan 2024 at 07:50, Eric Biggers <ebiggers@kernel.org> wrote:
>
> As discussed previously, the proposed use of the so-called perlasm for
> the RISC-V crypto assembly files makes them difficult to read, and these
> files have some other issues such extensively duplicating source code
> for the different AES key lengths and for the unrolled hash functions.
> There is/was a desire to share code with OpenSSL, but many of the files
> have already diverged significantly; also, for most of the algorithms
> the source code can be quite short anyway, due to the native support for
> them in the RISC-V vector crypto extensions combined with the way the
> RISC-V vector extension naturally scales to arbitrary vector lengths.
>
> Since we're still waiting for prerequisite patches to be merged anyway,
> we have a bit more time to get this cleaned up properly.  So I've had a
> go at cleaning up the patchset to use standard .S files, with the code
> duplication fixed.  I also made some tweaks to make the different
> algorithms consistent with each other and with what exists in the kernel
> already for other architectures, and tried to improve comments.
>
> The result is this series, which passes all tests and is about 2400
> lines shorter than the latest version with the perlasm
> (https://lore.kernel.org/linux-crypto/20231231152743.6304-1-jerry.shih@sifive.com/).
> All the same functionality and general optimizations are still included,
> except for some minor optimizations in XTS that I dropped since it's not
> clear they are worth the complexity.  (Note that almost all users of XTS
> in the kernel only use it with block-aligned messages, so it's not very
> important to optimize the ciphertext stealing case.)
>
> I'd appreciate people's thoughts on this series.  Jerry, I hope I'm not
> stepping on your toes too much here, but I think there are some big
> improvements here.
>

As I have indicated before, I fully agree with Eric here that avoiding
perlasm is preferable: sharing code with OpenSSL is great if we can
simply adopt the exact same code (and track OpenSSL as its upstream)
but this never really worked that well for skciphers, due to API
differences. (The SHA transforms can be reused a bit more easily)

I will also note that perlasm is not as useful for RISC-V as it is for
other architectures: in OpenSSL, perlasm is also used to abstract
differences in calling conventions between, e.g., x86_64 on Linux vs
Windows, or to support building with outdated [proprietary]
toolchains.

I do wonder if we could also use .req directives for register aliases
instead of CPP defines? It shouldn't matter for working code, but the
diagnostics tend to be a bit more useful if the aliases are visible to
the assembler.







> This series is based on riscv/for-next
> (https://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux.git/log/?h=for-next)
> commit f352a28cc2fb4ee8d08c6a6362c9a861fcc84236, and for convenience
> I've included the prerequisite patches too.
>
> Andy Chiu (1):
>   riscv: vector: make Vector always available for softirq context
>
> Eric Biggers (1):
>   RISC-V: add TOOLCHAIN_HAS_VECTOR_CRYPTO
>
> Greentime Hu (1):
>   riscv: Add support for kernel mode vector
>
> Heiko Stuebner (2):
>   RISC-V: add helper function to read the vector VLEN
>   RISC-V: hook new crypto subdir into build-system
>
> Jerry Shih (8):
>   crypto: riscv - add vector crypto accelerated AES
>   crypto: riscv - add vector crypto accelerated AES-{ECB,CBC,CTR,XTS}
>   crypto: riscv - add vector crypto accelerated ChaCha20
>   crypto: riscv - add vector crypto accelerated GHASH
>   crypto: riscv - add vector crypto accelerated SHA-{256,224}
>   crypto: riscv - add vector crypto accelerated SHA-{512,384}
>   crypto: riscv - add vector crypto accelerated SM3
>   crypto: riscv - add vector crypto accelerated SM4
>
>  arch/riscv/Kbuild                             |   1 +
>  arch/riscv/Kconfig                            |   7 +
>  arch/riscv/crypto/Kconfig                     | 108 +++++
>  arch/riscv/crypto/Makefile                    |  28 ++
>  arch/riscv/crypto/aes-macros.S                | 156 +++++++
>  .../crypto/aes-riscv64-block-mode-glue.c      | 435 ++++++++++++++++++
>  arch/riscv/crypto/aes-riscv64-glue.c          | 123 +++++
>  arch/riscv/crypto/aes-riscv64-glue.h          |  15 +
>  .../crypto/aes-riscv64-zvkned-zvbb-zvkg.S     | 300 ++++++++++++
>  arch/riscv/crypto/aes-riscv64-zvkned-zvkb.S   | 146 ++++++
>  arch/riscv/crypto/aes-riscv64-zvkned.S        | 180 ++++++++
>  arch/riscv/crypto/chacha-riscv64-glue.c       | 101 ++++
>  arch/riscv/crypto/chacha-riscv64-zvkb.S       | 294 ++++++++++++
>  arch/riscv/crypto/ghash-riscv64-glue.c        | 170 +++++++
>  arch/riscv/crypto/ghash-riscv64-zvkg.S        |  72 +++
>  arch/riscv/crypto/sha256-riscv64-glue.c       | 137 ++++++
>  .../sha256-riscv64-zvknha_or_zvknhb-zvkb.S    | 225 +++++++++
>  arch/riscv/crypto/sha512-riscv64-glue.c       | 133 ++++++
>  .../riscv/crypto/sha512-riscv64-zvknhb-zvkb.S | 203 ++++++++
>  arch/riscv/crypto/sm3-riscv64-glue.c          | 112 +++++
>  arch/riscv/crypto/sm3-riscv64-zvksh-zvkb.S    | 123 +++++
>  arch/riscv/crypto/sm4-riscv64-glue.c          | 109 +++++
>  arch/riscv/crypto/sm4-riscv64-zvksed-zvkb.S   | 117 +++++
>  arch/riscv/include/asm/processor.h            |  14 +-
>  arch/riscv/include/asm/simd.h                 |  48 ++
>  arch/riscv/include/asm/vector.h               |  20 +
>  arch/riscv/kernel/Makefile                    |   1 +
>  arch/riscv/kernel/kernel_mode_vector.c        | 126 +++++
>  arch/riscv/kernel/process.c                   |   1 +
>  crypto/Kconfig                                |   3 +
>  30 files changed, 3507 insertions(+), 1 deletion(-)
>  create mode 100644 arch/riscv/crypto/Kconfig
>  create mode 100644 arch/riscv/crypto/Makefile
>  create mode 100644 arch/riscv/crypto/aes-macros.S
>  create mode 100644 arch/riscv/crypto/aes-riscv64-block-mode-glue.c
>  create mode 100644 arch/riscv/crypto/aes-riscv64-glue.c
>  create mode 100644 arch/riscv/crypto/aes-riscv64-glue.h
>  create mode 100644 arch/riscv/crypto/aes-riscv64-zvkned-zvbb-zvkg.S
>  create mode 100644 arch/riscv/crypto/aes-riscv64-zvkned-zvkb.S
>  create mode 100644 arch/riscv/crypto/aes-riscv64-zvkned.S
>  create mode 100644 arch/riscv/crypto/chacha-riscv64-glue.c
>  create mode 100644 arch/riscv/crypto/chacha-riscv64-zvkb.S
>  create mode 100644 arch/riscv/crypto/ghash-riscv64-glue.c
>  create mode 100644 arch/riscv/crypto/ghash-riscv64-zvkg.S
>  create mode 100644 arch/riscv/crypto/sha256-riscv64-glue.c
>  create mode 100644 arch/riscv/crypto/sha256-riscv64-zvknha_or_zvknhb-zvkb.S
>  create mode 100644 arch/riscv/crypto/sha512-riscv64-glue.c
>  create mode 100644 arch/riscv/crypto/sha512-riscv64-zvknhb-zvkb.S
>  create mode 100644 arch/riscv/crypto/sm3-riscv64-glue.c
>  create mode 100644 arch/riscv/crypto/sm3-riscv64-zvksh-zvkb.S
>  create mode 100644 arch/riscv/crypto/sm4-riscv64-glue.c
>  create mode 100644 arch/riscv/crypto/sm4-riscv64-zvksed-zvkb.S
>  create mode 100644 arch/riscv/include/asm/simd.h
>  create mode 100644 arch/riscv/kernel/kernel_mode_vector.c
>
>
> base-commit: f352a28cc2fb4ee8d08c6a6362c9a861fcc84236
> --
> 2.43.0
>
Eric Biggers Jan. 3, 2024, 2:35 p.m. UTC | #2
On Wed, Jan 03, 2024 at 03:00:29PM +0100, Ard Biesheuvel wrote:
> On Tue, 2 Jan 2024 at 07:50, Eric Biggers <ebiggers@kernel.org> wrote:
> >
> > As discussed previously, the proposed use of the so-called perlasm for
> > the RISC-V crypto assembly files makes them difficult to read, and these
> > files have some other issues such extensively duplicating source code
> > for the different AES key lengths and for the unrolled hash functions.
> > There is/was a desire to share code with OpenSSL, but many of the files
> > have already diverged significantly; also, for most of the algorithms
> > the source code can be quite short anyway, due to the native support for
> > them in the RISC-V vector crypto extensions combined with the way the
> > RISC-V vector extension naturally scales to arbitrary vector lengths.
> >
> > Since we're still waiting for prerequisite patches to be merged anyway,
> > we have a bit more time to get this cleaned up properly.  So I've had a
> > go at cleaning up the patchset to use standard .S files, with the code
> > duplication fixed.  I also made some tweaks to make the different
> > algorithms consistent with each other and with what exists in the kernel
> > already for other architectures, and tried to improve comments.
> >
> > The result is this series, which passes all tests and is about 2400
> > lines shorter than the latest version with the perlasm
> > (https://lore.kernel.org/linux-crypto/20231231152743.6304-1-jerry.shih@sifive.com/).
> > All the same functionality and general optimizations are still included,
> > except for some minor optimizations in XTS that I dropped since it's not
> > clear they are worth the complexity.  (Note that almost all users of XTS
> > in the kernel only use it with block-aligned messages, so it's not very
> > important to optimize the ciphertext stealing case.)
> >
> > I'd appreciate people's thoughts on this series.  Jerry, I hope I'm not
> > stepping on your toes too much here, but I think there are some big
> > improvements here.
> >
> 
> As I have indicated before, I fully agree with Eric here that avoiding
> perlasm is preferable: sharing code with OpenSSL is great if we can
> simply adopt the exact same code (and track OpenSSL as its upstream)
> but this never really worked that well for skciphers, due to API
> differences. (The SHA transforms can be reused a bit more easily)
> 
> I will also note that perlasm is not as useful for RISC-V as it is for
> other architectures: in OpenSSL, perlasm is also used to abstract
> differences in calling conventions between, e.g., x86_64 on Linux vs
> Windows, or to support building with outdated [proprietary]
> toolchains.
> 
> I do wonder if we could also use .req directives for register aliases
> instead of CPP defines? It shouldn't matter for working code, but the
> diagnostics tend to be a bit more useful if the aliases are visible to
> the assembler.

.req unfortunately isn't an option since it is specific to AArch64 and ARM
assembly.  So we have to use #defines like x86 does.  Ultimately, the effect is
about the same.

- Eric
Jerry Shih Jan. 4, 2024, 10:23 a.m. UTC | #3
On Jan 3, 2024, at 22:35, Eric Biggers <ebiggers@kernel.org> wrote:
> On Wed, Jan 03, 2024 at 03:00:29PM +0100, Ard Biesheuvel wrote:
>> On Tue, 2 Jan 2024 at 07:50, Eric Biggers <ebiggers@kernel.org> wrote:
>>> 
>>> As discussed previously, the proposed use of the so-called perlasm for
>>> the RISC-V crypto assembly files makes them difficult to read, and these
>>> files have some other issues such extensively duplicating source code
>>> for the different AES key lengths and for the unrolled hash functions.
>>> There is/was a desire to share code with OpenSSL, but many of the files
>>> have already diverged significantly; also, for most of the algorithms
>>> the source code can be quite short anyway, due to the native support for
>>> them in the RISC-V vector crypto extensions combined with the way the
>>> RISC-V vector extension naturally scales to arbitrary vector lengths.
>>> 
>>> Since we're still waiting for prerequisite patches to be merged anyway,
>>> we have a bit more time to get this cleaned up properly.  So I've had a
>>> go at cleaning up the patchset to use standard .S files, with the code
>>> duplication fixed.  I also made some tweaks to make the different
>>> algorithms consistent with each other and with what exists in the kernel

Do you mean the xts gluing part only? Do we have the different algorithms
for the actual implementation in .S? 

>>> already for other architectures, and tried to improve comments.

The improved comments are better. Thanks.

>>> The result is this series, which passes all tests and is about 2400
>>> lines shorter than the latest version with the perlasm
>>> (https://lore.kernel.org/linux-crypto/20231231152743.6304-1-jerry.shih@sifive.com/).

For the unrolled hash functions case(sha256/512), the .S implementation uses
macro instead. But I think the macro expanded code will be the same. Do we
care about the source code size instead of the final binary code size?

For the AES variants part, the .S uses branches inside the inner loop. Does the
branch inside the inner loop better than the bigger code size?

>>> All the same functionality and general optimizations are still included,
>>> except for some minor optimizations in XTS that I dropped since it's not
>>> clear they are worth the complexity.  (Note that almost all users of XTS
>>> in the kernel only use it with block-aligned messages, so it's not very
>>> important to optimize the ciphertext stealing case.)

The aesni/neon are SIMD, so I think the rvv version could have the different
design. And I think my implementation is very similar to x86-xts except the
tail block numbers for ciphertext stealing case.
The x86-xts-like implementation uses the fixed 2 block for the ciphertext
stealing case.

+		int xts_blocks = DIV_ROUND_UP(req->cryptlen,
+					      AES_BLOCK_SIZE) - 2;

>>> I'd appreciate people's thoughts on this series.  Jerry, I hope I'm not
>>> stepping on your toes too much here, but I think there are some big
>>> improvements here.
>>> 
>> 
>> As I have indicated before, I fully agree with Eric here that avoiding
>> perlasm is preferable: sharing code with OpenSSL is great if we can
>> simply adopt the exact same code (and track OpenSSL as its upstream)
>> but this never really worked that well for skciphers, due to API
>> differences. (The SHA transforms can be reused a bit more easily)

In my opinion, I would prefer the perlasm with the usage of asm mnemonics.
I could see the expanded asm source from perlsm. I don't know how to dump the
expanded asm source when we use asm `.macro` directives. I use objdump to
see the final asm.
And we could use function to modularize the asm implementations. The macro
might do the same things, but it's more clear for me for the argument passing
and more powerful string manipulation.
And I could have scope for the register name binding. It's more clear to me
comparing with a long list of `#define xxx`.

If the pure .S is still more preferable in kernel, let's do that.

-Jerry

>> I will also note that perlasm is not as useful for RISC-V as it is for
>> other architectures: in OpenSSL, perlasm is also used to abstract
>> differences in calling conventions between, e.g., x86_64 on Linux vs
>> Windows, or to support building with outdated [proprietary]
>> toolchains.
>> 
>> I do wonder if we could also use .req directives for register aliases
>> instead of CPP defines? It shouldn't matter for working code, but the
>> diagnostics tend to be a bit more useful if the aliases are visible to
>> the assembler.
> 
> .req unfortunately isn't an option since it is specific to AArch64 and ARM
> assembly.  So we have to use #defines like x86 does.  Ultimately, the effect is
> about the same.
> 
> - Eric
Eric Biggers Jan. 4, 2024, 6:02 p.m. UTC | #4
On Thu, Jan 04, 2024 at 06:23:42PM +0800, Jerry Shih wrote:
> On Jan 3, 2024, at 22:35, Eric Biggers <ebiggers@kernel.org> wrote:
> > On Wed, Jan 03, 2024 at 03:00:29PM +0100, Ard Biesheuvel wrote:
> >> On Tue, 2 Jan 2024 at 07:50, Eric Biggers <ebiggers@kernel.org> wrote:
> >>> 
> >>> As discussed previously, the proposed use of the so-called perlasm for
> >>> the RISC-V crypto assembly files makes them difficult to read, and these
> >>> files have some other issues such extensively duplicating source code
> >>> for the different AES key lengths and for the unrolled hash functions.
> >>> There is/was a desire to share code with OpenSSL, but many of the files
> >>> have already diverged significantly; also, for most of the algorithms
> >>> the source code can be quite short anyway, due to the native support for
> >>> them in the RISC-V vector crypto extensions combined with the way the
> >>> RISC-V vector extension naturally scales to arbitrary vector lengths.
> >>> 
> >>> Since we're still waiting for prerequisite patches to be merged anyway,
> >>> we have a bit more time to get this cleaned up properly.  So I've had a
> >>> go at cleaning up the patchset to use standard .S files, with the code
> >>> duplication fixed.  I also made some tweaks to make the different
> >>> algorithms consistent with each other and with what exists in the kernel
> 
> Do you mean the xts gluing part only? Do we have the different algorithms
> for the actual implementation in .S? 

I did change both the XTS assembly and XTS glue code slightly.  However, the
general approach is still the same, except for a couple minor optimizations that
I dropped.  See what I wrote below:

    All the same functionality and general optimizations are still included,
    except for some minor optimizations in XTS that I dropped since it's not
    clear they are worth the complexity.  (Note that almost all users of XTS
    in the kernel only use it with block-aligned messages, so it's not very
    important to optimize the ciphertext stealing case.)

> >>> The result is this series, which passes all tests and is about 2400
> >>> lines shorter than the latest version with the perlasm
> >>> (https://lore.kernel.org/linux-crypto/20231231152743.6304-1-jerry.shih@sifive.com/).
> 
> For the unrolled hash functions case(sha256/512), the .S implementation uses
> macro instead. But I think the macro expanded code will be the same. Do we
> care about the source code size instead of the final binary code size?

Yes, because it's very difficult to review code that's been hand-unrolled.
Essentially what I had been doing to review your code was to reverse engineer
what the loop that was being unrolled was, and then go through each unrolled
iteration and verify that it matches what it is supposed to.  (While also
ignoring the code comments, as some of the comments had been copy+pasted without
being updated to be correct for what iteration they were commenting.)

It's much simpler to just use .rept to unroll the code for us.

> For the AES variants part, the .S uses branches inside the inner loop.

No, it does not.  The macros generate a separate inner loop for each AES key
length, just like what the perl scripts had.  The source code just isn't
duplicated 3 times anymore.

> 
> >>> All the same functionality and general optimizations are still included,
> >>> except for some minor optimizations in XTS that I dropped since it's not
> >>> clear they are worth the complexity.  (Note that almost all users of XTS
> >>> in the kernel only use it with block-aligned messages, so it's not very
> >>> important to optimize the ciphertext stealing case.)
> 
> The aesni/neon are SIMD, so I think the rvv version could have the different
> design. And I think my implementation is very similar to x86-xts except the
> tail block numbers for ciphertext stealing case.
> The x86-xts-like implementation uses the fixed 2 block for the ciphertext
> stealing case.
> 
> +		int xts_blocks = DIV_ROUND_UP(req->cryptlen,
> +					      AES_BLOCK_SIZE) - 2;

The two specific XTS optimizations that I dropped are:
    
1. There was a parameter 'update_iv' that told the assembly code whether to
   calculate and return the next tweak or not.  Computing the next tweak is a
   fairly small computation; it's not clear to me that it's worth adding a
   branch and extra code complexity to skip that computation.

2. For XTS encryption with ciphertext stealing, the assembly code encrypted the
   last full block during the main loop instead of separately, in order to take
   better advantage of parallelism.  I don't like how this optimization only
   applied to inputs whose length is not a multiple of 16, which are very rare.
   Also, it only applied to encryption, not to decryption.  That meant that
   separate code was needed for encryption and decryption.  Note that for disk
   encryption, decryption is more performance-critical than encryption.

Again, we could bring these back later.  I'd just like to error on the side of
simplicity for the initial version.  If we include optimizations that aren't
actually worthwhile, it might be hard to remove them later.

> >>> I'd appreciate people's thoughts on this series.  Jerry, I hope I'm not
> >>> stepping on your toes too much here, but I think there are some big
> >>> improvements here.
> >>> 
> >> 
> >> As I have indicated before, I fully agree with Eric here that avoiding
> >> perlasm is preferable: sharing code with OpenSSL is great if we can
> >> simply adopt the exact same code (and track OpenSSL as its upstream)
> >> but this never really worked that well for skciphers, due to API
> >> differences. (The SHA transforms can be reused a bit more easily)
> 
> In my opinion, I would prefer the perlasm with the usage of asm mnemonics.
> I could see the expanded asm source from perlsm. I don't know how to dump the
> expanded asm source when we use asm `.macro` directives. I use objdump to
> see the final asm.

Yes, you can always use objdump to see everything expanded out.  But there's not
much of a need to do that when you can just read the source code instead.

> And we could use function to modularize the asm implementations. The macro
> might do the same things, but it's more clear for me for the argument passing
> and more powerful string manipulation.

The perl scripts were not using any advanced capabilities of perl, and in some
cases they even emitted assembly macros like .rept instead of using perl.  So
the use of perl seemed unnecessary and just over-complicated things.

BTW, we could always use perlasm for specific algorithms where we actually need
advanced scripting, like to interleave two independent computations with each
other.  It doesn't have to be all one or the other.  It's just that the code
that's been written for RISC-V so far is relatively straightforward and doesn't
seem to benefit from perlasm.

> And I could have scope for the register name binding. It's more clear to me
> comparing with a long list of `#define xxx`.

I think it's good enough.  We can always re-#define things later in the file, or
just split functions with different parameters into different files.

BTW, the reason that some of the lists of defines got longer is that I actually
gave things proper names, whereas some the perl scripts used trivial names like
"V28" for the v28 register.

- Eric