mbox series

[0/7] crypto: switch to static calls for CRC-T10DIF

Message ID 20210111165237.18178-1-ardb@kernel.org (mailing list archive)
Headers show
Series crypto: switch to static calls for CRC-T10DIF | expand

Message

Ard Biesheuvel Jan. 11, 2021, 4:52 p.m. UTC
CRC-T10DIF is a very poor match for the crypto API:
- every user in the kernel calls it via a library wrapper around the
  shash API, so all callers share a single instance of the transform
- each architecture provides at most a single optimized implementation,
  based on SIMD instructions for carryless multiplication

In other words, none of the flexibility it provides is put to good use,
and so it is better to get rid of this complexity, and expose the optimized
implementations via static calls instead. This removes a substantial chunk
of code, and also gets rid of any indirect calls on architectures that
obsess about those (x86)

If this approach is deemed suitable, there are other places where we might
consider adopting it: CRC32 and CRC32(C).

Patch #1 does some preparatory refactoring and removes the library wrapper
around the shash transform.

Patch #2 introduces the actual static calls, along with the registration
routines to update the crc-t10dif implementation at runtime.

Patch #3 updates the generic CRC-T10DIF shash driver so it distinguishes
between the optimized library code and the generic library code.

Patches #4 to #7 update the various arch implementations to switch over to
the new method.

Special request to Peter to take a look at patch #2, and in particular,
whether synchronize_rcu_tasks() is sufficient to ensure that a module
providing the target of a static call can be unloaded safely.
 
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>

Ard Biesheuvel (7):
  crypto: crc-t10dif - turn library wrapper for shash into generic
    library
  crypto: lib/crc-t10dif - add static call support for optimized
    versions
  crypto: generic/crc-t10dif - expose both arch and generic shashes
  crypto: x86/crc-t10dif - convert to static call library API
  crypto: arm/crc-t10dif - convert to static call library API
  crypto: arm64/crc-t10dif - convert to static call API
  crypto: powerpc/crc-t10dif - convert to static call API

 arch/arm/crypto/Kconfig                     |   2 +-
 arch/arm/crypto/crct10dif-ce-glue.c         |  58 ++------
 arch/arm64/crypto/Kconfig                   |   3 +-
 arch/arm64/crypto/crct10dif-ce-glue.c       |  85 ++---------
 arch/powerpc/crypto/crct10dif-vpmsum_glue.c |  51 +------
 arch/x86/crypto/crct10dif-pclmul_glue.c     |  90 ++----------
 crypto/Kconfig                              |   7 +-
 crypto/Makefile                             |   2 +-
 crypto/crct10dif_common.c                   |  82 -----------
 crypto/crct10dif_generic.c                  | 100 +++++++++----
 include/linux/crc-t10dif.h                  |  21 ++-
 lib/Kconfig                                 |   2 -
 lib/crc-t10dif.c                            | 152 +++++++++-----------
 13 files changed, 204 insertions(+), 451 deletions(-)
 delete mode 100644 crypto/crct10dif_common.c

Comments

Ard Biesheuvel Jan. 11, 2021, 5:27 p.m. UTC | #1
On Mon, 11 Jan 2021 at 17:52, Ard Biesheuvel <ardb@kernel.org> wrote:
>
> CRC-T10DIF is a very poor match for the crypto API:
> - every user in the kernel calls it via a library wrapper around the
>   shash API, so all callers share a single instance of the transform
> - each architecture provides at most a single optimized implementation,
>   based on SIMD instructions for carryless multiplication
>
> In other words, none of the flexibility it provides is put to good use,
> and so it is better to get rid of this complexity, and expose the optimized
> implementations via static calls instead. This removes a substantial chunk
> of code, and also gets rid of any indirect calls on architectures that
> obsess about those (x86)
>
> If this approach is deemed suitable, there are other places where we might
> consider adopting it: CRC32 and CRC32(C).
>
> Patch #1 does some preparatory refactoring and removes the library wrapper
> around the shash transform.
>
> Patch #2 introduces the actual static calls, along with the registration
> routines to update the crc-t10dif implementation at runtime.
>
> Patch #3 updates the generic CRC-T10DIF shash driver so it distinguishes
> between the optimized library code and the generic library code.
>
> Patches #4 to #7 update the various arch implementations to switch over to
> the new method.
>
> Special request to Peter to take a look at patch #2, and in particular,
> whether synchronize_rcu_tasks() is sufficient to ensure that a module
> providing the target of a static call can be unloaded safely.
>

It seems I may have managed to confuse myself slightly here: without
an upper bound on the size of the input of the crc_t10dif() routine, I
suppose we can never assume that all its callers have finished.

Insights welcome ...
Ard Biesheuvel Jan. 11, 2021, 6:36 p.m. UTC | #2
On Mon, 11 Jan 2021 at 18:27, Ard Biesheuvel <ardb@kernel.org> wrote:
>
> On Mon, 11 Jan 2021 at 17:52, Ard Biesheuvel <ardb@kernel.org> wrote:
> >
> > CRC-T10DIF is a very poor match for the crypto API:
> > - every user in the kernel calls it via a library wrapper around the
> >   shash API, so all callers share a single instance of the transform
> > - each architecture provides at most a single optimized implementation,
> >   based on SIMD instructions for carryless multiplication
> >
> > In other words, none of the flexibility it provides is put to good use,
> > and so it is better to get rid of this complexity, and expose the optimized
> > implementations via static calls instead. This removes a substantial chunk
> > of code, and also gets rid of any indirect calls on architectures that
> > obsess about those (x86)
> >
> > If this approach is deemed suitable, there are other places where we might
> > consider adopting it: CRC32 and CRC32(C).
> >
> > Patch #1 does some preparatory refactoring and removes the library wrapper
> > around the shash transform.
> >
> > Patch #2 introduces the actual static calls, along with the registration
> > routines to update the crc-t10dif implementation at runtime.
> >
> > Patch #3 updates the generic CRC-T10DIF shash driver so it distinguishes
> > between the optimized library code and the generic library code.
> >
> > Patches #4 to #7 update the various arch implementations to switch over to
> > the new method.
> >
> > Special request to Peter to take a look at patch #2, and in particular,
> > whether synchronize_rcu_tasks() is sufficient to ensure that a module
> > providing the target of a static call can be unloaded safely.
> >
>
> It seems I may have managed to confuse myself slightly here: without
> an upper bound on the size of the input of the crc_t10dif() routine, I
> suppose we can never assume that all its callers have finished.
>

Replying to self again - apologies.

I think this is actually correct after all: synchronize_rcu_tasks()
guarantees that all tasks have passed through a 'safe state', i.e.,
voluntary schedule(), return to userland, etc, which guarantees that
no task could be executing the old static call target after
synchronize_rcu_tasks() returns.
Peter Zijlstra Jan. 11, 2021, 8:56 p.m. UTC | #3
On Mon, Jan 11, 2021 at 07:36:20PM +0100, Ard Biesheuvel wrote:
> On Mon, 11 Jan 2021 at 18:27, Ard Biesheuvel <ardb@kernel.org> wrote:
> > On Mon, 11 Jan 2021 at 17:52, Ard Biesheuvel <ardb@kernel.org> wrote:

> > > Special request to Peter to take a look at patch #2, and in particular,
> > > whether synchronize_rcu_tasks() is sufficient to ensure that a module
> > > providing the target of a static call can be unloaded safely.
> >
> > It seems I may have managed to confuse myself slightly here: without
> > an upper bound on the size of the input of the crc_t10dif() routine, I
> > suppose we can never assume that all its callers have finished.
> >
> 
> Replying to self again - apologies.
> 
> I think this is actually correct after all: synchronize_rcu_tasks()
> guarantees that all tasks have passed through a 'safe state', i.e.,
> voluntary schedule(), return to userland, etc, which guarantees that
> no task could be executing the old static call target after
> synchronize_rcu_tasks() returns.

Right, I think it should work.

My initial question was why you'd want to support the unreg at all.
AFAICT these implementations are tiny, why bother having them as a
module, or if you insist having them as a module, why allowing removal?
Ard Biesheuvel Jan. 11, 2021, 9:01 p.m. UTC | #4
On Mon, 11 Jan 2021 at 21:56, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Jan 11, 2021 at 07:36:20PM +0100, Ard Biesheuvel wrote:
> > On Mon, 11 Jan 2021 at 18:27, Ard Biesheuvel <ardb@kernel.org> wrote:
> > > On Mon, 11 Jan 2021 at 17:52, Ard Biesheuvel <ardb@kernel.org> wrote:
>
> > > > Special request to Peter to take a look at patch #2, and in particular,
> > > > whether synchronize_rcu_tasks() is sufficient to ensure that a module
> > > > providing the target of a static call can be unloaded safely.
> > >
> > > It seems I may have managed to confuse myself slightly here: without
> > > an upper bound on the size of the input of the crc_t10dif() routine, I
> > > suppose we can never assume that all its callers have finished.
> > >
> >
> > Replying to self again - apologies.
> >
> > I think this is actually correct after all: synchronize_rcu_tasks()
> > guarantees that all tasks have passed through a 'safe state', i.e.,
> > voluntary schedule(), return to userland, etc, which guarantees that
> > no task could be executing the old static call target after
> > synchronize_rcu_tasks() returns.
>
> Right, I think it should work.
>
> My initial question was why you'd want to support the unreg at all.
> AFAICT these implementations are tiny, why bother having them as a
> module, or if you insist having them as a module, why allowing removal?

Yeah, good question.

Having the accelerated version as a module makes sense imo: it will
only be loaded if it is supported on the system, and it can be
blacklisted by the user if it does not want it for any reason. Unload
is just for completeness/symmetry, but it is unlikely to be crucial to
anyone in practice.

In any case, thanks for confirming - if this looks sane to you then we
may be able to use this pattern in other places as well.
Eric Biggers Jan. 11, 2021, 9:05 p.m. UTC | #5
On Mon, Jan 11, 2021 at 05:52:30PM +0100, Ard Biesheuvel wrote:
> CRC-T10DIF is a very poor match for the crypto API:
> - every user in the kernel calls it via a library wrapper around the
>   shash API, so all callers share a single instance of the transform
> - each architecture provides at most a single optimized implementation,
>   based on SIMD instructions for carryless multiplication
> 
> In other words, none of the flexibility it provides is put to good use,
> and so it is better to get rid of this complexity, and expose the optimized
> implementations via static calls instead. This removes a substantial chunk
> of code, and also gets rid of any indirect calls on architectures that
> obsess about those (x86)
> 
> If this approach is deemed suitable, there are other places where we might
> consider adopting it: CRC32 and CRC32(C).
> 
> Patch #1 does some preparatory refactoring and removes the library wrapper
> around the shash transform.
> 
> Patch #2 introduces the actual static calls, along with the registration
> routines to update the crc-t10dif implementation at runtime.
> 
> Patch #3 updates the generic CRC-T10DIF shash driver so it distinguishes
> between the optimized library code and the generic library code.
> 
> Patches #4 to #7 update the various arch implementations to switch over to
> the new method.
> 
> Special request to Peter to take a look at patch #2, and in particular,
> whether synchronize_rcu_tasks() is sufficient to ensure that a module
> providing the target of a static call can be unloaded safely.
>  
> Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
> Cc: Eric Biggers <ebiggers@google.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> 
> Ard Biesheuvel (7):
>   crypto: crc-t10dif - turn library wrapper for shash into generic
>     library
>   crypto: lib/crc-t10dif - add static call support for optimized
>     versions
>   crypto: generic/crc-t10dif - expose both arch and generic shashes
>   crypto: x86/crc-t10dif - convert to static call library API
>   crypto: arm/crc-t10dif - convert to static call library API
>   crypto: arm64/crc-t10dif - convert to static call API
>   crypto: powerpc/crc-t10dif - convert to static call API
> 
>  arch/arm/crypto/Kconfig                     |   2 +-
>  arch/arm/crypto/crct10dif-ce-glue.c         |  58 ++------
>  arch/arm64/crypto/Kconfig                   |   3 +-
>  arch/arm64/crypto/crct10dif-ce-glue.c       |  85 ++---------
>  arch/powerpc/crypto/crct10dif-vpmsum_glue.c |  51 +------
>  arch/x86/crypto/crct10dif-pclmul_glue.c     |  90 ++----------
>  crypto/Kconfig                              |   7 +-
>  crypto/Makefile                             |   2 +-
>  crypto/crct10dif_common.c                   |  82 -----------
>  crypto/crct10dif_generic.c                  | 100 +++++++++----
>  include/linux/crc-t10dif.h                  |  21 ++-
>  lib/Kconfig                                 |   2 -
>  lib/crc-t10dif.c                            | 152 +++++++++-----------
>  13 files changed, 204 insertions(+), 451 deletions(-)
>  delete mode 100644 crypto/crct10dif_common.c

There is already a library API for two other hash functions, BLAKE2s and
Poly1305, which takes advantage of architecture-specific implementations without
using static calls.  Also, those algorithms are likewise also exposed through
the shash API, but in a different way from what this patchset proposes.

Is there a reason not to do things in the same way?  What are the advantages of
the new approach that you're proposing?

- Eric
Ard Biesheuvel Jan. 11, 2021, 9:31 p.m. UTC | #6
On Mon, 11 Jan 2021 at 22:05, Eric Biggers <ebiggers@kernel.org> wrote:
>
> On Mon, Jan 11, 2021 at 05:52:30PM +0100, Ard Biesheuvel wrote:
> > CRC-T10DIF is a very poor match for the crypto API:
> > - every user in the kernel calls it via a library wrapper around the
> >   shash API, so all callers share a single instance of the transform
> > - each architecture provides at most a single optimized implementation,
> >   based on SIMD instructions for carryless multiplication
> >
> > In other words, none of the flexibility it provides is put to good use,
> > and so it is better to get rid of this complexity, and expose the optimized
> > implementations via static calls instead. This removes a substantial chunk
> > of code, and also gets rid of any indirect calls on architectures that
> > obsess about those (x86)
> >
> > If this approach is deemed suitable, there are other places where we might
> > consider adopting it: CRC32 and CRC32(C).
> >
> > Patch #1 does some preparatory refactoring and removes the library wrapper
> > around the shash transform.
> >
> > Patch #2 introduces the actual static calls, along with the registration
> > routines to update the crc-t10dif implementation at runtime.
> >
> > Patch #3 updates the generic CRC-T10DIF shash driver so it distinguishes
> > between the optimized library code and the generic library code.
> >
> > Patches #4 to #7 update the various arch implementations to switch over to
> > the new method.
> >
> > Special request to Peter to take a look at patch #2, and in particular,
> > whether synchronize_rcu_tasks() is sufficient to ensure that a module
> > providing the target of a static call can be unloaded safely.
> >
> > Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
> > Cc: Eric Biggers <ebiggers@google.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> >
> > Ard Biesheuvel (7):
> >   crypto: crc-t10dif - turn library wrapper for shash into generic
> >     library
> >   crypto: lib/crc-t10dif - add static call support for optimized
> >     versions
> >   crypto: generic/crc-t10dif - expose both arch and generic shashes
> >   crypto: x86/crc-t10dif - convert to static call library API
> >   crypto: arm/crc-t10dif - convert to static call library API
> >   crypto: arm64/crc-t10dif - convert to static call API
> >   crypto: powerpc/crc-t10dif - convert to static call API
> >
> >  arch/arm/crypto/Kconfig                     |   2 +-
> >  arch/arm/crypto/crct10dif-ce-glue.c         |  58 ++------
> >  arch/arm64/crypto/Kconfig                   |   3 +-
> >  arch/arm64/crypto/crct10dif-ce-glue.c       |  85 ++---------
> >  arch/powerpc/crypto/crct10dif-vpmsum_glue.c |  51 +------
> >  arch/x86/crypto/crct10dif-pclmul_glue.c     |  90 ++----------
> >  crypto/Kconfig                              |   7 +-
> >  crypto/Makefile                             |   2 +-
> >  crypto/crct10dif_common.c                   |  82 -----------
> >  crypto/crct10dif_generic.c                  | 100 +++++++++----
> >  include/linux/crc-t10dif.h                  |  21 ++-
> >  lib/Kconfig                                 |   2 -
> >  lib/crc-t10dif.c                            | 152 +++++++++-----------
> >  13 files changed, 204 insertions(+), 451 deletions(-)
> >  delete mode 100644 crypto/crct10dif_common.c
>
> There is already a library API for two other hash functions, BLAKE2s and
> Poly1305, which takes advantage of architecture-specific implementations without
> using static calls.  Also, those algorithms are likewise also exposed through
> the shash API, but in a different way from what this patchset proposes.
>
> Is there a reason not to do things in the same way?  What are the advantages of
> the new approach that you're proposing?
>

The current approach uses build time dependencies - i.e., if you
decide to build the accelerated implementation, you always have to
load it, even if the accelerated implementation cannot be used on the
system in question, and it is up to that code to use fallbacks for
everything, This was kind of a compromise on my part when we were
having the crypto library vs Wireguard discussion - I mentioned at the
time that [in my opinion] it was a temporary approach because static
call support was taking so long to arrive. (lwn article here [0] but
the links in it seem to be dead). It also results in some nasty
Kconfig dependencies because building the generic code into the kernel
and building the accelerated code as a module gives problems.

I agree that having different approaches for doing the same thing is
suboptimal, but I think the situation may be slightly different for
plain SIMD code such as blake2 and poly1305 versus crc implementations
based on carryless multiplication instructions which may be rare but
10-20x faster if they are supported.

So in summary, I think this approach is better. The generic code can
be built in and superseded by module code, and there are no
dependencies going both ways.

There are some complications as well, though:
- module softdeps don't work {afaik) if the depender is builtin
- our poly1305 implementations use a different layout for the state
that is passed between the init/update/final calls, so it may be
tricky to use a similar approach there.




[0] https://lwn.net/Articles/802376/
Ard Biesheuvel Jan. 28, 2021, 8:19 a.m. UTC | #7
On Mon, 11 Jan 2021 at 22:31, Ard Biesheuvel <ardb@kernel.org> wrote:
>
> On Mon, 11 Jan 2021 at 22:05, Eric Biggers <ebiggers@kernel.org> wrote:
> >
> > On Mon, Jan 11, 2021 at 05:52:30PM +0100, Ard Biesheuvel wrote:
> > > CRC-T10DIF is a very poor match for the crypto API:
> > > - every user in the kernel calls it via a library wrapper around the
> > >   shash API, so all callers share a single instance of the transform
> > > - each architecture provides at most a single optimized implementation,
> > >   based on SIMD instructions for carryless multiplication
> > >
> > > In other words, none of the flexibility it provides is put to good use,
> > > and so it is better to get rid of this complexity, and expose the optimized
> > > implementations via static calls instead. This removes a substantial chunk
> > > of code, and also gets rid of any indirect calls on architectures that
> > > obsess about those (x86)
> > >
> > > If this approach is deemed suitable, there are other places where we might
> > > consider adopting it: CRC32 and CRC32(C).
> > >
> > > Patch #1 does some preparatory refactoring and removes the library wrapper
> > > around the shash transform.
> > >
> > > Patch #2 introduces the actual static calls, along with the registration
> > > routines to update the crc-t10dif implementation at runtime.
> > >
> > > Patch #3 updates the generic CRC-T10DIF shash driver so it distinguishes
> > > between the optimized library code and the generic library code.
> > >
> > > Patches #4 to #7 update the various arch implementations to switch over to
> > > the new method.
> > >
> > > Special request to Peter to take a look at patch #2, and in particular,
> > > whether synchronize_rcu_tasks() is sufficient to ensure that a module
> > > providing the target of a static call can be unloaded safely.
> > >
> > > Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
> > > Cc: Eric Biggers <ebiggers@google.com>
> > > Cc: Peter Zijlstra <peterz@infradead.org>
> > >
> > > Ard Biesheuvel (7):
> > >   crypto: crc-t10dif - turn library wrapper for shash into generic
> > >     library
> > >   crypto: lib/crc-t10dif - add static call support for optimized
> > >     versions
> > >   crypto: generic/crc-t10dif - expose both arch and generic shashes
> > >   crypto: x86/crc-t10dif - convert to static call library API
> > >   crypto: arm/crc-t10dif - convert to static call library API
> > >   crypto: arm64/crc-t10dif - convert to static call API
> > >   crypto: powerpc/crc-t10dif - convert to static call API
> > >
> > >  arch/arm/crypto/Kconfig                     |   2 +-
> > >  arch/arm/crypto/crct10dif-ce-glue.c         |  58 ++------
> > >  arch/arm64/crypto/Kconfig                   |   3 +-
> > >  arch/arm64/crypto/crct10dif-ce-glue.c       |  85 ++---------
> > >  arch/powerpc/crypto/crct10dif-vpmsum_glue.c |  51 +------
> > >  arch/x86/crypto/crct10dif-pclmul_glue.c     |  90 ++----------
> > >  crypto/Kconfig                              |   7 +-
> > >  crypto/Makefile                             |   2 +-
> > >  crypto/crct10dif_common.c                   |  82 -----------
> > >  crypto/crct10dif_generic.c                  | 100 +++++++++----
> > >  include/linux/crc-t10dif.h                  |  21 ++-
> > >  lib/Kconfig                                 |   2 -
> > >  lib/crc-t10dif.c                            | 152 +++++++++-----------
> > >  13 files changed, 204 insertions(+), 451 deletions(-)
> > >  delete mode 100644 crypto/crct10dif_common.c
> >
> > There is already a library API for two other hash functions, BLAKE2s and
> > Poly1305, which takes advantage of architecture-specific implementations without
> > using static calls.  Also, those algorithms are likewise also exposed through
> > the shash API, but in a different way from what this patchset proposes.
> >
> > Is there a reason not to do things in the same way?  What are the advantages of
> > the new approach that you're proposing?
> >
>
> The current approach uses build time dependencies - i.e., if you
> decide to build the accelerated implementation, you always have to
> load it, even if the accelerated implementation cannot be used on the
> system in question, and it is up to that code to use fallbacks for
> everything, This was kind of a compromise on my part when we were
> having the crypto library vs Wireguard discussion - I mentioned at the
> time that [in my opinion] it was a temporary approach because static
> call support was taking so long to arrive. (lwn article here [0] but
> the links in it seem to be dead). It also results in some nasty
> Kconfig dependencies because building the generic code into the kernel
> and building the accelerated code as a module gives problems.
>
> I agree that having different approaches for doing the same thing is
> suboptimal, but I think the situation may be slightly different for
> plain SIMD code such as blake2 and poly1305 versus crc implementations
> based on carryless multiplication instructions which may be rare but
> 10-20x faster if they are supported.
>
> So in summary, I think this approach is better. The generic code can
> be built in and superseded by module code, and there are no
> dependencies going both ways.
>
> There are some complications as well, though:
> - module softdeps don't work {afaik) if the depender is builtin
> - our poly1305 implementations use a different layout for the state
> that is passed between the init/update/final calls, so it may be
> tricky to use a similar approach there.
>
>

This does not actually work. The kernel_fpu_end() occurring in the
CRC-T10 code amounts to a voluntary schedule, which means
synchronize_rcu_tasks() does not guarantee that the accelerated code
is no longer live after the static call is updated to refer to the
generic code again, and so the remove is risky.

There are some other issues I want to focus on first, so I am going to
disregard this series for now.