Message ID | 1428396724-19962-2-git-send-email-ard.biesheuvel@linaro.org (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Herbert Xu |
Headers | show |
On Tue, Apr 07, 2015 at 10:51:49AM +0200, Ard Biesheuvel wrote: > > +typedef void (sha1_block_fn)(int blocks, u8 const *src, u32 *state, > + const u8 *head, void *p); Does this really need five arguments? First of all we can get rid of head by just calling this function twice. The last argument appears to only be used by arm64 where it is simply another way of saying (sctx->count + len) % SHA_BLOCK_SIZE != 0. So why not get rid of it and just use the conditional? Cheers,
On 8 April 2015 at 15:19, Herbert Xu <herbert@gondor.apana.org.au> wrote: > On Tue, Apr 07, 2015 at 10:51:49AM +0200, Ard Biesheuvel wrote: >> >> +typedef void (sha1_block_fn)(int blocks, u8 const *src, u32 *state, >> + const u8 *head, void *p); > > Does this really need five arguments? First of all we can get rid > of head by just calling this function twice. Not having to call the function twice is the whole point. In the arm64 case, all the SHA-256 round keys can be kept in registers (it has 32 16-byte SIMD registers), and that is what motivates this pattern. By passing a head block, a pointer to the source and the generic pointer (which arm64 uses to finalize the block, we can process all data in a single invocation of the block transform) > The last argument > appears to only be used by arm64 where it is simply another way > of saying (sctx->count + len) % SHA_BLOCK_SIZE != 0. So why not > get rid of it and just use the conditional? > Do note that these are only used by static inline functions, so the unused arguments are all eliminated from the binary anyway. In fact, looking at the generated code, the function calls don't use function pointers at all anymore, but just call the block transform directly, so the typedef is only used as a prototype, really.
On Wed, Apr 08, 2015 at 03:25:14PM +0200, Ard Biesheuvel wrote: > > Not having to call the function twice is the whole point. In the arm64 > case, all the SHA-256 round keys can be kept in registers (it has 32 > 16-byte SIMD registers), and that is what motivates this pattern. By > passing a head block, a pointer to the source and the generic pointer > (which arm64 uses to finalize the block, we can process all data in a > single invocation of the block transform) Does this really make any difference? With IPsec the partial code path is never even going to get executed. > Do note that these are only used by static inline functions, so the > unused arguments are all eliminated from the binary anyway. In fact, > looking at the generated code, the function calls don't use function > pointers at all anymore, > but just call the block transform directly, so the typedef is only > used as a prototype, really. It's not just the generated code. The next guy that comes along and writes a SHA implementation is going to go WTH is this p argument. I'm not going to add crap to the generic layer just because ARM needs it. In fact ARM doesn't even need it. Cheers,
On 8 April 2015 at 15:30, Herbert Xu <herbert@gondor.apana.org.au> wrote: > On Wed, Apr 08, 2015 at 03:25:14PM +0200, Ard Biesheuvel wrote: >> >> Not having to call the function twice is the whole point. In the arm64 >> case, all the SHA-256 round keys can be kept in registers (it has 32 >> 16-byte SIMD registers), and that is what motivates this pattern. By >> passing a head block, a pointer to the source and the generic pointer >> (which arm64 uses to finalize the block, we can process all data in a >> single invocation of the block transform) > > Does this really make any difference? With IPsec the partial code > path is never even going to get executed. > This is not the partial code path, it is the .finup path, in fact. Anything that hashes data that is often a multiple of the block size (which is more likely for block based applications than for IPsec, I think) should benefit from this. But even if it is not, using a head block and a pointer to the src eliminates one call of the block transform. Note that, in the arm64 case, calling a SHA-256 block transform in non-process context involves: - stacking the contents of 28 SIMD registers (28 x 16 = 448 bytes) - loading the SHA-256 constants (16 x 16 = 256 bytes) - processing the data - unstacking the contents of 28 SIMD registers (448 bytes) so anything that can prevent needlessly calling these functions multiple times in quick successsion is going to help, and 'just calling it twice' just doesn't cut it. >> Do note that these are only used by static inline functions, so the >> unused arguments are all eliminated from the binary anyway. In fact, >> looking at the generated code, the function calls don't use function >> pointers at all anymore, >> but just call the block transform directly, so the typedef is only >> used as a prototype, really. > > It's not just the generated code. The next guy that comes along > and writes a SHA implementation is going to go WTH is this p > argument. I'm not going to add crap to the generic layer just > because ARM needs it. In fact ARM doesn't even need it. > OK, so there are 2 pieces of crap [sic] in this proposed generic layer: - the head block - the generic pointer The generic pointer is used in the arm64 case to convey the information that the current invocation of the block transform is the final one, and the core code can apply the padding and finalize /and/ pass back whether it has done so or not. (the latter can easily be done in the C code as well) I used a generic pointer to allow other uses, but if you have a better idea for this particular use case, I'd be happy to hear it. -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 8 April 2015 at 15:40, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: > On 8 April 2015 at 15:30, Herbert Xu <herbert@gondor.apana.org.au> wrote: >> On Wed, Apr 08, 2015 at 03:25:14PM +0200, Ard Biesheuvel wrote: >>> >>> Not having to call the function twice is the whole point. In the arm64 >>> case, all the SHA-256 round keys can be kept in registers (it has 32 >>> 16-byte SIMD registers), and that is what motivates this pattern. By >>> passing a head block, a pointer to the source and the generic pointer >>> (which arm64 uses to finalize the block, we can process all data in a >>> single invocation of the block transform) >> >> Does this really make any difference? With IPsec the partial code >> path is never even going to get executed. >> > > This is not the partial code path, it is the .finup path, in fact. > Anything that hashes data that is often a multiple of the block size > (which is more likely for block based applications than for IPsec, I > think) should benefit from this. But even if it is not, using a head > block and a pointer to the src eliminates one call of the block > transform. > > Note that, in the arm64 case, calling a SHA-256 block transform in > non-process context involves: > - stacking the contents of 28 SIMD registers (28 x 16 = 448 bytes) > - loading the SHA-256 constants (16 x 16 = 256 bytes) > - processing the data > - unstacking the contents of 28 SIMD registers (448 bytes) > > so anything that can prevent needlessly calling these functions > multiple times in quick successsion is going to help, and 'just > calling it twice' just doesn't cut it. > OK, stacking/unstacking can be amortized over multiple invocations of the block transform, only loading the round constants cannot. >>> Do note that these are only used by static inline functions, so the >>> unused arguments are all eliminated from the binary anyway. In fact, >>> looking at the generated code, the function calls don't use function >>> pointers at all anymore, >>> but just call the block transform directly, so the typedef is only >>> used as a prototype, really. >> >> It's not just the generated code. The next guy that comes along >> and writes a SHA implementation is going to go WTH is this p >> argument. I'm not going to add crap to the generic layer just >> because ARM needs it. In fact ARM doesn't even need it. >> > > OK, so there are 2 pieces of crap [sic] in this proposed generic layer: > - the head block > - the generic pointer > > The generic pointer is used in the arm64 case to convey the > information that the current invocation of the block transform is the > final one, and the core code can apply the padding and finalize /and/ > pass back whether it has done so or not. (the latter can easily be > done in the C code as well) I used a generic pointer to allow other > uses, but if you have a better idea for this particular use case, I'd > be happy to hear it. -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Apr 08, 2015 at 03:40:56PM +0200, Ard Biesheuvel wrote: > > This is not the partial code path, it is the .finup path, in fact. > Anything that hashes data that is often a multiple of the block size > (which is more likely for block based applications than for IPsec, I > think) should benefit from this. But even if it is not, using a head > block and a pointer to the src eliminates one call of the block > transform. This would only ever happen if you call update on a partial block which should be the slow path anyway. IPsec only calls digest so it can never do a partial block update unless you have a badly fragmented skb. Incidentally if you did want to deal with such cases in light of the high overhead that you have you may want to switch over to ahash which would allow you to do the saving and restoring once per operation rather than once per SG entry. > OK, so there are 2 pieces of crap [sic] in this proposed generic layer: > - the head block > - the generic pointer Yes your patches look great otherwise. I just want to keep things simple because you're creating a template for all future SHA authors. > The generic pointer is used in the arm64 case to convey the > information that the current invocation of the block transform is the > final one, and the core code can apply the padding and finalize /and/ > pass back whether it has done so or not. (the latter can easily be > done in the C code as well) I used a generic pointer to allow other > uses, but if you have a better idea for this particular use case, I'd > be happy to hear it. As I said in the original email, this appears to be equivalent to (sctx->count + len) % SHA1_BLOCK_SIZE != 0 So why not just do that check in C? It'll get compiled into a simple mask so it should be no slower than a branch on finalize. In fact if your assembly function always updates sctx->count then you could simplify that to sctx->count % SHA1_BLOCK_SIZE != 0 Cheers,
On 8 April 2015 at 16:06, Herbert Xu <herbert@gondor.apana.org.au> wrote: > On Wed, Apr 08, 2015 at 03:40:56PM +0200, Ard Biesheuvel wrote: >> >> This is not the partial code path, it is the .finup path, in fact. >> Anything that hashes data that is often a multiple of the block size >> (which is more likely for block based applications than for IPsec, I >> think) should benefit from this. But even if it is not, using a head >> block and a pointer to the src eliminates one call of the block >> transform. > > This would only ever happen if you call update on a partial block > which should be the slow path anyway. > > IPsec only calls digest so it can never do a partial block update > unless you have a badly fragmented skb. > OK, so using digest() (which uses finup()), there will never be any data stashed in the struct::buf[], so the possible invocations of the block transform are - first call with all data - second call with the unaligned chunk at the end + padding - third call with the end of the padding + count where the second call is only made in a minority of cases where there is no room for the count and single padding byte after the data. And the head block only helps with partial data, not with reducing the potential number of invocations done for padding and count etc. > Incidentally if you did want to deal with such cases in light > of the high overhead that you have you may want to switch over > to ahash which would allow you to do the saving and restoring > once per operation rather than once per SG entry. > OK I will look into that. >> OK, so there are 2 pieces of crap [sic] in this proposed generic layer: >> - the head block >> - the generic pointer > > Yes your patches look great otherwise. I just want to keep things > simple because you're creating a template for all future SHA authors. > >> The generic pointer is used in the arm64 case to convey the >> information that the current invocation of the block transform is the >> final one, and the core code can apply the padding and finalize /and/ >> pass back whether it has done so or not. (the latter can easily be >> done in the C code as well) I used a generic pointer to allow other >> uses, but if you have a better idea for this particular use case, I'd >> be happy to hear it. > > As I said in the original email, this appears to be equivalent to > > (sctx->count + len) % SHA1_BLOCK_SIZE != 0 > > So why not just do that check in C? It'll get compiled into a > simple mask so it should be no slower than a branch on finalize. > > In fact if your assembly function always updates sctx->count > then you could simplify that to > > sctx->count % SHA1_BLOCK_SIZE != 0 > The problem is that the block transform needs to be informed of the fact that the current invocation is the final one, and it can perform the padding, add the count etc. The C code can figure out by looking at the count whether the asm code has applied the padding or not, but it cannot really tell the asm code whether it should apply the padding without an additional argument. So what about > +typedef void (sha1_block_fn)(int blocks, u8 const *src, u32 *state, > + bool final); ? -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Apr 08, 2015 at 04:18:56PM +0200, Ard Biesheuvel wrote: > > So what about > > > +typedef void (sha1_block_fn)(int blocks, u8 const *src, u32 *state, > > + bool final); > > ? Looks good to me. Thanks,
diff --git a/include/crypto/sha1_base.h b/include/crypto/sha1_base.h new file mode 100644 index 000000000000..919db0920203 --- /dev/null +++ b/include/crypto/sha1_base.h @@ -0,0 +1,123 @@ +/* + * sha1_base.h - core logic for SHA-1 implementations + * + * Copyright (C) 2015 Linaro Ltd <ard.biesheuvel@linaro.org> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + */ + +#include <crypto/internal/hash.h> +#include <crypto/sha.h> +#include <linux/crypto.h> +#include <linux/module.h> + +#include <asm/unaligned.h> + +typedef void (sha1_block_fn)(int blocks, u8 const *src, u32 *state, + const u8 *head, void *p); + +static inline int sha1_base_init(struct shash_desc *desc) +{ + struct sha1_state *sctx = shash_desc_ctx(desc); + + sctx->state[0] = SHA1_H0; + sctx->state[1] = SHA1_H1; + sctx->state[2] = SHA1_H2; + sctx->state[3] = SHA1_H3; + sctx->state[4] = SHA1_H4; + sctx->count = 0; + + return 0; +} + +static inline int sha1_base_export(struct shash_desc *desc, void *out) +{ + struct sha1_state *sctx = shash_desc_ctx(desc); + struct sha1_state *dst = out; + + *dst = *sctx; + + return 0; +} + +static inline int sha1_base_import(struct shash_desc *desc, const void *in) +{ + struct sha1_state *sctx = shash_desc_ctx(desc); + struct sha1_state const *src = in; + + *sctx = *src; + + return 0; +} + +static inline int sha1_base_do_update(struct shash_desc *desc, const u8 *data, + unsigned int len, sha1_block_fn *block_fn, + void *p) +{ + struct sha1_state *sctx = shash_desc_ctx(desc); + unsigned int partial = sctx->count % SHA1_BLOCK_SIZE; + + sctx->count += len; + + if (unlikely((partial + len) >= SHA1_BLOCK_SIZE)) { + int blocks; + + if (partial) { + int p = SHA1_BLOCK_SIZE - partial; + + memcpy(sctx->buffer + partial, data, p); + data += p; + len -= p; + } + + blocks = len / SHA1_BLOCK_SIZE; + len %= SHA1_BLOCK_SIZE; + + block_fn(blocks, data, sctx->state, + partial ? sctx->buffer : NULL, p); + data += blocks * SHA1_BLOCK_SIZE; + partial = 0; + } + if (len) + memcpy(sctx->buffer + partial, data, len); + + return 0; +} + +static inline int sha1_base_do_finalize(struct shash_desc *desc, + sha1_block_fn *block_fn, void *p) +{ + const int bit_offset = SHA1_BLOCK_SIZE - sizeof(__be64); + struct sha1_state *sctx = shash_desc_ctx(desc); + __be64 *bits = (__be64 *)(sctx->buffer + bit_offset); + unsigned int partial = sctx->count % SHA1_BLOCK_SIZE; + + sctx->buffer[partial++] = 0x80; + if (partial > bit_offset) { + memset(sctx->buffer + partial, 0x0, SHA1_BLOCK_SIZE - partial); + partial = 0; + + block_fn(1, sctx->buffer, sctx->state, NULL, p); + } + + memset(sctx->buffer + partial, 0x0, bit_offset - partial); + *bits = cpu_to_be64(sctx->count << 3); + block_fn(1, sctx->buffer, sctx->state, NULL, p); + + return 0; +} + +static inline int sha1_base_finish(struct shash_desc *desc, u8 *out) +{ + struct sha1_state *sctx = shash_desc_ctx(desc); + __be32 *digest = (__be32 *)out; + int i; + + for (i = 0; i < SHA1_DIGEST_SIZE / sizeof(__be32); i++) + put_unaligned_be32(sctx->state[i], digest++); + + *sctx = (struct sha1_state){}; + return 0; +}
To reduce the number of copies of boilerplate code throughout the tree, this patch implements generic glue for the SHA-1 algorithm. This allows a specific arch or hardware implementation to only implement the special handling that it needs. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> --- include/crypto/sha1_base.h | 123 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 123 insertions(+) create mode 100644 include/crypto/sha1_base.h