diff mbox

[v3,01/16] crypto: sha1: implement base layer for SHA-1

Message ID 1428396724-19962-2-git-send-email-ard.biesheuvel@linaro.org (mailing list archive)
State Changes Requested
Delegated to: Herbert Xu
Headers show

Commit Message

Ard Biesheuvel April 7, 2015, 8:51 a.m. UTC
To reduce the number of copies of boilerplate code throughout
the tree, this patch implements generic glue for the SHA-1
algorithm. This allows a specific arch or hardware implementation
to only implement the special handling that it needs.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
 include/crypto/sha1_base.h | 123 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 123 insertions(+)
 create mode 100644 include/crypto/sha1_base.h

Comments

Herbert Xu April 8, 2015, 1:19 p.m. UTC | #1
On Tue, Apr 07, 2015 at 10:51:49AM +0200, Ard Biesheuvel wrote:
>
> +typedef void (sha1_block_fn)(int blocks, u8 const *src, u32 *state,
> +			     const u8 *head, void *p);

Does this really need five arguments? First of all we can get rid
of head by just calling this function twice.  The last argument
appears to only be used by arm64 where it is simply another way
of saying (sctx->count + len) % SHA_BLOCK_SIZE != 0.  So why not
get rid of it and just use the conditional?

Cheers,
Ard Biesheuvel April 8, 2015, 1:25 p.m. UTC | #2
On 8 April 2015 at 15:19, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> On Tue, Apr 07, 2015 at 10:51:49AM +0200, Ard Biesheuvel wrote:
>>
>> +typedef void (sha1_block_fn)(int blocks, u8 const *src, u32 *state,
>> +                          const u8 *head, void *p);
>
> Does this really need five arguments? First of all we can get rid
> of head by just calling this function twice.

Not having to call the function twice is the whole point. In the arm64
case, all the SHA-256 round keys can be kept in registers (it has 32
16-byte SIMD registers), and that is what motivates this pattern. By
passing a head block, a pointer to the source and the generic pointer
(which arm64 uses to finalize the block, we can process all data in a
single invocation of the block transform)

>  The last argument
> appears to only be used by arm64 where it is simply another way
> of saying (sctx->count + len) % SHA_BLOCK_SIZE != 0.  So why not
> get rid of it and just use the conditional?
>

Do note that these are only used by static inline functions, so the
unused arguments are all eliminated from the binary anyway. In fact,
looking at the generated code, the function calls don't use function
pointers at all anymore,
but just call the block transform directly, so the typedef is only
used as a prototype, really.
Herbert Xu April 8, 2015, 1:30 p.m. UTC | #3
On Wed, Apr 08, 2015 at 03:25:14PM +0200, Ard Biesheuvel wrote:
>
> Not having to call the function twice is the whole point. In the arm64
> case, all the SHA-256 round keys can be kept in registers (it has 32
> 16-byte SIMD registers), and that is what motivates this pattern. By
> passing a head block, a pointer to the source and the generic pointer
> (which arm64 uses to finalize the block, we can process all data in a
> single invocation of the block transform)

Does this really make any difference? With IPsec the partial code
path is never even going to get executed.

> Do note that these are only used by static inline functions, so the
> unused arguments are all eliminated from the binary anyway. In fact,
> looking at the generated code, the function calls don't use function
> pointers at all anymore,
> but just call the block transform directly, so the typedef is only
> used as a prototype, really.

It's not just the generated code.  The next guy that comes along
and writes a SHA implementation is going to go WTH is this p
argument.  I'm not going to add crap to the generic layer just
because ARM needs it.  In fact ARM doesn't even need it.

Cheers,
Ard Biesheuvel April 8, 2015, 1:40 p.m. UTC | #4
On 8 April 2015 at 15:30, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> On Wed, Apr 08, 2015 at 03:25:14PM +0200, Ard Biesheuvel wrote:
>>
>> Not having to call the function twice is the whole point. In the arm64
>> case, all the SHA-256 round keys can be kept in registers (it has 32
>> 16-byte SIMD registers), and that is what motivates this pattern. By
>> passing a head block, a pointer to the source and the generic pointer
>> (which arm64 uses to finalize the block, we can process all data in a
>> single invocation of the block transform)
>
> Does this really make any difference? With IPsec the partial code
> path is never even going to get executed.
>

This is not the partial code path, it is the .finup path, in fact.
Anything that hashes data that is often a multiple of the block size
(which is more likely for block based applications than for IPsec, I
think) should benefit from this. But even if it is not, using a head
block and a pointer to the src eliminates one call of the block
transform.

Note that, in the arm64 case, calling a SHA-256 block transform in
non-process context involves:
- stacking the contents of 28 SIMD registers (28 x 16 = 448 bytes)
- loading the SHA-256 constants (16 x 16 = 256 bytes)
- processing the data
- unstacking the contents of 28 SIMD registers (448 bytes)

so anything that can prevent needlessly calling these functions
multiple times in quick successsion is going to help, and 'just
calling it twice' just doesn't cut it.

>> Do note that these are only used by static inline functions, so the
>> unused arguments are all eliminated from the binary anyway. In fact,
>> looking at the generated code, the function calls don't use function
>> pointers at all anymore,
>> but just call the block transform directly, so the typedef is only
>> used as a prototype, really.
>
> It's not just the generated code.  The next guy that comes along
> and writes a SHA implementation is going to go WTH is this p
> argument.  I'm not going to add crap to the generic layer just
> because ARM needs it.  In fact ARM doesn't even need it.
>

OK, so there are 2 pieces of crap [sic] in this proposed generic layer:
- the head block
- the generic pointer

The generic pointer is used in the arm64 case to convey the
information that the current invocation of the block transform is the
final one, and the core code can apply the padding and finalize /and/
pass back whether it has done so or not. (the latter can easily be
done in the C code as well)  I used a generic pointer to allow other
uses, but if you have a better idea for this particular use case, I'd
be happy to hear it.
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ard Biesheuvel April 8, 2015, 1:52 p.m. UTC | #5
On 8 April 2015 at 15:40, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
> On 8 April 2015 at 15:30, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>> On Wed, Apr 08, 2015 at 03:25:14PM +0200, Ard Biesheuvel wrote:
>>>
>>> Not having to call the function twice is the whole point. In the arm64
>>> case, all the SHA-256 round keys can be kept in registers (it has 32
>>> 16-byte SIMD registers), and that is what motivates this pattern. By
>>> passing a head block, a pointer to the source and the generic pointer
>>> (which arm64 uses to finalize the block, we can process all data in a
>>> single invocation of the block transform)
>>
>> Does this really make any difference? With IPsec the partial code
>> path is never even going to get executed.
>>
>
> This is not the partial code path, it is the .finup path, in fact.
> Anything that hashes data that is often a multiple of the block size
> (which is more likely for block based applications than for IPsec, I
> think) should benefit from this. But even if it is not, using a head
> block and a pointer to the src eliminates one call of the block
> transform.
>
> Note that, in the arm64 case, calling a SHA-256 block transform in
> non-process context involves:
> - stacking the contents of 28 SIMD registers (28 x 16 = 448 bytes)
> - loading the SHA-256 constants (16 x 16 = 256 bytes)
> - processing the data
> - unstacking the contents of 28 SIMD registers (448 bytes)
>
> so anything that can prevent needlessly calling these functions
> multiple times in quick successsion is going to help, and 'just
> calling it twice' just doesn't cut it.
>

OK, stacking/unstacking can be amortized over multiple invocations of
the block transform, only loading the round constants cannot.


>>> Do note that these are only used by static inline functions, so the
>>> unused arguments are all eliminated from the binary anyway. In fact,
>>> looking at the generated code, the function calls don't use function
>>> pointers at all anymore,
>>> but just call the block transform directly, so the typedef is only
>>> used as a prototype, really.
>>
>> It's not just the generated code.  The next guy that comes along
>> and writes a SHA implementation is going to go WTH is this p
>> argument.  I'm not going to add crap to the generic layer just
>> because ARM needs it.  In fact ARM doesn't even need it.
>>
>
> OK, so there are 2 pieces of crap [sic] in this proposed generic layer:
> - the head block
> - the generic pointer
>
> The generic pointer is used in the arm64 case to convey the
> information that the current invocation of the block transform is the
> final one, and the core code can apply the padding and finalize /and/
> pass back whether it has done so or not. (the latter can easily be
> done in the C code as well)  I used a generic pointer to allow other
> uses, but if you have a better idea for this particular use case, I'd
> be happy to hear it.
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu April 8, 2015, 2:06 p.m. UTC | #6
On Wed, Apr 08, 2015 at 03:40:56PM +0200, Ard Biesheuvel wrote:
>
> This is not the partial code path, it is the .finup path, in fact.
> Anything that hashes data that is often a multiple of the block size
> (which is more likely for block based applications than for IPsec, I
> think) should benefit from this. But even if it is not, using a head
> block and a pointer to the src eliminates one call of the block
> transform.

This would only ever happen if you call update on a partial block
which should be the slow path anyway.

IPsec only calls digest so it can never do a partial block update
unless you have a badly fragmented skb.

Incidentally if you did want to deal with such cases in light
of the high overhead that you have you may want to switch over
to ahash which would allow you to do the saving and restoring
once per operation rather than once per SG entry.

> OK, so there are 2 pieces of crap [sic] in this proposed generic layer:
> - the head block
> - the generic pointer

Yes your patches look great otherwise.  I just want to keep things
simple because you're creating a template for all future SHA authors.

> The generic pointer is used in the arm64 case to convey the
> information that the current invocation of the block transform is the
> final one, and the core code can apply the padding and finalize /and/
> pass back whether it has done so or not. (the latter can easily be
> done in the C code as well)  I used a generic pointer to allow other
> uses, but if you have a better idea for this particular use case, I'd
> be happy to hear it.

As I said in the original email, this appears to be equivalent to

	(sctx->count + len) % SHA1_BLOCK_SIZE != 0

So why not just do that check in C? It'll get compiled into a
simple mask so it should be no slower than a branch on finalize.

In fact if your assembly function always updates sctx->count
then you could simplify that to

	sctx->count % SHA1_BLOCK_SIZE != 0

Cheers,
Ard Biesheuvel April 8, 2015, 2:18 p.m. UTC | #7
On 8 April 2015 at 16:06, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> On Wed, Apr 08, 2015 at 03:40:56PM +0200, Ard Biesheuvel wrote:
>>
>> This is not the partial code path, it is the .finup path, in fact.
>> Anything that hashes data that is often a multiple of the block size
>> (which is more likely for block based applications than for IPsec, I
>> think) should benefit from this. But even if it is not, using a head
>> block and a pointer to the src eliminates one call of the block
>> transform.
>
> This would only ever happen if you call update on a partial block
> which should be the slow path anyway.
>
> IPsec only calls digest so it can never do a partial block update
> unless you have a badly fragmented skb.
>

OK, so using digest() (which uses finup()), there will never be any
data stashed in the struct::buf[], so the possible invocations of the
block transform are
- first call with all data
- second call with the unaligned chunk at the end + padding
- third call with the end of the padding + count

where the second call is only made in a minority of cases where there
is no room for the count and single padding byte after the data.
And the head block only helps with partial data, not with reducing the
potential number of invocations done for padding and count etc.

> Incidentally if you did want to deal with such cases in light
> of the high overhead that you have you may want to switch over
> to ahash which would allow you to do the saving and restoring
> once per operation rather than once per SG entry.
>

OK I will look into that.

>> OK, so there are 2 pieces of crap [sic] in this proposed generic layer:
>> - the head block
>> - the generic pointer
>
> Yes your patches look great otherwise.  I just want to keep things
> simple because you're creating a template for all future SHA authors.
>
>> The generic pointer is used in the arm64 case to convey the
>> information that the current invocation of the block transform is the
>> final one, and the core code can apply the padding and finalize /and/
>> pass back whether it has done so or not. (the latter can easily be
>> done in the C code as well)  I used a generic pointer to allow other
>> uses, but if you have a better idea for this particular use case, I'd
>> be happy to hear it.
>
> As I said in the original email, this appears to be equivalent to
>
>         (sctx->count + len) % SHA1_BLOCK_SIZE != 0
>
> So why not just do that check in C? It'll get compiled into a
> simple mask so it should be no slower than a branch on finalize.
>
> In fact if your assembly function always updates sctx->count
> then you could simplify that to
>
>         sctx->count % SHA1_BLOCK_SIZE != 0
>

The problem is that the block transform needs to be informed of the
fact that the current invocation is the final one, and it can perform
the padding, add the count etc. The C code can figure out by looking
at the count whether the asm code has applied the padding or not, but
it cannot really tell the asm code whether it should apply the padding
without an additional argument.

So what about

> +typedef void (sha1_block_fn)(int blocks, u8 const *src, u32 *state,
> +                          bool final);

?
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu April 8, 2015, 2:22 p.m. UTC | #8
On Wed, Apr 08, 2015 at 04:18:56PM +0200, Ard Biesheuvel wrote:
>
> So what about
> 
> > +typedef void (sha1_block_fn)(int blocks, u8 const *src, u32 *state,
> > +                          bool final);
> 
> ?

Looks good to me.

Thanks,
diff mbox

Patch

diff --git a/include/crypto/sha1_base.h b/include/crypto/sha1_base.h
new file mode 100644
index 000000000000..919db0920203
--- /dev/null
+++ b/include/crypto/sha1_base.h
@@ -0,0 +1,123 @@ 
+/*
+ * sha1_base.h - core logic for SHA-1 implementations
+ *
+ * Copyright (C) 2015 Linaro Ltd <ard.biesheuvel@linaro.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <crypto/internal/hash.h>
+#include <crypto/sha.h>
+#include <linux/crypto.h>
+#include <linux/module.h>
+
+#include <asm/unaligned.h>
+
+typedef void (sha1_block_fn)(int blocks, u8 const *src, u32 *state,
+			     const u8 *head, void *p);
+
+static inline int sha1_base_init(struct shash_desc *desc)
+{
+	struct sha1_state *sctx = shash_desc_ctx(desc);
+
+	sctx->state[0] = SHA1_H0;
+	sctx->state[1] = SHA1_H1;
+	sctx->state[2] = SHA1_H2;
+	sctx->state[3] = SHA1_H3;
+	sctx->state[4] = SHA1_H4;
+ 	sctx->count = 0;
+
+	return 0;
+}
+
+static inline int sha1_base_export(struct shash_desc *desc, void *out)
+{
+	struct sha1_state *sctx = shash_desc_ctx(desc);
+	struct sha1_state *dst = out;
+
+	*dst = *sctx;
+
+	return 0;
+}
+
+static inline int sha1_base_import(struct shash_desc *desc, const void *in)
+{
+	struct sha1_state *sctx = shash_desc_ctx(desc);
+	struct sha1_state const *src = in;
+
+	*sctx = *src;
+
+	return 0;
+}
+
+static inline int sha1_base_do_update(struct shash_desc *desc, const u8 *data,
+				      unsigned int len, sha1_block_fn *block_fn,
+				      void *p)
+{
+	struct sha1_state *sctx = shash_desc_ctx(desc);
+	unsigned int partial = sctx->count % SHA1_BLOCK_SIZE;
+
+	sctx->count += len;
+
+	if (unlikely((partial + len) >= SHA1_BLOCK_SIZE)) {
+		int blocks;
+
+		if (partial) {
+			int p = SHA1_BLOCK_SIZE - partial;
+
+			memcpy(sctx->buffer + partial, data, p);
+			data += p;
+			len -= p;
+		}
+
+		blocks = len / SHA1_BLOCK_SIZE;
+		len %= SHA1_BLOCK_SIZE;
+
+		block_fn(blocks, data, sctx->state,
+			 partial ? sctx->buffer : NULL, p);
+		data += blocks * SHA1_BLOCK_SIZE;
+		partial = 0;
+	}
+	if (len)
+		memcpy(sctx->buffer + partial, data, len);
+
+	return 0;
+}
+
+static inline int sha1_base_do_finalize(struct shash_desc *desc,
+					sha1_block_fn *block_fn, void *p)
+{
+	const int bit_offset = SHA1_BLOCK_SIZE - sizeof(__be64);
+	struct sha1_state *sctx = shash_desc_ctx(desc);
+	__be64 *bits = (__be64 *)(sctx->buffer + bit_offset);
+	unsigned int partial = sctx->count % SHA1_BLOCK_SIZE;
+
+	sctx->buffer[partial++] = 0x80;
+	if (partial > bit_offset) {
+		memset(sctx->buffer + partial, 0x0, SHA1_BLOCK_SIZE - partial);
+		partial = 0;
+
+		block_fn(1, sctx->buffer, sctx->state, NULL, p);
+	}
+
+	memset(sctx->buffer + partial, 0x0, bit_offset - partial);
+	*bits = cpu_to_be64(sctx->count << 3);
+	block_fn(1, sctx->buffer, sctx->state, NULL, p);
+
+	return 0;
+}
+
+static inline int sha1_base_finish(struct shash_desc *desc, u8 *out)
+{
+	struct sha1_state *sctx = shash_desc_ctx(desc);
+	__be32 *digest = (__be32 *)out;
+	int i;
+
+	for (i = 0; i < SHA1_DIGEST_SIZE / sizeof(__be32); i++)
+		put_unaligned_be32(sctx->state[i], digest++);
+
+	*sctx = (struct sha1_state){};
+	return 0;
+}