diff mbox series

random: ensure use of aligned buffers with ChaCha20

Message ID 5834509.PAxzGmSUSz@positron.chronox.de (mailing list archive)
State Not Applicable
Delegated to: Herbert Xu
Headers show
Series random: ensure use of aligned buffers with ChaCha20 | expand

Commit Message

Stephan Mueller Aug. 9, 2018, 6:38 p.m. UTC
The function extract_crng invokes the ChaCha20 block operation directly
on the user-provided buffer. The block operation operates on u32 words.
Thus the extract_crng function expects the buffer to be aligned to u32
as it is visible with the parameter type of extract_crng. However,
get_random_bytes uses a void pointer which may or may not be aligned.
Thus, an alignment check is necessary and the temporary buffer must be
used if the alignment to u32 is not ensured.

Cc: <stable@vger.kernel.org> # v4.16+
Cc: Ted Tso <tytso@mit.edu>
Signed-off-by: Stephan Mueller <smueller@chronox.de>
---
 drivers/char/random.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

Comments

Eric Biggers Aug. 9, 2018, 7:07 p.m. UTC | #1
On Thu, Aug 09, 2018 at 08:38:56PM +0200, Stephan Müller wrote:
> The function extract_crng invokes the ChaCha20 block operation directly
> on the user-provided buffer. The block operation operates on u32 words.
> Thus the extract_crng function expects the buffer to be aligned to u32
> as it is visible with the parameter type of extract_crng. However,
> get_random_bytes uses a void pointer which may or may not be aligned.
> Thus, an alignment check is necessary and the temporary buffer must be
> used if the alignment to u32 is not ensured.
> 
> Cc: <stable@vger.kernel.org> # v4.16+
> Cc: Ted Tso <tytso@mit.edu>
> Signed-off-by: Stephan Mueller <smueller@chronox.de>
> ---
>  drivers/char/random.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/char/random.c b/drivers/char/random.c
> index bd449ad52442..23f336872426 100644
> --- a/drivers/char/random.c
> +++ b/drivers/char/random.c
> @@ -1617,8 +1617,14 @@ static void _get_random_bytes(void *buf, int nbytes)
>  	trace_get_random_bytes(nbytes, _RET_IP_);
>  
>  	while (nbytes >= CHACHA20_BLOCK_SIZE) {
> -		extract_crng(buf);
> -		buf += CHACHA20_BLOCK_SIZE;
> +		if (likely((unsigned long)buf & (sizeof(tmp[0]) - 1))) {
> +			extract_crng(buf);
> +			buf += CHACHA20_BLOCK_SIZE;
> +		} else {
> +			extract_crng(tmp);
> +			memcpy(buf, tmp, CHACHA20_BLOCK_SIZE);
> +		}
> +
>  		nbytes -= CHACHA20_BLOCK_SIZE;
>  	}
>  
> -- 
> 2.17.1

This patch is backwards: the temporary buffer is used when the buffer is
*aligned*, not misaligned.  And more problematically, 'buf' is never incremented
in one of the cases...

Note that I had tried to fix the chacha20_block() alignment bugs in commit
9f480faec58cd6197a ("crypto: chacha20 - Fix keystream alignment for
chacha20_block()"), but I had missed this case.  I don't like seeing the
alignment requirement being worked around with a temporary buffer; it's
error-prone, and inefficient on common platforms.  How about we instead make the
output of chacha20_block() a u8 array and output the 16 32-bit words using
put_unaligned_le32()?  In retrospect I probably should have just done that, but
at the time I didn't know of any case where the alignment would be a problem.

- Eric
Theodore Ts'o Aug. 9, 2018, 7:21 p.m. UTC | #2
On Thu, Aug 09, 2018 at 08:38:56PM +0200, Stephan Müller wrote:
> The function extract_crng invokes the ChaCha20 block operation directly
> on the user-provided buffer. The block operation operates on u32 words.
> Thus the extract_crng function expects the buffer to be aligned to u32
> as it is visible with the parameter type of extract_crng. However,
> get_random_bytes uses a void pointer which may or may not be aligned.
> Thus, an alignment check is necessary and the temporary buffer must be
> used if the alignment to u32 is not ensured.

I'm wondering whether we have kernel code that actually tries to
extract more than 64 bytes, so I'm not sure how often we enter the
while loop at all.  Out of curiosity, did you find this from code
inspection or via a kernel fault on some architecture that doesn't
allow unaligned u32 memory accesses?

>  	while (nbytes >= CHACHA20_BLOCK_SIZE) {
> -		extract_crng(buf);
> -		buf += CHACHA20_BLOCK_SIZE;
> +		if (likely((unsigned long)buf & (sizeof(tmp[0]) - 1))) {
> +			extract_crng(buf);
> +			buf += CHACHA20_BLOCK_SIZE;
> +		} else {
> +			extract_crng(tmp);
> +			memcpy(buf, tmp, CHACHA20_BLOCK_SIZE);
> +		}
> +

That should be

		if (likely(((unsigned long)buf & (sizeof(tmp[0]) - 1)) == 0) {

shouldn't it?

If we *did* have callers who were calling get_random_bytes with nbytes
significantly larger than 64 bytes (CHACHA20_BLOCK_SIZE), I wonder if
we should be doing something like this:

	while (nbytes >= CHACHA20_BLOCK_SIZE) {
		int adjust = (unsigned long)buf & (sizeof(tmp[0]) - 1);

		extract_crng(buf);
		buf += CHACHA20_BLOCK_SIZE;
		if (likely(adjust == 0)) {
			extract_crng(buf);
			buf += CHACHA20_BLOCK_SIZE;
			nbytes -= CHACHA20_BLOCK_SIZE;
		} else {
			extract_crng(tmp);
			memcpy(buf, tmp, CHACHA20_BLOCK_SIZE - adjust);
			buf += CHACHA20_BLOCK_SIZE - adjust;
			nbytes -= CHACHA20_BLOCK_SIZE - adjust;
		}

	}

This may be a hyper optimization, though --- it's not clear how often,
say the kernel would be calling get_random_bytes with size >> 64 at
all, never mind with an unaligned buffer.

					- Ted
Eric Biggers Aug. 9, 2018, 7:40 p.m. UTC | #3
On Thu, Aug 09, 2018 at 12:07:18PM -0700, Eric Biggers wrote:
> On Thu, Aug 09, 2018 at 08:38:56PM +0200, Stephan Müller wrote:
> > The function extract_crng invokes the ChaCha20 block operation directly
> > on the user-provided buffer. The block operation operates on u32 words.
> > Thus the extract_crng function expects the buffer to be aligned to u32
> > as it is visible with the parameter type of extract_crng. However,
> > get_random_bytes uses a void pointer which may or may not be aligned.
> > Thus, an alignment check is necessary and the temporary buffer must be
> > used if the alignment to u32 is not ensured.
> > 
> > Cc: <stable@vger.kernel.org> # v4.16+
> > Cc: Ted Tso <tytso@mit.edu>
> > Signed-off-by: Stephan Mueller <smueller@chronox.de>
> > ---
> >  drivers/char/random.c | 10 ++++++++--
> >  1 file changed, 8 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/char/random.c b/drivers/char/random.c
> > index bd449ad52442..23f336872426 100644
> > --- a/drivers/char/random.c
> > +++ b/drivers/char/random.c
> > @@ -1617,8 +1617,14 @@ static void _get_random_bytes(void *buf, int nbytes)
> >  	trace_get_random_bytes(nbytes, _RET_IP_);
> >  
> >  	while (nbytes >= CHACHA20_BLOCK_SIZE) {
> > -		extract_crng(buf);
> > -		buf += CHACHA20_BLOCK_SIZE;
> > +		if (likely((unsigned long)buf & (sizeof(tmp[0]) - 1))) {
> > +			extract_crng(buf);
> > +			buf += CHACHA20_BLOCK_SIZE;
> > +		} else {
> > +			extract_crng(tmp);
> > +			memcpy(buf, tmp, CHACHA20_BLOCK_SIZE);
> > +		}
> > +
> >  		nbytes -= CHACHA20_BLOCK_SIZE;
> >  	}
> >  
> > -- 
> > 2.17.1
> 
> This patch is backwards: the temporary buffer is used when the buffer is
> *aligned*, not misaligned.  And more problematically, 'buf' is never incremented
> in one of the cases...
> 
> Note that I had tried to fix the chacha20_block() alignment bugs in commit
> 9f480faec58cd6197a ("crypto: chacha20 - Fix keystream alignment for
> chacha20_block()"), but I had missed this case.  I don't like seeing the
> alignment requirement being worked around with a temporary buffer; it's
> error-prone, and inefficient on common platforms.  How about we instead make the
> output of chacha20_block() a u8 array and output the 16 32-bit words using
> put_unaligned_le32()?  In retrospect I probably should have just done that, but
> at the time I didn't know of any case where the alignment would be a problem.
> 
> - Eric

For example:

-----8<-----

From: Eric Biggers <ebiggers@google.com>
Subject: [PATCH] crypto: chacha20 - Fix keystream alignment for chacha20_block() (again)

In commit 9f480faec58cd6 ("crypto: chacha20 - Fix keystream alignment
for chacha20_block()") I had missed that chacha20_block() can end up
being called on the buffer passed to get_random_bytes(), which can have
any alignment.  So, while my commit didn't break anything since
chacha20_block() has actually always had a u32-alignment requirement for
the output, it didn't fully solve the alignment problems.

Revert my solution and just update chacha20_block() to use
put_unaligned_le32(), so the output buffer doesn't have to be aligned.

This is simpler, and on many CPUs it's the same speed.

Reported-by: Stephan Müller <smueller@chronox.de>
Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 crypto/chacha20_generic.c |  7 ++++---
 drivers/char/random.c     | 24 ++++++++++++------------
 include/crypto/chacha20.h |  3 +--
 lib/chacha20.c            |  6 +++---
 4 files changed, 20 insertions(+), 20 deletions(-)

diff --git a/crypto/chacha20_generic.c b/crypto/chacha20_generic.c
index e451c3cb6a56..3ae96587caf9 100644
--- a/crypto/chacha20_generic.c
+++ b/crypto/chacha20_generic.c
@@ -18,20 +18,21 @@
 static void chacha20_docrypt(u32 *state, u8 *dst, const u8 *src,
 			     unsigned int bytes)
 {
-	u32 stream[CHACHA20_BLOCK_WORDS];
+	/* aligned to potentially speed up crypto_xor() */
+	u8 stream[CHACHA20_BLOCK_SIZE] __aligned(sizeof(long));
 
 	if (dst != src)
 		memcpy(dst, src, bytes);
 
 	while (bytes >= CHACHA20_BLOCK_SIZE) {
 		chacha20_block(state, stream);
-		crypto_xor(dst, (const u8 *)stream, CHACHA20_BLOCK_SIZE);
+		crypto_xor(dst, stream, CHACHA20_BLOCK_SIZE);
 		bytes -= CHACHA20_BLOCK_SIZE;
 		dst += CHACHA20_BLOCK_SIZE;
 	}
 	if (bytes) {
 		chacha20_block(state, stream);
-		crypto_xor(dst, (const u8 *)stream, bytes);
+		crypto_xor(dst, stream, bytes);
 	}
 }
 
diff --git a/drivers/char/random.c b/drivers/char/random.c
index bd449ad52442..b8f4345a50f4 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -433,9 +433,9 @@ static int crng_init_cnt = 0;
 static unsigned long crng_global_init_time = 0;
 #define CRNG_INIT_CNT_THRESH (2*CHACHA20_KEY_SIZE)
 static void _extract_crng(struct crng_state *crng,
-			  __u32 out[CHACHA20_BLOCK_WORDS]);
+			  __u8 out[CHACHA20_BLOCK_SIZE]);
 static void _crng_backtrack_protect(struct crng_state *crng,
-				    __u32 tmp[CHACHA20_BLOCK_WORDS], int used);
+				    __u8 tmp[CHACHA20_BLOCK_SIZE], int used);
 static void process_random_ready_list(void);
 static void _get_random_bytes(void *buf, int nbytes);
 
@@ -912,7 +912,7 @@ static void crng_reseed(struct crng_state *crng, struct entropy_store *r)
 	unsigned long	flags;
 	int		i, num;
 	union {
-		__u32	block[CHACHA20_BLOCK_WORDS];
+		__u8	block[CHACHA20_BLOCK_SIZE];
 		__u32	key[8];
 	} buf;
 
@@ -959,7 +959,7 @@ static void crng_reseed(struct crng_state *crng, struct entropy_store *r)
 }
 
 static void _extract_crng(struct crng_state *crng,
-			  __u32 out[CHACHA20_BLOCK_WORDS])
+			  __u8 out[CHACHA20_BLOCK_SIZE])
 {
 	unsigned long v, flags;
 
@@ -976,7 +976,7 @@ static void _extract_crng(struct crng_state *crng,
 	spin_unlock_irqrestore(&crng->lock, flags);
 }
 
-static void extract_crng(__u32 out[CHACHA20_BLOCK_WORDS])
+static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE])
 {
 	struct crng_state *crng = NULL;
 
@@ -994,7 +994,7 @@ static void extract_crng(__u32 out[CHACHA20_BLOCK_WORDS])
  * enough) to mutate the CRNG key to provide backtracking protection.
  */
 static void _crng_backtrack_protect(struct crng_state *crng,
-				    __u32 tmp[CHACHA20_BLOCK_WORDS], int used)
+				    __u8 tmp[CHACHA20_BLOCK_SIZE], int used)
 {
 	unsigned long	flags;
 	__u32		*s, *d;
@@ -1006,14 +1006,14 @@ static void _crng_backtrack_protect(struct crng_state *crng,
 		used = 0;
 	}
 	spin_lock_irqsave(&crng->lock, flags);
-	s = &tmp[used / sizeof(__u32)];
+	s = (__u32 *) &tmp[used];
 	d = &crng->state[4];
 	for (i=0; i < 8; i++)
 		*d++ ^= *s++;
 	spin_unlock_irqrestore(&crng->lock, flags);
 }
 
-static void crng_backtrack_protect(__u32 tmp[CHACHA20_BLOCK_WORDS], int used)
+static void crng_backtrack_protect(__u8 tmp[CHACHA20_BLOCK_SIZE], int used)
 {
 	struct crng_state *crng = NULL;
 
@@ -1029,7 +1029,7 @@ static void crng_backtrack_protect(__u32 tmp[CHACHA20_BLOCK_WORDS], int used)
 static ssize_t extract_crng_user(void __user *buf, size_t nbytes)
 {
 	ssize_t ret = 0, i = CHACHA20_BLOCK_SIZE;
-	__u32 tmp[CHACHA20_BLOCK_WORDS];
+	__u8 tmp[CHACHA20_BLOCK_SIZE];
 	int large_request = (nbytes > 256);
 
 	while (nbytes) {
@@ -1612,7 +1612,7 @@ static void _warn_unseeded_randomness(const char *func_name, void *caller,
  */
 static void _get_random_bytes(void *buf, int nbytes)
 {
-	__u32 tmp[CHACHA20_BLOCK_WORDS];
+	__u8 tmp[CHACHA20_BLOCK_SIZE];
 
 	trace_get_random_bytes(nbytes, _RET_IP_);
 
@@ -2222,7 +2222,7 @@ u64 get_random_u64(void)
 	if (use_lock)
 		read_lock_irqsave(&batched_entropy_reset_lock, flags);
 	if (batch->position % ARRAY_SIZE(batch->entropy_u64) == 0) {
-		extract_crng((__u32 *)batch->entropy_u64);
+		extract_crng((u8 *)batch->entropy_u64);
 		batch->position = 0;
 	}
 	ret = batch->entropy_u64[batch->position++];
@@ -2252,7 +2252,7 @@ u32 get_random_u32(void)
 	if (use_lock)
 		read_lock_irqsave(&batched_entropy_reset_lock, flags);
 	if (batch->position % ARRAY_SIZE(batch->entropy_u32) == 0) {
-		extract_crng(batch->entropy_u32);
+		extract_crng((u8 *)batch->entropy_u32);
 		batch->position = 0;
 	}
 	ret = batch->entropy_u32[batch->position++];
diff --git a/include/crypto/chacha20.h b/include/crypto/chacha20.h
index b83d66073db0..f76302d99e2b 100644
--- a/include/crypto/chacha20.h
+++ b/include/crypto/chacha20.h
@@ -13,13 +13,12 @@
 #define CHACHA20_IV_SIZE	16
 #define CHACHA20_KEY_SIZE	32
 #define CHACHA20_BLOCK_SIZE	64
-#define CHACHA20_BLOCK_WORDS	(CHACHA20_BLOCK_SIZE / sizeof(u32))
 
 struct chacha20_ctx {
 	u32 key[8];
 };
 
-void chacha20_block(u32 *state, u32 *stream);
+void chacha20_block(u32 *state, u8 *stream);
 void crypto_chacha20_init(u32 *state, struct chacha20_ctx *ctx, u8 *iv);
 int crypto_chacha20_setkey(struct crypto_skcipher *tfm, const u8 *key,
 			   unsigned int keysize);
diff --git a/lib/chacha20.c b/lib/chacha20.c
index c1cc50fb68c9..d907fec6a9ed 100644
--- a/lib/chacha20.c
+++ b/lib/chacha20.c
@@ -16,9 +16,9 @@
 #include <asm/unaligned.h>
 #include <crypto/chacha20.h>
 
-void chacha20_block(u32 *state, u32 *stream)
+void chacha20_block(u32 *state, u8 *stream)
 {
-	u32 x[16], *out = stream;
+	u32 x[16];
 	int i;
 
 	for (i = 0; i < ARRAY_SIZE(x); i++)
@@ -67,7 +67,7 @@ void chacha20_block(u32 *state, u32 *stream)
 	}
 
 	for (i = 0; i < ARRAY_SIZE(x); i++)
-		out[i] = cpu_to_le32(x[i] + state[i]);
+		put_unaligned_le32(x[i] + state[i], &stream[i * sizeof(u32)]);
 
 	state[12]++;
 }
Yann Droneaud Aug. 9, 2018, 8:28 p.m. UTC | #4
Hi,

Le jeudi 09 août 2018 à 12:40 -0700, Eric Biggers a écrit :
> From: Eric Biggers <ebiggers@google.com>
> Subject: [PATCH] crypto: chacha20 - Fix keystream alignment for chacha20_block() (again)
> 
> In commit 9f480faec58cd6 ("crypto: chacha20 - Fix keystream alignment
> for chacha20_block()") I had missed that chacha20_block() can end up
> being called on the buffer passed to get_random_bytes(), which can have
> any alignment.  So, while my commit didn't break anything since
> chacha20_block() has actually always had a u32-alignment requirement for
> the output, it didn't fully solve the alignment problems.
> 
> Revert my solution and just update chacha20_block() to use
> put_unaligned_le32(), so the output buffer doesn't have to be aligned.
> 
> This is simpler, and on many CPUs it's the same speed.
> 
> Reported-by: Stephan Müller <smueller@chronox.de>
> Signed-off-by: Eric Biggers <ebiggers@google.com>
> ---
>  crypto/chacha20_generic.c |  7 ++++---
>  drivers/char/random.c     | 24 ++++++++++++------------
>  include/crypto/chacha20.h |  3 +--
>  lib/chacha20.c            |  6 +++---
>  4 files changed, 20 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/char/random.c b/drivers/char/random.c
> index bd449ad52442..b8f4345a50f4 100644
> --- a/drivers/char/random.c
> +++ b/drivers/char/random.c
> @@ -433,9 +433,9 @@ static int crng_init_cnt = 0;
>  static unsigned long crng_global_init_time = 0;
>  #define CRNG_INIT_CNT_THRESH (2*CHACHA20_KEY_SIZE)
>  static void _extract_crng(struct crng_state *crng,
> -			  __u32 out[CHACHA20_BLOCK_WORDS]);
> +			  __u8 out[CHACHA20_BLOCK_SIZE]);
>  static void _crng_backtrack_protect(struct crng_state *crng,
> -				    __u32 tmp[CHACHA20_BLOCK_WORDS], int used);
> +				    __u8 tmp[CHACHA20_BLOCK_SIZE], int used);
>  static void process_random_ready_list(void);
>  static void _get_random_bytes(void *buf, int nbytes);
>  

[...]

> @@ -994,7 +994,7 @@ static void extract_crng(__u32 out[CHACHA20_BLOCK_WORDS])
>   * enough) to mutate the CRNG key to provide backtracking protection.
>   */
>  static void _crng_backtrack_protect(struct crng_state *crng,
> -				    __u32 tmp[CHACHA20_BLOCK_WORDS], int used)
> +				    __u8 tmp[CHACHA20_BLOCK_SIZE], int used)
>  {
>  	unsigned long	flags;
>  	__u32		*s, *d;
> @@ -1006,14 +1006,14 @@ static void _crng_backtrack_protect(struct crng_state *crng,
>  		used = 0;
>  	}
>  	spin_lock_irqsave(&crng->lock, flags);
> -	s = &tmp[used / sizeof(__u32)];
> +	s = (__u32 *) &tmp[used];

tmp is __8* while s is __u32*, that sounds like an alignment issue ...

>  	d = &crng->state[4];
>  	for (i=0; i < 8; i++)
>  		*d++ ^= *s++;

... when s is dereferenced

>  	spin_unlock_irqrestore(&crng->lock, flags);
>  }
>  


Regards.
Stephan Mueller Aug. 10, 2018, 6:13 a.m. UTC | #5
Am Donnerstag, 9. August 2018, 21:07:18 CEST schrieb Eric Biggers:

Hi Eric,

> This patch is backwards: the temporary buffer is used when the buffer is
> *aligned*, not misaligned.  And more problematically, 'buf' is never
> incremented in one of the cases...

Of course, it needs to be reversed. Thanks.

Ciao
Stephan
Stephan Mueller Aug. 10, 2018, 6:20 a.m. UTC | #6
Am Donnerstag, 9. August 2018, 21:21:32 CEST schrieb Theodore Y. Ts'o:

Hi Theodore,

> I'm wondering whether we have kernel code that actually tries to
> extract more than 64 bytes, so I'm not sure how often we enter the
> while loop at all.  Out of curiosity, did you find this from code
> inspection or via a kernel fault on some architecture that doesn't
> allow unaligned u32 memory accesses?

I found it based on inspection during performance assessments and considering 
that we have such alignment checks in the kernel crypto API. My LRNG 
implementation using ChaCha20 is between 20% and 120% faster. And I would like 
to have random.c benefit from such speed improvement as well.

> 
> That should be
> 
> 		if (likely(((unsigned long)buf & (sizeof(tmp[0]) - 1)) == 0) {
> 
> shouldn't it?

As said to Eric, my patch of course is reversed.
> 
> If we *did* have callers who were calling get_random_bytes with nbytes
> significantly larger than 64 bytes (CHACHA20_BLOCK_SIZE), I wonder if
> we should be doing something like this:
> 
> 	while (nbytes >= CHACHA20_BLOCK_SIZE) {
> 		int adjust = (unsigned long)buf & (sizeof(tmp[0]) - 1);
> 
> 		extract_crng(buf);

Why this line?

> 		buf += CHACHA20_BLOCK_SIZE;
> 		if (likely(adjust == 0)) {
> 			extract_crng(buf);
> 			buf += CHACHA20_BLOCK_SIZE;
> 			nbytes -= CHACHA20_BLOCK_SIZE;
> 		} else {
> 			extract_crng(tmp);
> 			memcpy(buf, tmp, CHACHA20_BLOCK_SIZE - adjust);
> 			buf += CHACHA20_BLOCK_SIZE - adjust;
> 			nbytes -= CHACHA20_BLOCK_SIZE - adjust;

Sure, why not.

> 		}
> 
> 	}
> 
> This may be a hyper optimization, though --- it's not clear how often,
> say the kernel would be calling get_random_bytes with size >> 64 at
> all, never mind with an unaligned buffer.

I agree it is not likely that we have unaligned buffers. But in case we have, 
we have the potential to overwrite memory that does not belong to us with 
unknown consequences.


Ciao
Stephan
Stephan Mueller Aug. 10, 2018, 6:27 a.m. UTC | #7
Am Donnerstag, 9. August 2018, 21:40:12 CEST schrieb Eric Biggers:

Hi Eric,

>  	while (bytes >= CHACHA20_BLOCK_SIZE) {
>  		chacha20_block(state, stream);
> -		crypto_xor(dst, (const u8 *)stream, CHACHA20_BLOCK_SIZE);
> +		crypto_xor(dst, stream, CHACHA20_BLOCK_SIZE);

If we are at it, I am wondering whether we should use crypto_xor. At this 
point we exactly know that the data is CHACHA20_BLOCK_SIZE bytes in length 
which is divisible by u32. Hence, shouldn't we disregard crypto_xor in favor 
of a loop iterating in 32 bits words? crypto_xor contains some checks for 
trailing bytes which we could spare.

>  		bytes -= CHACHA20_BLOCK_SIZE;
>  		dst += CHACHA20_BLOCK_SIZE;
>  	}
>  	if (bytes) {
>  		chacha20_block(state, stream);
> -		crypto_xor(dst, (const u8 *)stream, bytes);
> +		crypto_xor(dst, stream, bytes);

Same here.

> @@ -1006,14 +1006,14 @@ static void _crng_backtrack_protect(struct
> crng_state *crng, used = 0;
>  	}
>  	spin_lock_irqsave(&crng->lock, flags);
> -	s = &tmp[used / sizeof(__u32)];
> +	s = (__u32 *) &tmp[used];

As Yann said, wouldn't you have the alignment problem here again?

Somehow, somebody must check the provided input buffer at one time.


Ciao
Stephan
Theodore Ts'o Aug. 10, 2018, 2:58 p.m. UTC | #8
On Fri, Aug 10, 2018 at 08:20:51AM +0200, Stephan Mueller wrote:
> > 	while (nbytes >= CHACHA20_BLOCK_SIZE) {
> > 		int adjust = (unsigned long)buf & (sizeof(tmp[0]) - 1);
> > 
> > 		extract_crng(buf);
> 
> Why this line?
> 
> > 		buf += CHACHA20_BLOCK_SIZE;

Sorry, the above two lines should be removed, of course.

> > 		if (likely(adjust == 0)) {
> > 			extract_crng(buf);
> > 			buf += CHACHA20_BLOCK_SIZE;
> > 			nbytes -= CHACHA20_BLOCK_SIZE;
> > 		} else {
> > 			extract_crng(tmp);
> > 			memcpy(buf, tmp, CHACHA20_BLOCK_SIZE - adjust);
> > 			buf += CHACHA20_BLOCK_SIZE - adjust;
> > 			nbytes -= CHACHA20_BLOCK_SIZE - adjust;
> 
> Sure, why not.
> 
> > 		}
> > 
> > 	}
> > 
> > This may be a hyper optimization, though --- it's not clear how often,
> > say the kernel would be calling get_random_bytes with size >> 64 at
> > all, never mind with an unaligned buffer.
> 
> I agree it is not likely that we have unaligned buffers. But in case we have, 
> we have the potential to overwrite memory that does not belong to us with 
> unknown consequences.

Sure, faire enough.  The potential wouldn't be overwriting memory,
though.  It would be a kernel panic when the CPU trapped a non-aligned
pointer dereference.

						- Ted
Eric Biggers Sept. 12, 2018, 3:02 a.m. UTC | #9
To revive this...

On Fri, Aug 10, 2018 at 08:27:58AM +0200, Stephan Mueller wrote:
> Am Donnerstag, 9. August 2018, 21:40:12 CEST schrieb Eric Biggers:
> 
> Hi Eric,
> 
> >  	while (bytes >= CHACHA20_BLOCK_SIZE) {
> >  		chacha20_block(state, stream);
> > -		crypto_xor(dst, (const u8 *)stream, CHACHA20_BLOCK_SIZE);
> > +		crypto_xor(dst, stream, CHACHA20_BLOCK_SIZE);
> 
> If we are at it, I am wondering whether we should use crypto_xor. At this 
> point we exactly know that the data is CHACHA20_BLOCK_SIZE bytes in length 
> which is divisible by u32. Hence, shouldn't we disregard crypto_xor in favor 
> of a loop iterating in 32 bits words? crypto_xor contains some checks for 
> trailing bytes which we could spare.

crypto_xor() here is fine.  It already meets the conditions for the inlined
version that XOR's a long at a time:

	if (IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) &&
	    __builtin_constant_p(size) &&
	    (size % sizeof(unsigned long)) == 0) {
		unsigned long *d = (unsigned long *)dst;
		unsigned long *s = (unsigned long *)src;

		while (size > 0) {
			*d++ ^= *s++;
			size -= sizeof(unsigned long);
		}
	}

And regardless, it's better to optimize crypto_xor() once, than all the callers.

> 
> >  		bytes -= CHACHA20_BLOCK_SIZE;
> >  		dst += CHACHA20_BLOCK_SIZE;
> >  	}
> >  	if (bytes) {
> >  		chacha20_block(state, stream);
> > -		crypto_xor(dst, (const u8 *)stream, bytes);
> > +		crypto_xor(dst, stream, bytes);
> 
> Same here.

'bytes' need not be a multiple of sizeof(u32) or sizeof(long), and 'dst' can
have any alignment...  So we should just call crypto_xor() which does the right
thing, and is intended to do so as efficiently as possible.

> 
> > @@ -1006,14 +1006,14 @@ static void _crng_backtrack_protect(struct
> > crng_state *crng, used = 0;
> >  	}
> >  	spin_lock_irqsave(&crng->lock, flags);
> > -	s = &tmp[used / sizeof(__u32)];
> > +	s = (__u32 *) &tmp[used];
> 
> As Yann said, wouldn't you have the alignment problem here again?
> 
> Somehow, somebody must check the provided input buffer at one time.
> 

I guess we should just explicitly align the 'tmp' buffers in _get_random_bytes()
and extract_crng_user().

- Eric
diff mbox series

Patch

diff --git a/drivers/char/random.c b/drivers/char/random.c
index bd449ad52442..23f336872426 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -1617,8 +1617,14 @@  static void _get_random_bytes(void *buf, int nbytes)
 	trace_get_random_bytes(nbytes, _RET_IP_);
 
 	while (nbytes >= CHACHA20_BLOCK_SIZE) {
-		extract_crng(buf);
-		buf += CHACHA20_BLOCK_SIZE;
+		if (likely((unsigned long)buf & (sizeof(tmp[0]) - 1))) {
+			extract_crng(buf);
+			buf += CHACHA20_BLOCK_SIZE;
+		} else {
+			extract_crng(tmp);
+			memcpy(buf, tmp, CHACHA20_BLOCK_SIZE);
+		}
+
 		nbytes -= CHACHA20_BLOCK_SIZE;
 	}