diff mbox series

[v1,6/7] random: use simpler fast key erasure flow on per-cpu keys

Message ID 20220208155335.378318-7-Jason@zx2c4.com (mailing list archive)
State Not Applicable
Delegated to: Herbert Xu
Headers show
Series random: cleanups around per-cpu crng & rdrand | expand

Commit Message

Jason A. Donenfeld Feb. 8, 2022, 3:53 p.m. UTC
Rather than the clunky NUMA full ChaCha state system we had prior, this
commit is closer to the original "fast key erasure RNG" proposal from
<https://blog.cr.yp.to/20170723-random.html>, by simply treating ChaCha
keys on a per-cpu basis.

All entropy is extracted to a base crng key of 32 bytes. This base crng
has a birthdate and a generation counter. When we go to take bytes from
the crng, we first check if the birthdate is too old; if it is, we
reseed per usual. Then we start working on a per-cpu crng.

This per-cpu crng makes sure that it has the same generation counter as
the base crng. If it doesn't, it does fast key erasure with the base
crng key and uses the output as its new per-cpu key, and then updates
its local generation counter. Then, using this per-cpu state, we do
ordinary fast key erasure. Half of this first block is used to overwrite
the per-cpu crng key for the next call -- this is the fast key erasure
RNG idea -- and the other half, along with the ChaCha state, is returned
to the caller. If the caller desires more than this remaining half, it
can generate more ChaCha blocks, unlocked, using the now detached ChaCha
state that was just returned. Crypto-wise, this is more or less what we
were doing before, but this simply makes it more explicit and ensures
that we always have backtrack protection by not playing games with a
shared block counter.

The flow looks like this:

──extract()──► base_crng.key ◄──memcpy()───┐
                   │                       │
                   └──chacha()──────┬─► new_base_key
                                    └─► crngs[n].key ◄──memcpy()───┐
                                              │                    │
                                              └──chacha()───┬─► new_key
                                                            └─► random_bytes
                                                                      │
                                                                      └────►

There are a few hairy details around early init. Just as was done
before, prior to having gathered enough entropy, crng_fast_load() and
crng_slow_load() dump bytes directly into the base crng, and when we go
to take bytes from the crng, in that case, we're doing fast key erasure
with the base crng rather than the fast unlocked per-cpu crngs. This is
fine as that's only the state of affairs during very early boot; once
the crng initializes we never use these paths again.

In the process of all this, the APIs into the crng become a bit simpler:
we have get_random_bytes(buf, len) and get_random_bytes_user(buf, len),
which both do what you'd expect. All of the details of fast key erasure
and per-cpu selection happen only in a very short critical section of
crng_make_state(), which selects the right per-cpu key, does the fast
key erasure, and returns a local state to the caller's stack. So, we no
longer have a need for a separate backtrack function, as this happens
all at once here. The API then allows us to extend backtrack protection
to batched entropy without really having to do much at all.

The result is a bit simpler than before and has fewer foot guns. The
init time state machine also gets a lot simpler as we don't need to wait
for workqueues to come online and do deferred work. And the multi-core
performance should be increased significantly, by virtue of having hardly
any locking on the fast path.

Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
---
 drivers/char/random.c | 278 +++++++++++++++++++++++++-----------------
 1 file changed, 169 insertions(+), 109 deletions(-)

Comments

Eric Biggers Feb. 8, 2022, 11:39 p.m. UTC | #1
Hi Jason,

On Tue, Feb 08, 2022 at 04:53:34PM +0100, Jason A. Donenfeld wrote:
> Rather than the clunky NUMA full ChaCha state system we had prior, this
> commit is closer to the original "fast key erasure RNG" proposal from
> <https://blog.cr.yp.to/20170723-random.html>, by simply treating ChaCha
> keys on a per-cpu basis.
> 
> All entropy is extracted to a base crng key of 32 bytes. This base crng
> has a birthdate and a generation counter. When we go to take bytes from
> the crng, we first check if the birthdate is too old; if it is, we
> reseed per usual. Then we start working on a per-cpu crng.
> 
> This per-cpu crng makes sure that it has the same generation counter as
> the base crng. If it doesn't, it does fast key erasure with the base
> crng key and uses the output as its new per-cpu key, and then updates
> its local generation counter. Then, using this per-cpu state, we do
> ordinary fast key erasure. Half of this first block is used to overwrite
> the per-cpu crng key for the next call -- this is the fast key erasure
> RNG idea -- and the other half, along with the ChaCha state, is returned
> to the caller. If the caller desires more than this remaining half, it
> can generate more ChaCha blocks, unlocked, using the now detached ChaCha
> state that was just returned. Crypto-wise, this is more or less what we
> were doing before, but this simply makes it more explicit and ensures
> that we always have backtrack protection by not playing games with a
> shared block counter.
> 
> The flow looks like this:
> 
> ──extract()──► base_crng.key ◄──memcpy()───┐
>                    │                       │
>                    └──chacha()──────┬─► new_base_key
>                                     └─► crngs[n].key ◄──memcpy()───┐
>                                               │                    │
>                                               └──chacha()───┬─► new_key
>                                                             └─► random_bytes
>                                                                       │
>                                                                       └────►
> 
> There are a few hairy details around early init. Just as was done
> before, prior to having gathered enough entropy, crng_fast_load() and
> crng_slow_load() dump bytes directly into the base crng, and when we go
> to take bytes from the crng, in that case, we're doing fast key erasure
> with the base crng rather than the fast unlocked per-cpu crngs. This is
> fine as that's only the state of affairs during very early boot; once
> the crng initializes we never use these paths again.
> 
> In the process of all this, the APIs into the crng become a bit simpler:
> we have get_random_bytes(buf, len) and get_random_bytes_user(buf, len),
> which both do what you'd expect. All of the details of fast key erasure
> and per-cpu selection happen only in a very short critical section of
> crng_make_state(), which selects the right per-cpu key, does the fast
> key erasure, and returns a local state to the caller's stack. So, we no
> longer have a need for a separate backtrack function, as this happens
> all at once here. The API then allows us to extend backtrack protection
> to batched entropy without really having to do much at all.
> 
> The result is a bit simpler than before and has fewer foot guns. The
> init time state machine also gets a lot simpler as we don't need to wait
> for workqueues to come online and do deferred work. And the multi-core
> performance should be increased significantly, by virtue of having hardly
> any locking on the fast path.
> 
> Cc: Theodore Ts'o <tytso@mit.edu>
> Cc: Dominik Brodowski <linux@dominikbrodowski.net>
> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
> ---
>  drivers/char/random.c | 278 +++++++++++++++++++++++++-----------------
>  1 file changed, 169 insertions(+), 109 deletions(-)

This looks like a big improvement, thanks.  A few comments below.

>  static void crng_reseed(void)
>  {
>  	unsigned long flags;
> -	int i, entropy_count;
> -	union {
> -		u8 block[CHACHA_BLOCK_SIZE];
> -		u32 key[8];
> -	} buf;
> +	int entropy_count;
> +	unsigned long next_gen;
> +	u8 key[CHACHA_KEY_SIZE];
>  
>  	do {
>  		entropy_count = READ_ONCE(input_pool.entropy_count);
>  		if (entropy_count < POOL_MIN_BITS)
>  			return;
>  	} while (cmpxchg(&input_pool.entropy_count, entropy_count, 0) != entropy_count);
> -	extract_entropy(buf.key, sizeof(buf.key));
> +	extract_entropy(key, sizeof(key));
>  	wake_up_interruptible(&random_write_wait);
>  	kill_fasync(&fasync, SIGIO, POLL_OUT);
>  
> -	spin_lock_irqsave(&primary_crng.lock, flags);
> -	for (i = 0; i < 8; i++)
> -		primary_crng.state[i + 4] ^= buf.key[i];
> -	memzero_explicit(&buf, sizeof(buf));
> -	WRITE_ONCE(primary_crng.init_time, jiffies);
> -	spin_unlock_irqrestore(&primary_crng.lock, flags);
> +	spin_lock_irqsave(&base_crng.lock, flags);
> +	memcpy(base_crng.key, key, sizeof(base_crng.key));
> +	next_gen = base_crng.generation + 1;
> +	if (next_gen == ULONG_MAX)
> +		++next_gen;
> +	WRITE_ONCE(base_crng.generation, next_gen);
> +	base_crng.birth = jiffies;
> +	spin_unlock_irqrestore(&base_crng.lock, flags);
> +	memzero_explicit(key, sizeof(key));

Previously, the extracted entropy was being XOR'ed into the ChaCha key.  Now the
key is just being overwritten.  This is the approach we should be aiming for,
but I'm concerned about the fact that add_interrupt_randomness() still sometimes
adds entropy to the ChaCha key directly without mixing it into the entropy pool.
That happens via crng_fast_load() when crng_init == 0.  This new approach will
destroy any entropy that was present in the key only.

The right fix, IMO, would be to always send entropy through the entropy pool.
Until that is done, though, I'm not sure it's a good idea to overwrite the key
like this.

Unrelated: this function could use some comments that explain what is going on.

> +/*
> + * The general form here is based on a "fast key erasure RNG" from:
> + * https://blog.cr.yp.to/20170723-random.html
> + */
> +static void crng_fast_key_erasure(u8 key[CHACHA_KEY_SIZE],
> +				  u32 chacha_state[CHACHA_STATE_WORDS],
> +				  u8 random_data[32], size_t random_data_len)

Given that random_data is variable-length it should be a 'u8 *'.

Also, the above comment could use some more explanation.  I.e. what does this
function actually *do*.  (I understand it, but a comment will really help any
future readers...)

> + /*
> + * This function returns a ChaCha block that you may use for generating

ChaCha state, not ChaCha block.

> + * random data. It also returns up to 32 bytes on its own of random data
> + * that may be used.
> +*/
> +static void crng_make_state(u32 chacha_state[CHACHA_STATE_WORDS],
> +			    u8 random_data[32], size_t random_data_len)

Likewise, it's weird having random_data be declared as length 32 when it's
actually variable-length.

>  {
>  	unsigned long flags;
> -	u32 *s, *d;
> -	int i;
> +	struct crng *crng;
> +
> +	/* For the fast path, we check this unlocked first. */
> +	if (unlikely(!crng_ready())) {
> +		bool not_ready;
> +
> +		spin_lock_irqsave(&base_crng.lock, flags);
> +		/* Now that we're locked, re-check. */
> +		not_ready = !crng_ready();
> +		if (not_ready)
> +			crng_fast_key_erasure(base_crng.key, chacha_state,
> +					      random_data, random_data_len);
> +		spin_unlock_irqrestore(&base_crng.lock, flags);
> +		if (!not_ready)
> +			return;
> +	}

Isn't the '!not_ready' check above backwards?  And in case case, doubly-inverted
logic like that should be avoided.

And again, a few more comments please :-)

>  struct batched_entropy {
>  	union {
> -		u64 entropy_u64[CHACHA_BLOCK_SIZE / sizeof(u64)];
> -		u32 entropy_u32[CHACHA_BLOCK_SIZE / sizeof(u32)];
> +		u64 entropy_u64[CHACHA_BLOCK_SIZE * 3 / (2 * sizeof(u64))];
> +		u32 entropy_u32[CHACHA_BLOCK_SIZE * 3 / (2 * sizeof(u32))];
>  	};

The above numbers could use an explanation.

>  /*
>   * Get a random word for internal kernel use only. The quality of the random
> - * number is good as /dev/urandom, but there is no backtrack protection, with
> - * the goal of being quite fast and not depleting entropy. In order to ensure
> - * that the randomness provided by this function is okay, the function
> - * wait_for_random_bytes() should be called and return 0 at least once at any
> - * point prior.
> + * number is good as /dev/urandom. In order to ensure that the randomness
> + * provided by this function is okay, the function wait_for_random_bytes()
> + * should be called and return 0 at least once at any point prior.
>   */

There's also a comment at the top of the file that still claims that
get_random_int() et al. don't implement backtracking protection.  That needs to
be updated.

- Eric
Jason A. Donenfeld Feb. 9, 2022, 12:21 a.m. UTC | #2
Hey Eric,

Thanks a bunch for the review.

On Wed, Feb 9, 2022 at 12:39 AM Eric Biggers <ebiggers@kernel.org> wrote:
> Previously, the extracted entropy was being XOR'ed into the ChaCha key.  Now the
> key is just being overwritten.  This is the approach we should be aiming for,
> but I'm concerned about the fact that add_interrupt_randomness() still sometimes
> adds entropy to the ChaCha key directly without mixing it into the entropy pool.
> That happens via crng_fast_load() when crng_init == 0.  This new approach will
> destroy any entropy that was present in the key only.
>
> The right fix, IMO, would be to always send entropy through the entropy pool.
> Until that is done, though, I'm not sure it's a good idea to overwrite the key
> like this.

I agree with this in principle, and I've been trying to think of a
good way to do it, but it's a bit hard to do, because of the workqueue
deferred dumping. I'll try to see if I can figure out something there.
In practice, it's not _such_ a big deal, as it's "only" 64 credit-bits
of entropy we're talking about. A sort of hacky but maybe a
satisfactory solution would be to hash the base_crng key into the
pool, once, just at the transition point. I'll think a bit more on it.

>
> Unrelated: this function could use some comments that explain what is going on.

Will do.

>
> > +/*
> > + * The general form here is based on a "fast key erasure RNG" from:
> > + * https://blog.cr.yp.to/20170723-random.html
> > + */
> > +static void crng_fast_key_erasure(u8 key[CHACHA_KEY_SIZE],
> > +                               u32 chacha_state[CHACHA_STATE_WORDS],
> > +                               u8 random_data[32], size_t random_data_len)
>
> Given that random_data is variable-length it should be a 'u8 *'.

I'll do that and mention the size aspect in the comment on top.

>
> Also, the above comment could use some more explanation.  I.e. what does this
> function actually *do*.  (I understand it, but a comment will really help any
> future readers...)

Will do.

>
> > + /*
> > + * This function returns a ChaCha block that you may use for generating
>
> ChaCha state, not ChaCha block.

Thanks.

>
> > + * random data. It also returns up to 32 bytes on its own of random data
> > + * that may be used.
> > +*/
> > +static void crng_make_state(u32 chacha_state[CHACHA_STATE_WORDS],
> > +                         u8 random_data[32], size_t random_data_len)
>
> Likewise, it's weird having random_data be declared as length 32 when it's
> actually variable-length.

Will fix.

> > +     /* For the fast path, we check this unlocked first. */
> > +     if (unlikely(!crng_ready())) {
> > +             bool not_ready;
> > +
> > +             spin_lock_irqsave(&base_crng.lock, flags);
> > +             /* Now that we're locked, re-check. */
> > +             not_ready = !crng_ready();
> > +             if (not_ready)
> > +                     crng_fast_key_erasure(base_crng.key, chacha_state,
> > +                                           random_data, random_data_len);
> > +             spin_unlock_irqrestore(&base_crng.lock, flags);
> > +             if (!not_ready)
> > +                     return;
> > +     }
>
> Isn't the '!not_ready' check above backwards?  And in case case, doubly-inverted
> logic like that should be avoided.

You're right; it's all screwed up. I'll call it "ready" like it should
be and fix the logic on that last if statement. Thank you for pointing
this one out.

>
> And again, a few more comments please :-)

I'll pepper it up.

>
> >  struct batched_entropy {
> >       union {
> > -             u64 entropy_u64[CHACHA_BLOCK_SIZE / sizeof(u64)];
> > -             u32 entropy_u32[CHACHA_BLOCK_SIZE / sizeof(u32)];
> > +             u64 entropy_u64[CHACHA_BLOCK_SIZE * 3 / (2 * sizeof(u64))];
> > +             u32 entropy_u32[CHACHA_BLOCK_SIZE * 3 / (2 * sizeof(u32))];
> >       };
>
> The above numbers could use an explanation.

It's 32 bytes from fast key erasure + one full chacha block. I'll
leave a big comment there explaining this.

> There's also a comment at the top of the file that still claims that
> get_random_int() et al. don't implement backtracking protection.  That needs to
> be updated.

Will do. Saw this too after submitting.

Thanks again for your review!

Jason
diff mbox series

Patch

diff --git a/drivers/char/random.c b/drivers/char/random.c
index ed7fcef1ba31..ac22dc94b037 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -300,20 +300,6 @@  static struct fasync_struct *fasync;
 static DEFINE_SPINLOCK(random_ready_list_lock);
 static LIST_HEAD(random_ready_list);
 
-struct crng_state {
-	u32 state[16];
-	unsigned long init_time;
-	spinlock_t lock;
-};
-
-static struct crng_state primary_crng = {
-	.lock = __SPIN_LOCK_UNLOCKED(primary_crng.lock),
-	.state[0] = CHACHA_CONSTANT_EXPA,
-	.state[1] = CHACHA_CONSTANT_ND_3,
-	.state[2] = CHACHA_CONSTANT_2_BY,
-	.state[3] = CHACHA_CONSTANT_TE_K,
-};
-
 /*
  * crng_init =  0 --> Uninitialized
  *		1 --> Initialized
@@ -325,9 +311,6 @@  static struct crng_state primary_crng = {
 static int crng_init = 0;
 #define crng_ready() (likely(crng_init > 1))
 static int crng_init_cnt = 0;
-#define CRNG_INIT_CNT_THRESH (2 * CHACHA_KEY_SIZE)
-static void extract_crng(u8 out[CHACHA_BLOCK_SIZE]);
-static void crng_backtrack_protect(u8 tmp[CHACHA_BLOCK_SIZE], int used);
 static void process_random_ready_list(void);
 static void _get_random_bytes(void *buf, int nbytes);
 
@@ -465,7 +448,28 @@  static void credit_entropy_bits(int nbits)
  *
  *********************************************************************/
 
-#define CRNG_RESEED_INTERVAL (300 * HZ)
+enum {
+	CRNG_RESEED_INTERVAL = 300 * HZ,
+	CRNG_INIT_CNT_THRESH = 2 * CHACHA_KEY_SIZE
+};
+
+struct {
+	u8 key[CHACHA_KEY_SIZE] __aligned(__alignof__(long));
+	unsigned long birth;
+	unsigned long generation;
+	spinlock_t lock;
+} base_crng;
+
+struct crng {
+	u8 key[CHACHA_KEY_SIZE];
+	unsigned long generation;
+	local_lock_t lock;
+};
+
+static DEFINE_PER_CPU(struct crng, crngs) = {
+	.generation = ULONG_MAX,
+	.lock = INIT_LOCAL_LOCK(crngs.lock),
+};
 
 static DECLARE_WAIT_QUEUE_HEAD(crng_init_wait);
 
@@ -482,22 +486,22 @@  static size_t crng_fast_load(const u8 *cp, size_t len)
 	u8 *p;
 	size_t ret = 0;
 
-	if (!spin_trylock_irqsave(&primary_crng.lock, flags))
+	if (!spin_trylock_irqsave(&base_crng.lock, flags))
 		return 0;
 	if (crng_init != 0) {
-		spin_unlock_irqrestore(&primary_crng.lock, flags);
+		spin_unlock_irqrestore(&base_crng.lock, flags);
 		return 0;
 	}
-	p = (u8 *)&primary_crng.state[4];
+	p = base_crng.key;
 	while (len > 0 && crng_init_cnt < CRNG_INIT_CNT_THRESH) {
-		p[crng_init_cnt % CHACHA_KEY_SIZE] ^= *cp;
+		p[crng_init_cnt % sizeof(base_crng.key)] ^= *cp;
 		cp++; crng_init_cnt++; len--; ret++;
 	}
 	if (crng_init_cnt >= CRNG_INIT_CNT_THRESH) {
 		invalidate_batched_entropy();
 		crng_init = 1;
 	}
-	spin_unlock_irqrestore(&primary_crng.lock, flags);
+	spin_unlock_irqrestore(&base_crng.lock, flags);
 	if (crng_init == 1)
 		pr_notice("fast init done\n");
 	return ret;
@@ -522,14 +526,14 @@  static int crng_slow_load(const u8 *cp, size_t len)
 	unsigned long flags;
 	static u8 lfsr = 1;
 	u8 tmp;
-	unsigned int i, max = CHACHA_KEY_SIZE;
+	unsigned int i, max = sizeof(base_crng.key);
 	const u8 *src_buf = cp;
-	u8 *dest_buf = (u8 *)&primary_crng.state[4];
+	u8 *dest_buf = base_crng.key;
 
-	if (!spin_trylock_irqsave(&primary_crng.lock, flags))
+	if (!spin_trylock_irqsave(&base_crng.lock, flags))
 		return 0;
 	if (crng_init != 0) {
-		spin_unlock_irqrestore(&primary_crng.lock, flags);
+		spin_unlock_irqrestore(&base_crng.lock, flags);
 		return 0;
 	}
 	if (len > max)
@@ -540,38 +544,40 @@  static int crng_slow_load(const u8 *cp, size_t len)
 		lfsr >>= 1;
 		if (tmp & 1)
 			lfsr ^= 0xE1;
-		tmp = dest_buf[i % CHACHA_KEY_SIZE];
-		dest_buf[i % CHACHA_KEY_SIZE] ^= src_buf[i % len] ^ lfsr;
+		tmp = dest_buf[i % sizeof(base_crng.key)];
+		dest_buf[i % sizeof(base_crng.key)] ^= src_buf[i % len] ^ lfsr;
 		lfsr += (tmp << 3) | (tmp >> 5);
 	}
-	spin_unlock_irqrestore(&primary_crng.lock, flags);
+	spin_unlock_irqrestore(&base_crng.lock, flags);
 	return 1;
 }
 
 static void crng_reseed(void)
 {
 	unsigned long flags;
-	int i, entropy_count;
-	union {
-		u8 block[CHACHA_BLOCK_SIZE];
-		u32 key[8];
-	} buf;
+	int entropy_count;
+	unsigned long next_gen;
+	u8 key[CHACHA_KEY_SIZE];
 
 	do {
 		entropy_count = READ_ONCE(input_pool.entropy_count);
 		if (entropy_count < POOL_MIN_BITS)
 			return;
 	} while (cmpxchg(&input_pool.entropy_count, entropy_count, 0) != entropy_count);
-	extract_entropy(buf.key, sizeof(buf.key));
+	extract_entropy(key, sizeof(key));
 	wake_up_interruptible(&random_write_wait);
 	kill_fasync(&fasync, SIGIO, POLL_OUT);
 
-	spin_lock_irqsave(&primary_crng.lock, flags);
-	for (i = 0; i < 8; i++)
-		primary_crng.state[i + 4] ^= buf.key[i];
-	memzero_explicit(&buf, sizeof(buf));
-	WRITE_ONCE(primary_crng.init_time, jiffies);
-	spin_unlock_irqrestore(&primary_crng.lock, flags);
+	spin_lock_irqsave(&base_crng.lock, flags);
+	memcpy(base_crng.key, key, sizeof(base_crng.key));
+	next_gen = base_crng.generation + 1;
+	if (next_gen == ULONG_MAX)
+		++next_gen;
+	WRITE_ONCE(base_crng.generation, next_gen);
+	base_crng.birth = jiffies;
+	spin_unlock_irqrestore(&base_crng.lock, flags);
+	memzero_explicit(key, sizeof(key));
+
 	if (crng_init < 2) {
 		invalidate_batched_entropy();
 		crng_init = 2;
@@ -592,50 +598,88 @@  static void crng_reseed(void)
 	}
 }
 
-static void extract_crng(u8 out[CHACHA_BLOCK_SIZE])
+/*
+ * The general form here is based on a "fast key erasure RNG" from:
+ * https://blog.cr.yp.to/20170723-random.html
+ */
+static void crng_fast_key_erasure(u8 key[CHACHA_KEY_SIZE],
+				  u32 chacha_state[CHACHA_STATE_WORDS],
+				  u8 random_data[32], size_t random_data_len)
 {
-	unsigned long flags, init_time;
+	u8 first_block[CHACHA_BLOCK_SIZE];
 
-	if (crng_ready()) {
-		init_time = READ_ONCE(primary_crng.init_time);
-		if (time_after(jiffies, init_time + CRNG_RESEED_INTERVAL))
-			crng_reseed();
-	}
-	spin_lock_irqsave(&primary_crng.lock, flags);
-	chacha20_block(&primary_crng.state[0], out);
-	if (primary_crng.state[12] == 0)
-		primary_crng.state[13]++;
-	spin_unlock_irqrestore(&primary_crng.lock, flags);
+	chacha_init_consts(chacha_state);
+	memcpy(&chacha_state[4], key, CHACHA_KEY_SIZE);
+	memset(&chacha_state[12], 0, sizeof(u32) * 4);
+	chacha20_block(chacha_state, first_block);
+
+	memcpy(key, first_block, CHACHA_KEY_SIZE);
+	memcpy(random_data, first_block + CHACHA_KEY_SIZE, random_data_len);
+	memzero_explicit(first_block, sizeof(first_block));
 }
 
 /*
- * Use the leftover bytes from the CRNG block output (if there is
- * enough) to mutate the CRNG key to provide backtracking protection.
+ * This function returns a ChaCha block that you may use for generating
+ * random data. It also returns up to 32 bytes on its own of random data
+ * that may be used.
  */
-static void crng_backtrack_protect(u8 tmp[CHACHA_BLOCK_SIZE], int used)
+static void crng_make_state(u32 chacha_state[CHACHA_STATE_WORDS],
+			    u8 random_data[32], size_t random_data_len)
 {
 	unsigned long flags;
-	u32 *s, *d;
-	int i;
+	struct crng *crng;
+
+	/* For the fast path, we check this unlocked first. */
+	if (unlikely(!crng_ready())) {
+		bool not_ready;
+
+		spin_lock_irqsave(&base_crng.lock, flags);
+		/* Now that we're locked, re-check. */
+		not_ready = !crng_ready();
+		if (not_ready)
+			crng_fast_key_erasure(base_crng.key, chacha_state,
+					      random_data, random_data_len);
+		spin_unlock_irqrestore(&base_crng.lock, flags);
+		if (!not_ready)
+			return;
+	}
+
+	if (unlikely(time_after(jiffies, READ_ONCE(base_crng.birth) + CRNG_RESEED_INTERVAL)))
+		crng_reseed();
+
+	local_lock_irqsave(&crngs.lock, flags);
+	crng = raw_cpu_ptr(&crngs);
 
-	used = round_up(used, sizeof(u32));
-	if (used + CHACHA_KEY_SIZE > CHACHA_BLOCK_SIZE) {
-		extract_crng(tmp);
-		used = 0;
+	if (unlikely(crng->generation != READ_ONCE(base_crng.generation))) {
+		spin_lock(&base_crng.lock);
+		crng_fast_key_erasure(base_crng.key, chacha_state,
+				      crng->key, sizeof(crng->key));
+		crng->generation = base_crng.generation;
+		spin_unlock(&base_crng.lock);
 	}
-	spin_lock_irqsave(&primary_crng.lock, flags);
-	s = (u32 *)&tmp[used];
-	d = &primary_crng.state[4];
-	for (i = 0; i < 8; i++)
-		*d++ ^= *s++;
-	spin_unlock_irqrestore(&primary_crng.lock, flags);
+
+	crng_fast_key_erasure(crng->key, chacha_state, random_data, random_data_len);
+	local_unlock_irqrestore(&crngs.lock, flags);
 }
 
-static ssize_t extract_crng_user(void __user *buf, size_t nbytes)
+static ssize_t get_random_bytes_user(void __user *buf, size_t nbytes)
 {
-	ssize_t ret = 0, i = CHACHA_BLOCK_SIZE;
-	u8 tmp[CHACHA_BLOCK_SIZE] __aligned(4);
-	int large_request = (nbytes > 256);
+	bool large_request = (nbytes > 256);
+	ssize_t ret = 0, len;
+	u32 chacha_state[CHACHA_STATE_WORDS];
+	u8 output[CHACHA_BLOCK_SIZE];
+
+	if (!nbytes)
+		return 0;
+
+	len = min_t(ssize_t, 32, nbytes);
+	crng_make_state(chacha_state, output, len);
+
+	if (copy_to_user(buf, output, len))
+		return -EFAULT;
+	nbytes -= len;
+	buf += len;
+	ret += len;
 
 	while (nbytes) {
 		if (large_request && need_resched()) {
@@ -647,22 +691,23 @@  static ssize_t extract_crng_user(void __user *buf, size_t nbytes)
 			schedule();
 		}
 
-		extract_crng(tmp);
-		i = min_t(int, nbytes, CHACHA_BLOCK_SIZE);
-		if (copy_to_user(buf, tmp, i)) {
+		chacha20_block(chacha_state, output);
+		if (unlikely(chacha_state[12] == 0))
+			++chacha_state[13];
+
+		len = min_t(ssize_t, nbytes, CHACHA_BLOCK_SIZE);
+		if (copy_to_user(buf, output, len)) {
 			ret = -EFAULT;
 			break;
 		}
 
-		nbytes -= i;
-		buf += i;
-		ret += i;
+		nbytes -= len;
+		buf += len;
+		ret += len;
 	}
-	crng_backtrack_protect(tmp, i);
-
-	/* Wipe data just written to memory */
-	memzero_explicit(tmp, sizeof(tmp));
 
+	memzero_explicit(chacha_state, sizeof(chacha_state));
+	memzero_explicit(output, sizeof(output));
 	return ret;
 }
 
@@ -982,23 +1027,37 @@  static void _warn_unseeded_randomness(const char *func_name, void *caller, void
  */
 static void _get_random_bytes(void *buf, int nbytes)
 {
-	u8 tmp[CHACHA_BLOCK_SIZE] __aligned(4);
+	u32 chacha_state[CHACHA_STATE_WORDS];
+	u8 tmp[CHACHA_BLOCK_SIZE];
+	ssize_t len;
 
 	trace_get_random_bytes(nbytes, _RET_IP_);
 
-	while (nbytes >= CHACHA_BLOCK_SIZE) {
-		extract_crng(buf);
-		buf += CHACHA_BLOCK_SIZE;
-		nbytes -= CHACHA_BLOCK_SIZE;
+	if (!nbytes)
+		return;
+
+	len = min_t(ssize_t, 32, nbytes);
+	crng_make_state(chacha_state, buf, len);
+	nbytes -= len;
+	buf += len;
+
+	while (nbytes) {
+		if (nbytes < CHACHA_BLOCK_SIZE) {
+			chacha20_block(chacha_state, tmp);
+			memcpy(buf, tmp, nbytes);
+			memzero_explicit(tmp, sizeof(tmp));
+			break;
+		}
+
+		len = min_t(ssize_t, nbytes, CHACHA_BLOCK_SIZE);
+		chacha20_block(chacha_state, buf);
+		if (unlikely(chacha_state[12] == 0))
+			++chacha_state[13];
+		nbytes -= len;
+		buf += len;
 	}
 
-	if (nbytes > 0) {
-		extract_crng(tmp);
-		memcpy(buf, tmp, nbytes);
-		crng_backtrack_protect(tmp, nbytes);
-	} else
-		crng_backtrack_protect(tmp, CHACHA_BLOCK_SIZE);
-	memzero_explicit(tmp, sizeof(tmp));
+	memzero_explicit(chacha_state, sizeof(chacha_state));
 }
 
 void get_random_bytes(void *buf, int nbytes)
@@ -1229,13 +1288,12 @@  int __init rand_initialize(void)
 		mix_pool_bytes(&rv, sizeof(rv));
 	}
 
-	extract_entropy(&primary_crng.state[4], sizeof(u32) * 12);
+	extract_entropy(base_crng.key, sizeof(base_crng.key));
 	if (arch_init && trust_cpu && crng_init < 2) {
 		invalidate_batched_entropy();
 		crng_init = 2;
 		pr_notice("crng init done (trusting CPU's manufacturer)\n");
 	}
-	primary_crng.init_time = jiffies - CRNG_RESEED_INTERVAL - 1;
 
 	if (ratelimit_disable) {
 		urandom_warning.interval = 0;
@@ -1267,7 +1325,7 @@  static ssize_t urandom_read_nowarn(struct file *file, char __user *buf,
 	int ret;
 
 	nbytes = min_t(size_t, nbytes, INT_MAX >> 6);
-	ret = extract_crng_user(buf, nbytes);
+	ret = get_random_bytes_user(buf, nbytes);
 	trace_urandom_read(8 * nbytes, 0, input_pool.entropy_count);
 	return ret;
 }
@@ -1583,8 +1641,8 @@  static atomic_t batch_generation = ATOMIC_INIT(0);
 
 struct batched_entropy {
 	union {
-		u64 entropy_u64[CHACHA_BLOCK_SIZE / sizeof(u64)];
-		u32 entropy_u32[CHACHA_BLOCK_SIZE / sizeof(u32)];
+		u64 entropy_u64[CHACHA_BLOCK_SIZE * 3 / (2 * sizeof(u64))];
+		u32 entropy_u32[CHACHA_BLOCK_SIZE * 3 / (2 * sizeof(u32))];
 	};
 	local_lock_t lock;
 	unsigned int position;
@@ -1593,11 +1651,9 @@  struct batched_entropy {
 
 /*
  * Get a random word for internal kernel use only. The quality of the random
- * number is good as /dev/urandom, but there is no backtrack protection, with
- * the goal of being quite fast and not depleting entropy. In order to ensure
- * that the randomness provided by this function is okay, the function
- * wait_for_random_bytes() should be called and return 0 at least once at any
- * point prior.
+ * number is good as /dev/urandom. In order to ensure that the randomness
+ * provided by this function is okay, the function wait_for_random_bytes()
+ * should be called and return 0 at least once at any point prior.
  */
 static DEFINE_PER_CPU(struct batched_entropy, batched_entropy_u64) = {
 	.lock = INIT_LOCAL_LOCK(batched_entropy_u64.lock)
@@ -1619,12 +1675,14 @@  u64 get_random_u64(void)
 	next_gen = atomic_read(&batch_generation);
 	if (batch->position % ARRAY_SIZE(batch->entropy_u64) == 0 ||
 	    next_gen != batch->generation) {
-		extract_crng((u8 *)batch->entropy_u64);
+		get_random_bytes(batch->entropy_u64, sizeof(batch->entropy_u64));
 		batch->position = 0;
 		batch->generation = next_gen;
 	}
 
-	ret = batch->entropy_u64[batch->position++];
+	ret = batch->entropy_u64[batch->position];
+	batch->entropy_u64[batch->position] = 0;
+	++batch->position;
 	local_unlock_irqrestore(&batched_entropy_u64.lock, flags);
 	return ret;
 }
@@ -1650,12 +1708,14 @@  u32 get_random_u32(void)
 	next_gen = atomic_read(&batch_generation);
 	if (batch->position % ARRAY_SIZE(batch->entropy_u32) == 0 ||
 	    next_gen != batch->generation) {
-		extract_crng((u8 *)batch->entropy_u32);
+		get_random_bytes(batch->entropy_u32, sizeof(batch->entropy_u32));
 		batch->position = 0;
 		batch->generation = next_gen;
 	}
 
-	ret = batch->entropy_u32[batch->position++];
+	ret = batch->entropy_u32[batch->position];
+	batch->entropy_u32[batch->position] = 0;
+	++batch->position;
 	local_unlock_irqrestore(&batched_entropy_u32.lock, flags);
 	return ret;
 }