diff mbox

[2/3] random: make /dev/urandom scalable for silly userspace programs

Message ID 1462170413-7164-3-git-send-email-tytso@mit.edu (mailing list archive)
State Not Applicable
Delegated to: Herbert Xu
Headers show

Commit Message

Theodore Ts'o May 2, 2016, 6:26 a.m. UTC
On a system with a 4 socket (NUMA) system where a large number of
application processes were all trying to read from /dev/urandom, this
can result in the system spending 80% of its time contending on the
global urandom spinlock.  The application have used its own PRNG, but
let's try to help it from running, lemming-like, straight over the
locking cliff.

Reported-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
---
 drivers/char/random.c | 67 +++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 62 insertions(+), 5 deletions(-)

Comments

Stephan Mueller May 2, 2016, 7 a.m. UTC | #1
Am Montag, 2. Mai 2016, 02:26:52 schrieb Theodore Ts'o:

Hi Theodore,

I have not digested the patch set yet, but I have the following questions to 
your patch set.

> On a system with a 4 socket (NUMA) system where a large number of
> application processes were all trying to read from /dev/urandom, this
> can result in the system spending 80% of its time contending on the
> global urandom spinlock.  The application have used its own PRNG, but
> let's try to help it from running, lemming-like, straight over the
> locking cliff.

- initialization: In my DRBG based patch-set I tried serialize the 
initialization of the per-NUMA node RNGs as follows: first the node 0 pool is 
seeded completely, followed by the other nodes in a completely serial fashion. 
If during that initialization time, say, node 3 wants some random number, but 
the RNG for node 3 is not yet fully seeded, it goes back to the "default" RNG 
of node 0. This way, it is ensured that we try to have properly seeded RNGs 
even during heavy load at boot time. Would that make sense here?

- reseed avalanche: I see that you added a time-based reseed code too (I am 
glad about that one). What I fear is that there is a reseed avalanche when the 
various RNGs are seeded initially closely after each other (and thus the 
reseed timer will expire at the same time). That would mean that they can be 
reseeded all at the same time again when the timer based threshold expires and 
drain the input_pool such that if you have many nodes, the input pool will not 
have sufficient capacity (I am not speaking about entropy, but the potential 
to store entropy) to satisfy all RNGs at the same time. Hence, we would then 
have the potential to have entropy-starved RNGs.

- entropy pool draining: when having a timer-based reseeding on a quiet 
system, the entropy pool can be drained during the expiry of the timer. So, I 
tried to handle that by increasing the timer by, say, 100 seconds for each new 
NUMA node. Note, even the baseline of 300 seconds with CRNG_RESEED_INTERVAL is 
low. When I experimented with that on a KVM test system and left it quiet, 
entropy pool draining was prevented at around 500 seconds.

Ciao
Stephan
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o May 2, 2016, 12:50 p.m. UTC | #2
On Mon, May 02, 2016 at 09:00:22AM +0200, Stephan Mueller wrote:
> - reseed avalanche: I see that you added a time-based reseed code too (I am 
> glad about that one). What I fear is that there is a reseed avalanche when the 
> various RNGs are seeded initially closely after each other (and thus the 
> reseed timer will expire at the same time). That would mean that they can be 
> reseeded all at the same time again when the timer based threshold expires and 
> drain the input_pool such that if you have many nodes, the input pool will not 
> have sufficient capacity (I am not speaking about entropy, but the potential 
> to store entropy) to satisfy all RNGs at the same time. Hence, we would then 
> have the potential to have entropy-starved RNGs.

The crng is a CRNG, not an entropy pool.  So we don't pretend to track
entropy on the CRNG's at all.  The current rule is that when you draw
from a crng, if it has been over 5 mintues, it will reseed from its
"parent" source.  In the case of the primary_crng will draw between
128 and 256 bits of entropy from the input pool.  In the per-NUMA node
case, they draw from the primary_crng.

So if there are many secondary (per-NUMA node) CRNG's that are seeded
within five minutes of each other, the input pool only gets drawn down
once to seed the primary_crng.  The per-NUMA node crng's feed from the
primary crng, and absent some catastrophic security breach where the
adversary can read kernel memory (at which point you're toast anyway)
the output of the primary_crng is never exposed directly outside of
the system.  So even if you have some crazy SGI system with 1024 NUMA
nodes, the primary_crng will only be generating at most 32k worth of
data to seed the secondary crng's before it gets reseed --- and the
input pool is only going to be debited at most 128-256 bits of entropy
each time.

I thought about using the primary_crng to serve double duty as the
CRNG for NUMA node 0, but I decided that on a NUMA system you have
TB's and TB's of memory, and so blowing another 80 bytes or so on a
separate primary_crng state makes the security analysis much simpler,
and the code much simpler.  I also thought about only dynamically
initializing a node_id's CRNG if a spin_trylock on node 0's CRNG
failed, but again, decided against it in the itnerests of keeping
things simple and that NUMA people can afford to be profligate with
memory --- and they're blowing way more than 80 bytes per NUMA node
anyway.  Besides, manufactuers of crazy-expensive NUMA systems have to
feed their children, too.  :-)

> - entropy pool draining: when having a timer-based reseeding on a quiet 
> system, the entropy pool can be drained during the expiry of the timer. So, I 
> tried to handle that by increasing the timer by, say, 100 seconds for each new 
> NUMA node. Note, even the baseline of 300 seconds with CRNG_RESEED_INTERVAL is 
> low. When I experimented with that on a KVM test system and left it quiet, 
> entropy pool draining was prevented at around 500 seconds.

Sure, but if no one is actually *using* the system, who cares about
whether the input pool's entropy is getting drawn down?  The usual
reason why we might want to worry about reseeding frequently is if the
system is generating a huge amount of randomness for some reason.
This might be a good reason (you're running a IPSEC server and
generating lots of IKE session keys) or it might be for a really
stupid reason (dd if=/dev/urandom of=/dev/sdX bs=4k), but either way,
there will be lots of disk or networking interrupts to feed the input
pool.

I have thought about adding something a bit more sophisticated to
control the reseed logic (either tracking amount of data used, or
making the reseed interval adjustable, or dynamically adjustable), but
this was the simplest thing to do as a starting point.  Besides for
the people who believe that it's realistic to write academic papers
about recovering from catastrophic security exposures where the bad
guy can read arbitrary kernel memory, and somehow _not_ managed to
bootstrap that into a full privilege escalation attack and installed a
backdoor into your BIOS so that you are permanently pwned, they might
be happy that we will be trying to recover within 5 minutes.  :-)

					- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o May 2, 2016, 1:48 p.m. UTC | #3
On Mon, May 02, 2016 at 08:50:14AM -0400, Theodore Ts'o wrote:
> > - entropy pool draining: when having a timer-based reseeding on a quiet 
> > system, the entropy pool can be drained during the expiry of the timer. So, I 
> > tried to handle that by increasing the timer by, say, 100 seconds for each new 
> > NUMA node. Note, even the baseline of 300 seconds with CRNG_RESEED_INTERVAL is 
> > low. When I experimented with that on a KVM test system and left it quiet, 
> > entropy pool draining was prevented at around 500 seconds.

One other thought.  If your KVM test system was completely quiet, then
all of the entropy was coming from timer interrupts.  It is an open
question whether an adversary could predict the bit of "entropy" you
are generating with better than 50% probability if both the host and
the guest system are quiescent.  And if they can, then maybe assuming
one bit of entropy per interrupt might be a too optimistic.

This is especially true on bare metal where very often, especially on
smaller machines, where there is a single oscillator from which all of
the clocks on the SOC or motherboard are derived.  There is a reason
why I was being ultra conservative in sampling 64 interrupts into a
32-bit fast-mix pool before mixing it into the input pool, and only
crediting the pool with a single bit of entropy each time I did this.

(It's also because of this conservatism that I was comfortable with
having add_disk_randomness giving some extra credit for interrupts
that are probably more likely to be hard-to-predict by an adversary.
Especially if the interrupts are coming from a device with spinning
rust platters.)

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Stephan Mueller May 2, 2016, 1:53 p.m. UTC | #4
Am Montag, 2. Mai 2016, 09:48:57 schrieb Theodore Ts'o:

Hi Theodore,

> On Mon, May 02, 2016 at 08:50:14AM -0400, Theodore Ts'o wrote:
> > > - entropy pool draining: when having a timer-based reseeding on a quiet
> > > system, the entropy pool can be drained during the expiry of the timer.
> > > So, I tried to handle that by increasing the timer by, say, 100 seconds
> > > for each new NUMA node. Note, even the baseline of 300 seconds with
> > > CRNG_RESEED_INTERVAL is low. When I experimented with that on a KVM
> > > test system and left it quiet, entropy pool draining was prevented at
> > > around 500 seconds.
> 
> One other thought.  If your KVM test system was completely quiet, then
> all of the entropy was coming from timer interrupts.  It is an open
> question whether an adversary could predict the bit of "entropy" you
> are generating with better than 50% probability if both the host and
> the guest system are quiescent.  And if they can, then maybe assuming
> one bit of entropy per interrupt might be a too optimistic.

It is not entirely quiet -- systemd likes to dump data on the disk once in a 
while. So, it is no timer interrupt that I see.

Note, my test system runs as tickless kernel.
> 
> This is especially true on bare metal where very often, especially on
> smaller machines, where there is a single oscillator from which all of
> the clocks on the SOC or motherboard are derived.  There is a reason
> why I was being ultra conservative in sampling 64 interrupts into a
> 32-bit fast-mix pool before mixing it into the input pool, and only
> crediting the pool with a single bit of entropy each time I did this.

As I do not think that we see any timer interrupts, I think this argument may 
not count.

Besides, I have not seen any timer interrupts lately (with or without a 
tickless kernel). Thus, which interrupt do you think is a timer interrupt?


Ciao
Stephan
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/char/random.c b/drivers/char/random.c
index 95f4451..d5bb3b3 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -746,6 +746,17 @@  struct crng_state primary_crng = {
 };
 static DECLARE_WAIT_QUEUE_HEAD(crng_init_wait);
 
+#ifdef CONFIG_NUMA
+/*
+ * Hack to deal with crazy userspace progams when they are all trying
+ * to access /dev/urandom in parallel.  The programs are almost
+ * certainly doing something terribly wrong, but we'll work around
+ * their brain damage.
+ */
+static struct crng_state **crng_node_pool __read_mostly;
+#endif
+
+
 static void _initialize_crng(struct crng_state *crng)
 {
 	int		i;
@@ -761,11 +772,13 @@  static void _initialize_crng(struct crng_state *crng)
 	crng->init_time = jiffies - CRNG_RESEED_INTERVAL;
 }
 
+#ifdef CONFIG_NUMA
 static void initialize_crng(struct crng_state *crng)
 {
 	_initialize_crng(crng);
 	spin_lock_init(&crng->lock);
 }
+#endif
 
 static int crng_fast_load(__u32 pool[4])
 {
@@ -822,19 +835,23 @@  out:
 	return ret;
 }
 
+static inline void maybe_reseed_primary_crng(void)
+{
+	if (crng_init > 2 &&
+	    time_after(jiffies, primary_crng.init_time + CRNG_RESEED_INTERVAL))
+		crng_reseed(&input_pool);
+}
+
 static inline void crng_wait_ready(void)
 {
 	wait_event_interruptible(crng_init_wait, crng_ready());
 }
 
-static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE])
+static void _extract_crng(struct crng_state *crng,
+			  __u8 out[CHACHA20_BLOCK_SIZE])
 {
 	unsigned long v, flags;
-	struct crng_state *crng = &primary_crng;
 
-	if (crng_init > 2 &&
-	    time_after(jiffies, crng->init_time + CRNG_RESEED_INTERVAL))
-		crng_reseed(&input_pool);
 	spin_lock_irqsave(&crng->lock, flags);
 	if (arch_get_random_long(&v))
 		crng->state[14] ^= v;
@@ -844,6 +861,30 @@  static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE])
 	spin_unlock_irqrestore(&crng->lock, flags);
 }
 
+static void extract_crng(__u8 out[CHACHA20_BLOCK_SIZE])
+{
+#ifndef CONFIG_NUMA
+	maybe_reseed_primary_crng();
+	_extract_crng(&primary_crng, out);
+#else
+	int node_id = numa_node_id();
+	struct crng_state *crng = crng_node_pool[node_id];
+
+	if (time_after(jiffies, crng->init_time + CRNG_RESEED_INTERVAL)) {
+		unsigned long flags;
+
+		maybe_reseed_primary_crng();
+		_extract_crng(&primary_crng, out);
+		spin_lock_irqsave(&crng->lock, flags);
+		memcpy(&crng->state[4], out, CHACHA20_KEY_SIZE);
+		crng->state[15] = numa_node_id();
+		crng->init_time = jiffies;
+		spin_unlock_irqrestore(&crng->lock, flags);
+	}
+	_extract_crng(crng, out);
+#endif
+}
+
 static ssize_t extract_crng_user(void __user *buf, size_t nbytes)
 {
 	ssize_t ret = 0, i;
@@ -1548,6 +1589,22 @@  static void init_std_data(struct entropy_store *r)
  */
 static int rand_initialize(void)
 {
+#ifdef CONFIG_NUMA
+	int i;
+	int num_nodes = num_possible_nodes();
+	struct crng_state *crng;
+
+	crng_node_pool = kmalloc(num_nodes * sizeof(void *),
+				 GFP_KERNEL|__GFP_NOFAIL);
+
+	for (i=0; i < num_nodes; i++) {
+		crng = kmalloc(sizeof(struct crng_state),
+			       GFP_KERNEL | __GFP_NOFAIL);
+		initialize_crng(crng);
+		crng_node_pool[i] = crng;
+
+	}
+#endif
 	init_std_data(&input_pool);
 	init_std_data(&blocking_pool);
 	_initialize_crng(&primary_crng);