Message ID | 20211230165052.2698-1-Jason@zx2c4.com (mailing list archive) |
---|---|
State | Not Applicable |
Delegated to: | Herbert Xu |
Headers | show |
Series | random: avoid superfluous call to RDRAND in CRNG extraction | expand |
On Thu, Dec 30, 2021 at 05:50:52PM +0100, Jason A. Donenfeld wrote: > RDRAND is not fast. RDRAND is actually quite slow. We've known this for > a while, which is why functions like get_random_u{32,64} were converted > to use batching of our ChaCha-based CRNG instead. > > Yet CRNG extraction still includes a call to RDRAND, in the hot path of > every call to get_random_bytes(), /dev/urandom, and getrandom(2). > > This call to RDRAND here seems quite superfluous. CRNG is already > extracting things based on a 256-bit key, based on good entropy, which > is then reseeded periodically, updated, backtrack-mutated, and so > forth. The CRNG extraction construction is something that we're already > relying on to be secure and solid. If it's not, that's a serious > problem, and it's unlikely that mixing in a measly 32 bits from RDRAND > is going to alleviate things. > > There is one place, though, where such last-ditch moves might be > quasi-sensible, and that's before the CRNG is actually ready. In that case, > we're already very much operating from a position of trying to get > whatever we can, so we might as well throw in the RDRAND call because > why not. So I'm not sure we how desperately we *need* the 370% performance improvement, but realistically speaking, in crng_init_try_arch_early(), which gets called from rand_initialize(), we will have already set crng->state[4..15] via RDSEED or RDRAND. So there's no point in setting crng->state[0] from RDRAND. So if we're wanting to speed things up, we should just remove the crng->state[0] <= RDRAND entirely. Or if we want to improve the security of get_random_bytes() pre crng_ready(), then we should try to XOR RDRAND bytes into all returned buffer from get_random_bytes(). In other words, I'd argue that we should "go big, or go home". (And if we do have some real, security-critical users of get_random_bytes() pre-crng_ready(), maybe "go big" is the right way to go. Of course, if those do exist, we're still screwed for those architectures which don't have an RDRAND or equivalent --- arm32, RISC-V, I'm looking at you.) - Ted
Hi Ted, On 12/30/21, Theodore Ts'o <tytso@mit.edu> wrote: > but realistically speaking, in > crng_init_try_arch_early(), which gets called from rand_initialize(), > we will have already set crng->state[4..15] via RDSEED or RDRAND. > > So there's no point in setting crng->state[0] from RDRAND. So if > we're wanting to speed things up, we should just remove the > crng->state[0] <= RDRAND entirely. Good point, and that seems reasonable. I'll do that for v+1. > Or if we want to improve the security of get_random_bytes() pre > crng_ready(), then we should try to XOR RDRAND bytes into all returned > buffer from get_random_bytes(). In other words, I'd argue that we > should "go big, or go home". (And if we do have some real, > security-critical users of get_random_bytes() pre-crng_ready(), maybe > "go big" is the right way to go. That's a decent way of looking at it. Rather than dallying with 32bits, we may as well go all the way. Or, to compromise on efficiency, we could just xor in 16 or 32 bytes into the key rows prior to each extraction. Alternatively, we have fewer things to think about with the "go home" route, and then it's just a matter of important users using get_random_bytes_wait(), which I think I mostly took care of through the tree a few years back. > So I'm not sure we how desperately we *need* the 370% performance improvement It's not necessary (aside from, like, people using sendfile to erase NVMes or something weird?), but it appeals to me for two reasons: - The superfluous RDRAND with only 32bits really isn't doing much, and having it there makes the design every so slightly more confusing and less straightforward. - I would like to see if at some point (not now, just in the future) it's feasible, performance wise, to replace all of prandom with get_batched_random() and company. I was on some thread a few years ago where a researcher pointed out one place prandom was used when get_random_u64() should have been, and in the ensuing discussion a few more places were found with the same issue, and then more. And then nobody could agree on whether the performance hit was worth it for whichever security model. And in the end I don't recall anything really happening. If that whole discussion could magicially go away because we make all uses secure with no performance hit, it'd be a major win against footguns like prandom. Maybe it won't be feasible in the end, but simplifying a design in the process of seeing seems like decent enough motivation. Jason
On Thu, Dec 30, 2021 at 11:58:05PM +0100, Jason A. Donenfeld wrote: > > Or if we want to improve the security of get_random_bytes() pre > > crng_ready(), then we should try to XOR RDRAND bytes into all returned > > buffer from get_random_bytes(). In other words, I'd argue that we > > should "go big, or go home". (And if we do have some real, > > security-critical users of get_random_bytes() pre-crng_ready(), maybe > > "go big" is the right way to go. > > That's a decent way of looking at it. Rather than dallying with > 32bits, we may as well go all the way. Or, to compromise on > efficiency, we could just xor in 16 or 32 bytes into the key rows > prior to each extraction. Alternatively, we have fewer things to think > about with the "go home" route, and then it's just a matter of > important users using get_random_bytes_wait(), which I think I mostly > took care of through the tree a few years back. I was too lazy to do an audit of all of the get_random_bytes() users before I wrote my last e-mail, but I'm good with "go home" route --- especially since actually doing that full audit to make sure we don't have any pre-crng_ready() security-critical users of get_random_bytes() would still be important to do on architectures like RISC-V that don't have a RDRAND equivalent. The challenge is here is short of making adding a WARN_ON(!crng_ready()) to get_random_bytes(), it's hard to be sure that some future security critical user of get_random_bytes() in early boot won't creep back in. And last I checked, we still have some non-security get_random_bytes() users in early boot where the WARN_ON() isn't going to be welcome. > - I would like to see if at some point (not now, just in the future) > it's feasible, performance wise, to replace all of prandom with > get_batched_random() and company. That's going to be challenging, I suspect. Some of the networking users of prandom() have some *very* strong performance constraints. Or at least, the networking developers have some benchmarks where they won't countenance any performance regressions. When the prandom implementation was added, some of the networking devs were positively doing cycle counting to try to trim it down as much as possible.... - Ted
diff --git a/drivers/char/random.c b/drivers/char/random.c index 54086e9da05b..239b1455b1a8 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -1030,7 +1030,7 @@ static void _extract_crng(struct crng_state *crng, &input_pool : NULL); } spin_lock_irqsave(&crng->lock, flags); - if (arch_get_random_long(&v)) + if (unlikely(!crng_ready()) && arch_get_random_long(&v)) crng->state[14] ^= v; chacha20_block(&crng->state[0], out); if (crng->state[12] == 0)
RDRAND is not fast. RDRAND is actually quite slow. We've known this for a while, which is why functions like get_random_u{32,64} were converted to use batching of our ChaCha-based CRNG instead. Yet CRNG extraction still includes a call to RDRAND, in the hot path of every call to get_random_bytes(), /dev/urandom, and getrandom(2). This call to RDRAND here seems quite superfluous. CRNG is already extracting things based on a 256-bit key, based on good entropy, which is then reseeded periodically, updated, backtrack-mutated, and so forth. The CRNG extraction construction is something that we're already relying on to be secure and solid. If it's not, that's a serious problem, and it's unlikely that mixing in a measly 32 bits from RDRAND is going to alleviate things. There is one place, though, where such last-ditch moves might be quasi-sensible, and that's before the CRNG is actually ready. In that case, we're already very much operating from a position of trying to get whatever we can, so we might as well throw in the RDRAND call because why not. But once the CRNG is actually up, it's simply not sensible. Removing the call there improves performance on an i7-11850H by 370%. In other words, the vast majority of the work done by extract_crng() prior to this commit was devoted to fetching 32 bits of RDRAND. This commit fixes the issue by only making that call to RDRAND when the CRNG is not yet ready. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> --- drivers/char/random.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)