Message ID | 20161223233904.11739.qmail@ns.sciencehorizons.net (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Hi, On 24.12.2016 00:39, George Spelvin wrote: > Hannes Frederic Sowa wrote: >> In general this looks good, but bitbuffer needs to be protected from >> concurrent access, thus needing at least some atomic instruction and >> disabling of interrupts for the locking if done outside of >> get_random_long. Thus I liked your previous approach more where you >> simply embed this in the already necessary get_random_long and aliased >> get_random_long as get_random_bits(BITS_PER_LONG) accordingly, IMHO. > > It's meant to be part of the same approach, and I didn't include locking > because that's a requirement for *any* solution, and isn't specific > to the part I was trying to illustrate. > > (As for re-using the name "get_random_long", that was just so > I didn't have to explain it. Call it "get_random_long_internal" > if you like.) > > Possible locking implementations include: > 1) Use the same locking as applies to get_random_long_internal(), or > 2) Make bitbuffer a per-CPU variable (note that we currently have 128 > bits of per-CPU state in get_random_int_hash[]), and this is all a > fast-path to bypass heavier locking in get_random_long_internal(). I understood that this is definitely a solvable problem. >>> But, I just realized I've been overlooking something glaringly obvious... >>> there's no reason you can't compute multple blocks in advance. >> >> In the current code on the cost of per cpu allocations thus memory. > > Yes, but on 4-core machines it's still not much, and 4096-core > behemoths have RAM to burn. > >> In the extract_crng case, couldn't we simply use 8 bytes of the 64 byte >> return block to feed it directly back into the state chacha? So we pass >> on 56 bytes into the pcpu buffer, and consume 8 bytes for the next >> state. This would make the window max shorter than the anti >> backtracking protection right now from 300s to 14 get_random_int calls. >> Not sure if this is worth it. > > We just finished discussing why 8 bytes isn't enough. If you only > feed back 8 bytes, an attacker who can do 2^64 computation can find it > (by guessing and computing forward to verify the guess) and recover the > previous state. You need to feed back at least as much output as your > security targete. For /dev/urandom's ChaCha20, that's 256 bits. I followed the discussion but it appeared to me that this has the additional constraint of time until the next reseeding event happenes, which is 300s (under the assumption that calls to get_random_int happen regularly, which I expect right now). After that the existing reseeding mechansim will ensure enough backtracking protection. The number of bytes can easily be increased here, given that reseeding was shown to be quite fast already and we produce enough output. But I am not sure if this is a bit overengineered in the end? >>> For example, suppose we gave each CPU a small pool to minimize locking. >>> When one runs out and calls the global refill, it could refill *all* >>> of the CPU pools. (You don't even need locking; there may be a race to >>> determine *which* random numbers the reader sees, but they're equally >>> good either way.) > >> Yes, but still disabled interrupts, otherwise the same state could be >> used twice on the same CPU. Uff, I think we have this problem in >> prandom_u32. > > There are some locking gotchas, but it is doable lock-free. > > Basically, it's a seqlock. The writer increments it once (to an odd > number) before starting to overwrite the buffer, and a second time (to > an even number) after. "Before" and "after" mean smp_wmb(). > > The reader can use this to figure out how much of the data in the buffer > is safely fresh. The full sequence of checks is a bit intricate, > but straightforward. > > I didn't discuss the locking because I'm confident it's solvable, > not because I wasn't aware it has to be solved. Also agreed. Given your solution below to prandom_u32, I do think it might also work without the seqlock now. > As for prandom_u32(), what's the problem? Are you worried that > get_cpu_var disables preemption but not interrupts, and so an > ISR might return the same value as process-level code? Yes, exactly those were my thoughts. > First of all, that's not a problem because prandom_u32() doesn't > have security guarantees. Occasionally returning a dupicate number > is okay. > > Second, if you do care, that could be trivially fixed by throwing > a barrier() in the middle of the code. (Patch appended; S-o-B > if anyone wants it.) I wouldn't have added a disable irqs, but given that I really like your proposal, I would take it in my todo branch and submit it when net-next window opens. > diff --git a/lib/random32.c b/lib/random32.c > index c750216d..6bee4a36 100644 > --- a/lib/random32.c > +++ b/lib/random32.c > [...] Thanks, Hannes
Hannes Frederic Sowa wrote: > On 24.12.2016 00:39, George Spelvin wrote: >> We just finished discussing why 8 bytes isn't enough. If you only >> feed back 8 bytes, an attacker who can do 2^64 computation can find it >> (by guessing and computing forward to verify the guess) and recover the >> previous state. You need to feed back at least as much output as your >> security targete. For /dev/urandom's ChaCha20, that's 256 bits. > I followed the discussion but it appeared to me that this has the > additional constraint of time until the next reseeding event happenes, > which is 300s (under the assumption that calls to get_random_int happen > regularly, which I expect right now). After that the existing reseeding > mechansim will ensure enough backtracking protection. The number of > bytes can easily be increased here, given that reseeding was shown to be > quite fast already and we produce enough output. But I am not sure if > this is a bit overengineered in the end? I'm not following your description of how the time-based and call-based mechanisms interact, but for any mix-back, you should either do enough or none at all. (Also called "catastrophic reseeding".) For example, two mix-backs of 64 bits gives you 65 bit security, not 128. (Because each mixback can be guessed and verified separately.) > Also agreed. Given your solution below to prandom_u32, I do think it > might also work without the seqlock now. It's not technically a seqlock; in particular the reader doesn't spin. But the write side, and general logic is so similar it's a good mental model. Basically, assume a 64-byte buffer. The reader has gone through 32 bytes of it, and has 32 left, and when he reads another 8 bytes, has to distinguish three cases: 1) No update; we read the old bytes and there are now 32 - 24 bytes left. 2) Update completed while we weren't looking. There are now new bytes in the buffer, and we took 8 leaving 64 - 8 = 56. 3) Update in progress at the time of the read. We don't know if we are seeing old bytes or new bytes, so we have to assume the worst and not proceeed unless 32 >= 8, but assume at the end there are 64 - 8 = 56 new bytes left. > I wouldn't have added a disable irqs, but given that I really like your > proposal, I would take it in my todo branch and submit it when net-next > window opens. If you want that, I have a pile of patches to prandom I really should push upstream. Shall I refresh them and send them to you? commit 4cf1b3d9f4fbccc29ffc2fe4ca4ff52ea77253f1 Author: George Spelvin <linux@horizon.com> Date: Mon Aug 31 00:05:00 2015 -0400 net: Use prandom_u32_max() It's slightly more efficient than "prandom_u32() % range" The net/802 code was already efficient, but prandom_u32_max() is simpler. In net/batman-adv/bat_iv_ogm.c, batadv_iv_ogm_fwd_send_time() got changed from picking a random number of milliseconds and converting to jiffies to picking a random number of jiffies, since the number of milliseconds (and thus the conversion to jiffies) is a compile-time constant. The equivalent code in batadv_iv_ogm_emit_send_time was not changed, because the number of milliseconds is variable. In net/ipv6/ip6_flowlabel.c, ip6_flowlabel had htonl(prandom_u32()), which is silly. Just cast to __be32 without permuting the bits. net/sched/sch_netem.c got adjusted to only need one call to prandom_u32 instead of 2. (Assuming skb_headlen can't exceed 512 MiB, which is hopefully safe for some time yet.) Signed-off-by: George Spelvin <linux@horizon.com> commit 9c8fb80e1fd2be42c35cab1af27187d600fd85e3 Author: George Spelvin <linux@horizon.com> Date: Sat May 24 15:20:47 2014 -0400 mm/swapfile.c: Use prandom_u32_max() It's slightly more efficient than "prandom_u32() % range" Signed-off-by: George Spelvin <linux@horizon.com> commit 2743eb01e5c5958fd88ae78d19c5fea772d4b117 Author: George Spelvin <linux@horizon.com> Date: Sat May 24 15:19:53 2014 -0400 lib: Use prandom_u32_max() It's slightly more efficient than "prandom_u32() % range" Signed-off-by: George Spelvin <linux@horizon.com> commit 6a5e91bf395060a3351bfe5efc40ac20ffba2c1b Author: George Spelvin <linux@horizon.com> Date: Sat May 24 15:18:50 2014 -0400 fs/xfs: Use prandom_u32_max() It's slightly more efficient than "prandom_u32() % range". Also changed the last argument of xfs_error_test() from "unsigned long" to "unsigned", since the code never did support values > 2^32, and the largest value ever passed is 100. The code could be improved even further by passing in 2^32/rf rather than rf, but I'll leave that to some XFS developers. Signed-off-by: George Spelvin <linux@horizon.com> commit 6f6d485d9179ca6ec4e30caa06ade0e0c6931810 Author: George Spelvin <linux@horizon.com> Date: Sat May 24 15:00:17 2014 -0400 fs/ubifs: Use prandom_u32_max() It's slightly more efficient than "prandom_u32() % range". In fs/ubifs/debug.c, the chance() function got rewritten entirely to take advantage of the fact its arguments are always constant. Signed-off-by: George Spelvin <linux@horizon.com> commit 0b6bf2c874bbbcfa74f6be0413c772b3ac134d12 Author: George Spelvin <linux@horizon.com> Date: Sat May 24 14:52:17 2014 -0400 fs/extX: Use prandom_u32_max() It's slightly more efficient than "prandom_u32() % range". In ext4/ialloc.c:436, there's one more chance to do that, but the modulo is required to keep the deterministic option consistent, so I left it alone. Switching to "parent_group = ((u64)grp * ngroups) >> 32;" would be more efficient, but would change user-visible behavior. Signed-off-by: George Spelvin <linux@horizon.com> commit e79e0e8b491bc976c0b4e1b2070eccdf55b34f43 Author: George Spelvin <linux@horizon.com> Date: Sat May 24 14:47:15 2014 -0400 fs/ceph: Use prandom_u32_max() It's slightly more efficient than "prandom_u32() % range" Signed-off-by: George Spelvin <linux@horizon.com> commit fc628326d8cf4abe364ea01259bc392c0bbaf5a0 Author: George Spelvin <linux@horizon.com> Date: Sat May 24 14:46:29 2014 -0400 drivers/scsi/fcoe/fcoe_ctlr.c: Use prandom_u32_max() It's slightly more efficient than "prandom_u32() % range" Signed-off-by: George Spelvin <linux@horizon.com> commit 4810d67dd2edf08e7801ef47550d46b7397fe2dc Author: George Spelvin <linux@horizon.com> Date: Sat May 24 14:45:55 2014 -0400 drivers/ntb/ntb_hw.c: Use prandom_u32_max() It's slightly more efficient than "prandom_u32() % range" Signed-off-by: George Spelvin <linux@horizon.com> commit f4a806abbc0785e8f0363e02fac613246eed9e34 Author: George Spelvin <linux@horizon.com> Date: Sat May 24 14:45:27 2014 -0400 drivers/net: Use prandom_u32_max() It's slightly more efficient than "prandom_u32() % range" In some cases (like ethernet/broadcom/cnic.c and drivers/net/hamradio), the range is a compile-time constant power of 2, so the code doesn't get any better, but I'm trying to do a full sweep. In drivers/net/wireless/brcm80211/brcmfmac/p2p.c, a timeout is selected from 100, 200 or 300 ms. It would be easy enough to make the granularity finer, but I assume the existing code is that way for a reason. Signed-off-by: George Spelvin <linux@horizon.com> commit 603e0c311fac09d633ff6c0fd9242108f1c52ead Author: George Spelvin <linux@horizon.com> Date: Sat May 24 13:55:09 2014 -0400 drivers/mtd: Use prandom_u32_max() It's slightly more efficient than "prandom_u32() % range". And rewrote the 1-in-N code in drivers/mtd/ubi/debug.h to avoid even more arithmetic. Signed-off-by: George Spelvin <linux@horizon.com> commit e0657cc865e8e02768906935b8e8bf63af58aa46 Author: George Spelvin <linux@horizon.com> Date: Sat May 24 13:52:13 2014 -0400 drivers/mmc/core: Use prandom_u32_max() It's slightly more efficient than "prandom_u32() % range" Signed-off-by: George Spelvin <linux@horizon.com> commit 017ee6841ec8d416093fc1f18bdd3df0dfc6aacc Author: George Spelvin <linux@horizon.com> Date: Sat May 24 13:51:33 2014 -0400 drivers/lguest/page_tables.c: Use prandom_u32_max() This doesn't actually change the code because the array size is a power of 2 (it's a 4-element array). But I'm on a roll. Signed-off-by: George Spelvin <linux@horizon.com> commit 87556165f46eb42c756bcb94e062c3fbd272a4bf Author: George Spelvin <linux@horizon.com> Date: Sat May 24 13:49:41 2014 -0400 drivers/infiniband: Use prandom_u32_max() It's slightly more efficient than "prandom_u32() % range" Signed-off-by: George Spelvin <linux@horizon.com> commit 1eafe1d429f442218810e8c604d4e7c466414cf3 Author: George Spelvin <linux@horizon.com> Date: Sun Aug 30 23:42:41 2015 -0400 block/blk-mq-tag.c: Use prandom_u32_max() It's slightly more efficient than "prandom_u32() % range" Signed-off-by: George Spelvin <linux@horizon.com> commit f03ee59a63098d244d5b8932fc68c9fc3e2bb222 Author: George Spelvin <linux@horizon.com> Date: Sat May 24 13:46:52 2014 -0400 arch/x86: Use prandom_u32_max() It's slightly more efficient than "prandom_u32() % range" Signed-off-by: George Spelvin <linux@horizon.com> commit feafd3a3fb09924ea633d677a7ab8a25a817f39d Author: George Spelvin <linux@horizon.com> Date: Sat May 24 13:44:49 2014 -0400 lib/random32.c: Remove redundant U suffixes on integers Get rid of a few of the extraneous U suffixes on ordinary integers. Signed-off-by: George Spelvin <linux@horizon.com> commit f14328d248e59c862478633479528c9cb8554d7a Author: George Spelvin <linux@horizon.com> Date: Sat May 24 12:40:19 2014 -0400 lib/random32.c: Randomize timeout to the millisecond, not the second. If you're going to bother randomizing it, do it right. And use prandom_u32_max(), which is designed for the job, rather than "% 40". Signed-off-by: George Spelvin <linux@horizon.com> commit 143342006adfff718afedf58f639b72337d7c816 Author: George Spelvin <linux@horizon.com> Date: Sat May 24 12:51:26 2014 -0400 lib/random32.c: Make prandom_u32_max efficient for powers of 2 The multiply-and-shift is efficient in the general case, but slower than a simple bitwise AND if the range is a power of 2. Make the function handle this case so callers don't have to worry about micro-optimizing it. Signed-off-by: George Spelvin <linux@horizon.com> commit 99864db5cb023e6b09d71cde4997a5f6cafb5cca Author: George Spelvin <linux@horizon.com> Date: Sat May 24 12:02:17 2014 -0400 lib/random32.c: Use <asm/unaligned.h> instead of hand-rolling it The functions exist for a reason; the manual byte-at-a-time code is unnecessarily slow (and bloated). Signed-off-by: George Spelvin <linux@horizon.com> commit 4ecd45f6eadd4d171782dc6b451ed1040e47d419 Author: George Spelvin <linux@horizon.com> Date: Sat May 24 11:55:59 2014 -0400 lib/random32.c: Debloat non-speed-critical code Unrolling code in rarely-used code paths is just silly. There are two places that static calls to prandom_u32_state() can be removed: 1) prandom_warmup() calls prandom_u32_state 10 times. 2) prandom_state_selftest() can avoid one call and simplify the loop logic by repeating an assignment to a local variable (which probably adds zero code anyway). Signed-off-by: George Spelvin <linux@horizon.com> commit 8765cff45da1d96e4310d50dd49231790c49b612 Author: George Spelvin <linux@horizon.com> Date: Sat May 24 11:52:34 2014 -0400 lib/random32.c: Mark self-test data as __initconst So it can be thrown away along with the code that uses it. Signed-off-by: George Spelvin <linux@horizon.com>
Hello, On Fri, 2016-12-23 at 20:17 -0500, George Spelvin wrote: > Hannes Frederic Sowa wrote: > > On 24.12.2016 00:39, George Spelvin wrote: > > > We just finished discussing why 8 bytes isn't enough. If you only > > > feed back 8 bytes, an attacker who can do 2^64 computation can find it > > > (by guessing and computing forward to verify the guess) and recover the > > > previous state. You need to feed back at least as much output as your > > > security targete. For /dev/urandom's ChaCha20, that's 256 bits. > > I followed the discussion but it appeared to me that this has the > > additional constraint of time until the next reseeding event happenes, > > which is 300s (under the assumption that calls to get_random_int happen > > regularly, which I expect right now). After that the existing reseeding > > mechansim will ensure enough backtracking protection. The number of > > bytes can easily be increased here, given that reseeding was shown to be > > quite fast already and we produce enough output. But I am not sure if > > this is a bit overengineered in the end? > > I'm not following your description of how the time-based and call-based > mechanisms interact, but for any mix-back, you should either do enough > or none at all. (Also called "catastrophic reseeding".) We call extract_crng when we run out of batched entropy and reseed. How often we call down to extract_crng depends on how much entropy we extracted by calls to get_random_int/long, so the number of calls into those functions matter. In extract_crng we have a timer which reseeds every 300s the CPRNG and either uses completely new entropy from the CRNG or calls down into the CPRNG while also doing backtracing protection (which feeds chacha's block size / 2 back into chacha, if I read the code correctly, thus 1024 bits, which should be enough). > For example, two mix-backs of 64 bits gives you 65 bit security, not 128. > (Because each mixback can be guessed and verified separately.) Exactly, but the full reseed after running out of entropy is strong enough to not be defeated by your argumentation. Neither the reseed from the CRNG. > > Also agreed. Given your solution below to prandom_u32, I do think it > > might also work without the seqlock now. > > It's not technically a seqlock; in particular the reader doesn't > spin. But the write side, and general logic is so similar it's > a good mental model. > > Basically, assume a 64-byte buffer. The reader has gone through > 32 bytes of it, and has 32 left, and when he reads another 8 bytes, > has to distinguish three cases: > > 1) No update; we read the old bytes and there are now 32 - 24 bytes left. > 2) Update completed while we weren't looking. There are now new > bytes in the buffer, and we took 8 leaving 64 - 8 = 56. > 3) Update in progress at the time of the read. We don't know if we > are seeing old bytes or new bytes, so we have to assume the worst > and not proceeed unless 32 >= 8, but assume at the end there are > 64 - 8 = 56 new bytes left. > > > I wouldn't have added a disable irqs, but given that I really like your > > proposal, I would take it in my todo branch and submit it when net-next > > window opens. > > If you want that, I have a pile of patches to prandom I really > should push upstream. Shall I refresh them and send them to you? I would like to have a look at them in the new year, certainly! I can also take care about the core prandom patches, but don't know if I have time to submit the others to the different subsystems. Maybe, if David would be okay with that, we can submit all patches through his tree, as he is also the dedicated maintainer for prandom. > [... patch descriptions ...] Thanks, Hannes
Hannes Frederic Sowa wrote: > We call extract_crng when we run out of batched entropy and reseed. How > often we call down to extract_crng depends on how much entropy we > extracted by calls to get_random_int/long, so the number of calls into > those functions matter. > > In extract_crng we have a timer which reseeds every 300s the CPRNG and > either uses completely new entropy from the CRNG or calls down into the > CPRNG while also doing backtracing protection (which feeds chacha's > block size / 2 back into chacha, if I read the code correctly, thus > 1024 bits, which should be enough). In the current code, _extract_crng checks to see if more than 300 s have elapsed since last time it was reseeded, and if so, reseeds with fresh entropy. In addition, on every read (or get_random_bytes), if the request leaves enough ranfom bits in the last ChaCha block, it feeds back 256 bits (the ChaCha block size is 16*32 = 512 bits) for anti-backtracking. If the last read happened to not fit under that limit (size % 512 > 256), *and* there are no calls for RNG output for a long time, there is no upper limit to how long the old ChaCha key can hang around. > On Fri, 2016-12-23 at 20:17 -0500, George Spelvin wrote: >> For example, two mix-backs of 64 bits gives you 65 bit security, not 128. >> (Because each mixback can be guessed and verified separately.) > Exactly, but the full reseed after running out of entropy is strong > enough to not be defeated by your argumentation. Neither the reseed > from the CRNG. Yes, I was just reacting to your original statement: >>>>> couldn't we simply use 8 bytes of the 64 byte >>>>> return block to feed it directly back into the state chacha? It's not the idea that's bad, just the proposed quantity. >> If you want that, I have a pile of patches to prandom I really >> should push upstream. Shall I refresh them and send them to you? > I would like to have a look at them in the new year, certainly! I can > also take care about the core prandom patches, but don't know if I have > time to submit the others to the different subsystems. > > Maybe, if David would be okay with that, we can submit all patches > through his tree, as he is also the dedicated maintainer for prandom. Amazing, thank you very much! They're just minor cleanups, nothing too exciting. I'll put it in the queue to make sure they're up to date.
diff --git a/lib/random32.c b/lib/random32.c index c750216d..6bee4a36 100644 --- a/lib/random32.c +++ b/lib/random32.c @@ -55,16 +55,29 @@ static DEFINE_PER_CPU(struct rnd_state, net_rand_state); * * This is used for pseudo-randomness with no outside seeding. * For more random results, use prandom_u32(). + * + * The barrier() is to allow prandom_u32() to be called from interupt + * context without locking. An interrupt will run to completion, + * updating all four state variables. The barrier() ensures that + * the interrupted code will compute a different result. Either it + * will have written s1 and s2 (so the interrupt will start with + * the updated values), or it will use the values of s3 and s4 + * updated by the interrupt. + * + * (The same logic applies recursively to nested interrupts, trap + * handlers, and NMIs.) */ u32 prandom_u32_state(struct rnd_state *state) { + register u32 x; #define TAUSWORTHE(s, a, b, c, d) ((s & c) << d) ^ (((s << a) ^ s) >> b) - state->s1 = TAUSWORTHE(state->s1, 6, 13, 4294967294U, 18U); - state->s2 = TAUSWORTHE(state->s2, 2, 27, 4294967288U, 2U); - state->s3 = TAUSWORTHE(state->s3, 13, 21, 4294967280U, 7U); - state->s4 = TAUSWORTHE(state->s4, 3, 12, 4294967168U, 13U); + x = state->s1 = TAUSWORTHE(state->s1, 6, 13, ~1u, 18); + x += state->s2 = TAUSWORTHE(state->s2, 2, 27, ~7u, 2); + barrier(); + x ^= state->s3 = TAUSWORTHE(state->s3, 13, 21, ~15u, 7); + x += state->s4 = TAUSWORTHE(state->s4, 3, 12, ~127u, 13); - return (state->s1 ^ state->s2 ^ state->s3 ^ state->s4); + return x; } EXPORT_SYMBOL(prandom_u32_state);