Message ID | 20200317135035.GA19442@SDF.ORG (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm/shuffle.c: optimize add_to_free_area_random() | expand |
On Tue, Mar 17, 2020 at 01:50:35PM +0000, George Spelvin wrote: > First, use long rather than u64 for the bit buffer type, which > is significantly more efficient on 32-bit processors. Why? This is twice as many get_random_*() calls for 32-bit: it'll save whatever bits it doesn't use. (Speaking to word-sized stores, yes, that makes sense, but see below...) > Second, avoid the need for a separate rand_bits counter. > rand_bits is never more than 63, so there's always room in rand > for a bit to mark the end of the available bits. This makes the > shared state atomically updatable, which makes it a lot easier > to reason about race conditions. What are the race conditions? I had to go look at I see that add_to_free_area_random() is called under __free_one_page() which is under &zone->lock. Would it make more sense to store the randomness per-zone and entirely side-step the issues? > Third, use READ_ONCE and WRITE_ONCE. Without them, the compiler > may spill to the shared static in arbitrarily perverse ways, > and combined with the fact that the code eschews locking, that > is a recipe for hard-to-find bugs. Now, a race might cause a bit > to be used twice, or get_random_long() to be called redundantly, > but it can't summon nasal daemons. All these things might make the results less predictable, too. :) But, yes, if there was deterministic sharing of random values, that'd be bad. But this patch doesn't really solve the "use the same random bits" problem, which is the very thing that might get us into a predictable state. If two zones both call READ_ONCE(), use the same value (eek), and then both call WRITE_ONCE(), one of them wins the write (no big deal).
On Tue, Mar 17, 2020 at 02:44:33PM -0700, Kees Cook wrote: > On Tue, Mar 17, 2020 at 01:50:35PM +0000, George Spelvin wrote: > > First, use long rather than u64 for the bit buffer type, which > > is significantly more efficient on 32-bit processors. > > Why? This is twice as many get_random_*() calls for 32-bit: it'll save > whatever bits it doesn't use. (Speaking to word-sized stores, yes, that > makes sense, but see below...) Yes, but the per-call overhead is fairly low until you run out of the 64-byte buffer; then there's an expensive crypto operation to refill the buffer. So two get_random_32 calls is not a lot more than one get_random_64. (The buffers are per-CPU, so they are pretty cheap to access. (I have other patches I'm working on to speed up the locking they have, but this one is independent so I'm sending it separately.) And if you want faster, prandom_u32() is probably good enough for this application. I did it to speed up and shrink the fast path. It saves at least 3 instructions on 32-bit machines (one load, one rotate through carry, and one store), at the cost of one extra get_random_*() call per 64 bits. If get_random_u32()'s fast path is less than 192 instructions, it's a win. It's also relevant to the race condition reasoning mentioned below; a one-word value can be atomically updated. On a 32-bit machine, 64-bit operations get tearing and it's a PITA to reason about. >> Second, avoid the need for a separate rand_bits counter. >> rand_bits is never more than 63, so there's always room in rand >> for a bit to mark the end of the available bits. This makes the >> shared state atomically updatable, which makes it a lot easier >> to reason about race conditions. > > What are the race conditions? I had to go look at I see that > add_to_free_area_random() is called under __free_one_page() which is > under &zone->lock. Would it make more sense to store the randomness > per-zone and entirely side-step the issues? The most serious is that if two threads simultaneously observe rand_bits == 1, but do their decrements separately, you end up with rand_bits = 255 and you generate 255 consecutive 0 bits before refilling the buffer. Since we're only generating random bits, a screwed-up answer occasionally doesn't really matter (like the comment says "lack of locking is deliberate"), but 255 screwed up bits is a bit much. I avoided changing the underlying locking model because I didn't feel up to such an invasive change; I managed to fix the problems I saw without going there. And shrink the code; tht seemed like enough of a win to justify it to me. If you want to change the locking, the other alternative to per-zone is per-CPU, which also avoids locking. We could even add a generic get_random_bit() function. >> Third, use READ_ONCE and WRITE_ONCE. Without them, the compiler >> may spill to the shared static in arbitrarily perverse ways, >> and combined with the fact that the code eschews locking, that >> is a recipe for hard-to-find bugs. Now, a race might cause a bit >> to be used twice, or get_random_long() to be called redundantly, >> but it can't summon nasal daemons. > > All these things might make the results less predictable, too. :) But, > yes, if there was deterministic sharing of random values, that'd be > bad. But this patch doesn't really solve the "use the same random bits" > problem, which is the very thing that might get us into a predictable > state. If two zones both call READ_ONCE(), use the same value (eek), > and then both call WRITE_ONCE(), one of them wins the write (no big deal). Absolutely, but in case of a tight race, only *one* bit is used twice, and the bit used is uniformly distributed, so it's a really small correlation. Unlike the 255x zero race the existing code can have. The main reason I did this is to, as I already wrote, keep the nasal demons at bay. The existing code does things (having another thread change a non-volatile variable while this code is executing) that are explicitly undefined by the ANSI C standard. The compiler is allowed to (in John Woods' memorable explanation) produce code that makes demons fly out of your nose. (More plausibly, it may simply crash.) It probably won't, but I prefer to stay on the safe side of the language specification contact. Linus has ranted on more than one occasion about the perverse ways that GCC has used its permission to assume that people won't violate the contract.
On Tue, Mar 17, 2020 at 11:06:12PM +0000, George Spelvin wrote: > The most serious is that if two threads simultaneously observe > rand_bits == 1, but do their decrements separately, you end up > with rand_bits = 255 and you generate 255 consecutive 0 bits > before refilling the buffer. > > Since we're only generating random bits, a screwed-up answer occasionally > doesn't really matter (like the comment says "lack of locking is > deliberate"), but 255 screwed up bits is a bit much. Okay, I'm on board! :) Thanks for spelling this race out; I hadn't seen quite how nasty it could get. (Perhaps mention in the commit log for v2?) > I avoided changing the underlying locking model because I didn't > feel up to such an invasive change; I managed to fix the problems > I saw without going there. And shrink the code; tht seemed like > enough of a win to justify it to me. Fair enough. > The compiler is allowed to (in John Woods' memorable explanation) > produce code that makes demons fly out of your nose. (More plausibly, > it may simply crash.) So one thing that I see here that is still in the nasal demon realm is that the left-shift of a signed value, which is technically undefined behavior in C. (See the comment on check_shl_overflow().) Doing a signedness check is very cheap in the resulting machine code; but I suspect sticking to unsigned and reversing direction for a bottom-bit test too bad? i.e.: static unsigned long rand_bits; unsigned long r = READ_ONCE(rand_bits), rshift = r >> 1; if (unlikely(rshift == 0)) { r = get_random_long(); rshift = (r >> 1) | (0x1UL << (BITS_PER_LONG - 1)); } WRITE_ONCE(rand_bits, rshift); if (r & 1) add_to... else add_to...tail
diff --git a/mm/shuffle.c b/mm/shuffle.c index b3fe97fd6654..0e4bf6a8da52 100644 --- a/mm/shuffle.c +++ b/mm/shuffle.c @@ -186,22 +186,25 @@ void __meminit __shuffle_free_memory(pg_data_t *pgdat) void add_to_free_area_random(struct page *page, struct free_area *area, int migratetype) { - static u64 rand; - static u8 rand_bits; + static long rand; /* 0..BITS_PER_LONG-1 buffered random bits */ + long r = READ_ONCE(rand), rshift = r << 1;; /* - * The lack of locking is deliberate. If 2 threads race to + * rand holds some random msbits, with the end marked by a 1 bit. + * This allows us to maintain the pre-generated bits and the + * count of bits in a single, atomically updatable, variable. + * + * The lack of locking is deliberate. If two threads race to * update the rand state it just adds to the entropy. */ - if (rand_bits == 0) { - rand_bits = 64; - rand = get_random_u64(); + if (unlikely(rshift == 0)) { + r = get_random_long(); + rshift = r << 1 | 1; } + WRITE_ONCE(rand, rshift); - if (rand & 1) + if (r < 0) add_to_free_area(page, area, migratetype); else add_to_free_area_tail(page, area, migratetype); - rand_bits--; - rand >>= 1; }
First, use long rather than u64 for the bit buffer type, which is significantly more efficient on 32-bit processors. Second, avoid the need for a separate rand_bits counter. rand_bits is never more than 63, so there's always room in rand for a bit to mark the end of the available bits. This makes the shared state atomically updatable, which makes it a lot easier to reason about race conditions. Third, use READ_ONCE and WRITE_ONCE. Without them, the compiler may spill to the shared static in arbitrarily perverse ways, and combined with the fact that the code eschews locking, that is a recipe for hard-to-find bugs. Now, a race might cause a bit to be used twice, or get_random_long() to be called redundantly, but it can't summon nasal daemons. I've tried a few variants. Keeping random lsbits with a most- significant end marker, or using an explicit bool variable rather than testing r both increase code size slightly. x86_64 i386 This code 94 95 Explicit bool 103 99 Lsbits 99 101 Both 96 100 Signed-off-by: George Spelvin <lkml@sdf.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Kees Cook <keescook@chromium.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: linux-mm@kvack.org --- mm/shuffle.c | 21 ++++++++++++--------- 1 file changed, 12 insertions(+), 9 deletions(-)