Message ID | 20200318203914.GA16083@SDF.ORG (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v3] mm/shuffle.c: Fix races in add_to_free_area_random() | expand |
On Wed, Mar 18, 2020 at 1:39 PM George Spelvin <lkml@sdf.org> wrote: > > The old code had separate "rand" and "rand_count" variables, > which could get out of sync with bad results. > > In the worst case, two threads would see rand_count = 1 and > both decrement it, resulting in rand_count = 255 and rand being > filled with zeros for the next 255 calls. > > Instead, pack them both into a single, atomically updateable, > variable. This makes it a lot easier to reason about race > conditions. They are still there - the code deliberately eschews > locking - but basically harmless on the rare occasions that > they happen. > > Second, use READ_ONCE and WRITE_ONCE. Without them, we are deep > in the land of nasal demons. The compiler would be free to spill > temporaries to the static variables in arbitrarily perverse ways > and create hard-to-find bugs. > > (Alternatively, we could declare the static variable "volatile", > one of the few places in the Linux kernel that would be correct, > but it would probably annoy Linus.) > > Third, use long rather than u64. This not only keeps the state > atomically updateable, it also speeds up the fast path on 32-bit > machines. Saving at least three instructions on the fast path (one > load, one add-with-carry, and one store) is worth a second call > to get_random_u*() per 64 bits. The fast path of get_random_u* > is less than the 3*64 = 192 instructions saved, and the slow path > happens every 64 bytes so isn't affected by the change. > > Fourth, use left-shift rather than right. Testing the sign bit > produces slightly smaller/faster code than testing the lsbit. > > I've tried shifting both ways, and copying the desired bit to > a boolean before shifting rather than keeping separate full-width > r and rshift variables, but both produce larger code: > > x86_64 i386 > This code 94 95 > Explicit bool 103 99 > Lsbits 99 101 > Both 96 100 > > In a perfect world, the x86-64 object code would be tiny, > and dominated by the unavoidable unpredictable branch, but > this function doesn't warrant arch-specific optimization. > > add_to_free_area_random: > shlq rand(%rip) > jz 3f > 1: > jnc 2f > jmp add_to_free_area # tail call > 2: > jmp add_to_free_area_tail > 3: > pushq %rdx > pushq %rsi > pushq %rdi > call get_random_u64 > popq %rdi > popq %rsi > popq %rdx > stc > adcq %rax,%rax # not lea because we need carry out > movq %rax, rand(%rip) > jmp 1b > > Signed-off-by: George Spelvin <lkml@sdf.org> > Acked-by: Kees Cook <keescook@chromium.org> > Acked-by: Dan Williams <dan.j.williams@intel.com> > Cc: Alexander Duyck <alexander.duyck@gmail.com> > Cc: Randy Dunlap <rdunlap@infradead.org> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: linux-mm@kvack.org What kernel is this based on? You might want to rebase on the latest linux-next as it occurs to me that this function was renamed to shuffle_pick_tail as I had incorporated a few bits of it into the logic for placing buddy pages and reported pages on the tail of the list. > --- > v2: Rewrote commit message to explain existing races better. > Made local variables unsigned to avoid (technically undefined) > signed overflow. > v3: Typos fixed, Acked-by, expanded commit message. > > mm/shuffle.c | 26 ++++++++++++++++---------- > 1 file changed, 16 insertions(+), 10 deletions(-) > > diff --git a/mm/shuffle.c b/mm/shuffle.c > index e0ed247f8d90..16c5fddc292f 100644 > --- a/mm/shuffle.c > +++ b/mm/shuffle.c > @@ -186,22 +186,28 @@ void __meminit __shuffle_free_memory(pg_data_t *pgdat) > void add_to_free_area_random(struct page *page, struct free_area *area, > int migratetype) > { > - static u64 rand; > - static u8 rand_bits; > + static unsigned long rand; /* buffered random bits */ > + unsigned long r = READ_ONCE(rand), rshift = r << 1; > > /* > - * The lack of locking is deliberate. If 2 threads race to > - * update the rand state it just adds to the entropy. > + * rand holds 0..BITS_PER_LONG-1 random msbits, followed by a > + * 1 bit, then zero-padding in the lsbits. This allows us to > + * maintain the pre-generated bits and the count of bits in a > + * single, atomically updatable, variable. > + * > + * The lack of locking is deliberate. If two threads race to > + * update the rand state it just adds to the entropy. The > + * worst that can happen is a random bit is used twice, or > + * get_random_long is called redundantly. > */ > - if (rand_bits == 0) { > - rand_bits = 64; > - rand = get_random_u64(); > + if (unlikely(rshift == 0)) { > + r = get_random_long(); > + rshift = r << 1 | 1; > } > + WRITE_ONCE(rand, rshift); > > - if (rand & 1) > + if ((long)r < 0) > add_to_free_area(page, area, migratetype); > else > add_to_free_area_tail(page, area, migratetype); > - rand_bits--; > - rand >>= 1; > } > -- > 2.26.0.rc2
On Wed, Mar 18, 2020 at 02:34:04PM -0700, Alexander Duyck wrote: > What kernel is this based on? You might want to rebase on the latest > linux-next as it occurs to me that this function was renamed to > shuffle_pick_tail as I had incorporated a few bits of it into the > logic for placing buddy pages and reported pages on the tail of the > list. 5.5.8. I didn't realize it made much difference, but I see it does. Due to the different return type, the best variant has probably changed and I'll have to re-check. Since there's only one call site, should the function be moved to page_alloc.c and inlined?
On Wed, Mar 18, 2020 at 3:49 PM George Spelvin <lkml@sdf.org> wrote: > > On Wed, Mar 18, 2020 at 02:34:04PM -0700, Alexander Duyck wrote: > > What kernel is this based on? You might want to rebase on the latest > > linux-next as it occurs to me that this function was renamed to > > shuffle_pick_tail as I had incorporated a few bits of it into the > > logic for placing buddy pages and reported pages on the tail of the > > list. > > 5.5.8. I didn't realize it made much difference, but I see it does. Due > to the different return type, the best variant has probably changed and > I'll have to re-check. > > Since there's only one call site, should the function be moved to > page_alloc.c and inlined? The intent was to make it compile-away on CONFIG_SHUFFLE_PAGE_ALLOCATOR=n builds. However, CONFIG_SHUFFLE_PAGE_ALLOCATOR=y is now the common case for distros. So, it's not serving much purpose being in mm/shuffle.c. I'd support moving it.
On Wed, Mar 18, 2020 at 03:57:01PM -0700, Dan Williams wrote: > On Wed, Mar 18, 2020 at 3:49 PM George Spelvin <lkml@sdf.org> wrote: >> Since there's only one call site, should the function be moved to >> page_alloc.c and inlined? > > The intent was to make it compile-away on > CONFIG_SHUFFLE_PAGE_ALLOCATOR=n builds. However, > CONFIG_SHUFFLE_PAGE_ALLOCATOR=y is now the common case for distros. > So, it's not serving much purpose being in mm/shuffle.c. I'd support > moving it. The options are to define it as an inline function in shuffle.h, which keeps the #ifdef localized, or to drop it into page_alloc.c, which shrinks the shuffle.h interface but requires an #ifdef in page_alloc.h. Um... or does it? If is_shuffle_order always returns false, shuffle_pick_tail is dead code and will be elided by the compiler. But will the static variable be elided, too? Need to check.
diff --git a/mm/shuffle.c b/mm/shuffle.c index e0ed247f8d90..16c5fddc292f 100644 --- a/mm/shuffle.c +++ b/mm/shuffle.c @@ -186,22 +186,28 @@ void __meminit __shuffle_free_memory(pg_data_t *pgdat) void add_to_free_area_random(struct page *page, struct free_area *area, int migratetype) { - static u64 rand; - static u8 rand_bits; + static unsigned long rand; /* buffered random bits */ + unsigned long r = READ_ONCE(rand), rshift = r << 1; /* - * The lack of locking is deliberate. If 2 threads race to - * update the rand state it just adds to the entropy. + * rand holds 0..BITS_PER_LONG-1 random msbits, followed by a + * 1 bit, then zero-padding in the lsbits. This allows us to + * maintain the pre-generated bits and the count of bits in a + * single, atomically updatable, variable. + * + * The lack of locking is deliberate. If two threads race to + * update the rand state it just adds to the entropy. The + * worst that can happen is a random bit is used twice, or + * get_random_long is called redundantly. */ - if (rand_bits == 0) { - rand_bits = 64; - rand = get_random_u64(); + if (unlikely(rshift == 0)) { + r = get_random_long(); + rshift = r << 1 | 1; } + WRITE_ONCE(rand, rshift); - if (rand & 1) + if ((long)r < 0) add_to_free_area(page, area, migratetype); else add_to_free_area_tail(page, area, migratetype); - rand_bits--; - rand >>= 1; }