diff mbox series

[v3] mm/shuffle.c: Fix races in add_to_free_area_random()

Message ID 20200318203914.GA16083@SDF.ORG (mailing list archive)
State New, archived
Headers show
Series [v3] mm/shuffle.c: Fix races in add_to_free_area_random() | expand

Commit Message

George Spelvin March 18, 2020, 8:39 p.m. UTC
The old code had separate "rand" and "rand_count" variables,
which could get out of sync with bad results.

In the worst case, two threads would see rand_count = 1 and
both decrement it, resulting in rand_count = 255 and rand being
filled with zeros for the next 255 calls.

Instead, pack them both into a single, atomically updateable,
variable.  This makes it a lot easier to reason about race
conditions.  They are still there - the code deliberately eschews
locking - but basically harmless on the rare occasions that
they happen.

Second, use READ_ONCE and WRITE_ONCE.  Without them, we are deep
in the land of nasal demons.  The compiler would be free to spill
temporaries to the static variables in arbitrarily perverse ways
and create hard-to-find bugs.

(Alternatively, we could declare the static variable "volatile",
one of the few places in the Linux kernel that would be correct,
but it would probably annoy Linus.)

Third, use long rather than u64.  This not only keeps the state
atomically updateable, it also speeds up the fast path on 32-bit
machines.  Saving at least three instructions on the fast path (one
load, one add-with-carry, and one store) is worth a second call
to get_random_u*() per 64 bits.  The fast path of get_random_u*
is less than the 3*64 = 192 instructions saved, and the slow path
happens every 64 bytes so isn't affected by the change.

Fourth, use left-shift rather than right.  Testing the sign bit
produces slightly smaller/faster code than testing the lsbit.

I've tried shifting both ways, and copying the desired bit to
a boolean before shifting rather than keeping separate full-width
r and rshift variables, but both produce larger code:

		x86_64	i386
This code	 94	 95
Explicit bool	103	 99
Lsbits		 99	101
Both		 96	100

In a perfect world, the x86-64 object code would be tiny,
and dominated by the unavoidable unpredictable branch, but
this function doesn't warrant arch-specific optimization.

add_to_free_area_random:
	shlq	rand(%rip)
	jz	3f
1:
	jnc	2f
	jmp	add_to_free_area	# tail call
2:
	jmp	add_to_free_area_tail
3:
	pushq	%rdx
	pushq	%rsi
	pushq	%rdi
	call	get_random_u64
	popq	%rdi
	popq	%rsi
	popq	%rdx
	stc
	adcq	%rax,%rax	# not lea because we need carry out
	movq	%rax, rand(%rip)
	jmp	1b

Signed-off-by: George Spelvin <lkml@sdf.org>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
---
v2: Rewrote commit message to explain existing races better.
    Made local variables unsigned to avoid (technically undefined)
    signed overflow.
v3: Typos fixed, Acked-by, expanded commit message.

 mm/shuffle.c | 26 ++++++++++++++++----------
 1 file changed, 16 insertions(+), 10 deletions(-)

Comments

Alexander Duyck March 18, 2020, 9:34 p.m. UTC | #1
On Wed, Mar 18, 2020 at 1:39 PM George Spelvin <lkml@sdf.org> wrote:
>
> The old code had separate "rand" and "rand_count" variables,
> which could get out of sync with bad results.
>
> In the worst case, two threads would see rand_count = 1 and
> both decrement it, resulting in rand_count = 255 and rand being
> filled with zeros for the next 255 calls.
>
> Instead, pack them both into a single, atomically updateable,
> variable.  This makes it a lot easier to reason about race
> conditions.  They are still there - the code deliberately eschews
> locking - but basically harmless on the rare occasions that
> they happen.
>
> Second, use READ_ONCE and WRITE_ONCE.  Without them, we are deep
> in the land of nasal demons.  The compiler would be free to spill
> temporaries to the static variables in arbitrarily perverse ways
> and create hard-to-find bugs.
>
> (Alternatively, we could declare the static variable "volatile",
> one of the few places in the Linux kernel that would be correct,
> but it would probably annoy Linus.)
>
> Third, use long rather than u64.  This not only keeps the state
> atomically updateable, it also speeds up the fast path on 32-bit
> machines.  Saving at least three instructions on the fast path (one
> load, one add-with-carry, and one store) is worth a second call
> to get_random_u*() per 64 bits.  The fast path of get_random_u*
> is less than the 3*64 = 192 instructions saved, and the slow path
> happens every 64 bytes so isn't affected by the change.
>
> Fourth, use left-shift rather than right.  Testing the sign bit
> produces slightly smaller/faster code than testing the lsbit.
>
> I've tried shifting both ways, and copying the desired bit to
> a boolean before shifting rather than keeping separate full-width
> r and rshift variables, but both produce larger code:
>
>                 x86_64  i386
> This code        94      95
> Explicit bool   103      99
> Lsbits           99     101
> Both             96     100
>
> In a perfect world, the x86-64 object code would be tiny,
> and dominated by the unavoidable unpredictable branch, but
> this function doesn't warrant arch-specific optimization.
>
> add_to_free_area_random:
>         shlq    rand(%rip)
>         jz      3f
> 1:
>         jnc     2f
>         jmp     add_to_free_area        # tail call
> 2:
>         jmp     add_to_free_area_tail
> 3:
>         pushq   %rdx
>         pushq   %rsi
>         pushq   %rdi
>         call    get_random_u64
>         popq    %rdi
>         popq    %rsi
>         popq    %rdx
>         stc
>         adcq    %rax,%rax       # not lea because we need carry out
>         movq    %rax, rand(%rip)
>         jmp     1b
>
> Signed-off-by: George Spelvin <lkml@sdf.org>
> Acked-by: Kees Cook <keescook@chromium.org>
> Acked-by: Dan Williams <dan.j.williams@intel.com>
> Cc: Alexander Duyck <alexander.duyck@gmail.com>
> Cc: Randy Dunlap <rdunlap@infradead.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: linux-mm@kvack.org

What kernel is this based on? You might want to rebase on the latest
linux-next as it occurs to me that this function was renamed to
shuffle_pick_tail as I had incorporated a few bits of it into the
logic for placing buddy pages and reported pages on the tail of the
list.

> ---
> v2: Rewrote commit message to explain existing races better.
>     Made local variables unsigned to avoid (technically undefined)
>     signed overflow.
> v3: Typos fixed, Acked-by, expanded commit message.
>
>  mm/shuffle.c | 26 ++++++++++++++++----------
>  1 file changed, 16 insertions(+), 10 deletions(-)
>
> diff --git a/mm/shuffle.c b/mm/shuffle.c
> index e0ed247f8d90..16c5fddc292f 100644
> --- a/mm/shuffle.c
> +++ b/mm/shuffle.c
> @@ -186,22 +186,28 @@ void __meminit __shuffle_free_memory(pg_data_t *pgdat)
>  void add_to_free_area_random(struct page *page, struct free_area *area,
>                 int migratetype)
>  {
> -       static u64 rand;
> -       static u8 rand_bits;
> +       static unsigned long rand;      /* buffered random bits */
> +       unsigned long r = READ_ONCE(rand), rshift = r << 1;
>
>         /*
> -        * The lack of locking is deliberate. If 2 threads race to
> -        * update the rand state it just adds to the entropy.
> +        * rand holds 0..BITS_PER_LONG-1 random msbits, followed by a
> +        * 1 bit, then zero-padding in the lsbits.  This allows us to
> +        * maintain the pre-generated bits and the count of bits in a
> +        * single, atomically updatable, variable.
> +        *
> +        * The lack of locking is deliberate. If two threads race to
> +        * update the rand state it just adds to the entropy.  The
> +        * worst that can happen is a random bit is used twice, or
> +        * get_random_long is called redundantly.
>          */
> -       if (rand_bits == 0) {
> -               rand_bits = 64;
> -               rand = get_random_u64();
> +       if (unlikely(rshift == 0)) {
> +               r = get_random_long();
> +               rshift = r << 1 | 1;
>         }
> +       WRITE_ONCE(rand, rshift);
>
> -       if (rand & 1)
> +       if ((long)r < 0)
>                 add_to_free_area(page, area, migratetype);
>         else
>                 add_to_free_area_tail(page, area, migratetype);
> -       rand_bits--;
> -       rand >>= 1;
>  }
> --
> 2.26.0.rc2
George Spelvin March 18, 2020, 10:49 p.m. UTC | #2
On Wed, Mar 18, 2020 at 02:34:04PM -0700, Alexander Duyck wrote:
> What kernel is this based on? You might want to rebase on the latest
> linux-next as it occurs to me that this function was renamed to
> shuffle_pick_tail as I had incorporated a few bits of it into the
> logic for placing buddy pages and reported pages on the tail of the
> list.

5.5.8.  I didn't realize it made much difference, but I see it does. Due 
to the different return type, the best variant has probably changed and 
I'll have to re-check.

Since there's only one call site, should the function be moved to 
page_alloc.c and inlined?
Dan Williams March 18, 2020, 10:57 p.m. UTC | #3
On Wed, Mar 18, 2020 at 3:49 PM George Spelvin <lkml@sdf.org> wrote:
>
> On Wed, Mar 18, 2020 at 02:34:04PM -0700, Alexander Duyck wrote:
> > What kernel is this based on? You might want to rebase on the latest
> > linux-next as it occurs to me that this function was renamed to
> > shuffle_pick_tail as I had incorporated a few bits of it into the
> > logic for placing buddy pages and reported pages on the tail of the
> > list.
>
> 5.5.8.  I didn't realize it made much difference, but I see it does. Due
> to the different return type, the best variant has probably changed and
> I'll have to re-check.
>
> Since there's only one call site, should the function be moved to
> page_alloc.c and inlined?

The intent was to make it compile-away on
CONFIG_SHUFFLE_PAGE_ALLOCATOR=n builds. However,
CONFIG_SHUFFLE_PAGE_ALLOCATOR=y is now the common case for distros.
So, it's not serving much purpose being in mm/shuffle.c. I'd support
moving it.
George Spelvin March 18, 2020, 11:18 p.m. UTC | #4
On Wed, Mar 18, 2020 at 03:57:01PM -0700, Dan Williams wrote:
> On Wed, Mar 18, 2020 at 3:49 PM George Spelvin <lkml@sdf.org> wrote:
>> Since there's only one call site, should the function be moved to
>> page_alloc.c and inlined?
> 
> The intent was to make it compile-away on
> CONFIG_SHUFFLE_PAGE_ALLOCATOR=n builds. However,
> CONFIG_SHUFFLE_PAGE_ALLOCATOR=y is now the common case for distros.
> So, it's not serving much purpose being in mm/shuffle.c. I'd support
> moving it.

The options are to define it as an inline function in shuffle.h,
which keeps the #ifdef localized, or to drop it into page_alloc.c,
which shrinks the shuffle.h interface but requires an #ifdef in
page_alloc.h.

Um... or does it?  If is_shuffle_order always returns false,
shuffle_pick_tail is dead code and will be elided by the compiler.

But will the static variable be elided, too?  Need to check.
diff mbox series

Patch

diff --git a/mm/shuffle.c b/mm/shuffle.c
index e0ed247f8d90..16c5fddc292f 100644
--- a/mm/shuffle.c
+++ b/mm/shuffle.c
@@ -186,22 +186,28 @@  void __meminit __shuffle_free_memory(pg_data_t *pgdat)
 void add_to_free_area_random(struct page *page, struct free_area *area,
 		int migratetype)
 {
-	static u64 rand;
-	static u8 rand_bits;
+	static unsigned long rand;	/* buffered random bits */
+	unsigned long r = READ_ONCE(rand), rshift = r << 1;
 
 	/*
-	 * The lack of locking is deliberate. If 2 threads race to
-	 * update the rand state it just adds to the entropy.
+	 * rand holds 0..BITS_PER_LONG-1 random msbits, followed by a
+	 * 1 bit, then zero-padding in the lsbits.  This allows us to
+	 * maintain the pre-generated bits and the count of bits in a
+	 * single, atomically updatable, variable.
+	 *
+	 * The lack of locking is deliberate. If two threads race to
+	 * update the rand state it just adds to the entropy.  The
+	 * worst that can happen is a random bit is used twice, or
+	 * get_random_long is called redundantly.
 	 */
-	if (rand_bits == 0) {
-		rand_bits = 64;
-		rand = get_random_u64();
+	if (unlikely(rshift == 0)) {
+		r = get_random_long();
+		rshift = r << 1 | 1;
 	}
+	WRITE_ONCE(rand, rshift);
 
-	if (rand & 1)
+	if ((long)r < 0)
 		add_to_free_area(page, area, migratetype);
 	else
 		add_to_free_area_tail(page, area, migratetype);
-	rand_bits--;
-	rand >>= 1;
 }