diff mbox series

[v4] mm/shuffle.c: Fix races in add_to_free_area_random()

Message ID 20200319120522.GA1484@SDF.ORG (mailing list archive)
State New, archived
Headers show
Series [v4] mm/shuffle.c: Fix races in add_to_free_area_random() | expand

Commit Message

George Spelvin March 19, 2020, 12:05 p.m. UTC
The separate "rand" and "rand_count" variables could get out of
sync with bad results.

In the worst case, two threads would see rand_count=1 and both
decrement it, resulting in rand_count=255 and rand being filled
with zeros for the next 255 calls.

Instead, pack them both into a single, atomically updateable,
variable.  This makes it a lot easier to reason about race
conditions.  They are still there - the code deliberately eschews
locking - but basically harmless on the rare occasions that
they happen.

Second, use READ_ONCE and WRITE_ONCE.  Because the random bit
buffer is accessed by multiple threads concurrently without
locking, omitting those puts us deep in the land of nasal demons.
The compiler would be free to spill to the static variable in
arbitrarily perverse ways and create hard-to-find bugs.

(I'm torn between this and just declaring the buffer "volatile".
Linux tends to prefer marking accesses rather than variables,
but in this case, every access to the buffer is volatile.
It makes no difference to the generated code.)

Third, use long rather than u64.  This not only keeps the state
atomically updateable, it also speeds up the fast path on 32-bit
machines.  Saving at least three instructions on the fast path (one
load, one add-with-carry, and one store) is worth a second call
to get_random_u*() per 64 bits.  The fast path of get_random_u*
is less than the 3*64 = 192 instructions saved, and the slow path
happens every 64 bytes so isn't affected by the change.

Fourth, make the function inline.  It's small, and there's only
one caller (in mm/page_alloc.c:__free_one_page()), so avoid the
function call overhead.

Fifth, use the msbits of the buffer first (left shift) rather
than the lsbits (right shift).  Testing the sign bit produces
slightly smaller/faster code than testing the lsbit.

I've tried shifting both ways, and copying the desired bit to a
boolean before shifting rather than keeping separate full-width
r and rshift variables, but both produce larger code:

	x86-64 text size
Msbit   	42236
Lsbit   	42242 (+6)
Lsbit+bool	42258 (+22)
Msbit+bool	42284 (+52)

(Since this is straight-line code, size is a good proxy for number
of instructions and execution time.  Using READ/WRITE_ONCE instead of
volatile makes no difference.)

In a perfect world, on x86-64 the fast path would be:
	shlq	rand(%eip)
	jz	refill
refill_complete:
	jc	add_to_tail

but I don't see how to get gcc to generate that, and this
function isn't worth arch-specific implementation.

Signed-off-by: George Spelvin <lkml@sdf.org>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
---
v2: Rewrote commit message to explain existing races better.
    Made local variables unsigned to avoid (technically undefined)
    signed overflow.
v3: Typos fixed, Acked-by, expanded commit message.
v4: Rebase against -next; function has changed from 
    add_to_free_area_random() to shuffle_pick_tail.  Move to
    inline function in shuffle.h.
    Not sure if it's okay to keep Acked-by: after such a
    significant change.

 mm/shuffle.c | 23 -----------------------
 mm/shuffle.h | 26 +++++++++++++++++++++++++-
 2 files changed, 25 insertions(+), 24 deletions(-)


base-commit: 47780d7892b77e922bbe19b5dea99cde06b2f0e5

Comments

Alexander Duyck March 19, 2020, 5:49 p.m. UTC | #1
On Thu, Mar 19, 2020 at 5:05 AM George Spelvin <lkml@sdf.org> wrote:
>
> The separate "rand" and "rand_count" variables could get out of
> sync with bad results.
>
> In the worst case, two threads would see rand_count=1 and both
> decrement it, resulting in rand_count=255 and rand being filled
> with zeros for the next 255 calls.
>
> Instead, pack them both into a single, atomically updateable,
> variable.  This makes it a lot easier to reason about race
> conditions.  They are still there - the code deliberately eschews
> locking - but basically harmless on the rare occasions that
> they happen.
>
> Second, use READ_ONCE and WRITE_ONCE.  Because the random bit
> buffer is accessed by multiple threads concurrently without
> locking, omitting those puts us deep in the land of nasal demons.
> The compiler would be free to spill to the static variable in
> arbitrarily perverse ways and create hard-to-find bugs.
>
> (I'm torn between this and just declaring the buffer "volatile".
> Linux tends to prefer marking accesses rather than variables,
> but in this case, every access to the buffer is volatile.
> It makes no difference to the generated code.)
>
> Third, use long rather than u64.  This not only keeps the state
> atomically updateable, it also speeds up the fast path on 32-bit
> machines.  Saving at least three instructions on the fast path (one
> load, one add-with-carry, and one store) is worth a second call
> to get_random_u*() per 64 bits.  The fast path of get_random_u*
> is less than the 3*64 = 192 instructions saved, and the slow path
> happens every 64 bytes so isn't affected by the change.
>
> Fourth, make the function inline.  It's small, and there's only
> one caller (in mm/page_alloc.c:__free_one_page()), so avoid the
> function call overhead.
>
> Fifth, use the msbits of the buffer first (left shift) rather
> than the lsbits (right shift).  Testing the sign bit produces
> slightly smaller/faster code than testing the lsbit.
>
> I've tried shifting both ways, and copying the desired bit to a
> boolean before shifting rather than keeping separate full-width
> r and rshift variables, but both produce larger code:
>
>         x86-64 text size
> Msbit           42236
> Lsbit           42242 (+6)
> Lsbit+bool      42258 (+22)
> Msbit+bool      42284 (+52)
>
> (Since this is straight-line code, size is a good proxy for number
> of instructions and execution time.  Using READ/WRITE_ONCE instead of
> volatile makes no difference.)
>
> In a perfect world, on x86-64 the fast path would be:
>         shlq    rand(%eip)
>         jz      refill
> refill_complete:
>         jc      add_to_tail
>
> but I don't see how to get gcc to generate that, and this
> function isn't worth arch-specific implementation.
>
> Signed-off-by: George Spelvin <lkml@sdf.org>
> Acked-by: Kees Cook <keescook@chromium.org>
> Acked-by: Dan Williams <dan.j.williams@intel.com>
> Cc: Alexander Duyck <alexander.duyck@gmail.com>
> Cc: Randy Dunlap <rdunlap@infradead.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: linux-mm@kvack.org

Acked-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>

> ---
> v2: Rewrote commit message to explain existing races better.
>     Made local variables unsigned to avoid (technically undefined)
>     signed overflow.
> v3: Typos fixed, Acked-by, expanded commit message.
> v4: Rebase against -next; function has changed from
>     add_to_free_area_random() to shuffle_pick_tail.  Move to
>     inline function in shuffle.h.
>     Not sure if it's okay to keep Acked-by: after such a
>     significant change.
>
>  mm/shuffle.c | 23 -----------------------
>  mm/shuffle.h | 26 +++++++++++++++++++++++++-
>  2 files changed, 25 insertions(+), 24 deletions(-)
>
> diff --git a/mm/shuffle.c b/mm/shuffle.c
> index 44406d9977c7..ea281d5e1f23 100644
> --- a/mm/shuffle.c
> +++ b/mm/shuffle.c
> @@ -182,26 +182,3 @@ void __meminit __shuffle_free_memory(pg_data_t *pgdat)
>         for (z = pgdat->node_zones; z < pgdat->node_zones + MAX_NR_ZONES; z++)
>                 shuffle_zone(z);
>  }
> -
> -bool shuffle_pick_tail(void)
> -{
> -       static u64 rand;
> -       static u8 rand_bits;
> -       bool ret;
> -
> -       /*
> -        * The lack of locking is deliberate. If 2 threads race to
> -        * update the rand state it just adds to the entropy.
> -        */
> -       if (rand_bits == 0) {
> -               rand_bits = 64;
> -               rand = get_random_u64();
> -       }
> -
> -       ret = rand & 1;
> -
> -       rand_bits--;
> -       rand >>= 1;
> -
> -       return ret;
> -}
> diff --git a/mm/shuffle.h b/mm/shuffle.h
> index 4d79f03b6658..fb79e05cd86d 100644
> --- a/mm/shuffle.h
> +++ b/mm/shuffle.h
> @@ -22,7 +22,31 @@ enum mm_shuffle_ctl {
>  DECLARE_STATIC_KEY_FALSE(page_alloc_shuffle_key);
>  extern void page_alloc_shuffle(enum mm_shuffle_ctl ctl);
>  extern void __shuffle_free_memory(pg_data_t *pgdat);
> -extern bool shuffle_pick_tail(void);
> +static inline bool shuffle_pick_tail(void)
> +{
> +       static unsigned long rand;      /* buffered random bits */
> +       unsigned long r = READ_ONCE(rand), rshift = r << 1;
> +
> +       /*
> +        * rand holds 0..BITS_PER_LONG-1 random msbits, followed by a
> +        * 1 bit, then zero-padding in the lsbits.  This allows us to
> +        * maintain the pre-generated bits and the count of bits in a
> +        * single, atomically updatable, variable.
> +        *
> +        * The lack of locking is deliberate. If two threads race to
> +        * update the rand state it just adds to the entropy.  The
> +        * worst that can happen is a random bit is used twice, or
> +        * get_random_long is called redundantly.
> +        */
> +       if (unlikely(rshift == 0)) {
> +               r = get_random_long();
> +               rshift = r << 1 | 1;
> +       }
> +       WRITE_ONCE(rand, rshift);
> +
> +       return (long)r < 0;
> +}
> +
>  static inline void shuffle_free_memory(pg_data_t *pgdat)
>  {
>         if (!static_branch_unlikely(&page_alloc_shuffle_key))
>
> base-commit: 47780d7892b77e922bbe19b5dea99cde06b2f0e5
> --
> 2.26.0.rc2
Kees Cook March 20, 2020, 5:58 p.m. UTC | #2
On Thu, Mar 19, 2020 at 12:05:22PM +0000, George Spelvin wrote:
> The separate "rand" and "rand_count" variables could get out of
> sync with bad results.
> 
> In the worst case, two threads would see rand_count=1 and both
> decrement it, resulting in rand_count=255 and rand being filled
> with zeros for the next 255 calls.
> 
> Instead, pack them both into a single, atomically updateable,
> variable.  This makes it a lot easier to reason about race
> conditions.  They are still there - the code deliberately eschews
> locking - but basically harmless on the rare occasions that
> they happen.
> 
> Second, use READ_ONCE and WRITE_ONCE.  Because the random bit
> buffer is accessed by multiple threads concurrently without
> locking, omitting those puts us deep in the land of nasal demons.
> The compiler would be free to spill to the static variable in
> arbitrarily perverse ways and create hard-to-find bugs.
> 
> (I'm torn between this and just declaring the buffer "volatile".
> Linux tends to prefer marking accesses rather than variables,
> but in this case, every access to the buffer is volatile.
> It makes no difference to the generated code.)
> 
> Third, use long rather than u64.  This not only keeps the state
> atomically updateable, it also speeds up the fast path on 32-bit
> machines.  Saving at least three instructions on the fast path (one
> load, one add-with-carry, and one store) is worth a second call
> to get_random_u*() per 64 bits.  The fast path of get_random_u*
> is less than the 3*64 = 192 instructions saved, and the slow path
> happens every 64 bytes so isn't affected by the change.
> 
> Fourth, make the function inline.  It's small, and there's only
> one caller (in mm/page_alloc.c:__free_one_page()), so avoid the
> function call overhead.
> 
> Fifth, use the msbits of the buffer first (left shift) rather
> than the lsbits (right shift).  Testing the sign bit produces
> slightly smaller/faster code than testing the lsbit.
> 
> I've tried shifting both ways, and copying the desired bit to a
> boolean before shifting rather than keeping separate full-width
> r and rshift variables, but both produce larger code:
> 
> 	x86-64 text size
> Msbit   	42236
> Lsbit   	42242 (+6)
> Lsbit+bool	42258 (+22)
> Msbit+bool	42284 (+52)
> 
> (Since this is straight-line code, size is a good proxy for number
> of instructions and execution time.  Using READ/WRITE_ONCE instead of
> volatile makes no difference.)
> 
> In a perfect world, on x86-64 the fast path would be:
> 	shlq	rand(%eip)
> 	jz	refill
> refill_complete:
> 	jc	add_to_tail
> 
> but I don't see how to get gcc to generate that, and this
> function isn't worth arch-specific implementation.
> 
> Signed-off-by: George Spelvin <lkml@sdf.org>
> Acked-by: Kees Cook <keescook@chromium.org>
> Acked-by: Dan Williams <dan.j.williams@intel.com>
> Cc: Alexander Duyck <alexander.duyck@gmail.com>
> Cc: Randy Dunlap <rdunlap@infradead.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: linux-mm@kvack.org
> ---
> v2: Rewrote commit message to explain existing races better.
>     Made local variables unsigned to avoid (technically undefined)
>     signed overflow.
> v3: Typos fixed, Acked-by, expanded commit message.
> v4: Rebase against -next; function has changed from 
>     add_to_free_area_random() to shuffle_pick_tail.  Move to
>     inline function in shuffle.h.
>     Not sure if it's okay to keep Acked-by: after such a
>     significant change.

Good by me. Thanks!

-Kees
diff mbox series

Patch

diff --git a/mm/shuffle.c b/mm/shuffle.c
index 44406d9977c7..ea281d5e1f23 100644
--- a/mm/shuffle.c
+++ b/mm/shuffle.c
@@ -182,26 +182,3 @@  void __meminit __shuffle_free_memory(pg_data_t *pgdat)
 	for (z = pgdat->node_zones; z < pgdat->node_zones + MAX_NR_ZONES; z++)
 		shuffle_zone(z);
 }
-
-bool shuffle_pick_tail(void)
-{
-	static u64 rand;
-	static u8 rand_bits;
-	bool ret;
-
-	/*
-	 * The lack of locking is deliberate. If 2 threads race to
-	 * update the rand state it just adds to the entropy.
-	 */
-	if (rand_bits == 0) {
-		rand_bits = 64;
-		rand = get_random_u64();
-	}
-
-	ret = rand & 1;
-
-	rand_bits--;
-	rand >>= 1;
-
-	return ret;
-}
diff --git a/mm/shuffle.h b/mm/shuffle.h
index 4d79f03b6658..fb79e05cd86d 100644
--- a/mm/shuffle.h
+++ b/mm/shuffle.h
@@ -22,7 +22,31 @@  enum mm_shuffle_ctl {
 DECLARE_STATIC_KEY_FALSE(page_alloc_shuffle_key);
 extern void page_alloc_shuffle(enum mm_shuffle_ctl ctl);
 extern void __shuffle_free_memory(pg_data_t *pgdat);
-extern bool shuffle_pick_tail(void);
+static inline bool shuffle_pick_tail(void)
+{
+	static unsigned long rand;	/* buffered random bits */
+	unsigned long r = READ_ONCE(rand), rshift = r << 1;
+
+	/*
+	 * rand holds 0..BITS_PER_LONG-1 random msbits, followed by a
+	 * 1 bit, then zero-padding in the lsbits.  This allows us to
+	 * maintain the pre-generated bits and the count of bits in a
+	 * single, atomically updatable, variable.
+	 *
+	 * The lack of locking is deliberate. If two threads race to
+	 * update the rand state it just adds to the entropy.  The
+	 * worst that can happen is a random bit is used twice, or
+	 * get_random_long is called redundantly.
+	 */
+	if (unlikely(rshift == 0)) {
+		r = get_random_long();
+		rshift = r << 1 | 1;
+	}
+	WRITE_ONCE(rand, rshift);
+
+	return (long)r < 0;
+}
+
 static inline void shuffle_free_memory(pg_data_t *pgdat)
 {
 	if (!static_branch_unlikely(&page_alloc_shuffle_key))