diff mbox series

[v2] mm/shuffle.c: Fix races in add_to_free_area_random()

Message ID 20200318014410.GA2281@SDF.ORG (mailing list archive)
State New, archived
Headers show
Series [v2] mm/shuffle.c: Fix races in add_to_free_area_random() | expand

Commit Message

George Spelvin March 18, 2020, 1:44 a.m. UTC
The old code had separate "rand" and "rand_count" variables,
which could get out of sync with bad results.

In the worst case, two threads would see rand_count == 1 and
both decrement it, resultint in rand_count = 255 and rand being
filled with zeros for the next 255 calls.

Instead, pack them both into a single, atomically updatable,
variable.  This makes it a lot easier to reason about race
conditions.  They are still there - the code deliberately eschews
locking - but basically harmless on the rare occasions that
they happen.

Second, use READ_ONCE and WRITE_ONCE.  Without them, we are deep
in the land of nasal demons.  The compiler would be free to spill
temporaries to the static variables in arbitrary perverse ways
and create hard-to-find bugs.

(Alternatively, we could declare the static variable "volatile",
one of the few places in the Linux kernel that would be correct,
but it would probably annoy Linus.)

Third, use long rather than u64.  This not only keeps the
state atomically updatable, it also speeds up the fast path
on 32-bit machines.  Saving at least three instructions on
the fast path (one load, one add-with-carry, and one store)
is worth exchanging one call to get_random_u64 for two
calls to get_random_u32.  The fast path of get_random_* is
less than the 3*64 = 192 instructions saved, and the slow
path happens every 64 bytes so isn't affectrd by the change.

I've tried a few variants.  Keeping random lsbits with
a most-significant end marker, and using an explicit bool
flag rather than testing r both increase code size slightly.

		x86_64	i386
This code		 94	 95
Explicit bool		103	 99
Lsbits		 99	101
Both		 96	100

Signed-off-by: George Spelvin <lkml@sdf.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
---
 mm/shuffle.c | 26 ++++++++++++++++----------
 1 file changed, 16 insertions(+), 10 deletions(-)

Comments

Randy Dunlap March 18, 2020, 1:49 a.m. UTC | #1
Hi,

On 3/17/20 6:44 PM, George Spelvin wrote:
> +	unsigned long r = READ_ONCE(rand), rshift = r << 1;;

don't need the double ending ;;
Dan Williams March 18, 2020, 3:53 a.m. UTC | #2
On Tue, Mar 17, 2020 at 6:44 PM George Spelvin <lkml@sdf.org> wrote:
>
> The old code had separate "rand" and "rand_count" variables,
> which could get out of sync with bad results.
>
> In the worst case, two threads would see rand_count == 1 and
> both decrement it, resultint in rand_count = 255 and rand being
> filled with zeros for the next 255 calls.
>
> Instead, pack them both into a single, atomically updatable,
> variable.  This makes it a lot easier to reason about race
> conditions.  They are still there - the code deliberately eschews
> locking - but basically harmless on the rare occasions that
> they happen.
>
> Second, use READ_ONCE and WRITE_ONCE.  Without them, we are deep
> in the land of nasal demons.  The compiler would be free to spill
> temporaries to the static variables in arbitrary perverse ways
> and create hard-to-find bugs.
>
> (Alternatively, we could declare the static variable "volatile",
> one of the few places in the Linux kernel that would be correct,
> but it would probably annoy Linus.)
>
> Third, use long rather than u64.  This not only keeps the
> state atomically updatable, it also speeds up the fast path
> on 32-bit machines.  Saving at least three instructions on
> the fast path (one load, one add-with-carry, and one store)
> is worth exchanging one call to get_random_u64 for two
> calls to get_random_u32.  The fast path of get_random_* is
> less than the 3*64 = 192 instructions saved, and the slow
> path happens every 64 bytes so isn't affectrd by the change.
>
> I've tried a few variants.  Keeping random lsbits with
> a most-significant end marker, and using an explicit bool
> flag rather than testing r both increase code size slightly.
>
>                 x86_64  i386
> This code                94      95
> Explicit bool           103      99
> Lsbits           99     101
> Both             96     100
>
> Signed-off-by: George Spelvin <lkml@sdf.org>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: linux-mm@kvack.org
> ---
>  mm/shuffle.c | 26 ++++++++++++++++----------
>  1 file changed, 16 insertions(+), 10 deletions(-)
>
> diff --git a/mm/shuffle.c b/mm/shuffle.c
> index e0ed247f8d90..4ba3ba84764d 100644
> --- a/mm/shuffle.c
> +++ b/mm/shuffle.c
> @@ -186,22 +186,28 @@ void __meminit __shuffle_free_memory(pg_data_t *pgdat)
>  void add_to_free_area_random(struct page *page, struct free_area *area,
>                 int migratetype)
>  {
> -       static u64 rand;
> -       static u8 rand_bits;
> +       static long rand;       /* 0..BITS_PER_LONG-1 buffered random bits */
> +       unsigned long r = READ_ONCE(rand), rshift = r << 1;;
>
>         /*
> -        * The lack of locking is deliberate. If 2 threads race to
> -        * update the rand state it just adds to the entropy.
> +        * rand holds some random msbits, with a 1 bit appended, followed
> +        * by zero-padding in the lsbits.  This allows us to maintain
> +        * the pre-generated bits and the count of bits in a single,
> +        * atomically updatable, variable.
> +        *
> +        * The lack of locking is deliberate. If two threads race to
> +        * update the rand state it just adds to the entropy.  The
> +        * worst that can happen is a random bit is used twice, or
> +        * get_random_long is called redundantly.
>          */
> -       if (rand_bits == 0) {
> -               rand_bits = 64;
> -               rand = get_random_u64();
> +       if (unlikely(rshift == 0)) {

I had the impression that unless unlikely is "mostly never" then it
can do more harm than good. Is a branch guaranteed to be taken every
BITS_PER_LONG'th occurrence really a candidate for unlikely()
annotation?

Otherwise this change looks good to me.

> +               r = get_random_long();
> +               rshift = r << 1 | 1;
>         }
> +       WRITE_ONCE(rand, rshift);
>
> -       if (rand & 1)
> +       if ((long)r < 0)
>                 add_to_free_area(page, area, migratetype);
>         else
>                 add_to_free_area_tail(page, area, migratetype);
> -       rand_bits--;
> -       rand >>= 1;
>  }
> --
> 2.26.0.rc2
Kees Cook March 18, 2020, 3:58 a.m. UTC | #3
On Wed, Mar 18, 2020 at 01:44:10AM +0000, George Spelvin wrote:
> The old code had separate "rand" and "rand_count" variables,
> which could get out of sync with bad results.
> 
> In the worst case, two threads would see rand_count == 1 and
> both decrement it, resultint in rand_count = 255 and rand being

typo: resultint -> resulting

> filled with zeros for the next 255 calls.
> 
> Instead, pack them both into a single, atomically updatable,
> variable.  This makes it a lot easier to reason about race
> conditions.  They are still there - the code deliberately eschews
> locking - but basically harmless on the rare occasions that
> they happen.
> 
> Second, use READ_ONCE and WRITE_ONCE.  Without them, we are deep
> in the land of nasal demons.  The compiler would be free to spill
> temporaries to the static variables in arbitrary perverse ways
> and create hard-to-find bugs.
> 
> (Alternatively, we could declare the static variable "volatile",
> one of the few places in the Linux kernel that would be correct,
> but it would probably annoy Linus.)
> 
> Third, use long rather than u64.  This not only keeps the
> state atomically updatable, it also speeds up the fast path
> on 32-bit machines.  Saving at least three instructions on
> the fast path (one load, one add-with-carry, and one store)
> is worth exchanging one call to get_random_u64 for two
> calls to get_random_u32.  The fast path of get_random_* is
> less than the 3*64 = 192 instructions saved, and the slow
> path happens every 64 bytes so isn't affectrd by the change.
> 
> I've tried a few variants.  Keeping random lsbits with
> a most-significant end marker, and using an explicit bool
> flag rather than testing r both increase code size slightly.
> 
> 		x86_64	i386
> This code		 94	 95
> Explicit bool		103	 99
> Lsbits		 99	101
> Both		 96	100
> 
> Signed-off-by: George Spelvin <lkml@sdf.org>

And with Randy's other fix, please consider this:

Acked-by: Kees Cook <keescook@chromium.org>

-Kees

> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: linux-mm@kvack.org
> ---
>  mm/shuffle.c | 26 ++++++++++++++++----------
>  1 file changed, 16 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/shuffle.c b/mm/shuffle.c
> index e0ed247f8d90..4ba3ba84764d 100644
> --- a/mm/shuffle.c
> +++ b/mm/shuffle.c
> @@ -186,22 +186,28 @@ void __meminit __shuffle_free_memory(pg_data_t *pgdat)
>  void add_to_free_area_random(struct page *page, struct free_area *area,
>  		int migratetype)
>  {
> -	static u64 rand;
> -	static u8 rand_bits;
> +	static long rand;	/* 0..BITS_PER_LONG-1 buffered random bits */
> +	unsigned long r = READ_ONCE(rand), rshift = r << 1;;
>  
>  	/*
> -	 * The lack of locking is deliberate. If 2 threads race to
> -	 * update the rand state it just adds to the entropy.
> +	 * rand holds some random msbits, with a 1 bit appended, followed
> +	 * by zero-padding in the lsbits.  This allows us to maintain
> +	 * the pre-generated bits and the count of bits in a single,
> +	 * atomically updatable, variable.
> +	 *
> +	 * The lack of locking is deliberate. If two threads race to
> +	 * update the rand state it just adds to the entropy.  The
> +	 * worst that can happen is a random bit is used twice, or
> +	 * get_random_long is called redundantly.
>  	 */
> -	if (rand_bits == 0) {
> -		rand_bits = 64;
> -		rand = get_random_u64();
> +	if (unlikely(rshift == 0)) {
> +		r = get_random_long();
> +		rshift = r << 1 | 1;
>  	}
> +	WRITE_ONCE(rand, rshift);
>  
> -	if (rand & 1)
> +	if ((long)r < 0)
>  		add_to_free_area(page, area, migratetype);
>  	else
>  		add_to_free_area_tail(page, area, migratetype);
> -	rand_bits--;
> -	rand >>= 1;
>  }
> -- 
> 2.26.0.rc2
George Spelvin March 18, 2020, 8:20 a.m. UTC | #4
On Tue, Mar 17, 2020 at 08:53:55PM -0700, Dan Williams wrote:
> On Tue, Mar 17, 2020 at 6:44 PM George Spelvin <lkml@sdf.org> wrote:
>> -       if (rand_bits == 0) {
>> -               rand_bits = 64;
>> -               rand = get_random_u64();
>> +       if (unlikely(rshift == 0)) {
> 
> I had the impression that unless unlikely is "mostly never" then it
> can do more harm than good. Is a branch guaranteed to be taken every
> BITS_PER_LONG'th occurrence really a candidate for unlikely()
> annotation?

I had to look this up.  GCC manual:

For the purposes of branch prediction optimizations, the probability
that a '__builtin_expect' expression is 'true' is controlled by GCC's
'builtin-expect-probability' parameter, which defaults to 90%.  You can
also use '__builtin_expect_with_probability' to explicitly assign a
probability value to individual expressions.

So I think that <= 10% is good enough, which is true in this case.

I was tring to encourage the compiler to:
* Place this code path out of line, and
* Not do the stack manipulations (build a frame, spill registers)
  needed for a non-leaf function if this path isn't taken.
Alexander Duyck March 18, 2020, 3:26 p.m. UTC | #5
On Tue, Mar 17, 2020 at 6:44 PM George Spelvin <lkml@sdf.org> wrote:
>
> The old code had separate "rand" and "rand_count" variables,
> which could get out of sync with bad results.
>
> In the worst case, two threads would see rand_count == 1 and
> both decrement it, resultint in rand_count = 255 and rand being
> filled with zeros for the next 255 calls.
>
> Instead, pack them both into a single, atomically updatable,
> variable.  This makes it a lot easier to reason about race
> conditions.  They are still there - the code deliberately eschews
> locking - but basically harmless on the rare occasions that
> they happen.
>
> Second, use READ_ONCE and WRITE_ONCE.  Without them, we are deep
> in the land of nasal demons.  The compiler would be free to spill
> temporaries to the static variables in arbitrary perverse ways
> and create hard-to-find bugs.
>
> (Alternatively, we could declare the static variable "volatile",
> one of the few places in the Linux kernel that would be correct,
> but it would probably annoy Linus.)
>
> Third, use long rather than u64.  This not only keeps the
> state atomically updatable, it also speeds up the fast path
> on 32-bit machines.  Saving at least three instructions on
> the fast path (one load, one add-with-carry, and one store)
> is worth exchanging one call to get_random_u64 for two
> calls to get_random_u32.  The fast path of get_random_* is
> less than the 3*64 = 192 instructions saved, and the slow
> path happens every 64 bytes so isn't affectrd by the change.
>
> I've tried a few variants.  Keeping random lsbits with
> a most-significant end marker, and using an explicit bool
> flag rather than testing r both increase code size slightly.
>
>                 x86_64  i386
> This code                94      95
> Explicit bool           103      99
> Lsbits           99     101
> Both             96     100
>
> Signed-off-by: George Spelvin <lkml@sdf.org>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: linux-mm@kvack.org
> ---
>  mm/shuffle.c | 26 ++++++++++++++++----------
>  1 file changed, 16 insertions(+), 10 deletions(-)
>
> diff --git a/mm/shuffle.c b/mm/shuffle.c
> index e0ed247f8d90..4ba3ba84764d 100644
> --- a/mm/shuffle.c
> +++ b/mm/shuffle.c
> @@ -186,22 +186,28 @@ void __meminit __shuffle_free_memory(pg_data_t *pgdat)
>  void add_to_free_area_random(struct page *page, struct free_area *area,
>                 int migratetype)
>  {
> -       static u64 rand;
> -       static u8 rand_bits;
> +       static long rand;       /* 0..BITS_PER_LONG-1 buffered random bits */
> +       unsigned long r = READ_ONCE(rand), rshift = r << 1;;
>
>         /*
> -        * The lack of locking is deliberate. If 2 threads race to
> -        * update the rand state it just adds to the entropy.
> +        * rand holds some random msbits, with a 1 bit appended, followed
> +        * by zero-padding in the lsbits.  This allows us to maintain
> +        * the pre-generated bits and the count of bits in a single,
> +        * atomically updatable, variable.
> +        *
> +        * The lack of locking is deliberate. If two threads race to
> +        * update the rand state it just adds to the entropy.  The
> +        * worst that can happen is a random bit is used twice, or
> +        * get_random_long is called redundantly.
>          */
> -       if (rand_bits == 0) {
> -               rand_bits = 64;
> -               rand = get_random_u64();
> +       if (unlikely(rshift == 0)) {
> +               r = get_random_long();
> +               rshift = r << 1 | 1;

You might want to wrap the "r << 1" in parenthesis. Also you could
probably use a + 1 instead of an | 1.

>         }
> +       WRITE_ONCE(rand, rshift);
>
> -       if (rand & 1)
> +       if ((long)r < 0)

One trick you might be able to get away with here is to actually
compare r to rshift. "If (rshift <= r)" should give you the same
result. This works since what you are essentially doing is just adding
r to itself so if you overflow rshift will be equal to at most r - 1.
However with the addition of the single bit in the rshift == 0 case it
could potentially be equal in the unlikely case of r being all 1's.

>                 add_to_free_area(page, area, migratetype);
>         else
>                 add_to_free_area_tail(page, area, migratetype);
> -       rand_bits--;
> -       rand >>= 1;
>  }
> --
> 2.26.0.rc2
>
Dan Williams March 18, 2020, 5:36 p.m. UTC | #6
On Wed, Mar 18, 2020 at 1:20 AM George Spelvin <lkml@sdf.org> wrote:
>
> On Tue, Mar 17, 2020 at 08:53:55PM -0700, Dan Williams wrote:
> > On Tue, Mar 17, 2020 at 6:44 PM George Spelvin <lkml@sdf.org> wrote:
> >> -       if (rand_bits == 0) {
> >> -               rand_bits = 64;
> >> -               rand = get_random_u64();
> >> +       if (unlikely(rshift == 0)) {
> >
> > I had the impression that unless unlikely is "mostly never" then it
> > can do more harm than good. Is a branch guaranteed to be taken every
> > BITS_PER_LONG'th occurrence really a candidate for unlikely()
> > annotation?
>
> I had to look this up.  GCC manual:
>
> For the purposes of branch prediction optimizations, the probability
> that a '__builtin_expect' expression is 'true' is controlled by GCC's
> 'builtin-expect-probability' parameter, which defaults to 90%.  You can
> also use '__builtin_expect_with_probability' to explicitly assign a
> probability value to individual expressions.
>
> So I think that <= 10% is good enough, which is true in this case.
>
> I was tring to encourage the compiler to:
> * Place this code path out of line, and
> * Not do the stack manipulations (build a frame, spill registers)
>   needed for a non-leaf function if this path isn't taken.

Understood, I think it's ok in this case because the shuffling only
happens for order-10 page free events by default so it will be
difficult to measure the perf impact either way.  But in other kernel
contexts I think unlikely() annotation should come with numbers, 90%
not taken is not sufficient in and of itself.

You can add:

Acked-by: Dan Williams <dan.j.williams@intel.com>
George Spelvin March 18, 2020, 6:35 p.m. UTC | #7
On Wed, Mar 18, 2020 at 08:26:06AM -0700, Alexander Duyck wrote:
> On Tue, Mar 17, 2020 at 6:44 PM George Spelvin <lkml@sdf.org> wrote:
>> +       if (unlikely(rshift == 0)) {
>> +               r = get_random_long();
>> +               rshift = r << 1 | 1;
> 
> You might want to wrap the "r << 1" in parenthesis. Also you could
> probably use a + 1 instead of an | 1.

I could, but what would it matter?  I have just confirmed that all of:
	x << 1 | 1;
	(x << 1) + 1;
	x + x + 1;
	x + x | 1;
	2*x + 1;
	2*x | 1;
compile to
	leal	1(%rdi,%rdi), %eax

on x86, and two instructions on every other processor I can think of. 

Since this is concpetually a bit-manipulation operation where carry
propagation is undesirable, the logical operation form seems the most
natural way to write it.

As for the parens, all C programmers are forced to remember that the
boolean operators have weirdly low precedence (below < <= == >= >),
so there's no risk of confusion.

>>         }
>> +       WRITE_ONCE(rand, rshift);
>>
>> -       if (rand & 1)
>> +       if ((long)r < 0)
> 
> One trick you might be able to get away with here is to actually
> compare r to rshift. "If (rshift <= r)" should give you the same
> result. This works since what you are essentially doing is just adding
> r to itself so if you overflow rshift will be equal to at most r - 1.
> However with the addition of the single bit in the rshift == 0 case it
> could potentially be equal in the unlikely case of r being all 1's.

Er... but why would I want to?  On most processors, "branch on sign bit"
is a single instruction, and that's the instruction I'm hoping the 
compiler will generate.

That's why I changed the shift direction from the original right (testing
the lsbit) to left (testing the msbit): slight code size reduction.

Anything else produces larger and slower object code, for no benefit.
Alexander Duyck March 18, 2020, 7:17 p.m. UTC | #8
On Wed, Mar 18, 2020 at 11:35 AM George Spelvin <lkml@sdf.org> wrote:
>
> On Wed, Mar 18, 2020 at 08:26:06AM -0700, Alexander Duyck wrote:
> > On Tue, Mar 17, 2020 at 6:44 PM George Spelvin <lkml@sdf.org> wrote:
> >> +       if (unlikely(rshift == 0)) {
> >> +               r = get_random_long();
> >> +               rshift = r << 1 | 1;
> >
> > You might want to wrap the "r << 1" in parenthesis. Also you could
> > probably use a + 1 instead of an | 1.
>
> I could, but what would it matter?  I have just confirmed that all of:
>         x << 1 | 1;
>         (x << 1) + 1;
>         x + x + 1;
>         x + x | 1;
>         2*x + 1;
>         2*x | 1;
> compile to
>         leal    1(%rdi,%rdi), %eax
>
> on x86, and two instructions on every other processor I can think of.
>
> Since this is concpetually a bit-manipulation operation where carry
> propagation is undesirable, the logical operation form seems the most
> natural way to write it.
>
> As for the parens, all C programmers are forced to remember that the
> boolean operators have weirdly low precedence (below < <= == >= >),
> so there's no risk of confusion.

Okay, if this is coming out the same regardless I suppose it doesn't
matter. I am still not a huge fan of skipping the parenthesis as I
feel it is a bit uglier to read but that isn't anything that has to be
fixed.

> >>         }
> >> +       WRITE_ONCE(rand, rshift);
> >>
> >> -       if (rand & 1)
> >> +       if ((long)r < 0)
> >
> > One trick you might be able to get away with here is to actually
> > compare r to rshift. "If (rshift <= r)" should give you the same
> > result. This works since what you are essentially doing is just adding
> > r to itself so if you overflow rshift will be equal to at most r - 1.
> > However with the addition of the single bit in the rshift == 0 case it
> > could potentially be equal in the unlikely case of r being all 1's.
>
> Er... but why would I want to?  On most processors, "branch on sign bit"
> is a single instruction, and that's the instruction I'm hoping the
> compiler will generate.
>
> That's why I changed the shift direction from the original right (testing
> the lsbit) to left (testing the msbit): slight code size reduction.
>
> Anything else produces larger and slower object code, for no benefit.

I was just putting it out there as a possibility. What I have seen in
the past is that under some circumstances gcc can be smart enough to
interpret that as a "branch on carry". My thought was you are likely
having to test the value against itself and then you might be able to
make use of shift and carry flag to avoid that. In addition you could
get away from having to recast a unsigned value as a signed one in
order to perform the bit test.
George Spelvin March 18, 2020, 7:29 p.m. UTC | #9
On Wed, Mar 18, 2020 at 10:36:10AM -0700, Dan Williams wrote:
> On Wed, Mar 18, 2020 at 1:20 AM George Spelvin <lkml@sdf.org> wrote:
>> On Tue, Mar 17, 2020 at 08:53:55PM -0700, Dan Williams wrote:
>>> I had the impression that unless unlikely is "mostly never" then it
>>> can do more harm than good. Is a branch guaranteed to be taken every
>>> BITS_PER_LONG'th occurrence really a candidate for unlikely()
>>> annotation?
>>
>> I had to look this up.  GCC manual:
>>
>> For the purposes of branch prediction optimizations, the probability
>> that a '__builtin_expect' expression is 'true' is controlled by GCC's
>> 'builtin-expect-probability' parameter, which defaults to 90%.  You can
>> also use '__builtin_expect_with_probability' to explicitly assign a
>> probability value to individual expressions.
>>
>> So I think that <= 10% is good enough, which is true in this case.
>>
>> I was tring to encourage the compiler to:
>> * Place this code path out of line, and
>> * Not do the stack manipulations (build a frame, spill registers)
>>   needed for a non-leaf function if this path isn't taken.
> 
> Understood, I think it's ok in this case because the shuffling only
> happens for order-10 page free events by default so it will be
> difficult to measure the perf impact either way.  But in other kernel
> contexts I think unlikely() annotation should come with numbers, 90%
> not taken is not sufficient in and of itself.

I'm not sure I fully understand your point.  I *think* you're
editorializing on unlikely() in general and not this specific code,
but it's a little hard to follow.

Your mention of "order-10 page free events" is confusing.  Do you mean 
"(order-10 page) free events", i.e. freeing of 1024 consecutive pages?  
Or are you using "order" as a synonym for "approximately" and you mean 
"approximately 10 (page free event)s"?

We both agree (I hope) that the number here is obvious on brief 
inspection: 1/BITS_PER_LONG.  

GCC's heuristics are tuned to value cycles on the fast path 9x as much as 
cycles on the slow path, so it will spend up to 9 cycles on the slow path 
to save a cycle on the fast path.

I've found one comment (https://pastebin.com/S8Y8tqZy) saying that
GCC < 9.x was a lot sloppier on the cost ratio and could pessimize
the code if the branch was more than ~ 1% taken.  Perhaps that's what 
you're remembering?

Fortunately, 1/64 = 1.56% is fairly close to 1%. so I'm not too
worried.

> You can add:
> 
> Acked-by: Dan Williams <dan.j.williams@intel.com>

Thank you!
Dan Williams March 18, 2020, 7:40 p.m. UTC | #10
On Wed, Mar 18, 2020 at 12:29 PM George Spelvin <lkml@sdf.org> wrote:
>
> On Wed, Mar 18, 2020 at 10:36:10AM -0700, Dan Williams wrote:
> > On Wed, Mar 18, 2020 at 1:20 AM George Spelvin <lkml@sdf.org> wrote:
> >> On Tue, Mar 17, 2020 at 08:53:55PM -0700, Dan Williams wrote:
> >>> I had the impression that unless unlikely is "mostly never" then it
> >>> can do more harm than good. Is a branch guaranteed to be taken every
> >>> BITS_PER_LONG'th occurrence really a candidate for unlikely()
> >>> annotation?
> >>
> >> I had to look this up.  GCC manual:
> >>
> >> For the purposes of branch prediction optimizations, the probability
> >> that a '__builtin_expect' expression is 'true' is controlled by GCC's
> >> 'builtin-expect-probability' parameter, which defaults to 90%.  You can
> >> also use '__builtin_expect_with_probability' to explicitly assign a
> >> probability value to individual expressions.
> >>
> >> So I think that <= 10% is good enough, which is true in this case.
> >>
> >> I was tring to encourage the compiler to:
> >> * Place this code path out of line, and
> >> * Not do the stack manipulations (build a frame, spill registers)
> >>   needed for a non-leaf function if this path isn't taken.
> >
> > Understood, I think it's ok in this case because the shuffling only
> > happens for order-10 page free events by default so it will be
> > difficult to measure the perf impact either way.  But in other kernel
> > contexts I think unlikely() annotation should come with numbers, 90%
> > not taken is not sufficient in and of itself.
>
> I'm not sure I fully understand your point.  I *think* you're
> editorializing on unlikely() in general and not this specific code,
> but it's a little hard to follow.

Yes, editorializing on unlikely(). Specifically I would normally ask
for perf numbers to show that the hint is worth it, but I talked
myself out of asking for that in this case.

> Your mention of "order-10 page free events" is confusing.  Do you mean
> "(order-10 page) free events", i.e. freeing of 1024 consecutive pages?
> Or are you using "order" as a synonym for "approximately" and you mean
> "approximately 10 (page free event)s"?

I'm referring to this:

      if (is_shuffle_order(order))
                add_to_free_area_random(page, &zone->free_area[order],

Where shuffle order is MAX_ORDER-1. I.e. this code is only triggered
when we might be releasing a 4MB buddy-page.

> We both agree (I hope) that the number here is obvious on brief
> inspection: 1/BITS_PER_LONG.
>
> GCC's heuristics are tuned to value cycles on the fast path 9x as much as
> cycles on the slow path, so it will spend up to 9 cycles on the slow path
> to save a cycle on the fast path.
>
> I've found one comment (https://pastebin.com/S8Y8tqZy) saying that
> GCC < 9.x was a lot sloppier on the cost ratio and could pessimize
> the code if the branch was more than ~ 1% taken.  Perhaps that's what
> you're remembering?

Yes, thanks for that digging!

> Fortunately, 1/64 = 1.56% is fairly close to 1%. so I'm not too
> worried.

Makes sense.
George Spelvin March 18, 2020, 8:06 p.m. UTC | #11
On Wed, Mar 18, 2020 at 12:17:14PM -0700, Alexander Duyck wrote:
> I was just putting it out there as a possibility. What I have seen in
> the past is that under some circumstances gcc can be smart enough to
> interpret that as a "branch on carry". My thought was you are likely
> having to test the value against itself and then you might be able to
> make use of shift and carry flag to avoid that. In addition you could
> get away from having to recast a unsigned value as a signed one in
> order to perform the bit test.

Ah, yes, it would be nice if gcc could use the carry bit for r
rather than having to devote a whole register to it.  But it has
to do two unrelated flag tests (zero and carry), and it's generally
pretty bad at taking advantage of preserved flag bits like that.

My ideal x86-64 object code would be:
	shlq	rand(%rip)
	jz	fixup
fixed:
	jnc	tail
	jmp	add_to_free_area
tail:
	jmp	add_to_free_area_tail
fixup:
	pushq	%rdx
	pushq	%rsi
	pushq	%rdi
	call	get_random_u64
	popq	%rdi
	popq	%rsi
	popq	%rdx
	stc
	adcq	%rax,%rax
	movq	%rax, rand(%rip)
	jmp	fixed

... but I don't know how to induce GCC to generate that, and
the function doesn't seem worthy of platform-specific asm.

(Note that I have to use add add on the slow path because lea doesn't
set the carry bit.)
George Spelvin March 18, 2020, 9:02 p.m. UTC | #12
On Wed, Mar 18, 2020 at 12:40:33PM -0700, Dan Williams wrote:
> Yes, editorializing on unlikely(). Specifically I would normally ask
> for perf numbers to show that the hint is worth it, but I talked
> myself out of asking for that in this case.

Ah, now I understand!  Yes, it's just gilding the lily, but as long as I 
was messing about in the area it seemed worth adding.

You're saying that if the code were a hotter code path, you'd want 
benchmarks, but given its actual infrequent usage, you're satisfied with 
the argument that it's probably right; it's not like we've crippled the 
kernel even if wrong.

> I'm referring to this:
> 
>       if (is_shuffle_order(order))
>                 add_to_free_area_random(page, &zone->free_area[order],
> 
> Where shuffle order is MAX_ORDER-1. I.e. this code is only triggered
> when we might be releasing a 4MB buddy-page.

Ding!  Okay, now it makes sense.  I didn't look that far up the call
stack.  I was auditing each and every call to a get_random function in
the kernel, when I came across this call site, struggled to understand
what it was doing, and rewrote it to be less wrong.

I only vaguely understand where it fits into the larger mm/ system
so I never noticed that condition on its use.

Yes, given that infrequent use, I'm even happier that I focused on code
size rather than performance.

Thank you for taking the time to explain.
diff mbox series

Patch

diff --git a/mm/shuffle.c b/mm/shuffle.c
index e0ed247f8d90..4ba3ba84764d 100644
--- a/mm/shuffle.c
+++ b/mm/shuffle.c
@@ -186,22 +186,28 @@  void __meminit __shuffle_free_memory(pg_data_t *pgdat)
 void add_to_free_area_random(struct page *page, struct free_area *area,
 		int migratetype)
 {
-	static u64 rand;
-	static u8 rand_bits;
+	static long rand;	/* 0..BITS_PER_LONG-1 buffered random bits */
+	unsigned long r = READ_ONCE(rand), rshift = r << 1;;
 
 	/*
-	 * The lack of locking is deliberate. If 2 threads race to
-	 * update the rand state it just adds to the entropy.
+	 * rand holds some random msbits, with a 1 bit appended, followed
+	 * by zero-padding in the lsbits.  This allows us to maintain
+	 * the pre-generated bits and the count of bits in a single,
+	 * atomically updatable, variable.
+	 *
+	 * The lack of locking is deliberate. If two threads race to
+	 * update the rand state it just adds to the entropy.  The
+	 * worst that can happen is a random bit is used twice, or
+	 * get_random_long is called redundantly.
 	 */
-	if (rand_bits == 0) {
-		rand_bits = 64;
-		rand = get_random_u64();
+	if (unlikely(rshift == 0)) {
+		r = get_random_long();
+		rshift = r << 1 | 1;
 	}
+	WRITE_ONCE(rand, rshift);
 
-	if (rand & 1)
+	if ((long)r < 0)
 		add_to_free_area(page, area, migratetype);
 	else
 		add_to_free_area_tail(page, area, migratetype);
-	rand_bits--;
-	rand >>= 1;
 }