Message ID | 20200318014410.GA2281@SDF.ORG (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] mm/shuffle.c: Fix races in add_to_free_area_random() | expand |
Hi,
On 3/17/20 6:44 PM, George Spelvin wrote:
> + unsigned long r = READ_ONCE(rand), rshift = r << 1;;
don't need the double ending ;;
On Tue, Mar 17, 2020 at 6:44 PM George Spelvin <lkml@sdf.org> wrote: > > The old code had separate "rand" and "rand_count" variables, > which could get out of sync with bad results. > > In the worst case, two threads would see rand_count == 1 and > both decrement it, resultint in rand_count = 255 and rand being > filled with zeros for the next 255 calls. > > Instead, pack them both into a single, atomically updatable, > variable. This makes it a lot easier to reason about race > conditions. They are still there - the code deliberately eschews > locking - but basically harmless on the rare occasions that > they happen. > > Second, use READ_ONCE and WRITE_ONCE. Without them, we are deep > in the land of nasal demons. The compiler would be free to spill > temporaries to the static variables in arbitrary perverse ways > and create hard-to-find bugs. > > (Alternatively, we could declare the static variable "volatile", > one of the few places in the Linux kernel that would be correct, > but it would probably annoy Linus.) > > Third, use long rather than u64. This not only keeps the > state atomically updatable, it also speeds up the fast path > on 32-bit machines. Saving at least three instructions on > the fast path (one load, one add-with-carry, and one store) > is worth exchanging one call to get_random_u64 for two > calls to get_random_u32. The fast path of get_random_* is > less than the 3*64 = 192 instructions saved, and the slow > path happens every 64 bytes so isn't affectrd by the change. > > I've tried a few variants. Keeping random lsbits with > a most-significant end marker, and using an explicit bool > flag rather than testing r both increase code size slightly. > > x86_64 i386 > This code 94 95 > Explicit bool 103 99 > Lsbits 99 101 > Both 96 100 > > Signed-off-by: George Spelvin <lkml@sdf.org> > Cc: Dan Williams <dan.j.williams@intel.com> > Cc: Kees Cook <keescook@chromium.org> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: linux-mm@kvack.org > --- > mm/shuffle.c | 26 ++++++++++++++++---------- > 1 file changed, 16 insertions(+), 10 deletions(-) > > diff --git a/mm/shuffle.c b/mm/shuffle.c > index e0ed247f8d90..4ba3ba84764d 100644 > --- a/mm/shuffle.c > +++ b/mm/shuffle.c > @@ -186,22 +186,28 @@ void __meminit __shuffle_free_memory(pg_data_t *pgdat) > void add_to_free_area_random(struct page *page, struct free_area *area, > int migratetype) > { > - static u64 rand; > - static u8 rand_bits; > + static long rand; /* 0..BITS_PER_LONG-1 buffered random bits */ > + unsigned long r = READ_ONCE(rand), rshift = r << 1;; > > /* > - * The lack of locking is deliberate. If 2 threads race to > - * update the rand state it just adds to the entropy. > + * rand holds some random msbits, with a 1 bit appended, followed > + * by zero-padding in the lsbits. This allows us to maintain > + * the pre-generated bits and the count of bits in a single, > + * atomically updatable, variable. > + * > + * The lack of locking is deliberate. If two threads race to > + * update the rand state it just adds to the entropy. The > + * worst that can happen is a random bit is used twice, or > + * get_random_long is called redundantly. > */ > - if (rand_bits == 0) { > - rand_bits = 64; > - rand = get_random_u64(); > + if (unlikely(rshift == 0)) { I had the impression that unless unlikely is "mostly never" then it can do more harm than good. Is a branch guaranteed to be taken every BITS_PER_LONG'th occurrence really a candidate for unlikely() annotation? Otherwise this change looks good to me. > + r = get_random_long(); > + rshift = r << 1 | 1; > } > + WRITE_ONCE(rand, rshift); > > - if (rand & 1) > + if ((long)r < 0) > add_to_free_area(page, area, migratetype); > else > add_to_free_area_tail(page, area, migratetype); > - rand_bits--; > - rand >>= 1; > } > -- > 2.26.0.rc2
On Wed, Mar 18, 2020 at 01:44:10AM +0000, George Spelvin wrote: > The old code had separate "rand" and "rand_count" variables, > which could get out of sync with bad results. > > In the worst case, two threads would see rand_count == 1 and > both decrement it, resultint in rand_count = 255 and rand being typo: resultint -> resulting > filled with zeros for the next 255 calls. > > Instead, pack them both into a single, atomically updatable, > variable. This makes it a lot easier to reason about race > conditions. They are still there - the code deliberately eschews > locking - but basically harmless on the rare occasions that > they happen. > > Second, use READ_ONCE and WRITE_ONCE. Without them, we are deep > in the land of nasal demons. The compiler would be free to spill > temporaries to the static variables in arbitrary perverse ways > and create hard-to-find bugs. > > (Alternatively, we could declare the static variable "volatile", > one of the few places in the Linux kernel that would be correct, > but it would probably annoy Linus.) > > Third, use long rather than u64. This not only keeps the > state atomically updatable, it also speeds up the fast path > on 32-bit machines. Saving at least three instructions on > the fast path (one load, one add-with-carry, and one store) > is worth exchanging one call to get_random_u64 for two > calls to get_random_u32. The fast path of get_random_* is > less than the 3*64 = 192 instructions saved, and the slow > path happens every 64 bytes so isn't affectrd by the change. > > I've tried a few variants. Keeping random lsbits with > a most-significant end marker, and using an explicit bool > flag rather than testing r both increase code size slightly. > > x86_64 i386 > This code 94 95 > Explicit bool 103 99 > Lsbits 99 101 > Both 96 100 > > Signed-off-by: George Spelvin <lkml@sdf.org> And with Randy's other fix, please consider this: Acked-by: Kees Cook <keescook@chromium.org> -Kees > Cc: Dan Williams <dan.j.williams@intel.com> > Cc: Kees Cook <keescook@chromium.org> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: linux-mm@kvack.org > --- > mm/shuffle.c | 26 ++++++++++++++++---------- > 1 file changed, 16 insertions(+), 10 deletions(-) > > diff --git a/mm/shuffle.c b/mm/shuffle.c > index e0ed247f8d90..4ba3ba84764d 100644 > --- a/mm/shuffle.c > +++ b/mm/shuffle.c > @@ -186,22 +186,28 @@ void __meminit __shuffle_free_memory(pg_data_t *pgdat) > void add_to_free_area_random(struct page *page, struct free_area *area, > int migratetype) > { > - static u64 rand; > - static u8 rand_bits; > + static long rand; /* 0..BITS_PER_LONG-1 buffered random bits */ > + unsigned long r = READ_ONCE(rand), rshift = r << 1;; > > /* > - * The lack of locking is deliberate. If 2 threads race to > - * update the rand state it just adds to the entropy. > + * rand holds some random msbits, with a 1 bit appended, followed > + * by zero-padding in the lsbits. This allows us to maintain > + * the pre-generated bits and the count of bits in a single, > + * atomically updatable, variable. > + * > + * The lack of locking is deliberate. If two threads race to > + * update the rand state it just adds to the entropy. The > + * worst that can happen is a random bit is used twice, or > + * get_random_long is called redundantly. > */ > - if (rand_bits == 0) { > - rand_bits = 64; > - rand = get_random_u64(); > + if (unlikely(rshift == 0)) { > + r = get_random_long(); > + rshift = r << 1 | 1; > } > + WRITE_ONCE(rand, rshift); > > - if (rand & 1) > + if ((long)r < 0) > add_to_free_area(page, area, migratetype); > else > add_to_free_area_tail(page, area, migratetype); > - rand_bits--; > - rand >>= 1; > } > -- > 2.26.0.rc2
On Tue, Mar 17, 2020 at 08:53:55PM -0700, Dan Williams wrote: > On Tue, Mar 17, 2020 at 6:44 PM George Spelvin <lkml@sdf.org> wrote: >> - if (rand_bits == 0) { >> - rand_bits = 64; >> - rand = get_random_u64(); >> + if (unlikely(rshift == 0)) { > > I had the impression that unless unlikely is "mostly never" then it > can do more harm than good. Is a branch guaranteed to be taken every > BITS_PER_LONG'th occurrence really a candidate for unlikely() > annotation? I had to look this up. GCC manual: For the purposes of branch prediction optimizations, the probability that a '__builtin_expect' expression is 'true' is controlled by GCC's 'builtin-expect-probability' parameter, which defaults to 90%. You can also use '__builtin_expect_with_probability' to explicitly assign a probability value to individual expressions. So I think that <= 10% is good enough, which is true in this case. I was tring to encourage the compiler to: * Place this code path out of line, and * Not do the stack manipulations (build a frame, spill registers) needed for a non-leaf function if this path isn't taken.
On Tue, Mar 17, 2020 at 6:44 PM George Spelvin <lkml@sdf.org> wrote: > > The old code had separate "rand" and "rand_count" variables, > which could get out of sync with bad results. > > In the worst case, two threads would see rand_count == 1 and > both decrement it, resultint in rand_count = 255 and rand being > filled with zeros for the next 255 calls. > > Instead, pack them both into a single, atomically updatable, > variable. This makes it a lot easier to reason about race > conditions. They are still there - the code deliberately eschews > locking - but basically harmless on the rare occasions that > they happen. > > Second, use READ_ONCE and WRITE_ONCE. Without them, we are deep > in the land of nasal demons. The compiler would be free to spill > temporaries to the static variables in arbitrary perverse ways > and create hard-to-find bugs. > > (Alternatively, we could declare the static variable "volatile", > one of the few places in the Linux kernel that would be correct, > but it would probably annoy Linus.) > > Third, use long rather than u64. This not only keeps the > state atomically updatable, it also speeds up the fast path > on 32-bit machines. Saving at least three instructions on > the fast path (one load, one add-with-carry, and one store) > is worth exchanging one call to get_random_u64 for two > calls to get_random_u32. The fast path of get_random_* is > less than the 3*64 = 192 instructions saved, and the slow > path happens every 64 bytes so isn't affectrd by the change. > > I've tried a few variants. Keeping random lsbits with > a most-significant end marker, and using an explicit bool > flag rather than testing r both increase code size slightly. > > x86_64 i386 > This code 94 95 > Explicit bool 103 99 > Lsbits 99 101 > Both 96 100 > > Signed-off-by: George Spelvin <lkml@sdf.org> > Cc: Dan Williams <dan.j.williams@intel.com> > Cc: Kees Cook <keescook@chromium.org> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: linux-mm@kvack.org > --- > mm/shuffle.c | 26 ++++++++++++++++---------- > 1 file changed, 16 insertions(+), 10 deletions(-) > > diff --git a/mm/shuffle.c b/mm/shuffle.c > index e0ed247f8d90..4ba3ba84764d 100644 > --- a/mm/shuffle.c > +++ b/mm/shuffle.c > @@ -186,22 +186,28 @@ void __meminit __shuffle_free_memory(pg_data_t *pgdat) > void add_to_free_area_random(struct page *page, struct free_area *area, > int migratetype) > { > - static u64 rand; > - static u8 rand_bits; > + static long rand; /* 0..BITS_PER_LONG-1 buffered random bits */ > + unsigned long r = READ_ONCE(rand), rshift = r << 1;; > > /* > - * The lack of locking is deliberate. If 2 threads race to > - * update the rand state it just adds to the entropy. > + * rand holds some random msbits, with a 1 bit appended, followed > + * by zero-padding in the lsbits. This allows us to maintain > + * the pre-generated bits and the count of bits in a single, > + * atomically updatable, variable. > + * > + * The lack of locking is deliberate. If two threads race to > + * update the rand state it just adds to the entropy. The > + * worst that can happen is a random bit is used twice, or > + * get_random_long is called redundantly. > */ > - if (rand_bits == 0) { > - rand_bits = 64; > - rand = get_random_u64(); > + if (unlikely(rshift == 0)) { > + r = get_random_long(); > + rshift = r << 1 | 1; You might want to wrap the "r << 1" in parenthesis. Also you could probably use a + 1 instead of an | 1. > } > + WRITE_ONCE(rand, rshift); > > - if (rand & 1) > + if ((long)r < 0) One trick you might be able to get away with here is to actually compare r to rshift. "If (rshift <= r)" should give you the same result. This works since what you are essentially doing is just adding r to itself so if you overflow rshift will be equal to at most r - 1. However with the addition of the single bit in the rshift == 0 case it could potentially be equal in the unlikely case of r being all 1's. > add_to_free_area(page, area, migratetype); > else > add_to_free_area_tail(page, area, migratetype); > - rand_bits--; > - rand >>= 1; > } > -- > 2.26.0.rc2 >
On Wed, Mar 18, 2020 at 1:20 AM George Spelvin <lkml@sdf.org> wrote: > > On Tue, Mar 17, 2020 at 08:53:55PM -0700, Dan Williams wrote: > > On Tue, Mar 17, 2020 at 6:44 PM George Spelvin <lkml@sdf.org> wrote: > >> - if (rand_bits == 0) { > >> - rand_bits = 64; > >> - rand = get_random_u64(); > >> + if (unlikely(rshift == 0)) { > > > > I had the impression that unless unlikely is "mostly never" then it > > can do more harm than good. Is a branch guaranteed to be taken every > > BITS_PER_LONG'th occurrence really a candidate for unlikely() > > annotation? > > I had to look this up. GCC manual: > > For the purposes of branch prediction optimizations, the probability > that a '__builtin_expect' expression is 'true' is controlled by GCC's > 'builtin-expect-probability' parameter, which defaults to 90%. You can > also use '__builtin_expect_with_probability' to explicitly assign a > probability value to individual expressions. > > So I think that <= 10% is good enough, which is true in this case. > > I was tring to encourage the compiler to: > * Place this code path out of line, and > * Not do the stack manipulations (build a frame, spill registers) > needed for a non-leaf function if this path isn't taken. Understood, I think it's ok in this case because the shuffling only happens for order-10 page free events by default so it will be difficult to measure the perf impact either way. But in other kernel contexts I think unlikely() annotation should come with numbers, 90% not taken is not sufficient in and of itself. You can add: Acked-by: Dan Williams <dan.j.williams@intel.com>
On Wed, Mar 18, 2020 at 08:26:06AM -0700, Alexander Duyck wrote: > On Tue, Mar 17, 2020 at 6:44 PM George Spelvin <lkml@sdf.org> wrote: >> + if (unlikely(rshift == 0)) { >> + r = get_random_long(); >> + rshift = r << 1 | 1; > > You might want to wrap the "r << 1" in parenthesis. Also you could > probably use a + 1 instead of an | 1. I could, but what would it matter? I have just confirmed that all of: x << 1 | 1; (x << 1) + 1; x + x + 1; x + x | 1; 2*x + 1; 2*x | 1; compile to leal 1(%rdi,%rdi), %eax on x86, and two instructions on every other processor I can think of. Since this is concpetually a bit-manipulation operation where carry propagation is undesirable, the logical operation form seems the most natural way to write it. As for the parens, all C programmers are forced to remember that the boolean operators have weirdly low precedence (below < <= == >= >), so there's no risk of confusion. >> } >> + WRITE_ONCE(rand, rshift); >> >> - if (rand & 1) >> + if ((long)r < 0) > > One trick you might be able to get away with here is to actually > compare r to rshift. "If (rshift <= r)" should give you the same > result. This works since what you are essentially doing is just adding > r to itself so if you overflow rshift will be equal to at most r - 1. > However with the addition of the single bit in the rshift == 0 case it > could potentially be equal in the unlikely case of r being all 1's. Er... but why would I want to? On most processors, "branch on sign bit" is a single instruction, and that's the instruction I'm hoping the compiler will generate. That's why I changed the shift direction from the original right (testing the lsbit) to left (testing the msbit): slight code size reduction. Anything else produces larger and slower object code, for no benefit.
On Wed, Mar 18, 2020 at 11:35 AM George Spelvin <lkml@sdf.org> wrote: > > On Wed, Mar 18, 2020 at 08:26:06AM -0700, Alexander Duyck wrote: > > On Tue, Mar 17, 2020 at 6:44 PM George Spelvin <lkml@sdf.org> wrote: > >> + if (unlikely(rshift == 0)) { > >> + r = get_random_long(); > >> + rshift = r << 1 | 1; > > > > You might want to wrap the "r << 1" in parenthesis. Also you could > > probably use a + 1 instead of an | 1. > > I could, but what would it matter? I have just confirmed that all of: > x << 1 | 1; > (x << 1) + 1; > x + x + 1; > x + x | 1; > 2*x + 1; > 2*x | 1; > compile to > leal 1(%rdi,%rdi), %eax > > on x86, and two instructions on every other processor I can think of. > > Since this is concpetually a bit-manipulation operation where carry > propagation is undesirable, the logical operation form seems the most > natural way to write it. > > As for the parens, all C programmers are forced to remember that the > boolean operators have weirdly low precedence (below < <= == >= >), > so there's no risk of confusion. Okay, if this is coming out the same regardless I suppose it doesn't matter. I am still not a huge fan of skipping the parenthesis as I feel it is a bit uglier to read but that isn't anything that has to be fixed. > >> } > >> + WRITE_ONCE(rand, rshift); > >> > >> - if (rand & 1) > >> + if ((long)r < 0) > > > > One trick you might be able to get away with here is to actually > > compare r to rshift. "If (rshift <= r)" should give you the same > > result. This works since what you are essentially doing is just adding > > r to itself so if you overflow rshift will be equal to at most r - 1. > > However with the addition of the single bit in the rshift == 0 case it > > could potentially be equal in the unlikely case of r being all 1's. > > Er... but why would I want to? On most processors, "branch on sign bit" > is a single instruction, and that's the instruction I'm hoping the > compiler will generate. > > That's why I changed the shift direction from the original right (testing > the lsbit) to left (testing the msbit): slight code size reduction. > > Anything else produces larger and slower object code, for no benefit. I was just putting it out there as a possibility. What I have seen in the past is that under some circumstances gcc can be smart enough to interpret that as a "branch on carry". My thought was you are likely having to test the value against itself and then you might be able to make use of shift and carry flag to avoid that. In addition you could get away from having to recast a unsigned value as a signed one in order to perform the bit test.
On Wed, Mar 18, 2020 at 10:36:10AM -0700, Dan Williams wrote: > On Wed, Mar 18, 2020 at 1:20 AM George Spelvin <lkml@sdf.org> wrote: >> On Tue, Mar 17, 2020 at 08:53:55PM -0700, Dan Williams wrote: >>> I had the impression that unless unlikely is "mostly never" then it >>> can do more harm than good. Is a branch guaranteed to be taken every >>> BITS_PER_LONG'th occurrence really a candidate for unlikely() >>> annotation? >> >> I had to look this up. GCC manual: >> >> For the purposes of branch prediction optimizations, the probability >> that a '__builtin_expect' expression is 'true' is controlled by GCC's >> 'builtin-expect-probability' parameter, which defaults to 90%. You can >> also use '__builtin_expect_with_probability' to explicitly assign a >> probability value to individual expressions. >> >> So I think that <= 10% is good enough, which is true in this case. >> >> I was tring to encourage the compiler to: >> * Place this code path out of line, and >> * Not do the stack manipulations (build a frame, spill registers) >> needed for a non-leaf function if this path isn't taken. > > Understood, I think it's ok in this case because the shuffling only > happens for order-10 page free events by default so it will be > difficult to measure the perf impact either way. But in other kernel > contexts I think unlikely() annotation should come with numbers, 90% > not taken is not sufficient in and of itself. I'm not sure I fully understand your point. I *think* you're editorializing on unlikely() in general and not this specific code, but it's a little hard to follow. Your mention of "order-10 page free events" is confusing. Do you mean "(order-10 page) free events", i.e. freeing of 1024 consecutive pages? Or are you using "order" as a synonym for "approximately" and you mean "approximately 10 (page free event)s"? We both agree (I hope) that the number here is obvious on brief inspection: 1/BITS_PER_LONG. GCC's heuristics are tuned to value cycles on the fast path 9x as much as cycles on the slow path, so it will spend up to 9 cycles on the slow path to save a cycle on the fast path. I've found one comment (https://pastebin.com/S8Y8tqZy) saying that GCC < 9.x was a lot sloppier on the cost ratio and could pessimize the code if the branch was more than ~ 1% taken. Perhaps that's what you're remembering? Fortunately, 1/64 = 1.56% is fairly close to 1%. so I'm not too worried. > You can add: > > Acked-by: Dan Williams <dan.j.williams@intel.com> Thank you!
On Wed, Mar 18, 2020 at 12:29 PM George Spelvin <lkml@sdf.org> wrote: > > On Wed, Mar 18, 2020 at 10:36:10AM -0700, Dan Williams wrote: > > On Wed, Mar 18, 2020 at 1:20 AM George Spelvin <lkml@sdf.org> wrote: > >> On Tue, Mar 17, 2020 at 08:53:55PM -0700, Dan Williams wrote: > >>> I had the impression that unless unlikely is "mostly never" then it > >>> can do more harm than good. Is a branch guaranteed to be taken every > >>> BITS_PER_LONG'th occurrence really a candidate for unlikely() > >>> annotation? > >> > >> I had to look this up. GCC manual: > >> > >> For the purposes of branch prediction optimizations, the probability > >> that a '__builtin_expect' expression is 'true' is controlled by GCC's > >> 'builtin-expect-probability' parameter, which defaults to 90%. You can > >> also use '__builtin_expect_with_probability' to explicitly assign a > >> probability value to individual expressions. > >> > >> So I think that <= 10% is good enough, which is true in this case. > >> > >> I was tring to encourage the compiler to: > >> * Place this code path out of line, and > >> * Not do the stack manipulations (build a frame, spill registers) > >> needed for a non-leaf function if this path isn't taken. > > > > Understood, I think it's ok in this case because the shuffling only > > happens for order-10 page free events by default so it will be > > difficult to measure the perf impact either way. But in other kernel > > contexts I think unlikely() annotation should come with numbers, 90% > > not taken is not sufficient in and of itself. > > I'm not sure I fully understand your point. I *think* you're > editorializing on unlikely() in general and not this specific code, > but it's a little hard to follow. Yes, editorializing on unlikely(). Specifically I would normally ask for perf numbers to show that the hint is worth it, but I talked myself out of asking for that in this case. > Your mention of "order-10 page free events" is confusing. Do you mean > "(order-10 page) free events", i.e. freeing of 1024 consecutive pages? > Or are you using "order" as a synonym for "approximately" and you mean > "approximately 10 (page free event)s"? I'm referring to this: if (is_shuffle_order(order)) add_to_free_area_random(page, &zone->free_area[order], Where shuffle order is MAX_ORDER-1. I.e. this code is only triggered when we might be releasing a 4MB buddy-page. > We both agree (I hope) that the number here is obvious on brief > inspection: 1/BITS_PER_LONG. > > GCC's heuristics are tuned to value cycles on the fast path 9x as much as > cycles on the slow path, so it will spend up to 9 cycles on the slow path > to save a cycle on the fast path. > > I've found one comment (https://pastebin.com/S8Y8tqZy) saying that > GCC < 9.x was a lot sloppier on the cost ratio and could pessimize > the code if the branch was more than ~ 1% taken. Perhaps that's what > you're remembering? Yes, thanks for that digging! > Fortunately, 1/64 = 1.56% is fairly close to 1%. so I'm not too > worried. Makes sense.
On Wed, Mar 18, 2020 at 12:17:14PM -0700, Alexander Duyck wrote: > I was just putting it out there as a possibility. What I have seen in > the past is that under some circumstances gcc can be smart enough to > interpret that as a "branch on carry". My thought was you are likely > having to test the value against itself and then you might be able to > make use of shift and carry flag to avoid that. In addition you could > get away from having to recast a unsigned value as a signed one in > order to perform the bit test. Ah, yes, it would be nice if gcc could use the carry bit for r rather than having to devote a whole register to it. But it has to do two unrelated flag tests (zero and carry), and it's generally pretty bad at taking advantage of preserved flag bits like that. My ideal x86-64 object code would be: shlq rand(%rip) jz fixup fixed: jnc tail jmp add_to_free_area tail: jmp add_to_free_area_tail fixup: pushq %rdx pushq %rsi pushq %rdi call get_random_u64 popq %rdi popq %rsi popq %rdx stc adcq %rax,%rax movq %rax, rand(%rip) jmp fixed ... but I don't know how to induce GCC to generate that, and the function doesn't seem worthy of platform-specific asm. (Note that I have to use add add on the slow path because lea doesn't set the carry bit.)
On Wed, Mar 18, 2020 at 12:40:33PM -0700, Dan Williams wrote: > Yes, editorializing on unlikely(). Specifically I would normally ask > for perf numbers to show that the hint is worth it, but I talked > myself out of asking for that in this case. Ah, now I understand! Yes, it's just gilding the lily, but as long as I was messing about in the area it seemed worth adding. You're saying that if the code were a hotter code path, you'd want benchmarks, but given its actual infrequent usage, you're satisfied with the argument that it's probably right; it's not like we've crippled the kernel even if wrong. > I'm referring to this: > > if (is_shuffle_order(order)) > add_to_free_area_random(page, &zone->free_area[order], > > Where shuffle order is MAX_ORDER-1. I.e. this code is only triggered > when we might be releasing a 4MB buddy-page. Ding! Okay, now it makes sense. I didn't look that far up the call stack. I was auditing each and every call to a get_random function in the kernel, when I came across this call site, struggled to understand what it was doing, and rewrote it to be less wrong. I only vaguely understand where it fits into the larger mm/ system so I never noticed that condition on its use. Yes, given that infrequent use, I'm even happier that I focused on code size rather than performance. Thank you for taking the time to explain.
diff --git a/mm/shuffle.c b/mm/shuffle.c index e0ed247f8d90..4ba3ba84764d 100644 --- a/mm/shuffle.c +++ b/mm/shuffle.c @@ -186,22 +186,28 @@ void __meminit __shuffle_free_memory(pg_data_t *pgdat) void add_to_free_area_random(struct page *page, struct free_area *area, int migratetype) { - static u64 rand; - static u8 rand_bits; + static long rand; /* 0..BITS_PER_LONG-1 buffered random bits */ + unsigned long r = READ_ONCE(rand), rshift = r << 1;; /* - * The lack of locking is deliberate. If 2 threads race to - * update the rand state it just adds to the entropy. + * rand holds some random msbits, with a 1 bit appended, followed + * by zero-padding in the lsbits. This allows us to maintain + * the pre-generated bits and the count of bits in a single, + * atomically updatable, variable. + * + * The lack of locking is deliberate. If two threads race to + * update the rand state it just adds to the entropy. The + * worst that can happen is a random bit is used twice, or + * get_random_long is called redundantly. */ - if (rand_bits == 0) { - rand_bits = 64; - rand = get_random_u64(); + if (unlikely(rshift == 0)) { + r = get_random_long(); + rshift = r << 1 | 1; } + WRITE_ONCE(rand, rshift); - if (rand & 1) + if ((long)r < 0) add_to_free_area(page, area, migratetype); else add_to_free_area_tail(page, area, migratetype); - rand_bits--; - rand >>= 1; }
The old code had separate "rand" and "rand_count" variables, which could get out of sync with bad results. In the worst case, two threads would see rand_count == 1 and both decrement it, resultint in rand_count = 255 and rand being filled with zeros for the next 255 calls. Instead, pack them both into a single, atomically updatable, variable. This makes it a lot easier to reason about race conditions. They are still there - the code deliberately eschews locking - but basically harmless on the rare occasions that they happen. Second, use READ_ONCE and WRITE_ONCE. Without them, we are deep in the land of nasal demons. The compiler would be free to spill temporaries to the static variables in arbitrary perverse ways and create hard-to-find bugs. (Alternatively, we could declare the static variable "volatile", one of the few places in the Linux kernel that would be correct, but it would probably annoy Linus.) Third, use long rather than u64. This not only keeps the state atomically updatable, it also speeds up the fast path on 32-bit machines. Saving at least three instructions on the fast path (one load, one add-with-carry, and one store) is worth exchanging one call to get_random_u64 for two calls to get_random_u32. The fast path of get_random_* is less than the 3*64 = 192 instructions saved, and the slow path happens every 64 bytes so isn't affectrd by the change. I've tried a few variants. Keeping random lsbits with a most-significant end marker, and using an explicit bool flag rather than testing r both increase code size slightly. x86_64 i386 This code 94 95 Explicit bool 103 99 Lsbits 99 101 Both 96 100 Signed-off-by: George Spelvin <lkml@sdf.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Kees Cook <keescook@chromium.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: linux-mm@kvack.org --- mm/shuffle.c | 26 ++++++++++++++++---------- 1 file changed, 16 insertions(+), 10 deletions(-)