mm/shuffle.c: optimize add_to_free_area_random()

Message ID	20200317135035.GA19442@SDF.ORG (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=UE5v=5C=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 1F2762076A Date: Tue, 17 Mar 2020 13:50:35 +0000 From: George Spelvin <lkml@SDF.ORG> To: Dan Williams <dan.j.williams@intel.com>, linux-mm@kvack.org Cc: Kees Cook <keescook@chromium.org>, Andrew Morton <akpm@linux-foundation.org>, lkml@sdf.org Subject: [PATCH] mm/shuffle.c: optimize add_to_free_area_random() Message-ID: <20200317135035.GA19442@SDF.ORG> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm/shuffle.c: optimize add_to_free_area_random() \| expand mm/shuffle.c: optimize add_to_free_area_random()

Message ID

20200317135035.GA19442@SDF.ORG (mailing list archive)

State

New, archived

Headers

DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 1F2762076A
Date: Tue, 17 Mar 2020 13:50:35 +0000
From: George Spelvin <lkml@SDF.ORG>
To: Dan Williams <dan.j.williams@intel.com>, linux-mm@kvack.org
Cc: Kees Cook <keescook@chromium.org>,
        Andrew Morton <akpm@linux-foundation.org>, lkml@sdf.org
Subject: [PATCH] mm/shuffle.c: optimize add_to_free_area_random()
Message-ID: <20200317135035.GA19442@SDF.ORG>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

mm/shuffle.c: optimize add_to_free_area_random() | expand

Commit Message

George Spelvin March 17, 2020, 1:50 p.m. UTC

First, use long rather than u64 for the bit buffer type, which
is significantly more efficient on 32-bit processors.

Second, avoid the need for a separate rand_bits counter.
rand_bits is never more than 63, so there's always room in rand
for a bit to mark the end of the available bits.  This makes the
shared state atomically updatable, which makes it a lot easier
to reason about race conditions.

Third, use READ_ONCE and WRITE_ONCE.  Without them, the compiler
may spill to the shared static in arbitrarily perverse ways,
and combined with the fact that the code eschews locking, that
is a recipe for hard-to-find bugs.  Now, a race might cause a bit
to be used twice, or get_random_long() to be called redundantly,
but it can't summon nasal daemons.

I've tried a few variants.  Keeping random lsbits with a most-
significant end marker, or using an explicit bool variable rather
than testing r both increase code size slightly.

		x86_64	i386
This code		 94	 95
Explicit bool		103	 99
Lsbits		 99	101
Both		 96	100

Signed-off-by: George Spelvin <lkml@sdf.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
---
 mm/shuffle.c | 21 ++++++++++++---------
 1 file changed, 12 insertions(+), 9 deletions(-)

Comments

Kees Cook March 17, 2020, 9:44 p.m. UTC | #1

On Tue, Mar 17, 2020 at 01:50:35PM +0000, George Spelvin wrote:
> First, use long rather than u64 for the bit buffer type, which
> is significantly more efficient on 32-bit processors.

Why? This is twice as many get_random_*() calls for 32-bit: it'll save
whatever bits it doesn't use. (Speaking to word-sized stores, yes, that
makes sense, but see below...)

> Second, avoid the need for a separate rand_bits counter.
> rand_bits is never more than 63, so there's always room in rand
> for a bit to mark the end of the available bits.  This makes the
> shared state atomically updatable, which makes it a lot easier
> to reason about race conditions.

What are the race conditions? I had to go look at I see that
add_to_free_area_random() is called under __free_one_page() which is
under &zone->lock. Would it make more sense to store the randomness
per-zone and entirely side-step the issues?

> Third, use READ_ONCE and WRITE_ONCE.  Without them, the compiler
> may spill to the shared static in arbitrarily perverse ways,
> and combined with the fact that the code eschews locking, that
> is a recipe for hard-to-find bugs.  Now, a race might cause a bit
> to be used twice, or get_random_long() to be called redundantly,
> but it can't summon nasal daemons.

All these things might make the results less predictable, too. :) But,
yes, if there was deterministic sharing of random values, that'd be
bad. But this patch doesn't really solve the "use the same random bits"
problem, which is the very thing that might get us into a predictable
state. If two zones both call READ_ONCE(), use the same value (eek),
and then both call WRITE_ONCE(), one of them wins the write (no big deal).

George Spelvin March 17, 2020, 11:06 p.m. UTC | #2

On Tue, Mar 17, 2020 at 02:44:33PM -0700, Kees Cook wrote:
> On Tue, Mar 17, 2020 at 01:50:35PM +0000, George Spelvin wrote:
> > First, use long rather than u64 for the bit buffer type, which
> > is significantly more efficient on 32-bit processors.
> 
> Why? This is twice as many get_random_*() calls for 32-bit: it'll save
> whatever bits it doesn't use. (Speaking to word-sized stores, yes, that
> makes sense, but see below...)

Yes, but the per-call overhead is fairly low until you run out of the
64-byte buffer; then there's an expensive crypto operation to
refill the buffer.

So two get_random_32 calls is not a lot more than one get_random_64. (The 
buffers are per-CPU, so they are pretty cheap to access.  (I have other 
patches I'm working on to speed up the locking they have, but this one is 
independent so I'm sending it separately.)

And if you want faster, prandom_u32() is probably good enough for this
application.

I did it to speed up and shrink the fast path.  It saves at least
3 instructions on 32-bit machines (one load, one rotate through carry, 
and one store), at the cost of one extra get_random_*() call per 64
bits.  If get_random_u32()'s fast path is less than 192 instructions,
it's a win.

It's also relevant to the race condition reasoning mentioned below;
a one-word value can be atomically updated.  On a 32-bit machine,
64-bit operations get tearing and it's a PITA to reason about.

>> Second, avoid the need for a separate rand_bits counter.
>> rand_bits is never more than 63, so there's always room in rand
>> for a bit to mark the end of the available bits.  This makes the
>> shared state atomically updatable, which makes it a lot easier
>> to reason about race conditions.
> 
> What are the race conditions? I had to go look at I see that
> add_to_free_area_random() is called under __free_one_page() which is
> under &zone->lock. Would it make more sense to store the randomness
> per-zone and entirely side-step the issues?

The most serious is that if two threads simultaneously observe
rand_bits == 1, but do their decrements separately, you end up
with rand_bits = 255 and you generate 255 consecutive 0 bits
before refilling the buffer.

Since we're only generating random bits, a screwed-up answer occasionally 
doesn't really matter (like the comment says "lack of locking is 
deliberate"), but 255 screwed up bits is a bit much.

I avoided changing the underlying locking model because I didn't
feel up to such an invasive change; I managed to fix the problems
I saw without going there.  And shrink the code; tht seemed like
enough of a win to justify it to me.

If you want to change the locking, the other alternative to per-zone is 
per-CPU, which also avoids locking.  We could even add a generic 
get_random_bit() function.

>> Third, use READ_ONCE and WRITE_ONCE.  Without them, the compiler
>> may spill to the shared static in arbitrarily perverse ways,
>> and combined with the fact that the code eschews locking, that
>> is a recipe for hard-to-find bugs.  Now, a race might cause a bit
>> to be used twice, or get_random_long() to be called redundantly,
>> but it can't summon nasal daemons.
> 
> All these things might make the results less predictable, too. :) But,
> yes, if there was deterministic sharing of random values, that'd be
> bad. But this patch doesn't really solve the "use the same random bits"
> problem, which is the very thing that might get us into a predictable
> state. If two zones both call READ_ONCE(), use the same value (eek),
> and then both call WRITE_ONCE(), one of them wins the write (no big deal).

Absolutely, but in case of a tight race, only *one* bit is used twice,
and the bit used is uniformly distributed, so it's a really small
correlation.  Unlike the 255x zero race the existing code can have.

The main reason I did this is to, as I already wrote, keep the nasal 
demons at bay.  The existing code does things (having another thread 
change a non-volatile variable while this code is executing) that are 
explicitly undefined by the ANSI C standard.

The compiler is allowed to (in John Woods' memorable explanation)
produce code that makes demons fly out of your nose.  (More plausibly,
it may simply crash.)

It probably won't, but I prefer to stay on the safe side of the
language specification contact.  Linus has ranted on more than one 
occasion about the perverse ways that GCC has used its permission
to assume that people won't violate the contract.

Kees Cook March 17, 2020, 11:38 p.m. UTC | #3

On Tue, Mar 17, 2020 at 11:06:12PM +0000, George Spelvin wrote:
> The most serious is that if two threads simultaneously observe
> rand_bits == 1, but do their decrements separately, you end up
> with rand_bits = 255 and you generate 255 consecutive 0 bits
> before refilling the buffer.
> 
> Since we're only generating random bits, a screwed-up answer occasionally 
> doesn't really matter (like the comment says "lack of locking is 
> deliberate"), but 255 screwed up bits is a bit much.

Okay, I'm on board! :) Thanks for spelling this race out; I hadn't seen
quite how nasty it could get. (Perhaps mention in the commit log for v2?)

> I avoided changing the underlying locking model because I didn't
> feel up to such an invasive change; I managed to fix the problems
> I saw without going there.  And shrink the code; tht seemed like
> enough of a win to justify it to me.

Fair enough.

> The compiler is allowed to (in John Woods' memorable explanation)
> produce code that makes demons fly out of your nose.  (More plausibly,
> it may simply crash.)

So one thing that I see here that is still in the nasal demon realm is
that the left-shift of a signed value, which is technically undefined
behavior in C. (See the comment on check_shl_overflow().)

Doing a signedness check is very cheap in the resulting machine code;
but I suspect sticking to unsigned and reversing direction for a
bottom-bit test too bad?

i.e.:

	static unsigned long rand_bits;
	unsigned long r = READ_ONCE(rand_bits), rshift = r >> 1;

	if (unlikely(rshift == 0)) {
		r = get_random_long();
		rshift = (r >> 1) | (0x1UL << (BITS_PER_LONG - 1));
	}
	WRITE_ONCE(rand_bits, rshift);

	if (r & 1)
		add_to...
	else
		add_to...tail

diff --git a/mm/shuffle.c b/mm/shuffle.c
index b3fe97fd6654..0e4bf6a8da52 100644
--- a/mm/shuffle.c
+++ b/mm/shuffle.c
@@ -186,22 +186,25 @@  void __meminit __shuffle_free_memory(pg_data_t *pgdat)
 void add_to_free_area_random(struct page *page, struct free_area *area,
 		int migratetype)
 {
-	static u64 rand;
-	static u8 rand_bits;
+	static long rand;	/* 0..BITS_PER_LONG-1 buffered random bits */
+	long r = READ_ONCE(rand), rshift = r << 1;;
 
 	/*
-	 * The lack of locking is deliberate. If 2 threads race to
+	 * rand holds some random msbits, with the end marked by a 1 bit.
+	 * This allows us to maintain the pre-generated bits and the
+	 * count of bits in a single, atomically updatable, variable.
+	 *
+	 * The lack of locking is deliberate. If two threads race to
 	 * update the rand state it just adds to the entropy.
 	 */
-	if (rand_bits == 0) {
-		rand_bits = 64;
-		rand = get_random_u64();
+	if (unlikely(rshift == 0)) {
+		r = get_random_long();
+		rshift = r << 1 | 1;
 	}
+	WRITE_ONCE(rand, rshift);
 
-	if (rand & 1)
+	if (r < 0)
 		add_to_free_area(page, area, migratetype);
 	else
 		add_to_free_area_tail(page, area, migratetype);
-	rand_bits--;
-	rand >>= 1;
 }

mm/shuffle.c: optimize add_to_free_area_random()

Commit Message

Comments

Patch