[v2] make buffer_locked provide an acquire semantics

Message ID	alpine.LRH.2.02.2207311104020.16444@file01.intranet.prod.int.rdu2.redhat.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> Date: Sun, 31 Jul 2022 11:08:04 -0400 (EDT) From: Mikulas Patocka <mpatocka@redhat.com> To: Ard Biesheuvel <ardb@kernel.org> cc: Linus Torvalds <torvalds@linux-foundation.org>, Alexander Viro <viro@zeniv.linux.org.uk>, Alan Stern <stern@rowland.harvard.edu>, Andrea Parri <parri.andrea@gmail.com>, Will Deacon <will@kernel.org>, Peter Zijlstra <peterz@infradead.org>, Boqun Feng <boqun.feng@gmail.com>, Nicholas Piggin <npiggin@gmail.com>, David Howells <dhowells@redhat.com>, Jade Alglave <j.alglave@ucl.ac.uk>, Luc Maranget <luc.maranget@inria.fr>, "Paul E. McKenney" <paulmck@kernel.org>, Akira Yokosawa <akiyks@gmail.com>, Daniel Lustig <dlustig@nvidia.com>, Joel Fernandes <joel@joelfernandes.org>, Linux Kernel Mailing List <linux-kernel@vger.kernel.org>, linux-arch <linux-arch@vger.kernel.org>, linux-fsdevel@vger.kernel.org Subject: [PATCH v2] make buffer_locked provide an acquire semantics In-Reply-To: <alpine.LRH.2.02.2207310920390.6506@file01.intranet.prod.int.rdu2.redhat.com> Message-ID: <alpine.LRH.2.02.2207311104020.16444@file01.intranet.prod.int.rdu2.redhat.com> References: <alpine.LRH.2.02.2207310703170.14394@file01.intranet.prod.int.rdu2.redhat.com> <CAMj1kXFYRNrP2k8yppgfdKg+CxWeYfHTbzLBuyBqJ9UVAR_vaQ@mail.gmail.com> <alpine.LRH.2.02.2207310920390.6506@file01.intranet.prod.int.rdu2.redhat.com> User-Agent: Alpine 2.02 (LRH 1266 2009-07-14) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Precedence: bulk
Series	[v2] make buffer_locked provide an acquire semantics \| expand [v2] make buffer_locked provide an acquire semantics

Mikulas Patocka July 31, 2022, 3:08 p.m. UTC

> The only problem is that test_bit doesn't provide any memory barriers. 
> Should we add the barrier to buffer_locked() instead of wait_on_buffer()? 
> Perhaps it would fix more bugs - in reiserfs, there's this piece of code:

Her I'm sending the second version of the patch that changes buffer_locked 
to provide an acquire semantics.

Mikulas




From: Mikulas Patocka <mpatocka@redhat.com>

Let's have a look at this piece of code in __bread_slow:
	get_bh(bh);
	bh->b_end_io = end_buffer_read_sync;
	submit_bh(REQ_OP_READ, 0, bh);
	wait_on_buffer(bh);
	if (buffer_uptodate(bh))
		return bh;
Neither wait_on_buffer nor buffer_uptodate contain any memory barrier.
Consequently, if someone calls sb_bread and then reads the buffer data,
the read of buffer data may be executed before wait_on_buffer(bh) on
architectures with weak memory ordering and it may return invalid data.

Also, there is this pattern present several times:
	wait_on_buffer(bh);
	if (!buffer_uptodate(bh))
		err = -EIO;
It may be possible that buffer_uptodate is executed before wait_on_buffer
and it may return spurious error.

Fix these bugs by chaning the function buffer_locked to have the acquire
semantics - so that code that follows buffer_locked cannot be moved before
it. We must also add a read barrier after wait_on_bit_io because
wait_on_bit_io doesn't provide any barrier. (perhaps, should this
smp_rmb() be moved into wait_on_bit_io?)

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org

Linus Torvalds July 31, 2022, 4:51 p.m. UTC | #1

[ Will and Paul, question for you below ]

On Sun, Jul 31, 2022 at 8:08 AM Mikulas Patocka <mpatocka@redhat.com> wrote:
>
> Also, there is this pattern present several times:
>         wait_on_buffer(bh);
>         if (!buffer_uptodate(bh))
>                 err = -EIO;
> It may be possible that buffer_uptodate is executed before wait_on_buffer
> and it may return spurious error.

I'm not convinced that's actually valid.

They are testing the same memory location, and I don't think our
memory ordering model allows for _that_ to be out-of-order. Memory
barriers are for accesses to different memory locations.

Even alpha is specified to be locally ordered wrt *one* memory
location, including for reads (See table 5-1: "Processor issue order",
and also 5.6.2.2: "Litmus test 2"). So if a previous read has seen a
new value, a subsequent read is not allowed to see an older one - even
without a memory barrier.

Will, Paul? Maybe that's only for overlapping loads/stores, not for
loads/loads. Because maybe alpha for once isn't the weakest possible
ordering.

I didn't find this actually in our documentation, so maybe other
architectures allow it? Our docs talk about "overlapping loads and
stores", and maybe that was meant to imply that "load overlaps with
load" case, but it really reads like load-vs-store, not load-vs-load.

But the patch looks fine, though I agree that the ordering in
__wait_on_buffer should probably be moved into
wait_on_bit/wait_on_bit_io.

And would we perhaps want the bitops to have the different ordering
versions? Like "set_bit_release()" and "test_bit_acquire()"? That
would seem to be (a) cleaner and (b) possibly generate better code for
architectures where that makes a difference?

               Linus

Paul E. McKenney July 31, 2022, 5:30 p.m. UTC | #2

On Sun, Jul 31, 2022 at 09:51:47AM -0700, Linus Torvalds wrote:
> [ Will and Paul, question for you below ]
> 
> On Sun, Jul 31, 2022 at 8:08 AM Mikulas Patocka <mpatocka@redhat.com> wrote:
> >
> > Also, there is this pattern present several times:
> >         wait_on_buffer(bh);
> >         if (!buffer_uptodate(bh))
> >                 err = -EIO;
> > It may be possible that buffer_uptodate is executed before wait_on_buffer
> > and it may return spurious error.
> 
> I'm not convinced that's actually valid.
> 
> They are testing the same memory location, and I don't think our
> memory ordering model allows for _that_ to be out-of-order. Memory
> barriers are for accesses to different memory locations.

Yes, aligned same-sized marked accesses to a given location are always
executed so as to be consistent with some global order.  Please note the
"marked accesses".  The compiler can rearrange unmarked reads and in
some cases can discard unmarked writes.  For more information, please
see tools/memory-model/Documentation/recipes.txt's "Single CPU or single
memory location" section.

> Even alpha is specified to be locally ordered wrt *one* memory
> location, including for reads (See table 5-1: "Processor issue order",
> and also 5.6.2.2: "Litmus test 2"). So if a previous read has seen a
> new value, a subsequent read is not allowed to see an older one - even
> without a memory barrier.
> 
> Will, Paul? Maybe that's only for overlapping loads/stores, not for
> loads/loads. Because maybe alpha for once isn't the weakest possible
> ordering.

The "bad boy" in this case is Itanium, which can do some VLIW reordering
of accesses.  Or could, I am not sure that newer Itanium hardware
does this.  But this is why Itanium compilers made volatile loads use
the ld,acq instruction.

Which means that aligned same-sized marked accesses to a single location
really do execute consistently with some global ordering, even on Itanium.

That said, I confess that I am having a hard time finding the
buffer_locked() definition.  So if buffer_locked() uses normal C-language
accesses to sample the BH_Lock bit of the ->b_state field, then yes,
there could be a problem.  The compiler would then be free to reorder
calls to buffer_locked() because it could then assume that no other
thread was touching that ->b_state field.

> I didn't find this actually in our documentation, so maybe other
> architectures allow it? Our docs talk about "overlapping loads and
> stores", and maybe that was meant to imply that "load overlaps with
> load" case, but it really reads like load-vs-store, not load-vs-load.

I placed the relevant text from recipes.txt at the end of this email.

> But the patch looks fine, though I agree that the ordering in
> __wait_on_buffer should probably be moved into
> wait_on_bit/wait_on_bit_io.
> 
> And would we perhaps want the bitops to have the different ordering
> versions? Like "set_bit_release()" and "test_bit_acquire()"? That
> would seem to be (a) cleaner and (b) possibly generate better code for
> architectures where that makes a difference?

As always, I defer to the architecture maintainers on this one.

							Thanx, Paul

------------------------------------------------------------------------

Single CPU or single memory location
------------------------------------

If there is only one CPU on the one hand or only one variable
on the other, the code will execute in order.  There are (as
usual) some things to be careful of:

1.	Some aspects of the C language are unordered.  For example,
	in the expression "f(x) + g(y)", the order in which f and g are
	called is not defined; the object code is allowed to use either
	order or even to interleave the computations.

2.	Compilers are permitted to use the "as-if" rule.  That is, a
	compiler can emit whatever code it likes for normal accesses,
	as long as the results of a single-threaded execution appear
	just as if the compiler had followed all the relevant rules.
	To see this, compile with a high level of optimization and run
	the debugger on the resulting binary.

3.	If there is only one variable but multiple CPUs, that variable
	must be properly aligned and all accesses to that variable must
	be full sized.	Variables that straddle cachelines or pages void
	your full-ordering warranty, as do undersized accesses that load
	from or store to only part of the variable.

4.	If there are multiple CPUs, accesses to shared variables should
	use READ_ONCE() and WRITE_ONCE() or stronger to prevent load/store
	tearing, load/store fusing, and invented loads and stores.
	There are exceptions to this rule, including:

	i.	When there is no possibility of a given shared variable
		being updated by some other CPU, for example, while
		holding the update-side lock, reads from that variable
		need not use READ_ONCE().

	ii.	When there is no possibility of a given shared variable
		being either read or updated by other CPUs, for example,
		when running during early boot, reads from that variable
		need not use READ_ONCE() and writes to that variable
		need not use WRITE_ONCE().

Mikulas Patocka July 31, 2022, 8:39 p.m. UTC | #3

On Sun, 31 Jul 2022, Linus Torvalds wrote:

> [ Will and Paul, question for you below ]
> 
> On Sun, Jul 31, 2022 at 8:08 AM Mikulas Patocka <mpatocka@redhat.com> wrote:
> >
> > Also, there is this pattern present several times:
> >         wait_on_buffer(bh);
> >         if (!buffer_uptodate(bh))
> >                 err = -EIO;
> > It may be possible that buffer_uptodate is executed before wait_on_buffer
> > and it may return spurious error.
> 
> I'm not convinced that's actually valid.
> 
> They are testing the same memory location, and I don't think our
> memory ordering model allows for _that_ to be out-of-order. Memory
> barriers are for accesses to different memory locations.

You are right. And the bit tests are volatile, so the compiler can't 
reorder them.

But the compiler can reorder non-volatile accesses around volatile 
accesses (gcc does this, clang afaik doesn't), so the bit tests need at 
least a compiler barrier after them.

> But the patch looks fine, though I agree that the ordering in
> __wait_on_buffer should probably be moved into
> wait_on_bit/wait_on_bit_io.

Yes, there are more bugs where the code does wait_on_bit and then reads 
some data without any barrier. Adding the barrier to wait_on_bit fixes 
that.

I'll send two patches, one for wait_on_bit and the other for 
buffer_locked.

Do you think that wait_event also needs a read memory barrier? It is 
defined as:
#define wait_event(wq_head, condition)                                          \
do {                                                                            \
        might_sleep();                                                          \
        if (condition)                                                          \
                break;                                                          \
        __wait_event(wq_head, condition);                                       \
} while (0)

Mikulas

> And would we perhaps want the bitops to have the different ordering
> versions? Like "set_bit_release()" and "test_bit_acquire()"? That
> would seem to be (a) cleaner and (b) possibly generate better code for
> architectures where that makes a difference?
> 
>                Linus
>

Linus Torvalds July 31, 2022, 8:46 p.m. UTC | #4

On Sun, Jul 31, 2022 at 1:39 PM Mikulas Patocka <mpatocka@redhat.com> wrote:
>
> Do you think that wait_event also needs a read memory barrier?

Not really, no.

__wait_event() uses prepare_to_wait(), and it uses set_current_state()
very much so that the process state setting is serialized with the
test afterwards.

And the only race wait_event should worry about is exactly the "go to
sleep" vs "make the event true and then wake up" race, so that it
doesn't wait forever.

There is no guarantee of memory ordering wrt anything else, and I
don't think there is any need for such a guarantee.

If somebody wants that guarantee, they should probably make the
condition for wait_event() to be a "load_acquire()" or similar. Those
cases do exist.

                       Linus

Matthew Wilcox July 31, 2022, 10:48 p.m. UTC | #5

On Sun, Jul 31, 2022 at 10:30:11AM -0700, Paul E. McKenney wrote:
> That said, I confess that I am having a hard time finding the
> buffer_locked() definition.  So if buffer_locked() uses normal C-language
> accesses to sample the BH_Lock bit of the ->b_state field, then yes,
> there could be a problem.  The compiler would then be free to reorder
> calls to buffer_locked() because it could then assume that no other
> thread was touching that ->b_state field.

You're having a hard time finding it because it's constructed with the C
preprocessor.  I really wish we generated header files using CPP once
and then included the generated/ file.  That would make them greppable.

You're looking for include/linux/buffer_head.h and it's done like this:

enum bh_state_bits {
...
        BH_Lock,        /* Is locked */
...

#define BUFFER_FNS(bit, name)                                           \
...
static __always_inline int buffer_##name(const struct buffer_head *bh)  \
{                                                                       \
        return test_bit(BH_##bit, &(bh)->b_state);                      \
}

BUFFER_FNS(Lock, locked)

(fwiw, the page/folio versions of these functions don't autogenerate
the lock or uptodate ones because they need extra functions called)

Paul E. McKenney Aug. 1, 2022, 3:20 a.m. UTC | #6

On Sun, Jul 31, 2022 at 11:48:32PM +0100, Matthew Wilcox wrote:
> On Sun, Jul 31, 2022 at 10:30:11AM -0700, Paul E. McKenney wrote:
> > That said, I confess that I am having a hard time finding the
> > buffer_locked() definition.  So if buffer_locked() uses normal C-language
> > accesses to sample the BH_Lock bit of the ->b_state field, then yes,
> > there could be a problem.  The compiler would then be free to reorder
> > calls to buffer_locked() because it could then assume that no other
> > thread was touching that ->b_state field.
> 
> You're having a hard time finding it because it's constructed with the C
> preprocessor.  I really wish we generated header files using CPP once
> and then included the generated/ file.  That would make them greppable.
> 
> You're looking for include/linux/buffer_head.h and it's done like this:
> 
> enum bh_state_bits {
> ...
>         BH_Lock,        /* Is locked */
> ...
> 
> #define BUFFER_FNS(bit, name)                                           \
> ...
> static __always_inline int buffer_##name(const struct buffer_head *bh)  \
> {                                                                       \
>         return test_bit(BH_##bit, &(bh)->b_state);                      \
> }
> 
> BUFFER_FNS(Lock, locked)
> 
> (fwiw, the page/folio versions of these functions don't autogenerate
> the lock or uptodate ones because they need extra functions called)

Thank you!

Another thing that would have helped me find it would have been to leave
the "BH_" prefix on the bit name in the BUFFER_FNS() invocation, as in
ditch the "BH_##" in the definition and then just say "BUFFER_FNS(BH_Lock,
locked)".

But what is life without a challenge?  ;-/

Anyway, on x86 this will use the constant_test_bit() function, which
uses a volatile declaration for its parameter.  So it should avoid
vulnerability to the vicissitudes of the compiler.

So I feel much better now.

							Thanx, Paul

Will Deacon Aug. 1, 2022, 3:41 p.m. UTC | #7

Hi Linus, Paul,

Apologies for the slow response here; believe it or not, I was attending
a workshop about memory ordering.

On Sun, Jul 31, 2022 at 10:30:11AM -0700, Paul E. McKenney wrote:
> On Sun, Jul 31, 2022 at 09:51:47AM -0700, Linus Torvalds wrote:
> > Even alpha is specified to be locally ordered wrt *one* memory
> > location, including for reads (See table 5-1: "Processor issue order",
> > and also 5.6.2.2: "Litmus test 2"). So if a previous read has seen a
> > new value, a subsequent read is not allowed to see an older one - even
> > without a memory barrier.
> > 
> > Will, Paul? Maybe that's only for overlapping loads/stores, not for
> > loads/loads. Because maybe alpha for once isn't the weakest possible
> > ordering.
> 
> The "bad boy" in this case is Itanium, which can do some VLIW reordering
> of accesses.  Or could, I am not sure that newer Itanium hardware
> does this.  But this is why Itanium compilers made volatile loads use
> the ld,acq instruction.
> 
> Which means that aligned same-sized marked accesses to a single location
> really do execute consistently with some global ordering, even on Itanium.

Although this is true, there's a really subtle issue which crops up if you
try to compose this read-after-read ordering with dependencies in the case
where the two reads read the same value (which is encapsulated by the
unusual RSW litmus test that I've tried to convert to C below):

/* Global definitions; assume everything zero-initialised */
struct foo {
	int *x;
};

int x;
struct foo foo;
struct foo *ptr;

/* CPU 0 */
WRITE_ONCE(x, 1);
WRITE_ONCE(foo.x, &x);

/*
 * Release ordering to ensure that somebody following a non-NULL ptr will
 * see a fully-initialised 'foo'. smp_[w]mb() would work as well.
 */
smp_store_release(&ptr, &foo);

/* CPU 1 */
int *xp1, *xp2, val;
struct foo *foop;

/* Load the global pointer and check that it's not NULL. */
foop = READ_ONCE(ptr);
if (!foop)
	return;

/*
 * Load 'foo.x' via the pointer we just loaded. This is ordered after the
 * previous READ_ONCE() because of the address dependency.
 */
xp1 = READ_ONCE(foop->x);

/*
 * Load 'foo.x' directly via the global 'foo'.
 * _This is loading the same address as the previous READ_ONCE() and
 *  therefore cannot return a stale (NULL) value!_
 */
xp2 = READ_ONCE(foo.x);

/*
 * Load 'x' via the pointer we just loaded.
 * _We may see zero here!_
 */
val = READ_ONCE(*xp2);

So in this case, the two same-location reads on CPU1 are actually executed
out of order, but you can't tell just by looking at them in isolation
because they returned the same value (i.e. xp1 == xp2 == &x). However, you
*can* tell when you compose them in funny situations such as above (and I
believe that this is demonstrably true on PPC and Arm; effectively the
read-after-read ordering machinery only triggers in response to incoming
snoops).

There's probably a more-compelling variant using an (RCpc) load acquire
instead of the last address dependency on CPU 1 as well.

Anyway, I think all I'm trying to say is that I'd tend to shy away from
relying on read-after-read ordering if it forms part of a more involved
ordering relationship.

> > But the patch looks fine, though I agree that the ordering in
> > __wait_on_buffer should probably be moved into
> > wait_on_bit/wait_on_bit_io.
> > 
> > And would we perhaps want the bitops to have the different ordering
> > versions? Like "set_bit_release()" and "test_bit_acquire()"? That
> > would seem to be (a) cleaner and (b) possibly generate better code for
> > architectures where that makes a difference?
> 
> As always, I defer to the architecture maintainers on this one.

FWIW, that makes sense to me. We already have release/acquire/releaxed
variants of the bitops in the atomic_t API so it seems natural to have
a counterpart in the actual bitops API as well.

Will

Paul E. McKenney Aug. 1, 2022, 7:20 p.m. UTC | #8

On Mon, Aug 01, 2022 at 04:41:09PM +0100, Will Deacon wrote:
> Hi Linus, Paul,
> 
> Apologies for the slow response here; believe it or not, I was attending
> a workshop about memory ordering.

Nice!!!  Anything that I can/should know from that gathering?  ;-)

> On Sun, Jul 31, 2022 at 10:30:11AM -0700, Paul E. McKenney wrote:
> > On Sun, Jul 31, 2022 at 09:51:47AM -0700, Linus Torvalds wrote:
> > > Even alpha is specified to be locally ordered wrt *one* memory
> > > location, including for reads (See table 5-1: "Processor issue order",
> > > and also 5.6.2.2: "Litmus test 2"). So if a previous read has seen a
> > > new value, a subsequent read is not allowed to see an older one - even
> > > without a memory barrier.
> > > 
> > > Will, Paul? Maybe that's only for overlapping loads/stores, not for
> > > loads/loads. Because maybe alpha for once isn't the weakest possible
> > > ordering.
> > 
> > The "bad boy" in this case is Itanium, which can do some VLIW reordering
> > of accesses.  Or could, I am not sure that newer Itanium hardware
> > does this.  But this is why Itanium compilers made volatile loads use
> > the ld,acq instruction.
> > 
> > Which means that aligned same-sized marked accesses to a single location
> > really do execute consistently with some global ordering, even on Itanium.
> 
> Although this is true, there's a really subtle issue which crops up if you
> try to compose this read-after-read ordering with dependencies in the case
> where the two reads read the same value (which is encapsulated by the
> unusual RSW litmus test that I've tried to convert to C below):

RSW from the infamous test6.pdf, correct?

> /* Global definitions; assume everything zero-initialised */
> struct foo {
> 	int *x;
> };
> 
> int x;
> struct foo foo;
> struct foo *ptr;
> 
> 
> /* CPU 0 */
> WRITE_ONCE(x, 1);

Your x is RSW's z?

> WRITE_ONCE(foo.x, &x);

And your foo.x is RSW's x?  If so, the above WRITE_ONCE() could happen at
compile time, correct?  Or in the initialization clause of a litmus test?

> /*
>  * Release ordering to ensure that somebody following a non-NULL ptr will
>  * see a fully-initialised 'foo'. smp_[w]mb() would work as well.
>  */
> smp_store_release(&ptr, &foo);

Your ptr is RSW's y, correct?

> /* CPU 1 */
> int *xp1, *xp2, val;
> struct foo *foop;
> 
> /* Load the global pointer and check that it's not NULL. */
> foop = READ_ONCE(ptr);
> if (!foop)
> 	return;

A litmus tests can do this via the filter clause.

> /*
>  * Load 'foo.x' via the pointer we just loaded. This is ordered after the
>  * previous READ_ONCE() because of the address dependency.
>  */
> xp1 = READ_ONCE(foop->x);
> 
> /*
>  * Load 'foo.x' directly via the global 'foo'.
>  * _This is loading the same address as the previous READ_ONCE() and
>  *  therefore cannot return a stale (NULL) value!_
>  */
> xp2 = READ_ONCE(foo.x);

OK, same location, but RSW calls only for po, not addr from the initial
read to this read, got it.  (My first attempt left out this nuance,
in case you were wondering.)

> /*
>  * Load 'x' via the pointer we just loaded.
>  * _We may see zero here!_
>  */
> val = READ_ONCE(*xp2);

And herd7/LKMM agree with this, at least assuming I got the litmus
test right.  (I used RSW's variables as a cross-check.)

> So in this case, the two same-location reads on CPU1 are actually executed
> out of order, but you can't tell just by looking at them in isolation
> because they returned the same value (i.e. xp1 == xp2 == &x). However, you
> *can* tell when you compose them in funny situations such as above (and I
> believe that this is demonstrably true on PPC and Arm; effectively the
> read-after-read ordering machinery only triggers in response to incoming
> snoops).

Again, I leave the hardware specifics to the respective maintainers.

> There's probably a more-compelling variant using an (RCpc) load acquire
> instead of the last address dependency on CPU 1 as well.
> 
> Anyway, I think all I'm trying to say is that I'd tend to shy away from
> relying on read-after-read ordering if it forms part of a more involved
> ordering relationship.

Agreed.  As soon as you add that second variable, you need more ordering.
The read-after-read ordering applies only within the confines of a single
variable, and you need more ordering when you add that second variable.
And as Will's RSW example shows, more ordering than you might think.

> > > But the patch looks fine, though I agree that the ordering in
> > > __wait_on_buffer should probably be moved into
> > > wait_on_bit/wait_on_bit_io.
> > > 
> > > And would we perhaps want the bitops to have the different ordering
> > > versions? Like "set_bit_release()" and "test_bit_acquire()"? That
> > > would seem to be (a) cleaner and (b) possibly generate better code for
> > > architectures where that makes a difference?
> > 
> > As always, I defer to the architecture maintainers on this one.
> 
> FWIW, that makes sense to me. We already have release/acquire/releaxed
> variants of the bitops in the atomic_t API so it seems natural to have
> a counterpart in the actual bitops API as well.

Just to join you guys in liking acquire and release better than smp_wmb()
and smp_rmb().  My RCU life got much better after update-side uses of
smp_wmb() were replaced by rcu_assign_pointer() and read-side uses of
smp_read_barrier_depends() were replaced by rcu_dereference().

							Thanx, Paul

------------------------------------------------------------------------

C rsw

{
	a=0;
	x=z;
	y=a;
	z=0;
}

P0(int *x, int **y, int *z)
{
	WRITE_ONCE(*z, 1);
	WRITE_ONCE(*y, x);
}

P1(int *x, int **y, int *z)
{
	r1 = READ_ONCE(*y);
	r2 = READ_ONCE(*r1);
	r3 = READ_ONCE(*x);
	r4 = READ_ONCE(*r3);
}

filter(1:r1=x)
exists(1:r2=z /\ 1:r3=z /\ 1:r4=0)

------------------------------------------------------------------------

$ herd7 -conf linux-kernel.cfg /tmp/rsw.litmus
Test rsw Allowed
States 2
1:r2=z; 1:r3=z; 1:r4=0;
1:r2=z; 1:r3=z; 1:r4=1;
Ok
Witnesses
Positive: 1 Negative: 1
Condition exists (1:r2=z /\ 1:r3=z /\ 1:r4=0)
Observation rsw Sometimes 1 1
Time rsw 0.00
Hash=03618e3079eb0d371ca0f1ab679f3eae

Will Deacon Aug. 2, 2022, 8:54 a.m. UTC | #9

On Mon, Aug 01, 2022 at 12:20:35PM -0700, Paul E. McKenney wrote:
> On Mon, Aug 01, 2022 at 04:41:09PM +0100, Will Deacon wrote:
> > Apologies for the slow response here; believe it or not, I was attending
> > a workshop about memory ordering.
> 
> Nice!!!  Anything that I can/should know from that gathering?  ;-)

Oh come off it, you know this stuff already ;)

> > On Sun, Jul 31, 2022 at 10:30:11AM -0700, Paul E. McKenney wrote:
> > > On Sun, Jul 31, 2022 at 09:51:47AM -0700, Linus Torvalds wrote:
> > > > Even alpha is specified to be locally ordered wrt *one* memory
> > > > location, including for reads (See table 5-1: "Processor issue order",
> > > > and also 5.6.2.2: "Litmus test 2"). So if a previous read has seen a
> > > > new value, a subsequent read is not allowed to see an older one - even
> > > > without a memory barrier.
> > > > 
> > > > Will, Paul? Maybe that's only for overlapping loads/stores, not for
> > > > loads/loads. Because maybe alpha for once isn't the weakest possible
> > > > ordering.
> > > 
> > > The "bad boy" in this case is Itanium, which can do some VLIW reordering
> > > of accesses.  Or could, I am not sure that newer Itanium hardware
> > > does this.  But this is why Itanium compilers made volatile loads use
> > > the ld,acq instruction.
> > > 
> > > Which means that aligned same-sized marked accesses to a single location
> > > really do execute consistently with some global ordering, even on Itanium.
> > 
> > Although this is true, there's a really subtle issue which crops up if you
> > try to compose this read-after-read ordering with dependencies in the case
> > where the two reads read the same value (which is encapsulated by the
> > unusual RSW litmus test that I've tried to convert to C below):
> 
> RSW from the infamous test6.pdf, correct?

That's the badger. I've no doubt that you're aware of it already, but I
thought it was a useful exercise to transcribe it to C and have it on the
mailing list for folks to look at.

> > /* Global definitions; assume everything zero-initialised */
> > struct foo {
> > 	int *x;
> > };
> > 
> > int x;
> > struct foo foo;
> > struct foo *ptr;
> > 
> > 
> > /* CPU 0 */
> > WRITE_ONCE(x, 1);
> 
> Your x is RSW's z?

Yes.

> > WRITE_ONCE(foo.x, &x);
> 
> And your foo.x is RSW's x?  If so, the above WRITE_ONCE() could happen at
> compile time, correct?  Or in the initialization clause of a litmus test?

Yes, although I think it's a tiny bit more like real code to have it done
here, although it means that the "surprising" outcome relies on this being
reordered before the store to x.

> > /*
> >  * Release ordering to ensure that somebody following a non-NULL ptr will
> >  * see a fully-initialised 'foo'. smp_[w]mb() would work as well.
> >  */
> > smp_store_release(&ptr, &foo);
> 
> Your ptr is RSW's y, correct?

Yes.

> > /* CPU 1 */
> > int *xp1, *xp2, val;
> > struct foo *foop;
> > 
> > /* Load the global pointer and check that it's not NULL. */
> > foop = READ_ONCE(ptr);
> > if (!foop)
> > 	return;
> 
> A litmus tests can do this via the filter clause.

Indeed, but I was trying to make this look like C code for non-litmus
speakers!

> > /*
> >  * Load 'foo.x' via the pointer we just loaded. This is ordered after the
> >  * previous READ_ONCE() because of the address dependency.
> >  */
> > xp1 = READ_ONCE(foop->x);
> > 
> > /*
> >  * Load 'foo.x' directly via the global 'foo'.
> >  * _This is loading the same address as the previous READ_ONCE() and
> >  *  therefore cannot return a stale (NULL) value!_
> >  */
> > xp2 = READ_ONCE(foo.x);
> 
> OK, same location, but RSW calls only for po, not addr from the initial
> read to this read, got it.  (My first attempt left out this nuance,
> in case you were wondering.)

Right, there is only po from the initial read to this read. If there was an
address dependency, then we'd have a chain of address dependencies from the
first read to the last read on this CPU and the result (of x == 0) would be
forbidden.

> > /*
> >  * Load 'x' via the pointer we just loaded.
> >  * _We may see zero here!_
> >  */
> > val = READ_ONCE(*xp2);
> 
> And herd7/LKMM agree with this, at least assuming I got the litmus
> test right.  (I used RSW's variables as a cross-check.)

That's promising, but see below...

> C rsw
> 
> {
> 	a=0;
> 	x=z;
> 	y=a;
> 	z=0;
> }
> 
> P0(int *x, int **y, int *z)
> {
> 	WRITE_ONCE(*z, 1);
> 	WRITE_ONCE(*y, x);
> }

Ah wait, you need a barrier between these two writes, don't you? I used
an smp_store_release() but smp[w]_mb() should do too.

Will

Paul E. McKenney Aug. 2, 2022, 1:49 p.m. UTC | #10

On Tue, Aug 02, 2022 at 09:54:55AM +0100, Will Deacon wrote:
> On Mon, Aug 01, 2022 at 12:20:35PM -0700, Paul E. McKenney wrote:
> > On Mon, Aug 01, 2022 at 04:41:09PM +0100, Will Deacon wrote:
> > > Apologies for the slow response here; believe it or not, I was attending
> > > a workshop about memory ordering.
> > 
> > Nice!!!  Anything that I can/should know from that gathering?  ;-)
> 
> Oh come off it, you know this stuff already ;)

Thank you for the kind words, but the most devastating learning disability
of all is thinking that you already know everything about the topic
in question.  ;-)

> > > On Sun, Jul 31, 2022 at 10:30:11AM -0700, Paul E. McKenney wrote:
> > > > On Sun, Jul 31, 2022 at 09:51:47AM -0700, Linus Torvalds wrote:
> > > > > Even alpha is specified to be locally ordered wrt *one* memory
> > > > > location, including for reads (See table 5-1: "Processor issue order",
> > > > > and also 5.6.2.2: "Litmus test 2"). So if a previous read has seen a
> > > > > new value, a subsequent read is not allowed to see an older one - even
> > > > > without a memory barrier.
> > > > > 
> > > > > Will, Paul? Maybe that's only for overlapping loads/stores, not for
> > > > > loads/loads. Because maybe alpha for once isn't the weakest possible
> > > > > ordering.
> > > > 
> > > > The "bad boy" in this case is Itanium, which can do some VLIW reordering
> > > > of accesses.  Or could, I am not sure that newer Itanium hardware
> > > > does this.  But this is why Itanium compilers made volatile loads use
> > > > the ld,acq instruction.
> > > > 
> > > > Which means that aligned same-sized marked accesses to a single location
> > > > really do execute consistently with some global ordering, even on Itanium.
> > > 
> > > Although this is true, there's a really subtle issue which crops up if you
> > > try to compose this read-after-read ordering with dependencies in the case
> > > where the two reads read the same value (which is encapsulated by the
> > > unusual RSW litmus test that I've tried to convert to C below):
> > 
> > RSW from the infamous test6.pdf, correct?
> 
> That's the badger. I've no doubt that you're aware of it already, but I
> thought it was a useful exercise to transcribe it to C and have it on the
> mailing list for folks to look at.

I have seen it, but this was nevertheless a useful reminder.

> > > /* Global definitions; assume everything zero-initialised */
> > > struct foo {
> > > 	int *x;
> > > };
> > > 
> > > int x;
> > > struct foo foo;
> > > struct foo *ptr;
> > > 
> > > 
> > > /* CPU 0 */
> > > WRITE_ONCE(x, 1);
> > 
> > Your x is RSW's z?
> 
> Yes.
> 
> > > WRITE_ONCE(foo.x, &x);
> > 
> > And your foo.x is RSW's x?  If so, the above WRITE_ONCE() could happen at
> > compile time, correct?  Or in the initialization clause of a litmus test?
> 
> Yes, although I think it's a tiny bit more like real code to have it done
> here, although it means that the "surprising" outcome relies on this being
> reordered before the store to x.
> 
> > > /*
> > >  * Release ordering to ensure that somebody following a non-NULL ptr will
> > >  * see a fully-initialised 'foo'. smp_[w]mb() would work as well.
> > >  */
> > > smp_store_release(&ptr, &foo);
> > 
> > Your ptr is RSW's y, correct?
> 
> Yes.
> 
> > > /* CPU 1 */
> > > int *xp1, *xp2, val;
> > > struct foo *foop;
> > > 
> > > /* Load the global pointer and check that it's not NULL. */
> > > foop = READ_ONCE(ptr);
> > > if (!foop)
> > > 	return;
> > 
> > A litmus tests can do this via the filter clause.
> 
> Indeed, but I was trying to make this look like C code for non-litmus
> speakers!
> 
> > > /*
> > >  * Load 'foo.x' via the pointer we just loaded. This is ordered after the
> > >  * previous READ_ONCE() because of the address dependency.
> > >  */
> > > xp1 = READ_ONCE(foop->x);
> > > 
> > > /*
> > >  * Load 'foo.x' directly via the global 'foo'.
> > >  * _This is loading the same address as the previous READ_ONCE() and
> > >  *  therefore cannot return a stale (NULL) value!_
> > >  */
> > > xp2 = READ_ONCE(foo.x);
> > 
> > OK, same location, but RSW calls only for po, not addr from the initial
> > read to this read, got it.  (My first attempt left out this nuance,
> > in case you were wondering.)
> 
> Right, there is only po from the initial read to this read. If there was an
> address dependency, then we'd have a chain of address dependencies from the
> first read to the last read on this CPU and the result (of x == 0) would be
> forbidden.
> 
> > > /*
> > >  * Load 'x' via the pointer we just loaded.
> > >  * _We may see zero here!_
> > >  */
> > > val = READ_ONCE(*xp2);
> > 
> > And herd7/LKMM agree with this, at least assuming I got the litmus
> > test right.  (I used RSW's variables as a cross-check.)
> 
> That's promising, but see below...
> 
> > C rsw
> > 
> > {
> > 	a=0;
> > 	x=z;
> > 	y=a;
> > 	z=0;
> > }
> > 
> > P0(int *x, int **y, int *z)
> > {
> > 	WRITE_ONCE(*z, 1);
> > 	WRITE_ONCE(*y, x);
> > }
> 
> Ah wait, you need a barrier between these two writes, don't you? I used
> an smp_store_release() but smp[w]_mb() should do too.

You are quite right, thank you!  Here is the fixed version and output,
which LKMM still says is allowed.

							Thanx, Paul

------------------------------------------------------------------------

C rsw

{
	a=0;
	x=z;
	y=a;
	z=0;
}

P0(int *x, int **y, int *z)
{
	WRITE_ONCE(*z, 1);
	smp_store_release(y, x);
}

P1(int *x, int **y, int *z)
{
	r1 = READ_ONCE(*y);
	r2 = READ_ONCE(*r1);
	r3 = READ_ONCE(*x);
	r4 = READ_ONCE(*r3);
}

filter(1:r1=x)
exists(1:r2=z /\ 1:r3=z /\ 1:r4=0)

------------------------------------------------------------------------

$ herd7 -conf linux-kernel.cfg /tmp/rsw.litmus
Test rsw Allowed
States 2
1:r2=z; 1:r3=z; 1:r4=0;
1:r2=z; 1:r3=z; 1:r4=1;
Ok
Witnesses
Positive: 1 Negative: 1
Condition exists (1:r2=z /\ 1:r3=z /\ 1:r4=0)
Observation rsw Sometimes 1 1
Time rsw 0.01
Hash=588486c0f4d521fa3ce559a19ed118d5

Paul E. McKenney Aug. 2, 2022, 3:29 p.m. UTC | #11

On Tue, Aug 02, 2022 at 06:49:21AM -0700, Paul E. McKenney wrote:
> On Tue, Aug 02, 2022 at 09:54:55AM +0100, Will Deacon wrote:
> > On Mon, Aug 01, 2022 at 12:20:35PM -0700, Paul E. McKenney wrote:
> > > On Mon, Aug 01, 2022 at 04:41:09PM +0100, Will Deacon wrote:
> > > > Apologies for the slow response here; believe it or not, I was attending
> > > > a workshop about memory ordering.
> > > 
> > > Nice!!!  Anything that I can/should know from that gathering?  ;-)
> > 
> > Oh come off it, you know this stuff already ;)
> 
> Thank you for the kind words, but the most devastating learning disability
> of all is thinking that you already know everything about the topic
> in question.  ;-)
> 
> > > > On Sun, Jul 31, 2022 at 10:30:11AM -0700, Paul E. McKenney wrote:
> > > > > On Sun, Jul 31, 2022 at 09:51:47AM -0700, Linus Torvalds wrote:
> > > > > > Even alpha is specified to be locally ordered wrt *one* memory
> > > > > > location, including for reads (See table 5-1: "Processor issue order",
> > > > > > and also 5.6.2.2: "Litmus test 2"). So if a previous read has seen a
> > > > > > new value, a subsequent read is not allowed to see an older one - even
> > > > > > without a memory barrier.
> > > > > > 
> > > > > > Will, Paul? Maybe that's only for overlapping loads/stores, not for
> > > > > > loads/loads. Because maybe alpha for once isn't the weakest possible
> > > > > > ordering.
> > > > > 
> > > > > The "bad boy" in this case is Itanium, which can do some VLIW reordering
> > > > > of accesses.  Or could, I am not sure that newer Itanium hardware
> > > > > does this.  But this is why Itanium compilers made volatile loads use
> > > > > the ld,acq instruction.
> > > > > 
> > > > > Which means that aligned same-sized marked accesses to a single location
> > > > > really do execute consistently with some global ordering, even on Itanium.
> > > > 
> > > > Although this is true, there's a really subtle issue which crops up if you
> > > > try to compose this read-after-read ordering with dependencies in the case
> > > > where the two reads read the same value (which is encapsulated by the
> > > > unusual RSW litmus test that I've tried to convert to C below):
> > > 
> > > RSW from the infamous test6.pdf, correct?
> > 
> > That's the badger. I've no doubt that you're aware of it already, but I
> > thought it was a useful exercise to transcribe it to C and have it on the
> > mailing list for folks to look at.
> 
> I have seen it, but this was nevertheless a useful reminder.
> 
> > > > /* Global definitions; assume everything zero-initialised */
> > > > struct foo {
> > > > 	int *x;
> > > > };
> > > > 
> > > > int x;
> > > > struct foo foo;
> > > > struct foo *ptr;
> > > > 
> > > > 
> > > > /* CPU 0 */
> > > > WRITE_ONCE(x, 1);
> > > 
> > > Your x is RSW's z?
> > 
> > Yes.
> > 
> > > > WRITE_ONCE(foo.x, &x);
> > > 
> > > And your foo.x is RSW's x?  If so, the above WRITE_ONCE() could happen at
> > > compile time, correct?  Or in the initialization clause of a litmus test?
> > 
> > Yes, although I think it's a tiny bit more like real code to have it done
> > here, although it means that the "surprising" outcome relies on this being
> > reordered before the store to x.
> > 
> > > > /*
> > > >  * Release ordering to ensure that somebody following a non-NULL ptr will
> > > >  * see a fully-initialised 'foo'. smp_[w]mb() would work as well.
> > > >  */
> > > > smp_store_release(&ptr, &foo);
> > > 
> > > Your ptr is RSW's y, correct?
> > 
> > Yes.
> > 
> > > > /* CPU 1 */
> > > > int *xp1, *xp2, val;
> > > > struct foo *foop;
> > > > 
> > > > /* Load the global pointer and check that it's not NULL. */
> > > > foop = READ_ONCE(ptr);
> > > > if (!foop)
> > > > 	return;
> > > 
> > > A litmus tests can do this via the filter clause.
> > 
> > Indeed, but I was trying to make this look like C code for non-litmus
> > speakers!
> > 
> > > > /*
> > > >  * Load 'foo.x' via the pointer we just loaded. This is ordered after the
> > > >  * previous READ_ONCE() because of the address dependency.
> > > >  */
> > > > xp1 = READ_ONCE(foop->x);
> > > > 
> > > > /*
> > > >  * Load 'foo.x' directly via the global 'foo'.
> > > >  * _This is loading the same address as the previous READ_ONCE() and
> > > >  *  therefore cannot return a stale (NULL) value!_
> > > >  */
> > > > xp2 = READ_ONCE(foo.x);
> > > 
> > > OK, same location, but RSW calls only for po, not addr from the initial
> > > read to this read, got it.  (My first attempt left out this nuance,
> > > in case you were wondering.)
> > 
> > Right, there is only po from the initial read to this read. If there was an
> > address dependency, then we'd have a chain of address dependencies from the
> > first read to the last read on this CPU and the result (of x == 0) would be
> > forbidden.
> > 
> > > > /*
> > > >  * Load 'x' via the pointer we just loaded.
> > > >  * _We may see zero here!_
> > > >  */
> > > > val = READ_ONCE(*xp2);
> > > 
> > > And herd7/LKMM agree with this, at least assuming I got the litmus
> > > test right.  (I used RSW's variables as a cross-check.)
> > 
> > That's promising, but see below...
> > 
> > > C rsw
> > > 
> > > {
> > > 	a=0;
> > > 	x=z;
> > > 	y=a;
> > > 	z=0;
> > > }
> > > 
> > > P0(int *x, int **y, int *z)
> > > {
> > > 	WRITE_ONCE(*z, 1);
> > > 	WRITE_ONCE(*y, x);
> > > }
> > 
> > Ah wait, you need a barrier between these two writes, don't you? I used
> > an smp_store_release() but smp[w]_mb() should do too.
> 
> You are quite right, thank you!  Here is the fixed version and output,
> which LKMM still says is allowed.
> 
> 							Thanx, Paul
> 
> ------------------------------------------------------------------------
> 
> C rsw
> 
> {
> 	a=0;
> 	x=z;
> 	y=a;
> 	z=0;
> }
> 
> P0(int *x, int **y, int *z)
> {
> 	WRITE_ONCE(*z, 1);
> 	smp_store_release(y, x);

And like test6.pdf says, this is still allowed even if these are a pair
of WRITE_ONCE() invocations separated by smp_mb().  So I took that form.

							Thanx, Paul

> }
> 
> P1(int *x, int **y, int *z)
> {
> 	r1 = READ_ONCE(*y);
> 	r2 = READ_ONCE(*r1);
> 	r3 = READ_ONCE(*x);
> 	r4 = READ_ONCE(*r3);
> }
> 
> filter(1:r1=x)
> exists(1:r2=z /\ 1:r3=z /\ 1:r4=0)
> 
> ------------------------------------------------------------------------
> 
> $ herd7 -conf linux-kernel.cfg /tmp/rsw.litmus
> Test rsw Allowed
> States 2
> 1:r2=z; 1:r3=z; 1:r4=0;
> 1:r2=z; 1:r3=z; 1:r4=1;
> Ok
> Witnesses
> Positive: 1 Negative: 1
> Condition exists (1:r2=z /\ 1:r3=z /\ 1:r4=0)
> Observation rsw Sometimes 1 1
> Time rsw 0.01
> Hash=588486c0f4d521fa3ce559a19ed118d5

[v2] make buffer_locked provide an acquire semantics

Commit Message

Comments

Patch