Message ID | alpine.LRH.2.02.2207311104020.16444@file01.intranet.prod.int.rdu2.redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] make buffer_locked provide an acquire semantics | expand |
[ Will and Paul, question for you below ] On Sun, Jul 31, 2022 at 8:08 AM Mikulas Patocka <mpatocka@redhat.com> wrote: > > Also, there is this pattern present several times: > wait_on_buffer(bh); > if (!buffer_uptodate(bh)) > err = -EIO; > It may be possible that buffer_uptodate is executed before wait_on_buffer > and it may return spurious error. I'm not convinced that's actually valid. They are testing the same memory location, and I don't think our memory ordering model allows for _that_ to be out-of-order. Memory barriers are for accesses to different memory locations. Even alpha is specified to be locally ordered wrt *one* memory location, including for reads (See table 5-1: "Processor issue order", and also 5.6.2.2: "Litmus test 2"). So if a previous read has seen a new value, a subsequent read is not allowed to see an older one - even without a memory barrier. Will, Paul? Maybe that's only for overlapping loads/stores, not for loads/loads. Because maybe alpha for once isn't the weakest possible ordering. I didn't find this actually in our documentation, so maybe other architectures allow it? Our docs talk about "overlapping loads and stores", and maybe that was meant to imply that "load overlaps with load" case, but it really reads like load-vs-store, not load-vs-load. But the patch looks fine, though I agree that the ordering in __wait_on_buffer should probably be moved into wait_on_bit/wait_on_bit_io. And would we perhaps want the bitops to have the different ordering versions? Like "set_bit_release()" and "test_bit_acquire()"? That would seem to be (a) cleaner and (b) possibly generate better code for architectures where that makes a difference? Linus
On Sun, Jul 31, 2022 at 09:51:47AM -0700, Linus Torvalds wrote: > [ Will and Paul, question for you below ] > > On Sun, Jul 31, 2022 at 8:08 AM Mikulas Patocka <mpatocka@redhat.com> wrote: > > > > Also, there is this pattern present several times: > > wait_on_buffer(bh); > > if (!buffer_uptodate(bh)) > > err = -EIO; > > It may be possible that buffer_uptodate is executed before wait_on_buffer > > and it may return spurious error. > > I'm not convinced that's actually valid. > > They are testing the same memory location, and I don't think our > memory ordering model allows for _that_ to be out-of-order. Memory > barriers are for accesses to different memory locations. Yes, aligned same-sized marked accesses to a given location are always executed so as to be consistent with some global order. Please note the "marked accesses". The compiler can rearrange unmarked reads and in some cases can discard unmarked writes. For more information, please see tools/memory-model/Documentation/recipes.txt's "Single CPU or single memory location" section. > Even alpha is specified to be locally ordered wrt *one* memory > location, including for reads (See table 5-1: "Processor issue order", > and also 5.6.2.2: "Litmus test 2"). So if a previous read has seen a > new value, a subsequent read is not allowed to see an older one - even > without a memory barrier. > > Will, Paul? Maybe that's only for overlapping loads/stores, not for > loads/loads. Because maybe alpha for once isn't the weakest possible > ordering. The "bad boy" in this case is Itanium, which can do some VLIW reordering of accesses. Or could, I am not sure that newer Itanium hardware does this. But this is why Itanium compilers made volatile loads use the ld,acq instruction. Which means that aligned same-sized marked accesses to a single location really do execute consistently with some global ordering, even on Itanium. That said, I confess that I am having a hard time finding the buffer_locked() definition. So if buffer_locked() uses normal C-language accesses to sample the BH_Lock bit of the ->b_state field, then yes, there could be a problem. The compiler would then be free to reorder calls to buffer_locked() because it could then assume that no other thread was touching that ->b_state field. > I didn't find this actually in our documentation, so maybe other > architectures allow it? Our docs talk about "overlapping loads and > stores", and maybe that was meant to imply that "load overlaps with > load" case, but it really reads like load-vs-store, not load-vs-load. I placed the relevant text from recipes.txt at the end of this email. > But the patch looks fine, though I agree that the ordering in > __wait_on_buffer should probably be moved into > wait_on_bit/wait_on_bit_io. > > And would we perhaps want the bitops to have the different ordering > versions? Like "set_bit_release()" and "test_bit_acquire()"? That > would seem to be (a) cleaner and (b) possibly generate better code for > architectures where that makes a difference? As always, I defer to the architecture maintainers on this one. Thanx, Paul ------------------------------------------------------------------------ Single CPU or single memory location ------------------------------------ If there is only one CPU on the one hand or only one variable on the other, the code will execute in order. There are (as usual) some things to be careful of: 1. Some aspects of the C language are unordered. For example, in the expression "f(x) + g(y)", the order in which f and g are called is not defined; the object code is allowed to use either order or even to interleave the computations. 2. Compilers are permitted to use the "as-if" rule. That is, a compiler can emit whatever code it likes for normal accesses, as long as the results of a single-threaded execution appear just as if the compiler had followed all the relevant rules. To see this, compile with a high level of optimization and run the debugger on the resulting binary. 3. If there is only one variable but multiple CPUs, that variable must be properly aligned and all accesses to that variable must be full sized. Variables that straddle cachelines or pages void your full-ordering warranty, as do undersized accesses that load from or store to only part of the variable. 4. If there are multiple CPUs, accesses to shared variables should use READ_ONCE() and WRITE_ONCE() or stronger to prevent load/store tearing, load/store fusing, and invented loads and stores. There are exceptions to this rule, including: i. When there is no possibility of a given shared variable being updated by some other CPU, for example, while holding the update-side lock, reads from that variable need not use READ_ONCE(). ii. When there is no possibility of a given shared variable being either read or updated by other CPUs, for example, when running during early boot, reads from that variable need not use READ_ONCE() and writes to that variable need not use WRITE_ONCE().
On Sun, 31 Jul 2022, Linus Torvalds wrote: > [ Will and Paul, question for you below ] > > On Sun, Jul 31, 2022 at 8:08 AM Mikulas Patocka <mpatocka@redhat.com> wrote: > > > > Also, there is this pattern present several times: > > wait_on_buffer(bh); > > if (!buffer_uptodate(bh)) > > err = -EIO; > > It may be possible that buffer_uptodate is executed before wait_on_buffer > > and it may return spurious error. > > I'm not convinced that's actually valid. > > They are testing the same memory location, and I don't think our > memory ordering model allows for _that_ to be out-of-order. Memory > barriers are for accesses to different memory locations. You are right. And the bit tests are volatile, so the compiler can't reorder them. But the compiler can reorder non-volatile accesses around volatile accesses (gcc does this, clang afaik doesn't), so the bit tests need at least a compiler barrier after them. > But the patch looks fine, though I agree that the ordering in > __wait_on_buffer should probably be moved into > wait_on_bit/wait_on_bit_io. Yes, there are more bugs where the code does wait_on_bit and then reads some data without any barrier. Adding the barrier to wait_on_bit fixes that. I'll send two patches, one for wait_on_bit and the other for buffer_locked. Do you think that wait_event also needs a read memory barrier? It is defined as: #define wait_event(wq_head, condition) \ do { \ might_sleep(); \ if (condition) \ break; \ __wait_event(wq_head, condition); \ } while (0) Mikulas > And would we perhaps want the bitops to have the different ordering > versions? Like "set_bit_release()" and "test_bit_acquire()"? That > would seem to be (a) cleaner and (b) possibly generate better code for > architectures where that makes a difference? > > Linus >
On Sun, Jul 31, 2022 at 1:39 PM Mikulas Patocka <mpatocka@redhat.com> wrote: > > Do you think that wait_event also needs a read memory barrier? Not really, no. __wait_event() uses prepare_to_wait(), and it uses set_current_state() very much so that the process state setting is serialized with the test afterwards. And the only race wait_event should worry about is exactly the "go to sleep" vs "make the event true and then wake up" race, so that it doesn't wait forever. There is no guarantee of memory ordering wrt anything else, and I don't think there is any need for such a guarantee. If somebody wants that guarantee, they should probably make the condition for wait_event() to be a "load_acquire()" or similar. Those cases do exist. Linus
On Sun, Jul 31, 2022 at 10:30:11AM -0700, Paul E. McKenney wrote: > That said, I confess that I am having a hard time finding the > buffer_locked() definition. So if buffer_locked() uses normal C-language > accesses to sample the BH_Lock bit of the ->b_state field, then yes, > there could be a problem. The compiler would then be free to reorder > calls to buffer_locked() because it could then assume that no other > thread was touching that ->b_state field. You're having a hard time finding it because it's constructed with the C preprocessor. I really wish we generated header files using CPP once and then included the generated/ file. That would make them greppable. You're looking for include/linux/buffer_head.h and it's done like this: enum bh_state_bits { ... BH_Lock, /* Is locked */ ... #define BUFFER_FNS(bit, name) \ ... static __always_inline int buffer_##name(const struct buffer_head *bh) \ { \ return test_bit(BH_##bit, &(bh)->b_state); \ } BUFFER_FNS(Lock, locked) (fwiw, the page/folio versions of these functions don't autogenerate the lock or uptodate ones because they need extra functions called)
On Sun, Jul 31, 2022 at 11:48:32PM +0100, Matthew Wilcox wrote: > On Sun, Jul 31, 2022 at 10:30:11AM -0700, Paul E. McKenney wrote: > > That said, I confess that I am having a hard time finding the > > buffer_locked() definition. So if buffer_locked() uses normal C-language > > accesses to sample the BH_Lock bit of the ->b_state field, then yes, > > there could be a problem. The compiler would then be free to reorder > > calls to buffer_locked() because it could then assume that no other > > thread was touching that ->b_state field. > > You're having a hard time finding it because it's constructed with the C > preprocessor. I really wish we generated header files using CPP once > and then included the generated/ file. That would make them greppable. > > You're looking for include/linux/buffer_head.h and it's done like this: > > enum bh_state_bits { > ... > BH_Lock, /* Is locked */ > ... > > #define BUFFER_FNS(bit, name) \ > ... > static __always_inline int buffer_##name(const struct buffer_head *bh) \ > { \ > return test_bit(BH_##bit, &(bh)->b_state); \ > } > > BUFFER_FNS(Lock, locked) > > (fwiw, the page/folio versions of these functions don't autogenerate > the lock or uptodate ones because they need extra functions called) Thank you! Another thing that would have helped me find it would have been to leave the "BH_" prefix on the bit name in the BUFFER_FNS() invocation, as in ditch the "BH_##" in the definition and then just say "BUFFER_FNS(BH_Lock, locked)". But what is life without a challenge? ;-/ Anyway, on x86 this will use the constant_test_bit() function, which uses a volatile declaration for its parameter. So it should avoid vulnerability to the vicissitudes of the compiler. So I feel much better now. Thanx, Paul
Hi Linus, Paul, Apologies for the slow response here; believe it or not, I was attending a workshop about memory ordering. On Sun, Jul 31, 2022 at 10:30:11AM -0700, Paul E. McKenney wrote: > On Sun, Jul 31, 2022 at 09:51:47AM -0700, Linus Torvalds wrote: > > Even alpha is specified to be locally ordered wrt *one* memory > > location, including for reads (See table 5-1: "Processor issue order", > > and also 5.6.2.2: "Litmus test 2"). So if a previous read has seen a > > new value, a subsequent read is not allowed to see an older one - even > > without a memory barrier. > > > > Will, Paul? Maybe that's only for overlapping loads/stores, not for > > loads/loads. Because maybe alpha for once isn't the weakest possible > > ordering. > > The "bad boy" in this case is Itanium, which can do some VLIW reordering > of accesses. Or could, I am not sure that newer Itanium hardware > does this. But this is why Itanium compilers made volatile loads use > the ld,acq instruction. > > Which means that aligned same-sized marked accesses to a single location > really do execute consistently with some global ordering, even on Itanium. Although this is true, there's a really subtle issue which crops up if you try to compose this read-after-read ordering with dependencies in the case where the two reads read the same value (which is encapsulated by the unusual RSW litmus test that I've tried to convert to C below): /* Global definitions; assume everything zero-initialised */ struct foo { int *x; }; int x; struct foo foo; struct foo *ptr; /* CPU 0 */ WRITE_ONCE(x, 1); WRITE_ONCE(foo.x, &x); /* * Release ordering to ensure that somebody following a non-NULL ptr will * see a fully-initialised 'foo'. smp_[w]mb() would work as well. */ smp_store_release(&ptr, &foo); /* CPU 1 */ int *xp1, *xp2, val; struct foo *foop; /* Load the global pointer and check that it's not NULL. */ foop = READ_ONCE(ptr); if (!foop) return; /* * Load 'foo.x' via the pointer we just loaded. This is ordered after the * previous READ_ONCE() because of the address dependency. */ xp1 = READ_ONCE(foop->x); /* * Load 'foo.x' directly via the global 'foo'. * _This is loading the same address as the previous READ_ONCE() and * therefore cannot return a stale (NULL) value!_ */ xp2 = READ_ONCE(foo.x); /* * Load 'x' via the pointer we just loaded. * _We may see zero here!_ */ val = READ_ONCE(*xp2); So in this case, the two same-location reads on CPU1 are actually executed out of order, but you can't tell just by looking at them in isolation because they returned the same value (i.e. xp1 == xp2 == &x). However, you *can* tell when you compose them in funny situations such as above (and I believe that this is demonstrably true on PPC and Arm; effectively the read-after-read ordering machinery only triggers in response to incoming snoops). There's probably a more-compelling variant using an (RCpc) load acquire instead of the last address dependency on CPU 1 as well. Anyway, I think all I'm trying to say is that I'd tend to shy away from relying on read-after-read ordering if it forms part of a more involved ordering relationship. > > But the patch looks fine, though I agree that the ordering in > > __wait_on_buffer should probably be moved into > > wait_on_bit/wait_on_bit_io. > > > > And would we perhaps want the bitops to have the different ordering > > versions? Like "set_bit_release()" and "test_bit_acquire()"? That > > would seem to be (a) cleaner and (b) possibly generate better code for > > architectures where that makes a difference? > > As always, I defer to the architecture maintainers on this one. FWIW, that makes sense to me. We already have release/acquire/releaxed variants of the bitops in the atomic_t API so it seems natural to have a counterpart in the actual bitops API as well. Will
On Mon, Aug 01, 2022 at 04:41:09PM +0100, Will Deacon wrote: > Hi Linus, Paul, > > Apologies for the slow response here; believe it or not, I was attending > a workshop about memory ordering. Nice!!! Anything that I can/should know from that gathering? ;-) > On Sun, Jul 31, 2022 at 10:30:11AM -0700, Paul E. McKenney wrote: > > On Sun, Jul 31, 2022 at 09:51:47AM -0700, Linus Torvalds wrote: > > > Even alpha is specified to be locally ordered wrt *one* memory > > > location, including for reads (See table 5-1: "Processor issue order", > > > and also 5.6.2.2: "Litmus test 2"). So if a previous read has seen a > > > new value, a subsequent read is not allowed to see an older one - even > > > without a memory barrier. > > > > > > Will, Paul? Maybe that's only for overlapping loads/stores, not for > > > loads/loads. Because maybe alpha for once isn't the weakest possible > > > ordering. > > > > The "bad boy" in this case is Itanium, which can do some VLIW reordering > > of accesses. Or could, I am not sure that newer Itanium hardware > > does this. But this is why Itanium compilers made volatile loads use > > the ld,acq instruction. > > > > Which means that aligned same-sized marked accesses to a single location > > really do execute consistently with some global ordering, even on Itanium. > > Although this is true, there's a really subtle issue which crops up if you > try to compose this read-after-read ordering with dependencies in the case > where the two reads read the same value (which is encapsulated by the > unusual RSW litmus test that I've tried to convert to C below): RSW from the infamous test6.pdf, correct? > /* Global definitions; assume everything zero-initialised */ > struct foo { > int *x; > }; > > int x; > struct foo foo; > struct foo *ptr; > > > /* CPU 0 */ > WRITE_ONCE(x, 1); Your x is RSW's z? > WRITE_ONCE(foo.x, &x); And your foo.x is RSW's x? If so, the above WRITE_ONCE() could happen at compile time, correct? Or in the initialization clause of a litmus test? > /* > * Release ordering to ensure that somebody following a non-NULL ptr will > * see a fully-initialised 'foo'. smp_[w]mb() would work as well. > */ > smp_store_release(&ptr, &foo); Your ptr is RSW's y, correct? > /* CPU 1 */ > int *xp1, *xp2, val; > struct foo *foop; > > /* Load the global pointer and check that it's not NULL. */ > foop = READ_ONCE(ptr); > if (!foop) > return; A litmus tests can do this via the filter clause. > /* > * Load 'foo.x' via the pointer we just loaded. This is ordered after the > * previous READ_ONCE() because of the address dependency. > */ > xp1 = READ_ONCE(foop->x); > > /* > * Load 'foo.x' directly via the global 'foo'. > * _This is loading the same address as the previous READ_ONCE() and > * therefore cannot return a stale (NULL) value!_ > */ > xp2 = READ_ONCE(foo.x); OK, same location, but RSW calls only for po, not addr from the initial read to this read, got it. (My first attempt left out this nuance, in case you were wondering.) > /* > * Load 'x' via the pointer we just loaded. > * _We may see zero here!_ > */ > val = READ_ONCE(*xp2); And herd7/LKMM agree with this, at least assuming I got the litmus test right. (I used RSW's variables as a cross-check.) > So in this case, the two same-location reads on CPU1 are actually executed > out of order, but you can't tell just by looking at them in isolation > because they returned the same value (i.e. xp1 == xp2 == &x). However, you > *can* tell when you compose them in funny situations such as above (and I > believe that this is demonstrably true on PPC and Arm; effectively the > read-after-read ordering machinery only triggers in response to incoming > snoops). Again, I leave the hardware specifics to the respective maintainers. > There's probably a more-compelling variant using an (RCpc) load acquire > instead of the last address dependency on CPU 1 as well. > > Anyway, I think all I'm trying to say is that I'd tend to shy away from > relying on read-after-read ordering if it forms part of a more involved > ordering relationship. Agreed. As soon as you add that second variable, you need more ordering. The read-after-read ordering applies only within the confines of a single variable, and you need more ordering when you add that second variable. And as Will's RSW example shows, more ordering than you might think. > > > But the patch looks fine, though I agree that the ordering in > > > __wait_on_buffer should probably be moved into > > > wait_on_bit/wait_on_bit_io. > > > > > > And would we perhaps want the bitops to have the different ordering > > > versions? Like "set_bit_release()" and "test_bit_acquire()"? That > > > would seem to be (a) cleaner and (b) possibly generate better code for > > > architectures where that makes a difference? > > > > As always, I defer to the architecture maintainers on this one. > > FWIW, that makes sense to me. We already have release/acquire/releaxed > variants of the bitops in the atomic_t API so it seems natural to have > a counterpart in the actual bitops API as well. Just to join you guys in liking acquire and release better than smp_wmb() and smp_rmb(). My RCU life got much better after update-side uses of smp_wmb() were replaced by rcu_assign_pointer() and read-side uses of smp_read_barrier_depends() were replaced by rcu_dereference(). Thanx, Paul ------------------------------------------------------------------------ C rsw { a=0; x=z; y=a; z=0; } P0(int *x, int **y, int *z) { WRITE_ONCE(*z, 1); WRITE_ONCE(*y, x); } P1(int *x, int **y, int *z) { r1 = READ_ONCE(*y); r2 = READ_ONCE(*r1); r3 = READ_ONCE(*x); r4 = READ_ONCE(*r3); } filter(1:r1=x) exists(1:r2=z /\ 1:r3=z /\ 1:r4=0) ------------------------------------------------------------------------ $ herd7 -conf linux-kernel.cfg /tmp/rsw.litmus Test rsw Allowed States 2 1:r2=z; 1:r3=z; 1:r4=0; 1:r2=z; 1:r3=z; 1:r4=1; Ok Witnesses Positive: 1 Negative: 1 Condition exists (1:r2=z /\ 1:r3=z /\ 1:r4=0) Observation rsw Sometimes 1 1 Time rsw 0.00 Hash=03618e3079eb0d371ca0f1ab679f3eae
On Mon, Aug 01, 2022 at 12:20:35PM -0700, Paul E. McKenney wrote: > On Mon, Aug 01, 2022 at 04:41:09PM +0100, Will Deacon wrote: > > Apologies for the slow response here; believe it or not, I was attending > > a workshop about memory ordering. > > Nice!!! Anything that I can/should know from that gathering? ;-) Oh come off it, you know this stuff already ;) > > On Sun, Jul 31, 2022 at 10:30:11AM -0700, Paul E. McKenney wrote: > > > On Sun, Jul 31, 2022 at 09:51:47AM -0700, Linus Torvalds wrote: > > > > Even alpha is specified to be locally ordered wrt *one* memory > > > > location, including for reads (See table 5-1: "Processor issue order", > > > > and also 5.6.2.2: "Litmus test 2"). So if a previous read has seen a > > > > new value, a subsequent read is not allowed to see an older one - even > > > > without a memory barrier. > > > > > > > > Will, Paul? Maybe that's only for overlapping loads/stores, not for > > > > loads/loads. Because maybe alpha for once isn't the weakest possible > > > > ordering. > > > > > > The "bad boy" in this case is Itanium, which can do some VLIW reordering > > > of accesses. Or could, I am not sure that newer Itanium hardware > > > does this. But this is why Itanium compilers made volatile loads use > > > the ld,acq instruction. > > > > > > Which means that aligned same-sized marked accesses to a single location > > > really do execute consistently with some global ordering, even on Itanium. > > > > Although this is true, there's a really subtle issue which crops up if you > > try to compose this read-after-read ordering with dependencies in the case > > where the two reads read the same value (which is encapsulated by the > > unusual RSW litmus test that I've tried to convert to C below): > > RSW from the infamous test6.pdf, correct? That's the badger. I've no doubt that you're aware of it already, but I thought it was a useful exercise to transcribe it to C and have it on the mailing list for folks to look at. > > /* Global definitions; assume everything zero-initialised */ > > struct foo { > > int *x; > > }; > > > > int x; > > struct foo foo; > > struct foo *ptr; > > > > > > /* CPU 0 */ > > WRITE_ONCE(x, 1); > > Your x is RSW's z? Yes. > > WRITE_ONCE(foo.x, &x); > > And your foo.x is RSW's x? If so, the above WRITE_ONCE() could happen at > compile time, correct? Or in the initialization clause of a litmus test? Yes, although I think it's a tiny bit more like real code to have it done here, although it means that the "surprising" outcome relies on this being reordered before the store to x. > > /* > > * Release ordering to ensure that somebody following a non-NULL ptr will > > * see a fully-initialised 'foo'. smp_[w]mb() would work as well. > > */ > > smp_store_release(&ptr, &foo); > > Your ptr is RSW's y, correct? Yes. > > /* CPU 1 */ > > int *xp1, *xp2, val; > > struct foo *foop; > > > > /* Load the global pointer and check that it's not NULL. */ > > foop = READ_ONCE(ptr); > > if (!foop) > > return; > > A litmus tests can do this via the filter clause. Indeed, but I was trying to make this look like C code for non-litmus speakers! > > /* > > * Load 'foo.x' via the pointer we just loaded. This is ordered after the > > * previous READ_ONCE() because of the address dependency. > > */ > > xp1 = READ_ONCE(foop->x); > > > > /* > > * Load 'foo.x' directly via the global 'foo'. > > * _This is loading the same address as the previous READ_ONCE() and > > * therefore cannot return a stale (NULL) value!_ > > */ > > xp2 = READ_ONCE(foo.x); > > OK, same location, but RSW calls only for po, not addr from the initial > read to this read, got it. (My first attempt left out this nuance, > in case you were wondering.) Right, there is only po from the initial read to this read. If there was an address dependency, then we'd have a chain of address dependencies from the first read to the last read on this CPU and the result (of x == 0) would be forbidden. > > /* > > * Load 'x' via the pointer we just loaded. > > * _We may see zero here!_ > > */ > > val = READ_ONCE(*xp2); > > And herd7/LKMM agree with this, at least assuming I got the litmus > test right. (I used RSW's variables as a cross-check.) That's promising, but see below... > C rsw > > { > a=0; > x=z; > y=a; > z=0; > } > > P0(int *x, int **y, int *z) > { > WRITE_ONCE(*z, 1); > WRITE_ONCE(*y, x); > } Ah wait, you need a barrier between these two writes, don't you? I used an smp_store_release() but smp[w]_mb() should do too. Will
On Tue, Aug 02, 2022 at 09:54:55AM +0100, Will Deacon wrote: > On Mon, Aug 01, 2022 at 12:20:35PM -0700, Paul E. McKenney wrote: > > On Mon, Aug 01, 2022 at 04:41:09PM +0100, Will Deacon wrote: > > > Apologies for the slow response here; believe it or not, I was attending > > > a workshop about memory ordering. > > > > Nice!!! Anything that I can/should know from that gathering? ;-) > > Oh come off it, you know this stuff already ;) Thank you for the kind words, but the most devastating learning disability of all is thinking that you already know everything about the topic in question. ;-) > > > On Sun, Jul 31, 2022 at 10:30:11AM -0700, Paul E. McKenney wrote: > > > > On Sun, Jul 31, 2022 at 09:51:47AM -0700, Linus Torvalds wrote: > > > > > Even alpha is specified to be locally ordered wrt *one* memory > > > > > location, including for reads (See table 5-1: "Processor issue order", > > > > > and also 5.6.2.2: "Litmus test 2"). So if a previous read has seen a > > > > > new value, a subsequent read is not allowed to see an older one - even > > > > > without a memory barrier. > > > > > > > > > > Will, Paul? Maybe that's only for overlapping loads/stores, not for > > > > > loads/loads. Because maybe alpha for once isn't the weakest possible > > > > > ordering. > > > > > > > > The "bad boy" in this case is Itanium, which can do some VLIW reordering > > > > of accesses. Or could, I am not sure that newer Itanium hardware > > > > does this. But this is why Itanium compilers made volatile loads use > > > > the ld,acq instruction. > > > > > > > > Which means that aligned same-sized marked accesses to a single location > > > > really do execute consistently with some global ordering, even on Itanium. > > > > > > Although this is true, there's a really subtle issue which crops up if you > > > try to compose this read-after-read ordering with dependencies in the case > > > where the two reads read the same value (which is encapsulated by the > > > unusual RSW litmus test that I've tried to convert to C below): > > > > RSW from the infamous test6.pdf, correct? > > That's the badger. I've no doubt that you're aware of it already, but I > thought it was a useful exercise to transcribe it to C and have it on the > mailing list for folks to look at. I have seen it, but this was nevertheless a useful reminder. > > > /* Global definitions; assume everything zero-initialised */ > > > struct foo { > > > int *x; > > > }; > > > > > > int x; > > > struct foo foo; > > > struct foo *ptr; > > > > > > > > > /* CPU 0 */ > > > WRITE_ONCE(x, 1); > > > > Your x is RSW's z? > > Yes. > > > > WRITE_ONCE(foo.x, &x); > > > > And your foo.x is RSW's x? If so, the above WRITE_ONCE() could happen at > > compile time, correct? Or in the initialization clause of a litmus test? > > Yes, although I think it's a tiny bit more like real code to have it done > here, although it means that the "surprising" outcome relies on this being > reordered before the store to x. > > > > /* > > > * Release ordering to ensure that somebody following a non-NULL ptr will > > > * see a fully-initialised 'foo'. smp_[w]mb() would work as well. > > > */ > > > smp_store_release(&ptr, &foo); > > > > Your ptr is RSW's y, correct? > > Yes. > > > > /* CPU 1 */ > > > int *xp1, *xp2, val; > > > struct foo *foop; > > > > > > /* Load the global pointer and check that it's not NULL. */ > > > foop = READ_ONCE(ptr); > > > if (!foop) > > > return; > > > > A litmus tests can do this via the filter clause. > > Indeed, but I was trying to make this look like C code for non-litmus > speakers! > > > > /* > > > * Load 'foo.x' via the pointer we just loaded. This is ordered after the > > > * previous READ_ONCE() because of the address dependency. > > > */ > > > xp1 = READ_ONCE(foop->x); > > > > > > /* > > > * Load 'foo.x' directly via the global 'foo'. > > > * _This is loading the same address as the previous READ_ONCE() and > > > * therefore cannot return a stale (NULL) value!_ > > > */ > > > xp2 = READ_ONCE(foo.x); > > > > OK, same location, but RSW calls only for po, not addr from the initial > > read to this read, got it. (My first attempt left out this nuance, > > in case you were wondering.) > > Right, there is only po from the initial read to this read. If there was an > address dependency, then we'd have a chain of address dependencies from the > first read to the last read on this CPU and the result (of x == 0) would be > forbidden. > > > > /* > > > * Load 'x' via the pointer we just loaded. > > > * _We may see zero here!_ > > > */ > > > val = READ_ONCE(*xp2); > > > > And herd7/LKMM agree with this, at least assuming I got the litmus > > test right. (I used RSW's variables as a cross-check.) > > That's promising, but see below... > > > C rsw > > > > { > > a=0; > > x=z; > > y=a; > > z=0; > > } > > > > P0(int *x, int **y, int *z) > > { > > WRITE_ONCE(*z, 1); > > WRITE_ONCE(*y, x); > > } > > Ah wait, you need a barrier between these two writes, don't you? I used > an smp_store_release() but smp[w]_mb() should do too. You are quite right, thank you! Here is the fixed version and output, which LKMM still says is allowed. Thanx, Paul ------------------------------------------------------------------------ C rsw { a=0; x=z; y=a; z=0; } P0(int *x, int **y, int *z) { WRITE_ONCE(*z, 1); smp_store_release(y, x); } P1(int *x, int **y, int *z) { r1 = READ_ONCE(*y); r2 = READ_ONCE(*r1); r3 = READ_ONCE(*x); r4 = READ_ONCE(*r3); } filter(1:r1=x) exists(1:r2=z /\ 1:r3=z /\ 1:r4=0) ------------------------------------------------------------------------ $ herd7 -conf linux-kernel.cfg /tmp/rsw.litmus Test rsw Allowed States 2 1:r2=z; 1:r3=z; 1:r4=0; 1:r2=z; 1:r3=z; 1:r4=1; Ok Witnesses Positive: 1 Negative: 1 Condition exists (1:r2=z /\ 1:r3=z /\ 1:r4=0) Observation rsw Sometimes 1 1 Time rsw 0.01 Hash=588486c0f4d521fa3ce559a19ed118d5
On Tue, Aug 02, 2022 at 06:49:21AM -0700, Paul E. McKenney wrote: > On Tue, Aug 02, 2022 at 09:54:55AM +0100, Will Deacon wrote: > > On Mon, Aug 01, 2022 at 12:20:35PM -0700, Paul E. McKenney wrote: > > > On Mon, Aug 01, 2022 at 04:41:09PM +0100, Will Deacon wrote: > > > > Apologies for the slow response here; believe it or not, I was attending > > > > a workshop about memory ordering. > > > > > > Nice!!! Anything that I can/should know from that gathering? ;-) > > > > Oh come off it, you know this stuff already ;) > > Thank you for the kind words, but the most devastating learning disability > of all is thinking that you already know everything about the topic > in question. ;-) > > > > > On Sun, Jul 31, 2022 at 10:30:11AM -0700, Paul E. McKenney wrote: > > > > > On Sun, Jul 31, 2022 at 09:51:47AM -0700, Linus Torvalds wrote: > > > > > > Even alpha is specified to be locally ordered wrt *one* memory > > > > > > location, including for reads (See table 5-1: "Processor issue order", > > > > > > and also 5.6.2.2: "Litmus test 2"). So if a previous read has seen a > > > > > > new value, a subsequent read is not allowed to see an older one - even > > > > > > without a memory barrier. > > > > > > > > > > > > Will, Paul? Maybe that's only for overlapping loads/stores, not for > > > > > > loads/loads. Because maybe alpha for once isn't the weakest possible > > > > > > ordering. > > > > > > > > > > The "bad boy" in this case is Itanium, which can do some VLIW reordering > > > > > of accesses. Or could, I am not sure that newer Itanium hardware > > > > > does this. But this is why Itanium compilers made volatile loads use > > > > > the ld,acq instruction. > > > > > > > > > > Which means that aligned same-sized marked accesses to a single location > > > > > really do execute consistently with some global ordering, even on Itanium. > > > > > > > > Although this is true, there's a really subtle issue which crops up if you > > > > try to compose this read-after-read ordering with dependencies in the case > > > > where the two reads read the same value (which is encapsulated by the > > > > unusual RSW litmus test that I've tried to convert to C below): > > > > > > RSW from the infamous test6.pdf, correct? > > > > That's the badger. I've no doubt that you're aware of it already, but I > > thought it was a useful exercise to transcribe it to C and have it on the > > mailing list for folks to look at. > > I have seen it, but this was nevertheless a useful reminder. > > > > > /* Global definitions; assume everything zero-initialised */ > > > > struct foo { > > > > int *x; > > > > }; > > > > > > > > int x; > > > > struct foo foo; > > > > struct foo *ptr; > > > > > > > > > > > > /* CPU 0 */ > > > > WRITE_ONCE(x, 1); > > > > > > Your x is RSW's z? > > > > Yes. > > > > > > WRITE_ONCE(foo.x, &x); > > > > > > And your foo.x is RSW's x? If so, the above WRITE_ONCE() could happen at > > > compile time, correct? Or in the initialization clause of a litmus test? > > > > Yes, although I think it's a tiny bit more like real code to have it done > > here, although it means that the "surprising" outcome relies on this being > > reordered before the store to x. > > > > > > /* > > > > * Release ordering to ensure that somebody following a non-NULL ptr will > > > > * see a fully-initialised 'foo'. smp_[w]mb() would work as well. > > > > */ > > > > smp_store_release(&ptr, &foo); > > > > > > Your ptr is RSW's y, correct? > > > > Yes. > > > > > > /* CPU 1 */ > > > > int *xp1, *xp2, val; > > > > struct foo *foop; > > > > > > > > /* Load the global pointer and check that it's not NULL. */ > > > > foop = READ_ONCE(ptr); > > > > if (!foop) > > > > return; > > > > > > A litmus tests can do this via the filter clause. > > > > Indeed, but I was trying to make this look like C code for non-litmus > > speakers! > > > > > > /* > > > > * Load 'foo.x' via the pointer we just loaded. This is ordered after the > > > > * previous READ_ONCE() because of the address dependency. > > > > */ > > > > xp1 = READ_ONCE(foop->x); > > > > > > > > /* > > > > * Load 'foo.x' directly via the global 'foo'. > > > > * _This is loading the same address as the previous READ_ONCE() and > > > > * therefore cannot return a stale (NULL) value!_ > > > > */ > > > > xp2 = READ_ONCE(foo.x); > > > > > > OK, same location, but RSW calls only for po, not addr from the initial > > > read to this read, got it. (My first attempt left out this nuance, > > > in case you were wondering.) > > > > Right, there is only po from the initial read to this read. If there was an > > address dependency, then we'd have a chain of address dependencies from the > > first read to the last read on this CPU and the result (of x == 0) would be > > forbidden. > > > > > > /* > > > > * Load 'x' via the pointer we just loaded. > > > > * _We may see zero here!_ > > > > */ > > > > val = READ_ONCE(*xp2); > > > > > > And herd7/LKMM agree with this, at least assuming I got the litmus > > > test right. (I used RSW's variables as a cross-check.) > > > > That's promising, but see below... > > > > > C rsw > > > > > > { > > > a=0; > > > x=z; > > > y=a; > > > z=0; > > > } > > > > > > P0(int *x, int **y, int *z) > > > { > > > WRITE_ONCE(*z, 1); > > > WRITE_ONCE(*y, x); > > > } > > > > Ah wait, you need a barrier between these two writes, don't you? I used > > an smp_store_release() but smp[w]_mb() should do too. > > You are quite right, thank you! Here is the fixed version and output, > which LKMM still says is allowed. > > Thanx, Paul > > ------------------------------------------------------------------------ > > C rsw > > { > a=0; > x=z; > y=a; > z=0; > } > > P0(int *x, int **y, int *z) > { > WRITE_ONCE(*z, 1); > smp_store_release(y, x); And like test6.pdf says, this is still allowed even if these are a pair of WRITE_ONCE() invocations separated by smp_mb(). So I took that form. Thanx, Paul > } > > P1(int *x, int **y, int *z) > { > r1 = READ_ONCE(*y); > r2 = READ_ONCE(*r1); > r3 = READ_ONCE(*x); > r4 = READ_ONCE(*r3); > } > > filter(1:r1=x) > exists(1:r2=z /\ 1:r3=z /\ 1:r4=0) > > ------------------------------------------------------------------------ > > $ herd7 -conf linux-kernel.cfg /tmp/rsw.litmus > Test rsw Allowed > States 2 > 1:r2=z; 1:r3=z; 1:r4=0; > 1:r2=z; 1:r3=z; 1:r4=1; > Ok > Witnesses > Positive: 1 Negative: 1 > Condition exists (1:r2=z /\ 1:r3=z /\ 1:r4=0) > Observation rsw Sometimes 1 1 > Time rsw 0.01 > Hash=588486c0f4d521fa3ce559a19ed118d5
Index: linux-2.6/include/linux/buffer_head.h =================================================================== --- linux-2.6.orig/include/linux/buffer_head.h +++ linux-2.6/include/linux/buffer_head.h @@ -120,7 +120,6 @@ static __always_inline int test_clear_bu BUFFER_FNS(Uptodate, uptodate) BUFFER_FNS(Dirty, dirty) TAS_BUFFER_FNS(Dirty, dirty) -BUFFER_FNS(Lock, locked) BUFFER_FNS(Req, req) TAS_BUFFER_FNS(Req, req) BUFFER_FNS(Mapped, mapped) @@ -135,6 +134,17 @@ BUFFER_FNS(Meta, meta) BUFFER_FNS(Prio, prio) BUFFER_FNS(Defer_Completion, defer_completion) +static __always_inline void set_buffer_locked(struct buffer_head *bh) +{ + set_bit(BH_Lock, &bh->b_state); +} + +static __always_inline int buffer_locked(const struct buffer_head *bh) +{ + unsigned long state = smp_load_acquire(&bh->b_state); + return test_bit(BH_Lock, &state); +} + #define bh_offset(bh) ((unsigned long)(bh)->b_data & ~PAGE_MASK) /* If we *know* page->private refers to buffer_heads */ Index: linux-2.6/fs/buffer.c =================================================================== --- linux-2.6.orig/fs/buffer.c +++ linux-2.6/fs/buffer.c @@ -120,6 +120,7 @@ EXPORT_SYMBOL(buffer_check_dirty_writeba void __wait_on_buffer(struct buffer_head * bh) { wait_on_bit_io(&bh->b_state, BH_Lock, TASK_UNINTERRUPTIBLE); + smp_rmb(); } EXPORT_SYMBOL(__wait_on_buffer);