Message ID | 87shti74no.fsf@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 02/09/2016 16:33, Pranith Kumar wrote: > > Hi Paolo, > > This is in reference to the discussion we had yesterday on IRC. I am trying to > understand the need for smp_read_barrier_depends() and how it prevents the > following race condition. I think a regular barrier() should suffice instead > of smp_read_barrier_depends(). Consider: > > P0 P1 > ---------------------------------------- > bh = ctx->first_bh; > smp_read_barrier_depends(); // barrier() should be sufficient since bh > // is local variable > next = bh->next; > lock(bh_lock); > new_bh->next = ctx->first_bh; > smp_wmb(); > ctx->first_bh = new_bh; > unlock(bh_lock); > > if (bh) { > // do something > } > > Why do you think smp_read_barrier_depends() is necessary here? If bh was a > shared variable I would understand, but here bh is local and a regular > barrier() would make sure that we are not optimizing the initial load into bh. Honestly, I don't think you understand why memory barriers exist... They are used to synchronize writes to shared *data*, not to shared variables. It doesn't matter whether bh is a shared variable. The *data that it points to* is shared with other threads. ctx->first_bh and bh->next are both shared by P0 and P1. P1 must make sure that ctx->first_bh is written after all of its context (which in aio_bh_new includes new_bh->next) is ready. It uses smp_wmb for that. A "release store" for ctx->first_bh would be okay too. This is easy. P0 must make sure that bh->next is read after ctx->first_bh. The simplest way to ensure this is an "acquire load" for ctx->first_bh and bh->next place an smp_rmb where there is currently smp_read_barrier_depends(). This is easy too, but a bit overkill because bh->next is really ctx->first_bh->next and data dependent reads do not need full-blown acquire semantics. However, you still need to make sure that bh->next is read from _exactly_ the ctx->first_bh that was assigned to bh, and not for example a value that was changed in the meanwhile by another processor. Most processors promise this (except the Alpha!) but compilers might reload values if they think it's useful. For this reason Linux and QEMU have smp_read_barrier_depends(), and for this reason C11/C++11 introduce the "consume" memory order. smp_read_barrier_depends() is the same as C11's atomic_thread_fence(MEMORDER_CONSUME). We didn't make it up. So instead of smp_read_barrier_depends() you could load ctx->first_bh and bh->next with the consume memory order, but you do need _something_. Paolo
Paolo Bonzini writes: > On 02/09/2016 16:33, Pranith Kumar wrote: >> >> Hi Paolo, >> >> This is in reference to the discussion we had yesterday on IRC. I am trying to >> understand the need for smp_read_barrier_depends() and how it prevents the >> following race condition. I think a regular barrier() should suffice instead >> of smp_read_barrier_depends(). Consider: >> >> P0 P1 >> ---------------------------------------- >> bh = ctx->first_bh; >> smp_read_barrier_depends(); // barrier() should be sufficient since bh >> // is local variable >> next = bh->next; >> lock(bh_lock); >> new_bh->next = ctx->first_bh; >> smp_wmb(); >> ctx->first_bh = new_bh; >> unlock(bh_lock); >> >> if (bh) { >> // do something >> } >> >> Why do you think smp_read_barrier_depends() is necessary here? If bh was a >> shared variable I would understand, but here bh is local and a regular >> barrier() would make sure that we are not optimizing the initial load into bh. > > Honestly, I don't think you understand why memory barriers exist... Well, you are not entirely wrong :-). I think this is a difficult topic to excel at. And I am trying to wrap my head around Alpha barrier semantics which is an entirely different ball game. > They are used to synchronize writes to shared *data*, not to shared > variables. My bad. Shared memory location is what I meant too. I will be more precise when I write. > > It doesn't matter whether bh is a shared variable. The *data that it > points to* is shared with other threads. ctx->first_bh and bh->next are > both shared by P0 and P1. > > P1 must make sure that ctx->first_bh is written after all of its context > (which in aio_bh_new includes new_bh->next) is ready. It uses smp_wmb > for that. A "release store" for ctx->first_bh would be okay too. This > is easy. > > P0 must make sure that bh->next is read after ctx->first_bh. The > simplest way to ensure this is an "acquire load" for ctx->first_bh and > bh->next place an smp_rmb where there is currently > smp_read_barrier_depends(). This is easy too, but a bit overkill > because bh->next is really ctx->first_bh->next and data dependent reads > do not need full-blown acquire semantics. > > However, you still need to make sure that bh->next is read from > _exactly_ the ctx->first_bh that was assigned to bh, and not for example > a value that was changed in the meanwhile by another processor. Most > processors promise this (except the Alpha!) but compilers might reload > values if they think it's useful. For this reason Linux and QEMU have > smp_read_barrier_depends(), and for this reason C11/C++11 introduce the > "consume" memory order. smp_read_barrier_depends() is the same as C11's > atomic_thread_fence(MEMORDER_CONSUME). We didn't make it up. If I understand you correctly, this is what might happen without the barrier(): P0 P1 ---------------------------------------- // bh = ctx->first_bh; optimized if (ctx->first_bh) { // next = ctx->first_bh->next; lock(bh_lock); new_bh->next = ctx->first_bh; smp_wmb(); // this alone is not sufficient // for Alpha ctx->first_bh = new_bh; unlock(bh_lock); // bh = next; bh = ctx->first_bh->next; if (bh) {do something} } Is this what might happen? If so, inserting a barrier() after the first load into bh will prevent the compiler from optimizing the load into bh since the compiler cannot optimize away loads and stores past the barrier(). And on Alpha processors barrier() should really be smp_read_barrier_depends() to prevent this from happening because of it's memory model(issue a barrier after loading a pointer to shared memory and before dereferencing it). > > So instead of smp_read_barrier_depends() you could load ctx->first_bh > and bh->next with the consume memory order, but you do need _something_. OK, if the above situation is possible, then I think I understand the need for this barrier.
On 02/09/2016 20:23, Pranith Kumar wrote: > If I understand you correctly, this is what might happen without the > barrier(): > > P0 P1 > ---------------------------------------- > // bh = ctx->first_bh; optimized > if (ctx->first_bh) { > // next = ctx->first_bh->next; > lock(bh_lock); > new_bh->next = ctx->first_bh; > smp_wmb(); // this alone is not sufficient > // for Alpha > ctx->first_bh = new_bh; > unlock(bh_lock); > // bh = next; > bh = ctx->first_bh->next; > > if (bh) {do something} > } > > Is this what might happen? If so, inserting a barrier() after the first load > into bh will prevent the compiler from optimizing the load into bh since the > compiler cannot optimize away loads and stores past the barrier(). Yes, this is what you can expect from a compiler. > And on Alpha processors barrier() should really be smp_read_barrier_depends() > to prevent this from happening because of it's memory model(issue a barrier > after loading a pointer to shared memory and before dereferencing it). Yes; the actual effect you could see on the Alpha it's even crazier. Without the barrier, the "bh = ctx->first_bh" and "next = bh->next" can be reordered. If you take both threads into account, "next = bh->next" can use a value from before P1's "ctx->first = new_bh". Indeed, it can use a value from before P1's smp_wmb() or before new_bh->next = ctx->first_bh. For example, "next" could load a NULL value. FWIW, RISCv currently has the same memory model as the Alpha, but they plan to fix it before they finalize the relevant specs! Paolo >> > >> > So instead of smp_read_barrier_depends() you could load ctx->first_bh >> > and bh->next with the consume memory order, but you do need _something_. > OK, if the above situation is possible, then I think I understand the need for > this barrier.
diff --git a/async.c b/async.c index 3bca9b0..f4f8b17 100644 --- a/async.c +++ b/async.c @@ -76,9 +76,8 @@ int aio_bh_poll(AioContext *ctx) ctx->walking_bh++; ret = 0; - for (bh = ctx->first_bh; bh; bh = next) { - /* Make sure that fetching bh happens before accessing its members */ - smp_read_barrier_depends(); + for (bh = atomic_read(&ctx->first_bh); bh; bh = next) { + /* store bh->next since bh can be freed in aio_bh_call() */ next = bh->next; /* The atomic_xchg is paired with the one in qemu_bh_schedule. The * implicit memory barrier ensures that the callback sees all writes