diff mbox

aio: reg. smp_read_barrier_depends() in aio_bh_poll()

Message ID 87shti74no.fsf@gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Pranith Kumar Sept. 2, 2016, 2:33 p.m. UTC
Hi Paolo,

This is in reference to the discussion we had yesterday on IRC. I am trying to
understand the need for smp_read_barrier_depends() and how it prevents the
following race condition. I think a regular barrier() should suffice instead
of smp_read_barrier_depends(). Consider:

           P0                        P1
     ----------------------------------------
     bh = ctx->first_bh;        
     smp_read_barrier_depends();  // barrier() should be sufficient since bh
                                  // is local variable
     next = bh->next;
                                  lock(bh_lock);
                                  new_bh->next = ctx->first_bh;
                                  smp_wmb();
                                  ctx->first_bh = new_bh;
                                  unlock(bh_lock);
                                  
     if (bh) {
        // do something
     }

Why do you think smp_read_barrier_depends() is necessary here? If bh was a
shared variable I would understand, but here bh is local and a regular
barrier() would make sure that we are not optimizing the initial load into bh.

A patch fixing this follows.

Thanks,
--
Pranith

From: Pranith Kumar <bobby.prani@gmail.com>
Date: Fri, 2 Sep 2016 10:30:23 -0400
Subject: [PATCH] aio: Remove spurious smp_read_barrier_depends()

smp_read_barrier_depends() should be used only if you are reading dependent
pointers which are shared. Here 'bh' is a local variable and dereferencing it
will always be ordered after loading 'bh', i.e., bh->next will always be
ordered after fetching bh. However the initial load into 'bh' should not be
optimized away, hence we use atomic_read() to ensure this.

Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
---
 async.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

Comments

Paolo Bonzini Sept. 2, 2016, 3:38 p.m. UTC | #1
On 02/09/2016 16:33, Pranith Kumar wrote:
> 
> Hi Paolo,
> 
> This is in reference to the discussion we had yesterday on IRC. I am trying to
> understand the need for smp_read_barrier_depends() and how it prevents the
> following race condition. I think a regular barrier() should suffice instead
> of smp_read_barrier_depends(). Consider:
> 
>            P0                        P1
>      ----------------------------------------
>      bh = ctx->first_bh;        
>      smp_read_barrier_depends();  // barrier() should be sufficient since bh
>                                   // is local variable
>      next = bh->next;
>                                   lock(bh_lock);
>                                   new_bh->next = ctx->first_bh;
>                                   smp_wmb();
>                                   ctx->first_bh = new_bh;
>                                   unlock(bh_lock);
>                                   
>      if (bh) {
>         // do something
>      }
> 
> Why do you think smp_read_barrier_depends() is necessary here? If bh was a
> shared variable I would understand, but here bh is local and a regular
> barrier() would make sure that we are not optimizing the initial load into bh.

Honestly, I don't think you understand why memory barriers exist...
They are used to synchronize writes to shared *data*, not to shared
variables.

It doesn't matter whether bh is a shared variable.  The *data that it
points to* is shared with other threads.  ctx->first_bh and bh->next are
both shared by P0 and P1.

P1 must make sure that ctx->first_bh is written after all of its context
(which in aio_bh_new includes new_bh->next) is ready.  It uses smp_wmb
for that.  A "release store" for ctx->first_bh would be okay too.  This
is easy.

P0 must make sure that bh->next is read after ctx->first_bh.  The
simplest way to ensure this is an "acquire load" for ctx->first_bh and
bh->next place an smp_rmb where there is currently
smp_read_barrier_depends().  This is easy too, but a bit overkill
because bh->next is really ctx->first_bh->next and data dependent reads
do not need full-blown acquire semantics.

However, you still need to make sure that bh->next is read from
_exactly_ the ctx->first_bh that was assigned to bh, and not for example
a value that was changed in the meanwhile by another processor.  Most
processors promise this (except the Alpha!) but compilers might reload
values if they think it's useful.  For this reason Linux and QEMU have
smp_read_barrier_depends(), and for this reason C11/C++11 introduce the
"consume" memory order.  smp_read_barrier_depends() is the same as C11's
atomic_thread_fence(MEMORDER_CONSUME).  We didn't make it up.

So instead of smp_read_barrier_depends() you could load ctx->first_bh
and bh->next with the consume memory order, but you do need _something_.

Paolo
Pranith Kumar Sept. 2, 2016, 6:23 p.m. UTC | #2
Paolo Bonzini writes:

> On 02/09/2016 16:33, Pranith Kumar wrote:
>> 
>> Hi Paolo,
>> 
>> This is in reference to the discussion we had yesterday on IRC. I am trying to
>> understand the need for smp_read_barrier_depends() and how it prevents the
>> following race condition. I think a regular barrier() should suffice instead
>> of smp_read_barrier_depends(). Consider:
>> 
>>            P0                        P1
>>      ----------------------------------------
>>      bh = ctx->first_bh;        
>>      smp_read_barrier_depends();  // barrier() should be sufficient since bh
>>                                   // is local variable
>>      next = bh->next;
>>                                   lock(bh_lock);
>>                                   new_bh->next = ctx->first_bh;
>>                                   smp_wmb();
>>                                   ctx->first_bh = new_bh;
>>                                   unlock(bh_lock);
>>                                   
>>      if (bh) {
>>         // do something
>>      }
>> 
>> Why do you think smp_read_barrier_depends() is necessary here? If bh was a
>> shared variable I would understand, but here bh is local and a regular
>> barrier() would make sure that we are not optimizing the initial load into bh.
>
> Honestly, I don't think you understand why memory barriers exist...

Well, you are not entirely wrong :-). I think this is a difficult topic to
excel at. And I am trying to wrap my head around Alpha barrier semantics which
is an entirely different ball game.

> They are used to synchronize writes to shared *data*, not to shared
> variables.

My bad. Shared memory location is what I meant too. I will be more precise
when I write.

>
> It doesn't matter whether bh is a shared variable.  The *data that it
> points to* is shared with other threads.  ctx->first_bh and bh->next are
> both shared by P0 and P1.
>
> P1 must make sure that ctx->first_bh is written after all of its context
> (which in aio_bh_new includes new_bh->next) is ready.  It uses smp_wmb
> for that.  A "release store" for ctx->first_bh would be okay too.  This
> is easy.
>
> P0 must make sure that bh->next is read after ctx->first_bh.  The
> simplest way to ensure this is an "acquire load" for ctx->first_bh and
> bh->next place an smp_rmb where there is currently
> smp_read_barrier_depends().  This is easy too, but a bit overkill
> because bh->next is really ctx->first_bh->next and data dependent reads
> do not need full-blown acquire semantics.
>
> However, you still need to make sure that bh->next is read from
> _exactly_ the ctx->first_bh that was assigned to bh, and not for example
> a value that was changed in the meanwhile by another processor.  Most
> processors promise this (except the Alpha!) but compilers might reload
> values if they think it's useful.  For this reason Linux and QEMU have
> smp_read_barrier_depends(), and for this reason C11/C++11 introduce the
> "consume" memory order.  smp_read_barrier_depends() is the same as C11's
> atomic_thread_fence(MEMORDER_CONSUME).  We didn't make it up.

If I understand you correctly, this is what might happen without the
barrier():

            P0                        P1
      ----------------------------------------
      // bh = ctx->first_bh;  optimized       
      if (ctx->first_bh) {
      // next = ctx->first_bh->next;
                                   lock(bh_lock);
                                   new_bh->next = ctx->first_bh;
                                   smp_wmb();  // this alone is not sufficient
                                               // for Alpha
                                   ctx->first_bh = new_bh;
                                   unlock(bh_lock);
      // bh = next;
      bh = ctx->first_bh->next;

      if (bh) {do something}
      }

Is this what might happen? If so, inserting a barrier() after the first load
into bh will prevent the compiler from optimizing the load into bh since the
compiler cannot optimize away loads and stores past the barrier().

And on Alpha processors barrier() should really be smp_read_barrier_depends()
to prevent this from happening because of it's memory model(issue a barrier
after loading a pointer to shared memory and before dereferencing it).

>
> So instead of smp_read_barrier_depends() you could load ctx->first_bh
> and bh->next with the consume memory order, but you do need _something_.

OK, if the above situation is possible, then I think I understand the need for
this barrier.
Paolo Bonzini Sept. 5, 2016, 11:31 a.m. UTC | #3
On 02/09/2016 20:23, Pranith Kumar wrote:
> If I understand you correctly, this is what might happen without the
> barrier():
> 
>             P0                        P1
>       ----------------------------------------
>       // bh = ctx->first_bh;  optimized       
>       if (ctx->first_bh) {
>       // next = ctx->first_bh->next;
>                                    lock(bh_lock);
>                                    new_bh->next = ctx->first_bh;
>                                    smp_wmb();  // this alone is not sufficient
>                                                // for Alpha
>                                    ctx->first_bh = new_bh;
>                                    unlock(bh_lock);
>       // bh = next;
>       bh = ctx->first_bh->next;
> 
>       if (bh) {do something}
>       }
> 
> Is this what might happen? If so, inserting a barrier() after the first load
> into bh will prevent the compiler from optimizing the load into bh since the
> compiler cannot optimize away loads and stores past the barrier().

Yes, this is what you can expect from a compiler.

> And on Alpha processors barrier() should really be smp_read_barrier_depends()
> to prevent this from happening because of it's memory model(issue a barrier
> after loading a pointer to shared memory and before dereferencing it).

Yes; the actual effect you could see on the Alpha it's even crazier.
Without the barrier, the "bh = ctx->first_bh" and "next = bh->next" can
be reordered.  If you take both threads into account, "next = bh->next"
can use a value from before P1's "ctx->first = new_bh".  Indeed, it can
use a value from before P1's smp_wmb() or before new_bh->next =
ctx->first_bh.  For example, "next" could load a NULL value.

FWIW, RISCv currently has the same memory model as the Alpha, but they
plan to fix it before they finalize the relevant specs!

Paolo

>> >
>> > So instead of smp_read_barrier_depends() you could load ctx->first_bh
>> > and bh->next with the consume memory order, but you do need _something_.
> OK, if the above situation is possible, then I think I understand the need for
> this barrier.
diff mbox

Patch

diff --git a/async.c b/async.c
index 3bca9b0..f4f8b17 100644
--- a/async.c
+++ b/async.c
@@ -76,9 +76,8 @@  int aio_bh_poll(AioContext *ctx)
     ctx->walking_bh++;
 
     ret = 0;
-    for (bh = ctx->first_bh; bh; bh = next) {
-        /* Make sure that fetching bh happens before accessing its members */
-        smp_read_barrier_depends();
+    for (bh = atomic_read(&ctx->first_bh); bh; bh = next) {
+        /* store bh->next since bh can be freed in aio_bh_call() */
         next = bh->next;
         /* The atomic_xchg is paired with the one in qemu_bh_schedule.  The
          * implicit memory barrier ensures that the callback sees all writes