diff mbox series

[Bug,1805256] Re: [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues

Message ID 20191009080220.GA2905@hc (mailing list archive)
State New, archived
Headers show
Series [Bug,1805256] Re: [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues | expand

Commit Message

Jan Glauber Oct. 9, 2019, 8:02 a.m. UTC
On Mon, Oct 07, 2019 at 04:58:30PM +0200, Paolo Bonzini wrote:
> On 07/10/19 16:44, dann frazier wrote:
> > On Mon, Oct 07, 2019 at 01:06:20PM +0200, Paolo Bonzini wrote:
> >> On 02/10/19 11:23, Jan Glauber wrote:
> >>> I've looked into this on ThunderX2. The arm64 code generated for the
> >>> atomic_[add|sub] accesses of ctx->notify_me doesn't contain any
> >>> memory barriers. It is just plain ldaxr/stlxr.
> >>>
> >>> From my understanding this is not sufficient for SMP sync.
> >>>
> >>> If I read this comment correct:
> >>>
> >>>     void aio_notify(AioContext *ctx)
> >>>     {
> >>>         /* Write e.g. bh->scheduled before reading ctx->notify_me.  Pairs
> >>>          * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
> >>>          */
> >>>         smp_mb();
> >>>         if (ctx->notify_me) {
> >>>
> >>> it points out that the smp_mb() should be paired. But as
> >>> I said the used atomics don't generate any barriers at all.
> >>
> >> Based on the rest of the thread, this patch should also fix the bug:
> >>
> >> diff --git a/util/async.c b/util/async.c
> >> index 47dcbfa..721ea53 100644
> >> --- a/util/async.c
> >> +++ b/util/async.c
> >> @@ -249,7 +249,7 @@ aio_ctx_check(GSource *source)
> >>      aio_notify_accept(ctx);
> >>  
> >>      for (bh = ctx->first_bh; bh; bh = bh->next) {
> >> -        if (bh->scheduled) {
> >> +        if (atomic_mb_read(&bh->scheduled)) {
> >>              return true;
> >>          }
> >>      }
> >>
> >>
> >> And also the memory barrier in aio_notify can actually be replaced
> >> with a SEQ_CST load:
> >>
> >> diff --git a/util/async.c b/util/async.c
> >> index 47dcbfa..721ea53 100644
> >> --- a/util/async.c
> >> +++ b/util/async.c
> >> @@ -349,11 +349,11 @@ LinuxAioState *aio_get_linux_aio(AioContext *ctx)
> >>  
> >>  void aio_notify(AioContext *ctx)
> >>  {
> >> -    /* Write e.g. bh->scheduled before reading ctx->notify_me.  Pairs
> >> -     * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
> >> +    /* Using atomic_mb_read ensures that e.g. bh->scheduled is written before
> >> +     * ctx->notify_me is read.  Pairs with atomic_or in aio_ctx_prepare or
> >> +     * atomic_add in aio_poll.
> >>       */
> >> -    smp_mb();
> >> -    if (ctx->notify_me) {
> >> +    if (atomic_mb_read(&ctx->notify_me)) {
> >>          event_notifier_set(&ctx->notifier);
> >>          atomic_mb_set(&ctx->notified, true);
> >>      }
> >>
> >>
> >> Would you be able to test these (one by one possibly)?
> > 
> > Paolo,
> >   I tried them both separately and together on a Hi1620 system, each
> > time it hung in the first iteration. Here's a backtrace of a run with
> > both patches applied:
> 
> Ok, not a great start...  I'll find myself an aarch64 machine and look
> at it more closely.  I'd like the patch to be something we can
> understand and document, since this is probably the second most-used
> memory barrier idiom in QEMU.
> 
> Paolo

I'm still not sure what the actual issue is here, but could it be some bad
interaction between the notify_me and the list_lock? The are both 4 byte
and side-by-side:

address notify_me: 0xaaaadb528aa0  sizeof notify_me: 4
address list_lock: 0xaaaadb528aa4  sizeof list_lock: 4

AFAICS the generated code looks OK (all load/store exclusive done
with 32 bit size):

     e6c:       885ffc01        ldaxr   w1, [x0]
     e70:       11000821        add     w1, w1, #0x2
     e74:       8802fc01        stlxr   w2, w1, [x0]

...but if I bump notify_me size to uint64_t the issue goes away.

BTW, the image file I convert in the testcase is ~20 GB.

--Jan

Comments

Paolo Bonzini Oct. 9, 2019, 9:15 a.m. UTC | #1
On 09/10/19 10:02, Jan Glauber wrote:
> On Mon, Oct 07, 2019 at 04:58:30PM +0200, Paolo Bonzini wrote:
>> On 07/10/19 16:44, dann frazier wrote:
>>> On Mon, Oct 07, 2019 at 01:06:20PM +0200, Paolo Bonzini wrote:
>>>> On 02/10/19 11:23, Jan Glauber wrote:
>>>>> I've looked into this on ThunderX2. The arm64 code generated for the
>>>>> atomic_[add|sub] accesses of ctx->notify_me doesn't contain any
>>>>> memory barriers. It is just plain ldaxr/stlxr.
>>>>>
>>>>> From my understanding this is not sufficient for SMP sync.
>>>>>
>>>>> If I read this comment correct:
>>>>>
>>>>>     void aio_notify(AioContext *ctx)
>>>>>     {
>>>>>         /* Write e.g. bh->scheduled before reading ctx->notify_me.  Pairs
>>>>>          * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
>>>>>          */
>>>>>         smp_mb();
>>>>>         if (ctx->notify_me) {
>>>>>
>>>>> it points out that the smp_mb() should be paired. But as
>>>>> I said the used atomics don't generate any barriers at all.
>>>>
>>>> Based on the rest of the thread, this patch should also fix the bug:
>>>>
>>>> diff --git a/util/async.c b/util/async.c
>>>> index 47dcbfa..721ea53 100644
>>>> --- a/util/async.c
>>>> +++ b/util/async.c
>>>> @@ -249,7 +249,7 @@ aio_ctx_check(GSource *source)
>>>>      aio_notify_accept(ctx);
>>>>  
>>>>      for (bh = ctx->first_bh; bh; bh = bh->next) {
>>>> -        if (bh->scheduled) {
>>>> +        if (atomic_mb_read(&bh->scheduled)) {
>>>>              return true;
>>>>          }
>>>>      }
>>>>
>>>>
>>>> And also the memory barrier in aio_notify can actually be replaced
>>>> with a SEQ_CST load:
>>>>
>>>> diff --git a/util/async.c b/util/async.c
>>>> index 47dcbfa..721ea53 100644
>>>> --- a/util/async.c
>>>> +++ b/util/async.c
>>>> @@ -349,11 +349,11 @@ LinuxAioState *aio_get_linux_aio(AioContext *ctx)
>>>>  
>>>>  void aio_notify(AioContext *ctx)
>>>>  {
>>>> -    /* Write e.g. bh->scheduled before reading ctx->notify_me.  Pairs
>>>> -     * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
>>>> +    /* Using atomic_mb_read ensures that e.g. bh->scheduled is written before
>>>> +     * ctx->notify_me is read.  Pairs with atomic_or in aio_ctx_prepare or
>>>> +     * atomic_add in aio_poll.
>>>>       */
>>>> -    smp_mb();
>>>> -    if (ctx->notify_me) {
>>>> +    if (atomic_mb_read(&ctx->notify_me)) {
>>>>          event_notifier_set(&ctx->notifier);
>>>>          atomic_mb_set(&ctx->notified, true);
>>>>      }
>>>>
>>>>
>>>> Would you be able to test these (one by one possibly)?
>>>
>>> Paolo,
>>>   I tried them both separately and together on a Hi1620 system, each
>>> time it hung in the first iteration. Here's a backtrace of a run with
>>> both patches applied:
>>
>> Ok, not a great start...  I'll find myself an aarch64 machine and look
>> at it more closely.  I'd like the patch to be something we can
>> understand and document, since this is probably the second most-used
>> memory barrier idiom in QEMU.
>>
>> Paolo
> 
> I'm still not sure what the actual issue is here, but could it be some bad
> interaction between the notify_me and the list_lock? The are both 4 byte
> and side-by-side:
> 
> address notify_me: 0xaaaadb528aa0  sizeof notify_me: 4
> address list_lock: 0xaaaadb528aa4  sizeof list_lock: 4
> 
> AFAICS the generated code looks OK (all load/store exclusive done
> with 32 bit size):
> 
>      e6c:       885ffc01        ldaxr   w1, [x0]
>      e70:       11000821        add     w1, w1, #0x2
>      e74:       8802fc01        stlxr   w2, w1, [x0]
> 
> ...but if I bump notify_me size to uint64_t the issue goes away.

Ouch. :)  Is this with or without my patch(es)?

Also, what if you just add a dummy uint32_t after notify_me?

Thanks,

Paolo

> 
> BTW, the image file I convert in the testcase is ~20 GB.
> 
> --Jan
> 
> diff --git a/include/block/aio.h b/include/block/aio.h
> index a1d6b9e24939..e8a5ea3860bb 100644
> --- a/include/block/aio.h
> +++ b/include/block/aio.h
> @@ -83,7 +83,7 @@ struct AioContext {
>       * Instead, the aio_poll calls include both the prepare and the
>       * dispatch phase, hence a simple counter is enough for them.
>       */
> -    uint32_t notify_me;
> +    uint64_t notify_me;
>  
>      /* A lock to protect between QEMUBH and AioHandler adders and deleter,
>       * and to ensure that no callbacks are removed while we're walking and
>
Jan Glauber Oct. 11, 2019, 6:05 a.m. UTC | #2
On Wed, Oct 09, 2019 at 11:15:04AM +0200, Paolo Bonzini wrote:
> On 09/10/19 10:02, Jan Glauber wrote:

> > I'm still not sure what the actual issue is here, but could it be some bad
> > interaction between the notify_me and the list_lock? The are both 4 byte
> > and side-by-side:
> > 
> > address notify_me: 0xaaaadb528aa0  sizeof notify_me: 4
> > address list_lock: 0xaaaadb528aa4  sizeof list_lock: 4
> > 
> > AFAICS the generated code looks OK (all load/store exclusive done
> > with 32 bit size):
> > 
> >      e6c:       885ffc01        ldaxr   w1, [x0]
> >      e70:       11000821        add     w1, w1, #0x2
> >      e74:       8802fc01        stlxr   w2, w1, [x0]
> > 
> > ...but if I bump notify_me size to uint64_t the issue goes away.
> 
> Ouch. :)  Is this with or without my patch(es)?
> 
> Also, what if you just add a dummy uint32_t after notify_me?

With the dummy the testcase also runs fine for 500 iterations.

Dann, can you try if this works on the Hi1620 too?

--Jan
Paolo Bonzini Oct. 11, 2019, 8:18 a.m. UTC | #3
On 11/10/19 08:05, Jan Glauber wrote:
> On Wed, Oct 09, 2019 at 11:15:04AM +0200, Paolo Bonzini wrote:
>>> ...but if I bump notify_me size to uint64_t the issue goes away.
>>
>> Ouch. :)  Is this with or without my patch(es)?

You didn't answer this question.

>> Also, what if you just add a dummy uint32_t after notify_me?
> 
> With the dummy the testcase also runs fine for 500 iterations.

You might be lucky and causing list_lock to be in another cache line.
What if you add __attribute__((aligned(16)) to notify_me (and keep the
dummy)?

Paolo

> Dann, can you try if this works on the Hi1620 too?
Jan Glauber Oct. 11, 2019, 8:30 a.m. UTC | #4
On Fri, Oct 11, 2019 at 10:18:18AM +0200, Paolo Bonzini wrote:
> On 11/10/19 08:05, Jan Glauber wrote:
> > On Wed, Oct 09, 2019 at 11:15:04AM +0200, Paolo Bonzini wrote:
> >>> ...but if I bump notify_me size to uint64_t the issue goes away.
> >>
> >> Ouch. :)  Is this with or without my patch(es)?
> 
> You didn't answer this question.

Oh, sorry... I did but the mail probably didn't make it out.
I have both of your changes applied (as I think they make sense).

> >> Also, what if you just add a dummy uint32_t after notify_me?
> > 
> > With the dummy the testcase also runs fine for 500 iterations.
> 
> You might be lucky and causing list_lock to be in another cache line.
> What if you add __attribute__((aligned(16)) to notify_me (and keep the
> dummy)?

Good point. I'll try to force both into the same cacheline.

--Jan

> Paolo
> 
> > Dann, can you try if this works on the Hi1620 too?
dann frazier Oct. 11, 2019, 5:50 p.m. UTC | #5
On Fri, Oct 11, 2019 at 06:05:25AM +0000, Jan Glauber wrote:
> On Wed, Oct 09, 2019 at 11:15:04AM +0200, Paolo Bonzini wrote:
> > On 09/10/19 10:02, Jan Glauber wrote:
> 
> > > I'm still not sure what the actual issue is here, but could it be some bad
> > > interaction between the notify_me and the list_lock? The are both 4 byte
> > > and side-by-side:
> > > 
> > > address notify_me: 0xaaaadb528aa0  sizeof notify_me: 4
> > > address list_lock: 0xaaaadb528aa4  sizeof list_lock: 4
> > > 
> > > AFAICS the generated code looks OK (all load/store exclusive done
> > > with 32 bit size):
> > > 
> > >      e6c:       885ffc01        ldaxr   w1, [x0]
> > >      e70:       11000821        add     w1, w1, #0x2
> > >      e74:       8802fc01        stlxr   w2, w1, [x0]
> > > 
> > > ...but if I bump notify_me size to uint64_t the issue goes away.
> > 
> > Ouch. :)  Is this with or without my patch(es)?
> > 
> > Also, what if you just add a dummy uint32_t after notify_me?
> 
> With the dummy the testcase also runs fine for 500 iterations.
> 
> Dann, can you try if this works on the Hi1620 too?

On Hi1620, it hung on the first iteration. Here's the complete patch
I'm running with:

diff --git a/include/block/aio.h b/include/block/aio.h
index 6b0d52f732..e6fd6f1a1a 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -82,7 +82,7 @@ struct AioContext {
      * Instead, the aio_poll calls include both the prepare and the
      * dispatch phase, hence a simple counter is enough for them.
      */
-    uint32_t notify_me;
+    uint64_t notify_me;
 
     /* A lock to protect between QEMUBH and AioHandler adders and deleter,
      * and to ensure that no callbacks are removed while we're walking and
diff --git a/util/async.c b/util/async.c
index ca83e32c7f..024c4c567d 100644
--- a/util/async.c
+++ b/util/async.c
@@ -242,7 +242,7 @@ aio_ctx_check(GSource *source)
     aio_notify_accept(ctx);
 
     for (bh = ctx->first_bh; bh; bh = bh->next) {
-        if (bh->scheduled) {
+        if (atomic_mb_read(&bh->scheduled)) {
             return true;
         }
     }
@@ -342,12 +342,12 @@ LinuxAioState *aio_get_linux_aio(AioContext *ctx)
 
 void aio_notify(AioContext *ctx)
 {
-    /* Write e.g. bh->scheduled before reading ctx->notify_me.  Pairs
-     * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
+    /* Using atomic_mb_read ensures that e.g. bh->scheduled is written before
+     * ctx->notify_me is read.  Pairs with atomic_or in aio_ctx_prepare or
+     * atomic_add in aio_poll.
      */
-    smp_mb();
-    if (ctx->notify_me) {
-        event_notifier_set(&ctx->notifier);
+    if (atomic_mb_read(&ctx->notify_me)) {
+	event_notifier_set(&ctx->notifier);
         atomic_mb_set(&ctx->notified, true);
     }
 }
dann frazier Oct. 11, 2019, 5:55 p.m. UTC | #6
On Fri, Oct 11, 2019 at 08:30:02AM +0000, Jan Glauber wrote:
> On Fri, Oct 11, 2019 at 10:18:18AM +0200, Paolo Bonzini wrote:
> > On 11/10/19 08:05, Jan Glauber wrote:
> > > On Wed, Oct 09, 2019 at 11:15:04AM +0200, Paolo Bonzini wrote:
> > >>> ...but if I bump notify_me size to uint64_t the issue goes away.
> > >>
> > >> Ouch. :)  Is this with or without my patch(es)?
> > 
> > You didn't answer this question.
> 
> Oh, sorry... I did but the mail probably didn't make it out.
> I have both of your changes applied (as I think they make sense).
> 
> > >> Also, what if you just add a dummy uint32_t after notify_me?
> > > 
> > > With the dummy the testcase also runs fine for 500 iterations.
> > 
> > You might be lucky and causing list_lock to be in another cache line.
> > What if you add __attribute__((aligned(16)) to notify_me (and keep the
> > dummy)?
> 
> Good point. I'll try to force both into the same cacheline.

On the Hi1620, this still hangs in the first iteration:

diff --git a/include/block/aio.h b/include/block/aio.h
index 6b0d52f732..00e56a5412 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -82,7 +82,7 @@ struct AioContext {
      * Instead, the aio_poll calls include both the prepare and the
      * dispatch phase, hence a simple counter is enough for them.
      */
-    uint32_t notify_me;
+    __attribute__((aligned(16))) uint64_t notify_me;
 
     /* A lock to protect between QEMUBH and AioHandler adders and deleter,
      * and to ensure that no callbacks are removed while we're walking and
diff --git a/util/async.c b/util/async.c
index ca83e32c7f..024c4c567d 100644
--- a/util/async.c
+++ b/util/async.c
@@ -242,7 +242,7 @@ aio_ctx_check(GSource *source)
     aio_notify_accept(ctx);
 
     for (bh = ctx->first_bh; bh; bh = bh->next) {
-        if (bh->scheduled) {
+        if (atomic_mb_read(&bh->scheduled)) {
             return true;
         }
     }
@@ -342,12 +342,12 @@ LinuxAioState *aio_get_linux_aio(AioContext *ctx)
 
 void aio_notify(AioContext *ctx)
 {
-    /* Write e.g. bh->scheduled before reading ctx->notify_me.  Pairs
-     * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
+    /* Using atomic_mb_read ensures that e.g. bh->scheduled is written before
+     * ctx->notify_me is read.  Pairs with atomic_or in aio_ctx_prepare or
+     * atomic_add in aio_poll.
      */
-    smp_mb();
-    if (ctx->notify_me) {
-        event_notifier_set(&ctx->notifier);
+    if (atomic_mb_read(&ctx->notify_me)) {
+	event_notifier_set(&ctx->notifier);
         atomic_mb_set(&ctx->notified, true);
     }
 }
diff mbox series

Patch

diff --git a/include/block/aio.h b/include/block/aio.h
index a1d6b9e24939..e8a5ea3860bb 100644
--- a/include/block/aio.h
+++ b/include/block/aio.h
@@ -83,7 +83,7 @@  struct AioContext {
      * Instead, the aio_poll calls include both the prepare and the
      * dispatch phase, hence a simple counter is enough for them.
      */
-    uint32_t notify_me;
+    uint64_t notify_me;
 
     /* A lock to protect between QEMUBH and AioHandler adders and deleter,
      * and to ensure that no callbacks are removed while we're walking and