Message ID | 20191009080220.GA2905@hc (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [Bug,1805256] Re: [Qemu-devel] qemu_futex_wait() lockups in ARM64: 2 possible issues | expand |
On 09/10/19 10:02, Jan Glauber wrote: > On Mon, Oct 07, 2019 at 04:58:30PM +0200, Paolo Bonzini wrote: >> On 07/10/19 16:44, dann frazier wrote: >>> On Mon, Oct 07, 2019 at 01:06:20PM +0200, Paolo Bonzini wrote: >>>> On 02/10/19 11:23, Jan Glauber wrote: >>>>> I've looked into this on ThunderX2. The arm64 code generated for the >>>>> atomic_[add|sub] accesses of ctx->notify_me doesn't contain any >>>>> memory barriers. It is just plain ldaxr/stlxr. >>>>> >>>>> From my understanding this is not sufficient for SMP sync. >>>>> >>>>> If I read this comment correct: >>>>> >>>>> void aio_notify(AioContext *ctx) >>>>> { >>>>> /* Write e.g. bh->scheduled before reading ctx->notify_me. Pairs >>>>> * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll. >>>>> */ >>>>> smp_mb(); >>>>> if (ctx->notify_me) { >>>>> >>>>> it points out that the smp_mb() should be paired. But as >>>>> I said the used atomics don't generate any barriers at all. >>>> >>>> Based on the rest of the thread, this patch should also fix the bug: >>>> >>>> diff --git a/util/async.c b/util/async.c >>>> index 47dcbfa..721ea53 100644 >>>> --- a/util/async.c >>>> +++ b/util/async.c >>>> @@ -249,7 +249,7 @@ aio_ctx_check(GSource *source) >>>> aio_notify_accept(ctx); >>>> >>>> for (bh = ctx->first_bh; bh; bh = bh->next) { >>>> - if (bh->scheduled) { >>>> + if (atomic_mb_read(&bh->scheduled)) { >>>> return true; >>>> } >>>> } >>>> >>>> >>>> And also the memory barrier in aio_notify can actually be replaced >>>> with a SEQ_CST load: >>>> >>>> diff --git a/util/async.c b/util/async.c >>>> index 47dcbfa..721ea53 100644 >>>> --- a/util/async.c >>>> +++ b/util/async.c >>>> @@ -349,11 +349,11 @@ LinuxAioState *aio_get_linux_aio(AioContext *ctx) >>>> >>>> void aio_notify(AioContext *ctx) >>>> { >>>> - /* Write e.g. bh->scheduled before reading ctx->notify_me. Pairs >>>> - * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll. >>>> + /* Using atomic_mb_read ensures that e.g. bh->scheduled is written before >>>> + * ctx->notify_me is read. Pairs with atomic_or in aio_ctx_prepare or >>>> + * atomic_add in aio_poll. >>>> */ >>>> - smp_mb(); >>>> - if (ctx->notify_me) { >>>> + if (atomic_mb_read(&ctx->notify_me)) { >>>> event_notifier_set(&ctx->notifier); >>>> atomic_mb_set(&ctx->notified, true); >>>> } >>>> >>>> >>>> Would you be able to test these (one by one possibly)? >>> >>> Paolo, >>> I tried them both separately and together on a Hi1620 system, each >>> time it hung in the first iteration. Here's a backtrace of a run with >>> both patches applied: >> >> Ok, not a great start... I'll find myself an aarch64 machine and look >> at it more closely. I'd like the patch to be something we can >> understand and document, since this is probably the second most-used >> memory barrier idiom in QEMU. >> >> Paolo > > I'm still not sure what the actual issue is here, but could it be some bad > interaction between the notify_me and the list_lock? The are both 4 byte > and side-by-side: > > address notify_me: 0xaaaadb528aa0 sizeof notify_me: 4 > address list_lock: 0xaaaadb528aa4 sizeof list_lock: 4 > > AFAICS the generated code looks OK (all load/store exclusive done > with 32 bit size): > > e6c: 885ffc01 ldaxr w1, [x0] > e70: 11000821 add w1, w1, #0x2 > e74: 8802fc01 stlxr w2, w1, [x0] > > ...but if I bump notify_me size to uint64_t the issue goes away. Ouch. :) Is this with or without my patch(es)? Also, what if you just add a dummy uint32_t after notify_me? Thanks, Paolo > > BTW, the image file I convert in the testcase is ~20 GB. > > --Jan > > diff --git a/include/block/aio.h b/include/block/aio.h > index a1d6b9e24939..e8a5ea3860bb 100644 > --- a/include/block/aio.h > +++ b/include/block/aio.h > @@ -83,7 +83,7 @@ struct AioContext { > * Instead, the aio_poll calls include both the prepare and the > * dispatch phase, hence a simple counter is enough for them. > */ > - uint32_t notify_me; > + uint64_t notify_me; > > /* A lock to protect between QEMUBH and AioHandler adders and deleter, > * and to ensure that no callbacks are removed while we're walking and >
On Wed, Oct 09, 2019 at 11:15:04AM +0200, Paolo Bonzini wrote: > On 09/10/19 10:02, Jan Glauber wrote: > > I'm still not sure what the actual issue is here, but could it be some bad > > interaction between the notify_me and the list_lock? The are both 4 byte > > and side-by-side: > > > > address notify_me: 0xaaaadb528aa0 sizeof notify_me: 4 > > address list_lock: 0xaaaadb528aa4 sizeof list_lock: 4 > > > > AFAICS the generated code looks OK (all load/store exclusive done > > with 32 bit size): > > > > e6c: 885ffc01 ldaxr w1, [x0] > > e70: 11000821 add w1, w1, #0x2 > > e74: 8802fc01 stlxr w2, w1, [x0] > > > > ...but if I bump notify_me size to uint64_t the issue goes away. > > Ouch. :) Is this with or without my patch(es)? > > Also, what if you just add a dummy uint32_t after notify_me? With the dummy the testcase also runs fine for 500 iterations. Dann, can you try if this works on the Hi1620 too? --Jan
On 11/10/19 08:05, Jan Glauber wrote: > On Wed, Oct 09, 2019 at 11:15:04AM +0200, Paolo Bonzini wrote: >>> ...but if I bump notify_me size to uint64_t the issue goes away. >> >> Ouch. :) Is this with or without my patch(es)? You didn't answer this question. >> Also, what if you just add a dummy uint32_t after notify_me? > > With the dummy the testcase also runs fine for 500 iterations. You might be lucky and causing list_lock to be in another cache line. What if you add __attribute__((aligned(16)) to notify_me (and keep the dummy)? Paolo > Dann, can you try if this works on the Hi1620 too?
On Fri, Oct 11, 2019 at 10:18:18AM +0200, Paolo Bonzini wrote: > On 11/10/19 08:05, Jan Glauber wrote: > > On Wed, Oct 09, 2019 at 11:15:04AM +0200, Paolo Bonzini wrote: > >>> ...but if I bump notify_me size to uint64_t the issue goes away. > >> > >> Ouch. :) Is this with or without my patch(es)? > > You didn't answer this question. Oh, sorry... I did but the mail probably didn't make it out. I have both of your changes applied (as I think they make sense). > >> Also, what if you just add a dummy uint32_t after notify_me? > > > > With the dummy the testcase also runs fine for 500 iterations. > > You might be lucky and causing list_lock to be in another cache line. > What if you add __attribute__((aligned(16)) to notify_me (and keep the > dummy)? Good point. I'll try to force both into the same cacheline. --Jan > Paolo > > > Dann, can you try if this works on the Hi1620 too?
On Fri, Oct 11, 2019 at 06:05:25AM +0000, Jan Glauber wrote: > On Wed, Oct 09, 2019 at 11:15:04AM +0200, Paolo Bonzini wrote: > > On 09/10/19 10:02, Jan Glauber wrote: > > > > I'm still not sure what the actual issue is here, but could it be some bad > > > interaction between the notify_me and the list_lock? The are both 4 byte > > > and side-by-side: > > > > > > address notify_me: 0xaaaadb528aa0 sizeof notify_me: 4 > > > address list_lock: 0xaaaadb528aa4 sizeof list_lock: 4 > > > > > > AFAICS the generated code looks OK (all load/store exclusive done > > > with 32 bit size): > > > > > > e6c: 885ffc01 ldaxr w1, [x0] > > > e70: 11000821 add w1, w1, #0x2 > > > e74: 8802fc01 stlxr w2, w1, [x0] > > > > > > ...but if I bump notify_me size to uint64_t the issue goes away. > > > > Ouch. :) Is this with or without my patch(es)? > > > > Also, what if you just add a dummy uint32_t after notify_me? > > With the dummy the testcase also runs fine for 500 iterations. > > Dann, can you try if this works on the Hi1620 too? On Hi1620, it hung on the first iteration. Here's the complete patch I'm running with: diff --git a/include/block/aio.h b/include/block/aio.h index 6b0d52f732..e6fd6f1a1a 100644 --- a/include/block/aio.h +++ b/include/block/aio.h @@ -82,7 +82,7 @@ struct AioContext { * Instead, the aio_poll calls include both the prepare and the * dispatch phase, hence a simple counter is enough for them. */ - uint32_t notify_me; + uint64_t notify_me; /* A lock to protect between QEMUBH and AioHandler adders and deleter, * and to ensure that no callbacks are removed while we're walking and diff --git a/util/async.c b/util/async.c index ca83e32c7f..024c4c567d 100644 --- a/util/async.c +++ b/util/async.c @@ -242,7 +242,7 @@ aio_ctx_check(GSource *source) aio_notify_accept(ctx); for (bh = ctx->first_bh; bh; bh = bh->next) { - if (bh->scheduled) { + if (atomic_mb_read(&bh->scheduled)) { return true; } } @@ -342,12 +342,12 @@ LinuxAioState *aio_get_linux_aio(AioContext *ctx) void aio_notify(AioContext *ctx) { - /* Write e.g. bh->scheduled before reading ctx->notify_me. Pairs - * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll. + /* Using atomic_mb_read ensures that e.g. bh->scheduled is written before + * ctx->notify_me is read. Pairs with atomic_or in aio_ctx_prepare or + * atomic_add in aio_poll. */ - smp_mb(); - if (ctx->notify_me) { - event_notifier_set(&ctx->notifier); + if (atomic_mb_read(&ctx->notify_me)) { + event_notifier_set(&ctx->notifier); atomic_mb_set(&ctx->notified, true); } }
On Fri, Oct 11, 2019 at 08:30:02AM +0000, Jan Glauber wrote: > On Fri, Oct 11, 2019 at 10:18:18AM +0200, Paolo Bonzini wrote: > > On 11/10/19 08:05, Jan Glauber wrote: > > > On Wed, Oct 09, 2019 at 11:15:04AM +0200, Paolo Bonzini wrote: > > >>> ...but if I bump notify_me size to uint64_t the issue goes away. > > >> > > >> Ouch. :) Is this with or without my patch(es)? > > > > You didn't answer this question. > > Oh, sorry... I did but the mail probably didn't make it out. > I have both of your changes applied (as I think they make sense). > > > >> Also, what if you just add a dummy uint32_t after notify_me? > > > > > > With the dummy the testcase also runs fine for 500 iterations. > > > > You might be lucky and causing list_lock to be in another cache line. > > What if you add __attribute__((aligned(16)) to notify_me (and keep the > > dummy)? > > Good point. I'll try to force both into the same cacheline. On the Hi1620, this still hangs in the first iteration: diff --git a/include/block/aio.h b/include/block/aio.h index 6b0d52f732..00e56a5412 100644 --- a/include/block/aio.h +++ b/include/block/aio.h @@ -82,7 +82,7 @@ struct AioContext { * Instead, the aio_poll calls include both the prepare and the * dispatch phase, hence a simple counter is enough for them. */ - uint32_t notify_me; + __attribute__((aligned(16))) uint64_t notify_me; /* A lock to protect between QEMUBH and AioHandler adders and deleter, * and to ensure that no callbacks are removed while we're walking and diff --git a/util/async.c b/util/async.c index ca83e32c7f..024c4c567d 100644 --- a/util/async.c +++ b/util/async.c @@ -242,7 +242,7 @@ aio_ctx_check(GSource *source) aio_notify_accept(ctx); for (bh = ctx->first_bh; bh; bh = bh->next) { - if (bh->scheduled) { + if (atomic_mb_read(&bh->scheduled)) { return true; } } @@ -342,12 +342,12 @@ LinuxAioState *aio_get_linux_aio(AioContext *ctx) void aio_notify(AioContext *ctx) { - /* Write e.g. bh->scheduled before reading ctx->notify_me. Pairs - * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll. + /* Using atomic_mb_read ensures that e.g. bh->scheduled is written before + * ctx->notify_me is read. Pairs with atomic_or in aio_ctx_prepare or + * atomic_add in aio_poll. */ - smp_mb(); - if (ctx->notify_me) { - event_notifier_set(&ctx->notifier); + if (atomic_mb_read(&ctx->notify_me)) { + event_notifier_set(&ctx->notifier); atomic_mb_set(&ctx->notified, true); } }
diff --git a/include/block/aio.h b/include/block/aio.h index a1d6b9e24939..e8a5ea3860bb 100644 --- a/include/block/aio.h +++ b/include/block/aio.h @@ -83,7 +83,7 @@ struct AioContext { * Instead, the aio_poll calls include both the prepare and the * dispatch phase, hence a simple counter is enough for them. */ - uint32_t notify_me; + uint64_t notify_me; /* A lock to protect between QEMUBH and AioHandler adders and deleter, * and to ensure that no callbacks are removed while we're walking and