Message ID | 20200511075722.13483-2-chris@chris-wilson.co.uk (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [01/20] drm/i915/gt: Mark up the racy read of execlists->context_tag | expand |
On 11/05/2020 08:57, Chris Wilson wrote: > The second try at staging the transfer of the breadcrumb. In part one, > we realised we could not simply move to the second engine as we were > only holding the breadcrumb lock on the first. So in commit 6c81e21a4742 > ("drm/i915/gt: Stage the transfer of the virtual breadcrumb"), we > removed it from the first engine and marked up this request to reattach > the signaling on the new engine. However, this failed to take into > account that we only attach the breadcrumb if the new request is added > at the start of the queue, which if we are transferring, it is because > we know there to be a request to be signaled (and hence we would not be > attached). In this second try, we remove from the first list under its > lock, take ownership of the link, and then take the second lock to > complete the transfer. Overall just an optimisation not to call i915_request_enable_breadcrumb, I mean not add to the list indirectly? Regards, Tvrtko > Fixes: 6c81e21a4742 ("drm/i915/gt: Stage the transfer of the virtual breadcrumb") > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > --- > drivers/gpu/drm/i915/gt/intel_lrc.c | 16 ++++++++++++---- > 1 file changed, 12 insertions(+), 4 deletions(-) > > diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c > index ed45fc40f884..c5591248dafb 100644 > --- a/drivers/gpu/drm/i915/gt/intel_lrc.c > +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c > @@ -1825,13 +1825,12 @@ static void virtual_xfer_breadcrumbs(struct virtual_engine *ve, > struct i915_request *rq) > { > struct intel_engine_cs *old = ve->siblings[0]; > + bool xfer = false; > > /* All unattached (rq->engine == old) must already be completed */ > > spin_lock(&old->breadcrumbs.irq_lock); > if (!list_empty(&ve->context.signal_link)) { > - list_del_init(&ve->context.signal_link); > - > /* > * We cannot acquire the new engine->breadcrumbs.irq_lock > * (as we are holding a breadcrumbs.irq_lock already), > @@ -1839,12 +1838,21 @@ static void virtual_xfer_breadcrumbs(struct virtual_engine *ve, > * The queued irq_work will occur when we finally drop > * the engine->active.lock after dequeue. > */ > - set_bit(DMA_FENCE_FLAG_ENABLE_SIGNAL_BIT, &rq->fence.flags); > + __list_del_entry(&ve->context.signal_link); > + xfer = true; > + } > + spin_unlock(&old->breadcrumbs.irq_lock); > + > + if (xfer) { > + struct intel_breadcrumbs *b = &rq->engine->breadcrumbs; > + > + spin_lock(&b->irq_lock); > + list_add_tail(&ve->context.signal_link, &b->signalers); > + spin_unlock(&b->irq_lock); > > /* Also transfer the pending irq_work for the old breadcrumb. */ > intel_engine_signal_breadcrumbs(rq->engine); > } > - spin_unlock(&old->breadcrumbs.irq_lock); > } > > #define for_each_waiter(p__, rq__) \ >
Quoting Tvrtko Ursulin (2020-05-12 09:41:01) > > On 11/05/2020 08:57, Chris Wilson wrote: > > The second try at staging the transfer of the breadcrumb. In part one, > > we realised we could not simply move to the second engine as we were > > only holding the breadcrumb lock on the first. So in commit 6c81e21a4742 > > ("drm/i915/gt: Stage the transfer of the virtual breadcrumb"), we > > removed it from the first engine and marked up this request to reattach > > the signaling on the new engine. However, this failed to take into > > account that we only attach the breadcrumb if the new request is added > > at the start of the queue, which if we are transferring, it is because > > we know there to be a request to be signaled (and hence we would not be > > attached). In this second try, we remove from the first list under its > > lock, take ownership of the link, and then take the second lock to > > complete the transfer. > > Overall just an optimisation not to call i915_request_enable_breadcrumb, > I mean not add to the list indirectly? The request that we need to add already has its breadcrumb enabled. The request is on the veng->context.signals list, it's just that the veng is on siblings[0] signalers list and we are no longer guaranteed to generate an interrupt on engine. There's an explosion in the current code due to the lists not moving as expected on enabling the breadcrumb on the next request (because of if (pos == &ce->signals) /* catch transitions from empty list */ list_move_tail(&ce->signal_link, &b->signalers); ) The explosion is on a dead list, but has on a couple of occasions looked like <4> [373.551331] RIP: 0010:i915_request_enable_breadcrumb+0x144/0x380 [i915] <4> [373.551341] Code: c7 c2 20 f1 42 c0 48 c7 c7 77 85 28 c0 e8 44 bc f2 ec bf 01 00 00 00 e8 5a 8e f2 ec 31 f6 bf 09 00 00 00 e8 6e 09 e3 ec 0f 0b <3b> 45 80 0f 89 5d ff ff ff 48 8b 6d 08 4c 39 e5 75 ee 49 8b 4d 38 <4> [373.551356] RSP: 0018:ffffb64d0114b9f8 EFLAGS: 00010083 <4> [373.551363] RAX: 00000000000036b2 RBX: ffffa310385096c0 RCX: 0000000000000003 <4> [373.551372] RDX: 00000000000036b2 RSI: 000000002ac5cf63 RDI: 00000000ffffffff <4> [373.551379] RBP: dead000000000122 R08: ffffa31047075a50 R09: 00000000fffffffe <4> [373.551385] R10: 0000000053a90a70 R11: 000000005e84b7e5 R12: ffffa3103fde38c0 <4> [373.551392] R13: ffffa3103fde3888 R14: ffffa30ff0982328 R15: ffffa30ff0982000 <4> [373.551401] FS: 00007f19f3359e40(0000) GS:ffffa3104ed00000(0000) knlGS:0000000000000000 <4> [373.551410] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 <4> [373.551414] CR2: 00007f19f2aac778 CR3: 0000000232b0c004 CR4: 00000000003606e0 <4> [373.551421] Call Trace: <4> [373.551466] ? dma_i915_sw_fence_wake+0x40/0x40 [i915] <4> [373.551506] ? dma_i915_sw_fence_wake+0x40/0x40 [i915] <4> [373.551515] __dma_fence_enable_signaling+0x60/0x160 <4> [373.551558] ? dma_i915_sw_fence_wake+0x40/0x40 [i915] <4> [373.551564] dma_fence_add_callback+0x44/0xd0 <4> [373.551605] __i915_sw_fence_await_dma_fence+0x6f/0xc0 [i915] <4> [373.551665] __i915_request_commit+0x442/0x5b0 [i915] <4> [373.551721] i915_gem_do_execbuffer+0x17fb/0x2eb0 [i915] kasan/kcsan do not complain; it's just a broken list. -Chris
On 12/05/2020 09:49, Chris Wilson wrote: > Quoting Tvrtko Ursulin (2020-05-12 09:41:01) >> On 11/05/2020 08:57, Chris Wilson wrote: >>> The second try at staging the transfer of the breadcrumb. In part one, >>> we realised we could not simply move to the second engine as we were >>> only holding the breadcrumb lock on the first. So in commit 6c81e21a4742 >>> ("drm/i915/gt: Stage the transfer of the virtual breadcrumb"), we >>> removed it from the first engine and marked up this request to reattach >>> the signaling on the new engine. However, this failed to take into >>> account that we only attach the breadcrumb if the new request is added >>> at the start of the queue, which if we are transferring, it is because >>> we know there to be a request to be signaled (and hence we would not be >>> attached). In this second try, we remove from the first list under its >>> lock, take ownership of the link, and then take the second lock to >>> complete the transfer. >> >> Overall just an optimisation not to call i915_request_enable_breadcrumb, >> I mean not add to the list indirectly? > > The request that we need to add already has its breadcrumb enabled. The > request is on the veng->context.signals list, it's just that the veng is > on siblings[0] signalers list and we are no longer guaranteed to > generate an interrupt on engine. > > There's an explosion in the current code due to the lists not moving > as expected on enabling the breadcrumb on the next request (because of > if (pos == &ce->signals) /* catch transitions from empty list */ > list_move_tail(&ce->signal_link, &b->signalers); > > ) > > The explosion is on a dead list, but has on a couple of occasions looked > like > > <4> [373.551331] RIP: 0010:i915_request_enable_breadcrumb+0x144/0x380 [i915] > <4> [373.551341] Code: c7 c2 20 f1 42 c0 48 c7 c7 77 85 28 c0 e8 44 bc f2 ec bf 01 00 00 00 e8 5a 8e f2 ec 31 f6 bf 09 00 00 00 e8 6e 09 e3 ec 0f 0b <3b> 45 80 0f 89 5d ff ff ff 48 8b 6d 08 4c 39 e5 75 ee 49 8b 4d 38 > <4> [373.551356] RSP: 0018:ffffb64d0114b9f8 EFLAGS: 00010083 > <4> [373.551363] RAX: 00000000000036b2 RBX: ffffa310385096c0 RCX: 0000000000000003 > <4> [373.551372] RDX: 00000000000036b2 RSI: 000000002ac5cf63 RDI: 00000000ffffffff > <4> [373.551379] RBP: dead000000000122 R08: ffffa31047075a50 R09: 00000000fffffffe > <4> [373.551385] R10: 0000000053a90a70 R11: 000000005e84b7e5 R12: ffffa3103fde38c0 > <4> [373.551392] R13: ffffa3103fde3888 R14: ffffa30ff0982328 R15: ffffa30ff0982000 > <4> [373.551401] FS: 00007f19f3359e40(0000) GS:ffffa3104ed00000(0000) knlGS:0000000000000000 > <4> [373.551410] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > <4> [373.551414] CR2: 00007f19f2aac778 CR3: 0000000232b0c004 CR4: 00000000003606e0 > <4> [373.551421] Call Trace: > <4> [373.551466] ? dma_i915_sw_fence_wake+0x40/0x40 [i915] > <4> [373.551506] ? dma_i915_sw_fence_wake+0x40/0x40 [i915] > <4> [373.551515] __dma_fence_enable_signaling+0x60/0x160 > <4> [373.551558] ? dma_i915_sw_fence_wake+0x40/0x40 [i915] > <4> [373.551564] dma_fence_add_callback+0x44/0xd0 > <4> [373.551605] __i915_sw_fence_await_dma_fence+0x6f/0xc0 [i915] > <4> [373.551665] __i915_request_commit+0x442/0x5b0 [i915] > <4> [373.551721] i915_gem_do_execbuffer+0x17fb/0x2eb0 [i915] > > kasan/kcsan do not complain; it's just a broken list. Which list gets broken? But it does sound plausible that intel_lrc.c messing around the breadcrumb lists directly could cause a problem due special handling on empty <-> non-empty ce signalers transitions. Maybe virtual_xfer_breadcrumbs should be moved to intel_breadcrumbs.c, well moved.. breadcrumb internal logic moved, veng left in intel_lrc.c: static void virtual_xfer_breadcrumbs(struct virtual_engine *ve, struct intel_engine_cs *target) { intel_breadcrumbs_transfer(ve->context, ve->siblings[0], target); } ? Regards, Tvrtko
Quoting Tvrtko Ursulin (2020-05-12 11:12:23) > > On 12/05/2020 09:49, Chris Wilson wrote: > > Quoting Tvrtko Ursulin (2020-05-12 09:41:01) > >> On 11/05/2020 08:57, Chris Wilson wrote: > >>> The second try at staging the transfer of the breadcrumb. In part one, > >>> we realised we could not simply move to the second engine as we were > >>> only holding the breadcrumb lock on the first. So in commit 6c81e21a4742 > >>> ("drm/i915/gt: Stage the transfer of the virtual breadcrumb"), we > >>> removed it from the first engine and marked up this request to reattach > >>> the signaling on the new engine. However, this failed to take into > >>> account that we only attach the breadcrumb if the new request is added > >>> at the start of the queue, which if we are transferring, it is because > >>> we know there to be a request to be signaled (and hence we would not be > >>> attached). In this second try, we remove from the first list under its > >>> lock, take ownership of the link, and then take the second lock to > >>> complete the transfer. > >> > >> Overall just an optimisation not to call i915_request_enable_breadcrumb, > >> I mean not add to the list indirectly? > > > > The request that we need to add already has its breadcrumb enabled. The > > request is on the veng->context.signals list, it's just that the veng is > > on siblings[0] signalers list and we are no longer guaranteed to > > generate an interrupt on engine. > > > > There's an explosion in the current code due to the lists not moving > > as expected on enabling the breadcrumb on the next request (because of > > if (pos == &ce->signals) /* catch transitions from empty list */ > > list_move_tail(&ce->signal_link, &b->signalers); > > > > ) > > > > The explosion is on a dead list, but has on a couple of occasions looked > > like > > > > <4> [373.551331] RIP: 0010:i915_request_enable_breadcrumb+0x144/0x380 [i915] > > <4> [373.551341] Code: c7 c2 20 f1 42 c0 48 c7 c7 77 85 28 c0 e8 44 bc f2 ec bf 01 00 00 00 e8 5a 8e f2 ec 31 f6 bf 09 00 00 00 e8 6e 09 e3 ec 0f 0b <3b> 45 80 0f 89 5d ff ff ff 48 8b 6d 08 4c 39 e5 75 ee 49 8b 4d 38 > > <4> [373.551356] RSP: 0018:ffffb64d0114b9f8 EFLAGS: 00010083 > > <4> [373.551363] RAX: 00000000000036b2 RBX: ffffa310385096c0 RCX: 0000000000000003 > > <4> [373.551372] RDX: 00000000000036b2 RSI: 000000002ac5cf63 RDI: 00000000ffffffff > > <4> [373.551379] RBP: dead000000000122 R08: ffffa31047075a50 R09: 00000000fffffffe > > <4> [373.551385] R10: 0000000053a90a70 R11: 000000005e84b7e5 R12: ffffa3103fde38c0 > > <4> [373.551392] R13: ffffa3103fde3888 R14: ffffa30ff0982328 R15: ffffa30ff0982000 > > <4> [373.551401] FS: 00007f19f3359e40(0000) GS:ffffa3104ed00000(0000) knlGS:0000000000000000 > > <4> [373.551410] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > <4> [373.551414] CR2: 00007f19f2aac778 CR3: 0000000232b0c004 CR4: 00000000003606e0 > > <4> [373.551421] Call Trace: > > <4> [373.551466] ? dma_i915_sw_fence_wake+0x40/0x40 [i915] > > <4> [373.551506] ? dma_i915_sw_fence_wake+0x40/0x40 [i915] > > <4> [373.551515] __dma_fence_enable_signaling+0x60/0x160 > > <4> [373.551558] ? dma_i915_sw_fence_wake+0x40/0x40 [i915] > > <4> [373.551564] dma_fence_add_callback+0x44/0xd0 > > <4> [373.551605] __i915_sw_fence_await_dma_fence+0x6f/0xc0 [i915] > > <4> [373.551665] __i915_request_commit+0x442/0x5b0 [i915] > > <4> [373.551721] i915_gem_do_execbuffer+0x17fb/0x2eb0 [i915] > > > > kasan/kcsan do not complain; it's just a broken list. > > Which list gets broken? Since we may not signal the requests immediately from the new engine (and have decoupled them from the old), they will call i915_request_cancel_breadcrumbs() on their stale rq->engine->breadcrumbs which is no longer the lock owner. Following that logic, this is not safe either, we just are better at winning the race. -Chris
diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c index ed45fc40f884..c5591248dafb 100644 --- a/drivers/gpu/drm/i915/gt/intel_lrc.c +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c @@ -1825,13 +1825,12 @@ static void virtual_xfer_breadcrumbs(struct virtual_engine *ve, struct i915_request *rq) { struct intel_engine_cs *old = ve->siblings[0]; + bool xfer = false; /* All unattached (rq->engine == old) must already be completed */ spin_lock(&old->breadcrumbs.irq_lock); if (!list_empty(&ve->context.signal_link)) { - list_del_init(&ve->context.signal_link); - /* * We cannot acquire the new engine->breadcrumbs.irq_lock * (as we are holding a breadcrumbs.irq_lock already), @@ -1839,12 +1838,21 @@ static void virtual_xfer_breadcrumbs(struct virtual_engine *ve, * The queued irq_work will occur when we finally drop * the engine->active.lock after dequeue. */ - set_bit(DMA_FENCE_FLAG_ENABLE_SIGNAL_BIT, &rq->fence.flags); + __list_del_entry(&ve->context.signal_link); + xfer = true; + } + spin_unlock(&old->breadcrumbs.irq_lock); + + if (xfer) { + struct intel_breadcrumbs *b = &rq->engine->breadcrumbs; + + spin_lock(&b->irq_lock); + list_add_tail(&ve->context.signal_link, &b->signalers); + spin_unlock(&b->irq_lock); /* Also transfer the pending irq_work for the old breadcrumb. */ intel_engine_signal_breadcrumbs(rq->engine); } - spin_unlock(&old->breadcrumbs.irq_lock); } #define for_each_waiter(p__, rq__) \
The second try at staging the transfer of the breadcrumb. In part one, we realised we could not simply move to the second engine as we were only holding the breadcrumb lock on the first. So in commit 6c81e21a4742 ("drm/i915/gt: Stage the transfer of the virtual breadcrumb"), we removed it from the first engine and marked up this request to reattach the signaling on the new engine. However, this failed to take into account that we only attach the breadcrumb if the new request is added at the start of the queue, which if we are transferring, it is because we know there to be a request to be signaled (and hence we would not be attached). In this second try, we remove from the first list under its lock, take ownership of the link, and then take the second lock to complete the transfer. Fixes: 6c81e21a4742 ("drm/i915/gt: Stage the transfer of the virtual breadcrumb") Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> --- drivers/gpu/drm/i915/gt/intel_lrc.c | 16 ++++++++++++---- 1 file changed, 12 insertions(+), 4 deletions(-)