diff mbox series

[02/20] drm/i915/gt: Couple up old virtual breadcrumb on new sibling

Message ID 20200511075722.13483-2-chris@chris-wilson.co.uk (mailing list archive)
State New, archived
Headers show
Series [01/20] drm/i915/gt: Mark up the racy read of execlists->context_tag | expand

Commit Message

Chris Wilson May 11, 2020, 7:57 a.m. UTC
The second try at staging the transfer of the breadcrumb. In part one,
we realised we could not simply move to the second engine as we were
only holding the breadcrumb lock on the first. So in commit 6c81e21a4742
("drm/i915/gt: Stage the transfer of the virtual breadcrumb"), we
removed it from the first engine and marked up this request to reattach
the signaling on the new engine. However, this failed to take into
account that we only attach the breadcrumb if the new request is added
at the start of the queue, which if we are transferring, it is because
we know there to be a request to be signaled (and hence we would not be
attached). In this second try, we remove from the first list under its
lock, take ownership of the link, and then take the second lock to
complete the transfer.

Fixes: 6c81e21a4742 ("drm/i915/gt: Stage the transfer of the virtual breadcrumb")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_lrc.c | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

Comments

Tvrtko Ursulin May 12, 2020, 8:41 a.m. UTC | #1
On 11/05/2020 08:57, Chris Wilson wrote:
> The second try at staging the transfer of the breadcrumb. In part one,
> we realised we could not simply move to the second engine as we were
> only holding the breadcrumb lock on the first. So in commit 6c81e21a4742
> ("drm/i915/gt: Stage the transfer of the virtual breadcrumb"), we
> removed it from the first engine and marked up this request to reattach
> the signaling on the new engine. However, this failed to take into
> account that we only attach the breadcrumb if the new request is added
> at the start of the queue, which if we are transferring, it is because
> we know there to be a request to be signaled (and hence we would not be
> attached). In this second try, we remove from the first list under its
> lock, take ownership of the link, and then take the second lock to
> complete the transfer.

Overall just an optimisation not to call i915_request_enable_breadcrumb, 
I mean not add to the list indirectly?

Regards,

Tvrtko

> Fixes: 6c81e21a4742 ("drm/i915/gt: Stage the transfer of the virtual breadcrumb")
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/intel_lrc.c | 16 ++++++++++++----
>   1 file changed, 12 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
> index ed45fc40f884..c5591248dafb 100644
> --- a/drivers/gpu/drm/i915/gt/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
> @@ -1825,13 +1825,12 @@ static void virtual_xfer_breadcrumbs(struct virtual_engine *ve,
>   				     struct i915_request *rq)
>   {
>   	struct intel_engine_cs *old = ve->siblings[0];
> +	bool xfer = false;
>   
>   	/* All unattached (rq->engine == old) must already be completed */
>   
>   	spin_lock(&old->breadcrumbs.irq_lock);
>   	if (!list_empty(&ve->context.signal_link)) {
> -		list_del_init(&ve->context.signal_link);
> -
>   		/*
>   		 * We cannot acquire the new engine->breadcrumbs.irq_lock
>   		 * (as we are holding a breadcrumbs.irq_lock already),
> @@ -1839,12 +1838,21 @@ static void virtual_xfer_breadcrumbs(struct virtual_engine *ve,
>   		 * The queued irq_work will occur when we finally drop
>   		 * the engine->active.lock after dequeue.
>   		 */
> -		set_bit(DMA_FENCE_FLAG_ENABLE_SIGNAL_BIT, &rq->fence.flags);
> +		__list_del_entry(&ve->context.signal_link);
> +		xfer = true;
> +	}
> +	spin_unlock(&old->breadcrumbs.irq_lock);
> +
> +	if (xfer) {
> +		struct intel_breadcrumbs *b = &rq->engine->breadcrumbs;
> +
> +		spin_lock(&b->irq_lock);
> +		list_add_tail(&ve->context.signal_link, &b->signalers);
> +		spin_unlock(&b->irq_lock);
>   
>   		/* Also transfer the pending irq_work for the old breadcrumb. */
>   		intel_engine_signal_breadcrumbs(rq->engine);
>   	}
> -	spin_unlock(&old->breadcrumbs.irq_lock);
>   }
>   
>   #define for_each_waiter(p__, rq__) \
>
Chris Wilson May 12, 2020, 8:49 a.m. UTC | #2
Quoting Tvrtko Ursulin (2020-05-12 09:41:01)
> 
> On 11/05/2020 08:57, Chris Wilson wrote:
> > The second try at staging the transfer of the breadcrumb. In part one,
> > we realised we could not simply move to the second engine as we were
> > only holding the breadcrumb lock on the first. So in commit 6c81e21a4742
> > ("drm/i915/gt: Stage the transfer of the virtual breadcrumb"), we
> > removed it from the first engine and marked up this request to reattach
> > the signaling on the new engine. However, this failed to take into
> > account that we only attach the breadcrumb if the new request is added
> > at the start of the queue, which if we are transferring, it is because
> > we know there to be a request to be signaled (and hence we would not be
> > attached). In this second try, we remove from the first list under its
> > lock, take ownership of the link, and then take the second lock to
> > complete the transfer.
> 
> Overall just an optimisation not to call i915_request_enable_breadcrumb, 
> I mean not add to the list indirectly?

The request that we need to add already has its breadcrumb enabled. The
request is on the veng->context.signals list, it's just that the veng is
on siblings[0] signalers list and we are no longer guaranteed to
generate an interrupt on engine.

There's an explosion in the current code due to the lists not moving
as expected on enabling the breadcrumb on the next request (because of
                if (pos == &ce->signals) /* catch transitions from empty list */
                        list_move_tail(&ce->signal_link, &b->signalers);

)

The explosion is on a dead list, but has on a couple of occasions looked
like

<4> [373.551331] RIP: 0010:i915_request_enable_breadcrumb+0x144/0x380 [i915]
<4> [373.551341] Code: c7 c2 20 f1 42 c0 48 c7 c7 77 85 28 c0 e8 44 bc f2 ec bf 01 00 00 00 e8 5a 8e f2 ec 31 f6 bf 09 00 00 00 e8 6e 09 e3 ec 0f 0b <3b> 45 80 0f 89 5d ff ff ff 48 8b 6d 08 4c 39 e5 75 ee 49 8b 4d 38
<4> [373.551356] RSP: 0018:ffffb64d0114b9f8 EFLAGS: 00010083
<4> [373.551363] RAX: 00000000000036b2 RBX: ffffa310385096c0 RCX: 0000000000000003
<4> [373.551372] RDX: 00000000000036b2 RSI: 000000002ac5cf63 RDI: 00000000ffffffff
<4> [373.551379] RBP: dead000000000122 R08: ffffa31047075a50 R09: 00000000fffffffe
<4> [373.551385] R10: 0000000053a90a70 R11: 000000005e84b7e5 R12: ffffa3103fde38c0
<4> [373.551392] R13: ffffa3103fde3888 R14: ffffa30ff0982328 R15: ffffa30ff0982000
<4> [373.551401] FS:  00007f19f3359e40(0000) GS:ffffa3104ed00000(0000) knlGS:0000000000000000
<4> [373.551410] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [373.551414] CR2: 00007f19f2aac778 CR3: 0000000232b0c004 CR4: 00000000003606e0
<4> [373.551421] Call Trace:
<4> [373.551466]  ? dma_i915_sw_fence_wake+0x40/0x40 [i915]
<4> [373.551506]  ? dma_i915_sw_fence_wake+0x40/0x40 [i915]
<4> [373.551515]  __dma_fence_enable_signaling+0x60/0x160
<4> [373.551558]  ? dma_i915_sw_fence_wake+0x40/0x40 [i915]
<4> [373.551564]  dma_fence_add_callback+0x44/0xd0
<4> [373.551605]  __i915_sw_fence_await_dma_fence+0x6f/0xc0 [i915]
<4> [373.551665]  __i915_request_commit+0x442/0x5b0 [i915]
<4> [373.551721]  i915_gem_do_execbuffer+0x17fb/0x2eb0 [i915]

kasan/kcsan do not complain; it's just a broken list.
-Chris
Tvrtko Ursulin May 12, 2020, 10:12 a.m. UTC | #3
On 12/05/2020 09:49, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2020-05-12 09:41:01)
>> On 11/05/2020 08:57, Chris Wilson wrote:
>>> The second try at staging the transfer of the breadcrumb. In part one,
>>> we realised we could not simply move to the second engine as we were
>>> only holding the breadcrumb lock on the first. So in commit 6c81e21a4742
>>> ("drm/i915/gt: Stage the transfer of the virtual breadcrumb"), we
>>> removed it from the first engine and marked up this request to reattach
>>> the signaling on the new engine. However, this failed to take into
>>> account that we only attach the breadcrumb if the new request is added
>>> at the start of the queue, which if we are transferring, it is because
>>> we know there to be a request to be signaled (and hence we would not be
>>> attached). In this second try, we remove from the first list under its
>>> lock, take ownership of the link, and then take the second lock to
>>> complete the transfer.
>>
>> Overall just an optimisation not to call i915_request_enable_breadcrumb,
>> I mean not add to the list indirectly?
> 
> The request that we need to add already has its breadcrumb enabled. The
> request is on the veng->context.signals list, it's just that the veng is
> on siblings[0] signalers list and we are no longer guaranteed to
> generate an interrupt on engine.
> 
> There's an explosion in the current code due to the lists not moving
> as expected on enabling the breadcrumb on the next request (because of
>                  if (pos == &ce->signals) /* catch transitions from empty list */
>                          list_move_tail(&ce->signal_link, &b->signalers);
> 
> )
> 
> The explosion is on a dead list, but has on a couple of occasions looked
> like
> 
> <4> [373.551331] RIP: 0010:i915_request_enable_breadcrumb+0x144/0x380 [i915]
> <4> [373.551341] Code: c7 c2 20 f1 42 c0 48 c7 c7 77 85 28 c0 e8 44 bc f2 ec bf 01 00 00 00 e8 5a 8e f2 ec 31 f6 bf 09 00 00 00 e8 6e 09 e3 ec 0f 0b <3b> 45 80 0f 89 5d ff ff ff 48 8b 6d 08 4c 39 e5 75 ee 49 8b 4d 38
> <4> [373.551356] RSP: 0018:ffffb64d0114b9f8 EFLAGS: 00010083
> <4> [373.551363] RAX: 00000000000036b2 RBX: ffffa310385096c0 RCX: 0000000000000003
> <4> [373.551372] RDX: 00000000000036b2 RSI: 000000002ac5cf63 RDI: 00000000ffffffff
> <4> [373.551379] RBP: dead000000000122 R08: ffffa31047075a50 R09: 00000000fffffffe
> <4> [373.551385] R10: 0000000053a90a70 R11: 000000005e84b7e5 R12: ffffa3103fde38c0
> <4> [373.551392] R13: ffffa3103fde3888 R14: ffffa30ff0982328 R15: ffffa30ff0982000
> <4> [373.551401] FS:  00007f19f3359e40(0000) GS:ffffa3104ed00000(0000) knlGS:0000000000000000
> <4> [373.551410] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> <4> [373.551414] CR2: 00007f19f2aac778 CR3: 0000000232b0c004 CR4: 00000000003606e0
> <4> [373.551421] Call Trace:
> <4> [373.551466]  ? dma_i915_sw_fence_wake+0x40/0x40 [i915]
> <4> [373.551506]  ? dma_i915_sw_fence_wake+0x40/0x40 [i915]
> <4> [373.551515]  __dma_fence_enable_signaling+0x60/0x160
> <4> [373.551558]  ? dma_i915_sw_fence_wake+0x40/0x40 [i915]
> <4> [373.551564]  dma_fence_add_callback+0x44/0xd0
> <4> [373.551605]  __i915_sw_fence_await_dma_fence+0x6f/0xc0 [i915]
> <4> [373.551665]  __i915_request_commit+0x442/0x5b0 [i915]
> <4> [373.551721]  i915_gem_do_execbuffer+0x17fb/0x2eb0 [i915]
> 
> kasan/kcsan do not complain; it's just a broken list.

Which list gets broken? But it does sound plausible that intel_lrc.c 
messing around the breadcrumb lists directly could cause a problem due 
special handling on empty <-> non-empty ce signalers transitions.

Maybe virtual_xfer_breadcrumbs should be moved to intel_breadcrumbs.c, 
well moved.. breadcrumb internal logic moved, veng left in intel_lrc.c:

static void
virtual_xfer_breadcrumbs(struct virtual_engine *ve,
			 struct intel_engine_cs *target)
{
	intel_breadcrumbs_transfer(ve->context,
				   ve->siblings[0],
				   target);
}

?

Regards,

Tvrtko
Chris Wilson May 12, 2020, 10:28 a.m. UTC | #4
Quoting Tvrtko Ursulin (2020-05-12 11:12:23)
> 
> On 12/05/2020 09:49, Chris Wilson wrote:
> > Quoting Tvrtko Ursulin (2020-05-12 09:41:01)
> >> On 11/05/2020 08:57, Chris Wilson wrote:
> >>> The second try at staging the transfer of the breadcrumb. In part one,
> >>> we realised we could not simply move to the second engine as we were
> >>> only holding the breadcrumb lock on the first. So in commit 6c81e21a4742
> >>> ("drm/i915/gt: Stage the transfer of the virtual breadcrumb"), we
> >>> removed it from the first engine and marked up this request to reattach
> >>> the signaling on the new engine. However, this failed to take into
> >>> account that we only attach the breadcrumb if the new request is added
> >>> at the start of the queue, which if we are transferring, it is because
> >>> we know there to be a request to be signaled (and hence we would not be
> >>> attached). In this second try, we remove from the first list under its
> >>> lock, take ownership of the link, and then take the second lock to
> >>> complete the transfer.
> >>
> >> Overall just an optimisation not to call i915_request_enable_breadcrumb,
> >> I mean not add to the list indirectly?
> > 
> > The request that we need to add already has its breadcrumb enabled. The
> > request is on the veng->context.signals list, it's just that the veng is
> > on siblings[0] signalers list and we are no longer guaranteed to
> > generate an interrupt on engine.
> > 
> > There's an explosion in the current code due to the lists not moving
> > as expected on enabling the breadcrumb on the next request (because of
> >                  if (pos == &ce->signals) /* catch transitions from empty list */
> >                          list_move_tail(&ce->signal_link, &b->signalers);
> > 
> > )
> > 
> > The explosion is on a dead list, but has on a couple of occasions looked
> > like
> > 
> > <4> [373.551331] RIP: 0010:i915_request_enable_breadcrumb+0x144/0x380 [i915]
> > <4> [373.551341] Code: c7 c2 20 f1 42 c0 48 c7 c7 77 85 28 c0 e8 44 bc f2 ec bf 01 00 00 00 e8 5a 8e f2 ec 31 f6 bf 09 00 00 00 e8 6e 09 e3 ec 0f 0b <3b> 45 80 0f 89 5d ff ff ff 48 8b 6d 08 4c 39 e5 75 ee 49 8b 4d 38
> > <4> [373.551356] RSP: 0018:ffffb64d0114b9f8 EFLAGS: 00010083
> > <4> [373.551363] RAX: 00000000000036b2 RBX: ffffa310385096c0 RCX: 0000000000000003
> > <4> [373.551372] RDX: 00000000000036b2 RSI: 000000002ac5cf63 RDI: 00000000ffffffff
> > <4> [373.551379] RBP: dead000000000122 R08: ffffa31047075a50 R09: 00000000fffffffe
> > <4> [373.551385] R10: 0000000053a90a70 R11: 000000005e84b7e5 R12: ffffa3103fde38c0
> > <4> [373.551392] R13: ffffa3103fde3888 R14: ffffa30ff0982328 R15: ffffa30ff0982000
> > <4> [373.551401] FS:  00007f19f3359e40(0000) GS:ffffa3104ed00000(0000) knlGS:0000000000000000
> > <4> [373.551410] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > <4> [373.551414] CR2: 00007f19f2aac778 CR3: 0000000232b0c004 CR4: 00000000003606e0
> > <4> [373.551421] Call Trace:
> > <4> [373.551466]  ? dma_i915_sw_fence_wake+0x40/0x40 [i915]
> > <4> [373.551506]  ? dma_i915_sw_fence_wake+0x40/0x40 [i915]
> > <4> [373.551515]  __dma_fence_enable_signaling+0x60/0x160
> > <4> [373.551558]  ? dma_i915_sw_fence_wake+0x40/0x40 [i915]
> > <4> [373.551564]  dma_fence_add_callback+0x44/0xd0
> > <4> [373.551605]  __i915_sw_fence_await_dma_fence+0x6f/0xc0 [i915]
> > <4> [373.551665]  __i915_request_commit+0x442/0x5b0 [i915]
> > <4> [373.551721]  i915_gem_do_execbuffer+0x17fb/0x2eb0 [i915]
> > 
> > kasan/kcsan do not complain; it's just a broken list.
> 
> Which list gets broken?

Since we may not signal the requests immediately from the new engine
(and have decoupled them from the old), they will call
i915_request_cancel_breadcrumbs() on their stale rq->engine->breadcrumbs
which is no longer the lock owner.

Following that logic, this is not safe either, we just are better at
winning the race.
-Chris
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
index ed45fc40f884..c5591248dafb 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -1825,13 +1825,12 @@  static void virtual_xfer_breadcrumbs(struct virtual_engine *ve,
 				     struct i915_request *rq)
 {
 	struct intel_engine_cs *old = ve->siblings[0];
+	bool xfer = false;
 
 	/* All unattached (rq->engine == old) must already be completed */
 
 	spin_lock(&old->breadcrumbs.irq_lock);
 	if (!list_empty(&ve->context.signal_link)) {
-		list_del_init(&ve->context.signal_link);
-
 		/*
 		 * We cannot acquire the new engine->breadcrumbs.irq_lock
 		 * (as we are holding a breadcrumbs.irq_lock already),
@@ -1839,12 +1838,21 @@  static void virtual_xfer_breadcrumbs(struct virtual_engine *ve,
 		 * The queued irq_work will occur when we finally drop
 		 * the engine->active.lock after dequeue.
 		 */
-		set_bit(DMA_FENCE_FLAG_ENABLE_SIGNAL_BIT, &rq->fence.flags);
+		__list_del_entry(&ve->context.signal_link);
+		xfer = true;
+	}
+	spin_unlock(&old->breadcrumbs.irq_lock);
+
+	if (xfer) {
+		struct intel_breadcrumbs *b = &rq->engine->breadcrumbs;
+
+		spin_lock(&b->irq_lock);
+		list_add_tail(&ve->context.signal_link, &b->signalers);
+		spin_unlock(&b->irq_lock);
 
 		/* Also transfer the pending irq_work for the old breadcrumb. */
 		intel_engine_signal_breadcrumbs(rq->engine);
 	}
-	spin_unlock(&old->breadcrumbs.irq_lock);
 }
 
 #define for_each_waiter(p__, rq__) \