diff mbox series

[08/22] drm/i915/guc: Don't enable scheduling on a banned context, guc_id invalid, not registered

Message ID 20210816135139.10060-9-matthew.brost@intel.com (mailing list archive)
State New, archived
Headers show
Series Clean up GuC CI failures, simplify locking, and kernel DOC | expand

Commit Message

Matthew Brost Aug. 16, 2021, 1:51 p.m. UTC
When unblocking a context, do not enable scheduling if the context is
banned, guc_id invalid, or not registered.

Fixes: 62eaf0ae217d ("drm/i915/guc: Support request cancellation")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Cc: <stable@vger.kernel.org>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 3 +++
 1 file changed, 3 insertions(+)

Comments

Daniel Vetter Aug. 17, 2021, 9:47 a.m. UTC | #1
On Mon, Aug 16, 2021 at 06:51:25AM -0700, Matthew Brost wrote:
> When unblocking a context, do not enable scheduling if the context is
> banned, guc_id invalid, or not registered.
> 
> Fixes: 62eaf0ae217d ("drm/i915/guc: Support request cancellation")
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> Cc: <stable@vger.kernel.org>
> ---
>  drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index c3b7bf7319dd..353899634fa8 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -1579,6 +1579,9 @@ static void guc_context_unblock(struct intel_context *ce)
>  	spin_lock_irqsave(&ce->guc_state.lock, flags);
>  
>  	if (unlikely(submission_disabled(guc) ||
> +		     intel_context_is_banned(ce) ||
> +		     context_guc_id_invalid(ce) ||
> +		     !lrc_desc_registered(guc, ce->guc_id) ||
>  		     !intel_context_is_pinned(ce) ||
>  		     context_pending_disable(ce) ||
>  		     context_blocked(ce) > 1)) {

I think this entire if condition here is screaming that our intel_context
state machinery for guc is way too complex, and on the wrong side of
incomprehensible.

Also some of these check state outside of the context, and we don't seem
to hold spinlocks for those, or anything else.

I general I have no idea which of these are defensive programming and
cannot ever happen, and which actually can happen. There's for sure way
too many races going on given that this is all context-local stuff.
-Daniel
Daniel Vetter Aug. 17, 2021, 9:57 a.m. UTC | #2
On Tue, Aug 17, 2021 at 11:47:53AM +0200, Daniel Vetter wrote:
> On Mon, Aug 16, 2021 at 06:51:25AM -0700, Matthew Brost wrote:
> > When unblocking a context, do not enable scheduling if the context is
> > banned, guc_id invalid, or not registered.
> > 
> > Fixes: 62eaf0ae217d ("drm/i915/guc: Support request cancellation")
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > Cc: <stable@vger.kernel.org>
> > ---
> >  drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 3 +++
> >  1 file changed, 3 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index c3b7bf7319dd..353899634fa8 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -1579,6 +1579,9 @@ static void guc_context_unblock(struct intel_context *ce)
> >  	spin_lock_irqsave(&ce->guc_state.lock, flags);
> >  
> >  	if (unlikely(submission_disabled(guc) ||
> > +		     intel_context_is_banned(ce) ||
> > +		     context_guc_id_invalid(ce) ||
> > +		     !lrc_desc_registered(guc, ce->guc_id) ||
> >  		     !intel_context_is_pinned(ce) ||
> >  		     context_pending_disable(ce) ||
> >  		     context_blocked(ce) > 1)) {
> 
> I think this entire if condition here is screaming that our intel_context
> state machinery for guc is way too complex, and on the wrong side of
> incomprehensible.
> 
> Also some of these check state outside of the context, and we don't seem
> to hold spinlocks for those, or anything else.
> 
> I general I have no idea which of these are defensive programming and
> cannot ever happen, and which actually can happen. There's for sure way
> too many races going on given that this is all context-local stuff.

Races here meaining that we seem to be dropping locks while the context is
in an inconsistent state, which then means that every other code path
touching contexts needs to check whether the context is in an inconsistent
state.

This is a bit an example of protecting code, vs protecting datastructures.
Protecting code is having state bits of intermediate/transitional state
leak outside of the locked section (like context_blocked), so that every
other piece of code must be aware about the transition and not screw
things up for worse when they race.

This means your review and validation effort scales O(N^2) with the amount
of code and features you have. Which doesn't work.

Datastructure or object oriented locking design goes different:

1. You figure out what the invariants of your datastructure are. That
means what should hold after each state transition is finished. I have no
idea what is the solution for all them here, but e.g. why is
context_blocked even visible to other threads? Usual approach is a) take
lock b) do whatever is necessary (we're talking about reset stuff here, so
performance really doesn't matter) c) unlock. I know that i915-gem is full
of these leaky counting things, but that's really not a good design.

2. Next up, for every piece of state you think how it's protected with a
per-object lock. The fewer locks you have (but still per-objects so it's
not becoming a mess for different reasons) the higher chances that you
don't leak inconsistent state to other threads. This is a bit tricky when
multipled objects are involved, or if you have to split your locks for a
single object because some of it needs to be accessed from irq context
(like a tasklet).

3. Document your rules in kerneldoc, so that when new code gets added you
don't have to review everything for consistency against the rules. This
way you get overall O(N) effort for validation and review, because all you
have to do is check every function that changes state against the overall
contract, and not everything against everything else.

If you have a pile of if checks every time you grab a lock, your locking
design has too much state that leaks outside of the locked sections.
-Daniel
Matthew Brost Aug. 17, 2021, 4:44 p.m. UTC | #3
On Tue, Aug 17, 2021 at 11:47:53AM +0200, Daniel Vetter wrote:
> On Mon, Aug 16, 2021 at 06:51:25AM -0700, Matthew Brost wrote:
> > When unblocking a context, do not enable scheduling if the context is
> > banned, guc_id invalid, or not registered.
> > 
> > Fixes: 62eaf0ae217d ("drm/i915/guc: Support request cancellation")
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > Cc: <stable@vger.kernel.org>
> > ---
> >  drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 3 +++
> >  1 file changed, 3 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > index c3b7bf7319dd..353899634fa8 100644
> > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> > @@ -1579,6 +1579,9 @@ static void guc_context_unblock(struct intel_context *ce)
> >  	spin_lock_irqsave(&ce->guc_state.lock, flags);
> >  
> >  	if (unlikely(submission_disabled(guc) ||
> > +		     intel_context_is_banned(ce) ||
> > +		     context_guc_id_invalid(ce) ||
> > +		     !lrc_desc_registered(guc, ce->guc_id) ||
> >  		     !intel_context_is_pinned(ce) ||
> >  		     context_pending_disable(ce) ||
> >  		     context_blocked(ce) > 1)) {
> 
> I think this entire if condition here is screaming that our intel_context
> state machinery for guc is way too complex, and on the wrong side of
> incomprehensible.
> 
> Also some of these check state outside of the context, and we don't seem
> to hold spinlocks for those, or anything else.
> 
> I general I have no idea which of these are defensive programming and
> cannot ever happen, and which actually can happen. There's for sure way
> too many races going on given that this is all context-local stuff.

A lot of this is guarding against a full GT reset while trying to
cancel a request. Full GT resets make everything really hard and in
pratice should never really happen because the GuC does per engine /
context resets. Unfortunately IGTs do weird things like turn off per
engine / contexts resets and full GT reset the only way to recover so
the IGTs can will expose all the races around GT reset, especially when
we run IGTs a pre-prod HW that tends to hang for whatever reason.

Matt 

> -Daniel
> -- 
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index c3b7bf7319dd..353899634fa8 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1579,6 +1579,9 @@  static void guc_context_unblock(struct intel_context *ce)
 	spin_lock_irqsave(&ce->guc_state.lock, flags);
 
 	if (unlikely(submission_disabled(guc) ||
+		     intel_context_is_banned(ce) ||
+		     context_guc_id_invalid(ce) ||
+		     !lrc_desc_registered(guc, ce->guc_id) ||
 		     !intel_context_is_pinned(ce) ||
 		     context_pending_disable(ce) ||
 		     context_blocked(ce) > 1)) {