Message ID | 20180405123923.22671-4-tvrtko.ursulin@linux.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Quoting Tvrtko Ursulin (2018-04-05 13:39:19) > From: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > > Keep a count of requests submitted from userspace and not yet runnable due > unresolved dependencies. > > v2: Rename and move under the container struct. (Chris Wilson) > v3: Rebase. > > Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > --- > drivers/gpu/drm/i915/i915_request.c | 3 +++ > drivers/gpu/drm/i915/intel_engine_cs.c | 3 ++- > drivers/gpu/drm/i915/intel_ringbuffer.h | 8 ++++++++ > 3 files changed, 13 insertions(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c > index 5c01291ad1cc..152321655fe6 100644 > --- a/drivers/gpu/drm/i915/i915_request.c > +++ b/drivers/gpu/drm/i915/i915_request.c > @@ -640,6 +640,7 @@ submit_notify(struct i915_sw_fence *fence, enum i915_sw_fence_notify state) > rcu_read_lock(); > request->engine->submit_request(request); > rcu_read_unlock(); > + atomic_dec(&request->engine->request_stats.queued); But we use atomic here? Might as well use atomic for request_stats.runnable here as well? -Chris
On 06/04/2018 21:17, Chris Wilson wrote: > Quoting Tvrtko Ursulin (2018-04-05 13:39:19) >> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com> >> >> Keep a count of requests submitted from userspace and not yet runnable due >> unresolved dependencies. >> >> v2: Rename and move under the container struct. (Chris Wilson) >> v3: Rebase. >> >> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> >> --- >> drivers/gpu/drm/i915/i915_request.c | 3 +++ >> drivers/gpu/drm/i915/intel_engine_cs.c | 3 ++- >> drivers/gpu/drm/i915/intel_ringbuffer.h | 8 ++++++++ >> 3 files changed, 13 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c >> index 5c01291ad1cc..152321655fe6 100644 >> --- a/drivers/gpu/drm/i915/i915_request.c >> +++ b/drivers/gpu/drm/i915/i915_request.c >> @@ -640,6 +640,7 @@ submit_notify(struct i915_sw_fence *fence, enum i915_sw_fence_notify state) >> rcu_read_lock(); >> request->engine->submit_request(request); >> rcu_read_unlock(); >> + atomic_dec(&request->engine->request_stats.queued); > > But we use atomic here? Might as well use atomic for > request_stats.runnable here as well? I admit it can read a bit uneven. For runnable I wanted to avoid another atomic by putting it under the engine timeline lock. But for queued I did not want to start taking the same lock when adding a request. Your proposal to make runnable atomic_t and move to submit_notify would indeed simplify things, but at a cost of one more atomic in that path. Perhaps the code path is heavy enough for one new atomic to be completely hidden in it, and code simplification to win? Regards, Tvrtko
Quoting Tvrtko Ursulin (2018-04-09 10:11:53) > > On 06/04/2018 21:17, Chris Wilson wrote: > > Quoting Tvrtko Ursulin (2018-04-05 13:39:19) > >> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > >> > >> Keep a count of requests submitted from userspace and not yet runnable due > >> unresolved dependencies. > >> > >> v2: Rename and move under the container struct. (Chris Wilson) > >> v3: Rebase. > >> > >> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > >> --- > >> drivers/gpu/drm/i915/i915_request.c | 3 +++ > >> drivers/gpu/drm/i915/intel_engine_cs.c | 3 ++- > >> drivers/gpu/drm/i915/intel_ringbuffer.h | 8 ++++++++ > >> 3 files changed, 13 insertions(+), 1 deletion(-) > >> > >> diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c > >> index 5c01291ad1cc..152321655fe6 100644 > >> --- a/drivers/gpu/drm/i915/i915_request.c > >> +++ b/drivers/gpu/drm/i915/i915_request.c > >> @@ -640,6 +640,7 @@ submit_notify(struct i915_sw_fence *fence, enum i915_sw_fence_notify state) > >> rcu_read_lock(); > >> request->engine->submit_request(request); > >> rcu_read_unlock(); > >> + atomic_dec(&request->engine->request_stats.queued); > > > > But we use atomic here? Might as well use atomic for > > request_stats.runnable here as well? > > I admit it can read a bit uneven. > > For runnable I wanted to avoid another atomic by putting it under the > engine timeline lock. > > But for queued I did not want to start taking the same lock when adding > a request. > > Your proposal to make runnable atomic_t and move to submit_notify would > indeed simplify things, but at a cost of one more atomic in that path. > Perhaps the code path is heavy enough for one new atomic to be > completely hidden in it, and code simplification to win? It also solidifies that we are moving from one counter to the next. (There must be some common 64b cmpxchg for doing that!) Going from +1 locked operations to +2 here isn't the end of the world, but I can certainly appreciated trying to keep the number down (especially for aux information like stats). now = atomic64_read(&stats.queued_runnable); do { old = now; new_queued = upper_32_bits(old) - 1; new_runnable = lower_32_bits(old) + 1; now = atomic64_cmpxchg(&stats.queued_runnable, old, (new_runnable | (u64)new_queued << 32)); } while (now != old); Downside being that we either then use atomic64 throughout or we mix atomic32/atomic64 knowing that we're on x86. (I feel like someone else must have solved this problem in a much neater way, before they went to per-cpu stats ;) -Chris
On 09/04/2018 10:25, Chris Wilson wrote: > Quoting Tvrtko Ursulin (2018-04-09 10:11:53) >> >> On 06/04/2018 21:17, Chris Wilson wrote: >>> Quoting Tvrtko Ursulin (2018-04-05 13:39:19) >>>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com> >>>> >>>> Keep a count of requests submitted from userspace and not yet runnable due >>>> unresolved dependencies. >>>> >>>> v2: Rename and move under the container struct. (Chris Wilson) >>>> v3: Rebase. >>>> >>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> >>>> --- >>>> drivers/gpu/drm/i915/i915_request.c | 3 +++ >>>> drivers/gpu/drm/i915/intel_engine_cs.c | 3 ++- >>>> drivers/gpu/drm/i915/intel_ringbuffer.h | 8 ++++++++ >>>> 3 files changed, 13 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c >>>> index 5c01291ad1cc..152321655fe6 100644 >>>> --- a/drivers/gpu/drm/i915/i915_request.c >>>> +++ b/drivers/gpu/drm/i915/i915_request.c >>>> @@ -640,6 +640,7 @@ submit_notify(struct i915_sw_fence *fence, enum i915_sw_fence_notify state) >>>> rcu_read_lock(); >>>> request->engine->submit_request(request); >>>> rcu_read_unlock(); >>>> + atomic_dec(&request->engine->request_stats.queued); >>> >>> But we use atomic here? Might as well use atomic for >>> request_stats.runnable here as well? >> >> I admit it can read a bit uneven. >> >> For runnable I wanted to avoid another atomic by putting it under the >> engine timeline lock. >> >> But for queued I did not want to start taking the same lock when adding >> a request. >> >> Your proposal to make runnable atomic_t and move to submit_notify would >> indeed simplify things, but at a cost of one more atomic in that path. >> Perhaps the code path is heavy enough for one new atomic to be >> completely hidden in it, and code simplification to win? > > It also solidifies that we are moving from one counter to the next. > (There must be some common 64b cmpxchg for doing that!) Going from +1 > locked operations to +2 here isn't the end of the world, but I can > certainly appreciated trying to keep the number down (especially for aux > information like stats). > > now = atomic64_read(&stats.queued_runnable); > do { > old = now; > new_queued = upper_32_bits(old) - 1; > new_runnable = lower_32_bits(old) + 1; > now = atomic64_cmpxchg(&stats.queued_runnable, > old, (new_runnable | (u64)new_queued << 32)); > } while (now != old); Hm don't know, have to be careful with these retry loops. More importantly I am not sure if it isn't an overkill. > Downside being that we either then use atomic64 throughout or we mix > atomic32/atomic64 knowing that we're on x86. (I feel like someone else > must have solved this problem in a much neater way, before they went to > per-cpu stats ;) Is the winky implying you know who and where? :) We have three potential solutions now, even for if the winky is suggesting something. For me it is still a choice between what I have versus simplifying the code paths by going another atomic_t. Regards, Tvrtko
Quoting Tvrtko Ursulin (2018-04-09 11:17:04) > > On 09/04/2018 10:25, Chris Wilson wrote: > > Downside being that we either then use atomic64 throughout or we mix > > atomic32/atomic64 knowing that we're on x86. (I feel like someone else > > must have solved this problem in a much neater way, before they went to > > per-cpu stats ;) > > Is the winky implying you know who and where? :) We have three potential > solutions now, even for if the winky is suggesting something. Nah, just that atomic/locked counters are so old hat. Not sure if there remain any good examples for hotpath counters that remain applicable to our code. -Chris
Quoting Chris Wilson (2018-04-09 11:27:17) > Quoting Tvrtko Ursulin (2018-04-09 11:17:04) > > > > On 09/04/2018 10:25, Chris Wilson wrote: > > > Downside being that we either then use atomic64 throughout or we mix > > > atomic32/atomic64 knowing that we're on x86. (I feel like someone else > > > must have solved this problem in a much neater way, before they went to > > > per-cpu stats ;) > > > > Is the winky implying you know who and where? :) We have three potential > > solutions now, even for if the winky is suggesting something. > > Nah, just that atomic/locked counters are so old hat. Not sure if there > remain any good examples for hotpath counters that remain applicable to > our code. With an underlying nudge that perhaps our hotpath code isn't so hot, and if we can squeeze out a few more cycles. -Chris
On 09/04/2018 11:27, Chris Wilson wrote: > Quoting Tvrtko Ursulin (2018-04-09 11:17:04) >> >> On 09/04/2018 10:25, Chris Wilson wrote: >>> Downside being that we either then use atomic64 throughout or we mix >>> atomic32/atomic64 knowing that we're on x86. (I feel like someone else >>> must have solved this problem in a much neater way, before they went to >>> per-cpu stats ;) >> >> Is the winky implying you know who and where? :) We have three potential >> solutions now, even for if the winky is suggesting something. > > Nah, just that atomic/locked counters are so old hat. Not sure if there > remain any good examples for hotpath counters that remain applicable to > our code. Leave it as is then for now and improve if we discover it is not good enough? Regards, Tvrtko
Quoting Tvrtko Ursulin (2018-04-09 11:40:08) > > On 09/04/2018 11:27, Chris Wilson wrote: > > Quoting Tvrtko Ursulin (2018-04-09 11:17:04) > >> > >> On 09/04/2018 10:25, Chris Wilson wrote: > >>> Downside being that we either then use atomic64 throughout or we mix > >>> atomic32/atomic64 knowing that we're on x86. (I feel like someone else > >>> must have solved this problem in a much neater way, before they went to > >>> per-cpu stats ;) > >> > >> Is the winky implying you know who and where? :) We have three potential > >> solutions now, even for if the winky is suggesting something. > > > > Nah, just that atomic/locked counters are so old hat. Not sure if there > > remain any good examples for hotpath counters that remain applicable to > > our code. > > Leave it as is then for now and improve if we discover it is not good > enough? I did have an ulterior motive in that the cmpxchg did resolve one issue that irked me with the two counters being updated out of sync. Minor, minor glitches :) I don't have a strong preference either way. These instructions on the submit are not likely to stand out, as compared to the biggest fish of ksoftirqd, execlists_schedule() and execlists_dequeue(). -Chris
On 09/04/2018 11:51, Chris Wilson wrote: > Quoting Tvrtko Ursulin (2018-04-09 11:40:08) >> >> On 09/04/2018 11:27, Chris Wilson wrote: >>> Quoting Tvrtko Ursulin (2018-04-09 11:17:04) >>>> >>>> On 09/04/2018 10:25, Chris Wilson wrote: >>>>> Downside being that we either then use atomic64 throughout or we mix >>>>> atomic32/atomic64 knowing that we're on x86. (I feel like someone else >>>>> must have solved this problem in a much neater way, before they went to >>>>> per-cpu stats ;) >>>> >>>> Is the winky implying you know who and where? :) We have three potential >>>> solutions now, even for if the winky is suggesting something. >>> >>> Nah, just that atomic/locked counters are so old hat. Not sure if there >>> remain any good examples for hotpath counters that remain applicable to >>> our code. >> >> Leave it as is then for now and improve if we discover it is not good >> enough? > > I did have an ulterior motive in that the cmpxchg did resolve one issue > that irked me with the two counters being updated out of sync. Minor, > minor glitches :) > > I don't have a strong preference either way. These instructions on the > submit are not likely to stand out, as compared to the biggest fish of > ksoftirqd, execlists_schedule() and execlists_dequeue(). I could move the queued decrement from submit_notify to backends, right next to runnable++? Then both would be under the engine->timeline->lock so any inconsistencies in readout I'd hope should be dismissable? Regards, Tvrtko
Quoting Tvrtko Ursulin (2018-04-09 12:43:50) > > On 09/04/2018 11:51, Chris Wilson wrote: > > Quoting Tvrtko Ursulin (2018-04-09 11:40:08) > >> > >> On 09/04/2018 11:27, Chris Wilson wrote: > >>> Quoting Tvrtko Ursulin (2018-04-09 11:17:04) > >>>> > >>>> On 09/04/2018 10:25, Chris Wilson wrote: > >>>>> Downside being that we either then use atomic64 throughout or we mix > >>>>> atomic32/atomic64 knowing that we're on x86. (I feel like someone else > >>>>> must have solved this problem in a much neater way, before they went to > >>>>> per-cpu stats ;) > >>>> > >>>> Is the winky implying you know who and where? :) We have three potential > >>>> solutions now, even for if the winky is suggesting something. > >>> > >>> Nah, just that atomic/locked counters are so old hat. Not sure if there > >>> remain any good examples for hotpath counters that remain applicable to > >>> our code. > >> > >> Leave it as is then for now and improve if we discover it is not good > >> enough? > > > > I did have an ulterior motive in that the cmpxchg did resolve one issue > > that irked me with the two counters being updated out of sync. Minor, > > minor glitches :) > > > > I don't have a strong preference either way. These instructions on the > > submit are not likely to stand out, as compared to the biggest fish of > > ksoftirqd, execlists_schedule() and execlists_dequeue(). > > I could move the queued decrement from submit_notify to backends, right > next to runnable++? Then both would be under the engine->timeline->lock > so any inconsistencies in readout I'd hope should be dismissable? Fair. I have this itch to add a request->state, switch (request->state) { case QUEUED: stats->queued--; switch (now) { case QUEUED: BUG(); case: READY: stats->runnable++; case EXEC: break; } break; case ... } request->state = now; Stop me. Please, stop me. -Chris
diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c index 5c01291ad1cc..152321655fe6 100644 --- a/drivers/gpu/drm/i915/i915_request.c +++ b/drivers/gpu/drm/i915/i915_request.c @@ -640,6 +640,7 @@ submit_notify(struct i915_sw_fence *fence, enum i915_sw_fence_notify state) rcu_read_lock(); request->engine->submit_request(request); rcu_read_unlock(); + atomic_dec(&request->engine->request_stats.queued); break; case FENCE_FREE: @@ -1118,6 +1119,8 @@ void __i915_request_add(struct i915_request *request, bool flush_caches) engine->schedule(request, request->ctx->priority); rcu_read_unlock(); + atomic_inc(&engine->request_stats.queued); + local_bh_disable(); i915_sw_fence_commit(&request->submit); local_bh_enable(); /* Kick the execlists tasklet if just scheduled */ diff --git a/drivers/gpu/drm/i915/intel_engine_cs.c b/drivers/gpu/drm/i915/intel_engine_cs.c index 98254ff92785..e4992c2e23a4 100644 --- a/drivers/gpu/drm/i915/intel_engine_cs.c +++ b/drivers/gpu/drm/i915/intel_engine_cs.c @@ -1934,12 +1934,13 @@ void intel_engine_dump(struct intel_engine_cs *engine, if (i915_terminally_wedged(&engine->i915->gpu_error)) drm_printf(m, "*** WEDGED ***\n"); - drm_printf(m, "\tcurrent seqno %x, last %x, hangcheck %x [%d ms], inflight %d, runnable %u\n", + drm_printf(m, "\tcurrent seqno %x, last %x, hangcheck %x [%d ms], inflight %d, queued %u, runnable %u\n", intel_engine_get_seqno(engine), intel_engine_last_submit(engine), engine->hangcheck.seqno, jiffies_to_msecs(jiffies - engine->hangcheck.action_timestamp), engine->timeline->inflight_seqnos, + atomic_read(&engine->request_stats.queued), engine->request_stats.runnable); drm_printf(m, "\tReset count: %d (global %d)\n", i915_reset_engine_count(error, engine), diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h index 54d2ad1c8daa..616066f536c9 100644 --- a/drivers/gpu/drm/i915/intel_ringbuffer.h +++ b/drivers/gpu/drm/i915/intel_ringbuffer.h @@ -339,6 +339,14 @@ struct intel_engine_cs { struct drm_i915_gem_object *default_state; struct { + /** + * @queued: Number of submitted requests with dependencies. + * + * Count of requests waiting for dependencies before they can be + * submitted to the backend. + */ + atomic_t queued; + /** * @runnable: Number of runnable requests sent to the backend. *