Message ID | 20180328063845.4884-10-currojerez@riseup.net (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Quoting Francisco Jerez (2018-03-28 07:38:45) > This allows cpufreq governors to realize when the system becomes > non-CPU-bound due to GPU rendering activity, which will cause the > intel_pstate LP controller to behave more conservatively: CPU energy > usage will be reduced when there isn't a good chance for system > performance to scale with CPU frequency. This leaves additional TDP > budget available for the GPU to reach higher frequencies, which is > translated into an improvement in graphics performance to the extent > that the workload remains TDP-limited (Most non-trivial graphics > benchmarks out there improve significantly in TDP-constrained > platforms, see the cover letter for some numbers). If the workload > isn't (anymore) TDP-limited performance should stay roughly constant, > but energy usage will be divided by a similar factor. And that's what I thought IPS was already meant to be achieving; intelligent distribution between different units... > The intel_pstate LP controller is only enabled on BXT+ small-core > platforms at this point, so this shouldn't have any effect on other > systems. Although that's probably only a feature for big core :) > Signed-off-by: Francisco Jerez <currojerez@riseup.net> > --- > drivers/gpu/drm/i915/intel_lrc.c | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c > index 3a69b367e565..721f915115bd 100644 > --- a/drivers/gpu/drm/i915/intel_lrc.c > +++ b/drivers/gpu/drm/i915/intel_lrc.c > @@ -132,6 +132,7 @@ > * > */ > #include <linux/interrupt.h> > +#include <linux/cpufreq.h> > > #include <drm/drmP.h> > #include <drm/i915_drm.h> > @@ -379,11 +380,13 @@ execlists_context_schedule_in(struct i915_request *rq) > { > execlists_context_status_change(rq, INTEL_CONTEXT_SCHEDULE_IN); > intel_engine_context_in(rq->engine); > + cpufreq_io_active_begin(); Since you only care about a binary value for GPU activity, we don't need to do this on each context, just between submitting the first request and the final completion, i.e. couple this to EXECLISTS_ACTIVE_USER. Haven't yet gone back to check how heavy io_active_begin/end are, but I trust you appreciate that this code is particularly latency sensitive. -Chris
Hi Chris, Chris Wilson <chris@chris-wilson.co.uk> writes: > Quoting Francisco Jerez (2018-03-28 07:38:45) >> This allows cpufreq governors to realize when the system becomes >> non-CPU-bound due to GPU rendering activity, which will cause the >> intel_pstate LP controller to behave more conservatively: CPU energy >> usage will be reduced when there isn't a good chance for system >> performance to scale with CPU frequency. This leaves additional TDP >> budget available for the GPU to reach higher frequencies, which is >> translated into an improvement in graphics performance to the extent >> that the workload remains TDP-limited (Most non-trivial graphics >> benchmarks out there improve significantly in TDP-constrained >> platforms, see the cover letter for some numbers). If the workload >> isn't (anymore) TDP-limited performance should stay roughly constant, >> but energy usage will be divided by a similar factor. > > And that's what I thought IPS was already meant to be achieving; > intelligent distribution between different units... > I'm not aware of IPS ever being available on any small core systems. >> The intel_pstate LP controller is only enabled on BXT+ small-core >> platforms at this point, so this shouldn't have any effect on other >> systems. > > Although that's probably only a feature for big core :) > Actually I wouldn't rule out it being useful on big core up front. I've been playing around with the idea of hooking up this same interface to the EPP preference used by HWP, which would allow the HWP controller to reduce energy usage in GPU-bound conditions, which could potentially leave additional TDP headroom available for the GPU -- It's not uncommon for graphics workloads to be TDP-limited even on big core, and while non-TDP-limited (so neither IPS nor IBC will ever help you) that would still allow them to optimize CPU energy usage for workloads that are consistently GPU-bound (which would mean longer battery life while gaming on a big-core laptop). >> Signed-off-by: Francisco Jerez <currojerez@riseup.net> >> --- >> drivers/gpu/drm/i915/intel_lrc.c | 4 ++++ >> 1 file changed, 4 insertions(+) >> >> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c >> index 3a69b367e565..721f915115bd 100644 >> --- a/drivers/gpu/drm/i915/intel_lrc.c >> +++ b/drivers/gpu/drm/i915/intel_lrc.c >> @@ -132,6 +132,7 @@ >> * >> */ >> #include <linux/interrupt.h> >> +#include <linux/cpufreq.h> >> >> #include <drm/drmP.h> >> #include <drm/i915_drm.h> >> @@ -379,11 +380,13 @@ execlists_context_schedule_in(struct i915_request *rq) >> { >> execlists_context_status_change(rq, INTEL_CONTEXT_SCHEDULE_IN); >> intel_engine_context_in(rq->engine); >> + cpufreq_io_active_begin(); > > Since you only care about a binary value for GPU activity, we don't need > to do this on each context, just between submitting the first request > and the final completion, i.e. couple this to EXECLISTS_ACTIVE_USER. > > Haven't yet gone back to check how heavy io_active_begin/end are, but I > trust you appreciate that this code is particularly latency sensitive. The cpufreq_io_active_begin/end() interfaces are probably as lightweight as they can be. There's no locking involved. In cases where there already is an overlapping request, cpufreq_io_active_begin() and cpufreq_io_active_end() should return without doing much more than an atomic increment and an atomic cmpxchg respectively. That said, it wouldn't hurt to call each of them once per sequence of overlapping requests. Do you want me to call them from execlists_set/clear_active() themselves when bit == EXECLISTS_ACTIVE_USER, or at each callsite of execlists_set/clear_active()? [possibly protected with a check that execlists_is_active(EXECLISTS_ACTIVE_USER) wasn't already the expected value] > -Chris
Quoting Francisco Jerez (2018-03-28 19:55:12) > Hi Chris, > [snip] > That said, it wouldn't hurt to call each of them once per sequence of > overlapping requests. Do you want me to call them from > execlists_set/clear_active() themselves when bit == EXECLISTS_ACTIVE_USER, > or at each callsite of execlists_set/clear_active()? [possibly protected > with a check that execlists_is_active(EXECLISTS_ACTIVE_USER) wasn't > already the expected value] No, I was thinking of adding an execlists_start()/execlists_stop() [naming suggestions welcome, begin/end?] where we could hook additional bookkeeping into. -Chris
Quoting Chris Wilson (2018-03-28 20:20:19) > Quoting Francisco Jerez (2018-03-28 19:55:12) > > Hi Chris, > > > [snip] > > That said, it wouldn't hurt to call each of them once per sequence of > > overlapping requests. Do you want me to call them from > > execlists_set/clear_active() themselves when bit == EXECLISTS_ACTIVE_USER, > > or at each callsite of execlists_set/clear_active()? [possibly protected > > with a check that execlists_is_active(EXECLISTS_ACTIVE_USER) wasn't > > already the expected value] > > No, I was thinking of adding an execlists_start()/execlists_stop() > [naming suggestions welcome, begin/end?] where we could hook additional > bookkeeping into. Trying to call execlist_begin() once didn't pan out. It's easier to reuse for similar bookkeeping used in future patches if execlist_begin() (or whatever name suits best) at the start of each context. Something along the lines of: @@ -374,6 +374,19 @@ execlists_context_status_change(struct i915_request *rq, unsigned long status) status, rq); } +static inline void +execlists_begin(struct intel_engine_execlists *execlists, + struct execlist_port *port) +{ + execlists_set_active_once(execlists, EXECLISTS_ACTIVE_USER); +} + +static inline void +execlists_end(struct intel_engine_execlists *execlists) +{ + execlists_clear_active(execlists, EXECLISTS_ACTIVE_USER); +} + static inline void execlists_context_schedule_in(struct i915_request *rq) { @@ -710,7 +723,7 @@ static void execlists_dequeue(struct intel_engine_cs *engine) spin_unlock_irq(&engine->timeline->lock); if (submit) { - execlists_set_active(execlists, EXECLISTS_ACTIVE_USER); + execlists_begin(execlists, execlists->port); execlists_submit_ports(engine); } @@ -741,7 +754,7 @@ execlists_cancel_port_requests(struct intel_engine_execlists * const execlists) port++; } - execlists_clear_active(execlists, EXECLISTS_ACTIVE_USER); + execlists_end(execlists); } static void clear_gtiir(struct intel_engine_cs *engine) @@ -872,7 +885,7 @@ static void execlists_submission_tasklet(unsigned long data) { struct intel_engine_cs * const engine = (struct intel_engine_cs *)data; struct intel_engine_execlists * const execlists = &engine->execlists; - struct execlist_port * const port = execlists->port; + struct execlist_port *port = execlists->port; struct drm_i915_private *dev_priv = engine->i915; bool fw = false; @@ -1010,9 +1023,19 @@ static void execlists_submission_tasklet(unsigned long data) GEM_BUG_ON(count == 0); if (--count == 0) { + /* + * On the final event corresponding to the + * submission of this context, we expect either + * an element-switch event or an completion + * event (and on completion, the active-idle + * marker). No more preemptions, lite-restore + * or otherwise + */ GEM_BUG_ON(status & GEN8_CTX_STATUS_PREEMPTED); GEM_BUG_ON(port_isset(&port[1]) && !(status & GEN8_CTX_STATUS_ELEMENT_SWITCH)); + GEM_BUG_ON(!port_isset(&port[1]) && + !(status & GEN8_CTX_STATUS_ACTIVE_IDLE)); GEM_BUG_ON(!i915_request_completed(rq)); execlists_context_schedule_out(rq); trace_i915_request_out(rq); @@ -1021,17 +1044,14 @@ static void execlists_submission_tasklet(unsigned long data) GEM_TRACE("%s completed ctx=%d\n", engine->name, port->context_id); - execlists_port_complete(execlists, port); + port = execlists_port_complete(execlists, port); + if (port_isset(port)) + execlists_begin(execlists, port); + else + execlists_end(execlists); } else { port_set(port, port_pack(rq, count)); } -Chris
Chris Wilson <chris@chris-wilson.co.uk> writes: > Quoting Chris Wilson (2018-03-28 20:20:19) >> Quoting Francisco Jerez (2018-03-28 19:55:12) >> > Hi Chris, >> > >> [snip] >> > That said, it wouldn't hurt to call each of them once per sequence of >> > overlapping requests. Do you want me to call them from >> > execlists_set/clear_active() themselves when bit == EXECLISTS_ACTIVE_USER, >> > or at each callsite of execlists_set/clear_active()? [possibly protected >> > with a check that execlists_is_active(EXECLISTS_ACTIVE_USER) wasn't >> > already the expected value] >> >> No, I was thinking of adding an execlists_start()/execlists_stop() >> [naming suggestions welcome, begin/end?] where we could hook additional >> bookkeeping into. > > Trying to call execlist_begin() once didn't pan out. It's easier to > reuse for similar bookkeeping used in future patches if execlist_begin() > (or whatever name suits best) at the start of each context. > > Something along the lines of: > > @@ -374,6 +374,19 @@ execlists_context_status_change(struct i915_request *rq, unsigned long status) > status, rq); > } > > +static inline void > +execlists_begin(struct intel_engine_execlists *execlists, I had started writing something along the same lines in my working tree called execlists_active_user_begin/end -- Which name do you prefer? > + struct execlist_port *port) > +{ What do you expect the port argument to be useful for? Is it ever going to be anything other than execlists->port? > + execlists_set_active_once(execlists, EXECLISTS_ACTIVE_USER); > +} > + > +static inline void > +execlists_end(struct intel_engine_execlists *execlists) > +{ > + execlists_clear_active(execlists, EXECLISTS_ACTIVE_USER); > +} > + > static inline void > execlists_context_schedule_in(struct i915_request *rq) > { > @@ -710,7 +723,7 @@ static void execlists_dequeue(struct intel_engine_cs *engine) > spin_unlock_irq(&engine->timeline->lock); > > if (submit) { > - execlists_set_active(execlists, EXECLISTS_ACTIVE_USER); > + execlists_begin(execlists, execlists->port); > execlists_submit_ports(engine); > } > > @@ -741,7 +754,7 @@ execlists_cancel_port_requests(struct intel_engine_execlists * const execlists) > port++; > } > > - execlists_clear_active(execlists, EXECLISTS_ACTIVE_USER); This line doesn't seem to exist in my working tree. I guess it was just added? > + execlists_end(execlists); > } > > static void clear_gtiir(struct intel_engine_cs *engine) > @@ -872,7 +885,7 @@ static void execlists_submission_tasklet(unsigned long data) > { > struct intel_engine_cs * const engine = (struct intel_engine_cs *)data; > struct intel_engine_execlists * const execlists = &engine->execlists; > - struct execlist_port * const port = execlists->port; > + struct execlist_port *port = execlists->port; > struct drm_i915_private *dev_priv = engine->i915; > bool fw = false; > > @@ -1010,9 +1023,19 @@ static void execlists_submission_tasklet(unsigned long data) > > GEM_BUG_ON(count == 0); > if (--count == 0) { > + /* > + * On the final event corresponding to the > + * submission of this context, we expect either > + * an element-switch event or an completion > + * event (and on completion, the active-idle > + * marker). No more preemptions, lite-restore > + * or otherwise > + */ > GEM_BUG_ON(status & GEN8_CTX_STATUS_PREEMPTED); > GEM_BUG_ON(port_isset(&port[1]) && > !(status & GEN8_CTX_STATUS_ELEMENT_SWITCH)); > + GEM_BUG_ON(!port_isset(&port[1]) && > + !(status & GEN8_CTX_STATUS_ACTIVE_IDLE)); > GEM_BUG_ON(!i915_request_completed(rq)); > execlists_context_schedule_out(rq); > trace_i915_request_out(rq); > @@ -1021,17 +1044,14 @@ static void execlists_submission_tasklet(unsigned long data) > GEM_TRACE("%s completed ctx=%d\n", > engine->name, port->context_id); > > - execlists_port_complete(execlists, port); > + port = execlists_port_complete(execlists, port); > + if (port_isset(port)) > + execlists_begin(execlists, port); Isn't this going to call execlists_begin() roughly once per request? What's the purpose if we expect it to be a no-op except for the first request submitted after execlists_end()? Isn't the intention to provide a hook for bookkeeping that depends on idle to active and active to idle transitions of the hardware? > + else > + execlists_end(execlists); > } else { > port_set(port, port_pack(rq, count)); > } > Isn't there an "execlists_clear_active(execlists, EXECLISTS_ACTIVE_USER);" call in your tree a few lines below that is now redundant? > -Chris
Quoting Francisco Jerez (2018-03-29 01:32:07) > Chris Wilson <chris@chris-wilson.co.uk> writes: > > > Quoting Chris Wilson (2018-03-28 20:20:19) > >> Quoting Francisco Jerez (2018-03-28 19:55:12) > >> > Hi Chris, > >> > > >> [snip] > >> > That said, it wouldn't hurt to call each of them once per sequence of > >> > overlapping requests. Do you want me to call them from > >> > execlists_set/clear_active() themselves when bit == EXECLISTS_ACTIVE_USER, > >> > or at each callsite of execlists_set/clear_active()? [possibly protected > >> > with a check that execlists_is_active(EXECLISTS_ACTIVE_USER) wasn't > >> > already the expected value] > >> > >> No, I was thinking of adding an execlists_start()/execlists_stop() > >> [naming suggestions welcome, begin/end?] where we could hook additional > >> bookkeeping into. > > > > Trying to call execlist_begin() once didn't pan out. It's easier to > > reuse for similar bookkeeping used in future patches if execlist_begin() > > (or whatever name suits best) at the start of each context. > > > > Something along the lines of: > > > > @@ -374,6 +374,19 @@ execlists_context_status_change(struct i915_request *rq, unsigned long status) > > status, rq); > > } > > > > +static inline void > > +execlists_begin(struct intel_engine_execlists *execlists, > > I had started writing something along the same lines in my working tree > called execlists_active_user_begin/end -- Which name do you prefer? Hmm, so the reason why the user distinction exists is that we may momentarily remain active after the last user context is switch out, if there is a preemption event still pending. Keeping the user tag does help maintain that distinction. > > > + struct execlist_port *port) > > +{ > > What do you expect the port argument to be useful for? Is it ever going > to be anything other than execlists->port? There are patches afoot to change that, so since I needed to inspect port here it seemed to tie in nicely with the proposed changes to execlists_port_complete. > > + execlists_set_active_once(execlists, EXECLISTS_ACTIVE_USER); > > +} > > + > > +static inline void > > +execlists_end(struct intel_engine_execlists *execlists) > > +{ > > + execlists_clear_active(execlists, EXECLISTS_ACTIVE_USER); > > +} > > + > > static inline void > > execlists_context_schedule_in(struct i915_request *rq) > > { > > @@ -710,7 +723,7 @@ static void execlists_dequeue(struct intel_engine_cs *engine) > > spin_unlock_irq(&engine->timeline->lock); > > > > if (submit) { > > - execlists_set_active(execlists, EXECLISTS_ACTIVE_USER); > > + execlists_begin(execlists, execlists->port); > > execlists_submit_ports(engine); > > } > > > > @@ -741,7 +754,7 @@ execlists_cancel_port_requests(struct intel_engine_execlists * const execlists) > > port++; > > } > > > > - execlists_clear_active(execlists, EXECLISTS_ACTIVE_USER); > > This line doesn't seem to exist in my working tree. I guess it was just > added? A few days ago, yes. > > + execlists_end(execlists); > > } > > > > static void clear_gtiir(struct intel_engine_cs *engine) > > @@ -872,7 +885,7 @@ static void execlists_submission_tasklet(unsigned long data) > > { > > struct intel_engine_cs * const engine = (struct intel_engine_cs *)data; > > struct intel_engine_execlists * const execlists = &engine->execlists; > > - struct execlist_port * const port = execlists->port; > > + struct execlist_port *port = execlists->port; > > struct drm_i915_private *dev_priv = engine->i915; > > bool fw = false; > > > > @@ -1010,9 +1023,19 @@ static void execlists_submission_tasklet(unsigned long data) > > > > GEM_BUG_ON(count == 0); > > if (--count == 0) { > > + /* > > + * On the final event corresponding to the > > + * submission of this context, we expect either > > + * an element-switch event or an completion > > + * event (and on completion, the active-idle > > + * marker). No more preemptions, lite-restore > > + * or otherwise > > + */ > > GEM_BUG_ON(status & GEN8_CTX_STATUS_PREEMPTED); > > GEM_BUG_ON(port_isset(&port[1]) && > > !(status & GEN8_CTX_STATUS_ELEMENT_SWITCH)); > > + GEM_BUG_ON(!port_isset(&port[1]) && > > + !(status & GEN8_CTX_STATUS_ACTIVE_IDLE)); > > GEM_BUG_ON(!i915_request_completed(rq)); > > execlists_context_schedule_out(rq); > > trace_i915_request_out(rq); > > @@ -1021,17 +1044,14 @@ static void execlists_submission_tasklet(unsigned long data) > > GEM_TRACE("%s completed ctx=%d\n", > > engine->name, port->context_id); > > > > - execlists_port_complete(execlists, port); > > + port = execlists_port_complete(execlists, port); > > + if (port_isset(port)) > > + execlists_begin(execlists, port); > > Isn't this going to call execlists_begin() roughly once per request? > What's the purpose if we expect it to be a no-op except for the first > request submitted after execlists_end()? Isn't the intention to provide > a hook for bookkeeping that depends on idle to active and active to idle > transitions of the hardware? Because on a context switch I need to update the GPU clocks, update tracking for preemption, and that's just the two I have in my tree that need to land in the next couple of weeks... Currently I have, static inline void execlists_begin(struct intel_engine_execlists *execlists, struct execlist_port *port) { struct intel_engine_cs *engine = container_of(execlists, typeof(*engine), execlists); struct i915_request *rq = port_request(port); execlists_set_active_once(execlists, EXECLISTS_ACTIVE_USER); intel_rps_update_engine(engine, rq->ctx); execlists->current_priority = rq_prio(rq); } static inline void execlists_end(struct intel_engine_execlists *execlists) { struct intel_engine_cs *engine = container_of(execlists, typeof(*engine), execlists); execlists->current_priority = INT_MIN; intel_rps_update_engine(engine, NULL); execlists_clear_active(execlists, EXECLISTS_ACTIVE_USER); } > > + else > > + execlists_end(execlists); > > } else { > > port_set(port, port_pack(rq, count)); > > } > > > > Isn't there an "execlists_clear_active(execlists, > EXECLISTS_ACTIVE_USER);" call in your tree a few lines below that is now > redundant? This is only half the patch to show the gist. :) -Chris
Quoting Chris Wilson (2018-03-29 02:01:37) > Quoting Francisco Jerez (2018-03-29 01:32:07) > > Chris Wilson <chris@chris-wilson.co.uk> writes: > > > + else > > > + execlists_end(execlists); > > > } else { > > > port_set(port, port_pack(rq, count)); > > > } > > > > > > > Isn't there an "execlists_clear_active(execlists, > > EXECLISTS_ACTIVE_USER);" call in your tree a few lines below that is now > > redundant? > > This is only half the patch to show the gist. :) A more concrete sketch, https://patchwork.freedesktop.org/patch/213594/ -Chris
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c index 3a69b367e565..721f915115bd 100644 --- a/drivers/gpu/drm/i915/intel_lrc.c +++ b/drivers/gpu/drm/i915/intel_lrc.c @@ -132,6 +132,7 @@ * */ #include <linux/interrupt.h> +#include <linux/cpufreq.h> #include <drm/drmP.h> #include <drm/i915_drm.h> @@ -379,11 +380,13 @@ execlists_context_schedule_in(struct i915_request *rq) { execlists_context_status_change(rq, INTEL_CONTEXT_SCHEDULE_IN); intel_engine_context_in(rq->engine); + cpufreq_io_active_begin(); } static inline void execlists_context_schedule_out(struct i915_request *rq) { + cpufreq_io_active_end(); intel_engine_context_out(rq->engine); execlists_context_status_change(rq, INTEL_CONTEXT_SCHEDULE_OUT); } @@ -726,6 +729,7 @@ execlists_cancel_port_requests(struct intel_engine_execlists * const execlists) struct i915_request *rq = port_request(port); GEM_BUG_ON(!execlists->active); + cpufreq_io_active_end(); intel_engine_context_out(rq->engine); execlists_context_status_change(rq,
This allows cpufreq governors to realize when the system becomes non-CPU-bound due to GPU rendering activity, which will cause the intel_pstate LP controller to behave more conservatively: CPU energy usage will be reduced when there isn't a good chance for system performance to scale with CPU frequency. This leaves additional TDP budget available for the GPU to reach higher frequencies, which is translated into an improvement in graphics performance to the extent that the workload remains TDP-limited (Most non-trivial graphics benchmarks out there improve significantly in TDP-constrained platforms, see the cover letter for some numbers). If the workload isn't (anymore) TDP-limited performance should stay roughly constant, but energy usage will be divided by a similar factor. The intel_pstate LP controller is only enabled on BXT+ small-core platforms at this point, so this shouldn't have any effect on other systems. Signed-off-by: Francisco Jerez <currojerez@riseup.net> --- drivers/gpu/drm/i915/intel_lrc.c | 4 ++++ 1 file changed, 4 insertions(+)