Message ID | 20180604125239.5856-1-tvrtko.ursulin@linux.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Quoting Tvrtko Ursulin (2018-06-04 13:52:39) > From: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > > As Chris has discovered on his Ivybridge, and later automated test runs > have confirmed, on most of our platforms hrtimer faced with heavy GPU load > ca occasionally become sufficiently imprecise to affect PMU sampling s/ca/can/ > calculations. > > This means we cannot assume sampling frequency is what we asked for, but > we need to measure the interval ourselves. > > This patch is similar to Chris' original proposal for per-engine counters, > but instead of introducing a new set to work around the problem with > frequency sampling, it swaps around the way internal frequency accounting > is done. Instead of accumulating current frequency and dividing by > sampling frequency on readout, it accumulates frequency scaled by each > period. My ulterior motive failed to win favour ;) > Testcase: igt/perf_pmu/*busy* # snb, ivb, hsw > Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > Suggested-by: Chris Wilson <chris@chris-wilson.co.uk> > Cc: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> > @@ -237,14 +251,21 @@ static enum hrtimer_restart i915_sample(struct hrtimer *hrtimer) > { > struct drm_i915_private *i915 = > container_of(hrtimer, struct drm_i915_private, pmu.timer); > + unsigned int period_ns; > + ktime_t now; > > if (!READ_ONCE(i915->pmu.timer_enabled)) > return HRTIMER_NORESTART; > > - engines_sample(i915); > - frequency_sample(i915); > + now = ktime_get(); > + period_ns = ktime_to_ns(ktime_sub(now, i915->pmu.timer_last)); > + i915->pmu.timer_last = now; Maybe worth a comment about the lag coming from the timer outweighs our concern with using a single timestamp over multiple samplers, with indeterminate delays due to forcewake. > + engines_sample(i915, period_ns); > + frequency_sample(i915, period_ns); > + > + hrtimer_forward(hrtimer, now, ns_to_ktime(PERIOD)); > > - hrtimer_forward_now(hrtimer, ns_to_ktime(PERIOD)); > return HRTIMER_RESTART; > } > > @@ -519,12 +540,12 @@ static u64 __i915_pmu_event_read(struct perf_event *event) > case I915_PMU_ACTUAL_FREQUENCY: > val = > div_u64(i915->pmu.sample[__I915_SAMPLE_FREQ_ACT].cur, > - FREQUENCY); > + 1000000 /* to MHz */); USEC_PER_SECS /* to MHz */ Storage in sample[] is MHz.usec, so here we cancel out the usec to leave MHz. -Chris
Quoting Chris Wilson (2018-06-04 13:59:12) > Quoting Tvrtko Ursulin (2018-06-04 13:52:39) > > From: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > > > > As Chris has discovered on his Ivybridge, and later automated test runs > > have confirmed, on most of our platforms hrtimer faced with heavy GPU load > > ca occasionally become sufficiently imprecise to affect PMU sampling > > s/ca/can/ > > > calculations. > > > > This means we cannot assume sampling frequency is what we asked for, but > > we need to measure the interval ourselves. > > > > This patch is similar to Chris' original proposal for per-engine counters, > > but instead of introducing a new set to work around the problem with > > frequency sampling, it swaps around the way internal frequency accounting > > is done. Instead of accumulating current frequency and dividing by > > sampling frequency on readout, it accumulates frequency scaled by each > > period. > > My ulterior motive failed to win favour ;) > > > Testcase: igt/perf_pmu/*busy* # snb, ivb, hsw > > Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > > Suggested-by: Chris Wilson <chris@chris-wilson.co.uk> > > Cc: Chris Wilson <chris@chris-wilson.co.uk> > Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> I should point out that even with this fix (or rather my patch), we can still see errors of 25-30us, enough to fail the test. However, without the fix our errors can be orders of magnitude worse (e.g. reporting 80us busy instead of 500us). -Chris
On 04/06/2018 14:03, Chris Wilson wrote: > Quoting Chris Wilson (2018-06-04 13:59:12) >> Quoting Tvrtko Ursulin (2018-06-04 13:52:39) >>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com> >>> >>> As Chris has discovered on his Ivybridge, and later automated test runs >>> have confirmed, on most of our platforms hrtimer faced with heavy GPU load >>> ca occasionally become sufficiently imprecise to affect PMU sampling >> >> s/ca/can/ >> >>> calculations. >>> >>> This means we cannot assume sampling frequency is what we asked for, but >>> we need to measure the interval ourselves. >>> >>> This patch is similar to Chris' original proposal for per-engine counters, >>> but instead of introducing a new set to work around the problem with >>> frequency sampling, it swaps around the way internal frequency accounting >>> is done. Instead of accumulating current frequency and dividing by >>> sampling frequency on readout, it accumulates frequency scaled by each >>> period. >> >> My ulterior motive failed to win favour ;) >> >>> Testcase: igt/perf_pmu/*busy* # snb, ivb, hsw >>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> >>> Suggested-by: Chris Wilson <chris@chris-wilson.co.uk> >>> Cc: Chris Wilson <chris@chris-wilson.co.uk> >> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> > > I should point out that even with this fix (or rather my patch), we can I did not mean to steal it, just that you seemed very uncompromising on the new counters approach. If you prefer you can respin your patch with this approach for frequency counters, it is fine by me. > still see errors of 25-30us, enough to fail the test. However, without > the fix our errors can be orders of magnitude worse (e.g. reporting 80us > busy instead of 500us). Yeah I guess if the timer is delayed it is delayed and all samples just won't be there. I don't see a way to fix that elegantly. Even practically. Regards, Tvrtko
Quoting Tvrtko Ursulin (2018-06-04 16:01:26) > > On 04/06/2018 14:03, Chris Wilson wrote: > > Quoting Chris Wilson (2018-06-04 13:59:12) > >> Quoting Tvrtko Ursulin (2018-06-04 13:52:39) > >>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > >>> > >>> As Chris has discovered on his Ivybridge, and later automated test runs > >>> have confirmed, on most of our platforms hrtimer faced with heavy GPU load > >>> ca occasionally become sufficiently imprecise to affect PMU sampling > >> > >> s/ca/can/ > >> > >>> calculations. > >>> > >>> This means we cannot assume sampling frequency is what we asked for, but > >>> we need to measure the interval ourselves. > >>> > >>> This patch is similar to Chris' original proposal for per-engine counters, > >>> but instead of introducing a new set to work around the problem with > >>> frequency sampling, it swaps around the way internal frequency accounting > >>> is done. Instead of accumulating current frequency and dividing by > >>> sampling frequency on readout, it accumulates frequency scaled by each > >>> period. > >> > >> My ulterior motive failed to win favour ;) > >> > >>> Testcase: igt/perf_pmu/*busy* # snb, ivb, hsw > >>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > >>> Suggested-by: Chris Wilson <chris@chris-wilson.co.uk> > >>> Cc: Chris Wilson <chris@chris-wilson.co.uk> > >> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> > > > > I should point out that even with this fix (or rather my patch), we can > > I did not mean to steal it, Don't mind, I view such patches as "this here is a problem, and what I think can be done". I'm quite happy to be told "nope, the problem is..."; the end result is we fix and move on. > just that you seemed very uncompromising on > the new counters approach. If you prefer you can respin your patch with > this approach for frequency counters, it is fine by me. I'm just not convinced yet by the frequency sampler, and I still feel that we would be better off with cycles. I just haven't found a convincing argument ;) > > still see errors of 25-30us, enough to fail the test. However, without > > the fix our errors can be orders of magnitude worse (e.g. reporting 80us > > busy instead of 500us). > > Yeah I guess if the timer is delayed it is delayed and all samples just > won't be there. I don't see a way to fix that elegantly. Even practically. Right, but it's not the case that we aren't sampling. If the timer was delayed entirely, then we would see 0 (start_val == end_val). I'm not sure exactly what goes wrong, but it probably is related to the timer being unreliable. (It's just that in this case we are sampling a square wave, so really hard to mess up!) -Chris
On 04/06/2018 16:11, Chris Wilson wrote: > Quoting Tvrtko Ursulin (2018-06-04 16:01:26) >> >> On 04/06/2018 14:03, Chris Wilson wrote: >>> Quoting Chris Wilson (2018-06-04 13:59:12) >>>> Quoting Tvrtko Ursulin (2018-06-04 13:52:39) >>>>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com> >>>>> >>>>> As Chris has discovered on his Ivybridge, and later automated test runs >>>>> have confirmed, on most of our platforms hrtimer faced with heavy GPU load >>>>> ca occasionally become sufficiently imprecise to affect PMU sampling >>>> >>>> s/ca/can/ >>>> >>>>> calculations. >>>>> >>>>> This means we cannot assume sampling frequency is what we asked for, but >>>>> we need to measure the interval ourselves. >>>>> >>>>> This patch is similar to Chris' original proposal for per-engine counters, >>>>> but instead of introducing a new set to work around the problem with >>>>> frequency sampling, it swaps around the way internal frequency accounting >>>>> is done. Instead of accumulating current frequency and dividing by >>>>> sampling frequency on readout, it accumulates frequency scaled by each >>>>> period. >>>> >>>> My ulterior motive failed to win favour ;) >>>> >>>>> Testcase: igt/perf_pmu/*busy* # snb, ivb, hsw >>>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> >>>>> Suggested-by: Chris Wilson <chris@chris-wilson.co.uk> >>>>> Cc: Chris Wilson <chris@chris-wilson.co.uk> >>>> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> >>> >>> I should point out that even with this fix (or rather my patch), we can >> >> I did not mean to steal it, > > Don't mind, I view such patches as "this here is a problem, and what I > think can be done". I'm quite happy to be told "nope, the problem > is..."; the end result is we fix and move on. > >> just that you seemed very uncompromising on >> the new counters approach. If you prefer you can respin your patch with >> this approach for frequency counters, it is fine by me. > > I'm just not convinced yet by the frequency sampler, and I still feel > that we would be better off with cycles. I just haven't found a > convincing argument ;) Just imagine .unit file has Mcycles instead of MHz in it? :) Since we went with this when adding it initially I think we should exercise restrain with deprecating and adding so quickly. >>> still see errors of 25-30us, enough to fail the test. However, without >>> the fix our errors can be orders of magnitude worse (e.g. reporting 80us >>> busy instead of 500us). >> >> Yeah I guess if the timer is delayed it is delayed and all samples just >> won't be there. I don't see a way to fix that elegantly. Even practically. > > Right, but it's not the case that we aren't sampling. If the timer was > delayed entirely, then we would see 0 (start_val == end_val). I'm not > sure exactly what goes wrong, but it probably is related to the timer > being unreliable. (It's just that in this case we are sampling a square > wave, so really hard to mess up!) Hm maybe I am not completely understanding what you are seeing. I thought that multiple timer overruns (aka delay by multiple periods) explains everything. Then even with this patch, if they happen to happen (!) at the end of a sampling period (as observed from userspace), the large-ish chunk of samples (depending on the total sampling duration) may be missing (pending), and so still observed as fail. Difference after this patch is that when the timer eventually fires the counter will account for the delay, so subsequent queries will be fine. If readout waited for at least one real sampling tick, that would be solved, but by several other undesirable consequences. At least it would slow down readout to 200Hz, but I don't know from the top of my head if it wouldn't cause unresolvable deadlock scenarios, or if it is even technically possible within the perf framework. (Since I don't think we should do it, I don't even want to start thinking about it.) Regards, Tvrtko
diff --git a/drivers/gpu/drm/i915/i915_pmu.c b/drivers/gpu/drm/i915/i915_pmu.c index dc87797db500..ff3e3f865567 100644 --- a/drivers/gpu/drm/i915/i915_pmu.c +++ b/drivers/gpu/drm/i915/i915_pmu.c @@ -127,6 +127,7 @@ static void __i915_pmu_maybe_start_timer(struct drm_i915_private *i915) { if (!i915->pmu.timer_enabled && pmu_needs_timer(i915, true)) { i915->pmu.timer_enabled = true; + i915->pmu.timer_last = ktime_get(); hrtimer_start_range_ns(&i915->pmu.timer, ns_to_ktime(PERIOD), 0, HRTIMER_MODE_REL_PINNED); @@ -155,12 +156,13 @@ static bool grab_forcewake(struct drm_i915_private *i915, bool fw) } static void -update_sample(struct i915_pmu_sample *sample, u32 unit, u32 val) +add_sample(struct i915_pmu_sample *sample, u32 val) { - sample->cur += mul_u32_u32(val, unit); + sample->cur += val; } -static void engines_sample(struct drm_i915_private *dev_priv) +static void +engines_sample(struct drm_i915_private *dev_priv, unsigned int period_ns) { struct intel_engine_cs *engine; enum intel_engine_id id; @@ -182,8 +184,9 @@ static void engines_sample(struct drm_i915_private *dev_priv) val = !i915_seqno_passed(current_seqno, last_seqno); - update_sample(&engine->pmu.sample[I915_SAMPLE_BUSY], - PERIOD, val); + if (val) + add_sample(&engine->pmu.sample[I915_SAMPLE_BUSY], + period_ns); if (val && (engine->pmu.enable & (BIT(I915_SAMPLE_WAIT) | BIT(I915_SAMPLE_SEMA)))) { @@ -194,11 +197,13 @@ static void engines_sample(struct drm_i915_private *dev_priv) val = 0; } - update_sample(&engine->pmu.sample[I915_SAMPLE_WAIT], - PERIOD, !!(val & RING_WAIT)); + if (val & RING_WAIT) + add_sample(&engine->pmu.sample[I915_SAMPLE_WAIT], + period_ns); - update_sample(&engine->pmu.sample[I915_SAMPLE_SEMA], - PERIOD, !!(val & RING_WAIT_SEMAPHORE)); + if (val & RING_WAIT_SEMAPHORE) + add_sample(&engine->pmu.sample[I915_SAMPLE_SEMA], + period_ns); } if (fw) @@ -207,7 +212,14 @@ static void engines_sample(struct drm_i915_private *dev_priv) intel_runtime_pm_put(dev_priv); } -static void frequency_sample(struct drm_i915_private *dev_priv) +static void +add_sample_mult(struct i915_pmu_sample *sample, u32 val, u32 mul) +{ + sample->cur += mul_u32_u32(val, mul); +} + +static void +frequency_sample(struct drm_i915_private *dev_priv, unsigned int period_ns) { if (dev_priv->pmu.enable & config_enabled_mask(I915_PMU_ACTUAL_FREQUENCY)) { @@ -221,15 +233,17 @@ static void frequency_sample(struct drm_i915_private *dev_priv) intel_runtime_pm_put(dev_priv); } - update_sample(&dev_priv->pmu.sample[__I915_SAMPLE_FREQ_ACT], - 1, intel_gpu_freq(dev_priv, val)); + add_sample_mult(&dev_priv->pmu.sample[__I915_SAMPLE_FREQ_ACT], + intel_gpu_freq(dev_priv, val), + period_ns / 1000); } if (dev_priv->pmu.enable & config_enabled_mask(I915_PMU_REQUESTED_FREQUENCY)) { - update_sample(&dev_priv->pmu.sample[__I915_SAMPLE_FREQ_REQ], 1, - intel_gpu_freq(dev_priv, - dev_priv->gt_pm.rps.cur_freq)); + add_sample_mult(&dev_priv->pmu.sample[__I915_SAMPLE_FREQ_REQ], + intel_gpu_freq(dev_priv, + dev_priv->gt_pm.rps.cur_freq), + period_ns / 1000); } } @@ -237,14 +251,21 @@ static enum hrtimer_restart i915_sample(struct hrtimer *hrtimer) { struct drm_i915_private *i915 = container_of(hrtimer, struct drm_i915_private, pmu.timer); + unsigned int period_ns; + ktime_t now; if (!READ_ONCE(i915->pmu.timer_enabled)) return HRTIMER_NORESTART; - engines_sample(i915); - frequency_sample(i915); + now = ktime_get(); + period_ns = ktime_to_ns(ktime_sub(now, i915->pmu.timer_last)); + i915->pmu.timer_last = now; + + engines_sample(i915, period_ns); + frequency_sample(i915, period_ns); + + hrtimer_forward(hrtimer, now, ns_to_ktime(PERIOD)); - hrtimer_forward_now(hrtimer, ns_to_ktime(PERIOD)); return HRTIMER_RESTART; } @@ -519,12 +540,12 @@ static u64 __i915_pmu_event_read(struct perf_event *event) case I915_PMU_ACTUAL_FREQUENCY: val = div_u64(i915->pmu.sample[__I915_SAMPLE_FREQ_ACT].cur, - FREQUENCY); + 1000000 /* to MHz */); break; case I915_PMU_REQUESTED_FREQUENCY: val = div_u64(i915->pmu.sample[__I915_SAMPLE_FREQ_REQ].cur, - FREQUENCY); + 1000000 /* to MHz */); break; case I915_PMU_INTERRUPTS: val = count_interrupts(i915); diff --git a/drivers/gpu/drm/i915/i915_pmu.h b/drivers/gpu/drm/i915/i915_pmu.h index 2ba735299f7c..7f164ca3db12 100644 --- a/drivers/gpu/drm/i915/i915_pmu.h +++ b/drivers/gpu/drm/i915/i915_pmu.h @@ -65,6 +65,14 @@ struct i915_pmu { * event types. */ u64 enable; + + /** + * @timer_last: + * + * Timestmap of the previous timer invocation. + */ + ktime_t timer_last; + /** * @enable_count: Reference counts for the enabled events. *