Message ID | 152818960780.9058.9489807079671744910@mail.alporthouse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 05/06/2018 10:06, Chris Wilson wrote: > Quoting Tvrtko Ursulin (2018-06-05 09:51:01) >> >> On 04/06/2018 16:11, Chris Wilson wrote: >>> Quoting Tvrtko Ursulin (2018-06-04 16:01:26) >>>> >>>> On 04/06/2018 14:03, Chris Wilson wrote: >>>>> Quoting Chris Wilson (2018-06-04 13:59:12) >>>>>> Quoting Tvrtko Ursulin (2018-06-04 13:52:39) >>>>>>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com> >>>>>>> >>>>>>> As Chris has discovered on his Ivybridge, and later automated test runs >>>>>>> have confirmed, on most of our platforms hrtimer faced with heavy GPU load >>>>>>> ca occasionally become sufficiently imprecise to affect PMU sampling >>>>>> >>>>>> s/ca/can/ >>>>>> >>>>>>> calculations. >>>>>>> >>>>>>> This means we cannot assume sampling frequency is what we asked for, but >>>>>>> we need to measure the interval ourselves. >>>>>>> >>>>>>> This patch is similar to Chris' original proposal for per-engine counters, >>>>>>> but instead of introducing a new set to work around the problem with >>>>>>> frequency sampling, it swaps around the way internal frequency accounting >>>>>>> is done. Instead of accumulating current frequency and dividing by >>>>>>> sampling frequency on readout, it accumulates frequency scaled by each >>>>>>> period. >>>>>> >>>>>> My ulterior motive failed to win favour ;) >>>>>> >>>>>>> Testcase: igt/perf_pmu/*busy* # snb, ivb, hsw >>>>>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> >>>>>>> Suggested-by: Chris Wilson <chris@chris-wilson.co.uk> >>>>>>> Cc: Chris Wilson <chris@chris-wilson.co.uk> >>>>>> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> >>>>> >>>>> I should point out that even with this fix (or rather my patch), we can >>>> >>>> I did not mean to steal it, >>> >>> Don't mind, I view such patches as "this here is a problem, and what I >>> think can be done". I'm quite happy to be told "nope, the problem >>> is..."; the end result is we fix and move on. >>> >>>> just that you seemed very uncompromising on >>>> the new counters approach. If you prefer you can respin your patch with >>>> this approach for frequency counters, it is fine by me. >>> >>> I'm just not convinced yet by the frequency sampler, and I still feel >>> that we would be better off with cycles. I just haven't found a >>> convincing argument ;) >> >> Just imagine .unit file has Mcycles instead of MHz in it? :) >> >> Since we went with this when adding it initially I think we should >> exercise restrain with deprecating and adding so quickly. >> >>>>> still see errors of 25-30us, enough to fail the test. However, without >>>>> the fix our errors can be orders of magnitude worse (e.g. reporting 80us >>>>> busy instead of 500us). >>>> >>>> Yeah I guess if the timer is delayed it is delayed and all samples just >>>> won't be there. I don't see a way to fix that elegantly. Even practically. >>> >>> Right, but it's not the case that we aren't sampling. If the timer was >>> delayed entirely, then we would see 0 (start_val == end_val). I'm not >>> sure exactly what goes wrong, but it probably is related to the timer >>> being unreliable. (It's just that in this case we are sampling a square >>> wave, so really hard to mess up!) >> >> Hm maybe I am not completely understanding what you are seeing. I >> thought that multiple timer overruns (aka delay by multiple periods) >> explains everything. >> >> Then even with this patch, if they happen to happen (!) at the end of a >> sampling period (as observed from userspace), the large-ish chunk of >> samples (depending on the total sampling duration) may be missing >> (pending), and so still observed as fail. > > Ah, but with the variable period timer, we can force the timer to run > before we report the same (perf_event_read). That should still be inside > the busy batch, and it still has the same inaccuracy (~28us, for a small > sample population, measuring the tail is on my wishlist): > > diff --git a/drivers/gpu/drm/i915/i915_pmu.c b/drivers/gpu/drm/i915/i915_pmu.c > index 34cc6f6..8578a43 100644 > --- a/drivers/gpu/drm/i915/i915_pmu.c > +++ b/drivers/gpu/drm/i915/i915_pmu.c > @@ -534,6 +534,12 @@ static u64 __i915_pmu_event_read(struct perf_event *event) > val = engine->pmu.sample[sample].cur; > } > } else { > + if (hrtimer_cancel(&i915->pmu.timer)) { > + i915_sample(&i915->pmu.timer); > + hrtimer_start_expires(&i915->pmu.timer, > + HRTIMER_MODE_ABS); > + } > + > switch (event->attr.config) { > case I915_PMU_ACTUAL_FREQUENCY: > val = > > I haven't thought enough about why it still has the tail, nor measured > the distribution of errors. I just wanted to warn that we aren't quite > out of the mess yet, but with the variable period we do at least stop > drowning! > >> Difference after this patch is that when the timer eventually fires the >> counter will account for the delay, so subsequent queries will be fine. >> >> If readout waited for at least one real sampling tick, that would be >> solved, but by several other undesirable consequences. At least it would >> slow down readout to 200Hz, but I don't know from the top of my head if >> it wouldn't cause unresolvable deadlock scenarios, or if it is even >> technically possible within the perf framework. (Since I don't think we >> should do it, I don't even want to start thinking about it.) > > The above should do what you have in mind, I think. I think you got the direct sample criteria reversed - it should be if the timer wasn't running. If it was, it means it just exited so the sampling is already pretty correct. Also I was thinking to limit this to only when the timer was delayed so to avoid many readers messing with a single timer alot. In other words we keep accepting to sampling error (0.5% I think for steady input signal) and only improve the timer delay scenario. I'll send a patch to discuss on in more detail. Regards, Tvrtko
Quoting Tvrtko Ursulin (2018-06-05 12:35:36) > > On 05/06/2018 10:06, Chris Wilson wrote: > > Quoting Tvrtko Ursulin (2018-06-05 09:51:01) > >> > >> On 04/06/2018 16:11, Chris Wilson wrote: > >>> Quoting Tvrtko Ursulin (2018-06-04 16:01:26) > >>>> > >>>> On 04/06/2018 14:03, Chris Wilson wrote: > >>>>> Quoting Chris Wilson (2018-06-04 13:59:12) > >>>>>> Quoting Tvrtko Ursulin (2018-06-04 13:52:39) > >>>>>>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > >>>>>>> > >>>>>>> As Chris has discovered on his Ivybridge, and later automated test runs > >>>>>>> have confirmed, on most of our platforms hrtimer faced with heavy GPU load > >>>>>>> ca occasionally become sufficiently imprecise to affect PMU sampling > >>>>>> > >>>>>> s/ca/can/ > >>>>>> > >>>>>>> calculations. > >>>>>>> > >>>>>>> This means we cannot assume sampling frequency is what we asked for, but > >>>>>>> we need to measure the interval ourselves. > >>>>>>> > >>>>>>> This patch is similar to Chris' original proposal for per-engine counters, > >>>>>>> but instead of introducing a new set to work around the problem with > >>>>>>> frequency sampling, it swaps around the way internal frequency accounting > >>>>>>> is done. Instead of accumulating current frequency and dividing by > >>>>>>> sampling frequency on readout, it accumulates frequency scaled by each > >>>>>>> period. > >>>>>> > >>>>>> My ulterior motive failed to win favour ;) > >>>>>> > >>>>>>> Testcase: igt/perf_pmu/*busy* # snb, ivb, hsw > >>>>>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > >>>>>>> Suggested-by: Chris Wilson <chris@chris-wilson.co.uk> > >>>>>>> Cc: Chris Wilson <chris@chris-wilson.co.uk> > >>>>>> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> > >>>>> > >>>>> I should point out that even with this fix (or rather my patch), we can > >>>> > >>>> I did not mean to steal it, > >>> > >>> Don't mind, I view such patches as "this here is a problem, and what I > >>> think can be done". I'm quite happy to be told "nope, the problem > >>> is..."; the end result is we fix and move on. > >>> > >>>> just that you seemed very uncompromising on > >>>> the new counters approach. If you prefer you can respin your patch with > >>>> this approach for frequency counters, it is fine by me. > >>> > >>> I'm just not convinced yet by the frequency sampler, and I still feel > >>> that we would be better off with cycles. I just haven't found a > >>> convincing argument ;) > >> > >> Just imagine .unit file has Mcycles instead of MHz in it? :) > >> > >> Since we went with this when adding it initially I think we should > >> exercise restrain with deprecating and adding so quickly. > >> > >>>>> still see errors of 25-30us, enough to fail the test. However, without > >>>>> the fix our errors can be orders of magnitude worse (e.g. reporting 80us > >>>>> busy instead of 500us). > >>>> > >>>> Yeah I guess if the timer is delayed it is delayed and all samples just > >>>> won't be there. I don't see a way to fix that elegantly. Even practically. > >>> > >>> Right, but it's not the case that we aren't sampling. If the timer was > >>> delayed entirely, then we would see 0 (start_val == end_val). I'm not > >>> sure exactly what goes wrong, but it probably is related to the timer > >>> being unreliable. (It's just that in this case we are sampling a square > >>> wave, so really hard to mess up!) > >> > >> Hm maybe I am not completely understanding what you are seeing. I > >> thought that multiple timer overruns (aka delay by multiple periods) > >> explains everything. > >> > >> Then even with this patch, if they happen to happen (!) at the end of a > >> sampling period (as observed from userspace), the large-ish chunk of > >> samples (depending on the total sampling duration) may be missing > >> (pending), and so still observed as fail. > > > > Ah, but with the variable period timer, we can force the timer to run > > before we report the same (perf_event_read). That should still be inside > > the busy batch, and it still has the same inaccuracy (~28us, for a small > > sample population, measuring the tail is on my wishlist): > > > > diff --git a/drivers/gpu/drm/i915/i915_pmu.c b/drivers/gpu/drm/i915/i915_pmu.c > > index 34cc6f6..8578a43 100644 > > --- a/drivers/gpu/drm/i915/i915_pmu.c > > +++ b/drivers/gpu/drm/i915/i915_pmu.c > > @@ -534,6 +534,12 @@ static u64 __i915_pmu_event_read(struct perf_event *event) > > val = engine->pmu.sample[sample].cur; > > } > > } else { > > + if (hrtimer_cancel(&i915->pmu.timer)) { > > + i915_sample(&i915->pmu.timer); > > + hrtimer_start_expires(&i915->pmu.timer, > > + HRTIMER_MODE_ABS); > > + } > > + > > switch (event->attr.config) { > > case I915_PMU_ACTUAL_FREQUENCY: > > val = > > > > I haven't thought enough about why it still has the tail, nor measured > > the distribution of errors. I just wanted to warn that we aren't quite > > out of the mess yet, but with the variable period we do at least stop > > drowning! > > > >> Difference after this patch is that when the timer eventually fires the > >> counter will account for the delay, so subsequent queries will be fine. > >> > >> If readout waited for at least one real sampling tick, that would be > >> solved, but by several other undesirable consequences. At least it would > >> slow down readout to 200Hz, but I don't know from the top of my head if > >> it wouldn't cause unresolvable deadlock scenarios, or if it is even > >> technically possible within the perf framework. (Since I don't think we > >> should do it, I don't even want to start thinking about it.) > > > > The above should do what you have in mind, I think. > > I think you got the direct sample criteria reversed - it should be if > the timer wasn't running. If it was, it means it just exited so the > sampling is already pretty correct. Note it returns 1 if the timer was queued or active, 0 if the time wasn't iirc; it's try_to_cancel that has the more complex interface. (It should always be true along this path if we are using the timer.) > Also I was thinking to limit this to only when the timer was delayed so > to avoid many readers messing with a single timer alot. In other words > we keep accepting to sampling error (0.5% I think for steady input > signal) and only improve the timer delay scenario. You mean just if time_between_samples < 10us, return. It didn't seem worth it. There are still a number of issues with the timer granularity that makes chasing edge cases ... disappointing. -Chris
diff --git a/drivers/gpu/drm/i915/i915_pmu.c b/drivers/gpu/drm/i915/i915_pmu.c index 34cc6f6..8578a43 100644 --- a/drivers/gpu/drm/i915/i915_pmu.c +++ b/drivers/gpu/drm/i915/i915_pmu.c @@ -534,6 +534,12 @@ static u64 __i915_pmu_event_read(struct perf_event *event) val = engine->pmu.sample[sample].cur; } } else { + if (hrtimer_cancel(&i915->pmu.timer)) { + i915_sample(&i915->pmu.timer); + hrtimer_start_expires(&i915->pmu.timer, + HRTIMER_MODE_ABS); + } + switch (event->attr.config) { case I915_PMU_ACTUAL_FREQUENCY: val =