diff mbox

drm/i915/pmu: Do not assume fixed hrtimer period

Message ID 152818960780.9058.9489807079671744910@mail.alporthouse.com (mailing list archive)
State New, archived
Headers show

Commit Message

Chris Wilson June 5, 2018, 9:06 a.m. UTC
Quoting Tvrtko Ursulin (2018-06-05 09:51:01)
> 
> On 04/06/2018 16:11, Chris Wilson wrote:
> > Quoting Tvrtko Ursulin (2018-06-04 16:01:26)
> >>
> >> On 04/06/2018 14:03, Chris Wilson wrote:
> >>> Quoting Chris Wilson (2018-06-04 13:59:12)
> >>>> Quoting Tvrtko Ursulin (2018-06-04 13:52:39)
> >>>>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >>>>>
> >>>>> As Chris has discovered on his Ivybridge, and later automated test runs
> >>>>> have confirmed, on most of our platforms hrtimer faced with heavy GPU load
> >>>>> ca occasionally become sufficiently imprecise to affect PMU sampling
> >>>>
> >>>> s/ca/can/
> >>>>
> >>>>> calculations.
> >>>>>
> >>>>> This means we cannot assume sampling frequency is what we asked for, but
> >>>>> we need to measure the interval ourselves.
> >>>>>
> >>>>> This patch is similar to Chris' original proposal for per-engine counters,
> >>>>> but instead of introducing a new set to work around the problem with
> >>>>> frequency sampling, it swaps around the way internal frequency accounting
> >>>>> is done. Instead of accumulating current frequency and dividing by
> >>>>> sampling frequency on readout, it accumulates frequency scaled by each
> >>>>> period.
> >>>>
> >>>> My ulterior motive failed to win favour ;)
> >>>>    
> >>>>> Testcase: igt/perf_pmu/*busy* # snb, ivb, hsw
> >>>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >>>>> Suggested-by: Chris Wilson <chris@chris-wilson.co.uk>
> >>>>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> >>>> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
> >>>
> >>> I should point out that even with this fix (or rather my patch), we can
> >>
> >> I did not mean to steal it,
> > 
> > Don't mind, I view such patches as "this here is a problem, and what I
> > think can be done". I'm quite happy to be told "nope, the problem
> > is..."; the end result is we fix and move on.
> > 
> >> just that you seemed very uncompromising on
> >> the new counters approach. If you prefer you can respin your patch with
> >> this approach for frequency counters, it is fine by me.
> > 
> > I'm just not convinced yet by the frequency sampler, and I still feel
> > that we would be better off with cycles. I just haven't found a
> > convincing argument ;)
> 
> Just imagine .unit file has Mcycles instead of MHz in it? :)
> 
> Since we went with this when adding it initially I think we should 
> exercise restrain with deprecating and adding so quickly.
> 
> >>> still see errors of 25-30us, enough to fail the test. However, without
> >>> the fix our errors can be orders of magnitude worse (e.g. reporting 80us
> >>> busy instead of 500us).
> >>
> >> Yeah I guess if the timer is delayed it is delayed and all samples just
> >> won't be there. I don't see a way to fix that elegantly. Even practically.
> > 
> > Right, but it's not the case that we aren't sampling. If the timer was
> > delayed entirely, then we would see 0 (start_val == end_val). I'm not
> > sure exactly what goes wrong, but it probably is related to the timer
> > being unreliable. (It's just that in this case we are sampling a square
> > wave, so really hard to mess up!)
> 
> Hm maybe I am not completely understanding what you are seeing. I 
> thought that multiple timer overruns (aka delay by multiple periods) 
> explains everything.
> 
> Then even with this patch, if they happen to happen (!) at the end of a 
> sampling period (as observed from userspace), the large-ish chunk of 
> samples (depending on the total sampling duration) may be missing 
> (pending), and so still observed as fail.

Ah, but with the variable period timer, we can force the timer to run
before we report the same (perf_event_read). That should still be inside
the busy batch, and it still has the same inaccuracy (~28us, for a small
sample population, measuring the tail is on my wishlist):


I haven't thought enough about why it still has the tail, nor measured 
the distribution of errors. I just wanted to warn that we aren't quite
out of the mess yet, but with the variable period we do at least stop
drowning!
 
> Difference after this patch is that when the timer eventually fires the 
> counter will account for the delay, so subsequent queries will be fine.
> 
> If readout waited for at least one real sampling tick, that would be 
> solved, but by several other undesirable consequences. At least it would 
> slow down readout to 200Hz, but I don't know from the top of my head if 
> it wouldn't cause unresolvable deadlock scenarios, or if it is even 
> technically possible within the perf framework. (Since I don't think we 
> should do it, I don't even want to start thinking about it.)

The above should do what you have in mind, I think.
-Chris

Comments

Tvrtko Ursulin June 5, 2018, 11:35 a.m. UTC | #1
On 05/06/2018 10:06, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2018-06-05 09:51:01)
>>
>> On 04/06/2018 16:11, Chris Wilson wrote:
>>> Quoting Tvrtko Ursulin (2018-06-04 16:01:26)
>>>>
>>>> On 04/06/2018 14:03, Chris Wilson wrote:
>>>>> Quoting Chris Wilson (2018-06-04 13:59:12)
>>>>>> Quoting Tvrtko Ursulin (2018-06-04 13:52:39)
>>>>>>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>>>>>
>>>>>>> As Chris has discovered on his Ivybridge, and later automated test runs
>>>>>>> have confirmed, on most of our platforms hrtimer faced with heavy GPU load
>>>>>>> ca occasionally become sufficiently imprecise to affect PMU sampling
>>>>>>
>>>>>> s/ca/can/
>>>>>>
>>>>>>> calculations.
>>>>>>>
>>>>>>> This means we cannot assume sampling frequency is what we asked for, but
>>>>>>> we need to measure the interval ourselves.
>>>>>>>
>>>>>>> This patch is similar to Chris' original proposal for per-engine counters,
>>>>>>> but instead of introducing a new set to work around the problem with
>>>>>>> frequency sampling, it swaps around the way internal frequency accounting
>>>>>>> is done. Instead of accumulating current frequency and dividing by
>>>>>>> sampling frequency on readout, it accumulates frequency scaled by each
>>>>>>> period.
>>>>>>
>>>>>> My ulterior motive failed to win favour ;)
>>>>>>     
>>>>>>> Testcase: igt/perf_pmu/*busy* # snb, ivb, hsw
>>>>>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>>>>> Suggested-by: Chris Wilson <chris@chris-wilson.co.uk>
>>>>>>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>>>>>> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
>>>>>
>>>>> I should point out that even with this fix (or rather my patch), we can
>>>>
>>>> I did not mean to steal it,
>>>
>>> Don't mind, I view such patches as "this here is a problem, and what I
>>> think can be done". I'm quite happy to be told "nope, the problem
>>> is..."; the end result is we fix and move on.
>>>
>>>> just that you seemed very uncompromising on
>>>> the new counters approach. If you prefer you can respin your patch with
>>>> this approach for frequency counters, it is fine by me.
>>>
>>> I'm just not convinced yet by the frequency sampler, and I still feel
>>> that we would be better off with cycles. I just haven't found a
>>> convincing argument ;)
>>
>> Just imagine .unit file has Mcycles instead of MHz in it? :)
>>
>> Since we went with this when adding it initially I think we should
>> exercise restrain with deprecating and adding so quickly.
>>
>>>>> still see errors of 25-30us, enough to fail the test. However, without
>>>>> the fix our errors can be orders of magnitude worse (e.g. reporting 80us
>>>>> busy instead of 500us).
>>>>
>>>> Yeah I guess if the timer is delayed it is delayed and all samples just
>>>> won't be there. I don't see a way to fix that elegantly. Even practically.
>>>
>>> Right, but it's not the case that we aren't sampling. If the timer was
>>> delayed entirely, then we would see 0 (start_val == end_val). I'm not
>>> sure exactly what goes wrong, but it probably is related to the timer
>>> being unreliable. (It's just that in this case we are sampling a square
>>> wave, so really hard to mess up!)
>>
>> Hm maybe I am not completely understanding what you are seeing. I
>> thought that multiple timer overruns (aka delay by multiple periods)
>> explains everything.
>>
>> Then even with this patch, if they happen to happen (!) at the end of a
>> sampling period (as observed from userspace), the large-ish chunk of
>> samples (depending on the total sampling duration) may be missing
>> (pending), and so still observed as fail.
> 
> Ah, but with the variable period timer, we can force the timer to run
> before we report the same (perf_event_read). That should still be inside
> the busy batch, and it still has the same inaccuracy (~28us, for a small
> sample population, measuring the tail is on my wishlist):
> 
> diff --git a/drivers/gpu/drm/i915/i915_pmu.c b/drivers/gpu/drm/i915/i915_pmu.c
> index 34cc6f6..8578a43 100644
> --- a/drivers/gpu/drm/i915/i915_pmu.c
> +++ b/drivers/gpu/drm/i915/i915_pmu.c
> @@ -534,6 +534,12 @@ static u64 __i915_pmu_event_read(struct perf_event *event)
>                          val = engine->pmu.sample[sample].cur;
>                  }
>          } else {
> +               if (hrtimer_cancel(&i915->pmu.timer)) {
> +                       i915_sample(&i915->pmu.timer);
> +                       hrtimer_start_expires(&i915->pmu.timer,
> +                                             HRTIMER_MODE_ABS);
> +               }
> +
>                  switch (event->attr.config) {
>                  case I915_PMU_ACTUAL_FREQUENCY:
>                          val =
> 
> I haven't thought enough about why it still has the tail, nor measured
> the distribution of errors. I just wanted to warn that we aren't quite
> out of the mess yet, but with the variable period we do at least stop
> drowning!
>   
>> Difference after this patch is that when the timer eventually fires the
>> counter will account for the delay, so subsequent queries will be fine.
>>
>> If readout waited for at least one real sampling tick, that would be
>> solved, but by several other undesirable consequences. At least it would
>> slow down readout to 200Hz, but I don't know from the top of my head if
>> it wouldn't cause unresolvable deadlock scenarios, or if it is even
>> technically possible within the perf framework. (Since I don't think we
>> should do it, I don't even want to start thinking about it.)
> 
> The above should do what you have in mind, I think.

I think you got the direct sample criteria reversed - it should be if 
the timer wasn't running. If it was, it means it just exited so the 
sampling is already pretty correct.

Also I was thinking to limit this to only when the timer was delayed so 
to avoid many readers messing with a single timer alot. In other words 
we keep accepting to sampling error (0.5% I think for steady input 
signal) and only improve the timer delay scenario.

I'll send a patch to discuss on in more detail.

Regards,

Tvrtko
Chris Wilson June 5, 2018, 11:42 a.m. UTC | #2
Quoting Tvrtko Ursulin (2018-06-05 12:35:36)
> 
> On 05/06/2018 10:06, Chris Wilson wrote:
> > Quoting Tvrtko Ursulin (2018-06-05 09:51:01)
> >>
> >> On 04/06/2018 16:11, Chris Wilson wrote:
> >>> Quoting Tvrtko Ursulin (2018-06-04 16:01:26)
> >>>>
> >>>> On 04/06/2018 14:03, Chris Wilson wrote:
> >>>>> Quoting Chris Wilson (2018-06-04 13:59:12)
> >>>>>> Quoting Tvrtko Ursulin (2018-06-04 13:52:39)
> >>>>>>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >>>>>>>
> >>>>>>> As Chris has discovered on his Ivybridge, and later automated test runs
> >>>>>>> have confirmed, on most of our platforms hrtimer faced with heavy GPU load
> >>>>>>> ca occasionally become sufficiently imprecise to affect PMU sampling
> >>>>>>
> >>>>>> s/ca/can/
> >>>>>>
> >>>>>>> calculations.
> >>>>>>>
> >>>>>>> This means we cannot assume sampling frequency is what we asked for, but
> >>>>>>> we need to measure the interval ourselves.
> >>>>>>>
> >>>>>>> This patch is similar to Chris' original proposal for per-engine counters,
> >>>>>>> but instead of introducing a new set to work around the problem with
> >>>>>>> frequency sampling, it swaps around the way internal frequency accounting
> >>>>>>> is done. Instead of accumulating current frequency and dividing by
> >>>>>>> sampling frequency on readout, it accumulates frequency scaled by each
> >>>>>>> period.
> >>>>>>
> >>>>>> My ulterior motive failed to win favour ;)
> >>>>>>     
> >>>>>>> Testcase: igt/perf_pmu/*busy* # snb, ivb, hsw
> >>>>>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >>>>>>> Suggested-by: Chris Wilson <chris@chris-wilson.co.uk>
> >>>>>>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> >>>>>> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
> >>>>>
> >>>>> I should point out that even with this fix (or rather my patch), we can
> >>>>
> >>>> I did not mean to steal it,
> >>>
> >>> Don't mind, I view such patches as "this here is a problem, and what I
> >>> think can be done". I'm quite happy to be told "nope, the problem
> >>> is..."; the end result is we fix and move on.
> >>>
> >>>> just that you seemed very uncompromising on
> >>>> the new counters approach. If you prefer you can respin your patch with
> >>>> this approach for frequency counters, it is fine by me.
> >>>
> >>> I'm just not convinced yet by the frequency sampler, and I still feel
> >>> that we would be better off with cycles. I just haven't found a
> >>> convincing argument ;)
> >>
> >> Just imagine .unit file has Mcycles instead of MHz in it? :)
> >>
> >> Since we went with this when adding it initially I think we should
> >> exercise restrain with deprecating and adding so quickly.
> >>
> >>>>> still see errors of 25-30us, enough to fail the test. However, without
> >>>>> the fix our errors can be orders of magnitude worse (e.g. reporting 80us
> >>>>> busy instead of 500us).
> >>>>
> >>>> Yeah I guess if the timer is delayed it is delayed and all samples just
> >>>> won't be there. I don't see a way to fix that elegantly. Even practically.
> >>>
> >>> Right, but it's not the case that we aren't sampling. If the timer was
> >>> delayed entirely, then we would see 0 (start_val == end_val). I'm not
> >>> sure exactly what goes wrong, but it probably is related to the timer
> >>> being unreliable. (It's just that in this case we are sampling a square
> >>> wave, so really hard to mess up!)
> >>
> >> Hm maybe I am not completely understanding what you are seeing. I
> >> thought that multiple timer overruns (aka delay by multiple periods)
> >> explains everything.
> >>
> >> Then even with this patch, if they happen to happen (!) at the end of a
> >> sampling period (as observed from userspace), the large-ish chunk of
> >> samples (depending on the total sampling duration) may be missing
> >> (pending), and so still observed as fail.
> > 
> > Ah, but with the variable period timer, we can force the timer to run
> > before we report the same (perf_event_read). That should still be inside
> > the busy batch, and it still has the same inaccuracy (~28us, for a small
> > sample population, measuring the tail is on my wishlist):
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_pmu.c b/drivers/gpu/drm/i915/i915_pmu.c
> > index 34cc6f6..8578a43 100644
> > --- a/drivers/gpu/drm/i915/i915_pmu.c
> > +++ b/drivers/gpu/drm/i915/i915_pmu.c
> > @@ -534,6 +534,12 @@ static u64 __i915_pmu_event_read(struct perf_event *event)
> >                          val = engine->pmu.sample[sample].cur;
> >                  }
> >          } else {
> > +               if (hrtimer_cancel(&i915->pmu.timer)) {
> > +                       i915_sample(&i915->pmu.timer);
> > +                       hrtimer_start_expires(&i915->pmu.timer,
> > +                                             HRTIMER_MODE_ABS);
> > +               }
> > +
> >                  switch (event->attr.config) {
> >                  case I915_PMU_ACTUAL_FREQUENCY:
> >                          val =
> > 
> > I haven't thought enough about why it still has the tail, nor measured
> > the distribution of errors. I just wanted to warn that we aren't quite
> > out of the mess yet, but with the variable period we do at least stop
> > drowning!
> >   
> >> Difference after this patch is that when the timer eventually fires the
> >> counter will account for the delay, so subsequent queries will be fine.
> >>
> >> If readout waited for at least one real sampling tick, that would be
> >> solved, but by several other undesirable consequences. At least it would
> >> slow down readout to 200Hz, but I don't know from the top of my head if
> >> it wouldn't cause unresolvable deadlock scenarios, or if it is even
> >> technically possible within the perf framework. (Since I don't think we
> >> should do it, I don't even want to start thinking about it.)
> > 
> > The above should do what you have in mind, I think.
> 
> I think you got the direct sample criteria reversed - it should be if 
> the timer wasn't running. If it was, it means it just exited so the 
> sampling is already pretty correct.

Note it returns 1 if the timer was queued or active, 0 if the time
wasn't iirc; it's try_to_cancel that has the more complex interface.
(It should always be true along this path if we are using the timer.)

> Also I was thinking to limit this to only when the timer was delayed so 
> to avoid many readers messing with a single timer alot. In other words 
> we keep accepting to sampling error (0.5% I think for steady input 
> signal) and only improve the timer delay scenario.

You mean just if time_between_samples < 10us, return. It didn't seem
worth it. There are still a number of issues with the timer granularity
that makes chasing edge cases ... disappointing.
-Chris
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/i915_pmu.c b/drivers/gpu/drm/i915/i915_pmu.c
index 34cc6f6..8578a43 100644
--- a/drivers/gpu/drm/i915/i915_pmu.c
+++ b/drivers/gpu/drm/i915/i915_pmu.c
@@ -534,6 +534,12 @@  static u64 __i915_pmu_event_read(struct perf_event *event)
                        val = engine->pmu.sample[sample].cur;
                }
        } else {
+               if (hrtimer_cancel(&i915->pmu.timer)) {
+                       i915_sample(&i915->pmu.timer);
+                       hrtimer_start_expires(&i915->pmu.timer,
+                                             HRTIMER_MODE_ABS);
+               }
+
                switch (event->attr.config) {
                case I915_PMU_ACTUAL_FREQUENCY:
                        val =