diff mbox series

arm64: perf_event: Fix time_offset for arch timer

Message ID 20200320093545.28227-1-leo.yan@linaro.org (mailing list archive)
State New, archived
Headers show
Series arm64: perf_event: Fix time_offset for arch timer | expand

Commit Message

Leo Yan March 20, 2020, 9:35 a.m. UTC
Between the system powering on and kernel's sched clock registration,
the arch timer usually has been enabled at the early time and its
counter is incremented during the period of the booting up.  Thus the
arch timer's counter is not completely accounted into the sched clock,
and has a delta between the arch timer's counter and sched clock.  This
delta value should be stored into userpg->time_offset, which later can
be retrieved by Perf tool in the user space for sample timestamp
calculation.

Now userpg->time_offset is assigned to the negative sched clock with
'-now', this value cannot reflect the delta between arch timer's counter
and sched clock, so Perf cannot use it to calculate the sample time.

To fix this issue, this patch calculate the delta between the arch
timer's and sched clock and assign the delta to userpg->time_offset.
The detailed steps are firstly to convert counter to nanoseconds 'ns',
then the offset is calculated as 'now' minus 'ns'.

        |<------------------- 'ns' ---------------------->|
                                |<-------- 'now' -------->|
        |<---- time_offset ---->|
        |-----------------------|-------------------------|
        ^                       ^                         ^
  Power on system     sched clock registration      Perf starts

Signed-off-by: Leo Yan <leo.yan@linaro.org>
---
 arch/arm64/kernel/perf_event.c | 19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

Comments

Leo Yan April 1, 2020, 1:24 a.m. UTC | #1
On Fri, Mar 20, 2020 at 05:35:45PM +0800, Leo Yan wrote:
> Between the system powering on and kernel's sched clock registration,
> the arch timer usually has been enabled at the early time and its
> counter is incremented during the period of the booting up.  Thus the
> arch timer's counter is not completely accounted into the sched clock,
> and has a delta between the arch timer's counter and sched clock.  This
> delta value should be stored into userpg->time_offset, which later can
> be retrieved by Perf tool in the user space for sample timestamp
> calculation.
> 
> Now userpg->time_offset is assigned to the negative sched clock with
> '-now', this value cannot reflect the delta between arch timer's counter
> and sched clock, so Perf cannot use it to calculate the sample time.
> 
> To fix this issue, this patch calculate the delta between the arch
> timer's and sched clock and assign the delta to userpg->time_offset.
> The detailed steps are firstly to convert counter to nanoseconds 'ns',
> then the offset is calculated as 'now' minus 'ns'.
> 
>         |<------------------- 'ns' ---------------------->|
>                                 |<-------- 'now' -------->|
>         |<---- time_offset ---->|
>         |-----------------------|-------------------------|
>         ^                       ^                         ^
>   Power on system     sched clock registration      Perf starts
> 
> Signed-off-by: Leo Yan <leo.yan@linaro.org>
> ---
>  arch/arm64/kernel/perf_event.c | 19 ++++++++++++++++++-
>  1 file changed, 18 insertions(+), 1 deletion(-)

Gentle ping ...

Hi Mike R., Peter,

If possible, could you give a look for this patch?

Thank you,
Leo

> 
> diff --git a/arch/arm64/kernel/perf_event.c b/arch/arm64/kernel/perf_event.c
> index e40b65645c86..226d25d77072 100644
> --- a/arch/arm64/kernel/perf_event.c
> +++ b/arch/arm64/kernel/perf_event.c
> @@ -1143,6 +1143,7 @@ void arch_perf_update_userpage(struct perf_event *event,
>  {
>  	u32 freq;
>  	u32 shift;
> +	u64 count, ns, quot, rem;
>  
>  	/*
>  	 * Internal timekeeping for enabled/running/stopped times
> @@ -1164,5 +1165,21 @@ void arch_perf_update_userpage(struct perf_event *event,
>  		userpg->time_mult >>= 1;
>  	}
>  	userpg->time_shift = (u16)shift;
> -	userpg->time_offset = -now;
> +
> +	/*
> +	 * Since arch timer is enabled ealier than sched clock registration,
> +	 * compuate the delta (in nanosecond unit) between the arch timer
> +	 * counter and sched clock, assign the delta to time_offset and
> +	 * perf tool can use it for timestamp calculation.
> +	 *
> +	 * The formula for conversion arch timer cycle to ns is:
> +	 *   quot = (cyc >> time_shift);
> +	 *   rem  = cyc & ((1 << time_shift) - 1);
> +	 *   ns   = quot * time_mult + ((rem * time_mult) >> time_shift);
> +	 */
> +	count = arch_timer_read_counter();
> +	quot = count >> shift;
> +	rem = count & ((1 << shift) - 1);
> +	ns = quot * userpg->time_mult + ((rem * userpg->time_mult) >> shift);
> +	userpg->time_offset = now - ns;
>  }
> -- 
> 2.17.1
>
Will Deacon April 30, 2020, 2:58 p.m. UTC | #2
Hi Leo,

[+Maz and tglx in case I'm barking up the wrong tree]

On Fri, Mar 20, 2020 at 05:35:45PM +0800, Leo Yan wrote:
> Between the system powering on and kernel's sched clock registration,
> the arch timer usually has been enabled at the early time and its
> counter is incremented during the period of the booting up.  Thus the
> arch timer's counter is not completely accounted into the sched clock,
> and has a delta between the arch timer's counter and sched clock.  This
> delta value should be stored into userpg->time_offset, which later can
> be retrieved by Perf tool in the user space for sample timestamp
> calculation.
> 
> Now userpg->time_offset is assigned to the negative sched clock with
> '-now', this value cannot reflect the delta between arch timer's counter
> and sched clock, so Perf cannot use it to calculate the sample time.
> 
> To fix this issue, this patch calculate the delta between the arch
> timer's and sched clock and assign the delta to userpg->time_offset.
> The detailed steps are firstly to convert counter to nanoseconds 'ns',
> then the offset is calculated as 'now' minus 'ns'.
> 
>         |<------------------- 'ns' ---------------------->|
>                                 |<-------- 'now' -------->|
>         |<---- time_offset ---->|
>         |-----------------------|-------------------------|
>         ^                       ^                         ^
>   Power on system     sched clock registration      Perf starts

FWIW, I'm /really/ struggling to understand the problem here.

If I've grokked it correctly (big 'if'), then you can't just factor in
what you call "time_offset" in the diagram above, because there isn't
a guarantee that the counter is zero-initialised at the start.

> Signed-off-by: Leo Yan <leo.yan@linaro.org>
> ---
>  arch/arm64/kernel/perf_event.c | 19 ++++++++++++++++++-
>  1 file changed, 18 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/kernel/perf_event.c b/arch/arm64/kernel/perf_event.c
> index e40b65645c86..226d25d77072 100644
> --- a/arch/arm64/kernel/perf_event.c
> +++ b/arch/arm64/kernel/perf_event.c
> @@ -1143,6 +1143,7 @@ void arch_perf_update_userpage(struct perf_event *event,
>  {
>  	u32 freq;
>  	u32 shift;
> +	u64 count, ns, quot, rem;
>  
>  	/*
>  	 * Internal timekeeping for enabled/running/stopped times
> @@ -1164,5 +1165,21 @@ void arch_perf_update_userpage(struct perf_event *event,
>  		userpg->time_mult >>= 1;
>  	}
>  	userpg->time_shift = (u16)shift;
> -	userpg->time_offset = -now;
> +
> +	/*
> +	 * Since arch timer is enabled ealier than sched clock registration,
> +	 * compuate the delta (in nanosecond unit) between the arch timer
> +	 * counter and sched clock, assign the delta to time_offset and
> +	 * perf tool can use it for timestamp calculation.
> +	 *
> +	 * The formula for conversion arch timer cycle to ns is:
> +	 *   quot = (cyc >> time_shift);
> +	 *   rem  = cyc & ((1 << time_shift) - 1);
> +	 *   ns   = quot * time_mult + ((rem * time_mult) >> time_shift);
> +	 */
> +	count = arch_timer_read_counter();
> +	quot = count >> shift;
> +	rem = count & ((1 << shift) - 1);
> +	ns = quot * userpg->time_mult + ((rem * userpg->time_mult) >> shift);
> +	userpg->time_offset = now - ns;

Hmm, reading the counter and calculating the delta feels horribly
approximate to me. It would be much better if we could get hold of the
initial epoch cycles from the point at which sched_clock was initialised
using the counter. This represents the true cycle delta between the counter
and what sched_clock uses for 0 ns.

Unfortunately, I can't see a straightforward way to grab that information.
It looks like x86 pulls this directly from the TSC driver.

Will
Marc Zyngier April 30, 2020, 3:29 p.m. UTC | #3
On 2020-04-30 15:58, Will Deacon wrote:
> Hi Leo,
> 
> [+Maz and tglx in case I'm barking up the wrong tree]
> 
> On Fri, Mar 20, 2020 at 05:35:45PM +0800, Leo Yan wrote:
>> Between the system powering on and kernel's sched clock registration,
>> the arch timer usually has been enabled at the early time and its
>> counter is incremented during the period of the booting up.  Thus the
>> arch timer's counter is not completely accounted into the sched clock,
>> and has a delta between the arch timer's counter and sched clock.  
>> This
>> delta value should be stored into userpg->time_offset, which later can
>> be retrieved by Perf tool in the user space for sample timestamp
>> calculation.
>> 
>> Now userpg->time_offset is assigned to the negative sched clock with
>> '-now', this value cannot reflect the delta between arch timer's 
>> counter
>> and sched clock, so Perf cannot use it to calculate the sample time.
>> 
>> To fix this issue, this patch calculate the delta between the arch
>> timer's and sched clock and assign the delta to userpg->time_offset.
>> The detailed steps are firstly to convert counter to nanoseconds 'ns',
>> then the offset is calculated as 'now' minus 'ns'.
>> 
>>         |<------------------- 'ns' ---------------------->|
>>                                 |<-------- 'now' -------->|
>>         |<---- time_offset ---->|
>>         |-----------------------|-------------------------|
>>         ^                       ^                         ^
>>   Power on system     sched clock registration      Perf starts
> 
> FWIW, I'm /really/ struggling to understand the problem here.
> 
> If I've grokked it correctly (big 'if'), then you can't just factor in
> what you call "time_offset" in the diagram above, because there isn't
> a guarantee that the counter is zero-initialised at the start.

Even if it was, we have no idea of *when* that was. Think kexec, for a
start. Or spending some variable in firmware because of $REASON.

> 
>> Signed-off-by: Leo Yan <leo.yan@linaro.org>
>> ---
>>  arch/arm64/kernel/perf_event.c | 19 ++++++++++++++++++-
>>  1 file changed, 18 insertions(+), 1 deletion(-)
>> 
>> diff --git a/arch/arm64/kernel/perf_event.c 
>> b/arch/arm64/kernel/perf_event.c
>> index e40b65645c86..226d25d77072 100644
>> --- a/arch/arm64/kernel/perf_event.c
>> +++ b/arch/arm64/kernel/perf_event.c
>> @@ -1143,6 +1143,7 @@ void arch_perf_update_userpage(struct perf_event 
>> *event,
>>  {
>>  	u32 freq;
>>  	u32 shift;
>> +	u64 count, ns, quot, rem;
>> 
>>  	/*
>>  	 * Internal timekeeping for enabled/running/stopped times
>> @@ -1164,5 +1165,21 @@ void arch_perf_update_userpage(struct 
>> perf_event *event,
>>  		userpg->time_mult >>= 1;
>>  	}
>>  	userpg->time_shift = (u16)shift;
>> -	userpg->time_offset = -now;
>> +
>> +	/*
>> +	 * Since arch timer is enabled ealier than sched clock registration,
>> +	 * compuate the delta (in nanosecond unit) between the arch timer
>> +	 * counter and sched clock, assign the delta to time_offset and
>> +	 * perf tool can use it for timestamp calculation.
>> +	 *
>> +	 * The formula for conversion arch timer cycle to ns is:
>> +	 *   quot = (cyc >> time_shift);
>> +	 *   rem  = cyc & ((1 << time_shift) - 1);
>> +	 *   ns   = quot * time_mult + ((rem * time_mult) >> time_shift);
>> +	 */
>> +	count = arch_timer_read_counter();
>> +	quot = count >> shift;
>> +	rem = count & ((1 << shift) - 1);
>> +	ns = quot * userpg->time_mult + ((rem * userpg->time_mult) >> 
>> shift);
>> +	userpg->time_offset = now - ns;
> 
> Hmm, reading the counter and calculating the delta feels horribly
> approximate to me. It would be much better if we could get hold of the
> initial epoch cycles from the point at which sched_clock was 
> initialised
> using the counter. This represents the true cycle delta between the 
> counter
> and what sched_clock uses for 0 ns.

I think this is a sensible solution if you want an epoch that starts at 
0 with
sched_clock being initialized. The other question is whether it is 
possible to
use a different timestamping source for perf that wouldn't need to be 
offset.

> Unfortunately, I can't see a straightforward way to grab that 
> information.
> It looks like x86 pulls this directly from the TSC driver.

I wonder if we could/should make __sched_clock_offset available even 
when
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK isn't defined. It feels like it would
help with this particular can or worm...

         M.
Peter Zijlstra April 30, 2020, 4:04 p.m. UTC | #4
On Thu, Apr 30, 2020 at 04:29:23PM +0100, Marc Zyngier wrote:

> I wonder if we could/should make __sched_clock_offset available even when
> CONFIG_HAVE_UNSTABLE_SCHED_CLOCK isn't defined. It feels like it would
> help with this particular can or worm...

Errrgh. __sched_clock_offset is only needed on x86 because we transition
from one clock device to another on boot. It really shouldn't exist on
anything sane.

Let me try and understand your particular problem better.
Will Deacon April 30, 2020, 4:18 p.m. UTC | #5
On Thu, Apr 30, 2020 at 06:04:36PM +0200, Peter Zijlstra wrote:
> On Thu, Apr 30, 2020 at 04:29:23PM +0100, Marc Zyngier wrote:
> 
> > I wonder if we could/should make __sched_clock_offset available even when
> > CONFIG_HAVE_UNSTABLE_SCHED_CLOCK isn't defined. It feels like it would
> > help with this particular can or worm...
> 
> Errrgh. __sched_clock_offset is only needed on x86 because we transition
> from one clock device to another on boot. It really shouldn't exist on
> anything sane.

I think we still transition from jiffies on arm64, because we don't register
with sched_clock until the timer driver probes. Marc, is that right?

> Let me try and understand your particular problem better.

I think the long and short of it is that userspace needs a way to convert
the raw counter cycles into a ns value that can be compared against values
coming out of sched_clock. To do this accurately, I think it needs the
cycles value at the point when sched_clock was initialised.

Will
Peter Zijlstra April 30, 2020, 4:27 p.m. UTC | #6
On Thu, Apr 30, 2020 at 03:58:24PM +0100, Will Deacon wrote:
> On Fri, Mar 20, 2020 at 05:35:45PM +0800, Leo Yan wrote:

> > @@ -1164,5 +1165,21 @@ void arch_perf_update_userpage(struct perf_event *event,
> >  		userpg->time_mult >>= 1;
> >  	}
> >  	userpg->time_shift = (u16)shift;
> > -	userpg->time_offset = -now;
> > +
> > +	/*
> > +	 * Since arch timer is enabled ealier than sched clock registration,
> > +	 * compuate the delta (in nanosecond unit) between the arch timer
> > +	 * counter and sched clock, assign the delta to time_offset and
> > +	 * perf tool can use it for timestamp calculation.
> > +	 *
> > +	 * The formula for conversion arch timer cycle to ns is:
> > +	 *   quot = (cyc >> time_shift);
> > +	 *   rem  = cyc & ((1 << time_shift) - 1);
> > +	 *   ns   = quot * time_mult + ((rem * time_mult) >> time_shift);
> > +	 */
> > +	count = arch_timer_read_counter();
> > +	quot = count >> shift;
> > +	rem = count & ((1 << shift) - 1);
> > +	ns = quot * userpg->time_mult + ((rem * userpg->time_mult) >> shift);
> > +	userpg->time_offset = now - ns;
> 
> Hmm, reading the counter and calculating the delta feels horribly
> approximate to me. It would be much better if we could get hold of the
> initial epoch cycles from the point at which sched_clock was initialised
> using the counter. This represents the true cycle delta between the counter
> and what sched_clock uses for 0 ns.
> 
> Unfortunately, I can't see a straightforward way to grab that information.
> It looks like x86 pulls this directly from the TSC driver.

Yeah, and I'm thinking you should do the same. IIRC ARM uses this
kernel/time/sched_clock.c thing, and if I read that right, the struct
clock_data there has all the bits you need here.

So I'm thinking that you might want to add a helper function here to get
you the good stuff.
Will Deacon April 30, 2020, 4:45 p.m. UTC | #7
On Thu, Apr 30, 2020 at 06:27:50PM +0200, Peter Zijlstra wrote:
> On Thu, Apr 30, 2020 at 03:58:24PM +0100, Will Deacon wrote:
> > On Fri, Mar 20, 2020 at 05:35:45PM +0800, Leo Yan wrote:
> > > +	/*
> > > +	 * Since arch timer is enabled ealier than sched clock registration,
> > > +	 * compuate the delta (in nanosecond unit) between the arch timer
> > > +	 * counter and sched clock, assign the delta to time_offset and
> > > +	 * perf tool can use it for timestamp calculation.
> > > +	 *
> > > +	 * The formula for conversion arch timer cycle to ns is:
> > > +	 *   quot = (cyc >> time_shift);
> > > +	 *   rem  = cyc & ((1 << time_shift) - 1);
> > > +	 *   ns   = quot * time_mult + ((rem * time_mult) >> time_shift);
> > > +	 */
> > > +	count = arch_timer_read_counter();
> > > +	quot = count >> shift;
> > > +	rem = count & ((1 << shift) - 1);
> > > +	ns = quot * userpg->time_mult + ((rem * userpg->time_mult) >> shift);
> > > +	userpg->time_offset = now - ns;
> > 
> > Hmm, reading the counter and calculating the delta feels horribly
> > approximate to me. It would be much better if we could get hold of the
> > initial epoch cycles from the point at which sched_clock was initialised
> > using the counter. This represents the true cycle delta between the counter
> > and what sched_clock uses for 0 ns.
> > 
> > Unfortunately, I can't see a straightforward way to grab that information.
> > It looks like x86 pulls this directly from the TSC driver.
> 
> Yeah, and I'm thinking you should do the same. IIRC ARM uses this
> kernel/time/sched_clock.c thing, and if I read that right, the struct
> clock_data there has all the bits you need here.
> 
> So I'm thinking that you might want to add a helper function here to get
> you the good stuff.

Thanks, Peter.

Leo -- do you think you could look at implementing this as part of a v2,
please?

Will
Marc Zyngier April 30, 2020, 5:33 p.m. UTC | #8
On 2020-04-30 17:18, Will Deacon wrote:
> On Thu, Apr 30, 2020 at 06:04:36PM +0200, Peter Zijlstra wrote:
>> On Thu, Apr 30, 2020 at 04:29:23PM +0100, Marc Zyngier wrote:
>> 
>> > I wonder if we could/should make __sched_clock_offset available even when
>> > CONFIG_HAVE_UNSTABLE_SCHED_CLOCK isn't defined. It feels like it would
>> > help with this particular can or worm...
>> 
>> Errrgh. __sched_clock_offset is only needed on x86 because we 
>> transition
>> from one clock device to another on boot. It really shouldn't exist on
>> anything sane.
> 
> I think we still transition from jiffies on arm64, because we don't 
> register
> with sched_clock until the timer driver probes. Marc, is that right?

Indeed. The clocksource is only available relatively late, as we need to
discover the details of the platform and enable the various workarounds
(because nobody can get a simple 64bit counter right). So it is only at
that stage that we transition to it.

         M.
Leo Yan May 1, 2020, 3:14 p.m. UTC | #9
Hi all,

On Thu, Apr 30, 2020 at 05:18:15PM +0100, Will Deacon wrote:
> On Thu, Apr 30, 2020 at 06:04:36PM +0200, Peter Zijlstra wrote:
> > On Thu, Apr 30, 2020 at 04:29:23PM +0100, Marc Zyngier wrote:
> > 
> > > I wonder if we could/should make __sched_clock_offset available even when
> > > CONFIG_HAVE_UNSTABLE_SCHED_CLOCK isn't defined. It feels like it would
> > > help with this particular can or worm...
> > 
> > Errrgh. __sched_clock_offset is only needed on x86 because we transition
> > from one clock device to another on boot. It really shouldn't exist on
> > anything sane.
> 
> I think we still transition from jiffies on arm64, because we don't register
> with sched_clock until the timer driver probes. Marc, is that right?
> 
> > Let me try and understand your particular problem better.
> 
> I think the long and short of it is that userspace needs a way to convert
> the raw counter cycles into a ns value that can be compared against values
> coming out of sched_clock. To do this accurately, I think it needs the
> cycles value at the point when sched_clock was initialised.

Will's understanding is exactly what I want to resolve in this patch.

The background info is for the ARM SPE [1] decoding with perf tool, if
the timestamp is enabled, it uses the generic timer's counter as
timestamp source.  SPE trace data only contains the raw counter cycles,
as Will mentioned, the perf tool needs to convert it to a coordinate
value with sched_clock.  This is why this patch tries to calculate the
offset between the raw counter's ns value and sched_clock, eventually
this offset value will be used by SPE's decoding code in Perf tool to
calibrate a 'correct' timestamp.

Based on your suggestions, I will use __sched_clock_offset to resolve
the accuracy issue in patch v2.  (I noticed Peter suggested to use a
new API for wrapping clock_data structure, IIUC, __sched_clock_offset
is more straightforward for this case).

Please correct if I miss anything.  Thank you for reviewing and
suggestions!

Thanks,
Leo

[1] https://lkml.org/lkml/2020/1/23/571
Will Deacon May 1, 2020, 3:26 p.m. UTC | #10
On Fri, May 01, 2020 at 11:14:48PM +0800, Leo Yan wrote:
> On Thu, Apr 30, 2020 at 05:18:15PM +0100, Will Deacon wrote:
> > On Thu, Apr 30, 2020 at 06:04:36PM +0200, Peter Zijlstra wrote:
> > > On Thu, Apr 30, 2020 at 04:29:23PM +0100, Marc Zyngier wrote:
> > > 
> > > > I wonder if we could/should make __sched_clock_offset available even when
> > > > CONFIG_HAVE_UNSTABLE_SCHED_CLOCK isn't defined. It feels like it would
> > > > help with this particular can or worm...
> > > 
> > > Errrgh. __sched_clock_offset is only needed on x86 because we transition
> > > from one clock device to another on boot. It really shouldn't exist on
> > > anything sane.
> > 
> > I think we still transition from jiffies on arm64, because we don't register
> > with sched_clock until the timer driver probes. Marc, is that right?
> > 
> > > Let me try and understand your particular problem better.
> > 
> > I think the long and short of it is that userspace needs a way to convert
> > the raw counter cycles into a ns value that can be compared against values
> > coming out of sched_clock. To do this accurately, I think it needs the
> > cycles value at the point when sched_clock was initialised.
> 
> Will's understanding is exactly what I want to resolve in this patch.
> 
> The background info is for the ARM SPE [1] decoding with perf tool, if
> the timestamp is enabled, it uses the generic timer's counter as
> timestamp source.  SPE trace data only contains the raw counter cycles,
> as Will mentioned, the perf tool needs to convert it to a coordinate
> value with sched_clock.  This is why this patch tries to calculate the
> offset between the raw counter's ns value and sched_clock, eventually
> this offset value will be used by SPE's decoding code in Perf tool to
> calibrate a 'correct' timestamp.
> 
> Based on your suggestions, I will use __sched_clock_offset to resolve
> the accuracy issue in patch v2.  (I noticed Peter suggested to use a
> new API for wrapping clock_data structure, IIUC, __sched_clock_offset
> is more straightforward for this case).
> 
> Please correct if I miss anything.  Thank you for reviewing and
> suggestions!

I don't think you can use __sched_clock_offset without selecting
HAVE_UNSTABLE_SCHED_CLOCK, and we really don't want to do that just
for this. So Peter's idea about exposing what we need is better, although
you'll probably need to take care with the switch-over from jiffies.

It needs some thought, but one possibility would be to introduce a new
variant of sthe ched_clock_register() function that returns the cycle
offset, and then we could fish that out of the timer driver. If we're
crossing all the 'i's and dotting all the 't's then we'd want to disable the
perf userpage if sched_clock changes clocksource too (a bit like we do for
the vDSO).

Will
Marc Zyngier May 1, 2020, 3:29 p.m. UTC | #11
On 2020-05-01 16:14, Leo Yan wrote:
> Hi all,
> 
> On Thu, Apr 30, 2020 at 05:18:15PM +0100, Will Deacon wrote:
>> On Thu, Apr 30, 2020 at 06:04:36PM +0200, Peter Zijlstra wrote:
>> > On Thu, Apr 30, 2020 at 04:29:23PM +0100, Marc Zyngier wrote:
>> >
>> > > I wonder if we could/should make __sched_clock_offset available even when
>> > > CONFIG_HAVE_UNSTABLE_SCHED_CLOCK isn't defined. It feels like it would
>> > > help with this particular can or worm...
>> >
>> > Errrgh. __sched_clock_offset is only needed on x86 because we transition
>> > from one clock device to another on boot. It really shouldn't exist on
>> > anything sane.
>> 
>> I think we still transition from jiffies on arm64, because we don't 
>> register
>> with sched_clock until the timer driver probes. Marc, is that right?
>> 
>> > Let me try and understand your particular problem better.
>> 
>> I think the long and short of it is that userspace needs a way to 
>> convert
>> the raw counter cycles into a ns value that can be compared against 
>> values
>> coming out of sched_clock. To do this accurately, I think it needs the
>> cycles value at the point when sched_clock was initialised.
> 
> Will's understanding is exactly what I want to resolve in this patch.
> 
> The background info is for the ARM SPE [1] decoding with perf tool, if
> the timestamp is enabled, it uses the generic timer's counter as
> timestamp source.  SPE trace data only contains the raw counter cycles,
> as Will mentioned, the perf tool needs to convert it to a coordinate
> value with sched_clock.  This is why this patch tries to calculate the
> offset between the raw counter's ns value and sched_clock, eventually
> this offset value will be used by SPE's decoding code in Perf tool to
> calibrate a 'correct' timestamp.
> 
> Based on your suggestions, I will use __sched_clock_offset to resolve
> the accuracy issue in patch v2.  (I noticed Peter suggested to use a
> new API for wrapping clock_data structure, IIUC, __sched_clock_offset
> is more straightforward for this case).

I think Peter's point was *not* to use __sched_clock_offset which
appears to be only there for the purpose of an x86 workaround (and not
available to other architecture), but to instead expose the relevant
field (epoch_cyc) to the perf subsystem.

I feel it makes sense to make this a generic API, and see whether x86
can move over to it rather than the other way around.

Thanks,

         M.
Leo Yan May 1, 2020, 4:10 p.m. UTC | #12
On Fri, May 01, 2020 at 04:26:09PM +0100, Will Deacon wrote:

[...]

> > > > Let me try and understand your particular problem better.
> > > 
> > > I think the long and short of it is that userspace needs a way to convert
> > > the raw counter cycles into a ns value that can be compared against values
> > > coming out of sched_clock. To do this accurately, I think it needs the
> > > cycles value at the point when sched_clock was initialised.
> > 
> > Will's understanding is exactly what I want to resolve in this patch.
> > 
> > The background info is for the ARM SPE [1] decoding with perf tool, if
> > the timestamp is enabled, it uses the generic timer's counter as
> > timestamp source.  SPE trace data only contains the raw counter cycles,
> > as Will mentioned, the perf tool needs to convert it to a coordinate
> > value with sched_clock.  This is why this patch tries to calculate the
> > offset between the raw counter's ns value and sched_clock, eventually
> > this offset value will be used by SPE's decoding code in Perf tool to
> > calibrate a 'correct' timestamp.
> > 
> > Based on your suggestions, I will use __sched_clock_offset to resolve
> > the accuracy issue in patch v2.  (I noticed Peter suggested to use a
> > new API for wrapping clock_data structure, IIUC, __sched_clock_offset
> > is more straightforward for this case).
> > 
> > Please correct if I miss anything.  Thank you for reviewing and
> > suggestions!
> 
> I don't think you can use __sched_clock_offset without selecting
> HAVE_UNSTABLE_SCHED_CLOCK, and we really don't want to do that just
> for this. So Peter's idea about exposing what we need is better, although
> you'll probably need to take care with the switch-over from jiffies.
> 
> It needs some thought, but one possibility would be to introduce a new
> variant of sthe ched_clock_register() function that returns the cycle
> offset, and then we could fish that out of the timer driver.

Thanks a lot for you and Marc for correction.

> If we're
> crossing all the 'i's and dotting all the 't's then we'd want to disable the
> perf userpage if sched_clock changes clocksource too (a bit like we do for
> the vDSO).

To be honest, one thing is not clear for me is how the perf tool to
update the arch timer's parameters in the middle of tracing after
disable and re-enable per userpage.  I will note for this and look
into detailed implementation for this part.

Thanks for sharing comprehensive thoughts!

Leo
Will Deacon May 1, 2020, 5:13 p.m. UTC | #13
On Sat, May 02, 2020 at 12:10:50AM +0800, Leo Yan wrote:
> On Fri, May 01, 2020 at 04:26:09PM +0100, Will Deacon wrote:
> 
> [...]
> 
> > > > > Let me try and understand your particular problem better.
> > > > 
> > > > I think the long and short of it is that userspace needs a way to convert
> > > > the raw counter cycles into a ns value that can be compared against values
> > > > coming out of sched_clock. To do this accurately, I think it needs the
> > > > cycles value at the point when sched_clock was initialised.
> > > 
> > > Will's understanding is exactly what I want to resolve in this patch.
> > > 
> > > The background info is for the ARM SPE [1] decoding with perf tool, if
> > > the timestamp is enabled, it uses the generic timer's counter as
> > > timestamp source.  SPE trace data only contains the raw counter cycles,
> > > as Will mentioned, the perf tool needs to convert it to a coordinate
> > > value with sched_clock.  This is why this patch tries to calculate the
> > > offset between the raw counter's ns value and sched_clock, eventually
> > > this offset value will be used by SPE's decoding code in Perf tool to
> > > calibrate a 'correct' timestamp.
> > > 
> > > Based on your suggestions, I will use __sched_clock_offset to resolve
> > > the accuracy issue in patch v2.  (I noticed Peter suggested to use a
> > > new API for wrapping clock_data structure, IIUC, __sched_clock_offset
> > > is more straightforward for this case).
> > > 
> > > Please correct if I miss anything.  Thank you for reviewing and
> > > suggestions!
> > 
> > I don't think you can use __sched_clock_offset without selecting
> > HAVE_UNSTABLE_SCHED_CLOCK, and we really don't want to do that just
> > for this. So Peter's idea about exposing what we need is better, although
> > you'll probably need to take care with the switch-over from jiffies.
> > 
> > It needs some thought, but one possibility would be to introduce a new
> > variant of sthe ched_clock_register() function that returns the cycle
> > offset, and then we could fish that out of the timer driver.
> 
> Thanks a lot for you and Marc for correction.
> 
> > If we're
> > crossing all the 'i's and dotting all the 't's then we'd want to disable the
> > perf userpage if sched_clock changes clocksource too (a bit like we do for
> > the vDSO).
> 
> To be honest, one thing is not clear for me is how the perf tool to
> update the arch timer's parameters in the middle of tracing after
> disable and re-enable per userpage.  I will note for this and look
> into detailed implementation for this part.

I don't fully understand the concern but, generally, the seqlock should
take care of any inconsistencies in the data page.

Will
diff mbox series

Patch

diff --git a/arch/arm64/kernel/perf_event.c b/arch/arm64/kernel/perf_event.c
index e40b65645c86..226d25d77072 100644
--- a/arch/arm64/kernel/perf_event.c
+++ b/arch/arm64/kernel/perf_event.c
@@ -1143,6 +1143,7 @@  void arch_perf_update_userpage(struct perf_event *event,
 {
 	u32 freq;
 	u32 shift;
+	u64 count, ns, quot, rem;
 
 	/*
 	 * Internal timekeeping for enabled/running/stopped times
@@ -1164,5 +1165,21 @@  void arch_perf_update_userpage(struct perf_event *event,
 		userpg->time_mult >>= 1;
 	}
 	userpg->time_shift = (u16)shift;
-	userpg->time_offset = -now;
+
+	/*
+	 * Since arch timer is enabled ealier than sched clock registration,
+	 * compuate the delta (in nanosecond unit) between the arch timer
+	 * counter and sched clock, assign the delta to time_offset and
+	 * perf tool can use it for timestamp calculation.
+	 *
+	 * The formula for conversion arch timer cycle to ns is:
+	 *   quot = (cyc >> time_shift);
+	 *   rem  = cyc & ((1 << time_shift) - 1);
+	 *   ns   = quot * time_mult + ((rem * time_mult) >> time_shift);
+	 */
+	count = arch_timer_read_counter();
+	quot = count >> shift;
+	rem = count & ((1 << shift) - 1);
+	ns = quot * userpg->time_mult + ((rem * userpg->time_mult) >> shift);
+	userpg->time_offset = now - ns;
 }