[PATCHv2,0/3] clocksource: add db8500 PRCMU timer
diff mbox

Message ID 20110602094622.GS3660@n2100.arm.linux.org.uk
State New, archived
Headers show

Commit Message

Russell King - ARM Linux June 2, 2011, 9:46 a.m. UTC
On Thu, Jun 02, 2011 at 11:34:31AM +0200, Mattias Wallin wrote:
> The Multi Timer Unit (MTU) is currently used as clocksource and sched_clk
> for the u8500 machine. The MTU block loose power during cpuidle sleep states
> so an alternate clocksource is needed and these patches adds the db8500 PRCMU
> timer.

Why don't we just find a way of fixing sched_clock so that the value
doesn't reset over a suspend/resume cycle?  IOW, lets fix the problem
for _everyone_ rather than only fixing it for one platform at a time.

Could you try this patch to check whether sched_clock() behaves better
across a suspend/resume cycle please?

 arch/arm/kernel/sched_clock.c |   18 ++++++++++++++++++
 1 files changed, 18 insertions(+), 0 deletions(-)

Comments

Mattias Wallin June 2, 2011, 10:18 a.m. UTC | #1
On 06/02/2011 11:46 AM, Russell King - ARM Linux wrote:
> Why don't we just find a way of fixing sched_clock so that the value
> doesn't reset over a suspend/resume cycle?
Even if the value isn't reset during suspend/resume we want the 
clocksource to keep counting. Or is it ok to have the clocksource stop 
or freeze during suspend?
Then we have cpuidle. Is it ok to stop/freeze the timer during cpuidle 
sleep states?
My view is that sched_clock and the clocksource read should return the 
time since boot and keep counting during sleep and suspend. (In this 
case we use the same timer for both sched_clk and clocksource.)
> IOW, lets fix the problem
> for_everyone_  rather than only fixing it for one platform at a time.
I agree that fixing it for everyone would be better and I can give it a 
try if I get pointed in the right direction.
Russell King - ARM Linux June 2, 2011, 11:01 a.m. UTC | #2
On Thu, Jun 02, 2011 at 12:18:35PM +0200, Mattias Wallin wrote:
> On 06/02/2011 11:46 AM, Russell King - ARM Linux wrote:
>> Why don't we just find a way of fixing sched_clock so that the value
>> doesn't reset over a suspend/resume cycle?
> Even if the value isn't reset during suspend/resume we want the  
> clocksource to keep counting. Or is it ok to have the clocksource stop  
> or freeze during suspend?

kernel/time/timekeeping.c:timekeeping_suspend():

        timekeeping_forward_now();

which does:
        cycle_now = clock->read(clock);
        cycle_delta = (cycle_now - clock->cycle_last) & clock->mask;
        clock->cycle_last = cycle_now;

So that updates the time with the current offset between cycle_last and
the current value.

kernel/time/timekeeping.c:timekeeping_resume():
        /* re-base the last cycle value */
        timekeeper.clock->cycle_last = timekeeper.clock->read(timekeeper.clock);

So this re-sets cycle_last to the current value of the clocksource.  This
means that on resume, the clocksource can start counting from any value it
likes.

So, without any additional external inputs, time resumes incrementing at
the point where the suspend occurred without any jump backwards or forwards.

The code accounts for the sleep time by using read_persistent_clock() read
a timer which continues running during sleep to calculate the delta between
suspend and resume, and injects the delta between them to wind the time
forward.

> Then we have cpuidle. Is it ok to stop/freeze the timer during cpuidle  
> sleep states?

During _idle_ (iow, no task running) sched_clock and the clocksource
should both continue to run - the scheduler needs to know how long the
system has been idle for, and the clocksource can't stop because we'll
lose track of time.

Remember that the clockevent stuff is used as a trigger to the timekeeping
code to read the clocksource, and update the current time.  Time is moved
forward by the delta between a previous clocksource read and the current
clocksource read.  So stopping or resetting the clocksource in unexpected
ways (other than over suspend/resume as mentioned above) will result in
time going weird.
Mattias Wallin June 2, 2011, 12:10 p.m. UTC | #3
On 06/02/2011 01:01 PM, Russell King - ARM Linux wrote:
> On Thu, Jun 02, 2011 at 12:18:35PM +0200, Mattias Wallin wrote:
>> On 06/02/2011 11:46 AM, Russell King - ARM Linux wrote:
>>> Why don't we just find a way of fixing sched_clock so that the value
>>> doesn't reset over a suspend/resume cycle?
>> Even if the value isn't reset during suspend/resume we want the
>> clocksource to keep counting. Or is it ok to have the clocksource stop
>> or freeze during suspend?
>
> kernel/time/timekeeping.c:timekeeping_suspend():
>
>          timekeeping_forward_now();
>
> which does:
>          cycle_now = clock->read(clock);
>          cycle_delta = (cycle_now - clock->cycle_last)&  clock->mask;
>          clock->cycle_last = cycle_now;
>
> So that updates the time with the current offset between cycle_last and
> the current value.
>
> kernel/time/timekeeping.c:timekeeping_resume():
>          /* re-base the last cycle value */
>          timekeeper.clock->cycle_last = timekeeper.clock->read(timekeeper.clock);
>
> So this re-sets cycle_last to the current value of the clocksource.  This
> means that on resume, the clocksource can start counting from any value it
> likes.
>
> So, without any additional external inputs, time resumes incrementing at
> the point where the suspend occurred without any jump backwards or forwards.
>
> The code accounts for the sleep time by using read_persistent_clock() read
> a timer which continues running during sleep to calculate the delta between
> suspend and resume, and injects the delta between them to wind the time
> forward.
>
>> Then we have cpuidle. Is it ok to stop/freeze the timer during cpuidle
>> sleep states?
>
> During _idle_ (iow, no task running) sched_clock and the clocksource
> should both continue to run - the scheduler needs to know how long the
> system has been idle for, and the clocksource can't stop because we'll
> lose track of time.
>
> Remember that the clockevent stuff is used as a trigger to the timekeeping
> code to read the clocksource, and update the current time.  Time is moved
> forward by the delta between a previous clocksource read and the current
> clocksource read.  So stopping or resetting the clocksource in unexpected
> ways (other than over suspend/resume as mentioned above) will result in
> time going weird.

Hmm, I have missed the existence of the read_persistent_clock(). It 
sounds like I should keep the MTU as my clocksource / sched_clock and 
have the PRCMU Timer as a persistent_clock instead.

Then one problem remains. The MTU will be powered during cstates: 
running, wfi, ApIdle (arm retenetion). The MTU will loose power during 
cstates ApSleep and ApDeepSleep. So I need to do a similar sync as 
suspend does against the persistent_clock but when leaving and enter the 
deeper cstates.

Should I solve it in the clocksource framework with a flag telling which 
cstates the timer will stop/freeze and then inject time from the 
persistent_clock for those cstates? (I am thinking something like the 
CLOCK_EVT_FEAT_C3STOP flag)

Am I on the wrong track here or how should I solve it?
Santosh Shilimkar June 2, 2011, 12:57 p.m. UTC | #4
+ John,

On 6/2/2011 5:40 PM, Mattias Wallin wrote:
> On 06/02/2011 01:01 PM, Russell King - ARM Linux wrote:
>> On Thu, Jun 02, 2011 at 12:18:35PM +0200, Mattias Wallin wrote:
>>> On 06/02/2011 11:46 AM, Russell King - ARM Linux wrote:
>>>> Why don't we just find a way of fixing sched_clock so that the value
>>>> doesn't reset over a suspend/resume cycle?
>>> Even if the value isn't reset during suspend/resume we want the
>>> clocksource to keep counting. Or is it ok to have the clocksource stop
>>> or freeze during suspend?
>>
>> kernel/time/timekeeping.c:timekeeping_suspend():
>>
>> timekeeping_forward_now();
>>
>> which does:
>> cycle_now = clock->read(clock);
>> cycle_delta = (cycle_now - clock->cycle_last)& clock->mask;
>> clock->cycle_last = cycle_now;
>>
>> So that updates the time with the current offset between cycle_last and
>> the current value.
>>
>> kernel/time/timekeeping.c:timekeeping_resume():
>> /* re-base the last cycle value */
>> timekeeper.clock->cycle_last = timekeeper.clock->read(timekeeper.clock);
>>
>> So this re-sets cycle_last to the current value of the clocksource. This
>> means that on resume, the clocksource can start counting from any
>> value it
>> likes.
>>
>> So, without any additional external inputs, time resumes incrementing at
>> the point where the suspend occurred without any jump backwards or
>> forwards.
>>
>> The code accounts for the sleep time by using read_persistent_clock()
>> read
>> a timer which continues running during sleep to calculate the delta
>> between
>> suspend and resume, and injects the delta between them to wind the time
>> forward.
>>
>>> Then we have cpuidle. Is it ok to stop/freeze the timer during cpuidle
>>> sleep states?
>>
>> During _idle_ (iow, no task running) sched_clock and the clocksource
>> should both continue to run - the scheduler needs to know how long the
>> system has been idle for, and the clocksource can't stop because we'll
>> lose track of time.
>>
>> Remember that the clockevent stuff is used as a trigger to the
>> timekeeping
>> code to read the clocksource, and update the current time. Time is moved
>> forward by the delta between a previous clocksource read and the current
>> clocksource read. So stopping or resetting the clocksource in unexpected
>> ways (other than over suspend/resume as mentioned above) will result in
>> time going weird.
>
> Hmm, I have missed the existence of the read_persistent_clock(). It
> sounds like I should keep the MTU as my clocksource / sched_clock and
> have the PRCMU Timer as a persistent_clock instead.
>
> Then one problem remains. The MTU will be powered during cstates:
> running, wfi, ApIdle (arm retenetion). The MTU will loose power during
> cstates ApSleep and ApDeepSleep. So I need to do a similar sync as
> suspend does against the persistent_clock but when leaving and enter the
> deeper cstates.
>
> Should I solve it in the clocksource framework with a flag telling which
> cstates the timer will stop/freeze and then inject time from the
> persistent_clock for those cstates? (I am thinking something like the
> CLOCK_EVT_FEAT_C3STOP flag)
>
> Am I on the wrong track here or how should I solve it?
>
IIUC, what you are trying here is to use high-precision clock-source
but since it doesn't work in low power modes, you want it to supplement
with always running low resolution timer.

Now just making the persistent_clock() read from low-resolution timer
is not going to help. Because there is no reference available for
the kernel on whatever counting is done by the low-resolution timer.
In other words, it has to be a registered clock-source first.

Earlier this year at ELC SFO, I had a discussion with
John and Thomas on how to have a high-resolution clock-source
and a low-resolution clock-source working together to cover
the low power scenario and still manage to get the highest
timer resolution.
The idea was to do dynamic switching of clock-source
which initially looked simple. Here the idea was to
have this working for suspend and as well as cupidle.

John mentioned that because of frequent clock-source
switching, will affect the NTP correction badly to an
extent that NTP correction may not work.

Here is what John suggested to do but I got busy with
other stuff and this one got pushed out from my todo.

--------------------
John wrote ...
A different approach would be to create a meta-clocksource, that
utilizes the two underlying clocks to create a what looks like a unified
counter.

Basically you use the slow-always running counter as your long-term freq
adjusted clock, but supplement its low granularity with the highres
halting clock.

This works best if both counters are driven by the same crystal, so
there won't be much drift between them.
----------------------

This approach should solve most of the issues and get
the functionality what you are looking for.

If you like, you can work on this scheme.

Regards
Santosh
Russell King - ARM Linux June 2, 2011, 1:04 p.m. UTC | #5
On Thu, Jun 02, 2011 at 06:27:22PM +0530, Santosh Shilimkar wrote:
> Earlier this year at ELC SFO, I had a discussion with
> John and Thomas on how to have a high-resolution clock-source
> and a low-resolution clock-source working together to cover
> the low power scenario and still manage to get the highest
> timer resolution.
> The idea was to do dynamic switching of clock-source
> which initially looked simple. Here the idea was to
> have this working for suspend and as well as cupidle.

I don't think you can do this, because you'll lose precision whenever you
switch from the high resolution clocksource to the low resolution
clocksource.

While you can quickly update the current time of day from the highres one,
the lowres may or may not be a long way from its next tick rollover.  So
you lose the precision whenever you switch.

However, over a suspend/resume cycle, the precision loss is normally very
small compared to the time which you have been suspended, so the %age
error also becomes very small.

With cpuidle, it becomes a completely different matter.  Here, the %age
error is much larger because of the smaller sleep periods, and chances
are we can't wait for the low-res timer to change.

So if you're using cpuidle, you really need a clocksource which runs
continuously, even in whatever states cpuidle drops you into.
Santosh Shilimkar June 2, 2011, 1:16 p.m. UTC | #6
On 6/2/2011 6:34 PM, Russell King - ARM Linux wrote:
> On Thu, Jun 02, 2011 at 06:27:22PM +0530, Santosh Shilimkar wrote:
>> Earlier this year at ELC SFO, I had a discussion with
>> John and Thomas on how to have a high-resolution clock-source
>> and a low-resolution clock-source working together to cover
>> the low power scenario and still manage to get the highest
>> timer resolution.
>> The idea was to do dynamic switching of clock-source
>> which initially looked simple. Here the idea was to
>> have this working for suspend and as well as cupidle.
>
> I don't think you can do this, because you'll lose precision whenever you
> switch from the high resolution clocksource to the low resolution
> clocksource.
>
> While you can quickly update the current time of day from the highres one,
> the lowres may or may not be a long way from its next tick rollover.  So
> you lose the precision whenever you switch.
>
> However, over a suspend/resume cycle, the precision loss is normally very
> small compared to the time which you have been suspended, so the %age
> error also becomes very small.
>
> With cpuidle, it becomes a completely different matter.  Here, the %age
> error is much larger because of the smaller sleep periods, and chances
> are we can't wait for the low-res timer to change.
>
> So if you're using cpuidle, you really need a clocksource which runs
> continuously, even in whatever states cpuidle drops you into.

According to John, that's what the meta-clock source will for.
It will be continuous and will make use of underneath high res,
low res clock-sources based on the availability.

I let John comment on this on details but I think any other
method would have shortcoming.

Regards
Santosh
John Stultz June 2, 2011, 6:47 p.m. UTC | #7
On Thu, 2011-06-02 at 18:27 +0530, Santosh Shilimkar wrote:
> On 6/2/2011 5:40 PM, Mattias Wallin wrote:
> > On 06/02/2011 01:01 PM, Russell King - ARM Linux wrote:
> >> On Thu, Jun 02, 2011 at 12:18:35PM +0200, Mattias Wallin wrote:
> >>> On 06/02/2011 11:46 AM, Russell King - ARM Linux wrote:
> >>>> Why don't we just find a way of fixing sched_clock so that the value
> >>>> doesn't reset over a suspend/resume cycle?
> >>> Even if the value isn't reset during suspend/resume we want the
> >>> clocksource to keep counting. Or is it ok to have the clocksource stop
> >>> or freeze during suspend?
> >>
> >> kernel/time/timekeeping.c:timekeeping_suspend():
> >>
> >> timekeeping_forward_now();
> >>
> >> which does:
> >> cycle_now = clock->read(clock);
> >> cycle_delta = (cycle_now - clock->cycle_last)& clock->mask;
> >> clock->cycle_last = cycle_now;
> >>
> >> So that updates the time with the current offset between cycle_last and
> >> the current value.
> >>
> >> kernel/time/timekeeping.c:timekeeping_resume():
> >> /* re-base the last cycle value */
> >> timekeeper.clock->cycle_last = timekeeper.clock->read(timekeeper.clock);
> >>
> >> So this re-sets cycle_last to the current value of the clocksource. This
> >> means that on resume, the clocksource can start counting from any
> >> value it
> >> likes.
> >>
> >> So, without any additional external inputs, time resumes incrementing at
> >> the point where the suspend occurred without any jump backwards or
> >> forwards.
> >>
> >> The code accounts for the sleep time by using read_persistent_clock()
> >> read
> >> a timer which continues running during sleep to calculate the delta
> >> between
> >> suspend and resume, and injects the delta between them to wind the time
> >> forward.
> >>
> >>> Then we have cpuidle. Is it ok to stop/freeze the timer during cpuidle
> >>> sleep states?
> >>
> >> During _idle_ (iow, no task running) sched_clock and the clocksource
> >> should both continue to run - the scheduler needs to know how long the
> >> system has been idle for, and the clocksource can't stop because we'll
> >> lose track of time.
> >>
> >> Remember that the clockevent stuff is used as a trigger to the
> >> timekeeping
> >> code to read the clocksource, and update the current time. Time is moved
> >> forward by the delta between a previous clocksource read and the current
> >> clocksource read. So stopping or resetting the clocksource in unexpected
> >> ways (other than over suspend/resume as mentioned above) will result in
> >> time going weird.
> >
> > Hmm, I have missed the existence of the read_persistent_clock(). It
> > sounds like I should keep the MTU as my clocksource / sched_clock and
> > have the PRCMU Timer as a persistent_clock instead.
> >
> > Then one problem remains. The MTU will be powered during cstates:
> > running, wfi, ApIdle (arm retenetion). The MTU will loose power during
> > cstates ApSleep and ApDeepSleep. So I need to do a similar sync as
> > suspend does against the persistent_clock but when leaving and enter the
> > deeper cstates.
> >
> > Should I solve it in the clocksource framework with a flag telling which
> > cstates the timer will stop/freeze and then inject time from the
> > persistent_clock for those cstates? (I am thinking something like the
> > CLOCK_EVT_FEAT_C3STOP flag)
> >
> > Am I on the wrong track here or how should I solve it?
> >
[snip]
> John mentioned that because of frequent clock-source
> switching, will affect the NTP correction badly to an
> extent that NTP correction may not work.
> 
> Here is what John suggested to do but I got busy with
> other stuff and this one got pushed out from my todo.
> 
> --------------------
> John wrote ...
> A different approach would be to create a meta-clocksource, that
> utilizes the two underlying clocks to create a what looks like a unified
> counter.
> 
> Basically you use the slow-always running counter as your long-term freq
> adjusted clock, but supplement its low granularity with the highres
> halting clock.
> 
> This works best if both counters are driven by the same crystal, so
> there won't be much drift between them.
> ----------------------

Yea. The basic idea is to interpolate between two counters to produce a
a continuous clocksource for the timekeeping core.

As Russell pointed out, error is injected to the timekeeping code every
time you switch between clocksources or the persistent clock. Doing this
very frequently will provide very poor time.

So by using an interpolated clocksource, we would be able to maintain
clock adjustments via ntp and provide accurate long term time, while
still having fine-grained resolution (along with some short term error).

However, its all easier said then done. There are lots of gotchas and
corner-cases with interpolation. That's the whole reason we moved from
tick based timekeeping to continuous clocksource based timekeeping. I
believe the KVM folks have used similar interpolation method for their
kvm clocksource, and I know they had numerous issues. But it seems to
work well enough, so I would take a peek at their code to start.

thanks
-john
Mattias Wallin June 8, 2011, 1:44 p.m. UTC | #8
On 06/02/2011 08:47 PM, john stultz wrote:
> On Thu, 2011-06-02 at 18:27 +0530, Santosh Shilimkar wrote:
>> On 6/2/2011 5:40 PM, Mattias Wallin wrote:
>>> On 06/02/2011 01:01 PM, Russell King - ARM Linux wrote:
>>>> On Thu, Jun 02, 2011 at 12:18:35PM +0200, Mattias Wallin wrote:
>>>>> On 06/02/2011 11:46 AM, Russell King - ARM Linux wrote:
>>>>>> Why don't we just find a way of fixing sched_clock so that the value
>>>>>> doesn't reset over a suspend/resume cycle?
>>>>> Even if the value isn't reset during suspend/resume we want the
>>>>> clocksource to keep counting. Or is it ok to have the clocksource stop
>>>>> or freeze during suspend?
>>>>
>>>> kernel/time/timekeeping.c:timekeeping_suspend():
>>>>
>>>> timekeeping_forward_now();
>>>>
>>>> which does:
>>>> cycle_now = clock->read(clock);
>>>> cycle_delta = (cycle_now - clock->cycle_last)&  clock->mask;
>>>> clock->cycle_last = cycle_now;
>>>>
>>>> So that updates the time with the current offset between cycle_last and
>>>> the current value.
>>>>
>>>> kernel/time/timekeeping.c:timekeeping_resume():
>>>> /* re-base the last cycle value */
>>>> timekeeper.clock->cycle_last = timekeeper.clock->read(timekeeper.clock);
>>>>
>>>> So this re-sets cycle_last to the current value of the clocksource. This
>>>> means that on resume, the clocksource can start counting from any
>>>> value it
>>>> likes.
>>>>
>>>> So, without any additional external inputs, time resumes incrementing at
>>>> the point where the suspend occurred without any jump backwards or
>>>> forwards.
>>>>
>>>> The code accounts for the sleep time by using read_persistent_clock()
>>>> read
>>>> a timer which continues running during sleep to calculate the delta
>>>> between
>>>> suspend and resume, and injects the delta between them to wind the time
>>>> forward.
>>>>
>>>>> Then we have cpuidle. Is it ok to stop/freeze the timer during cpuidle
>>>>> sleep states?
>>>>
>>>> During _idle_ (iow, no task running) sched_clock and the clocksource
>>>> should both continue to run - the scheduler needs to know how long the
>>>> system has been idle for, and the clocksource can't stop because we'll
>>>> lose track of time.
>>>>
>>>> Remember that the clockevent stuff is used as a trigger to the
>>>> timekeeping
>>>> code to read the clocksource, and update the current time. Time is moved
>>>> forward by the delta between a previous clocksource read and the current
>>>> clocksource read. So stopping or resetting the clocksource in unexpected
>>>> ways (other than over suspend/resume as mentioned above) will result in
>>>> time going weird.
>>>
>>> Hmm, I have missed the existence of the read_persistent_clock(). It
>>> sounds like I should keep the MTU as my clocksource / sched_clock and
>>> have the PRCMU Timer as a persistent_clock instead.
>>>
>>> Then one problem remains. The MTU will be powered during cstates:
>>> running, wfi, ApIdle (arm retenetion). The MTU will loose power during
>>> cstates ApSleep and ApDeepSleep. So I need to do a similar sync as
>>> suspend does against the persistent_clock but when leaving and enter the
>>> deeper cstates.
>>>
>>> Should I solve it in the clocksource framework with a flag telling which
>>> cstates the timer will stop/freeze and then inject time from the
>>> persistent_clock for those cstates? (I am thinking something like the
>>> CLOCK_EVT_FEAT_C3STOP flag)
>>>
>>> Am I on the wrong track here or how should I solve it?
>>>
> [snip]
>> John mentioned that because of frequent clock-source
>> switching, will affect the NTP correction badly to an
>> extent that NTP correction may not work.
>>
>> Here is what John suggested to do but I got busy with
>> other stuff and this one got pushed out from my todo.
>>
>> --------------------
>> John wrote ...
>> A different approach would be to create a meta-clocksource, that
>> utilizes the two underlying clocks to create a what looks like a unified
>> counter.
>>
>> Basically you use the slow-always running counter as your long-term freq
>> adjusted clock, but supplement its low granularity with the highres
>> halting clock.
>>
>> This works best if both counters are driven by the same crystal, so
>> there won't be much drift between them.
>> ----------------------
>
> Yea. The basic idea is to interpolate between two counters to produce a
> a continuous clocksource for the timekeeping core.
>
> As Russell pointed out, error is injected to the timekeeping code every
> time you switch between clocksources or the persistent clock. Doing this
> very frequently will provide very poor time.
>
> So by using an interpolated clocksource, we would be able to maintain
> clock adjustments via ntp and provide accurate long term time, while
> still having fine-grained resolution (along with some short term error).
>
> However, its all easier said then done. There are lots of gotchas and
> corner-cases with interpolation. That's the whole reason we moved from
> tick based timekeeping to continuous clocksource based timekeeping. I
> believe the KVM folks have used similar interpolation method for their
> kvm clocksource, and I know they had numerous issues. But it seems to
> work well enough, so I would take a peek at their code to start.

Russell,
What is your gut feeling about above interpolated clocksource strategy 
suggestion?


While we (I) look at an alternative clocksource strategy and experiment 
are you ok with the three patches I sent?
I thought about the persistant_clock that I did not know about but that 
will only help me in suspend so I conclude that I still need the a 
clocksource that counts in all low power modes.

I would like to have something merged in case the alternate strategy 
fails. Something that work in the current kernel. The warnings above 
hints that it will be difficult and can take a long time to find a 
solution where both my timers can work together. It also easier to 
discuss framework changes with improvements of in-kernel drivers.

Thanks,
/Mattias Wallin
Russell King - ARM Linux June 9, 2011, 9:59 p.m. UTC | #9
On Wed, Jun 08, 2011 at 03:44:01PM +0200, Mattias Wallin wrote:
> On 06/02/2011 08:47 PM, john stultz wrote:
>> Yea. The basic idea is to interpolate between two counters to produce a
>> a continuous clocksource for the timekeeping core.
>>
>> As Russell pointed out, error is injected to the timekeeping code every
>> time you switch between clocksources or the persistent clock. Doing this
>> very frequently will provide very poor time.
>>
>> So by using an interpolated clocksource, we would be able to maintain
>> clock adjustments via ntp and provide accurate long term time, while
>> still having fine-grained resolution (along with some short term error).
>>
>> However, its all easier said then done. There are lots of gotchas and
>> corner-cases with interpolation. That's the whole reason we moved from
>> tick based timekeeping to continuous clocksource based timekeeping. I
>> believe the KVM folks have used similar interpolation method for their
>> kvm clocksource, and I know they had numerous issues. But it seems to
>> work well enough, so I would take a peek at their code to start.
>
> Russell,
> What is your gut feeling about above interpolated clocksource strategy  
> suggestion?

I think you're not going to maintain precision even with the above idea.
There's two problems:

1. When you switch from the high resolution to the low resolution, you
need to know how many high-res count increments have elapsed since the
last low-res count increment occurred.  From this you know the offset
between current time and the last low-res count increment and so can
correct for this change.

2. When you switch from the low-res to high-res, because the high-res
has restarted, you have lost track of the high-res counter.  The only
point when things are re-synchronisable is at the point when the low-res
counter increments.  So you either:
a) have to wait for the low-res to increment,
b) or you use the high-res counter to time until the low-res increments
 and then factor in some correction factor.

Either way looks like it will be fraught with problems - particularly
the resulting increase in latency caused by this when switching between
states, or the period of loss of precision caused by (2b) which may be
more disruptive to NTP.

> While we (I) look at an alternative clocksource strategy and experiment  
> are you ok with the three patches I sent?

No - I'd like you to check whether the patch I sent fixes your
sched_clock problem across suspend/resume so that we can get that sorted
for everyone, instead of coming up with platform specific fixes.  Then
we can work out how to hook that into cpuidle so that it works properly
there.

We're no longer going to go down the route of platforms working around
problems with common infrastructure in their own individual private
ways.  That madness has to stop right now.
Mattias Wallin June 10, 2011, 8:54 a.m. UTC | #10
On 06/09/2011 11:59 PM, Russell King - ARM Linux wrote:
> No - I'd like you to check whether the patch I sent fixes your
> sched_clock problem across suspend/resume so that we can get that sorted
> for everyone, instead of coming up with platform specific fixes.  Then
> we can work out how to hook that into cpuidle so that it works properly
> there.

Ahh, I see your plan. I was focused on the cpuidle problem and your 
patch only addressed suspend/resume.
Anyway, now that I understand more what you are after I'll have a look 
at it and come back to you.
Mattias Wallin June 10, 2011, 4 p.m. UTC | #11
On 06/10/2011 10:54 AM, Mattias Wallin wrote:
> On 06/09/2011 11:59 PM, Russell King - ARM Linux wrote:
>> No - I'd like you to check whether the patch I sent fixes your
>> sched_clock problem across suspend/resume so that we can get that sorted
>> for everyone, instead of coming up with platform specific fixes.  Then
>> we can work out how to hook that into cpuidle so that it works properly
>> there.
>
> Ahh, I see your plan. I was focused on the cpuidle problem and your
> patch only addressed suspend/resume.
> Anyway, now that I understand more what you are after I'll have a look
> at it and come back to you.

Unfortunately you will not get an answer if your patch works or not now. 
Our u8500 mainline code is missing suspend support and I have been 
trying to get suspend for u8500 to work without success. Since I will be 
away for two weeks I will try to get the suspend guys to work on it 
meanwhile.

Kind Regards,
Mattias Wallin
Russell King - ARM Linux July 10, 2011, 2:19 p.m. UTC | #12
On Thu, Jun 02, 2011 at 10:46:22AM +0100, Russell King - ARM Linux wrote:
> On Thu, Jun 02, 2011 at 11:34:31AM +0200, Mattias Wallin wrote:
> > The Multi Timer Unit (MTU) is currently used as clocksource and sched_clk
> > for the u8500 machine. The MTU block loose power during cpuidle sleep states
> > so an alternate clocksource is needed and these patches adds the db8500 PRCMU
> > timer.
> 
> Why don't we just find a way of fixing sched_clock so that the value
> doesn't reset over a suspend/resume cycle?  IOW, lets fix the problem
> for _everyone_ rather than only fixing it for one platform at a time.
> 
> Could you try this patch to check whether sched_clock() behaves better
> across a suspend/resume cycle please?

So do I merge this patch?

Patch
diff mbox

diff --git a/arch/arm/kernel/sched_clock.c b/arch/arm/kernel/sched_clock.c
index 9a46370..4be4019 100644
--- a/arch/arm/kernel/sched_clock.c
+++ b/arch/arm/kernel/sched_clock.c
@@ -10,6 +10,7 @@ 
 #include <linux/jiffies.h>
 #include <linux/kernel.h>
 #include <linux/sched.h>
+#include <linux/syscore_ops.h>
 #include <linux/timer.h>
 
 #include <asm/sched_clock.h>
@@ -72,3 +73,20 @@  void __init sched_clock_postinit(void)
 {
 	sched_clock_poll(sched_clock_timer.data);
 }
+
+static int sched_clock_suspend(void)
+{
+	sched_clock_poll(sched_clock_timer.data);
+	return 0;
+}
+
+static struct syscore_ops sched_clock_ops = {
+	.suspend	= sched_clock_suspend,
+};
+
+static int __init sched_clock_syscore_init(void)
+{
+	register_syscore_ops(&sched_clock_ops);
+	return 0;
+}
+device_initcall(sched_clock_syscore_init);