diff mbox

arm: spin one more cycle in timer-based delays

Message ID 582B0F61.6030903@free.fr (mailing list archive)
State New, archived
Headers show

Commit Message

Mason Nov. 15, 2016, 1:36 p.m. UTC
When polling a tick counter in a busy loop, one might sample the
counter just *before* it is updated, and then again just *after*
it is updated. In that case, while mathematically v2-v1 equals 1,
only epsilon has really passed.

So, if a caller requests an N-cycle delay, we spin until v2-v1
is strictly greater than N to avoid these random corner cases.

Signed-off-by: Mason <slash.tmp@free.fr>
---
 arch/arm/lib/delay.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Will Deacon Nov. 18, 2016, 12:06 p.m. UTC | #1
On Tue, Nov 15, 2016 at 02:36:33PM +0100, Mason wrote:
> When polling a tick counter in a busy loop, one might sample the
> counter just *before* it is updated, and then again just *after*
> it is updated. In that case, while mathematically v2-v1 equals 1,
> only epsilon has really passed.
> 
> So, if a caller requests an N-cycle delay, we spin until v2-v1
> is strictly greater than N to avoid these random corner cases.
> 
> Signed-off-by: Mason <slash.tmp@free.fr>
> ---
>  arch/arm/lib/delay.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/arm/lib/delay.c b/arch/arm/lib/delay.c
> index 69aad80a3af4..3f1cd15ca102 100644
> --- a/arch/arm/lib/delay.c
> +++ b/arch/arm/lib/delay.c
> @@ -60,7 +60,7 @@ static void __timer_delay(unsigned long cycles)
>  {
>  	cycles_t start = get_cycles();
>  
> -	while ((get_cycles() - start) < cycles)
> +	while ((get_cycles() - start) <= cycles)
>  		cpu_relax();
>  }

I thought a bit about this last night. It is well known that the delay
routines are not guaranteed to be accurate, and I don't *think* that's
limited to over-spinning, so arguably this isn't a bug. However, taking
the approach that "drivers should figure it out" is also unhelpful,
because the frequency of the underlying counter isn't generally known
and so drivers can't simply adjust the delay parameter.

If you want to go ahead with this change, I do think that you should fix
the other architectures for consistency (particularly arm64, which is
likely to share drivers with arch/arm/). However, given that I don't
think you've seen a failure in practice, it might be a hard sell to get
the patches picked up, especially given the deliberately vague guarantees
offered by the delay loop.

Will
Mason Nov. 18, 2016, 12:24 p.m. UTC | #2
On 18/11/2016 13:06, Will Deacon wrote:

> On Tue, Nov 15, 2016 at 02:36:33PM +0100, Mason wrote:
>
>> When polling a tick counter in a busy loop, one might sample the
>> counter just *before* it is updated, and then again just *after*
>> it is updated. In that case, while mathematically v2-v1 equals 1,
>> only epsilon has really passed.
>>
>> So, if a caller requests an N-cycle delay, we spin until v2-v1
>> is strictly greater than N to avoid these random corner cases.
>>
>> Signed-off-by: Mason <slash.tmp@free.fr>
>> ---
>>  arch/arm/lib/delay.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/arch/arm/lib/delay.c b/arch/arm/lib/delay.c
>> index 69aad80a3af4..3f1cd15ca102 100644
>> --- a/arch/arm/lib/delay.c
>> +++ b/arch/arm/lib/delay.c
>> @@ -60,7 +60,7 @@ static void __timer_delay(unsigned long cycles)
>>  {
>>  	cycles_t start = get_cycles();
>>  
>> -	while ((get_cycles() - start) < cycles)
>> +	while ((get_cycles() - start) <= cycles)
>>  		cpu_relax();
>>  }
> 
> I thought a bit about this last night. It is well known that the delay
> routines are not guaranteed to be accurate, and I don't *think* that's
> limited to over-spinning, so arguably this isn't a bug. However, taking
> the approach that "drivers should figure it out" is also unhelpful,
> because the frequency of the underlying counter isn't generally known
> and so drivers can't simply adjust the delay parameter.
> 
> If you want to go ahead with this change, I do think that you should fix
> the other architectures for consistency (particularly arm64, which is
> likely to share drivers with arch/arm/). However, given that I don't
> think you've seen a failure in practice, it might be a hard sell to get
> the patches picked up, especially given the deliberately vague guarantees
> offered by the delay loop.

Hello Will,

Thanks for the in-depth analysis.

This is, conceptually, the first patch in a 2-patch series, and perhaps
the intent would have been clearer had I posted the series, along with
full rationale in the cover letter. I'll do that next week, so everyone
can judge with all the information in hand.

Regards.
Russell King (Oracle) Nov. 18, 2016, 12:54 p.m. UTC | #3
On Fri, Nov 18, 2016 at 12:06:30PM +0000, Will Deacon wrote:
> On Tue, Nov 15, 2016 at 02:36:33PM +0100, Mason wrote:
> > When polling a tick counter in a busy loop, one might sample the
> > counter just *before* it is updated, and then again just *after*
> > it is updated. In that case, while mathematically v2-v1 equals 1,
> > only epsilon has really passed.
> > 
> > So, if a caller requests an N-cycle delay, we spin until v2-v1
> > is strictly greater than N to avoid these random corner cases.
> > 
> > Signed-off-by: Mason <slash.tmp@free.fr>
> > ---
> >  arch/arm/lib/delay.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/arch/arm/lib/delay.c b/arch/arm/lib/delay.c
> > index 69aad80a3af4..3f1cd15ca102 100644
> > --- a/arch/arm/lib/delay.c
> > +++ b/arch/arm/lib/delay.c
> > @@ -60,7 +60,7 @@ static void __timer_delay(unsigned long cycles)
> >  {
> >  	cycles_t start = get_cycles();
> >  
> > -	while ((get_cycles() - start) < cycles)
> > +	while ((get_cycles() - start) <= cycles)
> >  		cpu_relax();
> >  }
> 
> I thought a bit about this last night. It is well known that the delay
> routines are not guaranteed to be accurate, and I don't *think* that's
> limited to over-spinning, so arguably this isn't a bug. However, taking
> the approach that "drivers should figure it out" is also unhelpful,
> because the frequency of the underlying counter isn't generally known
> and so drivers can't simply adjust the delay parameter.

I don't think this change makes any sense whatsoever.  udelay() is
inaccurate, period.  It _will_ give delays shorter than requested
for many reasons, many of which can't be fixed.

Having a super-accurate version just encourages people to write broken
drivers which assume (eg) that udelay(10) will give _at least_ a 10us
delay.  However, there is no such guarantee.

So, having udelay() for timers return slightly short is actually a good
thing - it causes people not to make the mistake to be soo accurate
with their delay specifications.

So, NAK on this change.  udelay is not super-accurate.

Reference (and this is the most important one):

  http://lists.openwall.net/linux-kernel/2011/01/12/372
Mason Nov. 18, 2016, 2:18 p.m. UTC | #4
On 18/11/2016 13:54, Russell King - ARM Linux wrote:
> On Fri, Nov 18, 2016 at 12:06:30PM +0000, Will Deacon wrote:
>> On Tue, Nov 15, 2016 at 02:36:33PM +0100, Mason wrote:
>>> When polling a tick counter in a busy loop, one might sample the
>>> counter just *before* it is updated, and then again just *after*
>>> it is updated. In that case, while mathematically v2-v1 equals 1,
>>> only epsilon has really passed.
>>>
>>> So, if a caller requests an N-cycle delay, we spin until v2-v1
>>> is strictly greater than N to avoid these random corner cases.
>>>
>>> Signed-off-by: Mason <slash.tmp@free.fr>
>>> ---
>>>  arch/arm/lib/delay.c | 2 +-
>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/arch/arm/lib/delay.c b/arch/arm/lib/delay.c
>>> index 69aad80a3af4..3f1cd15ca102 100644
>>> --- a/arch/arm/lib/delay.c
>>> +++ b/arch/arm/lib/delay.c
>>> @@ -60,7 +60,7 @@ static void __timer_delay(unsigned long cycles)
>>>  {
>>>  	cycles_t start = get_cycles();
>>>  
>>> -	while ((get_cycles() - start) < cycles)
>>> +	while ((get_cycles() - start) <= cycles)
>>>  		cpu_relax();
>>>  }
>>
>> I thought a bit about this last night. It is well known that the delay
>> routines are not guaranteed to be accurate, and I don't *think* that's
>> limited to over-spinning, so arguably this isn't a bug. However, taking
>> the approach that "drivers should figure it out" is also unhelpful,
>> because the frequency of the underlying counter isn't generally known
>> and so drivers can't simply adjust the delay parameter.
> 
> I don't think this change makes any sense whatsoever.  udelay() is
> inaccurate, period.  It _will_ give delays shorter than requested
> for many reasons, many of which can't be fixed.

If I understand correctly, udelay is, in fact, a wrapper around
several possible implementations. (Run-time decision on arch/arm)

1) The "historical" software loop-based method, which is calibrated
during init.

2) A hardware solution, based on an actual timer, ticking away
at a constant frequency.

3) others?

Solution 1 (which probably dates back to Linux 0.1) suffers from
the inaccuracies you point out, because of interrupt latency
during calibration, DVFS, etc.

On the other hand, it is simple enough to implement solution 2
in a way that guarantees that spin_for_n_cycles(n) spins
/at least/ for n cycles.

On platforms using timer-based delays, there can indeed exist
a function guaranteeing a minimal delay.


> Having a super-accurate version just encourages people to write broken
> drivers which assume (eg) that udelay(10) will give _at least_ a 10us
> delay.  However, there is no such guarantee.

You say that calling udelay(10) expecting at least 10 µs is broken.
This fails the principle of least astonishment in a pretty big way.
(If it returns after 9.9 µs, I suppose it could be OK. But for
shorter delays, the relative error grows.)

A lot of drivers implement a spec that says
"do this, wait 1 µs, do that".

Driver writers would typically write
do_this(); udelay(1); do_that();

Do you suggest they should write udelay(2); ?
But then, it's not so easy to follow the spec because everything
has been translated to a different number.
Hide everything behind a macro?

Note that people have been using ndelay too.
(I count 182 in drivers, 288 overall.)

drivers/cpufreq/s3c2440-cpufreq.c:      ndelay(20);

I don't think the author expects this to return immediately.


> So, having udelay() for timers return slightly short is actually a good
> thing - it causes people not to make the mistake to be soo accurate
> with their delay specifications.

I disagree that having an implementation which introduces
hard-to-track-down random heisenbugs is a good thing.


> So, NAK on this change.  udelay is not super-accurate.

usleep_range() fixed this issue recently.
6c5e9059692567740a4ee51530dffe51a4b9584d
https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?h=timers/core&id=6c5e9059692567740a4ee51530dffe51a4b9584d

Do you suggest driver writers should use usleep_range()
instead of udelay?
(When does usleep_range start making sense? Around 50/100 µs
and above?)

> Reference (and this is the most important one):
> 
>   http://lists.openwall.net/linux-kernel/2011/01/12/372


Regards.
Peter Zijlstra Nov. 18, 2016, 2:42 p.m. UTC | #5
On Fri, Nov 18, 2016 at 03:18:58PM +0100, Mason wrote:
> If I understand correctly, udelay is, in fact, a wrapper around
> several possible implementations. (Run-time decision on arch/arm)
> 
> 1) The "historical" software loop-based method, which is calibrated
> during init.
> 
> 2) A hardware solution, based on an actual timer, ticking away
> at a constant frequency.

I'd say clock, a timer would imply interrupts or somesuch, which is of
course not useful for udelay.

x86 with TSC uses this, although recent chips have grown an timed-mwait
instruction and that too can be used.

The TSC based ones (including the mwait) try very hard to not return
early.

> 3) others?

On x86 there's a variant that times itself based on io port accesses
which have a 'known' slowness.
Doug Anderson Nov. 18, 2016, 5:13 p.m. UTC | #6
Hi,

On Fri, Nov 18, 2016 at 4:54 AM, Russell King - ARM Linux
<linux@armlinux.org.uk> wrote:
> On Fri, Nov 18, 2016 at 12:06:30PM +0000, Will Deacon wrote:
>> On Tue, Nov 15, 2016 at 02:36:33PM +0100, Mason wrote:
>> > When polling a tick counter in a busy loop, one might sample the
>> > counter just *before* it is updated, and then again just *after*
>> > it is updated. In that case, while mathematically v2-v1 equals 1,
>> > only epsilon has really passed.
>> >
>> > So, if a caller requests an N-cycle delay, we spin until v2-v1
>> > is strictly greater than N to avoid these random corner cases.
>> >
>> > Signed-off-by: Mason <slash.tmp@free.fr>
>> > ---
>> >  arch/arm/lib/delay.c | 2 +-
>> >  1 file changed, 1 insertion(+), 1 deletion(-)
>> >
>> > diff --git a/arch/arm/lib/delay.c b/arch/arm/lib/delay.c
>> > index 69aad80a3af4..3f1cd15ca102 100644
>> > --- a/arch/arm/lib/delay.c
>> > +++ b/arch/arm/lib/delay.c
>> > @@ -60,7 +60,7 @@ static void __timer_delay(unsigned long cycles)
>> >  {
>> >     cycles_t start = get_cycles();
>> >
>> > -   while ((get_cycles() - start) < cycles)
>> > +   while ((get_cycles() - start) <= cycles)
>> >             cpu_relax();
>> >  }
>>
>> I thought a bit about this last night. It is well known that the delay
>> routines are not guaranteed to be accurate, and I don't *think* that's
>> limited to over-spinning, so arguably this isn't a bug. However, taking
>> the approach that "drivers should figure it out" is also unhelpful,
>> because the frequency of the underlying counter isn't generally known
>> and so drivers can't simply adjust the delay parameter.
>
> I don't think this change makes any sense whatsoever.  udelay() is
> inaccurate, period.  It _will_ give delays shorter than requested
> for many reasons, many of which can't be fixed.
>
> Having a super-accurate version just encourages people to write broken
> drivers which assume (eg) that udelay(10) will give _at least_ a 10us
> delay.  However, there is no such guarantee.
>
> So, having udelay() for timers return slightly short is actually a good
> thing - it causes people not to make the mistake to be soo accurate
> with their delay specifications.
>
> So, NAK on this change.  udelay is not super-accurate.
>
> Reference (and this is the most important one):
>
>   http://lists.openwall.net/linux-kernel/2011/01/12/372

Mason's change adds no complexity to the code and seems like a
generally good idea.

As per John Stultz [1] there is agreement that udelay() really
shouldn't delay less than the requested amount.  In fact, we even have
a test committed to the tree: "kernel/time/test_udelay.c".  As your
reference says, maybe 1% isn't enough to throw a hissyfit about, but
when the change is so simple and adds no complexity then it seems like
a worthwhile thing to do.  Why make things inaccurate for no reason?

[1] http://www.gossamer-threads.com/lists/linux/kernel/1918249


For what it's worth, Mason's change is:

Reviewed-by: Douglas Anderson <dianders@chromium.org>

-Doug
Russell King (Oracle) Nov. 18, 2016, 5:32 p.m. UTC | #7
On Fri, Nov 18, 2016 at 09:13:18AM -0800, Doug Anderson wrote:
> Mason's change adds no complexity to the code and seems like a
> generally good idea.
> 
> As per John Stultz [1] there is agreement that udelay() really
> shouldn't delay less than the requested amount.

There is NO such agreement, read the reference that I gave, which is
a statement from Linus who clearly says that if you're stupid enough
to want udelay() to be accurate to within 5% then you're doing
something wrong.

FACT: software-based udelay()s are _always_ short by the very nature
of the way the delay loop is calibrated.

It's not something that's going to get fixed either - read the full
thread from my reference, and you'll see that Linus has said that
udelay() being short is not something anyone should care about.
That's at odds from John, and Linus' opinion rather trumps everyone
elses - it's Linus' kernel, his is the ultimate last say on anything.

>  In fact, we even have
> a test committed to the tree: "kernel/time/test_udelay.c".  As your
> reference says, maybe 1% isn't enough to throw a hissyfit about, but
> when the change is so simple and adds no complexity then it seems like
> a worthwhile thing to do.

Maybe, but my point is that it's just encouraging this fiction that
udelay() never delays shorter than the requested delay.

Also, given that this is architecture code, it's my decision to make,
not yours.  So while you can supply any attributation you like, I'm
saying no at the moment - unless someone can show that it causes
_significant_ errors.

In other words, if it causes significantly short or long delays that
the reasonable inflation of delay value that drivers are expected to
make becomes a problem, then yes, it should be fixed.  If it's merely
"lets stop udelay() returning slightly early" then no, I'm not
interested because it's encouraging the fiction.

And... if a data sheet says "needs a delay of 2us" and you put in the
driver udelay(2) then you're doing it wrong, and you need to read
Linus' mails on this subject, such as the one I've provided a link
to... that udelay() must be _at least_ udelay(3), if not 4.

The best thing when using udelay() is to remember the most important
point - udelay() is very _approximate_.

Adding Linus in case he wishes to add to the discussion.
Mason Nov. 18, 2016, 5:51 p.m. UTC | #8
On 18/11/2016 15:18, Mason wrote:
> On 18/11/2016 13:54, Russell King - ARM Linux wrote:
>> On Fri, Nov 18, 2016 at 12:06:30PM +0000, Will Deacon wrote:
>>> On Tue, Nov 15, 2016 at 02:36:33PM +0100, Mason wrote:
>>>> When polling a tick counter in a busy loop, one might sample the
>>>> counter just *before* it is updated, and then again just *after*
>>>> it is updated. In that case, while mathematically v2-v1 equals 1,
>>>> only epsilon has really passed.
>>>>
>>>> So, if a caller requests an N-cycle delay, we spin until v2-v1
>>>> is strictly greater than N to avoid these random corner cases.
>>>>
>>>> Signed-off-by: Mason <slash.tmp@free.fr>
>>>> ---
>>>>  arch/arm/lib/delay.c | 2 +-
>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/arch/arm/lib/delay.c b/arch/arm/lib/delay.c
>>>> index 69aad80a3af4..3f1cd15ca102 100644
>>>> --- a/arch/arm/lib/delay.c
>>>> +++ b/arch/arm/lib/delay.c
>>>> @@ -60,7 +60,7 @@ static void __timer_delay(unsigned long cycles)
>>>>  {
>>>>  	cycles_t start = get_cycles();
>>>>  
>>>> -	while ((get_cycles() - start) < cycles)
>>>> +	while ((get_cycles() - start) <= cycles)
>>>>  		cpu_relax();
>>>>  }
>>>
>>> I thought a bit about this last night. It is well known that the delay
>>> routines are not guaranteed to be accurate, and I don't *think* that's
>>> limited to over-spinning, so arguably this isn't a bug. However, taking
>>> the approach that "drivers should figure it out" is also unhelpful,
>>> because the frequency of the underlying counter isn't generally known
>>> and so drivers can't simply adjust the delay parameter.
>>
>> I don't think this change makes any sense whatsoever.  udelay() is
>> inaccurate, period.  It _will_ give delays shorter than requested
>> for many reasons, many of which can't be fixed.
> 
> If I understand correctly, udelay is, in fact, a wrapper around
> several possible implementations. (Run-time decision on arch/arm)
> 
> 1) The "historical" software loop-based method, which is calibrated
> during init.
> 
> 2) A hardware solution, based on an actual timer, ticking away
> at a constant frequency.
> 
> 3) others?
> 
> Solution 1 (which probably dates back to Linux 0.1) suffers from
> the inaccuracies you point out, because of interrupt latency
> during calibration, DVFS, etc.
> 
> On the other hand, it is simple enough to implement solution 2
> in a way that guarantees that spin_for_n_cycles(n) spins
> /at least/ for n cycles.
> 
> On platforms using timer-based delays, there can indeed exist
> a function guaranteeing a minimal delay.
> 
> 
>> Having a super-accurate version just encourages people to write broken
>> drivers which assume (eg) that udelay(10) will give _at least_ a 10us
>> delay.  However, there is no such guarantee.
> 
> You say that calling udelay(10) expecting at least 10 µs is broken.
> This fails the principle of least astonishment in a pretty big way.
> (If it returns after 9.9 µs, I suppose it could be OK. But for
> shorter delays, the relative error grows.)
> 
> A lot of drivers implement a spec that says
> "do this, wait 1 µs, do that".
> 
> Driver writers would typically write
> do_this(); udelay(1); do_that();
> 
> Do you suggest they should write udelay(2); ?
> But then, it's not so easy to follow the spec because everything
> has been translated to a different number.
> Hide everything behind a macro?
> 
> Note that people have been using ndelay too.
> (I count 182 in drivers, 288 overall.)
> 
> drivers/cpufreq/s3c2440-cpufreq.c:      ndelay(20);
> 
> I don't think the author expects this to return immediately.
> 
> 
>> So, having udelay() for timers return slightly short is actually a good
>> thing - it causes people not to make the mistake to be soo accurate
>> with their delay specifications.
> 
> I disagree that having an implementation which introduces
> hard-to-track-down random heisenbugs is a good thing.
> 
> 
>> So, NAK on this change.  udelay is not super-accurate.
> 
> usleep_range() fixed this issue recently.
> 6c5e9059692567740a4ee51530dffe51a4b9584d
> https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?h=timers/core&id=6c5e9059692567740a4ee51530dffe51a4b9584d
> 
> Do you suggest driver writers should use usleep_range()
> instead of udelay?
> (When does usleep_range start making sense? Around 50/100 µs
> and above?)
> 
>> Reference (and this is the most important one):
>>
>>   http://lists.openwall.net/linux-kernel/2011/01/12/372

I forgot to explain how I came to work in this area of the code.

When my NAND controller's driver probes, it scans the chips
for bad blocks, which generates 65000 calls to ndelay(100);
(for 1 GB of NAND). For larger sizes of NAND, the number can
climb to 500k calls.

Currently, on arch/arm, ndelay is rounded *up* to the
nearest microsecond. So this might generate a total
delay of up to 500 ms, when 50 ms would be sufficient.

So I locally defined ndelay as
#define NDELAY_MULT	((UL(2199)    * HZ) >> 11)
#define ndelay(n)	__const_udelay((n) * NDELAY_MULT)

I use a 27 MHz tick counter (37 ns period).
ndelay() was resolving to __timer_delay(2)
which might delay only a single tick, so 37 ns.

So the NAND framework occasionally (falsely) flags a block
as bad (random).

There are two sources of error in the timer-based code:
1) rounding down in __timer_const_udelay => loses 1 cycle
2) spinning one cycle too few => loses 1 cycle

With these two errors corrected, I am able to init the
NAND driver faster, with no blocks incorrectly marked
as bad.

Regards.
afzal mohammed Nov. 19, 2016, 7:17 a.m. UTC | #9
Hi Mason,

On Fri, Nov 18, 2016 at 03:18:58PM +0100, Mason wrote:
> On 18/11/2016 13:54, Russell King - ARM Linux wrote:

> > So, NAK on this change.  udelay is not super-accurate.
> 
> usleep_range() fixed this issue recently.
> 6c5e9059692567740a4ee51530dffe51a4b9584d
> https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?h=timers/core&id=6c5e9059692567740a4ee51530dffe51a4b9584d

But the above "timers: Fix usleep_range() in the context of
wake_up_process()" is to avoid wakeup causing premature return than
about being precise, no ?

With conflicting opinion on delay/sleep fn's from the players, the one
in gallery would get confused.

But Linus has mentioned udelay as not meant to be precise, okay ?

Regards
afzal
diff mbox

Patch

diff --git a/arch/arm/lib/delay.c b/arch/arm/lib/delay.c
index 69aad80a3af4..3f1cd15ca102 100644
--- a/arch/arm/lib/delay.c
+++ b/arch/arm/lib/delay.c
@@ -60,7 +60,7 @@  static void __timer_delay(unsigned long cycles)
 {
 	cycles_t start = get_cycles();
 
-	while ((get_cycles() - start) < cycles)
+	while ((get_cycles() - start) <= cycles)
 		cpu_relax();
 }