diff mbox

Guarantee udelay(N) spins at least N microseconds

Message ID 5527B331.5000205@free.fr (mailing list archive)
State New, archived
Headers show

Commit Message

Mason April 10, 2015, 11:25 a.m. UTC
Hello everyone,

This is take 2 of my tiny delay.c patch

Problem statement

When converting microseconds to timer cycles in __timer_udelay() and
__timer_const_udelay(), the result is rounded down(*), which means the
system will not spin as long as requested (specifically, between
epsilon and 1 cycle shorter).

If I understand correctly, most drivers expect udelay(N) to spin for
at least N µs. Is that correct? In that use case, spinning less might
introduce subtle heisenbugs.


Typical example

timer->freq = 90 kHz && HZ = 100
(thus UDELAY_MULT = 107374 && ticks_per_jiffy = 900)

udelay(10) => __timer_const_udelay(10*107374)
            => __timer_delay((1073740*900) >> 30)
            => __timer_delay(0)

So udelay(10) resolves to no delay at all.


(*) 2^41 / 10^6 = 2199023,255552
2199023 < 2^41 / 10^6
UDELAY_MULT = 2199023*HZ / 2^11 < 2^30*HZ / 10^6

cycles = N * UDELAY_MULT * freq/HZ / 2^30
        < N * 2^30*HZ / 10^6 * freq/HZ / 2^30
        < N / 10^6 * freq


Proposed fix

Since results are always rounded down, all we need is to increment
the result by 1 to round it up.

Would someone ACK the patch below?

Regards.


Patch against 4.0-rc4

Comments

Willy Tarreau April 10, 2015, 11:42 a.m. UTC | #1
Hi,

On Fri, Apr 10, 2015 at 01:25:37PM +0200, Mason wrote:
> Hello everyone,
> 
> This is take 2 of my tiny delay.c patch
> 
> Problem statement
> 
> When converting microseconds to timer cycles in __timer_udelay() and
> __timer_const_udelay(), the result is rounded down(*), which means the
> system will not spin as long as requested (specifically, between
> epsilon and 1 cycle shorter).
> 
> If I understand correctly, most drivers expect udelay(N) to spin for
> at least N µs. Is that correct? In that use case, spinning less might
> introduce subtle heisenbugs.
> 
> 
> Typical example
> 
> timer->freq = 90 kHz && HZ = 100
> (thus UDELAY_MULT = 107374 && ticks_per_jiffy = 900)
> 
> udelay(10) => __timer_const_udelay(10*107374)
>            => __timer_delay((1073740*900) >> 30)
>            => __timer_delay(0)
> 
> So udelay(10) resolves to no delay at all.
> 
> 
> (*) 2^41 / 10^6 = 2199023,255552
> 2199023 < 2^41 / 10^6
> UDELAY_MULT = 2199023*HZ / 2^11 < 2^30*HZ / 10^6
> 
> cycles = N * UDELAY_MULT * freq/HZ / 2^30
>        < N * 2^30*HZ / 10^6 * freq/HZ / 2^30
>        < N / 10^6 * freq
> 
> 
> Proposed fix
> 
> Since results are always rounded down, all we need is to increment
> the result by 1 to round it up.
> 
> Would someone ACK the patch below?
> 
> Regards.
> 
> 
> Patch against 4.0-rc4
> 
> diff --git a/arch/arm/lib/delay.c b/arch/arm/lib/delay.c
> index 312d43e..3cfbd07 100644
> --- a/arch/arm/lib/delay.c
> +++ b/arch/arm/lib/delay.c
> @@ -66,7 +66,7 @@ static void __timer_const_udelay(unsigned long xloops)
>  {
>         unsigned long long loops = xloops;
>         loops *= arm_delay_ops.ticks_per_jiffy;
> -       __timer_delay(loops >> UDELAY_SHIFT);
> +       __timer_delay((loops >> UDELAY_SHIFT) + 1);
>  }

If loops is a multiple of 2 ^ UDELAY_SHIFT, then your result is too
high by one. The proper way to round by excess is the following :

    __timer_delay((loops + (1 << UDELAY_SHIFT) - 1) >> UDELAY_SHIFT);

That way it does +1 for every value of loop not an exact multiple
of 2^UDELAY_SHIFT.

Regards,
willy
Russell King - ARM Linux April 10, 2015, 11:44 a.m. UTC | #2
On Fri, Apr 10, 2015 at 01:25:37PM +0200, Mason wrote:
> If I understand correctly, most drivers expect udelay(N) to spin for
> at least N µs. Is that correct? In that use case, spinning less might
> introduce subtle heisenbugs.

We've never guaranteed this.

The fact is that udelay() can delay for _approximately_ the time you
ask for - it might be slightly shorter, or it could be much longer
than you expect.  On most UP implementations using the software loop
it will typically be around 1% slower than requested.

Adding 1us to every delay is going to be very bad.  Rather than doing
that, why not arrange for the rounding error to be accomodated?

> Typical example
> 
> timer->freq = 90 kHz && HZ = 100
> (thus UDELAY_MULT = 107374 && ticks_per_jiffy = 900)
> 
> udelay(10) => __timer_const_udelay(10*107374)
>            => __timer_delay((1073740*900) >> 30)
>            => __timer_delay(0)

In other words, add (1 << 30) - 1 before shifting right by 30.
Mason April 10, 2015, 12:41 p.m. UTC | #3
On 10/04/2015 13:44, Russell King - ARM Linux wrote:
> On Fri, Apr 10, 2015 at 01:25:37PM +0200, Mason wrote:
>> If I understand correctly, most drivers expect udelay(N) to spin for
>> at least N µs. Is that correct? In that use case, spinning less might
>> introduce subtle heisenbugs.
>
> We've never guaranteed this.
>
> The fact is that udelay() can delay for _approximately_ the time you
> ask for - it might be slightly shorter, or it could be much longer
> than you expect.

OK, but asking for 10 µs and spinning 0 is a problem, right?

> On most UP implementations using the software loop
> it will typically be around 1% slower than requested.

Please correct any misconception, it seems to me that typical
driver writers, reading "operation X takes 10 µs to complete"
will write udelay(10); (if they want to spin-wait)

Do you think they should take the inherent imprecision of loop-based
delays into account, and add a small cushion to be safe?

Also, it seems timer-based delays were introduced, in part, because
they did away with the imprecision of loop-based delays on DVFS
systems.

> Adding 1us to every delay is going to be very bad.  Rather than doing
> that, why not arrange for the rounding error to be accommodated?

Are you saying that my proposed patch blindly adds a 1 µs delay?

In fact, it adds one /timer cycle/ (which, in my 90 kHz example,
is much more than 1 µs), and this increment is just rounding up
the conversion result.

So if a user wants to spin N µs, and N µs is equivalent to
X.7 cycles, then we delay X+1 cycles.

Regards.
Mason April 10, 2015, 2:53 p.m. UTC | #4
Hello Willy,

On 10/04/2015 13:42, Willy Tarreau wrote:
> Hi,
>
> On Fri, Apr 10, 2015 at 01:25:37PM +0200, Mason wrote:
>> Hello everyone,
>>
>> This is take 2 of my tiny delay.c patch
>>
>> Problem statement
>>
>> When converting microseconds to timer cycles in __timer_udelay() and
>> __timer_const_udelay(), the result is rounded down(*), which means the
>> system will not spin as long as requested (specifically, between
>> epsilon and 1 cycle shorter).
>>
>> If I understand correctly, most drivers expect udelay(N) to spin for
>> at least N µs. Is that correct? In that use case, spinning less might
>> introduce subtle heisenbugs.
>>
>>
>> Typical example
>>
>> timer->freq = 90 kHz && HZ = 100
>> (thus UDELAY_MULT = 107374 && ticks_per_jiffy = 900)
>>
>> udelay(10) => __timer_const_udelay(10*107374)
>>             => __timer_delay((1073740*900) >> 30)
>>             => __timer_delay(0)
>>
>> So udelay(10) resolves to no delay at all.
>>
>>
>> (*) 2^41 / 10^6 = 2199023,255552
>> 2199023 < 2^41 / 10^6
>> UDELAY_MULT = 2199023*HZ / 2^11 < 2^30*HZ / 10^6
>>
>> cycles = N * UDELAY_MULT * freq/HZ / 2^30
>>         < N * 2^30*HZ / 10^6 * freq/HZ / 2^30
>>         < N / 10^6 * freq
>>
>>
>> Proposed fix
>>
>> Since results are always rounded down, all we need is to increment
>> the result by 1 to round it up.
>>
>> Would someone ACK the patch below?
>>
>> Regards.
>>
>>
>> Patch against 4.0-rc4
>>
>> diff --git a/arch/arm/lib/delay.c b/arch/arm/lib/delay.c
>> index 312d43e..3cfbd07 100644
>> --- a/arch/arm/lib/delay.c
>> +++ b/arch/arm/lib/delay.c
>> @@ -66,7 +66,7 @@ static void __timer_const_udelay(unsigned long xloops)
>>   {
>>          unsigned long long loops = xloops;
>>          loops *= arm_delay_ops.ticks_per_jiffy;
>> -       __timer_delay(loops >> UDELAY_SHIFT);
>> +       __timer_delay((loops >> UDELAY_SHIFT) + 1);
>>   }
>
> If loops is a multiple of 2 ^ UDELAY_SHIFT, then your result is too
> high by one. The proper way to round by excess is the following :
>
>      __timer_delay((loops + (1 << UDELAY_SHIFT) - 1) >> UDELAY_SHIFT);
>
> That way it does +1 for every value of loop not an exact multiple
> of 2^UDELAY_SHIFT.

The important thing to realize is that xloops is already rounded down,
because we use 2199023 as an approximation of 2^41 / 10^6.

Thus, even when 'loops' is a multiple of 2^30, we'll want to round up.

Illustration

timer->freq = 100*2^20 && HZ = 100
(thus UDELAY_MULT = 107374 && ticks_per_jiffy = 2^20)

Suppose udelay(512)
so we want to spin for 512 / 10^6 * 100*2^20 = 53687,0912 cycles
i.e. 53688 cycles if we round up.

loops = 512 * 107374 * 2^20 = 53687 * 2^30

If we just add (2^30-1) before shifting by 30, the result comes out
to 53687, but (loops >> 30) + 1 is closer to what we really want.

One might argue that the difference between 53687 and 53688 is
lost in the noise and thus irrelevant. I can agree with that.

Which is why I chose the simpler

   __timer_delay((loops >> UDELAY_SHIFT) + 1);

over

   __timer_delay((loops + (1 << UDELAY_SHIFT) - 1) >> UDELAY_SHIFT);


Do you disagree with this logic?

Regards.
Russell King - ARM Linux April 10, 2015, 3:06 p.m. UTC | #5
On Fri, Apr 10, 2015 at 02:41:37PM +0200, Mason wrote:
> On 10/04/2015 13:44, Russell King - ARM Linux wrote:
> >On Fri, Apr 10, 2015 at 01:25:37PM +0200, Mason wrote:
> >>If I understand correctly, most drivers expect udelay(N) to spin for
> >>at least N µs. Is that correct? In that use case, spinning less might
> >>introduce subtle heisenbugs.
> >
> >We've never guaranteed this.
> >
> >The fact is that udelay() can delay for _approximately_ the time you
> >ask for - it might be slightly shorter, or it could be much longer
> >than you expect.
> 
> OK, but asking for 10 µs and spinning 0 is a problem, right?

Potentially.

> >On most UP implementations using the software loop
> >it will typically be around 1% slower than requested.
> 
> Please correct any misconception, it seems to me that typical
> driver writers, reading "operation X takes 10 µs to complete"
> will write udelay(10); (if they want to spin-wait)

Arguments do not matter here, sorry.  What I said above is the way it
behaves, period.  It's not for discussion.

It may interest you that I discussed this issue with Linus, and Linus
also said that it _isn't_ a problem and it _isn't_ something we care
to fix.

So, like it or not, we're stuck with udelay(10) being possible to
delay by 9.99us intead of 10us.

Where the inaccuracy comes from is entirely down to the way we calculate
loops_per_jiffy - this is the number of loop cycles between two timer
interrupts - but this does _not_ take account of the time to execute a
timer interrupt.

So, loops_per_jiffy is the number of loop cycles between two timer
interrupts minus the time taken to service the timer interrupt.

And what this means is that udelay(n) where 'n' is less than the
period between two timer interrupts /will/ be, and is /expected to
be/ potentially shorter than the requested period.

There's no getting away from that, we can't estimate how long the timer
interrupt takes to handle without the use of an external timer, and if
we've got an external timer, we might as well use it for all delays.

> Do you think they should take the inherent imprecision of loop-based
> delays into account, and add a small cushion to be safe?

No.  See above.  Not doing that.  Live with it.

Fix the rounding errors if you wish, but do _not_ try to fix the
"udelay() may return slightly earlier than requested" problem.  I
will not apply a patch to fix _that_ problem.
Willy Tarreau April 10, 2015, 3:06 p.m. UTC | #6
Hi Mason,

On Fri, Apr 10, 2015 at 04:53:10PM +0200, Mason wrote:
> The important thing to realize is that xloops is already rounded down,
> because we use 2199023 as an approximation of 2^41 / 10^6.
> 
> Thus, even when 'loops' is a multiple of 2^30, we'll want to round up.
> 
> Illustration
> 
> timer->freq = 100*2^20 && HZ = 100
> (thus UDELAY_MULT = 107374 && ticks_per_jiffy = 2^20)
> 
> Suppose udelay(512)
> so we want to spin for 512 / 10^6 * 100*2^20 = 53687,0912 cycles
> i.e. 53688 cycles if we round up.
> 
> loops = 512 * 107374 * 2^20 = 53687 * 2^30
> 
> If we just add (2^30-1) before shifting by 30, the result comes out
> to 53687, but (loops >> 30) + 1 is closer to what we really want.
> 
> One might argue that the difference between 53687 and 53688 is
> lost in the noise and thus irrelevant. I can agree with that.
> 
> Which is why I chose the simpler
> 
>   __timer_delay((loops >> UDELAY_SHIFT) + 1);
> 
> over
> 
>   __timer_delay((loops + (1 << UDELAY_SHIFT) - 1) >> UDELAY_SHIFT);
> 
> 
> Do you disagree with this logic?

OK it seems to make sense then.

Regards,
Willy
Mason April 10, 2015, 3:30 p.m. UTC | #7
Hello Russell,

I appreciate (very much so) that you spend time replying to me,
but I also sense a lot of animosity, and I don't know what I've
done wrong to deserve it :-(

On 10/04/2015 17:06, Russell King - ARM Linux wrote:
> On Fri, Apr 10, 2015 at 02:41:37PM +0200, Mason wrote:
>> On 10/04/2015 13:44, Russell King - ARM Linux wrote:
>>> On Fri, Apr 10, 2015 at 01:25:37PM +0200, Mason wrote:
>>>> If I understand correctly, most drivers expect udelay(N) to spin for
>>>> at least N µs. Is that correct? In that use case, spinning less might
>>>> introduce subtle heisenbugs.
>>>
>>> We've never guaranteed this.
>>>
>>> The fact is that udelay() can delay for _approximately_ the time you
>>> ask for - it might be slightly shorter, or it could be much longer
>>> than you expect.
>>
>> OK, but asking for 10 µs and spinning 0 is a problem, right?
>
> Potentially.
>
>>> On most UP implementations using the software loop
>>> it will typically be around 1% slower than requested.
>>
>> Please correct any misconception, it seems to me that typical
>> driver writers, reading "operation X takes 10 µs to complete"
>> will write udelay(10); (if they want to spin-wait)
>
> Arguments do not matter here, sorry.  What I said above is the way it
> behaves, period.  It's not for discussion.
>
> It may interest you that I discussed this issue with Linus, and Linus
> also said that it _isn't_ a problem and it _isn't_ something we care
> to fix.
>
> So, like it or not, we're stuck with udelay(10) being possible to
> delay by 9.99us instead of 10us.
>
> Where the inaccuracy comes from is entirely down to the way we calculate
> loops_per_jiffy - this is the number of loop cycles between two timer
> interrupts - but this does _not_ take account of the time to execute a
> timer interrupt.
>
> So, loops_per_jiffy is the number of loop cycles between two timer
> interrupts minus the time taken to service the timer interrupt.
>
> And what this means is that udelay(n) where 'n' is less than the
> period between two timer interrupts /will/ be, and is /expected to
> be/ potentially shorter than the requested period.

You've made it clear how loop-based delays are implemented; and also
that loop-based delays are typically 1% shorter than requested.
(Thanks for the overview, by the way.) Please note that I haven't
touched to the loop-based code, I'm only discussing the timer-based
code.

> There's no getting away from that, we can't estimate how long the timer
> interrupt takes to handle without the use of an external timer, and if
> we've got an external timer, we might as well use it for all delays.

Exactly! And my patch only changes __timer_const_udelay() so again I'm
not touching loop-based code.

>> Do you think they should take the inherent imprecision of loop-based
>> delays into account, and add a small cushion to be safe?
>
> No.  See above.  Not doing that.  Live with it.

See, I don't understand your "Not doing that" statement.
I didn't suggest changing the implementation, I asked your opinion
(and the opinion of others on the list) whether driver _writers_
should take into account _in their code_ the 1% error you mentioned.

Specifically, should a driver writer use

   udelay(101);

when his spec says to spin 100 µs?

(Anyway, this is just a tangential question, as I digest the ins
and outs of kernel and driver development.)

> Fix the rounding errors if you wish, but do _not_ try to fix the
> "udelay() may return slightly earlier than requested" problem.  I
> will not apply a patch to fix _that_ problem.

Can I submit as-is the one-line patch I proposed earlier?

Regards.
Russell King - ARM Linux April 10, 2015, 4:08 p.m. UTC | #8
On Fri, Apr 10, 2015 at 05:30:24PM +0200, Mason wrote:
> I appreciate (very much so) that you spend time replying to me,
> but I also sense a lot of animosity, and I don't know what I've
> done wrong to deserve it :-(

I'm putting the point across strongly because I really don't think
there is an issue to be solved here.

> On 10/04/2015 17:06, Russell King - ARM Linux wrote:
> >And what this means is that udelay(n) where 'n' is less than the
> >period between two timer interrupts /will/ be, and is /expected to
> >be/ potentially shorter than the requested period.
> 
> You've made it clear how loop-based delays are implemented; and also
> that loop-based delays are typically 1% shorter than requested.
> (Thanks for the overview, by the way.) Please note that I haven't
> touched to the loop-based code, I'm only discussing the timer-based
> code.

1% is a figure I pulled out of the air.  It really depends on the CPU
instructions per cycle, and how much work is being done in the timer
interrupt handler.

> >There's no getting away from that, we can't estimate how long the timer
> >interrupt takes to handle without the use of an external timer, and if
> >we've got an external timer, we might as well use it for all delays.
> 
> Exactly! And my patch only changes __timer_const_udelay() so again I'm
> not touching loop-based code.

What I'm trying to get through to you is that udelay() as a _whole_ does
not provide a guarantee that it will wait for _at least_ the time you
asked for.  All that it does is provide an _approximate_ delay.

Yes, we can improve the timer delays to provide a guaranteed delay of at
least the requested period.  We _can't_ do the same for the loop-based
delay.

So, if we fix the timer based delays, udelay() will then have two differing
expectations depending on whether it's using a timer or a loop based delay.
That is bad.  It's extremely poor.  The expectations of an API should not
change because of a different implementation, that's the way bugs happen.

That's why...

> >No.  See above.  Not doing that.  Live with it.

it's not a problem, and why we're not going to fix the timer code to
provide a minimum guaranteed delay.

> Specifically, should a driver writer use
> 
>   udelay(101);
> 
> when his spec says to spin 100 µs?
> 
> (Anyway, this is just a tangential question, as I digest the ins
> and outs of kernel and driver development.)

A driver writer should always use the required delay plus an adequate
cushion to ensure that the delay required by the hardware is met.

I suggest you read this:

	https://lkml.org/lkml/2011/1/9/37

which is the discussion I had with Linus on the point you are raising
here, which is the 4th hit in google for "udelay shorter delays".

Given that some udelay() implementations have been known to be as much
as 50% off, that suggests using a delay value of 200 in your above
example for a delay of 100µs.  ARM is /relatively/ good in that
regard, cpufreq and scheduling effects withstanding.
diff mbox

Patch

diff --git a/arch/arm/lib/delay.c b/arch/arm/lib/delay.c
index 312d43e..3cfbd07 100644
--- a/arch/arm/lib/delay.c
+++ b/arch/arm/lib/delay.c
@@ -66,7 +66,7 @@  static void __timer_const_udelay(unsigned long xloops)
  {
         unsigned long long loops = xloops;
         loops *= arm_delay_ops.ticks_per_jiffy;
-       __timer_delay(loops >> UDELAY_SHIFT);
+       __timer_delay((loops >> UDELAY_SHIFT) + 1);
  }
  
  static void __timer_udelay(unsigned long usecs)