[v2] timers: Fix usleep_range() in the context of wake_up_process()
diff mbox

Message ID 1476133442-17757-1-git-send-email-dianders@chromium.org
State New
Headers show

Commit Message

Doug Anderson Oct. 10, 2016, 9:04 p.m. UTC
Users of usleep_range() expect that it will _never_ return in less time
than the minimum passed parameter.  However, nothing in any of the code
ensures this.  Specifically:

usleep_range() => do_usleep_range() => schedule_hrtimeout_range() =>
schedule_hrtimeout_range_clock() just ends up calling schedule() with an
appropriate timeout set using the hrtimer.  If someone else happens to
wake up our task then we'll happily return from usleep_range() early.

msleep() already has code to handle this case since it will loop as long
as there was still time left.  usleep_range() had no such loop.

The problem is is easily demonstrated with a small bit of test code:

  static int usleep_test_task(void *data)
  {
    atomic_t *done = data;
    ktime_t start, end;

    start = ktime_get();
    usleep_range(50000, 100000);
    end = ktime_get();
    pr_info("Requested 50000 - 100000 us.  Actually slept for %llu us\n",
      (unsigned long long)ktime_to_us(ktime_sub(end, start)));
    atomic_set(done, 1);

    return 0;
  }

  static void run_usleep_test(void)
  {
    struct task_struct *t;
    atomic_t done;

    atomic_set(&done, 0);

    t = kthread_run(usleep_test_task, &done, "usleep_test_task");
    while (!atomic_read(&done)) {
      wake_up_process(t);
      udelay(1000);
    }
    kthread_stop(t);
  }

If you run the above code without this patch you get things like:
  Requested 50000 - 100000 us.  Actually slept for 967 us

If you run the above code _with_ this patch, you get:
  Requested 50000 - 100000 us.  Actually slept for 50001 us

Presumably this problem was not detected before because:
- It's not terribly common to use wake_up_process() directly.
- Other ways for processes to wake up are not typically mixed with
  usleep_range().
- There aren't lots of places that use usleep_range(), since many people
  call either msleep() or udelay().

Reported-by: Tao Huang <huangtao@rock-chips.com>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
---
Changes in v2:
- Fixed stupid bug that snuck in before posting
- Use ktime_before
- Remove delta from the loop

NOTE: Tested against 4.4 tree w/ backports.  I'm trying to get myself
up and running with mainline again to test there now but it might be a
little while.  Hopefully this time I don't shoot myself in the foot.

 kernel/time/timer.c | 20 ++++++++++++++++++--
 1 file changed, 18 insertions(+), 2 deletions(-)

Comments

Doug Anderson Oct. 10, 2016, 10:39 p.m. UTC | #1
Hi,

On Mon, Oct 10, 2016 at 2:04 PM, Douglas Anderson <dianders@chromium.org> wrote:
> Users of usleep_range() expect that it will _never_ return in less time
> than the minimum passed parameter.  However, nothing in any of the code
> ensures this.  Specifically:
>
> usleep_range() => do_usleep_range() => schedule_hrtimeout_range() =>
> schedule_hrtimeout_range_clock() just ends up calling schedule() with an
> appropriate timeout set using the hrtimer.  If someone else happens to
> wake up our task then we'll happily return from usleep_range() early.
>
> msleep() already has code to handle this case since it will loop as long
> as there was still time left.  usleep_range() had no such loop.
>
> The problem is is easily demonstrated with a small bit of test code:
>
>   static int usleep_test_task(void *data)
>   {
>     atomic_t *done = data;
>     ktime_t start, end;
>
>     start = ktime_get();
>     usleep_range(50000, 100000);
>     end = ktime_get();
>     pr_info("Requested 50000 - 100000 us.  Actually slept for %llu us\n",
>       (unsigned long long)ktime_to_us(ktime_sub(end, start)));
>     atomic_set(done, 1);
>
>     return 0;
>   }
>
>   static void run_usleep_test(void)
>   {
>     struct task_struct *t;
>     atomic_t done;
>
>     atomic_set(&done, 0);
>
>     t = kthread_run(usleep_test_task, &done, "usleep_test_task");
>     while (!atomic_read(&done)) {
>       wake_up_process(t);
>       udelay(1000);
>     }
>     kthread_stop(t);
>   }
>
> If you run the above code without this patch you get things like:
>   Requested 50000 - 100000 us.  Actually slept for 967 us
>
> If you run the above code _with_ this patch, you get:
>   Requested 50000 - 100000 us.  Actually slept for 50001 us
>
> Presumably this problem was not detected before because:
> - It's not terribly common to use wake_up_process() directly.
> - Other ways for processes to wake up are not typically mixed with
>   usleep_range().
> - There aren't lots of places that use usleep_range(), since many people
>   call either msleep() or udelay().
>
> Reported-by: Tao Huang <huangtao@rock-chips.com>
> Signed-off-by: Douglas Anderson <dianders@chromium.org>
> ---
> Changes in v2:
> - Fixed stupid bug that snuck in before posting
> - Use ktime_before
> - Remove delta from the loop
>
> NOTE: Tested against 4.4 tree w/ backports.  I'm trying to get myself
> up and running with mainline again to test there now but it might be a
> little while.  Hopefully this time I don't shoot myself in the foot.

OK, I finally resolved all my issues and have officially tested this
on mainline (not just in my private kernel) linux 4.8 on an
rk3288-veyron-jerry based device.  Problem still exists there and fix
still works.

Sorry for the noise earlier.  Hopefully this version of the patch
addresses all of Andreas's review feedback (and thank you very much
for the review!)

-Doug
Thomas Gleixner Oct. 11, 2016, 7:14 a.m. UTC | #2
On Mon, 10 Oct 2016, Douglas Anderson wrote:
> Users of usleep_range() expect that it will _never_ return in less time
> than the minimum passed parameter.  However, nothing in any of the code
> ensures this.  Specifically:

There is no such guarantee for that interface and never has been, so how
did you make sure that none of the existing users is relying on this?

You can't just can't just declare that all all of the users expect that and
be done with it.

Thanks,

	tglx
Doug Anderson Oct. 11, 2016, 4:33 p.m. UTC | #3
Thomas,

+clock and regulator folks since part of my arguments below involve
the regulator / clock core.  If you're not interested in this topic,
feel free to ignore.  Original patch can be found on LKML or at
<https://patchwork.kernel.org/patch/9369963/>

On Tue, Oct 11, 2016 at 12:14 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Mon, 10 Oct 2016, Douglas Anderson wrote:
>> Users of usleep_range() expect that it will _never_ return in less time
>> than the minimum passed parameter.  However, nothing in any of the code
>> ensures this.  Specifically:
>
> There is no such guarantee for that interface and never has been, so how
> did you make sure that none of the existing users is relying on this?
>
> You can't just can't just declare that all all of the users expect that and
> be done with it.

You're right that I can't guarantee that no callers are relying on the
existing behavior of a wake_up_process() causing usleep_range() to
return early.  I would say, however, that all callers I've seen are
absolutely relying on the min delay being enforced and I've never
personally seen a caller relying on being woken up from
usleep_range().  All the users relying on the min delay being enforced
have never had a problem because it's not common for a task that's in
usleep_range() to be woken up, but if you happen to call some generic
code in a context where a wake up is possible you'll get a very
unexpected bug (like I just did).

I can try to look through some callers to usleep_range(), but there
are 1620 of them as per my "git grep", so I can't look at them all.
:(

...in any case, below are my arguments about why we should change
this.  If we can't agree to change this, then IMHO the way forward is
to deprecate usleep_range() and start the transition of all callers to
one of two functions: usleep_atlest() and usleep_wakeable().  As
argued below I think that usleep_range() name implies that it will at
least sleep the minimum so I would really like to avoid keeping the
name usleep_range() and also keeping the existing behavior.

--

1. I believe msleep() and usleep_range() should be parallel functions.
msleep() has this guarantee of not ending early.  I believe users will
expect that usleep() will also.

2. The "timers-howto.txt" does not have any information about the fact
that usleep_range() might end early.  This document implies that
(assuming you're not in atomic) udelay() / usleep_range() should be
chosen between simply based on how long the expected delay is.  There
is no note that you shouldn't use usleep_range() if you can't handle
ending early.

3. It seems to me that if someone wants to wait to be woken up but
they want a timeout, they wouldn't reach for usleep_range().  They
would instead think to use schedule_hrtimeout_range() directly.  Said
another way: we already have a function for scheduling a delay that
can be interrupted with a wakeup, so why do we need usleep_range() to
also behave this way?

4. We need to change a lot of places if we want to handle the fact
that usleep_range() might wake up early.  Every instance of
usleep_range() that I've ever seen is not expecting to end early and
is expecting at least the minimum delay for correctness.  These
functions have all worked correctly because they didn't happen to be
called in a context where someone might wake up the calling task.
...but if suddenly one of these functions is called on a task that
might be woken up then all their correctness will go out the door.

In other words, I see a lot of usleep_range() calls that look like this:

  /* After 5 ms we know reset is done */
  assert_reset()
  usleep_range(5000, 7000)

If that returns early then we'll get badness.

--

Here's a (perhaps more realistic) example of the problem.  Imagine
that we are trying to do some type of DVF (Dynamic Voltage Frequency)
switch.  That involves changing both a regulator and a clock.  It's
possible that we might want to try to do these two things in parallel
on two tasks, so we could imagine doing:

Task #1 (voltage)
A) Call into regulator core to change voltage.
A1) Regulator core will call into arbitrary regulator driver.
A2) Regulator core won't return until regulator is at the right level.
A3) Delay waiting for regulator might be done with usleep_range().
B) Signal Task #2 that we're done.
C) Wait on Task #2 to finish.

Task #2 (clock)
A) Call into clock core to change clock.
A1) Clock core will call into arbitrary clock driver.
A2) Clock core won't return until clock is stable.
A3) Delay waiting for clock might be done with usleep_range().
B) Signal Task #1 that we're done.
C) Wait on Task #1 to finish.

Depending on the delays and the mechanism used to signal and wait, it
is possible that the "Signal" step above will end up waking up the
other task.  If it does this while the other task is in the
usleep_range() code then we'll have serious problems: we'll either run
with an clock or a regulator that isn't all the way ready.

Now perhaps you will say that we should re-design the signaling
mechanism above so that it doesn't cause a wake up of the other task.
We certainly could try to do that if need be (we'd have to validate
that there is really NEVER an unexpected wakeup), but I'd expect a lot
of push back since I don't think folks would think that waking up a
task would cause such incorrect behavior.

Now perhaps you will say that we should re-design all clock and
regulator drivers to not call usleep_range() (or to add a loop).  This
implies that we should add a new function so they don't all need to
get this loop right.


NOTE: In the particular failure case I'm running into, I'm on a system
using the (out of tree) "interactive" governor.  I haven't followed
through the exact code path, but I see that it is using
wake_up_process() in at least several places to wake up the
"speedchange_task".  This is the same task that might be calling into
cpufreq to change the frequency and might be calling into the
regulator core to do the delay.  We were specifically seeing cases
where the usleep_range() in PWM regulator was returning early.

--

OK, I know that the above is pretty long.  Hopefully it made sense.
Feel free to grab me on IRC if it helps.  I can be found on freenode
as dianders and can join any channel where it's helpful to hash this
out.

-Doug
Andreas Mohr Oct. 11, 2016, 6:25 p.m. UTC | #4
On Tue, Oct 11, 2016 at 09:14:38AM +0200, Thomas Gleixner wrote:
> On Mon, 10 Oct 2016, Douglas Anderson wrote:
> > Users of usleep_range() expect that it will _never_ return in less time
> > than the minimum passed parameter.  However, nothing in any of the code
> > ensures this.  Specifically:
> 
> There is no such guarantee for that interface and never has been, so how
> did you make sure that none of the existing users is relying on this?
> 
> You can't just can't just declare that all all of the users expect that and
> be done with it.

Hmm, somehow I don't manage to follow these thoughts.

https://www.kernel.org/doc/htmldocs/device-drivers/API-usleep-range.html
(as a hopefully sufficiently authoritative source of documentation)
clearly specifies min to be
"Minimum time in usecs to sleep"
, which is what one would expect a two-param interface here to be
(minimum-maximum),
i.e. what would be the *natural* protocol I'd think.

Also, [finally...] starting to enforce the minimum time
is an additional *constraint* on the protocol,
i.e. it's not at all like we are getting more *liberal* here
(since usually getting more liberal in certain protocols
is what will cause trouble, I'd think).

Not to mention that
desiring a delay in processing most certainly is
what caused users of this API to decide to invoke it in the first place
(else they would just have chosen to carry on with delay-less processing
and be done with it).
And those users then surely wouldn't want to experience a behaviour
where the delay may be ended at any time,
however short that may end up being.


A related topic probably is
premature wakeups (e.g. signal-induced) of select() etc. protocol.

Greetings,

Andreas Mohr
Brian Norris Oct. 11, 2016, 6:54 p.m. UTC | #5
Hi Doug,

On Mon, Oct 10, 2016 at 02:04:02PM -0700, Doug Anderson wrote:
> Users of usleep_range() expect that it will _never_ return in less time
> than the minimum passed parameter.  However, nothing in any of the code
> ensures this.

Like you and Andreas, I also don't understand Thomas's objection to your
above claim on what users of this function expect. I believe you have
clearly laid out why the current behavior needs to be changed somehow;
IMO the only remaining question is "how." Your follow up covers all this
plenty well for me.

Either we need a fix along the lines of what you've proposed, or else we
need to completely rethink almost all uses of usleep_range().

...

> Reported-by: Tao Huang <huangtao@rock-chips.com>
> Signed-off-by: Douglas Anderson <dianders@chromium.org>
> ---
> Changes in v2:
> - Fixed stupid bug that snuck in before posting
> - Use ktime_before
> - Remove delta from the loop
> 
> NOTE: Tested against 4.4 tree w/ backports.  I'm trying to get myself
> up and running with mainline again to test there now but it might be a
> little while.  Hopefully this time I don't shoot myself in the foot.
> 
>  kernel/time/timer.c | 20 ++++++++++++++++++--
>  1 file changed, 18 insertions(+), 2 deletions(-)

I've reviewed the logic here to the best of my ability, and it looks
good to me now. I'll admit that I'm not really a timekeeping or
scheduler expert, but FWIW:

Reviewed-by: Brian Norris <briannorris@chromium.org>

> diff --git a/kernel/time/timer.c b/kernel/time/timer.c
> index 32bf6f75a8fe..219439efd56a 100644
> --- a/kernel/time/timer.c
> +++ b/kernel/time/timer.c
> @@ -1898,12 +1898,28 @@ EXPORT_SYMBOL(msleep_interruptible);
>  
>  static void __sched do_usleep_range(unsigned long min, unsigned long max)
>  {
> +	ktime_t now, end;
>  	ktime_t kmin;
>  	u64 delta;
> +	int ret;
>  
> -	kmin = ktime_set(0, min * NSEC_PER_USEC);
> +	now = ktime_get();
> +	end = ktime_add_us(now, min);
>  	delta = (u64)(max - min) * NSEC_PER_USEC;
> -	schedule_hrtimeout_range(&kmin, delta, HRTIMER_MODE_REL);
> +	do {
> +		kmin = ktime_sub(end, now);
> +		ret = schedule_hrtimeout_range(&kmin, delta, HRTIMER_MODE_REL);
> +
> +		/*
> +		 * If schedule_hrtimeout_range() returns 0 then we actually
> +		 * hit the timeout. If not then we need to re-calculate the
> +		 * new timeout ourselves.
> +		 */
> +		if (ret == 0)
> +			break;
> +
> +		now = ktime_get();
> +	} while (ktime_before(now, end));
>  }
>  
>  /**
> -- 
> 2.8.0.rc3.226.g39d4020
>
Andreas Mohr Oct. 11, 2016, 7:30 p.m. UTC | #6
Hi,

decided to write a review now, slightly delayed, sorry.

On Mon, Oct 10, 2016 at 02:04:02PM -0700, Douglas Anderson wrote:
> diff --git a/kernel/time/timer.c b/kernel/time/timer.c
> index 32bf6f75a8fe..219439efd56a 100644
> --- a/kernel/time/timer.c
> +++ b/kernel/time/timer.c
> @@ -1898,12 +1898,28 @@ EXPORT_SYMBOL(msleep_interruptible);
>  
>  static void __sched do_usleep_range(unsigned long min, unsigned long max)
>  {
> +	ktime_t now, end;
>  	ktime_t kmin;
>  	u64 delta;
> +	int ret;
>  
> -	kmin = ktime_set(0, min * NSEC_PER_USEC);
> +	now = ktime_get();
> +	end = ktime_add_us(now, min);
>  	delta = (u64)(max - min) * NSEC_PER_USEC;
> -	schedule_hrtimeout_range(&kmin, delta, HRTIMER_MODE_REL);
> +	do {
> +		kmin = ktime_sub(end, now);
> +		ret = schedule_hrtimeout_range(&kmin, delta, HRTIMER_MODE_REL);
> +
> +		/*
> +		 * If schedule_hrtimeout_range() returns 0 then we actually
> +		 * hit the timeout. If not then we need to re-calculate the
> +		 * new timeout ourselves.
> +		 */
> +		if (ret == 0)
> +			break;
> +
> +		now = ktime_get();
> +	} while (ktime_before(now, end));

This loop implementation relies on negative kmin (result of ktime_sub())
getting internally handled by schedule_hrtimeout_range() as a 0 result.
If that ain't the case, then the loop itself will need to handle
negative values.
OK, scratch that, due to guaranteed-unchanged values of now and end,
result of directly subsequent ktime_sub() is guaranteed to
not deviate from what ktime_before() figured.
However, this is somewhat differing handling of these two APIs.


Which brings me to my second point:
somehow doing both ktime_before() and ktime_sub() feels redundant,
since they are both about arguments now and end,
i.e. they are *both* evaluating a "distance".
(this could simply calculate the distance, and then
1. use that calculated distance value, and
2. check that calculated distance value against being negative).
So, most likely it ought to be achievable to have just *one* of these
two active (which would likely be ktime_sub(),
simply since we need its result as schedule_hrtimeout_range() input...).
And that way we would save big (hohumm) on instruction cache footprint :)

Hmm, but ktime API does not have sufficiently open-coded handling of the
ktime_sub()-based distance value - the possibly only way to do an
"is it negative?" check would be by doing
ktime_compare(dist, ktime_null_value),
which might be pointless
since it's a comparably large effort
vs. the current ktime_before(), ktime_sub() case.

BTW, another argument for loop rework might be that
the result of ktime_sub() might already be improper
(due to being negative!) for
subsequent invocation of schedule_hrtimeout_range(),
i.e. there's an argument to be made that
the loop tail check instead ought to be done as a negative-value check
directly prior to schedule_hrtimeout_range() invocation.


Hmm, if schedule_hrtimeout_range() can be relied on to
internally properly be doing the negative check,
then one could simply say that
the annoyingly extra calculation via ktime_before() call
should simply be removed completely.
Which would be a good step since it would centralize protocol behaviour
within the inner handling of the helper
rather than bloating user-side instruction cache footprint.


</micro_optimization> ?

Thanks,

Andreas Mohr
Doug Anderson Oct. 11, 2016, 8:02 p.m. UTC | #7
Andreas,

On Tue, Oct 11, 2016 at 12:30 PM, Andreas Mohr <andi@lisas.de> wrote:
> Hi,
>
> decided to write a review now, slightly delayed, sorry.
>
> On Mon, Oct 10, 2016 at 02:04:02PM -0700, Douglas Anderson wrote:
>> diff --git a/kernel/time/timer.c b/kernel/time/timer.c
>> index 32bf6f75a8fe..219439efd56a 100644
>> --- a/kernel/time/timer.c
>> +++ b/kernel/time/timer.c
>> @@ -1898,12 +1898,28 @@ EXPORT_SYMBOL(msleep_interruptible);
>>
>>  static void __sched do_usleep_range(unsigned long min, unsigned long max)
>>  {
>> +     ktime_t now, end;
>>       ktime_t kmin;
>>       u64 delta;
>> +     int ret;
>>
>> -     kmin = ktime_set(0, min * NSEC_PER_USEC);
>> +     now = ktime_get();
>> +     end = ktime_add_us(now, min);
>>       delta = (u64)(max - min) * NSEC_PER_USEC;
>> -     schedule_hrtimeout_range(&kmin, delta, HRTIMER_MODE_REL);
>> +     do {
>> +             kmin = ktime_sub(end, now);
>> +             ret = schedule_hrtimeout_range(&kmin, delta, HRTIMER_MODE_REL);
>> +
>> +             /*
>> +              * If schedule_hrtimeout_range() returns 0 then we actually
>> +              * hit the timeout. If not then we need to re-calculate the
>> +              * new timeout ourselves.
>> +              */
>> +             if (ret == 0)
>> +                     break;
>> +
>> +             now = ktime_get();
>> +     } while (ktime_before(now, end));
>
> This loop implementation relies on negative kmin (result of ktime_sub())
> getting internally handled by schedule_hrtimeout_range() as a 0 result.
> If that ain't the case, then the loop itself will need to handle
> negative values.
> OK, scratch that, due to guaranteed-unchanged values of now and end,
> result of directly subsequent ktime_sub() is guaranteed to
> not deviate from what ktime_before() figured.
> However, this is somewhat differing handling of these two APIs.

I read this as: no bug, feel free to ignore.


> Which brings me to my second point:
> somehow doing both ktime_before() and ktime_sub() feels redundant,
> since they are both about arguments now and end,
> i.e. they are *both* evaluating a "distance".
> (this could simply calculate the distance, and then
> 1. use that calculated distance value, and
> 2. check that calculated distance value against being negative).
> So, most likely it ought to be achievable to have just *one* of these
> two active (which would likely be ktime_sub(),
> simply since we need its result as schedule_hrtimeout_range() input...).
> And that way we would save big (hohumm) on instruction cache footprint :)
>
> Hmm, but ktime API does not have sufficiently open-coded handling of the
> ktime_sub()-based distance value - the possibly only way to do an
> "is it negative?" check would be by doing
> ktime_compare(dist, ktime_null_value),
> which might be pointless
> since it's a comparably large effort
> vs. the current ktime_before(), ktime_sub() case.

I agree that we could try to save some math by making the loop more
complicated.  On the other hand, one could argue that a sufficiently
good compiler might actually be able to figure some of this stuff out,
since much of this stuff is just inlined 64-bit math comparisons.

That being said, if you want to propose some concrete changes that
save an instruction or two on most architectures and don't make the
code too complicated, I'm happy to spin my patch.


> BTW, another argument for loop rework might be that
> the result of ktime_sub() might already be improper
> (due to being negative!) for
> subsequent invocation of schedule_hrtimeout_range(),
> i.e. there's an argument to be made that
> the loop tail check instead ought to be done as a negative-value check
> directly prior to schedule_hrtimeout_range() invocation.
>
> Hmm, if schedule_hrtimeout_range() can be relied on to
> internally properly be doing the negative check,
> then one could simply say that
> the annoyingly extra calculation via ktime_before() call
> should simply be removed completely.
> Which would be a good step since it would centralize protocol behaviour
> within the inner handling of the helper
> rather than bloating user-side instruction cache footprint.

Ah, so you're saying just do a "while (true)" after changing the
behavior of schedule_hrtimeout_range_clock().  If everyone agrees that
they'd like to see this, I can do it.  I'd have to change
schedule_hrtimeout_range_clock() to check for <= 0 instead of just
comparing against 0.  ...and that would be an API change for
schedule_hrtimeout_range_clock(), and it seems like that would be yet
another source of bugs.  ...and this is to save an instruction?  It
doesn't seem worth it.


-Doug
Heiko Stuebner Oct. 11, 2016, 8:34 p.m. UTC | #8
Am Dienstag, 11. Oktober 2016, 09:14:38 CEST schrieb Thomas Gleixner:
> On Mon, 10 Oct 2016, Douglas Anderson wrote:
> > Users of usleep_range() expect that it will _never_ return in less time
> > than the minimum passed parameter.  However, nothing in any of the code
> 
> > ensures this.  Specifically:
> There is no such guarantee for that interface and never has been, so how
> did you make sure that none of the existing users is relying on this?
>
> You can't just can't just declare that all all of the users expect that and
> be done with it.

It may be my personal ignorance for not finding this explained, but the 
function documentation [0] states "min ... Minimum time in usecs to sleep"
which sounds pretty guaranteed to me.

One should of course make sure to not break anything intentionally, but having 
things expect to work outside these parameters sounds somewhat broken

If the specified values are flexible beyond their stated range, I guess the 
documentation should be updated to reflect that.


[0] https://www.kernel.org/doc/htmldocs/device-drivers/API-usleep-range.html
Andreas Mohr Oct. 11, 2016, 8:40 p.m. UTC | #9
On Tue, Oct 11, 2016 at 01:02:10PM -0700, Doug Anderson wrote:
> Andreas,
> 
> On Tue, Oct 11, 2016 at 12:30 PM, Andreas Mohr <andi@lisas.de> wrote:
> > This loop implementation relies on negative kmin (result of ktime_sub())
> > getting internally handled by schedule_hrtimeout_range() as a 0 result.
> > If that ain't the case, then the loop itself will need to handle
> > negative values.
> > OK, scratch that, due to guaranteed-unchanged values of now and end,
> > result of directly subsequent ktime_sub() is guaranteed to
> > not deviate from what ktime_before() figured.
> > However, this is somewhat differing handling of these two APIs.
> 
> I read this as: no bug, feel free to ignore.

Heh, yeah ("an alternate style of review" ;).


> I agree that we could try to save some math by making the loop more
> complicated.  On the other hand, one could argue that a sufficiently
> good compiler might actually be able to figure some of this stuff out,
> since much of this stuff is just inlined 64-bit math comparisons.

Indeed. "micro-optimization again".


> > BTW, another argument for loop rework might be that
> > the result of ktime_sub() might already be improper
> > (due to being negative!) for
> > subsequent invocation of schedule_hrtimeout_range(),
> > i.e. there's an argument to be made that
> > the loop tail check instead ought to be done as a negative-value check
> > directly prior to schedule_hrtimeout_range() invocation.
> >
> > Hmm, if schedule_hrtimeout_range() can be relied on to
> > internally properly be doing the negative check,
> > then one could simply say that
> > the annoyingly extra calculation via ktime_before() call
> > should simply be removed completely.
> > Which would be a good step since it would centralize protocol behaviour
> > within the inner handling of the helper
> > rather than bloating user-side instruction cache footprint.
> 
> Ah, so you're saying just do a "while (true)" after changing the
> behavior of schedule_hrtimeout_range_clock().  If everyone agrees that
> they'd like to see this, I can do it.  I'd have to change
> schedule_hrtimeout_range_clock() to check for <= 0 instead of just
> comparing against 0.  ...and that would be an API change for
> schedule_hrtimeout_range_clock(), and it seems like that would be yet
> another source of bugs.  ...and this is to save an instruction?  It
> doesn't seem worth it.

Yup, I just realized that probably what we'd ideally want is
a very thin decorating wrapper (loop handler)
which cares about nothing else other than
eradicating the relevant "feature" of the inner handler
(namely, premature exit),
and otherwise leaves a much as possible to
central inner handler decision-making implementation.
That to me sounds like
the theoretically most precise (since dead simple) handling.
And even if such exceedingly simple handling is dangerous -
in case of failure we would realize it very soon (infinite loop),
and would then know what will need fixing.


I've been reviewing schedule_hrtimeout_range_clock() as well,
and I'm actually puzzled that it checks for zero value only,
and not less-equal zero.
That to me does not seem like a true-to-the-core protocol
to handle the task at hand.
OTOH perhaps less-equal comparison was deemed to be
much more expensive than a zero check,
and thus possibly has been done in inner handlers only.



For the loop code itself,
I'm not sure whether to pursue it -
given the schedule_hrtimeout_range_clock() API change risks you outlined,
and especially given that
optimization is likely to mostly benefit the "repeat" case only,
which likely would be the less relevant non-hotpath case anyway.

Plus, let's not forget the fact that
even hotpath handling *is* an invocation of the entire delay handler,
and then a couple measly opcodes to have the loop exited reliably. So...


Now, at least we have a
Reviewed-by: Andreas Mohr <andim2@users.sf.net>


Oh well, lotsa discussion, some good thoughts,
but no truly revolutionary outcome so far.
However, sometimes the most important thing is
to have had a good educating discussion
(not everything in software circles is about the Get Rich Quick scheme -
you do have to spend some ample time - decades... -
to truly get mental concept things right).

Andreas Mohr
Mark Brown Oct. 12, 2016, 8:56 a.m. UTC | #10
On Tue, Oct 11, 2016 at 09:33:15AM -0700, Doug Anderson wrote:
> On Tue, Oct 11, 2016 at 12:14 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > On Mon, 10 Oct 2016, Douglas Anderson wrote:

> >> Users of usleep_range() expect that it will _never_ return in less time
> >> than the minimum passed parameter.  However, nothing in any of the code
> >> ensures this.  Specifically:

> > There is no such guarantee for that interface and never has been, so how
> > did you make sure that none of the existing users is relying on this?

> > You can't just can't just declare that all all of the users expect that and
> > be done with it.

> You're right that I can't guarantee that no callers are relying on the
> existing behavior of a wake_up_process() causing usleep_range() to
> return early.  I would say, however, that all callers I've seen are
> absolutely relying on the min delay being enforced and I've never
> personally seen a caller relying on being woken up from
> usleep_range().  All the users relying on the min delay being enforced

Indeed.  It's *highly* surprising for any sleep interface to undershoot
on delays, the usual thing is that they might delay for longer.  If the
function doesn't actually reliably delay for the minimum time then I'd
expect that a large proportion of those conversions and other recent
code that's been added is buggy.

> one of two functions: usleep_atlest() and usleep_wakeable().  As
> argued below I think that usleep_range() name implies that it will at
> least sleep the minimum so I would really like to avoid keeping the
> name usleep_range() and also keeping the existing behavior.

I tend to agree with everything Doug is saying in terms of API
expectations.
Thomas Gleixner Oct. 12, 2016, 1:11 p.m. UTC | #11
On Tue, 11 Oct 2016, Andreas Mohr wrote:
> On Tue, Oct 11, 2016 at 09:14:38AM +0200, Thomas Gleixner wrote:
> > On Mon, 10 Oct 2016, Douglas Anderson wrote:
> > > Users of usleep_range() expect that it will _never_ return in less time
> > > than the minimum passed parameter.  However, nothing in any of the code
> > > ensures this.  Specifically:
> > 
> > There is no such guarantee for that interface and never has been, so how
> > did you make sure that none of the existing users is relying on this?
> > 
> > You can't just can't just declare that all all of the users expect that and
> > be done with it.
> 
> Hmm, somehow I don't manage to follow these thoughts.
> 
> https://www.kernel.org/doc/htmldocs/device-drivers/API-usleep-range.html
> (as a hopefully sufficiently authoritative source of documentation)
> clearly specifies min to be
> "Minimum time in usecs to sleep"
> , which is what one would expect a two-param interface here to be
> (minimum-maximum),
> i.e. what would be the *natural* protocol I'd think.
> 
> Also, [finally...] starting to enforce the minimum time
> is an additional *constraint* on the protocol,
> i.e. it's not at all like we are getting more *liberal* here
> (since usually getting more liberal in certain protocols
> is what will cause trouble, I'd think).
> 
> Not to mention that
> desiring a delay in processing most certainly is
> what caused users of this API to decide to invoke it in the first place
> (else they would just have chosen to carry on with delay-less processing
> and be done with it).
> And those users then surely wouldn't want to experience a behaviour
> where the delay may be ended at any time,
> however short that may end up being.

I'm well aware what Doug wants to do and I'm not saying that this is wrong,
but I'm not going to look at all usleep() usage sites to make sure none is
relying on such a behaviour and gets surprised by the change,

The point is that we had cases over and over where stuff was depending on
implementation bugs which made the buggy behaviour into an expected
behaviour. I'm not saying that this is the case here, but it's not my duty
to make sure it isn't.

So the very minimum I need in the changelog is some mentioning that the
author at least tried to verify that this is not going to break the world
and some more. That's what I meant by:

You can't just can't just declare that all all of the users expect that and
be done with it.

Thanks,

	tglx
Guenter Roeck Oct. 12, 2016, 4:03 p.m. UTC | #12
On Mon, Oct 10, 2016 at 02:04:02PM -0700, Douglas Anderson wrote:
> Users of usleep_range() expect that it will _never_ return in less time
> than the minimum passed parameter.  However, nothing in any of the code
> ensures this.  Specifically:
> 
> usleep_range() => do_usleep_range() => schedule_hrtimeout_range() =>
> schedule_hrtimeout_range_clock() just ends up calling schedule() with an
> appropriate timeout set using the hrtimer.  If someone else happens to
> wake up our task then we'll happily return from usleep_range() early.
> 
> msleep() already has code to handle this case since it will loop as long
> as there was still time left.  usleep_range() had no such loop.
> 
> The problem is is easily demonstrated with a small bit of test code:
> 
>   static int usleep_test_task(void *data)
>   {
>     atomic_t *done = data;
>     ktime_t start, end;
> 
>     start = ktime_get();
>     usleep_range(50000, 100000);
>     end = ktime_get();
>     pr_info("Requested 50000 - 100000 us.  Actually slept for %llu us\n",
>       (unsigned long long)ktime_to_us(ktime_sub(end, start)));
>     atomic_set(done, 1);
> 
>     return 0;
>   }
> 
>   static void run_usleep_test(void)
>   {
>     struct task_struct *t;
>     atomic_t done;
> 
>     atomic_set(&done, 0);
> 
>     t = kthread_run(usleep_test_task, &done, "usleep_test_task");
>     while (!atomic_read(&done)) {
>       wake_up_process(t);
>       udelay(1000);
>     }
>     kthread_stop(t);
>   }
> 
> If you run the above code without this patch you get things like:
>   Requested 50000 - 100000 us.  Actually slept for 967 us
> 
> If you run the above code _with_ this patch, you get:
>   Requested 50000 - 100000 us.  Actually slept for 50001 us
> 
> Presumably this problem was not detected before because:
> - It's not terribly common to use wake_up_process() directly.
> - Other ways for processes to wake up are not typically mixed with
>   usleep_range().
> - There aren't lots of places that use usleep_range(), since many people
>   call either msleep() or udelay().
> 
> Reported-by: Tao Huang <huangtao@rock-chips.com>
> Signed-off-by: Douglas Anderson <dianders@chromium.org>
> Reviewed-by: Brian Norris <briannorris@chromium.org>
> Reviewed-by: Andreas Mohr <andim2@users.sf.net>

Reviewed-by: Guenter Roeck <linux@roeck-us.net>

The following drivers may expect the function to be interruptible.

drivers/iio/accel/kxcjk-1013.c: kxcjk1013_runtime_resume()
drivers/iio/accel/bmc150-accel-core.c:bmc150_accel_runtime_resume()
drivers/iio/accel/mma8452.c:mma8452_runtime_resume()
drivers/iio/accel/mma9551_core.c:mma9551_sleep()
kernel/trace/ring_buffer.c:rb_test()

A possible solution might be to introduce usleep_range_interruptible()
and use it there.

Note:
drivers/scsi/mvumi.c:mvumi_rescan_bus() uses msleep() but should possibly
use msleep_interruptible() instead.

Guenter
Doug Anderson Oct. 12, 2016, 4:27 p.m. UTC | #13
Hi,

On Wed, Oct 12, 2016 at 9:03 AM, Guenter Roeck <linux@roeck-us.net> wrote:
> drivers/iio/accel/kxcjk-1013.c: kxcjk1013_runtime_resume()
> drivers/iio/accel/bmc150-accel-core.c:bmc150_accel_runtime_resume()
> drivers/iio/accel/mma8452.c:mma8452_runtime_resume()
> drivers/iio/accel/mma9551_core.c:mma9551_sleep()

As far as I can tell these drivers will not suffer unduly from my
change.  Worse case they will delay 20us more, which is listed as the
max.

Also note that I assume the reason you flagged these is because they
follow the pattern:

        if (sleep_val < 20000)
                usleep_range(sleep_val, 20000);
        else
                msleep_interruptible(sleep_val/1000);

I will note that usleep_range() is and has always been
uninterruptible, since the implementation says:

void __sched usleep_range(unsigned long min, unsigned long max)
{
       __set_current_state(TASK_UNINTERRUPTIBLE);
       do_usleep_range(min, max);
}

So I'm not at all convinced that we are changing behavior here.  The
"interruptible" vs. "uninterruptible" affects whether signals can
interrupt the sleep, not whether a random wake up of a task can.  What
we really need to know is if they are affected by a wakeup.

> kernel/trace/ring_buffer.c:rb_test()

I assume that the person who wrote this code was confused since they wrote:

  set_current_state(TASK_INTERRUPTIBLE);
  /* Now sleep between a min of 100-300us and a max of 1ms */
  usleep_range(((data->cnt % 3) + 1) * 100, 1000);

That doesn't seem to make sense given the first line of usleep_range().

In any case, again I don't think I am changing behavior.

> A possible solution might be to introduce usleep_range_interruptible()
> and use it there.

This could be a useful function, but I don't think we need it if we
find someone who needs a wakeup to cut short a sleep.  We can just
call one of the schedule functions directly and use a timeout.


Thank you for searching through for stuff and for your review, though!

-Doug
Guenter Roeck Oct. 12, 2016, 4:53 p.m. UTC | #14
On Wed, Oct 12, 2016 at 09:27:35AM -0700, Doug Anderson wrote:
> Hi,
> 
> On Wed, Oct 12, 2016 at 9:03 AM, Guenter Roeck <linux@roeck-us.net> wrote:
> > drivers/iio/accel/kxcjk-1013.c: kxcjk1013_runtime_resume()
> > drivers/iio/accel/bmc150-accel-core.c:bmc150_accel_runtime_resume()
> > drivers/iio/accel/mma8452.c:mma8452_runtime_resume()
> > drivers/iio/accel/mma9551_core.c:mma9551_sleep()
> 
> As far as I can tell these drivers will not suffer unduly from my
> change.  Worse case they will delay 20us more, which is listed as the
> max.
> 
20 ms.

> Also note that I assume the reason you flagged these is because they
> follow the pattern:
> 
>         if (sleep_val < 20000)
>                 usleep_range(sleep_val, 20000);
>         else
>                 msleep_interruptible(sleep_val/1000);
> 
Correct

> I will note that usleep_range() is and has always been
> uninterruptible, since the implementation says:
> 
> void __sched usleep_range(unsigned long min, unsigned long max)
> {
>        __set_current_state(TASK_UNINTERRUPTIBLE);
>        do_usleep_range(min, max);
> }
> 
Good point.

> So I'm not at all convinced that we are changing behavior here.  The
> "interruptible" vs. "uninterruptible" affects whether signals can
> interrupt the sleep, not whether a random wake up of a task can.  What
> we really need to know is if they are affected by a wakeup.
> 
Yes, you are correct.

> > kernel/trace/ring_buffer.c:rb_test()
> 
> I assume that the person who wrote this code was confused since they wrote:
> 
>   set_current_state(TASK_INTERRUPTIBLE);
>   /* Now sleep between a min of 100-300us and a max of 1ms */
>   usleep_range(((data->cnt % 3) + 1) * 100, 1000);
> 
> That doesn't seem to make sense given the first line of usleep_range().
> 
... which, for those who don't pay attention (like me), is
	__set_current_state(TASK_UNINTERRUPTIBLE);

> In any case, again I don't think I am changing behavior.
> 
> > A possible solution might be to introduce usleep_range_interruptible()
> > and use it there.
> 
> This could be a useful function, but I don't think we need it if we
> find someone who needs a wakeup to cut short a sleep.  We can just
> call one of the schedule functions directly and use a timeout.
> 

Agreed.

> 
> Thank you for searching through for stuff and for your review, though!
> 
No problem. Thanks for correcting me.

Note that I also searched for use of usleep_range() in conjunction with a
a task wakeup, but did not find anything. I did find a large number of cases,
though, where the explicit assumption is made that the minimum sleep time
is well defined.

Guenter
Doug Anderson Oct. 12, 2016, 5:39 p.m. UTC | #15
Hi,

On Wed, Oct 12, 2016 at 6:11 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> I'm well aware what Doug wants to do and I'm not saying that this is wrong,
> but I'm not going to look at all usleep() usage sites to make sure none is
> relying on such a behaviour and gets surprised by the change,
>
> The point is that we had cases over and over where stuff was depending on
> implementation bugs which made the buggy behaviour into an expected
> behaviour. I'm not saying that this is the case here, but it's not my duty
> to make sure it isn't.

Every change breaks something.  <https://xkcd.com/1172/>.  I don't
think we promise bug for bug compatibility for Linux releases, do we?
Though I definitely agree that we shouldn't be breaking something on
purpose and it's good to be careful.


> So the very minimum I need in the changelog is some mentioning that the
> author at least tried to verify that this is not going to break the world
> and some more. That's what I meant by:
>
> You can't just can't just declare that all all of the users expect that and
> be done with it.

Do you have some suggestion about how to do that?  In general I'm
hesitant to rely on code analysis of 1620 call sites and that would
really be the only way to "be sure".  It would also take a really long
time.  I think Guenter was searching for files that contained
usleep_range() and some other text.  That's a pretty good idea but it
is definitely not exhaustive since there's no reason that the usleep
and a wakeup are guaranteed to be in the same file.  It's also
entirely possible that a wakeup could happen through some other source
than just a direct call to wake_up_process().  It's why I didn't try
it originally.

...but I can do it anyway.  I tried "git grep -C10000 usleep_range |
grep wake_up_process".  I actually do find some things that I don't
think Guenter found (or maybe he found, analyzed, and rejected):

> drivers/block/cciss.c-  wake_up_process(cciss_scan_thread);

We possibly wake up scan_thread().  Sleep is in
cciss_wait_for_mode_change_ack().  Callers of that are
cciss_enter_performant_mode() and cciss_enter_simple_mode().  Callers
of those are cciss_put_controller_into_performant_mode() and
cciss_pci_init() (and the performant mode boils down to
cciss_pci_init() anyway).  Caller is cciss_init_one().  That is the
probe function.  AKA: the thread that is being woken up isn't the one
doing the usleep_range().

No obvious bug here.

> drivers/memstick/host/rtsx_usb_ms.c-    wake_up_process(host->detect_ms);

Function being woken is rtsx_usb_detect_ms_card().  Sleeps are in
ms_power_on() and rtsx_usb_ms_set_param() (with no loops).
ms_power_on() is called by rtsx_usb_ms_set_param() anyway.  That
function is stored in msh->set_param.  That will be fairly hard to
track down, but it seems unlikely it is called by
rtsx_usb_detect_ms_card() and that we intentionally want the usleep to
end early.

No obvious bug here.


> drivers/scsi/mvumi.c-                   wake_up_process(mhba->dm_thread);
> drivers/scsi/mvumi.c-   wake_up_process(mhba->dm_thread);

Function being woken is mvumi_rescan_bus().  Intended to wake up the
task when it's in schedule().  Even after the schedule there is a
hardcoded msleep(1000).  To me the msleep(1000) makes it seem like
this loop can't possibly be too performance critical.

...but even if there are performance critical parts of that loop and
somehow the wake was intended not just to wakeup from the schedule but
also the usleep_range, worse case we will sleep 2 ms extra.

No obvious bug here.


> kernel/time/timer.c-    wake_up_process((struct task_struct *)__data);

Generic timer code.


> kernel/trace/ring_buffer.c-             wake_up_process(rb_threads[cpu]);

This is test code and already seems a bit confused.  ...but it does
appear that the caller might actually be intending the the
wake_up_process() to take effect.  Looking closer.  I see that the
wakeup is called once at the start of the test, not midway through the
test.  So I don't think there's a problem already than the already
identified problem of a useless
"set_current_state(TASK_INTERRUPTIBLE);"

No obvious bug here.

===

So the net result of all the above is that:

* With my git grep + code inspection, I see no obvious bug.  I will
admit that I didn't do a super deep search and finding bugs by code
inspection is iffy at best, but it might give us some sort of warm
fuzzy.

* With Guenter's search we saw no obvious bugs.

* Several subsystem maintainers and reviewers of lots of code have
chimed in and say that they're not aware of any code where
usleep_range() was expected to wake up early and they were also aware
of lots of code that would break if usleep_range() returned early.


IMHO: let's land it in linuxnext and give it some bake time.  If
nobody screams then it can go into linux proper, possibly to stable
trees.

-Doug
Daniel Kurtz Oct. 18, 2016, 1:44 p.m. UTC | #16
Hi Doug,

On Tue, Oct 11, 2016 at 5:04 AM, Douglas Anderson <dianders@chromium.org> wrote:
> Users of usleep_range() expect that it will _never_ return in less time
> than the minimum passed parameter.  However, nothing in any of the code
> ensures this.  Specifically:
>
> usleep_range() => do_usleep_range() => schedule_hrtimeout_range() =>
> schedule_hrtimeout_range_clock() just ends up calling schedule() with an
> appropriate timeout set using the hrtimer.  If someone else happens to
> wake up our task then we'll happily return from usleep_range() early.

I think this change works, and fixes a real issue, however, I don't
think you are fixing this at the right layer.
The comment for schedule_hrtimeout_range says:

/**
 * schedule_hrtimeout_range - sleep until timeout
 * @expires:    timeout value (ktime_t)
 * @delta:      slack in expires timeout (ktime_t)
 * @mode:       timer mode, HRTIMER_MODE_ABS or HRTIMER_MODE_REL
 *
 * Make the current task sleep until the given expiry time has
 * elapsed. The routine will return immediately unless
 * the current task state has been set (see set_current_state()).
 *
 * The @delta argument gives the kernel the freedom to schedule the
 * actual wakeup to a time that is both power and performance friendly.
 * The kernel give the normal best effort behavior for "@expires+@delta",
 * but may decide to fire the timer earlier, but no earlier than @expires.
 *
 * You can set the task state as follows -
 *
 * %TASK_UNINTERRUPTIBLE - at least @timeout time is guaranteed to
 * pass before the routine returns.
 *
 * %TASK_INTERRUPTIBLE - the routine may return early if a signal is
 * delivered to the current task.
 *
 * The current task state is guaranteed to be TASK_RUNNING when this
 * routine returns.
 *
 * Returns 0 when the timer has expired otherwise -EINTR
 */

The behavior as specified for this function "at least @timeout time is
guaranteed to pass before the routine returns" already guarantees the
behavior you are adding to do_usleep_range() whenever the current task
state is (pre-)set to TASK_UNINTERRUPTIBLE.

Thus, I think the loop around 'schedule()' should be moved to
schedule_hrtimeout_range() itself.
This would also fix direct callers of schedule_hrtimeout_range() that
use TASK_UNINTERRUPTIBLE, although, I could only find one:

pt3_fetch_thread()

-Dan
Doug Anderson Oct. 18, 2016, 8:29 p.m. UTC | #17
Dan,

On Tue, Oct 18, 2016 at 6:44 AM, Daniel Kurtz <djkurtz@chromium.org> wrote:
> Hi Doug,
>
> On Tue, Oct 11, 2016 at 5:04 AM, Douglas Anderson <dianders@chromium.org> wrote:
>> Users of usleep_range() expect that it will _never_ return in less time
>> than the minimum passed parameter.  However, nothing in any of the code
>> ensures this.  Specifically:
>>
>> usleep_range() => do_usleep_range() => schedule_hrtimeout_range() =>
>> schedule_hrtimeout_range_clock() just ends up calling schedule() with an
>> appropriate timeout set using the hrtimer.  If someone else happens to
>> wake up our task then we'll happily return from usleep_range() early.
>
> I think this change works, and fixes a real issue, however, I don't
> think you are fixing this at the right layer.
> The comment for schedule_hrtimeout_range says:
>
> /**
>  * schedule_hrtimeout_range - sleep until timeout
>  * @expires:    timeout value (ktime_t)
>  * @delta:      slack in expires timeout (ktime_t)
>  * @mode:       timer mode, HRTIMER_MODE_ABS or HRTIMER_MODE_REL
>  *
>  * Make the current task sleep until the given expiry time has
>  * elapsed. The routine will return immediately unless
>  * the current task state has been set (see set_current_state()).
>  *
>  * The @delta argument gives the kernel the freedom to schedule the
>  * actual wakeup to a time that is both power and performance friendly.
>  * The kernel give the normal best effort behavior for "@expires+@delta",
>  * but may decide to fire the timer earlier, but no earlier than @expires.
>  *
>  * You can set the task state as follows -
>  *
>  * %TASK_UNINTERRUPTIBLE - at least @timeout time is guaranteed to
>  * pass before the routine returns.
>  *
>  * %TASK_INTERRUPTIBLE - the routine may return early if a signal is
>  * delivered to the current task.
>  *
>  * The current task state is guaranteed to be TASK_RUNNING when this
>  * routine returns.
>  *
>  * Returns 0 when the timer has expired otherwise -EINTR
>  */
>
> The behavior as specified for this function "at least @timeout time is
> guaranteed to pass before the routine returns" already guarantees the
> behavior you are adding to do_usleep_range() whenever the current task
> state is (pre-)set to TASK_UNINTERRUPTIBLE.
>
> Thus, I think the loop around 'schedule()' should be moved to
> schedule_hrtimeout_range() itself.
> This would also fix direct callers of schedule_hrtimeout_range() that
> use TASK_UNINTERRUPTIBLE, although, I could only find one:
>
> pt3_fetch_thread()

Hmmm, I would agree with you that the behavior of
schedule_hrtimeout_range() doesn't seem to match the function
comments.

...but I'm not sure I agree with you about what to do here.
Specifically I think that whatever we do we need to try to keep
schedule_hrtimeout_range() and schedule_timeout() parallel.  For
schedule_timeout() we have the same comments but it's my understanding
that you'd expect that wake_up_process() would wake it up.  In any
case, if wake_up_process() doesn't wake it up then it seems like
msleep() and schedule_timeout_uninterruptible() are the same function
with two names, when in fact one is implemented in terms o the other.

NOTE that also it seems as if we need some other return values besides
0 and -EINTR from schedule_hrtimeout_range() (again, to match
schedule_timeout()) since right now we'll return -EINTR if we were
woken up with wake_up_process().  This would be unexpected in the case
where we had TASK_UNINTERRUPTIBLE set.

-Doug
Daniel Kurtz Oct. 20, 2016, 8:57 a.m. UTC | #18
On Wed, Oct 19, 2016 at 4:29 AM, Doug Anderson <dianders@chromium.org> wrote:
>
> Dan,
>
> On Tue, Oct 18, 2016 at 6:44 AM, Daniel Kurtz <djkurtz@chromium.org> wrote:
> > Hi Doug,
> >
> > On Tue, Oct 11, 2016 at 5:04 AM, Douglas Anderson <dianders@chromium.org> wrote:
> >> Users of usleep_range() expect that it will _never_ return in less time
> >> than the minimum passed parameter.  However, nothing in any of the code
> >> ensures this.  Specifically:
> >>
> >> usleep_range() => do_usleep_range() => schedule_hrtimeout_range() =>
> >> schedule_hrtimeout_range_clock() just ends up calling schedule() with an
> >> appropriate timeout set using the hrtimer.  If someone else happens to
> >> wake up our task then we'll happily return from usleep_range() early.
> >
> > I think this change works, and fixes a real issue, however, I don't
> > think you are fixing this at the right layer.
> > The comment for schedule_hrtimeout_range says:
> >
> > /**
> >  * schedule_hrtimeout_range - sleep until timeout
> >  * @expires:    timeout value (ktime_t)
> >  * @delta:      slack in expires timeout (ktime_t)
> >  * @mode:       timer mode, HRTIMER_MODE_ABS or HRTIMER_MODE_REL
> >  *
> >  * Make the current task sleep until the given expiry time has
> >  * elapsed. The routine will return immediately unless
> >  * the current task state has been set (see set_current_state()).
> >  *
> >  * The @delta argument gives the kernel the freedom to schedule the
> >  * actual wakeup to a time that is both power and performance friendly.
> >  * The kernel give the normal best effort behavior for "@expires+@delta",
> >  * but may decide to fire the timer earlier, but no earlier than @expires.
> >  *
> >  * You can set the task state as follows -
> >  *
> >  * %TASK_UNINTERRUPTIBLE - at least @timeout time is guaranteed to
> >  * pass before the routine returns.
> >  *
> >  * %TASK_INTERRUPTIBLE - the routine may return early if a signal is
> >  * delivered to the current task.
> >  *
> >  * The current task state is guaranteed to be TASK_RUNNING when this
> >  * routine returns.
> >  *
> >  * Returns 0 when the timer has expired otherwise -EINTR
> >  */
> >
> > The behavior as specified for this function "at least @timeout time is
> > guaranteed to pass before the routine returns" already guarantees the
> > behavior you are adding to do_usleep_range() whenever the current task
> > state is (pre-)set to TASK_UNINTERRUPTIBLE.
> >
> > Thus, I think the loop around 'schedule()' should be moved to
> > schedule_hrtimeout_range() itself.
> > This would also fix direct callers of schedule_hrtimeout_range() that
> > use TASK_UNINTERRUPTIBLE, although, I could only find one:
> >
> > pt3_fetch_thread()
>
> Hmmm, I would agree with you that the behavior of
> schedule_hrtimeout_range() doesn't seem to match the function
> comments.
>
> ...but I'm not sure I agree with you about what to do here.
> Specifically I think that whatever we do we need to try to keep
> schedule_hrtimeout_range() and schedule_timeout() parallel.  For
> schedule_timeout() we have the same comments but it's my understanding
> that you'd expect that wake_up_process() would wake it up.  In any
> case, if wake_up_process() doesn't wake it up then it seems like
> msleep() and schedule_timeout_uninterruptible() are the same function
> with two names, when in fact one is implemented in terms o the other.

Sounds reasonable.
It would be nice to add a note to all of those function comments
though to make them sound less absolute -
"at least @timeout time is guaranteed to pass before the routine
returns unless the current task is explicitly woken up, (e.g. by
wake_up_process())"

> NOTE that also it seems as if we need some other return values besides
> 0 and -EINTR from schedule_hrtimeout_range() (again, to match
> schedule_timeout()) since right now we'll return -EINTR if we were
> woken up with wake_up_process().  This would be unexpected in the case
> where we had TASK_UNINTERRUPTIBLE set.
>
> -Doug
Thomas Gleixner Oct. 20, 2016, 9:51 a.m. UTC | #19
On Thu, 20 Oct 2016, Daniel Kurtz wrote:
> On Wed, Oct 19, 2016 at 4:29 AM, Doug Anderson <dianders@chromium.org> wrote:
> > ...but I'm not sure I agree with you about what to do here.
> > Specifically I think that whatever we do we need to try to keep
> > schedule_hrtimeout_range() and schedule_timeout() parallel.  For
> > schedule_timeout() we have the same comments but it's my understanding
> > that you'd expect that wake_up_process() would wake it up.  In any
> > case, if wake_up_process() doesn't wake it up then it seems like
> > msleep() and schedule_timeout_uninterruptible() are the same function
> > with two names, when in fact one is implemented in terms o the other.
> 
> Sounds reasonable.
> It would be nice to add a note to all of those function comments
> though to make them sound less absolute -
> "at least @timeout time is guaranteed to pass before the routine
> returns unless the current task is explicitly woken up, (e.g. by
> wake_up_process())"

Agreed.

Patch
diff mbox

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 32bf6f75a8fe..219439efd56a 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -1898,12 +1898,28 @@  EXPORT_SYMBOL(msleep_interruptible);
 
 static void __sched do_usleep_range(unsigned long min, unsigned long max)
 {
+	ktime_t now, end;
 	ktime_t kmin;
 	u64 delta;
+	int ret;
 
-	kmin = ktime_set(0, min * NSEC_PER_USEC);
+	now = ktime_get();
+	end = ktime_add_us(now, min);
 	delta = (u64)(max - min) * NSEC_PER_USEC;
-	schedule_hrtimeout_range(&kmin, delta, HRTIMER_MODE_REL);
+	do {
+		kmin = ktime_sub(end, now);
+		ret = schedule_hrtimeout_range(&kmin, delta, HRTIMER_MODE_REL);
+
+		/*
+		 * If schedule_hrtimeout_range() returns 0 then we actually
+		 * hit the timeout. If not then we need to re-calculate the
+		 * new timeout ourselves.
+		 */
+		if (ret == 0)
+			break;
+
+		now = ktime_get();
+	} while (ktime_before(now, end));
 }
 
 /**