timers: Fix usleep_range() in the context of wake_up_process()
diff mbox

Message ID 1476125277-6061-1-git-send-email-dianders@chromium.org
State New
Headers show

Commit Message

Douglas Anderson Oct. 10, 2016, 6:47 p.m. UTC
Users of usleep_range() expect that it will _never_ return in less time
than the minimum passed parameter.  However, nothing in any of the code
ensures this.  Specifically:

usleep_range() => do_usleep_range() => schedule_hrtimeout_range() =>
schedule_hrtimeout_range_clock() just ends up calling schedule() with an
appropriate timeout set using the hrtimer.  If someone else happens to
wake up our task then we'll happily return from usleep_range() early.

msleep() already has code to handle this case since it will loop as long
as there was still time left.  usleep_range() had no such loop.

The problem is is easily demonstrated with a small bit of test code:

  static int usleep_test_task(void *data)
  {
    atomic_t *done = data;
    ktime_t start, end;

    start = ktime_get();
    usleep_range(50000, 100000);
    end = ktime_get();
    pr_info("Requested 50000 - 100000 us.  Actually slept for %llu us\n",
      (unsigned long long)ktime_to_us(ktime_sub(end, start)));
    atomic_set(done, 1);

    return 0;
  }

  static void run_usleep_test(void)
  {
    struct task_struct *t;
    atomic_t done;

    atomic_set(&done, 0);

    t = kthread_run(usleep_test_task, &done, "usleep_test_task");
    while (!atomic_read(&done)) {
      wake_up_process(t);
      udelay(1000);
    }
    kthread_stop(t);
  }

If you run the above code without this patch you get things like:
  Requested 50000 - 100000 us.  Actually slept for 967 us

If you run the above code _with_ this patch, you get:
  Requested 50000 - 100000 us.  Actually slept for 50001 us

Presumably this problem was not detected before because:
- It's not terribly common to use wake_up_process() directly.
- Other ways for processes to wake up are not typically mixed with
  usleep_range().
- There aren't lotsof places that use usleep_range(), since many people
  call either msleep() or udelay().

Reported-by: Tao Huang <huangtao@rock-chips.com>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
---
 kernel/time/timer.c | 23 ++++++++++++++++++++---
 1 file changed, 20 insertions(+), 3 deletions(-)

Comments

Andreas Mohr Oct. 10, 2016, 7:53 p.m. UTC | #1
Hi,

On Mon, Oct 10, 2016 at 11:47:57AM -0700, Douglas Anderson wrote:
> Users of usleep_range() expect that it will _never_ return in less time
> than the minimum passed parameter.  However, nothing in any of the code
> ensures this.  Specifically:

Ouch (aka: good catch! - by the reporter, though... ;)


> diff --git a/kernel/time/timer.c b/kernel/time/timer.c
> index 32bf6f75a8fe..ab03c7e403a4 100644
> --- a/kernel/time/timer.c
> +++ b/kernel/time/timer.c
> @@ -1898,12 +1898,29 @@ EXPORT_SYMBOL(msleep_interruptible);
>  
>  static void __sched do_usleep_range(unsigned long min, unsigned long max)
>  {
> +	ktime_t start, end;
>  	ktime_t kmin;
>  	u64 delta;
> +	unsigned long elapsed = 0;
> +	int ret;
> +
> +	start = ktime_get();
> +	do {
> +		kmin = ktime_set(0, min * NSEC_PER_USEC);
> +		delta = (u64)(max - min) * NSEC_PER_USEC;
> +		ret = schedule_hrtimeout_range(&kmin, delta, HRTIMER_MODE_REL);
> +
> +		/*
> +		 * If schedule_hrtimeout_range() returns 0 then we actually
> +		 * hit the timeout. If not then we need to re-calculate the
> +		 * new timeout ourselves.
> +		 */
> +		if (ret == 0)
> +			break;
>  
> -	kmin = ktime_set(0, min * NSEC_PER_USEC);
> -	delta = (u64)(max - min) * NSEC_PER_USEC;
> -	schedule_hrtimeout_range(&kmin, delta, HRTIMER_MODE_REL);
> +		end = ktime_get();
> +		elapsed = ktime_to_us(ktime_sub(end, start));
> +	} while (elapsed < min);

Some thoughts:
- max >= min pre-condition is validated somewhere, I'd hope?
  (somewhere in outer API frame?)
> +		delta = (u64)(max - min) * NSEC_PER_USEC;

- there is a domain transition in there, **within** the repeated loop:
> +		elapsed = ktime_to_us(ktime_sub(end, start));
  -->
> +	} while (elapsed < min);
  should likely be made to be able to
  directly compare a *non-converted* "elapsed" value
  (IIRC there's an API such as ktime_before()).
  One could argue that the "repeat" case is the non-likely case
  (thus the conversion should possibly not be done before
  actually being required),
  however I guess it's exactly hitting the non-likely cases
  which will contribute towards
  non-predictable, non-deterministic latency behaviour of
  a code path
  (think RT requirements)

Thanks,

Andreas Mohr
Brian Norris Oct. 10, 2016, 8:04 p.m. UTC | #2
Hi Doug,

On Mon, Oct 10, 2016 at 11:47:57AM -0700, Doug Anderson wrote:
> diff --git a/kernel/time/timer.c b/kernel/time/timer.c
> index 32bf6f75a8fe..ab03c7e403a4 100644
> --- a/kernel/time/timer.c
> +++ b/kernel/time/timer.c
> @@ -1898,12 +1898,29 @@ EXPORT_SYMBOL(msleep_interruptible);
>  
>  static void __sched do_usleep_range(unsigned long min, unsigned long max)
>  {
> +	ktime_t start, end;
>  	ktime_t kmin;
>  	u64 delta;
> +	unsigned long elapsed = 0;
> +	int ret;
> +
> +	start = ktime_get();
> +	do {
> +		kmin = ktime_set(0, min * NSEC_PER_USEC);

I believe 'min' is unmodified throughout, and therefore 'kmin' is
computed to be the same minimum timeout in each loop. Shouldn't this be
decreasing on each iteration of the loop? (i.e., either your compute
'kmin' differently here, or you recompute 'min' based on the elapsed
time?)

> +		delta = (u64)(max - min) * NSEC_PER_USEC;
> +		ret = schedule_hrtimeout_range(&kmin, delta, HRTIMER_MODE_REL);
> +
> +		/*
> +		 * If schedule_hrtimeout_range() returns 0 then we actually
> +		 * hit the timeout. If not then we need to re-calculate the
> +		 * new timeout ourselves.
> +		 */
> +		if (ret == 0)
> +			break;
>  
> -	kmin = ktime_set(0, min * NSEC_PER_USEC);
> -	delta = (u64)(max - min) * NSEC_PER_USEC;
> -	schedule_hrtimeout_range(&kmin, delta, HRTIMER_MODE_REL);
> +		end = ktime_get();
> +		elapsed = ktime_to_us(ktime_sub(end, start));
> +	} while (elapsed < min);

I think Andreas had similar comments, but it seemed to me like
ktime_before() might be nicer. But (as Andreas did) you might get into
(premature?) micro-optimizations down that path, and I'm certainly not
an expert there.

>  }
>  
>  /**

Brian
Douglas Anderson Oct. 10, 2016, 8:12 p.m. UTC | #3
Hi,

On Mon, Oct 10, 2016 at 1:04 PM, Brian Norris <briannorris@chromium.org> wrote:
> Hi Doug,
>
> On Mon, Oct 10, 2016 at 11:47:57AM -0700, Doug Anderson wrote:
>> diff --git a/kernel/time/timer.c b/kernel/time/timer.c
>> index 32bf6f75a8fe..ab03c7e403a4 100644
>> --- a/kernel/time/timer.c
>> +++ b/kernel/time/timer.c
>> @@ -1898,12 +1898,29 @@ EXPORT_SYMBOL(msleep_interruptible);
>>
>>  static void __sched do_usleep_range(unsigned long min, unsigned long max)
>>  {
>> +     ktime_t start, end;
>>       ktime_t kmin;
>>       u64 delta;
>> +     unsigned long elapsed = 0;
>> +     int ret;
>> +
>> +     start = ktime_get();
>> +     do {
>> +             kmin = ktime_set(0, min * NSEC_PER_USEC);
>
> I believe 'min' is unmodified throughout, and therefore 'kmin' is
> computed to be the same minimum timeout in each loop. Shouldn't this be
> decreasing on each iteration of the loop? (i.e., either your compute
> 'kmin' differently here, or you recompute 'min' based on the elapsed
> time?)

Yes, I stupidly changed something at the last second and then didn't
test again after my stupid change.  Fix coming soon with all comments
addressed.  Sorry for posting broken code.  :( :( :(

-Doug
Andreas Mohr Oct. 10, 2016, 8:42 p.m. UTC | #4
On Mon, Oct 10, 2016 at 01:12:39PM -0700, Doug Anderson wrote:
> Hi,
> 
> On Mon, Oct 10, 2016 at 1:04 PM, Brian Norris <briannorris@chromium.org> wrote:
> > I believe 'min' is unmodified throughout, and therefore 'kmin' is
> > computed to be the same minimum timeout in each loop. Shouldn't this be
> > decreasing on each iteration of the loop? (i.e., either your compute
> > 'kmin' differently here, or you recompute 'min' based on the elapsed
> > time?)
> 
> Yes, I stupidly changed something at the last second and then didn't
> test again after my stupid change.  Fix coming soon with all comments
> addressed.  Sorry for posting broken code.  :( :( :(

With a loop style that is actively re-calculating things,
such implementations should then not fall into the trap of
basing the "next" value on "current" time,
thereby bogusly accumulating scheduling-based delays
with each new loop iteration etc.
(i.e., things should still be based on hard, precise termination according to
an *initially* calculated, *absolute*, *minimum* expiry time).

Andreas Mohr
Andreas Mohr Oct. 10, 2016, 8:47 p.m. UTC | #5
Hi,

On Mon, Oct 10, 2016 at 01:04:01PM -0700, Brian Norris wrote:
> Hi Doug,
> 
> On Mon, Oct 10, 2016 at 11:47:57AM -0700, Doug Anderson wrote:
> > +		end = ktime_get();
> > +		elapsed = ktime_to_us(ktime_sub(end, start));
> > +	} while (elapsed < min);
> 
> I think Andreas had similar comments, but it seemed to me like
> ktime_before() might be nicer. But (as Andreas did) you might get into
> (premature?) micro-optimizations down that path, and I'm certainly not
> an expert there.

I'd justify it this way:
orientation/type of input data is rather irrelevant -
all that matters is how you want to treat things in your *inner handling* -
and if that happens to be ktime-based
(as is usually - if not pretty much always - the case within kernel code),
then it ought to be that way,
*consistently* (to avoid conversion transition overhead).

Andreas Mohr
Douglas Anderson Oct. 10, 2016, 9:16 p.m. UTC | #6
Hi,

On Mon, Oct 10, 2016 at 12:53 PM, Andreas Mohr <andi@lisas.de> wrote:
> Some thoughts:
> - max >= min pre-condition is validated somewhere, I'd hope?
>   (somewhere in outer API frame?)

I don't think this is validated, but it's not a new problem.  My patch
doesn't attempt to solve this.  A future patch could, though.  You can
speculate on whether a WARN_ON would be better or some type of
friendly warning.

-Doug

Patch
diff mbox

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 32bf6f75a8fe..ab03c7e403a4 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -1898,12 +1898,29 @@  EXPORT_SYMBOL(msleep_interruptible);
 
 static void __sched do_usleep_range(unsigned long min, unsigned long max)
 {
+	ktime_t start, end;
 	ktime_t kmin;
 	u64 delta;
+	unsigned long elapsed = 0;
+	int ret;
+
+	start = ktime_get();
+	do {
+		kmin = ktime_set(0, min * NSEC_PER_USEC);
+		delta = (u64)(max - min) * NSEC_PER_USEC;
+		ret = schedule_hrtimeout_range(&kmin, delta, HRTIMER_MODE_REL);
+
+		/*
+		 * If schedule_hrtimeout_range() returns 0 then we actually
+		 * hit the timeout. If not then we need to re-calculate the
+		 * new timeout ourselves.
+		 */
+		if (ret == 0)
+			break;
 
-	kmin = ktime_set(0, min * NSEC_PER_USEC);
-	delta = (u64)(max - min) * NSEC_PER_USEC;
-	schedule_hrtimeout_range(&kmin, delta, HRTIMER_MODE_REL);
+		end = ktime_get();
+		elapsed = ktime_to_us(ktime_sub(end, start));
+	} while (elapsed < min);
 }
 
 /**