Message ID | 1476133442-17757-1-git-send-email-dianders@chromium.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Hi, On Mon, Oct 10, 2016 at 2:04 PM, Douglas Anderson <dianders@chromium.org> wrote: > Users of usleep_range() expect that it will _never_ return in less time > than the minimum passed parameter. However, nothing in any of the code > ensures this. Specifically: > > usleep_range() => do_usleep_range() => schedule_hrtimeout_range() => > schedule_hrtimeout_range_clock() just ends up calling schedule() with an > appropriate timeout set using the hrtimer. If someone else happens to > wake up our task then we'll happily return from usleep_range() early. > > msleep() already has code to handle this case since it will loop as long > as there was still time left. usleep_range() had no such loop. > > The problem is is easily demonstrated with a small bit of test code: > > static int usleep_test_task(void *data) > { > atomic_t *done = data; > ktime_t start, end; > > start = ktime_get(); > usleep_range(50000, 100000); > end = ktime_get(); > pr_info("Requested 50000 - 100000 us. Actually slept for %llu us\n", > (unsigned long long)ktime_to_us(ktime_sub(end, start))); > atomic_set(done, 1); > > return 0; > } > > static void run_usleep_test(void) > { > struct task_struct *t; > atomic_t done; > > atomic_set(&done, 0); > > t = kthread_run(usleep_test_task, &done, "usleep_test_task"); > while (!atomic_read(&done)) { > wake_up_process(t); > udelay(1000); > } > kthread_stop(t); > } > > If you run the above code without this patch you get things like: > Requested 50000 - 100000 us. Actually slept for 967 us > > If you run the above code _with_ this patch, you get: > Requested 50000 - 100000 us. Actually slept for 50001 us > > Presumably this problem was not detected before because: > - It's not terribly common to use wake_up_process() directly. > - Other ways for processes to wake up are not typically mixed with > usleep_range(). > - There aren't lots of places that use usleep_range(), since many people > call either msleep() or udelay(). > > Reported-by: Tao Huang <huangtao@rock-chips.com> > Signed-off-by: Douglas Anderson <dianders@chromium.org> > --- > Changes in v2: > - Fixed stupid bug that snuck in before posting > - Use ktime_before > - Remove delta from the loop > > NOTE: Tested against 4.4 tree w/ backports. I'm trying to get myself > up and running with mainline again to test there now but it might be a > little while. Hopefully this time I don't shoot myself in the foot. OK, I finally resolved all my issues and have officially tested this on mainline (not just in my private kernel) linux 4.8 on an rk3288-veyron-jerry based device. Problem still exists there and fix still works. Sorry for the noise earlier. Hopefully this version of the patch addresses all of Andreas's review feedback (and thank you very much for the review!) -Doug
On Mon, 10 Oct 2016, Douglas Anderson wrote: > Users of usleep_range() expect that it will _never_ return in less time > than the minimum passed parameter. However, nothing in any of the code > ensures this. Specifically: There is no such guarantee for that interface and never has been, so how did you make sure that none of the existing users is relying on this? You can't just can't just declare that all all of the users expect that and be done with it. Thanks, tglx
Thomas, +clock and regulator folks since part of my arguments below involve the regulator / clock core. If you're not interested in this topic, feel free to ignore. Original patch can be found on LKML or at <https://patchwork.kernel.org/patch/9369963/> On Tue, Oct 11, 2016 at 12:14 AM, Thomas Gleixner <tglx@linutronix.de> wrote: > On Mon, 10 Oct 2016, Douglas Anderson wrote: >> Users of usleep_range() expect that it will _never_ return in less time >> than the minimum passed parameter. However, nothing in any of the code >> ensures this. Specifically: > > There is no such guarantee for that interface and never has been, so how > did you make sure that none of the existing users is relying on this? > > You can't just can't just declare that all all of the users expect that and > be done with it. You're right that I can't guarantee that no callers are relying on the existing behavior of a wake_up_process() causing usleep_range() to return early. I would say, however, that all callers I've seen are absolutely relying on the min delay being enforced and I've never personally seen a caller relying on being woken up from usleep_range(). All the users relying on the min delay being enforced have never had a problem because it's not common for a task that's in usleep_range() to be woken up, but if you happen to call some generic code in a context where a wake up is possible you'll get a very unexpected bug (like I just did). I can try to look through some callers to usleep_range(), but there are 1620 of them as per my "git grep", so I can't look at them all. :( ...in any case, below are my arguments about why we should change this. If we can't agree to change this, then IMHO the way forward is to deprecate usleep_range() and start the transition of all callers to one of two functions: usleep_atlest() and usleep_wakeable(). As argued below I think that usleep_range() name implies that it will at least sleep the minimum so I would really like to avoid keeping the name usleep_range() and also keeping the existing behavior. -- 1. I believe msleep() and usleep_range() should be parallel functions. msleep() has this guarantee of not ending early. I believe users will expect that usleep() will also. 2. The "timers-howto.txt" does not have any information about the fact that usleep_range() might end early. This document implies that (assuming you're not in atomic) udelay() / usleep_range() should be chosen between simply based on how long the expected delay is. There is no note that you shouldn't use usleep_range() if you can't handle ending early. 3. It seems to me that if someone wants to wait to be woken up but they want a timeout, they wouldn't reach for usleep_range(). They would instead think to use schedule_hrtimeout_range() directly. Said another way: we already have a function for scheduling a delay that can be interrupted with a wakeup, so why do we need usleep_range() to also behave this way? 4. We need to change a lot of places if we want to handle the fact that usleep_range() might wake up early. Every instance of usleep_range() that I've ever seen is not expecting to end early and is expecting at least the minimum delay for correctness. These functions have all worked correctly because they didn't happen to be called in a context where someone might wake up the calling task. ...but if suddenly one of these functions is called on a task that might be woken up then all their correctness will go out the door. In other words, I see a lot of usleep_range() calls that look like this: /* After 5 ms we know reset is done */ assert_reset() usleep_range(5000, 7000) If that returns early then we'll get badness. -- Here's a (perhaps more realistic) example of the problem. Imagine that we are trying to do some type of DVF (Dynamic Voltage Frequency) switch. That involves changing both a regulator and a clock. It's possible that we might want to try to do these two things in parallel on two tasks, so we could imagine doing: Task #1 (voltage) A) Call into regulator core to change voltage. A1) Regulator core will call into arbitrary regulator driver. A2) Regulator core won't return until regulator is at the right level. A3) Delay waiting for regulator might be done with usleep_range(). B) Signal Task #2 that we're done. C) Wait on Task #2 to finish. Task #2 (clock) A) Call into clock core to change clock. A1) Clock core will call into arbitrary clock driver. A2) Clock core won't return until clock is stable. A3) Delay waiting for clock might be done with usleep_range(). B) Signal Task #1 that we're done. C) Wait on Task #1 to finish. Depending on the delays and the mechanism used to signal and wait, it is possible that the "Signal" step above will end up waking up the other task. If it does this while the other task is in the usleep_range() code then we'll have serious problems: we'll either run with an clock or a regulator that isn't all the way ready. Now perhaps you will say that we should re-design the signaling mechanism above so that it doesn't cause a wake up of the other task. We certainly could try to do that if need be (we'd have to validate that there is really NEVER an unexpected wakeup), but I'd expect a lot of push back since I don't think folks would think that waking up a task would cause such incorrect behavior. Now perhaps you will say that we should re-design all clock and regulator drivers to not call usleep_range() (or to add a loop). This implies that we should add a new function so they don't all need to get this loop right. NOTE: In the particular failure case I'm running into, I'm on a system using the (out of tree) "interactive" governor. I haven't followed through the exact code path, but I see that it is using wake_up_process() in at least several places to wake up the "speedchange_task". This is the same task that might be calling into cpufreq to change the frequency and might be calling into the regulator core to do the delay. We were specifically seeing cases where the usleep_range() in PWM regulator was returning early. -- OK, I know that the above is pretty long. Hopefully it made sense. Feel free to grab me on IRC if it helps. I can be found on freenode as dianders and can join any channel where it's helpful to hash this out. -Doug
On Tue, Oct 11, 2016 at 09:14:38AM +0200, Thomas Gleixner wrote: > On Mon, 10 Oct 2016, Douglas Anderson wrote: > > Users of usleep_range() expect that it will _never_ return in less time > > than the minimum passed parameter. However, nothing in any of the code > > ensures this. Specifically: > > There is no such guarantee for that interface and never has been, so how > did you make sure that none of the existing users is relying on this? > > You can't just can't just declare that all all of the users expect that and > be done with it. Hmm, somehow I don't manage to follow these thoughts. https://www.kernel.org/doc/htmldocs/device-drivers/API-usleep-range.html (as a hopefully sufficiently authoritative source of documentation) clearly specifies min to be "Minimum time in usecs to sleep" , which is what one would expect a two-param interface here to be (minimum-maximum), i.e. what would be the *natural* protocol I'd think. Also, [finally...] starting to enforce the minimum time is an additional *constraint* on the protocol, i.e. it's not at all like we are getting more *liberal* here (since usually getting more liberal in certain protocols is what will cause trouble, I'd think). Not to mention that desiring a delay in processing most certainly is what caused users of this API to decide to invoke it in the first place (else they would just have chosen to carry on with delay-less processing and be done with it). And those users then surely wouldn't want to experience a behaviour where the delay may be ended at any time, however short that may end up being. A related topic probably is premature wakeups (e.g. signal-induced) of select() etc. protocol. Greetings, Andreas Mohr
Hi Doug, On Mon, Oct 10, 2016 at 02:04:02PM -0700, Doug Anderson wrote: > Users of usleep_range() expect that it will _never_ return in less time > than the minimum passed parameter. However, nothing in any of the code > ensures this. Like you and Andreas, I also don't understand Thomas's objection to your above claim on what users of this function expect. I believe you have clearly laid out why the current behavior needs to be changed somehow; IMO the only remaining question is "how." Your follow up covers all this plenty well for me. Either we need a fix along the lines of what you've proposed, or else we need to completely rethink almost all uses of usleep_range(). ... > Reported-by: Tao Huang <huangtao@rock-chips.com> > Signed-off-by: Douglas Anderson <dianders@chromium.org> > --- > Changes in v2: > - Fixed stupid bug that snuck in before posting > - Use ktime_before > - Remove delta from the loop > > NOTE: Tested against 4.4 tree w/ backports. I'm trying to get myself > up and running with mainline again to test there now but it might be a > little while. Hopefully this time I don't shoot myself in the foot. > > kernel/time/timer.c | 20 ++++++++++++++++++-- > 1 file changed, 18 insertions(+), 2 deletions(-) I've reviewed the logic here to the best of my ability, and it looks good to me now. I'll admit that I'm not really a timekeeping or scheduler expert, but FWIW: Reviewed-by: Brian Norris <briannorris@chromium.org> > diff --git a/kernel/time/timer.c b/kernel/time/timer.c > index 32bf6f75a8fe..219439efd56a 100644 > --- a/kernel/time/timer.c > +++ b/kernel/time/timer.c > @@ -1898,12 +1898,28 @@ EXPORT_SYMBOL(msleep_interruptible); > > static void __sched do_usleep_range(unsigned long min, unsigned long max) > { > + ktime_t now, end; > ktime_t kmin; > u64 delta; > + int ret; > > - kmin = ktime_set(0, min * NSEC_PER_USEC); > + now = ktime_get(); > + end = ktime_add_us(now, min); > delta = (u64)(max - min) * NSEC_PER_USEC; > - schedule_hrtimeout_range(&kmin, delta, HRTIMER_MODE_REL); > + do { > + kmin = ktime_sub(end, now); > + ret = schedule_hrtimeout_range(&kmin, delta, HRTIMER_MODE_REL); > + > + /* > + * If schedule_hrtimeout_range() returns 0 then we actually > + * hit the timeout. If not then we need to re-calculate the > + * new timeout ourselves. > + */ > + if (ret == 0) > + break; > + > + now = ktime_get(); > + } while (ktime_before(now, end)); > } > > /** > -- > 2.8.0.rc3.226.g39d4020 >
Hi, decided to write a review now, slightly delayed, sorry. On Mon, Oct 10, 2016 at 02:04:02PM -0700, Douglas Anderson wrote: > diff --git a/kernel/time/timer.c b/kernel/time/timer.c > index 32bf6f75a8fe..219439efd56a 100644 > --- a/kernel/time/timer.c > +++ b/kernel/time/timer.c > @@ -1898,12 +1898,28 @@ EXPORT_SYMBOL(msleep_interruptible); > > static void __sched do_usleep_range(unsigned long min, unsigned long max) > { > + ktime_t now, end; > ktime_t kmin; > u64 delta; > + int ret; > > - kmin = ktime_set(0, min * NSEC_PER_USEC); > + now = ktime_get(); > + end = ktime_add_us(now, min); > delta = (u64)(max - min) * NSEC_PER_USEC; > - schedule_hrtimeout_range(&kmin, delta, HRTIMER_MODE_REL); > + do { > + kmin = ktime_sub(end, now); > + ret = schedule_hrtimeout_range(&kmin, delta, HRTIMER_MODE_REL); > + > + /* > + * If schedule_hrtimeout_range() returns 0 then we actually > + * hit the timeout. If not then we need to re-calculate the > + * new timeout ourselves. > + */ > + if (ret == 0) > + break; > + > + now = ktime_get(); > + } while (ktime_before(now, end)); This loop implementation relies on negative kmin (result of ktime_sub()) getting internally handled by schedule_hrtimeout_range() as a 0 result. If that ain't the case, then the loop itself will need to handle negative values. OK, scratch that, due to guaranteed-unchanged values of now and end, result of directly subsequent ktime_sub() is guaranteed to not deviate from what ktime_before() figured. However, this is somewhat differing handling of these two APIs. Which brings me to my second point: somehow doing both ktime_before() and ktime_sub() feels redundant, since they are both about arguments now and end, i.e. they are *both* evaluating a "distance". (this could simply calculate the distance, and then 1. use that calculated distance value, and 2. check that calculated distance value against being negative). So, most likely it ought to be achievable to have just *one* of these two active (which would likely be ktime_sub(), simply since we need its result as schedule_hrtimeout_range() input...). And that way we would save big (hohumm) on instruction cache footprint :) Hmm, but ktime API does not have sufficiently open-coded handling of the ktime_sub()-based distance value - the possibly only way to do an "is it negative?" check would be by doing ktime_compare(dist, ktime_null_value), which might be pointless since it's a comparably large effort vs. the current ktime_before(), ktime_sub() case. BTW, another argument for loop rework might be that the result of ktime_sub() might already be improper (due to being negative!) for subsequent invocation of schedule_hrtimeout_range(), i.e. there's an argument to be made that the loop tail check instead ought to be done as a negative-value check directly prior to schedule_hrtimeout_range() invocation. Hmm, if schedule_hrtimeout_range() can be relied on to internally properly be doing the negative check, then one could simply say that the annoyingly extra calculation via ktime_before() call should simply be removed completely. Which would be a good step since it would centralize protocol behaviour within the inner handling of the helper rather than bloating user-side instruction cache footprint. </micro_optimization> ? Thanks, Andreas Mohr
Andreas, On Tue, Oct 11, 2016 at 12:30 PM, Andreas Mohr <andi@lisas.de> wrote: > Hi, > > decided to write a review now, slightly delayed, sorry. > > On Mon, Oct 10, 2016 at 02:04:02PM -0700, Douglas Anderson wrote: >> diff --git a/kernel/time/timer.c b/kernel/time/timer.c >> index 32bf6f75a8fe..219439efd56a 100644 >> --- a/kernel/time/timer.c >> +++ b/kernel/time/timer.c >> @@ -1898,12 +1898,28 @@ EXPORT_SYMBOL(msleep_interruptible); >> >> static void __sched do_usleep_range(unsigned long min, unsigned long max) >> { >> + ktime_t now, end; >> ktime_t kmin; >> u64 delta; >> + int ret; >> >> - kmin = ktime_set(0, min * NSEC_PER_USEC); >> + now = ktime_get(); >> + end = ktime_add_us(now, min); >> delta = (u64)(max - min) * NSEC_PER_USEC; >> - schedule_hrtimeout_range(&kmin, delta, HRTIMER_MODE_REL); >> + do { >> + kmin = ktime_sub(end, now); >> + ret = schedule_hrtimeout_range(&kmin, delta, HRTIMER_MODE_REL); >> + >> + /* >> + * If schedule_hrtimeout_range() returns 0 then we actually >> + * hit the timeout. If not then we need to re-calculate the >> + * new timeout ourselves. >> + */ >> + if (ret == 0) >> + break; >> + >> + now = ktime_get(); >> + } while (ktime_before(now, end)); > > This loop implementation relies on negative kmin (result of ktime_sub()) > getting internally handled by schedule_hrtimeout_range() as a 0 result. > If that ain't the case, then the loop itself will need to handle > negative values. > OK, scratch that, due to guaranteed-unchanged values of now and end, > result of directly subsequent ktime_sub() is guaranteed to > not deviate from what ktime_before() figured. > However, this is somewhat differing handling of these two APIs. I read this as: no bug, feel free to ignore. > Which brings me to my second point: > somehow doing both ktime_before() and ktime_sub() feels redundant, > since they are both about arguments now and end, > i.e. they are *both* evaluating a "distance". > (this could simply calculate the distance, and then > 1. use that calculated distance value, and > 2. check that calculated distance value against being negative). > So, most likely it ought to be achievable to have just *one* of these > two active (which would likely be ktime_sub(), > simply since we need its result as schedule_hrtimeout_range() input...). > And that way we would save big (hohumm) on instruction cache footprint :) > > Hmm, but ktime API does not have sufficiently open-coded handling of the > ktime_sub()-based distance value - the possibly only way to do an > "is it negative?" check would be by doing > ktime_compare(dist, ktime_null_value), > which might be pointless > since it's a comparably large effort > vs. the current ktime_before(), ktime_sub() case. I agree that we could try to save some math by making the loop more complicated. On the other hand, one could argue that a sufficiently good compiler might actually be able to figure some of this stuff out, since much of this stuff is just inlined 64-bit math comparisons. That being said, if you want to propose some concrete changes that save an instruction or two on most architectures and don't make the code too complicated, I'm happy to spin my patch. > BTW, another argument for loop rework might be that > the result of ktime_sub() might already be improper > (due to being negative!) for > subsequent invocation of schedule_hrtimeout_range(), > i.e. there's an argument to be made that > the loop tail check instead ought to be done as a negative-value check > directly prior to schedule_hrtimeout_range() invocation. > > Hmm, if schedule_hrtimeout_range() can be relied on to > internally properly be doing the negative check, > then one could simply say that > the annoyingly extra calculation via ktime_before() call > should simply be removed completely. > Which would be a good step since it would centralize protocol behaviour > within the inner handling of the helper > rather than bloating user-side instruction cache footprint. Ah, so you're saying just do a "while (true)" after changing the behavior of schedule_hrtimeout_range_clock(). If everyone agrees that they'd like to see this, I can do it. I'd have to change schedule_hrtimeout_range_clock() to check for <= 0 instead of just comparing against 0. ...and that would be an API change for schedule_hrtimeout_range_clock(), and it seems like that would be yet another source of bugs. ...and this is to save an instruction? It doesn't seem worth it. -Doug
Am Dienstag, 11. Oktober 2016, 09:14:38 CEST schrieb Thomas Gleixner: > On Mon, 10 Oct 2016, Douglas Anderson wrote: > > Users of usleep_range() expect that it will _never_ return in less time > > than the minimum passed parameter. However, nothing in any of the code > > > ensures this. Specifically: > There is no such guarantee for that interface and never has been, so how > did you make sure that none of the existing users is relying on this? > > You can't just can't just declare that all all of the users expect that and > be done with it. It may be my personal ignorance for not finding this explained, but the function documentation [0] states "min ... Minimum time in usecs to sleep" which sounds pretty guaranteed to me. One should of course make sure to not break anything intentionally, but having things expect to work outside these parameters sounds somewhat broken If the specified values are flexible beyond their stated range, I guess the documentation should be updated to reflect that. [0] https://www.kernel.org/doc/htmldocs/device-drivers/API-usleep-range.html
On Tue, Oct 11, 2016 at 01:02:10PM -0700, Doug Anderson wrote: > Andreas, > > On Tue, Oct 11, 2016 at 12:30 PM, Andreas Mohr <andi@lisas.de> wrote: > > This loop implementation relies on negative kmin (result of ktime_sub()) > > getting internally handled by schedule_hrtimeout_range() as a 0 result. > > If that ain't the case, then the loop itself will need to handle > > negative values. > > OK, scratch that, due to guaranteed-unchanged values of now and end, > > result of directly subsequent ktime_sub() is guaranteed to > > not deviate from what ktime_before() figured. > > However, this is somewhat differing handling of these two APIs. > > I read this as: no bug, feel free to ignore. Heh, yeah ("an alternate style of review" ;). > I agree that we could try to save some math by making the loop more > complicated. On the other hand, one could argue that a sufficiently > good compiler might actually be able to figure some of this stuff out, > since much of this stuff is just inlined 64-bit math comparisons. Indeed. "micro-optimization again". > > BTW, another argument for loop rework might be that > > the result of ktime_sub() might already be improper > > (due to being negative!) for > > subsequent invocation of schedule_hrtimeout_range(), > > i.e. there's an argument to be made that > > the loop tail check instead ought to be done as a negative-value check > > directly prior to schedule_hrtimeout_range() invocation. > > > > Hmm, if schedule_hrtimeout_range() can be relied on to > > internally properly be doing the negative check, > > then one could simply say that > > the annoyingly extra calculation via ktime_before() call > > should simply be removed completely. > > Which would be a good step since it would centralize protocol behaviour > > within the inner handling of the helper > > rather than bloating user-side instruction cache footprint. > > Ah, so you're saying just do a "while (true)" after changing the > behavior of schedule_hrtimeout_range_clock(). If everyone agrees that > they'd like to see this, I can do it. I'd have to change > schedule_hrtimeout_range_clock() to check for <= 0 instead of just > comparing against 0. ...and that would be an API change for > schedule_hrtimeout_range_clock(), and it seems like that would be yet > another source of bugs. ...and this is to save an instruction? It > doesn't seem worth it. Yup, I just realized that probably what we'd ideally want is a very thin decorating wrapper (loop handler) which cares about nothing else other than eradicating the relevant "feature" of the inner handler (namely, premature exit), and otherwise leaves a much as possible to central inner handler decision-making implementation. That to me sounds like the theoretically most precise (since dead simple) handling. And even if such exceedingly simple handling is dangerous - in case of failure we would realize it very soon (infinite loop), and would then know what will need fixing. I've been reviewing schedule_hrtimeout_range_clock() as well, and I'm actually puzzled that it checks for zero value only, and not less-equal zero. That to me does not seem like a true-to-the-core protocol to handle the task at hand. OTOH perhaps less-equal comparison was deemed to be much more expensive than a zero check, and thus possibly has been done in inner handlers only. For the loop code itself, I'm not sure whether to pursue it - given the schedule_hrtimeout_range_clock() API change risks you outlined, and especially given that optimization is likely to mostly benefit the "repeat" case only, which likely would be the less relevant non-hotpath case anyway. Plus, let's not forget the fact that even hotpath handling *is* an invocation of the entire delay handler, and then a couple measly opcodes to have the loop exited reliably. So... Now, at least we have a Reviewed-by: Andreas Mohr <andim2@users.sf.net> Oh well, lotsa discussion, some good thoughts, but no truly revolutionary outcome so far. However, sometimes the most important thing is to have had a good educating discussion (not everything in software circles is about the Get Rich Quick scheme - you do have to spend some ample time - decades... - to truly get mental concept things right). Andreas Mohr
On Tue, Oct 11, 2016 at 09:33:15AM -0700, Doug Anderson wrote: > On Tue, Oct 11, 2016 at 12:14 AM, Thomas Gleixner <tglx@linutronix.de> wrote: > > On Mon, 10 Oct 2016, Douglas Anderson wrote: > >> Users of usleep_range() expect that it will _never_ return in less time > >> than the minimum passed parameter. However, nothing in any of the code > >> ensures this. Specifically: > > There is no such guarantee for that interface and never has been, so how > > did you make sure that none of the existing users is relying on this? > > You can't just can't just declare that all all of the users expect that and > > be done with it. > You're right that I can't guarantee that no callers are relying on the > existing behavior of a wake_up_process() causing usleep_range() to > return early. I would say, however, that all callers I've seen are > absolutely relying on the min delay being enforced and I've never > personally seen a caller relying on being woken up from > usleep_range(). All the users relying on the min delay being enforced Indeed. It's *highly* surprising for any sleep interface to undershoot on delays, the usual thing is that they might delay for longer. If the function doesn't actually reliably delay for the minimum time then I'd expect that a large proportion of those conversions and other recent code that's been added is buggy. > one of two functions: usleep_atlest() and usleep_wakeable(). As > argued below I think that usleep_range() name implies that it will at > least sleep the minimum so I would really like to avoid keeping the > name usleep_range() and also keeping the existing behavior. I tend to agree with everything Doug is saying in terms of API expectations.
On Tue, 11 Oct 2016, Andreas Mohr wrote: > On Tue, Oct 11, 2016 at 09:14:38AM +0200, Thomas Gleixner wrote: > > On Mon, 10 Oct 2016, Douglas Anderson wrote: > > > Users of usleep_range() expect that it will _never_ return in less time > > > than the minimum passed parameter. However, nothing in any of the code > > > ensures this. Specifically: > > > > There is no such guarantee for that interface and never has been, so how > > did you make sure that none of the existing users is relying on this? > > > > You can't just can't just declare that all all of the users expect that and > > be done with it. > > Hmm, somehow I don't manage to follow these thoughts. > > https://www.kernel.org/doc/htmldocs/device-drivers/API-usleep-range.html > (as a hopefully sufficiently authoritative source of documentation) > clearly specifies min to be > "Minimum time in usecs to sleep" > , which is what one would expect a two-param interface here to be > (minimum-maximum), > i.e. what would be the *natural* protocol I'd think. > > Also, [finally...] starting to enforce the minimum time > is an additional *constraint* on the protocol, > i.e. it's not at all like we are getting more *liberal* here > (since usually getting more liberal in certain protocols > is what will cause trouble, I'd think). > > Not to mention that > desiring a delay in processing most certainly is > what caused users of this API to decide to invoke it in the first place > (else they would just have chosen to carry on with delay-less processing > and be done with it). > And those users then surely wouldn't want to experience a behaviour > where the delay may be ended at any time, > however short that may end up being. I'm well aware what Doug wants to do and I'm not saying that this is wrong, but I'm not going to look at all usleep() usage sites to make sure none is relying on such a behaviour and gets surprised by the change, The point is that we had cases over and over where stuff was depending on implementation bugs which made the buggy behaviour into an expected behaviour. I'm not saying that this is the case here, but it's not my duty to make sure it isn't. So the very minimum I need in the changelog is some mentioning that the author at least tried to verify that this is not going to break the world and some more. That's what I meant by: You can't just can't just declare that all all of the users expect that and be done with it. Thanks, tglx
On Mon, Oct 10, 2016 at 02:04:02PM -0700, Douglas Anderson wrote: > Users of usleep_range() expect that it will _never_ return in less time > than the minimum passed parameter. However, nothing in any of the code > ensures this. Specifically: > > usleep_range() => do_usleep_range() => schedule_hrtimeout_range() => > schedule_hrtimeout_range_clock() just ends up calling schedule() with an > appropriate timeout set using the hrtimer. If someone else happens to > wake up our task then we'll happily return from usleep_range() early. > > msleep() already has code to handle this case since it will loop as long > as there was still time left. usleep_range() had no such loop. > > The problem is is easily demonstrated with a small bit of test code: > > static int usleep_test_task(void *data) > { > atomic_t *done = data; > ktime_t start, end; > > start = ktime_get(); > usleep_range(50000, 100000); > end = ktime_get(); > pr_info("Requested 50000 - 100000 us. Actually slept for %llu us\n", > (unsigned long long)ktime_to_us(ktime_sub(end, start))); > atomic_set(done, 1); > > return 0; > } > > static void run_usleep_test(void) > { > struct task_struct *t; > atomic_t done; > > atomic_set(&done, 0); > > t = kthread_run(usleep_test_task, &done, "usleep_test_task"); > while (!atomic_read(&done)) { > wake_up_process(t); > udelay(1000); > } > kthread_stop(t); > } > > If you run the above code without this patch you get things like: > Requested 50000 - 100000 us. Actually slept for 967 us > > If you run the above code _with_ this patch, you get: > Requested 50000 - 100000 us. Actually slept for 50001 us > > Presumably this problem was not detected before because: > - It's not terribly common to use wake_up_process() directly. > - Other ways for processes to wake up are not typically mixed with > usleep_range(). > - There aren't lots of places that use usleep_range(), since many people > call either msleep() or udelay(). > > Reported-by: Tao Huang <huangtao@rock-chips.com> > Signed-off-by: Douglas Anderson <dianders@chromium.org> > Reviewed-by: Brian Norris <briannorris@chromium.org> > Reviewed-by: Andreas Mohr <andim2@users.sf.net> Reviewed-by: Guenter Roeck <linux@roeck-us.net> The following drivers may expect the function to be interruptible. drivers/iio/accel/kxcjk-1013.c: kxcjk1013_runtime_resume() drivers/iio/accel/bmc150-accel-core.c:bmc150_accel_runtime_resume() drivers/iio/accel/mma8452.c:mma8452_runtime_resume() drivers/iio/accel/mma9551_core.c:mma9551_sleep() kernel/trace/ring_buffer.c:rb_test() A possible solution might be to introduce usleep_range_interruptible() and use it there. Note: drivers/scsi/mvumi.c:mvumi_rescan_bus() uses msleep() but should possibly use msleep_interruptible() instead. Guenter
Hi, On Wed, Oct 12, 2016 at 9:03 AM, Guenter Roeck <linux@roeck-us.net> wrote: > drivers/iio/accel/kxcjk-1013.c: kxcjk1013_runtime_resume() > drivers/iio/accel/bmc150-accel-core.c:bmc150_accel_runtime_resume() > drivers/iio/accel/mma8452.c:mma8452_runtime_resume() > drivers/iio/accel/mma9551_core.c:mma9551_sleep() As far as I can tell these drivers will not suffer unduly from my change. Worse case they will delay 20us more, which is listed as the max. Also note that I assume the reason you flagged these is because they follow the pattern: if (sleep_val < 20000) usleep_range(sleep_val, 20000); else msleep_interruptible(sleep_val/1000); I will note that usleep_range() is and has always been uninterruptible, since the implementation says: void __sched usleep_range(unsigned long min, unsigned long max) { __set_current_state(TASK_UNINTERRUPTIBLE); do_usleep_range(min, max); } So I'm not at all convinced that we are changing behavior here. The "interruptible" vs. "uninterruptible" affects whether signals can interrupt the sleep, not whether a random wake up of a task can. What we really need to know is if they are affected by a wakeup. > kernel/trace/ring_buffer.c:rb_test() I assume that the person who wrote this code was confused since they wrote: set_current_state(TASK_INTERRUPTIBLE); /* Now sleep between a min of 100-300us and a max of 1ms */ usleep_range(((data->cnt % 3) + 1) * 100, 1000); That doesn't seem to make sense given the first line of usleep_range(). In any case, again I don't think I am changing behavior. > A possible solution might be to introduce usleep_range_interruptible() > and use it there. This could be a useful function, but I don't think we need it if we find someone who needs a wakeup to cut short a sleep. We can just call one of the schedule functions directly and use a timeout. Thank you for searching through for stuff and for your review, though! -Doug
On Wed, Oct 12, 2016 at 09:27:35AM -0700, Doug Anderson wrote: > Hi, > > On Wed, Oct 12, 2016 at 9:03 AM, Guenter Roeck <linux@roeck-us.net> wrote: > > drivers/iio/accel/kxcjk-1013.c: kxcjk1013_runtime_resume() > > drivers/iio/accel/bmc150-accel-core.c:bmc150_accel_runtime_resume() > > drivers/iio/accel/mma8452.c:mma8452_runtime_resume() > > drivers/iio/accel/mma9551_core.c:mma9551_sleep() > > As far as I can tell these drivers will not suffer unduly from my > change. Worse case they will delay 20us more, which is listed as the > max. > 20 ms. > Also note that I assume the reason you flagged these is because they > follow the pattern: > > if (sleep_val < 20000) > usleep_range(sleep_val, 20000); > else > msleep_interruptible(sleep_val/1000); > Correct > I will note that usleep_range() is and has always been > uninterruptible, since the implementation says: > > void __sched usleep_range(unsigned long min, unsigned long max) > { > __set_current_state(TASK_UNINTERRUPTIBLE); > do_usleep_range(min, max); > } > Good point. > So I'm not at all convinced that we are changing behavior here. The > "interruptible" vs. "uninterruptible" affects whether signals can > interrupt the sleep, not whether a random wake up of a task can. What > we really need to know is if they are affected by a wakeup. > Yes, you are correct. > > kernel/trace/ring_buffer.c:rb_test() > > I assume that the person who wrote this code was confused since they wrote: > > set_current_state(TASK_INTERRUPTIBLE); > /* Now sleep between a min of 100-300us and a max of 1ms */ > usleep_range(((data->cnt % 3) + 1) * 100, 1000); > > That doesn't seem to make sense given the first line of usleep_range(). > ... which, for those who don't pay attention (like me), is __set_current_state(TASK_UNINTERRUPTIBLE); > In any case, again I don't think I am changing behavior. > > > A possible solution might be to introduce usleep_range_interruptible() > > and use it there. > > This could be a useful function, but I don't think we need it if we > find someone who needs a wakeup to cut short a sleep. We can just > call one of the schedule functions directly and use a timeout. > Agreed. > > Thank you for searching through for stuff and for your review, though! > No problem. Thanks for correcting me. Note that I also searched for use of usleep_range() in conjunction with a a task wakeup, but did not find anything. I did find a large number of cases, though, where the explicit assumption is made that the minimum sleep time is well defined. Guenter
Hi, On Wed, Oct 12, 2016 at 6:11 AM, Thomas Gleixner <tglx@linutronix.de> wrote: > I'm well aware what Doug wants to do and I'm not saying that this is wrong, > but I'm not going to look at all usleep() usage sites to make sure none is > relying on such a behaviour and gets surprised by the change, > > The point is that we had cases over and over where stuff was depending on > implementation bugs which made the buggy behaviour into an expected > behaviour. I'm not saying that this is the case here, but it's not my duty > to make sure it isn't. Every change breaks something. <https://xkcd.com/1172/>. I don't think we promise bug for bug compatibility for Linux releases, do we? Though I definitely agree that we shouldn't be breaking something on purpose and it's good to be careful. > So the very minimum I need in the changelog is some mentioning that the > author at least tried to verify that this is not going to break the world > and some more. That's what I meant by: > > You can't just can't just declare that all all of the users expect that and > be done with it. Do you have some suggestion about how to do that? In general I'm hesitant to rely on code analysis of 1620 call sites and that would really be the only way to "be sure". It would also take a really long time. I think Guenter was searching for files that contained usleep_range() and some other text. That's a pretty good idea but it is definitely not exhaustive since there's no reason that the usleep and a wakeup are guaranteed to be in the same file. It's also entirely possible that a wakeup could happen through some other source than just a direct call to wake_up_process(). It's why I didn't try it originally. ...but I can do it anyway. I tried "git grep -C10000 usleep_range | grep wake_up_process". I actually do find some things that I don't think Guenter found (or maybe he found, analyzed, and rejected): > drivers/block/cciss.c- wake_up_process(cciss_scan_thread); We possibly wake up scan_thread(). Sleep is in cciss_wait_for_mode_change_ack(). Callers of that are cciss_enter_performant_mode() and cciss_enter_simple_mode(). Callers of those are cciss_put_controller_into_performant_mode() and cciss_pci_init() (and the performant mode boils down to cciss_pci_init() anyway). Caller is cciss_init_one(). That is the probe function. AKA: the thread that is being woken up isn't the one doing the usleep_range(). No obvious bug here. > drivers/memstick/host/rtsx_usb_ms.c- wake_up_process(host->detect_ms); Function being woken is rtsx_usb_detect_ms_card(). Sleeps are in ms_power_on() and rtsx_usb_ms_set_param() (with no loops). ms_power_on() is called by rtsx_usb_ms_set_param() anyway. That function is stored in msh->set_param. That will be fairly hard to track down, but it seems unlikely it is called by rtsx_usb_detect_ms_card() and that we intentionally want the usleep to end early. No obvious bug here. > drivers/scsi/mvumi.c- wake_up_process(mhba->dm_thread); > drivers/scsi/mvumi.c- wake_up_process(mhba->dm_thread); Function being woken is mvumi_rescan_bus(). Intended to wake up the task when it's in schedule(). Even after the schedule there is a hardcoded msleep(1000). To me the msleep(1000) makes it seem like this loop can't possibly be too performance critical. ...but even if there are performance critical parts of that loop and somehow the wake was intended not just to wakeup from the schedule but also the usleep_range, worse case we will sleep 2 ms extra. No obvious bug here. > kernel/time/timer.c- wake_up_process((struct task_struct *)__data); Generic timer code. > kernel/trace/ring_buffer.c- wake_up_process(rb_threads[cpu]); This is test code and already seems a bit confused. ...but it does appear that the caller might actually be intending the the wake_up_process() to take effect. Looking closer. I see that the wakeup is called once at the start of the test, not midway through the test. So I don't think there's a problem already than the already identified problem of a useless "set_current_state(TASK_INTERRUPTIBLE);" No obvious bug here. === So the net result of all the above is that: * With my git grep + code inspection, I see no obvious bug. I will admit that I didn't do a super deep search and finding bugs by code inspection is iffy at best, but it might give us some sort of warm fuzzy. * With Guenter's search we saw no obvious bugs. * Several subsystem maintainers and reviewers of lots of code have chimed in and say that they're not aware of any code where usleep_range() was expected to wake up early and they were also aware of lots of code that would break if usleep_range() returned early. IMHO: let's land it in linuxnext and give it some bake time. If nobody screams then it can go into linux proper, possibly to stable trees. -Doug
Hi Doug, On Tue, Oct 11, 2016 at 5:04 AM, Douglas Anderson <dianders@chromium.org> wrote: > Users of usleep_range() expect that it will _never_ return in less time > than the minimum passed parameter. However, nothing in any of the code > ensures this. Specifically: > > usleep_range() => do_usleep_range() => schedule_hrtimeout_range() => > schedule_hrtimeout_range_clock() just ends up calling schedule() with an > appropriate timeout set using the hrtimer. If someone else happens to > wake up our task then we'll happily return from usleep_range() early. I think this change works, and fixes a real issue, however, I don't think you are fixing this at the right layer. The comment for schedule_hrtimeout_range says: /** * schedule_hrtimeout_range - sleep until timeout * @expires: timeout value (ktime_t) * @delta: slack in expires timeout (ktime_t) * @mode: timer mode, HRTIMER_MODE_ABS or HRTIMER_MODE_REL * * Make the current task sleep until the given expiry time has * elapsed. The routine will return immediately unless * the current task state has been set (see set_current_state()). * * The @delta argument gives the kernel the freedom to schedule the * actual wakeup to a time that is both power and performance friendly. * The kernel give the normal best effort behavior for "@expires+@delta", * but may decide to fire the timer earlier, but no earlier than @expires. * * You can set the task state as follows - * * %TASK_UNINTERRUPTIBLE - at least @timeout time is guaranteed to * pass before the routine returns. * * %TASK_INTERRUPTIBLE - the routine may return early if a signal is * delivered to the current task. * * The current task state is guaranteed to be TASK_RUNNING when this * routine returns. * * Returns 0 when the timer has expired otherwise -EINTR */ The behavior as specified for this function "at least @timeout time is guaranteed to pass before the routine returns" already guarantees the behavior you are adding to do_usleep_range() whenever the current task state is (pre-)set to TASK_UNINTERRUPTIBLE. Thus, I think the loop around 'schedule()' should be moved to schedule_hrtimeout_range() itself. This would also fix direct callers of schedule_hrtimeout_range() that use TASK_UNINTERRUPTIBLE, although, I could only find one: pt3_fetch_thread() -Dan
Dan, On Tue, Oct 18, 2016 at 6:44 AM, Daniel Kurtz <djkurtz@chromium.org> wrote: > Hi Doug, > > On Tue, Oct 11, 2016 at 5:04 AM, Douglas Anderson <dianders@chromium.org> wrote: >> Users of usleep_range() expect that it will _never_ return in less time >> than the minimum passed parameter. However, nothing in any of the code >> ensures this. Specifically: >> >> usleep_range() => do_usleep_range() => schedule_hrtimeout_range() => >> schedule_hrtimeout_range_clock() just ends up calling schedule() with an >> appropriate timeout set using the hrtimer. If someone else happens to >> wake up our task then we'll happily return from usleep_range() early. > > I think this change works, and fixes a real issue, however, I don't > think you are fixing this at the right layer. > The comment for schedule_hrtimeout_range says: > > /** > * schedule_hrtimeout_range - sleep until timeout > * @expires: timeout value (ktime_t) > * @delta: slack in expires timeout (ktime_t) > * @mode: timer mode, HRTIMER_MODE_ABS or HRTIMER_MODE_REL > * > * Make the current task sleep until the given expiry time has > * elapsed. The routine will return immediately unless > * the current task state has been set (see set_current_state()). > * > * The @delta argument gives the kernel the freedom to schedule the > * actual wakeup to a time that is both power and performance friendly. > * The kernel give the normal best effort behavior for "@expires+@delta", > * but may decide to fire the timer earlier, but no earlier than @expires. > * > * You can set the task state as follows - > * > * %TASK_UNINTERRUPTIBLE - at least @timeout time is guaranteed to > * pass before the routine returns. > * > * %TASK_INTERRUPTIBLE - the routine may return early if a signal is > * delivered to the current task. > * > * The current task state is guaranteed to be TASK_RUNNING when this > * routine returns. > * > * Returns 0 when the timer has expired otherwise -EINTR > */ > > The behavior as specified for this function "at least @timeout time is > guaranteed to pass before the routine returns" already guarantees the > behavior you are adding to do_usleep_range() whenever the current task > state is (pre-)set to TASK_UNINTERRUPTIBLE. > > Thus, I think the loop around 'schedule()' should be moved to > schedule_hrtimeout_range() itself. > This would also fix direct callers of schedule_hrtimeout_range() that > use TASK_UNINTERRUPTIBLE, although, I could only find one: > > pt3_fetch_thread() Hmmm, I would agree with you that the behavior of schedule_hrtimeout_range() doesn't seem to match the function comments. ...but I'm not sure I agree with you about what to do here. Specifically I think that whatever we do we need to try to keep schedule_hrtimeout_range() and schedule_timeout() parallel. For schedule_timeout() we have the same comments but it's my understanding that you'd expect that wake_up_process() would wake it up. In any case, if wake_up_process() doesn't wake it up then it seems like msleep() and schedule_timeout_uninterruptible() are the same function with two names, when in fact one is implemented in terms o the other. NOTE that also it seems as if we need some other return values besides 0 and -EINTR from schedule_hrtimeout_range() (again, to match schedule_timeout()) since right now we'll return -EINTR if we were woken up with wake_up_process(). This would be unexpected in the case where we had TASK_UNINTERRUPTIBLE set. -Doug
On Wed, Oct 19, 2016 at 4:29 AM, Doug Anderson <dianders@chromium.org> wrote: > > Dan, > > On Tue, Oct 18, 2016 at 6:44 AM, Daniel Kurtz <djkurtz@chromium.org> wrote: > > Hi Doug, > > > > On Tue, Oct 11, 2016 at 5:04 AM, Douglas Anderson <dianders@chromium.org> wrote: > >> Users of usleep_range() expect that it will _never_ return in less time > >> than the minimum passed parameter. However, nothing in any of the code > >> ensures this. Specifically: > >> > >> usleep_range() => do_usleep_range() => schedule_hrtimeout_range() => > >> schedule_hrtimeout_range_clock() just ends up calling schedule() with an > >> appropriate timeout set using the hrtimer. If someone else happens to > >> wake up our task then we'll happily return from usleep_range() early. > > > > I think this change works, and fixes a real issue, however, I don't > > think you are fixing this at the right layer. > > The comment for schedule_hrtimeout_range says: > > > > /** > > * schedule_hrtimeout_range - sleep until timeout > > * @expires: timeout value (ktime_t) > > * @delta: slack in expires timeout (ktime_t) > > * @mode: timer mode, HRTIMER_MODE_ABS or HRTIMER_MODE_REL > > * > > * Make the current task sleep until the given expiry time has > > * elapsed. The routine will return immediately unless > > * the current task state has been set (see set_current_state()). > > * > > * The @delta argument gives the kernel the freedom to schedule the > > * actual wakeup to a time that is both power and performance friendly. > > * The kernel give the normal best effort behavior for "@expires+@delta", > > * but may decide to fire the timer earlier, but no earlier than @expires. > > * > > * You can set the task state as follows - > > * > > * %TASK_UNINTERRUPTIBLE - at least @timeout time is guaranteed to > > * pass before the routine returns. > > * > > * %TASK_INTERRUPTIBLE - the routine may return early if a signal is > > * delivered to the current task. > > * > > * The current task state is guaranteed to be TASK_RUNNING when this > > * routine returns. > > * > > * Returns 0 when the timer has expired otherwise -EINTR > > */ > > > > The behavior as specified for this function "at least @timeout time is > > guaranteed to pass before the routine returns" already guarantees the > > behavior you are adding to do_usleep_range() whenever the current task > > state is (pre-)set to TASK_UNINTERRUPTIBLE. > > > > Thus, I think the loop around 'schedule()' should be moved to > > schedule_hrtimeout_range() itself. > > This would also fix direct callers of schedule_hrtimeout_range() that > > use TASK_UNINTERRUPTIBLE, although, I could only find one: > > > > pt3_fetch_thread() > > Hmmm, I would agree with you that the behavior of > schedule_hrtimeout_range() doesn't seem to match the function > comments. > > ...but I'm not sure I agree with you about what to do here. > Specifically I think that whatever we do we need to try to keep > schedule_hrtimeout_range() and schedule_timeout() parallel. For > schedule_timeout() we have the same comments but it's my understanding > that you'd expect that wake_up_process() would wake it up. In any > case, if wake_up_process() doesn't wake it up then it seems like > msleep() and schedule_timeout_uninterruptible() are the same function > with two names, when in fact one is implemented in terms o the other. Sounds reasonable. It would be nice to add a note to all of those function comments though to make them sound less absolute - "at least @timeout time is guaranteed to pass before the routine returns unless the current task is explicitly woken up, (e.g. by wake_up_process())" > NOTE that also it seems as if we need some other return values besides > 0 and -EINTR from schedule_hrtimeout_range() (again, to match > schedule_timeout()) since right now we'll return -EINTR if we were > woken up with wake_up_process(). This would be unexpected in the case > where we had TASK_UNINTERRUPTIBLE set. > > -Doug
On Thu, 20 Oct 2016, Daniel Kurtz wrote: > On Wed, Oct 19, 2016 at 4:29 AM, Doug Anderson <dianders@chromium.org> wrote: > > ...but I'm not sure I agree with you about what to do here. > > Specifically I think that whatever we do we need to try to keep > > schedule_hrtimeout_range() and schedule_timeout() parallel. For > > schedule_timeout() we have the same comments but it's my understanding > > that you'd expect that wake_up_process() would wake it up. In any > > case, if wake_up_process() doesn't wake it up then it seems like > > msleep() and schedule_timeout_uninterruptible() are the same function > > with two names, when in fact one is implemented in terms o the other. > > Sounds reasonable. > It would be nice to add a note to all of those function comments > though to make them sound less absolute - > "at least @timeout time is guaranteed to pass before the routine > returns unless the current task is explicitly woken up, (e.g. by > wake_up_process())" Agreed.
diff --git a/kernel/time/timer.c b/kernel/time/timer.c index 32bf6f75a8fe..219439efd56a 100644 --- a/kernel/time/timer.c +++ b/kernel/time/timer.c @@ -1898,12 +1898,28 @@ EXPORT_SYMBOL(msleep_interruptible); static void __sched do_usleep_range(unsigned long min, unsigned long max) { + ktime_t now, end; ktime_t kmin; u64 delta; + int ret; - kmin = ktime_set(0, min * NSEC_PER_USEC); + now = ktime_get(); + end = ktime_add_us(now, min); delta = (u64)(max - min) * NSEC_PER_USEC; - schedule_hrtimeout_range(&kmin, delta, HRTIMER_MODE_REL); + do { + kmin = ktime_sub(end, now); + ret = schedule_hrtimeout_range(&kmin, delta, HRTIMER_MODE_REL); + + /* + * If schedule_hrtimeout_range() returns 0 then we actually + * hit the timeout. If not then we need to re-calculate the + * new timeout ourselves. + */ + if (ret == 0) + break; + + now = ktime_get(); + } while (ktime_before(now, end)); } /**
Users of usleep_range() expect that it will _never_ return in less time than the minimum passed parameter. However, nothing in any of the code ensures this. Specifically: usleep_range() => do_usleep_range() => schedule_hrtimeout_range() => schedule_hrtimeout_range_clock() just ends up calling schedule() with an appropriate timeout set using the hrtimer. If someone else happens to wake up our task then we'll happily return from usleep_range() early. msleep() already has code to handle this case since it will loop as long as there was still time left. usleep_range() had no such loop. The problem is is easily demonstrated with a small bit of test code: static int usleep_test_task(void *data) { atomic_t *done = data; ktime_t start, end; start = ktime_get(); usleep_range(50000, 100000); end = ktime_get(); pr_info("Requested 50000 - 100000 us. Actually slept for %llu us\n", (unsigned long long)ktime_to_us(ktime_sub(end, start))); atomic_set(done, 1); return 0; } static void run_usleep_test(void) { struct task_struct *t; atomic_t done; atomic_set(&done, 0); t = kthread_run(usleep_test_task, &done, "usleep_test_task"); while (!atomic_read(&done)) { wake_up_process(t); udelay(1000); } kthread_stop(t); } If you run the above code without this patch you get things like: Requested 50000 - 100000 us. Actually slept for 967 us If you run the above code _with_ this patch, you get: Requested 50000 - 100000 us. Actually slept for 50001 us Presumably this problem was not detected before because: - It's not terribly common to use wake_up_process() directly. - Other ways for processes to wake up are not typically mixed with usleep_range(). - There aren't lots of places that use usleep_range(), since many people call either msleep() or udelay(). Reported-by: Tao Huang <huangtao@rock-chips.com> Signed-off-by: Douglas Anderson <dianders@chromium.org> --- Changes in v2: - Fixed stupid bug that snuck in before posting - Use ktime_before - Remove delta from the loop NOTE: Tested against 4.4 tree w/ backports. I'm trying to get myself up and running with mainline again to test there now but it might be a little while. Hopefully this time I don't shoot myself in the foot. kernel/time/timer.c | 20 ++++++++++++++++++-- 1 file changed, 18 insertions(+), 2 deletions(-)