Message ID | 2803b89021f991662b000f50e45dbaebdcca438a.1724729626.git.len.brown@intel.com (mailing list archive) |
---|---|
State | Changes Requested, archived |
Headers | show |
Series | ACPI: Remove msleep() bloat from acpi_os_sleep() | expand |
First, let me add a few people who know more about timers than I do. On Tue, Aug 27, 2024 at 5:42 AM Len Brown <lenb@kernel.org> wrote: > > From: Len Brown <len.brown@intel.com> > > Optimize acpi_os_sleep(ms) using usleep_range(floor, ceiling). > The floor of the range is the exact requested ms, > with an additional 1ms of slack for sleeps above 20ms. > > This reduces the kernel resume time of the Dell 9300 > to 1,124 ms from 2,471 ms. > > The ACPI AML Sleep(ms) method calls acpi_os_sleep(ms), > which has invoked msleep(ms) since 2013. > > But msleep(ms) is based on jiffies, and the rounding-up > logic to convert to jiffies on a HZ=250 system causes > msleep(5) to bloat to a minimum of a 12ms delay. > msleep(5) typically takes over 15ms! > > As a result, AML delay loops with small Sleep() inside > magnify the entire loop. A particularly painful example > is ACPI support for powering-on ICL and TGL > thunderbolt/pcie_ports during system resume. > > Regarding jiffy-based msleep() being inexpensive > and hrtimer-based usleep_range() being expensive. > ACPI AML timer invocations are rare, and so it > is unlikely the hrtimer cost will be noticible, > or even measurable. At the same time, the msleep() > timer duration bloat is significant enough to > be noticed by end users. I'm not sure why you are refusing to follow the implementation of fsleep() and Documentation/timers/timers-howto.rst and still use msleep() for sleep durations longer than 20 ms. Why should usleep_range() be used for 100 ms sleeps, for instance? This goes against the recommendation in the above document, so is there a particular reason? > Regarding usleep_range() timer coalescing. > It virtually never works during ACPI flows, which > commonly run when there are few coalescing > opportunities. As a result, the timers almost > always expire at the maximum end of their specified range. I don't think that's the main point of using a nonzero range in usleep_range(). AFAICS, this is about letting the timekeeping subsystem know how much you care about timer precision so it can arrange things to meet everyone's needs. > It was tempting to use usleep_range(us, us) > for all values of us. But 1 ms is added to the > range for timers over 20ms on the reasoning that > the AML Sleep interface has a granularity of 1ms, > most costly loops use duration under 20ms inside, > and singular long sleeps are unlitly to notice an > additiona 1ms, so why not allow some coalescing... So again, why not use msleep() for sleeps longer than 20 ms? > Closes: https://bugzilla.kernel.org/show_bug.cgi?id=216263 > Signed-off-by: Len Brown <len.brown@intel.com> > Suggested-by: Arjan van de Ven <arjan@linux.intel.com> > Tested-by: Todd Brandt <todd.e.brandt@intel.com> > --- > drivers/acpi/osl.c | 8 +++++++- > 1 file changed, 7 insertions(+), 1 deletion(-) > > diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c > index 70af3fbbebe5..c4c76f86cd7a 100644 > --- a/drivers/acpi/osl.c > +++ b/drivers/acpi/osl.c > @@ -607,7 +607,13 @@ acpi_status acpi_os_remove_interrupt_handler(u32 gsi, acpi_osd_handler handler) > > void acpi_os_sleep(u64 ms) > { > - msleep(ms); > + u64 us = ms * 1000; > + > + if (us <= 20000) > + usleep_range(us, us); > + else > + usleep_range(us, us + 1000); > + > } > > void acpi_os_stall(u32 us) > -- While I agree with using usleep_range() for sleeps up to 20 ms in acpi_os_sleep(), I disagree with the patch as is. Thanks!
On Tue, Aug 27, 2024 at 7:29 AM Rafael J. Wysocki <rafael@kernel.org> wrote: > > First, let me add a few people who know more about timers than I do. > > On Tue, Aug 27, 2024 at 5:42 AM Len Brown <lenb@kernel.org> wrote: > > > > From: Len Brown <len.brown@intel.com> > > > > Optimize acpi_os_sleep(ms) using usleep_range(floor, ceiling). > > The floor of the range is the exact requested ms, > > with an additional 1ms of slack for sleeps above 20ms. > > > > This reduces the kernel resume time of the Dell 9300 > > to 1,124 ms from 2,471 ms. > > > > The ACPI AML Sleep(ms) method calls acpi_os_sleep(ms), > > which has invoked msleep(ms) since 2013. > > > > But msleep(ms) is based on jiffies, and the rounding-up > > logic to convert to jiffies on a HZ=250 system causes > > msleep(5) to bloat to a minimum of a 12ms delay. > > msleep(5) typically takes over 15ms! > > > > As a result, AML delay loops with small Sleep() inside > > magnify the entire loop. A particularly painful example > > is ACPI support for powering-on ICL and TGL > > thunderbolt/pcie_ports during system resume. > > > > Regarding jiffy-based msleep() being inexpensive > > and hrtimer-based usleep_range() being expensive. > > ACPI AML timer invocations are rare, and so it > > is unlikely the hrtimer cost will be noticible, > > or even measurable. At the same time, the msleep() > > timer duration bloat is significant enough to > > be noticed by end users. > > I'm not sure why you are refusing to follow the implementation of > fsleep() and Documentation/timers/timers-howto.rst and still use > msleep() for sleep durations longer than 20 ms. timers_howto.rst could use an update to reflect reality. It doesn't disclose how toxic msleep actually is for small values. msleep(1) takes 11ms on a HZ=250 system. Linux/ACPI has to support any random AML Sleep(ms) call, and sometimes we see timeout loops implemented around an inner Sleep(1ms). If we use msleep those loops explode by 11x and aggregate durations that are noticed by end users. fsleep does three things -- and none of them are a good fit for acpi_os_sleep: static inline void fsleep(unsigned long usecs) { if (usecs <= 10) udelay(usecs); else if (usecs <= 20000) usleep_range(usecs, 2 * usecs); else msleep(DIV_ROUND_UP(usecs, 1000)); } > udelay(usecs); will never execute in the ACPI case, as the minimum delay is 1000 usec. > usleep_range(usecs, 2 * usecs); timers-howto.rst says this: "With the introduction of a range, the scheduler is free to coalesce your wakeup with any other wakeup that may have happened for other reasons, or at the worst case, fire an interrupt for your upper bound." But the way usleep_range works is it sets the timer for the upper bound, and opportunistically wakes/cancels if another timer fires after the lower bound and before the upper bound. It calls it a "worst case" that the timer fires at the upper bound. But when ACPI suspend/resume flows are running the only other timer is the tick, and so the "worst case" happens effectively ALWAYS. So when fsleep does a usleep_range(usecs, 2 * usecs), it is effectively DOUBLING the duration of all timers 20 ms and smaller. There may be scenarios where doing this makes sense, but acpi_os_sleep() is not one of them. > msleep(DIV_ROUND_UP(usecs, 1000)); msleep(50) takes 59.8ms -- a 20% penalty. We have loops with AML Sleep(50) in the middle, and this code would bloat them by 20%, while the user is waiting -- for no benefit. o Again, there may be scenarios where doing this makes sense, but acpi_os_sleep() is not one of them. > Why should usleep_range() be used for 100 ms sleeps, for instance? > This goes against the recommendation in the above document, so is > there a particular reason? The document doesn't say *why* msleep is recommended. One would assume that when there are many timers, msleep is efficient because it consolidates them to jiffy boundaries, and if they are long duration timers, perhaps the assumption is that they don't care so much about the additional delays? Again, there are certainly scenarios where that makes sense, but at the end of the day, msleep(100) takes 106 msec. ACPI is not a heavy timer user, so the msleep efficiency justification for making the user wait longer holds no weight. Now that we realize that end-users notice timer bloat in acp_os_sleep, it is clear that msleep is simply a poor choice for acpi_os_sleep. > > Regarding usleep_range() timer coalescing. > > It virtually never works during ACPI flows, which > > commonly run when there are few coalescing > > opportunities. As a result, the timers almost > > always expire at the maximum end of their specified range. > > I don't think that's the main point of using a nonzero range in > usleep_range(). AFAICS, this is about letting the timekeeping > subsystem know how much you care about timer precision so it can > arrange things to meet everyone's needs. The range in usleep_range is to enable timer coalescing, which is to reduce the number of timers firing on the system. If it has another purpose, neither the code nor the API documentation are forthcoming. > > It was tempting to use usleep_range(us, us) > > for all values of us. But 1 ms is added to the > > range for timers over 20ms on the reasoning that > > the AML Sleep interface has a granularity of 1ms, > > most costly loops use duration under 20ms inside, > > and singular long sleeps are unlitly to notice an > > additiona 1ms, so why not allow some coalescing... > > So again, why not use msleep() for sleeps longer than 20 ms? per above. Too slow. > > Closes: https://bugzilla.kernel.org/show_bug.cgi?id=216263 > > Signed-off-by: Len Brown <len.brown@intel.com> > > Suggested-by: Arjan van de Ven <arjan@linux.intel.com> > > Tested-by: Todd Brandt <todd.e.brandt@intel.com> > > --- > > drivers/acpi/osl.c | 8 +++++++- > > 1 file changed, 7 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c > > index 70af3fbbebe5..c4c76f86cd7a 100644 > > --- a/drivers/acpi/osl.c > > +++ b/drivers/acpi/osl.c > > @@ -607,7 +607,13 @@ acpi_status acpi_os_remove_interrupt_handler(u32 gsi, acpi_osd_handler handler) > > > > void acpi_os_sleep(u64 ms) > > { > > - msleep(ms); > > + u64 us = ms * 1000; > > + > > + if (us <= 20000) > > + usleep_range(us, us); > > + else > > + usleep_range(us, us + 1000); > > + > > } > > > > void acpi_os_stall(u32 us) > > -- > > While I agree with using usleep_range() for sleeps up to 20 ms in > acpi_os_sleep(), I disagree with the patch as is. The measurement results do not support any form of acpi_os_sleep that invokes any form of msleep(). Honestly, I think because of when and how acpi_os_sleep is called, we should consider making it yet simpler: void acpi_os_sleep(u64 ms) { u64 us = ms * 1000; usleep_range(us, us); } thanks, -Len
On Wed, Aug 28, 2024 at 6:12 AM Len Brown <lenb@kernel.org> wrote: > > On Tue, Aug 27, 2024 at 7:29 AM Rafael J. Wysocki <rafael@kernel.org> wrote: > > > > First, let me add a few people who know more about timers than I do. > > > > On Tue, Aug 27, 2024 at 5:42 AM Len Brown <lenb@kernel.org> wrote: > > > > > > From: Len Brown <len.brown@intel.com> > > > > > > Optimize acpi_os_sleep(ms) using usleep_range(floor, ceiling). > > > The floor of the range is the exact requested ms, > > > with an additional 1ms of slack for sleeps above 20ms. > > > > > > This reduces the kernel resume time of the Dell 9300 > > > to 1,124 ms from 2,471 ms. > > > > > > The ACPI AML Sleep(ms) method calls acpi_os_sleep(ms), > > > which has invoked msleep(ms) since 2013. > > > > > > But msleep(ms) is based on jiffies, and the rounding-up > > > logic to convert to jiffies on a HZ=250 system causes > > > msleep(5) to bloat to a minimum of a 12ms delay. > > > msleep(5) typically takes over 15ms! > > > > > > As a result, AML delay loops with small Sleep() inside > > > magnify the entire loop. A particularly painful example > > > is ACPI support for powering-on ICL and TGL > > > thunderbolt/pcie_ports during system resume. > > > > > > Regarding jiffy-based msleep() being inexpensive > > > and hrtimer-based usleep_range() being expensive. > > > ACPI AML timer invocations are rare, and so it > > > is unlikely the hrtimer cost will be noticible, > > > or even measurable. At the same time, the msleep() > > > timer duration bloat is significant enough to > > > be noticed by end users. > > > > I'm not sure why you are refusing to follow the implementation of > > fsleep() and Documentation/timers/timers-howto.rst and still use > > msleep() for sleep durations longer than 20 ms. > > timers_howto.rst could use an update to reflect reality. Maybe it could. > It doesn't disclose how toxic msleep actually is for small values. > msleep(1) takes 11ms on a HZ=250 system. > > Linux/ACPI has to support any random AML Sleep(ms) call, > and sometimes we see timeout loops implemented around > an inner Sleep(1ms). If we use msleep those loops explode > by 11x and aggregate durations that are noticed by end users. So this is all about short sleeps. You don't need to argue with me about short sleeps, I'm convinced. Thus I'm going to skip all arguments regarding short sleeps going forward. > fsleep does three things -- and none of them are a good fit > for acpi_os_sleep: > > static inline void fsleep(unsigned long usecs) > { > if (usecs <= 10) > udelay(usecs); > else if (usecs <= 20000) > usleep_range(usecs, 2 * usecs); > else > msleep(DIV_ROUND_UP(usecs, 1000)); > } > > > udelay(usecs); > will never execute in the ACPI case, as the minimum delay is 1000 usec. > > > usleep_range(usecs, 2 * usecs); > > timers-howto.rst says this: > > "With the introduction of a range, the scheduler is > free to coalesce your wakeup with any other wakeup > that may have happened for other reasons, or at the > worst case, fire an interrupt for your upper bound." > > But the way usleep_range works is it sets the timer for the > upper bound, and opportunistically wakes/cancels if another timer > fires after the lower bound and before the upper bound. Yes, that's how it works. > It calls it a "worst case" that the timer fires at the upper bound. > > But when ACPI suspend/resume flows are running the only other > timer is the tick, and so the "worst case" happens effectively ALWAYS. Arguably though, ACPI suspend/resume flows are not the only case in which AML skeeps. > So when fsleep does a usleep_range(usecs, 2 * usecs), it is effectively > DOUBLING the duration of all timers 20 ms and smaller. > There may be scenarios where doing this makes sense, > but acpi_os_sleep() is not one of them. It would be good to actually try this. However, I'm not arguing that usleep_range(usecs, 2 * usecs) is suitable for ACPI sleeps. I'm saying that it is friendly with respect to the timekeeping subsystem to give it a nonzero range if it is affordable. And I do think that it is affordable in the ACPI sleep case. > > msleep(DIV_ROUND_UP(usecs, 1000)); > > msleep(50) takes 59.8ms -- a 20% penalty. I guess you mean on a HZ=250 system system and I guess this is a measured number. > We have loops with AML Sleep(50) in the middle, > and this code would bloat them by 20%, while > the user is waiting -- for no benefit. o If this happens during system suspend/resume and if this is the only thing that is waited for (ie. nothing runs in parallel with it). So your resume time is 20% longer than it can be in theory if every Sleep() causes an extra timer interrupt to be programmed. Is this extra overhead justified? > Again, there may be scenarios where doing this makes sense, > but acpi_os_sleep() is not one of them. I'm not convinced. > > Why should usleep_range() be used for 100 ms sleeps, for instance? > > This goes against the recommendation in the above document, so is > > there a particular reason? > > The document doesn't say *why* msleep is recommended. > One would assume that when there are many timers, msleep > is efficient because it consolidates them to jiffy boundaries, > and if they are long duration timers, perhaps the assumption is > that they don't care so much about the additional delays? > > Again, there are certainly scenarios where that makes sense, > but at the end of the day, msleep(100) takes 106 msec. And is this a big deal really? Is a user really going to notice the 6% difference and even if they do, would they care? For one, I wouldn't. Arguably, it all depends on the intention of the developer who uses the msleep(), os Sleep() in the AML case. Do they mean "exactly 100 ms" or do they mean "at least 100 ms"? If the latter is the case, the whole argument is moot if not misguided. As a side note, I generally think that if a developer puts Sleep(n) in their code they need to assume that the sleep will be slightly longer than n for reasons related to scheduling etc. so I don't believe that they ever mean "exactly n". However, they probably don't think that it may take 3 times longer, so it is better to make that not happen. > ACPI is not a heavy timer user, so the msleep efficiency justification > for making the user wait longer holds no weight. > > Now that we realize that end-users notice timer bloat in acp_os_sleep, > it is clear that msleep is simply a poor choice for acpi_os_sleep. For short sleeps, yes. For long sleeps, using an hrtimer generally ends up in programming an extra timer interrupt unless the range is really wide, but then msleep() gets better again. So is the extra timer interrupt really justified every time AML evaluates Sleep()? And again, system suspend/resume are not the only cases when that happens. Honestly, I don't think so and that's why I prefer msleep() to be used for longer sleeps even though it is "slacky". > > > Regarding usleep_range() timer coalescing. > > > It virtually never works during ACPI flows, which > > > commonly run when there are few coalescing > > > opportunities. As a result, the timers almost > > > always expire at the maximum end of their specified range. > > > > I don't think that's the main point of using a nonzero range in > > usleep_range(). AFAICS, this is about letting the timekeeping > > subsystem know how much you care about timer precision so it can > > arrange things to meet everyone's needs. > > The range in usleep_range is to enable timer coalescing, > which is to reduce the number of timers firing on the system. Exactly. Which reduces the number of times a timer interrupts needs to be programmed. Which generally reduces overhead. > If it has another purpose, neither the code nor the API documentation > are forthcoming. > > > > It was tempting to use usleep_range(us, us) > > > for all values of us. But 1 ms is added to the > > > range for timers over 20ms on the reasoning that > > > the AML Sleep interface has a granularity of 1ms, > > > most costly loops use duration under 20ms inside, > > > and singular long sleeps are unlitly to notice an > > > additiona 1ms, so why not allow some coalescing... > > > > So again, why not use msleep() for sleeps longer than 20 ms? > > per above. Too slow. per above, I'm not convinced. > > > Closes: https://bugzilla.kernel.org/show_bug.cgi?id=216263 > > > Signed-off-by: Len Brown <len.brown@intel.com> > > > Suggested-by: Arjan van de Ven <arjan@linux.intel.com> > > > Tested-by: Todd Brandt <todd.e.brandt@intel.com> > > > --- > > > drivers/acpi/osl.c | 8 +++++++- > > > 1 file changed, 7 insertions(+), 1 deletion(-) > > > > > > diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c > > > index 70af3fbbebe5..c4c76f86cd7a 100644 > > > --- a/drivers/acpi/osl.c > > > +++ b/drivers/acpi/osl.c > > > @@ -607,7 +607,13 @@ acpi_status acpi_os_remove_interrupt_handler(u32 gsi, acpi_osd_handler handler) > > > > > > void acpi_os_sleep(u64 ms) > > > { > > > - msleep(ms); > > > + u64 us = ms * 1000; > > > + > > > + if (us <= 20000) > > > + usleep_range(us, us); > > > + else > > > + usleep_range(us, us + 1000); > > > + > > > } > > > > > > void acpi_os_stall(u32 us) > > > -- > > > > While I agree with using usleep_range() for sleeps up to 20 ms in > > acpi_os_sleep(), I disagree with the patch as is. > > The measurement results do not support any form of acpi_os_sleep that > invokes any form of msleep(). That's a bold claim and it seems a bit unfounded. Have you tested anything other than system suspend/resume? Have you tested anything other than x86 laptops? > Honestly, I think because of when and how acpi_os_sleep is called, we should > consider making it yet simpler: > > void acpi_os_sleep(u64 ms) > { > u64 us = ms * 1000; > > usleep_range(us, us); > } Which I obviously disagree with. In the meantime I have checked how many users of usleep_range() use the same value as both arguments of it and I have found 2 out of over 4000. So now, is everyone clueless or is ACPI so special that it has to do something different than everyone else? And because I would actually like to see what the difference in terms of system suspend/resume time is between the $subject patch and something alternative that uses msleep() for longer sleeps and generally uses nonzero ranges with usleep_range(), attached is a patch to compare. Thanks!
On Wed, Aug 28, 2024 at 12:58 PM Rafael J. Wysocki <rafael@kernel.org> wrote: > > On Wed, Aug 28, 2024 at 6:12 AM Len Brown <lenb@kernel.org> wrote: > > > > On Tue, Aug 27, 2024 at 7:29 AM Rafael J. Wysocki <rafael@kernel.org> wrote: > > > > > > First, let me add a few people who know more about timers than I do. > > > > > > On Tue, Aug 27, 2024 at 5:42 AM Len Brown <lenb@kernel.org> wrote: > > > > > > > > From: Len Brown <len.brown@intel.com> > > > > > > > > Optimize acpi_os_sleep(ms) using usleep_range(floor, ceiling). > > > > The floor of the range is the exact requested ms, > > > > with an additional 1ms of slack for sleeps above 20ms. > > > > > > > > This reduces the kernel resume time of the Dell 9300 > > > > to 1,124 ms from 2,471 ms. > > > > > > > > The ACPI AML Sleep(ms) method calls acpi_os_sleep(ms), > > > > which has invoked msleep(ms) since 2013. > > > > > > > > But msleep(ms) is based on jiffies, and the rounding-up > > > > logic to convert to jiffies on a HZ=250 system causes > > > > msleep(5) to bloat to a minimum of a 12ms delay. > > > > msleep(5) typically takes over 15ms! > > > > > > > > As a result, AML delay loops with small Sleep() inside > > > > magnify the entire loop. A particularly painful example > > > > is ACPI support for powering-on ICL and TGL > > > > thunderbolt/pcie_ports during system resume. > > > > > > > > Regarding jiffy-based msleep() being inexpensive > > > > and hrtimer-based usleep_range() being expensive. > > > > ACPI AML timer invocations are rare, and so it > > > > is unlikely the hrtimer cost will be noticible, > > > > or even measurable. At the same time, the msleep() > > > > timer duration bloat is significant enough to > > > > be noticed by end users. > > > > > > I'm not sure why you are refusing to follow the implementation of > > > fsleep() and Documentation/timers/timers-howto.rst and still use > > > msleep() for sleep durations longer than 20 ms. > > > > timers_howto.rst could use an update to reflect reality. > > Maybe it could. > > > It doesn't disclose how toxic msleep actually is for small values. > > msleep(1) takes 11ms on a HZ=250 system. > > > > Linux/ACPI has to support any random AML Sleep(ms) call, > > and sometimes we see timeout loops implemented around > > an inner Sleep(1ms). If we use msleep those loops explode > > by 11x and aggregate durations that are noticed by end users. > > So this is all about short sleeps. > > You don't need to argue with me about short sleeps, I'm convinced. > > Thus I'm going to skip all arguments regarding short sleeps going forward. > > > fsleep does three things -- and none of them are a good fit > > for acpi_os_sleep: > > > > static inline void fsleep(unsigned long usecs) > > { > > if (usecs <= 10) > > udelay(usecs); > > else if (usecs <= 20000) > > usleep_range(usecs, 2 * usecs); > > else > > msleep(DIV_ROUND_UP(usecs, 1000)); > > } > > > > > udelay(usecs); > > will never execute in the ACPI case, as the minimum delay is 1000 usec. > > > > > usleep_range(usecs, 2 * usecs); > > > > timers-howto.rst says this: > > > > "With the introduction of a range, the scheduler is > > free to coalesce your wakeup with any other wakeup > > that may have happened for other reasons, or at the > > worst case, fire an interrupt for your upper bound." > > > > But the way usleep_range works is it sets the timer for the > > upper bound, and opportunistically wakes/cancels if another timer > > fires after the lower bound and before the upper bound. > > Yes, that's how it works. > > > It calls it a "worst case" that the timer fires at the upper bound. > > > > But when ACPI suspend/resume flows are running the only other > > timer is the tick, and so the "worst case" happens effectively ALWAYS. > > Arguably though, ACPI suspend/resume flows are not the only case in > which AML skeeps. > > > So when fsleep does a usleep_range(usecs, 2 * usecs), it is effectively > > DOUBLING the duration of all timers 20 ms and smaller. > > There may be scenarios where doing this makes sense, > > but acpi_os_sleep() is not one of them. > > It would be good to actually try this. > > However, I'm not arguing that usleep_range(usecs, 2 * usecs) is > suitable for ACPI sleeps. I'm saying that it is friendly with respect > to the timekeeping subsystem to give it a nonzero range if it is > affordable. > > And I do think that it is affordable in the ACPI sleep case. > > > > msleep(DIV_ROUND_UP(usecs, 1000)); > > > > msleep(50) takes 59.8ms -- a 20% penalty. > > I guess you mean on a HZ=250 system system and I guess this is a > measured number. > > > We have loops with AML Sleep(50) in the middle, > > and this code would bloat them by 20%, while > > the user is waiting -- for no benefit. o > > If this happens during system suspend/resume and if this is the only > thing that is waited for (ie. nothing runs in parallel with it). > > So your resume time is 20% longer than it can be in theory if every > Sleep() causes an extra timer interrupt to be programmed. Is this > extra overhead justified? > > > Again, there may be scenarios where doing this makes sense, > > but acpi_os_sleep() is not one of them. > > I'm not convinced. > > > > Why should usleep_range() be used for 100 ms sleeps, for instance? > > > This goes against the recommendation in the above document, so is > > > there a particular reason? > > > > The document doesn't say *why* msleep is recommended. > > One would assume that when there are many timers, msleep > > is efficient because it consolidates them to jiffy boundaries, > > and if they are long duration timers, perhaps the assumption is > > that they don't care so much about the additional delays? > > > > Again, there are certainly scenarios where that makes sense, > > but at the end of the day, msleep(100) takes 106 msec. > > And is this a big deal really? Is a user really going to notice the > 6% difference and even if they do, would they care? For one, I > wouldn't. > > Arguably, it all depends on the intention of the developer who uses > the msleep(), os Sleep() in the AML case. > > Do they mean "exactly 100 ms" or do they mean "at least 100 ms"? If > the latter is the case, the whole argument is moot if not misguided. > > As a side note, I generally think that if a developer puts Sleep(n) in > their code they need to assume that the sleep will be slightly longer > than n for reasons related to scheduling etc. so I don't believe that > they ever mean "exactly n". However, they probably don't think that > it may take 3 times longer, so it is better to make that not happen. > > > ACPI is not a heavy timer user, so the msleep efficiency justification > > for making the user wait longer holds no weight. > > > > Now that we realize that end-users notice timer bloat in acp_os_sleep, > > it is clear that msleep is simply a poor choice for acpi_os_sleep. > > For short sleeps, yes. > > For long sleeps, using an hrtimer generally ends up in programming an > extra timer interrupt unless the range is really wide, but then > msleep() gets better again. So is the extra timer interrupt really > justified every time AML evaluates Sleep()? And again, system > suspend/resume are not the only cases when that happens. > > Honestly, I don't think so and that's why I prefer msleep() to be used > for longer sleeps even though it is "slacky". That being said, I've read the thread mentioned in timers-howto.rst and it turns out to be interesting in the context of this patch. So Jon attempted to reimplement msleep() with the help of hrtimers and it basically worked except that it broke USB mouse hotplug for Andrew. IIUC, this issue was the reason why that patch was not applied. But this appears to mean that there are no fundamental objections to using hrtimers for long sleeps even though it is still recommended to use msleep() in that case. So it all boils down to the question: If a sleep is relatively long, say 100 ms or longer, why exactly is msleep(delay) more suitable for it than usleep_range(delay, delay + delta), where delta is of the order of 1 jiffy? I get that msleep() is based on tick and so it does not require an extra hrtimer (which leads to programming an additional timer interrupt in many cases), but is it worth the trouble of checking the delay and using msleep() instead of usleep_range() if the sleep is long enough? I'm not sure now TBH.
Hi, I try to give some input from the timer perspective and maybe it helps to clarify which should be the proper way to go for acpi_os_sleep(). It also identifies some problems with the current documentation and implementation of msleep/fsleep. Len Brown <lenb@kernel.org> writes: > Attempting to grab all the loose ends from this discussion > and put them into a list of 4 things: > > 1. Is it time to re-visit Jon's proposal to fix msleep, especially for > small sleep values? Lets have a deeper look to msleep() internals: msleep() uses timer list timers. Because of the design of the timer wheel (granularity of buckets increases with the levels) and because of the granularity of jiffies the sleep values will be longer as specified. Let's assume we are executing a msleep(1) on a HZ=250 system: First msecs are mapped on jiffies, so this results in a 4ms timeout value, as there is nothing shorter than 1 jiffie. Then the jiffie value is handed over to timer code and msleep() adds another jiffie to the timeout. The timeout is then 2 jiffies or 8ms. With this timeout a timer list timer is queued. To make sure that timers will not fire early or race with a concurrent incrementation of jiffie, timers are queued always into the next bucket. As the timer will end up in the first level of the timer wheel the granularity of the buckets is 1 jiffies. This means that the timeout would be 3 jiffies in worst case. The additional jiffie in msleep() is the historical prevention that the sleep time is at least the specified time. This is handled by timer wheel core code itself, so this extra jiffie could be removed. I will provide a patch for it. But even with this change, the worst case sleep length will be 8ms instead of 1ms. For comparison, see here a table with the values for all the steps explained above taking some different msleep values. I already dropped the addition of the extra jiffie. The granularity of the timer wheel levels can be found at the first long comment in kernel/time/timer.c. This is still a HZ=250 system: msleep() | msecs_to_jiffies() | worst case delay after | timer wheel | | enqueue (next bucket) | level -------------------------------------------------------------------------- 1 ms | 1 jiffie | 2 jiffies, 8 ms | 0 256 ms | 64 jiffies | 72 jiffies, 288 ms | 1 257 ms | 65 jiffies | 72 jiffies, 288 ms | 1 2048 ms | 513 jiffies | 576 jiffies, 2304 ms | 2 2300 ms | 575 jiffies | 576 jiffies, 2304 ms | 2 4096 ms | 1024 jiffies | 1088 jiffies, 4352 ms | 2 The same values on a HZ=1000 system: msleep() | msecs_to_jiffies() | worst case delay after | timer wheel | | enqueue (next bucket) | level -------------------------------------------------------------------------- 1 ms | 1 jiffie | 2 jiffies, 2 ms | 0 256 ms | 256 jiffies | 264 jiffies, 264 ms | 1 257 ms | 257 jiffies | 264 jiffies, 264 ms | 1 2048 ms | 2048 jiffies | 2112 jiffies, 2112 ms | 2 2300 ms | 2300 jiffies | 2304 jiffies, 2304 ms | 2 4096 ms | 4096 jiffies | 4608 jiffies, 4608 ms | 3 The documentation states, that msleep is not valid for short sleep values. But as demonstrated with the two tables, this is not enirely true. So the descision whether msleep() is suitable for the usecase is not as simple as documented. It depends on the HZ configuration in combination with the timeout value, and on the request how precise the timeout has to be. And another important point is: hrtimers are more expensive then timer list timers... The documentation was originally written back in 2010 where the non cascading timer wheel wasn't in place yet. So is has to be updated. > 2. We agree that usleep_range() is appropriate for short acpi_os_sleep() > due to ASL loops with small Sleep() values. > But if we do something different for large and small values, > where is the line between small and large? As pointed out above - this depends on HZ and what the requirements of the callsite are. > fsleep anointed 20ms, but didn't document why. > (and it made both short sleeps *and* long sleeps too slow to be useful > here, IMO) fsleep() just implements what the outdated documentation states. And it adds a magic max value for usleep_range(). It seems to me, that fsleep() somehow accidentially found the way into the kernel in 2020. As it is now there this needs to be fixed and should take the above things into account in some generic way. > 3. Is usleep_range(min, max) with min= max bad? > If it is good, why is virtually nobody else using min=max? It's not bad to use it. It depends on your use case. If you really need the precise sleep length, then it should be valid to use min = max. I hope the timer information will help to find the proper solution for acpi_os_sleep(). Thanks, Anna-Maria
Hi, On Wed, Aug 28, 2024 at 9:44 PM Anna-Maria Behnsen <anna-maria@linutronix.de> wrote: > > Hi, > > I try to give some input from the timer perspective and maybe it helps > to clarify which should be the proper way to go for acpi_os_sleep(). It > also identifies some problems with the current documentation and > implementation of msleep/fsleep. > > Len Brown <lenb@kernel.org> writes: > > > Attempting to grab all the loose ends from this discussion > > and put them into a list of 4 things: > > > > 1. Is it time to re-visit Jon's proposal to fix msleep, especially for > > small sleep values? > > Lets have a deeper look to msleep() internals: msleep() uses timer list > timers. Because of the design of the timer wheel (granularity of buckets > increases with the levels) and because of the granularity of jiffies the > sleep values will be longer as specified. Let's assume we are executing > a msleep(1) on a HZ=250 system: > > First msecs are mapped on jiffies, so this results in a 4ms timeout > value, as there is nothing shorter than 1 jiffie. Then the jiffie value > is handed over to timer code and msleep() adds another jiffie to the > timeout. The timeout is then 2 jiffies or 8ms. With this timeout a timer > list timer is queued. To make sure that timers will not fire early or > race with a concurrent incrementation of jiffie, timers are queued > always into the next bucket. As the timer will end up in the first level > of the timer wheel the granularity of the buckets is 1 jiffies. This > means that the timeout would be 3 jiffies in worst case. > > The additional jiffie in msleep() is the historical prevention that the > sleep time is at least the specified time. This is handled by timer > wheel core code itself, so this extra jiffie could be removed. I will > provide a patch for it. > > But even with this change, the worst case sleep length will be 8ms > instead of 1ms. > > For comparison, see here a table with the values for all the steps > explained above taking some different msleep values. I already dropped > the addition of the extra jiffie. The granularity of the timer wheel > levels can be found at the first long comment in kernel/time/timer.c. > > This is still a HZ=250 system: > > msleep() | msecs_to_jiffies() | worst case delay after | timer wheel > | | enqueue (next bucket) | level > -------------------------------------------------------------------------- > 1 ms | 1 jiffie | 2 jiffies, 8 ms | 0 > 256 ms | 64 jiffies | 72 jiffies, 288 ms | 1 > 257 ms | 65 jiffies | 72 jiffies, 288 ms | 1 > 2048 ms | 513 jiffies | 576 jiffies, 2304 ms | 2 > 2300 ms | 575 jiffies | 576 jiffies, 2304 ms | 2 > 4096 ms | 1024 jiffies | 1088 jiffies, 4352 ms | 2 > > The same values on a HZ=1000 system: > > msleep() | msecs_to_jiffies() | worst case delay after | timer wheel > | | enqueue (next bucket) | level > -------------------------------------------------------------------------- > 1 ms | 1 jiffie | 2 jiffies, 2 ms | 0 > 256 ms | 256 jiffies | 264 jiffies, 264 ms | 1 > 257 ms | 257 jiffies | 264 jiffies, 264 ms | 1 > 2048 ms | 2048 jiffies | 2112 jiffies, 2112 ms | 2 > 2300 ms | 2300 jiffies | 2304 jiffies, 2304 ms | 2 > 4096 ms | 4096 jiffies | 4608 jiffies, 4608 ms | 3 > > > The documentation states, that msleep is not valid for short sleep > values. But as demonstrated with the two tables, this is not enirely > true. So the descision whether msleep() is suitable for the usecase is > not as simple as documented. It depends on the HZ configuration in > combination with the timeout value, and on the request how precise the > timeout has to be. And another important point is: hrtimers are more > expensive then timer list timers... > > The documentation was originally written back in 2010 where the non > cascading timer wheel wasn't in place yet. So is has to be updated. > > > 2. We agree that usleep_range() is appropriate for short acpi_os_sleep() > > due to ASL loops with small Sleep() values. > > But if we do something different for large and small values, > > where is the line between small and large? > > As pointed out above - this depends on HZ and what the requirements of > the callsite are. > > > fsleep anointed 20ms, but didn't document why. > > (and it made both short sleeps *and* long sleeps too slow to be useful > > here, IMO) > > fsleep() just implements what the outdated documentation states. And it > adds a magic max value for usleep_range(). It seems to me, that fsleep() > somehow accidentially found the way into the kernel in 2020. As it is > now there this needs to be fixed and should take the above things into > account in some generic way. > > > 3. Is usleep_range(min, max) with min= max bad? > > If it is good, why is virtually nobody else using min=max? > > It's not bad to use it. It depends on your use case. If you really need > the precise sleep length, then it should be valid to use min = max. > > I hope the timer information will help to find the proper solution for > acpi_os_sleep(). Yes, it is useful, thank you! The main difficulty is that acpi_os_sleep() really works on behalf of AML code (it is called when the Sleep() operator is evaluated in AML) and it is hard to say what behavior is intended there. We know that the msleep() precision is not sufficient at least in some cases (for example, a high-count loop in AML that sleeps for 5 ms in every iteration), but we don't really know what precision is needed. IMV it generally is reasonable to assume that firmware developers would not expect the exact sleep time (and the ACPI specification is not very precise regarding this either), but they wouldn't also expect sleep times much longer (in relative terms) than requested. So roughly speaking they probably assume something between t and t + t/4, where t is the requested sleep time in milliseconds, regardless of the HZ value. Also, you said above that hrtimers were more expensive than the timer wheel timers, which we all agree with, but it is not clear to me what exactly the difference is. For example, if there is a loop of 1000 iterations in AML that each sleep for 5 ms, how much more overhead from using hrtimers would be there relative to using timer-wheel timers? Generally speaking, Len's idea of using usleep_range() for all sleep lengths in acpi_os_sleep() is compelling because it leads to simple code, but it is not clear to me how much more it would cost relative to msleep() (or what issues it may provoke for that matter) and what delta value to use in there. One idea is to derive the delta from the sleep length (say, take 1/4 of it like I did above), but the counter-argument is that this would cause the loops in AML in question to take measurably more time and there may be no real reason for it. Thanks!
Anna-Maria Behnsen <anna-maria@linutronix.de> writes: [...] > Lets have a deeper look to msleep() internals: msleep() uses timer list > timers. Because of the design of the timer wheel (granularity of buckets > increases with the levels) and because of the granularity of jiffies the > sleep values will be longer as specified. Let's assume we are executing > a msleep(1) on a HZ=250 system: > > First msecs are mapped on jiffies, so this results in a 4ms timeout > value, as there is nothing shorter than 1 jiffie. Then the jiffie value > is handed over to timer code and msleep() adds another jiffie to the > timeout. The timeout is then 2 jiffies or 8ms. With this timeout a timer > list timer is queued. To make sure that timers will not fire early or > race with a concurrent incrementation of jiffie, timers are queued > always into the next bucket. As the timer will end up in the first level > of the timer wheel the granularity of the buckets is 1 jiffies. This > means that the timeout would be 3 jiffies in worst case. > > The additional jiffie in msleep() is the historical prevention that the > sleep time is at least the specified time. This is handled by timer > wheel core code itself, so this extra jiffie could be removed. I will > provide a patch for it. I missed to use the whole cc list above when sending the patch: https://lore.kernel.org/r/20240829074133.4547-1-anna-maria@linutronix.de/ Thanks, Anna-Maria
Hi, "Rafael J. Wysocki" <rafael@kernel.org> writes: [...] > The main difficulty is that acpi_os_sleep() really works on behalf of > AML code (it is called when the Sleep() operator is evaluated in AML) > and it is hard to say what behavior is intended there. > > We know that the msleep() precision is not sufficient at least in some > cases (for example, a high-count loop in AML that sleeps for 5 ms in > every iteration), but we don't really know what precision is needed. > > IMV it generally is reasonable to assume that firmware developers > would not expect the exact sleep time (and the ACPI specification is > not very precise regarding this either), but they wouldn't also expect > sleep times much longer (in relative terms) than requested. So > roughly speaking they probably assume something between t and t + t/4, > where t is the requested sleep time in milliseconds, regardless of the > HZ value. > > Also, you said above that hrtimers were more expensive than the timer > wheel timers, which we all agree with, but it is not clear to me what > exactly the difference is. For example, if there is a loop of 1000 > iterations in AML that each sleep for 5 ms, how much more overhead > from using hrtimers would be there relative to using timer-wheel > timers? The general differences are: - hrtimers are using a rbtree instead of hasing, which is more expensive - in case the new hrtimer is the new first hrtimer, hardware has to be programmed, which is expensive too - the timer wheel uses batching. hrtimers not. They might get an own interrupt for every hrtimer when they do not have the exact same expiry time. - because of the above it is a lot more expensive to use hrtimers when hrtimers are cancelled and requeued with a new expiry time (like e.g. network timeouts do it) It's hard to measure, because it depends... I briefly talked to Thomas and his opinion is, that there is an overhead but it should be in a range of ok. > Generally speaking, Len's idea of using usleep_range() for all sleep > lengths in acpi_os_sleep() is compelling because it leads to simple > code, but it is not clear to me how much more it would cost relative > to msleep() (or what issues it may provoke for that matter) and what > delta value to use in there. One idea is to derive the delta from the > sleep length (say, take 1/4 of it like I did above), but the > counter-argument is that this would cause the loops in AML in question > to take measurably more time and there may be no real reason for it. I created a patch for fsleep() - only complie tested - which should make sure that the slack is maximum 25%. What do you think about it? Might it be helpful? ---8<--- --- a/include/linux/delay.h +++ b/include/linux/delay.h @@ -78,15 +78,36 @@ static inline void ssleep(unsigned int s msleep(seconds * 1000); } -/* see Documentation/timers/timers-howto.rst for the thresholds */ +/** + * fsleep - flexible sleep which autoselects the best mechanism + * @usecs: requested sleep duration in microseconds + * + * flseep() selects the best mechanism that will provide maximum 25% slack + * to the requested sleep duration. Therefore it uses: + * + * - udelay() loop for sleep durations <= 10 microseconds to avoid hrtimer + * overhead for really short sleep durations. + * + * - usleep_range() for sleep durations which would lead with the usage of + * msleep() to a slack larger than 25%. This depends on the granularity of + * jiffies. + * + * - msleep() for all other sleep durations. + * + * Note: When CONFIG_HIGH_RES_TIMERS is not set, all sleeps are processed with + * the granularity of jiffies and the slack might exceed 25% especially for + * short sleep durations. + */ static inline void fsleep(unsigned long usecs) { + const unsigned int max_slack_shift = 2; + if (usecs <= 10) udelay(usecs); - else if (usecs <= 20000) - usleep_range(usecs, 2 * usecs); + else if (usecs < ((TICK_NSEC << max_slack_shift) / NSEC_PER_USEC)) + usleep_range(usecs, usecs + (usecs >> max_slack_shift)); else - msleep(DIV_ROUND_UP(usecs, 1000)); + msleep(DIV_ROUND_UP(usecs, USEC_PER_MSEC)); } #endif /* defined(_LINUX_DELAY_H) */ ---8<---- Thanks, Anna-Maria
On Fri, Aug 30, 2024 at 7:55 AM Len Brown <lenb@kernel.org> wrote: > > On Thu, Aug 29, 2024 at 11:37 AM Anna-Maria Behnsen > <anna-maria@linutronix.de> wrote: > > > I created a patch for fsleep() - only complie tested - which should make > > sure that the slack is maximum 25%. What do you think about it? Might it > > be helpful? > > If the purpose of using msleep instead of usleep_range is to lower > the cost of the timer sub-system.... > > then I'm not sure that choosing to do an msleep instead of a usleep_range > based on the timer duration makes any sense. > > The lighter overhead of the msleep is something that is more important > when there are more timers. More timers is not the same as longer timers. My understanding is that when a new timer is added, it is not actually known how many timers there are overall in use in the system, so it is generally better to use a more lightweight variant in case there are many of them. However, the more lightweight variant is not suitable for short sleep durations because it adds bloat of 2 jiffies (after the recent change from Anna-Maria), so the idea is to start using msleep() at the point when the bloat added by it starts to be comparable to the delta between the usleep_range() arguments (used for short sleeps). Now, because it is not known exactly what timer precision is required, some assumptions need to be made and IMV it is reasonable to assume that the actual sleep duration is expected to be somewhat greater than requested, but it is also not expected to be much greater than requested, so using a fraction of the requested sleep time as the usleep_range() makes sense to me. Of course, you can argue that in the ACPI case there are those high-count loops and so more exact sleep durations are better, but realistically this is mostly about system suspend/resume times and the difference is not something that cannot be tolerated in that case IMV. Also it would be better to avoid running those high-count loops in the first place. Overall, if my understanding is correct, I generally agree with Anna-Maria's approach.
diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c index 70af3fbbebe5..c4c76f86cd7a 100644 --- a/drivers/acpi/osl.c +++ b/drivers/acpi/osl.c @@ -607,7 +607,13 @@ acpi_status acpi_os_remove_interrupt_handler(u32 gsi, acpi_osd_handler handler) void acpi_os_sleep(u64 ms) { - msleep(ms); + u64 us = ms * 1000; + + if (us <= 20000) + usleep_range(us, us); + else + usleep_range(us, us + 1000); + } void acpi_os_stall(u32 us)