diff mbox series

[v3,2/7] iopoll: Do not use timekeeping in read_poll_timeout_atomic()

Message ID 3d2a2f4e553489392d871108797c3be08f88300b.1685692810.git.geert+renesas@glider.be (mailing list archive)
State New, archived
Headers show
Series iopoll: Busy loop and timeout improvements + conversions | expand

Commit Message

Geert Uytterhoeven June 2, 2023, 8:50 a.m. UTC
read_poll_timeout_atomic() uses ktime_get() to implement the timeout
feature, just like its non-atomic counterpart.  However, there are
several issues with this, due to its use in atomic contexts:

  1. When called in the s2ram path (as typically done by clock or PM
     domain drivers), timekeeping may be suspended, triggering the
     WARN_ON(timekeeping_suspended) in ktime_get():

	WARNING: CPU: 0 PID: 654 at kernel/time/timekeeping.c:843 ktime_get+0x28/0x78

     Calling ktime_get_mono_fast_ns() instead of ktime_get() would get
     rid of that warning.  However, that would break timeout handling,
     as (at least on systems with an ARM architectured timer), the time
     returned by ktime_get_mono_fast_ns() does not advance while
     timekeeping is suspended.
     Interestingly, (on the same ARM systems) the time returned by
     ktime_get() does advance while timekeeping is suspended, despite
     the warning.

  2. Depending on the actual clock source, and especially before a
     high-resolution clocksource (e.g. the ARM architectured timer)
     becomes available, time may not advance in atomic contexts, thus
     breaking timeout handling.

Fix this by abandoning the idea that one can rely on timekeeping to
implement timeout handling in all atomic contexts, and switch from a
global time-based to a locally-estimated timeout handling.  In most
(all?) cases the timeout condition is exceptional and an error
condition, hence any additional delays due to underestimating wall clock
time are irrelevant.

Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Tony Lindgren <tony@atomide.com>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
---
The first issue was seen with the rcar-sysc driver in the BSP, as the
BSP contains modifications to the resume sequence of various PM Domains.

v3:
  - Add Acked-by, Reviewed-by,
  - Add comment about not using timekeeping, and its impact,

v2:
  - New.
---
 include/linux/iopoll.h | 22 +++++++++++++++++-----
 1 file changed, 17 insertions(+), 5 deletions(-)

Comments

Zong Li March 26, 2024, 1:31 a.m. UTC | #1
On Fri, Jun 02, 2023 at 10:50:37AM +0200, Geert Uytterhoeven wrote:
> read_poll_timeout_atomic() uses ktime_get() to implement the timeout
> feature, just like its non-atomic counterpart.  However, there are
> several issues with this, due to its use in atomic contexts:
> 
>   1. When called in the s2ram path (as typically done by clock or PM
>      domain drivers), timekeeping may be suspended, triggering the
>      WARN_ON(timekeeping_suspended) in ktime_get():
> 
> 	WARNING: CPU: 0 PID: 654 at kernel/time/timekeeping.c:843 ktime_get+0x28/0x78
> 
>      Calling ktime_get_mono_fast_ns() instead of ktime_get() would get
>      rid of that warning.  However, that would break timeout handling,
>      as (at least on systems with an ARM architectured timer), the time
>      returned by ktime_get_mono_fast_ns() does not advance while
>      timekeeping is suspended.
>      Interestingly, (on the same ARM systems) the time returned by
>      ktime_get() does advance while timekeeping is suspended, despite
>      the warning.
> 
>   2. Depending on the actual clock source, and especially before a
>      high-resolution clocksource (e.g. the ARM architectured timer)
>      becomes available, time may not advance in atomic contexts, thus
>      breaking timeout handling.
> 
> Fix this by abandoning the idea that one can rely on timekeeping to
> implement timeout handling in all atomic contexts, and switch from a
> global time-based to a locally-estimated timeout handling.  In most
> (all?) cases the timeout condition is exceptional and an error
> condition, hence any additional delays due to underestimating wall clock
> time are irrelevant.
>

Hi Geert,
I tested this patch on the FPGA, and I noticed the timeout duration
was much longer than expected. I tested it by removing the op operation
and break condition for avoiding the influence of other factors.
The code would look like as follows:

for (;;) {
        if (__timeout_us && __left_ns < 0)
                break;
        if (__delay_us) {
                udelay(__delay_us);
                if (__timeout_us)
                        __left_ns -= __delay_ns;;
	cpu_relex();
        if (__timeout_us)
                __left_ns--;
        }
}

Despite setting the timeout to 1 second, it actually takes 25 seconds
to reach the specified timeout value. I displayed the value of
__left_ns when a timeout occurred. As follows: __delay_us is 1, when
__left_ns counts down to -1, the system has run for 25 seconds.

[   26.016213] __timeout_us: 1000000 __left_ns: -1
[   50.818585] __timeout_us: 1000000  __left_ns: -1
[   75.620467] __timeout_us: 1000000  __left_ns: -1
[  100.422664] __timeout_us: 1000000  __left_ns: -1
[  125.224775] __timeout_us: 1000000  __left_ns: -1
...

I attempted to blend the two versions (e.g., ktime version and the
current version) for discarding the value of __left_ns. The resulting
output is as follows: __delay_us is 1, when it exceeds 1 second
according to ktime, __left_ns only counts around 40 ms.

[    6.734482] __timeout_us: 1000000  __left_ns: 961699000
[    7.738485] __timeout_us: 1000000  __left_ns: 961228000
[    8.812797] __timeout_us: 1000000  __left_ns: 961755000
[    9.814021] __timeout_us: 1000000  __left_ns: 961542000
[   10.815373] __timeout_us: 1000000 __left_ns: 962464000
[   11.816184] __timeout_us: 1000000 __left_ns: 961536000
[   12.817137] __timeout_us: 1000000 __left_ns: 961121000
...

Per your suggestion, I attempted to increase delay_us to 10 us,
it really helps to eliminate the underestimation. The actual
timeout became 3 secs on the FPGA.

I moved on my host x86 machine, the timeout has been reduced to
2 seconds even if the delay_us is 1. And the timeout can be
precise 1 seconds when delay_us is 10. I'm not sure if the clock
frequency or RTC frequency might also determine the underestimation
of wall clock time? Is there a suggested value of delay_us for a
driver that runs on various platforms?
What is your perspective for those situation?

Thanks.

> Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
> Acked-by: Arnd Bergmann <arnd@arndb.de>
> Reviewed-by: Tony Lindgren <tony@atomide.com>
> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
> ---
> The first issue was seen with the rcar-sysc driver in the BSP, as the
> BSP contains modifications to the resume sequence of various PM Domains.
> 
> v3:
>   - Add Acked-by, Reviewed-by,
>   - Add comment about not using timekeeping, and its impact,
> 
> v2:
>   - New.
> ---
>  include/linux/iopoll.h | 22 +++++++++++++++++-----
>  1 file changed, 17 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/iopoll.h b/include/linux/iopoll.h
> index 0417360a6db9b0d6..19a7b00baff43595 100644
> --- a/include/linux/iopoll.h
> +++ b/include/linux/iopoll.h
> @@ -74,6 +74,10 @@
>   * Returns 0 on success and -ETIMEDOUT upon a timeout. In either
>   * case, the last read value at @args is stored in @val.
>   *
> + * This macro does not rely on timekeeping.  Hence it is safe to call even when
> + * timekeeping is suspended, at the expense of an underestimation of wall clock
> + * time, which is rather minimal with a non-zero delay_us.
> + *
>   * When available, you'll probably want to use one of the specialized
>   * macros defined below rather than this macro directly.
>   */
> @@ -81,22 +85,30 @@
>  					delay_before_read, args...) \
>  ({ \
>  	u64 __timeout_us = (timeout_us); \
> +	s64 __left_ns = __timeout_us * NSEC_PER_USEC; \
>  	unsigned long __delay_us = (delay_us); \
> -	ktime_t __timeout = ktime_add_us(ktime_get(), __timeout_us); \
> -	if (delay_before_read && __delay_us) \
> +	u64 __delay_ns = __delay_us * NSEC_PER_USEC; \
> +	if (delay_before_read && __delay_us) { \
>  		udelay(__delay_us); \
> +		if (__timeout_us) \
> +			__left_ns -= __delay_ns; \
> +	} \
>  	for (;;) { \
>  		(val) = op(args); \
>  		if (cond) \
>  			break; \
> -		if (__timeout_us && \
> -		    ktime_compare(ktime_get(), __timeout) > 0) { \
> +		if (__timeout_us && __left_ns < 0) { \
>  			(val) = op(args); \
>  			break; \
>  		} \
> -		if (__delay_us) \
> +		if (__delay_us) { \
>  			udelay(__delay_us); \
> +			if (__timeout_us) \
> +				__left_ns -= __delay_ns; \
> +		} \
>  		cpu_relax(); \
> +		if (__timeout_us) \
> +			__left_ns--; \
>  	} \
>  	(cond) ? 0 : -ETIMEDOUT; \
>  })
> -- 
> 2.34.1
>
Geert Uytterhoeven April 18, 2024, 9:20 a.m. UTC | #2
Hi Zong,

On Tue, Mar 26, 2024 at 2:31 AM Zong Li <zong.li@sifive.com> wrote:
> On Fri, Jun 02, 2023 at 10:50:37AM +0200, Geert Uytterhoeven wrote:
> > read_poll_timeout_atomic() uses ktime_get() to implement the timeout
> > feature, just like its non-atomic counterpart.  However, there are
> > several issues with this, due to its use in atomic contexts:
> >
> >   1. When called in the s2ram path (as typically done by clock or PM
> >      domain drivers), timekeeping may be suspended, triggering the
> >      WARN_ON(timekeeping_suspended) in ktime_get():
> >
> >       WARNING: CPU: 0 PID: 654 at kernel/time/timekeeping.c:843 ktime_get+0x28/0x78
> >
> >      Calling ktime_get_mono_fast_ns() instead of ktime_get() would get
> >      rid of that warning.  However, that would break timeout handling,
> >      as (at least on systems with an ARM architectured timer), the time
> >      returned by ktime_get_mono_fast_ns() does not advance while
> >      timekeeping is suspended.
> >      Interestingly, (on the same ARM systems) the time returned by
> >      ktime_get() does advance while timekeeping is suspended, despite
> >      the warning.
> >
> >   2. Depending on the actual clock source, and especially before a
> >      high-resolution clocksource (e.g. the ARM architectured timer)
> >      becomes available, time may not advance in atomic contexts, thus
> >      breaking timeout handling.
> >
> > Fix this by abandoning the idea that one can rely on timekeeping to
> > implement timeout handling in all atomic contexts, and switch from a
> > global time-based to a locally-estimated timeout handling.  In most
> > (all?) cases the timeout condition is exceptional and an error
> > condition, hence any additional delays due to underestimating wall clock
> > time are irrelevant.
> >
>
> Hi Geert,
> I tested this patch on the FPGA, and I noticed the timeout duration
> was much longer than expected. I tested it by removing the op operation
> and break condition for avoiding the influence of other factors.
> The code would look like as follows:
>
> for (;;) {
>         if (__timeout_us && __left_ns < 0)
>                 break;
>         if (__delay_us) {
>                 udelay(__delay_us);
>                 if (__timeout_us)
>                         __left_ns -= __delay_ns;;
>         cpu_relex();
>         if (__timeout_us)
>                 __left_ns--;
>         }
> }
>
> Despite setting the timeout to 1 second, it actually takes 25 seconds
> to reach the specified timeout value. I displayed the value of
> __left_ns when a timeout occurred. As follows: __delay_us is 1, when
> __left_ns counts down to -1, the system has run for 25 seconds.
>
> [   26.016213] __timeout_us: 1000000 __left_ns: -1
> [   50.818585] __timeout_us: 1000000  __left_ns: -1
> [   75.620467] __timeout_us: 1000000  __left_ns: -1
> [  100.422664] __timeout_us: 1000000  __left_ns: -1
> [  125.224775] __timeout_us: 1000000  __left_ns: -1
> ...
>
> I attempted to blend the two versions (e.g., ktime version and the
> current version) for discarding the value of __left_ns. The resulting
> output is as follows: __delay_us is 1, when it exceeds 1 second
> according to ktime, __left_ns only counts around 40 ms.
>
> [    6.734482] __timeout_us: 1000000  __left_ns: 961699000
> [    7.738485] __timeout_us: 1000000  __left_ns: 961228000
> [    8.812797] __timeout_us: 1000000  __left_ns: 961755000
> [    9.814021] __timeout_us: 1000000  __left_ns: 961542000
> [   10.815373] __timeout_us: 1000000 __left_ns: 962464000
> [   11.816184] __timeout_us: 1000000 __left_ns: 961536000
> [   12.817137] __timeout_us: 1000000 __left_ns: 961121000
> ...
>
> Per your suggestion, I attempted to increase delay_us to 10 us,
> it really helps to eliminate the underestimation. The actual
> timeout became 3 secs on the FPGA.
>
> I moved on my host x86 machine, the timeout has been reduced to
> 2 seconds even if the delay_us is 1. And the timeout can be
> precise 1 seconds when delay_us is 10. I'm not sure if the clock
> frequency or RTC frequency might also determine the underestimation
> of wall clock time? Is there a suggested value of delay_us for a
> driver that runs on various platforms?
> What is your perspective for those situation?

RTC frequency does not impact the timeout, as the macro no longer
relies on timekeeping.
CPU clock frequency does impact the timeout, especially when op()
executes lots of instructions.  The code assumes op() takes 1 ns,
which is a conservative value to prevent overestimation of wall clock
time.  This assumes that such an overestimation (i.e. timing out too
early) is much worse than an underestimation (i.e. timing out too late).
If op() takes much more time than 1 ns, or even more time than delay_us,
the effective timeout will be much larger than expected.

I can only suggest to use a reasonable value for delay_us, i.e. a value
that is (sufficiently) larger than the expected execution time of op().

I hope this helps.

> > --- a/include/linux/iopoll.h
> > +++ b/include/linux/iopoll.h
> > @@ -74,6 +74,10 @@
> >   * Returns 0 on success and -ETIMEDOUT upon a timeout. In either
> >   * case, the last read value at @args is stored in @val.
> >   *
> > + * This macro does not rely on timekeeping.  Hence it is safe to call even when
> > + * timekeeping is suspended, at the expense of an underestimation of wall clock
> > + * time, which is rather minimal with a non-zero delay_us.
> > + *
> >   * When available, you'll probably want to use one of the specialized
> >   * macros defined below rather than this macro directly.
> >   */
> > @@ -81,22 +85,30 @@
> >                                       delay_before_read, args...) \
> >  ({ \
> >       u64 __timeout_us = (timeout_us); \
> > +     s64 __left_ns = __timeout_us * NSEC_PER_USEC; \
> >       unsigned long __delay_us = (delay_us); \
> > -     ktime_t __timeout = ktime_add_us(ktime_get(), __timeout_us); \
> > -     if (delay_before_read && __delay_us) \
> > +     u64 __delay_ns = __delay_us * NSEC_PER_USEC; \
> > +     if (delay_before_read && __delay_us) { \
> >               udelay(__delay_us); \
> > +             if (__timeout_us) \
> > +                     __left_ns -= __delay_ns; \
> > +     } \
> >       for (;;) { \
> >               (val) = op(args); \
> >               if (cond) \
> >                       break; \
> > -             if (__timeout_us && \
> > -                 ktime_compare(ktime_get(), __timeout) > 0) { \
> > +             if (__timeout_us && __left_ns < 0) { \
> >                       (val) = op(args); \
> >                       break; \
> >               } \
> > -             if (__delay_us) \
> > +             if (__delay_us) { \
> >                       udelay(__delay_us); \
> > +                     if (__timeout_us) \
> > +                             __left_ns -= __delay_ns; \
> > +             } \
> >               cpu_relax(); \
> > +             if (__timeout_us) \
> > +                     __left_ns--; \
> >       } \
> >       (cond) ? 0 : -ETIMEDOUT; \
> >  })

Gr{oetje,eeting}s,

                        Geert
diff mbox series

Patch

diff --git a/include/linux/iopoll.h b/include/linux/iopoll.h
index 0417360a6db9b0d6..19a7b00baff43595 100644
--- a/include/linux/iopoll.h
+++ b/include/linux/iopoll.h
@@ -74,6 +74,10 @@ 
  * Returns 0 on success and -ETIMEDOUT upon a timeout. In either
  * case, the last read value at @args is stored in @val.
  *
+ * This macro does not rely on timekeeping.  Hence it is safe to call even when
+ * timekeeping is suspended, at the expense of an underestimation of wall clock
+ * time, which is rather minimal with a non-zero delay_us.
+ *
  * When available, you'll probably want to use one of the specialized
  * macros defined below rather than this macro directly.
  */
@@ -81,22 +85,30 @@ 
 					delay_before_read, args...) \
 ({ \
 	u64 __timeout_us = (timeout_us); \
+	s64 __left_ns = __timeout_us * NSEC_PER_USEC; \
 	unsigned long __delay_us = (delay_us); \
-	ktime_t __timeout = ktime_add_us(ktime_get(), __timeout_us); \
-	if (delay_before_read && __delay_us) \
+	u64 __delay_ns = __delay_us * NSEC_PER_USEC; \
+	if (delay_before_read && __delay_us) { \
 		udelay(__delay_us); \
+		if (__timeout_us) \
+			__left_ns -= __delay_ns; \
+	} \
 	for (;;) { \
 		(val) = op(args); \
 		if (cond) \
 			break; \
-		if (__timeout_us && \
-		    ktime_compare(ktime_get(), __timeout) > 0) { \
+		if (__timeout_us && __left_ns < 0) { \
 			(val) = op(args); \
 			break; \
 		} \
-		if (__delay_us) \
+		if (__delay_us) { \
 			udelay(__delay_us); \
+			if (__timeout_us) \
+				__left_ns -= __delay_ns; \
+		} \
 		cpu_relax(); \
+		if (__timeout_us) \
+			__left_ns--; \
 	} \
 	(cond) ? 0 : -ETIMEDOUT; \
 })