Message ID | 20241107190818.522639-2-ankur.a.arora@oracle.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | arm64: support poll_idle() | expand |
On Thu, 7 Nov 2024, Ankur Arora wrote: > +#ifndef smp_cond_time_check_count > +/* > + * Limit how often smp_cond_load_relaxed_timeout() evaluates time_expr_ns. > + * This helps reduce the number of instructions executed while spin-waiting. > + */ > +#define smp_cond_time_check_count 200 > +#endif I dont like these loops that execute differently depending on the hardware. Can we use cycles and ns instead to have defined periods of time? Later patches establish the infrastructure to convert cycles to nanoseconds and microseconds. Use that? > +#ifndef smp_cond_load_relaxed_timeout > +#define smp_cond_load_relaxed_timeout(ptr, cond_expr, time_expr_ns, \ > + time_limit_ns) ({ \ > + typeof(ptr) __PTR = (ptr); \ > + __unqual_scalar_typeof(*ptr) VAL; \ > + unsigned int __count = 0; \ > + for (;;) { \ > + VAL = READ_ONCE(*__PTR); \ > + if (cond_expr) \ > + break; \ > + cpu_relax(); \ > + if (__count++ < smp_cond_time_check_count) \ > + continue; \ > + if ((time_expr_ns) >= time_limit_ns) \ > + break; \ Calling the clock retrieval function repeatedly should be fine and is typically done in user space as well as in kernel space for functions that need to wait short time periods.
Christoph Lameter (Ampere) <cl@gentwo.org> writes: > On Thu, 7 Nov 2024, Ankur Arora wrote: > >> +#ifndef smp_cond_time_check_count >> +/* >> + * Limit how often smp_cond_load_relaxed_timeout() evaluates time_expr_ns. >> + * This helps reduce the number of instructions executed while spin-waiting. >> + */ >> +#define smp_cond_time_check_count 200 >> +#endif > > I dont like these loops that execute differently depending on the > hardware. Can we use cycles and ns instead to have defined periods of > time? Later patches establish the infrastructure to convert cycles to > nanoseconds and microseconds. Use that? > >> +#ifndef smp_cond_load_relaxed_timeout >> +#define smp_cond_load_relaxed_timeout(ptr, cond_expr, time_expr_ns, \ >> + time_limit_ns) ({ \ >> + typeof(ptr) __PTR = (ptr); \ >> + __unqual_scalar_typeof(*ptr) VAL; \ >> + unsigned int __count = 0; \ >> + for (;;) { \ >> + VAL = READ_ONCE(*__PTR); \ >> + if (cond_expr) \ >> + break; \ >> + cpu_relax(); \ >> + if (__count++ < smp_cond_time_check_count) \ >> + continue; \ >> + if ((time_expr_ns) >= time_limit_ns) \ >> + break; \ > > Calling the clock retrieval function repeatedly should be fine and is > typically done in user space as well as in kernel space for functions that > need to wait short time periods. The problem is that you might have multiple CPUs polling in idle for prolonged periods of time. And, so you want to minimize your power/thermal envelope. For instance see commit 4dc2375c1a4e "cpuidle: poll_state: Avoid invoking local_clock() too often" which originally added a similar rate limit to poll_idle() where they saw exactly that issue. -- ankur
On Thu, 7 Nov 2024, Ankur Arora wrote: > > Calling the clock retrieval function repeatedly should be fine and is > > typically done in user space as well as in kernel space for functions that > > need to wait short time periods. > > The problem is that you might have multiple CPUs polling in idle > for prolonged periods of time. And, so you want to minimize > your power/thermal envelope. On ARM that maps to YIELD which does not do anything for the power envelope AFAICT. It switches to the other hyperthread. > For instance see commit 4dc2375c1a4e "cpuidle: poll_state: Avoid > invoking local_clock() too often" which originally added a similar > rate limit to poll_idle() where they saw exactly that issue. Looping w/o calling local_clock may increase the wait period etc. For power saving most arches have special instructions like ARMS WFE/WFET. These are then causing more accurate wait times than the looping thing?
diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h index d4f581c1e21d..77726ef807e4 100644 --- a/include/asm-generic/barrier.h +++ b/include/asm-generic/barrier.h @@ -273,6 +273,48 @@ do { \ }) #endif +#ifndef smp_cond_time_check_count +/* + * Limit how often smp_cond_load_relaxed_timeout() evaluates time_expr_ns. + * This helps reduce the number of instructions executed while spin-waiting. + */ +#define smp_cond_time_check_count 200 +#endif + +/** + * smp_cond_load_relaxed_timeout() - (Spin) wait for cond with no ordering + * guarantees until a timeout expires. + * @ptr: pointer to the variable to wait on + * @cond: boolean expression to wait for + * @time_expr_ns: evaluates to the current time + * @time_limit_ns: compared against time_expr_ns + * + * Equivalent to using READ_ONCE() on the condition variable. + * + * Due to C lacking lambda expressions we load the value of *ptr into a + * pre-named variable @VAL to be used in @cond. + */ +#ifndef smp_cond_load_relaxed_timeout +#define smp_cond_load_relaxed_timeout(ptr, cond_expr, time_expr_ns, \ + time_limit_ns) ({ \ + typeof(ptr) __PTR = (ptr); \ + __unqual_scalar_typeof(*ptr) VAL; \ + unsigned int __count = 0; \ + for (;;) { \ + VAL = READ_ONCE(*__PTR); \ + if (cond_expr) \ + break; \ + cpu_relax(); \ + if (__count++ < smp_cond_time_check_count) \ + continue; \ + if ((time_expr_ns) >= time_limit_ns) \ + break; \ + __count = 0; \ + } \ + (typeof(*ptr))VAL; \ +}) +#endif + /* * pmem_wmb() ensures that all stores for which the modification * are written to persistent storage by preceding instructions have
Add a timed variant of smp_cond_load_relaxed(). This is useful because arm64 supports polling on a conditional variable by directly waiting on the cacheline instead of spin waiting for the condition to change. However, an implementation such as this has a problem that it can block forever -- unless there's an explicit timeout or another out-of-band mechanism which allows it to come out of the wait state periodically. smp_cond_load_relaxed_timeout() supports these semantics by specifying a time-check expression and an associated time-limit. However, note that for the generic spin-wait implementation we want to minimize the numbers of instructions executed in each iteration. So, limit how often we evaluate the time-check expression by doing it once every smp_cond_time_check_count. The inner loop in poll_idle() has a substantially similar structure and constraints as smp_cond_load_relaxed_timeout(), so define smp_cond_time_check_count to the same value used in poll_idle(). Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com> --- include/asm-generic/barrier.h | 42 +++++++++++++++++++++++++++++++++++ 1 file changed, 42 insertions(+)