diff mbox series

[1/5] asm-generic: add smp_vcond_load_relaxed()

Message ID 20241105183041.1531976-2-harisokn@amazon.com (mailing list archive)
State New
Headers show
Series [1/5] asm-generic: add smp_vcond_load_relaxed() | expand

Commit Message

Okanovic, Haris Nov. 5, 2024, 6:30 p.m. UTC
Relaxed poll until desired mask/value is observed at the specified
address or timeout.

This macro is a specialization of the generic smp_cond_load_relaxed(),
which takes a simple mask/value condition (vcond) instead of an
arbitrary expression. It allows architectures to better specialize the
implementation, e.g. to enable wfe() polling of the address on arm.

Signed-off-by: Haris Okanovic <harisokn@amazon.com>
---
 include/asm-generic/barrier.h | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

Comments

Christoph Lameter (Ampere) Nov. 5, 2024, 7:36 p.m. UTC | #1
On Tue, 5 Nov 2024, Haris Okanovic wrote:

> +/**
> + * smp_vcond_load_relaxed() - (Spin) wait until an expected value at address
> + * with no ordering guarantees. Spins until `(*addr & mask) == val` or
> + * `nsecs` elapse, and returns the last observed `*addr` value.
> + *
> + * @nsecs: timeout in nanoseconds

Please use an absolute time in nsecs instead of a timeout. You do not know
what will happen to your execution thread until the local_clock_noinstr()
is run.
Catalin Marinas Nov. 6, 2024, 11:08 a.m. UTC | #2
On Tue, Nov 05, 2024 at 12:30:37PM -0600, Haris Okanovic wrote:
> diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
> index d4f581c1e21d..112027eabbfc 100644
> --- a/include/asm-generic/barrier.h
> +++ b/include/asm-generic/barrier.h
> @@ -256,6 +256,31 @@ do {									\
>  })
>  #endif
>  
> +/**
> + * smp_vcond_load_relaxed() - (Spin) wait until an expected value at address
> + * with no ordering guarantees. Spins until `(*addr & mask) == val` or
> + * `nsecs` elapse, and returns the last observed `*addr` value.
> + *
> + * @nsecs: timeout in nanoseconds

FWIW, I don't mind the relative timeout, it makes the API easier to use.
Yes, it may take longer in absolute time if the thread is scheduled out
before local_clock_noinstr() is read but the same can happen in the
caller anyway. It's similar to udelay(), it can take longer if the
thread is scheduled out.

> + * @addr: pointer to an integer
> + * @mask: a bit mask applied to read values
> + * @val: Expected value with mask
> + */
> +#ifndef smp_vcond_load_relaxed
> +#define smp_vcond_load_relaxed(nsecs, addr, mask, val) ({	\
> +	const u64 __start = local_clock_noinstr();		\
> +	u64 __nsecs = (nsecs);					\
> +	typeof(addr) __addr = (addr);				\
> +	typeof(*__addr) __mask = (mask);			\
> +	typeof(*__addr) __val = (val);				\
> +	typeof(*__addr) __cur;					\
> +	smp_cond_load_relaxed(__addr, (				\
> +		(VAL & __mask) == __val ||			\
> +		local_clock_noinstr() - __start > __nsecs	\
> +	));							\
> +})

The generic implementation has the same problem as Ankur's current
series. smp_cond_load_relaxed() can't wait on anything other than the
variable at __addr. If it goes into a WFE, there's nothing executed to
read the timer and check for progress. Any generic implementation of
such function would have to use cpu_relax() and polling.
Will Deacon Nov. 6, 2024, 11:39 a.m. UTC | #3
On Tue, Nov 05, 2024 at 12:30:37PM -0600, Haris Okanovic wrote:
> Relaxed poll until desired mask/value is observed at the specified
> address or timeout.
> 
> This macro is a specialization of the generic smp_cond_load_relaxed(),
> which takes a simple mask/value condition (vcond) instead of an
> arbitrary expression. It allows architectures to better specialize the
> implementation, e.g. to enable wfe() polling of the address on arm.

This doesn't make sense to me. The existing smp_cond_load() functions
already use wfe on arm64 and I don't see why we need a special helper
just to do a mask.

> Signed-off-by: Haris Okanovic <harisokn@amazon.com>
> ---
>  include/asm-generic/barrier.h | 25 +++++++++++++++++++++++++
>  1 file changed, 25 insertions(+)
> 
> diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
> index d4f581c1e21d..112027eabbfc 100644
> --- a/include/asm-generic/barrier.h
> +++ b/include/asm-generic/barrier.h
> @@ -256,6 +256,31 @@ do {									\
>  })
>  #endif
>  
> +/**
> + * smp_vcond_load_relaxed() - (Spin) wait until an expected value at address
> + * with no ordering guarantees. Spins until `(*addr & mask) == val` or
> + * `nsecs` elapse, and returns the last observed `*addr` value.
> + *
> + * @nsecs: timeout in nanoseconds
> + * @addr: pointer to an integer
> + * @mask: a bit mask applied to read values
> + * @val: Expected value with mask
> + */
> +#ifndef smp_vcond_load_relaxed

I know naming is hard, but "vcond" is especially terrible.
Perhaps smp_cond_load_timeout()?

Will
Okanovic, Haris Nov. 6, 2024, 5:06 p.m. UTC | #4
On Tue, 2024-11-05 at 11:36 -0800, Christoph Lameter (Ampere) wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
> On Tue, 5 Nov 2024, Haris Okanovic wrote:
> 
> > +/**
> > + * smp_vcond_load_relaxed() - (Spin) wait until an expected value at address
> > + * with no ordering guarantees. Spins until `(*addr & mask) == val` or
> > + * `nsecs` elapse, and returns the last observed `*addr` value.
> > + *
> > + * @nsecs: timeout in nanoseconds
> 
> Please use an absolute time in nsecs instead of a timeout.

I went with relative time because it clock agnostic. I agree deadline
is nicer because it can propagate down layers of functions, but it pins
the caller to single time base.

> You do not know
> what will happen to your execution thread until the local_clock_noinstr()
> is run.


Not sure what you mean. Could you perhaps give an example where it
would break?

> 
> 

One alternative is to do timekeeping with delay() in all cases, to
decouple from sched/clock.
Okanovic, Haris Nov. 6, 2024, 5:18 p.m. UTC | #5
On Wed, 2024-11-06 at 11:39 +0000, Will Deacon wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
> On Tue, Nov 05, 2024 at 12:30:37PM -0600, Haris Okanovic wrote:
> > Relaxed poll until desired mask/value is observed at the specified
> > address or timeout.
> > 
> > This macro is a specialization of the generic smp_cond_load_relaxed(),
> > which takes a simple mask/value condition (vcond) instead of an
> > arbitrary expression. It allows architectures to better specialize the
> > implementation, e.g. to enable wfe() polling of the address on arm.
> 
> This doesn't make sense to me. The existing smp_cond_load() functions
> already use wfe on arm64 and I don't see why we need a special helper
> just to do a mask.

We can't turn an arbitrary C expression into a wfe()/wfet() exit
condition, which is one of the inputs to the existing smp_cond_load().
This API is therefore more amenable to hardware acceleration.

> 
> > Signed-off-by: Haris Okanovic <harisokn@amazon.com>
> > ---
> >  include/asm-generic/barrier.h | 25 +++++++++++++++++++++++++
> >  1 file changed, 25 insertions(+)
> > 
> > diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
> > index d4f581c1e21d..112027eabbfc 100644
> > --- a/include/asm-generic/barrier.h
> > +++ b/include/asm-generic/barrier.h
> > @@ -256,6 +256,31 @@ do {                                                                     \
> >  })
> >  #endif
> > 
> > +/**
> > + * smp_vcond_load_relaxed() - (Spin) wait until an expected value at address
> > + * with no ordering guarantees. Spins until `(*addr & mask) == val` or
> > + * `nsecs` elapse, and returns the last observed `*addr` value.
> > + *
> > + * @nsecs: timeout in nanoseconds
> > + * @addr: pointer to an integer
> > + * @mask: a bit mask applied to read values
> > + * @val: Expected value with mask
> > + */
> > +#ifndef smp_vcond_load_relaxed
> 
> I know naming is hard, but "vcond" is especially terrible.
> Perhaps smp_cond_load_timeout()?

I agree, naming is hard! I was trying to differentiate it from
smp_cond_load() in some meaningful way - that one is an "expression"
condition this one is a "value" condition.

I'll think it over a bit more.

> 
> Will
Okanovic, Haris Nov. 6, 2024, 6:13 p.m. UTC | #6
On Wed, 2024-11-06 at 11:08 +0000, Catalin Marinas wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
> On Tue, Nov 05, 2024 at 12:30:37PM -0600, Haris Okanovic wrote:
> > diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
> > index d4f581c1e21d..112027eabbfc 100644
> > --- a/include/asm-generic/barrier.h
> > +++ b/include/asm-generic/barrier.h
> > @@ -256,6 +256,31 @@ do {                                                                     \
> >  })
> >  #endif
> > 
> > +/**
> > + * smp_vcond_load_relaxed() - (Spin) wait until an expected value at address
> > + * with no ordering guarantees. Spins until `(*addr & mask) == val` or
> > + * `nsecs` elapse, and returns the last observed `*addr` value.
> > + *
> > + * @nsecs: timeout in nanoseconds
> 
> FWIW, I don't mind the relative timeout, it makes the API easier to use.
> Yes, it may take longer in absolute time if the thread is scheduled out
> before local_clock_noinstr() is read but the same can happen in the
> caller anyway. It's similar to udelay(), it can take longer if the
> thread is scheduled out.
> 
> > + * @addr: pointer to an integer
> > + * @mask: a bit mask applied to read values
> > + * @val: Expected value with mask
> > + */
> > +#ifndef smp_vcond_load_relaxed
> > +#define smp_vcond_load_relaxed(nsecs, addr, mask, val) ({    \
> > +     const u64 __start = local_clock_noinstr();              \
> > +     u64 __nsecs = (nsecs);                                  \
> > +     typeof(addr) __addr = (addr);                           \
> > +     typeof(*__addr) __mask = (mask);                        \
> > +     typeof(*__addr) __val = (val);                          \
> > +     typeof(*__addr) __cur;                                  \
> > +     smp_cond_load_relaxed(__addr, (                         \
> > +             (VAL & __mask) == __val ||                      \
> > +             local_clock_noinstr() - __start > __nsecs       \
> > +     ));                                                     \
> > +})
> 
> The generic implementation has the same problem as Ankur's current
> series. smp_cond_load_relaxed() can't wait on anything other than the
> variable at __addr. If it goes into a WFE, there's nothing executed to
> read the timer and check for progress. Any generic implementation of
> such function would have to use cpu_relax() and polling.

How would the caller enter wfe()? Can you give a specific scenario that
you're concerned about?

This code already reduces to a relaxed poll, something like this:

```
start = clock();
while((READ_ONCE(*addr) & mask) != val && (clock() - start) < nsecs) {
  cpu_relax();
}
```

> 
> --
> Catalin
Catalin Marinas Nov. 6, 2024, 7:55 p.m. UTC | #7
On Wed, Nov 06, 2024 at 06:13:35PM +0000, Okanovic, Haris wrote:
> On Wed, 2024-11-06 at 11:08 +0000, Catalin Marinas wrote:
> > On Tue, Nov 05, 2024 at 12:30:37PM -0600, Haris Okanovic wrote:
> > > diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
> > > index d4f581c1e21d..112027eabbfc 100644
> > > --- a/include/asm-generic/barrier.h
> > > +++ b/include/asm-generic/barrier.h
> > > @@ -256,6 +256,31 @@ do {                                                                     \
> > >  })
> > >  #endif
> > > 
> > > +/**
> > > + * smp_vcond_load_relaxed() - (Spin) wait until an expected value at address
> > > + * with no ordering guarantees. Spins until `(*addr & mask) == val` or
> > > + * `nsecs` elapse, and returns the last observed `*addr` value.
> > > + *
> > > + * @nsecs: timeout in nanoseconds
> > 
> > FWIW, I don't mind the relative timeout, it makes the API easier to use.
> > Yes, it may take longer in absolute time if the thread is scheduled out
> > before local_clock_noinstr() is read but the same can happen in the
> > caller anyway. It's similar to udelay(), it can take longer if the
> > thread is scheduled out.
> > 
> > > + * @addr: pointer to an integer
> > > + * @mask: a bit mask applied to read values
> > > + * @val: Expected value with mask
> > > + */
> > > +#ifndef smp_vcond_load_relaxed
> > > +#define smp_vcond_load_relaxed(nsecs, addr, mask, val) ({    \
> > > +     const u64 __start = local_clock_noinstr();              \
> > > +     u64 __nsecs = (nsecs);                                  \
> > > +     typeof(addr) __addr = (addr);                           \
> > > +     typeof(*__addr) __mask = (mask);                        \
> > > +     typeof(*__addr) __val = (val);                          \
> > > +     typeof(*__addr) __cur;                                  \
> > > +     smp_cond_load_relaxed(__addr, (                         \
> > > +             (VAL & __mask) == __val ||                      \
> > > +             local_clock_noinstr() - __start > __nsecs       \
> > > +     ));                                                     \
> > > +})
> > 
> > The generic implementation has the same problem as Ankur's current
> > series. smp_cond_load_relaxed() can't wait on anything other than the
> > variable at __addr. If it goes into a WFE, there's nothing executed to
> > read the timer and check for progress. Any generic implementation of
> > such function would have to use cpu_relax() and polling.
> 
> How would the caller enter wfe()? Can you give a specific scenario that
> you're concerned about?

Let's take the arm64 example with the event stream disabled. Without the
subsequent patches implementing smp_vcond_load_relaxed(), just expand
the arm64 smp_cond_load_relaxed() implementation in the above macro. If
the timer check doesn't trigger an exit from the loop,
__cmpwait_relaxed() only waits on the variable to change its value,
nothing to do with the timer.

> This code already reduces to a relaxed poll, something like this:
> 
> ```
> start = clock();
> while((READ_ONCE(*addr) & mask) != val && (clock() - start) < nsecs) {
>   cpu_relax();
> }
> ```

Well, that's if you also use the generic implementation of
smp_cond_load_relaxed() but have you checked all the other architectures
that don't do something similar to the arm64 wfe (riscv comes close)?
Even if all other architectures just use a cpu_relax(), that's still
abusing the smp_cond_load_relaxed() semantics. And what if one places
another loop in their __cmpwait()? That's allowed because you are
supposed to wait on a single variable to change not on multiple states.
Okanovic, Haris Nov. 6, 2024, 8:31 p.m. UTC | #8
On Wed, 2024-11-06 at 19:55 +0000, Catalin Marinas wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
> 
> 
> 
> On Wed, Nov 06, 2024 at 06:13:35PM +0000, Okanovic, Haris wrote:
> > On Wed, 2024-11-06 at 11:08 +0000, Catalin Marinas wrote:
> > > On Tue, Nov 05, 2024 at 12:30:37PM -0600, Haris Okanovic wrote:
> > > > diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
> > > > index d4f581c1e21d..112027eabbfc 100644
> > > > --- a/include/asm-generic/barrier.h
> > > > +++ b/include/asm-generic/barrier.h
> > > > @@ -256,6 +256,31 @@ do {                                                                     \
> > > >  })
> > > >  #endif
> > > > 
> > > > +/**
> > > > + * smp_vcond_load_relaxed() - (Spin) wait until an expected value at address
> > > > + * with no ordering guarantees. Spins until `(*addr & mask) == val` or
> > > > + * `nsecs` elapse, and returns the last observed `*addr` value.
> > > > + *
> > > > + * @nsecs: timeout in nanoseconds
> > > 
> > > FWIW, I don't mind the relative timeout, it makes the API easier to use.
> > > Yes, it may take longer in absolute time if the thread is scheduled out
> > > before local_clock_noinstr() is read but the same can happen in the
> > > caller anyway. It's similar to udelay(), it can take longer if the
> > > thread is scheduled out.
> > > 
> > > > + * @addr: pointer to an integer
> > > > + * @mask: a bit mask applied to read values
> > > > + * @val: Expected value with mask
> > > > + */
> > > > +#ifndef smp_vcond_load_relaxed
> > > > +#define smp_vcond_load_relaxed(nsecs, addr, mask, val) ({    \
> > > > +     const u64 __start = local_clock_noinstr();              \
> > > > +     u64 __nsecs = (nsecs);                                  \
> > > > +     typeof(addr) __addr = (addr);                           \
> > > > +     typeof(*__addr) __mask = (mask);                        \
> > > > +     typeof(*__addr) __val = (val);                          \
> > > > +     typeof(*__addr) __cur;                                  \
> > > > +     smp_cond_load_relaxed(__addr, (                         \
> > > > +             (VAL & __mask) == __val ||                      \
> > > > +             local_clock_noinstr() - __start > __nsecs       \
> > > > +     ));                                                     \
> > > > +})
> > > 
> > > The generic implementation has the same problem as Ankur's current
> > > series. smp_cond_load_relaxed() can't wait on anything other than the
> > > variable at __addr. If it goes into a WFE, there's nothing executed to
> > > read the timer and check for progress. Any generic implementation of
> > > such function would have to use cpu_relax() and polling.
> > 
> > How would the caller enter wfe()? Can you give a specific scenario that
> > you're concerned about?
> 
> Let's take the arm64 example with the event stream disabled. Without the
> subsequent patches implementing smp_vcond_load_relaxed(), just expand
> the arm64 smp_cond_load_relaxed() implementation in the above macro. If
> the timer check doesn't trigger an exit from the loop,
> __cmpwait_relaxed() only waits on the variable to change its value,
> nothing to do with the timer.
> 
> > This code already reduces to a relaxed poll, something like this:
> > 
> > ```
> > start = clock();
> > while((READ_ONCE(*addr) & mask) != val && (clock() - start) < nsecs) {
> >   cpu_relax();
> > }
> > ```
> 
> Well, that's if you also use the generic implementation of
> smp_cond_load_relaxed() but have you checked all the other architectures
> that don't do something similar to the arm64 wfe (riscv comes close)?
> Even if all other architectures just use a cpu_relax(), that's still
> abusing the smp_cond_load_relaxed() semantics. And what if one places
> another loop in their __cmpwait()? That's allowed because you are
> supposed to wait on a single variable to change not on multiple states.

I see what you mean now - I glossed over the use of __cmpwait_relaxed()
in smp_cond_load_relaxed(). I'll post another rev with the fix, similar
to the above "reduced" code.

> 
> --
> Catalin
diff mbox series

Patch

diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
index d4f581c1e21d..112027eabbfc 100644
--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -256,6 +256,31 @@  do {									\
 })
 #endif
 
+/**
+ * smp_vcond_load_relaxed() - (Spin) wait until an expected value at address
+ * with no ordering guarantees. Spins until `(*addr & mask) == val` or
+ * `nsecs` elapse, and returns the last observed `*addr` value.
+ *
+ * @nsecs: timeout in nanoseconds
+ * @addr: pointer to an integer
+ * @mask: a bit mask applied to read values
+ * @val: Expected value with mask
+ */
+#ifndef smp_vcond_load_relaxed
+#define smp_vcond_load_relaxed(nsecs, addr, mask, val) ({	\
+	const u64 __start = local_clock_noinstr();		\
+	u64 __nsecs = (nsecs);					\
+	typeof(addr) __addr = (addr);				\
+	typeof(*__addr) __mask = (mask);			\
+	typeof(*__addr) __val = (val);				\
+	typeof(*__addr) __cur;					\
+	smp_cond_load_relaxed(__addr, (				\
+		(VAL & __mask) == __val ||			\
+		local_clock_noinstr() - __start > __nsecs	\
+	));							\
+})
+#endif
+
 /**
  * smp_cond_load_acquire() - (Spin) wait for cond with ACQUIRE ordering
  * @ptr: pointer to the variable to wait on