Message ID | 20240813-seq_optimize-v1-1-84d57182e6a7@gentwo.org (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [RFC] Avoid memory barrier in read_seqcount() through load acquire | expand |
On 8/13/24 14:26, Christoph Lameter via B4 Relay wrote: > From: "Christoph Lameter (Ampere)" <cl@gentwo.org> > > Some architectures support load acquire which can save us a memory > barrier and save some cycles. > > A typical sequence > > do { > seq = read_seqcount_begin(&s); > <something> > } while (read_seqcount_retry(&s, seq); > > requires 13 cycles on ARM64 for an empty loop. Two read memory barriers are > needed. One for each of the seqcount_* functions. > > We can replace the first read barrier with a load acquire of > the seqcount which saves us one barrier. > > On ARM64 doing so reduces the cycle count from 13 to 8. > > Signed-off-by: Christoph Lameter (Ampere) <cl@gentwo.org> > --- > arch/Kconfig | 5 +++++ > arch/arm64/Kconfig | 1 + > include/linux/seqlock.h | 41 +++++++++++++++++++++++++++++++++++++++++ > 3 files changed, 47 insertions(+) > > diff --git a/arch/Kconfig b/arch/Kconfig > index 975dd22a2dbd..3f8867110a57 100644 > --- a/arch/Kconfig > +++ b/arch/Kconfig > @@ -1600,6 +1600,11 @@ config ARCH_HAS_KERNEL_FPU_SUPPORT > Architectures that select this option can run floating-point code in > the kernel, as described in Documentation/core-api/floating-point.rst. > > +config ARCH_HAS_ACQUIRE_RELEASE > + bool > + help > + Architectures that support acquire / release can avoid memory fences > + > source "kernel/gcov/Kconfig" > > source "scripts/gcc-plugins/Kconfig" > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig > index a2f8ff354ca6..19e34fff145f 100644 > --- a/arch/arm64/Kconfig > +++ b/arch/arm64/Kconfig > @@ -39,6 +39,7 @@ config ARM64 > select ARCH_HAS_PTE_DEVMAP > select ARCH_HAS_PTE_SPECIAL > select ARCH_HAS_HW_PTE_YOUNG > + select ARCH_HAS_ACQUIRE_RELEASE > select ARCH_HAS_SETUP_DMA_OPS > select ARCH_HAS_SET_DIRECT_MAP > select ARCH_HAS_SET_MEMORY Do we need a new ARCH flag? I believe barrier APIs like smp_load_acquire() will use the full barrier for those arch'es that don't define their own smp_load_acquire(). BTW, acquire/release can be considered memory barriers too. Maybe you are talking about preferring acquire/release barriers over read/write barriers. Right? Cheers, Longman
On Tue, 13 Aug 2024, Waiman Long wrote: > Do we need a new ARCH flag? I believe barrier APIs like smp_load_acquire() > will use the full barrier for those arch'es that don't define their own > smp_load_acquire(). Well this is a load acquire instruction. The fallback of smp_load_aquire is: #define __smp_load_acquire(p) \ ({ \ __unqual_scalar_typeof(*p) ___p1 = READ_ONCE(*p); \ compiletime_assert_atomic_type(*p); \ __smp_mb(); \ (typeof(*p))___p1; \ }) Looks like we have an acquire + barrier here. > BTW, acquire/release can be considered memory barriers too. Maybe you are > talking about preferring acquire/release barriers over read/write barriers. > Right? Load acquire is a single instruction load that does not require full barrier semantics for the critical section but ensures that the value is loaded before all following. Regular barriers do not have this singe load that isolates the critical section and sequence all loads around them.
On Tue, 13 Aug 2024 at 12:01, Waiman Long <longman@redhat.com> wrote: > > Do we need a new ARCH flag? I'm confused by that question. That's clearly exactly what that ARCH_HAS_ACQUIRE_RELEASE is. Obviously all architectures "have" it - in the sense that we always have access to a "smp_load_acquire()/smp_store_release()". But if the architecture doesn't support it natively, the old rmb/wmb model may be preferred. Although maybe we're at the point where we don't even care about that. Linus
On 8/13/24 15:48, Linus Torvalds wrote: > On Tue, 13 Aug 2024 at 12:01, Waiman Long <longman@redhat.com> wrote: >> Do we need a new ARCH flag? > I'm confused by that question. > > That's clearly exactly what that ARCH_HAS_ACQUIRE_RELEASE is. > > Obviously all architectures "have" it - in the sense that we always > have access to a "smp_load_acquire()/smp_store_release()". Sorry for the confusion. What you said above is actually the reason that I ask this question. In the same way, smp_rmb()/wmb() is available for all arches. I am actually asking if it should be a flag that indicates the arch's preference to use acquire/release over rmb/wmb. Cheers, Longman > > But if the architecture doesn't support it natively, the old rmb/wmb > model may be preferred. > > Although maybe we're at the point where we don't even care about that. > > Linus >
On Tue, 13 Aug 2024 at 12:58, Waiman Long <longman@redhat.com> wrote: > > Sorry for the confusion. What you said above is actually the reason that > I ask this question. In the same way, smp_rmb()/wmb() is available for > all arches. I am actually asking if it should be a flag that indicates > the arch's preference to use acquire/release over rmb/wmb. I think that if an arch says it has native acquire/release, we should basically assume that it's the better model. I mean, we could certainly use "PREFERS" instead of "HAS", but is there any real reason to do that? Do we suddenly expect that people would make a CPU that has native acquire/release, and it would somehow then prefer a full read barrier? Linus
On 8/13/24 16:01, Linus Torvalds wrote: > On Tue, 13 Aug 2024 at 12:58, Waiman Long <longman@redhat.com> wrote: >> Sorry for the confusion. What you said above is actually the reason that >> I ask this question. In the same way, smp_rmb()/wmb() is available for >> all arches. I am actually asking if it should be a flag that indicates >> the arch's preference to use acquire/release over rmb/wmb. > I think that if an arch says it has native acquire/release, we should > basically assume that it's the better model. > > I mean, we could certainly use "PREFERS" instead of "HAS", but is > there any real reason to do that? > > Do we suddenly expect that people would make a CPU that has native > acquire/release, and it would somehow then prefer a full read barrier? ARCH_HAS_ACQUIRE_RELEASE is fine, but the help text for this Kconfig option should clarify this preference as both the ARCH_HAS_ACQUIRE_RELEASE and the !ARCH_HAS_ACQUIRE_RELEASE code are valid. Cheers, Longman
On Tue, Aug 13, 2024 at 11:26:17AM -0700, Christoph Lameter via B4 Relay wrote: > From: "Christoph Lameter (Ampere)" <cl@gentwo.org> > > Some architectures support load acquire which can save us a memory > barrier and save some cycles. > > A typical sequence > > do { > seq = read_seqcount_begin(&s); > <something> > } while (read_seqcount_retry(&s, seq); > > requires 13 cycles on ARM64 for an empty loop. Two read memory barriers are > needed. One for each of the seqcount_* functions. > > We can replace the first read barrier with a load acquire of > the seqcount which saves us one barrier. > > On ARM64 doing so reduces the cycle count from 13 to 8. The patch itself looks reasonable to me, but in the commit message could you please replace "ARM64" with the specific micro-architecture you're testing on? There are tens of contemporary micro-architectures, and for future reference it'd be good to know what specifically you've tested on. If you cannot disclose that for some reason, just say "on my ARM64 test machine" or something like that, so that we're not implying that this is true for all ARM64 implementations. Mark. > > Signed-off-by: Christoph Lameter (Ampere) <cl@gentwo.org> > --- > arch/Kconfig | 5 +++++ > arch/arm64/Kconfig | 1 + > include/linux/seqlock.h | 41 +++++++++++++++++++++++++++++++++++++++++ > 3 files changed, 47 insertions(+) > > diff --git a/arch/Kconfig b/arch/Kconfig > index 975dd22a2dbd..3f8867110a57 100644 > --- a/arch/Kconfig > +++ b/arch/Kconfig > @@ -1600,6 +1600,11 @@ config ARCH_HAS_KERNEL_FPU_SUPPORT > Architectures that select this option can run floating-point code in > the kernel, as described in Documentation/core-api/floating-point.rst. > > +config ARCH_HAS_ACQUIRE_RELEASE > + bool > + help > + Architectures that support acquire / release can avoid memory fences > + > source "kernel/gcov/Kconfig" > > source "scripts/gcc-plugins/Kconfig" > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig > index a2f8ff354ca6..19e34fff145f 100644 > --- a/arch/arm64/Kconfig > +++ b/arch/arm64/Kconfig > @@ -39,6 +39,7 @@ config ARM64 > select ARCH_HAS_PTE_DEVMAP > select ARCH_HAS_PTE_SPECIAL > select ARCH_HAS_HW_PTE_YOUNG > + select ARCH_HAS_ACQUIRE_RELEASE > select ARCH_HAS_SETUP_DMA_OPS > select ARCH_HAS_SET_DIRECT_MAP > select ARCH_HAS_SET_MEMORY > diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h > index d90d8ee29d81..353fcf32b800 100644 > --- a/include/linux/seqlock.h > +++ b/include/linux/seqlock.h > @@ -176,6 +176,28 @@ __seqprop_##lockname##_sequence(const seqcount_##lockname##_t *s) \ > return seq; \ > } \ > \ > +static __always_inline unsigned \ > +__seqprop_##lockname##_sequence_acquire(const seqcount_##lockname##_t *s) \ > +{ \ > + unsigned seq = smp_load_acquire(&s->seqcount.sequence); \ > + \ > + if (!IS_ENABLED(CONFIG_PREEMPT_RT)) \ > + return seq; \ > + \ > + if (preemptible && unlikely(seq & 1)) { \ > + __SEQ_LOCK(lockbase##_lock(s->lock)); \ > + __SEQ_LOCK(lockbase##_unlock(s->lock)); \ > + \ > + /* \ > + * Re-read the sequence counter since the (possibly \ > + * preempted) writer made progress. \ > + */ \ > + seq = smp_load_acquire(&s->seqcount.sequence); \ > + } \ > + \ > + return seq; \ > +} \ > + \ > static __always_inline bool \ > __seqprop_##lockname##_preemptible(const seqcount_##lockname##_t *s) \ > { \ > @@ -211,6 +233,11 @@ static inline unsigned __seqprop_sequence(const seqcount_t *s) > return READ_ONCE(s->sequence); > } > > +static inline unsigned __seqprop_sequence_acquire(const seqcount_t *s) > +{ > + return smp_load_acquire(&s->sequence); > +} > + > static inline bool __seqprop_preemptible(const seqcount_t *s) > { > return false; > @@ -259,6 +286,7 @@ SEQCOUNT_LOCKNAME(mutex, struct mutex, true, mutex) > #define seqprop_ptr(s) __seqprop(s, ptr)(s) > #define seqprop_const_ptr(s) __seqprop(s, const_ptr)(s) > #define seqprop_sequence(s) __seqprop(s, sequence)(s) > +#define seqprop_sequence_acquire(s) __seqprop(s, sequence_acquire)(s) > #define seqprop_preemptible(s) __seqprop(s, preemptible)(s) > #define seqprop_assert(s) __seqprop(s, assert)(s) > > @@ -293,6 +321,18 @@ SEQCOUNT_LOCKNAME(mutex, struct mutex, true, mutex) > * > * Return: count to be passed to read_seqcount_retry() > */ > +#ifdef CONFIG_ARCH_HAS_ACQUIRE_RELEASE > +#define raw_read_seqcount_begin(s) \ > +({ \ > + unsigned _seq; \ > + \ > + while ((_seq = seqprop_sequence_acquire(s)) & 1) \ > + cpu_relax(); \ > + \ > + kcsan_atomic_next(KCSAN_SEQLOCK_REGION_MAX); \ > + _seq; \ > +}) > +#else > #define raw_read_seqcount_begin(s) \ > ({ \ > unsigned _seq = __read_seqcount_begin(s); \ > @@ -300,6 +340,7 @@ SEQCOUNT_LOCKNAME(mutex, struct mutex, true, mutex) > smp_rmb(); \ > _seq; \ > }) > +#endif > > /** > * read_seqcount_begin() - begin a seqcount_t read critical section > > --- > base-commit: 6b4aa469f04999c3f244515fa7491b4d093c5167 > change-id: 20240813-seq_optimize-68c48696c798 > > Best regards, > -- > Christoph Lameter <cl@gentwo.org> > > >
On Mon, 19 Aug 2024 at 01:46, Mark Rutland <mark.rutland@arm.com> wrote: > > If you cannot disclose that for some reason, just say "on my ARM64 test > machine" or something like that, so that we're not implying that this is > true for all ARM64 implementations. It's the same machine I have - an Ampere Altra. It's a standard Neoverse N1 core, afaik. It might also be a good idea to just point to the ARM documentation, although I don't know how stable those web addresses are: https://developer.arm.com/documentation/102336/0100/Load-Acquire-and-Store-Release-instructions and quoting the relevant part on that page: "Weaker ordering requirements that are imposed by Load-Acquire and Store-Release instructions allow for micro-architectural optimizations, which could reduce some of the performance impacts that are otherwise imposed by an explicit memory barrier. If the ordering requirement is satisfied using either a Load-Acquire or Store-Release, then it would be preferable to use these instructions instead of a DMB" where that last sentence is basically ARM saying that load-acquire is better than load+DMB and should be preferred. Linus
diff --git a/arch/Kconfig b/arch/Kconfig index 975dd22a2dbd..3f8867110a57 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -1600,6 +1600,11 @@ config ARCH_HAS_KERNEL_FPU_SUPPORT Architectures that select this option can run floating-point code in the kernel, as described in Documentation/core-api/floating-point.rst. +config ARCH_HAS_ACQUIRE_RELEASE + bool + help + Architectures that support acquire / release can avoid memory fences + source "kernel/gcov/Kconfig" source "scripts/gcc-plugins/Kconfig" diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index a2f8ff354ca6..19e34fff145f 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -39,6 +39,7 @@ config ARM64 select ARCH_HAS_PTE_DEVMAP select ARCH_HAS_PTE_SPECIAL select ARCH_HAS_HW_PTE_YOUNG + select ARCH_HAS_ACQUIRE_RELEASE select ARCH_HAS_SETUP_DMA_OPS select ARCH_HAS_SET_DIRECT_MAP select ARCH_HAS_SET_MEMORY diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h index d90d8ee29d81..353fcf32b800 100644 --- a/include/linux/seqlock.h +++ b/include/linux/seqlock.h @@ -176,6 +176,28 @@ __seqprop_##lockname##_sequence(const seqcount_##lockname##_t *s) \ return seq; \ } \ \ +static __always_inline unsigned \ +__seqprop_##lockname##_sequence_acquire(const seqcount_##lockname##_t *s) \ +{ \ + unsigned seq = smp_load_acquire(&s->seqcount.sequence); \ + \ + if (!IS_ENABLED(CONFIG_PREEMPT_RT)) \ + return seq; \ + \ + if (preemptible && unlikely(seq & 1)) { \ + __SEQ_LOCK(lockbase##_lock(s->lock)); \ + __SEQ_LOCK(lockbase##_unlock(s->lock)); \ + \ + /* \ + * Re-read the sequence counter since the (possibly \ + * preempted) writer made progress. \ + */ \ + seq = smp_load_acquire(&s->seqcount.sequence); \ + } \ + \ + return seq; \ +} \ + \ static __always_inline bool \ __seqprop_##lockname##_preemptible(const seqcount_##lockname##_t *s) \ { \ @@ -211,6 +233,11 @@ static inline unsigned __seqprop_sequence(const seqcount_t *s) return READ_ONCE(s->sequence); } +static inline unsigned __seqprop_sequence_acquire(const seqcount_t *s) +{ + return smp_load_acquire(&s->sequence); +} + static inline bool __seqprop_preemptible(const seqcount_t *s) { return false; @@ -259,6 +286,7 @@ SEQCOUNT_LOCKNAME(mutex, struct mutex, true, mutex) #define seqprop_ptr(s) __seqprop(s, ptr)(s) #define seqprop_const_ptr(s) __seqprop(s, const_ptr)(s) #define seqprop_sequence(s) __seqprop(s, sequence)(s) +#define seqprop_sequence_acquire(s) __seqprop(s, sequence_acquire)(s) #define seqprop_preemptible(s) __seqprop(s, preemptible)(s) #define seqprop_assert(s) __seqprop(s, assert)(s) @@ -293,6 +321,18 @@ SEQCOUNT_LOCKNAME(mutex, struct mutex, true, mutex) * * Return: count to be passed to read_seqcount_retry() */ +#ifdef CONFIG_ARCH_HAS_ACQUIRE_RELEASE +#define raw_read_seqcount_begin(s) \ +({ \ + unsigned _seq; \ + \ + while ((_seq = seqprop_sequence_acquire(s)) & 1) \ + cpu_relax(); \ + \ + kcsan_atomic_next(KCSAN_SEQLOCK_REGION_MAX); \ + _seq; \ +}) +#else #define raw_read_seqcount_begin(s) \ ({ \ unsigned _seq = __read_seqcount_begin(s); \ @@ -300,6 +340,7 @@ SEQCOUNT_LOCKNAME(mutex, struct mutex, true, mutex) smp_rmb(); \ _seq; \ }) +#endif /** * read_seqcount_begin() - begin a seqcount_t read critical section