diff mbox series

[RFC] Avoid memory barrier in read_seqcount() through load acquire

Message ID 20240813-seq_optimize-v1-1-84d57182e6a7@gentwo.org (mailing list archive)
State New, archived
Headers show
Series [RFC] Avoid memory barrier in read_seqcount() through load acquire | expand

Commit Message

Christoph Lameter via B4 Relay Aug. 13, 2024, 6:26 p.m. UTC
From: "Christoph Lameter (Ampere)" <cl@gentwo.org>

Some architectures support load acquire which can save us a memory
barrier and save some cycles.

A typical sequence

	do {
		seq = read_seqcount_begin(&s);
		<something>
	} while (read_seqcount_retry(&s, seq);

requires 13 cycles on ARM64 for an empty loop. Two read memory barriers are
needed. One for each of the seqcount_* functions.

We can replace the first read barrier with a load acquire of
the seqcount which saves us one barrier.

On ARM64 doing so reduces the cycle count from 13 to 8.

Signed-off-by: Christoph Lameter (Ampere) <cl@gentwo.org>
---
 arch/Kconfig            |  5 +++++
 arch/arm64/Kconfig      |  1 +
 include/linux/seqlock.h | 41 +++++++++++++++++++++++++++++++++++++++++
 3 files changed, 47 insertions(+)


---
base-commit: 6b4aa469f04999c3f244515fa7491b4d093c5167
change-id: 20240813-seq_optimize-68c48696c798

Best regards,

Comments

Waiman Long Aug. 13, 2024, 7:01 p.m. UTC | #1
On 8/13/24 14:26, Christoph Lameter via B4 Relay wrote:
> From: "Christoph Lameter (Ampere)" <cl@gentwo.org>
>
> Some architectures support load acquire which can save us a memory
> barrier and save some cycles.
>
> A typical sequence
>
> 	do {
> 		seq = read_seqcount_begin(&s);
> 		<something>
> 	} while (read_seqcount_retry(&s, seq);
>
> requires 13 cycles on ARM64 for an empty loop. Two read memory barriers are
> needed. One for each of the seqcount_* functions.
>
> We can replace the first read barrier with a load acquire of
> the seqcount which saves us one barrier.
>
> On ARM64 doing so reduces the cycle count from 13 to 8.
>
> Signed-off-by: Christoph Lameter (Ampere) <cl@gentwo.org>
> ---
>   arch/Kconfig            |  5 +++++
>   arch/arm64/Kconfig      |  1 +
>   include/linux/seqlock.h | 41 +++++++++++++++++++++++++++++++++++++++++
>   3 files changed, 47 insertions(+)
>
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 975dd22a2dbd..3f8867110a57 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -1600,6 +1600,11 @@ config ARCH_HAS_KERNEL_FPU_SUPPORT
>   	  Architectures that select this option can run floating-point code in
>   	  the kernel, as described in Documentation/core-api/floating-point.rst.
>   
> +config ARCH_HAS_ACQUIRE_RELEASE
> +	bool
> +	help
> +	  Architectures that support acquire / release can avoid memory fences
> +
>   source "kernel/gcov/Kconfig"
>   
>   source "scripts/gcc-plugins/Kconfig"
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index a2f8ff354ca6..19e34fff145f 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -39,6 +39,7 @@ config ARM64
>   	select ARCH_HAS_PTE_DEVMAP
>   	select ARCH_HAS_PTE_SPECIAL
>   	select ARCH_HAS_HW_PTE_YOUNG
> +	select ARCH_HAS_ACQUIRE_RELEASE
>   	select ARCH_HAS_SETUP_DMA_OPS
>   	select ARCH_HAS_SET_DIRECT_MAP
>   	select ARCH_HAS_SET_MEMORY

Do we need a new ARCH flag? I believe barrier APIs like 
smp_load_acquire() will use the full barrier for those arch'es that 
don't define their own smp_load_acquire().

BTW, acquire/release can be considered memory barriers too. Maybe you 
are talking about preferring acquire/release barriers over read/write 
barriers. Right?

Cheers,
Longman
Christoph Lameter (Ampere) Aug. 13, 2024, 7:41 p.m. UTC | #2
On Tue, 13 Aug 2024, Waiman Long wrote:

> Do we need a new ARCH flag? I believe barrier APIs like smp_load_acquire() 
> will use the full barrier for those arch'es that don't define their own 
> smp_load_acquire().

Well this is a load acquire instruction. The fallback of smp_load_aquire 
is:

#define __smp_load_acquire(p)                                           \
({                                                                      \
         __unqual_scalar_typeof(*p) ___p1 = READ_ONCE(*p);               \
         compiletime_assert_atomic_type(*p);                             \
         __smp_mb();                                                     \
         (typeof(*p))___p1;                                              \
})

Looks like we have an acquire + barrier here.

> BTW, acquire/release can be considered memory barriers too. Maybe you are 
> talking about preferring acquire/release barriers over read/write barriers. 
> Right?

Load acquire is a single instruction load that does not require full 
barrier semantics for the critical section but ensures that the value is 
loaded before all following. Regular barriers do not have this singe load 
that isolates the critical section and sequence all loads around them.
Linus Torvalds Aug. 13, 2024, 7:48 p.m. UTC | #3
On Tue, 13 Aug 2024 at 12:01, Waiman Long <longman@redhat.com> wrote:
>
> Do we need a new ARCH flag?

I'm confused by that question.

That's clearly exactly what that ARCH_HAS_ACQUIRE_RELEASE is.

Obviously all architectures "have" it - in the sense that we always
have access to a "smp_load_acquire()/smp_store_release()".

But if the architecture doesn't support it natively, the old rmb/wmb
model may be preferred.

Although maybe we're at the point where we don't even care about that.

              Linus
Waiman Long Aug. 13, 2024, 7:58 p.m. UTC | #4
On 8/13/24 15:48, Linus Torvalds wrote:
> On Tue, 13 Aug 2024 at 12:01, Waiman Long <longman@redhat.com> wrote:
>> Do we need a new ARCH flag?
> I'm confused by that question.
>
> That's clearly exactly what that ARCH_HAS_ACQUIRE_RELEASE is.
>
> Obviously all architectures "have" it - in the sense that we always
> have access to a "smp_load_acquire()/smp_store_release()".

Sorry for the confusion. What you said above is actually the reason that 
I ask this question. In the same way, smp_rmb()/wmb() is available for 
all arches. I am actually asking if it should be a flag that indicates 
the arch's preference to use acquire/release over rmb/wmb.

Cheers,
Longman

>
> But if the architecture doesn't support it natively, the old rmb/wmb
> model may be preferred.
>
> Although maybe we're at the point where we don't even care about that.
>
>                Linus
>
Linus Torvalds Aug. 13, 2024, 8:01 p.m. UTC | #5
On Tue, 13 Aug 2024 at 12:58, Waiman Long <longman@redhat.com> wrote:
>
> Sorry for the confusion. What you said above is actually the reason that
> I ask this question. In the same way, smp_rmb()/wmb() is available for
> all arches. I am actually asking if it should be a flag that indicates
> the arch's preference to use acquire/release over rmb/wmb.

I think that if an arch says it has native acquire/release, we should
basically assume that it's the better model.

I mean, we could certainly use "PREFERS" instead of "HAS", but is
there any real reason to do that?

Do we suddenly expect that people would make a CPU that has native
acquire/release, and it would somehow then prefer a full read barrier?

             Linus
Waiman Long Aug. 13, 2024, 8:23 p.m. UTC | #6
On 8/13/24 16:01, Linus Torvalds wrote:
> On Tue, 13 Aug 2024 at 12:58, Waiman Long <longman@redhat.com> wrote:
>> Sorry for the confusion. What you said above is actually the reason that
>> I ask this question. In the same way, smp_rmb()/wmb() is available for
>> all arches. I am actually asking if it should be a flag that indicates
>> the arch's preference to use acquire/release over rmb/wmb.
> I think that if an arch says it has native acquire/release, we should
> basically assume that it's the better model.
>
> I mean, we could certainly use "PREFERS" instead of "HAS", but is
> there any real reason to do that?
>
> Do we suddenly expect that people would make a CPU that has native
> acquire/release, and it would somehow then prefer a full read barrier?

ARCH_HAS_ACQUIRE_RELEASE is fine, but the help text for this Kconfig option should clarify this preference as both the ARCH_HAS_ACQUIRE_RELEASE and the !ARCH_HAS_ACQUIRE_RELEASE code are valid.

Cheers,
Longman
Mark Rutland Aug. 19, 2024, 8:45 a.m. UTC | #7
On Tue, Aug 13, 2024 at 11:26:17AM -0700, Christoph Lameter via B4 Relay wrote:
> From: "Christoph Lameter (Ampere)" <cl@gentwo.org>
> 
> Some architectures support load acquire which can save us a memory
> barrier and save some cycles.
> 
> A typical sequence
> 
> 	do {
> 		seq = read_seqcount_begin(&s);
> 		<something>
> 	} while (read_seqcount_retry(&s, seq);
> 
> requires 13 cycles on ARM64 for an empty loop. Two read memory barriers are
> needed. One for each of the seqcount_* functions.
> 
> We can replace the first read barrier with a load acquire of
> the seqcount which saves us one barrier.
> 
> On ARM64 doing so reduces the cycle count from 13 to 8.

The patch itself looks reasonable to me, but in the commit message could
you please replace "ARM64" with the specific micro-architecture you're
testing on? There are tens of contemporary micro-architectures, and for
future reference it'd be good to know what specifically you've tested
on.

If you cannot disclose that for some reason, just say "on my ARM64 test
machine" or something like that, so that we're not implying that this is
true for all ARM64 implementations.

Mark.

> 
> Signed-off-by: Christoph Lameter (Ampere) <cl@gentwo.org>
> ---
>  arch/Kconfig            |  5 +++++
>  arch/arm64/Kconfig      |  1 +
>  include/linux/seqlock.h | 41 +++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 47 insertions(+)
> 
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 975dd22a2dbd..3f8867110a57 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -1600,6 +1600,11 @@ config ARCH_HAS_KERNEL_FPU_SUPPORT
>  	  Architectures that select this option can run floating-point code in
>  	  the kernel, as described in Documentation/core-api/floating-point.rst.
>  
> +config ARCH_HAS_ACQUIRE_RELEASE
> +	bool
> +	help
> +	  Architectures that support acquire / release can avoid memory fences
> +
>  source "kernel/gcov/Kconfig"
>  
>  source "scripts/gcc-plugins/Kconfig"
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index a2f8ff354ca6..19e34fff145f 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -39,6 +39,7 @@ config ARM64
>  	select ARCH_HAS_PTE_DEVMAP
>  	select ARCH_HAS_PTE_SPECIAL
>  	select ARCH_HAS_HW_PTE_YOUNG
> +	select ARCH_HAS_ACQUIRE_RELEASE
>  	select ARCH_HAS_SETUP_DMA_OPS
>  	select ARCH_HAS_SET_DIRECT_MAP
>  	select ARCH_HAS_SET_MEMORY
> diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h
> index d90d8ee29d81..353fcf32b800 100644
> --- a/include/linux/seqlock.h
> +++ b/include/linux/seqlock.h
> @@ -176,6 +176,28 @@ __seqprop_##lockname##_sequence(const seqcount_##lockname##_t *s)	\
>  	return seq;							\
>  }									\
>  									\
> +static __always_inline unsigned						\
> +__seqprop_##lockname##_sequence_acquire(const seqcount_##lockname##_t *s) \
> +{									\
> +	unsigned seq = smp_load_acquire(&s->seqcount.sequence);		\
> +									\
> +	if (!IS_ENABLED(CONFIG_PREEMPT_RT))				\
> +		return seq;						\
> +									\
> +	if (preemptible && unlikely(seq & 1)) {				\
> +		__SEQ_LOCK(lockbase##_lock(s->lock));			\
> +		__SEQ_LOCK(lockbase##_unlock(s->lock));			\
> +									\
> +		/*							\
> +		 * Re-read the sequence counter since the (possibly	\
> +		 * preempted) writer made progress.			\
> +		 */							\
> +		seq = smp_load_acquire(&s->seqcount.sequence);		\
> +	}								\
> +									\
> +	return seq;							\
> +}									\
> +									\
>  static __always_inline bool						\
>  __seqprop_##lockname##_preemptible(const seqcount_##lockname##_t *s)	\
>  {									\
> @@ -211,6 +233,11 @@ static inline unsigned __seqprop_sequence(const seqcount_t *s)
>  	return READ_ONCE(s->sequence);
>  }
>  
> +static inline unsigned __seqprop_sequence_acquire(const seqcount_t *s)
> +{
> +	return smp_load_acquire(&s->sequence);
> +}
> +
>  static inline bool __seqprop_preemptible(const seqcount_t *s)
>  {
>  	return false;
> @@ -259,6 +286,7 @@ SEQCOUNT_LOCKNAME(mutex,        struct mutex,    true,     mutex)
>  #define seqprop_ptr(s)			__seqprop(s, ptr)(s)
>  #define seqprop_const_ptr(s)		__seqprop(s, const_ptr)(s)
>  #define seqprop_sequence(s)		__seqprop(s, sequence)(s)
> +#define seqprop_sequence_acquire(s)	__seqprop(s, sequence_acquire)(s)
>  #define seqprop_preemptible(s)		__seqprop(s, preemptible)(s)
>  #define seqprop_assert(s)		__seqprop(s, assert)(s)
>  
> @@ -293,6 +321,18 @@ SEQCOUNT_LOCKNAME(mutex,        struct mutex,    true,     mutex)
>   *
>   * Return: count to be passed to read_seqcount_retry()
>   */
> +#ifdef CONFIG_ARCH_HAS_ACQUIRE_RELEASE
> +#define raw_read_seqcount_begin(s)					\
> +({									\
> +	unsigned _seq;							\
> +									\
> +	while ((_seq = seqprop_sequence_acquire(s)) & 1)		\
> +		cpu_relax();						\
> +									\
> +	kcsan_atomic_next(KCSAN_SEQLOCK_REGION_MAX);			\
> +	_seq;								\
> +})
> +#else
>  #define raw_read_seqcount_begin(s)					\
>  ({									\
>  	unsigned _seq = __read_seqcount_begin(s);			\
> @@ -300,6 +340,7 @@ SEQCOUNT_LOCKNAME(mutex,        struct mutex,    true,     mutex)
>  	smp_rmb();							\
>  	_seq;								\
>  })
> +#endif
>  
>  /**
>   * read_seqcount_begin() - begin a seqcount_t read critical section
> 
> ---
> base-commit: 6b4aa469f04999c3f244515fa7491b4d093c5167
> change-id: 20240813-seq_optimize-68c48696c798
> 
> Best regards,
> -- 
> Christoph Lameter <cl@gentwo.org>
> 
> 
>
Linus Torvalds Aug. 19, 2024, 4:25 p.m. UTC | #8
On Mon, 19 Aug 2024 at 01:46, Mark Rutland <mark.rutland@arm.com> wrote:
>
> If you cannot disclose that for some reason, just say "on my ARM64 test
> machine" or something like that, so that we're not implying that this is
> true for all ARM64 implementations.

It's the same machine I have - an Ampere Altra. It's a standard
Neoverse N1 core, afaik.

It might also be a good idea to just point to the ARM documentation,
although I don't know how stable those web addresses are:

   https://developer.arm.com/documentation/102336/0100/Load-Acquire-and-Store-Release-instructions

and quoting the relevant part on that page:

 "Weaker ordering requirements that are imposed by Load-Acquire and
  Store-Release instructions allow for micro-architectural
  optimizations, which could reduce some of the performance impacts that
  are otherwise imposed by an explicit memory barrier.

  If the ordering requirement is satisfied using either a Load-Acquire
  or Store-Release, then it would be preferable to use these
  instructions instead of a DMB"

where that last sentence is basically ARM saying that load-acquire is
better than load+DMB and should be preferred.

             Linus
diff mbox series

Patch

diff --git a/arch/Kconfig b/arch/Kconfig
index 975dd22a2dbd..3f8867110a57 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -1600,6 +1600,11 @@  config ARCH_HAS_KERNEL_FPU_SUPPORT
 	  Architectures that select this option can run floating-point code in
 	  the kernel, as described in Documentation/core-api/floating-point.rst.
 
+config ARCH_HAS_ACQUIRE_RELEASE
+	bool
+	help
+	  Architectures that support acquire / release can avoid memory fences
+
 source "kernel/gcov/Kconfig"
 
 source "scripts/gcc-plugins/Kconfig"
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index a2f8ff354ca6..19e34fff145f 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -39,6 +39,7 @@  config ARM64
 	select ARCH_HAS_PTE_DEVMAP
 	select ARCH_HAS_PTE_SPECIAL
 	select ARCH_HAS_HW_PTE_YOUNG
+	select ARCH_HAS_ACQUIRE_RELEASE
 	select ARCH_HAS_SETUP_DMA_OPS
 	select ARCH_HAS_SET_DIRECT_MAP
 	select ARCH_HAS_SET_MEMORY
diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h
index d90d8ee29d81..353fcf32b800 100644
--- a/include/linux/seqlock.h
+++ b/include/linux/seqlock.h
@@ -176,6 +176,28 @@  __seqprop_##lockname##_sequence(const seqcount_##lockname##_t *s)	\
 	return seq;							\
 }									\
 									\
+static __always_inline unsigned						\
+__seqprop_##lockname##_sequence_acquire(const seqcount_##lockname##_t *s) \
+{									\
+	unsigned seq = smp_load_acquire(&s->seqcount.sequence);		\
+									\
+	if (!IS_ENABLED(CONFIG_PREEMPT_RT))				\
+		return seq;						\
+									\
+	if (preemptible && unlikely(seq & 1)) {				\
+		__SEQ_LOCK(lockbase##_lock(s->lock));			\
+		__SEQ_LOCK(lockbase##_unlock(s->lock));			\
+									\
+		/*							\
+		 * Re-read the sequence counter since the (possibly	\
+		 * preempted) writer made progress.			\
+		 */							\
+		seq = smp_load_acquire(&s->seqcount.sequence);		\
+	}								\
+									\
+	return seq;							\
+}									\
+									\
 static __always_inline bool						\
 __seqprop_##lockname##_preemptible(const seqcount_##lockname##_t *s)	\
 {									\
@@ -211,6 +233,11 @@  static inline unsigned __seqprop_sequence(const seqcount_t *s)
 	return READ_ONCE(s->sequence);
 }
 
+static inline unsigned __seqprop_sequence_acquire(const seqcount_t *s)
+{
+	return smp_load_acquire(&s->sequence);
+}
+
 static inline bool __seqprop_preemptible(const seqcount_t *s)
 {
 	return false;
@@ -259,6 +286,7 @@  SEQCOUNT_LOCKNAME(mutex,        struct mutex,    true,     mutex)
 #define seqprop_ptr(s)			__seqprop(s, ptr)(s)
 #define seqprop_const_ptr(s)		__seqprop(s, const_ptr)(s)
 #define seqprop_sequence(s)		__seqprop(s, sequence)(s)
+#define seqprop_sequence_acquire(s)	__seqprop(s, sequence_acquire)(s)
 #define seqprop_preemptible(s)		__seqprop(s, preemptible)(s)
 #define seqprop_assert(s)		__seqprop(s, assert)(s)
 
@@ -293,6 +321,18 @@  SEQCOUNT_LOCKNAME(mutex,        struct mutex,    true,     mutex)
  *
  * Return: count to be passed to read_seqcount_retry()
  */
+#ifdef CONFIG_ARCH_HAS_ACQUIRE_RELEASE
+#define raw_read_seqcount_begin(s)					\
+({									\
+	unsigned _seq;							\
+									\
+	while ((_seq = seqprop_sequence_acquire(s)) & 1)		\
+		cpu_relax();						\
+									\
+	kcsan_atomic_next(KCSAN_SEQLOCK_REGION_MAX);			\
+	_seq;								\
+})
+#else
 #define raw_read_seqcount_begin(s)					\
 ({									\
 	unsigned _seq = __read_seqcount_begin(s);			\
@@ -300,6 +340,7 @@  SEQCOUNT_LOCKNAME(mutex,        struct mutex,    true,     mutex)
 	smp_rmb();							\
 	_seq;								\
 })
+#endif
 
 /**
  * read_seqcount_begin() - begin a seqcount_t read critical section