[v3] Avoid memory barrier in read_seqcount() through load acquire

Message ID	20240912-seq_optimize-v3-1-8ee25e04dffa@gentwo.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org> From: Christoph Lameter via B4 Relay <devnull+cl.gentwo.org@kernel.org> Date: Thu, 12 Sep 2024 15:44:08 -0700 Subject: [PATCH v3] Avoid memory barrier in read_seqcount() through load acquire MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Message-Id: <20240912-seq_optimize-v3-1-8ee25e04dffa@gentwo.org> To: Thomas Gleixner <tglx@linutronix.de>, Catalin Marinas <catalin.marinas@arm.com>, Will Deacon <will@kernel.org>, Peter Zijlstra <peterz@infradead.org>, Ingo Molnar <mingo@redhat.com>, Waiman Long <longman@redhat.com>, Boqun Feng <boqun.feng@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org>, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-arch@vger.kernel.org, "Christoph Lameter (Ampere)" <cl@gentwo.org> Precedence: list Reply-To: cl@gentwo.org Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org
Series	[v3] Avoid memory barrier in read_seqcount() through load acquire \| expand [v3] Avoid memory barrier in read_seqcount() through load acquire

Christoph Lameter via B4 Relay Sept. 12, 2024, 10:44 p.m. UTC

From: "Christoph Lameter (Ampere)" <cl@gentwo.org>

Some architectures support load acquire which can save us a memory
barrier and save some cycles.

A typical sequence

	do {
		seq = read_seqcount_begin(&s);
		<something>
	} while (read_seqcount_retry(&s, seq);

requires 13 cycles on ARM64 for an empty loop. Two read memory
barriers are needed. One for each of the seqcount_* functions.

We can replace the first read barrier with a load acquire of
the seqcount which saves us one barrier.

On ARM64 doing so reduces the cycle count from 13 to 8.

This is a general improvement for the ARM64 architecture and not
specific to a certain processor. The cycle count here was
obtained on a Neoverse N1 (Ampere Altra).

We can further optimize handling by using the cond_load_acquire logic
which will give an ARM CPU a chance to enter some power saving mode
while waiting for changes to a cacheline thereby avoiding busy loops
and therefore saving power.

The ARM documentation states that load acquire is more effective
than a load plus barrier. In general that tends to be true on all
compute platforms that support both.

See (as quoted by Linus Torvalds):
   https://developer.arm.com/documentation/102336/0100/Load-Acquire-and-Store-Release-instructions

 "Weaker ordering requirements that are imposed by Load-Acquire and
  Store-Release instructions allow for micro-architectural
  optimizations, which could reduce some of the performance impacts that
  are otherwise imposed by an explicit memory barrier.

  If the ordering requirement is satisfied using either a Load-Acquire
  or Store-Release, then it would be preferable to use these
  instructions instead of a DMB"

The patch benefited significantly from the knowledge of the innards
of the seqlock code by Thomas Gleixner.

Signed-off-by: Christoph Lameter (Ampere) <cl@gentwo.org>
---
V1->V2
- Describe the benefit of load acquire vs barriers
- Explain the CONFIG_ARCH_HAS_ACQUIRE_RELEASE option better
---
Changes in v3:
- Support cond_load_acquire to give the processor a chance to do some
  sort of power down until cacheline changes.
- Better code by Thomas Gleixner
- Link to v2: https://lore.kernel.org/r/20240819-seq_optimize-v2-1-9d0da82b022f@gentwo.org
---
 arch/Kconfig            |  8 +++++
 arch/arm64/Kconfig      |  1 +
 include/linux/seqlock.h | 85 ++++++++++++++++++++++++++++++++++++-------------
 3 files changed, 71 insertions(+), 23 deletions(-)


---
base-commit: 77f587896757708780a7e8792efe62939f25a5ab
change-id: 20240813-seq_optimize-68c48696c798

Best regards,

kernel test robot Sept. 13, 2024, 1:41 p.m. UTC | #1

Hi Christoph,

kernel test robot noticed the following build errors:

[auto build test ERROR on 77f587896757708780a7e8792efe62939f25a5ab]

url:    https://github.com/intel-lab-lkp/linux/commits/Christoph-Lameter-via-B4-Relay/Avoid-memory-barrier-in-read_seqcount-through-load-acquire/20240913-064557
base:   77f587896757708780a7e8792efe62939f25a5ab
patch link:    https://lore.kernel.org/r/20240912-seq_optimize-v3-1-8ee25e04dffa%40gentwo.org
patch subject: [PATCH v3] Avoid memory barrier in read_seqcount() through load acquire
config: x86_64-defconfig (https://download.01.org/0day-ci/archive/20240913/202409132135.ki3Mp5EA-lkp@intel.com/config)
compiler: gcc-11 (Debian 11.3.0-12) 11.3.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240913/202409132135.ki3Mp5EA-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202409132135.ki3Mp5EA-lkp@intel.com/

All error/warnings (new ones prefixed by >>):

   In file included from drivers/gpu/drm/i915/gt/intel_gt.c:36:
   drivers/gpu/drm/i915/gt/intel_tlb.h: In function 'intel_gt_tlb_seqno':
>> drivers/gpu/drm/i915/gt/intel_tlb.h:21:47: error: macro "seqprop_sequence" requires 2 arguments, but only 1 given
      21 |         return seqprop_sequence(&gt->tlb.seqno);
         |                                               ^
   In file included from include/linux/mmzone.h:17,
                    from include/linux/gfp.h:7,
                    from include/drm/drm_managed.h:6,
                    from drivers/gpu/drm/i915/gt/intel_gt.c:6:
   include/linux/seqlock.h:280: note: macro "seqprop_sequence" defined here
     280 | #define seqprop_sequence(s, a)          __seqprop(s, sequence)(s, a)
         | 
   In file included from drivers/gpu/drm/i915/gt/intel_gt.c:36:
>> drivers/gpu/drm/i915/gt/intel_tlb.h:21:16: error: 'seqprop_sequence' undeclared (first use in this function)
      21 |         return seqprop_sequence(&gt->tlb.seqno);
         |                ^~~~~~~~~~~~~~~~
   drivers/gpu/drm/i915/gt/intel_tlb.h:21:16: note: each undeclared identifier is reported only once for each function it appears in
--
   In file included from drivers/gpu/drm/i915/gt/intel_tlb.c:14:
   drivers/gpu/drm/i915/gt/intel_tlb.h: In function 'intel_gt_tlb_seqno':
>> drivers/gpu/drm/i915/gt/intel_tlb.h:21:47: error: macro "seqprop_sequence" requires 2 arguments, but only 1 given
      21 |         return seqprop_sequence(&gt->tlb.seqno);
         |                                               ^
   In file included from include/linux/mmzone.h:17,
                    from include/linux/gfp.h:7,
                    from include/linux/xarray.h:16,
                    from include/linux/radix-tree.h:21,
                    from include/linux/idr.h:15,
                    from include/linux/kernfs.h:12,
                    from include/linux/sysfs.h:16,
                    from include/linux/kobject.h:20,
                    from include/linux/energy_model.h:7,
                    from include/linux/device.h:16,
                    from include/linux/pm_qos.h:17,
                    from drivers/gpu/drm/i915/i915_drv.h:35,
                    from drivers/gpu/drm/i915/gt/intel_tlb.c:6:
   include/linux/seqlock.h:280: note: macro "seqprop_sequence" defined here
     280 | #define seqprop_sequence(s, a)          __seqprop(s, sequence)(s, a)
         | 
   In file included from drivers/gpu/drm/i915/gt/intel_tlb.c:14:
>> drivers/gpu/drm/i915/gt/intel_tlb.h:21:16: error: 'seqprop_sequence' undeclared (first use in this function)
      21 |         return seqprop_sequence(&gt->tlb.seqno);
         |                ^~~~~~~~~~~~~~~~
   drivers/gpu/drm/i915/gt/intel_tlb.h:21:16: note: each undeclared identifier is reported only once for each function it appears in
>> drivers/gpu/drm/i915/gt/intel_tlb.h:22:1: warning: control reaches end of non-void function [-Wreturn-type]
      22 | }
         | ^


vim +/seqprop_sequence +21 drivers/gpu/drm/i915/gt/intel_tlb.h

568a2e6f0b12ee Chris Wilson 2023-08-01  18  
568a2e6f0b12ee Chris Wilson 2023-08-01  19  static inline u32 intel_gt_tlb_seqno(const struct intel_gt *gt)
568a2e6f0b12ee Chris Wilson 2023-08-01  20  {
568a2e6f0b12ee Chris Wilson 2023-08-01 @21  	return seqprop_sequence(&gt->tlb.seqno);
568a2e6f0b12ee Chris Wilson 2023-08-01 @22  }
568a2e6f0b12ee Chris Wilson 2023-08-01  23

kernel test robot Sept. 13, 2024, 1:41 p.m. UTC | #2

Hi Christoph,

kernel test robot noticed the following build errors:

[auto build test ERROR on 77f587896757708780a7e8792efe62939f25a5ab]

url:    https://github.com/intel-lab-lkp/linux/commits/Christoph-Lameter-via-B4-Relay/Avoid-memory-barrier-in-read_seqcount-through-load-acquire/20240913-064557
base:   77f587896757708780a7e8792efe62939f25a5ab
patch link:    https://lore.kernel.org/r/20240912-seq_optimize-v3-1-8ee25e04dffa%40gentwo.org
patch subject: [PATCH v3] Avoid memory barrier in read_seqcount() through load acquire
config: i386-buildonly-randconfig-001-20240913 (https://download.01.org/0day-ci/archive/20240913/202409132145.0UdNx9kr-lkp@intel.com/config)
compiler: clang version 18.1.8 (https://github.com/llvm/llvm-project 3b5b5c1ec4a3095ab096dd780e84d7ab81f3d7ff)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240913/202409132145.0UdNx9kr-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202409132145.0UdNx9kr-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from drivers/gpu/drm/i915/gem/i915_gem_pages.c:11:
>> drivers/gpu/drm/i915/gt/intel_tlb.h:21:40: error: too few arguments provided to function-like macro invocation
      21 |         return seqprop_sequence(&gt->tlb.seqno);
         |                                               ^
   include/linux/seqlock.h:280:9: note: macro 'seqprop_sequence' defined here
     280 | #define seqprop_sequence(s, a)          __seqprop(s, sequence)(s, a)
         |         ^
   In file included from drivers/gpu/drm/i915/gem/i915_gem_pages.c:11:
>> drivers/gpu/drm/i915/gt/intel_tlb.h:21:9: error: use of undeclared identifier 'seqprop_sequence'; did you mean '__seqprop_sequence'?
      21 |         return seqprop_sequence(&gt->tlb.seqno);
         |                ^~~~~~~~~~~~~~~~
         |                __seqprop_sequence
   include/linux/seqlock.h:228:24: note: '__seqprop_sequence' declared here
     228 | static inline unsigned __seqprop_sequence(const seqcount_t *s, bool acquire)
         |                        ^
   In file included from drivers/gpu/drm/i915/gem/i915_gem_pages.c:11:
>> drivers/gpu/drm/i915/gt/intel_tlb.h:21:9: error: incompatible pointer to integer conversion returning 'unsigned int (const seqcount_t *, bool)' (aka 'unsigned int (const struct seqcount *, _Bool)') from a function with result type 'u32' (aka 'unsigned int') [-Wint-conversion]
      21 |         return seqprop_sequence(&gt->tlb.seqno);
         |                ^~~~~~~~~~~~~~~~
   3 errors generated.


vim +21 drivers/gpu/drm/i915/gt/intel_tlb.h

568a2e6f0b12ee Chris Wilson 2023-08-01  18  
568a2e6f0b12ee Chris Wilson 2023-08-01  19  static inline u32 intel_gt_tlb_seqno(const struct intel_gt *gt)
568a2e6f0b12ee Chris Wilson 2023-08-01  20  {
568a2e6f0b12ee Chris Wilson 2023-08-01 @21  	return seqprop_sequence(&gt->tlb.seqno);
568a2e6f0b12ee Chris Wilson 2023-08-01  22  }
568a2e6f0b12ee Chris Wilson 2023-08-01  23

Christoph Lameter (Ampere) Sept. 16, 2024, 5:52 p.m. UTC | #3

On Fri, 13 Sep 2024, kernel test robot wrote:

> >> drivers/gpu/drm/i915/gt/intel_tlb.h:21:47: error: macro "seqprop_sequence" requires 2 arguments, but only 1 given

From 15d86bc9589f16947c5fb0f34d2947eacd48f853 Mon Sep 17 00:00:00 2001
From: Christoph Lameter <cl@gentwo.org>
Date: Mon, 16 Sep 2024 10:44:16 -0700
Subject: [PATCH] Update Intel DRM use of seqprop_sequence

One of Intels drivers uses seqprop_sequence() for its tlb sequencing.
We added a parameter so that we can use acquire. Its pretty safe to
assume that this will work without acquire.

Signed-off-by: Christoph Lameter <cl@linux.com>
---
 drivers/gpu/drm/i915/gt/intel_tlb.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_tlb.h b/drivers/gpu/drm/i915/gt/intel_tlb.h
index 337327af92ac..81998c4cd4fb 100644
--- a/drivers/gpu/drm/i915/gt/intel_tlb.h
+++ b/drivers/gpu/drm/i915/gt/intel_tlb.h
@@ -18,7 +18,7 @@ void intel_gt_fini_tlb(struct intel_gt *gt);

 static inline u32 intel_gt_tlb_seqno(const struct intel_gt *gt)
 {
-	return seqprop_sequence(&gt->tlb.seqno);
+	return seqprop_sequence(&gt->tlb.seqno, false);
 }

 static inline u32 intel_gt_next_invalidate_tlb_full(const struct intel_gt *gt)

Will Deacon Sept. 17, 2024, 7:12 a.m. UTC | #4

Hi Christoph,

On Thu, Sep 12, 2024 at 03:44:08PM -0700, Christoph Lameter via B4 Relay wrote:
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 975dd22a2dbd..3c270f496231 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -1600,6 +1600,14 @@ config ARCH_HAS_KERNEL_FPU_SUPPORT
>  	  Architectures that select this option can run floating-point code in
>  	  the kernel, as described in Documentation/core-api/floating-point.rst.
>  
> +config ARCH_HAS_ACQUIRE_RELEASE
> +	bool
> +	help
> +	  Setting ARCH_HAS_ACQUIRE_RELEASE indicates that the architecture
> +	  supports load acquire and release. Typically these are more effective
> +	  than memory barriers. Code will prefer the use of load acquire and
> +	  store release over memory barriers if this option is enabled.
> +

Unsurprisingly, I'd be in favour of making this unconditional rather than
adding a new Kconfig option. Would that actually hurt any architectures
where we care about the last few shreds of performance?

>  source "kernel/gcov/Kconfig"
>  
>  source "scripts/gcc-plugins/Kconfig"
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index a2f8ff354ca6..19e34fff145f 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -39,6 +39,7 @@ config ARM64
>  	select ARCH_HAS_PTE_DEVMAP
>  	select ARCH_HAS_PTE_SPECIAL
>  	select ARCH_HAS_HW_PTE_YOUNG
> +	select ARCH_HAS_ACQUIRE_RELEASE
>  	select ARCH_HAS_SETUP_DMA_OPS
>  	select ARCH_HAS_SET_DIRECT_MAP
>  	select ARCH_HAS_SET_MEMORY
> diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h
> index d90d8ee29d81..a3fe9ee8edef 100644
> --- a/include/linux/seqlock.h
> +++ b/include/linux/seqlock.h
> @@ -23,6 +23,13 @@
>  
>  #include <asm/processor.h>
>  
> +#ifdef CONFIG_ARCH_HAS_ACQUIRE_RELEASE
> +# define USE_LOAD_ACQUIRE	true
> +# define USE_COND_LOAD_ACQUIRE	!IS_ENABLED(CONFIG_PREEMPT_RT)
> +#else
> +# define USE_LOAD_ACQUIRE	false
> +# define USE_COND_LOAD_ACQUIRE	false
> +#endif
>  /*
>   * The seqlock seqcount_t interface does not prescribe a precise sequence of
>   * read begin/retry/end. For readers, typically there is a call to
> @@ -132,6 +139,17 @@ static inline void seqcount_lockdep_reader_access(const seqcount_t *s)
>  #define seqcount_rwlock_init(s, lock)		seqcount_LOCKNAME_init(s, lock, rwlock)
>  #define seqcount_mutex_init(s, lock)		seqcount_LOCKNAME_init(s, lock, mutex)
>  
> +static __always_inline unsigned __seqprop_load_sequence(const seqcount_t *s, bool acquire)
> +{
> +	if (!acquire || !USE_LOAD_ACQUIRE)
> +		return READ_ONCE(s->sequence);
> +
> +	if (USE_COND_LOAD_ACQUIRE)
> +		return smp_cond_load_acquire((unsigned int *)&s->sequence, (s->sequence & 1) == 0);

This looks wrong to me.

The conditional expression passed to smp_cond_load_acquire() should be
written in terms of 'VAL', otherwise you're introducing an additional
non-atomic access to the sequence counter.

Will

Will Deacon Sept. 17, 2024, 7:37 a.m. UTC | #5

On Mon, Sep 16, 2024 at 10:52:18AM -0700, Christoph Lameter (Ampere) wrote:
> On Fri, 13 Sep 2024, kernel test robot wrote:
> 
> > >> drivers/gpu/drm/i915/gt/intel_tlb.h:21:47: error: macro "seqprop_sequence" requires 2 arguments, but only 1 given
> 
> From 15d86bc9589f16947c5fb0f34d2947eacd48f853 Mon Sep 17 00:00:00 2001
> From: Christoph Lameter <cl@gentwo.org>
> Date: Mon, 16 Sep 2024 10:44:16 -0700
> Subject: [PATCH] Update Intel DRM use of seqprop_sequence
> 
> One of Intels drivers uses seqprop_sequence() for its tlb sequencing.
> We added a parameter so that we can use acquire. Its pretty safe to
> assume that this will work without acquire.
> 
> Signed-off-by: Christoph Lameter <cl@linux.com>
> ---
>  drivers/gpu/drm/i915/gt/intel_tlb.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_tlb.h b/drivers/gpu/drm/i915/gt/intel_tlb.h
> index 337327af92ac..81998c4cd4fb 100644
> --- a/drivers/gpu/drm/i915/gt/intel_tlb.h
> +++ b/drivers/gpu/drm/i915/gt/intel_tlb.h
> @@ -18,7 +18,7 @@ void intel_gt_fini_tlb(struct intel_gt *gt);
> 
>  static inline u32 intel_gt_tlb_seqno(const struct intel_gt *gt)
>  {
> -	return seqprop_sequence(&gt->tlb.seqno);
> +	return seqprop_sequence(&gt->tlb.seqno, false);
>  }

Yikes, why is the driver using the seqlock internals here? It's a bit of
a pity, as a quick grep suggest that this is the _only_ user of
'seqcount_mutex_t', yet it's still having to work around the API.

Will

Thomas Gleixner Sept. 17, 2024, 11:50 a.m. UTC | #6

Cc+ i915 people

On Tue, Sep 17 2024 at 08:37, Will Deacon wrote:
> On Mon, Sep 16, 2024 at 10:52:18AM -0700, Christoph Lameter (Ampere) wrote:
>> On Fri, 13 Sep 2024, kernel test robot wrote:
>> 
>> > >> drivers/gpu/drm/i915/gt/intel_tlb.h:21:47: error: macro "seqprop_sequence" requires 2 arguments, but only 1 given
>> 
>> From 15d86bc9589f16947c5fb0f34d2947eacd48f853 Mon Sep 17 00:00:00 2001
>> From: Christoph Lameter <cl@gentwo.org>
>> Date: Mon, 16 Sep 2024 10:44:16 -0700
>> Subject: [PATCH] Update Intel DRM use of seqprop_sequence
>> 
>> One of Intels drivers uses seqprop_sequence() for its tlb sequencing.
>> We added a parameter so that we can use acquire. Its pretty safe to
>> assume that this will work without acquire.
>>
>> Signed-off-by: Christoph Lameter <cl@linux.com>
>> ---
>>  drivers/gpu/drm/i915/gt/intel_tlb.h | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> diff --git a/drivers/gpu/drm/i915/gt/intel_tlb.h b/drivers/gpu/drm/i915/gt/intel_tlb.h
>> index 337327af92ac..81998c4cd4fb 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_tlb.h
>> +++ b/drivers/gpu/drm/i915/gt/intel_tlb.h
>> @@ -18,7 +18,7 @@ void intel_gt_fini_tlb(struct intel_gt *gt);
>> 
>>  static inline u32 intel_gt_tlb_seqno(const struct intel_gt *gt)
>>  {
>> -	return seqprop_sequence(&gt->tlb.seqno);
>> +	return seqprop_sequence(&gt->tlb.seqno, false);
>>  }
>
> Yikes, why is the driver using the seqlock internals here? It's a bit of
> a pity, as a quick grep suggest that this is the _only_ user of
> 'seqcount_mutex_t', yet it's still having to work around the API.

Why the hell can't i915 use the proper interfaces and has to bypass the
core code? Just because C allows that does not make it correct.

Can the i915 people please remove this blatant violation of layering?

Thanks,

        tglx

Thomas Gleixner Sept. 17, 2024, 11:52 a.m. UTC | #7

On Thu, Sep 12 2024 at 15:44, Christoph Lameter via wrote:

$Subject: .....

You still fail to provide a proper subsystem prefix. It's not rocket science.

https://www.kernel.org/doc/html/latest/process/maintainer-tip.html#patch-subject

> From: "Christoph Lameter (Ampere)" <cl@gentwo.org>
>
> Some architectures support load acquire which can save us a memory
> barrier and save some cycles.
>
> A typical sequence
>
> 	do {
> 		seq = read_seqcount_begin(&s);
> 		<something>
> 	} while (read_seqcount_retry(&s, seq);
>
> requires 13 cycles on ARM64 for an empty loop. Two read memory
> barriers are needed. One for each of the seqcount_* functions.
>
> We can replace the first read barrier with a load acquire of
> the seqcount which saves us one barrier.

We ....

Please write changelogs using passive voice. 'We' means nothing here.

Thanks,

        tglx

Rodrigo Vivi Sept. 18, 2024, 12:45 a.m. UTC | #8

On Tue, 2024-09-17 at 13:50 +0200, Thomas Gleixner wrote:
> Cc+ i915 people
> 
> On Tue, Sep 17 2024 at 08:37, Will Deacon wrote:
> > On Mon, Sep 16, 2024 at 10:52:18AM -0700, Christoph Lameter
> > (Ampere) wrote:
> > > On Fri, 13 Sep 2024, kernel test robot wrote:
> > > 
> > > > > > drivers/gpu/drm/i915/gt/intel_tlb.h:21:47: error: macro
> > > > > > "seqprop_sequence" requires 2 arguments, but only 1 given
> > > 
> > > From 15d86bc9589f16947c5fb0f34d2947eacd48f853 Mon Sep 17 00:00:00
> > > 2001
> > > From: Christoph Lameter <cl@gentwo.org>
> > > Date: Mon, 16 Sep 2024 10:44:16 -0700
> > > Subject: [PATCH] Update Intel DRM use of seqprop_sequence
> > > 
> > > One of Intels drivers uses seqprop_sequence() for its tlb
> > > sequencing.
> > > We added a parameter so that we can use acquire. Its pretty safe
> > > to
> > > assume that this will work without acquire.
> > > 
> > > Signed-off-by: Christoph Lameter <cl@linux.com>
> > > ---
> > >  drivers/gpu/drm/i915/gt/intel_tlb.h | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/gt/intel_tlb.h
> > > b/drivers/gpu/drm/i915/gt/intel_tlb.h
> > > index 337327af92ac..81998c4cd4fb 100644
> > > --- a/drivers/gpu/drm/i915/gt/intel_tlb.h
> > > +++ b/drivers/gpu/drm/i915/gt/intel_tlb.h
> > > @@ -18,7 +18,7 @@ void intel_gt_fini_tlb(struct intel_gt *gt);
> > > 
> > >  static inline u32 intel_gt_tlb_seqno(const struct intel_gt *gt)
> > >  {
> > > -	return seqprop_sequence(&gt->tlb.seqno);
> > > +	return seqprop_sequence(&gt->tlb.seqno, false);
> > >  }
> > 
> > Yikes, why is the driver using the seqlock internals here? It's a
> > bit of
> > a pity, as a quick grep suggest that this is the _only_ user of
> > 'seqcount_mutex_t', yet it's still having to work around the API.
> 
> Why the hell can't i915 use the proper interfaces and has to bypass
> the
> core code? Just because C allows that does not make it correct.
> 
> Can the i915 people please remove this blatant violation of layering?

Yeap, we gotta remove this. Just need to be careful on this TLB
invalidation code without causing some funny deadlocks...

> 
> Thanks,
> 
>         tglx
>

Christoph Lameter (Ampere) Sept. 18, 2024, 11:03 a.m. UTC | #9

On Tue, 17 Sep 2024, Will Deacon wrote:

> > +config ARCH_HAS_ACQUIRE_RELEASE
> > +	bool
> > +	help
> > +	  Setting ARCH_HAS_ACQUIRE_RELEASE indicates that the architecture
> > +	  supports load acquire and release. Typically these are more effective
> > +	  than memory barriers. Code will prefer the use of load acquire and
> > +	  store release over memory barriers if this option is enabled.
> > +
>
> Unsurprisingly, I'd be in favour of making this unconditional rather than
> adding a new Kconfig option. Would that actually hurt any architectures
> where we care about the last few shreds of performance?

Other arches do not have acquire / release and will create additional
barriers in the fallback implementation of smp_load_acquire. So it needs
to be an arch config option.

> > +	if (USE_COND_LOAD_ACQUIRE)
> > +		return smp_cond_load_acquire((unsigned int *)&s->sequence, (s->sequence & 1) == 0);
>
> This looks wrong to me.
>
> The conditional expression passed to smp_cond_load_acquire() should be
> written in terms of 'VAL', otherwise you're introducing an additional
> non-atomic access to the sequence counter.

Hmmm... The compiler seems to have optimized that out. Will use VAL in
next rollup.

Christoph Lameter (Ampere) Sept. 18, 2024, 11:11 a.m. UTC | #10

On Tue, 17 Sep 2024, Thomas Gleixner wrote:

> We ....
>
> Please write changelogs using passive voice. 'We' means nothing here.

Done using active voice.

Linus Torvalds Sept. 18, 2024, 3:22 p.m. UTC | #11

On Wed, 18 Sept 2024 at 13:15, Christoph Lameter (Ampere) <cl@gentwo.org> wrote:
>
> Other arches do not have acquire / release and will create additional
> barriers in the fallback implementation of smp_load_acquire. So it needs
> to be an arch config option.

Actually, I looked at a few cases, and it doesn't really seem to be true.

For example, powerpc doesn't have a "native" acquire model, but both
smp_load_acquire() and smp_rmb() end up being  LWSYNC after the load
(which in the good case is a "lwsync" instruction, in bad case it's a
heavier "sync" instruction on older cores, but the point is that it's
the same thing for smp_rmb() and for smp_load_acquire()).

So on powerpc, smp_load_acquire() isn't any better than
"READ_ONCE()+smp_rmb()", but it also isn't any worse.

And at least alpha is the same - it doesn't have smp_load_acquire(),
and it falls back on a full memory barrier for that case - but that's
what smp_rmb() is too. However, because READ_ONCE() on alpha already
contains a smp_mb(), it turns out that on alpha having "READ_ONCE +
smp_rmb()" actually results in *two* barriers, while a
"smp_load_acquire()" is just one.

And obviously technically x86 doesn't have explicit acquire, but with
every load being an acquire, it's a no-op either way.

So on at least three very different architectures, smp_load_acquire()
is at least no worse than READ_ONCE() followed by a smp_rmb(). And on
alpha and arm64, it's better.

So it does look like making it conditional doesn't actually buy us
anything. We might as well just unconditionally use the
smp_load_acquire() over "READ_ONCE+smp_rmb".

Other random architectures from a quick look:

RISC-V technically turns smp_rmb() into a "fence r,r", while a
smp_load_acquire() ends up being a "fence r,rw", so technically the
fences are different. But honestly, any microarchitecture that makes
those two be different is just crazy garbage (there's never any valid
reason to move later writes up before earlier reads).

Loongarch has acquire and is better off with it.

parisc has acquire and is better off with it.

s390 and sparc64 are like x86, in that it's just a build barrier either way.

End result: let's just simplify the patch and make it entirely unconditional.

                 Linus

Linus Torvalds Sept. 23, 2024, 4:28 p.m. UTC | #12

On Wed, 18 Sept 2024 at 08:22, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Wed, 18 Sept 2024 at 13:15, Christoph Lameter (Ampere) <cl@gentwo.org> wrote:
> >
> > Other arches do not have acquire / release and will create additional
> > barriers in the fallback implementation of smp_load_acquire. So it needs
> > to be an arch config option.
>
> Actually, I looked at a few cases, and it doesn't really seem to be true.

Bah. I ended up just committing the minimal version of this all. I
gave Christoph credit for the commit, because I stole his commit
message, and he did most of the work, I just ended up going "simplify,
simplify, simplify".

I doubt anybody will notice, and smp_load_acquire() is the future. Any
architecture that does badly on it just doesn't matter (and, as
mentioned, I don't think they even exist - "smp_rmb()" is generally at
least as expensive).

                 Linus

Peter Zijlstra Oct. 23, 2024, 7:45 p.m. UTC | #13

On Mon, Sep 23, 2024 at 09:28:31AM -0700, Linus Torvalds wrote:
> On Wed, 18 Sept 2024 at 08:22, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > On Wed, 18 Sept 2024 at 13:15, Christoph Lameter (Ampere) <cl@gentwo.org> wrote:
> > >
> > > Other arches do not have acquire / release and will create additional
> > > barriers in the fallback implementation of smp_load_acquire. So it needs
> > > to be an arch config option.
> >
> > Actually, I looked at a few cases, and it doesn't really seem to be true.
> 
> Bah. I ended up just committing the minimal version of this all. I
> gave Christoph credit for the commit, because I stole his commit
> message, and he did most of the work, I just ended up going "simplify,
> simplify, simplify".
> 
> I doubt anybody will notice, and smp_load_acquire() is the future. Any
> architecture that does badly on it just doesn't matter (and, as
> mentioned, I don't think they even exist - "smp_rmb()" is generally at
> least as expensive).

Do we want to do the complementing patch and make write_seqcount_end()
use smp_store_release() ?

I think at least ARM (the 32bit thing) has wmb but uses mb for
store_release. But I also think I don't really care about that.

Linus Torvalds Oct. 23, 2024, 8:34 p.m. UTC | #14

On Wed, 23 Oct 2024 at 12:45, Peter Zijlstra <peterz@infradead.org> wrote:
>
> Do we want to do the complementing patch and make write_seqcount_end()
> use smp_store_release() ?
>
> I think at least ARM (the 32bit thing) has wmb but uses mb for
> store_release. But I also think I don't really care about that.

So unlike the "acquire vs rmb", there are architectures where "wmb" is
noticeably cheaper than a "store release".

Just as an example, on alpha, a "store release" is a full memory
barrier followed by the store, because it needs to serialize previous
loads too. But wmp_wmb() is lightweight.

Typically in traditional (pre acquire/release) architectures "wmb"
only ordered the CPU write queues, so "wmb" has always been cheap
pretty much everywhere.

And I *suspect* that alpha isn't the outlier in having a much cheaper
wmb than store-release.

But yeah, it's kind of ugly how we now have three completely different
orderings for seqcounts:

 - the initial load is done with the smp_read_acquire

 - the final load (the "retry") is done with a smp_rmb (because an
acquire orders _subsequent_ loads, not the ones inside the lock: we'd
actually want a "smp_load_release()", but such a thing doesn't exist)

 - the writer side uses smp_wmb

(and arguably there's a fourth pattern: the latching cases uses double
smp_wmb, because it orders the sequence count wrt both preceding and
subsequent stores)

Anyway, obviously on x86 (and s390) none of this matters.

On arm64, I _suspect_ they are mostly the same, but it's going to be
very microarchitecture-dependent. Neither should be expensive, but wmb
really is a fundamentally lightweight operation.

On 32-bit arm, wmb should be cheaper ("ishst" only waits for earlier stores).

On powerpc, wmb is cheaper on older CPU's (eieio vs sync), but the
same on newer CPUs (lwsync).

On alpha, wmb is definitely cheaper, but I doubt anybody really cares.

Others? I stopped looking, and am not familiar enough.

          Linus

Christoph Lameter (Ampere) Oct. 23, 2024, 11:42 p.m. UTC | #15

On Wed, 23 Oct 2024, Peter Zijlstra wrote:

> > I doubt anybody will notice, and smp_load_acquire() is the future. Any
> > architecture that does badly on it just doesn't matter (and, as
> > mentioned, I don't think they even exist - "smp_rmb()" is generally at
> > least as expensive).
>
> Do we want to do the complementing patch and make write_seqcount_end()
> use smp_store_release() ?
>
> I think at least ARM (the 32bit thing) has wmb but uses mb for
> store_release. But I also think I don't really care about that.

The proper instruction would be something like

atomic_inc_release(&seqcount)

The current atomics do not provide such a macro.

The closest in the current tree is atomic_inc_return_release().

We would prefer atomic_inc_release(&seqcount) because such an
atomic may be executed as a far atomic in the ARM mesh.

This could be cheaper than a local atomic and could f.e. be executed
on the memory controller of a remote NUMA node in order to avoid a costly
transfer of cacheline ownership.

The code generated is a atomic that also does a release. So there would be
no extra barrier etc needed.

Peter Zijlstra Oct. 25, 2024, 7:42 a.m. UTC | #16

On Wed, Oct 23, 2024 at 04:42:36PM -0700, Christoph Lameter (Ampere) wrote:
> On Wed, 23 Oct 2024, Peter Zijlstra wrote:
> 
> > > I doubt anybody will notice, and smp_load_acquire() is the future. Any
> > > architecture that does badly on it just doesn't matter (and, as
> > > mentioned, I don't think they even exist - "smp_rmb()" is generally at
> > > least as expensive).
> >
> > Do we want to do the complementing patch and make write_seqcount_end()
> > use smp_store_release() ?
> >
> > I think at least ARM (the 32bit thing) has wmb but uses mb for
> > store_release. But I also think I don't really care about that.
> 
> The proper instruction would be something like
> 
> atomic_inc_release(&seqcount)

It would not be, making the increment itself atomic would make the whole
thing far more expensive.

Christoph Lameter (Ampere) Oct. 25, 2024, 7:30 p.m. UTC | #17

On Fri, 25 Oct 2024, Peter Zijlstra wrote:

> > atomic_inc_release(&seqcount)
>
> It would not be, making the increment itself atomic would make the whole
> thing far more expensive.

True. I was too much taken by a cool hw feature on ARM.

Will Deacon Oct. 28, 2024, 2:10 p.m. UTC | #18

On Wed, Oct 23, 2024 at 01:34:16PM -0700, Linus Torvalds wrote:
> On Wed, 23 Oct 2024 at 12:45, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > Do we want to do the complementing patch and make write_seqcount_end()
> > use smp_store_release() ?
> >
> > I think at least ARM (the 32bit thing) has wmb but uses mb for
> > store_release. But I also think I don't really care about that.
> 
> So unlike the "acquire vs rmb", there are architectures where "wmb" is
> noticeably cheaper than a "store release".
> 
> Just as an example, on alpha, a "store release" is a full memory
> barrier followed by the store, because it needs to serialize previous
> loads too. But wmp_wmb() is lightweight.
> 
> Typically in traditional (pre acquire/release) architectures "wmb"
> only ordered the CPU write queues, so "wmb" has always been cheap
> pretty much everywhere.
> 
> And I *suspect* that alpha isn't the outlier in having a much cheaper
> wmb than store-release.
> 
> But yeah, it's kind of ugly how we now have three completely different
> orderings for seqcounts:
> 
>  - the initial load is done with the smp_read_acquire
> 
>  - the final load (the "retry") is done with a smp_rmb (because an
> acquire orders _subsequent_ loads, not the ones inside the lock: we'd
> actually want a "smp_load_release()", but such a thing doesn't exist)
> 
>  - the writer side uses smp_wmb
> 
> (and arguably there's a fourth pattern: the latching cases uses double
> smp_wmb, because it orders the sequence count wrt both preceding and
> subsequent stores)
> 
> Anyway, obviously on x86 (and s390) none of this matters.
> 
> On arm64, I _suspect_ they are mostly the same, but it's going to be
> very microarchitecture-dependent. Neither should be expensive, but wmb
> really is a fundamentally lightweight operation.

I agree here. An STLR additionally orders PO-prior loads on arm64, so
I'd stick with the wmb().

Will

[v3] Avoid memory barrier in read_seqcount() through load acquire

Commit Message

Comments

Patch