diff mbox series

[v3] Avoid memory barrier in read_seqcount() through load acquire

Message ID 20240912-seq_optimize-v3-1-8ee25e04dffa@gentwo.org (mailing list archive)
State New
Headers show
Series [v3] Avoid memory barrier in read_seqcount() through load acquire | expand

Commit Message

Christoph Lameter via B4 Relay Sept. 12, 2024, 10:44 p.m. UTC
From: "Christoph Lameter (Ampere)" <cl@gentwo.org>

Some architectures support load acquire which can save us a memory
barrier and save some cycles.

A typical sequence

	do {
		seq = read_seqcount_begin(&s);
		<something>
	} while (read_seqcount_retry(&s, seq);

requires 13 cycles on ARM64 for an empty loop. Two read memory
barriers are needed. One for each of the seqcount_* functions.

We can replace the first read barrier with a load acquire of
the seqcount which saves us one barrier.

On ARM64 doing so reduces the cycle count from 13 to 8.

This is a general improvement for the ARM64 architecture and not
specific to a certain processor. The cycle count here was
obtained on a Neoverse N1 (Ampere Altra).

We can further optimize handling by using the cond_load_acquire logic
which will give an ARM CPU a chance to enter some power saving mode
while waiting for changes to a cacheline thereby avoiding busy loops
and therefore saving power.

The ARM documentation states that load acquire is more effective
than a load plus barrier. In general that tends to be true on all
compute platforms that support both.

See (as quoted by Linus Torvalds):
   https://developer.arm.com/documentation/102336/0100/Load-Acquire-and-Store-Release-instructions

 "Weaker ordering requirements that are imposed by Load-Acquire and
  Store-Release instructions allow for micro-architectural
  optimizations, which could reduce some of the performance impacts that
  are otherwise imposed by an explicit memory barrier.

  If the ordering requirement is satisfied using either a Load-Acquire
  or Store-Release, then it would be preferable to use these
  instructions instead of a DMB"

The patch benefited significantly from the knowledge of the innards
of the seqlock code by Thomas Gleixner.

Signed-off-by: Christoph Lameter (Ampere) <cl@gentwo.org>
---
V1->V2
- Describe the benefit of load acquire vs barriers
- Explain the CONFIG_ARCH_HAS_ACQUIRE_RELEASE option better
---
Changes in v3:
- Support cond_load_acquire to give the processor a chance to do some
  sort of power down until cacheline changes.
- Better code by Thomas Gleixner
- Link to v2: https://lore.kernel.org/r/20240819-seq_optimize-v2-1-9d0da82b022f@gentwo.org
---
 arch/Kconfig            |  8 +++++
 arch/arm64/Kconfig      |  1 +
 include/linux/seqlock.h | 85 ++++++++++++++++++++++++++++++++++++-------------
 3 files changed, 71 insertions(+), 23 deletions(-)


---
base-commit: 77f587896757708780a7e8792efe62939f25a5ab
change-id: 20240813-seq_optimize-68c48696c798

Best regards,

Comments

kernel test robot Sept. 13, 2024, 1:41 p.m. UTC | #1
Hi Christoph,

kernel test robot noticed the following build errors:

[auto build test ERROR on 77f587896757708780a7e8792efe62939f25a5ab]

url:    https://github.com/intel-lab-lkp/linux/commits/Christoph-Lameter-via-B4-Relay/Avoid-memory-barrier-in-read_seqcount-through-load-acquire/20240913-064557
base:   77f587896757708780a7e8792efe62939f25a5ab
patch link:    https://lore.kernel.org/r/20240912-seq_optimize-v3-1-8ee25e04dffa%40gentwo.org
patch subject: [PATCH v3] Avoid memory barrier in read_seqcount() through load acquire
config: x86_64-defconfig (https://download.01.org/0day-ci/archive/20240913/202409132135.ki3Mp5EA-lkp@intel.com/config)
compiler: gcc-11 (Debian 11.3.0-12) 11.3.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240913/202409132135.ki3Mp5EA-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202409132135.ki3Mp5EA-lkp@intel.com/

All error/warnings (new ones prefixed by >>):

   In file included from drivers/gpu/drm/i915/gt/intel_gt.c:36:
   drivers/gpu/drm/i915/gt/intel_tlb.h: In function 'intel_gt_tlb_seqno':
>> drivers/gpu/drm/i915/gt/intel_tlb.h:21:47: error: macro "seqprop_sequence" requires 2 arguments, but only 1 given
      21 |         return seqprop_sequence(&gt->tlb.seqno);
         |                                               ^
   In file included from include/linux/mmzone.h:17,
                    from include/linux/gfp.h:7,
                    from include/drm/drm_managed.h:6,
                    from drivers/gpu/drm/i915/gt/intel_gt.c:6:
   include/linux/seqlock.h:280: note: macro "seqprop_sequence" defined here
     280 | #define seqprop_sequence(s, a)          __seqprop(s, sequence)(s, a)
         | 
   In file included from drivers/gpu/drm/i915/gt/intel_gt.c:36:
>> drivers/gpu/drm/i915/gt/intel_tlb.h:21:16: error: 'seqprop_sequence' undeclared (first use in this function)
      21 |         return seqprop_sequence(&gt->tlb.seqno);
         |                ^~~~~~~~~~~~~~~~
   drivers/gpu/drm/i915/gt/intel_tlb.h:21:16: note: each undeclared identifier is reported only once for each function it appears in
--
   In file included from drivers/gpu/drm/i915/gt/intel_tlb.c:14:
   drivers/gpu/drm/i915/gt/intel_tlb.h: In function 'intel_gt_tlb_seqno':
>> drivers/gpu/drm/i915/gt/intel_tlb.h:21:47: error: macro "seqprop_sequence" requires 2 arguments, but only 1 given
      21 |         return seqprop_sequence(&gt->tlb.seqno);
         |                                               ^
   In file included from include/linux/mmzone.h:17,
                    from include/linux/gfp.h:7,
                    from include/linux/xarray.h:16,
                    from include/linux/radix-tree.h:21,
                    from include/linux/idr.h:15,
                    from include/linux/kernfs.h:12,
                    from include/linux/sysfs.h:16,
                    from include/linux/kobject.h:20,
                    from include/linux/energy_model.h:7,
                    from include/linux/device.h:16,
                    from include/linux/pm_qos.h:17,
                    from drivers/gpu/drm/i915/i915_drv.h:35,
                    from drivers/gpu/drm/i915/gt/intel_tlb.c:6:
   include/linux/seqlock.h:280: note: macro "seqprop_sequence" defined here
     280 | #define seqprop_sequence(s, a)          __seqprop(s, sequence)(s, a)
         | 
   In file included from drivers/gpu/drm/i915/gt/intel_tlb.c:14:
>> drivers/gpu/drm/i915/gt/intel_tlb.h:21:16: error: 'seqprop_sequence' undeclared (first use in this function)
      21 |         return seqprop_sequence(&gt->tlb.seqno);
         |                ^~~~~~~~~~~~~~~~
   drivers/gpu/drm/i915/gt/intel_tlb.h:21:16: note: each undeclared identifier is reported only once for each function it appears in
>> drivers/gpu/drm/i915/gt/intel_tlb.h:22:1: warning: control reaches end of non-void function [-Wreturn-type]
      22 | }
         | ^


vim +/seqprop_sequence +21 drivers/gpu/drm/i915/gt/intel_tlb.h

568a2e6f0b12ee Chris Wilson 2023-08-01  18  
568a2e6f0b12ee Chris Wilson 2023-08-01  19  static inline u32 intel_gt_tlb_seqno(const struct intel_gt *gt)
568a2e6f0b12ee Chris Wilson 2023-08-01  20  {
568a2e6f0b12ee Chris Wilson 2023-08-01 @21  	return seqprop_sequence(&gt->tlb.seqno);
568a2e6f0b12ee Chris Wilson 2023-08-01 @22  }
568a2e6f0b12ee Chris Wilson 2023-08-01  23
kernel test robot Sept. 13, 2024, 1:41 p.m. UTC | #2
Hi Christoph,

kernel test robot noticed the following build errors:

[auto build test ERROR on 77f587896757708780a7e8792efe62939f25a5ab]

url:    https://github.com/intel-lab-lkp/linux/commits/Christoph-Lameter-via-B4-Relay/Avoid-memory-barrier-in-read_seqcount-through-load-acquire/20240913-064557
base:   77f587896757708780a7e8792efe62939f25a5ab
patch link:    https://lore.kernel.org/r/20240912-seq_optimize-v3-1-8ee25e04dffa%40gentwo.org
patch subject: [PATCH v3] Avoid memory barrier in read_seqcount() through load acquire
config: i386-buildonly-randconfig-001-20240913 (https://download.01.org/0day-ci/archive/20240913/202409132145.0UdNx9kr-lkp@intel.com/config)
compiler: clang version 18.1.8 (https://github.com/llvm/llvm-project 3b5b5c1ec4a3095ab096dd780e84d7ab81f3d7ff)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240913/202409132145.0UdNx9kr-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202409132145.0UdNx9kr-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from drivers/gpu/drm/i915/gem/i915_gem_pages.c:11:
>> drivers/gpu/drm/i915/gt/intel_tlb.h:21:40: error: too few arguments provided to function-like macro invocation
      21 |         return seqprop_sequence(&gt->tlb.seqno);
         |                                               ^
   include/linux/seqlock.h:280:9: note: macro 'seqprop_sequence' defined here
     280 | #define seqprop_sequence(s, a)          __seqprop(s, sequence)(s, a)
         |         ^
   In file included from drivers/gpu/drm/i915/gem/i915_gem_pages.c:11:
>> drivers/gpu/drm/i915/gt/intel_tlb.h:21:9: error: use of undeclared identifier 'seqprop_sequence'; did you mean '__seqprop_sequence'?
      21 |         return seqprop_sequence(&gt->tlb.seqno);
         |                ^~~~~~~~~~~~~~~~
         |                __seqprop_sequence
   include/linux/seqlock.h:228:24: note: '__seqprop_sequence' declared here
     228 | static inline unsigned __seqprop_sequence(const seqcount_t *s, bool acquire)
         |                        ^
   In file included from drivers/gpu/drm/i915/gem/i915_gem_pages.c:11:
>> drivers/gpu/drm/i915/gt/intel_tlb.h:21:9: error: incompatible pointer to integer conversion returning 'unsigned int (const seqcount_t *, bool)' (aka 'unsigned int (const struct seqcount *, _Bool)') from a function with result type 'u32' (aka 'unsigned int') [-Wint-conversion]
      21 |         return seqprop_sequence(&gt->tlb.seqno);
         |                ^~~~~~~~~~~~~~~~
   3 errors generated.


vim +21 drivers/gpu/drm/i915/gt/intel_tlb.h

568a2e6f0b12ee Chris Wilson 2023-08-01  18  
568a2e6f0b12ee Chris Wilson 2023-08-01  19  static inline u32 intel_gt_tlb_seqno(const struct intel_gt *gt)
568a2e6f0b12ee Chris Wilson 2023-08-01  20  {
568a2e6f0b12ee Chris Wilson 2023-08-01 @21  	return seqprop_sequence(&gt->tlb.seqno);
568a2e6f0b12ee Chris Wilson 2023-08-01  22  }
568a2e6f0b12ee Chris Wilson 2023-08-01  23
Christoph Lameter (Ampere) Sept. 16, 2024, 5:52 p.m. UTC | #3
On Fri, 13 Sep 2024, kernel test robot wrote:

> >> drivers/gpu/drm/i915/gt/intel_tlb.h:21:47: error: macro "seqprop_sequence" requires 2 arguments, but only 1 given

From 15d86bc9589f16947c5fb0f34d2947eacd48f853 Mon Sep 17 00:00:00 2001
From: Christoph Lameter <cl@gentwo.org>
Date: Mon, 16 Sep 2024 10:44:16 -0700
Subject: [PATCH] Update Intel DRM use of seqprop_sequence

One of Intels drivers uses seqprop_sequence() for its tlb sequencing.
We added a parameter so that we can use acquire. Its pretty safe to
assume that this will work without acquire.

Signed-off-by: Christoph Lameter <cl@linux.com>
---
 drivers/gpu/drm/i915/gt/intel_tlb.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_tlb.h b/drivers/gpu/drm/i915/gt/intel_tlb.h
index 337327af92ac..81998c4cd4fb 100644
--- a/drivers/gpu/drm/i915/gt/intel_tlb.h
+++ b/drivers/gpu/drm/i915/gt/intel_tlb.h
@@ -18,7 +18,7 @@ void intel_gt_fini_tlb(struct intel_gt *gt);

 static inline u32 intel_gt_tlb_seqno(const struct intel_gt *gt)
 {
-	return seqprop_sequence(&gt->tlb.seqno);
+	return seqprop_sequence(&gt->tlb.seqno, false);
 }

 static inline u32 intel_gt_next_invalidate_tlb_full(const struct intel_gt *gt)
Will Deacon Sept. 17, 2024, 7:12 a.m. UTC | #4
Hi Christoph,

On Thu, Sep 12, 2024 at 03:44:08PM -0700, Christoph Lameter via B4 Relay wrote:
> diff --git a/arch/Kconfig b/arch/Kconfig
> index 975dd22a2dbd..3c270f496231 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -1600,6 +1600,14 @@ config ARCH_HAS_KERNEL_FPU_SUPPORT
>  	  Architectures that select this option can run floating-point code in
>  	  the kernel, as described in Documentation/core-api/floating-point.rst.
>  
> +config ARCH_HAS_ACQUIRE_RELEASE
> +	bool
> +	help
> +	  Setting ARCH_HAS_ACQUIRE_RELEASE indicates that the architecture
> +	  supports load acquire and release. Typically these are more effective
> +	  than memory barriers. Code will prefer the use of load acquire and
> +	  store release over memory barriers if this option is enabled.
> +

Unsurprisingly, I'd be in favour of making this unconditional rather than
adding a new Kconfig option. Would that actually hurt any architectures
where we care about the last few shreds of performance?

>  source "kernel/gcov/Kconfig"
>  
>  source "scripts/gcc-plugins/Kconfig"
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index a2f8ff354ca6..19e34fff145f 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -39,6 +39,7 @@ config ARM64
>  	select ARCH_HAS_PTE_DEVMAP
>  	select ARCH_HAS_PTE_SPECIAL
>  	select ARCH_HAS_HW_PTE_YOUNG
> +	select ARCH_HAS_ACQUIRE_RELEASE
>  	select ARCH_HAS_SETUP_DMA_OPS
>  	select ARCH_HAS_SET_DIRECT_MAP
>  	select ARCH_HAS_SET_MEMORY
> diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h
> index d90d8ee29d81..a3fe9ee8edef 100644
> --- a/include/linux/seqlock.h
> +++ b/include/linux/seqlock.h
> @@ -23,6 +23,13 @@
>  
>  #include <asm/processor.h>
>  
> +#ifdef CONFIG_ARCH_HAS_ACQUIRE_RELEASE
> +# define USE_LOAD_ACQUIRE	true
> +# define USE_COND_LOAD_ACQUIRE	!IS_ENABLED(CONFIG_PREEMPT_RT)
> +#else
> +# define USE_LOAD_ACQUIRE	false
> +# define USE_COND_LOAD_ACQUIRE	false
> +#endif
>  /*
>   * The seqlock seqcount_t interface does not prescribe a precise sequence of
>   * read begin/retry/end. For readers, typically there is a call to
> @@ -132,6 +139,17 @@ static inline void seqcount_lockdep_reader_access(const seqcount_t *s)
>  #define seqcount_rwlock_init(s, lock)		seqcount_LOCKNAME_init(s, lock, rwlock)
>  #define seqcount_mutex_init(s, lock)		seqcount_LOCKNAME_init(s, lock, mutex)
>  
> +static __always_inline unsigned __seqprop_load_sequence(const seqcount_t *s, bool acquire)
> +{
> +	if (!acquire || !USE_LOAD_ACQUIRE)
> +		return READ_ONCE(s->sequence);
> +
> +	if (USE_COND_LOAD_ACQUIRE)
> +		return smp_cond_load_acquire((unsigned int *)&s->sequence, (s->sequence & 1) == 0);

This looks wrong to me.

The conditional expression passed to smp_cond_load_acquire() should be
written in terms of 'VAL', otherwise you're introducing an additional
non-atomic access to the sequence counter.

Will
Will Deacon Sept. 17, 2024, 7:37 a.m. UTC | #5
On Mon, Sep 16, 2024 at 10:52:18AM -0700, Christoph Lameter (Ampere) wrote:
> On Fri, 13 Sep 2024, kernel test robot wrote:
> 
> > >> drivers/gpu/drm/i915/gt/intel_tlb.h:21:47: error: macro "seqprop_sequence" requires 2 arguments, but only 1 given
> 
> From 15d86bc9589f16947c5fb0f34d2947eacd48f853 Mon Sep 17 00:00:00 2001
> From: Christoph Lameter <cl@gentwo.org>
> Date: Mon, 16 Sep 2024 10:44:16 -0700
> Subject: [PATCH] Update Intel DRM use of seqprop_sequence
> 
> One of Intels drivers uses seqprop_sequence() for its tlb sequencing.
> We added a parameter so that we can use acquire. Its pretty safe to
> assume that this will work without acquire.
> 
> Signed-off-by: Christoph Lameter <cl@linux.com>
> ---
>  drivers/gpu/drm/i915/gt/intel_tlb.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_tlb.h b/drivers/gpu/drm/i915/gt/intel_tlb.h
> index 337327af92ac..81998c4cd4fb 100644
> --- a/drivers/gpu/drm/i915/gt/intel_tlb.h
> +++ b/drivers/gpu/drm/i915/gt/intel_tlb.h
> @@ -18,7 +18,7 @@ void intel_gt_fini_tlb(struct intel_gt *gt);
> 
>  static inline u32 intel_gt_tlb_seqno(const struct intel_gt *gt)
>  {
> -	return seqprop_sequence(&gt->tlb.seqno);
> +	return seqprop_sequence(&gt->tlb.seqno, false);
>  }

Yikes, why is the driver using the seqlock internals here? It's a bit of
a pity, as a quick grep suggest that this is the _only_ user of
'seqcount_mutex_t', yet it's still having to work around the API.

Will
Thomas Gleixner Sept. 17, 2024, 11:50 a.m. UTC | #6
Cc+ i915 people

On Tue, Sep 17 2024 at 08:37, Will Deacon wrote:
> On Mon, Sep 16, 2024 at 10:52:18AM -0700, Christoph Lameter (Ampere) wrote:
>> On Fri, 13 Sep 2024, kernel test robot wrote:
>> 
>> > >> drivers/gpu/drm/i915/gt/intel_tlb.h:21:47: error: macro "seqprop_sequence" requires 2 arguments, but only 1 given
>> 
>> From 15d86bc9589f16947c5fb0f34d2947eacd48f853 Mon Sep 17 00:00:00 2001
>> From: Christoph Lameter <cl@gentwo.org>
>> Date: Mon, 16 Sep 2024 10:44:16 -0700
>> Subject: [PATCH] Update Intel DRM use of seqprop_sequence
>> 
>> One of Intels drivers uses seqprop_sequence() for its tlb sequencing.
>> We added a parameter so that we can use acquire. Its pretty safe to
>> assume that this will work without acquire.
>>
>> Signed-off-by: Christoph Lameter <cl@linux.com>
>> ---
>>  drivers/gpu/drm/i915/gt/intel_tlb.h | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> diff --git a/drivers/gpu/drm/i915/gt/intel_tlb.h b/drivers/gpu/drm/i915/gt/intel_tlb.h
>> index 337327af92ac..81998c4cd4fb 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_tlb.h
>> +++ b/drivers/gpu/drm/i915/gt/intel_tlb.h
>> @@ -18,7 +18,7 @@ void intel_gt_fini_tlb(struct intel_gt *gt);
>> 
>>  static inline u32 intel_gt_tlb_seqno(const struct intel_gt *gt)
>>  {
>> -	return seqprop_sequence(&gt->tlb.seqno);
>> +	return seqprop_sequence(&gt->tlb.seqno, false);
>>  }
>
> Yikes, why is the driver using the seqlock internals here? It's a bit of
> a pity, as a quick grep suggest that this is the _only_ user of
> 'seqcount_mutex_t', yet it's still having to work around the API.

Why the hell can't i915 use the proper interfaces and has to bypass the
core code? Just because C allows that does not make it correct.

Can the i915 people please remove this blatant violation of layering?

Thanks,

        tglx
Thomas Gleixner Sept. 17, 2024, 11:52 a.m. UTC | #7
On Thu, Sep 12 2024 at 15:44, Christoph Lameter via wrote:

$Subject: .....

You still fail to provide a proper subsystem prefix. It's not rocket science.

https://www.kernel.org/doc/html/latest/process/maintainer-tip.html#patch-subject

> From: "Christoph Lameter (Ampere)" <cl@gentwo.org>
>
> Some architectures support load acquire which can save us a memory
> barrier and save some cycles.
>
> A typical sequence
>
> 	do {
> 		seq = read_seqcount_begin(&s);
> 		<something>
> 	} while (read_seqcount_retry(&s, seq);
>
> requires 13 cycles on ARM64 for an empty loop. Two read memory
> barriers are needed. One for each of the seqcount_* functions.
>
> We can replace the first read barrier with a load acquire of
> the seqcount which saves us one barrier.

We ....

Please write changelogs using passive voice. 'We' means nothing here.

Thanks,

        tglx
Vivi, Rodrigo Sept. 18, 2024, 12:45 a.m. UTC | #8
On Tue, 2024-09-17 at 13:50 +0200, Thomas Gleixner wrote:
> Cc+ i915 people
> 
> On Tue, Sep 17 2024 at 08:37, Will Deacon wrote:
> > On Mon, Sep 16, 2024 at 10:52:18AM -0700, Christoph Lameter
> > (Ampere) wrote:
> > > On Fri, 13 Sep 2024, kernel test robot wrote:
> > > 
> > > > > > drivers/gpu/drm/i915/gt/intel_tlb.h:21:47: error: macro
> > > > > > "seqprop_sequence" requires 2 arguments, but only 1 given
> > > 
> > > From 15d86bc9589f16947c5fb0f34d2947eacd48f853 Mon Sep 17 00:00:00
> > > 2001
> > > From: Christoph Lameter <cl@gentwo.org>
> > > Date: Mon, 16 Sep 2024 10:44:16 -0700
> > > Subject: [PATCH] Update Intel DRM use of seqprop_sequence
> > > 
> > > One of Intels drivers uses seqprop_sequence() for its tlb
> > > sequencing.
> > > We added a parameter so that we can use acquire. Its pretty safe
> > > to
> > > assume that this will work without acquire.
> > > 
> > > Signed-off-by: Christoph Lameter <cl@linux.com>
> > > ---
> > >  drivers/gpu/drm/i915/gt/intel_tlb.h | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/gt/intel_tlb.h
> > > b/drivers/gpu/drm/i915/gt/intel_tlb.h
> > > index 337327af92ac..81998c4cd4fb 100644
> > > --- a/drivers/gpu/drm/i915/gt/intel_tlb.h
> > > +++ b/drivers/gpu/drm/i915/gt/intel_tlb.h
> > > @@ -18,7 +18,7 @@ void intel_gt_fini_tlb(struct intel_gt *gt);
> > > 
> > >  static inline u32 intel_gt_tlb_seqno(const struct intel_gt *gt)
> > >  {
> > > -	return seqprop_sequence(&gt->tlb.seqno);
> > > +	return seqprop_sequence(&gt->tlb.seqno, false);
> > >  }
> > 
> > Yikes, why is the driver using the seqlock internals here? It's a
> > bit of
> > a pity, as a quick grep suggest that this is the _only_ user of
> > 'seqcount_mutex_t', yet it's still having to work around the API.
> 
> Why the hell can't i915 use the proper interfaces and has to bypass
> the
> core code? Just because C allows that does not make it correct.
> 
> Can the i915 people please remove this blatant violation of layering?

Yeap, we gotta remove this. Just need to be careful on this TLB
invalidation code without causing some funny deadlocks...

> 
> Thanks,
> 
>         tglx
>
diff mbox series

Patch

diff --git a/arch/Kconfig b/arch/Kconfig
index 975dd22a2dbd..3c270f496231 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -1600,6 +1600,14 @@  config ARCH_HAS_KERNEL_FPU_SUPPORT
 	  Architectures that select this option can run floating-point code in
 	  the kernel, as described in Documentation/core-api/floating-point.rst.
 
+config ARCH_HAS_ACQUIRE_RELEASE
+	bool
+	help
+	  Setting ARCH_HAS_ACQUIRE_RELEASE indicates that the architecture
+	  supports load acquire and release. Typically these are more effective
+	  than memory barriers. Code will prefer the use of load acquire and
+	  store release over memory barriers if this option is enabled.
+
 source "kernel/gcov/Kconfig"
 
 source "scripts/gcc-plugins/Kconfig"
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index a2f8ff354ca6..19e34fff145f 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -39,6 +39,7 @@  config ARM64
 	select ARCH_HAS_PTE_DEVMAP
 	select ARCH_HAS_PTE_SPECIAL
 	select ARCH_HAS_HW_PTE_YOUNG
+	select ARCH_HAS_ACQUIRE_RELEASE
 	select ARCH_HAS_SETUP_DMA_OPS
 	select ARCH_HAS_SET_DIRECT_MAP
 	select ARCH_HAS_SET_MEMORY
diff --git a/include/linux/seqlock.h b/include/linux/seqlock.h
index d90d8ee29d81..a3fe9ee8edef 100644
--- a/include/linux/seqlock.h
+++ b/include/linux/seqlock.h
@@ -23,6 +23,13 @@ 
 
 #include <asm/processor.h>
 
+#ifdef CONFIG_ARCH_HAS_ACQUIRE_RELEASE
+# define USE_LOAD_ACQUIRE	true
+# define USE_COND_LOAD_ACQUIRE	!IS_ENABLED(CONFIG_PREEMPT_RT)
+#else
+# define USE_LOAD_ACQUIRE	false
+# define USE_COND_LOAD_ACQUIRE	false
+#endif
 /*
  * The seqlock seqcount_t interface does not prescribe a precise sequence of
  * read begin/retry/end. For readers, typically there is a call to
@@ -132,6 +139,17 @@  static inline void seqcount_lockdep_reader_access(const seqcount_t *s)
 #define seqcount_rwlock_init(s, lock)		seqcount_LOCKNAME_init(s, lock, rwlock)
 #define seqcount_mutex_init(s, lock)		seqcount_LOCKNAME_init(s, lock, mutex)
 
+static __always_inline unsigned __seqprop_load_sequence(const seqcount_t *s, bool acquire)
+{
+	if (!acquire || !USE_LOAD_ACQUIRE)
+		return READ_ONCE(s->sequence);
+
+	if (USE_COND_LOAD_ACQUIRE)
+		return smp_cond_load_acquire((unsigned int *)&s->sequence, (s->sequence & 1) == 0);
+
+	return smp_load_acquire(&s->sequence);
+}
+
 /*
  * SEQCOUNT_LOCKNAME()	- Instantiate seqcount_LOCKNAME_t and helpers
  * seqprop_LOCKNAME_*()	- Property accessors for seqcount_LOCKNAME_t
@@ -155,9 +173,10 @@  __seqprop_##lockname##_const_ptr(const seqcount_##lockname##_t *s)	\
 }									\
 									\
 static __always_inline unsigned						\
-__seqprop_##lockname##_sequence(const seqcount_##lockname##_t *s)	\
+__seqprop_##lockname##_sequence(const seqcount_##lockname##_t *s,	\
+				bool acquire)				\
 {									\
-	unsigned seq = READ_ONCE(s->seqcount.sequence);			\
+	unsigned seq = __seqprop_load_sequence(&s->seqcount, acquire);	\
 									\
 	if (!IS_ENABLED(CONFIG_PREEMPT_RT))				\
 		return seq;						\
@@ -170,7 +189,7 @@  __seqprop_##lockname##_sequence(const seqcount_##lockname##_t *s)	\
 		 * Re-read the sequence counter since the (possibly	\
 		 * preempted) writer made progress.			\
 		 */							\
-		seq = READ_ONCE(s->seqcount.sequence);			\
+		seq = __seqprop_load_sequence(&s->seqcount, acquire);	\
 	}								\
 									\
 	return seq;							\
@@ -206,9 +225,9 @@  static inline const seqcount_t *__seqprop_const_ptr(const seqcount_t *s)
 	return s;
 }
 
-static inline unsigned __seqprop_sequence(const seqcount_t *s)
+static inline unsigned __seqprop_sequence(const seqcount_t *s, bool acquire)
 {
-	return READ_ONCE(s->sequence);
+	return __seqprop_load_sequence(s, acquire);
 }
 
 static inline bool __seqprop_preemptible(const seqcount_t *s)
@@ -258,35 +277,53 @@  SEQCOUNT_LOCKNAME(mutex,        struct mutex,    true,     mutex)
 
 #define seqprop_ptr(s)			__seqprop(s, ptr)(s)
 #define seqprop_const_ptr(s)		__seqprop(s, const_ptr)(s)
-#define seqprop_sequence(s)		__seqprop(s, sequence)(s)
+#define seqprop_sequence(s, a)		__seqprop(s, sequence)(s, a)
 #define seqprop_preemptible(s)		__seqprop(s, preemptible)(s)
 #define seqprop_assert(s)		__seqprop(s, assert)(s)
 
 /**
- * __read_seqcount_begin() - begin a seqcount_t read section w/o barrier
- * @s: Pointer to seqcount_t or any of the seqcount_LOCKNAME_t variants
- *
- * __read_seqcount_begin is like read_seqcount_begin, but has no smp_rmb()
- * barrier. Callers should ensure that smp_rmb() or equivalent ordering is
- * provided before actually loading any of the variables that are to be
- * protected in this critical section.
- *
- * Use carefully, only in critical code, and comment how the barrier is
- * provided.
+ * read_seqcount_begin_cond_acquire() - begin a seqcount_t read section
+ * @s:	     Pointer to seqcount_t or any of the seqcount_LOCKNAME_t variants
+ * @acquire: If true, the read of the sequence count uses smp_load_acquire()
+ *	     if the architecure provides and enabled it.
  *
  * Return: count to be passed to read_seqcount_retry()
  */
-#define __read_seqcount_begin(s)					\
+#define read_seqcount_begin_cond_acquire(s, acquire)			\
 ({									\
 	unsigned __seq;							\
 									\
-	while ((__seq = seqprop_sequence(s)) & 1)			\
-		cpu_relax();						\
+	if (acquire && USE_COND_LOAD_ACQUIRE) {				\
+		__seq = seqprop_sequence(s, acquire);			\
+	} else {							\
+		while ((__seq = seqprop_sequence(s, acquire)) & 1)	\
+			cpu_relax();					\
+	}								\
 									\
 	kcsan_atomic_next(KCSAN_SEQLOCK_REGION_MAX);			\
 	__seq;								\
 })
 
+/**
+ * __read_seqcount_begin() - begin a seqcount_t read section w/o barrier
+ * @s: Pointer to seqcount_t or any of the seqcount_LOCKNAME_t variants
+ *
+ * __read_seqcount_begin is like read_seqcount_begin, but it neither
+ * provides a smp_rmb() barrier nor does it use smp_load_acquire() on
+ * architectures which provide it.
+ *
+ * Callers should ensure that smp_rmb() or equivalent ordering is provided
+ * before actually loading any of the variables that are to be protected in
+ * this critical section.
+ *
+ * Use carefully, only in critical code, and comment how the barrier is
+ * provided.
+ *
+ * Return: count to be passed to read_seqcount_retry()
+ */
+#define __read_seqcount_begin(s)					\
+	read_seqcount_begin_cond_acquire(s, false)
+
 /**
  * raw_read_seqcount_begin() - begin a seqcount_t read section w/o lockdep
  * @s: Pointer to seqcount_t or any of the seqcount_LOCKNAME_t variants
@@ -295,9 +332,10 @@  SEQCOUNT_LOCKNAME(mutex,        struct mutex,    true,     mutex)
  */
 #define raw_read_seqcount_begin(s)					\
 ({									\
-	unsigned _seq = __read_seqcount_begin(s);			\
+	unsigned _seq = read_seqcount_begin_cond_acquire(s, true);	\
 									\
-	smp_rmb();							\
+	if (!IS_ENABLED(CONFIG_ARCH_HAS_ACQUIRE_RELEASE))		\
+		smp_rmb();						\
 	_seq;								\
 })
 
@@ -326,9 +364,10 @@  SEQCOUNT_LOCKNAME(mutex,        struct mutex,    true,     mutex)
  */
 #define raw_read_seqcount(s)						\
 ({									\
-	unsigned __seq = seqprop_sequence(s);				\
+	unsigned __seq = seqprop_sequence(s, true);			\
 									\
-	smp_rmb();							\
+	if (!IS_ENABLED(CONFIG_ARCH_HAS_ACQUIRE_RELEASE))		\
+		smp_rmb();						\
 	kcsan_atomic_next(KCSAN_SEQLOCK_REGION_MAX);			\
 	__seq;								\
 })