[1/6] MIPS: Octeon: Implement __smp_store_release()

Message ID	20210127203627.47510-2-alexander.sverdlin@nokia.com (mailing list archive)
State	New
Headers	show Return-Path: <linux-mips-owner@kernel.org> Received-SPF: Pass (protection.outlook.com: domain of nokia.com designates 131.228.2.8 as permitted sender) receiver=protection.outlook.com; client-ip=131.228.2.8; helo=fihe3nok0734.emea.nsn-net.net; From: Alexander A Sverdlin <alexander.sverdlin@nokia.com> To: Paul Burton <paul.burton@imgtec.com>, linux-mips@vger.kernel.org Cc: Alexander Sverdlin <alexander.sverdlin@nokia.com>, Thomas Bogendoerfer <tsbogend@alpha.franken.de>, Will Deacon <will@kernel.org>, Peter Zijlstra <peterz@infradead.org>, Boqun Feng <boqun.feng@gmail.com>, Ingo Molnar <mingo@redhat.com>, linux-kernel@vger.kernel.org Subject: [PATCH 1/6] MIPS: Octeon: Implement __smp_store_release() Date: Wed, 27 Jan 2021 21:36:22 +0100 Message-Id: <20210127203627.47510-2-alexander.sverdlin@nokia.com> In-Reply-To: <20210127203627.47510-1-alexander.sverdlin@nokia.com> References: <20210127203627.47510-1-alexander.sverdlin@nokia.com> MIME-Version: 1.0 Content-Type: text/plain X-MS-Office365-Filtering-Correlation-Id: 869a0cb6-4be5-42cd-3a58-08d8c3033cf3 Precedence: bulk
Series	MIPS: qspinlock: Try to reduce reduce the spinlock regression \| expand [0/6] MIPS: qspinlock: Try to reduce reduce the spinlock regression [1/6] MIPS: Octeon: Implement __smp_store_release() [2/6] MIPS: Implement atomic_cmpxchg_relaxed() [3/6] MIPS: Octeon: qspinlock: Flush write buffer [4/6] MIPS: Octeon: qspinlock: Exclude mmiowb() [5/6] MIPS: Provide {atomic_}xchg_relaxed() [6/6] MIPS: cmpxchg: Use cmpxchg_local() for {cmp_}xchg_small()

Alexander Sverdlin Jan. 27, 2021, 8:36 p.m. UTC

From: Alexander Sverdlin <alexander.sverdlin@nokia.com>

On Octeon smp_mb() translates to SYNC while wmb+rmb translates to SYNCW
only. This brings around 10% performance on tight uncontended spinlock
loops.

Refer to commit 500c2e1fdbcc ("MIPS: Optimize spinlocks.") and the link
below.

On 6-core Octeon machine:
sysbench --test=mutex --num-threads=64 --memory-scope=local run

w/o patch:	1.60s
with patch:	1.51s

Link: https://lore.kernel.org/lkml/5644D08D.4080206@caviumnetworks.com/
Signed-off-by: Alexander Sverdlin <alexander.sverdlin@nokia.com>
---
 arch/mips/include/asm/barrier.h | 9 +++++++++
 1 file changed, 9 insertions(+)

Peter Zijlstra Jan. 27, 2021, 10:32 p.m. UTC | #1

On Wed, Jan 27, 2021 at 09:36:22PM +0100, Alexander A Sverdlin wrote:
> From: Alexander Sverdlin <alexander.sverdlin@nokia.com>
> 
> On Octeon smp_mb() translates to SYNC while wmb+rmb translates to SYNCW
> only. This brings around 10% performance on tight uncontended spinlock
> loops.
> 
> Refer to commit 500c2e1fdbcc ("MIPS: Optimize spinlocks.") and the link
> below.
> 
> On 6-core Octeon machine:
> sysbench --test=mutex --num-threads=64 --memory-scope=local run
> 
> w/o patch:	1.60s
> with patch:	1.51s
> 
> Link: https://lore.kernel.org/lkml/5644D08D.4080206@caviumnetworks.com/
> Signed-off-by: Alexander Sverdlin <alexander.sverdlin@nokia.com>
> ---
>  arch/mips/include/asm/barrier.h | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/arch/mips/include/asm/barrier.h b/arch/mips/include/asm/barrier.h
> index 49ff172..24c3f2c 100644
> --- a/arch/mips/include/asm/barrier.h
> +++ b/arch/mips/include/asm/barrier.h
> @@ -113,6 +113,15 @@ static inline void wmb(void)
>  					    ".set arch=octeon\n\t"	\
>  					    "syncw\n\t"			\
>  					    ".set pop" : : : "memory")
> +
> +#define __smp_store_release(p, v)					\
> +do {									\
> +	compiletime_assert_atomic_type(*p);				\
> +	__smp_wmb();							\
> +	__smp_rmb();							\
> +	WRITE_ONCE(*p, v);						\
> +} while (0)

This is wrong in general since smp_rmb() will only provide order between
two loads and smp_store_release() is a store.

If this is correct for all MIPS, this needs a giant comment on exactly
how that smp_rmb() makes sense here.

Alexander Sverdlin Jan. 28, 2021, 7:27 a.m. UTC | #2

Hello Peter,

On 27/01/2021 23:32, Peter Zijlstra wrote:
>> Link: https://lore.kernel.org/lkml/5644D08D.4080206@caviumnetworks.com/

please, check the discussion pointed by the link above...

>> Signed-off-by: Alexander Sverdlin <alexander.sverdlin@nokia.com>
>> ---
>>  arch/mips/include/asm/barrier.h | 9 +++++++++
>>  1 file changed, 9 insertions(+)
>>
>> diff --git a/arch/mips/include/asm/barrier.h b/arch/mips/include/asm/barrier.h
>> index 49ff172..24c3f2c 100644
>> --- a/arch/mips/include/asm/barrier.h
>> +++ b/arch/mips/include/asm/barrier.h
>> @@ -113,6 +113,15 @@ static inline void wmb(void)
>>  					    ".set arch=octeon\n\t"	\
>>  					    "syncw\n\t"			\
>>  					    ".set pop" : : : "memory")
>> +
>> +#define __smp_store_release(p, v)					\
>> +do {									\
>> +	compiletime_assert_atomic_type(*p);				\
>> +	__smp_wmb();							\
>> +	__smp_rmb();							\
>> +	WRITE_ONCE(*p, v);						\
>> +} while (0)
> This is wrong in general since smp_rmb() will only provide order between
> two loads and smp_store_release() is a store.
> 
> If this is correct for all MIPS, this needs a giant comment on exactly
> how that smp_rmb() makes sense here.

... the macro is provided for Octeon only, and __smp_rmb() is actually a NOP
there, but I thought to "document" the flow of thoughts from the discussion
above by including it anyway.

Peter Zijlstra Jan. 28, 2021, 11:33 a.m. UTC | #3

On Thu, Jan 28, 2021 at 08:27:29AM +0100, Alexander Sverdlin wrote:

> >> +#define __smp_store_release(p, v)					\
> >> +do {									\
> >> +	compiletime_assert_atomic_type(*p);				\
> >> +	__smp_wmb();							\
> >> +	__smp_rmb();							\
> >> +	WRITE_ONCE(*p, v);						\
> >> +} while (0)
> > This is wrong in general since smp_rmb() will only provide order between
> > two loads and smp_store_release() is a store.
> > 
> > If this is correct for all MIPS, this needs a giant comment on exactly
> > how that smp_rmb() makes sense here.
> 
> ... the macro is provided for Octeon only, and __smp_rmb() is actually a NOP
> there, but I thought to "document" the flow of thoughts from the discussion
> above by including it anyway.

Random discussions on the internet do not absolve you from having to
write coherent comments. Especially so where memory ordering is
concerned.

This, from commit 6b07d38aaa52 ("MIPS: Octeon: Use optimized memory
barrier primitives."):

	#define smp_mb__before_llsc() smp_wmb()
	#define __smp_mb__before_llsc() __smp_wmb()

is also dodgy as hell and really wants a comment too. I'm not buying the
Changelog of that commit either, __smp_mb__before_llsc should also
ensure the LL cannot happen earlier, but SYNCW has no effect on loads.
So what stops the load from being speculated?

Alexander Sverdlin Jan. 28, 2021, 11:52 a.m. UTC | #4

Hello Peter,

On 28/01/2021 12:33, Peter Zijlstra wrote:
> This, from commit 6b07d38aaa52 ("MIPS: Octeon: Use optimized memory
> barrier primitives."):
> 
> 	#define smp_mb__before_llsc() smp_wmb()
> 	#define __smp_mb__before_llsc() __smp_wmb()
> 
> is also dodgy as hell and really wants a comment too. I'm not buying the
> Changelog of that commit either, __smp_mb__before_llsc should also
> ensure the LL cannot happen earlier, but SYNCW has no effect on loads.
> So what stops the load from being speculated?

hmm, the commit message you point to above, says:

"Since Octeon does not do speculative reads, this functions as a full barrier."

Alexander Sverdlin Jan. 28, 2021, 12:09 p.m. UTC | #5

Hi!

On 28/01/2021 12:33, Peter Zijlstra wrote:
> On Thu, Jan 28, 2021 at 08:27:29AM +0100, Alexander Sverdlin wrote:
> 
>>>> +#define __smp_store_release(p, v)					\
>>>> +do {									\
>>>> +	compiletime_assert_atomic_type(*p);				\
>>>> +	__smp_wmb();							\
>>>> +	__smp_rmb();							\
>>>> +	WRITE_ONCE(*p, v);						\
>>>> +} while (0)
>>> This is wrong in general since smp_rmb() will only provide order between
>>> two loads and smp_store_release() is a store.
>>>
>>> If this is correct for all MIPS, this needs a giant comment on exactly
>>> how that smp_rmb() makes sense here.
>>
>> ... the macro is provided for Octeon only, and __smp_rmb() is actually a NOP
>> there, but I thought to "document" the flow of thoughts from the discussion
>> above by including it anyway.
> 
> Random discussions on the internet do not absolve you from having to
> write coherent comments. Especially so where memory ordering is
> concerned.

I actually hoped you will remember the discussion you've participated 5 years
ago and (in my understanding) actually already agreed that the solution itself
is not broken:

https://lore.kernel.org/lkml/20151112180003.GE17308@twins.programming.kicks-ass.net/

Could you please just suggest the proper comment you expect to be added here,
because there is no doubts, you have much more experience here than me?

> This, from commit 6b07d38aaa52 ("MIPS: Octeon: Use optimized memory
> barrier primitives."):
> 
> 	#define smp_mb__before_llsc() smp_wmb()
> 	#define __smp_mb__before_llsc() __smp_wmb()
> 
> is also dodgy as hell and really wants a comment too. I'm not buying the
> Changelog of that commit either, __smp_mb__before_llsc should also
> ensure the LL cannot happen earlier, but SYNCW has no effect on loads.
> So what stops the load from being speculated?
> 
>

Peter Zijlstra Jan. 28, 2021, 2:57 p.m. UTC | #6

On Thu, Jan 28, 2021 at 12:52:22PM +0100, Alexander Sverdlin wrote:
> Hello Peter,
> 
> On 28/01/2021 12:33, Peter Zijlstra wrote:
> > This, from commit 6b07d38aaa52 ("MIPS: Octeon: Use optimized memory
> > barrier primitives."):
> > 
> > 	#define smp_mb__before_llsc() smp_wmb()
> > 	#define __smp_mb__before_llsc() __smp_wmb()
> > 
> > is also dodgy as hell and really wants a comment too. I'm not buying the
> > Changelog of that commit either, __smp_mb__before_llsc should also
> > ensure the LL cannot happen earlier, but SYNCW has no effect on loads.
> > So what stops the load from being speculated?
> 
> hmm, the commit message you point to above, says:
> 
> "Since Octeon does not do speculative reads, this functions as a full barrier."

So then the only difference between SYNC and SYNCW is a pipeline drain?

I still worry about the transitivity thing.. ISTR that being a sticky
point back then too.

Peter Zijlstra Jan. 28, 2021, 3:04 p.m. UTC | #7

On Thu, Jan 28, 2021 at 01:09:39PM +0100, Alexander Sverdlin wrote:
> On 28/01/2021 12:33, Peter Zijlstra wrote:
> > On Thu, Jan 28, 2021 at 08:27:29AM +0100, Alexander Sverdlin wrote:
> > 
> >>>> +#define __smp_store_release(p, v)					\
> >>>> +do {									\
> >>>> +	compiletime_assert_atomic_type(*p);				\
> >>>> +	__smp_wmb();							\
> >>>> +	__smp_rmb();							\
> >>>> +	WRITE_ONCE(*p, v);						\
> >>>> +} while (0)

> I actually hoped you will remember the discussion you've participated 5 years
> ago and (in my understanding) actually already agreed that the solution itself
> is not broken:
> 
> https://lore.kernel.org/lkml/20151112180003.GE17308@twins.programming.kicks-ass.net/

My memory really isn't that good. I can barely remember what I did 5
weeks ago, 5 years ago might as well have never happened.

> Could you please just suggest the proper comment you expect to be added here,
> because there is no doubts, you have much more experience here than me?

So for store_release I'm not too worried, and provided no read
speculation, wmb is indeed sufficient. This is because our store_release
is RCpc.

Something like:

/*
 * Because Octeon does not do read speculation, an smp_wmb()
 * is sufficient to ensure {load,store}->{store} order.
 */
#define __smp_store_release(p, v) \
do { \
	compiletime_assert_atomic_type(*p); \
	__smp_wmb(); \
	WRITE_ONCE(*p, v); \
} while (0)

Peter Zijlstra Jan. 28, 2021, 3:15 p.m. UTC | #8

On Thu, Jan 28, 2021 at 03:57:58PM +0100, Peter Zijlstra wrote:
> On Thu, Jan 28, 2021 at 12:52:22PM +0100, Alexander Sverdlin wrote:
> > Hello Peter,
> > 
> > On 28/01/2021 12:33, Peter Zijlstra wrote:
> > > This, from commit 6b07d38aaa52 ("MIPS: Octeon: Use optimized memory
> > > barrier primitives."):
> > > 
> > > 	#define smp_mb__before_llsc() smp_wmb()
> > > 	#define __smp_mb__before_llsc() __smp_wmb()
> > > 
> > > is also dodgy as hell and really wants a comment too. I'm not buying the
> > > Changelog of that commit either, __smp_mb__before_llsc should also
> > > ensure the LL cannot happen earlier, but SYNCW has no effect on loads.
> > > So what stops the load from being speculated?
> > 
> > hmm, the commit message you point to above, says:
> > 
> > "Since Octeon does not do speculative reads, this functions as a full barrier."
> 
> So then the only difference between SYNC and SYNCW is a pipeline drain?
> 
> I still worry about the transitivity thing.. ISTR that being a sticky
> point back then too.

Ah, there we are, it's called multi-copy-atomic these days:

  f1ab25a30ce8 ("memory-barriers: Replace uses of "transitive"")

Do those SYNCW / write-completion barriers guarantee this?

[1/6] MIPS: Octeon: Implement __smp_store_release()

Commit Message

Comments

Patch