diff mbox series

[RFC] parisc: Use ldcw,co on uniprocessor machines only

Message ID 20191211201634.GA13407@ls3530.fritz.box (mailing list archive)
State RFC, archived
Headers show
Series [RFC] parisc: Use ldcw,co on uniprocessor machines only | expand

Commit Message

Helge Deller Dec. 11, 2019, 8:16 p.m. UTC
The only atomic operation on parisc is the ldcw instruction, which loads
a 32bit word from an address and replaces it by zero (load and clear
word). This instruction is used to implement kernel internal spinlocks.

Up to now we tried to optimize the ldcw usage by using the coherent
completer of this command, which operates on the cache (instead of
memory) and thus might speed up things, and which was enabled by default
on our 64bit kernel build.

But we still see runtime locking problems, so this patch changes it back
to use ldcw for 32- and 64-bit kernels, and live-patches it at runtime
to use the coherent completer when running on a uniprocessor machine.

Signed-off-by: Helge Deller <deller@gmx.de>

Comments

John David Anglin Dec. 11, 2019, 9:29 p.m. UTC | #1
On 2019-12-11 3:16 p.m., Helge Deller wrote:
> Up to now we tried to optimize the ldcw usage by using the coherent
> completer of this command, which operates on the cache (instead of
> memory) and thus might speed up things, and which was enabled by default
> on our 64bit kernel build.
>
> But we still see runtime locking problems, so this patch changes it back
> to use ldcw for 32- and 64-bit kernels, and live-patches it at runtime
> to use the coherent completer when running on a uniprocessor machine.
I'm not convinced this is the problem.  Nominally, every PA 2.0 machine that we support is coherent.
Is there evidence that this actually helps?  I did a test where I switched "ldcw,co" to "ldcw" and
didn't find a significant difference.  So, I left the default assumption that most PA 2.0 machines
are coherent in gcc.

I'm seeing different behavior for pthread_mutex_lock/pthread_mutex_unlock with different
glibc versions.  The locking issues also seem to vary from one kernel version to the next.

I don't know that we can blame the two build failures of acl2_8.1dfsg-6 on phantom on a
locking issue, but phantom failed twice at the same pwasoint.  In both cases, cc1 terminated with
a segmentation fault.  Yet, mx3210 has been chugging away for more than a day on the package.
It also built -4 and -5.

I don't have a clue what's really wrong but I suspect the slowness of our locking infrastructure
is what exposes these issues.

I've seen one issue in user space where a pointer to a mutex got corrupted in apt-cacher-ng.
If I remember correctly, the LWS locking code was spinning with a pointer value of 0x12.  I think
the code should have faulted but the thread stuck.  Had to systemctl restart apt-cacher-ng.

Dave
Helge Deller Dec. 12, 2019, 9:01 a.m. UTC | #2
On 11.12.19 22:29, John David Anglin wrote:
> On 2019-12-11 3:16 p.m., Helge Deller wrote:
>> Up to now we tried to optimize the ldcw usage by using the coherent
>> completer of this command, which operates on the cache (instead of
>> memory) and thus might speed up things, and which was enabled by default
>> on our 64bit kernel build.
>>
>> But we still see runtime locking problems, so this patch changes it back
>> to use ldcw for 32- and 64-bit kernels, and live-patches it at runtime
>> to use the coherent completer when running on a uniprocessor machine.
> I'm not convinced this is the problem.  Nominally, every PA 2.0 machine that we support is coherent.
> Is there evidence that this actually helps?  I did a test where I switched "ldcw,co" to "ldcw" and
> didn't find a significant difference.  So, I left the default assumption that most PA 2.0 machines
> are coherent in gcc.
>
> I'm seeing different behavior for pthread_mutex_lock/pthread_mutex_unlock with different
> glibc versions.  The locking issues also seem to vary from one kernel version to the next.
>
> I don't know that we can blame the two build failures of acl2_8.1dfsg-6 on phantom on a
> locking issue, but phantom failed twice at the same pwasoint.  In both cases, cc1 terminated with
> a segmentation fault.  Yet, mx3210 has been chugging away for more than a day on the package.
> It also built -4 and -5.
>
> I don't have a clue what's really wrong but I suspect the slowness of our locking infrastructure
> is what exposes these issues.
>
> I've seen one issue in user space where a pointer to a mutex got corrupted in apt-cacher-ng.
> If I remember correctly, the LWS locking code was spinning with a pointer value of 0x12.  I think
> the code should have faulted but the thread stuck.  Had to systemctl restart apt-cacher-ng.

Dave, thanks for your feedback!
I'm neither convinced that this helps, nor did I tested it any further than just booting.
My main concern is the stuck kernel threads which happen seldom on the bigger debian buildd
servers.
For now I'll not apply my patch for upstream kernels, but will keep it in patchwork.

Helge
diff mbox series

Patch

diff --git a/arch/parisc/include/asm/alternative.h b/arch/parisc/include/asm/alternative.h
index 0ec54f43d6d2..2667ec07acb9 100644
--- a/arch/parisc/include/asm/alternative.h
+++ b/arch/parisc/include/asm/alternative.h
@@ -11,6 +11,7 @@ 
 #define ALT_COND_RUN_ON_QEMU	0x20	/* if running on QEMU */

 #define INSN_PxTLB	0x02		/* modify pdtlb, pitlb */
+#define INSN_LDCW_CO	0x03		/* change cc in ldcw to ldcw,co */
 #define INSN_NOP	0x08000240	/* nop */

 #ifndef __ASSEMBLY__
diff --git a/arch/parisc/include/asm/assembly.h b/arch/parisc/include/asm/assembly.h
index a39250cb7dfc..8d6e76279d80 100644
--- a/arch/parisc/include/asm/assembly.h
+++ b/arch/parisc/include/asm/assembly.h
@@ -44,8 +44,9 @@ 

 #define CALLEE_SAVE_FRAME_SIZE (CALLEE_REG_FRAME_SIZE + CALLEE_FLOAT_FRAME_SIZE)

+#define LDCW		ALTERNATIVE(., .+4, ALT_COND_NO_SMP, INSN_LDCW_CO) ! ldcw
+
 #ifdef CONFIG_PA20
-#define LDCW		ldcw,co
 #define BL		b,l
 # ifdef CONFIG_64BIT
 #  define PA_ASM_LEVEL	2.0w
@@ -53,7 +54,6 @@ 
 #  define PA_ASM_LEVEL	2.0
 # endif
 #else
-#define LDCW		ldcw
 #define BL		bl
 #define PA_ASM_LEVEL	1.1
 #endif
diff --git a/arch/parisc/include/asm/ldcw.h b/arch/parisc/include/asm/ldcw.h
index e080143e79a3..59130b0dbc3e 100644
--- a/arch/parisc/include/asm/ldcw.h
+++ b/arch/parisc/include/asm/ldcw.h
@@ -2,7 +2,8 @@ 
 #ifndef __PARISC_LDCW_H
 #define __PARISC_LDCW_H

-#ifndef CONFIG_PA20
+#include <asm/alternative.h>
+
 /* Because kmalloc only guarantees 8-byte alignment for kmalloc'd data,
    and GCC only guarantees 8-byte alignment for stack locals, we can't
    be assured of 16-byte alignment for atomic lock data even if we
@@ -19,22 +20,6 @@ 
 		& ~(__PA_LDCW_ALIGNMENT - 1);			\
 	(volatile unsigned int *) __ret;			\
 })
-#define __LDCW	"ldcw"
-
-#else /*CONFIG_PA20*/
-/* From: "Jim Hull" <jim.hull of hp.com>
-   I've attached a summary of the change, but basically, for PA 2.0, as
-   long as the ",CO" (coherent operation) completer is specified, then the
-   16-byte alignment requirement for ldcw and ldcd is relaxed, and instead
-   they only require "natural" alignment (4-byte for ldcw, 8-byte for
-   ldcd). */
-
-#define __PA_LDCW_ALIGNMENT	4
-#define __PA_LDCW_ALIGN_ORDER	2
-#define __ldcw_align(a) (&(a)->slock)
-#define __LDCW	"ldcw,co"
-
-#endif /*!CONFIG_PA20*/

 /* LDCW, the only atomic read-write operation PA-RISC has. *sigh*.
    We don't explicitly expose that "*a" may be written as reload
@@ -46,7 +31,8 @@ 
    usually used within code blocks surrounded by memory barriers.  */
 #define __ldcw(a) ({						\
 	unsigned __ret;						\
-	__asm__ __volatile__(__LDCW " 0(%1),%0"			\
+	__asm__ __volatile__("ldcw 0(%1),%0"			\
+		ALTERNATIVE(ALT_COND_NO_SMP, INSN_LDCW_CO)	\
 		: "=r" (__ret) : "r" (a) : "memory");		\
 	__ret;							\
 })
diff --git a/arch/parisc/include/asm/spinlock_types.h b/arch/parisc/include/asm/spinlock_types.h
index 42979c5704dc..82d2384c3f22 100644
--- a/arch/parisc/include/asm/spinlock_types.h
+++ b/arch/parisc/include/asm/spinlock_types.h
@@ -3,13 +3,8 @@ 
 #define __ASM_SPINLOCK_TYPES_H

 typedef struct {
-#ifdef CONFIG_PA20
-	volatile unsigned int slock;
-# define __ARCH_SPIN_LOCK_UNLOCKED { 1 }
-#else
 	volatile unsigned int lock[4];
 # define __ARCH_SPIN_LOCK_UNLOCKED	{ { 1, 1, 1, 1 } }
-#endif
 } arch_spinlock_t;

 typedef struct {
diff --git a/arch/parisc/kernel/alternative.c b/arch/parisc/kernel/alternative.c
index 3c66d5c4d90d..cf83a801cc2a 100644
--- a/arch/parisc/kernel/alternative.c
+++ b/arch/parisc/kernel/alternative.c
@@ -69,6 +69,12 @@  void __init_or_module apply_alternatives(struct alt_instr *start,
 			if (boot_cpu_data.cpu_type >= pcxu) /* >= pa2.0 ? */
 				replacement |= (1 << 10); /* set el bit */
 		}
+		/* Want to replace ldcw by a ldcw,co instruction? */
+		if (replacement == INSN_LDCW_CO) {
+			replacement = *from;
+			/* set cache-coherent completer bits: */
+			replacement |= (0x01 << 10);
+		}

 		/*
 		 * Replace instruction with NOPs?