From patchwork Sun Sep 27 16:27:18 2015
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: John David Anglin <dave.anglin@bell.net>
X-Patchwork-Id: 7273401
Return-Path: <linux-parisc-owner@kernel.org>
X-Original-To: patchwork-linux-parisc@patchwork.kernel.org
Delivered-To: patchwork-parsemail@patchwork2.web.kernel.org
Received: from mail.kernel.org (mail.kernel.org [198.145.29.136])
	by patchwork2.web.kernel.org (Postfix) with ESMTP id B04B5BEEA4
	for <patchwork-linux-parisc@patchwork.kernel.org>;
	Sun, 27 Sep 2015 16:27:23 +0000 (UTC)
Received: from mail.kernel.org (localhost [127.0.0.1])
	by mail.kernel.org (Postfix) with ESMTP id 9F30120747
	for <patchwork-linux-parisc@patchwork.kernel.org>;
	Sun, 27 Sep 2015 16:27:22 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.kernel.org (Postfix) with ESMTP id 668322073E
	for <patchwork-linux-parisc@patchwork.kernel.org>;
	Sun, 27 Sep 2015 16:27:21 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1752128AbbI0Q1V (ORCPT
	<rfc822;patchwork-linux-parisc@patchwork.kernel.org>);
	Sun, 27 Sep 2015 12:27:21 -0400
Received: from simcoe207srvr.owm.bell.net ([184.150.200.207]:51735 "EHLO
	torfep01.bell.net" rhost-flags-OK-FAIL-OK-FAIL) by vger.kernel.org
	with ESMTP id S1751459AbbI0Q1U (ORCPT
	<rfc822;linux-parisc@vger.kernel.org>);
	Sun, 27 Sep 2015 12:27:20 -0400
Received: from bell.net torfep01 184.150.200.158 by torfep01.bell.net
	with ESMTP
	id <20150927162719.KTSS22640.torfep01.bell.net@torspm02.bell.net>
	for <linux-parisc@vger.kernel.org>; Sun, 27 Sep 2015 12:27:19 -0400
Received: from [192.168.2.10] (really [76.69.120.10]) by torspm02.bell.net
	with ESMTP
	id <20150927162718.WRZF26022.torspm02.bell.net@[192.168.2.10]>;
	Sun, 27 Sep 2015 12:27:18 -0400
Subject: [PATCH] parisc: Re: [PATCH] parisc: adjust L1_CACHE_BYTES to 128
	bytes on PA8800 and PA8900 CPUs
Mime-Version: 1.0 (Apple Message framework v1085)
From: John David Anglin <dave.anglin@bell.net>
In-Reply-To: <56017FB3.5050709@gmx.de>
Date: Sun, 27 Sep 2015 12:27:18 -0400
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>,
	linux-parisc@vger.kernel.org
Message-Id: <65929045-FF25-4BFB-BA89-F07A47328F1F@bell.net>
References: <20150902162000.GC2444@ls3530.box>
	<1441287043.2235.6.camel@HansenPartnership.com>
	<1441288665.2235.17.camel@HansenPartnership.com>
	<55EB5EFA.4040407@gmx.de> <56017FB3.5050709@gmx.de>
To: Helge Deller <deller@gmx.de>
X-Mailer: Apple Mail (2.1085)
X-Opwv-CommTouchExtSvcRefID: str=0001.0A020205.560818E7.002F, ss=1, re=0.000,
	fgs=0
Sender: linux-parisc-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-parisc.vger.kernel.org>
X-Mailing-List: linux-parisc@vger.kernel.org
X-Spam-Status: No, score=-6.9 required=5.0 tests=BAYES_00, RCVD_IN_DNSWL_HI,
	T_RP_MATCHES_RCVD, T_TVD_MIME_EPI,
	UNPARSEABLE_RELAY autolearn=ham version=3.3.1
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

On 2015-09-22, at 12:20 PM, Helge Deller wrote:

> On 05.09.2015 23:30, Helge Deller wrote:
>> Hi James,
>> ...
>> I haven't done any performance measurements yet, but your patch looks
>> absolutely correct.
>> ...
> 
> Hello everyone,
> 
> I did some timing tests with the various patches for
> a) atomic_hash patches:
>        https://patchwork.kernel.org/patch/7116811/
> b) alignment of LWS locks:
>        https://patchwork.kernel.org/patch/7137931/
> 
> The testcase I used is basically the following:
> - It starts 32 threads.
> - We have 16 atomic ints organized in an array.
> - The first thread increments NITERS times the first atomic int.
> - The second thread decrements NITERS times the first atomic int.
> - The third/fourth thread increments/decrements the second atomic int, and so on...
> - So, we have 32 threads, of which 16 increments and 16 decrements 16 different atomic ints.
> - All threads run in parallel on a 4-way SMP PA8700 rp5470 machine.
> - I used the "time" command to measure the timings.
> - I did not stopped other services on the machine, but ran each test a few times and the timing results did not show significant variation between each run.
> - All timings were done on a vanilla kernel 4.2 with only the mentioned patch applied.
> 
> The code is a modified testcase from the libatomic-ops debian package:
> 
> AO_t counter_array[16] = { 0, };
> #define NITERS 1000000
> 
> void * add1sub1_thr(void * id)
> {
>  int me = (int)(AO_PTRDIFF_T)id;
>  AO_t *counter;
>  int i;
> 
>  counter = &counter_array[me >> 1];
>  for (i = 0; i < NITERS; ++i)
>    if ((me & 1) != 0) {
>      (void)AO_fetch_and_sub1(counter);
>    } else {
>      (void)AO_fetch_and_add1(counter);
>    }
>  return 0;
> ...
> run_parallel(32, add1sub1_thr)
> ...
> 
> 
> 
> The baseline for all results is the timing with a vanilla kernel 4.2:
> real    0m13.596s
> user    0m18.152s
> sys     0m35.752s
> 
> 
> The next results are with the atomic_hash (a) patch applied:
> For ATOMIC_HASH_SIZE = 4. 
> real    0m21.892s 
> user    0m27.492s
> sys     0m59.704s
> 
> For ATOMIC_HASH_SIZE = 64.
> real    0m20.604s
> user    0m24.832s
> sys     0m56.552s
> 
> 
> Next I applied the LWS locks patch (b):
> XXXXXXXXXXXXXXXXXXXX    LWS_LOCK_ALIGN_BITS = 4
> real    0m13.660s
> user    0m18.592s
> sys     0m35.236s
> 
> XXXXXXXXXXXXXXXXXXXX    LWS_LOCK_ALIGN_BITS  = L1_CACHE_SHIFT
> real    0m11.992s 
> user    0m19.064s
> sys     0m28.476s
> 
> 
> 
> Then I applied both patches (a and b):
> ATOMIC_HASH_SIZE = 64,  LWS_LOCK_ALIGN_BITS = 4
> ATOMIC_HASH_SIZE = 64,  LWS_LOCK_ALIGN_BITS = 4
> real    0m13.232s
> user    0m17.704s
> sys     0m33.884s
> 
> ATOMIC_HASH_SIZE = 64,  LWS_LOCK_ALIGN_BITS = L1_CACHE_SHIFT
> real    0m12.300s
> user    0m20.268s
> sys     0m28.424s
> 
> ATOMIC_HASH_SIZE = 4,   LWS_LOCK_ALIGN_BITS = 4
> real    0m13.181s
> user    0m17.584s
> sys     0m34.800s
> 
> ATOMIC_HASH_SIZE = 4,   LWS_LOCK_ALIGN_BITS = L1_CACHE_SHIFT
> real    0m11.692s
> user    0m18.232s
> sys     0m27.072s
> 
> 
> In summary I'm astonished about those results.
> Especially from patch (a) I would have expected (when applied stand-alone) the same or better performance, because it makes the spinlocks more fine-grained.
> But a performance drop of 50% is strange.
> 
> Patch (b) stand-alone does significantly increases performance (~20%), and together with patch (a) it even adds a few more percent performance increase on top.
> 
> Given the numbers above I currently would suggest to apply both patches (with ATOMIC_HASH_SIZE = 4 and LWS_LOCK_ALIGN_BITS = L1_CACHE_SHIFT).
> 
> Thoughts?


Hi Helge,

Attached is a revised patch "a" to try to improve performance of atomic_t variables.  If you get a chance, could
you see how it performs.

As noted in previous mail, I believe the L1 cache size is 16 bytes for all PA-RISC systems.  This change pads
the arch_spinlock_t type on PA2.0 to 16 bytes, so it is the same as PA1.x.  As a result, the spinlocks in
__atomic_hash are now on separate L1 cache lines.

I also bumped up ATOMIC_HASH_SIZE to give us 64 locks and changed ATOMIC_HASH_SIZE to divide
by SMP_CACHE_BYTES.  SMP_CACHE_BYTES should be maximum alignment used by kernel.  This is an
attempt to ensure the hash indexes are evenly distributed.

Testing v4.2.1+ on rp3440 with gcc build.

Signed-off-by: John David Anglin <dave.anglin@bell.net>
---
John David Anglin	dave.anglin@bell.net

diff --git a/arch/parisc/include/asm/atomic.h b/arch/parisc/include/asm/atomic.h
index 226f8ca9..a5c6de5 100644
--- a/arch/parisc/include/asm/atomic.h
+++ b/arch/parisc/include/asm/atomic.h
@@ -25,8 +25,8 @@
  * Hash function to index into a different SPINLOCK.
  * Since "a" is usually an address, use one spinlock per cacheline.
  */
-#  define ATOMIC_HASH_SIZE 4
-#  define ATOMIC_HASH(a) (&(__atomic_hash[ (((unsigned long) (a))/L1_CACHE_BYTES) & (ATOMIC_HASH_SIZE-1) ]))
+#  define ATOMIC_HASH_SIZE 64
+#  define ATOMIC_HASH(a) (&(__atomic_hash[ (((unsigned long) (a))/SMP_CACHE_BYTES) & (ATOMIC_HASH_SIZE-1) ]))
 
 extern arch_spinlock_t __atomic_hash[ATOMIC_HASH_SIZE] __lock_aligned;
 
diff --git a/arch/parisc/include/asm/spinlock_types.h b/arch/parisc/include/asm/spinlock_types.h
index 8c373aa..8c54f5e 100644
--- a/arch/parisc/include/asm/spinlock_types.h
+++ b/arch/parisc/include/asm/spinlock_types.h
@@ -4,7 +4,8 @@
 typedef struct {
 #ifdef CONFIG_PA20
 	volatile unsigned int slock;
-# define __ARCH_SPIN_LOCK_UNLOCKED { 1 }
+	volatile unsigned int pad[3];
+# define __ARCH_SPIN_LOCK_UNLOCKED { 1, { 1, 1, 1 } }
 #else
 	volatile unsigned int lock[4];
 # define __ARCH_SPIN_LOCK_UNLOCKED	{ { 1, 1, 1, 1 } }