From patchwork Sun Sep 27 16:27:18 2015 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: John David Anglin X-Patchwork-Id: 7273401 Return-Path: X-Original-To: patchwork-linux-parisc@patchwork.kernel.org Delivered-To: patchwork-parsemail@patchwork2.web.kernel.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.136]) by patchwork2.web.kernel.org (Postfix) with ESMTP id B04B5BEEA4 for ; Sun, 27 Sep 2015 16:27:23 +0000 (UTC) Received: from mail.kernel.org (localhost [127.0.0.1]) by mail.kernel.org (Postfix) with ESMTP id 9F30120747 for ; Sun, 27 Sep 2015 16:27:22 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 668322073E for ; Sun, 27 Sep 2015 16:27:21 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752128AbbI0Q1V (ORCPT ); Sun, 27 Sep 2015 12:27:21 -0400 Received: from simcoe207srvr.owm.bell.net ([184.150.200.207]:51735 "EHLO torfep01.bell.net" rhost-flags-OK-FAIL-OK-FAIL) by vger.kernel.org with ESMTP id S1751459AbbI0Q1U (ORCPT ); Sun, 27 Sep 2015 12:27:20 -0400 Received: from bell.net torfep01 184.150.200.158 by torfep01.bell.net with ESMTP id <20150927162719.KTSS22640.torfep01.bell.net@torspm02.bell.net> for ; Sun, 27 Sep 2015 12:27:19 -0400 Received: from [192.168.2.10] (really [76.69.120.10]) by torspm02.bell.net with ESMTP id <20150927162718.WRZF26022.torspm02.bell.net@[192.168.2.10]>; Sun, 27 Sep 2015 12:27:18 -0400 Subject: [PATCH] parisc: Re: [PATCH] parisc: adjust L1_CACHE_BYTES to 128 bytes on PA8800 and PA8900 CPUs Mime-Version: 1.0 (Apple Message framework v1085) From: John David Anglin In-Reply-To: <56017FB3.5050709@gmx.de> Date: Sun, 27 Sep 2015 12:27:18 -0400 Cc: James Bottomley , linux-parisc@vger.kernel.org Message-Id: <65929045-FF25-4BFB-BA89-F07A47328F1F@bell.net> References: <20150902162000.GC2444@ls3530.box> <1441287043.2235.6.camel@HansenPartnership.com> <1441288665.2235.17.camel@HansenPartnership.com> <55EB5EFA.4040407@gmx.de> <56017FB3.5050709@gmx.de> To: Helge Deller X-Mailer: Apple Mail (2.1085) X-Opwv-CommTouchExtSvcRefID: str=0001.0A020205.560818E7.002F, ss=1, re=0.000, fgs=0 Sender: linux-parisc-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-parisc@vger.kernel.org X-Spam-Status: No, score=-6.9 required=5.0 tests=BAYES_00, RCVD_IN_DNSWL_HI, T_RP_MATCHES_RCVD, T_TVD_MIME_EPI, UNPARSEABLE_RELAY autolearn=ham version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP On 2015-09-22, at 12:20 PM, Helge Deller wrote: > On 05.09.2015 23:30, Helge Deller wrote: >> Hi James, >> ... >> I haven't done any performance measurements yet, but your patch looks >> absolutely correct. >> ... > > Hello everyone, > > I did some timing tests with the various patches for > a) atomic_hash patches: > https://patchwork.kernel.org/patch/7116811/ > b) alignment of LWS locks: > https://patchwork.kernel.org/patch/7137931/ > > The testcase I used is basically the following: > - It starts 32 threads. > - We have 16 atomic ints organized in an array. > - The first thread increments NITERS times the first atomic int. > - The second thread decrements NITERS times the first atomic int. > - The third/fourth thread increments/decrements the second atomic int, and so on... > - So, we have 32 threads, of which 16 increments and 16 decrements 16 different atomic ints. > - All threads run in parallel on a 4-way SMP PA8700 rp5470 machine. > - I used the "time" command to measure the timings. > - I did not stopped other services on the machine, but ran each test a few times and the timing results did not show significant variation between each run. > - All timings were done on a vanilla kernel 4.2 with only the mentioned patch applied. > > The code is a modified testcase from the libatomic-ops debian package: > > AO_t counter_array[16] = { 0, }; > #define NITERS 1000000 > > void * add1sub1_thr(void * id) > { > int me = (int)(AO_PTRDIFF_T)id; > AO_t *counter; > int i; > > counter = &counter_array[me >> 1]; > for (i = 0; i < NITERS; ++i) > if ((me & 1) != 0) { > (void)AO_fetch_and_sub1(counter); > } else { > (void)AO_fetch_and_add1(counter); > } > return 0; > ... > run_parallel(32, add1sub1_thr) > ... > > > > The baseline for all results is the timing with a vanilla kernel 4.2: > real 0m13.596s > user 0m18.152s > sys 0m35.752s > > > The next results are with the atomic_hash (a) patch applied: > For ATOMIC_HASH_SIZE = 4. > real 0m21.892s > user 0m27.492s > sys 0m59.704s > > For ATOMIC_HASH_SIZE = 64. > real 0m20.604s > user 0m24.832s > sys 0m56.552s > > > Next I applied the LWS locks patch (b): > XXXXXXXXXXXXXXXXXXXX LWS_LOCK_ALIGN_BITS = 4 > real 0m13.660s > user 0m18.592s > sys 0m35.236s > > XXXXXXXXXXXXXXXXXXXX LWS_LOCK_ALIGN_BITS = L1_CACHE_SHIFT > real 0m11.992s > user 0m19.064s > sys 0m28.476s > > > > Then I applied both patches (a and b): > ATOMIC_HASH_SIZE = 64, LWS_LOCK_ALIGN_BITS = 4 > ATOMIC_HASH_SIZE = 64, LWS_LOCK_ALIGN_BITS = 4 > real 0m13.232s > user 0m17.704s > sys 0m33.884s > > ATOMIC_HASH_SIZE = 64, LWS_LOCK_ALIGN_BITS = L1_CACHE_SHIFT > real 0m12.300s > user 0m20.268s > sys 0m28.424s > > ATOMIC_HASH_SIZE = 4, LWS_LOCK_ALIGN_BITS = 4 > real 0m13.181s > user 0m17.584s > sys 0m34.800s > > ATOMIC_HASH_SIZE = 4, LWS_LOCK_ALIGN_BITS = L1_CACHE_SHIFT > real 0m11.692s > user 0m18.232s > sys 0m27.072s > > > In summary I'm astonished about those results. > Especially from patch (a) I would have expected (when applied stand-alone) the same or better performance, because it makes the spinlocks more fine-grained. > But a performance drop of 50% is strange. > > Patch (b) stand-alone does significantly increases performance (~20%), and together with patch (a) it even adds a few more percent performance increase on top. > > Given the numbers above I currently would suggest to apply both patches (with ATOMIC_HASH_SIZE = 4 and LWS_LOCK_ALIGN_BITS = L1_CACHE_SHIFT). > > Thoughts? Hi Helge, Attached is a revised patch "a" to try to improve performance of atomic_t variables. If you get a chance, could you see how it performs. As noted in previous mail, I believe the L1 cache size is 16 bytes for all PA-RISC systems. This change pads the arch_spinlock_t type on PA2.0 to 16 bytes, so it is the same as PA1.x. As a result, the spinlocks in __atomic_hash are now on separate L1 cache lines. I also bumped up ATOMIC_HASH_SIZE to give us 64 locks and changed ATOMIC_HASH_SIZE to divide by SMP_CACHE_BYTES. SMP_CACHE_BYTES should be maximum alignment used by kernel. This is an attempt to ensure the hash indexes are evenly distributed. Testing v4.2.1+ on rp3440 with gcc build. Signed-off-by: John David Anglin --- John David Anglin dave.anglin@bell.net diff --git a/arch/parisc/include/asm/atomic.h b/arch/parisc/include/asm/atomic.h index 226f8ca9..a5c6de5 100644 --- a/arch/parisc/include/asm/atomic.h +++ b/arch/parisc/include/asm/atomic.h @@ -25,8 +25,8 @@ * Hash function to index into a different SPINLOCK. * Since "a" is usually an address, use one spinlock per cacheline. */ -# define ATOMIC_HASH_SIZE 4 -# define ATOMIC_HASH(a) (&(__atomic_hash[ (((unsigned long) (a))/L1_CACHE_BYTES) & (ATOMIC_HASH_SIZE-1) ])) +# define ATOMIC_HASH_SIZE 64 +# define ATOMIC_HASH(a) (&(__atomic_hash[ (((unsigned long) (a))/SMP_CACHE_BYTES) & (ATOMIC_HASH_SIZE-1) ])) extern arch_spinlock_t __atomic_hash[ATOMIC_HASH_SIZE] __lock_aligned; diff --git a/arch/parisc/include/asm/spinlock_types.h b/arch/parisc/include/asm/spinlock_types.h index 8c373aa..8c54f5e 100644 --- a/arch/parisc/include/asm/spinlock_types.h +++ b/arch/parisc/include/asm/spinlock_types.h @@ -4,7 +4,8 @@ typedef struct { #ifdef CONFIG_PA20 volatile unsigned int slock; -# define __ARCH_SPIN_LOCK_UNLOCKED { 1 } + volatile unsigned int pad[3]; +# define __ARCH_SPIN_LOCK_UNLOCKED { 1, { 1, 1, 1 } } #else volatile unsigned int lock[4]; # define __ARCH_SPIN_LOCK_UNLOCKED { { 1, 1, 1, 1 } }