parisc: adjust L1_CACHE_BYTES to 128 bytes on PA8800 and PA8900 CPUs
diff mbox

Message ID 1441288665.2235.17.camel@HansenPartnership.com
State New
Headers show

Commit Message

James Bottomley Sept. 3, 2015, 1:57 p.m. UTC
On Thu, 2015-09-03 at 06:30 -0700, James Bottomley wrote:
> On Wed, 2015-09-02 at 18:20 +0200, Helge Deller wrote:
> > PA8800 and PA8900 processors have a cache line length of 128 bytes.
> > 
> > Reported-by: John David Anglin <dave.anglin@bell.net>
> > Signed-off-by: Helge Deller <deller@gmx.de>
> > 
> > diff --git a/arch/parisc/include/asm/cache.h b/arch/parisc/include/asm/cache.h
> > index 47f11c7..a775f60 100644
> > --- a/arch/parisc/include/asm/cache.h
> > +++ b/arch/parisc/include/asm/cache.h
> > @@ -7,17 +7,19 @@
> >  
> > 
> >  /*
> > - * PA 2.0 processors have 64-byte cachelines; PA 1.1 processors have
> > - * 32-byte cachelines.  The default configuration is not for SMP anyway,
> > - * so if you're building for SMP, you should select the appropriate
> > - * processor type.  There is a potential livelock danger when running
> > - * a machine with this value set too small, but it's more probable you'll
> > - * just ruin performance.
> > + * Most PA 2.0 processors have 64-byte cachelines, but PA8800 and PA8900
> > + * processors have a cache line length of 128 bytes.
> > + * PA 1.1 processors have 32-byte cachelines.
> > + * There is a potential livelock danger when running a machine with this value
> > + * set too small, but it's more probable you'll just ruin performance.
> >   */
> > -#ifdef CONFIG_PA20
> > +#if defined(CONFIG_PA8X00)
> > +#define L1_CACHE_BYTES 128
> > +#define L1_CACHE_SHIFT 7
> > +#elif defined(CONFIG_PA20)
> >  #define L1_CACHE_BYTES 64
> >  #define L1_CACHE_SHIFT 6
> > -#else
> > +#else /* PA7XXX */
> >  #define L1_CACHE_BYTES 32
> >  #define L1_CACHE_SHIFT 5
> >  #endif
> > 
> > diff --git a/arch/parisc/include/asm/atomic.h b/arch/parisc/include/asm/atomic.h
> > index 226f8ca9..b2bc4b7 100644
> > --- a/arch/parisc/include/asm/atomic.h
> > +++ b/arch/parisc/include/asm/atomic.h
> > @@ -19,14 +19,14 @@
> >  
> >  #ifdef CONFIG_SMP
> >  #include <asm/spinlock.h>
> > -#include <asm/cache.h>		/* we use L1_CACHE_BYTES */
> > +#include <asm/cache.h>		/* we use L1_CACHE_SHIFT */
> >  
> >  /* Use an array of spinlocks for our atomic_ts.
> >   * Hash function to index into a different SPINLOCK.
> >   * Since "a" is usually an address, use one spinlock per cacheline.
> >   */
> >  #  define ATOMIC_HASH_SIZE 4
> > -#  define ATOMIC_HASH(a) (&(__atomic_hash[ (((unsigned long) (a))/L1_CACHE_BYTES) & (ATOMIC_HASH_SIZE-1) ]))
> > +#  define ATOMIC_HASH(a) (&(__atomic_hash[ (((unsigned long) (a)) >> L1_CACHE_SHIFT) & (ATOMIC_HASH_SIZE-1) ]))
> >  
> >  extern arch_spinlock_t __atomic_hash[ATOMIC_HASH_SIZE] __lock_aligned;
> 
> This doesn't look to be correct.  The L1_CACHE_BYTES is compile time not
> runtime, so it's the Architectural not the Actual width.  For us, there
> are only two architectural widths governed by our compile classes, which
> are PA1 and PA2 at 32 and 64.  There's no config way to produce a PA2
> kernel which is PA88/89 only, so we should follow the PA2 architectural
> width because the kernel can be booted on any PA2 system, not just
> PA88/89.  Even if we could produce a PA88/89 only kernel and would wish
> to, there's still not much point because 128 is the cache burst width.
> PA988/89 work perfectly OK with the architectural width, so the extra
> space is likely added for no benefit.

Having looked more closely at our atomics, we seem to have a bogus
assumption about cache lines.  Our cache is perfectly capable of
correctly writing to adjacent bytes in different lines even on different
processors and having the external visibility sequenced accordingly.
The reason we need the locks is for correctness not for cache coherence
problems.  This means that we shouldn't hash all locks in the same line
to the same lock because it introduces unnecessary contention in
adjacent atomics which could be sorted out much faster by the cache
logic.

The original way our hash worked would have been correct if the atomic_t
were cache line aligned ... meaning only one atomic per cache line, but
they're not, they have no alignment primitives.

On the same thought, to reduce atomic contention probabalistically, we
should have a much larger atomic array, say 64 or 128 ... it's a single
array of locks, so the amount of space consumed by expanding it isn't
huge.

This patch should get rid of the bogus cache assumption and at the same
time decrease our collision chances.  We should probably do some playing
around to see what the best size for ATOMIC_HASH_SIZE is.

James

---



--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Helge Deller Sept. 5, 2015, 9:30 p.m. UTC | #1
Hi James,

On 03.09.2015 15:57, James Bottomley wrote:
> Having looked more closely at our atomics, we seem to have a bogus
> assumption about cache lines.  Our cache is perfectly capable of
> correctly writing to adjacent bytes in different lines even on different
> processors and having the external visibility sequenced accordingly.
> The reason we need the locks is for correctness not for cache coherence
> problems.  This means that we shouldn't hash all locks in the same line
> to the same lock because it introduces unnecessary contention in
> adjacent atomics which could be sorted out much faster by the cache
> logic.
> 
> The original way our hash worked would have been correct if the atomic_t
> were cache line aligned ... meaning only one atomic per cache line, but
> they're not, they have no alignment primitives.
> 
> On the same thought, to reduce atomic contention probabalistically, we
> should have a much larger atomic array, say 64 or 128 ... it's a single
> array of locks, so the amount of space consumed by expanding it isn't
> huge.
> 
> This patch should get rid of the bogus cache assumption and at the same
> time decrease our collision chances.  We should probably do some playing
> around to see what the best size for ATOMIC_HASH_SIZE is.

I haven't done any performance measurements yet, but your patch looks
absolutely correct.

Do you mind sending a full patch including your signed-off line, so that
I can include it?

Thanks,
Helge 

> diff --git a/arch/parisc/include/asm/atomic.h b/arch/parisc/include/asm/atomic.h
> index 226f8ca9..de3361b 100644
> --- a/arch/parisc/include/asm/atomic.h
> +++ b/arch/parisc/include/asm/atomic.h
> @@ -19,14 +19,13 @@
>  
>  #ifdef CONFIG_SMP
>  #include <asm/spinlock.h>
> -#include <asm/cache.h>		/* we use L1_CACHE_BYTES */
>  
>  /* Use an array of spinlocks for our atomic_ts.
>   * Hash function to index into a different SPINLOCK.
> - * Since "a" is usually an address, use one spinlock per cacheline.
> + * Since "a" is usually an address, use one spinlock per atomic.
>   */
> -#  define ATOMIC_HASH_SIZE 4
> -#  define ATOMIC_HASH(a) (&(__atomic_hash[ (((unsigned long) (a))/L1_CACHE_BYTES) & (ATOMIC_HASH_SIZE-1) ]))
> +#  define ATOMIC_HASH_SIZE 64
> +#  define ATOMIC_HASH(a) (&(__atomic_hash[ (((unsigned long) (a))/sizeof(atomic_t)) & (ATOMIC_HASH_SIZE-1) ]))
>  
>  extern arch_spinlock_t __atomic_hash[ATOMIC_HASH_SIZE] __lock_aligned;

--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Helge Deller Sept. 22, 2015, 4:20 p.m. UTC | #2
On 05.09.2015 23:30, Helge Deller wrote:
> Hi James,
> ...
> I haven't done any performance measurements yet, but your patch looks
> absolutely correct.
> ...

Hello everyone,

I did some timing tests with the various patches for
a) atomic_hash patches:
        https://patchwork.kernel.org/patch/7116811/
b) alignment of LWS locks:
        https://patchwork.kernel.org/patch/7137931/

The testcase I used is basically the following:
- It starts 32 threads.
- We have 16 atomic ints organized in an array.
- The first thread increments NITERS times the first atomic int.
- The second thread decrements NITERS times the first atomic int.
- The third/fourth thread increments/decrements the second atomic int, and so on...
- So, we have 32 threads, of which 16 increments and 16 decrements 16 different atomic ints.
- All threads run in parallel on a 4-way SMP PA8700 rp5470 machine.
- I used the "time" command to measure the timings.
- I did not stopped other services on the machine, but ran each test a few times and the timing results did not show significant variation between each run.
- All timings were done on a vanilla kernel 4.2 with only the mentioned patch applied.

The code is a modified testcase from the libatomic-ops debian package:

AO_t counter_array[16] = { 0, };
#define NITERS 1000000

void * add1sub1_thr(void * id)
{
  int me = (int)(AO_PTRDIFF_T)id;
  AO_t *counter;
  int i;

  counter = &counter_array[me >> 1];
  for (i = 0; i < NITERS; ++i)
    if ((me & 1) != 0) {
      (void)AO_fetch_and_sub1(counter);
    } else {
      (void)AO_fetch_and_add1(counter);
    }
  return 0;
...
run_parallel(32, add1sub1_thr)
...
        

        
The baseline for all results is the timing with a vanilla kernel 4.2:
real    0m13.596s
user    0m18.152s
sys     0m35.752s


The next results are with the atomic_hash (a) patch applied:
For ATOMIC_HASH_SIZE = 4. 
real    0m21.892s 
user    0m27.492s
sys     0m59.704s

For ATOMIC_HASH_SIZE = 64.
real    0m20.604s
user    0m24.832s
sys     0m56.552s


Next I applied the LWS locks patch (b):
XXXXXXXXXXXXXXXXXXXX    LWS_LOCK_ALIGN_BITS = 4
real    0m13.660s
user    0m18.592s
sys     0m35.236s

XXXXXXXXXXXXXXXXXXXX    LWS_LOCK_ALIGN_BITS  = L1_CACHE_SHIFT
real    0m11.992s 
user    0m19.064s
sys     0m28.476s
    
      
    
Then I applied both patches (a and b):
ATOMIC_HASH_SIZE = 64,  LWS_LOCK_ALIGN_BITS = 4
ATOMIC_HASH_SIZE = 64,  LWS_LOCK_ALIGN_BITS = 4
real    0m13.232s
user    0m17.704s
sys     0m33.884s

ATOMIC_HASH_SIZE = 64,  LWS_LOCK_ALIGN_BITS = L1_CACHE_SHIFT
real    0m12.300s
user    0m20.268s
sys     0m28.424s

ATOMIC_HASH_SIZE = 4,   LWS_LOCK_ALIGN_BITS = 4
real    0m13.181s
user    0m17.584s
sys     0m34.800s

ATOMIC_HASH_SIZE = 4,   LWS_LOCK_ALIGN_BITS = L1_CACHE_SHIFT
real    0m11.692s
user    0m18.232s
sys     0m27.072s


In summary I'm astonished about those results.
Especially from patch (a) I would have expected (when applied stand-alone) the same or better performance, because it makes the spinlocks more fine-grained.
But a performance drop of 50% is strange.

Patch (b) stand-alone does significantly increases performance (~20%), and together with patch (a) it even adds a few more percent performance increase on top.

Given the numbers above I currently would suggest to apply both patches (with ATOMIC_HASH_SIZE = 4 and LWS_LOCK_ALIGN_BITS = L1_CACHE_SHIFT).

Thoughts?

Helge
--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
John David Anglin Sept. 23, 2015, 12:12 a.m. UTC | #3
On 2015-09-22, at 12:20 PM, Helge Deller wrote:

> On 05.09.2015 23:30, Helge Deller wrote:
>> Hi James,
>> ...
>> I haven't done any performance measurements yet, but your patch looks
>> absolutely correct.
>> ...
> 
> Hello everyone,
> 
> I did some timing tests with the various patches for
> a) atomic_hash patches:
>        https://patchwork.kernel.org/patch/7116811/
> b) alignment of LWS locks:
>        https://patchwork.kernel.org/patch/7137931/
> 
> The testcase I used is basically the following:
> - It starts 32 threads.
> - We have 16 atomic ints organized in an array.
> - The first thread increments NITERS times the first atomic int.
> - The second thread decrements NITERS times the first atomic int.
> - The third/fourth thread increments/decrements the second atomic int, and so on...
> - So, we have 32 threads, of which 16 increments and 16 decrements 16 different atomic ints.
> - All threads run in parallel on a 4-way SMP PA8700 rp5470 machine.
> - I used the "time" command to measure the timings.
> - I did not stopped other services on the machine, but ran each test a few times and the timing results did not show significant variation between each run.
> - All timings were done on a vanilla kernel 4.2 with only the mentioned patch applied.
> 
> The code is a modified testcase from the libatomic-ops debian package:
> 
> AO_t counter_array[16] = { 0, };
> #define NITERS 1000000
> 
> void * add1sub1_thr(void * id)
> {
>  int me = (int)(AO_PTRDIFF_T)id;
>  AO_t *counter;
>  int i;
> 
>  counter = &counter_array[me >> 1];
>  for (i = 0; i < NITERS; ++i)
>    if ((me & 1) != 0) {
>      (void)AO_fetch_and_sub1(counter);
>    } else {
>      (void)AO_fetch_and_add1(counter);
>    }
>  return 0;
> ...
> run_parallel(32, add1sub1_thr)
> ...
> 
> 

Does libatomic-ops now use the GCC sync builtins and the LWS calls?

> 
> The baseline for all results is the timing with a vanilla kernel 4.2:
> real    0m13.596s
> user    0m18.152s
> sys     0m35.752s
> 
> 
> The next results are with the atomic_hash (a) patch applied:
> For ATOMIC_HASH_SIZE = 4. 
> real    0m21.892s 
> user    0m27.492s
> sys     0m59.704s
> 
> For ATOMIC_HASH_SIZE = 64.
> real    0m20.604s
> user    0m24.832s
> sys     0m56.552s
> 

I'm not sure why the atomic_hash patch would directly affect performance of this test as I
don't think the patch affects the LWS locks.

I question the the atomic hash changes as the original defines are taken directly from generic code.
Optimally, we want one spinlock per cacheline.  Why do we care about the size of atomic_t?

The above indicate the change is detrimental.

Dave
--
John David Anglin	dave.anglin@bell.net



--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Helge Deller Sept. 23, 2015, 7:30 p.m. UTC | #4
On 23.09.2015 02:12, John David Anglin wrote:
> On 2015-09-22, at 12:20 PM, Helge Deller wrote:
> 
>> On 05.09.2015 23:30, Helge Deller wrote:
>>> Hi James,
>>> ...
>>> I haven't done any performance measurements yet, but your patch looks
>>> absolutely correct.
>>> ...
>>
>> Hello everyone,
>>
>> I did some timing tests with the various patches for
>> a) atomic_hash patches:
>>        https://patchwork.kernel.org/patch/7116811/
>> b) alignment of LWS locks:
>>        https://patchwork.kernel.org/patch/7137931/
>>
>> The testcase I used is basically the following:
>> - It starts 32 threads.
>> - We have 16 atomic ints organized in an array.
>> - The first thread increments NITERS times the first atomic int.
>> - The second thread decrements NITERS times the first atomic int.
>> - The third/fourth thread increments/decrements the second atomic int, and so on...
>> - So, we have 32 threads, of which 16 increments and 16 decrements 16 different atomic ints.
>> - All threads run in parallel on a 4-way SMP PA8700 rp5470 machine.
>> - I used the "time" command to measure the timings.
>> - I did not stopped other services on the machine, but ran each test a few times and the timing results did not show significant variation between each run.
>> - All timings were done on a vanilla kernel 4.2 with only the mentioned patch applied.
>>
>> The code is a modified testcase from the libatomic-ops debian package:
>>
>> AO_t counter_array[16] = { 0, };
>> #define NITERS 1000000
>>
>> void * add1sub1_thr(void * id)
>> {
>>  int me = (int)(AO_PTRDIFF_T)id;
>>  AO_t *counter;
>>  int i;
>>
>>  counter = &counter_array[me >> 1];
>>  for (i = 0; i < NITERS; ++i)
>>    if ((me & 1) != 0) {
>>      (void)AO_fetch_and_sub1(counter);
>>    } else {
>>      (void)AO_fetch_and_add1(counter);
>>    }
>>  return 0;
>> ...
>> run_parallel(32, add1sub1_thr)
>> ...
>>
>>
> 
> Does libatomic-ops now use the GCC sync builtins and the LWS calls?

Yes, if you apply the patch from https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=785654
on top of the libatomic-ops debian package.
That's what I tested.

>> The baseline for all results is the timing with a vanilla kernel 4.2:
>> real    0m13.596s
>> user    0m18.152s
>> sys     0m35.752s
>>
>>
>> The next results are with the atomic_hash (a) patch applied:
>> For ATOMIC_HASH_SIZE = 4. 
>> real    0m21.892s 
>> user    0m27.492s
>> sys     0m59.704s
>>
>> For ATOMIC_HASH_SIZE = 64.
>> real    0m20.604s
>> user    0m24.832s
>> sys     0m56.552s
>>
> 
> I'm not sure why the atomic_hash patch would directly affect performance of this test as I
> don't think the patch affects the LWS locks.

True, but even so more, the patch should not have slowed down the (unrelated) testcase.
 
> I question the the atomic hash changes as the original defines are taken directly from generic code.
> Optimally, we want one spinlock per cacheline.  Why do we care about the size of atomic_t?

Assume two unrelated code paths which are protected by two different spinlocks (which are of size atomic_t).
So, if the addresses of those spinlocks calculate to be (virtually) on the same cacheline they would block each other.
With James patch the possibility of blocking each other is theoretically lower (esp. if you increase the number of locks). 

> The above indicate the change is detrimental.

Yes, true for patch (a), but not (b).

Helge
--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
John David Anglin Sept. 23, 2015, 9 p.m. UTC | #5
On 2015-09-23 3:30 PM, Helge Deller wrote:
> On 23.09.2015 02:12, John David Anglin wrote:
>> On 2015-09-22, at 12:20 PM, Helge Deller wrote:
>>
>>> On 05.09.2015 23:30, Helge Deller wrote:
>>>> Hi James,
>>>> ...
>>>> I haven't done any performance measurements yet, but your patch looks
>>>> absolutely correct.
>>>> ...
>>> Hello everyone,
>>>
>>> I did some timing tests with the various patches for
>>> a) atomic_hash patches:
>>>         https://patchwork.kernel.org/patch/7116811/
>>> b) alignment of LWS locks:
>>>         https://patchwork.kernel.org/patch/7137931/
>>>
>>> The testcase I used is basically the following:
>>> - It starts 32 threads.
>>> - We have 16 atomic ints organized in an array.
>>> - The first thread increments NITERS times the first atomic int.
>>> - The second thread decrements NITERS times the first atomic int.
>>> - The third/fourth thread increments/decrements the second atomic int, and so on...
>>> - So, we have 32 threads, of which 16 increments and 16 decrements 16 different atomic ints.
>>> - All threads run in parallel on a 4-way SMP PA8700 rp5470 machine.
>>> - I used the "time" command to measure the timings.
>>> - I did not stopped other services on the machine, but ran each test a few times and the timing results did not show significant variation between each run.
>>> - All timings were done on a vanilla kernel 4.2 with only the mentioned patch applied.
>>>
>>> The code is a modified testcase from the libatomic-ops debian package:
>>>
>>> AO_t counter_array[16] = { 0, };
>>> #define NITERS 1000000
>>>
>>> void * add1sub1_thr(void * id)
>>> {
>>>   int me = (int)(AO_PTRDIFF_T)id;
>>>   AO_t *counter;
>>>   int i;
>>>
>>>   counter = &counter_array[me >> 1];
>>>   for (i = 0; i < NITERS; ++i)
>>>     if ((me & 1) != 0) {
>>>       (void)AO_fetch_and_sub1(counter);
>>>     } else {
>>>       (void)AO_fetch_and_add1(counter);
>>>     }
>>>   return 0;
>>> ...
>>> run_parallel(32, add1sub1_thr)
>>> ...
>>>
>>>
>> Does libatomic-ops now use the GCC sync builtins and the LWS calls?
> Yes, if you apply the patch from https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=785654
> on top of the libatomic-ops debian package.
> That's what I tested.
>
>>> The baseline for all results is the timing with a vanilla kernel 4.2:
>>> real    0m13.596s
>>> user    0m18.152s
>>> sys     0m35.752s
>>>
>>>
>>> The next results are with the atomic_hash (a) patch applied:
>>> For ATOMIC_HASH_SIZE = 4.
>>> real    0m21.892s
>>> user    0m27.492s
>>> sys     0m59.704s
>>>
>>> For ATOMIC_HASH_SIZE = 64.
>>> real    0m20.604s
>>> user    0m24.832s
>>> sys     0m56.552s
>>>
>> I'm not sure why the atomic_hash patch would directly affect performance of this test as I
>> don't think the patch affects the LWS locks.
> True, but even so more, the patch should not have slowed down the (unrelated) testcase.
>   
>> I question the the atomic hash changes as the original defines are taken directly from generic code.
>> Optimally, we want one spinlock per cacheline.  Why do we care about the size of atomic_t?
> Assume two unrelated code paths which are protected by two different spinlocks (which are of size atomic_t).
> So, if the addresses of those spinlocks calculate to be (virtually) on the same cacheline they would block each other.
> With James patch the possibility of blocking each other is theoretically lower (esp. if you increase the number of locks).

I don't believe spinlocks have size atomic_t.  atomic_t is a different 
struct.

The arch_spinlock_t type is defined in spinlock_types.h.  It's size is 4 
on PA20 and 16 otherwise.
This is used for raw_lock field in declaration of the raw_spinlock_t.  
This is combined with some other
fields to create the spinlock_t type.  Through some tricky manipulation 
of the ldcw lock address field,
we avoid specifying any specific alignment for lock.  As far a I can 
tell, spinlock_t types can end up anywhere.

The set of set of locks in  __atomic_hash is used exclusively for 
operations on atomic_ts.  On PA20,
all are locks on PA8800/PA8900 are on two lines even with a size of 64.  
I think we can be a bit
more liberal with storage allocation.

Think ATOMIC_HASH_SIZE should be 
N*L1_CACHE_BYTES/sizeof(arch_spinlock_t) and we should hash
to one lock per line.  Another alternative to compare is hashing to 16 
byte increments.

Dave
James Bottomley Sept. 24, 2015, 2:20 p.m. UTC | #6
On Tue, 2015-09-22 at 20:12 -0400, John David Anglin wrote:
> I question the the atomic hash changes as the original defines are
> taken directly from generic code.

It's about scaling.  The fewer locks, the more contention in a hash lock
system.  The interesting question is where does the line tip over so
that we see less speed up for more locks.

> Optimally, we want one spinlock per cacheline.  Why do we care about
> the size of atomic_t?

OK, so I think we're not using the word 'line size' in the same way.
When Linux says 'line size' it generally means the cache ownership line
size: the minimum block the inter cpu coherence operates on.  Most of
the architectural evidence for PA systems suggests that this is 16  We
should be able to get this definitively: it's however many lower bits of
a virtual address the LCI instruction truncates.  128 seems to be the
cache burst fill size (the number of bytes that will be pulled into the
cache by a usual operation touching any byte in the area).  For
streaming operations, the burst fill size is what we want to use, but
for coherence operations it's the ownership line size.  The reason is
that different CPUs can own adjacent lines uncontended, so one spinlock
per this region is optimal.

The disadvantage to padding things out to the cache burst fill size is
that we blow the cache footprint: data is too far apart and we use far
more cache than we should meaning the cache thrashes much sooner as you
load up the CPU.  On SMP systems, Linux uses SMP_CACHE_BYTES ==
L1_CACHE_BYTES for padding on tons of critical structures if it's too
big we'll waste a lot of cache footprint for no gain.

James


--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
John David Anglin Sept. 24, 2015, 4:39 p.m. UTC | #7
On 2015-09-24 10:20 AM, James Bottomley wrote:
> On Tue, 2015-09-22 at 20:12 -0400, John David Anglin wrote:
>> I question the the atomic hash changes as the original defines are
>> taken directly from generic code.
> It's about scaling.  The fewer locks, the more contention in a hash lock
> system.  The interesting question is where does the line tip over so
> that we see less speed up for more locks.
>
>> Optimally, we want one spinlock per cacheline.  Why do we care about
>> the size of atomic_t?
> OK, so I think we're not using the word 'line size' in the same way.
> When Linux says 'line size' it generally means the cache ownership line
> size: the minimum block the inter cpu coherence operates on.  Most of
> the architectural evidence for PA systems suggests that this is 16  We
> should be able to get this definitively: it's however many lower bits of
> a virtual address the LCI instruction truncates.  128 seems to be the
> cache burst fill size (the number of bytes that will be pulled into the
> cache by a usual operation touching any byte in the area).  For
> streaming operations, the burst fill size is what we want to use, but
> for coherence operations it's the ownership line size.  The reason is
> that different CPUs can own adjacent lines uncontended, so one spinlock
> per this region is optimal.
>
> The disadvantage to padding things out to the cache burst fill size is
> that we blow the cache footprint: data is too far apart and we use far
> more cache than we should meaning the cache thrashes much sooner as you
> load up the CPU.  On SMP systems, Linux uses SMP_CACHE_BYTES ==
> L1_CACHE_BYTES for padding on tons of critical structures if it's too
> big we'll waste a lot of cache footprint for no gain.
It looks to me like the LCI instruction must zero bits rather than 
truncate as drivers
(e.g., sba_iommu.c) drop the least significant 12 bits (ci >> 
PAGE_SHIFT).  I think we
should do the LCI test.  I had been assuming that the two lengths would 
be the same.

Dave
James Bottomley Sept. 24, 2015, 4:57 p.m. UTC | #8
On Thu, 2015-09-24 at 12:39 -0400, John David Anglin wrote:
> On 2015-09-24 10:20 AM, James Bottomley wrote:
> > On Tue, 2015-09-22 at 20:12 -0400, John David Anglin wrote:
> >> I question the the atomic hash changes as the original defines are
> >> taken directly from generic code.
> > It's about scaling.  The fewer locks, the more contention in a hash lock
> > system.  The interesting question is where does the line tip over so
> > that we see less speed up for more locks.
> >
> >> Optimally, we want one spinlock per cacheline.  Why do we care about
> >> the size of atomic_t?
> > OK, so I think we're not using the word 'line size' in the same way.
> > When Linux says 'line size' it generally means the cache ownership line
> > size: the minimum block the inter cpu coherence operates on.  Most of
> > the architectural evidence for PA systems suggests that this is 16  We
> > should be able to get this definitively: it's however many lower bits of
> > a virtual address the LCI instruction truncates.  128 seems to be the
> > cache burst fill size (the number of bytes that will be pulled into the
> > cache by a usual operation touching any byte in the area).  For
> > streaming operations, the burst fill size is what we want to use, but
> > for coherence operations it's the ownership line size.  The reason is
> > that different CPUs can own adjacent lines uncontended, so one spinlock
> > per this region is optimal.
> >
> > The disadvantage to padding things out to the cache burst fill size is
> > that we blow the cache footprint: data is too far apart and we use far
> > more cache than we should meaning the cache thrashes much sooner as you
> > load up the CPU.  On SMP systems, Linux uses SMP_CACHE_BYTES ==
> > L1_CACHE_BYTES for padding on tons of critical structures if it's too
> > big we'll waste a lot of cache footprint for no gain.
> It looks to me like the LCI instruction must zero bits rather than 
> truncate as drivers
> (e.g., sba_iommu.c) drop the least significant 12 bits (ci >> 
> PAGE_SHIFT).  I think we
> should do the LCI test.  I had been assuming that the two lengths would 
> be the same.

It's a backwards compatibility problem: You get to fix the cache
ownership size once per incompatible architecture.  Once PA produced 64
bit chips was the last opportunity to do this because an older 64 bit
kernel (for PA and Linux) must work reasonably well on newer processors.
This means that if you increase the actual ownership size and the older
kernel suddenly starts thrashing because of SMP cache interference,
everyone blames you and says your new chips are rubbish (this did
actually happen to Intel once, if I remember correctly).  That's why
it's usually avoided.

James


--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
John David Anglin Sept. 25, 2015, 12:20 p.m. UTC | #9
On 2015-09-24, at 12:57 PM, James Bottomley wrote:

> On Thu, 2015-09-24 at 12:39 -0400, John David Anglin wrote:
>> On 2015-09-24 10:20 AM, James Bottomley wrote:
>>> On Tue, 2015-09-22 at 20:12 -0400, John David Anglin wrote:
>>>> I question the the atomic hash changes as the original defines are
>>>> taken directly from generic code.
>>> It's about scaling.  The fewer locks, the more contention in a hash lock
>>> system.  The interesting question is where does the line tip over so
>>> that we see less speed up for more locks.
>>> 
>>>> Optimally, we want one spinlock per cacheline.  Why do we care about
>>>> the size of atomic_t?
>>> OK, so I think we're not using the word 'line size' in the same way.
>>> When Linux says 'line size' it generally means the cache ownership line
>>> size: the minimum block the inter cpu coherence operates on.  Most of
>>> the architectural evidence for PA systems suggests that this is 16  We
>>> should be able to get this definitively: it's however many lower bits of
>>> a virtual address the LCI instruction truncates.  128 seems to be the
>>> cache burst fill size (the number of bytes that will be pulled into the
>>> cache by a usual operation touching any byte in the area).  For
>>> streaming operations, the burst fill size is what we want to use, but
>>> for coherence operations it's the ownership line size.  The reason is
>>> that different CPUs can own adjacent lines uncontended, so one spinlock
>>> per this region is optimal.
>>> 
>>> The disadvantage to padding things out to the cache burst fill size is
>>> that we blow the cache footprint: data is too far apart and we use far
>>> more cache than we should meaning the cache thrashes much sooner as you
>>> load up the CPU.  On SMP systems, Linux uses SMP_CACHE_BYTES ==
>>> L1_CACHE_BYTES for padding on tons of critical structures if it's too
>>> big we'll waste a lot of cache footprint for no gain.
>> It looks to me like the LCI instruction must zero bits rather than 
>> truncate as drivers
>> (e.g., sba_iommu.c) drop the least significant 12 bits (ci >> 
>> PAGE_SHIFT).  I think we
>> should do the LCI test.  I had been assuming that the two lengths would 
>> be the same.
> 
> It's a backwards compatibility problem: You get to fix the cache
> ownership size once per incompatible architecture.  Once PA produced 64
> bit chips was the last opportunity to do this because an older 64 bit
> kernel (for PA and Linux) must work reasonably well on newer processors.
> This means that if you increase the actual ownership size and the older
> kernel suddenly starts thrashing because of SMP cache interference,
> everyone blames you and says your new chips are rubbish (this did
> actually happen to Intel once, if I remember correctly).  That's why
> it's usually avoided.

The division by L1_CACHE_BYTES in ATOMIC_HASH simply reflects the alignment of
atomic_t types.  We want a hash value that's not always zero.

On systems with hardware atomic operations, atomic variables need to be on separate
cache lines to minimize contention.  Thus, the L1_CACHE_BYTES alignment.

Our atomic_t types are guarded by ldcw locks.  So, they are not really subject to contention.
You could be correct that L1_CACHE_BYTES could be reduced to 16 (i.e., maximum alignment
needed for any type on parisc) provided it isn't used somewhere where we need the actual L1
cache line size as returned by the PDC.

When this started, I was pushing to see if placing our ldcw locks on different lines
helped performance.  Possibly, the number could be increased as well.  On PA20, we currently
align __atomic_hash to __lock_aligned (16 bytes).  So, all four locks are on the same line.

Dave

--
John David Anglin	dave.anglin@bell.net



--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
John David Anglin Sept. 25, 2015, 3:56 p.m. UTC | #10
On 2015-09-25 8:20 AM, John David Anglin wrote:
> Our atomic_t types are guarded by ldcw locks.  So, they are not really subject to contention.
> You could be correct that L1_CACHE_BYTES could be reduced to 16 (i.e., maximum alignment
> needed for any type on parisc) provided it isn't used somewhere where we need the actual L1
> cache line size as returned by the PDC.
Digging through various documentation, I now believe that L1_CACHE_BYTES 
is 16 bytes on ALL PA-RISC
processors.  We are getting confused by the L2 length reported by the 
PDC.  The PA-8800 is essentially
two PA-8700s integrated on the same die.

See page 10 in this document:
https://parisc.wiki.kernel.org/images-parisc/e/e9/PA-8700wp.pdf

It shows the PA-8700 L1 design.

Dave

Patch
diff mbox

diff --git a/arch/parisc/include/asm/atomic.h b/arch/parisc/include/asm/atomic.h
index 226f8ca9..de3361b 100644
--- a/arch/parisc/include/asm/atomic.h
+++ b/arch/parisc/include/asm/atomic.h
@@ -19,14 +19,13 @@ 
 
 #ifdef CONFIG_SMP
 #include <asm/spinlock.h>
-#include <asm/cache.h>		/* we use L1_CACHE_BYTES */
 
 /* Use an array of spinlocks for our atomic_ts.
  * Hash function to index into a different SPINLOCK.
- * Since "a" is usually an address, use one spinlock per cacheline.
+ * Since "a" is usually an address, use one spinlock per atomic.
  */
-#  define ATOMIC_HASH_SIZE 4
-#  define ATOMIC_HASH(a) (&(__atomic_hash[ (((unsigned long) (a))/L1_CACHE_BYTES) & (ATOMIC_HASH_SIZE-1) ]))
+#  define ATOMIC_HASH_SIZE 64
+#  define ATOMIC_HASH(a) (&(__atomic_hash[ (((unsigned long) (a))/sizeof(atomic_t)) & (ATOMIC_HASH_SIZE-1) ]))
 
 extern arch_spinlock_t __atomic_hash[ATOMIC_HASH_SIZE] __lock_aligned;