Message ID | 1441288665.2235.17.camel@HansenPartnership.com (mailing list archive) |
---|---|
State | Superseded, archived |
Headers | show |
Hi James, On 03.09.2015 15:57, James Bottomley wrote: > Having looked more closely at our atomics, we seem to have a bogus > assumption about cache lines. Our cache is perfectly capable of > correctly writing to adjacent bytes in different lines even on different > processors and having the external visibility sequenced accordingly. > The reason we need the locks is for correctness not for cache coherence > problems. This means that we shouldn't hash all locks in the same line > to the same lock because it introduces unnecessary contention in > adjacent atomics which could be sorted out much faster by the cache > logic. > > The original way our hash worked would have been correct if the atomic_t > were cache line aligned ... meaning only one atomic per cache line, but > they're not, they have no alignment primitives. > > On the same thought, to reduce atomic contention probabalistically, we > should have a much larger atomic array, say 64 or 128 ... it's a single > array of locks, so the amount of space consumed by expanding it isn't > huge. > > This patch should get rid of the bogus cache assumption and at the same > time decrease our collision chances. We should probably do some playing > around to see what the best size for ATOMIC_HASH_SIZE is. I haven't done any performance measurements yet, but your patch looks absolutely correct. Do you mind sending a full patch including your signed-off line, so that I can include it? Thanks, Helge > diff --git a/arch/parisc/include/asm/atomic.h b/arch/parisc/include/asm/atomic.h > index 226f8ca9..de3361b 100644 > --- a/arch/parisc/include/asm/atomic.h > +++ b/arch/parisc/include/asm/atomic.h > @@ -19,14 +19,13 @@ > > #ifdef CONFIG_SMP > #include <asm/spinlock.h> > -#include <asm/cache.h> /* we use L1_CACHE_BYTES */ > > /* Use an array of spinlocks for our atomic_ts. > * Hash function to index into a different SPINLOCK. > - * Since "a" is usually an address, use one spinlock per cacheline. > + * Since "a" is usually an address, use one spinlock per atomic. > */ > -# define ATOMIC_HASH_SIZE 4 > -# define ATOMIC_HASH(a) (&(__atomic_hash[ (((unsigned long) (a))/L1_CACHE_BYTES) & (ATOMIC_HASH_SIZE-1) ])) > +# define ATOMIC_HASH_SIZE 64 > +# define ATOMIC_HASH(a) (&(__atomic_hash[ (((unsigned long) (a))/sizeof(atomic_t)) & (ATOMIC_HASH_SIZE-1) ])) > > extern arch_spinlock_t __atomic_hash[ATOMIC_HASH_SIZE] __lock_aligned; -- To unsubscribe from this list: send the line "unsubscribe linux-parisc" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05.09.2015 23:30, Helge Deller wrote: > Hi James, > ... > I haven't done any performance measurements yet, but your patch looks > absolutely correct. > ... Hello everyone, I did some timing tests with the various patches for a) atomic_hash patches: https://patchwork.kernel.org/patch/7116811/ b) alignment of LWS locks: https://patchwork.kernel.org/patch/7137931/ The testcase I used is basically the following: - It starts 32 threads. - We have 16 atomic ints organized in an array. - The first thread increments NITERS times the first atomic int. - The second thread decrements NITERS times the first atomic int. - The third/fourth thread increments/decrements the second atomic int, and so on... - So, we have 32 threads, of which 16 increments and 16 decrements 16 different atomic ints. - All threads run in parallel on a 4-way SMP PA8700 rp5470 machine. - I used the "time" command to measure the timings. - I did not stopped other services on the machine, but ran each test a few times and the timing results did not show significant variation between each run. - All timings were done on a vanilla kernel 4.2 with only the mentioned patch applied. The code is a modified testcase from the libatomic-ops debian package: AO_t counter_array[16] = { 0, }; #define NITERS 1000000 void * add1sub1_thr(void * id) { int me = (int)(AO_PTRDIFF_T)id; AO_t *counter; int i; counter = &counter_array[me >> 1]; for (i = 0; i < NITERS; ++i) if ((me & 1) != 0) { (void)AO_fetch_and_sub1(counter); } else { (void)AO_fetch_and_add1(counter); } return 0; ... run_parallel(32, add1sub1_thr) ... The baseline for all results is the timing with a vanilla kernel 4.2: real 0m13.596s user 0m18.152s sys 0m35.752s The next results are with the atomic_hash (a) patch applied: For ATOMIC_HASH_SIZE = 4. real 0m21.892s user 0m27.492s sys 0m59.704s For ATOMIC_HASH_SIZE = 64. real 0m20.604s user 0m24.832s sys 0m56.552s Next I applied the LWS locks patch (b): XXXXXXXXXXXXXXXXXXXX LWS_LOCK_ALIGN_BITS = 4 real 0m13.660s user 0m18.592s sys 0m35.236s XXXXXXXXXXXXXXXXXXXX LWS_LOCK_ALIGN_BITS = L1_CACHE_SHIFT real 0m11.992s user 0m19.064s sys 0m28.476s Then I applied both patches (a and b): ATOMIC_HASH_SIZE = 64, LWS_LOCK_ALIGN_BITS = 4 ATOMIC_HASH_SIZE = 64, LWS_LOCK_ALIGN_BITS = 4 real 0m13.232s user 0m17.704s sys 0m33.884s ATOMIC_HASH_SIZE = 64, LWS_LOCK_ALIGN_BITS = L1_CACHE_SHIFT real 0m12.300s user 0m20.268s sys 0m28.424s ATOMIC_HASH_SIZE = 4, LWS_LOCK_ALIGN_BITS = 4 real 0m13.181s user 0m17.584s sys 0m34.800s ATOMIC_HASH_SIZE = 4, LWS_LOCK_ALIGN_BITS = L1_CACHE_SHIFT real 0m11.692s user 0m18.232s sys 0m27.072s In summary I'm astonished about those results. Especially from patch (a) I would have expected (when applied stand-alone) the same or better performance, because it makes the spinlocks more fine-grained. But a performance drop of 50% is strange. Patch (b) stand-alone does significantly increases performance (~20%), and together with patch (a) it even adds a few more percent performance increase on top. Given the numbers above I currently would suggest to apply both patches (with ATOMIC_HASH_SIZE = 4 and LWS_LOCK_ALIGN_BITS = L1_CACHE_SHIFT). Thoughts? Helge -- To unsubscribe from this list: send the line "unsubscribe linux-parisc" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2015-09-22, at 12:20 PM, Helge Deller wrote: > On 05.09.2015 23:30, Helge Deller wrote: >> Hi James, >> ... >> I haven't done any performance measurements yet, but your patch looks >> absolutely correct. >> ... > > Hello everyone, > > I did some timing tests with the various patches for > a) atomic_hash patches: > https://patchwork.kernel.org/patch/7116811/ > b) alignment of LWS locks: > https://patchwork.kernel.org/patch/7137931/ > > The testcase I used is basically the following: > - It starts 32 threads. > - We have 16 atomic ints organized in an array. > - The first thread increments NITERS times the first atomic int. > - The second thread decrements NITERS times the first atomic int. > - The third/fourth thread increments/decrements the second atomic int, and so on... > - So, we have 32 threads, of which 16 increments and 16 decrements 16 different atomic ints. > - All threads run in parallel on a 4-way SMP PA8700 rp5470 machine. > - I used the "time" command to measure the timings. > - I did not stopped other services on the machine, but ran each test a few times and the timing results did not show significant variation between each run. > - All timings were done on a vanilla kernel 4.2 with only the mentioned patch applied. > > The code is a modified testcase from the libatomic-ops debian package: > > AO_t counter_array[16] = { 0, }; > #define NITERS 1000000 > > void * add1sub1_thr(void * id) > { > int me = (int)(AO_PTRDIFF_T)id; > AO_t *counter; > int i; > > counter = &counter_array[me >> 1]; > for (i = 0; i < NITERS; ++i) > if ((me & 1) != 0) { > (void)AO_fetch_and_sub1(counter); > } else { > (void)AO_fetch_and_add1(counter); > } > return 0; > ... > run_parallel(32, add1sub1_thr) > ... > > Does libatomic-ops now use the GCC sync builtins and the LWS calls? > > The baseline for all results is the timing with a vanilla kernel 4.2: > real 0m13.596s > user 0m18.152s > sys 0m35.752s > > > The next results are with the atomic_hash (a) patch applied: > For ATOMIC_HASH_SIZE = 4. > real 0m21.892s > user 0m27.492s > sys 0m59.704s > > For ATOMIC_HASH_SIZE = 64. > real 0m20.604s > user 0m24.832s > sys 0m56.552s > I'm not sure why the atomic_hash patch would directly affect performance of this test as I don't think the patch affects the LWS locks. I question the the atomic hash changes as the original defines are taken directly from generic code. Optimally, we want one spinlock per cacheline. Why do we care about the size of atomic_t? The above indicate the change is detrimental. Dave -- John David Anglin dave.anglin@bell.net -- To unsubscribe from this list: send the line "unsubscribe linux-parisc" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 23.09.2015 02:12, John David Anglin wrote: > On 2015-09-22, at 12:20 PM, Helge Deller wrote: > >> On 05.09.2015 23:30, Helge Deller wrote: >>> Hi James, >>> ... >>> I haven't done any performance measurements yet, but your patch looks >>> absolutely correct. >>> ... >> >> Hello everyone, >> >> I did some timing tests with the various patches for >> a) atomic_hash patches: >> https://patchwork.kernel.org/patch/7116811/ >> b) alignment of LWS locks: >> https://patchwork.kernel.org/patch/7137931/ >> >> The testcase I used is basically the following: >> - It starts 32 threads. >> - We have 16 atomic ints organized in an array. >> - The first thread increments NITERS times the first atomic int. >> - The second thread decrements NITERS times the first atomic int. >> - The third/fourth thread increments/decrements the second atomic int, and so on... >> - So, we have 32 threads, of which 16 increments and 16 decrements 16 different atomic ints. >> - All threads run in parallel on a 4-way SMP PA8700 rp5470 machine. >> - I used the "time" command to measure the timings. >> - I did not stopped other services on the machine, but ran each test a few times and the timing results did not show significant variation between each run. >> - All timings were done on a vanilla kernel 4.2 with only the mentioned patch applied. >> >> The code is a modified testcase from the libatomic-ops debian package: >> >> AO_t counter_array[16] = { 0, }; >> #define NITERS 1000000 >> >> void * add1sub1_thr(void * id) >> { >> int me = (int)(AO_PTRDIFF_T)id; >> AO_t *counter; >> int i; >> >> counter = &counter_array[me >> 1]; >> for (i = 0; i < NITERS; ++i) >> if ((me & 1) != 0) { >> (void)AO_fetch_and_sub1(counter); >> } else { >> (void)AO_fetch_and_add1(counter); >> } >> return 0; >> ... >> run_parallel(32, add1sub1_thr) >> ... >> >> > > Does libatomic-ops now use the GCC sync builtins and the LWS calls? Yes, if you apply the patch from https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=785654 on top of the libatomic-ops debian package. That's what I tested. >> The baseline for all results is the timing with a vanilla kernel 4.2: >> real 0m13.596s >> user 0m18.152s >> sys 0m35.752s >> >> >> The next results are with the atomic_hash (a) patch applied: >> For ATOMIC_HASH_SIZE = 4. >> real 0m21.892s >> user 0m27.492s >> sys 0m59.704s >> >> For ATOMIC_HASH_SIZE = 64. >> real 0m20.604s >> user 0m24.832s >> sys 0m56.552s >> > > I'm not sure why the atomic_hash patch would directly affect performance of this test as I > don't think the patch affects the LWS locks. True, but even so more, the patch should not have slowed down the (unrelated) testcase. > I question the the atomic hash changes as the original defines are taken directly from generic code. > Optimally, we want one spinlock per cacheline. Why do we care about the size of atomic_t? Assume two unrelated code paths which are protected by two different spinlocks (which are of size atomic_t). So, if the addresses of those spinlocks calculate to be (virtually) on the same cacheline they would block each other. With James patch the possibility of blocking each other is theoretically lower (esp. if you increase the number of locks). > The above indicate the change is detrimental. Yes, true for patch (a), but not (b). Helge -- To unsubscribe from this list: send the line "unsubscribe linux-parisc" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2015-09-23 3:30 PM, Helge Deller wrote: > On 23.09.2015 02:12, John David Anglin wrote: >> On 2015-09-22, at 12:20 PM, Helge Deller wrote: >> >>> On 05.09.2015 23:30, Helge Deller wrote: >>>> Hi James, >>>> ... >>>> I haven't done any performance measurements yet, but your patch looks >>>> absolutely correct. >>>> ... >>> Hello everyone, >>> >>> I did some timing tests with the various patches for >>> a) atomic_hash patches: >>> https://patchwork.kernel.org/patch/7116811/ >>> b) alignment of LWS locks: >>> https://patchwork.kernel.org/patch/7137931/ >>> >>> The testcase I used is basically the following: >>> - It starts 32 threads. >>> - We have 16 atomic ints organized in an array. >>> - The first thread increments NITERS times the first atomic int. >>> - The second thread decrements NITERS times the first atomic int. >>> - The third/fourth thread increments/decrements the second atomic int, and so on... >>> - So, we have 32 threads, of which 16 increments and 16 decrements 16 different atomic ints. >>> - All threads run in parallel on a 4-way SMP PA8700 rp5470 machine. >>> - I used the "time" command to measure the timings. >>> - I did not stopped other services on the machine, but ran each test a few times and the timing results did not show significant variation between each run. >>> - All timings were done on a vanilla kernel 4.2 with only the mentioned patch applied. >>> >>> The code is a modified testcase from the libatomic-ops debian package: >>> >>> AO_t counter_array[16] = { 0, }; >>> #define NITERS 1000000 >>> >>> void * add1sub1_thr(void * id) >>> { >>> int me = (int)(AO_PTRDIFF_T)id; >>> AO_t *counter; >>> int i; >>> >>> counter = &counter_array[me >> 1]; >>> for (i = 0; i < NITERS; ++i) >>> if ((me & 1) != 0) { >>> (void)AO_fetch_and_sub1(counter); >>> } else { >>> (void)AO_fetch_and_add1(counter); >>> } >>> return 0; >>> ... >>> run_parallel(32, add1sub1_thr) >>> ... >>> >>> >> Does libatomic-ops now use the GCC sync builtins and the LWS calls? > Yes, if you apply the patch from https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=785654 > on top of the libatomic-ops debian package. > That's what I tested. > >>> The baseline for all results is the timing with a vanilla kernel 4.2: >>> real 0m13.596s >>> user 0m18.152s >>> sys 0m35.752s >>> >>> >>> The next results are with the atomic_hash (a) patch applied: >>> For ATOMIC_HASH_SIZE = 4. >>> real 0m21.892s >>> user 0m27.492s >>> sys 0m59.704s >>> >>> For ATOMIC_HASH_SIZE = 64. >>> real 0m20.604s >>> user 0m24.832s >>> sys 0m56.552s >>> >> I'm not sure why the atomic_hash patch would directly affect performance of this test as I >> don't think the patch affects the LWS locks. > True, but even so more, the patch should not have slowed down the (unrelated) testcase. > >> I question the the atomic hash changes as the original defines are taken directly from generic code. >> Optimally, we want one spinlock per cacheline. Why do we care about the size of atomic_t? > Assume two unrelated code paths which are protected by two different spinlocks (which are of size atomic_t). > So, if the addresses of those spinlocks calculate to be (virtually) on the same cacheline they would block each other. > With James patch the possibility of blocking each other is theoretically lower (esp. if you increase the number of locks). I don't believe spinlocks have size atomic_t. atomic_t is a different struct. The arch_spinlock_t type is defined in spinlock_types.h. It's size is 4 on PA20 and 16 otherwise. This is used for raw_lock field in declaration of the raw_spinlock_t. This is combined with some other fields to create the spinlock_t type. Through some tricky manipulation of the ldcw lock address field, we avoid specifying any specific alignment for lock. As far a I can tell, spinlock_t types can end up anywhere. The set of set of locks in __atomic_hash is used exclusively for operations on atomic_ts. On PA20, all are locks on PA8800/PA8900 are on two lines even with a size of 64. I think we can be a bit more liberal with storage allocation. Think ATOMIC_HASH_SIZE should be N*L1_CACHE_BYTES/sizeof(arch_spinlock_t) and we should hash to one lock per line. Another alternative to compare is hashing to 16 byte increments. Dave
On Tue, 2015-09-22 at 20:12 -0400, John David Anglin wrote: > I question the the atomic hash changes as the original defines are > taken directly from generic code. It's about scaling. The fewer locks, the more contention in a hash lock system. The interesting question is where does the line tip over so that we see less speed up for more locks. > Optimally, we want one spinlock per cacheline. Why do we care about > the size of atomic_t? OK, so I think we're not using the word 'line size' in the same way. When Linux says 'line size' it generally means the cache ownership line size: the minimum block the inter cpu coherence operates on. Most of the architectural evidence for PA systems suggests that this is 16 We should be able to get this definitively: it's however many lower bits of a virtual address the LCI instruction truncates. 128 seems to be the cache burst fill size (the number of bytes that will be pulled into the cache by a usual operation touching any byte in the area). For streaming operations, the burst fill size is what we want to use, but for coherence operations it's the ownership line size. The reason is that different CPUs can own adjacent lines uncontended, so one spinlock per this region is optimal. The disadvantage to padding things out to the cache burst fill size is that we blow the cache footprint: data is too far apart and we use far more cache than we should meaning the cache thrashes much sooner as you load up the CPU. On SMP systems, Linux uses SMP_CACHE_BYTES == L1_CACHE_BYTES for padding on tons of critical structures if it's too big we'll waste a lot of cache footprint for no gain. James -- To unsubscribe from this list: send the line "unsubscribe linux-parisc" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2015-09-24 10:20 AM, James Bottomley wrote: > On Tue, 2015-09-22 at 20:12 -0400, John David Anglin wrote: >> I question the the atomic hash changes as the original defines are >> taken directly from generic code. > It's about scaling. The fewer locks, the more contention in a hash lock > system. The interesting question is where does the line tip over so > that we see less speed up for more locks. > >> Optimally, we want one spinlock per cacheline. Why do we care about >> the size of atomic_t? > OK, so I think we're not using the word 'line size' in the same way. > When Linux says 'line size' it generally means the cache ownership line > size: the minimum block the inter cpu coherence operates on. Most of > the architectural evidence for PA systems suggests that this is 16 We > should be able to get this definitively: it's however many lower bits of > a virtual address the LCI instruction truncates. 128 seems to be the > cache burst fill size (the number of bytes that will be pulled into the > cache by a usual operation touching any byte in the area). For > streaming operations, the burst fill size is what we want to use, but > for coherence operations it's the ownership line size. The reason is > that different CPUs can own adjacent lines uncontended, so one spinlock > per this region is optimal. > > The disadvantage to padding things out to the cache burst fill size is > that we blow the cache footprint: data is too far apart and we use far > more cache than we should meaning the cache thrashes much sooner as you > load up the CPU. On SMP systems, Linux uses SMP_CACHE_BYTES == > L1_CACHE_BYTES for padding on tons of critical structures if it's too > big we'll waste a lot of cache footprint for no gain. It looks to me like the LCI instruction must zero bits rather than truncate as drivers (e.g., sba_iommu.c) drop the least significant 12 bits (ci >> PAGE_SHIFT). I think we should do the LCI test. I had been assuming that the two lengths would be the same. Dave
On Thu, 2015-09-24 at 12:39 -0400, John David Anglin wrote: > On 2015-09-24 10:20 AM, James Bottomley wrote: > > On Tue, 2015-09-22 at 20:12 -0400, John David Anglin wrote: > >> I question the the atomic hash changes as the original defines are > >> taken directly from generic code. > > It's about scaling. The fewer locks, the more contention in a hash lock > > system. The interesting question is where does the line tip over so > > that we see less speed up for more locks. > > > >> Optimally, we want one spinlock per cacheline. Why do we care about > >> the size of atomic_t? > > OK, so I think we're not using the word 'line size' in the same way. > > When Linux says 'line size' it generally means the cache ownership line > > size: the minimum block the inter cpu coherence operates on. Most of > > the architectural evidence for PA systems suggests that this is 16 We > > should be able to get this definitively: it's however many lower bits of > > a virtual address the LCI instruction truncates. 128 seems to be the > > cache burst fill size (the number of bytes that will be pulled into the > > cache by a usual operation touching any byte in the area). For > > streaming operations, the burst fill size is what we want to use, but > > for coherence operations it's the ownership line size. The reason is > > that different CPUs can own adjacent lines uncontended, so one spinlock > > per this region is optimal. > > > > The disadvantage to padding things out to the cache burst fill size is > > that we blow the cache footprint: data is too far apart and we use far > > more cache than we should meaning the cache thrashes much sooner as you > > load up the CPU. On SMP systems, Linux uses SMP_CACHE_BYTES == > > L1_CACHE_BYTES for padding on tons of critical structures if it's too > > big we'll waste a lot of cache footprint for no gain. > It looks to me like the LCI instruction must zero bits rather than > truncate as drivers > (e.g., sba_iommu.c) drop the least significant 12 bits (ci >> > PAGE_SHIFT). I think we > should do the LCI test. I had been assuming that the two lengths would > be the same. It's a backwards compatibility problem: You get to fix the cache ownership size once per incompatible architecture. Once PA produced 64 bit chips was the last opportunity to do this because an older 64 bit kernel (for PA and Linux) must work reasonably well on newer processors. This means that if you increase the actual ownership size and the older kernel suddenly starts thrashing because of SMP cache interference, everyone blames you and says your new chips are rubbish (this did actually happen to Intel once, if I remember correctly). That's why it's usually avoided. James -- To unsubscribe from this list: send the line "unsubscribe linux-parisc" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2015-09-24, at 12:57 PM, James Bottomley wrote: > On Thu, 2015-09-24 at 12:39 -0400, John David Anglin wrote: >> On 2015-09-24 10:20 AM, James Bottomley wrote: >>> On Tue, 2015-09-22 at 20:12 -0400, John David Anglin wrote: >>>> I question the the atomic hash changes as the original defines are >>>> taken directly from generic code. >>> It's about scaling. The fewer locks, the more contention in a hash lock >>> system. The interesting question is where does the line tip over so >>> that we see less speed up for more locks. >>> >>>> Optimally, we want one spinlock per cacheline. Why do we care about >>>> the size of atomic_t? >>> OK, so I think we're not using the word 'line size' in the same way. >>> When Linux says 'line size' it generally means the cache ownership line >>> size: the minimum block the inter cpu coherence operates on. Most of >>> the architectural evidence for PA systems suggests that this is 16 We >>> should be able to get this definitively: it's however many lower bits of >>> a virtual address the LCI instruction truncates. 128 seems to be the >>> cache burst fill size (the number of bytes that will be pulled into the >>> cache by a usual operation touching any byte in the area). For >>> streaming operations, the burst fill size is what we want to use, but >>> for coherence operations it's the ownership line size. The reason is >>> that different CPUs can own adjacent lines uncontended, so one spinlock >>> per this region is optimal. >>> >>> The disadvantage to padding things out to the cache burst fill size is >>> that we blow the cache footprint: data is too far apart and we use far >>> more cache than we should meaning the cache thrashes much sooner as you >>> load up the CPU. On SMP systems, Linux uses SMP_CACHE_BYTES == >>> L1_CACHE_BYTES for padding on tons of critical structures if it's too >>> big we'll waste a lot of cache footprint for no gain. >> It looks to me like the LCI instruction must zero bits rather than >> truncate as drivers >> (e.g., sba_iommu.c) drop the least significant 12 bits (ci >> >> PAGE_SHIFT). I think we >> should do the LCI test. I had been assuming that the two lengths would >> be the same. > > It's a backwards compatibility problem: You get to fix the cache > ownership size once per incompatible architecture. Once PA produced 64 > bit chips was the last opportunity to do this because an older 64 bit > kernel (for PA and Linux) must work reasonably well on newer processors. > This means that if you increase the actual ownership size and the older > kernel suddenly starts thrashing because of SMP cache interference, > everyone blames you and says your new chips are rubbish (this did > actually happen to Intel once, if I remember correctly). That's why > it's usually avoided. The division by L1_CACHE_BYTES in ATOMIC_HASH simply reflects the alignment of atomic_t types. We want a hash value that's not always zero. On systems with hardware atomic operations, atomic variables need to be on separate cache lines to minimize contention. Thus, the L1_CACHE_BYTES alignment. Our atomic_t types are guarded by ldcw locks. So, they are not really subject to contention. You could be correct that L1_CACHE_BYTES could be reduced to 16 (i.e., maximum alignment needed for any type on parisc) provided it isn't used somewhere where we need the actual L1 cache line size as returned by the PDC. When this started, I was pushing to see if placing our ldcw locks on different lines helped performance. Possibly, the number could be increased as well. On PA20, we currently align __atomic_hash to __lock_aligned (16 bytes). So, all four locks are on the same line. Dave -- John David Anglin dave.anglin@bell.net -- To unsubscribe from this list: send the line "unsubscribe linux-parisc" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2015-09-25 8:20 AM, John David Anglin wrote: > Our atomic_t types are guarded by ldcw locks. So, they are not really subject to contention. > You could be correct that L1_CACHE_BYTES could be reduced to 16 (i.e., maximum alignment > needed for any type on parisc) provided it isn't used somewhere where we need the actual L1 > cache line size as returned by the PDC. Digging through various documentation, I now believe that L1_CACHE_BYTES is 16 bytes on ALL PA-RISC processors. We are getting confused by the L2 length reported by the PDC. The PA-8800 is essentially two PA-8700s integrated on the same die. See page 10 in this document: https://parisc.wiki.kernel.org/images-parisc/e/e9/PA-8700wp.pdf It shows the PA-8700 L1 design. Dave
diff --git a/arch/parisc/include/asm/atomic.h b/arch/parisc/include/asm/atomic.h index 226f8ca9..de3361b 100644 --- a/arch/parisc/include/asm/atomic.h +++ b/arch/parisc/include/asm/atomic.h @@ -19,14 +19,13 @@ #ifdef CONFIG_SMP #include <asm/spinlock.h> -#include <asm/cache.h> /* we use L1_CACHE_BYTES */ /* Use an array of spinlocks for our atomic_ts. * Hash function to index into a different SPINLOCK. - * Since "a" is usually an address, use one spinlock per cacheline. + * Since "a" is usually an address, use one spinlock per atomic. */ -# define ATOMIC_HASH_SIZE 4 -# define ATOMIC_HASH(a) (&(__atomic_hash[ (((unsigned long) (a))/L1_CACHE_BYTES) & (ATOMIC_HASH_SIZE-1) ])) +# define ATOMIC_HASH_SIZE 64 +# define ATOMIC_HASH(a) (&(__atomic_hash[ (((unsigned long) (a))/sizeof(atomic_t)) & (ATOMIC_HASH_SIZE-1) ])) extern arch_spinlock_t __atomic_hash[ATOMIC_HASH_SIZE] __lock_aligned;