Message ID | 20160321171403.GE25466@e104818-lin.cambridge.arm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Mon, Mar 21, 2016 at 05:14:03PM +0000, Catalin Marinas wrote: > On Fri, Mar 18, 2016 at 09:05:37PM +0000, Chalamarla, Tirumalesh wrote: > > On 3/16/16, 2:32 AM, "linux-arm-kernel on behalf of Ganesh Mahendran" <linux-arm-kernel-bounces@lists.infradead.org on behalf of opensource.ganesh@gmail.com> wrote: > > >Reverts commit 97303480753e ("arm64: Increase the max granular size"). > > > > > >The commit 97303480753e ("arm64: Increase the max granular size") will > > >degrade system performente in some cpus. > > > > > >We test wifi network throughput with iperf on Qualcomm msm8996 CPU: > > >---------------- > > >run on host: > > > # iperf -s > > >run on device: > > > # iperf -c <device-ip-addr> -t 100 -i 1 > > >---------------- > > > > > >Test result: > > >---------------- > > >with commit 97303480753e ("arm64: Increase the max granular size"): > > > 172MBits/sec > > > > > >without commit 97303480753e ("arm64: Increase the max granular size"): > > > 230MBits/sec > > >---------------- > > > > > >Some module like slab/net will use the L1_CACHE_SHIFT, so if we do not > > >set the parameter correctly, it may affect the system performance. > > > > > >So revert the commit. > > > > Is there any explanation why is this so? May be there is an > > alternative to this, apart from reverting the commit. > > I agree we need an explanation but in the meantime, this patch has > caused a regression on certain systems. > > > Until now it seems L1_CACHE_SHIFT is the max of supported chips. But > > now we are making it 64byte, is there any reason why not 32. > > We may have to revisit this logic and consider L1_CACHE_BYTES the > _minimum_ of cache line sizes in arm64 systems supported by the kernel. > Do you have any benchmarks on Cavium boards that would show significant > degradation with 64-byte L1_CACHE_BYTES vs 128? > > For non-coherent DMA, the simplest is to make ARCH_DMA_MINALIGN the > _maximum_ of the supported systems: > > diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h > index 5082b30bc2c0..4b5d7b27edaf 100644 > --- a/arch/arm64/include/asm/cache.h > +++ b/arch/arm64/include/asm/cache.h > @@ -18,17 +18,17 @@ > > #include <asm/cachetype.h> > > -#define L1_CACHE_SHIFT 7 > +#define L1_CACHE_SHIFT 6 > #define L1_CACHE_BYTES (1 << L1_CACHE_SHIFT) > > /* > * Memory returned by kmalloc() may be used for DMA, so we must make > - * sure that all such allocations are cache aligned. Otherwise, > - * unrelated code may cause parts of the buffer to be read into the > - * cache before the transfer is done, causing old data to be seen by > - * the CPU. > + * sure that all such allocations are aligned to the maximum *known* > + * cache line size on ARMv8 systems. Otherwise, unrelated code may cause > + * parts of the buffer to be read into the cache before the transfer is > + * done, causing old data to be seen by the CPU. > */ > -#define ARCH_DMA_MINALIGN L1_CACHE_BYTES > +#define ARCH_DMA_MINALIGN (128) Does this actually fix the reported iperf regression? My assumption was that ARCH_DMA_MINALIGN is the problem, but I could be wrong. Will
On Mon, Mar 21, 2016 at 05:23:01PM +0000, Will Deacon wrote: > On Mon, Mar 21, 2016 at 05:14:03PM +0000, Catalin Marinas wrote: > > diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h > > index 5082b30bc2c0..4b5d7b27edaf 100644 > > --- a/arch/arm64/include/asm/cache.h > > +++ b/arch/arm64/include/asm/cache.h > > @@ -18,17 +18,17 @@ > > > > #include <asm/cachetype.h> > > > > -#define L1_CACHE_SHIFT 7 > > +#define L1_CACHE_SHIFT 6 > > #define L1_CACHE_BYTES (1 << L1_CACHE_SHIFT) > > > > /* > > * Memory returned by kmalloc() may be used for DMA, so we must make > > - * sure that all such allocations are cache aligned. Otherwise, > > - * unrelated code may cause parts of the buffer to be read into the > > - * cache before the transfer is done, causing old data to be seen by > > - * the CPU. > > + * sure that all such allocations are aligned to the maximum *known* > > + * cache line size on ARMv8 systems. Otherwise, unrelated code may cause > > + * parts of the buffer to be read into the cache before the transfer is > > + * done, causing old data to be seen by the CPU. > > */ > > -#define ARCH_DMA_MINALIGN L1_CACHE_BYTES > > +#define ARCH_DMA_MINALIGN (128) > > Does this actually fix the reported iperf regression? My assumption was > that ARCH_DMA_MINALIGN is the problem, but I could be wrong. I can't tell. But since I haven't seen any better explanation in this thread yet, I hope that at least someone would try this patch and come back with numbers. For networking, SKB_DATA_ALIGN() uses SMP_CACHE_BYTES (== L1_CACHE_BYTES). I think (hope) this alignment is not meant for non-coherent DMA, otherwise using SMP_CACHE_BYTES wouldn't make sense.
On 3/21/16, 10:33 AM, "Catalin Marinas" <catalin.marinas@arm.com> wrote: >On Mon, Mar 21, 2016 at 05:23:01PM +0000, Will Deacon wrote: >> On Mon, Mar 21, 2016 at 05:14:03PM +0000, Catalin Marinas wrote: >> > diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h >> > index 5082b30bc2c0..4b5d7b27edaf 100644 >> > --- a/arch/arm64/include/asm/cache.h >> > +++ b/arch/arm64/include/asm/cache.h >> > @@ -18,17 +18,17 @@ >> > >> > #include <asm/cachetype.h> >> > >> > -#define L1_CACHE_SHIFT 7 >> > +#define L1_CACHE_SHIFT 6 >> > #define L1_CACHE_BYTES (1 << L1_CACHE_SHIFT) >> > >> > /* >> > * Memory returned by kmalloc() may be used for DMA, so we must make >> > - * sure that all such allocations are cache aligned. Otherwise, >> > - * unrelated code may cause parts of the buffer to be read into the >> > - * cache before the transfer is done, causing old data to be seen by >> > - * the CPU. >> > + * sure that all such allocations are aligned to the maximum *known* >> > + * cache line size on ARMv8 systems. Otherwise, unrelated code may cause >> > + * parts of the buffer to be read into the cache before the transfer is >> > + * done, causing old data to be seen by the CPU. >> > */ >> > -#define ARCH_DMA_MINALIGN L1_CACHE_BYTES >> > +#define ARCH_DMA_MINALIGN (128) >> >> Does this actually fix the reported iperf regression? My assumption was >> that ARCH_DMA_MINALIGN is the problem, but I could be wrong. > >I can't tell. But since I haven't seen any better explanation in this >thread yet, I hope that at least someone would try this patch and come >back with numbers. > >For networking, SKB_DATA_ALIGN() uses SMP_CACHE_BYTES (== L1_CACHE_BYTES). >I think (hope) this alignment is not meant for non-coherent DMA, >otherwise using SMP_CACHE_BYTES wouldn't make sense. As I see the problem, may be it is because of fewer number of buffers available because of this alignment requirement. But that should be fixed by making slab alignment a run time thing. I may be totally wrong. We are still coming up with a decent benchmark that shows degradation. > >-- >Catalin
On 4/5/2017 10:13 AM, Imran Khan wrote: Hi Catalin, > Hi Catalin, > >> From: Catalin Marinas <catalin.marinas@arm.com> >> Date: Mon, Mar 21, 2016 at 10:44 PM >> Subject: Re: [PATCH] Revert "arm64: Increase the max granular size" >> To: "Chalamarla, Tirumalesh" <Tirumalesh.Chalamarla@caviumnetworks.com> >> Cc: Ganesh Mahendran <opensource.ganesh@gmail.com>, "will.deacon@arm.com" < >> will.deacon@arm.com>, "linux-arm-kernel@lists.infradead.org" < >> linux-arm-kernel@lists.infradead.org>, "linux-kernel@vger.kernel.org" < >>> , "stable@vger.kernel.org" < >> stable@vger.kernel.org> >> >> >> On Fri, Mar 18, 2016 at 09:05:37PM +0000, Chalamarla, Tirumalesh wrote: >>> On 3/16/16, 2:32 AM, "linux-arm-kernel on behalf of Ganesh Mahendran" < >> linux-arm-kernel-bounces@lists.infradead.org on behalf of >> opensource.ganesh@gmail.com> wrote: >>>> Reverts commit 97303480753e ("arm64: Increase the max granular size"). >>>> >>>> The commit 97303480753e ("arm64: Increase the max granular size") will >>>> degrade system performente in some cpus. >>>> >>>> We test wifi network throughput with iperf on Qualcomm msm8996 CPU: >>>> ---------------- >>>> run on host: >>>> # iperf -s >>>> run on device: >>>> # iperf -c <device-ip-addr> -t 100 -i 1 >>>> ---------------- >>>> >>>> Test result: >>>> ---------------- >>>> with commit 97303480753e ("arm64: Increase the max granular size"): >>>> 172MBits/sec >>>> >>>> without commit 97303480753e ("arm64: Increase the max granular size"): >>>> 230MBits/sec >>>> ---------------- >>>> >>>> Some module like slab/net will use the L1_CACHE_SHIFT, so if we do not >>>> set the parameter correctly, it may affect the system performance. >>>> >>>> So revert the commit. >>> >>> Is there any explanation why is this so? May be there is an >>> alternative to this, apart from reverting the commit. >> >> I agree we need an explanation but in the meantime, this patch has >> caused a regression on certain systems. >> >>> Until now it seems L1_CACHE_SHIFT is the max of supported chips. But >>> now we are making it 64byte, is there any reason why not 32. >> >> We may have to revisit this logic and consider L1_CACHE_BYTES the >> _minimum_ of cache line sizes in arm64 systems supported by the kernel. >> Do you have any benchmarks on Cavium boards that would show significant >> degradation with 64-byte L1_CACHE_BYTES vs 128? >> >> For non-coherent DMA, the simplest is to make ARCH_DMA_MINALIGN the >> _maximum_ of the supported systems: >> >> diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h >> index 5082b30bc2c0..4b5d7b27edaf 100644 >> --- a/arch/arm64/include/asm/cache.h >> +++ b/arch/arm64/include/asm/cache.h >> @@ -18,17 +18,17 @@ >> >> #include <asm/cachetype.h> >> >> -#define L1_CACHE_SHIFT 7 >> +#define L1_CACHE_SHIFT 6 >> #define L1_CACHE_BYTES (1 << L1_CACHE_SHIFT) >> >> /* >> * Memory returned by kmalloc() may be used for DMA, so we must make >> - * sure that all such allocations are cache aligned. Otherwise, >> - * unrelated code may cause parts of the buffer to be read into the >> - * cache before the transfer is done, causing old data to be seen by >> - * the CPU. >> + * sure that all such allocations are aligned to the maximum *known* >> + * cache line size on ARMv8 systems. Otherwise, unrelated code may cause >> + * parts of the buffer to be read into the cache before the transfer is >> + * done, causing old data to be seen by the CPU. >> */ >> -#define ARCH_DMA_MINALIGN L1_CACHE_BYTES >> +#define ARCH_DMA_MINALIGN (128) >> >> #ifndef __ASSEMBLY__ >> >> diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c >> index 392c67eb9fa6..30bafca1aebf 100644 >> --- a/arch/arm64/kernel/cpufeature.c >> +++ b/arch/arm64/kernel/cpufeature.c >> @@ -976,9 +976,9 @@ void __init setup_cpu_features(void) >> if (!cwg) >> pr_warn("No Cache Writeback Granule information, assuming >> cache line size %d\n", >> cls); >> - if (L1_CACHE_BYTES < cls) >> - pr_warn("L1_CACHE_BYTES smaller than the Cache Writeback >> Granule (%d < %d)\n", >> - L1_CACHE_BYTES, cls); >> + if (ARCH_DMA_MINALIGN < cls) >> + pr_warn("ARCH_DMA_MINALIGN smaller than the Cache Writeback >> Granule (%d < %d)\n", >> + ARCH_DMA_MINALIGN, cls); >> } >> >> static bool __maybe_unused >> >> >> > > This change was discussed at: [1] but was not concluded as apparently no one > came back with test report and numbers. After including this change in our > local kernel we are seeing significant throughput improvement. For example with: > > iperf -c 192.168.1.181 -i 1 -w 128K -t 60 > > The average throughput is improving by about 30% (230Mbps from 180Mbps). > Could you please let us know if this change can be included in upstream kernel. > > [1]: https://groups.google.com/forum/#!topic/linux.kernel/P40yDB90ePs > Could you please provide some feedback about the above mentioned query ? > Thanks and Regards, > Imran > > > >
On Thu, Apr 06, 2017 at 12:52:13PM +0530, Imran Khan wrote: > On 4/5/2017 10:13 AM, Imran Khan wrote: > >> We may have to revisit this logic and consider L1_CACHE_BYTES the > >> _minimum_ of cache line sizes in arm64 systems supported by the kernel. > >> Do you have any benchmarks on Cavium boards that would show significant > >> degradation with 64-byte L1_CACHE_BYTES vs 128? > >> > >> For non-coherent DMA, the simplest is to make ARCH_DMA_MINALIGN the > >> _maximum_ of the supported systems: > >> > >> diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h > >> index 5082b30bc2c0..4b5d7b27edaf 100644 > >> --- a/arch/arm64/include/asm/cache.h > >> +++ b/arch/arm64/include/asm/cache.h > >> @@ -18,17 +18,17 @@ > >> > >> #include <asm/cachetype.h> > >> > >> -#define L1_CACHE_SHIFT 7 > >> +#define L1_CACHE_SHIFT 6 > >> #define L1_CACHE_BYTES (1 << L1_CACHE_SHIFT) > >> > >> /* > >> * Memory returned by kmalloc() may be used for DMA, so we must make > >> - * sure that all such allocations are cache aligned. Otherwise, > >> - * unrelated code may cause parts of the buffer to be read into the > >> - * cache before the transfer is done, causing old data to be seen by > >> - * the CPU. > >> + * sure that all such allocations are aligned to the maximum *known* > >> + * cache line size on ARMv8 systems. Otherwise, unrelated code may cause > >> + * parts of the buffer to be read into the cache before the transfer is > >> + * done, causing old data to be seen by the CPU. > >> */ > >> -#define ARCH_DMA_MINALIGN L1_CACHE_BYTES > >> +#define ARCH_DMA_MINALIGN (128) > >> > >> #ifndef __ASSEMBLY__ > >> > >> diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c > >> index 392c67eb9fa6..30bafca1aebf 100644 > >> --- a/arch/arm64/kernel/cpufeature.c > >> +++ b/arch/arm64/kernel/cpufeature.c > >> @@ -976,9 +976,9 @@ void __init setup_cpu_features(void) > >> if (!cwg) > >> pr_warn("No Cache Writeback Granule information, assuming > >> cache line size %d\n", > >> cls); > >> - if (L1_CACHE_BYTES < cls) > >> - pr_warn("L1_CACHE_BYTES smaller than the Cache Writeback Granule (%d < %d)\n", > >> - L1_CACHE_BYTES, cls); > >> + if (ARCH_DMA_MINALIGN < cls) > >> + pr_warn("ARCH_DMA_MINALIGN smaller than the Cache Writeback Granule (%d < %d)\n", > >> + ARCH_DMA_MINALIGN, cls); > >> } > >> > >> static bool __maybe_unused > > > > This change was discussed at: [1] but was not concluded as apparently no one > > came back with test report and numbers. After including this change in our > > local kernel we are seeing significant throughput improvement. For example with: > > > > iperf -c 192.168.1.181 -i 1 -w 128K -t 60 > > > > The average throughput is improving by about 30% (230Mbps from 180Mbps). > > Could you please let us know if this change can be included in upstream kernel. > > > > [1]: https://groups.google.com/forum/#!topic/linux.kernel/P40yDB90ePs > > Could you please provide some feedback about the above mentioned query ? Do you have an explanation on the performance variation when L1_CACHE_BYTES is changed? We'd need to understand how the network stack is affected by L1_CACHE_BYTES, in which context it uses it (is it for non-coherent DMA?). The Cavium guys haven't shown any numbers (IIUC) to back the L1_CACHE_BYTES performance improvement but I would not revert the original commit since ARCH_DMA_MINALIGN definitely needs to cover the maximum available cache line size, which is 128 for them.
2017-04-06 23:58 GMT+08:00 Catalin Marinas <catalin.marinas@arm.com>: > On Thu, Apr 06, 2017 at 12:52:13PM +0530, Imran Khan wrote: >> On 4/5/2017 10:13 AM, Imran Khan wrote: >> >> We may have to revisit this logic and consider L1_CACHE_BYTES the >> >> _minimum_ of cache line sizes in arm64 systems supported by the kernel. >> >> Do you have any benchmarks on Cavium boards that would show significant >> >> degradation with 64-byte L1_CACHE_BYTES vs 128? >> >> >> >> For non-coherent DMA, the simplest is to make ARCH_DMA_MINALIGN the >> >> _maximum_ of the supported systems: >> >> >> >> diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h >> >> index 5082b30bc2c0..4b5d7b27edaf 100644 >> >> --- a/arch/arm64/include/asm/cache.h >> >> +++ b/arch/arm64/include/asm/cache.h >> >> @@ -18,17 +18,17 @@ >> >> >> >> #include <asm/cachetype.h> >> >> >> >> -#define L1_CACHE_SHIFT 7 >> >> +#define L1_CACHE_SHIFT 6 >> >> #define L1_CACHE_BYTES (1 << L1_CACHE_SHIFT) >> >> >> >> /* >> >> * Memory returned by kmalloc() may be used for DMA, so we must make >> >> - * sure that all such allocations are cache aligned. Otherwise, >> >> - * unrelated code may cause parts of the buffer to be read into the >> >> - * cache before the transfer is done, causing old data to be seen by >> >> - * the CPU. >> >> + * sure that all such allocations are aligned to the maximum *known* >> >> + * cache line size on ARMv8 systems. Otherwise, unrelated code may cause >> >> + * parts of the buffer to be read into the cache before the transfer is >> >> + * done, causing old data to be seen by the CPU. >> >> */ >> >> -#define ARCH_DMA_MINALIGN L1_CACHE_BYTES >> >> +#define ARCH_DMA_MINALIGN (128) >> >> >> >> #ifndef __ASSEMBLY__ >> >> >> >> diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c >> >> index 392c67eb9fa6..30bafca1aebf 100644 >> >> --- a/arch/arm64/kernel/cpufeature.c >> >> +++ b/arch/arm64/kernel/cpufeature.c >> >> @@ -976,9 +976,9 @@ void __init setup_cpu_features(void) >> >> if (!cwg) >> >> pr_warn("No Cache Writeback Granule information, assuming >> >> cache line size %d\n", >> >> cls); >> >> - if (L1_CACHE_BYTES < cls) >> >> - pr_warn("L1_CACHE_BYTES smaller than the Cache Writeback Granule (%d < %d)\n", >> >> - L1_CACHE_BYTES, cls); >> >> + if (ARCH_DMA_MINALIGN < cls) >> >> + pr_warn("ARCH_DMA_MINALIGN smaller than the Cache Writeback Granule (%d < %d)\n", >> >> + ARCH_DMA_MINALIGN, cls); >> >> } >> >> >> >> static bool __maybe_unused >> > >> > This change was discussed at: [1] but was not concluded as apparently no one >> > came back with test report and numbers. After including this change in our >> > local kernel we are seeing significant throughput improvement. For example with: >> > >> > iperf -c 192.168.1.181 -i 1 -w 128K -t 60 >> > >> > The average throughput is improving by about 30% (230Mbps from 180Mbps). >> > Could you please let us know if this change can be included in upstream kernel. >> > >> > [1]: https://groups.google.com/forum/#!topic/linux.kernel/P40yDB90ePs >> >> Could you please provide some feedback about the above mentioned query ? > > Do you have an explanation on the performance variation when > L1_CACHE_BYTES is changed? We'd need to understand how the network stack > is affected by L1_CACHE_BYTES, in which context it uses it (is it for > non-coherent DMA?). network stack use SKB_DATA_ALIGN to align. --- #define SKB_DATA_ALIGN(X) (((X) + (SMP_CACHE_BYTES - 1)) & \ ~(SMP_CACHE_BYTES - 1)) #define SMP_CACHE_BYTES L1_CACHE_BYTES --- I think this is the reason of performance regression. > > The Cavium guys haven't shown any numbers (IIUC) to back the > L1_CACHE_BYTES performance improvement but I would not revert the > original commit since ARCH_DMA_MINALIGN definitely needs to cover the > maximum available cache line size, which is 128 for them. how about define L1_CACHE_SHIFT like below: --- #ifdef CONFIG_ARM64_L1_CACHE_SHIFT #define L1_CACHE_SHIFT CONFIG_ARM64_L1_CACHE_SHIFT #else #define L1_CACHE_SHIFT 7 endif --- Thanks > > -- > Catalin
On Fri, Apr 07, 2017 at 10:06:31AM +0800, Ganesh Mahendran wrote: > 2017-04-06 23:58 GMT+08:00 Catalin Marinas <catalin.marinas@arm.com>: > > On Thu, Apr 06, 2017 at 12:52:13PM +0530, Imran Khan wrote: > >> On 4/5/2017 10:13 AM, Imran Khan wrote: > >> >> We may have to revisit this logic and consider L1_CACHE_BYTES the > >> >> _minimum_ of cache line sizes in arm64 systems supported by the kernel. > >> >> Do you have any benchmarks on Cavium boards that would show significant > >> >> degradation with 64-byte L1_CACHE_BYTES vs 128? > >> >> > >> >> For non-coherent DMA, the simplest is to make ARCH_DMA_MINALIGN the > >> >> _maximum_ of the supported systems: > >> >> > >> >> diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h > >> >> index 5082b30bc2c0..4b5d7b27edaf 100644 > >> >> --- a/arch/arm64/include/asm/cache.h > >> >> +++ b/arch/arm64/include/asm/cache.h > >> >> @@ -18,17 +18,17 @@ > >> >> > >> >> #include <asm/cachetype.h> > >> >> > >> >> -#define L1_CACHE_SHIFT 7 > >> >> +#define L1_CACHE_SHIFT 6 > >> >> #define L1_CACHE_BYTES (1 << L1_CACHE_SHIFT) > >> >> > >> >> /* > >> >> * Memory returned by kmalloc() may be used for DMA, so we must make > >> >> - * sure that all such allocations are cache aligned. Otherwise, > >> >> - * unrelated code may cause parts of the buffer to be read into the > >> >> - * cache before the transfer is done, causing old data to be seen by > >> >> - * the CPU. > >> >> + * sure that all such allocations are aligned to the maximum *known* > >> >> + * cache line size on ARMv8 systems. Otherwise, unrelated code may cause > >> >> + * parts of the buffer to be read into the cache before the transfer is > >> >> + * done, causing old data to be seen by the CPU. > >> >> */ > >> >> -#define ARCH_DMA_MINALIGN L1_CACHE_BYTES > >> >> +#define ARCH_DMA_MINALIGN (128) > >> >> > >> >> #ifndef __ASSEMBLY__ > >> >> > >> >> diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c > >> >> index 392c67eb9fa6..30bafca1aebf 100644 > >> >> --- a/arch/arm64/kernel/cpufeature.c > >> >> +++ b/arch/arm64/kernel/cpufeature.c > >> >> @@ -976,9 +976,9 @@ void __init setup_cpu_features(void) > >> >> if (!cwg) > >> >> pr_warn("No Cache Writeback Granule information, assuming > >> >> cache line size %d\n", > >> >> cls); > >> >> - if (L1_CACHE_BYTES < cls) > >> >> - pr_warn("L1_CACHE_BYTES smaller than the Cache Writeback Granule (%d < %d)\n", > >> >> - L1_CACHE_BYTES, cls); > >> >> + if (ARCH_DMA_MINALIGN < cls) > >> >> + pr_warn("ARCH_DMA_MINALIGN smaller than the Cache Writeback Granule (%d < %d)\n", > >> >> + ARCH_DMA_MINALIGN, cls); > >> >> } > >> >> > >> >> static bool __maybe_unused > >> > > >> > This change was discussed at: [1] but was not concluded as apparently no one > >> > came back with test report and numbers. After including this change in our > >> > local kernel we are seeing significant throughput improvement. For example with: > >> > > >> > iperf -c 192.168.1.181 -i 1 -w 128K -t 60 > >> > > >> > The average throughput is improving by about 30% (230Mbps from 180Mbps). > >> > Could you please let us know if this change can be included in upstream kernel. > >> > > >> > [1]: https://groups.google.com/forum/#!topic/linux.kernel/P40yDB90ePs > >> > >> Could you please provide some feedback about the above mentioned query ? > > > > Do you have an explanation on the performance variation when > > L1_CACHE_BYTES is changed? We'd need to understand how the network stack > > is affected by L1_CACHE_BYTES, in which context it uses it (is it for > > non-coherent DMA?). > > network stack use SKB_DATA_ALIGN to align. > --- > #define SKB_DATA_ALIGN(X) (((X) + (SMP_CACHE_BYTES - 1)) & \ > ~(SMP_CACHE_BYTES - 1)) > > #define SMP_CACHE_BYTES L1_CACHE_BYTES > --- > I think this is the reason of performance regression. And is this alignment required for DMA coherency? (I hope not since SMP_CACHE_BYTES doesn't give this). Anyway, to be sure, it's worth changing SKB_DATA_ALIGN to use 64 bytes (hard-coded) and check the results again. > > The Cavium guys haven't shown any numbers (IIUC) to back the > > L1_CACHE_BYTES performance improvement but I would not revert the > > original commit since ARCH_DMA_MINALIGN definitely needs to cover the > > maximum available cache line size, which is 128 for them. > > how about define L1_CACHE_SHIFT like below: > --- > #ifdef CONFIG_ARM64_L1_CACHE_SHIFT > #define L1_CACHE_SHIFT CONFIG_ARM64_L1_CACHE_SHIFT > #else > #define L1_CACHE_SHIFT 7 > endif > --- I'd very much like to avoid this, still aiming for a single kernel image. My suggestion is to check whether L1_CACHE_BYTES is wrongly used for DMA buffer alignment in the network code and, if not, move it back to 64 bytes while keeping ARCH_DMA_MINALIGN to 128 (as per my patch above). If the performance on the Cavium system is affected by the L1_CACHE_BYTES change, further patches would need to be backed by some numbers.
On 04/06/2017 11:58 AM, Catalin Marinas wrote: > The Cavium guys haven't shown any numbers (IIUC) to back the > L1_CACHE_BYTES performance improvement but I would not revert the > original commit since ARCH_DMA_MINALIGN definitely needs to cover the > maximum available cache line size, which is 128 for them. I am pinging them via other channels to ensure they've seen this. Jon.
On 4/7/2017 7:36 AM, Ganesh Mahendran wrote: > 2017-04-06 23:58 GMT+08:00 Catalin Marinas <catalin.marinas@arm.com>: >> On Thu, Apr 06, 2017 at 12:52:13PM +0530, Imran Khan wrote: >>> On 4/5/2017 10:13 AM, Imran Khan wrote: >>>>> We may have to revisit this logic and consider L1_CACHE_BYTES the >>>>> _minimum_ of cache line sizes in arm64 systems supported by the kernel. >>>>> Do you have any benchmarks on Cavium boards that would show significant >>>>> degradation with 64-byte L1_CACHE_BYTES vs 128? >>>>> >>>>> For non-coherent DMA, the simplest is to make ARCH_DMA_MINALIGN the >>>>> _maximum_ of the supported systems: >>>>> >>>>> diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h >>>>> index 5082b30bc2c0..4b5d7b27edaf 100644 >>>>> --- a/arch/arm64/include/asm/cache.h >>>>> +++ b/arch/arm64/include/asm/cache.h >>>>> @@ -18,17 +18,17 @@ >>>>> >>>>> #include <asm/cachetype.h> >>>>> >>>>> -#define L1_CACHE_SHIFT 7 >>>>> +#define L1_CACHE_SHIFT 6 >>>>> #define L1_CACHE_BYTES (1 << L1_CACHE_SHIFT) >>>>> >>>>> /* >>>>> * Memory returned by kmalloc() may be used for DMA, so we must make >>>>> - * sure that all such allocations are cache aligned. Otherwise, >>>>> - * unrelated code may cause parts of the buffer to be read into the >>>>> - * cache before the transfer is done, causing old data to be seen by >>>>> - * the CPU. >>>>> + * sure that all such allocations are aligned to the maximum *known* >>>>> + * cache line size on ARMv8 systems. Otherwise, unrelated code may cause >>>>> + * parts of the buffer to be read into the cache before the transfer is >>>>> + * done, causing old data to be seen by the CPU. >>>>> */ >>>>> -#define ARCH_DMA_MINALIGN L1_CACHE_BYTES >>>>> +#define ARCH_DMA_MINALIGN (128) >>>>> >>>>> #ifndef __ASSEMBLY__ >>>>> >>>>> diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c >>>>> index 392c67eb9fa6..30bafca1aebf 100644 >>>>> --- a/arch/arm64/kernel/cpufeature.c >>>>> +++ b/arch/arm64/kernel/cpufeature.c >>>>> @@ -976,9 +976,9 @@ void __init setup_cpu_features(void) >>>>> if (!cwg) >>>>> pr_warn("No Cache Writeback Granule information, assuming >>>>> cache line size %d\n", >>>>> cls); >>>>> - if (L1_CACHE_BYTES < cls) >>>>> - pr_warn("L1_CACHE_BYTES smaller than the Cache Writeback Granule (%d < %d)\n", >>>>> - L1_CACHE_BYTES, cls); >>>>> + if (ARCH_DMA_MINALIGN < cls) >>>>> + pr_warn("ARCH_DMA_MINALIGN smaller than the Cache Writeback Granule (%d < %d)\n", >>>>> + ARCH_DMA_MINALIGN, cls); >>>>> } >>>>> >>>>> static bool __maybe_unused >>>> >>>> This change was discussed at: [1] but was not concluded as apparently no one >>>> came back with test report and numbers. After including this change in our >>>> local kernel we are seeing significant throughput improvement. For example with: >>>> >>>> iperf -c 192.168.1.181 -i 1 -w 128K -t 60 >>>> >>>> The average throughput is improving by about 30% (230Mbps from 180Mbps). >>>> Could you please let us know if this change can be included in upstream kernel. >>>> >>>> [1]: https://groups.google.com/forum/#!topic/linux.kernel/P40yDB90ePs >>> >>> Could you please provide some feedback about the above mentioned query ? >> >> Do you have an explanation on the performance variation when >> L1_CACHE_BYTES is changed? We'd need to understand how the network stack >> is affected by L1_CACHE_BYTES, in which context it uses it (is it for >> non-coherent DMA?). > > network stack use SKB_DATA_ALIGN to align. > --- > #define SKB_DATA_ALIGN(X) (((X) + (SMP_CACHE_BYTES - 1)) & \ > ~(SMP_CACHE_BYTES - 1)) > > #define SMP_CACHE_BYTES L1_CACHE_BYTES > --- > I think this is the reason of performance regression. > Yes this is the reason for performance regression. Due to increases L1 cache alignment the object is coming from next kmalloc slab and skb->truesize is changing from 2304 bytes to 4352 bytes. This in turn increases sk_wmem_alloc which causes queuing of less send buffers. >> >> The Cavium guys haven't shown any numbers (IIUC) to back the >> L1_CACHE_BYTES performance improvement but I would not revert the >> original commit since ARCH_DMA_MINALIGN definitely needs to cover the >> maximum available cache line size, which is 128 for them. > > how about define L1_CACHE_SHIFT like below: > --- > #ifdef CONFIG_ARM64_L1_CACHE_SHIFT > #define L1_CACHE_SHIFT CONFIG_ARM64_L1_CACHE_SHIFT > #else > #define L1_CACHE_SHIFT 7 > endif > --- > > Thanks > >> >> -- >> Catalin
On 4/11/17, 10:13 PM, "linux-arm-kernel on behalf of Imran Khan" <linux-arm-kernel-bounces@lists.infradead.org on behalf of kimran@codeaurora.org> wrote: On 4/7/2017 7:36 AM, Ganesh Mahendran wrote: > 2017-04-06 23:58 GMT+08:00 Catalin Marinas <catalin.marinas@arm.com>: >> On Thu, Apr 06, 2017 at 12:52:13PM +0530, Imran Khan wrote: >>> On 4/5/2017 10:13 AM, Imran Khan wrote: >>>>> We may have to revisit this logic and consider L1_CACHE_BYTES the >>>>> _minimum_ of cache line sizes in arm64 systems supported by the kernel. >>>>> Do you have any benchmarks on Cavium boards that would show significant >>>>> degradation with 64-byte L1_CACHE_BYTES vs 128? >>>>> >>>>> For non-coherent DMA, the simplest is to make ARCH_DMA_MINALIGN the >>>>> _maximum_ of the supported systems: >>>>> >>>>> diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h >>>>> index 5082b30bc2c0..4b5d7b27edaf 100644 >>>>> --- a/arch/arm64/include/asm/cache.h >>>>> +++ b/arch/arm64/include/asm/cache.h >>>>> @@ -18,17 +18,17 @@ >>>>> >>>>> #include <asm/cachetype.h> >>>>> >>>>> -#define L1_CACHE_SHIFT 7 >>>>> +#define L1_CACHE_SHIFT 6 >>>>> #define L1_CACHE_BYTES (1 << L1_CACHE_SHIFT) >>>>> >>>>> /* >>>>> * Memory returned by kmalloc() may be used for DMA, so we must make >>>>> - * sure that all such allocations are cache aligned. Otherwise, >>>>> - * unrelated code may cause parts of the buffer to be read into the >>>>> - * cache before the transfer is done, causing old data to be seen by >>>>> - * the CPU. >>>>> + * sure that all such allocations are aligned to the maximum *known* >>>>> + * cache line size on ARMv8 systems. Otherwise, unrelated code may cause >>>>> + * parts of the buffer to be read into the cache before the transfer is >>>>> + * done, causing old data to be seen by the CPU. >>>>> */ >>>>> -#define ARCH_DMA_MINALIGN L1_CACHE_BYTES >>>>> +#define ARCH_DMA_MINALIGN (128) >>>>> >>>>> #ifndef __ASSEMBLY__ >>>>> >>>>> diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c >>>>> index 392c67eb9fa6..30bafca1aebf 100644 >>>>> --- a/arch/arm64/kernel/cpufeature.c >>>>> +++ b/arch/arm64/kernel/cpufeature.c >>>>> @@ -976,9 +976,9 @@ void __init setup_cpu_features(void) >>>>> if (!cwg) >>>>> pr_warn("No Cache Writeback Granule information, assuming >>>>> cache line size %d\n", >>>>> cls); >>>>> - if (L1_CACHE_BYTES < cls) >>>>> - pr_warn("L1_CACHE_BYTES smaller than the Cache Writeback Granule (%d < %d)\n", >>>>> - L1_CACHE_BYTES, cls); >>>>> + if (ARCH_DMA_MINALIGN < cls) >>>>> + pr_warn("ARCH_DMA_MINALIGN smaller than the Cache Writeback Granule (%d < %d)\n", >>>>> + ARCH_DMA_MINALIGN, cls); >>>>> } >>>>> >>>>> static bool __maybe_unused >>>> >>>> This change was discussed at: [1] but was not concluded as apparently no one >>>> came back with test report and numbers. After including this change in our >>>> local kernel we are seeing significant throughput improvement. For example with: >>>> >>>> iperf -c 192.168.1.181 -i 1 -w 128K -t 60 >>>> >>>> The average throughput is improving by about 30% (230Mbps from 180Mbps). >>>> Could you please let us know if this change can be included in upstream kernel. >>>> >>>> [1]: https://groups.google.com/forum/#!topic/linux.kernel/P40yDB90ePs >>> >>> Could you please provide some feedback about the above mentioned query ? >> >> Do you have an explanation on the performance variation when >> L1_CACHE_BYTES is changed? We'd need to understand how the network stack >> is affected by L1_CACHE_BYTES, in which context it uses it (is it for >> non-coherent DMA?). > > network stack use SKB_DATA_ALIGN to align. > --- > #define SKB_DATA_ALIGN(X) (((X) + (SMP_CACHE_BYTES - 1)) & \ > ~(SMP_CACHE_BYTES - 1)) > > #define SMP_CACHE_BYTES L1_CACHE_BYTES > --- > I think this is the reason of performance regression. > Yes this is the reason for performance regression. Due to increases L1 cache alignment the object is coming from next kmalloc slab and skb->truesize is changing from 2304 bytes to 4352 bytes. This in turn increases sk_wmem_alloc which causes queuing of less send buffers. We tried different benchmarks and found none which really affects with Cache line change. If there is no correctness issue, I think we are fine with reverting the patch. Though I still think it is beneficiary to do some more investigation for the perf loss, who knows 32 bit align or no align might Give even more perf benefit. Thanks, Tirumalesh. >> >> The Cavium guys haven't shown any numbers (IIUC) to back the >> L1_CACHE_BYTES performance improvement but I would not revert the >> original commit since ARCH_DMA_MINALIGN definitely needs to cover the >> maximum available cache line size, which is 128 for them. > > how about define L1_CACHE_SHIFT like below: > --- > #ifdef CONFIG_ARM64_L1_CACHE_SHIFT > #define L1_CACHE_SHIFT CONFIG_ARM64_L1_CACHE_SHIFT > #else > #define L1_CACHE_SHIFT 7 > endif > --- > > Thanks > >> >> -- >> Catalin -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a\nmember of the Code Aurora Forum, hosted by The Linux Foundation _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
On 4/12/2017 7:30 PM, Chalamarla, Tirumalesh wrote: > > > On 4/11/17, 10:13 PM, "linux-arm-kernel on behalf of Imran Khan" <linux-arm-kernel-bounces@lists.infradead.org on behalf of kimran@codeaurora.org> wrote: > > On 4/7/2017 7:36 AM, Ganesh Mahendran wrote: > > 2017-04-06 23:58 GMT+08:00 Catalin Marinas <catalin.marinas@arm.com>: > >> On Thu, Apr 06, 2017 at 12:52:13PM +0530, Imran Khan wrote: > >>> On 4/5/2017 10:13 AM, Imran Khan wrote: > >>>>> We may have to revisit this logic and consider L1_CACHE_BYTES the > >>>>> _minimum_ of cache line sizes in arm64 systems supported by the kernel. > >>>>> Do you have any benchmarks on Cavium boards that would show significant > >>>>> degradation with 64-byte L1_CACHE_BYTES vs 128? > >>>>> > >>>>> For non-coherent DMA, the simplest is to make ARCH_DMA_MINALIGN the > >>>>> _maximum_ of the supported systems: > >>>>> > >>>>> diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h > >>>>> index 5082b30bc2c0..4b5d7b27edaf 100644 > >>>>> --- a/arch/arm64/include/asm/cache.h > >>>>> +++ b/arch/arm64/include/asm/cache.h > >>>>> @@ -18,17 +18,17 @@ > >>>>> > >>>>> #include <asm/cachetype.h> > >>>>> > >>>>> -#define L1_CACHE_SHIFT 7 > >>>>> +#define L1_CACHE_SHIFT 6 > >>>>> #define L1_CACHE_BYTES (1 << L1_CACHE_SHIFT) > >>>>> > >>>>> /* > >>>>> * Memory returned by kmalloc() may be used for DMA, so we must make > >>>>> - * sure that all such allocations are cache aligned. Otherwise, > >>>>> - * unrelated code may cause parts of the buffer to be read into the > >>>>> - * cache before the transfer is done, causing old data to be seen by > >>>>> - * the CPU. > >>>>> + * sure that all such allocations are aligned to the maximum *known* > >>>>> + * cache line size on ARMv8 systems. Otherwise, unrelated code may cause > >>>>> + * parts of the buffer to be read into the cache before the transfer is > >>>>> + * done, causing old data to be seen by the CPU. > >>>>> */ > >>>>> -#define ARCH_DMA_MINALIGN L1_CACHE_BYTES > >>>>> +#define ARCH_DMA_MINALIGN (128) > >>>>> > >>>>> #ifndef __ASSEMBLY__ > >>>>> > >>>>> diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c > >>>>> index 392c67eb9fa6..30bafca1aebf 100644 > >>>>> --- a/arch/arm64/kernel/cpufeature.c > >>>>> +++ b/arch/arm64/kernel/cpufeature.c > >>>>> @@ -976,9 +976,9 @@ void __init setup_cpu_features(void) > >>>>> if (!cwg) > >>>>> pr_warn("No Cache Writeback Granule information, assuming > >>>>> cache line size %d\n", > >>>>> cls); > >>>>> - if (L1_CACHE_BYTES < cls) > >>>>> - pr_warn("L1_CACHE_BYTES smaller than the Cache Writeback Granule (%d < %d)\n", > >>>>> - L1_CACHE_BYTES, cls); > >>>>> + if (ARCH_DMA_MINALIGN < cls) > >>>>> + pr_warn("ARCH_DMA_MINALIGN smaller than the Cache Writeback Granule (%d < %d)\n", > >>>>> + ARCH_DMA_MINALIGN, cls); > >>>>> } > >>>>> > >>>>> static bool __maybe_unused > >>>> > >>>> This change was discussed at: [1] but was not concluded as apparently no one > >>>> came back with test report and numbers. After including this change in our > >>>> local kernel we are seeing significant throughput improvement. For example with: > >>>> > >>>> iperf -c 192.168.1.181 -i 1 -w 128K -t 60 > >>>> > >>>> The average throughput is improving by about 30% (230Mbps from 180Mbps). > >>>> Could you please let us know if this change can be included in upstream kernel. > >>>> > >>>> [1]: https://groups.google.com/forum/#!topic/linux.kernel/P40yDB90ePs > >>> > >>> Could you please provide some feedback about the above mentioned query ? > >> > >> Do you have an explanation on the performance variation when > >> L1_CACHE_BYTES is changed? We'd need to understand how the network stack > >> is affected by L1_CACHE_BYTES, in which context it uses it (is it for > >> non-coherent DMA?). > > > > network stack use SKB_DATA_ALIGN to align. > > --- > > #define SKB_DATA_ALIGN(X) (((X) + (SMP_CACHE_BYTES - 1)) & \ > > ~(SMP_CACHE_BYTES - 1)) > > > > #define SMP_CACHE_BYTES L1_CACHE_BYTES > > --- > > I think this is the reason of performance regression. > > > > Yes this is the reason for performance regression. Due to increases L1 cache alignment the > object is coming from next kmalloc slab and skb->truesize is changing from 2304 bytes to > 4352 bytes. This in turn increases sk_wmem_alloc which causes queuing of less send buffers. > > We tried different benchmarks and found none which really affects with Cache line change. If there is no correctness issue, > I think we are fine with reverting the patch. > So, can we revert the patch that makes L1_CACHE_SHIFT 7 or should the patch suggested by Catalin should be mainlined. We have verified the throughput degradation on 3.18 and 4.4 but I am afraid that this issue will be seen on other kernels too. > Though I still think it is beneficiary to do some more investigation for the perf loss, who knows 32 bit align or no align might > Give even more perf benefit. > Which perf loss you are referring to here. Did you mean throughput loss here or some other perf benchmarking ? Thanks, Imran > > Thanks, > Tirumalesh. > >> > >> The Cavium guys haven't shown any numbers (IIUC) to back the > >> L1_CACHE_BYTES performance improvement but I would not revert the > >> original commit since ARCH_DMA_MINALIGN definitely needs to cover the > >> maximum available cache line size, which is 128 for them. > > > > how about define L1_CACHE_SHIFT like below: > > --- > > #ifdef CONFIG_ARM64_L1_CACHE_SHIFT > > #define L1_CACHE_SHIFT CONFIG_ARM64_L1_CACHE_SHIFT > > #else > > #define L1_CACHE_SHIFT 7 > > endif > > --- > > > > Thanks > > > >> > >> -- > >> Catalin > > > -- > QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a\nmember of the Code Aurora Forum, hosted by The Linux Foundation > > _______________________________________________ > linux-arm-kernel mailing list > linux-arm-kernel@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel > >
>> >> Do you have an explanation on the performance variation when >> >> L1_CACHE_BYTES is changed? We'd need to understand how the network stack >> >> is affected by L1_CACHE_BYTES, in which context it uses it (is it for >> >> non-coherent DMA?). >> > >> > network stack use SKB_DATA_ALIGN to align. >> > --- >> > #define SKB_DATA_ALIGN(X) (((X) + (SMP_CACHE_BYTES - 1)) & \ >> > ~(SMP_CACHE_BYTES - 1)) >> > >> > #define SMP_CACHE_BYTES L1_CACHE_BYTES >> > --- >> > I think this is the reason of performance regression. >> > >> >> Yes this is the reason for performance regression. Due to increases L1 cache alignment the >> object is coming from next kmalloc slab and skb->truesize is changing from 2304 bytes to >> 4352 bytes. This in turn increases sk_wmem_alloc which causes queuing of less send buffers. With what traffic did you check 'skb->truesize' ? Increase from 2304 to 4352 bytes doesn't seem to be real. I checked with ICMP pkts with maximum size possible with 1500byte MTU and I don't see such a bump. If the bump is observed with Iperf sending TCP packets then I suggest to check if TSO is playing a part over here. And for 'sk_wmem_alloc', I have done Iperf benchmarking on a 40G interface and I hit linerate irrespective of cache line size being 64 or 128 bytes. I guess transmit completion latency on your HW or driver is very high and that seems to be the real issue for low performance and not due to cache line size, basically you are not able to freeup skbs/buffers fast enough so that new ones get queued up. Doesn't skb_orphan() solve your issue ? FYI, https://patchwork.ozlabs.org/patch/455134/ http://lxr.free-electrons.com/source/drivers/net/ethernet/chelsio/cxgb3/sge.c#L1288 >> >> We tried different benchmarks and found none which really affects with Cache line change. If there is no correctness issue, >> I think we are fine with reverting the patch. >> > So, can we revert the patch that makes L1_CACHE_SHIFT 7 or should the patch suggested by Catalin should be mainlined. This doesn't seem right, as someone said earlier what if there is another arm64 platform with 32bytes cacheline size and wants to reduce this further. Either this should be made platform dependent or left as is i.e that is maximum of all. Thanks, Sunil.
On Mon, Apr 17, 2017 at 04:08:52PM +0530, Sunil Kovvuri wrote: > >> >> Do you have an explanation on the performance variation when > >> >> L1_CACHE_BYTES is changed? We'd need to understand how the network stack > >> >> is affected by L1_CACHE_BYTES, in which context it uses it (is it for > >> >> non-coherent DMA?). > >> > > >> > network stack use SKB_DATA_ALIGN to align. > >> > --- > >> > #define SKB_DATA_ALIGN(X) (((X) + (SMP_CACHE_BYTES - 1)) & \ > >> > ~(SMP_CACHE_BYTES - 1)) > >> > > >> > #define SMP_CACHE_BYTES L1_CACHE_BYTES > >> > --- > >> > I think this is the reason of performance regression. > >> > > >> > >> Yes this is the reason for performance regression. Due to increases L1 cache alignment the > >> object is coming from next kmalloc slab and skb->truesize is changing from 2304 bytes to > >> 4352 bytes. This in turn increases sk_wmem_alloc which causes queuing of less send buffers. > > With what traffic did you check 'skb->truesize' ? > Increase from 2304 to 4352 bytes doesn't seem to be real. I checked > with ICMP pkts with maximum > size possible with 1500byte MTU and I don't see such a bump. If the > bump is observed with Iperf > sending TCP packets then I suggest to check if TSO is playing a part over here. I haven't checked truesize but I added some printks to __alloc_skb() (on a Juno platform) and the size argument to this function is 1720 on many occasions. With sizeof(struct skb_shared_info) of 320, the actual data allocation is exactly 2048 when using 64 byte L1_CACHE_SIZE. With a 128 byte cache size, it goes slightly over 2K, hence the 4K slab allocation. The 1720 figure surprised me a bit as well since I was expecting something close to 1500. The thing that worries me is that skb->data may be used as a buffer to DMA into. If that's the case, skb_shared_info is wrongly aligned based on SMP_CACHE_BYTES only and can lead to corruption on a non-DMA-coherent platform. It should really be ARCH_DMA_MINALIGN. IIUC, the Cavium platform has coherent DMA, so it shouldn't be an issue if we go back to 64 byte cache lines. However, we don't really have an easy way to check (maybe taint the kernel if CWG is different from ARCH_DMA_MINALIGN *and* the non-coherent DMA API is called).
On Tue, Apr 18, 2017 at 8:18 PM, Catalin Marinas <catalin.marinas@arm.com> wrote: > On Mon, Apr 17, 2017 at 04:08:52PM +0530, Sunil Kovvuri wrote: >> >> >> Do you have an explanation on the performance variation when >> >> >> L1_CACHE_BYTES is changed? We'd need to understand how the network stack >> >> >> is affected by L1_CACHE_BYTES, in which context it uses it (is it for >> >> >> non-coherent DMA?). >> >> > >> >> > network stack use SKB_DATA_ALIGN to align. >> >> > --- >> >> > #define SKB_DATA_ALIGN(X) (((X) + (SMP_CACHE_BYTES - 1)) & \ >> >> > ~(SMP_CACHE_BYTES - 1)) >> >> > >> >> > #define SMP_CACHE_BYTES L1_CACHE_BYTES >> >> > --- >> >> > I think this is the reason of performance regression. >> >> > >> >> >> >> Yes this is the reason for performance regression. Due to increases L1 cache alignment the >> >> object is coming from next kmalloc slab and skb->truesize is changing from 2304 bytes to >> >> 4352 bytes. This in turn increases sk_wmem_alloc which causes queuing of less send buffers. >> >> With what traffic did you check 'skb->truesize' ? >> Increase from 2304 to 4352 bytes doesn't seem to be real. I checked >> with ICMP pkts with maximum >> size possible with 1500byte MTU and I don't see such a bump. If the >> bump is observed with Iperf >> sending TCP packets then I suggest to check if TSO is playing a part over here. > > I haven't checked truesize but I added some printks to __alloc_skb() (on > a Juno platform) and the size argument to this function is 1720 on many > occasions. With sizeof(struct skb_shared_info) of 320, the actual data > allocation is exactly 2048 when using 64 byte L1_CACHE_SIZE. With a > 128 byte cache size, it goes slightly over 2K, hence the 4K slab > allocation. Understood but still in my opinion this '4K slab allocation' cannot be considered as an issue with cache line size, there are many network drivers out there which do receive buffer or page recycling to minimize (sometimes almost to zero) the cost of buffer allocation. >The 1720 figure surprised me a bit as well since I was > expecting something close to 1500. > > The thing that worries me is that skb->data may be used as a buffer to > DMA into. If that's the case, skb_shared_info is wrongly aligned based > on SMP_CACHE_BYTES only and can lead to corruption on a non-DMA-coherent > platform. It should really be ARCH_DMA_MINALIGN. I didn't get this, if you see __alloc_skb() 229 size = SKB_DATA_ALIGN(size); 230 size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); both DMA buffer and skb_shared_info are aligned to a cacheline separately, considering 128byte alignment guarantees 64byte alignment as well, how will this lead to corruption ? And if platform is non-DMA-coherent then again it's the driver which should take of coherency by using appropriate map/unmap APIs and should avoid any cacheline sharing btw DMA buffer and skb_shared_info. > > IIUC, the Cavium platform has coherent DMA, so it shouldn't be an issue > if we go back to 64 byte cache lines. Yes, Cavium platform is DMA coherent and there is no issue with reverting back to 64byte cachelines. But do we want to do this because some platform has a performance issue and this is an easy way to solve it. IMHO there seems to be many ways to solve performance degradation within the driver itself, and if those doesn't work then probably it makes sense to revert this. >However, we don't really have an > easy way to check (maybe taint the kernel if CWG is different from > ARCH_DMA_MINALIGN *and* the non-coherent DMA API is called). > > -- > Catalin
On 4/17/17, 12:35 AM, "Imran Khan" <kimran@codeaurora.org> wrote: On 4/12/2017 7:30 PM, Chalamarla, Tirumalesh wrote: > > > On 4/11/17, 10:13 PM, "linux-arm-kernel on behalf of Imran Khan" <linux-arm-kernel-bounces@lists.infradead.org on behalf of kimran@codeaurora.org> wrote: > > On 4/7/2017 7:36 AM, Ganesh Mahendran wrote: > > 2017-04-06 23:58 GMT+08:00 Catalin Marinas <catalin.marinas@arm.com>: > >> On Thu, Apr 06, 2017 at 12:52:13PM +0530, Imran Khan wrote: > >>> On 4/5/2017 10:13 AM, Imran Khan wrote: > >>>>> We may have to revisit this logic and consider L1_CACHE_BYTES the > >>>>> _minimum_ of cache line sizes in arm64 systems supported by the kernel. > >>>>> Do you have any benchmarks on Cavium boards that would show significant > >>>>> degradation with 64-byte L1_CACHE_BYTES vs 128? > >>>>> > >>>>> For non-coherent DMA, the simplest is to make ARCH_DMA_MINALIGN the > >>>>> _maximum_ of the supported systems: > >>>>> > >>>>> diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h > >>>>> index 5082b30bc2c0..4b5d7b27edaf 100644 > >>>>> --- a/arch/arm64/include/asm/cache.h > >>>>> +++ b/arch/arm64/include/asm/cache.h > >>>>> @@ -18,17 +18,17 @@ > >>>>> > >>>>> #include <asm/cachetype.h> > >>>>> > >>>>> -#define L1_CACHE_SHIFT 7 > >>>>> +#define L1_CACHE_SHIFT 6 > >>>>> #define L1_CACHE_BYTES (1 << L1_CACHE_SHIFT) > >>>>> > >>>>> /* > >>>>> * Memory returned by kmalloc() may be used for DMA, so we must make > >>>>> - * sure that all such allocations are cache aligned. Otherwise, > >>>>> - * unrelated code may cause parts of the buffer to be read into the > >>>>> - * cache before the transfer is done, causing old data to be seen by > >>>>> - * the CPU. > >>>>> + * sure that all such allocations are aligned to the maximum *known* > >>>>> + * cache line size on ARMv8 systems. Otherwise, unrelated code may cause > >>>>> + * parts of the buffer to be read into the cache before the transfer is > >>>>> + * done, causing old data to be seen by the CPU. > >>>>> */ > >>>>> -#define ARCH_DMA_MINALIGN L1_CACHE_BYTES > >>>>> +#define ARCH_DMA_MINALIGN (128) > >>>>> > >>>>> #ifndef __ASSEMBLY__ > >>>>> > >>>>> diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c > >>>>> index 392c67eb9fa6..30bafca1aebf 100644 > >>>>> --- a/arch/arm64/kernel/cpufeature.c > >>>>> +++ b/arch/arm64/kernel/cpufeature.c > >>>>> @@ -976,9 +976,9 @@ void __init setup_cpu_features(void) > >>>>> if (!cwg) > >>>>> pr_warn("No Cache Writeback Granule information, assuming > >>>>> cache line size %d\n", > >>>>> cls); > >>>>> - if (L1_CACHE_BYTES < cls) > >>>>> - pr_warn("L1_CACHE_BYTES smaller than the Cache Writeback Granule (%d < %d)\n", > >>>>> - L1_CACHE_BYTES, cls); > >>>>> + if (ARCH_DMA_MINALIGN < cls) > >>>>> + pr_warn("ARCH_DMA_MINALIGN smaller than the Cache Writeback Granule (%d < %d)\n", > >>>>> + ARCH_DMA_MINALIGN, cls); > >>>>> } > >>>>> > >>>>> static bool __maybe_unused > >>>> > >>>> This change was discussed at: [1] but was not concluded as apparently no one > >>>> came back with test report and numbers. After including this change in our > >>>> local kernel we are seeing significant throughput improvement. For example with: > >>>> > >>>> iperf -c 192.168.1.181 -i 1 -w 128K -t 60 > >>>> > >>>> The average throughput is improving by about 30% (230Mbps from 180Mbps). > >>>> Could you please let us know if this change can be included in upstream kernel. > >>>> > >>>> [1]: https://groups.google.com/forum/#!topic/linux.kernel/P40yDB90ePs > >>> > >>> Could you please provide some feedback about the above mentioned query ? > >> > >> Do you have an explanation on the performance variation when > >> L1_CACHE_BYTES is changed? We'd need to understand how the network stack > >> is affected by L1_CACHE_BYTES, in which context it uses it (is it for > >> non-coherent DMA?). > > > > network stack use SKB_DATA_ALIGN to align. > > --- > > #define SKB_DATA_ALIGN(X) (((X) + (SMP_CACHE_BYTES - 1)) & \ > > ~(SMP_CACHE_BYTES - 1)) > > > > #define SMP_CACHE_BYTES L1_CACHE_BYTES > > --- > > I think this is the reason of performance regression. > > > > Yes this is the reason for performance regression. Due to increases L1 cache alignment the > object is coming from next kmalloc slab and skb->truesize is changing from 2304 bytes to > 4352 bytes. This in turn increases sk_wmem_alloc which causes queuing of less send buffers. > > We tried different benchmarks and found none which really affects with Cache line change. If there is no correctness issue, > I think we are fine with reverting the patch. > So, can we revert the patch that makes L1_CACHE_SHIFT 7 or should the patch suggested by Catalin should be mainlined. We have verified the throughput degradation on 3.18 and 4.4 but I am afraid that this issue will be seen on other kernels too. > Though I still think it is beneficiary to do some more investigation for the perf loss, who knows 32 bit align or no align might > Give even more perf benefit. > Which perf loss you are referring to here. Did you mean throughput loss here or some other perf benchmarking ? The iperf issue mentioning here, looks to me as incomplete. Thanks, Imran > > Thanks, > Tirumalesh. > >> > >> The Cavium guys haven't shown any numbers (IIUC) to back the > >> L1_CACHE_BYTES performance improvement but I would not revert the > >> original commit since ARCH_DMA_MINALIGN definitely needs to cover the > >> maximum available cache line size, which is 128 for them. > > > > how about define L1_CACHE_SHIFT like below: > > --- > > #ifdef CONFIG_ARM64_L1_CACHE_SHIFT > > #define L1_CACHE_SHIFT CONFIG_ARM64_L1_CACHE_SHIFT > > #else > > #define L1_CACHE_SHIFT 7 > > endif > > --- > > > > Thanks > > > >> > >> -- > >> Catalin > > > -- > QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a\nmember of the Code Aurora Forum, hosted by The Linux Foundation > > _______________________________________________ > linux-arm-kernel mailing list > linux-arm-kernel@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel > > -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a\nmember of the Code Aurora Forum, hosted by The Linux Foundation
On Tue, Apr 18, 2017 at 10:35:02PM +0530, Sunil Kovvuri wrote: > On Tue, Apr 18, 2017 at 8:18 PM, Catalin Marinas > <catalin.marinas@arm.com> wrote: > > On Mon, Apr 17, 2017 at 04:08:52PM +0530, Sunil Kovvuri wrote: > >> >> >> Do you have an explanation on the performance variation when > >> >> >> L1_CACHE_BYTES is changed? We'd need to understand how the network stack > >> >> >> is affected by L1_CACHE_BYTES, in which context it uses it (is it for > >> >> >> non-coherent DMA?). > >> >> > > >> >> > network stack use SKB_DATA_ALIGN to align. > >> >> > --- > >> >> > #define SKB_DATA_ALIGN(X) (((X) + (SMP_CACHE_BYTES - 1)) & \ > >> >> > ~(SMP_CACHE_BYTES - 1)) > >> >> > > >> >> > #define SMP_CACHE_BYTES L1_CACHE_BYTES > >> >> > --- > >> >> > I think this is the reason of performance regression. > >> >> > > >> >> > >> >> Yes this is the reason for performance regression. Due to increases L1 cache alignment the > >> >> object is coming from next kmalloc slab and skb->truesize is changing from 2304 bytes to > >> >> 4352 bytes. This in turn increases sk_wmem_alloc which causes queuing of less send buffers. > >> > >> With what traffic did you check 'skb->truesize' ? Increase from > >> 2304 to 4352 bytes doesn't seem to be real. I checked with ICMP > >> pkts with maximum size possible with 1500byte MTU and I don't see > >> such a bump. If the bump is observed with Iperf sending TCP packets > >> then I suggest to check if TSO is playing a part over here. > > > > I haven't checked truesize but I added some printks to __alloc_skb() (on > > a Juno platform) and the size argument to this function is 1720 on many > > occasions. With sizeof(struct skb_shared_info) of 320, the actual data > > allocation is exactly 2048 when using 64 byte L1_CACHE_SIZE. With a > > 128 byte cache size, it goes slightly over 2K, hence the 4K slab > > allocation. > > Understood but still in my opinion this '4K slab allocation' cannot be > considered as an issue with cache line size, there are many network > drivers out there which do receive buffer or page recycling to > minimize (sometimes almost to zero) the cost of buffer allocation. The slab allocation shouldn't make much difference (unless you are running on a memory constrained system) but I don't understand how skb->truesize (which is almost half unused) affects the sk_wmem_alloc and its interaction with other bits in the network stack (e.g. tcp_limit_output_bytes). However, I do think it's worth investigating further to fully understand the issue. > >The 1720 figure surprised me a bit as well since I was > > expecting something close to 1500. > > > > The thing that worries me is that skb->data may be used as a buffer to > > DMA into. If that's the case, skb_shared_info is wrongly aligned based > > on SMP_CACHE_BYTES only and can lead to corruption on a non-DMA-coherent > > platform. It should really be ARCH_DMA_MINALIGN. > > I didn't get this, if you see __alloc_skb() > > 229 size = SKB_DATA_ALIGN(size); > 230 size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); > > both DMA buffer and skb_shared_info are aligned to a cacheline separately, > considering 128byte alignment guarantees 64byte alignment as well, how > will this lead to corruption ? It's the other way around: you align only to 64 byte while running on a platform with 128 byte cache lines and non-coherent DMA. > And if platform is non-DMA-coherent then again it's the driver which > should take of coherency by using appropriate map/unmap APIs and > should avoid any cacheline sharing btw DMA buffer and skb_shared_info. The problem is that the streaming DMA API can only work correctly on cacheline-aligned buffers (because of the cache invalidation it performs for DMA ops; even with clean&invalidate, the operation isn't always safe if a cacheline is shared between DMA and CPU buffers). In the skb case, we could have the data potentially sharing the last addresses of a DMA buffer with struct skb_shared_info. We may be able to get away with SKB_DATA_ALIGN not using ARCH_DMA_MINALIGN *if* skb_shared_info is *not* written before or during an inbound DMA transfer (though such tricks are arch specific). > > IIUC, the Cavium platform has coherent DMA, so it shouldn't be an issue > > if we go back to 64 byte cache lines. > > Yes, Cavium platform is DMA coherent and there is no issue with reverting back > to 64byte cachelines. But do we want to do this because some platform has a > performance issue and this is an easy way to solve it. IMHO there seems > to be many ways to solve performance degradation within the driver itself, and > if those doesn't work then probably it makes sense to revert this. My initial thought was to revert the change because it was causing a significant performance regression on certain SoC. But given that it took over a year for people to follow up, it doesn't seem too urgent, so we should rather try to understand the issue and potential side effects of moving back to a 64 byte cache line.
On Wed, Apr 19, 2017 at 5:31 PM, Catalin Marinas <catalin.marinas@arm.com> wrote: > On Tue, Apr 18, 2017 at 10:35:02PM +0530, Sunil Kovvuri wrote: >> On Tue, Apr 18, 2017 at 8:18 PM, Catalin Marinas >> <catalin.marinas@arm.com> wrote: >> > On Mon, Apr 17, 2017 at 04:08:52PM +0530, Sunil Kovvuri wrote: >> >> >> >> Do you have an explanation on the performance variation when >> >> >> >> L1_CACHE_BYTES is changed? We'd need to understand how the network stack >> >> >> >> is affected by L1_CACHE_BYTES, in which context it uses it (is it for >> >> >> >> non-coherent DMA?). >> >> >> > >> >> >> > network stack use SKB_DATA_ALIGN to align. >> >> >> > --- >> >> >> > #define SKB_DATA_ALIGN(X) (((X) + (SMP_CACHE_BYTES - 1)) & \ >> >> >> > ~(SMP_CACHE_BYTES - 1)) >> >> >> > >> >> >> > #define SMP_CACHE_BYTES L1_CACHE_BYTES >> >> >> > --- >> >> >> > I think this is the reason of performance regression. >> >> >> > >> >> >> >> >> >> Yes this is the reason for performance regression. Due to increases L1 cache alignment the >> >> >> object is coming from next kmalloc slab and skb->truesize is changing from 2304 bytes to >> >> >> 4352 bytes. This in turn increases sk_wmem_alloc which causes queuing of less send buffers. >> >> >> >> With what traffic did you check 'skb->truesize' ? Increase from >> >> 2304 to 4352 bytes doesn't seem to be real. I checked with ICMP >> >> pkts with maximum size possible with 1500byte MTU and I don't see >> >> such a bump. If the bump is observed with Iperf sending TCP packets >> >> then I suggest to check if TSO is playing a part over here. >> > >> > I haven't checked truesize but I added some printks to __alloc_skb() (on >> > a Juno platform) and the size argument to this function is 1720 on many >> > occasions. With sizeof(struct skb_shared_info) of 320, the actual data >> > allocation is exactly 2048 when using 64 byte L1_CACHE_SIZE. With a >> > 128 byte cache size, it goes slightly over 2K, hence the 4K slab >> > allocation. >> >> Understood but still in my opinion this '4K slab allocation' cannot be >> considered as an issue with cache line size, there are many network >> drivers out there which do receive buffer or page recycling to >> minimize (sometimes almost to zero) the cost of buffer allocation. > > The slab allocation shouldn't make much difference (unless you are > running on a memory constrained system) but I don't understand how > skb->truesize (which is almost half unused) affects the sk_wmem_alloc > and its interaction with other bits in the network stack (e.g. > tcp_limit_output_bytes). > > However, I do think it's worth investigating further to fully understand > the issue. Absolutely. > >> >The 1720 figure surprised me a bit as well since I was >> > expecting something close to 1500. >> > >> > The thing that worries me is that skb->data may be used as a buffer to >> > DMA into. If that's the case, skb_shared_info is wrongly aligned based >> > on SMP_CACHE_BYTES only and can lead to corruption on a non-DMA-coherent >> > platform. It should really be ARCH_DMA_MINALIGN. >> >> I didn't get this, if you see __alloc_skb() >> >> 229 size = SKB_DATA_ALIGN(size); >> 230 size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); >> >> both DMA buffer and skb_shared_info are aligned to a cacheline separately, >> considering 128byte alignment guarantees 64byte alignment as well, how >> will this lead to corruption ? > > It's the other way around: you align only to 64 byte while running on a > platform with 128 byte cache lines and non-coherent DMA. Okay, I mistook your statement. This is indeed a valid statement. >> And if platform is non-DMA-coherent then again it's the driver which >> should take of coherency by using appropriate map/unmap APIs and >> should avoid any cacheline sharing btw DMA buffer and skb_shared_info. > > The problem is that the streaming DMA API can only work correctly on > cacheline-aligned buffers (because of the cache invalidation it performs > for DMA ops; even with clean&invalidate, the operation isn't always safe > if a cacheline is shared between DMA and CPU buffers). In the skb case, > we could have the data potentially sharing the last addresses of a DMA > buffer with struct skb_shared_info. > > We may be able to get away with SKB_DATA_ALIGN not using > ARCH_DMA_MINALIGN *if* skb_shared_info is *not* written before or during > an inbound DMA transfer (though such tricks are arch specific). > >> > IIUC, the Cavium platform has coherent DMA, so it shouldn't be an issue >> > if we go back to 64 byte cache lines. >> >> Yes, Cavium platform is DMA coherent and there is no issue with reverting back >> to 64byte cachelines. But do we want to do this because some platform has a >> performance issue and this is an easy way to solve it. IMHO there seems >> to be many ways to solve performance degradation within the driver itself, and >> if those doesn't work then probably it makes sense to revert this. > > My initial thought was to revert the change because it was causing a > significant performance regression on certain SoC. But given that it > took over a year for people to follow up, it doesn't seem too urgent, so > we should rather try to understand the issue and potential side effects > of moving back to a 64 byte cache line. Yes. Thanks, Sunil. > > -- > Catalin
On 2017/4/19 21:11, Sunil Kovvuri wrote: > On Wed, Apr 19, 2017 at 5:31 PM, Catalin Marinas > <catalin.marinas@arm.com> wrote: >> On Tue, Apr 18, 2017 at 10:35:02PM +0530, Sunil Kovvuri wrote: >>> On Tue, Apr 18, 2017 at 8:18 PM, Catalin Marinas >>> <catalin.marinas@arm.com> wrote: >>>> On Mon, Apr 17, 2017 at 04:08:52PM +0530, Sunil Kovvuri wrote: >>>>>>> >> Do you have an explanation on the performance variation when >>>>>>> >> L1_CACHE_BYTES is changed? We'd need to understand how the network stack >>>>>>> >> is affected by L1_CACHE_BYTES, in which context it uses it (is it for >>>>>>> >> non-coherent DMA?). >>>>>>> > >>>>>>> > network stack use SKB_DATA_ALIGN to align. >>>>>>> > --- >>>>>>> > #define SKB_DATA_ALIGN(X) (((X) + (SMP_CACHE_BYTES - 1)) & \ >>>>>>> > ~(SMP_CACHE_BYTES - 1)) >>>>>>> > >>>>>>> > #define SMP_CACHE_BYTES L1_CACHE_BYTES >>>>>>> > --- >>>>>>> > I think this is the reason of performance regression. >>>>>>> > >>>>>>> >>>>>>> Yes this is the reason for performance regression. Due to increases L1 cache alignment the >>>>>>> object is coming from next kmalloc slab and skb->truesize is changing from 2304 bytes to >>>>>>> 4352 bytes. This in turn increases sk_wmem_alloc which causes queuing of less send buffers. >>>>> >>>>> With what traffic did you check 'skb->truesize' ? Increase from >>>>> 2304 to 4352 bytes doesn't seem to be real. I checked with ICMP >>>>> pkts with maximum size possible with 1500byte MTU and I don't see >>>>> such a bump. If the bump is observed with Iperf sending TCP packets >>>>> then I suggest to check if TSO is playing a part over here. >>>> >>>> I haven't checked truesize but I added some printks to __alloc_skb() (on >>>> a Juno platform) and the size argument to this function is 1720 on many >>>> occasions. With sizeof(struct skb_shared_info) of 320, the actual data >>>> allocation is exactly 2048 when using 64 byte L1_CACHE_SIZE. With a >>>> 128 byte cache size, it goes slightly over 2K, hence the 4K slab >>>> allocation. >>> >>> Understood but still in my opinion this '4K slab allocation' cannot be >>> considered as an issue with cache line size, there are many network >>> drivers out there which do receive buffer or page recycling to >>> minimize (sometimes almost to zero) the cost of buffer allocation. >> >> The slab allocation shouldn't make much difference (unless you are >> running on a memory constrained system) but I don't understand how >> skb->truesize (which is almost half unused) affects the sk_wmem_alloc >> and its interaction with other bits in the network stack (e.g. >> tcp_limit_output_bytes). >> >> However, I do think it's worth investigating further to fully understand >> the issue. > > Absolutely. > >> >>>> The 1720 figure surprised me a bit as well since I was >>>> expecting something close to 1500. >>>> >>>> The thing that worries me is that skb->data may be used as a buffer to >>>> DMA into. If that's the case, skb_shared_info is wrongly aligned based >>>> on SMP_CACHE_BYTES only and can lead to corruption on a non-DMA-coherent >>>> platform. It should really be ARCH_DMA_MINALIGN. >>> >>> I didn't get this, if you see __alloc_skb() >>> >>> 229 size = SKB_DATA_ALIGN(size); >>> 230 size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); >>> >>> both DMA buffer and skb_shared_info are aligned to a cacheline separately, >>> considering 128byte alignment guarantees 64byte alignment as well, how >>> will this lead to corruption ? >> >> It's the other way around: you align only to 64 byte while running on a >> platform with 128 byte cache lines and non-coherent DMA. > > Okay, I mistook your statement. This is indeed a valid statement. > >>> And if platform is non-DMA-coherent then again it's the driver which >>> should take of coherency by using appropriate map/unmap APIs and >>> should avoid any cacheline sharing btw DMA buffer and skb_shared_info. >> >> The problem is that the streaming DMA API can only work correctly on >> cacheline-aligned buffers (because of the cache invalidation it performs >> for DMA ops; even with clean&invalidate, the operation isn't always safe >> if a cacheline is shared between DMA and CPU buffers). In the skb case, >> we could have the data potentially sharing the last addresses of a DMA >> buffer with struct skb_shared_info. >> >> We may be able to get away with SKB_DATA_ALIGN not using >> ARCH_DMA_MINALIGN *if* skb_shared_info is *not* written before or during >> an inbound DMA transfer (though such tricks are arch specific). >> >>>> IIUC, the Cavium platform has coherent DMA, so it shouldn't be an issue >>>> if we go back to 64 byte cache lines. >>> >>> Yes, Cavium platform is DMA coherent and there is no issue with reverting back >>> to 64byte cachelines. But do we want to do this because some platform has a >>> performance issue and this is an easy way to solve it. IMHO there seems >>> to be many ways to solve performance degradation within the driver itself, and >>> if those doesn't work then probably it makes sense to revert this. >> >> My initial thought was to revert the change because it was causing a >> significant performance regression on certain SoC. But given that it >> took over a year for people to follow up, it doesn't seem too urgent, so >> we should rather try to understand the issue and potential side effects >> of moving back to a 64 byte cache line. > > Yes. > > Thanks, > Sunil. > Hi, continue this discussion, I have test this patch on hisilicon D05 board, the lmbench looks much better after applying this patch, my kernel version is 4.1: Lmbench: 128B/1core 128B/4core 128B/16core 128B/32core 64B/1core 64B/4core 64B/16core 64B/32core rd 9393.81 9118.1 23693.96 42026.32 8433.26 8635.96 11563.52 20142.34 frd 7748.18 7003.02 23336.95 42271.16 7269.42 7091.55 11211.67 19188.03 wr 7880.8 7403.61 12471.28 22545.83 7453.13 7066.01 11140.08 19625.04 fwr 8783.97 8911.85 25318.05 46540 8953.55 8894.28 19094.92 30646.38 bzero 8848.33 8873.43 25416.46 46620.68 8998.88 8877.64 19092.39 32537.9 rdwr 5854.37 5759.17 12346.27 22928.25 5601.26 5517.21 6608.84 11717.91 cp 5251.07 5118.19 8324.99 15616.85 4997.66 4964.38 5833.04 9600.68 fcp 6883.94 5446.19 11588.7 21534.58 6766.76 6625.98 8757.97 15844.67 bcopy 6561.41 6402.99 9070.05 17076.24 6839.07 6959.03 8900.73 15868.25 you could see that when use the 32 cores, the 128B L1 cache line is much better than the 64B, up 200%. I think the performance would be shown difference on different chip, so I suggest not to revert this patch and test more about it. Some details about the hisilicon D05 board chip: L1/L2: 64B L3: 128B Thanks Ding >> >> -- >> Catalin > > _______________________________________________ > linux-arm-kernel mailing list > linux-arm-kernel@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel > > . >
diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h index 5082b30bc2c0..4b5d7b27edaf 100644 --- a/arch/arm64/include/asm/cache.h +++ b/arch/arm64/include/asm/cache.h @@ -18,17 +18,17 @@ #include <asm/cachetype.h> -#define L1_CACHE_SHIFT 7 +#define L1_CACHE_SHIFT 6 #define L1_CACHE_BYTES (1 << L1_CACHE_SHIFT) /* * Memory returned by kmalloc() may be used for DMA, so we must make - * sure that all such allocations are cache aligned. Otherwise, - * unrelated code may cause parts of the buffer to be read into the - * cache before the transfer is done, causing old data to be seen by - * the CPU. + * sure that all such allocations are aligned to the maximum *known* + * cache line size on ARMv8 systems. Otherwise, unrelated code may cause + * parts of the buffer to be read into the cache before the transfer is + * done, causing old data to be seen by the CPU. */ -#define ARCH_DMA_MINALIGN L1_CACHE_BYTES +#define ARCH_DMA_MINALIGN (128) #ifndef __ASSEMBLY__ diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c index 392c67eb9fa6..30bafca1aebf 100644 --- a/arch/arm64/kernel/cpufeature.c +++ b/arch/arm64/kernel/cpufeature.c @@ -976,9 +976,9 @@ void __init setup_cpu_features(void) if (!cwg) pr_warn("No Cache Writeback Granule information, assuming cache line size %d\n", cls); - if (L1_CACHE_BYTES < cls) - pr_warn("L1_CACHE_BYTES smaller than the Cache Writeback Granule (%d < %d)\n", - L1_CACHE_BYTES, cls); + if (ARCH_DMA_MINALIGN < cls) + pr_warn("ARCH_DMA_MINALIGN smaller than the Cache Writeback Granule (%d < %d)\n", + ARCH_DMA_MINALIGN, cls); } static bool __maybe_unused