Message ID | 1442944788-17254-1-git-send-email-rric@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Tue, Sep 22, 2015 at 06:59:48PM +0100, Robert Richter wrote: > From: Tirumalesh Chalamarla <tchalamarla@cavium.com> > > Increase the standard cacheline size to avoid having locks in the same > cacheline. > > Cavium's ThunderX core implements cache lines of 128 byte size. With > current granulare size of 64 bytes (L1_CACHE_SHIFT=6) two locks could > share the same cache line leading a performance degradation. > Increasing the size fixes that. Do you have an example of that happening? > Increasing the size has no negative impact to cache invalidation on > systems with a smaller cache line. There is an impact on memory usage, > but that's not too important for arm64 use cases. Do you have any before/after numbers to show the impact of this change on other supported SoCs? Will
Will, On 22.09.15 19:29:02, Will Deacon wrote: > On Tue, Sep 22, 2015 at 06:59:48PM +0100, Robert Richter wrote: > > From: Tirumalesh Chalamarla <tchalamarla@cavium.com> > > > > Increase the standard cacheline size to avoid having locks in the same > > cacheline. > > > > Cavium's ThunderX core implements cache lines of 128 byte size. With > > current granulare size of 64 bytes (L1_CACHE_SHIFT=6) two locks could > > share the same cache line leading a performance degradation. > > Increasing the size fixes that. > > Do you have an example of that happening? I did some 'poor man's kernel build all modules benchmarking' and could not find significant performance improvements so far (second part with the patch reverted): build-allmodules-4.2.0-01404-g5818d6e89783.log:real 7m10.490s build-allmodules-4.2.0-01404-g5818d6e89783.log:real 6m59.747s build-allmodules-4.2.0-01404-g5818d6e89783.log:real 6m59.264s build-allmodules-4.2.0-01404-g5818d6e89783.log:real 7m0.435s build-allmodules-4.2.0-01404-g5818d6e89783.log:real 6m59.569s build-allmodules-4.2.0-01404-g5818d6e89783.log:real 6m59.274s build-allmodules-4.2.0-01404-g5818d6e89783.log:real 7m0.507s build-allmodules-4.2.0-01404-g5818d6e89783.log:real 7m1.551s build-allmodules-4.2.0-01404-g5818d6e89783.log:real 6m59.073s build-allmodules-4.2.0-01404-g5818d6e89783.log:real 7m1.738s build-allmodules-4.2.0-01406-g638c69fddc40.log:real 7m10.644s build-allmodules-4.2.0-01406-g638c69fddc40.log:real 6m59.814s build-allmodules-4.2.0-01406-g638c69fddc40.log:real 7m0.315s build-allmodules-4.2.0-01406-g638c69fddc40.log:real 6m59.610s build-allmodules-4.2.0-01406-g638c69fddc40.log:real 6m59.885s build-allmodules-4.2.0-01406-g638c69fddc40.log:real 6m59.281s build-allmodules-4.2.0-01406-g638c69fddc40.log:real 7m0.869s build-allmodules-4.2.0-01406-g638c69fddc40.log:real 7m0.953s build-allmodules-4.2.0-01406-g638c69fddc40.log:real 7m0.787s build-allmodules-4.2.0-01406-g638c69fddc40.log:real 7m0.656s I will check what kind of workloads this patch was written for. Tirumalesh, any idea? Thanks, -Robert > > > Increasing the size has no negative impact to cache invalidation on > > systems with a smaller cache line. There is an impact on memory usage, > > but that's not too important for arm64 use cases. > > Do you have any before/after numbers to show the impact of this change > on other supported SoCs?
On 09/25/2015 07:45 AM, Robert Richter wrote: > Will, > > On 22.09.15 19:29:02, Will Deacon wrote: >> On Tue, Sep 22, 2015 at 06:59:48PM +0100, Robert Richter wrote: >>> From: Tirumalesh Chalamarla <tchalamarla@cavium.com> >>> >>> Increase the standard cacheline size to avoid having locks in the same >>> cacheline. >>> >>> Cavium's ThunderX core implements cache lines of 128 byte size. With >>> current granulare size of 64 bytes (L1_CACHE_SHIFT=6) two locks could >>> share the same cache line leading a performance degradation. >>> Increasing the size fixes that. >> Do you have an example of that happening? > I did some 'poor man's kernel build all modules benchmarking' and > could not find significant performance improvements so far (second > part with the patch reverted): > > build-allmodules-4.2.0-01404-g5818d6e89783.log:real 7m10.490s > build-allmodules-4.2.0-01404-g5818d6e89783.log:real 6m59.747s > build-allmodules-4.2.0-01404-g5818d6e89783.log:real 6m59.264s > build-allmodules-4.2.0-01404-g5818d6e89783.log:real 7m0.435s > build-allmodules-4.2.0-01404-g5818d6e89783.log:real 6m59.569s > build-allmodules-4.2.0-01404-g5818d6e89783.log:real 6m59.274s > build-allmodules-4.2.0-01404-g5818d6e89783.log:real 7m0.507s > build-allmodules-4.2.0-01404-g5818d6e89783.log:real 7m1.551s > build-allmodules-4.2.0-01404-g5818d6e89783.log:real 6m59.073s > build-allmodules-4.2.0-01404-g5818d6e89783.log:real 7m1.738s > > build-allmodules-4.2.0-01406-g638c69fddc40.log:real 7m10.644s > build-allmodules-4.2.0-01406-g638c69fddc40.log:real 6m59.814s > build-allmodules-4.2.0-01406-g638c69fddc40.log:real 7m0.315s > build-allmodules-4.2.0-01406-g638c69fddc40.log:real 6m59.610s > build-allmodules-4.2.0-01406-g638c69fddc40.log:real 6m59.885s > build-allmodules-4.2.0-01406-g638c69fddc40.log:real 6m59.281s > build-allmodules-4.2.0-01406-g638c69fddc40.log:real 7m0.869s > build-allmodules-4.2.0-01406-g638c69fddc40.log:real 7m0.953s > build-allmodules-4.2.0-01406-g638c69fddc40.log:real 7m0.787s > build-allmodules-4.2.0-01406-g638c69fddc40.log:real 7m0.656s > > I will check what kind of workloads this patch was written for. > Tirumalesh, any idea? mainly for workloads where compiler optimizes based on cache line size, let me write a small bench mark > Thanks, > > -Robert > >>> Increasing the size has no negative impact to cache invalidation on >>> systems with a smaller cache line. There is an impact on memory usage, >>> but that's not too important for arm64 use cases. >> Do you have any before/after numbers to show the impact of this change >> on other supported SoCs?
On Tue, Sep 22, 2015 at 12:59 PM, Robert Richter <rric@kernel.org> wrote: > > -#define L1_CACHE_SHIFT 6 > +#define L1_CACHE_SHIFT 7 > #define L1_CACHE_BYTES (1 << L1_CACHE_SHIFT) Would it be better if this were a Kconfig option, like it is on ARM32? http://lxr.free-electrons.com/source/arch/arm/include/asm/cache.h#L7
On Sat, Oct 10, 2015 at 12:39:25PM -0500, Timur Tabi wrote: > On Tue, Sep 22, 2015 at 12:59 PM, Robert Richter <rric@kernel.org> wrote: > > > > -#define L1_CACHE_SHIFT 6 > > +#define L1_CACHE_SHIFT 7 > > #define L1_CACHE_BYTES (1 << L1_CACHE_SHIFT) > > Would it be better if this were a Kconfig option, like it is on ARM32? > > http://lxr.free-electrons.com/source/arch/arm/include/asm/cache.h#L7 I don't think it adds anything, to be honest. We really want one kernel that runs everywhere and we don't (yet) have the SoC variation that exists on arch/arm/, so we may as well just keep it as big as it needs to be. Of course, if we start to get significant divergence between the minimum and maximum value and that in turn shows a non-trivial impact on kernel size and/or performance, then we could consider a Kconfig option but at that point we'd probably also need to consider whether there are alternative ways of providing this information to the kernel. If somebody really wants to change it for their particular kernel build, modifying the #define isn't exactly rocket science. Will
On Tue, Sep 22, 2015 at 12:59 PM, Robert Richter <rric@kernel.org> wrote: > From: Tirumalesh Chalamarla <tchalamarla@cavium.com> > > Increase the standard cacheline size to avoid having locks in the same > cacheline. > > Cavium's ThunderX core implements cache lines of 128 byte size. With > current granulare size of 64 bytes (L1_CACHE_SHIFT=6) two locks could > share the same cache line leading a performance degradation. > Increasing the size fixes that. > > Increasing the size has no negative impact to cache invalidation on > systems with a smaller cache line. There is an impact on memory usage, > but that's not too important for arm64 use cases. > > Signed-off-by: Tirumalesh Chalamarla <tchalamarla@cavium.com> > Signed-off-by: Robert Richter <rrichter@cavium.com> Acked-by: Timur Tabi <timur@codeaurora.org> We need this patch, because on our silicon, CTR_EL0[CWG] set to 5, which means that setup_processor() complains with this warning: cls = cache_line_size(); if (L1_CACHE_BYTES < cls) pr_warn("L1_CACHE_BYTES smaller than the Cache Writeback Granule (%d < %d)\n", L1_CACHE_BYTES, cls);
On Tue, Sep 22, 2015 at 07:59:48PM +0200, Robert Richter wrote: > From: Tirumalesh Chalamarla <tchalamarla@cavium.com> > > Increase the standard cacheline size to avoid having locks in the same > cacheline. > > Cavium's ThunderX core implements cache lines of 128 byte size. With > current granulare size of 64 bytes (L1_CACHE_SHIFT=6) two locks could > share the same cache line leading a performance degradation. > Increasing the size fixes that. > > Increasing the size has no negative impact to cache invalidation on > systems with a smaller cache line. There is an impact on memory usage, > but that's not too important for arm64 use cases. > > Signed-off-by: Tirumalesh Chalamarla <tchalamarla@cavium.com> > Signed-off-by: Robert Richter <rrichter@cavium.com> Applied. Thanks.
On Tue, Nov 03, 2015 at 12:07:06PM +0100, Geert Uytterhoeven wrote: > On Wed, Oct 28, 2015 at 8:09 PM, Catalin Marinas > <catalin.marinas@arm.com> wrote: > > On Tue, Sep 22, 2015 at 07:59:48PM +0200, Robert Richter wrote: > >> From: Tirumalesh Chalamarla <tchalamarla@cavium.com> > >> > >> Increase the standard cacheline size to avoid having locks in the same > >> cacheline. > >> > >> Cavium's ThunderX core implements cache lines of 128 byte size. With > >> current granulare size of 64 bytes (L1_CACHE_SHIFT=6) two locks could > >> share the same cache line leading a performance degradation. > >> Increasing the size fixes that. > >> > >> Increasing the size has no negative impact to cache invalidation on > >> systems with a smaller cache line. There is an impact on memory usage, > >> but that's not too important for arm64 use cases. > >> > >> Signed-off-by: Tirumalesh Chalamarla <tchalamarla@cavium.com> > >> Signed-off-by: Robert Richter <rrichter@cavium.com> > > > > Applied. Thanks. > > This patch causes a BUG() on r8a7795/salvator-x, for which support is not > yet upstream. > > My config (attached) uses SLAB. If I switch to SLUB, it works. > The arm64 defconfig works, even if I switch from SLUB to SLAB. [...] > ------------[ cut here ]------------ > kernel BUG at mm/slab.c:2283! > Internal error: Oops - BUG: 0 [#1] SMP [...] > Call trace: > [<ffffffc00014f9b4>] __kmem_cache_create+0x21c/0x280 > [<ffffffc00068be50>] create_boot_cache+0x4c/0x80 > [<ffffffc00068bed8>] create_kmalloc_cache+0x54/0x88 > [<ffffffc00068bfc0>] create_kmalloc_caches+0x50/0xf4 > [<ffffffc00068db08>] kmem_cache_init+0x104/0x118 > [<ffffffc00067d7d8>] start_kernel+0x218/0x33c I haven't managed to reproduce this on a Juno kernel. Could you please add some debug printing to see what's being passed to kmalloc_slab()? The freelist_size getting out of bounds?
On Tue, Nov 03, 2015 at 12:05:05PM +0000, Catalin Marinas wrote: > On Tue, Nov 03, 2015 at 12:07:06PM +0100, Geert Uytterhoeven wrote: > > On Wed, Oct 28, 2015 at 8:09 PM, Catalin Marinas > > <catalin.marinas@arm.com> wrote: > > > On Tue, Sep 22, 2015 at 07:59:48PM +0200, Robert Richter wrote: > > >> From: Tirumalesh Chalamarla <tchalamarla@cavium.com> > > >> > > >> Increase the standard cacheline size to avoid having locks in the same > > >> cacheline. > > >> > > >> Cavium's ThunderX core implements cache lines of 128 byte size. With > > >> current granulare size of 64 bytes (L1_CACHE_SHIFT=6) two locks could > > >> share the same cache line leading a performance degradation. > > >> Increasing the size fixes that. > > >> > > >> Increasing the size has no negative impact to cache invalidation on > > >> systems with a smaller cache line. There is an impact on memory usage, > > >> but that's not too important for arm64 use cases. > > >> > > >> Signed-off-by: Tirumalesh Chalamarla <tchalamarla@cavium.com> > > >> Signed-off-by: Robert Richter <rrichter@cavium.com> > > > > > > Applied. Thanks. > > > > This patch causes a BUG() on r8a7795/salvator-x, for which support is not > > yet upstream. > > > > My config (attached) uses SLAB. If I switch to SLUB, it works. > > The arm64 defconfig works, even if I switch from SLUB to SLAB. > [...] > > ------------[ cut here ]------------ > > kernel BUG at mm/slab.c:2283! > > Internal error: Oops - BUG: 0 [#1] SMP > [...] > > Call trace: > > [<ffffffc00014f9b4>] __kmem_cache_create+0x21c/0x280 > > [<ffffffc00068be50>] create_boot_cache+0x4c/0x80 > > [<ffffffc00068bed8>] create_kmalloc_cache+0x54/0x88 > > [<ffffffc00068bfc0>] create_kmalloc_caches+0x50/0xf4 > > [<ffffffc00068db08>] kmem_cache_init+0x104/0x118 > > [<ffffffc00067d7d8>] start_kernel+0x218/0x33c > > I haven't managed to reproduce this on a Juno kernel. I now managed to reproduce it with your config (slightly adapted to allow Juno). I'll look into it.
Hi Catalin, On Tue, Nov 3, 2015 at 3:38 PM, Catalin Marinas <catalin.marinas@arm.com> wrote: > On Tue, Nov 03, 2015 at 12:05:05PM +0000, Catalin Marinas wrote: >> On Tue, Nov 03, 2015 at 12:07:06PM +0100, Geert Uytterhoeven wrote: >> > On Wed, Oct 28, 2015 at 8:09 PM, Catalin Marinas >> > <catalin.marinas@arm.com> wrote: >> > > On Tue, Sep 22, 2015 at 07:59:48PM +0200, Robert Richter wrote: >> > >> From: Tirumalesh Chalamarla <tchalamarla@cavium.com> >> > >> >> > >> Increase the standard cacheline size to avoid having locks in the same >> > >> cacheline. >> > >> >> > >> Cavium's ThunderX core implements cache lines of 128 byte size. With >> > >> current granulare size of 64 bytes (L1_CACHE_SHIFT=6) two locks could >> > >> share the same cache line leading a performance degradation. >> > >> Increasing the size fixes that. >> > >> >> > >> Increasing the size has no negative impact to cache invalidation on >> > >> systems with a smaller cache line. There is an impact on memory usage, >> > >> but that's not too important for arm64 use cases. >> > >> >> > >> Signed-off-by: Tirumalesh Chalamarla <tchalamarla@cavium.com> >> > >> Signed-off-by: Robert Richter <rrichter@cavium.com> >> > > >> > > Applied. Thanks. >> > >> > This patch causes a BUG() on r8a7795/salvator-x, for which support is not >> > yet upstream. >> > >> > My config (attached) uses SLAB. If I switch to SLUB, it works. >> > The arm64 defconfig works, even if I switch from SLUB to SLAB. >> [...] >> > ------------[ cut here ]------------ >> > kernel BUG at mm/slab.c:2283! >> > Internal error: Oops - BUG: 0 [#1] SMP >> [...] >> > Call trace: >> > [<ffffffc00014f9b4>] __kmem_cache_create+0x21c/0x280 >> > [<ffffffc00068be50>] create_boot_cache+0x4c/0x80 >> > [<ffffffc00068bed8>] create_kmalloc_cache+0x54/0x88 >> > [<ffffffc00068bfc0>] create_kmalloc_caches+0x50/0xf4 >> > [<ffffffc00068db08>] kmem_cache_init+0x104/0x118 >> > [<ffffffc00067d7d8>] start_kernel+0x218/0x33c >> >> I haven't managed to reproduce this on a Juno kernel. > > I now managed to reproduce it with your config (slightly adapted to > allow Juno). I'll look into it. Good to hear that! BTW, I see this: freelist_size = 32 cache_line_size() = 64 It seems like the value returned by cache_line_size() in arch/arm64/include/asm/cache.h disagrees with L1_CACHE_SHIFT == 7: static inline int cache_line_size(void) { u32 cwg = cache_type_cwg(); return cwg ? 4 << cwg : L1_CACHE_BYTES; } Making cache_line_size() always return L1_CACHE_BYTES doesn't help. Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds
On Tue, Nov 03, 2015 at 03:55:29PM +0100, Geert Uytterhoeven wrote: > On Tue, Nov 3, 2015 at 3:38 PM, Catalin Marinas <catalin.marinas@arm.com> wrote: > > On Tue, Nov 03, 2015 at 12:05:05PM +0000, Catalin Marinas wrote: > >> On Tue, Nov 03, 2015 at 12:07:06PM +0100, Geert Uytterhoeven wrote: > >> > On Wed, Oct 28, 2015 at 8:09 PM, Catalin Marinas > >> > <catalin.marinas@arm.com> wrote: > >> > > On Tue, Sep 22, 2015 at 07:59:48PM +0200, Robert Richter wrote: > >> > >> From: Tirumalesh Chalamarla <tchalamarla@cavium.com> > >> > >> > >> > >> Increase the standard cacheline size to avoid having locks in the same > >> > >> cacheline. > >> > >> > >> > >> Cavium's ThunderX core implements cache lines of 128 byte size. With > >> > >> current granulare size of 64 bytes (L1_CACHE_SHIFT=6) two locks could > >> > >> share the same cache line leading a performance degradation. > >> > >> Increasing the size fixes that. > >> > >> > >> > >> Increasing the size has no negative impact to cache invalidation on > >> > >> systems with a smaller cache line. There is an impact on memory usage, > >> > >> but that's not too important for arm64 use cases. > >> > >> > >> > >> Signed-off-by: Tirumalesh Chalamarla <tchalamarla@cavium.com> > >> > >> Signed-off-by: Robert Richter <rrichter@cavium.com> > >> > > > >> > > Applied. Thanks. > >> > > >> > This patch causes a BUG() on r8a7795/salvator-x, for which support is not > >> > yet upstream. > >> > > >> > My config (attached) uses SLAB. If I switch to SLUB, it works. > >> > The arm64 defconfig works, even if I switch from SLUB to SLAB. > >> [...] > >> > ------------[ cut here ]------------ > >> > kernel BUG at mm/slab.c:2283! > >> > Internal error: Oops - BUG: 0 [#1] SMP > >> [...] > >> > Call trace: > >> > [<ffffffc00014f9b4>] __kmem_cache_create+0x21c/0x280 > >> > [<ffffffc00068be50>] create_boot_cache+0x4c/0x80 > >> > [<ffffffc00068bed8>] create_kmalloc_cache+0x54/0x88 > >> > [<ffffffc00068bfc0>] create_kmalloc_caches+0x50/0xf4 > >> > [<ffffffc00068db08>] kmem_cache_init+0x104/0x118 > >> > [<ffffffc00067d7d8>] start_kernel+0x218/0x33c > >> > >> I haven't managed to reproduce this on a Juno kernel. > > > > I now managed to reproduce it with your config (slightly adapted to > > allow Juno). I'll look into it. > > Good to hear that! > > BTW, I see this: > > freelist_size = 32 > cache_line_size() = 64 > > It seems like the value returned by cache_line_size() in > arch/arm64/include/asm/cache.h disagrees with L1_CACHE_SHIFT == 7: > > static inline int cache_line_size(void) > { > u32 cwg = cache_type_cwg(); > return cwg ? 4 << cwg : L1_CACHE_BYTES; > } > > Making cache_line_size() always return L1_CACHE_BYTES doesn't help. (cc'ing Jonsoo and Christoph; summary: slab failure with L1_CACHE_BYTES of 128 and sizeof(kmem_cache_node) of 152) If I revert commit 8fc9cf420b36 ("slab: make more slab management structure off the slab") it works but I still need to figure out how slab indices are calculated. The size_index[] array is overridden so that 0..15 are 7 and 16..23 are 8. But the kmalloc_caches[7] has never been populated, hence the BUG_ON. Another option may be to change kmalloc_size and kmalloc_index to cope with KMALLOC_MIN_SIZE of 128. I'll do some more investigation tomorrow.
On Tue, 3 Nov 2015, Catalin Marinas wrote: > (cc'ing Jonsoo and Christoph; summary: slab failure with L1_CACHE_BYTES > of 128 and sizeof(kmem_cache_node) of 152) Hmmm... Yes that would mean use the 196 sized kmalloc array which is not a power of two slab. But the code looks fine to me. > If I revert commit 8fc9cf420b36 ("slab: make more slab management > structure off the slab") it works but I still need to figure out how > slab indices are calculated. The size_index[] array is overridden so > that 0..15 are 7 and 16..23 are 8. But the kmalloc_caches[7] has never > been populated, hence the BUG_ON. Another option may be to change > kmalloc_size and kmalloc_index to cope with KMALLOC_MIN_SIZE of 128. > > I'll do some more investigation tomorrow. The commit allows off slab management for PAGE_SIZE >> 5 that is 128. After that commit kmem_cache_create would try to allocate an off slab management structure which is not available during early boot. But the slab_early_init is set which should prevent the use of an off slab management infrastructure in kmem_cache_create(). However, the failure in line 2283 shows that the OFF SLAB flag was mistakenly set anyways!!!! Something must havee cleared slab_early_init?
On Tue, Sep 22, 2015 at 07:59:48PM +0200, Robert Richter wrote: > From: Tirumalesh Chalamarla <tchalamarla@cavium.com> > > Increase the standard cacheline size to avoid having locks in the same > cacheline. > > Cavium's ThunderX core implements cache lines of 128 byte size. With > current granulare size of 64 bytes (L1_CACHE_SHIFT=6) two locks could > share the same cache line leading a performance degradation. > Increasing the size fixes that. Beside, slab-side bug, I don't think this argument is valid. Even if this change is applied, statically allocated spinlock could share the same cache line. If two locks should not share the same cache line, you'd better to use compiler attribute such as ____cacheline_aligned_in_smp in appropriate place. Thanks. > > Increasing the size has no negative impact to cache invalidation on > systems with a smaller cache line. There is an impact on memory usage, > but that's not too important for arm64 use cases. > > Signed-off-by: Tirumalesh Chalamarla <tchalamarla@cavium.com> > Signed-off-by: Robert Richter <rrichter@cavium.com> > --- > arch/arm64/include/asm/cache.h | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h > index bde449936e2f..5082b30bc2c0 100644 > --- a/arch/arm64/include/asm/cache.h > +++ b/arch/arm64/include/asm/cache.h > @@ -18,7 +18,7 @@ > > #include <asm/cachetype.h> > > -#define L1_CACHE_SHIFT 6 > +#define L1_CACHE_SHIFT 7 > #define L1_CACHE_BYTES (1 << L1_CACHE_SHIFT) > > /* > -- > 2.1.1 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/
On Thu, Nov 05, 2015 at 01:40:14PM +0900, Joonsoo Kim wrote: > On Tue, Sep 22, 2015 at 07:59:48PM +0200, Robert Richter wrote: > > From: Tirumalesh Chalamarla <tchalamarla@cavium.com> > > > > Increase the standard cacheline size to avoid having locks in the same > > cacheline. > > > > Cavium's ThunderX core implements cache lines of 128 byte size. With > > current granulare size of 64 bytes (L1_CACHE_SHIFT=6) two locks could > > share the same cache line leading a performance degradation. > > Increasing the size fixes that. > > Beside, slab-side bug, I don't think this argument is valid. > Even if this change is applied, statically allocated spinlock could > share the same cache line. The benchmarks didn't show any difference with or without this patch applied. What convinced me to apply it was this email: http://lkml.kernel.org/g/CAOZdJXUiRMAguDV+HEJqPg57MyBNqEcTyaH+ya=U93NHb-pdJA@mail.gmail.com On ARM we have a notion of cache writeback granule (CWG) which tells us "the maximum size of memory that can be overwritten as a result of the eviction of a cache entry that has had a memory location in it modified". What we actually needed was ARCH_DMA_MINALIGN to be 128 (currently defined to the L1_CACHE_BYTES value). However, this wouldn't have fixed the KMALLOC_MIN_SIZE, unless we somehow generate different kmalloc_caches[] and kmalloc_dma_caches[] and probably introduce a size_dma_index[]. > If two locks should not share the same cache line, you'd better to use > compiler attribute such as ____cacheline_aligned_in_smp in appropriate > place. We could decouple SMP_CACHE_BYTES from L1_CACHE_BYTES but see above for the other issue we had to solve.
2015-11-05 19:32 GMT+09:00 Catalin Marinas <catalin.marinas@arm.com>: > On Thu, Nov 05, 2015 at 01:40:14PM +0900, Joonsoo Kim wrote: >> On Tue, Sep 22, 2015 at 07:59:48PM +0200, Robert Richter wrote: >> > From: Tirumalesh Chalamarla <tchalamarla@cavium.com> >> > >> > Increase the standard cacheline size to avoid having locks in the same >> > cacheline. >> > >> > Cavium's ThunderX core implements cache lines of 128 byte size. With >> > current granulare size of 64 bytes (L1_CACHE_SHIFT=6) two locks could >> > share the same cache line leading a performance degradation. >> > Increasing the size fixes that. >> >> Beside, slab-side bug, I don't think this argument is valid. >> Even if this change is applied, statically allocated spinlock could >> share the same cache line. > > The benchmarks didn't show any difference with or without this patch > applied. What convinced me to apply it was this email: > > http://lkml.kernel.org/g/CAOZdJXUiRMAguDV+HEJqPg57MyBNqEcTyaH+ya=U93NHb-pdJA@mail.gmail.com Okay. > On ARM we have a notion of cache writeback granule (CWG) which tells us > "the maximum size of memory that can be overwritten as a result of the > eviction of a cache entry that has had a memory location in it > modified". What we actually needed was ARCH_DMA_MINALIGN to be 128 > (currently defined to the L1_CACHE_BYTES value). However, this wouldn't > have fixed the KMALLOC_MIN_SIZE, unless we somehow generate different > kmalloc_caches[] and kmalloc_dma_caches[] and probably introduce a > size_dma_index[]. If we create separate kmalloc caches for dma, can we apply this alignment requirement only to dma caches? I guess some memory allocation request that will be used for DMA operation doesn't specify GFP_DMA because it doesn't want the memory from ZONE_DMA. In this case, we should apply dma alignment requirement to all types of caches. In fact, I know someone who try to implement this alignment separation like as you mentioned to reduce memory waste. I first suggest this solution to him but now I realize that it isn't possible because of above reason. Am I missing? If it isn't possible, is there another way to reduce memory waste due to increase of dma alignment requirement in arm64? Thanks.
On Thu, Nov 05, 2015 at 08:45:08PM +0900, Joonsoo Kim wrote: > 2015-11-05 19:32 GMT+09:00 Catalin Marinas <catalin.marinas@arm.com>: > > On ARM we have a notion of cache writeback granule (CWG) which tells us > > "the maximum size of memory that can be overwritten as a result of the > > eviction of a cache entry that has had a memory location in it > > modified". What we actually needed was ARCH_DMA_MINALIGN to be 128 > > (currently defined to the L1_CACHE_BYTES value). However, this wouldn't > > have fixed the KMALLOC_MIN_SIZE, unless we somehow generate different > > kmalloc_caches[] and kmalloc_dma_caches[] and probably introduce a > > size_dma_index[]. > > If we create separate kmalloc caches for dma, can we apply this alignment > requirement only to dma caches? I guess some memory allocation request > that will be used for DMA operation doesn't specify GFP_DMA because > it doesn't want the memory from ZONE_DMA. In this case, we should apply > dma alignment requirement to all types of caches. I think you are right. While something like swiotlb (through the streaming DMA API) could do bounce buffering and allocate one from ZONE_DMA, this is not guaranteed if the buffer physical address happens to match the dma_mask. Similarly with an IOMMU, no bouncing happens but the alignment is still required. > If it isn't possible, is there another way to reduce memory waste due to > increase of dma alignment requirement in arm64? I first need to see how significant the impact is (especially for embedded/mobiles platforms). An alternative is to leave L1_CACHE_BYTES to 64 by default but warn if the CWG is 128 on systems with non-coherent DMA (and hope that we won't have any). It's not really fix, rather an assumption. Anyway I would very much like the same kernel image for all platforms and no Kconfig entry for the cache line size but if the waste is significant, we may add one for some specific builds (like a mobile phone; or such vendors could patch the kernel themselves).
2015-11-05 21:17 GMT+09:00 Catalin Marinas <catalin.marinas@arm.com>: > On Thu, Nov 05, 2015 at 08:45:08PM +0900, Joonsoo Kim wrote: >> 2015-11-05 19:32 GMT+09:00 Catalin Marinas <catalin.marinas@arm.com>: >> > On ARM we have a notion of cache writeback granule (CWG) which tells us >> > "the maximum size of memory that can be overwritten as a result of the >> > eviction of a cache entry that has had a memory location in it >> > modified". What we actually needed was ARCH_DMA_MINALIGN to be 128 >> > (currently defined to the L1_CACHE_BYTES value). However, this wouldn't >> > have fixed the KMALLOC_MIN_SIZE, unless we somehow generate different >> > kmalloc_caches[] and kmalloc_dma_caches[] and probably introduce a >> > size_dma_index[]. >> >> If we create separate kmalloc caches for dma, can we apply this alignment >> requirement only to dma caches? I guess some memory allocation request >> that will be used for DMA operation doesn't specify GFP_DMA because >> it doesn't want the memory from ZONE_DMA. In this case, we should apply >> dma alignment requirement to all types of caches. > > I think you are right. While something like swiotlb (through the > streaming DMA API) could do bounce buffering and allocate one from > ZONE_DMA, this is not guaranteed if the buffer physical address happens > to match the dma_mask. Similarly with an IOMMU, no bouncing happens but > the alignment is still required. > >> If it isn't possible, is there another way to reduce memory waste due to >> increase of dma alignment requirement in arm64? > > I first need to see how significant the impact is (especially for > embedded/mobiles platforms). I don't have any ARM64 device. What I have just one report about slab usage from our developer. The report shows slab usage just after android boot is done in ARM64. Total slab usage: 90 MB kmalloc usage: 25 MB kmalloc (<=64) usage: 7 MB This would be measured without slab_nomerge so there is a possibility that some usages on kmem_cache is merged into usage of kmalloc (<=64). Anyway, if ARM64 increase L1_CACHE_BYTES to 128, roughly 7 MB would be wasted. I don't know how this picture is varied in runtime, but, even boot time overhead, 7 MB looks large to me. > An alternative is to leave L1_CACHE_BYTES to 64 by default but warn if > the CWG is 128 on systems with non-coherent DMA (and hope that we won't > have any). It's not really fix, rather an assumption. Anyway I would > very much like the same kernel image for all platforms and no Kconfig > entry for the cache line size but if the waste is significant, we may > add one for some specific builds (like a mobile phone; or such vendors > could patch the kernel themselves). Okay. In fact, I'm not very familiar with android device so it'd be better for you to ask some other android developer about how significant the impact is in android or mobile device. Thanks.
On Mon, Nov 09, 2015 at 04:41:58PM +0900, Joonsoo Kim wrote: > 2015-11-05 21:17 GMT+09:00 Catalin Marinas <catalin.marinas@arm.com>: > > On Thu, Nov 05, 2015 at 08:45:08PM +0900, Joonsoo Kim wrote: > >> If it isn't possible, is there another way to reduce memory waste due to > >> increase of dma alignment requirement in arm64? > > > > I first need to see how significant the impact is (especially for > > embedded/mobiles platforms). > > I don't have any ARM64 device. What I have just one report > about slab usage from our developer. > > The report shows slab usage just after android boot is done > in ARM64. > > Total slab usage: 90 MB > kmalloc usage: 25 MB > kmalloc (<=64) usage: 7 MB > > This would be measured without slab_nomerge so there is > a possibility that some usages on kmem_cache is merged > into usage of kmalloc (<=64). > > Anyway, if ARM64 increase L1_CACHE_BYTES to 128, roughly > 7 MB would be wasted. I don't know how this picture is varied > in runtime, but, even boot time overhead, 7 MB looks large to me. 7MB is considerable but I guess it wouldn't be all wasted with L1_CACHE_BYTES == 128, maybe half or slightly over. It would be good to know the other kmalloc caches, maybe up to 256. I don't have an Android filesystem but I just tried to boot Arch (aarch64). Immediately after boot and slab_nomerge, with 128 L1 I get: kmalloc-128: 6624 kmalloc-256: 1488 With L1 64, I get: kmalloc-64: 5760 kmalloc-128: 1152 kmalloc-192: 1155 kmalloc-256: 320 So that's about 1.2MB vs 0.8MB. The ratio is 3:2, though I'm not sure it will stay the same as the slab usage increases. It would be good to get more numbers, we could add a Kconfig option just for specific builds while keeping the default to 128.
On Mon, Nov 09, 2015 at 06:36:09PM +0000, Catalin Marinas wrote: > On Mon, Nov 09, 2015 at 04:41:58PM +0900, Joonsoo Kim wrote: > > 2015-11-05 21:17 GMT+09:00 Catalin Marinas <catalin.marinas@arm.com>: > > > On Thu, Nov 05, 2015 at 08:45:08PM +0900, Joonsoo Kim wrote: > > >> If it isn't possible, is there another way to reduce memory waste due to > > >> increase of dma alignment requirement in arm64? > > > > > > I first need to see how significant the impact is (especially for > > > embedded/mobiles platforms). > > > > I don't have any ARM64 device. What I have just one report > > about slab usage from our developer. > > > > The report shows slab usage just after android boot is done > > in ARM64. > > > > Total slab usage: 90 MB > > kmalloc usage: 25 MB > > kmalloc (<=64) usage: 7 MB > > > > This would be measured without slab_nomerge so there is > > a possibility that some usages on kmem_cache is merged > > into usage of kmalloc (<=64). > > > > Anyway, if ARM64 increase L1_CACHE_BYTES to 128, roughly > > 7 MB would be wasted. I don't know how this picture is varied > > in runtime, but, even boot time overhead, 7 MB looks large to me. > > 7MB is considerable but I guess it wouldn't be all wasted with > L1_CACHE_BYTES == 128, maybe half or slightly over. It would be good to > know the other kmalloc caches, maybe up to 256. > > I don't have an Android filesystem but I just tried to boot Arch > (aarch64). Immediately after boot and slab_nomerge, with 128 L1 I get: > > kmalloc-128: 6624 > kmalloc-256: 1488 > > With L1 64, I get: > > kmalloc-64: 5760 > kmalloc-128: 1152 > kmalloc-192: 1155 > kmalloc-256: 320 > > So that's about 1.2MB vs 0.8MB. The ratio is 3:2, though I'm not sure it > will stay the same as the slab usage increases. > > It would be good to get more numbers, we could add a Kconfig option just > for specific builds while keeping the default to 128. Okay. Thanks.
diff --git a/arch/arm64/include/asm/cache.h b/arch/arm64/include/asm/cache.h index bde449936e2f..5082b30bc2c0 100644 --- a/arch/arm64/include/asm/cache.h +++ b/arch/arm64/include/asm/cache.h @@ -18,7 +18,7 @@ #include <asm/cachetype.h> -#define L1_CACHE_SHIFT 6 +#define L1_CACHE_SHIFT 7 #define L1_CACHE_BYTES (1 << L1_CACHE_SHIFT) /*