Message ID | 20180302143737.10788-2-jglauber@cavium.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Fri, Mar 2, 2018 at 3:37 PM, Jan Glauber <jglauber@cavium.com> wrote: > ThunderX1 dual socket has 96 CPUs and ThunderX2 has 224 CPUs. Are you sure about those numbers? From my counting, I would have expected twice that number in both cases: 48 cores, 2 chips and 2x SMT for ThunderX vs 52 Cores, 2 chips and 4x SMT for ThunderX2. > Therefore raise the default number of CPUs from 64 to 256 > by adding an arm64 specific option to override the generic default. Regardless of what the correct numbers for your chips are, I'd like to hear some other opinions on how high we should raise that default limit, both in arch/arm64/Kconfig and in the defconfig file. As I remember it, there is a noticeable cost for taking the limit beyond BITS_PER_LONG, both in terms of memory consumption and also runtime performance (copying and comparing CPU masks). I'm sure someone will keep coming up with even larger configurations in the future, so we should try to decide how far we can take the defaults for the moment without impacting users of the smallest systems. Alternatively, you could add some measurements that show how much memory and CPU time is used up on a typical configuration for a small system (4 cores, no SMT, 512 MB RAM). If that's low enough, we could just do it anyway. Arnd
On Tue, Mar 06, 2018 at 02:12:29PM +0100, Arnd Bergmann wrote: > On Fri, Mar 2, 2018 at 3:37 PM, Jan Glauber <jglauber@cavium.com> wrote: > > ThunderX1 dual socket has 96 CPUs and ThunderX2 has 224 CPUs. > > Are you sure about those numbers? From my counting, I would have expected > twice that number in both cases: 48 cores, 2 chips and 2x SMT for ThunderX > vs 52 Cores, 2 chips and 4x SMT for ThunderX2. That's what I have on those machines. I counted SMT as normal CPUs as it doesn't make a difference for the config. I've not seen SMT on ThunderX. The ThunderX2 number of 224 is already with 4x SMT (and 2 chips) but there may be other versions planned that I'm not aware of. > > Therefore raise the default number of CPUs from 64 to 256 > > by adding an arm64 specific option to override the generic default. > > Regardless of what the correct numbers for your chips are, I'd like > to hear some other opinions on how high we should raise that default > limit, both in arch/arm64/Kconfig and in the defconfig file. > > As I remember it, there is a noticeable cost for taking the limit beyond > BITS_PER_LONG, both in terms of memory consumption and also > runtime performance (copying and comparing CPU masks). OK, that explains the default. My unverified assumption is that increasing the CPU masks wont be a noticable performance hit. Also, I don't think that anyone who wants performance will use defconfig. All server distributions would bump up the NR_CPUS anyway and really small systems will probably need to tune the config anyway. For me defconfig should produce a usable system, not with every last driver configured but with all the basics like CPUs, networking, etc. fully present. > I'm sure someone will keep coming up with even larger configurations > in the future, so we should try to decide how far we can take the > defaults for the moment without impacting users of the smallest > systems. Alternatively, you could add some measurements that > show how much memory and CPU time is used up on a typical > configuration for a small system (4 cores, no SMT, 512 MB RAM). > If that's low enough, we could just do it anyway. OK, I'll take a look. --Jan
On Tue, Mar 6, 2018 at 3:02 PM, Jan Glauber <jan.glauber@caviumnetworks.com> wrote: > On Tue, Mar 06, 2018 at 02:12:29PM +0100, Arnd Bergmann wrote: >> On Fri, Mar 2, 2018 at 3:37 PM, Jan Glauber <jglauber@cavium.com> wrote: >> > ThunderX1 dual socket has 96 CPUs and ThunderX2 has 224 CPUs. >> >> Are you sure about those numbers? From my counting, I would have expected >> twice that number in both cases: 48 cores, 2 chips and 2x SMT for ThunderX >> vs 52 Cores, 2 chips and 4x SMT for ThunderX2. > > That's what I have on those machines. I counted SMT as normal CPUs as it > doesn't make a difference for the config. I've not seen SMT on ThunderX. > > The ThunderX2 number of 224 is already with 4x SMT (and 2 chips) but > there may be other versions planned that I'm not aware of. I've never used on, the numbers I have are probably the highest announced core counts that are produced, but it's possible that those with fewer cores that you have (24 and 26, respectively) are much more affordable and/or common. >> > Therefore raise the default number of CPUs from 64 to 256 >> > by adding an arm64 specific option to override the generic default. >> >> Regardless of what the correct numbers for your chips are, I'd like >> to hear some other opinions on how high we should raise that default >> limit, both in arch/arm64/Kconfig and in the defconfig file. >> >> As I remember it, there is a noticeable cost for taking the limit beyond >> BITS_PER_LONG, both in terms of memory consumption and also >> runtime performance (copying and comparing CPU masks). > > OK, that explains the default. My unverified assumption is that > increasing the CPU masks wont be a noticable performance hit. The cpumask macros are rather subtle and are written to be as efficient as possible on configurations with 1, BITS_PER_LONG as well as large numbers of CPUs. There is also the CONFIG_CPUMASK_OFFSTACK option that trades (stack) memory consumption for CPU cycles and is usually used on configurations with more than 512 CPUs. > Also, I don't think that anyone who wants performance will use > defconfig. All server distributions would bump up the NR_CPUS anyway > and really small systems will probably need to tune the config > anyway. > > For me defconfig should produce a usable system, not with every last > driver configured but with all the basics like CPUs, networking, etc. > fully present. Agreed. If we can sacrifice a little bit of kernel performance in exchange for running on a wider range of machines, we should do that, but if either the CPU or memory cost is excessive for small machines, then I think it's better to sacrifice access to some of the CPUs on the larger systems. I would expect that the performance impact for running without SMP on ThunderX2 (52 CPUs instead of 224) is significant but also something we can live with as a non-optimized configuration. On my 32-thread x86 build box, disabling SMT costs under 20%, for larger configurations I would expect a smaller impact for similar workloads (because of Amdahl's law), but your SMT implementation may be better than AMD's. Arnd
On Tue, Mar 06, 2018 at 03:02:01PM +0100, Jan Glauber wrote: > On Tue, Mar 06, 2018 at 02:12:29PM +0100, Arnd Bergmann wrote: > > On Fri, Mar 2, 2018 at 3:37 PM, Jan Glauber <jglauber@cavium.com> wrote: > > > ThunderX1 dual socket has 96 CPUs and ThunderX2 has 224 CPUs. > > > > Are you sure about those numbers? From my counting, I would have expected > > twice that number in both cases: 48 cores, 2 chips and 2x SMT for ThunderX > > vs 52 Cores, 2 chips and 4x SMT for ThunderX2. > > That's what I have on those machines. I counted SMT as normal CPUs as it > doesn't make a difference for the config. I've not seen SMT on ThunderX. > > The ThunderX2 number of 224 is already with 4x SMT (and 2 chips) but > there may be other versions planned that I'm not aware of. > > > > Therefore raise the default number of CPUs from 64 to 256 > > > by adding an arm64 specific option to override the generic default. > > > > Regardless of what the correct numbers for your chips are, I'd like > > to hear some other opinions on how high we should raise that default > > limit, both in arch/arm64/Kconfig and in the defconfig file. > > > > As I remember it, there is a noticeable cost for taking the limit beyond > > BITS_PER_LONG, both in terms of memory consumption and also > > runtime performance (copying and comparing CPU masks). > > OK, that explains the default. My unverified assumption is that > increasing the CPU masks wont be a noticable performance hit. > > Also, I don't think that anyone who wants performance will use > defconfig. All server distributions would bump up the NR_CPUS anyway > and really small systems will probably need to tune the config > anyway. > > For me defconfig should produce a usable system, not with every last > driver configured but with all the basics like CPUs, networking, etc. > fully present. > > > I'm sure someone will keep coming up with even larger configurations > > in the future, so we should try to decide how far we can take the > > defaults for the moment without impacting users of the smallest > > systems. Alternatively, you could add some measurements that > > show how much memory and CPU time is used up on a typical > > configuration for a small system (4 cores, no SMT, 512 MB RAM). > > If that's low enough, we could just do it anyway. > > OK, I'll take a look. I've made some measurements on a 4 core board (Cavium 81xx) with NR_CPUS set to 64 or 256: - vmlinux grows by 0.04 % with 256 CPUs - Kernel compile time was a bit faster with 256 CPUS (which does not make sense, but at least is seems to not suffer from the change). Is there a benchmark that will be better suited? Maybe even a microbenchmark that will suffer from the longer cpumasks? - Available memory decreased by 0.13% (restricted memory to 512 MB), BSS increased 5.3 % Cheers, Jan
On Mon, Mar 26, 2018 at 10:52 AM, Jan Glauber <jan.glauber@caviumnetworks.com> wrote: > On Tue, Mar 06, 2018 at 03:02:01PM +0100, Jan Glauber wrote: >> On Tue, Mar 06, 2018 at 02:12:29PM +0100, Arnd Bergmann wrote: >> > On Fri, Mar 2, 2018 at 3:37 PM, Jan Glauber <jglauber@cavium.com> wrote: >> > > ThunderX1 dual socket has 96 CPUs and ThunderX2 has 224 CPUs. >> > >> > Are you sure about those numbers? From my counting, I would have expected >> > twice that number in both cases: 48 cores, 2 chips and 2x SMT for ThunderX >> > vs 52 Cores, 2 chips and 4x SMT for ThunderX2. >> >> That's what I have on those machines. I counted SMT as normal CPUs as it >> doesn't make a difference for the config. I've not seen SMT on ThunderX. >> >> The ThunderX2 number of 224 is already with 4x SMT (and 2 chips) but >> there may be other versions planned that I'm not aware of. >> >> > > Therefore raise the default number of CPUs from 64 to 256 >> > > by adding an arm64 specific option to override the generic default. >> > >> > Regardless of what the correct numbers for your chips are, I'd like >> > to hear some other opinions on how high we should raise that default >> > limit, both in arch/arm64/Kconfig and in the defconfig file. >> > >> > As I remember it, there is a noticeable cost for taking the limit beyond >> > BITS_PER_LONG, both in terms of memory consumption and also >> > runtime performance (copying and comparing CPU masks). >> >> OK, that explains the default. My unverified assumption is that >> increasing the CPU masks wont be a noticable performance hit. >> >> Also, I don't think that anyone who wants performance will use >> defconfig. All server distributions would bump up the NR_CPUS anyway >> and really small systems will probably need to tune the config >> anyway. >> >> For me defconfig should produce a usable system, not with every last >> driver configured but with all the basics like CPUs, networking, etc. >> fully present. >> >> > I'm sure someone will keep coming up with even larger configurations >> > in the future, so we should try to decide how far we can take the >> > defaults for the moment without impacting users of the smallest >> > systems. Alternatively, you could add some measurements that >> > show how much memory and CPU time is used up on a typical >> > configuration for a small system (4 cores, no SMT, 512 MB RAM). >> > If that's low enough, we could just do it anyway. >> >> OK, I'll take a look. > > I've made some measurements on a 4 core board (Cavium 81xx) with > NR_CPUS set to 64 or 256: > > - vmlinux grows by 0.04 % with 256 CPUs Ok. Is this both with CONFIG_CPUMASK_OFFSTACK=n? > - Kernel compile time was a bit faster with 256 CPUS (which does > not make sense, but at least is seems to not suffer from the change). Do you mean compiling the same kernel configuration while running on a system with less than 64 CPUs on either a CONFIG_NR_CPUS=64 or CONFIG_NR_PCUS=256 kernel, or do you mean the time to compile a kernel with either CONFIG_NR_CPUS=64 or CONFIG_NR_CPUS=256, while running on the same host? I assume the former, which is a very interesting result, possibly pointing to us doing something wrong in the NR_CPUS=64 case that could be optimized. If you ran with CONFIG_CPUMASK_OFFSTACK, that may have made a significant difference, but I would expect it to be faster without it. To get more insight to what is happening, you could rerun the same test with 'perf record' and then compare the profiles. How significant is the runtime difference compared to the jitter you get between normal runs on the same configuration? > Is there a benchmark that will be better suited? Maybe even a > microbenchmark that will suffer from the longer cpumasks? Good question. > - Available memory decreased by 0.13% (restricted memory to 512 MB), > BSS increased 5.3 % 0.13% of a few hundred megabytes is still several hundred kb, right? I'd like to hear some other opinions on that, but it seems to be in the range of enabling many additional device drivers, which is something we don't do lightly. Arnd
On Mon, Mar 26, 2018 at 11:28:28AM +0200, Arnd Bergmann wrote: > On Mon, Mar 26, 2018 at 10:52 AM, Jan Glauber > <jan.glauber@caviumnetworks.com> wrote: > > On Tue, Mar 06, 2018 at 03:02:01PM +0100, Jan Glauber wrote: > >> On Tue, Mar 06, 2018 at 02:12:29PM +0100, Arnd Bergmann wrote: > >> > On Fri, Mar 2, 2018 at 3:37 PM, Jan Glauber <jglauber@cavium.com> wrote: > >> > > ThunderX1 dual socket has 96 CPUs and ThunderX2 has 224 CPUs. > >> > > >> > Are you sure about those numbers? From my counting, I would have expected > >> > twice that number in both cases: 48 cores, 2 chips and 2x SMT for ThunderX > >> > vs 52 Cores, 2 chips and 4x SMT for ThunderX2. > >> > >> That's what I have on those machines. I counted SMT as normal CPUs as it > >> doesn't make a difference for the config. I've not seen SMT on ThunderX. > >> > >> The ThunderX2 number of 224 is already with 4x SMT (and 2 chips) but > >> there may be other versions planned that I'm not aware of. > >> > >> > > Therefore raise the default number of CPUs from 64 to 256 > >> > > by adding an arm64 specific option to override the generic default. > >> > > >> > Regardless of what the correct numbers for your chips are, I'd like > >> > to hear some other opinions on how high we should raise that default > >> > limit, both in arch/arm64/Kconfig and in the defconfig file. > >> > > >> > As I remember it, there is a noticeable cost for taking the limit beyond > >> > BITS_PER_LONG, both in terms of memory consumption and also > >> > runtime performance (copying and comparing CPU masks). > >> > >> OK, that explains the default. My unverified assumption is that > >> increasing the CPU masks wont be a noticable performance hit. > >> > >> Also, I don't think that anyone who wants performance will use > >> defconfig. All server distributions would bump up the NR_CPUS anyway > >> and really small systems will probably need to tune the config > >> anyway. > >> > >> For me defconfig should produce a usable system, not with every last > >> driver configured but with all the basics like CPUs, networking, etc. > >> fully present. > >> > >> > I'm sure someone will keep coming up with even larger configurations > >> > in the future, so we should try to decide how far we can take the > >> > defaults for the moment without impacting users of the smallest > >> > systems. Alternatively, you could add some measurements that > >> > show how much memory and CPU time is used up on a typical > >> > configuration for a small system (4 cores, no SMT, 512 MB RAM). > >> > If that's low enough, we could just do it anyway. > >> > >> OK, I'll take a look. > > > > I've made some measurements on a 4 core board (Cavium 81xx) with > > NR_CPUS set to 64 or 256: > > > > - vmlinux grows by 0.04 % with 256 CPUs > > Ok. Is this both with CONFIG_CPUMASK_OFFSTACK=n? Yes. > > - Kernel compile time was a bit faster with 256 CPUS (which does > > not make sense, but at least is seems to not suffer from the change). > > Do you mean compiling the same kernel configuration while running > on a system with less than 64 CPUs on either a CONFIG_NR_CPUS=64 > or CONFIG_NR_PCUS=256 kernel, or do you mean the time to compile > a kernel with either CONFIG_NR_CPUS=64 or CONFIG_NR_CPUS=256, > while running on the same host? The former, compiling everything on a 4-core system using two different kernels to compile the same thing. > I assume the former, which is a very interesting result, possibly > pointing to us doing something wrong in the NR_CPUS=64 case > that could be optimized. > > If you ran with CONFIG_CPUMASK_OFFSTACK, that may have made > a significant difference, but I would expect it to be faster without it. > > To get more insight to what is happening, you could rerun the same > test with 'perf record' and then compare the profiles. How significant > is the runtime difference compared to the jitter you get between normal > runs on the same configuration? I did retry once but the odd case that CONFIG_NR_CPUS=256 was faster was consistent. The difference was very small though so it may be completely due to jitter. > > Is there a benchmark that will be better suited? Maybe even a > > microbenchmark that will suffer from the longer cpumasks? > > Good question. > > > - Available memory decreased by 0.13% (restricted memory to 512 MB), > > BSS increased 5.3 % > > 0.13% of a few hundred megabytes is still several hundred kb, right? I'd > like to hear some other opinions on that, but it seems to be in the > range of enabling many additional device drivers, which is something > we don't do lightly. Agreed, available memory was reduced by 128 KB. --Jan
diff --git a/arch/arm64/configs/defconfig b/arch/arm64/configs/defconfig index 3594aefa496f..970950e8c76b 100644 --- a/arch/arm64/configs/defconfig +++ b/arch/arm64/configs/defconfig @@ -630,3 +630,4 @@ CONFIG_CRYPTO_AES_ARM64_CE_BLK=y CONFIG_CRYPTO_AES_ARM64_NEON_BLK=m CONFIG_CRYPTO_CHACHA20_NEON=m CONFIG_CRYPTO_AES_ARM64_BS=m +CONFIG_NR_CPUS=256
ThunderX1 dual socket has 96 CPUs and ThunderX2 has 224 CPUs. Therefore raise the default number of CPUs from 64 to 256 by adding an arm64 specific option to override the generic default. Signed-off-by: Jan Glauber <jglauber@cavium.com> --- arch/arm64/configs/defconfig | 1 + 1 file changed, 1 insertion(+)