Message ID | 20210319183453.4466-2-danielhb413@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | pseries: SMP sockets must match NUMA nodes | expand |
On Fri, Mar 19, 2021 at 03:34:52PM -0300, Daniel Henrique Barboza wrote: > Kernel commit 4bce545903fa ("powerpc/topology: Update > topology_core_cpumask") cause a regression in the pseries machine when > defining certain SMP topologies [1]. The reasoning behind the change is > explained in kernel commit 4ca234a9cbd7 ("powerpc/smp: Stop updating > cpu_core_mask"). In short, cpu_core_mask logic was causing troubles with > large VMs with lots of CPUs and was changed by cpu_cpu_mask because, as > far as the kernel understanding of SMP topologies goes, both masks are > equivalent. > > Further discussions in the kernel mailing list [2] shown that the > powerpc kernel always considered that the number of sockets were equal > to the number of NUMA nodes. The claim is that it doesn't make sense, > for Power hardware at least, 2+ sockets being in the same NUMA node. The > immediate conclusion is that all SMP topologies the pseries machine were > supplying to the kernel, with more than one socket in the same NUMA node > as in [1], happened to be correctly represented in the kernel by > accident during all these years. > > There's a case to be made for virtual topologies being detached from > hardware constraints, allowing maximum flexibility to users. At the same > time, this freedom can't result in unrealistic hardware representations > being emulated. If the real hardware and the pseries kernel don't > support multiple chips/sockets in the same NUMA node, neither should we. > > Starting in 6.0.0, all sockets must match an unique NUMA node in the > pseries machine. qtest changes were made to adapt to this new > condition. Oof. I really don't like this idea. It means a bunch of fiddly work for users to match these up, for no real gain. I'm also concerned that this will require follow on changes in libvirt to not make this a really cryptic and irritating point of failure. > > [1] https://bugzilla.redhat.com/1934421 > [2] https://lore.kernel.org/linuxppc-dev/daa5d05f-dbd0-05ad-7395-5d5a3d364fc6@gmail.com/ > > CC: Srikar Dronamraju <srikar@linux.vnet.ibm.com> > CC: Cédric Le Goater <clg@kaod.org> > CC: Igor Mammedov <imammedo@redhat.com> > CC: Laurent Vivier <lvivier@redhat.com> > CC: Thomas Huth <thuth@redhat.com> > Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> > --- > hw/ppc/spapr.c | 3 ++ > hw/ppc/spapr_numa.c | 7 +++++ > include/hw/ppc/spapr.h | 1 + > tests/qtest/cpu-plug-test.c | 4 +-- > tests/qtest/device-plug-test.c | 9 +++++- > tests/qtest/numa-test.c | 52 ++++++++++++++++++++++++++++------ > 6 files changed, 64 insertions(+), 12 deletions(-) > > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c > index d56418ca29..745f71c243 100644 > --- a/hw/ppc/spapr.c > +++ b/hw/ppc/spapr.c > @@ -4611,8 +4611,11 @@ DEFINE_SPAPR_MACHINE(6_0, "6.0", true); > */ > static void spapr_machine_5_2_class_options(MachineClass *mc) > { > + SpaprMachineClass *smc = SPAPR_MACHINE_CLASS(mc); > + > spapr_machine_6_0_class_options(mc); > compat_props_add(mc->compat_props, hw_compat_5_2, hw_compat_5_2_len); > + smc->pre_6_0_smp_topology = true; > } > > DEFINE_SPAPR_MACHINE(5_2, "5.2", false); > diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c > index 779f18b994..0ade43dd79 100644 > --- a/hw/ppc/spapr_numa.c > +++ b/hw/ppc/spapr_numa.c > @@ -163,6 +163,13 @@ void spapr_numa_associativity_init(SpaprMachineState *spapr, > int i, j, max_nodes_with_gpus; > bool using_legacy_numa = spapr_machine_using_legacy_numa(spapr); > > + if (!smc->pre_6_0_smp_topology && > + nb_numa_nodes != machine->smp.sockets) { > + error_report("Number of CPU sockets must be equal to the number " > + "of NUMA nodes"); > + exit(EXIT_FAILURE); > + } > + > /* > * For all associativity arrays: first position is the size, > * position MAX_DISTANCE_REF_POINTS is always the numa_id, > diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h > index 47cebaf3ac..98dc5d198a 100644 > --- a/include/hw/ppc/spapr.h > +++ b/include/hw/ppc/spapr.h > @@ -142,6 +142,7 @@ struct SpaprMachineClass { > hwaddr rma_limit; /* clamp the RMA to this size */ > bool pre_5_1_assoc_refpoints; > bool pre_5_2_numa_associativity; > + bool pre_6_0_smp_topology; > > bool (*phb_placement)(SpaprMachineState *spapr, uint32_t index, > uint64_t *buid, hwaddr *pio, > diff --git a/tests/qtest/cpu-plug-test.c b/tests/qtest/cpu-plug-test.c > index a1c689414b..946b9129ea 100644 > --- a/tests/qtest/cpu-plug-test.c > +++ b/tests/qtest/cpu-plug-test.c > @@ -118,8 +118,8 @@ static void add_pseries_test_case(const char *mname) > data->machine = g_strdup(mname); > data->cpu_model = "power8_v2.0"; > data->device_model = g_strdup("power8_v2.0-spapr-cpu-core"); > - data->sockets = 2; > - data->cores = 3; > + data->sockets = 1; > + data->cores = 6; > data->threads = 1; > data->maxcpus = data->sockets * data->cores * data->threads; > > diff --git a/tests/qtest/device-plug-test.c b/tests/qtest/device-plug-test.c > index 559d47727a..dd7d8268d2 100644 > --- a/tests/qtest/device-plug-test.c > +++ b/tests/qtest/device-plug-test.c > @@ -91,7 +91,14 @@ static void test_spapr_cpu_unplug_request(void) > { > QTestState *qtest; > > - qtest = qtest_initf("-cpu power9_v2.0 -smp 1,maxcpus=2 " > + /* > + * Default smp settings will prioritize sockets over cores and > + * threads, so '-smp 2,maxcpus=2' will add 2 sockets. However, > + * the pseries machine requires a NUMA node for each socket > + * (since 6.0.0). Specify sockets=1 to make life easier. > + */ > + qtest = qtest_initf("-cpu power9_v2.0 " > + "-smp 1,maxcpus=2,threads=1,cores=2,sockets=1 " > "-device power9_v2.0-spapr-cpu-core,core-id=1,id=dev0"); > > /* similar to test_pci_unplug_request */ > diff --git a/tests/qtest/numa-test.c b/tests/qtest/numa-test.c > index dc0ec571ca..bb13f7131b 100644 > --- a/tests/qtest/numa-test.c > +++ b/tests/qtest/numa-test.c > @@ -24,9 +24,17 @@ static void test_mon_explicit(const void *data) > QTestState *qts; > g_autofree char *s = NULL; > g_autofree char *cli = NULL; > + const char *arch = qtest_get_arch(); > + > + if (g_str_equal(arch, "ppc64")) { > + cli = make_cli(data, "-smp 8,threads=1,cores=4,sockets=2 " > + "-numa node,nodeid=0,memdev=ram,cpus=0-3 " > + "-numa node,nodeid=1,cpus=4-7"); > + } else { > + cli = make_cli(data, "-smp 8 -numa node,nodeid=0,memdev=ram,cpus=0-3 " > + "-numa node,nodeid=1,cpus=4-7"); > + } > > - cli = make_cli(data, "-smp 8 -numa node,nodeid=0,memdev=ram,cpus=0-3 " > - "-numa node,nodeid=1,cpus=4-7"); > qts = qtest_init(cli); > > s = qtest_hmp(qts, "info numa"); > @@ -57,10 +65,18 @@ static void test_mon_partial(const void *data) > QTestState *qts; > g_autofree char *s = NULL; > g_autofree char *cli = NULL; > + const char *arch = qtest_get_arch(); > + > + if (g_str_equal(arch, "ppc64")) { > + cli = make_cli(data, "-smp 8,threads=1,cores=4,sockets=2 " > + "-numa node,nodeid=0,memdev=ram,cpus=0-1 " > + "-numa node,nodeid=1,cpus=4-5 "); > + } else { > + cli = make_cli(data, "-smp 8 " > + "-numa node,nodeid=0,memdev=ram,cpus=0-1 " > + "-numa node,nodeid=1,cpus=4-5 "); > + } > > - cli = make_cli(data, "-smp 8 " > - "-numa node,nodeid=0,memdev=ram,cpus=0-1 " > - "-numa node,nodeid=1,cpus=4-5 "); > qts = qtest_init(cli); > > s = qtest_hmp(qts, "info numa"); > @@ -85,9 +101,17 @@ static void test_query_cpus(const void *data) > QObject *e; > QTestState *qts; > g_autofree char *cli = NULL; > + const char *arch = qtest_get_arch(); > + > + if (g_str_equal(arch, "ppc64")) { > + cli = make_cli(data, "-smp 8,threads=1,cores=4,sockets=2 " > + "-numa node,memdev=ram,cpus=0-3 " > + "-numa node,cpus=4-7"); > + } else { > + cli = make_cli(data, "-smp 8 -numa node,memdev=ram,cpus=0-3 " > + "-numa node,cpus=4-7"); > + } > > - cli = make_cli(data, "-smp 8 -numa node,memdev=ram,cpus=0-3 " > - "-numa node,cpus=4-7"); > qts = qtest_init(cli); > cpus = get_cpus(qts, &resp); > g_assert(cpus); > @@ -177,7 +201,7 @@ static void spapr_numa_cpu(const void *data) > QTestState *qts; > g_autofree char *cli = NULL; > > - cli = make_cli(data, "-smp 4,cores=4 " > + cli = make_cli(data, "-smp 4,threads=1,cores=2,sockets=2 " > "-numa node,nodeid=0,memdev=ram -numa node,nodeid=1 " > "-numa cpu,node-id=0,core-id=0 " > "-numa cpu,node-id=0,core-id=1 " > @@ -554,7 +578,17 @@ int main(int argc, char **argv) > > g_test_init(&argc, &argv, NULL); > > - qtest_add_data_func("/numa/mon/cpus/default", args, test_def_cpu_split); > + /* > + * Starting on 6.0.0, for the pseries machine, '-smp 8' will only work > + * if we have 8 NUMA nodes. If we specify 'smp 8,sockets=2' to match > + * 2 NUMA nodes, the CPUs will be split as 0123/4567 instead of > + * 0246/1357 that test_def_cpu_split expects. In short, this test is > + * no longer valid for ppc64 in 6.0.0. > + */ > + if (!g_str_equal(arch, "ppc64")) { > + qtest_add_data_func("/numa/mon/cpus/default", args, test_def_cpu_split); > + } > + > qtest_add_data_func("/numa/mon/cpus/explicit", args, test_mon_explicit); > qtest_add_data_func("/numa/mon/cpus/partial", args, test_mon_partial); > qtest_add_data_func("/numa/qmp/cpus/query-cpus", args, test_query_cpus);
On 3/22/21 10:03 PM, David Gibson wrote: > On Fri, Mar 19, 2021 at 03:34:52PM -0300, Daniel Henrique Barboza wrote: >> Kernel commit 4bce545903fa ("powerpc/topology: Update >> topology_core_cpumask") cause a regression in the pseries machine when >> defining certain SMP topologies [1]. The reasoning behind the change is >> explained in kernel commit 4ca234a9cbd7 ("powerpc/smp: Stop updating >> cpu_core_mask"). In short, cpu_core_mask logic was causing troubles with >> large VMs with lots of CPUs and was changed by cpu_cpu_mask because, as >> far as the kernel understanding of SMP topologies goes, both masks are >> equivalent. >> >> Further discussions in the kernel mailing list [2] shown that the >> powerpc kernel always considered that the number of sockets were equal >> to the number of NUMA nodes. The claim is that it doesn't make sense, >> for Power hardware at least, 2+ sockets being in the same NUMA node. The >> immediate conclusion is that all SMP topologies the pseries machine were >> supplying to the kernel, with more than one socket in the same NUMA node >> as in [1], happened to be correctly represented in the kernel by >> accident during all these years. >> >> There's a case to be made for virtual topologies being detached from >> hardware constraints, allowing maximum flexibility to users. At the same >> time, this freedom can't result in unrealistic hardware representations >> being emulated. If the real hardware and the pseries kernel don't >> support multiple chips/sockets in the same NUMA node, neither should we. >> >> Starting in 6.0.0, all sockets must match an unique NUMA node in the >> pseries machine. qtest changes were made to adapt to this new >> condition. > > Oof. I really don't like this idea. It means a bunch of fiddly work > for users to match these up, for no real gain. I'm also concerned > that this will require follow on changes in libvirt to not make this a > really cryptic and irritating point of failure. Haven't though about required Libvirt changes, although I can say that there will be some amount to be mande and it will probably annoy existing users (everyone that has a multiple socket per NUMA node topology). There is not much we can do from the QEMU layer aside from what I've proposed here. The other alternative is to keep interacting with the kernel folks to see if there is a way to keep our use case untouched. This also means that 'ibm,chip-id' will probably remain in use since it's the only place where we inform cores per socket information to the kernel. Thanks, DHB > >> >> [1] https://bugzilla.redhat.com/1934421 >> [2] https://lore.kernel.org/linuxppc-dev/daa5d05f-dbd0-05ad-7395-5d5a3d364fc6@gmail.com/ >> >> CC: Srikar Dronamraju <srikar@linux.vnet.ibm.com> >> CC: Cédric Le Goater <clg@kaod.org> >> CC: Igor Mammedov <imammedo@redhat.com> >> CC: Laurent Vivier <lvivier@redhat.com> >> CC: Thomas Huth <thuth@redhat.com> >> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> >> --- >> hw/ppc/spapr.c | 3 ++ >> hw/ppc/spapr_numa.c | 7 +++++ >> include/hw/ppc/spapr.h | 1 + >> tests/qtest/cpu-plug-test.c | 4 +-- >> tests/qtest/device-plug-test.c | 9 +++++- >> tests/qtest/numa-test.c | 52 ++++++++++++++++++++++++++++------ >> 6 files changed, 64 insertions(+), 12 deletions(-) >> >> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c >> index d56418ca29..745f71c243 100644 >> --- a/hw/ppc/spapr.c >> +++ b/hw/ppc/spapr.c >> @@ -4611,8 +4611,11 @@ DEFINE_SPAPR_MACHINE(6_0, "6.0", true); >> */ >> static void spapr_machine_5_2_class_options(MachineClass *mc) >> { >> + SpaprMachineClass *smc = SPAPR_MACHINE_CLASS(mc); >> + >> spapr_machine_6_0_class_options(mc); >> compat_props_add(mc->compat_props, hw_compat_5_2, hw_compat_5_2_len); >> + smc->pre_6_0_smp_topology = true; >> } >> >> DEFINE_SPAPR_MACHINE(5_2, "5.2", false); >> diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c >> index 779f18b994..0ade43dd79 100644 >> --- a/hw/ppc/spapr_numa.c >> +++ b/hw/ppc/spapr_numa.c >> @@ -163,6 +163,13 @@ void spapr_numa_associativity_init(SpaprMachineState *spapr, >> int i, j, max_nodes_with_gpus; >> bool using_legacy_numa = spapr_machine_using_legacy_numa(spapr); >> >> + if (!smc->pre_6_0_smp_topology && >> + nb_numa_nodes != machine->smp.sockets) { >> + error_report("Number of CPU sockets must be equal to the number " >> + "of NUMA nodes"); >> + exit(EXIT_FAILURE); >> + } >> + >> /* >> * For all associativity arrays: first position is the size, >> * position MAX_DISTANCE_REF_POINTS is always the numa_id, >> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h >> index 47cebaf3ac..98dc5d198a 100644 >> --- a/include/hw/ppc/spapr.h >> +++ b/include/hw/ppc/spapr.h >> @@ -142,6 +142,7 @@ struct SpaprMachineClass { >> hwaddr rma_limit; /* clamp the RMA to this size */ >> bool pre_5_1_assoc_refpoints; >> bool pre_5_2_numa_associativity; >> + bool pre_6_0_smp_topology; >> >> bool (*phb_placement)(SpaprMachineState *spapr, uint32_t index, >> uint64_t *buid, hwaddr *pio, >> diff --git a/tests/qtest/cpu-plug-test.c b/tests/qtest/cpu-plug-test.c >> index a1c689414b..946b9129ea 100644 >> --- a/tests/qtest/cpu-plug-test.c >> +++ b/tests/qtest/cpu-plug-test.c >> @@ -118,8 +118,8 @@ static void add_pseries_test_case(const char *mname) >> data->machine = g_strdup(mname); >> data->cpu_model = "power8_v2.0"; >> data->device_model = g_strdup("power8_v2.0-spapr-cpu-core"); >> - data->sockets = 2; >> - data->cores = 3; >> + data->sockets = 1; >> + data->cores = 6; >> data->threads = 1; >> data->maxcpus = data->sockets * data->cores * data->threads; >> >> diff --git a/tests/qtest/device-plug-test.c b/tests/qtest/device-plug-test.c >> index 559d47727a..dd7d8268d2 100644 >> --- a/tests/qtest/device-plug-test.c >> +++ b/tests/qtest/device-plug-test.c >> @@ -91,7 +91,14 @@ static void test_spapr_cpu_unplug_request(void) >> { >> QTestState *qtest; >> >> - qtest = qtest_initf("-cpu power9_v2.0 -smp 1,maxcpus=2 " >> + /* >> + * Default smp settings will prioritize sockets over cores and >> + * threads, so '-smp 2,maxcpus=2' will add 2 sockets. However, >> + * the pseries machine requires a NUMA node for each socket >> + * (since 6.0.0). Specify sockets=1 to make life easier. >> + */ >> + qtest = qtest_initf("-cpu power9_v2.0 " >> + "-smp 1,maxcpus=2,threads=1,cores=2,sockets=1 " >> "-device power9_v2.0-spapr-cpu-core,core-id=1,id=dev0"); >> >> /* similar to test_pci_unplug_request */ >> diff --git a/tests/qtest/numa-test.c b/tests/qtest/numa-test.c >> index dc0ec571ca..bb13f7131b 100644 >> --- a/tests/qtest/numa-test.c >> +++ b/tests/qtest/numa-test.c >> @@ -24,9 +24,17 @@ static void test_mon_explicit(const void *data) >> QTestState *qts; >> g_autofree char *s = NULL; >> g_autofree char *cli = NULL; >> + const char *arch = qtest_get_arch(); >> + >> + if (g_str_equal(arch, "ppc64")) { >> + cli = make_cli(data, "-smp 8,threads=1,cores=4,sockets=2 " >> + "-numa node,nodeid=0,memdev=ram,cpus=0-3 " >> + "-numa node,nodeid=1,cpus=4-7"); >> + } else { >> + cli = make_cli(data, "-smp 8 -numa node,nodeid=0,memdev=ram,cpus=0-3 " >> + "-numa node,nodeid=1,cpus=4-7"); >> + } >> >> - cli = make_cli(data, "-smp 8 -numa node,nodeid=0,memdev=ram,cpus=0-3 " >> - "-numa node,nodeid=1,cpus=4-7"); >> qts = qtest_init(cli); >> >> s = qtest_hmp(qts, "info numa"); >> @@ -57,10 +65,18 @@ static void test_mon_partial(const void *data) >> QTestState *qts; >> g_autofree char *s = NULL; >> g_autofree char *cli = NULL; >> + const char *arch = qtest_get_arch(); >> + >> + if (g_str_equal(arch, "ppc64")) { >> + cli = make_cli(data, "-smp 8,threads=1,cores=4,sockets=2 " >> + "-numa node,nodeid=0,memdev=ram,cpus=0-1 " >> + "-numa node,nodeid=1,cpus=4-5 "); >> + } else { >> + cli = make_cli(data, "-smp 8 " >> + "-numa node,nodeid=0,memdev=ram,cpus=0-1 " >> + "-numa node,nodeid=1,cpus=4-5 "); >> + } >> >> - cli = make_cli(data, "-smp 8 " >> - "-numa node,nodeid=0,memdev=ram,cpus=0-1 " >> - "-numa node,nodeid=1,cpus=4-5 "); >> qts = qtest_init(cli); >> >> s = qtest_hmp(qts, "info numa"); >> @@ -85,9 +101,17 @@ static void test_query_cpus(const void *data) >> QObject *e; >> QTestState *qts; >> g_autofree char *cli = NULL; >> + const char *arch = qtest_get_arch(); >> + >> + if (g_str_equal(arch, "ppc64")) { >> + cli = make_cli(data, "-smp 8,threads=1,cores=4,sockets=2 " >> + "-numa node,memdev=ram,cpus=0-3 " >> + "-numa node,cpus=4-7"); >> + } else { >> + cli = make_cli(data, "-smp 8 -numa node,memdev=ram,cpus=0-3 " >> + "-numa node,cpus=4-7"); >> + } >> >> - cli = make_cli(data, "-smp 8 -numa node,memdev=ram,cpus=0-3 " >> - "-numa node,cpus=4-7"); >> qts = qtest_init(cli); >> cpus = get_cpus(qts, &resp); >> g_assert(cpus); >> @@ -177,7 +201,7 @@ static void spapr_numa_cpu(const void *data) >> QTestState *qts; >> g_autofree char *cli = NULL; >> >> - cli = make_cli(data, "-smp 4,cores=4 " >> + cli = make_cli(data, "-smp 4,threads=1,cores=2,sockets=2 " >> "-numa node,nodeid=0,memdev=ram -numa node,nodeid=1 " >> "-numa cpu,node-id=0,core-id=0 " >> "-numa cpu,node-id=0,core-id=1 " >> @@ -554,7 +578,17 @@ int main(int argc, char **argv) >> >> g_test_init(&argc, &argv, NULL); >> >> - qtest_add_data_func("/numa/mon/cpus/default", args, test_def_cpu_split); >> + /* >> + * Starting on 6.0.0, for the pseries machine, '-smp 8' will only work >> + * if we have 8 NUMA nodes. If we specify 'smp 8,sockets=2' to match >> + * 2 NUMA nodes, the CPUs will be split as 0123/4567 instead of >> + * 0246/1357 that test_def_cpu_split expects. In short, this test is >> + * no longer valid for ppc64 in 6.0.0. >> + */ >> + if (!g_str_equal(arch, "ppc64")) { >> + qtest_add_data_func("/numa/mon/cpus/default", args, test_def_cpu_split); >> + } >> + >> qtest_add_data_func("/numa/mon/cpus/explicit", args, test_mon_explicit); >> qtest_add_data_func("/numa/mon/cpus/partial", args, test_mon_partial); >> qtest_add_data_func("/numa/qmp/cpus/query-cpus", args, test_query_cpus); >
On Tue, Mar 23, 2021 at 02:21:33PM -0300, Daniel Henrique Barboza wrote: > > > On 3/22/21 10:03 PM, David Gibson wrote: > > On Fri, Mar 19, 2021 at 03:34:52PM -0300, Daniel Henrique Barboza wrote: > > > Kernel commit 4bce545903fa ("powerpc/topology: Update > > > topology_core_cpumask") cause a regression in the pseries machine when > > > defining certain SMP topologies [1]. The reasoning behind the change is > > > explained in kernel commit 4ca234a9cbd7 ("powerpc/smp: Stop updating > > > cpu_core_mask"). In short, cpu_core_mask logic was causing troubles with > > > large VMs with lots of CPUs and was changed by cpu_cpu_mask because, as > > > far as the kernel understanding of SMP topologies goes, both masks are > > > equivalent. > > > > > > Further discussions in the kernel mailing list [2] shown that the > > > powerpc kernel always considered that the number of sockets were equal > > > to the number of NUMA nodes. The claim is that it doesn't make sense, > > > for Power hardware at least, 2+ sockets being in the same NUMA node. The > > > immediate conclusion is that all SMP topologies the pseries machine were > > > supplying to the kernel, with more than one socket in the same NUMA node > > > as in [1], happened to be correctly represented in the kernel by > > > accident during all these years. > > > > > > There's a case to be made for virtual topologies being detached from > > > hardware constraints, allowing maximum flexibility to users. At the same > > > time, this freedom can't result in unrealistic hardware representations > > > being emulated. If the real hardware and the pseries kernel don't > > > support multiple chips/sockets in the same NUMA node, neither should we. > > > > > > Starting in 6.0.0, all sockets must match an unique NUMA node in the > > > pseries machine. qtest changes were made to adapt to this new > > > condition. > > > > Oof. I really don't like this idea. It means a bunch of fiddly work > > for users to match these up, for no real gain. I'm also concerned > > that this will require follow on changes in libvirt to not make this a > > really cryptic and irritating point of failure. > > Haven't though about required Libvirt changes, although I can say that there > will be some amount to be mande and it will probably annoy existing users > (everyone that has a multiple socket per NUMA node topology). > > There is not much we can do from the QEMU layer aside from what I've proposed > here. The other alternative is to keep interacting with the kernel folks to > see if there is a way to keep our use case untouched. Right. Well.. not necessarily untouched, but I'm hoping for more replies from Cédric to my objections and mpe's. Even with sockets being a kinda meaningless concept in PAPR, I don't think tying it to NUMA nodes makes sense. > This also means that > 'ibm,chip-id' will probably remain in use since it's the only place where > we inform cores per socket information to the kernel. Well.. unless we can find some other sensible way to convey that information. I haven't given up hope for that yet.
On 3/25/21 3:10 AM, David Gibson wrote: > On Tue, Mar 23, 2021 at 02:21:33PM -0300, Daniel Henrique Barboza wrote: >> >> >> On 3/22/21 10:03 PM, David Gibson wrote: >>> On Fri, Mar 19, 2021 at 03:34:52PM -0300, Daniel Henrique Barboza wrote: >>>> Kernel commit 4bce545903fa ("powerpc/topology: Update >>>> topology_core_cpumask") cause a regression in the pseries machine when >>>> defining certain SMP topologies [1]. The reasoning behind the change is >>>> explained in kernel commit 4ca234a9cbd7 ("powerpc/smp: Stop updating >>>> cpu_core_mask"). In short, cpu_core_mask logic was causing troubles with >>>> large VMs with lots of CPUs and was changed by cpu_cpu_mask because, as >>>> far as the kernel understanding of SMP topologies goes, both masks are >>>> equivalent. >>>> >>>> Further discussions in the kernel mailing list [2] shown that the >>>> powerpc kernel always considered that the number of sockets were equal >>>> to the number of NUMA nodes. The claim is that it doesn't make sense, >>>> for Power hardware at least, 2+ sockets being in the same NUMA node. The >>>> immediate conclusion is that all SMP topologies the pseries machine were >>>> supplying to the kernel, with more than one socket in the same NUMA node >>>> as in [1], happened to be correctly represented in the kernel by >>>> accident during all these years. >>>> >>>> There's a case to be made for virtual topologies being detached from >>>> hardware constraints, allowing maximum flexibility to users. At the same >>>> time, this freedom can't result in unrealistic hardware representations >>>> being emulated. If the real hardware and the pseries kernel don't >>>> support multiple chips/sockets in the same NUMA node, neither should we. >>>> >>>> Starting in 6.0.0, all sockets must match an unique NUMA node in the >>>> pseries machine. qtest changes were made to adapt to this new >>>> condition. >>> >>> Oof. I really don't like this idea. It means a bunch of fiddly work >>> for users to match these up, for no real gain. I'm also concerned >>> that this will require follow on changes in libvirt to not make this a >>> really cryptic and irritating point of failure. >> >> Haven't though about required Libvirt changes, although I can say that there >> will be some amount to be mande and it will probably annoy existing users >> (everyone that has a multiple socket per NUMA node topology). >> >> There is not much we can do from the QEMU layer aside from what I've proposed >> here. The other alternative is to keep interacting with the kernel folks to >> see if there is a way to keep our use case untouched. > > Right. Well.. not necessarily untouched, but I'm hoping for more > replies from Cédric to my objections and mpe's. Even with sockets > being a kinda meaningless concept in PAPR, I don't think tying it to > NUMA nodes makes sense. I did a couple of replies in different email threads but maybe not to all. I felt it was going nowhere :/ Couple of thoughts, Shouldn't we get rid of the socket concept, die also, under pseries since they don't exist under PAPR ? We only have numa nodes, cores, threads AFAICT. Should we diverged from PAPR and add extra DT properties "qemu,..." ? There are a couple of places where Linux checks for the underlying hypervisor already. >> This also means that >> 'ibm,chip-id' will probably remain in use since it's the only place where >> we inform cores per socket information to the kernel. > > Well.. unless we can find some other sensible way to convey that > information. I haven't given up hope for that yet. Well, we could start by fixing the value in QEMU. It is broken today. This is all coming from some work we did last year to evaluate our HW (mostly for XIVE) on 2s, 4s, 16s systems on baremetal, KVM and PowerVM. We saw some real problems because Linux did not have a clear view of the topology. See the figures here : http://patchwork.ozlabs.org/project/linuxppc-dev/patch/20210303174857.1760393-9-clg@kaod.org/ The node id is a key parameter for system resource management, memory allocation, interrupt affinity, etc. Linux scales much better if used correctly. C.
On 3/25/21 5:56 AM, Cédric Le Goater wrote: > On 3/25/21 3:10 AM, David Gibson wrote: >> On Tue, Mar 23, 2021 at 02:21:33PM -0300, Daniel Henrique Barboza wrote: >>> >>> >>> On 3/22/21 10:03 PM, David Gibson wrote: >>>> On Fri, Mar 19, 2021 at 03:34:52PM -0300, Daniel Henrique Barboza wrote: >>>>> Kernel commit 4bce545903fa ("powerpc/topology: Update >>>>> topology_core_cpumask") cause a regression in the pseries machine when >>>>> defining certain SMP topologies [1]. The reasoning behind the change is >>>>> explained in kernel commit 4ca234a9cbd7 ("powerpc/smp: Stop updating >>>>> cpu_core_mask"). In short, cpu_core_mask logic was causing troubles with >>>>> large VMs with lots of CPUs and was changed by cpu_cpu_mask because, as >>>>> far as the kernel understanding of SMP topologies goes, both masks are >>>>> equivalent. >>>>> >>>>> Further discussions in the kernel mailing list [2] shown that the >>>>> powerpc kernel always considered that the number of sockets were equal >>>>> to the number of NUMA nodes. The claim is that it doesn't make sense, >>>>> for Power hardware at least, 2+ sockets being in the same NUMA node. The >>>>> immediate conclusion is that all SMP topologies the pseries machine were >>>>> supplying to the kernel, with more than one socket in the same NUMA node >>>>> as in [1], happened to be correctly represented in the kernel by >>>>> accident during all these years. >>>>> >>>>> There's a case to be made for virtual topologies being detached from >>>>> hardware constraints, allowing maximum flexibility to users. At the same >>>>> time, this freedom can't result in unrealistic hardware representations >>>>> being emulated. If the real hardware and the pseries kernel don't >>>>> support multiple chips/sockets in the same NUMA node, neither should we. >>>>> >>>>> Starting in 6.0.0, all sockets must match an unique NUMA node in the >>>>> pseries machine. qtest changes were made to adapt to this new >>>>> condition. >>>> >>>> Oof. I really don't like this idea. It means a bunch of fiddly work >>>> for users to match these up, for no real gain. I'm also concerned >>>> that this will require follow on changes in libvirt to not make this a >>>> really cryptic and irritating point of failure. >>> >>> Haven't though about required Libvirt changes, although I can say that there >>> will be some amount to be mande and it will probably annoy existing users >>> (everyone that has a multiple socket per NUMA node topology). >>> >>> There is not much we can do from the QEMU layer aside from what I've proposed >>> here. The other alternative is to keep interacting with the kernel folks to >>> see if there is a way to keep our use case untouched. >> >> Right. Well.. not necessarily untouched, but I'm hoping for more >> replies from Cédric to my objections and mpe's. Even with sockets >> being a kinda meaningless concept in PAPR, I don't think tying it to >> NUMA nodes makes sense. > > I did a couple of replies in different email threads but maybe not > to all. I felt it was going nowhere :/ Couple of thoughts, > > Shouldn't we get rid of the socket concept, die also, under pseries > since they don't exist under PAPR ? We only have numa nodes, cores, > threads AFAICT. I don't think we work with 'die'. Getting rid of the 'socket' representation is sensible regarding PAPR, but the effect for pseries will be similar to what this patch is already doing: users could have multiple sockets in the same NUMA node, and then they won't. Either because we got rid of the 'socket' representation or because socket == NUMA node. > > Should we diverged from PAPR and add extra DT properties "qemu,..." ? > There are a couple of places where Linux checks for the underlying > hypervisor already. > >>> This also means that >>> 'ibm,chip-id' will probably remain in use since it's the only place where >>> we inform cores per socket information to the kernel. >> >> Well.. unless we can find some other sensible way to convey that >> information. I haven't given up hope for that yet. > > Well, we could start by fixing the value in QEMU. It is broken today. I'll look into it. It makes more sense to talk about keeping it when it's working properly. DHB > > > This is all coming from some work we did last year to evaluate our HW > (mostly for XIVE) on 2s, 4s, 16s systems on baremetal, KVM and PowerVM. > We saw some real problems because Linux did not have a clear view of the > topology. See the figures here : > > http://patchwork.ozlabs.org/project/linuxppc-dev/patch/20210303174857.1760393-9-clg@kaod.org/ > > The node id is a key parameter for system resource management, memory > allocation, interrupt affinity, etc. Linux scales much better if used > correctly. > > C. >
On Thu, Mar 25, 2021 at 09:56:04AM +0100, Cédric Le Goater wrote: > On 3/25/21 3:10 AM, David Gibson wrote: > > On Tue, Mar 23, 2021 at 02:21:33PM -0300, Daniel Henrique Barboza wrote: > >> > >> > >> On 3/22/21 10:03 PM, David Gibson wrote: > >>> On Fri, Mar 19, 2021 at 03:34:52PM -0300, Daniel Henrique Barboza wrote: > >>>> Kernel commit 4bce545903fa ("powerpc/topology: Update > >>>> topology_core_cpumask") cause a regression in the pseries machine when > >>>> defining certain SMP topologies [1]. The reasoning behind the change is > >>>> explained in kernel commit 4ca234a9cbd7 ("powerpc/smp: Stop updating > >>>> cpu_core_mask"). In short, cpu_core_mask logic was causing troubles with > >>>> large VMs with lots of CPUs and was changed by cpu_cpu_mask because, as > >>>> far as the kernel understanding of SMP topologies goes, both masks are > >>>> equivalent. > >>>> > >>>> Further discussions in the kernel mailing list [2] shown that the > >>>> powerpc kernel always considered that the number of sockets were equal > >>>> to the number of NUMA nodes. The claim is that it doesn't make sense, > >>>> for Power hardware at least, 2+ sockets being in the same NUMA node. The > >>>> immediate conclusion is that all SMP topologies the pseries machine were > >>>> supplying to the kernel, with more than one socket in the same NUMA node > >>>> as in [1], happened to be correctly represented in the kernel by > >>>> accident during all these years. > >>>> > >>>> There's a case to be made for virtual topologies being detached from > >>>> hardware constraints, allowing maximum flexibility to users. At the same > >>>> time, this freedom can't result in unrealistic hardware representations > >>>> being emulated. If the real hardware and the pseries kernel don't > >>>> support multiple chips/sockets in the same NUMA node, neither should we. > >>>> > >>>> Starting in 6.0.0, all sockets must match an unique NUMA node in the > >>>> pseries machine. qtest changes were made to adapt to this new > >>>> condition. > >>> > >>> Oof. I really don't like this idea. It means a bunch of fiddly work > >>> for users to match these up, for no real gain. I'm also concerned > >>> that this will require follow on changes in libvirt to not make this a > >>> really cryptic and irritating point of failure. > >> > >> Haven't though about required Libvirt changes, although I can say that there > >> will be some amount to be mande and it will probably annoy existing users > >> (everyone that has a multiple socket per NUMA node topology). > >> > >> There is not much we can do from the QEMU layer aside from what I've proposed > >> here. The other alternative is to keep interacting with the kernel folks to > >> see if there is a way to keep our use case untouched. > > > > Right. Well.. not necessarily untouched, but I'm hoping for more > > replies from Cédric to my objections and mpe's. Even with sockets > > being a kinda meaningless concept in PAPR, I don't think tying it to > > NUMA nodes makes sense. > > I did a couple of replies in different email threads but maybe not > to all. I felt it was going nowhere :/ Couple of thoughts, I think I saw some of those, but maybe not all. > Shouldn't we get rid of the socket concept, die also, under pseries > since they don't exist under PAPR ? We only have numa nodes, cores, > threads AFAICT. Theoretically, yes. I'm not sure it's really practical, though, since AFAICT, both qemu and the kernel have the notion of sockets (though not dies) built into generic code. It does mean that one possible approach here - maybe the best one - is to simply declare that sockets are meaningless under, so we simply don't expect what the guest kernel reports to match what's given to qemu. It'd be nice to avoid that if we can: in a sense it's just cosmetic, but it is likely to surprise and confuse people. > Should we diverged from PAPR and add extra DT properties "qemu,..." ? > There are a couple of places where Linux checks for the underlying > hypervisor already. > > >> This also means that > >> 'ibm,chip-id' will probably remain in use since it's the only place where > >> we inform cores per socket information to the kernel. > > > > Well.. unless we can find some other sensible way to convey that > > information. I haven't given up hope for that yet. > > Well, we could start by fixing the value in QEMU. It is broken > today. Fixing what value, exactly? > This is all coming from some work we did last year to evaluate our HW > (mostly for XIVE) on 2s, 4s, 16s systems on baremetal, KVM and PowerVM. > We saw some real problems because Linux did not have a clear view of the > topology. See the figures here : > > http://patchwork.ozlabs.org/project/linuxppc-dev/patch/20210303174857.1760393-9-clg@kaod.org/ > > The node id is a key parameter for system resource management, memory > allocation, interrupt affinity, etc. Linux scales much better if used > correctly. Well, sure. And we have all the ibm,associativity stuff to convey the node ids to the guest (which has its own problems, but not that are relevant here). What's throwing me is why getting node IDs correct has anything to do with socket numbers.
On 3/29/21 6:20 AM, David Gibson wrote: > On Thu, Mar 25, 2021 at 09:56:04AM +0100, Cédric Le Goater wrote: >> On 3/25/21 3:10 AM, David Gibson wrote: >>> On Tue, Mar 23, 2021 at 02:21:33PM -0300, Daniel Henrique Barboza wrote: >>>> >>>> >>>> On 3/22/21 10:03 PM, David Gibson wrote: >>>>> On Fri, Mar 19, 2021 at 03:34:52PM -0300, Daniel Henrique Barboza wrote: >>>>>> Kernel commit 4bce545903fa ("powerpc/topology: Update >>>>>> topology_core_cpumask") cause a regression in the pseries machine when >>>>>> defining certain SMP topologies [1]. The reasoning behind the change is >>>>>> explained in kernel commit 4ca234a9cbd7 ("powerpc/smp: Stop updating >>>>>> cpu_core_mask"). In short, cpu_core_mask logic was causing troubles with >>>>>> large VMs with lots of CPUs and was changed by cpu_cpu_mask because, as >>>>>> far as the kernel understanding of SMP topologies goes, both masks are >>>>>> equivalent. >>>>>> >>>>>> Further discussions in the kernel mailing list [2] shown that the >>>>>> powerpc kernel always considered that the number of sockets were equal >>>>>> to the number of NUMA nodes. The claim is that it doesn't make sense, >>>>>> for Power hardware at least, 2+ sockets being in the same NUMA node. The >>>>>> immediate conclusion is that all SMP topologies the pseries machine were >>>>>> supplying to the kernel, with more than one socket in the same NUMA node >>>>>> as in [1], happened to be correctly represented in the kernel by >>>>>> accident during all these years. >>>>>> >>>>>> There's a case to be made for virtual topologies being detached from >>>>>> hardware constraints, allowing maximum flexibility to users. At the same >>>>>> time, this freedom can't result in unrealistic hardware representations >>>>>> being emulated. If the real hardware and the pseries kernel don't >>>>>> support multiple chips/sockets in the same NUMA node, neither should we. >>>>>> >>>>>> Starting in 6.0.0, all sockets must match an unique NUMA node in the >>>>>> pseries machine. qtest changes were made to adapt to this new >>>>>> condition. >>>>> >>>>> Oof. I really don't like this idea. It means a bunch of fiddly work >>>>> for users to match these up, for no real gain. I'm also concerned >>>>> that this will require follow on changes in libvirt to not make this a >>>>> really cryptic and irritating point of failure. >>>> >>>> Haven't though about required Libvirt changes, although I can say that there >>>> will be some amount to be mande and it will probably annoy existing users >>>> (everyone that has a multiple socket per NUMA node topology). >>>> >>>> There is not much we can do from the QEMU layer aside from what I've proposed >>>> here. The other alternative is to keep interacting with the kernel folks to >>>> see if there is a way to keep our use case untouched. >>> >>> Right. Well.. not necessarily untouched, but I'm hoping for more >>> replies from Cédric to my objections and mpe's. Even with sockets >>> being a kinda meaningless concept in PAPR, I don't think tying it to >>> NUMA nodes makes sense. >> >> I did a couple of replies in different email threads but maybe not >> to all. I felt it was going nowhere :/ Couple of thoughts, > > I think I saw some of those, but maybe not all. > >> Shouldn't we get rid of the socket concept, die also, under pseries >> since they don't exist under PAPR ? We only have numa nodes, cores, >> threads AFAICT. > > Theoretically, yes. I'm not sure it's really practical, though, since > AFAICT, both qemu and the kernel have the notion of sockets (though > not dies) built into generic code. Yes. But, AFAICT, these topology notions have not reached "arch/powerpc" and PPC Linux only has a NUMA node id, on pseries and powernv. > It does mean that one possible approach here - maybe the best one - is > to simply declare that sockets are meaningless under, so we simply > don't expect what the guest kernel reports to match what's given to > qemu. > > It'd be nice to avoid that if we can: in a sense it's just cosmetic, > but it is likely to surprise and confuse people. > >> Should we diverged from PAPR and add extra DT properties "qemu,..." ? >> There are a couple of places where Linux checks for the underlying >> hypervisor already. >> >>>> This also means that >>>> 'ibm,chip-id' will probably remain in use since it's the only place where >>>> we inform cores per socket information to the kernel. >>> >>> Well.. unless we can find some other sensible way to convey that >>> information. I haven't given up hope for that yet. >> >> Well, we could start by fixing the value in QEMU. It is broken >> today. > > Fixing what value, exactly? The value of the "ibm,chip-id" since we are keeping the property under QEMU. >> This is all coming from some work we did last year to evaluate our HW >> (mostly for XIVE) on 2s, 4s, 16s systems on baremetal, KVM and PowerVM. >> We saw some real problems because Linux did not have a clear view of the >> topology. See the figures here : >> >> http://patchwork.ozlabs.org/project/linuxppc-dev/patch/20210303174857.1760393-9-clg@kaod.org/ >> >> The node id is a key parameter for system resource management, memory >> allocation, interrupt affinity, etc. Linux scales much better if used >> correctly. > > Well, sure. And we have all the ibm,associativity stuff to convey the > node ids to the guest (which has its own problems, but not that are > relevant here). What's throwing me is why getting node IDs correct > has anything to do with socket numbers.
On 3/29/21 12:32 PM, Cédric Le Goater wrote: > On 3/29/21 6:20 AM, David Gibson wrote: >> On Thu, Mar 25, 2021 at 09:56:04AM +0100, Cédric Le Goater wrote: >>> On 3/25/21 3:10 AM, David Gibson wrote: >>>> On Tue, Mar 23, 2021 at 02:21:33PM -0300, Daniel Henrique Barboza wrote: >>>>> >>>>> >>>>> On 3/22/21 10:03 PM, David Gibson wrote: >>>>>> On Fri, Mar 19, 2021 at 03:34:52PM -0300, Daniel Henrique Barboza wrote: >>>>>>> Kernel commit 4bce545903fa ("powerpc/topology: Update >>>>>>> topology_core_cpumask") cause a regression in the pseries machine when >>>>>>> defining certain SMP topologies [1]. The reasoning behind the change is >>>>>>> explained in kernel commit 4ca234a9cbd7 ("powerpc/smp: Stop updating >>>>>>> cpu_core_mask"). In short, cpu_core_mask logic was causing troubles with >>>>>>> large VMs with lots of CPUs and was changed by cpu_cpu_mask because, as >>>>>>> far as the kernel understanding of SMP topologies goes, both masks are >>>>>>> equivalent. >>>>>>> >>>>>>> Further discussions in the kernel mailing list [2] shown that the >>>>>>> powerpc kernel always considered that the number of sockets were equal >>>>>>> to the number of NUMA nodes. The claim is that it doesn't make sense, >>>>>>> for Power hardware at least, 2+ sockets being in the same NUMA node. The >>>>>>> immediate conclusion is that all SMP topologies the pseries machine were >>>>>>> supplying to the kernel, with more than one socket in the same NUMA node >>>>>>> as in [1], happened to be correctly represented in the kernel by >>>>>>> accident during all these years. >>>>>>> >>>>>>> There's a case to be made for virtual topologies being detached from >>>>>>> hardware constraints, allowing maximum flexibility to users. At the same >>>>>>> time, this freedom can't result in unrealistic hardware representations >>>>>>> being emulated. If the real hardware and the pseries kernel don't >>>>>>> support multiple chips/sockets in the same NUMA node, neither should we. >>>>>>> >>>>>>> Starting in 6.0.0, all sockets must match an unique NUMA node in the >>>>>>> pseries machine. qtest changes were made to adapt to this new >>>>>>> condition. >>>>>> >>>>>> Oof. I really don't like this idea. It means a bunch of fiddly work >>>>>> for users to match these up, for no real gain. I'm also concerned >>>>>> that this will require follow on changes in libvirt to not make this a >>>>>> really cryptic and irritating point of failure. >>>>> >>>>> Haven't though about required Libvirt changes, although I can say that there >>>>> will be some amount to be mande and it will probably annoy existing users >>>>> (everyone that has a multiple socket per NUMA node topology). >>>>> >>>>> There is not much we can do from the QEMU layer aside from what I've proposed >>>>> here. The other alternative is to keep interacting with the kernel folks to >>>>> see if there is a way to keep our use case untouched. >>>> >>>> Right. Well.. not necessarily untouched, but I'm hoping for more >>>> replies from Cédric to my objections and mpe's. Even with sockets >>>> being a kinda meaningless concept in PAPR, I don't think tying it to >>>> NUMA nodes makes sense. >>> >>> I did a couple of replies in different email threads but maybe not >>> to all. I felt it was going nowhere :/ Couple of thoughts, >> >> I think I saw some of those, but maybe not all. >> >>> Shouldn't we get rid of the socket concept, die also, under pseries >>> since they don't exist under PAPR ? We only have numa nodes, cores, >>> threads AFAICT. >> >> Theoretically, yes. I'm not sure it's really practical, though, since >> AFAICT, both qemu and the kernel have the notion of sockets (though >> not dies) built into generic code. > > Yes. But, AFAICT, these topology notions have not reached "arch/powerpc" > and PPC Linux only has a NUMA node id, on pseries and powernv. > >> It does mean that one possible approach here - maybe the best one - is >> to simply declare that sockets are meaningless under, so we simply >> don't expect what the guest kernel reports to match what's given to >> qemu. >> >> It'd be nice to avoid that if we can: in a sense it's just cosmetic, >> but it is likely to surprise and confuse people. >> >>> Should we diverged from PAPR and add extra DT properties "qemu,..." ? >>> There are a couple of places where Linux checks for the underlying >>> hypervisor already. >>> >>>>> This also means that >>>>> 'ibm,chip-id' will probably remain in use since it's the only place where >>>>> we inform cores per socket information to the kernel. >>>> >>>> Well.. unless we can find some other sensible way to convey that >>>> information. I haven't given up hope for that yet. >>> >>> Well, we could start by fixing the value in QEMU. It is broken >>> today. >> >> Fixing what value, exactly? > > The value of the "ibm,chip-id" since we are keeping the property under > QEMU. David, I believe this has to do with the discussing we had last Friday. I mentioned that the ibm,chip-id property is being calculated in a way that promotes the same ibm,chip-id in CPUs that belongs to different NUMA nodes, e.g.: -smp 4,cores=4,maxcpus=8,threads=1 \ -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 \ -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1 $ dtc -I dtb -O dts fdt.dtb | grep -B2 ibm,chip-id ibm,associativity = <0x05 0x00 0x00 0x00 0x00 0x00>; ibm,pft-size = <0x00 0x19>; ibm,chip-id = <0x00>; -- ibm,associativity = <0x05 0x00 0x00 0x00 0x00 0x01>; ibm,pft-size = <0x00 0x19>; ibm,chip-id = <0x00>; -- ibm,associativity = <0x05 0x01 0x01 0x01 0x01 0x02>; ibm,pft-size = <0x00 0x19>; ibm,chip-id = <0x00>; -- ibm,associativity = <0x05 0x01 0x01 0x01 0x01 0x03>; ibm,pft-size = <0x00 0x19>; ibm,chip-id = <0x00>; We assign ibm,chip-id=0x0 to CPUs 0-3, but CPUs 2-3 are located in a different NUMA node than 0-1. This would mean that the same socket would belong to different NUMA nodes at the same time. I believe this is what Cedric wants to be addressed. Given that the property is called after the OPAL property ibm,chip-id, the kernel expects that the property will have the same semantics as in OPAL. Thanks, DHB > >>> This is all coming from some work we did last year to evaluate our HW >>> (mostly for XIVE) on 2s, 4s, 16s systems on baremetal, KVM and PowerVM. >>> We saw some real problems because Linux did not have a clear view of the >>> topology. See the figures here : >>> >>> http://patchwork.ozlabs.org/project/linuxppc-dev/patch/20210303174857.1760393-9-clg@kaod.org/ >>> >>> The node id is a key parameter for system resource management, memory >>> allocation, interrupt affinity, etc. Linux scales much better if used >>> correctly. >> >> Well, sure. And we have all the ibm,associativity stuff to convey the >> node ids to the guest (which has its own problems, but not that are >> relevant here). What's throwing me is why getting node IDs correct >> has anything to do with socket numbers. > >
On Tue, 23 Mar 2021 12:03:58 +1100 David Gibson <david@gibson.dropbear.id.au> wrote: > On Fri, Mar 19, 2021 at 03:34:52PM -0300, Daniel Henrique Barboza wrote: > > Kernel commit 4bce545903fa ("powerpc/topology: Update > > topology_core_cpumask") cause a regression in the pseries machine when > > defining certain SMP topologies [1]. The reasoning behind the change is > > explained in kernel commit 4ca234a9cbd7 ("powerpc/smp: Stop updating > > cpu_core_mask"). In short, cpu_core_mask logic was causing troubles with > > large VMs with lots of CPUs and was changed by cpu_cpu_mask because, as > > far as the kernel understanding of SMP topologies goes, both masks are > > equivalent. > > > > Further discussions in the kernel mailing list [2] shown that the > > powerpc kernel always considered that the number of sockets were equal > > to the number of NUMA nodes. The claim is that it doesn't make sense, > > for Power hardware at least, 2+ sockets being in the same NUMA node. The > > immediate conclusion is that all SMP topologies the pseries machine were > > supplying to the kernel, with more than one socket in the same NUMA node > > as in [1], happened to be correctly represented in the kernel by > > accident during all these years. > > > > There's a case to be made for virtual topologies being detached from > > hardware constraints, allowing maximum flexibility to users. At the same > > time, this freedom can't result in unrealistic hardware representations > > being emulated. If the real hardware and the pseries kernel don't > > support multiple chips/sockets in the same NUMA node, neither should we. > > > > Starting in 6.0.0, all sockets must match an unique NUMA node in the > > pseries machine. qtest changes were made to adapt to this new > > condition. > > Oof. I really don't like this idea. It means a bunch of fiddly work > for users to match these up, for no real gain. I'm also concerned > that this will require follow on changes in libvirt to not make this a > really cryptic and irritating point of failure. yes, it is annoying to users but we were very slowly tightening limitations on numa path, especially if it causes problems on guest side when user specify nonsense configs (and suspect there are few things to enforce in generic numa code left (currently just warnings)). So it's nothing new. So if limit applies to new machine type it should be fine (i.e. no existing VMs will suffer). Later on we can deprecate invalid configurations altogether and just error out. > > > > [1] https://bugzilla.redhat.com/1934421 > > [2] https://lore.kernel.org/linuxppc-dev/daa5d05f-dbd0-05ad-7395-5d5a3d364fc6@gmail.com/ > > > > CC: Srikar Dronamraju <srikar@linux.vnet.ibm.com> > > CC: Cédric Le Goater <clg@kaod.org> > > CC: Igor Mammedov <imammedo@redhat.com> > > CC: Laurent Vivier <lvivier@redhat.com> > > CC: Thomas Huth <thuth@redhat.com> > > Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> > > --- > > hw/ppc/spapr.c | 3 ++ > > hw/ppc/spapr_numa.c | 7 +++++ > > include/hw/ppc/spapr.h | 1 + > > tests/qtest/cpu-plug-test.c | 4 +-- > > tests/qtest/device-plug-test.c | 9 +++++- > > tests/qtest/numa-test.c | 52 ++++++++++++++++++++++++++++------ > > 6 files changed, 64 insertions(+), 12 deletions(-) > > > > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c > > index d56418ca29..745f71c243 100644 > > --- a/hw/ppc/spapr.c > > +++ b/hw/ppc/spapr.c > > @@ -4611,8 +4611,11 @@ DEFINE_SPAPR_MACHINE(6_0, "6.0", true); > > */ > > static void spapr_machine_5_2_class_options(MachineClass *mc) > > { > > + SpaprMachineClass *smc = SPAPR_MACHINE_CLASS(mc); > > + > > spapr_machine_6_0_class_options(mc); > > compat_props_add(mc->compat_props, hw_compat_5_2, hw_compat_5_2_len); > > + smc->pre_6_0_smp_topology = true; > > } > > > > DEFINE_SPAPR_MACHINE(5_2, "5.2", false); > > diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c > > index 779f18b994..0ade43dd79 100644 > > --- a/hw/ppc/spapr_numa.c > > +++ b/hw/ppc/spapr_numa.c > > @@ -163,6 +163,13 @@ void spapr_numa_associativity_init(SpaprMachineState *spapr, > > int i, j, max_nodes_with_gpus; > > bool using_legacy_numa = spapr_machine_using_legacy_numa(spapr); > > > > + if (!smc->pre_6_0_smp_topology && > > + nb_numa_nodes != machine->smp.sockets) { > > + error_report("Number of CPU sockets must be equal to the number " > > + "of NUMA nodes"); > > + exit(EXIT_FAILURE); > > + } > > + > > /* > > * For all associativity arrays: first position is the size, > > * position MAX_DISTANCE_REF_POINTS is always the numa_id, > > diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h > > index 47cebaf3ac..98dc5d198a 100644 > > --- a/include/hw/ppc/spapr.h > > +++ b/include/hw/ppc/spapr.h > > @@ -142,6 +142,7 @@ struct SpaprMachineClass { > > hwaddr rma_limit; /* clamp the RMA to this size */ > > bool pre_5_1_assoc_refpoints; > > bool pre_5_2_numa_associativity; > > + bool pre_6_0_smp_topology; > > > > bool (*phb_placement)(SpaprMachineState *spapr, uint32_t index, > > uint64_t *buid, hwaddr *pio, > > diff --git a/tests/qtest/cpu-plug-test.c b/tests/qtest/cpu-plug-test.c > > index a1c689414b..946b9129ea 100644 > > --- a/tests/qtest/cpu-plug-test.c > > +++ b/tests/qtest/cpu-plug-test.c > > @@ -118,8 +118,8 @@ static void add_pseries_test_case(const char *mname) > > data->machine = g_strdup(mname); > > data->cpu_model = "power8_v2.0"; > > data->device_model = g_strdup("power8_v2.0-spapr-cpu-core"); > > - data->sockets = 2; > > - data->cores = 3; > > + data->sockets = 1; > > + data->cores = 6; > > data->threads = 1; > > data->maxcpus = data->sockets * data->cores * data->threads; > > > > diff --git a/tests/qtest/device-plug-test.c b/tests/qtest/device-plug-test.c > > index 559d47727a..dd7d8268d2 100644 > > --- a/tests/qtest/device-plug-test.c > > +++ b/tests/qtest/device-plug-test.c > > @@ -91,7 +91,14 @@ static void test_spapr_cpu_unplug_request(void) > > { > > QTestState *qtest; > > > > - qtest = qtest_initf("-cpu power9_v2.0 -smp 1,maxcpus=2 " > > + /* > > + * Default smp settings will prioritize sockets over cores and > > + * threads, so '-smp 2,maxcpus=2' will add 2 sockets. However, > > + * the pseries machine requires a NUMA node for each socket > > + * (since 6.0.0). Specify sockets=1 to make life easier. > > + */ > > + qtest = qtest_initf("-cpu power9_v2.0 " > > + "-smp 1,maxcpus=2,threads=1,cores=2,sockets=1 " > > "-device power9_v2.0-spapr-cpu-core,core-id=1,id=dev0"); > > > > /* similar to test_pci_unplug_request */ > > diff --git a/tests/qtest/numa-test.c b/tests/qtest/numa-test.c > > index dc0ec571ca..bb13f7131b 100644 > > --- a/tests/qtest/numa-test.c > > +++ b/tests/qtest/numa-test.c > > @@ -24,9 +24,17 @@ static void test_mon_explicit(const void *data) > > QTestState *qts; > > g_autofree char *s = NULL; > > g_autofree char *cli = NULL; > > + const char *arch = qtest_get_arch(); > > + > > + if (g_str_equal(arch, "ppc64")) { > > + cli = make_cli(data, "-smp 8,threads=1,cores=4,sockets=2 " > > + "-numa node,nodeid=0,memdev=ram,cpus=0-3 " > > + "-numa node,nodeid=1,cpus=4-7"); > > + } else { > > + cli = make_cli(data, "-smp 8 -numa node,nodeid=0,memdev=ram,cpus=0-3 " > > + "-numa node,nodeid=1,cpus=4-7"); > > + } > > > > - cli = make_cli(data, "-smp 8 -numa node,nodeid=0,memdev=ram,cpus=0-3 " > > - "-numa node,nodeid=1,cpus=4-7"); > > qts = qtest_init(cli); > > > > s = qtest_hmp(qts, "info numa"); > > @@ -57,10 +65,18 @@ static void test_mon_partial(const void *data) > > QTestState *qts; > > g_autofree char *s = NULL; > > g_autofree char *cli = NULL; > > + const char *arch = qtest_get_arch(); > > + > > + if (g_str_equal(arch, "ppc64")) { > > + cli = make_cli(data, "-smp 8,threads=1,cores=4,sockets=2 " > > + "-numa node,nodeid=0,memdev=ram,cpus=0-1 " > > + "-numa node,nodeid=1,cpus=4-5 "); > > + } else { > > + cli = make_cli(data, "-smp 8 " > > + "-numa node,nodeid=0,memdev=ram,cpus=0-1 " > > + "-numa node,nodeid=1,cpus=4-5 "); > > + } > > > > - cli = make_cli(data, "-smp 8 " > > - "-numa node,nodeid=0,memdev=ram,cpus=0-1 " > > - "-numa node,nodeid=1,cpus=4-5 "); > > qts = qtest_init(cli); > > > > s = qtest_hmp(qts, "info numa"); > > @@ -85,9 +101,17 @@ static void test_query_cpus(const void *data) > > QObject *e; > > QTestState *qts; > > g_autofree char *cli = NULL; > > + const char *arch = qtest_get_arch(); > > + > > + if (g_str_equal(arch, "ppc64")) { > > + cli = make_cli(data, "-smp 8,threads=1,cores=4,sockets=2 " > > + "-numa node,memdev=ram,cpus=0-3 " > > + "-numa node,cpus=4-7"); > > + } else { > > + cli = make_cli(data, "-smp 8 -numa node,memdev=ram,cpus=0-3 " > > + "-numa node,cpus=4-7"); > > + } > > > > - cli = make_cli(data, "-smp 8 -numa node,memdev=ram,cpus=0-3 " > > - "-numa node,cpus=4-7"); > > qts = qtest_init(cli); > > cpus = get_cpus(qts, &resp); > > g_assert(cpus); > > @@ -177,7 +201,7 @@ static void spapr_numa_cpu(const void *data) > > QTestState *qts; > > g_autofree char *cli = NULL; > > > > - cli = make_cli(data, "-smp 4,cores=4 " > > + cli = make_cli(data, "-smp 4,threads=1,cores=2,sockets=2 " > > "-numa node,nodeid=0,memdev=ram -numa node,nodeid=1 " > > "-numa cpu,node-id=0,core-id=0 " > > "-numa cpu,node-id=0,core-id=1 " > > @@ -554,7 +578,17 @@ int main(int argc, char **argv) > > > > g_test_init(&argc, &argv, NULL); > > > > - qtest_add_data_func("/numa/mon/cpus/default", args, test_def_cpu_split); > > + /* > > + * Starting on 6.0.0, for the pseries machine, '-smp 8' will only work > > + * if we have 8 NUMA nodes. If we specify 'smp 8,sockets=2' to match > > + * 2 NUMA nodes, the CPUs will be split as 0123/4567 instead of > > + * 0246/1357 that test_def_cpu_split expects. In short, this test is > > + * no longer valid for ppc64 in 6.0.0. > > + */ > > + if (!g_str_equal(arch, "ppc64")) { > > + qtest_add_data_func("/numa/mon/cpus/default", args, test_def_cpu_split); > > + } > > + > > qtest_add_data_func("/numa/mon/cpus/explicit", args, test_mon_explicit); > > qtest_add_data_func("/numa/mon/cpus/partial", args, test_mon_partial); > > qtest_add_data_func("/numa/qmp/cpus/query-cpus", args, test_query_cpus); >
On Mon, 29 Mar 2021 15:32:37 -0300 Daniel Henrique Barboza <danielhb413@gmail.com> wrote: > On 3/29/21 12:32 PM, Cédric Le Goater wrote: > > On 3/29/21 6:20 AM, David Gibson wrote: > >> On Thu, Mar 25, 2021 at 09:56:04AM +0100, Cédric Le Goater wrote: > >>> On 3/25/21 3:10 AM, David Gibson wrote: > >>>> On Tue, Mar 23, 2021 at 02:21:33PM -0300, Daniel Henrique Barboza wrote: > >>>>> > >>>>> > >>>>> On 3/22/21 10:03 PM, David Gibson wrote: > >>>>>> On Fri, Mar 19, 2021 at 03:34:52PM -0300, Daniel Henrique Barboza wrote: > >>>>>>> Kernel commit 4bce545903fa ("powerpc/topology: Update > >>>>>>> topology_core_cpumask") cause a regression in the pseries machine when > >>>>>>> defining certain SMP topologies [1]. The reasoning behind the change is > >>>>>>> explained in kernel commit 4ca234a9cbd7 ("powerpc/smp: Stop updating > >>>>>>> cpu_core_mask"). In short, cpu_core_mask logic was causing troubles with > >>>>>>> large VMs with lots of CPUs and was changed by cpu_cpu_mask because, as > >>>>>>> far as the kernel understanding of SMP topologies goes, both masks are > >>>>>>> equivalent. > >>>>>>> > >>>>>>> Further discussions in the kernel mailing list [2] shown that the > >>>>>>> powerpc kernel always considered that the number of sockets were equal > >>>>>>> to the number of NUMA nodes. The claim is that it doesn't make sense, > >>>>>>> for Power hardware at least, 2+ sockets being in the same NUMA node. The > >>>>>>> immediate conclusion is that all SMP topologies the pseries machine were > >>>>>>> supplying to the kernel, with more than one socket in the same NUMA node > >>>>>>> as in [1], happened to be correctly represented in the kernel by > >>>>>>> accident during all these years. > >>>>>>> > >>>>>>> There's a case to be made for virtual topologies being detached from > >>>>>>> hardware constraints, allowing maximum flexibility to users. At the same > >>>>>>> time, this freedom can't result in unrealistic hardware representations > >>>>>>> being emulated. If the real hardware and the pseries kernel don't > >>>>>>> support multiple chips/sockets in the same NUMA node, neither should we. > >>>>>>> > >>>>>>> Starting in 6.0.0, all sockets must match an unique NUMA node in the > >>>>>>> pseries machine. qtest changes were made to adapt to this new > >>>>>>> condition. > >>>>>> > >>>>>> Oof. I really don't like this idea. It means a bunch of fiddly work > >>>>>> for users to match these up, for no real gain. I'm also concerned > >>>>>> that this will require follow on changes in libvirt to not make this a > >>>>>> really cryptic and irritating point of failure. > >>>>> > >>>>> Haven't though about required Libvirt changes, although I can say that there > >>>>> will be some amount to be mande and it will probably annoy existing users > >>>>> (everyone that has a multiple socket per NUMA node topology). > >>>>> > >>>>> There is not much we can do from the QEMU layer aside from what I've proposed > >>>>> here. The other alternative is to keep interacting with the kernel folks to > >>>>> see if there is a way to keep our use case untouched. > >>>> > >>>> Right. Well.. not necessarily untouched, but I'm hoping for more > >>>> replies from Cédric to my objections and mpe's. Even with sockets > >>>> being a kinda meaningless concept in PAPR, I don't think tying it to > >>>> NUMA nodes makes sense. > >>> > >>> I did a couple of replies in different email threads but maybe not > >>> to all. I felt it was going nowhere :/ Couple of thoughts, > >> > >> I think I saw some of those, but maybe not all. > >> > >>> Shouldn't we get rid of the socket concept, die also, under pseries > >>> since they don't exist under PAPR ? We only have numa nodes, cores, > >>> threads AFAICT. > >> > >> Theoretically, yes. I'm not sure it's really practical, though, since > >> AFAICT, both qemu and the kernel have the notion of sockets (though > >> not dies) built into generic code. > > > > Yes. But, AFAICT, these topology notions have not reached "arch/powerpc" > > and PPC Linux only has a NUMA node id, on pseries and powernv. > > > >> It does mean that one possible approach here - maybe the best one - is > >> to simply declare that sockets are meaningless under, so we simply > >> don't expect what the guest kernel reports to match what's given to > >> qemu. > >> > >> It'd be nice to avoid that if we can: in a sense it's just cosmetic, > >> but it is likely to surprise and confuse people. > >> > >>> Should we diverged from PAPR and add extra DT properties "qemu,..." ? > >>> There are a couple of places where Linux checks for the underlying > >>> hypervisor already. > >>> > >>>>> This also means that > >>>>> 'ibm,chip-id' will probably remain in use since it's the only place where > >>>>> we inform cores per socket information to the kernel. > >>>> > >>>> Well.. unless we can find some other sensible way to convey that > >>>> information. I haven't given up hope for that yet. > >>> > >>> Well, we could start by fixing the value in QEMU. It is broken > >>> today. > >> > >> Fixing what value, exactly? > > > > The value of the "ibm,chip-id" since we are keeping the property under > > QEMU. > > David, I believe this has to do with the discussing we had last Friday. > > I mentioned that the ibm,chip-id property is being calculated in a way that > promotes the same ibm,chip-id in CPUs that belongs to different NUMA nodes, > e.g.: > > -smp 4,cores=4,maxcpus=8,threads=1 \ > -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 \ > -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1 just heads up, 'cpus=' is going away (deprecation patch got lost during 6.0 but I plan on pushing it during 6.1 devel window), you can use '-numa cpus,node-id=a,core-id=b' instead > > > $ dtc -I dtb -O dts fdt.dtb | grep -B2 ibm,chip-id > ibm,associativity = <0x05 0x00 0x00 0x00 0x00 0x00>; > ibm,pft-size = <0x00 0x19>; > ibm,chip-id = <0x00>; > -- > ibm,associativity = <0x05 0x00 0x00 0x00 0x00 0x01>; > ibm,pft-size = <0x00 0x19>; > ibm,chip-id = <0x00>; > -- > ibm,associativity = <0x05 0x01 0x01 0x01 0x01 0x02>; > ibm,pft-size = <0x00 0x19>; > ibm,chip-id = <0x00>; > -- > ibm,associativity = <0x05 0x01 0x01 0x01 0x01 0x03>; > ibm,pft-size = <0x00 0x19>; > ibm,chip-id = <0x00>; > > We assign ibm,chip-id=0x0 to CPUs 0-3, but CPUs 2-3 are located in a different > NUMA node than 0-1. This would mean that the same socket would belong to > different NUMA nodes at the same time. > > I believe this is what Cedric wants to be addressed. Given that the property is > called after the OPAL property ibm,chip-id, the kernel expects that the property > will have the same semantics as in OPAL. > > > > Thanks, > > > DHB > > > > > > > >>> This is all coming from some work we did last year to evaluate our HW > >>> (mostly for XIVE) on 2s, 4s, 16s systems on baremetal, KVM and PowerVM. > >>> We saw some real problems because Linux did not have a clear view of the > >>> topology. See the figures here : > >>> > >>> http://patchwork.ozlabs.org/project/linuxppc-dev/patch/20210303174857.1760393-9-clg@kaod.org/ > >>> > >>> The node id is a key parameter for system resource management, memory > >>> allocation, interrupt affinity, etc. Linux scales much better if used > >>> correctly. > >> > >> Well, sure. And we have all the ibm,associativity stuff to convey the > >> node ids to the guest (which has its own problems, but not that are > >> relevant here). What's throwing me is why getting node IDs correct > >> has anything to do with socket numbers. > > > > >
On 3/29/21 8:51 PM, Igor Mammedov wrote: > On Tue, 23 Mar 2021 12:03:58 +1100 > David Gibson <david@gibson.dropbear.id.au> wrote: > >> On Fri, Mar 19, 2021 at 03:34:52PM -0300, Daniel Henrique Barboza wrote: >>> Kernel commit 4bce545903fa ("powerpc/topology: Update >>> topology_core_cpumask") cause a regression in the pseries machine when >>> defining certain SMP topologies [1]. The reasoning behind the change is >>> explained in kernel commit 4ca234a9cbd7 ("powerpc/smp: Stop updating >>> cpu_core_mask"). In short, cpu_core_mask logic was causing troubles with >>> large VMs with lots of CPUs and was changed by cpu_cpu_mask because, as >>> far as the kernel understanding of SMP topologies goes, both masks are >>> equivalent. >>> >>> Further discussions in the kernel mailing list [2] shown that the >>> powerpc kernel always considered that the number of sockets were equal >>> to the number of NUMA nodes. The claim is that it doesn't make sense, >>> for Power hardware at least, 2+ sockets being in the same NUMA node. The >>> immediate conclusion is that all SMP topologies the pseries machine were >>> supplying to the kernel, with more than one socket in the same NUMA node >>> as in [1], happened to be correctly represented in the kernel by >>> accident during all these years. >>> >>> There's a case to be made for virtual topologies being detached from >>> hardware constraints, allowing maximum flexibility to users. At the same >>> time, this freedom can't result in unrealistic hardware representations >>> being emulated. If the real hardware and the pseries kernel don't >>> support multiple chips/sockets in the same NUMA node, neither should we. >>> >>> Starting in 6.0.0, all sockets must match an unique NUMA node in the >>> pseries machine. qtest changes were made to adapt to this new >>> condition. >> >> Oof. I really don't like this idea. It means a bunch of fiddly work >> for users to match these up, for no real gain. I'm also concerned >> that this will require follow on changes in libvirt to not make this a >> really cryptic and irritating point of failure. > > yes, it is annoying to users but we were very slowly tightening limitations > on numa path, especially if it causes problems on guest side when user > specify nonsense configs (and suspect there are few things to enforce in > generic numa code left (currently just warnings)). So it's nothing new. Do you mind elaborating a bit on the 'nonsense configs' bit? Do you mean configs that are unrealistic because it's uncompliant with ACPI? I always thought that all possible SMP/NUMA topologies QEMU allows were ACPI/x86 compliant. If both x86 and PowerPC have restrictions in common we might want to consider putting this code in a common helper/place. > > So if limit applies to new machine type it should be fine (i.e. no existing > VMs will suffer). Later on we can deprecate invalid configurations altogether > and just error out. hmmm I'm not warning the user about the deprecated topologies in existing guests in this patch. I guess it doesn't hurt to add a warning for existing VMs. Thanks, DHB > >>> >>> [1] https://bugzilla.redhat.com/1934421 >>> [2] https://lore.kernel.org/linuxppc-dev/daa5d05f-dbd0-05ad-7395-5d5a3d364fc6@gmail.com/ >>> >>> CC: Srikar Dronamraju <srikar@linux.vnet.ibm.com> >>> CC: Cédric Le Goater <clg@kaod.org> >>> CC: Igor Mammedov <imammedo@redhat.com> >>> CC: Laurent Vivier <lvivier@redhat.com> >>> CC: Thomas Huth <thuth@redhat.com> >>> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> >>> --- >>> hw/ppc/spapr.c | 3 ++ >>> hw/ppc/spapr_numa.c | 7 +++++ >>> include/hw/ppc/spapr.h | 1 + >>> tests/qtest/cpu-plug-test.c | 4 +-- >>> tests/qtest/device-plug-test.c | 9 +++++- >>> tests/qtest/numa-test.c | 52 ++++++++++++++++++++++++++++------ >>> 6 files changed, 64 insertions(+), 12 deletions(-) >>> >>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c >>> index d56418ca29..745f71c243 100644 >>> --- a/hw/ppc/spapr.c >>> +++ b/hw/ppc/spapr.c >>> @@ -4611,8 +4611,11 @@ DEFINE_SPAPR_MACHINE(6_0, "6.0", true); >>> */ >>> static void spapr_machine_5_2_class_options(MachineClass *mc) >>> { >>> + SpaprMachineClass *smc = SPAPR_MACHINE_CLASS(mc); >>> + >>> spapr_machine_6_0_class_options(mc); >>> compat_props_add(mc->compat_props, hw_compat_5_2, hw_compat_5_2_len); >>> + smc->pre_6_0_smp_topology = true; >>> } >>> >>> DEFINE_SPAPR_MACHINE(5_2, "5.2", false); >>> diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c >>> index 779f18b994..0ade43dd79 100644 >>> --- a/hw/ppc/spapr_numa.c >>> +++ b/hw/ppc/spapr_numa.c >>> @@ -163,6 +163,13 @@ void spapr_numa_associativity_init(SpaprMachineState *spapr, >>> int i, j, max_nodes_with_gpus; >>> bool using_legacy_numa = spapr_machine_using_legacy_numa(spapr); >>> >>> + if (!smc->pre_6_0_smp_topology && >>> + nb_numa_nodes != machine->smp.sockets) { >>> + error_report("Number of CPU sockets must be equal to the number " >>> + "of NUMA nodes"); >>> + exit(EXIT_FAILURE); >>> + } >>> + >>> /* >>> * For all associativity arrays: first position is the size, >>> * position MAX_DISTANCE_REF_POINTS is always the numa_id, >>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h >>> index 47cebaf3ac..98dc5d198a 100644 >>> --- a/include/hw/ppc/spapr.h >>> +++ b/include/hw/ppc/spapr.h >>> @@ -142,6 +142,7 @@ struct SpaprMachineClass { >>> hwaddr rma_limit; /* clamp the RMA to this size */ >>> bool pre_5_1_assoc_refpoints; >>> bool pre_5_2_numa_associativity; >>> + bool pre_6_0_smp_topology; >>> >>> bool (*phb_placement)(SpaprMachineState *spapr, uint32_t index, >>> uint64_t *buid, hwaddr *pio, >>> diff --git a/tests/qtest/cpu-plug-test.c b/tests/qtest/cpu-plug-test.c >>> index a1c689414b..946b9129ea 100644 >>> --- a/tests/qtest/cpu-plug-test.c >>> +++ b/tests/qtest/cpu-plug-test.c >>> @@ -118,8 +118,8 @@ static void add_pseries_test_case(const char *mname) >>> data->machine = g_strdup(mname); >>> data->cpu_model = "power8_v2.0"; >>> data->device_model = g_strdup("power8_v2.0-spapr-cpu-core"); >>> - data->sockets = 2; >>> - data->cores = 3; >>> + data->sockets = 1; >>> + data->cores = 6; >>> data->threads = 1; >>> data->maxcpus = data->sockets * data->cores * data->threads; >>> >>> diff --git a/tests/qtest/device-plug-test.c b/tests/qtest/device-plug-test.c >>> index 559d47727a..dd7d8268d2 100644 >>> --- a/tests/qtest/device-plug-test.c >>> +++ b/tests/qtest/device-plug-test.c >>> @@ -91,7 +91,14 @@ static void test_spapr_cpu_unplug_request(void) >>> { >>> QTestState *qtest; >>> >>> - qtest = qtest_initf("-cpu power9_v2.0 -smp 1,maxcpus=2 " >>> + /* >>> + * Default smp settings will prioritize sockets over cores and >>> + * threads, so '-smp 2,maxcpus=2' will add 2 sockets. However, >>> + * the pseries machine requires a NUMA node for each socket >>> + * (since 6.0.0). Specify sockets=1 to make life easier. >>> + */ >>> + qtest = qtest_initf("-cpu power9_v2.0 " >>> + "-smp 1,maxcpus=2,threads=1,cores=2,sockets=1 " >>> "-device power9_v2.0-spapr-cpu-core,core-id=1,id=dev0"); >>> >>> /* similar to test_pci_unplug_request */ >>> diff --git a/tests/qtest/numa-test.c b/tests/qtest/numa-test.c >>> index dc0ec571ca..bb13f7131b 100644 >>> --- a/tests/qtest/numa-test.c >>> +++ b/tests/qtest/numa-test.c >>> @@ -24,9 +24,17 @@ static void test_mon_explicit(const void *data) >>> QTestState *qts; >>> g_autofree char *s = NULL; >>> g_autofree char *cli = NULL; >>> + const char *arch = qtest_get_arch(); >>> + >>> + if (g_str_equal(arch, "ppc64")) { >>> + cli = make_cli(data, "-smp 8,threads=1,cores=4,sockets=2 " >>> + "-numa node,nodeid=0,memdev=ram,cpus=0-3 " >>> + "-numa node,nodeid=1,cpus=4-7"); >>> + } else { >>> + cli = make_cli(data, "-smp 8 -numa node,nodeid=0,memdev=ram,cpus=0-3 " >>> + "-numa node,nodeid=1,cpus=4-7"); >>> + } >>> >>> - cli = make_cli(data, "-smp 8 -numa node,nodeid=0,memdev=ram,cpus=0-3 " >>> - "-numa node,nodeid=1,cpus=4-7"); >>> qts = qtest_init(cli); >>> >>> s = qtest_hmp(qts, "info numa"); >>> @@ -57,10 +65,18 @@ static void test_mon_partial(const void *data) >>> QTestState *qts; >>> g_autofree char *s = NULL; >>> g_autofree char *cli = NULL; >>> + const char *arch = qtest_get_arch(); >>> + >>> + if (g_str_equal(arch, "ppc64")) { >>> + cli = make_cli(data, "-smp 8,threads=1,cores=4,sockets=2 " >>> + "-numa node,nodeid=0,memdev=ram,cpus=0-1 " >>> + "-numa node,nodeid=1,cpus=4-5 "); >>> + } else { >>> + cli = make_cli(data, "-smp 8 " >>> + "-numa node,nodeid=0,memdev=ram,cpus=0-1 " >>> + "-numa node,nodeid=1,cpus=4-5 "); >>> + } >>> >>> - cli = make_cli(data, "-smp 8 " >>> - "-numa node,nodeid=0,memdev=ram,cpus=0-1 " >>> - "-numa node,nodeid=1,cpus=4-5 "); >>> qts = qtest_init(cli); >>> >>> s = qtest_hmp(qts, "info numa"); >>> @@ -85,9 +101,17 @@ static void test_query_cpus(const void *data) >>> QObject *e; >>> QTestState *qts; >>> g_autofree char *cli = NULL; >>> + const char *arch = qtest_get_arch(); >>> + >>> + if (g_str_equal(arch, "ppc64")) { >>> + cli = make_cli(data, "-smp 8,threads=1,cores=4,sockets=2 " >>> + "-numa node,memdev=ram,cpus=0-3 " >>> + "-numa node,cpus=4-7"); >>> + } else { >>> + cli = make_cli(data, "-smp 8 -numa node,memdev=ram,cpus=0-3 " >>> + "-numa node,cpus=4-7"); >>> + } >>> >>> - cli = make_cli(data, "-smp 8 -numa node,memdev=ram,cpus=0-3 " >>> - "-numa node,cpus=4-7"); >>> qts = qtest_init(cli); >>> cpus = get_cpus(qts, &resp); >>> g_assert(cpus); >>> @@ -177,7 +201,7 @@ static void spapr_numa_cpu(const void *data) >>> QTestState *qts; >>> g_autofree char *cli = NULL; >>> >>> - cli = make_cli(data, "-smp 4,cores=4 " >>> + cli = make_cli(data, "-smp 4,threads=1,cores=2,sockets=2 " >>> "-numa node,nodeid=0,memdev=ram -numa node,nodeid=1 " >>> "-numa cpu,node-id=0,core-id=0 " >>> "-numa cpu,node-id=0,core-id=1 " >>> @@ -554,7 +578,17 @@ int main(int argc, char **argv) >>> >>> g_test_init(&argc, &argv, NULL); >>> >>> - qtest_add_data_func("/numa/mon/cpus/default", args, test_def_cpu_split); >>> + /* >>> + * Starting on 6.0.0, for the pseries machine, '-smp 8' will only work >>> + * if we have 8 NUMA nodes. If we specify 'smp 8,sockets=2' to match >>> + * 2 NUMA nodes, the CPUs will be split as 0123/4567 instead of >>> + * 0246/1357 that test_def_cpu_split expects. In short, this test is >>> + * no longer valid for ppc64 in 6.0.0. >>> + */ >>> + if (!g_str_equal(arch, "ppc64")) { >>> + qtest_add_data_func("/numa/mon/cpus/default", args, test_def_cpu_split); >>> + } >>> + >>> qtest_add_data_func("/numa/mon/cpus/explicit", args, test_mon_explicit); >>> qtest_add_data_func("/numa/mon/cpus/partial", args, test_mon_partial); >>> qtest_add_data_func("/numa/qmp/cpus/query-cpus", args, test_query_cpus); >> >
On Mon, Mar 29, 2021 at 03:32:37PM -0300, Daniel Henrique Barboza wrote: > > > On 3/29/21 12:32 PM, Cédric Le Goater wrote: > > On 3/29/21 6:20 AM, David Gibson wrote: > > > On Thu, Mar 25, 2021 at 09:56:04AM +0100, Cédric Le Goater wrote: > > > > On 3/25/21 3:10 AM, David Gibson wrote: > > > > > On Tue, Mar 23, 2021 at 02:21:33PM -0300, Daniel Henrique Barboza wrote: > > > > > > > > > > > > > > > > > > On 3/22/21 10:03 PM, David Gibson wrote: > > > > > > > On Fri, Mar 19, 2021 at 03:34:52PM -0300, Daniel Henrique Barboza wrote: > > > > > > > > Kernel commit 4bce545903fa ("powerpc/topology: Update > > > > > > > > topology_core_cpumask") cause a regression in the pseries machine when > > > > > > > > defining certain SMP topologies [1]. The reasoning behind the change is > > > > > > > > explained in kernel commit 4ca234a9cbd7 ("powerpc/smp: Stop updating > > > > > > > > cpu_core_mask"). In short, cpu_core_mask logic was causing troubles with > > > > > > > > large VMs with lots of CPUs and was changed by cpu_cpu_mask because, as > > > > > > > > far as the kernel understanding of SMP topologies goes, both masks are > > > > > > > > equivalent. > > > > > > > > > > > > > > > > Further discussions in the kernel mailing list [2] shown that the > > > > > > > > powerpc kernel always considered that the number of sockets were equal > > > > > > > > to the number of NUMA nodes. The claim is that it doesn't make sense, > > > > > > > > for Power hardware at least, 2+ sockets being in the same NUMA node. The > > > > > > > > immediate conclusion is that all SMP topologies the pseries machine were > > > > > > > > supplying to the kernel, with more than one socket in the same NUMA node > > > > > > > > as in [1], happened to be correctly represented in the kernel by > > > > > > > > accident during all these years. > > > > > > > > > > > > > > > > There's a case to be made for virtual topologies being detached from > > > > > > > > hardware constraints, allowing maximum flexibility to users. At the same > > > > > > > > time, this freedom can't result in unrealistic hardware representations > > > > > > > > being emulated. If the real hardware and the pseries kernel don't > > > > > > > > support multiple chips/sockets in the same NUMA node, neither should we. > > > > > > > > > > > > > > > > Starting in 6.0.0, all sockets must match an unique NUMA node in the > > > > > > > > pseries machine. qtest changes were made to adapt to this new > > > > > > > > condition. > > > > > > > > > > > > > > Oof. I really don't like this idea. It means a bunch of fiddly work > > > > > > > for users to match these up, for no real gain. I'm also concerned > > > > > > > that this will require follow on changes in libvirt to not make this a > > > > > > > really cryptic and irritating point of failure. > > > > > > > > > > > > Haven't though about required Libvirt changes, although I can say that there > > > > > > will be some amount to be mande and it will probably annoy existing users > > > > > > (everyone that has a multiple socket per NUMA node topology). > > > > > > > > > > > > There is not much we can do from the QEMU layer aside from what I've proposed > > > > > > here. The other alternative is to keep interacting with the kernel folks to > > > > > > see if there is a way to keep our use case untouched. > > > > > > > > > > Right. Well.. not necessarily untouched, but I'm hoping for more > > > > > replies from Cédric to my objections and mpe's. Even with sockets > > > > > being a kinda meaningless concept in PAPR, I don't think tying it to > > > > > NUMA nodes makes sense. > > > > > > > > I did a couple of replies in different email threads but maybe not > > > > to all. I felt it was going nowhere :/ Couple of thoughts, > > > > > > I think I saw some of those, but maybe not all. > > > > > > > Shouldn't we get rid of the socket concept, die also, under pseries > > > > since they don't exist under PAPR ? We only have numa nodes, cores, > > > > threads AFAICT. > > > > > > Theoretically, yes. I'm not sure it's really practical, though, since > > > AFAICT, both qemu and the kernel have the notion of sockets (though > > > not dies) built into generic code. > > > > Yes. But, AFAICT, these topology notions have not reached "arch/powerpc" > > and PPC Linux only has a NUMA node id, on pseries and powernv. > > > > > It does mean that one possible approach here - maybe the best one - is > > > to simply declare that sockets are meaningless under, so we simply > > > don't expect what the guest kernel reports to match what's given to > > > qemu. > > > > > > It'd be nice to avoid that if we can: in a sense it's just cosmetic, > > > but it is likely to surprise and confuse people. > > > > > > > Should we diverged from PAPR and add extra DT properties "qemu,..." ? > > > > There are a couple of places where Linux checks for the underlying > > > > hypervisor already. > > > > > > > > > > This also means that > > > > > > 'ibm,chip-id' will probably remain in use since it's the only place where > > > > > > we inform cores per socket information to the kernel. > > > > > > > > > > Well.. unless we can find some other sensible way to convey that > > > > > information. I haven't given up hope for that yet. > > > > > > > > Well, we could start by fixing the value in QEMU. It is broken > > > > today. > > > > > > Fixing what value, exactly? > > > > The value of the "ibm,chip-id" since we are keeping the property under > > QEMU. > > David, I believe this has to do with the discussing we had last Friday. > > I mentioned that the ibm,chip-id property is being calculated in a way that > promotes the same ibm,chip-id in CPUs that belongs to different NUMA nodes, > e.g.: > > -smp 4,cores=4,maxcpus=8,threads=1 \ > -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 \ > -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1 > > > $ dtc -I dtb -O dts fdt.dtb | grep -B2 ibm,chip-id > ibm,associativity = <0x05 0x00 0x00 0x00 0x00 0x00>; > ibm,pft-size = <0x00 0x19>; > ibm,chip-id = <0x00>; > -- > ibm,associativity = <0x05 0x00 0x00 0x00 0x00 0x01>; > ibm,pft-size = <0x00 0x19>; > ibm,chip-id = <0x00>; > -- > ibm,associativity = <0x05 0x01 0x01 0x01 0x01 0x02>; > ibm,pft-size = <0x00 0x19>; > ibm,chip-id = <0x00>; > -- > ibm,associativity = <0x05 0x01 0x01 0x01 0x01 0x03>; > ibm,pft-size = <0x00 0x19>; > ibm,chip-id = <0x00>; > We assign ibm,chip-id=0x0 to CPUs 0-3, but CPUs 2-3 are located in a > different NUMA node than 0-1. This would mean that the same socket > would belong to different NUMA nodes at the same time. Right... and I'm still not seeing why that's a problem. AFAICT that's a possible, if unexpected, situation under real hardware - though maybe not for POWER9 specifically. > I believe this is what Cedric wants to be addressed. Given that the > property is called after the OPAL property ibm,chip-id, the kernel > expects that the property will have the same semantics as in OPAL. Even on powernv, I'm not clear why chip-id is tied into the NUMA configuration, rather than getting all the NUMA info from associativity properties.
David Gibson <david@gibson.dropbear.id.au> writes: > On Mon, Mar 29, 2021 at 03:32:37PM -0300, Daniel Henrique Barboza wrote: ... > >> We assign ibm,chip-id=0x0 to CPUs 0-3, but CPUs 2-3 are located in a >> different NUMA node than 0-1. This would mean that the same socket >> would belong to different NUMA nodes at the same time. > > Right... and I'm still not seeing why that's a problem. AFAICT that's > a possible, if unexpected, situation under real hardware - though > maybe not for POWER9 specifically. I think I agree. >> I believe this is what Cedric wants to be addressed. Given that the >> property is called after the OPAL property ibm,chip-id, the kernel >> expects that the property will have the same semantics as in OPAL. > > Even on powernv, I'm not clear why chip-id is tied into the NUMA > configuration, rather than getting all the NUMA info from > associativity properties. AFAIK we don't use chip-id for anything related to NUMA, if we do I'd consider that a bug. We do use it for topology_physical_package_id(), but that's almost completely unused. cheers
On 3/31/21 2:57 AM, David Gibson wrote: > On Mon, Mar 29, 2021 at 03:32:37PM -0300, Daniel Henrique Barboza wrote: >> >> >> On 3/29/21 12:32 PM, Cédric Le Goater wrote: >>> On 3/29/21 6:20 AM, David Gibson wrote: >>>> On Thu, Mar 25, 2021 at 09:56:04AM +0100, Cédric Le Goater wrote: >>>>> On 3/25/21 3:10 AM, David Gibson wrote: >>>>>> On Tue, Mar 23, 2021 at 02:21:33PM -0300, Daniel Henrique Barboza wrote: >>>>>>> >>>>>>> >>>>>>> On 3/22/21 10:03 PM, David Gibson wrote: >>>>>>>> On Fri, Mar 19, 2021 at 03:34:52PM -0300, Daniel Henrique Barboza wrote: >>>>>>>>> Kernel commit 4bce545903fa ("powerpc/topology: Update >>>>>>>>> topology_core_cpumask") cause a regression in the pseries machine when >>>>>>>>> defining certain SMP topologies [1]. The reasoning behind the change is >>>>>>>>> explained in kernel commit 4ca234a9cbd7 ("powerpc/smp: Stop updating >>>>>>>>> cpu_core_mask"). In short, cpu_core_mask logic was causing troubles with >>>>>>>>> large VMs with lots of CPUs and was changed by cpu_cpu_mask because, as >>>>>>>>> far as the kernel understanding of SMP topologies goes, both masks are >>>>>>>>> equivalent. >>>>>>>>> >>>>>>>>> Further discussions in the kernel mailing list [2] shown that the >>>>>>>>> powerpc kernel always considered that the number of sockets were equal >>>>>>>>> to the number of NUMA nodes. The claim is that it doesn't make sense, >>>>>>>>> for Power hardware at least, 2+ sockets being in the same NUMA node. The >>>>>>>>> immediate conclusion is that all SMP topologies the pseries machine were >>>>>>>>> supplying to the kernel, with more than one socket in the same NUMA node >>>>>>>>> as in [1], happened to be correctly represented in the kernel by >>>>>>>>> accident during all these years. >>>>>>>>> >>>>>>>>> There's a case to be made for virtual topologies being detached from >>>>>>>>> hardware constraints, allowing maximum flexibility to users. At the same >>>>>>>>> time, this freedom can't result in unrealistic hardware representations >>>>>>>>> being emulated. If the real hardware and the pseries kernel don't >>>>>>>>> support multiple chips/sockets in the same NUMA node, neither should we. >>>>>>>>> >>>>>>>>> Starting in 6.0.0, all sockets must match an unique NUMA node in the >>>>>>>>> pseries machine. qtest changes were made to adapt to this new >>>>>>>>> condition. >>>>>>>> >>>>>>>> Oof. I really don't like this idea. It means a bunch of fiddly work >>>>>>>> for users to match these up, for no real gain. I'm also concerned >>>>>>>> that this will require follow on changes in libvirt to not make this a >>>>>>>> really cryptic and irritating point of failure. >>>>>>> >>>>>>> Haven't though about required Libvirt changes, although I can say that there >>>>>>> will be some amount to be mande and it will probably annoy existing users >>>>>>> (everyone that has a multiple socket per NUMA node topology). >>>>>>> >>>>>>> There is not much we can do from the QEMU layer aside from what I've proposed >>>>>>> here. The other alternative is to keep interacting with the kernel folks to >>>>>>> see if there is a way to keep our use case untouched. >>>>>> >>>>>> Right. Well.. not necessarily untouched, but I'm hoping for more >>>>>> replies from Cédric to my objections and mpe's. Even with sockets >>>>>> being a kinda meaningless concept in PAPR, I don't think tying it to >>>>>> NUMA nodes makes sense. >>>>> >>>>> I did a couple of replies in different email threads but maybe not >>>>> to all. I felt it was going nowhere :/ Couple of thoughts, >>>> >>>> I think I saw some of those, but maybe not all. >>>> >>>>> Shouldn't we get rid of the socket concept, die also, under pseries >>>>> since they don't exist under PAPR ? We only have numa nodes, cores, >>>>> threads AFAICT. >>>> >>>> Theoretically, yes. I'm not sure it's really practical, though, since >>>> AFAICT, both qemu and the kernel have the notion of sockets (though >>>> not dies) built into generic code. >>> >>> Yes. But, AFAICT, these topology notions have not reached "arch/powerpc" >>> and PPC Linux only has a NUMA node id, on pseries and powernv. >>> >>>> It does mean that one possible approach here - maybe the best one - is >>>> to simply declare that sockets are meaningless under, so we simply >>>> don't expect what the guest kernel reports to match what's given to >>>> qemu. >>>> >>>> It'd be nice to avoid that if we can: in a sense it's just cosmetic, >>>> but it is likely to surprise and confuse people. >>>> >>>>> Should we diverged from PAPR and add extra DT properties "qemu,..." ? >>>>> There are a couple of places where Linux checks for the underlying >>>>> hypervisor already. >>>>> >>>>>>> This also means that >>>>>>> 'ibm,chip-id' will probably remain in use since it's the only place where >>>>>>> we inform cores per socket information to the kernel. >>>>>> >>>>>> Well.. unless we can find some other sensible way to convey that >>>>>> information. I haven't given up hope for that yet. >>>>> >>>>> Well, we could start by fixing the value in QEMU. It is broken >>>>> today. >>>> >>>> Fixing what value, exactly? >>> >>> The value of the "ibm,chip-id" since we are keeping the property under >>> QEMU. >> >> David, I believe this has to do with the discussing we had last Friday. >> >> I mentioned that the ibm,chip-id property is being calculated in a way that >> promotes the same ibm,chip-id in CPUs that belongs to different NUMA nodes, >> e.g.: >> >> -smp 4,cores=4,maxcpus=8,threads=1 \ >> -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 \ >> -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1 >> >> >> $ dtc -I dtb -O dts fdt.dtb | grep -B2 ibm,chip-id >> ibm,associativity = <0x05 0x00 0x00 0x00 0x00 0x00>; >> ibm,pft-size = <0x00 0x19>; >> ibm,chip-id = <0x00>; >> -- >> ibm,associativity = <0x05 0x00 0x00 0x00 0x00 0x01>; >> ibm,pft-size = <0x00 0x19>; >> ibm,chip-id = <0x00>; >> -- >> ibm,associativity = <0x05 0x01 0x01 0x01 0x01 0x02>; >> ibm,pft-size = <0x00 0x19>; >> ibm,chip-id = <0x00>; >> -- >> ibm,associativity = <0x05 0x01 0x01 0x01 0x01 0x03>; >> ibm,pft-size = <0x00 0x19>; >> ibm,chip-id = <0x00>; > >> We assign ibm,chip-id=0x0 to CPUs 0-3, but CPUs 2-3 are located in a >> different NUMA node than 0-1. This would mean that the same socket >> would belong to different NUMA nodes at the same time. > > Right... and I'm still not seeing why that's a problem. AFAICT that's > a possible, if unexpected, situation under real hardware - though > maybe not for POWER9 specifically. The ibm,chip-id property does not exist under PAPR. PAPR only has NUMA nodes, no sockets nor chips. And the property value is simply broken under QEMU. Try this : -smp 4,cores=1,maxcpus=8 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1 # dmesg | grep numa [ 0.013106] numa: Node 0 CPUs: 0-1 [ 0.013136] numa: Node 1 CPUs: 2-3 # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id ibm,chip-id = <0x01>; ibm,chip-id = <0x02>; ibm,chip-id = <0x00>; ibm,chip-id = <0x03>; >> I believe this is what Cedric wants to be addressed. Given that the >> property is called after the OPAL property ibm,chip-id, the kernel >> expects that the property will have the same semantics as in OPAL.> > Even on powernv, I'm not clear why chip-id is tied into the NUMA > configuration, rather than getting all the NUMA info from > associativity properties. It is the case. The associativity properties are built from chip-id in OPAL though. The chip-id property is only used in low level PowerNV drivers, VAS, XSCOM, LPC, etc. It's also badly used in the common part of the XIVE driver, what I am trying to fix to introduce an IPI per node on all platforms. C.
On 3/31/21 6:58 AM, Michael Ellerman wrote: > David Gibson <david@gibson.dropbear.id.au> writes: >> On Mon, Mar 29, 2021 at 03:32:37PM -0300, Daniel Henrique Barboza wrote: > ... >> >>> We assign ibm,chip-id=0x0 to CPUs 0-3, but CPUs 2-3 are located in a >>> different NUMA node than 0-1. This would mean that the same socket >>> would belong to different NUMA nodes at the same time. >> >> Right... and I'm still not seeing why that's a problem. AFAICT that's >> a possible, if unexpected, situation under real hardware - though >> maybe not for POWER9 specifically. > > I think I agree. > >>> I believe this is what Cedric wants to be addressed. Given that the >>> property is called after the OPAL property ibm,chip-id, the kernel >>> expects that the property will have the same semantics as in OPAL. >> >> Even on powernv, I'm not clear why chip-id is tied into the NUMA >> configuration, rather than getting all the NUMA info from >> associativity properties. > > AFAIK we don't use chip-id for anything related to NUMA, if we do I'd > consider that a bug. Since PAPR only has NUMA nodes, is the use of chip-id in XIVE PAPR considered as a bug ? I would say so. > We do use it for topology_physical_package_id(), but that's almost > completely unused. In that case, I think it should be fine to return -1 like under PowerVM. Thanks, C.
On 3/31/21 12:18 PM, Cédric Le Goater wrote: > On 3/31/21 2:57 AM, David Gibson wrote: >> On Mon, Mar 29, 2021 at 03:32:37PM -0300, Daniel Henrique Barboza wrote: >>> >>> >>> On 3/29/21 12:32 PM, Cédric Le Goater wrote: >>>> On 3/29/21 6:20 AM, David Gibson wrote: >>>>> On Thu, Mar 25, 2021 at 09:56:04AM +0100, Cédric Le Goater wrote: >>>>>> On 3/25/21 3:10 AM, David Gibson wrote: >>>>>>> On Tue, Mar 23, 2021 at 02:21:33PM -0300, Daniel Henrique Barboza wrote: >>>>>>>> >>>>>>>> >>>>>>>> On 3/22/21 10:03 PM, David Gibson wrote: >>>>>>>>> On Fri, Mar 19, 2021 at 03:34:52PM -0300, Daniel Henrique Barboza wrote: >>>>>>>>>> Kernel commit 4bce545903fa ("powerpc/topology: Update >>>>>>>>>> topology_core_cpumask") cause a regression in the pseries machine when >>>>>>>>>> defining certain SMP topologies [1]. The reasoning behind the change is >>>>>>>>>> explained in kernel commit 4ca234a9cbd7 ("powerpc/smp: Stop updating >>>>>>>>>> cpu_core_mask"). In short, cpu_core_mask logic was causing troubles with >>>>>>>>>> large VMs with lots of CPUs and was changed by cpu_cpu_mask because, as >>>>>>>>>> far as the kernel understanding of SMP topologies goes, both masks are >>>>>>>>>> equivalent. >>>>>>>>>> >>>>>>>>>> Further discussions in the kernel mailing list [2] shown that the >>>>>>>>>> powerpc kernel always considered that the number of sockets were equal >>>>>>>>>> to the number of NUMA nodes. The claim is that it doesn't make sense, >>>>>>>>>> for Power hardware at least, 2+ sockets being in the same NUMA node. The >>>>>>>>>> immediate conclusion is that all SMP topologies the pseries machine were >>>>>>>>>> supplying to the kernel, with more than one socket in the same NUMA node >>>>>>>>>> as in [1], happened to be correctly represented in the kernel by >>>>>>>>>> accident during all these years. >>>>>>>>>> >>>>>>>>>> There's a case to be made for virtual topologies being detached from >>>>>>>>>> hardware constraints, allowing maximum flexibility to users. At the same >>>>>>>>>> time, this freedom can't result in unrealistic hardware representations >>>>>>>>>> being emulated. If the real hardware and the pseries kernel don't >>>>>>>>>> support multiple chips/sockets in the same NUMA node, neither should we. >>>>>>>>>> >>>>>>>>>> Starting in 6.0.0, all sockets must match an unique NUMA node in the >>>>>>>>>> pseries machine. qtest changes were made to adapt to this new >>>>>>>>>> condition. >>>>>>>>> >>>>>>>>> Oof. I really don't like this idea. It means a bunch of fiddly work >>>>>>>>> for users to match these up, for no real gain. I'm also concerned >>>>>>>>> that this will require follow on changes in libvirt to not make this a >>>>>>>>> really cryptic and irritating point of failure. >>>>>>>> >>>>>>>> Haven't though about required Libvirt changes, although I can say that there >>>>>>>> will be some amount to be mande and it will probably annoy existing users >>>>>>>> (everyone that has a multiple socket per NUMA node topology). >>>>>>>> >>>>>>>> There is not much we can do from the QEMU layer aside from what I've proposed >>>>>>>> here. The other alternative is to keep interacting with the kernel folks to >>>>>>>> see if there is a way to keep our use case untouched. >>>>>>> >>>>>>> Right. Well.. not necessarily untouched, but I'm hoping for more >>>>>>> replies from Cédric to my objections and mpe's. Even with sockets >>>>>>> being a kinda meaningless concept in PAPR, I don't think tying it to >>>>>>> NUMA nodes makes sense. >>>>>> >>>>>> I did a couple of replies in different email threads but maybe not >>>>>> to all. I felt it was going nowhere :/ Couple of thoughts, >>>>> >>>>> I think I saw some of those, but maybe not all. >>>>> >>>>>> Shouldn't we get rid of the socket concept, die also, under pseries >>>>>> since they don't exist under PAPR ? We only have numa nodes, cores, >>>>>> threads AFAICT. >>>>> >>>>> Theoretically, yes. I'm not sure it's really practical, though, since >>>>> AFAICT, both qemu and the kernel have the notion of sockets (though >>>>> not dies) built into generic code. >>>> >>>> Yes. But, AFAICT, these topology notions have not reached "arch/powerpc" >>>> and PPC Linux only has a NUMA node id, on pseries and powernv. >>>> >>>>> It does mean that one possible approach here - maybe the best one - is >>>>> to simply declare that sockets are meaningless under, so we simply >>>>> don't expect what the guest kernel reports to match what's given to >>>>> qemu. >>>>> >>>>> It'd be nice to avoid that if we can: in a sense it's just cosmetic, >>>>> but it is likely to surprise and confuse people. >>>>> >>>>>> Should we diverged from PAPR and add extra DT properties "qemu,..." ? >>>>>> There are a couple of places where Linux checks for the underlying >>>>>> hypervisor already. >>>>>> >>>>>>>> This also means that >>>>>>>> 'ibm,chip-id' will probably remain in use since it's the only place where >>>>>>>> we inform cores per socket information to the kernel. >>>>>>> >>>>>>> Well.. unless we can find some other sensible way to convey that >>>>>>> information. I haven't given up hope for that yet. >>>>>> >>>>>> Well, we could start by fixing the value in QEMU. It is broken >>>>>> today. >>>>> >>>>> Fixing what value, exactly? >>>> >>>> The value of the "ibm,chip-id" since we are keeping the property under >>>> QEMU. >>> >>> David, I believe this has to do with the discussing we had last Friday. >>> >>> I mentioned that the ibm,chip-id property is being calculated in a way that >>> promotes the same ibm,chip-id in CPUs that belongs to different NUMA nodes, >>> e.g.: >>> >>> -smp 4,cores=4,maxcpus=8,threads=1 \ >>> -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 \ >>> -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1 >>> >>> >>> $ dtc -I dtb -O dts fdt.dtb | grep -B2 ibm,chip-id >>> ibm,associativity = <0x05 0x00 0x00 0x00 0x00 0x00>; >>> ibm,pft-size = <0x00 0x19>; >>> ibm,chip-id = <0x00>; >>> -- >>> ibm,associativity = <0x05 0x00 0x00 0x00 0x00 0x01>; >>> ibm,pft-size = <0x00 0x19>; >>> ibm,chip-id = <0x00>; >>> -- >>> ibm,associativity = <0x05 0x01 0x01 0x01 0x01 0x02>; >>> ibm,pft-size = <0x00 0x19>; >>> ibm,chip-id = <0x00>; >>> -- >>> ibm,associativity = <0x05 0x01 0x01 0x01 0x01 0x03>; >>> ibm,pft-size = <0x00 0x19>; >>> ibm,chip-id = <0x00>; >> >>> We assign ibm,chip-id=0x0 to CPUs 0-3, but CPUs 2-3 are located in a >>> different NUMA node than 0-1. This would mean that the same socket >>> would belong to different NUMA nodes at the same time. >> >> Right... and I'm still not seeing why that's a problem. AFAICT that's >> a possible, if unexpected, situation under real hardware - though >> maybe not for POWER9 specifically. > The ibm,chip-id property does not exist under PAPR. PAPR only has > NUMA nodes, no sockets nor chips. > > And the property value is simply broken under QEMU. Try this : > > -smp 4,cores=1,maxcpus=8 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1 > > # dmesg | grep numa > [ 0.013106] numa: Node 0 CPUs: 0-1 > [ 0.013136] numa: Node 1 CPUs: 2-3 > > # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id > ibm,chip-id = <0x01>; > ibm,chip-id = <0x02>; > ibm,chip-id = <0x00>; > ibm,chip-id = <0x03>; These values are not wrong. When you do: -smp 4,cores=1,maxcpus=8 (....) You didn't fill threads and sockets. QEMU default is to prioritize sockets to fill the missing information, up to the maxcpus value. This means that what you did is equivalent to: -smp 4,threads=1,cores=1,sockets=8,maxcpus=8 (....) It's a 1 thread/core, 1 core/socket with 8 sockets config. Each possible CPU will sit in its own core, having its own ibm,chip-id. So: "-numa node,nodeid=0,cpus=0-1" is in fact allocating sockets 0 and 1 to NUMA node 0. Thanks, DHB > >>> I believe this is what Cedric wants to be addressed. Given that the >>> property is called after the OPAL property ibm,chip-id, the kernel >>> expects that the property will have the same semantics as in OPAL.> >> Even on powernv, I'm not clear why chip-id is tied into the NUMA >> configuration, rather than getting all the NUMA info from >> associativity properties. > > It is the case. > > The associativity properties are built from chip-id in OPAL though. > > The chip-id property is only used in low level PowerNV drivers, VAS, > XSCOM, LPC, etc. > > It's also badly used in the common part of the XIVE driver, what I am > trying to fix to introduce an IPI per node on all platforms. > > C. > > >
On 3/31/21 7:29 PM, Daniel Henrique Barboza wrote: > > On 3/31/21 12:18 PM, Cédric Le Goater wrote: >> On 3/31/21 2:57 AM, David Gibson wrote: >>> On Mon, Mar 29, 2021 at 03:32:37PM -0300, Daniel Henrique Barboza wrote: >>>> >>>> >>>> On 3/29/21 12:32 PM, Cédric Le Goater wrote: >>>>> On 3/29/21 6:20 AM, David Gibson wrote: >>>>>> On Thu, Mar 25, 2021 at 09:56:04AM +0100, Cédric Le Goater wrote: >>>>>>> On 3/25/21 3:10 AM, David Gibson wrote: >>>>>>>> On Tue, Mar 23, 2021 at 02:21:33PM -0300, Daniel Henrique Barboza wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> On 3/22/21 10:03 PM, David Gibson wrote: >>>>>>>>>> On Fri, Mar 19, 2021 at 03:34:52PM -0300, Daniel Henrique Barboza wrote: >>>>>>>>>>> Kernel commit 4bce545903fa ("powerpc/topology: Update >>>>>>>>>>> topology_core_cpumask") cause a regression in the pseries machine when >>>>>>>>>>> defining certain SMP topologies [1]. The reasoning behind the change is >>>>>>>>>>> explained in kernel commit 4ca234a9cbd7 ("powerpc/smp: Stop updating >>>>>>>>>>> cpu_core_mask"). In short, cpu_core_mask logic was causing troubles with >>>>>>>>>>> large VMs with lots of CPUs and was changed by cpu_cpu_mask because, as >>>>>>>>>>> far as the kernel understanding of SMP topologies goes, both masks are >>>>>>>>>>> equivalent. >>>>>>>>>>> >>>>>>>>>>> Further discussions in the kernel mailing list [2] shown that the >>>>>>>>>>> powerpc kernel always considered that the number of sockets were equal >>>>>>>>>>> to the number of NUMA nodes. The claim is that it doesn't make sense, >>>>>>>>>>> for Power hardware at least, 2+ sockets being in the same NUMA node. The >>>>>>>>>>> immediate conclusion is that all SMP topologies the pseries machine were >>>>>>>>>>> supplying to the kernel, with more than one socket in the same NUMA node >>>>>>>>>>> as in [1], happened to be correctly represented in the kernel by >>>>>>>>>>> accident during all these years. >>>>>>>>>>> >>>>>>>>>>> There's a case to be made for virtual topologies being detached from >>>>>>>>>>> hardware constraints, allowing maximum flexibility to users. At the same >>>>>>>>>>> time, this freedom can't result in unrealistic hardware representations >>>>>>>>>>> being emulated. If the real hardware and the pseries kernel don't >>>>>>>>>>> support multiple chips/sockets in the same NUMA node, neither should we. >>>>>>>>>>> >>>>>>>>>>> Starting in 6.0.0, all sockets must match an unique NUMA node in the >>>>>>>>>>> pseries machine. qtest changes were made to adapt to this new >>>>>>>>>>> condition. >>>>>>>>>> >>>>>>>>>> Oof. I really don't like this idea. It means a bunch of fiddly work >>>>>>>>>> for users to match these up, for no real gain. I'm also concerned >>>>>>>>>> that this will require follow on changes in libvirt to not make this a >>>>>>>>>> really cryptic and irritating point of failure. >>>>>>>>> >>>>>>>>> Haven't though about required Libvirt changes, although I can say that there >>>>>>>>> will be some amount to be mande and it will probably annoy existing users >>>>>>>>> (everyone that has a multiple socket per NUMA node topology). >>>>>>>>> >>>>>>>>> There is not much we can do from the QEMU layer aside from what I've proposed >>>>>>>>> here. The other alternative is to keep interacting with the kernel folks to >>>>>>>>> see if there is a way to keep our use case untouched. >>>>>>>> >>>>>>>> Right. Well.. not necessarily untouched, but I'm hoping for more >>>>>>>> replies from Cédric to my objections and mpe's. Even with sockets >>>>>>>> being a kinda meaningless concept in PAPR, I don't think tying it to >>>>>>>> NUMA nodes makes sense. >>>>>>> >>>>>>> I did a couple of replies in different email threads but maybe not >>>>>>> to all. I felt it was going nowhere :/ Couple of thoughts, >>>>>> >>>>>> I think I saw some of those, but maybe not all. >>>>>> >>>>>>> Shouldn't we get rid of the socket concept, die also, under pseries >>>>>>> since they don't exist under PAPR ? We only have numa nodes, cores, >>>>>>> threads AFAICT. >>>>>> >>>>>> Theoretically, yes. I'm not sure it's really practical, though, since >>>>>> AFAICT, both qemu and the kernel have the notion of sockets (though >>>>>> not dies) built into generic code. >>>>> >>>>> Yes. But, AFAICT, these topology notions have not reached "arch/powerpc" >>>>> and PPC Linux only has a NUMA node id, on pseries and powernv. >>>>> >>>>>> It does mean that one possible approach here - maybe the best one - is >>>>>> to simply declare that sockets are meaningless under, so we simply >>>>>> don't expect what the guest kernel reports to match what's given to >>>>>> qemu. >>>>>> >>>>>> It'd be nice to avoid that if we can: in a sense it's just cosmetic, >>>>>> but it is likely to surprise and confuse people. >>>>>> >>>>>>> Should we diverged from PAPR and add extra DT properties "qemu,..." ? >>>>>>> There are a couple of places where Linux checks for the underlying >>>>>>> hypervisor already. >>>>>>> >>>>>>>>> This also means that >>>>>>>>> 'ibm,chip-id' will probably remain in use since it's the only place where >>>>>>>>> we inform cores per socket information to the kernel. >>>>>>>> >>>>>>>> Well.. unless we can find some other sensible way to convey that >>>>>>>> information. I haven't given up hope for that yet. >>>>>>> >>>>>>> Well, we could start by fixing the value in QEMU. It is broken >>>>>>> today. >>>>>> >>>>>> Fixing what value, exactly? >>>>> >>>>> The value of the "ibm,chip-id" since we are keeping the property under >>>>> QEMU. >>>> >>>> David, I believe this has to do with the discussing we had last Friday. >>>> >>>> I mentioned that the ibm,chip-id property is being calculated in a way that >>>> promotes the same ibm,chip-id in CPUs that belongs to different NUMA nodes, >>>> e.g.: >>>> >>>> -smp 4,cores=4,maxcpus=8,threads=1 \ >>>> -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 \ >>>> -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1 >>>> >>>> >>>> $ dtc -I dtb -O dts fdt.dtb | grep -B2 ibm,chip-id >>>> ibm,associativity = <0x05 0x00 0x00 0x00 0x00 0x00>; >>>> ibm,pft-size = <0x00 0x19>; >>>> ibm,chip-id = <0x00>; >>>> -- >>>> ibm,associativity = <0x05 0x00 0x00 0x00 0x00 0x01>; >>>> ibm,pft-size = <0x00 0x19>; >>>> ibm,chip-id = <0x00>; >>>> -- >>>> ibm,associativity = <0x05 0x01 0x01 0x01 0x01 0x02>; >>>> ibm,pft-size = <0x00 0x19>; >>>> ibm,chip-id = <0x00>; >>>> -- >>>> ibm,associativity = <0x05 0x01 0x01 0x01 0x01 0x03>; >>>> ibm,pft-size = <0x00 0x19>; >>>> ibm,chip-id = <0x00>; >>> >>>> We assign ibm,chip-id=0x0 to CPUs 0-3, but CPUs 2-3 are located in a >>>> different NUMA node than 0-1. This would mean that the same socket >>>> would belong to different NUMA nodes at the same time. >>> >>> Right... and I'm still not seeing why that's a problem. AFAICT that's >>> a possible, if unexpected, situation under real hardware - though >>> maybe not for POWER9 specifically. >> The ibm,chip-id property does not exist under PAPR. PAPR only has >> NUMA nodes, no sockets nor chips. >> >> And the property value is simply broken under QEMU. Try this : >> >> -smp 4,cores=1,maxcpus=8 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1 >> >> # dmesg | grep numa >> [ 0.013106] numa: Node 0 CPUs: 0-1 >> [ 0.013136] numa: Node 1 CPUs: 2-3 >> >> # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id >> ibm,chip-id = <0x01>; >> ibm,chip-id = <0x02>; >> ibm,chip-id = <0x00>; >> ibm,chip-id = <0x03>; > > These values are not wrong. When you do: > > -smp 4,cores=1,maxcpus=8 (....) > > You didn't fill threads and sockets. QEMU default is to prioritize sockets > to fill the missing information, up to the maxcpus value. This means that what > you did is equivalent to: > > -smp 4,threads=1,cores=1,sockets=8,maxcpus=8 (....) > > > It's a 1 thread/core, 1 core/socket with 8 sockets config. Each possible CPU > will sit in its own core, having its own ibm,chip-id. So: > > "-numa node,nodeid=0,cpus=0-1" > > is in fact allocating sockets 0 and 1 to NUMA node 0. ok. Thanks, C.
On Wed, Mar 31, 2021 at 05:22:39PM +0200, Cédric Le Goater wrote: > On 3/31/21 6:58 AM, Michael Ellerman wrote: > > David Gibson <david@gibson.dropbear.id.au> writes: > >> On Mon, Mar 29, 2021 at 03:32:37PM -0300, Daniel Henrique Barboza wrote: > > ... > >> > >>> We assign ibm,chip-id=0x0 to CPUs 0-3, but CPUs 2-3 are located in a > >>> different NUMA node than 0-1. This would mean that the same socket > >>> would belong to different NUMA nodes at the same time. > >> > >> Right... and I'm still not seeing why that's a problem. AFAICT that's > >> a possible, if unexpected, situation under real hardware - though > >> maybe not for POWER9 specifically. > > > > I think I agree. > > > >>> I believe this is what Cedric wants to be addressed. Given that the > >>> property is called after the OPAL property ibm,chip-id, the kernel > >>> expects that the property will have the same semantics as in OPAL. > >> > >> Even on powernv, I'm not clear why chip-id is tied into the NUMA > >> configuration, rather than getting all the NUMA info from > >> associativity properties. > > > > AFAIK we don't use chip-id for anything related to NUMA, if we do I'd > > consider that a bug. > > Since PAPR only has NUMA nodes, is the use of chip-id in XIVE PAPR > considered as a bug ? I would say so. As noted in another thread, XIVE PAPR *doesn't* actually use chip_id. And even on PowerNV, I don't think this has any real connection to NUMA. For PowerNV we care about whether we're working within a single XIVE hardware instance, or across multiple. There's one XIVE per-chip, hence the relevance of chip-id. That happens to also match to NUMA topology on (currently existing) POWER9 chips, but I don't see that it inherently has to. > > We do use it for topology_physical_package_id(), but that's almost > > completely unused. > > In that case, I think it should be fine to return -1 like under PowerVM. > > Thanks, > > C. > >
On Wed, Mar 31, 2021 at 05:18:45PM +0200, Cédric Le Goater wrote: > On 3/31/21 2:57 AM, David Gibson wrote: > > On Mon, Mar 29, 2021 at 03:32:37PM -0300, Daniel Henrique Barboza wrote: > >> > >> > >> On 3/29/21 12:32 PM, Cédric Le Goater wrote: > >>> On 3/29/21 6:20 AM, David Gibson wrote: > >>>> On Thu, Mar 25, 2021 at 09:56:04AM +0100, Cédric Le Goater wrote: > >>>>> On 3/25/21 3:10 AM, David Gibson wrote: > >>>>>> On Tue, Mar 23, 2021 at 02:21:33PM -0300, Daniel Henrique Barboza wrote: > >>>>>>> > >>>>>>> > >>>>>>> On 3/22/21 10:03 PM, David Gibson wrote: > >>>>>>>> On Fri, Mar 19, 2021 at 03:34:52PM -0300, Daniel Henrique Barboza wrote: > >>>>>>>>> Kernel commit 4bce545903fa ("powerpc/topology: Update > >>>>>>>>> topology_core_cpumask") cause a regression in the pseries machine when > >>>>>>>>> defining certain SMP topologies [1]. The reasoning behind the change is > >>>>>>>>> explained in kernel commit 4ca234a9cbd7 ("powerpc/smp: Stop updating > >>>>>>>>> cpu_core_mask"). In short, cpu_core_mask logic was causing troubles with > >>>>>>>>> large VMs with lots of CPUs and was changed by cpu_cpu_mask because, as > >>>>>>>>> far as the kernel understanding of SMP topologies goes, both masks are > >>>>>>>>> equivalent. > >>>>>>>>> > >>>>>>>>> Further discussions in the kernel mailing list [2] shown that the > >>>>>>>>> powerpc kernel always considered that the number of sockets were equal > >>>>>>>>> to the number of NUMA nodes. The claim is that it doesn't make sense, > >>>>>>>>> for Power hardware at least, 2+ sockets being in the same NUMA node. The > >>>>>>>>> immediate conclusion is that all SMP topologies the pseries machine were > >>>>>>>>> supplying to the kernel, with more than one socket in the same NUMA node > >>>>>>>>> as in [1], happened to be correctly represented in the kernel by > >>>>>>>>> accident during all these years. > >>>>>>>>> > >>>>>>>>> There's a case to be made for virtual topologies being detached from > >>>>>>>>> hardware constraints, allowing maximum flexibility to users. At the same > >>>>>>>>> time, this freedom can't result in unrealistic hardware representations > >>>>>>>>> being emulated. If the real hardware and the pseries kernel don't > >>>>>>>>> support multiple chips/sockets in the same NUMA node, neither should we. > >>>>>>>>> > >>>>>>>>> Starting in 6.0.0, all sockets must match an unique NUMA node in the > >>>>>>>>> pseries machine. qtest changes were made to adapt to this new > >>>>>>>>> condition. > >>>>>>>> > >>>>>>>> Oof. I really don't like this idea. It means a bunch of fiddly work > >>>>>>>> for users to match these up, for no real gain. I'm also concerned > >>>>>>>> that this will require follow on changes in libvirt to not make this a > >>>>>>>> really cryptic and irritating point of failure. > >>>>>>> > >>>>>>> Haven't though about required Libvirt changes, although I can say that there > >>>>>>> will be some amount to be mande and it will probably annoy existing users > >>>>>>> (everyone that has a multiple socket per NUMA node topology). > >>>>>>> > >>>>>>> There is not much we can do from the QEMU layer aside from what I've proposed > >>>>>>> here. The other alternative is to keep interacting with the kernel folks to > >>>>>>> see if there is a way to keep our use case untouched. > >>>>>> > >>>>>> Right. Well.. not necessarily untouched, but I'm hoping for more > >>>>>> replies from Cédric to my objections and mpe's. Even with sockets > >>>>>> being a kinda meaningless concept in PAPR, I don't think tying it to > >>>>>> NUMA nodes makes sense. > >>>>> > >>>>> I did a couple of replies in different email threads but maybe not > >>>>> to all. I felt it was going nowhere :/ Couple of thoughts, > >>>> > >>>> I think I saw some of those, but maybe not all. > >>>> > >>>>> Shouldn't we get rid of the socket concept, die also, under pseries > >>>>> since they don't exist under PAPR ? We only have numa nodes, cores, > >>>>> threads AFAICT. > >>>> > >>>> Theoretically, yes. I'm not sure it's really practical, though, since > >>>> AFAICT, both qemu and the kernel have the notion of sockets (though > >>>> not dies) built into generic code. > >>> > >>> Yes. But, AFAICT, these topology notions have not reached "arch/powerpc" > >>> and PPC Linux only has a NUMA node id, on pseries and powernv. > >>> > >>>> It does mean that one possible approach here - maybe the best one - is > >>>> to simply declare that sockets are meaningless under, so we simply > >>>> don't expect what the guest kernel reports to match what's given to > >>>> qemu. > >>>> > >>>> It'd be nice to avoid that if we can: in a sense it's just cosmetic, > >>>> but it is likely to surprise and confuse people. > >>>> > >>>>> Should we diverged from PAPR and add extra DT properties "qemu,..." ? > >>>>> There are a couple of places where Linux checks for the underlying > >>>>> hypervisor already. > >>>>> > >>>>>>> This also means that > >>>>>>> 'ibm,chip-id' will probably remain in use since it's the only place where > >>>>>>> we inform cores per socket information to the kernel. > >>>>>> > >>>>>> Well.. unless we can find some other sensible way to convey that > >>>>>> information. I haven't given up hope for that yet. > >>>>> > >>>>> Well, we could start by fixing the value in QEMU. It is broken > >>>>> today. > >>>> > >>>> Fixing what value, exactly? > >>> > >>> The value of the "ibm,chip-id" since we are keeping the property under > >>> QEMU. > >> > >> David, I believe this has to do with the discussing we had last Friday. > >> > >> I mentioned that the ibm,chip-id property is being calculated in a way that > >> promotes the same ibm,chip-id in CPUs that belongs to different NUMA nodes, > >> e.g.: > >> > >> -smp 4,cores=4,maxcpus=8,threads=1 \ > >> -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 \ > >> -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1 > >> > >> > >> $ dtc -I dtb -O dts fdt.dtb | grep -B2 ibm,chip-id > >> ibm,associativity = <0x05 0x00 0x00 0x00 0x00 0x00>; > >> ibm,pft-size = <0x00 0x19>; > >> ibm,chip-id = <0x00>; > > > >> We assign ibm,chip-id=0x0 to CPUs 0-3, but CPUs 2-3 are located in a > >> different NUMA node than 0-1. This would mean that the same socket > >> would belong to different NUMA nodes at the same time. > > > > Right... and I'm still not seeing why that's a problem. AFAICT that's > > a possible, if unexpected, situation under real hardware - though > > maybe not for POWER9 specifically. > The ibm,chip-id property does not exist under PAPR. PAPR only has > NUMA nodes, no sockets nor chips. > > And the property value is simply broken under QEMU. Try this : > > -smp 4,cores=1,maxcpus=8 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1 > > # dmesg | grep numa > [ 0.013106] numa: Node 0 CPUs: 0-1 > [ 0.013136] numa: Node 1 CPUs: 2-3 > > # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id > ibm,chip-id = <0x01>; > ibm,chip-id = <0x02>; > ibm,chip-id = <0x00>; > ibm,chip-id = <0x03>; Not seeing the problem here. You've specified 8 (possible) cpus, 1 core-per-socket, therefore there are 4 sockets, hence 4 chip-ids. Again, I don't see where this assumption that the chip-id is somehow related to the NUMA topology is coming from. > >> I believe this is what Cedric wants to be addressed. Given that the > >> property is called after the OPAL property ibm,chip-id, the kernel > >> expects that the property will have the same semantics as in OPAL.> > > Even on powernv, I'm not clear why chip-id is tied into the NUMA > > configuration, rather than getting all the NUMA info from > > associativity properties. > > It is the case. What exactly is the case? > The associativity properties are built from chip-id in OPAL though. Ok, so? Why do we care how OPAL builds the associativity properties once we *have* the associativity properties its built? > The chip-id property is only used in low level PowerNV drivers, VAS, > XSCOM, LPC, etc. Which AFAIK, really *do* need to know which chip they're on, not the NUMA toplogy. > It's also badly used in the common part of the XIVE driver, what I am > trying to fix to introduce an IPI per node on all platforms. See comments on other thread.
On 4/1/21 4:59 AM, David Gibson wrote: > On Wed, Mar 31, 2021 at 05:18:45PM +0200, Cédric Le Goater wrote: >> On 3/31/21 2:57 AM, David Gibson wrote: >>> On Mon, Mar 29, 2021 at 03:32:37PM -0300, Daniel Henrique Barboza wrote: >>>> >>>> >>>> On 3/29/21 12:32 PM, Cédric Le Goater wrote: >>>>> On 3/29/21 6:20 AM, David Gibson wrote: >>>>>> On Thu, Mar 25, 2021 at 09:56:04AM +0100, Cédric Le Goater wrote: >>>>>>> On 3/25/21 3:10 AM, David Gibson wrote: >>>>>>>> On Tue, Mar 23, 2021 at 02:21:33PM -0300, Daniel Henrique Barboza wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> On 3/22/21 10:03 PM, David Gibson wrote: >>>>>>>>>> On Fri, Mar 19, 2021 at 03:34:52PM -0300, Daniel Henrique Barboza wrote: >>>>>>>>>>> Kernel commit 4bce545903fa ("powerpc/topology: Update >>>>>>>>>>> topology_core_cpumask") cause a regression in the pseries machine when >>>>>>>>>>> defining certain SMP topologies [1]. The reasoning behind the change is >>>>>>>>>>> explained in kernel commit 4ca234a9cbd7 ("powerpc/smp: Stop updating >>>>>>>>>>> cpu_core_mask"). In short, cpu_core_mask logic was causing troubles with >>>>>>>>>>> large VMs with lots of CPUs and was changed by cpu_cpu_mask because, as >>>>>>>>>>> far as the kernel understanding of SMP topologies goes, both masks are >>>>>>>>>>> equivalent. >>>>>>>>>>> >>>>>>>>>>> Further discussions in the kernel mailing list [2] shown that the >>>>>>>>>>> powerpc kernel always considered that the number of sockets were equal >>>>>>>>>>> to the number of NUMA nodes. The claim is that it doesn't make sense, >>>>>>>>>>> for Power hardware at least, 2+ sockets being in the same NUMA node. The >>>>>>>>>>> immediate conclusion is that all SMP topologies the pseries machine were >>>>>>>>>>> supplying to the kernel, with more than one socket in the same NUMA node >>>>>>>>>>> as in [1], happened to be correctly represented in the kernel by >>>>>>>>>>> accident during all these years. >>>>>>>>>>> >>>>>>>>>>> There's a case to be made for virtual topologies being detached from >>>>>>>>>>> hardware constraints, allowing maximum flexibility to users. At the same >>>>>>>>>>> time, this freedom can't result in unrealistic hardware representations >>>>>>>>>>> being emulated. If the real hardware and the pseries kernel don't >>>>>>>>>>> support multiple chips/sockets in the same NUMA node, neither should we. >>>>>>>>>>> >>>>>>>>>>> Starting in 6.0.0, all sockets must match an unique NUMA node in the >>>>>>>>>>> pseries machine. qtest changes were made to adapt to this new >>>>>>>>>>> condition. >>>>>>>>>> >>>>>>>>>> Oof. I really don't like this idea. It means a bunch of fiddly work >>>>>>>>>> for users to match these up, for no real gain. I'm also concerned >>>>>>>>>> that this will require follow on changes in libvirt to not make this a >>>>>>>>>> really cryptic and irritating point of failure. >>>>>>>>> >>>>>>>>> Haven't though about required Libvirt changes, although I can say that there >>>>>>>>> will be some amount to be mande and it will probably annoy existing users >>>>>>>>> (everyone that has a multiple socket per NUMA node topology). >>>>>>>>> >>>>>>>>> There is not much we can do from the QEMU layer aside from what I've proposed >>>>>>>>> here. The other alternative is to keep interacting with the kernel folks to >>>>>>>>> see if there is a way to keep our use case untouched. >>>>>>>> >>>>>>>> Right. Well.. not necessarily untouched, but I'm hoping for more >>>>>>>> replies from Cédric to my objections and mpe's. Even with sockets >>>>>>>> being a kinda meaningless concept in PAPR, I don't think tying it to >>>>>>>> NUMA nodes makes sense. >>>>>>> >>>>>>> I did a couple of replies in different email threads but maybe not >>>>>>> to all. I felt it was going nowhere :/ Couple of thoughts, >>>>>> >>>>>> I think I saw some of those, but maybe not all. >>>>>> >>>>>>> Shouldn't we get rid of the socket concept, die also, under pseries >>>>>>> since they don't exist under PAPR ? We only have numa nodes, cores, >>>>>>> threads AFAICT. >>>>>> >>>>>> Theoretically, yes. I'm not sure it's really practical, though, since >>>>>> AFAICT, both qemu and the kernel have the notion of sockets (though >>>>>> not dies) built into generic code. >>>>> >>>>> Yes. But, AFAICT, these topology notions have not reached "arch/powerpc" >>>>> and PPC Linux only has a NUMA node id, on pseries and powernv. >>>>> >>>>>> It does mean that one possible approach here - maybe the best one - is >>>>>> to simply declare that sockets are meaningless under, so we simply >>>>>> don't expect what the guest kernel reports to match what's given to >>>>>> qemu. >>>>>> >>>>>> It'd be nice to avoid that if we can: in a sense it's just cosmetic, >>>>>> but it is likely to surprise and confuse people. >>>>>> >>>>>>> Should we diverged from PAPR and add extra DT properties "qemu,..." ? >>>>>>> There are a couple of places where Linux checks for the underlying >>>>>>> hypervisor already. >>>>>>> >>>>>>>>> This also means that >>>>>>>>> 'ibm,chip-id' will probably remain in use since it's the only place where >>>>>>>>> we inform cores per socket information to the kernel. >>>>>>>> >>>>>>>> Well.. unless we can find some other sensible way to convey that >>>>>>>> information. I haven't given up hope for that yet. >>>>>>> >>>>>>> Well, we could start by fixing the value in QEMU. It is broken >>>>>>> today. >>>>>> >>>>>> Fixing what value, exactly? >>>>> >>>>> The value of the "ibm,chip-id" since we are keeping the property under >>>>> QEMU. >>>> >>>> David, I believe this has to do with the discussing we had last Friday. >>>> >>>> I mentioned that the ibm,chip-id property is being calculated in a way that >>>> promotes the same ibm,chip-id in CPUs that belongs to different NUMA nodes, >>>> e.g.: >>>> >>>> -smp 4,cores=4,maxcpus=8,threads=1 \ >>>> -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 \ >>>> -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1 >>>> >>>> >>>> $ dtc -I dtb -O dts fdt.dtb | grep -B2 ibm,chip-id >>>> ibm,associativity = <0x05 0x00 0x00 0x00 0x00 0x00>; >>>> ibm,pft-size = <0x00 0x19>; >>>> ibm,chip-id = <0x00>; >>> >>>> We assign ibm,chip-id=0x0 to CPUs 0-3, but CPUs 2-3 are located in a >>>> different NUMA node than 0-1. This would mean that the same socket >>>> would belong to different NUMA nodes at the same time. >>> >>> Right... and I'm still not seeing why that's a problem. AFAICT that's >>> a possible, if unexpected, situation under real hardware - though >>> maybe not for POWER9 specifically. >> The ibm,chip-id property does not exist under PAPR. PAPR only has >> NUMA nodes, no sockets nor chips. >> >> And the property value is simply broken under QEMU. Try this : >> >> -smp 4,cores=1,maxcpus=8 -object memory-backend-ram,id=ram-node0,size=2G -numa node,nodeid=0,cpus=0-1,cpus=4-5,memdev=ram-node0 -object memory-backend-ram,id=ram-node1,size=2G -numa node,nodeid=1,cpus=2-3,cpus=6-7,memdev=ram-node1 >> >> # dmesg | grep numa >> [ 0.013106] numa: Node 0 CPUs: 0-1 >> [ 0.013136] numa: Node 1 CPUs: 2-3 >> >> # dtc -I fs /proc/device-tree/cpus/ -f | grep ibm,chip-id >> ibm,chip-id = <0x01>; >> ibm,chip-id = <0x02>; >> ibm,chip-id = <0x00>; >> ibm,chip-id = <0x03>; > > Not seeing the problem here. You've specified 8 (possible) cpus, 1 > core-per-socket, therefore there are 4 sockets, hence 4 chip-ids. > Again, I don't see where this assumption that the chip-id is somehow > related to the NUMA topology is coming from. > >>>> I believe this is what Cedric wants to be addressed. Given that the >>>> property is called after the OPAL property ibm,chip-id, the kernel >>>> expects that the property will have the same semantics as in OPAL.> >>> Even on powernv, I'm not clear why chip-id is tied into the NUMA >>> configuration, rather than getting all the NUMA info from >>> associativity properties. >> >> It is the case. > > What exactly is the case? sorry. That was badly phrased. I meant ; powernv gets the NUMA info from "ibm,associativity" property. C. >> The associativity properties are built from chip-id in OPAL though. > > Ok, so? Why do we care how OPAL builds the associativity properties > once we *have* the associativity properties its built? > >> The chip-id property is only used in low level PowerNV drivers, VAS, >> XSCOM, LPC, etc. > > Which AFAIK, really *do* need to know which chip they're on, not the > NUMA toplogy. > >> It's also badly used in the common part of the XIVE driver, what I am >> trying to fix to introduce an IPI per node on all platforms. > > See comments on other thread. >
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c index d56418ca29..745f71c243 100644 --- a/hw/ppc/spapr.c +++ b/hw/ppc/spapr.c @@ -4611,8 +4611,11 @@ DEFINE_SPAPR_MACHINE(6_0, "6.0", true); */ static void spapr_machine_5_2_class_options(MachineClass *mc) { + SpaprMachineClass *smc = SPAPR_MACHINE_CLASS(mc); + spapr_machine_6_0_class_options(mc); compat_props_add(mc->compat_props, hw_compat_5_2, hw_compat_5_2_len); + smc->pre_6_0_smp_topology = true; } DEFINE_SPAPR_MACHINE(5_2, "5.2", false); diff --git a/hw/ppc/spapr_numa.c b/hw/ppc/spapr_numa.c index 779f18b994..0ade43dd79 100644 --- a/hw/ppc/spapr_numa.c +++ b/hw/ppc/spapr_numa.c @@ -163,6 +163,13 @@ void spapr_numa_associativity_init(SpaprMachineState *spapr, int i, j, max_nodes_with_gpus; bool using_legacy_numa = spapr_machine_using_legacy_numa(spapr); + if (!smc->pre_6_0_smp_topology && + nb_numa_nodes != machine->smp.sockets) { + error_report("Number of CPU sockets must be equal to the number " + "of NUMA nodes"); + exit(EXIT_FAILURE); + } + /* * For all associativity arrays: first position is the size, * position MAX_DISTANCE_REF_POINTS is always the numa_id, diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h index 47cebaf3ac..98dc5d198a 100644 --- a/include/hw/ppc/spapr.h +++ b/include/hw/ppc/spapr.h @@ -142,6 +142,7 @@ struct SpaprMachineClass { hwaddr rma_limit; /* clamp the RMA to this size */ bool pre_5_1_assoc_refpoints; bool pre_5_2_numa_associativity; + bool pre_6_0_smp_topology; bool (*phb_placement)(SpaprMachineState *spapr, uint32_t index, uint64_t *buid, hwaddr *pio, diff --git a/tests/qtest/cpu-plug-test.c b/tests/qtest/cpu-plug-test.c index a1c689414b..946b9129ea 100644 --- a/tests/qtest/cpu-plug-test.c +++ b/tests/qtest/cpu-plug-test.c @@ -118,8 +118,8 @@ static void add_pseries_test_case(const char *mname) data->machine = g_strdup(mname); data->cpu_model = "power8_v2.0"; data->device_model = g_strdup("power8_v2.0-spapr-cpu-core"); - data->sockets = 2; - data->cores = 3; + data->sockets = 1; + data->cores = 6; data->threads = 1; data->maxcpus = data->sockets * data->cores * data->threads; diff --git a/tests/qtest/device-plug-test.c b/tests/qtest/device-plug-test.c index 559d47727a..dd7d8268d2 100644 --- a/tests/qtest/device-plug-test.c +++ b/tests/qtest/device-plug-test.c @@ -91,7 +91,14 @@ static void test_spapr_cpu_unplug_request(void) { QTestState *qtest; - qtest = qtest_initf("-cpu power9_v2.0 -smp 1,maxcpus=2 " + /* + * Default smp settings will prioritize sockets over cores and + * threads, so '-smp 2,maxcpus=2' will add 2 sockets. However, + * the pseries machine requires a NUMA node for each socket + * (since 6.0.0). Specify sockets=1 to make life easier. + */ + qtest = qtest_initf("-cpu power9_v2.0 " + "-smp 1,maxcpus=2,threads=1,cores=2,sockets=1 " "-device power9_v2.0-spapr-cpu-core,core-id=1,id=dev0"); /* similar to test_pci_unplug_request */ diff --git a/tests/qtest/numa-test.c b/tests/qtest/numa-test.c index dc0ec571ca..bb13f7131b 100644 --- a/tests/qtest/numa-test.c +++ b/tests/qtest/numa-test.c @@ -24,9 +24,17 @@ static void test_mon_explicit(const void *data) QTestState *qts; g_autofree char *s = NULL; g_autofree char *cli = NULL; + const char *arch = qtest_get_arch(); + + if (g_str_equal(arch, "ppc64")) { + cli = make_cli(data, "-smp 8,threads=1,cores=4,sockets=2 " + "-numa node,nodeid=0,memdev=ram,cpus=0-3 " + "-numa node,nodeid=1,cpus=4-7"); + } else { + cli = make_cli(data, "-smp 8 -numa node,nodeid=0,memdev=ram,cpus=0-3 " + "-numa node,nodeid=1,cpus=4-7"); + } - cli = make_cli(data, "-smp 8 -numa node,nodeid=0,memdev=ram,cpus=0-3 " - "-numa node,nodeid=1,cpus=4-7"); qts = qtest_init(cli); s = qtest_hmp(qts, "info numa"); @@ -57,10 +65,18 @@ static void test_mon_partial(const void *data) QTestState *qts; g_autofree char *s = NULL; g_autofree char *cli = NULL; + const char *arch = qtest_get_arch(); + + if (g_str_equal(arch, "ppc64")) { + cli = make_cli(data, "-smp 8,threads=1,cores=4,sockets=2 " + "-numa node,nodeid=0,memdev=ram,cpus=0-1 " + "-numa node,nodeid=1,cpus=4-5 "); + } else { + cli = make_cli(data, "-smp 8 " + "-numa node,nodeid=0,memdev=ram,cpus=0-1 " + "-numa node,nodeid=1,cpus=4-5 "); + } - cli = make_cli(data, "-smp 8 " - "-numa node,nodeid=0,memdev=ram,cpus=0-1 " - "-numa node,nodeid=1,cpus=4-5 "); qts = qtest_init(cli); s = qtest_hmp(qts, "info numa"); @@ -85,9 +101,17 @@ static void test_query_cpus(const void *data) QObject *e; QTestState *qts; g_autofree char *cli = NULL; + const char *arch = qtest_get_arch(); + + if (g_str_equal(arch, "ppc64")) { + cli = make_cli(data, "-smp 8,threads=1,cores=4,sockets=2 " + "-numa node,memdev=ram,cpus=0-3 " + "-numa node,cpus=4-7"); + } else { + cli = make_cli(data, "-smp 8 -numa node,memdev=ram,cpus=0-3 " + "-numa node,cpus=4-7"); + } - cli = make_cli(data, "-smp 8 -numa node,memdev=ram,cpus=0-3 " - "-numa node,cpus=4-7"); qts = qtest_init(cli); cpus = get_cpus(qts, &resp); g_assert(cpus); @@ -177,7 +201,7 @@ static void spapr_numa_cpu(const void *data) QTestState *qts; g_autofree char *cli = NULL; - cli = make_cli(data, "-smp 4,cores=4 " + cli = make_cli(data, "-smp 4,threads=1,cores=2,sockets=2 " "-numa node,nodeid=0,memdev=ram -numa node,nodeid=1 " "-numa cpu,node-id=0,core-id=0 " "-numa cpu,node-id=0,core-id=1 " @@ -554,7 +578,17 @@ int main(int argc, char **argv) g_test_init(&argc, &argv, NULL); - qtest_add_data_func("/numa/mon/cpus/default", args, test_def_cpu_split); + /* + * Starting on 6.0.0, for the pseries machine, '-smp 8' will only work + * if we have 8 NUMA nodes. If we specify 'smp 8,sockets=2' to match + * 2 NUMA nodes, the CPUs will be split as 0123/4567 instead of + * 0246/1357 that test_def_cpu_split expects. In short, this test is + * no longer valid for ppc64 in 6.0.0. + */ + if (!g_str_equal(arch, "ppc64")) { + qtest_add_data_func("/numa/mon/cpus/default", args, test_def_cpu_split); + } + qtest_add_data_func("/numa/mon/cpus/explicit", args, test_mon_explicit); qtest_add_data_func("/numa/mon/cpus/partial", args, test_mon_partial); qtest_add_data_func("/numa/qmp/cpus/query-cpus", args, test_query_cpus);
Kernel commit 4bce545903fa ("powerpc/topology: Update topology_core_cpumask") cause a regression in the pseries machine when defining certain SMP topologies [1]. The reasoning behind the change is explained in kernel commit 4ca234a9cbd7 ("powerpc/smp: Stop updating cpu_core_mask"). In short, cpu_core_mask logic was causing troubles with large VMs with lots of CPUs and was changed by cpu_cpu_mask because, as far as the kernel understanding of SMP topologies goes, both masks are equivalent. Further discussions in the kernel mailing list [2] shown that the powerpc kernel always considered that the number of sockets were equal to the number of NUMA nodes. The claim is that it doesn't make sense, for Power hardware at least, 2+ sockets being in the same NUMA node. The immediate conclusion is that all SMP topologies the pseries machine were supplying to the kernel, with more than one socket in the same NUMA node as in [1], happened to be correctly represented in the kernel by accident during all these years. There's a case to be made for virtual topologies being detached from hardware constraints, allowing maximum flexibility to users. At the same time, this freedom can't result in unrealistic hardware representations being emulated. If the real hardware and the pseries kernel don't support multiple chips/sockets in the same NUMA node, neither should we. Starting in 6.0.0, all sockets must match an unique NUMA node in the pseries machine. qtest changes were made to adapt to this new condition. [1] https://bugzilla.redhat.com/1934421 [2] https://lore.kernel.org/linuxppc-dev/daa5d05f-dbd0-05ad-7395-5d5a3d364fc6@gmail.com/ CC: Srikar Dronamraju <srikar@linux.vnet.ibm.com> CC: Cédric Le Goater <clg@kaod.org> CC: Igor Mammedov <imammedo@redhat.com> CC: Laurent Vivier <lvivier@redhat.com> CC: Thomas Huth <thuth@redhat.com> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- hw/ppc/spapr.c | 3 ++ hw/ppc/spapr_numa.c | 7 +++++ include/hw/ppc/spapr.h | 1 + tests/qtest/cpu-plug-test.c | 4 +-- tests/qtest/device-plug-test.c | 9 +++++- tests/qtest/numa-test.c | 52 ++++++++++++++++++++++++++++------ 6 files changed, 64 insertions(+), 12 deletions(-)