Message ID | 20200518214418.18248-2-arbab@linux.ibm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | None | expand |
On Mon, 18 May 2020 16:44:18 -0500 Reza Arbab <arbab@linux.ibm.com> wrote: > NUMA nodes corresponding to GPU memory currently have the same > affinity/distance as normal memory nodes. Add a third NUMA associativity > reference point enabling us to give GPU nodes more distance. > > This is guest visible information, which shouldn't change under a > running guest across migration between different qemu versions, so make > the change effective only in new (pseries > 5.0) machine types. > > Before, `numactl -H` output in a guest with 4 GPUs (nodes 2-5): > > node distances: > node 0 1 2 3 4 5 > 0: 10 40 40 40 40 40 > 1: 40 10 40 40 40 40 > 2: 40 40 10 40 40 40 > 3: 40 40 40 10 40 40 > 4: 40 40 40 40 10 40 > 5: 40 40 40 40 40 10 > > After: > > node distances: > node 0 1 2 3 4 5 > 0: 10 40 80 80 80 80 > 1: 40 10 80 80 80 80 > 2: 80 80 10 80 80 80 > 3: 80 80 80 10 80 80 > 4: 80 80 80 80 10 80 > 5: 80 80 80 80 80 10 > > These are the same distances as on the host, mirroring the change made > to host firmware in skiboot commit f845a648b8cb ("numa/associativity: > Add a new level of NUMA for GPU's"). > > Signed-off-by: Reza Arbab <arbab@linux.ibm.com> > --- > hw/ppc/spapr.c | 11 +++++++++-- > hw/ppc/spapr_pci_nvlink2.c | 2 +- > 2 files changed, 10 insertions(+), 3 deletions(-) > > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c > index 88b4a1f17716..1d9193d5ee49 100644 > --- a/hw/ppc/spapr.c > +++ b/hw/ppc/spapr.c > @@ -893,7 +893,11 @@ static void spapr_dt_rtas(SpaprMachineState *spapr, void *fdt) > int rtas; > GString *hypertas = g_string_sized_new(256); > GString *qemu_hypertas = g_string_sized_new(256); > - uint32_t refpoints[] = { cpu_to_be32(0x4), cpu_to_be32(0x4) }; > + uint32_t refpoints[] = { > + cpu_to_be32(0x4), > + cpu_to_be32(0x4), > + cpu_to_be32(0x2), > + }; > uint32_t nr_refpoints; > uint64_t max_device_addr = MACHINE(spapr)->device_memory->base + > memory_region_size(&MACHINE(spapr)->device_memory->mr); > @@ -4544,7 +4548,7 @@ static void spapr_machine_class_init(ObjectClass *oc, void *data) > smc->linux_pci_probe = true; > smc->smp_threads_vsmt = true; > smc->nr_xirqs = SPAPR_NR_XIRQS; > - smc->nr_assoc_refpoints = 2; > + smc->nr_assoc_refpoints = 3; > xfc->match_nvt = spapr_match_nvt; > } > > @@ -4611,8 +4615,11 @@ DEFINE_SPAPR_MACHINE(5_1, "5.1", true); > */ > static void spapr_machine_5_0_class_options(MachineClass *mc) > { > + SpaprMachineClass *smc = SPAPR_MACHINE_CLASS(mc); > + > spapr_machine_5_1_class_options(mc); > compat_props_add(mc->compat_props, hw_compat_5_0, hw_compat_5_0_len); > + smc->nr_assoc_refpoints = 2; > } > > DEFINE_SPAPR_MACHINE(5_0, "5.0", false); > diff --git a/hw/ppc/spapr_pci_nvlink2.c b/hw/ppc/spapr_pci_nvlink2.c > index 8332d5694e46..247fd48731e2 100644 > --- a/hw/ppc/spapr_pci_nvlink2.c > +++ b/hw/ppc/spapr_pci_nvlink2.c > @@ -362,7 +362,7 @@ void spapr_phb_nvgpu_ram_populate_dt(SpaprPhbState *sphb, void *fdt) > uint32_t associativity[] = { > cpu_to_be32(0x4), > SPAPR_GPU_NUMA_ID, > - SPAPR_GPU_NUMA_ID, > + cpu_to_be32(nvslot->numa_id), This is a guest visible change. It should theoretically be controlled with a compat property of the PHB (look for "static GlobalProperty" in spapr.c). But since this code is only used for GPU passthrough and we don't support migration of such devices, I guess it's okay. Maybe just mention it in the changelog. > SPAPR_GPU_NUMA_ID, > cpu_to_be32(nvslot->numa_id) > };
On Thu, May 21, 2020 at 01:36:16AM +0200, Greg Kurz wrote: > On Mon, 18 May 2020 16:44:18 -0500 > Reza Arbab <arbab@linux.ibm.com> wrote: > > > NUMA nodes corresponding to GPU memory currently have the same > > affinity/distance as normal memory nodes. Add a third NUMA associativity > > reference point enabling us to give GPU nodes more distance. > > > > This is guest visible information, which shouldn't change under a > > running guest across migration between different qemu versions, so make > > the change effective only in new (pseries > 5.0) machine types. > > > > Before, `numactl -H` output in a guest with 4 GPUs (nodes 2-5): > > > > node distances: > > node 0 1 2 3 4 5 > > 0: 10 40 40 40 40 40 > > 1: 40 10 40 40 40 40 > > 2: 40 40 10 40 40 40 > > 3: 40 40 40 10 40 40 > > 4: 40 40 40 40 10 40 > > 5: 40 40 40 40 40 10 > > > > After: > > > > node distances: > > node 0 1 2 3 4 5 > > 0: 10 40 80 80 80 80 > > 1: 40 10 80 80 80 80 > > 2: 80 80 10 80 80 80 > > 3: 80 80 80 10 80 80 > > 4: 80 80 80 80 10 80 > > 5: 80 80 80 80 80 10 > > > > These are the same distances as on the host, mirroring the change made > > to host firmware in skiboot commit f845a648b8cb ("numa/associativity: > > Add a new level of NUMA for GPU's"). > > > > Signed-off-by: Reza Arbab <arbab@linux.ibm.com> > > --- > > hw/ppc/spapr.c | 11 +++++++++-- > > hw/ppc/spapr_pci_nvlink2.c | 2 +- > > 2 files changed, 10 insertions(+), 3 deletions(-) > > > > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c > > index 88b4a1f17716..1d9193d5ee49 100644 > > --- a/hw/ppc/spapr.c > > +++ b/hw/ppc/spapr.c > > @@ -893,7 +893,11 @@ static void spapr_dt_rtas(SpaprMachineState *spapr, void *fdt) > > int rtas; > > GString *hypertas = g_string_sized_new(256); > > GString *qemu_hypertas = g_string_sized_new(256); > > - uint32_t refpoints[] = { cpu_to_be32(0x4), cpu_to_be32(0x4) }; > > + uint32_t refpoints[] = { > > + cpu_to_be32(0x4), > > + cpu_to_be32(0x4), > > + cpu_to_be32(0x2), > > + }; > > uint32_t nr_refpoints; > > uint64_t max_device_addr = MACHINE(spapr)->device_memory->base + > > memory_region_size(&MACHINE(spapr)->device_memory->mr); > > @@ -4544,7 +4548,7 @@ static void spapr_machine_class_init(ObjectClass *oc, void *data) > > smc->linux_pci_probe = true; > > smc->smp_threads_vsmt = true; > > smc->nr_xirqs = SPAPR_NR_XIRQS; > > - smc->nr_assoc_refpoints = 2; > > + smc->nr_assoc_refpoints = 3; > > xfc->match_nvt = spapr_match_nvt; > > } > > > > @@ -4611,8 +4615,11 @@ DEFINE_SPAPR_MACHINE(5_1, "5.1", true); > > */ > > static void spapr_machine_5_0_class_options(MachineClass *mc) > > { > > + SpaprMachineClass *smc = SPAPR_MACHINE_CLASS(mc); > > + > > spapr_machine_5_1_class_options(mc); > > compat_props_add(mc->compat_props, hw_compat_5_0, hw_compat_5_0_len); > > + smc->nr_assoc_refpoints = 2; > > } > > > > DEFINE_SPAPR_MACHINE(5_0, "5.0", false); > > diff --git a/hw/ppc/spapr_pci_nvlink2.c b/hw/ppc/spapr_pci_nvlink2.c > > index 8332d5694e46..247fd48731e2 100644 > > --- a/hw/ppc/spapr_pci_nvlink2.c > > +++ b/hw/ppc/spapr_pci_nvlink2.c > > @@ -362,7 +362,7 @@ void spapr_phb_nvgpu_ram_populate_dt(SpaprPhbState *sphb, void *fdt) > > uint32_t associativity[] = { > > cpu_to_be32(0x4), > > SPAPR_GPU_NUMA_ID, > > - SPAPR_GPU_NUMA_ID, > > + cpu_to_be32(nvslot->numa_id), > > This is a guest visible change. It should theoretically be controlled > with a compat property of the PHB (look for "static GlobalProperty" in > spapr.c). But since this code is only used for GPU passthrough and we > don't support migration of such devices, I guess it's okay. Maybe just > mention it in the changelog. Yeah, we might get away with it, but it should be too hard to get this right, so let's do it. > > > SPAPR_GPU_NUMA_ID, > > cpu_to_be32(nvslot->numa_id) > > }; >
On Thu, 21 May 2020 15:13:45 +1000 David Gibson <david@gibson.dropbear.id.au> wrote: > On Thu, May 21, 2020 at 01:36:16AM +0200, Greg Kurz wrote: > > On Mon, 18 May 2020 16:44:18 -0500 > > Reza Arbab <arbab@linux.ibm.com> wrote: > > > > > NUMA nodes corresponding to GPU memory currently have the same > > > affinity/distance as normal memory nodes. Add a third NUMA associativity > > > reference point enabling us to give GPU nodes more distance. > > > > > > This is guest visible information, which shouldn't change under a > > > running guest across migration between different qemu versions, so make > > > the change effective only in new (pseries > 5.0) machine types. > > > > > > Before, `numactl -H` output in a guest with 4 GPUs (nodes 2-5): > > > > > > node distances: > > > node 0 1 2 3 4 5 > > > 0: 10 40 40 40 40 40 > > > 1: 40 10 40 40 40 40 > > > 2: 40 40 10 40 40 40 > > > 3: 40 40 40 10 40 40 > > > 4: 40 40 40 40 10 40 > > > 5: 40 40 40 40 40 10 > > > > > > After: > > > > > > node distances: > > > node 0 1 2 3 4 5 > > > 0: 10 40 80 80 80 80 > > > 1: 40 10 80 80 80 80 > > > 2: 80 80 10 80 80 80 > > > 3: 80 80 80 10 80 80 > > > 4: 80 80 80 80 10 80 > > > 5: 80 80 80 80 80 10 > > > > > > These are the same distances as on the host, mirroring the change made > > > to host firmware in skiboot commit f845a648b8cb ("numa/associativity: > > > Add a new level of NUMA for GPU's"). > > > > > > Signed-off-by: Reza Arbab <arbab@linux.ibm.com> > > > --- > > > hw/ppc/spapr.c | 11 +++++++++-- > > > hw/ppc/spapr_pci_nvlink2.c | 2 +- > > > 2 files changed, 10 insertions(+), 3 deletions(-) > > > > > > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c > > > index 88b4a1f17716..1d9193d5ee49 100644 > > > --- a/hw/ppc/spapr.c > > > +++ b/hw/ppc/spapr.c > > > @@ -893,7 +893,11 @@ static void spapr_dt_rtas(SpaprMachineState *spapr, void *fdt) > > > int rtas; > > > GString *hypertas = g_string_sized_new(256); > > > GString *qemu_hypertas = g_string_sized_new(256); > > > - uint32_t refpoints[] = { cpu_to_be32(0x4), cpu_to_be32(0x4) }; > > > + uint32_t refpoints[] = { > > > + cpu_to_be32(0x4), > > > + cpu_to_be32(0x4), > > > + cpu_to_be32(0x2), > > > + }; > > > uint32_t nr_refpoints; > > > uint64_t max_device_addr = MACHINE(spapr)->device_memory->base + > > > memory_region_size(&MACHINE(spapr)->device_memory->mr); > > > @@ -4544,7 +4548,7 @@ static void spapr_machine_class_init(ObjectClass *oc, void *data) > > > smc->linux_pci_probe = true; > > > smc->smp_threads_vsmt = true; > > > smc->nr_xirqs = SPAPR_NR_XIRQS; > > > - smc->nr_assoc_refpoints = 2; > > > + smc->nr_assoc_refpoints = 3; > > > xfc->match_nvt = spapr_match_nvt; > > > } > > > > > > @@ -4611,8 +4615,11 @@ DEFINE_SPAPR_MACHINE(5_1, "5.1", true); > > > */ > > > static void spapr_machine_5_0_class_options(MachineClass *mc) > > > { > > > + SpaprMachineClass *smc = SPAPR_MACHINE_CLASS(mc); > > > + > > > spapr_machine_5_1_class_options(mc); > > > compat_props_add(mc->compat_props, hw_compat_5_0, hw_compat_5_0_len); > > > + smc->nr_assoc_refpoints = 2; > > > } > > > > > > DEFINE_SPAPR_MACHINE(5_0, "5.0", false); > > > diff --git a/hw/ppc/spapr_pci_nvlink2.c b/hw/ppc/spapr_pci_nvlink2.c > > > index 8332d5694e46..247fd48731e2 100644 > > > --- a/hw/ppc/spapr_pci_nvlink2.c > > > +++ b/hw/ppc/spapr_pci_nvlink2.c > > > @@ -362,7 +362,7 @@ void spapr_phb_nvgpu_ram_populate_dt(SpaprPhbState *sphb, void *fdt) > > > uint32_t associativity[] = { > > > cpu_to_be32(0x4), > > > SPAPR_GPU_NUMA_ID, > > > - SPAPR_GPU_NUMA_ID, > > > + cpu_to_be32(nvslot->numa_id), > > > > This is a guest visible change. It should theoretically be controlled > > with a compat property of the PHB (look for "static GlobalProperty" in > > spapr.c). But since this code is only used for GPU passthrough and we > > don't support migration of such devices, I guess it's okay. Maybe just > > mention it in the changelog. > > Yeah, we might get away with it, but it should be too hard to get this I guess you mean "it shouldn't be too hard" ? > right, so let's do it. > > > > > > SPAPR_GPU_NUMA_ID, > > > cpu_to_be32(nvslot->numa_id) > > > }; > > >
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c index 88b4a1f17716..1d9193d5ee49 100644 --- a/hw/ppc/spapr.c +++ b/hw/ppc/spapr.c @@ -893,7 +893,11 @@ static void spapr_dt_rtas(SpaprMachineState *spapr, void *fdt) int rtas; GString *hypertas = g_string_sized_new(256); GString *qemu_hypertas = g_string_sized_new(256); - uint32_t refpoints[] = { cpu_to_be32(0x4), cpu_to_be32(0x4) }; + uint32_t refpoints[] = { + cpu_to_be32(0x4), + cpu_to_be32(0x4), + cpu_to_be32(0x2), + }; uint32_t nr_refpoints; uint64_t max_device_addr = MACHINE(spapr)->device_memory->base + memory_region_size(&MACHINE(spapr)->device_memory->mr); @@ -4544,7 +4548,7 @@ static void spapr_machine_class_init(ObjectClass *oc, void *data) smc->linux_pci_probe = true; smc->smp_threads_vsmt = true; smc->nr_xirqs = SPAPR_NR_XIRQS; - smc->nr_assoc_refpoints = 2; + smc->nr_assoc_refpoints = 3; xfc->match_nvt = spapr_match_nvt; } @@ -4611,8 +4615,11 @@ DEFINE_SPAPR_MACHINE(5_1, "5.1", true); */ static void spapr_machine_5_0_class_options(MachineClass *mc) { + SpaprMachineClass *smc = SPAPR_MACHINE_CLASS(mc); + spapr_machine_5_1_class_options(mc); compat_props_add(mc->compat_props, hw_compat_5_0, hw_compat_5_0_len); + smc->nr_assoc_refpoints = 2; } DEFINE_SPAPR_MACHINE(5_0, "5.0", false); diff --git a/hw/ppc/spapr_pci_nvlink2.c b/hw/ppc/spapr_pci_nvlink2.c index 8332d5694e46..247fd48731e2 100644 --- a/hw/ppc/spapr_pci_nvlink2.c +++ b/hw/ppc/spapr_pci_nvlink2.c @@ -362,7 +362,7 @@ void spapr_phb_nvgpu_ram_populate_dt(SpaprPhbState *sphb, void *fdt) uint32_t associativity[] = { cpu_to_be32(0x4), SPAPR_GPU_NUMA_ID, - SPAPR_GPU_NUMA_ID, + cpu_to_be32(nvslot->numa_id), SPAPR_GPU_NUMA_ID, cpu_to_be32(nvslot->numa_id) };
NUMA nodes corresponding to GPU memory currently have the same affinity/distance as normal memory nodes. Add a third NUMA associativity reference point enabling us to give GPU nodes more distance. This is guest visible information, which shouldn't change under a running guest across migration between different qemu versions, so make the change effective only in new (pseries > 5.0) machine types. Before, `numactl -H` output in a guest with 4 GPUs (nodes 2-5): node distances: node 0 1 2 3 4 5 0: 10 40 40 40 40 40 1: 40 10 40 40 40 40 2: 40 40 10 40 40 40 3: 40 40 40 10 40 40 4: 40 40 40 40 10 40 5: 40 40 40 40 40 10 After: node distances: node 0 1 2 3 4 5 0: 10 40 80 80 80 80 1: 40 10 80 80 80 80 2: 80 80 10 80 80 80 3: 80 80 80 10 80 80 4: 80 80 80 80 10 80 5: 80 80 80 80 80 10 These are the same distances as on the host, mirroring the change made to host firmware in skiboot commit f845a648b8cb ("numa/associativity: Add a new level of NUMA for GPU's"). Signed-off-by: Reza Arbab <arbab@linux.ibm.com> --- hw/ppc/spapr.c | 11 +++++++++-- hw/ppc/spapr_pci_nvlink2.c | 2 +- 2 files changed, 10 insertions(+), 3 deletions(-)