Message ID | 20200508175927.21791-1-arbab@linux.ibm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | spapr: Add a new level of NUMA for GPUs | expand |
On Fri, May 08, 2020 at 12:59:27PM -0500, Reza Arbab wrote: > NUMA nodes corresponding to GPU memory currently have the same > affinity/distance as normal memory nodes. Add a third NUMA associativity > reference point enabling us to give GPU nodes more distance. > > Before, `numactl -H` output in a guest with 4 GPUs (nodes 2-5): > > node distances: > node 0 1 2 3 4 5 > 0: 10 40 40 40 40 40 > 1: 40 10 40 40 40 40 > 2: 40 40 10 40 40 40 > 3: 40 40 40 10 40 40 > 4: 40 40 40 40 10 40 > 5: 40 40 40 40 40 10 > > After: > > node distances: > node 0 1 2 3 4 5 > 0: 10 40 80 80 80 80 > 1: 40 10 80 80 80 80 > 2: 80 80 10 80 80 80 > 3: 80 80 80 10 80 80 > 4: 80 80 80 80 10 80 > 5: 80 80 80 80 80 10 > > These are the same distances as on the host, mirroring the change made > to host firmware in skiboot commit f845a648b8cb ("numa/associativity: > Add a new level of NUMA for GPU's"). Urgh. I have a numnber of thoughts on this. 1) This would all be much simpler, if PAPR's representation of NUMA distances weren't so awful. Somehow it manages to be both so complex that it's very hard to understand, and yet very limited in that it has no way to represent distances in any absolute units, or even specific ratios between distances. Both qemu and the guest kernel can have an arbitrary set of nodes, with an arbitrary matrix of distances between each pair, which we then have to lossily encode into this PAPR nonsense. 2) Alas, I don't think we can simply change this information. We'll have to do it conditionally on a new machine type. This is guest visible information, which shouldn't change under a running guest across migration between different qemu versions. At least for Linux guests we'd probably get away with it, since I think it only reads this info at boot, and across a migration we'd at worst get non-optimal behaviour, not actual breakage. Still, I'd need a stronger case than "probably won't break" before breaking our rules about guest environment stability within a machine type. 3) I'm not sure that this version is totally correct w.r.t. PAPR. But then, I'm also not really sure of that for the existing version. Specifically it's not at all clear from PAPR if the IDs used at each level of the ibm,associativity need to be: a) globally unique b) unique only within the associativity level they appear at or c) unique only within the "node" at the next higher level they belong to 4) I'm not totally clear on the rationale for using the individual gpu's numa ID at all levels, rather than just one. I'm guessing this is so that the gpu memory blocks are distant from each other as well as distant from main memory. Is that right? > Signed-off-by: Reza Arbab <arbab@linux.ibm.com> > --- > hw/ppc/spapr.c | 6 +++++- > hw/ppc/spapr_pci_nvlink2.c | 8 +++----- > 2 files changed, 8 insertions(+), 6 deletions(-) > > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c > index c18eab0a2305..53567f98f0c6 100644 > --- a/hw/ppc/spapr.c > +++ b/hw/ppc/spapr.c > @@ -892,7 +892,11 @@ static void spapr_dt_rtas(SpaprMachineState *spapr, void *fdt) > int rtas; > GString *hypertas = g_string_sized_new(256); > GString *qemu_hypertas = g_string_sized_new(256); > - uint32_t refpoints[] = { cpu_to_be32(0x4), cpu_to_be32(0x4) }; > + uint32_t refpoints[] = { > + cpu_to_be32(0x4), > + cpu_to_be32(0x4), > + cpu_to_be32(0x2), > + }; > uint64_t max_device_addr = MACHINE(spapr)->device_memory->base + > memory_region_size(&MACHINE(spapr)->device_memory->mr); > uint32_t lrdr_capacity[] = { > diff --git a/hw/ppc/spapr_pci_nvlink2.c b/hw/ppc/spapr_pci_nvlink2.c > index 8332d5694e46..f2cb26019e88 100644 > --- a/hw/ppc/spapr_pci_nvlink2.c > +++ b/hw/ppc/spapr_pci_nvlink2.c > @@ -37,8 +37,6 @@ > #define PHANDLE_NVLINK(phb, gn, nn) (0x00130000 | (((phb)->index) << 8) | \ > ((gn) << 4) | (nn)) > > -#define SPAPR_GPU_NUMA_ID (cpu_to_be32(1)) > - > typedef struct SpaprPhbPciNvGpuSlot { > uint64_t tgt; > uint64_t gpa; > @@ -361,9 +359,9 @@ void spapr_phb_nvgpu_ram_populate_dt(SpaprPhbState *sphb, void *fdt) > "nvlink2-mr[0]", NULL); > uint32_t associativity[] = { > cpu_to_be32(0x4), > - SPAPR_GPU_NUMA_ID, > - SPAPR_GPU_NUMA_ID, > - SPAPR_GPU_NUMA_ID, > + cpu_to_be32(nvslot->numa_id), > + cpu_to_be32(nvslot->numa_id), > + cpu_to_be32(nvslot->numa_id), > cpu_to_be32(nvslot->numa_id) > }; > uint64_t size = object_property_get_uint(nv_mrobj, "size", NULL);
Hi David, Thanks for your quick response! On Mon, May 11, 2020 at 04:17:45PM +1000, David Gibson wrote: >1) > >This would all be much simpler, if PAPR's representation of NUMA >distances weren't so awful. Somehow it manages to be both so complex >that it's very hard to understand, and yet very limited in that it >has no way to represent distances in any absolute units, or even >specific ratios between distances. > >Both qemu and the guest kernel can have an arbitrary set of nodes, >with an arbitrary matrix of distances between each pair, which we then >have to lossily encode into this PAPR nonsense. Completely agree. I've revisited that section many times now and still find the descriptions of these properties almost incomprehensible. The only way I've been able to make sense of it comes from reading the implementation, or experimentally tweaking the code/device tree to see if distances behave the way I expect. >2) > >Alas, I don't think we can simply change this information. We'll have >to do it conditionally on a new machine type. This is guest visible >information, which shouldn't change under a running guest across >migration between different qemu versions. At least for Linux guests >we'd probably get away with it, since I think it only reads this info >at boot, and across a migration we'd at worst get non-optimal >behaviour, not actual breakage. Sure, that makes sense. I'll try making the change conditional on a flag that can be set on new machine types. >3) > >I'm not sure that this version is totally correct w.r.t. PAPR. But >then, I'm also not really sure of that for the existing version. >Specifically it's not at all clear from PAPR if the IDs used at each >level of the ibm,associativity need to be: > a) globally unique > b) unique only within the associativity level they appear at >or c) unique only within the "node" at the next higher level they > belong to Again, I'm no authority but it seems to be (b): "To determine the associativity between any two resources, the OS scans down the two resources associativity lists in all pair wise combinations counting how many domains are the same until the first domain where the two list do not agree." FWIW, using the same number for id at multiple levels has been working in practice. >4) > >I'm not totally clear on the rationale for using the individual gpu's >numa ID at all levels, rather than just one. I'm guessing this is so >that the gpu memory blocks are distant from each other as well as >distant from main memory. Is that right? It's not necessary to use it at all levels--that was me trying to completely replicate that old firmware change. Strictly speaking, since only reference-points 4 (and now 2) are significant, that part could just be: @@ -362,7 +362,7 @@ void spapr_phb_nvgpu_ram_populate_dt(SpaprPhbState *sphb, void *fdt) uint32_t associativity[] = { cpu_to_be32(0x4), SPAPR_GPU_NUMA_ID, - SPAPR_GPU_NUMA_ID, + cpu_to_be32(nvslot->numa_id), SPAPR_GPU_NUMA_ID, cpu_to_be32(nvslot->numa_id) }; I think the rationale is that if those other levels got added to reference-points in the future, you'd likely want the GPU to be distinct there too.
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c index c18eab0a2305..53567f98f0c6 100644 --- a/hw/ppc/spapr.c +++ b/hw/ppc/spapr.c @@ -892,7 +892,11 @@ static void spapr_dt_rtas(SpaprMachineState *spapr, void *fdt) int rtas; GString *hypertas = g_string_sized_new(256); GString *qemu_hypertas = g_string_sized_new(256); - uint32_t refpoints[] = { cpu_to_be32(0x4), cpu_to_be32(0x4) }; + uint32_t refpoints[] = { + cpu_to_be32(0x4), + cpu_to_be32(0x4), + cpu_to_be32(0x2), + }; uint64_t max_device_addr = MACHINE(spapr)->device_memory->base + memory_region_size(&MACHINE(spapr)->device_memory->mr); uint32_t lrdr_capacity[] = { diff --git a/hw/ppc/spapr_pci_nvlink2.c b/hw/ppc/spapr_pci_nvlink2.c index 8332d5694e46..f2cb26019e88 100644 --- a/hw/ppc/spapr_pci_nvlink2.c +++ b/hw/ppc/spapr_pci_nvlink2.c @@ -37,8 +37,6 @@ #define PHANDLE_NVLINK(phb, gn, nn) (0x00130000 | (((phb)->index) << 8) | \ ((gn) << 4) | (nn)) -#define SPAPR_GPU_NUMA_ID (cpu_to_be32(1)) - typedef struct SpaprPhbPciNvGpuSlot { uint64_t tgt; uint64_t gpa; @@ -361,9 +359,9 @@ void spapr_phb_nvgpu_ram_populate_dt(SpaprPhbState *sphb, void *fdt) "nvlink2-mr[0]", NULL); uint32_t associativity[] = { cpu_to_be32(0x4), - SPAPR_GPU_NUMA_ID, - SPAPR_GPU_NUMA_ID, - SPAPR_GPU_NUMA_ID, + cpu_to_be32(nvslot->numa_id), + cpu_to_be32(nvslot->numa_id), + cpu_to_be32(nvslot->numa_id), cpu_to_be32(nvslot->numa_id) }; uint64_t size = object_property_get_uint(nv_mrobj, "size", NULL);
NUMA nodes corresponding to GPU memory currently have the same affinity/distance as normal memory nodes. Add a third NUMA associativity reference point enabling us to give GPU nodes more distance. Before, `numactl -H` output in a guest with 4 GPUs (nodes 2-5): node distances: node 0 1 2 3 4 5 0: 10 40 40 40 40 40 1: 40 10 40 40 40 40 2: 40 40 10 40 40 40 3: 40 40 40 10 40 40 4: 40 40 40 40 10 40 5: 40 40 40 40 40 10 After: node distances: node 0 1 2 3 4 5 0: 10 40 80 80 80 80 1: 40 10 80 80 80 80 2: 80 80 10 80 80 80 3: 80 80 80 10 80 80 4: 80 80 80 80 10 80 5: 80 80 80 80 80 10 These are the same distances as on the host, mirroring the change made to host firmware in skiboot commit f845a648b8cb ("numa/associativity: Add a new level of NUMA for GPU's"). Signed-off-by: Reza Arbab <arbab@linux.ibm.com> --- hw/ppc/spapr.c | 6 +++++- hw/ppc/spapr_pci_nvlink2.c | 8 +++----- 2 files changed, 8 insertions(+), 6 deletions(-)