[v2,2/2] spapr: Add a new level of NUMA for GPUs

Message ID	20200518214418.18248-2-arbab@linux.ibm.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=23hH=7A=nongnu.org=qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 8223620674 From: Reza Arbab <arbab@linux.ibm.com> To: David Gibson <david@gibson.dropbear.id.au>, qemu-ppc@nongnu.org, qemu-devel@nongnu.org Subject: [PATCH v2 2/2] spapr: Add a new level of NUMA for GPUs Date: Mon, 18 May 2020 16:44:18 -0500 Message-Id: <20200518214418.18248-2-arbab@linux.ibm.com> In-Reply-To: <20200518214418.18248-1-arbab@linux.ibm.com> References: <20200518214418.18248-1-arbab@linux.ibm.com> Received-SPF: pass client-ip=148.163.156.1; envelope-from=arbab@linux.ibm.com; helo=mx0a-001b2d01.pphosted.com Precedence: list Cc: Alexey Kardashevskiy <aik@ozlabs.ru>, Daniel Henrique Barboza <danielhb@linux.ibm.com>, Leonardo Augusto Guimaraes Garcia <lagarcia@linux.ibm.com> Errors-To: qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org>
Series	None \| expand [v2,2/2] spapr: Add a new level of NUMA for GPUs

Message ID

20200518214418.18248-2-arbab@linux.ibm.com (mailing list archive)

State

New, archived

Headers

DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 8223620674
From: Reza Arbab <arbab@linux.ibm.com>
To: David Gibson <david@gibson.dropbear.id.au>, qemu-ppc@nongnu.org,
 qemu-devel@nongnu.org
Subject: [PATCH v2 2/2] spapr: Add a new level of NUMA for GPUs
Date: Mon, 18 May 2020 16:44:18 -0500
Message-Id: <20200518214418.18248-2-arbab@linux.ibm.com>
In-Reply-To: <20200518214418.18248-1-arbab@linux.ibm.com>
References: <20200518214418.18248-1-arbab@linux.ibm.com>
Received-SPF: pass client-ip=148.163.156.1; envelope-from=arbab@linux.ibm.com;
 helo=mx0a-001b2d01.pphosted.com
X-Spam_score_int: -25
X-Spam_score: -2.6
X-Spam_bar: --
X-Spam_report: (-2.6 / 5.0 requ) BAYES_00=-1.9, KHOP_DYNAMIC=0.001,
 RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H2=-0.001,
 SPF_PASS=-0.001 autolearn=_AUTOLEARN
X-Spam_action: no action
X-BeenThere: qemu-devel@nongnu.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
 <mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <https://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
 <mailto:qemu-devel-request@nongnu.org?subject=subscribe>
Cc: Alexey Kardashevskiy <aik@ozlabs.ru>,
 Daniel Henrique Barboza <danielhb@linux.ibm.com>,
 Leonardo Augusto Guimaraes Garcia <lagarcia@linux.ibm.com>
Errors-To: 
 qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org
Sender: "Qemu-devel"
 <qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org>

Series

None | expand

Commit Message

Reza Arbab May 18, 2020, 9:44 p.m. UTC

NUMA nodes corresponding to GPU memory currently have the same
affinity/distance as normal memory nodes. Add a third NUMA associativity
reference point enabling us to give GPU nodes more distance.

This is guest visible information, which shouldn't change under a
running guest across migration between different qemu versions, so make
the change effective only in new (pseries > 5.0) machine types.

Before, `numactl -H` output in a guest with 4 GPUs (nodes 2-5):

node distances:
node   0   1   2   3   4   5
  0:  10  40  40  40  40  40
  1:  40  10  40  40  40  40
  2:  40  40  10  40  40  40
  3:  40  40  40  10  40  40
  4:  40  40  40  40  10  40
  5:  40  40  40  40  40  10

After:

node distances:
node   0   1   2   3   4   5
  0:  10  40  80  80  80  80
  1:  40  10  80  80  80  80
  2:  80  80  10  80  80  80
  3:  80  80  80  10  80  80
  4:  80  80  80  80  10  80
  5:  80  80  80  80  80  10

These are the same distances as on the host, mirroring the change made
to host firmware in skiboot commit f845a648b8cb ("numa/associativity:
Add a new level of NUMA for GPU's").

Signed-off-by: Reza Arbab <arbab@linux.ibm.com>
---
 hw/ppc/spapr.c             | 11 +++++++++--
 hw/ppc/spapr_pci_nvlink2.c |  2 +-
 2 files changed, 10 insertions(+), 3 deletions(-)

Comments

Greg Kurz May 20, 2020, 11:36 p.m. UTC | #1

On Mon, 18 May 2020 16:44:18 -0500
Reza Arbab <arbab@linux.ibm.com> wrote:

> NUMA nodes corresponding to GPU memory currently have the same
> affinity/distance as normal memory nodes. Add a third NUMA associativity
> reference point enabling us to give GPU nodes more distance.
> 
> This is guest visible information, which shouldn't change under a
> running guest across migration between different qemu versions, so make
> the change effective only in new (pseries > 5.0) machine types.
> 
> Before, `numactl -H` output in a guest with 4 GPUs (nodes 2-5):
> 
> node distances:
> node   0   1   2   3   4   5
>   0:  10  40  40  40  40  40
>   1:  40  10  40  40  40  40
>   2:  40  40  10  40  40  40
>   3:  40  40  40  10  40  40
>   4:  40  40  40  40  10  40
>   5:  40  40  40  40  40  10
> 
> After:
> 
> node distances:
> node   0   1   2   3   4   5
>   0:  10  40  80  80  80  80
>   1:  40  10  80  80  80  80
>   2:  80  80  10  80  80  80
>   3:  80  80  80  10  80  80
>   4:  80  80  80  80  10  80
>   5:  80  80  80  80  80  10
> 
> These are the same distances as on the host, mirroring the change made
> to host firmware in skiboot commit f845a648b8cb ("numa/associativity:
> Add a new level of NUMA for GPU's").
> 
> Signed-off-by: Reza Arbab <arbab@linux.ibm.com>
> ---
>  hw/ppc/spapr.c             | 11 +++++++++--
>  hw/ppc/spapr_pci_nvlink2.c |  2 +-
>  2 files changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 88b4a1f17716..1d9193d5ee49 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -893,7 +893,11 @@ static void spapr_dt_rtas(SpaprMachineState *spapr, void *fdt)
>      int rtas;
>      GString *hypertas = g_string_sized_new(256);
>      GString *qemu_hypertas = g_string_sized_new(256);
> -    uint32_t refpoints[] = { cpu_to_be32(0x4), cpu_to_be32(0x4) };
> +    uint32_t refpoints[] = {
> +        cpu_to_be32(0x4),
> +        cpu_to_be32(0x4),
> +        cpu_to_be32(0x2),
> +    };
>      uint32_t nr_refpoints;
>      uint64_t max_device_addr = MACHINE(spapr)->device_memory->base +
>          memory_region_size(&MACHINE(spapr)->device_memory->mr);
> @@ -4544,7 +4548,7 @@ static void spapr_machine_class_init(ObjectClass *oc, void *data)
>      smc->linux_pci_probe = true;
>      smc->smp_threads_vsmt = true;
>      smc->nr_xirqs = SPAPR_NR_XIRQS;
> -    smc->nr_assoc_refpoints = 2;
> +    smc->nr_assoc_refpoints = 3;
>      xfc->match_nvt = spapr_match_nvt;
>  }
>  
> @@ -4611,8 +4615,11 @@ DEFINE_SPAPR_MACHINE(5_1, "5.1", true);
>   */
>  static void spapr_machine_5_0_class_options(MachineClass *mc)
>  {
> +    SpaprMachineClass *smc = SPAPR_MACHINE_CLASS(mc);
> +
>      spapr_machine_5_1_class_options(mc);
>      compat_props_add(mc->compat_props, hw_compat_5_0, hw_compat_5_0_len);
> +    smc->nr_assoc_refpoints = 2;
>  }
>  
>  DEFINE_SPAPR_MACHINE(5_0, "5.0", false);
> diff --git a/hw/ppc/spapr_pci_nvlink2.c b/hw/ppc/spapr_pci_nvlink2.c
> index 8332d5694e46..247fd48731e2 100644
> --- a/hw/ppc/spapr_pci_nvlink2.c
> +++ b/hw/ppc/spapr_pci_nvlink2.c
> @@ -362,7 +362,7 @@ void spapr_phb_nvgpu_ram_populate_dt(SpaprPhbState *sphb, void *fdt)
>          uint32_t associativity[] = {
>              cpu_to_be32(0x4),
>              SPAPR_GPU_NUMA_ID,
> -            SPAPR_GPU_NUMA_ID,
> +            cpu_to_be32(nvslot->numa_id),

This is a guest visible change. It should theoretically be controlled
with a compat property of the PHB (look for "static GlobalProperty" in
spapr.c). But since this code is only used for GPU passthrough and we
don't support migration of such devices, I guess it's okay. Maybe just
mention it in the changelog.

>              SPAPR_GPU_NUMA_ID,
>              cpu_to_be32(nvslot->numa_id)
>          };

David Gibson May 21, 2020, 5:13 a.m. UTC | #2

On Thu, May 21, 2020 at 01:36:16AM +0200, Greg Kurz wrote:
> On Mon, 18 May 2020 16:44:18 -0500
> Reza Arbab <arbab@linux.ibm.com> wrote:
> 
> > NUMA nodes corresponding to GPU memory currently have the same
> > affinity/distance as normal memory nodes. Add a third NUMA associativity
> > reference point enabling us to give GPU nodes more distance.
> > 
> > This is guest visible information, which shouldn't change under a
> > running guest across migration between different qemu versions, so make
> > the change effective only in new (pseries > 5.0) machine types.
> > 
> > Before, `numactl -H` output in a guest with 4 GPUs (nodes 2-5):
> > 
> > node distances:
> > node   0   1   2   3   4   5
> >   0:  10  40  40  40  40  40
> >   1:  40  10  40  40  40  40
> >   2:  40  40  10  40  40  40
> >   3:  40  40  40  10  40  40
> >   4:  40  40  40  40  10  40
> >   5:  40  40  40  40  40  10
> > 
> > After:
> > 
> > node distances:
> > node   0   1   2   3   4   5
> >   0:  10  40  80  80  80  80
> >   1:  40  10  80  80  80  80
> >   2:  80  80  10  80  80  80
> >   3:  80  80  80  10  80  80
> >   4:  80  80  80  80  10  80
> >   5:  80  80  80  80  80  10
> > 
> > These are the same distances as on the host, mirroring the change made
> > to host firmware in skiboot commit f845a648b8cb ("numa/associativity:
> > Add a new level of NUMA for GPU's").
> > 
> > Signed-off-by: Reza Arbab <arbab@linux.ibm.com>
> > ---
> >  hw/ppc/spapr.c             | 11 +++++++++--
> >  hw/ppc/spapr_pci_nvlink2.c |  2 +-
> >  2 files changed, 10 insertions(+), 3 deletions(-)
> > 
> > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> > index 88b4a1f17716..1d9193d5ee49 100644
> > --- a/hw/ppc/spapr.c
> > +++ b/hw/ppc/spapr.c
> > @@ -893,7 +893,11 @@ static void spapr_dt_rtas(SpaprMachineState *spapr, void *fdt)
> >      int rtas;
> >      GString *hypertas = g_string_sized_new(256);
> >      GString *qemu_hypertas = g_string_sized_new(256);
> > -    uint32_t refpoints[] = { cpu_to_be32(0x4), cpu_to_be32(0x4) };
> > +    uint32_t refpoints[] = {
> > +        cpu_to_be32(0x4),
> > +        cpu_to_be32(0x4),
> > +        cpu_to_be32(0x2),
> > +    };
> >      uint32_t nr_refpoints;
> >      uint64_t max_device_addr = MACHINE(spapr)->device_memory->base +
> >          memory_region_size(&MACHINE(spapr)->device_memory->mr);
> > @@ -4544,7 +4548,7 @@ static void spapr_machine_class_init(ObjectClass *oc, void *data)
> >      smc->linux_pci_probe = true;
> >      smc->smp_threads_vsmt = true;
> >      smc->nr_xirqs = SPAPR_NR_XIRQS;
> > -    smc->nr_assoc_refpoints = 2;
> > +    smc->nr_assoc_refpoints = 3;
> >      xfc->match_nvt = spapr_match_nvt;
> >  }
> >  
> > @@ -4611,8 +4615,11 @@ DEFINE_SPAPR_MACHINE(5_1, "5.1", true);
> >   */
> >  static void spapr_machine_5_0_class_options(MachineClass *mc)
> >  {
> > +    SpaprMachineClass *smc = SPAPR_MACHINE_CLASS(mc);
> > +
> >      spapr_machine_5_1_class_options(mc);
> >      compat_props_add(mc->compat_props, hw_compat_5_0, hw_compat_5_0_len);
> > +    smc->nr_assoc_refpoints = 2;
> >  }
> >  
> >  DEFINE_SPAPR_MACHINE(5_0, "5.0", false);
> > diff --git a/hw/ppc/spapr_pci_nvlink2.c b/hw/ppc/spapr_pci_nvlink2.c
> > index 8332d5694e46..247fd48731e2 100644
> > --- a/hw/ppc/spapr_pci_nvlink2.c
> > +++ b/hw/ppc/spapr_pci_nvlink2.c
> > @@ -362,7 +362,7 @@ void spapr_phb_nvgpu_ram_populate_dt(SpaprPhbState *sphb, void *fdt)
> >          uint32_t associativity[] = {
> >              cpu_to_be32(0x4),
> >              SPAPR_GPU_NUMA_ID,
> > -            SPAPR_GPU_NUMA_ID,
> > +            cpu_to_be32(nvslot->numa_id),
> 
> This is a guest visible change. It should theoretically be controlled
> with a compat property of the PHB (look for "static GlobalProperty" in
> spapr.c). But since this code is only used for GPU passthrough and we
> don't support migration of such devices, I guess it's okay. Maybe just
> mention it in the changelog.

Yeah, we might get away with it, but it should be too hard to get this
right, so let's do it.

> 
> >              SPAPR_GPU_NUMA_ID,
> >              cpu_to_be32(nvslot->numa_id)
> >          };
>

Greg Kurz May 21, 2020, 9 a.m. UTC | #3

On Thu, 21 May 2020 15:13:45 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Thu, May 21, 2020 at 01:36:16AM +0200, Greg Kurz wrote:
> > On Mon, 18 May 2020 16:44:18 -0500
> > Reza Arbab <arbab@linux.ibm.com> wrote:
> > 
> > > NUMA nodes corresponding to GPU memory currently have the same
> > > affinity/distance as normal memory nodes. Add a third NUMA associativity
> > > reference point enabling us to give GPU nodes more distance.
> > > 
> > > This is guest visible information, which shouldn't change under a
> > > running guest across migration between different qemu versions, so make
> > > the change effective only in new (pseries > 5.0) machine types.
> > > 
> > > Before, `numactl -H` output in a guest with 4 GPUs (nodes 2-5):
> > > 
> > > node distances:
> > > node   0   1   2   3   4   5
> > >   0:  10  40  40  40  40  40
> > >   1:  40  10  40  40  40  40
> > >   2:  40  40  10  40  40  40
> > >   3:  40  40  40  10  40  40
> > >   4:  40  40  40  40  10  40
> > >   5:  40  40  40  40  40  10
> > > 
> > > After:
> > > 
> > > node distances:
> > > node   0   1   2   3   4   5
> > >   0:  10  40  80  80  80  80
> > >   1:  40  10  80  80  80  80
> > >   2:  80  80  10  80  80  80
> > >   3:  80  80  80  10  80  80
> > >   4:  80  80  80  80  10  80
> > >   5:  80  80  80  80  80  10
> > > 
> > > These are the same distances as on the host, mirroring the change made
> > > to host firmware in skiboot commit f845a648b8cb ("numa/associativity:
> > > Add a new level of NUMA for GPU's").
> > > 
> > > Signed-off-by: Reza Arbab <arbab@linux.ibm.com>
> > > ---
> > >  hw/ppc/spapr.c             | 11 +++++++++--
> > >  hw/ppc/spapr_pci_nvlink2.c |  2 +-
> > >  2 files changed, 10 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> > > index 88b4a1f17716..1d9193d5ee49 100644
> > > --- a/hw/ppc/spapr.c
> > > +++ b/hw/ppc/spapr.c
> > > @@ -893,7 +893,11 @@ static void spapr_dt_rtas(SpaprMachineState *spapr, void *fdt)
> > >      int rtas;
> > >      GString *hypertas = g_string_sized_new(256);
> > >      GString *qemu_hypertas = g_string_sized_new(256);
> > > -    uint32_t refpoints[] = { cpu_to_be32(0x4), cpu_to_be32(0x4) };
> > > +    uint32_t refpoints[] = {
> > > +        cpu_to_be32(0x4),
> > > +        cpu_to_be32(0x4),
> > > +        cpu_to_be32(0x2),
> > > +    };
> > >      uint32_t nr_refpoints;
> > >      uint64_t max_device_addr = MACHINE(spapr)->device_memory->base +
> > >          memory_region_size(&MACHINE(spapr)->device_memory->mr);
> > > @@ -4544,7 +4548,7 @@ static void spapr_machine_class_init(ObjectClass *oc, void *data)
> > >      smc->linux_pci_probe = true;
> > >      smc->smp_threads_vsmt = true;
> > >      smc->nr_xirqs = SPAPR_NR_XIRQS;
> > > -    smc->nr_assoc_refpoints = 2;
> > > +    smc->nr_assoc_refpoints = 3;
> > >      xfc->match_nvt = spapr_match_nvt;
> > >  }
> > >  
> > > @@ -4611,8 +4615,11 @@ DEFINE_SPAPR_MACHINE(5_1, "5.1", true);
> > >   */
> > >  static void spapr_machine_5_0_class_options(MachineClass *mc)
> > >  {
> > > +    SpaprMachineClass *smc = SPAPR_MACHINE_CLASS(mc);
> > > +
> > >      spapr_machine_5_1_class_options(mc);
> > >      compat_props_add(mc->compat_props, hw_compat_5_0, hw_compat_5_0_len);
> > > +    smc->nr_assoc_refpoints = 2;
> > >  }
> > >  
> > >  DEFINE_SPAPR_MACHINE(5_0, "5.0", false);
> > > diff --git a/hw/ppc/spapr_pci_nvlink2.c b/hw/ppc/spapr_pci_nvlink2.c
> > > index 8332d5694e46..247fd48731e2 100644
> > > --- a/hw/ppc/spapr_pci_nvlink2.c
> > > +++ b/hw/ppc/spapr_pci_nvlink2.c
> > > @@ -362,7 +362,7 @@ void spapr_phb_nvgpu_ram_populate_dt(SpaprPhbState *sphb, void *fdt)
> > >          uint32_t associativity[] = {
> > >              cpu_to_be32(0x4),
> > >              SPAPR_GPU_NUMA_ID,
> > > -            SPAPR_GPU_NUMA_ID,
> > > +            cpu_to_be32(nvslot->numa_id),
> > 
> > This is a guest visible change. It should theoretically be controlled
> > with a compat property of the PHB (look for "static GlobalProperty" in
> > spapr.c). But since this code is only used for GPU passthrough and we
> > don't support migration of such devices, I guess it's okay. Maybe just
> > mention it in the changelog.
> 
> Yeah, we might get away with it, but it should be too hard to get this

I guess you mean "it shouldn't be too hard" ?

> right, so let's do it.
> 
> > 
> > >              SPAPR_GPU_NUMA_ID,
> > >              cpu_to_be32(nvslot->numa_id)
> > >          };
> > 
>

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 88b4a1f17716..1d9193d5ee49 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -893,7 +893,11 @@  static void spapr_dt_rtas(SpaprMachineState *spapr, void *fdt)
     int rtas;
     GString *hypertas = g_string_sized_new(256);
     GString *qemu_hypertas = g_string_sized_new(256);
-    uint32_t refpoints[] = { cpu_to_be32(0x4), cpu_to_be32(0x4) };
+    uint32_t refpoints[] = {
+        cpu_to_be32(0x4),
+        cpu_to_be32(0x4),
+        cpu_to_be32(0x2),
+    };
     uint32_t nr_refpoints;
     uint64_t max_device_addr = MACHINE(spapr)->device_memory->base +
         memory_region_size(&MACHINE(spapr)->device_memory->mr);
@@ -4544,7 +4548,7 @@  static void spapr_machine_class_init(ObjectClass *oc, void *data)
     smc->linux_pci_probe = true;
     smc->smp_threads_vsmt = true;
     smc->nr_xirqs = SPAPR_NR_XIRQS;
-    smc->nr_assoc_refpoints = 2;
+    smc->nr_assoc_refpoints = 3;
     xfc->match_nvt = spapr_match_nvt;
 }
 
@@ -4611,8 +4615,11 @@  DEFINE_SPAPR_MACHINE(5_1, "5.1", true);
  */
 static void spapr_machine_5_0_class_options(MachineClass *mc)
 {
+    SpaprMachineClass *smc = SPAPR_MACHINE_CLASS(mc);
+
     spapr_machine_5_1_class_options(mc);
     compat_props_add(mc->compat_props, hw_compat_5_0, hw_compat_5_0_len);
+    smc->nr_assoc_refpoints = 2;
 }
 
 DEFINE_SPAPR_MACHINE(5_0, "5.0", false);
diff --git a/hw/ppc/spapr_pci_nvlink2.c b/hw/ppc/spapr_pci_nvlink2.c
index 8332d5694e46..247fd48731e2 100644
--- a/hw/ppc/spapr_pci_nvlink2.c
+++ b/hw/ppc/spapr_pci_nvlink2.c
@@ -362,7 +362,7 @@  void spapr_phb_nvgpu_ram_populate_dt(SpaprPhbState *sphb, void *fdt)
         uint32_t associativity[] = {
             cpu_to_be32(0x4),
             SPAPR_GPU_NUMA_ID,
-            SPAPR_GPU_NUMA_ID,
+            cpu_to_be32(nvslot->numa_id),
             SPAPR_GPU_NUMA_ID,
             cpu_to_be32(nvslot->numa_id)
         };

[v2,2/2] spapr: Add a new level of NUMA for GPUs

Commit Message

Comments

Patch