diff mbox series

[v9,05/11] numa: Extend CLI to provide initiator information for numa nodes

Message ID 20190809065731.9097-6-tao3.xu@intel.com (mailing list archive)
State New, archived
Headers show
Series Build ACPI Heterogeneous Memory Attribute Table (HMAT) | expand

Commit Message

Tao Xu Aug. 9, 2019, 6:57 a.m. UTC
From: Tao Xu <tao3.xu@intel.com>

In ACPI 6.3 chapter 5.2.27 Heterogeneous Memory Attribute Table (HMAT),
The initiator represents processor which access to memory. And in 5.2.27.3
Memory Proximity Domain Attributes Structure, the attached initiator is
defined as where the memory controller responsible for a memory proximity
domain. With attached initiator information, the topology of heterogeneous
memory can be described.

Extend CLI of "-numa node" option to indicate the initiator numa node-id.
In the linux kernel, the codes in drivers/acpi/hmat/hmat.c parse and report
the platform's HMAT tables.

Reviewed-by: Jingqi Liu <Jingqi.liu@intel.com>
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Tao Xu <tao3.xu@intel.com>
---

No changes in v9
---
 hw/core/machine.c     | 24 ++++++++++++++++++++++++
 hw/core/numa.c        | 13 +++++++++++++
 include/sysemu/numa.h |  3 +++
 qapi/machine.json     |  6 +++++-
 qemu-options.hx       | 27 +++++++++++++++++++++++----
 5 files changed, 68 insertions(+), 5 deletions(-)

Comments

Igor Mammedov Aug. 13, 2019, 3 p.m. UTC | #1
On Fri,  9 Aug 2019 14:57:25 +0800
Tao <tao3.xu@intel.com> wrote:

> From: Tao Xu <tao3.xu@intel.com>
> 
> In ACPI 6.3 chapter 5.2.27 Heterogeneous Memory Attribute Table (HMAT),
> The initiator represents processor which access to memory. And in 5.2.27.3
> Memory Proximity Domain Attributes Structure, the attached initiator is
> defined as where the memory controller responsible for a memory proximity
> domain. With attached initiator information, the topology of heterogeneous
> memory can be described.
> 
> Extend CLI of "-numa node" option to indicate the initiator numa node-id.
> In the linux kernel, the codes in drivers/acpi/hmat/hmat.c parse and report
> the platform's HMAT tables.
> 
> Reviewed-by: Jingqi Liu <Jingqi.liu@intel.com>
> Suggested-by: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: Tao Xu <tao3.xu@intel.com>
> ---
> 
> No changes in v9
> ---
>  hw/core/machine.c     | 24 ++++++++++++++++++++++++
>  hw/core/numa.c        | 13 +++++++++++++
>  include/sysemu/numa.h |  3 +++
>  qapi/machine.json     |  6 +++++-
>  qemu-options.hx       | 27 +++++++++++++++++++++++----
>  5 files changed, 68 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/core/machine.c b/hw/core/machine.c
> index 3c55470103..113184a9df 100644
> --- a/hw/core/machine.c
> +++ b/hw/core/machine.c
> @@ -640,6 +640,7 @@ void machine_set_cpu_numa_node(MachineState *machine,
>                                 const CpuInstanceProperties *props, Error **errp)
>  {
>      MachineClass *mc = MACHINE_GET_CLASS(machine);
> +    NodeInfo *numa_info = machine->numa_state->nodes;
>      bool match = false;
>      int i;
>  
> @@ -709,6 +710,16 @@ void machine_set_cpu_numa_node(MachineState *machine,
>          match = true;
>          slot->props.node_id = props->node_id;
>          slot->props.has_node_id = props->has_node_id;
> +
> +        if (numa_info[props->node_id].initiator_valid &&
> +            (props->node_id != numa_info[props->node_id].initiator)) {
> +            error_setg(errp, "The initiator of CPU NUMA node %" PRId64
> +                       " should be itself.", props->node_id);
> +            return;
> +        }
> +        numa_info[props->node_id].initiator_valid = true;
> +        numa_info[props->node_id].has_cpu = true;
> +        numa_info[props->node_id].initiator = props->node_id;
>      }
>  
>      if (!match) {
> @@ -1050,6 +1061,7 @@ static void machine_numa_finish_cpu_init(MachineState *machine)
>      GString *s = g_string_new(NULL);
>      MachineClass *mc = MACHINE_GET_CLASS(machine);
>      const CPUArchIdList *possible_cpus = mc->possible_cpu_arch_ids(machine);
> +    NodeInfo *numa_info = machine->numa_state->nodes;
>  
>      assert(machine->numa_state->num_nodes);
>      for (i = 0; i < possible_cpus->len; i++) {
> @@ -1083,6 +1095,18 @@ static void machine_numa_finish_cpu_init(MachineState *machine)
>              machine_set_cpu_numa_node(machine, &props, &error_fatal);
>          }
>      }
> +
> +    for (i = 0; i < machine->numa_state->num_nodes; i++) {
> +        if (numa_info[i].initiator_valid &&
> +            !numa_info[numa_info[i].initiator].has_cpu) {
                          ^^^^^^^^^^^^^^^^^^^^^^ possible out of bounds read, see bellow

> +            error_report("The initiator-id %"PRIu16 " of NUMA node %d"
> +                         " does not exist.", numa_info[i].initiator, i);
> +            error_printf("\n");
> +
> +            exit(1);
> +        }
it takes care only about nodes that have cpus or memory-only ones that have
initiator explicitly provided on CLI. And leaves possibility to have
memory-only nodes without initiator mixed with nodes that have initiator.
Is it valid to have mixed configuration?
Should we forbid it?

> +    }
> +
>      if (s->len && !qtest_enabled()) {
>          warn_report("CPU(s) not present in any NUMA nodes: %s",
>                      s->str);
> diff --git a/hw/core/numa.c b/hw/core/numa.c
> index 8fcbba05d6..cfb6339810 100644
> --- a/hw/core/numa.c
> +++ b/hw/core/numa.c
> @@ -128,6 +128,19 @@ static void parse_numa_node(MachineState *ms, NumaNodeOptions *node,
>          numa_info[nodenr].node_mem = object_property_get_uint(o, "size", NULL);
>          numa_info[nodenr].node_memdev = MEMORY_BACKEND(o);
>      }
> +
> +    if (node->has_initiator) {
> +        if (numa_info[nodenr].initiator_valid &&
> +            (node->initiator != numa_info[nodenr].initiator)) {
> +            error_setg(errp, "The initiator of NUMA node %" PRIu16 " has been "
> +                       "set to node %" PRIu16, nodenr,
> +                       numa_info[nodenr].initiator);
> +            return;
> +        }
> +
> +        numa_info[nodenr].initiator_valid = true;
> +        numa_info[nodenr].initiator = node->initiator;
                                             ^^^
not validated  user input? (which could lead to read beyond numa_info[] boundaries
in previous hunk).

> +    }
>      numa_info[nodenr].present = true;
>      max_numa_nodeid = MAX(max_numa_nodeid, nodenr + 1);
>      ms->numa_state->num_nodes++;
> diff --git a/include/sysemu/numa.h b/include/sysemu/numa.h
> index 76da3016db..46ad06e000 100644
> --- a/include/sysemu/numa.h
> +++ b/include/sysemu/numa.h
> @@ -10,6 +10,9 @@ struct NodeInfo {
>      uint64_t node_mem;
>      struct HostMemoryBackend *node_memdev;
>      bool present;
> +    bool has_cpu;
> +    bool initiator_valid;
> +    uint16_t initiator;
>      uint8_t distance[MAX_NODES];
>  };
>  
> diff --git a/qapi/machine.json b/qapi/machine.json
> index 6db8a7e2ec..05e367d26a 100644
> --- a/qapi/machine.json
> +++ b/qapi/machine.json
> @@ -414,6 +414,9 @@
>  # @memdev: memory backend object.  If specified for one node,
>  #          it must be specified for all nodes.
>  #
> +# @initiator: the initiator numa nodeid that is closest (as in directly
> +#             attached) to this numa node (since 4.2)
well, it's pretty unclear what doc comment means (unless reader knows well
specific part of ACPI spec)

suggest to rephrase to something more understandable for unaware
readers (+ possible reference to spec for those who is interested
in spec definition since this doc is meant for developers).

> +#
>  # Since: 2.1
>  ##
>  { 'struct': 'NumaNodeOptions',
> @@ -421,7 +424,8 @@
>     '*nodeid': 'uint16',
>     '*cpus':   ['uint16'],
>     '*mem':    'size',
> -   '*memdev': 'str' }}
> +   '*memdev': 'str',
> +   '*initiator': 'uint16' }}
>  
>  ##
>  # @NumaDistOptions:
> diff --git a/qemu-options.hx b/qemu-options.hx
> index 9621e934c0..c480781992 100644
> --- a/qemu-options.hx
> +++ b/qemu-options.hx
> @@ -161,14 +161,14 @@ If any on the three values is given, the total number of CPUs @var{n} can be omi
>  ETEXI
>  
>  DEF("numa", HAS_ARG, QEMU_OPTION_numa,
> -    "-numa node[,mem=size][,cpus=firstcpu[-lastcpu]][,nodeid=node]\n"
> -    "-numa node[,memdev=id][,cpus=firstcpu[-lastcpu]][,nodeid=node]\n"
> +    "-numa node[,mem=size][,cpus=firstcpu[-lastcpu]][,nodeid=node][,initiator=node]\n"
> +    "-numa node[,memdev=id][,cpus=firstcpu[-lastcpu]][,nodeid=node][,initiator=node]\n"
>      "-numa dist,src=source,dst=destination,val=distance\n"
>      "-numa cpu,node-id=node[,socket-id=x][,core-id=y][,thread-id=z]\n",
>      QEMU_ARCH_ALL)
>  STEXI
> -@item -numa node[,mem=@var{size}][,cpus=@var{firstcpu}[-@var{lastcpu}]][,nodeid=@var{node}]
> -@itemx -numa node[,memdev=@var{id}][,cpus=@var{firstcpu}[-@var{lastcpu}]][,nodeid=@var{node}]
> +@item -numa node[,mem=@var{size}][,cpus=@var{firstcpu}[-@var{lastcpu}]][,nodeid=@var{node}][,initiator=@var{initiator}]
> +@itemx -numa node[,memdev=@var{id}][,cpus=@var{firstcpu}[-@var{lastcpu}]][,nodeid=@var{node}][,initiator=@var{initiator}]
>  @itemx -numa dist,src=@var{source},dst=@var{destination},val=@var{distance}
>  @itemx -numa cpu,node-id=@var{node}[,socket-id=@var{x}][,core-id=@var{y}][,thread-id=@var{z}]
>  @findex -numa
> @@ -215,6 +215,25 @@ split equally between them.
>  @samp{mem} and @samp{memdev} are mutually exclusive. Furthermore,
>  if one node uses @samp{memdev}, all of them have to use it.
>  
> +@samp{initiator} indicate the initiator NUMA @var{initiator} that is
                                  ^^^^^^^       ^^^^^^^^^^^^^^
above will result in "initiator NUMA initiator", was it your intention?

> +closest (as in directly attached) to this NUMA @var{node}.
Again suggest replace spec language with something more user friendly
(this time without spec reference as it's geared for end user) 

> +For example, the following option assigns 2 NUMA nodes, node 0 has CPU.
Following example creates a machine with 2 NUMA ...

> +node 1 has only memory, and its' initiator is node 0. Note that because
> +node 0 has CPU, by default the initiator of node 0 is itself and must be
> +itself.
> +@example
> +-M pc \
> +-m 2G,slots=2,maxmem=4G \
> +-object memory-backend-ram,size=1G,id=m0 \
> +-object memory-backend-ram,size=1G,id=m1 \
> +-numa node,nodeid=0,memdev=m0 \
> +-numa node,nodeid=1,memdev=m1,initiator=0 \
> +-smp 2,sockets=2,maxcpus=2  \
> +-numa cpu,node-id=0,socket-id=0 \
> +-numa cpu,node-id=0,socket-id=1 \
> +@end example
> +
>  @var{source} and @var{destination} are NUMA node IDs.
>  @var{distance} is the NUMA distance from @var{source} to @var{destination}.
>  The distance from a node to itself is always 10. If any pair of nodes is
Tao Xu Aug. 14, 2019, 2:24 a.m. UTC | #2
On 8/13/2019 11:00 PM, Igor Mammedov wrote:
> On Fri,  9 Aug 2019 14:57:25 +0800
> Tao <tao3.xu@intel.com> wrote:
> 
>> From: Tao Xu <tao3.xu@intel.com>
>>
>> In ACPI 6.3 chapter 5.2.27 Heterogeneous Memory Attribute Table (HMAT),
>> The initiator represents processor which access to memory. And in 5.2.27.3
>> Memory Proximity Domain Attributes Structure, the attached initiator is
>> defined as where the memory controller responsible for a memory proximity
>> domain. With attached initiator information, the topology of heterogeneous
>> memory can be described.
>>
>> Extend CLI of "-numa node" option to indicate the initiator numa node-id.
>> In the linux kernel, the codes in drivers/acpi/hmat/hmat.c parse and report
>> the platform's HMAT tables.
>>
>> Reviewed-by: Jingqi Liu <Jingqi.liu@intel.com>
>> Suggested-by: Dan Williams <dan.j.williams@intel.com>
>> Signed-off-by: Tao Xu <tao3.xu@intel.com>
>> ---
>>
>> No changes in v9
>> ---
[...]
>> +
>> +    for (i = 0; i < machine->numa_state->num_nodes; i++) {
>> +        if (numa_info[i].initiator_valid &&
>> +            !numa_info[numa_info[i].initiator].has_cpu) {
>                            ^^^^^^^^^^^^^^^^^^^^^^ possible out of bounds read, see bellow
> 
I will add a error "if (numa_info[i].initiator >= MAX_NODES)" when input.
>> +            error_report("The initiator-id %"PRIu16 " of NUMA node %d"
>> +                         " does not exist.", numa_info[i].initiator, i);
>> +            error_printf("\n");
>> +
>> +            exit(1);
>> +        }
> it takes care only about nodes that have cpus or memory-only ones that have
> initiator explicitly provided on CLI. And leaves possibility to have
> memory-only nodes without initiator mixed with nodes that have initiator.
> Is it valid to have mixed configuration?
> Should we forbid it?
> 
Mixed configuration may indeed trigger bug in the future. Because in 
this patches we default generate HMAT. But mixed configuration situation 
or without initiator setting will let mem-only node "Flags" field 0, 
then the Proximity Domain for the Attached Initiator field is not
valid.

List are three situations:

1) full configuration, just like
-object memory-backend-ram,size=1G,id=m0 \
-object memory-backend-ram,size=1G,id=m1 \
-object memory-backend-ram,size=1G,id=m2 \
-numa node,nodeid=0,memdev=m0 \
-numa node,nodeid=1,memdev=m1,initiator=0 \
-numa node,nodeid=2,memdev=m2,initiator=0

2) mixed configuration, just like
-object memory-backend-ram,size=1G,id=m0 \
-object memory-backend-ram,size=1G,id=m1 \
-object memory-backend-ram,size=1G,id=m2 \
-numa node,nodeid=0,memdev=m0 \
-numa node,nodeid=1,memdev=m1,initiator=0 \
-numa node,nodeid=2,memdev=m2

3) no configuration, just like
-object memory-backend-ram,size=1G,id=m0 \
-object memory-backend-ram,size=1G,id=m1 \
-object memory-backend-ram,size=1G,id=m2 \
-numa node,nodeid=0,memdev=m0 \
-numa node,nodeid=1,memdev=m1 \
-numa node,nodeid=2,memdev=m2

I have 3 ideas:

1. HMAT option. Add a machine option like "-machine,hmat=yes", then qemu 
can have HMAT.

2. Default setting. The numa without initiator default set numa node 
which has cpu 0 as initiator.

3. Auto setting. intelligent auto configuration like 
numa_default_auto_assign_ram, auto set initiator of the memory-only 
nodes averagely.

Therefore, there are 2 different solution:

1) HMAT option + Default setting

2) HMAT option + Auto setting

>> +    }
>> +
>>       if (s->len && !qtest_enabled()) {
>>           warn_report("CPU(s) not present in any NUMA nodes: %s",
>>                       s->str);
>> diff --git a/hw/core/numa.c b/hw/core/numa.c
>> index 8fcbba05d6..cfb6339810 100644
>> --- a/hw/core/numa.c
>> +++ b/hw/core/numa.c
>> @@ -128,6 +128,19 @@ static void parse_numa_node(MachineState *ms, NumaNodeOptions *node,
>>           numa_info[nodenr].node_mem = object_property_get_uint(o, "size", NULL);
>>           numa_info[nodenr].node_memdev = MEMORY_BACKEND(o);
>>       }
>> +
>> +    if (node->has_initiator) {
>> +        if (numa_info[nodenr].initiator_valid &&
>> +            (node->initiator != numa_info[nodenr].initiator)) {
>> +            error_setg(errp, "The initiator of NUMA node %" PRIu16 " has been "
>> +                       "set to node %" PRIu16, nodenr,
>> +                       numa_info[nodenr].initiator);
>> +            return;
>> +        }
>> +
>> +        numa_info[nodenr].initiator_valid = true;
>> +        numa_info[nodenr].initiator = node->initiator;
>                                               ^^^
> not validated  user input? (which could lead to read beyond numa_info[] boundaries
> in previous hunk).
> 
>> +    }
>>       numa_info[nodenr].present = true;
>>       max_numa_nodeid = MAX(max_numa_nodeid, nodenr + 1);
>>       ms->numa_state->num_nodes++;
>> diff --git a/include/sysemu/numa.h b/include/sysemu/numa.h
>> index 76da3016db..46ad06e000 100644
>> --- a/include/sysemu/numa.h
>> +++ b/include/sysemu/numa.h
>> @@ -10,6 +10,9 @@ struct NodeInfo {
>>       uint64_t node_mem;
>>       struct HostMemoryBackend *node_memdev;
>>       bool present;
>> +    bool has_cpu;
>> +    bool initiator_valid;
>> +    uint16_t initiator;
>>       uint8_t distance[MAX_NODES];
>>   };
>>   
>> diff --git a/qapi/machine.json b/qapi/machine.json
>> index 6db8a7e2ec..05e367d26a 100644
>> --- a/qapi/machine.json
>> +++ b/qapi/machine.json
>> @@ -414,6 +414,9 @@
>>   # @memdev: memory backend object.  If specified for one node,
>>   #          it must be specified for all nodes.
>>   #
>> +# @initiator: the initiator numa nodeid that is closest (as in directly
>> +#             attached) to this numa node (since 4.2)
> well, it's pretty unclear what doc comment means (unless reader knows well
> specific part of ACPI spec)
> 
> suggest to rephrase to something more understandable for unaware
> readers (+ possible reference to spec for those who is interested
> in spec definition since this doc is meant for developers).
> 
>> +#
>>   # Since: 2.1
>>   ##
>>   { 'struct': 'NumaNodeOptions',
>> @@ -421,7 +424,8 @@
>>      '*nodeid': 'uint16',
>>      '*cpus':   ['uint16'],
>>      '*mem':    'size',
>> -   '*memdev': 'str' }}
>> +   '*memdev': 'str',
>> +   '*initiator': 'uint16' }}
>>   
>>   ##
>>   # @NumaDistOptions:
>> diff --git a/qemu-options.hx b/qemu-options.hx
>> index 9621e934c0..c480781992 100644
>> --- a/qemu-options.hx
>> +++ b/qemu-options.hx
>> @@ -161,14 +161,14 @@ If any on the three values is given, the total number of CPUs @var{n} can be omi
>>   ETEXI
>>   
>>   DEF("numa", HAS_ARG, QEMU_OPTION_numa,
>> -    "-numa node[,mem=size][,cpus=firstcpu[-lastcpu]][,nodeid=node]\n"
>> -    "-numa node[,memdev=id][,cpus=firstcpu[-lastcpu]][,nodeid=node]\n"
>> +    "-numa node[,mem=size][,cpus=firstcpu[-lastcpu]][,nodeid=node][,initiator=node]\n"
>> +    "-numa node[,memdev=id][,cpus=firstcpu[-lastcpu]][,nodeid=node][,initiator=node]\n"
>>       "-numa dist,src=source,dst=destination,val=distance\n"
>>       "-numa cpu,node-id=node[,socket-id=x][,core-id=y][,thread-id=z]\n",
>>       QEMU_ARCH_ALL)
>>   STEXI
>> -@item -numa node[,mem=@var{size}][,cpus=@var{firstcpu}[-@var{lastcpu}]][,nodeid=@var{node}]
>> -@itemx -numa node[,memdev=@var{id}][,cpus=@var{firstcpu}[-@var{lastcpu}]][,nodeid=@var{node}]
>> +@item -numa node[,mem=@var{size}][,cpus=@var{firstcpu}[-@var{lastcpu}]][,nodeid=@var{node}][,initiator=@var{initiator}]
>> +@itemx -numa node[,memdev=@var{id}][,cpus=@var{firstcpu}[-@var{lastcpu}]][,nodeid=@var{node}][,initiator=@var{initiator}]
>>   @itemx -numa dist,src=@var{source},dst=@var{destination},val=@var{distance}
>>   @itemx -numa cpu,node-id=@var{node}[,socket-id=@var{x}][,core-id=@var{y}][,thread-id=@var{z}]
>>   @findex -numa
>> @@ -215,6 +215,25 @@ split equally between them.
>>   @samp{mem} and @samp{memdev} are mutually exclusive. Furthermore,
>>   if one node uses @samp{memdev}, all of them have to use it.
>>   
>> +@samp{initiator} indicate the initiator NUMA @var{initiator} that is
>                                    ^^^^^^^       ^^^^^^^^^^^^^^
> above will result in "initiator NUMA initiator", was it your intention?
> 
>> +closest (as in directly attached) to this NUMA @var{node}.
> Again suggest replace spec language with something more user friendly
> (this time without spec reference as it's geared for end user)
> 
>> +For example, the following option assigns 2 NUMA nodes, node 0 has CPU.
> Following example creates a machine with 2 NUMA ...
> 
>> +node 1 has only memory, and its' initiator is node 0. Note that because
>> +node 0 has CPU, by default the initiator of node 0 is itself and must be
>> +itself.
>> +@example
>> +-M pc \
>> +-m 2G,slots=2,maxmem=4G \
>> +-object memory-backend-ram,size=1G,id=m0 \
>> +-object memory-backend-ram,size=1G,id=m1 \
>> +-numa node,nodeid=0,memdev=m0 \
>> +-numa node,nodeid=1,memdev=m1,initiator=0 \
>> +-smp 2,sockets=2,maxcpus=2  \
>> +-numa cpu,node-id=0,socket-id=0 \
>> +-numa cpu,node-id=0,socket-id=1 \
>> +@end example
>> +
>>   @var{source} and @var{destination} are NUMA node IDs.
>>   @var{distance} is the NUMA distance from @var{source} to @var{destination}.
>>   The distance from a node to itself is always 10. If any pair of nodes is
>
Dan Williams Aug. 14, 2019, 2:39 a.m. UTC | #3
On Tue, Aug 13, 2019 at 8:00 AM Igor Mammedov <imammedo@redhat.com> wrote:
>
> On Fri,  9 Aug 2019 14:57:25 +0800
> Tao <tao3.xu@intel.com> wrote:
>
> > From: Tao Xu <tao3.xu@intel.com>
> >
> > In ACPI 6.3 chapter 5.2.27 Heterogeneous Memory Attribute Table (HMAT),
> > The initiator represents processor which access to memory. And in 5.2.27.3
> > Memory Proximity Domain Attributes Structure, the attached initiator is
> > defined as where the memory controller responsible for a memory proximity
> > domain. With attached initiator information, the topology of heterogeneous
> > memory can be described.
> >
> > Extend CLI of "-numa node" option to indicate the initiator numa node-id.
> > In the linux kernel, the codes in drivers/acpi/hmat/hmat.c parse and report
> > the platform's HMAT tables.
> >
> > Reviewed-by: Jingqi Liu <Jingqi.liu@intel.com>
> > Suggested-by: Dan Williams <dan.j.williams@intel.com>
> > Signed-off-by: Tao Xu <tao3.xu@intel.com>
> > ---
> >
> > No changes in v9
> > ---
> >  hw/core/machine.c     | 24 ++++++++++++++++++++++++
> >  hw/core/numa.c        | 13 +++++++++++++
> >  include/sysemu/numa.h |  3 +++
> >  qapi/machine.json     |  6 +++++-
> >  qemu-options.hx       | 27 +++++++++++++++++++++++----
> >  5 files changed, 68 insertions(+), 5 deletions(-)
> >
> > diff --git a/hw/core/machine.c b/hw/core/machine.c
> > index 3c55470103..113184a9df 100644
> > --- a/hw/core/machine.c
> > +++ b/hw/core/machine.c
> > @@ -640,6 +640,7 @@ void machine_set_cpu_numa_node(MachineState *machine,
> >                                 const CpuInstanceProperties *props, Error **errp)
> >  {
> >      MachineClass *mc = MACHINE_GET_CLASS(machine);
> > +    NodeInfo *numa_info = machine->numa_state->nodes;
> >      bool match = false;
> >      int i;
> >
> > @@ -709,6 +710,16 @@ void machine_set_cpu_numa_node(MachineState *machine,
> >          match = true;
> >          slot->props.node_id = props->node_id;
> >          slot->props.has_node_id = props->has_node_id;
> > +
> > +        if (numa_info[props->node_id].initiator_valid &&
> > +            (props->node_id != numa_info[props->node_id].initiator)) {
> > +            error_setg(errp, "The initiator of CPU NUMA node %" PRId64
> > +                       " should be itself.", props->node_id);
> > +            return;
> > +        }
> > +        numa_info[props->node_id].initiator_valid = true;
> > +        numa_info[props->node_id].has_cpu = true;
> > +        numa_info[props->node_id].initiator = props->node_id;
> >      }
> >
> >      if (!match) {
> > @@ -1050,6 +1061,7 @@ static void machine_numa_finish_cpu_init(MachineState *machine)
> >      GString *s = g_string_new(NULL);
> >      MachineClass *mc = MACHINE_GET_CLASS(machine);
> >      const CPUArchIdList *possible_cpus = mc->possible_cpu_arch_ids(machine);
> > +    NodeInfo *numa_info = machine->numa_state->nodes;
> >
> >      assert(machine->numa_state->num_nodes);
> >      for (i = 0; i < possible_cpus->len; i++) {
> > @@ -1083,6 +1095,18 @@ static void machine_numa_finish_cpu_init(MachineState *machine)
> >              machine_set_cpu_numa_node(machine, &props, &error_fatal);
> >          }
> >      }
> > +
> > +    for (i = 0; i < machine->numa_state->num_nodes; i++) {
> > +        if (numa_info[i].initiator_valid &&
> > +            !numa_info[numa_info[i].initiator].has_cpu) {
>                           ^^^^^^^^^^^^^^^^^^^^^^ possible out of bounds read, see bellow
>
> > +            error_report("The initiator-id %"PRIu16 " of NUMA node %d"
> > +                         " does not exist.", numa_info[i].initiator, i);
> > +            error_printf("\n");
> > +
> > +            exit(1);
> > +        }
> it takes care only about nodes that have cpus or memory-only ones that have
> initiator explicitly provided on CLI. And leaves possibility to have
> memory-only nodes without initiator mixed with nodes that have initiator.
> Is it valid to have mixed configuration?
> Should we forbid it?

The spec talks about the "Proximity Domain for the Attached Initiator"
field only being valid if the memory controller for the memory can be
identified by an initiator id in the SRAT. So I expect the only way to
define a memory proximity domain without this local initiator is to
allow specifying a node-id that does not have an entry in the SRAT.

That would be a useful feature for testing OS HMAT parsing behavior,
and may match platforms that exist in practice.

>
> > +    }
> > +
> >      if (s->len && !qtest_enabled()) {
> >          warn_report("CPU(s) not present in any NUMA nodes: %s",
> >                      s->str);
> > diff --git a/hw/core/numa.c b/hw/core/numa.c
> > index 8fcbba05d6..cfb6339810 100644
> > --- a/hw/core/numa.c
> > +++ b/hw/core/numa.c
> > @@ -128,6 +128,19 @@ static void parse_numa_node(MachineState *ms, NumaNodeOptions *node,
> >          numa_info[nodenr].node_mem = object_property_get_uint(o, "size", NULL);
> >          numa_info[nodenr].node_memdev = MEMORY_BACKEND(o);
> >      }
> > +
> > +    if (node->has_initiator) {
> > +        if (numa_info[nodenr].initiator_valid &&
> > +            (node->initiator != numa_info[nodenr].initiator)) {
> > +            error_setg(errp, "The initiator of NUMA node %" PRIu16 " has been "
> > +                       "set to node %" PRIu16, nodenr,
> > +                       numa_info[nodenr].initiator);
> > +            return;
> > +        }
> > +
> > +        numa_info[nodenr].initiator_valid = true;
> > +        numa_info[nodenr].initiator = node->initiator;
>                                              ^^^
> not validated  user input? (which could lead to read beyond numa_info[] boundaries
> in previous hunk).
>
> > +    }
> >      numa_info[nodenr].present = true;
> >      max_numa_nodeid = MAX(max_numa_nodeid, nodenr + 1);
> >      ms->numa_state->num_nodes++;
> > diff --git a/include/sysemu/numa.h b/include/sysemu/numa.h
> > index 76da3016db..46ad06e000 100644
> > --- a/include/sysemu/numa.h
> > +++ b/include/sysemu/numa.h
> > @@ -10,6 +10,9 @@ struct NodeInfo {
> >      uint64_t node_mem;
> >      struct HostMemoryBackend *node_memdev;
> >      bool present;
> > +    bool has_cpu;
> > +    bool initiator_valid;
> > +    uint16_t initiator;
> >      uint8_t distance[MAX_NODES];
> >  };
> >
> > diff --git a/qapi/machine.json b/qapi/machine.json
> > index 6db8a7e2ec..05e367d26a 100644
> > --- a/qapi/machine.json
> > +++ b/qapi/machine.json
> > @@ -414,6 +414,9 @@
> >  # @memdev: memory backend object.  If specified for one node,
> >  #          it must be specified for all nodes.
> >  #
> > +# @initiator: the initiator numa nodeid that is closest (as in directly
> > +#             attached) to this numa node (since 4.2)
> well, it's pretty unclear what doc comment means (unless reader knows well
> specific part of ACPI spec)
>
> suggest to rephrase to something more understandable for unaware
> readers (+ possible reference to spec for those who is interested
> in spec definition since this doc is meant for developers).
>
> > +#
> >  # Since: 2.1
> >  ##
> >  { 'struct': 'NumaNodeOptions',
> > @@ -421,7 +424,8 @@
> >     '*nodeid': 'uint16',
> >     '*cpus':   ['uint16'],
> >     '*mem':    'size',
> > -   '*memdev': 'str' }}
> > +   '*memdev': 'str',
> > +   '*initiator': 'uint16' }}
> >
> >  ##
> >  # @NumaDistOptions:
> > diff --git a/qemu-options.hx b/qemu-options.hx
> > index 9621e934c0..c480781992 100644
> > --- a/qemu-options.hx
> > +++ b/qemu-options.hx
> > @@ -161,14 +161,14 @@ If any on the three values is given, the total number of CPUs @var{n} can be omi
> >  ETEXI
> >
> >  DEF("numa", HAS_ARG, QEMU_OPTION_numa,
> > -    "-numa node[,mem=size][,cpus=firstcpu[-lastcpu]][,nodeid=node]\n"
> > -    "-numa node[,memdev=id][,cpus=firstcpu[-lastcpu]][,nodeid=node]\n"
> > +    "-numa node[,mem=size][,cpus=firstcpu[-lastcpu]][,nodeid=node][,initiator=node]\n"
> > +    "-numa node[,memdev=id][,cpus=firstcpu[-lastcpu]][,nodeid=node][,initiator=node]\n"
> >      "-numa dist,src=source,dst=destination,val=distance\n"
> >      "-numa cpu,node-id=node[,socket-id=x][,core-id=y][,thread-id=z]\n",
> >      QEMU_ARCH_ALL)
> >  STEXI
> > -@item -numa node[,mem=@var{size}][,cpus=@var{firstcpu}[-@var{lastcpu}]][,nodeid=@var{node}]
> > -@itemx -numa node[,memdev=@var{id}][,cpus=@var{firstcpu}[-@var{lastcpu}]][,nodeid=@var{node}]
> > +@item -numa node[,mem=@var{size}][,cpus=@var{firstcpu}[-@var{lastcpu}]][,nodeid=@var{node}][,initiator=@var{initiator}]
> > +@itemx -numa node[,memdev=@var{id}][,cpus=@var{firstcpu}[-@var{lastcpu}]][,nodeid=@var{node}][,initiator=@var{initiator}]
> >  @itemx -numa dist,src=@var{source},dst=@var{destination},val=@var{distance}
> >  @itemx -numa cpu,node-id=@var{node}[,socket-id=@var{x}][,core-id=@var{y}][,thread-id=@var{z}]
> >  @findex -numa
> > @@ -215,6 +215,25 @@ split equally between them.
> >  @samp{mem} and @samp{memdev} are mutually exclusive. Furthermore,
> >  if one node uses @samp{memdev}, all of them have to use it.
> >
> > +@samp{initiator} indicate the initiator NUMA @var{initiator} that is
>                                   ^^^^^^^       ^^^^^^^^^^^^^^
> above will result in "initiator NUMA initiator", was it your intention?
>
> > +closest (as in directly attached) to this NUMA @var{node}.
> Again suggest replace spec language with something more user friendly
> (this time without spec reference as it's geared for end user)
>
> > +For example, the following option assigns 2 NUMA nodes, node 0 has CPU.
> Following example creates a machine with 2 NUMA ...
>
> > +node 1 has only memory, and its' initiator is node 0. Note that because
> > +node 0 has CPU, by default the initiator of node 0 is itself and must be
> > +itself.
> > +@example
> > +-M pc \
> > +-m 2G,slots=2,maxmem=4G \
> > +-object memory-backend-ram,size=1G,id=m0 \
> > +-object memory-backend-ram,size=1G,id=m1 \
> > +-numa node,nodeid=0,memdev=m0 \
> > +-numa node,nodeid=1,memdev=m1,initiator=0 \
> > +-smp 2,sockets=2,maxcpus=2  \
> > +-numa cpu,node-id=0,socket-id=0 \
> > +-numa cpu,node-id=0,socket-id=1 \
> > +@end example
> > +
> >  @var{source} and @var{destination} are NUMA node IDs.
> >  @var{distance} is the NUMA distance from @var{source} to @var{destination}.
> >  The distance from a node to itself is always 10. If any pair of nodes is
>
Tao Xu Aug. 14, 2019, 5:13 a.m. UTC | #4
On 8/14/2019 10:39 AM, Dan Williams wrote:
> On Tue, Aug 13, 2019 at 8:00 AM Igor Mammedov <imammedo@redhat.com> wrote:
>>
>> On Fri,  9 Aug 2019 14:57:25 +0800
>> Tao <tao3.xu@intel.com> wrote:
>>
>>> From: Tao Xu <tao3.xu@intel.com>
>>>
>>> In ACPI 6.3 chapter 5.2.27 Heterogeneous Memory Attribute Table (HMAT),
>>> The initiator represents processor which access to memory. And in 5.2.27.3
>>> Memory Proximity Domain Attributes Structure, the attached initiator is
>>> defined as where the memory controller responsible for a memory proximity
>>> domain. With attached initiator information, the topology of heterogeneous
>>> memory can be described.
>>>
>>> Extend CLI of "-numa node" option to indicate the initiator numa node-id.
>>> In the linux kernel, the codes in drivers/acpi/hmat/hmat.c parse and report
>>> the platform's HMAT tables.
>>>
>>> Reviewed-by: Jingqi Liu <Jingqi.liu@intel.com>
>>> Suggested-by: Dan Williams <dan.j.williams@intel.com>
>>> Signed-off-by: Tao Xu <tao3.xu@intel.com>
>>> ---
>>>
>>> No changes in v9
>>> ---
>>>   hw/core/machine.c     | 24 ++++++++++++++++++++++++
>>>   hw/core/numa.c        | 13 +++++++++++++
>>>   include/sysemu/numa.h |  3 +++
>>>   qapi/machine.json     |  6 +++++-
>>>   qemu-options.hx       | 27 +++++++++++++++++++++++----
>>>   5 files changed, 68 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/hw/core/machine.c b/hw/core/machine.c
>>> index 3c55470103..113184a9df 100644
>>> --- a/hw/core/machine.c
>>> +++ b/hw/core/machine.c
>>> @@ -640,6 +640,7 @@ void machine_set_cpu_numa_node(MachineState *machine,
>>>                                  const CpuInstanceProperties *props, Error **errp)
>>>   {
>>>       MachineClass *mc = MACHINE_GET_CLASS(machine);
>>> +    NodeInfo *numa_info = machine->numa_state->nodes;
>>>       bool match = false;
>>>       int i;
>>>
>>> @@ -709,6 +710,16 @@ void machine_set_cpu_numa_node(MachineState *machine,
>>>           match = true;
>>>           slot->props.node_id = props->node_id;
>>>           slot->props.has_node_id = props->has_node_id;
>>> +
>>> +        if (numa_info[props->node_id].initiator_valid &&
>>> +            (props->node_id != numa_info[props->node_id].initiator)) {
>>> +            error_setg(errp, "The initiator of CPU NUMA node %" PRId64
>>> +                       " should be itself.", props->node_id);
>>> +            return;
>>> +        }
>>> +        numa_info[props->node_id].initiator_valid = true;
>>> +        numa_info[props->node_id].has_cpu = true;
>>> +        numa_info[props->node_id].initiator = props->node_id;
>>>       }
>>>
>>>       if (!match) {
>>> @@ -1050,6 +1061,7 @@ static void machine_numa_finish_cpu_init(MachineState *machine)
>>>       GString *s = g_string_new(NULL);
>>>       MachineClass *mc = MACHINE_GET_CLASS(machine);
>>>       const CPUArchIdList *possible_cpus = mc->possible_cpu_arch_ids(machine);
>>> +    NodeInfo *numa_info = machine->numa_state->nodes;
>>>
>>>       assert(machine->numa_state->num_nodes);
>>>       for (i = 0; i < possible_cpus->len; i++) {
>>> @@ -1083,6 +1095,18 @@ static void machine_numa_finish_cpu_init(MachineState *machine)
>>>               machine_set_cpu_numa_node(machine, &props, &error_fatal);
>>>           }
>>>       }
>>> +
>>> +    for (i = 0; i < machine->numa_state->num_nodes; i++) {
>>> +        if (numa_info[i].initiator_valid &&
>>> +            !numa_info[numa_info[i].initiator].has_cpu) {
>>                            ^^^^^^^^^^^^^^^^^^^^^^ possible out of bounds read, see bellow
>>
>>> +            error_report("The initiator-id %"PRIu16 " of NUMA node %d"
>>> +                         " does not exist.", numa_info[i].initiator, i);
>>> +            error_printf("\n");
>>> +
>>> +            exit(1);
>>> +        }
>> it takes care only about nodes that have cpus or memory-only ones that have
>> initiator explicitly provided on CLI. And leaves possibility to have
>> memory-only nodes without initiator mixed with nodes that have initiator.
>> Is it valid to have mixed configuration?
>> Should we forbid it?
> 
> The spec talks about the "Proximity Domain for the Attached Initiator"
> field only being valid if the memory controller for the memory can be
> identified by an initiator id in the SRAT. So I expect the only way to
> define a memory proximity domain without this local initiator is to
> allow specifying a node-id that does not have an entry in the SRAT.
> 
Hi Dan,

So there may be a situation for the Attached Initiator field is not
valid? If true, I would allow user to input Initiator invalid.

> That would be a useful feature for testing OS HMAT parsing behavior,
> and may match platforms that exist in practice.
> 
>>
>>> +    }
>>> +
>>>       if (s->len && !qtest_enabled()) {
>>>           warn_report("CPU(s) not present in any NUMA nodes: %s",
>>>                       s->str);
>>> diff --git a/hw/core/numa.c b/hw/core/numa.c
>>> index 8fcbba05d6..cfb6339810 100644
>>> --- a/hw/core/numa.c
>>> +++ b/hw/core/numa.c
>>> @@ -128,6 +128,19 @@ static void parse_numa_node(MachineState *ms, NumaNodeOptions *node,
>>>           numa_info[nodenr].node_mem = object_property_get_uint(o, "size", NULL);
>>>           numa_info[nodenr].node_memdev = MEMORY_BACKEND(o);
>>>       }
>>> +
>>> +    if (node->has_initiator) {
>>> +        if (numa_info[nodenr].initiator_valid &&
>>> +            (node->initiator != numa_info[nodenr].initiator)) {
>>> +            error_setg(errp, "The initiator of NUMA node %" PRIu16 " has been "
>>> +                       "set to node %" PRIu16, nodenr,
>>> +                       numa_info[nodenr].initiator);
>>> +            return;
>>> +        }
>>> +
>>> +        numa_info[nodenr].initiator_valid = true;
>>> +        numa_info[nodenr].initiator = node->initiator;
>>                                               ^^^
>> not validated  user input? (which could lead to read beyond numa_info[] boundaries
>> in previous hunk).
>>
>>> +    }
>>>       numa_info[nodenr].present = true;
>>>       max_numa_nodeid = MAX(max_numa_nodeid, nodenr + 1);
>>>       ms->numa_state->num_nodes++;
>>> diff --git a/include/sysemu/numa.h b/include/sysemu/numa.h
>>> index 76da3016db..46ad06e000 100644
>>> --- a/include/sysemu/numa.h
>>> +++ b/include/sysemu/numa.h
>>> @@ -10,6 +10,9 @@ struct NodeInfo {
>>>       uint64_t node_mem;
>>>       struct HostMemoryBackend *node_memdev;
>>>       bool present;
>>> +    bool has_cpu;
>>> +    bool initiator_valid;
>>> +    uint16_t initiator;
>>>       uint8_t distance[MAX_NODES];
>>>   };
>>>
>>> diff --git a/qapi/machine.json b/qapi/machine.json
>>> index 6db8a7e2ec..05e367d26a 100644
>>> --- a/qapi/machine.json
>>> +++ b/qapi/machine.json
>>> @@ -414,6 +414,9 @@
>>>   # @memdev: memory backend object.  If specified for one node,
>>>   #          it must be specified for all nodes.
>>>   #
>>> +# @initiator: the initiator numa nodeid that is closest (as in directly
>>> +#             attached) to this numa node (since 4.2)
>> well, it's pretty unclear what doc comment means (unless reader knows well
>> specific part of ACPI spec)
>>
>> suggest to rephrase to something more understandable for unaware
>> readers (+ possible reference to spec for those who is interested
>> in spec definition since this doc is meant for developers).
>>
>>> +#
>>>   # Since: 2.1
>>>   ##
>>>   { 'struct': 'NumaNodeOptions',
>>> @@ -421,7 +424,8 @@
>>>      '*nodeid': 'uint16',
>>>      '*cpus':   ['uint16'],
>>>      '*mem':    'size',
>>> -   '*memdev': 'str' }}
>>> +   '*memdev': 'str',
>>> +   '*initiator': 'uint16' }}
>>>
>>>   ##
>>>   # @NumaDistOptions:
>>> diff --git a/qemu-options.hx b/qemu-options.hx
>>> index 9621e934c0..c480781992 100644
>>> --- a/qemu-options.hx
>>> +++ b/qemu-options.hx
>>> @@ -161,14 +161,14 @@ If any on the three values is given, the total number of CPUs @var{n} can be omi
>>>   ETEXI
>>>
>>>   DEF("numa", HAS_ARG, QEMU_OPTION_numa,
>>> -    "-numa node[,mem=size][,cpus=firstcpu[-lastcpu]][,nodeid=node]\n"
>>> -    "-numa node[,memdev=id][,cpus=firstcpu[-lastcpu]][,nodeid=node]\n"
>>> +    "-numa node[,mem=size][,cpus=firstcpu[-lastcpu]][,nodeid=node][,initiator=node]\n"
>>> +    "-numa node[,memdev=id][,cpus=firstcpu[-lastcpu]][,nodeid=node][,initiator=node]\n"
>>>       "-numa dist,src=source,dst=destination,val=distance\n"
>>>       "-numa cpu,node-id=node[,socket-id=x][,core-id=y][,thread-id=z]\n",
>>>       QEMU_ARCH_ALL)
>>>   STEXI
>>> -@item -numa node[,mem=@var{size}][,cpus=@var{firstcpu}[-@var{lastcpu}]][,nodeid=@var{node}]
>>> -@itemx -numa node[,memdev=@var{id}][,cpus=@var{firstcpu}[-@var{lastcpu}]][,nodeid=@var{node}]
>>> +@item -numa node[,mem=@var{size}][,cpus=@var{firstcpu}[-@var{lastcpu}]][,nodeid=@var{node}][,initiator=@var{initiator}]
>>> +@itemx -numa node[,memdev=@var{id}][,cpus=@var{firstcpu}[-@var{lastcpu}]][,nodeid=@var{node}][,initiator=@var{initiator}]
>>>   @itemx -numa dist,src=@var{source},dst=@var{destination},val=@var{distance}
>>>   @itemx -numa cpu,node-id=@var{node}[,socket-id=@var{x}][,core-id=@var{y}][,thread-id=@var{z}]
>>>   @findex -numa
>>> @@ -215,6 +215,25 @@ split equally between them.
>>>   @samp{mem} and @samp{memdev} are mutually exclusive. Furthermore,
>>>   if one node uses @samp{memdev}, all of them have to use it.
>>>
>>> +@samp{initiator} indicate the initiator NUMA @var{initiator} that is
>>                                    ^^^^^^^       ^^^^^^^^^^^^^^
>> above will result in "initiator NUMA initiator", was it your intention?
>>
>>> +closest (as in directly attached) to this NUMA @var{node}.
>> Again suggest replace spec language with something more user friendly
>> (this time without spec reference as it's geared for end user)
>>
>>> +For example, the following option assigns 2 NUMA nodes, node 0 has CPU.
>> Following example creates a machine with 2 NUMA ...
>>
>>> +node 1 has only memory, and its' initiator is node 0. Note that because
>>> +node 0 has CPU, by default the initiator of node 0 is itself and must be
>>> +itself.
>>> +@example
>>> +-M pc \
>>> +-m 2G,slots=2,maxmem=4G \
>>> +-object memory-backend-ram,size=1G,id=m0 \
>>> +-object memory-backend-ram,size=1G,id=m1 \
>>> +-numa node,nodeid=0,memdev=m0 \
>>> +-numa node,nodeid=1,memdev=m1,initiator=0 \
>>> +-smp 2,sockets=2,maxcpus=2  \
>>> +-numa cpu,node-id=0,socket-id=0 \
>>> +-numa cpu,node-id=0,socket-id=1 \
>>> +@end example
>>> +
>>>   @var{source} and @var{destination} are NUMA node IDs.
>>>   @var{distance} is the NUMA distance from @var{source} to @var{destination}.
>>>   The distance from a node to itself is always 10. If any pair of nodes is
>>
Dan Williams Aug. 14, 2019, 9:29 p.m. UTC | #5
On Tue, Aug 13, 2019 at 10:14 PM Tao Xu <tao3.xu@intel.com> wrote:
>
> On 8/14/2019 10:39 AM, Dan Williams wrote:
> > On Tue, Aug 13, 2019 at 8:00 AM Igor Mammedov <imammedo@redhat.com> wrote:
> >>
> >> On Fri,  9 Aug 2019 14:57:25 +0800
> >> Tao <tao3.xu@intel.com> wrote:
> >>
> >>> From: Tao Xu <tao3.xu@intel.com>
> >>>
> >>> In ACPI 6.3 chapter 5.2.27 Heterogeneous Memory Attribute Table (HMAT),
> >>> The initiator represents processor which access to memory. And in 5.2.27.3
> >>> Memory Proximity Domain Attributes Structure, the attached initiator is
> >>> defined as where the memory controller responsible for a memory proximity
> >>> domain. With attached initiator information, the topology of heterogeneous
> >>> memory can be described.
> >>>
> >>> Extend CLI of "-numa node" option to indicate the initiator numa node-id.
> >>> In the linux kernel, the codes in drivers/acpi/hmat/hmat.c parse and report
> >>> the platform's HMAT tables.
> >>>
> >>> Reviewed-by: Jingqi Liu <Jingqi.liu@intel.com>
> >>> Suggested-by: Dan Williams <dan.j.williams@intel.com>
> >>> Signed-off-by: Tao Xu <tao3.xu@intel.com>
> >>> ---
> >>>
> >>> No changes in v9
> >>> ---
> >>>   hw/core/machine.c     | 24 ++++++++++++++++++++++++
> >>>   hw/core/numa.c        | 13 +++++++++++++
> >>>   include/sysemu/numa.h |  3 +++
> >>>   qapi/machine.json     |  6 +++++-
> >>>   qemu-options.hx       | 27 +++++++++++++++++++++++----
> >>>   5 files changed, 68 insertions(+), 5 deletions(-)
> >>>
> >>> diff --git a/hw/core/machine.c b/hw/core/machine.c
> >>> index 3c55470103..113184a9df 100644
> >>> --- a/hw/core/machine.c
> >>> +++ b/hw/core/machine.c
> >>> @@ -640,6 +640,7 @@ void machine_set_cpu_numa_node(MachineState *machine,
> >>>                                  const CpuInstanceProperties *props, Error **errp)
> >>>   {
> >>>       MachineClass *mc = MACHINE_GET_CLASS(machine);
> >>> +    NodeInfo *numa_info = machine->numa_state->nodes;
> >>>       bool match = false;
> >>>       int i;
> >>>
> >>> @@ -709,6 +710,16 @@ void machine_set_cpu_numa_node(MachineState *machine,
> >>>           match = true;
> >>>           slot->props.node_id = props->node_id;
> >>>           slot->props.has_node_id = props->has_node_id;
> >>> +
> >>> +        if (numa_info[props->node_id].initiator_valid &&
> >>> +            (props->node_id != numa_info[props->node_id].initiator)) {
> >>> +            error_setg(errp, "The initiator of CPU NUMA node %" PRId64
> >>> +                       " should be itself.", props->node_id);
> >>> +            return;
> >>> +        }
> >>> +        numa_info[props->node_id].initiator_valid = true;
> >>> +        numa_info[props->node_id].has_cpu = true;
> >>> +        numa_info[props->node_id].initiator = props->node_id;
> >>>       }
> >>>
> >>>       if (!match) {
> >>> @@ -1050,6 +1061,7 @@ static void machine_numa_finish_cpu_init(MachineState *machine)
> >>>       GString *s = g_string_new(NULL);
> >>>       MachineClass *mc = MACHINE_GET_CLASS(machine);
> >>>       const CPUArchIdList *possible_cpus = mc->possible_cpu_arch_ids(machine);
> >>> +    NodeInfo *numa_info = machine->numa_state->nodes;
> >>>
> >>>       assert(machine->numa_state->num_nodes);
> >>>       for (i = 0; i < possible_cpus->len; i++) {
> >>> @@ -1083,6 +1095,18 @@ static void machine_numa_finish_cpu_init(MachineState *machine)
> >>>               machine_set_cpu_numa_node(machine, &props, &error_fatal);
> >>>           }
> >>>       }
> >>> +
> >>> +    for (i = 0; i < machine->numa_state->num_nodes; i++) {
> >>> +        if (numa_info[i].initiator_valid &&
> >>> +            !numa_info[numa_info[i].initiator].has_cpu) {
> >>                            ^^^^^^^^^^^^^^^^^^^^^^ possible out of bounds read, see bellow
> >>
> >>> +            error_report("The initiator-id %"PRIu16 " of NUMA node %d"
> >>> +                         " does not exist.", numa_info[i].initiator, i);
> >>> +            error_printf("\n");
> >>> +
> >>> +            exit(1);
> >>> +        }
> >> it takes care only about nodes that have cpus or memory-only ones that have
> >> initiator explicitly provided on CLI. And leaves possibility to have
> >> memory-only nodes without initiator mixed with nodes that have initiator.
> >> Is it valid to have mixed configuration?
> >> Should we forbid it?
> >
> > The spec talks about the "Proximity Domain for the Attached Initiator"
> > field only being valid if the memory controller for the memory can be
> > identified by an initiator id in the SRAT. So I expect the only way to
> > define a memory proximity domain without this local initiator is to
> > allow specifying a node-id that does not have an entry in the SRAT.
> >
> Hi Dan,
>
> So there may be a situation for the Attached Initiator field is not
> valid? If true, I would allow user to input Initiator invalid.

Yes it's something the OS needs to consider because the platform may
not be able to meet the constraint that a single initiator is
associated with the memory controller for a given memory target. In
retrospect it would have been nice if the spec reserved 0xffffffff for
this purpose, but it seems "not in SRAT" is the only way to identify
memory that is not attached to any single initiator.

> > That would be a useful feature for testing OS HMAT parsing behavior,
> > and may match platforms that exist in practice.
Tao Xu Aug. 15, 2019, 1:56 a.m. UTC | #6
On 8/15/2019 5:29 AM, Dan Williams wrote:
> On Tue, Aug 13, 2019 at 10:14 PM Tao Xu <tao3.xu@intel.com> wrote:
>>
>> On 8/14/2019 10:39 AM, Dan Williams wrote:
>>> On Tue, Aug 13, 2019 at 8:00 AM Igor Mammedov <imammedo@redhat.com> wrote:
>>>>
>>>> On Fri,  9 Aug 2019 14:57:25 +0800
>>>> Tao <tao3.xu@intel.com> wrote:
>>>>
>>>>> From: Tao Xu <tao3.xu@intel.com>
>>>>>
[...]
>>>>> +    for (i = 0; i < machine->numa_state->num_nodes; i++) {
>>>>> +        if (numa_info[i].initiator_valid &&
>>>>> +            !numa_info[numa_info[i].initiator].has_cpu) {
>>>>                             ^^^^^^^^^^^^^^^^^^^^^^ possible out of bounds read, see bellow
>>>>
>>>>> +            error_report("The initiator-id %"PRIu16 " of NUMA node %d"
>>>>> +                         " does not exist.", numa_info[i].initiator, i);
>>>>> +            error_printf("\n");
>>>>> +
>>>>> +            exit(1);
>>>>> +        }
>>>> it takes care only about nodes that have cpus or memory-only ones that have
>>>> initiator explicitly provided on CLI. And leaves possibility to have
>>>> memory-only nodes without initiator mixed with nodes that have initiator.
>>>> Is it valid to have mixed configuration?
>>>> Should we forbid it?
>>>
>>> The spec talks about the "Proximity Domain for the Attached Initiator"
>>> field only being valid if the memory controller for the memory can be
>>> identified by an initiator id in the SRAT. So I expect the only way to
>>> define a memory proximity domain without this local initiator is to
>>> allow specifying a node-id that does not have an entry in the SRAT.
>>>
>> Hi Dan,
>>
>> So there may be a situation for the Attached Initiator field is not
>> valid? If true, I would allow user to input Initiator invalid.
> 
> Yes it's something the OS needs to consider because the platform may
> not be able to meet the constraint that a single initiator is
> associated with the memory controller for a given memory target. In
> retrospect it would have been nice if the spec reserved 0xffffffff for
> this purpose, but it seems "not in SRAT" is the only way to identify
> memory that is not attached to any single initiator.
> 
But As far as I konw, QEMU can't emulate a NUMA node "not in SRAT". I am 
wondering if it is effective only set Initiator invalid?
Dan Williams Aug. 15, 2019, 2:31 a.m. UTC | #7
On Wed, Aug 14, 2019 at 6:57 PM Tao Xu <tao3.xu@intel.com> wrote:
>
> On 8/15/2019 5:29 AM, Dan Williams wrote:
> > On Tue, Aug 13, 2019 at 10:14 PM Tao Xu <tao3.xu@intel.com> wrote:
> >>
> >> On 8/14/2019 10:39 AM, Dan Williams wrote:
> >>> On Tue, Aug 13, 2019 at 8:00 AM Igor Mammedov <imammedo@redhat.com> wrote:
> >>>>
> >>>> On Fri,  9 Aug 2019 14:57:25 +0800
> >>>> Tao <tao3.xu@intel.com> wrote:
> >>>>
> >>>>> From: Tao Xu <tao3.xu@intel.com>
> >>>>>
> [...]
> >>>>> +    for (i = 0; i < machine->numa_state->num_nodes; i++) {
> >>>>> +        if (numa_info[i].initiator_valid &&
> >>>>> +            !numa_info[numa_info[i].initiator].has_cpu) {
> >>>>                             ^^^^^^^^^^^^^^^^^^^^^^ possible out of bounds read, see bellow
> >>>>
> >>>>> +            error_report("The initiator-id %"PRIu16 " of NUMA node %d"
> >>>>> +                         " does not exist.", numa_info[i].initiator, i);
> >>>>> +            error_printf("\n");
> >>>>> +
> >>>>> +            exit(1);
> >>>>> +        }
> >>>> it takes care only about nodes that have cpus or memory-only ones that have
> >>>> initiator explicitly provided on CLI. And leaves possibility to have
> >>>> memory-only nodes without initiator mixed with nodes that have initiator.
> >>>> Is it valid to have mixed configuration?
> >>>> Should we forbid it?
> >>>
> >>> The spec talks about the "Proximity Domain for the Attached Initiator"
> >>> field only being valid if the memory controller for the memory can be
> >>> identified by an initiator id in the SRAT. So I expect the only way to
> >>> define a memory proximity domain without this local initiator is to
> >>> allow specifying a node-id that does not have an entry in the SRAT.
> >>>
> >> Hi Dan,
> >>
> >> So there may be a situation for the Attached Initiator field is not
> >> valid? If true, I would allow user to input Initiator invalid.
> >
> > Yes it's something the OS needs to consider because the platform may
> > not be able to meet the constraint that a single initiator is
> > associated with the memory controller for a given memory target. In
> > retrospect it would have been nice if the spec reserved 0xffffffff for
> > this purpose, but it seems "not in SRAT" is the only way to identify
> > memory that is not attached to any single initiator.
> >
> But As far as I konw, QEMU can't emulate a NUMA node "not in SRAT". I am
> wondering if it is effective only set Initiator invalid?

You don't need to emulate a NUMA node not in SRAT. Just put a number
in this HMAT entry larger than the largest proximity domain number
found in the SRAT.
>
Igor Mammedov Aug. 16, 2019, 2:47 p.m. UTC | #8
On Wed, 14 Aug 2019 10:24:03 +0800
Tao Xu <tao3.xu@intel.com> wrote:

> On 8/13/2019 11:00 PM, Igor Mammedov wrote:
> > On Fri,  9 Aug 2019 14:57:25 +0800
> > Tao <tao3.xu@intel.com> wrote:
> >   
> >> From: Tao Xu <tao3.xu@intel.com>
> >>
> >> In ACPI 6.3 chapter 5.2.27 Heterogeneous Memory Attribute Table (HMAT),
> >> The initiator represents processor which access to memory. And in 5.2.27.3
> >> Memory Proximity Domain Attributes Structure, the attached initiator is
> >> defined as where the memory controller responsible for a memory proximity
> >> domain. With attached initiator information, the topology of heterogeneous
> >> memory can be described.
> >>
> >> Extend CLI of "-numa node" option to indicate the initiator numa node-id.
> >> In the linux kernel, the codes in drivers/acpi/hmat/hmat.c parse and report
> >> the platform's HMAT tables.
> >>
> >> Reviewed-by: Jingqi Liu <Jingqi.liu@intel.com>
> >> Suggested-by: Dan Williams <dan.j.williams@intel.com>
> >> Signed-off-by: Tao Xu <tao3.xu@intel.com>

see comments below,

PS:
I'll continue reviewing series in a week when I'm back.

> >> ---
> >>
> >> No changes in v9
> >> ---  
> [...]
> >> +
> >> +    for (i = 0; i < machine->numa_state->num_nodes; i++) {
> >> +        if (numa_info[i].initiator_valid &&
> >> +            !numa_info[numa_info[i].initiator].has_cpu) {  
> >                            ^^^^^^^^^^^^^^^^^^^^^^ possible out of bounds read, see bellow
> >   
> I will add a error "if (numa_info[i].initiator >= MAX_NODES)" when input.

it'd would be better to validate user input instead, at the place pointed below

> >> +            error_report("The initiator-id %"PRIu16 " of NUMA node %d"
> >> +                         " does not exist.", numa_info[i].initiator, i);
> >> +            error_printf("\n");
> >> +
> >> +            exit(1);
> >> +        }  
> > it takes care only about nodes that have cpus or memory-only ones that have
> > initiator explicitly provided on CLI. And leaves possibility to have
> > memory-only nodes without initiator mixed with nodes that have initiator.
> > Is it valid to have mixed configuration?
> > Should we forbid it?
> >   
> Mixed configuration may indeed trigger bug in the future. Because in 
> this patches we default generate HMAT. But mixed configuration situation 
> or without initiator setting will let mem-only node "Flags" field 0, 
> then the Proximity Domain for the Attached Initiator field is not
> valid.
> 
> List are three situations:
> 
> 1) full configuration, just like
> -object memory-backend-ram,size=1G,id=m0 \
> -object memory-backend-ram,size=1G,id=m1 \
> -object memory-backend-ram,size=1G,id=m2 \
> -numa node,nodeid=0,memdev=m0 \
> -numa node,nodeid=1,memdev=m1,initiator=0 \
> -numa node,nodeid=2,memdev=m2,initiator=0
> 
> 2) mixed configuration, just like
> -object memory-backend-ram,size=1G,id=m0 \
> -object memory-backend-ram,size=1G,id=m1 \
> -object memory-backend-ram,size=1G,id=m2 \
> -numa node,nodeid=0,memdev=m0 \
> -numa node,nodeid=1,memdev=m1,initiator=0 \
> -numa node,nodeid=2,memdev=m2
> 
> 3) no configuration, just like
> -object memory-backend-ram,size=1G,id=m0 \
> -object memory-backend-ram,size=1G,id=m1 \
> -object memory-backend-ram,size=1G,id=m2 \
> -numa node,nodeid=0,memdev=m0 \
> -numa node,nodeid=1,memdev=m1 \
> -numa node,nodeid=2,memdev=m2
> 
> I have 3 ideas:
> 
> 1. HMAT option. Add a machine option like "-machine,hmat=yes", then qemu 
> can have HMAT.
I'd go with it. HAMT even if it's broken won't affect anything unless requested by user.
So we could polish impl. and experiment with it with little risk
to break something


> 2. Default setting. The numa without initiator default set numa node 
> which has cpu 0 as initiator.
> 
> 3. Auto setting. intelligent auto configuration like 
> numa_default_auto_assign_ram, auto set initiator of the memory-only 
> nodes averagely.
numa_default_auto_assign_ram is deprecated.
Usually auto_something bites us back long therm
when we need to change related code so we end up with a bunch of
compat code and maintenance burden that introduces.
(the same applies to made up defaults (i.e. non spec dictated)).

> 
> Therefore, there are 2 different solution:
> 
> 1) HMAT option + Default setting
> 
> 2) HMAT option + Auto setting
> 
> >> +    }
> >> +
> >>       if (s->len && !qtest_enabled()) {
> >>           warn_report("CPU(s) not present in any NUMA nodes: %s",
> >>                       s->str);
> >> diff --git a/hw/core/numa.c b/hw/core/numa.c
> >> index 8fcbba05d6..cfb6339810 100644
> >> --- a/hw/core/numa.c
> >> +++ b/hw/core/numa.c
> >> @@ -128,6 +128,19 @@ static void parse_numa_node(MachineState *ms, NumaNodeOptions *node,
> >>           numa_info[nodenr].node_mem = object_property_get_uint(o, "size", NULL);
> >>           numa_info[nodenr].node_memdev = MEMORY_BACKEND(o);
> >>       }
> >> +
> >> +    if (node->has_initiator) {
> >> +        if (numa_info[nodenr].initiator_valid &&
> >> +            (node->initiator != numa_info[nodenr].initiator)) {
> >> +            error_setg(errp, "The initiator of NUMA node %" PRIu16 " has been "
> >> +                       "set to node %" PRIu16, nodenr,
> >> +                       numa_info[nodenr].initiator);
> >> +            return;
> >> +        }
> >> +
> >> +        numa_info[nodenr].initiator_valid = true;
> >> +        numa_info[nodenr].initiator = node->initiator;  
> >                                               ^^^
> > not validated  user input? (which could lead to read beyond numa_info[] boundaries
> > in previous hunk).
> >   
> >> +    }
> >>       numa_info[nodenr].present = true;
> >>       max_numa_nodeid = MAX(max_numa_nodeid, nodenr + 1);
> >>       ms->numa_state->num_nodes++;
> >> diff --git a/include/sysemu/numa.h b/include/sysemu/numa.h
> >> index 76da3016db..46ad06e000 100644
> >> --- a/include/sysemu/numa.h
> >> +++ b/include/sysemu/numa.h
> >> @@ -10,6 +10,9 @@ struct NodeInfo {
> >>       uint64_t node_mem;
> >>       struct HostMemoryBackend *node_memdev;
> >>       bool present;
> >> +    bool has_cpu;
> >> +    bool initiator_valid;
> >> +    uint16_t initiator;
> >>       uint8_t distance[MAX_NODES];
> >>   };
> >>   
> >> diff --git a/qapi/machine.json b/qapi/machine.json
> >> index 6db8a7e2ec..05e367d26a 100644
> >> --- a/qapi/machine.json
> >> +++ b/qapi/machine.json
> >> @@ -414,6 +414,9 @@
> >>   # @memdev: memory backend object.  If specified for one node,
> >>   #          it must be specified for all nodes.
> >>   #
> >> +# @initiator: the initiator numa nodeid that is closest (as in directly
> >> +#             attached) to this numa node (since 4.2)  
> > well, it's pretty unclear what doc comment means (unless reader knows well
> > specific part of ACPI spec)
> > 
> > suggest to rephrase to something more understandable for unaware
> > readers (+ possible reference to spec for those who is interested
> > in spec definition since this doc is meant for developers).
> >   
> >> +#
> >>   # Since: 2.1
> >>   ##
> >>   { 'struct': 'NumaNodeOptions',
> >> @@ -421,7 +424,8 @@
> >>      '*nodeid': 'uint16',
> >>      '*cpus':   ['uint16'],
> >>      '*mem':    'size',
> >> -   '*memdev': 'str' }}
> >> +   '*memdev': 'str',
> >> +   '*initiator': 'uint16' }}
> >>   
> >>   ##
> >>   # @NumaDistOptions:
> >> diff --git a/qemu-options.hx b/qemu-options.hx
> >> index 9621e934c0..c480781992 100644
> >> --- a/qemu-options.hx
> >> +++ b/qemu-options.hx
> >> @@ -161,14 +161,14 @@ If any on the three values is given, the total number of CPUs @var{n} can be omi
> >>   ETEXI
> >>   
> >>   DEF("numa", HAS_ARG, QEMU_OPTION_numa,
> >> -    "-numa node[,mem=size][,cpus=firstcpu[-lastcpu]][,nodeid=node]\n"
> >> -    "-numa node[,memdev=id][,cpus=firstcpu[-lastcpu]][,nodeid=node]\n"
> >> +    "-numa node[,mem=size][,cpus=firstcpu[-lastcpu]][,nodeid=node][,initiator=node]\n"
> >> +    "-numa node[,memdev=id][,cpus=firstcpu[-lastcpu]][,nodeid=node][,initiator=node]\n"
> >>       "-numa dist,src=source,dst=destination,val=distance\n"
> >>       "-numa cpu,node-id=node[,socket-id=x][,core-id=y][,thread-id=z]\n",
> >>       QEMU_ARCH_ALL)
> >>   STEXI
> >> -@item -numa node[,mem=@var{size}][,cpus=@var{firstcpu}[-@var{lastcpu}]][,nodeid=@var{node}]
> >> -@itemx -numa node[,memdev=@var{id}][,cpus=@var{firstcpu}[-@var{lastcpu}]][,nodeid=@var{node}]
> >> +@item -numa node[,mem=@var{size}][,cpus=@var{firstcpu}[-@var{lastcpu}]][,nodeid=@var{node}][,initiator=@var{initiator}]
> >> +@itemx -numa node[,memdev=@var{id}][,cpus=@var{firstcpu}[-@var{lastcpu}]][,nodeid=@var{node}][,initiator=@var{initiator}]
> >>   @itemx -numa dist,src=@var{source},dst=@var{destination},val=@var{distance}
> >>   @itemx -numa cpu,node-id=@var{node}[,socket-id=@var{x}][,core-id=@var{y}][,thread-id=@var{z}]
> >>   @findex -numa
> >> @@ -215,6 +215,25 @@ split equally between them.
> >>   @samp{mem} and @samp{memdev} are mutually exclusive. Furthermore,
> >>   if one node uses @samp{memdev}, all of them have to use it.
> >>   
> >> +@samp{initiator} indicate the initiator NUMA @var{initiator} that is  
> >                                    ^^^^^^^       ^^^^^^^^^^^^^^
> > above will result in "initiator NUMA initiator", was it your intention?
> >   
> >> +closest (as in directly attached) to this NUMA @var{node}.  
> > Again suggest replace spec language with something more user friendly
> > (this time without spec reference as it's geared for end user)
> >   
> >> +For example, the following option assigns 2 NUMA nodes, node 0 has CPU.  
> > Following example creates a machine with 2 NUMA ...
> >   
> >> +node 1 has only memory, and its' initiator is node 0. Note that because
> >> +node 0 has CPU, by default the initiator of node 0 is itself and must be
> >> +itself.
> >> +@example
> >> +-M pc \
> >> +-m 2G,slots=2,maxmem=4G \
> >> +-object memory-backend-ram,size=1G,id=m0 \
> >> +-object memory-backend-ram,size=1G,id=m1 \
> >> +-numa node,nodeid=0,memdev=m0 \
> >> +-numa node,nodeid=1,memdev=m1,initiator=0 \
> >> +-smp 2,sockets=2,maxcpus=2  \
> >> +-numa cpu,node-id=0,socket-id=0 \
> >> +-numa cpu,node-id=0,socket-id=1 \
> >> +@end example
> >> +
> >>   @var{source} and @var{destination} are NUMA node IDs.
> >>   @var{distance} is the NUMA distance from @var{source} to @var{destination}.
> >>   The distance from a node to itself is always 10. If any pair of nodes is  
> >   
> 
>
Igor Mammedov Aug. 16, 2019, 2:57 p.m. UTC | #9
On Wed, 14 Aug 2019 19:31:27 -0700
Dan Williams <dan.j.williams@intel.com> wrote:

> On Wed, Aug 14, 2019 at 6:57 PM Tao Xu <tao3.xu@intel.com> wrote:
> >
> > On 8/15/2019 5:29 AM, Dan Williams wrote:  
> > > On Tue, Aug 13, 2019 at 10:14 PM Tao Xu <tao3.xu@intel.com> wrote:  
> > >>
> > >> On 8/14/2019 10:39 AM, Dan Williams wrote:  
> > >>> On Tue, Aug 13, 2019 at 8:00 AM Igor Mammedov <imammedo@redhat.com> wrote:  
> > >>>>
> > >>>> On Fri,  9 Aug 2019 14:57:25 +0800
> > >>>> Tao <tao3.xu@intel.com> wrote:
> > >>>>  
> > >>>>> From: Tao Xu <tao3.xu@intel.com>
> > >>>>>  
> > [...]  
> > >>>>> +    for (i = 0; i < machine->numa_state->num_nodes; i++) {
> > >>>>> +        if (numa_info[i].initiator_valid &&
> > >>>>> +            !numa_info[numa_info[i].initiator].has_cpu) {  
> > >>>>                             ^^^^^^^^^^^^^^^^^^^^^^ possible out of bounds read, see bellow
> > >>>>  
> > >>>>> +            error_report("The initiator-id %"PRIu16 " of NUMA node %d"
> > >>>>> +                         " does not exist.", numa_info[i].initiator, i);
> > >>>>> +            error_printf("\n");
> > >>>>> +
> > >>>>> +            exit(1);
> > >>>>> +        }  
> > >>>> it takes care only about nodes that have cpus or memory-only ones that have
> > >>>> initiator explicitly provided on CLI. And leaves possibility to have
> > >>>> memory-only nodes without initiator mixed with nodes that have initiator.
> > >>>> Is it valid to have mixed configuration?
> > >>>> Should we forbid it?  
> > >>>
> > >>> The spec talks about the "Proximity Domain for the Attached Initiator"
> > >>> field only being valid if the memory controller for the memory can be
> > >>> identified by an initiator id in the SRAT. So I expect the only way to
> > >>> define a memory proximity domain without this local initiator is to
> > >>> allow specifying a node-id that does not have an entry in the SRAT.
> > >>>  
> > >> Hi Dan,
> > >>
> > >> So there may be a situation for the Attached Initiator field is not
> > >> valid? If true, I would allow user to input Initiator invalid.  
> > >
> > > Yes it's something the OS needs to consider because the platform may
> > > not be able to meet the constraint that a single initiator is
> > > associated with the memory controller for a given memory target. In
> > > retrospect it would have been nice if the spec reserved 0xffffffff for
> > > this purpose, but it seems "not in SRAT" is the only way to identify
> > > memory that is not attached to any single initiator.
> > >  
> > But As far as I konw, QEMU can't emulate a NUMA node "not in SRAT". I am
> > wondering if it is effective only set Initiator invalid?  
> 
> You don't need to emulate a NUMA node not in SRAT. Just put a number
> in this HMAT entry larger than the largest proximity domain number
> found in the SRAT.
> >  
> 

So behavior is really not defined in the spec
(well I wasn't able to convince myself that above behavior is in the spec).

In this case I'd go with a strict check for now not allowing invalid initiator
(we can easily relax check and allow it point to nonsense later but no other way around)
Tao Xu Aug. 20, 2019, 8:34 a.m. UTC | #10
On 8/16/2019 10:57 PM, Igor Mammedov wrote:
> On Wed, 14 Aug 2019 19:31:27 -0700
> Dan Williams <dan.j.williams@intel.com> wrote:
> 
>> On Wed, Aug 14, 2019 at 6:57 PM Tao Xu <tao3.xu@intel.com> wrote:
>>>
>>> On 8/15/2019 5:29 AM, Dan Williams wrote:
>>>> On Tue, Aug 13, 2019 at 10:14 PM Tao Xu <tao3.xu@intel.com> wrote:
>>>>>
>>>>> On 8/14/2019 10:39 AM, Dan Williams wrote:
>>>>>> On Tue, Aug 13, 2019 at 8:00 AM Igor Mammedov <imammedo@redhat.com> wrote:
>>>>>>>
>>>>>>> On Fri,  9 Aug 2019 14:57:25 +0800
>>>>>>> Tao <tao3.xu@intel.com> wrote:
>>>>>>>   
>>>>>>>> From: Tao Xu <tao3.xu@intel.com>
>>>>>>>>   
>>> [...]
>>>>>>>> +    for (i = 0; i < machine->numa_state->num_nodes; i++) {
>>>>>>>> +        if (numa_info[i].initiator_valid &&
>>>>>>>> +            !numa_info[numa_info[i].initiator].has_cpu) {
>>>>>>>                              ^^^^^^^^^^^^^^^^^^^^^^ possible out of bounds read, see bellow
>>>>>>>   
>>>>>>>> +            error_report("The initiator-id %"PRIu16 " of NUMA node %d"
>>>>>>>> +                         " does not exist.", numa_info[i].initiator, i);
>>>>>>>> +            error_printf("\n");
>>>>>>>> +
>>>>>>>> +            exit(1);
>>>>>>>> +        }
>>>>>>> it takes care only about nodes that have cpus or memory-only ones that have
>>>>>>> initiator explicitly provided on CLI. And leaves possibility to have
>>>>>>> memory-only nodes without initiator mixed with nodes that have initiator.
>>>>>>> Is it valid to have mixed configuration?
>>>>>>> Should we forbid it?
>>>>>>
>>>>>> The spec talks about the "Proximity Domain for the Attached Initiator"
>>>>>> field only being valid if the memory controller for the memory can be
>>>>>> identified by an initiator id in the SRAT. So I expect the only way to
>>>>>> define a memory proximity domain without this local initiator is to
>>>>>> allow specifying a node-id that does not have an entry in the SRAT.
>>>>>>   
>>>>> Hi Dan,
>>>>>
>>>>> So there may be a situation for the Attached Initiator field is not
>>>>> valid? If true, I would allow user to input Initiator invalid.
>>>>
>>>> Yes it's something the OS needs to consider because the platform may
>>>> not be able to meet the constraint that a single initiator is
>>>> associated with the memory controller for a given memory target. In
>>>> retrospect it would have been nice if the spec reserved 0xffffffff for
>>>> this purpose, but it seems "not in SRAT" is the only way to identify
>>>> memory that is not attached to any single initiator.
>>>>   
>>> But As far as I konw, QEMU can't emulate a NUMA node "not in SRAT". I am
>>> wondering if it is effective only set Initiator invalid?
>>
>> You don't need to emulate a NUMA node not in SRAT. Just put a number
>> in this HMAT entry larger than the largest proximity domain number
>> found in the SRAT.
>>>   
>>
> 
> So behavior is really not defined in the spec
> (well I wasn't able to convince myself that above behavior is in the spec).
> 
> In this case I'd go with a strict check for now not allowing invalid initiator
> (we can easily relax check and allow it point to nonsense later but no other way around)
> 

So let me summarize the solution, in order to avoid misunderstanding, if 
there are something wrong, pls tell me:

1)
-machine,hmat=yes
-object memory-backend-ram,size=1G,id=m0 \
-object memory-backend-ram,size=1G,id=m1 \
-object memory-backend-ram,size=1G,id=m2 \
-numa node,nodeid=0,memdev=m0 \
-numa node,nodeid=1,memdev=m1,initiator=0 \
-numa node,nodeid=2,memdev=m2,initiator=0 \
-numa cpu,node-id=0,socket-id=0 \
-numa cpu,node-id=0,socket-id=1

then qemu can use HMAT.

2)
if initiator this case:

-numa node,nodeid=0,memdev=m0 \
-numa node,nodeid=1,memdev=m1,initiator=0 \
-numa node,nodeid=2,memdev=m2

then qemu can't boot and show error message.

3)
if initiator this case:

-numa node,nodeid=0,memdev=m0 \
-numa node,nodeid=1,memdev=m1,initiator=0 \
-numa node,nodeid=2,memdev=m2,initiator=1

then qemu can boot and the initiator of nodeid=2 is invalid.
Igor Mammedov Aug. 27, 2019, 1:12 p.m. UTC | #11
On Tue, 20 Aug 2019 16:34:44 +0800
Tao Xu <tao3.xu@intel.com> wrote:

> On 8/16/2019 10:57 PM, Igor Mammedov wrote:
> > On Wed, 14 Aug 2019 19:31:27 -0700
> > Dan Williams <dan.j.williams@intel.com> wrote:
> >   
> >> On Wed, Aug 14, 2019 at 6:57 PM Tao Xu <tao3.xu@intel.com> wrote:  
> >>>
> >>> On 8/15/2019 5:29 AM, Dan Williams wrote:  
> >>>> On Tue, Aug 13, 2019 at 10:14 PM Tao Xu <tao3.xu@intel.com> wrote:  
> >>>>>
> >>>>> On 8/14/2019 10:39 AM, Dan Williams wrote:  
> >>>>>> On Tue, Aug 13, 2019 at 8:00 AM Igor Mammedov <imammedo@redhat.com> wrote:  
> >>>>>>>
> >>>>>>> On Fri,  9 Aug 2019 14:57:25 +0800
> >>>>>>> Tao <tao3.xu@intel.com> wrote:
> >>>>>>>     
> >>>>>>>> From: Tao Xu <tao3.xu@intel.com>
> >>>>>>>>     
> >>> [...]  
> >>>>>>>> +    for (i = 0; i < machine->numa_state->num_nodes; i++) {
> >>>>>>>> +        if (numa_info[i].initiator_valid &&
> >>>>>>>> +            !numa_info[numa_info[i].initiator].has_cpu) {  
> >>>>>>>                              ^^^^^^^^^^^^^^^^^^^^^^ possible out of bounds read, see bellow
> >>>>>>>     
> >>>>>>>> +            error_report("The initiator-id %"PRIu16 " of NUMA node %d"
> >>>>>>>> +                         " does not exist.", numa_info[i].initiator, i);
> >>>>>>>> +            error_printf("\n");
> >>>>>>>> +
> >>>>>>>> +            exit(1);
> >>>>>>>> +        }  
> >>>>>>> it takes care only about nodes that have cpus or memory-only ones that have
> >>>>>>> initiator explicitly provided on CLI. And leaves possibility to have
> >>>>>>> memory-only nodes without initiator mixed with nodes that have initiator.
> >>>>>>> Is it valid to have mixed configuration?
> >>>>>>> Should we forbid it?  
> >>>>>>
> >>>>>> The spec talks about the "Proximity Domain for the Attached Initiator"
> >>>>>> field only being valid if the memory controller for the memory can be
> >>>>>> identified by an initiator id in the SRAT. So I expect the only way to
> >>>>>> define a memory proximity domain without this local initiator is to
> >>>>>> allow specifying a node-id that does not have an entry in the SRAT.
> >>>>>>     
> >>>>> Hi Dan,
> >>>>>
> >>>>> So there may be a situation for the Attached Initiator field is not
> >>>>> valid? If true, I would allow user to input Initiator invalid.  
> >>>>
> >>>> Yes it's something the OS needs to consider because the platform may
> >>>> not be able to meet the constraint that a single initiator is
> >>>> associated with the memory controller for a given memory target. In
> >>>> retrospect it would have been nice if the spec reserved 0xffffffff for
> >>>> this purpose, but it seems "not in SRAT" is the only way to identify
> >>>> memory that is not attached to any single initiator.
> >>>>     
> >>> But As far as I konw, QEMU can't emulate a NUMA node "not in SRAT". I am
> >>> wondering if it is effective only set Initiator invalid?  
> >>
> >> You don't need to emulate a NUMA node not in SRAT. Just put a number
> >> in this HMAT entry larger than the largest proximity domain number
> >> found in the SRAT.  
> >>>     
> >>  
> > 
> > So behavior is really not defined in the spec
> > (well I wasn't able to convince myself that above behavior is in the spec).
> > 
> > In this case I'd go with a strict check for now not allowing invalid initiator
> > (we can easily relax check and allow it point to nonsense later but no other way around)
> >   
> 
> So let me summarize the solution, in order to avoid misunderstanding, if 
> there are something wrong, pls tell me:
> 
> 1)
> -machine,hmat=yes
> -object memory-backend-ram,size=1G,id=m0 \
> -object memory-backend-ram,size=1G,id=m1 \
> -object memory-backend-ram,size=1G,id=m2 \
> -numa node,nodeid=0,memdev=m0 \
> -numa node,nodeid=1,memdev=m1,initiator=0 \
> -numa node,nodeid=2,memdev=m2,initiator=0 \
> -numa cpu,node-id=0,socket-id=0 \
> -numa cpu,node-id=0,socket-id=1
> 
> then qemu can use HMAT.
> 
> 2)
> if initiator this case:
> 
> -numa node,nodeid=0,memdev=m0 \
> -numa node,nodeid=1,memdev=m1,initiator=0 \
> -numa node,nodeid=2,memdev=m2
> 
> then qemu can't boot and show error message.
> 
> 3)
> if initiator this case:
> 
> -numa node,nodeid=0,memdev=m0 \
> -numa node,nodeid=1,memdev=m1,initiator=0 \
> -numa node,nodeid=2,memdev=m2,initiator=1
> 
> then qemu can boot and the initiator of nodeid=2 is invalid.
In this last case I'd error out instead of booting with invalid config.
Tao Xu Aug. 28, 2019, 1:09 a.m. UTC | #12
On 8/27/2019 9:12 PM, Igor Mammedov wrote:
> On Tue, 20 Aug 2019 16:34:44 +0800
> Tao Xu <tao3.xu@intel.com> wrote:
> 
>> On 8/16/2019 10:57 PM, Igor Mammedov wrote:
>>> On Wed, 14 Aug 2019 19:31:27 -0700
>>> Dan Williams <dan.j.williams@intel.com> wrote:
>>>    
>>>> On Wed, Aug 14, 2019 at 6:57 PM Tao Xu <tao3.xu@intel.com> wrote:
>>>>>
>>>>> On 8/15/2019 5:29 AM, Dan Williams wrote:
>>>>>> On Tue, Aug 13, 2019 at 10:14 PM Tao Xu <tao3.xu@intel.com> wrote:
>>>>>>>
>>>>>>> On 8/14/2019 10:39 AM, Dan Williams wrote:
>>>>>>>> On Tue, Aug 13, 2019 at 8:00 AM Igor Mammedov <imammedo@redhat.com> wrote:
>>>>>>>>>
>>>>>>>>> On Fri,  9 Aug 2019 14:57:25 +0800
>>>>>>>>> Tao <tao3.xu@intel.com> wrote:
>>>>>>>>>      
>>>>>>>>>> From: Tao Xu <tao3.xu@intel.com>
>>>>>>>>>>      
>>>>> [...]
>>>>>>>>>> +    for (i = 0; i < machine->numa_state->num_nodes; i++) {
>>>>>>>>>> +        if (numa_info[i].initiator_valid &&
>>>>>>>>>> +            !numa_info[numa_info[i].initiator].has_cpu) {
>>>>>>>>>                               ^^^^^^^^^^^^^^^^^^^^^^ possible out of bounds read, see bellow
>>>>>>>>>      
>>>>>>>>>> +            error_report("The initiator-id %"PRIu16 " of NUMA node %d"
>>>>>>>>>> +                         " does not exist.", numa_info[i].initiator, i);
>>>>>>>>>> +            error_printf("\n");
>>>>>>>>>> +
>>>>>>>>>> +            exit(1);
>>>>>>>>>> +        }
>>>>>>>>> it takes care only about nodes that have cpus or memory-only ones that have
>>>>>>>>> initiator explicitly provided on CLI. And leaves possibility to have
>>>>>>>>> memory-only nodes without initiator mixed with nodes that have initiator.
>>>>>>>>> Is it valid to have mixed configuration?
>>>>>>>>> Should we forbid it?
>>>>>>>>
>>>>>>>> The spec talks about the "Proximity Domain for the Attached Initiator"
>>>>>>>> field only being valid if the memory controller for the memory can be
>>>>>>>> identified by an initiator id in the SRAT. So I expect the only way to
>>>>>>>> define a memory proximity domain without this local initiator is to
>>>>>>>> allow specifying a node-id that does not have an entry in the SRAT.
>>>>>>>>      
>>>>>>> Hi Dan,
>>>>>>>
>>>>>>> So there may be a situation for the Attached Initiator field is not
>>>>>>> valid? If true, I would allow user to input Initiator invalid.
>>>>>>
>>>>>> Yes it's something the OS needs to consider because the platform may
>>>>>> not be able to meet the constraint that a single initiator is
>>>>>> associated with the memory controller for a given memory target. In
>>>>>> retrospect it would have been nice if the spec reserved 0xffffffff for
>>>>>> this purpose, but it seems "not in SRAT" is the only way to identify
>>>>>> memory that is not attached to any single initiator.
>>>>>>      
>>>>> But As far as I konw, QEMU can't emulate a NUMA node "not in SRAT". I am
>>>>> wondering if it is effective only set Initiator invalid?
>>>>
>>>> You don't need to emulate a NUMA node not in SRAT. Just put a number
>>>> in this HMAT entry larger than the largest proximity domain number
>>>> found in the SRAT.
>>>>>      
>>>>   
>>>
>>> So behavior is really not defined in the spec
>>> (well I wasn't able to convince myself that above behavior is in the spec).
>>>
>>> In this case I'd go with a strict check for now not allowing invalid initiator
>>> (we can easily relax check and allow it point to nonsense later but no other way around)
>>>    
>>
>> So let me summarize the solution, in order to avoid misunderstanding, if
>> there are something wrong, pls tell me:
>>
>> 1)
>> -machine,hmat=yes
>> -object memory-backend-ram,size=1G,id=m0 \
>> -object memory-backend-ram,size=1G,id=m1 \
>> -object memory-backend-ram,size=1G,id=m2 \
>> -numa node,nodeid=0,memdev=m0 \
>> -numa node,nodeid=1,memdev=m1,initiator=0 \
>> -numa node,nodeid=2,memdev=m2,initiator=0 \
>> -numa cpu,node-id=0,socket-id=0 \
>> -numa cpu,node-id=0,socket-id=1
>>
>> then qemu can use HMAT.
>>
>> 2)
>> if initiator this case:
>>
>> -numa node,nodeid=0,memdev=m0 \
>> -numa node,nodeid=1,memdev=m1,initiator=0 \
>> -numa node,nodeid=2,memdev=m2
>>
>> then qemu can't boot and show error message.
>>
>> 3)
>> if initiator this case:
>>
>> -numa node,nodeid=0,memdev=m0 \
>> -numa node,nodeid=1,memdev=m1,initiator=0 \
>> -numa node,nodeid=2,memdev=m2,initiator=1
>>
>> then qemu can boot and the initiator of nodeid=2 is invalid.
> In this last case I'd error out instead of booting with invalid config.
> 
OK
diff mbox series

Patch

diff --git a/hw/core/machine.c b/hw/core/machine.c
index 3c55470103..113184a9df 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -640,6 +640,7 @@  void machine_set_cpu_numa_node(MachineState *machine,
                                const CpuInstanceProperties *props, Error **errp)
 {
     MachineClass *mc = MACHINE_GET_CLASS(machine);
+    NodeInfo *numa_info = machine->numa_state->nodes;
     bool match = false;
     int i;
 
@@ -709,6 +710,16 @@  void machine_set_cpu_numa_node(MachineState *machine,
         match = true;
         slot->props.node_id = props->node_id;
         slot->props.has_node_id = props->has_node_id;
+
+        if (numa_info[props->node_id].initiator_valid &&
+            (props->node_id != numa_info[props->node_id].initiator)) {
+            error_setg(errp, "The initiator of CPU NUMA node %" PRId64
+                       " should be itself.", props->node_id);
+            return;
+        }
+        numa_info[props->node_id].initiator_valid = true;
+        numa_info[props->node_id].has_cpu = true;
+        numa_info[props->node_id].initiator = props->node_id;
     }
 
     if (!match) {
@@ -1050,6 +1061,7 @@  static void machine_numa_finish_cpu_init(MachineState *machine)
     GString *s = g_string_new(NULL);
     MachineClass *mc = MACHINE_GET_CLASS(machine);
     const CPUArchIdList *possible_cpus = mc->possible_cpu_arch_ids(machine);
+    NodeInfo *numa_info = machine->numa_state->nodes;
 
     assert(machine->numa_state->num_nodes);
     for (i = 0; i < possible_cpus->len; i++) {
@@ -1083,6 +1095,18 @@  static void machine_numa_finish_cpu_init(MachineState *machine)
             machine_set_cpu_numa_node(machine, &props, &error_fatal);
         }
     }
+
+    for (i = 0; i < machine->numa_state->num_nodes; i++) {
+        if (numa_info[i].initiator_valid &&
+            !numa_info[numa_info[i].initiator].has_cpu) {
+            error_report("The initiator-id %"PRIu16 " of NUMA node %d"
+                         " does not exist.", numa_info[i].initiator, i);
+            error_printf("\n");
+
+            exit(1);
+        }
+    }
+
     if (s->len && !qtest_enabled()) {
         warn_report("CPU(s) not present in any NUMA nodes: %s",
                     s->str);
diff --git a/hw/core/numa.c b/hw/core/numa.c
index 8fcbba05d6..cfb6339810 100644
--- a/hw/core/numa.c
+++ b/hw/core/numa.c
@@ -128,6 +128,19 @@  static void parse_numa_node(MachineState *ms, NumaNodeOptions *node,
         numa_info[nodenr].node_mem = object_property_get_uint(o, "size", NULL);
         numa_info[nodenr].node_memdev = MEMORY_BACKEND(o);
     }
+
+    if (node->has_initiator) {
+        if (numa_info[nodenr].initiator_valid &&
+            (node->initiator != numa_info[nodenr].initiator)) {
+            error_setg(errp, "The initiator of NUMA node %" PRIu16 " has been "
+                       "set to node %" PRIu16, nodenr,
+                       numa_info[nodenr].initiator);
+            return;
+        }
+
+        numa_info[nodenr].initiator_valid = true;
+        numa_info[nodenr].initiator = node->initiator;
+    }
     numa_info[nodenr].present = true;
     max_numa_nodeid = MAX(max_numa_nodeid, nodenr + 1);
     ms->numa_state->num_nodes++;
diff --git a/include/sysemu/numa.h b/include/sysemu/numa.h
index 76da3016db..46ad06e000 100644
--- a/include/sysemu/numa.h
+++ b/include/sysemu/numa.h
@@ -10,6 +10,9 @@  struct NodeInfo {
     uint64_t node_mem;
     struct HostMemoryBackend *node_memdev;
     bool present;
+    bool has_cpu;
+    bool initiator_valid;
+    uint16_t initiator;
     uint8_t distance[MAX_NODES];
 };
 
diff --git a/qapi/machine.json b/qapi/machine.json
index 6db8a7e2ec..05e367d26a 100644
--- a/qapi/machine.json
+++ b/qapi/machine.json
@@ -414,6 +414,9 @@ 
 # @memdev: memory backend object.  If specified for one node,
 #          it must be specified for all nodes.
 #
+# @initiator: the initiator numa nodeid that is closest (as in directly
+#             attached) to this numa node (since 4.2)
+#
 # Since: 2.1
 ##
 { 'struct': 'NumaNodeOptions',
@@ -421,7 +424,8 @@ 
    '*nodeid': 'uint16',
    '*cpus':   ['uint16'],
    '*mem':    'size',
-   '*memdev': 'str' }}
+   '*memdev': 'str',
+   '*initiator': 'uint16' }}
 
 ##
 # @NumaDistOptions:
diff --git a/qemu-options.hx b/qemu-options.hx
index 9621e934c0..c480781992 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -161,14 +161,14 @@  If any on the three values is given, the total number of CPUs @var{n} can be omi
 ETEXI
 
 DEF("numa", HAS_ARG, QEMU_OPTION_numa,
-    "-numa node[,mem=size][,cpus=firstcpu[-lastcpu]][,nodeid=node]\n"
-    "-numa node[,memdev=id][,cpus=firstcpu[-lastcpu]][,nodeid=node]\n"
+    "-numa node[,mem=size][,cpus=firstcpu[-lastcpu]][,nodeid=node][,initiator=node]\n"
+    "-numa node[,memdev=id][,cpus=firstcpu[-lastcpu]][,nodeid=node][,initiator=node]\n"
     "-numa dist,src=source,dst=destination,val=distance\n"
     "-numa cpu,node-id=node[,socket-id=x][,core-id=y][,thread-id=z]\n",
     QEMU_ARCH_ALL)
 STEXI
-@item -numa node[,mem=@var{size}][,cpus=@var{firstcpu}[-@var{lastcpu}]][,nodeid=@var{node}]
-@itemx -numa node[,memdev=@var{id}][,cpus=@var{firstcpu}[-@var{lastcpu}]][,nodeid=@var{node}]
+@item -numa node[,mem=@var{size}][,cpus=@var{firstcpu}[-@var{lastcpu}]][,nodeid=@var{node}][,initiator=@var{initiator}]
+@itemx -numa node[,memdev=@var{id}][,cpus=@var{firstcpu}[-@var{lastcpu}]][,nodeid=@var{node}][,initiator=@var{initiator}]
 @itemx -numa dist,src=@var{source},dst=@var{destination},val=@var{distance}
 @itemx -numa cpu,node-id=@var{node}[,socket-id=@var{x}][,core-id=@var{y}][,thread-id=@var{z}]
 @findex -numa
@@ -215,6 +215,25 @@  split equally between them.
 @samp{mem} and @samp{memdev} are mutually exclusive. Furthermore,
 if one node uses @samp{memdev}, all of them have to use it.
 
+@samp{initiator} indicate the initiator NUMA @var{initiator} that is
+closest (as in directly attached) to this NUMA @var{node}.
+
+For example, the following option assigns 2 NUMA nodes, node 0 has CPU.
+node 1 has only memory, and its' initiator is node 0. Note that because
+node 0 has CPU, by default the initiator of node 0 is itself and must be
+itself.
+@example
+-M pc \
+-m 2G,slots=2,maxmem=4G \
+-object memory-backend-ram,size=1G,id=m0 \
+-object memory-backend-ram,size=1G,id=m1 \
+-numa node,nodeid=0,memdev=m0 \
+-numa node,nodeid=1,memdev=m1,initiator=0 \
+-smp 2,sockets=2,maxcpus=2  \
+-numa cpu,node-id=0,socket-id=0 \
+-numa cpu,node-id=0,socket-id=1 \
+@end example
+
 @var{source} and @var{destination} are NUMA node IDs.
 @var{distance} is the NUMA distance from @var{source} to @var{destination}.
 The distance from a node to itself is always 10. If any pair of nodes is