[v3,00/18] APIC ID fixes for AMD EPYC CPU models

Message ID	157541968844.46157.17994918142533791313.stgit@naples-babu.amd.com (mailing list archive)
Headers	show Return-Path: <SRS0=LFK+=Z2=nongnu.org=qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 1DCEA2073F Subject: [PATCH v3 00/18] APIC ID fixes for AMD EPYC CPU models From: Babu Moger <babu.moger@amd.com> To: ehabkost@redhat.com, marcel.apfelbaum@gmail.com, mst@redhat.com, pbonzini@redhat.com, rth@twiddle.net, eblake@redhat.com, armbru@redhat.com, imammedo@redhat.com Date: Tue, 03 Dec 2019 18:36:54 -0600 Message-ID: <157541968844.46157.17994918142533791313.stgit@naples-babu.amd.com> User-Agent: StGit/unknown-version Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Received-SPF: None (protection.outlook.com: amd.com does not designate permitted sender hosts) Precedence: list Cc: babu.moger@amd.com, qemu-devel@nongnu.org Errors-To: qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org>
Series	APIC ID fixes for AMD EPYC CPU models \| expand [v3,00/18] APIC ID fixes for AMD EPYC CPU models [v3,01/18] hw/i386: Rename X86CPUTopoInfo structure to X86CPUTopoIDs [v3,02/18] hw/i386: Introduce X86CPUTopoInfo to contain topology info [v3,03/18] hw/i386: Consolidate topology functions [v3,04/18] hw/i386: Introduce initialize_topo_info to initialize X86CPUTopoInfo [v3,05/18] machine: Add SMP Sockets in CpuTopology [v3,06/18] hw/core: Add core complex id in X86CPU topology [v3,07/18] machine: Add a new function init_apicid_fn in MachineClass [v3,08/18] hw/i386: Update structures for nodes_per_pkg [v3,09/18] i386: Add CPUX86Family type in CPUX86State [v3,10/18] hw/386: Add EPYC mode topology decoding functions [v3,11/18] i386: Cleanup and use the EPYC mode topology functions [v3,12/18] numa: Split the numa initialization [v3,13/18] hw/i386: Introduce apicid_from_cpu_idx in PCMachineState [v3,14/18] hw/i386: Introduce topo_ids_from_apicid handler PCMachineState [v3,15/18] hw/i386: Introduce apic_id_from_topo_ids handler in PCMachineState [v3,16/18] hw/i386: Introduce EPYC mode function handlers [v3,17/18] i386: Fix pkg_id offset for epyc mode [v3,18/18] tests: Update the Unit tests

Babu Moger Dec. 4, 2019, 12:36 a.m. UTC

This series fixes APIC ID encoding problems on AMD EPYC CPUs.
https://bugzilla.redhat.com/show_bug.cgi?id=1728166

Currently, the APIC ID is decoded based on the sequence
sockets->dies->cores->threads. This works for most standard AMD and other
vendors' configurations, but this decoding sequence does not follow that of
AMD's APIC ID enumeration strictly. In some cases this can cause CPU topology
inconsistency. When booting a guest VM, the kernel tries to validate the
topology, and finds it inconsistent with the enumeration of EPYC cpu models.

To fix the problem we need to build the topology as per the Processor
Programming Reference (PPR) for AMD Family 17h Model 01h, Revision B1
Processors. It is available at https://www.amd.com/system/files/TechDocs/55570-B1_PUB.zip

Here is the text from the PPR.
Operating systems are expected to use Core::X86::Cpuid::SizeId[ApicIdSize], the
number of least significant bits in the Initial APIC ID that indicate core ID
within a processor, in constructing per-core CPUID masks.
Core::X86::Cpuid::SizeId[ApicIdSize] determines the maximum number of cores
(MNC) that the processor could theoretically support, not the actual number of
cores that are actually implemented or enabled on the processor, as indicated
by Core::X86::Cpuid::SizeId[NC].
Each Core::X86::Apic::ApicId[ApicId] register is preset as follows:
• ApicId[6] = Socket ID.
• ApicId[5:4] = Node ID.
• ApicId[3] = Logical CCX L3 complex ID
• ApicId[2:0]= (SMT) ? {LogicalCoreID[1:0],ThreadId} : {1'b0,LogicalCoreID[1:0]}

v3:
1. Consolidated the topology information in structure X86CPUTopoInfo.
2. Changed the ccx_id to llc_id as commented by upstream.
3. Generalized the apic id decoding. It is mostly similar to current apic id
except that it adds new field llc_id when numa configured. Removes all the
hardcoded values.
4. Removed the earlier parse_numa split. And moved the numa node initialization
inside the numa_complete_configuration. This is bit cleaner as commented by
Eduardo.
5. Added new function init_apicid_fn inside machine_class structure. This
will be used to update the apic id handler specific to cpu model.
6. Updated the cpuid unit tests.
7. TODO : Need to figure out how to dynamically update the handlers using cpu models.
I might some guidance on that.

v2:
https://lore.kernel.org/qemu-devel/156779689013.21957.1631551572950676212.stgit@localhost.localdomain/
1. Introduced the new property epyc to enable new epyc mode.
2. Separated the epyc mode and non epyc mode function.
3. Introduced function pointers in PCMachineState to handle the
differences.
4. Mildly tested different combinations to make things are working as expected.
5. TODO : Setting the epyc feature bit needs to be worked out. This feature is
supported only on AMD EPYC models. I may need some guidance on that.

v1:
https://lore.kernel.org/qemu-devel/20190731232032.51786-1-babu.moger@amd.com/

---

Babu Moger (18):
hw/i386: Rename X86CPUTopoInfo structure to X86CPUTopoIDs
hw/i386: Introduce X86CPUTopoInfo to contain topology info
hw/i386: Consolidate topology functions
hw/i386: Introduce initialize_topo_info to initialize X86CPUTopoInfo
machine: Add SMP Sockets in CpuTopology
hw/core: Add core complex id in X86CPU topology
machine: Add a new function init_apicid_fn in MachineClass
hw/i386: Update structures for nodes_per_pkg
i386: Add CPUX86Family type in CPUX86State
hw/386: Add EPYC mode topology decoding functions
i386: Cleanup and use the EPYC mode topology functions
numa: Split the numa initialization
hw/i386: Introduce apicid_from_cpu_idx in PCMachineState
hw/i386: Introduce topo_ids_from_apicid handler PCMachineState
hw/i386: Introduce apic_id_from_topo_ids handler in PCMachineState
hw/i386: Introduce EPYC mode function handlers
i386: Fix pkg_id offset for epyc mode
tests: Update the Unit tests

--

Igor Mammedov Feb. 3, 2020, 2:59 p.m. UTC | #1

On Tue, 03 Dec 2019 18:36:54 -0600
Babu Moger <babu.moger@amd.com> wrote:

> This series fixes APIC ID encoding problems on AMD EPYC CPUs.
> https://bugzilla.redhat.com/show_bug.cgi?id=1728166
> 
> Currently, the APIC ID is decoded based on the sequence
> sockets->dies->cores->threads. This works for most standard AMD and other
> vendors' configurations, but this decoding sequence does not follow that of
> AMD's APIC ID enumeration strictly. In some cases this can cause CPU topology
> inconsistency.  When booting a guest VM, the kernel tries to validate the
> topology, and finds it inconsistent with the enumeration of EPYC cpu models.
> 
> To fix the problem we need to build the topology as per the Processor
> Programming Reference (PPR) for AMD Family 17h Model 01h, Revision B1
> Processors. It is available at https://www.amd.com/system/files/TechDocs/55570-B1_PUB.zip
> 
> Here is the text from the PPR.
> Operating systems are expected to use Core::X86::Cpuid::SizeId[ApicIdSize], the
> number of least significant bits in the Initial APIC ID that indicate core ID
> within a processor, in constructing per-core CPUID masks.
> Core::X86::Cpuid::SizeId[ApicIdSize] determines the maximum number of cores
> (MNC) that the processor could theoretically support, not the actual number of
> cores that are actually implemented or enabled on the processor, as indicated
> by Core::X86::Cpuid::SizeId[NC].
> Each Core::X86::Apic::ApicId[ApicId] register is preset as follows:
> • ApicId[6] = Socket ID.
> • ApicId[5:4] = Node ID.
> • ApicId[3] = Logical CCX L3 complex ID
> • ApicId[2:0]= (SMT) ? {LogicalCoreID[1:0],ThreadId} : {1'b0,LogicalCoreID[1:0]}


After checking out all patches and some pondering, used here approach
looks to me too intrusive for the task at hand especially where it
comes to generic code.

(Ignore till ==== to see suggestion how to simplify without reading
reasoning behind it first)

Lets look for a way to simplify it a little bit.

So problem we are trying to solve,
 1: calculate APIC IDs based on cpu type (to e more specific: for EPYC based CPUs)
 2: it depends on knowing total number of numa nodes.

Externally workflow looks like following:
  1. user provides -smp x,sockets,cores,...,maxcpus
      that's used by possible_cpu_arch_ids() singleton to build list of
      possible CPUs (which is available to user via command 'hotpluggable-cpus')

      Hook could be called very early and possible_cpus data might be
      not complete. It builds a list of possible CPUs which user could
      modify later.

  2.1 user uses "-numa cpu,node-id=x,..." or legacy "-numa node,node_id=x,cpus="
      options to assign cpus to nodes, which is one way or another calling
      machine_set_cpu_numa_node(). The later updates 'possible_cpus' list
      with node information. It happens early when total number of nodes
      is not available.

  2.2 user does not provide explicit node mappings for CPUs.
      QEMU steps in and assigns possible cpus to nodes in machine_numa_finish_cpu_init()
      (using the same machine_set_cpu_numa_node()) right before calling boards
      specific machine init(). At that time total number of nodes is known.

In 1 -- 2.1 cases, 'arch_id' in 'possible_cpus' list doesn't have to be defined before
boards init() is run.

In 2.2 case it calls get_default_cpu_node_id() -> x86_get_default_cpu_node_id()
which uses arch_id calculate numa node.
But then question is: does it have to use APIC id or could it infer 'pkg_id',
it's after, from ms->possible_cpus->cpus[i].props data?
  
With that out of the way APIC ID will be used only during board's init(),
so board could update possible_cpus with valid APIC IDs at the start of
x86_cpus_init().

====
in nutshell it would be much easier to do following:

 1. make x86_get_default_cpu_node_id() APIC ID in-depended or
    if impossible as alternative recompute APIC IDs there if cpu
    type is EPYC based (since number of nodes is already known)
 2. recompute APIC IDs in x86_cpus_init() if cpu type is EPYC based

this way one doesn't need to touch generic numa code, introduce
x86 specific init_apicid_fn() hook into generic code and keep
x86/EPYC nuances contained within x86 code only.

> v3:
>   1. Consolidated the topology information in structure X86CPUTopoInfo.
>   2. Changed the ccx_id to llc_id as commented by upstream.
>   3. Generalized the apic id decoding. It is mostly similar to current apic id
>      except that it adds new field llc_id when numa configured. Removes all the
>      hardcoded values.
>   4. Removed the earlier parse_numa split. And moved the numa node initialization
>      inside the numa_complete_configuration. This is bit cleaner as commented by 
>      Eduardo.
>   5. Added new function init_apicid_fn inside machine_class structure. This
>      will be used to update the apic id handler specific to cpu model.
>   6. Updated the cpuid unit tests.
>   7. TODO : Need to figure out how to dynamically update the handlers using cpu models.
>      I might some guidance on that.
> 
> v2:
>   https://lore.kernel.org/qemu-devel/156779689013.21957.1631551572950676212.stgit@localhost.localdomain/
>   1. Introduced the new property epyc to enable new epyc mode.
>   2. Separated the epyc mode and non epyc mode function.
>   3. Introduced function pointers in PCMachineState to handle the
>      differences.
>   4. Mildly tested different combinations to make things are working as expected.
>   5. TODO : Setting the epyc feature bit needs to be worked out. This feature is
>      supported only on AMD EPYC models. I may need some guidance on that.
> 
> v1:
>   https://lore.kernel.org/qemu-devel/20190731232032.51786-1-babu.moger@amd.com/
> 
> ---
> 
> Babu Moger (18):
>       hw/i386: Rename X86CPUTopoInfo structure to X86CPUTopoIDs
>       hw/i386: Introduce X86CPUTopoInfo to contain topology info
>       hw/i386: Consolidate topology functions
>       hw/i386: Introduce initialize_topo_info to initialize X86CPUTopoInfo
>       machine: Add SMP Sockets in CpuTopology
>       hw/core: Add core complex id in X86CPU topology
>       machine: Add a new function init_apicid_fn in MachineClass
>       hw/i386: Update structures for nodes_per_pkg
>       i386: Add CPUX86Family type in CPUX86State
>       hw/386: Add EPYC mode topology decoding functions
>       i386: Cleanup and use the EPYC mode topology functions
>       numa: Split the numa initialization
>       hw/i386: Introduce apicid_from_cpu_idx in PCMachineState
>       hw/i386: Introduce topo_ids_from_apicid handler PCMachineState
>       hw/i386: Introduce apic_id_from_topo_ids handler in PCMachineState
>       hw/i386: Introduce EPYC mode function handlers
>       i386: Fix pkg_id offset for epyc mode
>       tests: Update the Unit tests
> 
> 
>  hw/core/machine-hmp-cmds.c |    3 +
>  hw/core/machine.c          |   14 +++
>  hw/core/numa.c             |   62 +++++++++----
>  hw/i386/pc.c               |  132 +++++++++++++++++++---------
>  include/hw/boards.h        |    3 +
>  include/hw/i386/pc.h       |    9 ++
>  include/hw/i386/topology.h |  209 +++++++++++++++++++++++++++++++-------------
>  include/sysemu/numa.h      |    5 +
>  qapi/machine.json          |    7 +
>  target/i386/cpu.c          |  196 ++++++++++++-----------------------------
>  target/i386/cpu.h          |    9 ++
>  tests/test-x86-cpuid.c     |  115 ++++++++++++++----------
>  vl.c                       |    4 +
>  13 files changed, 455 insertions(+), 313 deletions(-)
> 
> --
>

Babu Moger Feb. 3, 2020, 7:31 p.m. UTC | #2

On 2/3/20 8:59 AM, Igor Mammedov wrote:
> On Tue, 03 Dec 2019 18:36:54 -0600
> Babu Moger <babu.moger@amd.com> wrote:
> 
>> This series fixes APIC ID encoding problems on AMD EPYC CPUs.
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.redhat.com%2Fshow_bug.cgi%3Fid%3D1728166&amp;data=02%7C01%7Cbabu.moger%40amd.com%7C50685202e372472d7b2c08d7a8b9afa6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637163387802886193&amp;sdata=N%2FaBBZ8G3D1gCNvabVQ%2FraHvINazcVeEc9FWdxQAWmg%3D&amp;reserved=0
>>
>> Currently, the APIC ID is decoded based on the sequence
>> sockets->dies->cores->threads. This works for most standard AMD and other
>> vendors' configurations, but this decoding sequence does not follow that of
>> AMD's APIC ID enumeration strictly. In some cases this can cause CPU topology
>> inconsistency.  When booting a guest VM, the kernel tries to validate the
>> topology, and finds it inconsistent with the enumeration of EPYC cpu models.
>>
>> To fix the problem we need to build the topology as per the Processor
>> Programming Reference (PPR) for AMD Family 17h Model 01h, Revision B1
>> Processors. It is available at https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.amd.com%2Fsystem%2Ffiles%2FTechDocs%2F55570-B1_PUB.zip&amp;data=02%7C01%7Cbabu.moger%40amd.com%7C50685202e372472d7b2c08d7a8b9afa6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637163387802886193&amp;sdata=McjyMS3A3x5Jr57VxJmHDyh5jumdybzW%2FwLtE4FAKHQ%3D&amp;reserved=0
>>
>> Here is the text from the PPR.
>> Operating systems are expected to use Core::X86::Cpuid::SizeId[ApicIdSize], the
>> number of least significant bits in the Initial APIC ID that indicate core ID
>> within a processor, in constructing per-core CPUID masks.
>> Core::X86::Cpuid::SizeId[ApicIdSize] determines the maximum number of cores
>> (MNC) that the processor could theoretically support, not the actual number of
>> cores that are actually implemented or enabled on the processor, as indicated
>> by Core::X86::Cpuid::SizeId[NC].
>> Each Core::X86::Apic::ApicId[ApicId] register is preset as follows:
>> • ApicId[6] = Socket ID.
>> • ApicId[5:4] = Node ID.
>> • ApicId[3] = Logical CCX L3 complex ID
>> • ApicId[2:0]= (SMT) ? {LogicalCoreID[1:0],ThreadId} : {1'b0,LogicalCoreID[1:0]}
> 
> 
> After checking out all patches and some pondering, used here approach
> looks to me too intrusive for the task at hand especially where it
> comes to generic code.
> 
> (Ignore till ==== to see suggestion how to simplify without reading
> reasoning behind it first)
> 
> Lets look for a way to simplify it a little bit.
> 
> So problem we are trying to solve,
>  1: calculate APIC IDs based on cpu type (to e more specific: for EPYC based CPUs)
>  2: it depends on knowing total number of numa nodes.
> 
> Externally workflow looks like following:
>   1. user provides -smp x,sockets,cores,...,maxcpus
>       that's used by possible_cpu_arch_ids() singleton to build list of
>       possible CPUs (which is available to user via command 'hotpluggable-cpus')
> 
>       Hook could be called very early and possible_cpus data might be
>       not complete. It builds a list of possible CPUs which user could
>       modify later.
> 
>   2.1 user uses "-numa cpu,node-id=x,..." or legacy "-numa node,node_id=x,cpus="
>       options to assign cpus to nodes, which is one way or another calling
>       machine_set_cpu_numa_node(). The later updates 'possible_cpus' list
>       with node information. It happens early when total number of nodes
>       is not available.
> 
>   2.2 user does not provide explicit node mappings for CPUs.
>       QEMU steps in and assigns possible cpus to nodes in machine_numa_finish_cpu_init()
>       (using the same machine_set_cpu_numa_node()) right before calling boards
>       specific machine init(). At that time total number of nodes is known.
> 
> In 1 -- 2.1 cases, 'arch_id' in 'possible_cpus' list doesn't have to be defined before
> boards init() is run.
> 
> In 2.2 case it calls get_default_cpu_node_id() -> x86_get_default_cpu_node_id()
> which uses arch_id calculate numa node.
> But then question is: does it have to use APIC id or could it infer 'pkg_id',
> it's after, from ms->possible_cpus->cpus[i].props data?

Not sure if I got the question right. In this case because the numa
information is not provided all the cpus are assigned to only one node.
The apic id is used here to get the correct pkg_id.

>   
> With that out of the way APIC ID will be used only during board's init(),
> so board could update possible_cpus with valid APIC IDs at the start of
> x86_cpus_init().
> 
> ====
> in nutshell it would be much easier to do following:
> 
>  1. make x86_get_default_cpu_node_id() APIC ID in-depended or
>     if impossible as alternative recompute APIC IDs there if cpu
>     type is EPYC based (since number of nodes is already known)
>  2. recompute APIC IDs in x86_cpus_init() if cpu type is EPYC based
> 
> this way one doesn't need to touch generic numa code, introduce
> x86 specific init_apicid_fn() hook into generic code and keep
> x86/EPYC nuances contained within x86 code only.

I was kind of already working in the similar direction in v4.
1. We already have split the numa initialization in patch #12(Split the
numa initialization). This way we know exactly how many numa nodes are
there before hand.
2. Planning to remove init_apicid_fn
3. Insert the handlers inside X86CPUDefinition.
4. EPYC model will have its own apid id handlers. Everything else will be
initialized with a default handlers(current default handler).
5. The function pc_possible_cpu_arch_ids will load the model definition
and initialize the PCMachineState data structure with the model specific
handlers.

Does that sound similar to what you are thinking. Thoughts?

> 
>> v3:
>>   1. Consolidated the topology information in structure X86CPUTopoInfo.
>>   2. Changed the ccx_id to llc_id as commented by upstream.
>>   3. Generalized the apic id decoding. It is mostly similar to current apic id
>>      except that it adds new field llc_id when numa configured. Removes all the
>>      hardcoded values.
>>   4. Removed the earlier parse_numa split. And moved the numa node initialization
>>      inside the numa_complete_configuration. This is bit cleaner as commented by 
>>      Eduardo.
>>   5. Added new function init_apicid_fn inside machine_class structure. This
>>      will be used to update the apic id handler specific to cpu model.
>>   6. Updated the cpuid unit tests.
>>   7. TODO : Need to figure out how to dynamically update the handlers using cpu models.
>>      I might some guidance on that.
>>
>> v2:
>>   https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fqemu-devel%2F156779689013.21957.1631551572950676212.stgit%40localhost.localdomain%2F&amp;data=02%7C01%7Cbabu.moger%40amd.com%7C50685202e372472d7b2c08d7a8b9afa6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637163387802886193&amp;sdata=ls1cxA1yh0P05zYsAf3sLXDM11DFHtxZvfWWaar7Mgg%3D&amp;reserved=0
>>   1. Introduced the new property epyc to enable new epyc mode.
>>   2. Separated the epyc mode and non epyc mode function.
>>   3. Introduced function pointers in PCMachineState to handle the
>>      differences.
>>   4. Mildly tested different combinations to make things are working as expected.
>>   5. TODO : Setting the epyc feature bit needs to be worked out. This feature is
>>      supported only on AMD EPYC models. I may need some guidance on that.
>>
>> v1:
>>   https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fqemu-devel%2F20190731232032.51786-1-babu.moger%40amd.com%2F&amp;data=02%7C01%7Cbabu.moger%40amd.com%7C50685202e372472d7b2c08d7a8b9afa6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637163387802886193&amp;sdata=nT4T9RIL4EeSvB%2Ff9%2BjbU7lldopjglQ2X6uYx13WMPE%3D&amp;reserved=0
>>
>> ---
>>
>> Babu Moger (18):
>>       hw/i386: Rename X86CPUTopoInfo structure to X86CPUTopoIDs
>>       hw/i386: Introduce X86CPUTopoInfo to contain topology info
>>       hw/i386: Consolidate topology functions
>>       hw/i386: Introduce initialize_topo_info to initialize X86CPUTopoInfo
>>       machine: Add SMP Sockets in CpuTopology
>>       hw/core: Add core complex id in X86CPU topology
>>       machine: Add a new function init_apicid_fn in MachineClass
>>       hw/i386: Update structures for nodes_per_pkg
>>       i386: Add CPUX86Family type in CPUX86State
>>       hw/386: Add EPYC mode topology decoding functions
>>       i386: Cleanup and use the EPYC mode topology functions
>>       numa: Split the numa initialization
>>       hw/i386: Introduce apicid_from_cpu_idx in PCMachineState
>>       hw/i386: Introduce topo_ids_from_apicid handler PCMachineState
>>       hw/i386: Introduce apic_id_from_topo_ids handler in PCMachineState
>>       hw/i386: Introduce EPYC mode function handlers
>>       i386: Fix pkg_id offset for epyc mode
>>       tests: Update the Unit tests
>>
>>
>>  hw/core/machine-hmp-cmds.c |    3 +
>>  hw/core/machine.c          |   14 +++
>>  hw/core/numa.c             |   62 +++++++++----
>>  hw/i386/pc.c               |  132 +++++++++++++++++++---------
>>  include/hw/boards.h        |    3 +
>>  include/hw/i386/pc.h       |    9 ++
>>  include/hw/i386/topology.h |  209 +++++++++++++++++++++++++++++++-------------
>>  include/sysemu/numa.h      |    5 +
>>  qapi/machine.json          |    7 +
>>  target/i386/cpu.c          |  196 ++++++++++++-----------------------------
>>  target/i386/cpu.h          |    9 ++
>>  tests/test-x86-cpuid.c     |  115 ++++++++++++++----------
>>  vl.c                       |    4 +
>>  13 files changed, 455 insertions(+), 313 deletions(-)
>>
>> --
>>
>

Igor Mammedov Feb. 4, 2020, 8:02 a.m. UTC | #3

On Mon, 3 Feb 2020 13:31:29 -0600
Babu Moger <babu.moger@amd.com> wrote:

> On 2/3/20 8:59 AM, Igor Mammedov wrote:
> > On Tue, 03 Dec 2019 18:36:54 -0600
> > Babu Moger <babu.moger@amd.com> wrote:
> >   
> >> This series fixes APIC ID encoding problems on AMD EPYC CPUs.
> >> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.redhat.com%2Fshow_bug.cgi%3Fid%3D1728166&amp;data=02%7C01%7Cbabu.moger%40amd.com%7C50685202e372472d7b2c08d7a8b9afa6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637163387802886193&amp;sdata=N%2FaBBZ8G3D1gCNvabVQ%2FraHvINazcVeEc9FWdxQAWmg%3D&amp;reserved=0
> >>
> >> Currently, the APIC ID is decoded based on the sequence
> >> sockets->dies->cores->threads. This works for most standard AMD and other
> >> vendors' configurations, but this decoding sequence does not follow that of
> >> AMD's APIC ID enumeration strictly. In some cases this can cause CPU topology
> >> inconsistency.  When booting a guest VM, the kernel tries to validate the
> >> topology, and finds it inconsistent with the enumeration of EPYC cpu models.
> >>
> >> To fix the problem we need to build the topology as per the Processor
> >> Programming Reference (PPR) for AMD Family 17h Model 01h, Revision B1
> >> Processors. It is available at https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.amd.com%2Fsystem%2Ffiles%2FTechDocs%2F55570-B1_PUB.zip&amp;data=02%7C01%7Cbabu.moger%40amd.com%7C50685202e372472d7b2c08d7a8b9afa6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637163387802886193&amp;sdata=McjyMS3A3x5Jr57VxJmHDyh5jumdybzW%2FwLtE4FAKHQ%3D&amp;reserved=0
> >>
> >> Here is the text from the PPR.
> >> Operating systems are expected to use Core::X86::Cpuid::SizeId[ApicIdSize], the
> >> number of least significant bits in the Initial APIC ID that indicate core ID
> >> within a processor, in constructing per-core CPUID masks.
> >> Core::X86::Cpuid::SizeId[ApicIdSize] determines the maximum number of cores
> >> (MNC) that the processor could theoretically support, not the actual number of
> >> cores that are actually implemented or enabled on the processor, as indicated
> >> by Core::X86::Cpuid::SizeId[NC].
> >> Each Core::X86::Apic::ApicId[ApicId] register is preset as follows:
> >> • ApicId[6] = Socket ID.
> >> • ApicId[5:4] = Node ID.
> >> • ApicId[3] = Logical CCX L3 complex ID
> >> • ApicId[2:0]= (SMT) ? {LogicalCoreID[1:0],ThreadId} : {1'b0,LogicalCoreID[1:0]}  
> > 
> > 
> > After checking out all patches and some pondering, used here approach
> > looks to me too intrusive for the task at hand especially where it
> > comes to generic code.
> > 
> > (Ignore till ==== to see suggestion how to simplify without reading
> > reasoning behind it first)
> > 
> > Lets look for a way to simplify it a little bit.
> > 
> > So problem we are trying to solve,
> >  1: calculate APIC IDs based on cpu type (to e more specific: for EPYC based CPUs)
> >  2: it depends on knowing total number of numa nodes.
> > 
> > Externally workflow looks like following:
> >   1. user provides -smp x,sockets,cores,...,maxcpus
> >       that's used by possible_cpu_arch_ids() singleton to build list of
> >       possible CPUs (which is available to user via command 'hotpluggable-cpus')
> > 
> >       Hook could be called very early and possible_cpus data might be
> >       not complete. It builds a list of possible CPUs which user could
> >       modify later.
> > 
> >   2.1 user uses "-numa cpu,node-id=x,..." or legacy "-numa node,node_id=x,cpus="
> >       options to assign cpus to nodes, which is one way or another calling
> >       machine_set_cpu_numa_node(). The later updates 'possible_cpus' list
> >       with node information. It happens early when total number of nodes
> >       is not available.
> > 
> >   2.2 user does not provide explicit node mappings for CPUs.
> >       QEMU steps in and assigns possible cpus to nodes in machine_numa_finish_cpu_init()
> >       (using the same machine_set_cpu_numa_node()) right before calling boards
> >       specific machine init(). At that time total number of nodes is known.
> > 
> > In 1 -- 2.1 cases, 'arch_id' in 'possible_cpus' list doesn't have to be defined before
> > boards init() is run.
> > 
> > In 2.2 case it calls get_default_cpu_node_id() -> x86_get_default_cpu_node_id()
> > which uses arch_id calculate numa node.
> > But then question is: does it have to use APIC id or could it infer 'pkg_id',
> > it's after, from ms->possible_cpus->cpus[i].props data?  
> 
> Not sure if I got the question right. In this case because the numa
> information is not provided all the cpus are assigned to only one node.
> The apic id is used here to get the correct pkg_id.

apicid was composed from socket/core/thread[/die] tuple which cpus[i].props is.

Question is if we can compose only pkg_id based on the same data without
converting it to apicid and then "reverse engineering" it back
original data?

Or more direct question: is socket-id the same as pkg_id?


> 
> >   
> > With that out of the way APIC ID will be used only during board's init(),
> > so board could update possible_cpus with valid APIC IDs at the start of
> > x86_cpus_init().
> > 
> > ====
> > in nutshell it would be much easier to do following:
> > 
> >  1. make x86_get_default_cpu_node_id() APIC ID in-depended or
> >     if impossible as alternative recompute APIC IDs there if cpu
> >     type is EPYC based (since number of nodes is already known)
> >  2. recompute APIC IDs in x86_cpus_init() if cpu type is EPYC based
> > 
> > this way one doesn't need to touch generic numa code, introduce
> > x86 specific init_apicid_fn() hook into generic code and keep
> > x86/EPYC nuances contained within x86 code only.  
> 
> I was kind of already working in the similar direction in v4.
> 1. We already have split the numa initialization in patch #12(Split the
> numa initialization). This way we know exactly how many numa nodes are
> there before hand.

I suggest to drop that patch, It's the one that touches generic numa
code and adding more legacy based extensions like cpu_indexes.
Which I'd like to get rid of to begin with, so only -numa cpu is left.

I think it's not necessary to touch numa code at all for apicid generation
purpose, as I tried to explain above. We should be able to keep
this x86 only business.

> 2. Planning to remove init_apicid_fn
> 3. Insert the handlers inside X86CPUDefinition.
what handlers do you mean?

> 4. EPYC model will have its own apid id handlers. Everything else will be
> initialized with a default handlers(current default handler).
> 5. The function pc_possible_cpu_arch_ids will load the model definition
> and initialize the PCMachineState data structure with the model specific
> handlers.
I'm not sure what do you mean here.
 
> Does that sound similar to what you are thinking. Thoughts?
If you have something to share and can push it on github,
I can look at, whether it has design issues to spare you a round trip on a list.
(it won't be proper review but at least I can help to pinpoint most problematic parts)

> 
> >   
> >> v3:
> >>   1. Consolidated the topology information in structure X86CPUTopoInfo.
> >>   2. Changed the ccx_id to llc_id as commented by upstream.
> >>   3. Generalized the apic id decoding. It is mostly similar to current apic id
> >>      except that it adds new field llc_id when numa configured. Removes all the
> >>      hardcoded values.
> >>   4. Removed the earlier parse_numa split. And moved the numa node initialization
> >>      inside the numa_complete_configuration. This is bit cleaner as commented by 
> >>      Eduardo.
> >>   5. Added new function init_apicid_fn inside machine_class structure. This
> >>      will be used to update the apic id handler specific to cpu model.
> >>   6. Updated the cpuid unit tests.
> >>   7. TODO : Need to figure out how to dynamically update the handlers using cpu models.
> >>      I might some guidance on that.
> >>
> >> v2:
> >>   https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fqemu-devel%2F156779689013.21957.1631551572950676212.stgit%40localhost.localdomain%2F&amp;data=02%7C01%7Cbabu.moger%40amd.com%7C50685202e372472d7b2c08d7a8b9afa6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637163387802886193&amp;sdata=ls1cxA1yh0P05zYsAf3sLXDM11DFHtxZvfWWaar7Mgg%3D&amp;reserved=0
> >>   1. Introduced the new property epyc to enable new epyc mode.
> >>   2. Separated the epyc mode and non epyc mode function.
> >>   3. Introduced function pointers in PCMachineState to handle the
> >>      differences.
> >>   4. Mildly tested different combinations to make things are working as expected.
> >>   5. TODO : Setting the epyc feature bit needs to be worked out. This feature is
> >>      supported only on AMD EPYC models. I may need some guidance on that.
> >>
> >> v1:
> >>   https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fqemu-devel%2F20190731232032.51786-1-babu.moger%40amd.com%2F&amp;data=02%7C01%7Cbabu.moger%40amd.com%7C50685202e372472d7b2c08d7a8b9afa6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637163387802886193&amp;sdata=nT4T9RIL4EeSvB%2Ff9%2BjbU7lldopjglQ2X6uYx13WMPE%3D&amp;reserved=0
> >>
> >> ---
> >>
> >> Babu Moger (18):
> >>       hw/i386: Rename X86CPUTopoInfo structure to X86CPUTopoIDs
> >>       hw/i386: Introduce X86CPUTopoInfo to contain topology info
> >>       hw/i386: Consolidate topology functions
> >>       hw/i386: Introduce initialize_topo_info to initialize X86CPUTopoInfo
> >>       machine: Add SMP Sockets in CpuTopology
> >>       hw/core: Add core complex id in X86CPU topology
> >>       machine: Add a new function init_apicid_fn in MachineClass
> >>       hw/i386: Update structures for nodes_per_pkg
> >>       i386: Add CPUX86Family type in CPUX86State
> >>       hw/386: Add EPYC mode topology decoding functions
> >>       i386: Cleanup and use the EPYC mode topology functions
> >>       numa: Split the numa initialization
> >>       hw/i386: Introduce apicid_from_cpu_idx in PCMachineState
> >>       hw/i386: Introduce topo_ids_from_apicid handler PCMachineState
> >>       hw/i386: Introduce apic_id_from_topo_ids handler in PCMachineState
> >>       hw/i386: Introduce EPYC mode function handlers
> >>       i386: Fix pkg_id offset for epyc mode
> >>       tests: Update the Unit tests
> >>
> >>
> >>  hw/core/machine-hmp-cmds.c |    3 +
> >>  hw/core/machine.c          |   14 +++
> >>  hw/core/numa.c             |   62 +++++++++----
> >>  hw/i386/pc.c               |  132 +++++++++++++++++++---------
> >>  include/hw/boards.h        |    3 +
> >>  include/hw/i386/pc.h       |    9 ++
> >>  include/hw/i386/topology.h |  209 +++++++++++++++++++++++++++++++-------------
> >>  include/sysemu/numa.h      |    5 +
> >>  qapi/machine.json          |    7 +
> >>  target/i386/cpu.c          |  196 ++++++++++++-----------------------------
> >>  target/i386/cpu.h          |    9 ++
> >>  tests/test-x86-cpuid.c     |  115 ++++++++++++++----------
> >>  vl.c                       |    4 +
> >>  13 files changed, 455 insertions(+), 313 deletions(-)
> >>
> >> --
> >>  
> >   
>

Babu Moger Feb. 4, 2020, 7:08 p.m. UTC | #4

On 2/4/20 2:02 AM, Igor Mammedov wrote:
> On Mon, 3 Feb 2020 13:31:29 -0600
> Babu Moger <babu.moger@amd.com> wrote:
> 
>> On 2/3/20 8:59 AM, Igor Mammedov wrote:
>>> On Tue, 03 Dec 2019 18:36:54 -0600
>>> Babu Moger <babu.moger@amd.com> wrote:
>>>   
>>>> This series fixes APIC ID encoding problems on AMD EPYC CPUs.
>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.redhat.com%2Fshow_bug.cgi%3Fid%3D1728166&amp;data=02%7C01%7Cbabu.moger%40amd.com%7Cbbd1693802184161c8c308d7a9489ee9%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637164001686545394&amp;sdata=UtYAoTk4RfZZ1VfaP%2FhcYrCSNTcubEB7cB%2BoYlRLfhc%3D&amp;reserved=0
>>>>
>>>> Currently, the APIC ID is decoded based on the sequence
>>>> sockets->dies->cores->threads. This works for most standard AMD and other
>>>> vendors' configurations, but this decoding sequence does not follow that of
>>>> AMD's APIC ID enumeration strictly. In some cases this can cause CPU topology
>>>> inconsistency.  When booting a guest VM, the kernel tries to validate the
>>>> topology, and finds it inconsistent with the enumeration of EPYC cpu models.
>>>>
>>>> To fix the problem we need to build the topology as per the Processor
>>>> Programming Reference (PPR) for AMD Family 17h Model 01h, Revision B1
>>>> Processors. It is available at https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.amd.com%2Fsystem%2Ffiles%2FTechDocs%2F55570-B1_PUB.zip&amp;data=02%7C01%7Cbabu.moger%40amd.com%7Cbbd1693802184161c8c308d7a9489ee9%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637164001686555390&amp;sdata=oHONRiXtpstKqwrxyzqR20bYDDr3zvmwq91a%2Br6iDqc%3D&amp;reserved=0
>>>>
>>>> Here is the text from the PPR.
>>>> Operating systems are expected to use Core::X86::Cpuid::SizeId[ApicIdSize], the
>>>> number of least significant bits in the Initial APIC ID that indicate core ID
>>>> within a processor, in constructing per-core CPUID masks.
>>>> Core::X86::Cpuid::SizeId[ApicIdSize] determines the maximum number of cores
>>>> (MNC) that the processor could theoretically support, not the actual number of
>>>> cores that are actually implemented or enabled on the processor, as indicated
>>>> by Core::X86::Cpuid::SizeId[NC].
>>>> Each Core::X86::Apic::ApicId[ApicId] register is preset as follows:
>>>> • ApicId[6] = Socket ID.
>>>> • ApicId[5:4] = Node ID.
>>>> • ApicId[3] = Logical CCX L3 complex ID
>>>> • ApicId[2:0]= (SMT) ? {LogicalCoreID[1:0],ThreadId} : {1'b0,LogicalCoreID[1:0]}  
>>>
>>>
>>> After checking out all patches and some pondering, used here approach
>>> looks to me too intrusive for the task at hand especially where it
>>> comes to generic code.
>>>
>>> (Ignore till ==== to see suggestion how to simplify without reading
>>> reasoning behind it first)
>>>
>>> Lets look for a way to simplify it a little bit.
>>>
>>> So problem we are trying to solve,
>>>  1: calculate APIC IDs based on cpu type (to e more specific: for EPYC based CPUs)
>>>  2: it depends on knowing total number of numa nodes.
>>>
>>> Externally workflow looks like following:
>>>   1. user provides -smp x,sockets,cores,...,maxcpus
>>>       that's used by possible_cpu_arch_ids() singleton to build list of
>>>       possible CPUs (which is available to user via command 'hotpluggable-cpus')
>>>
>>>       Hook could be called very early and possible_cpus data might be
>>>       not complete. It builds a list of possible CPUs which user could
>>>       modify later.
>>>
>>>   2.1 user uses "-numa cpu,node-id=x,..." or legacy "-numa node,node_id=x,cpus="
>>>       options to assign cpus to nodes, which is one way or another calling
>>>       machine_set_cpu_numa_node(). The later updates 'possible_cpus' list
>>>       with node information. It happens early when total number of nodes
>>>       is not available.
>>>
>>>   2.2 user does not provide explicit node mappings for CPUs.
>>>       QEMU steps in and assigns possible cpus to nodes in machine_numa_finish_cpu_init()
>>>       (using the same machine_set_cpu_numa_node()) right before calling boards
>>>       specific machine init(). At that time total number of nodes is known.
>>>
>>> In 1 -- 2.1 cases, 'arch_id' in 'possible_cpus' list doesn't have to be defined before
>>> boards init() is run.

In case of 2.1, we need to have the arch_id already generated. This is
done inside possible_cpu_arch_ids. The arch_id is used by
machine_set_cpu_numa_node to assign the cpus to correct numa node.

If we want to move the arch_id generation into board init(), then we need
to save the cpu indexes belonging to each node somewhere.

>>>
>>> In 2.2 case it calls get_default_cpu_node_id() -> x86_get_default_cpu_node_id()
>>> which uses arch_id calculate numa node.
>>> But then question is: does it have to use APIC id or could it infer 'pkg_id',
>>> it's after, from ms->possible_cpus->cpus[i].props data?  
>>
>> Not sure if I got the question right. In this case because the numa
>> information is not provided all the cpus are assigned to only one node.
>> The apic id is used here to get the correct pkg_id.
> 
> apicid was composed from socket/core/thread[/die] tuple which cpus[i].props is.
> 
> Question is if we can compose only pkg_id based on the same data without
> converting it to apicid and then "reverse engineering" it back
> original data?

Yes. It is possible.

> 
> Or more direct question: is socket-id the same as pkg_id?

Yes. Socket_id and pkg_id is same.

> 
> 
>>
>>>   
>>> With that out of the way APIC ID will be used only during board's init(),
>>> so board could update possible_cpus with valid APIC IDs at the start of
>>> x86_cpus_init().
>>>
>>> ====
>>> in nutshell it would be much easier to do following:
>>>
>>>  1. make x86_get_default_cpu_node_id() APIC ID in-depended or
>>>     if impossible as alternative recompute APIC IDs there if cpu
>>>     type is EPYC based (since number of nodes is already known)
>>>  2. recompute APIC IDs in x86_cpus_init() if cpu type is EPYC based
>>>
>>> this way one doesn't need to touch generic numa code, introduce
>>> x86 specific init_apicid_fn() hook into generic code and keep
>>> x86/EPYC nuances contained within x86 code only.  
>>
>> I was kind of already working in the similar direction in v4.
>> 1. We already have split the numa initialization in patch #12(Split the
>> numa initialization). This way we know exactly how many numa nodes are
>> there before hand.
> 
> I suggest to drop that patch, It's the one that touches generic numa
> code and adding more legacy based extensions like cpu_indexes.
> Which I'd like to get rid of to begin with, so only -numa cpu is left.
> 
> I think it's not necessary to touch numa code at all for apicid generation
> purpose, as I tried to explain above. We should be able to keep
> this x86 only business.

This is going to be difficult without touching the generic numa code.

> 
>> 2. Planning to remove init_apicid_fn
>> 3. Insert the handlers inside X86CPUDefinition.
> what handlers do you mean?

Apicid generation logic can be separated into 3 types of handlers.
x86_apicid_from_cpu_idx: Generate apicid from cpu index.
x86_topo_ids_from_apicid: Generate topo ids from apic id.
x86_apicid_from_topo_ids: Generate apicid from topo ids.

We should be able to generate one id from other(you can see topology.h).

X86CPUDefinition will have the handlers specific to each model like the
way we have features now. The above 3 handlers will be used as default
handler.


The EPYC model will have its corresponding handlers.

x86_apicid_from_cpu_idx_epyc
x86_topo_ids_from_apicid_epyc
x86_apicid_from_topo_ids_epyc.

> 
>> 4. EPYC model will have its own apid id handlers. Everything else will be
>> initialized with a default handlers(current default handler).
>> 5. The function pc_possible_cpu_arch_ids will load the model definition
>> and initialize the PCMachineState data structure with the model specific
>> handlers.
> I'm not sure what do you mean here.

PCMachineState will have the function pointers to the above handlers.
I was going to load the correct handler based on the mode type.

>  
>> Does that sound similar to what you are thinking. Thoughts?
> If you have something to share and can push it on github,
> I can look at, whether it has design issues to spare you a round trip on a list.
> (it won't be proper review but at least I can help to pinpoint most problematic parts)
> 
My code for the current approach is kind of ready(yet to be tested). I can
send it as v3.1 if you want to look. Or we can wait for our discussion to
settle. I will post it after our discussion.


There is one more problem we need to address. I was going to address later
in v4 or v5.

This works
-numa node,nodeid=0,cpus=0-3 -numa node,nodeid=1,cpus=4-7

This does not work
-numa node,nodeid=0,cpus=0-5 -numa node,nodeid=1,cpus=6-7

This requires the generic code to pass the node information to the x86
code which requires some handler changes. I was thinking my code will
simplify the changes to address this issue.

>>
>>>   
>>>> v3:
>>>>   1. Consolidated the topology information in structure X86CPUTopoInfo.
>>>>   2. Changed the ccx_id to llc_id as commented by upstream.
>>>>   3. Generalized the apic id decoding. It is mostly similar to current apic id
>>>>      except that it adds new field llc_id when numa configured. Removes all the
>>>>      hardcoded values.
>>>>   4. Removed the earlier parse_numa split. And moved the numa node initialization
>>>>      inside the numa_complete_configuration. This is bit cleaner as commented by 
>>>>      Eduardo.
>>>>   5. Added new function init_apicid_fn inside machine_class structure. This
>>>>      will be used to update the apic id handler specific to cpu model.
>>>>   6. Updated the cpuid unit tests.
>>>>   7. TODO : Need to figure out how to dynamically update the handlers using cpu models.
>>>>      I might some guidance on that.
>>>>
>>>> v2:
>>>>   https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fqemu-devel%2F156779689013.21957.1631551572950676212.stgit%40localhost.localdomain%2F&amp;data=02%7C01%7Cbabu.moger%40amd.com%7Cbbd1693802184161c8c308d7a9489ee9%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637164001686555390&amp;sdata=phJxrK4AExmcfBfxcN8Ngtnuwv1vLR%2BW4PnqjUSPQfI%3D&amp;reserved=0
>>>>   1. Introduced the new property epyc to enable new epyc mode.
>>>>   2. Separated the epyc mode and non epyc mode function.
>>>>   3. Introduced function pointers in PCMachineState to handle the
>>>>      differences.
>>>>   4. Mildly tested different combinations to make things are working as expected.
>>>>   5. TODO : Setting the epyc feature bit needs to be worked out. This feature is
>>>>      supported only on AMD EPYC models. I may need some guidance on that.
>>>>
>>>> v1:
>>>>   https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fqemu-devel%2F20190731232032.51786-1-babu.moger%40amd.com%2F&amp;data=02%7C01%7Cbabu.moger%40amd.com%7Cbbd1693802184161c8c308d7a9489ee9%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637164001686555390&amp;sdata=P1Ghnsypj8uSuGiv9XW38nytrHXAIeOGumsbbAEUjCU%3D&amp;reserved=0
>>>>
>>>> ---
>>>>
>>>> Babu Moger (18):
>>>>       hw/i386: Rename X86CPUTopoInfo structure to X86CPUTopoIDs
>>>>       hw/i386: Introduce X86CPUTopoInfo to contain topology info
>>>>       hw/i386: Consolidate topology functions
>>>>       hw/i386: Introduce initialize_topo_info to initialize X86CPUTopoInfo
>>>>       machine: Add SMP Sockets in CpuTopology
>>>>       hw/core: Add core complex id in X86CPU topology
>>>>       machine: Add a new function init_apicid_fn in MachineClass
>>>>       hw/i386: Update structures for nodes_per_pkg
>>>>       i386: Add CPUX86Family type in CPUX86State
>>>>       hw/386: Add EPYC mode topology decoding functions
>>>>       i386: Cleanup and use the EPYC mode topology functions
>>>>       numa: Split the numa initialization
>>>>       hw/i386: Introduce apicid_from_cpu_idx in PCMachineState
>>>>       hw/i386: Introduce topo_ids_from_apicid handler PCMachineState
>>>>       hw/i386: Introduce apic_id_from_topo_ids handler in PCMachineState
>>>>       hw/i386: Introduce EPYC mode function handlers
>>>>       i386: Fix pkg_id offset for epyc mode
>>>>       tests: Update the Unit tests
>>>>
>>>>
>>>>  hw/core/machine-hmp-cmds.c |    3 +
>>>>  hw/core/machine.c          |   14 +++
>>>>  hw/core/numa.c             |   62 +++++++++----
>>>>  hw/i386/pc.c               |  132 +++++++++++++++++++---------
>>>>  include/hw/boards.h        |    3 +
>>>>  include/hw/i386/pc.h       |    9 ++
>>>>  include/hw/i386/topology.h |  209 +++++++++++++++++++++++++++++++-------------
>>>>  include/sysemu/numa.h      |    5 +
>>>>  qapi/machine.json          |    7 +
>>>>  target/i386/cpu.c          |  196 ++++++++++++-----------------------------
>>>>  target/i386/cpu.h          |    9 ++
>>>>  tests/test-x86-cpuid.c     |  115 ++++++++++++++----------
>>>>  vl.c                       |    4 +
>>>>  13 files changed, 455 insertions(+), 313 deletions(-)
>>>>
>>>> --
>>>>  
>>>   
>>
>

Igor Mammedov Feb. 5, 2020, 9:38 a.m. UTC | #5

On Tue, 4 Feb 2020 13:08:58 -0600
Babu Moger <babu.moger@amd.com> wrote:

> On 2/4/20 2:02 AM, Igor Mammedov wrote:
> > On Mon, 3 Feb 2020 13:31:29 -0600
> > Babu Moger <babu.moger@amd.com> wrote:
> >   
> >> On 2/3/20 8:59 AM, Igor Mammedov wrote:  
> >>> On Tue, 03 Dec 2019 18:36:54 -0600
> >>> Babu Moger <babu.moger@amd.com> wrote:
> >>>     
> >>>> This series fixes APIC ID encoding problems on AMD EPYC CPUs.
> >>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.redhat.com%2Fshow_bug.cgi%3Fid%3D1728166&amp;data=02%7C01%7Cbabu.moger%40amd.com%7Cbbd1693802184161c8c308d7a9489ee9%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637164001686545394&amp;sdata=UtYAoTk4RfZZ1VfaP%2FhcYrCSNTcubEB7cB%2BoYlRLfhc%3D&amp;reserved=0
> >>>>
> >>>> Currently, the APIC ID is decoded based on the sequence
> >>>> sockets->dies->cores->threads. This works for most standard AMD and other
> >>>> vendors' configurations, but this decoding sequence does not follow that of
> >>>> AMD's APIC ID enumeration strictly. In some cases this can cause CPU topology
> >>>> inconsistency.  When booting a guest VM, the kernel tries to validate the
> >>>> topology, and finds it inconsistent with the enumeration of EPYC cpu models.
> >>>>
> >>>> To fix the problem we need to build the topology as per the Processor
> >>>> Programming Reference (PPR) for AMD Family 17h Model 01h, Revision B1
> >>>> Processors. It is available at https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.amd.com%2Fsystem%2Ffiles%2FTechDocs%2F55570-B1_PUB.zip&amp;data=02%7C01%7Cbabu.moger%40amd.com%7Cbbd1693802184161c8c308d7a9489ee9%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637164001686555390&amp;sdata=oHONRiXtpstKqwrxyzqR20bYDDr3zvmwq91a%2Br6iDqc%3D&amp;reserved=0
> >>>>
> >>>> Here is the text from the PPR.
> >>>> Operating systems are expected to use Core::X86::Cpuid::SizeId[ApicIdSize], the
> >>>> number of least significant bits in the Initial APIC ID that indicate core ID
> >>>> within a processor, in constructing per-core CPUID masks.
> >>>> Core::X86::Cpuid::SizeId[ApicIdSize] determines the maximum number of cores
> >>>> (MNC) that the processor could theoretically support, not the actual number of
> >>>> cores that are actually implemented or enabled on the processor, as indicated
> >>>> by Core::X86::Cpuid::SizeId[NC].
> >>>> Each Core::X86::Apic::ApicId[ApicId] register is preset as follows:
> >>>> • ApicId[6] = Socket ID.
> >>>> • ApicId[5:4] = Node ID.
> >>>> • ApicId[3] = Logical CCX L3 complex ID
> >>>> • ApicId[2:0]= (SMT) ? {LogicalCoreID[1:0],ThreadId} : {1'b0,LogicalCoreID[1:0]}    
> >>>
> >>>
> >>> After checking out all patches and some pondering, used here approach
> >>> looks to me too intrusive for the task at hand especially where it
> >>> comes to generic code.
> >>>
> >>> (Ignore till ==== to see suggestion how to simplify without reading
> >>> reasoning behind it first)
> >>>
> >>> Lets look for a way to simplify it a little bit.
> >>>
> >>> So problem we are trying to solve,
> >>>  1: calculate APIC IDs based on cpu type (to e more specific: for EPYC based CPUs)
> >>>  2: it depends on knowing total number of numa nodes.
> >>>
> >>> Externally workflow looks like following:
> >>>   1. user provides -smp x,sockets,cores,...,maxcpus
> >>>       that's used by possible_cpu_arch_ids() singleton to build list of
> >>>       possible CPUs (which is available to user via command 'hotpluggable-cpus')
> >>>
> >>>       Hook could be called very early and possible_cpus data might be
> >>>       not complete. It builds a list of possible CPUs which user could
> >>>       modify later.
> >>>
> >>>   2.1 user uses "-numa cpu,node-id=x,..." or legacy "-numa node,node_id=x,cpus="
> >>>       options to assign cpus to nodes, which is one way or another calling
> >>>       machine_set_cpu_numa_node(). The later updates 'possible_cpus' list
> >>>       with node information. It happens early when total number of nodes
> >>>       is not available.
> >>>
> >>>   2.2 user does not provide explicit node mappings for CPUs.
> >>>       QEMU steps in and assigns possible cpus to nodes in machine_numa_finish_cpu_init()
> >>>       (using the same machine_set_cpu_numa_node()) right before calling boards
> >>>       specific machine init(). At that time total number of nodes is known.
> >>>
> >>> In 1 -- 2.1 cases, 'arch_id' in 'possible_cpus' list doesn't have to be defined before
> >>> boards init() is run.  
> 
> In case of 2.1, we need to have the arch_id already generated. This is
> done inside possible_cpu_arch_ids. The arch_id is used by
> machine_set_cpu_numa_node to assign the cpus to correct numa node.

I might have missed something but I don't see arch_id itself being used in
machine_set_cpu_numa_node(). It only uses props part of possible_cpus

 
> If we want to move the arch_id generation into board init(), then we need
> to save the cpu indexes belonging to each node somewhere.

when cpus are assigned explicitly, decision what cpus go to what nodes is
up to user and user configured mapping is stored in MachineState::possible_cpus
which is accessed by via possible_cpu_arch_ids() callback.
Hence I don see any reason to touch cpu indexes.

> 
> >>>
> >>> In 2.2 case it calls get_default_cpu_node_id() -> x86_get_default_cpu_node_id()
> >>> which uses arch_id calculate numa node.
> >>> But then question is: does it have to use APIC id or could it infer 'pkg_id',
> >>> it's after, from ms->possible_cpus->cpus[i].props data?    
> >>
> >> Not sure if I got the question right. In this case because the numa
> >> information is not provided all the cpus are assigned to only one node.
> >> The apic id is used here to get the correct pkg_id.  
> > 
> > apicid was composed from socket/core/thread[/die] tuple which cpus[i].props is.
> > 
> > Question is if we can compose only pkg_id based on the same data without
> > converting it to apicid and then "reverse engineering" it back
> > original data?  
> 
> Yes. It is possible.
> 
> > 
> > Or more direct question: is socket-id the same as pkg_id?  
> 
> Yes. Socket_id and pkg_id is same.
> 
> > 
> >   
> >>  
> >>>   
> >>> With that out of the way APIC ID will be used only during board's init(),
> >>> so board could update possible_cpus with valid APIC IDs at the start of
> >>> x86_cpus_init().
> >>>
> >>> ====
> >>> in nutshell it would be much easier to do following:
> >>>
> >>>  1. make x86_get_default_cpu_node_id() APIC ID in-depended or
> >>>     if impossible as alternative recompute APIC IDs there if cpu
> >>>     type is EPYC based (since number of nodes is already known)
> >>>  2. recompute APIC IDs in x86_cpus_init() if cpu type is EPYC based
> >>>
> >>> this way one doesn't need to touch generic numa code, introduce
> >>> x86 specific init_apicid_fn() hook into generic code and keep
> >>> x86/EPYC nuances contained within x86 code only.    
> >>
> >> I was kind of already working in the similar direction in v4.
> >> 1. We already have split the numa initialization in patch #12(Split the
> >> numa initialization). This way we know exactly how many numa nodes are
> >> there before hand.  
> > 
> > I suggest to drop that patch, It's the one that touches generic numa
> > code and adding more legacy based extensions like cpu_indexes.
> > Which I'd like to get rid of to begin with, so only -numa cpu is left.
> > 
> > I think it's not necessary to touch numa code at all for apicid generation
> > purpose, as I tried to explain above. We should be able to keep
> > this x86 only business.  
> 
> This is going to be difficult without touching the generic numa code.

Looking at current code I don't see why one would touch numa code.
Care to explain in more details why you'd have to touch it?

> >> 2. Planning to remove init_apicid_fn
> >> 3. Insert the handlers inside X86CPUDefinition.  
> > what handlers do you mean?  
> 
> Apicid generation logic can be separated into 3 types of handlers.
> x86_apicid_from_cpu_idx: Generate apicid from cpu index.
> x86_topo_ids_from_apicid: Generate topo ids from apic id.
> x86_apicid_from_topo_ids: Generate apicid from topo ids.
> 
> We should be able to generate one id from other(you can see topology.h).
> 
> X86CPUDefinition will have the handlers specific to each model like the
> way we have features now. The above 3 handlers will be used as default
> handler.

it probably shouldn't be a part of X86CPUDefinition,
as it's machines responsibility to generate and set APIC ID.

What you are doing with this topo functions in this version
looks more that enough to me.

> The EPYC model will have its corresponding handlers.
> 
> x86_apicid_from_cpu_idx_epyc
> x86_topo_ids_from_apicid_epyc
> x86_apicid_from_topo_ids_epyc.

CPU might use call backs, but does it have to?
I see cpu_x86_cpuid() uses these functions to decode apic_id back to topo
info and then compose various leaves based on it.
Within CPU code I'd just use
 if (i_am_epyc)
    x86_topo_ids_from_apicid_epyc()
 else
    x86_topo_ids_from_apicid()
it's easier to read and one doesn't have to go figure
indirection chain to figure out what code is called.
   
> >> 4. EPYC model will have its own apid id handlers. Everything else will be
> >> initialized with a default handlers(current default handler).
> >> 5. The function pc_possible_cpu_arch_ids will load the model definition
> >> and initialize the PCMachineState data structure with the model specific
> >> handlers.  
> > I'm not sure what do you mean here.  
> 
> PCMachineState will have the function pointers to the above handlers.
> I was going to load the correct handler based on the mode type.

Could be done like this, but considering that within machine we need
to calculate apic_id only once, the same 'if' trick would be simpler

x86_cpus_init() {

  if (cpu == epic) {
     make_epyc_apic_ids(mc->possible_cpu_arch_ids(ms))
  }

  // go on with creating cpus ...
}

> >> Does that sound similar to what you are thinking. Thoughts?  
> > If you have something to share and can push it on github,
> > I can look at, whether it has design issues to spare you a round trip on a list.
> > (it won't be proper review but at least I can help to pinpoint most problematic parts)
> >   
> My code for the current approach is kind of ready(yet to be tested). I can
> send it as v3.1 if you want to look. Or we can wait for our discussion to
> settle. I will post it after our discussion.
ok, lets wait till we finish this discussion

> There is one more problem we need to address. I was going to address later
> in v4 or v5.
> 
> This works
> -numa node,nodeid=0,cpus=0-3 -numa node,nodeid=1,cpus=4-7
> 
> This does not work
> -numa node,nodeid=0,cpus=0-5 -numa node,nodeid=1,cpus=6-7
Is it supposed to work (i.e. can real hardware do such topology)?

> This requires the generic code to pass the node information to the x86
> code which requires some handler changes. I was thinking my code will
> simplify the changes to address this issue.

without more information, it's hard to comment on issue and whether
extra complexity of callbacks is justificated. 

There could be 2 ways here, add fixes to this series so we could see the reason
or make this series simple to solve apic_id problem only and then on top of
it send the second series that solves another issue.

Considering that this series is already big/complicated enough,
personally I'd go for 2nd option. As it's easier to describe what patches are
doing and easier to review => should result in faster reaching consensus and merging.
[...]

Babu Moger Feb. 5, 2020, 4:10 p.m. UTC | #6

On 2/5/20 3:38 AM, Igor Mammedov wrote:
> On Tue, 4 Feb 2020 13:08:58 -0600
> Babu Moger <babu.moger@amd.com> wrote:
> 
>> On 2/4/20 2:02 AM, Igor Mammedov wrote:
>>> On Mon, 3 Feb 2020 13:31:29 -0600
>>> Babu Moger <babu.moger@amd.com> wrote:
>>>   
>>>> On 2/3/20 8:59 AM, Igor Mammedov wrote:  
>>>>> On Tue, 03 Dec 2019 18:36:54 -0600
>>>>> Babu Moger <babu.moger@amd.com> wrote:
>>>>>     
>>>>>> This series fixes APIC ID encoding problems on AMD EPYC CPUs.
>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.redhat.com%2Fshow_bug.cgi%3Fid%3D1728166&amp;data=02%7C01%7Cbabu.moger%40amd.com%7Cdbfd059a060a4851aad908d7aa1f3532%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637164923333568238&amp;sdata=P0I547X5r0s9emWu3ptIcm1U%2FhCMZmnMQOQ0IgLPzzQ%3D&amp;reserved=0
>>>>>>
>>>>>> Currently, the APIC ID is decoded based on the sequence
>>>>>> sockets->dies->cores->threads. This works for most standard AMD and other
>>>>>> vendors' configurations, but this decoding sequence does not follow that of
>>>>>> AMD's APIC ID enumeration strictly. In some cases this can cause CPU topology
>>>>>> inconsistency.  When booting a guest VM, the kernel tries to validate the
>>>>>> topology, and finds it inconsistent with the enumeration of EPYC cpu models.
>>>>>>
>>>>>> To fix the problem we need to build the topology as per the Processor
>>>>>> Programming Reference (PPR) for AMD Family 17h Model 01h, Revision B1
>>>>>> Processors. It is available at https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.amd.com%2Fsystem%2Ffiles%2FTechDocs%2F55570-B1_PUB.zip&amp;data=02%7C01%7Cbabu.moger%40amd.com%7Cdbfd059a060a4851aad908d7aa1f3532%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637164923333568238&amp;sdata=AO6m%2FEI17iLoAa3gNnRSJKJAdvBRKh0Dmbr7bCVA0us%3D&amp;reserved=0
>>>>>>
>>>>>> Here is the text from the PPR.
>>>>>> Operating systems are expected to use Core::X86::Cpuid::SizeId[ApicIdSize], the
>>>>>> number of least significant bits in the Initial APIC ID that indicate core ID
>>>>>> within a processor, in constructing per-core CPUID masks.
>>>>>> Core::X86::Cpuid::SizeId[ApicIdSize] determines the maximum number of cores
>>>>>> (MNC) that the processor could theoretically support, not the actual number of
>>>>>> cores that are actually implemented or enabled on the processor, as indicated
>>>>>> by Core::X86::Cpuid::SizeId[NC].
>>>>>> Each Core::X86::Apic::ApicId[ApicId] register is preset as follows:
>>>>>> • ApicId[6] = Socket ID.
>>>>>> • ApicId[5:4] = Node ID.
>>>>>> • ApicId[3] = Logical CCX L3 complex ID
>>>>>> • ApicId[2:0]= (SMT) ? {LogicalCoreID[1:0],ThreadId} : {1'b0,LogicalCoreID[1:0]}    
>>>>>
>>>>>
>>>>> After checking out all patches and some pondering, used here approach
>>>>> looks to me too intrusive for the task at hand especially where it
>>>>> comes to generic code.
>>>>>
>>>>> (Ignore till ==== to see suggestion how to simplify without reading
>>>>> reasoning behind it first)
>>>>>
>>>>> Lets look for a way to simplify it a little bit.
>>>>>
>>>>> So problem we are trying to solve,
>>>>>  1: calculate APIC IDs based on cpu type (to e more specific: for EPYC based CPUs)
>>>>>  2: it depends on knowing total number of numa nodes.
>>>>>
>>>>> Externally workflow looks like following:
>>>>>   1. user provides -smp x,sockets,cores,...,maxcpus
>>>>>       that's used by possible_cpu_arch_ids() singleton to build list of
>>>>>       possible CPUs (which is available to user via command 'hotpluggable-cpus')
>>>>>
>>>>>       Hook could be called very early and possible_cpus data might be
>>>>>       not complete. It builds a list of possible CPUs which user could
>>>>>       modify later.
>>>>>
>>>>>   2.1 user uses "-numa cpu,node-id=x,..." or legacy "-numa node,node_id=x,cpus="
>>>>>       options to assign cpus to nodes, which is one way or another calling
>>>>>       machine_set_cpu_numa_node(). The later updates 'possible_cpus' list
>>>>>       with node information. It happens early when total number of nodes
>>>>>       is not available.
>>>>>
>>>>>   2.2 user does not provide explicit node mappings for CPUs.
>>>>>       QEMU steps in and assigns possible cpus to nodes in machine_numa_finish_cpu_init()
>>>>>       (using the same machine_set_cpu_numa_node()) right before calling boards
>>>>>       specific machine init(). At that time total number of nodes is known.
>>>>>
>>>>> In 1 -- 2.1 cases, 'arch_id' in 'possible_cpus' list doesn't have to be defined before
>>>>> boards init() is run.  
>>
>> In case of 2.1, we need to have the arch_id already generated. This is
>> done inside possible_cpu_arch_ids. The arch_id is used by
>> machine_set_cpu_numa_node to assign the cpus to correct numa node.
> 
> I might have missed something but I don't see arch_id itself being used in
> machine_set_cpu_numa_node(). It only uses props part of possible_cpus

Before calling machine_set_cpu_numa_node, we call
cpu_index_to_instance_props -> x86_cpu_index_to_props->
possible_cpu_arch_ids->x86_possible_cpu_arch_ids.

This sequence sets up the arch_id(in x86_cpu_apic_id_from_index) for all
the available cpus. Based on the arch_id, it also sets up the props.
And these props values are used to assign the nodes in
machine_set_cpu_numa_node.

At this point we are still parsing the numa nodes and so we don't know the
total number of numa nodes. Without that information, the arch_id
generated here will not be correct for EPYC models.

This is the reason for changing the generic numa code(patch #12-Split the
numa initialization).

> 
>  
>> If we want to move the arch_id generation into board init(), then we need
>> to save the cpu indexes belonging to each node somewhere.
> 
> when cpus are assigned explicitly, decision what cpus go to what nodes is
> up to user and user configured mapping is stored in MachineState::possible_cpus
> which is accessed by via possible_cpu_arch_ids() callback.
> Hence I don see any reason to touch cpu indexes.

Please see my reasoning above.

> 
>>
>>>>>
>>>>> In 2.2 case it calls get_default_cpu_node_id() -> x86_get_default_cpu_node_id()
>>>>> which uses arch_id calculate numa node.
>>>>> But then question is: does it have to use APIC id or could it infer 'pkg_id',
>>>>> it's after, from ms->possible_cpus->cpus[i].props data?    
>>>>
>>>> Not sure if I got the question right. In this case because the numa
>>>> information is not provided all the cpus are assigned to only one node.
>>>> The apic id is used here to get the correct pkg_id.  
>>>
>>> apicid was composed from socket/core/thread[/die] tuple which cpus[i].props is.
>>>
>>> Question is if we can compose only pkg_id based on the same data without
>>> converting it to apicid and then "reverse engineering" it back
>>> original data?  
>>
>> Yes. It is possible.
>>
>>>
>>> Or more direct question: is socket-id the same as pkg_id?  
>>
>> Yes. Socket_id and pkg_id is same.
>>
>>>
>>>   
>>>>  
>>>>>   
>>>>> With that out of the way APIC ID will be used only during board's init(),
>>>>> so board could update possible_cpus with valid APIC IDs at the start of
>>>>> x86_cpus_init().
>>>>>
>>>>> ====
>>>>> in nutshell it would be much easier to do following:
>>>>>
>>>>>  1. make x86_get_default_cpu_node_id() APIC ID in-depended or
>>>>>     if impossible as alternative recompute APIC IDs there if cpu
>>>>>     type is EPYC based (since number of nodes is already known)
>>>>>  2. recompute APIC IDs in x86_cpus_init() if cpu type is EPYC based
>>>>>
>>>>> this way one doesn't need to touch generic numa code, introduce
>>>>> x86 specific init_apicid_fn() hook into generic code and keep
>>>>> x86/EPYC nuances contained within x86 code only.    
>>>>
>>>> I was kind of already working in the similar direction in v4.
>>>> 1. We already have split the numa initialization in patch #12(Split the
>>>> numa initialization). This way we know exactly how many numa nodes are
>>>> there before hand.  
>>>
>>> I suggest to drop that patch, It's the one that touches generic numa
>>> code and adding more legacy based extensions like cpu_indexes.
>>> Which I'd like to get rid of to begin with, so only -numa cpu is left.
>>>
>>> I think it's not necessary to touch numa code at all for apicid generation
>>> purpose, as I tried to explain above. We should be able to keep
>>> this x86 only business.  
>>
>> This is going to be difficult without touching the generic numa code.patch #12(Split the
>>>> numa initialization)
> 
> Looking at current code I don't see why one would touch numa code.
> Care to explain in more details why you'd have to touch it?

Please see the reasoning above.
> 
>>>> 2. Planning to remove init_apicid_fn
>>>> 3. Insert the handlers inside X86CPUDefinition.  
>>> what handlers do you mean?  
>>
>> Apicid generation logic can be separated into 3 types of handlers.
>> x86_apicid_from_cpu_idx: Generate apicid from cpu index.
>> x86_topo_ids_from_apicid: Generate topo ids from apic id.
>> x86_apicid_from_topo_ids: Generate apicid from topo ids.
>>
>> We should be able to generate one id from other(you can see topology.h).
>>
>> X86CPUDefinition will have the handlers specific to each model like the
>> way we have features now. The above 3 handlers will be used as default
>> handler.
> 
> it probably shouldn't be a part of X86CPUDefinition,
> as it's machines responsibility to generate and set APIC ID.
> 
> What you are doing with this topo functions in this version
> looks more that enough to me.

It is all the exact same topo functions. Only making these functions as
the handlers inside the X86CPUDefinition.

> 
>> The EPYC model will have its corresponding handlers.
>>
>> x86_apicid_from_cpu_idx_epyc
>> x86_topo_ids_from_apicid_epyc
>> x86_apicid_from_topo_ids_epyc.
> 
> CPU might use call backs, but does it have to?
> I see cpu_x86_cpuid() uses these functions to decode apic_id back to topo
> info and then compose various leaves based on it.
> Within CPU code I'd just use
>  if (i_am_epyc)
>     x86_topo_ids_from_apicid_epyc()
>  else
>     x86_topo_ids_from_apicid()
> it's easier to read and one doesn't have to go figure
> indirection chain to figure out what code is called.

Eduardo already commented on this idea. Anything specific to cpu models
should be part of the X86CPUDefinition. We should not compare the specific
model here. Comparing the specific model does not scale. We are achieving
this by loading the model definition(similar to what we do in
x86_cpu_load_model).

>    
>>>> 4. EPYC model will have its own apid id handlers. Everything else will be
>>>> initialized with a default handlers(current default handler).
>>>> 5. The function pc_possible_cpu_arch_ids will load the model definition
>>>> and initialize the PCMachineState data structure with the model specific
>>>> handlers.  
>>> I'm not sure what do you mean here.  
>>
>> PCMachineState will have the function pointers to the above handlers.
>> I was going to load the correct handler based on the mode type.
> 
> Could be done like this, but considering that within machine we need
> to calculate apic_id only once, the same 'if' trick would be simpler
> 
> x86_cpus_init() {
> 
>   if (cpu == epic) {
>      make_epyc_apic_ids(mc->possible_cpu_arch_ids(ms))
>   }

Once again, this does not scale. Please see my response above.

> 
>   // go on with creating cpus ...
> }
> 
>>>> Does that sound similar to what you are thinking. Thoughts?  
>>> If you have something to share and can push it on github,
>>> I can look at, whether it has design issues to spare you a round trip on a list.
>>> (it won't be proper review but at least I can help to pinpoint most problematic parts)
>>>   
>> My code for the current approach is kind of ready(yet to be tested). I can
>> send it as v3.1 if you want to look. Or we can wait for our discussion to
>> settle. I will post it after our discussion.
> ok, lets wait till we finish this discussion

I can post my draft patch to give you more idea about what i am talking
about now. Let me know.

> 
>> There is one more problem we need to address. I was going to address later
>> in v4 or v5.
>>
>> This works
>> -numa node,nodeid=0,cpus=0-3 -numa node,nodeid=1,cpus=4-7
>>
>> This does not work
>> -numa node,nodeid=0,cpus=0-5 -numa node,nodeid=1,cpus=6-7
> Is it supposed to work (i.e. can real hardware do such topology)?

Hardware does not support this configuration. That is why I did not think
it is serious enough to fix this problem right now.

> 
>> This requires the generic code to pass the node information to the x86
>> code which requires some handler changes. I was thinking my code will
>> simplify the changes to address this issue.
> 
> without more information, it's hard to comment on issue and whether
> extra complexity of callbacks is justificated. 
> 
> There could be 2 ways here, add fixes to this series so we could see the reason
> or make this series simple to solve apic_id problem only and then on top of
> it send the second series that solves another issue.
> 
> Considering that this series is already big/complicated enough,
> personally I'd go for 2nd option. As it's easier to describe what patches are
> doing and easier to review => should result in faster reaching consensus and merging.
> [...]
>

Igor Mammedov Feb. 5, 2020, 4:56 p.m. UTC | #7

On Wed, 5 Feb 2020 10:10:06 -0600
Babu Moger <babu.moger@amd.com> wrote:

> On 2/5/20 3:38 AM, Igor Mammedov wrote:
> > On Tue, 4 Feb 2020 13:08:58 -0600
> > Babu Moger <babu.moger@amd.com> wrote:
> >   
> >> On 2/4/20 2:02 AM, Igor Mammedov wrote:  
> >>> On Mon, 3 Feb 2020 13:31:29 -0600
> >>> Babu Moger <babu.moger@amd.com> wrote:
> >>>     
> >>>> On 2/3/20 8:59 AM, Igor Mammedov wrote:    
> >>>>> On Tue, 03 Dec 2019 18:36:54 -0600
> >>>>> Babu Moger <babu.moger@amd.com> wrote:
> >>>>>       
> >>>>>> This series fixes APIC ID encoding problems on AMD EPYC CPUs.
> >>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.redhat.com%2Fshow_bug.cgi%3Fid%3D1728166&amp;data=02%7C01%7Cbabu.moger%40amd.com%7Cdbfd059a060a4851aad908d7aa1f3532%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637164923333568238&amp;sdata=P0I547X5r0s9emWu3ptIcm1U%2FhCMZmnMQOQ0IgLPzzQ%3D&amp;reserved=0
> >>>>>>
> >>>>>> Currently, the APIC ID is decoded based on the sequence
> >>>>>> sockets->dies->cores->threads. This works for most standard AMD and other
> >>>>>> vendors' configurations, but this decoding sequence does not follow that of
> >>>>>> AMD's APIC ID enumeration strictly. In some cases this can cause CPU topology
> >>>>>> inconsistency.  When booting a guest VM, the kernel tries to validate the
> >>>>>> topology, and finds it inconsistent with the enumeration of EPYC cpu models.
> >>>>>>
> >>>>>> To fix the problem we need to build the topology as per the Processor
> >>>>>> Programming Reference (PPR) for AMD Family 17h Model 01h, Revision B1
> >>>>>> Processors. It is available at https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.amd.com%2Fsystem%2Ffiles%2FTechDocs%2F55570-B1_PUB.zip&amp;data=02%7C01%7Cbabu.moger%40amd.com%7Cdbfd059a060a4851aad908d7aa1f3532%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637164923333568238&amp;sdata=AO6m%2FEI17iLoAa3gNnRSJKJAdvBRKh0Dmbr7bCVA0us%3D&amp;reserved=0
> >>>>>>
> >>>>>> Here is the text from the PPR.
> >>>>>> Operating systems are expected to use Core::X86::Cpuid::SizeId[ApicIdSize], the
> >>>>>> number of least significant bits in the Initial APIC ID that indicate core ID
> >>>>>> within a processor, in constructing per-core CPUID masks.
> >>>>>> Core::X86::Cpuid::SizeId[ApicIdSize] determines the maximum number of cores
> >>>>>> (MNC) that the processor could theoretically support, not the actual number of
> >>>>>> cores that are actually implemented or enabled on the processor, as indicated
> >>>>>> by Core::X86::Cpuid::SizeId[NC].
> >>>>>> Each Core::X86::Apic::ApicId[ApicId] register is preset as follows:
> >>>>>> • ApicId[6] = Socket ID.
> >>>>>> • ApicId[5:4] = Node ID.
> >>>>>> • ApicId[3] = Logical CCX L3 complex ID
> >>>>>> • ApicId[2:0]= (SMT) ? {LogicalCoreID[1:0],ThreadId} : {1'b0,LogicalCoreID[1:0]}      
> >>>>>
> >>>>>
> >>>>> After checking out all patches and some pondering, used here approach
> >>>>> looks to me too intrusive for the task at hand especially where it
> >>>>> comes to generic code.
> >>>>>
> >>>>> (Ignore till ==== to see suggestion how to simplify without reading
> >>>>> reasoning behind it first)
> >>>>>
> >>>>> Lets look for a way to simplify it a little bit.
> >>>>>
> >>>>> So problem we are trying to solve,
> >>>>>  1: calculate APIC IDs based on cpu type (to e more specific: for EPYC based CPUs)
> >>>>>  2: it depends on knowing total number of numa nodes.
> >>>>>
> >>>>> Externally workflow looks like following:
> >>>>>   1. user provides -smp x,sockets,cores,...,maxcpus
> >>>>>       that's used by possible_cpu_arch_ids() singleton to build list of
> >>>>>       possible CPUs (which is available to user via command 'hotpluggable-cpus')
> >>>>>
> >>>>>       Hook could be called very early and possible_cpus data might be
> >>>>>       not complete. It builds a list of possible CPUs which user could
> >>>>>       modify later.
> >>>>>
> >>>>>   2.1 user uses "-numa cpu,node-id=x,..." or legacy "-numa node,node_id=x,cpus="
> >>>>>       options to assign cpus to nodes, which is one way or another calling
> >>>>>       machine_set_cpu_numa_node(). The later updates 'possible_cpus' list
> >>>>>       with node information. It happens early when total number of nodes
> >>>>>       is not available.
> >>>>>
> >>>>>   2.2 user does not provide explicit node mappings for CPUs.
> >>>>>       QEMU steps in and assigns possible cpus to nodes in machine_numa_finish_cpu_init()
> >>>>>       (using the same machine_set_cpu_numa_node()) right before calling boards
> >>>>>       specific machine init(). At that time total number of nodes is known.
> >>>>>
> >>>>> In 1 -- 2.1 cases, 'arch_id' in 'possible_cpus' list doesn't have to be defined before
> >>>>> boards init() is run.    
> >>
> >> In case of 2.1, we need to have the arch_id already generated. This is
> >> done inside possible_cpu_arch_ids. The arch_id is used by
> >> machine_set_cpu_numa_node to assign the cpus to correct numa node.  
> > 
> > I might have missed something but I don't see arch_id itself being used in
> > machine_set_cpu_numa_node(). It only uses props part of possible_cpus  
> 
> Before calling machine_set_cpu_numa_node, we call
> cpu_index_to_instance_props -> x86_cpu_index_to_props->
> possible_cpu_arch_ids->x86_possible_cpu_arch_ids.
> 
> This sequence sets up the arch_id(in x86_cpu_apic_id_from_index) for all
> the available cpus. Based on the arch_id, it also sets up the props.


x86_possible_cpu_arch_ids()
   arch_id = x86_cpu_apic_id_from_index(x86ms, i)
   x86_topo_ids_from_apicid(arch_id, x86ms->smp_dies, ms->smp.cores,  ms->smp.threads, &topo);
   // assign socket/die/core/thread from topo

so currently it uses indirect way to convert index in possible_cpus->cpus[]
to socket/die/core/thread ids.
But essentially it take '-smp' options and [0..max_cpus) number as original data
converts it into intermediate apic_id and then reverse engineer it back to
topo info.

Why not use x86_topo_ids_from_idx() directly to get rid of 'props' dependency on apic_id?



> And these props values are used to assign the nodes in
> machine_set_cpu_numa_node.
> 
> At this point we are still parsing the numa nodes and so we don't know the
> total number of numa nodes. Without that information, the arch_id
> generated here will not be correct for EPYC models.
> 
> This is the reason for changing the generic numa code(patch #12-Split the
> numa initialization).
> 
> > 
> >    
> >> If we want to move the arch_id generation into board init(), then we need
> >> to save the cpu indexes belonging to each node somewhere.  
> > 
> > when cpus are assigned explicitly, decision what cpus go to what nodes is
> > up to user and user configured mapping is stored in MachineState::possible_cpus
> > which is accessed by via possible_cpu_arch_ids() callback.
> > Hence I don see any reason to touch cpu indexes.  
> 
> Please see my reasoning above.
> 
> >   
> >>  
> >>>>>
> >>>>> In 2.2 case it calls get_default_cpu_node_id() -> x86_get_default_cpu_node_id()
> >>>>> which uses arch_id calculate numa node.
> >>>>> But then question is: does it have to use APIC id or could it infer 'pkg_id',
> >>>>> it's after, from ms->possible_cpus->cpus[i].props data?      
> >>>>
> >>>> Not sure if I got the question right. In this case because the numa
> >>>> information is not provided all the cpus are assigned to only one node.
> >>>> The apic id is used here to get the correct pkg_id.    
> >>>
> >>> apicid was composed from socket/core/thread[/die] tuple which cpus[i].props is.
> >>>
> >>> Question is if we can compose only pkg_id based on the same data without
> >>> converting it to apicid and then "reverse engineering" it back
> >>> original data?    
> >>
> >> Yes. It is possible.
> >>  
> >>>
> >>> Or more direct question: is socket-id the same as pkg_id?    
> >>
> >> Yes. Socket_id and pkg_id is same.
> >>  
> >>>
> >>>     
> >>>>    
> >>>>>   
> >>>>> With that out of the way APIC ID will be used only during board's init(),
> >>>>> so board could update possible_cpus with valid APIC IDs at the start of
> >>>>> x86_cpus_init().
> >>>>>
> >>>>> ====
> >>>>> in nutshell it would be much easier to do following:
> >>>>>
> >>>>>  1. make x86_get_default_cpu_node_id() APIC ID in-depended or
> >>>>>     if impossible as alternative recompute APIC IDs there if cpu
> >>>>>     type is EPYC based (since number of nodes is already known)
> >>>>>  2. recompute APIC IDs in x86_cpus_init() if cpu type is EPYC based
> >>>>>
> >>>>> this way one doesn't need to touch generic numa code, introduce
> >>>>> x86 specific init_apicid_fn() hook into generic code and keep
> >>>>> x86/EPYC nuances contained within x86 code only.      
> >>>>
> >>>> I was kind of already working in the similar direction in v4.
> >>>> 1. We already have split the numa initialization in patch #12(Split the
> >>>> numa initialization). This way we know exactly how many numa nodes are
> >>>> there before hand.    
> >>>
> >>> I suggest to drop that patch, It's the one that touches generic numa
> >>> code and adding more legacy based extensions like cpu_indexes.
> >>> Which I'd like to get rid of to begin with, so only -numa cpu is left.
> >>>
> >>> I think it's not necessary to touch numa code at all for apicid generation
> >>> purpose, as I tried to explain above. We should be able to keep
> >>> this x86 only business.    
> >>
> >> This is going to be difficult without touching the generic numa code.patch #12(Split the  
> >>>> numa initialization)  
> > 
> > Looking at current code I don't see why one would touch numa code.
> > Care to explain in more details why you'd have to touch it?  
> 
> Please see the reasoning above.
> >   
> >>>> 2. Planning to remove init_apicid_fn
> >>>> 3. Insert the handlers inside X86CPUDefinition.    
> >>> what handlers do you mean?    
> >>
> >> Apicid generation logic can be separated into 3 types of handlers.
> >> x86_apicid_from_cpu_idx: Generate apicid from cpu index.
> >> x86_topo_ids_from_apicid: Generate topo ids from apic id.
> >> x86_apicid_from_topo_ids: Generate apicid from topo ids.
> >>
> >> We should be able to generate one id from other(you can see topology.h).
> >>
> >> X86CPUDefinition will have the handlers specific to each model like the
> >> way we have features now. The above 3 handlers will be used as default
> >> handler.  
> > 
> > it probably shouldn't be a part of X86CPUDefinition,
> > as it's machines responsibility to generate and set APIC ID.
> > 
> > What you are doing with this topo functions in this version
> > looks more that enough to me.  
> 
> It is all the exact same topo functions. Only making these functions as
> the handlers inside the X86CPUDefinition.
> 
> >   
> >> The EPYC model will have its corresponding handlers.
> >>
> >> x86_apicid_from_cpu_idx_epyc
> >> x86_topo_ids_from_apicid_epyc
> >> x86_apicid_from_topo_ids_epyc.  
> > 
> > CPU might use call backs, but does it have to?
> > I see cpu_x86_cpuid() uses these functions to decode apic_id back to topo
> > info and then compose various leaves based on it.
> > Within CPU code I'd just use
> >  if (i_am_epyc)
> >     x86_topo_ids_from_apicid_epyc()
> >  else
> >     x86_topo_ids_from_apicid()
> > it's easier to read and one doesn't have to go figure
> > indirection chain to figure out what code is called.  
> 
> Eduardo already commented on this idea. Anything specific to cpu models
> should be part of the X86CPUDefinition. We should not compare the specific
> model here. Comparing the specific model does not scale. We are achieving
> this by loading the model definition(similar to what we do in
> x86_cpu_load_model).

ok

> 
> >      
> >>>> 4. EPYC model will have its own apid id handlers. Everything else will be
> >>>> initialized with a default handlers(current default handler).
> >>>> 5. The function pc_possible_cpu_arch_ids will load the model definition
> >>>> and initialize the PCMachineState data structure with the model specific
> >>>> handlers.    
> >>> I'm not sure what do you mean here.    
> >>
> >> PCMachineState will have the function pointers to the above handlers.
> >> I was going to load the correct handler based on the mode type.  
> > 
> > Could be done like this, but considering that within machine we need
> > to calculate apic_id only once, the same 'if' trick would be simpler
> > 
> > x86_cpus_init() {
> > 
> >   if (cpu == epic) {
> >      make_epyc_apic_ids(mc->possible_cpu_arch_ids(ms))
> >   }  
> 
> Once again, this does not scale. Please see my response above.
> 
> > 
> >   // go on with creating cpus ...
> > }
> >   
> >>>> Does that sound similar to what you are thinking. Thoughts?    
> >>> If you have something to share and can push it on github,
> >>> I can look at, whether it has design issues to spare you a round trip on a list.
> >>> (it won't be proper review but at least I can help to pinpoint most problematic parts)
> >>>     
> >> My code for the current approach is kind of ready(yet to be tested). I can
> >> send it as v3.1 if you want to look. Or we can wait for our discussion to
> >> settle. I will post it after our discussion.  
> > ok, lets wait till we finish this discussion  
> 
> I can post my draft patch to give you more idea about what i am talking
> about now. Let me know.
> 
> >   
> >> There is one more problem we need to address. I was going to address later
> >> in v4 or v5.
> >>
> >> This works
> >> -numa node,nodeid=0,cpus=0-3 -numa node,nodeid=1,cpus=4-7
> >>
> >> This does not work
> >> -numa node,nodeid=0,cpus=0-5 -numa node,nodeid=1,cpus=6-7  
> > Is it supposed to work (i.e. can real hardware do such topology)?  
> 
> Hardware does not support this configuration. That is why I did not think
> it is serious enough to fix this problem right now.
> 
> >   
> >> This requires the generic code to pass the node information to the x86
> >> code which requires some handler changes. I was thinking my code will
> >> simplify the changes to address this issue.  
> > 
> > without more information, it's hard to comment on issue and whether
> > extra complexity of callbacks is justificated. 
> > 
> > There could be 2 ways here, add fixes to this series so we could see the reason
> > or make this series simple to solve apic_id problem only and then on top of
> > it send the second series that solves another issue.
> > 
> > Considering that this series is already big/complicated enough,
> > personally I'd go for 2nd option. As it's easier to describe what patches are
> > doing and easier to review => should result in faster reaching consensus and merging.
> > [...]
> >   
>

Babu Moger Feb. 5, 2020, 7:07 p.m. UTC | #8

On 2/5/20 10:56 AM, Igor Mammedov wrote:
> On Wed, 5 Feb 2020 10:10:06 -0600
> Babu Moger <babu.moger@amd.com> wrote:
> 
>> On 2/5/20 3:38 AM, Igor Mammedov wrote:
>>> On Tue, 4 Feb 2020 13:08:58 -0600
>>> Babu Moger <babu.moger@amd.com> wrote:
>>>   
>>>> On 2/4/20 2:02 AM, Igor Mammedov wrote:  
>>>>> On Mon, 3 Feb 2020 13:31:29 -0600
>>>>> Babu Moger <babu.moger@amd.com> wrote:
>>>>>     
>>>>>> On 2/3/20 8:59 AM, Igor Mammedov wrote:    
>>>>>>> On Tue, 03 Dec 2019 18:36:54 -0600
>>>>>>> Babu Moger <babu.moger@amd.com> wrote:
>>>>>>>       
>>>>>>>> This series fixes APIC ID encoding problems on AMD EPYC CPUs.
>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.redhat.com%2Fshow_bug.cgi%3Fid%3D1728166&amp;data=02%7C01%7Cbabu.moger%40amd.com%7C6b6d6af79fee45cc904808d7aa5c5f37%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637165186049856500&amp;sdata=vDAkIxR3U6LX%2FmnYjZPRC55smMqLend%2FHQjbfYWydBk%3D&amp;reserved=0
>>>>>>>>
>>>>>>>> Currently, the APIC ID is decoded based on the sequence
>>>>>>>> sockets->dies->cores->threads. This works for most standard AMD and other
>>>>>>>> vendors' configurations, but this decoding sequence does not follow that of
>>>>>>>> AMD's APIC ID enumeration strictly. In some cases this can cause CPU topology
>>>>>>>> inconsistency.  When booting a guest VM, the kernel tries to validate the
>>>>>>>> topology, and finds it inconsistent with the enumeration of EPYC cpu models.
>>>>>>>>
>>>>>>>> To fix the problem we need to build the topology as per the Processor
>>>>>>>> Programming Reference (PPR) for AMD Family 17h Model 01h, Revision B1
>>>>>>>> Processors. It is available at https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.amd.com%2Fsystem%2Ffiles%2FTechDocs%2F55570-B1_PUB.zip&amp;data=02%7C01%7Cbabu.moger%40amd.com%7C6b6d6af79fee45cc904808d7aa5c5f37%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637165186049856500&amp;sdata=rVMRN%2BbUeGWEksKO5uQ3Wxc71eeHCXMrkLVRbo4JHHI%3D&amp;reserved=0
>>>>>>>>
>>>>>>>> Here is the text from the PPR.
>>>>>>>> Operating systems are expected to use Core::X86::Cpuid::SizeId[ApicIdSize], the
>>>>>>>> number of least significant bits in the Initial APIC ID that indicate core ID
>>>>>>>> within a processor, in constructing per-core CPUID masks.
>>>>>>>> Core::X86::Cpuid::SizeId[ApicIdSize] determines the maximum number of cores
>>>>>>>> (MNC) that the processor could theoretically support, not the actual number of
>>>>>>>> cores that are actually implemented or enabled on the processor, as indicated
>>>>>>>> by Core::X86::Cpuid::SizeId[NC].
>>>>>>>> Each Core::X86::Apic::ApicId[ApicId] register is preset as follows:
>>>>>>>> • ApicId[6] = Socket ID.
>>>>>>>> • ApicId[5:4] = Node ID.
>>>>>>>> • ApicId[3] = Logical CCX L3 complex ID
>>>>>>>> • ApicId[2:0]= (SMT) ? {LogicalCoreID[1:0],ThreadId} : {1'b0,LogicalCoreID[1:0]}      
>>>>>>>
>>>>>>>
>>>>>>> After checking out all patches and some pondering, used here approach
>>>>>>> looks to me too intrusive for the task at hand especially where it
>>>>>>> comes to generic code.
>>>>>>>
>>>>>>> (Ignore till ==== to see suggestion how to simplify without reading
>>>>>>> reasoning behind it first)
>>>>>>>
>>>>>>> Lets look for a way to simplify it a little bit.
>>>>>>>
>>>>>>> So problem we are trying to solve,
>>>>>>>  1: calculate APIC IDs based on cpu type (to e more specific: for EPYC based CPUs)
>>>>>>>  2: it depends on knowing total number of numa nodes.
>>>>>>>
>>>>>>> Externally workflow looks like following:
>>>>>>>   1. user provides -smp x,sockets,cores,...,maxcpus
>>>>>>>       that's used by possible_cpu_arch_ids() singleton to build list of
>>>>>>>       possible CPUs (which is available to user via command 'hotpluggable-cpus')
>>>>>>>
>>>>>>>       Hook could be called very early and possible_cpus data might be
>>>>>>>       not complete. It builds a list of possible CPUs which user could
>>>>>>>       modify later.
>>>>>>>
>>>>>>>   2.1 user uses "-numa cpu,node-id=x,..." or legacy "-numa node,node_id=x,cpus="
>>>>>>>       options to assign cpus to nodes, which is one way or another calling
>>>>>>>       machine_set_cpu_numa_node(). The later updates 'possible_cpus' list
>>>>>>>       with node information. It happens early when total number of nodes
>>>>>>>       is not available.
>>>>>>>
>>>>>>>   2.2 user does not provide explicit node mappings for CPUs.
>>>>>>>       QEMU steps in and assigns possible cpus to nodes in machine_numa_finish_cpu_init()
>>>>>>>       (using the same machine_set_cpu_numa_node()) right before calling boards
>>>>>>>       specific machine init(). At that time total number of nodes is known.
>>>>>>>
>>>>>>> In 1 -- 2.1 cases, 'arch_id' in 'possible_cpus' list doesn't have to be defined before
>>>>>>> boards init() is run.    
>>>>
>>>> In case of 2.1, we need to have the arch_id already generated. This is
>>>> done inside possible_cpu_arch_ids. The arch_id is used by
>>>> machine_set_cpu_numa_node to assign the cpus to correct numa node.  
>>>
>>> I might have missed something but I don't see arch_id itself being used in
>>> machine_set_cpu_numa_node(). It only uses props part of possible_cpus  
>>
>> Before calling machine_set_cpu_numa_node, we call
>> cpu_index_to_instance_props -> x86_cpu_index_to_props->
>> possible_cpu_arch_ids->x86_possible_cpu_arch_ids.
>>
>> This sequence sets up the arch_id(in x86_cpu_apic_id_from_index) for all
>> the available cpus. Based on the arch_id, it also sets up the props.
> 
> 
> x86_possible_cpu_arch_ids()
>    arch_id = x86_cpu_apic_id_from_index(x86ms, i)
>    x86_topo_ids_from_apicid(arch_id, x86ms->smp_dies, ms->smp.cores,  ms->smp.threads, &topo);
>    // assign socket/die/core/thread from topo
> 
> so currently it uses indirect way to convert index in possible_cpus->cpus[]
> to socket/die/core/thread ids.
> But essentially it take '-smp' options and [0..max_cpus) number as original data
> converts it into intermediate apic_id and then reverse engineer it back to
> topo info.
> 
> Why not use x86_topo_ids_from_idx() directly to get rid of 'props' dependency on apic_id?

It might work. But this feels like a work-around and delaying the problem
for later. Just re-arranging the numa code little bit we can address this.

> 
> 
> 
>> And these props values are used to assign the nodes in
>> machine_set_cpu_numa_node.
>>
>> At this point we are still parsing the numa nodes and so we don't know the
>> total number of numa nodes. Without that information, the arch_id
>> generated here will not be correct for EPYC models.
>>
>> This is the reason for changing the generic numa code(patch #12-Split the
>> numa initialization).
>>
>>>
>>>    
>>>> If we want to move the arch_id generation into board init(), then we need
>>>> to save the cpu indexes belonging to each node somewhere.  
>>>
>>> when cpus are assigned explicitly, decision what cpus go to what nodes is
>>> up to user and user configured mapping is stored in MachineState::possible_cpus
>>> which is accessed by via possible_cpu_arch_ids() callback.
>>> Hence I don see any reason to touch cpu indexes.  
>>
>> Please see my reasoning above.
>>
>>>   
>>>>  
>>>>>>>
>>>>>>> In 2.2 case it calls get_default_cpu_node_id() -> x86_get_default_cpu_node_id()
>>>>>>> which uses arch_id calculate numa node.
>>>>>>> But then question is: does it have to use APIC id or could it infer 'pkg_id',
>>>>>>> it's after, from ms->possible_cpus->cpus[i].props data?      
>>>>>>
>>>>>> Not sure if I got the question right. In this case because the numa
>>>>>> information is not provided all the cpus are assigned to only one node.
>>>>>> The apic id is used here to get the correct pkg_id.    
>>>>>
>>>>> apicid was composed from socket/core/thread[/die] tuple which cpus[i].props is.
>>>>>
>>>>> Question is if we can compose only pkg_id based on the same data without
>>>>> converting it to apicid and then "reverse engineering" it back
>>>>> original data?    
>>>>
>>>> Yes. It is possible.
>>>>  
>>>>>
>>>>> Or more direct question: is socket-id the same as pkg_id?    
>>>>
>>>> Yes. Socket_id and pkg_id is same.
>>>>  
>>>>>
>>>>>     
>>>>>>    
>>>>>>>   
>>>>>>> With that out of the way APIC ID will be used only during board's init(),
>>>>>>> so board could update possible_cpus with valid APIC IDs at the start of
>>>>>>> x86_cpus_init().
>>>>>>>
>>>>>>> ====
>>>>>>> in nutshell it would be much easier to do following:
>>>>>>>
>>>>>>>  1. make x86_get_default_cpu_node_id() APIC ID in-depended or
>>>>>>>     if impossible as alternative recompute APIC IDs there if cpu
>>>>>>>     type is EPYC based (since number of nodes is already known)
>>>>>>>  2. recompute APIC IDs in x86_cpus_init() if cpu type is EPYC based
>>>>>>>
>>>>>>> this way one doesn't need to touch generic numa code, introduce
>>>>>>> x86 specific init_apicid_fn() hook into generic code and keep
>>>>>>> x86/EPYC nuances contained within x86 code only.      
>>>>>>
>>>>>> I was kind of already working in the similar direction in v4.
>>>>>> 1. We already have split the numa initialization in patch #12(Split the
>>>>>> numa initialization). This way we know exactly how many numa nodes are
>>>>>> there before hand.    
>>>>>
>>>>> I suggest to drop that patch, It's the one that touches generic numa
>>>>> code and adding more legacy based extensions like cpu_indexes.
>>>>> Which I'd like to get rid of to begin with, so only -numa cpu is left.
>>>>>
>>>>> I think it's not necessary to touch numa code at all for apicid generation
>>>>> purpose, as I tried to explain above. We should be able to keep
>>>>> this x86 only business.    
>>>>
>>>> This is going to be difficult without touching the generic numa code.patch #12(Split the  
>>>>>> numa initialization)  
>>>
>>> Looking at current code I don't see why one would touch numa code.
>>> Care to explain in more details why you'd have to touch it?  
>>
>> Please see the reasoning above.
>>>   
>>>>>> 2. Planning to remove init_apicid_fn
>>>>>> 3. Insert the handlers inside X86CPUDefinition.    
>>>>> what handlers do you mean?    
>>>>
>>>> Apicid generation logic can be separated into 3 types of handlers.
>>>> x86_apicid_from_cpu_idx: Generate apicid from cpu index.
>>>> x86_topo_ids_from_apicid: Generate topo ids from apic id.
>>>> x86_apicid_from_topo_ids: Generate apicid from topo ids.
>>>>
>>>> We should be able to generate one id from other(you can see topology.h).
>>>>
>>>> X86CPUDefinition will have the handlers specific to each model like the
>>>> way we have features now. The above 3 handlers will be used as default
>>>> handler.  
>>>
>>> it probably shouldn't be a part of X86CPUDefinition,
>>> as it's machines responsibility to generate and set APIC ID.
>>>
>>> What you are doing with this topo functions in this version
>>> looks more that enough to me.  
>>
>> It is all the exact same topo functions. Only making these functions as
>> the handlers inside the X86CPUDefinition.
>>
>>>   
>>>> The EPYC model will have its corresponding handlers.
>>>>
>>>> x86_apicid_from_cpu_idx_epyc
>>>> x86_topo_ids_from_apicid_epyc
>>>> x86_apicid_from_topo_ids_epyc.  
>>>
>>> CPU might use call backs, but does it have to?
>>> I see cpu_x86_cpuid() uses these functions to decode apic_id back to topo
>>> info and then compose various leaves based on it.
>>> Within CPU code I'd just use
>>>  if (i_am_epyc)
>>>     x86_topo_ids_from_apicid_epyc()
>>>  else
>>>     x86_topo_ids_from_apicid()
>>> it's easier to read and one doesn't have to go figure
>>> indirection chain to figure out what code is called.  
>>
>> Eduardo already commented on this idea. Anything specific to cpu models
>> should be part of the X86CPUDefinition. We should not compare the specific
>> model here. Comparing the specific model does not scale. We are achieving
>> this by loading the model definition(similar to what we do in
>> x86_cpu_load_model).
> 
> ok
> 
>>
>>>      
>>>>>> 4. EPYC model will have its own apid id handlers. Everything else will be
>>>>>> initialized with a default handlers(current default handler).
>>>>>> 5. The function pc_possible_cpu_arch_ids will load the model definition
>>>>>> and initialize the PCMachineState data structure with the model specific
>>>>>> handlers.    
>>>>> I'm not sure what do you mean here.    
>>>>
>>>> PCMachineState will have the function pointers to the above handlers.
>>>> I was going to load the correct handler based on the mode type.  
>>>
>>> Could be done like this, but considering that within machine we need
>>> to calculate apic_id only once, the same 'if' trick would be simpler
>>>
>>> x86_cpus_init() {
>>>
>>>   if (cpu == epic) {
>>>      make_epyc_apic_ids(mc->possible_cpu_arch_ids(ms))
>>>   }  
>>
>> Once again, this does not scale. Please see my response above.
>>
>>>
>>>   // go on with creating cpus ...
>>> }
>>>   
>>>>>> Does that sound similar to what you are thinking. Thoughts?    
>>>>> If you have something to share and can push it on github,
>>>>> I can look at, whether it has design issues to spare you a round trip on a list.
>>>>> (it won't be proper review but at least I can help to pinpoint most problematic parts)
>>>>>     
>>>> My code for the current approach is kind of ready(yet to be tested). I can
>>>> send it as v3.1 if you want to look. Or we can wait for our discussion to
>>>> settle. I will post it after our discussion.  
>>> ok, lets wait till we finish this discussion  
>>
>> I can post my draft patch to give you more idea about what i am talking
>> about now. Let me know.
>>
>>>   
>>>> There is one more problem we need to address. I was going to address later
>>>> in v4 or v5.
>>>>
>>>> This works
>>>> -numa node,nodeid=0,cpus=0-3 -numa node,nodeid=1,cpus=4-7
>>>>
>>>> This does not work
>>>> -numa node,nodeid=0,cpus=0-5 -numa node,nodeid=1,cpus=6-7  
>>> Is it supposed to work (i.e. can real hardware do such topology)?  
>>
>> Hardware does not support this configuration. That is why I did not think
>> it is serious enough to fix this problem right now.
>>
>>>   
>>>> This requires the generic code to pass the node information to the x86
>>>> code which requires some handler changes. I was thinking my code will
>>>> simplify the changes to address this issue.  
>>>
>>> without more information, it's hard to comment on issue and whether
>>> extra complexity of callbacks is justificated. 
>>>
>>> There could be 2 ways here, add fixes to this series so we could see the reason
>>> or make this series simple to solve apic_id problem only and then on top of
>>> it send the second series that solves another issue.
>>>
>>> Considering that this series is already big/complicated enough,
>>> personally I'd go for 2nd option. As it's easier to describe what patches are
>>> doing and easier to review => should result in faster reaching consensus and merging.
>>> [...]
>>>   
>>
>

Igor Mammedov Feb. 6, 2020, 1:08 p.m. UTC | #9

On Wed, 5 Feb 2020 13:07:31 -0600
Babu Moger <babu.moger@amd.com> wrote:

> On 2/5/20 10:56 AM, Igor Mammedov wrote:
> > On Wed, 5 Feb 2020 10:10:06 -0600
> > Babu Moger <babu.moger@amd.com> wrote:
> >   
> >> On 2/5/20 3:38 AM, Igor Mammedov wrote:  
> >>> On Tue, 4 Feb 2020 13:08:58 -0600
> >>> Babu Moger <babu.moger@amd.com> wrote:
> >>>     
> >>>> On 2/4/20 2:02 AM, Igor Mammedov wrote:    
> >>>>> On Mon, 3 Feb 2020 13:31:29 -0600
> >>>>> Babu Moger <babu.moger@amd.com> wrote:
> >>>>>       
> >>>>>> On 2/3/20 8:59 AM, Igor Mammedov wrote:      
> >>>>>>> On Tue, 03 Dec 2019 18:36:54 -0600
> >>>>>>> Babu Moger <babu.moger@amd.com> wrote:
> >>>>>>>         
> >>>>>>>> This series fixes APIC ID encoding problems on AMD EPYC CPUs.
> >>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.redhat.com%2Fshow_bug.cgi%3Fid%3D1728166&amp;data=02%7C01%7Cbabu.moger%40amd.com%7C6b6d6af79fee45cc904808d7aa5c5f37%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637165186049856500&amp;sdata=vDAkIxR3U6LX%2FmnYjZPRC55smMqLend%2FHQjbfYWydBk%3D&amp;reserved=0
> >>>>>>>>
> >>>>>>>> Currently, the APIC ID is decoded based on the sequence
> >>>>>>>> sockets->dies->cores->threads. This works for most standard AMD and other
> >>>>>>>> vendors' configurations, but this decoding sequence does not follow that of
> >>>>>>>> AMD's APIC ID enumeration strictly. In some cases this can cause CPU topology
> >>>>>>>> inconsistency.  When booting a guest VM, the kernel tries to validate the
> >>>>>>>> topology, and finds it inconsistent with the enumeration of EPYC cpu models.
> >>>>>>>>
> >>>>>>>> To fix the problem we need to build the topology as per the Processor
> >>>>>>>> Programming Reference (PPR) for AMD Family 17h Model 01h, Revision B1
> >>>>>>>> Processors. It is available at https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.amd.com%2Fsystem%2Ffiles%2FTechDocs%2F55570-B1_PUB.zip&amp;data=02%7C01%7Cbabu.moger%40amd.com%7C6b6d6af79fee45cc904808d7aa5c5f37%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637165186049856500&amp;sdata=rVMRN%2BbUeGWEksKO5uQ3Wxc71eeHCXMrkLVRbo4JHHI%3D&amp;reserved=0
> >>>>>>>>
> >>>>>>>> Here is the text from the PPR.
> >>>>>>>> Operating systems are expected to use Core::X86::Cpuid::SizeId[ApicIdSize], the
> >>>>>>>> number of least significant bits in the Initial APIC ID that indicate core ID
> >>>>>>>> within a processor, in constructing per-core CPUID masks.
> >>>>>>>> Core::X86::Cpuid::SizeId[ApicIdSize] determines the maximum number of cores
> >>>>>>>> (MNC) that the processor could theoretically support, not the actual number of
> >>>>>>>> cores that are actually implemented or enabled on the processor, as indicated
> >>>>>>>> by Core::X86::Cpuid::SizeId[NC].
> >>>>>>>> Each Core::X86::Apic::ApicId[ApicId] register is preset as follows:
> >>>>>>>> • ApicId[6] = Socket ID.
> >>>>>>>> • ApicId[5:4] = Node ID.
> >>>>>>>> • ApicId[3] = Logical CCX L3 complex ID
> >>>>>>>> • ApicId[2:0]= (SMT) ? {LogicalCoreID[1:0],ThreadId} : {1'b0,LogicalCoreID[1:0]}        
> >>>>>>>
> >>>>>>>
> >>>>>>> After checking out all patches and some pondering, used here approach
> >>>>>>> looks to me too intrusive for the task at hand especially where it
> >>>>>>> comes to generic code.
> >>>>>>>
> >>>>>>> (Ignore till ==== to see suggestion how to simplify without reading
> >>>>>>> reasoning behind it first)
> >>>>>>>
> >>>>>>> Lets look for a way to simplify it a little bit.
> >>>>>>>
> >>>>>>> So problem we are trying to solve,
> >>>>>>>  1: calculate APIC IDs based on cpu type (to e more specific: for EPYC based CPUs)
> >>>>>>>  2: it depends on knowing total number of numa nodes.
> >>>>>>>
> >>>>>>> Externally workflow looks like following:
> >>>>>>>   1. user provides -smp x,sockets,cores,...,maxcpus
> >>>>>>>       that's used by possible_cpu_arch_ids() singleton to build list of
> >>>>>>>       possible CPUs (which is available to user via command 'hotpluggable-cpus')
> >>>>>>>
> >>>>>>>       Hook could be called very early and possible_cpus data might be
> >>>>>>>       not complete. It builds a list of possible CPUs which user could
> >>>>>>>       modify later.
> >>>>>>>
> >>>>>>>   2.1 user uses "-numa cpu,node-id=x,..." or legacy "-numa node,node_id=x,cpus="
> >>>>>>>       options to assign cpus to nodes, which is one way or another calling
> >>>>>>>       machine_set_cpu_numa_node(). The later updates 'possible_cpus' list
> >>>>>>>       with node information. It happens early when total number of nodes
> >>>>>>>       is not available.
> >>>>>>>
> >>>>>>>   2.2 user does not provide explicit node mappings for CPUs.
> >>>>>>>       QEMU steps in and assigns possible cpus to nodes in machine_numa_finish_cpu_init()
> >>>>>>>       (using the same machine_set_cpu_numa_node()) right before calling boards
> >>>>>>>       specific machine init(). At that time total number of nodes is known.
> >>>>>>>
> >>>>>>> In 1 -- 2.1 cases, 'arch_id' in 'possible_cpus' list doesn't have to be defined before
> >>>>>>> boards init() is run.      
> >>>>
> >>>> In case of 2.1, we need to have the arch_id already generated. This is
> >>>> done inside possible_cpu_arch_ids. The arch_id is used by
> >>>> machine_set_cpu_numa_node to assign the cpus to correct numa node.    
> >>>
> >>> I might have missed something but I don't see arch_id itself being used in
> >>> machine_set_cpu_numa_node(). It only uses props part of possible_cpus    
> >>
> >> Before calling machine_set_cpu_numa_node, we call
> >> cpu_index_to_instance_props -> x86_cpu_index_to_props->
> >> possible_cpu_arch_ids->x86_possible_cpu_arch_ids.
> >>
> >> This sequence sets up the arch_id(in x86_cpu_apic_id_from_index) for all
> >> the available cpus. Based on the arch_id, it also sets up the props.  
> > 
> > 
> > x86_possible_cpu_arch_ids()
> >    arch_id = x86_cpu_apic_id_from_index(x86ms, i)
> >    x86_topo_ids_from_apicid(arch_id, x86ms->smp_dies, ms->smp.cores,  ms->smp.threads, &topo);
> >    // assign socket/die/core/thread from topo
> > 
> > so currently it uses indirect way to convert index in possible_cpus->cpus[]
> > to socket/die/core/thread ids.
> > But essentially it take '-smp' options and [0..max_cpus) number as original data
> > converts it into intermediate apic_id and then reverse engineer it back to
> > topo info.
> > 
> > Why not use x86_topo_ids_from_idx() directly to get rid of 'props' dependency on apic_id?  
> 
> It might work. But this feels like a work-around and delaying the problem
> for later. Just re-arranging the numa code little bit we can address this.

The idea behind possible_cpus is to allow users query topo information
board generates (based on -smp) at configuration time (or late) so users
could know what -numa cpu,topo_options [and -device foo-cpu,topo_options]
to use, initializing apic_id on the first access is secondary and I did
it only because I could do it without additional data.

But main purpose of possible_cpus is to keep topology information.
That includes numa node mapping, which should be stored in possible_cpus
along with the rest of cpu topology.

Looking [12/18] numa patch, it makes -numa node,cpus legacy option
to reintroduce data duplication, by storing mapping elsewhere and
then putting that mapping into possible_cpus at numa complete time
(that's what I dislike and don't see a valid reason to do so).

That also won't work if user queries hotpluggable-cpus before that time
and it also doesn't work if user uses preferable -numa cpu,topo_options
as both would initialize possible_cpus on the first access.

So if you need do some board specific post-processing done on topo
information when it's complete and recalculate apic_id do it at board
init time like was suggested before (x86_cpu_new() looks like a good
place to do it).

[...]

Babu Moger Feb. 6, 2020, 3:32 p.m. UTC | #10

On 2/6/20 7:08 AM, Igor Mammedov wrote:
> On Wed, 5 Feb 2020 13:07:31 -0600
> Babu Moger <babu.moger@amd.com> wrote:
> 
>> On 2/5/20 10:56 AM, Igor Mammedov wrote:
>>> On Wed, 5 Feb 2020 10:10:06 -0600
>>> Babu Moger <babu.moger@amd.com> wrote:
>>>   
>>>> On 2/5/20 3:38 AM, Igor Mammedov wrote:  
>>>>> On Tue, 4 Feb 2020 13:08:58 -0600
>>>>> Babu Moger <babu.moger@amd.com> wrote:
>>>>>     
>>>>>> On 2/4/20 2:02 AM, Igor Mammedov wrote:    
>>>>>>> On Mon, 3 Feb 2020 13:31:29 -0600
>>>>>>> Babu Moger <babu.moger@amd.com> wrote:
>>>>>>>       
>>>>>>>> On 2/3/20 8:59 AM, Igor Mammedov wrote:      
>>>>>>>>> On Tue, 03 Dec 2019 18:36:54 -0600
>>>>>>>>> Babu Moger <babu.moger@amd.com> wrote:
>>>>>>>>>         
>>>>>>>>>> This series fixes APIC ID encoding problems on AMD EPYC CPUs.
>>>>>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugzilla.redhat.com%2Fshow_bug.cgi%3Fid%3D1728166&amp;data=02%7C01%7Cbabu.moger%40amd.com%7C76bf8434899b41de094f08d7ab05bdf3%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637165913481441118&amp;sdata=34fZQpUjScKbbc35c7ot433HA1Rz03YG6aP1ucyGUsQ%3D&amp;reserved=0
>>>>>>>>>>
>>>>>>>>>> Currently, the APIC ID is decoded based on the sequence
>>>>>>>>>> sockets->dies->cores->threads. This works for most standard AMD and other
>>>>>>>>>> vendors' configurations, but this decoding sequence does not follow that of
>>>>>>>>>> AMD's APIC ID enumeration strictly. In some cases this can cause CPU topology
>>>>>>>>>> inconsistency.  When booting a guest VM, the kernel tries to validate the
>>>>>>>>>> topology, and finds it inconsistent with the enumeration of EPYC cpu models.
>>>>>>>>>>
>>>>>>>>>> To fix the problem we need to build the topology as per the Processor
>>>>>>>>>> Programming Reference (PPR) for AMD Family 17h Model 01h, Revision B1
>>>>>>>>>> Processors. It is available at https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.amd.com%2Fsystem%2Ffiles%2FTechDocs%2F55570-B1_PUB.zip&amp;data=02%7C01%7Cbabu.moger%40amd.com%7C76bf8434899b41de094f08d7ab05bdf3%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637165913481451075&amp;sdata=4YXG%2BrCP5UUXcCQX4Ly8B%2FXdlvZoFrPCgonjy0IwG0U%3D&amp;reserved=0
>>>>>>>>>>
>>>>>>>>>> Here is the text from the PPR.
>>>>>>>>>> Operating systems are expected to use Core::X86::Cpuid::SizeId[ApicIdSize], the
>>>>>>>>>> number of least significant bits in the Initial APIC ID that indicate core ID
>>>>>>>>>> within a processor, in constructing per-core CPUID masks.
>>>>>>>>>> Core::X86::Cpuid::SizeId[ApicIdSize] determines the maximum number of cores
>>>>>>>>>> (MNC) that the processor could theoretically support, not the actual number of
>>>>>>>>>> cores that are actually implemented or enabled on the processor, as indicated
>>>>>>>>>> by Core::X86::Cpuid::SizeId[NC].
>>>>>>>>>> Each Core::X86::Apic::ApicId[ApicId] register is preset as follows:
>>>>>>>>>> • ApicId[6] = Socket ID.
>>>>>>>>>> • ApicId[5:4] = Node ID.
>>>>>>>>>> • ApicId[3] = Logical CCX L3 complex ID
>>>>>>>>>> • ApicId[2:0]= (SMT) ? {LogicalCoreID[1:0],ThreadId} : {1'b0,LogicalCoreID[1:0]}        
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> After checking out all patches and some pondering, used here approach
>>>>>>>>> looks to me too intrusive for the task at hand especially where it
>>>>>>>>> comes to generic code.
>>>>>>>>>
>>>>>>>>> (Ignore till ==== to see suggestion how to simplify without reading
>>>>>>>>> reasoning behind it first)
>>>>>>>>>
>>>>>>>>> Lets look for a way to simplify it a little bit.
>>>>>>>>>
>>>>>>>>> So problem we are trying to solve,
>>>>>>>>>  1: calculate APIC IDs based on cpu type (to e more specific: for EPYC based CPUs)
>>>>>>>>>  2: it depends on knowing total number of numa nodes.
>>>>>>>>>
>>>>>>>>> Externally workflow looks like following:
>>>>>>>>>   1. user provides -smp x,sockets,cores,...,maxcpus
>>>>>>>>>       that's used by possible_cpu_arch_ids() singleton to build list of
>>>>>>>>>       possible CPUs (which is available to user via command 'hotpluggable-cpus')
>>>>>>>>>
>>>>>>>>>       Hook could be called very early and possible_cpus data might be
>>>>>>>>>       not complete. It builds a list of possible CPUs which user could
>>>>>>>>>       modify later.
>>>>>>>>>
>>>>>>>>>   2.1 user uses "-numa cpu,node-id=x,..." or legacy "-numa node,node_id=x,cpus="
>>>>>>>>>       options to assign cpus to nodes, which is one way or another calling
>>>>>>>>>       machine_set_cpu_numa_node(). The later updates 'possible_cpus' list
>>>>>>>>>       with node information. It happens early when total number of nodes
>>>>>>>>>       is not available.
>>>>>>>>>
>>>>>>>>>   2.2 user does not provide explicit node mappings for CPUs.
>>>>>>>>>       QEMU steps in and assigns possible cpus to nodes in machine_numa_finish_cpu_init()
>>>>>>>>>       (using the same machine_set_cpu_numa_node()) right before calling boards
>>>>>>>>>       specific machine init(). At that time total number of nodes is known.
>>>>>>>>>
>>>>>>>>> In 1 -- 2.1 cases, 'arch_id' in 'possible_cpus' list doesn't have to be defined before
>>>>>>>>> boards init() is run.      
>>>>>>
>>>>>> In case of 2.1, we need to have the arch_id already generated. This is
>>>>>> done inside possible_cpu_arch_ids. The arch_id is used by
>>>>>> machine_set_cpu_numa_node to assign the cpus to correct numa node.    
>>>>>
>>>>> I might have missed something but I don't see arch_id itself being used in
>>>>> machine_set_cpu_numa_node(). It only uses props part of possible_cpus    
>>>>
>>>> Before calling machine_set_cpu_numa_node, we call
>>>> cpu_index_to_instance_props -> x86_cpu_index_to_props->
>>>> possible_cpu_arch_ids->x86_possible_cpu_arch_ids.
>>>>
>>>> This sequence sets up the arch_id(in x86_cpu_apic_id_from_index) for all
>>>> the available cpus. Based on the arch_id, it also sets up the props.  
>>>
>>>
>>> x86_possible_cpu_arch_ids()
>>>    arch_id = x86_cpu_apic_id_from_index(x86ms, i)
>>>    x86_topo_ids_from_apicid(arch_id, x86ms->smp_dies, ms->smp.cores,  ms->smp.threads, &topo);
>>>    // assign socket/die/core/thread from topo
>>>
>>> so currently it uses indirect way to convert index in possible_cpus->cpus[]
>>> to socket/die/core/thread ids.
>>> But essentially it take '-smp' options and [0..max_cpus) number as original data
>>> converts it into intermediate apic_id and then reverse engineer it back to
>>> topo info.
>>>
>>> Why not use x86_topo_ids_from_idx() directly to get rid of 'props' dependency on apic_id?  
>>
>> It might work. But this feels like a work-around and delaying the problem
>> for later. Just re-arranging the numa code little bit we can address this.
> 
> The idea behind possible_cpus is to allow users query topo information
> board generates (based on -smp) at configuration time (or late) so users
> could know what -numa cpu,topo_options [and -device foo-cpu,topo_options]
> to use, initializing apic_id on the first access is secondary and I did
> it only because I could do it without additional data.
> 
> But main purpose of possible_cpus is to keep topology information.
> That includes numa node mapping, which should be stored in possible_cpus
> along with the rest of cpu topology.
> 
> Looking [12/18] numa patch, it makes -numa node,cpus legacy option
> to reintroduce data duplication, by storing mapping elsewhere and
> then putting that mapping into possible_cpus at numa complete time
> (that's what I dislike and don't see a valid reason to do so).
> 
> That also won't work if user queries hotpluggable-cpus before that time
> and it also doesn't work if user uses preferable -numa cpu,topo_options
> as both would initialize possible_cpus on the first access.
> 
> So if you need do some board specific post-processing done on topo
> information when it's complete and recalculate apic_id do it at board
> init time like was suggested before (x86_cpu_new() looks like a good
> place to do it).

Ok. Sure. Will start working on it. Thanks

[v3,00/18] APIC ID fixes for AMD EPYC CPU models

Message

Comments