mbox series

[v8,00/12] s390x: CPU Topology

Message ID 20220620140352.39398-1-pmorel@linux.ibm.com (mailing list archive)
Headers show
Series s390x: CPU Topology | expand

Message

Pierre Morel June 20, 2022, 2:03 p.m. UTC
Hi,

This new spin is essentially for coherence with the last Linux CPU
Topology patch, function testing and coding style modifications.

Forword
=======

The goal of this series is to implement CPU topology for S390, it
improves the preceeding series with the implementation of books and
drawers, of non uniform CPU topology and with documentation.

To use these patches, you will need the Linux series version 10.
You find it there:
https://lkml.org/lkml/2022/6/20/590

Currently this code is for KVM only, I have no idea if it is interesting
to provide a TCG patch. If ever it will be done in another series.

To have a better understanding of the S390x CPU Topology and its
implementation in QEMU you can have a look at the documentation in the
last patch or follow the introduction here under.

A short introduction
====================

CPU Topology is described in the S390 POP with essentially the description
of two instructions:

PTF Perform Topology function used to poll for topology change
    and used to set the polarization but this part is not part of this item.

STSI Store System Information and the SYSIB 15.1.x providing the Topology
    configuration.

S390 Topology is a 6 levels hierarchical topology with up to 5 level
    of containers. The last topology level, specifying the CPU cores.

    This patch series only uses the two lower levels sockets and cores.
    
    To get the information on the topology, S390 provides the STSI
    instruction, which stores a structures providing the list of the
    containers used in the Machine topology: the SYSIB.
    A selector within the STSI instruction allow to chose how many topology
    levels will be provide in the SYSIB.

    Using the Topology List Entries (TLE) provided inside the SYSIB we
    the Linux kernel is able to compute the information about the cache
    distance between two cores and can use this information to take
    scheduling decisions.

The design
==========

1) To be ready for hotplug, I chose an Object oriented design
of the topology containers:
- A node is a bridge on the SYSBUS and defines a "node bus"
- A drawer is hotplug on the "node bus"
- A book on the "drawer bus"
- A socket on the "book bus"
- And the CPU Topology List Entry (CPU-TLE)sits on the socket bus.
These objects will be enhanced with the cache information when
NUMA is implemented.

This also allows for easy retrieval when building the different SYSIB
for Store Topology System Information (STSI)

2) Perform Topology Function (PTF) instruction is made available to the
guest with a new KVM capability and intercepted in QEMU, allowing the
guest to pool for topology changes.


Features
========

- There is no direct match between IDs shown by:
    - lscpu (unrelated numbered list),
    - SYSIB 15.1.x (topology ID)

- The CPU number, left column of lscpu, is used to reference a CPU
    by Linux tools
    While the CPU address is used by QEMU for hotplug.

- Effect of -smp parsing on the topology with an example:
    -smp 9,sockets=4,cores=4,maxcpus=16

    We have 4 socket each holding 4 cores so that we have a maximum 
    of 16 CPU, 9 of them are active on boot. (Should be obvious)

# lscpu -e
CPU NODE DRAWER BOOK SOCKET CORE L1d:L1i:L2d:L2i ONLINE CONFIGURED POLARIZATION ADDRESS
  0    0      0    0      0    0 0:0:0:0            yes yes        horizontal   0
  1    0      0    0      0    1 1:1:1:1            yes yes        horizontal   1
  2    0      0    0      0    2 2:2:2:2            yes yes        horizontal   2
  3    0      0    0      0    3 3:3:3:3            yes yes        horizontal   3
  4    0      0    0      1    4 4:4:4:4            yes yes        horizontal   4
  5    0      0    0      1    5 5:5:5:5            yes yes        horizontal   5
  6    0      0    0      1    6 6:6:6:6            yes yes        horizontal   6
  7    0      0    0      1    7 7:7:7:7            yes yes        horizontal   7
  8    0      0    0      2    8 8:8:8:8            yes yes        horizontal   8
# 


- To plug a new CPU inside the topology one can simply use the CPU
    address like in:
  
(qemu) device_add host-s390x-cpu,core-id=12
# lscpu -e
CPU NODE DRAWER BOOK SOCKET CORE L1d:L1i:L2d:L2i ONLINE CONFIGURED POLARIZATION ADDRESS
  0    0      0    0      0    0 0:0:0:0            yes yes        horizontal   0
  1    0      0    0      0    1 1:1:1:1            yes yes        horizontal   1
  2    0      0    0      0    2 2:2:2:2            yes yes        horizontal   2
  3    0      0    0      0    3 3:3:3:3            yes yes        horizontal   3
  4    0      0    0      1    4 4:4:4:4            yes yes        horizontal   4
  5    0      0    0      1    5 5:5:5:5            yes yes        horizontal   5
  6    0      0    0      1    6 6:6:6:6            yes yes        horizontal   6
  7    0      0    0      1    7 7:7:7:7            yes yes        horizontal   7
  8    0      0    0      2    8 8:8:8:8            yes yes        horizontal   8
  9    -      -    -      -    - :::                 no yes        horizontal   12
# chcpu -e 9
CPU 9 enabled
# lscpu -e
CPU NODE DRAWER BOOK SOCKET CORE L1d:L1i:L2d:L2i ONLINE CONFIGURED POLARIZATION ADDRESS
  0    0      0    0      0    0 0:0:0:0            yes yes        horizontal   0
  1    0      0    0      0    1 1:1:1:1            yes yes        horizontal   1
  2    0      0    0      0    2 2:2:2:2            yes yes        horizontal   2
  3    0      0    0      0    3 3:3:3:3            yes yes        horizontal   3
  4    0      0    0      1    4 4:4:4:4            yes yes        horizontal   4
  5    0      0    0      1    5 5:5:5:5            yes yes        horizontal   5
  6    0      0    0      1    6 6:6:6:6            yes yes        horizontal   6
  7    0      0    0      1    7 7:7:7:7            yes yes        horizontal   7
  8    0      0    0      2    8 8:8:8:8            yes yes        horizontal   8
  9    0      0    0      3    9 9:9:9:9            yes yes        horizontal   12
#

It is up to the admin level, Libvirt for example, to pin the righ CPU to the right
vCPU, but as we can see without NUMA, chosing separate sockets for CPUs is not easy
without hotplug because without information the code will assign the vCPU and fill
the sockets one after the other.
Note that this is also the default behavior on the LPAR.

Conclusion
==========

This patch, together with the associated KVM patch allows to provide CPU topology
information to the guest.
Currently, only dedicated vCPU and CPU are supported and a NUMA topology can only
be handled using CPU hotplug inside the guest.

Regards,
Pierre

Pierre Morel (12):
  Update Linux Headers
  s390x/cpu_topology: CPU topology objects and structures
  s390x/cpu_topology: implementating Store Topology System Information
  s390x/cpu_topology: Adding books to CPU topology
  s390x/cpu_topology: Adding books to STSI
  s390x/cpu_topology: Adding drawers to CPU topology
  s390x/cpu_topology: Adding drawers to STSI
  s390x/cpu_topology: implementing numa for the s390x topology
  target/s390x: interception of PTF instruction
  s390x/cpu_topology: resetting the Topology-Change-Report
  s390x/cpu_topology: CPU topology migration
  s390x/cpu_topology: activating CPU topology

 hw/core/machine-smp.c              |  48 +-
 hw/core/machine.c                  |  22 +
 hw/s390x/cpu-topology.c            | 754 +++++++++++++++++++++++++++++
 hw/s390x/meson.build               |   1 +
 hw/s390x/s390-virtio-ccw.c         |  77 ++-
 include/hw/boards.h                |   8 +
 include/hw/s390x/cpu-topology.h    |  99 ++++
 include/hw/s390x/s390-virtio-ccw.h |   6 +
 include/hw/s390x/sclp.h            |   1 +
 linux-headers/asm-s390/kvm.h       |   9 +
 linux-headers/linux/kvm.h          |   1 +
 qapi/machine.json                  |  14 +-
 qemu-options.hx                    |   6 +-
 softmmu/vl.c                       |   6 +
 target/s390x/cpu-sysemu.c          |   7 +
 target/s390x/cpu.h                 |  52 ++
 target/s390x/cpu_models.c          |   1 +
 target/s390x/cpu_topology.c        | 169 +++++++
 target/s390x/kvm/kvm.c             |  93 ++++
 target/s390x/kvm/kvm_s390x.h       |   2 +
 target/s390x/meson.build           |   1 +
 21 files changed, 1359 insertions(+), 18 deletions(-)
 create mode 100644 hw/s390x/cpu-topology.c
 create mode 100644 include/hw/s390x/cpu-topology.h
 create mode 100644 target/s390x/cpu_topology.c

Comments

Janis Schoetterl-Glausch July 14, 2022, 6:43 p.m. UTC | #1
On 6/20/22 16:03, Pierre Morel wrote:
> Hi,
> 
> This new spin is essentially for coherence with the last Linux CPU
> Topology patch, function testing and coding style modifications.
> 
> Forword
> =======
> 
> The goal of this series is to implement CPU topology for S390, it
> improves the preceeding series with the implementation of books and
> drawers, of non uniform CPU topology and with documentation.
> 
> To use these patches, you will need the Linux series version 10.
> You find it there:
> https://lkml.org/lkml/2022/6/20/590
> 
> Currently this code is for KVM only, I have no idea if it is interesting
> to provide a TCG patch. If ever it will be done in another series.
> 
> To have a better understanding of the S390x CPU Topology and its
> implementation in QEMU you can have a look at the documentation in the
> last patch or follow the introduction here under.
> 
> A short introduction
> ====================
> 
> CPU Topology is described in the S390 POP with essentially the description
> of two instructions:
> 
> PTF Perform Topology function used to poll for topology change
>     and used to set the polarization but this part is not part of this item.
> 
> STSI Store System Information and the SYSIB 15.1.x providing the Topology
>     configuration.
> 
> S390 Topology is a 6 levels hierarchical topology with up to 5 level
>     of containers. The last topology level, specifying the CPU cores.
> 
>     This patch series only uses the two lower levels sockets and cores.
>     
>     To get the information on the topology, S390 provides the STSI
>     instruction, which stores a structures providing the list of the
>     containers used in the Machine topology: the SYSIB.
>     A selector within the STSI instruction allow to chose how many topology
>     levels will be provide in the SYSIB.
> 
>     Using the Topology List Entries (TLE) provided inside the SYSIB we
>     the Linux kernel is able to compute the information about the cache
>     distance between two cores and can use this information to take
>     scheduling decisions.

Do the socket, book, ... metaphors and looking at STSI from the existing
smp infrastructure even make sense?

STSI 15.1.x reports the topology to the guest and for a virtual machine,
this topology can be very dynamic. So a CPU can move from from one topology
container to another, but the socket of a cpu changing while it's running seems
a bit strange. And this isn't supported by this patch series as far as I understand,
the only topology changes are on hotplug.
Pierre Morel July 14, 2022, 8:05 p.m. UTC | #2
On 7/14/22 20:43, Janis Schoetterl-Glausch wrote:
> On 6/20/22 16:03, Pierre Morel wrote:
>> Hi,
>>
>> This new spin is essentially for coherence with the last Linux CPU
>> Topology patch, function testing and coding style modifications.
>>
>> Forword
>> =======
>>
>> The goal of this series is to implement CPU topology for S390, it
>> improves the preceeding series with the implementation of books and
>> drawers, of non uniform CPU topology and with documentation.
>>
>> To use these patches, you will need the Linux series version 10.
>> You find it there:
>> https://lkml.org/lkml/2022/6/20/590
>>
>> Currently this code is for KVM only, I have no idea if it is interesting
>> to provide a TCG patch. If ever it will be done in another series.
>>
>> To have a better understanding of the S390x CPU Topology and its
>> implementation in QEMU you can have a look at the documentation in the
>> last patch or follow the introduction here under.
>>
>> A short introduction
>> ====================
>>
>> CPU Topology is described in the S390 POP with essentially the description
>> of two instructions:
>>
>> PTF Perform Topology function used to poll for topology change
>>      and used to set the polarization but this part is not part of this item.
>>
>> STSI Store System Information and the SYSIB 15.1.x providing the Topology
>>      configuration.
>>
>> S390 Topology is a 6 levels hierarchical topology with up to 5 level
>>      of containers. The last topology level, specifying the CPU cores.
>>
>>      This patch series only uses the two lower levels sockets and cores.
>>      
>>      To get the information on the topology, S390 provides the STSI
>>      instruction, which stores a structures providing the list of the
>>      containers used in the Machine topology: the SYSIB.
>>      A selector within the STSI instruction allow to chose how many topology
>>      levels will be provide in the SYSIB.
>>
>>      Using the Topology List Entries (TLE) provided inside the SYSIB we
>>      the Linux kernel is able to compute the information about the cache
>>      distance between two cores and can use this information to take
>>      scheduling decisions.
> 
> Do the socket, book, ... metaphors and looking at STSI from the existing
> smp infrastructure even make sense?

Sorry, I do not understand.
I admit the cover-letter is old and I did not rewrite it really good 
since the first patch series.

What we do is:
Compute the STSI from the SMP + numa + device QEMU parameters .

> 
> STSI 15.1.x reports the topology to the guest and for a virtual machine,
> this topology can be very dynamic. So a CPU can move from from one topology
> container to another, but the socket of a cpu changing while it's running seems
> a bit strange. And this isn't supported by this patch series as far as I understand,
> the only topology changes are on hotplug.

A CPU changing from a socket to another socket is the only case the PTF 
instruction reports a change in the topology with the case a new CPU is 
plug in.
It is not expected to appear often but it does appear.
The code has been removed from the kernel in spin 10 for 2 reasons:
1) we decided to first support only dedicated and pinned CPU
2) Christian fears it may happen too often due to Linux host scheduling 
and could be a performance problem

So yes now we only have a topology report on vCPU plug.







>
Janis Schoetterl-Glausch July 15, 2022, 9:31 a.m. UTC | #3
On 7/14/22 22:05, Pierre Morel wrote:
> 
> 
> On 7/14/22 20:43, Janis Schoetterl-Glausch wrote:
>> On 6/20/22 16:03, Pierre Morel wrote:
>>> Hi,
>>>
>>> This new spin is essentially for coherence with the last Linux CPU
>>> Topology patch, function testing and coding style modifications.
>>>
>>> Forword
>>> =======
>>>
>>> The goal of this series is to implement CPU topology for S390, it
>>> improves the preceeding series with the implementation of books and
>>> drawers, of non uniform CPU topology and with documentation.
>>>
>>> To use these patches, you will need the Linux series version 10.
>>> You find it there:
>>> https://lkml.org/lkml/2022/6/20/590
>>>
>>> Currently this code is for KVM only, I have no idea if it is interesting
>>> to provide a TCG patch. If ever it will be done in another series.
>>>
>>> To have a better understanding of the S390x CPU Topology and its
>>> implementation in QEMU you can have a look at the documentation in the
>>> last patch or follow the introduction here under.
>>>
>>> A short introduction
>>> ====================
>>>
>>> CPU Topology is described in the S390 POP with essentially the description
>>> of two instructions:
>>>
>>> PTF Perform Topology function used to poll for topology change
>>>      and used to set the polarization but this part is not part of this item.
>>>
>>> STSI Store System Information and the SYSIB 15.1.x providing the Topology
>>>      configuration.
>>>
>>> S390 Topology is a 6 levels hierarchical topology with up to 5 level
>>>      of containers. The last topology level, specifying the CPU cores.
>>>
>>>      This patch series only uses the two lower levels sockets and cores.
>>>           To get the information on the topology, S390 provides the STSI
>>>      instruction, which stores a structures providing the list of the
>>>      containers used in the Machine topology: the SYSIB.
>>>      A selector within the STSI instruction allow to chose how many topology
>>>      levels will be provide in the SYSIB.
>>>
>>>      Using the Topology List Entries (TLE) provided inside the SYSIB we
>>>      the Linux kernel is able to compute the information about the cache
>>>      distance between two cores and can use this information to take
>>>      scheduling decisions.
>>
>> Do the socket, book, ... metaphors and looking at STSI from the existing
>> smp infrastructure even make sense?
> 
> Sorry, I do not understand.
> I admit the cover-letter is old and I did not rewrite it really good since the first patch series.
> 
> What we do is:
> Compute the STSI from the SMP + numa + device QEMU parameters .
> 
>>
>> STSI 15.1.x reports the topology to the guest and for a virtual machine,
>> this topology can be very dynamic. So a CPU can move from from one topology
>> container to another, but the socket of a cpu changing while it's running seems
>> a bit strange. And this isn't supported by this patch series as far as I understand,
>> the only topology changes are on hotplug.
> 
> A CPU changing from a socket to another socket is the only case the PTF instruction reports a change in the topology with the case a new CPU is plug in.

Can a CPU actually change between sockets right now?
The socket-id is computed from the core-id, so it's fixed, is it not?

> It is not expected to appear often but it does appear.
> The code has been removed from the kernel in spin 10 for 2 reasons:
> 1) we decided to first support only dedicated and pinned CPU> 2) Christian fears it may happen too often due to Linux host scheduling and could be a performance problem

This seems sensible, but now it seems too static.
For example after migration, you cannot tell the guest which CPUs are in the same socket, book, ...,
unless I'm misunderstanding something.
And migration is rare, but something you'd want to be able to react to.
And I could imaging that the vCPUs are pinned most of the time, but the pinning changes occasionally.

> 
> So yes now we only have a topology report on vCPU plug.
> 
> 
> 
> 
> 
> 
> 
>>
>
Pierre Morel July 15, 2022, 1:47 p.m. UTC | #4
On 7/15/22 11:31, Janis Schoetterl-Glausch wrote:
> On 7/14/22 22:05, Pierre Morel wrote:
>>
>>
>> On 7/14/22 20:43, Janis Schoetterl-Glausch wrote:
>>> On 6/20/22 16:03, Pierre Morel wrote:
>>>> Hi,
>>>>
>>>> This new spin is essentially for coherence with the last Linux CPU
>>>> Topology patch, function testing and coding style modifications.
>>>>
>>>> Forword
>>>> =======
>>>>
>>>> The goal of this series is to implement CPU topology for S390, it
>>>> improves the preceeding series with the implementation of books and
>>>> drawers, of non uniform CPU topology and with documentation.
>>>>
>>>> To use these patches, you will need the Linux series version 10.
>>>> You find it there:
>>>> https://lkml.org/lkml/2022/6/20/590
>>>>
>>>> Currently this code is for KVM only, I have no idea if it is interesting
>>>> to provide a TCG patch. If ever it will be done in another series.
>>>>
>>>> To have a better understanding of the S390x CPU Topology and its
>>>> implementation in QEMU you can have a look at the documentation in the
>>>> last patch or follow the introduction here under.
>>>>
>>>> A short introduction
>>>> ====================
>>>>
>>>> CPU Topology is described in the S390 POP with essentially the description
>>>> of two instructions:
>>>>
>>>> PTF Perform Topology function used to poll for topology change
>>>>       and used to set the polarization but this part is not part of this item.
>>>>
>>>> STSI Store System Information and the SYSIB 15.1.x providing the Topology
>>>>       configuration.
>>>>
>>>> S390 Topology is a 6 levels hierarchical topology with up to 5 level
>>>>       of containers. The last topology level, specifying the CPU cores.
>>>>
>>>>       This patch series only uses the two lower levels sockets and cores.
>>>>            To get the information on the topology, S390 provides the STSI
>>>>       instruction, which stores a structures providing the list of the
>>>>       containers used in the Machine topology: the SYSIB.
>>>>       A selector within the STSI instruction allow to chose how many topology
>>>>       levels will be provide in the SYSIB.
>>>>
>>>>       Using the Topology List Entries (TLE) provided inside the SYSIB we
>>>>       the Linux kernel is able to compute the information about the cache
>>>>       distance between two cores and can use this information to take
>>>>       scheduling decisions.
>>>
>>> Do the socket, book, ... metaphors and looking at STSI from the existing
>>> smp infrastructure even make sense?
>>
>> Sorry, I do not understand.
>> I admit the cover-letter is old and I did not rewrite it really good since the first patch series.
>>
>> What we do is:
>> Compute the STSI from the SMP + numa + device QEMU parameters .
>>
>>>
>>> STSI 15.1.x reports the topology to the guest and for a virtual machine,
>>> this topology can be very dynamic. So a CPU can move from from one topology
>>> container to another, but the socket of a cpu changing while it's running seems
>>> a bit strange. And this isn't supported by this patch series as far as I understand,
>>> the only topology changes are on hotplug.
>>
>> A CPU changing from a socket to another socket is the only case the PTF instruction reports a change in the topology with the case a new CPU is plug in.
> 
> Can a CPU actually change between sockets right now?

To be exact, what I understand is that a shared CPU can be scheduled to 
another real CPU exactly as a guest vCPU can be scheduled by the host to 
another host CPU.

> The socket-id is computed from the core-id, so it's fixed, is it not?

the virtual socket-id is computed from the virtual core-id

> 
>> It is not expected to appear often but it does appear.
>> The code has been removed from the kernel in spin 10 for 2 reasons:
>> 1) we decided to first support only dedicated and pinned CPU> 2) Christian fears it may happen too often due to Linux host scheduling and could be a performance problem
> 
> This seems sensible, but now it seems too static.
> For example after migration, you cannot tell the guest which CPUs are in the same socket, book, ...,
> unless I'm misunderstanding something.

No, to do this we would need to ask the kernel about it.

> And migration is rare, but something you'd want to be able to react to.
> And I could imaging that the vCPUs are pinned most of the time, but the pinning changes occasionally.

I think on migration we should just make a kvm_set_mtcr on post_load 
like Nico suggested everything else seems complicated for a questionable 
benefit.
Janis Schoetterl-Glausch July 15, 2022, 6:28 p.m. UTC | #5
On 7/15/22 15:47, Pierre Morel wrote:
> 
> 
> On 7/15/22 11:31, Janis Schoetterl-Glausch wrote:
>> On 7/14/22 22:05, Pierre Morel wrote:
>>>
>>>
>>> On 7/14/22 20:43, Janis Schoetterl-Glausch wrote:
>>>> On 6/20/22 16:03, Pierre Morel wrote:
>>>>> Hi,
>>>>>
>>>>> This new spin is essentially for coherence with the last Linux CPU
>>>>> Topology patch, function testing and coding style modifications.
>>>>>
>>>>> Forword
>>>>> =======
>>>>>
>>>>> The goal of this series is to implement CPU topology for S390, it
>>>>> improves the preceeding series with the implementation of books and
>>>>> drawers, of non uniform CPU topology and with documentation.
>>>>>
>>>>> To use these patches, you will need the Linux series version 10.
>>>>> You find it there:
>>>>> https://lkml.org/lkml/2022/6/20/590
>>>>>
>>>>> Currently this code is for KVM only, I have no idea if it is interesting
>>>>> to provide a TCG patch. If ever it will be done in another series.
>>>>>
>>>>> To have a better understanding of the S390x CPU Topology and its
>>>>> implementation in QEMU you can have a look at the documentation in the
>>>>> last patch or follow the introduction here under.
>>>>>
>>>>> A short introduction
>>>>> ====================
>>>>>
>>>>> CPU Topology is described in the S390 POP with essentially the description
>>>>> of two instructions:
>>>>>
>>>>> PTF Perform Topology function used to poll for topology change
>>>>>       and used to set the polarization but this part is not part of this item.
>>>>>
>>>>> STSI Store System Information and the SYSIB 15.1.x providing the Topology
>>>>>       configuration.
>>>>>
>>>>> S390 Topology is a 6 levels hierarchical topology with up to 5 level
>>>>>       of containers. The last topology level, specifying the CPU cores.
>>>>>
>>>>>       This patch series only uses the two lower levels sockets and cores.
>>>>>            To get the information on the topology, S390 provides the STSI
>>>>>       instruction, which stores a structures providing the list of the
>>>>>       containers used in the Machine topology: the SYSIB.
>>>>>       A selector within the STSI instruction allow to chose how many topology
>>>>>       levels will be provide in the SYSIB.
>>>>>
>>>>>       Using the Topology List Entries (TLE) provided inside the SYSIB we
>>>>>       the Linux kernel is able to compute the information about the cache
>>>>>       distance between two cores and can use this information to take
>>>>>       scheduling decisions.
>>>>
>>>> Do the socket, book, ... metaphors and looking at STSI from the existing
>>>> smp infrastructure even make sense?
>>>
>>> Sorry, I do not understand.
>>> I admit the cover-letter is old and I did not rewrite it really good since the first patch series.
>>>
>>> What we do is:
>>> Compute the STSI from the SMP + numa + device QEMU parameters .
>>>
>>>>
>>>> STSI 15.1.x reports the topology to the guest and for a virtual machine,
>>>> this topology can be very dynamic. So a CPU can move from from one topology
>>>> container to another, but the socket of a cpu changing while it's running seems
>>>> a bit strange. And this isn't supported by this patch series as far as I understand,
>>>> the only topology changes are on hotplug.
>>>
>>> A CPU changing from a socket to another socket is the only case the PTF instruction reports a change in the topology with the case a new CPU is plug in.
>>
>> Can a CPU actually change between sockets right now?
> 
> To be exact, what I understand is that a shared CPU can be scheduled to another real CPU exactly as a guest vCPU can be scheduled by the host to another host CPU.

Ah, ok, this is what I'm forgetting, and what made communication harder,
there are two ways by which the topology can change:
1. the host topology changes
2. the vCPU threads are scheduled on another host CPU

I've been only thinking about the 2.
I assumed some outside entity (libvirt?) pins vCPU threads, and so it would
be the responsibility of that entity to set the topology which then is 
reported to the guest. So if you pin vCPUs for the whole lifetime of the vm
then you could do that by specifying the topology up front with -devices.
If you want to support migration, then the outside entity would need a way
to tell qemu the updated topology.
 
> 
>> The socket-id is computed from the core-id, so it's fixed, is it not?
> 
> the virtual socket-id is computed from the virtual core-id

Meaning cpu.env.core_id, correct? (which is the same as cpu.cpu_index which is the same as
ms->possible_cpus->cpus[core_id].props.core_id)
And a cpu's core id doesn't change during the lifetime of the vm, right?
And so it's socket id doesn't either.

> 
>>
>>> It is not expected to appear often but it does appear.
>>> The code has been removed from the kernel in spin 10 for 2 reasons:
>>> 1) we decided to first support only dedicated and pinned CPU> 2) Christian fears it may happen too often due to Linux host scheduling and could be a performance problem
>>
>> This seems sensible, but now it seems too static.
>> For example after migration, you cannot tell the guest which CPUs are in the same socket, book, ...,
>> unless I'm misunderstanding something.
> 
> No, to do this we would need to ask the kernel about it.

You mean polling /sys/devices/system/cpu/cpu*/topology/*_id ?
That should work if it isn't done to frequently, right?
And if it's done by the entity doing the pinning it could judge if the host topology change
is relevant to the guest and if so tell qemu how to update it.
> 
>> And migration is rare, but something you'd want to be able to react to.
>> And I could imaging that the vCPUs are pinned most of the time, but the pinning changes occasionally.
> 
> I think on migration we should just make a kvm_set_mtcr on post_load like Nico suggested everything else seems complicated for a questionable benefit.

But what is the point? The result of STSI reported to the guest doesn't actually change, does it?
Since the same CPUs with the same calculated socket-ids, ..., exist.
You cannot migrate to a vm with a different virtual topology, since the CPUs get matched via the cpu_index
as far as I can tell, which is the same as the core_id, or am I misunderstanding something?
Migrating the MTCR bit is correct, if it is 1 than there was a cpu hotplug that the guest did not yet observe,
but setting it to 1 after migration would we wrong if the STSI result would be the same.
> 
>
Pierre Morel July 18, 2022, 12:32 p.m. UTC | #6
On 7/15/22 20:28, Janis Schoetterl-Glausch wrote:
> On 7/15/22 15:47, Pierre Morel wrote:
>>
>>
>> On 7/15/22 11:31, Janis Schoetterl-Glausch wrote:
>>> On 7/14/22 22:05, Pierre Morel wrote:
>>>>
>>>>
>>>> On 7/14/22 20:43, Janis Schoetterl-Glausch wrote:
>>>>> On 6/20/22 16:03, Pierre Morel wrote:
>>>>>> Hi,
>>>>>>
>>>>>> This new spin is essentially for coherence with the last Linux CPU
>>>>>> Topology patch, function testing and coding style modifications.
>>>>>>
>>>>>> Forword
>>>>>> =======
>>>>>>
>>>>>> The goal of this series is to implement CPU topology for S390, it
>>>>>> improves the preceeding series with the implementation of books and
>>>>>> drawers, of non uniform CPU topology and with documentation.
>>>>>>
>>>>>> To use these patches, you will need the Linux series version 10.
>>>>>> You find it there:
>>>>>> https://lkml.org/lkml/2022/6/20/590
>>>>>>
>>>>>> Currently this code is for KVM only, I have no idea if it is interesting
>>>>>> to provide a TCG patch. If ever it will be done in another series.
>>>>>>
>>>>>> To have a better understanding of the S390x CPU Topology and its
>>>>>> implementation in QEMU you can have a look at the documentation in the
>>>>>> last patch or follow the introduction here under.
>>>>>>
>>>>>> A short introduction
>>>>>> ====================
>>>>>>
>>>>>> CPU Topology is described in the S390 POP with essentially the description
>>>>>> of two instructions:
>>>>>>
>>>>>> PTF Perform Topology function used to poll for topology change
>>>>>>        and used to set the polarization but this part is not part of this item.
>>>>>>
>>>>>> STSI Store System Information and the SYSIB 15.1.x providing the Topology
>>>>>>        configuration.
>>>>>>
>>>>>> S390 Topology is a 6 levels hierarchical topology with up to 5 level
>>>>>>        of containers. The last topology level, specifying the CPU cores.
>>>>>>
>>>>>>        This patch series only uses the two lower levels sockets and cores.
>>>>>>             To get the information on the topology, S390 provides the STSI
>>>>>>        instruction, which stores a structures providing the list of the
>>>>>>        containers used in the Machine topology: the SYSIB.
>>>>>>        A selector within the STSI instruction allow to chose how many topology
>>>>>>        levels will be provide in the SYSIB.
>>>>>>
>>>>>>        Using the Topology List Entries (TLE) provided inside the SYSIB we
>>>>>>        the Linux kernel is able to compute the information about the cache
>>>>>>        distance between two cores and can use this information to take
>>>>>>        scheduling decisions.
>>>>>
>>>>> Do the socket, book, ... metaphors and looking at STSI from the existing
>>>>> smp infrastructure even make sense?
>>>>
>>>> Sorry, I do not understand.
>>>> I admit the cover-letter is old and I did not rewrite it really good since the first patch series.
>>>>
>>>> What we do is:
>>>> Compute the STSI from the SMP + numa + device QEMU parameters .
>>>>
>>>>>
>>>>> STSI 15.1.x reports the topology to the guest and for a virtual machine,
>>>>> this topology can be very dynamic. So a CPU can move from from one topology
>>>>> container to another, but the socket of a cpu changing while it's running seems
>>>>> a bit strange. And this isn't supported by this patch series as far as I understand,
>>>>> the only topology changes are on hotplug.
>>>>
>>>> A CPU changing from a socket to another socket is the only case the PTF instruction reports a change in the topology with the case a new CPU is plug in.
>>>
>>> Can a CPU actually change between sockets right now?
>>
>> To be exact, what I understand is that a shared CPU can be scheduled to another real CPU exactly as a guest vCPU can be scheduled by the host to another host CPU.
> 
> Ah, ok, this is what I'm forgetting, and what made communication harder,
> there are two ways by which the topology can change:
> 1. the host topology changes
> 2. the vCPU threads are scheduled on another host CPU
> 
> I've been only thinking about the 2.
> I assumed some outside entity (libvirt?) pins vCPU threads, and so it would
> be the responsibility of that entity to set the topology which then is
> reported to the guest. So if you pin vCPUs for the whole lifetime of the vm
> then you could do that by specifying the topology up front with -devices.
> If you want to support migration, then the outside entity would need a way
> to tell qemu the updated topology.

Yes

>   
>>
>>> The socket-id is computed from the core-id, so it's fixed, is it not?
>>
>> the virtual socket-id is computed from the virtual core-id
> 
> Meaning cpu.env.core_id, correct? (which is the same as cpu.cpu_index which is the same as
> ms->possible_cpus->cpus[core_id].props.core_id)
> And a cpu's core id doesn't change during the lifetime of the vm, right?

right

> And so it's socket id doesn't either.

Yes

> 
>>
>>>
>>>> It is not expected to appear often but it does appear.
>>>> The code has been removed from the kernel in spin 10 for 2 reasons:
>>>> 1) we decided to first support only dedicated and pinned CPU> 2) Christian fears it may happen too often due to Linux host scheduling and could be a performance problem
>>>
>>> This seems sensible, but now it seems too static.
>>> For example after migration, you cannot tell the guest which CPUs are in the same socket, book, ...,
>>> unless I'm misunderstanding something.
>>
>> No, to do this we would need to ask the kernel about it.
> 
> You mean polling /sys/devices/system/cpu/cpu*/topology/*_id ?
> That should work if it isn't done to frequently, right?
> And if it's done by the entity doing the pinning it could judge if the host topology change
> is relevant to the guest and if so tell qemu how to update it.

yes, I guess we will need to change the core-id which may be complicated 
or find another way to link the vCPU topology with the host CPU topology.
First I wanted to have something directly from the kernel, as we have 
there all the info on vCPU and host topology.
That is why I had a struct in the UAPI.

Viktor is as you say here for something in userland only.

For the moment I would like to stay for this patch series on a fixed 
topology, set by the admin, then we go on updating the topology.


>>
>>> And migration is rare, but something you'd want to be able to react to.
>>> And I could imaging that the vCPUs are pinned most of the time, but the pinning changes occasionally.
>>
>> I think on migration we should just make a kvm_set_mtcr on post_load like Nico suggested everything else seems complicated for a questionable benefit.
> 
> But what is the point? The result of STSI reported to the guest doesn't actually change, does it?
> Since the same CPUs with the same calculated socket-ids, ..., exist.
> You cannot migrate to a vm with a different virtual topology, since the CPUs get matched via the cpu_index
> as far as I can tell, which is the same as the core_id, or am I misunderstanding something?
> Migrating the MTCR bit is correct, if it is 1 than there was a cpu hotplug that the guest did not yet observe,
> but setting it to 1 after migration would we wrong if the STSI result would be the same.

That is a good point, IIUC it follows that:
- a CPU hotplug can not be done during the migration.
- migration can not be started while a CPU is being hot plugged.