mbox series

[RFC,0/2] Specifying cache topology on ARM

Message ID 20240823125446.721-1-alireza.sanaee@huawei.com (mailing list archive)
Headers show
Series Specifying cache topology on ARM | expand

Message

Alireza Sanaee Aug. 23, 2024, 12:54 p.m. UTC
Specifying the cache layout in virtual machines is useful for
applications and operating systems to fetch accurate information about
the cache structure and make appropriate adjustments. Enforcing correct
sharing information can lead to better optimizations. This patch enables
the specification of cache layout through a command line parameter,
building on a patch set by Intel [1]. It uses this set as a foundation.
The ACPI/PPTT table is populated based on user-provided information and
CPU topology.

Example:


+----------------+                            +----------------+
|    Socket 0    |                            |    Socket 1    |
|    (L3 Cache)  |                            |    (L3 Cache)  |
+--------+-------+                            +--------+-------+
         |                                             |
+--------+--------+                            +--------+--------+
|   Cluster 0     |                            |   Cluster 0     |
|   (L2 Cache)    |                            |   (L2 Cache)    |
+--------+--------+                            +--------+--------+
         |                                             |
+--------+--------+  +--------+--------+    +--------+--------+  +--------+----+
|   Core 0         | |   Core 1        |    |   Core 0        |  |   Core 1    |
|   (L1i, L1d)     | |   (L1i, L1d)    |    |   (L1i, L1d)    |  |   (L1i, L1d)|
+--------+--------+  +--------+--------+    +--------+--------+  +--------+----+
         |                   |                       |                   |
+--------+              +--------+              +--------+          +--------+
|Thread 0|              |Thread 1|              |Thread 1|          |Thread 0|
+--------+              +--------+              +--------+          +--------+
|Thread 1|              |Thread 0|              |Thread 0|          |Thread 1|
+--------+              +--------+              +--------+          +--------+


The following command will represent the system.

./qemu-system-aarch64 \
 -machine virt,**smp-cache=cache0** \
 -cpu max \
 -m 2048 \
 -smp sockets=2,clusters=1,cores=2,threads=2 \
 -kernel ./Image.gz \
 -append "console=ttyAMA0 root=/dev/ram rdinit=/init acpi=force" \
 -initrd rootfs.cpio.gz \
 -bios ./edk2-aarch64-code.fd \
 **-object '{"qom-type":"smp-cache","id":"cache0","caches":[{"name":"l1d","topo":"core"},{"name":"l1i","topo":"core"},{"name":"l2","topo":"cluster"},{"name":"l3","topo":"socket"}]}'** \
 -nographic

Failure cases:
    1) there are cases where QEMU might not have any clusters selected in the
    -smp option, while user specifies caches to be shared at cluster level. In
    this situations, qemu returns error.

    2) There are other scenarios where caches exist in systems' registers but
    not left unspecified by users. In this case qemu returns failure.

Currently only three levels of caches are supported to be specified from
the command line. However, increasing the value does not require
significant changes. Further, this patch assumes l2 and l3 unified
caches and does not allow l(2/3)(i/d). The level terminology is
thread/core/cluster/socket right now.

Here is the hierarchy assumed in this patch:
Socket level = Cluster level + 1 = Core level + 2 = Thread level + 3;

[1] https://lore.kernel.org/qemu-devel/20240704031603.1744546-1-zhao1.liu@intel.com/#r

TODO:
1) Making the code to work with arbitrary levels
2) Separated data and instruction cache at L2 and L3.
3) Allow for different Data or Instruction only at a particular level.
4) Additional cache controls.  e.g. size of L3 may not want to just
match the underlying system, because only some of the associated host
CPUs may be bound to this VM.
5) Add device tree related code to generate info related to caches.

Alireza Sanaee (2):
  target/arm/tcg: increase cache level for cpu=max
  hw/acpi: add cache hierarchy node to pptt table

 hw/acpi/aml-build.c         | 307 +++++++++++++++++++++++++++++++++++-
 hw/arm/virt-acpi-build.c    | 137 +++++++++++++++-
 hw/arm/virt.c               |   5 +
 hw/core/machine-smp.c       |   6 +-
 hw/loongarch/acpi-build.c   |   3 +-
 include/hw/acpi/aml-build.h |  20 ++-
 target/arm/tcg/cpu64.c      |  35 ++++
 7 files changed, 503 insertions(+), 10 deletions(-)

Comments

Zhao Liu Aug. 31, 2024, 11:25 a.m. UTC | #1
Hi Alireza,

Great to see your Arm side implementation!

On Fri, Aug 23, 2024 at 01:54:44PM +0100, Alireza Sanaee wrote:
> Date: Fri, 23 Aug 2024 13:54:44 +0100
> From: Alireza Sanaee <alireza.sanaee@huawei.com>
> Subject: [RFC PATCH 0/2] Specifying cache topology on ARM
> X-Mailer: git-send-email 2.34.1
> 

[snip]

> 
> The following command will represent the system.
> 
> ./qemu-system-aarch64 \
>  -machine virt,**smp-cache=cache0** \
>  -cpu max \
>  -m 2048 \
>  -smp sockets=2,clusters=1,cores=2,threads=2 \
>  -kernel ./Image.gz \
>  -append "console=ttyAMA0 root=/dev/ram rdinit=/init acpi=force" \
>  -initrd rootfs.cpio.gz \
>  -bios ./edk2-aarch64-code.fd \
>  **-object '{"qom-type":"smp-cache","id":"cache0","caches":[{"name":"l1d","topo":"core"},{"name":"l1i","topo":"core"},{"name":"l2","topo":"cluster"},{"name":"l3","topo":"socket"}]}'** \
>  -nographic

I plan to refresh a new version soon, in which the smp-cache array will
be integrated into -machine totally. And I'cc you then.

Regards,
Zhao
Alireza Sanaee Sept. 2, 2024, 10:25 a.m. UTC | #2
On Sat, 31 Aug 2024 19:25:47 +0800
Zhao Liu <zhao1.liu@intel.com> wrote:

> Hi Alireza,
> 
> Great to see your Arm side implementation!
> 
> On Fri, Aug 23, 2024 at 01:54:44PM +0100, Alireza Sanaee wrote:
> > Date: Fri, 23 Aug 2024 13:54:44 +0100
> > From: Alireza Sanaee <alireza.sanaee@huawei.com>
> > Subject: [RFC PATCH 0/2] Specifying cache topology on ARM
> > X-Mailer: git-send-email 2.34.1
> >   
> 
> [snip]
> 
> > 
> > The following command will represent the system.
> > 
> > ./qemu-system-aarch64 \
> >  -machine virt,**smp-cache=cache0** \
> >  -cpu max \
> >  -m 2048 \
> >  -smp sockets=2,clusters=1,cores=2,threads=2 \
> >  -kernel ./Image.gz \
> >  -append "console=ttyAMA0 root=/dev/ram rdinit=/init acpi=force" \
> >  -initrd rootfs.cpio.gz \
> >  -bios ./edk2-aarch64-code.fd \
> >  **-object
> > '{"qom-type":"smp-cache","id":"cache0","caches":[{"name":"l1d","topo":"core"},{"name":"l1i","topo":"core"},{"name":"l2","topo":"cluster"},{"name":"l3","topo":"socket"}]}'**
> > \ -nographic  
> 
> I plan to refresh a new version soon, in which the smp-cache array
> will be integrated into -machine totally. And I'cc you then.
> 
> Regards,
> Zhao
> 
> 


Hi Zhao,

Yes, please keep me CCed. 

One thing that I noticed, sometimes, since you were going down the
Intel path, some variables couldn't be NULL. But when I was gonna go
down to ARM path, I faced some scenarios where I ended up with
some uninit vars which is still OK but could have been avoided.

Looking forward to the next revision.

Alireza
Marcin Juszkiewicz Sept. 2, 2024, 11:49 a.m. UTC | #3
On 23.08.2024 14:54, Alireza Sanaee via wrote:

> Failure cases:
>      1) there are cases where QEMU might not have any clusters selected in the
>      -smp option, while user specifies caches to be shared at cluster level. In
>      this situations, qemu returns error.
> 
>      2) There are other scenarios where caches exist in systems' registers but
>      not left unspecified by users. In this case qemu returns failure.

Sockets, clusters, cores, threads. And then caches. Sounds like more fun
than it is already.

IIRC Arm hardware can have up to 16 cores per cluster (virt uses 16,
sbsa-ref uses 8) as this is GIC limitation.

I have a script to visualize Arm topology:

https://github.com/hrw/sbsa-ref-status/blob/main/parse-pptt-log.py

It uses 'EFIShell> acpiview -s PPTT' output and gives something like this:

-smp 24,sockets=1,clusters=2,cores=3,threads=4
socket:        offset: 0x24 parent: 0x0
   cluster:     offset: 0x38 parent: 0x24
     core:      offset: 0x4C parent: 0x38 cpuId: 0x0 L1i: 0x68 L1d: 0x84
       cache:   offset: 0x68 cacheId: 1 size: 0x10000 next: 0xA0
       cache:   offset: 0x84 cacheId: 2 size: 0x10000 next: 0xA0
       cache:   offset: 0xA0 cacheId: 3 size: 0x80000
       thread:  offset: 0xBC parent: 0x4C cpuId: 0x0
       thread:  offset: 0xD0 parent: 0x4C cpuId: 0x1
       thread:  offset: 0xE4 parent: 0x4C cpuId: 0x2
       thread:  offset: 0xF8 parent: 0x4C cpuId: 0x3
     core:      offset: 0x10C parent: 0x38 cpuId: 0x0 L1i: 0x128 L1d: 0x144
       cache:   offset: 0x128 cacheId: 4 size: 0x10000 next: 0x160
       cache:   offset: 0x144 cacheId: 5 size: 0x10000 next: 0x160
       cache:   offset: 0x160 cacheId: 6 size: 0x80000
       thread:  offset: 0x17C parent: 0x10C cpuId: 0x4
       thread:  offset: 0x190 parent: 0x10C cpuId: 0x5
       thread:  offset: 0x1A4 parent: 0x10C cpuId: 0x6
       thread:  offset: 0x1B8 parent: 0x10C cpuId: 0x7
     core:      offset: 0x1CC parent: 0x38 cpuId: 0x0 L1i: 0x1E8 L1d: 0x204
       cache:   offset: 0x1E8 cacheId: 7 size: 0x10000 next: 0x220
       cache:   offset: 0x204 cacheId: 8 size: 0x10000 next: 0x220
       cache:   offset: 0x220 cacheId: 9 size: 0x80000
       thread:  offset: 0x23C parent: 0x1CC cpuId: 0x8
       thread:  offset: 0x250 parent: 0x1CC cpuId: 0x9
       thread:  offset: 0x264 parent: 0x1CC cpuId: 0xA
       thread:  offset: 0x278 parent: 0x1CC cpuId: 0xB
   cluster:     offset: 0x28C parent: 0x24
     core:      offset: 0x2A0 parent: 0x28C cpuId: 0x0 L1i: 0x2BC L1d: 0x2D8
       cache:   offset: 0x2BC cacheId: 10 size: 0x10000 next: 0x2F4
       cache:   offset: 0x2D8 cacheId: 11 size: 0x10000 next: 0x2F4
       cache:   offset: 0x2F4 cacheId: 12 size: 0x80000
       thread:  offset: 0x310 parent: 0x2A0 cpuId: 0xC
       thread:  offset: 0x324 parent: 0x2A0 cpuId: 0xD
       thread:  offset: 0x338 parent: 0x2A0 cpuId: 0xE
       thread:  offset: 0x34C parent: 0x2A0 cpuId: 0xF
     core:      offset: 0x360 parent: 0x28C cpuId: 0x0 L1i: 0x37C L1d: 0x398
       cache:   offset: 0x37C cacheId: 13 size: 0x10000 next: 0x3B4
       cache:   offset: 0x398 cacheId: 14 size: 0x10000 next: 0x3B4
       cache:   offset: 0x3B4 cacheId: 15 size: 0x80000
       thread:  offset: 0x3D0 parent: 0x360 cpuId: 0x10
       thread:  offset: 0x3E4 parent: 0x360 cpuId: 0x11
       thread:  offset: 0x3F8 parent: 0x360 cpuId: 0x12
       thread:  offset: 0x40C parent: 0x360 cpuId: 0x13
     core:      offset: 0x420 parent: 0x28C cpuId: 0x0 L1i: 0x43C L1d: 0x458
       cache:   offset: 0x43C cacheId: 16 size: 0x10000 next: 0x474
       cache:   offset: 0x458 cacheId: 17 size: 0x10000 next: 0x474
       cache:   offset: 0x474 cacheId: 18 size: 0x80000
       thread:  offset: 0x490 parent: 0x420 cpuId: 0x14
       thread:  offset: 0x4A4 parent: 0x420 cpuId: 0x15
       thread:  offset: 0x4B8 parent: 0x420 cpuId: 0x16
       thread:  offset: 0x4CC parent: 0x420 cpuId: 0x17

You may find it useful. I tested it only with cache at either core or
cluster level.
Zhao Liu Sept. 2, 2024, 12:23 p.m. UTC | #4
On Mon, Sep 02, 2024 at 11:25:19AM +0100, Alireza Sanaee wrote:
> 
> Hi Zhao,
> 
> Yes, please keep me CCed. 
> 
> One thing that I noticed, sometimes, since you were going down the
> Intel path, some variables couldn't be NULL. But when I was gonna go
> down to ARM path, I faced some scenarios where I ended up with
> some uninit vars which is still OK but could have been avoided.

Ah I didn't get your point very clearly. Could you please figure out
those places on my patches? Then I can fix them in my next version. :)

Thanks,
Zhao

> Looking forward to the next revision.
> 
> Alireza