mbox series

[RFC,v2,00/29] PowerPC interrupt rework

Message ID 20220927201544.4088567-1-matheus.ferst@eldorado.org.br (mailing list archive)
Headers show
Series PowerPC interrupt rework | expand

Message

Matheus K. Ferst Sept. 27, 2022, 8:15 p.m. UTC
Link to v1: https://lists.gnu.org/archive/html/qemu-ppc/2022-08/msg00370.html
This series is also available as a git branch: https://github.com/PPC64/qemu/tree/ferst-interrupt-fix-v2

This version addresses Fabiano's feedback and fixes some issues found
with the tests suggested by Cédric. While working on it, I found two
intermittent problems on master:

 i) ~10% of boots with pSeries and 970/970mp/POWER5+ hard lockup after
    either SCSI or network initialization when using -smp 4. With
    -smp 2, the problem is harder to reproduce but still happens, and I
    couldn't reproduce with thread=single.
ii) ~52% of KVM guest initializations on PowerNV hang in different parts
    of the boot process when using more than one CPU.

With the complete series applied, I couldn't reproduce (i) anymore, and
(ii) became a little more frequent (~58%).

I've tested each patch of this series with [1], modified to use -smp for
machines that support more than one CPU. The machines I can currently
boot with FreeBSD (970/970,p/POWER5+/POWER7/POWER8/POWER9 pSeries,
POWER8/POWER9 PowerNV, and mpc8544ds) were tested with the images from
[2] and still boot after applying the patch series. Booting nested
guests inside a TCG pSeries machine also seems to be working fine.

Using command lines like:

./qemu-system-ppc64 -M powernv9 -cpu POWER9 -accel tcg,thread=multi \
                -m 8G -smp $SMP -vga none -nographic -kernel zImage \
                -append 'console=hvc0' -initrdootfs.cpio.xz \
                -serial pipe:pipe -monitor unix:mon,server,nowait

and

./qemu-system-ppc64 -M pseries -cpu POWER9 -accel tcg,thread=multi \
                -m 8G -smp $SMP -vga none -nographic -kernel zImage \
                -append 'console=hvc0' -initrd rootfs.cpio.xz \
                -serial pipe:pipe -monitor unix:mon,server,nowait

to measure the time to boot, login, and shut down a compressed kernel
with a buildroot initramfs, with 100 iteration we get:

+-----+------------------------------+-----------------------------+
|     |            PowerNV           |           pSeries           |
|-smp |------------------------------+-----------------------------+
|     |     master    | patch series |    master    | patch series |
+-----+------------------------------+-----------------------------+
|  1  |  45,84 ± 0,92 | 38,08 ± 0,66 | 23,56 ± 1,16 | 23,76 ± 1,04 |
|  2  |  80,21 ± 8,03 | 40,81 ± 0,45 | 26,59 ± 0,92 | 26,88 ± 0,99 |
|  4  | 115,98 ± 9,85 | 38,80 ± 0,44 | 28,83 ± 0,84 | 28,46 ± 0,94 |
|  6  | 199,14 ± 6,36 | 39,32 ± 0,50 | 29,22 ± 0,78 | 29,45 ± 0,86 |
|  8  | 47,85 ± 27,50 | 38,98 ± 0,49 | 29,63 ± 0,80 | 29,60 ± 0,78 |
+-----+------------------------------+-----------------------------+

This results shows that the problem reported in [3] is solved, while
pSeries boot time is essentially unchanged.

With a non-compressed kernel, the difference with PowerNV is smaller,
and pSeries stills the same:

+-----+------------------------------+-----------------------------+
|     |            PowerNV           |           pSeries           |
|-smp |------------------------------+-----------------------------+
|     |     master    | patch series |    master    | patch series |
+-----+------------------------------+-----------------------------+
|  1  |  42,17 ± 0,92 | 38,13 ± 0,59 | 23,15 ± 1,02 | 23,46 ± 1,02 |
|  2  |  55,72 ± 3,54 | 40,30 ± 0,56 | 26,26 ± 0,82 | 26,38 ± 0,80 |
|  4  |  67,09 ± 3,02 | 38,26 ± 0,47 | 28,36 ± 0,77 | 28,19 ± 0,78 |
|  6  |  98,96 ± 2,49 | 39,01 ± 0,38 | 28,68 ± 0,75 | 29,02 ± 0,88 |
|  8  |  39,68 ± 0,42 | 38,44 ± 0,41 | 29,24 ± 0,81 | 29,44 ± 0,75 |
+-----+------------------------------+-----------------------------+

Finally, using command lines like

./qemu-system-ppc64 -M powernv9 -cpu POWER9 -accel tcg,thread=multi \
    -m 8G -smp 4 -device virtio-scsi-pci -boot c -vga none -nographic \
    -device nvme,bus=pcie.2,addr=0x0,drive=drive0,serial=1234 \
    -drive file=rootfs.ext2,if=none,id=drive0,format=raw,cache=none \
    -snapshot -serial pipe:pipe -monitor unix:mon,server,nowait \
    -kernel zImage -append 'console=hvc0 rootwait root=/dev/nvme0n1' \
    -device virtio-net-pci,netdev=br0,mac=52:54:00:12:34:57,bus=pcie.0 \
    -netdev bridge,id=br0

and

./qemu-system-ppc64 -M pseries -cpu POWER9 -accel tcg,thread=multi \
    -m 8G -smp 4 -device virtio-scsi-pci -boot c -vga none -nographic \
    -drive file=rootfs.ext2,if=scsi,index=0,format=raw -snapshot \
    -kernel zImage -append 'console=hvc0 rootwait root=/dev/sda' \
    -serial pipe:pipe -monitor unix:mon,server,nowait \
    -device virtio-net-pci,netdev=br0,mac=52:54:00:12:34:57 \
    -netdev bridge,id=br0

to tests IO performance, with iperf to test network and a 4Gb scp
transfer to test disk+network, in 100 iterations we saw:

+---------------------+---------------+-----------------+
|                     |    scp (s)    |   iperf (MB/s)  |
+---------------------+---------------+-----------------+
|PowerNV master       | 166,91 ± 8,37 | 918,06 ± 114,78 |
|PowerNV patch series | 166,25 ± 8,85 | 916,91 ± 107,56 |
|pSeries master       | 175,70 ± 8,22 | 958,73 ± 115,09 |
|pSeries patch series | 173,62 ± 8,13 | 893,42 ±  87,77 |
+---------------------+---------------+-----------------+

The scp data shows little difference, while testing just network shows
that it's a bit slower with the patch series applied (although, with
this variation, we'd probably need to repeat this test more times to
have a more robust result...)

[1] https://github.com/legoater/qemu-ppc-boot
[2] https://artifact.ci.freebsd.org/snapshot/14.0-CURRENT/latest_vm/powerpc
[3] https://lists.gnu.org/archive/html/qemu-ppc/2022-06/msg00336.html

Matheus Ferst (29):
  target/ppc: define PPC_INTERRUPT_* values directly
  target/ppc: always use ppc_set_irq to set env->pending_interrupts
  target/ppc: split interrupt masking and delivery from ppc_hw_interrupt
  target/ppc: prepare to split interrupt masking and delivery by excp_model
  target/ppc: create an interrupt masking method for POWER9/POWER10
  target/ppc: remove unused interrupts from p9_pending_interrupt
  target/ppc: create an interrupt deliver method for POWER9/POWER10
  target/ppc: remove unused interrupts from p9_deliver_interrupt
  target/ppc: remove generic architecture checks from p9_deliver_interrupt
  target/ppc: move power-saving interrupt masking out of cpu_has_work_POWER9
  target/ppc: add power-saving interrupt masking logic to p9_next_unmasked_interrupt
  target/ppc: create an interrupt masking method for POWER8
  target/ppc: remove unused interrupts from p8_pending_interrupt
  target/ppc: create an interrupt deliver method for POWER8
  target/ppc: remove unused interrupts from p8_deliver_interrupt
  target/ppc: remove generic architecture checks from p8_deliver_interrupt
  target/ppc: move power-saving interrupt masking out of cpu_has_work_POWER8
  target/ppc: add power-saving interrupt masking logic to p8_next_unmasked_interrupt
  target/ppc: create an interrupt masking method for POWER7
  target/ppc: remove unused interrupts from p7_pending_interrupt
  target/ppc: create an interrupt deliver method for POWER7
  target/ppc: remove unused interrupts from p7_deliver_interrupt
  target/ppc: remove generic architecture checks from p7_deliver_interrupt
  target/ppc: move power-saving interrupt masking out of cpu_has_work_POWER7
  target/ppc: add power-saving interrupt masking logic to p7_next_unmasked_interrupt
  target/ppc: remove ppc_store_lpcr from CONFIG_USER_ONLY builds
  target/ppc: introduce ppc_maybe_interrupt
  target/ppc: unify cpu->has_work based on cs->interrupt_request
  target/ppc: move the p*_interrupt_powersave methods to excp_helper.c

 hw/ppc/pnv_core.c        |   1 +
 hw/ppc/ppc.c             |  17 +-
 hw/ppc/spapr_hcall.c     |   6 +
 hw/ppc/spapr_rtas.c      |   2 +-
 hw/ppc/trace-events      |   2 +-
 target/ppc/cpu.c         |   4 +
 target/ppc/cpu.h         |  43 +-
 target/ppc/cpu_init.c    | 212 +---------
 target/ppc/excp_helper.c | 857 ++++++++++++++++++++++++++++++++++-----
 target/ppc/helper.h      |   1 +
 target/ppc/helper_regs.c |   2 +
 target/ppc/misc_helper.c |  11 +-
 target/ppc/translate.c   |   2 +
 13 files changed, 803 insertions(+), 357 deletions(-)

Comments

Cédric Le Goater Sept. 28, 2022, 5:31 p.m. UTC | #1
Hello Matheus,

On 9/27/22 22:15, Matheus Ferst wrote:
> Link to v1: https://lists.gnu.org/archive/html/qemu-ppc/2022-08/msg00370.html
> This series is also available as a git branch: https://github.com/PPC64/qemu/tree/ferst-interrupt-fix-v2

This is impressive work on QEMU PPC.

> This version addresses Fabiano's feedback and fixes some issues found
> with the tests suggested by Cédric. While working on it, I found two
> intermittent problems on master:
> 
>   i) ~10% of boots with pSeries and 970/970mp/POWER5+ hard lockup after

These CPUs never got real attention with KVM. The FW was even broken
before 7.0.

>      either SCSI or network initialization when using -smp 4. With
>      -smp 2, the problem is harder to reproduce but still happens, and I
>      couldn't reproduce with thread=single.
> ii) ~52% of KVM guest initializations on PowerNV hang in different parts
>      of the boot process when using more than one CPU.

Do you mean when the guest is SMP or the host ?

> With the complete series applied, I couldn't reproduce (i) anymore, 

Super ! Models are getting better. This is nice for the 970.

> and (ii) became a little more frequent (~58%).

Have you checked 'info pic' ? XIVE is in charge of vCPU scheduling.
Could you please check with powersave=off in the host kernel also ?

> I've tested each patch of this series with [1], modified to use -smp for
> machines that support more than one CPU. The machines I can currently
> boot with FreeBSD (970/970,p/POWER5+/POWER7/POWER8/POWER9 pSeries,
> POWER8/POWER9 PowerNV, and mpc8544ds) were tested with the images from
> [2] and still boot after applying the patch series. Booting nested
> guests inside a TCG pSeries machine also seems to be working fine.
> 
> Using command lines like:
> 
> ./qemu-system-ppc64 -M powernv9 -cpu POWER9 -accel tcg,thread=multi \
>                  -m 8G -smp $SMP -vga none -nographic -kernel zImage \
>                  -append 'console=hvc0' -initrdootfs.cpio.xz \
>                  -serial pipe:pipe -monitor unix:mon,server,nowait
> 
> and
> 
> ./qemu-system-ppc64 -M pseries -cpu POWER9 -accel tcg,thread=multi \
>                  -m 8G -smp $SMP -vga none -nographic -kernel zImage \
>                  -append 'console=hvc0' -initrd rootfs.cpio.xz \
>                  -serial pipe:pipe -monitor unix:mon,server,nowait
> 
> to measure the time to boot, login, and shut down a compressed kernel
> with a buildroot initramfs, with 100 iteration we get:
> 
> +-----+------------------------------+-----------------------------+
> |     |            PowerNV           |           pSeries           |
> |-smp |------------------------------+-----------------------------+
> |     |     master    | patch series |    master    | patch series |
> +-----+------------------------------+-----------------------------+
> |  1  |  45,84 ± 0,92 | 38,08 ± 0,66 | 23,56 ± 1,16 | 23,76 ± 1,04 |
> |  2  |  80,21 ± 8,03 | 40,81 ± 0,45 | 26,59 ± 0,92 | 26,88 ± 0,99 |
> |  4  | 115,98 ± 9,85 | 38,80 ± 0,44 | 28,83 ± 0,84 | 28,46 ± 0,94 |
> |  6  | 199,14 ± 6,36 | 39,32 ± 0,50 | 29,22 ± 0,78 | 29,45 ± 0,86 |
> |  8  | 47,85 ± 27,50 | 38,98 ± 0,49 | 29,63 ± 0,80 | 29,60 ± 0,78 |
> +-----+------------------------------+-----------------------------+
> 
> This results shows that the problem reported in [3] is solved, while

Yes. Nice work ! The PowerNV results with -smp 8 on master are unexpected.
Did you do some profiling also ?
  
> pSeries boot time is essentially unchanged.
>
> 
> With a non-compressed kernel, the difference with PowerNV is smaller,
> and pSeries stills the same:
> 
> +-----+------------------------------+-----------------------------+
> |     |            PowerNV           |           pSeries           |
> |-smp |------------------------------+-----------------------------+
> |     |     master    | patch series |    master    | patch series |
> +-----+------------------------------+-----------------------------+
> |  1  |  42,17 ± 0,92 | 38,13 ± 0,59 | 23,15 ± 1,02 | 23,46 ± 1,02 |
> |  2  |  55,72 ± 3,54 | 40,30 ± 0,56 | 26,26 ± 0,82 | 26,38 ± 0,80 |
> |  4  |  67,09 ± 3,02 | 38,26 ± 0,47 | 28,36 ± 0,77 | 28,19 ± 0,78 |
> |  6  |  98,96 ± 2,49 | 39,01 ± 0,38 | 28,68 ± 0,75 | 29,02 ± 0,88 |
> |  8  |  39,68 ± 0,42 | 38,44 ± 0,41 | 29,24 ± 0,81 | 29,44 ± 0,75 |
> +-----+------------------------------+-----------------------------+
> 
> Finally, using command lines like
> 
> ./qemu-system-ppc64 -M powernv9 -cpu POWER9 -accel tcg,thread=multi \
>      -m 8G -smp 4 -device virtio-scsi-pci -boot c -vga none -nographic \
>      -device nvme,bus=pcie.2,addr=0x0,drive=drive0,serial=1234 \
>      -drive file=rootfs.ext2,if=none,id=drive0,format=raw,cache=none \
>      -snapshot -serial pipe:pipe -monitor unix:mon,server,nowait \
>      -kernel zImage -append 'console=hvc0 rootwait root=/dev/nvme0n1' \
>      -device virtio-net-pci,netdev=br0,mac=52:54:00:12:34:57,bus=pcie.0 \
>      -netdev bridge,id=br0
> 
> and
> 
> ./qemu-system-ppc64 -M pseries -cpu POWER9 -accel tcg,thread=multi \
>      -m 8G -smp 4 -device virtio-scsi-pci -boot c -vga none -nographic \
>      -drive file=rootfs.ext2,if=scsi,index=0,format=raw -snapshot \
>      -kernel zImage -append 'console=hvc0 rootwait root=/dev/sda' \
>      -serial pipe:pipe -monitor unix:mon,server,nowait \
>      -device virtio-net-pci,netdev=br0,mac=52:54:00:12:34:57 \
>      -netdev bridge,id=br0
> 
> to tests IO performance, with iperf to test network and a 4Gb scp
> transfer to test disk+network, in 100 iterations we saw:
> 
> +---------------------+---------------+-----------------+
> |                     |    scp (s)    |   iperf (MB/s)  |
> +---------------------+---------------+-----------------+
> |PowerNV master       | 166,91 ± 8,37 | 918,06 ± 114,78 |
> |PowerNV patch series | 166,25 ± 8,85 | 916,91 ± 107,56 |
> |pSeries master       | 175,70 ± 8,22 | 958,73 ± 115,09 |
> |pSeries patch series | 173,62 ± 8,13 | 893,42 ±  87,77 |
> +---------------------+---------------+-----------------+

These are SMP machines under high IO load using MTTCG. It means
that the models are quite robust now.

> The scp data shows little difference, while testing just network shows
> that it's a bit slower with the patch series applied (although, with
> this variation, we'd probably need to repeat this test more times to
> have a more robust result...)

You could try with powersave=off.

Thanks,

C.



> [1] https://github.com/legoater/qemu-ppc-boot
> [2] https://artifact.ci.freebsd.org/snapshot/14.0-CURRENT/latest_vm/powerpc
> [3] https://lists.gnu.org/archive/html/qemu-ppc/2022-06/msg00336.html
> 
> Matheus Ferst (29):
>    target/ppc: define PPC_INTERRUPT_* values directly
>    target/ppc: always use ppc_set_irq to set env->pending_interrupts
>    target/ppc: split interrupt masking and delivery from ppc_hw_interrupt
>    target/ppc: prepare to split interrupt masking and delivery by excp_model
>    target/ppc: create an interrupt masking method for POWER9/POWER10
>    target/ppc: remove unused interrupts from p9_pending_interrupt
>    target/ppc: create an interrupt deliver method for POWER9/POWER10
>    target/ppc: remove unused interrupts from p9_deliver_interrupt
>    target/ppc: remove generic architecture checks from p9_deliver_interrupt
>    target/ppc: move power-saving interrupt masking out of cpu_has_work_POWER9
>    target/ppc: add power-saving interrupt masking logic to p9_next_unmasked_interrupt
>    target/ppc: create an interrupt masking method for POWER8
>    target/ppc: remove unused interrupts from p8_pending_interrupt
>    target/ppc: create an interrupt deliver method for POWER8
>    target/ppc: remove unused interrupts from p8_deliver_interrupt
>    target/ppc: remove generic architecture checks from p8_deliver_interrupt
>    target/ppc: move power-saving interrupt masking out of cpu_has_work_POWER8
>    target/ppc: add power-saving interrupt masking logic to p8_next_unmasked_interrupt
>    target/ppc: create an interrupt masking method for POWER7
>    target/ppc: remove unused interrupts from p7_pending_interrupt
>    target/ppc: create an interrupt deliver method for POWER7
>    target/ppc: remove unused interrupts from p7_deliver_interrupt
>    target/ppc: remove generic architecture checks from p7_deliver_interrupt
>    target/ppc: move power-saving interrupt masking out of cpu_has_work_POWER7
>    target/ppc: add power-saving interrupt masking logic to p7_next_unmasked_interrupt
>    target/ppc: remove ppc_store_lpcr from CONFIG_USER_ONLY builds
>    target/ppc: introduce ppc_maybe_interrupt
>    target/ppc: unify cpu->has_work based on cs->interrupt_request
>    target/ppc: move the p*_interrupt_powersave methods to excp_helper.c
> 
>   hw/ppc/pnv_core.c        |   1 +
>   hw/ppc/ppc.c             |  17 +-
>   hw/ppc/spapr_hcall.c     |   6 +
>   hw/ppc/spapr_rtas.c      |   2 +-
>   hw/ppc/trace-events      |   2 +-
>   target/ppc/cpu.c         |   4 +
>   target/ppc/cpu.h         |  43 +-
>   target/ppc/cpu_init.c    | 212 +---------
>   target/ppc/excp_helper.c | 857 ++++++++++++++++++++++++++++++++++-----
>   target/ppc/helper.h      |   1 +
>   target/ppc/helper_regs.c |   2 +
>   target/ppc/misc_helper.c |  11 +-
>   target/ppc/translate.c   |   2 +
>   13 files changed, 803 insertions(+), 357 deletions(-)
>
Matheus K. Ferst Oct. 3, 2022, 3:45 p.m. UTC | #2
On 28/09/2022 14:31, Cédric Le Goater wrote:
> Hello Matheus,
> 
> On 9/27/22 22:15, Matheus Ferst wrote:
>> Link to v1: 
>> https://lists.gnu.org/archive/html/qemu-ppc/2022-08/msg00370.html
>> This series is also available as a git branch: 
>> https://github.com/PPC64/qemu/tree/ferst-interrupt-fix-v2
> 
> This is impressive work on QEMU PPC.
> 
>> This version addresses Fabiano's feedback and fixes some issues found
>> with the tests suggested by Cédric. While working on it, I found two
>> intermittent problems on master:
>>
>>   i) ~10% of boots with pSeries and 970/970mp/POWER5+ hard lockup after
> 
> These CPUs never got real attention with KVM. The FW was even broken
> before 7.0.
> 
>>      either SCSI or network initialization when using -smp 4. With
>>      -smp 2, the problem is harder to reproduce but still happens, and I
>>      couldn't reproduce with thread=single.
>> ii) ~52% of KVM guest initializations on PowerNV hang in different parts
>>      of the boot process when using more than one CPU.
> 
> Do you mean when the guest is SMP or the host ?

I should've added more details, this percentage was testing powernv9 
with "-smp 4" and a pSeries-POWER9 guest with "-smp 4", but I can also 
reproduce with a multithread L0 and single thread L1. The firmware is 
printing messages like:

Could not set special wakeup on 0:1: timeout waiting for SPECIAL_WKUP_DONE.

when it hangs, but I also have this message on some successful boots.

> 
>> With the complete series applied, I couldn't reproduce (i) anymore,
> 
> Super ! Models are getting better. This is nice for the 970.
> 
>> and (ii) became a little more frequent (~58%).
> 
> Have you checked 'info pic' ? XIVE is in charge of vCPU scheduling.

I don't have much knowledge in this area yet, so I don't know what to 
look for, but if it's useful, here is the output of the command when the 
problem occurs with a 4 core L0 and a single core L1:

(qemu) info pic
info pic
CPU[0000]:   QW   NSR CPPR IPB LSMFB ACK# INC AGE PIPR  W2
CPU[0000]: USER    00   00  00    00   00  00  00   00  00000000
CPU[0000]:   OS    00   00  00    ff   ff  00  ff   ff  00000000
CPU[0000]: POOL    00   00  00    ff   00  00  00   00  00000000
CPU[0000]: PHYS    00   ff  00    00   00  00  00   ff  80000000
CPU[0001]:   QW   NSR CPPR IPB LSMFB ACK# INC AGE PIPR  W2
CPU[0001]: USER    00   00  00    00   00  00  00   00  00000000
CPU[0001]:   OS    00   00  00    ff   ff  00  ff   ff  00000000
CPU[0001]: POOL    00   00  00    ff   00  00  00   00  00000001
CPU[0001]: PHYS    00   ff  00    00   00  00  00   ff  80000000
CPU[0002]:   QW   NSR CPPR IPB LSMFB ACK# INC AGE PIPR  W2
CPU[0002]: USER    00   00  00    00   00  00  00   00  00000000
CPU[0002]:   OS    00   00  00    ff   ff  00  ff   ff  00000000
CPU[0002]: POOL    00   00  00    ff   00  00  00   00  00000002
CPU[0002]: PHYS    00   ff  00    00   00  00  00   ff  80000000
CPU[0003]:   QW   NSR CPPR IPB LSMFB ACK# INC AGE PIPR  W2
CPU[0003]: USER    00   00  00    00   00  00  00   00  00000000
CPU[0003]:   OS    00   ff  00    00   ff  00  ff   ff  00000004
CPU[0003]: POOL    00   00  00    ff   00  00  00   00  00000003
CPU[0003]: PHYS    00   ff  00    00   00  00  00   ff  80000000
XIVE[0] #0 Source 00000000 .. 000fffff
   00000014 MSI --
   00000015 MSI --
   00000016 MSI --
   00000017 MSI --
   00000018 MSI --
   00000019 MSI --
   0000001a MSI --
   0000001b MSI --
   0000001e MSI P-
   00000023 MSI --
   00000024 MSI --
   00000025 MSI --
   00000026 MSI --
XIVE[0] #0 EAT 00000000 .. 000fffff
   00000014   end:00/000f data:00000010
   00000015   end:00/0017 data:00000010
   00000016   end:00/001f data:00000010
   00000017   end:00/0027 data:00000010
   00000018   end:00/004e data:00000010
   00000019   end:00/004e data:00000012
   0000001a   end:00/004e data:0000001b
   0000001b   end:00/004e data:00000013
   0000001e   end:00/004e data:00000016
   00000023   end:00/004e data:00000017
   00000024   end:00/004e data:00000018
   00000025   end:00/004e data:00000019
   00000026   end:00/004e data:0000001a
   000fb000   end:00/001f data:00000030
   000fb001   end:00/0027 data:00000031
   000fb002   end:00/000f data:00000032
   000fb003   end:00/000f data:00000033
   000fb004   end:00/0017 data:00000034
   000fb005   end:00/001f data:00000035
   000fb006   end:00/0027 data:00000036
   000fb7fe   end:00/000f data:00000029
   000fb7ff   end:00/0017 data:0000002a
   000fbffe   end:00/001f data:00000027
   000fbfff   end:00/0027 data:00000028
   000fcffe   end:00/000f data:00000025
   000fcfff   end:00/0017 data:00000026
   000fd000   end:00/001f data:00000037
   000fd001   end:00/000f data:00000038
   000fd002   end:00/0017 data:00000039
   000fd003   end:00/001f data:0000003a
   000fd004   end:00/0027 data:0000003b
   000fd7fe   end:00/001f data:00000023
   000fd7ff   end:00/0027 data:00000024
   000fdffe   end:00/000f data:00000021
   000fdfff   end:00/0017 data:00000022
   000feffe   end:00/001f data:0000001f
   000fefff   end:00/0027 data:00000020
   000ffff0   end:00/000f data:00000011
   000ffff1   end:00/0017 data:00000012
   000ffff2   end:00/001f data:00000013
   000ffff3   end:00/0027 data:00000014
   000ffff4   end:00/000f data:00000015
   000ffff5   end:00/0017 data:00000016
   000ffff6   end:00/001f data:00000017
   000ffff7   end:00/0027 data:00000018
   000ffff8   end:00/000f data:00000019
   000ffff9   end:00/0017 data:0000001a
   000ffffa   end:00/001f data:0000001b
   000ffffb   end:00/0027 data:0000001c
   000ffffc   end:00/000f data:0000001d
   000ffffd   end:00/0017 data:0000001e
XIVE[0] #0 ENDT
   0000000f -Q vqnb---f prio:7 nvt:00/0080 eq:@03400000   825/16384 ^1 [ 
8000004f 8000004f 80000
04f 8000004f 8000004f ^00000000 ]
   00000017 -Q vqnb---f prio:7 nvt:00/0084 eq:@03750000  1048/16384 ^1 [ 
8000001e 8000001e 80000
01e 8000001e 8000001e ^00000000 ]
   0000001f -Q vqnb---f prio:7 nvt:00/0088 eq:@037f0000   154/16384 ^1 [ 
8000003a 8000003a 80000
03a 8000003a 8000003a ^00000000 ]
   00000027 -Q vqnb---f prio:7 nvt:00/008c eq:@038a0000   340/16384 ^1 [ 
80000014 80000014 80000
014 80000014 8000003b ^00000000 ]
   0000004e -Q vqnbeu-- prio:6 nvt:00/0004 eq:@1d170000  1104/16384 ^1 [ 
80000016 80000016 80000
016 80000016 80000016 ^00000000 ]
   0000004f -Q v--be-s- prio:0 nvt:00/0000
XIVE[0] #0 END Escalation EAT
   0000004e -Q    end:00/004f data:00000000
   0000004f P-    end:00/000f data:0000004f
XIVE[0] #0 NVTT 00000000 .. 0007ffff
   00000000 end:00/0028 IPB:00
   00000001 end:00/0030 IPB:00
   00000002 end:00/0038 IPB:00
00000003 end:00/0040 IPB:00
   00000004 end:00/0048 IPB:02
   00000080 end:00/0008 IPB:00
   00000084 end:00/0010 IPB:00
   00000088 end:00/0018 IPB:00
   0000008c end:00/0020 IPB:00
PSIHB Source 000ffff0 .. 000ffffd
   000ffff0 LSI --
   000ffff1 LSI --
   000ffff2 LSI --
   000ffff3 LSI --
   000ffff4 LSI --
   000ffff5 LSI --
   000ffff6 LSI --
   000ffff7 LSI --
   000ffff8 LSI --
   000ffff9 LSI --
   000ffffa LSI --
   000ffffb LSI --
   000ffffc LSI --
   000ffffd LSI --
PHB4[0:0] Source 000fe000 .. 000fefff  @6030203110100
   00000ffe LSI --
   00000fff LSI --
PHB4[0:5] Source 000fb000 .. 000fb7ff  @6030203110228
   00000000 MSI --
   00000001 MSI --
   00000002 MSI --
   00000003 MSI --
   00000004 MSI --
   00000005 MSI --
   00000006 MSI --
   000007fe LSI --
   000007ff LSI --
PHB4[0:4] Source 000fb800 .. 000fbfff  @6030203110220
   000007fe LSI --
   000007ff LSI --
PHB4[0:3] Source 000fc000 .. 000fcfff  @6030203110218
   00000ffe LSI --
   00000fff LSI --
PHB4[0:2] Source 000fd000 .. 000fd7ff  @6030203110210
   00000000 MSI --
   00000001 MSI --
   00000002 MSI --
   00000003 MSI --
   00000004 MSI --
   000007fe LSI --
   000007ff LSI --
PHB4[0:1] Source 000fd800 .. 000fdfff  @6030203110208
   000007fe LSI --
   000007ff LSI --

> Could you please check with powersave=off in the host kernel also ?
> 

It still hangs with this option.

>> I've tested each patch of this series with [1], modified to use -smp for
>> machines that support more than one CPU. The machines I can currently
>> boot with FreeBSD (970/970,p/POWER5+/POWER7/POWER8/POWER9 pSeries,
>> POWER8/POWER9 PowerNV, and mpc8544ds) were tested with the images from
>> [2] and still boot after applying the patch series. Booting nested
>> guests inside a TCG pSeries machine also seems to be working fine.
>>
>> Using command lines like:
>>
>> ./qemu-system-ppc64 -M powernv9 -cpu POWER9 -accel tcg,thread=multi \
>>                  -m 8G -smp $SMP -vga none -nographic -kernel zImage \
>>                  -append 'console=hvc0' -initrdootfs.cpio.xz \
>>                  -serial pipe:pipe -monitor unix:mon,server,nowait
>>
>> and
>>
>> ./qemu-system-ppc64 -M pseries -cpu POWER9 -accel tcg,thread=multi \
>>                  -m 8G -smp $SMP -vga none -nographic -kernel zImage \
>>                  -append 'console=hvc0' -initrd rootfs.cpio.xz \
>>                  -serial pipe:pipe -monitor unix:mon,server,nowait
>>
>> to measure the time to boot, login, and shut down a compressed kernel
>> with a buildroot initramfs, with 100 iteration we get:
>>
>> +-----+------------------------------+-----------------------------+
>> |     |            PowerNV           |           pSeries           |
>> |-smp |------------------------------+-----------------------------+
>> |     |     master    | patch series |    master    | patch series |
>> +-----+------------------------------+-----------------------------+
>> |  1  |  45,84 ± 0,92 | 38,08 ± 0,66 | 23,56 ± 1,16 | 23,76 ± 1,04 |
>> |  2  |  80,21 ± 8,03 | 40,81 ± 0,45 | 26,59 ± 0,92 | 26,88 ± 0,99 |
>> |  4  | 115,98 ± 9,85 | 38,80 ± 0,44 | 28,83 ± 0,84 | 28,46 ± 0,94 |
>> |  6  | 199,14 ± 6,36 | 39,32 ± 0,50 | 29,22 ± 0,78 | 29,45 ± 0,86 |
>> |  8  | 47,85 ± 27,50 | 38,98 ± 0,49 | 29,63 ± 0,80 | 29,60 ± 0,78 |
>> +-----+------------------------------+-----------------------------+
>>
>> This results shows that the problem reported in [3] is solved, while
> 
> Yes. Nice work ! The PowerNV results with -smp 8 on master are unexpected.
> Did you do some profiling also ?
> 

We've noticed that in the original thread when Frederic reported the 
issue, this happens when the -smp >= $(nproc), but I haven't looked too 
deep in this case. Maybe some magic optimization on Linux mutex 
implementation that helps on the higher contention case?

>> pSeries boot time is essentially unchanged.
>>
>>
>> With a non-compressed kernel, the difference with PowerNV is smaller,
>> and pSeries stills the same:
>>
>> +-----+------------------------------+-----------------------------+
>> |     |            PowerNV           |           pSeries           |
>> |-smp |------------------------------+-----------------------------+
>> |     |     master    | patch series |    master    | patch series |
>> +-----+------------------------------+-----------------------------+
>> |  1  |  42,17 ± 0,92 | 38,13 ± 0,59 | 23,15 ± 1,02 | 23,46 ± 1,02 |
>> |  2  |  55,72 ± 3,54 | 40,30 ± 0,56 | 26,26 ± 0,82 | 26,38 ± 0,80 |
>> |  4  |  67,09 ± 3,02 | 38,26 ± 0,47 | 28,36 ± 0,77 | 28,19 ± 0,78 |
>> |  6  |  98,96 ± 2,49 | 39,01 ± 0,38 | 28,68 ± 0,75 | 29,02 ± 0,88 |
>> |  8  |  39,68 ± 0,42 | 38,44 ± 0,41 | 29,24 ± 0,81 | 29,44 ± 0,75 |
>> +-----+------------------------------+-----------------------------+
>>
>> Finally, using command lines like
>>
>> ./qemu-system-ppc64 -M powernv9 -cpu POWER9 -accel tcg,thread=multi \
>>      -m 8G -smp 4 -device virtio-scsi-pci -boot c -vga none -nographic \
>>      -device nvme,bus=pcie.2,addr=0x0,drive=drive0,serial=1234 \
>>      -drive file=rootfs.ext2,if=none,id=drive0,format=raw,cache=none \
>>      -snapshot -serial pipe:pipe -monitor unix:mon,server,nowait \
>>      -kernel zImage -append 'console=hvc0 rootwait root=/dev/nvme0n1' \
>>      -device virtio-net-pci,netdev=br0,mac=52:54:00:12:34:57,bus=pcie.0 \
>>      -netdev bridge,id=br0
>>
>> and
>>
>> ./qemu-system-ppc64 -M pseries -cpu POWER9 -accel tcg,thread=multi \
>>      -m 8G -smp 4 -device virtio-scsi-pci -boot c -vga none -nographic \
>>      -drive file=rootfs.ext2,if=scsi,index=0,format=raw -snapshot \
>>      -kernel zImage -append 'console=hvc0 rootwait root=/dev/sda' \
>>      -serial pipe:pipe -monitor unix:mon,server,nowait \
>>      -device virtio-net-pci,netdev=br0,mac=52:54:00:12:34:57 \
>>      -netdev bridge,id=br0
>>
>> to tests IO performance, with iperf to test network and a 4Gb scp
>> transfer to test disk+network, in 100 iterations we saw:
>>
>> +---------------------+---------------+-----------------+
>> |                     |    scp (s)    |   iperf (MB/s)  |
>> +---------------------+---------------+-----------------+
>> |PowerNV master       | 166,91 ± 8,37 | 918,06 ± 114,78 |
>> |PowerNV patch series | 166,25 ± 8,85 | 916,91 ± 107,56 |
>> |pSeries master       | 175,70 ± 8,22 | 958,73 ± 115,09 |
>> |pSeries patch series | 173,62 ± 8,13 | 893,42 ±  87,77 |
>> +---------------------+---------------+-----------------+
> 
> These are SMP machines under high IO load using MTTCG. It means
> that the models are quite robust now.
> 
>> The scp data shows little difference, while testing just network shows
>> that it's a bit slower with the patch series applied (although, with
>> this variation, we'd probably need to repeat this test more times to
>> have a more robust result...)
> 
> You could try with powersave=off.
> 

Not a big difference, with 50 iterations:

+---------------------+---------------+-----------------+
|                     |    scp (s)    |   iperf (MB/s)  |
+---------------------+---------------+-----------------+
|PowerNV master       | 142.73 ± 8.38 | 924.34 ± 353.93 |
|PowerNV patch series | 145.75 ± 9.18 | 874.52 ± 286.21 |
+---------------------+---------------+-----------------+

Thanks,
Matheus K. Ferst
Instituto de Pesquisas ELDORADO <http://www.eldorado.org.br/>
Analista de Software
Aviso Legal - Disclaimer <https://www.eldorado.org.br/disclaimer.html>
Cédric Le Goater Oct. 3, 2022, 8:58 p.m. UTC | #3
> (qemu) info pic
> info pic
> CPU[0000]:   QW   NSR CPPR IPB LSMFB ACK# INC AGE PIPR  W2
> CPU[0000]: USER    00   00  00    00   00  00  00   00  00000000
> CPU[0000]:   OS    00   00  00    ff   ff  00  ff   ff  00000000
> CPU[0000]: POOL    00   00  00    ff   00  00  00   00  00000000
> CPU[0000]: PHYS    00   ff  00    00   00  00  00   ff  80000000
> CPU[0001]:   QW   NSR CPPR IPB LSMFB ACK# INC AGE PIPR  W2
> CPU[0001]: USER    00   00  00    00   00  00  00   00  00000000
> CPU[0001]:   OS    00   00  00    ff   ff  00  ff   ff  00000000
> CPU[0001]: POOL    00   00  00    ff   00  00  00   00  00000001
> CPU[0001]: PHYS    00   ff  00    00   00  00  00   ff  80000000
> CPU[0002]:   QW   NSR CPPR IPB LSMFB ACK# INC AGE PIPR  W2
> CPU[0002]: USER    00   00  00    00   00  00  00   00  00000000
> CPU[0002]:   OS    00   00  00    ff   ff  00  ff   ff  00000000
> CPU[0002]: POOL    00   00  00    ff   00  00  00   00  00000002
> CPU[0002]: PHYS    00   ff  00    00   00  00  00   ff  80000000
> CPU[0003]:   QW   NSR CPPR IPB LSMFB ACK# INC AGE PIPR  W2
> CPU[0003]: USER    00   00  00    00   00  00  00   00  00000000
> CPU[0003]:   OS    00   ff  00    00   ff  00  ff   ff  00000004

vCPU 4 was scheduled to run on this CPU at some point, but it is not
anymore : no VALID bit.

> CPU[0003]: POOL    00   00  00    ff   00  00  00   00  00000003
> CPU[0003]: PHYS    00   ff  00    00   00  00  00   ff  80000000
> XIVE[0] #0 Source 00000000 .. 000fffff
>    00000014 MSI --
>    00000015 MSI --
>    00000016 MSI --
>    00000017 MSI --
>    00000018 MSI --
>    00000019 MSI --
>    0000001a MSI --
>    0000001b MSI --
>    0000001e MSI P-

The 0x1E HW interrupt (virtual device) is pending. And not queued.

>    00000023 MSI --
>    00000024 MSI --
>    00000025 MSI --
>    00000026 MSI --
> XIVE[0] #0 EAT 00000000 .. 000fffff
>    00000014   end:00/000f data:00000010
>    00000015   end:00/0017 data:00000010
>    00000016   end:00/001f data:00000010
>    00000017   end:00/0027 data:00000010 -> 0x10 == CPU IPI 
>    00000018   end:00/004e data:00000010 -> This is the vCPU IPI 
>    00000019   end:00/004e data:00000012
>    0000001a   end:00/004e data:0000001b
>    0000001b   end:00/004e data:00000013
>    0000001e   end:00/004e data:00000016

notificationd of 0x1E HW interrupts will be pushed on vCPU 0 queue 0x4e,
with (Linux) effective interrupt number 0x16, the console may be.

>    00000023   end:00/004e data:00000017
>    00000024   end:00/004e data:00000018
>    00000025   end:00/004e data:00000019
>    00000026   end:00/004e data:0000001a

Follow the PHB interrupts, MSI and LSIs.

>    000fb000   end:00/001f data:00000030
>    000fb001   end:00/0027 data:00000031
>    000fb002   end:00/000f data:00000032
>    000fb003   end:00/000f data:00000033
>    000fb004   end:00/0017 data:00000034
>    000fb005   end:00/001f data:00000035
>    000fb006   end:00/0027 data:00000036
>    000fb7fe   end:00/000f data:00000029
>    000fb7ff   end:00/0017 data:0000002a
>    000fbffe   end:00/001f data:00000027
>    000fbfff   end:00/0027 data:00000028
>    000fcffe   end:00/000f data:00000025
>    000fcfff   end:00/0017 data:00000026
>    000fd000   end:00/001f data:00000037
>    000fd001   end:00/000f data:00000038
>    000fd002   end:00/0017 data:00000039
>    000fd003   end:00/001f data:0000003a
>    000fd004   end:00/0027 data:0000003b
>    000fd7fe   end:00/001f data:00000023
>    000fd7ff   end:00/0027 data:00000024
>    000fdffe   end:00/000f data:00000021
>    000fdfff   end:00/0017 data:00000022
>    000feffe   end:00/001f data:0000001f
>    000fefff   end:00/0027 data:00000020

opal events are after

>    000ffff0   end:00/000f data:00000011 
>    000ffff1   end:00/0017 data:00000012
>    000ffff2   end:00/001f data:00000013 
>    000ffff3   end:00/0027 data:00000014 # opal-psi#0:lpchc
>    000ffff4   end:00/000f data:00000015
>    000ffff5   end:00/0017 data:00000016
>    000ffff6   end:00/001f data:00000017
>    000ffff7   end:00/0027 data:00000018
>    000ffff8   end:00/000f data:00000019
>    000ffff9   end:00/0017 data:0000001a
>    000ffffa   end:00/001f data:0000001b
>    000ffffb   end:00/0027 data:0000001c
>    000ffffc   end:00/000f data:0000001d
>    000ffffd   end:00/0017 data:0000001e # opal-psi#0:psu ? 
> XIVE[0] #0 ENDT
>    0000000f -Q vqnb---f prio:7 nvt:00/0080 eq:@03400000   825/16384 ^1 [ 8000004f 8000004f 8000004f 8000004f 8000004f ^00000000 ]

event queue of host CPU 0 is filling up with escalation interrupt
numbers, 0x4f.

host CPU 0 (queue 0xf) is serving its own IPI, some MSIs, some EEH PCI
interrupts, and some OPAL events.

>    00000017 -Q vqnb---f prio:7 nvt:00/0084 eq:@03750000  1048/16384 ^1 [ 8000001e 8000001e 8000001e 8000001e 8000001e ^00000000 ]

hmm, host CPU 1 is serving 0xffffd = opal-psi#0:psu. May be too much.

>    0000001f -Q vqnb---f prio:7 nvt:00/0088 eq:@037f0000   154/16384 ^1 [ 8000003a 8000003a 8000003a 8000003a 8000003a ^00000000 ]

0x3a is an MSI.

>    00000027 -Q vqnb---f prio:7 nvt:00/008c eq:@038a0000   340/16384 ^1 [ 80000014 80000014 80000014 80000014 8000003b ^00000000 ]

This is the console 0x14 and 0x3b is an MSI

>    0000004e -Q vqnbeu-- prio:6 nvt:00/0004 eq:@1d170000  1104/16384 ^1 [ 80000016 80000016 80000016 80000016 80000016 ^00000000 ]

0x4e (0x48 + 6) is the event queue number of guest's vCPU 0 prio 6.
0x16 is the Linux interrupt number in the guest of HW interrupt 0x1e,
the one pending.

>    0000004f -Q v--be-s- prio:0 nvt:00/0000

0x4f is the escalation queue of vCPU 0 (when vCPU is not dispatched on any
HW threads) 0x4f is also a source interrupt number for escalations.

> XIVE[0] #0 END Escalation EAT
>    0000004e -Q    end:00/004f data:00000000
>    0000004f P-    end:00/000f data:0000004f

0x4f interrupt number is pending. vPCU 0 should be dispatched but the
escalation interrupt has not being served by the hypervisor at this
point in time. Since it is not queued, we may have reached some deadlock ?

> XIVE[0] #0 NVTT 00000000 .. 0007ffff
>    00000000 end:00/0028 IPB:00
>    00000001 end:00/0030 IPB:00
>    00000002 end:00/0038 IPB:00
>    00000003 end:00/0040 IPB:00
>    00000004 end:00/0048 IPB:02

         ^

0x4 is the vCPU0 notification virtual target number and an interrupt is
pending on prio 6. vCPU 0 did not acknowledge it yet, because vCPU 0 (NVT=4)
has not been dispatched on any HW thread because the escalation interrupt
was not handled on the host (CPU 0 should). Question is what is CPU 0 up to?


>    00000080 end:00/0008 IPB:00
>    00000084 end:00/0010 IPB:00
>    00000088 end:00/0018 IPB:00
>    0000008c end:00/0020 IPB:00
> PSIHB Source 000ffff0 .. 000ffffd
>    000ffff0 LSI --
>    000ffff1 LSI --
>    000ffff2 LSI --
>    000ffff3 LSI --
>    000ffff4 LSI --
>    000ffff5 LSI --
>    000ffff6 LSI --
>    000ffff7 LSI --
>    000ffff8 LSI --
>    000ffff9 LSI --
>    000ffffa LSI --
>    000ffffb LSI --
>    000ffffc LSI --
>    000ffffd LSI --
> PHB4[0:0] Source 000fe000 .. 000fefff  @6030203110100
>    00000ffe LSI --
>    00000fff LSI --
> PHB4[0:5] Source 000fb000 .. 000fb7ff  @6030203110228
>    00000000 MSI --
>    00000001 MSI --
>    00000002 MSI --
>    00000003 MSI --
>    00000004 MSI --
>    00000005 MSI --
>    00000006 MSI --
>    000007fe LSI --
>    000007ff LSI --
> PHB4[0:4] Source 000fb800 .. 000fbfff  @6030203110220
>    000007fe LSI --
>    000007ff LSI --
> PHB4[0:3] Source 000fc000 .. 000fcfff  @6030203110218
>    00000ffe LSI --
>    00000fff LSI --
> PHB4[0:2] Source 000fd000 .. 000fd7ff  @6030203110210
>    00000000 MSI --
>    00000001 MSI --
>    00000002 MSI --
>    00000003 MSI --
>    00000004 MSI --
>    000007fe LSI --
>    000007ff LSI --
> PHB4[0:1] Source 000fd800 .. 000fdfff  @6030203110208
>    000007fe LSI --
>    000007ff LSI --
> 
>> Could you please check with powersave=off in the host kernel also ?
>>
> 
> It still hangs with this option.

This is going to need some serious digging to solve. It might not be worse
the time :/

C.